| ykarel | thanks corvus | 04:46 |
|---|---|---|
| ykarel | Hi https://review.opendev.org/ looks down, can someone check | 06:02 |
| tonyb | ykarel: on it | 06:08 |
| ykarel | thx tonyb | 06:09 |
| tonyb | #status log review.opendev.org was shutoff approximately an hour ago. Server restarted. Working on Gerrit restart | 06:21 |
| opendevstatus | tonyb: finished logging | 06:21 |
| mnasiadka | Oh boy | 07:00 |
| tonyb | mnasiadka: Yeah. | 07:01 |
| tonyb | #status log review.opendev.org service restored. albeit a little slow while caches rebuild | 08:42 |
| opendevstatus | tonyb: finished logging | 08:42 |
| tonyb | mnasiadka: I kinda recall there was something you wanted to talk about at the last OpenDev meeting? If so it's in about 11 hours | 08:46 |
| mnasiadka | tonyb: that got sorted out outside of OpenDev meeting | 08:50 |
| mnasiadka | I know it’s in 11 hours, because it’s just after the TC one | 08:51 |
| mnasiadka | I’ll come - maybe there will be something I can help in OpenDev - sort of looking to be more active outside of Kolla, so anywhere where I can help - just shout ;) | 08:51 |
| tonyb | mnasiadka: oh there are plenty of places to help | 08:52 |
| tonyb | we'll sort something out during the meeting | 08:52 |
| opendevreview | Simon Westphahl proposed zuul/zuul-jobs master: Fix ensure-rust wrapper script https://review.opendev.org/c/zuul/zuul-jobs/+/966020 | 10:11 |
| fungi | tonyb: i guess there were no errors in the server show output? if not, then this sounds identical to the outage at the summit, so oom killer taking out the vm or similar i guess | 13:53 |
| fungi | anyway, i confirm the service is up now, so thanks for taking care of it quickly | 13:53 |
| clarkb | re slowness starting up I believe that deleting the large h2 cache files will address that | 14:45 |
| clarkb | (this should be done when gerrit is stopped) | 14:45 |
| clarkb | infra-root I made a note on the meeting agenda that we should file a ticket with vexxhost if anyone has done that before or wants to figure out how to do it I think that is our next step | 14:46 |
| clarkb | something along the liens of "we noticed the server shutdown on november 11 and october whatever day it was. We don't know why this is happening as the instance doesn't seem to have any records of problems in its logs and nova doesn't show any errors. Talking to the nova team we believe that this could potentially happen if the hypervisor runs out of memory, but that is a hunch at | 14:47 |
| clarkb | this point and would requier the cloud to debug" | 14:47 |
| opendevreview | Clark Boylan proposed opendev/system-config master: Drop /opt/project-config/gerrit/projects.ini on review https://review.opendev.org/c/opendev/system-config/+/966083 | 15:21 |
| opendevreview | Clark Boylan proposed opendev/system-config master: Update Gerrit images to 3.10.9 and 3.11.7 https://review.opendev.org/c/opendev/system-config/+/966084 | 15:21 |
| clarkb | mnasiadka: right now the things I'm looking at are the etherpad 2.5.2 upgrade, gitea 1.25.0 upgrade, and the gerrit 3.11 upgrade. Those first two have held nodes that you can interact with and test to see if they work for you (or not). That is good feedback to have pre upgrade and I can dig up ip addresses for you if you are interested in spot checking things. | 16:35 |
| clarkb | mnasiadka: the gerrit upgrade just went back to limbo due to the chagnes above. We probably need to get gerrit sorted out in terms of uptime and restarts and being on latest bugfixes before we dig into upgrading too much | 16:36 |
| clarkb | beyond that I also need to look at bootstrapping matrix for opendev comms | 16:37 |
| mnasiadka | Sure - can help with etherpad and gitea testing - if you can pass the ips of the held nodes I can have a look | 16:37 |
| clarkb | https://etherpad.opendev.org/p/opendev-running-todo-list | 16:38 |
| mnasiadka | Gerrit - probably the cloud hosting that VM has some issues (if that’s OOM it’s hitting) | 16:38 |
| clarkb | this is our mpre high level backlog/todo/wants list | 16:38 |
| clarkb | ya need to figure out filing a ticket | 16:38 |
| clarkb | mnasiadka: https://213.32.78.118:3081/opendev/system-config is the held gitea 1.25.0 | 16:40 |
| clarkb | 50.56.157.144 is the held etherpad. you have yo put this one in /etc/hosts for etherpad.opendev.org due to how redirects work. there is a clarkb-test pad but feel free to make new ones too | 16:42 |
| mnasiadka | Any known gitea or etherpad regressions I should keep an eye on? | 16:43 |
| clarkb | mnasiadka: for etherpad the root page at / had css issues where the open pad by name button didn't render nicely in versions 2.5.0 and 2.5.1. We believe 2.5.2 fixes that (and appars to in our screenshots and my local testing) but good to check with your brwoser too | 16:45 |
| clarkb | then for gitea no known issues yet, they publish a large changelog here: https://github.com/go-gitea/gitea/blob/v1.25.0/CHANGELOG.md | 16:46 |
| fungi | mnasiadka: at this point, hypervisor host oom is just a likely guess because we use a very-high-ram flavor for that vm and the outward symptoms (limited as they may be) are consistent with an oom kill | 16:47 |
| fungi | it's something the cloud operators would need to confirm anyway when they (hopefully) look into it | 16:48 |
| fungi | guilhermesp___, ricolin_ and mnaser are the folks who would usually check that, if they're around | 16:48 |
| fungi | but if we open a ticket about it, they may be able to track it better once they have some time | 16:49 |
| mnaser | yes, a ticket would fare much better than an irc ping, at least it goes in the right queues =) | 16:50 |
| mnaser | get pulled into a million things every second | 16:50 |
| clarkb | mnaser: understood. fwiw I've logged into horizon and don't see a link to submit a ticket from there. I assume we need to login via vexxhost id to your main website and do it from there but it looks like we only ever got keystone credentials so I'm not sure how to do that | 16:51 |
| clarkb | mnaser: is there a workaround for that (maybe I'm missing a link or maybe we can login using keystone creds there too? or maybe we can send email to a special address?) | 16:52 |
| mnaser | yeah i think you are a bit of a unicorn with an openstack account only, send support@vexxhost.com and life will be easy -- just call out this is for the openinfra ci account | 16:52 |
| clarkb | mnaser: thank you! I'll get that done soonish | 16:53 |
| clarkb | tuesday morning is my morning of meetings but I should be able to draft something during the meeting blocks | 16:53 |
| mnasiadka | fungi: well, from what tonyb said on #openstack-nova - it was in shutoff state - so a lot of possibilities - from OOM/libvirt killing the VM to a power off from inside the VM (but that’s rather impossible without it being logged) - but basically everything outside of Nova’s hands | 16:54 |
| fungi | right | 16:54 |
| mnaser | it is prooooobably oom, i'll have to see why, we do have a fair bit of reserved memory | 16:54 |
| clarkb | mnaser: thats what we figured. I'll get an email sent with the two latest occurences and their timestamps as well as info about the host (name, uuid) | 16:55 |
| fungi | it's happened a couple of times in the past month, so figuring it out in order to avoid it better in the future is of interest to us | 16:55 |
| clarkb | fungi: you don't happen to have rough timelines for the occurence during the summit do you? | 16:57 |
| clarkb | I should be able to dig it up from logs in irc if nothing else | 16:57 |
| fungi | i would just end up digging it out of irc history myself, sorry | 16:57 |
| fungi | i know i captured and dug into the approximate time the outage started, just don't remember the details (and wouldn't trust my memory anyway since i was on a different continent at the time) | 16:58 |
| clarkb | ya I'll find it in the logs | 16:59 |
| corvus | https://matrix.to/#/!GXiijUJAGqDLBuZqwV:matrix.org/$7agBEPLFoFifgX_5wjyiiWcLaREDZWTKcPi6jXF1khk?via=matrix.org&via=matrix.uhurutec.com&via=ubuntu.com | 16:59 |
| corvus | sunday october 19 | 16:59 |
| fungi | oh, also it may have been relayed by someone else because my shell server was offline due to an unrelated incident in rackspace flex, so i wasn't in irc at the time | 16:59 |
| clarkb | we still have syslog logs for both so I'm going off of those timestamps | 17:08 |
| mnasiadka | clarkb: I think I’ve been in all dark corners of gitea and changelog doesn’t list anything that looks weird | 17:10 |
| clarkb | mnasiadka: thank you for checking: maybe drop a note on https://review.opendev.org/c/opendev/system-config/+/965960 to record that? | 17:11 |
| mnasiadka | done | 17:14 |
| clarkb | thanks | 17:14 |
| clarkb | ok email sent to vexxhost support | 17:14 |
| clarkb | I cc'd opendev admins on it too | 17:14 |
| fungi | thanks! | 17:25 |
| mnasiadka | clarkb: done the same with etherpad, if everything works even in Safari, then I guess it should be fine | 17:40 |
| clarkb | mnasiadka: yup I think we can probably proceed with that upgrade at this point. My testing all looked good as did tonyb's | 17:41 |
| clarkb | I've got my large block of meetings then lunch (and maybe a bike ride if weather cooperates) so either later today or first thing tomorrow on etherpad | 17:41 |
| clarkb | its good you checked safari as I am not able to :) | 17:41 |
| Clark[m] | As expected gitea 1.25.1 just released | 20:06 |
| clarkb | ok I have about a 2 or 3 hour window before the next atmospheric river is supposed to arrive so I'm going t o take it | 20:37 |
| clarkb | but when I get back we can decide if we are comfortable with upgrading etherpad nad I'll udpate the gitea change and swap its hold for 1.25.1 | 20:37 |
| *** mnaser[m] is now known as mnaser | 20:46 | |
| mnaser | Did https://docs.opendev.org/opendev/infra-specs/latest/specs/matrix_for_opendev.html ever end up happening or it's still mostly IRC only? | 20:48 |
| mnaser | I'm setting up Matrix for me and just wondering :) | 20:48 |
| Clark[m] | mnaser it hasn't happened yet but hopefully will soon | 20:49 |
| fungi | mnaser: we just talked about it in the meeting an hour ago, tonyb will work on setting up the channel on our homeserver soonish, corvus is planning to work on the new statusbot for it, and mnasiadka was interested in helping with some aspects of the move too | 20:50 |
| mnaser | sounds good! | 20:50 |
| fungi | there is already a #openstack-ops:opendev.org channel on our homeserver that might be of interest in the meantime though, in addition to #zuul:opendev.org | 20:51 |
| Clark[m] | I need to clean up after my bike ride but I think I have enough time to upgrade etherpad when done | 22:26 |
| tonyb | thanks Clark[m] | 22:34 |
| clarkb | https://review.opendev.org/c/opendev/system-config/+/956593 has been approved. I will monitor it | 22:48 |
| opendevreview | Clark Boylan proposed opendev/system-config master: Update Gitea to 1.25 https://review.opendev.org/c/opendev/system-config/+/965960 | 22:50 |
| opendevreview | Clark Boylan proposed opendev/system-config master: DNM intentional Gitea failure to hold a node https://review.opendev.org/c/opendev/system-config/+/848181 | 22:50 |
| clarkb | that updates to 1.25.1 and I'll recycle the autohold too | 22:51 |
| opendevreview | Merged opendev/system-config master: Upgrade etherpad to 2.5.2 https://review.opendev.org/c/opendev/system-config/+/956593 | 23:22 |
| clarkb | the deployment job for ^ failed. I have logged into the server and it appears to be running the old image and the new image isn't even present yet so we're good for now. I'll look at job logs momentarily | 23:28 |
| clarkb | ERROR: for etherpad toomanyrequests: You have reached your unauthenticated pull rate limit. https://www.docker.com/increase-rate-limit | 23:29 |
| clarkb | I'm going to put the docker hub ipv4 addresses in /etc/hosts on etherpad.o.o to force it to use ipv4 then reenqueue the deployment jobs. The only question I have about doing that is the image promotion job is part of the deployment buildset. corvus do you know if we can safely rerun the promotion job? I think yes? | 23:30 |
| clarkb | btu I'm not sure if it will "safely" fail because the promotion is already done and then the deployment itself won't run because the promotion failed... | 23:30 |
| clarkb | I've forgotten how annoying these rate limits are now that most things are on quay | 23:31 |
| clarkb | actually we just do a docker-compose pull and then docker-compose up -d. I'll just do that rather than reenqueue the deploy buildset as that is easy | 23:36 |
| clarkb | ok etherpad reports it is up and healthy now | 23:41 |
| corvus | clarkb: ack. lmk if you have further questions but easy sounds good | 23:41 |
| corvus | meanwhile we broke the logjam on zuul changes, so the multi-provider fix merged and was just promoted | 23:41 |
| corvus | i'll restart the launchers | 23:41 |
| clarkb | https://etherpad.opendev.org/p/gerrit-upgrade-3.10 loads for me whcih has a lot of formatting | 23:41 |
| clarkb | corvus: thanks | 23:41 |
| clarkb | arg the main page css formatting is still a little odd but only on firefox | 23:42 |
| clarkb | despite all the testing it still comes out that way for some reason | 23:42 |
| corvus | what's odd? | 23:43 |
| clarkb | corvus: the text tries to escape the button block | 23:43 |
| clarkb | it doesn't do that on chrom* and I reproted it upstream and they fixed some stuff and when I tested with the held node it didn't do that so now I'm slightly confused | 23:43 |
| clarkb | its not critical | 23:43 |
| clarkb | but I was trying to get it fixed before we upgraded and thought it was so annoyed that it isn't | 23:44 |
| corvus | can you screenshot what you see that's wrong? | 23:44 |
| clarkb | yes one sec | 23:45 |
| corvus | #status log restarted zuul launchers with multi-provider ready node fix | 23:45 |
| opendevstatus | corvus: finished logging | 23:46 |
| corvus | ykarel: ^ fyi thanks and sorry :) | 23:46 |
| clarkb | corvus: I shared it with you on matrix because I wasn't sure how else to do it (I guess imgur?) | 23:47 |
| corvus | oh that main page | 23:47 |
| corvus | yeah i see that too that's weird | 23:47 |
| clarkb | it doesn't do that on chrom* | 23:48 |
| corvus | sorry for not understanding the words you very clearly typed -- i just got fixated on the actual pad. :) | 23:48 |
| clarkb | I'm going to remove the /etc/hosts overrides on the prod server then I'm goign to recheck the held server node | 23:48 |
| clarkb | its compeltely functional just looks odd | 23:48 |
| clarkb | so I don't think we need to rollback | 23:48 |
| clarkb | ya the held server at 50.56.157.144 doesn't do it. I wonder if apache is caching things maybe? (I'm doing my testing in an incognito tab so theoretically my browser isn't caching it) | 23:52 |
| clarkb | but I don't see any explicit caching config in the vhost | 23:52 |
| clarkb | corvus: do you think it is worth restarting apache to see if its maybe doing some implicit caching? | 23:52 |
| clarkb | I'm going to try and grab css files first to compare | 23:53 |
| corvus | yeah that doesn't sound like it should be a thing, but it's easy and low impact, so why not | 23:53 |
| clarkb | another idea that I hate to entertain is that the v2.5.2 tag moved between when I held the node to when we just built etherpad | 23:55 |
| clarkb | ok I have confirmed that teh css files are different | 23:57 |
| clarkb | now to restart apache2 and see if that changes | 23:57 |
| clarkb | nope that didn't change anything | 23:58 |
| clarkb | so now I'm going to check if the tag moved | 23:58 |
Generated by irclog2html.py 4.0.0 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!