Tuesday, 2025-07-15

mnasiadkaGerrit seems to be down04:54
tonybYes it does.   I'm not near my laptop ATM.04:59
tonybit's probably too early for frickler 04:59
ajaiswal[m]is https://review.opendev.org/ gerrit down05:50
fricklerssh: connect to host review03.opendev.org port 22: No route to host07:03
fricklermnaser: ricolin: vexxhost API says it is shutoff since 2025-07-15T04:24:21Z , any known issues?07:05
fricklergoing to start it now, but not sure what recovery steps need to be taken if it doesn't boot normally07:06
mnaserI can't have a look right now but no issues. If its shut off that may be some sort of oom07:09
fricklerthe vm is started again now, no shutdown messages in the journal from before the outage, so oom or other hypervisor issue seems likely07:22
fricklerand the dockers really do not start automatically, so I'll now need to dig into our docs on how to recover this07:22
frickler#status notice the gerrit service (https://review.opendev.org) is currently down, please be patient while we work on restoring it07:32
opendevstatusfrickler: sending notice07:32
-opendevstatus- NOTICE: the gerrit service (https://review.opendev.org) is currently down, please be patient while we work on restoring it07:32
fricklerseems we don't have any recovery docs that I could find. so I'll rather wait for some other infra-root to show up than possibly breaking even more07:34
opendevstatusfrickler: finished sending notice07:36
tonybfrickler: I finally have access to my laptop.   do you need any help.    obviously clarkb or fungi would be better than me but I'm here 08:17
fricklertonyb: well the thing is I don't know how safe it is to simply start the gerrit containers or whether some cleanup should be done first. feel free to proceed if you feel confident enough08:39
tonybI'm not sure we have a better option?09:52
tonybI'm reading about Gerrit recovery to be sure09:52
mnasiadkafrickler: looking at the gerrit role in system-config - I don’t see any magic required - unless you want to rebuild indexes - but I’ve never seen Gerrit besides being a user :)10:46
*** iurygregory_ is now known as iurygregory11:15
*** hjensas_ is now known as hjensas11:25
tonybinfra-root I've restarted gerrit and the site seems to be up, diffs are missing as I didn't truncate the cache I expect they'll come along soon11:39
tonybI see diffs11:41
tonybCorrection I see some diffs11:45
tonyband I see what looks to be a crawler of some description11:45
tonyb#status notice the gerrit service (https://review.opendev.org) is back up.  We believe the restoration is complete.  If you notice any issues please report them in #opendev ASAP11:53
opendevstatustonyb: sending notice11:53
-opendevstatus- NOTICE: the gerrit service (https://review.opendev.org) is back up. We believe the restoration is complete. If you notice any issues please report them in #opendev ASAP11:54
fricklerI'm seeing no comments in the UI, e.g. https://review.opendev.org/c/openstack/requirements/+/90766511:54
fricklerhmm, some minutes later the comments have appeared without me touching the page in the meantime. so just slowish somehow11:57
opendevstatustonyb: finished sending notice11:57
tonybYeah kinda slow, 12:00
tonybinfra-root: I have a screen session open as root on review03 feel free to join12:00
tonybI'm not seeing anything in the error_log12:00
tonybIt seems better to me12:09
fungisorry, at the keyboard now, and yes i would have just started the containers as-is too13:17
fungigerrit is sometimes sort of slow when it first starts due to background tasks13:18
fungioh geez, i about had a heart attack looking at the syslog until it dawned on me that podman makes up random container names13:27
fungii was seeing commands logged for containers called things like wizardly_mcnulty and nifty_khayyam and thought the server was compromised13:28
fungii can't believe someone thought that was a feature13:28
fungiso looks like the last thing recorded to syslog before it died was an innocuous sysstat collection run at 2025-07-15T04:20:0013:30
fungimnasiadka reported the outage here at 04:54, i'll see if i can narrow it down better but there should have been a another sysstat collection logged at 04:40 so it had presumably already died by then13:32
fungilast log entry from apache before the reboot was 15/Jul/2025:04:23:57 so probably happened moments after that13:34
fricklerfungi: yes, as mentioned above the vexxhost API said 2025-07-15T04:24:21Z for the shutoff update13:36
funginot finding anything thanks, so that's a pretty strong correlation. if the server lost the ability to write to /var before it went offline entirely, it was at most 24 seconds13:37
fungiassuming timestamps in the os and on the cloud backend are relatively accurate13:37
fricklerlast systemd journal entry was 04:23:47, which maybe syncs a bit less often than direct syslog writes, so that matches, too13:38
fricklers/syslog/apache log/13:39
fungiyeah, so at this point it doesn't look like we have anything in guest-side logs to indicate any reason for it inexplicably going offline. mnaser's suggestion of something like an oom on the hypervisor host would fit13:42
fungiit is a very massive vm, after all13:43
fungi128gb ram13:43
slittle1_Looking for advice ... I'm trying to bring a feature branch up to date relative to the master branch via git merge across all the starlingx repos14:26
slittle1_the result is this series of reviews ... https://review.opendev.org/q/topic:%22merge-master-as-of-20250711T041000Z%2214:27
slittle1_zull is unhappy with about a dozen of them, even though it allowed the same updates into master14:28
slittle1_my theory is that these are a lot of cross repo dependencies that were met when the content was delivered in small chunks and in the correct order.  However with the merge, Some components are being tested vs other repos in their unmerged state14:30
slittle1_is there a way around this?14:31
slittle1_Can I either make zuul test the whole set of updates across ~100 repos together?   Can zull be overriden in some way?14:33
fungislittle1_: you can push a change to disable testing in that branch on ach repository and then enable testing again once you have all the merges completed?14:35
fungizuul does have an "atomic merge" functionality when setting circular dependencies between changes, but 1. i don't know how well that would work with proposing interdependent merge commits across 100 different repos at once and 2. it's a tenant-wide setting but starlingx isn't configured as a separate tenant it's still sharing a tenant with lots of other projects14:37
fungianother option would be to temporarily grant push --force permissions to you on those repositories so you can bypass code review and then you can try to sort out any remaining broken tests after the fact, but be aware that would allow you to accidentally create a lot of mess if you pushed the wrong thing somewhere14:40
slittle1_Disabling the zuul tests on a branch be done en-masse through the project config?  Or would I have to modify each project one at a time?14:44
fungiit depends on where the jobs are added to your project pipelines. if they're added in .zuul.yaml files in each project's feature branch you'd need to disable them there. if they're added to the project pipelines in central configuration that's where you'd temporarily limit what branches they apply to14:46
fungiodds are you have a mix of both, but you probably only need to disable the jobs that are failing14:46
slittle1_k, thanks14:49
clarkbtonyb: frickler: fungi: reindexing changes after a gerrit restart can help prevent duplicate change id problems if a change is created in a project and not indexed due to shutting down before it gets into the index14:57
clarkbbut that is a relatively minor problem if it occurs. And yes if the server comes up and the filesystems look happy then starting services is the next step14:57
clarkbthe main concern and why we don't auto start things is that we want to ensure that the filesystem is in place properly before starting the service14:58
clarkbslittle1_: fungi  when you push a merge commit if each of the commits that are new to the branch have already been reviewed in gerrit for another branch I thought it didn't make you review each of them. It only makes you review the merge commit itself15:01
clarkbbut I guess the issue here is the branch exists across a bunch of repos and its the coordination there that is the problem? Not unexpected reviews for changes in a single repo15:02
slittle1_correct, it's just the merge commit.   However the merge commits are throwing zuul errors as each git is accessed vs the unmerged state of al lthe rest of our repos.15:03
fungii imagineit's that some of the integration tests depend on newer functionality across different repos15:03
fungislittle1_: when you say "zuul errors" you mean they're not being tested and instead you're seeing messages about missing job definitions and the like? or jobs are running but returning failing results?15:04
slittle1_they run and return failure results15:04
slittle1_eg https://review.opendev.org/c/starlingx/distcloud/+/95501115:05
clarkbya so another option is to set up depends on between the changes if the issue is simply dependency ordering15:05
slittle1_I've no idea what the correct order is.15:06
fungiright, depends on how many of the changes are failing jobs and whether their interdependencies can be untangled easily15:06
fricklerclarkb: so "filesystem is in place" refers to /home/gerrit2 mostly? maybe add that to the docs somewhere, then? or place a check into the container startup?15:15
clarkbfrickler: yes the /home/gerrit2 volume. I'm not sure what sort of check would be appropriate for automation. We would probably have to write our own unit either way to do anything useful15:16
opendevreviewAntoine Musso proposed opendev/git-review master: Command to delete applied local branches  https://review.opendev.org/c/opendev/git-review/+/95509417:48
opendevreviewMerged openstack/project-config master: DCO enforcement for all OpenInfra Foundation repos  https://review.opendev.org/c/openstack/project-config/+/95467217:58
opendevreviewJeremy Stanley proposed openstack/project-config master: Remove CLA enforcement from all projects and lock  https://review.opendev.org/c/openstack/project-config/+/95437418:05
fungiokay, i'm going to disappear for a few minutes to grab a bite to eat, should be back in under an hour19:27

Generated by irclog2html.py 4.0.0 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!