mnasiadka | Gerrit seems to be down | 04:54 |
---|---|---|
tonyb | Yes it does. I'm not near my laptop ATM. | 04:59 |
tonyb | it's probably too early for frickler | 04:59 |
ajaiswal[m] | is https://review.opendev.org/ gerrit down | 05:50 |
frickler | ssh: connect to host review03.opendev.org port 22: No route to host | 07:03 |
frickler | mnaser: ricolin: vexxhost API says it is shutoff since 2025-07-15T04:24:21Z , any known issues? | 07:05 |
frickler | going to start it now, but not sure what recovery steps need to be taken if it doesn't boot normally | 07:06 |
mnaser | I can't have a look right now but no issues. If its shut off that may be some sort of oom | 07:09 |
frickler | the vm is started again now, no shutdown messages in the journal from before the outage, so oom or other hypervisor issue seems likely | 07:22 |
frickler | and the dockers really do not start automatically, so I'll now need to dig into our docs on how to recover this | 07:22 |
frickler | #status notice the gerrit service (https://review.opendev.org) is currently down, please be patient while we work on restoring it | 07:32 |
opendevstatus | frickler: sending notice | 07:32 |
-opendevstatus- NOTICE: the gerrit service (https://review.opendev.org) is currently down, please be patient while we work on restoring it | 07:32 | |
frickler | seems we don't have any recovery docs that I could find. so I'll rather wait for some other infra-root to show up than possibly breaking even more | 07:34 |
opendevstatus | frickler: finished sending notice | 07:36 |
tonyb | frickler: I finally have access to my laptop. do you need any help. obviously clarkb or fungi would be better than me but I'm here | 08:17 |
frickler | tonyb: well the thing is I don't know how safe it is to simply start the gerrit containers or whether some cleanup should be done first. feel free to proceed if you feel confident enough | 08:39 |
tonyb | I'm not sure we have a better option? | 09:52 |
tonyb | I'm reading about Gerrit recovery to be sure | 09:52 |
mnasiadka | frickler: looking at the gerrit role in system-config - I don’t see any magic required - unless you want to rebuild indexes - but I’ve never seen Gerrit besides being a user :) | 10:46 |
*** iurygregory_ is now known as iurygregory | 11:15 | |
*** hjensas_ is now known as hjensas | 11:25 | |
tonyb | infra-root I've restarted gerrit and the site seems to be up, diffs are missing as I didn't truncate the cache I expect they'll come along soon | 11:39 |
tonyb | I see diffs | 11:41 |
tonyb | Correction I see some diffs | 11:45 |
tonyb | and I see what looks to be a crawler of some description | 11:45 |
tonyb | #status notice the gerrit service (https://review.opendev.org) is back up. We believe the restoration is complete. If you notice any issues please report them in #opendev ASAP | 11:53 |
opendevstatus | tonyb: sending notice | 11:53 |
-opendevstatus- NOTICE: the gerrit service (https://review.opendev.org) is back up. We believe the restoration is complete. If you notice any issues please report them in #opendev ASAP | 11:54 | |
frickler | I'm seeing no comments in the UI, e.g. https://review.opendev.org/c/openstack/requirements/+/907665 | 11:54 |
frickler | hmm, some minutes later the comments have appeared without me touching the page in the meantime. so just slowish somehow | 11:57 |
opendevstatus | tonyb: finished sending notice | 11:57 |
tonyb | Yeah kinda slow, | 12:00 |
tonyb | infra-root: I have a screen session open as root on review03 feel free to join | 12:00 |
tonyb | I'm not seeing anything in the error_log | 12:00 |
tonyb | It seems better to me | 12:09 |
fungi | sorry, at the keyboard now, and yes i would have just started the containers as-is too | 13:17 |
fungi | gerrit is sometimes sort of slow when it first starts due to background tasks | 13:18 |
fungi | oh geez, i about had a heart attack looking at the syslog until it dawned on me that podman makes up random container names | 13:27 |
fungi | i was seeing commands logged for containers called things like wizardly_mcnulty and nifty_khayyam and thought the server was compromised | 13:28 |
fungi | i can't believe someone thought that was a feature | 13:28 |
fungi | so looks like the last thing recorded to syslog before it died was an innocuous sysstat collection run at 2025-07-15T04:20:00 | 13:30 |
fungi | mnasiadka reported the outage here at 04:54, i'll see if i can narrow it down better but there should have been a another sysstat collection logged at 04:40 so it had presumably already died by then | 13:32 |
fungi | last log entry from apache before the reboot was 15/Jul/2025:04:23:57 so probably happened moments after that | 13:34 |
frickler | fungi: yes, as mentioned above the vexxhost API said 2025-07-15T04:24:21Z for the shutoff update | 13:36 |
fungi | not finding anything thanks, so that's a pretty strong correlation. if the server lost the ability to write to /var before it went offline entirely, it was at most 24 seconds | 13:37 |
fungi | assuming timestamps in the os and on the cloud backend are relatively accurate | 13:37 |
frickler | last systemd journal entry was 04:23:47, which maybe syncs a bit less often than direct syslog writes, so that matches, too | 13:38 |
frickler | s/syslog/apache log/ | 13:39 |
fungi | yeah, so at this point it doesn't look like we have anything in guest-side logs to indicate any reason for it inexplicably going offline. mnaser's suggestion of something like an oom on the hypervisor host would fit | 13:42 |
fungi | it is a very massive vm, after all | 13:43 |
fungi | 128gb ram | 13:43 |
slittle1_ | Looking for advice ... I'm trying to bring a feature branch up to date relative to the master branch via git merge across all the starlingx repos | 14:26 |
slittle1_ | the result is this series of reviews ... https://review.opendev.org/q/topic:%22merge-master-as-of-20250711T041000Z%22 | 14:27 |
slittle1_ | zull is unhappy with about a dozen of them, even though it allowed the same updates into master | 14:28 |
slittle1_ | my theory is that these are a lot of cross repo dependencies that were met when the content was delivered in small chunks and in the correct order. However with the merge, Some components are being tested vs other repos in their unmerged state | 14:30 |
slittle1_ | is there a way around this? | 14:31 |
slittle1_ | Can I either make zuul test the whole set of updates across ~100 repos together? Can zull be overriden in some way? | 14:33 |
fungi | slittle1_: you can push a change to disable testing in that branch on ach repository and then enable testing again once you have all the merges completed? | 14:35 |
fungi | zuul does have an "atomic merge" functionality when setting circular dependencies between changes, but 1. i don't know how well that would work with proposing interdependent merge commits across 100 different repos at once and 2. it's a tenant-wide setting but starlingx isn't configured as a separate tenant it's still sharing a tenant with lots of other projects | 14:37 |
fungi | another option would be to temporarily grant push --force permissions to you on those repositories so you can bypass code review and then you can try to sort out any remaining broken tests after the fact, but be aware that would allow you to accidentally create a lot of mess if you pushed the wrong thing somewhere | 14:40 |
slittle1_ | Disabling the zuul tests on a branch be done en-masse through the project config? Or would I have to modify each project one at a time? | 14:44 |
fungi | it depends on where the jobs are added to your project pipelines. if they're added in .zuul.yaml files in each project's feature branch you'd need to disable them there. if they're added to the project pipelines in central configuration that's where you'd temporarily limit what branches they apply to | 14:46 |
fungi | odds are you have a mix of both, but you probably only need to disable the jobs that are failing | 14:46 |
slittle1_ | k, thanks | 14:49 |
clarkb | tonyb: frickler: fungi: reindexing changes after a gerrit restart can help prevent duplicate change id problems if a change is created in a project and not indexed due to shutting down before it gets into the index | 14:57 |
clarkb | but that is a relatively minor problem if it occurs. And yes if the server comes up and the filesystems look happy then starting services is the next step | 14:57 |
clarkb | the main concern and why we don't auto start things is that we want to ensure that the filesystem is in place properly before starting the service | 14:58 |
clarkb | slittle1_: fungi when you push a merge commit if each of the commits that are new to the branch have already been reviewed in gerrit for another branch I thought it didn't make you review each of them. It only makes you review the merge commit itself | 15:01 |
clarkb | but I guess the issue here is the branch exists across a bunch of repos and its the coordination there that is the problem? Not unexpected reviews for changes in a single repo | 15:02 |
slittle1_ | correct, it's just the merge commit. However the merge commits are throwing zuul errors as each git is accessed vs the unmerged state of al lthe rest of our repos. | 15:03 |
fungi | i imagineit's that some of the integration tests depend on newer functionality across different repos | 15:03 |
fungi | slittle1_: when you say "zuul errors" you mean they're not being tested and instead you're seeing messages about missing job definitions and the like? or jobs are running but returning failing results? | 15:04 |
slittle1_ | they run and return failure results | 15:04 |
slittle1_ | eg https://review.opendev.org/c/starlingx/distcloud/+/955011 | 15:05 |
clarkb | ya so another option is to set up depends on between the changes if the issue is simply dependency ordering | 15:05 |
slittle1_ | I've no idea what the correct order is. | 15:06 |
fungi | right, depends on how many of the changes are failing jobs and whether their interdependencies can be untangled easily | 15:06 |
frickler | clarkb: so "filesystem is in place" refers to /home/gerrit2 mostly? maybe add that to the docs somewhere, then? or place a check into the container startup? | 15:15 |
clarkb | frickler: yes the /home/gerrit2 volume. I'm not sure what sort of check would be appropriate for automation. We would probably have to write our own unit either way to do anything useful | 15:16 |
opendevreview | Antoine Musso proposed opendev/git-review master: Command to delete applied local branches https://review.opendev.org/c/opendev/git-review/+/955094 | 17:48 |
opendevreview | Merged openstack/project-config master: DCO enforcement for all OpenInfra Foundation repos https://review.opendev.org/c/openstack/project-config/+/954672 | 17:58 |
opendevreview | Jeremy Stanley proposed openstack/project-config master: Remove CLA enforcement from all projects and lock https://review.opendev.org/c/openstack/project-config/+/954374 | 18:05 |
fungi | okay, i'm going to disappear for a few minutes to grab a bite to eat, should be back in under an hour | 19:27 |
Generated by irclog2html.py 4.0.0 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!