| @clarkb:matrix.org | pruning has begun but it looks like it may actually take about 3 hours to compelte so I probably wont' be able to test that the lodgeit image is preserved until tomorrow | 00:01 |
|---|---|---|
| @clarkb:matrix.org | it hasn't failed yet which is also a good sign | 00:12 |
| @clarkb:matrix.org | (and I did check the container updated earlier today) | 00:12 |
| @clarkb:matrix.org | I have rechecked https://review.opendev.org/c/opendev/system-config/+/945143 to see if registry pruning causes the lodgeit image to 404 (it shouldn't I hope) | 15:07 |
| @clarkb:matrix.org | fungi: re depends on. I think the raeson I'm convinced it is fine is we are not changing the use of override checkout only the target and depends on worked fine when overriding to the branch name | 15:15 |
| @clarkb:matrix.org | fungi: so in theory we should continue to override whatevervalue is in there with the depends on in the future as well | 15:15 |
| @clarkb:matrix.org | The recheck of https://review.opendev.org/c/opendev/system-config/+/945143 succeeded and pruning seems to have completed successfully on the registry too. I think this problem may be fixed now | 15:43 |
| @clarkb:matrix.org | I also don't see any keycloak backup failure emails. Let me check the host directly | 15:44 |
| @clarkb:matrix.org | `Fri Feb 6 05:36:08 UTC 2026 Backup finished successfully` | 15:45 |
| @clarkb:matrix.org | I'm reasonably satisfied that both of those items are resolved now. Say something if you see evidence to the contrary | 15:45 |
| @tafkamax:matrix.org | Hey 👋 | 15:56 |
| @fungicide:matrix.org | welcome! | 15:57 |
| @clarkb:matrix.org | fungi: while I'm working on those infra-manual updates did we want to proceed with https://review.opendev.org/c/opendev/system-config/+/975176 or do you think we need more clarification on depends-on first? | 16:01 |
| @jim:acmegating.com | Clark: note the infra-manual still mentions #opendev irc (not sure if you're already planning on making that update, but maybe you can while you're in there?) | 16:03 |
| @clarkb:matrix.org | corvus: yup that is the first step. | 16:05 |
| @fungicide:matrix.org | Clark: i'm happy to approve 975176 but am disappearing for lunch so can't work on a restart for an hour or so | 16:10 |
| @fungicide:matrix.org | approved now, it probably won't merge until i get back regardless | 16:10 |
| @clarkb:matrix.org | fungi: sounds good thanks | 16:12 |
| -@gerrit:opendev.org- Clark Boylan proposed: [opendev/infra-manual] 975926: Add documentation for Matrix https://review.opendev.org/c/opendev/infra-manual/+/975926 | 16:39 | |
| @clarkb:matrix.org | corvus: ^ that is the first step there | 16:39 |
| @clarkb:matrix.org | Looking at the docs we actually do have the getting started document point to `git review -s` and its the extra bits document that talks about https remotes. So I'll need to think about how we can convey this better | 16:40 |
| -@gerrit:opendev.org- Clark Boylan proposed: | 16:55 | |
| - [opendev/infra-manual] 975928: Point people at Getting Started with an attention block https://review.opendev.org/c/opendev/infra-manual/+/975928 | ||
| - [opendev/infra-manual] 975929: Make it clearer that SSH is the preferred Gerrit comms protocol https://review.opendev.org/c/opendev/infra-manual/+/975929 | ||
| @clarkb:matrix.org | Something like those two changes maybe | 16:55 |
| -@gerrit:opendev.org- Zuul merged on behalf of Clark Boylan: [opendev/system-config] 975176: Update Gerrit images to 3.11.8 and 3.12.4 https://review.opendev.org/c/opendev/system-config/+/975176 | 17:18 | |
| @fungicide:matrix.org | just in time | 17:19 |
| @clarkb:matrix.org | I guess we can proceed with that whenever we think it is likely to be less disruptive (maybe already?) | 17:21 |
| @clarkb:matrix.org | the deployment jobs are still running should probably wait for those to complete | 17:21 |
| @clarkb:matrix.org | fungi: did we want to do this like last time where you run things on the gerrit side and I can pause zuul queue processing? | 17:24 |
| @clarkb:matrix.org | the deploy jobs have completed successfully now | 17:25 |
| @fungicide:matrix.org | sure, sounds fine to me | 17:42 |
| @clarkb:matrix.org | https://quay.io/repository/opendevorg/gerrit/manifest/sha256:8ff0f759ae6729bbf57f47721728038de6e41a92f50370ec96621d76b841c0da is the new image | 17:42 |
| @fungicide:matrix.org | i've opened a root screen session on review03 | 17:43 |
| @fungicide:matrix.org | quay.io/opendevorg/gerrit 3.11 5bebd6c38d59 2 weeks ago 716MB | 17:44 |
| @fungicide:matrix.org | that's what we're running on at the moment | 17:44 |
| @fungicide:matrix.org | i'll do a pull and inspect | 17:44 |
| @clarkb:matrix.org | I have attached and see that happening | 17:44 |
| @clarkb:matrix.org | I'm going to dig up the zuul pausing command | 17:45 |
| @fungicide:matrix.org | quay.io/opendevorg/gerrit 3.11 b4345ed1ab79 About an hour ago 715MB | 17:45 |
| @fungicide:matrix.org | quay.io/opendevorg/gerrit@sha256:8ff0f759ae6729bbf57f47721728038de6e41a92f50370ec96621d76b841c0da | 17:45 |
| @fungicide:matrix.org | seems to have pulled the expected image | 17:45 |
| @clarkb:matrix.org | excellent | 17:45 |
| @clarkb:matrix.org | should we send something like `#status notice Gerrit on review.opendev.org will experience a short outage while we upgrade it to 3.11.8` | 17:46 |
| @fungicide:matrix.org | lgtm | 17:46 |
| @clarkb:matrix.org | `zuul-client manage-events --all-tenants --reason "Gerrit restart in progress" pause-result` appears to be the pausing command. Now to figure out the unpausing command | 17:46 |
| @fungicide:matrix.org | `docker compose -f /etc/gerrit-compose/docker-compose.yaml down && mv ~gerrit2/review_site/data/replication/ref-updates/waiting ~gerrit2/tmp/waiting_queue_2026-02-06 && rm ~gerrit2/review_site/cache/{gerrit_file_diff,git_file_diff,git_modified_files,modified_files,comment_context}.* && sudo docker compose -f /etc/gerrit-compose/docker-compose.yaml up -d` | 17:46 |
| @fungicide:matrix.org | that's what i've queued up in screen on review03 | 17:47 |
| @clarkb:matrix.org | `zuul-client manage-events --all-tenants normal` is the unpause | 17:47 |
| @fungicide:matrix.org | i guess we're ready to go if you want to do the status notice? | 17:47 |
| @clarkb:matrix.org | yup let me do the status notice then when that completes I'll pause zuul and you can run your restart command | 17:48 |
| @clarkb:matrix.org | #status notice Gerrit on review.opendev.org will experience a short outage while we upgrade it to 3.11.8 | 17:48 |
| @status:opendev.org | @clarkb:matrix.org: sending notice | 17:48 |
| @fungicide:matrix.org | perfect | 17:49 |
| -@status:opendev.org- NOTICE: Gerrit on review.opendev.org will experience a short outage while we upgrade it to 3.11.8 | 17:51 | |
| @status:opendev.org | @clarkb:matrix.org: finished sending notice | 17:51 |
| @clarkb:matrix.org | fungi: zuul is paused now too | 17:52 |
| @fungicide:matrix.org | gerrit is restarting | 17:52 |
| @clarkb:matrix.org | `git_file_diff.lock.db` is the last cache lock fiel remaining so I think it is close to shutting down | 17:54 |
| @fungicide:matrix.org | stopping took 223.9 seconds | 17:55 |
| @fungicide:matrix.org | the webui is coming up for me now | 17:56 |
| @clarkb:matrix.org | `[2026-02-06T17:56:18.282Z] [main] INFO com.google.gerrit.pgm.Daemon : Gerrit Code Review 3.11.8-dirty ready` from the log | 17:56 |
| @fungicide:matrix.org | "Powered by Gerrit Code Review (3.11.8-dirty)" | 17:56 |
| @clarkb:matrix.org | should I unpause zuul now? I also have a change I want to make a small udpate on that I could push to first if we want | 17:57 |
| @fungicide:matrix.org | i'm fine with unpausing now, and yeah a quick replication test would be good | 17:57 |
| @clarkb:matrix.org | actuall I have to write the change first. I should just unpause zuul now | 17:57 |
| @clarkb:matrix.org | zuul is unpaused | 17:58 |
| @clarkb:matrix.org | and diffs load for me. Let me make my update | 17:58 |
| @fungicide:matrix.org | i'm also prepped to run `gerrit index start changes --force` over the ssh api once we're ready for that | 17:58 |
| -@gerrit:opendev.org- Clark Boylan proposed: [opendev/infra-manual] 975929: Make it clearer that SSH is the preferred Gerrit comms protocol https://review.opendev.org/c/opendev/infra-manual/+/975929 | 18:00 | |
| @clarkb:matrix.org | there is my update | 18:00 |
| @fungicide:matrix.org | ah, no it's `replication start` we want to run, not index start | 18:00 |
| @clarkb:matrix.org | fungi: no its index | 18:01 |
| @fungicide:matrix.org | oh, right, because we lose the pending queue when restarting | 18:01 |
| @clarkb:matrix.org | because there is a race in the shutdown process where a new change can arrive and get recorded in git before the index is updated then gerrit shutsdown and if we don't update the index gerrit never finds out about that change (or it finds out later when reindexing happens for another reason) | 18:01 |
| @clarkb:matrix.org | https://opendev.org/opendev/infra-manual/commit/91bab6d6ae089c1ee15c97b3a5c113d6b905e9db seems to have replicated from my push so I think that is working | 18:01 |
| @fungicide:matrix.org | so it may have been preparing to index a change and we don't persist the storage for that | 18:01 |
| @fungicide:matrix.org | okay, running `gerrit index start changes --force` after all | 18:02 |
| @clarkb:matrix.org | yup I think that is the next step. show-queue was basically empty, I can push things and they replicate, web ui is up and diffs work etc | 18:02 |
| @fungicide:matrix.org | watching the reindex progress from gerrit logs in the screen session | 18:03 |
| @clarkb:matrix.org | I wish I understood what leads to the huge variance in shutdown timingt | 18:06 |
| @clarkb:matrix.org | I'm half tempted to set our docker compose shutdown timeout to something like 1800 seconds (half an hour) then we can manually kill -9 if it takes longer than we want. Otherwise it lets us wait | 18:07 |
| @clarkb:matrix.org | but considering that we literally delete these caches before starting back up I think having a resonable timeout then giving up and killing it with -9 is probably fine | 18:07 |
| @clarkb:matrix.org | I don't think anything is running in the jvm other than the cache db cleanup at that point so its super safe (particularly when paired with the h2 deletion after shutdown) | 18:07 |
| @fungicide:matrix.org | clouds | 18:11 |
| @fungicide:matrix.org | the reason is always clouds | 18:11 |
| @clarkb:matrix.org | fungi: if you want to review the infra-manual updates they may be good "merge something" test cases since their ci jobs should run quickly. I'm also happy to update them if you find issues | 18:14 |
| @clarkb:matrix.org | fungi: I think reindexing is slower than it has been in the past. I suspect that is due to the extra cache dbs we are deleting now as indexing relies on the caches (it will populate them as it goes) | 18:21 |
| -@gerrit:opendev.org- Zuul merged on behalf of Clark Boylan: | 18:28 | |
| - [opendev/infra-manual] 975926: Add documentation for Matrix https://review.opendev.org/c/opendev/infra-manual/+/975926 | ||
| - [opendev/infra-manual] 975928: Point people at Getting Started with an attention block https://review.opendev.org/c/opendev/infra-manual/+/975928 | ||
| - [opendev/infra-manual] 975929: Make it clearer that SSH is the preferred Gerrit comms protocol https://review.opendev.org/c/opendev/infra-manual/+/975929 | ||
| @fungicide:matrix.org | seems like reindexing is over halfway done already | 18:29 |
| @fungicide:matrix.org | i'm not sure it's any slower than in recent history | 18:29 |
| @fungicide:matrix.org | but also my sense of time is terrible to nonexistent | 18:30 |
| @clarkb:matrix.org | I tthink half an hour is what it would take in the past | 18:34 |
| @clarkb:matrix.org | and we just about at that time now | 18:34 |
| @clarkb:matrix.org | looks like it completed after about 50 minutes with the expected 3 failures. fungi I detached from the screen | 19:06 |
| @fungicide:matrix.org | confirmed, i've closed out the screen session now | 19:10 |
| @clarkb:matrix.org | There are more trixie image updates if we want to take a risk on any of them on a Friday. I've actually got some yard work I should do today before the rain comes back so maybe I'll pop outside after lunch | 19:24 |
| -@gerrit:opendev.org- Bartosz Bezak proposed: [opendev/system-config] 975966: UCA: Add Gazpacho https://review.opendev.org/c/opendev/system-config/+/975966 | 21:30 | |
| @fungicide:matrix.org | looks like our uca mirror has been stale for 4 months, judging from logs it was broken by bionic removal: | 21:44 |
| @fungicide:matrix.org | `Error: packages database contains unused 'bionic-updates/rocky|main|arm64' database.` | 21:44 |
| @fungicide:matrix.org | i think we need to run `reprepro --delete clearvanished` per https://docs.opendev.org/opendev/system-config/latest/reprepro.html#removing-components | 21:46 |
| @fungicide:matrix.org | i'll work on that now, then try manually pulling updates | 21:46 |
| @fungicide:matrix.org | https://paste.opendev.org/show/bkVtLFGRdvfA7OmmaihY/ | 21:48 |
| @fungicide:matrix.org | we missed doing that for xenial as well, apparently | 21:49 |
| @fungicide:matrix.org | following up with `reprepro --nokeepunreferencedfiles deleteunreferenced` now to clear out the associated packages | 21:49 |
| @fungicide:matrix.org | amd finally, manual update is in progress | 21:50 |
| @fungicide:matrix.org | seems to have worked, vos release is running | 21:58 |
| @fungicide:matrix.org | rerunning it just to make sure it's essentially a no-op | 21:58 |
| @fungicide:matrix.org | and it was, finished already | 21:59 |
| @fungicide:matrix.org | i've released the lock | 22:00 |
| @fungicide:matrix.org | next cron-driven run is in ~16 minutes | 22:00 |
| @fungicide:matrix.org | #status log Ran a reprepro clearvanished pass on our Ubuntu Cloud Archive mirror in order to resolve errors related to earlier Xenial and Bionic ARM removals which were blocking updates for the past 4 months | 22:02 |
| @status:opendev.org | @fungicide:matrix.org: finished logging | 22:02 |
| @clarkb:matrix.org | fungi: so if we remove a release and don't intervene we break the db? | 22:13 |
| @fungicide:matrix.org | no, it didn't break the db | 22:14 |
| @fungicide:matrix.org | our update script runs a check of the reprepro configuration, which errors when it finds components present which aren't reflected in the config, so the script aborts without updating anything | 22:15 |
| @fungicide:matrix.org | all i did was tell reprepro to clear references to anything not listed in the config | 22:16 |
| @fungicide:matrix.org | we could potentially just run that first in the script, but it's potentially destructive and doesn't need to be run often | 22:17 |
| @clarkb:matrix.org | Got it. And did that clear out the old packages from the pool too or just the release details on the index side? | 22:19 |
| @fungicide:matrix.org | the first command (clearvanished) only cleared the db entries, the second command (deleteunreferenced) cleared the orphaned package files from the pool | 22:22 |
| @fungicide:matrix.org | though our update script also does a deleteunreferenced so i could have skipped that, it was useful to do it first so i could see what got deleted and separate that from the subsequent deletions during update (when new package versions made the kept ones obsolete) | 22:23 |
| @fungicide:matrix.org | https://static.opendev.org/mirror/ubuntu-cloud-archive/timestamp.txt shows the time from the cron run a few minutes ago, so i think it's back on schedule now | 22:24 |
| @clarkb:matrix.org | I wonder if maybe we want to run clear vanished for repos like UCA which are much smaller (and thus easier to rebuild it necessary) and also update more frequently as they do a Ubuntu X Openstack release matrix | 22:24 |
| @clarkb:matrix.org | Then be more conservative with the distro proper repos as those change infrequently from a release perspective and are massively expensive to rebuild | 22:25 |
| @fungicide:matrix.org | it's worth thinking about, but we reuse the same script for all repositories/suites of all deb-based distros we mirror, so we'd need a selector/flag | 22:25 |
| @fungicide:matrix.org | which i guess could just be a list of names baked into the script | 22:26 |
| @clarkb:matrix.org | Or a simple getopt flag | 22:26 |
| @fungicide:matrix.org | yep | 22:26 |
| @fungicide:matrix.org | and then set it in the cronjobs | 22:26 |
| @clarkb:matrix.org | I think I would be willing to do that for the lower risk repos (mostly a cost to rebuild question I think so UCA, docker, maybe ceph?) | 22:28 |
| @fungicide:matrix.org | yeah, just not for debian, ubuntu and ubuntu-ports | 22:29 |
| @fungicide:matrix.org | since those take days to a week to rebuild | 22:29 |
| @clarkb:matrix.org | ++ | 22:32 |
| @fungicide:matrix.org | just a heads up, seems like launchpad may be having problems. system-config-run-mirror-update failed twice in a row with timeouts reading from our openafs ppa | 22:32 |
| @fungicide:matrix.org | `failed to fetch PPA information, error was: Connection failure: The read operation timed out` | 22:33 |
| @clarkb:matrix.org | There is a whole conversation on the gerrit mailing list about 3.12 and its v2 h2 cache dbs getting corrupted. It looks like this is happening due to problems like OOMing which we should in theory avoid since we limit the jvm well below host limits. But I wonder if the kill -9 timeout from docker compose down would cause that to happen | 22:45 |
| @clarkb:matrix.org | again it probably doesn't matter too much if we are then immediately deleting the backing database file. | 22:45 |
| @clarkb:matrix.org | But something to keep in mind as we ramp up 3.12 upgrade planning | 22:46 |
| @clarkb:matrix.org | it also looks like the solution is to delete the database entirely too if you hit it so again we're already doing that regularly its probably not a big deal for us | 22:47 |
Generated by irclog2html.py 4.0.0 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!