| cardoe | Just wondering how much you guys have on legacy Rackspace vs Rackspace Flex? | 00:29 |
|---|---|---|
| cardoe | devnull isn't on here but I poked him on work chat. | 00:31 |
| clarkb | cardoe: legacy rackspace is still the majority of stuff | 00:31 |
| clarkb | cardoe: basically the entire control plane except for gerrit, gitea, a nameserver, and a backupserver and then the per region mirrors is rax legacy. Rax legacy also provides the larger portion of the CI resources | 00:32 |
| clarkb | cardoe: before the holidays we had an email thread with james denton to discuss some issues we've had with rax flex. In particular corrupted image uploads (in sjc3 only iirc) and then inconsistent floating ip listings leading some floating ips to be misidentified as leaked cleaning them up and breaking external connectivity to the mirrors in those regions | 00:33 |
| clarkb | that and the prior struggles with cinder volumes (which I think are a lot happier these days) as well as the lack of ipv6 are why we've been slow to migrate | 00:34 |
| clarkb | happy to have a discussion about that more directly if an email thread is maybe not the most effective method | 00:34 |
| clarkb | the performance of the rax flex instances is great and I think we'd be happy to be there if we can get more of the administrative stuff sorted out | 00:35 |
| cardoe | No that’s great. I didn’t realize. It’s not my area so I’m not in the loop. James isn’t on here either. | 00:42 |
| *** amorin_ is now known as amorin | 08:22 | |
| fungi | cardoe: also the missing ipv6 presents a challenge, unless that's been solved very recently and we just haven't heard | 14:48 |
| fungi | the forced floating ips might also present a problem for some of our control plane (openafs fileservers in particular) | 14:49 |
| cardoe | Yeah that's a fair point. I honestly don't know. I would have thought ip6 was there from the get go. | 14:52 |
| fungi | it's been "coming soon" since we first started to test drive it, i guess they ran into a lot more routing challenges than anticipated | 14:53 |
| fungi | in other news, the launchpad matrix room reports lp should be back to mostly normal again as of 07:00 utc, about 8 hours ago | 14:53 |
| cardoe | I'm raising what you folks have shared internally. | 15:06 |
| clarkb | reminder I am/was interested in possibly proceeding with Gerrit on java 21 today: https://review.opendev.org/c/opendev/system-config/+/970160 but haven't gotten any reviews yet. I rechecked the change last week so there should be job logs you can refer to when reviewing. | 15:39 |
| clarkb | I know there have been a lot of distractions recently so am happy to delay if we need more time to feel comfortable with this. I could go and debug zuul registry pruning some more or work on honeypot improvement ideas instead. There is always something else to do :) | 15:40 |
| clarkb | after last weekend I totally understand if we want to avoid things that might create a working weekend | 15:41 |
| clarkb | now that I've said all that I've half convinced myself even that it can wait for monday or whatever :) | 15:42 |
| slittle | thumbs up on https://review.opendev.org/c/openstack/project-config/+/973521 | 16:51 |
| fungi | thanks slittle, working on it after the review i'm currently looking at | 16:57 |
| fungi | clarkb: that series lgtm up until 970451, any idea why the skopeo errors persist? | 17:02 |
| clarkb | fungi: yes, that is the over pruning of the images in the intermediate registry | 17:04 |
| clarkb | fungi: things work on the day you push things to the intermediate registry but then once the registry pruning runs it prunes things it shouldn't and then they are gone and 404 | 17:04 |
| fungi | okay, so we'd need to recheck the rest of the series or wait until they merge? | 17:05 |
| clarkb | correct | 17:05 |
| fungi | wfm | 17:05 |
| clarkb | we'll want to do the java 21 switch as its own thing too I think rather than landing the whole stack and restarting for the result | 17:05 |
| fungi | anyway, i'm good to proceed with the first one unless you wanted to wait for more eyes | 17:05 |
| clarkb | fungi: I suspect it may just be us. If we proceed with that first one we should plan to restart Gerrit on the new image today (I'm around so that should be fine) | 17:06 |
| clarkb | I'd say go ahead and approve it if you're willing to help with restarting things | 17:06 |
| clarkb | I should check the h2 cache files to get a sense for how slow the shutdown may be | 17:06 |
| fungi | yeah, i'm here all day too | 17:06 |
| clarkb | 49GB and 64GB | 17:08 |
| clarkb | something tells me that this shutdown si going to timeout | 17:08 |
| fungi | oof | 17:09 |
| clarkb | ya I don't think there is anyway past that except forward | 17:10 |
| clarkb | so the question is ultimately do we want to do that restart to java 21 today or say monday? I'm good with today if you are | 17:11 |
| clarkb | and maybe sooner is better with the h2 caches growing like that | 17:11 |
| clarkb | it will take least half an hour to gate and probably closer to an hour (depends on where the gitea job schedules). fungi I'd say go ahead and approve it if you want to help restart today. Anyone else that wants to review can do so while it is in the gate. Otherwise we can plan for monday? | 17:14 |
| opendevreview | Merged openstack/project-config master: Add repo app-openvswitch for starlingx https://review.opendev.org/c/openstack/project-config/+/973521 | 17:16 |
| fungi | clarkb: yeah, i say we go for it asap then | 17:16 |
| fungi | fridays tend to be quiet anyway, so a brief gerrit outage shouldn't pose a problem | 17:17 |
| clarkb | fungi: sounds good. Do you want to approve it or should I? | 17:20 |
| fungi | i can | 17:20 |
| opendevreview | Merged opendev/system-config master: Upgrade build and runtime for Gerrit to Java 21 https://review.opendev.org/c/opendev/system-config/+/970160 | 18:13 |
| clarkb | must've run on a slightly quicker set of nodes. Nice | 18:14 |
| fungi | infra-prod-service-review is running now | 18:15 |
| clarkb | that will update the gerrit.config to point at java 21 which I think is actually a noop for us these days as the run-gerrit.sh script in the container image selects the java version | 18:15 |
| clarkb | `quay.io/opendevorg/gerrit 3.11 127c71372a4f 2 months ago 693MB` this is the image we're running on right now | 18:17 |
| fungi | yeah, i guess system-config-promote-image-gerrit-3.11 is what we really care about, and that succeeded | 18:17 |
| clarkb | yup and gerrit.config did update as expected but I don't think that is super important | 18:18 |
| clarkb | I think we can probably start working on a restart now | 18:18 |
| fungi | i've started a root screen session on review03 | 18:18 |
| clarkb | https://quay.io/repository/opendevorg/gerrit/manifest/sha256:c1fc9ddf96bc2dc485306b5da6167c977e1c888cb6d061ad1abc911c09f03544 is the new image for comparison after pulling | 18:19 |
| clarkb | I've attached to the root screen | 18:19 |
| clarkb | fungi: I think we want to delete a few more caches this time as a number are >2GB. I'll work on a list | 18:20 |
| fungi | previously we ran something like: | 18:21 |
| fungi | docker compose -f /etc/gerrit-compose/docker-compose.yaml down && mv ~gerrit2/review_site/data/replication/ref-updates/waiting ~gerrit2/tmp/waiting_queue_2026-01-16 && rm ~gerrit2/review_site/cache/{gerrit_file_diff,git_file_diff}.* && sudo docker compose -f /etc/gerrit-compose/docker-compose.yaml up -d | 18:21 |
| clarkb | git_file_diff.h2.db gerrit_file_diff.h2.db git_modified_files.h2.db modified_files.h2.db comment_context.h2.db | 18:21 |
| clarkb | that is sorted largest to smallest. So we're covering the two really big ones with that command. The next largest is 3.4gb with git_modified_files.h2.db then the last two are 2.2gb | 18:22 |
| clarkb | and note that we need to pull the image and check thati t looks right ebfore we run your command above | 18:22 |
| fungi | ah, when you said "after pulling" i thought you meant you already had done, but yeah | 18:22 |
| fungi | i'll pull on the server | 18:23 |
| clarkb | that loosk right (the pull command) | 18:23 |
| clarkb | fungi: if you run `sudo docker inspect f43f988a6736 | grep 'RepoDigests' -A 2` you'll see those numbers match u pwith https://quay.io/repository/opendevorg/gerrit/manifest/sha256:c1fc9ddf96bc2dc485306b5da6167c977e1c888cb6d061ad1abc911c09f03544 | 18:25 |
| fungi | confirmed! | 18:26 |
| clarkb | the top hash is in the url and the second one matches the linux amd64 submimage or whatever we're calling it | 18:26 |
| clarkb | and if you click on the amd64 image you'll go here https://quay.io/repository/opendevorg/gerrit/manifest/sha256:d2af8674917ebaf0a30afedcdf01dd13de95ed416ad66dfdbbaf6d2683ab358b and see it was built with trixie and openjdk 21 in the layer log thing | 18:26 |
| clarkb | so that all lgtm | 18:26 |
| clarkb | should we send something like #status notice Gerrit is going to be restarted to reset some caches and to switch the java runtime to java 21. | 18:27 |
| clarkb | I guess we can get our restart command sorted out first | 18:27 |
| clarkb | fungi: also git_modified_files ? | 18:29 |
| clarkb | ok I think that looks good to me | 18:30 |
| clarkb | should I send the notice? Anything to change in that notice? | 18:31 |
| fungi | message lgtm though i usually say something like "the gerrit service on review.opendev.org will be offline briefly..." since some people may not know that we have only one gerrit or that the service name isn't the same as the domain name, and to indicate that a restart does indeed also come with an outage | 18:31 |
| clarkb | ack how about this #status notice Gerrit will be offline briefly in order to restart on a newer jvm and to clear out caches | 18:32 |
| fungi | okay | 18:32 |
| fungi | whenever you're ready | 18:32 |
| clarkb | #status notice Gerrit on review.opendev.org will be offline briefly in order to restart on a newer JVM and to clear out caches | 18:32 |
| opendevstatus | clarkb: sending notice | 18:32 |
| clarkb | sorry made a few small edits and it ook another second to get it going | 18:32 |
| -opendevstatus- NOTICE: Gerrit on review.opendev.org will be offline briefly in order to restart on a newer JVM and to clear out caches | 18:33 | |
| clarkb | in theory this will send to the matrix rooms now too | 18:33 |
| fungi | and apologies for my delays and terseness in communication, my network is really super laggy at the moment for some reason | 18:33 |
| opendevstatus | clarkb: finished sending notice | 18:35 |
| clarkb | it is interesting that I got the finished sending notice message before matrix loaded the messages, but the message made it to matrix this time which is nice | 18:36 |
| clarkb | fungi: I think we're ready whenever you are | 18:36 |
| clarkb | I expect this first command will fail | 18:37 |
| clarkb | due to cache cleanups being slow | 18:37 |
| clarkb | and if that happens we can just rerun it since the second down s hould succeed | 18:37 |
| clarkb | now I'm wondering if fungi's internet woes have gotten worse | 18:38 |
| fungi | ready to restart? | 18:40 |
| fungi | ugh, mosh picked that exact moment to decide it's suffering massive packet loss | 18:40 |
| clarkb | fungi: it is 18:40 UTC ish and yes I'm ready as soon as your connection is happy enough to proceed | 18:40 |
| clarkb | I'm guessing fungi has lost connectivity. I'm going to hold here for a bit before doing anything | 18:50 |
| clarkb | Ideally I'm not restarting gerrit by myself also good to not trip over each other if connectivity comes back | 18:50 |
| clarkb | fungi it looks like I'm able to ping/traceroute/mtr the ip address youconnceted to review03 from | 18:53 |
| clarkb | implying the problem is maybe not a complete loss of connectivity (route specific? or maybe application/server specific?) | 18:53 |
| fungi | okay, i'm back. yeah some segment of the internal network here went sideways, not sure why yet | 18:56 |
| fungi | anyway, this machine seems to be on a stable leg of the network so i'm available to proceed | 18:56 |
| clarkb | ok I'm still here and I think we can continue if you're happy with your network situation | 18:56 |
| fungi | yeah, proceeding now | 18:56 |
| fungi | did you kill the process manually, or did the db cleanup not timeout? | 18:58 |
| clarkb | I didn't touch it | 18:58 |
| fungi | neat | 18:58 |
| fungi | webui is already up for me | 18:58 |
| clarkb | Gerrit Code Review 3.11.7-22-g66e6095dcb-dirty ready | 18:59 |
| clarkb | that is from the error log | 18:59 |
| clarkb | it is pruning the caches we didn't delete which should go quickly | 18:59 |
| clarkb | maybe its just faster with gerrit 3.11 and/or java 21? | 18:59 |
| fungi | that's nice if so | 19:00 |
| clarkb | ps reports we're running with java 21 so that is good | 19:00 |
| clarkb | diffs load for me too | 19:00 |
| clarkb | show-queue output looks good | 19:01 |
| clarkb | fungi: I think now we want someone to push a change/patchset and confirm that works as well as replication | 19:02 |
| clarkb | do you have anything convenient to update? I can do a DNM update I guess | 19:02 |
| fungi | i have something to push but it'll take me a few minutes to write | 19:03 |
| clarkb | I can wait. better than an otherwise useless dnm change | 19:03 |
| clarkb | the new jdbc driver seems to be logging warnings about duplicate entries | 19:04 |
| clarkb | I think that is fine and that db is also one we can drop completely and start over with if we need to | 19:04 |
| clarkb | the reviewed tags on files seems to be updating for me otherwise | 19:05 |
| clarkb | https://review.opendev.org/c/openstack/manila-ui/+/963516 this change just merged | 19:07 |
| clarkb | (we should still test pushes but that is a good sign I'm trying to check replication of ^ now) | 19:07 |
| clarkb | https://opendev.org/openstack/manila-ui/commits/branch/master yup that merge replicated | 19:07 |
| clarkb | so I think the last major item to check is pushing patchsets/changes | 19:08 |
| clarkb | https://review.opendev.org/c/starlingx/update/+/973683 is a new change | 19:09 |
| clarkb | and it appears to have replicated here: https://opendev.org/starlingx/update/commit/2ffaba0c37589f225704dbf76b225b7d567e7144 | 19:09 |
| clarkb | ok I thnik the reviewed state is not actually updating in the db as expected | 19:12 |
| fungi | i just pushed https://review.opendev.org/c/openstack/ossa/+/973684 as well | 19:12 |
| clarkb | when I view the diffs or hit the "Mark reviewed" button on a change the UI updates for me there and it looks ok. But if I hard refresh the change the values go away | 19:12 |
| clarkb | but this doesn't affect every change just those with the issue with duplicate keys | 19:13 |
| clarkb | https://review.opendev.org/c/opendev/system-config/+/973535 I was able to update file reviewed state in and https://review.opendev.org/c/openstack/cinder/+/973232 I was not | 19:13 |
| clarkb | the 'Duplicate entry' warning messages begin after the restart. So it must be due to the updated jdbc | 19:16 |
| clarkb | I think this is a relatively minor problem. Tseting on https://review.opendev.org/c/openstack/ossa/+/973684 I'm not able to reproduce the issue and the reviewed state seems to work there after toggling back and forth a couple times and hard refreshing in between | 19:19 |
| opendevreview | James E. Blair proposed zuul/zuul-jobs master: Validate crc32 checksum for S3 image upload https://review.opendev.org/c/zuul/zuul-jobs/+/973689 | 19:19 |
| corvus | ohai i also just helped with testing. :) | 19:20 |
| clarkb | though after setting fungi's change file to reviewed then hard refreshing ~4 times once it came back without the reviewed flag set | 19:21 |
| clarkb | but no errors for that change in the error_log so I'm guessing that is a cache problem not a db consistency problem | 19:21 |
| corvus | maybe just an async web request timeout? | 19:21 |
| clarkb | corvus: actually yes I think that may explain the hard refresh behavior because a normal f5 is consistent so it may be loading that data in the background and not getting it quickyl enough on a hard refresh each time | 19:22 |
| clarkb | looking at the logs and cross checking with the activity I've done through the browser it definitely isn't every change and my change that was pushed (but not reviewed) before the jdbc driver update was fine as is fungi's new change post update | 19:23 |
| clarkb | I half wonder if the old driver was just problematic in some cases and making double entries. I'll see what I can find in the database looking at it directly | 19:23 |
| clarkb | but so far my gut feeling on this is its an annoying problem but one that may just go away over time as changes age out and/or one we could fix by deleting the db and starting over (or possibly by deleting double entries if we find some) | 19:24 |
| fungi | do we want to keep the upgrade screen session open for now, or go ahead and close it out? | 19:25 |
| clarkb | if there are disagreements on that assessment please let me know. I think going back to the old jdbc driver implies rolling back to java 17 | 19:25 |
| clarkb | fungi: I don't think there is really anything in there that we need to keep other than maybe the 43.6second shutdown for gerrit. Yes it took less than a minute amazing | 19:25 |
| fungi | okay, terminated | 19:25 |
| clarkb | oh! | 19:29 |
| clarkb | I think this is just noise and everything is mostly working as expected | 19:29 |
| clarkb | looking at https://review.opendev.org/c/openstack/cinder/+/973232 again it shows I've reviewed all three files in that change. If I do a select against that change number, patchset, and my gerrit numerical id in the db there are only three entries in there one for each file | 19:30 |
| clarkb | I think what is happening is that the reviewed state isn't making it up to the browser (or is slow to make it) so then when you open the file or explicitly click mark reviewed the backend tries to insert another row for that entry which then fails because that would create a duplicate primary key | 19:30 |
| clarkb | my hunch is that the old jdbc connector treated that as a lower debug level log entry or maybe it was "replacing" entries | 19:31 |
| clarkb | but either way I think this is ok | 19:31 |
| clarkb | could also be a caches are cold problem and this will go away entirely as they warm and we just never noticed when restarting gerrit before and the old jdbc would do the same thing | 19:32 |
| clarkb | when I grepped old logs they were just from a few days ago not when we previously restarted gerrit | 19:32 |
| clarkb | so anyway my assessment is that this is noise in the logs (gerrit is really good about that) but not something we really need to take action on | 19:33 |
| clarkb | unrelated: I just checked and lists backups didn't complain this morning so I think that should be solved for now | 19:35 |
| clarkb | alright I'm going to pop out for lunch soon. As far as I can tell things are working well enough. The reviewed state is still acting weird for me but I only get the jdbc error/warning when I intentionally mark something reviewed that the web ui is showing as unreviewed that is reviewed in the db | 19:44 |
| clarkb | so whatever is causing that has to do with how the web ui is loading that info | 19:44 |
| fungi | sounds good. i'm around and keeping an eye on things, but also rebooting my home network one device at a time | 19:44 |
| clarkb | I did skim the log for stable-3.11 and I don't see anything obviously related to cache/index/db state loading in the web ui | 19:45 |
| clarkb | https://gerrit.googlesource.com/gerrit/+log/refs/heads/stable-3.11 | 19:45 |
| clarkb | I half wonder if it could be a timing issue exposed by the new jvm | 19:45 |
| clarkb | looking in FF developer tools we do fetch files?reviewed whcih is a json list with a list of files in it that should be in a reviewed state. However, the web ui when I refresh the page doesn't mark files in that list as reviewed | 19:47 |
| clarkb | definitely a weird one | 19:47 |
| clarkb | timing might also explain why we see it on some changes and not others. Some changes have more data to load (larger diffs, more files, etc) | 19:48 |
| clarkb | oh some of it is new patchsets I should've been looking at a specific patchset in the ui or using a change I was managing | 19:49 |
| clarkb | ok the more I did the less I'm concerend (new patchset arrived which amde it look like I had reset the older patchset values but it simply was an update so the state is updated) | 19:49 |
| clarkb | *the more I dig | 19:50 |
| clarkb | fungi: oh I've just remembered that something we've been doing post restart is reindexing changes to catch any that may have updated on disk without being indexed during the shutdown | 19:52 |
| clarkb | fungi: considering how quickly this restart went I'm a bit less concerned about that but I think that race exists regardless so we may wish to do that here | 19:52 |
| clarkb | corvus: also did you see that statusbot is working as expected now? | 19:52 |
| corvus | clarkb: yes i noticed that it appeared in the zuul room | 19:59 |
| Clark[m] | It actually reported completion on the irc side before this matrix client saw it in the zuul room but that isn't a big deal | 19:59 |
| corvus | yeah that's typical; you'll get to see the matrix federation delay :) | 20:00 |
| fungi | i observed the same | 20:02 |
| fungi | though in my case it was all in one client | 20:02 |
| corvus | heh, there's a bit of a quirk to this: the bot will publish to whichever platform the command is sent first. so since you sent the command from irc, it sent it to all the irc channels, then the matrix rooms. so it's understandable that the completion notice is racing the actual notices. | 20:03 |
| corvus | but if you send the command from matrix, the opposite is true. so you might see irc lag. | 20:03 |
| corvus | we could change that in the bot if we want; it's just an arbitrary ordering based on what i thought was aesthetically pleasing symmetry at the time. :) | 20:04 |
| Clark[m] | fungi: any opinion on reindexing? | 20:04 |
| fungi | i can start it momentarily | 20:04 |
| Clark[m] | Sorry I'm off to my sandwich now. Fried ahi fish burger made from fish i caught in Hawaii. | 20:04 |
| corvus | you brought enough for everyone? | 20:05 |
| fungi | oh wow | 20:05 |
| fungi | ground ahi then, formed into a patty? | 20:05 |
| fungi | or just a slab steak? | 20:06 |
| fungi | either way, jealous | 20:06 |
| Clark[m] | No just a steak with panko on it. We did end up with a lot of fish but by the time we got it back to the west coast and divided it up it doesn't seem like as much | 20:06 |
| Clark[m] | Brother that had a direct flight flew up with two coolers. After our 14 hour delay I was too paranoid to try cooler as checked bags on our return | 20:07 |
| fungi | so sort of a tuna karaage? | 20:07 |
| Clark[m] | Or tuna katsu | 20:08 |
| fungi | ah okay | 20:09 |
| Clark[m] | If you ever do a charter in Hawaii not every boat will let you keep your fish and most will assume you want Marlin which is more of a catch and release thing. So be sure you talk to the boat on details like that before you go | 20:10 |
| fungi | i guess which one is a question of whether you deep-fried it or pan-fried | 20:10 |
| fungi | i did `gerrit index start changes --force` just now | 20:12 |
| fungi | 2792 tasks remaining | 20:13 |
| Clark[m] | Thanks. If you watch the log it reports that it is completed at the end with a count of the failed changes (which should be 3) | 20:17 |
| fungi | `sudo tail -Fn0 /home/gerrit2/review_site/logs/error_log|grep Reindex` | 20:22 |
| fungi | looks like that should catch it | 20:22 |
| fungi | seems to be getting the periodic updates anyway | 20:22 |
| clarkb | up to 85% done | 20:58 |
| fungi | nearly there now | 21:06 |
| clarkb | yup just completed with the expected 3 failures | 21:07 |
| fungi | Failed 3/967145 changes | 21:07 |
| clarkb | which is good because that is the behjavior we've had for some time. Consistency is nice | 21:07 |
| fungi | we're drawing ever closer to the million-change mark too | 21:07 |
| clarkb | I'll probably start working on landing the gerrit 3.12 testing changes next week. Then for the nodejs update we should probably plan another restart of gerrit in production just to be sure it works as expected afterwards | 21:08 |
| mordred | when has nodejs ever broken anything? | 22:15 |
| clarkb | mordred: ya usually if it builds I would expect it to be fine | 23:06 |
| clarkb | since it isn't a runtime nodejs system | 23:06 |
Generated by irclog2html.py 4.0.0 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!