Friday, 2026-01-16

cardoeJust wondering how much you guys have on legacy Rackspace vs Rackspace Flex?00:29
cardoedevnull isn't on here but I poked him on work chat.00:31
clarkbcardoe: legacy rackspace is still the majority of stuff00:31
clarkbcardoe: basically the entire control plane except for gerrit, gitea, a nameserver, and a backupserver and then the per region mirrors is rax legacy. Rax legacy also provides the larger portion of the CI resources00:32
clarkbcardoe: before the holidays we had an email thread with james denton to discuss some issues we've had with rax flex. In particular corrupted image uploads (in sjc3 only iirc) and then inconsistent floating ip listings leading some floating ips to be misidentified as leaked cleaning them up and breaking external connectivity to the mirrors in those regions00:33
clarkbthat and the prior struggles with cinder volumes (which I think are a lot happier these days) as well as the lack of ipv6 are why we've been slow to migrate00:34
clarkbhappy to have a discussion about that more directly if an email thread is maybe not the most effective method00:34
clarkbthe performance of the rax flex instances is great and I think we'd be happy to be there if we can get more of the administrative stuff sorted out00:35
cardoeNo that’s great. I didn’t realize. It’s not my area so I’m not in the loop. James isn’t on here either.00:42
*** amorin_ is now known as amorin08:22
fungicardoe: also the missing ipv6 presents a challenge, unless that's been solved very recently and we just haven't heard14:48
fungithe forced floating ips might also present a problem for some of our control plane (openafs fileservers in particular)14:49
cardoeYeah that's a fair point. I honestly don't know. I would have thought ip6 was there from the get go.14:52
fungiit's been "coming soon" since we first started to test drive it, i guess they ran into a lot more routing challenges than anticipated14:53
fungiin other news, the launchpad matrix room reports lp should be back to mostly normal again as of 07:00 utc, about 8 hours ago14:53
cardoeI'm raising what you folks have shared internally.15:06
clarkbreminder I am/was interested in possibly proceeding with Gerrit on java 21 today: https://review.opendev.org/c/opendev/system-config/+/970160 but haven't gotten any reviews yet. I rechecked the change last week so there should be job logs you can refer to when reviewing.15:39
clarkbI know there have been a lot of distractions recently so am happy to delay if we need more time to feel comfortable with this. I could go and debug zuul registry pruning some more or work on honeypot improvement ideas instead. There is always something else to do :)15:40
clarkbafter last weekend I totally understand if we want to avoid things that might create a working weekend15:41
clarkbnow that I've said all that I've half convinced myself even that it can wait for monday or whatever :)15:42
slittlethumbs up on https://review.opendev.org/c/openstack/project-config/+/97352116:51
fungithanks slittle, working on it after the review i'm currently looking at16:57
fungiclarkb: that series lgtm up until 970451, any idea why the skopeo errors persist?17:02
clarkbfungi: yes, that is the over pruning of the images in the intermediate registry17:04
clarkbfungi: things work on the day you push things to the intermediate registry but then once the registry pruning runs it prunes things it shouldn't and then they are gone and 40417:04
fungiokay, so we'd need to recheck the rest of the series or wait until they merge?17:05
clarkbcorrect17:05
fungiwfm17:05
clarkbwe'll want to do the java 21 switch as its own thing too I think rather than landing the whole stack and restarting for the result17:05
fungianyway, i'm good to proceed with the first one unless you wanted to wait for more eyes17:05
clarkbfungi: I suspect it may just be us. If we proceed with that first one we should plan to restart Gerrit on the new image today (I'm around so that should be fine)17:06
clarkbI'd say go ahead and approve it if you're willing to help with restarting things17:06
clarkbI should check the h2 cache files to get a sense for how slow the shutdown may be17:06
fungiyeah, i'm here all day too17:06
clarkb49GB and 64GB17:08
clarkbsomething tells me that this shutdown si going to timeout17:08
fungioof17:09
clarkbya I don't think there is anyway past that except forward17:10
clarkbso the question is ultimately do we want to do that restart to java 21 today or say monday? I'm good with today if you are17:11
clarkband maybe sooner is better with the h2 caches growing like that17:11
clarkbit will take least half an hour to gate and probably closer to an hour (depends on where the gitea job schedules). fungi I'd say go ahead and approve it if you want to help restart today. Anyone else that wants to review can do so while it is in the gate. Otherwise we can plan for monday?17:14
opendevreviewMerged openstack/project-config master: Add repo app-openvswitch for starlingx  https://review.opendev.org/c/openstack/project-config/+/97352117:16
fungiclarkb: yeah, i say we go for it asap then17:16
fungifridays tend to be quiet anyway, so a brief gerrit outage shouldn't pose a problem17:17
clarkbfungi: sounds good. Do you want to approve it or should I?17:20
fungii can17:20
opendevreviewMerged opendev/system-config master: Upgrade build and runtime for Gerrit to Java 21  https://review.opendev.org/c/opendev/system-config/+/97016018:13
clarkbmust've run on a slightly quicker set of nodes. Nice18:14
fungiinfra-prod-service-review is running now18:15
clarkbthat will update the gerrit.config to point at java 21 which I think is actually a noop for us these days as the run-gerrit.sh script in the container image selects the java version18:15
clarkb`quay.io/opendevorg/gerrit       3.11      127c71372a4f   2 months ago   693MB` this is the image we're running on right now18:17
fungiyeah, i guess system-config-promote-image-gerrit-3.11 is what we really care about, and that succeeded18:17
clarkbyup and gerrit.config did update as expected but I don't think that is super important18:18
clarkbI think we can probably start working on a restart now18:18
fungii've started a root screen session on review0318:18
clarkbhttps://quay.io/repository/opendevorg/gerrit/manifest/sha256:c1fc9ddf96bc2dc485306b5da6167c977e1c888cb6d061ad1abc911c09f03544 is the new image for comparison after pulling18:19
clarkbI've attached to the root screen18:19
clarkbfungi: I think we want to delete a few more caches this time as a number are >2GB. I'll work on a list18:20
fungipreviously we ran something like:18:21
fungidocker compose -f /etc/gerrit-compose/docker-compose.yaml down && mv ~gerrit2/review_site/data/replication/ref-updates/waiting ~gerrit2/tmp/waiting_queue_2026-01-16 && rm ~gerrit2/review_site/cache/{gerrit_file_diff,git_file_diff}.* && sudo docker compose -f /etc/gerrit-compose/docker-compose.yaml up -d18:21
clarkbgit_file_diff.h2.db gerrit_file_diff.h2.db git_modified_files.h2.db modified_files.h2.db comment_context.h2.db18:21
clarkbthat is sorted largest to smallest. So we're covering the two really big ones with that command. The next largest is 3.4gb with git_modified_files.h2.db then the last two are 2.2gb18:22
clarkband note that we need to pull the image and check thati t looks right ebfore we run your command above18:22
fungiah, when you said "after pulling" i thought you meant you already had done, but yeah18:22
fungii'll pull on the server18:23
clarkbthat loosk right (the pull command)18:23
clarkbfungi: if you run `sudo docker inspect f43f988a6736 | grep 'RepoDigests' -A 2` you'll see those numbers match u pwith https://quay.io/repository/opendevorg/gerrit/manifest/sha256:c1fc9ddf96bc2dc485306b5da6167c977e1c888cb6d061ad1abc911c09f0354418:25
fungiconfirmed!18:26
clarkbthe top hash is in the url and the second one matches the linux amd64 submimage or whatever we're calling it18:26
clarkband if you click on the amd64 image you'll go here https://quay.io/repository/opendevorg/gerrit/manifest/sha256:d2af8674917ebaf0a30afedcdf01dd13de95ed416ad66dfdbbaf6d2683ab358b and see it was built with trixie and openjdk 21 in the layer log thing18:26
clarkbso that all lgtm18:26
clarkbshould we send something like #status notice Gerrit is going to be restarted to reset some caches and to switch the java runtime to java 21.18:27
clarkbI guess we can get our restart command sorted out first18:27
clarkbfungi: also git_modified_files ?18:29
clarkbok I think that looks good to me18:30
clarkbshould I send the notice? Anything to change in that notice?18:31
fungimessage lgtm though i usually say something like "the gerrit service on review.opendev.org will be offline briefly..." since some people may not know that we have only one gerrit or that the service name isn't the same as the domain name, and to indicate that a restart does indeed also come with an outage18:31
clarkback how about this #status notice Gerrit will be offline briefly in order to restart on a newer jvm and to clear out caches18:32
fungiokay18:32
fungiwhenever you're ready18:32
clarkb#status notice Gerrit on review.opendev.org will be offline briefly in order to restart on a newer JVM and to clear out caches18:32
opendevstatusclarkb: sending notice18:32
clarkbsorry made a few small edits and it ook another second to get it going18:32
-opendevstatus- NOTICE: Gerrit on review.opendev.org will be offline briefly in order to restart on a newer JVM and to clear out caches18:33
clarkbin theory this will send to the matrix rooms now too18:33
fungiand apologies for my delays and terseness in communication, my network is really super laggy at the moment for some reason18:33
opendevstatusclarkb: finished sending notice18:35
clarkbit is interesting that I got the finished sending notice message before matrix loaded the messages, but the message made it to matrix this time which is nice18:36
clarkbfungi: I think we're ready whenever you are18:36
clarkbI expect this first command will fail18:37
clarkbdue to cache cleanups being slow18:37
clarkband if that happens we can just rerun it since the second down s hould succeed18:37
clarkbnow I'm wondering if fungi's internet woes have gotten worse18:38
fungiready to restart?18:40
fungiugh, mosh picked that exact moment to decide it's suffering massive packet loss18:40
clarkbfungi: it is 18:40 UTC ish and yes I'm ready as soon as your connection is happy enough to proceed18:40
clarkbI'm guessing fungi has lost connectivity. I'm going to hold here for a bit before doing anything18:50
clarkbIdeally I'm not restarting gerrit by myself also good to not trip over each other if connectivity comes back18:50
clarkbfungi it looks like I'm able to ping/traceroute/mtr the ip address youconnceted to review03 from18:53
clarkbimplying the problem is maybe not a complete loss of connectivity (route specific? or maybe application/server specific?)18:53
fungiokay, i'm back. yeah some segment of the internal network here went sideways, not sure why yet18:56
fungianyway, this machine seems to be on a stable leg of the network so i'm available to proceed18:56
clarkbok I'm still here and I think we can continue if you're happy with your network situation18:56
fungiyeah, proceeding now18:56
fungidid you kill the process manually, or did the db cleanup not timeout?18:58
clarkbI didn't touch it18:58
fungineat18:58
fungiwebui is already up for me18:58
clarkbGerrit Code Review 3.11.7-22-g66e6095dcb-dirty ready18:59
clarkbthat is from the error log18:59
clarkbit is pruning the caches we didn't delete which should go quickly18:59
clarkbmaybe its just faster with gerrit 3.11 and/or java 21?18:59
fungithat's nice if so19:00
clarkbps reports we're running with java 21 so that is good19:00
clarkbdiffs load for me too19:00
clarkbshow-queue output looks good19:01
clarkbfungi: I think now we want someone to push a change/patchset and confirm that works as well as replication19:02
clarkbdo you have anything convenient to update? I can do a DNM update I guess19:02
fungii have something to push but it'll take me a few minutes to write19:03
clarkbI can wait. better than an otherwise useless dnm change19:03
clarkbthe new jdbc driver seems to be logging warnings about duplicate entries19:04
clarkbI think that is fine and that db is also one we can drop completely and start over with if we need to19:04
clarkbthe reviewed tags on files seems to be updating for me otherwise19:05
clarkbhttps://review.opendev.org/c/openstack/manila-ui/+/963516 this change just merged19:07
clarkb(we should still test pushes but that is a good sign I'm trying to check replication of ^ now)19:07
clarkbhttps://opendev.org/openstack/manila-ui/commits/branch/master yup that merge replicated19:07
clarkbso I think the last major item to check is pushing patchsets/changes19:08
clarkbhttps://review.opendev.org/c/starlingx/update/+/973683 is a new change19:09
clarkband it appears to have replicated here: https://opendev.org/starlingx/update/commit/2ffaba0c37589f225704dbf76b225b7d567e714419:09
clarkbok I thnik the reviewed state is not actually updating in the db as expected19:12
fungii just pushed https://review.opendev.org/c/openstack/ossa/+/973684 as well19:12
clarkbwhen I view the diffs or hit the "Mark reviewed" button on a change the UI updates for me there and it looks ok. But if I hard refresh the change the values go away19:12
clarkbbut this doesn't affect every change just those with the issue with duplicate keys19:13
clarkbhttps://review.opendev.org/c/opendev/system-config/+/973535 I was able to update file reviewed state in and https://review.opendev.org/c/openstack/cinder/+/973232 I was not19:13
clarkbthe 'Duplicate entry' warning messages begin after the restart. So it must be due to the updated jdbc19:16
clarkbI think this is a relatively minor problem. Tseting on https://review.opendev.org/c/openstack/ossa/+/973684 I'm not able to reproduce the issue and the reviewed state seems to work there after toggling back and forth a couple times and hard refreshing in between19:19
opendevreviewJames E. Blair proposed zuul/zuul-jobs master: Validate crc32 checksum for S3 image upload  https://review.opendev.org/c/zuul/zuul-jobs/+/97368919:19
corvusohai i also just helped with testing.  :)19:20
clarkbthough after setting fungi's change file to reviewed then hard refreshing ~4 times once it came back without the reviewed flag set19:21
clarkbbut no errors for that change in the error_log so I'm guessing that is a cache problem not a db consistency problem19:21
corvusmaybe just an async web request timeout?19:21
clarkbcorvus: actually yes I think that may explain the hard refresh behavior because a normal f5 is consistent so it may be loading that data in the background and not getting it quickyl enough on a hard refresh each time19:22
clarkblooking at the logs and cross checking with the activity I've done through the browser it definitely isn't every change and my change that was pushed (but not reviewed) before the jdbc driver update was fine as is fungi's new change post update19:23
clarkbI half wonder if the old driver was just problematic in some cases and making double entries. I'll see what I can find in the database looking at it directly19:23
clarkbbut so far my gut feeling on this is its an annoying problem but one that may just go away over time as changes age out and/or one we could fix by deleting the db and starting over (or possibly by deleting double entries if we find some)19:24
fungido we want to keep the upgrade screen session open for now, or go ahead and close it out?19:25
clarkbif there are disagreements on that assessment please let me know. I think going back to the old jdbc driver implies rolling back to java 1719:25
clarkbfungi: I don't think there is really anything in there that we need to keep other than maybe the 43.6second shutdown for gerrit. Yes it took less than a minute amazing19:25
fungiokay, terminated19:25
clarkboh!19:29
clarkbI think this is just noise and everything is mostly working as expected19:29
clarkblooking at https://review.opendev.org/c/openstack/cinder/+/973232 again it shows I've reviewed all three files in that change. If I do a select against that change number, patchset, and my gerrit numerical id in the db there are only three entries in there one for each file19:30
clarkbI think what is happening is that the reviewed state isn't making it up to the browser (or is slow to make it) so then when you open the file or explicitly click mark reviewed the backend tries to insert another row for that entry which then fails because that would create a duplicate primary key19:30
clarkbmy hunch is that the old jdbc connector treated that as a lower debug level log entry or maybe it was "replacing" entries19:31
clarkbbut either way I think this is ok19:31
clarkbcould also be a caches are cold problem and this will go away entirely as they warm and we just never noticed when restarting gerrit before and the old jdbc would do the same thing19:32
clarkbwhen I grepped old logs they were just from a few days ago not when we previously restarted gerrit19:32
clarkbso anyway my assessment is that this is noise in the logs (gerrit is really good about that) but not something we really need to take action on19:33
clarkbunrelated: I just checked and lists backups didn't complain this morning so I think that should be solved for now19:35
clarkbalright I'm going to pop out for lunch soon. As far as I can tell things are working well enough. The reviewed state is still acting weird for me but I only get the jdbc error/warning when I intentionally mark something reviewed that the web ui is showing as unreviewed that is reviewed in the db19:44
clarkbso whatever is causing that has to do with how the web ui is loading that info19:44
fungisounds good. i'm around and keeping an eye on things, but also rebooting my home network one device at a time19:44
clarkbI did skim the log for stable-3.11 and I don't see anything obviously related to cache/index/db state loading in the web ui19:45
clarkbhttps://gerrit.googlesource.com/gerrit/+log/refs/heads/stable-3.1119:45
clarkbI half wonder if it could be a timing issue exposed by the new jvm19:45
clarkblooking in FF developer tools we do fetch files?reviewed whcih is a json list with a list of files in it that should be in a reviewed state. However, the web ui when I refresh the page doesn't mark files in that list as reviewed19:47
clarkbdefinitely a weird one19:47
clarkbtiming might also explain why we see it on some changes and not others. Some changes have more data to load (larger diffs, more files, etc)19:48
clarkboh some of it is new patchsets I should've been looking at a specific patchset in the ui or using a change I was managing19:49
clarkbok the more I did the less I'm concerend (new patchset arrived which amde it look like I had reset the older patchset values but it simply was an update so the state is updated)19:49
clarkb*the more I dig19:50
clarkbfungi: oh I've just remembered that something we've been doing post restart is reindexing changes to catch any that may have updated on disk without being indexed during the shutdown19:52
clarkbfungi: considering how quickly this restart went I'm a bit less concerned about that but I think that race exists regardless so we may wish to do that here19:52
clarkbcorvus: also did you see that statusbot is working as expected now?19:52
corvusclarkb: yes i noticed that it appeared in the zuul room19:59
Clark[m]It actually reported completion on the irc side before this matrix client saw it in the zuul room but that isn't a big deal19:59
corvusyeah that's typical; you'll get to see the matrix federation delay :)20:00
fungii observed the same20:02
fungithough in my case it was all in one client20:02
corvusheh, there's a bit of a quirk to this: the bot will publish to whichever platform the command is sent first.  so since you sent the command from irc, it sent it to all the irc channels, then the matrix rooms.  so it's understandable that the completion notice is racing the actual notices.20:03
corvusbut if you send the command from matrix, the opposite is true.  so you might see irc lag.20:03
corvuswe could change that in the bot if we want; it's just an arbitrary ordering based on what i thought was aesthetically pleasing symmetry at the time.  :)20:04
Clark[m]fungi: any opinion on reindexing?20:04
fungii can start it momentarily20:04
Clark[m]Sorry I'm off to my sandwich now. Fried ahi fish burger made from fish i caught in Hawaii.20:04
corvusyou brought enough for everyone?20:05
fungioh wow20:05
fungiground ahi then, formed into a patty?20:05
fungior just a slab steak?20:06
fungieither way, jealous20:06
Clark[m]No just a steak with panko on it. We did end up with a lot of fish but by the time we got it back to the west coast and divided it up it doesn't seem like as much20:06
Clark[m]Brother that had a direct flight flew up with two coolers. After our 14 hour delay I was too paranoid to try cooler as checked bags on our return20:07
fungiso sort of a tuna karaage?20:07
Clark[m]Or tuna katsu20:08
fungiah okay20:09
Clark[m]If you ever do a charter in Hawaii not every boat will let you keep your fish and most will assume you want Marlin which is more of a catch and release thing. So be sure you talk to the boat on details like that before you go20:10
fungii guess which one is a question of whether you deep-fried it or pan-fried20:10
fungii did `gerrit index start changes --force` just now20:12
fungi2792 tasks remaining20:13
Clark[m]Thanks. If you watch the log it reports that it is completed at the end with a count of the failed changes (which should be 3)20:17
fungi`sudo tail -Fn0 /home/gerrit2/review_site/logs/error_log|grep Reindex`20:22
fungilooks like that should catch it20:22
fungiseems to be getting the periodic updates anyway20:22
clarkbup to 85% done20:58
funginearly there now21:06
clarkbyup just completed with the expected 3 failures21:07
fungiFailed 3/967145 changes21:07
clarkbwhich is good because that is the behjavior we've had for some time. Consistency is nice21:07
fungiwe're drawing ever closer to the million-change mark too21:07
clarkbI'll probably start working on landing the gerrit 3.12 testing changes next week. Then for the nodejs update we should probably plan another restart of gerrit in production just to be sure it works as expected afterwards21:08
mordredwhen has nodejs ever broken anything?22:15
clarkbmordred: ya usually if it builds I would expect it to be fine23:06
clarkbsince it isn't a runtime nodejs system23:06

Generated by irclog2html.py 4.0.0 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!