Friday, 2026-03-20

-@gerrit:opendev.org- Michal Nasiadka proposed: [opendev/zone-opendev.org] 981499: Switch OpenMetal mirror CNAME to new mirror https://review.opendev.org/c/opendev/zone-opendev.org/+/98149911:44
-@gerrit:opendev.org- David Ostrovsky proposed:14:11
- [zuul/zuul-jobs] 981517: Optionally ensure Java in ensure-bazelisk https://review.opendev.org/c/zuul/zuul-jobs/+/981517
- [zuul/zuul-jobs] 981518: Make bazelisk-build ensure Java via ensure-bazelisk https://review.opendev.org/c/zuul/zuul-jobs/+/981518
-@gerrit:opendev.org- David Ostrovsky proposed: [zuul/zuul-jobs] 981518: Make bazelisk-build ensure Java via ensure-bazelisk https://review.opendev.org/c/zuul/zuul-jobs/+/98151814:36
@clarkb:matrix.orgfungi: https://review.opendev.org/c/opendev/system-config/+/981042 is the change I promised to purge refstack01 and eavesdrop01 backups from the smaller backup server to free up more space there. You had previously discussed this should be fine. If that holds maybe you can review it and we can proceed with that today?14:54
@clarkb:matrix.orgOne thing I never followed up on from earlier was the LE cert renewal. It does look like the warnings have gone away after we addressed the dpkg lock on the osuosl mirror. I suspect that when we deployed static03 all of these certs were reissued and it has been fine since then15:16
@fungicide:matrix.orgyes, it was entirely because static02 was in the disable list for a few days15:19
@fungicide:matrix.orgonce we moved the sites it cleared up15:19
@fungicide:matrix.orgdisappearing for a quick lunch, back shortly15:28
@clarkb:matrix.orgenjoy. When you get back I think I'll approve the backup purge change15:29
@clarkb:matrix.orgsomething is going on with raxflex iad3 api requests and that appears to be why we are not starting jobs quickly at the moment15:35
@clarkb:matrix.orgI tried to do a server show against a server and got a 500 error. Listing servers shows a number of them in BUILD. I'm trying to cross reference the servers listed there with np$UUIDPREFIX names listed by the zuul nodes page to see if they even show up15:36
@clarkb:matrix.orgcorvus: ^ fyi15:36
@clarkb:matrix.orgoh also I see a fairly large number of 16GB nodes in use at the moment. Theoretically not a problem. But just something I didn't expect that I saw as I was looking into this15:37
@clarkb:matrix.orgok none of the requested nodes as listed by the zuul nodes page for that region show up in the server listing. The one active node does show up. I think that means we've leaked all of these other nodes?15:39
@clarkb:matrix.orgI think my next step may be to try and delete one of the BUILD nodes15:39
@mnasiadka:matrix.orgClark: re mirror02.iad3.openmetal I assume next step is to remove the DNS/inventory entries somewhere next week and remove the old instance in OpenMetal's cloud?15:40
@clarkb:matrix.orgmnasiadka: yes exactly15:40
@clarkb:matrix.orgthe raxflex-iad3 thing doesn't appear to be image specific fwiw. I see multiple labels for different images affected15:41
@clarkb:matrix.org(just to rule out something with a single bad image upload)15:41
@clarkb:matrix.orgok doing some server shows against a larger range of uuids I've had better luck. So far all of the servers were created 2-4 ish hours ago and are still in a BUILD state. At least one of them doesn't exist anymore between the time I did a server list and a server show15:45
@clarkb:matrix.orgI suspect that the cloud is unable to boot things quickly (or at all), the launcher gives up after its own timeout then moves on and thinks it has more quota so then tries to boot more nodes and we're in this sort of self sustaining sad state15:46
@clarkb:matrix.orgI suspect that we may want to disable the cloud region in the launchers, but I am going to try some deletes first and see if maybe clearing up these sad nodes helps things15:47
@clarkb:matrix.orgI was able to delete np14afcbb6c4ef4 (b1112c16-5432-42d5-9ca0-e4f0618eb67e). So ya let met try and put together a safe list of deletes and see if we end up in a happier place when things are cleaned up15:49
@clarkb:matrix.orgoh this is interesting there are two error state nodes. One says unable to allocate network not retrying and the other says internal server error15:51
@clarkb:matrix.orglooks like `server list` outputs newest nodes first. My currently server show | grep created iterating over the list implies this anyway. If this rule holds through the whole list I'll go ahead and attempt to delete every BUILD server older than an hour or so (which I think may be all of them)15:57
@clarkb:matrix.orgthen we can see if we end up with newly created stuck BUILD nodes or maybe whatever the problem is has been corrected in the last hour and it will be happy again15:57
@clarkb:matrix.orgcorvus: ^ assuming that these deletes work I guess my next question is why sin't the launcher deleting them for us? Is that somethign missing in the timeout while stuck in a build state process? Or maybe we just aren't trying to delete things as often as nodepool once did?15:58
@clarkb:matrix.orgby the time I put together a list of uuids that are safe to delete they all appear to have been deleted automatically so maybe the launcher is doing this16:08
@clarkb:matrix.orgtheir creation dates were all over an hour old though so there was a timegap for some reason16:08
@clarkb:matrix.orgserver list shows no error and no building nodes now16:10
@clarkb:matrix.orgok now I have ACTIVE and BUILD nodes again and they map to the nodes in the openstack tenant node list16:11
@clarkb:matrix.organd they appear to be getting ip addresses16:11
@clarkb:matrix.orgso either my manual testing of node deletions got something unstuck and that allowed automated systems on our end or the cloud side to delete the rest of them. Or something changed in the launcher or something changed in the cloud side to get stuff movign again16:12
@jim:acmegating.comleaked nodes should be deleted fairly quickly, like after a minute16:12
@jim:acmegating.comso there's a good chance the leaked node deletions weren't happening on the cloud side but perhaps are now (as you suggest)16:12
@clarkb:matrix.orgya that was my expectation.16:12
@jim:acmegating.com * leaked nodes should be deleted fairly quickly, like after a few minutes16:13
@clarkb:matrix.orgok so it seems likely that whatever was wrong got corrected on the cloud side which allowed zuul launcher to complete its deletions16:13
@clarkb:matrix.organd that just happened to be occuring with my manual attempts at doing this16:13
@jim:acmegating.comheisenbug16:13
-@gerrit:opendev.org- Zuul merged on behalf of Michal Nasiadka: [opendev/zone-opendev.org] 981499: Switch OpenMetal mirror CNAME to new mirror https://review.opendev.org/c/opendev/zone-opendev.org/+/98149916:14
@clarkb:matrix.orgcorvus: I wonder if we could have a hidden by default, but accessible listing of outstanding node deletions?16:15
@clarkb:matrix.orgcorvus: I think for most users that info is noise/too verbose, but when debugging stuff like this a quick way to see whether or not zuul even knows about the nodes would be good16:15
@jim:acmegating.comwell, if zuul does know about it, it's in the list; it doesn't track the leaked nodes16:17
@clarkb:matrix.orgoh interesting. Does that imply the launcher thought the deletions were successful?16:17
@clarkb:matrix.orgthere were nodes that were over 4 hours old according to the cloud api listing in a BUILD state16:18
@clarkb:matrix.orgbut none of them showed up in the zuul node listing16:18
@jim:acmegating.comusually; could have timed out or returned an error16:18
@jim:acmegating.combut we see openstack lie about deletes all the time16:19
@clarkb:matrix.orgdoes it not keep the record and try again in those cases? (I got more than one 500 error so that does seem potentially likely)16:19
@clarkb:matrix.orgright but with nodepool we wouldn't remove the internal record until the node stopped being listed by the cloud16:19
@clarkb:matrix.orgor maybe you are saying maybe the listing was faulty?16:20
@jim:acmegating.comyes we would16:20
@jim:acmegating.comthe behavior where nodepool kept those records or even recreated them was very old16:20
@jim:acmegating.comlater nodepool was quicker to delete and not recreate16:20
@jim:acmegating.comit's a lot better this way16:20
@jim:acmegating.comthis way, zuul tries to delete things normally.  if something goes wrong, then we just fall back to our universal "try to clean stuff up" routine16:21
@jim:acmegating.comit's better than having two different "recover from failed delete" code paths16:21
@clarkb:matrix.orgI see, it gets handled through leak detection rather than delete this specific node16:21
@jim:acmegating.comyep16:21
@clarkb:matrix.orgin any case things look much happier now. The only remaining potential issue I see is npdbcc63b190644 (e66a59fa-e463-4f25-b7f6-c8a199d0ba94) created at 2026-03-20T11:15 is in an active state with task_state set to deleting. So the cloud seems to know this one should be deleted but it hasn't gone away yet16:22
@clarkb:matrix.orgbut otherwise we are booting and using new nodes there again and changes are merging16:23
@jim:acmegating.comif we don't already, we could probably emit a stat for leaked resources16:23
@jim:acmegating.comthat would surface the issue on grafana16:23
@clarkb:matrix.orgcorvus: I guess the other question I've got is why would the cloud accept requests in this state? I guess it must've been below quota so grabbed them but then requests were possibly piled up behind requests for cleaning up the leaks or booting a specific node?16:27
@clarkb:matrix.orgthe nodes were listed as requested and not building though so I don't think it was a queue for requests problem16:27
@clarkb:matrix.orgI would've expected maybe one request to end up in this situation and then prevent others from ending up in the same boat (but that is probably another nodepool behavior based assumption)16:28
@harbott.osism.tech:regio.chatlikely unrelated to raxflex, I've now seen quite a number of ssh failures on openmetal-iad3-main like https://zuul.opendev.org/t/openstack/build/37c15982bd4c498f9cd5f96c4b579397 , maybe someone can take a closer look?17:15
@clarkb:matrix.orgI started an ssh session to each of the two mirrors in that cloud and have watch ls -l running in them. Just to see if I can reproduce it that way. Chances are this is hypervisor specific or something and it may need a closer look17:22
@fungicide:matrix.orgClark: i went ahead and approved 981042 now that i'm around for the rest of the day17:47
@clarkb:matrix.orgcool thanks17:49
@clarkb:matrix.orgI'm putting together the beginning of a gerrit 3.12 upgrade planning document: https://etherpad.opendev.org/p/gerrit-upgrade-3.1217:49
@fungicide:matrix.orgfor openmetal, i wonder if the new mirror node is consuming just enough additional resources to starve one of the hosts? or did we separate the two tenants to different aggregates?17:49
@clarkb:matrix.orgfungi: the rebuild was partially done because the old mirror is in the wrong tenant. The new one is in the correct opendevci tenant. I don't think there are any special host aggregates.17:50
@clarkb:matrix.orgConsidering it is a network/ssh problem I suspect it is something more subtle but could be resource starvation17:51
-@gerrit:opendev.org- Zuul merged on behalf of Clark Boylan: [opendev/system-config] 981042: Purge eavesdrop01 and refstack01 backups on the smaller backup server https://review.opendev.org/c/opendev/system-config/+/98104217:57
@clarkb:matrix.orgok I've gone through the 3.12 release notes and put info on our etherpad for the things that I think deserve extra attention. After lunch today I'll start going item by item and try to have answers for concerns or otherwise add info as necesasry.18:03
@clarkb:matrix.orgthe jobs to purge the backups are queued up. I'll check that looks happy before lunch18:04
@clarkb:matrix.orgre gerrit 3.12 my primary concern remains the h2 v2 change. There are breaking changes and other things to check but nothing immediately stands out to me as having a big impact on our installation (but I intend to double check)18:06
@clarkb:matrix.orgpurging completed and the contents do appear to eb gone. Unfortunately, that was only worth about 2GB. Better than 0 I susppose18:11
@fungicide:matrix.orgyeah, i guess there wasn't much data churn for them18:20
@mnasiadka:matrix.orgClark: thanks for the review on https://review.opendev.org/c/opendev/system-config/+/980851 - I limited the change only to Noble because that’s sort of the target version for all of the hosts as I assume - which is probably fine for now19:04
@clarkb:matrix.orgmnasiadka: yup I'm not sure how old of an apt supports that format either. Noble seems to default to it so I think that is fine as written19:08
@mnasiadka:matrix.orgGreat, just wanted to be sure it’s fine as is19:08
@clarkb:matrix.orgYes I think so19:10
@mnasiadka:matrix.orgI’ll try to have a look on more Noble migrations next week, unless we want to have a go on setting up the Prometheus server (just responded on the tsdb sizing)19:30
@clarkb:matrix.orgmnasiadka: I am out on Monday, but I think getting the Prometheus moving forward is a great goal19:42
@clarkb:matrix.orgHappy to help with that19:42
@clarkb:matrix.orgI'll check those review responses after lunch so that I'm all caught up and can leave any additional thoughts if necessary 19:42
@clarkb:matrix.orgmnasiadka: ok posted a response that includes some math indicating that 1TB is probably a good starting volume size20:08
@clarkb:matrix.orgmy ssh connections to the openmetal mirrors have not failed yet. I'm going to stop them now20:08
@clarkb:matrix.orgGerrit 3.12 will disable rules.pl prolog files by default. Looking at https://codesearch.opendev.org/?q=.*&i=nope&literal=nope&files=.*rules.pl&excludeFiles=&repos= I don't think we're using that anywhere. ANd off the top of my head  I can't remember anyone relying on prolog in gerrit. If you know of cases that do rely on prolog please let me know20:19
@fungicide:matrix.orgwe talked about it a few times, but ultimately always decided it was a bit too magical and brought unnecessary complexity over other solutions20:29
@clarkb:matrix.orgthat was my recollection too. So good to know that we think we're good and our previous selves evaluated the situation appropriately :)20:32
@clarkb:matrix.orgI've made it through the first portion of gerrit 3.12 upgrade evaluation in https://etherpad.opendev.org/p/gerrit-upgrade-3.12 I skipped the h2 v2 stuff for now. I've added a few todos and struck out items that I believe are non issues for us. I've got to do a school run but I'll try to get through the remainder of the list I called out today (other than h2) so that next week I can look into the todos and h221:01
@fungicide:matrix.orgit looks like we're getting multiple post_failure results on builds where the executor is hitting a connection timeout when trying to rsync logs from test nodes21:09
@fungicide:matrix.orgopenmetal-iad321:10
@fungicide:matrix.orgi've looked through several examples so far and they're all there21:11
@clarkb:matrix.orgThat's what frickler mentioned earlier 21:11
@fungicide:matrix.orgokay, so same symptom then21:11
@clarkb:matrix.orgMy simple ssh to the mirrors test was fruitless so probably specific IP address or hypervisor related 21:12
@clarkb:matrix.orgThe regular ssh stuff uses control persistence and that connection seems fine. Rsync doesn't and it's new connection fails21:13
@clarkb:matrix.orgMaybe it is a conntrack type issue?21:13
@fungicide:matrix.orgyeah, looks like his example was in a different task but also rsync21:13
@fungicide:matrix.orginteresting that ansible doesn't fail prior to that21:14
@fungicide:matrix.orgit's definitely happening on a variety of jobs and node types, but all in the same provider21:15
@clarkb:matrix.orgAnsible may be failing prior but then we retry the job? Otherwise it connects once and reuses the connection the whole job21:16
@clarkb:matrix.orgExcept for rsync which does not21:16
@fungicide:matrix.orglooking at two examples that post_failed around the same time there, the host_id fields in the inventory differ so i think that means they weren't both scheduled to the same hypervisor host?21:18
@clarkb:matrix.orgI think that is correct 21:18
@fungicide:matrix.orgso if it's a host-specific problem then it's impacting multiple hosts21:19
@clarkb:matrix.orgAnother test may be repeated ssh connection attempts to the mirrors or to a held or manually booted node21:20
@clarkb:matrix.orgConsidering we ran ansible before the rsync failure it isn't happening 100% of the time for the IP21:20
@clarkb:matrix.orgsecond round of gerrit upgrade notes are in place. I also heard back from luca about the requset refs caching thing and his suggestion is that we just disable it since we don't run with nfs. apparently that cache exists to improve the lives of nfs systems and since we don't use nfs can just disable it and avoid any potential problems. So I'ev got a todo to disable it22:12
@fungicide:matrix.orgoh, good to know22:14
@clarkb:matrix.orgThen I also discovered this script/tool: https://gerrit.googlesource.com/gerrit/+/refs/tags/v3.12.5/contrib/maintenance/ which I don't think we need but its nice to see that it exists and I think we can switch to it and we'd probably see some improvements. Its just that our repos aren't so large that we need it for gerrit to perform well. Normal git gc is keeping up for us22:18

Generated by irclog2html.py 4.1.0 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!