| -@gerrit:opendev.org- Michal Nasiadka proposed: [opendev/zone-opendev.org] 981499: Switch OpenMetal mirror CNAME to new mirror https://review.opendev.org/c/opendev/zone-opendev.org/+/981499 | 11:44 | |
| -@gerrit:opendev.org- David Ostrovsky proposed: | 14:11 | |
| - [zuul/zuul-jobs] 981517: Optionally ensure Java in ensure-bazelisk https://review.opendev.org/c/zuul/zuul-jobs/+/981517 | ||
| - [zuul/zuul-jobs] 981518: Make bazelisk-build ensure Java via ensure-bazelisk https://review.opendev.org/c/zuul/zuul-jobs/+/981518 | ||
| -@gerrit:opendev.org- David Ostrovsky proposed: [zuul/zuul-jobs] 981518: Make bazelisk-build ensure Java via ensure-bazelisk https://review.opendev.org/c/zuul/zuul-jobs/+/981518 | 14:36 | |
| @clarkb:matrix.org | fungi: https://review.opendev.org/c/opendev/system-config/+/981042 is the change I promised to purge refstack01 and eavesdrop01 backups from the smaller backup server to free up more space there. You had previously discussed this should be fine. If that holds maybe you can review it and we can proceed with that today? | 14:54 |
|---|---|---|
| @clarkb:matrix.org | One thing I never followed up on from earlier was the LE cert renewal. It does look like the warnings have gone away after we addressed the dpkg lock on the osuosl mirror. I suspect that when we deployed static03 all of these certs were reissued and it has been fine since then | 15:16 |
| @fungicide:matrix.org | yes, it was entirely because static02 was in the disable list for a few days | 15:19 |
| @fungicide:matrix.org | once we moved the sites it cleared up | 15:19 |
| @fungicide:matrix.org | disappearing for a quick lunch, back shortly | 15:28 |
| @clarkb:matrix.org | enjoy. When you get back I think I'll approve the backup purge change | 15:29 |
| @clarkb:matrix.org | something is going on with raxflex iad3 api requests and that appears to be why we are not starting jobs quickly at the moment | 15:35 |
| @clarkb:matrix.org | I tried to do a server show against a server and got a 500 error. Listing servers shows a number of them in BUILD. I'm trying to cross reference the servers listed there with np$UUIDPREFIX names listed by the zuul nodes page to see if they even show up | 15:36 |
| @clarkb:matrix.org | corvus: ^ fyi | 15:36 |
| @clarkb:matrix.org | oh also I see a fairly large number of 16GB nodes in use at the moment. Theoretically not a problem. But just something I didn't expect that I saw as I was looking into this | 15:37 |
| @clarkb:matrix.org | ok none of the requested nodes as listed by the zuul nodes page for that region show up in the server listing. The one active node does show up. I think that means we've leaked all of these other nodes? | 15:39 |
| @clarkb:matrix.org | I think my next step may be to try and delete one of the BUILD nodes | 15:39 |
| @mnasiadka:matrix.org | Clark: re mirror02.iad3.openmetal I assume next step is to remove the DNS/inventory entries somewhere next week and remove the old instance in OpenMetal's cloud? | 15:40 |
| @clarkb:matrix.org | mnasiadka: yes exactly | 15:40 |
| @clarkb:matrix.org | the raxflex-iad3 thing doesn't appear to be image specific fwiw. I see multiple labels for different images affected | 15:41 |
| @clarkb:matrix.org | (just to rule out something with a single bad image upload) | 15:41 |
| @clarkb:matrix.org | ok doing some server shows against a larger range of uuids I've had better luck. So far all of the servers were created 2-4 ish hours ago and are still in a BUILD state. At least one of them doesn't exist anymore between the time I did a server list and a server show | 15:45 |
| @clarkb:matrix.org | I suspect that the cloud is unable to boot things quickly (or at all), the launcher gives up after its own timeout then moves on and thinks it has more quota so then tries to boot more nodes and we're in this sort of self sustaining sad state | 15:46 |
| @clarkb:matrix.org | I suspect that we may want to disable the cloud region in the launchers, but I am going to try some deletes first and see if maybe clearing up these sad nodes helps things | 15:47 |
| @clarkb:matrix.org | I was able to delete np14afcbb6c4ef4 (b1112c16-5432-42d5-9ca0-e4f0618eb67e). So ya let met try and put together a safe list of deletes and see if we end up in a happier place when things are cleaned up | 15:49 |
| @clarkb:matrix.org | oh this is interesting there are two error state nodes. One says unable to allocate network not retrying and the other says internal server error | 15:51 |
| @clarkb:matrix.org | looks like `server list` outputs newest nodes first. My currently server show | grep created iterating over the list implies this anyway. If this rule holds through the whole list I'll go ahead and attempt to delete every BUILD server older than an hour or so (which I think may be all of them) | 15:57 |
| @clarkb:matrix.org | then we can see if we end up with newly created stuck BUILD nodes or maybe whatever the problem is has been corrected in the last hour and it will be happy again | 15:57 |
| @clarkb:matrix.org | corvus: ^ assuming that these deletes work I guess my next question is why sin't the launcher deleting them for us? Is that somethign missing in the timeout while stuck in a build state process? Or maybe we just aren't trying to delete things as often as nodepool once did? | 15:58 |
| @clarkb:matrix.org | by the time I put together a list of uuids that are safe to delete they all appear to have been deleted automatically so maybe the launcher is doing this | 16:08 |
| @clarkb:matrix.org | their creation dates were all over an hour old though so there was a timegap for some reason | 16:08 |
| @clarkb:matrix.org | server list shows no error and no building nodes now | 16:10 |
| @clarkb:matrix.org | ok now I have ACTIVE and BUILD nodes again and they map to the nodes in the openstack tenant node list | 16:11 |
| @clarkb:matrix.org | and they appear to be getting ip addresses | 16:11 |
| @clarkb:matrix.org | so either my manual testing of node deletions got something unstuck and that allowed automated systems on our end or the cloud side to delete the rest of them. Or something changed in the launcher or something changed in the cloud side to get stuff movign again | 16:12 |
| @jim:acmegating.com | leaked nodes should be deleted fairly quickly, like after a minute | 16:12 |
| @jim:acmegating.com | so there's a good chance the leaked node deletions weren't happening on the cloud side but perhaps are now (as you suggest) | 16:12 |
| @clarkb:matrix.org | ya that was my expectation. | 16:12 |
| @jim:acmegating.com | * leaked nodes should be deleted fairly quickly, like after a few minutes | 16:13 |
| @clarkb:matrix.org | ok so it seems likely that whatever was wrong got corrected on the cloud side which allowed zuul launcher to complete its deletions | 16:13 |
| @clarkb:matrix.org | and that just happened to be occuring with my manual attempts at doing this | 16:13 |
| @jim:acmegating.com | heisenbug | 16:13 |
| -@gerrit:opendev.org- Zuul merged on behalf of Michal Nasiadka: [opendev/zone-opendev.org] 981499: Switch OpenMetal mirror CNAME to new mirror https://review.opendev.org/c/opendev/zone-opendev.org/+/981499 | 16:14 | |
| @clarkb:matrix.org | corvus: I wonder if we could have a hidden by default, but accessible listing of outstanding node deletions? | 16:15 |
| @clarkb:matrix.org | corvus: I think for most users that info is noise/too verbose, but when debugging stuff like this a quick way to see whether or not zuul even knows about the nodes would be good | 16:15 |
| @jim:acmegating.com | well, if zuul does know about it, it's in the list; it doesn't track the leaked nodes | 16:17 |
| @clarkb:matrix.org | oh interesting. Does that imply the launcher thought the deletions were successful? | 16:17 |
| @clarkb:matrix.org | there were nodes that were over 4 hours old according to the cloud api listing in a BUILD state | 16:18 |
| @clarkb:matrix.org | but none of them showed up in the zuul node listing | 16:18 |
| @jim:acmegating.com | usually; could have timed out or returned an error | 16:18 |
| @jim:acmegating.com | but we see openstack lie about deletes all the time | 16:19 |
| @clarkb:matrix.org | does it not keep the record and try again in those cases? (I got more than one 500 error so that does seem potentially likely) | 16:19 |
| @clarkb:matrix.org | right but with nodepool we wouldn't remove the internal record until the node stopped being listed by the cloud | 16:19 |
| @clarkb:matrix.org | or maybe you are saying maybe the listing was faulty? | 16:20 |
| @jim:acmegating.com | yes we would | 16:20 |
| @jim:acmegating.com | the behavior where nodepool kept those records or even recreated them was very old | 16:20 |
| @jim:acmegating.com | later nodepool was quicker to delete and not recreate | 16:20 |
| @jim:acmegating.com | it's a lot better this way | 16:20 |
| @jim:acmegating.com | this way, zuul tries to delete things normally. if something goes wrong, then we just fall back to our universal "try to clean stuff up" routine | 16:21 |
| @jim:acmegating.com | it's better than having two different "recover from failed delete" code paths | 16:21 |
| @clarkb:matrix.org | I see, it gets handled through leak detection rather than delete this specific node | 16:21 |
| @jim:acmegating.com | yep | 16:21 |
| @clarkb:matrix.org | in any case things look much happier now. The only remaining potential issue I see is npdbcc63b190644 (e66a59fa-e463-4f25-b7f6-c8a199d0ba94) created at 2026-03-20T11:15 is in an active state with task_state set to deleting. So the cloud seems to know this one should be deleted but it hasn't gone away yet | 16:22 |
| @clarkb:matrix.org | but otherwise we are booting and using new nodes there again and changes are merging | 16:23 |
| @jim:acmegating.com | if we don't already, we could probably emit a stat for leaked resources | 16:23 |
| @jim:acmegating.com | that would surface the issue on grafana | 16:23 |
| @clarkb:matrix.org | corvus: I guess the other question I've got is why would the cloud accept requests in this state? I guess it must've been below quota so grabbed them but then requests were possibly piled up behind requests for cleaning up the leaks or booting a specific node? | 16:27 |
| @clarkb:matrix.org | the nodes were listed as requested and not building though so I don't think it was a queue for requests problem | 16:27 |
| @clarkb:matrix.org | I would've expected maybe one request to end up in this situation and then prevent others from ending up in the same boat (but that is probably another nodepool behavior based assumption) | 16:28 |
| @harbott.osism.tech:regio.chat | likely unrelated to raxflex, I've now seen quite a number of ssh failures on openmetal-iad3-main like https://zuul.opendev.org/t/openstack/build/37c15982bd4c498f9cd5f96c4b579397 , maybe someone can take a closer look? | 17:15 |
| @clarkb:matrix.org | I started an ssh session to each of the two mirrors in that cloud and have watch ls -l running in them. Just to see if I can reproduce it that way. Chances are this is hypervisor specific or something and it may need a closer look | 17:22 |
| @fungicide:matrix.org | Clark: i went ahead and approved 981042 now that i'm around for the rest of the day | 17:47 |
| @clarkb:matrix.org | cool thanks | 17:49 |
| @clarkb:matrix.org | I'm putting together the beginning of a gerrit 3.12 upgrade planning document: https://etherpad.opendev.org/p/gerrit-upgrade-3.12 | 17:49 |
| @fungicide:matrix.org | for openmetal, i wonder if the new mirror node is consuming just enough additional resources to starve one of the hosts? or did we separate the two tenants to different aggregates? | 17:49 |
| @clarkb:matrix.org | fungi: the rebuild was partially done because the old mirror is in the wrong tenant. The new one is in the correct opendevci tenant. I don't think there are any special host aggregates. | 17:50 |
| @clarkb:matrix.org | Considering it is a network/ssh problem I suspect it is something more subtle but could be resource starvation | 17:51 |
| -@gerrit:opendev.org- Zuul merged on behalf of Clark Boylan: [opendev/system-config] 981042: Purge eavesdrop01 and refstack01 backups on the smaller backup server https://review.opendev.org/c/opendev/system-config/+/981042 | 17:57 | |
| @clarkb:matrix.org | ok I've gone through the 3.12 release notes and put info on our etherpad for the things that I think deserve extra attention. After lunch today I'll start going item by item and try to have answers for concerns or otherwise add info as necesasry. | 18:03 |
| @clarkb:matrix.org | the jobs to purge the backups are queued up. I'll check that looks happy before lunch | 18:04 |
| @clarkb:matrix.org | re gerrit 3.12 my primary concern remains the h2 v2 change. There are breaking changes and other things to check but nothing immediately stands out to me as having a big impact on our installation (but I intend to double check) | 18:06 |
| @clarkb:matrix.org | purging completed and the contents do appear to eb gone. Unfortunately, that was only worth about 2GB. Better than 0 I susppose | 18:11 |
| @fungicide:matrix.org | yeah, i guess there wasn't much data churn for them | 18:20 |
| @mnasiadka:matrix.org | Clark: thanks for the review on https://review.opendev.org/c/opendev/system-config/+/980851 - I limited the change only to Noble because that’s sort of the target version for all of the hosts as I assume - which is probably fine for now | 19:04 |
| @clarkb:matrix.org | mnasiadka: yup I'm not sure how old of an apt supports that format either. Noble seems to default to it so I think that is fine as written | 19:08 |
| @mnasiadka:matrix.org | Great, just wanted to be sure it’s fine as is | 19:08 |
| @clarkb:matrix.org | Yes I think so | 19:10 |
| @mnasiadka:matrix.org | I’ll try to have a look on more Noble migrations next week, unless we want to have a go on setting up the Prometheus server (just responded on the tsdb sizing) | 19:30 |
| @clarkb:matrix.org | mnasiadka: I am out on Monday, but I think getting the Prometheus moving forward is a great goal | 19:42 |
| @clarkb:matrix.org | Happy to help with that | 19:42 |
| @clarkb:matrix.org | I'll check those review responses after lunch so that I'm all caught up and can leave any additional thoughts if necessary | 19:42 |
| @clarkb:matrix.org | mnasiadka: ok posted a response that includes some math indicating that 1TB is probably a good starting volume size | 20:08 |
| @clarkb:matrix.org | my ssh connections to the openmetal mirrors have not failed yet. I'm going to stop them now | 20:08 |
| @clarkb:matrix.org | Gerrit 3.12 will disable rules.pl prolog files by default. Looking at https://codesearch.opendev.org/?q=.*&i=nope&literal=nope&files=.*rules.pl&excludeFiles=&repos= I don't think we're using that anywhere. ANd off the top of my head I can't remember anyone relying on prolog in gerrit. If you know of cases that do rely on prolog please let me know | 20:19 |
| @fungicide:matrix.org | we talked about it a few times, but ultimately always decided it was a bit too magical and brought unnecessary complexity over other solutions | 20:29 |
| @clarkb:matrix.org | that was my recollection too. So good to know that we think we're good and our previous selves evaluated the situation appropriately :) | 20:32 |
| @clarkb:matrix.org | I've made it through the first portion of gerrit 3.12 upgrade evaluation in https://etherpad.opendev.org/p/gerrit-upgrade-3.12 I skipped the h2 v2 stuff for now. I've added a few todos and struck out items that I believe are non issues for us. I've got to do a school run but I'll try to get through the remainder of the list I called out today (other than h2) so that next week I can look into the todos and h2 | 21:01 |
| @fungicide:matrix.org | it looks like we're getting multiple post_failure results on builds where the executor is hitting a connection timeout when trying to rsync logs from test nodes | 21:09 |
| @fungicide:matrix.org | openmetal-iad3 | 21:10 |
| @fungicide:matrix.org | i've looked through several examples so far and they're all there | 21:11 |
| @clarkb:matrix.org | That's what frickler mentioned earlier | 21:11 |
| @fungicide:matrix.org | okay, so same symptom then | 21:11 |
| @clarkb:matrix.org | My simple ssh to the mirrors test was fruitless so probably specific IP address or hypervisor related | 21:12 |
| @clarkb:matrix.org | The regular ssh stuff uses control persistence and that connection seems fine. Rsync doesn't and it's new connection fails | 21:13 |
| @clarkb:matrix.org | Maybe it is a conntrack type issue? | 21:13 |
| @fungicide:matrix.org | yeah, looks like his example was in a different task but also rsync | 21:13 |
| @fungicide:matrix.org | interesting that ansible doesn't fail prior to that | 21:14 |
| @fungicide:matrix.org | it's definitely happening on a variety of jobs and node types, but all in the same provider | 21:15 |
| @clarkb:matrix.org | Ansible may be failing prior but then we retry the job? Otherwise it connects once and reuses the connection the whole job | 21:16 |
| @clarkb:matrix.org | Except for rsync which does not | 21:16 |
| @fungicide:matrix.org | looking at two examples that post_failed around the same time there, the host_id fields in the inventory differ so i think that means they weren't both scheduled to the same hypervisor host? | 21:18 |
| @clarkb:matrix.org | I think that is correct | 21:18 |
| @fungicide:matrix.org | so if it's a host-specific problem then it's impacting multiple hosts | 21:19 |
| @clarkb:matrix.org | Another test may be repeated ssh connection attempts to the mirrors or to a held or manually booted node | 21:20 |
| @clarkb:matrix.org | Considering we ran ansible before the rsync failure it isn't happening 100% of the time for the IP | 21:20 |
| @clarkb:matrix.org | second round of gerrit upgrade notes are in place. I also heard back from luca about the requset refs caching thing and his suggestion is that we just disable it since we don't run with nfs. apparently that cache exists to improve the lives of nfs systems and since we don't use nfs can just disable it and avoid any potential problems. So I'ev got a todo to disable it | 22:12 |
| @fungicide:matrix.org | oh, good to know | 22:14 |
| @clarkb:matrix.org | Then I also discovered this script/tool: https://gerrit.googlesource.com/gerrit/+/refs/tags/v3.12.5/contrib/maintenance/ which I don't think we need but its nice to see that it exists and I think we can switch to it and we'd probably see some improvements. Its just that our repos aren't so large that we need it for gerrit to perform well. Normal git gc is keeping up for us | 22:18 |
Generated by irclog2html.py 4.1.0 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!