Thursday, 2025-11-20

dmsimard[m]I must once again run, but I also dropped some thoughts in the ara-for-databases pad, thanks for getting it started00:52
*** liuxie is now known as liushy06:10
opendevreviewMichal Nasiadka proposed zuul/zuul-jobs master: Use mirror_info in configure-mirrors role  https://review.opendev.org/c/zuul/zuul-jobs/+/96618706:58
opendevreviewMichal Nasiadka proposed zuul/zuul-jobs master: Use mirror_info in configure-mirrors role  https://review.opendev.org/c/zuul/zuul-jobs/+/96618707:06
opendevreviewMichal Nasiadka proposed zuul/zuul-jobs master: Use mirror_info in configure-mirrors role  https://review.opendev.org/c/zuul/zuul-jobs/+/96618707:23
opendevreviewMichal Nasiadka proposed zuul/zuul-jobs master: Use mirror_info in configure-mirrors role  https://review.opendev.org/c/zuul/zuul-jobs/+/96618707:38
opendevreviewMichal Nasiadka proposed zuul/zuul-jobs master: Use mirror_info in configure-mirrors role  https://review.opendev.org/c/zuul/zuul-jobs/+/96618707:39
opendevreviewMichal Nasiadka proposed zuul/zuul-jobs master: Use mirror_info in configure-mirrors role  https://review.opendev.org/c/zuul/zuul-jobs/+/96618708:03
opendevreviewMichal Nasiadka proposed zuul/zuul-jobs master: Use mirror_info in configure-mirrors role  https://review.opendev.org/c/zuul/zuul-jobs/+/96618708:05
*** sfinucan is now known as stephenfin10:16
*** ykarel_ is now known as ykarel13:30
babicohello14:14
fungihi, need something?14:30
fungichecking up on openmetal transfer rates, i got 82.1 MB/s pulling 1gb-test.dat from the mirror there to /dev/null on ze12 just now, so doesn't seem like the problem is recurring at the moment14:44
fungineed to disappear for a bit to run more errands, back after lunch16:04
clarkbseems quiet this morning. I'm going to start working through my todo list on the gerrit 3.11 upgrade etherpad16:08
clarkbI need to grab the gerrit war out of our 3.11 container images and I wish there was an easy way to do something like docker cp out of an image rather than a container. There probably is, but I don't know it16:20
clarkbhrm looks like my gerrit 3.11 held node is unhappy. I guess I need to look into that16:38
clarkbthe server rebooted 5 minutes ago16:39
clarkboh! the server at that ip address is a new different instance running some python unittests16:39
clarkbinteresting16:39
clarkbso it didn't really reboot so much as get replaced with a new boot16:40
clarkbmy autohold, 0000000256, is still in place16:40
clarkb(note I checked that on zuul02 running the cli client which caused it to pull the image from the new location on quay)16:41
clarkb`Unable to find image 'quay.io/zuul-ci/zuul-client:latest' locally` then it automatically pulled it down16:41
clarkbcorvus: ^ I don't yet know if this a bug in zuul-launcher or not, but the autohold is still in place and the web ui nodes list shows the nodes are still there. I'm going to see now what the cloud provider says16:42
clarkbcorvus: the cloud provider does not have the instance anymore (and the instance that was reusing the floating ip is also gone at this point). Still don't know if the launcher or cloud cleaned it up16:44
clarkb6d8b42238b674d1db580d2afdf777670 is the zuul id for the instance I think (it is the noble node which should be the gerrit node in the two node nodeset)16:45
clarkboh wait my earlier listing was in the wrong project let me double check the cloud side info16:45
clarkbok the server is still in the cloud but it no longer has the floating ip attached to it16:46
clarkbnp6d8b42238b674 735cd878-e101-441d-a62b-5b67469a7d3b in rax flex dfw316:47
clarkb174.143.59.58 is the old fip value. So now I think I need to see what if any logs around that ip being deleted we may have16:47
clarkbI want to say nova has some way of giving an accounting of things like live migration events I wonder if there is such a thing for floating ip attach and detach.16:49
clarkb`server event list 735cd878-e101-441d-a62b-5b67469a7d3b` only has the event for server create in it16:50
clarkbthe other node in the held nodeset also lost its fip if it becomes useful to look at that one16:51
clarkbthe ip address is nodescanned by zl02 at 2025-11-13 23:51:09,407 for the held node. Then we don't scan the ip address again until 2025-11-19 22:04:55,916 again on zl02. The ip address only shows up in those nodescan log entries so not sure if the launcher deleted it for some reason or the cloud16:57
clarkbtheory: this floating ip got deleted by our leaked floating ip cleanup routine. I have no evidence to support this yet17:00
clarkbcorvus: we call https://opendev.org/openstack/openstacksdk/src/branch/master/openstack/cloud/_network_common.py#L971-L1001 from zuul to delete unattached floating ips here: https://opendev.org/zuul/zuul/src/branch/master/zuul/driver/openstack/openstackendpoint.py#L1071-L1082 There are a bunch of caveats listed in the openstacksdk method but I think we're following them as we create17:07
clarkbfloating ips setting the server value during server creation. My hunch here is that this method is not actually safe to use17:07
clarkbmaybe it is neutron backend dependent or something17:07
clarkbanyway I can drop the provider floating ip cleanup flag from rax flex regions to disable that and then we can monitor to see if we leak fips17:08
clarkbunfortunately I don't see any logging around this in the zuul launcher code so I'm not sure we'll be able to find much more historical evidence other than the fact that we do try to periodically clean up floating ips and something may have gone wrong there17:08
clarkbanother appraoch would be to try and add logging to that and then leave the provide flag in place to clean things up and see if we can reproduce it17:09
clarkbI will need to recycle my held node to do more testing but ideally we would have some plan for ^ before hand just in case the held nodes land in rax flex again17:10
clarkbhttps://opendev.org/opendev/zuul-providers/src/branch/master/zuul.d/providers.yaml#L136 and just to double check we do have the flag set17:13
clarkbit looks like that cleanup routine is called every time we list servers and it took 6 days to occur. Seems like that this is a really infrequent corner case but an annoying one if hit17:16
clarkbas an alternative I can manually attach a new floating ip to my server and see if that works17:16
clarkb(rather than recycling the holds)17:17
opendevreviewClark Boylan proposed opendev/zuul-providers master: Stop automated floating ip clean up rax flex  https://review.opendev.org/c/opendev/zuul-providers/+/96788417:37
clarkbConsider that an RFC. I'm open to other ideas for debugging and dealing with this17:37
corvus...catching up...17:42
corvusclarkb: do you have an estimated time range for when it may have happened?  from your logs above i see 11-13 through 11-19... any narrower than that?17:46
clarkbcorvus: not really unfortunately. I was hoping the nova server event list would tell us but it only has the server creation event in it17:48
fungi13-19 would span a restart, i guess that's the biggest thing we'd want to narrow the window to rule out17:48
clarkbcorvus: judging on the frequency of ip address reuse I would guess it happened not too long before the ip gets used again17:48
clarkb(we're recycling ips frequently enough that a few hours at most seems likely)17:48
corvuswe also cleanup leaked ports, and i was looking to see if perhaps that happened which may have then prompted the ip cleanup, but in the logs, i only see port cleanups for iad317:48
corvusand i think you said it was dfw317:49
clarkbcorrect this is dfw3 17:49
corvus /var/log/zuul/launcher-debug.log.1:2025-11-19 20:13:58,584 DEBUG zuul.openstack.raxflex-IAD3: Removed DOWN port dfa55cb4-eb97-4461-bc74-0cc3ed577ebd in raxflex-IAD3 (example log message; not relevant)17:49
corvusclarkb: i'd be open to vendoring https://opendev.org/openstack/openstacksdk/src/branch/master/openstack/cloud/_network_common.py#L971-L1001 into zuul with debug log lines17:50
clarkbcorvus: ya that might be a good approach. THough potentially super verbose? Getting the logging right might be tricky but I agree pulling the magic out of the sdk where we can log specific things is probably a good next step for debugging17:51
corvusi think just logging when we delete an ip would be a start; that shouldn't be verbose because this shouldn't happen that often17:52
corvuslike, we delete maybe 50 leaked ports per day17:52
corvusi wouldn't expect fip leaks to be much different in magnitude17:53
clarkb++17:53
clarkbcorvus: in that case should I WIP my zuul-providers change so that we don't remove the condition under which we would experience this in the first place?17:53
corvusyeah, i think my preference would be to do that instead of turning off the flag:  mostly because we can at least get more debugging and then try to correlate it with other things (maybe a zuul bug, maybe an openstack bug).  and this error seems to be super rare, if annoying, so we can probably stand to allow it to happen again17:54
corvusalso, i'm pretty sure we'd see actual fip leaks more anyway and we'd have to clean them up, so this is probably ultimately less work for us.  :)17:55
clarkbok the change is WIP now. I think we can/should vendor the code and add some debug logging. Then I can manually attach new floating IPs to my two held nodes and they can be our canaries17:56
clarkbcorvus: I can work on the launcher change as well unless you want to do that17:56
corvusclarkb: feel free and i can review it; i have a few things in progress locally right now and also an errand to run :)17:58
clarkbon it17:58
clarkbcorvus: remote:   https://review.opendev.org/c/zuul/zuul/+/967891 Vendor openstacksdk delete_unattached_floating_ips18:20
clarkbI don't have a test case for it as I'm half hoping the existing tests we have will cover it well enough but I'm not positive of that. I did manually do some repl stuff on bridge to confirm that ip['floating_ip_address'] is a valid attribute lookup and that it returns the correct data18:21
corvusclarkb: yeah, there is a resource cleanup test case that covers that, so we should have code coverage19:29
clarkbcorvus: looks like the tests failed bceause I need to mock or fake out some thinsg20:44
corvusclarkb: see!  there is test coverage!  :)  i think you just need to implement _use_neutron_floating(); it can probably just return True.  goes in "tests/fake_openstack.py" -- right next to delete_unattached_floating_ips :)20:49
corvusbecause list and delete methods already exist20:49
clarkbyup just pushed. and I agree that appears to be the only thing that needs to be faked20:50
clarkbI ended up making it conditional based on a very similar preexisting conditional that I think better captures the need vs always returning True20:50
clarkbwe shall see (I don't have my local testing set up right now and I'm hoping to pop out shortly for a bike ride otherwise I would run that test locally)20:51
corvus++20:52
clarkbalso I realized I can just manually attach an fip to the held node whenever and if it clears out we can reattach it again manually. So when I get back from my bike ride I will probably do that and continue with my gerrit testing20:52
corvusclarkb: i think you'll have to do that between cleanup intervals for the launcher, so be fast :)20:53
clarkbcorvus: maybe? the node was held for ~6 days before the ip got reused and since my discovery of the problem I think that ip has been reused at least twice20:54
clarkbI suspect there is some other condition on the backend that we have to trip over to break things but I don't know as mentioend before we lack timestamps so this is a lot of guessing20:54
corvusno i mean, if you create a fip that's unattached, that cleanup method will delete it, so you need to create and attach while it isn't running20:55
clarkboh ha yes20:55
clarkbone thing I've noticed on the gerrit upgrade prep which may or may not be concerning is that a number of these potentially concerning behavior changes actually come into gerrit via chagnes to older stable branches then get forward ported and then captured as a more scary thing in the next proper release's release notes so for a lot of stuff we're already running the code and it is20:56
clarkbapparently a non issue for us20:56
clarkbI think some of these do get captured by the point release release docs for older releases20:56
clarkbbut they are maybe easier to skim over in that context or maybe I'm just not taking good enough notes to remember something was investigated and determined to be a non issue20:57
clarkbthe sshd timeout fix is one example of this20:57
clarkbit was captured in 3.10.something release notes20:57
clarkbwhcih by the way reduced our sshd idletimout from an unexpected 3600000 seconds to the configured 3600 seconds so not an issue for us because our idletimeout value was clueless about the bug20:59
clarkbI also remembered to check the static status after lunch and it looks like load remains reasonable and we're running well below the new limit. So ya I suspect the crawling is continuing to get worse over time and we just have to crank up the volume in response21:02
clarkbok popping out now21:03
corvusclarkb: there were some un-exrecised parts of the fakes that needed updating.  i made a new change and put it under yours:   https://review.opendev.org/c/zuul/zuul/+/967934 Increase fake openstack fidelity [NEW] 22:02
opendevreviewMerged zuul/zuul-jobs master: Fix defaults for upload-image-swift and -s3  https://review.opendev.org/c/zuul/zuul-jobs/+/96223822:27
clarkbcorvus: thanks, I'm taking a look now23:21
clarkband it lgtm not sure if we want to go ahead and approve that since it is a test only improvement? I'll let you decide on that23:24
clarkbfungi: I may have possibly found a regression with gerrit 3.11 but it is hard to say. It has to do with how repo browse links are rendered in the repo page on gerrit. Not urgent but if you get a chance can you review line 102 on https://etherpad.opendev.org/p/gerrit-upgrade-3.11 and see if you know why this may be happening as you've done gerrit gitweb/gitea link stuff in the past23:31
clarkbif it is a regression I think it is one we can live with. But it may also simply be missing configuration in our test setup?23:31
corvusclarkb:  yeah, i think so.  and we can +w your change tomorrow23:31
clarkbcorvus: sounds good. I've got a floating ip attached to the node again so we should hav a new canary in place23:32
clarkb50.56.159.76 this ip specifically23:33
clarkbRamereth[m]: not sure if you saw the other day but we think we're experiencing the poor io performance issues with the arm nodes again (specifically when building new images). I'm happy to dig up logs or run testing that you think may be relevant if there is any way we can help debug that just let us know23:44

Generated by irclog2html.py 4.0.0 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!