Thursday, 2025-11-20

dmsimard[m]	I must once again run, but I also dropped some thoughts in the ara-for-databases pad, thanks for getting it started	00:52
*** liuxie is now known as liushy		06:10
opendevreview	Michal Nasiadka proposed zuul/zuul-jobs master: Use mirror_info in configure-mirrors role https://review.opendev.org/c/zuul/zuul-jobs/+/966187	06:58
opendevreview	Michal Nasiadka proposed zuul/zuul-jobs master: Use mirror_info in configure-mirrors role https://review.opendev.org/c/zuul/zuul-jobs/+/966187	07:06
opendevreview	Michal Nasiadka proposed zuul/zuul-jobs master: Use mirror_info in configure-mirrors role https://review.opendev.org/c/zuul/zuul-jobs/+/966187	07:23
opendevreview	Michal Nasiadka proposed zuul/zuul-jobs master: Use mirror_info in configure-mirrors role https://review.opendev.org/c/zuul/zuul-jobs/+/966187	07:38
opendevreview	Michal Nasiadka proposed zuul/zuul-jobs master: Use mirror_info in configure-mirrors role https://review.opendev.org/c/zuul/zuul-jobs/+/966187	07:39
opendevreview	Michal Nasiadka proposed zuul/zuul-jobs master: Use mirror_info in configure-mirrors role https://review.opendev.org/c/zuul/zuul-jobs/+/966187	08:03
opendevreview	Michal Nasiadka proposed zuul/zuul-jobs master: Use mirror_info in configure-mirrors role https://review.opendev.org/c/zuul/zuul-jobs/+/966187	08:05
*** sfinucan is now known as stephenfin		10:16
*** ykarel_ is now known as ykarel		13:30
babico	hello	14:14
fungi	hi, need something?	14:30
fungi	checking up on openmetal transfer rates, i got 82.1 MB/s pulling 1gb-test.dat from the mirror there to /dev/null on ze12 just now, so doesn't seem like the problem is recurring at the moment	14:44
fungi	need to disappear for a bit to run more errands, back after lunch	16:04
clarkb	seems quiet this morning. I'm going to start working through my todo list on the gerrit 3.11 upgrade etherpad	16:08
clarkb	I need to grab the gerrit war out of our 3.11 container images and I wish there was an easy way to do something like docker cp out of an image rather than a container. There probably is, but I don't know it	16:20
clarkb	hrm looks like my gerrit 3.11 held node is unhappy. I guess I need to look into that	16:38
clarkb	the server rebooted 5 minutes ago	16:39
clarkb	oh! the server at that ip address is a new different instance running some python unittests	16:39
clarkb	interesting	16:39
clarkb	so it didn't really reboot so much as get replaced with a new boot	16:40
clarkb	my autohold, 0000000256, is still in place	16:40
clarkb	(note I checked that on zuul02 running the cli client which caused it to pull the image from the new location on quay)	16:41
clarkb	`Unable to find image 'quay.io/zuul-ci/zuul-client:latest' locally` then it automatically pulled it down	16:41
clarkb	corvus: ^ I don't yet know if this a bug in zuul-launcher or not, but the autohold is still in place and the web ui nodes list shows the nodes are still there. I'm going to see now what the cloud provider says	16:42
clarkb	corvus: the cloud provider does not have the instance anymore (and the instance that was reusing the floating ip is also gone at this point). Still don't know if the launcher or cloud cleaned it up	16:44
clarkb	6d8b42238b674d1db580d2afdf777670 is the zuul id for the instance I think (it is the noble node which should be the gerrit node in the two node nodeset)	16:45
clarkb	oh wait my earlier listing was in the wrong project let me double check the cloud side info	16:45
clarkb	ok the server is still in the cloud but it no longer has the floating ip attached to it	16:46
clarkb	np6d8b42238b674 735cd878-e101-441d-a62b-5b67469a7d3b in rax flex dfw3	16:47
clarkb	174.143.59.58 is the old fip value. So now I think I need to see what if any logs around that ip being deleted we may have	16:47
clarkb	I want to say nova has some way of giving an accounting of things like live migration events I wonder if there is such a thing for floating ip attach and detach.	16:49
clarkb	`server event list 735cd878-e101-441d-a62b-5b67469a7d3b` only has the event for server create in it	16:50
clarkb	the other node in the held nodeset also lost its fip if it becomes useful to look at that one	16:51
clarkb	the ip address is nodescanned by zl02 at 2025-11-13 23:51:09,407 for the held node. Then we don't scan the ip address again until 2025-11-19 22:04:55,916 again on zl02. The ip address only shows up in those nodescan log entries so not sure if the launcher deleted it for some reason or the cloud	16:57
clarkb	theory: this floating ip got deleted by our leaked floating ip cleanup routine. I have no evidence to support this yet	17:00
clarkb	corvus: we call https://opendev.org/openstack/openstacksdk/src/branch/master/openstack/cloud/_network_common.py#L971-L1001 from zuul to delete unattached floating ips here: https://opendev.org/zuul/zuul/src/branch/master/zuul/driver/openstack/openstackendpoint.py#L1071-L1082 There are a bunch of caveats listed in the openstacksdk method but I think we're following them as we create	17:07
clarkb	floating ips setting the server value during server creation. My hunch here is that this method is not actually safe to use	17:07
clarkb	maybe it is neutron backend dependent or something	17:07
clarkb	anyway I can drop the provider floating ip cleanup flag from rax flex regions to disable that and then we can monitor to see if we leak fips	17:08
clarkb	unfortunately I don't see any logging around this in the zuul launcher code so I'm not sure we'll be able to find much more historical evidence other than the fact that we do try to periodically clean up floating ips and something may have gone wrong there	17:08
clarkb	another appraoch would be to try and add logging to that and then leave the provide flag in place to clean things up and see if we can reproduce it	17:09
clarkb	I will need to recycle my held node to do more testing but ideally we would have some plan for ^ before hand just in case the held nodes land in rax flex again	17:10
clarkb	https://opendev.org/opendev/zuul-providers/src/branch/master/zuul.d/providers.yaml#L136 and just to double check we do have the flag set	17:13
clarkb	it looks like that cleanup routine is called every time we list servers and it took 6 days to occur. Seems like that this is a really infrequent corner case but an annoying one if hit	17:16
clarkb	as an alternative I can manually attach a new floating ip to my server and see if that works	17:16
clarkb	(rather than recycling the holds)	17:17
opendevreview	Clark Boylan proposed opendev/zuul-providers master: Stop automated floating ip clean up rax flex https://review.opendev.org/c/opendev/zuul-providers/+/967884	17:37
clarkb	Consider that an RFC. I'm open to other ideas for debugging and dealing with this	17:37
corvus	...catching up...	17:42
corvus	clarkb: do you have an estimated time range for when it may have happened? from your logs above i see 11-13 through 11-19... any narrower than that?	17:46
clarkb	corvus: not really unfortunately. I was hoping the nova server event list would tell us but it only has the server creation event in it	17:48
fungi	13-19 would span a restart, i guess that's the biggest thing we'd want to narrow the window to rule out	17:48
clarkb	corvus: judging on the frequency of ip address reuse I would guess it happened not too long before the ip gets used again	17:48
clarkb	(we're recycling ips frequently enough that a few hours at most seems likely)	17:48
corvus	we also cleanup leaked ports, and i was looking to see if perhaps that happened which may have then prompted the ip cleanup, but in the logs, i only see port cleanups for iad3	17:48
corvus	and i think you said it was dfw3	17:49
clarkb	correct this is dfw3	17:49
corvus	/var/log/zuul/launcher-debug.log.1:2025-11-19 20:13:58,584 DEBUG zuul.openstack.raxflex-IAD3: Removed DOWN port dfa55cb4-eb97-4461-bc74-0cc3ed577ebd in raxflex-IAD3 (example log message; not relevant)	17:49
corvus	clarkb: i'd be open to vendoring https://opendev.org/openstack/openstacksdk/src/branch/master/openstack/cloud/_network_common.py#L971-L1001 into zuul with debug log lines	17:50
clarkb	corvus: ya that might be a good approach. THough potentially super verbose? Getting the logging right might be tricky but I agree pulling the magic out of the sdk where we can log specific things is probably a good next step for debugging	17:51
corvus	i think just logging when we delete an ip would be a start; that shouldn't be verbose because this shouldn't happen that often	17:52
corvus	like, we delete maybe 50 leaked ports per day	17:52
corvus	i wouldn't expect fip leaks to be much different in magnitude	17:53
clarkb	++	17:53
clarkb	corvus: in that case should I WIP my zuul-providers change so that we don't remove the condition under which we would experience this in the first place?	17:53
corvus	yeah, i think my preference would be to do that instead of turning off the flag: mostly because we can at least get more debugging and then try to correlate it with other things (maybe a zuul bug, maybe an openstack bug). and this error seems to be super rare, if annoying, so we can probably stand to allow it to happen again	17:54
corvus	also, i'm pretty sure we'd see actual fip leaks more anyway and we'd have to clean them up, so this is probably ultimately less work for us. :)	17:55
clarkb	ok the change is WIP now. I think we can/should vendor the code and add some debug logging. Then I can manually attach new floating IPs to my two held nodes and they can be our canaries	17:56
clarkb	corvus: I can work on the launcher change as well unless you want to do that	17:56
corvus	clarkb: feel free and i can review it; i have a few things in progress locally right now and also an errand to run :)	17:58
clarkb	on it	17:58
clarkb	corvus: remote: https://review.opendev.org/c/zuul/zuul/+/967891 Vendor openstacksdk delete_unattached_floating_ips	18:20
clarkb	I don't have a test case for it as I'm half hoping the existing tests we have will cover it well enough but I'm not positive of that. I did manually do some repl stuff on bridge to confirm that ip['floating_ip_address'] is a valid attribute lookup and that it returns the correct data	18:21
corvus	clarkb: yeah, there is a resource cleanup test case that covers that, so we should have code coverage	19:29
clarkb	corvus: looks like the tests failed bceause I need to mock or fake out some thinsg	20:44
corvus	clarkb: see! there is test coverage! :) i think you just need to implement _use_neutron_floating(); it can probably just return True. goes in "tests/fake_openstack.py" -- right next to delete_unattached_floating_ips :)	20:49
corvus	because list and delete methods already exist	20:49
clarkb	yup just pushed. and I agree that appears to be the only thing that needs to be faked	20:50
clarkb	I ended up making it conditional based on a very similar preexisting conditional that I think better captures the need vs always returning True	20:50
clarkb	we shall see (I don't have my local testing set up right now and I'm hoping to pop out shortly for a bike ride otherwise I would run that test locally)	20:51
corvus	++	20:52
clarkb	also I realized I can just manually attach an fip to the held node whenever and if it clears out we can reattach it again manually. So when I get back from my bike ride I will probably do that and continue with my gerrit testing	20:52
corvus	clarkb: i think you'll have to do that between cleanup intervals for the launcher, so be fast :)	20:53
clarkb	corvus: maybe? the node was held for ~6 days before the ip got reused and since my discovery of the problem I think that ip has been reused at least twice	20:54
clarkb	I suspect there is some other condition on the backend that we have to trip over to break things but I don't know as mentioend before we lack timestamps so this is a lot of guessing	20:54
corvus	no i mean, if you create a fip that's unattached, that cleanup method will delete it, so you need to create and attach while it isn't running	20:55
clarkb	oh ha yes	20:55
clarkb	one thing I've noticed on the gerrit upgrade prep which may or may not be concerning is that a number of these potentially concerning behavior changes actually come into gerrit via chagnes to older stable branches then get forward ported and then captured as a more scary thing in the next proper release's release notes so for a lot of stuff we're already running the code and it is	20:56
clarkb	apparently a non issue for us	20:56
clarkb	I think some of these do get captured by the point release release docs for older releases	20:56
clarkb	but they are maybe easier to skim over in that context or maybe I'm just not taking good enough notes to remember something was investigated and determined to be a non issue	20:57
clarkb	the sshd timeout fix is one example of this	20:57
clarkb	it was captured in 3.10.something release notes	20:57
clarkb	whcih by the way reduced our sshd idletimout from an unexpected 3600000 seconds to the configured 3600 seconds so not an issue for us because our idletimeout value was clueless about the bug	20:59
clarkb	I also remembered to check the static status after lunch and it looks like load remains reasonable and we're running well below the new limit. So ya I suspect the crawling is continuing to get worse over time and we just have to crank up the volume in response	21:02
clarkb	ok popping out now	21:03
corvus	clarkb: there were some un-exrecised parts of the fakes that needed updating. i made a new change and put it under yours: https://review.opendev.org/c/zuul/zuul/+/967934 Increase fake openstack fidelity [NEW]	22:02
opendevreview	Merged zuul/zuul-jobs master: Fix defaults for upload-image-swift and -s3 https://review.opendev.org/c/zuul/zuul-jobs/+/962238	22:27
clarkb	corvus: thanks, I'm taking a look now	23:21
clarkb	and it lgtm not sure if we want to go ahead and approve that since it is a test only improvement? I'll let you decide on that	23:24
clarkb	fungi: I may have possibly found a regression with gerrit 3.11 but it is hard to say. It has to do with how repo browse links are rendered in the repo page on gerrit. Not urgent but if you get a chance can you review line 102 on https://etherpad.opendev.org/p/gerrit-upgrade-3.11 and see if you know why this may be happening as you've done gerrit gitweb/gitea link stuff in the past	23:31
clarkb	if it is a regression I think it is one we can live with. But it may also simply be missing configuration in our test setup?	23:31
corvus	clarkb: yeah, i think so. and we can +w your change tomorrow	23:31
clarkb	corvus: sounds good. I've got a floating ip attached to the node again so we should hav a new canary in place	23:32
clarkb	50.56.159.76 this ip specifically	23:33
clarkb	Ramereth[m]: not sure if you saw the other day but we think we're experiencing the poor io performance issues with the arm nodes again (specifically when building new images). I'm happy to dig up logs or run testing that you think may be relevant if there is any way we can help debug that just let us know	23:44

Generated by irclog2html.py 4.0.0 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!