| dmsimard[m] | I must once again run, but I also dropped some thoughts in the ara-for-databases pad, thanks for getting it started | 00:52 |
|---|---|---|
| *** liuxie is now known as liushy | 06:10 | |
| opendevreview | Michal Nasiadka proposed zuul/zuul-jobs master: Use mirror_info in configure-mirrors role https://review.opendev.org/c/zuul/zuul-jobs/+/966187 | 06:58 |
| opendevreview | Michal Nasiadka proposed zuul/zuul-jobs master: Use mirror_info in configure-mirrors role https://review.opendev.org/c/zuul/zuul-jobs/+/966187 | 07:06 |
| opendevreview | Michal Nasiadka proposed zuul/zuul-jobs master: Use mirror_info in configure-mirrors role https://review.opendev.org/c/zuul/zuul-jobs/+/966187 | 07:23 |
| opendevreview | Michal Nasiadka proposed zuul/zuul-jobs master: Use mirror_info in configure-mirrors role https://review.opendev.org/c/zuul/zuul-jobs/+/966187 | 07:38 |
| opendevreview | Michal Nasiadka proposed zuul/zuul-jobs master: Use mirror_info in configure-mirrors role https://review.opendev.org/c/zuul/zuul-jobs/+/966187 | 07:39 |
| opendevreview | Michal Nasiadka proposed zuul/zuul-jobs master: Use mirror_info in configure-mirrors role https://review.opendev.org/c/zuul/zuul-jobs/+/966187 | 08:03 |
| opendevreview | Michal Nasiadka proposed zuul/zuul-jobs master: Use mirror_info in configure-mirrors role https://review.opendev.org/c/zuul/zuul-jobs/+/966187 | 08:05 |
| *** sfinucan is now known as stephenfin | 10:16 | |
| *** ykarel_ is now known as ykarel | 13:30 | |
| babico | hello | 14:14 |
| fungi | hi, need something? | 14:30 |
| fungi | checking up on openmetal transfer rates, i got 82.1 MB/s pulling 1gb-test.dat from the mirror there to /dev/null on ze12 just now, so doesn't seem like the problem is recurring at the moment | 14:44 |
| fungi | need to disappear for a bit to run more errands, back after lunch | 16:04 |
| clarkb | seems quiet this morning. I'm going to start working through my todo list on the gerrit 3.11 upgrade etherpad | 16:08 |
| clarkb | I need to grab the gerrit war out of our 3.11 container images and I wish there was an easy way to do something like docker cp out of an image rather than a container. There probably is, but I don't know it | 16:20 |
| clarkb | hrm looks like my gerrit 3.11 held node is unhappy. I guess I need to look into that | 16:38 |
| clarkb | the server rebooted 5 minutes ago | 16:39 |
| clarkb | oh! the server at that ip address is a new different instance running some python unittests | 16:39 |
| clarkb | interesting | 16:39 |
| clarkb | so it didn't really reboot so much as get replaced with a new boot | 16:40 |
| clarkb | my autohold, 0000000256, is still in place | 16:40 |
| clarkb | (note I checked that on zuul02 running the cli client which caused it to pull the image from the new location on quay) | 16:41 |
| clarkb | `Unable to find image 'quay.io/zuul-ci/zuul-client:latest' locally` then it automatically pulled it down | 16:41 |
| clarkb | corvus: ^ I don't yet know if this a bug in zuul-launcher or not, but the autohold is still in place and the web ui nodes list shows the nodes are still there. I'm going to see now what the cloud provider says | 16:42 |
| clarkb | corvus: the cloud provider does not have the instance anymore (and the instance that was reusing the floating ip is also gone at this point). Still don't know if the launcher or cloud cleaned it up | 16:44 |
| clarkb | 6d8b42238b674d1db580d2afdf777670 is the zuul id for the instance I think (it is the noble node which should be the gerrit node in the two node nodeset) | 16:45 |
| clarkb | oh wait my earlier listing was in the wrong project let me double check the cloud side info | 16:45 |
| clarkb | ok the server is still in the cloud but it no longer has the floating ip attached to it | 16:46 |
| clarkb | np6d8b42238b674 735cd878-e101-441d-a62b-5b67469a7d3b in rax flex dfw3 | 16:47 |
| clarkb | 174.143.59.58 is the old fip value. So now I think I need to see what if any logs around that ip being deleted we may have | 16:47 |
| clarkb | I want to say nova has some way of giving an accounting of things like live migration events I wonder if there is such a thing for floating ip attach and detach. | 16:49 |
| clarkb | `server event list 735cd878-e101-441d-a62b-5b67469a7d3b` only has the event for server create in it | 16:50 |
| clarkb | the other node in the held nodeset also lost its fip if it becomes useful to look at that one | 16:51 |
| clarkb | the ip address is nodescanned by zl02 at 2025-11-13 23:51:09,407 for the held node. Then we don't scan the ip address again until 2025-11-19 22:04:55,916 again on zl02. The ip address only shows up in those nodescan log entries so not sure if the launcher deleted it for some reason or the cloud | 16:57 |
| clarkb | theory: this floating ip got deleted by our leaked floating ip cleanup routine. I have no evidence to support this yet | 17:00 |
| clarkb | corvus: we call https://opendev.org/openstack/openstacksdk/src/branch/master/openstack/cloud/_network_common.py#L971-L1001 from zuul to delete unattached floating ips here: https://opendev.org/zuul/zuul/src/branch/master/zuul/driver/openstack/openstackendpoint.py#L1071-L1082 There are a bunch of caveats listed in the openstacksdk method but I think we're following them as we create | 17:07 |
| clarkb | floating ips setting the server value during server creation. My hunch here is that this method is not actually safe to use | 17:07 |
| clarkb | maybe it is neutron backend dependent or something | 17:07 |
| clarkb | anyway I can drop the provider floating ip cleanup flag from rax flex regions to disable that and then we can monitor to see if we leak fips | 17:08 |
| clarkb | unfortunately I don't see any logging around this in the zuul launcher code so I'm not sure we'll be able to find much more historical evidence other than the fact that we do try to periodically clean up floating ips and something may have gone wrong there | 17:08 |
| clarkb | another appraoch would be to try and add logging to that and then leave the provide flag in place to clean things up and see if we can reproduce it | 17:09 |
| clarkb | I will need to recycle my held node to do more testing but ideally we would have some plan for ^ before hand just in case the held nodes land in rax flex again | 17:10 |
| clarkb | https://opendev.org/opendev/zuul-providers/src/branch/master/zuul.d/providers.yaml#L136 and just to double check we do have the flag set | 17:13 |
| clarkb | it looks like that cleanup routine is called every time we list servers and it took 6 days to occur. Seems like that this is a really infrequent corner case but an annoying one if hit | 17:16 |
| clarkb | as an alternative I can manually attach a new floating ip to my server and see if that works | 17:16 |
| clarkb | (rather than recycling the holds) | 17:17 |
| opendevreview | Clark Boylan proposed opendev/zuul-providers master: Stop automated floating ip clean up rax flex https://review.opendev.org/c/opendev/zuul-providers/+/967884 | 17:37 |
| clarkb | Consider that an RFC. I'm open to other ideas for debugging and dealing with this | 17:37 |
| corvus | ...catching up... | 17:42 |
| corvus | clarkb: do you have an estimated time range for when it may have happened? from your logs above i see 11-13 through 11-19... any narrower than that? | 17:46 |
| clarkb | corvus: not really unfortunately. I was hoping the nova server event list would tell us but it only has the server creation event in it | 17:48 |
| fungi | 13-19 would span a restart, i guess that's the biggest thing we'd want to narrow the window to rule out | 17:48 |
| clarkb | corvus: judging on the frequency of ip address reuse I would guess it happened not too long before the ip gets used again | 17:48 |
| clarkb | (we're recycling ips frequently enough that a few hours at most seems likely) | 17:48 |
| corvus | we also cleanup leaked ports, and i was looking to see if perhaps that happened which may have then prompted the ip cleanup, but in the logs, i only see port cleanups for iad3 | 17:48 |
| corvus | and i think you said it was dfw3 | 17:49 |
| clarkb | correct this is dfw3 | 17:49 |
| corvus | /var/log/zuul/launcher-debug.log.1:2025-11-19 20:13:58,584 DEBUG zuul.openstack.raxflex-IAD3: Removed DOWN port dfa55cb4-eb97-4461-bc74-0cc3ed577ebd in raxflex-IAD3 (example log message; not relevant) | 17:49 |
| corvus | clarkb: i'd be open to vendoring https://opendev.org/openstack/openstacksdk/src/branch/master/openstack/cloud/_network_common.py#L971-L1001 into zuul with debug log lines | 17:50 |
| clarkb | corvus: ya that might be a good approach. THough potentially super verbose? Getting the logging right might be tricky but I agree pulling the magic out of the sdk where we can log specific things is probably a good next step for debugging | 17:51 |
| corvus | i think just logging when we delete an ip would be a start; that shouldn't be verbose because this shouldn't happen that often | 17:52 |
| corvus | like, we delete maybe 50 leaked ports per day | 17:52 |
| corvus | i wouldn't expect fip leaks to be much different in magnitude | 17:53 |
| clarkb | ++ | 17:53 |
| clarkb | corvus: in that case should I WIP my zuul-providers change so that we don't remove the condition under which we would experience this in the first place? | 17:53 |
| corvus | yeah, i think my preference would be to do that instead of turning off the flag: mostly because we can at least get more debugging and then try to correlate it with other things (maybe a zuul bug, maybe an openstack bug). and this error seems to be super rare, if annoying, so we can probably stand to allow it to happen again | 17:54 |
| corvus | also, i'm pretty sure we'd see actual fip leaks more anyway and we'd have to clean them up, so this is probably ultimately less work for us. :) | 17:55 |
| clarkb | ok the change is WIP now. I think we can/should vendor the code and add some debug logging. Then I can manually attach new floating IPs to my two held nodes and they can be our canaries | 17:56 |
| clarkb | corvus: I can work on the launcher change as well unless you want to do that | 17:56 |
| corvus | clarkb: feel free and i can review it; i have a few things in progress locally right now and also an errand to run :) | 17:58 |
| clarkb | on it | 17:58 |
| clarkb | corvus: remote: https://review.opendev.org/c/zuul/zuul/+/967891 Vendor openstacksdk delete_unattached_floating_ips | 18:20 |
| clarkb | I don't have a test case for it as I'm half hoping the existing tests we have will cover it well enough but I'm not positive of that. I did manually do some repl stuff on bridge to confirm that ip['floating_ip_address'] is a valid attribute lookup and that it returns the correct data | 18:21 |
| corvus | clarkb: yeah, there is a resource cleanup test case that covers that, so we should have code coverage | 19:29 |
| clarkb | corvus: looks like the tests failed bceause I need to mock or fake out some thinsg | 20:44 |
| corvus | clarkb: see! there is test coverage! :) i think you just need to implement _use_neutron_floating(); it can probably just return True. goes in "tests/fake_openstack.py" -- right next to delete_unattached_floating_ips :) | 20:49 |
| corvus | because list and delete methods already exist | 20:49 |
| clarkb | yup just pushed. and I agree that appears to be the only thing that needs to be faked | 20:50 |
| clarkb | I ended up making it conditional based on a very similar preexisting conditional that I think better captures the need vs always returning True | 20:50 |
| clarkb | we shall see (I don't have my local testing set up right now and I'm hoping to pop out shortly for a bike ride otherwise I would run that test locally) | 20:51 |
| corvus | ++ | 20:52 |
| clarkb | also I realized I can just manually attach an fip to the held node whenever and if it clears out we can reattach it again manually. So when I get back from my bike ride I will probably do that and continue with my gerrit testing | 20:52 |
| corvus | clarkb: i think you'll have to do that between cleanup intervals for the launcher, so be fast :) | 20:53 |
| clarkb | corvus: maybe? the node was held for ~6 days before the ip got reused and since my discovery of the problem I think that ip has been reused at least twice | 20:54 |
| clarkb | I suspect there is some other condition on the backend that we have to trip over to break things but I don't know as mentioend before we lack timestamps so this is a lot of guessing | 20:54 |
| corvus | no i mean, if you create a fip that's unattached, that cleanup method will delete it, so you need to create and attach while it isn't running | 20:55 |
| clarkb | oh ha yes | 20:55 |
| clarkb | one thing I've noticed on the gerrit upgrade prep which may or may not be concerning is that a number of these potentially concerning behavior changes actually come into gerrit via chagnes to older stable branches then get forward ported and then captured as a more scary thing in the next proper release's release notes so for a lot of stuff we're already running the code and it is | 20:56 |
| clarkb | apparently a non issue for us | 20:56 |
| clarkb | I think some of these do get captured by the point release release docs for older releases | 20:56 |
| clarkb | but they are maybe easier to skim over in that context or maybe I'm just not taking good enough notes to remember something was investigated and determined to be a non issue | 20:57 |
| clarkb | the sshd timeout fix is one example of this | 20:57 |
| clarkb | it was captured in 3.10.something release notes | 20:57 |
| clarkb | whcih by the way reduced our sshd idletimout from an unexpected 3600000 seconds to the configured 3600 seconds so not an issue for us because our idletimeout value was clueless about the bug | 20:59 |
| clarkb | I also remembered to check the static status after lunch and it looks like load remains reasonable and we're running well below the new limit. So ya I suspect the crawling is continuing to get worse over time and we just have to crank up the volume in response | 21:02 |
| clarkb | ok popping out now | 21:03 |
| corvus | clarkb: there were some un-exrecised parts of the fakes that needed updating. i made a new change and put it under yours: https://review.opendev.org/c/zuul/zuul/+/967934 Increase fake openstack fidelity [NEW] | 22:02 |
| opendevreview | Merged zuul/zuul-jobs master: Fix defaults for upload-image-swift and -s3 https://review.opendev.org/c/zuul/zuul-jobs/+/962238 | 22:27 |
| clarkb | corvus: thanks, I'm taking a look now | 23:21 |
| clarkb | and it lgtm not sure if we want to go ahead and approve that since it is a test only improvement? I'll let you decide on that | 23:24 |
| clarkb | fungi: I may have possibly found a regression with gerrit 3.11 but it is hard to say. It has to do with how repo browse links are rendered in the repo page on gerrit. Not urgent but if you get a chance can you review line 102 on https://etherpad.opendev.org/p/gerrit-upgrade-3.11 and see if you know why this may be happening as you've done gerrit gitweb/gitea link stuff in the past | 23:31 |
| clarkb | if it is a regression I think it is one we can live with. But it may also simply be missing configuration in our test setup? | 23:31 |
| corvus | clarkb: yeah, i think so. and we can +w your change tomorrow | 23:31 |
| clarkb | corvus: sounds good. I've got a floating ip attached to the node again so we should hav a new canary in place | 23:32 |
| clarkb | 50.56.159.76 this ip specifically | 23:33 |
| clarkb | Ramereth[m]: not sure if you saw the other day but we think we're experiencing the poor io performance issues with the arm nodes again (specifically when building new images). I'm happy to dig up logs or run testing that you think may be relevant if there is any way we can help debug that just let us know | 23:44 |
Generated by irclog2html.py 4.0.0 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!