Monday, 2025-10-27

ykarelHi is there known issue in stable jobs where keystone fails to start with failed to open python file /opt/stack/data/venv/bin/keystone-wsgi-public? seeing failures across stable branches like stable/2024.1(non grenade) stable/2024.2(non grenade) stable/2025.1(grenade) stable/2025.2(grenade skip-level), failing since Saturday06:55
ykarelhttps://zuul.openstack.org/builds?job_name=tempest-full-2024-2&job_name=tempest-full-2024-1&skip=006:57
ykarellooks related to new pip release pip-25.3 07:36
mnasiadkaWe’ll be lucky if that’s the only breakage from pip ;-)07:39
ykarelyes locally can confirm it's specific to pip-25.307:48
tonybpfft.  that's annoying.07:55
tonybI can't help debug right now but I can, probably, help land stuff.08:01
*** ralonsoh_ is now known as ralonsoh08:04
ykarelReported https://bugs.launchpad.net/devstack/+bug/2129913 to track this08:54
mnasiadkaHad a NODE_FAILURE in kolla-ansible recently - NODE_FAILURE Node(set) request 8888d843ceca43e5bf8ad026a550547e failed in 0s (I’ll leave that here if that’s something you’d like to check)09:21
tonybmnasiadka: I'll check it out tomorrow if no one else has11:33
*** dhill is now known as Guest3001113:13
fungilooks like today is the only time this week i'll be able to go anywhere for lunch, so taking that opportunity now. back soon14:51
clarkbykarel: based on the notes fungi posted after the pip release I'm goign to assume its related to editable installation not working as expected. Devstack uses editable installs14:52
clarkbykarel: its probably worth moving this discussion to #openstack-qa since the changes will almost certainly occur within devstack14:53
ykarelclarkb, yes it's specific to editable installs14:53
ykarelack reported it in openstack-qa14:54
clarkbmnasiadka: corvus on zl02 I see exceptions attempting to load the zkobject belonging to nodes associated with request 8888d843ceca43e5bf8ad026a550547e. Looks like zl01 tried to launch in one provider too14:57
clarkbthe exceptions appear to be around trying to use unassigned nodes. I wonder if therei s a bug around that? or maybe something to do with the weekend updates making old unassigned nodes unusable now? (these are just hunches I don't have concrete data beyond the zkobject load error after trying to use unassigned nodes)14:57
corvusi can take a look later15:17
clarkbon further inspection we just log an error with no traceback. May need to check the zk db directly?15:17
clarkb(or at least I have had trouble finding a traceback using grep -A and -B15:17
corvusclarkb: the "unable to load" lines are not necessarily errors... they are usually benign, but we log them as errors in case they happen to be an error in that circumstance...  i expect we'll get rid of them someday, but we need them there for now...15:27
corvusbasically, there are a number of places where it's normal for certain zkobjects not to exist15:27
clarkbgot it15:28
corvusbut if, say, we're debugging something and we expect it to exist and it didn't, that's when we need that log msg.15:28
corvusso what looks weird here is this:15:28
corvushttps://paste.opendev.org/show/bnjKjwHw2S4XCC8zrgVq/15:28
corvuswe went straight from "everything is great" to "everything is not great"15:29
corvusoh wait, i see the node is failed in the first line15:29
clarkbya and that seems to happen immediately after the zkobject load error which is why i assocaited them15:29
clarkboh maybe the unassigned node is failed and we don't filter for good unassigned nodes?15:29
corvusyeah, i think that's the case... i'm lining up timestamps15:30
corvusno that's not the unassigned node id15:33
corvuslet's ignore the unassigned node messages for a moment; if we do, then the story of that request is that it encountered 3 node launch failures and reported node_failure -- that seems fairly normal to me.15:34
corvusthe only mystery i see is why was it toying with the idea of assigning a different unassigned node15:35
clarkbif I grep for 2d132651c4e24fa6aa29291dad0bb8ba (which is the zl02 sjc3 launch node uuid) I don't see it really do anything until what I ntoiced. But maybe we're not logging the uuid in places I would expect it to so my grep doesn't show the whole story15:37
clarkbactually I guess we don't log where the requested node is being requseted?15:38
corvusclarkb: it was build on zl01, got this error: 2025-10-27 08:39:40,298 ERROR zuul.Launcher: [e: df0531d7312a49adbfeebae20f7a8e40] [req: 8888d843ceca43e5bf8ad026a550547e] Timeout creating node <OpenstackProviderNode uuid=2d132651c4e24fa6aa29291dad0bb8ba, label=rockylinux-10-8GB, state=building, provider=opendev.org%2Fopendev%2Fzuul-providers/raxflex-sjc3-main>15:39
clarkboh interesting so zl02 can log that the node is requested but it isn't responsible for servicing that request15:40
clarkbI'll have ot keep that in mind for future debugging15:40
corvusokay, the weird thing is that it chose to use an unassigned node, but then it switched back to using the failed node15:41
corvusi think i see the bug15:41
corvusclarkb: remote:   https://review.opendev.org/c/zuul/zuul/+/964893 WIP Fix missing update in node reassignment [NEW]        15:43
corvusyay python15:43
corvusthat should be the fix, but i want to see if i can get test coverage15:43
clarkbthanks I'll look momentarily15:44
clarkbaha ya we're setting the node each iteration through that loop.15:47
clarkbthough that seems to only happen if the unassigned node is None but in this case it seems like we do have an unassigned node?15:47
clarkboh I see node is part of the request.updateProviderNode() call then we use requested_nodes for accounting afterwards so they need to be in sync15:54
clarkbinfra-root I do plan on hosting a meeting tomorrow since it has been a bit since we had one (I realize with the PTG this is maybe annoying/difficult but there are a few topics I want t ocall out)16:00
corvusyep, basically we updated zookeeper but not our in-memory scratch copy16:01
clarkbFor example the unexpected Gerrit shutdown which according to nova folks at the summit could be due to memory pressure (we wouldn't get an error). But also the idea of preventing direct gitea backend access16:01
opendevreviewClark Boylan proposed opendev/system-config master: Upgrade gitea to 1.24.7  https://review.opendev.org/c/opendev/system-config/+/96489916:04
clarkbalso if ^ looks clean we should probably plan on upgrading16:04
fungiokay, back16:14
corvusclarkb: mnasiadka this should take care of that issue, thanks!  remote:   https://review.opendev.org/c/zuul/zuul/+/964893 Fix missing update in node reassignment        16:27
corvusincidentally, that's going to be another subtle improvement with NIZ: we will be able to "rescue" nodeset requests by having them use unassigned ready nodes -- nodepool was unable to do that.  that will cause us to re-use ready nodes faster and avoid situations where we launch a bunch of nodes at the same time we have the same types of nodes sitting idle.16:29
corvusand that almost happened in this case (that's the only case that would have triggered that bug)16:30
clarkbcorvus: +2 I did notice what I think is some redundant checking in the etst case but that doesn't hurt so I +2'd16:41
corvusyep, a little silly but not broken16:46
clarkbinfra-root I think we need https://review.opendev.org/c/opendev/system-config/+/963642/ before we can upgrade gitea (or make many other system-config changes) 964899 should confirm by failing soon17:04
clarkbI'll rebase it once it fails17:04
corvusclarkb: oh that's cool.  i guess once we convert everything to docker compose + podman daemon we should be able to back that out?17:10
corvus(though, i guess it's worth having a conversation whether we want to do that, or instead standardize on the docker cli with the podman backend)17:10
corvusgiven that we're using "docker compose" maybe that's the sensible thing to do, so 642 would be a permanent change.17:10
clarkbcorvus: yup the main issue now is that we're still using docker commands everywhere (not just docker compose commands)17:10
clarkbcorvus: but also the job I expect to fail hasn't failed yet, but the package still only recommends not requires docker.io so not sure why it is working17:11
clarkboh I wonder if that job configures the load balancer (where we'd see the failure) later in the job execution than I expect. That could explain the delay in failing17:15
JayFis there an incident ongoing? I'm unable to reach e.g. https://opendev.org/openstack/governance17:25
JayFbut ping opendev.org does return (2604:e100:3:0:f816:3eff:fe5a:c9d7)17:26
JayFlooks like it's unreachable for me over ipv4 @ 38.108.68.9717:26
JayFwell, there it went, maybe it's just internet issues?17:26
clarkbgitea12 is overloaded17:28
clarkbhttps://review.opendev.org/c/opendev/system-config/+/964728/ is aimed at addressing this17:28
clarkbbut basically our friends the ai web crawlers have discovered the backends and will target them directly making the load balancer less useful17:28
clarkbthough looking at the other backends they all look busy too so maybe load balcning is "working" in this case (I checked 12 first since it is the one I balnace to)17:29
clarkbso ya I don't think there is an outage. Things are slower than we'd like probably because ai web crawlers don't care only the data matters to them17:30
fungiif someone wanted to work on a crawler tarpit that blocked any ip address hitting unpublished urls disallowed by our robots.txt, i'd gladly review it17:31
fungis/tarpit/honeypot/17:31
clarkbya looks like the traffic may be going through the load balancer but its doing the classic thing of fetch every file for every commit in a tight loop hope you don't melt down17:31
clarkbclearly not respecting our robots.txt rule for time between requests17:32
clarkbidentifying itslef as a modern chrome on osx17:32
fungino it's really a human with a bionic clicky-finger17:32
clarkbspot checking I see the classic random cloud provider hosted IPs as the source. I 've got a cut | sed | sort | uniq -c | sort running to try and quantify who is worst17:43
clarkbI've identified a particular bad source that we can consider blocking. I also see a long tail of ip addrs coming out of a particular geo location that we might look at next if the first doesn't help17:50
clarkb`iptables -I openstack-INPUT -s $ipaddr -j DROP` ?17:54
clarkbI never know how many flags we should provide to iptables. We could restrict it to tcp with -p tcp for example17:54
JayFif you're doing IP blocks for being abusive, being less specific is better18:00
JayFthe fewer matches the cheaper it is18:00
JayFif you're getting to the point of large lists, you might wanna look into ipsets to manage them: https://ipset.netfilter.org/ipset.man.html18:01
JayF(e.g. you drop from ipset, then update the ipset to add/remove things, it's more performant aiui)18:01
clarkbya can use masks so that the rule applies more broadly.18:02
clarkbI'm not too worried about that portion of the command more, is this the right chain and do we need other attributes on the rule to avoid overblocking or whatever18:02
clarkbsetting the ip addrs is easy relatively imo18:02
clarkbalso looks like it may be backing off too. Maybe they are watching us here18:03
clarkbI added the initial rule18:05
clarkbJayF: any better for you now? It looks like things may be settling into a happier state18:18
JayFworks for me now pretty snappily; I got past my blocker not too long after the report18:19
clarkbload average is a lot more up and down now with a ceiling lower than our old floor. But it is still a bit higher than I'd like18:21
opendevreviewClark Boylan proposed opendev/system-config master: Upgrade gitea to 1.24.7  https://review.opendev.org/c/opendev/system-config/+/96489918:45
clarkbthat is the promised rebase from earlier. the job did eventualyl fail so the check isn't as early as I had thought it would be18:45
clarkbhttps://review.opendev.org/c/opendev/system-config/+/963642/ is the fix and I'll probably plan to approve that after lunch unless someone else wants to beat me to it18:46
fungiapproved 963642 now18:56
opendevreviewNicolas Hicher proposed zuul/zuul-jobs master: Refactor: multi-node-bridge to use linux bridge  https://review.opendev.org/c/zuul/zuul-jobs/+/95939319:29
opendevreviewDmitriy Rabotyagov proposed zuul/zuul-jobs master: Allow mirror_fqdn to be overriden  https://review.opendev.org/c/zuul/zuul-jobs/+/96500819:50
opendevreviewMerged opendev/system-config master: Explictly install docker.io package  https://review.opendev.org/c/opendev/system-config/+/96364219:54
clarkbthat is deploying now to a bunch of services but it should noop as they all already have docker.io installed19:56
clarkbya on gitea-lb03 the package seems to have remained the same19:58
clarkband sudo docker ps -a continues to show the expected containers to me so we didn't modify the socket or units as far as  Ican tell19:59
clarkband /usr/lib/systemd/system/docker.service and docker.socket haven't been modified since june20:01
clarkbthat was my main concern: that we'd somehow reinstall the docker.io package and reenable some of the docker bits, but seems to be nooping from what I can see20:02
clarkb`sudo systemctl list-unit-files --type=service --state=disabled` and `sudo systemctl list-unit-files --type=socket --state=disabled` seem to confirm. I do note that ssh.service shows up in the disabled list oddly. ssh is running on the load balancer and list-units shows it. Not sure what that is about20:05
clarkbso I think ssh.service is triggered by ssh.socket and that is why it isn't enabled by default? Makes it confusing but this way you never get an ssh until systemd gets a connection on port 22 then it hands things over?20:11
clarkbweird way to express it but I Think this is fine20:11
fungiyeah, just odd that socket type is also disabled20:13
opendevreviewNicolas Hicher proposed zuul/zuul-jobs master: Refactor: multi-node-bridge to use linux bridge  https://review.opendev.org/c/zuul/zuul-jobs/+/95939320:14
clarkbfungi: no ssh.socket is not disabled but ssh.service is20:14
clarkband I think that iswhy because systemd doesn't actually start the service until the socket gets a connection so its not really enabled (which means start the process on boot)20:14
fungiah, yeah `sudo systemctl list-unit-files --type=socket --state=enabled` has ssh.socket enables20:15
fungier, enabled20:16
fungii guess that's just how it indicates processes that are set up for socket activation20:16
clarkbya I think so since enabled here means run this at boot and it isn't running it at boot instead it is opening a port and only running things if a connection is made to that port20:19
clarkbnhicher[m]: I responeded to your question on the multi node bridge change20:19
nhicher[m]clarkb: thanks, I will have a look20:20
gouthamro/ seeing NODE_FAILURES a few more times, probably needs https://review.opendev.org/c/zuul/zuul/+/964893 to be merged? but its failing tests .. should we hold off on re-checking? 20:25
clarkbyes that is the expected fix. I think this is an issue where the more you stress the system the worse it gets20:27
clarkbso not rechecking may help, but I also don't know that it is necessary for every change20:27
clarkbbasically if there are failures booting nodes then you can hit this case. More load puts more stress on nova schedulgin in the clouds and could lead to more failures20:28
clarkbbut we were able to land 963642 which has many multinode jobs running against it and they all succeeded so the error rate can't be too high20:29
clarkbhttps://review.opendev.org/c/opendev/system-config/+/964899/ passes now and screenshots lgtm if we want to think about fitting a gitea upgrade in20:34
clarkbgitea11 and 12 load is moderate right now. Seems like things continue to be happier there20:34
clarkbI do have a school run to do in a bit but otherwise am around if we want to try and do that today. Happy to aim for tomorrow too20:34
corvusyeah, that change corrects the code to what we all thought it was in the first place, so i'm comfortable approving it now if we're seeing more impact.20:38
corvusi +3d 96489320:39
clarkback thanks20:39
clarkbI'ev just updated the meeting agenda on the wiki. Let me know if anything important is missing20:42
opendevreviewJames E. Blair proposed opendev/system-config master: static: redirect zuulci.org to zuul-ci.org  https://review.opendev.org/c/opendev/system-config/+/96429621:21
clarkbfungi: in ^ doesn't the regex there also redirect www?21:52
clarkboh you're suggesting that www.zuul-ci.org also redirect to zuul-ci.org21:53
clarkbI +2'd I think the test case may actually push for fungi's update but figure if it doesn't then we're good already so +221:55
fungiit would also be consistent with what we already have in the gating.dev vhost that way22:08
clarkblooks like the PBR testing is now unhappy in its pip latest functional test. I'm guessing the pip release broke this too :/22:09
clarkbunfortunately the test case logs are not super helpful; I see https://zuul.opendev.org/t/openstack/build/3331f228fd65478ea1a4ef3a61ebc8a2/log/job-output.txt#5701-5712 but I'm not sure that maps to where the rc of 1 occurs but it must as I don't see other issues22:10
clarkbfungi: I guess that setuptools>=40.8.0 must come from some default somewhere as we don't set that version in the test case as far as I can tell. Then maybe the --no-index flag causes it to not find the version?22:12
fungiwe probably ought to recheck the entire stack in case it starts working somewhere up the list22:12
clarkbin this case I don't expect it to since it seems to be a problem with teh test case and the stack doesn't really touch those beyond renaming things that get refactored22:12
fungiah22:12
clarkbI suspect that pip made a change to how --no-index is handled for build env deps22:12
clarkbit almost looks like the package is trying to use setuptools>=40.8.0 by default in an isolated build env then failing to install that dep beacuse --no-index is set22:13
clarkbfungi: do you think I should put the PBR situation on the opendev meeting agenda?22:23
clarkbalso I'll get that sent out in about half na hour22:24
clarkbheh 964893 hit node failure errors in the gate (it previuosly had not)22:24
clarkbbut also it has a failing unittest run so would've failed anyway22:24
fungisounds worth discussing22:30
clarkbok its on the agenda now. I'll send that out about 2300 UTC if there is anything else22:33
clarkbthis build https://zuul.opendev.org/t/openstack/build/b23ab86fe9164667afebacb6fc2bbb51 just had an upload failure to swift22:40
clarkbit attempted to upload to ovh bhs22:40
clarkbso be on the lookout for more failures like that I guess22:40

Generated by irclog2html.py 4.0.0 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!