| ykarel | Hi is there known issue in stable jobs where keystone fails to start with failed to open python file /opt/stack/data/venv/bin/keystone-wsgi-public? seeing failures across stable branches like stable/2024.1(non grenade) stable/2024.2(non grenade) stable/2025.1(grenade) stable/2025.2(grenade skip-level), failing since Saturday | 06:55 |
|---|---|---|
| ykarel | https://zuul.openstack.org/builds?job_name=tempest-full-2024-2&job_name=tempest-full-2024-1&skip=0 | 06:57 |
| ykarel | looks related to new pip release pip-25.3 | 07:36 |
| mnasiadka | We’ll be lucky if that’s the only breakage from pip ;-) | 07:39 |
| ykarel | yes locally can confirm it's specific to pip-25.3 | 07:48 |
| tonyb | pfft. that's annoying. | 07:55 |
| tonyb | I can't help debug right now but I can, probably, help land stuff. | 08:01 |
| *** ralonsoh_ is now known as ralonsoh | 08:04 | |
| ykarel | Reported https://bugs.launchpad.net/devstack/+bug/2129913 to track this | 08:54 |
| mnasiadka | Had a NODE_FAILURE in kolla-ansible recently - NODE_FAILURE Node(set) request 8888d843ceca43e5bf8ad026a550547e failed in 0s (I’ll leave that here if that’s something you’d like to check) | 09:21 |
| tonyb | mnasiadka: I'll check it out tomorrow if no one else has | 11:33 |
| *** dhill is now known as Guest30011 | 13:13 | |
| fungi | looks like today is the only time this week i'll be able to go anywhere for lunch, so taking that opportunity now. back soon | 14:51 |
| clarkb | ykarel: based on the notes fungi posted after the pip release I'm goign to assume its related to editable installation not working as expected. Devstack uses editable installs | 14:52 |
| clarkb | ykarel: its probably worth moving this discussion to #openstack-qa since the changes will almost certainly occur within devstack | 14:53 |
| ykarel | clarkb, yes it's specific to editable installs | 14:53 |
| ykarel | ack reported it in openstack-qa | 14:54 |
| clarkb | mnasiadka: corvus on zl02 I see exceptions attempting to load the zkobject belonging to nodes associated with request 8888d843ceca43e5bf8ad026a550547e. Looks like zl01 tried to launch in one provider too | 14:57 |
| clarkb | the exceptions appear to be around trying to use unassigned nodes. I wonder if therei s a bug around that? or maybe something to do with the weekend updates making old unassigned nodes unusable now? (these are just hunches I don't have concrete data beyond the zkobject load error after trying to use unassigned nodes) | 14:57 |
| corvus | i can take a look later | 15:17 |
| clarkb | on further inspection we just log an error with no traceback. May need to check the zk db directly? | 15:17 |
| clarkb | (or at least I have had trouble finding a traceback using grep -A and -B | 15:17 |
| corvus | clarkb: the "unable to load" lines are not necessarily errors... they are usually benign, but we log them as errors in case they happen to be an error in that circumstance... i expect we'll get rid of them someday, but we need them there for now... | 15:27 |
| corvus | basically, there are a number of places where it's normal for certain zkobjects not to exist | 15:27 |
| clarkb | got it | 15:28 |
| corvus | but if, say, we're debugging something and we expect it to exist and it didn't, that's when we need that log msg. | 15:28 |
| corvus | so what looks weird here is this: | 15:28 |
| corvus | https://paste.opendev.org/show/bnjKjwHw2S4XCC8zrgVq/ | 15:28 |
| corvus | we went straight from "everything is great" to "everything is not great" | 15:29 |
| corvus | oh wait, i see the node is failed in the first line | 15:29 |
| clarkb | ya and that seems to happen immediately after the zkobject load error which is why i assocaited them | 15:29 |
| clarkb | oh maybe the unassigned node is failed and we don't filter for good unassigned nodes? | 15:29 |
| corvus | yeah, i think that's the case... i'm lining up timestamps | 15:30 |
| corvus | no that's not the unassigned node id | 15:33 |
| corvus | let's ignore the unassigned node messages for a moment; if we do, then the story of that request is that it encountered 3 node launch failures and reported node_failure -- that seems fairly normal to me. | 15:34 |
| corvus | the only mystery i see is why was it toying with the idea of assigning a different unassigned node | 15:35 |
| clarkb | if I grep for 2d132651c4e24fa6aa29291dad0bb8ba (which is the zl02 sjc3 launch node uuid) I don't see it really do anything until what I ntoiced. But maybe we're not logging the uuid in places I would expect it to so my grep doesn't show the whole story | 15:37 |
| clarkb | actually I guess we don't log where the requested node is being requseted? | 15:38 |
| corvus | clarkb: it was build on zl01, got this error: 2025-10-27 08:39:40,298 ERROR zuul.Launcher: [e: df0531d7312a49adbfeebae20f7a8e40] [req: 8888d843ceca43e5bf8ad026a550547e] Timeout creating node <OpenstackProviderNode uuid=2d132651c4e24fa6aa29291dad0bb8ba, label=rockylinux-10-8GB, state=building, provider=opendev.org%2Fopendev%2Fzuul-providers/raxflex-sjc3-main> | 15:39 |
| clarkb | oh interesting so zl02 can log that the node is requested but it isn't responsible for servicing that request | 15:40 |
| clarkb | I'll have ot keep that in mind for future debugging | 15:40 |
| corvus | okay, the weird thing is that it chose to use an unassigned node, but then it switched back to using the failed node | 15:41 |
| corvus | i think i see the bug | 15:41 |
| corvus | clarkb: remote: https://review.opendev.org/c/zuul/zuul/+/964893 WIP Fix missing update in node reassignment [NEW] | 15:43 |
| corvus | yay python | 15:43 |
| corvus | that should be the fix, but i want to see if i can get test coverage | 15:43 |
| clarkb | thanks I'll look momentarily | 15:44 |
| clarkb | aha ya we're setting the node each iteration through that loop. | 15:47 |
| clarkb | though that seems to only happen if the unassigned node is None but in this case it seems like we do have an unassigned node? | 15:47 |
| clarkb | oh I see node is part of the request.updateProviderNode() call then we use requested_nodes for accounting afterwards so they need to be in sync | 15:54 |
| clarkb | infra-root I do plan on hosting a meeting tomorrow since it has been a bit since we had one (I realize with the PTG this is maybe annoying/difficult but there are a few topics I want t ocall out) | 16:00 |
| corvus | yep, basically we updated zookeeper but not our in-memory scratch copy | 16:01 |
| clarkb | For example the unexpected Gerrit shutdown which according to nova folks at the summit could be due to memory pressure (we wouldn't get an error). But also the idea of preventing direct gitea backend access | 16:01 |
| opendevreview | Clark Boylan proposed opendev/system-config master: Upgrade gitea to 1.24.7 https://review.opendev.org/c/opendev/system-config/+/964899 | 16:04 |
| clarkb | also if ^ looks clean we should probably plan on upgrading | 16:04 |
| fungi | okay, back | 16:14 |
| corvus | clarkb: mnasiadka this should take care of that issue, thanks! remote: https://review.opendev.org/c/zuul/zuul/+/964893 Fix missing update in node reassignment | 16:27 |
| corvus | incidentally, that's going to be another subtle improvement with NIZ: we will be able to "rescue" nodeset requests by having them use unassigned ready nodes -- nodepool was unable to do that. that will cause us to re-use ready nodes faster and avoid situations where we launch a bunch of nodes at the same time we have the same types of nodes sitting idle. | 16:29 |
| corvus | and that almost happened in this case (that's the only case that would have triggered that bug) | 16:30 |
| clarkb | corvus: +2 I did notice what I think is some redundant checking in the etst case but that doesn't hurt so I +2'd | 16:41 |
| corvus | yep, a little silly but not broken | 16:46 |
| clarkb | infra-root I think we need https://review.opendev.org/c/opendev/system-config/+/963642/ before we can upgrade gitea (or make many other system-config changes) 964899 should confirm by failing soon | 17:04 |
| clarkb | I'll rebase it once it fails | 17:04 |
| corvus | clarkb: oh that's cool. i guess once we convert everything to docker compose + podman daemon we should be able to back that out? | 17:10 |
| corvus | (though, i guess it's worth having a conversation whether we want to do that, or instead standardize on the docker cli with the podman backend) | 17:10 |
| corvus | given that we're using "docker compose" maybe that's the sensible thing to do, so 642 would be a permanent change. | 17:10 |
| clarkb | corvus: yup the main issue now is that we're still using docker commands everywhere (not just docker compose commands) | 17:10 |
| clarkb | corvus: but also the job I expect to fail hasn't failed yet, but the package still only recommends not requires docker.io so not sure why it is working | 17:11 |
| clarkb | oh I wonder if that job configures the load balancer (where we'd see the failure) later in the job execution than I expect. That could explain the delay in failing | 17:15 |
| JayF | is there an incident ongoing? I'm unable to reach e.g. https://opendev.org/openstack/governance | 17:25 |
| JayF | but ping opendev.org does return (2604:e100:3:0:f816:3eff:fe5a:c9d7) | 17:26 |
| JayF | looks like it's unreachable for me over ipv4 @ 38.108.68.97 | 17:26 |
| JayF | well, there it went, maybe it's just internet issues? | 17:26 |
| clarkb | gitea12 is overloaded | 17:28 |
| clarkb | https://review.opendev.org/c/opendev/system-config/+/964728/ is aimed at addressing this | 17:28 |
| clarkb | but basically our friends the ai web crawlers have discovered the backends and will target them directly making the load balancer less useful | 17:28 |
| clarkb | though looking at the other backends they all look busy too so maybe load balcning is "working" in this case (I checked 12 first since it is the one I balnace to) | 17:29 |
| clarkb | so ya I don't think there is an outage. Things are slower than we'd like probably because ai web crawlers don't care only the data matters to them | 17:30 |
| fungi | if someone wanted to work on a crawler tarpit that blocked any ip address hitting unpublished urls disallowed by our robots.txt, i'd gladly review it | 17:31 |
| fungi | s/tarpit/honeypot/ | 17:31 |
| clarkb | ya looks like the traffic may be going through the load balancer but its doing the classic thing of fetch every file for every commit in a tight loop hope you don't melt down | 17:31 |
| clarkb | clearly not respecting our robots.txt rule for time between requests | 17:32 |
| clarkb | identifying itslef as a modern chrome on osx | 17:32 |
| fungi | no it's really a human with a bionic clicky-finger | 17:32 |
| clarkb | spot checking I see the classic random cloud provider hosted IPs as the source. I 've got a cut | sed | sort | uniq -c | sort running to try and quantify who is worst | 17:43 |
| clarkb | I've identified a particular bad source that we can consider blocking. I also see a long tail of ip addrs coming out of a particular geo location that we might look at next if the first doesn't help | 17:50 |
| clarkb | `iptables -I openstack-INPUT -s $ipaddr -j DROP` ? | 17:54 |
| clarkb | I never know how many flags we should provide to iptables. We could restrict it to tcp with -p tcp for example | 17:54 |
| JayF | if you're doing IP blocks for being abusive, being less specific is better | 18:00 |
| JayF | the fewer matches the cheaper it is | 18:00 |
| JayF | if you're getting to the point of large lists, you might wanna look into ipsets to manage them: https://ipset.netfilter.org/ipset.man.html | 18:01 |
| JayF | (e.g. you drop from ipset, then update the ipset to add/remove things, it's more performant aiui) | 18:01 |
| clarkb | ya can use masks so that the rule applies more broadly. | 18:02 |
| clarkb | I'm not too worried about that portion of the command more, is this the right chain and do we need other attributes on the rule to avoid overblocking or whatever | 18:02 |
| clarkb | setting the ip addrs is easy relatively imo | 18:02 |
| clarkb | also looks like it may be backing off too. Maybe they are watching us here | 18:03 |
| clarkb | I added the initial rule | 18:05 |
| clarkb | JayF: any better for you now? It looks like things may be settling into a happier state | 18:18 |
| JayF | works for me now pretty snappily; I got past my blocker not too long after the report | 18:19 |
| clarkb | load average is a lot more up and down now with a ceiling lower than our old floor. But it is still a bit higher than I'd like | 18:21 |
| opendevreview | Clark Boylan proposed opendev/system-config master: Upgrade gitea to 1.24.7 https://review.opendev.org/c/opendev/system-config/+/964899 | 18:45 |
| clarkb | that is the promised rebase from earlier. the job did eventualyl fail so the check isn't as early as I had thought it would be | 18:45 |
| clarkb | https://review.opendev.org/c/opendev/system-config/+/963642/ is the fix and I'll probably plan to approve that after lunch unless someone else wants to beat me to it | 18:46 |
| fungi | approved 963642 now | 18:56 |
| opendevreview | Nicolas Hicher proposed zuul/zuul-jobs master: Refactor: multi-node-bridge to use linux bridge https://review.opendev.org/c/zuul/zuul-jobs/+/959393 | 19:29 |
| opendevreview | Dmitriy Rabotyagov proposed zuul/zuul-jobs master: Allow mirror_fqdn to be overriden https://review.opendev.org/c/zuul/zuul-jobs/+/965008 | 19:50 |
| opendevreview | Merged opendev/system-config master: Explictly install docker.io package https://review.opendev.org/c/opendev/system-config/+/963642 | 19:54 |
| clarkb | that is deploying now to a bunch of services but it should noop as they all already have docker.io installed | 19:56 |
| clarkb | ya on gitea-lb03 the package seems to have remained the same | 19:58 |
| clarkb | and sudo docker ps -a continues to show the expected containers to me so we didn't modify the socket or units as far as Ican tell | 19:59 |
| clarkb | and /usr/lib/systemd/system/docker.service and docker.socket haven't been modified since june | 20:01 |
| clarkb | that was my main concern: that we'd somehow reinstall the docker.io package and reenable some of the docker bits, but seems to be nooping from what I can see | 20:02 |
| clarkb | `sudo systemctl list-unit-files --type=service --state=disabled` and `sudo systemctl list-unit-files --type=socket --state=disabled` seem to confirm. I do note that ssh.service shows up in the disabled list oddly. ssh is running on the load balancer and list-units shows it. Not sure what that is about | 20:05 |
| clarkb | so I think ssh.service is triggered by ssh.socket and that is why it isn't enabled by default? Makes it confusing but this way you never get an ssh until systemd gets a connection on port 22 then it hands things over? | 20:11 |
| clarkb | weird way to express it but I Think this is fine | 20:11 |
| fungi | yeah, just odd that socket type is also disabled | 20:13 |
| opendevreview | Nicolas Hicher proposed zuul/zuul-jobs master: Refactor: multi-node-bridge to use linux bridge https://review.opendev.org/c/zuul/zuul-jobs/+/959393 | 20:14 |
| clarkb | fungi: no ssh.socket is not disabled but ssh.service is | 20:14 |
| clarkb | and I think that iswhy because systemd doesn't actually start the service until the socket gets a connection so its not really enabled (which means start the process on boot) | 20:14 |
| fungi | ah, yeah `sudo systemctl list-unit-files --type=socket --state=enabled` has ssh.socket enables | 20:15 |
| fungi | er, enabled | 20:16 |
| fungi | i guess that's just how it indicates processes that are set up for socket activation | 20:16 |
| clarkb | ya I think so since enabled here means run this at boot and it isn't running it at boot instead it is opening a port and only running things if a connection is made to that port | 20:19 |
| clarkb | nhicher[m]: I responeded to your question on the multi node bridge change | 20:19 |
| nhicher[m] | clarkb: thanks, I will have a look | 20:20 |
| gouthamr | o/ seeing NODE_FAILURES a few more times, probably needs https://review.opendev.org/c/zuul/zuul/+/964893 to be merged? but its failing tests .. should we hold off on re-checking? | 20:25 |
| clarkb | yes that is the expected fix. I think this is an issue where the more you stress the system the worse it gets | 20:27 |
| clarkb | so not rechecking may help, but I also don't know that it is necessary for every change | 20:27 |
| clarkb | basically if there are failures booting nodes then you can hit this case. More load puts more stress on nova schedulgin in the clouds and could lead to more failures | 20:28 |
| clarkb | but we were able to land 963642 which has many multinode jobs running against it and they all succeeded so the error rate can't be too high | 20:29 |
| clarkb | https://review.opendev.org/c/opendev/system-config/+/964899/ passes now and screenshots lgtm if we want to think about fitting a gitea upgrade in | 20:34 |
| clarkb | gitea11 and 12 load is moderate right now. Seems like things continue to be happier there | 20:34 |
| clarkb | I do have a school run to do in a bit but otherwise am around if we want to try and do that today. Happy to aim for tomorrow too | 20:34 |
| corvus | yeah, that change corrects the code to what we all thought it was in the first place, so i'm comfortable approving it now if we're seeing more impact. | 20:38 |
| corvus | i +3d 964893 | 20:39 |
| clarkb | ack thanks | 20:39 |
| clarkb | I'ev just updated the meeting agenda on the wiki. Let me know if anything important is missing | 20:42 |
| opendevreview | James E. Blair proposed opendev/system-config master: static: redirect zuulci.org to zuul-ci.org https://review.opendev.org/c/opendev/system-config/+/964296 | 21:21 |
| clarkb | fungi: in ^ doesn't the regex there also redirect www? | 21:52 |
| clarkb | oh you're suggesting that www.zuul-ci.org also redirect to zuul-ci.org | 21:53 |
| clarkb | I +2'd I think the test case may actually push for fungi's update but figure if it doesn't then we're good already so +2 | 21:55 |
| fungi | it would also be consistent with what we already have in the gating.dev vhost that way | 22:08 |
| clarkb | looks like the PBR testing is now unhappy in its pip latest functional test. I'm guessing the pip release broke this too :/ | 22:09 |
| clarkb | unfortunately the test case logs are not super helpful; I see https://zuul.opendev.org/t/openstack/build/3331f228fd65478ea1a4ef3a61ebc8a2/log/job-output.txt#5701-5712 but I'm not sure that maps to where the rc of 1 occurs but it must as I don't see other issues | 22:10 |
| clarkb | fungi: I guess that setuptools>=40.8.0 must come from some default somewhere as we don't set that version in the test case as far as I can tell. Then maybe the --no-index flag causes it to not find the version? | 22:12 |
| fungi | we probably ought to recheck the entire stack in case it starts working somewhere up the list | 22:12 |
| clarkb | in this case I don't expect it to since it seems to be a problem with teh test case and the stack doesn't really touch those beyond renaming things that get refactored | 22:12 |
| fungi | ah | 22:12 |
| clarkb | I suspect that pip made a change to how --no-index is handled for build env deps | 22:12 |
| clarkb | it almost looks like the package is trying to use setuptools>=40.8.0 by default in an isolated build env then failing to install that dep beacuse --no-index is set | 22:13 |
| clarkb | fungi: do you think I should put the PBR situation on the opendev meeting agenda? | 22:23 |
| clarkb | also I'll get that sent out in about half na hour | 22:24 |
| clarkb | heh 964893 hit node failure errors in the gate (it previuosly had not) | 22:24 |
| clarkb | but also it has a failing unittest run so would've failed anyway | 22:24 |
| fungi | sounds worth discussing | 22:30 |
| clarkb | ok its on the agenda now. I'll send that out about 2300 UTC if there is anything else | 22:33 |
| clarkb | this build https://zuul.opendev.org/t/openstack/build/b23ab86fe9164667afebacb6fc2bbb51 just had an upload failure to swift | 22:40 |
| clarkb | it attempted to upload to ovh bhs | 22:40 |
| clarkb | so be on the lookout for more failures like that I guess | 22:40 |
Generated by irclog2html.py 4.0.0 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!