Monday, 2025-10-27

ykarel	Hi is there known issue in stable jobs where keystone fails to start with failed to open python file /opt/stack/data/venv/bin/keystone-wsgi-public? seeing failures across stable branches like stable/2024.1(non grenade) stable/2024.2(non grenade) stable/2025.1(grenade) stable/2025.2(grenade skip-level), failing since Saturday	06:55
ykarel	https://zuul.openstack.org/builds?job_name=tempest-full-2024-2&job_name=tempest-full-2024-1&skip=0	06:57
ykarel	looks related to new pip release pip-25.3	07:36
mnasiadka	We’ll be lucky if that’s the only breakage from pip ;-)	07:39
ykarel	yes locally can confirm it's specific to pip-25.3	07:48
tonyb	pfft. that's annoying.	07:55
tonyb	I can't help debug right now but I can, probably, help land stuff.	08:01
*** ralonsoh_ is now known as ralonsoh		08:04
ykarel	Reported https://bugs.launchpad.net/devstack/+bug/2129913 to track this	08:54
mnasiadka	Had a NODE_FAILURE in kolla-ansible recently - NODE_FAILURE Node(set) request 8888d843ceca43e5bf8ad026a550547e failed in 0s (I’ll leave that here if that’s something you’d like to check)	09:21
tonyb	mnasiadka: I'll check it out tomorrow if no one else has	11:33
*** dhill is now known as Guest30011		13:13
fungi	looks like today is the only time this week i'll be able to go anywhere for lunch, so taking that opportunity now. back soon	14:51
clarkb	ykarel: based on the notes fungi posted after the pip release I'm goign to assume its related to editable installation not working as expected. Devstack uses editable installs	14:52
clarkb	ykarel: its probably worth moving this discussion to #openstack-qa since the changes will almost certainly occur within devstack	14:53
ykarel	clarkb, yes it's specific to editable installs	14:53
ykarel	ack reported it in openstack-qa	14:54
clarkb	mnasiadka: corvus on zl02 I see exceptions attempting to load the zkobject belonging to nodes associated with request 8888d843ceca43e5bf8ad026a550547e. Looks like zl01 tried to launch in one provider too	14:57
clarkb	the exceptions appear to be around trying to use unassigned nodes. I wonder if therei s a bug around that? or maybe something to do with the weekend updates making old unassigned nodes unusable now? (these are just hunches I don't have concrete data beyond the zkobject load error after trying to use unassigned nodes)	14:57
corvus	i can take a look later	15:17
clarkb	on further inspection we just log an error with no traceback. May need to check the zk db directly?	15:17
clarkb	(or at least I have had trouble finding a traceback using grep -A and -B	15:17
corvus	clarkb: the "unable to load" lines are not necessarily errors... they are usually benign, but we log them as errors in case they happen to be an error in that circumstance... i expect we'll get rid of them someday, but we need them there for now...	15:27
corvus	basically, there are a number of places where it's normal for certain zkobjects not to exist	15:27
clarkb	got it	15:28
corvus	but if, say, we're debugging something and we expect it to exist and it didn't, that's when we need that log msg.	15:28
corvus	so what looks weird here is this:	15:28
corvus	https://paste.opendev.org/show/bnjKjwHw2S4XCC8zrgVq/	15:28
corvus	we went straight from "everything is great" to "everything is not great"	15:29
corvus	oh wait, i see the node is failed in the first line	15:29
clarkb	ya and that seems to happen immediately after the zkobject load error which is why i assocaited them	15:29
clarkb	oh maybe the unassigned node is failed and we don't filter for good unassigned nodes?	15:29
corvus	yeah, i think that's the case... i'm lining up timestamps	15:30
corvus	no that's not the unassigned node id	15:33
corvus	let's ignore the unassigned node messages for a moment; if we do, then the story of that request is that it encountered 3 node launch failures and reported node_failure -- that seems fairly normal to me.	15:34
corvus	the only mystery i see is why was it toying with the idea of assigning a different unassigned node	15:35
clarkb	if I grep for 2d132651c4e24fa6aa29291dad0bb8ba (which is the zl02 sjc3 launch node uuid) I don't see it really do anything until what I ntoiced. But maybe we're not logging the uuid in places I would expect it to so my grep doesn't show the whole story	15:37
clarkb	actually I guess we don't log where the requested node is being requseted?	15:38
corvus	clarkb: it was build on zl01, got this error: 2025-10-27 08:39:40,298 ERROR zuul.Launcher: [e: df0531d7312a49adbfeebae20f7a8e40] [req: 8888d843ceca43e5bf8ad026a550547e] Timeout creating node <OpenstackProviderNode uuid=2d132651c4e24fa6aa29291dad0bb8ba, label=rockylinux-10-8GB, state=building, provider=opendev.org%2Fopendev%2Fzuul-providers/raxflex-sjc3-main>	15:39
clarkb	oh interesting so zl02 can log that the node is requested but it isn't responsible for servicing that request	15:40
clarkb	I'll have ot keep that in mind for future debugging	15:40
corvus	okay, the weird thing is that it chose to use an unassigned node, but then it switched back to using the failed node	15:41
corvus	i think i see the bug	15:41
corvus	clarkb: remote: https://review.opendev.org/c/zuul/zuul/+/964893 WIP Fix missing update in node reassignment [NEW]	15:43
corvus	yay python	15:43
corvus	that should be the fix, but i want to see if i can get test coverage	15:43
clarkb	thanks I'll look momentarily	15:44
clarkb	aha ya we're setting the node each iteration through that loop.	15:47
clarkb	though that seems to only happen if the unassigned node is None but in this case it seems like we do have an unassigned node?	15:47
clarkb	oh I see node is part of the request.updateProviderNode() call then we use requested_nodes for accounting afterwards so they need to be in sync	15:54
clarkb	infra-root I do plan on hosting a meeting tomorrow since it has been a bit since we had one (I realize with the PTG this is maybe annoying/difficult but there are a few topics I want t ocall out)	16:00
corvus	yep, basically we updated zookeeper but not our in-memory scratch copy	16:01
clarkb	For example the unexpected Gerrit shutdown which according to nova folks at the summit could be due to memory pressure (we wouldn't get an error). But also the idea of preventing direct gitea backend access	16:01
opendevreview	Clark Boylan proposed opendev/system-config master: Upgrade gitea to 1.24.7 https://review.opendev.org/c/opendev/system-config/+/964899	16:04
clarkb	also if ^ looks clean we should probably plan on upgrading	16:04
fungi	okay, back	16:14
corvus	clarkb: mnasiadka this should take care of that issue, thanks! remote: https://review.opendev.org/c/zuul/zuul/+/964893 Fix missing update in node reassignment	16:27
corvus	incidentally, that's going to be another subtle improvement with NIZ: we will be able to "rescue" nodeset requests by having them use unassigned ready nodes -- nodepool was unable to do that. that will cause us to re-use ready nodes faster and avoid situations where we launch a bunch of nodes at the same time we have the same types of nodes sitting idle.	16:29
corvus	and that almost happened in this case (that's the only case that would have triggered that bug)	16:30
clarkb	corvus: +2 I did notice what I think is some redundant checking in the etst case but that doesn't hurt so I +2'd	16:41
corvus	yep, a little silly but not broken	16:46
clarkb	infra-root I think we need https://review.opendev.org/c/opendev/system-config/+/963642/ before we can upgrade gitea (or make many other system-config changes) 964899 should confirm by failing soon	17:04
clarkb	I'll rebase it once it fails	17:04
corvus	clarkb: oh that's cool. i guess once we convert everything to docker compose + podman daemon we should be able to back that out?	17:10
corvus	(though, i guess it's worth having a conversation whether we want to do that, or instead standardize on the docker cli with the podman backend)	17:10
corvus	given that we're using "docker compose" maybe that's the sensible thing to do, so 642 would be a permanent change.	17:10
clarkb	corvus: yup the main issue now is that we're still using docker commands everywhere (not just docker compose commands)	17:10
clarkb	corvus: but also the job I expect to fail hasn't failed yet, but the package still only recommends not requires docker.io so not sure why it is working	17:11
clarkb	oh I wonder if that job configures the load balancer (where we'd see the failure) later in the job execution than I expect. That could explain the delay in failing	17:15
JayF	is there an incident ongoing? I'm unable to reach e.g. https://opendev.org/openstack/governance	17:25
JayF	but ping opendev.org does return (2604:e100:3:0:f816:3eff:fe5a:c9d7)	17:26
JayF	looks like it's unreachable for me over ipv4 @ 38.108.68.97	17:26
JayF	well, there it went, maybe it's just internet issues?	17:26
clarkb	gitea12 is overloaded	17:28
clarkb	https://review.opendev.org/c/opendev/system-config/+/964728/ is aimed at addressing this	17:28
clarkb	but basically our friends the ai web crawlers have discovered the backends and will target them directly making the load balancer less useful	17:28
clarkb	though looking at the other backends they all look busy too so maybe load balcning is "working" in this case (I checked 12 first since it is the one I balnace to)	17:29
clarkb	so ya I don't think there is an outage. Things are slower than we'd like probably because ai web crawlers don't care only the data matters to them	17:30
fungi	if someone wanted to work on a crawler tarpit that blocked any ip address hitting unpublished urls disallowed by our robots.txt, i'd gladly review it	17:31
fungi	s/tarpit/honeypot/	17:31
clarkb	ya looks like the traffic may be going through the load balancer but its doing the classic thing of fetch every file for every commit in a tight loop hope you don't melt down	17:31
clarkb	clearly not respecting our robots.txt rule for time between requests	17:32
clarkb	identifying itslef as a modern chrome on osx	17:32
fungi	no it's really a human with a bionic clicky-finger	17:32
clarkb	spot checking I see the classic random cloud provider hosted IPs as the source. I 've got a cut \| sed \| sort \| uniq -c \| sort running to try and quantify who is worst	17:43
clarkb	I've identified a particular bad source that we can consider blocking. I also see a long tail of ip addrs coming out of a particular geo location that we might look at next if the first doesn't help	17:50
clarkb	`iptables -I openstack-INPUT -s $ipaddr -j DROP` ?	17:54
clarkb	I never know how many flags we should provide to iptables. We could restrict it to tcp with -p tcp for example	17:54
JayF	if you're doing IP blocks for being abusive, being less specific is better	18:00
JayF	the fewer matches the cheaper it is	18:00
JayF	if you're getting to the point of large lists, you might wanna look into ipsets to manage them: https://ipset.netfilter.org/ipset.man.html	18:01
JayF	(e.g. you drop from ipset, then update the ipset to add/remove things, it's more performant aiui)	18:01
clarkb	ya can use masks so that the rule applies more broadly.	18:02
clarkb	I'm not too worried about that portion of the command more, is this the right chain and do we need other attributes on the rule to avoid overblocking or whatever	18:02
clarkb	setting the ip addrs is easy relatively imo	18:02
clarkb	also looks like it may be backing off too. Maybe they are watching us here	18:03
clarkb	I added the initial rule	18:05
clarkb	JayF: any better for you now? It looks like things may be settling into a happier state	18:18
JayF	works for me now pretty snappily; I got past my blocker not too long after the report	18:19
clarkb	load average is a lot more up and down now with a ceiling lower than our old floor. But it is still a bit higher than I'd like	18:21
opendevreview	Clark Boylan proposed opendev/system-config master: Upgrade gitea to 1.24.7 https://review.opendev.org/c/opendev/system-config/+/964899	18:45
clarkb	that is the promised rebase from earlier. the job did eventualyl fail so the check isn't as early as I had thought it would be	18:45
clarkb	https://review.opendev.org/c/opendev/system-config/+/963642/ is the fix and I'll probably plan to approve that after lunch unless someone else wants to beat me to it	18:46
fungi	approved 963642 now	18:56
opendevreview	Nicolas Hicher proposed zuul/zuul-jobs master: Refactor: multi-node-bridge to use linux bridge https://review.opendev.org/c/zuul/zuul-jobs/+/959393	19:29
opendevreview	Dmitriy Rabotyagov proposed zuul/zuul-jobs master: Allow mirror_fqdn to be overriden https://review.opendev.org/c/zuul/zuul-jobs/+/965008	19:50
opendevreview	Merged opendev/system-config master: Explictly install docker.io package https://review.opendev.org/c/opendev/system-config/+/963642	19:54
clarkb	that is deploying now to a bunch of services but it should noop as they all already have docker.io installed	19:56
clarkb	ya on gitea-lb03 the package seems to have remained the same	19:58
clarkb	and sudo docker ps -a continues to show the expected containers to me so we didn't modify the socket or units as far as Ican tell	19:59
clarkb	and /usr/lib/systemd/system/docker.service and docker.socket haven't been modified since june	20:01
clarkb	that was my main concern: that we'd somehow reinstall the docker.io package and reenable some of the docker bits, but seems to be nooping from what I can see	20:02
clarkb	`sudo systemctl list-unit-files --type=service --state=disabled` and `sudo systemctl list-unit-files --type=socket --state=disabled` seem to confirm. I do note that ssh.service shows up in the disabled list oddly. ssh is running on the load balancer and list-units shows it. Not sure what that is about	20:05
clarkb	so I think ssh.service is triggered by ssh.socket and that is why it isn't enabled by default? Makes it confusing but this way you never get an ssh until systemd gets a connection on port 22 then it hands things over?	20:11
clarkb	weird way to express it but I Think this is fine	20:11
fungi	yeah, just odd that socket type is also disabled	20:13
opendevreview	Nicolas Hicher proposed zuul/zuul-jobs master: Refactor: multi-node-bridge to use linux bridge https://review.opendev.org/c/zuul/zuul-jobs/+/959393	20:14
clarkb	fungi: no ssh.socket is not disabled but ssh.service is	20:14
clarkb	and I think that iswhy because systemd doesn't actually start the service until the socket gets a connection so its not really enabled (which means start the process on boot)	20:14
fungi	ah, yeah `sudo systemctl list-unit-files --type=socket --state=enabled` has ssh.socket enables	20:15
fungi	er, enabled	20:16
fungi	i guess that's just how it indicates processes that are set up for socket activation	20:16
clarkb	ya I think so since enabled here means run this at boot and it isn't running it at boot instead it is opening a port and only running things if a connection is made to that port	20:19
clarkb	nhicher[m]: I responeded to your question on the multi node bridge change	20:19
nhicher[m]	clarkb: thanks, I will have a look	20:20
gouthamr	o/ seeing NODE_FAILURES a few more times, probably needs https://review.opendev.org/c/zuul/zuul/+/964893 to be merged? but its failing tests .. should we hold off on re-checking?	20:25
clarkb	yes that is the expected fix. I think this is an issue where the more you stress the system the worse it gets	20:27
clarkb	so not rechecking may help, but I also don't know that it is necessary for every change	20:27
clarkb	basically if there are failures booting nodes then you can hit this case. More load puts more stress on nova schedulgin in the clouds and could lead to more failures	20:28
clarkb	but we were able to land 963642 which has many multinode jobs running against it and they all succeeded so the error rate can't be too high	20:29
clarkb	https://review.opendev.org/c/opendev/system-config/+/964899/ passes now and screenshots lgtm if we want to think about fitting a gitea upgrade in	20:34
clarkb	gitea11 and 12 load is moderate right now. Seems like things continue to be happier there	20:34
clarkb	I do have a school run to do in a bit but otherwise am around if we want to try and do that today. Happy to aim for tomorrow too	20:34
corvus	yeah, that change corrects the code to what we all thought it was in the first place, so i'm comfortable approving it now if we're seeing more impact.	20:38
corvus	i +3d 964893	20:39
clarkb	ack thanks	20:39
clarkb	I'ev just updated the meeting agenda on the wiki. Let me know if anything important is missing	20:42
opendevreview	James E. Blair proposed opendev/system-config master: static: redirect zuulci.org to zuul-ci.org https://review.opendev.org/c/opendev/system-config/+/964296	21:21
clarkb	fungi: in ^ doesn't the regex there also redirect www?	21:52
clarkb	oh you're suggesting that www.zuul-ci.org also redirect to zuul-ci.org	21:53
clarkb	I +2'd I think the test case may actually push for fungi's update but figure if it doesn't then we're good already so +2	21:55
fungi	it would also be consistent with what we already have in the gating.dev vhost that way	22:08
clarkb	looks like the PBR testing is now unhappy in its pip latest functional test. I'm guessing the pip release broke this too :/	22:09
clarkb	unfortunately the test case logs are not super helpful; I see https://zuul.opendev.org/t/openstack/build/3331f228fd65478ea1a4ef3a61ebc8a2/log/job-output.txt#5701-5712 but I'm not sure that maps to where the rc of 1 occurs but it must as I don't see other issues	22:10
clarkb	fungi: I guess that setuptools>=40.8.0 must come from some default somewhere as we don't set that version in the test case as far as I can tell. Then maybe the --no-index flag causes it to not find the version?	22:12
fungi	we probably ought to recheck the entire stack in case it starts working somewhere up the list	22:12
clarkb	in this case I don't expect it to since it seems to be a problem with teh test case and the stack doesn't really touch those beyond renaming things that get refactored	22:12
fungi	ah	22:12
clarkb	I suspect that pip made a change to how --no-index is handled for build env deps	22:12
clarkb	it almost looks like the package is trying to use setuptools>=40.8.0 by default in an isolated build env then failing to install that dep beacuse --no-index is set	22:13
clarkb	fungi: do you think I should put the PBR situation on the opendev meeting agenda?	22:23
clarkb	also I'll get that sent out in about half na hour	22:24
clarkb	heh 964893 hit node failure errors in the gate (it previuosly had not)	22:24
clarkb	but also it has a failing unittest run so would've failed anyway	22:24
fungi	sounds worth discussing	22:30
clarkb	ok its on the agenda now. I'll send that out about 2300 UTC if there is anything else	22:33
clarkb	this build https://zuul.opendev.org/t/openstack/build/b23ab86fe9164667afebacb6fc2bbb51 just had an upload failure to swift	22:40
clarkb	it attempted to upload to ovh bhs	22:40
clarkb	so be on the lookout for more failures like that I guess	22:40

Generated by irclog2html.py 4.0.0 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!