Wednesday, 2024-09-25

frickler	clarkb: regarding OVN+DNS, I missed to remember that with recent OVN versions, the affected queries should pass untouched, see the commit reference in the so issue https://github.com/ovn-org/ovn/commit/4b10571aa89b226c13a8c5551ceb7208d782b580	08:02
frickler	so that confirms my choice of "no need to do anything unless we see an actual issue"	08:03
opendevreview	Merged openstack/project-config master: Remove CI jobs from trio2o https://review.opendev.org/c/openstack/project-config/+/930302	14:24
opendevreview	Merged openstack/project-config master: Remove jobs for dead projects https://review.opendev.org/c/openstack/project-config/+/930305	14:26
opendevreview	Merged openstack/project-config master: Remove references to legacy-sandbox-tag job https://review.opendev.org/c/openstack/project-config/+/930319	14:28
opendevreview	Merged openstack/project-config master: Set launch-timeout on nodepool providers https://review.opendev.org/c/openstack/project-config/+/930388	14:34
clarkb	frickler: ok. I think I'm personally not comfortable with it happening at all regardless of whether or not buggy (as considered by ovn) behaviors occur. But I'm happy to hold off on making any changes until there are concrete concerns rather than just philosophical ones	14:51
clarkb	If I send a request to a server I either want that server to respond or if anything else responds it should be to indicate non delivery	14:51
clarkb	did anyone else want to weigh in on whether or not we should keep older less maintained but native python tooling (blockdiag/seqdiag) or switch to non native python but maintained tooling (graphviz) for our document graphics generation?	15:03
fungi	vanishing for lunch, but should be back in an hour	15:17
fungi	i can take a look at the chart generator change then	15:17
clarkb	cool thanks	15:17
mordred	clarkb: graphviz is a standard enough tool that it seems like a fine switch in this case	15:23
corvus	we've also been using it in zuul sphinx docs for years	15:24
clarkb	yup I think it should be fine too. Just want to make suresomeone isn't going to take over blockdiag maintenance or something	15:24
corvus	(also, super fun fact that's not useful here but i like to share: we literally run graphviz in the zuul web ui via wasm)	15:26
corvus	client side	15:27
mordred	<timburke> "well, "should be" -- i suppose..." <- things could certainly switch to that at this point. the pbr runtime code LONG predates there being a sane API for what it's doing - as evidenced by the fallback behavior to use pkg_resources if importlib isn't around. It wasn't reasonable back in 2012 :)	15:29
opendevreview	James E. Blair proposed zuul/zuul-jobs master: Add a role to set ulimits https://review.opendev.org/c/zuul/zuul-jobs/+/930493	16:33
corvus	clarkb fungi : i've been going through zuul test failures, and 4/10 of the ones i checked were on rax-flex and hit ulimit errors. i can't think of why that would be related to rax-flex other than just a different system causing different timings. to address it, i have changes that increase the ulimit:	16:38
corvus	remote: https://review.opendev.org/c/zuul/zuul-jobs/+/930493 Add a role to set ulimits [NEW]	16:38
corvus	remote: https://review.opendev.org/c/zuul/zuul/+/930494 Update ulimits before running tests [NEW]	16:38
corvus	remote: https://review.opendev.org/c/zuul/zuul/+/930495 DNM: exercise ulimit change [NEW]	16:38
corvus	i included an output before changing the ulimit so if there is (somehow) a difference on different providers we can see it. i don't expect that, but i thought for completeness i should check and eliminate that as a possibility.	16:39
clarkb	corvus: some neutron tests were also hitting ulimits with files open. These jobs used devstack and devstack collected open file counts but not paths and processes for those files. I suggested they modify the file count collector to do an lsof dump after crossing some threshold to debug further. This was after they bumped up the ulimit too. Anyway just more data. I think bumping the	16:40
clarkb	limit is a good first step	16:40
clarkb	I don't know that they attriuted it to a specific provider	16:40
corvus	yeah. i also have had issues running tests locally due to ulimits, but have not seen them in the gate until now. locally i run something like 16-20 in parallel, so that's more expected. the gate is less parallelized so it's a bit surprising.	16:42
corvus	would be cool if dstat would collect the numbers for us	16:43
corvus	--fs is "enable filesystem stats (open files, inodes) "	16:43
clarkb	fwiw there is a remaining slow to boot focal node in the raxflex region, but I think that is because the noderequest was first processed before the launch-timeout update applied so we're using theo ld default for all three boot attempts?	16:48
clarkb	and the error count is much more consistently high. I'm going to manually try to boot a server out of band of nodepool to see if the nodes ever go ready and to collect console info etc if possible	16:49
corvus	clarkb: re timeout update: likely so	16:51
opendevreview	James E. Blair proposed zuul/zuul-jobs master: Include filesystem stats in dstat https://review.opendev.org/c/zuul/zuul-jobs/+/930497	16:53
corvus	i updated the exercise change to depend on that so we can get some numbers	16:54
corvus	oh look, that failed with a linter error	16:55
corvus	roles/ulimit/tasks/main.yaml:13: command-instead-of-shell: Use shell only when shell functionality is required.	16:55
clarkb	my test node went active almost immediately	16:55
corvus	guess what is a shell command and not a binary?	16:56
clarkb	ulimit?	16:56
corvus	yes!	16:56
fungi	corvus: clarkb: the only significant difference i can think of in raxflex is the cpu count... could using fewer cpus result in more files open in parallel? seems like more cpus->more parallelism->more open files rather than the other way around	16:58
corvus	fungi: i agree it's counter-intuitive	16:59
mordred	maybe fewer cpus means parallel tasks are stacking more / the work queue isn't getting drained as quickly, so tasks are getting spun up in parallel the same and are opening files then waiting for processing, but it's taking longer to close out things?	17:00
mordred	(just thinking out lod)	17:00
clarkb	the hostId from my successful but and a random focal stuck in build I looked at are different. Could be hypervisor specific? I don't really have strong evidenceo of that yet and I don't really ever see any successful focal boots from nodepool	17:00
clarkb	s/successful but/successful boot/	17:01
corvus	mordred: yeah, this is showing up a lot in executor git repo cleanup actually; what you describe could happen there	17:01
fungi	so basically more queuing	17:01
clarkb	I lied there is a successful focal boot from nodepool	17:01
corvus	i also pushed up a change to run a bunch of tests on current master vs the niz stack i was looking at (in case something about the in-progress niz work was causing it)	17:02
clarkb	and the successful nodepool boot has the same hostId as my test	17:02
clarkb	I'm going to delete my test and retry a couple of times to see if I can get it to go slowly and then compare hostIds	17:02
clarkb	7bd9e7a2-b89f-4a2f-913c-903e482d9e6a \| np0038616486 \| ubuntu-focal-1726786632 is a successful example	17:04
clarkb	c413a91c-1481-4ffd-8e0d-d7f28b85633b \| np0038617550 \| ubuntu-focal-1726786632 is a failed example	17:04
clarkb	booting one at a time was always successful. I just tried booting 5 close together. We'll see what happens with placement now	17:10
clarkb	I think the fifth one ended up on the potentially bad hypervisor based on hostid and didn't go active as quickly as the others. It is currently 17:11 if it isn't active in say 30 minutes I'll write up an email blaming that hostId/hypervisor	17:12
clarkb	noonedeadpunk: ^ fyi followup on yesterday's debugging. I think maybe at least one hypervisor is sad	17:14
opendevreview	Merged opendev/base-jobs master: Replace blockdiag and seqdiag with graphviz in docs https://review.opendev.org/c/opendev/base-jobs/+/930082	17:22
opendevreview	Merged opendev/base-jobs master: Fine-tune graphviz sequence diagrams https://review.opendev.org/c/opendev/base-jobs/+/930358	17:24
clarkb	cool I'll work on porting those to the repos once I get a few other things done	17:25
clarkb	this is interesting the hostId changed for that node. I wonder if there is rescheduling in the background?	17:39
clarkb	email sent	17:55
corvus	sample size of 1: but one of my "baseline" tests (current master, no ulimit changes) hit "too many open files" on rax-flex. so that probably excludes the in-progress niz stack as a cause.	18:06
fungi	disappearing around the corner for fall vaccines, should only be gone a few minutes hopefully	18:16
opendevreview	Clark Boylan proposed zuul/zuul-jobs master: Replace blockdiag/seqdiag with graphviz https://review.opendev.org/c/zuul/zuul-jobs/+/930502	18:20
clarkb	I need to go and get that done I should look at my calendar	18:28
noonedeadpunk	thanks for the update!	18:41
clarkb	`Exceeded max scheduling attempts 3 for instance $UUID Last exception: [Errno 32] Corrupt image download. Hash was $HASH`	18:57
clarkb	so if we let it go long enough we eventually get that error and the instance goes into an ERROR state. I've sent a followup email	18:57
corvus	after increasing the ulimit, those jobs are now failing with 'ValueError: filedescriptor out of range in select()' indicating that, indeed, we do have >1024 files open since select has a fd no limit of 1024	19:01
corvus	still 100% correlation with raxfelx	19:04
corvus	here are the results, with 1 test still running: https://paste.opendev.org/show/bDXnwLpb8jmo49WExlQ0/	19:06
clarkb	so ya maybe things are piling up due to the smaller cpu count?	19:24
opendevreview	Merged opendev/system-config master: Update Gitea to v1.22.2 https://review.opendev.org/c/opendev/system-config/+/930217	19:27
clarkb	gitea09 has updated and https://gitea09.opendev.org:3081/opendev/system-config/ loads for me	19:31
fungi	yeah, working for me so far	19:32
clarkb	fungi: maybe you can do a git clone just to sanity check that? I'll keep watching as it rotates through nodes	19:33
clarkb	9 10 and 11 are done	19:33
clarkb	and web responses from all three look ok to me	19:34
fungi	yeah, working on it	19:34
clarkb	tyty	19:34
fungi	`git clone https://gitea09.opendev.org:3000/opendev/bindep` worked fine	19:34
fungi	nova is in progress but will take a few minutes	19:35
clarkb	awesome. 12 is done now too	19:36
clarkb	and now only 14 remains	19:37
clarkb	and now 14 is done. The whole cluster should be running 1.22.2	19:38
*** elodilles is now known as elodilles_pto		19:38
fungi	my nova clone is nearly finished	19:38
clarkb	the job reported success too so from the config management side all is well	19:39
fungi	my nova git clone from 09 completed successfully and without errors	19:40
clarkb	the other thing to check would be replication	19:41
clarkb	but I'm not too worried about that	19:41
clarkb	I'm cleaning up my own autohold for etherpad and notice frickler has one for debugging nodepool stuff whcih I think ended up being fixed by the newer microk8s stuff so I'll delete that. corvus you also haev a bullseye image build debug hold can I delete taht one too? I think bullseye images are working	19:55
clarkb	if frickler wasnt out for the next 3 weeks I'd wait for an answer on this one but I'm like 95% certain and frickler is out so I'll go for it	19:55
opendevreview	Merged zuul/zuul-jobs master: Replace blockdiag/seqdiag with graphviz https://review.opendev.org/c/zuul/zuul-jobs/+/930502	19:56
corvus	clarkb: yep	19:57
clarkb	thanks both of those have been deleted too	19:58
corvus	new theory: something in the tests is leaking files; they are per-test-process. fewer test procs means more open files.	20:22
corvus	https://storage.gra.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_522/930498/2/check/zuul-nox-py312-0/5220fb7/dstat.html	20:23
corvus	https://storage.bhs.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_d2d/930498/2/check/zuul-nox-py312-1/d2dec09/dstat.html	20:23
corvus	those have the same total files, but i think one has it spread over 4 proces, and one over 2	20:23
corvus	or maybe it's 3 and 5... something like that	20:24
clarkb	oh that would explain it	20:42
mordred	mmm. and much less esoterically than the previous hypothesis	21:16
mordred	because stestr backends are based on nproc right?	21:17
fungi	scaled by the number of processors i think, but i don't recall the scaling formula	21:18
*** mtreinish_ is now known as mtreinish		23:09
corvus	i believe i have found the leaks. it was not trivial.	23:34

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!