Wednesday, 2024-09-25

fricklerclarkb: regarding OVN+DNS, I missed to remember that with recent OVN versions, the affected queries should pass untouched, see the commit reference in the so issue https://github.com/ovn-org/ovn/commit/4b10571aa89b226c13a8c5551ceb7208d782b58008:02
fricklerso that confirms my choice of "no need to do anything unless we see an actual issue"08:03
opendevreviewMerged openstack/project-config master: Remove CI jobs from trio2o  https://review.opendev.org/c/openstack/project-config/+/93030214:24
opendevreviewMerged openstack/project-config master: Remove jobs for dead projects  https://review.opendev.org/c/openstack/project-config/+/93030514:26
opendevreviewMerged openstack/project-config master: Remove references to legacy-sandbox-tag job  https://review.opendev.org/c/openstack/project-config/+/93031914:28
opendevreviewMerged openstack/project-config master: Set launch-timeout on nodepool providers  https://review.opendev.org/c/openstack/project-config/+/93038814:34
clarkbfrickler: ok. I think I'm personally not comfortable with it happening at all regardless of whether or not buggy (as considered by ovn) behaviors occur. But I'm happy to hold off on making any changes until there are concrete concerns rather than just philosophical ones14:51
clarkbIf I send a request to a server I either want that server to respond or if anything else responds it should be to indicate non delivery14:51
clarkbdid anyone else want to weigh in on whether or not we should keep older less maintained but native python tooling (blockdiag/seqdiag) or switch to non native python but maintained tooling (graphviz) for our document graphics generation?15:03
fungivanishing for lunch, but should be back in an hour15:17
fungii can take a look at the chart generator change then15:17
clarkbcool thanks15:17
mordredclarkb: graphviz is a standard enough tool that it seems like a fine switch in this case15:23
corvuswe've also been using it in zuul sphinx docs for years15:24
clarkbyup I think it should be fine too. Just want to make suresomeone isn't going to take over blockdiag maintenance or something15:24
corvus(also, super fun fact that's not useful here but i like to share: we literally run graphviz in the zuul web ui via wasm)15:26
corvusclient side15:27
mordred<timburke> "well, "should be" -- i suppose..." <- things could certainly switch to that at this point. the pbr runtime code LONG predates there being a sane API for what it's doing - as evidenced by the fallback behavior to use pkg_resources if importlib isn't around. It wasn't reasonable back in 2012 :)15:29
opendevreviewJames E. Blair proposed zuul/zuul-jobs master: Add a role to set ulimits  https://review.opendev.org/c/zuul/zuul-jobs/+/93049316:33
corvusclarkb fungi : i've been going through zuul test failures, and 4/10 of the ones i checked were on rax-flex and hit ulimit errors.  i can't think of why that would be related to rax-flex other than just a different system causing different timings.  to address it, i have changes that increase the ulimit:16:38
corvusremote:   https://review.opendev.org/c/zuul/zuul-jobs/+/930493 Add a role to set ulimits [NEW]        16:38
corvusremote:   https://review.opendev.org/c/zuul/zuul/+/930494 Update ulimits before running tests [NEW]        16:38
corvusremote:   https://review.opendev.org/c/zuul/zuul/+/930495 DNM: exercise ulimit change [NEW]        16:38
corvusi included an output before changing the ulimit so if there is (somehow) a difference on different providers we can see it.  i don't expect that, but i thought for completeness i should check and eliminate that as a possibility.16:39
clarkbcorvus: some neutron tests were also hitting ulimits with files open. These jobs used devstack and devstack collected open file counts but not paths and processes for those files. I suggested they modify the file count collector to do an lsof dump after crossing some threshold to debug further. This was after they bumped up the ulimit too. Anyway just more data. I think bumping the16:40
clarkblimit is a good first step16:40
clarkbI don't know that they attriuted it to a specific provider16:40
corvusyeah.  i also have had issues running tests locally due to ulimits, but have not seen them in the gate until now.  locally i run something like 16-20 in parallel, so that's more expected.  the gate is less parallelized so it's a bit surprising.16:42
corvuswould be cool if dstat would collect the numbers for us16:43
corvus--fs is "enable filesystem stats (open files, inodes) "16:43
clarkbfwiw there is a remaining slow to boot focal node in the raxflex region, but I think that is because the noderequest was first processed before the launch-timeout update applied so we're using theo ld default for all three boot attempts?16:48
clarkband the error count is much more consistently high. I'm going to manually try to boot a server out of band of nodepool to see if the nodes ever go ready and to collect console info etc if possible16:49
corvusclarkb: re timeout update: likely so 16:51
opendevreviewJames E. Blair proposed zuul/zuul-jobs master: Include filesystem stats in dstat  https://review.opendev.org/c/zuul/zuul-jobs/+/93049716:53
corvusi updated the exercise change to depend on that so we can get some numbers16:54
corvusoh look, that failed with a linter error16:55
corvusroles/ulimit/tasks/main.yaml:13: command-instead-of-shell: Use shell only when shell functionality is required.16:55
clarkbmy test node went active almost immediately16:55
corvusguess what is a shell command and not a binary?16:56
clarkbulimit?16:56
corvusyes!16:56
fungicorvus: clarkb: the only significant difference i can think of in raxflex is the cpu count... could using fewer cpus result in more files open in parallel? seems like more cpus->more parallelism->more open files rather than the other way around16:58
corvusfungi: i agree it's counter-intuitive16:59
mordredmaybe fewer cpus means parallel tasks are stacking more / the work queue isn't getting drained as quickly, so tasks are getting spun up in parallel the same and are opening files then waiting for processing, but it's taking longer to close out things?17:00
mordred(just thinking out lod)17:00
clarkbthe hostId from my successful but and a random focal stuck in build I looked at are different. Could be hypervisor specific? I don't really have strong evidenceo of that yet and I don't really ever see any successful focal boots from nodepool17:00
clarkbs/successful but/successful boot/17:01
corvusmordred: yeah, this is showing up a lot in executor git repo cleanup actually; what you describe could happen there17:01
fungiso basically more queuing17:01
clarkbI lied there is a successful focal boot from nodepool17:01
corvusi also pushed up a change to run a bunch of tests on current master vs the niz stack i was looking at (in case something about the in-progress niz work was causing it)17:02
clarkband the successful nodepool boot has the same hostId as my test17:02
clarkbI'm going to delete my test and retry a couple of times to see if I can get it to go slowly and then compare hostIds17:02
clarkb7bd9e7a2-b89f-4a2f-913c-903e482d9e6a | np0038616486 | ubuntu-focal-1726786632 is a successful example17:04
clarkbc413a91c-1481-4ffd-8e0d-d7f28b85633b | np0038617550 | ubuntu-focal-1726786632 is a failed example17:04
clarkbbooting one at a time was always successful. I just tried booting 5 close together. We'll see what happens with placement now17:10
clarkbI think the fifth one ended up on the potentially bad hypervisor based on hostid and didn't go active as quickly as the others. It is currently 17:11 if it isn't active in say 30 minutes I'll write up an email blaming that hostId/hypervisor17:12
clarkbnoonedeadpunk: ^ fyi followup on yesterday's debugging. I think maybe at least one hypervisor is sad17:14
opendevreviewMerged opendev/base-jobs master: Replace blockdiag and seqdiag with graphviz in docs  https://review.opendev.org/c/opendev/base-jobs/+/93008217:22
opendevreviewMerged opendev/base-jobs master: Fine-tune graphviz sequence diagrams  https://review.opendev.org/c/opendev/base-jobs/+/93035817:24
clarkbcool I'll work on porting those to the repos once I get a few other things done17:25
clarkbthis is interesting the hostId changed for that node. I wonder if there is rescheduling in the background?17:39
clarkbemail sent17:55
corvussample size of 1: but one of my "baseline" tests (current master, no ulimit changes) hit "too many open files" on rax-flex.  so that probably excludes the in-progress niz stack as a cause.18:06
fungidisappearing around the corner for fall vaccines, should only be gone a few minutes hopefully18:16
opendevreviewClark Boylan proposed zuul/zuul-jobs master: Replace blockdiag/seqdiag with graphviz  https://review.opendev.org/c/zuul/zuul-jobs/+/93050218:20
clarkbI need to go and get that done I should look at my calendar18:28
noonedeadpunkthanks for the update!18:41
clarkb`Exceeded max scheduling attempts 3 for instance $UUID Last exception: [Errno 32] Corrupt image download. Hash was $HASH`18:57
clarkbso if we let it go long enough we eventually get that error and the instance goes into an ERROR state. I've sent a followup email18:57
corvusafter increasing the ulimit, those jobs are now failing with 'ValueError: filedescriptor out of range in select()' indicating that, indeed, we do have >1024 files open since select has a fd no limit of 102419:01
corvusstill 100% correlation with raxfelx19:04
corvushere are the results, with 1 test still running: https://paste.opendev.org/show/bDXnwLpb8jmo49WExlQ0/19:06
clarkbso ya maybe things are piling up due to the smaller cpu count?19:24
opendevreviewMerged opendev/system-config master: Update Gitea to v1.22.2  https://review.opendev.org/c/opendev/system-config/+/93021719:27
clarkbgitea09 has updated and https://gitea09.opendev.org:3081/opendev/system-config/ loads for me19:31
fungiyeah, working for me so far19:32
clarkbfungi: maybe you can do a git clone just to sanity check that? I'll keep watching as it rotates through nodes19:33
clarkb9 10 and 11 are done19:33
clarkband web responses from all three look ok to me19:34
fungiyeah, working on it19:34
clarkbtyty19:34
fungi`git clone https://gitea09.opendev.org:3000/opendev/bindep` worked fine19:34
funginova is in progress but will take a few minutes19:35
clarkbawesome. 12 is done now too19:36
clarkband now only 14 remains19:37
clarkband now 14 is done. The whole cluster should be running 1.22.219:38
*** elodilles is now known as elodilles_pto19:38
fungimy nova clone is nearly finished19:38
clarkbthe job reported success too so from the config management side all is well19:39
fungimy nova git clone from 09 completed successfully and without errors19:40
clarkbthe other thing to check would be replication19:41
clarkbbut I'm not too worried about that19:41
clarkbI'm cleaning up my own autohold for etherpad and notice frickler has one for debugging nodepool stuff whcih I think ended up being fixed by the newer microk8s stuff so I'll delete that. corvus you also haev a bullseye image build debug hold can I delete taht one too? I think bullseye images are working19:55
clarkbif frickler wasnt out for the next 3 weeks I'd wait for an answer on this one but I'm like 95% certain and frickler is out so I'll go for it19:55
opendevreviewMerged zuul/zuul-jobs master: Replace blockdiag/seqdiag with graphviz  https://review.opendev.org/c/zuul/zuul-jobs/+/93050219:56
corvusclarkb: yep19:57
clarkbthanks both of those have been deleted too19:58
corvusnew theory: something in the tests is leaking files; they are per-test-process.  fewer test procs means more open files.20:22
corvushttps://storage.gra.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_522/930498/2/check/zuul-nox-py312-0/5220fb7/dstat.html20:23
corvushttps://storage.bhs.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_d2d/930498/2/check/zuul-nox-py312-1/d2dec09/dstat.html20:23
corvusthose have the same total files, but i think one has it spread over 4 proces, and one over 220:23
corvusor maybe it's 3 and 5... something like that20:24
clarkboh that would explain it20:42
mordredmmm. and much less esoterically than the previous hypothesis21:16
mordredbecause stestr backends are based on nproc right?21:17
fungiscaled by the number of processors i think, but i don't recall the scaling formula21:18
*** mtreinish_ is now known as mtreinish23:09
corvusi believe i have found the leaks.  it was not trivial.23:34

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!