frickler | clarkb: regarding OVN+DNS, I missed to remember that with recent OVN versions, the affected queries should pass untouched, see the commit reference in the so issue https://github.com/ovn-org/ovn/commit/4b10571aa89b226c13a8c5551ceb7208d782b580 | 08:02 |
---|---|---|
frickler | so that confirms my choice of "no need to do anything unless we see an actual issue" | 08:03 |
opendevreview | Merged openstack/project-config master: Remove CI jobs from trio2o https://review.opendev.org/c/openstack/project-config/+/930302 | 14:24 |
opendevreview | Merged openstack/project-config master: Remove jobs for dead projects https://review.opendev.org/c/openstack/project-config/+/930305 | 14:26 |
opendevreview | Merged openstack/project-config master: Remove references to legacy-sandbox-tag job https://review.opendev.org/c/openstack/project-config/+/930319 | 14:28 |
opendevreview | Merged openstack/project-config master: Set launch-timeout on nodepool providers https://review.opendev.org/c/openstack/project-config/+/930388 | 14:34 |
clarkb | frickler: ok. I think I'm personally not comfortable with it happening at all regardless of whether or not buggy (as considered by ovn) behaviors occur. But I'm happy to hold off on making any changes until there are concrete concerns rather than just philosophical ones | 14:51 |
clarkb | If I send a request to a server I either want that server to respond or if anything else responds it should be to indicate non delivery | 14:51 |
clarkb | did anyone else want to weigh in on whether or not we should keep older less maintained but native python tooling (blockdiag/seqdiag) or switch to non native python but maintained tooling (graphviz) for our document graphics generation? | 15:03 |
fungi | vanishing for lunch, but should be back in an hour | 15:17 |
fungi | i can take a look at the chart generator change then | 15:17 |
clarkb | cool thanks | 15:17 |
mordred | clarkb: graphviz is a standard enough tool that it seems like a fine switch in this case | 15:23 |
corvus | we've also been using it in zuul sphinx docs for years | 15:24 |
clarkb | yup I think it should be fine too. Just want to make suresomeone isn't going to take over blockdiag maintenance or something | 15:24 |
corvus | (also, super fun fact that's not useful here but i like to share: we literally run graphviz in the zuul web ui via wasm) | 15:26 |
corvus | client side | 15:27 |
mordred | <timburke> "well, "should be" -- i suppose..." <- things could certainly switch to that at this point. the pbr runtime code LONG predates there being a sane API for what it's doing - as evidenced by the fallback behavior to use pkg_resources if importlib isn't around. It wasn't reasonable back in 2012 :) | 15:29 |
opendevreview | James E. Blair proposed zuul/zuul-jobs master: Add a role to set ulimits https://review.opendev.org/c/zuul/zuul-jobs/+/930493 | 16:33 |
corvus | clarkb fungi : i've been going through zuul test failures, and 4/10 of the ones i checked were on rax-flex and hit ulimit errors. i can't think of why that would be related to rax-flex other than just a different system causing different timings. to address it, i have changes that increase the ulimit: | 16:38 |
corvus | remote: https://review.opendev.org/c/zuul/zuul-jobs/+/930493 Add a role to set ulimits [NEW] | 16:38 |
corvus | remote: https://review.opendev.org/c/zuul/zuul/+/930494 Update ulimits before running tests [NEW] | 16:38 |
corvus | remote: https://review.opendev.org/c/zuul/zuul/+/930495 DNM: exercise ulimit change [NEW] | 16:38 |
corvus | i included an output before changing the ulimit so if there is (somehow) a difference on different providers we can see it. i don't expect that, but i thought for completeness i should check and eliminate that as a possibility. | 16:39 |
clarkb | corvus: some neutron tests were also hitting ulimits with files open. These jobs used devstack and devstack collected open file counts but not paths and processes for those files. I suggested they modify the file count collector to do an lsof dump after crossing some threshold to debug further. This was after they bumped up the ulimit too. Anyway just more data. I think bumping the | 16:40 |
clarkb | limit is a good first step | 16:40 |
clarkb | I don't know that they attriuted it to a specific provider | 16:40 |
corvus | yeah. i also have had issues running tests locally due to ulimits, but have not seen them in the gate until now. locally i run something like 16-20 in parallel, so that's more expected. the gate is less parallelized so it's a bit surprising. | 16:42 |
corvus | would be cool if dstat would collect the numbers for us | 16:43 |
corvus | --fs is "enable filesystem stats (open files, inodes) " | 16:43 |
clarkb | fwiw there is a remaining slow to boot focal node in the raxflex region, but I think that is because the noderequest was first processed before the launch-timeout update applied so we're using theo ld default for all three boot attempts? | 16:48 |
clarkb | and the error count is much more consistently high. I'm going to manually try to boot a server out of band of nodepool to see if the nodes ever go ready and to collect console info etc if possible | 16:49 |
corvus | clarkb: re timeout update: likely so | 16:51 |
opendevreview | James E. Blair proposed zuul/zuul-jobs master: Include filesystem stats in dstat https://review.opendev.org/c/zuul/zuul-jobs/+/930497 | 16:53 |
corvus | i updated the exercise change to depend on that so we can get some numbers | 16:54 |
corvus | oh look, that failed with a linter error | 16:55 |
corvus | roles/ulimit/tasks/main.yaml:13: command-instead-of-shell: Use shell only when shell functionality is required. | 16:55 |
clarkb | my test node went active almost immediately | 16:55 |
corvus | guess what is a shell command and not a binary? | 16:56 |
clarkb | ulimit? | 16:56 |
corvus | yes! | 16:56 |
fungi | corvus: clarkb: the only significant difference i can think of in raxflex is the cpu count... could using fewer cpus result in more files open in parallel? seems like more cpus->more parallelism->more open files rather than the other way around | 16:58 |
corvus | fungi: i agree it's counter-intuitive | 16:59 |
mordred | maybe fewer cpus means parallel tasks are stacking more / the work queue isn't getting drained as quickly, so tasks are getting spun up in parallel the same and are opening files then waiting for processing, but it's taking longer to close out things? | 17:00 |
mordred | (just thinking out lod) | 17:00 |
clarkb | the hostId from my successful but and a random focal stuck in build I looked at are different. Could be hypervisor specific? I don't really have strong evidenceo of that yet and I don't really ever see any successful focal boots from nodepool | 17:00 |
clarkb | s/successful but/successful boot/ | 17:01 |
corvus | mordred: yeah, this is showing up a lot in executor git repo cleanup actually; what you describe could happen there | 17:01 |
fungi | so basically more queuing | 17:01 |
clarkb | I lied there is a successful focal boot from nodepool | 17:01 |
corvus | i also pushed up a change to run a bunch of tests on current master vs the niz stack i was looking at (in case something about the in-progress niz work was causing it) | 17:02 |
clarkb | and the successful nodepool boot has the same hostId as my test | 17:02 |
clarkb | I'm going to delete my test and retry a couple of times to see if I can get it to go slowly and then compare hostIds | 17:02 |
clarkb | 7bd9e7a2-b89f-4a2f-913c-903e482d9e6a | np0038616486 | ubuntu-focal-1726786632 is a successful example | 17:04 |
clarkb | c413a91c-1481-4ffd-8e0d-d7f28b85633b | np0038617550 | ubuntu-focal-1726786632 is a failed example | 17:04 |
clarkb | booting one at a time was always successful. I just tried booting 5 close together. We'll see what happens with placement now | 17:10 |
clarkb | I think the fifth one ended up on the potentially bad hypervisor based on hostid and didn't go active as quickly as the others. It is currently 17:11 if it isn't active in say 30 minutes I'll write up an email blaming that hostId/hypervisor | 17:12 |
clarkb | noonedeadpunk: ^ fyi followup on yesterday's debugging. I think maybe at least one hypervisor is sad | 17:14 |
opendevreview | Merged opendev/base-jobs master: Replace blockdiag and seqdiag with graphviz in docs https://review.opendev.org/c/opendev/base-jobs/+/930082 | 17:22 |
opendevreview | Merged opendev/base-jobs master: Fine-tune graphviz sequence diagrams https://review.opendev.org/c/opendev/base-jobs/+/930358 | 17:24 |
clarkb | cool I'll work on porting those to the repos once I get a few other things done | 17:25 |
clarkb | this is interesting the hostId changed for that node. I wonder if there is rescheduling in the background? | 17:39 |
clarkb | email sent | 17:55 |
corvus | sample size of 1: but one of my "baseline" tests (current master, no ulimit changes) hit "too many open files" on rax-flex. so that probably excludes the in-progress niz stack as a cause. | 18:06 |
fungi | disappearing around the corner for fall vaccines, should only be gone a few minutes hopefully | 18:16 |
opendevreview | Clark Boylan proposed zuul/zuul-jobs master: Replace blockdiag/seqdiag with graphviz https://review.opendev.org/c/zuul/zuul-jobs/+/930502 | 18:20 |
clarkb | I need to go and get that done I should look at my calendar | 18:28 |
noonedeadpunk | thanks for the update! | 18:41 |
clarkb | `Exceeded max scheduling attempts 3 for instance $UUID Last exception: [Errno 32] Corrupt image download. Hash was $HASH` | 18:57 |
clarkb | so if we let it go long enough we eventually get that error and the instance goes into an ERROR state. I've sent a followup email | 18:57 |
corvus | after increasing the ulimit, those jobs are now failing with 'ValueError: filedescriptor out of range in select()' indicating that, indeed, we do have >1024 files open since select has a fd no limit of 1024 | 19:01 |
corvus | still 100% correlation with raxfelx | 19:04 |
corvus | here are the results, with 1 test still running: https://paste.opendev.org/show/bDXnwLpb8jmo49WExlQ0/ | 19:06 |
clarkb | so ya maybe things are piling up due to the smaller cpu count? | 19:24 |
opendevreview | Merged opendev/system-config master: Update Gitea to v1.22.2 https://review.opendev.org/c/opendev/system-config/+/930217 | 19:27 |
clarkb | gitea09 has updated and https://gitea09.opendev.org:3081/opendev/system-config/ loads for me | 19:31 |
fungi | yeah, working for me so far | 19:32 |
clarkb | fungi: maybe you can do a git clone just to sanity check that? I'll keep watching as it rotates through nodes | 19:33 |
clarkb | 9 10 and 11 are done | 19:33 |
clarkb | and web responses from all three look ok to me | 19:34 |
fungi | yeah, working on it | 19:34 |
clarkb | tyty | 19:34 |
fungi | `git clone https://gitea09.opendev.org:3000/opendev/bindep` worked fine | 19:34 |
fungi | nova is in progress but will take a few minutes | 19:35 |
clarkb | awesome. 12 is done now too | 19:36 |
clarkb | and now only 14 remains | 19:37 |
clarkb | and now 14 is done. The whole cluster should be running 1.22.2 | 19:38 |
*** elodilles is now known as elodilles_pto | 19:38 | |
fungi | my nova clone is nearly finished | 19:38 |
clarkb | the job reported success too so from the config management side all is well | 19:39 |
fungi | my nova git clone from 09 completed successfully and without errors | 19:40 |
clarkb | the other thing to check would be replication | 19:41 |
clarkb | but I'm not too worried about that | 19:41 |
clarkb | I'm cleaning up my own autohold for etherpad and notice frickler has one for debugging nodepool stuff whcih I think ended up being fixed by the newer microk8s stuff so I'll delete that. corvus you also haev a bullseye image build debug hold can I delete taht one too? I think bullseye images are working | 19:55 |
clarkb | if frickler wasnt out for the next 3 weeks I'd wait for an answer on this one but I'm like 95% certain and frickler is out so I'll go for it | 19:55 |
opendevreview | Merged zuul/zuul-jobs master: Replace blockdiag/seqdiag with graphviz https://review.opendev.org/c/zuul/zuul-jobs/+/930502 | 19:56 |
corvus | clarkb: yep | 19:57 |
clarkb | thanks both of those have been deleted too | 19:58 |
corvus | new theory: something in the tests is leaking files; they are per-test-process. fewer test procs means more open files. | 20:22 |
corvus | https://storage.gra.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_522/930498/2/check/zuul-nox-py312-0/5220fb7/dstat.html | 20:23 |
corvus | https://storage.bhs.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_d2d/930498/2/check/zuul-nox-py312-1/d2dec09/dstat.html | 20:23 |
corvus | those have the same total files, but i think one has it spread over 4 proces, and one over 2 | 20:23 |
corvus | or maybe it's 3 and 5... something like that | 20:24 |
clarkb | oh that would explain it | 20:42 |
mordred | mmm. and much less esoterically than the previous hypothesis | 21:16 |
mordred | because stestr backends are based on nproc right? | 21:17 |
fungi | scaled by the number of processors i think, but i don't recall the scaling formula | 21:18 |
*** mtreinish_ is now known as mtreinish | 23:09 | |
corvus | i believe i have found the leaks. it was not trivial. | 23:34 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!