Wednesday, 2025-08-13

Clark[m]I'm switched over to dinner mode but a couple other multinode jobs succeeded.00:14
Clark[m]This is looking like it will work00:15
opendevreviewGoutham Pacha Ravi proposed openstack/project-config master: Retire Monasca project  https://review.opendev.org/c/openstack/project-config/+/95706304:15
*** elodilles_pto is now known as elodilles07:54
mnasiadkaBeen observing some timeouts since yesterday on ord.rax mirror in Kolla/Kolla-Ansible jobs09:06
mnasiadkahttps://zuul.opendev.org/t/openstack/build/94be78a6d6e249cca2fb160f5eef3bac/log/primary/logs/ansible/bootstrap-servers#76409:07
opendevreviewJan Gutter proposed zuul/zuul-jobs master: Raise connection pool for boto3 in s3 upload role  https://review.opendev.org/c/zuul/zuul-jobs/+/95721810:48
opendevreviewDamian Fajfer proposed zuul/zuul-jobs master: Remove version defaults for nodejs jobs  https://review.opendev.org/c/zuul/zuul-jobs/+/95721911:00
opendevreviewJan Gutter proposed zuul/zuul-jobs master: Raise connection pool for boto3 in s3 upload role  https://review.opendev.org/c/zuul/zuul-jobs/+/95721811:14
opendevreviewDamian Fajfer proposed zuul/zuul-jobs master: Remove version defaults for nodejs jobs  https://review.opendev.org/c/zuul/zuul-jobs/+/95721911:16
fungiclarkb: stephenfin: mentioning here instead of #openstack-oslo since clarkb isn't in there, but zuul hit a possible regression in pbr 7.0.0 https://review.opendev.org/c/zuul/zuul/+/95723513:39
stephenfiniiuc that's not a regression, since it's a private function, right?14:36
fungistephenfin: yes, that's why i said "possible"14:39
stephenfingotcha14:39
fungii also pointed out that even the relocated version of that method may stop working in future setuptools14:39
fungii'm more wondering if you have better ideas for how to get at that data that doesn't go through private pbr internals14:40
fungiis there an interface pbr can/should expose for that, or is there a direct path in setuptools to get at that with pyproject.toml type packaging in the future?14:41
stephenfinlooking14:42
stephenfinalso, while I'm looking, it seems I regressed Ib494a0bac947cc6b5fb8d3d4315a3d386cbffdc2 and the long description isn't being used now14:43
clarkbhrm why would _from_git need to be private?15:38
fungiit was previously. looks like, though i didn't dig through history to see how long it was that way15:38
clarkbthats a pbr specific thing not setuptools specific (its looking at git details to get stuff like authors and changelog)15:39
clarkbI guess my point is I don't see that as being a "compat" setuptools specific method. Its a pbr method for interacting with git info15:39
clarkbso it probably didn't need to be moved and considering that one of the major things pbr does is expose git to packaging it also probably doesn't need to be private?15:39
clarkb(But I agree it was marked private previsouly and maybe there was a reason for that)15:40
clarkbfungi: I wonder if zuul can just make _build_javascript the setup_hook rather than doing the indirection to run it later via _from_git()? I suspect the reason it doesn't is it wants version info or something like that maybe?15:43
clarkbfrom a PBR perspective we could release a 7.0.1 that puts the _from_git method back in place (possibly just as an import to the old path?) then pick it up from there?15:45
fungithat's a good point. maybe stephenfin has more specific rationale for moving it, or maybe it was just caught up in the move of all the surrounding code15:46
clarkbok seems like _from_git is only called by the old install.install classes (which I think aren't going away its just that they'll be triggered via pep517 entrypoints in the future and not setup.py install)15:51
clarkbbut that may explain why it moved15:51
clarkbits only when pbr becomes a "native" pep 517 installer and doesn't go through setuptools anymore that would change. But I assume we'd have to support both paths for a time. Given that I think the proposed change to zuul may also be good enough?15:52
clarkbhrm except the pep517 path will always do sdist then wheel? It will never do the equivalent of a setup.py install will it? So maybe that is effectively dead code once setup.py install stops working entirely15:54
clarkboh wait LocalSDist calls _from_git too so yes I believe this code will live on for some time15:55
clarkbbecause the setuptools pep517 code paths will use the sdist path then make a wheel from that15:55
clarkbLocalInstall and InstallWithGit might become dead code real soon now but LocalSDist shouldn't15:55
clarkbso yes I think the zuul change is fine if we just want to roll forward. Or we can make an alias for _from_git in its old home and release a 7.0.115:56
*** gthiemon1e is now known as gthiemonge15:56
opendevreviewClark Boylan proposed zuul/zuul-jobs master: Vendor openvswitch_bridge  https://review.opendev.org/c/zuul/zuul-jobs/+/95718816:03
opendevreviewClark Boylan proposed zuul/zuul-jobs master: Vendor openvswitch_bridge  https://review.opendev.org/c/zuul/zuul-jobs/+/95718816:08
fungiclarkb: a shim method at the old location would have the effect of solving the challengs of zuul's upgrade job16:12
clarkbfungi: good point.16:13
fungithough they will probably prefer a more immediate solution than the turn-around for getting that merged and a new release published, i dunno16:13
clarkbya and its unlikely this private method will move again in the future16:13
clarkbso just fixing it directly in zuul is likely to be good enough and quicjk16:14
corvusyeah, i don't think zuul needs a pbr release16:15
opendevreviewClark Boylan proposed zuul/zuul-jobs master: Vendor openvswitch_bridge  https://review.opendev.org/c/zuul/zuul-jobs/+/95718816:26
corvusit looks like there may be an issue with zuul-launcher and ready nodes and quota that may be causing us to send too many requests to full providers; i'm looking into it.16:58
clarkbperson from the gas company just rang my doorbell. Good news they are just doing meter maintenance17:34
mnasiadkacorvus: wanted to write that I see some node requests stalled for 8 hours, but I see you’re already on top of this ;)18:40
corvusyeah, almost ready to push up some changes18:40
opendevreviewJames E. Blair proposed opendev/zuul-providers master: Limit ready nodes to 30m  https://review.opendev.org/c/opendev/zuul-providers/+/95727618:54
corvusthat change and this one are related: https://review.opendev.org/95727518:55
corvusfrom what i can tell, it was a rough night last night; there were a number of errors, including zk disconnects.  that may have led to the launchers operating with incomplete information, and they may have assigned too many nodes to some of the providers.18:56
corvusthe zuul change is a small step in addressing that and also collecting more information if it happens again.  i doubt that change is enough on its own (but it could be!).  but the problem space is too big and i think we need to narrow it down, and it should do that.18:57
corvusthe change to zuul-providers is to mitigate one of the symptoms.  we ended up with a bunch of ready nodes that we can't use, because a whole bunch of node requests overnight were for ubuntu-noble nodes, and at this time of day, we're mostly using ubuntu-noble-8GB18:58
fungihow does max-ready-age interact with min-ready? do min-ready nodes get recycled and replaced every 30 minutes if unused?18:58
corvusso a chunk of our quota is sitting idle.  that change should let us recover that.18:58
corvusfungi: yes18:58
fungii guess i can see up and down sides to that, but mostly up sides18:58
corvus(also... mostly due to neglect... we haven't actually set min-ready on any labels yet)18:58
fungiah, then basically entirely up sides. sold18:59
corvusour glut of ready labels right now are due to aborted requests18:59
opendevreviewMerged opendev/zuul-providers master: Limit ready nodes to 30m  https://review.opendev.org/c/opendev/zuul-providers/+/95727619:00
fungiwent ahead and approved it, the sooner that gets applied the sooner things will start running19:00
fungiin theory19:00
corvusthe final piece of the puzzle: a lot of the oldest node requests are multinode requests -- they're harder to fill when they get assigned to a backlogged provider, so they're sticking around longer.19:00
corvusi'll manually delete the ready nodes that are there now19:00
fungioh, i guess that only applies to new nodes, not retroactively?19:01
corvusyeah, awkwardly, the max-ready-age gets set when the node is launched, so the ones we have now will stick around indefinitely.19:01
fungiah too bad19:01
opendevreviewClark Boylan proposed opendev/system-config master: Move python base images back to quay.io  https://review.opendev.org/c/opendev/system-config/+/95727719:01
corvusthe other thing we could do is root out uses of "ubuntu-noble" and update them to "ubuntu-noble-8GB"19:03
corvusthat would improve efficiency a bit19:03
corvusbut at this point, most of those are going to be in individual project nodeset definitions19:03
corvusso... a bunch of busy work.  may not be worth much investment in time.19:03
fungii don't suppose it would make sense for zuul to have a concept of label aliases?19:04
fungiso that ubuntu-noble could just become another name for the ubuntu-noble-8GB label?19:04
corvusthis feels more like a one-time problem19:04
fungiyeah, likely so19:04
clarkbthe update to the launcher lgtm19:06
fungisame19:09
corvusi also think that some of the ready nodes became unlocked while they still were assigned to providers and tenants, which further restricted their use... i'll see about patching that too19:13
corvusoh the zookeeper connection issues are ongoing19:21
corvus2025-08-13 17:49:50,515 DEBUG zuul.zk.base.ZooKeeperClient: ZooKeeper connection (session: 0x0): LOST19:22
corvus2025-08-13 18:58:52,752 DEBUG zuul.zk.base.ZooKeeperClient: ZooKeeper connection (session: 0x0): LOST19:22
Clark[m]Is it only happening to one launcher?20:06
Clark[m]Thinking out loud I wonder if image uploads could impact the network traffic on launchers20:06
corvusno, both launchers and i suspect i just saw it on a scheduler too20:06
Clark[m]Ok if a scheduler is doing it too then unlikely to be image uplaods20:06
corvushrm... nope not the scheduler20:07
corvusyeah, might be worth a check on cacti then20:07
corvusoh we have a memory problem20:10
corvusincluding significant swap usage20:10
fungithat could certainly result in things like process timeouts for connections20:12
corvusi think i have smoothed things over wrt the old ready nodes.  i did end up dequeing and re-enqueing a bunch of openstack changes.20:12
corvusi have a little more info that points to another possible fix/improvement to the launcher; i'm going to look at that now20:13
corvusokay, the thread i was pulling on is not a smoking gun.  so the exact mechanism by which we ended up in that adverse state is a bit of a mystery still.  the change to the quota handling may be part of it, and also includes some more info that would be helpful if it happens again.  also this change https://review.opendev.org/957282 would also provide some more info.20:42
corvusi'm going to try to figure out what the memory usage is about now20:43
clarkbfungi: if you have a moment you may which to weigh in on https://review.opendev.org/c/zuul/zuul-jobs/+/957188 to avoid breaking devstack multinode jobs when we switch to ansible 11 but review 957282 as that one is more urgent20:49
corvusbased on the output from SIGUSR2, i think https://review.opendev.org/957283 may be the immediate cause of our memory problems21:06
corvusor rather, that's the fix, i hope :)21:06
clarkblooking at it now. Side note I'm always happy to see the sigusr2 yappi and threaddump info continue to be so useful21:07
corvusyes, so much21:07
corvus2025-08-13 20:48:08,854 DEBUG zuul.stack_dump:   DeleteJob                    3368646  +336864621:07
corvusthat's how many of those are in memory right now21:08
fungiyikes21:09
fungifix lgtm21:11
fungithanks for the quick diagnosis!21:11
corvusnp21:13
corvusclarkb: fungi there's still some ongoing behavior i don't understand, so i'd like to get a little more debug info:  https://review.opendev.org/c/zuul/zuul/+/95728421:36
corvusi see a bunch of nodes assigned to providers when others are available, and this is recent21:37
clarkb+221:40
corvusanother launcher problem is that our temp space has filled up; probably due to crashes/restarts, and not having anything to clean those up.  we'll need to add something to the launcher to deal with that, but in the mean time -- once all the outstanding niz changes land, i'll stop, delete, start to clear those out.21:40
clarkback sounds like a plan21:41
corvustemp space -> the directory where we store the image downloads21:41
clarkb`Multiarch podman is not yet implemented` is interesting. I guess that means we're not using docker + buildkit like I thought we were?21:49
clarkbah container_command defaults to podman so I probably need to override that somewhere21:50
clarkbhuh I chagned that in april. I don't remember doing that. But also wonder if that was premature optimization given what I think I now know about buildkit and its mirror handling21:53
clarkbI'll update the python base job change to use docker only in those builds so the others don't change (for now)21:54
opendevreviewClark Boylan proposed opendev/system-config master: Move python base images back to quay.io  https://review.opendev.org/c/opendev/system-config/+/95727721:56
fungisorry, stepped away to cook dinner, back now for a bit22:47

Generated by irclog2html.py 4.0.0 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!