Clark[m] | I'm switched over to dinner mode but a couple other multinode jobs succeeded. | 00:14 |
---|---|---|
Clark[m] | This is looking like it will work | 00:15 |
opendevreview | Goutham Pacha Ravi proposed openstack/project-config master: Retire Monasca project https://review.opendev.org/c/openstack/project-config/+/957063 | 04:15 |
*** elodilles_pto is now known as elodilles | 07:54 | |
mnasiadka | Been observing some timeouts since yesterday on ord.rax mirror in Kolla/Kolla-Ansible jobs | 09:06 |
mnasiadka | https://zuul.opendev.org/t/openstack/build/94be78a6d6e249cca2fb160f5eef3bac/log/primary/logs/ansible/bootstrap-servers#764 | 09:07 |
opendevreview | Jan Gutter proposed zuul/zuul-jobs master: Raise connection pool for boto3 in s3 upload role https://review.opendev.org/c/zuul/zuul-jobs/+/957218 | 10:48 |
opendevreview | Damian Fajfer proposed zuul/zuul-jobs master: Remove version defaults for nodejs jobs https://review.opendev.org/c/zuul/zuul-jobs/+/957219 | 11:00 |
opendevreview | Jan Gutter proposed zuul/zuul-jobs master: Raise connection pool for boto3 in s3 upload role https://review.opendev.org/c/zuul/zuul-jobs/+/957218 | 11:14 |
opendevreview | Damian Fajfer proposed zuul/zuul-jobs master: Remove version defaults for nodejs jobs https://review.opendev.org/c/zuul/zuul-jobs/+/957219 | 11:16 |
fungi | clarkb: stephenfin: mentioning here instead of #openstack-oslo since clarkb isn't in there, but zuul hit a possible regression in pbr 7.0.0 https://review.opendev.org/c/zuul/zuul/+/957235 | 13:39 |
stephenfin | iiuc that's not a regression, since it's a private function, right? | 14:36 |
fungi | stephenfin: yes, that's why i said "possible" | 14:39 |
stephenfin | gotcha | 14:39 |
fungi | i also pointed out that even the relocated version of that method may stop working in future setuptools | 14:39 |
fungi | i'm more wondering if you have better ideas for how to get at that data that doesn't go through private pbr internals | 14:40 |
fungi | is there an interface pbr can/should expose for that, or is there a direct path in setuptools to get at that with pyproject.toml type packaging in the future? | 14:41 |
stephenfin | looking | 14:42 |
stephenfin | also, while I'm looking, it seems I regressed Ib494a0bac947cc6b5fb8d3d4315a3d386cbffdc2 and the long description isn't being used now | 14:43 |
clarkb | hrm why would _from_git need to be private? | 15:38 |
fungi | it was previously. looks like, though i didn't dig through history to see how long it was that way | 15:38 |
clarkb | thats a pbr specific thing not setuptools specific (its looking at git details to get stuff like authors and changelog) | 15:39 |
clarkb | I guess my point is I don't see that as being a "compat" setuptools specific method. Its a pbr method for interacting with git info | 15:39 |
clarkb | so it probably didn't need to be moved and considering that one of the major things pbr does is expose git to packaging it also probably doesn't need to be private? | 15:39 |
clarkb | (But I agree it was marked private previsouly and maybe there was a reason for that) | 15:40 |
clarkb | fungi: I wonder if zuul can just make _build_javascript the setup_hook rather than doing the indirection to run it later via _from_git()? I suspect the reason it doesn't is it wants version info or something like that maybe? | 15:43 |
clarkb | from a PBR perspective we could release a 7.0.1 that puts the _from_git method back in place (possibly just as an import to the old path?) then pick it up from there? | 15:45 |
fungi | that's a good point. maybe stephenfin has more specific rationale for moving it, or maybe it was just caught up in the move of all the surrounding code | 15:46 |
clarkb | ok seems like _from_git is only called by the old install.install classes (which I think aren't going away its just that they'll be triggered via pep517 entrypoints in the future and not setup.py install) | 15:51 |
clarkb | but that may explain why it moved | 15:51 |
clarkb | its only when pbr becomes a "native" pep 517 installer and doesn't go through setuptools anymore that would change. But I assume we'd have to support both paths for a time. Given that I think the proposed change to zuul may also be good enough? | 15:52 |
clarkb | hrm except the pep517 path will always do sdist then wheel? It will never do the equivalent of a setup.py install will it? So maybe that is effectively dead code once setup.py install stops working entirely | 15:54 |
clarkb | oh wait LocalSDist calls _from_git too so yes I believe this code will live on for some time | 15:55 |
clarkb | because the setuptools pep517 code paths will use the sdist path then make a wheel from that | 15:55 |
clarkb | LocalInstall and InstallWithGit might become dead code real soon now but LocalSDist shouldn't | 15:55 |
clarkb | so yes I think the zuul change is fine if we just want to roll forward. Or we can make an alias for _from_git in its old home and release a 7.0.1 | 15:56 |
*** gthiemon1e is now known as gthiemonge | 15:56 | |
opendevreview | Clark Boylan proposed zuul/zuul-jobs master: Vendor openvswitch_bridge https://review.opendev.org/c/zuul/zuul-jobs/+/957188 | 16:03 |
opendevreview | Clark Boylan proposed zuul/zuul-jobs master: Vendor openvswitch_bridge https://review.opendev.org/c/zuul/zuul-jobs/+/957188 | 16:08 |
fungi | clarkb: a shim method at the old location would have the effect of solving the challengs of zuul's upgrade job | 16:12 |
clarkb | fungi: good point. | 16:13 |
fungi | though they will probably prefer a more immediate solution than the turn-around for getting that merged and a new release published, i dunno | 16:13 |
clarkb | ya and its unlikely this private method will move again in the future | 16:13 |
clarkb | so just fixing it directly in zuul is likely to be good enough and quicjk | 16:14 |
corvus | yeah, i don't think zuul needs a pbr release | 16:15 |
opendevreview | Clark Boylan proposed zuul/zuul-jobs master: Vendor openvswitch_bridge https://review.opendev.org/c/zuul/zuul-jobs/+/957188 | 16:26 |
corvus | it looks like there may be an issue with zuul-launcher and ready nodes and quota that may be causing us to send too many requests to full providers; i'm looking into it. | 16:58 |
clarkb | person from the gas company just rang my doorbell. Good news they are just doing meter maintenance | 17:34 |
mnasiadka | corvus: wanted to write that I see some node requests stalled for 8 hours, but I see you’re already on top of this ;) | 18:40 |
corvus | yeah, almost ready to push up some changes | 18:40 |
opendevreview | James E. Blair proposed opendev/zuul-providers master: Limit ready nodes to 30m https://review.opendev.org/c/opendev/zuul-providers/+/957276 | 18:54 |
corvus | that change and this one are related: https://review.opendev.org/957275 | 18:55 |
corvus | from what i can tell, it was a rough night last night; there were a number of errors, including zk disconnects. that may have led to the launchers operating with incomplete information, and they may have assigned too many nodes to some of the providers. | 18:56 |
corvus | the zuul change is a small step in addressing that and also collecting more information if it happens again. i doubt that change is enough on its own (but it could be!). but the problem space is too big and i think we need to narrow it down, and it should do that. | 18:57 |
corvus | the change to zuul-providers is to mitigate one of the symptoms. we ended up with a bunch of ready nodes that we can't use, because a whole bunch of node requests overnight were for ubuntu-noble nodes, and at this time of day, we're mostly using ubuntu-noble-8GB | 18:58 |
fungi | how does max-ready-age interact with min-ready? do min-ready nodes get recycled and replaced every 30 minutes if unused? | 18:58 |
corvus | so a chunk of our quota is sitting idle. that change should let us recover that. | 18:58 |
corvus | fungi: yes | 18:58 |
fungi | i guess i can see up and down sides to that, but mostly up sides | 18:58 |
corvus | (also... mostly due to neglect... we haven't actually set min-ready on any labels yet) | 18:58 |
fungi | ah, then basically entirely up sides. sold | 18:59 |
corvus | our glut of ready labels right now are due to aborted requests | 18:59 |
opendevreview | Merged opendev/zuul-providers master: Limit ready nodes to 30m https://review.opendev.org/c/opendev/zuul-providers/+/957276 | 19:00 |
fungi | went ahead and approved it, the sooner that gets applied the sooner things will start running | 19:00 |
fungi | in theory | 19:00 |
corvus | the final piece of the puzzle: a lot of the oldest node requests are multinode requests -- they're harder to fill when they get assigned to a backlogged provider, so they're sticking around longer. | 19:00 |
corvus | i'll manually delete the ready nodes that are there now | 19:00 |
fungi | oh, i guess that only applies to new nodes, not retroactively? | 19:01 |
corvus | yeah, awkwardly, the max-ready-age gets set when the node is launched, so the ones we have now will stick around indefinitely. | 19:01 |
fungi | ah too bad | 19:01 |
opendevreview | Clark Boylan proposed opendev/system-config master: Move python base images back to quay.io https://review.opendev.org/c/opendev/system-config/+/957277 | 19:01 |
corvus | the other thing we could do is root out uses of "ubuntu-noble" and update them to "ubuntu-noble-8GB" | 19:03 |
corvus | that would improve efficiency a bit | 19:03 |
corvus | but at this point, most of those are going to be in individual project nodeset definitions | 19:03 |
corvus | so... a bunch of busy work. may not be worth much investment in time. | 19:03 |
fungi | i don't suppose it would make sense for zuul to have a concept of label aliases? | 19:04 |
fungi | so that ubuntu-noble could just become another name for the ubuntu-noble-8GB label? | 19:04 |
corvus | this feels more like a one-time problem | 19:04 |
fungi | yeah, likely so | 19:04 |
clarkb | the update to the launcher lgtm | 19:06 |
fungi | same | 19:09 |
corvus | i also think that some of the ready nodes became unlocked while they still were assigned to providers and tenants, which further restricted their use... i'll see about patching that too | 19:13 |
corvus | oh the zookeeper connection issues are ongoing | 19:21 |
corvus | 2025-08-13 17:49:50,515 DEBUG zuul.zk.base.ZooKeeperClient: ZooKeeper connection (session: 0x0): LOST | 19:22 |
corvus | 2025-08-13 18:58:52,752 DEBUG zuul.zk.base.ZooKeeperClient: ZooKeeper connection (session: 0x0): LOST | 19:22 |
Clark[m] | Is it only happening to one launcher? | 20:06 |
Clark[m] | Thinking out loud I wonder if image uploads could impact the network traffic on launchers | 20:06 |
corvus | no, both launchers and i suspect i just saw it on a scheduler too | 20:06 |
Clark[m] | Ok if a scheduler is doing it too then unlikely to be image uplaods | 20:06 |
corvus | hrm... nope not the scheduler | 20:07 |
corvus | yeah, might be worth a check on cacti then | 20:07 |
corvus | oh we have a memory problem | 20:10 |
corvus | including significant swap usage | 20:10 |
fungi | that could certainly result in things like process timeouts for connections | 20:12 |
corvus | i think i have smoothed things over wrt the old ready nodes. i did end up dequeing and re-enqueing a bunch of openstack changes. | 20:12 |
corvus | i have a little more info that points to another possible fix/improvement to the launcher; i'm going to look at that now | 20:13 |
corvus | okay, the thread i was pulling on is not a smoking gun. so the exact mechanism by which we ended up in that adverse state is a bit of a mystery still. the change to the quota handling may be part of it, and also includes some more info that would be helpful if it happens again. also this change https://review.opendev.org/957282 would also provide some more info. | 20:42 |
corvus | i'm going to try to figure out what the memory usage is about now | 20:43 |
clarkb | fungi: if you have a moment you may which to weigh in on https://review.opendev.org/c/zuul/zuul-jobs/+/957188 to avoid breaking devstack multinode jobs when we switch to ansible 11 but review 957282 as that one is more urgent | 20:49 |
corvus | based on the output from SIGUSR2, i think https://review.opendev.org/957283 may be the immediate cause of our memory problems | 21:06 |
corvus | or rather, that's the fix, i hope :) | 21:06 |
clarkb | looking at it now. Side note I'm always happy to see the sigusr2 yappi and threaddump info continue to be so useful | 21:07 |
corvus | yes, so much | 21:07 |
corvus | 2025-08-13 20:48:08,854 DEBUG zuul.stack_dump: DeleteJob 3368646 +3368646 | 21:07 |
corvus | that's how many of those are in memory right now | 21:08 |
fungi | yikes | 21:09 |
fungi | fix lgtm | 21:11 |
fungi | thanks for the quick diagnosis! | 21:11 |
corvus | np | 21:13 |
corvus | clarkb: fungi there's still some ongoing behavior i don't understand, so i'd like to get a little more debug info: https://review.opendev.org/c/zuul/zuul/+/957284 | 21:36 |
corvus | i see a bunch of nodes assigned to providers when others are available, and this is recent | 21:37 |
clarkb | +2 | 21:40 |
corvus | another launcher problem is that our temp space has filled up; probably due to crashes/restarts, and not having anything to clean those up. we'll need to add something to the launcher to deal with that, but in the mean time -- once all the outstanding niz changes land, i'll stop, delete, start to clear those out. | 21:40 |
clarkb | ack sounds like a plan | 21:41 |
corvus | temp space -> the directory where we store the image downloads | 21:41 |
clarkb | `Multiarch podman is not yet implemented` is interesting. I guess that means we're not using docker + buildkit like I thought we were? | 21:49 |
clarkb | ah container_command defaults to podman so I probably need to override that somewhere | 21:50 |
clarkb | huh I chagned that in april. I don't remember doing that. But also wonder if that was premature optimization given what I think I now know about buildkit and its mirror handling | 21:53 |
clarkb | I'll update the python base job change to use docker only in those builds so the others don't change (for now) | 21:54 |
opendevreview | Clark Boylan proposed opendev/system-config master: Move python base images back to quay.io https://review.opendev.org/c/opendev/system-config/+/957277 | 21:56 |
fungi | sorry, stepped away to cook dinner, back now for a bit | 22:47 |
Generated by irclog2html.py 4.0.0 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!