Thursday, 2025-06-12

opendevreview	OpenStack Proposal Bot proposed openstack/project-config master: Normalize projects.yaml https://review.opendev.org/c/openstack/project-config/+/952006	02:25
opendevreview	Tony Breeds proposed openstack/diskimage-builder master: Add new openstack/devstack based functional testing https://review.opendev.org/c/openstack/diskimage-builder/+/949942	03:33
opendevreview	Tony Breeds proposed openstack/diskimage-builder master: Add new openstack/devstack based functional testing https://review.opendev.org/c/openstack/diskimage-builder/+/949942	06:04
*** amoralej_ is now known as amoralej		11:59
fungi	yeah, my 952408 is still in check with a paused buildset registry job 14 hours later. i'll leave it there for the moment since it's not an urgent fix or anything, and maybe corvus will have some ideas as to what sort of corner case/race we tripped over	12:41
fungi	okay that's odd, it completed moments ago, after running for nearly 15 hours	13:13
fungi	- opendev-buildset-registry https://zuul.opendev.org/t/openstack/build/b891f8394e074cae9b9d3d4df691decc : SUCCESS in 14h 51m 21s	13:14
clarkb	fungi: was it stalled out after all the child jobs had run or before?	14:57
clarkb	that job pauses and waits for child jobs to complete. Just wondering if it didn't get the complete signal in a reasonable amount of time or if the child jobs took a long time to schedule	14:57
fungi	the child job failed because the parent change timed out performing an image build (sort of surprised that could even happen in an independent pipeline), and then the buildset registry job remained paused for 14+ hours after that	14:59
fungi	the failure on the child change's system-config-run-gitea job was: requires artifact(s) gitea-container-image provided by build e99d34c1c7e54d4685361256f18dcd35 (triggered by change 952407 on project opendev/system-config), but that build failed with result "TIMED_OUT"	15:01
fungi	while i'm still wondering what transpired to leave the registry job paused for so long, i'm even more curious as to what could have eventually unstuck it	15:03
fungi	maybe something merged in a random project that triggered a reconfigure?	15:04
clarkb	maybe?	15:15
clarkb	child jobs failing should also be sufficient to unpause	15:15
fungi	right, but clearly didn't in this case since the child jobs all ended 14 hours earlier	15:16
fungi	heading out to grab lunch, should be back in an hour or so	15:24
corvus	the reconfigure sounds like a good lead; last was at 13:03	16:04
fungi	that does roughly coincide with when the job unpaused	16:43
opendevreview	Clark Boylan proposed openstack/diskimage-builder master: Add new openstack/devstack based functional testing https://review.opendev.org/c/openstack/diskimage-builder/+/949942	18:02
clarkb	mnasiadka: tonyb: ^I took the liberty of doing a quick update. I think the failed_when in your loops is causing the tasks to fail immediately after one iteration (hence the wait for nova to convert the image and boot hte node before we check the node status)	18:03
clarkb	instead I think we can just rely on retries and until to determine if we fail due to a timeout of the state not reaching what we want	18:03
clarkb	then I also bumped up the retry count significatnly so that we can see if these tasks ever succeed. I suspect part of the problem previously was just not waiting long enough	18:04
clarkb	and finally I put the devstack run in pre-run since we're not actually testing devstack here. This also fixes some node information collection stuff I think	18:04
clarkb	fungi: I noticed that the devstack install for 949942 is building a zstd wheel for this version of zstd: https://pypi.org/project/zstd/1.5.7.0/#files but that listing shows there is already a python3.12 manylinux wheel. My hunch right now is that wheel is build for glibc 2.4 or greater. Noble appears to have glibc 2.39. Is this a pip wheel compatibility parsing error where 2.4 is >	18:27
clarkb	2.39 maybe?	18:27
clarkb	there probably is a simpler explanation but I'm not seeing it on initial inspection	18:27
clarkb	the python3.5 and 3.6 wheels area actually built for glibc 2.14+	18:31
clarkb	it does seem like there may be something fishy going on there.	18:31
fungi	clarkb: the cp312 wheel seems to be for i686 arch not x86_64	18:58
fungi	clarkb: in https://github.com/sergey-dryabzhinsky/python-zstd/issues/233 the maintainer comments "I know that wheels for python3.12 amd64 is missing, but I don't know how to fix it. I tried restarting action already."	19:14
Clark[m]	Oh that would explain it. I skimmed right over that	19:40
opendevreview	Clark Boylan proposed openstack/diskimage-builder master: Add new openstack/devstack based functional testing https://review.opendev.org/c/openstack/diskimage-builder/+/949942	20:43
corvus	i restarted zuul-launcher to get the most recent updates	21:25
fungi	thanks!	21:25
fungi	logins to wiki.openstack.org have broken again, rebooting it now	21:36
fungi	load average on it is around 65	21:37
fungi	working again now	21:39
Clark[m]	fungi: I see you were already talking to the cfn folks about branch creation. Did you want to followup to their email from today or should I? Looks like the acl is missing the branch creation rule	21:41
fungi	they e-mailed me privately at which point i sent them the link to the infra-manual section that talks about adding the acl entry, and suggested that in the future they ask on the ml or in irc rather than sending e-mail to me	21:46
fungi	i haven't gotten around to replying to their list message, or looking at what their acl is like for that matter, but would likely just end up quoting from the same document i already sent them a link to	21:47
fungi	i doubt i'll get to it tonight though	21:48
Clark[m]	oh I see you already pointed them to the acl update	21:50
Clark[m]	its hard to parse the thread in my mail client. It got squashed	21:51
fungi	though part of me worries that by me continuing to reply to them, just on the ml instead of privately, they won't understand the distinction or reason for asking in normal channels	21:51
Clark[m]	ya I'll respond to try and push the list angle more	21:51
fungi	so maybe it would make sense for a second person to reply just so they don't get the impression i'm the only one around	21:51
Clark[m]	response sent. Let me know if you think any additional info would be helpful	21:57
fungi	looks perfect, thanks!!!	21:58
opendevreview	Clark Boylan proposed openstack/diskimage-builder master: Add new openstack/devstack based functional testing https://review.opendev.org/c/openstack/diskimage-builder/+/949942	22:18
Clark[m]	tonyb: mnasiadka: ok I think if the new rescue block works this should help with debugging. Then I realized that we're testing with almalinux which I think is a bad choice since we (opendev) don't do any almalinux image builds so its harder to say if it should work at all. I updated the change to also build ubuntu noble.	22:19
Clark[m]	I think almalinux may be failing due to non working dhcp	22:22
Clark[m]	but its all networkmanager which is find indecipherable so being able to confirm on ubuntu would be nice	22:22
Clark[m]	tonyb: mnasiadka: I bet there is no dhcp server for the public network. But I'm not positive of that. It may be best to simply boot a node and attach a floating ip to it	22:31
Clark[m]	rather than try and do something fancy with neutron networking	22:31
tonyb	thanks clarkb I picked alma because it starts with "a" it seems like glean doesn't write out the config. so I was happy to blame glean and move on but it works under nodepool so there is something else happening	22:33
tonyb	but I can add the floating support	22:34
opendevreview	Clark Boylan proposed openstack/diskimage-builder master: Add new openstack/devstack based functional testing https://review.opendev.org/c/openstack/diskimage-builder/+/949942	22:34
Clark[m]	tonyb: If you look at the console log here: https://zuul.opendev.org/t/openstack/build/940a11f6abe0496782199059fe51aa08/console#6/0/34/controller it is definitely using glean. But then there are a bunch of messages about failed dhcp	22:35
Clark[m]	which is why my hunch now is that by connecting to the public netwrok we're getting a network that does not have dhcp configured	22:36
Clark[m]	its been a while since I looked closely at a default devstack install's network setup. But my hunch is they push the public network isn't really for direct attach	22:39
Clark[m]	we can probably configure it to work that way or just use the cloud in the manner that is expected of us	22:40
tonyb	okay. that more or less matches the testing I did on a held node. there was definitely no dhcp reply but looking at the network json file in the config drive it looked to me like glean should be writing a config file for a static address and it wasn't	22:42
tonyb	that's why I pushed the change to include the existing nodepool testing to locate the difference	22:43
Clark[m]	hrm I think the default is dhcp	22:43
Clark[m]	for neutron networks I mean	22:43
tonyb	today was going to be "learn glean" day to confirm all of that	22:44
Clark[m]	the nodepool based testing is almost certainly going to be using floating ips. we should be able to check that /me looks	22:44
tonyb	I see some good feedback on the change and I can look at that once the caffeine hits	22:44
Clark[m]	oh nope the container image builds for nodepool are failing so those jobs aren't running	22:44
Clark[m]	tonyb: fwiw some of the feedback I incorporated into what I've already pushed. Mostly around getting better debugging info out to see what is going on	22:45
tonyb	perfect	22:45
Clark[m]	but yes I suspect that with nodepool based testing its doing floating ips	22:46
tonyb	the raw vs qcow thing I can address. I started with raw but it failed, switched to qcow, it also failed. fixed the real problem (swift max object size) and never switched back to raw.	22:47
tonyb	switching back makes sense as we are definitely wasting time with all the switching between formats	22:49
tonyb	I'll also look at ianw feedback as I don't quite understand it.	22:50
tonyb	Clark[m]: are you okay for me to "take the reins" once the current jobs complete?	22:51
tonyb	I don't want to mess up what you're working on	22:52
Clark[m]	yup and feel free to push updates at this point. I think the last fix I have is working (the job got far enough)	22:55
Clark[m]	don't need to wait for things t ocomplete unless you want to get ubuntu image feedback yourself	22:55
Clark[m]	actually looks like ubuntu image may be working?	22:56
Clark[m]	so maybe you want to wait for it to finish to get that logging data recorded in an easy to find location	22:56
tonyb	yup I'll wait for things to complete. I need another coffee anyway	22:59
ianw	tonyb: sorry if not clear -- the nodepool build config file sets several flags. The DIB_REPOREF_* ones are pretty simple, they just point dib at the zuul checkout of glean instead of pypi o make sure it's co-installed. the one to think about is DIB_DEV_USER_AUTHORIZED_KEYS ... i think we've been using that to login	23:04
tonyb	ianw: any lack of clarity/confusion lay with me not you	23:09
tonyb	I think the div repo ref stuff is no longer needed now that the ensure dib role installs dib from the repo as setup by zuul	23:10
ianw	the LABEL= one ... that's a mess. the root disk has label "cloudimg-rootfs" ... IIRC basically grub would install in the built image, so if the currently booted kernel (like on a dib-built gate node) had LABEL=cloudimg-rootfs set on the kernel command-line, that would be copied in by grub and it would boot. but then we'd build it on the nodepool builders that were not dib images and didn't have this label in their underlying kernel command line,	23:10
ianw	there were times when dib was not setting it properly in the grub updating	23:10
ianw	ensure-dib will install dib from zuul source, yep, but we also want dib to install into the built images glean from zuul src	23:11

Generated by irclog2html.py 4.0.0 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!