Thursday, 2025-06-12

opendevreviewOpenStack Proposal Bot proposed openstack/project-config master: Normalize projects.yaml  https://review.opendev.org/c/openstack/project-config/+/95200602:25
opendevreviewTony Breeds proposed openstack/diskimage-builder master: Add new openstack/devstack based functional testing  https://review.opendev.org/c/openstack/diskimage-builder/+/94994203:33
opendevreviewTony Breeds proposed openstack/diskimage-builder master: Add new openstack/devstack based functional testing  https://review.opendev.org/c/openstack/diskimage-builder/+/94994206:04
*** amoralej_ is now known as amoralej11:59
fungiyeah, my 952408 is still in check with a paused buildset registry job 14 hours later. i'll leave it there for the moment since it's not an urgent fix or anything, and maybe corvus will have some ideas as to what sort of corner case/race we tripped over12:41
fungiokay that's odd, it completed moments ago, after running for nearly 15 hours13:13
fungi- opendev-buildset-registry https://zuul.opendev.org/t/openstack/build/b891f8394e074cae9b9d3d4df691decc : SUCCESS in 14h 51m 21s13:14
clarkbfungi: was it stalled out after all the child jobs had run or before?14:57
clarkbthat job pauses and waits for child jobs to complete. Just wondering if it didn't get the complete signal in a reasonable amount of time or if the child jobs took a long time to schedule14:57
fungithe child job failed because the parent change timed out performing an image build (sort of surprised that could even happen in an independent pipeline), and then the buildset registry job remained paused for 14+ hours after that14:59
fungithe failure on the child change's system-config-run-gitea job was: requires artifact(s) gitea-container-image provided by build e99d34c1c7e54d4685361256f18dcd35 (triggered by change 952407 on project opendev/system-config), but that build failed with result "TIMED_OUT"15:01
fungiwhile i'm still wondering what transpired to leave the registry job paused for so long, i'm even more curious as to what could have eventually unstuck it15:03
fungimaybe something merged in a random project that triggered a reconfigure?15:04
clarkbmaybe?15:15
clarkbchild jobs failing should also be sufficient to unpause15:15
fungiright, but clearly didn't in this case since the child jobs all ended 14 hours earlier15:16
fungiheading out to grab lunch, should be back in an hour or so15:24
corvusthe reconfigure sounds like a good lead; last was at 13:0316:04
fungithat does roughly coincide with when the job unpaused16:43
opendevreviewClark Boylan proposed openstack/diskimage-builder master: Add new openstack/devstack based functional testing  https://review.opendev.org/c/openstack/diskimage-builder/+/94994218:02
clarkbmnasiadka: tonyb: ^I took the liberty of doing a quick update. I think the failed_when in your loops is causing the tasks to fail immediately after one iteration (hence the wait for nova to convert the image and boot hte node before we check the node status)18:03
clarkbinstead I think we can just rely on retries and until to determine if we fail due to a timeout of the state not reaching what we want18:03
clarkbthen I also bumped up the retry count significatnly so that we can see if these tasks ever succeed. I suspect part of the problem previously was just not waiting long enough18:04
clarkband finally I put the devstack run in pre-run since we're not actually testing devstack here. This also fixes some node information collection stuff I think18:04
clarkbfungi: I noticed that the devstack install for 949942 is building a zstd wheel for this version of zstd: https://pypi.org/project/zstd/1.5.7.0/#files but that listing shows there is already a python3.12 manylinux wheel. My hunch right now is that wheel is build for glibc 2.4 or greater. Noble appears to have glibc 2.39. Is this a pip wheel compatibility parsing error where 2.4 is >18:27
clarkb2.39 maybe?18:27
clarkbthere probably is a simpler explanation but I'm not seeing it on initial inspection18:27
clarkbthe python3.5 and 3.6 wheels area actually built for glibc 2.14+18:31
clarkbit does seem like there may be something fishy going on there.18:31
fungiclarkb: the cp312 wheel seems to be for i686 arch not x86_6418:58
fungiclarkb: in https://github.com/sergey-dryabzhinsky/python-zstd/issues/233 the maintainer comments "I know that wheels for python3.12 amd64 is missing, but I don't know how to fix it. I tried restarting action already."19:14
Clark[m]Oh that would explain it. I skimmed right over that19:40
opendevreviewClark Boylan proposed openstack/diskimage-builder master: Add new openstack/devstack based functional testing  https://review.opendev.org/c/openstack/diskimage-builder/+/94994220:43
corvusi restarted zuul-launcher to get the most recent updates21:25
fungithanks!21:25
fungilogins to wiki.openstack.org have broken again, rebooting it now21:36
fungiload average on it is around 6521:37
fungiworking again now21:39
Clark[m]fungi: I see you were already talking to the cfn folks about branch creation. Did you want to followup to their email from today or should I? Looks like the acl is missing the branch creation rule21:41
fungithey e-mailed me privately at which point i sent them the link to the infra-manual section that talks about adding the acl entry, and suggested that in the future they ask on the ml or in irc rather than sending e-mail to me21:46
fungii haven't gotten around to replying to their list message, or looking at what their acl is like for that matter, but would likely just end up quoting from the same document i already sent them a link to21:47
fungii doubt i'll get to it tonight though21:48
Clark[m]oh I see you already pointed them to the acl update21:50
Clark[m]its hard to parse the thread in my mail client. It got squashed21:51
fungithough part of me worries that by me continuing to reply to them, just on the ml instead of privately, they won't understand the distinction or reason for asking in normal channels21:51
Clark[m]ya I'll respond to try and push the list angle more21:51
fungiso maybe it would make sense for a second person to reply just so they don't get the impression i'm the only one around21:51
Clark[m]response sent. Let me know if you think any additional info would be helpful21:57
fungilooks perfect, thanks!!!21:58
opendevreviewClark Boylan proposed openstack/diskimage-builder master: Add new openstack/devstack based functional testing  https://review.opendev.org/c/openstack/diskimage-builder/+/94994222:18
Clark[m]tonyb: mnasiadka: ok I think if the new rescue block works this should help with debugging. Then I realized that we're testing with almalinux which I think is a bad choice since we (opendev) don't do any almalinux image builds so its harder to say if it should work at all. I updated the change to also build ubuntu noble.22:19
Clark[m]I think almalinux may be failing due to non working dhcp22:22
Clark[m]but its all networkmanager which is find indecipherable so being able to confirm on ubuntu would be nice22:22
Clark[m]tonyb: mnasiadka: I bet there is no dhcp server for the public network. But I'm not positive of that. It may be best to simply boot a node and attach a floating ip to it22:31
Clark[m]rather than try and do something fancy with neutron networking22:31
tonybthanks clarkb  I picked alma because it starts with "a" it seems like glean doesn't write out the config.   so I was happy to blame glean and move on but it works under nodepool so there is something else happening 22:33
tonybbut I can add the floating support22:34
opendevreviewClark Boylan proposed openstack/diskimage-builder master: Add new openstack/devstack based functional testing  https://review.opendev.org/c/openstack/diskimage-builder/+/94994222:34
Clark[m]tonyb: If you look at the console log here: https://zuul.opendev.org/t/openstack/build/940a11f6abe0496782199059fe51aa08/console#6/0/34/controller it is definitely using glean. But then there are a bunch of messages about failed dhcp22:35
Clark[m]which is why my hunch now is that by connecting to the public netwrok we're getting a network that does not have dhcp configured22:36
Clark[m]its been a while since I looked closely at a default devstack install's network setup. But my hunch is they push the public network isn't really for direct attach22:39
Clark[m]we can probably configure it to work that way or just use the cloud in the manner that is expected of us22:40
tonybokay.   that more or less matches the testing I did on a held node.   there was definitely no dhcp reply but looking at the network json file in the config drive it looked to me like glean should be writing a config file for a static address and it wasn't 22:42
tonybthat's why I pushed the change to include the existing nodepool testing to locate the difference 22:43
Clark[m]hrm I think the default is dhcp22:43
Clark[m]for neutron networks I mean22:43
tonybtoday was going to be "learn glean" day to confirm all of that22:44
Clark[m]the nodepool based testing is almost certainly going to be using floating ips. we should be able to check that /me looks22:44
tonybI see some good feedback on the change and I can look at that once the caffeine hits22:44
Clark[m]oh nope the container image builds for nodepool are failing so those jobs aren't running22:44
Clark[m]tonyb: fwiw some of the feedback I incorporated into what I've already pushed. Mostly around getting better debugging info out to see what is going on22:45
tonybperfect 22:45
Clark[m]but yes I suspect that with nodepool based testing its doing floating ips22:46
tonybthe raw vs qcow thing I can address.  I started with raw but it failed, switched to qcow, it also failed.   fixed the real problem (swift max object size) and never switched back to raw.22:47
tonybswitching back makes sense as we are definitely wasting time with all the switching between formats22:49
tonybI'll also look at ianw feedback as I don't quite understand it.22:50
tonybClark[m]: are you okay for me to "take the reins" once the current jobs complete?22:51
tonybI don't want to mess up what you're working on22:52
Clark[m]yup and feel free to push updates at this point. I think the last fix I have is working (the job got far enough)22:55
Clark[m]don't need to wait for things t ocomplete unless you want to get ubuntu image feedback yourself22:55
Clark[m]actually looks like ubuntu image may be working?22:56
Clark[m]so maybe you want to wait for it to finish to get that logging data recorded in an easy to find location22:56
tonybyup I'll wait for things to complete.    I need another coffee anyway 22:59
ianwtonyb: sorry if not clear -- the nodepool build config file sets several flags.  The DIB_REPOREF_* ones are pretty simple, they just point dib at the zuul checkout of glean instead of pypi o make sure it's co-installed.  the one to think about is DIB_DEV_USER_AUTHORIZED_KEYS ... i think we've been using that to login23:04
tonybianw: any lack of clarity/confusion lay with me not you23:09
tonybI think the div repo ref stuff is no longer needed now that the ensure dib role installs dib from the repo as setup by zuul23:10
ianwthe LABEL= one ... that's a mess.  the root disk has label "cloudimg-rootfs" ... IIRC basically grub would install in the built image, so if the currently booted kernel (like on a dib-built gate node) had LABEL=cloudimg-rootfs set on the kernel command-line, that would be copied in by grub and it would boot.  but then we'd build it on the nodepool builders that were not dib images and didn't have this label in their underlying kernel command line, 23:10
ianwthere were times when dib was not setting it properly in the grub updating23:10
ianwensure-dib will install dib from zuul source, yep, but we also want dib to install into the built images glean from zuul src23:11

Generated by irclog2html.py 4.0.0 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!