opendevreview | OpenStack Proposal Bot proposed openstack/project-config master: Normalize projects.yaml https://review.opendev.org/c/openstack/project-config/+/952006 | 02:25 |
---|---|---|
opendevreview | Tony Breeds proposed openstack/diskimage-builder master: Add new openstack/devstack based functional testing https://review.opendev.org/c/openstack/diskimage-builder/+/949942 | 03:33 |
opendevreview | Tony Breeds proposed openstack/diskimage-builder master: Add new openstack/devstack based functional testing https://review.opendev.org/c/openstack/diskimage-builder/+/949942 | 06:04 |
*** amoralej_ is now known as amoralej | 11:59 | |
fungi | yeah, my 952408 is still in check with a paused buildset registry job 14 hours later. i'll leave it there for the moment since it's not an urgent fix or anything, and maybe corvus will have some ideas as to what sort of corner case/race we tripped over | 12:41 |
fungi | okay that's odd, it completed moments ago, after running for nearly 15 hours | 13:13 |
fungi | - opendev-buildset-registry https://zuul.opendev.org/t/openstack/build/b891f8394e074cae9b9d3d4df691decc : SUCCESS in 14h 51m 21s | 13:14 |
clarkb | fungi: was it stalled out after all the child jobs had run or before? | 14:57 |
clarkb | that job pauses and waits for child jobs to complete. Just wondering if it didn't get the complete signal in a reasonable amount of time or if the child jobs took a long time to schedule | 14:57 |
fungi | the child job failed because the parent change timed out performing an image build (sort of surprised that could even happen in an independent pipeline), and then the buildset registry job remained paused for 14+ hours after that | 14:59 |
fungi | the failure on the child change's system-config-run-gitea job was: requires artifact(s) gitea-container-image provided by build e99d34c1c7e54d4685361256f18dcd35 (triggered by change 952407 on project opendev/system-config), but that build failed with result "TIMED_OUT" | 15:01 |
fungi | while i'm still wondering what transpired to leave the registry job paused for so long, i'm even more curious as to what could have eventually unstuck it | 15:03 |
fungi | maybe something merged in a random project that triggered a reconfigure? | 15:04 |
clarkb | maybe? | 15:15 |
clarkb | child jobs failing should also be sufficient to unpause | 15:15 |
fungi | right, but clearly didn't in this case since the child jobs all ended 14 hours earlier | 15:16 |
fungi | heading out to grab lunch, should be back in an hour or so | 15:24 |
corvus | the reconfigure sounds like a good lead; last was at 13:03 | 16:04 |
fungi | that does roughly coincide with when the job unpaused | 16:43 |
opendevreview | Clark Boylan proposed openstack/diskimage-builder master: Add new openstack/devstack based functional testing https://review.opendev.org/c/openstack/diskimage-builder/+/949942 | 18:02 |
clarkb | mnasiadka: tonyb: ^I took the liberty of doing a quick update. I think the failed_when in your loops is causing the tasks to fail immediately after one iteration (hence the wait for nova to convert the image and boot hte node before we check the node status) | 18:03 |
clarkb | instead I think we can just rely on retries and until to determine if we fail due to a timeout of the state not reaching what we want | 18:03 |
clarkb | then I also bumped up the retry count significatnly so that we can see if these tasks ever succeed. I suspect part of the problem previously was just not waiting long enough | 18:04 |
clarkb | and finally I put the devstack run in pre-run since we're not actually testing devstack here. This also fixes some node information collection stuff I think | 18:04 |
clarkb | fungi: I noticed that the devstack install for 949942 is building a zstd wheel for this version of zstd: https://pypi.org/project/zstd/1.5.7.0/#files but that listing shows there is already a python3.12 manylinux wheel. My hunch right now is that wheel is build for glibc 2.4 or greater. Noble appears to have glibc 2.39. Is this a pip wheel compatibility parsing error where 2.4 is > | 18:27 |
clarkb | 2.39 maybe? | 18:27 |
clarkb | there probably is a simpler explanation but I'm not seeing it on initial inspection | 18:27 |
clarkb | the python3.5 and 3.6 wheels area actually built for glibc 2.14+ | 18:31 |
clarkb | it does seem like there may be something fishy going on there. | 18:31 |
fungi | clarkb: the cp312 wheel seems to be for i686 arch not x86_64 | 18:58 |
fungi | clarkb: in https://github.com/sergey-dryabzhinsky/python-zstd/issues/233 the maintainer comments "I know that wheels for python3.12 amd64 is missing, but I don't know how to fix it. I tried restarting action already." | 19:14 |
Clark[m] | Oh that would explain it. I skimmed right over that | 19:40 |
opendevreview | Clark Boylan proposed openstack/diskimage-builder master: Add new openstack/devstack based functional testing https://review.opendev.org/c/openstack/diskimage-builder/+/949942 | 20:43 |
corvus | i restarted zuul-launcher to get the most recent updates | 21:25 |
fungi | thanks! | 21:25 |
fungi | logins to wiki.openstack.org have broken again, rebooting it now | 21:36 |
fungi | load average on it is around 65 | 21:37 |
fungi | working again now | 21:39 |
Clark[m] | fungi: I see you were already talking to the cfn folks about branch creation. Did you want to followup to their email from today or should I? Looks like the acl is missing the branch creation rule | 21:41 |
fungi | they e-mailed me privately at which point i sent them the link to the infra-manual section that talks about adding the acl entry, and suggested that in the future they ask on the ml or in irc rather than sending e-mail to me | 21:46 |
fungi | i haven't gotten around to replying to their list message, or looking at what their acl is like for that matter, but would likely just end up quoting from the same document i already sent them a link to | 21:47 |
fungi | i doubt i'll get to it tonight though | 21:48 |
Clark[m] | oh I see you already pointed them to the acl update | 21:50 |
Clark[m] | its hard to parse the thread in my mail client. It got squashed | 21:51 |
fungi | though part of me worries that by me continuing to reply to them, just on the ml instead of privately, they won't understand the distinction or reason for asking in normal channels | 21:51 |
Clark[m] | ya I'll respond to try and push the list angle more | 21:51 |
fungi | so maybe it would make sense for a second person to reply just so they don't get the impression i'm the only one around | 21:51 |
Clark[m] | response sent. Let me know if you think any additional info would be helpful | 21:57 |
fungi | looks perfect, thanks!!! | 21:58 |
opendevreview | Clark Boylan proposed openstack/diskimage-builder master: Add new openstack/devstack based functional testing https://review.opendev.org/c/openstack/diskimage-builder/+/949942 | 22:18 |
Clark[m] | tonyb: mnasiadka: ok I think if the new rescue block works this should help with debugging. Then I realized that we're testing with almalinux which I think is a bad choice since we (opendev) don't do any almalinux image builds so its harder to say if it should work at all. I updated the change to also build ubuntu noble. | 22:19 |
Clark[m] | I think almalinux may be failing due to non working dhcp | 22:22 |
Clark[m] | but its all networkmanager which is find indecipherable so being able to confirm on ubuntu would be nice | 22:22 |
Clark[m] | tonyb: mnasiadka: I bet there is no dhcp server for the public network. But I'm not positive of that. It may be best to simply boot a node and attach a floating ip to it | 22:31 |
Clark[m] | rather than try and do something fancy with neutron networking | 22:31 |
tonyb | thanks clarkb I picked alma because it starts with "a" it seems like glean doesn't write out the config. so I was happy to blame glean and move on but it works under nodepool so there is something else happening | 22:33 |
tonyb | but I can add the floating support | 22:34 |
opendevreview | Clark Boylan proposed openstack/diskimage-builder master: Add new openstack/devstack based functional testing https://review.opendev.org/c/openstack/diskimage-builder/+/949942 | 22:34 |
Clark[m] | tonyb: If you look at the console log here: https://zuul.opendev.org/t/openstack/build/940a11f6abe0496782199059fe51aa08/console#6/0/34/controller it is definitely using glean. But then there are a bunch of messages about failed dhcp | 22:35 |
Clark[m] | which is why my hunch now is that by connecting to the public netwrok we're getting a network that does not have dhcp configured | 22:36 |
Clark[m] | its been a while since I looked closely at a default devstack install's network setup. But my hunch is they push the public network isn't really for direct attach | 22:39 |
Clark[m] | we can probably configure it to work that way or just use the cloud in the manner that is expected of us | 22:40 |
tonyb | okay. that more or less matches the testing I did on a held node. there was definitely no dhcp reply but looking at the network json file in the config drive it looked to me like glean should be writing a config file for a static address and it wasn't | 22:42 |
tonyb | that's why I pushed the change to include the existing nodepool testing to locate the difference | 22:43 |
Clark[m] | hrm I think the default is dhcp | 22:43 |
Clark[m] | for neutron networks I mean | 22:43 |
tonyb | today was going to be "learn glean" day to confirm all of that | 22:44 |
Clark[m] | the nodepool based testing is almost certainly going to be using floating ips. we should be able to check that /me looks | 22:44 |
tonyb | I see some good feedback on the change and I can look at that once the caffeine hits | 22:44 |
Clark[m] | oh nope the container image builds for nodepool are failing so those jobs aren't running | 22:44 |
Clark[m] | tonyb: fwiw some of the feedback I incorporated into what I've already pushed. Mostly around getting better debugging info out to see what is going on | 22:45 |
tonyb | perfect | 22:45 |
Clark[m] | but yes I suspect that with nodepool based testing its doing floating ips | 22:46 |
tonyb | the raw vs qcow thing I can address. I started with raw but it failed, switched to qcow, it also failed. fixed the real problem (swift max object size) and never switched back to raw. | 22:47 |
tonyb | switching back makes sense as we are definitely wasting time with all the switching between formats | 22:49 |
tonyb | I'll also look at ianw feedback as I don't quite understand it. | 22:50 |
tonyb | Clark[m]: are you okay for me to "take the reins" once the current jobs complete? | 22:51 |
tonyb | I don't want to mess up what you're working on | 22:52 |
Clark[m] | yup and feel free to push updates at this point. I think the last fix I have is working (the job got far enough) | 22:55 |
Clark[m] | don't need to wait for things t ocomplete unless you want to get ubuntu image feedback yourself | 22:55 |
Clark[m] | actually looks like ubuntu image may be working? | 22:56 |
Clark[m] | so maybe you want to wait for it to finish to get that logging data recorded in an easy to find location | 22:56 |
tonyb | yup I'll wait for things to complete. I need another coffee anyway | 22:59 |
ianw | tonyb: sorry if not clear -- the nodepool build config file sets several flags. The DIB_REPOREF_* ones are pretty simple, they just point dib at the zuul checkout of glean instead of pypi o make sure it's co-installed. the one to think about is DIB_DEV_USER_AUTHORIZED_KEYS ... i think we've been using that to login | 23:04 |
tonyb | ianw: any lack of clarity/confusion lay with me not you | 23:09 |
tonyb | I think the div repo ref stuff is no longer needed now that the ensure dib role installs dib from the repo as setup by zuul | 23:10 |
ianw | the LABEL= one ... that's a mess. the root disk has label "cloudimg-rootfs" ... IIRC basically grub would install in the built image, so if the currently booted kernel (like on a dib-built gate node) had LABEL=cloudimg-rootfs set on the kernel command-line, that would be copied in by grub and it would boot. but then we'd build it on the nodepool builders that were not dib images and didn't have this label in their underlying kernel command line, | 23:10 |
ianw | there were times when dib was not setting it properly in the grub updating | 23:10 |
ianw | ensure-dib will install dib from zuul source, yep, but we also want dib to install into the built images glean from zuul src | 23:11 |
Generated by irclog2html.py 4.0.0 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!