ykarel | Hi is the issue with latest ubuntu jammy image known ? jobs using linuxbridge/openvswitch broken as buggy kernel shipped | 04:11 |
---|---|---|
ykarel | reported https://bugs.launchpad.net/neutron/+bug/2091990 | 05:04 |
tonyb | ykarel: Have you reported that to the ubuntu kernel team/package? | 05:06 |
ykarel | tonyb, just commented on the linked bugs | 05:07 |
tonyb | ykarel: Okay I think I follow | 05:14 |
ykarel | also asking on #ubuntu-kernel on libra.chat, or there any other place to report? | 05:15 |
ykarel | tonyb, any way to hold this update on infra side? | 05:16 |
tonyb | ykarel: Thanks probably adequate, I assume if they need more info they'll ask and/or point you to the right process | 05:16 |
tonyb | ykarel: I'm looking into it | 05:16 |
ykarel | like reverting to previous good image and stopping new updates until there is new kernel fix | 05:16 |
ykarel | /etc/dib-builddate.txt 2024-12-17 07:45 | 05:17 |
tonyb | ykarel: I think it's recoverable, but it requires a little more nodepool ops than I have. I'm reading but it might be a job for frickler | 05:43 |
tonyb | infra-root: ykarel noticed https://bugs.launchpad.net/neutron/+bug/2091990 with jammy kernels. Lookin at the nodepool logs ubuntu-jammy-e6937ec44c254831b6a59904e5e3c655 contains the bad kernel but ubuntu-jammy-6b82bb6f7d00476c90a7ab18de68b645 contains the previous ... expected good kernel. I'm not sure of the right way to force the clouds+regions that have the latest image to go back one ... and pause uploading new | 05:46 |
tonyb | versions | 05:46 |
frickler | tonyb: ykarel: ok, I did this now, let's see whether it helps, not sure how old the broken kernel build actually is https://paste.opendev.org/show/b04mPkWgdEVY6nE3kaRU/ | 05:51 |
frickler | maybe also jamespage is around and can help pushing a fix/revert on the ubuntu archive side | 05:52 |
tonyb | frickler: thanks | 05:58 |
frickler | #status log deleted the latest ubuntu-jammy image and paused new builds for it in an attempt to avoid a likely kernel bug | 06:00 |
opendevstatus | frickler: finished logging | 06:00 |
*** ykarel_ is now known as ykarel | 06:02 | |
ykarel | thx frickler tonyb | 06:04 |
tonyb | frickler: thanks, and I assume the launcher will detect clouds that have e693... (re)upload 6b82... and cycle the nodes | 06:05 |
ykarel | and how much time ^ usually takes? | 06:05 |
frickler | the clouds also keep the two copies, no reupload needed. the deletion is almost done, just openmetal seems to be very slow at it | 06:06 |
frickler | so nodes starting anywhere else now should already be using the old image | 06:07 |
frickler | note that jobs may use up existing nodes for a bit, though | 06:07 |
tonyb | frickler: thanks for the clarification | 06:07 |
ykarel | thx, me rechecks and will have results in some time | 06:08 |
frickler | ah, openmetal cannot delete images that are still in use. there is also an image 116d old that nodepool keeps trying to delete, I assume that that me be the one we booted the mirror node from | 06:15 |
frickler | Failed to delete image with name or ID 'ubuntu-jammy-1734429394': ConflictException: 409: Client Error for url: http://192.168.3.254:9292/v2/images/eca58c19-ce6f-4e74-b3fa-a955293042cb, : 409 Conflict: Image eca58c19-ce6f-4e74-b3fa-a955293042cb could not be deleted because it is in use: The image cannot be deleted because it is in use through the backend store outside of Glance. | 06:15 |
frickler | not sure whether we want to look into that with openmetal folks or just accept it as a given | 06:16 |
ykarel | ok can confirm jobs started using old kernel in latest triggered job KERNEL_VERSION=5.15.0-126-generic | 06:17 |
ykarel | ^ from Provider: openmetal-iad3 | 06:17 |
tonyb | frickler: Ahh okay. I think it's definitely something we should look into, IIRC I mistakenly booted the mirror node in the wrong tennat, so perhaps we should move it which would free up the image for deletion | 06:19 |
frickler | too bad that this bug will undermine the "IPv6 is bad" impression for some people | 06:27 |
*** darmach5 is now known as darmach | 14:24 | |
clarkb | frickler: yes this is a fundamental problem with boot from volume imo. I wish that the cloud could logically separate the resources instead of making humans keep track of it all | 15:48 |
clarkb | frickler: tonyb ykarel does the kernel bug affect any releases other than jammy? | 15:50 |
frickler | clarkb: afaict one could configure cinder to flatten volumes after cloning from an image, not sure though how that'd affect performance | 15:50 |
frickler | clarkb: no, only 5.15 kernel affected | 15:51 |
clarkb | frickler: I'm saying that this shouldn't be a configuration thing at all. I should be able to delete an image even if things are booted from it. Ownership of that data in ceph or wherever should simply be attributed to the instance and not the image at that point. Basically I want hardlinks for images but instead we hav symlinks to the glance owned resource | 15:58 |
clarkb | frickler: looks like https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2091990 should tell us when it is safe to unpause the jammy builds too? | 16:27 |
frickler | clarkb: yes, I'm subscribed to that bug. do we want to write the pause into the config, too? I applied it manually this morning in order to avoid being too slow and having the old image getting lost | 16:31 |
clarkb | that might be good. I'm also trying to understand if this affects any of our jammy nodes | 16:31 |
tonyb | clarkb: I wasn't abel to determine that yesterday but Its on my to-do list for today | 22:01 |
tonyb | but it looks like the Ubuntu kernel team has already answered that question | 22:05 |
clarkb | tonyb: ya seems like oracular and jammy only | 22:08 |
clarkb | still no updates to ubuntu jammy proposed unfortunately | 22:27 |
tonyb | Yeah :/ | 22:27 |
tonyb | I wonder how/why noble was skipped | 22:27 |
clarkb | I get the sense that the SRU process for each kernel is somewhat independent of the others? | 22:29 |
clarkb | and those two just happened to be updated already? But I'm not positive of that | 22:30 |
*** haleyb is now known as haleyb|out | 22:47 | |
tonyb | Yeah, seems somewhat strange to me. Speaking only as an ex kernel engineer and distro packager | 23:34 |
opendevreview | Clark Boylan proposed openstack/pbr master: Update PBR testing for Noble https://review.opendev.org/c/openstack/pbr/+/938030 | 23:48 |
clarkb | after leaving my comments on https://review.opendev.org/c/openstack/pbr/+/924216 I decided to try and figure out what actually is required and the first step is getting CI running on noble. That said the python3.12 unittests pass so ya I think PBR itself works its largely the supporting structures that we need to udpate then also maybe dealing with deprecations and the like | 23:49 |
clarkb | I've forced the docs job to noble in taht first patchset to determine if we need setuptools in doc/requirements.txt. Currently the job runs on jammy so it just works | 23:49 |
clarkb | I wasn't a fan of precommit before ^ but now that precommit doesn't log what it is installing int othe env that it runs commands out of I'm really not a fan | 23:50 |
clarkb | I only know that hacking needs to be updated to make this work because I've had to debug this elsewhere but in those cases I wasn't using pre commit so it was easier | 23:51 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!