| *** ykarel_ is now known as ykarel | 07:08 | |
| *** dhill is now known as Guest30693 | 13:40 | |
| opendevreview | Dmitriy Rabotyagov proposed openstack/diskimage-builder master: Add support for building Fedora 40 https://review.opendev.org/c/openstack/diskimage-builder/+/922109 | 14:06 |
|---|---|---|
| opendevreview | Dmitriy Rabotyagov proposed openstack/diskimage-builder master: Add support for building Fedora 40 https://review.opendev.org/c/openstack/diskimage-builder/+/922109 | 14:08 |
| opendevreview | Dmitriy Rabotyagov proposed openstack/diskimage-builder master: Add Fedora 41 and 42 tests https://review.opendev.org/c/openstack/diskimage-builder/+/958519 | 14:12 |
| fungi | lunchtime, bbiab | 15:58 |
| nhicher[m] | clarkb: hello, can you have a look to the comment @yatin added on https://review.opendev.org/c/zuul/zuul-jobs/+/959393? thanks | 16:03 |
| opendevreview | Dmitriy Rabotyagov proposed openstack/diskimage-builder master: Add support for building Fedora 40 https://review.opendev.org/c/openstack/diskimage-builder/+/922109 | 16:04 |
| clarkb | nhicher[m]: the reason is that installing ovs requires extra special packages on some platforms (like centos aiui) | 16:05 |
| clarkb | nhicher[m]: the idea in my mind was if we have to rewrite the entire role anyway then lets choose something that relies on standard linux kernel features so that we are more portable | 16:05 |
| clarkb | I'll leave a gerrit comment | 16:05 |
| nhicher[m] | clarkb: thanks | 16:06 |
| Ramereth[m] | clarkb: fungi I see a VM with the name npd38a8a60f6f64 still active but I got a report that the image attached to it was trying to be removed. Can we manually remove it on our end? It was created a few days ago so I figured it should be deleted by now | 17:30 |
| clarkb | Ramereth[m]: let me take a look. It might be a held node for debugging | 17:31 |
| fungi | there's definitely still a gap in openstack for a user to declare that they no longer care about the image a vm was booted from while still caring about the running vm itself | 17:32 |
| clarkb | Ramereth[m]: you can see the node is in a 'used' state here: https://zuul.opendev.org/t/openstack/nodes if you drop the 'np' prefix from the name and ^F the rest | 17:33 |
| fungi | sort of a "if this vm disappears, i don't need the ability to boot it again" | 17:33 |
| clarkb | that is distinct from 'in-use' and 'hold' I suspect that the node was used and we haven't been able to delete it for some reason corvus does that make sense if a node is in a used state for several days? | 17:34 |
| clarkb | it is active on the cloud side and not deleting. I don't see any errors either. Let me look at launcher logs next | 17:34 |
| Ramereth[m] | sounds good, thanks for checking | 17:39 |
| clarkb | ok I see two node requests associated with the node on 20251103 4924d0b0658f427ea2998f2f5177972b is the first request that boots the node | 17:41 |
| clarkb | then 61ecb590cc7846e99b16da2f72a1d94b attempts to use it until 61ecb590cc7846e99b16da2f72a1d94b seems to be discarded as a request and the node goes into a ready state | 17:41 |
| clarkb | I haven't seen how the node goes from ready to used yet | 17:41 |
| clarkb | 61ecb590cc7846e99b16da2f72a1d94b was fulfilled by zl01 on the 3rd | 17:44 |
| clarkb | so I think it did get used. But I'm not seeing where we flip it to used and try to delete it in the logs | 17:45 |
| clarkb | I suspect that we can delete the node and that is what zuul launcher would intend to do, but haven't confirmed that yet | 17:45 |
| clarkb | corvus: it looks like we log 'Marking node ...... as ready' and 'Marking node ...... as failed' in the launcher logs but not in-use or used. I suspect that this is because we're actually flipping that state in another service like the scheduler or executor | 17:48 |
| clarkb | corvus: should launchers automatically pick up nodes in a used state and delete them? Could it be that these zkobject load failures are preventing that? I know you've said many of those zkobject load failures are not real issues, but I'm wondering if in this case it is | 17:49 |
| clarkb | corvus: at least one of the failed zkobject loads from the 4th (actually just about 15 minutes after the logs on the 3rd) appear to be in a see of other failures trying to cleanup nodes | 17:54 |
| clarkb | I half suspect that we're iterating through some list of nodes to clean up and this one failed to load from an initial state but I'm not certain of that. | 17:55 |
| clarkb | Ramereth[m]: if it isn't a major issue for you to keep that instance running can we leave it for now? I think it may be useful to debug any potential bugs around the node cleanup process | 17:59 |
| clarkb | Ramereth[m]: I worry that if we delete the node manually it may inadverdently clean up zuul database state thati s useful to any potential debugging so would like to avoid that for now if we can | 18:00 |
| clarkb | but ya my best guess at the moment is something is preventing the launchers from pickup in the node in a used state and lceaning it up. If we can keep it around so that corvus can take a look with his better understanding of the system that may be helpful in fixing a bug | 18:08 |
| opendevreview | Merged opendev/system-config master: Add trixie mirror config to reprepro https://review.opendev.org/c/opendev/system-config/+/965334 | 19:18 |
| fungi | the debian mirror update with trixie started at 20:10 utc, i'll check back in after it probably times out to rerun manually with an indefinitely held lock and no timeout | 20:37 |
| clarkb | thanks! | 21:04 |
| Ramereth[m] | clarkb: sure, let me know later if you need me to delete it. The reason I noticed it was because there was an image that was queued waiting to be deleted and it was related to that VM still using the image. | 22:57 |
| fungi | i had to manually clear a leaked /afs/.openstack.org/mirror/debian/db/lockfile but have a manual debian mirror update in progress in a root screen session on mirror-update.o.o running now | 23:40 |
| fungi | oh... no good | 23:41 |
| fungi | BDB0689 /afs/.openstack.org/mirror/debian/db/checksums.db page 19020 is on free list with type 5 | 23:41 |
| fungi | BDB0061 PANIC: Invalid argument | 23:41 |
| fungi | Internal error of the underlying BerkeleyDB database: | 23:41 |
| fungi | Within checksums.db subtable pool at put(uniq): BDB0087 DB_RUNRECOVERY: Fatal error, run database recovery | 23:41 |
| fungi | BDB0060 PANIC: fatal region error detected; run recovery | 23:41 |
| fungi | db_close(checksums.db, pool): BDB0087 DB_RUNRECOVERY: Fatal error, run database recovery | 23:41 |
| fungi | i kinda hate it when i do a web search and just find a comment from past me... https://meetings.opendev.org/irclogs/%23opendev/%23opendev.2022-09-13.log.html#t2022-09-13T18:52:32 | 23:43 |
| fungi | let's see how i fixed it three years ago when this exact same situation occurred | 23:43 |
| fungi | that time it was contents.cache.db rather than checksums.db that broke, but i've tried moving checksums.db out of afs to my homedir on afs01.dfw and have reprepro running again now, hopefully it will recreate the db but i'll keep an eye on it through the weekend | 23:48 |
| corvus | clarkb: Ramereth[m] ack on the node debug. i will look at that tomorrow and hopefully get an answer before next week. | 23:50 |
| corvus | clarkb: (and yes, the executor changes the node to used; used+unlocked is expected launcher cleanup; so yes, this is something to debug) | 23:51 |
Generated by irclog2html.py 4.0.0 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!