Friday, 2025-11-07

*** ykarel_ is now known as ykarel		07:08
*** dhill is now known as Guest30693		13:40
opendevreview	Dmitriy Rabotyagov proposed openstack/diskimage-builder master: Add support for building Fedora 40 https://review.opendev.org/c/openstack/diskimage-builder/+/922109	14:06
opendevreview	Dmitriy Rabotyagov proposed openstack/diskimage-builder master: Add support for building Fedora 40 https://review.opendev.org/c/openstack/diskimage-builder/+/922109	14:08
opendevreview	Dmitriy Rabotyagov proposed openstack/diskimage-builder master: Add Fedora 41 and 42 tests https://review.opendev.org/c/openstack/diskimage-builder/+/958519	14:12
fungi	lunchtime, bbiab	15:58
nhicher[m]	clarkb: hello, can you have a look to the comment @yatin added on https://review.opendev.org/c/zuul/zuul-jobs/+/959393? thanks	16:03
opendevreview	Dmitriy Rabotyagov proposed openstack/diskimage-builder master: Add support for building Fedora 40 https://review.opendev.org/c/openstack/diskimage-builder/+/922109	16:04
clarkb	nhicher[m]: the reason is that installing ovs requires extra special packages on some platforms (like centos aiui)	16:05
clarkb	nhicher[m]: the idea in my mind was if we have to rewrite the entire role anyway then lets choose something that relies on standard linux kernel features so that we are more portable	16:05
clarkb	I'll leave a gerrit comment	16:05
nhicher[m]	clarkb: thanks	16:06
Ramereth[m]	clarkb: fungi I see a VM with the name npd38a8a60f6f64 still active but I got a report that the image attached to it was trying to be removed. Can we manually remove it on our end? It was created a few days ago so I figured it should be deleted by now	17:30
clarkb	Ramereth[m]: let me take a look. It might be a held node for debugging	17:31
fungi	there's definitely still a gap in openstack for a user to declare that they no longer care about the image a vm was booted from while still caring about the running vm itself	17:32
clarkb	Ramereth[m]: you can see the node is in a 'used' state here: https://zuul.opendev.org/t/openstack/nodes if you drop the 'np' prefix from the name and ^F the rest	17:33
fungi	sort of a "if this vm disappears, i don't need the ability to boot it again"	17:33
clarkb	that is distinct from 'in-use' and 'hold' I suspect that the node was used and we haven't been able to delete it for some reason corvus does that make sense if a node is in a used state for several days?	17:34
clarkb	it is active on the cloud side and not deleting. I don't see any errors either. Let me look at launcher logs next	17:34
Ramereth[m]	sounds good, thanks for checking	17:39
clarkb	ok I see two node requests associated with the node on 20251103 4924d0b0658f427ea2998f2f5177972b is the first request that boots the node	17:41
clarkb	then 61ecb590cc7846e99b16da2f72a1d94b attempts to use it until 61ecb590cc7846e99b16da2f72a1d94b seems to be discarded as a request and the node goes into a ready state	17:41
clarkb	I haven't seen how the node goes from ready to used yet	17:41
clarkb	61ecb590cc7846e99b16da2f72a1d94b was fulfilled by zl01 on the 3rd	17:44
clarkb	so I think it did get used. But I'm not seeing where we flip it to used and try to delete it in the logs	17:45
clarkb	I suspect that we can delete the node and that is what zuul launcher would intend to do, but haven't confirmed that yet	17:45
clarkb	corvus: it looks like we log 'Marking node ...... as ready' and 'Marking node ...... as failed' in the launcher logs but not in-use or used. I suspect that this is because we're actually flipping that state in another service like the scheduler or executor	17:48
clarkb	corvus: should launchers automatically pick up nodes in a used state and delete them? Could it be that these zkobject load failures are preventing that? I know you've said many of those zkobject load failures are not real issues, but I'm wondering if in this case it is	17:49
clarkb	corvus: at least one of the failed zkobject loads from the 4th (actually just about 15 minutes after the logs on the 3rd) appear to be in a see of other failures trying to cleanup nodes	17:54
clarkb	I half suspect that we're iterating through some list of nodes to clean up and this one failed to load from an initial state but I'm not certain of that.	17:55
clarkb	Ramereth[m]: if it isn't a major issue for you to keep that instance running can we leave it for now? I think it may be useful to debug any potential bugs around the node cleanup process	17:59
clarkb	Ramereth[m]: I worry that if we delete the node manually it may inadverdently clean up zuul database state thati s useful to any potential debugging so would like to avoid that for now if we can	18:00
clarkb	but ya my best guess at the moment is something is preventing the launchers from pickup in the node in a used state and lceaning it up. If we can keep it around so that corvus can take a look with his better understanding of the system that may be helpful in fixing a bug	18:08
opendevreview	Merged opendev/system-config master: Add trixie mirror config to reprepro https://review.opendev.org/c/opendev/system-config/+/965334	19:18
fungi	the debian mirror update with trixie started at 20:10 utc, i'll check back in after it probably times out to rerun manually with an indefinitely held lock and no timeout	20:37
clarkb	thanks!	21:04
Ramereth[m]	clarkb: sure, let me know later if you need me to delete it. The reason I noticed it was because there was an image that was queued waiting to be deleted and it was related to that VM still using the image.	22:57
fungi	i had to manually clear a leaked /afs/.openstack.org/mirror/debian/db/lockfile but have a manual debian mirror update in progress in a root screen session on mirror-update.o.o running now	23:40
fungi	oh... no good	23:41
fungi	BDB0689 /afs/.openstack.org/mirror/debian/db/checksums.db page 19020 is on free list with type 5	23:41
fungi	BDB0061 PANIC: Invalid argument	23:41
fungi	Internal error of the underlying BerkeleyDB database:	23:41
fungi	Within checksums.db subtable pool at put(uniq): BDB0087 DB_RUNRECOVERY: Fatal error, run database recovery	23:41
fungi	BDB0060 PANIC: fatal region error detected; run recovery	23:41
fungi	db_close(checksums.db, pool): BDB0087 DB_RUNRECOVERY: Fatal error, run database recovery	23:41
fungi	i kinda hate it when i do a web search and just find a comment from past me... https://meetings.opendev.org/irclogs/%23opendev/%23opendev.2022-09-13.log.html#t2022-09-13T18:52:32	23:43
fungi	let's see how i fixed it three years ago when this exact same situation occurred	23:43
fungi	that time it was contents.cache.db rather than checksums.db that broke, but i've tried moving checksums.db out of afs to my homedir on afs01.dfw and have reprepro running again now, hopefully it will recreate the db but i'll keep an eye on it through the weekend	23:48
corvus	clarkb: Ramereth[m] ack on the node debug. i will look at that tomorrow and hopefully get an answer before next week.	23:50
corvus	clarkb: (and yes, the executor changes the node to used; used+unlocked is expected launcher cleanup; so yes, this is something to debug)	23:51

Generated by irclog2html.py 4.0.0 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!