Monday, 2023-05-01

opendevreview	sean mooney proposed openstack/nova master: [WIP] use alpine instead of cirros https://review.opendev.org/c/openstack/nova/+/881912	12:54
opendevreview	sean mooney proposed openstack/nova master: [WIP] use alpine instead of cirros https://review.opendev.org/c/openstack/nova/+/881912	13:01
opendevreview	sean mooney proposed openstack/nova master: [WIP] use alpine instead of cirros https://review.opendev.org/c/openstack/nova/+/881912	13:04
dansmith	sean-k-mooney: so check this out: https://264a86dd518c5142b8a6-3b04393c7506ab488b4fde073fc22e36.ssl.cf1.rackcdn.com/881764/1/check/cinder-tempest-plugin-lvm-multiattach/d28afe7/testr_results.html	13:31
dansmith	maybe kashyap too ^	13:32
dansmith	that almost kinda looks like we attached the disk to the guest and the kernel crashed while probing partitions/filesystems, right?	13:32
dansmith	in sysfs_add_file_mode_ns()	13:33
dansmith	or perhaps while creating sysfs entries for the disk perhaps?	13:33
dansmith	eharney: you around by chance?	16:30
eharney	dansmith: yes	16:45
dansmith	eharney: so, I seem to have gotten the nova ceph job down to a single repeatable failure	16:45
dansmith	it is in the volume extend test, and it fails during cleanup	16:45
dansmith	it's trying to, I guess, detach the volume from the server before deleting it and then before deleting the volume	16:46
dansmith	it does not fail locally, so I don't think it's something fundamentally broken with new ceph or anything like that	16:46
dansmith	https://955f32f8268e5d475e65-6c8f4c6e546a0854b4c11cc7c78829ca.ssl.cf5.rackcdn.com/881585/6/check/nova-ceph-multistore/1ecb09a/testr_results.html	16:46
eharney	dansmith: let me take a look through the logs	16:47
eharney	detach failing isn't something i'm familiar with (other than it being mentioned here the other day)	16:47
dansmith	I've just been tracing through the code looking for how this works and it seems to me like everything is working, but perhaps it's just legitimately that the guest doesn't let go of the volume when we detach after the resize happened	16:48
dansmith	okay, volume detach is without a doubt our most common failure	16:48
dansmith	eharney: yeah appreciate if you could see if you spot anything in the logs	16:48
dansmith	I see the guest saw the size change on vdb and also mentions that it's resizing the filesystem on it... that looks like more than just the block device size change,	17:00
dansmith	so I wonder if it is literally doing a resize2fs on it and that is still happening when we try to issue the detach and that gets us stuck	17:00
dansmith	because we start the detach less than half a second after the resize happens	17:01
eharney	dansmith: yeah, i was also just looking down the path of whether the libvirt block resize call is synchronous or not (it looks like nova assumes it is?)	17:03
dansmith	eharney: synchronous with what? It's synchronous to libvirt from compute, but I don't know that it waits to return until it's delivered to the guest, but definitely stuff like resizing filesystems would happen after that returns	17:04
dansmith	even still, the test polls for completion of the operation as far as nova is concerned, but the guest stuff would all be async	17:05
eharney	i see	17:05
dansmith	I think maybe I should make the test ssh to the guest and see if it's mounted, and perhaps try to unmount it before the test ends or something, to avoid racing with the detach	17:05
dansmith	the resize happens before cirros is even done with its startup stuff,	17:11
dansmith	so I wonder on a slow emulated guest we mount all the filesystems we find, which means it's mounted in the guest when we resize, so it does the resize2fs activity automatically	17:11
dansmith	but on a fast non-nested local run, it finishes startup before we do the attach and thus doesn't end up with it mounted during the resize	17:11
eharney	is resize2fs etc triggered by qemu-guest-agent?	17:12
dansmith	could be.. does cirros have the guest agent in it? I assumed not	17:13
eharney	i don't think so	17:13
dansmith	on regular systems I've never had resize2fs triggered automatically for me, so I'm not really sure why that would happen	17:13
dansmith	but it seems like it is here	17:13
eharney	i would have guessed that resize2fs doesn't happen in these test jobs	17:14
dansmith	me too	17:14
dansmith	you see it in the output though right?	17:14
eharney	no, where is that?	17:14
dansmith	[ 48.156649] EXT4-fs (vda1): resizing filesystem from 25600 to 259835 blocks	17:14
dansmith	[ 48.255755] EXT4-fs (vda1): resized filesystem to 259835	17:14
dansmith	in the guest console dump	17:14
dansmith	oh damn	17:14
dansmith	that is vda, nevermind!	17:14
dansmith	that's cirros resizing its root disk on startup, not the attached volume	17:15
eharney	ah, right	17:15
dansmith	mah bad	17:15
dansmith	so, without a doubt the most common failure in nova jobs is failing to detach volumes	17:17
dansmith	we've been trying to get a handle on it for a long time,	17:17
eharney	does it show up on non-rbd volumes?	17:17
dansmith	yeah	17:17
dansmith	there is some assertion that if we attach to an instance before it is far enough along during boot, then it might prevent it from being detached later	17:18
dansmith	I'm slightly skeptical of that, but we've been adding "wait for sshable" checks everywhere	17:18
eharney	i guess i'm not sure what kind of conditions in the libvirt area would prevent detach from completing	17:19
dansmith	I just added that for this test recently (merged on friday) which passes normally (and passes locally) but with this rbd job it seems to fail.. it passed the gate on the focal-based rbd, but not on new ceph and jammy	17:19

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!