opendevreview | sean mooney proposed openstack/nova master: [WIP] use alpine instead of cirros https://review.opendev.org/c/openstack/nova/+/881912 | 12:54 |
---|---|---|
opendevreview | sean mooney proposed openstack/nova master: [WIP] use alpine instead of cirros https://review.opendev.org/c/openstack/nova/+/881912 | 13:01 |
opendevreview | sean mooney proposed openstack/nova master: [WIP] use alpine instead of cirros https://review.opendev.org/c/openstack/nova/+/881912 | 13:04 |
dansmith | sean-k-mooney: so check this out: https://264a86dd518c5142b8a6-3b04393c7506ab488b4fde073fc22e36.ssl.cf1.rackcdn.com/881764/1/check/cinder-tempest-plugin-lvm-multiattach/d28afe7/testr_results.html | 13:31 |
dansmith | maybe kashyap too ^ | 13:32 |
dansmith | that almost kinda looks like we attached the disk to the guest and the kernel crashed while probing partitions/filesystems, right? | 13:32 |
dansmith | in sysfs_add_file_mode_ns() | 13:33 |
dansmith | or perhaps while creating sysfs entries for the disk perhaps? | 13:33 |
dansmith | eharney: you around by chance? | 16:30 |
eharney | dansmith: yes | 16:45 |
dansmith | eharney: so, I seem to have gotten the nova ceph job down to a single repeatable failure | 16:45 |
dansmith | it is in the volume extend test, and it fails during cleanup | 16:45 |
dansmith | it's trying to, I guess, detach the volume from the server before deleting it and then before deleting the volume | 16:46 |
dansmith | it does *not* fail locally, so I don't think it's something fundamentally broken with new ceph or anything like that | 16:46 |
dansmith | https://955f32f8268e5d475e65-6c8f4c6e546a0854b4c11cc7c78829ca.ssl.cf5.rackcdn.com/881585/6/check/nova-ceph-multistore/1ecb09a/testr_results.html | 16:46 |
eharney | dansmith: let me take a look through the logs | 16:47 |
eharney | detach failing isn't something i'm familiar with (other than it being mentioned here the other day) | 16:47 |
dansmith | I've just been tracing through the code looking for how this works and it seems to me like everything is working, but perhaps it's just legitimately that the guest doesn't let go of the volume when we detach after the resize happened | 16:48 |
dansmith | okay, volume detach is without a doubt our most common failure | 16:48 |
dansmith | eharney: yeah appreciate if you could see if you spot anything in the logs | 16:48 |
dansmith | I see the guest saw the size change on vdb and also mentions that it's resizing the filesystem on it... that looks like more than just the block device size change, | 17:00 |
dansmith | so I wonder if it is literally doing a resize2fs on it and that is still happening when we try to issue the detach and that gets us stuck | 17:00 |
dansmith | because we start the detach less than half a second after the resize happens | 17:01 |
eharney | dansmith: yeah, i was also just looking down the path of whether the libvirt block resize call is synchronous or not (it looks like nova assumes it is?) | 17:03 |
dansmith | eharney: synchronous with what? It's synchronous to libvirt from compute, but I don't know that it waits to return until it's delivered to the guest, but definitely stuff like resizing filesystems would happen after that returns | 17:04 |
dansmith | even still, the test polls for completion of the operation as far as nova is concerned, but the guest stuff would all be async | 17:05 |
eharney | i see | 17:05 |
dansmith | I think maybe I should make the test ssh to the guest and see if it's mounted, and perhaps try to unmount it before the test ends or something, to avoid racing with the detach | 17:05 |
dansmith | the resize happens before cirros is even done with its startup stuff, | 17:11 |
dansmith | so I wonder on a slow emulated guest we mount all the filesystems we find, which means it's mounted in the guest when we resize, so it does the resize2fs activity automatically | 17:11 |
dansmith | but on a fast non-nested local run, it finishes startup before we do the attach and thus doesn't end up with it mounted during the resize | 17:11 |
eharney | is resize2fs etc triggered by qemu-guest-agent? | 17:12 |
dansmith | could be.. does cirros have the guest agent in it? I assumed not | 17:13 |
eharney | i don't think so | 17:13 |
dansmith | on regular systems I've never had resize2fs triggered automatically for me, so I'm not really sure why that would happen | 17:13 |
dansmith | but it seems like it is here | 17:13 |
eharney | i would have guessed that resize2fs doesn't happen in these test jobs | 17:14 |
dansmith | me too | 17:14 |
dansmith | you see it in the output though right? | 17:14 |
eharney | no, where is that? | 17:14 |
dansmith | [ 48.156649] EXT4-fs (vda1): resizing filesystem from 25600 to 259835 blocks | 17:14 |
dansmith | [ 48.255755] EXT4-fs (vda1): resized filesystem to 259835 | 17:14 |
dansmith | in the guest console dump | 17:14 |
dansmith | oh damn | 17:14 |
dansmith | that is vda, nevermind! | 17:14 |
dansmith | that's cirros resizing its root disk on startup, not the attached volume | 17:15 |
eharney | ah, right | 17:15 |
dansmith | mah bad | 17:15 |
dansmith | so, without a doubt the most common failure in nova jobs is failing to detach volumes | 17:17 |
dansmith | we've been trying to get a handle on it for a long time, | 17:17 |
eharney | does it show up on non-rbd volumes? | 17:17 |
dansmith | yeah | 17:17 |
dansmith | there is some assertion that if we attach to an instance before it is far enough along during boot, then it might prevent it from being detached later | 17:18 |
dansmith | I'm slightly skeptical of that, but we've been adding "wait for sshable" checks everywhere | 17:18 |
eharney | i guess i'm not sure what kind of conditions in the libvirt area would prevent detach from completing | 17:19 |
dansmith | I just added that for this test recently (merged on friday) which passes normally (and passes locally) but with this rbd job it seems to fail.. it passed the gate on the focal-based rbd, but not on new ceph and jammy | 17:19 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!