spatel | Any idea what is wrong with this error - ERROR oslo_service.service oslo_messaging.exceptions.MessagingTimeout: Timed out waiting for a reply to message ID 32738247b2bc4024bd0a4e65d900729d | 02:06 |
---|---|---|
spatel | my nova-compute throwing that error | 02:06 |
spatel | Even i rebuild my RabbitMQ from scratch | 02:06 |
spatel | to me look like a bug | 02:33 |
spatel | https://paste.opendev.org/show/bFtAhaCLfed1R3ptNr1F/ | 02:33 |
*** ministry is now known as __ministry | 02:37 | |
spatel | what are the option i have here? | 02:39 |
opendevreview | sean mooney proposed openstack/nova master: [WIP] add healthcheck manager to manager base https://review.opendev.org/c/openstack/nova/+/827844 | 03:01 |
spatel | sean-k-mooney are you around? | 03:06 |
*** dasm is now known as dasm|off | 03:08 | |
opendevreview | Ghanshyam proposed openstack/nova master: Make more project level APIs scoped to project only https://review.opendev.org/c/openstack/nova/+/828670 | 05:37 |
*** amoralej|off is now known as amoralej | 07:21 | |
frickler | gibi: gmann: wouldn't https://bugs.launchpad.net/devstack/+bug/1960346 rather be a nova issue than devstack? | 07:41 |
__ministry | hi all, I have a question that is: "Can we plug multiple vGPU at the same time into a compute node?" | 07:48 |
__ministry | Please, help me. | 07:49 |
gibi | frickler: it more feels like a libvirt regression between 6.0.0 and 8.0.0 but sure it is probably not a devstack issue | 07:53 |
gibi | __ministry: you probably want to plug vGPUs to the guest VMs as compute nodes are physical things so they have physical gpus plugged | 07:55 |
gibi | __ministry: I think nova support multiple pGPUs per compute as well as multiple vGPUs per guest | 07:56 |
__ministry | yep. I want know to estimate amount of devices to make order. | 07:57 |
__ministry | thank you. | 07:58 |
frickler | gibi: how confident are you that it is actually a regression and not an intentional change that nova would have to adapt to? anyway I'll assign it to nova then to find out more details | 08:16 |
gibi | frickler: yeah, you are right, as there is major version bump, it can be an intended change too | 08:16 |
__ministry | gibi: multiple physical GPUs which vGPU supported in a compute node, we can? | 08:19 |
gibi | __ministry: I think so. bauzas do we have limitations around that ^^ ? | 08:20 |
bauzas | __ministry: gibi: which exact question do you want to know about vGPUs ? | 08:21 |
gibi | bauzas: as far as I understand __ministry asking in nova supports multiple pGPUs providing vGPUs in a single compute | 08:21 |
bauzas | most of the limitations are written here https://docs.openstack.org/nova/latest/admin/virtual-gpu.html#caveats | 08:21 |
bauzas | gibi: oh this totally works | 08:22 |
bauzas | sometimes with some GPU board (like a Tesla T-something) you actually get more than 1 pGPU per board :) | 08:22 |
bauzas | since stein, we just create a resource provider per physical PCI device, that's it | 08:23 |
__ministry | I want plug multiple vGPU to a nova instance. | 08:23 |
bauzas | oh then the other way around | 08:23 |
bauzas | __ministry: so you're asking about resources:VGPU=X where X>1 | 08:23 |
bauzas | I'm unfortunate to say there is some limitation from libvirt due to some nasty bug | 08:24 |
__ministry | yep. it like we can attach multiple volume to nova instance. | 08:24 |
__ministry | ok | 08:24 |
bauzas | __ministry: https://bugs.launchpad.net/nova/+bug/1758086 | 08:27 |
__ministry | bauzas: thank you. ^.^ | 08:33 |
bauzas | __ministry: that being said, this is a nvidia driver limitation, you're welcome to test again another newer GRID release version | 08:35 |
opendevreview | Manuel Bentele proposed openstack/nova master: libvirt: Add properties to set advanced QXL video RAM settings https://review.opendev.org/c/openstack/nova/+/828674 | 08:52 |
opendevreview | Manuel Bentele proposed openstack/nova master: libvirt: Add configuration options to set SPICE compression settings https://review.opendev.org/c/openstack/nova/+/828675 | 08:59 |
opendevreview | Manuel Bentele proposed openstack/nova master: libvirt: Add property to set number of screens per video adapter https://review.opendev.org/c/openstack/nova/+/828676 | 09:09 |
opendevreview | Balazs Gibizer proposed openstack/placement master: Add any-traits support for listing resource providers https://review.opendev.org/c/openstack/placement/+/826491 | 10:09 |
opendevreview | Balazs Gibizer proposed openstack/placement master: Add any-traits support for allocation candidates https://review.opendev.org/c/openstack/placement/+/826492 | 10:09 |
opendevreview | Balazs Gibizer proposed openstack/placement master: Remove unused compatibility code https://review.opendev.org/c/openstack/placement/+/826493 | 10:09 |
opendevreview | Balazs Gibizer proposed openstack/placement master: Add microversion 1.39 to support any-trait queries https://review.opendev.org/c/openstack/placement/+/826719 | 10:10 |
gibi | gmann: thanks for the comment in the any-traits series. I restored the legacy behavior in https://review.opendev.org/c/openstack/placement/+/826491/8 | 10:11 |
gibi | melwitt: I had to respin the any-traits series due to ^^ | 10:11 |
opendevreview | Manuel Bentele proposed openstack/nova master: libvirt: Add configuration options to set SPICE compression settings https://review.opendev.org/c/openstack/nova/+/828675 | 10:27 |
sean-k-mooney | chateaulav: you have 1 local branch for all commits in a feature | 10:58 |
sean-k-mooney | chateaulav: you should be able to check out that branch locally and run all the code form the top of the branch | 10:58 |
sean-k-mooney | e.g. run the unit/func tests and or deploy nova and execute the code | 10:58 |
sean-k-mooney | frickler: im pretty sure any change in the detach behavior is a libvirt/qemu regression likely related to the rework they are doing in this arrea for pci passthtough | 11:02 |
sean-k-mooney | regarding https://bugs.launchpad.net/nova/+bug/1960346 | 11:02 |
bauzas | sean-k-mooney: yup we'd appreciate some qemu expert on this one | 11:05 |
bauzas | kashyap: maybe you can help ? | 11:05 |
kashyap | bauzas: Hey, reading back | 11:05 |
bauzas | kashyap: sean-k-mooney: context is some TripleO CI blocked https://bugs.launchpad.net/tripleo/+bug/1960310 | 11:05 |
kashyap | sean-k-mooney: I wouldn't be so confident about regressions in libvirt/QEMU without evidence. | 11:05 |
bauzas | which looks to be due b/c of https://bugs.launchpad.net/nova/+bug/1960346 | 11:06 |
kashyap | When it comes to bugs, I follow "seeing is believing" | 11:06 |
bauzas | kashyap: if you see the last bug, you'll see gibi saying we use a new libvirt/qemu version | 11:06 |
bauzas | but it seems to be a regression | 11:06 |
kashyap | bauzas: Regression where? Me looks | 11:06 |
frickler | oh, it's not only devstack, but also tripleo, then devstack is even more out of the boat ;) | 11:07 |
bauzas | kashyap: see https://zuul.openstack.org/build/3e24d977991d4536b6279afd7f3b5d56/log/controller/logs/screen-n-cpu.txt?severity=4#49433 | 11:08 |
kashyap | bauzas: Yep, already noticed it | 11:09 |
kashyap | bauzas: Looking for libvirtd logs w/ QEMU filters | 11:09 |
kashyap | bauzas: Have you got the affected instance ID from the logs | 11:09 |
bauzas | lemme try to find one | 11:09 |
kashyap | 1074c6fa-12fe-40a8-b1d5-a47a49018d9f | 11:11 |
kashyap | ? | 11:11 |
bauzas | for this job, yes | 11:12 |
chateaulav | sean-k-mooney: alright then im on the right track, got that all setup and inline. | 11:15 |
*** bhagyashris__ is now known as bhagyashris | 11:26 | |
chateaulav | sean-k-mooney: and then with any further changes, i would use interactive rebase to edit and add any files to each specific commit and the submit the topic for review. | 11:29 |
kashyap | bauzas: Do you know why I'm unable to fetch all the instance log files with a simple `wget`? | 11:29 |
kashyap | I'm tryin this: | 11:29 |
kashyap | $> wget -r -nH -nd -np -R "index.html*" https://storage.gra.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_3e2/828280/1/check/devstack-platform-centos-9-stream/3e24d97/controller/logs/libvirt/libvirt/qemu/ | 11:29 |
kashyap | Anyway, it's not required; ignore the above | 11:42 |
*** mdbooth2 is now known as mdbooth | 11:49 | |
kashyap | Duh, I wish Launchpad didn't break the complete formatting in text messages by wrapping text | 11:56 |
sean-k-mooney | chateaulav: yes. and you can always author new commits at the end of the chain and move them with an interactive rebase if need | 12:04 |
chateaulav | thanks for that last confirmation! your mentorship is much appreciated. | 12:05 |
sean-k-mooney | but basically gerrit is intended to work with freature branches and it tack each chage to a review with the change-id in the commit message | 12:05 |
sean-k-mooney | chateaulav: no worries glad to help | 12:05 |
sean-k-mooney | so rebases or change to a commit will not create a new review if the change id does not change and it will just update teh exsit review with a new revsion | 12:06 |
gibi | kashyap: so what do you think about "Device virtio-disk1 is already in the process of unplug" error? Should we increase the amount of time we wait before we retry the detach? | 12:15 |
gibi | it is configurable with CONF.libvirt.device_detach_timeout | 12:17 |
gibi | it is 20 sec by default | 12:17 |
opendevreview | Erlon R. Cruz proposed openstack/nova master: Adds regression test for bug LP#1944619 https://review.opendev.org/c/openstack/nova/+/821840 | 12:22 |
opendevreview | Erlon R. Cruz proposed openstack/nova master: Fix pre_live_migration rollback https://review.opendev.org/c/openstack/nova/+/815324 | 12:22 |
bauzas | kashyap: sorry, was at lunch (isolated but in the kitchen tho) | 12:23 |
gibi | kashyap: I've pushed https://review.opendev.org/c/openstack/devstack/+/828705 to see if longer timeout helps or not | 12:24 |
sean-k-mooney | gibi: i tought you implmented an event based retry | 12:29 |
gibi | sean-k-mooney: it is event based but with a timeout | 12:29 |
gibi | so if the event came then we stop waiting | 12:29 |
sean-k-mooney | and when it times out we give up | 12:29 |
gibi | but if the event never cames we give up | 12:29 |
sean-k-mooney | ya ok | 12:30 |
gibi | more preciesly we retry | 12:30 |
sean-k-mooney | well no | 12:30 |
sean-k-mooney | retrying would be wrong | 12:30 |
sean-k-mooney | since we know that is an error in qemu and will abort the detach | 12:30 |
sean-k-mooney | at least in current verions in old version it was undefiend behavior | 12:30 |
gibi | I remember that even with libvirt 6.0.0 retry was needed in some cases | 12:30 |
kashyap | bauzas: Don't worry | 12:31 |
gibi | but maybe that was just the case of not waiting enough | 12:31 |
sean-k-mooney | its qemu rather then libvirt that i think is important here | 12:31 |
kashyap | gibi: Reading back; went for some air | 12:31 |
gibi | sean-k-mooney: ack, then qemu 4.2.0 vs 6.2.0 | 12:31 |
sean-k-mooney | gibi: the orginal behavior change was qemu consider a second detach request to be an error and aborting | 12:31 |
sean-k-mooney | yes | 12:32 |
kashyap | gibi: What I'm wondering is why is the unplug still in the process - what is holding up the unplug. Let me chat w/ the libvirt block dev | 12:32 |
gibi | kashyap: thanks | 12:32 |
gibi | kashyap: could be the guest keeping the dev busy? | 12:32 |
sean-k-mooney | kashyap: unplug requries the guest kernel to cooperate | 12:32 |
kashyap | gibi: Right, this is a negative test of server rescue, right? I'm trying to look at the exact test | 12:32 |
gibi | sean-k-mooney: do you have a qemu version number from which we should never retry? | 12:32 |
sean-k-mooney | so if the guest is not fully booted or busy that can delay it | 12:32 |
kashyap | sean-k-mooney: But note: there's no QEMU guest agent installed here | 12:32 |
sean-k-mooney | kashyap: it is not realted to the guest agent | 12:33 |
sean-k-mooney | its related to hardware interupts that are sent by qemu that guest must process | 12:33 |
sean-k-mooney | either via achi or the pci native hotplug mechium depend on you machinetype and qemu version | 12:34 |
gibi | kashyap: we have a positive test test_stable_device_rescue_disk_virtio_with_volume_attached and a negative test_stable_device_rescue_disk_virtio_with_volume_attached both failing | 12:34 |
gibi | ahh | 12:34 |
gibi | this is the negative test_rescued_vm_detach_volume | 12:34 |
kashyap | Yeah, this is negative that's failing | 12:35 |
kashyap | What exactly is the negative test doing? /me looks... | 12:35 |
sean-k-mooney | gibi: in terms of the exact vesion i think its in the release notes but ill see if i can find it | 12:35 |
kashyap | sean-k-mooney: Are you confident it is "related to hardware interupts that are sent by QEMU?" What evidence there is for it? | 12:35 |
gibi | sean-k-mooney: thanks. if we know the version number then I can craft a patch that conditionally set the detach attempts to 1 if the qemu is new enough | 12:36 |
sean-k-mooney | kashyap: we dont know that for certin and in fact i think https://bugzilla.redhat.com/show_bug.cgi?id=2007129 is a large part of the problem | 12:37 |
sean-k-mooney | unless you use virtio-scsi which we dont by default each volume attach and deatch is a pci hotplug form teh guest perspective as we add a seperte pci device for each virtio-blk device | 12:37 |
frickler | kashyap: for downloading logs, I think wget may have issues because the source is swift and not a "normal" webserver. you may want to look at https://storage.gra.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_3e2/828280/1/check/devstack-platform-centos-9-stream/3e24d97/download-logs.sh and maybe filter if you need only some subdirs | 12:38 |
kashyap | Okay, that bug is about PCI hotplug emulation | 12:38 |
sean-k-mooney | kashyap: that bug might be unrealted but artom suggested it might be in a different context | 12:38 |
kashyap | sean-k-mooney: Do you know what the negative test is exaclty doing? | 12:38 |
sean-k-mooney | i have not looked expictly | 12:39 |
sean-k-mooney | i guess botting into rescue mode and detaching a cinder volume | 12:39 |
sean-k-mooney | https://github.com/openstack/tempest/blob/7e96c8e854386f43604ad098a6ec7606ee676145/tempest/api/compute/servers/test_server_rescue_negative.py#L136 | 12:40 |
gibi | yepp it does that, I try to correlate that with the logs | 12:40 |
kashyap | So it is trying to rescue a paused instance, and a non-existing instance | 12:40 |
sean-k-mooney | no | 12:40 |
sean-k-mooney | its booting a vm, attaching a volume, then puting it in rescue mode which reboot with a new root disk | 12:41 |
sean-k-mooney | waiting for ti to get to rescue meaning its running | 12:41 |
sean-k-mooney | then asserting that detach raises a 409 conflict | 12:41 |
sean-k-mooney | there is not paused instance | 12:42 |
kashyap | sean-k-mooney: Well. What do you think this is doing, then? - test_rescue_non_existent_server()? | 12:42 |
kashyap | And test_rescue_paused_instance() | 12:43 |
sean-k-mooney | its just asserting that if the server does not exist the rescue call returns a 404 | 12:43 |
kashyap | And there are also: test_rescued_vm_attach_volume() and test_rescued_vm_detach_volume() | 12:43 |
sean-k-mooney | paused is asserting that you cant call resuce when its paused | 12:43 |
sean-k-mooney | yep | 12:43 |
sean-k-mooney | how is this relevent | 12:43 |
sean-k-mooney | the test all look valid and are asserting what i woudl expect | 12:44 |
kashyap | It is relevant in the sense that these are the different tests being run here | 12:44 |
sean-k-mooney | yes they are differnt test but they are not using the same vm | 12:45 |
kashyap | Yep, noted | 12:45 |
sean-k-mooney | so they should have no impact on each other | 12:45 |
sean-k-mooney | although i will say if we get very very unlucky the random uuid could colide with a real on ein the non_existent_instance test | 12:45 |
sean-k-mooney | statistically however we are not going to get uuid collisions | 12:46 |
gibi | wait a bit I think we are looking at the wrong test case | 12:48 |
kashyap | gibi: So you don't think this is the failing test? test_rescued_vm_detach_volume()? | 12:49 |
gibi | Im confused | 12:49 |
gibi | Im looking at this run https://storage.gra.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_3e2/828280/1/check/devstack-platform-centos-9-stream/3e24d97/controller/logs/index.html | 12:49 |
kashyap | Same here. | 12:50 |
*** dasm|off is now known as dasm | 12:52 | |
kashyap | gibi: If we can narrow down the exact test action that is causing this, then we can debug it further from libvirt/QEMU angle. | 12:53 |
gibi | yeah I'm trying that | 12:53 |
gibi | I mean trying to collect the steps the test took | 12:53 |
kashyap | Sure, no rush. (I just want to arrive at a reproducer w/ just libvirt - APIs or shell) | 12:53 |
gibi | https://paste.opendev.org/show/bEtFDBoLqDfotDMPDOq8/ | 12:55 |
gibi | OK so that test case is failing similarly than the other so we can look at any of the too | 12:56 |
gibi | that paste has the steps from test_rescued_vm_detach_volume | 12:56 |
* kashyap clicks | 12:56 | |
gibi | so yeah the log correlates with the test case | 12:57 |
kashyap | gibi: So, it is indeed test_rescued_vm_detach_volume(), then | 12:57 |
gibi | we boot an instance then rescue it (basically destory the domina and start it again with different disk config) then we attempt to detach a volume while the rescue domin is running | 12:57 |
gibi | kashyap: yes | 12:57 |
kashyap | gibi: Please post this info in the bug as a record. | 12:58 |
gibi | kashyap: but the other test_stable_device_rescue_disk_virtio_with_volume_attached also failing similarly | 12:58 |
gibi | kashyap: hence my confusion | 12:58 |
kashyap | Hm, so both positive and negative are failing similarly | 12:58 |
gibi | updating the bug... | 12:58 |
kashyap | gibi: During rescue, by "different disk config" do you mean a fresh, similar config? Or actually different? If so, how is the disk config different before rescue? | 13:00 |
gibi | we create a domain in a way that it boots from a rescue image but also has the original root fs attached | 13:00 |
kashyap | I see, noted. | 13:02 |
gibi | kashyap: https://paste.opendev.org/show/bf9JaJYMYDOX9onjFFy0/ here are the domain xmls for that nova instance | 13:05 |
kashyap | gibi: Excellent; I just asked Peter Krempa (he meditates on libvirt block layer) on #virt (OFTC) | 13:07 |
*** amoralej is now known as amoralej|lunch | 13:09 | |
kashyap | Corresponding QEMU log: | 13:10 |
kashyap | https://storage.gra.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_3e2/828280/1/check/devstack-platform-centos-9-stream/3e24d97/controller/logs/libvirt/libvirt/qemu/instance-0000004d_log.txt | 13:10 |
sean-k-mooney | kashyap: lyarwood implemented a stabel rescue feature but basiclly we add a new disk on the hw_rescue_bus typeicly usb as the boot disk | 13:10 |
sean-k-mooney | the default rescue disk is the same image from glance that the vm booted wiht but it will be a clean copy of it | 13:11 |
sean-k-mooney | you can specify an alternitive image to use via config or the rescue action but these tests do not | 13:11 |
kashyap | I see, noted. | 13:12 |
*** artom__ is now known as artom | 13:14 | |
gibi | kashyap: these are the nova request ids from the compute log correlated with actions: https://paste.opendev.org/show/bS5Jmmw4PbLirMXtAUyj/ | 13:26 |
gibi | there is two rescue / unrescue pair | 13:27 |
kashyap | gibi: Noted; meanwhile, the libvirt dev says: | 13:27 |
kashyap | "weird, we [libvirt] indeed try to detach a blockdev node that wasn't ever attached" | 13:27 |
gibi | that is wierd indeed :) | 13:27 |
kashyap | gibi: I sent him an email with you in Cc. He asked it, as he's in a hurry | 13:27 |
gibi | thanks for the cc | 13:28 |
gibi | I have to jump on a call I will be back in an hour | 13:28 |
kashyap | No rush; we can deal with this async. | 13:28 |
kashyap | gibi: Oh, I don't think we have this output captured anywhere, right? The dev was asking me | 13:29 |
kashyap | "please also get me the output of 'qemu-img info' of the copy destination image if it wasn't removed" | 13:29 |
gibi | copy destination? | 13:30 |
kashyap | [quote] What breaks is a block copy into a image with "--reuse-external" [this is what Nova uses - i.e. reuse an external file], so we try to obey the metadata. [/quote] | 13:31 |
kashyap | gibi: Here an image copy is involved under the hood | 13:31 |
gibi | interesting.. | 13:31 |
kashyap | Okay, Peter says: "it's okay to just point me to the place formatting the image, I don't need an actual example, just what's put into the metadata" | 13:32 |
* kashyap taps on the table and thinks | 13:33 | |
kashyap | gibi: He does admit that there's a potential libvirt bug here. | 13:33 |
* kashyap --> bbiab | 13:34 | |
gibi | bahh there are multiple test cases reusing the same nova instance during the testing https://paste.opendev.org/show/b0OYtRRb5FyZ5pyWgBCu/ | 13:43 |
gibi | hence the more action on the instance in the compute log than what is in the test case that fails | 13:44 |
gibi | OK, see a bit more. | 13:48 |
gibi | The actual tempest test case passes. The detach returns conflict when the instance is in RESCUE state. That is the end of the test case. _THEN_ the tempest starts cleaning up the pieces, and during that first it unrescues the VM and then detached the volume. This detach should work and remove the volume but it fails with the error in libvirt | 13:50 |
*** amoralej|lunch is now known as amoralej | 13:58 | |
gibi | btw increasing the timeout from 20 to 60 did not helped detach still timeouts | 14:07 |
gibi | https://zuul.opendev.org/t/openstack/build/61f733fc73834ff0924284dac61c9a4b/log/controller/logs/screen-n-cpu.txt?severity=3 | 14:08 |
* kashyap reads back | 14:13 | |
gibi | I don't know what block copy action we do during these sequences | 14:17 |
gibi | hopefully with this patch I can run only the singe test case we want to troubleshoot so the logs will be smaller an cleaner https://review.opendev.org/c/openstack/devstack/+/828705 | 14:23 |
opendevreview | Erlon R. Cruz proposed openstack/nova master: Adds regression test for bug LP#1944619 https://review.opendev.org/c/openstack/nova/+/821840 | 14:24 |
opendevreview | Erlon R. Cruz proposed openstack/nova master: Fix pre_live_migration rollback https://review.opendev.org/c/openstack/nova/+/815324 | 14:24 |
kashyap | gibi: So the block copy is definitely there, looking at the commands libvirt has sent to QEMU (from the CI log): | 14:31 |
kashyap | 2022-02-08 15:24:09.109+0000: 72482: info : qemuMonitorSend:914 : QEMU_MONITOR_SEND_MSG: mon=0x7f74e40cde70 msg={"execute":"blockdev-mirror","arguments":{"job-id":"copy-vda-libvirt-2-format","device":"libvirt-2-format","target":"libvirt-4-format","sync": | 14:31 |
kashyap | "top","auto-finalize":true,"auto-dismiss":false},"id":"libvirt-408"} | 14:31 |
kashyap | The QEMU keyword here is "blockdev-mirror" | 14:31 |
kashyap | ... which is what libvirt calls "block copy". | 14:32 |
kashyap | gibi: When you get a minute, please post your above observation about how the Tempest test passes, but the detach returns conflict in RESCUE. It is useful for the record. | 14:33 |
gibi | kashyap: sure. the conflict is from nova, in rescue we don't allow detach. And that part works. | 14:34 |
kashyap | Okay, so expected error there. | 14:34 |
gibi | yepp that is OK and the test case is pass, but after the test case tempest cleans up | 14:34 |
gibi | basically did the actions in revers to move back to the starting state | 14:35 |
gibi | so as the server in RESCUE state it unrescues it | 14:35 |
gibi | and as a volume was attached to the server before rescue, it tries to detach the volume after the unrescue | 14:35 |
gibi | and that detach should remove the volume from the domian but that fails | 14:35 |
kashyap | I see. And looks like it couldn't find that volume? | 14:37 |
kashyap | Maybe this goes back to Peter's comment earlier about how libvirt tries "to detach a blockdev node that wasn't ever attached" | 14:37 |
kashyap | These weird tests are spinning my head. | 14:38 |
gibi | during this detach: nova first detaches the volume from the persistent domain that succeeds | 14:39 |
gibi | then nova issue the detach command from the live domian and waits for the event | 14:39 |
gibi | that event is not received in 20 sec so it issue the command again | 14:40 |
kashyap | Ah-ha, that makes sense | 14:40 |
gibi | that commend returns | 14:40 |
gibi | error message: internal error: unable to execute QEMU command 'device_del': Device virtio-disk1 is already in the process of unplug | 14:40 |
gibi | then nova retries 6 more times | 14:40 |
kashyap | Right, and then times out | 14:41 |
gibi | always getting the same message | 14:41 |
gibi | and then gives up | 14:41 |
gibi | this is the log from the detach attempst https://paste.opendev.org/show/be647YeC57HREuAfwwru/ | 14:41 |
kashyap | gibi: Your last 10-ish messages are a great summary of the prob at hand. Can I take and rephrase them into a paragraph on that mail thread? | 14:41 |
gibi | sure | 14:41 |
kashyap | gibi: I've also posted the pimped up version here: https://bugs.launchpad.net/nova/+bug/1960346/comments/8 | 14:48 |
gibi | thanks | 14:48 |
gmann | gibi: ack. thanks | 14:52 |
opendevreview | Andre Aranha proposed openstack/nova stable/xena: Add check job for FIPS https://review.opendev.org/c/openstack/nova/+/827895 | 14:52 |
gmann | frickler: I opened it for devstack due to centos9 libvirt version bump but it is ok to add nova too | 14:53 |
*** hemna8 is now known as hemna | 14:53 | |
kashyap | frickler: Forgot to respond in the scrollback; yeah, that explains it - why my `wget` didn't work :) Thank you. | 14:55 |
gibi | kashyap: these are the libvirtd logs from the first and second detach https://paste.opendev.org/show/bKANW2WAzfAzEIGcgJX8/ probably nothing new here I just working through the logs... | 14:55 |
kashyap | gibi: Nice work narrowing down | 14:59 |
kashyap | A side-tip is: | 14:59 |
kashyap | $> grep -Ei '(MONITOR_SEND_MSG|QEMU_MONITOR_RECV_)' libvirtd_log.txt | 14:59 |
kashyap | (That gets you all the commands and the responses libvirt is sending to QEMU.) | 14:59 |
* kashyap --> a call; back later | 14:59 | |
spatel | folks i need urgent help to understand what is wrong with nova and rabbitMQ :( | 15:43 |
spatel | I have rebuild rabbitMQ but now not able to spin up VM | 15:44 |
spatel | vm getting stuck in BUILD | 15:44 |
spatel | nova-conductor throwing these errors - https://paste.opendev.org/show/bbSVnr5zGdCCOPdL5tQF/ | 15:45 |
kashyap | No answer, but please don't count on community to provide "urgent help". That's what vendors are for | 15:48 |
spatel | kashyap i understand just looking for clue to see what is going on | 15:49 |
melwitt | gibi: ack, will look | 16:04 |
gibi | melwitt: thanks | 16:05 |
*** priteau_ is now known as priteau | 16:11 | |
opendevreview | Erlon R. Cruz proposed openstack/nova master: Fix pre_live_migration rollback https://review.opendev.org/c/openstack/nova/+/815324 | 16:14 |
opendevreview | Andre Aranha proposed openstack/nova stable/xena: Add check job for FIPS https://review.opendev.org/c/openstack/nova/+/827895 | 16:24 |
opendevreview | Andre Aranha proposed openstack/nova stable/wallaby: Add check job for FIPS https://review.opendev.org/c/openstack/nova/+/827896 | 16:29 |
*** tkajinam is now known as Guest210 | 16:30 | |
*** priteau is now known as priteau_ | 16:48 | |
*** priteau_ is now known as priteau | 16:48 | |
opendevreview | Lior Friedman proposed openstack/nova master: Support use_multipath for NVME driver https://review.opendev.org/c/openstack/nova/+/823941 | 17:02 |
opendevreview | Dmitrii Shcherbakov proposed openstack/nova master: Document remote-managed port usage considerations https://review.opendev.org/c/openstack/nova/+/827513 | 17:05 |
opendevreview | Lior Friedman proposed openstack/nova master: Support use_multipath for NVME driver https://review.opendev.org/c/openstack/nova/+/823941 | 17:12 |
gibi | kashyap: fyi there is a smaller reproduction in https://bugs.launchpad.net/nova/+bug/1960346/comments/10 | 17:40 |
gibi | but I have to drop off now | 17:40 |
opendevreview | melanie witt proposed openstack/placement master: Make perfload jobs fail if write allocation fails https://review.opendev.org/c/openstack/placement/+/828438 | 18:01 |
*** amoralej is now known as amoralej|off | 18:21 | |
opendevreview | Ghanshyam proposed openstack/nova master: Make more project level APIs scoped to project only https://review.opendev.org/c/openstack/nova/+/828670 | 18:32 |
chateaulav | gibi: can i get a little more on the backporting of the 1.3 to 1.2. I have been playing around with it but am not quite sure. is this more related to the actual version itself or pulling the new values available in 1.3 to 1.2. this is for https://review.opendev.org/c/openstack/nova/+/828369 and i know that my question seems repeatative | 18:54 |
opendevreview | Merged openstack/nova master: Join quota exception family trees https://review.opendev.org/c/openstack/nova/+/828185 | 19:43 |
spatel | kashyap by the way i found issue, it was related to neutron-metadata service which was causing issue and holding VM build.. | 20:27 |
opendevreview | melanie witt proposed openstack/nova stable/wallaby: libvirt: Add announce-self post live-migration workaround https://review.opendev.org/c/openstack/nova/+/825178 | 21:35 |
*** dasm is now known as dasm|off | 21:49 | |
chateaulav | gibi: found the info I needed. Will add the back ports tomorrow. | 23:03 |
opendevreview | Ghanshyam proposed openstack/nova master: Server actions APIs scoped to project scope https://review.opendev.org/c/openstack/nova/+/824358 | 23:21 |
opendevreview | Ghanshyam proposed openstack/nova master: Server actions APIs scoped to project scope https://review.opendev.org/c/openstack/nova/+/824358 | 23:21 |
*** prometheanfire is now known as Guest2 | 23:49 | |
*** osmanlicilegi is now known as Guest0 | 23:49 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!