*** ministry is now known as __ministry | 08:11 | |
auniyal | Hi sean-k-mooney | 09:16 |
---|---|---|
auniyal | can you please review this - https://review.opendev.org/c/openstack/nova/+/790447/ | 09:16 |
zigo | Hi there! | 09:44 |
zigo | I'm currently doing routine upgrade of compute nodes in a cluster (running Victoria), and I'm getting live-migration errors of VMs like this one: | 09:44 |
zigo | https://paste.opendev.org/show/bYlTfz7fxnQtzVhpVf91/ | 09:44 |
zigo | Is it known? Is there a way to fix? Is it related to the version of libvirt or qemu? | 09:44 |
bauzas | looking | 09:49 |
bauzas | zigo: good question I guess you've seen the libvirt error | 09:52 |
bauzas | 2023-03-23 09:37:52.682 3209246 INFO nova.compute.manager [req-5fda88d8-510a-4943-9703-9b47e865a89f - - - - -] [instance: 17672112-c416-494a-88f8-fd7cfa85453b] VM Resumed (Lifecycle Event) 2023-03-23 09:37:52.694 3209246 ERROR nova.virt.libvirt.driver [-] [instance: 17672112-c416-494a-88f8-fd7cfa85453b] Live Migration failure: internal error: qemu unexpectedly closed the monitor: 2023-03-23T09:37:52.215715Z qemu-system-x86_64: VQ | 09:52 |
bauzas | 0 size 0x80 < last_avail_idx 0x0 - used_idx 0x44 2023-03-23T09:37:52.215742Z qemu-system-x86_64: Failed to load virtio-balloon:virtio 2023-03-23T09:37:52.215745Z qemu-system-x86_64: error while loading state for instance 0x0 of device '0000:00:05.0/virtio-balloon' | 09:52 |
zigo | Yeah, I do. But then, where do I look? | 09:53 |
zigo | Libvirt logs? | 09:53 |
bauzas | it reminds me this bug report https://bugs.launchpad.net/cloud-archive/+bug/1848497 | 09:54 |
zigo | I haven't see anything doing a tail of /var/log/libvirt/qemu/*.log | 09:55 |
zigo | On Bullseye, I'm running with qemu 1:5.2+dfsg-11+deb11u2 | 09:55 |
bauzas | and you're not seeing anything with qemu logs ? | 09:57 |
bauzas | the error is reported by qemu process, not by libvirtd | 09:58 |
bauzas | so I'd say check the qemu logs | 09:58 |
bauzas | zigo: so it seems a qemu migration *to* a node with a version of 4.0 or higher is problematic | 10:03 |
bauzas | which qemu version the source is running ? | 10:03 |
bauzas | I assume you're not mixing releases | 10:03 |
bauzas | but I wanted to double-check | 10:03 |
zigo | Same version of qemu and libvirt in both source and dest. | 10:04 |
zigo | It's a plain Bullseye, so I use whatever is in Debian Stable (minus the security upgrades that I'm trying to perform). | 10:05 |
bauzas | and what migration flags are you using ? | 10:06 |
bauzas | are the vms paused ? | 10:06 |
bauzas | or suspended ? | 10:06 |
zigo | They are ACTIVE. | 10:08 |
zigo | Is there migration flags I can set?!? :) | 10:08 |
zigo | Where do I look? | 10:08 |
zigo | I've just done "nova host-evacuate-live <hostname>" ... | 10:09 |
zigo | (not sure if there's a way to do this with python-openstackclient from Victoria...) | 10:10 |
zigo | What's weird, is that MANY VMs on the same host are live-migrating without a glitch. Then on average, 2 VMs on each compute can't live-migrate ... | 10:11 |
sean-k-mooney | zigo: there isnet and that intentional | 10:11 |
sean-k-mooney | host-evacuate-live is not somthing we recomend operators use | 10:12 |
zigo | sean-k-mooney: What should I use then? | 10:12 |
sean-k-mooney | we are intentioally not supporting it in osc | 10:12 |
bauzas | wait | 10:12 |
bauzas | evacuate or live-migrate ? | 10:12 |
bauzas | I'm lost here | 10:12 |
sean-k-mooney | zigo: you should write yoru won code to live migrate all the vms forma host that actully has error handeling | 10:12 |
bauzas | oh | 10:12 |
bauzas | host-evacuate-live | 10:12 |
bauzas | damn old unspported CLIs | 10:13 |
sean-k-mooney | technially deprecated rather then unsuppported | 10:13 |
sean-k-mooney | until we can remvoe it in C/D | 10:13 |
sean-k-mooney | we need everyone ot use the sdk first | 10:13 |
sean-k-mooney | zigo: it wont help now but you used to be able to set https://docs.openstack.org/nova/latest/configuration/config.html#libvirt.mem_stats_period_seconds to 0 to disable the memory ballon | 10:15 |
sean-k-mooney | can you confirm if that is set to 0 on either host and that the vm has a memory ballon | 10:15 |
zigo | It's set to default on that host (ie: 10 ...). | 10:16 |
sean-k-mooney | ack | 10:16 |
zigo | Should I set it to zero and try again then? | 10:16 |
sean-k-mooney | i was wondering if having it enabel and disable on differnt hsot coudl cause issues | 10:16 |
sean-k-mooney | i think this is one of those thigns that you cant change with runnign vms | 10:17 |
sean-k-mooney | zigo: i assume you cant just cold migrate them | 10:17 |
bauzas | yup, or it would require a vm recycle | 10:17 |
zigo | Oh ... :/ | 10:17 |
sean-k-mooney | /recycle/restart/ | 10:17 |
zigo | sean-k-mooney: Well, to cold-migrate, I must get in touch with customers to at least warn them about the operation, and let them know their VM will reboot. | 10:18 |
zigo | That's kind of very annoying with 50 computes and 2k+ VMs ... | 10:18 |
sean-k-mooney | to answer your orgianl question no im not aware of live migration issues related to memory baloons | 10:18 |
bauzas | sean-k-mooney: I remember we had some old qemu-4 live migration issues with the qemu balloons | 10:20 |
bauzas | but zigo isn't impacted | 10:21 |
sean-k-mooney | zigo: i assume this is persistent | 10:27 |
sean-k-mooney | i.e. a secodn live migration of the vm has the same error | 10:28 |
zigo | Right. | 10:28 |
sean-k-mooney | have you tried migratign to a differnt host? | 10:28 |
zigo | I didn't try to specify the dest host, but I can try. I'll let you know... | 10:29 |
sean-k-mooney | im trying to fiture out is the a thing tha that is speciic to the vm, the destion host ectra | 10:29 |
sean-k-mooney | https://bugzilla.redhat.com/show_bug.cgi?id=1923881 | 10:32 |
sean-k-mooney | that sound like it | 10:32 |
sean-k-mooney | similar for virtio-blk https://bugs.launchpad.net/nova/+bug/1737625 | 10:34 |
sean-k-mooney | ""Dave notes that we get this "guest index inconsistent" error when the migrated RAM is inconsistent with the migrated 'virtio' device state. And a common case is where a 'virtio' device does an operation after the vCPU is stopped and after RAM has been transmitted.""" | 10:34 |
sean-k-mooney | zigo: are you using post-copy or autoconverge by the way | 10:36 |
sean-k-mooney | bauzas: this is the qemu 4.0 issue right https://lore.kernel.org/all/156517411102.26464.1302440989654328620.launchpad@gac.canonical.com/T/ | 10:37 |
sean-k-mooney | https://bugs.launchpad.net/qemu/+bug/1838569 | 10:37 |
zigo | Not using post-copy (it's set to default, ie: false) | 10:38 |
zigo | Same for live_migration_permit_auto_converge (set to default: false) | 10:39 |
zigo | I'm using TLS though... | 10:39 |
zigo | libvirt over TLS. | 10:39 |
zigo | I probably should set live_migration_permit_auto_converge to true though, as sometimes, I have to manualy do a live-migration-force-complete ... | 10:40 |
bauzas | sean-k-mooney: unrelated, was it you who wrote https://etherpad.opendev.org/p/nova-bobcat-ptg#L75 ? | 10:40 |
sean-k-mooney | i dont rememebr that but i am interested in that | 10:44 |
sean-k-mooney | maybe dansmith | 10:44 |
sean-k-mooney | or artom i certenlly ask that question or one like it in our intenal meetings a few week ago | 10:44 |
sean-k-mooney | my understandign is the glance folks were going to also work on the nova patches | 10:45 |
bauzas | I'm just mentioning glance will discuss this with cinder | 10:45 |
sean-k-mooney | and we just needed to reveiew | 10:45 |
bauzas | and people will join | 10:45 |
sean-k-mooney | ok are you suggesting we also join that at teh same time | 10:48 |
bauzas | yeah, if interested | 10:52 |
bauzas | I wrote that in our etherpad | 10:52 |
bauzas | but we can ask for a specific glance/nova session if after that some questions remain | 10:53 |
bauzas | I just arranged this with pdeore | 10:53 |
artom | sean-k-mooney, bauzas, no, not me | 11:15 |
bauzas | k | 11:15 |
stephenfin | Could I get another reviewer on that series to remove sqlalchemy-migrate, please? https://review.opendev.org/q/topic:sqlalchemy-20+project:openstack/nova+is:open | 12:28 |
sean-k-mooney | so you have two diffent patch changes there | 12:54 |
sean-k-mooney | https://review.opendev.org/c/openstack/nova/+/860829 andhttps://review.opendev.org/c/openstack/nova/+/872428 | 12:54 |
sean-k-mooney | i reviewd the later before and was ok to proceed with it | 12:54 |
sean-k-mooney | i have not looked at teh former | 12:55 |
opendevreview | Elod Illes proposed openstack/nova stable/victoria: DNM: gate test https://review.opendev.org/c/openstack/nova/+/878386 | 13:27 |
opendevreview | Merged openstack/nova stable/victoria: [stable-only][cve] Check VMDK create-type against an allowed list https://review.opendev.org/c/openstack/nova/+/871699 | 13:34 |
opendevreview | Merged openstack/nova stable/yoga: Reproducer for bug 1951656 https://review.opendev.org/c/openstack/nova/+/866153 | 13:34 |
dansmith | bauzas: sean-k-mooney wasn't me | 13:38 |
bauzas | anyway, we'll see this next Thursday then | 13:38 |
bauzas | elodilles: awesome \o/ https://review.opendev.org/c/openstack/nova/+/871699 | 13:38 |
bauzas | dansmith: gibi or sean-k-mooney: we need to merge this one now https://review.opendev.org/c/openstack/nova/+/875621 given grenade was modified | 13:39 |
dansmith | bauzas: yep, got it | 13:41 |
bauzas | ++ | 13:43 |
elodilles | bauzas: wow! finally... \o/ | 13:43 |
dansmith | gmann: around yet? | 14:12 |
dansmith | gmann: I'm not sure I understand the skip-level-always comment.. you made it gate/voting in the template and thus it needs to be in the gate pipeline in nova's zuul as well is that right? | 14:13 |
dansmith | just confused by the irrelevant-files part I guess | 14:13 |
sean-k-mooney | you can override it in nova .zuul.yaml | 14:14 |
sean-k-mooney | if we wanted too | 14:14 |
sean-k-mooney | the in repo options take precidnece | 14:14 |
sean-k-mooney | you would just need to add the job by name to check/gate and set voting false. but i think your really wondiering if ti shoudl be voting? | 14:15 |
sean-k-mooney | or was this a zuul mechanics question | 14:16 |
opendevreview | Dan Smith proposed openstack/nova master: Add grenade-skip-level-always to nova https://review.opendev.org/c/openstack/nova/+/875773 | 14:16 |
dansmith | sean-k-mooney: I'm talking about a specific comment *on* nova's zuul.yaml from gmann | 14:17 |
sean-k-mooney | oh hum well you have it in both check and gate and we have irrelevnet-files in both | 14:18 |
sean-k-mooney | i think they were suggesting editing the project template | 14:19 |
sean-k-mooney | maybe | 14:19 |
dansmith | maybe we just wait and see what he meant :) | 14:20 |
sean-k-mooney | sure but im not sure i really like the idea of having to have teh irrelevnt file we use in the tempest repo | 14:21 |
sean-k-mooney | since that depnes on the nova repo strcuture. granted that changes very in frequesntly but still | 14:21 |
sean-k-mooney | it does not feel liek this shoudl be there | 14:21 |
sean-k-mooney | granted i have also said in the past that i woudl preer if the integrated-gate-compute template was in the nova repo but i knwo why the qa team want to keep those all in one repo | 14:22 |
dansmith | are you talking about the policies-irrelevant list? | 14:23 |
dansmith | I dunno why that's named that way, but AFAIK it's defined in this file, not in tempest | 14:24 |
opendevreview | Dmitry Tantsur proposed openstack/nova master: ironic: clean up references to memory_mb/cpus/local_gb https://review.opendev.org/c/openstack/nova/+/878418 | 14:29 |
opendevreview | Merged openstack/nova stable/yoga: Handle mdev devices in libvirt 7.7+ https://review.opendev.org/c/openstack/nova/+/866154 | 14:57 |
bauzas | whoami-rajat: it's quite late for your time, but it looks like we need a cinder-nova cross-project session, IIRC my emails :) | 14:59 |
bauzas | for the vPTG | 15:00 |
whoami-rajat | bauzas, yeah i was just reading through it, I haven't finalized timing for cinder topics yet, do you have any time in mind? | 15:00 |
whoami-rajat | we're going to have a cross project with glance on thursday | 15:01 |
bauzas | yup, I discussed that this morning with pdeore | 15:02 |
bauzas | whoami-rajat: Sofia was requesting the last Thursday slot or the first Wed slot | 15:04 |
bauzas | whoami-rajat: I can somehow set it for Wed 1300UTC, would that work for you ? | 15:05 |
whoami-rajat | bauzas, hmm, the problem with first slots is we don't have full gathering, would thursday last slot work for you? 1600-1700 UTC | 15:05 |
bauzas | whoami-rajat: sure | 15:06 |
bauzas | let's take it | 15:06 |
whoami-rajat | great! | 15:06 |
whoami-rajat | thanks | 15:06 |
bauzas | whoami-rajat: would it work if that would be in our room ? | 15:06 |
whoami-rajat | bauzas, sure, we can move there, is it on Zoom or any other platform? | 15:07 |
bauzas | whoami-rajat: we use the diablo room (zoom) | 15:07 |
whoami-rajat | cool, we will be there | 15:08 |
bauzas | I just added it in our etherpad https://etherpad.opendev.org/p/nova-bobcat-ptg#L91 | 15:08 |
bauzas | whoami-rajat: and if you see other topics to discuss, please add them there | 15:08 |
whoami-rajat | bauzas, sure, sounds good! | 15:09 |
bauzas | ++ | 15:10 |
bauzas | g'night | 15:10 |
dtantsur | Hey folks! Seeing this in the ironic grenade job: https://zuul.opendev.org/t/openstack/build/d90374e9b6554704a2f84b7fe8a9d411/log/controller/logs/screen-n-api.txt#4182 | 15:44 |
dtantsur | rings any bells? | 15:44 |
dtantsur | It's quite possible that the ironic virt driver does not indeed support 'openstack console log show', but why is it called? | 15:45 |
dtantsur | hmm, maybe a red herring. judging by https://opendev.org/openstack/grenade/commit/adcb563b185416451da419186a8d7773ffb6b913 it happens if ping fails. | 15:47 |
dtantsur | (would be cool to check the virt driver before doing it) | 15:47 |
clarkb | dtantsur: I think one of the responses to a failed tempest test is to dump the instance console log. This is often useful if there are networking issues because with a VM the console is accessed via libvirt and not the network and the logs often show you if dhcp failed etc | 15:48 |
dtantsur | Right. Probably needs to exclude VIRT_DRIVER=ironic | 15:48 |
gmann | dansmith: hi | 17:08 |
gmann | dansmith: I mean to add irrelevnet-files in gate pipeline in same way you did in check pipeline to avoid running grenade on doc only changes etc | 17:08 |
sean-k-mooney | gmann: that in the curernt patch but i dont knwo if you left your comment on a previous version | 17:50 |
sean-k-mooney | gmann: https://review.opendev.org/c/openstack/nova/+/875773/6/.zuul.yaml#798 | 17:50 |
sean-k-mooney | ah i see so in v5 it was being added to the gate pipeline via the template | 17:52 |
sean-k-mooney | not explictly | 17:52 |
sean-k-mooney | so ya v6 adressed your comment | 17:52 |
sean-k-mooney | however it failed in v6 for somereason when it ran | 17:53 |
sean-k-mooney | test_security_group_rules_create | 17:53 |
sean-k-mooney | weired that looks unrealted | 17:53 |
sean-k-mooney | i wonder wy that failed | 17:54 |
gmann | sean-k-mooney: yes, already +2 on that | 17:55 |
sean-k-mooney | im jsut quickly checking the logs before rechecking and +2ing | 17:55 |
gmann | ohk | 17:56 |
gmann | I checked it from test failing log but did not go into deep | 17:56 |
sean-k-mooney | nova.api.openstack.wsgi.Fault: Instance 08070e41-68b0-4dd3-9eb6-1926c3082060 could not be found. | 17:56 |
sean-k-mooney | there are a bunch of faults like that in the nova api | 17:57 |
sean-k-mooney | although im not sure its the same test | 17:57 |
gmann | api log might be confusing on NotFound due to negative tests | 17:59 |
sean-k-mooney | ya i was assumign that too | 17:59 |
sean-k-mooney | but i was just checkign to see if there are any erference to that est | 17:59 |
sean-k-mooney | ah found the request id req-90bd91de-9909-4b0a-a16f-1de245c67834 | 18:00 |
sean-k-mooney | req-90bd91de-9909-4b0a-a16f-1de245c67834 tempest-SecurityGroupRulesTestJSON-617656270 tempest-SecurityGroupRulesTestJSON-617656270-project-member] 10.209.0.48 "POST /compute/v2.1/os-security-groups" status: 200 | 18:00 |
sean-k-mooney | so nova was happy with it | 18:01 |
sean-k-mooney | im not seeing isseu on the neutorn side so im pretty happy this is a one off failure | 18:03 |
sean-k-mooney | i just have not seen that test fail before at least not that stuck out in my memory | 18:04 |
sean-k-mooney | so i wanted to check it a little more deeply in case it was a real failure | 18:04 |
opendevreview | Merged openstack/nova master: Add grenade-skip-level-always to nova https://review.opendev.org/c/openstack/nova/+/875773 | 21:06 |
*** promethe- is now known as prometheanfire | 23:08 | |
*** seebaer is now known as seba | 23:19 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!