opendevreview | melanie witt proposed openstack/nova master: WIP Support encrypted backing files for qcow2 https://review.opendev.org/c/openstack/nova/+/907961 | 01:33 |
---|---|---|
opendevreview | melanie witt proposed openstack/nova master: libvirt: Introduce support for raw with LUKS https://review.opendev.org/c/openstack/nova/+/884313 | 01:33 |
opendevreview | melanie witt proposed openstack/nova master: libvirt: Introduce support for rbd with LUKS https://review.opendev.org/c/openstack/nova/+/889912 | 01:33 |
opendevreview | Takashi Kajinami proposed openstack/nova-specs master: libvirt: AMD SEV-ES support https://review.opendev.org/c/openstack/nova-specs/+/907702 | 02:13 |
opendevreview | Takashi Kajinami proposed openstack/nova-specs master: libvirt: AMD SEV-ES support https://review.opendev.org/c/openstack/nova-specs/+/907702 | 02:15 |
opendevreview | Takashi Kajinami proposed openstack/os-resource-classes master: Update bug tracker url https://review.opendev.org/c/openstack/os-resource-classes/+/908219 | 02:17 |
opendevreview | Takashi Kajinami proposed openstack/os-traits master: Update bug tracker url https://review.opendev.org/c/openstack/os-traits/+/908220 | 02:18 |
opendevreview | Takashi Kajinami proposed openstack/osc-placement master: Update bug tracker url https://review.opendev.org/c/openstack/osc-placement/+/908221 | 02:24 |
opendevreview | Takashi Kajinami proposed openstack/nova-specs master: libvirt: AMD SEV-ES support https://review.opendev.org/c/openstack/nova-specs/+/907702 | 06:02 |
melwitt | sean-k-mooney: I did the respin, it seems like things are working except stable rescue in some cases. I'm debugging that | 06:03 |
opendevreview | Takashi Kajinami proposed openstack/nova-specs master: libvirt: AMD SEV-ES support https://review.opendev.org/c/openstack/nova-specs/+/907702 | 06:24 |
bauzas | okay, so I can officially say that nova-lvm has an issue :- | 10:27 |
whoami-rajat | bauzas, hey, do you mean nova-lvm or nova-cinder-lvm? not sure if nova directly uses lvm without cinder | 11:21 |
frickler | iiuc that's the nova-lvm job, which does NOVA_BACKEND="lvm", so not involving cinder | 11:27 |
whoami-rajat | okay, i remember nova team mentioning about some issues with the cinder-lvm backend so good to know it's not that | 11:28 |
opendevreview | Takashi Kajinami proposed openstack/nova-specs master: libvirt: AMD SEV-ES support https://review.opendev.org/c/openstack/nova-specs/+/907702 | 11:55 |
opendevreview | Takashi Kajinami proposed openstack/nova-specs master: libvirt: AMD statless firmware support https://review.opendev.org/c/openstack/nova-specs/+/908297 | 12:08 |
opendevreview | Takashi Kajinami proposed openstack/nova-specs master: libvirt: Statless firmware support https://review.opendev.org/c/openstack/nova-specs/+/908297 | 12:08 |
greatgatsby_ | Hello. Is it possible to modify OS-EXT-SVR-ATTR:host via the nova command or OSC if it's incorrect (after compute crash)? Trying to avoid modifying via the database. | 12:24 |
bauzas | whoami-rajat: frickler: yeah was at lunch, correct this is nova-lvm jobn | 12:36 |
opendevreview | Sylvain Bauza proposed openstack/nova master: Fix verifying all the alloc requests from a multi-create https://review.opendev.org/c/openstack/nova/+/846786 | 13:26 |
opendevreview | Takashi Kajinami proposed openstack/nova-specs master: libvirt: AMD SEV-ES support https://review.opendev.org/c/openstack/nova-specs/+/907702 | 13:54 |
opendevreview | Takashi Kajinami proposed openstack/nova-specs master: libvirt: Statless firmware support https://review.opendev.org/c/openstack/nova-specs/+/908297 | 14:01 |
opendevreview | Takashi Kajinami proposed openstack/nova-specs master: libvirt: AMD SEV-ES support https://review.opendev.org/c/openstack/nova-specs/+/907702 | 14:02 |
opendevreview | Takashi Kajinami proposed openstack/nova-specs master: libvirt: Statless firmware support https://review.opendev.org/c/openstack/nova-specs/+/908297 | 14:03 |
opendevreview | Takashi Kajinami proposed openstack/nova-specs master: libvirt: Stateless firmware support https://review.opendev.org/c/openstack/nova-specs/+/908297 | 14:44 |
*** d34dh0r5- is now known as d34dh0r53 | 14:57 | |
bauzas | dansmith: saw your comment on https://review.opendev.org/c/openstack/nova/+/904209/14/nova/virt/libvirt/driver.py#9838 | 16:09 |
bauzas | dansmith: you're right, we could have problems, that's why I wrote this in the spec : | 16:09 |
bauzas | "will persist that list of target mediated devices in some internal dictionary field of the LibvirtDriver instance, keyed by the instance UUID. " https://specs.openstack.org/openstack/nova-specs/specs/2024.1/approved/libvirt-mdev-live-migrate.html#proposed-change | 16:10 |
bauzas | but yeah, in case the operator restart the compute service, then the dict could be wrong | 16:10 |
bauzas | or the other way if we have a rabbit issue | 16:11 |
bauzas | so I don't know what to say here | 16:11 |
bauzas | that's why we discussed this in the PTG, we said we will try to remember those by a dict | 16:11 |
bauzas | https://etherpad.opendev.org/p/nova-caracal-ptg#L585 | 16:12 |
bauzas | anyway, let's dicuss this after our meeting | 16:13 |
* zigo has just finished setting-up an arm64 as a compute (80 ampere cores) ! :) | 16:19 | |
zigo | It felt super fast in the host, not so much in the VMs... | 16:19 |
dansmith | bauzas: yeah, and L576 says "Dan Smith not very convinced" :) | 16:23 |
dansmith | so I think this is well in bounds for being concerned now that we're looking at the implementation | 16:23 |
bauzas | dansmith: let's find an alternative after our meeting :) | 16:24 |
bauzas | let's try* to find (tbc) | 16:24 |
dansmith | I'm certainly not trying to cause you trouble, I'm just expressing concern | 16:24 |
dansmith | ack | 16:24 |
bauzas | dansmith: I'm swamped in another meeting but my point is that I wonder if we should persist the internal dict | 17:09 |
bauzas | in case of a rabbit issue, people just restart their computes | 17:09 |
bauzas | that said, if we have a current live-migration, it would be eventually stopped, right? | 17:10 |
dansmith | right, but if they do, they leak resources in placement right? | 17:10 |
bauzas | or not ? | 17:10 |
bauzas | because of the target allocations ? | 17:10 |
dansmith | yeah | 17:11 |
bauzas | dansmith: sorry was in a meeting | 17:40 |
bauzas | so, what I can do is to provide some logs, for sure | 17:41 |
bauzas | then, I'll try to test with my hardware environment what would arrive if the operator tries to restart a compute while a live-migration is currently running | 17:42 |
bauzas | for the allocations, it would be like any other feature | 17:43 |
dansmith | bauzas: what if we had the compute service look for and clean up any allocations against its RP for instances that don't exist on it? basically "things that were reserved here, but are gone now"? | 17:54 |
dansmith | maybe we already do that and I'm missing it? | 17:54 |
dansmith | but if so, then a restart would *actually* clean up everything because the in-memory bit would be dumped and we'd clean up the stale reservations | 17:54 |
bauzas | dansmith: there are two things | 17:54 |
bauzas | there are VGPU allocations and there are mdev "reservations" | 17:55 |
bauzas | the internal dict is just here for making sure we don't pass a mdev to a new instance if that mdev is currently used by a live-migrating instance | 17:56 |
bauzas | so we just 'reserve' it | 17:56 |
bauzas | that said, and that's something I can verify, if the tarrget domain is created just when we start the live-migration, then we don't need to 'reserve' the mdev | 17:57 |
bauzas | maybe the context you miss is that a compute only knows which mdevs are used by instances by looking at the guest XMLs | 17:58 |
bauzas | my concern I had is that if we don't have a domain for a target guest yet while live-migrating, then the compute would not know that the related mdevs are currently used for that one | 17:59 |
bauzas | hence the internal dict | 17:59 |
dansmith | bauzas: thanks, I understand the difference between the dict reservation and the allocations | 18:09 |
dansmith | however, the allocations are not cleaned up by a restart (right?) even though the dict reservations are | 18:09 |
dansmith | I don't really like the latter requiring a restart, but the former seems very unfortunate to me | 18:09 |
bauzas | that's the same problem with any live-migration allocation, right? | 18:10 |
bauzas | dansmith: one question I have and that you could maybe know, are we providing an error for running live-migrations if we restart a target ? | 18:11 |
dansmith | bauzas: It seems to me like this is the first place where we're doing that allocation in the pre check no? | 18:13 |
dansmith | we grab pci stuff, but I don't think we persist it until later, if I'm reading correctly | 18:13 |
bauzas | I don't know *when* we create a target allocation | 18:13 |
bauzas | I need to look at the code | 18:14 |
bauzas | https://github.com/openstack/nova/blob/master/nova/conductor/tasks/live_migrate.py#L558 | 18:16 |
dansmith | okay I just found that | 18:16 |
dansmith | so the conductor nukes any allocations that we have against the proposed node if we abort, yeah? | 18:16 |
bauzas | after pre-livemigrate sure | 18:17 |
bauzas | but then I wonder if we delete the allocations when we monitor the running live-migration | 18:17 |
bauzas | looking at the compute now | 18:17 |
dansmith | yeah, that's only if we decide the destination isn't a good fit, but doesn't address the failure later | 18:17 |
bauzas | https://github.com/openstack/nova/blob/master/nova/conductor/tasks/live_migrate.py#L558 | 18:20 |
bauzas | here we error a migration if we have any exception | 18:20 |
dansmith | right, but after we start the migration, nothing will clean up the destination node's allocations right? | 18:22 |
bauzas | https://github.com/openstack/nova/blob/master/nova/compute/manager.py#L9692 | 18:22 |
bauzas | nah, found it | 18:22 |
bauzas | we delete the allocation when calling _rollback_live_migration() | 18:22 |
bauzas | and then we call the drivers's methods for cleaning up the residues... which then unreserve the mdev from the dict :) | 18:23 |
bauzas | so we're good | 18:23 |
bauzas | but I agree on adding more logs | 18:24 |
dansmith | the dict is on the destination, right/ | 18:24 |
dansmith | so we don't clean that up | 18:24 |
bauzas | I didn't done that in the past with allocating mdevs and that was needed | 18:24 |
bauzas | dansmith: no, we both call source and dest virt drivers | 18:24 |
bauzas | source is here https://github.com/openstack/nova/blob/master/nova/compute/manager.py#L9708-L9709 | 18:24 |
dansmith | bauzas: not if rabbit is down | 18:25 |
bauzas | and dest is there https://github.com/openstack/nova/blob/master/nova/compute/manager.py#L9736-L9738 | 18:25 |
bauzas | dansmith: hah, I get your point now | 18:25 |
dansmith | just saying it seems like we need more post-disaster recovery bits, be it on compute restart or some periodic | 18:26 |
bauzas | there could be a situation with rabbit down that would hit the live-migration but we wouldn't unreserve the mdev | 18:26 |
bauzas | gotcha | 18:26 |
dansmith | but, given that we do clean these allocations if rabbit is not down and we fail, then fair enough, | 18:26 |
dansmith | but I think the logging of the internal private data that won't be reset until restart is important, at least so it's discoverable why something is reserved but there's no record of it | 18:26 |
bauzas | yeah good point | 18:27 |
bauzas | my kids are starving, so I need to leave or they will run errand in the street like zombies | 18:28 |
bauzas | but I'll try to be a bit more resilient | 18:28 |
bauzas | at least the first thing is to log | 18:28 |
bauzas | then I wonder whether we need some periodic for the cleaning case | 18:28 |
bauzas | (but I just wonder which conditional to write in that periodic :) ) | 18:29 |
bauzas | that should be "give me all the migrations that were errored and make sure we don't have a reserved mdev for those" | 18:29 |
dansmith | something like that | 18:40 |
sean-k-mooney | melwitt: i think i need to restack but i got https://paste.opendev.org/show/bRUFWOFqOa8kSeXyH12I/ when i tried to delete a vm | 18:40 |
melwitt | sean-k-mooney: oh, ok that's bc I added a db migration (that I forgot to mention) to add a column (for backing file secret uuid). so you'd have to nova-manage db sync. it's complaining about the lack of the column | 18:42 |
sean-k-mooney | oh ok | 18:43 |
sean-k-mooney | i just did a git checkout | 18:43 |
sean-k-mooney | and restarted things so ya di cna do a db sync | 18:43 |
melwitt | yeah, I did the same thing. just forgot about the db sync, sorry 😑 | 18:43 |
sean-k-mooney | i was really confused becasue the way this fails in the openstack client is really terrible | 18:43 |
sean-k-mooney | venv) ubuntu@disk-encrypt-1:~/repos/devstack$ openstack server delete "test-vm" | 18:44 |
sean-k-mooney | Resource.get() takes 1 positional argument but 2 were given | 18:44 |
sean-k-mooney | that is the error you get | 18:44 |
melwitt | yeah, I remember thinking that too | 18:44 |
sean-k-mooney | and if you use debug mode it does not help | 18:44 |
sean-k-mooney | ya so i went back to nova client and it toold me there was a 500 and then it was clear that i was missing something | 18:45 |
melwitt | guess that's another thing for the osc backlog, figure out what's with that error message and how to improve it | 18:45 |
sean-k-mooney | this ist its internal traceback https://paste.opendev.org/show/bJQt6QO5X1pdKbL7dnm7/ | 18:46 |
melwitt | oh, and another thing that tripped me up is the db sync doesn't fan out to cells (!) and you have to use --local_cell to make it sync the nova_cell1 db | 18:46 |
sean-k-mooney | but the issue is it tyin gto use the resouce before its verified the return code i think | 18:47 |
sean-k-mooney | good to know cause i totally did not add --local_cell | 18:47 |
melwitt | hm | 18:47 |
sean-k-mooney | its still not working for me but it did to the upgrade | 18:48 |
sean-k-mooney | do i need to als restart | 18:48 |
sean-k-mooney | Running upgrade 13863f4e1612 -> 2a7173b820a6, Add backing_encryption_secret_uuid to block_device_mapping | 18:48 |
melwitt | it took me a bit to figure out wtf was wrong too. been that long since I db sync'ed on the fly I guess 😶 | 18:48 |
melwitt | I don't think you should need to restart | 18:49 |
melwitt | and make sure you had 'db sync' without --local_cell too bc that's what syncs the nova_cell0 db 😛 | 18:50 |
melwitt | another thing for the backlog ... do a fanout. I'm not sure if it was deliberate not to or if it's just no one has gotten around to it yet | 18:51 |
sean-k-mooney | nova-manage --config-dir /etc/nova --config-file /etc/nova/nova_cell1.conf db sync | 18:51 |
sean-k-mooney | that worked | 18:51 |
sean-k-mooney | oh you know what this instance was in error so it was burried in cell0 | 18:52 |
melwitt | hm ok. I thought I tried that but it didn't work without --local_cell | 18:52 |
melwitt | *didn't work to sync nova_cell1 | 18:52 |
sean-k-mooney | oh actullly no it got past the schduler so it was in cell 1 | 18:53 |
sean-k-mooney | i saw was becasue its deleted now so i guess it does not matter | 18:53 |
sean-k-mooney | also i just booted a vm with your latest code so that cool too | 18:53 |
sean-k-mooney | ok its on disk-encrypt-2 so in theroy i coudl live migrate it to disk-encrypt-1 | 18:54 |
melwitt | that should work, but since I said that it probably won't | 18:54 |
sean-k-mooney | it did not but i havent tried live migration without encypted voluems so i dont know if i missed a step or not | 18:56 |
sean-k-mooney | ah ok | 18:57 |
sean-k-mooney | libvirt.libvirtError: Cannot recv data: ssh: Could not resolve hostname disk-encrypt-1: Name or service not known: Connection reset by peer | 18:57 |
sean-k-mooney | i just need to update /etc/hosts | 18:57 |
melwitt | 😅 | 18:58 |
sean-k-mooney | Live Migration failure: Cannot recv data: Host key verification failed.: Connection reset by peer: libvirt.libvirtError: Cannot recv data: Host key verification failed.: Connection reset by peer ok i need to check what usesr its using | 19:01 |
sean-k-mooney | i see up ubuntu but maybe its runing as root | 19:01 |
sean-k-mooney | still complaining baout host key verificaiton | 19:10 |
sean-k-mooney | i think ill just turn that off | 19:10 |
sean-k-mooney | melwitt: finally i had to swap to using root for some reason for the ssh conenction | 19:22 |
sean-k-mooney | but in anycase yes live migration in one direction workd | 19:22 |
sean-k-mooney | im going to check if the instnace dirs ectra actully got cleaned up but looking good so far | 19:23 |
sean-k-mooney | im so out of partice of doing this with devstack by hand | 19:24 |
melwitt | sean-k-mooney: yeah, same (I had been) | 19:32 |
melwitt | sean-k-mooney: oh, with the host key verification in my case what I had to do was both ssh to and from as root "ssh root@blah" but also had to ssh as the other user, maybe the stack user? but pass the root user like "ssh root@blah" | 19:38 |
melwitt | it would not stop complaining until I did both of those | 19:38 |
sean-k-mooney | so if i use live_migration_uri = qemu+ssh://root@%s/system | 20:07 |
sean-k-mooney | and ssh as root to each host | 20:07 |
sean-k-mooney | then it works fine | 20:08 |
sean-k-mooney | but i was trying to use live_migration_uri = qemu+ssh://ubuntu@%s/system | 20:08 |
opendevreview | Ghanshyam proposed openstack/nova master: Remove HyperV: cleanup doc/code ref https://review.opendev.org/c/openstack/nova/+/906629 | 20:08 |
sean-k-mooney | there is a way to use a non root user to do that but its not worth the hassel | 20:08 |
opendevreview | Ghanshyam proposed openstack/nova master: HyperV: Remove HyperVLiveMigrateData object https://review.opendev.org/c/openstack/nova/+/906636 | 20:10 |
melwitt | ah I see | 20:16 |
opendevreview | Ghanshyam proposed openstack/nova master: HyperV: Remove RDP console connection information API https://review.opendev.org/c/openstack/nova/+/906991 | 20:19 |
sean-k-mooney | its still unhappy for cold migration so i need to add the root users ssh key to the ubuntu users autherised keys | 20:20 |
sean-k-mooney | i normally just use one key an put it in all both ubuntu and root with the same public/private key on all hosts | 20:20 |
opendevreview | Ghanshyam proposed openstack/nova master: HyperV: Remove RDP console connection information API https://review.opendev.org/c/openstack/nova/+/906991 | 20:20 |
sean-k-mooney | this time i was trying to be fancy and use seperate keys and minimal privladges | 20:21 |
opendevreview | Ghanshyam proposed openstack/nova master: HyperV: Remove RDP console API https://review.opendev.org/c/openstack/nova/+/906809 | 20:21 |
melwitt | 🙂 | 20:22 |
opendevreview | Ghanshyam proposed openstack/nova master: HyperV: Remove extra specs of HyperV driver https://review.opendev.org/c/openstack/nova/+/906992 | 20:22 |
sean-k-mooney | melwitt: ill try this again tomorrow when i can make sure i set this up correctly end to end | 20:28 |
sean-k-mooney | of far i have got good coverage of what happens if your ssh keys dont work and we rollback the cold migrate early :) | 20:29 |
melwitt | haha nice | 20:29 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!