Monday, 2024-09-23

bauzas	haleyb: sure, shoot	08:04
parasitid	hi all, i have a question regarding the shelving process. would it be possible to "suspend" then shelve an instance, and unshelve + resume so that i can resume my instance on another host without restarting the OS ? Currently it seems that unshelving an instance reboots the VM and that its state file is lost during the shelving process. Am i correct ? thanks a lot	09:05
pas-ha[m]	parasitid: I don't think 'shelving' saves a memory dump.. it only saves a disk state. Shelving is literally 'shutdown and create an image', and shelve-offload is '..and de-allocate resources' (removes local disk files, removes allocations in placement but keeps the instance in the DB).	09:17
pas-ha[m]	And 'unshelving' just means 'rebuild the instance from the image we saved in the shelve command'	09:18
parasitid	pas-ha[m]: ok thanks that's what i thought. I hoped that the state machine would differentiate an instance active state > shelved state and "instance suspended state" > shelved state. it could be nice to offload the instance state file if suspended so that unshelving an instance could restore its states	09:22
bauzas	parasitid: you need to live-migrate your instance if you want to keep the instance alive	09:22
bauzas	shelve means "stop my instance and put it in the shelve" :)	09:22
parasitid	bauzas: well my usecase is : start a instance, use it. shelve it so i'm potentially not billed by the cloud provider. unshelve it and recover previous working environment	09:24
pas-ha[m]	I don't think nova has anything similar to libvirt/virt-manager 'save VM' which is kind of like hibernate - dump a memory to a file, and resume from that (and disk) later.	09:24
bauzas	snapshot it	09:25
pas-ha[m]	that's what would be required for your use case as you want to save the memory as well	09:25
bauzas	you could also suspend the VM	09:25
bauzas	but your cloud provider will probably continue to bill you	09:26
parasitid	bauzas: yes, but suspending a VM keeps the resources allocated, so you're often billed by the cloud provider. whereas shelving it frees the resources	09:26
bauzas	sure, but that's why you need to stop the instance, right?	09:27
bauzas	when you shelve, IIUC, you'll get a memory snapshot	09:27
bauzas	sorry, a disk snapshot I mean	09:28
bauzas	but the memory will be lost	09:28
parasitid	bauzas: ok. couldn't the shelve process consider the VM state and also backup/restore the memory state file if ever the instance was in suspended state ?	09:30
bauzas	I see your usecase tbc	09:32
bauzas	but the problem here is that for the moment, AFAIK nova isn't able to save the memory	09:32
parasitid	bauzas: usecase would be VDI/remote desktop	09:32
bauzas	yup, I got it	09:32
bauzas	that's in general why people want to save the memory :D	09:33
parasitid	bauzas: are you sure ? i made a test this morning. when i suspend an instance and resume it, i can ssh back into my instance and tmux attach to my session and recover the vim buffer right where is was before suspending	09:34
bauzas	honestly, I don't know what to tell you, except that it would need a new feature	09:34
bauzas	suspend works indeed with the memory, but we don't snapshot it	09:35
bauzas	that's why I said above "snapshot it"	09:35
parasitid	bauzas: ok thanks. i understand perfectly. i jumped here just in case there would be a secret tip to achieve this :)	09:35
bauzas	but your usecase is different : you want to /persist/ it	09:35
bauzas	tbh, VMware supports that AFAIKK	09:36
bauzas	(memory snapshot)	09:36
bauzas	but as I said here, we need some new feature and some new API support for it	09:36
bauzas	sean-k-mooney: ^	09:36
bauzas	(that and live-resize are possibly the most needed features we'd like to have for VMware migration :) )	09:38
bauzas	sean-k-mooney: I wonder, could we help parasitid with file-backed memory?	09:39
bauzas	(oh I forgot, sean is on PTO this week)	09:41
sean-k-mooney[m]	we have talked about this in the past, libvirt can do memory snapshots but we can because we do not guarantee a stable hardware interface	09:43
sean-k-mooney[m]	when you unshleve you might shelve to a differnt host with differnent cpu, the pci device order in the guest may or may not be the same and if you have any sriov device there memory would not be captured	09:44
sean-k-mooney[m]	we may be able to make it work in some cases but we would have to save and store the memory as an addtional image and also might need to recored other info about the guest like the pci device ids instead of allowing libvirt to choose them	09:46
sean-k-mooney[m]	but no file backed memory will not help here. libivrt has a call to take a snapshot and include the guest memory	09:47
sean-k-mooney[m]	so its not that we cant do that	09:47
gibi	yeah shelve with memory would be like: hibernate your guest, write the image to disk, bring the image to another hardwared, and try to resume from hibernation. As far as I know if I change my hardware while my machine is hibernated and I try to resume I very well be in kernel panic territory	09:48
sean-k-mooney[m]	for instances with no passthough devices like neurton sriov ports, or vgpus that use cpu_mode=custom and a pinned cpu model it proably would work	09:49
gibi	wondering what restiction libvirt puts on resuming from memory snapshot while the hypervisor is reconfigured inbetween	09:50
sean-k-mooney[m]	yep, so if there there was a strong demand for this, we could consider it but its very non cloudy, but it is an enterpirse virt feature	09:50
gibi	If I have to choose what to spend time on between live-resize vs shelve with memory, then the former feels a more generally useful feature	09:51
gibi	(and both is hard)	09:52
sean-k-mooney[m]	the former is more cloudy	09:53
sean-k-mooney[m]	shelve with memory is definetly in the enterpise virt space	09:53
bauzas_	I agree with the fact there are some technical constraints and limitations	09:56
sean-k-mooney[m]	anyway yes im on pto this week so totally not going back to working on my home lab.	09:56
gibi	it is more like we don't seem to know the list of constraints, and coming up with one is not simple	09:56
bauzas_	but the usecase written as "as a VDI user, I'd like my instance to be shelved with its memory so I can spin again my instance without problems" sounds a valid cloud usecase for me	09:57
gibi	sean-k-mooney[m]: enjoy your PTO	09:57
sean-k-mooney[m]	VDI its selef is iffy	09:57
gibi	bauzas_: in a cloud we run cattle.	09:58
sean-k-mooney[m]	im open to looking at this again but to be clear we have said no to this exact usecase once before and snapshot with memory once also	09:58
sean-k-mooney[m]	shevle with memory i think is actully more reasonable then generic snapshot with memory	09:58
bauzas_	if you prefer, someone running a instance with virtual desktop capabilities	09:58
*** bauzas_ is now known as bauzas		09:58
bauzas	or there could be other workloads where saving the memory would be more than just nice	09:59
bauzas	but there are security implications for sure, we'd need to keep the memory safe	09:59
gibi	bauzas: I still reject the idea that this is cloudy. In a cloud you should be ready to loose a single VM any time.	09:59
sean-k-mooney[m]	yep you just need to reconsile the fact that we provide no abi stablity for the guest.	09:59
sean-k-mooney[m]	meaning if you add 2 volumes and remove 1	10:00
bauzas	gibi: well, that ship has already sailed for a while :(	10:00
gibi	sean-k-mooney[m]: yeah we can promise that we restore your memory but we cannot promise the guest kernel wont panic	10:00
sean-k-mooney[m]	the first one, then the next time we generte the xml the remaining voluems pci adress will change	10:00
bauzas	while I'm one of the most feroceous guys that say 'sorry, cloud', I tend to admit that /some/ workloads require more care	10:01
gibi	bauzas: we can try to stop on that slippery slope :)	10:01
sean-k-mooney[m]	we can add this to the list of things people want but i would port vtpm live migration higher personally	10:02
bauzas	gibi: and ask the foundation to remove the Vmware migration to OpenStack whitebook ? :)	10:02
gibi	unshelve with memory is like a brain transplant while you expect not just the patient to survive but also to keep all its past memories.	10:02
sean-k-mooney[m]	this is notthign to do with vmware	10:02
sean-k-mooney[m]	we have a vmware direver and they never cared to enable this usecase	10:02
sean-k-mooney[m]	we have disucss this in an inperson ptg before	10:03
sean-k-mooney[m]	so this is an old request	10:03
gibi	bauzas: just because the fundation say so brain transplant will not work out of the boz	10:03
bauzas	doesn't vsphere support snapshoting the memory ?	10:03
gibi	x	10:03
bauzas	I like the idea of "brain transplant"	10:03
sean-k-mooney[m]	it does but it does not do it as part of shelve	10:03
bauzas	but this is more live neurolinks transplants	10:03
bauzas	I can consider the brain as disk while the neurotransmitters are the RAM :)	10:04
sean-k-mooney[m]	libvirt, hyperv and vmware all can do this (so can virtual box) but none of them ever enabled it in there virt driver	10:04
gibi	the disk content goes through a full kernel boot process with hw probing etc, the resume from ram tries to continue exectuing where it left of without a boot process	10:05
bauzas	then why shelve w/ memory would be more acceptable than snapshot w/ memory ?	10:06
bauzas	shit	10:06
gibi	I think we can ask the libvirt folks how do you feel about moving the libvirt memory snapshot, disk, and domain def to another compute and try to resume, if they say they support it then lets discuss further	10:06
bauzas	sorry, I said "shit" because I was trying to understand exactly the diffs with suspend	10:07
bauzas	gibi: again, I don't have the energy nor the will to push that more forward, I was just saying that the usecase may sound legit	10:08
bauzas	let's not rathole on it, the answer is as of now "NOT SUPPORTED"	10:08
gibi	as legit as requesting a brain transpart. legit but pointless if the tech is not there	10:09
bauzas	the tech already allows us teleportation :)	10:10
bauzas	but I hear ya	10:10
bauzas	that would be an horrible spec to write and a terrible spec to review	10:10
gibi	if teleportation you mean live-migration then yes libvirt has a bunch of tech implemented to support that, I'm not sure they have the tech to support brain transplant yet.	10:11
gibi	btw that points to a direction actually. instead of shelving with a memory snapshot, do memory snapshot on the current hypervisor. The resume is supported there on the same hypevisor. I think the requesting people do not need the move aspect of shelve, they need the resumability only. So give them way to resume on same hypevisor	10:14
gibi	wait we have that, they just get billed as they reseve the space to be able to resume	10:15
gibi	so they don't want to get billed, but still want to make sure they can resume in place, that is contradicting	10:16
gibi	or at least we are moving to interrutible instances territory to make space for resume	10:16
gibi	(or what was the name of that feature that allows killing certian type of instances to make space for reservations)	10:17
gibi	I guess I stop braindumping here as it is not exciting :)	10:18
sean-k-mooney[m]	the reason shelve is more ligit then snapshot is it removes the snapshot once and create many copies uescase adn it also removes restore via rebuild	11:06
sean-k-mooney[m]	so shelve/unshelve is much smaller in scope and has less sharp edges as a result	11:08
sean-k-mooney[m]	gibi the reason sususped (which calls managed save) works today is we still have the domain so we can still restore it without worriing about the xml changing, managed save also does not allow pci passthough devices to be attached to the domain	11:10
sean-k-mooney[m]	if i was to do shelve with memory my instict would say, save the xml and the memory as addtional images and reuse some of the live migraton logic to update the host side for it xml without modifying the guest visible side	11:12
sean-k-mooney[m]	the same host does not really matter as long as the guest cant tell it moved	11:12
gibi	sean-k-mooney[m]: I agree, move does not matter to the user if the instance works. I just pointing out that they also never requested the move, they only requests having a place to resume to but not getting billed for that place while suspended.	11:32
gibi	btw having a vtpm live migration indeed seems even more important to get	11:33
sean-k-mooney[m]	i mainly mentioned vtpm as another example of needing to store addtinoal data alongside the disk when shelving which we dont support today	11:35
sean-k-mooney[m]	but also it a higher priority in my book but both are parralel efforts	11:35
parasitid	hi gibi: i think i get you're point but there are still lots very obscure stuff to me because i'm definately not a specialist of this topic. When you say: "so they don't want to get billed, but still want to make sure they can resume in place, that is contradicting" why wouldn't it be possible to be resumed on a host supporting the same "flavor" ? why does it work in case of live migration and not unshelving ?. And yes i wouldn't	11:37
parasitid	gibi: i just want to maximize the fact of being "resumable" but lowering the No Valid Host found error	11:38
parasitid	btw, i didn't want to hijack your backlog by introducing this topic, i only wanted to know if there was a hidden feature/tips to achieve it. as it seems that its not currently supported im fine with it. don't worry.	11:40
gibi	parasitid: no worries. I'm happy to discuss incoming request. I'm especially happy that it wasn't just a request but you are open to discuss it a bit deeper.	11:41
gibi	parasitid: the devil is in the details of host supporting the same flavor. For nova supporting a flavor to boot a new VM on a host is differnt from supporting host as target for a live migration. And I assume it will be also different for supporting a host as a target for unshelve with memory.	11:44
parasitid	gibi: yes. the starting point of my day was a test where i created and instance, open a vim in a tmux session, then suspended it, shelved it... Up to that point, as nova didn't complained about shelving a suspended instance, i secretly hoped that the resume would work :)	11:45
parasitid	as i didn't find anything related to this in the docs, i jumped here to ask questions	11:45
gibi	parasitid: what, nova allowed to shelve a suspended VM? That feels like an API validation bug :) (and also a way to raise false hope)	11:46
gibi	parasitid: totally valid to jump and ask question. :)	11:46
gibi	I think I'm debating mostly with bauzas about how acceptable this use case as is into our scope.	11:47
gibi	and pointing out that even if we accept it there might me dependencies on libvirt / qemu we don't have right now to actually support it	11:49
gibi	(and I think bauzas has a point about scope, the question boils down to how much we want to support enterprise virt vs. cloud)	11:50
parasitid	by cloud you mean cloud native workloads ?	11:52
parasitid	coz opening a vim in a tmux is not a cloud native workload :)	11:53
parasitid	moreover.... i'm in the emacs team	11:54
gibi	by cloud I mean VMs considered as cattle instead of pets	12:01
gibi	you loose one, you create a new	12:01
gibi	no feelings attached	12:01
bauzas	sometimes this is a bit harder than that, like if you have some database service using memory :è)	12:06
gibi	we have cluster aware DBs like galera. I think galera Active/Active/Active is possible. I assume that survives killing a VM	12:07
gibi	if you only store your important data in memory then loosing that is not on nova	12:08
bauzas	gibi: (sorry, was upgrading to F40), yeah I don't disagree with your point, I'm just saying that some cases are related to some memory usage	12:35
bauzas	and just saying 'sorry, but cattle' doesn't help them	12:35
gibi	I hope that it is OK to send the message that please also try to change the workload to be more cloudy. I hope this help them in the long run changing those workloads and enjoying the benefits.	12:37
gibi	I do belive saying no sometimes actually helps :)	12:38
opendevreview	Balazs Gibizer proposed openstack/nova master: Refactor obj_make_compatible to reduce complexity https://review.opendev.org/c/openstack/nova/+/928590	12:44
opendevreview	Balazs Gibizer proposed openstack/nova master: [ovo]Add igb value to hw_vif_model image property https://review.opendev.org/c/openstack/nova/+/928456	12:44
opendevreview	Balazs Gibizer proposed openstack/nova master: [libvirt]Support hw_vif_model = igb https://review.opendev.org/c/openstack/nova/+/928584	12:44
opendevreview	Balazs Gibizer proposed openstack/nova master: [doc]Developer doc about PCI and SRIOV testing https://review.opendev.org/c/openstack/nova/+/928834	12:44
gibi	bauzas: can I drop the multipath_id and the novnc subpath topics from the nova meeting agenda? Or do we want to revisit them this week? I'm asking as I'm going to add https://blueprints.launchpad.net/nova/+spec/igb-vif-model to the agenda I noticed that those items might be stale	12:55
bauzas	gibi: we discussed the multipath_id one	12:56
bauzas	for the subpath, I think we also agreed the specless bp	12:56
bauzas	so yeah	12:56
gibi	OK, dropping them	12:57
gibi	done	12:57
bauzas	coo	12:58
bauzas	cool even	12:58
opendevreview	Balazs Gibizer proposed openstack/nova master: [doc]Developer doc about PCI and SRIOV testing https://review.opendev.org/c/openstack/nova/+/928834	13:05
opendevreview	Brian Haley proposed openstack/nova stable/2023.2: libvirt: Cap with max_instances GPU types https://review.opendev.org/c/openstack/nova/+/916089	13:12
*** bauzas_ is now known as bauzas		13:17
opendevreview	Balazs Gibizer proposed openstack/nova master: [doc]Developer doc about PCI and SRIOV testing https://review.opendev.org/c/openstack/nova/+/928834	13:52
opendevreview	Doug Szumski proposed openstack/nova master: Revert "[libvirt] Live migration fails when config_drive_format=iso9660" https://review.opendev.org/c/openstack/nova/+/909122	16:15

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!