Wednesday, 2022-09-28

opendevreview	Amit Uniyal proposed openstack/nova master: Adds check for VM snapshot fail while quiesce https://review.opendev.org/c/openstack/nova/+/852171	06:10
bauzas	JayF: ping me when you're back if you want, sure	07:00
opendevreview	Balazs Gibizer proposed openstack/nova stable/yoga: Bump min oslo.concurrencty to >= 5.0.1 https://review.opendev.org/c/openstack/nova/+/859421	10:10
damiandabrowski	hey folks, I noticed that if compute node already exceeds its overcommit ratios, it's not possible to migrate VMs out of it	10:29
damiandabrowski	it makes no sense for me and it should be considered as a bug, but maybe there is some reason behind it?	10:29
damiandabrowski	https://paste.openstack.org/raw/byjHufprWu3hwDFPSKAD/	10:29
gibi	damiandabrowski: the reason is that if you the compute is already exceeds overcommit then it cannot have new allocations. And the migration needs to move the instance allocation to the migration_uuid in placement and that would require a new allocation even if the total usage would not change.	10:37
gibi	what you can do is	10:37
gibi	i) increase the allocation ratio temporarily to the level where the usage not exceeds it	10:38
gibi	ii) migrate the workload out	10:38
gibi	iii) revert the allocation ratio to the original level	10:38
damiandabrowski	thanks, i'll do that	10:39
damiandabrowski	so you don't consider it as a bug, right?	10:39
gibi	you exceeded the overallocation so you are in a not supported state I would say	10:40
damiandabrowski	okok, thanks for clarification	10:41
opendevreview	Oleksii Butenko proposed openstack/os-vif master: Add new os-vif `network` property https://review.opendev.org/c/openstack/os-vif/+/859574	12:17
jkulik	damiandabrowski: FYI, we have a patch downstream in our Placement to allow switching resources on overused providers, so migrations off should still be possible: https://github.com/sapcc/placement/commit/2691056d6fa3e7db0cf9966b082af51ae6b5dda9	12:19
jkulik	use at your own risk, though	12:20
damiandabrowski	oh that's convenient, thanks	12:29
opendevreview	Oleksii Butenko proposed openstack/nova master: Napatech SmartNIC support https://review.opendev.org/c/openstack/nova/+/859577	12:41
opendevreview	J.P.Klippel proposed openstack/nova master: Fix link to Cyborg device profiles API https://review.opendev.org/c/openstack/nova/+/859578	12:53
opendevreview	Justas Poderys proposed openstack/nova-specs master: Add support for Napatech LinkVirt SmartNICs https://review.opendev.org/c/openstack/nova-specs/+/859290	13:17
*** dasm\|off is now known as dasm		13:18
justas_napa	Hi bauzas, gibi. Re: Napatech smartnic support, we have now pushed the code changes to opendev	13:18
justas_napa	all changes are visible in opendev via https://review.opendev.org/q/topic:Napatech_SmartNIC_support	13:19
noonedeadpunk	gibi: I guess the question is - why new allocation is created at the first place. As it's already present, isn't it?	13:45
noonedeadpunk	so why another allocation is being created against source host	13:46
gibi	noonedeadpunk: it is a technicality. When the instance is created the resource allocation of the instance is held by the consumer=instance.uuid in Placemnet. When the instance is being migrated, Nova first moves the allocation held by the consumer=instance.uuid to consumer=migration.uuid in Placement, then nova will call the scheduler to select a target host for the migration and allocate the	13:47
gibi	resource on the target host to consumer=instance.uuid in Placement.	13:47
gibi	noonedeadpunk: the move of the allocation from instance.uuid to migration.uuid is the one that considered as a new allocation in Placement and rejected as the host is already overallocated	13:48
noonedeadpunk	well, it's quite slight line between bug and "as designed"	13:48
noonedeadpunk	because for me - allocation change that does not involve change of resources shouldn't be considered as new one...	13:49
gibi	I intentionally not say it is a bug or not a bug, what I say is that having the host overallocated (over the overallocation limit) is the cause of the problem. A host should never be in that state	13:49
noonedeadpunk	but I do agree that workaround is easy enough to not spend time on making placement logic mroe complex for this	13:50
gibi	in all these cases you have to first figure out why and how you ended up in an overallocated state	13:51
gibi	and as you said recovering from this state is not as hard	13:51
noonedeadpunk	I can say super simply example - you decided that overcommit ratio is too high and it's good to lower it down. And you have prometheus that can tell ou on average for region overcommit is lower then you want to define	13:52
noonedeadpunk	So you lower it down, but some computes are still having higher overcommit, as you have not checked that against placement per host	13:53
noonedeadpunk	And you can't lower it down easily by reverting setting and disabling compute for scheduling on top	13:53
noonedeadpunk	* And you can't lower it down easily by migrating instances out so you have to revert setting	13:55
gibi	basically that process makes the mistake of lowering the allocation ratio on the host before checking it if it causes overallocation	13:59
gibi	so the proper process would be (in my eyes): 1) disable the host temporarily to avoid new scheduling while you reconfigure it 2) migrate VMs out from the host to achive the target resource usage 3) reconfigure the host with the new allocation ratio 4) enable the host	14:01
noonedeadpunk	well... it's hard to disagree here :)	14:03
noonedeadpunk	(it's more tricky to do though with microversion 2.88)	14:03
noonedeadpunk	or maybe not and it's jsut me who got used to deal with nova api...	14:07
noonedeadpunk	as eventually representation of inventory in placement is way more accurate	14:08
gibi	yeah you should use placement to calculate how many vms you need to move	14:10
auniyal	Hello #openstack-nova	14:29
noonedeadpunk	gibi: the problem is that with sdk it's quite tricky to use placement as you need to really call api rather then get resources with all attributes in it (like it was with hypervisors)	14:29
auniyal	how this works - https://opendev.org/openstack/nova/src/commit/aad31e6ba489f720f5bdc765c132fd0f059a0329/nova/context.py#L396	14:30
noonedeadpunk	gibi: just to express my pain as operator. https://paste.openstack.org/show/bpA2Iq28mHEfo8r4sH2W/ consumes about 3 seconds to execute. And as you see - overrides microversion <2.88.	14:55
noonedeadpunk	If I am to use palcement to get the same data - I will need code like that https://paste.openstack.org/show/bOsJwmnMMpftgZ01St0o/ and it takes... 42 seconds to execute against exact same cluster	14:56
noonedeadpunk	And not saying about load on APIs as for each resource provider I will need to make 2 calls to placement	14:57
noonedeadpunk	so for users deprecation of providing resource statistics from hypervisors is quite serious regression in performance. And if I want to monitor that regulary - it's really waste of time.	14:59
gibi	noonedeadpunk: interesting. so there is some heavy inefficiencies somewhere as it just 2x the number of calls but it takes more than 10 times the time to execute	15:00
noonedeadpunk	fwiw - I was testing remote cloud (not from inside of it)	15:00
noonedeadpunk	just to show that I'm not exaggerating https://paste.openstack.org/show/b1irA4GLwJEdUORuLarb/	15:03
noonedeadpunk	But I'm not sure about 2x number of calls. As it feels that from hypervisor data was fetched with single api call...	15:04
noonedeadpunk	Not sure though	15:04
noonedeadpunk	(or at least I can't explain it otherwise)	15:05
gibi	hm, you are right the hypervisor one returned all compute from the cluster	15:06
gibi	/os-hypervisors/detail	15:06
gibi	so then I think it would make sense to add a similar api for placement	15:06
gibi	single call that returns all the usages per provider	15:06
gibi	this explains the preformance differences	15:07
noonedeadpunk	yeah and it will depend on amount of providers actually right now quite dramatically	15:08
gibi	so I would support adding such api to placement. I cannot commit to implement it, but sure I can review the implementation	15:08
noonedeadpunk	I can only put it to my backlog and hope to get to it one day...	15:09
gibi	noonedeadpunk: and if you want to raise this issue then I think we have dedicated time on the coming PTG for operator feedback	15:10
noonedeadpunk	that is good idea actually	15:10
gibi	noonedeadpunk: https://etherpad.opendev.org/p/oct2022-ptg-operator-hour-nova	15:11
noonedeadpunk	first hour does not intersect with anything, so can join:)	15:11
gibi	cool	15:11
jkulik	you could™ make the calls in parallel at least ;)	15:17
noonedeadpunk	isn't it still waste of resources?	15:19
noonedeadpunk	and regression from operator prespective?	15:19
jkulik	sure. would be helpful to get it in a single call, but at least it reduces the pain of waiting	15:19
noonedeadpunk	and add some hardware to serve increased load?:)	15:20
noonedeadpunk	but yeah, that's fair	15:20
noonedeadpunk	*fair workaround	15:20
noonedeadpunk	fwiw, even with 10 threads it's still 5 times slower then jsut ask nova	15:28
noonedeadpunk	maybe I shouldn't have used joblib...	15:28
*** dasm is now known as dasm\|off		21:31
rm_work	hey, i've noticed the OSC seems to sometimes hide the "fault" field on a server show command... but also sometimes it doesn't. anyone know why this is the case?	22:07
rm_work	starting to dig into the code now, but hoping someone has some idea, since the clients are often a bit funky to interpret, lots of magic 😛	22:08
rm_work	yeah, didn't find anything specifically relating to "fault" in there... this is super weird, I can see the "fault" come back with `--debug` in the response from nova, it just gets filtered out or something before the results are shown	22:21
rm_work	nevermind, figured it out, there's a client patch I didn't know about >_< FML	22:31
clarkb	rm_work: don't leave us hanging. What causes it?	22:35
rm_work	local client patch where someone did something ... ill advised	22:41
rm_work	trying to figure out how to untangle it now	22:41
clarkb	Oh I see what you mean by client patch now. I thought you meant you found the chagne in gerrit taht did it or something	22:45

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!