Thursday, 2021-09-09

gibi	good morning	07:12
bauzas	gibi: good morning	07:56
bauzas	was mostly off the grid yesterday due to a hell day with customer escalations	07:57
gibi	not much happened upstream yestrday here either	08:01
bauzas	gibi: in case you haven't seen it, i eventually found the problem with my prelude change	08:10
bauzas	one single line was badly indented by one char	08:10
bauzas	but not the one which was returning the error	08:11
bauzas	...	08:11
gibi	yepp, I +2d the prelude yestearday. thank you for writing it	08:13
kashyap	bauzas: Then how did you discover it? If it's not the line that caused the error :-)	08:19
bauzas	kashyap: I still use my eyes in general	08:20
kashyap	Hehe	08:20
bauzas	automated coding isn't an option yet, I'm afraid	08:20
bauzas	I should ask elon to propose it	08:20
kashyap	bauzas: GitHUb has some ideas with their <clears-throat> "Copilot"	08:23
kashyap	bauzas: But, look - https://www.techradar.com/news/github-autopilot-highly-likely-to-introduce-bugs-and-vulnerabilities-report-claims	08:23
kashyap	"researchers discover that nearly 40% of the code suggestions by GitHub’s Copilot tool are erroneous, from a security point of view"	08:23
kashyap	But LOL, as if that's a surprise!	08:24
bauzas	kashyap: you get the same experience with Tesla Full Self Driving which was proposed since a while	08:24
bauzas	the only difference being that a car bug leads to some injury :D	08:25
kashyap	I don't drive; so ignore cars altogether :D I'm more interested in long-distance electric bikes :D	08:25
bauzas	I switched 70% of my drives to be full electric, I'm chasing down the last 30% bits	08:25
bauzas	but that will require hardware upgrade	08:26
bauzas	kashyap: excellent choice, btw. do you have government incentives for this like in France ?	08:26
kashyap	bauzas: I was told there are; but I need to dig in still	08:28
kashyap	bauzas: I got more energized about it after my recent hiking in the Alps.	08:29
bauzas	kashyap: sure, but mountain e-bikes are way different	08:31
kashyap	bauzas: Oh, sure	08:31
kashyap	I was not mixing them up. Just noticing how people were doing long-distance rides there w/ e-bikes made me think about it more	08:31
bauzas	kashyap: https://mobilit.belgium.be/nl/mobiliteit/personenvervoer/fiets ;)	08:32
* bauzas is glad that when clicking to the nl button, this didn't went back to the homepage :)		08:32
kashyap	bauzas: Hehe, thank you. Yeah, NL and FR are treated as first-class citizens on most web pages	08:33
kashyap	bauzas: I knew this existed, just didn't bother to investigate	08:33
* kashyap bookmarks		08:33
bauzas	kashyap: surprinsingly, most of the incentives seem to be only for the Brussels and the Wallony regions	08:36
bauzas	afaicr, Ghent is in Flanders, right?	08:36
kashyap	bauzas: Yes, it is	08:36
* bauzas suggests to relocate :D		08:36
kashyap	That's very odd, though.	08:36
kashyap	bauzas: LOL, no, thank you	08:36
bauzas	there is a fun fact here	08:37
bauzas	if you want to buy an electric car, the gov gives you a discount of 7k€ if the car is below 45k€	08:37
kashyap	Nice, that's quite a chunky amount	08:38
bauzas	but, if you live in the Marseille area, the local city gov gives you an extra 5k	08:38
bauzas	so, lots of people are considering some way to address their primary location as some random Marseille place	08:39
kashyap	Heh	08:40
gibi	those are nice summs indeed	08:40
gibi	here we have ~5k€ subsidise for full electric cars (and free parking)	08:43
gibi	but for cars < 35k€	08:44
kashyap	I do hope people are taking advantage of it; I guess it's a win-win	08:47
bauzas	kashyap: we do	08:48
bauzas	we have two cars, one is full electric (a peugeot 208) and one is plugin-hybrid (a skoda superb)	08:49
bauzas	even with the hybrid which has 40km+ electric range, we try to do 100% of our drives to be electric	08:49
kashyap	Nice; /me learned first time of plugin-hybrid	08:50
bauzas	as a consequence, we use the 208 (which is a compact car) for one-day drives around our region	08:50
bauzas	and we only take the superb for trips above 200 kms which requires lots of leg space	08:50
bauzas	I hate now refuelling	08:51
bauzas	this is expensive and it stinks	08:51
gibi	bauzas: if you feel the power, then could you please review https://review.opendev.org/c/openstack/placement/+/807014 I think (and CI thinks) it is good and fixing the transaction issue	08:52
bauzas	hopefully next year, we'll change the superb to a new electric vehicle, because we tested long-range trips with intermediary recharges, and this works	08:52
bauzas	gibi: excellent point, I need to look at this one	08:52
bauzas	gibi: I also need to amend the vgpu doc	08:52
gibi	and if you are at placement land then https://review.opendev.org/c/openstack/placement/+/807155 is simple and fixes a bug in consumer_types	08:53
bauzas	gibi: I guess melwitt addressed your excellent concern ?	08:53
songwenping_	hi,team, is any entry to delete compute node except nova service-delete?	08:54
gibi	bauzas: yes, she added a independent transaction for re-reading the rp data in the retry loop, and it seems to work well	08:54
bauzas	songwenping_: you shouldn't delete the compute node entries	08:54
bauzas	songwenping_: either the virt driver or the service deletion can do this	08:54
songwenping_	how virt driver works?	08:55
kashyap	songwenping_: The bird's-eye view is this:	08:57
kashyap	nova-api (in coordination with nova-scheduler) --> nova-compute (virt driver) --> launches libvirtd --> launches QEMU	08:57
kashyap	But you have to be more specific for people to answer :)	08:57
songwenping_	no, i means which scenes virt driver delete the compute node?	08:59
bauzas	songwenping_: sorry, I need to jump off for 30 mins	09:16
bauzas	but basically, the virt driver gives the inventories and the compute nodes to the RT which creates the necessary records	09:16
bauzas	RT : ResourceTracker	09:16
bauzas	as the RT is run by the nova-compute service, you need to delete the service	09:17
bauzas	there is a tight relationship between an RPC service (the nova-compute manager) and the compute node record	09:17
* bauzas needs to drop		09:18
songwenping_	hi, team, anybody knows why placement check source node resource when evacuate VM?	11:15
sean-k-mooney	how do you mean	11:19
sean-k-mooney	as it it can include the source host in the set of host returned	11:19
sean-k-mooney	that would be because we do not currently filter host but just up hosts	11:21
sean-k-mooney	https://github.com/openstack/nova/blob/master/nova/scheduler/request_filter.py#L241-L254	11:21
sean-k-mooney	we do filter hosts by there disabeld status however	11:21
sean-k-mooney	so fi you had dissbled the host you are evacuatating form then it would not be included in the placment query	11:21
sean-k-mooney	songwenping_: the source node will be elimiated by the scudler after the placment query so it does not really have a negitive impact to include it	11:22
opendevreview	Merged openstack/nova master: Support Cpu Compararion on Aarch64 Platform https://review.opendev.org/c/openstack/nova/+/763928	11:34
songwenping_	sean-k-mooney: wait for a min, i am finding the placement code.	11:41
songwenping_	when we evacuate vm, placement will check_capacity_exceeded, https://github.com/openstack/placement/blob/master/placement/objects/allocation.py#L73	11:44
songwenping_	https://github.com/openstack/placement/blob/master/placement/objects/allocation.py#L120 contains source node provider id and dest node provider id.	11:45
sean-k-mooney	yes	12:05
sean-k-mooney	when we make the allocation candiate quest we do not exclude the host we are evacuating form	12:06
sean-k-mooney	ah i see	12:06
sean-k-mooney	this should not have any negitive effect	12:07
sean-k-mooney	we are technicall checkign allocation for one addtional host that we dont need placment to condiser	12:08
songwenping_	sometimes there are some rubbish data at allocation table, this lead to evacuate failed due to this check.	12:10
sean-k-mooney	that shoudl just eliminate the host as an allcoation candiate no?	12:12
sean-k-mooney	have you filed a bug for this	12:12
sean-k-mooney	placment should not be made aware of evuacation or other lifecycle operations explcitly	12:12
songwenping_	havenot filed bug.	12:13
sean-k-mooney	so im not sure we could proceed with any apprcoh that required modifciation of placment to make it expcitly aware of evacuation	12:13
sean-k-mooney	but we might be able to handel the error condition with corupt data	12:13
sean-k-mooney	so that it woudl not fail	12:13
gibi	songwenping_: during evacuation the source node allocation of the VM is kept and the dest node allocation is added to it	12:17
gibi	so for an evacuated VM you will see allocation on the source and dest nodes _until_ the source compute is down	12:18
gibi	when the source compute node is recovered it will delete the allocation on the source node for the already evacuated VMs	12:18
sean-k-mooney	gibi: for evacuate should we not be using the migration uuid for the evaucation	12:19
sean-k-mooney	alocaltions	12:19
sean-k-mooney	like we do for rezise	12:19
songwenping_	gibi: yeah, this is right workflow, but placement check the source node capacity before evacuate.	12:20
gibi	sean-k-mooney: we never switched the evac workflow to migration allocations	12:21
gibi	there is todo in the code	12:21
sean-k-mooney	ah ok	12:21
sean-k-mooney	i guess that would be the correct way to fix this then	12:22
gibi	songwenping_: do you have a pointer where nova checks the source node capacity during evac?	12:23
sean-k-mooney	its in placment https://github.com/openstack/placement/blob/master/placement/objects/allocation.py#L120	12:23
songwenping_	yes, it's placement checking.	12:24
gibi	what placement does during any kind of allocation update, is to replace the existing allocation of the consumer (VM in this case) with the new requested allocation	12:25
gibi	so if the node is not overallocated then the replace_all should not fail	12:26
gibi	if you got your node overallocated then I guess the allocation update can fail	12:26
gibi	as placement will not allow you to overallocate	12:27
songwenping_	yes, i means why we check the source node overallocated as the vm will be evacuated to the dest node.	12:27
gibi	placement does not know these things, what placement sees is a request to update the allocation of a consumer	12:27
gibi	and the old allocation has resources on the source node, the new allocation has resource on the source node and on the dest node	12:27
gibi	placement goes and replaces the whole old allocation with the new allocation	12:28
gibi	if you got your compute overallocated then simply removing the source allocation and adding it back at this step fails	12:28
gibi	sean-k-mooney: btw, moving to migration allocation will not solve this as there we move the VM allocation to the migration allocation and that move will fail for the same reason	12:29
gibi	in short, if you overallocated your compute then placement will reject allocation updates on that compute	12:29
gibi	you need to resolve the overallocation	12:29
sean-k-mooney	it should reject new allcoations	12:29
gibi	no	12:29
sean-k-mooney	but if we dont change the reosuce in a currnt oen it shoudl work no	12:29
gibi	placement implements replace_all for allocation update	12:30
gibi	so it is a delete + create in the same transaction	12:30
gibi	but after the delte the compute is full	12:30
gibi	so create fails	12:30
songwenping_	on our product env, there are some unknow reasons that vm has allocations on two resource provider.	12:30
gibi	sean-k-mooney: I mean I got that it would be nice to detect that the source node allocation does not change during the update and dont delete + re-create it, but still placement does not do that logic	12:33
sean-k-mooney	well i was hoping that a simple uuid update woudl not triger this check	12:34
gibi	sean-k-mooney: there is no way to update a consumer uuid	12:34
gibi	sean-k-mooney: you update the allocation of a consumer or you create / delete consumers	12:35
gibi	there is no rename consumer	12:35
gibi	and there is no partial allocation update	12:35
gibi	just total one	12:35
sean-k-mooney	ack	12:35
gibi	probably the easyest thing is to implement rename consumer, the partial allocation update feels hard	12:36
sean-k-mooney	ya so we would move the source allocation to the migration uuid create a new allocation for the vm using its uuid	12:37
gibi	then we can reimplement the allocation move from VM -> migration with the rename	12:37
sean-k-mooney	then have the dest delete the migration allocation after evac	12:37
sean-k-mooney	which will avoid leaking the allcoation in placment if the source compute never comes back	12:38
sean-k-mooney	well actully no	12:38
sean-k-mooney	we have to be careful	12:38
sean-k-mooney	to make sure if the evac fails we can jsut evac again	12:38
sean-k-mooney	so the dest vm need to have the migration uuid	12:38
gibi	yeah, I feel there is a reason why we kept the source allocation for the source compute to clean up	12:38
sean-k-mooney	untill it succeed then we cna remvoe the source vm allcoation and rename the migration allcotion	12:39
gibi	having the migration uuid to allocate on the test is a surgery as today we just call the scheduler and that always uses the instance uuid to allocate	12:39
gibi	s/test/dest/	12:39
gibi	all the moves are using the migration uuid on the dest so the scheduler don't have to be branched for moves	12:40
gibi	sorry on the source	12:40
sean-k-mooney	ok	12:40
sean-k-mooney	well we can just use our exisign patteren	12:40
sean-k-mooney	but rename would make it simpler	12:40
gibi	rename would be needed to solve the above placement-reject-evac-as-source-is-overallocated issue	12:41
sean-k-mooney	it might also be useful for blazar	12:41
sean-k-mooney	this would obviously be a api change right	12:42
gibi	yepp	12:42
sean-k-mooney	technically there are no filed change and we are just chanign form a 400 to 200	12:42
sean-k-mooney	but i assume that still need a microverion bump	12:42
sean-k-mooney	so not backportable?	12:43
gibi	as the 400 wasnt caused by a bug, the transformation that to 200 is a microversion bump	12:43
sean-k-mooney	ok so i dont really see a way to fix this in code for exisitng release then	12:44
sean-k-mooney	operators will just need to fix the RP inventories	12:44
sean-k-mooney	e.g. set capasty to max int or something	12:44
sean-k-mooney	the comptue node would fix it when it started back up	12:44
gibi	basically the operator needs to resolve the overallocation	12:45
sean-k-mooney	but while its down you can use osc to manually update it	12:45
gibi	either by deleting allocations or by increasing inventory	12:45
sean-k-mooney	gibi: right but if the host is down they cant really do deletes	12:45
gibi	ture	12:45
gibi	true	12:45
gibi	the change the inventory via OSC	12:45
gibi	then	12:46
sean-k-mooney	yep	12:46
gibi	that is the way	12:46
gibi	and also investigate how you ended up in overallocation	12:46
gibi	as placement should not allow that	12:46
sean-k-mooney	it normally happens if you change things like cpu_dedicated_set or the amount of hugepages ectra	12:47
sean-k-mooney	or actully more commanly the ram/disk/cpu allcoation ratios	12:48
sean-k-mooney	im sure there are other ways too but i have most often seen it due to operators chanign config such that the current vms nolonger fit	12:49
gibi	hm, maybe we should add a WARNING for the compute log / placement log if there is overallocation detected so the admin will detect the misconfiguration	12:50
sean-k-mooney	to the perodic	12:50
sean-k-mooney	update_avaialable_resouces when we recalulate the placment update	12:51
sean-k-mooney	ya we could	12:51
sean-k-mooney	im not sure how spamy that would be but it does indeicate the might need to heal allcoaiton or other wise investigate why	12:52
gibi	sean-k-mooney: actually placement already has a warning	14:07
gibi	sean-k-mooney: "WARNING placement.objects.resource_provider [None req-6f2253b9-a195-4bf9-8c7e-2a32271a8c0c admin admin] Resource provider 935b9ad6-d7d1-4b5a-bb49-022acbba7c72 is now over-capacity for VCPU"	14:07
gibi	when I set the allocation ratio to lower to induce overallocation	14:07
opendevreview	Balazs Gibizer proposed openstack/placement master: DNM: extra logs to troubleshoot overcapacity https://review.opendev.org/c/openstack/placement/+/808083	14:15
sean-k-mooney	ah nice	15:33
*** abhishekk is now known as abhishekk\|afk		15:45
sean-k-mooney	although if the compute agent is down you might not see that	15:47
gibi	elodilles_pto: there is probably a stable only bug here https://bugs.launchpad.net/nova/+bug/1941819 but as it is probably only affect stein and older which are in EM I don't think I will spend time fixing it. Maybe the bug author try it.	17:03
*** efried1 is now known as efried		17:30
*** akekane__ is now known as abhishekk		17:39
legochen	hey nova experts, one question - can someone point me the best practice about configuring nova-scheduler filter to distribute VMs to user-specified cabinets?	18:40
dansmith	user-specified, meaning "at boot a user specifies where this should go" ?	18:41
dansmith	or did you mean user-specific?	18:41
legochen	user-specified, meaning "at boot a user specifies where this should go" ? <= yes	18:42
dansmith	in general this is not a thing nova allows or intends to allow, with one exception: AZs	18:42
dansmith	users can choose AZs, so if you want them to be able to choose, make AZs for them to specify	18:42
dansmith	you can do an AZ per site, or per aisle, or per rack or something	18:43
legochen	For example, I have multiple cabients in data center, users want to distribute their VMs to different cabinets equally in order to avoid SPOF of tor switch or power stuff.	18:43
dansmith	that's what AZs are for	18:44
legochen	hmm, per cabinet per AZ seems not that reasonable to me :(	18:44
legochen	I was thinking to configure per aggregate per cabinet. And set property - cabient=A for aggregate A, cabinet=B for aggregate B…. then users can specify —hint cabinet=A while creating a VM.	18:49
dansmith	that's what AZs are for	18:49
opendevreview	xiaoxin yang proposed openstack/nova master: Secure boot requires SMM feature enabled https://review.opendev.org/c/openstack/nova/+/808126	19:18
*** slaweq1 is now known as slaweq		19:19

Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!