Wednesday, 2019-06-26

*** sapd1_x has quit IRC		00:07
*** spatel has joined #openstack-nova		00:08
*** slaweq has joined #openstack-nova		00:11
*** slaweq has quit IRC		00:15
*** lbragstad has quit IRC		00:23
*** hamzy has joined #openstack-nova		00:27
*** brinzhang has joined #openstack-nova		00:32
*** spatel has quit IRC		00:34
*** bhagyashris has joined #openstack-nova		00:50
*** takashin has joined #openstack-nova		00:51
*** brinzhang has quit IRC		00:55
*** bhagyashris has quit IRC		00:59
*** _alastor_ has quit IRC		01:12
*** brinzhang has joined #openstack-nova		01:18
*** spatel has joined #openstack-nova		01:30
*** spatel has quit IRC		01:30
openstackgerrit	Merged openstack/nova master: hacking: Resolve W503 (line break occurred before a binary operator) https://review.opendev.org/651555	01:31
openstackgerrit	Merged openstack/nova master: hacking: Resolve E741 (ambiguous variable name) https://review.opendev.org/652103	01:31
*** mgoddard has quit IRC		01:40
*** mgoddard has joined #openstack-nova		01:48
*** yedongcan has joined #openstack-nova		01:53
openstackgerrit	Yongli He proposed openstack/nova-specs master: grammar fix for show-server-numa-topology spec https://review.opendev.org/667487	01:54
openstackgerrit	Yongli He proposed openstack/nova master: Clean up orphan instances virt driver https://review.opendev.org/648912	01:57
openstackgerrit	Yongli He proposed openstack/nova master: clean up orphan instances https://review.opendev.org/627765	01:57
*** igordc has quit IRC		01:58
*** Dinesh_Bhor has joined #openstack-nova		02:05
*** slaweq has joined #openstack-nova		02:11
*** slaweq has quit IRC		02:16
openstackgerrit	Bhagyashri Shewale proposed openstack/nova master: Ignore root_gb for BFV in simple tenant usage API https://review.opendev.org/612626	02:27
*** bhagyashris has joined #openstack-nova		02:33
*** hongbin has joined #openstack-nova		02:44
openstackgerrit	Alex Xu proposed openstack/nova master: Correct the comment of RequestSpec's network_metadata https://review.opendev.org/667061	02:44
*** cfriesen has quit IRC		02:52
*** ricolin has joined #openstack-nova		03:01
*** takashin has left #openstack-nova		03:02
*** whoami-rajat has joined #openstack-nova		03:04
*** hongbin has quit IRC		03:18
*** mkrai__ has joined #openstack-nova		03:30
*** Dinesh_Bhor has quit IRC		03:33
*** psachin has joined #openstack-nova		03:35
*** udesale has joined #openstack-nova		03:51
*** hongbin has joined #openstack-nova		03:54
*** slaweq has joined #openstack-nova		04:11
*** hongbin has quit IRC		04:12
*** slaweq has quit IRC		04:16
*** jhesketh has quit IRC		04:19
*** jhesketh has joined #openstack-nova		04:19
*** dave-mccowan has quit IRC		04:23
*** mkrai__ has quit IRC		04:27
*** mkrai__ has joined #openstack-nova		04:29
*** shilpasd has joined #openstack-nova		04:29
*** ratailor has joined #openstack-nova		04:58
openstackgerrit	Yongli He proposed openstack/nova-specs master: grammar fix for show-server-numa-topology spec https://review.opendev.org/667487	05:22
*** konetzed has quit IRC		05:28
*** ivve has joined #openstack-nova		05:36
*** mmethot has quit IRC		05:46
*** mgoddard has quit IRC		05:54
*** shilpasd has quit IRC		05:56
*** Bidwe_jay65 has quit IRC		05:56
*** lpetrut has joined #openstack-nova		05:56
*** lpetrut has quit IRC		05:57
*** lpetrut has joined #openstack-nova		05:57
*** tetsuro has joined #openstack-nova		05:59
*** mgoddard has joined #openstack-nova		06:00
*** dpawlik has joined #openstack-nova		06:05
*** slaweq has joined #openstack-nova		06:11
*** ratailor has quit IRC		06:14
*** slaweq has quit IRC		06:16
*** artom has joined #openstack-nova		06:16
*** artom is now known as artom\|gmtplus3		06:16
*** ratailor has joined #openstack-nova		06:21
*** pcaruana has joined #openstack-nova		06:26
*** belmoreira has joined #openstack-nova		06:33
*** rdopiera has joined #openstack-nova		06:35
*** mkrai__ has quit IRC		06:39
*** mkrai_ has joined #openstack-nova		06:39
openstackgerrit	Merged openstack/nova master: Remove comments about mirroring changes to nova/cells/messaging.py https://review.opendev.org/667107	06:43
openstackgerrit	Merged openstack/nova master: Drop source node allocations if finish_resize fails https://review.opendev.org/654067	06:43
*** slaweq has joined #openstack-nova		06:44
*** belmoreira has quit IRC		06:45
*** luksky has joined #openstack-nova		06:45
*** belmoreira has joined #openstack-nova		06:47
*** maciejjozefczyk has joined #openstack-nova		06:50
*** rdopiera has quit IRC		06:52
*** rdopiera has joined #openstack-nova		06:52
openstackgerrit	Brin Zhang proposed openstack/python-novaclient master: Microversion 2.74: Support Specifying AZ to unshelve https://review.opendev.org/665136	06:52
*** bhagyashris has quit IRC		07:01
*** rcernin has quit IRC		07:06
*** belmoreira has quit IRC		07:13
*** belmoreira has joined #openstack-nova		07:14
*** tesseract has joined #openstack-nova		07:16
brinzhang	efried: Are you around?	07:29
*** belmoreira has quit IRC		07:34
*** ccamacho has joined #openstack-nova		07:37
*** tetsuro has quit IRC		07:40
*** rajinir has quit IRC		07:45
*** damien_r has joined #openstack-nova		07:46
*** belmoreira has joined #openstack-nova		07:48
*** ttsiouts has joined #openstack-nova		07:48
*** ralonsoh has joined #openstack-nova		07:49
*** psachin has quit IRC		07:55
*** tetsuro has joined #openstack-nova		07:58
*** tssurya has joined #openstack-nova		08:00
*** ttsiouts has quit IRC		08:00
*** ttsiouts has joined #openstack-nova		08:01
*** ttsiouts has quit IRC		08:05
*** ttsiouts has joined #openstack-nova		08:06
*** tkajinam has quit IRC		08:16
openstackgerrit	Balazs Gibizer proposed openstack/nova master: pull out functions from _heal_allocations_for_instance https://review.opendev.org/655457	08:25
openstackgerrit	Balazs Gibizer proposed openstack/nova master: reorder conditions in _heal_allocations_for_instance https://review.opendev.org/655458	08:25
openstackgerrit	Balazs Gibizer proposed openstack/nova master: Prepare _heal_allocations_for_instance for nested allocations https://review.opendev.org/637954	08:25
openstackgerrit	Balazs Gibizer proposed openstack/nova master: pull out put_allocation call from _heal_* https://review.opendev.org/655459	08:25
openstackgerrit	Balazs Gibizer proposed openstack/nova master: nova-manage: heal port allocations https://review.opendev.org/637955	08:25
*** tetsuro has quit IRC		08:28
*** Luzi has joined #openstack-nova		08:31
*** dtantsur\|afk is now known as dtantsur\|mtg		08:37
*** tesseract has quit IRC		08:38
*** tesseract has joined #openstack-nova		08:40
*** rpittau\|afk is now known as rpittau\|mtg		08:41
*** imacdonn has quit IRC		08:42
*** imacdonn has joined #openstack-nova		08:43
*** rcernin has joined #openstack-nova		08:46
*** ociuhandu has joined #openstack-nova		08:47
*** mdbooth has quit IRC		09:02
openstackgerrit	Surya Seetharaman proposed openstack/nova master: Grab fresh power state info from the driver https://review.opendev.org/665975	09:04
*** ociuhandu has quit IRC		09:04
*** rcernin has quit IRC		09:07
*** jaosorior has quit IRC		09:22
*** jaosorior has joined #openstack-nova		09:24
openstackgerrit	Boxiang Zhu proposed openstack/nova master: Update AZ admin doc to mention the new way to specify hosts https://review.opendev.org/666767	09:29
*** mdbooth has joined #openstack-nova		09:32
kashyap	Does anyone here of an existing bug in the Gate, where the "tempest-slow-py3" is failing with:	09:32
kashyap	tempest.exceptions.BuildErrorException: Server 008c5c50-ff54-49f4-adb0-23775e8af5f1 failed to build and is in ERROR status	09:32
kashyap	Details: {'code': 500, 'created': '2019-06-25T20:55:49Z', 'message': 'Unexpected vif_type=binding_failed'}	09:32
kashyap	http://logs.openstack.org/89/667389/1/check/tempest-slow-py3/2606bcc/testr_results.html.gz	09:32
kashyap	[That is a stable/stein backport]	09:32
kashyap	Okay, I see time outs (also in the stable/rocky backport) in the 'testr_results'. /me goes to 'recheck'	09:37
*** psachin has joined #openstack-nova		09:39
*** xek has joined #openstack-nova		09:55
*** ratailor_ has joined #openstack-nova		10:01
*** ratailor has quit IRC		10:04
*** ociuhandu has joined #openstack-nova		10:06
*** artom\|gmtplus3 has quit IRC		10:06
*** ttsiouts has quit IRC		10:10
*** ttsiouts has joined #openstack-nova		10:10
*** artom has joined #openstack-nova		10:10
*** liuyulong has joined #openstack-nova		10:14
*** ttsiouts has quit IRC		10:15
*** brinzhang has quit IRC		10:18
*** luksky has quit IRC		10:28
*** mkrai_ has quit IRC		10:29
*** mkrai_ has joined #openstack-nova		10:29
*** mkrai_ has quit IRC		10:31
*** mkrai__ has joined #openstack-nova		10:31
*** dpawlik has quit IRC		10:37
*** dpawlik has joined #openstack-nova		10:38
*** davidsha has joined #openstack-nova		10:40
*** brinzhang has joined #openstack-nova		10:41
*** dpawlik has quit IRC		10:42
*** sapd1_x has joined #openstack-nova		10:42
*** dpawlik has joined #openstack-nova		10:45
*** liuyulong has quit IRC		10:47
*** dpawlik has quit IRC		10:50
*** dpawlik has joined #openstack-nova		10:53
*** Bidwe_jay has joined #openstack-nova		10:57
*** mkrai__ has quit IRC		10:58
*** mkrai_ has joined #openstack-nova		10:58
*** luksky has joined #openstack-nova		10:58
*** dpawlik has quit IRC		11:00
*** dpawlik has joined #openstack-nova		11:01
*** sapd1_x has quit IRC		11:02
NewBruce	Hey @kayshap	11:06
NewBruce	so good news, i didnt try to mess around with xml, instead just use SELinux ;) but not sure if you can help out on this one - migrations are failing with	11:07
NewBruce	Live Migration failure: Library function returned error but did not set virError: libvirtError: Library function returned error but did not set vir	11:07
NewBruce	digging into the libvirt logs -	11:08
NewBruce	2019-06-26 09:46:22.816+0000: 30621: error : virNetClientStreamRaiseError:200 : stream had I/O failure	11:08
NewBruce	2019-06-26 09:46:23.190+0000: 19228: error : virNetClientProgramDispatchError:177 : internal error: qemu unexpectedly closed the monitor: 2019-06-26T09:46:22.815029Z qemu-kvm: Failed to load PCIDevice:config	11:08
NewBruce	2019-06-26T09:46:22.815033Z qemu-kvm: Failed to load virtio-net:virtio	11:08
NewBruce	2019-06-26T09:46:22.815036Z qemu-kvm: error while loading state for instance 0x0 of device '0000:00:03.0/virtio-net'	11:08
NewBruce	thing is, from the control side of things, the migration completed - no errors or anything are returned… also, i could migrate fine between these hosts before i added SELinux, and it (rarely) works to migrate a machine… im lost if its a LibVirt or nova issue at this point - thoughts?	11:09
NewBruce	the osc reports life is peachy : openstack server migrate --live cc-compute28-sto2 aadfe56a-88b8-49c0-9dac-41a4c494c1b5 --wait	11:10
NewBruce	Progress: 97Complete	11:10
NewBruce	but nova never gets to post-migration, and i dont think its actually doing the migration itself - on the source	11:11
NewBruce	Took 2.35 seconds for pre_live_migration on destination host cc-compute26-sto2.	11:11
NewBruce	Migration running for 0 secs, memory 100% remaining; (bytes processed=0, remaining=0, total=0)	11:11
NewBruce	Migration operation has aborted	11:11
*** ttsiouts has joined #openstack-nova		11:13
*** ociuhandu has quit IRC		11:23
*** ociuhandu has joined #openstack-nova		11:23
*** tbachman has joined #openstack-nova		11:27
*** mkrai_ has quit IRC		11:31
*** shilpasd has joined #openstack-nova		11:31
*** shilpasd10 has joined #openstack-nova		11:31
*** shilpasd10 has quit IRC		11:32
*** shilpasd63 has joined #openstack-nova		11:33
*** sean-k-mooney has joined #openstack-nova		11:43
*** ratailor_ has quit IRC		11:48
*** ivve has quit IRC		11:51
*** udesale has quit IRC		11:51
*** udesale has joined #openstack-nova		11:52
*** eharney has quit IRC		11:55
*** _erlon_ has joined #openstack-nova		11:59
*** _alastor_ has joined #openstack-nova		12:00
*** luksky has quit IRC		12:02
*** luksky has joined #openstack-nova		12:03
*** francoisp has joined #openstack-nova		12:04
*** _alastor_ has quit IRC		12:04
*** mdbooth has quit IRC		12:11
*** ttsiouts has quit IRC		12:13
*** ttsiouts has joined #openstack-nova		12:13
*** lbragstad has joined #openstack-nova		12:17
*** ttsiouts has quit IRC		12:18
*** ttsiouts has joined #openstack-nova		12:20
*** mdbooth has joined #openstack-nova		12:21
*** martinkennelly has joined #openstack-nova		12:23
*** mriedem has joined #openstack-nova		12:25
mriedem	lyarwood: is your plan for https://bugs.launchpad.net/nova/+bug/1832248 to get https://review.opendev.org/#/c/664418/ released and then bump the lower constraint dependency from nova to os-brick and then consider the nova bug fixed?	12:27
openstack	Launchpad bug 1832248 in OpenStack Compute (nova) "tempest.api.volume.test_volumes_extend.VolumesExtendAttachedTest.test_extend_attached_volume failing when using the Q35 machine type" [Undecided,New]	12:27
*** shilpasd63 has quit IRC		12:30
*** shilpasd80 has joined #openstack-nova		12:31
alex_xu	mriedem: hope we answered all your question https://review.opendev.org/601596, looking for one more +2 :)	12:32
alex_xu	johnthetubaguy: ^ hope you around, the vpmem spec is in good shape	12:33
lyarwood	mriedem: no the nova bug is seperate, the os-brick change just works around the underlying QEMU issue Nova is hitting with q35	12:34
*** Luzi has quit IRC		12:35
mriedem	lyarwood: oh ok	12:36
mriedem	alex_xu: ack, i still need to read all of the replies...	12:36
alex_xu	mriedem: hah, i see, a lot	12:37
*** brinzhang has quit IRC		12:38
*** brinzhang has joined #openstack-nova		12:39
alex_xu	mriedem: also, there is the code for reference https://review.opendev.org/#/q/status:open+project:openstack/nova+branch:master+topic:bp/virtual-persistent-memory, althought it is merge conflict, but it is still good to see what we probably are going to change in the code	12:39
*** dpawlik has quit IRC		12:40
sean-k-mooney	mriedem: can you take a look at https://review.opendev.org/#/c/667264/ its a osc change for force down. you sent a mail to the list about droping computenode host/service id compat code and im wondering if that is related or not	12:42
mriedem	i think my biggest hangups were on the (1) flavor extra spec definition which was a bit hard to parse from a user perspective in my opinion and (2) the questions about the new data model and versioned object which were very similar to a BDM but i realize we don't want to re-use BDMs for this	12:43
mriedem	sean-k-mooney: different issue	12:44
mriedem	sean-k-mooney: before 2.53 you had to call a force-down route, with 2.53 you just call the normal PUT route	12:44
openstackgerrit	Ghanshyam Mann proposed openstack/nova master: Add mising tests for flavor extra_specs mv 2.61 https://review.opendev.org/667600	12:44
mriedem	https://developer.openstack.org/api-ref/compute/#update-forced-down for <2.53	12:44
sean-k-mooney	right i saw that	12:44
mriedem	https://developer.openstack.org/api-ref/compute/#update-compute-service >=2.53	12:44
sean-k-mooney	what i was concerned about is the new form uses service id	12:45
mriedem	with 2.53 the service_id in the API is a uuid	12:45
openstackgerrit	Ghanshyam Mann proposed openstack/nova master: Add missing tests for flavor extra_specs mv 2.61 https://review.opendev.org/667600	12:45
sean-k-mooney	the old used host and binary name	12:45
mriedem	service_id is the uuid of the service, it's fine	12:45
sean-k-mooney	and i was not clear what you were proposeing droping in the mail	12:45
mriedem	it's unrelated to the relationship between compute nodes and services	12:45
sean-k-mooney	ok	12:45
mriedem	see all of the notes/todos around ComputeNode.service_id in the code	12:45
mriedem	and ComputeNode.host	12:46
lyarwood	mriedem: https://review.opendev.org/#/c/457886/ - btw, would you mind taking a look at this if you have time this week.	12:46
sean-k-mooney	mriedem: ok im reading them now thanks	12:46
mriedem	lyarwood: sure	12:47
lyarwood	thanks	12:47
*** dave-mccowan has joined #openstack-nova		12:52
*** dpawlik has joined #openstack-nova		12:55
*** mmethot has joined #openstack-nova		12:57
openstackgerrit	Ghanshyam Mann proposed openstack/nova master: Add missing tests for flavor extra_specs mv 2.61 https://review.opendev.org/667600	13:00
*** xek_ has joined #openstack-nova		13:05
*** xek has quit IRC		13:07
bauzas	mriedem: FWIW, I need to reload a shit ton of context from Kilo before replying to you but I saw your email	13:08
bauzas	mriedem: because I wonder if we need a major version bump for the ComputeNode object	13:09
*** mmethot is now known as mmethot\|brb		13:10
mriedem	bauzas: i wondered about that as well but figured it wasn't required	13:15
openstackgerrit	Martin Midolesov proposed openstack/nova master: Implementing graceful shutdown. https://review.opendev.org/666245	13:15
*** eharney has joined #openstack-nova		13:15
mriedem	i think we've only ever bumped the major version on an object and that's when dansmith did Instance v2.0	13:16
mriedem	i don't remember the details of how complicated it was but i'm pretty sure i'd screw it up if i tried to do it myself	13:16
*** rajinir has joined #openstack-nova		13:18
bauzas	mriedem: yeah I need to remember why I was thinking about that by Kilo time	13:24
mdbooth	stephenfin or sean-k-mooney: https://review.opendev.org/#/c/663382/4/nova/compute/manager.py Not my area of expertise, but would the prior call to _deallocate_network not mean that neutron would no longer return this stuff?	13:28
mriedem	sean-k-mooney: i've replied on https://review.opendev.org/#/c/667264/2 with what i think they should do in the 2.53 case,	13:28
mriedem	whether or not novaclient has all of the plumbing they need i haven't checked	13:28
*** eharney has quit IRC		13:29
sean-k-mooney	mriedem: thanks osc is not what i normally review but since they asked me to take a look i said i would review	13:29
*** mmethot\|brb is now known as mmethot		13:29
mriedem	sean-k-mooney: mdbooth: also commented in https://review.opendev.org/#/c/663382/4	13:31
*** spatel has joined #openstack-nova		13:31
sean-k-mooney	maybe im looking. we could proably use try_dealocate_networks there too	13:31
mdbooth	mriedem: Ooh, I'd forgotten that gem.	13:31
mriedem	mdbooth: what? force_refresh?	13:33
dansmith	mriedem: correct, and yes, it's complicated	13:33
mriedem	you don't want to use that in this case	13:33
mriedem	mdbooth: because force_refresh only goes back to stein and i'm guessing you want to backport this further than that	13:33
mdbooth	mriedem: Ack.	13:33
*** shilpasd80 has quit IRC		13:34
*** spatel has quit IRC		13:36
kashyap	Any others seeing stable/stein failures with the 'tempest-slow-p3' job?	13:43
kashyap	http://logs.openstack.org/89/667389/1/check/tempest-slow-py3/2606bcc/testr_results.html.gz	13:44
mriedem	kashyap: yes known issue	13:44
mriedem	https://review.opendev.org/#/c/667216	13:44
kashyap	Ah, thanks. I didn't wanted to mindlessly do 'recheck'	13:45
*** yedongcan has quit IRC		13:45
sean-k-mooney	mdbooth: deallocate_network delete the neutron port that were auto allcoated by nova so yes we proably should move that to the end of the function since it clears the network info caceh https://opendev.org/openstack/nova/src/branch/master/nova/network/neutronv2/api.py#L1603-L1604	13:46
*** mloza has quit IRC		13:47
*** eharney has joined #openstack-nova		13:48
*** eharney has quit IRC		13:48
*** eharney has joined #openstack-nova		13:49
*** mlavalle has joined #openstack-nova		13:54
*** belmoreira has quit IRC		13:55
openstackgerrit	Lee Yarwood proposed openstack/nova master: libvirt: Add a rbd_connect_timeout configurable https://review.opendev.org/667421	13:56
mriedem	sean-k-mooney: i left some more comments/questions in that one and added some vmware and zvm driver devs	14:00
sean-k-mooney	mriedem: it looks like https://review.opendev.org/#/c/660761/8 is trying to fix the same or a similar bug	14:01
*** belmoreira has joined #openstack-nova		14:02
*** brinzhang has quit IRC		14:03
sean-k-mooney	mriedem: if we delete while building there is a scond race which causes us to not clean up the vif	14:03
*** brinzhang has joined #openstack-nova		14:03
mriedem	that's amorin's fix yes	14:04
mriedem	which is different from stephenfin's which is handling a failure while building	14:04
sean-k-mooney	e.g. if the vm has spawned but we get teh delete before we update the instcance state in the db we raise an exception which is what cause us to not clean them up	14:04
mriedem	and amorin was just in here the other day saying he had a similar issue there	14:04
sean-k-mooney	mriedem: no stephens issue was a failure caused when you delete while building	14:05
*** jistr is now known as jistr\|call		14:05
sean-k-mooney	spefically for the customer it was cause because one of the isntance in there heat stack failed to build and that cause all fo the instance to be deleted	14:05
sean-k-mooney	mriedem: i think amorin bug is a duplicate of stephens but im not sure it would fix it in all cases as in sthepens edgecase we never call distroy	14:07
sean-k-mooney	well maybe they are both bugs i did not fully review there bug in detail	14:08
mriedem	as i said amorin said he still has an issue which stephenfin's patch might resolve	14:12
mriedem	amorin said he was going to try and recreate and use stephen's patch to test it	14:12
sean-k-mooney	ya i think on reading there bug both would be needed	14:12
sean-k-mooney	mriedem: amorin is fixing the fact we might be useing an outdated network_info object form the instance and stephenfin is fixing that if we fail due to the db update we never even tried to clean up the vifs	14:13
sean-k-mooney	so to fix the downsteam bug we will need to backprot both.	14:14
sean-k-mooney	ok this make more sense to me now.	14:14
*** _alastor_ has joined #openstack-nova		14:15
amorin	hey all	14:20
amorin	the bug I faced 2 days ago was not fixed by stephenfin patch	14:22
*** jistr\|call is now known as jistr		14:22
amorin	I found that it was something else in our code	14:22
amorin	cc mriedem sean-k-mooney	14:23
*** _erlon_ has quit IRC		14:23
mriedem	mnaser: i think you just hit something like this nw info cache lost thing, so you might have input here http://lists.openstack.org/pipermail/openstack-discuss/2019-June/007363.html	14:23
amorin	by the way, I faced an other one, related to the patch I did:	14:23
amorin	https://review.opendev.org/#/c/667294/	14:23
mriedem	maciejjozefczyk: sean-k-mooney: ^	14:23
mriedem	amorin: one step forward, two steps back :(	14:24
amorin	yup	14:25
mriedem	i remember a similar check was added here https://github.com/openstack/nova/blob/707deb158996d540111c23afd8c916ea1c18906a/nova/network/base_api.py#L35	14:25
amorin	exact	14:25
sean-k-mooney	ok so we might need all 3 patches	14:26
sean-k-mooney	amorin: stephenfin patch is a generalised fix to a very specific edgecase	14:27
sean-k-mooney	amorin: what you originally tried to fix was more subtle as we were passing stale data in some cases	14:28
maciejjozefczyk	ehh, instance_info_cache :)	14:29
openstackgerrit	Martin Midolesov proposed openstack/nova master: Implementing graceful shutdown. https://review.opendev.org/666245	14:29
sean-k-mooney	maciejjozefczyk: yep its awsome...	14:30
sean-k-mooney	mriedem: out of interest why do we store the instance info cache in the db?	14:30
sean-k-mooney	i fell like we would have fewer bugs related to it if we actully just made it an in process dict cache	14:31
mriedem	sean-k-mooney: i'll direct your question to the people that worked on nova back in 2011 or something	14:31
sean-k-mooney	well my next question was going to be "i assume this is because of nova networks legacy choices"	14:32
openstackgerrit	Stephen Finucane proposed openstack/nova master: Remove no longer required "inner" methods. https://review.opendev.org/655282	14:32
openstackgerrit	Stephen Finucane proposed openstack/nova master: Privsepify ipv4 forwarding enablement. https://review.opendev.org/635431	14:32
openstackgerrit	Stephen Finucane proposed openstack/nova master: Remove unused FP device creation and deletion methods. https://review.opendev.org/635433	14:32
openstackgerrit	Stephen Finucane proposed openstack/nova master: Privsep the ebtables modification code. https://review.opendev.org/635435	14:32
openstackgerrit	Stephen Finucane proposed openstack/nova master: Move adding vlans to interfaces to privsep. https://review.opendev.org/635436	14:32
openstackgerrit	Stephen Finucane proposed openstack/nova master: Move iptables rule fetching and setting to privsep. https://review.opendev.org/636508	14:32
openstackgerrit	Stephen Finucane proposed openstack/nova master: Move dnsmasq restarts to privsep. https://review.opendev.org/639280	14:32
openstackgerrit	Stephen Finucane proposed openstack/nova master: Move router advertisement daemon restarts to privsep. https://review.opendev.org/639281	14:32
openstackgerrit	Stephen Finucane proposed openstack/nova master: Move calls to ovs-vsctl to privsep. https://review.opendev.org/639282	14:32
openstackgerrit	Stephen Finucane proposed openstack/nova master: Move setting of device trust to privsep. https://review.opendev.org/639283	14:32
openstackgerrit	Stephen Finucane proposed openstack/nova master: Move final bridge commands to privsep. https://review.opendev.org/639580	14:32
openstackgerrit	Stephen Finucane proposed openstack/nova master: Cleanup the _execute shim in nova/network. https://review.opendev.org/639581	14:32
openstackgerrit	Stephen Finucane proposed openstack/nova master: We no longer need rootwrap. https://review.opendev.org/554438	14:32
openstackgerrit	Stephen Finucane proposed openstack/nova master: Cleanup no longer required filters and add a release note. https://review.opendev.org/639826	14:32
mriedem	sean-k-mooney: idk, you'd have to do some digging to find out when the network info cache was introduced, i don't know if it was before quantum or not	14:33
mriedem	but we also store bdms in the db which are essentially the same thing - a cache of volume information for the server	14:33
mriedem	which was probably before cinder existed	14:33
sean-k-mooney	im seeing a pattern there	14:33
sean-k-mooney	ok well lets fix the current issue first but i think i might look into if we could remove storing it to the db	14:34
amorin	I would love that	14:34
amorin	:p	14:34
sean-k-mooney	cacheing in memory in the compute agent would likely be enough	14:34
sean-k-mooney	we would have to rebuild it every time the compute agent restarts but i think that is fine	14:35
sean-k-mooney	actully we could use memcache to cache it too which would mean all the services would have acess to it anway its now on my todo list	14:37
sean-k-mooney	messing up the netron policy and currpting the network info cache is what cause our ci cloud production outage at the weekend	14:38
*** aarents has joined #openstack-nova		14:38
*** luksky has quit IRC		14:43
mriedem	TheJulia: is this a known busted job? http://logs.openstack.org/17/667417/1/check/ironic-tempest-ipa-wholedisk-bios-agent_ipmitool-tinyipa/db33ba3/controller/logs/devstacklog.txt.gz#_2019-06-26_05_47_14_168	14:45
mriedem	sean-k-mooney: redoing how the nw info cache works is hopefully wayyyyyyy down on your todo lits	14:46
mriedem	*list	14:46
shilpasd	efried: mriedem: can you tell me how to trigger live migration sync and async way, any CLI commands?	14:47
mriedem	shilpasd: i don't know what you mean, sync and async way	14:48
shilpasd	mriedem: means nova live-migration <instance_id>, it triggers live migration, but any another way to live migrate, any periodic call or something	14:48
mriedem	no nova doesn't auto-live migrate things for you	14:49
shilpasd	mriedem: i am in process of verifying all move operations on NFS changes done against https://review.opendev.org/#/c/650188/	14:49
shilpasd	so wnat to take care of all move operations	14:50
shilpasd	so just want to know @ it	14:50
mriedem	all move operations are user-initiated	14:51
mriedem	as far as i know anyway	14:51
shilpasd	ok, as of now verifying SHELVE + SHELVE with offload + UNSHELVE + REBUILD + RESIZE + RESIZE REVERT + EVACUATION + COLD MIGRATION + COLD MIGRATION REVERT + LIVE MIGRATION	14:52
*** lpetrut has quit IRC		14:52
shilpasd	just list if i missed anything	14:52
mriedem	by rebuild i assume you mean evacuate	14:52
mriedem	rebuild (the server action in the api) isn't a move,	14:52
mriedem	but evacuate is	14:52
efried	brinzhang: I'm here now, what's up?	14:52
mriedem	evacuate = rebuild on another host	14:52
shilpasd	rebuild using another image	14:52
mriedem	rebuild + a new image is not a move	14:53
mriedem	it's rebuilding the server's root disk image on the same host	14:53
bauzas	mriedem: not sure I understood your point in https://bugs.launchpad.net/nova/+bug/1793569/comments/5	14:53
openstack	Launchpad bug 1793569 in OpenStack Compute (nova) "Add placement audit commands" [Wishlist,Confirmed] - Assigned to Sylvain Bauza (sylvain-bauza)	14:53
mriedem	also, shelve w/o offload and then unshelve is also not a move operation,	14:53
bauzas	mriedem: do you want heal_allocations to support this or the "placement audit' rather ?	14:53
mriedem	if the instance is shelved but not offloaded, and then the user unshelves, it's just unshelved on the same host	14:53
shilpasd	mriedem: ok, noted	14:53
shilpasd	mriedem: what @ resize	14:54
shilpasd	its move operation, right, since resizing on another host also	14:55
mriedem	shilpasd: maybe :)	14:55
mriedem	unless nova is configured with allow_resize_to_same_host and the scheduler picks the same host the instance is already one,	14:55
mriedem	which is possible in a small edge site or if the server is in a strict affinity group and can't be moved	14:55
mriedem	*already on	14:56
shilpasd	got it	14:56
mriedem	https://bugs.launchpad.net/nova/+bug/1790204 is all about that problem	14:56
openstack	Launchpad bug 1790204 in OpenStack Compute (nova) "Allocations are "doubled up" on same host resize even though there is only 1 server on the host" [High,Triaged]	14:56
mriedem	bauzas: i think i meant to say "nova-manage placement audit" there,	14:58
mriedem	since heal_allocations doesn't report on things really, nor does it delete allocations, it only adds allocations for instances (not migrations) that are missing	14:58
bauzas	mriedem: ack, will add this there then	14:58
mriedem	i went on to continue talking about heal_allocations but idk, it's a blur	15:00
shilpasd	mriedem: one more query, i have NFS configuration, and performing resize on another host, and it goes for creating a instance data file on the dest system via SSH	15:00
shilpasd	refer code at https://github.com/openstack/nova/blob/master/nova/virt/libvirt/driver.py#L8861	15:00
shilpasd	mriedem: during shared resource provider check, why this check is necessary?	15:01
*** cfriesen has joined #openstack-nova		15:01
shilpasd	_is_storage_shared_with()	15:02
*** asettle is now known as asettle-PTO		15:03
*** xek__ has joined #openstack-nova		15:03
mriedem	shilpasd: it may be ssh or rsync, it depends on config https://docs.openstack.org/nova/latest/configuration/config.html#libvirt.remote_filesystem_transport	15:03
mriedem	the default is ssh	15:03
mriedem	i'm less familiar with this code, but for one we don't have shared storage provider support in the libvirt driver anyway,	15:04
*** ccamacho has quit IRC		15:04
mriedem	but this is presumably one of the things we could replace if we had compute nodes modeled in a shared storage aggregate and we could avoid the "temp file create" tests and such for shared storage	15:04
mriedem	as i'm sure lyarwood and mdbooth could probably attest, shared storage support in the libvirt driver can be very confusing because there are the instance files like console logs and such, and there is the image backend, and that can all be different and be a mix of shared storage and non-shared storage, e.g. the root disk images might be in the rbd image backend but the instance files, like console logs, could be on local dis	15:05
mriedem	d get ssh'ed/rsync'ed around	15:05
*** xek_ has quit IRC		15:05
mriedem	e.g.	15:06
mriedem	https://docs.openstack.org/nova/latest/configuration/config.html#workarounds.ensure_libvirt_rbd_instance_dir_cleanup	15:06
sean-k-mooney	mriedem: yes it is but if we keep getting bug with it i might have to raise it. but ya not before m2 likely not before m3 if in train at all.	15:07
sean-k-mooney	^ network info cache rework	15:07
mriedem	bauzas: i think what i was thinking of was an audit command could detect that you have orphaned allocations tied to a not-in-progress migration, e.g. a migration that failed but we failed to cleanup the allocations,	15:07
mriedem	bauzas: and then that information could be provided to the admin to then determine what to do, e.g. delete the allocations for the migration record consumer and potentially the related instance,	15:08
bauzas	mriedem: yeah ok	15:08
mriedem	and if they delete the allocations for the instance, then they could run heal_allocations on the instance to fix things up	15:08
mriedem	we could also eventually build on that to make it automatic with options	15:08
mriedem	e.g. nova-manage placement audit --heal	15:08
mriedem	something like that	15:08
shilpasd	mriedem: thanks for discussing doubts, will go through the sharings and get back to you for any further	15:09
mriedem	sean-k-mooney: redoing nova's nw info cache at this point in the game is going to be a big undertaking, and i would not be surprised if trying to use a global cache like memcache or etcd or something just generates more or different kinds of bugs than what we've already been patching lo these many years, as i'm sure dansmith can agree	15:09
* mriedem feels the need to phone a friend		15:10
dansmith	oh mahgod	15:10
efried	yonglihe: I'm going to fix your pep8 error on https://review.opendev.org/#/c/627765/ real quick, k?	15:10
dansmith	why do we need a memcache? it's in the database	15:10
efried	it's due to a new rule that recently merged.	15:10
sean-k-mooney	dansmith: i was suggesting not keeping in in the database and only having a dict cache or maybe use memcache	15:11
dansmith	sean-k-mooney: ...why?	15:11
sean-k-mooney	mriedem: and ya it would be a blueprint or spec not a bug fix	15:11
mriedem	sean-k-mooney: we can just as easily f that up	15:11
sean-k-mooney	well if its in process as a dict cache then if we f it up it fixed by restarting the compute agent	15:12
sean-k-mooney	memcahce is proably not going to help with anything	15:12
dansmith	we store some stuff in nwinfo that isn't anywhere else, IIRC, like which ports we created vs. the user, so that has to be persisted somewhere if we were going to use memcache	15:12
dansmith	...yeah ;)	15:12
dansmith	what problem is being solved here?	15:13
mriedem	i don't think that overhauling to use an external cache service and restarting the compute is the giant hammer we really need for what we're trying to solve	15:13
sean-k-mooney	nothing at the momemnt reworking it is unrelated to what we are trying to fix	15:13
openstackgerrit	Eric Fried proposed openstack/nova master: Clean up orphan instances virt driver https://review.opendev.org/648912	15:13
openstackgerrit	Eric Fried proposed openstack/nova master: clean up orphan instances https://review.opendev.org/627765	15:13
mriedem	so this is a....thought exercise?	15:13
efried	sean-k-mooney, gibi: Would y'all please have another look at these --^	15:13
sean-k-mooney	yes	15:13
*** brinzhang has quit IRC		15:13
sean-k-mooney	its on my todo list to figure ot if it makes sense to even do	15:14
gibi	efried: I have it open	15:14
*** brinzhang has joined #openstack-nova		15:14
efried	thanks gibi	15:15
efried	thanks sean-k-mooney	15:15
efried	sean-k-mooney: fyi it's apparently a thing stx cares about	15:15
efried	thus presumably it "makes sense" in some capacity :)	15:15
mriedem	efried: hyperv ci is happy with the update_provider_tree patch https://review.opendev.org/#/c/667417/	15:17
efried	mriedem: thanks for the reminder	15:17
mriedem	efried: fwiw that cleanup orphan instances thing is also something that the public cloud SIG (and huawei public cloud ops) care about as well, which i was initially reviewing it awhile back	15:17
mriedem	*why i was	15:18
mriedem	the concern at the last ptg was how much duplication there was with the existing periodic to cleanup running deleted (but not orphaned) instances	15:18
*** psachin has quit IRC		15:19
efried	okay, thanks for that background.	15:20
mriedem	something something live migration fails and you've got untracked guests on the host consuming resources (which aren't tracked obviously) so then trying to schedule things to those hosts fails b/c you're out of resources	15:21
efried	sounds like we need a patch to clean up those orphaned instances	15:21
* mriedem shrugs		15:22
mriedem	i'm sure lots of operators have already just written scripts to detect and clean those types of thing sup	15:22
mriedem	*up	15:22
*** whoami-rajat has quit IRC		15:22
mriedem	but yeah it's better to have it native probably	15:22
*** brinzhang has quit IRC		15:27
*** dpawlik has quit IRC		15:28
*** icarusfactor has joined #openstack-nova		15:28
*** jamesdenton has joined #openstack-nova		15:29
*** factor has quit IRC		15:29
*** aarents has quit IRC		15:35
*** whoami-rajat has joined #openstack-nova		15:38
efried	mriedem: We don't have a way to prove the xen one is being hit, do we? (update_provider_tree)	15:42
efried	since their CI is dead?	15:42
efried	mriedem: also, if you haven't already, there should be a note to the ML warning of this (and another before we remove the code path, obvsly)	15:43
efried	...for oot folk	15:44
mriedem	sorry was just doing tech support with my mom	15:44
efried	(I know nova_powervm is copacetic fwiw)	15:44
mriedem	i was waiting to send the oot ML email until we were more sure about what i've proposed	15:45
mriedem	and idk about the xen one if their CI is dead, though it's pretty damn basic	15:45
mriedem	just a port of get_inventory	15:45
bauzas	efried: mriedem: heh, the reportclient doesn't of course support all placement API queries, so I wonder whether I should add something like "get_resource_providers()" method in the reportclient just for nova-manage caller, or calling directly the Placement API	15:46
bauzas	thoughts on that ?	15:46
efried	bauzas: If it's something simple like GET /resource_providers (you really want all of them?) then yeah, just call SchedulerReportClient.get()	15:47
bauzas	zactly	15:47
efried	sfine	15:47
bauzas	efried: but then I don't have a safe_connect connection	15:48
mriedem	if you're not going to page, you could be listing 14K providers in the case of cern...	15:48
efried	bauzas: We don't want @safe_connect	15:48
efried	ever, anywhere	15:48
efried	Handle ksa.ClientException at the caller instead.	15:48
efried	And if you see @safe_connect anywhere in your travels and want to kill it and do that ^, I will buy your drivks.	15:48
efried	drinks	15:48
* mriedem notes that GET /resource_providers doesn't support paging		15:48
efried	true story	15:48
bauzas	it's 40°C here, I'm all for a drink	15:49
efried	bauzas: what are you trying to do with the master list?	15:49
bauzas	efried: looking up all allocations to see whether they're orphaned	15:49
bauzas	mriedem: ah shit, excellent point	15:49
mriedem	you could instead page the compute nodes in the cells and hit this api https://developer.openstack.org/api-ref/placement/?expanded=#list-resource-provider-allocations	15:50
bauzas	we could possibly need to look at all allocations per resource provider, which would be given by a list of compute services (which is paginated AFAIK)	15:50
bauzas	heh, jinxed	15:50
mriedem	compute service != compute node == resource provider	15:50
bauzas	shit, typo, nodes indeed	15:50
bauzas	tell me about my Kilo bp	15:50
*** ttsiouts has quit IRC		15:51
mriedem	so once you get the allocations for a given provider, what are you going to do?	15:51
*** ttsiouts has joined #openstack-nova		15:51
mriedem	check if an instance (or migration) exists with the given consumer uuid?	15:51
mriedem	and if not, consider the allocation orphaned?	15:51
mriedem	iff the allocation has resources that nova "owns" like VCPU	15:52
mriedem	without consumer types in the allocations response we have to rely on the resource class	15:52
bauzas	exactly this, I was about to say which resource classes where nova-related	15:52
bauzas	were*	15:53
efried	ugh, relying on resource class...	15:53
efried	this is where the concept of provider owner would be handy.	15:54
bauzas	yeah I know	15:54
efried	hopefully we're not allowing allocations from different owners against the same provider anywhere	15:54
bauzas	we could also add an argument asking for the resource class we wanna check	15:54
efried	no, we shouldn't do it by resource class	15:54
efried	because same resource class may be managed by different owners in different providers	15:55
*** tssurya has quit IRC		15:55
efried	think VF (nova-PCI vs cyborg vs neutron)	15:55
efried	but we (need to make sure we) have a rule that a provider as a whole is only managed by a single owner.	15:55
bauzas	hmmm	15:56
*** ttsiouts has quit IRC		15:56
bauzas	actually, I'm checking consumer_id	15:57
bauzas	so I guess all resource providers corresponding to compute nodes (and children associated) should have allocations against consumer_id that	15:57
bauzas	that is either a migration object or a nova instance	15:57
bauzas	even cyborg, right?	15:58
openstackgerrit	Nate Johnston proposed openstack/nova stable/stein: [DNM] Test change to check for port/instance project mismatch https://review.opendev.org/667663	15:58
bauzas	efried: ^?	15:59
*** igordc has joined #openstack-nova		15:59
*** damien_r has quit IRC		15:59
efried	bauzas: If what you're looking to do is clean up allocations against orphaned instances, I think it's legit to remove all the allocations associated with that consumer, even if they're on providers you don't own. That's symmetrical with what we do when we schedule (we claim all of those atomically from nova).	16:00
efried	and	16:00
efried	if there's an allocation against a compute node RP, you can legitimately assume it's in that category	16:01
efried	but	16:01
efried	that will break eventually if we ever have resourceless roots	16:01
efried	because	16:01
efried	you can not assume that all children of the compute node RP also belong to nova.	16:01
bauzas	baby steps here :)	16:01
efried	yeah, just leave a note/todo I guess.	16:02
bauzas	at least if I can support nested rps, it would be cool	16:02
*** _erlon_ has joined #openstack-nova		16:02
bauzas	because eg. VGPU allocations are still made against a consumer which is an instance, yeepee	16:02
bauzas	but, that would mean I would look at all resource providers, not only the ones Nova owns	16:03
efried	yeah, it would be 1) compute node => 2) compute node RP => 3) allocations against that RP => 4) consumer for that allocation => 5) filter down to orphan consumers => 6) allocations for those consumers => 7) delete all of those	16:03
efried	no	16:03
bauzas	and here comes pagination...	16:04
efried	with the limitation noted above (stops working for resourceless roots, which we're a long way off of), the above process will get you there.	16:04
efried	Step 1 done by paginating from the nova API.	16:04
bauzas	cool then	16:04
efried	this is in a nova-manage type utility?	16:05
efried	So we don't care that it'll take FOREVER to run at cern?	16:05
bauzas	a nova-manage placement audit thing	16:05
efried	mm	16:05
bauzas	so a cron job basically	16:05
bauzas	marker and the likes	16:05
efried	mm	16:06
bauzas	zactly like heal_allocations	16:06
efried	sure would be nice to find a way to make it more efficient then.	16:06
mriedem	heal_allocations doesn't have a marker	16:06
efried	but: make it, make it right, make it fast	16:06
mriedem	it has a limit of things to process	16:06
mriedem	nor does heal_allocations deal with nested allocations	16:06
mriedem	the audit command could also take a --consumer option to just investigate what the operator thinks is a problem instance/migration	16:08
mriedem	note that i added --instance to heal_allocations later for that reason	16:09
bauzas	yup I saw	16:09
mriedem	and --dry-run	16:09
mriedem	depends on what the command will do though, if it's just reporting then you don't need a --dry-run	16:09
*** artom has quit IRC		16:10
bauzas	I was thinking of just telling the orphaned, but then later adding a --remove option	16:11
bauzas	later	16:11
bauzas	anyway, needs to go off and run by hot summer nights	16:12
bauzas	I think I have everything I needed, thanks folks	16:13
* mriedem assumes "hot summer nights" is a french adult store that bauzas frequents		16:14
dansmith	hoo boy	16:14
mriedem	strictly adult cheese, wine and things of that nature	16:15
melwitt	now, for another fun topic	16:16
melwitt	mriedem, dansmith: I was reading these comments on an old [unmerged] patch: https://review.opendev.org/#/c/462521/12/releasenotes/notes/resize-auto-revert-6e1648828aba16b2.yaml@5,	16:17
*** maciejjozefczyk has quit IRC		16:17
melwitt	and it made me think of the [recently merged] patch: https://review.opendev.org/633227 again and how it changed ERROR state to ACTIVE (or STOPPED) state. now I'm worried that wasn't an ok thing to do (API change?)	16:17
melwitt	for a failed cold migration to self	16:18
mriedem	not the same	16:18
mriedem	in my change,	16:18
mriedem	we failed in prep_resize before we actually did anything to the guest	16:18
mriedem	in that case, putting the instance in ERROR status makes no sense imo	16:18
mriedem	as i said, the only way you can get it out of error then is to do something like rebuild, hard reboot and/or reset status to ACTIVE,	16:19
dansmith	and was yours also resetting to ACTIVE if it was actually shutoff?	16:19
dansmith	I forget	16:19
mriedem	and if i started a resize or cold migration of a STOPPED instance, then resetting it to ACTIVE isn't what i want, nor is rebuild or hard reboot really	16:19
mriedem	dansmith: that was the point of my fix	16:19
mriedem	to reset to STOPPED if it was STOPPED	16:19
dansmith	mriedem: right	16:19
mriedem	well, in part,	16:19
mriedem	the main point was don't put it in ERROR status	16:20
*** belmoreira has quit IRC		16:21
melwitt	ok, I think I see. this is ok because the instance is actually ok (other than cosmetic), whereas for the first example, the instance was not ok and was proposed to auto-correct to an ok/healthy state	16:21
dansmith	the auto-revert actually moved stuff back, IIRC	16:22
dansmith	not just correcting state, but actual revert	16:22
melwitt	yeah it did	16:22
melwitt	I was zooming in on the vm_state part of it, how it appears to an external script like in your example in the comment	16:23
melwitt	and then I was thinking, is that a problem, if we imagine an external script executing a cold migrate and it fails and the instance stays ACTIVE so the script doesn't know it didn't work. that sort of thing	16:24
melwitt	I was wondering about that after I read the comments on the old auto-revert patch	16:25
dansmith	but the difference is,	16:26
mriedem	the external thing should be waiting for task_state to be None to know the operation is done (or the instance action is finished/error, or the migration status is 'finished' or whatever in this case)	16:26
dansmith	the merged patch corrected state before it changed from $orig to MIGRATING or whatever, right?	16:26
mriedem	polling the vm_state in the API is not sufficient	16:26
dansmith	the auto-revert one has it go into all the migrating states and then pop back	16:26
dansmith	specifically, potentially pop back to ACTIVE and not have moved, IIRC	16:27
melwitt	yes, I believe it did prevent an ERROR state that occurred before going to MIGRATING	16:27
mriedem	i'm getting lost in the "it" references here when talking about separate changes	16:28
melwitt	heh, sorry. the merged change	16:28
mriedem	booth's change was,	16:28
mriedem	resize/cold migrated failed somewhere and somehow, and the instance was set to ERROR status, right?	16:28
mriedem	and if you tried doing a revertResize API call on that ERROR instance, it would do the revert resize flow to go back from the dest to the source host	16:29
dansmith	no, it did a full revert I think	16:29
mriedem	even though what we could have failed on was maybe something in prep_resize or resize_instance before the guest / volumes / networking ever actually got to the dest host	16:29
dansmith	so we get to the dest host, fail, auto-revert back to source, and go back to ACTIVE	16:30
dansmith	you wait for ACTIVE to mean "success" but really it failed and the instance hasn't resized or move	16:30
melwitt	yeah, I think it was a full revert on the booth change. i.e. do automatically what a user would have to do, initiate a revert	16:30
mriedem	oh i see https://review.opendev.org/#/c/462521/12/nova/compute/manager.py@4449	16:30
dansmith	granted it's been 18 months since I last looked at this	16:30
dansmith	it's really the opposite of what mriedem's change was doing,	16:30
dansmith	which was keep it active if we don't start	16:31
mriedem	or stopped rather than active...	16:31
dansmith	well, and that's an important piece yeah	16:31
mriedem	i.e. start resize with a stopped server, prep_resize fails, don't reset to active because it's stopped	16:31
dansmith	right	16:31
mriedem	eventually the power sync task would stop the instance i think but still	16:31
dansmith	or restart it when it shouldn't, right?	16:32
melwitt	yeah, makes sense	16:32
dansmith	if vm_state is active, it was stopped, power state sync says "hmm, this should be running"	16:32
mriedem	i don't think that task ever starts anything	16:32
mriedem	even though people have asked for that in the past	16:32
dansmith	no? I thought it would for things like post-host-failure recovery	16:32
mriedem	i believe the reasoning was always, we don't want to turn things on by guessing and then bill the user	16:32
dansmith	well, billing is unrelated to started or stopped, but okay :)	16:33
dansmith	it's a complex enough not-really-a-state-machine that I'm sure I'm getting it wrong	16:33
mriedem	depends on how you do your billing	16:33
dansmith	regardless, ACTIVE but not running is about as bad	16:33
mriedem	same - it's been a long time since i loked	16:33
mriedem	*looked	16:33
mriedem	anyway, i agree that if i'm doing a resize (and i'm sure tempest would do this), you're waiting for the instance to go to VERIFY_RESIZE with task_state=None,	16:34
mriedem	it the instance goes back to ACTIVE with task_state=None, i'd wait indefinitely	16:34
mriedem	unless i've got a timeout,	16:34
dansmith	especially if you went into RESIZING in between	16:34
mriedem	or also checking instance actions or migration status (which might be admin-only inof)	16:34
mriedem	*info	16:34
mriedem	i personally wouldn't try to track the task_state transitions since that's probably a losing game	16:35
mriedem	i would just wait for terminal states but yeah	16:35
*** tesseract has quit IRC		16:35
dansmith	the thing is, ACTIVE is a terminal state for auto-confirm	16:35
mriedem	true yeah	16:35
dansmith	so if it went ACTIVE -> RESIZING -> ACTIVE, you should assume it actually resized and was auto-confirmed	16:35
dansmith	but with auto-revert,	16:35
mriedem	i know powervc set auto-confirm to 1 second	16:35
dansmith	that breaks that behavior	16:35
mriedem	lbragstad had to fix a few race bugs as a result :)	16:36
dansmith	with auto-revert, ACTIVE->RESIZING->ACTIVE could mean "it worked" or "it didn't"	16:36
mriedem	dansmith: yeah, and you wouldn't know unless you checked the migratoin or instance actions, which you as a non-admin might not have access to those details	16:36
melwitt	yeah, I see	16:36
dansmith	it turns waiting for a terminal state into a much more complex affair for sure	16:36
openstackgerrit	Merged openstack/nova master: Replace deprecated with_lockmode with with_for_update https://review.opendev.org/666221	16:37
melwitt	that's a helpful way to think about it, imagining what a tempest (or func test) would need to do to be able to automate it	16:37
mriedem	maybe should link this conversation into the abandoned change so we have that when this comes up again in 2 years :)	16:40
melwitt	yeah, that's a good idea. let me do that now	16:40
sean-k-mooney	... i started reading the scroll back and i think on second tought i not going to do that	16:44
sean-k-mooney	melwitt: the only way for a non admin to deterim if a cold migrate suceeded would be to check the hashed host id before and after	16:47
sean-k-mooney	for resize they could check the if the flavor is the one they expected	16:48
dansmith	sean-k-mooney: not really	16:48
dansmith	oh for a strict migration, yeah	16:48
dansmith	was going to say, resize to same host breaks that assumption	16:48
melwitt	sean-k-mooney: could also observe ACTIVE -> RESIZING -> ACTIVE as dansmith described, right? as non-admin	16:49
dansmith	melwitt: yes	16:49
sean-k-mooney	you could observe it if you pool but you would not know if it succeded or failed	16:49
sean-k-mooney	without also checking if the falvor is the old or new one	16:49
dansmith	sean-k-mooney: you won't go back to active from resizing currently	16:49
sean-k-mooney	oh ok so that was the change ye were talking about	16:50
melwitt	sean-k-mooney: if it failed [after going to RESIZING] it would go to ERROR. are you talking hypothetically with the abandoned patch?	16:50
sean-k-mooney	melwitt: there are case we i though it would auto revert that went back to active	16:51
melwitt	sean-k-mooney: no, that was the proposal in the abandoned patch	16:51
sean-k-mooney	ok i might be thinking about live migrate then	16:52
sean-k-mooney	for live migrate we can fail to migrate but still be in active	16:52
*** davidsha has quit IRC		16:56
sean-k-mooney	so ya looking at code earch revert_resize is only ever called form the api which simplifes some things but not others	16:58
sean-k-mooney	melwitt: do we currently allow you to revert a resize for an instance that is in error because the resize failed	16:59
sean-k-mooney	so you can go active->resizeing->error->active?	16:59
mriedem	fwiw, as a non-admin i think you can tell if your resize failed if the instance action "message" is not null /servers/{server_id}/os-instance-actions/{request_id}	17:00
mriedem	er GET /servers/{server_id}/os-instance-actions/{request_id}	17:00
melwitt	sean-k-mooney: I think so, based on the abandoned patch. it was proposing to do that automatically (from error)	17:00
sean-k-mooney	melwitt: ok if we did not you would have to do rest state (which is admin only?) + a hard reboot	17:01
melwitt	mriedem: you mean failed before resize started right	17:02
mriedem	no if the operation failed	17:02
mriedem	if any event in an action (operation) fails, the overall action 'message' is always 'Error': https://github.com/openstack/nova/blob/707deb158996d540111c23afd8c916ea1c18906a/nova/db/sqlalchemy/api.py#L5227	17:02
*** ricolin has quit IRC		17:02
melwitt	sean-k-mooney: if we did not allow revert from error? I don't think reset state + reboot would put everything back properly	17:02
mriedem	which is actually a bug...	17:02
mriedem	https://bugs.launchpad.net/nova/+bug/1824420	17:03
openstack	Launchpad bug 1824420 in OpenStack Compute (nova) "Live migration succeeds but instance-action-list still has unexpected Error status" [Undecided,Triaged]	17:03
melwitt	oh	17:03
mriedem	so before we go down the road of "well the user can track the operatoin to see if it was auto-reverted on error because of instance actions" let me point out that relying on instance actions that way isn't fool proof because of that bug	17:04
mriedem	and especially b/c it's a result of failures on hosts and then doing reschedules to other hosts	17:04
mriedem	which resize can do	17:04
sean-k-mooney	the instace should become active on the source host but it might not fix the allocation in placment properly	17:04
mriedem	auto-reverting a failed resize could be all sorts of f'ed up	17:05
mriedem	because rollbacks are near impossible	17:05
mriedem	hard to test	17:05
mriedem	i'm fairly certain our live migration rollback code is also quite janky in several ways	17:05
mriedem	because we don't test it in the gate	17:06
*** martinkennelly has quit IRC		17:07
sean-k-mooney	just looking at that bug the live migration failed right?	17:08
sean-k-mooney	so we would exepct there to be an error in the instance action log?	17:09
mriedem	no	17:10
mriedem	read my comments on the bug	17:10
sean-k-mooney	maybe im missreading it as its kind of hard to read the initilal bug	17:10
mriedem	a pre-check on one of the candidate dest hosts failed	17:10
mriedem	which triggers a reschedule to another dest host in the conductor live migration task	17:10
mriedem	the 2nd host works	17:10
mriedem	but b/c the pre-check failed on the first dest host the instance action event for that one is error which sets the overall action message to 'Error'	17:11
sean-k-mooney	ah ok	17:11
mriedem	iow, actions aren't reschedule aware	17:11
mriedem	or what is a non-fatal error	17:11
sean-k-mooney	right	17:12
sean-k-mooney	should we be loging the prechecks as events?	17:12
sean-k-mooney	i was not exepcting to see compute_check_can_live_migrate_destination events	17:12
mriedem	hard to say	17:13
mriedem	if you configure nova to not have alternate hosts for reschedules	17:13
mriedem	then you'd likely want to know it's that dest pre-check that failed right?	17:13
sean-k-mooney	maybe or jsut that you had no valid hosts?	17:14
sean-k-mooney	/ exausted teh list of alternates	17:14
sean-k-mooney	i thikn we would still log the failure right	17:14
openstackgerrit	Merged openstack/nova master: Remove orphaned comment from _get_group_details https://review.opendev.org/667135	17:15
mriedem	sure, if you set https://docs.openstack.org/nova/latest/configuration/config.html#DEFAULT.migrate_max_retries to 0 for now retries, or you don't have any alternate hosts	17:15
mriedem	idk, anyway, it's tangential to the auto-revert failed resize thing mel was asking about	17:15
sean-k-mooney	for me this feels like we are leaking an implemenation detail as an event	17:15
sean-k-mooney	ya it is	17:16
mriedem	instance action events are basically entirely leaked implementation details :)	17:16
mriedem	the event names come from the methods they decorate	17:16
mriedem	there is no guarantee on api stability for those things	17:16
sean-k-mooney	ok personally i would prefer not to decorate that function	17:17
sean-k-mooney	but as you said its tangental to melwitt's topic	17:17
*** zbr\|ruck is now known as zbr		17:18
*** luksky has joined #openstack-nova		17:29
*** udesale has quit IRC		17:29
*** panda has quit IRC		17:31
*** panda has joined #openstack-nova		17:35
openstackgerrit	Lee Yarwood proposed openstack/nova master: libvirt: Add a rbd_connect_timeout configurable https://review.opendev.org/667421	17:36
openstackgerrit	Eric Fried proposed openstack/nova-specs master: grammar fix for show-server-numa-topology spec https://review.opendev.org/667487	17:36
Nick_A	any idea why metadata would send all /24 routes in a region to each instance? http://paste.openstack.org/show/y0lE42EA59yhnu7G1KnY/	17:38
openstackgerrit	Matt Riedemann proposed openstack/nova master: Fix AttributeError in RT._update_usage_from_migration https://review.opendev.org/667687	17:38
openstackgerrit	Matt Riedemann proposed openstack/nova master: Fix RT init arg order in test_unsupported_move_type https://review.opendev.org/667688	17:38
openstackgerrit	Ghanshyam Mann proposed openstack/nova master: Multiple API cleanup changes https://review.opendev.org/666889	17:48
*** NewBruce9 has joined #openstack-nova		17:48
*** jamesdenton has quit IRC		17:49
*** jistr_ has joined #openstack-nova		17:52
*** dtantsur has joined #openstack-nova		17:52
*** klindgren_ has joined #openstack-nova		17:52
*** NewBruce has quit IRC		17:53
*** dtantsur\|mtg has quit IRC		17:53
*** klindgren has quit IRC		17:53
*** jistr has quit IRC		17:53
*** maciejjozefczyk has joined #openstack-nova		17:53
*** altlogbot_1 has quit IRC		17:55
*** jamesdenton has joined #openstack-nova		17:56
*** altlogbot_2 has joined #openstack-nova		17:59
sean-k-mooney	yonglihe: efried just reviewing https://review.opendev.org/#/c/648912/14 but why are we looking up instance by name?	17:59
*** altlogbot_2 has quit IRC		18:01
efried	sean-k-mooney: I haven't the foggiest. I'm involved here in an administrative capacity :)	18:01
sean-k-mooney	the domain xml has the instance uuid stored in the uuid field for several release now so im wondering why we would use the instance domain name instead	18:02
*** altlogbot_1 has joined #openstack-nova		18:03
melwitt	sean-k-mooney: been meaning to get to that review. after what we briefly discussed at the ptg, that change would be best to piggyback onto the existing periodic for handling deleted instances. I don't know why it would be looking up instances by name	18:03
efried	k, hopefully yonglihe will be able to answer on the review. Thanks sean-k-mooney.	18:03
sean-k-mooney	melwitt: it is piggybacking on that task	18:04
melwitt	but I see a lot of new methods	18:04
sean-k-mooney	efried: ok im trying to find where it gets teh list of suspected instances	18:04
sean-k-mooney	melwitt: ya there are. im not sure if they are all needed.	18:05
melwitt	yeah, I would think there should be none new needed	18:05
sean-k-mooney	well we need a new method to query the driver for the instance that are runnin on the host but not in the db	18:06
sean-k-mooney	and then we can call the old meethods that implement the policy. e.g. reap or log or do nothing whatever you have set in the config	18:07
melwitt	why? there's a self._get_instances_on_driver method already	18:07
sean-k-mooney	that is a good question :)	18:08
sean-k-mooney	i have only properly looked at https://review.opendev.org/#/c/648912/14 whic is doing driver change but i need to look at how that is being used in https://review.opendev.org/#/c/627765/28	18:08
melwitt	anyway, just saying skimming those patches I don't see why they're so large	18:08
melwitt	or rather, I expected not such a large patch for this	18:09
sean-k-mooney	melwitt: this seam oververly complex https://review.opendev.org/#/c/627765/28/nova/compute/manager.py@8884	18:13
sean-k-mooney	also _destroy_orphan_instances is not a greate name for that since it might not destroy anything .	18:14
sean-k-mooney	melwitt: i was wrong however its adding a new periodici task not extending the existing tasks	18:15
*** hongbin has joined #openstack-nova		18:33
*** ociuhandu has quit IRC		18:36
melwitt	sean-k-mooney: yeah, that's what I had thought. it should get much simpler if it's changed to extend the existing periodic. and I think the suggestion at the ptg from dansmith was to add another enumerated choice to the existing conf option that is something like "reap-unknown" which will reap both known deleted and unknown guests. and otherwise just log unknowns	18:41
openstackgerrit	Merged openstack/nova master: Fix update_provider_tree signature in reference docs https://review.opendev.org/667408	18:43
openstackgerrit	Eric Fried proposed openstack/nova-specs master: grammar fix for show-server-numa-topology spec https://review.opendev.org/667487	18:43
mriedem	sean-k-mooney: it's named _destroy* because it's similar to the existing _destroy_running_instances	18:44
mriedem	which might also not destroy anything	18:44
sean-k-mooney	mriedem: ah ok	18:44
sean-k-mooney	my main issue with this is it will only really work for the libvirt dirver	18:45
sean-k-mooney	and the fake driver	18:45
*** ivve has joined #openstack-nova		18:45
mriedem	yup	18:53
mriedem	i believe i noted something along those lines last time i did a deep review on it	18:54
openstackgerrit	Miguel Ángel Herranz Trillo proposed openstack/nova master: Fix type error on call to mount device https://review.opendev.org/659780	18:54
mriedem	i seem to remember why they needed to lookup by name at the time, and it was libvirt-specific	18:54
sean-k-mooney	ya im looking at it again	18:54
sean-k-mooney	its because we can have libvirt domains that are for instance that nova created but have been deleted form the db	18:55
sean-k-mooney	so to destory the running guest they need to use the libvirt domain name	18:55
mriedem	having said all that, i'm not opposed to starting with something that could eventually be implemented by other drivers	18:55
mriedem	as long as it's graceful about other drivers not implementing the necessary interface	18:56
sean-k-mooney	ya it handels the not implmetned excetion correctly in the manager	18:56
sean-k-mooney	im think of asking them to use the uuid instead of the name however	18:56
sean-k-mooney	the instance uuid is set in the libvirt domains uuid field	18:57
mriedem	yeah uuid is ideal if we can use it	18:57
sean-k-mooney	but i libvirt allows us to deleate by uuid	18:58
sean-k-mooney	me might just be pushing the translation into the dirver	18:58
sean-k-mooney	im also wondering how to deal with the fact we could be leaking plugged interfaces	18:59
*** maciejjozefczyk has quit IRC		19:02
mriedem	heh,	19:03
mriedem	well we could be leaking all sorts of things	19:04
mriedem	storage connections	19:04
sean-k-mooney	will undefineing the domin delete the root disk?	19:04
sean-k-mooney	or other disks	19:04
mriedem	i doubt it	19:05
sean-k-mooney	if we are reaping the orpahn vms that were created by nova but are deleted in the db we really should be cleaning up all there local resouces in that case which this current patch does not attemtp to do	19:05
mriedem	otherwise we wouldn't need separate calls during driver.destroy to cleanup the volumes via brick and unplug vifs via os-vif	19:06
mriedem	at some point this could also be cyborg devices and such couldn't it?	19:06
sean-k-mooney	ya that is a good point	19:06
sean-k-mooney	ya	19:06
sean-k-mooney	so if we are going to reap these we need to try and clean up as much of the resouces as we can or just support powering down the instance but not reaping them?	19:07
openstackgerrit	Merged openstack/nova-specs master: grammar fix for show-server-numa-topology spec https://review.opendev.org/667487	19:07
sean-k-mooney	if we just delete the domain the operator has noting ot go on to figure out what they need to clean up manually	19:07
sean-k-mooney	its tricky however as from the domain we dont know what nova create or not but i think we should at lesast try to do some cleanup	19:09
*** dklyle has quit IRC		19:09
sean-k-mooney	anyway im going to get something to eat	19:12
sean-k-mooney	o/	19:13
*** ralonsoh has quit IRC		19:17
*** phughk has quit IRC		19:18
mriedem	there is meta in the domain that tells us if nova created it	19:19
sean-k-mooney	yes there is	19:19
mriedem	for libvirt anyway	19:19
sean-k-mooney	which i asked yonglihe to check when deleteing the domian	19:20
sean-k-mooney	in the libvirt case that is	19:20
efried	mriedem: +1 on the ML note, thanks for that	19:20
mriedem	np	19:21
sean-k-mooney	that is what https://review.opendev.org/#/c/648912/14/nova/virt/libvirt/driver.py@9695 is doing and its used to filer the domains here https://review.opendev.org/#/c/627765/28/nova/compute/manager.py@8886	19:22
*** whoami-rajat has quit IRC		19:22
*** maciejjozefczyk has joined #openstack-nova		19:31
*** dklyle has joined #openstack-nova		19:36
*** panda has quit IRC		19:39
*** panda has joined #openstack-nova		19:40
*** altlogbot_1 has quit IRC		19:45
*** altlogbot_2 has joined #openstack-nova		19:47
*** maciejjozefczyk has quit IRC		19:50
*** tbachman has quit IRC		19:55
efried	mriedem: took me three days to go through all the specs, but http://lists.openstack.org/pipermail/openstack-discuss/2019-June/007381.html	19:59
*** tbachman has joined #openstack-nova		19:59
*** mriedem has quit IRC		20:04
*** mriedem has joined #openstack-nova		20:07
mriedem	efried: huh https://review.opendev.org/#/q/project:openstack/nova-specs+status:open+path:%255Especs/train/approved/.*+reviewedby:self seems to not work properly, it's only showing me https://review.opendev.org/#/c/631154/ but i've clearly commented on https://review.opendev.org/#/c/648686/ as well	20:10
mriedem	maybe reviewedby is only on the latest PS?	20:10
mriedem	aha	20:10
mriedem	https://review.opendev.org/#/q/project:openstack/nova-specs+status:open+path:%255Especs/train/approved/.*+reviewer:self	20:10
mriedem	reviewer:self, not reviewedby:self	20:11
efried	whoops, thanks	20:14
*** altlogbot_2 has quit IRC		20:15
*** eharney has quit IRC		20:16
*** altlogbot_2 has joined #openstack-nova		20:19
*** altlogbot_2 has quit IRC		20:43
*** altlogbot_2 has joined #openstack-nova		20:45
*** Bidwe_jay has quit IRC		20:59
*** altlogbot_2 has quit IRC		21:00
*** altlogbot_3 has joined #openstack-nova		21:05
*** pcaruana has quit IRC		21:05
*** ivve has quit IRC		21:07
*** cfriesen has quit IRC		21:24
melwitt	mnaser, imacdonn: hi, as responders to the ML thread awhile back, I have a spec up for showing server status as UNKNOWN if host status is UNKNOWN that has been receiving some reviews. your reviews would be helpful for deciding whether it goes forward https://review.opendev.org/666181	21:31
*** cfriesen has joined #openstack-nova		21:31
mriedem	dansmith: you may want to drop your +2 to a +1 or 0 https://review.opendev.org/#/c/457886/ until i get the ceph job results on it	21:33
openstackgerrit	Miguel Ángel Herranz Trillo proposed openstack/nova master: Fix type error on call to mount device https://review.opendev.org/659780	21:41
*** panda has quit IRC		21:42
mriedem	dansmith: nvm, lyarwood already had a patch up to test that	21:44
*** panda has joined #openstack-nova		21:45
*** takashin has joined #openstack-nova		21:50
*** rcernin has joined #openstack-nova		22:00
*** mlavalle has quit IRC		22:09
*** xek__ has quit IRC		22:10
*** shilpasd has quit IRC		22:11
*** eharney has joined #openstack-nova		22:12
mriedem	efried: are you ok with me just pushing up this test change and +2ing? https://review.opendev.org/#/c/659780/3/nova/tests/unit/virt/disk/mount/test_nbd.py	22:14
efried	...	22:14
efried	mriedem: yes, and I'll +A	22:15
*** luksky has quit IRC		22:16
openstackgerrit	Matt Riedemann proposed openstack/nova master: Fix type error on call to mount device https://review.opendev.org/659780	22:16
mriedem	done	22:17
efried	+A	22:18
*** rcernin has quit IRC		22:19
*** rcernin has joined #openstack-nova		22:20
mnaser	melwitt: left a comment thanks :D	22:25
melwitt	thanks	22:43
*** tkajinam has joined #openstack-nova		23:05
*** threestrands has joined #openstack-nova		23:15
*** igordc has quit IRC		23:25
*** threestrands has quit IRC		23:29
*** mriedem has quit IRC		23:42
*** hongbin has quit IRC		23:43
*** slaweq has quit IRC		23:50
*** icarusfactor has quit IRC		23:51

Generated by irclog2html.py 2.15.3 by Marius Gedminas - find it at mg.pov.lt!