Friday, 2018-07-27

rm_work	so I'm still not hearing anything that is a problem for my deployment	00:00
dansmith	heh, okay	00:00
rm_work	and if I can do it, others can do it	00:00
dansmith	here's one problem: file injection has been deprecated :P	00:00
rm_work	yes, thanks	00:00
rm_work	I happened to notice that recently ;P	00:01
*** tetsuro has joined #openstack-nova		00:01
penick	So the root problem is you need to put arbitrary files in an instance, or you need instances to have x509 certs chains?	00:02
rm_work	i don't think we're going to get anywhere today on this, maybe we pick up at the PTG	00:02
rm_work	right now it's a cert-chain, PK, and agent config	00:02
rm_work	so I guess "arbitrary files"	00:02
rm_work	and they all contain data that we consider "sensitive"	00:03
rm_work	(obviously, in the case of the PK)	00:03
penick	Are they "shared" secrets, like the keypair for a public website? Barbican might be the right place for those.	00:03
*** david-lyle has joined #openstack-nova		00:03
rm_work	well, specifically the PK	00:03
rm_work	we do use Barbican, but our instances have no way to auth against it	00:03
rm_work	one is shared	00:04
rm_work	the other is generated specifically for the VM in question	00:04
rm_work	(the PK)	00:04
*** shaohe_feng has quit IRC		00:04
penick	We generate secrets on our instances, then have another system the instances call to have their csr signed, it asserts their identity before it's signed by our root of trust	00:05
rm_work	ok so that there is the important bit	00:05
johnsom	Speaking of FF - had to go take care of that. Yeah I think at least of floppy disk's worth of storage is reasonable. lol Like the PXE boot image size.	00:05
rm_work	we USE those certs/PK to assert identity	00:05
*** harlowja has joined #openstack-nova		00:05
rm_work	how do you assert VM's identity without that?	00:05
rm_work	i mean, that is exactly our workflow	00:06
rm_work	well ... ALMOST our workflow	00:06
rm_work	we reach out to the VM, not the other way around	00:06
johnsom	Yeah, this was the whole discussion that led use to what was implemented years ago.	00:06
*** shaohe_feng has joined #openstack-nova		00:06
dansmith	penick is headed down the right path here, which is not to pass everything to nova and expect it to keep it (most people) or disavow it (some people), and only give nova enough information to let you interact with some service that can do what you want	00:06
*** dklyle has quit IRC		00:06
dansmith	information that is not sensitive forever	00:06
johnsom	I'm just concerned that if we don't trust how we store and handle images we are in trouble before we even get to config data and establishing secure channels.	00:07
*** linkmark has quit IRC		00:07
penick	We create a signed bearer document that's time limited and place it in the instance, on boot the instance creates a PK and CSR, then sends those along with the attestation document (created as part of vendor data) to the token server, which verifies the signature in the attestation document (and then invalidates the document) then calls openstack to verify the details in the CSR	00:08
rm_work	which "we" is that?	00:08
penick	eg, ensure the IP, UUID, etc in the CSR match the instance	00:08
rm_work	which service	00:08
rm_work	because that does sound like the workflow we're aiming for	00:08
penick	the service is called Athenz, and the system we've built to integrate it into OpenStack is called copper argos	00:08
penick	I have a talk on it, one sec..	00:09
rm_work	i was hoping to just glance at the repo	00:09
rm_work	https://github.com/yahoo/athenz ?	00:09
penick	https://www.openstack.org/videos/vancouver-2018/attestable-service-identity-with-copper-argos	00:10
penick	yup	00:10
rm_work	https://github.com/yahoo/athenz/blob/master/docs/copper_argos_dev.md	00:10
penick	Ayup, that's it	00:10
*** vladikr has quit IRC		00:11
rm_work	so basically, we're screwed once Stein hits, and we have to get something like this working before then? :P	00:11
rm_work	sounds like another day at the office, lol	00:11
penick	I feel like it benefits me to say Yes :)	00:11
rm_work	we'll investigate	00:11
dansmith	rm_work: you should really read the spec you're freaking out about	00:11
rm_work	I did	00:11
dansmith	"Since personality file injection will still be supported with older microversions, there will be nothing removed from the backend compute code related to file injection"	00:11
penick	We're eager to have other people use this, so lmk if y'all (who..are..you?) are interested in using Athenz. It'd be good to get other organizations using/contributing to Athenz	00:12
rm_work	yeah, but in Octavia we don't necessarily control the nova deployments	00:12
rm_work	so we can't guarantee they have the thing enabled	00:12
rm_work	but we still need our stuff to work	00:12
*** dklyle has joined #openstack-nova		00:12
*** gbarros has joined #openstack-nova		00:12
dansmith	rm_work: oooh, I have good news for you	00:12
rm_work	penick: we'd be writing something like that into Octavia	00:12
dansmith	rm_work: user_data will always work? see how nice it is to have features that don't come and go with the deployment choices? :)	00:12
rm_work	lol	00:13
rm_work	except user-data already doesn't work :P	00:13
johnsom	Well, nova is a stable api, so it shouldn't be going away any time soon or they are dropping their stable assertion....	00:13
penick	We'll be using octavia with this in the near future. It's one of the things we have to suss out this qtr	00:13
dansmith	you mean jamming a bus into your wallet won't work	00:13
penick	But, we already have Athenz in place	00:13
penick	dansmith: Well not with that attitude	00:13
dansmith	johnsom: that's what I'm trying to point out	00:13
rm_work	but you're saying it's already disabled in most nova deploys?	00:14
dansmith	johnsom: which is what you get if you read a paragraph down below "and now lose your mind"	00:14
*** dklyle_ has joined #openstack-nova		00:14
dansmith	rm_work: no, we're saying that file injection is disabled, but as you pointed out we're putting those personality files into the config drive the first time we make it	00:14
*** david-lyle has quit IRC		00:14
*** shaohe_feng has quit IRC		00:15
rm_work	[16:38:53] <dansmith>so this has been disabled by default for libvirt for a long time,	00:15
rm_work	^^ so what did that mean?	00:15
dansmith	rm_work: file. injection.	00:15
*** itlinux has joined #openstack-nova		00:15
rm_work	yes, which has always worked via personality files?	00:15
dansmith	rm_work: you saw the part where I said "I'm not sure how this is going into config drive" and then ... found and quoted the code right?	00:15
*** shaohe_feng has joined #openstack-nova		00:15
*** jamesde__ has quit IRC		00:15
rm_work	maybe?	00:15
johnsom	dansmith I was shocked because we hadn't heard of this and it was the way to do this securely and reliably and user-data was .... less than ideal	00:16
rm_work	https://github.com/openstack/nova/blob/master/nova/api/metadata/base.py#L191-L194 this link?	00:16
rm_work	I thought that was via libvirt using the thing you said was disabled	00:16
dansmith	johnsom: you know that config drive is disable-able and depending on it is also not reliable yeah?	00:16
dansmith	rm_work: no	00:16
dansmith	rm_work: I get that it says libvirt there, but...	00:17
*** jamesden_ has joined #openstack-nova		00:17
*** dklyle has quit IRC		00:17
*** Sundar has quit IRC		00:17
johnsom	dansmith We force require it as the metadata service was swiss cheese and blew up if you booted more than a few instances at a time	00:17
rm_work	if that's not "file injection" then I don't know	00:17
dansmith	rm_work: the rest of the spec is talking about file injection specifically, which has nothing to do with config drive and is all about violating the very sanctity of the image by forcing large things into small holes	00:17
rm_work	err	00:18
penick	rm_work what's generating the secrets that you're putting into the instance? (amphora vms?)	00:18
rm_work	so are we using file injection or not?	00:18
dansmith	I'm serious, you should totes read the spec :)	00:18
rm_work	I read the spec	00:18
rm_work	several sections more than once	00:18
rm_work	so obviously whatever you're hinting at, i'm not going to get	00:18
johnsom	Yeah, the terminology in that spec is super confusing compared to the nova API and client API	00:18
dansmith	that's the point of the first #1 bullet	00:19
rm_work	this whole conversation started because I asked "is what we are doing the deprecated file injection" and multiple people said "yes"	00:19
dansmith	users can't know whether they will get the files they send, because either the deployment may have actual injection disabled (the default),	00:19
rm_work	which #1 bullet, there are several	00:19
dansmith	or they may have disabled config drive (the other way to get these files)	00:19
dansmith	rm_work: I said the first :)	00:19
rm_work	(in fact, I DID notice something new by re-reading -- that SECTION has two, rofl)	00:20
openstackgerrit	Merged openstack/nova master: conf: Add '[neutron] physnets' and related options https://review.openstack.org/564440	00:20
dansmith	let me try to restate this whole thing	00:20
dansmith	and if that doesn't help, then I'll leave and you can keep your torches and pitchforks for whatever you want	00:21
dansmith	in the olden times,	00:21
dansmith	there was a feature called "file injection"	00:21
dansmith	there are two halves of said feature:	00:21
dansmith	1. The API (personality files) by which people provide this data which may get ignored if config is unfriendly	00:21
johnsom	Anyhow, any change we can bump that max size of user-data up to a floppy size? Is it just the API limitation and a DB column alter, or is cloud-init going to need to spin too?	00:22
dansmith	2. The actual injection part, where the virt driver (some not all) could inject files into images forcibly, literally by taking a hard-coded partition number, and writing over it with your data	00:22
dansmith	are you with me?	00:22
dansmith	config drive didn't exist at this point	00:22
*** gyee has quit IRC		00:23
dansmith	aight, I guess nobody wants to hear my story	00:24
rm_work	i'm trying to parse it	00:24
*** medberry has joined #openstack-nova		00:24
*** vladikr has joined #openstack-nova		00:24
dansmith	which part?	00:24
rm_work	so, file-injection IS what we're using, correct? so right now, we are using both halves of this?	00:24
dansmith	no,	00:24
dansmith	you're using the first part,	00:24
rm_work	or this was just the past, and it's changed now, and you're getting to that	00:24
dansmith	and another part I haven't gotten to yet	00:24
rm_work	k	00:25
*** shaohe_feng has quit IRC		00:25
dansmith	the #2 part is the really nasty bit, which has been disabled by default, and which we _actually_ want to be rid of	00:25
*** itlinux has quit IRC		00:25
dansmith	however, the first part is problematic because we don't store it and it breaks several of our other features (agree to disagree on this)	00:25
dansmith	so, in the middle ages, long before you showed up,	00:25
dansmith	this config_drive thing was created	00:25
*** shaohe_feng has joined #openstack-nova		00:26
dansmith	which was a way to avoid the metadata server's restrictions, complication, whatever	00:26
*** gbarros has quit IRC		00:26
dansmith	apparently when we create that the first time, we also put those files in there (TIL)	00:26
*** gbarros has joined #openstack-nova		00:26
dansmith	but we can't re-create it later, which is the #2 part of the spec problem section	00:27
dansmith	so,	00:27
dansmith	you're using the API part, and the config drive part, but not the actual injection thing which is the most smelly bit	00:27
rm_work	ha, right, which is funny because the #2 "problem" is actually WHY we chose this method	00:27
dansmith	fine, but whatever	00:27
rm_work	ok so if #2 was the bad part, and that's just not done anymore... why is the first part being removed?	00:27
dansmith	#2 is related to the API not the really bad part	00:28
rm_work	err	00:28
rm_work	sorry, PART 1 and 2	00:28
dansmith	the API part is bad because it takes arbitrary files and then kind keeps track of them, until a rebuild or something and then we lose them	00:29
rm_work	per "1. The API (personality files) by which people provide this data" and "the #2 part is the really nasty bit, which has been disabled by default, and which we _actually_ want to be rid of"	00:29
rm_work	hmmm	00:29
dansmith	the #2 part is the libvirt injection partition thing	00:29
dansmith	sorry	00:29
dansmith	eff,	00:29
rm_work	yeah	00:30
dansmith	this straightening isn't going well	00:30
rm_work	so right, #2 part (libvirt) isn't even done anymore	00:30
rm_work	now it puts things into config-drvie	00:30
rm_work	which is ... fine?	00:30
rm_work	it's just that nova then loses track of that data, which you consider bad (but we don't)	00:30
rm_work	(and it has worked that way for a while?)	00:31
dansmith	okay, you know, it's after 5pm and I'm getting more frustrated here, so I'm just going to go	00:31
rm_work	kk	00:31
rm_work	prolly just discussing at the PTG is best	00:31
*** Ileixe has joined #openstack-nova		00:31
Ileixe	Hello guys	00:32
Ileixe	Recently I implement custom hooking code for server create api in nova-api by hook api.	00:33
johnsom	My take away. There was some nasty bit taking files and making some strange partition at boot. We aren't using that and never have. Then there is the bit that takes files, stashes them in the config drive and cloud-init drops them in the guest filesystem. This what we use. However to remove the partition stuff the config drive part got removed too	00:33
*** jangutter has joined #openstack-nova		00:35
Ileixe	Oh sorry there was converstation in now. Never mind. I ask later	00:35
*** shaohe_feng has quit IRC		00:35
rm_work	Ileixe: we are ... wrapped up on that :P	00:35
rm_work	it's fine	00:35
rm_work	lol	00:35
*** shaohe_feng has joined #openstack-nova		00:36
Ileixe	Thanks rm_work :) just simple qeustion. I found hook api was deprecated, and the api was the right thing for my logic, so i wonder what replace hook api	00:37
*** namnh has joined #openstack-nova		00:40
*** namnh has quit IRC		00:44
*** shaohe_feng has quit IRC		00:45
*** Ileixe_ has joined #openstack-nova		00:46
*** Ileixe has quit IRC		00:47
*** shaohe_feng has joined #openstack-nova		00:49
*** Ileixe_ has quit IRC		00:51
*** ileixe has joined #openstack-nova		00:53
*** felipemonteiro has quit IRC		00:55
*** shaohe_feng has quit IRC		00:56
melwitt	argh, looks like we have a new gate failure as of today	00:57
*** shaohe_feng has joined #openstack-nova		00:57
melwitt	http://logstash.openstack.org/#/dashboard/file/logstash.json?query=message:%5C%22Unsupported%20VIF%20type%20unbound%20convert%20'_nova_to_osvif_vif_unbound'%5C%22%20AND%20tags:screen-n-cpu.txt&from=7d	00:57
melwitt	unless it's only the numa-aware-vswitches patches that are affected... looking closer	00:58
melwitt	it's hitting several of the numa-aware-vswitches patches but is hitting other patches as well. started very recently	01:04
*** slaweq has joined #openstack-nova		01:05
*** shaohe_feng has quit IRC		01:06
*** gbarros has quit IRC		01:07
openstackgerrit	Xiaohan Zhang proposed openstack/nova master: compute node local_gb_used include swap disks https://review.openstack.org/585928	01:07
*** shaohe_feng has joined #openstack-nova		01:08
*** artom has quit IRC		01:09
*** gbarros has joined #openstack-nova		01:09
*** slaweq has quit IRC		01:10
*** mrsoul has joined #openstack-nova		01:12
*** abhishekk has joined #openstack-nova		01:12
*** medberry has quit IRC		01:13
*** mrsoul` has quit IRC		01:15
*** shaohe_feng has quit IRC		01:16
*** harlowja has quit IRC		01:17
*** shaohe_feng has joined #openstack-nova		01:17
*** tiendc has joined #openstack-nova		01:25
*** shaohe_feng has quit IRC		01:26
*** gbarros has quit IRC		01:27
*** shaohe_feng has joined #openstack-nova		01:27
*** gbarros has joined #openstack-nova		01:28
*** ileixe has quit IRC		01:33
*** ileixe has joined #openstack-nova		01:34
*** shaohe_feng has quit IRC		01:37
*** shaohe_feng has joined #openstack-nova		01:37
*** sean-k-mooney has joined #openstack-nova		01:39
*** tbachman has quit IRC		01:43
*** namnh has joined #openstack-nova		01:43
mriedem	melwitt: i was noticing those randomly the last couple of weeks	01:44
mriedem	unless it's major, just recheck	01:44
melwitt	mriedem: oh, logstash was claiming it started today. and I was wondering if it might be related to https://review.openstack.org/522537	01:45
melwitt	I've rechecked the numa patches at least twice because of it so far. maybe it's a coincidence. I'll keep trying to recheck	01:45
mriedem	hmm, yeah it might be, mostly hitting on the live migration and multinode jobs	01:46
mriedem	which is where that is turned on	01:46
mriedem	well that would be...awesome	01:47
mriedem	can you report a neutron bug?	01:47
*** shaohe_feng has quit IRC		01:47
melwitt	that patch landed at 13:00 (my time) which coincides with the logstash start of hits	01:47
melwitt	mriedem: can do. was just writing it up for nova not realizing it's neutron. will copy it over and open for neutron	01:48
*** yamahata has quit IRC		01:48
mriedem	it could be either	01:48
mriedem	just add both	01:49
*** dklyle has joined #openstack-nova		01:49
melwitt	oh, right. we can do that	01:49
mriedem	Kevin_Zheng: fyi, might need to see if zhaobo can investigate this ^	01:49
mriedem	mlavalle is already gone for the day	01:49
*** shaohe_feng has joined #openstack-nova		01:49
Kevin_Zheng	ACK, I will ask him	01:49
mriedem	melwitt: there would be an easy way to disable it in nova if needed	01:50
melwitt	k	01:50
mriedem	and then could be tracked as an rc bug (it will need to be an rc bug anyway)	01:50
mriedem	rather than revert	01:50
*** dklyle_ has quit IRC		01:50
openstackgerrit	Matt Riedemann proposed openstack/nova-specs master: Fix problem description number in deprecate file injection spec https://review.openstack.org/586385	01:51
mriedem	i'm also going to fast approve ^ b/c of the confusion i saw in the backscroll	01:51
*** namnh has quit IRC		01:52
Kevin_Zheng	mriedem, could you provide a error log?	01:55
dansmith	mriedem: way ahead of you	01:55
Kevin_Zheng	mriedem, never mind, Igot it	01:55
melwitt	mriedem: https://bugs.launchpad.net/neutron/+bug/1783917	01:57
openstack	Launchpad bug 1783917 in OpenStack Compute (nova) "live migration fails with NovaException: Unsupported VIF type unbound convert '_nova_to_osvif_vif_unbound'" [Undecided,New]	01:57
openstackgerrit	Matt Riedemann proposed openstack/nova master: api-ref: document user_data length restriction https://review.openstack.org/586388	01:57
melwitt	Kevin_Zheng ^	01:57
*** shaohe_feng has quit IRC		01:57
Kevin_Zheng	Thanks	01:57
*** medberry has joined #openstack-nova		01:57
mriedem	i'll push up an e-r and nova wip patch and then i have to run i think	01:57
melwitt	oh, I'm not 100% sure it makes live migration "fail", I meant to change that to "raises"	01:58
*** shaohe_feng has joined #openstack-nova		01:58
mriedem	e-r query https://review.openstack.org/#/c/586389/	01:59
mriedem	it fails	01:59
mriedem	http://logstash.openstack.org/#dashboard/file/logstash.json?query=message%3A%5C%22Live%20migration%20failed%5C%22%20AND%20message%3A%5C%22Unsupported%20VIF%20type%20unbound%20convert%20'_nova_to_osvif_vif_unbound'%5C%22%20AND%20tags%3A%5C%22screen-n-cpu.txt%5C%22&from=7d	01:59
melwitt	although yeah, all the logstash hits containing the message are build failures	02:00
melwitt	bah changes it back	02:00
*** takashin has left #openstack-nova		02:00
melwitt	cool, thanks for adding the e-r query	02:01
*** david-lyle has joined #openstack-nova		02:01
*** dklyle has quit IRC		02:02
*** david-lyle has quit IRC		02:03
*** dklyle_ has joined #openstack-nova		02:04
*** alexpilotti has quit IRC		02:04
sean-k-mooney	so im going to sleep now but http://logs.openstack.org/63/585163/1/check/nova-live-migration/1b2aebb/logs/screen-n-cpu.txt#_Jul_27_01_44_01_083831 looks like its happening because we are calling unplug on the source node after we have activated the binding on the dest	02:07
*** shaohe_feng has quit IRC		02:07
melwitt	sean-k-mooney: thanks. so maybe something we need to adjust given the use of the new binding API? I dunno	02:08
*** shaohe_feng has joined #openstack-nova		02:08
melwitt	I'll add your comment to the bug	02:08
openstackgerrit	Matt Riedemann proposed openstack/nova master: Temporarily disable port binding flows for live migration https://review.openstack.org/586391	02:09
mriedem	^ is an option for temporarily disabling this while debugging a fix	02:09
mriedem	i hope it doesn't have to come to that, but would understand if it's causing a lot of failures	02:10
* melwitt nods		02:10
sean-k-mooney	melwitt: i can try and reporduce this in the morning. we proably need to stor the original vif type and use that to constuct the vif object and use that or do the unplug on the host.	02:10
melwitt	mriedem: okay, we'll decide what to do in the morning tomorrow when other people are around	02:11
openstackgerrit	Merged openstack/nova-specs master: Fix problem description number in deprecate file injection spec https://review.openstack.org/586385	02:11
mriedem	yeah the error is from unplugging vifs in _post_live_migration which happens on the source,	02:12
mriedem	https://github.com/openstack/nova/blob/2afc5fed1f60077e7ff0b9e81b64cff4e4dbabfc/nova/compute/manager.py#L6581	02:12
mriedem	right before that,	02:12
mriedem	https://github.com/openstack/nova/blob/2afc5fed1f60077e7ff0b9e81b64cff4e4dbabfc/nova/compute/manager.py#L6572	02:12
mriedem	we activate the port bindings for the dest host	02:13
melwitt	ah, I see	02:13
sean-k-mooney	mriedem: yep that will deactivaate all other port bindings for that port meaning it will be in the unbound state on the sorce host	02:13
melwitt	so just flip that?	02:13
mriedem	https://github.com/openstack/nova/blob/2afc5fed1f60077e7ff0b9e81b64cff4e4dbabfc/nova/network/neutronv2/api.py#L2534	02:14
mriedem	i didn't know we couldn't unplug a deactivated port...	02:14
melwitt	I wonder how it doesn't fail 100% of the time	02:14
mriedem	melwitt: race	02:14
mriedem	apparently	02:14
melwitt	ah	02:14
melwitt	yeah, what luck that the actual change didn't fail	02:14
sean-k-mooney	mriedem: your raising with the notification neutron send for the port status change	02:15
mriedem	hmm http://logs.openstack.org/63/585163/1/check/nova-live-migration/1b2aebb/logs/screen-n-cpu.txt#_Jul_27_01_44_00_974248	02:15
mriedem	melwitt: i had seen this once in the series and mlavalle debugged it and couldn't find anything wrong	02:15
mriedem	Jul 27 01:44:00.974248 ubuntu-xenial-rax-dfw-0001002000 nova-compute[2629]: DEBUG nova.network.neutronv2.api [None req-33283139-ba55-4106-b76c-8751a025f153 service nova] [instance: 6b72a721-0995-446e-848f-f407b788c7f4] Port 21095ff0-6bcd-414b-9d6f-b63e03aacb23 binding to destination host ubuntu-xenial-rax-dfw-0001002004 is already ACTIVE. {{(pid=2629) migrate_instance_start /opt/stack/new/nova/nova/network/neutronv2/api.py:25	02:15
melwitt	ah, okay	02:15
mriedem	oh i know why it's already active,	02:16
mriedem	because we activate the dest host port binding during post-copy	02:16
mriedem	which is the whole point of the blueprint - to shorten the window of time that you don't have networking on the dest host	02:16
melwitt	right	02:17
melwitt	shorten the window	02:17
mriedem	this is the unplug event http://logs.openstack.org/63/585163/1/check/nova-live-migration/1b2aebb/logs/screen-n-cpu.txt#_Jul_27_01_44_00_069526	02:18
*** shaohe_feng has quit IRC		02:18
mriedem	this is where we activate the ports on the dest host during post-copy http://logs.openstack.org/63/585163/1/check/nova-live-migration/1b2aebb/logs/screen-n-cpu.txt#_Jul_27_01_43_58_561391	02:18
*** shaohe_feng has joined #openstack-nova		02:18
mriedem	we could have the live migration method wait for the unplug event before starting with post live migration, but (1) i'm not sure that helps anything and (2) it might not work that way for all virt drivers - only libvirt + post-copy has this	02:19
melwitt	yeah, events are sketch depending on which networking backend too, right	02:20
melwitt	like ovs vs other	02:20
openstackgerrit	Yikun Jiang (Kero) proposed openstack/nova master: Change deprecated policies to policy https://review.openstack.org/583434	02:20
mriedem	melwitt: shouldn't be in this case,	02:20
mriedem	odl should send the event on host binding changes	02:20
openstackgerrit	Yikun Jiang (Kero) proposed openstack/nova master: Fix all invalid obj_make_compatible test case https://review.openstack.org/574240	02:20
openstackgerrit	Yikun Jiang (Kero) proposed openstack/nova master: Fix all invalid obj_make_compatible test case https://review.openstack.org/574240	02:20
mriedem	just not plug/unplug	02:20
melwitt	oh, because neutron knows about it and not relying on anything else? ok	02:20
melwitt	just remember getting burned by the whole plug event thing for reboot	02:21
melwitt	but that was because we so os-vif plug only, not any call to neutron and the agent (or something) has to notice it	02:21
sean-k-mooney	melwitt: the binding change is handeld in the common ml2 layer if i rember corrrectly yes. the port wire up/tear down event however has to come form the backend not the common layer hence the delta between odl/ovs in that case	02:22
*** gongysh has joined #openstack-nova		02:22
melwitt	sean-k-mooney: yeah, I was having trouble remembering what the deal was. thanks	02:22
*** psachin`` has joined #openstack-nova		02:23
sean-k-mooney	melwitt: the reason it did not work with linux bridge is its pools. the reason it did not work for odl was they were missing the handeler for the event in odl to send it to the websocket creeated by netowrking odl. i think they have fixed that. maybe	02:24
sean-k-mooney	any way nova is reciving the port update event in this case from neutron and its updating the network info cacche so by the time we call nova_to_osvif_vif the vif_type is set to unbound and boom. if we still have the migration data object at this point we should have a copy of the original vif object that we could use instead of the info_cache versions to work around it.	02:27
*** Dinesh_Bhor has joined #openstack-nova		02:28
mriedem	so migrate_instance_start() was always a noop before this series,	02:28
*** shaohe_feng has quit IRC		02:28
mriedem	so its order in _post_live_migration would have never mattered except for nova-network	02:28
*** shaohe_feng has joined #openstack-nova		02:29
mriedem	given we already call migrate_instance_start during post-copy, i don't think moving the order of those calls in _post_live_migration will matter,	02:29
mriedem	because from these logs, i can see that when we call migrate_instance_start from _post_live_migration, it's a noop b/c the dest port binding is already active	02:29
mriedem	http://logs.openstack.org/63/585163/1/check/nova-live-migration/1b2aebb/logs/screen-n-cpu.txt#_Jul_27_01_44_00_974248	02:30
mriedem	so i would think it means, we need to handle unbound vifs during unplug in the driver?	02:30
mriedem	or just not call unplug_vifs in certain cases	02:30
mriedem	not totally sure though	02:30
mriedem	all the libvirt driver does in post_live_migration_at_source is unplug_vifs	02:31
sean-k-mooney	if we dont call unplug_vif we could leak the linux bridges we create for ovs hybrid plug	02:32
mriedem	umm...	02:33
mriedem	oh i see what you were saying about storing off the vif_type then	02:33
*** psachin`` has quit IRC		02:33
mriedem	b/c i was going to say, we could just not call unplug_vifs if the vif type (after refreshing the network info cache from neutron) was now 'unbound'	02:33
mriedem	if it is, we can temporarily heal that using migrate_data.vifs	02:33
mriedem	that has the vif type in it	02:34
sean-k-mooney	mriedem: yep	02:34
mriedem	ok i could try cooking something up real quick,	02:34
mriedem	my wife is going to kill me though	02:34
melwitt	you could do tomorrow morning?	02:34
sean-k-mooney	i can try this in the morning too. i just need a 2 node vanila devstack install right	02:35
melwitt	unless you were thinking to fast-approve this tonight	02:35
mriedem	why would the vif type be unbound?	02:36
mriedem	shouldn't it be bound to the dest host?	02:36
mriedem	since we activated it there?	02:36
sean-k-mooney	mriedem: it is. each host has its own binding now. only one will be in the bound state all the rest will be unbound	02:37
mriedem	but i think the port in our info cache is not host-aware...	02:38
*** shaohe_feng has quit IRC		02:38
mriedem	i need to check	02:38
*** medberry has quit IRC		02:39
*** shaohe_feng has joined #openstack-nova		02:39
mriedem	http://logs.openstack.org/63/585163/1/check/nova-live-migration/1b2aebb/logs/screen-n-cpu.txt#_Jul_27_01_44_00_726935	02:39
mriedem	that's where we refresh the info cache in _post_live_migration	02:40
mriedem	after activating the dest host port binding	02:40
mriedem	[{"profile": {"migrating_to": "ubuntu-xenial-rax-dfw-0001002004"}, "ovs_interfaceid": null, "preserve_on_delete": false, "network": {"bridge": null, "subnets": [{"ips": [{"meta": {}, "version": 4, "type": "fixed", "floating_ips": [], "address": "10.1.0.10"}], "version": 4, "meta": {}, "dns": [], "routes": [], "cidr": "10.1.0.0/28", "gateway": {"meta": {}, "version": 4, "type": "gateway", "address": "10.1.0.1"}}], "meta": {"in	02:40
mriedem	ed": false, "tenant_id": "7dbeedd7076e472091193779ebbcf887", "mtu": 1400}, "id": "1d8de970-331e-46b5-8c7b-574821e891e5", "label": "tempest-LiveMigrationTest-411356071-network"}, "devname": "tap21095ff0-6b", "vnic_type": "normal", "qbh_params": null, "meta": {}, "details": {}, "address": "fa:16:3e:34:c9:90", "active": false, "type": "unbound", "id": "21095ff0-6bcd-414b-9d6f-b63e03aacb23", "qbg_params": null}]	02:40
mriedem	yeah...that's wrong	02:40
mriedem	it should be bound to the dest host	02:40
sean-k-mooney	well it was bound shortly before http://logs.openstack.org/63/585163/1/check/nova-live-migration/1b2aebb/logs/screen-n-cpu.txt#_Jul_27_01_43_59_311896	02:44
mriedem	yup we hit post-copy callback here and activate the dest host port binding http://logs.openstack.org/63/585163/1/check/nova-live-migration/1b2aebb/logs/screen-n-cpu.txt#_Jul_27_01_43_58_561391	02:46
mriedem	refresh nw info cache here http://logs.openstack.org/63/585163/1/check/nova-live-migration/1b2aebb/logs/screen-n-cpu.txt#_Jul_27_01_43_59_310738	02:47
mriedem	then we get an unplugged vif event from neutron	02:48
mriedem	could be concurrently	02:48
sean-k-mooney	whats happening is liekly that when the ovs neutron agent sees the tap device disapear it is sending an update to notify us the port state has changed on the souce node.	02:48
mriedem	yeah we get the unplugged event and refresh the cache and it's unbound http://logs.openstack.org/63/585163/1/check/nova-live-migration/1b2aebb/logs/screen-n-cpu.txt#_Jul_27_01_44_00_726935	02:48
*** shaohe_feng has quit IRC		02:48
mriedem	post live migrate the dest host port binding is already active http://logs.openstack.org/63/585163/1/check/nova-live-migration/1b2aebb/logs/screen-n-cpu.txt#_Jul_27_01_44_00_974248	02:49
mriedem	then we unplug and kablammo	02:50
mriedem	doesn't help that we route all of these plug/unplug neutron events to the source host only, that's a nova limitation during live migration right now	02:50
mriedem	and there might be some kind of delay in the state updates or something in the neutron db?	02:50
*** shaohe_feng has joined #openstack-nova		02:51
openstackgerrit	Tetsuro Nakamura proposed openstack/nova master: Fix create_all() to replace_all() in comments https://review.openstack.org/586396	02:51
mriedem	anyway, i can hack around this a bit i think but kind of sucks	02:51
sean-k-mooney	mriedem: well there is a delay in the neutron agent sendign the update over the rabbit rpc bus to the neutron-server and then the rest call to nova.	02:51
mriedem	i just worry the port isn't wired up on the dest or something, but that shouldn't be the case b/c we plug_vifs on the dest host during pre_live_migration now	02:52
mriedem	it's just inactive until post-copy	02:52
sean-k-mooney	we could prabably hack in a filter to ignore nay info cache updates where teh vif type is unbound and the port profile containts a migrating_to field	02:52
mriedem	yeah...	02:53
mriedem	that would coincide with this http://logs.openstack.org/63/585163/1/check/nova-live-migration/1b2aebb/logs/screen-n-cpu.txt#_Jul_27_01_43_59_310738	02:53
sean-k-mooney	mriedem: yes if the pulgin fails in pre_live_migration we bail out early and try another host so at this point the dest networking shoudl be fully set up	02:54
mriedem	also, if we get the info cache based on what's setup for the dest host, we could have changed vif types, so unplugging on the source could be a different vif type...couldn't it?	02:55
mriedem	this gets a bit wonky	02:56
mriedem	we do have an exact copy of the source_vif in the migrate data vifs	02:56
sean-k-mooney	yes it could have changed.	02:56
sean-k-mooney	yep	02:56
sean-k-mooney	the migrate data has everything you need.	02:56
sean-k-mooney	just look up the vif by the port uuid and unplug or better yet just loop over all the vifs in migrate data instead of instance	02:57
mriedem	that's kind of what i'm going to do, will hack something up quick and post it then flesh it out more in the morning	02:57
mriedem	sean-k-mooney: and for the love of toast go to bed	02:57
bzhao__	Sorry for a nic break, I have a brief in the neutron log from the link shows. For the failure test instance, seem It works correct in Neutron side.	02:58
sean-k-mooney	haha its only 4 am. but ya. il be back only in 6-8 hours and ill take a look at it then. nighto/	02:58
*** shaohe_feng has quit IRC		02:59
*** shaohe_feng has joined #openstack-nova		03:00
melwitt	bzhao__: thanks. feel free to add a comment to explain about the neutron side in https://bugs.launchpad.net/neutron/+bug/1783917 see comment #6	03:01
openstack	Launchpad bug 1783917 in OpenStack Compute (nova) "live migration fails with NovaException: Unsupported VIF type unbound convert '_nova_to_osvif_vif_unbound'" [High,Confirmed]	03:01
bzhao__	melwitt: Thanks, I will. ;-)	03:02
mriedem	got a patch, pretty simple, no tests but can be easily added by someone else tonight or in the morning	03:07
sapd	Hi everyone. I got this error when attach a SR-IOV port to instance http://paste.openstack.org/show/726723/ Please help me	03:09
*** shaohe_feng has quit IRC		03:09
mriedem	sapd: read through https://docs.openstack.org/neutron/latest/admin/config-sriov.html and check everything in there	03:10
melwitt	mriedem: coolness, sounds good	03:10
*** shaohe_feng has joined #openstack-nova		03:12
sapd	mriedem: yep. I have read it. And follow the guide to config. Everything I setup is correct. Because I already launched an instance using SR-IOV successful. But It did not receive DHCP. So I launched another instance using Openvswitch then add SR-IOV port to the instance. But got above error.	03:14
melwitt	sapd: looks like the bug has been around for awhile and still not resolved https://bugs.launchpad.net/nova/+bug/1708433 they say you can boot with the port if you pass it during server create, but that attaching port separately is broken	03:16
openstack	Launchpad bug 1708433 in OpenStack Compute (nova) "Attaching sriov nic VM fail with keyError pci_slot" [Undecided,Expired]	03:16
*** abhishekk has quit IRC		03:17
melwitt	sapd: what release of nova are you using?	03:18
sapd	melwitt: I'm using queens version. 17.0.4	03:18
openstackgerrit	Matt Riedemann proposed openstack/nova master: WIP: Use source vifs when unplugging on source during post live migrate https://review.openstack.org/586402	03:18
mriedem	melwitt: bzhao__: Kevin_Zheng: sean-k-mooney: ^ just needs unit tests	03:18
melwitt	sapd: okay, I'm going to re-open that bug and mention what version you saw it in. it will need to be worked on	03:19
Kevin_Zheng	mriedem, got it, just finish reading launchpad report	03:19
mriedem	ask sahid to look at it	03:19
mriedem	the sriov bug i mean	03:19
*** shaohe_feng has quit IRC		03:19
melwitt	k	03:19
sean-k-mooney[m]	Melwitt we used ti have an api check at one point to expresly forbid attach sriov port to existing instances.	03:20
melwitt	hmm, interesting. I wonder what happened to that	03:20
sapd	melwitt: I'm waiting.	03:21
melwitt	hah	03:21
*** shaohe_feng has joined #openstack-nova		03:21
sean-k-mooney[m]	Melwitt im guessing some of artoms changes	03:21
melwitt	okay, I'll ask him about it	03:23
*** dave-mccowan has quit IRC		03:24
openstackgerrit	Merged openstack/os-vif stable/rocky: Add vif_plug_noop to setup.cfg packages https://review.openstack.org/586340	03:26
melwitt	hot dog	03:26
bzhao__	mriedem: So so quick.... =。=	03:29
*** shaohe_feng has quit IRC		03:29
*** annp has quit IRC		03:31
*** tiendc has quit IRC		03:31
*** trungnv has quit IRC		03:31
melwitt	I think I'm gonna give up on rechecking the r-3 patches, seems like a pretty high fail rate with the live migration thing	03:31
*** shaohe_feng has joined #openstack-nova		03:32
*** tiendc has joined #openstack-nova		03:32
*** trungnv has joined #openstack-nova		03:32
melwitt	get the fix sorted in the morning and go from there	03:32
*** annp has joined #openstack-nova		03:32
*** gbarros has quit IRC		03:39
*** shaohe_feng has quit IRC		03:40
*** shaohe_feng has joined #openstack-nova		03:40
*** vladikr has quit IRC		03:45
*** vladikr has joined #openstack-nova		03:45
mriedem	should have tests done pretty soon	03:48
*** shaohe_feng has quit IRC		03:50
*** vladikr has quit IRC		03:51
*** vladikr has joined #openstack-nova		03:51
*** shaohe_feng has joined #openstack-nova		03:52
*** links has joined #openstack-nova		03:52
*** Dinesh_Bhor has quit IRC		03:52
*** gongysh has quit IRC		03:52
*** yamahata has joined #openstack-nova		03:53
*** Dinesh_Bhor has joined #openstack-nova		03:54
openstackgerrit	Matt Riedemann proposed openstack/nova master: Use source vifs when unplugging on source during post live migrate https://review.openstack.org/586402	03:56
mriedem	alright gang there it is with a test ^	03:57
* melwitt clicks		03:58
*** Dinesh_Bhor has quit IRC		04:00
*** shaohe_feng has quit IRC		04:00
*** shaohe_feng has joined #openstack-nova		04:01
*** vladikr has quit IRC		04:03
*** vladikr has joined #openstack-nova		04:03
mriedem	and now i'm going to bed	04:04
mriedem	o/	04:04
*** mriedem has quit IRC		04:04
melwitt	gnite	04:04
*** mschuppert has joined #openstack-nova		04:06
*** tiendc has quit IRC		04:10
*** shaohe_feng has quit IRC		04:10
*** tiendc has joined #openstack-nova		04:11
*** slaweq has joined #openstack-nova		04:11
*** shaohe_feng has joined #openstack-nova		04:11
*** slaweq has quit IRC		04:16
*** shaohe_feng has quit IRC		04:21
*** mdnadeem has joined #openstack-nova		04:21
*** itlinux has joined #openstack-nova		04:22
*** shaohe_feng has joined #openstack-nova		04:22
*** pcaruana has joined #openstack-nova		04:28
*** pcaruana has quit IRC		04:30
*** shaohe_feng has quit IRC		04:31
*** shaohe_feng has joined #openstack-nova		04:33
*** shaohe_feng has quit IRC		04:41
*** shaohe_feng has joined #openstack-nova		04:41
openstackgerrit	Xiaohan Zhang proposed openstack/nova master: compute node local_gb_used include swap disks https://review.openstack.org/585928	04:47
*** gongysh has joined #openstack-nova		04:50
*** shaohe_feng has quit IRC		04:51
*** shaohe_feng has joined #openstack-nova		04:53
*** vladikr has quit IRC		04:53
*** vladikr has joined #openstack-nova		04:54
*** flwang1 has quit IRC		04:59
*** shaohe_feng has quit IRC		05:02
*** shaohe_feng has joined #openstack-nova		05:02
*** vladikr has quit IRC		05:05
*** itlinux has quit IRC		05:05
*** tbachman has joined #openstack-nova		05:06
*** vladikr has joined #openstack-nova		05:08
*** tbachman has quit IRC		05:11
vishakha	melwitt : Hi, waiting for your response https://review.openstack.org/#/c/580271/. Thanks	05:11
*** shaohe_feng has quit IRC		05:12
*** slaweq has joined #openstack-nova		05:13
*** shaohe_feng has joined #openstack-nova		05:14
*** tbachman has joined #openstack-nova		05:16
*** Bhujay has joined #openstack-nova		05:17
*** slaweq has quit IRC		05:17
*** Bhujay has quit IRC		05:21
*** shaohe_feng has quit IRC		05:22
*** shaohe_feng has joined #openstack-nova		05:23
*** vladikr has quit IRC		05:27
*** vladikr has joined #openstack-nova		05:29
*** shaohe_feng has quit IRC		05:32
*** sridharg has joined #openstack-nova		05:32
*** shaohe_feng has joined #openstack-nova		05:34
*** shaohe_feng has quit IRC		05:43
*** shaohe_feng has joined #openstack-nova		05:46
*** tbachman has quit IRC		05:46
*** vladikr has quit IRC		05:48
*** josecastroleon has joined #openstack-nova		05:48
*** vladikr has joined #openstack-nova		05:51
*** trungnv has quit IRC		05:51
*** annp has quit IRC		05:51
*** tiendc has quit IRC		05:51
*** tiendc has joined #openstack-nova		05:52
*** trungnv has joined #openstack-nova		05:52
*** annp has joined #openstack-nova		05:52
*** zigo_ has joined #openstack-nova		05:53
*** zigo has quit IRC		05:53
*** shaohe_feng has quit IRC		05:53
*** shaohe_feng has joined #openstack-nova		05:54
*** Luzi has joined #openstack-nova		05:54
*** vladikr has quit IRC		06:01
*** vladikr has joined #openstack-nova		06:02
*** shaohe_feng has quit IRC		06:03
*** shaohe_feng has joined #openstack-nova		06:05
openstackgerrit	Vishakha Agarwal proposed openstack/nova master: No change in field 'updated' in server https://review.openstack.org/586446	06:08
*** shaohe_feng has quit IRC		06:13
*** shaohe_feng has joined #openstack-nova		06:15
*** alexchadin has joined #openstack-nova		06:15
*** sapd has quit IRC		06:22
*** sapd has joined #openstack-nova		06:23
*** shaohe_feng has quit IRC		06:24
openstackgerrit	Vishakha Agarwal proposed openstack/nova master: No change in field 'updated' in server https://review.openstack.org/586446	06:25
*** shaohe_feng has joined #openstack-nova		06:26
*** tiendc_ has joined #openstack-nova		06:28
*** tiendc has quit IRC		06:30
ileixe	Hello again	06:32
ileixe	Does any body know how to expand APIExtensionBase for pre-processing not for post-processing..?	06:33
*** shaohe_feng has quit IRC		06:34
*** shaohe_feng has joined #openstack-nova		06:35
*** abhishekk has joined #openstack-nova		06:41
*** mgoddard has joined #openstack-nova		06:41
*** shaohe_feng has quit IRC		06:44
*** shaohe_feng has joined #openstack-nova		06:45
*** vladikr has quit IRC		06:45
openstackgerrit	Xiaohan Zhang proposed openstack/nova master: compute node local_gb_used include swap disks https://review.openstack.org/585928	06:47
*** vladikr has joined #openstack-nova		06:48
*** mgoddard has quit IRC		06:50
*** brault has joined #openstack-nova		06:51
*** tesseract has joined #openstack-nova		06:52
*** shaohe_feng has quit IRC		06:54
*** shaohe_feng has joined #openstack-nova		06:56
*** rcernin has quit IRC		07:00
*** ispp has joined #openstack-nova		07:00
*** liuyulong__ has joined #openstack-nova		07:02
*** shaohe_feng has quit IRC		07:05
*** shaohe_feng has joined #openstack-nova		07:05
*** liuyulong_ has quit IRC		07:06
*** ileixe has quit IRC		07:09
*** ttsiouts has joined #openstack-nova		07:14
*** shaohe_feng has quit IRC		07:15
openstackgerrit	Chen proposed openstack/nova master: Make nova-manage capable of syncing all cell databases https://review.openstack.org/519275	07:15
*** tiendc has joined #openstack-nova		07:15
*** tiendc_ has quit IRC		07:16
*** shaohe_feng has joined #openstack-nova		07:16
*** ccamacho has joined #openstack-nova		07:20
*** dtantsur\|afk is now known as dtantsur		07:21
*** ttsiouts has quit IRC		07:24
*** shaohe_feng has quit IRC		07:25
*** shaohe_feng has joined #openstack-nova		07:26
*** ileixe has joined #openstack-nova		07:27
*** ispp has quit IRC		07:27
*** AlexeyAbashkin has joined #openstack-nova		07:29
*** gibi is now known as giblet		07:30
openstackgerrit	Vishakha Agarwal proposed openstack/nova master: No change in field 'updated' in server https://review.openstack.org/586446	07:33
*** shaohe_feng has quit IRC		07:35
*** shaohe_feng has joined #openstack-nova		07:37
openstackgerrit	Tetsuro Nakamura proposed openstack/nova master: Fix create_all() to replace_all() in comments https://review.openstack.org/586396	07:43
*** shaohe_feng has quit IRC		07:46
*** shaohe_feng has joined #openstack-nova		07:46
*** tssurya has joined #openstack-nova		07:48
*** ispp has joined #openstack-nova		07:48
*** alexchadin has quit IRC		07:52
*** ttsiouts has joined #openstack-nova		07:54
*** shaohe_feng has quit IRC		07:56
*** shaohe_feng has joined #openstack-nova		07:57
*** rpittau has quit IRC		07:57
*** rpittau has joined #openstack-nova		07:57
*** dtantsur is now known as dtantsur\|bbl		08:00
*** abhishekk has quit IRC		08:04
*** alexchadin has joined #openstack-nova		08:05
*** shaohe_feng has quit IRC		08:06
*** vladikr has quit IRC		08:07
*** vladikr has joined #openstack-nova		08:08
*** shaohe_feng has joined #openstack-nova		08:08
*** mgoddard has joined #openstack-nova		08:12
*** tetsuro has quit IRC		08:14
*** vladikr has quit IRC		08:15
*** vladikr has joined #openstack-nova		08:15
*** shaohe_feng has quit IRC		08:16
*** shaohe_feng has joined #openstack-nova		08:19
*** bauzas is now known as PapaOurs		08:19
kashyap	Hey folks, I'm hitting a "POST_FAILURE" state for the 'nova-live-migration' CI job; seems like a Zuul problem?	08:20
kashyap	(For this change: https://review.openstack.org/#/c/567258/)	08:20
PapaOurs	kashyap: nothing raised by infra AFAIK	08:21
PapaOurs	kashyap: but maybe you should ask in #openstack-infra ?	08:21
kashyap	Nod; in the past I've seen channel topic being changed when such errors occurreed.	08:21
kashyap	PapaOurs: Yep, was just about to check there.	08:21
kashyap	When I look into the log, it's the SSH failing	08:22
*** derekh has joined #openstack-nova		08:23
*** shaohe_feng has quit IRC		08:27
*** shaohe_feng has joined #openstack-nova		08:28
*** avolkov has joined #openstack-nova		08:28
*** mgoddard has quit IRC		08:34
*** flwang1 has joined #openstack-nova		08:34
*** shaohe_feng has quit IRC		08:37
*** jaosorior has quit IRC		08:38
*** shaohe_feng has joined #openstack-nova		08:38
*** vivsoni has quit IRC		08:41
openstackgerrit	Vishakha Agarwal proposed openstack/nova master: No change in field 'updated' in server https://review.openstack.org/586446	08:43
*** mgoddard has joined #openstack-nova		08:43
*** flwang1 has quit IRC		08:46
*** shaohe_feng has quit IRC		08:47
*** shaohe_feng has joined #openstack-nova		08:49
*** lifeless has quit IRC		08:54
*** vladikr has quit IRC		08:55
*** vladikr has joined #openstack-nova		08:55
*** vishakha has quit IRC		08:57
*** shaohe_feng has quit IRC		08:57
*** jaosorior has joined #openstack-nova		08:58
*** shaohe_feng has joined #openstack-nova		08:58
*** vivsoni has joined #openstack-nova		09:05
*** shaohe_feng has quit IRC		09:08
*** shaohe_feng has joined #openstack-nova		09:08
*** flwang1 has joined #openstack-nova		09:09
*** josecastroleon has quit IRC		09:09
*** lifeless has joined #openstack-nova		09:11
*** vladikr has quit IRC		09:11
*** vladikr has joined #openstack-nova		09:12
*** akki has joined #openstack-nova		09:12
*** akki has quit IRC		09:13
*** akki has joined #openstack-nova		09:13
akki	can we take lxd container snapshots and use them to launch new containers?	09:15
*** cdent has joined #openstack-nova		09:18
*** naichuans has quit IRC		09:18
*** shaohe_feng has quit IRC		09:18
*** josecastroleon has joined #openstack-nova		09:18
PapaOurs	do folks have any idea why we stupidly set the device owner of a port to be compute:<instance_az> ?	09:18
openstackgerrit	huanhongda proposed openstack/nova master: hypervisor-stats shows wrong disk usages with shared storage https://review.openstack.org/149878	09:18
*** vladikr has quit IRC		09:21
*** shaohe_feng has joined #openstack-nova		09:21
*** shaohe_feng has quit IRC		09:28
*** shaohe_feng has joined #openstack-nova		09:29
*** MultipleCrashes has joined #openstack-nova		09:29
MultipleCrashes	Looking for further review from sometime , please have a look https://review.openstack.org/#/c/563418/	09:29
openstackgerrit	huanhongda proposed openstack/nova master: Change the metadata re to match the unicode https://review.openstack.org/536236	09:32
*** vladikr has joined #openstack-nova		09:33
*** MultipleCrashes has quit IRC		09:37
*** shaohe_feng has quit IRC		09:38
*** shaohe_feng has joined #openstack-nova		09:41
*** Dinesh_Bhor has joined #openstack-nova		09:45
*** andymccr- has joined #openstack-nova		09:47
*** shaohe_feng has quit IRC		09:49
*** jaosorior has quit IRC		09:49
*** shaohe_feng has joined #openstack-nova		09:49
*** andymccr_ has quit IRC		09:50
*** johnthetubaguy has quit IRC		09:52
*** flwang1 has quit IRC		09:55
*** flwang1 has joined #openstack-nova		09:56
*** shaohe_feng has quit IRC		09:59
*** shaohe_feng has joined #openstack-nova		10:00
*** flwang1 has quit IRC		10:00
*** vladikr has quit IRC		10:03
*** stakeda has quit IRC		10:03
*** vladikr has joined #openstack-nova		10:04
*** andymccr has quit IRC		10:04
*** andymccr- is now known as andymccr		10:05
*** liuzz_ has quit IRC		10:09
*** shaohe_feng has quit IRC		10:09
*** ispp has quit IRC		10:09
*** shaohe_feng has joined #openstack-nova		10:10
*** Dinesh_Bhor has quit IRC		10:10
*** flwang1 has joined #openstack-nova		10:13
openstackgerrit	Balazs Gibizer proposed openstack/nova master: Use placement 1.28 in scheduler report client https://review.openstack.org/583667	10:15
*** trungnv has quit IRC		10:18
*** shaohe_feng has quit IRC		10:19
*** shaohe_feng has joined #openstack-nova		10:21
*** alexchadin has quit IRC		10:26
*** cdent has quit IRC		10:27
*** shaohe_feng has quit IRC		10:30
*** shaohe_feng has joined #openstack-nova		10:30
*** ttsiouts has quit IRC		10:31
*** ispp has joined #openstack-nova		10:33
*** vladikr has quit IRC		10:36
*** vladikr has joined #openstack-nova		10:36
sean-k-mooney[m]	kashyap: post_failure means the job failed to upload the logs/result	10:36
kashyap	sean-k-mooney[m]: Ah, I see	10:36
kashyap	sean-k-mooney[m]: I hit a recheck, let's see if it goes through.	10:37
kashyap	sean-k-mooney[m]: Would you happen to have time to have a gander at this: https://review.openstack.org/#/c/567258/ ("libvirt: Remove usage of migrateToURI{2} APIs")	10:37
kashyap	Fairly mechanical, but some churn in there.	10:37
kashyap	(The 'recheck' is still in progress, though.)	10:38
kashyap	It's slow as molasses.	10:38
sean-k-mooney[m]	Am sure. I'll take a look once i ger coffee	10:38
sean-k-mooney[m]	Its feature freeze time the gate is under a lot of load. Rechek is all you could have done in this case	10:39
*** alexchadin has joined #openstack-nova		10:39
*** shaohe_feng has quit IRC		10:40
*** shaohe_feng has joined #openstack-nova		10:42
kashyap	Ah, right	10:47
*** gongysh has quit IRC		10:47
*** dtantsur\|bbl is now known as dtantsur		10:49
*** sridharg has quit IRC		10:50
*** brault_ has joined #openstack-nova		10:50
*** shaohe_feng has quit IRC		10:50
*** shaohe_feng has joined #openstack-nova		10:51
openstackgerrit	Merged openstack/nova master: doc: add missing permission for the vCenter service account https://review.openstack.org/585683	10:52
*** brault has quit IRC		10:53
*** savvas has quit IRC		10:53
*** savvas has joined #openstack-nova		10:53
*** vladikr has quit IRC		10:55
*** vladikr has joined #openstack-nova		10:55
*** gilfoyle has joined #openstack-nova		10:58
gilfoyle	I'm trying to replicate some of nova's (the cli util) is doin. This is an old deployment of openstack. My goal is to understand how it is getting the zone-related information from the database when no zones are created	10:59
gilfoyle	could someone help me by pointing out where in the repos should I be looking for this?	11:00
gilfoyle	the relevant command is `nova availability-zone-list`	11:00
*** shaohe_feng has quit IRC		11:00
*** shaohe_feng has joined #openstack-nova		11:01
sean-k-mooney	gilfoyle: what is the result you are getting and what were you expecting	11:04
sean-k-mooney	ther are 2 default az that exist without you creating any	11:05
sean-k-mooney	internal and nova	11:05
sean-k-mooney	the contoler nodes will be in internal and all computes will be in nova	11:05
*** dave-mccowan has joined #openstack-nova		11:06
*** pooja_jadhav has quit IRC		11:08
sean-k-mooney	kashyap: i was going to ask why ther is a migrateToURI() migrateToURI2() and migrateToURI3() then i rembered libvirt is written in c...	11:08
gilfoyle	sean-k-mooney: my issue is that I'm running a query against a database that's not returning me any of the coputes in the `nova` and from the nova command above I do see it thee	11:10
gilfoyle	there even, apologies	11:10
*** shaohe_feng has quit IRC		11:11
sean-k-mooney	gilfoyle: yes i think the api layer injects the nova az before it gets to the client	11:11
*** shaohe_feng has joined #openstack-nova		11:11
*** takedakn has joined #openstack-nova		11:12
gilfoyle	is it a case of if a compute node has been added without specifying an AZ, the reporting then returns it as being `nova`? that's how I've handled it in the past	11:13
*** s10 has joined #openstack-nova		11:15
sean-k-mooney	gilfoyle: yes and that is still how its handeled today	11:15
gilfoyle	or, let me restart, if the compute node has not been added to an AZ, it ends up in 'nova'? I've seen occasions where the aggregates.name came up as NULL, so I used the following shortcut in mysql `IFNULL(aggregates.name, 'nova') as zone`	11:16
gilfoyle	s/restart/restate	11:16
sean-k-mooney	gilfoyle: ah no if you have added a host to a host aggregate and you have set the availablity_zone metadata key on the aggregate it should not show up in nova anymore	11:17
gilfoyle	ah, that explains my conundrum then, however, I now have a different question/ask	11:18
gilfoyle	what's the case where aggregates.name is NULL?	11:18
gilfoyle	if this isn't an obvious one, then I'll go back to the drawing board and try to analyse it further :)	11:19
sean-k-mooney	gilfoyle: i belive we allow you to have host aggregate where you only set the uuid	11:19
sean-k-mooney	i cant rember of the top of my head why however	11:20
*** shaohe_feng has quit IRC		11:21
gilfoyle	ah, cool :)	11:22
*** flwang1 has quit IRC		11:22
*** vivsoni has quit IRC		11:23
*** takedakn has quit IRC		11:23
*** shaohe_feng has joined #openstack-nova		11:24
sean-k-mooney	gilfoyle: the name filed on the aggregate is not the availability_zone name by the way. its the host aggregate name just incase you taught they were the same	11:24
sean-k-mooney	i mean i personally always set them the same but they dont have to be	11:24
gilfoyle	sean-k-mooney: Oh. interesting, I've been using a query with a relationship between aggregates, aggregate_hosts, compute_nodes and services tables to try and get all nodes for all AZs	11:25
*** flwang1 has joined #openstack-nova		11:26
sean-k-mooney	gilfoyle: an avlailblity zone isnet really a thing in nova. its just a host_aggregate with metadata key called availability_zone in it	11:27
*** cdent has joined #openstack-nova		11:28
*** Shilpa has quit IRC		11:28
*** ttsiouts has joined #openstack-nova		11:29
sean-k-mooney	so to get all host in an az you just find the host_aggregate with the correct metadata key then list its host.	11:29
*** pooja_jadhav has joined #openstack-nova		11:29
sean-k-mooney	the nova and internal az are special however	11:29
gilfoyle	could you possibly eyeball this and see if you can spot any obvious assumption(s) https://paste.ubuntu.com/p/sDFRDffzpy/ ?	11:30
sean-k-mooney	i think the nova az is calulated by taking gennerating a list of host that are not part of another az	11:31
*** jamesde__ has joined #openstack-nova		11:31
*** shaohe_feng has quit IRC		11:31
*** jamesden_ has quit IRC		11:32
*** flwang1 has quit IRC		11:33
*** jamesde__ has quit IRC		11:34
*** shaohe_feng has joined #openstack-nova		11:34
gilfoyle	that seems to make sense to me, so I assume it does that as a separate step/query in the `nova` cli? would you have any idea where this defined in the source?	11:34
sean-k-mooney	gilfoyle: i think services.topic = 'compute' can be changed in the nova conf. so that might be more fragile then looking at the service.binary	11:34
*** flwang1 has joined #openstack-nova		11:34
*** alexchadin has quit IRC		11:34
*** alexchadin has joined #openstack-nova		11:35
sean-k-mooney	gilfoyle: but that should list the capsity of all compute nodes ordered by the az they are in	11:35
*** alexchadin has quit IRC		11:35
*** alexchadin has joined #openstack-nova		11:36
gilfoyle	yes, that's the goal, but for a cluster w/o any zones, I don't see the only compute node with it. Probably because it needs to be a separate query as you suggested above :)	11:36
sean-k-mooney	actully no it wont	11:36
*** alexchadin has quit IRC		11:36
*** alexchadin has joined #openstack-nova		11:36
sean-k-mooney	ya thats because you are matching on the aggregate name not the az name	11:37
*** alexchadin has quit IRC		11:37
sean-k-mooney	actully thats not quite true either	11:37
sean-k-mooney	by default you will not have any aggregates s the left join on aggregate_hosts.host = compute_nodes.hypervisor_hostname will filter out all the hosts	11:38
gilfoyle	yup, that became apparent after your nugget above, too :)	11:39
*** kholkina has joined #openstack-nova		11:39
sean-k-mooney	gilfoyle: so what you need to do is rather then set the aggregate.name to nova if null is also join this result with a suuquey on the computenodes table for every host that is not in the first result set	11:40
gilfoyle	thank you sean-k-mooney! :)	11:41
*** shaohe_feng has quit IRC		11:41
sean-k-mooney	gilfoyle: do you want to view this by host_aggregate or availablty zone by the way	11:42
sean-k-mooney	the service has teh az embeded https://github.com/openstack/nova/blob/2afc5fed1f60077e7ff0b9e81b64cff4e4dbabfc/nova/objects/service.py#L190	11:42
*** shaohe_feng has joined #openstack-nova		11:42
gilfoyle	by availability zone :)	11:42
*** abhishekk has joined #openstack-nova		11:46
openstackgerrit	Merged openstack/nova master: [placement] Use base test in placement functional tests https://review.openstack.org/585778	11:49
*** shaohe_feng has quit IRC		11:52
kashyap	sean-k-mooney: Was AFK for lunch	11:52
kashyap	sean-k-mooney: Hehe, yeah. I linked to a libvirt commit that explains it	11:52
*** tiendc has quit IRC		11:53
*** shaohe_feng has joined #openstack-nova		11:53
*** shaohe_feng has quit IRC		12:02
*** shaohe_feng has joined #openstack-nova		12:03
*** linkmark has joined #openstack-nova		12:03
*** savvas has quit IRC		12:04
*** medberry has joined #openstack-nova		12:04
*** ispp has quit IRC		12:08
*** savvas has joined #openstack-nova		12:09
*** savvas has quit IRC		12:11
*** savvas_ has joined #openstack-nova		12:11
*** shaohe_feng has quit IRC		12:12
*** shaohe_feng has joined #openstack-nova		12:14
*** alexchadin has joined #openstack-nova		12:16
*** edmondsw has joined #openstack-nova		12:17
*** johnthetubaguy has joined #openstack-nova		12:17
*** alexchadin has quit IRC		12:20
*** ispp has joined #openstack-nova		12:20
*** armaan has joined #openstack-nova		12:22
*** shaohe_feng has quit IRC		12:22
*** shaohe_feng has joined #openstack-nova		12:23
*** sridharg has joined #openstack-nova		12:24
*** wolverineav has joined #openstack-nova		12:26
*** annp has quit IRC		12:27
*** Shilpa has joined #openstack-nova		12:31
*** mdnadeem has quit IRC		12:32
*** alexchadin has joined #openstack-nova		12:33
*** shaohe_feng has quit IRC		12:33
*** shaohe_feng has joined #openstack-nova		12:33
*** lyan has joined #openstack-nova		12:34
*** lyan is now known as Guest87808		12:34
*** vladikr has quit IRC		12:35
*** mriedem has joined #openstack-nova		12:35
*** ispp has quit IRC		12:36
*** savvas_ has quit IRC		12:40
*** armaan has quit IRC		12:41
*** shaohe_feng has quit IRC		12:43
*** flwang1 has quit IRC		12:43
*** shaohe_feng has joined #openstack-nova		12:44
*** armaan has joined #openstack-nova		12:45
*** savvas has joined #openstack-nova		12:45
*** flwang1 has joined #openstack-nova		12:46
mriedem	http://status.openstack.org/elastic-recheck/index.html#1783917 is clearly our top code-related gate failure so need eyes on the proposed fix https://review.openstack.org/#/c/586402/	12:47
giblet	mriedem: as sean-k-mooney is +1 on the change I'm going to approve it	12:49
mriedem	giblet: ok. i'm looking at what other calls we make on the source,	12:49
*** armaan has quit IRC		12:49
mriedem	rollback_live_migration looks OK - nothing directly using the info cache in there	12:49
*** savvas has quit IRC		12:50
*** armaan has joined #openstack-nova		12:50
*** ttsiouts has quit IRC		12:50
PapaOurs	mriedem: there were some POST_FAILURE gate issues this morning too	12:51
mriedem	PapaOurs: that's not code related	12:51
PapaOurs	yup, I know, just FYI	12:51
mriedem	and has been a known issue the last few weeks with one of the node providers	12:51
PapaOurs	that I didn't know of	12:51
PapaOurs	either way, giblet +Wd your change	12:52
*** shaohe_feng has quit IRC		12:53
*** armaan has quit IRC		12:54
*** savvas has joined #openstack-nova		12:54
*** shaohe_feng has joined #openstack-nova		12:55
*** ttsiouts has joined #openstack-nova		12:56
mriedem	i do see one potential place i missed	12:57
*** savvas has quit IRC		12:59
mriedem	giblet: comment inline, i'll do a follow up	12:59
*** rmart04 has joined #openstack-nova		12:59
giblet	mriedem: OK, cool	13:00
*** pchavva has joined #openstack-nova		13:01
*** vladikr has joined #openstack-nova		13:01
mriedem	hyperv ci failed but on unrelated tests	13:01
mriedem	looks like those were failing due to ssh and timeouts	13:01
mriedem	{7} tempest.api.volume.test_volumes_extend.VolumesExtendTest.test_volume_extend_when_volume_has_snapshot [365.093541s] ... FAILED	13:01
mriedem	huh	13:03
mriedem	2018-07-27 05:15:36.661 5060 105049744 MainThread WARNING nova.scheduler.client.report [req-640b132e-9a1b-4f75-8f8d-7ae96964af72 c329c90c52a44fe2889e0284651a21f0 82e0a447215e49079fe42481922ccd81 - default default] Failed to save allocation for 390d33d0-36e2-469e-85be-8ec10658e953. Got HTTP 400: {"errors": [{"status": 400, "request_id": "req-fc67d1c6-b641-475a-afdf-27075995c0ff", "detail": "The server could not comply with the	13:03
mriedem	uest since it is either malformed or otherwise incorrect.\n\n JSON does not validate: {} does not have enough properties Failed validating 'minProperties' in schema['properties']['allocations']['items']['properties']['resources']: {'additionalProperties': False, 'minProperties': 1, 'patternProperties': {'^[0-9A-Z_]+$': {'minimum': 1, 'type': 'integer'}}, 'type':	13:03
mriedem	ect'} On instance['allocations'][0]['resources']: {} ", "title": "Bad Request"}]}	13:03
*** shaohe_feng has quit IRC		13:03
*** ispp has joined #openstack-nova		13:03
mriedem	Sending updated allocation [{'resource_provider': {'uuid': u'b2979fd7-376b-4f9e-a1b9-b4c69d619cb9'}, 'resources': {}}] for instance 390d33d0-36e2-469e-85be-8ec10658e953	13:03
mriedem	2018-07-27 05:15:36.513 5060 105049744 MainThread INFO nova.compute.manager [req-640b132e-9a1b-4f75-8f8d-7ae96964af72 c329c90c52a44fe2889e0284651a21f0 82e0a447215e49079fe42481922ccd81 - default default] [instance: 390d33d0-36e2-469e-85be-8ec10658e953] Doing legacy allocation math for migration 8221f52a-c72b-4b7b-81d9-67cb67fb37bc after instance move	13:04
*** shaohe_feng has joined #openstack-nova		13:04
mriedem	i'm not sure why the hyperv ci would be hitting that in rocky	13:05
mriedem	edmondsw: powervm in-tree ci took over 5 hours here and timed out https://review.openstack.org/#/c/586402/	13:06
mriedem	fyi	13:06
*** mgariepy has quit IRC		13:08
*** mgariepy has joined #openstack-nova		13:10
*** edleafe is now known as figleaf		13:11
openstackgerrit	Balazs Gibizer proposed openstack/nova master: Use placement 1.28 in scheduler report client https://review.openstack.org/583667	13:11
cdent	Is this already a known thing: http://logstash.openstack.org/#dashboard/file/logstash.json?query=message%3A%5C%22Unsupported%20VIF%20type%20unbound%20convert%5C%22	13:12
cdent	oh never mind, my search on launchpad just hit	13:13
mriedem	http://status.openstack.org/elastic-recheck/index.html#1783917	13:13
cdent	it didn't when I was missing a closing t	13:13
mriedem	fix is in the gate	13:13
cdent	cool, thanks	13:13
*** flwang1 has quit IRC		13:14
*** antosh has joined #openstack-nova		13:14
*** shaohe_feng has quit IRC		13:14
*** shaohe_feng has joined #openstack-nova		13:14
mriedem	based on the 50 mocks i have to do in _post_live_migration, clearly that method is too big	13:15
cdent	ugh	13:16
*** savvas has joined #openstack-nova		13:16
*** cdent has quit IRC		13:16
*** savvas has quit IRC		13:21
*** savvas has joined #openstack-nova		13:21
*** abhishekk has quit IRC		13:21
*** shaohe_feng has quit IRC		13:24
*** shaohe_feng has joined #openstack-nova		13:25
edmondsw	mriedem the powervm ci is borked right now. I'm trying to help get it fixed	13:26
*** ttsiouts has quit IRC		13:26
*** mdrabe has joined #openstack-nova		13:26
*** gbarros has joined #openstack-nova		13:27
*** flwang1 has joined #openstack-nova		13:31
*** jistr is now known as jistr\|mtg		13:32
*** Luzi has quit IRC		13:32
openstackgerrit	Balazs Gibizer proposed openstack/nova master: Use placement 1.28 in scheduler report client https://review.openstack.org/583667	13:33
*** shaohe_feng has quit IRC		13:34
dansmith	efried: what should happen if I have compute nodes with MISC_SHARES (and thus no DISK_GB inventory)? Should the scheduler receive split allocations from placement with disk on the sharing provider?	13:35
*** shaohe_feng has joined #openstack-nova		13:35
dansmith	because I have yet to convince it to do that in a functional test	13:36
*** alexchadin has quit IRC		13:36
*** medberry has quit IRC		13:37
*** gilfoyle has quit IRC		13:37
*** alexchadin has joined #openstack-nova		13:42
*** burt has joined #openstack-nova		13:43
*** tbachman has joined #openstack-nova		13:44
*** shaohe_feng has quit IRC		13:44
*** shaohe_feng has joined #openstack-nova		13:45
*** fanzhang has quit IRC		13:45
*** fanzhang has joined #openstack-nova		13:45
*** alexchadin has quit IRC		13:49
*** shaohe_feng has quit IRC		13:55
*** ttsiouts has joined #openstack-nova		13:56
*** shaohe_feng has joined #openstack-nova		13:56
*** mlavalle has joined #openstack-nova		13:57
mriedem	speaking of, i think this is going to be the money patch https://review.openstack.org/#/c/586363/	13:57
mriedem	creates a shared storage provider using the DISK_GB calculated from the compute node provider, then removes the compute node provider's DISK_GB inventory before the compute service host is discovered	13:58
*** awaugama has joined #openstack-nova		13:58
*** med_ has quit IRC		13:58
s10	Please check this bug: https://bugs.launchpad.net/nova/+bug/1784006	13:59
openstack	Launchpad bug 1784006 in OpenStack Compute (nova) "Instances misses neutron QoS on their ports after unrescue and soft reboot" [Undecided,New]	13:59
*** ttsiouts has quit IRC		14:00
*** blkart has quit IRC		14:01
s10	User can easily drop QoS limitations on ports with _soft_reboot() or unrescue() for libvirt driver.	14:01
*** blkart has joined #openstack-nova		14:01
*** ttsiouts has joined #openstack-nova		14:02
mriedem	s10: i think we do plug_vifs on hard reboot now, but maybe not in pike...	14:04
mriedem	or maybe only for certain types of vifs...	14:04
mriedem	it's kind of a mess	14:04
*** shaohe_feng has quit IRC		14:05
s10	plug_vifs are executed on hard reboot and spawn(). Not for soft reboot, in master: https://github.com/openstack/nova/blob/master/nova/virt/libvirt/driver.py#L2706	14:06
mriedem	oh right, i missed the "without" here: "Execute nova reboot (without parameter --hard)"	14:06
melwitt	hm, I thought for soft reboot they shouldn't have been unplugged in the first place, but the bug says a new domain is created, which I didn't think happened either. I wonder if something changed there	14:07
*** shaohe_feng has joined #openstack-nova		14:08
*** gbarros has quit IRC		14:08
dansmith	soft reboot will turn into a hard reboot if the guest doesn't shut down voluntarily right?	14:08
melwitt	shutdown and then a create	14:08
mriedem	correct	14:08
dansmith	it's trivial for me to make my guest not shut down when asked	14:09
mriedem	but apparently in this case soft reboot works	14:09
melwitt	looking at the code, indeed it does a guest.shutdown() followed by a create. so you'd think you'd have to plug the vifs in again, I wonder how this normally works?	14:09
*** links has quit IRC		14:10
dansmith	hmm, it doesn't do an actual reboot?	14:10
*** gilfoyle has joined #openstack-nova		14:10
melwitt	doesn't look like it? I guess I've never looked at soft reboot in detail before https://github.com/openstack/nova/blob/stable/pike/nova/virt/libvirt/driver.py#L2547	14:11
dansmith	hmm, yeah, I didn't think this was like this	14:11
openstackgerrit	Matt Riedemann proposed openstack/nova master: Pass source vifs to driver.cleanup in _post_live_migration https://review.openstack.org/586568	14:12
mriedem	giblet: ^	14:12
giblet	mriedem: looking	14:12
dansmith	I thought if we were running a virt that could do real reboot, we did that and only fell back to the shutdown/restart if not	14:13
dansmith	but I don't see that	14:13
melwitt	mriedem: why were you thinking not to use the source vifs throughout the entire method? just wondering	14:13
mriedem	no particular reason, just wanted to minimize the amount of change,	14:14
mriedem	but we could just do that at the top rather than get the refreshed nw info cache	14:14
*** gilfoyle has quit IRC		14:14
mriedem	i.e. here https://review.openstack.org/#/c/586568/1/nova/compute/manager.py@6555	14:15
*** eharney has joined #openstack-nova		14:15
*** shaohe_feng has quit IRC		14:15
mriedem	i can definitely make that change if it makes more sense	14:17
*** shaohe_feng has joined #openstack-nova		14:17
melwitt	yeah, I'm not 100% sure but it feels like it should be consistent throughout. but I guess that's never guaranteed anyway because neutron events in flight could change the network_info as it goes through the method anyway?	14:17
mriedem	shouldn't	14:18
*** jlvacation is now known as jlvillal		14:18
mriedem	an event would be processed separately and shouldn't be able to modify that network_info variable by reference	14:18
melwitt	oh, yeah, okay	14:18
mriedem	the instance.info_cache might be updated concurrently, sure	14:18
mriedem	but we're using the local variable in most places	14:18
*** r-daneel has joined #openstack-nova		14:18
melwitt	yeah	14:18
mriedem	the versioned notifications will still use instance.info_cache	14:18
*** felipemonteiro has joined #openstack-nova		14:19
mriedem	left that as a comment so giblet can also ponder it	14:20
mriedem	i didn't do it in https://review.openstack.org/#/c/586402/ because (1) it was late and (2) i just wanted to get the immediate fire put out	14:20
melwitt	I guess I could see the rationale in only using the source vifs for the relevant actions because like I think you mentioned, maybe the notifications should reflect the state of the network info cache at the time it was queried	14:20
melwitt	that's the only other thing network info is used for in that method, I assume?	14:21
pooja_jadhav	mriedem: hello	14:21
*** gilfoyle has joined #openstack-nova		14:22
pooja_jadhav	sean-k-mooney : hello	14:22
mriedem	and unfilter_instance in the firewall driver,	14:22
mriedem	i looked at how it was used in the various drivers and it was just getting the mac address off the vifs in one case	14:23
mriedem	which i don't think should change	14:23
mriedem	but,	14:23
mriedem	admittedly, only passing the source vifs from migrate_data to 2 spots indicates tight coupling into knowing exactly what those methods are doing with network_info	14:23
giblet	mriedem, melwitt: I think having the current network infor send in the notification is the valid thing as we are notifying about current state	14:23
pooja_jadhav	sean-k-mooney, mriedem: I am trying live migrate and using nfs storage, its failing for "Binding failed for port e973dde6-d68c-4aec-a70d-86dcd81fa11b and host Neha-VirtualBox."	14:24
mriedem	pooja_jadhav: i can't really help you debug that right now	14:24
melwitt	giblet, mriedem: I think that makes sense too, the more I think about it	14:24
mriedem	pooja_jadhav: i'd suggest using something besides devstack if you want a more sophisticated deployment tool for multi-node with live migration, like openstack-ansible	14:24
pooja_jadhav	mriedem: ok	14:24
mriedem	melwitt: i'm totally fine with making the generic switch at the top of the method	14:25
mriedem	i don't like the tight coupling that's in here really	14:25
mriedem	i just wanted to reduce any exposure to regression	14:25
mriedem	pooja_jadhav: or look at a nova-live-migration job config and see how it set things up	14:25
mriedem	but those don't use nfs	14:25
*** shaohe_feng has quit IRC		14:25
mriedem	http://logs.openstack.org/02/586402/2/check/nova-live-migration/2db7a54/	14:25
*** med_ has joined #openstack-nova		14:25
*** med_ has quit IRC		14:25
*** med_ has joined #openstack-nova		14:25
*** alexchadin has joined #openstack-nova		14:26
mriedem	pooja_jadhav: binding failed means something failed in neutron	14:26
mriedem	so network is messed up	14:26
*** gilfoyle has quit IRC		14:26
pooja_jadhav	mriedem: Hmm	14:26
*** mdrabe has quit IRC		14:26
*** shaohe_feng has joined #openstack-nova		14:27
melwitt	mriedem: yeah, I'm thinking I agree with giblet though, that we should leave it the way you have it. let the notifications use the fresh network info and not artificially send source vif. I think the only reason to use source vifs there is if somehow a notifications listener might want to know which vif is actually being acted upon during the actions in the method. hmm.	14:27
pooja_jadhav	mriedem: But I am not able to see any error logs at neutron side.. thats the problem	14:28
mriedem	melwitt: giblet: well, only the versioned notifications will use the instance.info_cache,	14:28
mriedem	the legacy ones would end up using the source vifs	14:28
melwitt	oh	14:29
*** links has joined #openstack-nova		14:29
mriedem	anyway, we could always change this later i guess if it causes some other unanticipated problem	14:29
melwitt	yeah	14:30
mriedem	let me check to make sure the mac address on the vif is the same between source and dest	14:30
mriedem	since that's used in the firewall driver to unfilter	14:30
*** alexchadin has quit IRC		14:30
giblet	mriedem: in the current code the legacy notification uses the local network_info and I guess that is the same as what the versioned gets from instance.info_cache	14:31
mriedem	source vif "address": "fa:16:3e:cc:ff:66"	14:31
mriedem	from the cache: "address": "fa:16:3e:cc:ff:66"	14:31
mriedem	so yeah the mac doesn't change	14:31
mriedem	giblet: yes	14:31
giblet	mriedem: then I still think that the current code in your patch is good	14:32
*** cdent has joined #openstack-nova		14:32
*** breton has quit IRC		14:33
*** gilfoyle has joined #openstack-nova		14:34
s10	What could be done with unrescue/soft reboot QoS issue? Should we use _create_domain_and_network() in that functions instead of simple _create_domain()? Or call plug_vifs()?	14:34
* giblet is logging off for the weekend		14:35
*** shaohe_feng has quit IRC		14:36
*** shaohe_feng has joined #openstack-nova		14:37
*** jistr\|mtg is now known as jistr		14:39
*** tidwellr has joined #openstack-nova		14:39
*** gilfoyle has quit IRC		14:39
*** tidwellr has quit IRC		14:39
*** tidwellr has joined #openstack-nova		14:39
*** bhagyashris has quit IRC		14:41
*** flwang1 has quit IRC		14:41
mriedem	woot ceph shared storage change got through stack.sh and is now running tempest	14:42
*** felipemonteiro_ has joined #openstack-nova		14:42
cdent	huzzah	14:42
dansmith	cdent: did you see my question to efried earlier?	14:42
cdent	dansmith: no sir, what's up?	14:43
openstackgerrit	Chris Dent proposed openstack/nova master: [placement] Retry allocation writes server side https://review.openstack.org/586048	14:43
dansmith	[06:36:22] <dansmith>efried: what should happen if I have compute nodes with MISC_SHARES (and thus no DISK_GB inventory)? Should the scheduler receive split allocations from placement with disk on the sharing provider?	14:43
dansmith	[06:36:46] <dansmith>because I have yet to convince it to do that in a functional test	14:43
dansmith	cdent: ^	14:43
cdent	one sec, let me find something	14:44
cdent	dansmith: this is current passing: https://github.com/cdent/placecat/blob/master/gabbits/fridge.yaml#L204-L213	14:45
cdent	which is an example of some allocations with sharing providers	14:45
*** [fcandido] has joined #openstack-nova		14:45
cdent	so in theory it should work, but I'm not clear on what needs to happen on compute-node side to set things up	14:45
*** felipemonteiro has quit IRC		14:46
dansmith	cdent: that is asserting what? that one of the providers only has a part of the whole?	14:46
*** shaohe_feng has quit IRC		14:46
dansmith	or, two providers in the request	14:46
mriedem	# but there are two resource providers in that one allocations block	14:46
*** shaohe_feng has joined #openstack-nova		14:46
cdent	^	14:46
dansmith	yeah	14:46
*** gilfoyle has joined #openstack-nova		14:47
dansmith	so, that tells me that a single non-fancy request to placement should return a split allocation	14:47
mriedem	dansmith: we should know shortly from this ceph patch i have	14:47
cdent	If we need a specific functional test for something, I'm semi idle right now, so could make something if someone tells me what it needs to be	14:47
dansmith	and the scheduler is doing a non-fancy request, so it should be getting back a split allocation I guess	14:47
melwitt	mriedem: in https://review.openstack.org/586568 is that taking care of the live migration rollback scenario? or is that still an open question	14:47
mriedem	melwitt: i looked at rollback and didn't see anything that needed this type of thing	14:47
dansmith	cdent: well, I've tried writing a very hacky one and placement is returning no allocation requests	14:47
melwitt	mriedem: ack	14:47
cdent	dansmith: do you want to push it up and I'll tune it and you can go review something?	14:48
mriedem	melwitt: i'd say if we ever go the generic route in _post_live_migration, we'd want to do the same in _rollback_live_migration	14:48
mriedem	rollback is likely less of an issue b/c if we failed live migration, we won't activate the dest host port bindings and get into this mess	14:48
*** lucasbagb has joined #openstack-nova		14:49
[fcandido]	http://eavesdrop.openstack.org/meetings	14:49
*** efried is now known as fried_rice		14:49
*** [fcandido] has left #openstack-nova		14:49
melwitt	ack	14:49
openstackgerrit	Dan Smith proposed openstack/nova master: WIP: funtional test with sharing providers https://review.openstack.org/586589	14:49
dansmith	cdent: ^	14:49
cdent	on it	14:49
dansmith	cdent: warning, it's very, uh, forced	14:50
cdent	ha, noted	14:50
*** flwang1 has joined #openstack-nova		14:50
fried_rice	dansmith/superdan: I haven't caught up on the whole conversation, but you're asking about a compute node that's marked as a sharing provider?	14:50
dansmith	cdent: attempts to create a provider with disk, associate with the compute host providers, nuke the disk inventory from one and then try to boot and see if we got the shared bit	14:50
cdent	✔	14:51
dansmith	fried_rice: no, not a compute node marked as sharing, just a compute with no disk because it's associated to a shared disk provider	14:51
*** mlavalle has quit IRC		14:52
*** imacdonn has quit IRC		14:52
*** mlavalle has joined #openstack-nova		14:52
mriedem	dansmith: why not write a simple fake virt driver that doesn't report DISK_GB inventory?	14:52
*** imacdonn has joined #openstack-nova		14:52
dansmith	mriedem: because this was quick	14:52
dansmith	mriedem: obviously not mergeable	14:52
*** fgonzales_ has joined #openstack-nova		14:53
mriedem	your max_unit is wrong	14:54
melwitt	argh, gate bug fix just failed merge for POST_FAILURE	14:54
mriedem	your sharing provider has 1gb	14:54
mriedem	unless flavor1 doesn't have any root_gb	14:55
*** bacape has joined #openstack-nova		14:56
*** breton has joined #openstack-nova		14:56
dansmith	it has 1024 GB	14:56
*** Bellesse has joined #openstack-nova		14:56
*** jfinck has joined #openstack-nova		14:56
*** shaohe_feng has quit IRC		14:56
mriedem	but max you can request in a chunk is 1 right?	14:56
dansmith	oh max_unit	14:56
cdent	i'll mess with it	14:57
mriedem	max_unit should equal total	14:57
*** shaohe_feng has joined #openstack-nova		14:57
dansmith	still no dice	14:57
dansmith	er, hmm it didn't update	14:57
dansmith	ah, I'm setting inventory twice for some reason	14:58
sean-k-mooney	melwitt: the live migrate one?	14:58
mriedem	yeah	14:58
mriedem	you might be using a 1 root_gb flavor anyway	14:58
mriedem	so the max_unit being 1 might not make a difference	14:59
dansmith	I was, and still no difference	14:59
dansmith	yeah	14:59
melwitt	sean-k-mooney: yeah	14:59
fried_rice	dansmith/superdan: Okay, you're trying to make a setup that has its disk allocated from a sharing provider, not the compute node. And then what, migrate it?	14:59
mriedem	boot and then migrate	14:59
mriedem	but boot fails?	14:59
dansmith	fried_rice: well, boot first would be nice	14:59
fried_rice	bhagyashri got that working live and in a func test with the libvirt driver.	15:00
dansmith	fried_rice: I believe migrate will mangle the allocations, but trying to prove it	15:00
fried_rice	Have you located that func test yet?	15:00
dansmith	nope	15:00
fried_rice	dansmith: I suspect you may be right.	15:00
fried_rice	okay, stand by...	15:00
mriedem	fried_rice: that libvirt func test doesn't go through the scheduler though right?	15:00
dansmith	fried_rice: yeah, so in that case, I want to remove the bit of the libvirt inventory thing that will not expose disk_gb, because people may turn that on, and then be mangling their allocations with migrations for a couple days before realizing it	15:01
fried_rice	mriedem: I sure thought it did.	15:01
fried_rice	https://review.openstack.org/#/c/560459/	15:01
mriedem	hmm yeah https://review.openstack.org/#/c/560459/17/nova/tests/functional/libvirt/test_shared_resource_provider.py	15:01
fried_rice	yup	15:02
*** links has quit IRC		15:02
dansmith	yeah, so I dunno why it's not working for me	15:02
dansmith	but that's fine	15:03
sean-k-mooney	dansmith: only the allocation for the compute resouces need to be migrated correct. the shard storage allocation should remain the same.	15:03
mriedem	sean-k-mooney: well, that's the point of the test,	15:03
dansmith	sean-k-mooney: right, but we don't do that properly	15:03
fried_rice	dansmith: Building on that one and trying a migration would be informative. I would be surprised if it works properly, because we have no logic to do ^	15:03
mriedem	because we have FIXME notes all over the migration code	15:03
sean-k-mooney	i guess unless we are migrating with a block migraion to a different storage provider	15:03
dansmith	fried_rice: I have fixmes about it being broken and known	15:03
fried_rice	yup	15:04
*** alexchadin has joined #openstack-nova		15:04
dansmith	fried_rice: so, yeah, I'm not sure why we landed the patch to do that for inventory in that case, but.. alas	15:04
fried_rice	dansmith: So that we wouldn't be double-reporting inventory allocations.	15:04
*** gilfoyle_ has joined #openstack-nova		15:05
fried_rice	dansmith: Can't you only migrate an instance that's on volume storage anyway?	15:05
dansmith	fried_rice: right, but that has been broken since forever, and this change means we lose data	15:05
dansmith	no	15:05
fried_rice	what happens to the disk?	15:05
mriedem	ssh to the dest	15:05
dansmith	hah	15:05
fried_rice	eek, really?	15:05
dansmith	it gets migrated	15:05
*** mdrabe has joined #openstack-nova		15:05
dansmith	either block migration or shared (non-volume) storage in teh backend	15:05
fried_rice	Okay, so what are we expecting to happen here?	15:05
dansmith	for live, and yeah, scp to dest for the cold migration case	15:06
fried_rice	I would have thought we would ssh the data to whatever disk got allocated on the dest.	15:06
*** gilfoyle has quit IRC		15:06
dansmith	I think we need to remove that bit of the inventory logic that doesn't expose DISK_GB	15:06
dansmith	so that we don't get split allocations that we trash during a migration	15:06
dansmith	because we'll end up with instances with no DISK_GB allocation at all	15:06
fried_rice	which may or may not be the same provider as we started on, but to a different spot on that disk - which would be something to fix later	15:06
dansmith	and then start overcommitting	15:06
*** shaohe_feng has quit IRC		15:06
fried_rice	I don't understand that thinking. And IMO it is premature to land a patch to yank that out until we've demonstrated that anything bad happens.	15:07
*** shaohe_feng has joined #openstack-nova		15:07
dansmith	that's why I'm trying to write a test	15:07
fried_rice	sounds good.	15:07
fried_rice	need help?	15:07
*** josecastroleon has quit IRC		15:07
dansmith	I asked for help and now am working on using that functional test to do my bidding	15:08
* cdent is still poking at the test too		15:08
*** r-daneel_ has joined #openstack-nova		15:08
*** r-daneel has quit IRC		15:09
*** r-daneel_ is now known as r-daneel		15:09
mriedem	i believe this is the problem https://github.com/openstack/nova/blob/master/nova/conductor/tasks/migrate.py#L48	15:09
mriedem	b/c we're assuming only allocations on the source compute node provider	15:09
fried_rice	I think the worst that happens is we fail to remove the original allocation for the DISK_GB on the sharing provider. What happens after that depends on whether we migrated to a compute node with or without sharing disk. But the doubled allocation leaves us in no worse shape than we were before this fix, I would have thought.	15:09
mriedem	and copy those to the migration consumer	15:09
mriedem	which won't include the DISK_GB allocation on the shared provider	15:09
sean-k-mooney	fried_rice: dansmith do we handel flavors with root_gb=0 in placement by the way. preplacement we jsut did not track there disk usage properly. im assuming that is stil broken	15:09
mriedem	sean-k-mooney: fixed like 1 week ago	15:10
mriedem	sean-k-mooney: https://review.openstack.org/#/q/topic:bug/1469179+(status:open+OR+status:merged)	15:10
sean-k-mooney	mriedem: fixed by reading disk size form image?	15:10
dansmith	fried_rice: and I think we lose the disk allocation silently	15:10
fried_rice	dansmith: mriedem: oic, yeah, that makes sense.	15:10
mriedem	sean-k-mooney: no, we don't request DISK_GB allocations for bfv	15:10
sean-k-mooney	mriedem: i was thinking about the non boot form volume case	15:11
fried_rice	I didn't realize we don't go through GET /a_c to request the resources on the dest.	15:11
mriedem	yes this is what removes the instances allocations https://github.com/openstack/nova/blob/master/nova/conductor/tasks/migrate.py#L60	15:11
mriedem	from all providers	15:11
mriedem	fried_rice: we do to pick the dest host during scheduling	15:12
sean-k-mooney	mriedem: for example the nano flavor with with the cirros image in devstack with no volume for the guest.	15:12
fried_rice	mriedem: GET /a_c or just GET /rps?resources=... ?	15:12
mriedem	GET /a_c,	15:12
mriedem	we have to do that in the scheduler to figure out which providers to filter for a dest host	15:12
fried_rice	mriedem: and then ignore that result and just copy the resources from the src to the dest?	15:12
mriedem	i'm looking to confirm that	15:13
fried_rice	mriedem: well, you could have used GET /rps?resources=... as well	15:13
mriedem	sure but we don't in the scheduler	15:13
fried_rice	The right thing would be to use GET /a_c to pick the host and create the allocations. Then we wouldn't be having this problem.	15:13
mriedem	oh you know what,	15:13
mriedem	yes that's what we o	15:14
mriedem	*do	15:14
mriedem	we move the existing allocs from the instance on the source node to the migration record,	15:14
mriedem	and then call the scheduler and claim on the dest host	15:14
mriedem	so the migration has allocs on source host and instance has allocs on dest host	15:14
mriedem	then on successful migration we delete the migration allocs on the source host	15:14
mriedem	on failure, we delete allocs for instance on dest and move allocs from migratoin on source host to instane	15:15
mriedem	*instance	15:15
fried_rice	oh, so what's actually happening is we're erroneously losing the DISK_GB allocation for a minute during the migration, but picking it up again on the dest.	15:15
mriedem	so we don't hit _move_operation_alloc_request in the scheduler report client	15:15
cdent	dansmith: the root problem in your test is that two compute nodes are not in the aggregate, when you do the put for that it is coming up 404, so the resource providers don't exist yet, not sure why that is	15:15
dansmith	cdent: ah, okay	15:15
dansmith	mriedem: hmm, so we end up double-claiming on the shared provider?	15:16
dansmith	mriedem: I thought even with the new accounting we had to grab the allocation for the provider in question and regenerate it, which would mean the instance's allocation on the dest wouldn't include the shared one	15:16
mriedem	i don't think so...as eric said, we'll remove the allocs for the instance on the shared provider,	15:16
mriedem	then claim on the dest during scheduling	15:16
*** shaohe_feng has quit IRC		15:17
dansmith	because we do a full regular schedule?	15:17
mriedem	i think on a revert or failed migration we'd eff that up though	15:17
mriedem	yes	15:17
mriedem	EXCEPT in the case of forced live migrate	15:17
dansmith	oh you're saying we drop the disk allocation but only because we don't copy it for the migration	15:17
mriedem	we don't go through the scheduler there	15:17
mriedem	dansmith: yeah	15:17
dansmith	so,	15:18
*** shaohe_feng has joined #openstack-nova		15:18
mriedem	on a revert or failed resize, we'll delete the allocs for the instance on the dest host (created by the scheduler) and move those back from the migration to the instance, but the migration allocs won't be on the sharing provider	15:18
dansmith	what happens if placement picks a different sharing provider than we had before? our disk doesn't actually move	15:18
mriedem	so we'd lost the DISK_GB alocs in that case	15:18
dansmith	ah, yeah, anything where we use the migration's allocations would be wrong	15:18
*** niraj_singh has quit IRC		15:18
mriedem	dansmith: yup that's this https://github.com/openstack/nova/blob/master/nova/compute/manager.py#L4138	15:19
mriedem	well, we wouldn't hit that yet	15:19
mriedem	the migration consumer will only have VCPU and MEMORY_MB allocations against the source node	15:19
mriedem	so this https://github.com/openstack/nova/blob/master/nova/compute/manager.py#L4155	15:19
mriedem	so that's definitely busted - we could easily test that with a resize revert test and verify the DISK_GB allocation for the instance is gone	15:20
dansmith	and forced live	15:20
mriedem	i haven't stepped through forced live yet (or evac for that matter)	15:20
sean-k-mooney	dansmith: we should only be able to pick a different provider in a block migrate case correct? if we do not set that flag we should not allow the shareing provider to change	15:21
dansmith	also,	15:21
dansmith	migrate to an older node that doesn't have this will drop the shared disk allocation	15:21
dansmith	because placement will allocate from its own disk inventory, even though it's the same pool,	15:21
mriedem	sean-k-mooney: re: "for example the nano flavor with with the cirros image in devstack with no volume for the guest." i don't know what you're asking me	15:21
dansmith	and then when we upgrade that node, we won't convert the allocations	15:21
dansmith	in fact,	15:21
dansmith	any upgrade where we boot up on rocky code and change our inventory will break all our allocations right?	15:22
dansmith	fried_rice: cdent what happens if I have allocations against my disk_gb inventory and then I update my inventory with no disk_gb ?	15:22
fried_rice	The inv update will bounce 409 InventoryInUse.	15:22
mriedem	i don't think you can do that	15:22
mriedem	yeah	15:22
fried_rice	on every periodic	15:23
dansmith	okay, so anyone with MISC_SHARES now will failboat on upgrade to rocky	15:23
fried_rice	update_from_provider_tree will never succeed.	15:23
dansmith	and anyone that sets that on non-empty computes will stop reporting	15:23
mriedem	oh right b/c upt removes the DISK_GB from the node provider if it sees it's in a sharing provider relationship	15:23
dansmith	yeah	15:24
fried_rice	Note that we didn't document that you could do this.	15:24
mriedem	and if that DISK_GB is being used it will blow up on the remove	15:24
mriedem	fried_rice: heh i know	15:24
dansmith	fried_rice: and yet, it's in documentation and people have tried it, hence the bug yeah?	15:24
sean-k-mooney	mriedem: in that instance. the flavor has root_gb=0 the imange is like 20MB in glance and we boot it on the dest without claim space in placement. the vm can use as much space as disk topology in the image specifies	15:24
*** alexchadin has quit IRC		15:24
*** tssurya has quit IRC		15:24
fried_rice	dansmith: The bug was opened because bhagyashri was working on it and I said it should have a bug report.	15:24
mriedem	shared storage providers is definitely a feature/spec	15:25
fried_rice	...which we don't claim works yet.	15:25
mriedem	given the upgrade/CI/etc	15:25
mriedem	i know, but	15:25
*** ttsiouts has quit IRC		15:25
fried_rice	we should document that we don't support it.	15:25
dansmith	well, there's mention of that trait in our own docs, and given what it breaks it's not trivial, IMHO	15:25
fried_rice	and then take some time to resolve these issues correctly.	15:25
mriedem	that's what dansmith and melwitt were talking about last night	15:25
sean-k-mooney	mriedem: anyway thats unrelated to dans question excpet for the fact we dont track the disk usage correctly in placemnet	15:25
*** ttsiouts has joined #openstack-nova		15:26
melwitt	yes, described here L8 https://etherpad.openstack.org/p/nova-rocky-release-candidate-todo	15:26
mriedem	sean-k-mooney: yes i think that's correct and likely a bug; i'm not entirely sure how the resource tracker reports disk usage for a flavor like that which is using local disk	15:26
*** fgonzales_ has quit IRC		15:26
mriedem	where root_gb=0	15:26
mriedem	sean-k-mooney: we've also said you shouldn't use root_gb=0 except for volume-backed flavors	15:27
mriedem	and added a policy rule in rocky to disable that	15:27
*** shaohe_feng has quit IRC		15:27
sean-k-mooney	mriedem: oh cool	15:27
sean-k-mooney	is is set by default	15:27
mriedem	sean-k-mooney: https://github.com/openstack/nova/commit/763fd62464e9a0753e061171cc1fd826055bbc01	15:28
mriedem	the plan was to disable that by default starting in stein	15:28
*** jfinck has quit IRC		15:28
mriedem	so you can't boot a server with a root_gb=0 flavor unless you're doing boot from volume	15:28
cdent	dansmith: microversions :(	15:28
mriedem	how are microversions related to this?	15:28
dansmith	I assume because I messed up a version in my test	15:29
*** shaohe_feng has joined #openstack-nova		15:29
cdent	(sorry, his test)	15:29
mriedem	ah	15:29
mriedem	whew	15:29
sean-k-mooney	mriedem: right ok cool. if you have existing instces booted that way we sould have to update teh embeeded flavor or resouce dict to indicate or live migration will explode	15:29
mriedem	sean-k-mooney: so figuring out how we track disk usage for those types of flavors in the resource tracker would be good to know	15:29
mriedem	because if it was never tracked as usage before, then it's not really a huge regression to not be tracking it in placement	15:30
sean-k-mooney	mriedem: im pretry sure we track it as 0	15:30
sean-k-mooney	e.g. we dont track it at all	15:30
*** ttsiouts has quit IRC		15:30
mriedem	sean-k-mooney: i think that too b/c https://github.com/openstack/nova/blob/master/nova/compute/resource_tracker.py#L1461	15:30
mriedem	object_or_dict.flavor.root_gb	15:30
sean-k-mooney	it was a way to bypass qouta in the past	15:31
mriedem	the is_bfv in there was just recently added in the same series of fixes for the bfv thing	15:31
mriedem	right, so to summarize, don't set flavor root_gb=0 unless those flavors are only used with bfv instances,	15:31
mriedem	and we have the is_bfv root_gb reporting in the RT and placement fixed in rocky	15:31
dansmith	melwitt: aight, well, anyway, my recommendation is that we just remove that inventory quirk for rocky since it can't work and it's one line. alternatively, at least a known-issue reno just to cover our butts in case someone hits it	15:32
sean-k-mooney	mriedem: ya i think though we will have to fix up the allcoation for existing instance that are not bfv going forward	15:32
mriedem	sean-k-mooney: we do	15:32
dansmith	melwitt: it's like having a half-merged feature.. doesn't really serve any purpose and is externally tickle-able to failure	15:32
dansmith	obviously your call on what to do	15:33
mriedem	sean-k-mooney: https://review.openstack.org/#/c/583715/	15:33
mriedem	sean-k-mooney: we'll heal on move	15:33
cdent	any swag on how hard to make it go, now-ish?	15:33
mriedem	"make it go" == make it work?	15:33
mriedem	we don't even have multi-node shared storage provider CI	15:33
mriedem	so very high risk IMO	15:33
sean-k-mooney	mriedem: in the non BFV case we need to read the size form the image if the flavor root_gb=0	15:33
mriedem	way too late	15:34
mriedem	sean-k-mooney: yup	15:34
dansmith	yeah, way too late to try to make any of the broken non-broken	15:34
mriedem	but that's not reported to the RT as far as i know	15:34
fried_rice	dansmith: I can propose that if you like.	15:34
dansmith	fried_rice: which?	15:34
mriedem	dansmith: so we should likely start with a bug saying this stuff will nuke your DISK_GB allocations on failure or revert at least	15:34
*** andymccr has quit IRC		15:34
mriedem	fried_rice: melwitt: ^	15:34
fried_rice	dansmith: Taking that line out of the libvirt driver.	15:34
dansmith	mriedem: for sure	15:35
*** andymccr has joined #openstack-nova		15:35
dansmith	fried_rice: sure, I'm happy to do it as well, either way	15:35
mriedem	and we can track the various bugs in a spec in stein if we're going to go full on and support this	15:35
fried_rice	dansmith: You want to write up the bug, I'll do the patch?	15:36
mriedem	b/c we need a spec for the upgrade impacts obviously, and how to deploy the thing, plus CI requirements (which i'm already half-way done with)	15:36
melwitt	sounds like a plan	15:36
dansmith	fried_rice: if mriedem isn't going to	15:36
mriedem	go ahead	15:36
dansmith	fried_rice: I think mriedem really wants to do it	15:36
dansmith	I heard him say earlier	15:36
dansmith	so I don't want to step on his toes	15:36
mriedem	i'm cleaning up stephenfin's last 2 changes in his vswitc hseries	15:36
dansmith	because I think he measures his weekly progress by bugs reported	15:37
dansmith	mriedem: more?	15:37
mriedem	plus, zuul just f'ed my ceph ci run that was almost done!	15:37
melwitt	this is new, gate failure RETRY_LIMIT	15:37
melwitt	great	15:37
mriedem	melwitt: yes same	15:37
openstackgerrit	Chris Dent proposed openstack/nova master: WIP: funtional test with sharing providers https://review.openstack.org/586589	15:37
mriedem	infra just posted a status	15:37
mriedem	#status alert A zuul config error slipped through and caused a pile of job failures with retry_limit - a fix is being applied and should be back up in a few minutes	15:37
*** shaohe_feng has quit IRC		15:37
mriedem	so don't recheck	15:37
cdent	dansmith: ^ that gets the test actually making reasonable requests, but no more that that	15:38
cdent	not sure if you care given the earlier discussion, but in case you do...	15:38
*** shaohe_feng has joined #openstack-nova		15:38
dansmith	cdent: yeah, probably don't care now that I found this other one	15:38
dansmith	but thanks for setting me straight	15:38
-openstackstatus- NOTICE: A zuul config error slipped through and caused a pile of job failures with retry_limit - a fix is being applied and should be back up in a few minutes		15:39
*** ChanServ changes topic to "A zuul config error slipped through and caused a pile of job failures with retry_limit - a fix is being applied and should be back up in a few minutes"		15:39
*** hongbin_ has joined #openstack-nova		15:39
fried_rice	cdent: All I care about is that you misspelled funtional	15:40
mriedem	i can write the bug if no one has started yet	15:40
fried_rice	Do it. And let the English see you do it.	15:41
mriedem	alright	15:41
cdent	fried_rice: that was dansmith in this case	15:42
dansmith	I was rushing	15:42
cdent	but I can see how it being me would be unsurprising	15:42
* cdent is always rushing		15:42
*** Shilpa has quit IRC		15:42
cdent	is my new excuse	15:42
fried_rice	cdent: If it had been three weeks ago, and it had been fuctional, I would have totally known it was you.	15:42
cdent	I fuctional tests all the time	15:43
fried_rice	Is there a way to mark a normal funtional test as an xfail?	15:44
fried_rice	oh, shit, bhagyashri's test still succeeds with that bit commented out :(	15:45
*** mdrabe has quit IRC		15:46
fried_rice	ignore me, phew.	15:46
*** mdrabe has joined #openstack-nova		15:46
*** AlexeyAbashkin has quit IRC		15:46
cdent	fried_rice: https://docs.python.org/3/library/unittest.html#unittest.expectedFailure	15:47
cdent	https://docs.python.org/3/library/unittest.html#skipping-tests-and-expected-failures	15:47
fried_rice	thanks cdent	15:47
fried_rice	only py3? Are we running func tests on only py3 these days?	15:47
*** shaohe_feng has quit IRC		15:47
sean-k-mooney	fried_rice: i think we have both. still	15:48
fried_rice	yup sean-k-mooney	15:48
sean-k-mooney	you can proably do a version check and jsut skip on 2 and expect failure on 3	15:49
*** shaohe_feng has joined #openstack-nova		15:49
mriedem	here you go https://bugs.launchpad.net/nova/+bug/1784020	15:50
openstack	Launchpad bug 1784020 in OpenStack Compute (nova) "Shared storage providers are not supported and will break things if used" [High,Triaged]	15:50
mriedem	dansmith: fried_rice: melwitt: ^	15:50
dansmith	oh thanks	15:50
* dansmith closes the empty bug report he had open		15:50
cdent	fried_rice: https://docs.python.org/2.7/library/unittest.html#skipping-tests-and-expected-failures	15:50
melwitt	now that's a bug report	15:50
dansmith	mine would have been 5% of that	15:51
mriedem	fried_rice: testtools has an expectedFailure thing	15:51
melwitt	hah, I know	15:51
mriedem	you said i take pride in it...	15:51
dansmith	"s'broken, kthx"	15:51
mriedem	mostly because if i don't put those details in there, i'll totally forget wtf we talked about a year from now	15:51
melwitt	yeah. the details are super helpful	15:51
sean-k-mooney	mriedem: i think that bug also falls into the catagory of "we dont have ci for it so its broken by default"	15:52
mriedem	well,	15:52
mriedem	we don't have CI for a lot of things	15:52
mriedem	and we still support them, <cough>evacuate</cough>	15:52
*** gyee has joined #openstack-nova		15:52
sean-k-mooney	yes and i assume they are broken by default unless proven otherwise by it working when i use it and being happy	15:53
dansmith	evacuate is hard to test for legit reasons, but this shared thing is not	15:53
dansmith	and it's also often broken	15:53
mriedem	yup	15:54
mriedem	btw, yes, forced host live migrate/evacuate will drop the DISK_GB allocation on the shared provider	15:54
*** flwang1 has quit IRC		15:55
dansmith	mriedem: from your test?	15:56
mriedem	no just looking at teh code	15:57
mriedem	https://github.com/openstack/nova/blob/6be7f7248fb1c2bbb890a0a48a424e205e173c9c/nova/scheduler/utils.py#L500	15:57
mriedem	we only get the allocations for the instance against the source node	15:57
dansmith	oh	15:57
mriedem	and copy those to the dest node for the instance	15:57
mriedem	double up	15:57
mriedem	doesn't put anything on the migration record in the force cas	15:58
mriedem	*case	15:58
*** shaohe_feng has quit IRC		15:58
*** rpittau has quit IRC		15:58
*** r-daneel_ has joined #openstack-nova		15:58
mriedem	hmm, which makes me wonder if we ever cleanup the dest host allocations on a failed live migration	15:58
mriedem	that is forced	15:58
*** flwang1 has joined #openstack-nova		16:00
mriedem	looks like post_live_migration will give you a warning but remove the doubled allocation	16:00
*** shaohe_feng has joined #openstack-nova		16:00
*** r-daneel has quit IRC		16:00
*** r-daneel_ is now known as r-daneel		16:00
mriedem	https://github.com/openstack/nova/blob/6be7f7248fb1c2bbb890a0a48a424e205e173c9c/nova/compute/manager.py#L6638L6669	16:00
mriedem	oops	16:00
mriedem	i'll write a functional test for the rollback forced live migration case	16:01
*** openstackgerrit has quit IRC		16:04
mriedem	https://bugs.launchpad.net/nova/+bug/1784022	16:06
openstack	Launchpad bug 1784022 in OpenStack Compute (nova) "Failed forced live migration does not rollback doubled up allocations in placement" [High,Triaged]	16:06
mriedem	looks like we regressed that in queens	16:07
*** shaohe_feng has quit IRC		16:08
mriedem	blarg https://review.openstack.org/#/c/507638/25/nova/compute/manager.py@6252	16:08
*** shaohe_feng has joined #openstack-nova		16:09
*** jangutter has quit IRC		16:10
dansmith	mriedem: are you saying we don't have a migration record if we do a forced?	16:11
mriedem	dansmith: we do, but we don't put the allocations on it	16:11
*** lbragstad_ is now known as lbragstad		16:11
mriedem	b/c we don't go through the scheduler for forced	16:11
*** ispp has quit IRC		16:11
dansmith	um	16:11
mriedem	this is just one of the many reasons for the dreaded -5 in dublin	16:11
*** flwang1 has quit IRC		16:12
mriedem	dansmith: forced live migration calls this method to double up the allocations from the source to the forced dest https://github.com/openstack/nova/blob/6be7f7248fb1c2bbb890a0a48a424e205e173c9c/nova/scheduler/utils.py#L473	16:12
mriedem	that's from pike when doubling was all the rage	16:12
dansmith	okay, so you're saying on forced we don't do the migration allocations, we just allocate against the newhost, then if we have to revert, we don't have the migration allocations to revert to the instance?	16:12
melwitt	is it safe to recheck yet? I didn't see another status update	16:12
mriedem	dansmith: correct	16:13
mriedem	melwitt: yeah i just did	16:13
melwitt	ok	16:13
mriedem	dansmith: i'll write a functional test for it when i'm back from lunch	16:13
dansmith	mriedem: okay but the doubling is not intentional, just incidental since we didn't replace the instance allocs with the migration one yeah?	16:13
mriedem	it's intentional	16:13
mriedem	it mimics the behavior of doubling in the scheduler from before quens	16:13
mriedem	*queens	16:14
dansmith	right, but we shouldn't be doing any doubling anymore	16:14
mriedem	sure,	16:14
mriedem	but we are :)	16:14
mriedem	for forced	16:14
mriedem	b/c forced is FUN	16:14
mriedem	-20!	16:14
*** links has joined #openstack-nova		16:14
dansmith	I'm saying we shouldn't intend to be doing that,	16:14
mriedem	not anymore no	16:14
dansmith	which means it's a case we missed in converting to non-doubling	16:14
mriedem	but we missed it in queens with your bp	16:14
mriedem	yes	16:14
dansmith	right, that's what I mean	16:14
dansmith	unintentional	16:14
mriedem	yeah	16:14
mriedem	ok lunch	16:15
*** mriedem is now known as mriedem_away		16:15
*** links has quit IRC		16:17
*** links has joined #openstack-nova		16:17
*** shaohe_feng has quit IRC		16:18
*** shaohe_feng has joined #openstack-nova		16:19
*** artom_ has joined #openstack-nova		16:22
*** links has quit IRC		16:23
*** Sundar_ has joined #openstack-nova		16:23
sean-k-mooney	mriedem_away: im goint to choose to read -20! as -(20 factoral) to give it the weight it should have	16:23
Sundar_	efried: Please ping me when you have the time	16:25
*** openstackgerrit has joined #openstack-nova		16:26
openstackgerrit	Eric Fried proposed openstack/nova master: libvirt: Revert non-reporting DISK_GB if sharing https://review.openstack.org/586614	16:26
fried_rice	mriedem_away, dansmith, cdent, melwitt: ^	16:26
fried_rice	Sundar_: Bad timing :( I have to run for a bit. Will you be around in a couple of hours?	16:26
Sundar_	NP, sure	16:27
*** harlowja has joined #openstack-nova		16:27
*** flwang1 has joined #openstack-nova		16:28
*** shaohe_feng has quit IRC		16:28
*** shaohe_feng has joined #openstack-nova		16:29
*** derekh has quit IRC		16:30
*** tesseract has quit IRC		16:32
*** fried_rice is now known as fried_rolls		16:33
*** vladikr has quit IRC		16:35
*** vladikr has joined #openstack-nova		16:35
*** shaohe_feng has quit IRC		16:39
dansmith	mriedem_away: when you're back: I guess I don't really see the thing requiring the dynamic opts registration as being a bad thing	16:40
dansmith	mriedem_away: it forces us to think about it when we write new code and the tests for it	16:40
*** shaohe_feng has joined #openstack-nova		16:41
*** Bellesse has quit IRC		16:44
*** rmart04 has quit IRC		16:46
*** shaohe_feng has quit IRC		16:49
openstackgerrit	Dan Smith proposed openstack/nova master: Assorted cleanups from numa-aware-vswitches series https://review.openstack.org/582651	16:49
openstackgerrit	Dan Smith proposed openstack/nova master: Add additional functional tests for NUMA networks https://review.openstack.org/585385	16:49
*** shaohe_feng has joined #openstack-nova		16:49
*** felipemonteiro__ has joined #openstack-nova		16:52
*** felipemonteiro_ has quit IRC		16:52
cdent	melwitt, dansmith, mriedem_away : next week I'm pretty broadly available, so if stuff comes up and you want to wind me up and point me particular places, please ask.	16:52
melwitt	will do, thanks	16:54
*** shaohe_feng has quit IRC		16:59
*** shaohe_feng has joined #openstack-nova		17:04
*** felipemonteiro_ has joined #openstack-nova		17:06
*** mgoddard has quit IRC		17:07
*** yamahata has quit IRC		17:07
*** burt has quit IRC		17:08
*** shaohe_feng has quit IRC		17:09
*** felipemonteiro__ has quit IRC		17:10
*** dtantsur is now known as dtantsur\|afk		17:10
*** gbarros has joined #openstack-nova		17:11
*** shaohe_feng has joined #openstack-nova		17:12
*** bacape_ has joined #openstack-nova		17:16
*** felipemonteiro__ has joined #openstack-nova		17:18
*** felipemonteiro_ has quit IRC		17:18
*** bacape_ has quit IRC		17:18
*** bacape has quit IRC		17:20
*** shaohe_feng has quit IRC		17:20
*** shaohe_feng has joined #openstack-nova		17:20
*** mriedem_away is now known as mriedem		17:23
*** gbarros has quit IRC		17:23
*** artom has joined #openstack-nova		17:23
*** jmlowe has joined #openstack-nova		17:24
*** artom_ has quit IRC		17:26
*** savvas has quit IRC		17:29
*** shaohe_feng has quit IRC		17:30
*** harlowja has quit IRC		17:31
*** shaohe_feng has joined #openstack-nova		17:32
*** felipemonteiro_ has joined #openstack-nova		17:34
*** felipemonteiro__ has quit IRC		17:37
*** cfriesen_ has quit IRC		17:39
*** shaohe_feng has quit IRC		17:40
*** shaohe_feng has joined #openstack-nova		17:41
*** gbarros has joined #openstack-nova		17:42
*** mgoddard has joined #openstack-nova		17:43
*** yamahata has joined #openstack-nova		17:43
*** colby_ has joined #openstack-nova		17:46
colby_	Hey Everyone. Im trying to get metrics based filtering working in nova. I tried enabling compute_monitors but I always get an error in the logs:	17:47
colby_	compute_monitors=["nova.compute.monitors.cpu.virt_driver", "numa_mem_bw.virt_driver"]	17:47
colby_	2018-07-27 17:43:14.001 2295696 WARNING nova.compute.monitors [req-51711d41-c626-4af2-92fd-dde09c576fb2 - - - - -] Excluding nova.compute.monitors.cpu monitor virt_driver. Not in the list of enabled monitors (CONF.compute_monitors).	17:48
colby_	Ive tried variations on the monitor: cpu.virt_driver & just virt_driver. It always gives the same error	17:48
colby_	Im on pike, Centos, kvm	17:49
colby_	I have gnocchi running and collecting resource	17:49
colby_	2018-07-27 17:44:36.110 2800963 INFO nova.filters [req-0e8215e5-e029-4104-8578-a917bf9edddc e28435e0a66740968c523e6376c57f68 18882d9c32ba42aeaa33c4703ad84b2c - default default] Filter MetricsFilter returned 0 hosts	17:50
colby_	Not sure where the problem is	17:50
*** shaohe_feng has quit IRC		17:50
colby_	weight_setting=compute.node.cpu.percent=-1.0	17:51
dansmith	colby_: I really can't help you, but I can tell you that metrics have nothing to do with gnocchi/ceilo	17:51
colby_	ok I thought I read somewhere that it used the gnocchi metrics...	17:51
dansmith	colby_: the computes have to be configured to report them in order to use the filter	17:51
dansmith	nope	17:51
*** shaohe_feng has joined #openstack-nova		17:52
colby_	ok so then the compute_monitors is the issue then	17:52
dansmith	the metrics come from libvirt, reported by the compute, used by the filter	17:52
colby_	ok then Im not sure why Im not getting the metrics	17:53
colby_	besides the filed driver load	17:53
colby_	or monitor load I mean	17:53
dansmith	yeah, I can't really help beyond that	17:53
sean-k-mooney	dansmith: colby_ if you enable the metric reporting on the compute node ceilometer is able to read them form the message bus and store them but that is a sideffect	17:54
colby_	Ok so does that mean my metrics reporting is working?	17:55
sean-k-mooney	colby_: by the way memory bandwith monitoring is broken on skylake. both read and write metrics are are actully read...	17:56
colby_	Im actually just interested in the cpu.percent	17:56
colby_	I want to not put instances on nodes with high cpu usage. We have a large memory node and the scheduler always puts instances there even when its way overcommited on cpu	17:57
sean-k-mooney	ah well you could just change the order of the weigher to prefer weighing on cpus before memory. am but i have not used the metric based weigher myself so i have not tried to configure it before	17:58
*** penick is now known as OcataGuy		17:58
*** OcataGuy is now known as MostlyOcataGuy		17:58
colby_	ah ok. I treid weight_setting=cpu.percent=-1.0	17:59
colby_	but I got zero hosts returned with metrics filter enabled	17:59
*** Sundar_ has quit IRC		18:00
colby_	I was not aware that changing weigher order made any difference	18:00
*** shaohe_feng has quit IRC		18:01
colby_	I just used: nova.scheduler.weights.all_weighers	18:01
colby_	I thought it was all just based on multipliers	18:02
*** savvas has joined #openstack-nova		18:02
*** med_ has quit IRC		18:02
sean-k-mooney	colby_: well stickly speaking it does not but what i ment was listing only the weighers you care about and then setting there multipliers	18:02
*** jdillaman has quit IRC		18:03
*** shaohe_feng has joined #openstack-nova		18:03
sean-k-mooney	if you only care about cpus then you can simploy only enable the cpu Weigher	18:03
colby_	hmm ok	18:04
colby_	thanks	18:04
melwitt	colby_: are you specifying compute_monitors= under the [DEFAULT] section of the nova.conf?	18:04
colby_	yes	18:04
colby_	but I get the error: Excluding nova.compute.monitors.cpu monitor virt_driver. Not in the list of enabled monitors (CONF.compute_monitors)	18:04
melwitt	okay. the log message you posted earlier is saying it doesn't find the monitor in the list from the conf option. hm	18:05
*** gbarros_ has joined #openstack-nova		18:05
colby_	oh wait...there is a typo <smacks head>	18:06
sean-k-mooney	colby_: you could proably get a similar effect by setting ram_weight_multiplier=0 or 0.1 so that ram is basically ignored when weighing if that does not work	18:08
colby_	ok thanks for your help!	18:08
*** gbarros has quit IRC		18:09
*** gbarros_ has quit IRC		18:09
*** gbarros has joined #openstack-nova		18:10
*** shaohe_feng has quit IRC		18:11
*** gbarros_ has joined #openstack-nova		18:12
*** gbarros__ has joined #openstack-nova		18:13
*** mriedem1 has joined #openstack-nova		18:14
*** mriedem has quit IRC		18:14
*** gbarro___ has joined #openstack-nova		18:14
*** gbarros has quit IRC		18:15
*** shaohe_feng has joined #openstack-nova		18:15
*** gbarros has joined #openstack-nova		18:15
*** harlowja has joined #openstack-nova		18:15
*** gbarros_ has quit IRC		18:16
*** gbarros__ has quit IRC		18:18
*** sridharg has quit IRC		18:18
*** gbarro___ has quit IRC		18:18
sean-k-mooney	melwitt: mriedem1 https://review.openstack.org/#/c/586568/ hit the retry_limit issue after your last recheck. is that issue(retry_limit) still happening in the gate	18:19
melwitt	I think it's been fixed	18:19
sean-k-mooney	well there is no gate job for that patch at the moment. will i retry it?	18:20
melwitt	yeah, go ahead. I didn't realize that one hadn't been rechecked	18:21
*** shaohe_feng has quit IRC		18:21
sean-k-mooney	melwitt: it had. you did it at 5:14 but it hit the error again	18:22
sean-k-mooney	you proably missed the fix by a few minutes	18:22
melwitt	yeah, guh	18:22
mriedem1	dansmith: danicus, i have good pleasurable news	18:22
*** mriedem1 is now known as mriedem		18:22
dansmith	um	18:22
*** shaohe_feng has joined #openstack-nova		18:22
mriedem	bug 1784022 isn't a problem	18:23
openstack	bug 1784022 in OpenStack Compute (nova) queens "Failed forced live migration does not rollback doubled up allocations in placement" [High,Triaged] https://launchpad.net/bugs/1784022	18:23
mriedem	it's handled	18:23
dansmith	oh yeah?	18:23
dansmith	that is indeed pleasurable	18:23
melwitt	dansmith: wanna ack this? https://review.openstack.org/586614	18:24
dansmith	yup	18:25
*** artom has quit IRC		18:26
melwitt	hooray	18:26
melwitt	dangit, missed artom again. I had wanted to ask him about https://bugs.launchpad.net/nova/+bug/1708433	18:27
openstack	Launchpad bug 1708433 in OpenStack Compute (nova) "Attaching sriov nic VM fail with keyError pci_slot" [Undecided,New]	18:27
mriedem	dansmith: i'll push up the functional test anyway since it didn't look like we had one, only for the non-forced rollback checks	18:28
dansmith	okay	18:28
dansmith	mriedem: did you see my comment above about stephen's set?	18:29
dansmith	and I pushed up the other fixes to that, btw	18:29
dansmith	since you hadn't and seemingly got distracted with this other thing	18:29
*** Sundar_ has joined #openstack-nova		18:29
dansmith	oh I see you did	18:29
dansmith	cool	18:29
*** rmart04 has joined #openstack-nova		18:31
*** jmlowe has quit IRC		18:31
*** shaohe_feng has quit IRC		18:31
mriedem	sure did	18:33
openstackgerrit	Matt Riedemann proposed openstack/nova master: Add functional test for forced live migration rollback allocs https://review.openstack.org/586636	18:33
*** shaohe_feng has joined #openstack-nova		18:34
mriedem	well, just in time for us to kill the shared storage provider support, i got it passing the ceph job http://logs.openstack.org/63/586363/3/check/legacy-tempest-dsvm-full-devstack-plugin-ceph/569c574/	18:35
dansmith	presumably because we're left with broken allocations after a revert or something, but don't check/assert them?	18:36
*** artom has joined #openstack-nova		18:37
mriedem	right tempest won't assert any of that stuff,	18:37
mriedem	we do have a post-test hook in the nova-next job for making sure there are no orphaned allocations but only on compute node providers	18:37
dansmith	we had some sanity checking and logging in the RT when we removed the healing.. maybe there is some evidence in there?	18:37
mriedem	oh nvm it's not just computes, it's all resource providers	18:38
mriedem	but we don't run it on that job	18:38
dansmith	http://logs.openstack.org/63/586363/3/check/legacy-tempest-dsvm-full-devstack-plugin-ceph/569c574/logs/screen-n-cpu.txt.gz#_Jul_27_17_31_35_337258	18:39
mriedem	yeah i don't see any obvious warnings related to allocatoins	18:40
mriedem	i think if we ran our post-test leaked allocation hook on this job it would fail	18:40
*** flwang1 has quit IRC		18:41
mriedem	well, maybe not for single node	18:41
*** shaohe_feng has quit IRC		18:42
*** flwang1 has joined #openstack-nova		18:42
dansmith	yeah, so there are 133 logs of instance fd563ed2-d42c-4dc1-a614-8700c6e6c8fd	18:42
dansmith	having non-cleaned-up allocations	18:43
*** shaohe_feng has joined #openstack-nova		18:43
dansmith	although really the allocations that we'd destroy wouldn't be against the compute node,	18:43
dansmith	and would be gone not stale	18:43
dansmith	so even your check probably wouldn't catch it	18:43
dansmith	because we'd be _losing_ not _leaking_ disk allocations	18:43
dansmith	also, um	18:45
dansmith	I just noticed that we're logging an entire console log out of privsep somewhere	18:45
dansmith	http://logs.openstack.org/63/586363/3/check/legacy-tempest-dsvm-full-devstack-plugin-ceph/569c574/logs/screen-n-cpu.txt.gz#_Jul_27_18_07_23_550670	18:45
dansmith	you could argue that is a security issue if instances log sensitive info to their console	18:46
mriedem	nice, 9 of those	18:46
mriedem	you can open that bug	18:46
*** r-daneel_ has joined #openstack-nova		18:47
*** r-daneel has quit IRC		18:47
*** r-daneel_ is now known as r-daneel		18:47
dansmith	okay	18:47
dansmith	does privsep daemon log everything over the channel or something?	18:48
*** s10 has quit IRC		18:48
sean-k-mooney	dansmith: that log is becasue seting a route in teh guest failed http://logs.openstack.org/63/586363/3/check/legacy-tempest-dsvm-full-devstack-plugin-ceph/569c574/logs/screen-n-cpu.txt.gz#_Jul_27_18_07_23_555618	18:49
dansmith	not sure about that	18:50
sean-k-mooney	i think	18:50
Sundar_	efried: I need to take off for lunch. I'll look for your response in https://review.openstack.org/#/c/577438/. We need to get this discussion to a closure.	18:50
dansmith	don't think so, I'm not sure why we'd log the console output in that case	18:50
*** Sundar_ has quit IRC		18:50
dansmith	the route errors on the console are just there because we're logging it, if that's what you're looking at	18:51
*** rmart04 has quit IRC		18:51
sean-k-mooney	yes it was but this looks like the ouput for dmesg when we are unning through cloud-init	18:51
sean-k-mooney	well i gess its the main console log	18:52
*** shaohe_feng has quit IRC		18:52
dansmith	sean-k-mooney: it's the instance console log	18:53
*** shaohe_feng has joined #openstack-nova		18:53
dansmith	which is more than dmesg	18:53
*** rmart04 has joined #openstack-nova		18:53
sean-k-mooney	well its a debug log. i wonder is it related to http://logs.openstack.org/63/586363/3/check/legacy-tempest-dsvm-full-devstack-plugin-ceph/569c574/logs/screen-n-cpu.txt.gz#_Jul_27_18_07_23_548723	18:54
dansmith	it looks to me like privsep daemon is logging anything sent over the channel,	18:54
sean-k-mooney	by that i mean its a debug log so at least it does not do this normally	18:54
*** rmart04 has quit IRC		18:54
dansmith	and since we're using it to do a readpty of the console, it gets logged	18:55
dansmith	sean-k-mooney: lots of people run with debug on all the time	18:55
dansmith	https://bugs.launchpad.net/nova/+bug/1784062	18:55
openstack	Launchpad bug 1784062 in OpenStack Compute (nova) "Instance console data is logged at DEBUG" [Undecided,New]	18:55
dansmith	melwitt: ^	18:55
dansmith	I dunno what will be involved in squelching that,	18:55
*** gbarros has quit IRC		18:55
dansmith	but might be good to fix that before GA, IMHO	18:55
melwitt	gah, moar bugs	18:55
melwitt	yeah, agreed. I'll put it on the RC1 list	18:56
sean-k-mooney	dansmith: well i know privsep propagate any excpetions back over the unix socket and any loging within the privesep deamon is redirected to the parrent too as far as i know	18:57
dansmith	I'd like to point to mriedem's statement that we should be finding and fixing critical bugs during this phase instead of rushing on a lot of FFEs	18:57
dansmith	the last 24 hours has been pretty ... that.	18:57
*** fried_rolls is now known as fried_rice		18:59
*** MostlyOcataGuy is now known as penick		19:00
mriedem	SWEET VALIDATION	19:01
*** shaohe_feng has quit IRC		19:02
sean-k-mooney	dansmith: its coming from this line https://github.com/openstack/oslo.privsep/blob/master/oslo_privsep/daemon.py#L442	19:02
dansmith	sean-k-mooney: that wouldn't make much sense	19:03
dansmith	I expect it's the one below, L455	19:03
dansmith	TestNetworkBasicOps-1426085565] privsep: reply[140593546325360]: (4, '')	19:03
sean-k-mooney	sorry yes l455	19:03
*** shaohe_feng has joined #openstack-nova		19:03
dansmith	yup	19:03
*** r-daneel has quit IRC		19:04
melwitt	that doesn't look very squelchable	19:04
sean-k-mooney	so should we just delete those?	19:04
dansmith	melwitt: agree, it's sticky, but .. imagine what else we might be logging when we're running commands as root...	19:04
melwitt	no, I agree. just thinking, how can we stop it	19:05
dansmith	melwitt: maybe we recommend squelching privsep DEBUG logs in the levels as a security measure?	19:05
dansmith	but still,	19:05
dansmith	something better likely needs doing	19:05
sean-k-mooney	we could add a conf option for extra verbose loggin to privsep.	19:05
dansmith	we control that to some degree in our default levels for libraries,	19:05
dansmith	assuming the daemon starts with our config	19:06
*** gbarros has joined #openstack-nova		19:06
sean-k-mooney	things like os-vif plugins create there own privsep deamons	19:06
sean-k-mooney	it would be nice to turn that off by defaut globally	19:07
*** gbarros has quit IRC		19:09
dansmith	decorating privsep methods as "may return sensitive stuff" would be one way, and let the daemon just not log the result	19:11
dansmith	for the DoS case, limiting what we log to 256 chars max or something seems prudent	19:11
melwitt	are you talking about changes to oslo_privsep or nova?	19:12
*** shaohe_feng has quit IRC		19:12
dansmith	well, the decoration would be both	19:12
dansmith	we'd decorate our things, and the daemon code would have to honor it	19:13
melwitt	okay, I see	19:13
dansmith	the log length limit would be purely privsep	19:13
melwitt	gotcha	19:13
dansmith	and our forcing of a log level for our own daemon could maybe be all on our end, but not sure	19:13
*** shaohe_feng has joined #openstack-nova		19:14
melwitt	yeah, I was looking for where the default log levels come from and didn't find it yet	19:14
dansmith	well, we control them for our libraries you know,	19:15
dansmith	but I think the daemon itself is logging this	19:15
sean-k-mooney	melwitt: well this is a devstack run so we proably hardcode the loglevel to debug in the nova conf	19:15
melwitt	the decorator idea sounds like a good feature but I don't know how hard it would be to coordinate that with oslo in the next week or so	19:15
dansmith	but, I assumed it was following our debug=true, so..	19:15
*** jaypipes has joined #openstack-nova		19:16
dansmith	I wonder if we've been doing this since this patch merged...	19:16
melwitt	yeah, I mean how do we configure another library to log at a certain different level	19:16
dansmith	surely thought we'd have heard of it	19:16
dansmith	there's a default log levels thing	19:18
sean-k-mooney	well privsep has its own log handeler that redirects everything over the unix socket https://github.com/openstack/oslo.privsep/blob/master/oslo_privsep/daemon.py#L144	19:18
dansmith	https://docs.openstack.org/kilo/config-reference/content/list-of-compute-config-options.html	19:18
dansmith	default_log_levels =	19:19
*** med_ has joined #openstack-nova		19:19
*** med_ has quit IRC		19:19
*** med_ has joined #openstack-nova		19:19
dansmith	default contains, for example: oslo.messaging=INFO	19:19
dansmith	heh, that's kilo, but... :)	19:19
melwitt	oh, never knew about that. cool	19:19
dansmith	I don't see that we much control the execution of the daemon really,	19:21
dansmith	so not sure if it even knows what our config is	19:21
dansmith	or how it knows to have debug on	19:22
dansmith	but yeah, if it's being fed into our logger (like sean-k-mooney is suggesting) then setting a level in that config might affect it	19:22
*** shaohe_feng has quit IRC		19:23
melwitt	hm, yeah. the example shows all kinds of libraries that aren't openstack things as being affected	19:23
sean-k-mooney	well this is what is handeling the log message on the nova side of the call https://github.com/openstack/oslo.privsep/blob/master/oslo_privsep/daemon.py#L206	19:23
dansmith	melwitt: it has nothing to do with openstack	19:24
*** shaohe_feng has joined #openstack-nova		19:24
dansmith	melwitt: it's in our config of the root logger,	19:24
dansmith	which any library will ultimately use	19:24
*** mgoddard has quit IRC		19:24
dansmith	it just matters that it's in our process space	19:24
*** rtjure has quit IRC		19:25
dansmith	so the daemon being outside, would be unaffected (unless it's looking at our config), but if it's redirecting all the log traffic over the channel and we have something our side reading that and logging _as_ privsep.daemon in our process,	19:25
dansmith	then our root logger config would affect it	19:25
melwitt	okay, I see. thanks for explaining that	19:25
sean-k-mooney	dansmith: in this case it even going to work across process because both the root wrap and fork clients swap out the looger to redirect it over the socket	19:26
dansmith	sean-k-mooney: yeah I just said that :)	19:26
mriedem	so we just need to hard-code oslo.privsep=INFO or something in our default_log_levels yeah for that bug? didn't read all the backscroll	19:27
sean-k-mooney	hah yep. i was typeing when you did :)	19:27
dansmith	sean-k-mooney: heh okay	19:27
dansmith	mriedem: yeah, sounds like it	19:27
mriedem	easy peasy	19:27
dansmith	yup	19:27
mriedem	melwitt: don't forget to defer a bunch of these https://blueprints.launchpad.net/nova/rocky	19:28
melwitt	mriedem: right, thanks	19:28
mriedem	i only see 3 in there that wouldn't be deferred	19:28
mriedem	mox-removal, versioned notifications and stephen's numa vswitch bp	19:28
melwitt	thanks	19:28
sean-k-mooney	dansmith: am could we use a decoreator/context manager to also chagne the config for spcific call?	19:29
dansmith	sean-k-mooney: not sure I parsed that, but I think we'd not want to override log levels in a context manager	19:29
sean-k-mooney	basically im thinking about your previous suggstion of a decorator for this is sensitive never logit cases	19:30
dansmith	sean-k-mooney: yep, something intentional for this might be good	19:30
*** eharney has quit IRC		19:31
melwitt	setting the default log level for oslo.privsep is a good mitigation we can do immediately. then we can look at the idea of adding something to oslo.privsep to control this in a better, non-overrideable way (though I guess one could argue if the user really wants to override, they should be able to)	19:31
sean-k-mooney	the default log level change is also good but that read tty call proablly should never be logged	19:31
sean-k-mooney	melwitt: if the user really want to log it that much they could add a print()	19:32
sean-k-mooney	or remvoe the decorator	19:32
sean-k-mooney	its likely that you would only want to do this if your debugging	19:32
*** shaohe_feng has quit IRC		19:33
melwitt	yeah, I just meant to point out it's a consideration. not arguing either way	19:33
*** shaohe_feng has joined #openstack-nova		19:34
*** rtjure has joined #openstack-nova		19:35
sean-k-mooney	ya thats true	19:36
mriedem	the default_log_levels things is backportable, which i'm assuming this needs to be	19:36
mriedem	we've had privsep in for awhile	19:36
dansmith	I've been trying to git-review this mofo for a few minutes now	19:37
sean-k-mooney	well the default_log_levels can be set in deployment tools so it can be done downstream also even if it was not upstream	19:38
openstackgerrit	Dan Smith proposed openstack/nova master: Force oslo.privsep.daemon logging to INFO level https://review.openstack.org/586643	19:38
dansmith	thar ^	19:38
dansmith	we can check the logs after a run of that and make sure theres no privsep debug noise in there	19:39
*** lbragstad_ has joined #openstack-nova		19:40
*** lbragstad has quit IRC		19:41
*** shaohe_feng has quit IRC		19:43
*** shaohe_feng has joined #openstack-nova		19:45
*** mchlumsky_ has quit IRC		19:47
* mriedem goes to get ma child		19:48
*** mchlumsky has joined #openstack-nova		19:50
sean-k-mooney	dansmith: the only down side to this change is i used to use some of those log messages to debug os-vif plugging stuff but in heighsight i should have proably questioned why they were there	19:51
sean-k-mooney	dansmith: that said http://logs.openstack.org/63/586363/3/check/legacy-tempest-dsvm-full-devstack-plugin-ceph/569c574/logs/screen-n-cpu.txt.gz#_Jul_27_17_31_34_229614	19:51
dansmith	sean-k-mooney: this is just the default, you can still override it in config to turn it on	19:51
sean-k-mooney	this is being loged form the privsep deamon but reported as oslo_concurrency	19:52
sean-k-mooney	dansmith: oh i know, what will we do in the gate?	19:52
dansmith	well, we can override this for the gate, it just needs to not be on by default	19:53
*** shaohe_feng has quit IRC		19:53
dansmith	sean-k-mooney: are you sure? that doesn't look like the privsep format	19:54
dansmith	and processutils would log something like that	19:54
dansmith	maybe it's inside the daemon, but running processutils, which is emitting the actual log?	19:54
sean-k-mooney	dansmith: that code is executed via privsep but that message is not from that log	19:54
sean-k-mooney	dansmith: yes	19:54
dansmith	okay I'm confused about what you're saying	19:55
sean-k-mooney	sorry one sec	19:55
*** shaohe_feng has joined #openstack-nova		19:56
sean-k-mooney	its basically this https://github.com/openstack/os-vif/blob/master/vif_plug_ovs/linux_net.py#L155	19:56
*** lbragstad_ is now known as lbragstad		19:56
*** mchlumsky has quit IRC		19:56
sean-k-mooney	which invokes processuitls here https://github.com/openstack/os-vif/blob/master/vif_plug_ovs/linux_net.py#L58	19:56
sean-k-mooney	the actull privsep request message is printed here http://logs.openstack.org/63/586363/3/check/legacy-tempest-dsvm-full-devstack-plugin-ceph/569c574/logs/screen-n-cpu.txt.gz#_Jul_27_17_31_34_229139	19:58
*** mchlumsky has joined #openstack-nova		19:58
sean-k-mooney	but any loging privledge fucntions do is also relyed to the parent over the socket.	19:59
*** pchavva has quit IRC		19:59
*** ccamacho1 has joined #openstack-nova		20:00
*** mgoddard has joined #openstack-nova		20:01
*** ccamacho has quit IRC		20:01
sean-k-mooney	anyway lets see if that config option just effect the default log level of oslo.privspes own internal logging or also the suff call via a privsep context	20:02
*** itlinux has joined #openstack-nova		20:03
dansmith	okay, I'm still not sure what your concern is	20:03
dansmith	but it's probably my friday brain	20:03
openstackgerrit	Artom Lifshitz proposed openstack/nova master: DNM: Extra logs for volume detach device tags cleanup https://review.openstack.org/584032	20:03
*** shaohe_feng has quit IRC		20:04
sean-k-mooney	well im hoping that oslo.preivsep.deamon=INFO just disables the debug loggin for privsep debug logs but not debug logs from things called via privsep	20:04
dansmith	why?	20:05
*** mgoddard has quit IRC		20:05
dansmith	it should affect anything that logs with oslo.privsep.daemon, not anything else	20:05
*** mchlumsky has quit IRC		20:05
*** shaohe_feng has joined #openstack-nova		20:05
dansmith	if those concurrency logs are logged with a logger name of oslo.concurrency.processutils, then they should be unaffected	20:06
dansmith	is that what you mean?	20:06
sean-k-mooney	yes.	20:06
dansmith	okay I think we'll be okay on that, assuming it works the way I think it does	20:06
dansmith	I expect there is some code in privsep that does:	20:06
*** weaksauce2 has joined #openstack-nova		20:06
weaksauce2	Hey, I thought you guys might be interested in this blog by freenode staff member Bryan 'kloeri' Ostergaard https://bryanostergaard.com/	20:06
weaksauce2	or maybe this blog by freenode staff member Matthew 'mst' Trout https://MattSTrout.com/	20:06
*** weaksauce2 has quit IRC		20:07
dansmith	for message_logged_in_the_daemon: logger.getLogger(message.log_name).log.$level(message.msg)	20:07
*** mchlumsky has joined #openstack-nova		20:07
dansmith	so my change should only affect actual messages logged on the daemon log name	20:07
dansmith	not anything logged in the context of the daemon at all	20:07
*** liuyulong__ has quit IRC		20:08
dansmith	hmm, that code was kindof nonsense, let me try again:	20:08
sean-k-mooney	its this code that i was unsure about https://github.com/openstack/oslo.privsep/blob/master/oslo_privsep/daemon.py#L249-L254	20:08
itlinux	hello Nova guys, when spinning up a VM, and the hypervisor is asking to pull the image from glance does that go over the storage network? thanks	20:08
*** liuyulong__ has joined #openstack-nova		20:08
dansmith	sean-k-mooney: that's the daemon-side code that intercepts the logs to redirect	20:09
sean-k-mooney	yes	20:09
dansmith	it's the non-daemon code that does the actual logging and would do what I surmised above	20:09
*** s10 has joined #openstack-nova		20:09
sean-k-mooney	well part of it	20:09
sean-k-mooney	anyway we will see soon.	20:09
*** errantekarmico has joined #openstack-nova		20:14
*** shaohe_feng has quit IRC		20:14
*** shaohe_feng has joined #openstack-nova		20:15
*** slaweq has joined #openstack-nova		20:15
dansmith	yup	20:15
*** errantekarmico has left #openstack-nova		20:16
mnaser	there technically should never be rows with cell_id=NULL in instance_mappings.. right?	20:19
dansmith	mnaser: mappings have no cell until they're scheduled	20:20
mnaser	dansmith: right, but yknow, not an instance from march lets say	20:21
mnaser	:p	20:21
dansmith	they should always end up scheduled, to cell0 at least, but they can be there transiently and/or if something fails	20:21
mnaser	alright so i think i'll have to write something to look in our cell vs cell0 and update mappings to make the db consistent	20:22
*** dtruong_ has quit IRC		20:23
*** shaohe_feng has quit IRC		20:24
*** shaohe_feng has joined #openstack-nova		20:25
*** dtruong_ has joined #openstack-nova		20:26
*** med_ has quit IRC		20:27
*** savvas has quit IRC		20:28
*** savvas has joined #openstack-nova		20:28
*** savvas has quit IRC		20:30
*** savvas has joined #openstack-nova		20:30
*** artom has quit IRC		20:32
*** shaohe_feng has quit IRC		20:34
*** mchlumsky has quit IRC		20:37
*** tidwellr has quit IRC		20:38
*** slaweq has quit IRC		20:40
*** shaohe_feng has joined #openstack-nova		20:40
*** felipemonteiro_ has quit IRC		20:40
*** felipemonteiro_ has joined #openstack-nova		20:40
*** shaohe_feng has quit IRC		20:45
*** shaohe_feng has joined #openstack-nova		20:45
mriedem	mnaser: same issue from last week right?	20:46
mriedem	could have been rpc outage so a failed db update	20:46
mriedem	er db?	20:46
mriedem	failed write i mean	20:46
*** cdent has quit IRC		20:46
mnaser	mriedem: no it looks like over the lifetime of our cloud any rpc or db related things might have accumulated a lot of things in nova_api with cell_id = NONE	20:46
mnaser	like, 20000 worth.	20:46
mriedem	i had also identified one spot in conductor where the build request will be gone and we don't set the instance mapping to cell0	20:46
mnaser	however for 99.9999% of those, they were actually assigned a cell and not buried in cell0	20:47
mnaser	dansmith, mriedem: http://paste.openstack.org/show/726767/ might be a useful little tool if someone ends up in the same situation	20:47
mnaser	connect to api db, get all cells, go over them all and check where it can find the instance, and then print out an update statement for manual fix	20:48
mriedem	we could nova-manage cell_v2 that baby	20:48
mnaser	i can push up an initial patch but i dunno how much i can iterate/test/etc because i've been a bit overwhelmed	20:49
mnaser	and it would have to be updated to use nova objects too i guess	20:49
mriedem	np, or just report a bug and put this paste in it as a template	20:49
mriedem	latter is fine ^	20:49
mnaser	good idea	20:49
mriedem	is this finding instances in non-cell0 cells?	20:50
mriedem	that aren't in error state?	20:50
mnaser	mriedem: im not sure about the exact logic, but i grab a list of all cells, connect to them, and loop until i find an entry inside 'instances' table with the same id	20:50
mnaser	if that is logically wrong, i can fix it	20:51
mriedem	it makes sense	20:51
mriedem	if the instance mapping doesn't tell what cell it's in, we have to iterate the cells looking for it	20:51
mnaser	and there is no change it ever being in two cells	20:52
mnaser	s/change/chance/	20:52
mriedem	is that a question?	20:52
mnaser	yes	20:52
mriedem	shouldn't be no	20:52
mnaser	okay sounds good, because i break off once i find it and stop looping	20:52
mriedem	but this shouldn't be happening in the first place	20:52
mnaser	yeah :\ but i dunno how much to blame nova when it might have been an infra problem	20:52
mriedem	i mean in a normal case we create the instance in the cell here https://github.com/openstack/nova/blob/6be7f7248fb1c2bbb890a0a48a424e205e173c9c/nova/conductor/manager.py#L1257	20:53
mriedem	if the user goes over quota we should put the instance into error state and mark the instance mapping https://github.com/openstack/nova/blob/6be7f7248fb1c2bbb890a0a48a424e205e173c9c/nova/conductor/manager.py#L1370	20:53
*** david-lyle has joined #openstack-nova		20:54
mriedem	in a normal case, we update the instance mapping here https://github.com/openstack/nova/blob/6be7f7248fb1c2bbb890a0a48a424e205e173c9c/nova/conductor/manager.py#L1322	20:54
mnaser	in any case -- https://bugs.launchpad.net/nova/+bug/1784074	20:54
openstack	Launchpad bug 1784074 in OpenStack Compute (nova) "Instances end up with no cell assigned in instance_mappings" [Undecided,New]	20:54
mriedem	before deleting the build request and casting to compute	20:54
mnaser	hmm	20:54
mnaser	i wonder if i wanna update that script	20:54
mnaser	to check if a build_request exists	20:55
mriedem	if anything fails in between there we could fail to update the mapping	20:55
mriedem	mnaser: maybe - if the build request exists, the instance shouldn't be in a cell	20:55
*** manjeets_ has joined #openstack-nova		20:55
mriedem	so L42 in your script is where i'd look for a build request	20:55
mriedem	as a sanity check	20:55
*** shaohe_feng_ has joined #openstack-nova		20:56
mnaser	mriedem: yeah i was planning to just run the mysql till a certain point and assume the rest was just unscheduled stuff but it could be confusing to hand off to others	20:56
*** dklyle_ has quit IRC		20:57
mnaser	i'm feeling to check if a build request exists at L27 so a) i dont hit the cells and b) if a build requests exists, technically there shouldn't be an issue because api calls will interact with that build request	20:57
*** manjeets has quit IRC		20:57
*** anupn_ has quit IRC		20:57
mnaser	i think the problem is there when a build request AND cell mapping is missing	20:57
mnaser	but i believe if build request is there but cell mapping is missing, it'll work just fine and not do any weird 404s on instances	20:57
mriedem	correct	20:57
mriedem	this was the case i was worried about last week https://github.com/openstack/nova/blob/6be7f7248fb1c2bbb890a0a48a424e205e173c9c/nova/conductor/manager.py#L1243	20:58
*** karimull has quit IRC		20:58
mriedem	in that case, the api has deleted the build request, and we haven't updated the instance mapping	20:58
mriedem	but, we wouldn't put the instance in cell0 b/c the user deleted the instance before we created it (via build request)	20:58
mriedem	mnaser: might be nice info to know if these unmapped instances are deleted	20:59
melwitt	one thing that's interesting that I learned recently is that if, for some reason, there is a case where a build request exists but no instance mapping exists, the API does not handle it in that, the "instance" will show up in a 'nova list' but it can't be deleted because delete will raise NotFound	20:59
*** shaohe_feng has quit IRC		20:59
*** shaohe_feng_ is now known as shaohe_feng		20:59
mriedem	i don't know how that could happen	20:59
mriedem	we create the build request and the instance mapping in _provision_instances	20:59
mriedem	*and request spec	21:00
melwitt	and via code inspection, I don't know how that state could be gotten into other than nova-api restarting at precisely the moment after the build request is created but before the instance mapping was	21:00
mriedem	https://github.com/openstack/nova/blob/6be7f7248fb1c2bbb890a0a48a424e205e173c9c/nova/compute/api.py#L930 and then https://github.com/openstack/nova/blob/6be7f7248fb1c2bbb890a0a48a424e205e173c9c/nova/compute/api.py#L942	21:00
*** anupn has joined #openstack-nova		21:00
mnaser	melwitt: yeah that's essentially the state that these vms are in	21:00
mriedem	or the db failing the instance mapping insert	21:00
*** karimull has joined #openstack-nova		21:01
melwitt	mnaser: I thought you had instance mappings though, right?	21:01
melwitt	yeah, or that	21:01
mnaser	melwitt: instance_mapping is there sure, but cell_id=NONE	21:01
mnaser	so some of those are list-able, but not delete-able	21:01
melwitt	yeah, that's different than what I said. your case will let a delete work	21:01
mriedem	mnaser: are you listing as admin?	21:01
melwitt	oh really?	21:01
mriedem	to list out deleted instances?	21:01
mnaser	nope, i had a user complain they could list an instance but could not delete it	21:02
mriedem	i have to think you're hitting this https://github.com/openstack/nova/blob/6be7f7248fb1c2bbb890a0a48a424e205e173c9c/nova/conductor/manager.py#L1243	21:02
mnaser	hell i cant even delete it	21:02
melwitt	hm, okay, that is a new case I didn't know	21:02
mnaser	let me dig th eticket	21:02
*** brault_ has quit IRC		21:02
melwitt	I guess what it must do is, get the instance mapping, see cell_id=None and then think "I can't lookup the instance, therefore I can't delete it"	21:03
mriedem	well,	21:03
* melwitt looks at the code		21:03
mnaser	ok so confirmed here	21:03
mriedem	it will fallback to trying to lookup the instance from the locally configured (in the api) [database]/connection	21:03
mnaser	nova list --all-tenants \| grep 1812c2eb-cfbc-4659-9817-4694ad3d2c37 < returns the instance with ERROR/NOSTATE	21:03
mnaser	nova show 1812c2eb-cfbc-4659-9817-4694ad3d2c37 => ERROR (CommandError): No server with a name or ID of '1812c2eb-cfbc-4659-9817-4694ad3d2c37' exists.	21:03
mriedem	mnaser: is that instance deleted?	21:03
mriedem	instances.deleted != 0	21:04
mnaser	let me double check	21:04
mnaser	fwiw though cell_id=NULL	21:04
mnaser	checking instances	21:04
*** edmondsw has quit IRC		21:04
mnaser	deleted=0 but this one is in cell0	21:05
mriedem	melwitt: this is what i'm thinking of https://github.com/openstack/nova/blob/6be7f7248fb1c2bbb890a0a48a424e205e173c9c/nova/compute/api.py#L1768	21:05
*** edmondsw has joined #openstack-nova		21:05
mriedem	mnaser: hmm, ok so the instance was created in cell0 but the instance mapping update failed	21:05
mnaser	in this case yes	21:05
*** yamahata has quit IRC		21:05
melwitt	that's not what runs for a delete though	21:05
*** shaohe_feng has quit IRC		21:05
mriedem	melwitt: it has to lookup the instance right?	21:05
mriedem	_lookup_instance is called via API.get()	21:05
mnaser	yeah i cant even look it up, it just 404s	21:05
melwitt	yeah but it goes here https://github.com/openstack/nova/blob/6be7f7248fb1c2bbb890a0a48a424e205e173c9c/nova/compute/api.py#L2333	21:05
*** r-daneel has joined #openstack-nova		21:05
*** shaohe_feng has joined #openstack-nova		21:06
mnaser	let me check	21:06
mnaser	it probably doesnt have a build request	21:06
mnaser	no build request indeed	21:07
mnaser	https://github.com/openstack/nova/blob/6be7f7248fb1c2bbb890a0a48a424e205e173c9c/nova/compute/api.py#L2353	21:07
mnaser	so ending up here afaik	21:07
mriedem	how are we listing it then...	21:08
mnaser	maybe list just hits the cells and ignores api stuff?	21:09
mnaser	i can help if i knew where the list code is :p	21:09
mriedem	https://github.com/openstack/nova/blob/6be7f7248fb1c2bbb890a0a48a424e205e173c9c/nova/compute/instance_list.py#L98	21:09
*** edmondsw has quit IRC		21:10
melwitt	_lookup_instance is called via API().delete, _get_instance is called via API().get	21:10
melwitt	and the API (nova/api/openstack/compute/servers.py) does a API().get first before doing anything with an instance	21:10
mriedem	mnaser: you're right, we'll just iterate the cells	21:10
mnaser	i guess in an ideal world you retrieve list of vms from nova_api, and then generate a subsequent list to each cell with a list of instance uuids to request	21:11
mnaser	which might even eliminate extra calls if a user is located in one cell	21:12
melwitt	so in the case of a build request with a instance mapping with cell_mapping = None, it will return build_request.instance, which I'm not sure what will happen if you try to delete that	21:12
mriedem	mnaser: that's what this is for https://github.com/openstack/nova/blob/6be7f7248fb1c2bbb890a0a48a424e205e173c9c/nova/compute/instance_list.py#L101	21:12
melwitt	presumably it fails	21:12
mriedem	and that's what cern uses	21:12
mnaser	wouldn't it be safer to only delete the build request once the cell has been set?	21:13
melwitt	so that means build_request.instance gets passed to compute API().delete	21:13
mriedem	melwitt: in that case we should go through here https://github.com/openstack/nova/blob/6be7f7248fb1c2bbb890a0a48a424e205e173c9c/nova/compute/api.py#L1877	21:14
mriedem	mnaser: the idea is if the user deletes the build request before the instance has been scheduled to a cell, we never create the instance in the cell,	21:15
mriedem	so there is nothing to do with the instance mapping b/c it's not in a cell	21:15
*** shaohe_feng has quit IRC		21:15
mriedem	and shouldn't get listed either b/c it's (1) not a build request and (2) not in a cell	21:16
*** r-daneel has quit IRC		21:16
mnaser	yeah so maybe the issue here really inside list?	21:16
melwitt	right, so the delete of the build request would succeed, but then the lookup of the instance will fail because it was just a build_request.instance shell	21:16
mriedem	which if that is really working, we get here in conductor after the build request was deleted in api https://github.com/openstack/nova/blob/6be7f7248fb1c2bbb890a0a48a424e205e173c9c/nova/conductor/manager.py#L1243	21:16
melwitt	or well, maybe not. _lookup_instance would return None, None in the cell_mapping = None case	21:17
mriedem	i wonder why we don't update the instance mapping right after this https://github.com/openstack/nova/blob/6be7f7248fb1c2bbb890a0a48a424e205e173c9c/nova/conductor/manager.py#L1257	21:17
*** shaohe_feng has joined #openstack-nova		21:18
*** yamahata has joined #openstack-nova		21:18
mriedem	melwitt: right, if _delete_while_booting returns True, we exit https://github.com/openstack/nova/blob/6be7f7248fb1c2bbb890a0a48a424e205e173c9c/nova/compute/api.py#L1877	21:18
melwitt	hm, so I'm not seeing how delete would fail in that case	21:19
*** manjeets_ is now known as manjeets		21:20
melwitt	mnaser: is there any chance the service version in one of the records in the 'services' tables is < 15?	21:22
mriedem	heh, i asked that last week too :)	21:22
mnaser	melwitt: i checked that with mriedem last time we tried to look into this and no, none	21:22
melwitt	I guess that wouldn't make sense. all of your instance GET would fail in that case	21:22
mriedem	btw, i thin kwe should probably remove that service version check now	21:22
mriedem	commented on the bug https://bugs.launchpad.net/nova/+bug/1784074/comments/1	21:23
openstack	Launchpad bug 1784074 in OpenStack Compute (nova) "Instances end up with no cell assigned in instance_mappings" [Undecided,New]	21:23
mriedem	with what might be happening	21:23
mriedem	but you'd have errors in the logs	21:23
melwitt	this doesn't make any sense how delete returns 404	21:23
mriedem	melwitt: read ^ that comment in the bug because i think that could explain a window where it could happen	21:23
*** liuyulong_ has joined #openstack-nova		21:24
mriedem	mnaser: i wonder if these are instances getting created as part of a multi-create request where they all get created in a cell, then when we go to update mappings, something fails and then the rest are left unmapped	21:24
*** liuyulong__ has quit IRC		21:24
mriedem	the user attempts to delete the instance, they delete the build request, but then they can still list it,	21:24
mriedem	but can't delete it b/c the build request is gone and the instance mapping isn't poining at a cell	21:24
mriedem	hence your fix up script	21:24
melwitt	ohhh	21:25
mriedem	this goes back to something we've talked about before where the schedule_and_build_instances method was split into a few phases where it was originally one	21:25
*** shaohe_feng has quit IRC		21:26
mriedem	so now we (1) get hosts from scheduler (2) create instances in cells (3) recheck quota (4) do some other stuff including updating instance mappings and casting to compute to build	21:26
*** awaugama has quit IRC		21:26
mriedem	if anything fails in the loop in #4 we'd have this situation	21:26
mnaser	these could be a multi create	21:26
mnaser	let me double check	21:26
*** shaohe_feng has joined #openstack-nova		21:26
mriedem	mnaser: you'd have to find the request spec and look that up	21:27
melwitt	yeah, gosh	21:27
mnaser	i know of a customer that uses this feature all the time	21:27
mnaser	so it could just be them	21:27
mriedem	there should be a num_instances field in the request spec for any of those instances	21:27
mnaser	nope, at least one i randomly picked out is not a multi create	21:27
mriedem	ok, well,	21:27
mriedem	i think the theory still applies	21:27
mriedem	if we fail before setting the instance mapping but after we've created the instance in the cell, we're toast	21:28
mriedem	did we ever figure out if rabbit being down for notifications could screw us up too? because we send notifications before we update the instance mapping...	21:29
melwitt	I don't know	21:30
mriedem	i'll throw something up quick before i have to head out	21:31
mnaser	so my audit script helped bring them from 20k down to 308 left which have no build_requests, no cell_id in the mapping	21:32
mnaser	and not existing in any cells	21:32
mriedem	mnaser: ok those are likely just instance mappings for deleted and purged instances	21:32
mriedem	do you archive/purge the cell dbs often/	21:32
mriedem	?	21:32
mriedem	b/c it wasn't until i think rocky that we added instance mapping and reqspec hard delete to nova-manage db archive_deleted_rows when instances are archive	21:33
mriedem	or maybe you run your own archive/purge script?	21:33
mnaser	select created_at from instances order by id asc limit 1; => 2014-12-14 02:38:53	21:33
mnaser	...ha.	21:33
mnaser	but i think i'm mostly waiting for the rocky archive delete stuff	21:34
*** shaohe_feng has quit IRC		21:36
*** shaohe_feng has joined #openstack-nova		21:37
mriedem	do you run your own archive script or nova-manage db archive_deleted_rows?	21:41
mnaser	mriedem: none of the above, we just have a really really really big database	21:42
mnaser	mysql indexing seems fast enough that it hasn't really affected us much other than just.. being a big db.	21:42
sean-k-mooney	mriedem: fyi i left a comment on the review but is the call to self.driver.cleanup in https://review.openstack.org/#/c/586568/1 against the source or dest node?	21:42
openstackgerrit	Matt Riedemann proposed openstack/nova master: WIP: Update instance mapping as soon as instance is created in cell https://review.openstack.org/586713	21:44
mriedem	mnaser: melwitt: throwing things at the wall ^	21:44
mriedem	sean-k-mooney: source	21:44
mriedem	_post_live_migration and _rollback_live_migration run on the source host	21:44
*** liuyulong__ has joined #openstack-nova		21:45
mriedem	sean-k-mooney: replie	21:45
mriedem	*replied	21:45
sean-k-mooney	mriedem: oh ok then yes it proably should have the source vif then however i dont think it actully will need them unless we replug the vifs	21:45
mriedem	that's not what you said last night	21:46
*** rtjure has quit IRC		21:46
mriedem	something something ovs hybrid plug cleanup	21:46
mriedem	but it was 4am and you were maybe loopy	21:46
sean-k-mooney	mriedem: for the cleanup	21:46
*** shaohe_feng has quit IRC		21:46
sean-k-mooney	mriedem: self.driver.post_live_migration_at_source shoudl use the old source vifs so it can unplug correctly	21:47
*** liuyulong_ has quit IRC		21:47
*** shaohe_feng has joined #openstack-nova		21:47
mriedem	sean-k-mooney: yes, same thing	21:48
sean-k-mooney	i dont know what self.driver.cleanup does. if its on the source however it should also proably be using the source vifs	21:48
mriedem	sean-k-mooney: in the commit message, i pointed out that if post_live_migration_at_source is successful, destroy_vifs=False and the libvirt driver won't try to unplug in cleanup()	21:48
mriedem	however, not all virt drivers adhere to that destroy_vifs flag	21:49
mriedem	the hyperv driver doesn't for example	21:49
sean-k-mooney	ah ok then yes that all looks good then	21:49
mriedem	it looks...beautiful	21:49
sean-k-mooney	normally i like shorter fuction names but the at_source and at_destination really helps keep context in this code	21:50
mriedem	that's why i did https://review.openstack.org/#/c/551371/	21:51
mriedem	because knowing wtf is going on in the 20 methods involved in live migration is not sometihng you can keep in your head	21:51
mriedem	also https://docs.openstack.org/nova/latest/reference/live-migration.html	21:52
melwitt	yes. every time I figure out code like that, a few months later I end up wishing I had added a lot of code comments to it, if nothing else	21:52
mriedem	yup also https://review.openstack.org/#/c/496861/	21:53
melwitt	two thumbs up	21:53
mriedem	thanks ebert	21:54
melwitt	looking at your change, trying to remember why the instance.create() was split up from the inst mapping update in the first place	21:54
mriedem	RIP	21:54
mriedem	melwitt: the quota stuff	21:54
mriedem	i can find a review comment where we talked about the split	21:54
melwitt	yeah, trying to re-remember	21:54
sean-k-mooney	ya i have that bookmarked i just didnt have see we were still in _post_live_migration. that function does a lot	21:54
mriedem	too much	21:54
melwitt	I think it was something about, if we failed a quota recheck in the middle of a multi create, and to nix all the instances before creating any mappings	21:55
melwitt	but we ended up not doing that and putting them in ERROR state	21:55
sean-k-mooney	part of the issue is ist implementing a state machine and all of that context is mixed in with what its doing	21:55
melwitt	so that ended up being the wrong thing to do, I think	21:55
mriedem	melwitt: https://review.openstack.org/#/c/501408/2/nova/conductor/manager.py@1020	21:56
*** antosh has quit IRC		21:56
mriedem	too bad i didn't link that irc convo in	21:56
*** shaohe_feng has quit IRC		21:56
melwitt	yeah, this is coming back to me. there were other things like, at the time I was thinking don't create the BDMs etc until after we know we're good after the quota recheck	21:58
*** shaohe_feng has joined #openstack-nova		21:58
melwitt	but we discussed on IRC and determined that all had a failure path to clean up anything that was created, and so should have been okay to just do everything normally and check quota at the end	21:58
melwitt	in one loop instead of two	21:59
mriedem	http://eavesdrop.openstack.org/irclogs/%23openstack-nova/%23openstack-nova.2017-09-06.log.html#t2017-09-06T20:33:51	21:59
mriedem	it was also a refactor we didn't want to backport	21:59
melwitt	right yeah	21:59
mriedem	i had a todo to combine back to a single loop on my desk for a long time, b/c i had in mind how to do it,	22:00
mriedem	but long forgot now	22:00
sean-k-mooney	mriedem: haha i was just looking at the irc logs to see if i could find it for you.	22:00
melwitt	I added it to my todo list too so hopefully one of us will do it this time. I had forgotten about it	22:00
*** savvas has quit IRC		22:01
*** med_ has joined #openstack-nova		22:02
*** med_ has quit IRC		22:02
*** med_ has joined #openstack-nova		22:02
mriedem	"dansmithmriedem: we wouldn't know where to find the instance record to mark it as deleted when they deleted the buildreq, so we'd leave that undeleted but unfindable instance forever"	22:02
mriedem	heh	22:02
mriedem	sound familiar?	22:03
mriedem	"mriedemi shit my pants everytime we touch nova these days"	22:04
mriedem	ha	22:04
melwitt	haha, relatable	22:04
mriedem	mnaser: again, congratulations to you to continue running a business on top of stuff we're still talking about fixing almost 1 year later :)	22:04
*** itlinux has quit IRC		22:05
openstackgerrit	karim proposed openstack/nova master: Updated AggregateImagePropertiesIsolation filter illustration https://review.openstack.org/586317	22:05
*** felipemonteiro_ has quit IRC		22:06
mriedem	i think the tl;dr from the irc convo is just combine the loops and move the quota check to the end	22:06
mriedem	"locally" deleting the instance will automatically delete the tags and bdms along with the instance from the cell	22:06
melwitt	I'm trying to think, why didn't we move the instance mapping update earlier last time?	22:06
melwitt	yeah, that's what I'm getting from it too, merge the loops and check quota at the end	22:07
*** shaohe_feng has quit IRC		22:07
*** jmlowe has joined #openstack-nova		22:07
*** shaohe_feng has joined #openstack-nova		22:07
mriedem	idk, my guess is tunnel vision on the fix at hand	22:07
melwitt	wait, that change (last year) did move the inst mapping update earlier to right after the instance.create(). looking to see what happened to that	22:10
*** itlinux has joined #openstack-nova		22:10
*** savvas has joined #openstack-nova		22:11
*** rtjure has joined #openstack-nova		22:13
mriedem	but only if the quota check failed	22:14
mriedem	b/c we exit after that	22:14
mriedem	we don't bury in cell0 if quota check fails because the instances are already created in cells at that point	22:15
*** figleaf is now known as edleafe		22:15
melwitt	I mean this, this is showing an update of the instance mapping right after we create the instance record https://review.openstack.org/#/c/501408/2/nova/conductor/manager.py@1003	22:16
*** savvas has quit IRC		22:16
mriedem	oh right yewah	22:17
mriedem	*yeah	22:17
melwitt	but in the current version of the code, the instance mapping update isn't right after the instance create anymore	22:17
melwitt	and I can't find how that changed, looking at git blame and failing	22:17
*** shaohe_feng has quit IRC		22:17
mriedem	_populate_instance_mapping was only ever used in the cellsv1 path	22:17
mriedem	the build_instances method	22:17
mriedem	i'm pretty sure	22:17
melwitt	but in that old patch, it's in schedule_and_build_instances	22:17
*** shaohe_feng has joined #openstack-nova		22:18
mriedem	because mnaser was re-using it	22:19
mriedem	you mean why did we talk him out of that?	22:19
melwitt	no I mean, as of that patch, the instance mapping update was right after instance create, but the current code has the mapping update much later, and I was wondering why that was moved. I assume it was to fix some other bug or something	22:20
*** savvas has joined #openstack-nova		22:20
mriedem	looks like it was changed as a result of the irc convo	22:21
melwitt	oh gaaaahhh, I didn't realize I was looking at an earlier PS	22:21
melwitt	okay so the final version only added a mapping update to the cleanup method, like you said earlier I think. so the normal path for updating the mapping was always later on	22:24
*** sambetts_ has quit IRC		22:24
*** savvas has quit IRC		22:25
melwitt	ok	22:25
*** sambetts_ has joined #openstack-nova		22:26
mriedem	yup. alright gotta run. o/	22:27
melwitt	o/	22:27
*** shaohe_feng has quit IRC		22:27
*** shaohe_feng has joined #openstack-nova		22:28
*** shaohe_feng has quit IRC		22:37
*** savvas has joined #openstack-nova		22:38
*** shaohe_feng has joined #openstack-nova		22:39
*** avolkov has quit IRC		22:40
*** hongbin_ has quit IRC		22:42
*** shaohe_feng has quit IRC		22:48
*** shaohe_feng has joined #openstack-nova		22:48
*** mhg has quit IRC		22:53
*** shaohe_feng has quit IRC		22:58
*** shaohe_feng has joined #openstack-nova		23:01
*** mschuppert has quit IRC		23:03
*** gilfoyle_ has quit IRC		23:08
*** shaohe_feng has quit IRC		23:08
*** harlowja has quit IRC		23:09
*** shaohe_feng has joined #openstack-nova		23:10
*** shaohe_feng has quit IRC		23:18
*** shaohe_feng has joined #openstack-nova		23:20
openstackgerrit	Merged openstack/nova master: Use source vifs when unplugging on source during post live migrate https://review.openstack.org/586402	23:27
openstackgerrit	Merged openstack/nova master: Pass source vifs to driver.cleanup in _post_live_migration https://review.openstack.org/586568	23:27
openstackgerrit	Merged openstack/nova master: Update queued-for-delete from the ComputeAPI during deletion/restoration https://review.openstack.org/566813	23:27
*** shaohe_feng has quit IRC		23:29
melwitt	finally \o/	23:30
*** shaohe_feng has joined #openstack-nova		23:32
*** itlinux has quit IRC		23:37
*** shaohe_feng has quit IRC		23:39
*** shaohe_feng has joined #openstack-nova		23:40
*** gongysh has joined #openstack-nova		23:43
mnaser	well i know its late	23:49
mnaser	but now even another whole interesting failure	23:49
mnaser	no record in nova_api but one in the cell	23:49
mnaser	lol	23:49
*** shaohe_feng has quit IRC		23:49
*** shaohe_feng has joined #openstack-nova		23:50
*** itlinux has joined #openstack-nova		23:50
*** itlinux has quit IRC		23:50
*** itlinux has joined #openstack-nova		23:51
*** itlinux has quit IRC		23:51
*** wolverineav has quit IRC		23:53
*** wolverineav has joined #openstack-nova		23:54
melwitt	mnaser: no build request or instance mapping?	23:54
mnaser	melwitt: build request, no instance mapping	23:55
mnaser	wait sorry	23:55
mnaser	it doesnt exist in the cell, sorry	23:55
melwitt	build request, instance in cell, no instance mapping	23:56
melwitt	build request only?	23:56
mnaser	yes	23:56
mnaser	build request only	23:56
melwitt	that's the exact same thing rdo cloud ran into	23:56
mnaser	so shows up in list but not deletable etc	23:56
melwitt	right	23:56
mnaser	i guess i can just delete the build request and have it disappear?	23:56
melwitt	do you have several or just a few? like does it happen a lot?	23:56
melwitt	yes, that's what I told rdo cloud to do too	23:57
mnaser	i mean after running my fixup script, i still had a few instances that were stuck BUILD/scheduling	23:57
melwitt	I dug around in the code and didn't see a way it can happen other than nova-api going down at the precise moment between the build_request.create() and the instance_mapping.create() or the instance_mapping.create() somehow failing	23:57
mnaser	so for context it is possible that rpc and/or db both had issues at the time	23:57
mnaser	does the build request and instance_mapping get created at the same time or?	23:58
melwitt	which seems it would be crazy rare ... so maybe we're missing some other way it could happen	23:58
melwitt	pretty much yeah. let me grab a link	23:58
melwitt	https://github.com/openstack/nova/blob/3e0b17b1e138615b66293976ca5b55c291957844/nova/compute/api.py#L930-L942	23:58
* mnaser is learning so much lol		23:59
melwitt	yeah, soon you can come fix all these bugs	23:59
mnaser	ok that's interesting	23:59
mnaser	haha	23:59
*** wolverineav has quit IRC		23:59
*** shaohe_feng has quit IRC		23:59
mnaser	so build request was created, instance mapping was not created. unless there was an attempt to delete the instance while it was still in build request	23:59

Generated by irclog2html.py 2.15.3 by Marius Gedminas - find it at mg.pov.lt!