Wednesday, 2021-09-29

opendevreview	Ghanshyam proposed openstack/nova master: DNM testing grenade neutron-trunk fix https://review.opendev.org/c/openstack/nova/+/811118	00:51
opendevreview	Ghanshyam proposed openstack/nova stable/xena: DNM: Testing nova-grenade-multinode with neutron-trunk https://review.opendev.org/c/openstack/nova/+/811491	01:02
opendevreview	Ghanshyam proposed openstack/nova stable/wallaby: DNM: Testing nova-grenade-multinode with neutron-trunk https://review.opendev.org/c/openstack/nova/+/811513	01:03
opendevreview	Hang Yang proposed openstack/nova master: Support creating servers with RBAC SGs https://review.opendev.org/c/openstack/nova/+/811521	01:34
opendevreview	melanie witt proposed openstack/nova stable/train: address open redirect with 3 forward slashes https://review.opendev.org/c/openstack/nova/+/806629	01:48
opendevreview	Federico Ressi proposed openstack/nova master: Check Nova project changes with Tobiko scenario test cases https://review.opendev.org/c/openstack/nova/+/806853	01:52
opendevreview	Federico Ressi proposed openstack/nova master: Debug Nova APIs call failures https://review.opendev.org/c/openstack/nova/+/806683	01:56
opendevreview	Ghanshyam proposed openstack/nova stable/victoria: DNM: Testing nova-grenade-multinode with neutron-trunk https://review.opendev.org/c/openstack/nova/+/811540	04:10
bauzas	good morning Nova	06:43
deke	Hi	08:16
deke	Has there been any discussion about implementing a MAAS driver for nova to enable ironic-like functionality for users who have deployed openstack with juju on MAAS?	08:16
bauzas	deke: I haven't heard anything about this	08:36
bauzas	deke: tbh, it would be a laaaarge discussion, right?	08:36
deke	yea it would	08:36
deke	it was just a thought I had	08:36
deke	if we are already deploying on top of a baremetal provisioning service	08:37
deke	then why deploy another baremetal provisioning service	08:37
bauzas	we are not saying "no" for a new virt driver	08:39
bauzas	but for having a new virt driver upstream, that means we need to discuss how to use it and how we could verify it by the CI	08:40
bauzas	that means large discussions during a lot of PTGs + making sure we at least have a third-party CI	08:40
lyarwood	dansmith: https://review.opendev.org/c/openstack/grenade/+/811117 - looks like you're the only remaining active core on grenade, can you ack this when you get online to unblock Nova's gate?	08:42
gibi	gmann: thanks for the patches!	09:03
bauzas	gmann: what gibi said, thanks for having worked on them	09:05
bauzas	so the xena and master changes for fixing the placement API endpoints are now merged, we only have the grenade issue left, right?	09:06
lyarwood	Yup I believe so	09:07
lyarwood	and I think we only need it in master	09:07
gibi	yupp thats my view too	09:07
lyarwood	I don't know why we've backported it tbh	09:07
gibi	lyarwood: you mean why we are enabling trunk testing on wallaby and back?	09:07
lyarwood	on stable/xena and backwards, I thought grenade used the current branch of itself and the previous branch for everything else for the initial deploy	09:08
lyarwood	so master grenade deploys a stable/xena env and then upgrades that to master	09:08
bauzas	gibi: I stupidely forgot to +1 the RC2 patches yesterday before leaving	09:10
bauzas	:facepalm:	09:10
bauzas	now we need elodilles_pto but as his nick says, he's on PTO :)	09:10
* bauzas hides		09:11
gibi	lyarwood: if we do this https://review.opendev.org/c/openstack/devstack/+/811518/2/lib/tempest then we need to do this as well https://review.opendev.org/c/openstack/grenade/+/811542/1/.zuul.yaml on stable. But I agre that we don't have to enable trunk on stable, skip is OK to me too.	09:12
lyarwood	yeah I agree with the comment on the stable/wallaby change that we shouldn't be adding tests to stable this late on	09:12
gibi	bauzas: I think other release cores can approve the RC2 we don't necessary need to wait for elodilles_pto.	09:13
gibi	anyhow he is back tomorrow so no worries	09:13
gibi	lyarwood: ack, we can discuss this with gmann when he is up	09:14
lyarwood	kk	09:20
opendevreview	Lee Yarwood proposed openstack/nova master: nova-manage: Ensure mountpoint is passed when updating attachment https://review.opendev.org/c/openstack/nova/+/811713	10:56
opendevreview	Lee Yarwood proposed openstack/nova master: nova-manage: Always get BDMs using get_by_volume_and_instance https://review.opendev.org/c/openstack/nova/+/811716	11:17
lyarwood	https://review.opendev.org/c/openstack/nova/+/811118 - cool the check queue is passing with the grenade fix	11:26
gibi	\o/	11:27
mdbooth	👋 Are we confident that https://review.opendev.org/c/openstack/devstack/+/811399 fixed the devstack issue on Ubuntu? My devstack-using CI job still seems to be failing:	12:00
mdbooth	https://storage.googleapis.com/kubernetes-jenkins/pr-logs/pull/kubernetes-sigs_cluster-api-provider-openstack/1009/pull-cluster-api-provider-openstack-e2e-test/1443169418262614016/artifacts/logs/cloud-final.log	12:00
mdbooth	I can see the following in the devstack output: "ProxyPass "/placement" "unix:/var/run/uwsgi/placement-api.socket\|uwsgi://uwsgi-uds-placement-api" retry=0", which looks like it contains the fix. However I'm still seeing the borked GETs to placement	12:02
mdbooth	I'm far from discounting some other issue in the environment, btw, just looking for confirmation or otherwise that we've seen this fix the issue.	12:04
*** swp20 is now known as songwenping		12:04
gibi	mdbooth: you logs shows the original error I can confirm as placement logs ""GET /placemen//resource_providers?in_"	12:05
mdbooth	gibi: Right	12:05
gibi	and I do see the ProxyPass you mentioned, and that is the correct one	12:06
gibi	are you sure that your env using the ProxyPass that is logged?	12:06
mdbooth	gibi: There's a git pull; git show up the top of that output which suggests the working directory is at "7f16f6d4 Fix uwsgi config for trailing slashes"	12:06
mdbooth	gibi: i.e. Can I confirm that the apache config actually contains that ProxyPass?	12:07
gibi	gibi: or that apache2 was restarted / reload after the setting is applied	12:07
gibi	hm, I see an apache2 reload later	12:08
mdbooth	This is running in cloud-init on initial boot. However it is running from an image that already contains devstack bits, so it's entirely possible there's dirty state there.	12:08
mdbooth	Let me try to confirm that. I hadn't considered that it might not actually be updating the apache config.	12:09
gibi	hm, sorry I only see logs that suggest the reload like	12:09
gibi	To activate the new configuration, you need to run:	12:09
gibi	systemctl reload apache2	12:09
gibi	but I don't see the reload itself	12:10
gibi	so If you can log into the system then try accessing placment if it fails then check the ProxyPass confing then reload apache and check that placement is accessible now	12:11
gibi	(I did this manually with in my local devstack so I do belive the fix helps at least in devstack)	12:11
mdbooth	Unfortunately this is running in a CI system I don't have access to :( I have to debug via modifying the job and re-executing 😬	12:12
mdbooth	I'll see what I can confirm. Thanks!	12:12
gibi	mdbooth: no problem, let us know if this still fails for you	12:13
mdbooth	Ok, I can confirm that the image the machine is created from already contains dirty devstack state	12:18
mdbooth	Including the broken ProxyPass directives	12:18
mdbooth	So unless we consider it a bug that devstack doesn't update that, I'm guessing this is my problem?	12:19
gibi	mdbooth: so you have both wrong and bad ProxyPass lines in the config file?	12:20
gibi	I mean wrong and good	12:20
gibi	:D	12:20
mdbooth	So before pulling and executing a patched devstack I already have bad config present	12:21
mdbooth	And running the patched devstack doesn't fix it	12:21
gibi	I do see multiple ProxyPass lines locally too, I guess those are from multiple unstack.sh / stack.sh runs in my case	12:21
lyarwood	yeah that isn't a devstack bug	12:22
lyarwood	./clean.sh first	12:22
gibi	hm, if clean.sh removes it then I agree it is not a bug (I'm lazy to always run clean.sh but I should)	12:22
mdbooth	lyarwood: Or more likely summon the arcane wizards of CI and update the 'preinstalled' image!	12:23
lyarwood	yeah or that if it's already baked into the image	12:23
gibi	mdbooth: yeah, for CI I suggest to always start from a clean image :)	12:23
lyarwood	that's odd tbh	12:23
mdbooth	Apparently it saves a ton of time, although I haven't personally measured it. I'm about to measure it, though :)	12:24
gibi	not too long ago dansmith added parallelism to devstack stack.sh run that helped a lot with runtime	12:24
lyarwood	yeah that we use across most jobs now AFAIK	12:27
mdbooth	lyarwood: Is there a flag?	12:28
gibi	there should be	12:29
gibi	but is on by default	12:29
mdbooth	Ok, cool	12:29
lyarwood	yeah it's there by default in master and xena	12:29
gibi	btw, ./clean.sh deletes all the config from /etc/apache2/sites-enabled/ except glance :D	12:29
mdbooth	Ah, looks like it's DEVSTACK_PARALLEL, and it's not in victoria	12:30
mdbooth	Perhaps I'll update	12:30
lyarwood	why are you deploying victoria btw?	12:30
mdbooth	lyarwood: Because nobody changed it	12:31
lyarwood	oh fun	12:31
gibi	:)	12:31
mdbooth	Honestly this CI system is awesome, I'm not complaining. It installs devstack in a VM in GCE and runs tests against it, and until we hit this it was rock solid.	12:32
mdbooth	Ok, I'm going to run again from a clean image instead of the preinstalled image to see how long that takes. Then I'm going to do the same again but against Xena to measure again.	12:33
mdbooth	Thanks for all the help!	12:33
gibi	happy to help	12:34
gmann	bauzas: other than neutron-grenade. nova-ceph-multistore is broken on stable (https://bugs.launchpad.net/devstack-plugin-ceph/+bug/1945358) fixes are ready to merger - https://review.opendev.org/q/topic:%22bug%252F1945358%22+(status:open%20OR%20status:merged)	12:36
gmann	gibi: lyarwood checking comments on grenade fixes,	12:36
gibi	gmann: in short we are questioning whether we need to turn on trunk testing on stable branches	12:39
gmann	gibi: lyarwood ok, I will say we should as Tempest master is used to test the stable/ussuri -> master and if we do not enable then we skip the trunk test for no reason. it was just a miss in enabling the extension in stable branches.	12:41
lyarwood	gm	12:41
lyarwood	ops sorry	12:41
lyarwood	gmann: Yeah I appreciate that but if wasn't tested until now it seems a little odd to add it in for non-master branches	12:42
lyarwood	gmann: the missing coverage was mostly around live migration right?	12:42
gmann	and we can see it is passing in nova grenade jobs in stable so we do not need anythings extra than grenade fix	12:42
gmann	lyarwood: yeah, live migration trunk tests. those are only tests for trunk currently in tempest	12:43
lyarwood	gmann: okay well if it's passing and the neutron folks are okay with helping with any fallout that might appear in the coming days then I guess we can go ahead	12:43
gmann	lyarwood: yeah, if neutron team is not comfortable or confident then we can just keep it like that.	12:43
gmann	lyarwood: whether we need it for Xena or not, I remember master testing patch passed only when I added xena fix as depends-on here https://review.opendev.org/c/openstack/nova/+/811118	12:53
gmann	I think i added due to zuul job inventory but let me debug that what exactly happing here	12:54
lyarwood	okay if that's the case lets just merge both	12:54
gmann	lyarwood: zuul inventory seems to take job definition from master only https://zuul.opendev.org/t/openstack/build/01f10e2cb162450e912551026dde8f85/log/zuul-info/inventory.yaml#294-304	13:08
gmann	let me remove the xena fix as depends-on and then we can get more clarity	13:09
opendevreview	Ghanshyam proposed openstack/nova master: DNM testing grenade neutron-trunk fix https://review.opendev.org/c/openstack/nova/+/811118	13:09
lyarwood	gmann: kk	13:16
bauzas	sorry folks, I had issues with my computer	13:55
bauzas	gmann: thanks, unfortunately gtk	13:56
bauzas	looks like it's limbo	13:57
gmann	nova grenade master fix is in gate https://review.opendev.org/c/openstack/grenade/+/811117	13:58
gmann	for enabling testing on stable or not is different things and we can continue discussion.	13:58
dansmith	bauzas: gibi looks like the nova-yoga-ptg etherpad has been emptied	14:01
bauzas	ORLY ?	14:01
gibi	shit, I see	14:02
bauzas	dansmith: this was really done 3 mins before	14:02
gibi	this is the last known state https://etherpad.opendev.org/p/nova-yoga-ptg/timeslider#8272	14:02
gibi	last known good state	14:03
bauzas	yup	14:03
dansmith	bauzas: okay I just loaded it and saw it empty	14:03
bauzas	dansmith: me too	14:03
bauzas	it's just, something happened between 8272 and 8273	14:03
dansmith	certainly you can restore a rev right?	14:03
gibi	the content yes, the coloring I think no	14:03
bauzas	I wonder	14:04
bauzas	lemme ask infra	14:04
gibi	sure	14:04
dansmith	hmm, I thought there was some way	14:04
gibi	what I did before is that I exported the old state and copied that back	14:04
bauzas	fungi is looking at restoring	14:05
gibi	ack	14:05
bauzas	folks, don't touch now	14:05
dansmith	sweet	14:05
bauzas	dansmith: thanks for having identified this	14:06
lyarwood	ah ffs I think that was me	14:06
lyarwood	I was editing and my dock crashed	14:06
lyarwood	and then I couldn't reconnect to the pad	14:07
lyarwood	yeah my items were the last on there	14:08
lyarwood	https://etherpad.opendev.org/p/nova-yoga-ptg/timeslider#8272	14:08
lyarwood	not entirely sure how my laptop dock (and thus network) crashing caused this tbh	14:09
gibi	an interesting untested edge case in the etherpad code :)	14:09
fungi	bauzas: gibi: dansmith: i've rolled it back to revision 8272	14:13
fungi	lyarwood: ^	14:13
dansmith	looks good, thanks fungi !	14:13
bauzas	fungi: with all my love	14:13
fungi	cool, just making sure it's looking like you needed	14:13
lyarwood	thanks for that and apologies all	14:13
fungi	we do also back up the db behind it daily, worst case	14:13
bauzas	fungi: it is, all the colors	14:14
gibi	fungi: awesome thanks	14:14
fungi	no problem, glad we could recover it	14:14
bauzas	and yeah, interesting edge case	14:14
bauzas	a docking issue with a network problem swallows an etherpad	14:14
fungi	those aren't so bad. in the past there have also been bugs which clients somehow tickled to make the pads completely unusable	14:16
fungi	and the most we can do in those cases is dump an earlier text copy with the api and stick that into a new pad, but it loses all the attribution and history	14:16
fungi	and formatting	14:17
bauzas	fungi: thanks, gtk	14:18
bauzas	gmann: i know you're busy with all those devstack and grenade stuff	14:19
bauzas	gmann: but tell me when you think it would be a good opportunity for the x-p PTG session between oslo and nova re: policy	14:19
gmann	bauzas: give me 10 min, currently internal meeting	14:22
bauzas	gmann: nah, no rush	14:22
lyarwood	https://review.opendev.org/c/openstack/nova/+/811118/ is passing with just the master grenade fix that's in the gate FWIW	14:34
* lyarwood does a little dance		14:34
gibi	\o/	14:35
bauzas	lyarwood: https://media.giphy.com/media/HTjcWZwMtHpyhuCGKZ/giphy-downsized-large.gif?cid=ecf05e47cwkexzrwaxwlkbicd2ppy22bb1l52sg7yoa8sk9x&rid=giphy-downsized-large.gif&ct=g	14:52
* bauzas is sometimes regreting we can't add media in IRC		14:52
bauzas	but that's lasting 0.1sec and then I say "meh"	14:52
fungi	if you used a matrix client with the matrix-oftc bridge to join this channel, you and other matrix users could share inline media while irc users would just see a url to it in-channel	14:53
bauzas	fungi: what I said, "meh" :p	15:04
fungi	heh	15:05
gmann	lyarwood: perfect then I will say it was late night thing which thought that Xena fix is needed :)	15:05
gmann	bauzas: for PTG oslo sessions, is it possible on Tuesday Oct 19th between 13-15 UTC	15:07
opendevreview	melanie witt proposed openstack/nova stable/train: Reject open redirection in the console proxy https://review.opendev.org/c/openstack/nova/+/791807	15:14
opendevreview	melanie witt proposed openstack/nova stable/train: address open redirect with 3 forward slashes https://review.opendev.org/c/openstack/nova/+/806629	15:14
gmann	bauzas: and as per current topic I need only 30 min unless there is more things to discuss from anyone else	15:19
bauzas	gmann: sorry, I'm on a meeting but I guess we could run a meeting after 1400UTC as we have a cyborg x-p session first	15:20
gmann	bauzas: on tuesday right?	15:22
bauzas	gmann: on the 19th of October, yes	15:22
gmann	bauzas: perfect, sounds good. thanks	15:22
bauzas	https://etherpad.opendev.org/p/nova-yoga-ptg L47 tells me the 14-15:00UTC slot is already taken	15:23
gmann	bauzas: that is 'Cyborg-Nova: Tuesday (19th Oct) 13:00 UTC - 14: 00 UTC:'	15:24
bauzas	my bad, yeah	15:24
bauzas	again, wfm for a oslo x-p session on Oct-19 14:00UTC	15:24
gmann	+1, thanks again	15:24
lyarwood	gmann: ^_^ no issues thanks for working on it so late	15:24
bauzas	gmann: all good	15:25
gmann	lyarwood: np!, and for stable backport we can wait for neutron team opinion so I agree on 'no hurry for those' .	15:25
lyarwood	awesome	15:25
bauzas	artom: honestly, I'm torn with https://review.opendev.org/c/openstack/nova/+/808474	16:12
bauzas	that's an behavioural change	16:13
opendevreview	Balazs Gibizer proposed openstack/nova master: Enable min pps tempest testing in nova-next https://review.opendev.org/c/openstack/nova/+/811748	16:13
bauzas	operators suppose a plugoff when delete	16:13
bauzas	now, we'll first try to shutdown the guest for every instance	16:13
bauzas	including other drivers but libvirt	16:13
melwitt	that is my concern as well. maybe it could be conditional on bfv that is not "delete on termination"?	16:14
melwitt	is that the only case where this would be desired?	16:15
melwitt	or rather, is there any gain in doing it for non bfv non delete on termination?	16:15
dansmith	generally you don't want that :)	16:15
melwitt	maybe also shared storage is another case	16:16
dansmith	the only real case for non-bfv volumes is for precious data	16:16
melwitt	but precious data on a local disk that's going to be deleted anyway? I must be missing something	16:16
dansmith	sorry, thought you were talking about delete-on-termination for non-bfv cinder volumes	16:17
artom	melwitt, "is there any gain in doing it for non bfv non delete on termination?" None that I can see	16:17
artom	Well, no, any volume, really	16:17
artom	Doesn't have to be bfv	16:17
melwitt	oh, yeah attached volumes. I wasn't thinking of that. yeah	16:17
dansmith	that's my point, delete-on-termination should only be useful for bfv volumes we created with non-precious data from an image	16:18
artom	It's still attached and mounted in the guest, and would ideally be flushed correctly if it's not delete_on_termination=True	16:18
artom	I need to run an errand quickly, can this be carried over to the gerrit review?	16:18
artom	And thanks for looking into it :)	16:18
artom	And yeah, so bauzas's point, the compute manager/driver division of labour here is pretty muddy	16:19
melwitt	my bad for not looking at the change yet. if it's targeted to only instance with volume(s) cases I think that makes a lot more sense	16:19
gibi	is it really a gracefull shutdown via openstack server stop and then a openstack server delete?	16:19
dansmith	gibi: stop + delete should be graceful	16:19
gibi	so my point is this can already done with our APIs	16:20
melwitt	just saying I don't think we should be doing it for everything, for things where the data is going to be blown away anyway	16:20
dansmith	gibi: for sure. I assume the goal is to make nova do the graceful behavior if volumes are attached, but to do it properly really requires some higher-level orch, like a stop...timeout...destroy kind of thing	16:21
dansmith	"do the graceful behavior automatically" I should have said	16:21
gibi	OK I see	16:21
melwitt	yeah that is my understanding as well	16:21
dansmith	I'm a bit torn, because unless you're running with unsafe cache, I would think that fast destroy is fine.. might have a journal to replay when you use the volume later, but...	16:22
gibi	it make sense for data consistency but it also makes delete slower so I think this should be opt in	16:22
melwitt	dansmith: yeah it's weird, the user is experiencing volume gets corrupted and no longer usable when they delete without stopping first	16:23
melwitt	we had thought just deleting should be fine but it's behaving in a way we didn't expect	16:23
melwitt	not sure why	16:23
dansmith	destroy of a running vm is the same as pulling the plug.. if you're using a precious volume, you wouldn't do that to a physical server, so...	16:23
artom	dansmith, so the "real" problem is https://bugzilla.redhat.com/show_bug.cgi?id=1965081	16:24
melwitt	yeah but re: "I would think that fast destroy is fine"?	16:24
artom	Apparently in some cases stop+delete causes races	16:24
melwitt	yeah I think you have to poll and wait for it to be stopped no?	16:25
dansmith	melwitt: to do the graceful shutdown, you'd need some long-running task, yeah, what I said above	16:25
artom	So we can either make delete safer, or require that any orchestration/automation on top of Nova does stop + delete, but then we would need to fix that race	16:25
dansmith	artom: so this has nothing to do with volume safety?	16:25
artom	dansmith, it does, because the reason for doing stop + delete (which can cause this deadlock) is volume safety	16:26
melwitt	it does. it began with the racing problem and then we said "try a delete without the stop" and then their volumes got messed up	16:26
dansmith	oh the dbdeadlock came from stop?	16:26
lyarwood	doesn't delete take an instance.uuid lock on the compute?	16:26
artom	dansmith, stop immediately followed by delete, apparently	16:26
melwitt	it came from doing a delete right after a stop	16:27
melwitt	without waiting for the stop to be stopped	16:27
melwitt	lyarwood: good question	16:27
dansmith	okay, so, fix that	16:27
dansmith	don't engineer an orchestrated graceful delete, IMHO	16:27
dansmith	I'm guessing task_state doesn't protect the delete from running,	16:27
artom	I dunno, I don't necessarily think expecting an attached volume to not be corrupted after a delete is invalid	16:28
dansmith	and the api might start the delete process while the stop is still running on the compute host or something	16:28
melwitt	hm, stop and delete are both locked with instance.uuid	16:28
dansmith	artom: I agree, delete could leave a volume unhappy, but I'm thinking if you issued a stop and then delete you're assuming they're queued	16:28
melwitt	I just checked	16:28
lyarwood	https://github.com/openstack/nova/blob/e07bb310b674fb471a92edf3258e564f05534595/nova/compute/manager.py#L3237-L3255 looks like soft delete doesnt	16:28
dansmith	artom: in reality it's probably about like hitting shutdown on your server and then pulling the plug before it finishes, but...	16:29
melwitt	dansmith: that's exactly what they want, the queuing	16:29
dansmith	I think relying on the instance lock is probably too fragile here,	16:30
dansmith	since it's just on the compute node, but I'd have to go look at the (many) delete path(s) we have	16:30
lyarwood	I wouldn't say so for this case where the instance was running on a host	16:31
dansmith	delete is and always has been pretty much "I want this to complete and stop charging me immediately", so... we're really not wrong here, IMHO	16:31
lyarwood	that makes perfect sense	16:31
opendevreview	Balazs Gibizer proposed openstack/nova master: Enable min pps tempest testing in nova-next https://review.opendev.org/c/openstack/nova/+/811748	16:31
dansmith	lyarwood: wouldn't say what, that relying on the lock is unsafe? we can do stuff in the api to delete things (in the local case) which has nothing to do with the instance lock,	16:32
dansmith	and I'm not sure that relying on the ordering of two calls is really safe either unless we have perfectly fair locks	16:32
bauzas	sorry, I was afk	16:32
* bauzas scrolling back to the convo I started		16:33
lyarwood	dansmith: we can't force the delete in the API of an instance still associated with a host can we?	16:33
dansmith	like, if you issue three calls on that instance and the stop is #2, delete is #3, the lock has to be fair to make sure we don't happen to choose the waiting #3 thread when we stop doing #1	16:33
dansmith	lyarwood: of course we can	16:33
dansmith	if we think the host is down, or we missed the last service update, etc	16:33
lyarwood	right I'm stuck in the happy path here	16:34
melwitt	we have the ability to add fair=True to the locks, I wonder if that would work?	16:34
bauzas	my biggest concern here is that we would make destroy holding for a graceful shutdown	16:34
dansmith	bauzas: yeah I don't think we should do that for sure	16:35
dansmith	bauzas: I think at most, we should try to make sure a delete doesn't preempt a stop operation in progress	16:35
bauzas	yup, agreed	16:35
bauzas	if a stop is occuring, destroy should wait	16:35
melwitt	how can we make it wait without rejecting it?	16:35
bauzas	excellent questioin	16:36
melwitt	I do wonder if the fair=True would help	16:36
bauzas	I'd say we would reject synchronously by looking at the VM state	16:36
melwitt	I'll try it in a devstack if I can repro the problem	16:36
bauzas	destroy is synchronous, right?	16:37
bauzas	even if the host is down, destroy will occur	16:37
dansmith	melwitt: I'm just suggesting that lock inversion is one possible way that we might not be able to depend on the ordering	16:37
bauzas	that's the guest destroy which is async, right?	16:37
dansmith	anything else that happens on the compute host before we block on the lock could cause it, as well as if conductor is involved or something like that	16:38
bauzas	dansmith: I missed your proposal	16:38
bauzas	you're telling we could lock on stop ?	16:38
bauzas	hence preventing the delete ?	16:38
melwitt	we do lock on stop on the compute node	16:39
melwitt	but lock waiters would not be in order because they are not fair locks	16:39
melwitt	*would not necessarily be in order	16:39
dansmith	bauzas: I haven't proposed anything	16:39
bauzas	ok, melwitt explained	16:40
dansmith	yeah, so when we go to do delete,	16:41
melwitt	dansmith: ack, I was just thinking that would be such a small and simple way to fix it if the fair locks would do that. but like you said if we have an issue with two async things arriving on the compute node in the "wrong order" then it wouldn't help	16:41
dansmith	we create an instance event before we run terminate_instance and grab the lock,	16:41
dansmith	which means we're doing network IO to conductor	16:41
dansmith	so even fair locks won't prevent order inversion there	16:41
melwitt	yeah...	16:41
dansmith	so if stop and delete arrive in the right order, but each call out to conductor, we hit any conductor in the cluster, each one tries to create records in the db, who knows which one will finish first, grab the lock, etc	16:42
dansmith	that's what I mean by assuming this sort of super tight ordering is unsafe, even though we think we're on the compute and largely single-threaded	16:42
bauzas	I see	16:42
bauzas	stupid idea, can't we rely on the state of the instance ?	16:43
dansmith	delete is anything-goes I think	16:43
bauzas	yeah, that's the original problem	16:43
dansmith	because that's what we want...	16:43
bauzas	yup...	16:43
dansmith	and anything else we build into there is going to be pretty obscure.. "delete always deletes, except stop but not shelve... and either waits or refuses or ..."	16:44
dansmith	makes the "delete always works" contract a little less clear	16:44
bauzas	honestly I don't know how to move on with this ask	16:44
dansmith	tell them that delete means pulling the plug,	16:44
bauzas	"ask your orchestration to be smarter ?"	16:44
lyarwood	how about making stop more graceful just for the libvirt driver?	16:44
gibi	can we add a flag to the delete api saying I-want-a-shutdown-first?	16:44
dansmith	and you wouldn't do that while waiting for start->shutdown on a physical server, so they should wait for vm_state=STOPPED before delete	16:45
dansmith	gibi: that makes it a little less obscure, but doesn't eliminate the need to make that orchestration bit work, of course	16:45
bauzas	yeah	16:45
lyarwood	artom: there main issue with stop was that it eventually destroys the instance right?	16:45
bauzas	that doesn't solve the ordering problem	16:45
lyarwood	artom: and in your change you've suggested that we call shutdown on the domain as an initial step to make this more graceful	16:46
dansmith	lyarwood: can we call quiesce or something else constant-time before libvirt destroy?	16:46
dansmith	instead of shutdown, which the guest can block or ignore?	16:46
bauzas	the root problem is the I/O flushes, right?	16:47
dansmith	novaclient has a --poll option for some things. "nova stop server --poll; nova delete server" would solve this pretty easy :)	16:47
melwitt	they're using tripleo/heat but I'm pretty sure heat has dependency or waiting ability	16:49
dansmith	surely	16:49
lyarwood	dansmith: quiesce before $domain.shutdown() might help flush things if that's where the guestOS is getting hung up	16:49
melwitt	their argument has been that stop + insta delete should not break the volume, IIUC	16:49
lyarwood	waiting assumes we don't kill it before it's finished shutting down	16:50
dansmith	lyarwood: I meant to make sure journal buffers are written before we nuke the guest	16:50
dansmith	melwitt: that argument is fine as long as you wait between stop and delete :)	16:50
dansmith	melwitt: because again, start->shutdown and then pulling the plug before it finishes yields a corrupted disk :)	16:51
melwitt	dansmith: that was the first thing I said on the bug report but it's gone on for a long time now and gotten into the weeds	16:51
* dansmith nods		16:51
bauzas	melwitt: I haven't seen any bug report against artom's patch	16:52
bauzas	I guess you're talking internally	16:52
melwitt	bauzas: yes internally	16:52
bauzas	yeah, because that's what worried me originally	16:52
bauzas	technically, destroy works like expected	16:52
dansmith	we also really need to do a better job of making this a sanitized bug externally if we're going to claim this is a bug in nova	16:52
bauzas	I'm not happy with claiming this as an upstream bug	16:53
dansmith	me either, fwiw :)	16:53
bauzas	a blueprint or a wishlist bug	16:53
dansmith	we could add a feature as gibi said, but destroy is doing the right thing here	16:54
bauzas	(18:52:51) bauzas: technically, destroy works like expected	16:54
melwitt	+1 to all of that	16:54
dansmith	an alternative to gibi's idea, would be a delete flag that says "assuming task_state=None" meaning "delete this if nothing else is going on"	16:54
bauzas	strong agreement here	16:54
dansmith	but it would require the client to retry, which they could currently do by just waiting for the stop to finish	16:54
dansmith	so, meh	16:54
bauzas	dansmith: yeah, that's why I was considering the vm state or the task state	16:55
bauzas	if we really want to do something	16:55
bauzas	but,	16:55
bauzas	this can't be done with the current destroy API	16:55
dansmith	it doesn't make it do what they want, and honestly it's kinda weird since they could just wait for the stop just as well, but it's less new orchestration stuff	16:55
dansmith	[09:47:37] <dansmith> novaclient has a --poll option for some things. "nova stop server --poll; nova delete server" would solve this pretty easy :)	16:56
dansmith	^ :)	16:56
bauzas	it would be a 'destroy++" API	16:56
bauzas	heh	16:56
dansmith	also, if melwitt is right and they're using heat, then FFS, get heat to wait or something	16:56
bauzas	(18:44:41) bauzas: "ask your orchestration to be smarter ?"	16:56
dansmith	lol	16:56
lyarwood	we keep saying they could wait for the stop to finish but wasn't that part of the issue here? Even if they did wait the libvirt driver would kill the instance prematurely before it had finished shutting down?	16:56
dansmith	if we're just re-quoting ourselves, are we done here? :P	16:57
gibi	if currently stop + wait for STOPPED + delete works, then a new delete-with-gracefull-shutdown could also be orchestrated from the conductor but it is obviously an orchestration and adds complexity to the already complext delete codepaths	16:57
bauzas	gibi: and again, this can't be the straight delete	16:57
dansmith	gibi: we don't go through conductor directly for stop or delete right now, AFAIK, so it would really get confusing to add another whole path like that, IMHO	16:57
gibi	bauzas: yepp this is delete++ :D	16:57
gibi	dansmith: then we either need the make the stop RCP sync or do a polling from the nova-api	16:58
gibi	RPC	16:58
melwitt	lyarwood: I didn't think so?	16:58
bauzas	can't we just have as dansmith suggested a "destroy++" API return some 40x when the task state is not None ?	16:58
lyarwood	that's fine if they aren't waiting	16:59
melwitt	if that's the case I missed it	16:59
dansmith	gibi: we'd need a cast to conductor, a new method there, and a sync stop operation to compute from conductor followed by delete.. waiting from the api is not reasonable because stop can take a long time	16:59
gibi	yeah	16:59
lyarwood	melwitt: I was sure we had talked about them waiting and libvirt still killing the domain before things had shutdown sorry	16:59
lyarwood	I'll look at the bug again	16:59
dansmith	in addition to the other four ways we can delete things :)	16:59
melwitt	lyarwood: my understanding (and I could be wrong) is that they have never tried waiting to stopped	16:59
lyarwood	I was sure we suggested that early on	17:00
dansmith	melwitt: right that seems like it to me	17:00
melwitt	lyarwood: ok same, maybe I'm way off	17:00
bauzas	I'd rather prefer this destroy++ API to stop synchronously at the API level if some conditions aren't met	17:00
bauzas	rather than us pursuing the idea we could achieve some distributed locking mechanism	17:00
dansmith	gibi: also delete returns "I will eventually do this" to the client, which it would no longer be able to guarantee	17:00
dansmith	bauzas: fwiw, I don't think that is really going to do what this person wants, so we should make sure they think it would help before we do that work	17:01
gibi	dansmith: ohh, that is correct and bad :/	17:01
gibi	delete++ is just too complex	17:02
bauzas	dansmith: welp, good point	17:02
bauzas	either way, I think we need a proper tracking	17:07
bauzas	artom: I guess you need to fill a blueprint and honestly given the brainstorm efforts we made over last hour, you need to write a bit of a spec	17:07
bauzas	mostly because the current destroy action can't be used for this and we need to consider a new action parameter (or any change in our API)	17:08
bauzas	artom: feel also free to write a PTG proposal for this one, we could continue the talk there	17:08
* bauzas stops for the day		17:22
* gibi too		17:23
*** beekneemech is now known as bnemec		17:41
spatel	Any idea how to do VM nic bonding using two VF with SRIOV ?	17:57
spatel	i am looking for redundancy with sriov implementation and only solution is to do bonding inside vm	17:58
opendevreview	Ivan Kolodyazhny proposed openstack/nova master: Add release note which descrube NVMe attach issue https://review.opendev.org/c/openstack/nova/+/811447	18:04
opendevreview	melanie witt proposed openstack/nova stable/train: Reject open redirection in the console proxy https://review.opendev.org/c/openstack/nova/+/791807	19:15
opendevreview	melanie witt proposed openstack/nova stable/train: address open redirect with 3 forward slashes https://review.opendev.org/c/openstack/nova/+/806629	19:15
opendevreview	melanie witt proposed openstack/nova stable/train: [stable-only] Set lower-constraints job as non-voting https://review.opendev.org/c/openstack/nova/+/811762	19:15
opendevreview	David Vallee Delisle proposed openstack/nova master: rephrasing config description for num_pcie_ports in libvirt https://review.opendev.org/c/openstack/nova/+/811173	21:59
opendevreview	Goutham Pacha Ravi proposed openstack/nova stable/wallaby: DNM: Test wallaby with ceph pacific https://review.opendev.org/c/openstack/nova/+/811802	22:19
melwitt	gouthamr: thanks for that! ^ I was going to do it and forgot	22:33
gouthamr	o/ melwitt - i may have hit a circular dependency of sorts	22:34
melwitt	oh hm.. (looking at it now)	22:35
melwitt	gouthamr: I wonder if you need to use the url instead of the change-id? not sure how it would know which of the two (master or stable/wallaby) to use	22:36
melwitt	for the depends-on	22:37
gouthamr	+1 can try that	22:37
opendevreview	Goutham Pacha Ravi proposed openstack/nova stable/wallaby: DNM: Test wallaby with ceph pacific https://review.opendev.org/c/openstack/nova/+/811802	22:38
clarkb	the url is prefered now because you can depends on things in other code review systems	22:38
clarkb	change id is gerrit specific	22:38
gmann	gouthamr: i do not see circular deps, it should work fine. nova->plugin->devstack	22:39
melwitt	good to know	22:39
gouthamr	thanks, but zuul stull tells me "Unable to freeze job graph: 0"	22:40
clarkb	gouthamr: that implies you have a bug in your zuul config I think	22:40
clarkb	gouthamr: can you give me links to all the changes involved?	22:41
gouthamr	clarkb: yep: https://review.opendev.org/c/openstack/nova/+/811802/	22:41
gouthamr	https://review.opendev.org/c/openstack/devstack-plugin-ceph/+/810059 and https://review.opendev.org/c/openstack/devstack/+/810202	22:42
opendevreview	Ghanshyam proposed openstack/nova stable/wallaby: DNM: Test wallaby with ceph pacific https://review.opendev.org/c/openstack/nova/+/811802	22:42
gmann	gouthamr: I think running now ^^ ?	22:43
gouthamr	oh!	22:43
gmann	yeah	22:43
gouthamr	gmann: thanks! i am used to pushing empty changes to trigger the CI elsewhere :)	22:43
gmann	with right deps too https://zuul.openstack.org/status#nova	22:44
melwitt	gmann++	22:44
clarkb	the issue is you have no files in the commit	22:47
clarkb	seems like you figured that out	22:47
clarkb	I don't know that that is a use case we should support. If you aren't changing anything then why bother	22:48
melwitt	I noticed that but didn't know that it would cause problems	22:48
gmann	clarkb: yeah, erorr was confusing though	22:48
clarkb	gmann: yes, https://paste.opendev.org/show/809680/ is the internal handling of it. It is an exceptional case currently	22:49
clarkb	I've brought it up in the zuul matrix room to see if that is something we can handle better	22:49
gmann	clarkb: +1	22:49
gmann	thanks	22:49
gouthamr	awesome thanks clarkb!	22:51
clarkb	the other issue you'll run into with that even if zuul didn't explode on it is so many jobs match on files that are modified. If no files are modified then none of those jobs will run	22:52
opendevreview	melanie witt proposed openstack/nova stable/wallaby: Add functional regression test for bug 1853009 https://review.opendev.org/c/openstack/nova/+/811805	23:51
opendevreview	melanie witt proposed openstack/nova stable/wallaby: Clear rebalanced compute nodes from resource tracker https://review.opendev.org/c/openstack/nova/+/811806	23:51
opendevreview	melanie witt proposed openstack/nova stable/wallaby: Invalidate provider tree when compute node disappears https://review.opendev.org/c/openstack/nova/+/811807	23:51
opendevreview	melanie witt proposed openstack/nova stable/wallaby: Prevent deletion of a compute node belonging to another host https://review.opendev.org/c/openstack/nova/+/811808	23:51
opendevreview	melanie witt proposed openstack/nova stable/wallaby: Fix inactive session error in compute node creation https://review.opendev.org/c/openstack/nova/+/811809	23:51

Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!