Wednesday, 2022-08-31

*** dasm\|off is now known as Guest1645		04:00
opendevreview	Rajat Dhasmana proposed openstack/python-novaclient master: Add support to rebuild boot volume 2.94 https://review.opendev.org/c/openstack/python-novaclient/+/827163	06:24
bauzas	good morning Nova	07:05
Uggla	good morning	07:22
bauzas	mmm, I forgot to ask about your preferences for that yet again virtual PTG	07:44
bauzas	I mean, how many slots we should book	07:44
bauzas	I'm about to book 4 x 4 slots (Tues-Fri) but I'll cancel some if we agree	07:44
bauzas	just done, see the scheduled tracks https://ptg.opendev.org/ptg.html	07:49
bauzas	I can unbook the slots	07:49
gibi	morning	07:50
gibi	bauzas: 4x4 works for me	07:50
gibi	sean-k-mooney: thanks for testing the user data out.	07:50
bauzas	gibi: we'll see how much we need once we have an agenda	07:51
gibi	bauzas: sure	07:51
gibi	sean-k-mooney, dansmith: in the light of the actual regeneration issue I agree to punt the user_data and land rebuild bfv first	07:51
sean-k-mooney[m]	im locally trying to fix the patch by the way as a followup patch	07:52
sean-k-mooney[m]	ok what i was trying wont work but i know what to do instead so trying that now	08:03
hkominos	Hi guys. Quick question. When overriding policies for nova, Should one redefine the whole policy.json file or Can I just define the single policy that I want to override?My current policy file is empty {}	08:34
bauzas	hkominos: no, you should just update what you need*	08:37
gibi	hkominos: you only need to add to the file that you want to override	08:37
gibi	bauzas: you won :)	08:37
bauzas	\o/	08:37
hkominos	thanks you guys !	08:40
bauzas	gmann: good spot for the WIP status on my election patch	08:40
bauzas	gmann: that's because I flagged the WIP status directly in Gerrit when I created the PS1 in Aug	08:41
bauzas	now the patch is in active state	08:41
sean-k-mooney[m]	gibi: this is such a can of worms	09:00
sean-k-mooney[m]	i think to make this work we need to delete the old config drive first but that means we neeed to deal with all the image backends seperately since they dont have a rm() function	09:01
bauzas	sean-k-mooney: a tl:dr about the problem ?	09:02
sean-k-mooney[m]	i can add one but i was hoping this would be simple and then i rememebred this should also work for lvm and rbd so i cant just delete the file	09:02
sean-k-mooney[m]	bauzas: the user data update feature is trying to rebuild the config deive on hard reboot if you updated the user data	09:03
sean-k-mooney[m]	btu that fails with a permission denied\	09:03
sean-k-mooney[m]	i belive that is because the disk is owned by qemu	09:03
sean-k-mooney[m]	so nova cant updated it and this need to work for rbd and lvm anyway so i cant really just assume its a file on disk and chown it	09:04
sean-k-mooney[m]	the user_data feature is blocking the BFV rebuild feature because of the micorversion order	09:05
sean-k-mooney[m]	so i was trying to unblock that but we are just going to have to change the order	09:05
sean-k-mooney[m]	make BFV rebuild 2.93 and likely punt mutable user_data to AA	09:06
sean-k-mooney[m]	there are other issue with rpc/testing in the userdata patch that likely wont get adressed without a FFE	09:06
bauzas	agreed on flipping the microversions	09:07
bauzas	at least for not blocking bfv rebuild	09:07
gibi	sean-k-mooney[m]: do we store the config drive in lvm / rbd? I thought it is always a file on disk	09:07
gibi	but agree if this is complicate then punt it	09:08
bauzas	for userdata, we can't just tell "sorry you can't update your data because config drive" as this is a config-driven API behaviour	09:08
bauzas	could we have a separate flag that would say "I allow you to update your userdata" and leave the responsibility to the ops to enable it ?	09:08
kashyap	gibi: I think it's a file on the disk, too, the config-drive	09:08
bauzas	with a caveat saying there could be perm issues with configdrive	09:08
sean-k-mooney[m]	gibi: good question i belvie we do looking at the code but ill check. i know swap gets created on rbd	09:09
sean-k-mooney[m]	and this appears to be using the generic imagebackend code	09:09
sean-k-mooney[m]	but ill confimr again	09:09
sean-k-mooney[m]	bauzas i dont think we should hack around this	09:10
sean-k-mooney[m]	the feature is broken currently if you use config deive and i think its important that works	09:10
sean-k-mooney[m]	espcially since we added a stadard trait for this	09:11
bauzas	that's my point	09:11
sean-k-mooney[m]	we could simple not report that for the libvirt driver i guess	09:11
bauzas	in theory, we could only allow to update userdate if configdriver isn't in use	09:11
bauzas	userdata*	09:11
sean-k-mooney[m]	so in this cycle no driver would report it so you could only use this if you did not have config drive	09:11
sean-k-mooney[m]	bauzas: that works today i think	09:12
bauzas	but I don't wanna leak a config-driven behaviour	09:12
sean-k-mooney[m]	it would not be config driven behavior	09:12
bauzas	sean-k-mooney: then this is documentation	09:12
sean-k-mooney[m]	we just set the compute capablity trait to false for libvirt	09:12
* bauzas needs to look at the patches		09:12
sean-k-mooney[m]	and then when config drive works we set it to true	09:12
sean-k-mooney[m]	bauzas: either way im not sure its reasonable of use to ask whoami-rajat to wait any longer given the state of the user data patch	09:13
bauzas	agreed again	09:13
bauzas	we should ask to flip the microversions	09:13
sean-k-mooney[m]	so flip microversion and split user data patch in two. setting the capablity trait to false in all drivers in the first one and the second patch and then implemnte the config drive rebuild adn rpc change	09:15
sean-k-mooney[m]	ahtough that still feels a bit wrong	09:16
bauzas	rpc change becaaaaause ?	09:16
sean-k-mooney[m]	because currently it uses a dirty flag in the instance system metadata to say the config drive need to be rebuilt	09:17
sean-k-mooney[m]	and dansmith made a very valid point tha tthat a shadow rpc interface and we should have added that as a parmater to the reboot rpc call	09:17
sean-k-mooney[m]	so that was already requested as a followup	09:17
gibi	I'm OK to only allow user data update for non config drive instances in Zed. That is a good enough compromise. Also config drive is just half config drive, as it can also be requested via the API	09:17
gibi	*config drive is just half config driven	09:18
sean-k-mooney[m]	yes it can thats how i was testing	09:18
sean-k-mooney[m]	it can be forced via nova.conf but its user requestable as you said	09:19
gibi	as a side note I found a bug in the allocation candidate filtering in the PCI series.	09:19
gibi	so I think what is realistic there is to land the healing part	09:20
gibi	up until https://review.opendev.org/c/openstack/nova/+/850468/20	09:20
gibi	stephenfin: Sean is +2 til ^^ so if you have time :)	09:20
sean-k-mooney[m]	yes that was the partion i tought made sense if we did not land it all	09:21
stephenfin	okay	09:21
sean-k-mooney[m]	thats everything except the schduler part	09:21
gibi	also in the light of the recent shadow RPC discussion I feel that storing data in InstancePCIRequest.extra_info is also shadowy in my series	09:21
gibi	sean-k-mooney[m]: yepp, that already gives visibility of the PCI resource inventories in placement which is good to have	09:22
sean-k-mooney[m]	it would have been nice to have the split pools by PF change	09:22
gibi	we could land that, that is not effected by the bug	09:22
sean-k-mooney[m]	that would be nice since it gives preference to PFs without VFs when you ask for a PF	09:23
gibi	it does not do any externally visible change\	09:23
gibi	I'm not sure it is a real preference or just an ordering change	09:23
gibi	I have to look if we sort the pool	09:23
sean-k-mooney[m]	i always tought the way we allocated was more or less deterministic	09:23
sean-k-mooney[m]	at least it has been stable enough for use to use in the functional tests without intermient failures	09:24
gibi	it is stable	09:24
gibi	but it might be not stable do to sorting	09:24
gibi	but due to simple iteration order of devices / pools	09:24
sean-k-mooney[m]	ack	09:25
bauzas	sean-k-mooney: so I've read https://review.opendev.org/c/openstack/nova/+/816157/15	09:25
sean-k-mooney[m]	so i guess we can look at that again and confirm if it is of benifit	09:25
bauzas	sean-k-mooney: can you request for the patch split and the other microversion ?	09:25
gibi	sean-k-mooney[m]: yes, I will look at the ordering	09:25
bauzas	I can leave some comments but I'm way behind	09:25
gibi	I have to drop for an hour now for an early lunch but I will be back after	09:25
sean-k-mooney[m]	bauzas: sure	09:26
bauzas	thanks	09:26
bauzas	I'll review quickly the bfv rebuild series	09:26
bauzas	both you and gibi gave +2s but I'll do my glance quickly	09:26
bauzas	and I'll tell whoami-rajat to respin the API patch with 2.93	09:27
gibi	bauzas: you could look at the viommu one too	09:27
bauzas	gibi: yup, on my list	09:27
sean-k-mooney[m]	https://review.opendev.org/c/openstack/nova/+/816157/15#message-f9bcfb9f6d0115f20223c4cd1e91d5c5428798d7	09:32
sean-k-mooney[m]	done ^	09:32
sean-k-mooney[m]	whoami-rajat: dansmith do you have time to update the BFV serise to use 2.93 and rebase to master	09:32
bauzas	thanks	09:32
sean-k-mooney[m]	its too early for them but they should see it when they get online	09:33
sean-k-mooney[m]	gibi i think you are right by the way. there is no preference its just the fact we create the pool with just the PF before we create the pool with the VF in those tests i think so its just down to pciadress/pool ordering i think.	09:36
sean-k-mooney[m]	we could add a prefernce but that is out of scope of the spec so dont worry about it	09:37
sean-k-mooney[m]	... its a annoying i think i see how to fix configdrive	09:45
sean-k-mooney[m]	for RBD we do infact import the local config drive into ceph and it will creatly delete and recreate it if required	09:45
sean-k-mooney[m]	if i use a tempory path to to generate teh new config drive and implemnente import file for the other backends then that should fix it	09:46
sean-k-mooney[m]	for flat/qcow that jsut an atomic move to the normal localtion via privesep to workaround the permission issue	09:47
sean-k-mooney[m]	for lvm its not much harder to do but its a little more work	09:47
sean-k-mooney[m]	so im still going to try and do that so that they have a refernce patch for how to fix it	09:48
bauzas	sean-k-mooney: gibi: fwiw https://review.opendev.org/c/openstack/nova/+/820368/comment/bd35c4c9_0ad9dd3d/ I'm asking to update https://docs.openstack.org/nova/latest/user/support-matrix.html#operation_rebuild in the API patch	10:03
sean-k-mooney[m]	bauzas: gibi ok i have a working version of the config drive part	10:11
sean-k-mooney[m]	i can push it for review and then work on tests to see what people think	10:11
bauzas	ok	10:14
bauzas	I need to bail out for lunch but I'll be back later	10:14
opendevreview	sean mooney proposed openstack/nova master: support configdrive rebuilding https://review.opendev.org/c/openstack/nova/+/855351	10:26
whoami-rajat	sean-k-mooney[m], ack, on it	10:44
opendevreview	Rajat Dhasmana proposed openstack/nova master: Add support for volume backed server rebuild https://review.opendev.org/c/openstack/nova/+/820368	11:35
opendevreview	Rajat Dhasmana proposed openstack/nova master: Add conductor RPC interface for rebuild https://review.opendev.org/c/openstack/nova/+/831219	11:35
opendevreview	Rajat Dhasmana proposed openstack/nova master: Add API support for rebuilding BFV instances https://review.opendev.org/c/openstack/nova/+/830883	11:35
whoami-rajat	sean-k-mooney[m], bauzas gibi dansmith ^^ rebased to 2.93	11:37
bauzas	++	11:37
whoami-rajat	once this merges, i will change novaclient, openstackclient, tempest changes as well	11:38
whoami-rajat	bauzas, I wasn't sure what your ask was regarding the support matrix change, can you check if this is the change you intended ? https://review.opendev.org/c/openstack/nova/+/830883/29..30/doc/source/user/support-matrix.ini	11:39
whoami-rajat	https://review.opendev.org/c/openstack/nova/+/830883/30/doc/source/user/support-matrix.ini	11:39
bauzas	whoami-rajat: well, the status should be 'complete' for the libvirt drivers, right?	11:45
whoami-rajat	bauzas, yeah, there were too many, so wasn't sure, will update for all libvirt drivers	11:48
bauzas	whoami-rajat: say 'unknown' for the ones you don't know	11:49
bauzas	like ppc64, s390x and lxc	11:49
whoami-rajat	to be honest, I'm not sure about different architectures	11:49
whoami-rajat	ack will do	11:50
gibi	I guess on those architecture where the libvirt virt driver supports boot from volume there the rebuild will work too. at least I don't remember any arch dependent code in the rebuild patches	11:52
sean-k-mooney	the arch shoudl not matter	11:53
sean-k-mooney	the virt types might	11:53
sean-k-mooney	as i expect this to work for qemu and kvm	11:53
sean-k-mooney	we could list them as unknon but i stongly suspect it will work on arrch64 at the very least or as gibi suggested anywhere wehre boot form volume is supported	11:55
sean-k-mooney	whoami-rajat: we support attach and detach volume on all teh kvm and qemu combindations	11:57
sean-k-mooney	do you have any reason to belive that it would not work or would depend on the architecure	11:58
gibi	we dont have a row in the matrix about boot from volume support	11:59
sean-k-mooney	ya i noticed that	11:59
sean-k-mooney	this could be adressed in a followup patch yes	11:59
gibi	yes	11:59
gibi	the whole matrix thing is just doc	12:00
gibi	that can be done even after FF	12:00
gibi	before RC1	12:00
gibi	or as a doc bugfix later	12:00
whoami-rajat	sean-k-mooney, i don't think so, if BFV works then this should work too as we're doing some attach detach operations and on cinder side we're doing dd to copy image data to volume	12:00
whoami-rajat	more or less same as what we do in BFV	12:01
sean-k-mooney	ack thats what i tought woudl be the case	12:01
whoami-rajat	so should i just mark all kvm qemu drivers as supported?	12:01
sean-k-mooney	i think we could mark them complete unless we get a bug report saying it does not work	12:01
sean-k-mooney	thats what i would do but bauzas might perfer unknonw	12:02
whoami-rajat	ack, will update that	12:02
bauzas	I'm not opiniateds	12:02
bauzas	and I thought, given whoami-rajat was about to update his API change, it was OK to ask for this for the API chbange	12:03
bauzas	and not by a FUP	12:03
sean-k-mooney	either is fine	12:03
gibi	I'm OK with the patch as is if we go with a FUP, or I can reapply my +2 if it is respined	12:03
sean-k-mooney	i was waitign for zuul to report before revieing after the microversion change	12:03
opendevreview	Rajat Dhasmana proposed openstack/nova master: Add API support for rebuilding BFV instances https://review.opendev.org/c/openstack/nova/+/830883	12:05
whoami-rajat	does this look good? ^	12:05
sean-k-mooney	yes i think so	12:06
bauzas	+2d	12:06
sean-k-mooney	libvirt-lxc wont work since attach/detach volumes is missing but thats a nit	12:07
sean-k-mooney	unknonw is fine	12:07
gibi	we need a +2+A on the first patch then the series will land	12:09
sean-k-mooney	doing that now	12:09
sean-k-mooney	was just revieing it	12:09
sean-k-mooney	whoami-rajat: thanks for fixing the typos and renaming the detach funciton	12:10
whoami-rajat	sean-k-mooney, np, i had to rebase so was just avoiding followups	12:10
sean-k-mooney	whoami-rajat: and sorry for the micorversion hassel. stacking the changes was ment to prevent fighting for the same microversion but in this case it did not help	12:11
opendevreview	Takashi Natsume proposed openstack/nova master: Add missing descriptions in HACKING.rst https://review.opendev.org/c/openstack/nova/+/853054	12:12
opendevreview	Takashi Natsume proposed openstack/nova master: Add a hacking rule for the setDaemon method https://review.opendev.org/c/openstack/nova/+/854653	12:12
* sean-k-mooney should really be on a donwstream call		12:12
whoami-rajat	sean-k-mooney, i can understand there were some last minute decisions to be made, everything is fine until the changes get in so no problem :)	12:12
* sean-k-mooney that is why i ahve been waring this headset for 15mins in total silence		12:12
opendevreview	Rajat Dhasmana proposed openstack/nova master: Add API support for rebuilding BFV instances https://review.opendev.org/c/openstack/nova/+/830883	12:36
whoami-rajat	^ there was a functional test failure due to the method name change in compute manager, everything else should is running fine	12:37
sean-k-mooney	ack ill re reivew after donwstream call	12:38
sean-k-mooney	if gibi or bauzas dont do it before	12:39
gibi	I'm on it	12:39
whoami-rajat	thanks	12:40
dansmith	gibi: wanna hear something funny?	13:42
*** Guest1645 is now known as dasm		13:50
gibi	dansmith: sure	13:51
dansmith	gibi: I would have bet my lunch on user_data being in the set of things we don't query out by default and only load if required	13:52
dansmith	I remember extensive discussions about it around icehouse when all this was being done	13:52
dansmith	because of it's size	13:52
dansmith	but I think we're actually always pulling that out and always sending it over the bus right now, which is insane because it's almost never used	13:52
dansmith	and in any of the joined queries where we duplicate the instance fields (like why we stopped joining the metadata queries), we'd be duplicating up to 64k of user data in the DB response...	13:53
sean-k-mooney	dansmith: we had a cve related to that in the past	13:53
sean-k-mooney	that should be fixed now	13:54
dansmith	sean-k-mooney: related to what?	13:54
gibi	I don't see any special casing for user_data so probably you are right we are always pulling it out	13:54
sean-k-mooney	pulling GBs of results into a join when you have large userdata	13:54
dansmith	okay, then fixed how?	13:54
dansmith	because AFAICT, we're always pulling it	13:55
sean-k-mooney	https://review.opendev.org/c/openstack/nova/+/758928/	13:55
dansmith	sean-k-mooney: that's metadata	13:55
dansmith	but you're saying that because we were also loading user_data that was much larger I guess?	13:56
sean-k-mooney	yep so i think we fixed the once case wehre we had a really large join as a result	13:56
dansmith	yeah so that patch will maybe avoid some duplication of it, but we don't need to be pulling it out of the DB except when we need it	13:56
sean-k-mooney	yes thats also fair we dont	13:57
sean-k-mooney	i guess it would be nice to do as a performace enhancement in general	13:58
gibi	does it also mean that we can save the updated user data on the compute side now?	13:58
sean-k-mooney	but also possibly reducing the change of another cve	13:58
dansmith	well, it could seriously cut down on our rabbit load	13:58
sean-k-mooney	where the user data is large yes	13:58
dansmith	gibi: yeah I assume that if that patch works, it works because somewhere in reboot the instance gets saved, probably due to state	13:59
dansmith	gibi: no test coverage of that though that I can see :)	13:59
sean-k-mooney	unfortunetly people that use heat tend to also abuse the userdata	13:59
sean-k-mooney	we have had requests in teh past to increae the limmit even more	13:59
gibi	dansmith: ahh yes as we dont call save on the api side	13:59
sean-k-mooney	we said no the last few times it came up	13:59
dansmith	sean-k-mooney: yeah, I bet people are bzipping stuff to keep it under the limit :D	13:59
sean-k-mooney	oh by the way the limit of medium text is 64MB not 64KB but i think we limit to 64KB but now i want to go check	14:01
sean-k-mooney	we use mediumtext in the db schema	14:01
sean-k-mooney	i think we put the limit in the api	14:01
dansmith	we are limiting in the api yeah	14:01
dansmith	at least in this patch	14:02
sean-k-mooney	our api ref say Configuration information or scripts to use upon launch. Must be Base64 encoded. Restricted to 65535 bytes.	14:02
sean-k-mooney	which is what i expected	14:03
dansmith	btw, I've seen this fail several times in the last day: https://de836787b7e59a5adc13-298f4365cc798f0001a632f171eb41d6.ssl.cf2.rackcdn.com/831219/22/check/nova-multi-cell/9d8aa66/testr_results.html	14:03
dansmith	it complains of a missing host, which the test is unshelving to by name, so I assume that's a test bug or something	14:04
sean-k-mooney	Compute host ubuntu-focal-rax-dfw-0030919238 could not be found.	14:04
sean-k-mooney	ya	14:04
sean-k-mooney	so tempest config error maybe	14:04
dansmith	so anyway, on the user_data thing, I think it would be a good idea for us to en-lazy that by default and only load it in the api and metadata api by default and I bet we'll see some rabbit load relaxed	14:05
gmann	dansmith: sean-k-mooney gibi Uggla I am also seeing nova-multi-cell failing consistently for the new test added for test_unshelve_to_specific_host	14:06
gmann	https://zuul.opendev.org/t/openstack/builds?job_name=nova-multi-cell&skip=0	14:06
dansmith	ack, I've probably rechecked 8 of that failure in the last 24 hours	14:06
sean-k-mooney	ubuntu-focal-rax-dfw-0030919243 is the compute buntu-focal-rax-dfw-0030919238 is the contoller	14:06
gmann	this test was recently merged yesterday and was passing that time	14:07
gibi	it is interesting as the test looks up the other compute by the service list	14:07
dansmith	might be flaky or might not work on some cloud providers if there's a name weirdness?	14:07
sean-k-mooney	dansmith: so the name is correct at least it trying to unshelve to the contoller	14:07
sean-k-mooney	i wonder if there is fqdn stuff going on	14:08
gibi	ubuntu-focal-rax-dfw-0030919238 doesnt feel wierd	14:08
gmann	yeah	14:08
dansmith	gibi: I meant FQDN type things	14:08
dansmith	sean-k-mooney and I are scarred for life on FQDN problems :)	14:08
sean-k-mooney	ubuntu-focal-rax-dfw-0030919238 is what we see for the host value in the compute agent startup	14:09
sean-k-mooney	so maybe hypervior hostname	14:09
sean-k-mooney	DEBUG nova.compute.resource_tracker [None req-49d84742-3abf-4111-8a83-e2c5e0e67661 None None] Hypervisor/Node resource view: name=ubuntu-focal-rax-dfw-0030919238	14:10
sean-k-mooney	nope they all seem to line up	14:10
sean-k-mooney	odd	14:10
dansmith	do we do a disabled or service liveness check before we let you unshelve there? maybe the compute is stuck and not updating its counter?	14:11
gibi	when we query the service we check for that it is up and enabled	14:12
gmann	dansmith: you mean just before unshelve or while selecting the host? while selecting host we do check service is up and enable	14:12
gmann	but after that shelve happen and then unshelve and that time no check before unshelve	14:12
dansmith	gmann: yeah I mean whatever check we're doing that gives that error	14:13
dansmith	is there a tempest test that is disabling a compute by chance for testing?	14:13
sean-k-mooney	its proably worth checkign the schduler logs	14:13
dansmith	oh is this just the scheduler "is this host okay" check?	14:13
gmann	I do not think we have any such test of touching the compute services enable/disable	14:15
gibi	it is not about sevice state	14:15
gibi	the unshelve fails on compute_node_get_all_by_host	14:15
gibi	so the compute is not in the DBV	14:15
gibi	DB	14:15
gibi	https://github.com/openstack/nova/blob/master/nova/compute/api.py#L4577	14:17
gibi	this where the unshelve fails	14:17
sean-k-mooney	so it is acutlly trying to shelve and unshlve multiple times	14:18
sean-k-mooney	it shleve offloaded and shelve to the compute	14:19
sean-k-mooney	then it shelved again	14:19
sean-k-mooney	and failed to unshleve to the contoler with the not found issue	14:19
gibi	yes	14:19
gibi	first it unselves back to where it was, then it unshelves to the other compute	14:20
sean-k-mooney	and req-b3e7705d-a20b-42b3-8eb5-594181a096c8 is the request that fialed	14:20
sean-k-mooney	which failed in the api right	14:20
sean-k-mooney	its not in the secheuler	14:20
gibi	yes	14:21
gmann	yes it is failing in 2nd time unshelve on another host	14:21
gibi	it is not fails on the scheduler	14:21
gibi	it fails in the compute api	14:21
gibi	https://de836787b7e59a5adc13-298f4365cc798f0001a632f171eb41d6.ssl.cf2.rackcdn.com/831219/22/check/nova-multi-cell/9d8aa66/controller/logs/screen-n-api.txt	14:21
gibi	https://github.com/openstack/nova/blob/master/nova/compute/api.py#L4577	14:21
gibi	unshelve checks is the requested host exists	14:21
gibi	before it goes to the conductor	14:21
sean-k-mooney	by calling compute_node_get_all_by_host	14:21
gibi	it calls get_first_node_by_host_for_old_compat	14:22
sean-k-mooney	thats on the main db i.e. cell db right	14:22
dansmith	so that is looking up by the service, and getting the first compute node on the service	14:22
dansmith	not by_host_and_node	14:22
* dansmith tickles his "robustify compute node hostnames" spec		14:23
sean-k-mooney	for libvirt there should only be one but ya	14:23
sean-k-mooney	if shelve was a thing for vmware or ironic that would not be ideal	14:23
dansmith	sean-k-mooney: right but this is very loose association which doesn't have FK constraints	14:23
sean-k-mooney	why is it hitting the cell db. it could look at the host mapping table in the api db right	14:24
sean-k-mooney	or cell mappings table	14:24
dansmith	eh?	14:24
dansmith	those just tell us which db the host is in	14:24
sean-k-mooney	right but if all you care about is does it exist is that not enough	14:24
dansmith	nothing about whether it's really there or up or whatever	14:24
gibi	sean-k-mooney: you have a point, so the compute-api does a cell DB query without targeting the cell first	14:24
sean-k-mooney	dansmith: i guess it could be down but im not sure this is checkign that	14:25
dansmith	sean-k-mooney: or deleted	14:25
sean-k-mooney	i would expct that to be check by the schduler to be honest	14:25
dansmith	gibi: these hosts are in the same cell though so if that wasn't working, we wouldn't have found the first one right?	14:25
gibi	this is the multicell job so I assume we have hosts in different cells	14:26
sean-k-mooney	yes there are only two nodes in the job	14:26
sean-k-mooney	so each is in a different cell	14:26
dansmith	multicell meaning superconductor not actually one host per cell right?	14:26
sean-k-mooney	well we can check but i tought this was 2 cells	14:27
dansmith	or is this really unshelving to different cells?	14:27
dansmith	can we unshelve across cells?	14:27
dansmith	I thought not	14:27
dansmith	so maybe we're targeted at the cell where the instance is and not the cell where the host is that we're unshelving to?	14:28
dansmith	not sure how that would have ever passed	14:28
sean-k-mooney	the compute is not runnign a second conductor	14:28
gibi	I'm looking at the compute.api.API.unshelve and I think the context is not targeted	14:28
sean-k-mooney	oh the contoler is ruuning both cell1 and cell2 conductors	14:29
sean-k-mooney	and the supper conductor	14:29
sean-k-mooney	https://de836787b7e59a5adc13-298f4365cc798f0001a632f171eb41d6.ssl.cf2.rackcdn.com/831219/22/check/nova-multi-cell/9d8aa66/compute1/logs/etc/nova/nova_cell1_conf.txt	14:30
sean-k-mooney	transport_url = rabbit://stackrabbit:secretrabbit@10.208.192.130:5672/nova_cell2	14:30
sean-k-mooney	so ya compute1 is nova_cell2	14:31
sean-k-mooney	and contoler is transport_url = rabbit://stackrabbit:secretrabbit@10.208.192.130:5672/nova_cell1	14:31
gibi	I don't know why this ever passed the test	14:31
dansmith	I also don't know how unshelve could be using an untargeted context	14:31
dansmith	it should be pointing at cell0 all the time which would never work	14:31
sean-k-mooney	dansmith: did you say this just started failing recently?	14:32
sean-k-mooney	because we merged unshelve to host a while ago	14:32
dansmith	sean-k-mooney: it just recently merged I think gmann said	14:32
sean-k-mooney	the tempest test?	14:32
sean-k-mooney	if so i guess we dont run the multi-cell job on tempest	14:33
gmann	dansmith: sean-k-mooney gibi that passed in tempest-multinode-full-py3 job and I do not think we checked multi-cell job run for this	14:33
dansmith	the base nova conf doesn't have a db connection pointing at cell0, which I assume is what the apis are using	14:33
gmann	yeah we do not run multi-cell job there and only checked tempest-multinode-full-py3 job passing it	14:34
sean-k-mooney	gmann: ack that makes sense	14:34
gibi	yeah nova-multicell did not run on the tempest patch	14:34
sean-k-mooney	ok so we have two thing one we should skip this temporaly on the multi-cell-job	14:34
sean-k-mooney	and then fix the bug	14:34
gibi	yes	14:34
dansmith	a normal run should have the api pointing at cell0: https://storage.bhs.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_8cf/831219/22/check/tempest-integrated-compute/8cfb267/controller/logs/etc/nova/nova_conf.txt	14:34
dansmith	but that multicell run has nothing defined there	14:35
sean-k-mooney	we at least shoudl check we are not unshelving across cells for now but that actully should work	14:35
sean-k-mooney	since we use shelve for	14:35
sean-k-mooney	cross cell resize	14:35
dansmith	sean-k-mooney: it shouldn't work	14:35
* gibi goes and file a bug		14:35
dansmith	because cross-cell migration does other stuff on top of the shelve	14:35
sean-k-mooney	oh your right	14:35
gmann	and cross-cell is not in that feature acope right? which is added in this cycle only.	14:36
gmann	scope	14:36
gmann	sean-k-mooney: agree to skip it in multi-cell job	14:36
sean-k-mooney	it was not explcitly no	14:36
sean-k-mooney	so we can just document the limitation i guess	14:36
gmann	but will be good to add in documnt/releasenotes	14:37
gmann	yes	14:37
sean-k-mooney	dansmith: ohter then usign multiple port binding to test if the destination cell can bind the port. how much more does cross cell migration do over shelve. there is a bunch of stuff in the supper conductor to copy the instnace object between the cell db right	14:38
dansmith	yeah it moves the thing between DBs, that's the big thing	14:38
sean-k-mooney	well for now i guess its fine the error could be improved but the api wont let you currpt the db or anything like that	14:39
dansmith	so this multicell job has two cells and two computes, one compute per cell? it must be disabling live migration and lots of other stuff that would normally require multinode right/	14:39
sean-k-mooney	ya its two nodes (contoller and compute) and both nodes are in a differnt cell	14:40
dansmith	so it must disable a bunch of things	14:40
sean-k-mooney	yep	14:40
sean-k-mooney	https://github.com/openstack/nova/blob/master/.zuul.yaml#L514-L571	14:41
gmann	yeah, live migration is disabled https://github.com/openstack/nova/blob/master/.zuul.yaml#L554	14:41
sean-k-mooney	thats the deffintion	14:41
dansmith	okay	14:42
gibi	file the bug https://bugs.launchpad.net/nova/+bug/1988316	14:42
gibi	as it is blocks the gate I will set it to Critical	14:43
dansmith	also, the api config I was looking at was on the compute, the controller is pointed at cell0 by default	14:43
dansmith	gibi: ++	14:43
sean-k-mooney	gmann: does the tempest test have a compute_feature_flag or will i just add it to the exclude regex	14:43
gibi	and I will push the skip	14:43
gibi	to unblock the gate	14:43
sean-k-mooney	oh ok	14:43
sean-k-mooney	ill leave it to you so	14:43
gmann	sean-k-mooney: exclude the test	14:44
gmann	gibi: ok, I was doing but please you go ahead	14:44
gibi	ack, I'm on it	14:45
dansmith	sean-k-mooney and I can get our +2 hammers polished up	14:45
gmann	gibi: we can add test class name in case more test are added for this https://github.com/openstack/nova/blob/master/devstack/nova-multi-cell-exclude-list.txt	14:45
sean-k-mooney	ya UnshelveToHostMultiNodesTest	14:45
sean-k-mooney	or tempest.api.compute.admin.test_servers_on_multinodes.UnshelveToHostMultiNodesTest	14:46
sean-k-mooney	if we want to fully quallify it	14:46
gibi	dansmith: would the current code fail to found the compute even if it is the same cell of the instance as it does not target the cell?	14:48
dansmith	gibi: I think it has to be targeting the cell	14:48
sean-k-mooney	it seamed to work for the host it started on	14:48
gibi	I don't see where it targets it	14:48
dansmith	I know it's not clear but it wouldn't be getting as far as it is if not	14:48
sean-k-mooney	its presumably targeting the source cell and not finding the dest host	14:49
dansmith	I confirmed it's using cell0 by default, which means it must be getting targeted	14:49
sean-k-mooney	oh ok	14:50
opendevreview	ribaudr proposed openstack/nova master: Attach Manila shares via virtiofs (db) https://review.opendev.org/c/openstack/nova/+/831193	14:52
opendevreview	ribaudr proposed openstack/nova master: Attach Manila shares via virtiofs (objects) https://review.opendev.org/c/openstack/nova/+/839401	14:52
opendevreview	ribaudr proposed openstack/nova master: Attach Manila shares via virtiofs (manila abstraction) https://review.opendev.org/c/openstack/nova/+/831194	14:52
opendevreview	ribaudr proposed openstack/nova master: Attach Manila shares via virtiofs (drivers and compute manager part) https://review.opendev.org/c/openstack/nova/+/833090	14:52
opendevreview	ribaudr proposed openstack/nova master: Attach Manila shares via virtiofs (api) https://review.opendev.org/c/openstack/nova/+/836830	14:52
opendevreview	ribaudr proposed openstack/nova master: Bump compute version and check shares support https://review.opendev.org/c/openstack/nova/+/850499	14:52
opendevreview	ribaudr proposed openstack/nova master: Add metadata for shares https://review.opendev.org/c/openstack/nova/+/850500	14:52
opendevreview	ribaudr proposed openstack/nova master: Add instance.share_attach notification https://review.opendev.org/c/openstack/nova/+/850501	14:52
opendevreview	ribaudr proposed openstack/nova master: Add instance.share_detach notification https://review.opendev.org/c/openstack/nova/+/851028	14:52
opendevreview	ribaudr proposed openstack/nova master: Add shares to InstancePayload https://review.opendev.org/c/openstack/nova/+/851029	14:52
opendevreview	ribaudr proposed openstack/nova master: Add instance.power_on_error notification https://review.opendev.org/c/openstack/nova/+/852084	14:52
opendevreview	ribaudr proposed openstack/nova master: Add instance.power_off_error notification https://review.opendev.org/c/openstack/nova/+/852278	14:52
opendevreview	ribaudr proposed openstack/nova master: Add helper methods to attach/detach shares https://review.opendev.org/c/openstack/nova/+/852085	14:52
opendevreview	ribaudr proposed openstack/nova master: Add libvirt test to ensure metadata are working. https://review.opendev.org/c/openstack/nova/+/852086	14:52
opendevreview	ribaudr proposed openstack/nova master: Add virt/libvirt error test cases https://review.opendev.org/c/openstack/nova/+/852087	14:52
opendevreview	ribaudr proposed openstack/nova master: Add share_info parameter to reboot method for each driver (driver part) https://review.opendev.org/c/openstack/nova/+/854823	14:52
opendevreview	ribaudr proposed openstack/nova master: Support rebooting an instance with shares (compute and API part) https://review.opendev.org/c/openstack/nova/+/854824	14:52
opendevreview	ribaudr proposed openstack/nova master: Change microversion to 2.93 https://review.opendev.org/c/openstack/nova/+/852088	14:52
opendevreview	Balazs Gibizer proposed openstack/nova master: Skip UnshelveToHostMultiNodesTest in nova-multi-cell https://review.opendev.org/c/openstack/nova/+/855378	14:52
gibi	gmann, dansmith, sean-k-mooney: here is the skip ^^	14:52
sean-k-mooney	do we still use that file	14:54
gibi	dansmith: in the rest api code we load the instance properly from its cell could that codepath alter the context too?	14:54
sean-k-mooney	oh we do	14:54
dansmith	no, because we're doing scatter	14:54
gmann	yes	14:54
dansmith	so actually	14:55
dansmith	I think gibi might be right and that the feature itself is broken	14:55
dansmith	but I can't explain why it's working on single-cell, since we're targeted to cell0 by default	14:55
dansmith	it looks like we don't target until we hit superconductor	14:55
dansmith	https://github.com/openstack/nova/blob/cbc9b516fb5e6b079b57cc40380e1d730751cd6c/nova/conductor/manager.py#L974-L975	14:55
dansmith	so I can't explain why this ever works, because the api should have only checked for the host being present in cell0, which should never have any hosts	14:56
gibi	I think I see where we target the context	14:56
gibi	_get_instance_from_cell	14:56
gibi	https://github.com/openstack/nova/blob/3862cfc649d0099971c91d17e6f8800d10712a26/nova/compute/api.py#L2867-L2870	14:56
gibi	this actually has a side effect to alter the context	14:56
dansmith	ah, yep, okay I missed that	14:57
dansmith	phew :)	14:57
gibi	so unshelve actually works within a cell	14:57
gibi	we need to document that it does not work across cells and it need to emit a better error	14:57
dansmith	yeah	14:57
sean-k-mooney	ya although that proably a latent bug	14:58
sean-k-mooney	i.e. if we use the old way of targeting an az	14:58
sean-k-mooney	and that az was in a differnt cell	14:58
sean-k-mooney	it likely would have failed in teh same place	14:58
dansmith	latent bug in terms of raising a useful error you mean?	14:58
sean-k-mooney	yes	14:58
dansmith	yeah	14:58
sean-k-mooney	its just now that we have a tempest test that hits it we are seeing it	14:58
sean-k-mooney	but we proably were not doing cross az unshelve in the multi cell job	14:59
gibi	I've update the bug	15:00
dansmith	sean-k-mooney: can you +W the skip so it's headed in after tests?	15:00
dansmith	or gmann	15:01
gibi	Uggla: probably the above discussion about unshelve is interesting to you too ;)	15:01
gibi	Uggla: also https://bugs.launchpad.net/nova/+bug/1988316 :)	15:01
gmann	dansmith: gibi done	15:03
gibi	gmann: thanks	15:03
dansmith	sweet	15:03
gmann	I think its worth to add multi-cell jo on tempest side too and check the new tests for cross-cell thigns	15:04
gmann	job	15:04
gmann	I will do	15:04
gibi	nah then we had the necessary gate block at FF week done	15:04
gibi	could have been worse	15:06
gmann	? did not get	15:06
gmann	you mean we should not add in tempest gate ?	15:06
gibi	nah I'm just joking	15:06
dansmith	I believe gibi is being funny :)	15:07
gmann	:)	15:07
gibi	so we tend to break the gate during FF as we are rushing	15:07
dansmith	but in all seriousness, gmann adding multicell to tempest gate might be a bit heavy and nova-centric	15:07
dansmith	not sure it's really necessary	15:07
dansmith	maybe in experimental or something	15:07
gmann	yeah experimental is good idea	15:07
bauzas	add this in periodic	15:09
bauzas	periodic-weelky	15:09
bauzas	and then I'll check its state every week before the meeting	15:09
sean-k-mooney	dansmith: sorry was geting a drink yes i can	15:09
bauzas	better than experimental where not all of us would check it	15:09
sean-k-mooney	oh gmann already did	15:10
gmann	bauzas: with experimental we can check it during the new test addition. it run in nova gate as voting so periodic in tempest side would not be much benefit. if anything broken then nova gate will check it nd inform tempest	15:11
sean-k-mooney	im not sure periodic-weekly helps	15:11
sean-k-mooney	we will notice it on the nova side before it runs	15:11
gmann	yeah	15:11
bauzas	okok	15:11
bauzas	maybe I misunderstood	15:12
dansmith	right, periodic won't really help	15:12
gmann	bauzas: we are talking to add multi-cell job in tempest experimental pipeline so that we can check new tests (migration/unshelve one ) for cross cells env	15:12
bauzas	aaaaah	15:13
bauzas	OK, then yes	15:13
bauzas	fine with me	15:13
bauzas	I was thinking it was a nova job	15:13
bauzas	lgmt then to be experimental	15:13
gmann	bauzas: with the new change in RBAC (dropped system scope from policy), we do not need this blueprint change anymore. I have added the notes there https://blueprints.launchpad.net/nova/+spec/allow-project-admin-list-hypervisors	16:29
gmann	bauzas: please close/mark it not needed	16:30
gmann	added the same notes in etherpad too for your easy tracking https://etherpad.opendev.org/p/nova-zed-blueprint-status	16:32
sean-k-mooney	gmann i just set it to obsolete	16:33
gmann	sean-k-mooney: thanks	16:34
bauzas	ricolin: just said -1 for https://review.opendev.org/c/openstack/nova/+/830646 but if you don't have time today or tomorrow morning, lemme know	16:58
bauzas	ricolin: we could just need a follow-up patch for my nits	16:58
bauzas	and I'll +2	16:58
ricolin	bauzas: thanks, I will try to update it ASAP	17:15
whoami-rajat	dansmith, do you have any idea about the nova multi cell breaking constantly or is it just flaky?	17:16
dansmith	whoami-rajat: there's a fix in the queue	17:16
dansmith	whoami-rajat: no point in rechecking until that lands	17:16
whoami-rajat	ah ok	17:16
whoami-rajat	wasn't aware	17:16
dansmith	it has been in check for hours, not even in the gate yet	17:16
whoami-rajat	doesn't sound good	17:17
dansmith	whoami-rajat: it's in the gate now	17:30
whoami-rajat	great	17:30
whoami-rajat	can i have the link?	17:30
dansmith	it's 855378	17:31
whoami-rajat	ack	17:31
sean-k-mooney	whoami-rajat: sicne your change is approved even if its not merged by COB tommrow at m3 it can be rechecked to land it	17:56
sean-k-mooney	so we will keep an eye on it until its landed	17:56
sean-k-mooney	whoami-rajat: if you have a depends on form the nova client and osc patchs against the top patch we can review those in parralel	17:57
whoami-rajat	sean-k-mooney, thanks, just wanted to update other project patches after this merged, guess it's safe to update the clients and tempest in parallel so will get to that	17:57
whoami-rajat	sean-k-mooney, yep, i will just update them quickly and let you know	17:58
sean-k-mooney	yep depends on will prevent them merging until the api change lands and then they can just be rechecked at that point	17:58
whoami-rajat	ack, sounds good	18:00
opendevreview	Rico Lin proposed openstack/nova master: libvirt: Add vIOMMU device to guest https://review.opendev.org/c/openstack/nova/+/830646	18:14
opendevreview	Rico Lin proposed openstack/nova master: Add traits for viommu model https://review.opendev.org/c/openstack/nova/+/844507	18:14
sean-k-mooney	ricolin: you still have not adress my feedback on https://review.opendev.org/c/openstack/nova/+/844507/15	18:15
sean-k-mooney	the way you are doing the traits reporting is wrong	18:15
opendevreview	Rajat Dhasmana proposed openstack/python-novaclient master: Add support to rebuild boot volume 2.93 https://review.opendev.org/c/openstack/python-novaclient/+/827163	18:16
whoami-rajat	novaclient: https://review.opendev.org/c/openstack/python-novaclient/+/827163	18:18
whoami-rajat	osc: https://review.opendev.org/c/openstack/python-openstackclient/+/831014	18:18
whoami-rajat	sean-k-mooney, ^	18:18
sean-k-mooney	ack ill go review those now	18:19
whoami-rajat	thanks!	18:20
ricolin	sean-k-mooney: okay, checking the comments again	18:28
dansmith	that skip is about to pop	19:31
dansmith	next ten minutes or so	19:32
sean-k-mooney	whoami-rajat: i looked over the chagnes novaclinet looks good. minor issue with osc, and does the tempest change really need its own job? i would just add that test to tempest full if cinder is deployed and nova is new enough	19:43
sean-k-mooney	none of that is blocking the nova change but that just my feedback	19:43
sean-k-mooney	im going to go have dinner/breakfest	19:43
opendevreview	Merged openstack/nova master: Skip UnshelveToHostMultiNodesTest in nova-multi-cell https://review.opendev.org/c/openstack/nova/+/855378	19:50
opendevreview	Rico Lin proposed openstack/nova master: libvirt: Add vIOMMU device to guest https://review.opendev.org/c/openstack/nova/+/830646	20:18
opendevreview	Rico Lin proposed openstack/nova master: Add traits for viommu model https://review.opendev.org/c/openstack/nova/+/844507	20:18
ricolin	sean-k-mooney: can you check if I do make it right this time, thanks ^^^	20:18
opendevreview	Rico Lin proposed openstack/nova master: libvirt: Add vIOMMU device to guest https://review.opendev.org/c/openstack/nova/+/830646	20:50
opendevreview	Rico Lin proposed openstack/nova master: Add traits for viommu model https://review.opendev.org/c/openstack/nova/+/844507	20:50
*** dasm is now known as dasm\|off		21:40
opendevreview	Merged openstack/nova master: Add documentation and releasenotes for RBAC change https://review.opendev.org/c/openstack/nova/+/854882	23:27
*** efried1 is now known as efried		23:49

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!