Tuesday, 2023-02-21

*** ralonsoh_ooo is now known as ralonsoh		07:32
opendevreview	Jorge San Emeterio proposed openstack/nova master: WIP: Look for cpu controller on cgroups v2 https://review.opendev.org/c/openstack/nova/+/873127	08:49
bauzas	gibi: so a bit of heads-up	09:26
bauzas	gibi: first, you may be interested in knowing what the logs tell to us for the functests https://paste.opendev.org/show/bfvZX0XeKsELzY54EGb8/	09:27
bauzas	gibi: secondly, I created a docs patch and a fup for the cpu mgmnt series https://review.opendev.org/c/openstack/nova/+/874514 and https://review.opendev.org/c/openstack/nova/+/874515/	09:28
bauzas	eventually, I'll tell about the RC1 etherpad in the meeting https://etherpad.opendev.org/p/nova-antelope-rc-potential	09:29
bauzas	we now have a LP tag for antelope rc	09:29
opendevreview	Jorge San Emeterio proposed openstack/nova stable/train: WIP: Fixing python-devel package for RHEL 8 https://review.opendev.org/c/openstack/nova/+/874547	10:13
opendevreview	Jorge San Emeterio proposed openstack/nova stable/train: Changing "python-devel" to "python3-devel" on bindep test requirements for RPM based distros. https://review.opendev.org/c/openstack/nova/+/874547	11:10
opendevreview	Alexey Stupnikov proposed openstack/nova stable/victoria: reenable greendns in nova. https://review.opendev.org/c/openstack/nova/+/833436	12:03
opendevreview	Rajesh Tailor proposed openstack/nova master: Handle InstanceExists exception for duplicate instance https://review.opendev.org/c/openstack/nova/+/860938	12:39
*** ralonsoh is now known as ralonsoh_lunch		12:51
*** ralonsoh_lunch is now known as ralonsoh		13:31
opendevreview	Jorge San Emeterio proposed openstack/nova stable/train: Indicate dependency on "python3-devel" for py3 based RPM distros. https://review.opendev.org/c/openstack/nova/+/874547	14:10
*** dasm\|off is now known as dasm		14:12
opendevreview	Jorge San Emeterio proposed openstack/nova stable/train: Add binary test dependency "python3-devel" for py3 based RPM distros. https://review.opendev.org/c/openstack/nova/+/874547	14:12
opendevreview	Jorge San Emeterio proposed openstack/nova stable/train: [stable-only] Add binary test dependency "python3-devel" for py3 based RPM distros. https://review.opendev.org/c/openstack/nova/+/874547	14:13
gibi	bauzas: I will have to drop aroun 17:30 during the nova weekly meeting	14:40
bauzas	gibi: ack, np	14:40
elodilles	bauzas: are you editing the meeting page? let me know when i can update stable section	14:43
bauzas	elodilles: do it now	14:44
bauzas	elodilles: I'll add all the Bobcat plans and RC1 later	14:44
elodilles	bauzas: done	14:45
bauzas	all cool	14:45
opendevreview	Jorge San Emeterio proposed openstack/nova master: WIP: Look for cpu controller on cgroups v2 https://review.opendev.org/c/openstack/nova/+/873127	14:45
elodilles	bauzas: btw, have you seen this? https://review.opendev.org/c/openstack/releases/+/874450	14:51
elodilles	(i know that you are busy with everything o:))	14:51
bauzas	elodilles: yup, it's now in the RC1 etherpad	14:51
bauzas	we'll discuss it in the meeting	14:52
elodilles	bauzas: ++	14:52
opendevreview	Merged openstack/nova-specs master: Create specs directory for 2023.2 Bobcat https://review.opendev.org/c/openstack/nova-specs/+/872068	15:39
bauzas	#startmeeting nova	16:00
opendevmeet	Meeting started Tue Feb 21 16:00:38 2023 UTC and is due to finish in 60 minutes. The chair is bauzas. Information about MeetBot at http://wiki.debian.org/MeetBot.	16:00
opendevmeet	Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.	16:00
opendevmeet	The meeting name has been set to 'nova'	16:00
Uggla	o/	16:00
bauzas	hey folks, hola everyone	16:00
dansmith	o/	16:01
bauzas	#link https://wiki.openstack.org/wiki/Meetings/Nova#Agenda_for_next_meeting	16:01
*** artom_ is now known as artom		16:01
elodilles	o/	16:01
bauzas	let's start, some people have to leave early	16:01
bauzas	#topic Bugs (stuck/critical)	16:01
bauzas	#info No Critical bug	16:02
bauzas	#link https://bugs.launchpad.net/nova/+bugs?search=Search&field.status=New 16 new untriaged bugs (-1 since the last meeting)	16:02
gibi	o/	16:02
bauzas	auniyal helped me with triage	16:02
bauzas	I created an etherpad	16:02
bauzas	and I have a bug I'd like to discuss with you folks	16:02
bauzas	#link https://etherpad.opendev.org/p/nova-bug-triage-20230214	16:02
bauzas	the bug in question :	16:03
bauzas	#link https://bugs.launchpad.net/nova/+bug/2006770	16:03
bauzas	as you see, i did set it to Opinion	16:03
bauzas	tl;dr: this is about our ip query param for instances list	16:03
bauzas	we directly call Neutron to get the ports	16:03
bauzas	it basically works, but the reporter had some concerns	16:04
bauzas	people want to discuss this bug now or later ?	16:05
bauzas	(we can discuss it in the open disc topic if we have time)	16:05
bauzas	let's say later then :)	16:06
bauzas	(people can lookup the bug if they want meanwhile)	16:06
dansmith	opinion seems right to me :)	16:06
bauzas	let's discuss this then later in the open discussion topic	16:06
bauzas	so people will have time	16:06
bauzas	moving on	16:06
bauzas	#info Add yourself in the team bug roster if you want to help https://etherpad.opendev.org/p/nova-bug-triage-roster	16:07
bauzas	Uggla: works for you to get the baton this week ?	16:07
Uggla	bauzas, ok	16:07
bauzas	ack	16:07
bauzas	#info bug baton is being passed to Uggla	16:07
bauzas	#topic Gate status	16:07
bauzas	#link https://bugs.launchpad.net/nova/+bugs?field.tag=gate-failure Nova gate bugs	16:07
bauzas	but the best is to track the etherpad	16:07
bauzas	#link https://etherpad.opendev.org/p/nova-ci-failures	16:08
bauzas	it was a dodgy week	16:08
dansmith	so,	16:08
dansmith	this got merged: https://review.opendev.org/c/openstack/devstack/+/873646	16:08
dansmith	which seems to allow halving the memory used by mysqld	16:08
bauzas	haha, gtk	16:09
dansmith	which may help with the OOM issues we see, especially in the fat jobs like ceph-multistore	16:09
dansmith	we could enable that in our nova-ceph-multistore job if we want to be on the leading edge and try to make sure that it's actually helping	16:09
dansmith	(it's opt-in right now)	16:09
bauzas	indeed, I'll double check later if we continue to have some OOM issues	16:09
bauzas	ah my bad	16:09
dansmith	we could remove it if it causes other problems, but.. might be good to try it	16:09
bauzas	surely	16:10
bauzas	dansmith: thanks for having worked on it	16:10
bauzas	dansmith: I can write a zuul patch for novza	16:10
dansmith	I can do it too, just wanted to socialize	16:10
bauzas	dansmith: ack cool then, ping me for reviews	16:10
dansmith	ack	16:11
bauzas	++ again	16:11
bauzas	dansmith: I also need to look at all the Gerrit recheck comments I wrote last week	16:12
bauzas	I maybe found some other races	16:12
bauzas	but we'll see	16:12
bauzas	we also have the OOM logger patch that was telling us a few things	16:13
bauzas	https://paste.opendev.org/show/bfvZX0XeKsELzY54EGb8/	16:13
bauzas	but let's discuss this off-meeting	16:13
* gibi had not time to look at the extra logs from the functional race		16:14
bauzas	gibi: basically, each of the 6 failures having logs was having a different functest	16:15
gibi	cool, that can serve as a basis for a local repor	16:15
bauzas	anyway, moving on	16:15
sean-k-mooney1	bauzas: they are all in teh libvirt test suite	16:15
sean-k-mooney1	so likely all have the same common issue	16:16
sean-k-mooney1	but ya lets move on	16:16
bauzas	maybe, I didn't had time yet to look at the code	16:16
bauzas	#link https://zuul.openstack.org/builds?project=openstack%2Fnova&project=openstack%2Fplacement&pipeline=periodic-weekly Nova&Placement periodic jobs status	16:16
bauzas	all of them are green ^	16:16
bauzas	#info Please look at the gate failures and file a bug report with the gate-failure tag.	16:16
bauzas	#info STOP DOING BLIND RECHECKS aka. 'recheck' https://docs.openstack.org/project-team-guide/testing.html#how-to-handle-test-failures	16:16
bauzas	that's it	16:16
bauzas	#topic Release Planning	16:16
bauzas	#link https://releases.openstack.org/antelope/schedule.html	16:16
bauzas	so we're now on Feature Freeze	16:17
bauzas	#link https://etherpad.opendev.org/p/nova-antelope-blueprint-status Blueprint status for 2023.1	16:17
bauzas	you can see what we merged	16:17
bauzas	I also created two changes for my own series that were asked	16:17
bauzas	but I'll ping folks tomorrow about them	16:17
bauzas	#info Antelope-rc1 is in 1.5 weeks	16:17
bauzas	now, we need to prepare for our RC1 where we branch master	16:18
bauzas	#link https://etherpad.opendev.org/p/nova-antelope-rc-potential	16:18
bauzas	as you see ^ I created an etherpad	16:18
bauzas	thanks btw. again takashi for creating some changes that are needed	16:18
bauzas	as a reminder, if people find some bugs, they can use a specific tag :	16:19
bauzas	https://bugs.launchpad.net/nova/+bugs?field.tag=antelope-rc-potential	16:19
bauzas	before RC1, any bug report can be using this tag, but we prefer to make sure they are regressions	16:19
bauzas	after RC1, only regressions should use this tag	16:20
bauzas	I created a cycle highlights change too :	16:20
bauzas	#link https://review.opendev.org/c/openstack/releases/+/874483 Cycle highlights for Nova Antelope	16:20
bauzas	please review it	16:20
bauzas	at least gibi, dansmith, artom and other folks that were having merged changes	16:21
gibi	ack	16:21
bauzas	I'll +1 on Thursday	16:21
bauzas	we need to merge this before this Thursday for the Foundation market folks	16:22
bauzas	we also have https://review.opendev.org/c/openstack/releases/+/874450 to +1	16:22
bauzas	I guess we're done with our clients	16:22
bauzas	so I'll branch os-vif, osc-placement and novaclient unless people have concerns	16:23
bauzas	as you see in the commit msg, it will be merged eventually on Friday	16:23
* bauzas will just verify the SHA1		16:24
elodilles	or earlier if a release liaison +1s it	16:24
bauzas	elodilles: yup, but I'm asking people if they have concerns	16:24
sean-k-mooney1	speaking of which im not sure if i ill have time to continue to do that	16:24
bauzas	looks not, so I'll just verify the SHA1s before +1ing	16:24
sean-k-mooney1	i may leave my slef for this cycle if no on else want to take on that role	16:24
bauzas	sean-k-mooney1: yup, I know and I was planning to ask you	16:25
sean-k-mooney1	but i am not sure of my aviablity to keep an eye on it this cycle	16:25
*** sean-k-mooney1 is now known as sean-k-mooney		16:25
bauzas	ok, so maybe it's not time yet to ask if someone else wants to be a release liaison	16:25
bauzas	but I'll officially ask it next week	16:25
sean-k-mooney	ok	16:25
bauzas	we can have more than one release liaison btw.	16:26
bauzas	no need to remove you before someone arrives or something like that	16:26
sean-k-mooney	ack the primary role is to reducec the bus factor and ensure that release are done correctly and in a timply fashion so it does not all fall on the PTL	16:26
bauzas	and we can even have two liaisons if we really find two people wanting to be :)	16:26
bauzas	no need to battle :po	16:26
bauzas	I'll explain next week what a release liaison is and what they do	16:27
bauzas	but if people want, they can DM me	16:27
bauzas	before next meeting	16:27
bauzas	#info If someone wants to run as a Nova release liaison next cycle, please ping bauzas	16:28
bauzas	I think that's it for the RC1 agenda	16:28
bauzas	oh	16:28
bauzas	one last thing	16:28
bauzas	thanks to takashi, https://review.opendev.org/c/openstack/nova-specs/+/872068 is merged	16:29
bauzas	you can now add your specs for Bocat	16:29
bauzas	Bobcat even	16:29
bauzas	like, people who had accepted specs for Antelope can just repropose them for Bobcat and I'll quickly +2/+W directly if nothing changes between both spec files	16:30
* bauzas tries to not eye at folks		16:30
bauzas	I'll do the Launchpad Bobcat magic later next week (I guess)	16:31
bauzas	that's it this time	16:31
bauzas	#topic vPTG Planning	16:31
bauzas	as a weekly reminder :	16:31
bauzas	#link https://www.eventbrite.com/e/project-teams-gathering-march-2023-tickets-483971570997 Register your free ticket	16:31
* gibi needs to drop, will read back tomorrow		16:31
bauzas	maybe you haven't seen but we are officially a PTG team	16:31
bauzas	I don't know yet how long we could run the vPTG sessions	16:32
bauzas	but like every cycle, I'll ask your opinions about the timing	16:32
bauzas	not today, but once I'm asked	16:32
bauzas	good time for saying	16:32
bauzas	#link https://etherpad.opendev.org/p/nova-bobcat-ptg Draft PTG etherpad	16:33
bauzas	I feel alone with this etherpad ^	16:33
bauzas	and I'm sure people have topics they want to discuss	16:33
bauzas	anyway, moving on	16:34
bauzas	(just hoping people read our meeting notes)	16:34
bauzas	#topic Review priorities	16:34
bauzas	#link https://review.opendev.org/q/status:open+(project:openstack/nova+OR+project:openstack/placement+OR+project:openstack/os-traits+OR+project:openstack/os-resource-classes+OR+project:openstack/os-vif+OR+project:openstack/python-novaclient+OR+project:openstack/osc-placement)+(label:Review-Priority%252B1+OR+label:Review-Priority%252B2)	16:34
bauzas	#info As a reminder, cores eager to review changes can +1 to indicate their interest, +2 for committing to the review	16:34
bauzas	#topic Stable Branches	16:35
bauzas	elodilles: your turn	16:35
elodilles	#info stable gates seem to be OK (victoria gate workaround has landed and it is now unblocked)	16:35
elodilles	well, unblocked	16:35
elodilles	though it's not everywhere easy to merge in patches	16:35
elodilles	#info stable branch status / gate failures tracking etherpad: https://etherpad.opendev.org/p/nova-stable-branch-ci	16:36
bauzas	indeed	16:36
elodilles	that's the short summary	16:36
bauzas	I still have the ussuri CVE VMDK fix to be merged	16:36
bauzas	I rechecked it a few times	16:36
bauzas	elodilles: thanks for the notes	16:37
elodilles	np	16:37
bauzas	#topic Open discussion	16:37
bauzas	so, nothing on the agenda	16:37
bauzas	we can discuss https://bugs.launchpad.net/nova/+bug/2006770 if people want or close the meeting	16:37
bauzas	the fact is, I wrote Opinion	16:37
bauzas	unless people have concerns with what I wrote, I'm done.	16:38
bauzas	looks not	16:39
bauzas	then I assume we're done.	16:39
dansmith	++	16:40
bauzas	thanks all	16:40
bauzas	#endmeeting	16:40
opendevmeet	Meeting ended Tue Feb 21 16:40:23 2023 UTC. Information about MeetBot at http://wiki.debian.org/MeetBot . (v 0.1.4)	16:40
opendevmeet	Minutes: https://meetings.opendev.org/meetings/nova/2023/nova.2023-02-21-16.00.html	16:40
opendevmeet	Minutes (text): https://meetings.opendev.org/meetings/nova/2023/nova.2023-02-21-16.00.txt	16:40
opendevmeet	Log: https://meetings.opendev.org/meetings/nova/2023/nova.2023-02-21-16.00.log.html	16:40
elodilles	thanks o/	16:40
dansmith	bauzas: so, gmann and I were running that memory usage patch in periodic on tempest jobs for a few days to make sure it didn't substantially worsen things	16:40
dansmith	and my survey at the moment indicates that it looks good	16:41
dansmith	so I'll propose to make it enabled for ceph-multistore (which will also impact glance) and we'll see if gmann is cool with that when he's around	16:41
bauzas	nice to hear	16:41
bauzas	ack, do it and I'll vote	16:41
opendevreview	Dan Smith proposed openstack/nova master: Use mysql memory reduction flags for ceph job https://review.opendev.org/c/openstack/nova/+/874664	16:45
dansmith	bauzas: ^	16:45
bauzas	dansmith: I doubt that cells_v2 map_instances could work with https://bugs.launchpad.net/nova/+bug/2007922 (even I asked for it)	17:43
bauzas	dansmith: tl;dr: the instance mapping exists but the cell value is None	17:43
bauzas	and we know the instance is in cell0 DB	17:43
dansmith	yeah, as I said, I initially missed that the person said they had reference in the mappings table	17:43
bauzas	dansmith: I guess the simpliest thing is to hack the DB to add the cell0 uuid in the instancemapping record, nope ?	17:43
dansmith	probably	17:44
bauzas	or do we have a better nova-manage command ?	17:44
bauzas	looking at the docs, nope	17:44
dansmith	not that I know of	17:44
bauzas	this instance is somehow sit in the middle	17:44
bauzas	not fully migrated but in between	17:44
dansmith	not fully ... mapped?	17:44
bauzas	sorry, yeah mapped	17:45
bauzas	I'll propose the ALTER to the reporter	17:45
dansmith	don't we have a mapped flag on the instance (or something else)?	17:47
bauzas	in the instances table you mean ?	17:48
dansmith	I thought it was.. that's how we survey instances that need to be mapped right?	17:48
* bauzas just checks the map_instances code		17:49
dansmith	just wondering if that flag matches or not	17:49
bauzas	so	17:52
bauzas	https://github.com/openstack/nova/blob/master/nova/cmd/manage.py#L874	17:52
bauzas	we just iterate over a limit and a marker on the instances table from a cell that's given	17:53
dansmith	ah right	17:54
bauzas	and I think I understand how the cell ID was set to None	17:54
bauzas	https://github.com/openstack/nova/blob/439c67254859485011e7fd2859051464e570d78b/nova/objects/instance_mapping.py#L73	17:54
dansmith	it only does that if it's not none though	17:55
bauzas	anyway, map_instances could work with cell0	17:55
bauzas	if the reporter runs map_instances with cell0 attribute, it will loop over the contents of cell0's instances table and will create an instancemapping object	17:56
bauzas	oh wait, fuck no	17:56
bauzas	https://github.com/openstack/nova/blob/master/nova/cmd/manage.py#L791-L792	17:56
bauzas	so, definitely the easier is to alter the db	17:56
dansmith	again, I only thought it was useful to run map if the mapping didn't exist	18:00
bauzas	yup	18:03
bauzas	or the reporter could then delete the instance mapping	18:03
sean-k-mooney	bauzas: elodilles can we prioritise review of this if possibel https://review.opendev.org/c/openstack/nova/+/874547	18:10
opendevreview	Takashi Natsume proposed openstack/placement master: Move implemented specs for Xena and Yoga release https://review.opendev.org/c/openstack/placement/+/853730	18:10
sean-k-mooney	this will help us fix our downstream ci	18:11
bauzas	done but I leave you +W as I don't have a lot of context	18:13
sean-k-mooney	tl;dr we use bindep in our downstream jobs to install deps before runing tox but rhel 8 nolonger has python-devel	18:15
gmann	dansmith: +W on 'mysql memory reduction flags for ceph job'	18:15
sean-k-mooney	i wanted to check with elodilles to make sure they were ok with the stable-only change	18:15
dansmith	gmann: cool	18:16
dansmith	thanks	18:16
dansmith	gmann: oh jeez, I didn't realize the mysql periodic thing hadn't landed yet	18:24
dansmith	gmann: do you think we should wait for that to soak for a bit?	18:24
dansmith	I know the devstack one did, and I guess I misread that the tempest one hadn't yet	18:24
gmann	dansmith: I also did not realize it when I checked that patch. but I think it is ok to enable it in ceph job and see.	18:26
gmann	we can always revert it if it fail and make delay things during release time	18:26
dansmith	okay, that's my preference too	18:26
mnaser	i got a fun one. it looks like by default nova saves the az of the vm in the cell db, but it doesn't update the request_spec, but when we do migrations, we pass the request_spec to the scheduler (which contains az=null) which then moves you from one az to another in the migration	18:44
mnaser	since .. https://github.com/openstack/nova/blob/90e2a5e50fbf08e62a1aedd5e176845ee22d96c9/nova/scheduler/request_filter.py#L138-L166 checks for request_spec az	18:45
sean-k-mooney	mnaser: this was changed recently	18:45
mnaser	this is in a scenario where an operator wants to make vms stick to their az if a user doesnt specify one	18:45
sean-k-mooney	right so we spent a lot fo time trying to decide what the sematics shoudl be	18:46
sean-k-mooney	im trying to find the spec	18:46
sean-k-mooney	https://specs.openstack.org/openstack/nova-specs/specs/zed/implemented/unshelve-to-host.html	18:47
sean-k-mooney	i guess this was for unshleve	18:47
sean-k-mooney	mnaser: we epect tht the request spec would not have the az by the way if the user did not request one	18:48
mnaser	makes sense cause that's their request	18:48
mnaser	i understand it might not be everyone that wants this, but maybe for live migration use case it can cause issues if nova ends up doing cross-az migrations	18:49
sean-k-mooney	for move operations where we supprot specifying a AZ it would be ok in some cases to set it in the request spec	18:49
sean-k-mooney	mnaser: but we would want to have the same beahivor as in the unselve spec	18:50
sean-k-mooney	i dont recal if we fixed the other move operatiosn to be consitent with that when we did this	18:50
sean-k-mooney	Uggla: do you recall	18:50
sean-k-mooney	mnaser: wiht unshelve to a specific az if you set it in the unshelve request and it was not set in the orgainl request spec it will be set after	18:51
sean-k-mooney	mnaser: live migraiton does not currently supprot an az	18:52
mnaser	sean-k-mooney: esentailly im thinking this is where this can be changed https://github.com/openstack/nova/blob/f01a90ccb85ab254236f84009cd432d03ce12ebb/nova/compute/api.py#L5499-L5500	18:52
mnaser	cause live migrating from one az to another could pretty much fail, and we can just have it as an option i guess if we dont want to change default behaviour	18:53
sean-k-mooney	nor does migrate	18:53
mnaser	in most worlds migrate or live migrate will fail across az's	18:53
mnaser	esp if you're using different storage backends for example	18:53
sean-k-mooney	mnaser: this would be an api change and need a spec	18:53
sean-k-mooney	in general live migration betwen AZ will either work on not work depending on yoru deployment. in general i would expect it to work in most cases	18:54
sean-k-mooney	it just comes down to if you have exchanged ssh keys such that the hyperviors can comunicate and if you are using az with cinder or not	18:55
sean-k-mooney	and cross_az attach	18:55
mnaser	Maybe we can make it so that if cross az attach = false then it would update the request spec to match?	18:56
sean-k-mooney	no	18:56
sean-k-mooney	no config drvent api behavior	18:56
sean-k-mooney	this is not a bug	18:56
sean-k-mooney	if we want to supprot move operation to target an AZ or change the request spec this is an api change	18:56
mnaser	No it’s not to target an AZ	18:56
sean-k-mooney	i know you want to prefer to keep affinity	18:57
mnaser	it’s for that it stays in the same AZ, or otherwise the live migration will fail	18:57
sean-k-mooney	liek a weigher or filter	18:57
sean-k-mooney	however that is not what the end user asked for	18:57
mnaser	if nova allows you to live migrate from one az to another for a vm with cross_az_attach set to false is that a bug ?	18:58
sean-k-mooney	not a schduler bug	18:58
sean-k-mooney	it will fail in pre-live-migrate	18:59
sean-k-mooney	and the vm will stay in active on the host	18:59
sean-k-mooney	(source host)	18:59
mnaser	now if you’re using rbd for images_type and you have 2 clusters with each az using different cluster	18:59
mnaser	And you do a live migrate and end up with vm running on the other side and but using it’s original storage	19:00
sean-k-mooney	then you need to configure your filters to ensure that you target the vsm to spcific cluster using a flavor or simialr	19:00
mnaser	And then on resize ops it blows up horribly because it’s trying to use the destination cluster id	19:00
sean-k-mooney	yep that operator error if they did not configure things properly to prevent this	19:01
sean-k-mooney	adressing theses usecase is somethign that could be done but it would be a feature not a bug	19:01
mnaser	How? So if you have 3 azs you create 3 flavors?	19:01
sean-k-mooney	yep	19:01
mnaser	Do you think that’s user friendly at all	19:01
sean-k-mooney	nope but its how its currently desigined	19:02
sean-k-mooney	and fixing it would not eb a bug fix	19:02
mnaser	So really what you’re saying is nova will do live migrations that will break your vm	19:02
mnaser	And that’s not a bug	19:02
sean-k-mooney	nope	19:02
sean-k-mooney	nova check if it can attach the volcumes to the select host before it live migrates	19:03
sean-k-mooney	so it will pass the schduler but fail in pre live migrate	19:03
mnaser	ok, lets put that aside and talk about the users who use images_type=rbd	19:03
mnaser	with different az's	19:03
mnaser	it will break thoes vms	19:03
sean-k-mooney	also live migrate is an admin only api and we allow you as an admin to select the host	19:03
mnaser	ok when we're deploying openstack for customers they don't expect to sit and decide which host they are going to move things into at scale	19:04
sean-k-mooney	mnaser: not if you use cross_az_atch=false	19:04
mnaser	if i tell them 'sorry, openstack is kinda silly, it picks the wrong hosts, you just pick the right host yourself instead'	19:04
mnaser	non-bfv, images_type=rbd, 2 az's with ceph cluster each will result in broken live migrations	19:04
sean-k-mooney	if you want to propsoe a new feature for this im open to review that	19:04
sean-k-mooney	what i do not think woudl be corerct it considerign this a bug when we previously declared it out of scope and backproting this	19:05
sean-k-mooney	mnaser: it would break if the ceph cluster was inaccable yes	19:05
sean-k-mooney	although i belvie	19:06
sean-k-mooney	the vm would stay runnign on the souce host in active	19:06
sean-k-mooney	with the migration in error	19:06
mnaser	and any reasonable operator would make a sane assumption that the cloud would not live migrate across az's	19:06
sean-k-mooney	libvirt will detect teh qemu instance was not able to connect	19:06
mnaser	the vm does migrate if the cluster is accessible, and then all further operations like resize/migrate/etc are broken	19:06
sean-k-mooney	and it shoudl abort the migration	19:06
mnaser	so it goes into a user-facing broken state	19:06
sean-k-mooney	az are not fault domain	19:07
sean-k-mooney	or isolated segments	19:07
sean-k-mooney	mnaser: i do not belive you will get into a user facing broken state for live migration	19:07
mnaser	you will.. if both ceph clusters are accessible, then the further operations will try to use the fsid of the target vm	19:07
mnaser	i can ask to get tracebacks and logs from teh customer	19:08
mnaser	but it makes sense since now its trying to use the _new_ cluster fsid, but doesnt find the volume, since its attached from the old cluster fsid	19:08
sean-k-mooney	if both are accsabel and you only have ceph cred for one of them on the compute host then qemu will not be able to conenct	19:08
sean-k-mooney	mnaser: that sound like they are trying to use the same user/keyring between both clusters	19:09
mnaser	ok, assume one cluster with different pools when you're using ceph then	19:09
sean-k-mooney	which is incorect	19:09
mnaser	i havent dug that deep into their stuff	19:09
mnaser	now when nova tries to do things it'll do it on the new pool but cant find that _disk image	19:10
sean-k-mooney	which will fail when we try to create the qemu instance on the dest	19:10
sean-k-mooney	but the migraiton shoudl abort then	19:10
mnaser	isnt the old xml get transferred	19:11
sean-k-mooney	and the vm shoudl stay runing on the souce node in actie	19:11
mnaser	so it successfully completes?	19:11
mnaser	s/isnt/doesnt/	19:11
sean-k-mooney	no the vm get created really really early on the dest	19:11
mnaser	i dont think we rebuild xml from scratch on target but rather rely on shipping the xml from the old libvirt to the new one?	19:11
sean-k-mooney	we have to create the vm on the dest so that the ram can be copied	19:11
sean-k-mooney	mnaser: we generate a new xml on the souce for the dest	19:11
mnaser	ok something is not adding up then	19:12
sean-k-mooney	so my expectation is that it shoudl use the old cluster	19:12
sean-k-mooney	so you woudl have cross az traffic	19:12
mnaser	oh ok right yes, it would add up nevermind	19:12
mnaser	if we generate xml on source for the dest it'll have the old	19:12
sean-k-mooney	what might break is a hard reboot after that	19:12
mnaser	yes exactly, or resize, etc	19:12
sean-k-mooney	right but thats a complete differnt issue	19:13
sean-k-mooney	we do not supprot move operations across diffent stroagge backends at all	19:13
sean-k-mooney	and preventing that is left to the operator today and it has alyas been that way in nova	19:13
mnaser	so as someone whos trying to get people to use openstack, giving them a big gun to shoot themselves in the foot	19:13
mnaser	and then when they do that because it doesnt seem very trivial and obvious that what they did is wrong	19:14
sean-k-mooney	mnaser: the simpelr approch si to use cells	19:14
mnaser	when they went ahead, created az, aggregates, etc	19:14
sean-k-mooney	we do not allwo cross cell live migration	19:14
mnaser	that's a really good point	19:14
mnaser	so ensure same storage backend inside a cell	19:14
mnaser	seems like pretty sane advice	19:15
sean-k-mooney	yes	19:15
sean-k-mooney	with all that said we coudl work on a feature to adress this	19:15
sean-k-mooney	but it would be a new feature and it would have to still allow usecase wehre cross az move operations make sense	19:15
sean-k-mooney	mnaser: for example we recently added a similar feature for neutron routed networks	19:16
sean-k-mooney	https://specs.openstack.org/openstack/nova-specs/specs/wallaby/implemented/routed-networks-scheduling.html	19:16
mnaser	sometimes i really feel letting users create az's was a massive mistake lol	19:16
sean-k-mooney	well users cant	19:17
mnaser	it was always so loose and there's so many people who get shot in the foot with it	19:17
mnaser	nah i mean from an operator perspective	19:17
sean-k-mooney	its admin only unless you change the policy	19:17
mnaser	people build out something and then it almost never gives them what they want	19:17
sean-k-mooney	oh well the issue is peopel consufe nova az with aws	19:17
sean-k-mooney	and they are nothign like each other	19:17
mnaser	yeah	19:17
sean-k-mooney	so before wallaybe tehre was no schduler supprot for route l3 networks	19:18
mnaser	aws has a strong presence so its natural to think of it that way	19:18
sean-k-mooney	i.e. there was nothign preventing you form cold/live migrating to a host where that ip coudl not be routed	19:18
sean-k-mooney	https://specs.openstack.org/openstack/nova-specs/specs/wallaby/implemented/routed-networks-scheduling.html added support for this	19:18
sean-k-mooney	it woudl not be unreasonable to have a similer feature for nova stoage	19:19
sean-k-mooney	for example if we use the ceph fsid to create a placement aggrate containing all host that were configured to use that ceph cluster	19:19
sean-k-mooney	and then recoded that in the isntance_system_metadata and schduled based on that if set	19:19
sean-k-mooney	we would jsut need to do member_of=<fsid> in the pacement query	19:20
mnaser	yeah, that seems like a handy simple way to track that for ceph	19:20
sean-k-mooney	if rbd_fsid was in instance_system_metadata	19:21
mnaser	i guess we would technically toss that into block device mapping data	19:21
mnaser	i cant remember if nova uses that for its own storage	19:21
sean-k-mooney	ish	19:21
sean-k-mooney	we do in weird ways	19:22
mnaser	maybe we should add a warning to the doc https://docs.openstack.org/nova/latest/admin/availability-zones.html about looking into using cells if you want to have full isolation and not allow migrations from one az to another	19:22
sean-k-mooney	but this would be for the root disk really althoguh you could map cinder voluems to placment aggreats in a simialr way	19:22
sean-k-mooney	bauzas: when you ahve time reading back over ^ would be good	19:23
mnaser	ill push a PR to add some details about migrations and bring up cells	19:23
sean-k-mooney	mnaser: cells are still not full isolation but ya.	19:24
mnaser	i have to be honest in my ability of providing help, spec + new feature discussion + all that is a bit too far of a reach for this	19:24
sean-k-mooney	mnaser: the other approch woudl be to have a weigher	19:24
sean-k-mooney	so an az affinity weigher	19:24
mnaser	hmm	19:24
mnaser	i could do that out of tree i guess	19:24
mnaser	as i dont think nova particlarly would wnat to carry that	19:24
sean-k-mooney	we would need to pass the instance current cell to the sheduler and then the weigher could prefer to say in the same az	19:25
sean-k-mooney	am i would not be against having it	19:25
sean-k-mooney	we woudl need to modify the destination object and add a prefered az filed or something	19:25
mnaser	i guess it can be a filter too but it would be very ugly	19:26
mnaser	cause it would have to check if this is a reschedule (aka instance exists and we can find it) or first time (ignore)	19:27
sean-k-mooney	well it should not be a filter because corss az move operations are valid	19:27
mnaser	ah yes also addressing that	19:28
sean-k-mooney	mnaser: basically we coudl add a "current_az" field here https://github.com/openstack/nova/blob/master/nova/objects/request_spec.py#L1092-L1122	19:28
mnaser	this starts to enter the domain of requiring more resources/time than i have so trying to see how i can be them most useful with the little resource i can spend on this 😅	19:29
sean-k-mooney	thats used in a few places but we baskcialy woudl just need to get the instnace.az and pass it on	19:29
sean-k-mooney	well simple solution is docs patch + ptg topic	19:29
sean-k-mooney	and i can raise it as a "operator pain point" internally and see if there is interst in adressing it	19:30
sean-k-mooney	although i think we likely wont have time in the next cycle to work on this	19:30
sean-k-mooney	there are potically 2 feature here an az affinity weigher, and reporting ceph clusters reachablity to placment	19:31
sean-k-mooney	both help usablity in differnt ways	19:32
*** dasm is now known as dasm\|off		19:56
simondodsley	In Train when I try to volume migrate a boot volume of a shutdown instance I get the message `Cannot 'swap_volume' instance xyx while it is in vm_state stopped`	21:36
simondodsley	Is there any way to do this?	21:36
simondodsley	Or was this something added after Train	21:36

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!