Tuesday, 2022-11-29

*** clarkb is now known as Guest298		01:19
*** Guest298 is now known as clarkb		01:20
*** atmark is now known as Guest305		02:10
opendevreview	Amit Uniyal proposed openstack/nova stable/train: Adds a repoducer for post live migration fail https://review.opendev.org/c/openstack/nova/+/863806	02:38
opendevreview	Amit Uniyal proposed openstack/nova stable/train: [compute] always set instance.host in post_livemigration https://review.opendev.org/c/openstack/nova/+/864055	02:38
gmann	bauzas: sean-k-mooney[m] gibi : this is for skipping failing tests https://review.opendev.org/c/openstack/nova/+/865922	02:43
opendevreview	Amit Uniyal proposed openstack/nova master: Refactoring live_migrate function name https://review.opendev.org/c/openstack/nova/+/865954	08:38
bauzas	gibi: sean-k-mooney: your help is appreciated https://review.opendev.org/c/openstack/nova/+/865922	08:41
opendevreview	Arnaud Morin proposed openstack/nova master: Unbind port when offloading a shelved instance https://review.opendev.org/c/openstack/nova/+/853682	09:30
gibi	bauzas: on it	09:34
gibi	gmann: thank you	09:34
bauzas	gibi: thanks	09:45
bauzas	and I updated the ML thread to notify our gerrit users	09:45
gibi	cool	09:45
opendevreview	Amit Uniyal proposed openstack/nova master: Adds check if resized to swap zero https://review.opendev.org/c/openstack/nova/+/857339	09:45
opendevreview	Merged openstack/nova master: Temporary skip some volume detach test in nova-lvm job https://review.opendev.org/c/openstack/nova/+/865922	11:08
opendevreview	Merged openstack/nova stable/wallaby: Adds a repoducer for post live migration fail https://review.opendev.org/c/openstack/nova/+/863900	11:08
opendevreview	Manuel Bentele proposed openstack/nova-specs master: Add configuration options to set SPICE compression settings https://review.opendev.org/c/openstack/nova-specs/+/849488	11:13
opendevreview	Amit Uniyal proposed openstack/nova stable/train: add regression test case for bug 1978983 https://review.opendev.org/c/openstack/nova/+/864168	11:13
opendevreview	Amit Uniyal proposed openstack/nova stable/train: For evacuation, ignore if task_state is not None https://review.opendev.org/c/openstack/nova/+/864169	11:13
opendevreview	Amit Uniyal proposed openstack/nova stable/train: add regression test case for bug 1978983 https://review.opendev.org/c/openstack/nova/+/864168	11:40
opendevreview	Amit Uniyal proposed openstack/nova stable/train: For evacuation, ignore if task_state is not None https://review.opendev.org/c/openstack/nova/+/864169	11:40
opendevreview	Merged openstack/nova stable/wallaby: [compute] always set instance.host in post_livemigration https://review.opendev.org/c/openstack/nova/+/863901	12:01
*** dasm\|off is now known as dasm		13:05
sahid	morning dansmith, how do you feel regarding spec https://review.opendev.org/c/openstack/nova-specs/+/857838 I also have proposed the implementation	13:45
sahid	if you have a moment i would be glad to make progress on it, eel free to let me know if there is any point that you need details thanks a lot	13:46
slaweq	gibi bauzas and other nova cores: hi, please check https://review.opendev.org/c/openstack/nova/+/855664 when You will have few minutes, thx in advance	13:47
pengo_	Hello I am using wallaby release openstack and having issues with cinder volumes attachments as once I try to delete, resize or unshelve the shelved vms the volume_attachement entries do not get deleted in cinder db and therefore the above mentioned operations fail every time. I have to delete these volume_attachement entries manually to make it work. I could only gather logs from nova-compute	13:48
pengo_	https://www.irccloud.com/pastebin/kIDKAF54/	13:48
pengo_	If I would like to unshelve the instance it wont work as it has a duplicate entry in cinder db for the attachment. So i have to delete it manually from db or via cli. This is the only choice I have if I would like to unshelve vm. But this is not a good approach for production env to delete duplicate volume attachments entries every time for every vm. Is there any way to fix this issue ?	13:49
opendevreview	Amit Uniyal proposed openstack/nova master: Adds regression functional test for 1980720 https://review.opendev.org/c/openstack/nova/+/861357	14:11
opendevreview	Amit Uniyal proposed openstack/nova master: Adds check for VM snapshot fail while quiesce https://review.opendev.org/c/openstack/nova/+/852171	14:11
bauzas	slaweq: hah, I remember the context, we discussed this at the PTG right?	14:16
pengo_	lyes	14:22
slaweq_	bauzas: yes, we talked about it at the PTG	14:35
slaweq_	and we agreed that we can go with this approach	14:35
sean-k-mooney	is this related to the mtu advertisement ?	14:36
sean-k-mooney	or soemthing else	14:36
sean-k-mooney	ah yes https://review.opendev.org/c/openstack/nova/+/855664	14:37
slaweq_	sean-k-mooney: yes	14:37
sean-k-mooney	ya so we agreeed that if dhcp is avialable on the subnet we can omit the mtu form the metadata	14:37
sean-k-mooney	this will allow the mtu to be reduced but not increased and the vms will clamp the mtu the next time it renews its dhcp lease	14:38
sean-k-mooney	will still not perfect it will help	14:38
sean-k-mooney	i can take a look now before i forget about it again	14:38
sean-k-mooney	slaweq_: quick question	14:41
sean-k-mooney	slaweq_: is the mtu on the netowrk or on the subnet in neutron. its generally an aspect of the netwrok in a real deployment just wondering how its modeled in neutron	14:42
sean-k-mooney	neutron does not supprot having diffenert mtu per network segment correct when using routed networks	14:42
sean-k-mooney	i have asked that in the review https://review.opendev.org/c/openstack/nova/+/855664/3/nova/virt/netutils.py#b266	14:47
slaweq_	sean-k-mooney: mtu is per network for sure	14:48
slaweq_	I'm not sure about segments in routed networks	14:48
sean-k-mooney	"The net-mtu extension allows plug-ins to expose the MTU that is guaranteed to pass through the data path of the segments in the network."	14:49
sean-k-mooney	that makes it sound like the network mtu shoudl be the maxium mtu that woudl eb supported on all segments	14:49
sean-k-mooney	ok we are good	14:50
sean-k-mooney	https://docs.openstack.org/api-ref/network/v2/index.html?expanded=show-segment-details-detail#show-segment-details	14:50
sean-k-mooney	the segment does not have an mtu field	14:50
slaweq_	sean-k-mooney++	14:50
bauzas	was trampled into internal problems, lemme look at the MTU patch	14:51
sean-k-mooney	nor does the subnet so ya defeintly per network ill upgade to +2 so	14:51
slaweq_	sean-k-mooney: thx a lot	14:53
bauzas	slaweq: sent to the gate now the gate is back :)	15:02
gibi	slaweq: I have a question https://review.opendev.org/c/openstack/nova/+/855664/3/nova/virt/netutils.py#274	15:02
bauzas	gibi: I had the same concern, but eventually I said yes because it's an operator question	15:06
bauzas	gibi: here, that means that we won't provide the MTU in the metadata service if the subnet from the instance is using a dhcp server	15:07
gibi	"subnet from the instance is using a dhcp server" <- but if the network has two subnets one with dhcp and one without dhcp then we need to check the actualy subnet the port uses, not every subnet in the network	15:08
gibi	or do I miss something	15:08
gibi	?	15:08
gibi	I'm OK to not set MTU if the subnet the port uses has DHCP. But the patch does not implement that	15:09
gibi	that patch does not set MTU if _any_ of the subnets of the network has DHCP	15:09
gibi	no just the one the port uses	15:09
frickler	I need to double-check but I think for v6 the MTU is signaled by RAs, not dhcp?	15:10
sean-k-mooney	gibi: good catch gibi	15:11
gibi	(I assume that as the dhcp_server is defined per subnet it can be differently configured per subnet of the same network)	15:11
sean-k-mooney	frickler: for ipv6 mtu is discoverd via the neibour discovery protocol	15:11
sean-k-mooney	and it shoudl be automaticaly negociated regardelss of using RA or DHCP6	15:12
opendevreview	Sylvain Bauza proposed openstack/nova master: Don't provide MTU value in metadata service if DHCP is enabled https://review.opendev.org/c/openstack/nova/+/855664	15:12
sean-k-mooney	RA and DHCPv6 could provide an inital value.	15:12
bauzas	slaweq: gibi: sean-k-mooney: in order to stop the check pipeline, I created a new revision ^	15:12
sean-k-mooney	gibi: yes the dhcp option is per subnet not per network	15:13
sean-k-mooney	bauzas: ack	15:13
sean-k-mooney	well you could have just removed the +w	15:13
sean-k-mooney	gibi: actully	15:14
sean-k-mooney	gibi: is the set of subnets that it is lopping over the subnets the port is attach to via the fixed ipes it has or just the one on the network	15:15
sean-k-mooney	i woudl assume we have not prefiltered them	15:15
gibi	I don't think we prefilter them	15:16
gibi	but I haven't checked explicitly	15:16
gibi	I assumed it is all the subnets of the network as it is under [networ][subnets]	15:16
sean-k-mooney	so we need to do an intersection between the subnets of the fixed ips and the subnets of the network and then check only those	15:16
sean-k-mooney	gibi: ya that is what i woudl assume too	15:16
bauzas	sean-k-mooney: no, you can't just remove the +W if it was running to the gate	15:18
bauzas	even in check pipeline	15:19
bauzas	gibi: we prefilter only if we opt in for routed networks	15:20
sean-k-mooney	bauzas: if it was in check you can if it was in gate then no	15:20
bauzas	sean-k-mooney: anyway, this is done but I'm pretty sure of the other way	15:21
bauzas	meh	15:21
sean-k-mooney	so this is not related to routed networks	15:21
bauzas	so, here, we have a list of subnets that's given from a network	15:21
sean-k-mooney	you can have as may subnetes on a network as you liek to add more ips to the network	15:21
sean-k-mooney	its pretty common ot jsut add addtional subnets to a network if you run out of ips for ports on that network	15:22
bauzas	if the instance has an IP address for a subnet that doesn't have a DHCP server running, but other subnets do have DHCP server, then we won't set the MTU	15:22
bauzas	if we merge this one	15:22
bauzas	sean-k-mooney: I know, I'm clarifying the situation	15:22
sean-k-mooney	yes which woudl be incorrect	15:22
bauzas	so yeah, agreed, we need to check the fixed IP address subnet, that's it	15:23
bauzas	we even don't need an intersection	15:23
sean-k-mooney	oh	15:23
sean-k-mooney	ya good point	15:23
sean-k-mooney	we dont need the intersaction	15:23
bauzas	and I think my prefilter already does this somewhere	15:23
sean-k-mooney	just loop over the subnets of the fixed ips	15:23
bauzas	not saying we should run the prefilter	15:24
bauzas	but the check is the same	15:24
bauzas	I mean the pattern	15:24
sean-k-mooney	right just that there might be exsiting code that can be copeid or shared	15:24
* bauzas needs to go getting the kid		15:24
sean-k-mooney	this https://github.com/openstack/nova/blob/master/nova/scheduler/request_filter.py#L317-L331	15:25
sean-k-mooney	you are calling neutron there since the info is not in the network info cache at this point	15:26
sean-k-mooney	so that wont actrully help	15:26
sean-k-mooney	but the vif object can give you the fixed ips	15:27
sean-k-mooney	https://github.com/openstack/nova/blob/master/nova/network/model.py#L445-L450	15:27
sean-k-mooney	looking at that code it might be already filtering	15:28
sean-k-mooney	so the subnets are constucted here	15:30
sean-k-mooney	https://github.com/openstack/nova/blob/master/nova/network/neutron.py#L3330-L3332	15:30
sean-k-mooney	which internally uses https://github.com/openstack/nova/blob/master/nova/network/neutron.py#L3568-L3637	15:31
sean-k-mooney	_get_subnets_from_port	15:31
sean-k-mooney	so we only store the subnets for the current port in the info cache	15:32
bauzas	reminder : nova meeting in 5 mins	15:54
bauzas	-ish	15:54
bauzas	#startmeeting nova	16:00
opendevmeet	Meeting started Tue Nov 29 16:00:07 2022 UTC and is due to finish in 60 minutes. The chair is bauzas. Information about MeetBot at http://wiki.debian.org/MeetBot.	16:00
opendevmeet	Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.	16:00
opendevmeet	The meeting name has been set to 'nova'	16:00
bauzas	hey folks	16:00
auniyal	O/	16:00
gibi	o/	16:00
Uggla	o/	16:00
elodilles	o/	16:01
bauzas	let me grab a coffee and we start	16:01
bauzas	ok let's start and welcome	16:02
bauzas	#topic Bugs (stuck/critical)	16:03
dansmith	o/	16:03
bauzas	#info No Critical bug	16:03
sean-k-mooney	o/	16:03
bauzas	#link https://bugs.launchpad.net/nova/+bugs?search=Search&field.status=New 16 new untriaged bugs (+5 since the last meeting)	16:03
bauzas	#info Add yourself in the team bug roster if you want to help https://etherpad.opendev.org/p/nova-bug-triage-roster	16:03
bauzas	I know this was a busy week	16:03
bauzas	any bug to discuss ?	16:03
bauzas	(apart from the gate ones)	16:03
bauzas	looks not	16:04
bauzas	elodilles: can you use the baton for the next bugs ?	16:04
elodilles	yepp	16:04
bauzas	cool thanks !	16:04
bauzas	#info bug baton is being passed to elodilles	16:04
bauzas	#topic Gate status	16:04
bauzas	#link https://bugs.launchpad.net/nova/+bugs?field.tag=gate-failure Nova gate bugs	16:04
bauzas	it was a busy week	16:04
bauzas	#info ML thread about the gate blocking issues we had https://lists.openstack.org/pipermail/openstack-discuss/2022-November/031357.html	16:05
bauzas	kudos to the team for the hard work	16:05
bauzas	it looks now the gate is back	16:05
bauzas	unfortunately, we had to skip some tests :(	16:05
bauzas	but actually maybe they were not necessary :)	16:06
bauzas	anyway	16:06
bauzas	#link https://zuul.openstack.org/builds?project=openstack%2Fnova&project=openstack%2Fplacement&pipeline=periodic-weekly Nova&Placement periodic jobs status	16:06
bauzas	#info Please look at the gate failures and file a bug report with the gate-failure tag.	16:06
bauzas	#info STOP DOING BLIND RECHECKS aka. 'recheck' https://docs.openstack.org/project-team-guide/testing.html#how-to-handle-test-failures	16:06
gibi	nah no test is necessary :) only the code need to work :)	16:06
opendevreview	Arnaud Morin proposed openstack/nova master: Unbind port when offloading a shelved instance https://review.opendev.org/c/openstack/nova/+/853682	16:07
bauzas	anything to discuss about the gate ?	16:07
bauzas	#topic Release Planning	16:07
bauzas	#link https://releases.openstack.org/antelope/schedule.html	16:07
bauzas	#info Antelope-2 is in 5 weeks	16:07
bauzas	as a reminder, remember that the last December week(s) you could be off :)	16:08
bauzas	so even if we have 5 weeks until A-2, maybe less for you :)	16:09
sean-k-mooney	ya i dont know if we want to have another review day before then	16:09
bauzas	should we do another spec review day before end of December, btw ?	16:09
gibi	I vote for someting after 13th of Dec :)	16:09
bauzas	sean-k-mooney: we accepted to have an implementation review day around end of Dec, like Dec 20	16:09
sean-k-mooney	well spec or impelemnation or both	16:09
sean-k-mooney	that might be a bit late	16:10
bauzas	for implementations, not really	16:10
gibi	more specifically between 14th and 19th	16:10
bauzas	as we only have a deadline for A-3	16:10
sean-k-mooney	because of vacation	16:10
bauzas	for specs, yes	16:10
bauzas	ah	16:10
bauzas	then, we could do a spec review day on Dec 13th	16:11
gibi	14 please	16:11
gibi	there is an internal demo on 13th I will be busy with :)	16:11
bauzas	and when should we be doing a implementation review day ?	16:11
sean-k-mooney	ya im off form the 19th so 14th-16th for feature review	16:11
sean-k-mooney	woudl be ok	16:11
bauzas	gibi: haha, I don't see what you're saying :p	16:11
gibi	yeah I support what sean-k-mooney proposes 14-16	16:12
bauzas	gibi: as a reminder, last week, I was discussing here during the meeting while adding some slides for some internal meeting :p	16:12
bauzas	you surely could do the same :D	16:12
gibi	maybe a spec review day on 14th and an impl review day on 15th :)	16:12
bauzas	if people accept to have two upstream days	16:13
sean-k-mooney	that would work for me	16:13
gibi	bauzas: I'm no way near to your abbility to multi task	16:13
bauzas	during the same week	16:13
sean-k-mooney	that leave the 16th to wrap up stuff before pto	16:13
gibi	yepp	16:13
gibi	(I will be here on 19th too but off from 20th)	16:13
bauzas	gibi: or then I'd prefer to have an implementation day once we're back	16:13
bauzas	not all of us work upstream everyday :)	16:14
sean-k-mooney	we could but the idea was to have 2 of them	16:14
sean-k-mooney	to take the pressure off m3	16:14
sean-k-mooney	so one before m2 and one before m3	16:14
gibi	I'm back on the 5th of Jan	16:14
sean-k-mooney	so we coudl proably do one on the 10th of january	16:14
sean-k-mooney	most will be around by then	16:15
bauzas	sean-k-mooney: I don't disagree, I'm just advocating that some folks couldn't be able to have two review days on the same week	16:15
sean-k-mooney	if we want ot keep it aligned to the meetign days	16:15
gibi	10th works for me	16:15
bauzas	gibi: we don't really need to align those review days to our meeting	16:16
bauzas	gibi: but this is nice as a reminder	16:16
gibi	so I think we are converging on Dec 14th as spec and Jan 10th as a code review day	16:17
bauzas	I think this works for me	16:17
bauzas	and we can have another implementation review day later after Jan 10th	16:17
sean-k-mooney	ya sure that sound workable	16:18
bauzas	as a reminder, Antelope-3 (FF) is Feb 16th	16:18
bauzas	more than 5 weeks after Jan 10th	16:18
sean-k-mooney	there are still a few bits i would hope we can merge by the end of the year however. namely i would liek to see us make progress on the pci in placement serises	16:19
bauzas	sure	16:19
sean-k-mooney	ok so i think we can move on for now	16:19
bauzas	what we can do is to tell we can review some changes by Dec 15th if we want so	16:19
bauzas	that shouldn't be a specific review day, but people would know that some folks can review their changes by this day	16:20
bauzas	anyway, I think we found a way	16:20
gibi	yepp	16:21
bauzas	#agreed Dec-14th will be a spec review day and Jan-10th will be an implementation review day, mark your calendars	16:21
bauzas	#action bauzas to send an email about it	16:21
bauzas	#agreed Some nova-cores can review some features changes around Dec 15th, you now know about it	16:22
gibi	:)	16:22
bauzas	OK, that's it	16:22
bauzas	moving on	16:22
bauzas	(sorry, that was a long discussion)	16:22
bauzas	#topic Review priorities	16:22
bauzas	#link https://review.opendev.org/q/status:open+(project:openstack/nova+OR+project:openstack/placement+OR+project:openstack/os-traits+OR+project:openstack/os-resource-classes+OR+project:openstack/os-vif+OR+project:openstack/python-novaclient+OR+project:openstack/osc-placement)+(label:Review-Priority%252B1+OR+label:Review-Priority%252B2)	16:23
bauzas	#info As a reminder, cores eager to review changes can +1 to indicate their interest, +2 for committing to the review	16:23
bauzas	I'm happy to see people using it	16:23
bauzas	that's it for that topic	16:23
bauzas	next one	16:24
bauzas	#topic Stable Branches	16:24
bauzas	elodilles: your turn	16:24
elodilles	ack	16:24
elodilles	this will be short	16:24
elodilles	#info stable branches seem to be unblocked / OK	16:24
elodilles	#info stable branch status / gate failures tracking etherpad: https://etherpad.opendev.org/p/nova-stable-branch-ci	16:24
elodilles	that's it	16:24
gibi	nice	16:25
bauzas	was quick and awesome	16:26
bauzas	last topic but not the least in theory,	16:26
bauzas	#topic Open discussion	16:26
bauzas	nothing in the wikipage	16:26
bauzas	so	16:26
bauzas	anything to discuss here by now ?	16:27
gibi	-	16:27
sean-k-mooney	did you merge skipign the failing nova-lvm tests yet	16:27
sean-k-mooney	or is the master gate still explodingon that	16:27
bauzas	I think yesterday we said we could discuss during this meeting about the test skips	16:27
bauzas	but given we merged gmann's patch, the ship has sailed	16:27
sean-k-mooney	ack	16:27
sean-k-mooney	so they are disabeled currently	16:28
bauzas	sean-k-mooney: see my ML thread above ^	16:28
sean-k-mooney	the failing detach tests	16:28
sean-k-mooney	ah ok will check after meeting	16:28
bauzas	sean-k-mooney: you'll get the link to the gerrit change	16:28
sean-k-mooney	nothing else form me	16:28
auniyal	hand-raise: zuul frequent timeout issue/fails - this seems to be resource issue, is it possible zuull resource can be increased ?	16:28
bauzas	sean-k-mooney: tl;dr: yes we skipped the related tests but maybe they are actually not needed as you said	16:29
sean-k-mooney	auniyal: not really timeout are not that common in our jobs	16:29
bauzas	auniyal: see what I said above, we had problems with the gate very recently	16:29
sean-k-mooney	auniyal: do you have an example	16:29
auniyal	in morning when there are less number of jobs running if we run same, job gets passed	16:29
auniyal	like less then 20	16:29
auniyal	right now 60 jobs are running	16:29
sean-k-mooney	that should not really be a thing	16:29
sean-k-mooney	unless we have issues with our ci providers	16:30
bauzas	auniyal: if you speak about job results telling timeouts, agreed with sean-k-mooney, you should tell which ones so we could investigate	16:30
sean-k-mooney	we ocationally have issues with slow providers but its not normally coralated with the number of runnign jobs	16:30
bauzas	yup	16:30
auniyal	ack	16:30
bauzas	timeouts are generally an infra issue	16:30
bauzas	from a ci provider	16:30
bauzas	but "generally"	16:30
bauzas	which means sometimes we may have a larger problem	16:31
sean-k-mooney	auniyal: do you have a gerrit link to a change where it happend	16:31
dansmith	are they fips jobs?	16:31
clarkb	bauzas: I'm not sure I agree with that statement	16:31
sean-k-mooney	oh ya it could be that did we add the extra 30 mins ot the job yet	16:31
clarkb	we have significant amounts of very inefficient test payload	16:31
clarkb	yes slow providers make that worse, but we have lots of ability to improve things in the jobs just about every time I look	16:31
sean-k-mooney	clarkb: we dont often see timeouts in the jobs that run on the nova gate	16:32
sean-k-mooney	we tent to be well within the job timeout interval	16:32
sean-k-mooney	that is not nessialy the same for other projects	16:32
clarkb	(it is common for tempets jobs to dig into swap which slows everything down, devstack uses osc which is super slow because it gets a new token for every request and has python spin up time, ansible loops are costly with large numbers of entries and so on)	16:32
auniyal	sean, I am trying to find a link but its time taking	16:32
clarkb	sean-k-mooney: yes swap is a common cause for the difference in behaviors and that isn't an infra issue	16:33
clarkb	sean-k-mooney: and devstack runtime could be ~halved if we stopped using osc	16:33
clarkb	or improved osc's startup and token acquisition time	16:33
sean-k-mooney	clarkb: ack	16:33
clarkb	I just want to avoid the idea its an infra issue so ignore it	16:33
sean-k-mooney	ya the osc thing is a long runing known issue	16:33
clarkb	this assertion gets made often then I go looking and there is plenty of job payload that is just slow	16:33
sean-k-mooney	the parallel improments dansmith did helped indirectly	16:33
bauzas	clarkb: sorry, I was unclear	16:33
auniyal	although, I have experinced this alot, if my zuul, is not passing at night time (IST), even after recheck I ran them in morning, then pass	16:33
bauzas	clarkb: I wasn't advocating about someone else's fault	16:34
bauzas	clarkb: I was just explaining to some new nova contributor that given the current situation, we only have timeouts with nova jobs due to some ci provider issue	16:34
clarkb	bauzas: right I disagree with that	16:35
bauzas	clarkb: but I agree with you on some jobs that are wasting resources	16:35
clarkb	jobs timeout due to an accumulation of slow steps	16:35
clarkb	some of those may be due to a slow provider or slow instance	16:35
clarkb	but, it is extremely rare that this is the only problem	16:35
sean-k-mooney	clarkb: we tend to be seeing an avgerate runtime at about 75% or less of the job timeout in my experince	16:35
clarkb	and I know nova tempest jobs have a large number of other slowness problems	16:35
clarkb	sean-k-mooney: yes, but if a job digs deeply into swap its all downhill from there	16:35
bauzas	clarkb: that's a fair point	16:35
sean-k-mooney	we have 2 hour timeouts on our tempest jobs and we rarly go above about 90 mins	16:35
clarkb	suddenly your 75% typical runtime can balloon to 200%	16:36
bauzas	except the fips one	16:36
sean-k-mooney	clarkb: sure but i dont think we are	16:36
sean-k-mooney	but its somethign we can look at	16:36
sean-k-mooney	auniyal: the best thing you can do is provide us an example and we can look into it	16:36
clarkb	++ to looking at it	16:36
sean-k-mooney	and then see if there is a trend	16:36
auniyal	ack	16:37
bauzas	I actually wonder how we can track the trend	16:38
sean-k-mooney	https://zuul.openstack.org/builds?project=openstack%2Fnova&result=TIMED_OUT&skip=0	16:38
sean-k-mooney	that but its currently loading	16:38
sean-k-mooney	we have a couple every few days	16:39
bauzas	sure, but you don't have the time a SUCCESS job runs	16:39
bauzas	which is what we should track	16:39
clarkb	you can show both success and timeouts in a listing	16:39
clarkb	(and failures, etc)	16:39
sean-k-mooney	well we can change the result to filter both	16:39
bauzas	the duration field, shit, missed it	16:39
sean-k-mooney	we also hav ento fixed the fips job	16:40
sean-k-mooney	ill create a patch for that i think	16:40
bauzas	sean-k-mooney: I said I should do it	16:40
sean-k-mooney	bauzas: ok please do	16:40
bauzas	sean-k-mooney: that's simple to do and that's like 4 weeks I promised it	16:40
bauzas	sean-k-mooney: you know what ? I'll end this meeting by now so everyone can do what they want, including me writing a zuul patch :)	16:41
sean-k-mooney	:)	16:41
bauzas	having said it,	16:41
bauzas	thanks folks	16:41
bauzas	#endmeeting	16:41
opendevmeet	Meeting ended Tue Nov 29 16:41:43 2022 UTC. Information about MeetBot at http://wiki.debian.org/MeetBot . (v 0.1.4)	16:41
opendevmeet	Minutes: https://meetings.opendev.org/meetings/nova/2022/nova.2022-11-29-16.00.html	16:41
opendevmeet	Minutes (text): https://meetings.opendev.org/meetings/nova/2022/nova.2022-11-29-16.00.txt	16:41
opendevmeet	Log: https://meetings.opendev.org/meetings/nova/2022/nova.2022-11-29-16.00.log.html	16:41
gibi	o/	16:42
chateaulav	o/	16:42
elodilles	o/	16:42
* bauzas says again how much he enjoys zuul v3 web interface and how easy you can get a specific job's details		16:44
bauzas	like, https://zuul.openstack.org/job/tempest-centos9-stream-fips	16:44
clarkb	sean-k-mooney: one thing I've found is fairly consistent for our systemic job timeouts is that we've got a number of steps to each job: base setup (configuring mirrors, configuring ssh keys, configuring git repos), test env setup (tox/devstack/whatever), actual testing, and log collection. Each of these tends to have unnecessary slow bits that add up over the course of a job.	16:48
clarkb	When you then run into something like a slower node or swapping or slowness fetching an external resource it is very easy to tip over the timeout	16:48
clarkb	sean-k-mooney: imo it would be helpful for us to try and whittle away at that accumulated slowness tech debt to make us less susceptible when we run into an overall slower situation.	16:49
clarkb	I've worked on that a bit in the base jobs and log collection side of things as that has broad impact. But the downsides here are that it has broad impact so I have to be extremely careful to maintain backward compatibility	16:49
clarkb	but the same approaches can be taken to improve things like devstack (did you know it installs tempest 3 times!)	16:49
clarkb	I think improving memory consumption would also help avoid slowness caused by swapping. privsep is a fairly outsized offender here	16:50
bauzas	clarkb: I'm curious about privsep being memory greedy	16:53
bauzas	and I wonder why	16:53
clarkb	I suspect because it grows buffers to handle all the input and output sent through it	16:53
bauzas	our internal customers haven't reported such problem, but I guess because of lack of evidence rather than not having a problem	16:54
clarkb	one way to maybe improve things is to stop running a different privsep for each service whcih creates a bunch of large buffers. We might be able to get away with one large buffer	16:54
clarkb	or buffer things with intentionally smaller buffers	16:54
bauzas	agreed	16:54
bauzas	a stream is costly	16:54
sean-k-mooney	clarkb: yep although we have enough fo a buffer in the nova project that we get a time out failure only 1 or twice a week	16:57
clarkb	sean-k-mooney: yes, but you've also set your timeout to two hours	16:57
clarkb	(one hour was the goal once upon a time)	16:57
sean-k-mooney	ack	16:57
sean-k-mooney	shareing privesep is a security issue	16:58
sean-k-mooney	so i dont think we can ever do that	16:58
sean-k-mooney	nova will have more privespe deamons eventurlaly	16:58
clarkb	does privsep prevent random processes from connecting to it? If not this is equivalent. If so it could also apply restrictions on what a specific process can do (granted this is maybe a larger attack surface than refusing to talk at all)	16:59
sean-k-mooney	you are ment to use file permissiosn to limit access at the file system level	16:59
sean-k-mooney	but no	16:59
sean-k-mooney	not as part of privsep itself	17:00
sean-k-mooney	clarkb: you need one privsep deamon per privsep context currently	17:00
sean-k-mooney	for it to proplry provide the correct permission enforcement/escalation	17:00
clarkb	sean-k-mooney: in that case my suggestion would be to investigate optimizing privsep memory usage	17:00
sean-k-mooney	we might be able to reduce it in ci	17:01
sean-k-mooney	by limiting it to one process	17:01
clarkb	or just bound your buffers	17:01
clarkb	and read in chunks	17:01
sean-k-mooney	maybe i havent really looked at the channel implemenation closely	17:01
clarkb	I suspect this is a case of python makes it easy to read abritrary sized buffers into memroy without much fuss	17:02
clarkb	it might also be inefficient compilation of the ruleset	17:02
clarkb	(regexes aren't free either)	17:02
sean-k-mooney	the impelmation is here https://github.com/openstack/oslo.privsep/blob/83870bd2655f3250bb5d5aed7c9865ba0b5e4770/oslo_privsep/comm.py	17:02
sean-k-mooney	self.writesock.sendall(buf)	17:03
sean-k-mooney	so its takeign the serialsed payload and sending it	17:03
sean-k-mooney	using the msgpack format for serialisation	17:04
sean-k-mooney	its using 4k buffers https://github.com/openstack/oslo.privsep/blob/83870bd2655f3250bb5d5aed7c9865ba0b5e4770/oslo_privsep/comm.py#L81	17:05
sean-k-mooney	clarkb: honestly i have looked at privsep a couple of time but dont have enough context of the code to have a feel for how much memory it shoudl be using and if its bounded or not	17:08
dansmith	also not sure what we might be calling via privsep that would return large buffers	17:09
sean-k-mooney	clarkb: but i suspect that if we are using it ofr any file operatiosn then it might need to process guest images	17:09
dansmith	it's mostly for doing things and maybe pulling the qemu-img info blob	17:09
bauzas	sean-k-mooney: https://review.opendev.org/c/openstack/tempest/+/866049	17:09
sean-k-mooney	dansmith: the console log or writing images to disk would be the two the come to mind	17:09
dansmith	I don't think we write images to disk via privsep. console log.. maybe? I thought we can get that via libvirt	17:10
clarkb	ya I'm not sure. Its just in aggregate privsep uses more memory than most other openstack services	17:10
sean-k-mooney	dansmith: no we read the file that is written to disk im pretty sure	17:10
clarkb	I think cinder? and neutron use more then privsep is next. Its been a while since I looked at hte memory profiling though	17:10
bauzas	I don't know if we could somehow pdb the running privsep process thru a backdoor, because if we could, like we do with nova services, then we could monitor the growing memory	17:11
sean-k-mooney	dansmith: i would hope for the images that we write it to somewhere we own then move it and change the permission if needed	17:11
bauzas	I personnally use tracemalloc to persist the memory state and compare between snapshots	17:11
sean-k-mooney	like in most case i woudl expect nova to put it in the image cache then use qemu-image to create a qcow with the iamge as the backing file	17:11
bauzas	but this requires access to the process	17:11
sean-k-mooney	so privsep shoudl only be needed for invokeign qemu-img and not the actul image downlaod	17:12
sean-k-mooney	but not sure about the same codepath for raw images	17:12
sean-k-mooney	bauzas: you coudl do that locally but i dont think privsep uses eventlet	17:13
bauzas	sean-k-mooney: afaik it doesn't	17:13
dansmith	sean-k-mooney: we can't possibly be streaming multiple gigs of image data over the privsep socket	17:13
sean-k-mooney	dansmith: ya that woudl not make sense to me eitehr	17:13
sean-k-mooney	i know the console log is stream over it as we had a security issue with that in the past	17:14
sean-k-mooney	but that is small	17:14
dansmith	well, that might be large, but not tens of gigs	17:15
dansmith	and still not sure why we're doing that really, but AFAIK nova has to eat the whole thing and return it over RPC anyway, right?	17:15
sean-k-mooney	in production yes in ci the vms dont run that long so i would be surpsied if we ever go close to a mb	17:15
dansmith	oh sure	17:16
sean-k-mooney	https://github.com/openstack/nova/blob/master/nova/config.py#L78-L80	17:17
sean-k-mooney	so in produciton we drop privesep abck to info	17:17
clarkb	Looking at https://6546e150b851cf607d58-90bcb133e16a34d804e19428ec130682.ssl.cf5.rackcdn.com/865479/1/gate/tempest-full-py3/9076120/controller/logs/screen-memory_tracker.txt there are three different nova-cpu privsep processes and together use about 150MB of rss. Similar story for neutron ovn metadata. Buts its ~250MB rss	17:17
sean-k-mooney	im not sure if we override that in ci	17:17
sean-k-mooney	clarkb: two fo the prive sepc process i can accoutn for	17:18
sean-k-mooney	nova has one privsep context for everything and os-vif has one for the linux bridge and ovs plugin but i only expect 1 of those two to be created in any given ci job	17:19
clarkb	sean-k-mooney: why do we need multiple process per service though?	17:19
sean-k-mooney	we need one per security context	17:19
sean-k-mooney	and we shoudl have mulitpel security context per service	17:19
clarkb	sean-k-mooney: but if your security contro lmehtod is file permissions (any maybe selinux rules) then you can only limit by pid user or pid selinux context	17:19
clarkb	you aren't any more secure if you have three different processes that nova cpu can talk to	17:20
sean-k-mooney	privesp drops permissions to only does in the context when it spawns	17:20
sean-k-mooney	clarkb: if you have muplie context you can limit the permission of each funciton	17:20
clarkb	sean-k-mooney: but if the same entity can tell each of those three to do the work then functioanlly its the same context	17:21
sean-k-mooney	that is the permission model that privsep provides and the big delta form rootwarp	17:21
clarkb	you're limited by what you can limit access to on the filesystem	17:21
clarkb	and if I have permission to talk to one I have permisssion to talk to all since a process shares those attributes	17:21
clarkb	and if I somehow break that security layer to promote myself to nova-cpu user or selinux context I can talk to all of them	17:22
sean-k-mooney	you have to decorate the funciton with the context they can use	17:22
clarkb	right but does that give the system any additional security since it sounds like we are using fielsystem controls I think not	17:22
clarkb	the level of scrutiny is process attribute (user, selinux context etc)	17:22
sean-k-mooney	we have had this dusicssion multiple time	17:22
sean-k-mooney	prive sep is about limiting the permission of indivual function calls	17:23
sean-k-mooney	it is not intended to adress all security concerns	17:23
sean-k-mooney	in many respec rootwarp was more secure	17:24
sean-k-mooney	but it was unmaintained and slow	17:24
clarkb	ok, the result i that ~5% of system memory on a tempest job run is consumed by privsep for nova cpu and neutron ovn metadata	17:24
sean-k-mooney	the nova_ovn_metadata on is kind fo interesting it will need cap net admin for setting the iptables rules to nat the traffic	17:25
sean-k-mooney	but it shoudl not need much memory for that	17:25
bauzas	I remember some internal customer bug about the metadata service being greedy	17:25
clarkb	nova cpu uses about 200MB of memory compared to 150 for privsep for nova cpu	17:25
clarkb	as a comparison	17:25
clarkb	it essentially doubles the memory footprint of nova cpu	17:25
bauzas	we tried to look into it, but given the issue was not reproducable after a reboot, we were unable to conclude	17:26
sean-k-mooney	yep it has to load many of the same nova depencies	17:26
sean-k-mooney	it is runnign part of the nova codebase after all	17:26
sean-k-mooney	bauzas: the customer issue was becasue they had 0 swap	17:27
clarkb	sure, but that is why I remember it being outsized. Seems like the current memory tracking continues to show this	17:27
sean-k-mooney	so the python interperty was using more memory then when swap was avaiable	17:27
clarkb	mysqld then journald are the top consumers. No surprise there given how mysql uses memory and the quantity of logs we are dumping into the journal	17:28
sean-k-mooney	i wonder if there is any way to have privsep free memeory	17:28
sean-k-mooney	i dont know if it suffers form the same fragmentation thing that cinder? glance? had	17:28
bauzas	sean-k-mooney: no, unrelated case	17:29
clarkb	looks like cinder-backup is also running in he base job. I thought there was no testing of it in he base jobs and we were removing it	17:29
sean-k-mooney	oh sorry metadata not nova-compute	17:29
sean-k-mooney	bauzas: ya metadata memory usage is varible	17:30
sean-k-mooney	we need to build the fully metadta respocen in memory for every request	17:30
sean-k-mooney	which is why we store it in memcache	17:30
bauzas	anyway, I need to quit, my wife feels ill and I need to visit the drugstore	17:31
sean-k-mooney	im not sure exactuly how we handel user-data and if that is incldued in the cached value or if we stream that seperatly	17:31
clarkb	in that example job we have about 100MB free memory when the log file is collected	17:31
sean-k-mooney	the user-data file can be up to 64mb	17:32
clarkb	and we are using about 600MB of swap fi I read that correctly	17:32
clarkb	(out of about 1GB of swap total)	17:32
clarkb	other than mysqld there isn't any single process getting close to dobule digit memory consumption. This is deaht by a thousand cuts (whcih makes sense given micro services etc)	17:41
dansmith	I'm pretty sure "death by a thousand cuts" is the actual marketing tagline of the microservice approach :)	17:42
clarkb	oh wait memavailable is distinct to memfree. Its closer to 750MB of memory available. That isn't too bad. I wonder why we're so deep into swap then	17:44
sean-k-mooney	free is actully unacllocated	17:49
sean-k-mooney	avaiable include cachces and maybe buffers	17:50
dansmith	sean-k-mooney: bauzas if you're still around, have you seen anything from rajat about changing nova to use a service account to look at image locations?	19:25
opendevreview	Merged openstack/nova-specs master: spec: allowing target state for evacuate https://review.opendev.org/c/openstack/nova-specs/+/857838	19:52
sean-k-mooney	dansmith: i have not seen that but i dont think nova would need to do anything would it	20:44
sean-k-mooney	the account we have in the glance section just needs to have the service role	20:45
dansmith	sean-k-mooney: we have no such account	20:45
sean-k-mooney	oh ok because we currently dont use admin for this	20:45
dansmith	right, everything we do with glance is with the user's token	20:46
dansmith	this is a pretty fundamental change from that	20:46
sean-k-mooney	ya ok im not sure if the service_user stuff woudl help there although thats really just for working aound token exeriation	20:47
dansmith	sean-k-mooney: it would require it	20:48
sean-k-mooney	this would really be that large a change in nova either a specless blueprint or short spec i guess	20:48
dansmith	sean-k-mooney: maybe just read the spec :)	20:48
sean-k-mooney	oh we have a spec already then sure	20:48
dansmith	sean-k-mooney: no, the spec on the glance side I copied you on	20:49
sean-k-mooney	the current service user config we have is not for that however	20:49
dansmith	no spec on the nova side,	20:49
sean-k-mooney	ah righit i saw the email havent looked at it yet	20:49
sean-k-mooney	https://review.opendev.org/c/openstack/glance-specs/+/863209	20:49
sean-k-mooney	the new locations api spec	20:49
dansmith	but since everything in nova assumes that the user's token is used for glance interaction, I just want to be super careful that we don't accidentally use the service account for anything related to the image other than the ceph location thing	20:49
sean-k-mooney	ya so the same way we have for geting an admin client for neutron we likely need to add a get_service_client funtion or similar and use it for that call explcitly	20:50
dansmith	right	20:51
dansmith	we discussed this earlier related to the dual internal/external glance thing was brought up	20:52
dansmith	and I thought you had a good reason for why there's a gotcha there, but I don't remember what it was	20:52
sean-k-mooney	the internal endpoint is also used to provide unmeetered acces to the api for tenant workloads	20:53
sean-k-mooney	in public clouds	20:53
sean-k-mooney	so really it woudl be nice if keystone addded a service endpoint	20:53
sean-k-mooney	for service to service comunications	20:53
dansmith	yeah, that's unrelated to this	20:54
sean-k-mooney	although if this requried the service user	20:54
dansmith	I mean, this is to avoid needing that	20:54
sean-k-mooney	right if we have the service role not user	20:54
sean-k-mooney	then we can just filter the filed based on teh role	20:54
sean-k-mooney	like we do with server show	20:54
sean-k-mooney	so only show the image location if the token has the service role	20:55
sean-k-mooney	that would be the nicer way to do this	20:55
dansmith	you should read the spec	20:55
sean-k-mooney	you said the policy/rback stuff in glance has only recently been made capabliy of supproting somethign like that right	20:55
sean-k-mooney	sure ill add it to my list for tomorrow	20:56
sean-k-mooney	i was just back breifly ot check on something	20:56
sean-k-mooney	skiming it without the nova changes if this was unconsitonaly added to glance	20:57
sean-k-mooney	it woudl silently disable the fast clone support	20:58
sean-k-mooney	and even then it woudl break the grenade upgrade rules if we did not do upgrade carfully	20:58
sean-k-mooney	i.e. you shoudl not need to change the config when you upgrade	20:59
dansmith	it's a new api, so they'll have to support the old one for a while, and I commented on that in the spec that it needs to hang around for a good while	20:59
sean-k-mooney	ack	21:00
sean-k-mooney	so they are not just doign the filtering on the old one	21:00
sean-k-mooney	ya ok there is no point in me speulcating on this until i have had time to read the spec fully thanks for highlighting it	21:00
dansmith	I just want to make sure that nova people are aware of when glance people say "we'll just change nova to do X" with no planning on this side, and potentially nobody signing up to do it or review it	21:03
*** dasm is now known as dasm\|off		21:39
opendevreview	Merged openstack/nova master: Handle mdev devices in libvirt 7.7+ https://review.opendev.org/c/openstack/nova/+/838976	21:52
opendevreview	Merged openstack/nova master: Don't provide MTU value in metadata service if DHCP is enabled https://review.opendev.org/c/openstack/nova/+/855664	22:02
opendevreview	Merged openstack/nova master: extend_volume of libvirt/volume/fc should not use device_path https://review.opendev.org/c/openstack/nova/+/858129	22:02
opendevreview	melanie witt proposed openstack/nova stable/xena: Retry attachment delete API call for 504 Gateway Timeout https://review.opendev.org/c/openstack/nova/+/866083	22:54
opendevreview	melanie witt proposed openstack/nova stable/wallaby: Fix the wrong exception used to retry detach API calls https://review.opendev.org/c/openstack/nova/+/866084	22:57
opendevreview	melanie witt proposed openstack/nova stable/wallaby: Retry attachment delete API call for 504 Gateway Timeout https://review.opendev.org/c/openstack/nova/+/866085	22:57
opendevreview	melanie witt proposed openstack/nova stable/victoria: Fix the wrong exception used to retry detach API calls https://review.opendev.org/c/openstack/nova/+/866086	23:09
opendevreview	melanie witt proposed openstack/nova stable/victoria: Retry attachment delete API call for 504 Gateway Timeout https://review.opendev.org/c/openstack/nova/+/866087	23:09
opendevreview	melanie witt proposed openstack/nova stable/ussuri: Fix the wrong exception used to retry detach API calls https://review.opendev.org/c/openstack/nova/+/866088	23:16
opendevreview	melanie witt proposed openstack/nova stable/ussuri: Retry attachment delete API call for 504 Gateway Timeout https://review.opendev.org/c/openstack/nova/+/866089	23:16
opendevreview	melanie witt proposed openstack/nova stable/train: Fix the wrong exception used to retry detach API calls https://review.opendev.org/c/openstack/nova/+/866090	23:19
opendevreview	melanie witt proposed openstack/nova stable/train: Retry attachment delete API call for 504 Gateway Timeout https://review.opendev.org/c/openstack/nova/+/866091	23:19
clarkb	sean-k-mooney: bauzas: not urgent, but the other piece of info that is probably worht remembering for slow jobs is that we are our own noisy neighbor in some of these clouds. This means our own inefficiencies add up across jobs too not just within them.	23:44

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!