Friday, 2021-10-22

brinzhang	bauzas: I agree with gibi and sean-k-mooney, I have no difficulty discussing it, thanks	00:29
*** ganso_ is now known as ganso		02:29
*** viks___ is now known as viks__		02:31
gibi	morning	07:25
bauzas	good morning	08:00
opendevreview	Balazs Gibizer proposed openstack/nova master: Add a WA flag waiting for vif-plugged event during reboot https://review.opendev.org/c/openstack/nova/+/813419	08:09
opendevreview	Balazs Gibizer proposed openstack/nova master: Add a WA flag waiting for vif-plugged event during reboot https://review.opendev.org/c/openstack/nova/+/813419	08:13
opendevreview	Balazs Gibizer proposed openstack/nova stable/pike: Add a WA flag waiting for vif-plugged event during reboot https://review.opendev.org/c/openstack/nova/+/813437	08:27
-opendevstatus- NOTICE: zuul needed to be restarted, queues were lost, you may need to recheck your changes		08:47
stephenfin	bauzas: Morning o/ Care to look at https://review.opendev.org/c/openstack/nova/+/814547	08:56
bauzas	stephenfin: ok thanks for finding it !	08:56
bauzas	stephenfin: just a thought, you're not explaining in https://review.opendev.org/c/openstack/nova/+/814547/1//COMMIT_MSG why we now have a regression	08:58
stephenfin	oh, sorry, the regression is because I removed I6ce930fa86c82da1008089791942b1fff7d04c18	08:59
stephenfin	I mention that at the end of the commit message. It's kind of implicit though, admittedly	08:59
stephenfin	I thought I'd fixed the issue that made I6ce930fa86c82da1008089791942b1fff7d04c18 necessary. Evidently not :(	09:00
opendevreview	Rajat Dhasmana proposed openstack/nova-specs master: Add spec for volume backed server rebuild https://review.opendev.org/c/openstack/nova-specs/+/809621	09:00
bauzas	stephenfin: ok, then I'll leave a comment telling it and then I'll approve	09:10
bauzas	done.	09:12
stephenfin	ty	09:36
opendevreview	Merged openstack/nova master: db: Increase timeout for migration tests https://review.opendev.org/c/openstack/nova/+/814547	09:40
opendevreview	Wenping Song proposed openstack/nova master: Support concurrently add hosts to aggregates https://review.opendev.org/c/openstack/nova/+/815105	10:04
gibi	sean-k-mooney[m]: do I understand correctly that neutron's sriov-nic-agent only sends plugtime plug/unplug events for vnic_type=direct ports but not for vnic_type=direct-physical ports	10:16
gibi	?	10:16
gibi	sean-k-mooney[m]: https://github.com/openstack/neutron/blob/6d8e830859cd4ac9708701b8e344fdc68cbcaebb/neutron/plugins/ml2/drivers/mech_sriov/mech_driver/mech_driver.py#L135-L137	10:18
sean-k-mooney[m]	hum that is a good question. i guess that woud be the case yes since for PFs the agent does not configure anything since anything it did would be undone whne we detach the device from the host kernel and attach it to the guest	10:20
sean-k-mooney[m]	i have never actully check its behavior in that regard	10:21
gibi	in my local env I see plug/unplug event during nova hard reboot for VF ports but not for PF ports so probably this is the case	10:22
sean-k-mooney[m]	yes so you might need to make an excption in your workaround patch	10:23
sean-k-mooney[m]	perhaps change it form a boolean to a list of vnic_types	10:23
sean-k-mooney[m]	odl only support vnic_type normal and vhost_user	10:24
sean-k-mooney[m]	well vhost-user	10:24
gibi	I think the doc in the patch still correct when we say set the flag only for ml2/ovs or networking-odl	10:24
gibi	I might extend that with mech_sriov + vnic_type direct	10:25
sean-k-mooney[m]	right but if you filter by vnic type you can use it when you have odl and sriov on the same host	10:25
gibi	yeah, I can ignore direct-physical ports when waiting for plug	10:26
sean-k-mooney[m]	ya i guess that also works	10:26
gibi	sean-k-mooney[m]: what would be your way to filter?	10:26
sean-k-mooney[m]	if we make the config option a list of vnic_types to wait for on hard reboot we just do	10:28
sean-k-mooney[m]	if vif.vnic_type in CONF.wait_on_reboot: …	10:28
gibi	hm yeah that is also a way	10:29
sean-k-mooney[m]	im not sure if we need to have different behavor for other vnic types like the cyborg ones	10:29
sean-k-mooney[m]	or baremetal though thtat is used only by ironic	10:30
sean-k-mooney[m]	i wouold have to look at the spec again but when we are using cyborg provided smart nics neutron still sends the events right?	10:31
gibi	hm neutron seems to support accelerator-direct with mech_sriov, and direct means a VF so I assume there is plug time events	10:35
sean-k-mooney[m]	i would assume so too but its not actully mentioned in https://specs.openstack.org/openstack/nova-specs/specs/xena/implemented/sriov-smartnic-support.html	10:35
sean-k-mooney[m]	gibi so you could either limit this to the case we know work (normal,direct,vhost-user) or you could filet out the case we know wont work (direct-physical,acclerator-direct-physical)	10:38
gibi	yeah	10:38
gibi	as the config today needs to be opt-in	10:38
gibi	I guess opting in to supported vnic types are better	10:38
gibi	when you say vhost-user why is not enough to simply filter for vnic_type direct?	10:39
sean-k-mooney[m]	in that case the list would be normal,direct,macvtap,acclerator-direct,vhost-user	10:40
sean-k-mooney[m]	well direct shoud work right	10:40
sean-k-mooney[m]	hardware offloaded ovs with ml2/ovs support direct and will send plug time events and the sriov nic agent should also	10:40
sean-k-mooney[m]	and vhost-user should also work with ml2/ovs and ml2/odl	10:41
sean-k-mooney[m]	the sriov nica agent should support macvtap plug time events	10:41
sean-k-mooney[m]	ml2/ovs should also send them for vdpa	10:42
gibi	for the deployer probably it is easier to just list vnic_types and not go into details like vhost-user	10:42
sean-k-mooney[m]	vhost-user is a vnic type	10:42
gibi	sean-k-mooney[m]: is it?	10:42
sean-k-mooney[m]	yes	10:42
sean-k-mooney[m]	its not a vif_type	10:42
gibi	blob/6d8e830859cd4ac9708701b8e344fdc68cbcaebb/neutron/plugins/ml2/drivers/mech_sriov/mech_driver/mech_driver.py#L164	10:43
gibi	sorry	10:43
gibi	wrong buffe4r	10:43
gibi	https://github.com/openstack/neutron-lib/blob/f01b2e9025d33aeff3bf22ea2568bda036878819/neutron_lib/api/definitions/portbindings.py#L131	10:43
sean-k-mooney[m]	so i think if you want to hard code it just filter out direct-physical,baremetal and acclerator-direct-physical	10:43
gibi	se here are the vnic_types	10:43
gibi	I don't see vhost-user as vnic type in that list	10:44
sean-k-mooney[m]	oh sorry maybe your right i have not looked at dpdk in 2 years or more	10:45
sean-k-mooney[m]	i might be miss rememebvring let me check the ml2/driver but i guess its vnic_normal	10:45
gibi	I thin it is mapped to normal yes	10:46
gibi	anyhow I think we are in agreement to have this filtering based on vnic_type	10:46
gibi	I think I will amend the current patch with that	10:46
sean-k-mooney[m]	ya you are right it is	10:46
sean-k-mooney[m]	so we should just skip waiting then for *-physical and baremetal	10:47
sean-k-mooney[m]	looking at the other vnic_types i dont think vnic_type smartnic is used with ovs or odl	10:48
gibi	OK, I will discuss this the the downstream folks to and see if they prefer a configurable vnic_type or they are OK with a hardcode	10:48
sean-k-mooney[m]	i think that is used by ironic	10:48
sean-k-mooney[m]	ack	10:49
gibi	yes, smartnic is ironic afaik	10:49
gibi	so we can filter out that too	10:49
sean-k-mooney[m]	ya most likely	10:49
sean-k-mooney[m]	you could make the config an exclude list and default to the set we know wont work	10:50
sean-k-mooney[m]	actully no	10:50
sean-k-mooney[m]	that would enable it by default which we do not want	10:50
sean-k-mooney[m]	ok ill be afk for 20 mins or so chat to you later	10:51
gibi	ack, thanks!	10:52
frickler	kashyap: couple of more findings: a) no change with the Nehalem cpu settings patch from clarkb	11:13
frickler	b) same issue with qemu-6.1 compiled from source	11:13
kashyap	frickler: Hi	11:13
frickler	c) the delta doesn'n really increase with large flavors, i.e. with 512M or 1G, the cirros process still stays at 600M	11:14
kashyap	frickler: The Nehalem thing here is not relevant (unless you're using CentOS9).	11:15
kashyap	frickler: Good to know that you've actually tested it w/ compiled with source	11:15
frickler	the latter is likely why this issue isn't more widely seen. it just affects CIs that try to start a larger number of small instances	11:15
frickler	... in a limited memory environment	11:15
kashyap	frickler: Right. So, this is TCG - this is not super amazingly well-tested upstream. Because a lot of folks use hardware accel. That said:	11:17
kashyap	frickler: Can you please file an upstream QEMU bug here (do you have a GitLab account?) - https://gitlab.com/qemu-project/qemu/-/issues	11:18
kashyap	frickler: That'll help me investigate the issue with a TCG dev.	11:19
kashyap	Also, please include the bits you posted yesterday. (https://paste.opendev.org/raw/810150/)	11:21
kashyap	frickler: I wonder if we can replicate this outside of OpenStack CI: like artificially triggering a script that'll start a ton of CirrOS instances?	11:22
frickler	kashyap: I'll do the bug report, though likely not today, I'll let you know then	11:23
kashyap	Thanks! Do mention the buggy version where you saw it first. And also the 6.1 compiled-from-source test.	11:23
kashyap	It'll help with bisecting.	11:23
frickler	kashyap: well I replicated with a local devstack deployment. I can also try to just create an instance with virt-manager	11:24
kashyap	Yes, that'll be more preferable, if possible.	11:24
frickler	kashyap: ok, thx for your feedback so far	11:25
kashyap	No problem. These TCG bugs (if it is indeed a bug) are hard to suss out.	11:25
sean-k-mooney1	frickler: you are seeing this just with a normal boot right	11:37
sean-k-mooney1	you dont need to boot many vms to trigger it	11:37
sean-k-mooney1	you are seeing the large memory usage with just a singel instnace	11:37
sean-k-mooney1	so this should not be hard to replicate right	11:37
kashyap	sean-k-mooney1: I don't think it's just one boot. He said "CIs that try to start a larger no. of small instances"	11:37
sean-k-mooney1	frickler: out of interest do you have swap avaiable in these hosts	11:37
kashyap	sean-k-mooney1: Good question ;-) The "ghost of swap"...	11:38
sean-k-mooney1	kashyap: its only an issue for cis because those small instance that used to fit nolonger do	11:38
sean-k-mooney1	kashyap: when i first spoke to frickler about this i think they mentioned it hapens for any small vms created	11:38
*** sean-k-mooney1 is now known as sean-k-mooney		11:39
sean-k-mooney	i.e. in ci the vm used to take say 128mb for ram is now using 600 so we cant run 4 of them in parallel anymore	11:39
* kashyap nods (on both points)		11:39
sean-k-mooney	so what im wondiring is in environment without swap are we seeing more resident memory usage	11:40
frickler	sean-k-mooney: yes, I see this with a single instance. the CI example is just where we noticed it first, with jobs OOMing with tempest running parallel tests	11:40
sean-k-mooney	frickler: we had a very weird customer issue where we saw python process have very large resident memory usage when no swap was presnet but when it was allocated they did not have high memory usage and also did not use any swap	11:41
sean-k-mooney	it was like have swap avaiabel stop the memory allcoator preallcoating the memory	11:42
frickler	hmm, indeed I have no swap on my test host. but we do have swap enabled on CI instances	11:42
sean-k-mooney	ah ok i was going to say could you add a 1G swap file temporay to your devstack and see if it change behavior	11:43
sean-k-mooney	if its in the ci then no point its not relatted	11:43
*** tosky_ is now known as tosky		12:07
bauzas	(late) reminder: final PTG day for nova sessions starting in 12 mins at https://www.openstack.org/ptg/rooms/newton	12:49
sean-k-mooney	dansmith: for the health check you want me to support http over tcp ranther then just a tcp socket right. assuming yes im assuming if it make sense ot make this a real wsgi application or just use https://eventlet.net/doc/modules/wsgi.html#eventlet.wsgi.server to call a binay specifci health check function	12:57
dansmith	sean-k-mooney: yes http and I'd keep it uuber simple (so the latter)	12:58
sean-k-mooney	ok	12:58
sean-k-mooney	brb just going to make a coffee and ill join	13:00
bauzas	nova session started	13:02
dansmith	sean-k-mooney: out of curiosity, does haproxy support some sort of bare tcp socket health check/	13:03
dansmith	I would kinda expect not	13:03
dansmith	I thought even systemd wanted http, but can use a script too	13:03
sean-k-mooney	dansmith: i think i tcan but orginaly haproxy was not part of my orignal usecases	13:10
dansmith	okay	13:10
sean-k-mooney	i was orginally thining of this as a camand/contol interface with comand objects echanged more like the rpc bus	13:11
sean-k-mooney	with nc as or nova-manage as the cli	13:11
gibi	sean-k-mooney: fyi, sriov agent only unrealiably sends vif plug for VFs, as it polls the hypervisor. If the unplug/plug is fast enough then the agent might miss the state when the device was down	13:12
sean-k-mooney	gibi: more fun	13:12
sean-k-mooney	ok	13:12
gibi	it is fun all the way down :D	13:12
sean-k-mooney	so we might want to only wait for vnic_type=normal then	13:13
gibi	it seems soo	13:13
sean-k-mooney	long term we really do need to fix this interface and enforce a stricter contract	13:13
sean-k-mooney	rather then guessing cause there are so many factors to consider	13:14
gibi	yes, we need neturon to be either enforce that events always sent, or declare in the port what event can be expected. There is no way nova can maintain a sane mapping alone	13:14
bauzas	dmitriis: saw the chat ?	13:25
dmitriis	bauzas: looking, 1 sec	13:26
bauzas	dmitriis: I'm about to propose to postpone your topic after 3pm UTC	13:26
dmitriis	bauzas: got a conflict at 3PM but I can make it work	13:27
bauzas	dmitriis: maybe later then ?	13:27
bauzas	the idea is just to avoid discussing your stuff before :)	13:27
dmitriis	bauzas: let's do it at 3PM, later is more complicated :^)	13:28
bauzas	ack, moving your topic then :)	13:28
dmitriis	bauzas: ack, ty for pinging	13:28
stephenfin	Is it just me or is tbarron's sound clipping real bad? I can understand him though (i.e. can be fixed later)	13:33
tbarron	stephenfin: It may be from my end, sorry. Yesterday zoom was having trouble with my rural location.	13:38
stephenfin	tbarron: nw, I was just concerned something was broken on my end :) I could understand everything just fine	13:39
tbarron	cool	13:39
opendevreview	Stephen Finucane proposed openstack/nova master: db: Remove models that were moved to the API database https://review.opendev.org/c/openstack/nova/+/812149	13:43
opendevreview	Stephen Finucane proposed openstack/nova master: db: Remove models for removed services, features https://review.opendev.org/c/openstack/nova/+/812150	13:43
opendevreview	Stephen Finucane proposed openstack/nova master: db: Remove nova-network models https://review.opendev.org/c/openstack/nova/+/812151	13:43
opendevreview	Balazs Gibizer proposed openstack/nova stable/pike: Add a WA flag waiting for vif-plugged event during reboot https://review.opendev.org/c/openstack/nova/+/813437	13:45
kashyap	clarkb: I'm innundated with a few things; I will reply to your email on the list on Monday. Hope that's okay	13:56
clarkb	kashyap: yup no worries	15:15
clarkb	have a good weekend	15:16
bauzas	dansmith: dunno if you're fancy happy joining us, but we're discussing a topic your knowledge could be helpful : instance v3.0 object bump	17:04
dansmith	sorry, I'm tied up	17:04
bauzas	or you're stuck in the TC meeting	17:05
bauzas	heh, no worries	17:05
dansmith	I definitely have opinions	17:05
bauzas	dansmith: notes on the etherpad could be appreciated tho :)	17:05
bauzas	https://etherpad.opendev.org/p/nova-yoga-ptg L646	17:05
sean-k-mooney	https://etherpad.opendev.org/p/nova-yoga-ptg-backup	17:26
sean-k-mooney	no colors but there is a snapshot ^	17:26
clarkb	I just restored it to the version that bauzas identified as good (the original etherpad url I mean)	17:28
bauzas	yeah \o/	17:28
sean-k-mooney	yep that looks ok	17:28
bauzas	I just wrote our last conclusions	17:28
bauzas	we can call it a wrap	17:28
sean-k-mooney	clarkb++ thanks	17:28
bauzas	sean-k-mooney: just copy it again, so we have a backup	17:28
* bauzas doesn't wanna implore the infra team for the 3rd time (on the same etherpad) :D		17:29
sean-k-mooney	i just did an plain text export form the timeline and imported it again in adifferent page	17:29
bauzas	yeah that works	17:29
sean-k-mooney	but ill try that in html form and see if it can keep colours	17:29
bauzas	we have the highlights with the history	17:29
bauzas	so nothing is technically lost	17:29
sean-k-mooney	in etherpad format it goes to the latest point in history not the one you have selected in the timeline	17:30
sean-k-mooney	ya so html format does not keep colors either	17:31
sean-k-mooney	but we have teh backup in any case	17:31
sean-k-mooney	and the orginal is restored so we are good	17:31
bauzas	yup	17:32
bauzas	on that note, /me calls it a week	17:32
bauzas	\o	17:32
gibi	me too	17:32
gibi	o/	17:32
bauzas	I'm just sad to hear that the next PTG will still be virtual, but that's life	17:33
bauzas	I'm just exhausted and I miss our whiteboards and hallway discussions	17:33
bauzas	but I'll open a beer to enjoy the last day of the PTG as if it was physical :)	17:34
mnaser_	hm	19:22
mnaser_	anyone would have an idea as to why metadata service seems to be stalling / very slow	19:22
mnaser_	2021-10-22 19:23:13.422 11 INFO nova.metadata.wsgi.server [req-f65469d1-4082-4b9e-ab3b-3c710c1f0366 - - - - -] 10.30.107.187,10.101.2.141 "GET /latest/meta-data/block-device-mapping/root HTTP/1.1" status: 200 len: 148 time: 25.8355188	19:23
mnaser_	it almost feels like all 'requests' stall for a bit and then they all burst out at once	19:23
mnaser_	memcache is fine, mysql is fine	19:23
mnaser_	plenty of conductors	19:24
mnaser_	2021-10-22 19:25:15.880 13 INFO nova.metadata.wsgi.server [-] 192.168.1.5,10.101.2.142 "GET /openstack HTTP/1.1" status: 200 len: 235 time: 5.0295317	19:25
mnaser_	especailly this, it seems very 5s-y	19:25
clarkb	mnaser_: I've noticed that on a couple of devstack jobs recently too but haven't had a chance to dig into it beyond noting that was the error returned by tempest	19:46
clarkb	specifically some tests says metadata service fails to respond in time and the test times out	19:47
prometheanfire	nova fails with the new oslo-concurrency 4.5.0 https://zuul.opendev.org/t/openstack/build/9a385f3324fb46a3abf6257a09020d38	20:44
prometheanfire	looks like an extra value was used (blocking)?	20:45

Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!