Thursday, 2022-12-15

opendevreview	Merged openstack/project-config master: Update post-review zuul pipeline definiton https://review.opendev.org/c/openstack/project-config/+/867282	00:02
*** rlandy\|bbl is now known as rlandy\|out		02:24
*** yadnesh\|away is now known as yadnesh		05:08
*** soniya29 is now known as soniya29\|pto		05:57
*** jpena\|off is now known as jpena		07:54
opendevreview	Ade Lee proposed openstack/project-config master: Add FIPS job for ubuntu https://review.opendev.org/c/openstack/project-config/+/867112	10:37
*** yadnesh is now known as yadnesh\|afk		11:05
*** dviroel\|out is now known as dviroel\|rover		11:12
*** rlandy\|out is now known as rlandy		11:14
*** yadnesh\|afk is now known as yadnesh		12:02
opendevreview	Thierry Carrez proposed openstack/project-config master: Bring back the PTL+1 column in release dashboard https://review.opendev.org/c/openstack/project-config/+/867801	13:00
ykarel_	fungi, Thanks i got nodes on hold, who can help from vexxhost-ca-ymq-1 side as they may have already dealt with such nested virt issues?	13:34
fungi	ykarel_: well, for a start, do you need me to give your ssh key access to log into the held nodes?	13:36
ykarel_	fungi, i already have my keys there injected in pre playbook	13:36
ykarel_	now collecting information required for reporting bugs as described in https://www.kernel.org/doc/html/latest/virt/kvm/x86/running-nested-guests.html#reporting-bugs-from-nested-setups	13:37
fungi	ahh, okay. in that case, mnaser has helped look into this sort of thing before from the vexxhost side	13:37
ykarel_	Thanks fungi	13:38
fungi	it looks like the uuid for 199.19.213.61 is 1982206a-6620-454f-9983-465db3651cf8, while the uuid for 199.19.213.167 is 5200d42a-e41f-4fc4-a5f8-3d8a48a71349	13:38
fungi	that may make it easier to look up on the nova side	13:38
ykarel_	mnaser, since we switched jobs to run on ubuntu jammy, we started seeing random issues https://bugs.launchpad.net/neutron/+bug/1999249, can you please help in clearing it we only seeing it in vexxhost provider	13:39
ykarel_	Thanks fungi	13:39
fungi	ykarel_: and just to be clear, you're enabling nested virt in the tempest ovn jobs?	13:40
ykarel_	fungi, yes we do	13:40
fungi	yeah, looks like you limited those to the nested-virt-ubuntu-jammy label	13:41
ykarel_	https://github.com/openstack/neutron-tempest-plugin/blob/master/zuul.d/base-nested-switch.yaml#L31-L32	13:41
ykarel_	yes	13:41
*** akekane is now known as abhishekk		13:45
*** dasm\|off is now known as dasm		14:30
*** frenzy_friday is now known as frenzy_friday\|food		14:45
*** yadnesh is now known as yadnesh\|away		15:01
clarkb	ykarel_: fungi: to be extra clear this is exactly why we don't support nested virt... We have those labels set up under the condition that you know it is unsupported and any issues will need to be sorted out between you and the cloud provider (we only add it to clouds that indicate they are willing to work through this)	15:26
clarkb	we can collect info out of nodepool if necessary (though at this point zuul and the node itself should have all that data?) but we haven't in the past been able to debug this for you	15:27
ykarel_	clarkb, Thanks yes i understand that, yes nodes have required data, if anything else is required we can collect that too as it's quite easily reproducable, but for such issues investigation is required at L0 level too	15:32
ykarel_	those nodes provides good performance that's why we moved to those ;)	15:33
clarkb	ykarel_: ok that is not why you should use them	15:33
*** frenzy_friday\|food is now known as frenzy_friday		15:33
clarkb	they are there specifically for jobs that require the functionality to run at all with an understanding that debugging might need to be undertaken	15:33
clarkb	they don't provide the same redundancy as other labels, nor the same capacity. We need to use them judiciously to balance the needs of jobs against limited resources and possibility for brokeness	15:34
opendevreview	Merged openstack/project-config master: Bring back the PTL+1 column in release dashboard https://review.opendev.org/c/openstack/project-config/+/867801	15:34
ykarel_	clarkb, yes we had concidered capacity/redudancy/debugging aspects these while started using these	15:36
ykarel_	not considered functionality part as for those jobs we don't need any nested virt functionality	15:37
clarkb	if you don't need any nested virt functionality then the jobs probably shouldn't be using that label	15:37
clarkb	and that is for two resons. The first is that these are limited resources and some jobs actually do need that functioanlity (octavia for example) and second they are known to be less reliable because nested virt is less reliable and now your jobs are broken	15:38
ykarel_	yes got it, it worked great for us with focal, but broken with jammy :(	15:40
ykarel_	job time was reduced approximately by half + too much less failures and thus rechecks	15:41
mlavalle	hey, can I get someone from the devstack core team to take a look at https://review.opendev.org/c/openstack/devstack/+/866944. Just needs to be pushed over the edge	15:42
clarkb	right, and the idea is that we very specifically target jobs that need that to these labels. If nested virt worked everywhere and was reliable we wouldn't have this situation. Unfortunately nested virt is extremely complicated relies on hardware details and the kernel versions for each level of operating system and probably libvirt too.	15:42
clarkb	mlavalle: you'll want to ask in #openstack-qa	15:42
ykarel_	and now with jammy with qemu>=5 guest vms consuming too much memory and we seeing oom-kills	15:42
mlavalle	clarkb: ack, thanks	15:42
ykarel_	clarkb, yes right	15:45
clarkb	For a long time we thought that nested virt was stable on AMD cpus bceause the linux kernel set the flag to enable it by default on AMD but not Intel. The implication being "we expect it to work on AMD but be less reliable on Intel". Turns out that was a bug and they had set the flag improperly and it is flaky on both :(	15:47
* ykarel_ was not aware about ^		15:48
ykarel_	clarkb, if we are not affecting the capacity for projects that really need nested virt functionality can't we keep on using these nodes as we understand the supportability issues with it?	15:50
ykarel_	after the jammy issue is fixed with those	15:50
clarkb	ykarel_: looking at https://grafana.opendev.org/d/b283670153/nodepool-vexxhost?orgId=1&from=now-7d&to=now it does appear that we are not hitting our capacity for those resources. I think the main risks to you are being queued behind capacity limits, the cloud having an outage or going away and no longer being able to run there, and the nested virt errors themselves. I guess	15:52
ykarel_	we are not running all neutron jobs there but a few	15:52
clarkb	I'm ok with it as long as capacity isn't an issue	15:52
clarkb	the main risk is that you could end up being unable to test if that cloud can't work for some reason (like now with nested virt problems or if the cloud has to shutdown or has an outage)	15:53
ykarel_	Thanks clarkb , yes total understandable	15:54
clarkb	ykarel_: I would definitely keep it to targetted usage	15:55
fungi	ykarel_: out of curiosity, is that job failing this way 100% of the time, or just sometimes?	15:55
ykarel_	fungi, on vexxhost provider 40% of time on ovh gra1, bhs1 not seen a failure yet	15:56
clarkb	ya its very likely specific to the underlying CPUs or kernels	15:56
ykarel_	so was trying to understand if it's with only few compute nodes in the cloud	15:56
clarkb	and they almost certainly differ between vexxhost and ovh	15:56
fungi	but also the fact that it doesn't lock up immediately and actually works all the way through a job most of the time is interesting, definitely sounds like some sort of corner case (granted a fairly impactful one in this situation)	15:57
fungi	or maybe it's that only 40% of the nova controllers have the right cpu or kernel version to expose the bug	15:58
clarkb	fungi: that would be my hunch	15:58
ykarel_	yeap that's something to be figured out what's triggering it	16:01
ykarel_	will wait for mnaser to look into it	16:01
ykarel_	may be he already knows about it	16:01
johnsom	ykarel_ As others have said, please don’t use the nested-virt nodes unless you absolutely need them.	16:05
fungi	but also, finding out new nested virt bugs and getting them reported to the kernel devs is helpful	16:06
fungi	ricolin: ^ also not sure if that environment might be one you have insight into	16:08
ykarel_	johnsom, just wondering if you seen this issue with octavia or any other project using those nodes?	16:13
johnsom	There are only two projects that really need these nodes (that I know of). So far I have not seen any issues with Jammy on them.	16:16
ykarel_	thanks, would be interesting to see why it's not seen there as just running libguestfs-test-tool stucks on the affected system	16:19
clarkb	oh libguestfs....	16:19
johnsom	Typically we have to attach to the kernel durning boot to debug issues on these nodes. It is a lot of work. If you don’t absolutely need them for the test cases, I recommend you don’t use them	16:21
opendevreview	Merged openstack/project-config master: Set iweb max-servers to 0 https://review.opendev.org/c/openstack/project-config/+/867262	16:55
*** jpena is now known as jpena\|off		17:16
opendevreview	Ghanshyam proposed openstack/openstack-zuul-jobs master: Pin tox<4 for stable branches (<=stable/zed) testing https://review.opendev.org/c/openstack/openstack-zuul-jobs/+/867849	18:55
opendevreview	Merged openstack/project-config master: Set iweb to empty labels and diskimages https://review.opendev.org/c/openstack/project-config/+/867263	21:07
*** dviroel\|rover is now known as dviroel\|rover\|afk		21:16
*** cloudnull1 is now known as cloudnull		22:29
*** blarnath is now known as d34dh0r53		22:29
*** rlandy is now known as rlandy\|out		22:49
*** sfinucan is now known as stephenfin		22:55
*** dasm is now known as dasm\|off		23:59

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!