opendevreview | Merged openstack/project-config master: Update post-review zuul pipeline definiton https://review.opendev.org/c/openstack/project-config/+/867282 | 00:02 |
---|---|---|
*** rlandy|bbl is now known as rlandy|out | 02:24 | |
*** yadnesh|away is now known as yadnesh | 05:08 | |
*** soniya29 is now known as soniya29|pto | 05:57 | |
*** jpena|off is now known as jpena | 07:54 | |
opendevreview | Ade Lee proposed openstack/project-config master: Add FIPS job for ubuntu https://review.opendev.org/c/openstack/project-config/+/867112 | 10:37 |
*** yadnesh is now known as yadnesh|afk | 11:05 | |
*** dviroel|out is now known as dviroel|rover | 11:12 | |
*** rlandy|out is now known as rlandy | 11:14 | |
*** yadnesh|afk is now known as yadnesh | 12:02 | |
opendevreview | Thierry Carrez proposed openstack/project-config master: Bring back the PTL+1 column in release dashboard https://review.opendev.org/c/openstack/project-config/+/867801 | 13:00 |
ykarel_ | fungi, Thanks i got nodes on hold, who can help from vexxhost-ca-ymq-1 side as they may have already dealt with such nested virt issues? | 13:34 |
fungi | ykarel_: well, for a start, do you need me to give your ssh key access to log into the held nodes? | 13:36 |
ykarel_ | fungi, i already have my keys there injected in pre playbook | 13:36 |
ykarel_ | now collecting information required for reporting bugs as described in https://www.kernel.org/doc/html/latest/virt/kvm/x86/running-nested-guests.html#reporting-bugs-from-nested-setups | 13:37 |
fungi | ahh, okay. in that case, mnaser has helped look into this sort of thing before from the vexxhost side | 13:37 |
ykarel_ | Thanks fungi | 13:38 |
fungi | it looks like the uuid for 199.19.213.61 is 1982206a-6620-454f-9983-465db3651cf8, while the uuid for 199.19.213.167 is 5200d42a-e41f-4fc4-a5f8-3d8a48a71349 | 13:38 |
fungi | that may make it easier to look up on the nova side | 13:38 |
ykarel_ | mnaser, since we switched jobs to run on ubuntu jammy, we started seeing random issues https://bugs.launchpad.net/neutron/+bug/1999249, can you please help in clearing it we only seeing it in vexxhost provider | 13:39 |
ykarel_ | Thanks fungi | 13:39 |
fungi | ykarel_: and just to be clear, you're enabling nested virt in the tempest ovn jobs? | 13:40 |
ykarel_ | fungi, yes we do | 13:40 |
fungi | yeah, looks like you limited those to the nested-virt-ubuntu-jammy label | 13:41 |
ykarel_ | https://github.com/openstack/neutron-tempest-plugin/blob/master/zuul.d/base-nested-switch.yaml#L31-L32 | 13:41 |
ykarel_ | yes | 13:41 |
*** akekane is now known as abhishekk | 13:45 | |
*** dasm|off is now known as dasm | 14:30 | |
*** frenzy_friday is now known as frenzy_friday|food | 14:45 | |
*** yadnesh is now known as yadnesh|away | 15:01 | |
clarkb | ykarel_: fungi: to be extra clear this is exactly why we don't support nested virt... We have those labels set up under the condition that you know it is unsupported and any issues will need to be sorted out between you and the cloud provider (we only add it to clouds that indicate they are willing to work through this) | 15:26 |
clarkb | we can collect info out of nodepool if necessary (though at this point zuul and the node itself should have all that data?) but we haven't in the past been able to debug this for you | 15:27 |
ykarel_ | clarkb, Thanks yes i understand that, yes nodes have required data, if anything else is required we can collect that too as it's quite easily reproducable, but for such issues investigation is required at L0 level too | 15:32 |
ykarel_ | those nodes provides good performance that's why we moved to those ;) | 15:33 |
clarkb | ykarel_: ok that is not why you should use them | 15:33 |
*** frenzy_friday|food is now known as frenzy_friday | 15:33 | |
clarkb | they are there specifically for jobs that require the functionality to run at all with an understanding that debugging might need to be undertaken | 15:33 |
clarkb | they don't provide the same redundancy as other labels, nor the same capacity. We need to use them judiciously to balance the needs of jobs against limited resources and possibility for brokeness | 15:34 |
opendevreview | Merged openstack/project-config master: Bring back the PTL+1 column in release dashboard https://review.opendev.org/c/openstack/project-config/+/867801 | 15:34 |
ykarel_ | clarkb, yes we had concidered capacity/redudancy/debugging aspects these while started using these | 15:36 |
ykarel_ | not considered functionality part as for those jobs we don't need any nested virt functionality | 15:37 |
clarkb | if you don't need any nested virt functionality then the jobs probably shouldn't be using that label | 15:37 |
clarkb | and that is for two resons. The first is that these are limited resources and some jobs actually do need that functioanlity (octavia for example) and second they are known to be less reliable because nested virt is less reliable and now your jobs are broken | 15:38 |
ykarel_ | yes got it, it worked great for us with focal, but broken with jammy :( | 15:40 |
ykarel_ | job time was reduced approximately by half + too much less failures and thus rechecks | 15:41 |
mlavalle | hey, can I get someone from the devstack core team to take a look at https://review.opendev.org/c/openstack/devstack/+/866944. Just needs to be pushed over the edge | 15:42 |
clarkb | right, and the idea is that we very specifically target jobs that need that to these labels. If nested virt worked everywhere and was reliable we wouldn't have this situation. Unfortunately nested virt is extremely complicated relies on hardware details and the kernel versions for each level of operating system and probably libvirt too. | 15:42 |
clarkb | mlavalle: you'll want to ask in #openstack-qa | 15:42 |
ykarel_ | and now with jammy with qemu>=5 guest vms consuming too much memory and we seeing oom-kills | 15:42 |
mlavalle | clarkb: ack, thanks | 15:42 |
ykarel_ | clarkb, yes right | 15:45 |
clarkb | For a long time we thought that nested virt was stable on AMD cpus bceause the linux kernel set the flag to enable it by default on AMD but not Intel. The implication being "we expect it to work on AMD but be less reliable on Intel". Turns out that was a bug and they had set the flag improperly and it is flaky on both :( | 15:47 |
* ykarel_ was not aware about ^ | 15:48 | |
ykarel_ | clarkb, if we are not affecting the capacity for projects that really need nested virt functionality can't we keep on using these nodes as we understand the supportability issues with it? | 15:50 |
ykarel_ | after the jammy issue is fixed with those | 15:50 |
clarkb | ykarel_: looking at https://grafana.opendev.org/d/b283670153/nodepool-vexxhost?orgId=1&from=now-7d&to=now it does appear that we are not hitting our capacity for those resources. I think the main risks to you are being queued behind capacity limits, the cloud having an outage or going away and no longer being able to run there, and the nested virt errors themselves. I guess | 15:52 |
ykarel_ | we are not running all neutron jobs there but a few | 15:52 |
clarkb | I'm ok with it as long as capacity isn't an issue | 15:52 |
clarkb | the main risk is that you could end up being unable to test if that cloud can't work for some reason (like now with nested virt problems or if the cloud has to shutdown or has an outage) | 15:53 |
ykarel_ | Thanks clarkb , yes total understandable | 15:54 |
clarkb | ykarel_: I would definitely keep it to targetted usage | 15:55 |
fungi | ykarel_: out of curiosity, is that job failing this way 100% of the time, or just sometimes? | 15:55 |
ykarel_ | fungi, on vexxhost provider 40% of time on ovh gra1, bhs1 not seen a failure yet | 15:56 |
clarkb | ya its very likely specific to the underlying CPUs or kernels | 15:56 |
ykarel_ | so was trying to understand if it's with only few compute nodes in the cloud | 15:56 |
clarkb | and they almost certainly differ between vexxhost and ovh | 15:56 |
fungi | but also the fact that it doesn't lock up immediately and actually works all the way through a job most of the time is interesting, definitely sounds like some sort of corner case (granted a fairly impactful one in this situation) | 15:57 |
fungi | or maybe it's that only 40% of the nova controllers have the right cpu or kernel version to expose the bug | 15:58 |
clarkb | fungi: that would be my hunch | 15:58 |
ykarel_ | yeap that's something to be figured out what's triggering it | 16:01 |
ykarel_ | will wait for mnaser to look into it | 16:01 |
ykarel_ | may be he already knows about it | 16:01 |
johnsom | ykarel_ As others have said, please don’t use the nested-virt nodes unless you absolutely need them. | 16:05 |
fungi | but also, finding out new nested virt bugs and getting them reported to the kernel devs is helpful | 16:06 |
fungi | ricolin: ^ also not sure if that environment might be one you have insight into | 16:08 |
ykarel_ | johnsom, just wondering if you seen this issue with octavia or any other project using those nodes? | 16:13 |
johnsom | There are only two projects that really need these nodes (that I know of). So far I have not seen any issues with Jammy on them. | 16:16 |
ykarel_ | thanks, would be interesting to see why it's not seen there as just running libguestfs-test-tool stucks on the affected system | 16:19 |
clarkb | oh libguestfs.... | 16:19 |
johnsom | Typically we have to attach to the kernel durning boot to debug issues on these nodes. It is a lot of work. If you don’t absolutely need them for the test cases, I recommend you don’t use them | 16:21 |
opendevreview | Merged openstack/project-config master: Set iweb max-servers to 0 https://review.opendev.org/c/openstack/project-config/+/867262 | 16:55 |
*** jpena is now known as jpena|off | 17:16 | |
opendevreview | Ghanshyam proposed openstack/openstack-zuul-jobs master: Pin tox<4 for stable branches (<=stable/zed) testing https://review.opendev.org/c/openstack/openstack-zuul-jobs/+/867849 | 18:55 |
opendevreview | Merged openstack/project-config master: Set iweb to empty labels and diskimages https://review.opendev.org/c/openstack/project-config/+/867263 | 21:07 |
*** dviroel|rover is now known as dviroel|rover|afk | 21:16 | |
*** cloudnull1 is now known as cloudnull | 22:29 | |
*** blarnath is now known as d34dh0r53 | 22:29 | |
*** rlandy is now known as rlandy|out | 22:49 | |
*** sfinucan is now known as stephenfin | 22:55 | |
*** dasm is now known as dasm|off | 23:59 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!