Thursday, 2024-10-17

*** liuxie is now known as liushy02:50
*** liuxie is now known as liushy07:16
*** travisholton0 is now known as travisholton07:42
*** elodilles_pto is now known as elodilles09:12
opendevreviewMerged opendev/irc-meetings master: Move Large Scale SIG meeting temporarily  https://review.opendev.org/c/opendev/irc-meetings/+/93249509:33
TheJuliaGreetings folks, I had some discussion with the Chameleon folks at OpenInfra Days NA this week, and I'm wondering if there could be a way to do some sort of spot experimental job trigger which we could automate through with Zuul. I realize that might not exist today, but it could be interesting to be able to leverage a specifically reserved hardware node to validate, for example, a driver change/fix against very specific hardware 10:18
TheJuliafor which third party CI would be unreasonable.10:18
opendevreviewMerged openstack/project-config master: Start including team as release meta field  https://review.opendev.org/c/openstack/project-config/+/92991411:58
opendevreviewDmitriy Rabotyagov proposed openstack/diskimage-builder master: Add support for building Fedora 40  https://review.opendev.org/c/openstack/diskimage-builder/+/92210912:19
*** iurygregory_ is now known as iurygregory12:57
fungiTheJulia: i think it would come down to what api(s) the job would be calling to run whatever you want run. it sounds like the proposal was to use the ironic api to provision a physical server and then boot a devstack image, update that with the proposed change and exercise it somehow?14:45
fungii expect it's doable, just not simple, so more a question of finding someone who has time to develop the job for it14:46
fungifor the specific case which came up in discussion at oid-na though, i suspect it wouldn't have helped without someone also provisioning an instance of nvidia's proprietary licensing server, loading it with paid licenses, and incorporating the proprietary vgpu driver module in the image to be booted14:48
Clark[m]If you use ironic via nova I expect it to work today with just configuration updates15:01
Clark[m]The big issue has always been people unwilling to provide hardware attached to the internet (which is totally fair). Not any deficiency in zuul or nodepool15:02
TheJuliafungi: possibly, AIUI, they present stuff via nova, but I can talk to them about details. Definitely not simple.15:03
TheJuliaClark[m]: I think we might be able to work something out :)15:04
TheJuliaI believe mutual interest to build trust might be a factor to help bridge some gaps15:05
fungiyeah, it wasn't clear to me whether chameleon was able to provide nova baremetal access to ironic or only direct ironic api access15:05
TheJuliaI don't remember, exactly, what capabilities zuul has under the hood. A similar idea was mused *ages* ago, but I suspect something is better than nothing if we can find some happy middle ground15:07
fungibut i do think that the specific case of needing to test nova code changes against licensed vgpus might be more complicated than can be reasonably automated15:07
TheJuliaquite possibly, but it also seems like something it wouldn't be awful to have a job in a project's back pocket15:08
TheJuliathere are $issues regardless, but yeah15:08
fungibecause it's not a case of exercising them in an existing deployment, but rather setting up a deployment and doing the additional bits to integrate the nvidia license compliance services15:08
fungialso i suspect we wouldn't be able to use our normal image management solutions if the images need to include proprietary drivers that aren't freely redistributable, since we'd end up violating the license with the images being accessible15:09
TheJuliafungi: Indeed, but the remote operator might have it and we might be able to invoke an automated job on it15:10
TheJuliaThere might something like "here is your special flavor to use"15:11
TheJuliaand $magic, but yeah.....15:11
fungiyeah, there are definitely a lot of details to figure out15:11
TheJuliatotally15:11
TheJuliaNot something I'm going to be figuring out while on the the metal tube15:12
Clark[m]Ya you'd do all of that work to install drivers in the jobs15:16
JayFI kinda wanna ask: what's the virtue of having something one-off like this integrated rather than just some kind of external scripting of devstack which could run ad-hoc15:29
Clark[m]Makes it more accessible which can lead to doing bigger and better things like additional hardware testing or better gpu integration. Ensures it is run regularly and makes reports public so there isn't any confusion over its status.15:33
Clark[m]Enables people without the hardware to make changes and have confidence in them15:33
JayFIt sounded to me like the entire idea was to *not* run it regularly? 15:33
Clark[m]If you don't intend on running it with some sort of regularity (daily/weekly or every change touching $codepath) then ya I think you lose a lot of the benefit. You'll fight why it doesn't work every 6 months when you go to run it instead15:44
JayFThat's mainly the thing playing out in my head when I asked the original question :)15:45
fungiyeah, for a bit of explanation, there was a forum discussion where someone at indiana university using nova with nvidia vgpus found a regression in 2024.1/caracal but it sounded like the nova team didn't want to entertain fixing the regression unless there was some means of automated testing for that hardware, then someone else from chameleon cloud said they had a bunch of vgpu-capable servers16:02
fungiset up if we could figure out a way to use them16:02
jrosseryou don't actually need the licence to boot a vgpu node16:03
fungioh, that might simplify matters then16:03
jrosserthe performance will be terrible after $short-time without it, but it will boot and you can verify its ok with `nvidia-smi` or similar16:03
jrosserif you want to actually run something that uses the gpu to do useful work, that might be a different matter16:05
clarkbhaving some sort of workload would probably help confirm that everything is working, but it also probably doesn't need to be performant or useful. Just execute successfully16:06
jrosserthe nvidia cli tools are good enough for us to verify that the gpu and drivers are loaded sucessfully16:06
fungiin the specific case in question, it sounded like there was also some sr-iov involved too, not sure if that complicates matters16:06
jrosserthere is also this to keep an eye on https://www.phoronix.com/news/NVIDIA-Open-GPU-Virtualization16:07
jrossersriov depends on the gpu type16:07
jrossersome are/can, some are not16:07
jrosserthis is the documentation for what happens with an unlicenced vgpu https://docs.nvidia.com/vgpu/17.0/grid-licensing-user-guide/index.html#software-enforcement-grid-licensing16:13
fungifrom what i gathered about the regression, some change in libvirt caused it to enumerate virtual devices differently, and they would no longer map up to the correct device names later when nova tried to add them16:14
fungibut i haven't had time to hunt down the proposed changes for the fix16:14
jrosserprobably this https://review.opendev.org/c/openstack/nova/+/83897616:16
jrosseror something very similar16:16
fungilooks like it was actually https://review.opendev.org/c/openstack/nova/+/92503716:24
fungiwhich was an attempted fix for something that supposedly worked in 2023.2/bobcat but stopped working in 2024.1/caracal and the objection from reviewers seems to be that it was unsupported but just happened to be working before some changes in nova16:26
jrosserhmmm interesting16:29
clarkbinfra-root https://review.opendev.org/c/opendev/system-config/+/932102 should be an easy straightforweard cleanup of some docker images we don't need to build any longer16:31
clarkband then https://review.opendev.org/c/opendev/system-config/+/931966 is a gitea update we should probably deploy tomorrow if the change/update itself looks fine (tomorrow because fungi should be back from travel and I'm back now)16:32
clarkbis there anything else I should be looking at to catch up from being out the last few days?16:32
clarkbThe PTG is next week too so I suspect we'll be busy with discussions and planning16:32
fungihaving not really been around, not sure either. seems like there was some success/progress with zuul-launcher image uploads to rackspace flex swift though16:45
clarkbhttps://d6c736fcc9a860f59461-fbb3a5107d50e8d0a9c9940ac7f8a1de.ssl.cf5.rackcdn.com/931966/1/check/system-config-run-gitea/aca029e/bridge99.opendev.org/screenshots/gitea-project-dib.png gitea screenshots look good and this shorter one shows the version is correct (the longer ones are taller than our virtual display so it doesn't render the version on them)16:59
corvusyeah, image uploads are working; need to elaborate a bit more on the openstack driver in order to launch nodes though; that's on my list17:19
corvusone note: we have clouds.yaml on the schedulers so that the schedulers can figure out what image format to use.  we're going to need to do the same thing on zuul-web as well.  i think that's okay, and i think it's worth the convenience to have openstacksdk tell zuul what image format to use without asking users to figure it out and duplicate it in the zuul config17:21
corvusbut if that sounds sketchy to anyone, let me know and i can revisit that.17:21
corvus(personally, i think we're going to eventually want to have all the cloud config on all the components anyway, so even if it's "just" for image formats now, i think we'll find other uses later anyway)17:22
clarkbI don't think we should have them on the executors. I think for the other services I'm less concerned17:22
corvusclarkb: ack.  we don't need them on the executors for niz.17:23
fungiagreed, executors are the main exposure risk for that data17:38
fungiwe could also consider splitting credentials out if some of the components only need other information from clouds.yaml and don't actually need to make openstack api calls17:40
fungiopenstacksdk does support keeping credentials in a separate file from the rest, if memory serves17:40
corvus++18:11
corvus(as in, i don't think we need it now but if we're concerned we can split)18:11
fungiagreed18:13
mordredyup. secure.yaml is totally a thing19:03
mordredin fact- I think we even used it at one point on opendev servers for reasons I've long since forgotten19:04
clarkbreminder that the PTG is next week. I've just sorted out half of a schedule for myself and need to register. A good week to avoid changes to jitsi and etherpad21:18
fungi...and to ptgbot21:18
clarkb++21:19
fungi(so maybe wait until after next week to merge my pending fixes there)21:19
clarkbnormally I would cancel the infra meeting but considering we didn't have one this week maybe we should try to catch up next week21:55
clarkbI'll operate under that assumption unless things are so busy on Monday that it seems unlikely we'll get anyone other than myself to attend21:55
clarkbTIL the gerrit plugin manager plugin will serve an intro page to gerrit the first time a user logs in22:44
clarkbI actually noticed this behavior when using zuul quickstart but didn't realize that was the plugin that does it22:44
clarkbwe dno't get that behavior because we disable remote plugin administrationa nd this only happens if you enable that feature (not sure the config options align but hey22:44

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!