*** liuxie is now known as liushy | 02:50 | |
*** liuxie is now known as liushy | 07:16 | |
*** travisholton0 is now known as travisholton | 07:42 | |
*** elodilles_pto is now known as elodilles | 09:12 | |
opendevreview | Merged opendev/irc-meetings master: Move Large Scale SIG meeting temporarily https://review.opendev.org/c/opendev/irc-meetings/+/932495 | 09:33 |
---|---|---|
TheJulia | Greetings folks, I had some discussion with the Chameleon folks at OpenInfra Days NA this week, and I'm wondering if there could be a way to do some sort of spot experimental job trigger which we could automate through with Zuul. I realize that might not exist today, but it could be interesting to be able to leverage a specifically reserved hardware node to validate, for example, a driver change/fix against very specific hardware | 10:18 |
TheJulia | for which third party CI would be unreasonable. | 10:18 |
opendevreview | Merged openstack/project-config master: Start including team as release meta field https://review.opendev.org/c/openstack/project-config/+/929914 | 11:58 |
opendevreview | Dmitriy Rabotyagov proposed openstack/diskimage-builder master: Add support for building Fedora 40 https://review.opendev.org/c/openstack/diskimage-builder/+/922109 | 12:19 |
*** iurygregory_ is now known as iurygregory | 12:57 | |
fungi | TheJulia: i think it would come down to what api(s) the job would be calling to run whatever you want run. it sounds like the proposal was to use the ironic api to provision a physical server and then boot a devstack image, update that with the proposed change and exercise it somehow? | 14:45 |
fungi | i expect it's doable, just not simple, so more a question of finding someone who has time to develop the job for it | 14:46 |
fungi | for the specific case which came up in discussion at oid-na though, i suspect it wouldn't have helped without someone also provisioning an instance of nvidia's proprietary licensing server, loading it with paid licenses, and incorporating the proprietary vgpu driver module in the image to be booted | 14:48 |
Clark[m] | If you use ironic via nova I expect it to work today with just configuration updates | 15:01 |
Clark[m] | The big issue has always been people unwilling to provide hardware attached to the internet (which is totally fair). Not any deficiency in zuul or nodepool | 15:02 |
TheJulia | fungi: possibly, AIUI, they present stuff via nova, but I can talk to them about details. Definitely not simple. | 15:03 |
TheJulia | Clark[m]: I think we might be able to work something out :) | 15:04 |
TheJulia | I believe mutual interest to build trust might be a factor to help bridge some gaps | 15:05 |
fungi | yeah, it wasn't clear to me whether chameleon was able to provide nova baremetal access to ironic or only direct ironic api access | 15:05 |
TheJulia | I don't remember, exactly, what capabilities zuul has under the hood. A similar idea was mused *ages* ago, but I suspect something is better than nothing if we can find some happy middle ground | 15:07 |
fungi | but i do think that the specific case of needing to test nova code changes against licensed vgpus might be more complicated than can be reasonably automated | 15:07 |
TheJulia | quite possibly, but it also seems like something it wouldn't be awful to have a job in a project's back pocket | 15:08 |
TheJulia | there are $issues regardless, but yeah | 15:08 |
fungi | because it's not a case of exercising them in an existing deployment, but rather setting up a deployment and doing the additional bits to integrate the nvidia license compliance services | 15:08 |
fungi | also i suspect we wouldn't be able to use our normal image management solutions if the images need to include proprietary drivers that aren't freely redistributable, since we'd end up violating the license with the images being accessible | 15:09 |
TheJulia | fungi: Indeed, but the remote operator might have it and we might be able to invoke an automated job on it | 15:10 |
TheJulia | There might something like "here is your special flavor to use" | 15:11 |
TheJulia | and $magic, but yeah..... | 15:11 |
fungi | yeah, there are definitely a lot of details to figure out | 15:11 |
TheJulia | totally | 15:11 |
TheJulia | Not something I'm going to be figuring out while on the the metal tube | 15:12 |
Clark[m] | Ya you'd do all of that work to install drivers in the jobs | 15:16 |
JayF | I kinda wanna ask: what's the virtue of having something one-off like this integrated rather than just some kind of external scripting of devstack which could run ad-hoc | 15:29 |
Clark[m] | Makes it more accessible which can lead to doing bigger and better things like additional hardware testing or better gpu integration. Ensures it is run regularly and makes reports public so there isn't any confusion over its status. | 15:33 |
Clark[m] | Enables people without the hardware to make changes and have confidence in them | 15:33 |
JayF | It sounded to me like the entire idea was to *not* run it regularly? | 15:33 |
Clark[m] | If you don't intend on running it with some sort of regularity (daily/weekly or every change touching $codepath) then ya I think you lose a lot of the benefit. You'll fight why it doesn't work every 6 months when you go to run it instead | 15:44 |
JayF | That's mainly the thing playing out in my head when I asked the original question :) | 15:45 |
fungi | yeah, for a bit of explanation, there was a forum discussion where someone at indiana university using nova with nvidia vgpus found a regression in 2024.1/caracal but it sounded like the nova team didn't want to entertain fixing the regression unless there was some means of automated testing for that hardware, then someone else from chameleon cloud said they had a bunch of vgpu-capable servers | 16:02 |
fungi | set up if we could figure out a way to use them | 16:02 |
jrosser | you don't actually need the licence to boot a vgpu node | 16:03 |
fungi | oh, that might simplify matters then | 16:03 |
jrosser | the performance will be terrible after $short-time without it, but it will boot and you can verify its ok with `nvidia-smi` or similar | 16:03 |
jrosser | if you want to actually run something that uses the gpu to do useful work, that might be a different matter | 16:05 |
clarkb | having some sort of workload would probably help confirm that everything is working, but it also probably doesn't need to be performant or useful. Just execute successfully | 16:06 |
jrosser | the nvidia cli tools are good enough for us to verify that the gpu and drivers are loaded sucessfully | 16:06 |
fungi | in the specific case in question, it sounded like there was also some sr-iov involved too, not sure if that complicates matters | 16:06 |
jrosser | there is also this to keep an eye on https://www.phoronix.com/news/NVIDIA-Open-GPU-Virtualization | 16:07 |
jrosser | sriov depends on the gpu type | 16:07 |
jrosser | some are/can, some are not | 16:07 |
jrosser | this is the documentation for what happens with an unlicenced vgpu https://docs.nvidia.com/vgpu/17.0/grid-licensing-user-guide/index.html#software-enforcement-grid-licensing | 16:13 |
fungi | from what i gathered about the regression, some change in libvirt caused it to enumerate virtual devices differently, and they would no longer map up to the correct device names later when nova tried to add them | 16:14 |
fungi | but i haven't had time to hunt down the proposed changes for the fix | 16:14 |
jrosser | probably this https://review.opendev.org/c/openstack/nova/+/838976 | 16:16 |
jrosser | or something very similar | 16:16 |
fungi | looks like it was actually https://review.opendev.org/c/openstack/nova/+/925037 | 16:24 |
fungi | which was an attempted fix for something that supposedly worked in 2023.2/bobcat but stopped working in 2024.1/caracal and the objection from reviewers seems to be that it was unsupported but just happened to be working before some changes in nova | 16:26 |
jrosser | hmmm interesting | 16:29 |
clarkb | infra-root https://review.opendev.org/c/opendev/system-config/+/932102 should be an easy straightforweard cleanup of some docker images we don't need to build any longer | 16:31 |
clarkb | and then https://review.opendev.org/c/opendev/system-config/+/931966 is a gitea update we should probably deploy tomorrow if the change/update itself looks fine (tomorrow because fungi should be back from travel and I'm back now) | 16:32 |
clarkb | is there anything else I should be looking at to catch up from being out the last few days? | 16:32 |
clarkb | The PTG is next week too so I suspect we'll be busy with discussions and planning | 16:32 |
fungi | having not really been around, not sure either. seems like there was some success/progress with zuul-launcher image uploads to rackspace flex swift though | 16:45 |
clarkb | https://d6c736fcc9a860f59461-fbb3a5107d50e8d0a9c9940ac7f8a1de.ssl.cf5.rackcdn.com/931966/1/check/system-config-run-gitea/aca029e/bridge99.opendev.org/screenshots/gitea-project-dib.png gitea screenshots look good and this shorter one shows the version is correct (the longer ones are taller than our virtual display so it doesn't render the version on them) | 16:59 |
corvus | yeah, image uploads are working; need to elaborate a bit more on the openstack driver in order to launch nodes though; that's on my list | 17:19 |
corvus | one note: we have clouds.yaml on the schedulers so that the schedulers can figure out what image format to use. we're going to need to do the same thing on zuul-web as well. i think that's okay, and i think it's worth the convenience to have openstacksdk tell zuul what image format to use without asking users to figure it out and duplicate it in the zuul config | 17:21 |
corvus | but if that sounds sketchy to anyone, let me know and i can revisit that. | 17:21 |
corvus | (personally, i think we're going to eventually want to have all the cloud config on all the components anyway, so even if it's "just" for image formats now, i think we'll find other uses later anyway) | 17:22 |
clarkb | I don't think we should have them on the executors. I think for the other services I'm less concerned | 17:22 |
corvus | clarkb: ack. we don't need them on the executors for niz. | 17:23 |
fungi | agreed, executors are the main exposure risk for that data | 17:38 |
fungi | we could also consider splitting credentials out if some of the components only need other information from clouds.yaml and don't actually need to make openstack api calls | 17:40 |
fungi | openstacksdk does support keeping credentials in a separate file from the rest, if memory serves | 17:40 |
corvus | ++ | 18:11 |
corvus | (as in, i don't think we need it now but if we're concerned we can split) | 18:11 |
fungi | agreed | 18:13 |
mordred | yup. secure.yaml is totally a thing | 19:03 |
mordred | in fact- I think we even used it at one point on opendev servers for reasons I've long since forgotten | 19:04 |
clarkb | reminder that the PTG is next week. I've just sorted out half of a schedule for myself and need to register. A good week to avoid changes to jitsi and etherpad | 21:18 |
fungi | ...and to ptgbot | 21:18 |
clarkb | ++ | 21:19 |
fungi | (so maybe wait until after next week to merge my pending fixes there) | 21:19 |
clarkb | normally I would cancel the infra meeting but considering we didn't have one this week maybe we should try to catch up next week | 21:55 |
clarkb | I'll operate under that assumption unless things are so busy on Monday that it seems unlikely we'll get anyone other than myself to attend | 21:55 |
clarkb | TIL the gerrit plugin manager plugin will serve an intro page to gerrit the first time a user logs in | 22:44 |
clarkb | I actually noticed this behavior when using zuul quickstart but didn't realize that was the plugin that does it | 22:44 |
clarkb | we dno't get that behavior because we disable remote plugin administrationa nd this only happens if you enable that feature (not sure the config options align but hey | 22:44 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!