Thursday, 2024-10-17

*** liuxie is now known as liushy		02:50
*** liuxie is now known as liushy		07:16
*** travisholton0 is now known as travisholton		07:42
*** elodilles_pto is now known as elodilles		09:12
opendevreview	Merged opendev/irc-meetings master: Move Large Scale SIG meeting temporarily https://review.opendev.org/c/opendev/irc-meetings/+/932495	09:33
TheJulia	Greetings folks, I had some discussion with the Chameleon folks at OpenInfra Days NA this week, and I'm wondering if there could be a way to do some sort of spot experimental job trigger which we could automate through with Zuul. I realize that might not exist today, but it could be interesting to be able to leverage a specifically reserved hardware node to validate, for example, a driver change/fix against very specific hardware	10:18
TheJulia	for which third party CI would be unreasonable.	10:18
opendevreview	Merged openstack/project-config master: Start including team as release meta field https://review.opendev.org/c/openstack/project-config/+/929914	11:58
opendevreview	Dmitriy Rabotyagov proposed openstack/diskimage-builder master: Add support for building Fedora 40 https://review.opendev.org/c/openstack/diskimage-builder/+/922109	12:19
*** iurygregory_ is now known as iurygregory		12:57
fungi	TheJulia: i think it would come down to what api(s) the job would be calling to run whatever you want run. it sounds like the proposal was to use the ironic api to provision a physical server and then boot a devstack image, update that with the proposed change and exercise it somehow?	14:45
fungi	i expect it's doable, just not simple, so more a question of finding someone who has time to develop the job for it	14:46
fungi	for the specific case which came up in discussion at oid-na though, i suspect it wouldn't have helped without someone also provisioning an instance of nvidia's proprietary licensing server, loading it with paid licenses, and incorporating the proprietary vgpu driver module in the image to be booted	14:48
Clark[m]	If you use ironic via nova I expect it to work today with just configuration updates	15:01
Clark[m]	The big issue has always been people unwilling to provide hardware attached to the internet (which is totally fair). Not any deficiency in zuul or nodepool	15:02
TheJulia	fungi: possibly, AIUI, they present stuff via nova, but I can talk to them about details. Definitely not simple.	15:03
TheJulia	Clark[m]: I think we might be able to work something out :)	15:04
TheJulia	I believe mutual interest to build trust might be a factor to help bridge some gaps	15:05
fungi	yeah, it wasn't clear to me whether chameleon was able to provide nova baremetal access to ironic or only direct ironic api access	15:05
TheJulia	I don't remember, exactly, what capabilities zuul has under the hood. A similar idea was mused ages ago, but I suspect something is better than nothing if we can find some happy middle ground	15:07
fungi	but i do think that the specific case of needing to test nova code changes against licensed vgpus might be more complicated than can be reasonably automated	15:07
TheJulia	quite possibly, but it also seems like something it wouldn't be awful to have a job in a project's back pocket	15:08
TheJulia	there are $issues regardless, but yeah	15:08
fungi	because it's not a case of exercising them in an existing deployment, but rather setting up a deployment and doing the additional bits to integrate the nvidia license compliance services	15:08
fungi	also i suspect we wouldn't be able to use our normal image management solutions if the images need to include proprietary drivers that aren't freely redistributable, since we'd end up violating the license with the images being accessible	15:09
TheJulia	fungi: Indeed, but the remote operator might have it and we might be able to invoke an automated job on it	15:10
TheJulia	There might something like "here is your special flavor to use"	15:11
TheJulia	and $magic, but yeah.....	15:11
fungi	yeah, there are definitely a lot of details to figure out	15:11
TheJulia	totally	15:11
TheJulia	Not something I'm going to be figuring out while on the the metal tube	15:12
Clark[m]	Ya you'd do all of that work to install drivers in the jobs	15:16
JayF	I kinda wanna ask: what's the virtue of having something one-off like this integrated rather than just some kind of external scripting of devstack which could run ad-hoc	15:29
Clark[m]	Makes it more accessible which can lead to doing bigger and better things like additional hardware testing or better gpu integration. Ensures it is run regularly and makes reports public so there isn't any confusion over its status.	15:33
Clark[m]	Enables people without the hardware to make changes and have confidence in them	15:33
JayF	It sounded to me like the entire idea was to not run it regularly?	15:33
Clark[m]	If you don't intend on running it with some sort of regularity (daily/weekly or every change touching $codepath) then ya I think you lose a lot of the benefit. You'll fight why it doesn't work every 6 months when you go to run it instead	15:44
JayF	That's mainly the thing playing out in my head when I asked the original question :)	15:45
fungi	yeah, for a bit of explanation, there was a forum discussion where someone at indiana university using nova with nvidia vgpus found a regression in 2024.1/caracal but it sounded like the nova team didn't want to entertain fixing the regression unless there was some means of automated testing for that hardware, then someone else from chameleon cloud said they had a bunch of vgpu-capable servers	16:02
fungi	set up if we could figure out a way to use them	16:02
jrosser	you don't actually need the licence to boot a vgpu node	16:03
fungi	oh, that might simplify matters then	16:03
jrosser	the performance will be terrible after $short-time without it, but it will boot and you can verify its ok with `nvidia-smi` or similar	16:03
jrosser	if you want to actually run something that uses the gpu to do useful work, that might be a different matter	16:05
clarkb	having some sort of workload would probably help confirm that everything is working, but it also probably doesn't need to be performant or useful. Just execute successfully	16:06
jrosser	the nvidia cli tools are good enough for us to verify that the gpu and drivers are loaded sucessfully	16:06
fungi	in the specific case in question, it sounded like there was also some sr-iov involved too, not sure if that complicates matters	16:06
jrosser	there is also this to keep an eye on https://www.phoronix.com/news/NVIDIA-Open-GPU-Virtualization	16:07
jrosser	sriov depends on the gpu type	16:07
jrosser	some are/can, some are not	16:07
jrosser	this is the documentation for what happens with an unlicenced vgpu https://docs.nvidia.com/vgpu/17.0/grid-licensing-user-guide/index.html#software-enforcement-grid-licensing	16:13
fungi	from what i gathered about the regression, some change in libvirt caused it to enumerate virtual devices differently, and they would no longer map up to the correct device names later when nova tried to add them	16:14
fungi	but i haven't had time to hunt down the proposed changes for the fix	16:14
jrosser	probably this https://review.opendev.org/c/openstack/nova/+/838976	16:16
jrosser	or something very similar	16:16
fungi	looks like it was actually https://review.opendev.org/c/openstack/nova/+/925037	16:24
fungi	which was an attempted fix for something that supposedly worked in 2023.2/bobcat but stopped working in 2024.1/caracal and the objection from reviewers seems to be that it was unsupported but just happened to be working before some changes in nova	16:26
jrosser	hmmm interesting	16:29
clarkb	infra-root https://review.opendev.org/c/opendev/system-config/+/932102 should be an easy straightforweard cleanup of some docker images we don't need to build any longer	16:31
clarkb	and then https://review.opendev.org/c/opendev/system-config/+/931966 is a gitea update we should probably deploy tomorrow if the change/update itself looks fine (tomorrow because fungi should be back from travel and I'm back now)	16:32
clarkb	is there anything else I should be looking at to catch up from being out the last few days?	16:32
clarkb	The PTG is next week too so I suspect we'll be busy with discussions and planning	16:32
fungi	having not really been around, not sure either. seems like there was some success/progress with zuul-launcher image uploads to rackspace flex swift though	16:45
clarkb	https://d6c736fcc9a860f59461-fbb3a5107d50e8d0a9c9940ac7f8a1de.ssl.cf5.rackcdn.com/931966/1/check/system-config-run-gitea/aca029e/bridge99.opendev.org/screenshots/gitea-project-dib.png gitea screenshots look good and this shorter one shows the version is correct (the longer ones are taller than our virtual display so it doesn't render the version on them)	16:59
corvus	yeah, image uploads are working; need to elaborate a bit more on the openstack driver in order to launch nodes though; that's on my list	17:19
corvus	one note: we have clouds.yaml on the schedulers so that the schedulers can figure out what image format to use. we're going to need to do the same thing on zuul-web as well. i think that's okay, and i think it's worth the convenience to have openstacksdk tell zuul what image format to use without asking users to figure it out and duplicate it in the zuul config	17:21
corvus	but if that sounds sketchy to anyone, let me know and i can revisit that.	17:21
corvus	(personally, i think we're going to eventually want to have all the cloud config on all the components anyway, so even if it's "just" for image formats now, i think we'll find other uses later anyway)	17:22
clarkb	I don't think we should have them on the executors. I think for the other services I'm less concerned	17:22
corvus	clarkb: ack. we don't need them on the executors for niz.	17:23
fungi	agreed, executors are the main exposure risk for that data	17:38
fungi	we could also consider splitting credentials out if some of the components only need other information from clouds.yaml and don't actually need to make openstack api calls	17:40
fungi	openstacksdk does support keeping credentials in a separate file from the rest, if memory serves	17:40
corvus	++	18:11
corvus	(as in, i don't think we need it now but if we're concerned we can split)	18:11
fungi	agreed	18:13
mordred	yup. secure.yaml is totally a thing	19:03
mordred	in fact- I think we even used it at one point on opendev servers for reasons I've long since forgotten	19:04
clarkb	reminder that the PTG is next week. I've just sorted out half of a schedule for myself and need to register. A good week to avoid changes to jitsi and etherpad	21:18
fungi	...and to ptgbot	21:18
clarkb	++	21:19
fungi	(so maybe wait until after next week to merge my pending fixes there)	21:19
clarkb	normally I would cancel the infra meeting but considering we didn't have one this week maybe we should try to catch up next week	21:55
clarkb	I'll operate under that assumption unless things are so busy on Monday that it seems unlikely we'll get anyone other than myself to attend	21:55
clarkb	TIL the gerrit plugin manager plugin will serve an intro page to gerrit the first time a user logs in	22:44
clarkb	I actually noticed this behavior when using zuul quickstart but didn't realize that was the plugin that does it	22:44
clarkb	we dno't get that behavior because we disable remote plugin administrationa nd this only happens if you enable that feature (not sure the config options align but hey	22:44

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!