opendevreview | Si Snow proposed openstack/nova master: Improve the use of castellan.key_manager https://review.opendev.org/c/openstack/nova/+/934225 | 01:30 |
---|---|---|
opendevreview | Si Snow proposed openstack/nova master: Improve the use of castellan.key_manager https://review.opendev.org/c/openstack/nova/+/934225 | 03:24 |
*** __ministry is now known as Guest9049 | 04:04 | |
opendevreview | Ghanshyam proposed openstack/nova master: Remove default override for config options policy_file https://review.opendev.org/c/openstack/nova/+/934578 | 05:44 |
ratailor | #openstack-nova Can I get attention on my open patches, some of those are having +1 from zuul but waiting for reviews from cores. | 10:05 |
ratailor | https://review.opendev.org/q/owner:ratailor@redhat.com+status:open | 10:05 |
ratailor | specially https://review.opendev.org/c/openstack/nova/+/928933 where api-samples tests are failing, I tried to debug, but couldn't get what's missing. It would be great if anyone having expertise in api-samples could help there. | 10:06 |
gibi | ratailor: added hints | 12:42 |
ratailor | gibi. Thanks! | 12:43 |
ratailor | but there is no api action yet, because nowhere finish method of instance action is called as of now. | 12:43 |
ratailor | that's why its the first patch in bp to add support for showing the same to user in show api response. | 12:44 |
gibi | ratailor: do you mean that the finish_time is always None after your first patch? If so then the template for the api test should expect None not an strtime | 12:47 |
gibi | ratailor: but also we tend to order the patches differently to have the API change as the last one in the series. | 12:48 |
ratailor | gibi, let me check if version bump resolves this. | 12:48 |
gibi | the version bump only resolves some of the tests, the version api tests | 12:48 |
ratailor | this is the only patch for implementation | 12:48 |
ratailor | but we have strtime for other time related fields, like start_time, updated_at etc. | 12:49 |
ratailor | gibi, also could you please review other patches, which are waiting for review since some time https://review.opendev.org/c/openstack/nova/+/873901 https://review.opendev.org/c/openstack/nova/+/911108 https://review.opendev.org/c/openstack/nova/+/901810 | 12:52 |
gibi | ratailor: I could not commit to other reviews right now. | 12:52 |
gibi | but lets discuss the finish_time further as I'm confused | 12:53 |
ratailor | gibi, whenever you have time later, but please add those to your review list. :) | 12:53 |
ratailor | gibi, yea, sure. | 12:53 |
gibi | 13:44 < ratailor> that's why its the first patch in bp to add support for showing the same to user in show api response. | 12:53 |
gibi | vs | 12:53 |
gibi | ratailor> this is the only patch for implementation | 12:53 |
gibi | so will there be other patches in the bp? | 12:54 |
ratailor | no, for nova this is only patch. and if osc and sdk related changes are required, then those also. | 12:55 |
gibi | or is this just adding a field to the API that is always None? | 12:55 |
ratailor | so long story short, we have this lp https://bugs.launchpad.net/nova/+bug/2058928 which required to call finish method of instnace action, which we don't do currently. | 12:55 |
ratailor | there is already finish_time field in InstanceAction object, but we don't show it to user in instance action show api response. | 12:56 |
ratailor | so this bp is just to add support for showing finish_time in instnace action show response. | 12:57 |
ratailor | after that there will be series of patches, to call finish method of instance action, where we start the action, but don't finish yet (which is case of every api where instance_action start mthod is called). | 12:58 |
ratailor | so currently this is the case | 12:58 |
ratailor | https://paste.opendev.org/show/bKShu6Vcc0Q9Ej4nzZgj/ | 12:58 |
gibi | ratailor: OK. So it does mean that after the new microversion finish_time will be None for all responses unitl the further patches are added to actully call action_finish. | 13:00 |
ratailor | gibi, yes. | 13:00 |
gibi | then 1) you need to change the api sample template to expect None instead of strtime | 13:00 |
ratailor | gibi, but do we need to change it to strtime once we start calling finish method of instnace action for any api. | 13:01 |
gibi | 2) at some point in the future patches that test will fail as it will start returning a time instead of None which is problematic a bit but I need to see how others feel about that | 13:01 |
gibi | hence my point of ordering of patches | 13:01 |
gibi | I would rather expose a field in the API after that field has meaningful value | 13:02 |
ratailor | but the bp required only one patch to expose the field to user. | 13:02 |
ratailor | the field is already there, we are just missing the functionality to show it to user. | 13:02 |
gibi | bauzas: sean-k-mooney: dansmith: ^^ any oppinion about the ordering here? The https://review.opendev.org/c/openstack/nova/+/928933 would expose instance action finish_time in a new microversion but at that point the field would be always None and later patches will add the population of finish_time. | 13:03 |
gibi | ratailor: showing always None to the user is not useful in my eyes. It becomes useful as soon as it can be not None. Maybe you can add one patches that calls finish for at least one action and then test the API via that action. | 13:05 |
sean-k-mooney | i feel like there is littel point in exposing it until it works | 13:05 |
sean-k-mooney | so to me that should be the last patch | 13:05 |
ratailor | gibi, yes, that's not useful. | 13:06 |
ratailor | gibi, sean-k-mooney ack. I will prepare one patch for any api and see if that works. And if it requires all patches, then I will add as many as required for other apis. | 13:07 |
ratailor | gibi, sean-k-mooney Thanks for your suggestions. Also please provide your reviews on other patches. | 13:08 |
gibi | ratailor: I think the action list tests creates a VM and stops it. If you fix those first the the api sample test will probably work. | 13:18 |
ratailor | gibi, ack. sure. | 13:19 |
ratailor | gibi, I will then try to add finish call to stop api first and see how it goes. | 13:21 |
ratailor | gibi, Thanks for the hint. :) | 13:21 |
*** ratailor is now known as ratailor|dinner | 13:22 | |
opendevreview | Rajesh Tailor proposed openstack/nova master: Fix instance vm_state during shelve https://review.opendev.org/c/openstack/nova/+/934294 | 13:50 |
*** ratailor|dinner is now known as ratailor | 14:02 | |
dansmith | gibi: agree, no point until it's useful | 14:24 |
gibi | sean-k-mooney, dansmith: thanks for chiming in, that was my immediate feeling as well | 14:35 |
opendevreview | Merged openstack/nova-specs master: fix footnote refernces https://review.opendev.org/c/openstack/nova-specs/+/934493 | 15:06 |
moot | Hello ! I am having trouble with my GPU being detected as netdev when trying to setup vgpu managed by nova, I am on caracal. | 15:20 |
moot | When trying to understand what could have gone wrong I stumbled across this bit of code and I am wondering if there is an indentation error : https://github.com/openstack/nova/blob/stable/2024.1/nova/virt/libvirt/host.py#L1487-L1489 | 15:20 |
moot | If there is not, I am really not sure about what can cause thoses errors : | 15:20 |
moot | "Could not get a PF mac for 0000:41:00.0 _get_sriov_netdev_details /var/lib/kolla/venv/lib/python3.11/site-packages/nova/virt/libvirt/host.py:1455" | 15:20 |
moot | "Cannot get MAC address of the PF 0000:41:00.0. It is probably attached to a guest already _get_pf_details /var/lib/kolla/venv/lib/python3.11/site-packages/nova/virt/libvirt/host.py:1318" | 15:20 |
sean-k-mooney | so there are two factors, first in the pci dev-spec its imporant to make sure you have not set physical_netork on teh entry for the gpus | 15:22 |
sean-k-mooney | second i belive in newer release of nova that is just a warning not an error | 15:23 |
sean-k-mooney | 2024.1 should not error out for example | 15:23 |
zigo | Mickael is already running Caracal. | 15:25 |
sean-k-mooney | moot: one other thing to note is that vGPU is not the same thing as pci passthough and today we do not supprot using the VFs form nvidia vGPUs directly | 15:25 |
sean-k-mooney | intel and amd gpus supprot sriov and that can be used witn nova generic pci passhtough supprot | 15:26 |
sean-k-mooney | but nvida's vGPU support uses mdevs today, we are lookign at addign supprot for the vfio-variant drivers in 2025.1 but that work has not reached the POC stage yet | 15:27 |
moot | but if I have : | 15:29 |
moot | [devices] | 15:29 |
moot | enabled_mdev_types = nvidia-746 | 15:29 |
moot | [mdev_nvidia-746] | 15:29 |
moot | device_addresses = 0000:41:00.0 | 15:29 |
moot | in nova.conf as told in https://docs.openstack.org/nova/latest/admin/virtual-gpu.html | 15:29 |
moot | a ressource provider should apear right ? | 15:29 |
sean-k-mooney | it should yes. are you cactully getting an errro in the logs | 15:30 |
sean-k-mooney | the code you are refernceign here https://github.com/openstack/nova/blob/stable/2024.1/nova/virt/libvirt/host.py#L1487-L1489 is not related to vgpus | 15:30 |
sean-k-mooney | the _get_sriov_netdev_details funciton has a gracefull fallback in the case that the VF is not a nic | 15:35 |
moot | It is not an error nor a warning just a DEBUG log | 15:35 |
sean-k-mooney | right this one https://github.com/openstack/nova/blob/stable/2024.1/nova/virt/libvirt/host.py#L1455 | 15:35 |
sean-k-mooney | you can ignore that | 15:35 |
sean-k-mooney | its not ment of operators its ment for developer to know which code path was taken | 15:36 |
moot | okay but the RP is never created, I enabled DEBUG hoping I would find some clue | 15:37 |
sean-k-mooney | ack well we can proably help you debug that but that not related to that log :) | 15:37 |
sean-k-mooney | can you post the ouput of "ls /sys/class/mdev_bus/*/mdev_supported_types" somewhere | 15:38 |
sean-k-mooney | lets see if your host is properly configure to be capable of supproting mdevs on that device and with that type | 15:38 |
moot | sure ! | 15:40 |
moot | https://paste.opendev.org/show/buohdB4juV5JTLRA0mna/ | 15:40 |
sean-k-mooney | ah i see the problem | 15:40 |
sean-k-mooney | so 0000:41:00.0 is the PF correct | 15:40 |
sean-k-mooney | the physical gpu | 15:40 |
moot | yes | 15:41 |
sean-k-mooney | you have enabled mig mode on the gpu which moves the mdevs form the PF to the VF | 15:41 |
sean-k-mooney | or VFs plural | 15:41 |
sean-k-mooney | so you need to update device_addresses with the pci adress of the vfs | 15:41 |
sean-k-mooney | specificly you should only list a number of adresses equal to the number of nvidia-746 instance that mdev type can create | 15:42 |
sean-k-mooney | if you do not enable mig mode with "sriov-manage -e" then all the mdevs will work in time slicing mode and will be reported on the PF | 15:43 |
sean-k-mooney | in otherword wither you list the PF address of VF adddress depend on if your using timesliced vgpus or mig more vgpus | 15:45 |
moot | A2 nvidia GPU should not support mig mode if I am not mistaken ? | 15:48 |
moot | I tried to provide the pci addresses of the VFs to nova.conf device_address field and it worked but I wanted to let nova manage the whole PF. | 15:48 |
moot | I don't think I enabled the ig mode I enable my VF using : /usr/lib/nvidia/sriov-manage -e 0000:41:00.0 | 15:48 |
moot | I think I am already in the time slicing mode | 15:51 |
opendevreview | Takashi Kajinami proposed openstack/nova master: Skip functional tests on pre-commit config update https://review.opendev.org/c/openstack/nova/+/933192 | 15:54 |
moot | I am double checking about the mig mode that you talked about to be sure | 15:56 |
moot | I get : | 15:59 |
moot | >> sudo nvidia-smi -mig 0 | 15:59 |
moot | Unable to disable MIG Mode for GPU 00000000:41:00.0: Not Supported | 15:59 |
moot | >> sudo nvidia-smi -mig 1 | 15:59 |
moot | Unable to enable MIG Mode for GPU 00000000:41:00.0: Not Supported | 15:59 |
sean-k-mooney | so i think you are only ment to run "/usr/lib/nvidia/sriov-manage -e" if you dont want timesilcing | 16:15 |
sean-k-mooney | with that said nvidia keep changing how that works | 16:15 |
sean-k-mooney | so even if that was ture once it may not be anymore | 16:16 |
moot | MIG mode is not enabled on my gpu I can see that using nvidia-smi -q | 16:16 |
sean-k-mooney | moot: if you dont have any usecase for usign the mdevs on the host you can ommit the device address and instead define the maxum mdevs i think | 16:16 |
sean-k-mooney | oh thats missing form our docs entirly | 16:18 |
sean-k-mooney | that might be becasue its an option in an dynmic config section | 16:18 |
sean-k-mooney | https://github.com/openstack/nova/blob/master/nova/conf/devices.py#L108-L115 | 16:19 |
sean-k-mooney | that is not invoked as part of our docs generation. not really sure how to fix that | 16:20 |
moot | I will try that but it does not work in a senario with multiple PF right ? | 16:20 |
sean-k-mooney | it does provided you do not want to use diffent mdev typs per pf | 16:21 |
sean-k-mooney | the second you want to do that you need to staticly partion them | 16:21 |
moot | even if I am using time sliced gpus ? | 16:22 |
sean-k-mooney | so in the past for timesliced vgpu you did not use sriov-manage -e <addres> | 16:22 |
sean-k-mooney | so it might be worth testing without doing that and see fi the mdevs move back to the pf | 16:23 |
sean-k-mooney | that is how it used to work before nvidia relased card that supprot sriov | 16:23 |
moot | you mean testing out without enabling VF so no sriov-manage -e <addres> ? | 16:23 |
sean-k-mooney | nviida have unfortually changed this behvior on the same card with diffent verions fo there driver/firmware | 16:23 |
sean-k-mooney | moot: correct | 16:24 |
sean-k-mooney | if you dont enabel sriov i would expect the mdev capablity to be adversed on the PF isntead and then in nova you could list jut the pf adress | 16:24 |
moot | I can do that, I will look into contributing to the doc if you think it needs some updates | 16:24 |
sean-k-mooney | thats how it worked on the T and older generation of cards | 16:25 |
sean-k-mooney | this is kind of why we just say "look at the vendor docs" for that part | 16:25 |
sean-k-mooney | i.e. https://docs.nvidia.com/vgpu/5.0/pdf/grid-vgpu-user-guide.pdf | 16:26 |
moot | well nvidia doc just tell to enable the VF if am not mistaken | 16:26 |
moot | btw I tried to remove the pci address from the nova conf add a ressource provider was created for each VF | 16:29 |
moot | 'If not set, it implies that we use the maximum allowed by the type." | 16:29 |
moot | and not add * | 16:32 |
zigo | sean-k-mooney: Thanks for helping my (intern) colleague that was kind of lost! :) | 16:45 |
sean-k-mooney | zigo: vgpus are not a kind thing to expose interns too :P happy to help | 16:46 |
sean-k-mooney | this is one of the case where nova is kind of helpless to document proeprly becasue it not only depends on the vendor but also the card and the driver used | 16:47 |
sean-k-mooney | and you know the phase of the moon, your choice in wiskey and if you are feelign lucky today | 16:47 |
zigo | sean-k-mooney: Yeah, but that one intern is special and knows his stuff ... :P | 16:47 |
sean-k-mooney | moot: if you do see a way to make the docs better please feel free too | 16:48 |
zigo | I'll beat him until all is right! :P | 16:55 |
moot | sean-k-mooney: I tried not to enable SRIOV as discussed but the PF does not seems to be mdev capable, so no RP is created. | 17:02 |
moot | when I do not specifies either the device id nor the mdev max number nova does seem to manage well I did not test extensivly I will do that tomorrow! | 17:02 |
moot | Any way thank you for your help I will try my best to find a way to add this information to the docs! | 17:02 |
opendevreview | Ghanshyam proposed openstack/nova master: Remove default override for config options policy_file https://review.opendev.org/c/openstack/nova/+/934578 | 21:29 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!