| opendevreview | Joan Gilabert proposed openstack/cyborg master: Move cyborg-tempest job definitions to cyborg https://review.opendev.org/c/openstack/cyborg/+/987620 | 10:32 |
|---|---|---|
| tafkamax | Hi I just deployed cyborg and are planning to use it to pass PGPU-s to vm-s for now. | 10:40 |
| tafkamax | I just talked in kolla chat aswell, as it is active and there are is some knowledge there too. | 10:41 |
| opendevreview | Joan Gilabert proposed openstack/cyborg master: Move cyborg-tempest job definitions to cyborg https://review.opendev.org/c/openstack/cyborg/+/987620 | 10:42 |
| tafkamax | The GPU shows up in `openstack accelerator device show UUID`... (full message at <https://matrix.org/oftc/media/v1/media/download/AW1EbeU-tbWQQQrjpuO6aBPema1ZFNxz1H-Oe8zT3IzpZWc2PNyPKxAaRLOwQuK9lv5coKnS2iHnahg2PqYfBgFCeeSiEWiwAG1hdHJpeC5vcmcvbmZOem9Rc2Zqak9zb2JialhsU0RSWmNB>) | 10:42 |
| opendevreview | Joan Gilabert proposed openstack/cyborg-tempest-plugin master: Move job definitions to cyborg repo https://review.opendev.org/c/openstack/cyborg-tempest-plugin/+/987621 | 10:42 |
| tafkamax | the traits are pretty empty for it though:... (full message at <https://matrix.org/oftc/media/v1/media/download/AUewwyd-zCY46PWfpF0JsBLctL5NmRnri3j9bK6462IJYTg7QZOmF_RjeTDUs9FvS2nB6diGWvBU9MVBF2J4ynFCeeSiF-zAAG1hdHJpeC5vcmcvZ0xUUWRHYUNOdHhzVlp0RENUU2tsY2xT>) | 10:42 |
| tafkamax | I also tried this accelerator profile... (full message at <https://matrix.org/oftc/media/v1/media/download/ARQew_Tex2aH3YGmBZDRiWwCFloG5Cdc0UaMQ5z9BfAuBKNfoNRQyiIb9zs5J0HZola9q9RhLGjJY1Y3Jb_PGcVCeeSiJUsQAG1hdHJpeC5vcmcvUVdVeEtqaUNXeXBZT2xOd3dnSmZ4c0hv>) | 10:43 |
| tafkamax | Any thoughts? | 10:43 |
| jgilaber | tafkamax, hi! the traits look correct to me. Was the flavor created correctly? | 11:01 |
| tafkamax | openstack flavor show 6943fd8e-08ea-4541-bc11-a167f270e98f... (full message at <https://matrix.org/oftc/media/v1/media/download/AXUWetSMAsXVtnlrnt-ox0XHF4gtEKxjUQ1zhRXjy7dCvygSAi7WqnZJr4jUfNLe3gDhYf-3t8FlaROgiijBg15CeeSjQ8dQAG1hdHJpeC5vcmcvZnZ3ZHdiZ0tkbUhUUnluTUxrRWtJUkpt>) | 11:03 |
| opendevreview | Joan Gilabert proposed openstack/cyborg master: Move cyborg-tempest job definitions to cyborg https://review.opendev.org/c/openstack/cyborg/+/987620 | 11:05 |
| opendevreview | Joan Gilabert proposed openstack/cyborg-tempest-plugin master: Move job definitions to cyborg repo https://review.opendev.org/c/openstack/cyborg-tempest-plugin/+/987621 | 11:05 |
| jgilaber | is there any more detail in the nova logs? it looks like placement might be reporting that there are no available gpus | 11:11 |
| jgilaber | could maybe nova be configured to passthrough the gpu as well? | 11:12 |
| tafkamax | We havent explicilty configured nova for that. | 11:13 |
| jgilaber | ack, thanks, can you also check what 'openstack accelerator arq list' reports? | 11:13 |
| tafkamax | kolla-ansible 2025.1 deployment type | 11:14 |
| tafkamax | hmm empty | 11:15 |
| tafkamax | oh I need to do that beforehand? | 11:15 |
| chandankumar | Do we need to create the device profile with resource class like this https://paste.openstack.org/raw/bJEmpgye27s5h4tEMlTf/ | 11:17 |
| chandankumar | for pci and fake fpga device, we pass resource class | 11:18 |
| tafkamax | Like PGPU? | 11:18 |
| chandankumar | sorry it should be GPU | 11:20 |
| tafkamax | ok, thanks, so that comes from openstack accelerator device show command | 11:21 |
| chandankumar | https://github.com/openstack/cyborg/blob/f111946df6713aa64efa29dd025d47839241c529/cyborg/common/constants.py#L115 | 11:29 |
| chandankumar | "PGPU": orc.PGPU, | 11:29 |
| chandankumar | "VGPU": orc.VGPU, | 11:29 |
| chandankumar | pgpu - for physcial gou | 11:29 |
| chandankumar | pgpu - for physcial gpu and vgpu - for virtual gpu | 11:30 |
| chandankumar | Cyborg also have GPU. | 11:30 |
| tafkamax | I have this in log: 2026-05-07 14:39:05.152 7 WARNING cyborg.accelerator.drivers.gpu.nvidia.sysinfo [-] Unable to load vGPU_type from [gpu_devices] Ensure "enabled_vgpu_types" is set if the gpuis virtualized. but I dont have vgpu enabled. Is this just informational? | 11:41 |
| tafkamax | from nova scheduler then: 2026-05-07 14:40:05.475 1087 ERROR nova.scheduler.client.report [req-3a504845-350b-4532-a33d-f0334e44db4c req-f8f77ce8-89c9-42c9-90e5-a9e7d6eaaf33 204c13dfae0b4214ae00b15a95a5d180 87aa79e7272a4f9b9e66e4582fb28c93 - - default default] Failed to retrieve allocation candidates from placement API for filters: | 11:45 |
| tafkamax | RequestGroup(aggregates=[],forbidden_aggregates=set([]),forbidden_traits=set([]),in_tree=8b9f6c21-0c4a-458a-8541-d481921c6d08,provider_uuids=[],requester_id=None,required_traits=set([]),resources={MEMORY_MB=16384,VCPU=8},use_same_provider=False), | 11:45 |
| tafkamax | RequestGroup(aggregates=[],forbidden_aggregates=set([]),forbidden_traits=set([]),in_tree=None,provider_uuids=[],requester_id='device_profile_0',required_traits=set(['CUSTOM_NVIDIA_26B9']),resources={},use_same_provider=True) | 11:45 |
| chandankumar | I am not sure about the first warning. It might be informative message. | 11:48 |
| tafkamax | Do I need to enable more filters or something? | 11:48 |
| chandankumar | From last error, it does not have any resources. | 11:49 |
| chandankumar | Not sure it is linked with fAILED TO retrive allocation error | 11:50 |
| * tafkamax sent a code block: https://matrix.org/oftc/media/v1/media/download/AeGI7UNDRjSQKyaYmmY-Z1mBBe7wRM3gMOvtpUjizKsu24kuRJkZjISUG5IC_WTUfV0znrjfz4QyJsrfIj7nbK5CeeSl9e8AAG1hdHJpeC5vcmcvWUtEV2p0TVhQc2dNWGRDdGxHWEJYSXZu | 11:50 | |
| chandankumar | I will wait for other people to take a look | 11:50 |
| chandankumar | May be resource type would be PGPU or VGPU | 11:50 |
| chandankumar | can you try with PGPU? | 11:51 |
| tafkamax | Yeah, I will try. Also, some commands give this output in CLI: | 11:51 |
| tafkamax | openstack accelerator device enable 4dc9f844-75d4-4e6c-a252-b9986816cc5b | 11:51 |
| tafkamax | 'Proxy' object has no attribute 'enable_device' | 11:51 |
| tafkamax | Is this expected? | 11:51 |
| tafkamax | Do I need to the arq bind thing aswell? | 11:57 |
| chandank` | tafkamax: yes, 'Proxy' object has no attribute 'enable_device' similar error is coming in my env with both enable and disable | 11:58 |
| chandank` | you can open a bug for this on https://bugs.launchpad.net/openstack-cyborg | 11:59 |
| tafkamax | Aww damn | 12:01 |
| tafkamax | openstack accelerator device attribute list | 12:03 |
| tafkamax | 'Proxy' object has no attribute 'attributes' | 12:03 |
| tafkamax | this also return the same | 12:03 |
| tafkamax | https://bugs.launchpad.net/openstack-cyborg/+bug/2151792 | 12:03 |
| chandank` | https://paste.openstack.org/raw/bRpUV3OrPTrGUImCllzA/ - opentack accelerator device attribute list output from master devstack vm | 12:04 |
| tafkamax | Could it be openstacksdk or some client lib version? | 12:05 |
| tafkamax | I am using from 2025.1 UC:... (full message at <https://matrix.org/oftc/media/v1/media/download/AZYa4b2-0u7aEhweDKgZL_Hoq4YysFkqdAt284_KUhwWGucPCZc6Mi3CuUP0OQA3sVMa9LPSYo5LiskzetFd6NFCeeSm4TQgAG1hdHJpeC5vcmcvRmVaa0JwUENuSWlhUFRjdm5OUm5Bc3h2>) | 12:06 |
| tafkamax | Accelerator client initialized using OpenStackSDK: <openstack.accelerator.v2._proxy.Proxy object at 0x7f1aad7e4dd0> | 12:10 |
| tafkamax | 'Proxy' object has no attribute 'attributes' | 12:10 |
| tafkamax | using -vvv | 12:10 |
| chandank` | openstacksdk 4.11.0 is used in master. | 12:12 |
| chandank` | Can you also add openstacksdk and python-cyborgclient version in the bug? If anyone fixes it, we can backport it back | 12:12 |
| tafkamax | edited | 12:14 |
| tafkamax | 2025.2 UC worked for attribute list | 12:14 |
| * tafkamax sent a code block: https://matrix.org/oftc/media/v1/media/download/Ac5ss1UANXX71xOr-qPTkqzVWnkfPDTGM0R7-rgAeQAnVFslfRU342ShXaTzcRU2RvKH7B6EXL2KbWVsePcdQmBCeeSnXAiwAG1hdHJpeC5vcmcvQ2ZoZmpPSWdzWk9YVFNwYVlzWHZOcm5o | 12:14 | |
| chandank` | we are missing some backport then | 12:15 |
| tafkamax | but the enable command seems to be not working indeed | 12:16 |
| tafkamax | https://review.opendev.org/c/openstack/openstacksdk/+/883238?usp=search | 12:20 |
| tafkamax | does this need to be backported? | 12:20 |
| tafkamax | its april 4 2025, so after 2025.1 ? | 12:21 |
| jgilaber | yes that commit is missing in 2025.1 https://github.com/openstack/openstacksdk/tree/stable/2025.1/openstack/accelerator | 12:24 |
| chandank` | We have a upgrade job, let me push one patch to get the error in CI | 12:25 |
| tafkamax | But regarding the enable/disable that just seems to be missing? | 12:27 |
| tafkamax | E.g. this needs to be looked in the cyborg API spec and implemented i presume | 12:27 |
| tafkamax | docs.openstack.org/api-ref/accelerator/#enable-a-device ? | 12:30 |
| jgilaber | from a quick glance looks like the controller for that API exists https://github.com/openstack/cyborg/blob/79384661ce73984d2eef05dbee800507d36e997c/cyborg/api/controllers/v2/devices.py#L173 | 12:31 |
| chandank` | yes, it is implemented in cyborg side | 12:31 |
| chandank` | something is misisng or broken on cyborgclient side | 12:31 |
| tafkamax | oh okay | 12:33 |
| tafkamax | So: https://github.com/openstack/python-cyborgclient/blob/master/cyborgclient/osc/v2/device.py#L141 | 12:34 |
| sean-k-mooney | ya its a know gap | 12:40 |
| sean-k-mooney | there are a few issues witht eh current cli | 12:40 |
| sean-k-mooney | and the way its usign the sdk | 12:40 |
| sean-k-mooney | those are thign we plan to fix over this cycle now that we are trying to more activlly maintian cyborg again | 12:41 |
| sean-k-mooney | tafkamax: thanks for filling the bug it will help to have a backlog orf the sepcific brakages | 12:42 |
| sean-k-mooney | chandank`: P in PGPU stands for phsycial becasue its only used for the physical function but this is somethign we will likely evolved in newer drivers | 12:43 |
| sean-k-mooney | the type in the device list and the resouce class are not expected to be the same | 12:43 |
| tafkamax | aha okay, so I should always look for attribute list, when creating a profile | 12:47 |
| tafkamax | Okay. So I need to enable the device actually for it to be able to "bind" to an VM? | 12:51 |
| tafkamax | Hmm I will try to use an API call for enable then. | 12:52 |
| sean-k-mooney | tafkamax: so the device profile need to match the resouce class used but the atribute api allow you to overied that | 12:58 |
| sean-k-mooney | tafkamax: we will use a default one based on the driver that manages the device but if you wanted to use a diffent resouce class the atibutes api provide a way to cofnigure it | 12:59 |
| tafkamax | oh okay, so my device is ID 8, I can add a custom attribute to that device ID 8. | 13:00 |
| sean-k-mooney | my expecation is over then next 12-18 montsh we are goign to revisit how many fo the drivres work and ensure there is a declaritve way to do that via the config file as well | 13:00 |
| sean-k-mooney | tafkamax: yes if you wanted to add custom traits to it you could do that via placement directly but you can also do that via the atibutes api | 13:00 |
| sean-k-mooney | at least in theory | 13:00 |
| sean-k-mooney | this is an areay i have not spent too much time on yet and the testing is lite | 13:01 |
| sean-k-mooney | so if you find bugs please let us know | 13:01 |
| tafkamax | ok | 13:01 |
| sean-k-mooney | on of the topic we ont have time for this cycle form the PTG was "how to evolve the api" | 13:02 |
| sean-k-mooney | currently i think having devices, atribute and deployabels as 3 seperate apis is a bit confusing | 13:02 |
| sean-k-mooney | i think eventually the devices api would be a better home for atibutes for example | 13:03 |
| sean-k-mooney | i.e. include the atibute in the device show and proveide /devices/<id>/atibutes subpaths for adding/removing them | 13:03 |
| tafkamax | so how would you enable the device if the CLI does not work. Via curl to the API endpoint? | 13:04 |
| sean-k-mooney | so it shoudl be enabled by default | 13:04 |
| sean-k-mooney | but yes | 13:04 |
| sean-k-mooney | unfortunetly via curl | 13:04 |
| sean-k-mooney | so you woudl do an openstack token issue | 13:04 |
| sean-k-mooney | to get a keyston token and then curl the end point with that token | 13:04 |
| tafkamax | yep, thats what i was thinking. will do a script for it now for testing | 13:05 |
| tafkamax | thanks for the quick responses here | 13:05 |
| sean-k-mooney | just one point of clarifcaion while the resouce calss is an atibute on the device im not sure that you can modify it today | 13:06 |
| sean-k-mooney | you can add and remove addtional atibutes | 13:06 |
| sean-k-mooney | but im not sure the api allows you to overreid once generated by the driver | 13:06 |
| sean-k-mooney | and the docs dont actully tell you one way or another so that on my todo list to figure out | 13:07 |
| sean-k-mooney | ill need to go back to the orginal atibutes spec and compare that to the final code | 13:07 |
| tafkamax | oh okay | 13:08 |
| tafkamax | Hmm I am trying to enable the gpu via API and it gives me an 204: | 13:48 |
| tafkamax | 2026-05-07 16:45:59.843 1097 INFO eventlet.wsgi.server [None req-78038ad2-f1fe-4f2c-a984-aa6fd469ab34 204c13dfae0b4214ae00b15a95a5d180 1cc33bde294848818c8a462ad9d221a9 - - default default] "POST /v2/devices/4dc9f844-75d4-4e6c-a252-b9986816cc5b/enable HTTP/1.1" status: 204 len: 278 time: 0.4349580 | 13:48 |
| tafkamax | when using device show the status value is empty:| status | | | 13:48 |
| tafkamax | I think it might be enabled by default. Not sure though | 14:18 |
| tafkamax | I found the placement API requst from logs: | 14:28 |
| tafkamax | 2026-05-07 17:19:35.782 1081 INFO placement.requestlog [req-e9714860-9a55-4040-b055-1402f814745d req-c938965f-e5f0-4e5b-854c-6399de72e6e2 f28c1a6ab3704064bd656bdd2d3db679 9e4bc1e8a4ef469695c83b671c090a34 - - default default] redacted "GET | 14:28 |
| tafkamax | /allocation_candidates?in_tree=8b9f6c21-0c4a-458a-8541-d481921c6d08&limit=1000&requireddevice_profile_0=CUSTOM_NVIDIA_26B9&resources=MEMORY_MB%3A16384%2CVCPU%3A8&resourcesdevice_profile_0=PGPU%3A1&root_required=COMPUTE_ACCELERATORS%2C%21COMPUTE_STATUS_DISABLED" status: 200 len: 53 microversion: 1.36 | 14:28 |
| tafkamax | i need to see what this returns | 14:29 |
| tafkamax | Ok I modified the query | 14:50 |
| tafkamax | and removed root_required | 14:50 |
| tafkamax | and got results | 14:50 |
| tafkamax | wait maybe the compute node is disabled because it was in maintenance 😅😅😅😅 | 14:51 |
| tafkamax | Enabled the hypervisor and try again | 14:54 |
| tafkamax | Seems like it booted. | 14:55 |
| tafkamax | the VM | 14:55 |
| tafkamax | And the VM can see the device in lspci! | 14:57 |
| tafkamax | root@test-vm-gpu:~# lspci... (full message at <https://matrix.org/oftc/media/v1/media/download/AYWYRn0ernbP2iFEup_KDgOi8sTs-6dPqBKmAXcisI3S1lBaDW-A9B5u8Dr7jSbXxWudb-A2N9Xzt87A53zswTVCeeSwsnywAG1hdHJpeC5vcmcvdktwZlBsT0FKSmdXZ1VqT0RqRVdSUnh5>) | 14:58 |
| opendevreview | sean mooney proposed openstack/cyborg master: Fix rule:allow policy bypass on device/deployable/attribute APIs https://review.opendev.org/c/openstack/cyborg/+/987680 | 15:05 |
| opendevreview | sean mooney proposed openstack/cyborg master: Set project_id on ARQ creation and binding https://review.opendev.org/c/openstack/cyborg/+/987681 | 15:05 |
| opendevreview | sean mooney proposed openstack/cyborg master: Add project_id backfill for existing ARQs https://review.opendev.org/c/openstack/cyborg/+/987682 | 15:05 |
| opendevreview | sean mooney proposed openstack/cyborg master: Enforce project-scoped access for ARQs https://review.opendev.org/c/openstack/cyborg/+/987683 | 15:05 |
| opendevreview | sean mooney proposed openstack/cyborg master: Require service token for bound ARQ operations https://review.opendev.org/c/openstack/cyborg/+/987684 | 15:05 |
| opendevreview | sean mooney proposed openstack/cyborg master: Document ARQ ownership and service tokens https://review.opendev.org/c/openstack/cyborg/+/987685 | 15:05 |
| opendevreview | sean mooney proposed openstack/cyborg master: Mark conductor ARQ delete methods for removal in RPC v2 https://review.opendev.org/c/openstack/cyborg/+/987686 | 15:05 |
| chandank` | tafkamax: awesome, | 15:15 |
| tafkamax | thanks for the help and good that some bugs were found :-) | 15:16 |
| chandank` | here is our driver doc https://docs.openstack.org/cyborg/latest/configuration/drivers.html, Please have a look, Do share is there anyhting we can improve on doc side or any other issues you hit, feel free to open bugs so that we can address in future. :-) | 15:17 |
| tafkamax | regarding the actual config it was rather intuitive. kolla-ansible did its magic and I just saw that for PGPU I needed to enable the | 15:20 |
| tafkamax | [agent] | 15:20 |
| tafkamax | enabled_drivers = nvidia_gpu_driver | 15:20 |
| tafkamax | I didn't understand inititally all the stuff in `openstack accelerator <command>` | 15:20 |
| tafkamax | I guess the understanding issue was for how to create working"profile" E.g. -> look into openstack accelerator attribute list and if its rcs use resources:<attribute>:1 and if it is trait use trait:<attribute>:required | 15:22 |
| tafkamax | and a NB! that don't look at the attributes under `openstack accelerator device list` or `openstack accelerator device show <uuid>` as these are not used in profiles. Did I understand this correctly? | 15:24 |
| tafkamax | Also this page is not present in the indexmenu on the left side of screen: https://docs.openstack.org/cyborg/latest/admin/ | 15:26 |
| tafkamax | I just found this link via search | 15:26 |
| sean-k-mooney | ya so one of the thing we are missign is an end to end workflwo guide | 15:30 |
| sean-k-mooney | htat has some of the info requried | 15:30 |
| sean-k-mooney | in https://docs.openstack.org/cyborg/latest/admin/#user-requests | 15:30 |
| sean-k-mooney | but what i woudl liek to add going forward is a better end to end "how to i make this work" guide to help new operators properly configre it | 15:31 |
| sean-k-mooney | let me explain breilfy | 15:34 |
| sean-k-mooney | "trait:CUSTOM_FPGA_TRAITS":"required", | 15:34 |
| sean-k-mooney | "resources:FPGA":"1", | 15:34 |
| sean-k-mooney | in the device profile the resouces: part is descibing a countable resouce that will be assigned | 15:35 |
| sean-k-mooney | and traits: are qulitive triats that must also be advertised on the device | 15:35 |
| sean-k-mooney | for a gpu this could be a cuda level or somethign like that | 15:35 |
| sean-k-mooney | a device profiel can have more then 1 device request | 15:36 |
| sean-k-mooney | this is experssed in teh groups section | 15:36 |
| sean-k-mooney | each group can be allcoated form a differnt resouce provider | 15:36 |
| sean-k-mooney | typiclly you will have resouce:<something>:1 | 15:37 |
| sean-k-mooney | but if that something is divisabel say ssd stroage you coudl ask for say resouces:CUSTOM_SSD_GB:100 | 15:38 |
| sean-k-mooney | as an example | 15:38 |
| sean-k-mooney | i added https://docs.openstack.org/cyborg/latest/configuration/drivers.html | 15:39 |
| sean-k-mooney | as a stop gap to have some info on how to confirue each driver | 15:39 |
| sean-k-mooney | but that only covers the config options currently | 15:40 |
| tafkamax | oh okay | 15:40 |
| sean-k-mooney | i woudl like to have a per driver doc going forward that provide an end to end example for each fo the driver | 15:40 |
| sean-k-mooney | includign a sampe device profile | 15:40 |
| sean-k-mooney | tafkamax: in teh intrim https://specs.openstack.org/openstack/cyborg-specs/specs/train/implemented/device-profiles.html | 15:42 |
| sean-k-mooney | is the spec that defiend what device profiles are and how they are expected to work | 15:42 |
| sean-k-mooney | and https://specs.openstack.org/openstack/cyborg-specs/specs/train/implemented/cyborg-nova-placement.html covers how this works with regard to placement | 15:43 |
| sean-k-mooney | tafkamax: you were askign about the enable/disabel api before | 15:45 |
| sean-k-mooney | https://specs.openstack.org/openstack/cyborg-specs/specs/2023.2/approved/disable-enable-device.html | 15:45 |
| sean-k-mooney | that was new in bobcat and not completed fully | 15:45 |
| tafkamax | Aha okay, that makes sense then why its like this | 15:57 |
| opendevreview | Merged openstack/cyborg master: Fix rule:allow policy bypass on device/deployable/attribute APIs https://review.opendev.org/c/openstack/cyborg/+/987680 | 18:28 |
| opendevreview | Merged openstack/cyborg master: Set project_id on ARQ creation and binding https://review.opendev.org/c/openstack/cyborg/+/987681 | 18:38 |
| opendevreview | Merged openstack/cyborg master: Add project_id backfill for existing ARQs https://review.opendev.org/c/openstack/cyborg/+/987682 | 18:38 |
| opendevreview | Merged openstack/cyborg master: Enforce project-scoped access for ARQs https://review.opendev.org/c/openstack/cyborg/+/987683 | 18:38 |
| opendevreview | Merged openstack/cyborg master: Require service token for bound ARQ operations https://review.opendev.org/c/openstack/cyborg/+/987684 | 18:38 |
Generated by irclog2html.py 4.1.0 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!