03:05:12 <Sundar> #startmeeting openstack-cyborg 03:05:13 <openstack> Meeting started Thu Aug 29 03:05:12 2019 UTC and is due to finish in 60 minutes. The chair is Sundar. Information about MeetBot at http://wiki.debian.org/MeetBot. 03:05:14 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 03:05:16 <openstack> The meeting name has been set to 'openstack_cyborg' 03:05:18 <Coco_gao_> #info Coco_gao_ 03:05:20 <Sundar> Hi all 03:05:24 <shaohe_feng> morning Coco_gao_ 03:05:33 <s_shogo> Hi all 03:05:35 <Sundar> #topic Attendance 03:05:35 <Coco_gao_> morning shaohe 03:05:40 <Sundar> #info SUndar 03:05:47 <s_shogo> #info s_shogo 03:05:51 <chenke> Hi~ 03:05:58 <Sundar> Hi all 03:05:59 <Coco_gao_> Hi chenke 03:05:59 <chenke> #info chenke 03:06:01 <shaohe_feng> #info shaohe_feng 03:06:07 <Yumeng> #info Yumeng 03:06:08 <yikun> #info yikun 03:06:19 <Sundar> Agenda: https://wiki.openstack.org/wiki/Meetings/CyborgTeamMeeting#Agenda 03:06:19 <chenke> Hi Coco_gao_ 03:07:53 <Sundar> Python 3: Since OpenStack Train release has some Python 3 goals, due by Milestone 3, and it seems that we are close to fixing Py3 issues for Cyborg, 03:08:17 <Sundar> I have requested s_shogo to make Python 3 tests as a voting job in Zuul. 03:08:27 <Sundar> Any objections or comments? 03:10:01 <Sundar> I'll take the silence as agreement. ;) There were requests for fixing Python 3 in the cyborg client too. Luckily, it has taken only 1 patch so far, so we don't need to spend much time on it. 03:10:21 <chenke> +1 03:11:05 <s_shogo> I'll do the py3 work in cyborg client, too. 03:11:19 <chenke> good job 03:12:16 <chenke> I had modify the tox.ini default env support py36,py37 03:12:22 <Coco_gao_> thank you 03:12:26 <Sundar> s_shogo: The catch is, the current client is for v1 API code and not based on the openstacksdk method. Bringing it to v2 is more important, right? 03:12:27 <Coco_gao_> s_shaogo 03:12:35 <Coco_gao_> s_shogo 03:12:37 <wangzhh> Cool. 03:12:49 <shaohe_feng> https://review.opendev.org/#/c/673228/ 03:13:06 <shaohe_feng> this is a python3 issue fix for client 03:13:25 <Sundar> But somebody else proposed a patch and it got merged. 03:13:50 <s_shogo> Sundar: I think so, My openstackSDK patch is made for the v2 Deployable API, now. 03:14:12 <s_shogo> And the P5-P9 patches doesn't include the migration code , "Deployable" API , from v1 to v2. 03:15:03 <Sundar> s_shogo: Great. Please add device profiles, as that is more importan IMHO. Operators need to create device profiles to use Cyborg, but doing that with curl is not easy 03:15:26 <Coco_gao_> agree, Sundar 03:15:49 <Sundar> As 2nd priority, I'd say devices -- that will give an inventory of accelerator devices in the cluster 03:16:57 <Sundar> IMHO, when devices are asked for, we can return the components like deployables and attributes, so the client gets a full picture 03:16:59 <shaohe_feng> yes, client if more friendly than curl 03:17:25 <s_shogo> As related the client, the deadline for openstackSDK's commit seems to be near, so would like to begin commit to that, prior to the merge of APIv2 patches. 03:17:34 <Sundar> Yes, makes sense 03:17:48 <Sundar> Thanks, s_shogo! 03:18:28 <Sundar> The main thing that is holding me back is that I am testing P5-P9 with the notification and Placement report patches. Plus, Nova code changes to create a merge conflict for me. 03:18:46 <Sundar> Once those are resolved, hope we can merge the P5-P9 patches 03:19:01 <Sundar> ANy other comments on the client, anybody? 03:19:13 <shaohe_feng> yes, async job depends on P5-P9 03:19:25 <s_shogo> In my assumption,python-cyborg client and openstacksdk could to be completed before the Train release, 03:19:38 <shaohe_feng> great 03:19:38 <s_shogo> but I'm anxious of sufficiency in my test codes, thus please review that in following patches, and help that if necessary. 03:19:53 <Sundar> shaohe_feng: Agreed. I'll expedite as much as I can. 03:20:06 <Sundar> s_shogo: Agreed, we'll help for sure 03:20:23 <s_shogo> Thanks , Sundar 03:20:29 <shaohe_feng> maybe the test codes can be add later. 03:20:35 <Sundar> wangzhh: Thanks for proposing the RBAC patch. I had some concerns/questions in the patch. Please take a look. 03:20:50 <shaohe_feng> firstly let the client can work. 03:21:11 <Coco_gao_> s_shogo, thank you . We will review the code. 03:21:28 <wangzhh> Yep. I have updated my code. May commit after meeting. 03:21:39 <Sundar> Thanks, wangzhh 03:22:03 <s_shogo> shaohe_feng : OK, I'll do that preferentially. 03:22:26 <Sundar> shaohe_feng: Part of the issue is that some Nova developers want to test Cyborg code with Nova code in theor env. Also, we need to show tempest working end-to-end. 03:23:41 <Sundar> Anybody else trying out the Placement report? With GPUs, AI chip, etc.? 03:23:42 <Coco_gao_> What's the remaining work for tempest? 03:23:46 <shaohe_feng> yes, tempest can eliminate their concerns 03:24:07 <Sundar> Coco_gao_: It is mostly to get the patches to work together, I think 03:24:34 <Sundar> Xinran's patches look good IMO. Trying to make sure they work with P5-P9 03:24:51 <Yumeng> I have tried the Placement report With GPUs 03:25:20 <Sundar> Yumeng: Good to know 03:25:42 <Sundar> #topic Nova functional tests 03:26:40 <Sundar> There was talk at the PTG that we should propose functional tests for Nova, which mock CYborg API in a test fixture, and use that to test Nova patches 03:27:08 <Sundar> They seem to cover a few more scenarios than unit tests and tempest 03:27:52 <Coco_gao_> mock cyborg API's return? 03:27:53 <chenke> I agree we need to import functional test for nova. 03:28:09 <Sundar> Coco_gao_: Yes 03:28:44 <Sundar> We have an entry in the Storyboard too. I have not any comments of late, but there is concern that it may come up at the last moment 03:29:11 <Sundar> Since there is lots of stuff in Nova runway, it can be tough to get a 2nd look if this issue comes up 03:29:38 <Sundar> DO we have any volunteers for writing Nova functional tests? I'll help as much as I can 03:31:35 <Sundar> Please think it over and LMK if you can. 03:32:33 <Sundar> shaohe_feng: Do you want to bring up the discussion about ARQ states and transitions, as followup? Or is it settled? 03:33:12 <shaohe_feng> yes 03:33:45 <shaohe_feng> one things is that, who delete the ARQ 03:34:09 <shaohe_feng> when delete API tag the state as delete_pending? 03:34:22 <Sundar> There is Nova code to delete the ARQ in some error cases and when VM is terminated 03:35:03 <shaohe_feng> maybe it is still in bind process 03:35:47 <shaohe_feng> the bind process to delete it when it find the state is delete_pending? 03:36:06 <Sundar> Yes. In that case, IMHO, it is best to let the bind complete and the traits get updated in Placement, and then unbind/delete the ARQ 03:36:26 <Sundar> If we try to interrupt FPGA progamming, bad things can happen 03:36:46 <shaohe_feng> we will not add any rollback this release for bind. just go through the whole process even deleting. 03:36:54 <Sundar> Agreed 03:37:06 <shaohe_feng> OK. 03:37:09 <Coco_gao_> OK 03:37:48 <shaohe_feng> any state transform should be transaction. 03:38:39 <Sundar> Yes, db transaction 03:39:21 <shaohe_feng> seems there is a state machine in oslo lib 03:39:26 <Sundar> Any other issue, shaohe_feng? 03:39:43 <shaohe_feng> we will not introduce it release 03:40:12 <Sundar> Ok by me. What are the benefits of using that? 03:40:13 <shaohe_feng> for I need time to read up it. 03:40:31 <shaohe_feng> do not look into it at present. 03:40:35 <Sundar> ok 03:40:55 <shaohe_feng> maybe after the whole flow code are finished 03:41:07 <shaohe_feng> we can have a look for cons and pros 03:41:33 <Sundar> Sure. We'll trust your judgement on this :) 03:41:48 <shaohe_feng> another things, should the async job timeout? 03:42:01 <Sundar> On a different note, I am seeing this issue for allocating attach handles: https://opendev.org/openstack/cyborg/src/branch/master/cyborg/db/sqlalchemy/api.py#L269 The in_use field does not get written to db 03:42:37 <shaohe_feng> but there's still a problem. 03:42:40 <Sundar> The timeout should correspond to default Nova timeout 03:43:10 <shaohe_feng> maybe it is in programming or other critical job 03:43:34 <Sundar> The programming typically takes a few seconds, so default of 300 seconds (I think) is good enough 03:43:40 <shaohe_feng> timeout can be disaster 03:44:21 <shaohe_feng> another things 03:44:43 <shaohe_feng> currently the bind process is specify for FPGA 03:45:40 <Sundar> Umm, bind if for all accelerators. Only programming is for FPGA. the bind means the ARQ is associated with a host and deployable in Cyborg's db, and the device is ready to use 03:45:44 <Sundar> *is for 03:45:55 <shaohe_feng> there should be good extension for other kinds 03:46:05 <shaohe_feng> I means: 03:46:22 <shaohe_feng> 1. get the resource type. 03:46:50 <shaohe_feng> every resource type should has its own extend bind action 03:46:56 <shaohe_feng> for FPGA it is program. 03:47:08 <shaohe_feng> other's maybe evn setup, not sure. 03:47:43 <shaohe_feng> 2. every resource should be has its own placement report. 03:48:20 <shaohe_feng> the report info maybe different 03:48:35 <shaohe_feng> so the code should be: 03:49:09 <shaohe_feng> type, num = arq.group_get_resource() 03:49:17 <shaohe_feng> for n in num: 03:50:05 <shaohe_feng> action = get_accelerator_action(type) # fpga is program 03:50:08 <shaohe_feng> action() 03:50:15 <shaohe_feng> somethings like this 03:50:38 <shaohe_feng> and these code should be split from the arq object file 03:51:36 <Sundar> In general, the process should be generic for all accelerators. The current code looks at the device profile request group to see if it has function_id or bitstream_id entries, which are specific to FPGA, to decide if programming is needed 03:52:32 <shaohe_feng> we maybe add other spec in 03:52:36 <Sundar> AFAIK, for non-FPGA devices in this release, there is nothing required to prepare the device, right? 03:53:10 <shaohe_feng> devices profile for different acclerations 03:53:14 <shaohe_feng> such as HDDL 03:53:19 <shaohe_feng> we can add 03:54:18 <shaohe_feng> "accel:affinity": true 03:54:44 <Sundar> Ok 03:54:46 <shaohe_feng> which means we need 4 accelerator in one card 03:54:49 <Sundar> We had an idea of a generic prepare_device API in the driver, which gets a dictionary as a parameter, where the dictionary values depend on the device type. 03:55:26 <shaohe_feng> yes, different devices maybe take different action during bind. 03:56:14 <Sundar> Quick process check: Since we have only few minutes left, should we continue this via email, copying all of us and openstack-ML? What do you all think? 03:56:34 <shaohe_feng> also another things, where we init the threadpoolexcutor? 03:56:43 <shaohe_feng> int the arq object file? 03:56:48 <shaohe_feng> seems not good. 03:57:11 <shaohe_feng> OK. 03:57:52 <Sundar> All, please look at this issue for allocating attach handles: https://opendev.org/openstack/cyborg/src/branch/master/cyborg/db/sqlalchemy/api.py#L269 The in_use field does not get written to db 03:58:34 <Sundar> All, we are seeing good review activity of late. Thank you all, and please keep it up. We are literally 2 weeks from the milestone. :) 03:58:41 <Sundar> #topic AoB 03:59:20 <Sundar> shaohe_feng: if you prefer, I can initiate an email thread for the good points that you brought up. Good? 03:59:39 <shaohe_feng> OK 03:59:48 <Sundar> Anything else, folks? 04:00:50 <shaohe_feng> do you have a look that the in_use is in the arguments of the update function? 04:01:00 <Sundar> Yes 04:01:14 <shaohe_feng> and the your DB really have the in_use field? 04:01:32 <shaohe_feng> directly use mysql command. 04:02:04 <Sundar> Oh yes. The ref.update has it, but it doesn;t get written to db. Use mysql cmd from Python code? 04:02:20 <shaohe_feng> no 04:02:22 <shaohe_feng> such as: 04:02:33 <shaohe_feng> mysql -uroot -ppass cyborg 04:02:58 <Sundar> Yes, update command works from CLI 04:03:25 <Sundar> We'll follow up on this too by email. 04:03:33 <shaohe_feng> desc haddler; 04:03:43 <shaohe_feng> OK. 04:03:52 <Sundar> Thanks, everybody. Happy coding and reviewing :). Have a good day. Bye. 04:03:58 <Sundar> #endmeeting