#openstack-cyborg log

03:05:12 <Sundar> #startmeeting openstack-cyborg
03:05:13 <openstack> Meeting started Thu Aug 29 03:05:12 2019 UTC and is due to finish in 60 minutes.  The chair is Sundar. Information about MeetBot at http://wiki.debian.org/MeetBot.
03:05:14 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
03:05:16 <openstack> The meeting name has been set to 'openstack_cyborg'
03:05:18 <Coco_gao_> #info Coco_gao_
03:05:20 <Sundar> Hi all
03:05:24 <shaohe_feng> morning Coco_gao_
03:05:33 <s_shogo> Hi all
03:05:35 <Sundar> #topic Attendance
03:05:35 <Coco_gao_> morning shaohe
03:05:40 <Sundar> #info SUndar
03:05:47 <s_shogo> #info s_shogo
03:05:51 <chenke> Hi~
03:05:58 <Sundar> Hi all
03:05:59 <Coco_gao_> Hi chenke
03:05:59 <chenke> #info chenke
03:06:01 <shaohe_feng> #info shaohe_feng
03:06:07 <Yumeng> #info Yumeng
03:06:08 <yikun> #info yikun
03:06:19 <Sundar> Agenda: https://wiki.openstack.org/wiki/Meetings/CyborgTeamMeeting#Agenda
03:06:19 <chenke> Hi Coco_gao_
03:07:53 <Sundar> Python 3: Since OpenStack Train release has some Python 3 goals, due by Milestone 3, and it seems that we are close to fixing Py3 issues for Cyborg,
03:08:17 <Sundar> I have requested s_shogo to make Python 3 tests as a voting job in Zuul.
03:08:27 <Sundar> Any objections or comments?
03:10:01 <Sundar> I'll take the silence as agreement. ;)  There were requests for fixing Python 3 in the cyborg client too. Luckily, it has taken only 1 patch so far, so we don't need to spend much time on it.
03:10:21 <chenke> +1
03:11:05 <s_shogo> I'll do the py3 work in cyborg client, too.
03:11:19 <chenke> good job
03:12:16 <chenke> I had modify the tox.ini default env support py36,py37
03:12:22 <Coco_gao_> thank you
03:12:26 <Sundar> s_shogo: The catch is, the current client is for v1 API code and not based on the openstacksdk method. Bringing it to v2 is more important, right?
03:12:27 <Coco_gao_> s_shaogo
03:12:35 <Coco_gao_> s_shogo
03:12:37 <wangzhh> Cool.
03:12:49 <shaohe_feng> https://review.opendev.org/#/c/673228/
03:13:06 <shaohe_feng> this is a python3 issue fix for client
03:13:25 <Sundar> But somebody else proposed a patch and it got merged.
03:13:50 <s_shogo> Sundar: I think so, My openstackSDK patch is made for the v2 Deployable API, now.
03:14:12 <s_shogo> And the P5-P9 patches doesn't include the migration code , "Deployable" API , from v1 to v2.
03:15:03 <Sundar> s_shogo: Great. Please add device profiles, as that is more importan IMHO. Operators need to create device profiles to use Cyborg, but doing that with curl is not easy
03:15:26 <Coco_gao_> agree, Sundar
03:15:49 <Sundar> As 2nd priority, I'd say devices -- that will give an inventory of accelerator devices in the cluster
03:16:57 <Sundar> IMHO, when devices are asked for, we can return the components like deployables and attributes, so the client gets a full picture
03:16:59 <shaohe_feng> yes,  client if more friendly than curl
03:17:25 <s_shogo> As related the client,  the deadline for openstackSDK's commit seems to be near, so would like to begin commit to that, prior to the merge of APIv2 patches.
03:17:34 <Sundar> Yes, makes sense
03:17:48 <Sundar> Thanks, s_shogo!
03:18:28 <Sundar> The main thing that is holding me back is that I am testing P5-P9 with the notification and Placement report patches. Plus, Nova code changes to create a merge conflict for me.
03:18:46 <Sundar> Once those are resolved, hope we can merge the P5-P9 patches
03:19:01 <Sundar> ANy other comments on the client, anybody?
03:19:13 <shaohe_feng> yes, async job depends on P5-P9
03:19:25 <s_shogo> In my assumption,python-cyborg client and openstacksdk could to be completed before the Train release,
03:19:38 <shaohe_feng> great
03:19:38 <s_shogo> but I'm anxious of sufficiency in my test codes, thus please review that in following patches, and help that if necessary.
03:19:53 <Sundar> shaohe_feng: Agreed. I'll expedite as much as I can.
03:20:06 <Sundar> s_shogo: Agreed, we'll help for sure
03:20:23 <s_shogo> Thanks , Sundar
03:20:29 <shaohe_feng> maybe the test codes can be add later.
03:20:35 <Sundar> wangzhh: Thanks for proposing the RBAC patch. I had some concerns/questions in the patch. Please take a look.
03:20:50 <shaohe_feng> firstly let the client can work.
03:21:11 <Coco_gao_> s_shogo, thank you . We will review the code.
03:21:28 <wangzhh> Yep. I have updated my code. May commit after meeting.
03:21:39 <Sundar> Thanks, wangzhh
03:22:03 <s_shogo> shaohe_feng : OK, I'll do that preferentially.
03:22:26 <Sundar> shaohe_feng: Part of the issue is that some Nova developers want to test Cyborg code with Nova code in theor env. Also, we need to show tempest working end-to-end.
03:23:41 <Sundar> Anybody else trying out the Placement report? With GPUs, AI chip, etc.?
03:23:42 <Coco_gao_> What's the remaining work for tempest?
03:23:46 <shaohe_feng> yes, tempest can eliminate their concerns
03:24:07 <Sundar> Coco_gao_: It is mostly to get the patches to work together, I think
03:24:34 <Sundar> Xinran's patches look good IMO. Trying to make sure they work with P5-P9
03:24:51 <Yumeng> I have tried the Placement report With GPUs
03:25:20 <Sundar> Yumeng: Good to know
03:25:42 <Sundar> #topic Nova functional tests
03:26:40 <Sundar> There was talk at the PTG that we should propose functional tests for Nova, which mock CYborg API in a test fixture, and use that to test Nova patches
03:27:08 <Sundar> They seem to cover a few more scenarios than unit tests and tempest
03:27:52 <Coco_gao_> mock cyborg API's return?
03:27:53 <chenke> I agree we need to import functional test for nova.
03:28:09 <Sundar> Coco_gao_: Yes
03:28:44 <Sundar> We have an entry in the Storyboard too. I have not any comments of late, but there is concern that it may come up at the last moment
03:29:11 <Sundar> Since there is lots of stuff in Nova runway, it can be tough to get a 2nd look if this issue comes up
03:29:38 <Sundar> DO we have any volunteers for writing Nova functional tests? I'll help as much as I can
03:31:35 <Sundar> Please think it over and LMK if you can.
03:32:33 <Sundar> shaohe_feng: Do you want to bring up the discussion about ARQ states and transitions, as followup? Or is it settled?
03:33:12 <shaohe_feng> yes
03:33:45 <shaohe_feng> one things is that, who delete the ARQ
03:34:09 <shaohe_feng> when delete API tag the state as delete_pending?
03:34:22 <Sundar> There is Nova code to delete the ARQ in some error cases and when VM is terminated
03:35:03 <shaohe_feng> maybe it is still in bind process
03:35:47 <shaohe_feng> the bind process to delete it when it find the state is delete_pending?
03:36:06 <Sundar> Yes. In that case, IMHO, it is best to let the bind complete and the traits get updated in Placement, and then unbind/delete the ARQ
03:36:26 <Sundar> If we try to interrupt FPGA progamming, bad things can happen
03:36:46 <shaohe_feng> we will not add any rollback this release for bind.  just go through the whole process even deleting.
03:36:54 <Sundar> Agreed
03:37:06 <shaohe_feng> OK.
03:37:09 <Coco_gao_> OK
03:37:48 <shaohe_feng> any state transform should be transaction.
03:38:39 <Sundar> Yes, db transaction
03:39:21 <shaohe_feng> seems there is a state machine in oslo lib
03:39:26 <Sundar> Any other issue, shaohe_feng?
03:39:43 <shaohe_feng> we will not introduce it release
03:40:12 <Sundar> Ok by me. What are the benefits of using that?
03:40:13 <shaohe_feng> for I need time to read up it.
03:40:31 <shaohe_feng> do not look into it at present.
03:40:35 <Sundar> ok
03:40:55 <shaohe_feng> maybe after the whole flow code are finished
03:41:07 <shaohe_feng> we can have a look for cons and pros
03:41:33 <Sundar> Sure. We'll trust your judgement on this :)
03:41:48 <shaohe_feng> another things, should the async job timeout?
03:42:01 <Sundar> On a different note, I am seeing this issue for allocating attach handles: https://opendev.org/openstack/cyborg/src/branch/master/cyborg/db/sqlalchemy/api.py#L269 The in_use field does not get written to db
03:42:37 <shaohe_feng> but there's still a problem.
03:42:40 <Sundar> The timeout should correspond to default Nova timeout
03:43:10 <shaohe_feng> maybe it is in programming or other critical job
03:43:34 <Sundar> The programming typically takes a few seconds, so default of 300 seconds (I think) is good enough
03:43:40 <shaohe_feng> timeout can be disaster
03:44:21 <shaohe_feng> another things
03:44:43 <shaohe_feng> currently the bind process is specify for FPGA
03:45:40 <Sundar> Umm, bind if for all accelerators. Only programming is for FPGA. the bind means the ARQ is associated with a host and deployable in Cyborg's db, and the device is ready to use
03:45:44 <Sundar> *is for
03:45:55 <shaohe_feng> there should be good extension for other kinds
03:46:05 <shaohe_feng> I means:
03:46:22 <shaohe_feng> 1.  get the resource type.
03:46:50 <shaohe_feng> every resource type should has its own extend bind action
03:46:56 <shaohe_feng> for FPGA it is program.
03:47:08 <shaohe_feng> other's maybe evn setup, not sure.
03:47:43 <shaohe_feng> 2. every resource should be has its own placement report.
03:48:20 <shaohe_feng> the report info maybe  different
03:48:35 <shaohe_feng> so the code should be:
03:49:09 <shaohe_feng> type, num = arq.group_get_resource()
03:49:17 <shaohe_feng> for n in num:
03:50:05 <shaohe_feng> action = get_accelerator_action(type) # fpga is program
03:50:08 <shaohe_feng> action()
03:50:15 <shaohe_feng> somethings like this
03:50:38 <shaohe_feng> and these code should be split from the arq object file
03:51:36 <Sundar> In general, the process should be generic for all accelerators. The current code looks at the device profile request group to see if it has function_id or bitstream_id entries, which are specific to FPGA, to decide if programming is needed
03:52:32 <shaohe_feng> we maybe add other spec in
03:52:36 <Sundar> AFAIK, for non-FPGA devices in this release, there is nothing required to prepare the device, right?
03:53:10 <shaohe_feng> devices profile for different acclerations
03:53:14 <shaohe_feng> such as HDDL
03:53:19 <shaohe_feng> we can add
03:54:18 <shaohe_feng> "accel:affinity": true
03:54:44 <Sundar> Ok
03:54:46 <shaohe_feng> which means we need  4 accelerator in one card
03:54:49 <Sundar> We had an idea of a generic prepare_device API in the driver, which gets a dictionary as a parameter, where the dictionary values depend on the device type.
03:55:26 <shaohe_feng> yes, different devices maybe take different action during bind.
03:56:14 <Sundar> Quick process check: Since we have only few minutes left, should we continue this via email, copying all of us and openstack-ML? What do you all think?
03:56:34 <shaohe_feng> also another things, where we init the threadpoolexcutor?
03:56:43 <shaohe_feng> int the arq object file?
03:56:48 <shaohe_feng> seems not good.
03:57:11 <shaohe_feng> OK.
03:57:52 <Sundar> All, please look at this issue for allocating attach handles: https://opendev.org/openstack/cyborg/src/branch/master/cyborg/db/sqlalchemy/api.py#L269 The in_use field does not get written to db
03:58:34 <Sundar> All, we are seeing good review activity of late. Thank you all, and please keep it up. We are literally 2 weeks from the milestone. :)
03:58:41 <Sundar> #topic AoB
03:59:20 <Sundar> shaohe_feng: if you prefer, I can initiate an email thread for the good points that you brought up. Good?
03:59:39 <shaohe_feng> OK
03:59:48 <Sundar> Anything else, folks?
04:00:50 <shaohe_feng> do you have a look that the in_use is in the arguments of the update function?
04:01:00 <Sundar> Yes
04:01:14 <shaohe_feng> and the your DB really have the in_use field?
04:01:32 <shaohe_feng> directly use mysql command.
04:02:04 <Sundar> Oh yes. The ref.update has it, but it doesn;t get written to db. Use mysql cmd from Python code?
04:02:20 <shaohe_feng> no
04:02:22 <shaohe_feng> such as:
04:02:33 <shaohe_feng> mysql  -uroot -ppass cyborg
04:02:58 <Sundar> Yes, update command works from CLI
04:03:25 <Sundar> We'll follow up on this too by email.
04:03:33 <shaohe_feng> desc haddler;
04:03:43 <shaohe_feng> OK.
04:03:52 <Sundar> Thanks, everybody. Happy coding and reviewing :). Have a good day. Bye.
04:03:58 <Sundar> #endmeeting