03:02:43 <Sundar> #startmeeting openstack-cyborg 03:02:44 <openstack> Meeting started Thu Sep 19 03:02:43 2019 UTC and is due to finish in 60 minutes. The chair is Sundar. Information about MeetBot at http://wiki.debian.org/MeetBot. 03:02:45 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 03:02:47 <openstack> The meeting name has been set to 'openstack_cyborg' 03:02:59 <Sundar> #topic Who's here 03:03:02 <Sundar> o/ 03:03:08 <chenke> o/ 03:03:20 <Yumeng> #info Yumeng 03:03:26 <s_shogo> #info s_shogo 03:03:32 <wangzhh> #info wangzhh 03:03:40 <changzhi> #info changzhi 03:03:49 <chenke> #info chenke 03:04:00 <Sundar> Hi chenke, Yumeng, s_shogo, wangzhh. Welcome changzhi 03:04:03 <shaohe_feng> #info shaohe_feng 03:04:10 <Sundar> Hi shaohe 03:04:17 <Sundar> #topic Status 03:04:41 <Sundar> First, thank you all for an active Train cycle. We have hit feature feeze a week ago 03:05:00 <Sundar> SO also did other projects. 03:05:41 <Sundar> The good news: Cyborg side of the Nova integration is pretty much done. We just need to clean up the way we invoke other services 03:06:07 <chenke> Great 03:06:24 <wangzhh> Cool 03:06:33 <Sundar> Not so good news: Our Nova patches did not enough reviews from Nova developers, and so did not make the cut. 03:07:24 <Sundar> Part of the problem is that, Cyborg patches were open for a long time, so Nova developers did not see it as ready, though we could put up a VM with Cyborg + Nova patches 03:08:16 <Sundar> Also, there was a longstanding request to show tempest CI working. That completed exactly in the milestone week. That was too late to get sustained reviews. 03:08:23 <shaohe_feng> We know intigration is a big effort 03:08:45 <shaohe_feng> Sundar: you d a lot of effort. Thanks 03:09:42 <chenke> It is understandable the patch in nova be merged slowly. 03:09:51 <Sundar> NP, thanks Shaohe. I am optimistic about U because I think we are close. and I have re-proposed the Nova spec. This time, tempest and most things are merged. Things that attratc croos-project attention, like tempest, privsep, sdk_adapter stuff, etc. are all done or making good progress 03:10:36 <Sundar> Hope to get the Nova patches in the runway very early in the cycle. The more we wait, the more things get bogged down among the tons of other reviews. 03:11:28 <Sundar> That said, we have a few more things to wrap up in Train :) 03:12:12 <Sundar> First, remove the hardcoding of 'dvstack-admin'. Thanks, chenker and all for addressing that :) 03:12:47 <Sundar> Second, v1 API is deprecated but still supported in Train. But it is not working because we removed all v1 from devstack. I should re-enable it, I think 03:13:28 <xinranwang> #info xinranwang 03:13:31 <Sundar> SHaohe's async bind, privsep, rbac are important 03:13:36 <xinranwang> Hi all 03:14:05 <Sundar> I think all the pep8/flake fixes from chenker/zhurong are looking good and will probably merge this week 03:14:33 <Sundar> Can you all think of anything else? 03:14:44 <Yumeng> Sundar: and please don't forget update device_profile db by conductor:https://review.opendev.org/#/c/679406/ 03:14:48 <Yumeng> just updated 03:15:06 <shaohe_feng> Sundar remain some slot for me to introduce the async jobs, so other's can easily to review it. 03:15:17 <shaohe_feng> Thanks 03:15:25 <Yumeng> and this gpu fix :https://review.opendev.org/#/c/675059/ I tested in my devstack env, it works 03:15:57 <Sundar> Ah yes, that too, Yumeng :) There are quite a few patches up there, including https://review.opendev.org/680953. 03:16:26 <Sundar> Sure, let's knock off as much as we can. Was just listing the ones critical to complete in Train 03:16:35 <openstackgerrit> Merged openstack/cyborg master: P5: Fix pep8 error in cyborg/accelerator https://review.opendev.org/679175 03:16:54 <Sundar> shaohe_feng: Sure 03:17:15 <Sundar> Folks, anything else before we dive into Shaohe's async bind? 03:18:01 <s_shogo> I'm starting test&validation task, with real machine , begin with common functions, independet from specific accelerators. 03:18:11 <s_shogo> If extracted some bugs or erros, report that or post patches till the Train release. 03:19:36 <Sundar> Sure, s_shogo. I think the client effort can be aimed early in U release, since the Train release milestone for clients is past 03:19:41 <Sundar> I have some questions on RBAC: https://review.opendev.org/#/c/678177/ . In https://review.opendev.org/#/c/678177/3/cyborg/common/policy.py@83, should it be allow rule? ANybody can create an ARQ and thereby bind that ARQ, and so program an FPGA? 03:21:11 <s_shogo> Sundar: OK, I'll do the client&sdk task continuously, to the U release. 03:22:26 <Sundar> wangzhh: What do you think? 03:22:33 <xinranwang> should we complete v2 API in T? 03:23:09 <wangzhh> Sundar, it should be allowed and recheck it in the method if it is a program action or not. 03:23:29 <Sundar> wangzhh: ok 03:23:51 <Sundar> xinranwang: Only devices API remains. We are supposed to merge only bug fixes, I think. So, it will probably go to U. Is anything else remaining? 03:25:30 <Sundar> OK, 35 min remaining. Let's move to async bind. 03:25:41 <Sundar> #topic Async bind 03:25:54 <Sundar> Shaohe, take it away! 03:26:05 <shaohe_feng> Now let's we start to introduce async bind. Any questions can fafter the introduction. 03:26:12 <shaohe_feng> Briefly put, bind is to find a suitable device(maybe PCI, or MDEV) on the right host for a server instance to use. 03:26:18 <shaohe_feng> So what's the suitable device, we need a spec to describe it. 03:26:25 <shaohe_feng> On v1 we discribe the device directly on nova flavor extra spec, and cyborg parser the spec, Xinran implement this work. 03:26:32 <shaohe_feng> On v2, after the PTG discussion, we define it in cyborgs owen Device Pofile. And Sundar implement it. 03:26:43 <shaohe_feng> I have no chance to attend PTG for discussion, More details please talk with Sundar. 03:26:50 <shaohe_feng> Thans Xinran and Sundar's effor. 03:26:59 <shaohe_feng> Before we introduce async bind, let's know some implement(rules) in the current code firstly. 03:27:08 <shaohe_feng> 1. The AtachHandler in ExtARQ is not a list, so only one AtachHandler(one devcie for ARQ) 03:27:08 <shaohe_feng> profile group in order to get the expected devices. 03:27:19 <shaohe_feng> Now Our cyborg ARQ API bind API is sync, be we define it as async, so need to improve. 03:27:28 <shaohe_feng> So what we changed: 03:27:43 <shaohe_feng> 1. Use a thread pool to start the async job. 03:27:50 <shaohe_feng> In cyborg spec, sundar suggests use concurrent, yes it is a python stand lib. See python office link: 03:27:57 <shaohe_feng> https://docs.python.org/3/library/concurrent.futures.html 03:28:05 <shaohe_feng> Also we can greening it by greenlet. patched it by eventlet. 03:28:11 <shaohe_feng> utures = eventlet.import_patched('concurrent.futures') # 'greening' futures, 03:28:13 <openstackgerrit> Merged openstack/cyborg master: P6: Fix pep8 error in cyborg/agent and cyborg/db https://review.opendev.org/679193 03:28:27 <shaohe_feng> easily to greening 03:28:37 <shaohe_feng> See python mail list discussion. 03:28:52 <shaohe_feng> I have simply test it, it can work, but I did not test it performance, do not enable greening in the patch. 03:29:00 <shaohe_feng> 2. I move out the bind logical from ExtARQ object. 03:29:13 <shaohe_feng> Let the ExtARQ maintain's its base function, such as its attribution's CRUD. 03:29:20 <shaohe_feng> Move it to cyborg/accelerator/common/handler.py (not sure this is a good place, this is a OPEN) 03:29:30 <shaohe_feng> Add a basic and general bind handle class named Accelerators. (not sure this is a good name, this is a OPEN) 03:29:37 <shaohe_feng> It support the base _bind 03:29:44 <shaohe_feng> https://review.opendev.org/#/c/681005/16/cyborg/accelerator/common/handler.py 03:29:54 <shaohe_feng> If a new acclerators need extra opeation, can derived it and extend it if needed, such as FPGA 03:30:02 <shaohe_feng> line 386 at 03:30:47 <shaohe_feng> For FPGA it need to get image metadata, download image, program image and update the placement. 03:31:18 <shaohe_feng> If _bind is time consume, use "wrap_job_tb" to wraper it. 03:31:30 <shaohe_feng> In this wraper I add it with "is_job" and can catch every Exception/traceback during bind process, then log it. 03:31:31 <openstackgerrit> Merged openstack/cyborg master: P7: Fix pep8 error in cyborg/objects and cyborg/image https://review.opendev.org/679526 03:31:32 <openstackgerrit> Merged openstack/cyborg master: P8: Fix pep8 error in cyborg/tests and add post_mortem_debug.py https://review.opendev.org/679538 03:31:38 <shaohe_feng> I also add a bind in the general class to start the jobs tagged with "is_job". 03:31:46 <shaohe_feng> I also add a master to monitor the jobs(as sundar suggestted) 03:31:52 <shaohe_feng> https://review.opendev.org/#/c/681005/16/cyborg/accelerator/common/handler.py 03:32:00 <shaohe_feng> It checks the jobs status and also will get the job Exception/traceback. 03:32:28 <shaohe_feng> please add a SUPPORT_RESOURCES in 03:32:41 <shaohe_feng> 4. I add ARQ_STATES_TRANSFORM_MATRIX to sync the status. 03:32:49 <shaohe_feng> Talked with sundar and xinran, we add extra status: ARQ_DELETING and ARQ_BIND_STARTED 03:32:57 <shaohe_feng> line at 29 03:33:11 <shaohe_feng> I just refacor Sundar's effort. Do not change his logical, at present. So did not change any API define exposed to user. Thanks for Sundar's effort. 03:33:17 <shaohe_feng> I did not test multi/batch AQRs, for example, a request for 2 FPGAs, or 1 GPU and 1 FPGA. 03:33:21 <shaohe_feng> Have no really env. 03:33:50 <shaohe_feng> So I think we need to merge the patch, and let more developers test it. 03:34:00 <shaohe_feng> That's the different with VM management. Ironic or Cyborg sometimes need hardware, so it is difficult to manage. 03:34:39 <shaohe_feng> the commit message show you how to test this patch and 03:35:17 <shaohe_feng> analyze the process by log: https://review.opendev.org/#/c/681005/16//COMMIT_MSG 03:35:39 <shaohe_feng> Also there's still lot of works on it. Need to improve it continuously. Let it works firstly, then improvement. 03:36:31 <shaohe_feng> sorry 03:37:01 <shaohe_feng> any questions? 03:37:30 <Sundar> shaohe_feng: Thanks for all the time and hard work 03:38:19 <Sundar> For testing, hope people can use the fake driver. It supports FPGA resource class. Can we get it to take the programming patch but treat it as a no-op? 03:38:36 <Sundar> *programming code path 03:39:19 <shaohe_feng> Do you means make some mock do not really programming? 03:39:23 <Sundar> Yes 03:39:36 <shaohe_feng> Hardware support is really than VM 03:39:51 <Yumeng> shaohe_feng: that's really a comprehensive and deep research and very helpful introduction. 03:40:40 <shaohe_feng> Yumeng thanks. hopeful it is useful. 03:40:50 <s_shogo> Thanks, shaohe_feng : 03:41:12 <xinranwang> shaohe_feng: thanks Shaohe for your efforts 03:41:13 <shaohe_feng> Sundar let me give a method to mock it later. 03:41:18 <Sundar> Not everybody has hardware, as you said. But concurrent execution is not easy to test throughly. It may work in my env but fail in somebody else's. We can hopefully get more people to check it out using fake driver 03:41:30 <Sundar> Great, thanks 03:41:54 <shaohe_feng> Yes, will give a guide for how to mock it. 03:42:26 <chenke> Great jobs thanks ShaoHe. 03:42:40 <Sundar> Also: "Move it to cyborg/accelerator/common/handler.py". Bind is really an operation on an ExtARQ. It logically belongs with objects/ext_arq.py. If you want to split that into separate source file, that is OK. But it can be a mix-in rather than a separate object/class, IMHO 03:43:37 <Yumeng> shaohe_feng: great! looking froward to the mock guide 03:44:12 <shaohe_feng> I have check nova's object code, Then I make this change. 03:44:28 <wangzhh> shaohe_feng, Thx for your effort. 03:44:38 <shaohe_feng> Sundar any details for how to split it? 03:45:23 <Sundar> shaohe_feng: I found this blog useful: http://www.qtrac.eu/pyclassmulti.html 03:46:00 <Sundar> It considers many ways to split a Python class into different source files, and finally recommends mix-ins 03:48:43 <shaohe_feng> glance it. seem it is a big change. 03:49:52 <Sundar> Hmmm... only the last part is the mix-in. That could be a small change. You can move your chosen methods into a separate file, put it in a mix-in, and inherit that mix-in into the ExtARQ object class 03:50:11 <Sundar> I can help as much as I can. 03:51:09 <shaohe_feng> good, then I can write a mock evn guide for test. 03:51:35 <Sundar> In that article, the last section "The Definitive Version?" alone is about mix-ins 03:51:40 <Sundar> OK, great 03:52:34 <Sundar> Anything else, Shaohe? 03:52:48 <shaohe_feng> no, that's all for me. 03:53:02 <Sundar> Thanks very much, once again. 03:53:06 <Sundar> #topic AoB 03:53:09 <shaohe_feng> let move the patch on 03:53:20 <Sundar> Python IPv6 jobs: https://review.opendev.org/#/c/682517/ Please review 03:53:52 <Sundar> Many patches hit merge conflict after recent merges 03:54:04 <shaohe_feng> it does not matter. 03:54:25 <shaohe_feng> we just improve our git skill 03:54:49 <shaohe_feng> other active project 03:55:18 <Sundar> We need one more review for https://review.opendev.org/#/c/680953/ from outside Intel. 03:55:28 <shaohe_feng> conflict is very common 03:55:51 <Sundar> Sure 03:56:03 <Sundar> Train schedule: https://releases.openstack.org/train/schedule.html RC1 candidate is next week! 03:56:18 <Sundar> Hope to get the critical patches in by that time. 03:56:49 <Sundar> After that, even bug fixes are not assured 03:57:48 <Sundar> BTW, Cyborg will get packaged as a RPM as part of OpenStack release: https://opendev.org/openstack/rpm-packaging/src/branch/master/openstack/cyborg 03:58:19 <Sundar> Anything else, guys? 03:58:28 <shaohe_feng> no 03:58:32 <chenke> no 03:58:52 <Sundar> Have a good day! Bye 03:58:56 <Sundar> #endmeeting