15:00:56 <zhipeng_> #startmeeting openstack-cyborg 15:00:56 <openstack> Meeting started Wed Jun 7 15:00:56 2017 UTC and is due to finish in 60 minutes. The chair is zhipeng_. Information about MeetBot at http://wiki.debian.org/MeetBot. 15:00:57 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 15:01:00 <openstack> The meeting name has been set to 'openstack_cyborg' 15:01:05 <zhipeng_> hahaha 15:01:10 <zhipeng_> let's hope so 15:01:18 <zhipeng_> okey so quick update from my side 15:01:22 <zhipeng_> on the api/db patch 15:01:40 <zhipeng_> #topic BP discussion 15:01:43 <zhipeng_> #link https://review.openstack.org/#/c/445814/ 15:02:07 <zhipeng_> so ChrisD reviewed with the comments that there is an ongoing discussion on the traits 15:02:16 <zhipeng_> we might consider to align our design to it 15:02:43 <zhipeng_> originally, the placement resource provider was meant for just compute node 15:02:53 <jkilpatr> I was looking over that, care to summarize? 15:03:16 <zhipeng_> sure I'm putting my thoughts together now 15:03:27 <zhipeng_> so now the placement team see the pitfall for that 15:03:48 <zhipeng_> since for example for shared storage (external arrays I would suppose) 15:04:03 <zhipeng_> if you only count the storage side of things on the compute node 15:04:22 <zhipeng_> your resource provider will never correctly reflect the required traits 15:04:43 <jkilpatr> so this is an issue with accelerators that may be shared between many computes? 15:04:48 <zhipeng_> the resouce provider should reflect the shared storage arrays, rather than only local discks 15:05:02 <zhipeng_> no, I think this is an issue for accelerators as whole 15:05:14 <jkilpatr> how so? 15:05:23 <zhipeng_> since if the resource provider only identify with compute node 15:05:56 <zhipeng_> we could wind up with the same problem as we have now, since accelerator characters are bundled with the compute charaters 15:06:13 <zhipeng_> well we could have our own resource class for sure, but that does not solve the problem 15:06:36 <zhipeng_> nova scheduler asks the placement api to provide all the necessary resources 15:07:00 <zhipeng_> and for Cyborg, one of the important goals is that accelerators being treated as the first class citezen 15:07:51 <zhipeng_> meaning that we should have indiidual resource providers for accelerators 15:08:20 <zhipeng_> from the email link Chris provided, there is an etherpad documenting the "Plan B" 15:09:02 <jkilpatr> ok so the issue is that if we have a 'gpu' resource provider it's dependent on computes in a way that resource providers aren't supposed to be. 15:09:02 <zhipeng_> which I liked very much, is working on to extend the current nested resource provider definition, to a more relaxed, multiple resource providers one 15:09:11 <zhipeng_> yes exactly 15:09:42 <zhipeng_> the scheduling decision would still largely depends on the regular compute features, since we are just part of the traits 15:09:59 <crushil> interesting 15:10:12 <zhipeng_> so back to the "Plan B", the current nested resource provider model is designed primarily for stuff like NUMA nodes 15:10:22 <zhipeng_> where you got this parent-child relationship 15:10:28 <crushil> So, how does that change our implementation? 15:10:42 <zhipeng_> the Plan B extneds the scope to be more general, meaning for Cyborg use cases 15:11:05 <zhipeng_> we could have multiple resource provider for each and every accelerators 15:11:14 <zhipeng_> (if they deemed important for the workload) 15:11:23 <zhipeng_> crushil the change is that 15:11:46 <zhipeng_> our DB design has to align with the proposed nested resource provider/trait design 15:12:02 <zhipeng_> at least DB schemas 15:12:18 <zhipeng_> so that when cyborg agent populate our inventory to the placement api 15:12:25 <zhipeng_> it could understand it correctly 15:13:59 <crushil> Ok, what about the other specs? 15:14:13 <zhipeng_> not concerned that much :) 15:14:18 <crushil> gotcha 15:14:58 <zhipeng_> So I'm thinking we might need two DB schemas 15:15:15 <zhipeng_> the current one in the spec patch, could be used for the discovery phase 15:15:37 <zhipeng_> that is when user start the cyborg service and then agent/driver do the discovery/pre-config 15:15:53 <zhipeng_> collect what we have, on the host 15:16:16 <zhipeng_> the second set of schema needs to be aligned with nested resource provider 15:16:33 <zhipeng_> to interact with placement api and eventually nova-scheduler 15:16:53 <zhipeng_> for the VM to select the correct accelerator resource 15:17:19 <jkilpatr> so we need to maintain two parallel db's for each purpose or do you mean we want to change the format in a future release? 15:18:20 <zhipeng_> what I'm thinking is that we don't have exhaustive knowledge on the hardware now 15:18:58 <zhipeng_> therefore we keep a seperate DB schema, the host side one should be more extendable or more abstract 15:19:17 <zhipeng_> But on another thought 15:19:23 <zhipeng_> it might be just too complex ..... 15:19:26 <zhipeng_> what do you guys think 15:19:49 <jkilpatr> I think we should try and keep one db as much as possible, I don't want to try and maintain parallel sets of data 15:19:58 <zhipeng_> that makes sense 15:20:24 <crushil> I agree, having multiple DBs is just clunky 15:21:04 <zhipeng_> in that case we will just use the resource provider schema,I will follow up with Chris to see which one I should use 15:21:11 <zhipeng_> the current one or the proposed one 15:22:19 <jkilpatr> sounds good. 15:22:56 <jkilpatr> Anything else on that subject? 15:22:59 <zhipeng_> nope 15:23:13 <zhipeng_> anything else from you guys on the open spec ? 15:23:33 <crushil> nope 15:23:51 <zhipeng_> great 15:23:58 <zhipeng_> #topic initial code development 15:24:05 <zhipeng_> so, any roadblocks 15:24:39 <jkilpatr> been trying to understand oslo rpc and message passing and start structuring the conductor/agent 15:24:52 <zhipeng_> sounds like a great start :) 15:25:02 <crushil> I have created stubs and I will push them up by the end of the week 15:25:11 <zhipeng_> great ! 15:25:18 <jkilpatr> crushil, sounds good. 15:26:03 <zhipeng_> let's do small pieces like Justin suggested 15:26:09 <crushil> I will fill them out rebased on top of the API and agent patches 15:26:10 <jkilpatr> so a lot of what we will be doing involves rpc between different components, so people with integrating parts need to talk to each other about interfaces 15:26:19 <jkilpatr> I don't think we should be too worried about a stable internal interface 15:26:29 <zhipeng_> yes I agree 15:27:04 <zhipeng_> oslo.messaging could provide everything we need 15:27:46 <jkilpatr> well sometimes we need rpc for example the driver should be called by the agent over rpc I'm thinking (we could invoke directly but I'm not sure if I want to do that) 15:29:38 <zhipeng_> i think it should be done over rpc 15:29:57 <zhipeng_> unless, we gave driver restful apis ? 15:30:31 <jkilpatr> I don't think that's the right application here. Our internal code needs to be more tightly integrated than restfulness allows. 15:30:45 <zhipeng_> yep 15:31:07 <zhipeng_> so rpc should be fine here 15:31:27 <zhipeng_> i think at the moment, it is agent talking to the generic driver 15:31:50 <zhipeng_> later on, we should design something like the neutron ml2 driver interface 15:32:29 <zhipeng_> that every driver, vendor or not, implements the interface which rpc calls will go through 15:32:34 <zhipeng_> in a rather standard way 15:33:29 <rushil> Ok. So, are we going to follow the neutron model vs the nova/cinder model? 15:33:47 <zhipeng_> i think more like the neutron moddel 15:33:55 <zhipeng_> for out-of-tree drivers 15:34:03 <rushil> But isn't that too complicated 15:34:09 <zhipeng_> cinder and nova are mostly in-tree maintained drivers 15:34:20 <zhipeng_> it won't be too complicated for us i think 15:34:40 <zhipeng_> neutron is complicated because they have to define the type drivers and mechanism drivers 15:34:42 <rushil> Well, cinder has out of tree drivers based on whether you have CI or not 15:35:04 <zhipeng_> I think in-tree drivers also requires the CI 15:35:14 <zhipeng_> otherwise the cinder team removes your driver 15:35:38 <rushil> No, they just make it unsupported i.e. move it out of tree 15:35:54 <zhipeng_> for us, as long as it is PCIe communicated devices, the driver interface won't be too complicated 15:36:12 <zhipeng_> but if we need to support extra protocols, that is where things will get wild 15:36:24 <zhipeng_> rushil ah okey 15:36:25 <rushil> Ok. I just want to make sure we don't make things more complicated than they should be 15:36:37 <zhipeng_> yes that is always our goal 15:36:38 <jkilpatr> I can agree on a standard rpc interface but that's less complicated than I think you are making it out to be. 15:36:49 <zhipeng_> we even wanted to skip the conductor :P 15:37:14 <jkilpatr> and I nearly got away with it too! 15:37:21 <zhipeng_> jkilpatr haha 15:37:41 <rushil> Lol 15:39:26 <zhipeng_> rushil the cyborg ml2 driver would be modeled from your generic driver implementation :P 15:40:37 <rushil> I wouldn't call it ml2 driver though 15:40:52 <zhipeng_> of course we will have another name for it 15:41:14 <zhipeng_> aluminum drivers :P 15:41:24 <zhipeng_> for cyborg robots 15:41:45 <rushil> Hehe 15:42:36 <jkilpatr> Anyways I'll try have a stub up this week (conductor) and then agent next week. 15:42:46 <jkilpatr> depends on how other tasks go for me. 15:43:44 <rushil> jkilpatr: Cool 15:43:50 <zhipeng_> sounds great, i got another colleague working on cyborg this week, so api code will be developed in parallel 15:44:09 <rushil> Awesome 15:44:10 <zhipeng_> hopefully when we settled the spec, the initial code will come out 15:44:21 <zhipeng_> and we could iterate over 15:44:48 <zhipeng_> #topic AoB 15:44:52 <zhipeng_> any other topics 15:44:59 <rushil> Btw our group at Lenovo sent out initial emails to vendors to get their drivers aligned with cyborg 15:45:20 <zhipeng_> wow 15:45:24 <zhipeng_> that is awesome 15:45:45 <rushil> I'll keep you guys posted on that 15:45:51 <zhipeng_> could you disclose the vendor names for now ? 15:45:57 <zhipeng_> or should we wait until later 15:46:14 <rushil> The usual suspects 15:46:35 <zhipeng_> e.g ? 15:46:47 <rushil> Nvidia, AMD 15:47:00 <rushil> And smaller ones like Micron 15:47:29 <zhipeng_> cool ! 15:47:30 <rushil> I'll let y'all know when they are committed to contributing code 15:47:40 <zhipeng_> great :) 15:50:41 <zhipeng_> okey if there are no other topics, we go to the usual long slumber ~~ 15:50:56 <zhipeng_> will try to remember to close the meeting an hour later 15:51:05 <crushil> Cool, thanks zhipeng_ 17:00:56 <zhipeng_> #endmeeting