#openstack-cyborg log

15:00:56 <zhipeng_> #startmeeting openstack-cyborg
15:00:56 <openstack> Meeting started Wed Jun  7 15:00:56 2017 UTC and is due to finish in 60 minutes.  The chair is zhipeng_. Information about MeetBot at http://wiki.debian.org/MeetBot.
15:00:57 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
15:01:00 <openstack> The meeting name has been set to 'openstack_cyborg'
15:01:05 <zhipeng_> hahaha
15:01:10 <zhipeng_> let's hope so
15:01:18 <zhipeng_> okey so quick update from my side
15:01:22 <zhipeng_> on the api/db patch
15:01:40 <zhipeng_> #topic BP discussion
15:01:43 <zhipeng_> #link https://review.openstack.org/#/c/445814/
15:02:07 <zhipeng_> so ChrisD reviewed with the comments that there is an ongoing discussion on the traits
15:02:16 <zhipeng_> we might consider to align our design to it
15:02:43 <zhipeng_> originally, the placement resource provider was meant for just compute node
15:02:53 <jkilpatr> I was looking over that, care to summarize?
15:03:16 <zhipeng_> sure I'm putting my thoughts together now
15:03:27 <zhipeng_> so now the placement team see the pitfall for that
15:03:48 <zhipeng_> since for example for shared storage (external arrays I would suppose)
15:04:03 <zhipeng_> if you only count the storage side of things on the compute node
15:04:22 <zhipeng_> your resource provider will never correctly reflect the required traits
15:04:43 <jkilpatr> so this is an issue with accelerators that may be shared between many computes?
15:04:48 <zhipeng_> the resouce provider should reflect the shared storage arrays, rather than only local discks
15:05:02 <zhipeng_> no, I think this is an issue for accelerators as whole
15:05:14 <jkilpatr> how so?
15:05:23 <zhipeng_> since if the resource provider only identify with compute node
15:05:56 <zhipeng_> we could wind up with the same problem as we have now, since accelerator characters are bundled with the compute charaters
15:06:13 <zhipeng_> well we could have our own resource class for sure, but that does not solve the problem
15:06:36 <zhipeng_> nova scheduler asks the placement api to provide all the necessary resources
15:07:00 <zhipeng_> and for Cyborg, one of the important goals is that accelerators being treated as the first class citezen
15:07:51 <zhipeng_> meaning that we should have indiidual resource providers for accelerators
15:08:20 <zhipeng_> from the email link Chris provided, there is an etherpad documenting the "Plan B"
15:09:02 <jkilpatr> ok so the issue is that if we have a 'gpu' resource provider it's dependent on computes in a way that resource providers aren't supposed to be.
15:09:02 <zhipeng_> which I liked very much, is working on to extend the current nested resource provider definition, to a more relaxed, multiple resource providers one
15:09:11 <zhipeng_> yes exactly
15:09:42 <zhipeng_> the scheduling decision would still largely depends on the regular compute features, since we are just part of the traits
15:09:59 <crushil> interesting
15:10:12 <zhipeng_> so back to the "Plan B", the current nested resource provider model is designed primarily for stuff like NUMA nodes
15:10:22 <zhipeng_> where you got this parent-child relationship
15:10:28 <crushil> So, how does that change our implementation?
15:10:42 <zhipeng_> the Plan B extneds the scope to be more general, meaning for Cyborg use cases
15:11:05 <zhipeng_> we could have multiple resource provider for each and every accelerators
15:11:14 <zhipeng_> (if they deemed important for the workload)
15:11:23 <zhipeng_> crushil the change is that
15:11:46 <zhipeng_> our DB design has to align with the proposed nested resource provider/trait design
15:12:02 <zhipeng_> at least DB schemas
15:12:18 <zhipeng_> so that when cyborg agent populate our inventory to the placement api
15:12:25 <zhipeng_> it could understand it correctly
15:13:59 <crushil> Ok, what about the other specs?
15:14:13 <zhipeng_> not concerned that much :)
15:14:18 <crushil> gotcha
15:14:58 <zhipeng_> So I'm thinking we might need two DB schemas
15:15:15 <zhipeng_> the current one in the spec patch, could be used for the discovery phase
15:15:37 <zhipeng_> that is when user start the cyborg service and then agent/driver do the discovery/pre-config
15:15:53 <zhipeng_> collect what we have, on the host
15:16:16 <zhipeng_> the second set of schema needs to be aligned with nested resource provider
15:16:33 <zhipeng_> to interact with placement api and eventually nova-scheduler
15:16:53 <zhipeng_> for the VM to select the correct accelerator resource
15:17:19 <jkilpatr> so we need to maintain two parallel db's for each purpose or do you mean we want to change the format in a future release?
15:18:20 <zhipeng_> what I'm thinking is that we don't have exhaustive knowledge on the hardware now
15:18:58 <zhipeng_> therefore we keep a seperate DB schema, the host side one should be more extendable or more abstract
15:19:17 <zhipeng_> But on another thought
15:19:23 <zhipeng_> it might be just too complex .....
15:19:26 <zhipeng_> what do you guys think
15:19:49 <jkilpatr> I think we should try and keep one db as much as possible, I don't want to try and maintain parallel sets of data
15:19:58 <zhipeng_> that makes sense
15:20:24 <crushil> I agree, having multiple DBs is just clunky
15:21:04 <zhipeng_> in that case we will just use the resource provider schema,I will follow up with Chris to see which one I should use
15:21:11 <zhipeng_> the current one or the proposed one
15:22:19 <jkilpatr> sounds good.
15:22:56 <jkilpatr> Anything else on that subject?
15:22:59 <zhipeng_> nope
15:23:13 <zhipeng_> anything else from you guys on the open spec ?
15:23:33 <crushil> nope
15:23:51 <zhipeng_> great
15:23:58 <zhipeng_> #topic initial code development
15:24:05 <zhipeng_> so, any roadblocks
15:24:39 <jkilpatr> been trying to understand oslo rpc and message passing and start structuring the conductor/agent
15:24:52 <zhipeng_> sounds like a great start :)
15:25:02 <crushil> I have created stubs and I will push them up by the end of the week
15:25:11 <zhipeng_> great !
15:25:18 <jkilpatr> crushil, sounds good.
15:26:03 <zhipeng_> let's do small pieces like Justin suggested
15:26:09 <crushil> I will fill them out rebased on top of the API and agent patches
15:26:10 <jkilpatr> so a lot of what we will be doing involves rpc between different components, so people with integrating parts need to talk to each other about interfaces
15:26:19 <jkilpatr> I don't think we should be too worried about a stable internal interface
15:26:29 <zhipeng_> yes I agree
15:27:04 <zhipeng_> oslo.messaging could provide everything we need
15:27:46 <jkilpatr> well sometimes we need rpc for example the driver should be called by the agent over rpc I'm thinking (we could invoke directly but I'm not sure if I want to do that)
15:29:38 <zhipeng_> i think it should be done over rpc
15:29:57 <zhipeng_> unless, we gave driver restful apis ?
15:30:31 <jkilpatr> I don't think that's the right application here. Our internal code needs to be more tightly integrated than restfulness allows.
15:30:45 <zhipeng_> yep
15:31:07 <zhipeng_> so rpc should be fine here
15:31:27 <zhipeng_> i think at the moment, it is agent talking to the generic driver
15:31:50 <zhipeng_> later on, we should design something like the neutron ml2 driver interface
15:32:29 <zhipeng_> that every driver, vendor or not, implements the interface which rpc calls will go through
15:32:34 <zhipeng_> in a rather standard way
15:33:29 <rushil> Ok. So, are we going to follow the neutron model vs the nova/cinder model?
15:33:47 <zhipeng_> i think more like the neutron moddel
15:33:55 <zhipeng_> for out-of-tree drivers
15:34:03 <rushil> But isn't that too complicated
15:34:09 <zhipeng_> cinder and nova are mostly in-tree maintained drivers
15:34:20 <zhipeng_> it won't be too complicated for us i think
15:34:40 <zhipeng_> neutron is complicated because they have to define the type drivers and mechanism drivers
15:34:42 <rushil> Well, cinder has out of tree drivers based on whether you have CI or not
15:35:04 <zhipeng_> I think in-tree drivers also requires the CI
15:35:14 <zhipeng_> otherwise the cinder team removes your driver
15:35:38 <rushil> No, they just make it unsupported i.e. move it out of tree
15:35:54 <zhipeng_> for us, as long as it is PCIe communicated devices, the driver interface won't be too complicated
15:36:12 <zhipeng_> but if we need to support extra protocols, that is where things will get wild
15:36:24 <zhipeng_> rushil ah okey
15:36:25 <rushil> Ok. I just want to make sure we don't make things more complicated than they should be
15:36:37 <zhipeng_> yes that is always our goal
15:36:38 <jkilpatr> I can agree on a standard rpc interface but that's less complicated than I think you are making it out to be.
15:36:49 <zhipeng_> we even wanted to skip the conductor :P
15:37:14 <jkilpatr> and I nearly got away with it too!
15:37:21 <zhipeng_> jkilpatr haha
15:37:41 <rushil> Lol
15:39:26 <zhipeng_> rushil the cyborg ml2 driver would be modeled from your generic driver implementation :P
15:40:37 <rushil> I wouldn't call it ml2 driver though
15:40:52 <zhipeng_> of course we will have another name for it
15:41:14 <zhipeng_> aluminum drivers :P
15:41:24 <zhipeng_> for cyborg robots
15:41:45 <rushil> Hehe
15:42:36 <jkilpatr> Anyways I'll try have a stub up this week (conductor) and then agent next week.
15:42:46 <jkilpatr> depends on how other tasks go for me.
15:43:44 <rushil> jkilpatr: Cool
15:43:50 <zhipeng_> sounds great, i got another colleague working on cyborg this week, so api code will be developed in parallel
15:44:09 <rushil> Awesome
15:44:10 <zhipeng_> hopefully when we settled the spec, the initial code will come out
15:44:21 <zhipeng_> and we could iterate over
15:44:48 <zhipeng_> #topic AoB
15:44:52 <zhipeng_> any other topics
15:44:59 <rushil> Btw our group at Lenovo sent out initial emails to vendors to get their drivers aligned with cyborg
15:45:20 <zhipeng_> wow
15:45:24 <zhipeng_> that is awesome
15:45:45 <rushil> I'll keep you guys posted on that
15:45:51 <zhipeng_> could you disclose the vendor names for now ?
15:45:57 <zhipeng_> or should we wait until later
15:46:14 <rushil> The usual suspects
15:46:35 <zhipeng_> e.g ?
15:46:47 <rushil> Nvidia, AMD
15:47:00 <rushil> And smaller ones like Micron
15:47:29 <zhipeng_> cool !
15:47:30 <rushil> I'll let y'all know when they are committed to contributing code
15:47:40 <zhipeng_> great :)
15:50:41 <zhipeng_> okey if there are no other topics, we go to the usual long slumber ~~
15:50:56 <zhipeng_> will try to remember to close the meeting an hour later
15:51:05 <crushil> Cool, thanks zhipeng_
17:00:56 <zhipeng_> #endmeeting