15:03:31 <zhipeng> #startmeeting openstack-cyborg
15:03:31 <openstack> Meeting started Wed May 17 15:03:31 2017 UTC and is due to finish in 60 minutes.  The chair is zhipeng. Information about MeetBot at http://wiki.debian.org/MeetBot.
15:03:32 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
15:03:34 <openstack> The meeting name has been set to 'openstack_cyborg'
15:03:38 <jkilpatr> ah speak the devil.
15:03:41 <zhipeng> just got connected
15:03:43 <zhipeng> :P
15:03:59 <jkilpatr> ok then I had a list of topics to go over
15:04:08 <jkilpatr> unless someone wants to bring other things up before I steamroll head?
15:04:10 * cdent is here to lurk
15:04:11 <jkilpatr> ahead*
15:04:29 <zhipeng> #topic BP discussion
15:04:38 <zhipeng> jkilpatrt go ahead
15:05:55 <jkilpatr> zhipeng, your outlined Cyborg api doesn't actually have an attach api call. What did you plan to do catch some property of a booting instance?
15:07:03 <zhipeng> as far as I could tell I did include the attach api ?
15:07:04 <jkilpatr> nova *can* do pci hotplug, supposedly. which would make live attachment *possible* but I have no grasp on the gap between "it should work" and "it actually will"
15:07:21 * jkilpatr is checking the commit again please hold
15:07:57 * jkilpatr hold music ends
15:08:34 <jkilpatr> so get have get, post, put ,delete but they all seem to be for managing the database of accelerators, you can put to update a accelerator spec but what if I have an instance I need to attach it to
15:09:35 <zhipeng> Okey for VM instance, I still think we would have to look for Nova for the actual attachment operation
15:09:59 <zhipeng> actually Jay and I discussed about this today
15:10:29 <zhipeng> unless we have identify a set of properties for the accelerator connection to the host
15:10:48 <zhipeng> meaning that we have a os-brick like library
15:11:16 <zhipeng> unless we have that, then we could just assume it is a regular PCI-e attachment
15:12:06 <zhipeng> I think for most of the use cases we have at the moment, PCIe would be the most usual case and Nova currently support that
15:13:08 <jkilpatr> I think we need to come to a workflow conclusion sooner rather than later, how does the user ask for a vm with a specific accelerator, do they post to us and then we talk to nova, do they post to nova and we watch for some tag etc
15:13:34 <jkilpatr> should there be like a 'user workflow' spec? I guess it goes into the api
15:14:01 <zhipeng> workflow is kinda an end-to-end thing
15:14:46 <zhipeng> I personally would suggest for the moment, we take the Cinder approach, and expand later base upon this
15:15:02 <jkilpatr> that meaning live attachment?
15:15:13 <zhipeng> because this might be the only way that we don't create a major impact on the existing implementations
15:15:42 <zhipeng> not necessary live-attachment
15:15:51 <zhipeng> but just attach.detach ops in general
15:16:00 <zhipeng> for VM instances
15:16:15 <zhipeng> since you will need the instance id and host id anyway
15:16:22 <zhipeng> and nova got all these
15:16:26 <jkilpatr> ok then, we have a booted vm we need to do the attachment, we can setup nova to do the passthrough and then reboot the instance, not sure how nova feels about it but if it doesn't work I don't think they would be opposed to us making that work
15:16:37 <jkilpatr> but also nova does support live pci attachment, so we can/should just use that
15:17:07 <zhipeng> yes if Nova indeed could support that
15:17:14 <jkilpatr> https://wiki.openstack.org/wiki/Nova/pci_hotplug
15:17:25 <jkilpatr> this is a stub right now, so some support
15:17:28 <zhipeng> we just don't differentiate in Cyborg
15:18:14 <jkilpatr> the driver actually handles this of course, we just run attach on a live instance that was spawned with a flavor or some other indicator that says "cyborg: TeslaP100"
15:18:59 <zhipeng> yes
15:19:21 <jkilpatr> because we have to make sure Nova gets it to the right spot firs, so we need to have the placement api fed live knowlege and then the instance needs to call the resource name we fed the placement api
15:19:40 <jkilpatr> maybe we can have a command in cyborg that will help you make flavors, but that's for later.
15:19:51 <zhipeng> hotplug could be the trait when we use placement api
15:20:33 <zhipeng> so when we schedule it, it know it needs a compute node with hotplug feature
15:20:46 <jkilpatr> wouldn't we want a bunch of traits
15:20:58 <zhipeng> yep I guess so :)
15:21:03 <jkilpatr> like gpu, with cuda support, with hotplug support .... so on and so forth
15:21:10 <zhipeng> we could start with really basic and simple ones
15:21:11 <jkilpatr> this is why we need a flavor creation wizard in cyborg but once again, later.
15:21:25 <zhipeng> yes
15:21:34 <jkilpatr> ok anyone have comments on these ideas? things that they might want that this won't cover?
15:22:15 <jkilpatr> zhipeng, I'm going to put a comment on your api patch as a reminder to add cinder like attach/detach to the spec sound good?
15:23:00 <zhipeng> which I thought is already done in the current patch ?
15:23:27 <cdent> it would be great to see, at some point, a narration of the expected end to end flow from a user's standpoint, if it doesn't already exist. Including how various services will be touched.
15:24:17 <zhipeng> cdent we got a flow chart in our BOS presentation, but rather rudimentary at the moment
15:24:20 <jkilpatr> cdent, we have a decent idea of how we want it to work but I expect some things will change as we get into the nitty gritty of placement problems
15:24:42 <cdent> sure, change is the nature of this stuff :)
15:24:47 <jkilpatr> zhipeng, ok so if I want to attach an accelerator to an instance what do I do? Do I put to update an accelerator spec with a new instance ID to attach to?
15:24:48 <zhipeng> :)
15:25:18 <jkilpatr> cdent, I think I'll put up a user workflow spec later today, just so that we keep track of all of this better.
15:25:31 <cdent> \o/
15:25:48 <cdent> can you add me as a review on that when it is up, so I get some email to remind me to look?
15:26:23 <zhipeng> just as we drew for our presentation, after the user using Cyborg service to complete the discovery phase and Cyborg finishing interaction with placement to advertise the accelerator inventory
15:26:24 <jkilpatr> will do, whats your email?
15:26:44 <cdent> cdent@anticdent.org is me on gerrit
15:27:01 <zhipeng> then user just request to create an instance on a compute node with the corresponding accelerator trait
15:27:32 <zhipeng> if trait include hotplug, then maybe it will be a live attachment
15:27:45 <jkilpatr> I really don't think we're going to get away with one trait per accelerator, users will probably bundle them into flavors, but instead of being tied to a list of whitelisted pci devices these flavors can be much mroe general.
15:27:50 <zhipeng> which means user could attach the accelerator after VM creation
15:28:35 <zhipeng> I was told by jay today that trait are per resource provider
15:29:02 <zhipeng> so it would mostly be one trait per compute node
15:29:08 <jkilpatr> I'll have to look at it in detail, I was watching the summit presentation again today.
15:29:17 <zhipeng> or we got vGPUs or FPGA virtual functions
15:29:28 <zhipeng> then it would be nested resource provider
15:29:38 <zhipeng> and we could have trait on the virtual functions
15:29:53 <zhipeng> but anyways it does not tie to a specific accelerator
15:30:10 <zhipeng> just depending on you model your accelerators into resource providers
15:30:22 <zhipeng> cdent plz correct me if i'm wrong :P
15:30:24 <jkilpatr> we're going to need to be careful with that.
15:31:03 <zhipeng> #link https://pbs.twimg.com/media/DAAUxEWUAAAV6zA.jpg
15:31:16 <cdent> zhipeng: that looks mostly correct, but I'm only partially paying attention :(
15:31:27 <zhipeng> cdent no problemo :)
15:31:51 <zhipeng> as long as I don't make any extremely wrong claims :)
15:32:07 <jkilpatr> so implementation. What can we start and when?
15:32:24 <zhipeng> as soon as we freeze the specs
15:32:35 <zhipeng> i suppose we should all go ahead start coding
15:32:48 <crushil> Can we focus on closing out the specs out first though?
15:32:55 <zhipeng> yes
15:32:57 <jkilpatr> ok then that's a plan.
15:33:15 <zhipeng> okey, then for api spec
15:33:31 <zhipeng> #link https://review.openstack.org/445814
15:33:35 <zhipeng> any other questions ?
15:34:22 <jkilpatr> I just posted a comment there, otherwise I'm happy enough
15:34:31 <zhipeng> okey
15:34:54 <jkilpatr> um should we pick a database tech? what's available already sql, mongo... reddis (not sure about that one)
15:35:08 <crushil> MariaDB?
15:35:15 <zhipeng> #action jkilpatr to post a reminder comment, the api spec patch is LGTM
15:35:39 <zhipeng> i think we could just use mysql
15:35:54 <jkilpatr> MariaDB == mysql except when it doesnt
15:36:04 <jkilpatr> I think openstack ships with Maria right now
15:36:07 <zhipeng> yes
15:36:09 <crushil> yup
15:36:28 <zhipeng> next up, agent spec
15:36:33 <zhipeng> #link https://review.openstack.org/#/c/446091/
15:37:12 <jkilpatr> looks like most people are happy with it.
15:37:14 <zhipeng> #info Jay Pipes suggest agent could directly interact with placement api, instead of going through conductor
15:37:30 <zhipeng> #link https://pbs.twimg.com/media/DAAUtyoUMAAXQXI.jpg
15:37:50 <jkilpatr> so from the summit preso all the computes already talk to the placement api themselves
15:37:51 <zhipeng> but I guess we don't need to reflect that in the agent spec
15:37:54 <jkilpatr> so it's designed to scale well like that
15:38:30 <jkilpatr> I'd prefer to be explicit, I'll patch it into my spec today
15:38:48 <crushil> We should reflect that in the agent spec
15:39:01 <zhipeng> jkilpatr yes, and for implementation, Jay suggest we could directly just copy nova/scheduler/client/report.py
15:39:24 <zhipeng> since it is basically rest calls between agent and placement api
15:39:32 <jkilpatr> that's the sort of laziness I can get behind.
15:39:42 <zhipeng> XD
15:39:44 <crushil> lol
15:40:29 <zhipeng> #agreed jkilpatr do a quick update on agent spec to reflect jaypipes comment, then the agent spec patch LGTM
15:40:38 <zhipeng> okey, next up, generic driver
15:40:58 <zhipeng> #link https://review.openstack.org/#/c/447257/
15:41:04 <zhipeng> any more comments
15:41:08 <zhipeng> looks fine to me
15:41:52 <crushil> jkilpatr, Any other comments on your end? I have tried to address all of your and Roman's comments in the patch
15:41:56 <jkilpatr> what about detect accelerator? discovery has to be handled by someone, do we want drivers to have a discovery call?
15:42:10 <jkilpatr> I like the rest of the api list for it, good job
15:43:12 <crushil> I can add that to the list. What would be the flow though for discovery?
15:43:37 <zhipeng> i think discovery already part of the spec ?
15:43:44 <zhipeng> see line 121
15:43:55 <jkilpatr> ah yup just not in the other list
15:44:04 <crushil> It's not part of the API list
15:44:09 <jkilpatr> crushil, the flow (which I think you should add into your spec or maybe me in to the agent spec)
15:44:27 <jkilpatr> is agent on first startup says "hey I've never been started up before, lets call discover for all my drivers"
15:44:48 <jkilpatr> whatever returns true it lists and sends to the conductor to store in the db as possible accelerators
15:45:08 <jkilpatr> later on operators can call discover to do this again and add new accelerators.
15:45:42 <jkilpatr> as a note I think accelerators should get added in a "not ready" state with the operator having to tell cyborg to go install drivers otherwise we risk bad endings installing software on live clouds
15:45:59 <jkilpatr> more things to add to the agent spec
15:46:14 <zhipeng> agree
15:46:17 <crushil> +1
15:46:35 <crushil> Makes sense, but should we add it to the driver spec or agent spec or both?
15:47:26 <zhipeng> i think for both, because discovery is directly triggered by agent to run loops on drivers ,right ?
15:47:28 <jkilpatr> crushil, driver spec just needs "on discovery return if the accelerator exists or not" agent is the one that will call discovery then wait for the operator to call the api to move the accelrator into 'ready' before calling the install driver function.
15:48:08 <zhipeng> yep
15:48:27 <jkilpatr> um speaking of message passing
15:48:34 <jkilpatr> most of this should be done over message passing
15:48:41 <jkilpatr> rabbitmq/oslo messaging fine?
15:48:53 <crushil> Yup, that is the OS standard
15:48:59 <zhipeng> yep
15:50:13 <zhipeng> #agreed crushil to update the driver spec to include the discovery interface, and jkilpatr update the agent spec to reflect the related operations, otherwise it is LGTM
15:50:48 <zhipeng> moving along, next up, interaction https://review.openstack.org/#/c/448228/
15:50:52 <zhipeng> #link https://review.openstack.org/#/c/448228/
15:51:01 <zhipeng> I think we still need more work in this
15:51:27 <zhipeng> first of all thx to gryf to work this on his own time
15:51:53 <ttk2[m]> I think this is where most of the workflow stuff is hidden right now.
15:52:04 <ttk2[m]> Oh this is jkilpatr moved to my phone.
15:53:00 <zhipeng> yes
15:53:32 <zhipeng> we should continue to work on the spec, but I don't think it will block our implementation
15:54:32 <zhipeng> any thoughts ?
15:56:18 <zhipeng> and tttk2[m] I think you could just work with Roman on this patch to illustrate the workflow
15:56:25 <zhipeng> and also have cdent for review
15:56:37 <crushil> Ya, makes sense. But, we need to have a cutoff date to finish the spec
15:57:40 <zhipeng> we slip the Apr 15th one rather quickly lol, but ya I agree we need another cutoff date
15:57:55 <zhipeng> what is the m2 deadline for Pike ?
15:58:40 <crushil> June 9
15:59:05 <zhipeng> i think we could just use that for all the non-LGTM specs per today's meeting
15:59:23 <ttk2[m]> Ok then. Can we comment on the specs with that deadline.
15:59:41 <crushil> We should close out all the other specs sooner though
15:59:51 <ttk2[m]> I feel like we should make a point of moving info out of meetings and into specs so we don't lose them in the back hole or IRC logs.
15:59:58 <crushil> +1
16:00:42 <zhipeng> +1
16:01:05 <zhipeng> at least for all the LGTM specs I will merge those by the end of this week
16:01:47 <zhipeng> #agreed set June 9th for a hard cut-off date for all the remaining spec, including cyborg-nova interaction
16:01:58 <zhipeng> next up , conductor spec
16:02:14 <zhipeng> #link https://review.openstack.org/#/c/463316/
16:02:30 <zhipeng> i think I will post some review, most on the wording
16:02:53 <zhipeng> but this should be a simple one for us to freeze this week
16:03:41 <ttk2[m]> Agreed. It's pretty much just glue code.
16:04:08 <zhipeng> #agreed after some polishing, conductor spec LGTM this week
16:04:24 <zhipeng> the last one in the queue, not a spec patch tho
16:04:34 <zhipeng> #link https://review.openstack.org/#/c/461220/
16:04:57 <zhipeng> could folks just give a +1 so that I could merge it, it is mostly a house cleaning stuff
16:07:43 <gryf> I have mixed feelings about that
16:08:05 * gryf just joined
16:08:22 <zhipeng> gryf which topic ?
16:08:36 <gryf> nacsa.tgz in a repo
16:08:44 <gryf> it doesn't sound right
16:09:24 <zhipeng> we just hosted in the sandbox
16:09:37 <zhipeng> we could even move them out to an individual repo later on
16:09:46 <gryf> well, yeah
16:09:50 <zhipeng> but we did have extensive discussion on that matter
16:09:55 <zhipeng> with moshe and his team
16:09:59 <gryf> but it will affect size of the repositiory
16:10:49 <zhipeng> then I think maybe we could move the sandbox out to an individual repo, such as cyborg-sandbox
16:11:01 <zhipeng> so that it won't affect the cyborg project repo itself
16:11:19 <gryf> yes, I think that the better solution
16:11:23 <gryf> also
16:11:53 <gryf> I'd like to avoid keeping binary blobs in repository
16:12:03 <ttk2[m]> Agreed.
16:12:05 <zhipeng> that's fine for me :)
16:12:20 <zhipeng> but we do need to merge it first, and then move it out
16:12:27 <zhipeng> due process
16:12:35 <gryf> so the perfect solution would be to unpack it, and make the commmit which move entire work into its own directory. what do you think?
16:12:51 <zhipeng> nuh that won't be necessary
16:13:15 <zhipeng> i think just move to another repo just for records
16:13:30 <zhipeng> we won't do any release, for example , for the cyborg-sandbox
16:13:37 <ttk2[m]> Um if we merge it it's in the repo history forever.
16:13:40 <zhipeng> it just sits there
16:13:48 <zhipeng> no we could move it our
16:13:59 <zhipeng> and we need to move out the spec later as well
16:14:14 <zhipeng> cyborg-spec will be the standalone repo to store all the specs
16:14:19 <ttk2[m]> I don't have super strong feelings. But Id like to keep binaries out of the repo
16:14:28 <gryf> ttk2[m], +1
16:14:39 <zhipeng> I have no problem either
16:14:58 <zhipeng> but let's just follow a procedure and get it done
16:16:44 <zhipeng> sounds reasonable for everyone ?
16:18:23 <gryf> zhipeng, what exactly do you mean by following procedure?
16:19:02 <zhipeng> have it first in the current cyborg repo, and then move it out to a seperate one
16:19:31 <gryf> I'm against it. as ttk2[m] said - if we merge it, it stays forever.
16:19:43 <zhipeng> why ??
16:19:51 <gryf> it's a git :>
16:19:54 <ttk2[m]> Because history
16:20:02 <zhipeng> say we couldn;t even move the specs out ?
16:20:12 <adreznec> merging it will permanently increase the repo size because the artifact will remain in the history forever
16:21:00 <zhipeng> okey understood
16:21:06 <gryf> zhipeng, we can, but they will be available, if someone would like to go back in time (in history) and nothing prevent him to do so :D
16:21:37 <zhipeng> then I will abandon the patch and directly submit it to the seperate repo instead
16:21:42 <zhipeng> this sounds reasonable ?
16:21:42 <gryf> unless, we do some rebase stuff on the repo itself, but I'm not aware if this is a good practice
16:21:55 <gryf> yup
16:22:31 <zhipeng> #agreed abandon the nacsa sandbox patch and directly submit it to a seperate repo
16:22:53 <adreznec> gryf: yeah, you basically have to use a rebase or git filter-branch to remove it, but that'll break everyone's checked out repos since you're rewriting history... so not typically good practice
16:23:03 <zhipeng> okey, we got many things settled :)
16:23:11 <zhipeng> #topic CI discussion
16:23:33 <zhipeng> as I understand ttk2[m] and gryf has some discussion on the CI settings
16:23:40 <zhipeng> do we have any perference now ?
16:23:40 <gryf> adreznec, yeah.
16:24:25 <gryf> zhipeng, it was mostly very high level discussion
16:25:05 <gryf> we have to have some concrete implementation first
16:25:07 <zhipeng> will then on high level, any directions that we want to follow upon :)
16:25:14 <zhipeng> okey
16:25:32 <zhipeng> but have vendors to provide third party CI env would always be a good idea
16:25:49 <zhipeng> baremetal or vm, is it correct ?
16:25:53 <gryf> we can figure that out later
16:26:00 <zhipeng> sure
16:26:09 <zhipeng> #topic AoB
16:26:21 <zhipeng> any other topics ?
16:26:36 <ttk2[m]> Keep up the good work guys.
16:27:00 <zhipeng> that would be a good note our meeting ends on :)
16:28:20 <zhipeng> ok thx guys, let's end the meeting for today
16:28:25 <zhipeng> #endmeeting