15:03:31 #startmeeting openstack-cyborg 15:03:31 Meeting started Wed May 17 15:03:31 2017 UTC and is due to finish in 60 minutes. The chair is zhipeng. Information about MeetBot at http://wiki.debian.org/MeetBot. 15:03:32 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 15:03:34 The meeting name has been set to 'openstack_cyborg' 15:03:38 ah speak the devil. 15:03:41 just got connected 15:03:43 :P 15:03:59 ok then I had a list of topics to go over 15:04:08 unless someone wants to bring other things up before I steamroll head? 15:04:10 * cdent is here to lurk 15:04:11 ahead* 15:04:29 #topic BP discussion 15:04:38 jkilpatrt go ahead 15:05:55 zhipeng, your outlined Cyborg api doesn't actually have an attach api call. What did you plan to do catch some property of a booting instance? 15:07:03 as far as I could tell I did include the attach api ? 15:07:04 nova *can* do pci hotplug, supposedly. which would make live attachment *possible* but I have no grasp on the gap between "it should work" and "it actually will" 15:07:21 * jkilpatr is checking the commit again please hold 15:07:57 * jkilpatr hold music ends 15:08:34 so get have get, post, put ,delete but they all seem to be for managing the database of accelerators, you can put to update a accelerator spec but what if I have an instance I need to attach it to 15:09:35 Okey for VM instance, I still think we would have to look for Nova for the actual attachment operation 15:09:59 actually Jay and I discussed about this today 15:10:29 unless we have identify a set of properties for the accelerator connection to the host 15:10:48 meaning that we have a os-brick like library 15:11:16 unless we have that, then we could just assume it is a regular PCI-e attachment 15:12:06 I think for most of the use cases we have at the moment, PCIe would be the most usual case and Nova currently support that 15:13:08 I think we need to come to a workflow conclusion sooner rather than later, how does the user ask for a vm with a specific accelerator, do they post to us and then we talk to nova, do they post to nova and we watch for some tag etc 15:13:34 should there be like a 'user workflow' spec? I guess it goes into the api 15:14:01 workflow is kinda an end-to-end thing 15:14:46 I personally would suggest for the moment, we take the Cinder approach, and expand later base upon this 15:15:02 that meaning live attachment? 15:15:13 because this might be the only way that we don't create a major impact on the existing implementations 15:15:42 not necessary live-attachment 15:15:51 but just attach.detach ops in general 15:16:00 for VM instances 15:16:15 since you will need the instance id and host id anyway 15:16:22 and nova got all these 15:16:26 ok then, we have a booted vm we need to do the attachment, we can setup nova to do the passthrough and then reboot the instance, not sure how nova feels about it but if it doesn't work I don't think they would be opposed to us making that work 15:16:37 but also nova does support live pci attachment, so we can/should just use that 15:17:07 yes if Nova indeed could support that 15:17:14 https://wiki.openstack.org/wiki/Nova/pci_hotplug 15:17:25 this is a stub right now, so some support 15:17:28 we just don't differentiate in Cyborg 15:18:14 the driver actually handles this of course, we just run attach on a live instance that was spawned with a flavor or some other indicator that says "cyborg: TeslaP100" 15:18:59 yes 15:19:21 because we have to make sure Nova gets it to the right spot firs, so we need to have the placement api fed live knowlege and then the instance needs to call the resource name we fed the placement api 15:19:40 maybe we can have a command in cyborg that will help you make flavors, but that's for later. 15:19:51 hotplug could be the trait when we use placement api 15:20:33 so when we schedule it, it know it needs a compute node with hotplug feature 15:20:46 wouldn't we want a bunch of traits 15:20:58 yep I guess so :) 15:21:03 like gpu, with cuda support, with hotplug support .... so on and so forth 15:21:10 we could start with really basic and simple ones 15:21:11 this is why we need a flavor creation wizard in cyborg but once again, later. 15:21:25 yes 15:21:34 ok anyone have comments on these ideas? things that they might want that this won't cover? 15:22:15 zhipeng, I'm going to put a comment on your api patch as a reminder to add cinder like attach/detach to the spec sound good? 15:23:00 which I thought is already done in the current patch ? 15:23:27 it would be great to see, at some point, a narration of the expected end to end flow from a user's standpoint, if it doesn't already exist. Including how various services will be touched. 15:24:17 cdent we got a flow chart in our BOS presentation, but rather rudimentary at the moment 15:24:20 cdent, we have a decent idea of how we want it to work but I expect some things will change as we get into the nitty gritty of placement problems 15:24:42 sure, change is the nature of this stuff :) 15:24:47 zhipeng, ok so if I want to attach an accelerator to an instance what do I do? Do I put to update an accelerator spec with a new instance ID to attach to? 15:24:48 :) 15:25:18 cdent, I think I'll put up a user workflow spec later today, just so that we keep track of all of this better. 15:25:31 \o/ 15:25:48 can you add me as a review on that when it is up, so I get some email to remind me to look? 15:26:23 just as we drew for our presentation, after the user using Cyborg service to complete the discovery phase and Cyborg finishing interaction with placement to advertise the accelerator inventory 15:26:24 will do, whats your email? 15:26:44 cdent@anticdent.org is me on gerrit 15:27:01 then user just request to create an instance on a compute node with the corresponding accelerator trait 15:27:32 if trait include hotplug, then maybe it will be a live attachment 15:27:45 I really don't think we're going to get away with one trait per accelerator, users will probably bundle them into flavors, but instead of being tied to a list of whitelisted pci devices these flavors can be much mroe general. 15:27:50 which means user could attach the accelerator after VM creation 15:28:35 I was told by jay today that trait are per resource provider 15:29:02 so it would mostly be one trait per compute node 15:29:08 I'll have to look at it in detail, I was watching the summit presentation again today. 15:29:17 or we got vGPUs or FPGA virtual functions 15:29:28 then it would be nested resource provider 15:29:38 and we could have trait on the virtual functions 15:29:53 but anyways it does not tie to a specific accelerator 15:30:10 just depending on you model your accelerators into resource providers 15:30:22 cdent plz correct me if i'm wrong :P 15:30:24 we're going to need to be careful with that. 15:31:03 #link https://pbs.twimg.com/media/DAAUxEWUAAAV6zA.jpg 15:31:16 zhipeng: that looks mostly correct, but I'm only partially paying attention :( 15:31:27 cdent no problemo :) 15:31:51 as long as I don't make any extremely wrong claims :) 15:32:07 so implementation. What can we start and when? 15:32:24 as soon as we freeze the specs 15:32:35 i suppose we should all go ahead start coding 15:32:48 Can we focus on closing out the specs out first though? 15:32:55 yes 15:32:57 ok then that's a plan. 15:33:15 okey, then for api spec 15:33:31 #link https://review.openstack.org/445814 15:33:35 any other questions ? 15:34:22 I just posted a comment there, otherwise I'm happy enough 15:34:31 okey 15:34:54 um should we pick a database tech? what's available already sql, mongo... reddis (not sure about that one) 15:35:08 MariaDB? 15:35:15 #action jkilpatr to post a reminder comment, the api spec patch is LGTM 15:35:39 i think we could just use mysql 15:35:54 MariaDB == mysql except when it doesnt 15:36:04 I think openstack ships with Maria right now 15:36:07 yes 15:36:09 yup 15:36:28 next up, agent spec 15:36:33 #link https://review.openstack.org/#/c/446091/ 15:37:12 looks like most people are happy with it. 15:37:14 #info Jay Pipes suggest agent could directly interact with placement api, instead of going through conductor 15:37:30 #link https://pbs.twimg.com/media/DAAUtyoUMAAXQXI.jpg 15:37:50 so from the summit preso all the computes already talk to the placement api themselves 15:37:51 but I guess we don't need to reflect that in the agent spec 15:37:54 so it's designed to scale well like that 15:38:30 I'd prefer to be explicit, I'll patch it into my spec today 15:38:48 We should reflect that in the agent spec 15:39:01 jkilpatr yes, and for implementation, Jay suggest we could directly just copy nova/scheduler/client/report.py 15:39:24 since it is basically rest calls between agent and placement api 15:39:32 that's the sort of laziness I can get behind. 15:39:42 XD 15:39:44 lol 15:40:29 #agreed jkilpatr do a quick update on agent spec to reflect jaypipes comment, then the agent spec patch LGTM 15:40:38 okey, next up, generic driver 15:40:58 #link https://review.openstack.org/#/c/447257/ 15:41:04 any more comments 15:41:08 looks fine to me 15:41:52 jkilpatr, Any other comments on your end? I have tried to address all of your and Roman's comments in the patch 15:41:56 what about detect accelerator? discovery has to be handled by someone, do we want drivers to have a discovery call? 15:42:10 I like the rest of the api list for it, good job 15:43:12 I can add that to the list. What would be the flow though for discovery? 15:43:37 i think discovery already part of the spec ? 15:43:44 see line 121 15:43:55 ah yup just not in the other list 15:44:04 It's not part of the API list 15:44:09 crushil, the flow (which I think you should add into your spec or maybe me in to the agent spec) 15:44:27 is agent on first startup says "hey I've never been started up before, lets call discover for all my drivers" 15:44:48 whatever returns true it lists and sends to the conductor to store in the db as possible accelerators 15:45:08 later on operators can call discover to do this again and add new accelerators. 15:45:42 as a note I think accelerators should get added in a "not ready" state with the operator having to tell cyborg to go install drivers otherwise we risk bad endings installing software on live clouds 15:45:59 more things to add to the agent spec 15:46:14 agree 15:46:17 +1 15:46:35 Makes sense, but should we add it to the driver spec or agent spec or both? 15:47:26 i think for both, because discovery is directly triggered by agent to run loops on drivers ,right ? 15:47:28 crushil, driver spec just needs "on discovery return if the accelerator exists or not" agent is the one that will call discovery then wait for the operator to call the api to move the accelrator into 'ready' before calling the install driver function. 15:48:08 yep 15:48:27 um speaking of message passing 15:48:34 most of this should be done over message passing 15:48:41 rabbitmq/oslo messaging fine? 15:48:53 Yup, that is the OS standard 15:48:59 yep 15:50:13 #agreed crushil to update the driver spec to include the discovery interface, and jkilpatr update the agent spec to reflect the related operations, otherwise it is LGTM 15:50:48 moving along, next up, interaction https://review.openstack.org/#/c/448228/ 15:50:52 #link https://review.openstack.org/#/c/448228/ 15:51:01 I think we still need more work in this 15:51:27 first of all thx to gryf to work this on his own time 15:51:53 I think this is where most of the workflow stuff is hidden right now. 15:52:04 Oh this is jkilpatr moved to my phone. 15:53:00 yes 15:53:32 we should continue to work on the spec, but I don't think it will block our implementation 15:54:32 any thoughts ? 15:56:18 and tttk2[m] I think you could just work with Roman on this patch to illustrate the workflow 15:56:25 and also have cdent for review 15:56:37 Ya, makes sense. But, we need to have a cutoff date to finish the spec 15:57:40 we slip the Apr 15th one rather quickly lol, but ya I agree we need another cutoff date 15:57:55 what is the m2 deadline for Pike ? 15:58:40 June 9 15:59:05 i think we could just use that for all the non-LGTM specs per today's meeting 15:59:23 Ok then. Can we comment on the specs with that deadline. 15:59:41 We should close out all the other specs sooner though 15:59:51 I feel like we should make a point of moving info out of meetings and into specs so we don't lose them in the back hole or IRC logs. 15:59:58 +1 16:00:42 +1 16:01:05 at least for all the LGTM specs I will merge those by the end of this week 16:01:47 #agreed set June 9th for a hard cut-off date for all the remaining spec, including cyborg-nova interaction 16:01:58 next up , conductor spec 16:02:14 #link https://review.openstack.org/#/c/463316/ 16:02:30 i think I will post some review, most on the wording 16:02:53 but this should be a simple one for us to freeze this week 16:03:41 Agreed. It's pretty much just glue code. 16:04:08 #agreed after some polishing, conductor spec LGTM this week 16:04:24 the last one in the queue, not a spec patch tho 16:04:34 #link https://review.openstack.org/#/c/461220/ 16:04:57 could folks just give a +1 so that I could merge it, it is mostly a house cleaning stuff 16:07:43 I have mixed feelings about that 16:08:05 * gryf just joined 16:08:22 gryf which topic ? 16:08:36 nacsa.tgz in a repo 16:08:44 it doesn't sound right 16:09:24 we just hosted in the sandbox 16:09:37 we could even move them out to an individual repo later on 16:09:46 well, yeah 16:09:50 but we did have extensive discussion on that matter 16:09:55 with moshe and his team 16:09:59 but it will affect size of the repositiory 16:10:49 then I think maybe we could move the sandbox out to an individual repo, such as cyborg-sandbox 16:11:01 so that it won't affect the cyborg project repo itself 16:11:19 yes, I think that the better solution 16:11:23 also 16:11:53 I'd like to avoid keeping binary blobs in repository 16:12:03 Agreed. 16:12:05 that's fine for me :) 16:12:20 but we do need to merge it first, and then move it out 16:12:27 due process 16:12:35 so the perfect solution would be to unpack it, and make the commmit which move entire work into its own directory. what do you think? 16:12:51 nuh that won't be necessary 16:13:15 i think just move to another repo just for records 16:13:30 we won't do any release, for example , for the cyborg-sandbox 16:13:37 Um if we merge it it's in the repo history forever. 16:13:40 it just sits there 16:13:48 no we could move it our 16:13:59 and we need to move out the spec later as well 16:14:14 cyborg-spec will be the standalone repo to store all the specs 16:14:19 I don't have super strong feelings. But Id like to keep binaries out of the repo 16:14:28 ttk2[m], +1 16:14:39 I have no problem either 16:14:58 but let's just follow a procedure and get it done 16:16:44 sounds reasonable for everyone ? 16:18:23 zhipeng, what exactly do you mean by following procedure? 16:19:02 have it first in the current cyborg repo, and then move it out to a seperate one 16:19:31 I'm against it. as ttk2[m] said - if we merge it, it stays forever. 16:19:43 why ?? 16:19:51 it's a git :> 16:19:54 Because history 16:20:02 say we couldn;t even move the specs out ? 16:20:12 merging it will permanently increase the repo size because the artifact will remain in the history forever 16:21:00 okey understood 16:21:06 zhipeng, we can, but they will be available, if someone would like to go back in time (in history) and nothing prevent him to do so :D 16:21:37 then I will abandon the patch and directly submit it to the seperate repo instead 16:21:42 this sounds reasonable ? 16:21:42 unless, we do some rebase stuff on the repo itself, but I'm not aware if this is a good practice 16:21:55 yup 16:22:31 #agreed abandon the nacsa sandbox patch and directly submit it to a seperate repo 16:22:53 gryf: yeah, you basically have to use a rebase or git filter-branch to remove it, but that'll break everyone's checked out repos since you're rewriting history... so not typically good practice 16:23:03 okey, we got many things settled :) 16:23:11 #topic CI discussion 16:23:33 as I understand ttk2[m] and gryf has some discussion on the CI settings 16:23:40 do we have any perference now ? 16:23:40 adreznec, yeah. 16:24:25 zhipeng, it was mostly very high level discussion 16:25:05 we have to have some concrete implementation first 16:25:07 will then on high level, any directions that we want to follow upon :) 16:25:14 okey 16:25:32 but have vendors to provide third party CI env would always be a good idea 16:25:49 baremetal or vm, is it correct ? 16:25:53 we can figure that out later 16:26:00 sure 16:26:09 #topic AoB 16:26:21 any other topics ? 16:26:36 Keep up the good work guys. 16:27:00 that would be a good note our meeting ends on :) 16:28:20 ok thx guys, let's end the meeting for today 16:28:25 #endmeeting