#openstack-meeting log

15:00:44 <n0ano> #startmeeting gantt
15:00:45 <openstack> Meeting started Tue Jun 24 15:00:44 2014 UTC and is due to finish in 60 minutes.  The chair is n0ano. Information about MeetBot at http://wiki.debian.org/MeetBot.
15:00:46 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
15:00:48 <openstack> The meeting name has been set to 'gantt'
15:00:54 <n0ano> anyone here to talk about the scheduler?
15:00:55 <bauzas> \o
15:01:54 <bauzas> sounds like the crowds are silent :)
15:02:08 <n0ano> listen to the crickets :-)
15:02:34 <bauzas> well, it's the Summer, so that's possible :)
15:02:48 <n0ano> well, we can make it quick
15:02:53 <n0ano> #topic code forklift
15:03:06 <bauzas> soooooo
15:03:22 <bauzas> long debates this week with johnthetubaguy :)
15:03:24 <n0ano> I see that john garbutt doesn't like one of your patches, is that one crucial
15:03:43 <bauzas> well, which one do you refer ?
15:03:53 <n0ano> https://review.openstack.org/#/c/97232/
15:04:00 <bauzas> ah, this one
15:04:12 <bauzas> (I know now all of them by the numbers ;) )
15:04:24 <n0ano> it's shorter :-)
15:04:38 <bauzas> so, indeed, the problem is about how we care until Gantt is out
15:04:54 <bauzas> I finally agreed with John
15:05:21 <n0ano> if we can do the split and then do 97232 then I don't have a problem waiting for it also, my prime goal is just doing the split
15:05:28 <bauzas> indeed
15:05:39 <bauzas> we will still have many things to do
15:06:08 <n0ano> I looked at 82778 and didn't see anything wrong (probably a weakness in my pyhthon foo)
15:06:12 <bauzas> so, let's unscope 97232 from the pre-split tasks
15:06:21 <n0ano> +1
15:06:27 <bauzas> yey, the real debate is about 82778
15:06:39 <bauzas> everything was fine until last week...
15:07:06 <bauzas> the idea is to update the stats without returning nothing
15:07:12 <bauzas> that's cool
15:07:14 <bauzas> but...
15:07:17 * johnthetubaguy waves
15:07:19 <bauzas> (teaser)
15:07:45 <bauzas> so, everything is setup in resourcetracker and updated to the scheduler
15:07:58 <bauzas> nothing is needed to be returned to RT
15:08:01 <bauzas> nothing but...
15:08:19 <bauzas> I discovered a section of PCI code asking explicitely for the compute node id
15:08:36 <bauzas> which is not returned back when creating
15:08:46 <bauzas> so that's blocking the creation of computenode in TR
15:08:47 <bauzas> RT
15:09:16 <n0ano> yeah, things can get very intertwined
15:09:16 <bauzas> I did the PCI code review and there is no clear benefit of keeping the cn id
15:09:38 <bauzas> that's basically for raising an exception and saying which CN is failing
15:09:52 <bauzas> so, yjiang5 agreed on removing that cn id from PCI
15:10:07 <bauzas> but until that, compute node creations has to be done on the RT side
15:10:19 <n0ano> so now you're dependent upon the change to the PCI code, right?
15:10:32 <bauzas> nope
15:10:46 <bauzas> because that's life, and we have to figure out another way
15:10:51 <johnthetubaguy> n0ano: I think we can leave compute node in nova basically
15:11:04 <johnthetubaguy> n0ano: and gantt just has its own copy
15:11:22 <bauzas> that was most of the previous talk I had with johnthetubaguy, about keeping compute node in Nova :)
15:11:30 <johnthetubaguy> n0ano: so PCI stuff still goes in there tables (till they fix that)
15:11:40 <bauzas> so n0ano your thoughts are welcome here :)
15:11:52 <johnthetubaguy> yeah, would be good to agree a direction here
15:12:09 <johnthetubaguy> I am pro leaving compute node in nova, and just not populating the stats there when we switch to gantt
15:12:30 <n0ano> a little concerned about 2 copies of compute node (one in nova, one in gantt) but, if we're careful, it's probably OK
15:12:35 <johnthetubaguy> that way we don't have to wait for PCI, they can do that at there own speed
15:13:02 <n0ano> +1 (not being depenent upon PCI changes is good)
15:13:08 <johnthetubaguy> n0ano: yeah, it just means we wrip it out of Nova later, and gantt needs a separate database and schema anyways
15:13:10 <bauzas> +1 too
15:13:25 <n0ano> I'm hearing violent agreement
15:13:36 <bauzas> yey, but the debate is not there :)
15:14:04 <bauzas> now we agree to workaround the current situation (RT has to know the CN ID)
15:14:08 <PaulMurray> bauzas, what does it take for PCI stuff to be done first
15:14:12 <bauzas> what's the best option ?
15:14:32 <bauzas> PaulMurray: good question, a quick code review showed me little effort here
15:14:43 <PaulMurray> bauzas, so why not do it?
15:14:45 <bauzas> PaulMurray: but I'm possibly missing a crucial thing
15:15:03 <PaulMurray> bauzas, I think you said it was for information in the exceptions?
15:15:11 <PaulMurray> bauzas, or at least logging
15:15:11 <n0ano> PaulMurray, I've been waiting 2 weeks for a simple change, things can take a long time to get done
15:15:19 <bauzas> PaulMurray: because PCI code is very sensitive, I have no way to correctly test it :)
15:15:39 <johnthetubaguy> yeah, its not worth doing PCI code
15:15:43 <PaulMurray> n0ano, no s***
15:16:00 <johnthetubaguy> bauzas: I got confused where we are debating to be honest
15:16:08 <bauzas> PaulMurray: and I unfortunately discovered that PCI testing is a bit unsufficient because Jenkins was still happy with my change even if I was not giving back the id.... :)
15:16:17 <n0ano> the more stuff we are responsible for, the better
15:16:35 <johnthetubaguy> bauzas: yes, no PCI passthrough testing in the gate right now
15:17:10 <johnthetubaguy> n0ano: turns out we get PCI stats for scheduling OK, its just some internal account stuff that not scheduling related that happens to use ComputeNode that would stay in nova
15:17:12 <bauzas> ok, everyone happy with leaving compute nodes in RT ?
15:17:52 <PaulMurray> bauzas, no, but understand it may be better than waiting on pci
15:18:15 <johnthetubaguy> PaulMurray: what your worry with compute node in RT
15:18:19 <n0ano> johnthetubaguy, as I said, a little icky (2 copies) but probably OK, something to fix as soon as possible once gantt is split out
15:18:21 <bauzas> PaulMurray: the alternative was to give back the id when creating the node in Scheduler client code
15:18:47 <johnthetubaguy> n0ano: its just a slightly empty nova structure, the gnatt one will have to be different in shape regardless
15:19:00 <PaulMurray> bauzas, johnthetubaguy its only about making things cleaner for me
15:19:02 <johnthetubaguy> n0ano: there should be no data in two places, except the name
15:19:09 <bauzas> I made various code proposals about the split in different patchsets
15:19:21 <johnthetubaguy> PaulMurray: right, I see it as delete the ComputeNode independently of the scheduler split
15:19:24 <bauzas> people can compare and appreciate which one is better
15:20:11 <bauzas> I'm still having a -1 from Jenkins on the last patchset, but probably a short miss
15:20:26 <bauzas> or some tests to fix
15:20:28 <PaulMurray> bauzas, is the latest the one to review?
15:20:37 <bauzas> one sec, checking
15:20:45 <PaulMurray> bauzas, you made 3 revisions since I started to look
15:21:08 <johnthetubaguy> so, for the record, I am kinda pushing for this sort of approach: https://review.openstack.org/#/c/101858/
15:21:24 <n0ano> bauzas, you do have a -1, I missed that (I `hate` color encodings)
15:21:37 <bauzas> ok, lemme give you details
15:21:42 <johnthetubaguy> basically, just move self.conductor_api.compute_node_update into the scheduler client
15:22:07 <bauzas> patchset #26 is everything done in Scheduler, compute nodes owned by sched client
15:22:08 <johnthetubaguy> so we drop a gantt client that would take the same stats and store them somewhere else, if you don't want nova-scheduler
15:22:27 <bauzas> nah, nevermind
15:22:29 <bauzas> so
15:23:07 <johnthetubaguy> its a tricky line to draw, thats agreed here
15:23:16 <bauzas> so, patchset #26 is everything done in sched client
15:23:32 <n0ano> johnthetubaguy, +2 (plus the devil is in the details)
15:23:34 <bauzas> patchset #27 is john's proposal about keeping creation in RT
15:23:50 <bauzas> patchset #30 is a CRUD interface
15:24:02 <bauzas> (everything done in sched client)
15:24:21 <johnthetubaguy> yup, https://review.openstack.org/#/c/101858 (which because #27) is just updating the stats in the client, and leaving the compute node in the RT
15:24:31 <bauzas> patchset #31 is john's idea, but a little rewritten
15:24:57 <PaulMurray> bauzas, I'm wondering if this should be marked work in progress
15:25:14 <bauzas> PaulMurray: I can, at least until Jenkins is happy
15:25:29 <bauzas> PaulMurray: but situation is changing everyday because of the confusion
15:25:39 <bauzas> yesterday, it was OK for reviewing
15:25:43 <PaulMurray> bauzas, its the latter I was thinking about
15:25:52 <bauzas> ok, putting WIP now
15:26:17 <PaulMurray> bauzas, although I know people don't review WIP#
15:26:32 <johnthetubaguy> bauzas: my patch had some fixed up unit tests you might want to borrow
15:26:33 <PaulMurray> bauzas, but I will :)
15:26:40 <bauzas> well, I think we need to agree on the approach
15:27:02 <johnthetubaguy> bauzas: what have we not agreed on at this point?
15:27:05 <bauzas> everyone's happy with keeping cn creation in RT and cn update in scheduler client ?
15:27:23 <n0ano> bauzas, +1 (that's my understanding)
15:27:27 <bauzas> johnthetubaguy: I was feeling a little confusion over here
15:27:33 <bauzas> PaulMurray ?
15:27:46 <PaulMurray> bauzas, not sure
15:28:12 <PaulMurray> bauzas, I think I am probably not deep enough into the problems here
15:28:14 <bauzas> PaulMurray: could you then review all the patchsets I mentioned and leave a comment directly in Gerrit ?
15:28:28 <johnthetubaguy> so just be mess things up a bit, I don't think we keep ComputeNode in RT, I think we remove that separately to this client work
15:28:40 <PaulMurray> bauzas, yes, I was half way through looking but got confused over what I should be looking at
15:28:56 <bauzas> this blueprint is awfully time-consuming, and I just want to make sure we all agree on what needs to be done
15:29:30 <bauzas> the last 2 weeks were about doing back and forthes on that patch, just want to make sure everybody will like the direction :)
15:30:09 <PaulMurray> bauzas, by "create compute node" do yo mean the conductor call to create it in db or do you mean the compute_node data structure?
15:30:49 <bauzas> RT will still issue the call to conductor create, and the DB model will stay in Nova :)
15:31:10 <PaulMurray> bauzas, I would prefer to see the db create and update in client library, but as I said
15:31:25 <bauzas> but in the proposal, the sched client is doing the update call
15:31:26 <PaulMurray> bauzas, I not sure I yunderstand the difficulties
15:31:28 <bauzas> to the conductor
15:31:49 <bauzas> well, there is no difficulty in having sched client owning the creation
15:32:00 <bauzas> you can look at patchset #26
15:32:05 <bauzas> that was the case
15:32:21 <bauzas> we just need to return the id when the creation is done
15:32:31 <johnthetubaguy> PaulMurray: let me find you a quick link, its this...
15:32:59 <johnthetubaguy> PaulMurray: https://github.com/openstack/nova/blob/master/nova/db/sqlalchemy/models.py#L1378
15:33:17 <PaulMurray> n0ano, as a process point, is it ok dwelling on this in this meeting?
15:33:17 <johnthetubaguy> PCI device stats are stored with a foreignkey into compute node table
15:33:37 <n0ano> PaulMurray, sure, this is the only agenda item for today
15:33:45 <johnthetubaguy> its not scheduler info, its resource tracker only state, so needs to stay in Nova
15:33:45 <bauzas> PaulMurray: https://review.openstack.org/#/c/82778/26/nova/scheduler/client.py here is the proposal having sched client owning conductor calls
15:34:23 <johnthetubaguy> PaulMurray: if the scheduler client owns the compute node, it ends up passing back the id, so that the compute node id is known to the PCI tracer
15:35:20 <johnthetubaguy> PaulMurray: so when you replace the nova scheduler client, the gantt client would have to still access the nova db, which is not allowed, so we need to first change the PCI stats, then we also end up pass back in id that should be interal to gantt, and its all very confusing
15:35:21 <n0ano> as I understand it, this is all a little convoluted in order to make the current PCI code happy
15:35:45 <bauzas> johnthetubaguy: I'm just wondering if we can imagine such scenario :
15:35:56 <johnthetubaguy> n0ano: yeah, further I think it doesn't work when you add in the gantt client, until you change the nova DB structures for the PCI devices
15:36:27 <bauzas> 1. merge the client to place all creation/update calls and return the id with a FIXME comment
15:36:42 <bauzas> 2. do the PCI work of removing that FK
15:36:43 <johnthetubaguy> updates only seems to work both ways though, although it leaves the old DB structures in Nova, till it deletes them, which we have to do anyway due to the different deprecation cycles
15:36:49 <n0ano> johnthetubaguy, well, one good thing is jiang is in my group so I can probably get his attention to make PCI changes :-)
15:36:51 <bauzas> 3. provide a Gantt client
15:37:18 <johnthetubaguy> n0ano: right, but its a bit chain of changes that may never get into trunk
15:37:28 <bauzas> anyway,we're far from proposing a Gantt client now so there is high probability to have the PCI fix before the use of a Gantt client in Nova
15:38:06 <n0ano> johnthetubaguy, which is why we work around the PCI code for now and worry about fixing it later
15:38:11 <PaulMurray> n0ano, the reason I didn't get the ComputeNode object done in RT a while back was difficulties with PCI
15:38:36 <PaulMurray> n0ano, that's why I am sympathetic to the "PCI avoidance route"
15:38:42 <johnthetubaguy> n0ano: yeah, thats my preference
15:38:52 <PaulMurray> n0ano, sooner or later we need to clean up the PCI code
15:38:58 <n0ano> unfortunate that all roads lead back to the PCI code, oh well
15:39:20 <bauzas> and there is high visibility because of the SRIOV efforts
15:39:50 <n0ano> I don't have a problem with PCI & SR/IOV, it's just the implementation that needs to be cleaned up
15:39:56 <PaulMurray> n0ano, there was a suggestion (possibly from jiang) to make PCI a resource plugin for extensible RT
15:39:59 <bauzas> n0ano: +1
15:40:28 <PaulMurray> n0ano, we could try to make an effort to sort it out if that ever comes about
15:41:02 <n0ano> mid-cycle meetup coming soon, we should raise that issue then (both jiang & I will be there)
15:41:14 <PaulMurray> n0ano, me too
15:41:23 <bauzas> I won't be able to be there :(
15:41:41 <n0ano> bauzas, NP, I'll do your proxy :-)
15:42:06 <bauzas> my wife had the unsupportable thing to expect to release my 2.0 baby by these dates
15:42:23 <johnthetubaguy> I will be at the mid-cylce
15:42:32 <johnthetubaguy> just not really sure what we are disagreeing about here
15:42:34 <n0ano> bauzas, congratulations, that's a good excuse
15:43:05 <n0ano> johnthetubaguy, seems like the current PCI implementation is impacting two different areas, that's an indication that somethings wrong
15:43:21 <bauzas> anyway, that's also a matter of timeline
15:43:34 <bauzas> I would prefer this code to be merged before Juno-2
15:43:43 <johnthetubaguy> n0ano: yes, but we seem to have a solution to that now, but maybe I am missing something
15:43:55 <bauzas> basically, the refactor is very simple, but we care about how we should do it
15:44:15 <n0ano> johnthetubaguy, a work around for us but we would still like the PCI code to change later
15:45:20 <johnthetubaguy> n0ano: oh very true, I think the SRIOV stuff gets most of that into a better place, but there are other bits of work
15:45:48 <n0ano> as I keep saying, the devil is in the details
15:45:48 <bauzas> maybe I'm wrong, but if we talk about workarounds, why returning an id can't be a possible approach ?
15:46:21 <n0ano> bauzas, would that require changes to the PCI code?
15:46:32 <PaulMurray> bauzas, the db interface passes back the whole data structure with the id filled in
15:46:38 <johnthetubaguy> bauzas: what id are you returning, is the problem? and why, the key into the DB should the (compute host, compute node) tripple that gets returned from select destinations
15:46:38 <bauzas> n0ano: nope
15:46:41 <PaulMurray> bauzas, could do same?
15:47:38 <johnthetubaguy> PaulMurray: I just don't see what the scheduler should have to return the values you just sent it, given the values are out of date when you send them, but you know the better values yourself
15:47:51 <johnthetubaguy> ^ oops, why the scheduler, not what
15:47:58 <bauzas> johnthetubaguy: if we say that https://review.openstack.org/#/c/82778/26/nova/scheduler/client.py is returning the compute_node['id'], that's a workaround
15:48:01 <PaulMurray> johnthetubaguy, yes, fair enough
15:48:14 <johnthetubaguy> bauzas: but what would the gantt client do when you drop it in there?
15:48:36 <johnthetubaguy> bauzas: it is writing into a different database, so the id will not help the PCI stats that still talks to the Nova db
15:48:39 <bauzas> but we agreed to ask PCI guys to do the removal ?
15:49:27 <PaulMurray> bauzas, isn't the PCI stats part of the compute node information that would go to the scheduler?
15:49:27 <bauzas> as I said, in terms of planning, this is far sooner to remove the FK in PCI table that having Nova making use of a Gantt client
15:49:33 <johnthetubaguy> bauzas: its a bit trickier on the PCI side, it might make sense to keep it
15:49:59 <bauzas> johnthetubaguy: well, yjiang5 said he was ok to remove it
15:50:03 <johnthetubaguy> bauzas: can we go back to the question around the id, what would you return, and how does it help the PCI stats?
15:50:43 <johnthetubaguy> bauzas: it should go into the persistent resource tracker BP, so its more of a move, but lets not get distracted by PCI details
15:50:53 <bauzas> johnthetubaguy: https://review.openstack.org/#/c/82778/26/nova/scheduler/client.py L65 I would return values['id']
15:51:17 <johnthetubaguy> bauzas: but its from the gantt db, not the nova db, so doesn't help the PCI stats
15:51:27 <johnthetubaguy> that id doesn't exist in the nova db
15:51:43 <johnthetubaguy> (when using the gantt client)
15:52:10 * johnthetubaguy so wishes he could draw a picture
15:53:12 * bauzas would love to see all of you in person :)
15:53:38 <PaulMurray> maybe we need our own meet up
15:53:44 <johnthetubaguy> https://awwapp.com/draw.html#0dca39e5
15:53:51 <bauzas> I'm saying that once we will a gantt client
15:54:18 <n0ano> PaulMurray, unfortunately, the july in Oregon and nov and Pars are the only near term options
15:54:40 <n0ano> s/and Pars/in Paris
15:54:59 <johnthetubaguy> bauzas: thats what I am trying to say, the id is from the wrong database: https://awwapp.com/draw.html#0dca39e5
15:55:07 <bauzas> we would possibly move the conductor calls back in RT in the _update() method and make use of the client, if necessary
15:55:49 <johnthetubaguy> bauzas: you can't access the nova DB directly from compute nodes, security reasons, so we need the conductor for that
15:56:08 <johnthetubaguy> the scheduler might need its own conductor, but thats a different story
15:56:33 <bauzas> johnthetubaguy: I would so love to show you my thoughts in the code directly...
15:56:41 <johnthetubaguy> (I have a gantt API plan in my head that doesn't involve REST...)
15:56:59 <johnthetubaguy> bauzas: which bit?
15:58:21 <bauzas> johnthetubaguy: https://review.openstack.org/#/c/82778/26/nova/compute/resource_tracker.py
15:58:34 <bauzas> johnthetubaguy: consider this one the trunk with gantt in use
15:59:28 <n0ano> unfortunately, I have another meeting and we're running out of time, I'll have to send you guys over to #openstack-nova
15:59:50 <johnthetubaguy> bauzas: so thats basically the gantt client I am imagining, but it talks to the scheduler "conductor", but right now that would break PCI if it were the nova client as well
16:00:07 <bauzas> let's move to #openstack-nova :)
16:00:13 <n0ano> so, I'll thank everyone, looking forward to an updated patch, and we'll talk again next week
16:00:17 <johnthetubaguy> bauzas: +1
16:00:18 <bauzas> n0ano: thanks a lot :)
16:00:25 <n0ano> #endmeeting