21:00:29 <alaski> #startmeeting nova_cells
21:00:30 <openstack> Meeting started Wed Apr 29 21:00:29 2015 UTC and is due to finish in 60 minutes.  The chair is alaski. Information about MeetBot at http://wiki.debian.org/MeetBot.
21:00:33 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
21:00:34 <bauzas> \o
21:00:36 <openstack> The meeting name has been set to 'nova_cells'
21:00:51 <melwitt> o/
21:00:53 <dansmith> o/
21:00:53 <alaski> good <regional time of day> everyone
21:00:59 <belmoreira> o/
21:01:17 <alaski> #topic Tempest testing
21:01:41 <alaski> so there are still some intermittent failures
21:01:54 <alaski> but nothing has merged yet to address them, so totally expected
21:02:36 <edleafe> o/
21:02:43 <alaski> https://review.openstack.org/#/c/177356/ will hopefully help, but it's hard to say for sure at this point
21:02:47 <bauzas> alaski: to be clear, we should see if your patch helps that
21:03:01 <alaski> yeah
21:03:12 <bauzas> so another week round before proposing it to vote ?
21:03:24 <alaski> I also want to dig into some of the failures, but there's been a lack of hours in the day
21:03:29 <bauzas> trying to remember the logstash query for checking the intermittent ratio
21:04:05 <alaski> http://goo.gl/AlgZRa
21:04:25 <bauzas> gotcha "build_name:"check-tempest-dsvm-cells" AND message:"Worker Balance" AND build_status:"FAILURE" "
21:04:26 <alaski> well, that shows failures
21:04:51 <bauzas> alaski: agreed but that counts per hour
21:05:28 <alaski> there are two bugs opened https://bugs.launchpad.net/nova/+bug/1448316 and https://bugs.launchpad.net/nova/+bug/1448302
21:05:28 <openstack> Launchpad bug 1448316 in OpenStack Compute (nova) "cells: Object action destroy failed because: host changed" [Medium,Confirmed] - Assigned to tianzichen306 (tianzichen306)
21:05:28 <bauzas> alaski: so if we have like 5/h, we just need to know how many runs per hour and we get the max error ratio
21:05:29 <openstack> Launchpad bug 1448302 in OpenStack Compute (nova) "cells: intermittent KeyError when deleting instance metadata" [Low,Confirmed]
21:05:35 <alaski> if anyone fancies looking at some point
21:06:15 <bauzas> alaski: by saying that, I mean that if we are below 0.1% error rate, why not just consider the job as voting ?
21:06:27 <bauzas> because a recheck would be worth it
21:06:55 <bauzas> we should just face the failures and mark them in e-r
21:07:01 <bauzas> once the job is voting
21:07:23 <alaski> I'm not sold on that at this point, I'd want to see something saying the rate is really really low first
21:07:46 <bauzas> we spiked at 7 per hour during the last 7 days
21:07:59 <dansmith> the last thing we want,
21:08:00 <bauzas> but with an average around 2 or 3
21:08:05 <alaski> that's a fair amount
21:08:11 <dansmith> is to turn it on prematurely and gain the scorn of our colleagues :)
21:08:13 <dansmith> IMHO.
21:08:17 <bauzas> dansmith: agreed
21:08:21 <alaski> right
21:08:47 <bauzas> dansmith: my point is just to identify the balance between waiting for any magical solution and how much it costs to get it voting
21:08:54 <dansmith> yep
21:09:03 <alaski> I think we just have a couple of issues left, which may be resolved with my patch
21:09:05 <dansmith> there's definitely a balance there
21:09:26 <bauzas> anyway don't get me wrong, the prio is alaski's patch
21:09:31 <bauzas> before even thinking about voting
21:09:54 <alaski> bauzas: if it got to a handful of failures a day I think we shoudl have a conversation on it
21:09:55 <bauzas> ie. https://review.openstack.org/#/c/177356/
21:10:24 <bauzas> alaski: agreed
21:10:58 <alaski> anything else on testing?  I think for now it's waiting on that patch and digging in on the lp bugs when there's time
21:11:05 <bauzas> anyway, let's chase another guru and then see how it reduce the failures
21:11:15 <bauzas> alaski: +1
21:11:31 <alaski> #topic Specs
21:11:50 <alaski> shamefully I have not updated any specs recently
21:12:30 <bauzas> so, maybe nothing to say ?
21:12:37 <alaski> well, one thing
21:12:43 <bauzas> unless someone takes the opportunity to ask questions ?
21:12:54 <alaski> perhaps I misread something, but I felt like there might be some question about https://review.openstack.org/#/c/176078/ being enough
21:13:20 <alaski> because while it works for scheduling, it doesn't work for an API response
21:13:47 <bauzas> alaski: then I need to be more clear
21:14:08 <bauzas> alaski: I think persisting the whole object in the API db is good to me
21:15:05 <bauzas> alaski: then, it could maybe be done like instance.save(), ie. being synced
21:15:12 <bauzas> maybe I'm foolish
21:15:54 <alaski> I was thinking we would pick the db based on context
21:16:10 <bauzas> my point is that if the scheduling stuff is magically working on the parent cell, why do we need to get it to the child cell ?
21:16:10 <alaski> so with cell_db_context: req.save()
21:16:16 <bauzas> I perhaps missed it
21:17:04 <alaski> the primary reason IMO is we don't want to store much data per instance in the api db
21:17:34 <bauzas> well the instance <=> req_spec is thin
21:18:00 <bauzas> an instance can have a request_spec dependency
21:18:08 <alaski> but there can be one reqest spec per instance
21:18:10 <bauzas> ergh s/dep/relationship
21:18:20 <bauzas> alaski: yeah
21:18:38 <bauzas> alaski: but then, you only need to care about how to get the reqspec reference
21:18:51 <bauzas> alaski: you don't need to store the whole object
21:19:03 <bauzas> I mean nested in the instance
21:19:35 <alaski> to be clear, are you suggesting the request spec in the api db, and a ref to it in the instance?
21:19:42 <bauzas> alaski: yup
21:20:37 <belmoreira> I have concerns about https://review.openstack.org/#/c/141486/ but maybe I'm misunderstanding it. A spec iteration would be good.
21:20:43 <alaski> my concern is that for scaling it's better to have that in the cell, so the api db doesn't store much data per instance
21:21:12 <belmoreira> bauzas: thanks for the comments
21:21:18 <bauzas> alaski: then the context is a good option
21:21:33 <bauzas> I mean a cell context related save()
21:21:51 <bauzas> alaski: will amend my comment then
21:22:03 <alaski> belmoreira: I will iterate that.  sorry for the delay
21:22:45 <bauzas> belmoreira: I know that we probably have kind of misunderstanding between us
21:22:46 <bauzas> :)
21:23:07 <bauzas> belmoreira: I know you would like some cell-only scheduling options
21:23:22 <alaski> for the request spec, it will need to be stored in the api db originally.  and then additional data will need to be stored to satisfy an instance list/show
21:23:33 <bauzas> alaski: agreed
21:23:52 <melwitt> I had a similar albeit vague thought as alaski. I'm not very clear on the interactions for things like a migrate where we want to refer to the req spec, if keeping it in the cell would help save things converging on the api db constantly
21:25:13 <bauzas> melwitt: to be clear, migrations are not using requestspec now, just because we're not persisting it yet :)
21:25:26 <melwitt> bauzas: yes, I mean assuming we persist
21:26:01 <bauzas> melwitt: MHO is that migrating comes from the API anyway
21:26:06 <belmoreira> bauzas: yes :)
21:26:35 <bauzas> melwitt: the conductor is being called, but that's still coming in from the api
21:27:01 <melwitt> bauzas: that's what I was getting at. if it's in the api anyway, it wouldn't help much. so I was trying to say, if there are scenarios where we need to use request spec and we're not already at the api
21:27:06 <bauzas> melwitt: there is no periodic task related to migrations done on the conductor, AFAIK
21:27:34 <melwitt> just trying to get clear on, is this a current need or are we just thinking to cover that case for the future, if something comes up
21:28:06 <bauzas> melwitt: I think we are pretty clear that automatic evacuation should not be done in Nova
21:28:11 <dansmith> alaski: I thought we said store the req spec in the api db until we schedule, then delete and move it to the cell db?
21:28:22 <alaski> dansmith: right, that's my thinking
21:28:36 <bauzas> that can work too
21:28:41 <alaski> bauzas: was making a case for keeping it in the api
21:28:43 <dansmith> okay, above it sounded like maybe you meant not delete from the api db
21:28:48 <dansmith> permanently/
21:29:10 <bauzas> dansmith: that depends on where you query the DB to get the spec when migrating
21:29:18 <dansmith> I think having it in the api db makes it clear it's looking for a home, and once it's scheduled, it's in the other one
21:29:45 <dansmith> bauzas: right, I don't think it really needs to be in the api db though, especially if the scheduler was external, we'd just grab the req from the current home, provide it to the scheduler and let it move
21:29:49 <bauzas> dansmith: if that's done in the API, then there is a need for keeping it locally, but that goes down to the conductor and then being readed, it's actually better to have it in the cell db IMHO
21:30:02 <dansmith> so, migration/resize would be moving the instance/reqspec from once cell to another (if it moves cells)
21:30:27 <bauzas> dansmith: well, we need to read the spec for calling the scheduler right ?
21:30:29 <dansmith> okay, I don't follow that conductor line of thinking, but sounds like it's not important
21:30:41 <dansmith> yes
21:30:48 <alaski> bauzas: the api can read from the cell db
21:30:57 <dansmith> that's the point yeah :)
21:31:05 <bauzas> dansmith: then, it's just a matter of finding where to call that
21:31:15 <bauzas> alaski: but not the contrary, right ?
21:31:24 <alaski> bauzas: correct
21:31:30 <dansmith> contrary what?
21:31:31 <bauzas> alaski: given the current process of migrating
21:31:38 <alaski> dansmith: reverse I'm assuming
21:31:42 <dansmith> cell things reading from the api db?
21:31:43 <bauzas> dansmith: the cell can't read the api db
21:31:57 <bauzas> yeah, French hitted
21:32:02 <dansmith> I'm not sure we'll be able to completely ban it, but it would be great if we can pass enough information down so that it doesn't
21:32:02 <bauzas> so
21:32:23 <alaski> dansmith: that's my goal.  but I agree it could be added if necessary
21:32:34 <bauzas> the current process is that a migration request comes to the API, which calls the conductor methods asynchronouslyu
21:32:47 <bauzas> that's a cast I mean
21:32:48 <dansmith> yeah, I can see some things needing to rendezvous at the api db, but if we can avoid it, it's nice separation
21:33:25 <bauzas> so, speaking about cells, conductor methods will be run on the cell side
21:33:47 <bauzas> so they won't have access to the api db, unless we explicitely allow that
21:34:39 <alaski> a migration within a cell could be handled without coordination of the api
21:34:56 <bauzas> alaski: who would trigger the migration ?
21:35:18 <alaski> the api would cast to conductor as you said
21:35:26 <dansmith> alaski: but it's the same, right?
21:35:40 <dansmith> alaski: api passes the destination info, which might be the current cell
21:36:01 <dansmith> er, no
21:36:05 <bauzas> dansmith: unless the user specifies no destination, and then it goes to a scheduler call
21:36:20 <dansmith> I guess api calls to conductor to trigger, conductor calls scheduler, and then conductor will need the deets of the destination cell
21:36:32 <alaski> I think we need to be clear about inter vs intra cell
21:36:36 <dansmith> which goes back to my conductor might need to call back up
21:36:44 <dansmith> alaski: why? isn't that a scheduler thing?
21:37:09 <bauzas> alaski: intra-cell conductor would call the inter-cell scheduler, I don't see that as a problem
21:37:23 <alaski> there's more needed for intra-cell
21:37:30 * dansmith groks "intra-cell conductor" for a minute
21:37:39 <bauzas> alaski: when migrating ?
21:37:51 <dansmith> I say we just punt on inter-cell migrations for the moment
21:37:52 <bauzas> alaski: or unshelving or whatever ?
21:38:01 <alaski> dansmith: right, that's what I want to do
21:38:05 <dansmith> they'll be possible, but not supported today in v1 so we might as well punt on this
21:38:12 <bauzas> fair point
21:38:19 <alaski> they're punted in cells v1 too
21:38:46 <alaski> if you're looking at intra cell you just pass the request into the cell, bam
21:38:46 <dansmith> right
21:38:54 <bauzas> yup
21:39:34 <bauzas> alaski: well, there is just a flaw in that
21:39:54 <bauzas> alaski: since the user can specify a dest host, it bypasses the scheduler and then calls the compute
21:40:27 <bauzas> alaski: so we just need to make sure that if the user specifies a dest, it's one within the same cell as the source
21:40:35 <bauzas> but I'm nitpicking for now
21:40:45 <alaski> right.  that would fail currently because the two computes can't talk to each other
21:41:03 <alaski> and in the new setup
21:41:36 <alaski> so are we agreed that the request spec lives in a cell? :)
21:41:39 <bauzas> alaski: in the new setup, that's just a check that the target is in the same cell than the source
21:41:52 <dansmith> alaski: it lives in the cell after scheduling...
21:41:58 <bauzas> alaski: after scheduling
21:42:02 <bauzas> aaaarh
21:42:07 * bauzas jinxed
21:42:08 <alaski> dansmith: right, that's totally what I meant
21:42:18 <dansmith> okay :)
21:42:38 <bauzas> but yeah, I'm fine with that approach
21:42:43 <alaski> cool
21:42:48 <alaski> moving on
21:42:53 <alaski> #Open discussion
21:42:57 <bauzas> cool
21:42:57 <alaski> dammit
21:43:04 <alaski> #topic Open discussion
21:43:56 <alaski> anyone have a topic to discuss?
21:44:17 <alaski> I need to iterate some specs and get some more things down on paper
21:44:35 <bauzas> we at least need to have a plan for the summit :)
21:44:43 <alaski> ahh, yeah
21:44:55 <bauzas> for showing that and discuss
21:44:55 <alaski> so we're split between scheduling topics and everything else
21:45:10 <bauzas> you're toasted
21:45:39 <alaski> toasted?
21:45:44 <bauzas> mmm
21:45:56 <bauzas> sandwitched
21:46:13 <alaski> ahh, yeah
21:46:16 * bauzas tries to invent words at 11.40pm - always a bad idea
21:46:51 <bauzas> so, it will depend on how it goes with the scheduler stuff
21:47:05 <bauzas> but we actually have 2 sessions back-to-back for cells v2
21:47:21 <bauzas> one for discussing about the scheduling needs, and one for the other stuff
21:47:25 <bauzas> which is good IMHO
21:47:29 <alaski> yeah
21:47:46 <alaski> getting scheduling right is a big part of this
21:48:00 <bauzas> and at least, people won't get boring just because we'll be speaking about scheduler all the morning...
21:48:21 <alaski> outside of that is getting the interactions between the api and cell right, and migrating data properly
21:48:24 <bauzas> alaski: I would rather make a call to discuss how we can scale
21:48:42 <bauzas> alaski: I just wonder if we should also poke the neutron guys
21:48:56 <bauzas> alaski: since we haven't yet agreed on the plan for network requests
21:49:02 <alaski> I've been thinking on that
21:49:07 <bauzas> alaski: and since neutron doesn't have cells also
21:49:14 <bauzas> alaski: yeah I know
21:49:17 <alaski> if you assume a global neutron everything should just work
21:49:31 <bauzas> alaski: would it be not facing the same problems as nova ?
21:49:54 <alaski> it has scaling concerns, and cells might be a good approach for them
21:49:56 <bauzas> alaski: like the queues and so on
21:50:30 <alaski> there's definitely things to talk about there, but we're not blocked on it I think
21:50:31 <bauzas> alaski: that's my point, it would just be horrible to do all the efforts and then being pushbacked just because neutron doesn't scale
21:51:05 <bauzas> alaski: well, n-net is not deprecated yet, right ? :)
21:51:18 <alaski> heh
21:51:41 <alaski> realistically it would help if someone could help liason that stuff right now
21:51:44 <bauzas> so indeed, let's do the assumptions that everything will scale, and at the end, n-net will be there for us :)
21:52:17 <alaski> I speak with some networking folks occasionally, but don't have the bandwidth to drive it very far at the moment
21:52:22 <bauzas> should we maybe engage some discussions with s68cal since he's the liaison ?
21:52:43 <alaski> oh, was that figured out.  my email went wonky for a few days
21:53:11 <bauzas> he's the neutron guy enough foolish to visit Nova
21:53:24 <alaski> gotcha.  is there anyone on the nova side yet?
21:53:43 <bauzas> none I heard about
21:54:02 <bauzas> I mean, network guys are creepy
21:54:15 * bauzas kidding
21:54:34 <alaski> okay.  I can reach out to s68cal to give him a heads up on our efforts
21:55:18 <bauzas> we could at least give him a wish list
21:56:17 <alaski> back to the summit: does the data migration part of cells, getting instances and cells mapped, seem like a good topic to y'all?
21:56:59 <bauzas> that's a fair point to discuss, if the consensus is getting reached quickly, that's even better
21:57:03 <alaski> I want to touch on a few other things, but in the time allowed we can probably only hit one/two big things
21:57:50 <bauzas> alaski: I seriously need to see what we should discuss for the scheduler efforts, so I could help you on that by giving you time if I can
21:58:22 <alaski> okay
21:58:23 <bauzas> I seriosuly doubt people will be liking having all the morning kinda dedicated to scheduling
21:58:33 <bauzas> because that's a beast
21:59:11 <bauzas> the last session is about RT, I still need to figure out what to fill in there
21:59:13 <dansmith> they'll get over it
21:59:40 <bauzas> because there are good points to raise, but few chances to work on for Lib
21:59:41 <alaski> yep, that's what we're there for
22:00:18 <alaski> well this was a bit disorganized, sorry.  I'll be more prepared next week
22:00:21 <alaski> thanks all!
22:00:29 <alaski> #endmeeting