22:00:28 #startmeeting nova_cells 22:00:33 Meeting started Wed Dec 10 22:00:28 2014 UTC and is due to finish in 60 minutes. The chair is alaski. Information about MeetBot at http://wiki.debian.org/MeetBot. 22:00:34 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 22:00:35 hi 22:00:37 The meeting name has been set to 'nova_cells' 22:00:40 Howdy guys. 22:00:42 o/ 22:00:43 Hello 22:00:43 o/ 22:00:45 hi 22:00:45 hey 22:00:47 yo 22:00:55 Hi 22:01:03 hi 22:01:18 awesome, let's get started 22:01:27 #topic Cells manifesto 22:01:34 https://review.openstack.org/#/c/139191/ 22:01:44 It's been proposed to devref 22:02:04 I have some changes I would still like to make, but please look it over and comment 22:02:21 gilliard: you know that normal nova *is* a no-cells deployment, right? 22:02:45 alaski: I didn't realize this was up yet, sorry, else I'd have reviewed it 22:02:53 dansmith: right. 22:03:04 dansmith: no worries. I didn't really announce it at all 22:03:07 alaski: nice stuff.. 22:03:09 \o 22:03:15 gilliard: okay, then I don't understand your comment on there 22:03:34 after cells v2 there won't be no-cells is what I thought it meant 22:03:39 I took it to mean that going forward there won't be a no-cells deployment 22:03:46 ah, okay, 22:03:53 that's what I meant, yes. 22:04:09 there won't not be a cells deployment, more double negatives please 22:04:11 it's actually, no no-cell deployment 22:04:12 I guess it seems weird that the comment is in the cellsv1 section 22:04:24 oh, I didn't notice that 22:04:40 that's why I was confused, but sounds like alaski will just slap something in there to clarify 22:04:54 I'll try to word it better 22:04:59 yeah, I'll likely add it to the proposal section 22:05:01 alaski: How does top level cell have things like correct current power state for instance for 'nova show ' ? 22:05:23 uh oh 22:05:26 yeah :/ 22:05:37 comstud: because it connects directly to the cell db to look at that 22:05:42 comstud: it doesn't. all queries go to a cell 22:05:54 Ok, so you look up the mapping first 22:05:57 comstud: and direct to its db, not to its conductor or anything 22:06:00 comstud: right 22:06:01 using the cell instance mapping table :) 22:06:06 so 2 db calls 22:06:17 which is fine 22:06:17 initially yeah 22:06:26 comstud: yes. very likely to end up cached in memory 22:06:30 eventually 22:06:40 actually 3 db if we have cells and server tables different 22:06:44 comstud: I think you should be interested in looking https://review.openstack.org/#/c/135644/ 22:06:53 vineetmenon_: how three? 22:07:12 first get the cell, then the server then get status from db connection? 22:07:28 this is unfortunate for cells that may be 'far away' 22:07:30 vineetmenon_: um, not sure about that 22:07:57 cache probably solves it 22:08:11 oh.. okay.. i may need to revisit that.. 22:08:16 comstud: so the thought was also to mix in some of what alaski was talking about doing with current stuff even, which is caching like the json from an instance show 22:08:47 yeah, now that xml is dead.. caching json is cool 22:08:58 cache could be a good option provided we know when invalidating it :à) 22:09:00 :) 22:09:24 right, we could keep a read friendly copy of relevant instance data up top 22:09:27 bauzas: if vm_state == active, cache for a while, invalidate when you do a thing 22:09:28 * bauzas likes nitpicking 22:09:40 bauzas: if vm_state != active, cache for only short periods 22:09:49 I think there are some easy wins there 22:09:57 dansmith: agreed, and invalidate on a subset of queries ? 22:10:02 right 22:10:09 There are only two hard things in Computer Science: cache invalidation and naming things. -- Phil Karlton. 22:10:15 comstud: I was originally thinking of how to move from current cells to a place where the global db differed from the cells. this is approaching that same place from the other side 22:10:30 vineetmenon_: that's why I just wanted to make sure we're not creating a beast :) 22:10:44 comstud: and removing the cells rpc proxy and just hitting mqs directly 22:10:48 and dbs 22:10:55 alleluiah 22:11:01 yeah 22:11:16 I think this is a reasonable solution 22:11:22 * dansmith gasps 22:11:26 it feels somewhat less distributed somehow, though. 22:11:33 'feels' 22:12:03 how about the list instances all tenants query? I assume that doesn't end up significantly more costly than current cells 22:12:07 i think the thing that negates that feeling is the cache.. depending on how it is implemented 22:12:30 melwitt: are we really *sure* that this query scales ? :D 22:12:36 I mean atm ? 22:12:37 comstud: you could potentially have the dbs all close to the api node, since the computes are using rpc to get to the DB anyway 22:12:51 comstud, but thinking from a systems POV, this design looks clean, if not efficient 22:12:54 comstud: i.e. more latency for computes, less for the api, but still one DB per cell 22:13:15 melwitt: you issues multiple parallel requests to each cell 22:13:20 melwitt: it's costlier in terms of db connections, but the size of the data should be the same 22:13:25 dansmith: that makes it worse for me :) 22:13:27 less distributed 22:13:40 and then someone invented MapReduce 22:14:01 comstud: right, but faster for the API -- I wasn't really talking about the dsitributedness there 22:14:07 sure, it's faster 22:14:14 dansmith, alaski: yeah, I was referring to number of db calls 22:14:16 I mean, I like the idea of dividing for reigning 22:14:36 melwitt: but if they're in parallel to N DBs and smaller each 22:14:46 so we can scale if we do parallel calls 22:15:02 yeah, cool 22:15:09 comstud: this may end up evolving to look a lot like current cells over time, if we need more distributedness 22:15:18 I'll try to word it better~. 22:15:25 of course, that's tied to greenthread IOs, but don't pick the nits yet now :) 22:15:31 bauzas: right 22:16:02 next topic? 22:16:16 yep 22:16:23 this got off topic a bit 22:16:32 but everyone comment on the manifesto :) 22:16:37 alaski: just a sec 22:16:37 #topic Testing 22:16:49 there are two tables, right.. 22:17:03 vineetmenon_: there are. but can you hold for open discussion? 22:17:09 I certainly like the queue part. 22:17:19 so, I'm not getting, why we need parallel db calls.. 22:17:21 I updated the bottom of https://etherpad.openstack.org/p/nova-cells-testing with the latest test failures 22:17:43 you can precisely get which server resides where, right 22:17:46 I think the DB part is reasonable for now to simply things 22:17:50 we're down to 40 test failures on the latest runs 22:17:51 or am I totally wrong? 22:17:59 vineetmenon_: can you wait for open discussion? 22:18:04 okay.. sure 22:18:27 so looks like we need eyes on https://review.openstack.org/#/c/135700/ 22:18:37 comstud: awesome. I would love your thoughts on the specs that are open, or a more open discussion at some poitn 22:18:52 I just looked at the one so far 22:18:55 mriedem: well, eyes are good for sure, but I'm still not where I want to be on that thing :( 22:19:08 dansmith: so it's WIP? 22:19:08 I'll see if I can find time to look and comment more 22:19:15 but it probably won't be til after christmas 22:19:17 mriedem: the one before that is I think 22:19:17 hehe 22:19:22 mriedem: sorry for looking lazy, but how this patch helps increasing the coverage ? 22:19:25 dansmith: oh boy, just saw LOC 22:19:32 comstud: heh, no worries 22:19:39 * bauzas misses some backlog history 22:19:41 bauzas: i was just looking at the list of patches for review in the testing etherpad 22:19:41 mriedem: yeah 22:19:51 you obviously don't need me ;) 22:20:02 mriedem: this is the third time I've started it from scratch, but flavors are just soooo pervasive :( 22:20:10 mriedem: third time I've tried splitting bits off I mean 22:20:18 comstud: if we say we do will you stick around :) 22:20:24 ok, so can s/o explain why modifying storage of flavors will help functional test coverage ? 22:20:34 alaski: he means he doesn't need *us* too :) 22:20:35 * bauzas is lost a bit 22:20:43 bauzas: unrelated 22:20:55 dansmith: oh, perfect reason then 22:20:59 :) 22:21:11 bauzas: what we need is for flavor extra_specs to be available in a cell 22:21:26 because flavors are not replicated into the cell dbs 22:21:35 we want to pass that down with the instance 22:21:36 alaski: aaaah ack. 22:21:37 This does go somewhat in the opposite direction than I was thinkings in terms of the API 22:21:39 alaski: is that basically the same reqiest the ironic guys had? 22:21:46 we need it for a lot of things 22:21:55 tonyb: yes, it will help them as well 22:22:06 I wanted to segregate it more from the computes 22:22:14 alaski: cool. 22:22:29 but depending on how the cache works, it might end up the same thing 22:23:22 comstud: cool, would love to talk more on this when there's more time 22:23:26 comstud: that's why we are looking forward in spilitting data between cell and api..https://etherpad.openstack.org/p/nova-cells-table-analysis 22:23:39 me too 22:23:53 unfort i have like 3 big things to finish before xmas 22:24:09 so the reason that dansmiths patch series is linked in the testing etherpad is because it's a long term solution to something we've worked around in another way in the short term 22:24:30 which is likely to break as the scheduler work progresses (the short term solution) 22:25:17 dansmith: is there anything we can do to help with it atm 22:25:24 alaski: shoot me in the head 22:25:34 that helps you, not us :) 22:25:37 :( 22:25:42 heh 22:25:45 :) 22:26:33 alaski: so on Testing and the ~40 failures is that something I can look at and not duplicate work you're doing? 22:26:39 well, there are still 40 failures which are likely unrelated to flavors 22:26:54 tonyb: yes 22:27:08 I'm intermittently looking into them, but I can mark that on the pad when I do it 22:27:08 well, the host detail Tempest test worries me 22:27:29 because I can't see how we can fix it 22:27:36 alaski: okay I'll see what I can do in that area 22:27:46 tonyb: awesome 22:27:56 bauzas: do you know where the failure is? 22:28:13 alaski: needs definitely more time to look at the issue 22:28:19 ok 22:28:20 need* 22:28:20 it'd be nice to actually write code in openstack rather than qemu/libvirt ;P 22:28:46 bauzas: if it's not a quick fix, we can skip the test(s) 22:29:23 moving on... 22:29:32 #topic cells scheduling requirements 22:29:50 woops, forgot to link on the agenda 22:29:52 https://etherpad.openstack.org/p/nova-cells-scheduling-requirements 22:29:53 bauza: are you talking about this? https://bugs.launchpad.net/nova/+bug/1312002 22:29:57 There is a use case of filtering cells basing on their capabilities which are already gathered and passed up to the parent cell: https://review.openstack.org/#/c/140031/ 22:30:48 mateuszb: yes, I agree. But I do think it's unrelated to this a bit 22:31:09 because we're not really looking at using the cells scheduler 22:31:11 vineetmenon_: nope 22:32:00 mateuszb: but I do like that spec and think it's worthwhile regardless 22:33:05 alaski: I think that belmoreira raised the main issues for having a intra-cell scheduler 22:33:06 alaski: Ok, but it would be great if you leave your feedback on this. I know there is an interest in it apart from Intel 22:33:21 alaski: to be clear, s/issues/concerns 22:33:35 mateuszb_: okay, I will do that 22:33:43 thank you 22:34:24 mateuszb_: yes, we are interested on this... in fact we are already using something similar to what you are proposing 22:34:57 belmoreira: there is one spec for changing how Scheduler would pick hosts based on aggregates that I think you could be interested in https://review.openstack.org/#/c/89893/ 22:35:07 belmoreira: you mean you created your own filter? 22:35:53 belmoreira and I both listed a desire to have intra-cell scheduling 22:35:53 mateuszb_: yes, we have created filters to deal with capabilities... (datacentre, avz, ...) 22:36:25 bauzas: thanks, I will have a look 22:36:45 just to be clear, do all people know we'll change how filters will look at aggregates and instances ? 22:36:48 belmoreira: a question I had for you is if it is a requirement that they be different scheduler endpoints? 22:37:26 I'm very concerned by any spec creating intra-calls within the filter to Nova DB or so 22:37:26 belmoreira: I'm thinking ahead to when the scheduler is split out, potentially 22:38:39 alaski: I think we need to think about how having a scheduler able to either provide a cell or an host 22:38:42 alaski: not a requirement... My concern then have a bottleneck, and be more difficult to scale 22:38:48 alaski: but not having 2 different schedulers 22:39:05 alaski: or we would reproduce what we have with the current Cellv1 scheduler 22:39:36 ie. something totally out of scope of what's happening within the scheduler's world 22:39:39 belmoreira: I completely agree about scale. But I'm wondering if we can have the scheduler be something we query, and it can deal with the scale and separation question separately 22:40:04 For me one of the advantages to have cells is because each cell can be configured in a different way (depending in the use case) including the schedulers 22:40:06 bauzas: agreed 22:40:19 belmoreira: we know that the Scheduler can't scale because it does in-query calls to DB 22:40:24 bauzas: I've been thinking about it a lot these past few days, and want to write something down about it 22:40:29 that's really expensive 22:40:38 alaski: pick me in the loop then 22:40:57 alaski: we have time to loop back with the scheduler swat team 22:41:06 bauzas: will do. I would love to get some ideas and thoughts from others on this 22:41:45 a memcache would be more beneficial here, IMHO. 22:41:45 alaski, bauzas: keep me in the loop as well 22:41:58 belmoreira: will do 22:42:18 anyway, the idea is to make sure we can do something generic and scalable 22:42:32 We have not received any feedback from HP/Nectar yet, because I didn't reach out to them yet 22:42:33 eh, isn't it what we want to provide for the scheduler ? :D 22:42:51 So I will do that so they can add to the conversation 22:43:20 #action alaski Reach out to HP/Nectar for scheduling requirement feedback 22:43:45 next up... 22:43:53 #topic open discussion 22:44:09 alaski: did you miss database? 22:44:27 i guess that was an agenda as well.. 22:44:57 vineetmenon_: I actually removed that item since I wasn't sure where to go with that yet 22:45:14 but we can talk about it now 22:45:20 just to be clear, I think my comments on https://etherpad.openstack.org/p/nova-cells-table-analysis depend on the issue of the discussions about the sched requirements 22:45:28 aah.. 22:46:17 bauzas: which comments? 22:46:57 alaski: eh... damn etherpad, it left different colors for myself 22:47:10 and should we come to more of a resolution around scheduling requirements before getting too far into table analysis? 22:47:13 alaski: I was mentioning aggregates and instancegroyps 22:47:17 under controversial tab? 22:47:24 for the DB discussion it will be easy if we first reserve a meeting to discuss aggregates, volumes, server groups... 22:47:26 alaski: they're tied to the scheduler 22:47:38 belmireira: +1 22:47:40 vineetmenon_: right 22:48:07 belmoreira: as I said, I think the scheduling requirement decision seems to be the first thing to do 22:48:41 belmoreira: because we can't talk about DB segregation without having a clear idea yet on what would be the whole stories for boot and evacuate for example 22:49:12 so this part is going to be limbo for a looong time. 22:49:20 bauzas: I think we can talk about it if we limit the scope to basic scheduling 22:49:46 vineetmenon_: some parts of it, maybe 22:49:51 alaski: well, it depends on if you want to reach feature parity with Nova 22:49:54 but let's start with this: 22:50:01 alaski: like, live migration in between cells ? 22:50:17 do people want more time to consider the "easy" tables 22:50:21 ? 22:50:29 or is there a general consensus there? 22:50:46 alaski: well, I pointed out services 22:51:02 alaski: as it's very related to SG API 22:51:10 okay, that can move to controversial 22:52:09 bauzas: I agree with you, but we can also see it in the other way around... for example deciding where aggregates live will influence the scheduler 22:52:31 Can we maybe pick a date, like next wednesday, and say that we're generally okay with what's not in controversial/unsure and start from there? 22:52:38 belmoreira: aggregates are a beast for only the Scheduler 22:52:52 then we can start picking apart what's left 22:53:09 belmoreira: what would be the decision for aggregates, it would require a same level of granularity for the scheduler 22:53:33 alaski: that plan sounds fair 22:53:36 alaski: I'm pretty ok with the list except one last thing 22:53:41 alaski: networks ? 22:54:14 but time is running fast, dammit. 22:54:14 this is why we need to decide what to do about nova-network, 22:54:23 because if n-net goes away soon, so does networks 22:54:28 +1 22:54:42 and what would be the network topology for cells ? 22:55:10 belmoreira: do you have 1 to N subnets per cells ? 22:55:20 belmoreira: or is it something global ? 22:55:27 it'll depend 22:55:30 on the deployer 22:55:44 there are people out there running everything on one L2, one cell per subnet, etc 22:55:54 bauzas: each cell has different subnets 22:55:58 dansmith: agreed, that's why Neutron exists 22:56:12 what about subnet spread across multiple cells? 22:56:29 anyway, we have 4 mins left :( 22:56:33 and each cell constisting multiple subnets as well? 22:57:07 dansmith: I don't remember anything about networks in the manifesto, will review the patch with that in mind 22:57:25 dansmith: we need to be explicit on that I guess 22:57:26 so for now it seems like network falls under controversial, and we can devote some time to it later 22:57:39 *networks 22:58:11 let's try to get to the list of non controversial things for now 22:58:35 so if there's a concern on a table move it to the unsure/controversial list 22:59:07 sounds good, with a deadline set to Wed hten ? 22:59:12 *then 22:59:18 there's a lot to try to tackle in this effort, we're not going to get it all at once 22:59:28 bauzas: yes, since I didn't hear any complaints 22:59:34 alaski: cool 22:59:56 #action review table split so we can claim consensus by next wednesday 23:00:26 my final item was that I will not be around to run the meeting on the 24th or 31st 23:00:40 and it's likely others won't be around either 23:00:49 so I am going to suggest we skip those weeks 23:00:59 31st seems to be hard to follow :) 23:01:00 but we can make that decision later, just throwing it out there 23:01:01 alaski: I'd say cancel those meetings. 23:01:06 alaski: +1 23:01:16 in particular as it's midnight now 23:01:24 coll 23:01:25 cool 23:01:29 bauzas: :) 23:01:29 thanks all! 23:01:39 #endmeeting