21:00:29 #startmeeting nova_cells 21:00:30 Meeting started Wed Apr 29 21:00:29 2015 UTC and is due to finish in 60 minutes. The chair is alaski. Information about MeetBot at http://wiki.debian.org/MeetBot. 21:00:33 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 21:00:34 \o 21:00:36 The meeting name has been set to 'nova_cells' 21:00:51 o/ 21:00:53 o/ 21:00:53 good everyone 21:00:59 o/ 21:01:17 #topic Tempest testing 21:01:41 so there are still some intermittent failures 21:01:54 but nothing has merged yet to address them, so totally expected 21:02:36 o/ 21:02:43 https://review.openstack.org/#/c/177356/ will hopefully help, but it's hard to say for sure at this point 21:02:47 alaski: to be clear, we should see if your patch helps that 21:03:01 yeah 21:03:12 so another week round before proposing it to vote ? 21:03:24 I also want to dig into some of the failures, but there's been a lack of hours in the day 21:03:29 trying to remember the logstash query for checking the intermittent ratio 21:04:05 http://goo.gl/AlgZRa 21:04:25 gotcha "build_name:"check-tempest-dsvm-cells" AND message:"Worker Balance" AND build_status:"FAILURE" " 21:04:26 well, that shows failures 21:04:51 alaski: agreed but that counts per hour 21:05:28 there are two bugs opened https://bugs.launchpad.net/nova/+bug/1448316 and https://bugs.launchpad.net/nova/+bug/1448302 21:05:28 Launchpad bug 1448316 in OpenStack Compute (nova) "cells: Object action destroy failed because: host changed" [Medium,Confirmed] - Assigned to tianzichen306 (tianzichen306) 21:05:28 alaski: so if we have like 5/h, we just need to know how many runs per hour and we get the max error ratio 21:05:29 Launchpad bug 1448302 in OpenStack Compute (nova) "cells: intermittent KeyError when deleting instance metadata" [Low,Confirmed] 21:05:35 if anyone fancies looking at some point 21:06:15 alaski: by saying that, I mean that if we are below 0.1% error rate, why not just consider the job as voting ? 21:06:27 because a recheck would be worth it 21:06:55 we should just face the failures and mark them in e-r 21:07:01 once the job is voting 21:07:23 I'm not sold on that at this point, I'd want to see something saying the rate is really really low first 21:07:46 we spiked at 7 per hour during the last 7 days 21:07:59 the last thing we want, 21:08:00 but with an average around 2 or 3 21:08:05 that's a fair amount 21:08:11 is to turn it on prematurely and gain the scorn of our colleagues :) 21:08:13 IMHO. 21:08:17 dansmith: agreed 21:08:21 right 21:08:47 dansmith: my point is just to identify the balance between waiting for any magical solution and how much it costs to get it voting 21:08:54 yep 21:09:03 I think we just have a couple of issues left, which may be resolved with my patch 21:09:05 there's definitely a balance there 21:09:26 anyway don't get me wrong, the prio is alaski's patch 21:09:31 before even thinking about voting 21:09:54 bauzas: if it got to a handful of failures a day I think we shoudl have a conversation on it 21:09:55 ie. https://review.openstack.org/#/c/177356/ 21:10:24 alaski: agreed 21:10:58 anything else on testing? I think for now it's waiting on that patch and digging in on the lp bugs when there's time 21:11:05 anyway, let's chase another guru and then see how it reduce the failures 21:11:15 alaski: +1 21:11:31 #topic Specs 21:11:50 shamefully I have not updated any specs recently 21:12:30 so, maybe nothing to say ? 21:12:37 well, one thing 21:12:43 unless someone takes the opportunity to ask questions ? 21:12:54 perhaps I misread something, but I felt like there might be some question about https://review.openstack.org/#/c/176078/ being enough 21:13:20 because while it works for scheduling, it doesn't work for an API response 21:13:47 alaski: then I need to be more clear 21:14:08 alaski: I think persisting the whole object in the API db is good to me 21:15:05 alaski: then, it could maybe be done like instance.save(), ie. being synced 21:15:12 maybe I'm foolish 21:15:54 I was thinking we would pick the db based on context 21:16:10 my point is that if the scheduling stuff is magically working on the parent cell, why do we need to get it to the child cell ? 21:16:10 so with cell_db_context: req.save() 21:16:16 I perhaps missed it 21:17:04 the primary reason IMO is we don't want to store much data per instance in the api db 21:17:34 well the instance <=> req_spec is thin 21:18:00 an instance can have a request_spec dependency 21:18:08 but there can be one reqest spec per instance 21:18:10 ergh s/dep/relationship 21:18:20 alaski: yeah 21:18:38 alaski: but then, you only need to care about how to get the reqspec reference 21:18:51 alaski: you don't need to store the whole object 21:19:03 I mean nested in the instance 21:19:35 to be clear, are you suggesting the request spec in the api db, and a ref to it in the instance? 21:19:42 alaski: yup 21:20:37 I have concerns about https://review.openstack.org/#/c/141486/ but maybe I'm misunderstanding it. A spec iteration would be good. 21:20:43 my concern is that for scaling it's better to have that in the cell, so the api db doesn't store much data per instance 21:21:12 bauzas: thanks for the comments 21:21:18 alaski: then the context is a good option 21:21:33 I mean a cell context related save() 21:21:51 alaski: will amend my comment then 21:22:03 belmoreira: I will iterate that. sorry for the delay 21:22:45 belmoreira: I know that we probably have kind of misunderstanding between us 21:22:46 :) 21:23:07 belmoreira: I know you would like some cell-only scheduling options 21:23:22 for the request spec, it will need to be stored in the api db originally. and then additional data will need to be stored to satisfy an instance list/show 21:23:33 alaski: agreed 21:23:52 I had a similar albeit vague thought as alaski. I'm not very clear on the interactions for things like a migrate where we want to refer to the req spec, if keeping it in the cell would help save things converging on the api db constantly 21:25:13 melwitt: to be clear, migrations are not using requestspec now, just because we're not persisting it yet :) 21:25:26 bauzas: yes, I mean assuming we persist 21:26:01 melwitt: MHO is that migrating comes from the API anyway 21:26:06 bauzas: yes :) 21:26:35 melwitt: the conductor is being called, but that's still coming in from the api 21:27:01 bauzas: that's what I was getting at. if it's in the api anyway, it wouldn't help much. so I was trying to say, if there are scenarios where we need to use request spec and we're not already at the api 21:27:06 melwitt: there is no periodic task related to migrations done on the conductor, AFAIK 21:27:34 just trying to get clear on, is this a current need or are we just thinking to cover that case for the future, if something comes up 21:28:06 melwitt: I think we are pretty clear that automatic evacuation should not be done in Nova 21:28:11 alaski: I thought we said store the req spec in the api db until we schedule, then delete and move it to the cell db? 21:28:22 dansmith: right, that's my thinking 21:28:36 that can work too 21:28:41 bauzas: was making a case for keeping it in the api 21:28:43 okay, above it sounded like maybe you meant not delete from the api db 21:28:48 permanently/ 21:29:10 dansmith: that depends on where you query the DB to get the spec when migrating 21:29:18 I think having it in the api db makes it clear it's looking for a home, and once it's scheduled, it's in the other one 21:29:45 bauzas: right, I don't think it really needs to be in the api db though, especially if the scheduler was external, we'd just grab the req from the current home, provide it to the scheduler and let it move 21:29:49 dansmith: if that's done in the API, then there is a need for keeping it locally, but that goes down to the conductor and then being readed, it's actually better to have it in the cell db IMHO 21:30:02 so, migration/resize would be moving the instance/reqspec from once cell to another (if it moves cells) 21:30:27 dansmith: well, we need to read the spec for calling the scheduler right ? 21:30:29 okay, I don't follow that conductor line of thinking, but sounds like it's not important 21:30:41 yes 21:30:48 bauzas: the api can read from the cell db 21:30:57 that's the point yeah :) 21:31:05 dansmith: then, it's just a matter of finding where to call that 21:31:15 alaski: but not the contrary, right ? 21:31:24 bauzas: correct 21:31:30 contrary what? 21:31:31 alaski: given the current process of migrating 21:31:38 dansmith: reverse I'm assuming 21:31:42 cell things reading from the api db? 21:31:43 dansmith: the cell can't read the api db 21:31:57 yeah, French hitted 21:32:02 I'm not sure we'll be able to completely ban it, but it would be great if we can pass enough information down so that it doesn't 21:32:02 so 21:32:23 dansmith: that's my goal. but I agree it could be added if necessary 21:32:34 the current process is that a migration request comes to the API, which calls the conductor methods asynchronouslyu 21:32:47 that's a cast I mean 21:32:48 yeah, I can see some things needing to rendezvous at the api db, but if we can avoid it, it's nice separation 21:33:25 so, speaking about cells, conductor methods will be run on the cell side 21:33:47 so they won't have access to the api db, unless we explicitely allow that 21:34:39 a migration within a cell could be handled without coordination of the api 21:34:56 alaski: who would trigger the migration ? 21:35:18 the api would cast to conductor as you said 21:35:26 alaski: but it's the same, right? 21:35:40 alaski: api passes the destination info, which might be the current cell 21:36:01 er, no 21:36:05 dansmith: unless the user specifies no destination, and then it goes to a scheduler call 21:36:20 I guess api calls to conductor to trigger, conductor calls scheduler, and then conductor will need the deets of the destination cell 21:36:32 I think we need to be clear about inter vs intra cell 21:36:36 which goes back to my conductor might need to call back up 21:36:44 alaski: why? isn't that a scheduler thing? 21:37:09 alaski: intra-cell conductor would call the inter-cell scheduler, I don't see that as a problem 21:37:23 there's more needed for intra-cell 21:37:30 * dansmith groks "intra-cell conductor" for a minute 21:37:39 alaski: when migrating ? 21:37:51 I say we just punt on inter-cell migrations for the moment 21:37:52 alaski: or unshelving or whatever ? 21:38:01 dansmith: right, that's what I want to do 21:38:05 they'll be possible, but not supported today in v1 so we might as well punt on this 21:38:12 fair point 21:38:19 they're punted in cells v1 too 21:38:46 if you're looking at intra cell you just pass the request into the cell, bam 21:38:46 right 21:38:54 yup 21:39:34 alaski: well, there is just a flaw in that 21:39:54 alaski: since the user can specify a dest host, it bypasses the scheduler and then calls the compute 21:40:27 alaski: so we just need to make sure that if the user specifies a dest, it's one within the same cell as the source 21:40:35 but I'm nitpicking for now 21:40:45 right. that would fail currently because the two computes can't talk to each other 21:41:03 and in the new setup 21:41:36 so are we agreed that the request spec lives in a cell? :) 21:41:39 alaski: in the new setup, that's just a check that the target is in the same cell than the source 21:41:52 alaski: it lives in the cell after scheduling... 21:41:58 alaski: after scheduling 21:42:02 aaaarh 21:42:07 * bauzas jinxed 21:42:08 dansmith: right, that's totally what I meant 21:42:18 okay :) 21:42:38 but yeah, I'm fine with that approach 21:42:43 cool 21:42:48 moving on 21:42:53 #Open discussion 21:42:57 cool 21:42:57 dammit 21:43:04 #topic Open discussion 21:43:56 anyone have a topic to discuss? 21:44:17 I need to iterate some specs and get some more things down on paper 21:44:35 we at least need to have a plan for the summit :) 21:44:43 ahh, yeah 21:44:55 for showing that and discuss 21:44:55 so we're split between scheduling topics and everything else 21:45:10 you're toasted 21:45:39 toasted? 21:45:44 mmm 21:45:56 sandwitched 21:46:13 ahh, yeah 21:46:16 * bauzas tries to invent words at 11.40pm - always a bad idea 21:46:51 so, it will depend on how it goes with the scheduler stuff 21:47:05 but we actually have 2 sessions back-to-back for cells v2 21:47:21 one for discussing about the scheduling needs, and one for the other stuff 21:47:25 which is good IMHO 21:47:29 yeah 21:47:46 getting scheduling right is a big part of this 21:48:00 and at least, people won't get boring just because we'll be speaking about scheduler all the morning... 21:48:21 outside of that is getting the interactions between the api and cell right, and migrating data properly 21:48:24 alaski: I would rather make a call to discuss how we can scale 21:48:42 alaski: I just wonder if we should also poke the neutron guys 21:48:56 alaski: since we haven't yet agreed on the plan for network requests 21:49:02 I've been thinking on that 21:49:07 alaski: and since neutron doesn't have cells also 21:49:14 alaski: yeah I know 21:49:17 if you assume a global neutron everything should just work 21:49:31 alaski: would it be not facing the same problems as nova ? 21:49:54 it has scaling concerns, and cells might be a good approach for them 21:49:56 alaski: like the queues and so on 21:50:30 there's definitely things to talk about there, but we're not blocked on it I think 21:50:31 alaski: that's my point, it would just be horrible to do all the efforts and then being pushbacked just because neutron doesn't scale 21:51:05 alaski: well, n-net is not deprecated yet, right ? :) 21:51:18 heh 21:51:41 realistically it would help if someone could help liason that stuff right now 21:51:44 so indeed, let's do the assumptions that everything will scale, and at the end, n-net will be there for us :) 21:52:17 I speak with some networking folks occasionally, but don't have the bandwidth to drive it very far at the moment 21:52:22 should we maybe engage some discussions with s68cal since he's the liaison ? 21:52:43 oh, was that figured out. my email went wonky for a few days 21:53:11 he's the neutron guy enough foolish to visit Nova 21:53:24 gotcha. is there anyone on the nova side yet? 21:53:43 none I heard about 21:54:02 I mean, network guys are creepy 21:54:15 * bauzas kidding 21:54:34 okay. I can reach out to s68cal to give him a heads up on our efforts 21:55:18 we could at least give him a wish list 21:56:17 back to the summit: does the data migration part of cells, getting instances and cells mapped, seem like a good topic to y'all? 21:56:59 that's a fair point to discuss, if the consensus is getting reached quickly, that's even better 21:57:03 I want to touch on a few other things, but in the time allowed we can probably only hit one/two big things 21:57:50 alaski: I seriously need to see what we should discuss for the scheduler efforts, so I could help you on that by giving you time if I can 21:58:22 okay 21:58:23 I seriosuly doubt people will be liking having all the morning kinda dedicated to scheduling 21:58:33 because that's a beast 21:59:11 the last session is about RT, I still need to figure out what to fill in there 21:59:13 they'll get over it 21:59:40 because there are good points to raise, but few chances to work on for Lib 21:59:41 yep, that's what we're there for 22:00:18 well this was a bit disorganized, sorry. I'll be more prepared next week 22:00:21 thanks all! 22:00:29 #endmeeting