21:00:00 #startmeeting nova_cells 21:00:01 Meeting started Wed Nov 30 21:00:00 2016 UTC and is due to finish in 60 minutes. The chair is dansmith. Information about MeetBot at http://wiki.debian.org/MeetBot. 21:00:03 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 21:00:05 The meeting name has been set to 'nova_cells' 21:00:26 *ahem* 21:00:26 o/ 21:00:50 o/ 21:01:15 well... 21:01:28 was really hoping for alaski to show up 21:01:32 because I have questions 21:01:36 concerns 21:01:43 o/ 21:01:54 i made it 21:01:59 congrats :) 21:02:02 thank you 21:02:09 #topic cells testing / bugs 21:02:22 anything on testing/bugs this week? 21:02:23 the only bug we had was that pg one 21:02:32 only took 5 days to fix it 21:02:39 true, I guess that was because of cellsv2 21:03:06 otherwise nada 21:03:08 #topic open reviews 21:03:16 so, I refreshed my set again just a bit ago: 21:03:20 https://review.openstack.org/#/q/status:open+project:openstack/nova+branch:master+topic:bp/cells-scheduling-interaction 21:03:27 on top of melwitt's cell database fixture 21:03:38 the last patch is mostly good for unit tests at this point, 21:03:40 so https://review.openstack.org/#/c/396417/ got sorted out? 21:03:46 except for one ugly gotcha around cellsv1 21:04:02 mriedem: I need a fix to oslo.messaging for that, which I have proposed 21:04:08 mriedem: but their gate is fubar right now 21:04:17 do they know what's up? 21:04:22 like, are they fixing their stuff? 21:04:25 mriedem: so I have a hack in there to make it work, but I don't expect us to merge it 21:04:34 mriedem: it's zmq.. someone lobbed a patch at it claiming to fix it, but it didn't 21:04:42 mriedem: but yeah, I've been in there poking people about it 21:04:58 my fix has been +W for a few days but can't make it through gate 21:05:01 are you going to wip https://review.openstack.org/#/c/396417/ then? 21:05:09 we could also do the old depends-on dance 21:05:13 if I rechecked it harder I could maybe get it in 21:05:15 but will need to depend on a release and g-r bump 21:05:45 yeah 21:05:48 mriedem: this is the hack: https://review.openstack.org/#/c/396417/15/nova/rpc.py@77 21:05:54 yeah i saw it 21:06:01 yeah I guess I can -W it 21:06:18 and https://review.openstack.org/#/c/403924/ is what we want 21:06:26 yar 21:06:49 fun http://logs.openstack.org/24/403924/1/check/gate-oslo.messaging-dsvm-functional-py27-zeromq/35e71b1/console.html#_2016-11-29_19_18_53_458260 21:07:11 anyway, so yeah that is blocked at the moment 21:07:13 if it's not a regression and no one is fixing it we could skip the test... 21:07:20 but maybe that's extreme for right now 21:07:21 *shrug* 21:07:36 i guess we can light a fire when we've reviewed the entire series 21:07:40 yeah 21:07:48 the bottom couple of patches there can merge I think 21:07:58 at least 21:08:03 anyway, there is a bigger problem at the top of that set 21:08:18 and that is around cellsv1 21:08:27 which one? https://review.openstack.org/#/c/396775/ ? 21:08:43 no that one is out for the moment 21:08:52 https://review.openstack.org/#/c/396417 21:09:21 so the deal there is something I realized yesterday while trying to squash the last couple of unit test failures 21:09:48 in cellsv1, we do the compute/api bit, create in the api, then call down to the cell, which replays the compute/api part in the cell, 21:09:58 before finally casting to conductor to get things going 21:10:12 if we finish this move of the create from api to conductor, 21:10:49 then we're not going to create in the api cell, and we're going to call to conductor in the api cell, 21:11:05 which is then going to try to talk directly to the compute instead of calling through the cells rpc to get down there 21:11:10 which is ... scary 21:11:29 because (a) I don't want to add any more cells calls, and certainly not for things that need to call to conductor 21:11:36 so I'm not really sure what to do 21:11:47 which is what I was hoping to poke alaski about today 21:11:58 does any of that ranting make sense? 21:12:01 and we don't want a bunch of if CONF.cells.enabled checks all over this flow 21:12:05 yeah 21:12:16 and we don't want to have to keep two instance create paths either 21:12:32 i won't profess to understand the entire create flow through the api for cells v1 21:12:43 well, it's complicated 21:12:52 so the parent api cell creates the instance in the parent api cell db? 21:12:54 but we do it twice, once in the api cell and once in the child cell 21:13:02 oh right the instance is in both places 21:13:04 hence the up calls to sync 21:13:11 and that replication happens at the compute/api layer, 21:13:30 by intercepting the compute/api call, doing some of it, and then re-playing it in the child cell 21:13:36 and when status changes in the child cell while building the instance, we send that up to the parent cell to update the instance there too right? 21:13:47 yeah 21:14:09 and cellsv1 doesn't know jack about build requests... 21:14:15 nor does it know about conductor 21:14:21 it's purely replication at the compute/api layer 21:14:55 we also *have* to create in the api cell before we create in the child because otherwise the api stops working, 21:15:06 so we can't even try to patch it up by making the first sync create the instance back at the top again 21:16:20 so yeah 21:16:22 pretty much the suck. 21:16:25 I was thinking there's already "if cell_type == 'api'" in the compute/api and maybe we could just do the old way in there ... 21:16:36 old create I mean 21:16:53 melwitt: well, it means we keep two paths 21:17:05 CellsScheduler is parent/api cell right? 21:17:11 which builds the instances in the api cell 21:17:20 and calls the compute api code to create the instance in the api cell db 21:17:21 gah pushed the wrong key 21:17:35 mriedem: I dunno tbh 21:17:50 i'm pretty sure it is b/c _create_instances_here calls instance = self.compute_api.create_db_entry_for_new_instance( 21:17:55 don't we have to keep the two paths though? I mean wherever it says things like "cell_type =='api'" has to stay because cells v1 needs it 21:18:08 non-cellsv1 create_db_entry_for_new_instance doesn't create the instance in the db 21:18:34 melwitt: well, yes, but having instance created in two different *services* is uglier than just in two places in the same service, IMHO 21:19:00 melwitt: because then you get into all kinds of potentials for races I think, assuming that the instance is created by a certain point, when it might not be, etc 21:19:21 I mean, obviously the path out of this box is going to be some amount of "if cells1, else" sort of thing 21:19:29 but I'm just dreading it 21:19:39 yeah 21:20:38 anyway, 21:20:46 hmm, so have we hit the patch that sees this fail in cells v1 yet? 21:20:48 I was hoping that maybe someone had already thought about this and what the best plan would be 21:21:15 mriedem: it fails in unit tests, I'm still waiting for the run on cellsv1, but I might have other breakages to fix first 21:21:19 I just pushed it up like 30 minutes ago 21:21:48 and that's just https://review.openstack.org/#/c/396417 ? 21:21:59 i figured it would manifest in https://review.openstack.org/#/c/319379/ 21:22:05 no, https://review.openstack.org/#/c/319379 21:22:10 ah ok 21:22:10 yeah 21:22:14 that makes sense 21:23:03 so, unit tests will fail, sure, but i think we should check out what the cells v1 job failures are and then start poking at it 21:23:05 anyway, so I will try to fix whatever that shakes out in the next day for the normal path and then see if I can start throwing things at it to make it still work 21:23:06 yeah, I guess I would be thinking to preserve what cells v1 is doing until we remove it, which is the two path thing. because the alternatives involve trying to sync the create upward to the API or something, right? 21:23:50 could we....take the build request that's created in the api now and use that to hydrate and create the instance in the api/parent cell db? 21:23:58 melwitt: yeah, I'm just concerned about other stuff, like bits we have moved out of compute api to conductor that won't get run for cellsv1, like bdm validation or something (not really, but something like that) 21:24:03 instead of the cells scheduler calling the compute api to create the instance? 21:24:28 mriedem: no, we have to create it at about the same place as we do now, or later, because of all the junk that compute/api does to the instance 21:24:32 I see 21:24:54 does to the instance how? 21:25:00 like figuring out the name and stuff? 21:25:16 mriedem: well, for one thing it handles the "num_instances" bit, as well as yeah names and stuff 21:25:17 i thought that was all done before the instance was serialized and the build request was created 21:25:31 it's spread all over the place 21:25:43 well, I should say, 21:25:59 I don't know where exactly the cell scheduler bit plugs in, 21:26:00 so maybe it's more in the middle than I think, I dunno 21:26:11 ok, so....sounds like maybe 2 options, 21:26:15 regardless, fixing at that layer seems like more new different code 21:26:18 which I'm afraid of 21:26:28 1. if the build_request.instance is 90% what we need in the api cell for v1, then maybe we can use that to create it in the api cell for v1 21:26:46 2. else we see if we can hack in a conditional here or there to do the dirty deeds done dirt cheap 21:26:56 until we can kill cells v1 21:26:59 I think #2 is the thing to try firfst 21:27:02 meaning, 21:27:02 sure 21:27:11 I like that my suggestion is the AC/DC one 21:27:12 right now we have a place where we just no longer call instance.create() 21:27:14 3. alaski saves our asses 21:27:34 and so we'd just do "if cellsv1, do create like old times" but then we also have to not call the new conductor method I think 21:27:36 yeah, alaski, come and solve this for us 21:27:36 something 21:27:47 dansmith: yeah that's not too terrible 21:27:49 if that's all it is 21:27:58 I think it's end-of-days worst possible thing 21:28:07 # NOTE(danms): this makes bon scott roll in his grave, but we have to do this... 21:28:19 lol 21:28:19 which is probably "not too terrible" times standard dansmith inflation factor 21:28:20 haha 21:28:41 4. dtp fixes this all for us 21:28:48 i wish! 21:28:54 * dansmith reassigns to dtp 21:28:55 did laski mention any of this in his brain dump patch? 21:29:00 mriedem: no 21:29:01 mriedem: I looked 21:29:04 a lot. :) 21:29:05 dagnabbit 21:29:06 ha 21:29:06 I looked too 21:29:19 first thing I did :) 21:29:26 alright, anyway, enough dwelling on this 21:29:35 so, 21:29:37 quotas?! 21:29:38 I will keep plugging away 21:29:46 * dansmith hands the mic to melwitt 21:30:31 yeah, I'm working on it but nothing to show yet. haven't gotten as much done by now as I wanted to 21:31:12 the spec was amended :) https://review.openstack.org/#/c/399750/ that's something 21:31:38 oh, right. I did do that 21:32:32 okay, well, 21:32:42 #topic open dis-cush-ee-ohn 21:32:47 anything else? 21:32:55 well, 21:32:59 on the ci front, 21:33:06 we should be back to nova-net being gone by eod 21:33:10 except for cells v1 21:33:17 yeah, that's good 21:33:24 and on the bright side sdague us back to look at the grenade change to require cells v2 in ocata 21:33:27 so progress 21:33:32 yay 21:33:38 s/us/is/ 21:34:14 there are 2 semi related things for cells v2 21:34:15 https://blueprints.launchpad.net/nova/+spec/prep-for-network-aware-scheduling-ocata 21:34:30 looks like that's moving slowly 21:34:39 john has been caught up in the multiattach stuff lately 21:34:46 boo 21:35:10 and https://review.openstack.org/#/c/393205/ which i've asked sdague to look at again, and i need to look at again 21:35:46 on the bright side alex has started on the json schema validation for query params https://review.openstack.org/#/q/topic:bp/consistent-query-parameters-validation 21:35:59 there are multiple bright sides? 21:36:09 there are 3 bright sides in this meeting 21:36:13 oh my 21:36:26 #action need to review https://review.openstack.org/#/c/393205/ 21:36:44 #action review https://review.openstack.org/#/c/399750/ 21:36:49 one thing I realized is the remaining consoleauth stuff was covered by a spec that I missed reproposing for ocata. so that has to wait until pike 21:37:24 does it block anything? 21:37:25 if not implemented 21:37:36 might block the upcall thing 21:37:44 but also not sure who is going to work on it 21:38:09 i was never really familiar with that series 21:38:14 yeah, upcall needed to talk to consoleauth service otherwise. so mq switch needed from a cell, I think in that case 21:38:15 would have to read up on it 21:38:15 https://review.openstack.org/#/q/topic:bp/convert-consoles-to-objects 21:38:46 I was thinking to pick it up, i.e. restore the abandoned patches 21:39:45 i think someone would have to explain the upcall bits in more detail to me 21:40:00 to see the relation to cells v2 here 21:40:08 mriedem: preventing a call from a compute node up to the api db 21:40:23 b/c the consoleauth stuff is in the api db now? 21:40:49 no it's not 21:41:04 oh wait, my memory is recalling alaski saying we could just change deployment assumptions to run a consoleauth service per cell 21:41:18 in the meantime 21:41:18 hmm, does that work? 21:41:29 I thought the problem was that we route the api request by token id, 21:41:36 which we couldn't resolve to a cell without more information 21:41:41 https://review.openstack.org/#/c/321636/ is the cells v2 spec amendment 21:41:58 unless maybe we change what we return for them to call or something, but that's an api change I *thought* 21:42:06 I was thinking of the message queue part. if consoleauth runs on the api host, the cell would need to be able to talk to the api message queue 21:42:29 to request auth of a token, IIUC 21:42:32 I dunno, I don't have it in my head for sure 21:42:52 maybe you two can go off and figure out what needs doing and decide if someone has time for it this cycle 21:42:56 "The consoleauth service will be retained for legacy compatibility but 21:42:56 in a deprecated status, supported for one release. After the 21:42:56 period the consoleauth service can be removed." 21:43:13 that's if we do paul's thing I think 21:43:15 if routing is by token id from the api then that's another problem. I'm not that familiar with consoleauth 21:43:28 yeah 21:43:33 melwitt: I dunno, I might be making that up, but I thought there was something about that 21:43:43 This can be resolved by adding the instance uuid to 21:43:43 the query string in the URL. 21:43:59 i think it just means adding something like &instance_uuid=foo to the url 21:44:02 dansmith: yeah, I take it as a possibility. I'll dig into it and find out what the deal is 21:44:03 right 21:44:04 and then we can map that to a cell 21:44:10 mriedem: correct 21:44:34 anyway, we've brought it up a couple weeks in a row now, so maybe we can shoot for having a plan by next week 21:44:35 ? 21:44:45 okay 21:45:01 #action melwitt to figure out what's up with consoleauth changes wrt cells v2 21:45:04 #action melwitt to detangle the consoleauth stuff for next week 21:45:07 ooo 21:45:09 mriedem: gdi who is running this meeting? 21:45:15 force of habit 21:45:18 mriedem taking over the place 21:45:37 so on the quotas thing, if there were some steps to get started on that i could maybe take a crack 21:45:46 but would need some hand holding 21:46:32 anyway, i just don't do much code stuff these days on any priorities, so i'm open 21:46:55 #info mriedem's code chops are falling into disrepair 21:46:59 anything else? 21:47:01 hey 21:47:05 heh 21:47:12 see me triage that shelve race earlier?! 21:47:20 mriedem: okay, let me think about that. I was first going to do the object code for the quota tables we're keeping (that keep the limits) 21:47:20 i'm done 21:48:00 and then after that work on the resource counting and replacing reserve/commit/rollback calls 21:48:13 the latter part is what i thought sounded somewhat simpler 21:48:16 but maybe it's not 21:48:32 anyway, we can take that outside the meeting 21:49:04 I move we adjourn 21:49:17 second 21:49:47 * dansmith wields the gavel 21:49:59 #endmeeting