21:00:24 #startmeeting nova_cells 21:00:25 Meeting started Wed Aug 22 21:00:24 2018 UTC and is due to finish in 60 minutes. The chair is dansmith. Information about MeetBot at http://wiki.debian.org/MeetBot. 21:00:26 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 21:00:28 The meeting name has been set to 'nova_cells' 21:00:30 #link https://wiki.openstack.org/wiki/Meetings/NovaCellsv2 21:00:32 o/ 21:00:58 o/ 21:01:08 #topic bugs 21:01:10 o. 21:01:28 there's a big list of bugs in the agenda, and because I'm a terrible person I haven't looked at them much 21:01:39 I plan to do that tomorrow morning first thing 21:02:10 anyone want to highlight any of them specifically? they're fairly well documented so that it's easy to see which are the most important. 21:02:35 anything i have in there is stale, 21:02:54 like the stuff mnaser reported on instance mappings without a build request and no cell set on the mapping 21:03:07 we had talked about a transactional write for those 21:03:38 nothing critical atm 21:04:01 alright. 21:04:03 anyone else? 21:04:15 there's a bug for this cells batching perf thing but I will comment on that in open reviews 21:04:50 skipping testing 21:04:55 #topic open reviews 21:05:05 mriedem: has one open which I just clicked on 21:05:16 which appears merged, so. okay 21:05:26 ? 21:05:34 oh 21:05:45 yeah that 21:05:57 i have an item for the ptg for that 21:06:03 I still have the multiple cells affinity thing https://review.openstack.org/585073 21:06:32 i at least got the test :) 21:06:47 which reminds me, 21:06:48 yes, thank you 21:07:01 that i think we do some redundant instance group query shit now in setup_instance_group or whatever it's called 21:07:09 unrelated to that patch, but perftastic 21:07:36 the second query in there? that keeps confusing me 21:07:37 ever since we stopped storing hosts/members in the group record 21:08:12 no https://review.openstack.org/#/c/540258/13/nova/scheduler/utils.py@792 21:08:13 i think 21:08:40 well, i mean if we already have the group 21:08:46 yeah, that's what I was thinking of 21:08:58 and the subsequent get the hosts again 21:09:03 yeah i assume that was in there b/c we needed the freshest latest members and hosts in the group 21:09:17 but dansmith fixed that by not storing the group hosts/members in the db 21:09:28 this ^ code predated dan's fix 21:09:31 and is likely redundant now 21:09:46 just one of the many todo threads to pull 21:09:54 so, open a bug? 21:10:09 "this might be total bullshit, but i think x is not required now b/c of y" 21:10:09 sure 21:10:42 melwitt: I will try to hit those tomorrow morning as well, remind me if not 21:10:51 ok, thanks 21:11:02 so tssurya has had down-cells patches up for a while and because I'm a terrible person I just circled back to them this morning 21:11:15 or was that yesterday? I forget 21:11:20 yesterday 21:11:22 yesterday 21:11:34 regardless, I think we worked out a better way to do part of the accounting bit which is cool, 21:11:36 i've been slowly burning through gmann's dep stuff below it 21:11:47 yea thanks and thanks 21:11:48 and I have that on the end of my batching set for her to use, 21:11:52 since she's blocked on that other thing anyway 21:12:17 meanwhile I will write tests for them 21:12:19 I found an issue in the batching code while writing that for her, so I'm glad I did that at the end anyway 21:13:05 which leads me to my batching stuff to address perf issues the huawei guys have noted 21:13:20 they've been reviewing and testing that stuff for me, which is cool 21:13:24 have kevin and yikun been talking to you about testing the latest? 21:13:29 with like error instances and such? 21:13:39 I assume we're all aware of that process but if anyone isn't clear on what that is or why/if it's important speak up 21:13:56 mriedem: not since last night or whatever 21:13:58 I decided not to do the fault thing (again) 21:14:39 ok but they tested the distributed strategy right? 21:15:03 ah nevermind 21:15:04 I'm not sure about that, but it will equate to the same thing for them unless they perturb their distribution a little 21:15:26 chenfujun will be pleased! 21:15:45 anything else on batching or other open reviews? 21:16:13 #topic open reviews 21:16:19 mriedem has things 21:16:30 open discussion? 21:16:34 sorry 21:16:35 #undo 21:16:35 Removing item from minutes: #topic open reviews 21:16:39 #topic open discussion 21:16:50 i'll do the 2nd one first, since its a segue from handling a down cell 21:17:06 when we were talking to the public cloud guy in charge of moving from their cascading / cells v1 type thing last week, 21:17:11 and talked about handling a down cell, 21:17:40 he wanted an option to not return any results when listing if the tenant has any instances in a down cell; this would be before tssurya's new microversion which returns the instnaces in down cells with the UNKNOWN status 21:18:18 he was concerned that users would be confused if they knew they had like 3 VMs and then all of a sudden they see 2 21:18:26 this seems a bit close to config-defined-api-behavior to me 21:18:32 so he'd rather not show them any, because if they delete the 2 and the 1 comes back, it would be an issue 21:18:38 seems like going back to 500 would be better than an empty list 21:18:48 it is, yes, which i pointed out, 21:19:01 but it's also not really an interop kind of situation 21:19:03 500 is an option 21:19:21 but we already fixed that 500 bug... 21:19:27 so clearly CERN wants it one way and huawei wants it another 21:19:45 I meant turn on the 500 with a config flag 21:19:56 oh 21:20:01 like explode_if_some_are_down=True 21:20:05 yeah that's an option 21:20:13 i can go back and ask which is preferrable 21:20:15 hide or explode 21:20:18 in the end, 21:20:23 it's basically the same from the user viewpoint 21:20:45 I had this https://review.openstack.org/592428/ which was for the empty list config 21:20:50 although of course huawei public cloud has a strict no errors policy... :) 21:20:53 so this would be a change, pre-microversion? 21:20:53 not IMHO.. 500 means "the server is broken, I'll try later", empty means "firedrill, my instances and data are gone" 21:20:55 to me 21:21:23 melwitt: yes pre-microversion 21:21:30 ok 21:21:33 if you request the microversion you get the down cell instances with UNKNOWN status 21:21:35 can't imagine how returning an empty list doesn't just generate a billion urgent calls to support 21:21:46 yeah i know 21:21:52 i'll ask 21:22:07 500 might mean "hmm, I'll go check the status page" and see that some issues are occurring 21:22:08 yeah, the empty list might give people a heart attack 21:22:14 if my instances are gone suddenly I freak the eff out 21:22:25 I vote for 500 too, makes more sense 21:22:46 specially since the problem is users being confused to see 2 instead of 3.. now they would see 0 21:22:57 yup 21:22:58 noted 21:23:04 moving onto the other item 21:23:07 cross-cell cold migration 21:23:15 i've vomited some random thoughts into https://etherpad.openstack.org/p/nova-ptg-stein-cells 21:23:29 our requirements are basically, network isolation between cells, 21:23:42 so no ssh between hosts in different cells so can't cold migrate the traditional way 21:23:56 image service is global, which made me think of an orchestrated shelve/unshelve type thing 21:24:15 i feel like that might be a bit cludgy, so open to other ideas 21:24:32 i'm not sure how else to get the guest disk image out of the one cell and into the other w/o snapshots 21:24:36 I think snapshot for migrate will not be what some people want, but I think it will be ideal for a lot of others 21:24:48 I think that's a good first step for us to do though 21:24:52 of course, ports and volumes... 21:25:28 with ports and volumes we just attach on the new host, migrate, detach from old host on confirm 21:25:40 so how ports and volumes get migrated is going to be another thing 21:26:11 what else do we have attached to an instance that's external? barbican secrets? castellan certs? 21:27:16 yep, we'll have to do some thinking probably 21:27:26 i dislike thinking 21:27:48 however, 21:28:03 mlavalle and smcginnis work for the same overlords so i can talk to them 21:28:12 routed networks comes to mind for ports, but cinder has no concept like this 21:28:16 as far as i know 21:28:20 synergy 21:29:30 are we done with that for the moment? 21:29:52 yes 21:29:59 dump ideas/comments in the etherpad 21:30:02 #todo everyone be thinking about cross-cell-migration gotchas 21:30:08 also, 21:30:12 huawei cloud is all volume-backed... 21:30:17 so i have to care about that bigly 21:30:28 well, maybe that helps 21:30:49 maybe, if the volume storage is shared between cells 21:31:04 or if it's migratable on the cinder side and we can just wait 21:31:35 my guess is that would trigger a swap callback 21:31:43 that's what happens if you do volume live migration 21:32:12 kevin seemed pretty confident this wasn't a hard problem, so maybe i'll just let him own it :) 21:32:56 anything else before we close? 21:33:20 nope 21:34:00 #endmeeting