21:00:24 <dansmith> #startmeeting nova_cells 21:00:25 <openstack> Meeting started Wed Aug 22 21:00:24 2018 UTC and is due to finish in 60 minutes. The chair is dansmith. Information about MeetBot at http://wiki.debian.org/MeetBot. 21:00:26 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 21:00:28 <openstack> The meeting name has been set to 'nova_cells' 21:00:30 <dansmith> #link https://wiki.openstack.org/wiki/Meetings/NovaCellsv2 21:00:32 <melwitt> o/ 21:00:58 <tssurya> o/ 21:01:08 <dansmith> #topic bugs 21:01:10 <mriedem> o. 21:01:28 <dansmith> there's a big list of bugs in the agenda, and because I'm a terrible person I haven't looked at them much 21:01:39 <dansmith> I plan to do that tomorrow morning first thing 21:02:10 <dansmith> anyone want to highlight any of them specifically? they're fairly well documented so that it's easy to see which are the most important. 21:02:35 <mriedem> anything i have in there is stale, 21:02:54 <mriedem> like the stuff mnaser reported on instance mappings without a build request and no cell set on the mapping 21:03:07 <mriedem> we had talked about a transactional write for those 21:03:38 <mriedem> nothing critical atm 21:04:01 <dansmith> alright. 21:04:03 <dansmith> anyone else? 21:04:15 <dansmith> there's a bug for this cells batching perf thing but I will comment on that in open reviews 21:04:50 <dansmith> skipping testing 21:04:55 <dansmith> #topic open reviews 21:05:05 <dansmith> mriedem: has one open which I just clicked on 21:05:16 <dansmith> which appears merged, so. okay 21:05:26 <mriedem> ? 21:05:34 <mriedem> oh 21:05:45 <mriedem> yeah that 21:05:57 <mriedem> i have an item for the ptg for that 21:06:03 <melwitt> I still have the multiple cells affinity thing https://review.openstack.org/585073 21:06:32 <mriedem> i at least got the test :) 21:06:47 <mriedem> which reminds me, 21:06:48 <melwitt> yes, thank you 21:07:01 <mriedem> that i think we do some redundant instance group query shit now in setup_instance_group or whatever it's called 21:07:09 <mriedem> unrelated to that patch, but perftastic 21:07:36 <melwitt> the second query in there? that keeps confusing me 21:07:37 <mriedem> ever since we stopped storing hosts/members in the group record 21:08:12 <mriedem> no https://review.openstack.org/#/c/540258/13/nova/scheduler/utils.py@792 21:08:13 <mriedem> i think 21:08:40 <mriedem> well, i mean if we already have the group 21:08:46 <melwitt> yeah, that's what I was thinking of 21:08:58 <melwitt> and the subsequent get the hosts again 21:09:03 <mriedem> yeah i assume that was in there b/c we needed the freshest latest members and hosts in the group 21:09:17 <mriedem> but dansmith fixed that by not storing the group hosts/members in the db 21:09:28 <mriedem> this ^ code predated dan's fix 21:09:31 <mriedem> and is likely redundant now 21:09:46 <mriedem> just one of the many todo threads to pull 21:09:54 <dansmith> so, open a bug? 21:10:09 <mriedem> "this might be total bullshit, but i think x is not required now b/c of y" 21:10:09 <mriedem> sure 21:10:42 <dansmith> melwitt: I will try to hit those tomorrow morning as well, remind me if not 21:10:51 <melwitt> ok, thanks 21:11:02 <dansmith> so tssurya has had down-cells patches up for a while and because I'm a terrible person I just circled back to them this morning 21:11:15 <dansmith> or was that yesterday? I forget 21:11:20 <mriedem> yesterday 21:11:22 <tssurya> yesterday 21:11:34 <dansmith> regardless, I think we worked out a better way to do part of the accounting bit which is cool, 21:11:36 <mriedem> i've been slowly burning through gmann's dep stuff below it 21:11:47 <tssurya> yea thanks and thanks 21:11:48 <dansmith> and I have that on the end of my batching set for her to use, 21:11:52 <dansmith> since she's blocked on that other thing anyway 21:12:17 <tssurya> meanwhile I will write tests for them 21:12:19 <dansmith> I found an issue in the batching code while writing that for her, so I'm glad I did that at the end anyway 21:13:05 <dansmith> which leads me to my batching stuff to address perf issues the huawei guys have noted 21:13:20 <dansmith> they've been reviewing and testing that stuff for me, which is cool 21:13:24 <mriedem> have kevin and yikun been talking to you about testing the latest? 21:13:29 <mriedem> with like error instances and such? 21:13:39 <dansmith> I assume we're all aware of that process but if anyone isn't clear on what that is or why/if it's important speak up 21:13:56 <dansmith> mriedem: not since last night or whatever 21:13:58 <dansmith> I decided not to do the fault thing (again) 21:14:39 <mriedem> ok but they tested the distributed strategy right? 21:15:03 <mriedem> ah nevermind 21:15:04 <dansmith> I'm not sure about that, but it will equate to the same thing for them unless they perturb their distribution a little 21:15:26 <mriedem> chenfujun will be pleased! 21:15:45 <dansmith> anything else on batching or other open reviews? 21:16:13 <dansmith> #topic open reviews 21:16:19 <dansmith> mriedem has things 21:16:30 <mriedem> open discussion? 21:16:34 <dansmith> sorry 21:16:35 <dansmith> #undo 21:16:35 <openstack> Removing item from minutes: #topic open reviews 21:16:39 <dansmith> #topic open discussion 21:16:50 <mriedem> i'll do the 2nd one first, since its a segue from handling a down cell 21:17:06 <mriedem> when we were talking to the public cloud guy in charge of moving from their cascading / cells v1 type thing last week, 21:17:11 <mriedem> and talked about handling a down cell, 21:17:40 <mriedem> he wanted an option to not return any results when listing if the tenant has any instances in a down cell; this would be before tssurya's new microversion which returns the instnaces in down cells with the UNKNOWN status 21:18:18 <mriedem> he was concerned that users would be confused if they knew they had like 3 VMs and then all of a sudden they see 2 21:18:26 <dansmith> this seems a bit close to config-defined-api-behavior to me 21:18:32 <mriedem> so he'd rather not show them any, because if they delete the 2 and the 1 comes back, it would be an issue 21:18:38 <dansmith> seems like going back to 500 would be better than an empty list 21:18:48 <mriedem> it is, yes, which i pointed out, 21:19:01 <mriedem> but it's also not really an interop kind of situation 21:19:03 <mriedem> 500 is an option 21:19:21 <mriedem> but we already fixed that 500 bug... 21:19:27 <mriedem> so clearly CERN wants it one way and huawei wants it another 21:19:45 <dansmith> I meant turn on the 500 with a config flag 21:19:56 <mriedem> oh 21:20:01 <dansmith> like explode_if_some_are_down=True 21:20:05 <mriedem> yeah that's an option 21:20:13 <mriedem> i can go back and ask which is preferrable 21:20:15 <mriedem> hide or explode 21:20:18 <mriedem> in the end, 21:20:23 <mriedem> it's basically the same from the user viewpoint 21:20:45 <tssurya> I had this https://review.openstack.org/592428/ which was for the empty list config 21:20:50 <mriedem> although of course huawei public cloud has a strict no errors policy... :) 21:20:53 <melwitt> so this would be a change, pre-microversion? 21:20:53 <dansmith> not IMHO.. 500 means "the server is broken, I'll try later", empty means "firedrill, my instances and data are gone" 21:20:55 <dansmith> to me 21:21:23 <mriedem> melwitt: yes pre-microversion 21:21:30 <melwitt> ok 21:21:33 <mriedem> if you request the microversion you get the down cell instances with UNKNOWN status 21:21:35 <dansmith> can't imagine how returning an empty list doesn't just generate a billion urgent calls to support 21:21:46 <mriedem> yeah i know 21:21:52 <mriedem> i'll ask 21:22:07 <dansmith> 500 might mean "hmm, I'll go check the status page" and see that some issues are occurring 21:22:08 <melwitt> yeah, the empty list might give people a heart attack 21:22:14 <dansmith> if my instances are gone suddenly I freak the eff out 21:22:25 <tssurya> I vote for 500 too, makes more sense 21:22:46 <tssurya> specially since the problem is users being confused to see 2 instead of 3.. now they would see 0 21:22:57 <mriedem> yup 21:22:58 <mriedem> noted 21:23:04 <mriedem> moving onto the other item 21:23:07 <mriedem> cross-cell cold migration 21:23:15 <mriedem> i've vomited some random thoughts into https://etherpad.openstack.org/p/nova-ptg-stein-cells 21:23:29 <mriedem> our requirements are basically, network isolation between cells, 21:23:42 <mriedem> so no ssh between hosts in different cells so can't cold migrate the traditional way 21:23:56 <mriedem> image service is global, which made me think of an orchestrated shelve/unshelve type thing 21:24:15 <mriedem> i feel like that might be a bit cludgy, so open to other ideas 21:24:32 <mriedem> i'm not sure how else to get the guest disk image out of the one cell and into the other w/o snapshots 21:24:36 <dansmith> I think snapshot for migrate will not be what some people want, but I think it will be ideal for a lot of others 21:24:48 <dansmith> I think that's a good first step for us to do though 21:24:52 <mriedem> of course, ports and volumes... 21:25:28 <mriedem> with ports and volumes we just attach on the new host, migrate, detach from old host on confirm 21:25:40 <mriedem> so how ports and volumes get migrated is going to be another thing 21:26:11 <mriedem> what else do we have attached to an instance that's external? barbican secrets? castellan certs? 21:27:16 <dansmith> yep, we'll have to do some thinking probably 21:27:26 <mriedem> i dislike thinking 21:27:48 <mriedem> however, 21:28:03 <mriedem> mlavalle and smcginnis work for the same overlords so i can talk to them 21:28:12 <mriedem> routed networks comes to mind for ports, but cinder has no concept like this 21:28:16 <mriedem> as far as i know 21:28:20 <melwitt> synergy 21:29:30 <dansmith> are we done with that for the moment? 21:29:52 <mriedem> yes 21:29:59 <mriedem> dump ideas/comments in the etherpad 21:30:02 <dansmith> #todo everyone be thinking about cross-cell-migration gotchas 21:30:08 <mriedem> also, 21:30:12 <mriedem> huawei cloud is all volume-backed... 21:30:17 <mriedem> so i have to care about that bigly 21:30:28 <dansmith> well, maybe that helps 21:30:49 <mriedem> maybe, if the volume storage is shared between cells 21:31:04 <dansmith> or if it's migratable on the cinder side and we can just wait 21:31:35 <mriedem> my guess is that would trigger a swap callback 21:31:43 <mriedem> that's what happens if you do volume live migration 21:32:12 <mriedem> kevin seemed pretty confident this wasn't a hard problem, so maybe i'll just let him own it :) 21:32:56 <dansmith> anything else before we close? 21:33:20 <mriedem> nope 21:34:00 <dansmith> #endmeeting