#openstack-meeting-3 log

21:00:24 <dansmith> #startmeeting nova_cells
21:00:25 <openstack> Meeting started Wed Aug 22 21:00:24 2018 UTC and is due to finish in 60 minutes.  The chair is dansmith. Information about MeetBot at http://wiki.debian.org/MeetBot.
21:00:26 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
21:00:28 <openstack> The meeting name has been set to 'nova_cells'
21:00:30 <dansmith> #link https://wiki.openstack.org/wiki/Meetings/NovaCellsv2
21:00:32 <melwitt> o/
21:00:58 <tssurya> o/
21:01:08 <dansmith> #topic bugs
21:01:10 <mriedem> o.
21:01:28 <dansmith> there's a big list of bugs in the agenda, and because I'm a terrible person I haven't looked at them much
21:01:39 <dansmith> I plan to do that tomorrow morning first thing
21:02:10 <dansmith> anyone want to highlight any of them specifically? they're fairly well documented so that it's easy to see which are the most important.
21:02:35 <mriedem> anything i have in there is stale,
21:02:54 <mriedem> like the stuff mnaser reported on instance mappings without a build request and no cell set on the mapping
21:03:07 <mriedem> we had talked about a transactional write for those
21:03:38 <mriedem> nothing critical atm
21:04:01 <dansmith> alright.
21:04:03 <dansmith> anyone else?
21:04:15 <dansmith> there's a bug for this cells batching perf thing but I will comment on that in open reviews
21:04:50 <dansmith> skipping testing
21:04:55 <dansmith> #topic open reviews
21:05:05 <dansmith> mriedem: has one open which I just clicked on
21:05:16 <dansmith> which appears merged, so. okay
21:05:26 <mriedem> ?
21:05:34 <mriedem> oh
21:05:45 <mriedem> yeah that
21:05:57 <mriedem> i have an item for the ptg for that
21:06:03 <melwitt> I still have the multiple cells affinity thing https://review.openstack.org/585073
21:06:32 <mriedem> i at least got the test :)
21:06:47 <mriedem> which reminds me,
21:06:48 <melwitt> yes, thank you
21:07:01 <mriedem> that i think we do some redundant instance group query shit now in setup_instance_group or whatever it's called
21:07:09 <mriedem> unrelated to that patch, but perftastic
21:07:36 <melwitt> the second query in there? that keeps confusing me
21:07:37 <mriedem> ever since we stopped storing hosts/members in the group record
21:08:12 <mriedem> no https://review.openstack.org/#/c/540258/13/nova/scheduler/utils.py@792
21:08:13 <mriedem> i think
21:08:40 <mriedem> well, i mean if we already have the group
21:08:46 <melwitt> yeah, that's what I was thinking of
21:08:58 <melwitt> and the subsequent get the hosts again
21:09:03 <mriedem> yeah i assume that was in there b/c we needed the freshest latest members and hosts in the group
21:09:17 <mriedem> but dansmith fixed that by not storing the group hosts/members in the db
21:09:28 <mriedem> this ^ code predated dan's fix
21:09:31 <mriedem> and is likely redundant now
21:09:46 <mriedem> just one of the many todo threads to pull
21:09:54 <dansmith> so, open a bug?
21:10:09 <mriedem> "this might be total bullshit, but i think x is not required now b/c of y"
21:10:09 <mriedem> sure
21:10:42 <dansmith> melwitt: I will try to hit those tomorrow morning as well, remind me if not
21:10:51 <melwitt> ok, thanks
21:11:02 <dansmith> so tssurya has had down-cells patches up for a while and because I'm a terrible person I just circled back to them this morning
21:11:15 <dansmith> or was that yesterday? I forget
21:11:20 <mriedem> yesterday
21:11:22 <tssurya> yesterday
21:11:34 <dansmith> regardless, I think we worked out a better way to do part of the accounting bit which is cool,
21:11:36 <mriedem> i've been slowly burning through gmann's dep stuff below it
21:11:47 <tssurya> yea thanks and thanks
21:11:48 <dansmith> and I have that on the end of my batching set for her to use,
21:11:52 <dansmith> since she's blocked on that other thing anyway
21:12:17 <tssurya> meanwhile I will write tests for them
21:12:19 <dansmith> I found an issue in the batching code while writing that for her, so I'm glad I did that at the end anyway
21:13:05 <dansmith> which leads me to my batching stuff to address perf issues the huawei guys have noted
21:13:20 <dansmith> they've been reviewing and testing that stuff for me, which is cool
21:13:24 <mriedem> have kevin and yikun been talking to you about testing the latest?
21:13:29 <mriedem> with like error instances and such?
21:13:39 <dansmith> I assume we're all aware of that process but if anyone isn't clear on what that is or why/if it's important speak up
21:13:56 <dansmith> mriedem: not since last night or whatever
21:13:58 <dansmith> I decided not to do the fault thing (again)
21:14:39 <mriedem> ok but they tested the distributed strategy right?
21:15:03 <mriedem> ah nevermind
21:15:04 <dansmith> I'm not sure about that, but it will equate to the same thing for them unless they perturb their distribution a little
21:15:26 <mriedem> chenfujun will be pleased!
21:15:45 <dansmith> anything else on batching or other open reviews?
21:16:13 <dansmith> #topic open reviews
21:16:19 <dansmith> mriedem has things
21:16:30 <mriedem> open discussion?
21:16:34 <dansmith> sorry
21:16:35 <dansmith> #undo
21:16:35 <openstack> Removing item from minutes: #topic open reviews
21:16:39 <dansmith> #topic open discussion
21:16:50 <mriedem> i'll do the 2nd one first, since its a segue from handling a down cell
21:17:06 <mriedem> when we were talking to the public cloud guy in charge of moving from their cascading / cells v1 type thing last week,
21:17:11 <mriedem> and talked about handling a down cell,
21:17:40 <mriedem> he wanted an option to not return any results when listing if the tenant has any instances in a down cell; this would be before tssurya's new microversion which returns the instnaces in down cells with the UNKNOWN status
21:18:18 <mriedem> he was concerned that users would be confused if they knew they had like 3 VMs and then all of a sudden they see 2
21:18:26 <dansmith> this seems a bit close to config-defined-api-behavior to me
21:18:32 <mriedem> so he'd rather not show them any, because if they delete the 2 and the 1 comes back, it would be an issue
21:18:38 <dansmith> seems like going back to 500 would be better than an empty list
21:18:48 <mriedem> it is, yes, which i pointed out,
21:19:01 <mriedem> but it's also not really an interop kind of situation
21:19:03 <mriedem> 500 is an option
21:19:21 <mriedem> but we already fixed that 500 bug...
21:19:27 <mriedem> so clearly CERN wants it one way and huawei wants it another
21:19:45 <dansmith> I meant turn on the 500 with a config flag
21:19:56 <mriedem> oh
21:20:01 <dansmith> like explode_if_some_are_down=True
21:20:05 <mriedem> yeah that's an option
21:20:13 <mriedem> i can go back and ask which is preferrable
21:20:15 <mriedem> hide or explode
21:20:18 <mriedem> in the end,
21:20:23 <mriedem> it's basically the same from the user viewpoint
21:20:45 <tssurya> I had this  https://review.openstack.org/592428/ which was for the empty list config
21:20:50 <mriedem> although of course huawei public cloud has a strict no errors policy... :)
21:20:53 <melwitt> so this would be a change, pre-microversion?
21:20:53 <dansmith> not IMHO.. 500 means "the server is broken, I'll try later", empty means "firedrill, my instances and data are gone"
21:20:55 <dansmith> to me
21:21:23 <mriedem> melwitt: yes pre-microversion
21:21:30 <melwitt> ok
21:21:33 <mriedem> if you request the microversion you get the down cell instances with UNKNOWN status
21:21:35 <dansmith> can't imagine how returning an empty list doesn't just generate a billion urgent calls to support
21:21:46 <mriedem> yeah i know
21:21:52 <mriedem> i'll ask
21:22:07 <dansmith> 500 might mean "hmm, I'll go check the status page" and see that some issues are occurring
21:22:08 <melwitt> yeah, the empty list might give people a heart attack
21:22:14 <dansmith> if my instances are gone suddenly I freak the eff out
21:22:25 <tssurya> I vote for 500 too, makes more sense
21:22:46 <tssurya> specially since the problem is users being confused to see 2 instead of 3.. now they would see 0
21:22:57 <mriedem> yup
21:22:58 <mriedem> noted
21:23:04 <mriedem> moving onto the other item
21:23:07 <mriedem> cross-cell cold migration
21:23:15 <mriedem> i've vomited some random thoughts into https://etherpad.openstack.org/p/nova-ptg-stein-cells
21:23:29 <mriedem> our requirements are basically, network isolation between cells,
21:23:42 <mriedem> so no ssh between hosts in different cells so can't cold migrate the traditional way
21:23:56 <mriedem> image service is global, which made me think of an orchestrated shelve/unshelve type thing
21:24:15 <mriedem> i feel like that might be a bit cludgy, so open to other ideas
21:24:32 <mriedem> i'm not sure how else to get the guest disk image out of the one cell and into the other w/o snapshots
21:24:36 <dansmith> I think snapshot for migrate will not be what some people want, but I think it will be ideal for a lot of others
21:24:48 <dansmith> I think that's a good first step for us to do though
21:24:52 <mriedem> of course, ports and volumes...
21:25:28 <mriedem> with ports and volumes we just attach on the new host, migrate, detach from old host on confirm
21:25:40 <mriedem> so how ports and volumes get migrated is going to be another thing
21:26:11 <mriedem> what else do we have attached to an instance that's external? barbican secrets? castellan certs?
21:27:16 <dansmith> yep, we'll have to do some thinking probably
21:27:26 <mriedem> i dislike thinking
21:27:48 <mriedem> however,
21:28:03 <mriedem> mlavalle and smcginnis work for the same overlords so i can talk to them
21:28:12 <mriedem> routed networks comes to mind for ports, but cinder has no concept like this
21:28:16 <mriedem> as far as i know
21:28:20 <melwitt> synergy
21:29:30 <dansmith> are we done with that for the moment?
21:29:52 <mriedem> yes
21:29:59 <mriedem> dump ideas/comments in the etherpad
21:30:02 <dansmith> #todo everyone be thinking about cross-cell-migration gotchas
21:30:08 <mriedem> also,
21:30:12 <mriedem> huawei cloud is all volume-backed...
21:30:17 <mriedem> so i have to care about that bigly
21:30:28 <dansmith> well, maybe that helps
21:30:49 <mriedem> maybe, if the volume storage is shared between cells
21:31:04 <dansmith> or if it's migratable on the cinder side and we can just wait
21:31:35 <mriedem> my guess is that would trigger a swap callback
21:31:43 <mriedem> that's what happens if you do volume live migration
21:32:12 <mriedem> kevin seemed pretty confident this wasn't a hard problem, so maybe i'll just let him own it :)
21:32:56 <dansmith> anything else before we close?
21:33:20 <mriedem> nope
21:34:00 <dansmith> #endmeeting