#openstack-meeting-3 log

17:00:00 <dansmith> #startmeeting nova_cells
17:00:00 <openstack> Meeting started Wed Sep 27 17:00:00 2017 UTC and is due to finish in 60 minutes.  The chair is dansmith. Information about MeetBot at http://wiki.debian.org/MeetBot.
17:00:02 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
17:00:04 <openstack> The meeting name has been set to 'nova_cells'
17:00:08 <mriedem> o/
17:00:37 <melwitt> o/
17:01:03 <dansmith> doesn't look like surya is around
17:01:17 <dansmith> assuming I remember that nick properly
17:01:22 <mriedem> tssurya: here?
17:01:30 <dansmith> oh forgot there was a t in front
17:01:52 <dansmith> anyway
17:01:55 <dansmith> #topic bugs
17:02:09 <mriedem> https://bugs.launchpad.net/nova/+bugs?field.tag=cells&orderby=-datecreated&start=0
17:02:15 <mriedem> i was poking through those the other day
17:02:41 <mriedem> https://review.openstack.org/#/q/status:open+project:openstack/nova+topic:bug/1715533
17:02:49 <mriedem> i've got those backports lined up for the map_instances things
17:02:51 <mriedem> *thing
17:02:55 <dansmith> cool
17:03:01 <dansmith> I shall leave that open
17:03:25 <dansmith> anything else we should discuss here/
17:03:39 <mriedem> umm,
17:03:46 <mriedem> no. we should just scrub that list at some point
17:03:51 <dansmith> yeah
17:03:51 <mriedem> probably cells v1 things in there we won't fix
17:03:58 <dansmith> aye
17:04:05 <dansmith> #topic open reviews
17:04:20 <dansmith> before I get to whining about my instance list set, anything we should note here?
17:04:28 <dansmith> ed has the selection patches and spec up for review,
17:04:38 <dansmith> which I've been trying to circle back to every once in a while
17:04:53 <mriedem> are you lumping the alternate hosts stuff in with that as well?
17:04:56 <mriedem> i know he has a spec for both
17:05:00 <dansmith> #link https://review.openstack.org/#/q/status:open+project:openstack/nova+branch:master+topic:bp/placement-allocation-requests
17:05:19 <dansmith> it's all part of the same effort yeah
17:06:01 <mriedem> ok, i haven't reviewed any of it yet
17:06:05 <mriedem> are there any major issues?
17:06:24 <dansmith> well, the limits thing might be, but I haven't gone back around to that since yesterday
17:08:13 <dansmith> anything else before we get to instance list stuff?
17:08:25 <mriedem> nope
17:08:40 <dansmith> okay, so, for the logs:
17:08:52 <dansmith> mriedem did some perf testing with and without the base instance list cutover patch
17:09:04 <dansmith> and seemed to measure some slowerness with the patch applied
17:09:23 <dansmith> I tried to reproduce and I don't see the differing behavior
17:09:32 <dansmith> but in the process,
17:09:38 <dansmith> we probably uncovered some placement issues,
17:09:54 <dansmith> and definitely are seeing a large performance different (with and without the patch applied) between 2.1 and 2.53
17:10:11 <dansmith> like going from average of 6s per request to 9s for listing 1000 instances
17:10:38 <mriedem> yup, plus you can't create 500 instances at once
17:10:40 <dansmith> details are here:
17:10:41 <dansmith> #link https://etherpad.openstack.org/p/nova-instance-list
17:10:46 <dansmith> mriedem: that's the placement thing I mentioned
17:10:53 <mriedem> oh right
17:11:04 <dansmith> so right now, I'm running every microversion from 1 to 53 for 10x and recording the results
17:11:11 <dansmith> I'm up to 2.22 on master
17:11:25 <dansmith> and so far I'm still at ~6s
17:11:27 <dansmith> looking for a jump
17:11:53 <dansmith> https://imgur.com/a/SHGmi
17:12:07 <melwitt> neat
17:12:35 <dansmith> mriedem: when you got your 30s run on my patch, did you run that a bunch of times or just once?
17:12:39 <dansmith> I know it was 10x per,
17:13:07 <dansmith> but I see some pretty big variances that correlate to other things on the system happening, so I'm just wondering if you being in a public cloud instance skewed one run
17:13:20 <mriedem> for GET /servers/detail with 2.53?
17:13:20 <dansmith> mine is running with plenty of ram, cpu and disk IO
17:13:23 <dansmith> yeah
17:13:34 <mriedem> it was just 10x in a loop
17:13:46 <dansmith> right but you didn't run the 10x loop a couple more times?
17:13:47 <mriedem> 46.3s was the highest
17:13:49 <mriedem> no
17:13:51 <dansmith> okay
17:14:02 <mriedem> i'm running loops of 10 with 1000 active instances with your big patch now
17:14:09 <dansmith> okay
17:14:47 <mriedem> otherwise yeah doing perf testing on a public cloud vm isn't ideal,
17:14:50 <mriedem> but it's basically all i've got
17:15:08 <dansmith> yeah, it's fine, it's just good to keep it in mind if we have an outlier we can't explain
17:15:16 <dansmith> anyway,
17:15:42 <dansmith> since I've got my thing up I'm going to keep poking at master, my patch, and the one that does the fault joining
17:15:45 <dansmith> at this point,
17:16:05 <dansmith> I think maybe we might want to drop the fault join because it doesn't really seem to help (and may hurt in its current form)
17:16:31 <dansmith> ah fsck, my token expired at 2.27 :P
17:17:31 <dansmith> so, anything else on this?
17:17:55 <mriedem> ha
17:18:02 <mriedem> yeah i have to remember to refresh that every once in a while
17:18:19 <mriedem> i don't have anything else on this
17:18:22 <dansmith> I have the timed loop isolated so I should just fetch a token before I run it each time
17:18:40 <dansmith> okay
17:18:41 <dansmith> #topic open discussion
17:18:44 <dansmith> anything else on any subject?
17:19:03 <mriedem> i've got a todo to circle back on forum topics
17:19:16 <mriedem> which i'll bring up in the team meeting tomorrow
17:19:22 <mriedem> and probably in the ML again before friday
17:19:29 <melwitt> just a FYI, I've been working on some cell databases testing stuff, to try to make our env more like reality to help catch more bugs in testing
17:19:46 <dansmith> melwitt: functional or devstack?
17:20:10 <mriedem> functional, it's the cells db fixture,
17:20:15 <mriedem> it defaults a context
17:20:19 <melwitt> dansmith: functional. namely trying to rejigger the way we use the fixture to see more reality
17:20:25 <mriedem> which can mask bugs when we should be targeting in code but aren't
17:20:29 <dansmith> ah I see
17:20:37 <mriedem> we've had a few bugs slip through testing b/c of that
17:20:40 <dansmith> yeah
17:20:58 <melwitt> not sure what yall are gonna think of it but, I've almost got it working (had to fix some tests). so I'll let you know when that's up
17:21:01 <mriedem> and then my huawei masters beat me
17:21:08 <melwitt> lol
17:21:47 <dansmith> okay anything else?
17:21:51 <mriedem> fortunately they don't care as much about instance list performance
17:21:58 <mriedem> only server create and delete performance
17:22:08 <mriedem> oh wait, nvm, they care about instance list too...
17:22:10 <mriedem> f me
17:22:11 <melwitt> well, you found that delete takes forever though
17:22:16 <mriedem> so did they
17:22:18 <melwitt> which is really weird
17:22:19 <melwitt> oh
17:22:37 <mriedem> but we can talk about that in -nova since dan is getting knifey
17:22:42 <dansmith> instance delete is pretty similar to instance boot
17:22:49 <dansmith> in terms of things involved
17:22:52 <mriedem> except,
17:22:55 <mriedem> it's serialized
17:22:57 <mriedem> unlike boot
17:23:04 <dansmith> eh?
17:23:04 <mriedem> where networking and bdms are concurrent
17:23:15 <dansmith> serialized where?
17:23:34 <mriedem> whatever the one of 10 methods in the compute manager cleans up network and volume attachments
17:23:42 <mriedem> _shutdown_instance?
17:23:51 <dansmith> oh you mean serialized in the delete process, not across all instances being deleted
17:24:00 <mriedem> right
17:24:02 <dansmith> gotcha
17:24:12 <mriedem> i think the public cloud team made that concurrent
17:24:18 <mriedem> they've done lots of things...
17:24:31 <mriedem> dims and i are trying to figure it out
17:24:31 <dansmith> anyway, yeah, we're off topic
17:24:35 <dansmith> so I think we're good to end, yes?
17:24:40 <mriedem> yes
17:24:42 <melwitt> yes
17:24:44 <dansmith> #endmeeting