21:00:05 <dansmith> #startmeeting nova_cells
21:00:06 <openstack> Meeting started Wed Mar  8 21:00:05 2017 UTC and is due to finish in 60 minutes.  The chair is dansmith. Information about MeetBot at http://wiki.debian.org/MeetBot.
21:00:07 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
21:00:10 <openstack> The meeting name has been set to 'nova_cells'
21:00:18 <mriedem> o/
21:00:24 <melwitt> o/.
21:00:26 <macsz> \o
21:00:33 <pumaranikar> o/
21:00:42 <dansmith> melwitt: armpit?
21:01:09 <melwitt> what armpit
21:01:20 <dansmith> your hand up had a bogey
21:01:20 <macsz> the dot :)
21:01:33 <melwitt> oh, haha. I didn't even notice
21:01:38 <dansmith> #topic cells testing/bugs
21:01:47 <dansmith> so before we get into mriedem shitting all over it,
21:01:52 <dansmith> in regards to testing, I'd like to point out this:
21:01:59 <dansmith> http://logs.openstack.org/94/436094/14/check/gate-tempest-dsvm-neutron-full-ubuntu-xenial/f7d6160/logs/testr_results.html.gz
21:02:15 <dansmith> all but two tempest tests running with multiple cells, and I have patches up for those as well
21:02:35 <dansmith> unfortunately, no chance of having a clean run at this point due to when I pushed those up
21:02:54 <dansmith> but anyway, the effort of actually getting a clean test run on multicell devstack is progressing
21:02:54 <mriedem> \o/
21:03:00 <dansmith> the devstack patch itself still needs a lot of work,
21:03:07 <dansmith> but I won't ever get to it if mriedem keeps up his antics
21:03:18 <dansmith> anyway, anything else testing-related?
21:03:23 <mriedem> s/antics/excellent reviews/
21:03:51 <mriedem> yeah
21:03:52 <mriedem> on https://review.openstack.org/#/c/442861/
21:03:59 <mriedem> is the nova-status thing just separate from this series?
21:04:06 <mriedem> i think i thought that was fine but checking
21:04:11 <dansmith> mriedem: it is
21:04:22 <mriedem> ah yes https://review.openstack.org/#/c/442787/
21:04:38 <dansmith> mriedem: we pulled out a newer service version check at the end of ocata, you'll recall, and I continue to challenge the root concern anyway,
21:04:54 <dansmith> but not opposed to a status check of course
21:05:21 <mriedem> you mean the one in the scheduler filter for placement?
21:05:37 <dansmith> no, that was in pike, but that's another good example :)
21:05:55 <dansmith> we had one in compute/api about earlier computes before a cells patch from avolkov
21:06:00 <mriedem> so really our minimum version service check in nova-status should be whatever was required for that placement thing
21:06:14 <mriedem> which i think i had a bug for anyway
21:06:37 <mriedem> oh no different check https://bugs.launchpad.net/nova/+bug/1669433
21:06:37 <openstack> Launchpad bug 1669433 in OpenStack Compute (nova) "nova-status needs to check that placement 1.4 is available for pike" [High,In progress] - Assigned to Roman Podoliaka (rpodolyaka)
21:07:01 <mriedem> anyway, it's a good point that the minimum compute version is going to need to be 16
21:07:03 <mriedem> which is your patch
21:07:54 <dansmith> anything else on testing? the next optic on open reviews has a lot of material
21:08:04 <mriedem> no
21:08:37 <dansmith> #topic open reviews
21:08:55 <dansmith> so one of our oldest is dtp's console upcall patch, which I hit again today
21:09:09 <dansmith> I had a minor complaint about it doing some cleanup and functional change in the same and asked him to split
21:09:15 <dansmith> #link https://review.openstack.org/#/c/415922/
21:09:23 <dansmith> anyone else able to take a look at that soon?
21:09:42 <mriedem> i'd prefer melwitt to look at that given she was looking more into the spec
21:09:48 <melwitt> I'm planning to look at it
21:09:48 <mriedem> did that get re-proposed and approved btw?
21:10:00 <mriedem> in other words,
21:10:04 <melwitt> mriedem: not yet, going to do that maybe today. this week
21:10:05 <mriedem> shouldn't this change go under that blueprint?
21:10:22 <dansmith> mriedem: which blueprint? the console tokens in db one?
21:10:24 <melwitt> well, I guess the thing is this is an interim thing
21:10:39 <dansmith> right this is not really related to that larger effort
21:10:41 <melwitt> it was supposed to go in ocata as a stop-gap
21:10:42 <mriedem> oh
21:10:55 <mriedem> carry on then
21:11:12 <dansmith> cool
21:11:27 <dansmith> melwitt: can we maybe try to have that merged by this time next week?
21:11:54 <melwitt> dansmith: the spec? yeah. I will also get the placement spec up this week too
21:12:00 <dansmith> no, dtp's patch
21:12:06 <melwitt> oh, yeah. sorry. yeah
21:12:15 <dansmith> cool, the specs are important too of course,
21:12:25 <dansmith> but just want to avoid this withering on the vine too much
21:12:33 <melwitt> roger that
21:12:55 <dansmith> okay so the next set is the quotas stuff,
21:13:10 <dansmith> which got some activity this morning and I think melwitt is probably working on as we speak
21:13:16 <dansmith> I've been through parts of that patch but not the rest
21:13:25 <dansmith> the bottom two are approved and just holding until the third is ready to go
21:13:51 <dansmith> mriedem: in between shitting on my patches that might be a good one for you to look at too
21:13:57 <dansmith> you know, to spread the pain^Wlove around
21:14:22 <mriedem> john has been reviewing that right?
21:14:25 <dansmith> yeah
21:14:28 <mriedem> at this point i'm happy to let john handle it
21:14:39 <dansmith> well, it has some implications to behavior
21:14:50 <mriedem> i realize it's something i should know about...
21:14:54 <melwitt> yup. the top patch is not a picnic for review, a lot of it is deleting of code. so be on the lookout for gaps as something to watch out for
21:14:56 <dansmith> about how things behave when you're close to quota
21:15:09 <mriedem> do we have functional tests for the edge cases?
21:15:33 <dansmith> well, the point is the edge cases are leaky by design
21:16:32 <mriedem> sure. selfishly speaking, there are only so many super complicated series of things i can push into context in my brain at any given time, and with cells v2 and jay's inventory stuff and some other things, i just won't say i can get to it right now and give it a thorough review.
21:16:55 <dansmith> okay
21:17:07 <mriedem> i'm channeling my inner sdague here
21:17:51 <dansmith> melwitt: maybe we try to make sure the commit message/reno summarize the changes well enough that if he just reads that he won't be surprised on stage in the future
21:18:26 <melwitt> fwiw, the edge case discussions are contained at the moment as the only comments on the review. that makes it easier-ish to weigh in on those points
21:18:59 <dansmith> yeah
21:19:00 <melwitt> dansmith: yeah, that's a good idea in general, for anyone to be able to get the main points
21:19:04 <dansmith> yeah
21:19:19 <dansmith> alright anyway,
21:19:28 <mriedem> didn't we need the user/project in placement for counting quotas first?
21:19:35 <dansmith> no
21:19:39 <mriedem> or was that optional for now since we don't expect cells to be 'down' right now
21:19:41 <dansmith> it helps us do it better
21:19:43 <dansmith> yeah
21:20:11 <melwitt> yeah, we're going to go forward with this for now as a first step that has caveats, and expect the placement stuff to complete this cycle and close that gap
21:20:28 <melwitt> since multi cell isn't really a thing at the present moment, anyway
21:20:43 <dansmith> hey!
21:20:59 <dansmith> it is in my fairy tale life
21:20:59 <melwitt> sorry, I meant in the non CD case
21:21:00 <mriedem> look who is shitting on your stuff now
21:21:07 * dansmith steams
21:21:13 <mriedem> like a steaming pile of...
21:21:13 <dansmith> moving on?
21:21:15 <mriedem> yes
21:21:23 <melwitt> guh, no sorry not what I meant
21:21:25 <dansmith> the next series is my steaming pile of shit
21:21:44 <mriedem> don't worry, i also have searchlight to talk about at some point here
21:21:49 <dansmith> which I just realized won't work in the order I just pushed up, so I will have to transplant some code first
21:22:29 <mriedem> dansmith: as in the patch we just talked about first, but without the GET by id stuff?
21:22:31 <dansmith> however, on top of all of them, we pass a tempest run, although just a few minutes ago mriedem identified some issues that stem from historical leaks of things like internal DB ids
21:22:57 <dansmith> mriedem: no, I moved that up, but it had a refactor (load_cells) that the other ones need, so I need to transplant that
21:23:03 <mriedem> ok
21:23:18 <dansmith> mriedem: so one thing to note is that until you have multiple actual cells,
21:23:26 <dansmith> what I have is not any different than what we have today I think
21:23:41 <dansmith> but, you said you had an idea about moving forward with those?
21:23:44 <mriedem> sure for single cell this is fine
21:23:46 <mriedem> yeah
21:23:59 <mriedem> so, i think we can agree that we should stop leaking ids out of the cell databases in the REST API
21:24:00 <mriedem> correct?
21:24:15 <dansmith> are you saying you're okay merging this early with that caveat? because reordering back is much easier
21:24:19 <dansmith> uh, yes, agreed
21:24:23 <dansmith> obviously
21:24:25 <mriedem> dansmith: not yet
21:24:49 <mriedem> ok, so i think we can agree that we should probably do a microversion in os-hypervisors that returns the compute node uuid rather than the id, and takes a uuid rather than an id for GET calls
21:24:53 <mriedem> is that ok?
21:25:29 <dansmith> what are you asking is okay? that we stop being stupid? yes, that's okay :)
21:25:34 <mriedem> ok
21:25:40 <mriedem> just setting the foundatoin of shit we can agree on
21:25:56 <mriedem> next thing is, in this code that's the problem, and not cells aware,
21:26:15 <mriedem> if we have multiple cells and can't find a unique compute/service by id (not uuid), we fail with a 400
21:26:27 <mriedem> and force you to use the microversion to pass the uuid to find the thing you need
21:26:37 <dansmith> meaning, check all of them and if we find any dupes, then refuse to do that thing?
21:26:50 <mriedem> right, just like when we boot a server w/o a specific network
21:26:55 <mriedem> if there are duplicate networks, we fail
21:26:58 <dansmith> sure, that's a good idea
21:27:04 <dansmith> but only after we have the microversion api I guess
21:27:09 <mriedem> yeah, so,
21:27:18 <mriedem> you can still pass id before the microversion in the single cell case
21:27:19 <mriedem> that's fine
21:27:33 <mriedem> but in the multi-cell case, if you pass id and we find multiple, it's a 400,
21:27:40 <mriedem> and you have to pass the uuid using the microversion
21:27:46 <dansmith> aye
21:27:55 <mriedem> ok, if we're all happy with that, i can start the spe
21:27:56 <mriedem> *spec
21:28:13 <dansmith> so, there's probably a few things, right? os-hypervisors, os-services at least
21:28:17 <dansmith> shouldn't we do them all together/
21:28:21 <mriedem> i'm slightly less clear on the os-pci api here, but would have to investigate that more
21:28:29 <mriedem> yes probably
21:28:39 <mriedem> yeah for sure os-hypervisors and os-services
21:28:41 <dansmith> okay, well, anyway, I'm definitely on baord with that
21:28:44 <mriedem> cool
21:28:50 <mriedem> i think we have the same issue in os-pci,
21:28:58 <mriedem> but i have 0 idea if anyone ever uses that api
21:29:02 <mriedem> it's not even documented
21:29:07 <dansmith> I will reswizzle these so this patch can be later in the stack and keep pushing what we can, and wait for that for this patch
21:29:59 <mriedem> ha, also, side note,
21:30:10 <mriedem> PCI_ADMIN_KEYS is used in os-pci but doesn't check if you're an admin,
21:30:15 <mriedem> or perform any kind of check
21:30:32 <dansmith> wt...f
21:30:33 <mriedem> anyway
21:30:41 <mriedem> well, the default policy on listing pci devices is admin only
21:30:44 <mriedem> but still
21:30:55 <dansmith> ah
21:30:56 <dansmith> yeah
21:31:09 <dansmith> okay, mriedem you wanted to call out the searchlight review I assume?
21:31:22 <mriedem> yeah, sec
21:31:31 <mriedem> https://review.openstack.org/#/c/441692/
21:31:33 <dansmith> I have started looking at it a few times, but this guy keeps shitting on my patches
21:31:37 <mriedem> #link searchlight integration spec https://review.openstack.org/#/c/441692/
21:31:39 <dansmith> with "alternative facts"
21:31:52 <mriedem> i haven't gone through the latest round of comments in there,
21:31:58 <mriedem> but it's got quite a bit of detail,
21:32:19 <mriedem> net is it's a bit of a mess dependency-wis
21:32:22 <mriedem> *wise
21:32:28 <mriedem> searchlight doesn't support versioned notifications yet
21:32:38 <mriedem> they have a blueprint to do it, but aren't doing it yet
21:32:43 <dansmith> orly
21:32:55 <dansmith> I thought they were super interested in those
21:33:01 <mriedem> we also have an issue with the fact that when you delete a server in nova, they delete the index entry for that server in searchlight,
21:33:21 <mriedem> so if nova is using searching and you do nova list --deleted, you get nothing
21:33:44 <mriedem> elasticsearch used to have a concept of a ttl on the entry, but that's removed in v5.0
21:33:57 <melwitt> what are the implications of them not supporting versioned notifications? how do they currently get nova notifications?
21:33:58 <mriedem> they basically pushed the filtering on time to the client it sounds like
21:34:06 <mriedem> melwitt: they get the legacy unversioned notifications
21:34:19 <mriedem> they said they wanted to get versioned notification support in for ocata but didn't have the people to do it
21:35:00 <mriedem> i think it will happen, it's just something to note right now
21:35:05 <dansmith> okay
21:35:11 <mriedem> the delete thing is a bit more worrisome for me,
21:35:16 <dansmith> the deleted thing is probably an issue for them anyway right?
21:35:24 <mriedem> i've suggested a config option in searchlight for a time window before they delete the entry
21:35:28 <dansmith> because people that care about that won't be happy with searchlight as a semi replacement
21:36:04 <mriedem> we don't guarantee that you can get deleted instances forever anyway b/c of archive and purge, but it's something people are going to assume works
21:36:14 <mriedem> and i'm sure admins rely on for debug
21:36:15 <dansmith> yeah
21:36:44 <mriedem> as far as data migrations,
21:37:06 <mriedem> the upside is searchlight already has a searchlight-manage command that you can run to make searchlight hit the nova api and pull in all of the existing instances to populate indexes
21:37:20 <mriedem> so we don't have to worry about nova pushing that data out, or setting up a cron to issue instance.usage.exists
21:37:37 <dansmith> sweet
21:37:46 <mriedem> so you (1) setup searchlight, (2) pull the nova data to populate searchlight, (3) configure nova-api to use it, (4) restart nova-api
21:38:25 <mriedem> the other thing i noted in there that sucks is every new field we add to the rest api we have to add to our versioned notifications
21:38:30 <mriedem> that's not really new, but will be more strictly enforced
21:38:55 <mriedem> plus right now the searchlight guys said we'd also have to make a corresponding mapping change to searchlight to make it handle the new field
21:39:16 <mriedem> gibi pointed out that we have a bp to send the schema with the versioned notification payload, and searchlight could use that schema to add new mappings, but that's a long ways off i think
21:39:30 <mriedem> anyway, none of this is impossible, it's just not as trivial as "we'll just have searchlight do our stuff"
21:39:46 <mriedem> fin
21:39:57 <dansmith> okay that's not too bad,
21:40:02 <dansmith> if we're depending on them like we plan to
21:40:24 <dansmith> not unlike making changes to o.vo or os-vif that we need
21:40:53 <mriedem> yeah it would just suck if we have to make 3 changes before we can return something new out of the rest api
21:40:58 <mriedem> but anyway
21:41:16 <dansmith> well,
21:41:38 <dansmith> we'd have to make the searchlight changes before it would work in that environment, not necessarily for it to work at all
21:41:39 <dansmith> but yeah
21:41:55 <dansmith> not surprising given the level at which we're using them for api in this scenario though
21:42:55 <dansmith> anything else on stuff up for review?
21:43:15 <mriedem> i don't have anything
21:43:29 <dansmith> melwitt: ?
21:43:52 <melwitt> no, think everything got mentioned
21:44:03 <dansmith> cool
21:44:06 <dansmith> #topic open discussion
21:44:20 <melwitt> I wanted to clarify what I said earlier,
21:44:42 <melwitt> I was thinking of multi cell from an operator perspective as in, how long would they experience a gap in say, the "cell down quota issue"
21:45:15 <melwitt> I had been thinking we were going to signal to them that's it's okay/recommended to create multiple cells at rc1
21:45:16 <mriedem> can someone explain to me what 'cell down' even means?
21:45:22 <mriedem> rabbit and db are dead for that cell?
21:45:30 <melwitt> like, lose communication with cell, for whatever reason
21:45:51 <melwitt> yeah, that's one example
21:45:59 <dansmith> melwitt: if mriedem stops shitting on patches, then yes I agree with that statement :)
21:46:00 <mriedem> how is that different from if your non-cells single region deployment loses rabbit/db today?
21:46:15 <dansmith> mriedem: your quota appears to expand in that case
21:46:21 <dansmith> mriedem: because you stop counting certain resources you can't see
21:46:32 <mriedem> ok
21:46:42 <mriedem> which is why we need the global allocation
21:46:44 <mriedem> via placement
21:46:46 <mriedem> got it
21:46:55 <dansmith> yeah
21:46:58 <mriedem> btw, it's fun that placement is the new keystone :)
21:47:01 <mriedem> has anyone mentioned that yet?
21:47:14 <dansmith> hah, jaypipes kinda did indirectly :P
21:47:19 <melwitt> yeah, so I was thinking if we aren't going to recommended to operators to create multiple cells until rc1, then we're "safe" in that we don't have a gap for the "cell down" case from their perspective
21:47:20 <macsz> in which way?
21:47:21 <dansmith> when discussing quota shit
21:47:50 <mriedem> macsz: what in which way? placement == keystone?
21:47:56 <macsz> yeah
21:47:59 <dansmith> melwitt: even if we get all my stuff landed, we can still say "it works with the following caveats, so don't use it if those bother you"
21:48:00 <mriedem> macsz: just that it's going to have user/project stuff in it and it's global
21:48:22 <dansmith> melwitt: in fact, I have one such caveat called out in the series already
21:48:25 <macsz> mriedem: oh, ok, got it :)
21:48:33 <melwitt> dansmith: yeah, true
21:48:33 <dansmith> although it's much smaller than quotas of course
21:48:40 <mriedem> melwitt: i'm also fine with saying multiple cells is ok with caveats
21:48:45 <mriedem> for what we know doesn't work
21:48:55 <dansmith> like.. pci :)
21:48:58 <mriedem> melwitt: however,
21:49:07 <mriedem> the fact you're thinking about caveats for rc1 at this point scares me
21:49:22 <melwitt> heh
21:49:25 <dansmith> mriedem: this quota caveat has been planned since before atlanta
21:49:55 <mriedem> but wasn't that before talking about putting user/project in allocations in placement?
21:49:58 <dansmith> as long as the caveats don't affect the single-cell case, I don't see the problem other than just limiting the scope of who can move to cellsv2 on release day
21:50:12 <melwitt> well, I woke up in the middle of the night and exclaimed (in my mind) "what if a cell goes down!" so I've been thinking about it. I'm sure there are other caveats I haven't thought of yet
21:50:34 <mriedem> i woke up thinking about sump pumps and patio furniture covers and sling tv
21:50:43 <mriedem> to each his/her own
21:50:44 <dansmith> yes
21:50:55 <dansmith> oops
21:51:19 <melwitt> dansmith: agreed with the single cell vs multi cell caveat thing
21:51:25 <dansmith> anything else? this is already the longest cells meeting on record
21:51:38 <macsz> just would like to say that me and pumaranikar are both working in osic with johnthetubaguy
21:51:45 <macsz> and we both are interested in doing some work for cells
21:52:07 <macsz> probably will start with some bugs, unfortunately most of cells bug reports are over 1 yr old
21:52:14 <macsz> but will start digging sth :)
21:52:18 <mriedem> macsz: i think there will be work to do with searchlight possibly
21:52:20 <dansmith> macsz: those are bugs we don't care much about
21:52:27 <dansmith> macsz: i.e. cellsv1 bugs
21:52:38 <mriedem> macsz: i don't know that anyone is slated to add versioned notification work to searchlight
21:52:55 <dansmith> macsz: testing more realistic stuff with cellsv2 is something major you guys could help out with
21:52:59 <mriedem> probably need to sort that out with steve mclellan and/or kevin_zheng
21:53:09 <macsz> ok :)
21:53:20 <mriedem> macsz: also, i've been trying to get a ci job setup with searchlight enabled and nova configured to send things to searchlight,
21:53:27 <mriedem> that's not as easy as i thought it would be
21:53:34 <mriedem> we can talk about that in -nova if you're interested
21:53:58 <macsz> mriedem: yeah, sure
21:54:36 <mriedem> ok so i think we can wrap up
21:54:46 <mriedem> excellent meeting gang!
21:54:57 <dansmith> sweet
21:54:59 <dansmith> #endmeeting