21:00:05 #startmeeting nova_cells 21:00:06 Meeting started Wed Mar 8 21:00:05 2017 UTC and is due to finish in 60 minutes. The chair is dansmith. Information about MeetBot at http://wiki.debian.org/MeetBot. 21:00:07 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 21:00:10 The meeting name has been set to 'nova_cells' 21:00:18 o/ 21:00:24 o/. 21:00:26 \o 21:00:33 o/ 21:00:42 melwitt: armpit? 21:01:09 what armpit 21:01:20 your hand up had a bogey 21:01:20 the dot :) 21:01:33 oh, haha. I didn't even notice 21:01:38 #topic cells testing/bugs 21:01:47 so before we get into mriedem shitting all over it, 21:01:52 in regards to testing, I'd like to point out this: 21:01:59 http://logs.openstack.org/94/436094/14/check/gate-tempest-dsvm-neutron-full-ubuntu-xenial/f7d6160/logs/testr_results.html.gz 21:02:15 all but two tempest tests running with multiple cells, and I have patches up for those as well 21:02:35 unfortunately, no chance of having a clean run at this point due to when I pushed those up 21:02:54 but anyway, the effort of actually getting a clean test run on multicell devstack is progressing 21:02:54 \o/ 21:03:00 the devstack patch itself still needs a lot of work, 21:03:07 but I won't ever get to it if mriedem keeps up his antics 21:03:18 anyway, anything else testing-related? 21:03:23 s/antics/excellent reviews/ 21:03:51 yeah 21:03:52 on https://review.openstack.org/#/c/442861/ 21:03:59 is the nova-status thing just separate from this series? 21:04:06 i think i thought that was fine but checking 21:04:11 mriedem: it is 21:04:22 ah yes https://review.openstack.org/#/c/442787/ 21:04:38 mriedem: we pulled out a newer service version check at the end of ocata, you'll recall, and I continue to challenge the root concern anyway, 21:04:54 but not opposed to a status check of course 21:05:21 you mean the one in the scheduler filter for placement? 21:05:37 no, that was in pike, but that's another good example :) 21:05:55 we had one in compute/api about earlier computes before a cells patch from avolkov 21:06:00 so really our minimum version service check in nova-status should be whatever was required for that placement thing 21:06:14 which i think i had a bug for anyway 21:06:37 oh no different check https://bugs.launchpad.net/nova/+bug/1669433 21:06:37 Launchpad bug 1669433 in OpenStack Compute (nova) "nova-status needs to check that placement 1.4 is available for pike" [High,In progress] - Assigned to Roman Podoliaka (rpodolyaka) 21:07:01 anyway, it's a good point that the minimum compute version is going to need to be 16 21:07:03 which is your patch 21:07:54 anything else on testing? the next optic on open reviews has a lot of material 21:08:04 no 21:08:37 #topic open reviews 21:08:55 so one of our oldest is dtp's console upcall patch, which I hit again today 21:09:09 I had a minor complaint about it doing some cleanup and functional change in the same and asked him to split 21:09:15 #link https://review.openstack.org/#/c/415922/ 21:09:23 anyone else able to take a look at that soon? 21:09:42 i'd prefer melwitt to look at that given she was looking more into the spec 21:09:48 I'm planning to look at it 21:09:48 did that get re-proposed and approved btw? 21:10:00 in other words, 21:10:04 mriedem: not yet, going to do that maybe today. this week 21:10:05 shouldn't this change go under that blueprint? 21:10:22 mriedem: which blueprint? the console tokens in db one? 21:10:24 well, I guess the thing is this is an interim thing 21:10:39 right this is not really related to that larger effort 21:10:41 it was supposed to go in ocata as a stop-gap 21:10:42 oh 21:10:55 carry on then 21:11:12 cool 21:11:27 melwitt: can we maybe try to have that merged by this time next week? 21:11:54 dansmith: the spec? yeah. I will also get the placement spec up this week too 21:12:00 no, dtp's patch 21:12:06 oh, yeah. sorry. yeah 21:12:15 cool, the specs are important too of course, 21:12:25 but just want to avoid this withering on the vine too much 21:12:33 roger that 21:12:55 okay so the next set is the quotas stuff, 21:13:10 which got some activity this morning and I think melwitt is probably working on as we speak 21:13:16 I've been through parts of that patch but not the rest 21:13:25 the bottom two are approved and just holding until the third is ready to go 21:13:51 mriedem: in between shitting on my patches that might be a good one for you to look at too 21:13:57 you know, to spread the pain^Wlove around 21:14:22 john has been reviewing that right? 21:14:25 yeah 21:14:28 at this point i'm happy to let john handle it 21:14:39 well, it has some implications to behavior 21:14:50 i realize it's something i should know about... 21:14:54 yup. the top patch is not a picnic for review, a lot of it is deleting of code. so be on the lookout for gaps as something to watch out for 21:14:56 about how things behave when you're close to quota 21:15:09 do we have functional tests for the edge cases? 21:15:33 well, the point is the edge cases are leaky by design 21:16:32 sure. selfishly speaking, there are only so many super complicated series of things i can push into context in my brain at any given time, and with cells v2 and jay's inventory stuff and some other things, i just won't say i can get to it right now and give it a thorough review. 21:16:55 okay 21:17:07 i'm channeling my inner sdague here 21:17:51 melwitt: maybe we try to make sure the commit message/reno summarize the changes well enough that if he just reads that he won't be surprised on stage in the future 21:18:26 fwiw, the edge case discussions are contained at the moment as the only comments on the review. that makes it easier-ish to weigh in on those points 21:18:59 yeah 21:19:00 dansmith: yeah, that's a good idea in general, for anyone to be able to get the main points 21:19:04 yeah 21:19:19 alright anyway, 21:19:28 didn't we need the user/project in placement for counting quotas first? 21:19:35 no 21:19:39 or was that optional for now since we don't expect cells to be 'down' right now 21:19:41 it helps us do it better 21:19:43 yeah 21:20:11 yeah, we're going to go forward with this for now as a first step that has caveats, and expect the placement stuff to complete this cycle and close that gap 21:20:28 since multi cell isn't really a thing at the present moment, anyway 21:20:43 hey! 21:20:59 it is in my fairy tale life 21:20:59 sorry, I meant in the non CD case 21:21:00 look who is shitting on your stuff now 21:21:07 * dansmith steams 21:21:13 like a steaming pile of... 21:21:13 moving on? 21:21:15 yes 21:21:23 guh, no sorry not what I meant 21:21:25 the next series is my steaming pile of shit 21:21:44 don't worry, i also have searchlight to talk about at some point here 21:21:49 which I just realized won't work in the order I just pushed up, so I will have to transplant some code first 21:22:29 dansmith: as in the patch we just talked about first, but without the GET by id stuff? 21:22:31 however, on top of all of them, we pass a tempest run, although just a few minutes ago mriedem identified some issues that stem from historical leaks of things like internal DB ids 21:22:57 mriedem: no, I moved that up, but it had a refactor (load_cells) that the other ones need, so I need to transplant that 21:23:03 ok 21:23:18 mriedem: so one thing to note is that until you have multiple actual cells, 21:23:26 what I have is not any different than what we have today I think 21:23:41 but, you said you had an idea about moving forward with those? 21:23:44 sure for single cell this is fine 21:23:46 yeah 21:23:59 so, i think we can agree that we should stop leaking ids out of the cell databases in the REST API 21:24:00 correct? 21:24:15 are you saying you're okay merging this early with that caveat? because reordering back is much easier 21:24:19 uh, yes, agreed 21:24:23 obviously 21:24:25 dansmith: not yet 21:24:49 ok, so i think we can agree that we should probably do a microversion in os-hypervisors that returns the compute node uuid rather than the id, and takes a uuid rather than an id for GET calls 21:24:53 is that ok? 21:25:29 what are you asking is okay? that we stop being stupid? yes, that's okay :) 21:25:34 ok 21:25:40 just setting the foundatoin of shit we can agree on 21:25:56 next thing is, in this code that's the problem, and not cells aware, 21:26:15 if we have multiple cells and can't find a unique compute/service by id (not uuid), we fail with a 400 21:26:27 and force you to use the microversion to pass the uuid to find the thing you need 21:26:37 meaning, check all of them and if we find any dupes, then refuse to do that thing? 21:26:50 right, just like when we boot a server w/o a specific network 21:26:55 if there are duplicate networks, we fail 21:26:58 sure, that's a good idea 21:27:04 but only after we have the microversion api I guess 21:27:09 yeah, so, 21:27:18 you can still pass id before the microversion in the single cell case 21:27:19 that's fine 21:27:33 but in the multi-cell case, if you pass id and we find multiple, it's a 400, 21:27:40 and you have to pass the uuid using the microversion 21:27:46 aye 21:27:55 ok, if we're all happy with that, i can start the spe 21:27:56 *spec 21:28:13 so, there's probably a few things, right? os-hypervisors, os-services at least 21:28:17 shouldn't we do them all together/ 21:28:21 i'm slightly less clear on the os-pci api here, but would have to investigate that more 21:28:29 yes probably 21:28:39 yeah for sure os-hypervisors and os-services 21:28:41 okay, well, anyway, I'm definitely on baord with that 21:28:44 cool 21:28:50 i think we have the same issue in os-pci, 21:28:58 but i have 0 idea if anyone ever uses that api 21:29:02 it's not even documented 21:29:07 I will reswizzle these so this patch can be later in the stack and keep pushing what we can, and wait for that for this patch 21:29:59 ha, also, side note, 21:30:10 PCI_ADMIN_KEYS is used in os-pci but doesn't check if you're an admin, 21:30:15 or perform any kind of check 21:30:32 wt...f 21:30:33 anyway 21:30:41 well, the default policy on listing pci devices is admin only 21:30:44 but still 21:30:55 ah 21:30:56 yeah 21:31:09 okay, mriedem you wanted to call out the searchlight review I assume? 21:31:22 yeah, sec 21:31:31 https://review.openstack.org/#/c/441692/ 21:31:33 I have started looking at it a few times, but this guy keeps shitting on my patches 21:31:37 #link searchlight integration spec https://review.openstack.org/#/c/441692/ 21:31:39 with "alternative facts" 21:31:52 i haven't gone through the latest round of comments in there, 21:31:58 but it's got quite a bit of detail, 21:32:19 net is it's a bit of a mess dependency-wis 21:32:22 *wise 21:32:28 searchlight doesn't support versioned notifications yet 21:32:38 they have a blueprint to do it, but aren't doing it yet 21:32:43 orly 21:32:55 I thought they were super interested in those 21:33:01 we also have an issue with the fact that when you delete a server in nova, they delete the index entry for that server in searchlight, 21:33:21 so if nova is using searching and you do nova list --deleted, you get nothing 21:33:44 elasticsearch used to have a concept of a ttl on the entry, but that's removed in v5.0 21:33:57 what are the implications of them not supporting versioned notifications? how do they currently get nova notifications? 21:33:58 they basically pushed the filtering on time to the client it sounds like 21:34:06 melwitt: they get the legacy unversioned notifications 21:34:19 they said they wanted to get versioned notification support in for ocata but didn't have the people to do it 21:35:00 i think it will happen, it's just something to note right now 21:35:05 okay 21:35:11 the delete thing is a bit more worrisome for me, 21:35:16 the deleted thing is probably an issue for them anyway right? 21:35:24 i've suggested a config option in searchlight for a time window before they delete the entry 21:35:28 because people that care about that won't be happy with searchlight as a semi replacement 21:36:04 we don't guarantee that you can get deleted instances forever anyway b/c of archive and purge, but it's something people are going to assume works 21:36:14 and i'm sure admins rely on for debug 21:36:15 yeah 21:36:44 as far as data migrations, 21:37:06 the upside is searchlight already has a searchlight-manage command that you can run to make searchlight hit the nova api and pull in all of the existing instances to populate indexes 21:37:20 so we don't have to worry about nova pushing that data out, or setting up a cron to issue instance.usage.exists 21:37:37 sweet 21:37:46 so you (1) setup searchlight, (2) pull the nova data to populate searchlight, (3) configure nova-api to use it, (4) restart nova-api 21:38:25 the other thing i noted in there that sucks is every new field we add to the rest api we have to add to our versioned notifications 21:38:30 that's not really new, but will be more strictly enforced 21:38:55 plus right now the searchlight guys said we'd also have to make a corresponding mapping change to searchlight to make it handle the new field 21:39:16 gibi pointed out that we have a bp to send the schema with the versioned notification payload, and searchlight could use that schema to add new mappings, but that's a long ways off i think 21:39:30 anyway, none of this is impossible, it's just not as trivial as "we'll just have searchlight do our stuff" 21:39:46 fin 21:39:57 okay that's not too bad, 21:40:02 if we're depending on them like we plan to 21:40:24 not unlike making changes to o.vo or os-vif that we need 21:40:53 yeah it would just suck if we have to make 3 changes before we can return something new out of the rest api 21:40:58 but anyway 21:41:16 well, 21:41:38 we'd have to make the searchlight changes before it would work in that environment, not necessarily for it to work at all 21:41:39 but yeah 21:41:55 not surprising given the level at which we're using them for api in this scenario though 21:42:55 anything else on stuff up for review? 21:43:15 i don't have anything 21:43:29 melwitt: ? 21:43:52 no, think everything got mentioned 21:44:03 cool 21:44:06 #topic open discussion 21:44:20 I wanted to clarify what I said earlier, 21:44:42 I was thinking of multi cell from an operator perspective as in, how long would they experience a gap in say, the "cell down quota issue" 21:45:15 I had been thinking we were going to signal to them that's it's okay/recommended to create multiple cells at rc1 21:45:16 can someone explain to me what 'cell down' even means? 21:45:22 rabbit and db are dead for that cell? 21:45:30 like, lose communication with cell, for whatever reason 21:45:51 yeah, that's one example 21:45:59 melwitt: if mriedem stops shitting on patches, then yes I agree with that statement :) 21:46:00 how is that different from if your non-cells single region deployment loses rabbit/db today? 21:46:15 mriedem: your quota appears to expand in that case 21:46:21 mriedem: because you stop counting certain resources you can't see 21:46:32 ok 21:46:42 which is why we need the global allocation 21:46:44 via placement 21:46:46 got it 21:46:55 yeah 21:46:58 btw, it's fun that placement is the new keystone :) 21:47:01 has anyone mentioned that yet? 21:47:14 hah, jaypipes kinda did indirectly :P 21:47:19 yeah, so I was thinking if we aren't going to recommended to operators to create multiple cells until rc1, then we're "safe" in that we don't have a gap for the "cell down" case from their perspective 21:47:20 in which way? 21:47:21 when discussing quota shit 21:47:50 macsz: what in which way? placement == keystone? 21:47:56 yeah 21:47:59 melwitt: even if we get all my stuff landed, we can still say "it works with the following caveats, so don't use it if those bother you" 21:48:00 macsz: just that it's going to have user/project stuff in it and it's global 21:48:22 melwitt: in fact, I have one such caveat called out in the series already 21:48:25 mriedem: oh, ok, got it :) 21:48:33 dansmith: yeah, true 21:48:33 although it's much smaller than quotas of course 21:48:40 melwitt: i'm also fine with saying multiple cells is ok with caveats 21:48:45 for what we know doesn't work 21:48:55 like.. pci :) 21:48:58 melwitt: however, 21:49:07 the fact you're thinking about caveats for rc1 at this point scares me 21:49:22 heh 21:49:25 mriedem: this quota caveat has been planned since before atlanta 21:49:55 but wasn't that before talking about putting user/project in allocations in placement? 21:49:58 as long as the caveats don't affect the single-cell case, I don't see the problem other than just limiting the scope of who can move to cellsv2 on release day 21:50:12 well, I woke up in the middle of the night and exclaimed (in my mind) "what if a cell goes down!" so I've been thinking about it. I'm sure there are other caveats I haven't thought of yet 21:50:34 i woke up thinking about sump pumps and patio furniture covers and sling tv 21:50:43 to each his/her own 21:50:44 yes 21:50:55 oops 21:51:19 dansmith: agreed with the single cell vs multi cell caveat thing 21:51:25 anything else? this is already the longest cells meeting on record 21:51:38 just would like to say that me and pumaranikar are both working in osic with johnthetubaguy 21:51:45 and we both are interested in doing some work for cells 21:52:07 probably will start with some bugs, unfortunately most of cells bug reports are over 1 yr old 21:52:14 but will start digging sth :) 21:52:18 macsz: i think there will be work to do with searchlight possibly 21:52:20 macsz: those are bugs we don't care much about 21:52:27 macsz: i.e. cellsv1 bugs 21:52:38 macsz: i don't know that anyone is slated to add versioned notification work to searchlight 21:52:55 macsz: testing more realistic stuff with cellsv2 is something major you guys could help out with 21:52:59 probably need to sort that out with steve mclellan and/or kevin_zheng 21:53:09 ok :) 21:53:20 macsz: also, i've been trying to get a ci job setup with searchlight enabled and nova configured to send things to searchlight, 21:53:27 that's not as easy as i thought it would be 21:53:34 we can talk about that in -nova if you're interested 21:53:58 mriedem: yeah, sure 21:54:36 ok so i think we can wrap up 21:54:46 excellent meeting gang! 21:54:57 sweet 21:54:59 #endmeeting