17:00:48 #startmeeting nova_cells 17:00:52 Meeting started Wed May 25 17:00:48 2016 UTC and is due to finish in 60 minutes. The chair is alaski. Information about MeetBot at http://wiki.debian.org/MeetBot. 17:00:54 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 17:00:57 The meeting name has been set to 'nova_cells' 17:01:07 o/ 17:01:09 \o/ 17:01:33 small crowd, but let's get going 17:01:39 #topic Testing 17:01:40 o/ 17:01:55 no cellsv1 breaks that I know of 17:02:18 auggy has kindly offered to work on cellsv2 testing 17:02:37 she was researching what it would take to get a grenade job going on that 17:02:52 Thanks auggy. 17:03:06 * bauzas waves a bit late 17:03:17 one point of interest that came up was whether we encourage deployers to migrate as part of an upgrade, or after 17:03:39 and should grenade test M->N plus migrate, or N->N plus migrate 17:04:10 I leaned towards thinking that deployers will want to do it in two steps 17:04:17 yup 17:04:22 so they would migrate to N, then at some later point migrate to cells 17:04:28 that's my thought as well 17:04:38 that depends on if we want to support deployers not yet using cells v2 17:04:54 bauzas: so far I've been working under that assumption 17:05:31 so, the question is : should we accept to have operators deploying Newton without using cells v2 ? 17:05:34 and I don't think there's anything to prevent migrating to cells during an upgrade, but that won't be the tested/recommended path 17:06:05 bauzas: yes, I think so 17:06:12 and then, the correlated question is: if we accept Newton deployments without cells v2, should we deprecate it for Ocata then ? 17:06:19 o/ 17:06:24 Yes. We don't want that to go on too long. 17:06:27 I would like to put a hard cut off for O 17:06:38 okay, so the point is 17:06:46 i.e. must migrate to cellsv2 before upgrade to O 17:07:15 okay, so we answered the question you had 17:07:16 Absolutely. We (developers) are at some point going to make assumptions that cellsv2 is set up. 17:07:36 ie. grenade for M-N and then N->N + cells V2 17:07:39 bauzas: yep. we all seem to be on the same page 17:08:06 saying that at the GA for Ocata, all the deployers will be using cellsv2 17:08:13 doffm: that will make some code much easier, right now I have conditionals everywhere 17:08:32 The migration code is basically all conditionals. 17:08:33 (ie. at least running the main command) 17:08:53 I mean the one-for-all command 17:09:01 yep 17:09:09 okay, I'm fine 17:09:15 the assumption I would like to make is that instance and host mappings are all set up 17:09:47 cool, let's move on 17:09:51 #topic Open Reviews 17:09:53 https://etherpad.openstack.org/p/newton-nova-priorities-tracking 17:10:10 be better than me and try to keep that updated :) 17:10:15 I haven't put it on the etherpad. (I will) but bauzas had some questions about: https://review.openstack.org/#/c/296597 17:10:36 Basically I think there is a race condition when migrating and updating at the same time. 17:10:48 I'll put additional comments up, but would like feedback on how best to proceed. 17:10:49 yeah it was my question 17:11:13 okay, I will need to read through it 17:11:16 I haven't looked, but will 17:11:17 either way, 17:11:35 admin-only operations that could be racy without damage during an upgrade don't concern me as much 17:11:46 No, I made that point, no damage caused. 17:11:47 if it's something a user can do and have issues, that's a probem 17:11:53 good point 17:11:55 And operator only. 17:11:56 good point yeah 17:12:01 but an admin can follow renos and not do things like that during an upgrade 17:12:12 or at least, we can blame an admin who doesn't :P 17:12:15 given the aggregate API is an admin one, then please nevermind my question 17:12:22 because I agree with dansmith 17:12:24 Ok. 17:12:29 I still want to take a look, but I agree 17:12:36 yep, and I will 17:12:47 we don't want to make too many of those kinds of excuses, 17:12:51 meaning that only operators modify the aggregates, so they should know when they migrate 17:12:55 but weigh the relative payoff vs. the complexity 17:13:22 yeah 17:13:30 I just have to make sure that all the update operations actually fail, rather than write to wrong db successfully. 17:13:44 I think that is the case. 17:13:49 Just have to make sure. 17:13:59 if we could have an easy check of "migration has started but not completed so all actions fail momentarily" that seems good. but anything more complex may not be worth it 17:14:19 like flavors 17:14:24 right 17:14:28 anything in api db means no updates 17:14:34 er, anything in either, means no updates 17:15:00 I can add the 'ensure_migrated' check to update ops. 17:15:02 yeah. 17:15:08 cool 17:15:30 #topic Open Discussion 17:15:31 cool for me yeah, we could just say "sorry dude, you should know why you have this exception" 17:15:52 o/ hi! 17:16:05 hi! 17:16:11 aïl 17:16:18 for the grenade test, i have been updating the testing etherpad and am currently working on setting up a target devstack 17:16:26 awesome 17:16:28 so i will probably have questions ;) 17:17:01 I proposed the devstack transport_url patch https://review.openstack.org/320746 before seeing that sileht had one up already https://review.openstack.org/317958 17:17:18 auggy: great. that should be a good opportunity for documenting things that will help someone getting up to speed on cells 17:17:29 his doesn't handle the cells v1 case so I was hoping we could merge the two. mine passes the cells job in the experimental queue 17:18:18 melwitt: excellent 17:18:59 I will take a look and comment 17:19:26 cool, thanks 17:19:43 yeah the docs question was going to be the next thing, since i'm basically fresh eyes on this whole process 17:20:07 I want to mention something that dansmith, bauzas, and I discussed yesterday but might benefit from further thinking 17:20:21 * dansmith doesn't even know what this is 17:20:26 I have completely forgotten what docs we have. :) 17:20:27 oh, delete 17:20:29 auggy: there's a cells doc in the devref that could be updated 17:20:29 plusone 17:20:31 dansmith: yep 17:20:42 it gets even better though 17:20:57 alaski: ok cool i'll take a look and put together a patch based on my experiences with the grenade/devstack setup 17:21:17 so instance delete currently blocks all progress on ongoing actions through the expected_task_state check on instance.save() 17:21:39 but instance delete when all we have is a buildrequest does not block further actions 17:21:47 like creating an instance in a cell after scheduling 17:22:41 right now my approach is going to be that instance delete removes the buildrequest and instance_mapping if there is no instance in a cell yet, and their existence should be checked when writing to a cell 17:23:08 which mostly works, but might leave a small window for a race 17:23:39 so if anyone has thoughts on improving that I'm open to suggestions 17:23:55 The race would be a very odd failure though. Would the instance possibly boot wile the instance_mapping was empty? 17:24:28 I'm still missing where there is a race 17:24:31 once the RPC call goes to a compute the instance_mapping won't matter 17:25:08 if we create in the cell db before we send the build rpc call, then a delete can hit the instance and block progress the same way it does today 17:25:19 if we create instance, delete buildreq in that order, 17:25:32 the delete won't hit the instance until instance_mapping points to it 17:25:33 then I would think we can clean up the instance if buildreq is gone, before we do the cast, so we're good 17:26:06 we just have to make sure we create and delete things in the right order I think 17:26:13 mapping, instance, delete buildreq 17:26:21 should let us detect the delete before the rpc call I think 17:26:32 so if we write the instance, confirm buildreq/instancemapping exist,* , remap to instance, send RPC 17:26:34 once the instance and mapping are in place, then the normal delete procedures will work at any point 17:26:37 the * is the race I think 17:27:01 write instance, write mapping, delete the buildreq, send rpc 17:27:56 also, 17:28:03 if we race at your *, 17:28:09 then delete will be re-issue-able, right? 17:28:13 if we write mapping at the same time as the delete comes through it might remove the buildreq after we check 17:28:25 yes, the delete is re-issue-able 17:28:44 so not the worst thing, as long as our data stays consisten 17:28:46 I don't get that last thing 17:28:46 t 17:29:13 create mapping and instance in one transaction, 17:29:21 delete the buildreq in another 17:29:38 mapping and instance are in two dbs 17:29:41 if delete fails because it's missing, then we undo the instance 17:29:50 okay, right, but it doesn't matter 17:29:53 create the instance first 17:30:00 the instance isn't findable until the mapping is there 17:30:08 * bauzas sorry, got distracted 17:30:12 so mapping in one transaction, delete the buildreq in a separate transaction 17:30:26 if buildreq is gone, delete the mapping and instance, never send the rpc 17:30:42 I think we could whiteboard this in PDX and figure out a plan 17:30:48 not sure it's worth obsessing over right now 17:30:49 a whiteboard would help 17:31:10 we can move on and discuss this outside the meeting, but I wanted to bring it to everyones attention 17:31:14 there's one further issue 17:31:26 instance tags, which can be created at any point in the instance lifecycle 17:31:35 alaski: Is there a patch where this race is introduced... for comments. 17:31:37 as in, when no instance exists 17:31:58 I just think the instance mapping should be the very last thing to delete, given any problem we could have 17:32:09 doffm: this is the closest https://review.openstack.org/#/c/319379/ 17:32:12 so we could just ask the user to delete again 17:32:39 alaski: Thanks. 17:32:52 bauzas: we could 17:32:58 alaski: you mean tag creation before the instance is scheduled? 17:33:00 asking the user to delete again, I hope we can avoid that because I think it'll happen a lot more than we are expecting here 17:33:32 dansmith: right. so it's looking like we need to have that code update the buildreq instance 17:33:34 melwitt: agree we should try to avoid it, but if the tiny race window still allows reissuing of the delete, then that's the least worst 17:33:44 melwitt: sure, I'm just saying that if we still have the instance mapping, we can still delete again, which is not possible when the row is removed 17:33:45 alaski: oye 17:33:46 melwitt: dansmith +1 17:34:24 dansmith: yeah, I'm hitting a few places where api code needs to be aware of if it got instance from a buildreq and do different things 17:34:49 yeah, I know it's least worst. just that it's not okay IMO. I think it'll happen a lot and will cause a lot of pain for users 17:35:12 alaski: seems like we might be able to argue for a "busy try again in a minute" status code for that 17:35:20 melwitt: you think the race will? 17:35:46 melwitt: it would need to be a delete call almost immediately after issuing the boot request 17:36:09 melwitt: I would really like to solve this. we need a lock or sync point, and I'm not sure if we have the right things in place to do that yet 17:36:23 Sounds like something that could happen in testing repeatedly. Which isn't the worst thing, but could cause repeated alarms in automated tests. 17:36:31 we previously used task_state 17:36:44 alaski: that's really not airtight though 17:36:52 true 17:37:04 alaski: it eventually stops the thing, but not always on the first try, AFAIK 17:37:45 yeah. I'm going to think on it more, and I just wanted to get you smart people thinking on it as well 17:37:53 dansmith: yes. I think I've seen a lot of bugs in launchpad over the years of people deleting before the instance schedules. perhaps I'm thinking of the old cells v1 bug with delete racing with boot that got fixed when I converted things to instance objects 17:38:00 but for now I'm going to try relying on checking buildreq delete 17:38:02 I think it's fully resolvable and not that big of a deal 17:38:15 that race happened often in the tempest job 17:38:33 melwitt: sure, I know that kind of thing happens, but I wouldn't say it "affects a lot of our users" I guess 17:38:38 I tend to agree with dansmith here, because cells v1 delete is a very different problem 17:38:49 given how we call top/down and bottom/up 17:38:55 (for cells v1) 17:38:56 or a lot of pain or whatever 17:39:15 I think the average horizon user probably clicks delete three times per delete anyway 17:39:18 like an elevator button :D 17:39:21 just making some assumptions when issuing a delete doesn't sound hard to do 17:40:18 I think we need to take this to the review, or a whiteboard in pdx 17:40:32 well, I have seen enough things like that, people boot an instance and immediately delete it. and to have the situation of "oh yeah, sometimes it doesn't really delete it, just delete it again" 17:40:38 I have something to try, but further suggestions definitely welcome 17:41:02 I'm just saying I think it's worse than the consensus here is saying. just my opinion 17:42:13 I agree it's crappy, but potentially addressable with the current direction 17:42:24 I need to see it visually to convince myself either way 17:42:49 We have reached the cognitive limits of IRC. 17:42:56 heh 17:43:09 dansmith: so back to your "try again later" status code 17:43:14 doffm: yep 17:43:47 that touches on the crux of this, which is that there is now a big distinction between pre/post scheduled instances 17:44:17 to avoid exposing that in the API we sort of need to hide that, but should we really do that 17:44:35 um 17:44:41 too many "thats" in there for me to grok 17:44:49 heh 17:45:02 should users be aware that an instance isn't really an instance until after scheduling 17:45:11 and some actions may not work until then 17:45:26 the problem is it's not an action right? 17:45:30 like updating tags or metadata 17:45:38 an action of shelve doesn't work on a pre-schedule instance, 17:45:46 but tags have their own CRUD operations, right? 17:46:01 right. it's not instance actions, it's more actions on data attached to an instance 17:46:14 s/attached/associated/ 17:46:34 alaski: so we *can* load modify and save the buildreq, right? the problem is, 17:46:45 we can yes 17:46:48 high danger of losing writes when multiple things are happening together 17:47:07 we could use a generation field to at least signal that it happened instead of lying, 17:47:11 that. and that some code now needs two paths 17:47:19 but that is the same as the "try back later" right? 17:48:17 yes 17:48:32 I'm not sure it's much, or any, worse than today with an instance 17:48:48 my main concern is special casing the code around buildrequests 17:49:04 but I may be convincing myself that it's only a few things and not that bad 17:49:20 right but blocking things is less duplication than allowing it to complete in two ways 17:50:00 yes. but if we can avoid blocking I would like to 17:50:32 so wait 17:50:57 you want to (a) avoid blocking (b) avoid doing it two ways (c) not move tags to the api db? 17:51:06 please god don't say it's (c) 17:51:25 (a) 17:51:38 but also (c), why would we move tags 17:51:41 so you want to load/store the buildreq, add a version field? 17:52:07 I guess I thought above you said you didn't want to do that 17:53:24 * dansmith notes seven minutes remain 17:53:39 what I'm hoping to avoid is too much complex if buildreq then foo elif instance then bar 17:53:53 if it can be abstracted nicely I'm not concerned 17:54:18 yeah, I don't think we'll solve this here 17:54:21 I dunno, I think you're likely going to have it be pretty complex, 17:54:26 because tags are not stored in the instance, 17:54:46 so you're going to need to store them with the buildreq instance, and make sure instance create handles that right 17:54:55 maybe it will, I'd have to go look, I don't recall you being able to do that though 17:55:11 it doesn't, I would have to add that 17:55:11 I was thinking tags was a regular @property, but maybe it's a proper field 17:55:18 it's a field 17:55:30 but it's not normally updated that way 17:55:34 maybe not ever 17:55:38 it is yeah 17:55:42 and yeah, it's a weird one 17:56:21 well, I'll put something together and we can discuss on a review 17:56:45 that's all I have today 17:56:51 tags are not created in create() 17:56:55 anyone want to use up the last 3 minutes? 17:56:56 anyway 17:57:24 dansmith: yeah, they're a bit odd 17:57:27 okay, thanks all! 17:57:30 Thanks. 17:57:32 #endmeeting