21:00:00 #startmeeting nova 21:00:01 Meeting started Thu Nov 21 21:00:00 2019 UTC and is due to finish in 60 minutes. The chair is efried. Information about MeetBot at http://wiki.debian.org/MeetBot. 21:00:02 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 21:00:05 The meeting name has been set to 'nova' 21:00:13 o/ 21:00:29 o/ 21:01:39 Hello both of you! 21:01:40 o/ hello. Though I might chime in with some questions about a review if you have time 21:01:53 Hi bcm, glad to have you. Saw your name on the agenda. 21:02:16 though you better not let dansmith see that you put it in "stuck reviews" rather than "open discussion". 21:02:33 okay, might as well get started 21:02:40 #link agenda https://wiki.openstack.org/wiki/Meetings/Nova#Agenda_for_next_meeting 21:02:51 efried: ah cool. Hmm. Thought it qualified as one. Ok 21:03:01 maybe it does. I'm mostly being facetious. 21:03:20 :) 21:03:24 It's been nearly a month since we had a meeting, what with PTG/summit fore- and aftermath, DST shift (f DST) etc. 21:03:39 on that note: 21:03:39 #topic Last meeting (4 weeks ago) 21:03:39 #link Minutes from last meeting: http://eavesdrop.openstack.org/meetings/nova/2019/nova.2019-10-24-21.00.html 21:03:45 any ol biniss? 21:15:28 Is there anything else we can accomplish herenow? 21:15:36 So, and pardon my ignorance, but does the current review look like it could be made acceptable 21:15:37 wrt that patch, bcm? 21:15:57 or are substantial changes required to achieve something thats "right" ? 21:16:56 Just wondering what the best course of action is going forward to close off the bug 21:17:02 (I definitely don't have enough background/context to be able to weigh in btw, I'm just sitting here with the gavel) 21:17:23 mriedem any thoughts? you seem pretty well versed in this area. 21:17:47 It looks like lyarwood was tracking this for a while; maybe he could be poked again now that we're post-ptg? 21:18:01 i would have to load this all back in my head, 21:18:14 but i saw a few people flounder for over a year trying to get CI working 21:18:19 so i'm not sure i trust what it's donig 21:18:21 *doing 21:18:29 right ok 21:18:52 need gorka to weigh in from a cinder pov 21:18:59 efried: when/where is lyarwood usually contactable? 21:19:07 since he basically told me what i was doing in my change, which was similar to this, was wrong 21:19:13 bcm: he's UK 21:19:24 For the sake of the minutes: Bug #1452641 21:19:24 or ireland? 21:19:27 UK 21:19:36 same thing am i right?! 21:19:38 mriedem: I did bring this up in the cinder meeting yesterday, but they didn't have much to say 21:19:39 hello? bug 1452641 21:19:42 * mriedem dodges a bottle 21:19:45 lol 21:19:57 bots are ill. 21:20:15 #link Static Ceph mon IP addresses in connection_info can prevent VM startup https://bugs.launchpad.net/nova/+bug/1452641 21:20:49 anyway, yeah, when I see things about ceph, I think of lyarwood and maaaybe dansmith. 21:21:12 bug 1452641 in nova (Ubuntu) "Static Ceph mon IP addresses in connection_info can prevent VM startup" [Medium,In progress] https://launchpad.net/bugs/1452641 - Assigned to Corey Bryant (corey.bryant) 21:21:17 nice 21:21:26 bcm: perhaps another way to resurrect interest would be to respond to that ML thread. 21:21:27 I'm the ceph janitor, I watch the ceph job and help fix it when it breaks 21:21:49 with your handy dandy bottle of ceph turpentine 21:22:48 Okay, I guess we should move on, you good for now bcm? 21:22:52 efried: ok sounds good 21:22:54 yep i'm good 21:23:01 k 21:23:02 thanks efried mriedem and all 21:23:15 #topic Open discussion 21:23:31 (mriedem): Thoughts on exposing exception type to non-admins in instance action event http://lists.openstack.org/pipermail/openstack-discuss/2019-November/010775.html 21:23:32 Is this still open mriedem? 21:23:48 i have some poc patches up, 21:23:53 mdbooth and melwitt commented 21:24:12 there is some work to do if it's desirable but i won't push on it unless people think it's useful 21:24:39 this email says "store type" but the patch is storing the entire message? 21:25:04 I'm not sure whether it's useful bc I'm not that familiar with ppl using instance action events 21:25:06 yeah i changed it b/c the feeling in the ML thread was that just the type isn't useful since you need a decoder to know wtf the types mean 21:25:11 ah ok 21:25:19 so i changed to store the same thing as the fault message we store 21:25:21 if someone thinks it's useful, sure why not 21:25:45 but I haven't heard anything from our customers about that before 21:25:45 the impetus was because we changed resize/cold migrate to do an rpc cast rather than call from the api to conductor, 21:25:56 Heck, if it's the whole message, I'm totally on board. I thought we couldn't do that for some kind of security reason tho. 21:25:59 so we lost the "fast" novalidhost scheduling failures you could get for resize and cold migarte 21:26:10 so i figured, you can just get that info from instance actoins, 21:26:21 but they don't contain details, just that some part of the operation failed 21:26:48 efried: security wise it's no worse than anything we expose out of the server fault message if the server is in ERROR status 21:27:01 don't you normally just nova show the instance and see the details? 21:27:04 which was mdbooth's concern (leaking details) but melwitt pointed out ^ we already have that problem 21:27:12 melwitt: for what? faults? 21:27:23 yeah, if resize failed 21:27:33 we only expose the server fault if the server status is ERROR or DELETED 21:27:37 don't you get error instance and look at fault? 21:27:43 if you fail scheduling during resize, the server status doesn't go to ERROR 21:27:47 oh I see ok 21:27:57 no reason to error out the server if we didnt' touch it 21:28:03 that's what instance actions are for 21:28:03 so... the user doesn't know the resize failed? 21:28:30 they don't get a 400 response for NoValidHost during scheduling anymore 21:28:45 but they wouldn't know if say the resize_claim on the dest compute failed either (from the API response) 21:28:56 actions are for the results of parts of the operatoin 21:29:05 so there is a compute_prep_resize event for example 21:29:16 either that is success or error, but if error there are no details 21:29:16 we were doing scheduling synchronously with the call to resize before, 21:29:20 hence my proposal 21:29:26 which backs us into a corner for complex things like cross-cell resize 21:29:43 I think I'm gonna have to read up on this. bc on the surface it sounds like we lost the ability to tell the user their resize failed 21:29:52 no, 21:30:05 only only signal for one situation became less obvious 21:30:33 mriedem: for that matter, the async schedule task could also just error the instance just like a failure later in resize would right? 21:31:00 dansmith: depends on where the resize fails 21:31:07 if we haven't touched the guest, we don't put the instance into error status 21:31:21 if finish_resize on the dest fails, yes the instance goes to ERROR status 21:31:24 yeah I'm just... this goes back to the whole task API thing. in a lot of cases we kind of have to put the instance into ERROR just to signal that something didn't work 21:31:24 if resize_claim fails, we don't 21:31:27 okay, then nevermind 21:31:36 even if we didn't "touch" the instance 21:31:41 melwitt: actions are our poor version of the tasks api 21:31:49 yeah, actions are fine for this 21:32:03 it would be better if they were used more for determining success 21:32:07 how does someone know to go look at actions? because that's not a normal part of most users workflow as far as I have known 21:32:23 melwitt: it should be 21:32:27 actions used to be this forgotten API 21:32:34 right, it should be, 21:32:34 ok... I dunno about that 21:32:41 i write actions checking into my functional tests quite a bit 21:32:44 that's gonna take some user re-education 21:32:58 especially b/c end users can't listen for notifications 21:33:05 put it in the release notes, "you really need to look at this now" 21:33:06 like we do in a lot of our functional tests 21:33:07 melwitt: well, the problem is before actions, we had a very inconsistent reporting, where only severe errors could really be signaled, 21:33:12 and we have things like this, 21:33:23 If that's what it's for, and it's a matter of education, then we really shouldn't paper over it by making it unnecessary again. 21:33:29 where the architecture of the system relies on returning a synchronous result for a scheduling operation which could take minutes 21:33:30 we've also had innumerable bugs like "server goes to ERROR status even though nothing changed" 21:33:30 instead we should respond to such bugs with said education. 21:33:38 and some clouds get a page when something goes to ERROR 21:34:28 I'm fine with that, just yeah, want the heads up in the release notes. at least MHO 21:34:47 mriedem's change from sync to async on this one thing did have a reno 21:34:50 *fine with re-education being the way forward. just needs to be communicated 21:34:56 I dunno if it provided a handholding recipe for checking actions, but.. 21:35:10 just a link to the api-ref or such is fine I think 21:35:36 https://review.opendev.org/#/c/693937/2/releasenotes/notes/resize-api-cast-always-7eb1dbef8f7fe228.yaml 21:35:37 patch 693937 - nova - Make API always RPC cast to conductor for resize/m... (MERGED) - 2 patch sets 21:35:55 but just, I know from when I was at yahoo, no user knew about the instance actions API or needed to. this is a change in thinking, at least for me. even myself as a user, would not think to look at instance actions as part of my normal usage 21:36:01 mriedem: of course 21:36:24 make that last line a hyperlink mriedem and we're golden. 21:36:27 that's how a user knows anything other than current status though 21:36:52 at yahoo you probably had enough external plugin APIs to do all the backend reporting for operations so you didn't need the actoins API :) 21:36:56 ok. sorry for the tangent. back to the point, if this is the place to let someone know why the resize failed, I think adding the error message to the instance action is a good idea 21:37:12 mriedem: haha, you'd think so, but no. it was just, instance is in ERROR, nova show the instance 21:38:23 ok i think these coffee shop people want me out and my battery is about to die so wrapping up? 21:38:34 okay 21:38:38 (luyao): Does vpmem live migration work need a blueprint? 21:38:41 I think we said "yes" 21:38:52 please 21:38:53 for the love 21:38:54 of crap 21:38:55 yes it does 21:38:56 because if nothing else it involves RPC obj changes 21:39:04 okay, anything else before we close? 21:39:07 I think blueprint needed is a given 21:39:21 I think efried meant spec 21:39:23 spec is the one we say not always needed 21:39:30 oh ok 21:39:30 because data model changes, if nothing else 21:39:38 that's also old business 21:39:43 yes 21:39:46 we told them multinode ci with cold migrate as table stakes 21:39:48 then come talk to us 21:39:55 oh yeah, I remember now. 21:39:58 efried: please put me out of your misery 21:40:00 Didn't clear it from agenda 21:40:03 Thanks all. o/ 21:40:03 #endmeeting