21:00:00 <efried> #startmeeting nova 21:00:01 <openstack> Meeting started Thu Nov 21 21:00:00 2019 UTC and is due to finish in 60 minutes. The chair is efried. Information about MeetBot at http://wiki.debian.org/MeetBot. 21:00:02 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 21:00:05 <openstack> The meeting name has been set to 'nova' 21:00:13 <mriedem> o/ 21:00:29 <melwitt> o/ 21:01:39 <efried> Hello both of you! 21:01:40 <bcm> o/ hello. Though I might chime in with some questions about a review if you have time 21:01:53 <efried> Hi bcm, glad to have you. Saw your name on the agenda. 21:02:16 <efried> though you better not let dansmith see that you put it in "stuck reviews" rather than "open discussion". 21:02:33 <efried> okay, might as well get started 21:02:40 <efried> #link agenda https://wiki.openstack.org/wiki/Meetings/Nova#Agenda_for_next_meeting 21:02:51 <bcm> efried: ah cool. Hmm. Thought it qualified as one. Ok 21:03:01 <efried> maybe it does. I'm mostly being facetious. 21:03:20 <bcm> :) 21:03:24 <efried> It's been nearly a month since we had a meeting, what with PTG/summit fore- and aftermath, DST shift (f DST) etc. 21:03:39 <efried> on that note: 21:03:39 <efried> #topic Last meeting (4 weeks ago) 21:03:39 <efried> #link Minutes from last meeting: http://eavesdrop.openstack.org/meetings/nova/2019/nova.2019-10-24-21.00.html 21:03:45 <efried> any ol biniss? 21:15:28 <efried> Is there anything else we can accomplish herenow? 21:15:36 <bcm> So, and pardon my ignorance, but does the current review look like it could be made acceptable 21:15:37 <efried> wrt that patch, bcm? 21:15:57 <bcm> or are substantial changes required to achieve something thats "right" ? 21:16:56 <bcm> Just wondering what the best course of action is going forward to close off the bug 21:17:02 <efried> (I definitely don't have enough background/context to be able to weigh in btw, I'm just sitting here with the gavel) 21:17:23 <bcm> mriedem any thoughts? you seem pretty well versed in this area. 21:17:47 <efried> It looks like lyarwood was tracking this for a while; maybe he could be poked again now that we're post-ptg? 21:18:01 <mriedem> i would have to load this all back in my head, 21:18:14 <mriedem> but i saw a few people flounder for over a year trying to get CI working 21:18:19 <mriedem> so i'm not sure i trust what it's donig 21:18:21 <mriedem> *doing 21:18:29 <bcm> right ok 21:18:52 <mriedem> need gorka to weigh in from a cinder pov 21:18:59 <bcm> efried: when/where is lyarwood usually contactable? 21:19:07 <mriedem> since he basically told me what i was doing in my change, which was similar to this, was wrong 21:19:13 <mriedem> bcm: he's UK 21:19:24 <efried> For the sake of the minutes: Bug #1452641 21:19:24 <mriedem> or ireland? 21:19:27 <melwitt> UK 21:19:36 <mriedem> same thing am i right?! 21:19:38 <bcm> mriedem: I did bring this up in the cinder meeting yesterday, but they didn't have much to say 21:19:39 <efried> hello? bug 1452641 21:19:42 * mriedem dodges a bottle 21:19:45 <melwitt> lol 21:19:57 <efried> bots are ill. 21:20:15 <efried> #link Static Ceph mon IP addresses in connection_info can prevent VM startup https://bugs.launchpad.net/nova/+bug/1452641 21:20:49 <efried> anyway, yeah, when I see things about ceph, I think of lyarwood and maaaybe dansmith. 21:21:12 <openstack> bug 1452641 in nova (Ubuntu) "Static Ceph mon IP addresses in connection_info can prevent VM startup" [Medium,In progress] https://launchpad.net/bugs/1452641 - Assigned to Corey Bryant (corey.bryant) 21:21:17 <efried> nice 21:21:26 <efried> bcm: perhaps another way to resurrect interest would be to respond to that ML thread. 21:21:27 <melwitt> I'm the ceph janitor, I watch the ceph job and help fix it when it breaks 21:21:49 <efried> with your handy dandy bottle of ceph turpentine 21:22:48 <efried> Okay, I guess we should move on, you good for now bcm? 21:22:52 <bcm> efried: ok sounds good 21:22:54 <bcm> yep i'm good 21:23:01 <efried> k 21:23:02 <bcm> thanks efried mriedem and all 21:23:15 <efried> #topic Open discussion 21:23:31 <efried> (mriedem): Thoughts on exposing exception type to non-admins in instance action event http://lists.openstack.org/pipermail/openstack-discuss/2019-November/010775.html 21:23:32 <efried> Is this still open mriedem? 21:23:48 <mriedem> i have some poc patches up, 21:23:53 <mriedem> mdbooth and melwitt commented 21:24:12 <mriedem> there is some work to do if it's desirable but i won't push on it unless people think it's useful 21:24:39 <melwitt> this email says "store type" but the patch is storing the entire message? 21:25:04 <melwitt> I'm not sure whether it's useful bc I'm not that familiar with ppl using instance action events 21:25:06 <mriedem> yeah i changed it b/c the feeling in the ML thread was that just the type isn't useful since you need a decoder to know wtf the types mean 21:25:11 <melwitt> ah ok 21:25:19 <mriedem> so i changed to store the same thing as the fault message we store 21:25:21 <melwitt> if someone thinks it's useful, sure why not 21:25:45 <melwitt> but I haven't heard anything from our customers about that before 21:25:45 <mriedem> the impetus was because we changed resize/cold migrate to do an rpc cast rather than call from the api to conductor, 21:25:56 <efried> Heck, if it's the whole message, I'm totally on board. I thought we couldn't do that for some kind of security reason tho. 21:25:59 <mriedem> so we lost the "fast" novalidhost scheduling failures you could get for resize and cold migarte 21:26:10 <mriedem> so i figured, you can just get that info from instance actoins, 21:26:21 <mriedem> but they don't contain details, just that some part of the operation failed 21:26:48 <mriedem> efried: security wise it's no worse than anything we expose out of the server fault message if the server is in ERROR status 21:27:01 <melwitt> don't you normally just nova show the instance and see the details? 21:27:04 <mriedem> which was mdbooth's concern (leaking details) but melwitt pointed out ^ we already have that problem 21:27:12 <mriedem> melwitt: for what? faults? 21:27:23 <melwitt> yeah, if resize failed 21:27:33 <mriedem> we only expose the server fault if the server status is ERROR or DELETED 21:27:37 <melwitt> don't you get error instance and look at fault? 21:27:43 <mriedem> if you fail scheduling during resize, the server status doesn't go to ERROR 21:27:47 <melwitt> oh I see ok 21:27:57 <mriedem> no reason to error out the server if we didnt' touch it 21:28:03 <mriedem> that's what instance actions are for 21:28:03 <melwitt> so... the user doesn't know the resize failed? 21:28:30 <mriedem> they don't get a 400 response for NoValidHost during scheduling anymore 21:28:45 <mriedem> but they wouldn't know if say the resize_claim on the dest compute failed either (from the API response) 21:28:56 <mriedem> actions are for the results of parts of the operatoin 21:29:05 <mriedem> so there is a compute_prep_resize event for example 21:29:16 <mriedem> either that is success or error, but if error there are no details 21:29:16 <dansmith> we were doing scheduling synchronously with the call to resize before, 21:29:20 <mriedem> hence my proposal 21:29:26 <dansmith> which backs us into a corner for complex things like cross-cell resize 21:29:43 <melwitt> I think I'm gonna have to read up on this. bc on the surface it sounds like we lost the ability to tell the user their resize failed 21:29:52 <dansmith> no, 21:30:05 <dansmith> only only signal for one situation became less obvious 21:30:33 <dansmith> mriedem: for that matter, the async schedule task could also just error the instance just like a failure later in resize would right? 21:31:00 <mriedem> dansmith: depends on where the resize fails 21:31:07 <mriedem> if we haven't touched the guest, we don't put the instance into error status 21:31:21 <mriedem> if finish_resize on the dest fails, yes the instance goes to ERROR status 21:31:24 <melwitt> yeah I'm just... this goes back to the whole task API thing. in a lot of cases we kind of have to put the instance into ERROR just to signal that something didn't work 21:31:24 <mriedem> if resize_claim fails, we don't 21:31:27 <dansmith> okay, then nevermind 21:31:36 <melwitt> even if we didn't "touch" the instance 21:31:41 <mriedem> melwitt: actions are our poor version of the tasks api 21:31:49 <dansmith> yeah, actions are fine for this 21:32:03 <dansmith> it would be better if they were used more for determining success 21:32:07 <melwitt> how does someone know to go look at actions? because that's not a normal part of most users workflow as far as I have known 21:32:23 <dansmith> melwitt: it should be 21:32:27 <melwitt> actions used to be this forgotten API 21:32:34 <mriedem> right, it should be, 21:32:34 <melwitt> ok... I dunno about that 21:32:41 <mriedem> i write actions checking into my functional tests quite a bit 21:32:44 <melwitt> that's gonna take some user re-education 21:32:58 <mriedem> especially b/c end users can't listen for notifications 21:33:05 <melwitt> put it in the release notes, "you really need to look at this now" 21:33:06 <mriedem> like we do in a lot of our functional tests 21:33:07 <dansmith> melwitt: well, the problem is before actions, we had a very inconsistent reporting, where only severe errors could really be signaled, 21:33:12 <dansmith> and we have things like this, 21:33:23 <efried> If that's what it's for, and it's a matter of education, then we really shouldn't paper over it by making it unnecessary again. 21:33:29 <dansmith> where the architecture of the system relies on returning a synchronous result for a scheduling operation which could take minutes 21:33:30 <mriedem> we've also had innumerable bugs like "server goes to ERROR status even though nothing changed" 21:33:30 <efried> instead we should respond to such bugs with said education. 21:33:38 <mriedem> and some clouds get a page when something goes to ERROR 21:34:28 <melwitt> I'm fine with that, just yeah, want the heads up in the release notes. at least MHO 21:34:47 <dansmith> mriedem's change from sync to async on this one thing did have a reno 21:34:50 <melwitt> *fine with re-education being the way forward. just needs to be communicated 21:34:56 <dansmith> I dunno if it provided a handholding recipe for checking actions, but.. 21:35:10 <melwitt> just a link to the api-ref or such is fine I think 21:35:36 <mriedem> https://review.opendev.org/#/c/693937/2/releasenotes/notes/resize-api-cast-always-7eb1dbef8f7fe228.yaml 21:35:37 <patchbot> patch 693937 - nova - Make API always RPC cast to conductor for resize/m... (MERGED) - 2 patch sets 21:35:55 <melwitt> but just, I know from when I was at yahoo, no user knew about the instance actions API or needed to. this is a change in thinking, at least for me. even myself as a user, would not think to look at instance actions as part of my normal usage 21:36:01 <dansmith> mriedem: of course 21:36:24 <efried> make that last line a hyperlink mriedem and we're golden. 21:36:27 <dansmith> that's how a user knows anything other than current status though 21:36:52 <mriedem> at yahoo you probably had enough external plugin APIs to do all the backend reporting for operations so you didn't need the actoins API :) 21:36:56 <melwitt> ok. sorry for the tangent. back to the point, if this is the place to let someone know why the resize failed, I think adding the error message to the instance action is a good idea 21:37:12 <melwitt> mriedem: haha, you'd think so, but no. it was just, instance is in ERROR, nova show the instance 21:38:23 <mriedem> ok i think these coffee shop people want me out and my battery is about to die so wrapping up? 21:38:34 <efried> okay 21:38:38 <efried> (luyao): Does vpmem live migration work need a blueprint? 21:38:41 <efried> I think we said "yes" 21:38:52 <dansmith> please 21:38:53 <dansmith> for the love 21:38:54 <dansmith> of crap 21:38:55 <dansmith> yes it does 21:38:56 <efried> because if nothing else it involves RPC obj changes 21:39:04 <efried> okay, anything else before we close? 21:39:07 <melwitt> I think blueprint needed is a given 21:39:21 <dansmith> I think efried meant spec 21:39:23 <melwitt> spec is the one we say not always needed 21:39:30 <melwitt> oh ok 21:39:30 <dansmith> because data model changes, if nothing else 21:39:38 <mriedem> that's also old business 21:39:43 <dansmith> yes 21:39:46 <mriedem> we told them multinode ci with cold migrate as table stakes 21:39:48 <mriedem> then come talk to us 21:39:55 <efried> oh yeah, I remember now. 21:39:58 <dansmith> efried: please put me out of your misery 21:40:00 <efried> Didn't clear it from agenda 21:40:03 <efried> Thanks all. o/ 21:40:03 <efried> #endmeeting