21:00:00 <efried> #startmeeting nova
21:00:01 <openstack> Meeting started Thu Nov 21 21:00:00 2019 UTC and is due to finish in 60 minutes.  The chair is efried. Information about MeetBot at http://wiki.debian.org/MeetBot.
21:00:02 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
21:00:05 <openstack> The meeting name has been set to 'nova'
21:00:13 <mriedem> o/
21:00:29 <melwitt> o/
21:01:39 <efried> Hello both of you!
21:01:40 <bcm> o/ hello. Though I might chime in with some questions about a review if you have time
21:01:53 <efried> Hi bcm, glad to have you. Saw your name on the agenda.
21:02:16 <efried> though you better not let dansmith see that you put it in "stuck reviews" rather than "open discussion".
21:02:33 <efried> okay, might as well get started
21:02:40 <efried> #link agenda https://wiki.openstack.org/wiki/Meetings/Nova#Agenda_for_next_meeting
21:02:51 <bcm> efried: ah cool. Hmm. Thought it qualified as one. Ok
21:03:01 <efried> maybe it does. I'm mostly being facetious.
21:03:20 <bcm> :)
21:03:24 <efried> It's been nearly a month since we had a meeting, what with PTG/summit fore- and aftermath, DST shift (f DST) etc.
21:03:39 <efried> on that note:
21:03:39 <efried> #topic Last meeting (4 weeks ago)
21:03:39 <efried> #link Minutes from last meeting: http://eavesdrop.openstack.org/meetings/nova/2019/nova.2019-10-24-21.00.html
21:03:45 <efried> any ol biniss?
21:15:28 <efried> Is there anything else we can accomplish herenow?
21:15:36 <bcm> So, and pardon my ignorance, but does the current review look like it could be made acceptable
21:15:37 <efried> wrt that patch, bcm?
21:15:57 <bcm> or are substantial changes required to achieve something thats "right" ?
21:16:56 <bcm> Just wondering what the best course of action is going forward to close off the bug
21:17:02 <efried> (I definitely don't have enough background/context to be able to weigh in btw, I'm just sitting here with the gavel)
21:17:23 <bcm> mriedem any thoughts? you seem pretty well versed in this area.
21:17:47 <efried> It looks like lyarwood was tracking this for a while; maybe he could be poked again now that we're post-ptg?
21:18:01 <mriedem> i would have to load this all back in my head,
21:18:14 <mriedem> but i saw a few people flounder for over a year trying to get CI working
21:18:19 <mriedem> so i'm not sure i trust what it's donig
21:18:21 <mriedem> *doing
21:18:29 <bcm> right ok
21:18:52 <mriedem> need gorka to weigh in from a cinder pov
21:18:59 <bcm> efried: when/where is lyarwood  usually contactable?
21:19:07 <mriedem> since he basically told me what i was doing in my change, which was similar to this, was wrong
21:19:13 <mriedem> bcm: he's UK
21:19:24 <efried> For the sake of the minutes: Bug #1452641
21:19:24 <mriedem> or ireland?
21:19:27 <melwitt> UK
21:19:36 <mriedem> same thing am i right?!
21:19:38 <bcm> mriedem: I did bring this up in the cinder meeting yesterday, but they didn't have much to say
21:19:39 <efried> hello? bug 1452641
21:19:42 * mriedem dodges a bottle
21:19:45 <melwitt> lol
21:19:57 <efried> bots are ill.
21:20:15 <efried> #link Static Ceph mon IP addresses in connection_info can prevent VM startup https://bugs.launchpad.net/nova/+bug/1452641
21:20:49 <efried> anyway, yeah, when I see things about ceph, I think of lyarwood and maaaybe dansmith.
21:21:12 <openstack> bug 1452641 in nova (Ubuntu) "Static Ceph mon IP addresses in connection_info can prevent VM startup" [Medium,In progress] https://launchpad.net/bugs/1452641 - Assigned to Corey Bryant (corey.bryant)
21:21:17 <efried> nice
21:21:26 <efried> bcm: perhaps another way to resurrect interest would be to respond to that ML thread.
21:21:27 <melwitt> I'm the ceph janitor, I watch the ceph job and help fix it when it breaks
21:21:49 <efried> with your handy dandy bottle of ceph turpentine
21:22:48 <efried> Okay, I guess we should move on, you good for now bcm?
21:22:52 <bcm> efried: ok sounds good
21:22:54 <bcm> yep i'm good
21:23:01 <efried> k
21:23:02 <bcm> thanks efried mriedem  and all
21:23:15 <efried> #topic Open discussion
21:23:31 <efried> (mriedem): Thoughts on exposing exception type to non-admins in instance action event http://lists.openstack.org/pipermail/openstack-discuss/2019-November/010775.html
21:23:32 <efried> Is this still open mriedem?
21:23:48 <mriedem> i have some poc patches up,
21:23:53 <mriedem> mdbooth and melwitt commented
21:24:12 <mriedem> there is some work to do if it's desirable but i won't push on it unless people think it's useful
21:24:39 <melwitt> this email says "store type" but the patch is storing the entire message?
21:25:04 <melwitt> I'm not sure whether it's useful bc I'm not that familiar with ppl using instance action events
21:25:06 <mriedem> yeah i changed it b/c the feeling in the ML thread was that just the type isn't useful since you need a decoder to know wtf the types mean
21:25:11 <melwitt> ah ok
21:25:19 <mriedem> so i changed to store the same thing as the fault message we store
21:25:21 <melwitt> if someone thinks it's useful, sure why not
21:25:45 <melwitt> but I haven't heard anything from our customers about that before
21:25:45 <mriedem> the impetus was because we changed resize/cold migrate to do an rpc cast rather than call from the api to conductor,
21:25:56 <efried> Heck, if it's the whole message, I'm totally on board. I thought we couldn't do that for some kind of security reason tho.
21:25:59 <mriedem> so we lost the "fast" novalidhost scheduling failures you could get for resize and cold migarte
21:26:10 <mriedem> so i figured, you can just get that info from instance actoins,
21:26:21 <mriedem> but they don't contain details, just that some part of the operation failed
21:26:48 <mriedem> efried: security wise it's no worse than anything we expose out of the server fault message if the server is in ERROR status
21:27:01 <melwitt> don't you normally just nova show the instance and see the details?
21:27:04 <mriedem> which was mdbooth's concern (leaking details) but melwitt pointed out ^ we already have that problem
21:27:12 <mriedem> melwitt: for what? faults?
21:27:23 <melwitt> yeah, if resize failed
21:27:33 <mriedem> we only expose the server fault if the server status is ERROR or DELETED
21:27:37 <melwitt> don't you get error instance and look at fault?
21:27:43 <mriedem> if you fail scheduling during resize, the server status doesn't go to ERROR
21:27:47 <melwitt> oh I see ok
21:27:57 <mriedem> no reason to error out the server if we didnt' touch it
21:28:03 <mriedem> that's what instance actions are for
21:28:03 <melwitt> so... the user doesn't know the resize failed?
21:28:30 <mriedem> they don't get a 400 response for NoValidHost during scheduling anymore
21:28:45 <mriedem> but they wouldn't know if say the resize_claim on the dest compute failed either (from the API response)
21:28:56 <mriedem> actions are for the results of parts of the operatoin
21:29:05 <mriedem> so there is a compute_prep_resize event for example
21:29:16 <mriedem> either that is success or error, but if error there are no details
21:29:16 <dansmith> we were doing scheduling synchronously with the call to resize before,
21:29:20 <mriedem> hence my proposal
21:29:26 <dansmith> which backs us into a corner for complex things like cross-cell resize
21:29:43 <melwitt> I think I'm gonna have to read up on this. bc on the surface it sounds like we lost the ability to tell the user their resize failed
21:29:52 <dansmith> no,
21:30:05 <dansmith> only only signal for one situation became less obvious
21:30:33 <dansmith> mriedem: for that matter, the async schedule task could also just error the instance just like a failure later in resize would right?
21:31:00 <mriedem> dansmith: depends on where the resize fails
21:31:07 <mriedem> if we haven't touched the guest, we don't put the instance into error status
21:31:21 <mriedem> if finish_resize on the dest fails, yes the instance goes to ERROR status
21:31:24 <melwitt> yeah I'm just... this goes back to the whole task API thing. in a lot of cases we kind of have to put the instance into ERROR just to signal that something didn't work
21:31:24 <mriedem> if resize_claim fails, we don't
21:31:27 <dansmith> okay, then nevermind
21:31:36 <melwitt> even if we didn't "touch" the instance
21:31:41 <mriedem> melwitt: actions are our poor version of the tasks api
21:31:49 <dansmith> yeah, actions are fine for this
21:32:03 <dansmith> it would be better if they were used more for determining success
21:32:07 <melwitt> how does someone know to go look at actions? because that's not a normal part of most users workflow as far as I have known
21:32:23 <dansmith> melwitt: it should be
21:32:27 <melwitt> actions used to be this forgotten API
21:32:34 <mriedem> right, it should be,
21:32:34 <melwitt> ok... I dunno about that
21:32:41 <mriedem> i write actions checking into my functional tests quite a bit
21:32:44 <melwitt> that's gonna take some user re-education
21:32:58 <mriedem> especially b/c end users can't listen for notifications
21:33:05 <melwitt> put it in the release notes, "you really need to look at this now"
21:33:06 <mriedem> like we do in a lot of our functional tests
21:33:07 <dansmith> melwitt: well, the problem is before actions, we had a very inconsistent reporting, where only severe errors could really be signaled,
21:33:12 <dansmith> and we have things like this,
21:33:23 <efried> If that's what it's for, and it's a matter of education, then we really shouldn't paper over it by making it unnecessary again.
21:33:29 <dansmith> where the architecture of the system relies on returning a synchronous result for a scheduling operation which could take minutes
21:33:30 <mriedem> we've also had innumerable bugs like "server goes to ERROR status even though nothing changed"
21:33:30 <efried> instead we should respond to such bugs with said education.
21:33:38 <mriedem> and some clouds get a page when something goes to ERROR
21:34:28 <melwitt> I'm fine with that, just yeah, want the heads up in the release notes. at least MHO
21:34:47 <dansmith> mriedem's change from sync to async on this one thing did have a reno
21:34:50 <melwitt> *fine with re-education being the way forward. just needs to be communicated
21:34:56 <dansmith> I dunno if it provided a handholding recipe for checking actions, but..
21:35:10 <melwitt> just a link to the api-ref or such is fine I think
21:35:36 <mriedem> https://review.opendev.org/#/c/693937/2/releasenotes/notes/resize-api-cast-always-7eb1dbef8f7fe228.yaml
21:35:37 <patchbot> patch 693937 - nova - Make API always RPC cast to conductor for resize/m... (MERGED) - 2 patch sets
21:35:55 <melwitt> but just, I know from when I was at yahoo, no user knew about the instance actions API or needed to. this is a change in thinking, at least for me. even myself as a user, would not think to look at instance actions as part of my normal usage
21:36:01 <dansmith> mriedem: of course
21:36:24 <efried> make that last line a hyperlink mriedem and we're golden.
21:36:27 <dansmith> that's how a user knows anything other than current status though
21:36:52 <mriedem> at yahoo you probably had enough external plugin APIs to do all the backend reporting for operations so you didn't need the actoins API :)
21:36:56 <melwitt> ok. sorry for the tangent. back to the point, if this is the place to let someone know why the resize failed, I think adding the error message to the instance action is a good idea
21:37:12 <melwitt> mriedem: haha, you'd think so, but no. it was just, instance is in ERROR, nova show the instance
21:38:23 <mriedem> ok i think these coffee shop people want me out and my battery is about to die so wrapping up?
21:38:34 <efried> okay
21:38:38 <efried> (luyao): Does vpmem live migration work need a blueprint?
21:38:41 <efried> I think we said "yes"
21:38:52 <dansmith> please
21:38:53 <dansmith> for the love
21:38:54 <dansmith> of crap
21:38:55 <dansmith> yes it does
21:38:56 <efried> because if nothing else it involves RPC obj changes
21:39:04 <efried> okay, anything else before we close?
21:39:07 <melwitt> I think blueprint needed is a given
21:39:21 <dansmith> I think efried meant spec
21:39:23 <melwitt> spec is the one we say not always needed
21:39:30 <melwitt> oh ok
21:39:30 <dansmith> because data model changes, if nothing else
21:39:38 <mriedem> that's also old business
21:39:43 <dansmith> yes
21:39:46 <mriedem> we told them multinode ci with cold migrate as table stakes
21:39:48 <mriedem> then come talk to us
21:39:55 <efried> oh yeah, I remember now.
21:39:58 <dansmith> efried: please put me out of your misery
21:40:00 <efried> Didn't clear it from agenda
21:40:03 <efried> Thanks all. o/
21:40:03 <efried> #endmeeting