#openstack-meeting log

21:00:04 <melwitt> #startmeeting nova
21:00:05 <openstack> Meeting started Thu May 10 21:00:04 2018 UTC and is due to finish in 60 minutes.  The chair is melwitt. Information about MeetBot at http://wiki.debian.org/MeetBot.
21:00:06 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
21:00:08 <openstack> The meeting name has been set to 'nova'
21:00:11 <mriedem> o/
21:00:12 <tssurya> o/
21:00:13 <melwitt> hi everybody
21:00:15 <takashin> o/
21:00:19 <dansmith> o/
21:00:27 <edleafe> \o
21:00:47 <melwitt> let's get started
21:00:49 <efried> ō/
21:00:52 <melwitt> #topic Release News
21:00:58 <melwitt> #link Rocky release schedule: https://wiki.openstack.org/wiki/Nova/Rocky_Release_Schedule
21:01:13 <melwitt> we've got the summit coming up soon, the week after next
21:01:27 <cdent> hmm. I totally shouldn't be here. Oh well.
21:01:30 * cdent settles in
21:01:34 <melwitt> r-2 is June 7 which is also the spec freeze
21:01:53 <melwitt> current runway status:
21:01:57 <melwitt> #link Rocky review runways: https://etherpad.openstack.org/p/nova-runways-rocky
21:02:03 <melwitt> #link runway #1: XenAPI: Support a new image handler for non-FS based SRs [END DATE: 2018-05-11] series starting at https://review.openstack.org/#/c/497201
21:02:11 <melwitt> #link runway #2: Add z/VM driver [END DATE: 2018-05-15] spec amendment needed at https://review.openstack.org/562154 and implementation starting at https://review.openstack.org/523387
21:02:17 <melwitt> #link runway #3: Local disk serial numbers [END DATE: 2018-05-16] series starting at https://review.openstack.org/526346
21:03:10 <melwitt> anyone have anything to add for release news or runways?
21:03:48 <melwitt> #topic Bugs (stuck/critical)
21:03:59 <melwitt> no critical bugs
21:04:07 <melwitt> #link 34 new untriaged bugs (up 3 since the last meeting): https://bugs.launchpad.net/nova/+bugs?search=Search&field.status=New
21:04:15 <melwitt> #link 9 untagged untriaged bugs: https://bugs.launchpad.net/nova/+bugs?field.tag=-*&field.status%3Alist=NEW
21:04:24 <mriedem> damn was down to 28 last night
21:04:42 <melwitt> wow, 6 new overnight then. that's a lot
21:04:52 <mriedem> https://bugs.launchpad.net/nova/+bug/1718439 isn't new but...
21:04:53 <openstack> Launchpad bug 1718439 in OpenStack Compute (nova) "Apparent lack of locking in conductor logs" [Undecided,New]
21:05:47 <melwitt> oh, okay got added to nova today
21:05:57 <mriedem> added back, i tried to pawn it off on oslo a year ago
21:06:02 <mriedem> anyway
21:06:21 <melwitt> okay
21:06:43 <melwitt> #link bug triage how-to: https://wiki.openstack.org/wiki/Nova/BugTriage#Tags
21:07:00 <melwitt> please lend a helping hand with bug triage if you can ^ and thanks to all who have been helping
21:07:28 <melwitt> Gate status:
21:07:32 <melwitt> #link check queue gate status http://status.openstack.org/elastic-recheck/index.html
21:07:45 <melwitt> there have been a lot of job timeouts lately, I've noticed
21:08:26 <melwitt> one cool thing is that e-r (elastic recheck) is working again and commenting on reviews if a known gate bug is hit
21:08:49 <melwitt> 3rd party CI:
21:08:55 <melwitt> #link 3rd party CI status http://ci-watch.tintri.com/project?project=nova&time=7+days
21:09:28 <melwitt> the zKVM third party job is currently broken because of a consoles related devstack change of mine
21:10:02 <melwitt> ML thread http://lists.openstack.org/pipermail/openstack-dev/2018-May/130331.html and a fix is proposed https://review.openstack.org/567298
21:10:24 <melwitt> does anyone have anything else on bugs, gate status or third party CI?
21:11:04 <melwitt> #topic Reminders
21:11:14 <melwitt> #link Forum session moderators, create your etherpads and link them at https://wiki.openstack.org/wiki/Forum/Vancouver2018
21:11:20 <melwitt> #link ML post http://lists.openstack.org/pipermail/openstack-dev/2018-May/130316.html
21:11:45 <melwitt> friendly reminder to get your etherpads created and link them on the wiki so people can find em
21:11:59 <melwitt> #link Rocky Review Priorities https://etherpad.openstack.org/p/rocky-nova-priorities-tracking
21:12:26 <melwitt> reminder that the subteam and bugs etherpad is there for finding patches
21:12:33 <melwitt> anything else for reminders?
21:12:57 <melwitt> #topic Stable branch status
21:13:04 <melwitt> #link stable/queens: https://review.openstack.org/#/q/status:open+project:openstack/nova+branch:stable/queens,n,z
21:13:15 <melwitt> #link stable/pike: https://review.openstack.org/#/q/status:open+project:openstack/nova+branch:stable/pike,n,z
21:13:20 <melwitt> #link stable/ocata: https://review.openstack.org/#/q/status:open+project:openstack/nova+branch:stable/ocata,n,z
21:13:35 <melwitt> #link We're going to do some stable releases to get some regression fixes out: https://etherpad.openstack.org/p/nova-stable-branch-status
21:14:15 <melwitt> looks like "libvirt: check image type before removing snapshots in _cleanup_resize" is the last thing we need before we do the releases
21:14:33 <melwitt> anything else on stable branch status?
21:14:56 <melwitt> #topic Subteam Highlights
21:15:16 <melwitt> we skipped the cells v2 meeting again because we didn't need a meeting. anything interesting to mention dansmith?
21:15:23 <dansmith> negative ghostrider
21:15:28 <melwitt> k
21:15:38 <melwitt> edleafe, have some scheduler subteam news?
21:15:42 <edleafe> Gave an update on Nested Resource Providers progress to a Cyborg team rep
21:15:45 <edleafe> Bauzas remembered that he owes me a beer, to be paid in Vancouver
21:15:47 <edleafe> Confirmed the importance of merging the conversion of libvirt's get_inventory() method to update_provider_tree()
21:15:51 <edleafe> #link https://review.openstack.org/#/c/560444
21:15:53 <edleafe> This will be my last update. After running the Scheduler meeting for over two years, I'm stepping down.
21:15:56 <edleafe> EOM
21:16:58 <melwitt> glad to hear the collab with the cyborg team is going along smoothly it sounds like
21:17:22 <edleafe> yeah, I want to keep them in the loop
21:17:34 <melwitt> and sorry to hear that edleafe. thanks for chairing the meeting all these years
21:18:17 <melwitt> gibi has left some notes about the notifications meeting
21:18:26 <melwitt> "We talked about #link https://review.openstack.org/#/c/563269 Add notification support for trusted_certs and agreed that it is going to a good direction"
21:18:45 <melwitt> "We agreed that the notification requirement in #link https://review.openstack.org/#/c/554212 Add PENDING vm state can be fulfilled with existing notifications. But now I see that additional requirements are popping up in the spec so I have to go back and re-review it."
21:19:19 <melwitt> anything else for subteams?
21:19:45 <melwitt> #topic Stuck Reviews
21:19:57 <melwitt> nothing in the agenda. anyone have anything for stuck reviews?
21:20:21 <melwitt> #topic Open discussion
21:20:27 <melwitt> couple of agenda items in here,
21:20:33 <melwitt> first one:
21:20:41 <melwitt> (wznoinsk): I wanted to raise a concern about not running some advanced tempest tests (they're part of 'slow' type hence are skipped): https://github.com/openstack/tempest/search?utf8=%E2%9C%93&q=%22type%3D%27slow%27%22&type=
21:21:05 <melwitt> are you around wznoinsk?
21:21:14 <mriedem> we do run the slow tests in a job in the experimental queue
21:21:19 <mriedem> i always have to look up which one though
21:21:39 <melwitt> so they only run on-demand then? I wonder if that should be a daily periodic
21:21:48 <mriedem> for example, the encrypted volume tests are marked as slow
21:21:52 <mriedem> well,
21:22:01 <mriedem> i think i've also floated the idea of a job that only runs slow tests
21:22:14 <mriedem> since there are relatively few of them
21:22:35 <wznoinsk> melwitt, yes
21:22:37 <mriedem> they were removed from the main tempest-full job because the time it was taking to run api, scenario and slow tests
21:22:45 <melwitt> to run on every change, the slow tests you mean?
21:22:48 <mriedem> yes
21:22:57 <melwitt> ah, okay. so they used to be in tempest-full
21:23:09 <mriedem> i believe so, until sdague went on a rampage
21:23:13 <melwitt> yeah, I think they should be run once a day at the very least
21:23:22 <mriedem> we could even have a nova-slow job in-tree that just runs compute api and all slow scenario tests
21:23:36 <mriedem> i could put up something to see what that looks like and how long it takes
21:23:39 <melwitt> having a separate slow-only job on every change sounds fine too being that we used to run them anyway as part of tempest-full
21:23:48 <mriedem> if it's only like 45 minutes that's about half of a normal tempest-full run
21:23:50 <wznoinsk> melwitt, mriedem I don't think some of them are that slow... I can compare times but IIRC there wasn't anything time-exploding
21:23:53 <melwitt> okay, I think that would be a good idea
21:24:24 <mriedem> i think it's this one legacy-tempest-dsvm-neutron-scenario-multinode-lvm-multibackend
21:24:39 <wznoinsk> melwitt, mriedem basically I found that because we don't run these slow upstream at all (or I couldn't find any check/gate job with them) some of them are basically broken
21:24:41 <mriedem> in nova experimental i mean
21:24:56 <mriedem> yeah, and we could be regressing encrypted volume support w/o knowing it
21:25:03 <melwitt> wznoinsk: gotcha. thanks for bringing it up, I didn't know about this
21:26:05 <melwitt> okay, so let's try out a nova-slow in-tree job and see if it's reasonable to run on all changes to guard against regressions
21:26:30 <melwitt> anything else on that topic?
21:26:33 <mriedem> #action mriedem to add nova-slow job
21:27:01 <melwitt> do I have to type that to make it go on the minutes?
21:27:02 <wznoinsk> mriedem, there are a few networking related slows too
21:27:15 <mriedem> wznoinsk: sure but nova changes don't care about slow provider network tests
21:28:01 <mriedem> we can argue in the review
21:28:03 <wznoinsk> mriedem, would I bring up this topic with networking guys too then? or can it be somehow discussed between teams internally?
21:28:20 <mriedem> cinder would also likely want this
21:28:28 <mriedem> or we just do a generic tempest-slow job
21:28:33 <mriedem> should talk to gmann
21:28:41 <wznoinsk> mriedem++
21:28:55 <mriedem> wznoinsk: how about you talk to gmann :)
21:29:09 <wznoinsk> mriedem, I don't mind, will do
21:29:45 <melwitt> cool. let's move to the next agenda item:
21:30:00 <melwitt> (takashin): Abort Cold Migration #link https://review.openstack.org/#/c/334732/
21:30:22 <takashin> About the "Abort Cold Migration" function(spec), I got feedbacks from operators in openstack-operators mailing list.
21:30:36 <takashin> #link http://lists.openstack.org/pipermail/openstack-operators/2018-May/015209.html
21:30:44 <takashin> #link http://lists.openstack.org/pipermail/openstack-operators/2018-May/015237.html
21:30:53 <takashin> I would like to proceed "List/show all server migration types" implementation that the function depends on.
21:31:00 <takashin> #link https://review.openstack.org/#/c/430608/
21:31:16 <mriedem> takashin: you got replies from some people from your company,
21:31:21 <mriedem> and i asked them some follow up questoins
21:31:23 <mriedem> *questions
21:31:47 <takashin> Yes.
21:32:27 <mriedem> so i personally would like to at least wait to hear those responses
21:32:58 <takashin> okay.
21:33:08 <melwitt> IIRC, this has been discussed long ago and the usual use case is, cold migrating an instance with a large disk that will take forever, and wanting the ability to cancel it
21:33:30 <mriedem> sure, on non-shared storage
21:33:45 <mriedem> which was the one other reply in the ops list, they are rolling out a new deployment and before investing in it, they use cheap local storage
21:33:45 <takashin> or it is stalled out
21:33:48 <mriedem> not ceph
21:34:03 <melwitt> right
21:34:21 <mriedem> i'd like more details in the ops list reply about the stall out case
21:35:03 <mriedem> i dont really want to put a bunch of time into new apis and plumbing for one operator that hits some stall issue without details
21:36:19 <melwitt> fwiw, I've thought the use case sounds reasonable but that the implementation would be really error prone and racy. I don't know if that could be solved with a simple task_state lockout
21:37:04 <mriedem> we also can't reliably test this
21:37:12 <mriedem> without stubs
21:37:24 <mriedem> by test i'm talking tempest
21:37:41 <dansmith> cleanup from stuff like this is a pita
21:37:46 <dansmith> and if we can't test it (agree we can't), then it's going to be ugly
21:37:59 <dansmith> I know it sounds useful in principle,
21:38:13 <dansmith> but I think most people that care about this kind of thing use shared storage,
21:38:38 <dansmith> because allowing your users to move terabytes of data around your network and disks every time they want to resize is kinda ugly
21:38:55 <dansmith> the cancel and cleanup case for ceph, by the way,
21:39:20 <dansmith> is potentially even worse than the "stop a copy and roll back" case of copying the disk manually
21:39:49 <dansmith> so without some really compelling reason I'd love to not introduce this complication
21:39:50 <melwitt> you mean non-shared ceph?
21:40:22 <dansmith> no
21:41:10 <melwitt> okay, I thought in the shared storage case, you don't have to copy anything
21:41:20 <dansmith> right,
21:41:54 <dansmith> you've got a vastly smaller window of time where it makes sense to abort
21:42:38 <mriedem> i would expect anyone using ceph to not need this
21:42:40 <mriedem> so they won't use it
21:42:53 <dansmith> right, but, it'll be there.. a button to push
21:43:18 <melwitt> yeah ... I think that's a good summary of why we haven't gone forward with this in the past. really complex and anyone with large storage isn't going to be doing local storage
21:43:25 <melwitt> (if they want to migrate things)
21:44:31 <mriedem> so (1) wait for replies, (2) talk some more
21:44:40 <melwitt> yeah
21:44:41 <mriedem> (3) we're <2 weeks from the forum with real live operators in the room to harass
21:44:57 <mriedem> there will be a public cloud forum session,
21:44:58 <takashin> okay
21:45:01 <mriedem> where they talk about gaps and such,
21:45:07 <mriedem> we should get some feedback in that session for sure
21:45:16 <mriedem> i'll find their etherpad and add this
21:45:32 <melwitt> cool, thank you
21:45:39 <takashin> Thanks.
21:46:16 <mriedem> https://etherpad.openstack.org/p/YVR-publiccloud-wg-brainstorming btw
21:46:40 <melwitt> okay, so we'll await the replies to the ML thread and gain more understanding of what the "stall out" problem is, whether large disk local storage is being used, and why shared storage or boot-from-volume isn't being used
21:47:52 <melwitt> anything else on that topic? anyone have any other topics for open discussion?
21:48:14 <dansmith> END IT
21:48:27 <melwitt> :)
21:48:34 <melwitt> going once
21:48:44 <melwitt> going twice
21:48:56 <melwitt> ...
21:49:13 <melwitt> #endmeeting