#openstack-meeting log

14:00:04 <gibi> #startmeeting nova
14:00:05 <openstack> Meeting started Thu Nov 29 14:00:04 2018 UTC and is due to finish in 60 minutes.  The chair is gibi. Information about MeetBot at http://wiki.debian.org/MeetBot.
14:00:06 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
14:00:08 <openstack> The meeting name has been set to 'nova'
14:00:23 <takashin> o/
14:00:40 <gibi> hi
14:00:42 <gmann> o/
14:00:45 <edleafe> \o
14:01:30 <mdbooth> o/
14:01:40 <gibi> let's get started
14:01:48 <gibi> #topic Release News
14:01:52 * bauzas waves late
14:01:54 <efried> o/
14:01:54 <gibi> #link Stein release schedule: https://wiki.openstack.org/wiki/Nova/Stein_Release_Schedule
14:02:08 <gibi> next milestone is 10th of January
14:02:33 <gibi> #link Stein runway etherpad: https://etherpad.openstack.org/p/nova-runways-stein
14:02:38 <gibi> #link runway #1: https://blueprints.launchpad.net/nova/+spec/io-semaphore-for-concurrent-disk-ops (jackding) [END 2018-12-04] https://review.openstack.org/#/c/609180/ (approved Nov 27, currently in the gate queue)
14:02:42 <gibi> #link runway #2: https://blueprints.launchpad.net/nova/+spec/reshape-provider-tree (bauzas/naichuan) [END 2018-12-04]
14:02:45 <gibi> https://review.openstack.org/#/c/599208/ libvirt: implement reshaper for vgpu (bauzas)
14:02:48 <gibi> https://review.openstack.org/#/c/521041/ xenapi(N-R-P): support compute node resource provider update (naichuan)
14:02:51 <gibi> #link runway #3 https://blueprints.launchpad.net/nova/+spec/initial-allocation-ratios (yikun) [END 2018-12-10] starts here https://review.openstack.org/#/c/602803
14:03:17 <gibi> hopefully the io semaphore item goes through CI soon and then one slot can be freed
14:03:46 <gibi> any comments about release or runways?
14:04:22 <gibi> #topic Bugs
14:04:26 <gibi> No critical bugs
14:04:34 <gibi> #link 54 new untriaged bugs (down 4 since the last meeting): https://bugs.launchpad.net/nova/+bugs?search=Search&field.status=New
14:04:37 <gibi> #link 9 untagged untriaged bugs (up 2 since the last meeting): https://bugs.launchpad.net/nova/+bugs?field.tag=-*&field.status%3Alist=NEW
14:04:40 <gibi> #link bug triage how-to: https://wiki.openstack.org/wiki/Nova/BugTriage#Tags
14:04:43 <gibi> #help need help with bug triage
14:05:14 <gibi> any comments on the bug situation?
14:05:31 <cdent> I'm trying to stir some bug triage help internally, but everyone is busy :(\
14:05:51 <gibi> thanks cdent
14:06:02 <cdent> seems same story everywhere
14:06:16 <gibi> seems so
14:06:23 <gibi> #topic Gate status
14:06:30 <gibi> #link check queue gate status http://status.openstack.org/elastic-recheck/index.html
14:06:43 <gibi> there are some mirror issues as far as I see
14:07:20 <gibi> it seems http://status.openstack.org/elastic-recheck/index.html#1449136 is a heavy hitter
14:07:46 <gibi> #link 3rd party CI status http://ci-watch.tintri.com/project?project=nova&time=7+days
14:07:57 <gibi> the link ^^ does not open for me
14:08:25 <gibi> any comments about the gate or 3rd party CI?
14:08:27 <bauzas> that depens
14:08:31 <bauzas> it can be long to open
14:08:57 <gibi> bauzas: it times out for me today
14:09:03 <bauzas> hmm, right
14:09:16 <efried> yeah, I heard whoever was maintaining it, wasn't anymore.
14:09:46 <efried> I (or someone) started a conversation in -infra, possibly with the thought of getting someone to take over / replace that with something equivalent
14:09:50 <efried> but I'm not sure if that panned out.
14:09:56 <efried> That was several weeks ago, before chaos.
14:10:05 <bauzas> I have a very bad DSL connection at home, so I'm not really the right guy to tell whether it's a problem or not
14:10:05 <gibi> efried: thanks for the info
14:10:37 <gibi> #topic Reminders
14:10:43 <gibi> #link Stein Subteam Patches n Bugs: https://etherpad.openstack.org/p/stein-nova-subteam-tracking
14:10:48 <gibi> any other reminders?
14:11:21 <gibi> #topic Stable branch status
14:11:25 <gibi> #link stable/rocky: https://review.openstack.org/#/q/status:open+project:openstack/nova+branch:stable/rocky,n,z
14:11:28 <gibi> #link stable/queens: https://review.openstack.org/#/q/status:open+project:openstack/nova+branch:stable/queens,n,z
14:11:31 <gibi> #link stable/pike: https://review.openstack.org/#/q/status:open+project:openstack/nova+branch:stable/pike,n,z
14:11:44 <gibi> there is a list of patches waiting for a second stable core
14:12:26 <gibi> any comment on stable?
14:13:01 <sean-k-mooney> has there been any progress on teh oslo.service issue
14:13:01 <gmann> FYI, nova-next does not run on queens. i have pushed the backport
14:13:19 <gmann> #link https://review.openstack.org/#/q/topic:nova-next-queens+(status:open+OR+status:merged)
14:13:43 <gmann> and matt backported  the devstack patch
14:15:02 <gibi> sean-k-mooney: this is the last mail in the ML http://lists.openstack.org/pipermail/openstack-discuss/2018-November/000350.html
14:15:22 <gibi> sean-k-mooney: I did not follow the issue closely
14:16:09 <efried> I'm happy with any of the proposed solutions, so I'm sitting back and letting others sort it out.
14:16:12 <gibi> gmann: thanks
14:16:28 <sean-k-mooney> it looklike a new version is proposed for oslo.servce to fix it https://review.openstack.org/#/c/620913/
14:16:32 <efried> but I do feel kinda guilty for causing the whole debacle.
14:18:10 <sean-k-mooney> ok i think we can assume that it is progressing
14:18:20 <gibi> OK, moving on
14:18:23 <gibi> #topic Subteam Highlights
14:18:32 <gibi> Scheduler (efried)
14:18:50 <efried> #link n-sch meeting minutes http://eavesdrop.openstack.org/meetings/nova_scheduler/2018/nova_scheduler.2018-11-26-14.00.html
14:18:53 <efried> Lots of specs need TLC from authors
14:18:57 <efried> Extraction is proceeding well. The
14:18:57 <efried> #link devstack change https://review.openstack.org/#/c/600162/
14:18:57 <efried> has merged since the meeting
14:19:05 <efried> #link data migrations fix from tetsuro #link https://review.openstack.org/#/c/619126/
14:19:19 <efried> Lots of small/easy placement changes that could use a look to clean things up.
14:19:19 <efried> #link placement open patches https://review.openstack.org/#/q/project:openstack/placement+status:open
14:19:19 <efried> Especially
14:19:19 <efried> #link integrated template https://review.openstack.org/#/c/617565/
14:19:19 <efried> which has since merged
14:19:23 <efried> Reshaper patches still need review
14:19:23 <efried> #link libvirt reshaper https://review.openstack.org/#/c/599208/
14:19:23 <efried> #link xen reshaper (middle of series) https://review.openstack.org/#/c/521041
14:19:26 <efried> FFU framework for reshaper: we will consciously continue kicking this can down the road.
14:19:29 <efried> Educated tetsuro on flamethrowers
14:19:32 <efried> END
14:19:38 <gibi> efried: thanks
14:19:48 <edleafe> Important education
14:20:04 <gibi> API (gmann)
14:20:07 <gmann> No office hour this week
14:20:09 <gmann> Triaged 5 bugs during that time.
14:20:15 <gmann> Updated the subteam tracking etherpad #link https://etherpad.openstack.org/p/stein-nova-subteam-tracking
14:20:24 <gmann> Detail status of this week #link http://lists.openstack.org/pipermail/openstack-discuss/2018-November/000281.html
14:20:27 <gmann> END
14:20:30 <gibi> gmann: thanks
14:20:51 <gibi> any other subteam report?
14:21:14 <gibi> #topic Stuck Reviews
14:21:22 <gibi> we have one on the agenda
14:21:23 <gibi> (mriedem): Need to figure out what to do about the fix for gate bug 1798688: https://review.openstack.org/#/c/617040/ - do we whack the mole in the compute code or detect and retry in the scheduler?
14:21:25 <openstack> bug 1798688 in OpenStack Compute (nova) "AllocationUpdateFailed_Remote: Failed to update allocations for consumer. Error: another process changed the consumer after the report client read the consumer state during the claim" [High,In progress] https://launchpad.net/bugs/1798688 - Assigned to Matt Riedemann (mriedem)
14:22:15 <gibi> This is basically a race condition between shelve offload and unshelve
14:23:10 <gibi> detected by the consumer generation
14:23:10 <gibi> in placement
14:23:13 <gibi> efried: had -1 on the code and I also left an alternative proposal this morning there
14:23:30 <gibi> s/://
14:23:53 <gibi> the currently proposed solution is a retry in the report client
14:23:55 <cdent> I think I prefer option 2 as well. Without matt or dan here, it's hard to move along
14:24:06 <bauzas> I don't have context
14:24:27 <gibi> cdent: yeah, without dansmith and mriedem it is not easy to discuss this item
14:24:45 <bauzas> hah, reading the commit msg
14:25:18 <georgh> i need a final approve for this change: https://review.openstack.org/#/c/575735/
14:25:41 <gibi> georgh: we can get back to that in Open Discussion
14:26:12 <gibi> so I think I will leave this review in the agenda in the stuck reviews
14:26:30 <gibi> and if nothing changes til next week then we can try to dicuss it again
14:26:52 <bauzas> I thought we said retries are fine when we have a generation issue ?
14:27:05 <bauzas> if so, why not for an unshelve ?
14:27:18 <edleafe> bauzas: I was wondering the same thing
14:27:26 <efried> bauzas: that's too broad a statement.
14:27:29 <gibi> bauzas: blind retry felt like ignoring the consumer generation feature itself
14:27:35 <efried> yes, exactly.
14:27:49 <bauzas> so we should check the generation bit, that's your concern ?
14:27:50 <gibi> to be correct the proposed solution is not totally blind
14:28:13 <gibi> bauzas: I personally would like to avoid the unshelve race if possible
14:28:14 <edleafe> Retry with a GET to update the current state of the allocations
14:28:29 <edleafe> that's what the idea of consumer gens was for
14:28:41 <efried> The statement needs to be more like, "When you get a generation conflict, you need to re-GET the relevant pieces and re-execute the relevant logic, which may or may not amount to an exact retry."
14:28:46 <efried> Yeah, what edleafe said.
14:29:02 <edleafe> So why is this case different?
14:29:30 <efried> edleafe: I think it's the scope of the retry
14:29:46 <efried> edleafe: I think the point is that some of the logic from the caller of this method would need to be included in the retry.
14:30:34 <efried> I don't know that for sure in this case btw
14:30:39 <edleafe> The retry would be better in the code itself, not the reportclient, which whould be generic
14:30:40 <cdent> Is there anything wrong with gibi's proposal? because fixing a race is much nicer than handling a race
14:30:47 <sean-k-mooney> this is the basese fo the compare and swap idiom for concurrent modifcatoin of a shared datastructure. the unshlve action would need to read the state but it shoudl be able to retry
14:31:19 <edleafe> cdent: I'm not familiar enough with unshelve to know whether that would break other assumptions
14:31:46 <mdbooth> Is there any reason why you wouldn't make both changes?
14:32:07 <bauzas> edleafe: unshelve is just a crazy scheduling call which literrally recreates the instance
14:32:25 <sean-k-mooney> mdbooth: it might hide if we reintoduce the race after fixing it if we have a retry mechaniums too
14:32:37 <edleafe> bauzas: yeah, I get that. It's the 'crazy' part that I'm nervous about
14:32:51 <efried> mriedem: are you following / catching up?
14:33:14 <mriedem> did dst mess me up?
14:33:23 <bauzas> indeed
14:33:29 <mriedem> then no
14:34:11 <mriedem> you're talking about my stuck patch
14:34:13 <efried> mriedem: we're discussing the delete-consumer-race-on-shelving
14:34:14 <efried> yeah
14:34:33 <bauzas> edleafe: for more details https://developer.openstack.org/api-guide/compute/server_concepts.html#server-actions
14:34:36 <mriedem> as i said in the change, i could do the spot fix in the shelve code or the more generic fix in the scheduler, i opted for the latter
14:34:54 <mriedem> it seems to me the scheduler code, before consumer aggregates, was already retrying on conflict
14:34:57 <mriedem> so i followed that
14:35:27 <mriedem> fixing where shelve removes the allocatoins will fix *this* race but i worry about others
14:35:39 <gibi> mriedem: I would fix the race in the shelve code as that would be a specific fix. the current proposal is a generic fix for more than just shelve - unshelve race
14:35:53 <efried> Agree with ^
14:36:01 <mriedem> so people want both?
14:36:03 <gibi> mriedem: I'm affraid the retry would hide things we eventually want to fix
14:36:32 <mriedem> ok, i guess if there is majority agreement on at least the shelve fix i can do that and leave the generic fix to rot
14:37:02 <gibi> anybody against ^^ ?
14:37:30 <edleafe> not I
14:37:32 <efried> Don't know about "rot". But it should be evaluated separately whether we can identify a proper scope for "always retry".
14:37:48 <efried> I'm just not convinced that that scope == this method.
14:37:48 <mriedem> i just likely won't have the energy to do that evaluation
14:38:01 <efried> That's fine.
14:38:05 <mriedem> i already spent the better part of a day identifying this race
14:38:11 <gibi> yeah, fix the known race
14:38:12 <efried> Seems like something gibi would have the energy for, nudge nudge :P
14:38:21 <mriedem> gibi has bigger fish to fry
14:38:30 <gibi> efried: :) I don't feel that way
14:39:08 <gibi> OK so we agreed that mriedem propose a fix for the shelve race
14:39:12 <gibi> moving forward
14:39:31 <gibi> #topic Open discussion
14:39:37 <gibi> (mriedem): Looking for approval on this specless blueprint: https://blueprints.launchpad.net/nova/+spec/run-meta-api-per-cell - this was discussed in the Oct 25 meeting and there was agreement for no spec, but efried wanted to know the name of the config option which is in the blueprint now. So are we good to go?
14:39:54 <gibi> efried: ^^ ?
14:40:02 <efried> ...
14:40:33 <rambo_li_> Hi,all ,I have some questions,one is The actual  operator is different from the operator was record  in panko. Such as the delete action, we create the VM as user1,  and we delete the VM as user2， but the operator is user1 who delete the VM in panko event, not the actual operator user2.
14:40:52 <gibi> rambo_li_: lets take that after georgh
14:40:54 <efried> gibi, mriedem: Yup, good to go, thanks for the update.
14:40:56 <rambo_li_> o
14:41:01 <gibi> efried: cool
14:41:08 <mriedem> ok so should i approve my own blueprint then?
14:41:18 <gibi> mriedem: I agree to approve it
14:41:20 <mriedem> seems...dubious
14:41:23 <mriedem> ok
14:41:25 <mriedem> it's on the record
14:41:29 <gibi> :)
14:41:35 <gibi> next
14:41:37 <gibi> (gmann): Regarding migrating the gate jobs to Bionic- http://lists.openstack.org/pipermail/openstack-discuss/2018-November/000168.html . I have tested existing zuulv3 nova jobs on bionic and all working fine. I have marked OK for nova in https://etherpad.openstack.org/p/devstack-bionic
14:41:55 <gibi> gmann: thanks for the effort
14:42:17 <gibi> is there anything we have to discuss about the bionic jobs?
14:42:22 <gmann> note- legacy jobs are not planned to migrate to bionic so they keep running on xenial until move to zuulv3 native
14:42:41 <mriedem> ouch...
14:42:45 <mriedem> when will infra drop xenial?
14:42:51 <mriedem> we have a lot of legacy jobs
14:43:12 <gmann> it will not be dropped as few job still have dependency like keystone federation job etc
14:43:34 <gmann> mriedem: i am planning to start the migration to zuulv3 one by one next week.
14:43:37 <mriedem> ok, i guess it can be a community wide goal in the future when it's a real problem
14:43:42 <mriedem> or that
14:43:43 <gmann> +1
14:43:47 <gibi> +1
14:44:05 <gibi> OK. georgh it is your turn
14:44:24 <georgh> thx, I'm looking for a final approval of https://review.openstack.org/#/c/575735/
14:45:50 <georgh> melwitt asked Kashyap Chamarthy for help but his review didn't bring the issue ahead
14:46:22 <mriedem> and sahid is gone from red hat,
14:46:22 <sean-k-mooney> yes i rember this patch for what its worth i still think it good
14:46:25 <georgh> It's been stuck since then and from my point of view the change its finished.
14:46:27 <mriedem> there just aren't that many people familiar with this code
14:46:50 <efried> I pinged kashyap to have another look.
14:46:50 <georgh> I know that interactive serial consoles are an exotic topic
14:46:54 <sean-k-mooney> mriedem: sahid is now at canonical but im not sure if he is working on openstack anymore
14:48:03 <efried> kashyap is looking
14:48:06 <gibi> efried: thanks
14:48:20 <gibi> I hope kashyap reply will unblock this patch
14:48:20 <kashyap> efried: Oh, this one.  Yes, need to load all the "console-related context", can do it tomm or early next week
14:48:24 <kashyap> Thanks for the ping
14:48:37 <georgh> ok, thank you
14:48:39 <kashyap> My bad, the reply has been sitting since 02-Nov
14:48:55 <gibi> moving forward
14:48:58 <efried> We'll need a couple of cores once that happens. I'm afraid I have zero background in this area. Anyone?
14:49:15 <sean-k-mooney> efried: stephenfin: should be black monday
14:49:17 <gibi> I happy to read the patch and learn some of it
14:49:17 <mriedem> mine is very thin
14:49:29 <efried> If not, I wouldn't feel *too* bad about leaning on the expertise of sean-k-mooney and kashyap
14:49:35 <mriedem> someone get markus zoeller out of retirement
14:49:45 <efried> if stephenfin is familiar, that would be good.
14:49:59 <gibi> really moving forward
14:50:03 <gibi> rambo_li_: your turn
14:50:43 <efried> I wonder if rambo_li_'s issue is better for the main channel, outside of the meeting.
14:51:07 <efried> or even the ML
14:51:45 <gibi> rambo_li_: would it be OK to you to bring this up on #openstack-nova or on the mailing list?
14:52:09 <rambo_li_> ok,thank you,let's go to the next one
14:52:19 <gibi> rambo_li_: thanks
14:52:27 <rambo_li_> Another one,When we resize/migrate instance, if error occurs on source compute node, the instance state can rollback to active currently.But if error occurs in "finish_resize" function on destination compute node, the instance state would not rollback to active. Is there a bug, or if anyone plans to change this?
14:53:11 <bauzas> reset-state ?
14:53:17 <mriedem> likely because there isn't cleanup on finish_resize if an error occurs
14:53:23 <gibi> rambo_li_: if finish resize already destroyed someting on the source then it is hard to roll back
14:53:31 <mriedem> depending on where it happens, by the time you're in finish_resize there is a guest on the dest
14:53:37 <mriedem> so putting the instance as ACTIVE isn't really valid in that case
14:53:46 <gibi> mriedem: agree
14:53:54 <efried> TL;DR: working as designed
14:54:03 <mriedem> if not designed,
14:54:09 <mriedem> not a bug really,
14:54:22 <mriedem> it would really be an RFE i think to make it more robust for rolling back
14:54:24 <mriedem> if we cared to do that
14:54:35 <mriedem> b/c rollback is hard
14:54:48 <gibi> yeah it would be good to know why finish resize failed at the first place
14:54:53 <gibi> maybe we can avoid that failure
14:55:07 <bauzas> what exactly does finish_resize ?
14:55:20 <bauzas> sorry for the dumb question, I can read code but I'm old and laz
14:55:21 <bauzas> lazy*
14:55:23 <mriedem> i finishes the resize
14:55:27 <bauzas> \o/
14:56:03 <gibi> rambo_li_: could you provide information about the failure during finish resize?
14:56:11 <mriedem> sets up the disk and starts the guest
14:56:13 <mriedem> on the dest host
14:56:17 <sean-k-mooney> bauzas: it cleans up networks on the source node a a few other things
14:56:40 <rambo_li_> oh,thank you,I will reconsider it
14:56:56 <bauzas> ok, I was asking this, because if that's just clean-up, you can still resurrect your instance ?
14:57:08 <mriedem> it's not cleanup
14:57:11 <rambo_li_> when we finish resize ,the instance can't start in dest node
14:57:16 <mriedem> it's the thing right before the status goes to VERIFY_RESIZE
14:57:36 <bauzas> ok, nevermind, I'll look
14:57:51 <mriedem> prep_resize (claim on dest) -> resize_instance (source, transfer disk to dest) -> finish_resize (setup disk on dest, start guest, do db magic)
14:57:56 <sean-k-mooney> oh ok my bad i tought it was the confirm step after that
14:58:05 <mriedem> by the time you get to the finish_resize and it fails, you're in trouble
14:58:10 <efried> Someone should do a flow diagram for that.
14:58:27 <mriedem> efried: i've got a dime with your name on it
14:58:38 <efried> I require a napkin sketch, like last time.
14:58:41 <mriedem> oh and don't forget about reschedules in the flow diagram :)
14:58:45 <gibi> we have less than 2 minutes. I think robustifying finish_resize could be a valid feature request
14:58:46 <mriedem> heh
14:58:54 <bauzas> mriedem: oh shit, I also confused myself with confirm resize
14:59:16 <efried> Update on ci-watch:
14:59:16 <efried> The maintainer of ci-watch.tintri.com is gone and unreachable.
14:59:16 <efried> But mmedvede has redeployed the code to: http://ciwatch.mmedvede.net/
14:59:16 <efried> I have updated the Nova meeting agenda accordingly.
14:59:17 <mriedem> see i don't know shit about tcp consoles, but i know how the resize flow works
14:59:34 <gibi> efried: thanks
14:59:47 <gibi> we have to close the meeting in seconds.
14:59:51 <rambo_li_> ok,last one, I find it is important that live-resize the instance in production environment. We have talked it many years and we agreed this in Rocky PTG, then the author remove the spec to Stein, but there is no information about this spec, is there anyone to push the spec and achieve it?  The link:https://review.openstack.org/#/c/141219/
14:59:58 <gibi> let's continue on #openstack-nova
15:00:05 <gibi> #endmeeting