14:00:04 #startmeeting nova 14:00:05 Meeting started Thu Nov 29 14:00:04 2018 UTC and is due to finish in 60 minutes. The chair is gibi. Information about MeetBot at http://wiki.debian.org/MeetBot. 14:00:06 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 14:00:08 The meeting name has been set to 'nova' 14:00:23 o/ 14:00:40 hi 14:00:42 o/ 14:00:45 \o 14:01:30 o/ 14:01:40 let's get started 14:01:48 #topic Release News 14:01:52 * bauzas waves late 14:01:54 o/ 14:01:54 #link Stein release schedule: https://wiki.openstack.org/wiki/Nova/Stein_Release_Schedule 14:02:08 next milestone is 10th of January 14:02:33 #link Stein runway etherpad: https://etherpad.openstack.org/p/nova-runways-stein 14:02:38 #link runway #1: https://blueprints.launchpad.net/nova/+spec/io-semaphore-for-concurrent-disk-ops (jackding) [END 2018-12-04] https://review.openstack.org/#/c/609180/ (approved Nov 27, currently in the gate queue) 14:02:42 #link runway #2: https://blueprints.launchpad.net/nova/+spec/reshape-provider-tree (bauzas/naichuan) [END 2018-12-04] 14:02:45 https://review.openstack.org/#/c/599208/ libvirt: implement reshaper for vgpu (bauzas) 14:02:48 https://review.openstack.org/#/c/521041/ xenapi(N-R-P): support compute node resource provider update (naichuan) 14:02:51 #link runway #3 https://blueprints.launchpad.net/nova/+spec/initial-allocation-ratios (yikun) [END 2018-12-10] starts here https://review.openstack.org/#/c/602803 14:03:17 hopefully the io semaphore item goes through CI soon and then one slot can be freed 14:03:46 any comments about release or runways? 14:04:22 #topic Bugs 14:04:26 No critical bugs 14:04:34 #link 54 new untriaged bugs (down 4 since the last meeting): https://bugs.launchpad.net/nova/+bugs?search=Search&field.status=New 14:04:37 #link 9 untagged untriaged bugs (up 2 since the last meeting): https://bugs.launchpad.net/nova/+bugs?field.tag=-*&field.status%3Alist=NEW 14:04:40 #link bug triage how-to: https://wiki.openstack.org/wiki/Nova/BugTriage#Tags 14:04:43 #help need help with bug triage 14:05:14 any comments on the bug situation? 14:05:31 I'm trying to stir some bug triage help internally, but everyone is busy :(\ 14:05:51 thanks cdent 14:06:02 seems same story everywhere 14:06:16 seems so 14:06:23 #topic Gate status 14:06:30 #link check queue gate status http://status.openstack.org/elastic-recheck/index.html 14:06:43 there are some mirror issues as far as I see 14:07:20 it seems http://status.openstack.org/elastic-recheck/index.html#1449136 is a heavy hitter 14:07:46 #link 3rd party CI status http://ci-watch.tintri.com/project?project=nova&time=7+days 14:07:57 the link ^^ does not open for me 14:08:25 any comments about the gate or 3rd party CI? 14:08:27 that depens 14:08:31 it can be long to open 14:08:57 bauzas: it times out for me today 14:09:03 hmm, right 14:09:16 yeah, I heard whoever was maintaining it, wasn't anymore. 14:09:46 I (or someone) started a conversation in -infra, possibly with the thought of getting someone to take over / replace that with something equivalent 14:09:50 but I'm not sure if that panned out. 14:09:56 That was several weeks ago, before chaos. 14:10:05 I have a very bad DSL connection at home, so I'm not really the right guy to tell whether it's a problem or not 14:10:05 efried: thanks for the info 14:10:37 #topic Reminders 14:10:43 #link Stein Subteam Patches n Bugs: https://etherpad.openstack.org/p/stein-nova-subteam-tracking 14:10:48 any other reminders? 14:11:21 #topic Stable branch status 14:11:25 #link stable/rocky: https://review.openstack.org/#/q/status:open+project:openstack/nova+branch:stable/rocky,n,z 14:11:28 #link stable/queens: https://review.openstack.org/#/q/status:open+project:openstack/nova+branch:stable/queens,n,z 14:11:31 #link stable/pike: https://review.openstack.org/#/q/status:open+project:openstack/nova+branch:stable/pike,n,z 14:11:44 there is a list of patches waiting for a second stable core 14:12:26 any comment on stable? 14:13:01 has there been any progress on teh oslo.service issue 14:13:01 FYI, nova-next does not run on queens. i have pushed the backport 14:13:19 #link https://review.openstack.org/#/q/topic:nova-next-queens+(status:open+OR+status:merged) 14:13:43 and matt backported the devstack patch 14:15:02 sean-k-mooney: this is the last mail in the ML http://lists.openstack.org/pipermail/openstack-discuss/2018-November/000350.html 14:15:22 sean-k-mooney: I did not follow the issue closely 14:16:09 I'm happy with any of the proposed solutions, so I'm sitting back and letting others sort it out. 14:16:12 gmann: thanks 14:16:28 it looklike a new version is proposed for oslo.servce to fix it https://review.openstack.org/#/c/620913/ 14:16:32 but I do feel kinda guilty for causing the whole debacle. 14:18:10 ok i think we can assume that it is progressing 14:18:20 OK, moving on 14:18:23 #topic Subteam Highlights 14:18:32 Scheduler (efried) 14:18:50 #link n-sch meeting minutes http://eavesdrop.openstack.org/meetings/nova_scheduler/2018/nova_scheduler.2018-11-26-14.00.html 14:18:53 Lots of specs need TLC from authors 14:18:57 Extraction is proceeding well. The 14:18:57 #link devstack change https://review.openstack.org/#/c/600162/ 14:18:57 has merged since the meeting 14:19:05 #link data migrations fix from tetsuro #link https://review.openstack.org/#/c/619126/ 14:19:19 Lots of small/easy placement changes that could use a look to clean things up. 14:19:19 #link placement open patches https://review.openstack.org/#/q/project:openstack/placement+status:open 14:19:19 Especially 14:19:19 #link integrated template https://review.openstack.org/#/c/617565/ 14:19:19 which has since merged 14:19:23 Reshaper patches still need review 14:19:23 #link libvirt reshaper https://review.openstack.org/#/c/599208/ 14:19:23 #link xen reshaper (middle of series) https://review.openstack.org/#/c/521041 14:19:26 FFU framework for reshaper: we will consciously continue kicking this can down the road. 14:19:29 Educated tetsuro on flamethrowers 14:19:32 END 14:19:38 efried: thanks 14:19:48 Important education 14:20:04 API (gmann) 14:20:07 No office hour this week 14:20:09 Triaged 5 bugs during that time. 14:20:15 Updated the subteam tracking etherpad #link https://etherpad.openstack.org/p/stein-nova-subteam-tracking 14:20:24 Detail status of this week #link http://lists.openstack.org/pipermail/openstack-discuss/2018-November/000281.html 14:20:27 END 14:20:30 gmann: thanks 14:20:51 any other subteam report? 14:21:14 #topic Stuck Reviews 14:21:22 we have one on the agenda 14:21:23 (mriedem): Need to figure out what to do about the fix for gate bug 1798688: https://review.openstack.org/#/c/617040/ - do we whack the mole in the compute code or detect and retry in the scheduler? 14:21:25 bug 1798688 in OpenStack Compute (nova) "AllocationUpdateFailed_Remote: Failed to update allocations for consumer. Error: another process changed the consumer after the report client read the consumer state during the claim" [High,In progress] https://launchpad.net/bugs/1798688 - Assigned to Matt Riedemann (mriedem) 14:22:15 This is basically a race condition between shelve offload and unshelve 14:23:10 detected by the consumer generation 14:23:10 in placement 14:23:13 efried: had -1 on the code and I also left an alternative proposal this morning there 14:23:30 s/:// 14:23:53 the currently proposed solution is a retry in the report client 14:23:55 I think I prefer option 2 as well. Without matt or dan here, it's hard to move along 14:24:06 I don't have context 14:24:27 cdent: yeah, without dansmith and mriedem it is not easy to discuss this item 14:24:45 hah, reading the commit msg 14:25:18 i need a final approve for this change: https://review.openstack.org/#/c/575735/ 14:25:41 georgh: we can get back to that in Open Discussion 14:26:12 so I think I will leave this review in the agenda in the stuck reviews 14:26:30 and if nothing changes til next week then we can try to dicuss it again 14:26:52 I thought we said retries are fine when we have a generation issue ? 14:27:05 if so, why not for an unshelve ? 14:27:18 bauzas: I was wondering the same thing 14:27:26 bauzas: that's too broad a statement. 14:27:29 bauzas: blind retry felt like ignoring the consumer generation feature itself 14:27:35 yes, exactly. 14:27:49 so we should check the generation bit, that's your concern ? 14:27:50 to be correct the proposed solution is not totally blind 14:28:13 bauzas: I personally would like to avoid the unshelve race if possible 14:28:14 Retry with a GET to update the current state of the allocations 14:28:29 that's what the idea of consumer gens was for 14:28:41 The statement needs to be more like, "When you get a generation conflict, you need to re-GET the relevant pieces and re-execute the relevant logic, which may or may not amount to an exact retry." 14:28:46 Yeah, what edleafe said. 14:29:02 So why is this case different? 14:29:30 edleafe: I think it's the scope of the retry 14:29:46 edleafe: I think the point is that some of the logic from the caller of this method would need to be included in the retry. 14:30:34 I don't know that for sure in this case btw 14:30:39 The retry would be better in the code itself, not the reportclient, which whould be generic 14:30:40 Is there anything wrong with gibi's proposal? because fixing a race is much nicer than handling a race 14:30:47 this is the basese fo the compare and swap idiom for concurrent modifcatoin of a shared datastructure. the unshlve action would need to read the state but it shoudl be able to retry 14:31:19 cdent: I'm not familiar enough with unshelve to know whether that would break other assumptions 14:31:46 Is there any reason why you wouldn't make both changes? 14:32:07 edleafe: unshelve is just a crazy scheduling call which literrally recreates the instance 14:32:25 mdbooth: it might hide if we reintoduce the race after fixing it if we have a retry mechaniums too 14:32:37 bauzas: yeah, I get that. It's the 'crazy' part that I'm nervous about 14:32:51 mriedem: are you following / catching up? 14:33:14 did dst mess me up? 14:33:23 indeed 14:33:29 then no 14:34:11 you're talking about my stuck patch 14:34:13 mriedem: we're discussing the delete-consumer-race-on-shelving 14:34:14 yeah 14:34:33 edleafe: for more details https://developer.openstack.org/api-guide/compute/server_concepts.html#server-actions 14:34:36 as i said in the change, i could do the spot fix in the shelve code or the more generic fix in the scheduler, i opted for the latter 14:34:54 it seems to me the scheduler code, before consumer aggregates, was already retrying on conflict 14:34:57 so i followed that 14:35:27 fixing where shelve removes the allocatoins will fix *this* race but i worry about others 14:35:39 mriedem: I would fix the race in the shelve code as that would be a specific fix. the current proposal is a generic fix for more than just shelve - unshelve race 14:35:53 Agree with ^ 14:36:01 so people want both? 14:36:03 mriedem: I'm affraid the retry would hide things we eventually want to fix 14:36:32 ok, i guess if there is majority agreement on at least the shelve fix i can do that and leave the generic fix to rot 14:37:02 anybody against ^^ ? 14:37:30 not I 14:37:32 Don't know about "rot". But it should be evaluated separately whether we can identify a proper scope for "always retry". 14:37:48 I'm just not convinced that that scope == this method. 14:37:48 i just likely won't have the energy to do that evaluation 14:38:01 That's fine. 14:38:05 i already spent the better part of a day identifying this race 14:38:11 yeah, fix the known race 14:38:12 Seems like something gibi would have the energy for, nudge nudge :P 14:38:21 gibi has bigger fish to fry 14:38:30 efried: :) I don't feel that way 14:39:08 OK so we agreed that mriedem propose a fix for the shelve race 14:39:12 moving forward 14:39:31 #topic Open discussion 14:39:37 (mriedem): Looking for approval on this specless blueprint: https://blueprints.launchpad.net/nova/+spec/run-meta-api-per-cell - this was discussed in the Oct 25 meeting and there was agreement for no spec, but efried wanted to know the name of the config option which is in the blueprint now. So are we good to go? 14:39:54 efried: ^^ ? 14:40:02 ... 14:40:33 Hi,all ,I have some questions,one is The actual operator is different from the operator was record in panko. Such as the delete action, we create the VM as user1, and we delete the VM as user2, but the operator is user1 who delete the VM in panko event, not the actual operator user2. 14:40:52 rambo_li_: lets take that after georgh 14:40:54 gibi, mriedem: Yup, good to go, thanks for the update. 14:40:56 o 14:41:01 efried: cool 14:41:08 ok so should i approve my own blueprint then? 14:41:18 mriedem: I agree to approve it 14:41:20 seems...dubious 14:41:23 ok 14:41:25 it's on the record 14:41:29 :) 14:41:35 next 14:41:37 (gmann): Regarding migrating the gate jobs to Bionic- http://lists.openstack.org/pipermail/openstack-discuss/2018-November/000168.html . I have tested existing zuulv3 nova jobs on bionic and all working fine. I have marked OK for nova in https://etherpad.openstack.org/p/devstack-bionic 14:41:55 gmann: thanks for the effort 14:42:17 is there anything we have to discuss about the bionic jobs? 14:42:22 note- legacy jobs are not planned to migrate to bionic so they keep running on xenial until move to zuulv3 native 14:42:41 ouch... 14:42:45 when will infra drop xenial? 14:42:51 we have a lot of legacy jobs 14:43:12 it will not be dropped as few job still have dependency like keystone federation job etc 14:43:34 mriedem: i am planning to start the migration to zuulv3 one by one next week. 14:43:37 ok, i guess it can be a community wide goal in the future when it's a real problem 14:43:42 or that 14:43:43 +1 14:43:47 +1 14:44:05 OK. georgh it is your turn 14:44:24 thx, I'm looking for a final approval of https://review.openstack.org/#/c/575735/ 14:45:50 melwitt asked Kashyap Chamarthy for help but his review didn't bring the issue ahead 14:46:22 and sahid is gone from red hat, 14:46:22 yes i rember this patch for what its worth i still think it good 14:46:25 It's been stuck since then and from my point of view the change its finished. 14:46:27 there just aren't that many people familiar with this code 14:46:50 I pinged kashyap to have another look. 14:46:50 I know that interactive serial consoles are an exotic topic 14:46:54 mriedem: sahid is now at canonical but im not sure if he is working on openstack anymore 14:48:03 kashyap is looking 14:48:06 efried: thanks 14:48:20 I hope kashyap reply will unblock this patch 14:48:20 efried: Oh, this one. Yes, need to load all the "console-related context", can do it tomm or early next week 14:48:24 Thanks for the ping 14:48:37 ok, thank you 14:48:39 My bad, the reply has been sitting since 02-Nov 14:48:55 moving forward 14:48:58 We'll need a couple of cores once that happens. I'm afraid I have zero background in this area. Anyone? 14:49:15 efried: stephenfin: should be black monday 14:49:17 I happy to read the patch and learn some of it 14:49:17 mine is very thin 14:49:29 If not, I wouldn't feel *too* bad about leaning on the expertise of sean-k-mooney and kashyap 14:49:35 someone get markus zoeller out of retirement 14:49:45 if stephenfin is familiar, that would be good. 14:49:59 really moving forward 14:50:03 rambo_li_: your turn 14:50:43 I wonder if rambo_li_'s issue is better for the main channel, outside of the meeting. 14:51:07 or even the ML 14:51:45 rambo_li_: would it be OK to you to bring this up on #openstack-nova or on the mailing list? 14:52:09 ok,thank you,let's go to the next one 14:52:19 rambo_li_: thanks 14:52:27 Another one,When we resize/migrate instance, if error occurs on source compute node, the instance state can rollback to active currently.But if error occurs in "finish_resize" function on destination compute node, the instance state would not rollback to active. Is there a bug, or if anyone plans to change this? 14:53:11 reset-state ? 14:53:17 likely because there isn't cleanup on finish_resize if an error occurs 14:53:23 rambo_li_: if finish resize already destroyed someting on the source then it is hard to roll back 14:53:31 depending on where it happens, by the time you're in finish_resize there is a guest on the dest 14:53:37 so putting the instance as ACTIVE isn't really valid in that case 14:53:46 mriedem: agree 14:53:54 TL;DR: working as designed 14:54:03 if not designed, 14:54:09 not a bug really, 14:54:22 it would really be an RFE i think to make it more robust for rolling back 14:54:24 if we cared to do that 14:54:35 b/c rollback is hard 14:54:48 yeah it would be good to know why finish resize failed at the first place 14:54:53 maybe we can avoid that failure 14:55:07 what exactly does finish_resize ? 14:55:20 sorry for the dumb question, I can read code but I'm old and laz 14:55:21 lazy* 14:55:23 i finishes the resize 14:55:27 \o/ 14:56:03 rambo_li_: could you provide information about the failure during finish resize? 14:56:11 sets up the disk and starts the guest 14:56:13 on the dest host 14:56:17 bauzas: it cleans up networks on the source node a a few other things 14:56:40 oh,thank you,I will reconsider it 14:56:56 ok, I was asking this, because if that's just clean-up, you can still resurrect your instance ? 14:57:08 it's not cleanup 14:57:11 when we finish resize ,the instance can't start in dest node 14:57:16 it's the thing right before the status goes to VERIFY_RESIZE 14:57:36 ok, nevermind, I'll look 14:57:51 prep_resize (claim on dest) -> resize_instance (source, transfer disk to dest) -> finish_resize (setup disk on dest, start guest, do db magic) 14:57:56 oh ok my bad i tought it was the confirm step after that 14:58:05 by the time you get to the finish_resize and it fails, you're in trouble 14:58:10 Someone should do a flow diagram for that. 14:58:27 efried: i've got a dime with your name on it 14:58:38 I require a napkin sketch, like last time. 14:58:41 oh and don't forget about reschedules in the flow diagram :) 14:58:45 we have less than 2 minutes. I think robustifying finish_resize could be a valid feature request 14:58:46 heh 14:58:54 mriedem: oh shit, I also confused myself with confirm resize 14:59:16 Update on ci-watch: 14:59:16 The maintainer of ci-watch.tintri.com is gone and unreachable. 14:59:16 But mmedvede has redeployed the code to: http://ciwatch.mmedvede.net/ 14:59:16 I have updated the Nova meeting agenda accordingly. 14:59:17 see i don't know shit about tcp consoles, but i know how the resize flow works 14:59:34 efried: thanks 14:59:47 we have to close the meeting in seconds. 14:59:51 ok,last one, I find it is important that live-resize the instance in production environment. We have talked it many years and we agreed this in Rocky PTG, then the author remove the spec to Stein, but there is no information about this spec, is there anyone to push the spec and achieve it? The link:https://review.openstack.org/#/c/141219/ 14:59:58 let's continue on #openstack-nova 15:00:05 #endmeeting