21:00:15 #startmeeting nova 21:00:16 Meeting started Thu Jul 20 21:00:15 2017 UTC and is due to finish in 60 minutes. The chair is mriedem. Information about MeetBot at http://wiki.debian.org/MeetBot. 21:00:17 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 21:00:19 The meeting name has been set to 'nova' 21:00:25 ohai 21:00:35 o/ 21:00:49 \o 21:00:50 o/ 21:01:19 o/ 21:01:28 \m/ 21:01:50 ok let's get started 21:01:53 #link agenda https://wiki.openstack.org/wiki/Meetings/Nova#Agenda_for_next_meeting 21:01:56 o/ 21:01:59 #topic release news 21:02:07 so i heard there is a release around the corner 21:02:12 #link Pike release schedule: https://wiki.openstack.org/wiki/Nova/Pike_Release_Schedule 21:02:22 #info July 27 is feature freeze (1 week away) 21:02:32 #info July 20 (today) is the release freeze for non-client libraries (oslo, os-vif, os-traits, os-brick): https://releases.openstack.org/pike/schedule.html#p-final-lib 21:02:39 * bauzas waves late 21:02:41 we've done the final os-vif and os-traits releases already 21:02:59 if you needed something from oslo or keystoneauth, you'd better get cracking on it before dims passes out 21:03:18 * dims waves 21:03:21 #info Blueprints: 65 targeted, 64 approved, 33 completed (+4 from last week) 21:03:38 yay we officially made it over the 50% mark 21:03:43 which i think is a passing grade in the US now 21:04:01 lol 21:04:04 and... sad.... 21:04:04 thank you 21:04:10 i'll be here all week 21:04:17 #link Pike feature freeze blueprint tracker etherpad: https://etherpad.openstack.org/p/nova-pike-feature-freeze-status 21:04:35 i'm keeping ^ up to date throughout the day 21:04:41 trying to focus reviews a bit 21:04:49 any questions about the release? 21:05:05 ok moving on 21:05:07 #topic bugs 21:05:16 nothing is listed as critical 21:05:22 #help Need help with bug triage; there are 123 (+8 from last week) new untriaged bugs as of today (July 20). 21:05:35 just a note, after the FF, we have 2 weeks until the rc1 21:05:58 so we'll have to start going through those untriaged bugs here at some point to see if there are stop ship rc candidates 21:06:07 "rc candidates" is redundant 21:06:07 i know 21:06:20 speaking of 21:06:20 #info Starting to tag bugs for pike-rc-potential: https://bugs.launchpad.net/nova/+bugs?field.tag=pike-rc-potential 21:06:43 so if you spot a bug that looks like a major regression or upgrade issue, please tag it 21:06:57 gate status 21:06:58 #link check queue gate status http://status.openstack.org/elastic-recheck/index.html 21:07:07 well, we know the functional tests have been wonky 21:07:21 something to do with lock timeouts 21:07:36 i started playing with that here but it went badly https://review.openstack.org/#/c/485335/ 21:07:43 heh 21:07:54 turns out that if you don't set the external lock fixture, 21:08:12 we're still starting services, like the network service, which has code that uses external locks, 21:08:19 and if you don't have the lock_path configured, it blows up 21:08:29 so maybe we can stub that out, i haven't had time to dig into it 21:08:36 oslo.concurrency does have some fixtures for this though 21:09:03 #link 3rd party CI status http://ci-watch.tintri.com/project?project=nova&time=7+days 21:09:15 i noticed the xenserver ci was busted the other night 21:09:22 i don't know if that's been fixed 21:09:34 questions about bugs? 21:09:45 ok moving on 21:09:47 #topic reminders 21:09:50 #link Pike Review Priorities etherpad: https://etherpad.openstack.org/p/pike-nova-priorities-tracking 21:10:02 ^ is probably stale, and i'm personally focusing on the FF etherpad above 21:10:09 #link Consider signing up for the Queens PTG which is the week of September 11 in Denver, CO, USA: https://www.openstack.org/ptg 21:10:30 #topic stable branch status 21:10:41 there isn't really any news here 21:10:57 we have some things to review, 21:11:06 and i'll probably do stable branch releases around rc1 21:11:20 #topic subteam highlights 21:11:29 dansmith: things for cells v2? 21:11:35 no meeting this week, 21:11:46 made some progress on the quotas set, which is our highest priority 21:11:50 need more tho 21:11:52 that is all. 21:12:18 right so quotas are now at https://review.openstack.org/#/c/410945/ 21:12:23 which is moving quotas into the api db 21:12:26 it's 2 changes 21:12:31 the counting quotas stuff is all approved though 21:12:35 aye 21:12:45 and the external-server-events thing is approved 21:12:52 it just doesn't want to merge 21:12:59 nor do lots of things 21:13:02 it's in the gate right no though 21:13:04 *now 21:13:05 beyond tha twe have some docs and bugs to fix 21:13:19 tracking those here https://etherpad.openstack.org/p/nova-pike-cells-v2-todos 21:13:23 * dansmith nods 21:13:30 #help cells v2 TODOs https://etherpad.openstack.org/p/nova-pike-cells-v2-todos 21:13:41 ok scheduler stuff 21:13:42 edleafe: ? 21:13:48 I'll fill in for ed 21:13:52 he's out 21:14:09 ok 21:14:31 mriedem: basically, thanks to cdent and gibi we identified a nasty little bug in the placement shared resource providers SQL code. 21:15:18 and... 21:15:32 jaypipes is a master of suspense 21:15:38 * jaypipes looking for link 21:15:55 The secret of the universe is... 21:16:02 * dansmith is on the edge of his seat 21:16:04 42 21:16:13 https://review.openstack.org/#/c/485088/ 21:16:15 #link https://review.openstack.org/#/c/485088/ 21:16:36 that is the cause of the transient NoValidHosts functional test failures we were seeing 21:16:48 oh nice 21:16:56 my bad for not notifying you and dansmith sooner. 21:16:58 mriedem is clearly spoiled by edleafe's prepared copy/paste reports 21:17:03 i saw the talk 21:17:05 for some reason I thought I did but no. 21:17:15 i mean, 21:17:17 efried: no, I'm just an idiot. 21:17:17 people been talkin' 21:17:24 talkin' about patches 21:17:26 mriedem: talkin bout talkin? 21:17:48 i like to do the bonnie rait joke once per year 21:17:51 mriedem: outside of that major bug fix, there is the placement-claims work 21:17:54 I got it :) 21:18:01 which has gotten reviews and is slowly being merged. 21:18:04 yeah so claims https://review.openstack.org/#/c/483566/ 21:18:08 jaypipes: you see the bug in there? 21:18:40 like, you don't have to understand it right now 21:18:45 but there is a bug in there, found by gib 21:18:47 *gibi 21:19:01 #info gibi is finding lots of placement bugs - yay gibi 21:19:18 mriedem: yeah, sorry, I dropped the ball on that. will address shortly after meeting. 21:19:19 we also need to talk about alternatives at some point 21:19:25 but can do that later 21:19:38 i'd like to rap about contingency plans, mkay? 21:19:46 * mriedem turns his chair around 21:19:48 go for it. 21:19:51 well, later 21:20:05 well, or now 21:20:20 main question is, if we don't get alternatives sorted out going to the computes, what breaks or do we lose? 21:20:20 there's some outstanding traits-related things that are also in-flight from alex_xu 21:20:32 including a bug fix that is important. I need to finish the review on that one. 21:20:34 re-review.. 21:20:35 we lose the separate conductors 21:20:59 we lose separate conductors for retries 21:21:04 right 21:21:10 asking from another pov, 21:21:11 well, yeah, we can't fulfill our promise of retaining retries in cellsv2 21:21:12 could be worse, 21:21:15 What Dan said 21:21:21 if we land the claims in the scheduler, but not alternatives going to computes, is that ok? 21:21:25 one of the first people that should be doing things with cellsv2 and pike don't care about retries anyway 21:21:54 mriedem: if we haven't removed the regular retry mechanism it should still work the same way it does now, right? 21:21:56 who is that? CERN? 21:21:59 melwitt: yeah 21:22:07 dansmith: yes, 21:22:12 i just want to make sure we're not missing something, 21:22:22 or that once we claim in the scheduler we lose something or break something else 21:22:34 but i think that if you're single level conductor it's business as usual, 21:22:43 yep 21:22:48 but you get the up-front claim protection which should mitigate some of the claim race failures in the computes 21:22:54 so we're making progress still 21:22:56 I don't think we would have other problems 21:23:16 and we'll still retry alternatives if the allocation request in the scheduler fails 21:23:18 due to a race 21:23:34 Correct 21:24:04 right 21:24:22 is the retry happening in https://review.openstack.org/#/c/483566/ yet? 21:24:35 mriedem: oh yes. 21:24:40 Yup 21:24:56 mriedem: line 189 in filter_scheduler.py 21:25:06 ok i was looking at https://review.openstack.org/#/c/483566/4/nova/scheduler/filter_scheduler.py@203 21:25:31 mriedem: that's after all retries. 21:25:33 ok and the number of hosts is limited by the number of retries? 21:25:55 mriedem: no 21:26:09 there's no retries in this code right? 21:26:14 so we could retry 1000 times? 21:26:30 the retry is the for loop over hosts on L189 21:26:30 dansmith: yes, there is. line 189-200 21:26:44 mriedem: yes, we can retry 1000 times. 21:26:45 but we don't limit that by the configurable retry option 21:26:55 uh 21:27:08 so wait, this is a loop over all the hosts we got back from placement, 21:27:14 dansmith: correct. 21:27:19 and filtered/weighed 21:27:20 and we try to claim one until we run out of hosts or get one? 21:27:25 yes 21:28:06 okay, I hadn't noticed that the first time around, but I see now 21:28:12 so is there a reason we don't limit by the config? 21:28:15 to be clear, this is just retries in the claiming sense, 21:28:24 not the *reschedule* that won't work without the alternatives stuff 21:28:24 mriedem: no reason really... 21:28:31 dansmith: yeah i get that 21:28:49 Honestly its just for a race 21:28:50 so retrying here 1000 times is better than retrying compute > conductor > scheduler * 1000 21:28:59 I feel like we should limit here, but I'm not sure why to be honest 21:29:14 i feel like we should limit too, 21:29:27 because as bauzas says, this should only be a race between placement saying something is good and us trying it 21:29:28 In case two concurrent requests get the same destination 21:29:33 but only because if we goofed something up and are going to retry 1000 times on a set of hosts that won't ever match 21:29:37 you'd need a lot of concurrent schedulers to race through 1000 of them 21:29:49 Yup 21:30:08 so, I'm tempted to leave it as is and backport an emergency fix if we find people are hitting it in some way 21:30:15 Looks a minor issue for me IMHO 21:30:17 and a lot of luck. remember there's the randomization of the host subset as well 21:30:18 release note? 21:30:22 in reality, we probably have to fix whatever is causing them to hit it and not just the limit 21:30:26 jaypipes: aye 21:30:29 dansmith: yeah agreed 21:30:34 the retry is papering over a bigger problem 21:30:38 mriedem: sure, reno for the warning perhaps 21:30:44 ok 21:30:48 happy to do that. 21:30:49 let's continue that fun in the review 21:30:52 k 21:30:53 Reno FTW 21:31:15 remember, reno is the way we can always say "we told you so" :) 21:31:26 "we told you it was going to break!" 21:31:31 we're making a lot of people watch a small conversation, fwiw 21:31:38 good 21:31:41 we're a family here 21:31:52 ok moving on then 21:31:58 thanks for the discussion 21:32:10 the api meeting, i don't think i was tehre 21:32:19 sdague: were you? 21:32:36 the no more api extensions stuff is working its way through https://review.openstack.org/#/q/topic:bp/api-no-more-extensions-pike+status:open 21:32:44 the service/hypervisor uuids apis landed for 2.53 21:32:53 i'm working on the novaclient changes for that one now 21:33:08 notification meeting was relatively short, 21:33:24 gibi and i talked about the last remaining change for the searchlight notifications series 21:33:30 which is https://review.openstack.org/#/c/459493/ 21:33:51 and we talked about the updated_at bug fix from takashin 21:34:05 this https://review.openstack.org/#/c/475276/ 21:34:27 there is something weird going on in there where the updated_at field isn't set on the instance record even after we've updated the vm/task state in there 21:34:42 and updated_at should be set anytime there is an ONUPDATE triggered by a db record update 21:34:44 so it's weird 21:35:00 that's the only thing i'm holding my +2 for 21:35:10 ok cinder stuff 21:35:21 making progress, the swap volume change merged 21:35:24 open reviews are now https://review.openstack.org/#/q/status:open+project:openstack/nova+topic:bp/cinder-new-attach-apis+branch:master 21:35:31 the check_detach one is close 21:35:43 stvnoyes found a problem in the live migration one and is fixing it up 21:35:46 and we need a cinderclient release 21:35:54 but it's really down to like mainly 2 changes 21:36:02 so any help there on reviews would be appreciated 21:36:19 #topic stuck reviews 21:36:32 there was nothing in the agenda, so does anyone have something to mention here? 21:36:52 alright then 21:36:55 #topic open discussion 21:37:01 So I made a thing: https://blueprints.launchpad.net/nova/+spec/libvirt-virtio-set-queue-sizes 21:37:08 I can't see this landing in Pike, seeing as it exposes functionality in libvirt that hasn't even merged yet, and therefore technically doesn't exist 21:37:13 But it's been a while since I've done any Nova work, and I could use some expert eyes on it to make sure I'm not crazy, and that my approach makes some sense 21:37:18 And I could use some help getting it prioritized somewhere, obviously 21:37:43 nic: i'll target the bp to queens so we can discuss it after the pike release 21:37:50 \o/ 21:38:06 I will be in Denever and nic might be so we could discuss it there as well? 21:38:11 Denver even... 21:38:11 sure 21:38:27 I like how that was "de never" 21:38:31 In the meantime I think what nic was asking is.. are we even headed in the right direction or is it omg abort? 21:38:43 talk to sean-k-mooney 21:38:57 or vladikr maybe? 21:39:00 nic: I'm also interested. For the NFV case with low drops, you sometimes need larger queues. 21:39:32 jangutter Thats what its for... 21:39:37 i've added people to the review 21:39:52 Excellent 21:40:00 jangutter We hit it with NFV work loads under NFV. We carry a local set of patches for now to work around this but obviously want to get it into upstream. 21:40:09 NFV workloads under VPP even 21:40:15 I can't type today... or ever. 21:40:21 cburgess: Yep, and partly QEMU's also to blame :-( 21:40:30 jangutter yeah 21:40:50 alright, anything else? 21:41:15 i'll take that as a no 21:41:19 1 week left 21:41:21 thanks everyone 21:41:23 #endmeeting