22:04:44 <armax> #startmeeting neutron_drivers 22:04:45 <openstack> Meeting started Thu Jan 26 22:04:44 2017 UTC and is due to finish in 60 minutes. The chair is armax. Information about MeetBot at http://wiki.debian.org/MeetBot. 22:04:46 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 22:04:48 <openstack> The meeting name has been set to 'neutron_drivers' 22:04:52 <kevinbenton> NoneType has no attribute meeting_name 22:05:40 <njohnston> :) 22:06:10 <armax> hello folks 22:06:21 <armax> we’re no officially in feature freeze 22:06:25 <armax> as O-3 is out 22:06:37 <ihrachys> *now 22:06:44 <ihrachys> a slight different :) 22:06:46 <armax> ihrachys: that one 22:07:16 <armax> I’ll start preparing a postmortmem document so that we can figure out what gets granted a FFE and what can’t 22:07:37 <armax> make sure the stuff you have looked at/reviewed/care about has up to date information, targets etc 22:07:49 <ihrachys> armax: also maybe starting looking at pre-release checklist would make sense, so that we avoid backports later 22:08:14 <armax> I’ll create an RC1/Pike-1 milestone pages shortly and start rolling stuff over 22:09:06 <armax> I’ll also create a ocata-rc-potential bug tag if it doesn’t exist already so that we can flag what we want to make sure lands in time for ocata 22:09:14 <armax> ihrachys: excellent, we need that 22:10:18 <armax> anything else I am missing? 22:10:29 <ihrachys> gate-failure bugs, do they get FFE by default? 22:10:58 <armax> ihrachys: I’d say so 22:11:05 <ihrachys> good 22:11:13 <armax> about gate-failures, I wanted actually to talk about those briefly during this meeting 22:11:24 <armax> but before we do, anything else release related that comes to mind? 22:12:14 <ihrachys> how do we handle releases for e.g. ovn? 22:12:29 <ihrachys> does FFE apply to them? 22:13:25 <armax> ihrachys: the way this worked the last few times is that if we end up seeing blockers 22:13:38 <armax> especially introduced in neutron by kevinbenton, we’ll most likely know about them 22:13:42 <armax> and release accordingly :) 22:14:10 <ihrachys> no pain no gain :) 22:14:13 <kevinbenton> i'm not sure how i understand what we are talking about relates to OVN releases? 22:14:25 <kevinbenton> (terrible sentence) 22:14:35 <armax> but as for regular outstanding work, I’d defer to the team how they want to handle the merge process in the RC window 22:14:41 <kevinbenton> how are OVN released related to me breaking the gate? 22:14:44 <armax> unless there’s something that shows up on the neutron dashboard 22:15:12 <armax> kevinbenton: remember when we broke a bunch of DHCP stuff last time? 22:15:28 <kevinbenton> yes 22:15:52 <ihrachys> kevinbenton: yeah I also try to figure. I was merely asking if they follow FFE rules, and if so, if we want stadium to follow the same procedures/timelines as we do for neutron core (btw is it now only neutron + fwaas?) 22:15:53 <armax> that led us to churn a few RCs for neutron 22:16:06 <kevinbenton> armax: for neutron, not for OVN 22:16:27 <armax> kevinbenton: they landed stop-gap in the meantime, remember? 22:16:46 <armax> that may lead to get a new release 22:16:51 <armax> but to ihrachys’ point 22:17:30 <armax> FFE proces applies to anything that’s on the neutron launchpad dashboard 22:18:00 <armax> if ther’es anything that pertains networking-foo, that’ll have to show up on it for us to have a look at it 22:18:37 <ihrachys> meaning, we don't really enforce release process on e.g. networking-odl even though we seem to vouch on it 22:18:48 <ihrachys> due to them being in stadium 22:18:58 <armax> other than that, when we are about to push an RC release, we can look at the milestone-based projects and advance the hash for them 22:19:02 <armax> however... 22:19:18 <armax> for those projects that are currently on a release:independent schedule 22:19:50 <armax> I think we should make sure that for those in the stadium end up being ready to cut a release soon-ish 22:20:14 <armax> rather than waiting Pike-? for an ocata release 22:20:25 <armax> I’ll send a note out about this 22:20:30 <armax> ihrachys: does that make sense? 22:21:10 <ihrachys> yes. I would suggest we also try to advertise proper release practices to them, because otherwise they may continue landing disruptive patches only to learn later they need to stabilize in a week 22:21:42 <ihrachys> some are still in dillusion that they will have months+ after neutron final to prepare their release 22:22:10 <armax> ihrachys: I don’t believe they are that clueless, but we can surely remind that on the ML 22:22:40 <ihrachys> aye, I think we are good on stadium then for the most part 22:23:11 <armax> as I said, I’ll follow up on the ML, I noticed that dasm has already started the thread, and I’ll reply to his email 22:23:36 <armax> and provide more details on what I expect it’ll happen between now and the time Ocata is officially released 22:23:40 <armax> sounds good? 22:24:08 <ihrachys> yes 22:24:12 <armax> sweet 22:24:25 <armax> kevinbenton, amotoki anything to add? 22:24:55 <armax> now about gate snafus 22:25:09 <kevinbenton> let's switch to pecan today. blogan wants to :) 22:25:53 <armax> can’t find the word to express the feeling that statement made me feel 22:26:06 <armax> :) 22:26:30 <ihrachys> lonely? 22:26:41 <armax> ihrachys: no, that’s not it 22:27:03 <armax> as for gate failures, I ws looking at grafana right now 22:27:05 <armax> #link http://grafana.openstack.org/dashboard/db/neutron-failure-rate 22:27:42 <armax> it seems most of the instability is in the xenial vrersions of the jobs 22:27:46 <armax> would you guys concur? 22:28:12 <armax> it’s fair that trusty runs on mitaka only at this point 22:28:29 <ihrachys> yeah, not many data points there 22:28:30 <armax> and we may not have many data points 22:28:34 <kevinbenton> i haven't been watching many mitaka runs 22:28:52 <armax> kevinbenton: they seemed to have landed fine for what I could tell 22:28:59 <ihrachys> xenial is probably a mix of libvirtd corruptions and oom-killer 22:29:22 <kevinbenton> nobody replied back about my swap observation with the oom thing 22:29:30 <armax> about the oom-killer, is there anything we can do there? 22:29:55 <ihrachys> kevinbenton: I think dasm was replying in the bug didn't he? 22:30:17 <armax> do we know if anyone in the infra team is looking at these? 22:30:57 <dasm> ihrachys: i was. didn't find anything yet 22:31:12 <ihrachys> for the record, the bug is https://bugs.launchpad.net/neutron/+bug/1656386 22:31:12 <openstack> Launchpad bug 1656386 in neutron "Memory leaks on Neutron jobs" [Critical,Confirmed] - Assigned to Darek Smigiel (smigiel-dariusz) 22:31:51 <ihrachys> dasm: "Now, I see it completely reversed." can you clarify what you mean? that you don't see swap used? or that it's different from what kevinbenton observed? 22:32:24 <dasm> i added this chart: https://imgur.com/a/59KYz 22:32:38 <kevinbenton> dasm: in that run swap never exceeds a couple of hundred MB 22:32:44 <ihrachys> armax: I dunno of anyone looking at it. does it suggest we are the only affected? 22:32:59 <dasm> first time when i've seen this issue, i've seen that swap was exceeding 2GBs 22:33:19 <armax> ihrachys: the entire gate is affected 22:33:31 <armax> ihrachys: the integrated gate is, I mean 22:33:36 <armax> we’re default now, remember? :) 22:33:40 <kevinbenton> dasm: we have up to 8GB though 22:33:49 <armax> looking at logstash 22:33:52 <dasm> ihrachys: i've found oom-killer just for tripleo jobs, but as comments, not cause of failure 22:33:54 <kevinbenton> oom killer shouldn't trigger with 6GB free 22:34:15 <armax> I see 39 events in neutron patches against 18 in nova, 15 in cinder and 10 in tempest 22:34:37 <dasm> kevinbenton: ram was consumed ~8GB, then swap arose to ~2GB and after couple minutes, oom-killer was triggered against mysql 22:34:58 <armax> ihrachys: it happened 21 times in the gate queue 22:35:01 <armax> and 110 times in check 22:35:21 <armax> it’s master only at this point 22:35:22 <kevinbenton> dasm: right, that's still bad behavior though. it just killed stuff with 6GB free in that case 22:35:38 <armax> don’t see any occurence in any stable branch 22:35:45 <dasm> kevinbenton: agree 22:36:00 <ihrachys> kevinbenton: something requested more than 6gb at once? :) 22:36:59 <kevinbenton> ihrachys: maybe. I would assume that would make it a prime target for the OOM killer 22:37:11 <ihrachys> nah that's not how oom-killer works :) 22:37:20 <ihrachys> I believe it's shooting processes at random 22:37:23 <kevinbenton> ihrachys: no 22:37:30 <kevinbenton> ihrachys: each process gets a score 22:37:54 <armax> ihrachys: as a matter of fact oom-killer is always picking up on mysql 22:37:57 <ihrachys> ok but the hungry doesn't necessarily get shot? 22:38:12 <kevinbenton> http://unix.stackexchange.com/questions/153585/how-the-oom-killer-decides-which-process-to-kill-first 22:38:18 <armax> a handful of times it picks something different and that’s why I see OOM kill traces with succesful runs 22:38:51 <armax> making mysql less attratictive for a oom-kill would be a way to try and mitigate the issue 22:39:30 <armax> however, if we want to do something within our control, all we can do is figure out how to reduce the neutron processes memory footprint 22:39:35 <kevinbenton> armax: well i assume in that case it would randomly just kill nova/neutron/cinder 22:39:38 <armax> kevinbenton and I looked at this 22:39:51 <armax> kevinbenton: it depends on the test run though 22:40:13 <armax> it could kill something that can either be restarted or not being used by the specific test at that specific time, no? 22:40:34 <kevinbenton> armax: who will restart it? 22:40:34 <smcginnis> fwiw, we chased oom issues in cinder for a while and never found a good way to isolate a root cause. 22:40:49 <smcginnis> Other than just things generally getting bigger and requiring more memory. 22:40:59 <kevinbenton> smcginnis: did you also notice that it's doing it way before swap space is exhausted? 22:41:04 <armax> kevinbenton: in neutron we have the processmonitor thingy 22:41:17 <kevinbenton> armax: the processmonitor thingy is for subprocesses of the agent 22:41:20 <smcginnis> kevinbenton: I can't remember specifics now, but there were some strange things with it. 22:41:26 <kevinbenton> armax: the neutron-server will not be restarted 22:41:31 <armax> kevinbenton: I am not saying that’s exactly how it happens 22:41:36 <armax> duh 22:41:38 <armax> :) 22:42:19 <armax> kevinbenton: I’ll find a successul run with an oom-kill to see what actually happened 22:42:27 <clarkb> apparently one potential reason for OOMkiller being invoked evne with plenty of swap is if the kernel itself is requesting the memory 22:42:58 <kevinbenton> clarkb: i wonder if this kernel version is janky with our low swapiness setting that we put on the gate? 22:43:20 <armax> clarkb: do you think that trying to keep mysql memory footprint small would be a viable step, at the expense perhaps of running a bit slower in the gate? 22:43:27 <clarkb> or newer kernel in general? 22:43:44 <clarkb> armax: as I said in my email to the thread I think the ideal solution here is if we put openstack on more of a diet like heat did 22:44:00 <clarkb> they dropped memory consumption quite a bit over the last cycle or two 22:44:10 <armax> clarkb: yeah, trimming the memory footprint of the neutron processes is soemthing we’re looking at 22:44:21 <armax> if we compound the effort across all projects we can definitely make a difference 22:44:25 <armax> but these things usually take time 22:44:42 <clarkb> kevinbenton: another possible cause is use of mlock 22:45:00 <kevinbenton> clarkb: would you be ammenable to adjusting that swapiness setting upward to see if it improves things? 22:45:15 <clarkb> kevinbenton: ya I think devstack can do that 22:45:24 <armax> clarkb: we saw it in d-g 22:45:31 <kevinbenton> i will propose change 22:45:38 <clarkb> oh I guess we already adjust in d-g ya 22:45:41 <armax> that swappiness is explicitly set towards the low end of the scale 22:45:45 <armax> and for a good reason too 22:46:07 <armax> http://git.openstack.org/cgit/openstack-infra/devstack-gate/tree/functions.sh#n432 22:46:13 <ihrachys> could it be nova doing something fancy with guest memory? like enabling mlock? 22:46:26 <armax> mriedem: ^ 22:46:41 <armax> we’re brainstorming on the oom-kill failures we’re seeing lately 22:46:52 <armax> any idea about how nova might be hogging memory? 22:47:16 <ihrachys> (...and in such a way that swap is not fully utilized) 22:47:24 <mriedem> not really, no one has profiled it yet 22:47:30 <armax> so kevinbenton how high are you thinking of raising swappiness? 22:47:43 <armax> mriedem: ok 22:47:48 <armax> default is 60 22:48:09 <armax> clarkb: there are also a bunch of mysql settings that devstack sets 22:48:40 <armax> clarkb: what do you think of setting cache sizes as well? 22:48:56 <clarkb> mysql cache sizes? thats probably a better quetion for mordred or devananda 22:49:14 <clarkb> my guess is that they probably help quite a bit on instances with slower IO in the gate 22:49:44 <kevinbenton> https://review.openstack.org/425961 22:49:54 <armax> clarkb: I see we have lots of connections open 22:49:55 <armax> http://git.openstack.org/cgit/openstack-dev/devstack/tree/lib/databases/mysql#n99 22:49:56 <kevinbenton> clarkb, armax: interesting comments above that 22:50:17 <armax> kevinbenton: you mean the IO one? 22:50:32 <kevinbenton> L426 22:50:57 <armax> nonymous-memory to file-backed mappings 22:51:44 <ihrachys> "kicking in on some processes despite swap being available;" 22:51:55 <dasm> kevinbenton: indeed, interesting. would fit with our use-case (although we're running ubuntu, afair) 22:52:23 <ihrachys> btw time check 8 mins 22:52:24 <clarkb> git blame should show us who ran into that but my guess is ianw when getting stuff running on fedora/centos 22:52:26 <kevinbenton> dasm: right, but setting swappiness to 10 here in the gate happens regardless of what we are running 22:52:26 <dasm> is it possible, that something change with latest ubuntu? 22:52:27 <armax> ihrachys: agreed 22:52:47 <armax> even though I had a different agenda in mind, this turned out to be a good discussion I didn’t want to stop 22:53:13 <dasm> nah. if it would be problem with ubuntu, then we should probably start seeing this oom-killer earlier. 22:53:27 <kevinbenton> did you want to talk more about how we are going to switch to pecan and ovs firewall by default before feature freeze? 22:53:29 <armax> dasm: we started hammering on xenial only recently though 22:53:37 <kevinbenton> dasm: it could be related to a later kernel version in xenial 22:53:50 <armax> and if we indeed increasing the memory of some services, it might be that it was just the last straw 22:54:13 <dasm> mhm 22:54:41 <armax> kevinbenton, clarkb: do we want to try and tweak mysql settings, propose the change and see what reactions people might have? 22:54:55 <clarkb> its possible the weighting has changed on swappiness. Apprently such weighting has changed in the past too 22:55:03 <armax> of do we want to wait and see how the swappiness increase does? 22:55:15 <kevinbenton> armax: let's wait and see if this helps 22:55:17 <clarkb> armax: probably best to see if one toggle at a time makes a difference 22:55:31 <armax> clarkb: ok, I can start on the patch and put it in WIP 22:55:40 <kevinbenton> if it does, we've successfully swept the problem under the rug until the end of next cycle when the memory consumption of all of the services has doubled :) 22:55:50 <dasm> LOL 22:56:01 <armax> we’ll double the swappinees then! 22:56:05 <armax> no 22:56:21 <armax> but in all seriousness we should look at the memory footprints of our services 22:56:29 <armax> to see if in the long term we can trim some fat 22:57:27 <armax> ok, so we’re 3 mins until the top of the hour 22:57:31 <armax> I guess that’s about it 22:58:17 <ihrachys> oktnxbye 22:58:23 <dasm> o/ 22:58:26 <armax> #endmeeting