16:00:54 <ihrachys> #startmeeting neutron_ci 16:00:55 <openstack> Meeting started Tue Feb 7 16:00:54 2017 UTC and is due to finish in 60 minutes. The chair is ihrachys. Information about MeetBot at http://wiki.debian.org/MeetBot. 16:00:56 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 16:00:58 <openstack> The meeting name has been set to 'neutron_ci' 16:01:12 <ihrachys> hi everyone 16:01:15 <sindhu> hi 16:01:46 * ihrachys gives a minute for everyone to gather around a fire in a circle 16:01:58 <manjeets> hi 16:02:44 * haleyb gets some marshmallows for the fire 16:03:32 <ihrachys> ok let's get it started. hopefully people will get on board. :) 16:03:52 <ihrachys> #link https://wiki.openstack.org/wiki/Meetings/NeutronCI Agenda 16:04:21 <ihrachys> I guess we can start with looking at action items from the previous meeting 16:04:30 <ihrachys> #topic Action items from previous meeting 16:04:47 <ihrachys> "armax to make sure periodic functional test job shows up in grafana" 16:05:07 <ihrachys> armax: I still don't see the job in periodic dashboard in grafana. any news on that one? 16:06:05 <ihrachys> ok, I guess armax is not up that early. I will follow up with him offline. 16:06:26 <ihrachys> #action ihrachys to follow up with armax on periodic functional job not showing up in grafana 16:06:33 <ihrachys> next is: "ihrachys to follow up on elastic-recheck with e-r cores" 16:07:18 <ihrachys> so I sent this email to get some answers from e-r folks: http://lists.openstack.org/pipermail/openstack-dev/2017-January/111288.html 16:08:43 <ihrachys> mtreinish gave some answers; basically, 1) check queue is as eligible for e-r queries as gate one so functional job can be captured with it; 2) adding cores is a matter of doing some reviews before getting a hammer; 3) there is a recheck bot that we can enable for neutron channel if we feel a need for that. 16:09:17 <ihrachys> so from our perspective, we should be fine pushing more queries into the repo, and we should probably try to get some review weight in the repo 16:10:15 <ihrachys> #action ihrachys to look at e-r bot for openstack-neutron channel 16:10:18 <manjeets> ihrachys, question : do we have a dashboard for ci specific patches ? 16:10:55 <ihrachys> manjeets: we don't, though we have LP that captures bugs with gate-failure and similar tags 16:11:05 <mtreinish> ihrachys: I said check queue jobs are covered by e-r queries. But we normally only add queries for failures that occur in gate jobs. If it's just the check queue filtering noise from patches makes it tricky 16:11:10 <armax> ihrachys: I am here 16:11:13 <armax> but in another meeting 16:11:20 <ihrachys> manjeets: we probably can look at writing something to generate such a dashboard 16:11:33 <mtreinish> ihrachys: I can give you a hand updating the e-r bot config, it should be a simple yaml change 16:11:41 <electrocucaracha> ihrachys: do we have a criteria for adding a new query to e-r, like more than x number of hits? 16:11:51 <ihrachys> mtreinish: that's a price of having functional neutron job in check only. we may want to revisit that. 16:12:05 <manjeets> ihrachys, yes thanks 16:12:26 <ihrachys> electrocucaracha: I don't think we have it, but the guidelines would be - it's some high profile gate bug, and we can't quickly come up with a fix. 16:12:41 <ihrachys> manjeets: do you want to look at creating such a dashboard? 16:12:55 <manjeets> ihrachys, sure i'll take a look 16:13:04 <ihrachys> manjeets: rossella_s wrote one for patches targeted for next release, you probably could reuse her work 16:13:28 <mtreinish> electrocucaracha: if you want to get your feet wet everything here http://status.openstack.org/elastic-recheck/data/integrated_gate.html needs categorization 16:13:31 <electrocucaracha> ihrachys: regarding the point 3, it seems like only adding a new entry in the yams file https://github.com/openstack-infra/project-config/blob/master/gerritbot/channels.yaml#L831 16:13:34 <manjeets> cool thanks for example 16:13:45 <ihrachys> manjeets: see Gerrit Dashboard Links at the top at http://status.openstack.org/reviews/ 16:13:56 <electrocucaracha> thanks mtreinish 16:14:03 <rossella_s> ihrachys, manjeets I can help with that 16:14:17 <mtreinish> if you find a fingerprint for any of those failures thats something we'd definitely accept an e-r query for 16:14:44 <ihrachys> #action manjeets to produce a gerrit dashboard for gate and functional failures 16:14:47 <manjeets> thanks rossella_s i'll go through and will ping you if any help needed 16:15:04 <mtreinish> electrocucaracha: that's actually not the correct irc bot. The elastic recheck cofnig lives in the puppet-elastic_recheck repo 16:15:11 <mtreinish> I linked to it on my ML post 16:15:25 <ihrachys> armax: np, I will ping you later to see what we can do with the dashboard 16:15:59 <ihrachys> overall, some grafana dashboards are in bad shape, we may need to have a broader look at them 16:16:48 <ihrachys> ok next action was: "ihrachys to follow up with infra on forbidding bare gerrit rechecks" 16:17:01 <ihrachys> I haven't actually followed up with infra, though I checked project-config code 16:17:09 <mtreinish> ihrachys: if you have ideas on how to make: http://status.openstack.org/openstack-health/#/g/project/openstack~2Fneutron more useful that's somethign we should work on too 16:17:20 <ihrachys> basically, the gerrit comment recheck filter is per queue, not per project 16:18:03 <ihrachys> here: https://github.com/openstack-infra/project-config/blob/master/zuul/layout.yaml#L20 16:18:33 <ihrachys> so I believe it would require some more work on infra side to make it per project. but I will still check with infra to make sure. 16:19:14 <ihrachys> mtreinish: what do you mean? is it somehow project specific? or do you mean just general improvements that may help everyone? 16:19:59 <mtreinish> you were talking about dashboards, and I would like to make sure that your needs are being met with openstack-health. Whatever improvements neutron needed likely would benefit everyone 16:20:14 <mtreinish> so instead of doing it in a corner, I just wanted to see if there was space to work on that in o-h 16:20:20 <ihrachys> you see -health as a replacement for grafana? 16:20:53 <mtreinish> I see it as something that can consume grafana as necessary. But I want to unify all the test results dashboards to a single place 16:21:01 <mtreinish> instead of jumping around between a dozen web pages 16:22:42 <ihrachys> makes sense. it's just we were so far set on grafana, probably it's time to consider -health first for any new ideas. 16:23:13 <ihrachys> ok, next action was "ihrachys check if we can enable dstat logs for functional job" 16:23:40 <ihrachys> that's to track system load during functional test runs that sometimes produce timeouts in ovsdb native 16:23:52 <ihrachys> I posted this https://review.openstack.org/427358, please have a look 16:25:05 <ihrachys> on related note, I also posted https://review.openstack.org/427362 to properly index per-testcase messages in logstash, that should also help us with elastic-recheck queries, and overall with understanding impact of some failures 16:25:07 <manjeets> ihrachys, where it will the dump the logs a separate screen window ? 16:25:27 <manjeets> i mean separate file ? 16:25:34 <ihrachys> sorry, not project-config patch, I wanted to post https://review.openstack.org/430316 instead 16:25:48 <ihrachys> manjeets: yes, it should go in screen-dstat as in devstack runs 16:26:22 <manjeets> ohk 16:26:25 <ihrachys> finally, there is some peakmem-tracker service in devstack that I try to enable here: https://review.openstack.org/430289 (not sure if it will even work, I haven't found other repos that use the service) 16:26:53 <ihrachys> finally, the last action item is "jlibosva to explore what broke scenario job" 16:27:01 <ihrachys> sadly I don't see Jakb 16:27:02 <ihrachys> *Jakub 16:27:39 <ihrachys> but afaik the failures were related to bad ubuntu image contents 16:27:55 <ihrachys> so we pinned the image with https://review.openstack.org/#/c/425165/ 16:28:24 <ihrachys> and Jakub also has a patch to enable console logging for connectivity failures in scenario jobs: https://review.openstack.org/#/c/427312/, that one needs second +2, please review 16:29:16 <ihrachys> sadly grafana shows that scenario jobs are still at 80% to 100% failure rate, something that did not happen even a month ago 16:29:29 <ihrachys> so there is still something to follow up on 16:29:39 <ihrachys> #action jlibosva to follow up on scenario failures 16:30:13 <ihrachys> overall, those jobs will need to go voting, or it will be another fullstack job broken once in a while :) 16:32:11 <ihrachys> ok let's have a look at bugs now 16:32:28 <ihrachys> #topic Known gate failures 16:32:39 <ihrachys> #link https://goo.gl/IYhs7k Confirmed/In progress bugs 16:33:05 <ihrachys> ok, so first is ovsdb native timeout 16:33:26 <ihrachys> otherwiseguy was kind to produce some patch that hopefully mitigates the issue: https://review.openstack.org/#/c/429095/ 16:33:32 <ihrachys> and it already has +W, nice 16:34:01 <ihrachys> there is a backport for the patch for Ocata found in: https://review.openstack.org/#/q/I26c7731f5dbd3bd2955dbfa18a7c41517da63e6e,n,z 16:34:30 <ihrachys> so far rechecks in gerrit show some good results 16:34:42 <ihrachys> we will monitor the failure rate after it lands 16:36:17 <ihrachys> another bug that lingers our gates is bug 1643911 16:36:17 <openstack> bug 1643911 in OpenStack Compute (nova) "libvirt randomly crashes on xenial nodes with "*** Error in `/usr/sbin/libvirtd': malloc(): memory corruption:"" [Medium,Confirmed] https://launchpad.net/bugs/1643911 16:37:21 <ihrachys> the last time I checked, armax suspected it to be the same as the oom-killer spree bug in gate, something discussed extensively in http://lists.openstack.org/pipermail/openstack-dev/2017-February/111413.html 16:38:02 <ihrachys> armax made several attempts to lower memory footprint for neutron, like the one merged https://review.openstack.org/429069 16:38:17 <ihrachys> it's not a complete solution, but hopefully buys us some time 16:38:48 <ihrachys> there is an action item to actually run a memory profiler against neutron services and see what takes the most 16:39:12 <ihrachys> afaiu armax won't have time in next days for that, so, anyone willing to try it out? 16:41:03 <ihrachys> I may give some guidance if you are hesitant about tools to try :) 16:41:22 <ihrachys> anyway, reach out if you have cycles for this high profile assignment : 16:41:23 <ihrachys> :) 16:41:51 <manjeets> wouldn't enabling dtsat will help ? 16:42:52 <ihrachys> it will give us info about how system behaved while tests were running, but it won't give us info on which data structures use the memory 16:43:05 <manjeets> ohk gotcha 16:44:10 <armax> ihrachys: the devstsack changes landed at last 16:44:26 <armax> preliminary logstash results seem promising 16:44:29 <ihrachys> armax: do we see rate going down? 16:44:33 <armax> but that only bought us a few more days 16:44:34 <ihrachys> mmm, good 16:44:55 <ihrachys> armax: few more days? you are optimistic about what the community can achieve in such a short time :P 16:45:03 <armax> ihrachys: I haven’t looked in great detail, but the last failure was yesterday lunchtime PST 16:45:10 <ihrachys> what was it? like 300 mb freed? 16:45:25 <armax> ihrachys: between 350 and 400 MB of RSS memory, yes 16:45:57 <armax> some might be shared, but it should be enough to push the ceiling a bit further up and help aboid oom-kills and libvirt barfing all over the place 16:46:02 <armax> *avoid 16:46:10 <mtreinish> armax: fwiw, harlowja and I have been playing tracemalloc to try and profile the memory usage 16:46:19 <armax> mtreinish: nice 16:46:29 <armax> mtreinish: for libvirt? 16:47:05 <mtreinish> well I started with turning the profiling on for the neutron api server 16:47:43 <mtreinish> https://review.openstack.org/#/q/status:open+topic:tracemalloc 16:48:20 <ihrachys> mtreinish: any results to consume so far? I see all red in neutron patch. 16:48:48 <mtreinish> the neutron patch won't work, it's just a dnm to setup stuff for testing. Look at the oslo.service patch's tempest job 16:48:52 <mtreinish> there are memory snapshots there 16:48:56 <mtreinish> if you want your browser to hate you, this kinda thing is the end goal: http://blog.kortar.org/wp-content/uploads/2017/02/flame.svg 16:49:24 * ihrachys puts a life vest on and clicks 16:49:24 <mtreinish> we're still debugging the memory snapshot collection, because what we're collecting doesn't match what ps says the process is consuming 16:49:41 <mtreinish> ihrachys: heh, it's 26MB IIRC 16:50:19 <ihrachys> cool, let me capture the link in the notes 16:50:39 <ihrachys> #link https://review.openstack.org/#/q/status:open+topic:tracemalloc Attempt to trace memory usage for Neutron 16:51:01 <ihrachys> #link http://blog.kortar.org/wp-content/uploads/2017/02/flame.svg Neutron API server memory trace attempt 16:51:08 <armax> mtreinish: would apply the same patches to nova help correlate/spot a pattern? 16:51:54 <ihrachys> armax: and maybe also a service that does not seem to be as obese 16:51:56 <mtreinish> armax: that's the theory 16:52:19 <mtreinish> once we can figure out a successful pattern for collecting and visualizing where things are eating memory we can apply it to all the things 16:52:38 <armax> mtreinish: understood 16:52:43 <armax> mtreinish: thanks for looking into this 16:52:50 <ihrachys> is the tracer usable in gate? does it slow down/destabilize jobs? 16:52:59 <armax> I have been struggling to find time to go deeper into this headache 16:53:09 * mtreinish prepares his little vm for the traffic flood 16:53:33 <mtreinish> ihrachys: there didn't seem to be too much of an overhead, but I wasn't watching it closely 16:53:51 <mtreinish> it's still very early in all of this (I just started playing with it yesterday afternoon :) ) 16:54:55 <ihrachys> gotcha, thanks a lot for taking it upon yourself 16:55:19 <ihrachys> on related note, I am not sure we got to the root of why we don't use all the swap 16:55:37 <mtreinish> ihrachys, armax: oh if you want to generate that flame graph locally: http://paste.openstack.org/show/597889/ 16:55:45 <ihrachys> we played with swappiness knob of no affect I believe: https://review.openstack.org/#/c/425961/ 16:55:55 <mtreinish> just take the snapshots from the oslo.service log dir 16:58:03 <ihrachys> #action ihrachys to read about how swappiness is supposed to work, and why it doesn't in gate 16:58:25 <ihrachys> #topic Open discussion 16:58:43 <ihrachys> we are almost at the top of the hour. anything worth mentioning before we abrupt? 17:00:09 <ihrachys> ok thanks everyone for joining, and working on making the gate great again 17:00:09 <ihrachys> #endmeeting