16:00:54 #startmeeting neutron_ci 16:00:55 Meeting started Tue Feb 7 16:00:54 2017 UTC and is due to finish in 60 minutes. The chair is ihrachys. Information about MeetBot at http://wiki.debian.org/MeetBot. 16:00:56 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 16:00:58 The meeting name has been set to 'neutron_ci' 16:01:12 hi everyone 16:01:15 hi 16:01:46 * ihrachys gives a minute for everyone to gather around a fire in a circle 16:01:58 hi 16:02:44 * haleyb gets some marshmallows for the fire 16:03:32 ok let's get it started. hopefully people will get on board. :) 16:03:52 #link https://wiki.openstack.org/wiki/Meetings/NeutronCI Agenda 16:04:21 I guess we can start with looking at action items from the previous meeting 16:04:30 #topic Action items from previous meeting 16:04:47 "armax to make sure periodic functional test job shows up in grafana" 16:05:07 armax: I still don't see the job in periodic dashboard in grafana. any news on that one? 16:06:05 ok, I guess armax is not up that early. I will follow up with him offline. 16:06:26 #action ihrachys to follow up with armax on periodic functional job not showing up in grafana 16:06:33 next is: "ihrachys to follow up on elastic-recheck with e-r cores" 16:07:18 so I sent this email to get some answers from e-r folks: http://lists.openstack.org/pipermail/openstack-dev/2017-January/111288.html 16:08:43 mtreinish gave some answers; basically, 1) check queue is as eligible for e-r queries as gate one so functional job can be captured with it; 2) adding cores is a matter of doing some reviews before getting a hammer; 3) there is a recheck bot that we can enable for neutron channel if we feel a need for that. 16:09:17 so from our perspective, we should be fine pushing more queries into the repo, and we should probably try to get some review weight in the repo 16:10:15 #action ihrachys to look at e-r bot for openstack-neutron channel 16:10:18 ihrachys, question : do we have a dashboard for ci specific patches ? 16:10:55 manjeets: we don't, though we have LP that captures bugs with gate-failure and similar tags 16:11:05 ihrachys: I said check queue jobs are covered by e-r queries. But we normally only add queries for failures that occur in gate jobs. If it's just the check queue filtering noise from patches makes it tricky 16:11:10 ihrachys: I am here 16:11:13 but in another meeting 16:11:20 manjeets: we probably can look at writing something to generate such a dashboard 16:11:33 ihrachys: I can give you a hand updating the e-r bot config, it should be a simple yaml change 16:11:41 ihrachys: do we have a criteria for adding a new query to e-r, like more than x number of hits? 16:11:51 mtreinish: that's a price of having functional neutron job in check only. we may want to revisit that. 16:12:05 ihrachys, yes thanks 16:12:26 electrocucaracha: I don't think we have it, but the guidelines would be - it's some high profile gate bug, and we can't quickly come up with a fix. 16:12:41 manjeets: do you want to look at creating such a dashboard? 16:12:55 ihrachys, sure i'll take a look 16:13:04 manjeets: rossella_s wrote one for patches targeted for next release, you probably could reuse her work 16:13:28 electrocucaracha: if you want to get your feet wet everything here http://status.openstack.org/elastic-recheck/data/integrated_gate.html needs categorization 16:13:31 ihrachys: regarding the point 3, it seems like only adding a new entry in the yams file https://github.com/openstack-infra/project-config/blob/master/gerritbot/channels.yaml#L831 16:13:34 cool thanks for example 16:13:45 manjeets: see Gerrit Dashboard Links at the top at http://status.openstack.org/reviews/ 16:13:56 thanks mtreinish 16:14:03 ihrachys, manjeets I can help with that 16:14:17 if you find a fingerprint for any of those failures thats something we'd definitely accept an e-r query for 16:14:44 #action manjeets to produce a gerrit dashboard for gate and functional failures 16:14:47 thanks rossella_s i'll go through and will ping you if any help needed 16:15:04 electrocucaracha: that's actually not the correct irc bot. The elastic recheck cofnig lives in the puppet-elastic_recheck repo 16:15:11 I linked to it on my ML post 16:15:25 armax: np, I will ping you later to see what we can do with the dashboard 16:15:59 overall, some grafana dashboards are in bad shape, we may need to have a broader look at them 16:16:48 ok next action was: "ihrachys to follow up with infra on forbidding bare gerrit rechecks" 16:17:01 I haven't actually followed up with infra, though I checked project-config code 16:17:09 ihrachys: if you have ideas on how to make: http://status.openstack.org/openstack-health/#/g/project/openstack~2Fneutron more useful that's somethign we should work on too 16:17:20 basically, the gerrit comment recheck filter is per queue, not per project 16:18:03 here: https://github.com/openstack-infra/project-config/blob/master/zuul/layout.yaml#L20 16:18:33 so I believe it would require some more work on infra side to make it per project. but I will still check with infra to make sure. 16:19:14 mtreinish: what do you mean? is it somehow project specific? or do you mean just general improvements that may help everyone? 16:19:59 you were talking about dashboards, and I would like to make sure that your needs are being met with openstack-health. Whatever improvements neutron needed likely would benefit everyone 16:20:14 so instead of doing it in a corner, I just wanted to see if there was space to work on that in o-h 16:20:20 you see -health as a replacement for grafana? 16:20:53 I see it as something that can consume grafana as necessary. But I want to unify all the test results dashboards to a single place 16:21:01 instead of jumping around between a dozen web pages 16:22:42 makes sense. it's just we were so far set on grafana, probably it's time to consider -health first for any new ideas. 16:23:13 ok, next action was "ihrachys check if we can enable dstat logs for functional job" 16:23:40 that's to track system load during functional test runs that sometimes produce timeouts in ovsdb native 16:23:52 I posted this https://review.openstack.org/427358, please have a look 16:25:05 on related note, I also posted https://review.openstack.org/427362 to properly index per-testcase messages in logstash, that should also help us with elastic-recheck queries, and overall with understanding impact of some failures 16:25:07 ihrachys, where it will the dump the logs a separate screen window ? 16:25:27 i mean separate file ? 16:25:34 sorry, not project-config patch, I wanted to post https://review.openstack.org/430316 instead 16:25:48 manjeets: yes, it should go in screen-dstat as in devstack runs 16:26:22 ohk 16:26:25 finally, there is some peakmem-tracker service in devstack that I try to enable here: https://review.openstack.org/430289 (not sure if it will even work, I haven't found other repos that use the service) 16:26:53 finally, the last action item is "jlibosva to explore what broke scenario job" 16:27:01 sadly I don't see Jakb 16:27:02 *Jakub 16:27:39 but afaik the failures were related to bad ubuntu image contents 16:27:55 so we pinned the image with https://review.openstack.org/#/c/425165/ 16:28:24 and Jakub also has a patch to enable console logging for connectivity failures in scenario jobs: https://review.openstack.org/#/c/427312/, that one needs second +2, please review 16:29:16 sadly grafana shows that scenario jobs are still at 80% to 100% failure rate, something that did not happen even a month ago 16:29:29 so there is still something to follow up on 16:29:39 #action jlibosva to follow up on scenario failures 16:30:13 overall, those jobs will need to go voting, or it will be another fullstack job broken once in a while :) 16:32:11 ok let's have a look at bugs now 16:32:28 #topic Known gate failures 16:32:39 #link https://goo.gl/IYhs7k Confirmed/In progress bugs 16:33:05 ok, so first is ovsdb native timeout 16:33:26 otherwiseguy was kind to produce some patch that hopefully mitigates the issue: https://review.openstack.org/#/c/429095/ 16:33:32 and it already has +W, nice 16:34:01 there is a backport for the patch for Ocata found in: https://review.openstack.org/#/q/I26c7731f5dbd3bd2955dbfa18a7c41517da63e6e,n,z 16:34:30 so far rechecks in gerrit show some good results 16:34:42 we will monitor the failure rate after it lands 16:36:17 another bug that lingers our gates is bug 1643911 16:36:17 bug 1643911 in OpenStack Compute (nova) "libvirt randomly crashes on xenial nodes with "*** Error in `/usr/sbin/libvirtd': malloc(): memory corruption:"" [Medium,Confirmed] https://launchpad.net/bugs/1643911 16:37:21 the last time I checked, armax suspected it to be the same as the oom-killer spree bug in gate, something discussed extensively in http://lists.openstack.org/pipermail/openstack-dev/2017-February/111413.html 16:38:02 armax made several attempts to lower memory footprint for neutron, like the one merged https://review.openstack.org/429069 16:38:17 it's not a complete solution, but hopefully buys us some time 16:38:48 there is an action item to actually run a memory profiler against neutron services and see what takes the most 16:39:12 afaiu armax won't have time in next days for that, so, anyone willing to try it out? 16:41:03 I may give some guidance if you are hesitant about tools to try :) 16:41:22 anyway, reach out if you have cycles for this high profile assignment : 16:41:23 :) 16:41:51 wouldn't enabling dtsat will help ? 16:42:52 it will give us info about how system behaved while tests were running, but it won't give us info on which data structures use the memory 16:43:05 ohk gotcha 16:44:10 ihrachys: the devstsack changes landed at last 16:44:26 preliminary logstash results seem promising 16:44:29 armax: do we see rate going down? 16:44:33 but that only bought us a few more days 16:44:34 mmm, good 16:44:55 armax: few more days? you are optimistic about what the community can achieve in such a short time :P 16:45:03 ihrachys: I haven’t looked in great detail, but the last failure was yesterday lunchtime PST 16:45:10 what was it? like 300 mb freed? 16:45:25 ihrachys: between 350 and 400 MB of RSS memory, yes 16:45:57 some might be shared, but it should be enough to push the ceiling a bit further up and help aboid oom-kills and libvirt barfing all over the place 16:46:02 *avoid 16:46:10 armax: fwiw, harlowja and I have been playing tracemalloc to try and profile the memory usage 16:46:19 mtreinish: nice 16:46:29 mtreinish: for libvirt? 16:47:05 well I started with turning the profiling on for the neutron api server 16:47:43 https://review.openstack.org/#/q/status:open+topic:tracemalloc 16:48:20 mtreinish: any results to consume so far? I see all red in neutron patch. 16:48:48 the neutron patch won't work, it's just a dnm to setup stuff for testing. Look at the oslo.service patch's tempest job 16:48:52 there are memory snapshots there 16:48:56 if you want your browser to hate you, this kinda thing is the end goal: http://blog.kortar.org/wp-content/uploads/2017/02/flame.svg 16:49:24 * ihrachys puts a life vest on and clicks 16:49:24 we're still debugging the memory snapshot collection, because what we're collecting doesn't match what ps says the process is consuming 16:49:41 ihrachys: heh, it's 26MB IIRC 16:50:19 cool, let me capture the link in the notes 16:50:39 #link https://review.openstack.org/#/q/status:open+topic:tracemalloc Attempt to trace memory usage for Neutron 16:51:01 #link http://blog.kortar.org/wp-content/uploads/2017/02/flame.svg Neutron API server memory trace attempt 16:51:08 mtreinish: would apply the same patches to nova help correlate/spot a pattern? 16:51:54 armax: and maybe also a service that does not seem to be as obese 16:51:56 armax: that's the theory 16:52:19 once we can figure out a successful pattern for collecting and visualizing where things are eating memory we can apply it to all the things 16:52:38 mtreinish: understood 16:52:43 mtreinish: thanks for looking into this 16:52:50 is the tracer usable in gate? does it slow down/destabilize jobs? 16:52:59 I have been struggling to find time to go deeper into this headache 16:53:09 * mtreinish prepares his little vm for the traffic flood 16:53:33 ihrachys: there didn't seem to be too much of an overhead, but I wasn't watching it closely 16:53:51 it's still very early in all of this (I just started playing with it yesterday afternoon :) ) 16:54:55 gotcha, thanks a lot for taking it upon yourself 16:55:19 on related note, I am not sure we got to the root of why we don't use all the swap 16:55:37 ihrachys, armax: oh if you want to generate that flame graph locally: http://paste.openstack.org/show/597889/ 16:55:45 we played with swappiness knob of no affect I believe: https://review.openstack.org/#/c/425961/ 16:55:55 just take the snapshots from the oslo.service log dir 16:58:03 #action ihrachys to read about how swappiness is supposed to work, and why it doesn't in gate 16:58:25 #topic Open discussion 16:58:43 we are almost at the top of the hour. anything worth mentioning before we abrupt? 17:00:09 ok thanks everyone for joining, and working on making the gate great again 17:00:09 #endmeeting