16:00:52 <ihrachys> #startmeeting neutron_ci 16:00:52 <openstack> Meeting started Tue Jul 25 16:00:52 2017 UTC and is due to finish in 60 minutes. The chair is ihrachys. Information about MeetBot at http://wiki.debian.org/MeetBot. 16:00:53 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 16:00:56 <openstack> The meeting name has been set to 'neutron_ci' 16:01:09 <ihrachys> jlibosva, o/ 16:01:15 <jlibosva> hi 16:01:31 <ihrachys> #topic Actions from prev week 16:01:43 <ihrachys> "jlibosva to report bugs for eventlet and maybe oslo.service for eventlet signal/timer issue with py3" 16:01:53 <ihrachys> jlibosva, I recollect you reported a bug against oslo.service 16:02:04 <jlibosva> there was one reported by Victor IIRC 16:02:08 <ihrachys> https://bugs.launchpad.net/oslo.service/+bug/1705047 16:02:08 <openstack> Launchpad bug 1705047 in oslo.service "Race condition in signal handling on Python 3" [Undecided,New] 16:02:35 <jlibosva> yep 16:02:37 <jlibosva> that one 16:02:46 <ihrachys> was one reported for eventlet? or it's not really a bug for them? 16:03:17 <jlibosva> I think he was working on a fix for eventlet but he sent me only this one, lemme check 16:03:40 <ihrachys> what's his irc nick? 16:03:57 <jlibosva> I don't see any eventlet issues on github, seems like that's their bugtracker 16:04:02 <jlibosva> vstinner is his irc nick 16:05:28 <ihrachys> ok 16:05:35 <ihrachys> well if it's oslo.service so be it 16:05:41 <ihrachys> I assume we will see a fix in the near future 16:05:50 <ihrachys> but we may need to track it so that we can close it in pike 16:06:02 <jlibosva> would be good if that'd go to oslo service 16:06:04 <jlibosva> imho 16:06:12 <ihrachys> yeah, no new eventlet requirement 16:06:12 <haleyb> hi 16:06:17 <ihrachys> haleyb, heya 16:06:30 <ihrachys> ok, next item was "jlibosva to post patch splitting OVN from OvsVenvFixture" 16:06:50 <jlibosva> I did but it's a WIP - https://review.openstack.org/#/c/484874/ 16:06:53 <jlibosva> failing tests 16:07:08 <jlibosva> oh, I see I pushed wrong patch :) 16:07:13 <jlibosva> in comments there 16:07:19 <jlibosva> anyway, there is still work to be done there 16:07:51 <haleyb> hi there 16:08:05 <jlibosva> hello 16:08:18 <ihrachys> jlibosva, do you need reviews at this point, or you want to spin on it? 16:08:33 <jlibosva> ihrachys: no reviews needed 16:08:35 <ihrachys> ack 16:08:48 <ihrachys> next was "ihrachys to look at why fullstack switch to rootwrap/d-g broke some test cases" 16:09:00 <ihrachys> you think I did this time? not in a slightest. 16:09:08 <ihrachys> and seems like I am swamped for this week too 16:09:20 <ihrachys> so if someone would like to take a look, feel free to take over 16:09:48 <ihrachys> next was "haleyb to collect feedback of key contributors on multinode-by-default patch" 16:10:07 <ihrachys> haleyb, what's the result of the thread on switching to multinode grenade for integrated gate? 16:10:16 <haleyb> there weren't many comments 16:11:08 <ihrachys> general stance seems to be "let's try it and see"? 16:11:27 <haleyb> i will ping the QA PTL to make sure they don't have a problem, tihnk it's andreaf 16:11:40 <ihrachys> ack 16:11:51 <haleyb> otherwise we'll see what happens :) 16:12:01 <ihrachys> one question I had that I am not clear is whether we want to get rid of single node coverage for grenade repo itself 16:12:24 <ihrachys> there may be a reason to keep it working for cases like local reproduction 16:13:12 <ihrachys> not sure who is responsible for grenade to make this call 16:13:13 <haleyb> i can leave it in the check and gate queues there 16:13:32 <ihrachys> haleyb, that would be definitely on safe side. they can decide to clean up on their own 16:14:14 <haleyb> that might fall under qa as well, i'll find out 16:14:15 <ihrachys> #action haleyb to reach out to QA PTL about switching grenade integrated gate to multinode 16:14:34 <ihrachys> next item was "jlibosva to send a debug patch for random test runner murders, and recheck, recheck, recheck" 16:14:45 <ihrachys> this is in relation to functional test timeouts we saw the prev weeks 16:14:55 <jlibosva> I sent out one patch today 16:14:58 <ihrachys> to recap, the current failures were triaged and classified in https://etherpad.openstack.org/p/neutron-functional-gate-failures-july 16:15:30 <jlibosva> ihrachys: I also re-shuffled the above, most of the firewall failures were caused by test executor kill 16:15:36 <ihrachys> you mean this? https://review.openstack.org/#/c/487065/2/tools/kill.sh 16:16:03 <jlibosva> fortunately, we do log all execute() calls. unfortunately, there is no call having pid of killed executor 16:16:06 <jlibosva> ihrachys: yes, that one 16:16:19 * haleyb is having irc lag if he seems slow 16:16:30 <jlibosva> so I thought I'll log each call to kill bin 16:16:47 <jlibosva> which won't catch calls like os.kill() from python but at least it's something 16:17:29 <ihrachys> jlibosva, cool. btw we could probably use the -index.txt log file to get messages from all tests executed around the Killed message 16:17:41 <ihrachys> maybe that is easier to spot some other thread calling kill 16:17:55 <ihrachys> though the number of messages there may be hard to digest 16:18:11 <ihrachys> jlibosva, so the plan now is to recheck the patch until it hits the timeout? 16:18:35 <jlibosva> probably, maybe if it won't cause much issues, we could merge it and wait for failure 16:19:36 <ihrachys> ack 16:19:41 <ihrachys> let's review the state next week 16:19:57 <ihrachys> #action jlibosva to recheck the tools/kill.sh patch until it hits timeout and report back 16:20:22 <ihrachys> that's all there is regarding action items 16:20:27 <ihrachys> #topic Grafana 16:20:30 <ihrachys> http://grafana.openstack.org/dashboard/db/neutron-failure-rate 16:20:59 <ihrachys> haleyb, I see in grenade gate queue, dvr flavour is nuts 16:21:12 <ihrachys> regular at 0% but dvr is 30%? 16:21:31 <ihrachys> it's the job we propose for the integrated gate right? 16:21:38 <ihrachys> the chart: http://grafana.openstack.org/dashboard/db/neutron-failure-rate?panelId=6&fullscreen 16:22:06 <haleyb> ihrachys: yes, since 00 GMT it seems, i'll look 16:22:49 <ihrachys> #action haleyb to look at why dvr grenade flavour surged to 30% failure rate comparing to 0% for other grenades 16:24:02 <ihrachys> as for non-voting jobs... 16:24:04 <ihrachys> http://grafana.openstack.org/dashboard/db/neutron-failure-rate?panelId=10&fullscreen 16:24:09 <ihrachys> this is tempest check queue chart 16:24:18 <ihrachys> we have scenarios at ~100% for a long time 16:24:29 <ihrachys> jlibosva, do we have grasp as to what happens there still? 16:24:32 <haleyb> the dvr-ha-multinode check job is also nuts 16:24:33 <ihrachys> is it same trunk connectivity? 16:24:56 <jlibosva> ihrachys: the migrations are broken - from legacy to ha or alike 16:25:00 <ihrachys> haleyb, yep. the fact it's at ~100% is reassuring though 16:25:06 <ihrachys> haleyb, should be easy to reproduce 16:25:17 <jlibosva> also I saw one failure related to trunk port lifecycle, which seemed like port status is not updated 16:25:31 <jlibosva> and then there is east_west fip test that's also failing 16:26:49 <ihrachys> jlibosva, http://logs.openstack.org/22/410422/24/check/gate-tempest-dsvm-neutron-dvr-multinode-scenario-ubuntu-xenial-nv/67e24b4/console.html#_2017-07-24_23_58_32_020051 seems like a good example of migration breakage? 16:27:30 <ihrachys> jlibosva, do you think it makes sense to start with classifying failures? and then we could send an email to openstack-dev@ asking for help? 16:27:39 <ihrachys> ofc we would need to report bugs first 16:28:05 <ihrachys> I am afraid this job is not going anywhere for quite some time, and it may repeat the story of fullstack and friends never getting to voting 16:28:06 <jlibosva> yeah, that sounds like a good plan. I can look at it on Friday 16:28:13 <ihrachys> jlibosva, cooool 16:28:30 <ihrachys> #action jlibosva to classify current scenario failures and send email to openstack-dev@ asking for help 16:28:58 <ihrachys> #action haleyb to check why dvr-ha job is at ~100% failure rate 16:29:28 * ihrachys feels like a jerk giving out tasks to innocent people 16:29:49 <jlibosva> we're not innocent, we're developers doing qe :) 16:30:36 <ihrachys> ok, looking at other charts, there is probably nothing serious there except what we covered 16:30:51 <jlibosva> the migration tests might be actually uncovering legitimate bugs 16:31:21 <ihrachys> jlibosva, would be interesting to play with logstash to see when it started. if we can spot time, we can find the offender easily. 16:31:30 <ihrachys> it should have the failure in its scenario log 16:31:37 <jlibosva> I don't think we have enough data to do so, it's been already a while 16:31:43 <ihrachys> oh ok 16:31:55 <jlibosva> and we have logstash for a week or so, no? 16:32:03 <ihrachys> yeah, something like 1-2 week 16:32:07 <ihrachys> sadly 16:32:27 <ihrachys> they seem to be bloated with logs lately (hence the push to consolidate jobs) 16:32:31 <ihrachys> #topic Fullstack isolation 16:32:38 <jlibosva> yeah, as per logstash, it started on 17th :) 16:32:52 <ihrachys> jlibosva, we discussed that topic the last time, was there any progress since then? 16:33:12 <jlibosva> unfortunately not, I just have a plan in my had that I need to implement 16:33:33 <ihrachys> ok no rush 16:33:40 <ihrachys> #topic Open discussion 16:33:42 <jlibosva> I wanted to also introduce something like network description but today I decided I'll defer it and use the current bridges 16:33:48 <ihrachys> anything else to discuss? 16:34:08 <jlibosva> about the multinode-dvr running neutron tests - 16:34:27 <jlibosva> how about adding some skipping to those tests that are reproducible on local environments 16:34:37 <jlibosva> and then solve those issues with lower priority 16:34:54 <jlibosva> if there are any like that 16:34:56 <ihrachys> jlibosva, what is the job you refer to? the scenario for migration failures? 16:35:08 <jlibosva> yeah, if they are consistent 16:35:26 <jlibosva> I didn't check tho. I checked that east_west tests are not reproducible easily, at least not on my env 16:35:26 <ihrachys> if they are consistent; if we reported bugs; and if we asked for help, yes I think we can do that 16:35:34 <ihrachys> probably on gate level, not tempest plugin 16:35:56 <ihrachys> so that downstream consumers suffer :p 16:36:01 <jlibosva> lol 16:36:06 <ihrachys> and other plugins don't lose coverage 16:36:19 <jlibosva> makes sense, I was thinking on tempest plugin but gate is better 16:36:21 <jlibosva> thanks 16:36:33 <jlibosva> that's all from me 16:36:39 <ihrachys> I think midonet actively runs tests from our tree 16:36:42 <haleyb> luckily that job is non-voting 16:37:02 <jlibosva> that's a question, if it would be voting, it would get more attention and priority :D 16:37:51 <ihrachys> yeah, we should target voting even if it means loosing some coverage along the way. voting with less tests is more than non-voting with plenty of tests no one cares 16:38:25 <ihrachys> ok I think we can close the meeting 16:38:27 <ihrachys> thanks folks 16:38:30 <ihrachys> #endmeeting