16:00:52 #startmeeting neutron_ci 16:00:52 Meeting started Tue Jul 25 16:00:52 2017 UTC and is due to finish in 60 minutes. The chair is ihrachys. Information about MeetBot at http://wiki.debian.org/MeetBot. 16:00:53 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 16:00:56 The meeting name has been set to 'neutron_ci' 16:01:09 jlibosva, o/ 16:01:15 hi 16:01:31 #topic Actions from prev week 16:01:43 "jlibosva to report bugs for eventlet and maybe oslo.service for eventlet signal/timer issue with py3" 16:01:53 jlibosva, I recollect you reported a bug against oslo.service 16:02:04 there was one reported by Victor IIRC 16:02:08 https://bugs.launchpad.net/oslo.service/+bug/1705047 16:02:08 Launchpad bug 1705047 in oslo.service "Race condition in signal handling on Python 3" [Undecided,New] 16:02:35 yep 16:02:37 that one 16:02:46 was one reported for eventlet? or it's not really a bug for them? 16:03:17 I think he was working on a fix for eventlet but he sent me only this one, lemme check 16:03:40 what's his irc nick? 16:03:57 I don't see any eventlet issues on github, seems like that's their bugtracker 16:04:02 vstinner is his irc nick 16:05:28 ok 16:05:35 well if it's oslo.service so be it 16:05:41 I assume we will see a fix in the near future 16:05:50 but we may need to track it so that we can close it in pike 16:06:02 would be good if that'd go to oslo service 16:06:04 imho 16:06:12 yeah, no new eventlet requirement 16:06:12 hi 16:06:17 haleyb, heya 16:06:30 ok, next item was "jlibosva to post patch splitting OVN from OvsVenvFixture" 16:06:50 I did but it's a WIP - https://review.openstack.org/#/c/484874/ 16:06:53 failing tests 16:07:08 oh, I see I pushed wrong patch :) 16:07:13 in comments there 16:07:19 anyway, there is still work to be done there 16:07:51 hi there 16:08:05 hello 16:08:18 jlibosva, do you need reviews at this point, or you want to spin on it? 16:08:33 ihrachys: no reviews needed 16:08:35 ack 16:08:48 next was "ihrachys to look at why fullstack switch to rootwrap/d-g broke some test cases" 16:09:00 you think I did this time? not in a slightest. 16:09:08 and seems like I am swamped for this week too 16:09:20 so if someone would like to take a look, feel free to take over 16:09:48 next was "haleyb to collect feedback of key contributors on multinode-by-default patch" 16:10:07 haleyb, what's the result of the thread on switching to multinode grenade for integrated gate? 16:10:16 there weren't many comments 16:11:08 general stance seems to be "let's try it and see"? 16:11:27 i will ping the QA PTL to make sure they don't have a problem, tihnk it's andreaf 16:11:40 ack 16:11:51 otherwise we'll see what happens :) 16:12:01 one question I had that I am not clear is whether we want to get rid of single node coverage for grenade repo itself 16:12:24 there may be a reason to keep it working for cases like local reproduction 16:13:12 not sure who is responsible for grenade to make this call 16:13:13 i can leave it in the check and gate queues there 16:13:32 haleyb, that would be definitely on safe side. they can decide to clean up on their own 16:14:14 that might fall under qa as well, i'll find out 16:14:15 #action haleyb to reach out to QA PTL about switching grenade integrated gate to multinode 16:14:34 next item was "jlibosva to send a debug patch for random test runner murders, and recheck, recheck, recheck" 16:14:45 this is in relation to functional test timeouts we saw the prev weeks 16:14:55 I sent out one patch today 16:14:58 to recap, the current failures were triaged and classified in https://etherpad.openstack.org/p/neutron-functional-gate-failures-july 16:15:30 ihrachys: I also re-shuffled the above, most of the firewall failures were caused by test executor kill 16:15:36 you mean this? https://review.openstack.org/#/c/487065/2/tools/kill.sh 16:16:03 fortunately, we do log all execute() calls. unfortunately, there is no call having pid of killed executor 16:16:06 ihrachys: yes, that one 16:16:19 * haleyb is having irc lag if he seems slow 16:16:30 so I thought I'll log each call to kill bin 16:16:47 which won't catch calls like os.kill() from python but at least it's something 16:17:29 jlibosva, cool. btw we could probably use the -index.txt log file to get messages from all tests executed around the Killed message 16:17:41 maybe that is easier to spot some other thread calling kill 16:17:55 though the number of messages there may be hard to digest 16:18:11 jlibosva, so the plan now is to recheck the patch until it hits the timeout? 16:18:35 probably, maybe if it won't cause much issues, we could merge it and wait for failure 16:19:36 ack 16:19:41 let's review the state next week 16:19:57 #action jlibosva to recheck the tools/kill.sh patch until it hits timeout and report back 16:20:22 that's all there is regarding action items 16:20:27 #topic Grafana 16:20:30 http://grafana.openstack.org/dashboard/db/neutron-failure-rate 16:20:59 haleyb, I see in grenade gate queue, dvr flavour is nuts 16:21:12 regular at 0% but dvr is 30%? 16:21:31 it's the job we propose for the integrated gate right? 16:21:38 the chart: http://grafana.openstack.org/dashboard/db/neutron-failure-rate?panelId=6&fullscreen 16:22:06 ihrachys: yes, since 00 GMT it seems, i'll look 16:22:49 #action haleyb to look at why dvr grenade flavour surged to 30% failure rate comparing to 0% for other grenades 16:24:02 as for non-voting jobs... 16:24:04 http://grafana.openstack.org/dashboard/db/neutron-failure-rate?panelId=10&fullscreen 16:24:09 this is tempest check queue chart 16:24:18 we have scenarios at ~100% for a long time 16:24:29 jlibosva, do we have grasp as to what happens there still? 16:24:32 the dvr-ha-multinode check job is also nuts 16:24:33 is it same trunk connectivity? 16:24:56 ihrachys: the migrations are broken - from legacy to ha or alike 16:25:00 haleyb, yep. the fact it's at ~100% is reassuring though 16:25:06 haleyb, should be easy to reproduce 16:25:17 also I saw one failure related to trunk port lifecycle, which seemed like port status is not updated 16:25:31 and then there is east_west fip test that's also failing 16:26:49 jlibosva, http://logs.openstack.org/22/410422/24/check/gate-tempest-dsvm-neutron-dvr-multinode-scenario-ubuntu-xenial-nv/67e24b4/console.html#_2017-07-24_23_58_32_020051 seems like a good example of migration breakage? 16:27:30 jlibosva, do you think it makes sense to start with classifying failures? and then we could send an email to openstack-dev@ asking for help? 16:27:39 ofc we would need to report bugs first 16:28:05 I am afraid this job is not going anywhere for quite some time, and it may repeat the story of fullstack and friends never getting to voting 16:28:06 yeah, that sounds like a good plan. I can look at it on Friday 16:28:13 jlibosva, cooool 16:28:30 #action jlibosva to classify current scenario failures and send email to openstack-dev@ asking for help 16:28:58 #action haleyb to check why dvr-ha job is at ~100% failure rate 16:29:28 * ihrachys feels like a jerk giving out tasks to innocent people 16:29:49 we're not innocent, we're developers doing qe :) 16:30:36 ok, looking at other charts, there is probably nothing serious there except what we covered 16:30:51 the migration tests might be actually uncovering legitimate bugs 16:31:21 jlibosva, would be interesting to play with logstash to see when it started. if we can spot time, we can find the offender easily. 16:31:30 it should have the failure in its scenario log 16:31:37 I don't think we have enough data to do so, it's been already a while 16:31:43 oh ok 16:31:55 and we have logstash for a week or so, no? 16:32:03 yeah, something like 1-2 week 16:32:07 sadly 16:32:27 they seem to be bloated with logs lately (hence the push to consolidate jobs) 16:32:31 #topic Fullstack isolation 16:32:38 yeah, as per logstash, it started on 17th :) 16:32:52 jlibosva, we discussed that topic the last time, was there any progress since then? 16:33:12 unfortunately not, I just have a plan in my had that I need to implement 16:33:33 ok no rush 16:33:40 #topic Open discussion 16:33:42 I wanted to also introduce something like network description but today I decided I'll defer it and use the current bridges 16:33:48 anything else to discuss? 16:34:08 about the multinode-dvr running neutron tests - 16:34:27 how about adding some skipping to those tests that are reproducible on local environments 16:34:37 and then solve those issues with lower priority 16:34:54 if there are any like that 16:34:56 jlibosva, what is the job you refer to? the scenario for migration failures? 16:35:08 yeah, if they are consistent 16:35:26 I didn't check tho. I checked that east_west tests are not reproducible easily, at least not on my env 16:35:26 if they are consistent; if we reported bugs; and if we asked for help, yes I think we can do that 16:35:34 probably on gate level, not tempest plugin 16:35:56 so that downstream consumers suffer :p 16:36:01 lol 16:36:06 and other plugins don't lose coverage 16:36:19 makes sense, I was thinking on tempest plugin but gate is better 16:36:21 thanks 16:36:33 that's all from me 16:36:39 I think midonet actively runs tests from our tree 16:36:42 luckily that job is non-voting 16:37:02 that's a question, if it would be voting, it would get more attention and priority :D 16:37:51 yeah, we should target voting even if it means loosing some coverage along the way. voting with less tests is more than non-voting with plenty of tests no one cares 16:38:25 ok I think we can close the meeting 16:38:27 thanks folks 16:38:30 #endmeeting