#openstack-meeting log

16:00:52 <ihrachys> #startmeeting neutron_ci
16:00:52 <openstack> Meeting started Tue Jul 25 16:00:52 2017 UTC and is due to finish in 60 minutes.  The chair is ihrachys. Information about MeetBot at http://wiki.debian.org/MeetBot.
16:00:53 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
16:00:56 <openstack> The meeting name has been set to 'neutron_ci'
16:01:09 <ihrachys> jlibosva, o/
16:01:15 <jlibosva> hi
16:01:31 <ihrachys> #topic Actions from prev week
16:01:43 <ihrachys> "jlibosva to report bugs for eventlet and maybe oslo.service for eventlet signal/timer issue with py3"
16:01:53 <ihrachys> jlibosva, I recollect you reported a bug against oslo.service
16:02:04 <jlibosva> there was one reported by Victor IIRC
16:02:08 <ihrachys> https://bugs.launchpad.net/oslo.service/+bug/1705047
16:02:08 <openstack> Launchpad bug 1705047 in oslo.service "Race condition in signal handling on Python 3" [Undecided,New]
16:02:35 <jlibosva> yep
16:02:37 <jlibosva> that one
16:02:46 <ihrachys> was one reported for eventlet? or it's not really a bug for them?
16:03:17 <jlibosva> I think he was working on a fix for eventlet but he sent me only this one, lemme check
16:03:40 <ihrachys> what's his irc nick?
16:03:57 <jlibosva> I don't see any eventlet issues on github, seems like that's their bugtracker
16:04:02 <jlibosva> vstinner is his irc nick
16:05:28 <ihrachys> ok
16:05:35 <ihrachys> well if it's oslo.service so be it
16:05:41 <ihrachys> I assume we will see a fix in the near future
16:05:50 <ihrachys> but we may need to track it so that we can close it in pike
16:06:02 <jlibosva> would be good if that'd go to oslo service
16:06:04 <jlibosva> imho
16:06:12 <ihrachys> yeah, no new eventlet requirement
16:06:12 <haleyb> hi
16:06:17 <ihrachys> haleyb, heya
16:06:30 <ihrachys> ok, next item was "jlibosva to post patch splitting OVN from OvsVenvFixture"
16:06:50 <jlibosva> I did but it's a WIP - https://review.openstack.org/#/c/484874/
16:06:53 <jlibosva> failing tests
16:07:08 <jlibosva> oh, I see I pushed wrong patch :)
16:07:13 <jlibosva> in comments there
16:07:19 <jlibosva> anyway, there is still work to be done there
16:07:51 <haleyb> hi there
16:08:05 <jlibosva> hello
16:08:18 <ihrachys> jlibosva, do you need reviews at this point, or you want to spin on it?
16:08:33 <jlibosva> ihrachys: no reviews needed
16:08:35 <ihrachys> ack
16:08:48 <ihrachys> next was "ihrachys to look at why fullstack switch to rootwrap/d-g broke some test cases"
16:09:00 <ihrachys> you think I did this time? not in a slightest.
16:09:08 <ihrachys> and seems like I am swamped for this week too
16:09:20 <ihrachys> so if someone would like to take a look, feel free to take over
16:09:48 <ihrachys> next was "haleyb to collect feedback of key contributors on multinode-by-default patch"
16:10:07 <ihrachys> haleyb, what's the result of the thread on switching to multinode grenade for integrated gate?
16:10:16 <haleyb> there weren't many comments
16:11:08 <ihrachys> general stance seems to be "let's try it and see"?
16:11:27 <haleyb> i will ping the QA PTL to make sure they don't have a problem, tihnk it's andreaf
16:11:40 <ihrachys> ack
16:11:51 <haleyb> otherwise we'll see what happens :)
16:12:01 <ihrachys> one question I had that I am not clear is whether we want to get rid of single node coverage for grenade repo itself
16:12:24 <ihrachys> there may be a reason to keep it working for cases like local reproduction
16:13:12 <ihrachys> not sure who is responsible for grenade to make this call
16:13:13 <haleyb> i can leave it in the check and gate queues there
16:13:32 <ihrachys> haleyb, that would be definitely on safe side. they can decide to clean up on their own
16:14:14 <haleyb> that might fall under qa as well, i'll find out
16:14:15 <ihrachys> #action haleyb to reach out to QA PTL about switching grenade integrated gate to multinode
16:14:34 <ihrachys> next item was "jlibosva to send a debug patch for random test runner murders, and recheck, recheck, recheck"
16:14:45 <ihrachys> this is in relation to functional test timeouts we saw the prev weeks
16:14:55 <jlibosva> I sent out one patch today
16:14:58 <ihrachys> to recap, the current failures were triaged and classified in https://etherpad.openstack.org/p/neutron-functional-gate-failures-july
16:15:30 <jlibosva> ihrachys: I also re-shuffled the above, most of the firewall failures were caused by test executor kill
16:15:36 <ihrachys> you mean this? https://review.openstack.org/#/c/487065/2/tools/kill.sh
16:16:03 <jlibosva> fortunately, we do log all execute() calls. unfortunately, there is no call having pid of killed executor
16:16:06 <jlibosva> ihrachys: yes, that one
16:16:19 * haleyb is having irc lag if he seems slow
16:16:30 <jlibosva> so I thought I'll log each call to kill bin
16:16:47 <jlibosva> which won't catch calls like os.kill() from python but at least it's something
16:17:29 <ihrachys> jlibosva, cool. btw we could probably use the -index.txt log file to get messages from all tests executed around the Killed message
16:17:41 <ihrachys> maybe that is easier to spot some other thread calling kill
16:17:55 <ihrachys> though the number of messages there may be hard to digest
16:18:11 <ihrachys> jlibosva, so the plan now is to recheck the patch until it hits the timeout?
16:18:35 <jlibosva> probably, maybe if it won't cause much issues, we could merge it and wait for failure
16:19:36 <ihrachys> ack
16:19:41 <ihrachys> let's review the state next week
16:19:57 <ihrachys> #action jlibosva to recheck the tools/kill.sh patch until it hits timeout and report back
16:20:22 <ihrachys> that's all there is regarding action items
16:20:27 <ihrachys> #topic Grafana
16:20:30 <ihrachys> http://grafana.openstack.org/dashboard/db/neutron-failure-rate
16:20:59 <ihrachys> haleyb, I see in grenade gate queue, dvr flavour is nuts
16:21:12 <ihrachys> regular at 0% but dvr is 30%?
16:21:31 <ihrachys> it's the job we propose for the integrated gate right?
16:21:38 <ihrachys> the chart: http://grafana.openstack.org/dashboard/db/neutron-failure-rate?panelId=6&fullscreen
16:22:06 <haleyb> ihrachys: yes, since 00 GMT it seems, i'll look
16:22:49 <ihrachys> #action haleyb to look at why dvr grenade flavour surged to 30% failure rate comparing to 0% for other grenades
16:24:02 <ihrachys> as for non-voting jobs...
16:24:04 <ihrachys> http://grafana.openstack.org/dashboard/db/neutron-failure-rate?panelId=10&fullscreen
16:24:09 <ihrachys> this is tempest check queue chart
16:24:18 <ihrachys> we have scenarios at ~100% for a long time
16:24:29 <ihrachys> jlibosva, do we have grasp as to what happens there still?
16:24:32 <haleyb> the dvr-ha-multinode check job is also nuts
16:24:33 <ihrachys> is it same trunk connectivity?
16:24:56 <jlibosva> ihrachys: the migrations are broken - from legacy to ha or alike
16:25:00 <ihrachys> haleyb, yep. the fact it's at ~100% is reassuring though
16:25:06 <ihrachys> haleyb, should be easy to reproduce
16:25:17 <jlibosva> also I saw one failure related to trunk port lifecycle, which seemed like port status is not updated
16:25:31 <jlibosva> and then there is east_west fip test that's also failing
16:26:49 <ihrachys> jlibosva, http://logs.openstack.org/22/410422/24/check/gate-tempest-dsvm-neutron-dvr-multinode-scenario-ubuntu-xenial-nv/67e24b4/console.html#_2017-07-24_23_58_32_020051 seems like a good example of migration breakage?
16:27:30 <ihrachys> jlibosva, do you think it makes sense to start with classifying failures? and then we could send an email to openstack-dev@ asking for help?
16:27:39 <ihrachys> ofc we would need to report bugs first
16:28:05 <ihrachys> I am afraid this job is not going anywhere for quite some time, and it may repeat the story of fullstack and friends never getting to voting
16:28:06 <jlibosva> yeah, that sounds like a good plan. I can look at it on Friday
16:28:13 <ihrachys> jlibosva, cooool
16:28:30 <ihrachys> #action jlibosva to classify current scenario failures and send email to openstack-dev@ asking for help
16:28:58 <ihrachys> #action haleyb to check why dvr-ha job is at ~100% failure rate
16:29:28 * ihrachys feels like a jerk giving out tasks to innocent people
16:29:49 <jlibosva> we're not innocent, we're developers doing qe :)
16:30:36 <ihrachys> ok, looking at other charts, there is probably nothing serious there except what we covered
16:30:51 <jlibosva> the migration tests might be actually uncovering legitimate bugs
16:31:21 <ihrachys> jlibosva, would be interesting to play with logstash to see when it started. if we can spot time, we can find the offender easily.
16:31:30 <ihrachys> it should have the failure in its scenario log
16:31:37 <jlibosva> I don't think we have enough data to do so, it's been already a while
16:31:43 <ihrachys> oh ok
16:31:55 <jlibosva> and we have logstash for a week or so, no?
16:32:03 <ihrachys> yeah, something like 1-2 week
16:32:07 <ihrachys> sadly
16:32:27 <ihrachys> they seem to be bloated with logs lately (hence the push to consolidate jobs)
16:32:31 <ihrachys> #topic Fullstack isolation
16:32:38 <jlibosva> yeah, as per logstash, it started on 17th :)
16:32:52 <ihrachys> jlibosva, we discussed that topic the last time, was there any progress since then?
16:33:12 <jlibosva> unfortunately not, I just have a plan in my had that I need to implement
16:33:33 <ihrachys> ok no rush
16:33:40 <ihrachys> #topic Open discussion
16:33:42 <jlibosva> I wanted to also introduce something like network description but today I decided I'll defer it and use the current bridges
16:33:48 <ihrachys> anything else to discuss?
16:34:08 <jlibosva> about the multinode-dvr running neutron tests -
16:34:27 <jlibosva> how about adding some skipping to those tests that are reproducible on local environments
16:34:37 <jlibosva> and then solve those issues with lower priority
16:34:54 <jlibosva> if there are any like that
16:34:56 <ihrachys> jlibosva, what is the job you refer to? the scenario for migration failures?
16:35:08 <jlibosva> yeah, if they are consistent
16:35:26 <jlibosva> I didn't check tho. I checked that east_west tests are not reproducible easily, at least not on my env
16:35:26 <ihrachys> if they are consistent; if we reported bugs; and if we asked for help, yes I think we can do that
16:35:34 <ihrachys> probably on gate level, not tempest plugin
16:35:56 <ihrachys> so that downstream consumers suffer :p
16:36:01 <jlibosva> lol
16:36:06 <ihrachys> and other plugins don't lose coverage
16:36:19 <jlibosva> makes sense, I was thinking on tempest plugin but gate is better
16:36:21 <jlibosva> thanks
16:36:33 <jlibosva> that's all from me
16:36:39 <ihrachys> I think midonet actively runs tests from our tree
16:36:42 <haleyb> luckily that job is non-voting
16:37:02 <jlibosva> that's a question, if it would be voting, it would get more attention and priority :D
16:37:51 <ihrachys> yeah, we should target voting even if it means loosing some coverage along the way. voting with less tests is more than non-voting with plenty of tests no one cares
16:38:25 <ihrachys> ok I think we can close the meeting
16:38:27 <ihrachys> thanks folks
16:38:30 <ihrachys> #endmeeting