16:00:57 <ihrachys> #startmeeting neutron_ci
16:00:57 <openstack> Meeting started Tue Mar 20 16:00:57 2018 UTC and is due to finish in 60 minutes.  The chair is ihrachys. Information about MeetBot at http://wiki.debian.org/MeetBot.
16:00:58 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
16:01:01 <openstack> The meeting name has been set to 'neutron_ci'
16:01:06 <slaweq> hi
16:01:10 <ihrachys> o/
16:01:16 <ihrachys> mlavalle won't be able to join us today
16:01:22 <ihrachys> haleyb, jlibosva
16:01:54 <jlibosva> o
16:01:56 <jlibosva> /
16:02:42 <ihrachys> ok let's get going
16:02:43 <ihrachys> #topic Actions from prev meeting
16:02:56 <ihrachys> btw thanks slaweq for taking it over the prev time, appreciate it
16:03:05 <ihrachys> slaweq, you should probably become the chair for the meeting
16:03:06 <slaweq> ihrachys: no problem :)
16:03:24 <ihrachys> since you are the one who actually does most of the work :) you and jlibosva
16:03:33 <ihrachys> slaweq, I mean in generally
16:03:42 <slaweq> maybe jlibosva then :)
16:03:58 <slaweq> I don't think I'm experienced enought to do it all the time
16:04:12 <jlibosva> You did really well last time :)
16:04:12 <ihrachys> nah I think you are a better long term shot ;) things I know... :)
16:04:20 <ihrachys> ok as for action items
16:04:22 <ihrachys> "slaweq to enable voting for fullstack"
16:04:30 <slaweq> done
16:04:42 <ihrachys> https://review.openstack.org/552686
16:05:09 <ihrachys> was the change global for all queues? https://review.openstack.org/#/c/552686/1/.zuul.yaml
16:05:17 <ihrachys> or is it blocked from voting in gate somewhere else?
16:05:39 <slaweq> it is only for master branch and for check queue for now
16:05:45 <slaweq> that is what we decided last time
16:06:31 <slaweq> if all will be fine we will add it to gate queue also, am I right?
16:06:38 <ihrachys> oh I get it, it's not in gate jobs list
16:06:45 <slaweq> yes
16:06:49 <ihrachys> yeah you are right, I am just bad at tech
16:06:57 <ihrachys> "slaweq to enable voting for linuxbridge scenarios"
16:07:02 <slaweq> done also: https://review.openstack.org/#/c/552689/
16:07:17 <ihrachys> democracy, everyone gets a vote
16:07:17 <slaweq> and also only in check queue for now
16:07:36 <ihrachys> +
16:07:37 <ihrachys> "jlibosva to take a look on dvr trunk tests issue"
16:07:42 <jlibosva> ehm
16:07:47 <jlibosva> I didn't
16:07:49 <ihrachys> context: http://logs.openstack.org/32/550832/2/check/neutron-tempest-plugin-dvr-multinode-scenario/6d09fd8/logs/testr_results.html.gz
16:08:04 <jlibosva> I'll try this week
16:08:16 <ihrachys> #action jlibosva to take a look on dvr trunk tests issue: http://logs.openstack.org/32/550832/2/check/neutron-tempest-plugin-dvr-multinode-scenario/6d09fd8/logs/testr_results.html.gz
16:08:22 <ihrachys> great
16:08:29 <ihrachys> "slaweq check reasons of failures of neutron-tempest-dvr-ha-multinode-full"
16:08:39 <slaweq> I checked it and I created bug https://bugs.launchpad.net/neutron/+bug/1756301
16:08:40 <openstack> Launchpad bug 1756301 in neutron "Tempest DVR HA multimode tests fails due to no FIP connectivity" [High,Confirmed]
16:08:46 <ihrachys> context http://logs.openstack.org/14/529814/5/check/neutron-tempest-dvr-ha-multinode-full/d8cfbdf/logs/testr_results.html.gz
16:09:01 <slaweq> it looks that in all those cases failure is the same
16:09:13 <slaweq> but I didn't do any debugging on this issue
16:10:15 <slaweq> all of other cases which I checked issues were related to cinder volumes
16:10:58 <ihrachys> slaweq, that's weird. I would understand it if e.g. instance would crash because of volume attachment.
16:11:03 <ihrachys> but it's clear it's something net related
16:11:25 <slaweq> yes, In all those cases which are pointed in this bug report errors are related to neutron
16:11:38 <ihrachys> at least one failure of those listed is in tempest.scenario.test_server_basic_ops.TestServerBasicOps
16:11:41 <ihrachys> I don't think it's vol specific
16:11:55 <slaweq> I was just saying that I found many other failures of this job with issues not related to network but related to cinder
16:12:23 <slaweq> sorry if I wasn't clear :)
16:12:31 <ihrachys> I see.
16:13:16 <ihrachys> well, the bug is reported. is someone will have cycles they can take it
16:13:20 <ihrachys> "slaweq switch scenario jobs to lib/neutron"
16:13:39 <ihrachys> slaweq, I am missing context on that one. while it's good to move there, I don't think we did for gate or any other jobs.
16:13:44 <ihrachys> so what's special about scenarios?
16:13:58 <slaweq> nothing, I just wanted to start with something :)
16:14:10 <slaweq> so I send a patch https://review.openstack.org/#/c/552689/ even
16:14:20 <slaweq> to switch one scenario job to zuul v3
16:14:45 <slaweq> sorry, wrong link, this one is good: https://review.openstack.org/#/c/552846/13
16:15:25 <slaweq> but in meantime devstack-tempest job was switched back to lib/neutron-legacy so this will not switch our job to using lib/neutron for now
16:15:47 <slaweq> but I still think that it could be good to start moving jobs to zuul v3 definitions
16:15:53 <slaweq> what You think about that?
16:16:31 <ihrachys> yeah that's good. I am not sure if that helps with lib/neutron target but we should move to the new format regardless.
16:16:44 <jlibosva> we talked with slaweq and we agreed it would be good to migrate and then once devstack-tempest is switched, we'll get it for free
16:16:59 <slaweq> exactly, thx jlibosva :)
16:17:42 <slaweq> so please review it if You will have some time :)
16:18:14 <ihrachys> fair
16:18:37 <ihrachys> #topic Grafana
16:18:40 <jlibosva> I looked but I don't feel like I understand the zuul thingy completely so I'm not comfortable +2'ing
16:18:53 <ihrachys> http://grafana.openstack.org/dashboard/db/neutron-failure-rate
16:20:15 <ihrachys> there was some suspicious fullstack spike a day ago. while some other jobs also spikes at the same time, the fullstack chart doesn't seem to have recovered completely.
16:20:48 <ihrachys> any ideas what's the spike about?
16:21:04 <slaweq> today I spotted 2 or 3 times issue like described in https://bugs.launchpad.net/neutron/+bug/1757089
16:21:05 <openstack> Launchpad bug 1757089 in neutron "Fullstack test_port_shut_down(Linux bridge agent) fails quite often" [High,Confirmed] - Assigned to Slawek Kaplonski (slaweq)
16:21:16 <slaweq> I started debugging it even but haven't found anything yet
16:21:59 <slaweq> I think it is some kind of race between finishing configuration of port and setting it to admin_state_up = False
16:22:05 <slaweq> but I'm not sure yet
16:22:05 <ihrachys> yeah I haven't seen this test case failing before
16:22:38 <ihrachys> slaweq, are you planning to dig it further or we should have someone else on it?
16:22:49 <slaweq> yes, I will take care of this one
16:22:53 <ihrachys> the job just moved to voting so it would be nice to tackle that before people start complaining :)
16:22:57 <ihrachys> great!
16:23:21 <ihrachys> another change in grafana worth discussion is that dvr-scenarios seem to be at 100% now
16:23:22 <slaweq> but I don't know if this is the only issue with fullstack
16:23:30 <ihrachys> since 4 days ago
16:24:07 <ihrachys> failure example: http://logs.openstack.org/76/550676/10/check/neutron-tempest-plugin-dvr-multinode-scenario/1c3297a/logs/testr_results.html.gz
16:25:07 <ihrachys> haleyb, ideas what's broken in dvr scenarios job?
16:26:27 <ihrachys> ok here are errors: http://logs.openstack.org/76/550676/10/check/neutron-tempest-plugin-dvr-multinode-scenario/1c3297a/logs/screen-q-l3.txt.gz?level=TRACE
16:26:33 <ihrachys> "NetlinkError: (99, 'Cannot assign requested address')"
16:26:41 <ihrachys> slaweq, I believe that's your pyroute2 patch
16:27:30 <slaweq> ihrachys: it can be
16:28:16 <slaweq> I will check that one ASAP
16:28:58 <ihrachys> #action slaweq to check why dvr-scenario job is broken with netlink errors in l3 agent log
16:29:12 <ihrachys> interesting that it doesn't happen in linuxbridge
16:30:07 <ihrachys> slaweq, btw jlibosva left so we are two now
16:30:16 <slaweq> ihrachys: ok
16:30:45 <ihrachys> #topic Fullstack
16:31:42 <ihrachys> there are some existing bug reports that we could take a look at since we have half an hour
16:31:43 <ihrachys> https://bugs.launchpad.net/neutron/+bugs?field.tag=fullstack&orderby=status&start=0
16:31:52 <ihrachys> I will ignore wishlist bugs
16:31:55 <slaweq> sure
16:32:02 <ihrachys> https://bugs.launchpad.net/neutron/+bug/1757089
16:32:02 <openstack> Launchpad bug 1757089 in neutron "Fullstack test_port_shut_down(Linux bridge agent) fails quite often" [High,Confirmed] - Assigned to Slawek Kaplonski (slaweq)
16:32:12 <ihrachys> this one we already discussed, it's the new failure you just reported
16:32:17 <slaweq> yep
16:33:15 <ihrachys> and you are going to look into it
16:33:20 <ihrachys> https://bugs.launchpad.net/neutron/+bug/1744402
16:33:21 <openstack> Launchpad bug 1744402 in neutron "fullstack security groups test fails because ncat process don't starts" [High,Confirmed] - Assigned to Slawek Kaplonski (slaweq)
16:33:21 <slaweq> yes
16:33:59 <ihrachys> this one resulted in test disabled
16:34:09 <ihrachys> do we have patches tackling the original issue?
16:34:37 <ihrachys> doesn't seem like it
16:34:42 <slaweq> no, we don't have any patches for this one
16:35:27 <ihrachys> checking on logstash how often it happens
16:35:33 <slaweq> and to be honest I'm not sure if this is still issue or maybe it is fixed by https://review.openstack.org/#/c/545820/
16:36:00 <slaweq> as it could be related to problem with not configured IP address in fake machine namespace
16:36:10 <ihrachys> slaweq, you think ncat failed to start because no ip set?
16:36:23 <slaweq> it could be IMO
16:36:50 <slaweq> if on destination fake machine there wasn't IP address, it can't start
16:37:25 <slaweq> but I'm not 100% sure, maybe there is different problem still
16:37:41 <ihrachys> we could close it and see if it reemerges
16:37:47 <ihrachys> boy logstash is slow
16:38:20 <slaweq> we can do that, and I will remove this unstable test decorator
16:38:31 <slaweq> so if it will fail we will easier find it
16:38:45 <ihrachys> yeah. no hits in logstash in a month.
16:39:13 <slaweq> so I will remove this decorator and we will see, fine?
16:39:22 <ihrachys> #action slaweq to revert https://review.openstack.org/#/c/541242/ because the bug is probably fixed by https://review.openstack.org/#/c/545820/4
16:39:33 <slaweq> thx
16:39:47 <ihrachys> https://bugs.launchpad.net/neutron/+bug/1687074
16:39:48 <openstack> Launchpad bug 1687074 in neutron "Sometimes ovsdb fails with "tcp:127.0.0.1:6640: error parsing stream"" [High,Confirmed]
16:40:02 <ihrachys> that's an old one.
16:40:16 <ihrachys> I don't think Terry made any progress since it was initially looked into
16:40:43 <slaweq> I didn't saw any patch related to this one for sure
16:41:13 <slaweq> but it's also not so often now IMO
16:41:24 <ihrachys> at some point, this was present in lots of failures. I don't think we have those lately.
16:41:38 <ihrachys> right. it seems like maybe something was fixed in the middle.
16:41:49 <ihrachys> and I recollect we sometimes saw those errors in successful runs too
16:42:11 <ihrachys> so it could as well be a red herring, and what helped is a set of unrelated patches that you carefully merged into for legit issues.
16:42:21 <slaweq> so maybe we can close it for now and reopen if will be necessary?
16:42:26 <ihrachys> maybe we were blaiming the failures on the error message
16:42:45 <slaweq> it can be like You are saying
16:42:49 <ihrachys> yes that's my inclination. I will leave ovsdbapp part open in LP so that they can decide on their own.
16:43:06 <slaweq> ok
16:44:40 <ihrachys> did just that
16:44:45 <slaweq> thx
16:44:54 <ihrachys> https://bugs.launchpad.net/neutron/+bug/1673531
16:44:55 <openstack> Launchpad bug 1673531 in neutron "fullstack test_controller_timeout_does_not_break_connectivity_sigkill(GRE and l2pop,openflow-native_ovsdb-cli) failure" [High,Confirmed] - Assigned to Ihar Hrachyshka (ihar-hrachyshka)
16:47:52 <slaweq> I also didn't saw this one lately
16:47:52 <ihrachys> crawling logstash in background. I will close if no hits.
16:47:53 <slaweq> could be maybe fixed by https://review.openstack.org/#/c/547345/
16:47:53 <slaweq> but I don't remember exactly in which moment this test was failing usually so I'm not sure
16:47:53 <ihrachys> maybe. it was slow port active state propagation
16:47:53 <ihrachys> ok no hits in logstash
16:47:53 <ihrachys> closed
16:47:57 <ihrachys> the rest are wishlist bugs
16:48:21 <slaweq> FYI: I just marked https://bugs.launchpad.net/neutron/+bug/1744396 as dubplicate
16:48:23 <openstack> Launchpad bug 1723912 in neutron "duplicate for #1744396 Refactor securitygroup fullstack tests" [Wishlist,Confirmed] - Assigned to Dongcan Ye (hellochosen)
16:48:24 <ihrachys> (moved on Low -> Wishlist just now that talks about signal we use to tear down resources)
16:48:32 <slaweq> because there was two about same thing
16:48:46 <ihrachys> slaweq, great
16:49:19 <slaweq> only two not whishlist, that is good :)
16:49:29 <ihrachys> ok we are done with this list and still have 10 more minutes. so let's also clean up gate-failure bugs
16:49:31 <ihrachys> https://bugs.launchpad.net/neutron/+bugs?field.tag=gate-failure&orderby=status&start=0
16:49:37 <ihrachys> https://bugs.launchpad.net/neutron/+bug/1741889
16:49:38 <openstack> Launchpad bug 1741889 in neutron "functional: DbAddCommand sometimes times out after 10 seconds" [High,New]
16:50:08 <slaweq> is it still an issue?
16:50:23 <ihrachys> not sure. ovsdbapp release that claims it's fixed was issued.
16:50:39 <slaweq> functional tests are quite stable last days
16:51:36 <ihrachys> there are 16 hits of the timeout message in logstash for a month
16:52:02 <ihrachys> but no functional it seems
16:52:13 <ihrachys> example: http://logs.openstack.org/99/538699/3/gate/tempest-full-py3/71b6729/controller/logs/screen-q-agt.txt?level=TRACE#_Mar_20_08_57_06_775653
16:52:38 <slaweq> but this is oslo messaging timeout
16:52:43 <ihrachys> but there is also queue.Empty above
16:52:50 <slaweq> righ
16:52:53 <slaweq> right
16:53:02 <ihrachys> slaweq, oslo messaging?
16:53:17 <slaweq> http://logs.openstack.org/99/538699/3/gate/tempest-full-py3/71b6729/controller/logs/screen-q-agt.txt?level=TRACE#_Mar_20_08_57_07_663639
16:53:18 <ihrachys> oh I see below
16:53:29 <slaweq> I was talking about this one just below "yours" :)
16:53:41 <ihrachys> that's timeout on reporting state
16:54:07 <ihrachys> looks like both rpc state thread and ovsdbapp was hanging
16:54:23 <slaweq> maybe it's  another issue with load on host
16:54:27 <ihrachys> yeah
16:54:39 <ihrachys> ok let's close the functional bug
16:54:45 <slaweq> ++
16:55:55 <ihrachys> https://bugs.launchpad.net/neutron/+bug/1738475
16:55:57 <openstack> Launchpad bug 1738475 in neutron "test_notify_port_ready_after_enable_dhcp fails with RuntimeError: 'dhcp_ready_on_ports' not be called" [Low,New]
16:56:35 <ihrachys> haven't seen it honestly
16:56:42 <slaweq> IMO we can close it - I didn't found anything like that since then
16:56:49 <ihrachys> ok
16:57:27 <ihrachys> https://bugs.launchpad.net/neutron/+bug/1664347
16:57:28 <openstack> Launchpad bug 1664347 in OpenStack Compute (nova) "test_volume_boot_pattern failed to get an instance into ACTIVE state" [Undecided,Confirmed]
16:57:35 <ihrachys> it's Incomplete for us
16:57:41 <slaweq> yep
16:58:12 <ihrachys> how to get rid off it from the list?
16:58:20 <ihrachys> I thought incomplete bugs eventually expire
16:58:27 <ihrachys> but probably the nova component holds it
16:58:28 <slaweq> I don't know
16:58:38 <ihrachys> maybe I will just kill the neutron link
16:58:48 <slaweq> ++
16:59:00 <ihrachys> https://bugs.launchpad.net/neutron/+bug/1750334
16:59:01 <openstack> Launchpad bug 1750334 in neutron "ovsdb commands timeouts cause fullstack tests failures" [High,Confirmed]
16:59:08 <ihrachys> isn't it the same as other timeout bugs?
16:59:42 <slaweq> yes, AFAIR it was same issue like with functional
16:59:58 <ihrachys> marking as dup
17:00:05 <ihrachys> ok we are at the top of the hour
17:00:16 <ihrachys> great, nice to do some cleanup once in a while :)
17:00:20 <ihrachys> slaweq, thanks a lot!
17:00:25 <slaweq> thx
17:00:29 <ihrachys> #endmeeting