16:00:57 <ihrachys> #startmeeting neutron_ci 16:00:57 <openstack> Meeting started Tue Mar 20 16:00:57 2018 UTC and is due to finish in 60 minutes. The chair is ihrachys. Information about MeetBot at http://wiki.debian.org/MeetBot. 16:00:58 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 16:01:01 <openstack> The meeting name has been set to 'neutron_ci' 16:01:06 <slaweq> hi 16:01:10 <ihrachys> o/ 16:01:16 <ihrachys> mlavalle won't be able to join us today 16:01:22 <ihrachys> haleyb, jlibosva 16:01:54 <jlibosva> o 16:01:56 <jlibosva> / 16:02:42 <ihrachys> ok let's get going 16:02:43 <ihrachys> #topic Actions from prev meeting 16:02:56 <ihrachys> btw thanks slaweq for taking it over the prev time, appreciate it 16:03:05 <ihrachys> slaweq, you should probably become the chair for the meeting 16:03:06 <slaweq> ihrachys: no problem :) 16:03:24 <ihrachys> since you are the one who actually does most of the work :) you and jlibosva 16:03:33 <ihrachys> slaweq, I mean in generally 16:03:42 <slaweq> maybe jlibosva then :) 16:03:58 <slaweq> I don't think I'm experienced enought to do it all the time 16:04:12 <jlibosva> You did really well last time :) 16:04:12 <ihrachys> nah I think you are a better long term shot ;) things I know... :) 16:04:20 <ihrachys> ok as for action items 16:04:22 <ihrachys> "slaweq to enable voting for fullstack" 16:04:30 <slaweq> done 16:04:42 <ihrachys> https://review.openstack.org/552686 16:05:09 <ihrachys> was the change global for all queues? https://review.openstack.org/#/c/552686/1/.zuul.yaml 16:05:17 <ihrachys> or is it blocked from voting in gate somewhere else? 16:05:39 <slaweq> it is only for master branch and for check queue for now 16:05:45 <slaweq> that is what we decided last time 16:06:31 <slaweq> if all will be fine we will add it to gate queue also, am I right? 16:06:38 <ihrachys> oh I get it, it's not in gate jobs list 16:06:45 <slaweq> yes 16:06:49 <ihrachys> yeah you are right, I am just bad at tech 16:06:57 <ihrachys> "slaweq to enable voting for linuxbridge scenarios" 16:07:02 <slaweq> done also: https://review.openstack.org/#/c/552689/ 16:07:17 <ihrachys> democracy, everyone gets a vote 16:07:17 <slaweq> and also only in check queue for now 16:07:36 <ihrachys> + 16:07:37 <ihrachys> "jlibosva to take a look on dvr trunk tests issue" 16:07:42 <jlibosva> ehm 16:07:47 <jlibosva> I didn't 16:07:49 <ihrachys> context: http://logs.openstack.org/32/550832/2/check/neutron-tempest-plugin-dvr-multinode-scenario/6d09fd8/logs/testr_results.html.gz 16:08:04 <jlibosva> I'll try this week 16:08:16 <ihrachys> #action jlibosva to take a look on dvr trunk tests issue: http://logs.openstack.org/32/550832/2/check/neutron-tempest-plugin-dvr-multinode-scenario/6d09fd8/logs/testr_results.html.gz 16:08:22 <ihrachys> great 16:08:29 <ihrachys> "slaweq check reasons of failures of neutron-tempest-dvr-ha-multinode-full" 16:08:39 <slaweq> I checked it and I created bug https://bugs.launchpad.net/neutron/+bug/1756301 16:08:40 <openstack> Launchpad bug 1756301 in neutron "Tempest DVR HA multimode tests fails due to no FIP connectivity" [High,Confirmed] 16:08:46 <ihrachys> context http://logs.openstack.org/14/529814/5/check/neutron-tempest-dvr-ha-multinode-full/d8cfbdf/logs/testr_results.html.gz 16:09:01 <slaweq> it looks that in all those cases failure is the same 16:09:13 <slaweq> but I didn't do any debugging on this issue 16:10:15 <slaweq> all of other cases which I checked issues were related to cinder volumes 16:10:58 <ihrachys> slaweq, that's weird. I would understand it if e.g. instance would crash because of volume attachment. 16:11:03 <ihrachys> but it's clear it's something net related 16:11:25 <slaweq> yes, In all those cases which are pointed in this bug report errors are related to neutron 16:11:38 <ihrachys> at least one failure of those listed is in tempest.scenario.test_server_basic_ops.TestServerBasicOps 16:11:41 <ihrachys> I don't think it's vol specific 16:11:55 <slaweq> I was just saying that I found many other failures of this job with issues not related to network but related to cinder 16:12:23 <slaweq> sorry if I wasn't clear :) 16:12:31 <ihrachys> I see. 16:13:16 <ihrachys> well, the bug is reported. is someone will have cycles they can take it 16:13:20 <ihrachys> "slaweq switch scenario jobs to lib/neutron" 16:13:39 <ihrachys> slaweq, I am missing context on that one. while it's good to move there, I don't think we did for gate or any other jobs. 16:13:44 <ihrachys> so what's special about scenarios? 16:13:58 <slaweq> nothing, I just wanted to start with something :) 16:14:10 <slaweq> so I send a patch https://review.openstack.org/#/c/552689/ even 16:14:20 <slaweq> to switch one scenario job to zuul v3 16:14:45 <slaweq> sorry, wrong link, this one is good: https://review.openstack.org/#/c/552846/13 16:15:25 <slaweq> but in meantime devstack-tempest job was switched back to lib/neutron-legacy so this will not switch our job to using lib/neutron for now 16:15:47 <slaweq> but I still think that it could be good to start moving jobs to zuul v3 definitions 16:15:53 <slaweq> what You think about that? 16:16:31 <ihrachys> yeah that's good. I am not sure if that helps with lib/neutron target but we should move to the new format regardless. 16:16:44 <jlibosva> we talked with slaweq and we agreed it would be good to migrate and then once devstack-tempest is switched, we'll get it for free 16:16:59 <slaweq> exactly, thx jlibosva :) 16:17:42 <slaweq> so please review it if You will have some time :) 16:18:14 <ihrachys> fair 16:18:37 <ihrachys> #topic Grafana 16:18:40 <jlibosva> I looked but I don't feel like I understand the zuul thingy completely so I'm not comfortable +2'ing 16:18:53 <ihrachys> http://grafana.openstack.org/dashboard/db/neutron-failure-rate 16:20:15 <ihrachys> there was some suspicious fullstack spike a day ago. while some other jobs also spikes at the same time, the fullstack chart doesn't seem to have recovered completely. 16:20:48 <ihrachys> any ideas what's the spike about? 16:21:04 <slaweq> today I spotted 2 or 3 times issue like described in https://bugs.launchpad.net/neutron/+bug/1757089 16:21:05 <openstack> Launchpad bug 1757089 in neutron "Fullstack test_port_shut_down(Linux bridge agent) fails quite often" [High,Confirmed] - Assigned to Slawek Kaplonski (slaweq) 16:21:16 <slaweq> I started debugging it even but haven't found anything yet 16:21:59 <slaweq> I think it is some kind of race between finishing configuration of port and setting it to admin_state_up = False 16:22:05 <slaweq> but I'm not sure yet 16:22:05 <ihrachys> yeah I haven't seen this test case failing before 16:22:38 <ihrachys> slaweq, are you planning to dig it further or we should have someone else on it? 16:22:49 <slaweq> yes, I will take care of this one 16:22:53 <ihrachys> the job just moved to voting so it would be nice to tackle that before people start complaining :) 16:22:57 <ihrachys> great! 16:23:21 <ihrachys> another change in grafana worth discussion is that dvr-scenarios seem to be at 100% now 16:23:22 <slaweq> but I don't know if this is the only issue with fullstack 16:23:30 <ihrachys> since 4 days ago 16:24:07 <ihrachys> failure example: http://logs.openstack.org/76/550676/10/check/neutron-tempest-plugin-dvr-multinode-scenario/1c3297a/logs/testr_results.html.gz 16:25:07 <ihrachys> haleyb, ideas what's broken in dvr scenarios job? 16:26:27 <ihrachys> ok here are errors: http://logs.openstack.org/76/550676/10/check/neutron-tempest-plugin-dvr-multinode-scenario/1c3297a/logs/screen-q-l3.txt.gz?level=TRACE 16:26:33 <ihrachys> "NetlinkError: (99, 'Cannot assign requested address')" 16:26:41 <ihrachys> slaweq, I believe that's your pyroute2 patch 16:27:30 <slaweq> ihrachys: it can be 16:28:16 <slaweq> I will check that one ASAP 16:28:58 <ihrachys> #action slaweq to check why dvr-scenario job is broken with netlink errors in l3 agent log 16:29:12 <ihrachys> interesting that it doesn't happen in linuxbridge 16:30:07 <ihrachys> slaweq, btw jlibosva left so we are two now 16:30:16 <slaweq> ihrachys: ok 16:30:45 <ihrachys> #topic Fullstack 16:31:42 <ihrachys> there are some existing bug reports that we could take a look at since we have half an hour 16:31:43 <ihrachys> https://bugs.launchpad.net/neutron/+bugs?field.tag=fullstack&orderby=status&start=0 16:31:52 <ihrachys> I will ignore wishlist bugs 16:31:55 <slaweq> sure 16:32:02 <ihrachys> https://bugs.launchpad.net/neutron/+bug/1757089 16:32:02 <openstack> Launchpad bug 1757089 in neutron "Fullstack test_port_shut_down(Linux bridge agent) fails quite often" [High,Confirmed] - Assigned to Slawek Kaplonski (slaweq) 16:32:12 <ihrachys> this one we already discussed, it's the new failure you just reported 16:32:17 <slaweq> yep 16:33:15 <ihrachys> and you are going to look into it 16:33:20 <ihrachys> https://bugs.launchpad.net/neutron/+bug/1744402 16:33:21 <openstack> Launchpad bug 1744402 in neutron "fullstack security groups test fails because ncat process don't starts" [High,Confirmed] - Assigned to Slawek Kaplonski (slaweq) 16:33:21 <slaweq> yes 16:33:59 <ihrachys> this one resulted in test disabled 16:34:09 <ihrachys> do we have patches tackling the original issue? 16:34:37 <ihrachys> doesn't seem like it 16:34:42 <slaweq> no, we don't have any patches for this one 16:35:27 <ihrachys> checking on logstash how often it happens 16:35:33 <slaweq> and to be honest I'm not sure if this is still issue or maybe it is fixed by https://review.openstack.org/#/c/545820/ 16:36:00 <slaweq> as it could be related to problem with not configured IP address in fake machine namespace 16:36:10 <ihrachys> slaweq, you think ncat failed to start because no ip set? 16:36:23 <slaweq> it could be IMO 16:36:50 <slaweq> if on destination fake machine there wasn't IP address, it can't start 16:37:25 <slaweq> but I'm not 100% sure, maybe there is different problem still 16:37:41 <ihrachys> we could close it and see if it reemerges 16:37:47 <ihrachys> boy logstash is slow 16:38:20 <slaweq> we can do that, and I will remove this unstable test decorator 16:38:31 <slaweq> so if it will fail we will easier find it 16:38:45 <ihrachys> yeah. no hits in logstash in a month. 16:39:13 <slaweq> so I will remove this decorator and we will see, fine? 16:39:22 <ihrachys> #action slaweq to revert https://review.openstack.org/#/c/541242/ because the bug is probably fixed by https://review.openstack.org/#/c/545820/4 16:39:33 <slaweq> thx 16:39:47 <ihrachys> https://bugs.launchpad.net/neutron/+bug/1687074 16:39:48 <openstack> Launchpad bug 1687074 in neutron "Sometimes ovsdb fails with "tcp:127.0.0.1:6640: error parsing stream"" [High,Confirmed] 16:40:02 <ihrachys> that's an old one. 16:40:16 <ihrachys> I don't think Terry made any progress since it was initially looked into 16:40:43 <slaweq> I didn't saw any patch related to this one for sure 16:41:13 <slaweq> but it's also not so often now IMO 16:41:24 <ihrachys> at some point, this was present in lots of failures. I don't think we have those lately. 16:41:38 <ihrachys> right. it seems like maybe something was fixed in the middle. 16:41:49 <ihrachys> and I recollect we sometimes saw those errors in successful runs too 16:42:11 <ihrachys> so it could as well be a red herring, and what helped is a set of unrelated patches that you carefully merged into for legit issues. 16:42:21 <slaweq> so maybe we can close it for now and reopen if will be necessary? 16:42:26 <ihrachys> maybe we were blaiming the failures on the error message 16:42:45 <slaweq> it can be like You are saying 16:42:49 <ihrachys> yes that's my inclination. I will leave ovsdbapp part open in LP so that they can decide on their own. 16:43:06 <slaweq> ok 16:44:40 <ihrachys> did just that 16:44:45 <slaweq> thx 16:44:54 <ihrachys> https://bugs.launchpad.net/neutron/+bug/1673531 16:44:55 <openstack> Launchpad bug 1673531 in neutron "fullstack test_controller_timeout_does_not_break_connectivity_sigkill(GRE and l2pop,openflow-native_ovsdb-cli) failure" [High,Confirmed] - Assigned to Ihar Hrachyshka (ihar-hrachyshka) 16:47:52 <slaweq> I also didn't saw this one lately 16:47:52 <ihrachys> crawling logstash in background. I will close if no hits. 16:47:53 <slaweq> could be maybe fixed by https://review.openstack.org/#/c/547345/ 16:47:53 <slaweq> but I don't remember exactly in which moment this test was failing usually so I'm not sure 16:47:53 <ihrachys> maybe. it was slow port active state propagation 16:47:53 <ihrachys> ok no hits in logstash 16:47:53 <ihrachys> closed 16:47:57 <ihrachys> the rest are wishlist bugs 16:48:21 <slaweq> FYI: I just marked https://bugs.launchpad.net/neutron/+bug/1744396 as dubplicate 16:48:23 <openstack> Launchpad bug 1723912 in neutron "duplicate for #1744396 Refactor securitygroup fullstack tests" [Wishlist,Confirmed] - Assigned to Dongcan Ye (hellochosen) 16:48:24 <ihrachys> (moved on Low -> Wishlist just now that talks about signal we use to tear down resources) 16:48:32 <slaweq> because there was two about same thing 16:48:46 <ihrachys> slaweq, great 16:49:19 <slaweq> only two not whishlist, that is good :) 16:49:29 <ihrachys> ok we are done with this list and still have 10 more minutes. so let's also clean up gate-failure bugs 16:49:31 <ihrachys> https://bugs.launchpad.net/neutron/+bugs?field.tag=gate-failure&orderby=status&start=0 16:49:37 <ihrachys> https://bugs.launchpad.net/neutron/+bug/1741889 16:49:38 <openstack> Launchpad bug 1741889 in neutron "functional: DbAddCommand sometimes times out after 10 seconds" [High,New] 16:50:08 <slaweq> is it still an issue? 16:50:23 <ihrachys> not sure. ovsdbapp release that claims it's fixed was issued. 16:50:39 <slaweq> functional tests are quite stable last days 16:51:36 <ihrachys> there are 16 hits of the timeout message in logstash for a month 16:52:02 <ihrachys> but no functional it seems 16:52:13 <ihrachys> example: http://logs.openstack.org/99/538699/3/gate/tempest-full-py3/71b6729/controller/logs/screen-q-agt.txt?level=TRACE#_Mar_20_08_57_06_775653 16:52:38 <slaweq> but this is oslo messaging timeout 16:52:43 <ihrachys> but there is also queue.Empty above 16:52:50 <slaweq> righ 16:52:53 <slaweq> right 16:53:02 <ihrachys> slaweq, oslo messaging? 16:53:17 <slaweq> http://logs.openstack.org/99/538699/3/gate/tempest-full-py3/71b6729/controller/logs/screen-q-agt.txt?level=TRACE#_Mar_20_08_57_07_663639 16:53:18 <ihrachys> oh I see below 16:53:29 <slaweq> I was talking about this one just below "yours" :) 16:53:41 <ihrachys> that's timeout on reporting state 16:54:07 <ihrachys> looks like both rpc state thread and ovsdbapp was hanging 16:54:23 <slaweq> maybe it's another issue with load on host 16:54:27 <ihrachys> yeah 16:54:39 <ihrachys> ok let's close the functional bug 16:54:45 <slaweq> ++ 16:55:55 <ihrachys> https://bugs.launchpad.net/neutron/+bug/1738475 16:55:57 <openstack> Launchpad bug 1738475 in neutron "test_notify_port_ready_after_enable_dhcp fails with RuntimeError: 'dhcp_ready_on_ports' not be called" [Low,New] 16:56:35 <ihrachys> haven't seen it honestly 16:56:42 <slaweq> IMO we can close it - I didn't found anything like that since then 16:56:49 <ihrachys> ok 16:57:27 <ihrachys> https://bugs.launchpad.net/neutron/+bug/1664347 16:57:28 <openstack> Launchpad bug 1664347 in OpenStack Compute (nova) "test_volume_boot_pattern failed to get an instance into ACTIVE state" [Undecided,Confirmed] 16:57:35 <ihrachys> it's Incomplete for us 16:57:41 <slaweq> yep 16:58:12 <ihrachys> how to get rid off it from the list? 16:58:20 <ihrachys> I thought incomplete bugs eventually expire 16:58:27 <ihrachys> but probably the nova component holds it 16:58:28 <slaweq> I don't know 16:58:38 <ihrachys> maybe I will just kill the neutron link 16:58:48 <slaweq> ++ 16:59:00 <ihrachys> https://bugs.launchpad.net/neutron/+bug/1750334 16:59:01 <openstack> Launchpad bug 1750334 in neutron "ovsdb commands timeouts cause fullstack tests failures" [High,Confirmed] 16:59:08 <ihrachys> isn't it the same as other timeout bugs? 16:59:42 <slaweq> yes, AFAIR it was same issue like with functional 16:59:58 <ihrachys> marking as dup 17:00:05 <ihrachys> ok we are at the top of the hour 17:00:16 <ihrachys> great, nice to do some cleanup once in a while :) 17:00:20 <ihrachys> slaweq, thanks a lot! 17:00:25 <slaweq> thx 17:00:29 <ihrachys> #endmeeting