16:00:57 #startmeeting neutron_ci 16:00:57 Meeting started Tue Mar 20 16:00:57 2018 UTC and is due to finish in 60 minutes. The chair is ihrachys. Information about MeetBot at http://wiki.debian.org/MeetBot. 16:00:58 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 16:01:01 The meeting name has been set to 'neutron_ci' 16:01:06 hi 16:01:10 o/ 16:01:16 mlavalle won't be able to join us today 16:01:22 haleyb, jlibosva 16:01:54 o 16:01:56 / 16:02:42 ok let's get going 16:02:43 #topic Actions from prev meeting 16:02:56 btw thanks slaweq for taking it over the prev time, appreciate it 16:03:05 slaweq, you should probably become the chair for the meeting 16:03:06 ihrachys: no problem :) 16:03:24 since you are the one who actually does most of the work :) you and jlibosva 16:03:33 slaweq, I mean in generally 16:03:42 maybe jlibosva then :) 16:03:58 I don't think I'm experienced enought to do it all the time 16:04:12 You did really well last time :) 16:04:12 nah I think you are a better long term shot ;) things I know... :) 16:04:20 ok as for action items 16:04:22 "slaweq to enable voting for fullstack" 16:04:30 done 16:04:42 https://review.openstack.org/552686 16:05:09 was the change global for all queues? https://review.openstack.org/#/c/552686/1/.zuul.yaml 16:05:17 or is it blocked from voting in gate somewhere else? 16:05:39 it is only for master branch and for check queue for now 16:05:45 that is what we decided last time 16:06:31 if all will be fine we will add it to gate queue also, am I right? 16:06:38 oh I get it, it's not in gate jobs list 16:06:45 yes 16:06:49 yeah you are right, I am just bad at tech 16:06:57 "slaweq to enable voting for linuxbridge scenarios" 16:07:02 done also: https://review.openstack.org/#/c/552689/ 16:07:17 democracy, everyone gets a vote 16:07:17 and also only in check queue for now 16:07:36 + 16:07:37 "jlibosva to take a look on dvr trunk tests issue" 16:07:42 ehm 16:07:47 I didn't 16:07:49 context: http://logs.openstack.org/32/550832/2/check/neutron-tempest-plugin-dvr-multinode-scenario/6d09fd8/logs/testr_results.html.gz 16:08:04 I'll try this week 16:08:16 #action jlibosva to take a look on dvr trunk tests issue: http://logs.openstack.org/32/550832/2/check/neutron-tempest-plugin-dvr-multinode-scenario/6d09fd8/logs/testr_results.html.gz 16:08:22 great 16:08:29 "slaweq check reasons of failures of neutron-tempest-dvr-ha-multinode-full" 16:08:39 I checked it and I created bug https://bugs.launchpad.net/neutron/+bug/1756301 16:08:40 Launchpad bug 1756301 in neutron "Tempest DVR HA multimode tests fails due to no FIP connectivity" [High,Confirmed] 16:08:46 context http://logs.openstack.org/14/529814/5/check/neutron-tempest-dvr-ha-multinode-full/d8cfbdf/logs/testr_results.html.gz 16:09:01 it looks that in all those cases failure is the same 16:09:13 but I didn't do any debugging on this issue 16:10:15 all of other cases which I checked issues were related to cinder volumes 16:10:58 slaweq, that's weird. I would understand it if e.g. instance would crash because of volume attachment. 16:11:03 but it's clear it's something net related 16:11:25 yes, In all those cases which are pointed in this bug report errors are related to neutron 16:11:38 at least one failure of those listed is in tempest.scenario.test_server_basic_ops.TestServerBasicOps 16:11:41 I don't think it's vol specific 16:11:55 I was just saying that I found many other failures of this job with issues not related to network but related to cinder 16:12:23 sorry if I wasn't clear :) 16:12:31 I see. 16:13:16 well, the bug is reported. is someone will have cycles they can take it 16:13:20 "slaweq switch scenario jobs to lib/neutron" 16:13:39 slaweq, I am missing context on that one. while it's good to move there, I don't think we did for gate or any other jobs. 16:13:44 so what's special about scenarios? 16:13:58 nothing, I just wanted to start with something :) 16:14:10 so I send a patch https://review.openstack.org/#/c/552689/ even 16:14:20 to switch one scenario job to zuul v3 16:14:45 sorry, wrong link, this one is good: https://review.openstack.org/#/c/552846/13 16:15:25 but in meantime devstack-tempest job was switched back to lib/neutron-legacy so this will not switch our job to using lib/neutron for now 16:15:47 but I still think that it could be good to start moving jobs to zuul v3 definitions 16:15:53 what You think about that? 16:16:31 yeah that's good. I am not sure if that helps with lib/neutron target but we should move to the new format regardless. 16:16:44 we talked with slaweq and we agreed it would be good to migrate and then once devstack-tempest is switched, we'll get it for free 16:16:59 exactly, thx jlibosva :) 16:17:42 so please review it if You will have some time :) 16:18:14 fair 16:18:37 #topic Grafana 16:18:40 I looked but I don't feel like I understand the zuul thingy completely so I'm not comfortable +2'ing 16:18:53 http://grafana.openstack.org/dashboard/db/neutron-failure-rate 16:20:15 there was some suspicious fullstack spike a day ago. while some other jobs also spikes at the same time, the fullstack chart doesn't seem to have recovered completely. 16:20:48 any ideas what's the spike about? 16:21:04 today I spotted 2 or 3 times issue like described in https://bugs.launchpad.net/neutron/+bug/1757089 16:21:05 Launchpad bug 1757089 in neutron "Fullstack test_port_shut_down(Linux bridge agent) fails quite often" [High,Confirmed] - Assigned to Slawek Kaplonski (slaweq) 16:21:16 I started debugging it even but haven't found anything yet 16:21:59 I think it is some kind of race between finishing configuration of port and setting it to admin_state_up = False 16:22:05 but I'm not sure yet 16:22:05 yeah I haven't seen this test case failing before 16:22:38 slaweq, are you planning to dig it further or we should have someone else on it? 16:22:49 yes, I will take care of this one 16:22:53 the job just moved to voting so it would be nice to tackle that before people start complaining :) 16:22:57 great! 16:23:21 another change in grafana worth discussion is that dvr-scenarios seem to be at 100% now 16:23:22 but I don't know if this is the only issue with fullstack 16:23:30 since 4 days ago 16:24:07 failure example: http://logs.openstack.org/76/550676/10/check/neutron-tempest-plugin-dvr-multinode-scenario/1c3297a/logs/testr_results.html.gz 16:25:07 haleyb, ideas what's broken in dvr scenarios job? 16:26:27 ok here are errors: http://logs.openstack.org/76/550676/10/check/neutron-tempest-plugin-dvr-multinode-scenario/1c3297a/logs/screen-q-l3.txt.gz?level=TRACE 16:26:33 "NetlinkError: (99, 'Cannot assign requested address')" 16:26:41 slaweq, I believe that's your pyroute2 patch 16:27:30 ihrachys: it can be 16:28:16 I will check that one ASAP 16:28:58 #action slaweq to check why dvr-scenario job is broken with netlink errors in l3 agent log 16:29:12 interesting that it doesn't happen in linuxbridge 16:30:07 slaweq, btw jlibosva left so we are two now 16:30:16 ihrachys: ok 16:30:45 #topic Fullstack 16:31:42 there are some existing bug reports that we could take a look at since we have half an hour 16:31:43 https://bugs.launchpad.net/neutron/+bugs?field.tag=fullstack&orderby=status&start=0 16:31:52 I will ignore wishlist bugs 16:31:55 sure 16:32:02 https://bugs.launchpad.net/neutron/+bug/1757089 16:32:02 Launchpad bug 1757089 in neutron "Fullstack test_port_shut_down(Linux bridge agent) fails quite often" [High,Confirmed] - Assigned to Slawek Kaplonski (slaweq) 16:32:12 this one we already discussed, it's the new failure you just reported 16:32:17 yep 16:33:15 and you are going to look into it 16:33:20 https://bugs.launchpad.net/neutron/+bug/1744402 16:33:21 Launchpad bug 1744402 in neutron "fullstack security groups test fails because ncat process don't starts" [High,Confirmed] - Assigned to Slawek Kaplonski (slaweq) 16:33:21 yes 16:33:59 this one resulted in test disabled 16:34:09 do we have patches tackling the original issue? 16:34:37 doesn't seem like it 16:34:42 no, we don't have any patches for this one 16:35:27 checking on logstash how often it happens 16:35:33 and to be honest I'm not sure if this is still issue or maybe it is fixed by https://review.openstack.org/#/c/545820/ 16:36:00 as it could be related to problem with not configured IP address in fake machine namespace 16:36:10 slaweq, you think ncat failed to start because no ip set? 16:36:23 it could be IMO 16:36:50 if on destination fake machine there wasn't IP address, it can't start 16:37:25 but I'm not 100% sure, maybe there is different problem still 16:37:41 we could close it and see if it reemerges 16:37:47 boy logstash is slow 16:38:20 we can do that, and I will remove this unstable test decorator 16:38:31 so if it will fail we will easier find it 16:38:45 yeah. no hits in logstash in a month. 16:39:13 so I will remove this decorator and we will see, fine? 16:39:22 #action slaweq to revert https://review.openstack.org/#/c/541242/ because the bug is probably fixed by https://review.openstack.org/#/c/545820/4 16:39:33 thx 16:39:47 https://bugs.launchpad.net/neutron/+bug/1687074 16:39:48 Launchpad bug 1687074 in neutron "Sometimes ovsdb fails with "tcp:127.0.0.1:6640: error parsing stream"" [High,Confirmed] 16:40:02 that's an old one. 16:40:16 I don't think Terry made any progress since it was initially looked into 16:40:43 I didn't saw any patch related to this one for sure 16:41:13 but it's also not so often now IMO 16:41:24 at some point, this was present in lots of failures. I don't think we have those lately. 16:41:38 right. it seems like maybe something was fixed in the middle. 16:41:49 and I recollect we sometimes saw those errors in successful runs too 16:42:11 so it could as well be a red herring, and what helped is a set of unrelated patches that you carefully merged into for legit issues. 16:42:21 so maybe we can close it for now and reopen if will be necessary? 16:42:26 maybe we were blaiming the failures on the error message 16:42:45 it can be like You are saying 16:42:49 yes that's my inclination. I will leave ovsdbapp part open in LP so that they can decide on their own. 16:43:06 ok 16:44:40 did just that 16:44:45 thx 16:44:54 https://bugs.launchpad.net/neutron/+bug/1673531 16:44:55 Launchpad bug 1673531 in neutron "fullstack test_controller_timeout_does_not_break_connectivity_sigkill(GRE and l2pop,openflow-native_ovsdb-cli) failure" [High,Confirmed] - Assigned to Ihar Hrachyshka (ihar-hrachyshka) 16:47:52 I also didn't saw this one lately 16:47:52 crawling logstash in background. I will close if no hits. 16:47:53 could be maybe fixed by https://review.openstack.org/#/c/547345/ 16:47:53 but I don't remember exactly in which moment this test was failing usually so I'm not sure 16:47:53 maybe. it was slow port active state propagation 16:47:53 ok no hits in logstash 16:47:53 closed 16:47:57 the rest are wishlist bugs 16:48:21 FYI: I just marked https://bugs.launchpad.net/neutron/+bug/1744396 as dubplicate 16:48:23 Launchpad bug 1723912 in neutron "duplicate for #1744396 Refactor securitygroup fullstack tests" [Wishlist,Confirmed] - Assigned to Dongcan Ye (hellochosen) 16:48:24 (moved on Low -> Wishlist just now that talks about signal we use to tear down resources) 16:48:32 because there was two about same thing 16:48:46 slaweq, great 16:49:19 only two not whishlist, that is good :) 16:49:29 ok we are done with this list and still have 10 more minutes. so let's also clean up gate-failure bugs 16:49:31 https://bugs.launchpad.net/neutron/+bugs?field.tag=gate-failure&orderby=status&start=0 16:49:37 https://bugs.launchpad.net/neutron/+bug/1741889 16:49:38 Launchpad bug 1741889 in neutron "functional: DbAddCommand sometimes times out after 10 seconds" [High,New] 16:50:08 is it still an issue? 16:50:23 not sure. ovsdbapp release that claims it's fixed was issued. 16:50:39 functional tests are quite stable last days 16:51:36 there are 16 hits of the timeout message in logstash for a month 16:52:02 but no functional it seems 16:52:13 example: http://logs.openstack.org/99/538699/3/gate/tempest-full-py3/71b6729/controller/logs/screen-q-agt.txt?level=TRACE#_Mar_20_08_57_06_775653 16:52:38 but this is oslo messaging timeout 16:52:43 but there is also queue.Empty above 16:52:50 righ 16:52:53 right 16:53:02 slaweq, oslo messaging? 16:53:17 http://logs.openstack.org/99/538699/3/gate/tempest-full-py3/71b6729/controller/logs/screen-q-agt.txt?level=TRACE#_Mar_20_08_57_07_663639 16:53:18 oh I see below 16:53:29 I was talking about this one just below "yours" :) 16:53:41 that's timeout on reporting state 16:54:07 looks like both rpc state thread and ovsdbapp was hanging 16:54:23 maybe it's another issue with load on host 16:54:27 yeah 16:54:39 ok let's close the functional bug 16:54:45 ++ 16:55:55 https://bugs.launchpad.net/neutron/+bug/1738475 16:55:57 Launchpad bug 1738475 in neutron "test_notify_port_ready_after_enable_dhcp fails with RuntimeError: 'dhcp_ready_on_ports' not be called" [Low,New] 16:56:35 haven't seen it honestly 16:56:42 IMO we can close it - I didn't found anything like that since then 16:56:49 ok 16:57:27 https://bugs.launchpad.net/neutron/+bug/1664347 16:57:28 Launchpad bug 1664347 in OpenStack Compute (nova) "test_volume_boot_pattern failed to get an instance into ACTIVE state" [Undecided,Confirmed] 16:57:35 it's Incomplete for us 16:57:41 yep 16:58:12 how to get rid off it from the list? 16:58:20 I thought incomplete bugs eventually expire 16:58:27 but probably the nova component holds it 16:58:28 I don't know 16:58:38 maybe I will just kill the neutron link 16:58:48 ++ 16:59:00 https://bugs.launchpad.net/neutron/+bug/1750334 16:59:01 Launchpad bug 1750334 in neutron "ovsdb commands timeouts cause fullstack tests failures" [High,Confirmed] 16:59:08 isn't it the same as other timeout bugs? 16:59:42 yes, AFAIR it was same issue like with functional 16:59:58 marking as dup 17:00:05 ok we are at the top of the hour 17:00:16 great, nice to do some cleanup once in a while :) 17:00:20 slaweq, thanks a lot! 17:00:25 thx 17:00:29 #endmeeting