16:00:48 <ihrachys> #startmeeting neutron_ci 16:00:49 <openstack> Meeting started Tue Feb 13 16:00:48 2018 UTC and is due to finish in 60 minutes. The chair is ihrachys. Information about MeetBot at http://wiki.debian.org/MeetBot. 16:00:50 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 16:00:52 <openstack> The meeting name has been set to 'neutron_ci' 16:00:55 <mlavalle> o/ 16:01:12 <slaweq> hi 16:01:16 <ihrachys> #topic Actions from prev meeting 16:01:23 <mlavalle> ihrachys: Kuba won't make it to the meeting 16:01:29 <ihrachys> ack 16:01:34 <ihrachys> thanks for the notice 16:01:35 <mlavalle> He went to hospital to see wife 16:01:51 <ihrachys> yeah he has better things to care about right now ;) 16:01:52 <mlavalle> and Sara 16:02:19 <ihrachys> first action item was "ihrachys report bug for -pg- failure to glance" 16:02:42 <ihrachys> that was rather optimistic, before we learned about OVO / facade issue 16:02:45 <ihrachys> so i didn't 16:02:50 <ihrachys> let me check if it's still relevant 16:03:57 <ihrachys> http://grafana.openstack.org/dashboard/db/neutron-failure-rate?panelId=14&fullscreen 16:04:20 <ihrachys> there was a 2d trench apparently on weekend 16:04:29 <ihrachys> don't we run periodics on weekends? 16:05:44 <ihrachys> ok the latest failure seems to be smth different than devstack failure we saw before: 16:05:45 <ihrachys> http://logs.openstack.org/periodic/git.openstack.org/openstack/neutron/master/legacy-periodic-tempest-dsvm-neutron-pg-full/894eb3b/job-output.txt.gz#_2018-02-13_07_13_23_633653 16:05:52 <ihrachys> test_get_service_by_service_and_host_name failed 16:06:02 <ihrachys> that's for compute admin api 16:07:09 <ihrachys> I will report this one instead 16:07:18 <ihrachys> #action ihrachys report test_get_service_by_service_and_host_name failure in periodic -pg- job 16:07:30 <ihrachys> then there was "slaweq to report bug for ovsfw job failure / sea of red in ovs agent logs" 16:07:38 <ihrachys> I believe this is largely contained now 16:07:50 <slaweq> so I reported bug: https://bugs.launchpad.net/neutron/+bug/1747709 16:07:51 <openstack> Launchpad bug 1747709 in neutron "neutron-tempest-ovsfw fails 100% times" [High,Fix released] - Assigned to Slawek Kaplonski (slaweq) 16:08:11 <slaweq> and later also https://bugs.launchpad.net/neutron/+bug/1748546 16:08:12 <openstack> Launchpad bug 1748546 in neutron "ovsfw tempest tests random fails" [High,Fix committed] - Assigned to Slawek Kaplonski (slaweq) 16:08:25 <slaweq> and now patches for both are merged 16:08:37 <slaweq> https://review.openstack.org/#/c/542596/ 16:08:55 <slaweq> adn https://review.openstack.org/#/c/542257/ 16:08:57 <slaweq> *and 16:09:09 <ihrachys> ok, we need to land backports: https://review.openstack.org/#/q/I750224f7495aa46634bec1211599953cbbd4d1df 16:09:21 <ihrachys> and https://review.openstack.org/#/q/I6d917cbac61293e9a956a2efcd9f2b720e4cac95 16:09:31 <slaweq> with those 2 patches neutron-tempest-ovsfw job was green at least 5-6 times when I checked that during weekend 16:09:51 <mlavalle> yeap 16:10:01 <ihrachys> mlavalle, we'll need rc2 for those right? 16:10:07 <mlavalle> yes 16:10:08 <slaweq> yes, cherry-picks to queens are done already also 16:10:30 <mlavalle> those two are the only ones we have for RC2, so far 16:10:43 <ihrachys> slaweq, do we need earlier backports for those bugs? 16:10:49 <ihrachys> pike and ocata are still open 16:11:03 <slaweq> I'm not sure but I will check it ASAP 16:11:18 <mlavalle> with Pike we have more time 16:11:24 <mlavalle> Ocata is the urgent one 16:12:13 <slaweq> ihrachys: please assign "action" for me - I will check it after the meeting 16:12:39 <ihrachys> #action slaweq to backport ovsfw fixes to older stable branches 16:12:42 <slaweq> thx 16:12:53 <ihrachys> thanks for fixing the job! incredible. 16:13:03 <mlavalle> ++ 16:13:05 <ihrachys> next was "ihrachys to look at linuxbridge scenario random failures when sshing to FIP" 16:13:14 <ihrachys> ok that's another one that fell through cracks 16:13:33 <ihrachys> and I don't have capacity for anything new / time consuming this week 16:14:11 <ihrachys> so maybe someone can take it over 16:14:28 <ihrachys> actually, we can probably check if it's still an important issue later in the meeting when we talk about scenarios 16:14:38 <mlavalle> let's do that 16:15:28 <ihrachys> #topic Grafana 16:15:28 <ihrachys> http://grafana.openstack.org/dashboard/db/neutron-failure-rate 16:17:21 <ihrachys> fullstack seems to be the only job that is above 50% 16:17:36 <slaweq> what is also good ;) 16:17:37 <ihrachys> well except ovsfw but it looks like it quickly goes down with its average 16:18:02 <slaweq> ovsfw patches was merged today so I think we should wait with this one some time 16:18:52 * mlavalle keeping fingers crossed 16:18:58 <ihrachys> right, but it's clear it's going down very quick. if that would be a Dow Jones chart we would see people jumping from windows. 16:19:10 <mlavalle> LOL 16:19:12 <slaweq> LOL 16:19:14 <mlavalle> like 1929 16:19:31 <slaweq> but here lower is better 16:19:40 <ihrachys> ok on serious note... seems like we want to talk about fullstack first thing 16:19:57 <ihrachys> and then move to scenarios if time allows for it 16:20:12 <ihrachys> #topic Fullstack 16:20:37 <ihrachys> before going through latest failures, I think it makes sense to check some patches 16:21:23 <ihrachys> specifically, https://review.openstack.org/#/c/539953/ enables test_dscp_marking_packets (it's actually already in gate so maybe we can just make a note it may resurface now and skip it) 16:21:44 <ihrachys> also those backports for the same test for older branches: https://review.openstack.org/#/q/Ia3522237dc787edb90d162ac4a5535ff5d2a03d5 16:21:55 <slaweq> but I believe this test should works fine now 16:21:59 <ihrachys> I believe once those are in, we can also backport the re-enabling patch 16:22:14 <ihrachys> there are also those backports: https://review.openstack.org/#/q/Ib2d588d081a48f4f2b6e98a943bca95b9955a149 16:23:17 <ihrachys> any other patches I missed for fullstack? 16:23:17 <slaweq> AFAIR https://review.openstack.org/#/q/Ib2d588d081a48f4f2b6e98a943bca95b9955a149 are required by https://review.openstack.org/#/q/Ia3522237dc787edb90d162ac4a5535ff5d2a03d5 16:24:44 <ihrachys> yeah I think I mentioned both 16:24:53 <mlavalle> yes you did 16:24:54 <slaweq> yes, just saying :) 16:24:55 <ihrachys> ok now looking at a latest run 16:24:55 <ihrachys> http://logs.openstack.org/53/539953/2/check/neutron-fullstack/55e3511/logs/testr_results.html.gz 16:25:02 <ihrachys> test_controller_timeout_does_not_break_connectivity_sigterm and test_l2_agent_restart 16:25:06 <ihrachys> all old friends 16:25:24 <ihrachys> at some point I should have even looked at one of them but never did 16:25:55 <ihrachys> and as I said I am out of capacity for this week anyway 16:26:15 <slaweq> I can take a look on them 16:26:30 <ihrachys> I can't see neither reported in https://bugs.launchpad.net/neutron/+bugs?field.tag=fullstack 16:27:18 <ihrachys> slaweq, 16:27:20 <ihrachys> great 16:27:24 <slaweq> https://bugs.launchpad.net/neutron/+bug/1673531 16:27:25 <openstack> Launchpad bug 1673531 in neutron "fullstack test_controller_timeout_does_not_break_connectivity_sigkill(GRE and l2pop,openflow-native_ovsdb-cli) failure" [High,Confirmed] - Assigned to Ihar Hrachyshka (ihar-hrachyshka) 16:27:26 <ihrachys> actually maybe the sigterm one is https://bugs.launchpad.net/neutron/+bug/1673531 16:27:29 <ihrachys> riiight 16:27:31 <slaweq> isn't this related? 16:27:35 <ihrachys> probably 16:27:39 <ihrachys> just different signal 16:28:08 <slaweq> I will take a look on them this week 16:28:28 <ihrachys> #action slaweq to look at http://logs.openstack.org/53/539953/2/check/neutron-fullstack/55e3511/logs/testr_results.html.gz failures 16:28:38 <ihrachys> slaweq, very cool, thanks 16:28:54 <slaweq> no problem 16:29:17 <ihrachys> checked another run, same failures: http://logs.openstack.org/01/542301/1/check/neutron-fullstack/d8f312e/logs/testr_results.html.gz 16:29:53 <ihrachys> and another failed run - same: http://logs.openstack.org/96/542596/6/check/neutron-fullstack/8db76b7/logs/testr_results.html.gz (just one failed though) 16:30:11 <ihrachys> it's clear that if those are tackled, it may get fullstack down to normal failure rate 16:30:32 <ihrachys> so it's great we'll have slaweq on those, meaning next week we'll have it green ;) 16:30:32 <slaweq> I hope so :) 16:30:55 <slaweq> I can always mark them as unstable to make it green :P 16:31:29 <ihrachys> true lol 16:32:02 <ihrachys> #topic Functional 16:32:29 <ihrachys> I am looking at functional job charts and it seems to work fine. actually showing weird 0% failure last several days. 16:32:54 <ihrachys> mlavalle, what would be the standard to hit to get it marked voting in check queue? 16:33:10 <ihrachys> for the record I look at http://grafana.openstack.org/dashboard/db/neutron-failure-rate?panelId=7&fullscreen 16:34:16 <mlavalle> good question 16:34:34 <mlavalle> 10% sounds good to me but I am shooting from the hip 16:34:47 <mlavalle> what do you think? 16:35:19 * ihrachys checking what's average for unit tests 16:35:47 <ihrachys> these are unit tests: http://grafana.openstack.org/dashboard/db/neutron-failure-rate?panelId=10&fullscreen 16:36:37 <ihrachys> if you look at 30 days stats, you see average is somewhere between 10 and 15 16:37:10 <mlavalle> so 10% sounds good 16:37:14 <ihrachys> mlavalle, let's say we can hit 10%. how long should we experience it to consider voting? 16:37:30 <mlavalle> let's say a week 16:37:35 <mlavalle> with that average 16:37:49 <ihrachys> ok gotcha, and full *working* week I believe 16:37:56 <ihrachys> because weekends are not normal 16:37:56 <mlavalle> yes 16:38:35 <ihrachys> ok we are in the second day then right now. I will monitor it proactively then. ;) 16:38:44 <mlavalle> cool 16:39:08 <mlavalle> we can revisit net meeting 16:39:14 <mlavalle> next^^^ 16:39:20 <ihrachys> ok 16:39:23 <ihrachys> #topic Scenarios 16:39:38 <ihrachys> the chart here: http://grafana.openstack.org/dashboard/db/neutron-failure-rate?panelId=8&fullscreen 16:40:35 <ihrachys> ok. first, dvr one is 20-25% 16:40:48 <ihrachys> slaweq since the job uses ovsfw shouldn't we expect some stabilization there too? 16:41:12 <slaweq> I don't know exactly 16:41:25 <slaweq> do You have some example failure? 16:42:00 <ihrachys> no not yet. we'll look at them later. 16:42:12 <ihrachys> as for linuxbridge job, it's 25-30% so slightly worse 16:42:34 <ihrachys> hence first looking at linuxbridge 16:43:07 <ihrachys> example failure: http://logs.openstack.org/07/492107/9/check/neutron-tempest-plugin-scenario-linuxbridge/086f728/logs/testr_results.html.gz 16:43:37 <ihrachys> all ssh failures using fip 16:45:35 <ihrachys> since this job is not dvr / ha, routers are legacy 16:45:49 <ihrachys> so we should e.g. see arpings and iptables-restore and 'ip addr add' in logs 16:47:19 <ihrachys> this is where the fip address is configured on l3agent side: http://logs.openstack.org/07/492107/9/check/neutron-tempest-plugin-scenario-linuxbridge/086f728/logs/screen-q-l3.txt.gz#_Feb_13_15_11_35_940199 16:48:27 <ihrachys> so the address is configured in namespace; arping is executed; tempest should know about correct location of the address 16:48:51 <slaweq> and iptables-save is also done just after so NAT should be also done 16:48:56 <haleyb> there's no errors in the l2 agent log either 16:49:02 <slaweq> maybe there is some issue with SG then? 16:50:13 <ihrachys> hm possible 16:50:23 <ihrachys> and also instances booted fine and reached metadata 16:50:23 <haleyb> the only thing that merged recently was the ARP change i did maybe, moved things to PREROUTING chain 16:50:32 <ihrachys> (it's seen in one of test cases that logs console) 16:50:54 <haleyb> slaweq: right, unless its SG 16:51:20 <ihrachys> slaweq, wouldn't it be Connection refused if SG closed the port? 16:51:58 <slaweq> I don't think so 16:52:03 <ihrachys> there it seems it just hangs there until timeout 16:55:18 <ihrachys> haleyb, so one thing that pops up is this message: 16:55:21 <ihrachys> http://logs.openstack.org/07/492107/9/check/neutron-tempest-plugin-scenario-linuxbridge/086f728/logs/screen-q-agt.txt.gz?#_Feb_13_15_19_22_794763 16:55:32 <ihrachys> Tried to remove rule that was not there: 'POSTROUTING' u'-m physdev --physdev-in tap967365b4-ce --physdev-is-bridged -j neutron-linuxbri-qos-o967365' 16:55:48 <slaweq> I think there is already patch for that 16:55:54 <ihrachys> or is it the red herring that slaweq has a patch for 16:56:03 <ihrachys> https://review.openstack.org/#/c/539922/ ? 16:56:05 <slaweq> haleyb did this patch I think 16:56:05 <haleyb> yes, there is, red herring i believe 16:56:14 <ihrachys> ah wait that's db side 16:56:29 <ihrachys> https://review.openstack.org/#/c/541379/ ? 16:56:35 <slaweq> yes 16:56:41 <slaweq> You were faster :) 16:57:03 <haleyb> and that passed the scenario test 16:57:20 <ihrachys> ok I nudged it in gate 16:57:31 <ihrachys> but do we think it could break anything? 16:58:30 <slaweq> this warning? I don't think so 16:58:36 <ihrachys> ok 16:58:41 <ihrachys> so there is something else 16:58:49 <ihrachys> and we are almost out of time 16:58:52 <slaweq> IMHO yes 16:59:04 <slaweq> *yes, there is something else :) 16:59:44 <ihrachys> this issue doesn't have obvious leads in logs so one thing to consider is maybe we should expand logging 16:59:50 <mlavalle> ihrachys: I don't have much time this week, but if nobody else has, I'll try to take a stab at it 17:00:11 <slaweq> I will probably be busy with debugging fullstack tests 17:00:13 <ihrachys> one thing I noticed is most tests that failed don't log console on failure. that could be worth expanding on, not just for this issue. 17:00:24 <ihrachys> mlavalle, please do! 17:00:31 <mlavalle> ok 17:00:35 <ihrachys> #action mlavalle to look into linuxbridge ssh timeout failures 17:00:40 <ihrachys> thanks folks for joining 17:00:42 <ihrachys> #endmeeting