16:00:48 #startmeeting neutron_ci 16:00:49 Meeting started Tue Feb 13 16:00:48 2018 UTC and is due to finish in 60 minutes. The chair is ihrachys. Information about MeetBot at http://wiki.debian.org/MeetBot. 16:00:50 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 16:00:52 The meeting name has been set to 'neutron_ci' 16:00:55 o/ 16:01:12 hi 16:01:16 #topic Actions from prev meeting 16:01:23 ihrachys: Kuba won't make it to the meeting 16:01:29 ack 16:01:34 thanks for the notice 16:01:35 He went to hospital to see wife 16:01:51 yeah he has better things to care about right now ;) 16:01:52 and Sara 16:02:19 first action item was "ihrachys report bug for -pg- failure to glance" 16:02:42 that was rather optimistic, before we learned about OVO / facade issue 16:02:45 so i didn't 16:02:50 let me check if it's still relevant 16:03:57 http://grafana.openstack.org/dashboard/db/neutron-failure-rate?panelId=14&fullscreen 16:04:20 there was a 2d trench apparently on weekend 16:04:29 don't we run periodics on weekends? 16:05:44 ok the latest failure seems to be smth different than devstack failure we saw before: 16:05:45 http://logs.openstack.org/periodic/git.openstack.org/openstack/neutron/master/legacy-periodic-tempest-dsvm-neutron-pg-full/894eb3b/job-output.txt.gz#_2018-02-13_07_13_23_633653 16:05:52 test_get_service_by_service_and_host_name failed 16:06:02 that's for compute admin api 16:07:09 I will report this one instead 16:07:18 #action ihrachys report test_get_service_by_service_and_host_name failure in periodic -pg- job 16:07:30 then there was "slaweq to report bug for ovsfw job failure / sea of red in ovs agent logs" 16:07:38 I believe this is largely contained now 16:07:50 so I reported bug: https://bugs.launchpad.net/neutron/+bug/1747709 16:07:51 Launchpad bug 1747709 in neutron "neutron-tempest-ovsfw fails 100% times" [High,Fix released] - Assigned to Slawek Kaplonski (slaweq) 16:08:11 and later also https://bugs.launchpad.net/neutron/+bug/1748546 16:08:12 Launchpad bug 1748546 in neutron "ovsfw tempest tests random fails" [High,Fix committed] - Assigned to Slawek Kaplonski (slaweq) 16:08:25 and now patches for both are merged 16:08:37 https://review.openstack.org/#/c/542596/ 16:08:55 adn https://review.openstack.org/#/c/542257/ 16:08:57 *and 16:09:09 ok, we need to land backports: https://review.openstack.org/#/q/I750224f7495aa46634bec1211599953cbbd4d1df 16:09:21 and https://review.openstack.org/#/q/I6d917cbac61293e9a956a2efcd9f2b720e4cac95 16:09:31 with those 2 patches neutron-tempest-ovsfw job was green at least 5-6 times when I checked that during weekend 16:09:51 yeap 16:10:01 mlavalle, we'll need rc2 for those right? 16:10:07 yes 16:10:08 yes, cherry-picks to queens are done already also 16:10:30 those two are the only ones we have for RC2, so far 16:10:43 slaweq, do we need earlier backports for those bugs? 16:10:49 pike and ocata are still open 16:11:03 I'm not sure but I will check it ASAP 16:11:18 with Pike we have more time 16:11:24 Ocata is the urgent one 16:12:13 ihrachys: please assign "action" for me - I will check it after the meeting 16:12:39 #action slaweq to backport ovsfw fixes to older stable branches 16:12:42 thx 16:12:53 thanks for fixing the job! incredible. 16:13:03 ++ 16:13:05 next was "ihrachys to look at linuxbridge scenario random failures when sshing to FIP" 16:13:14 ok that's another one that fell through cracks 16:13:33 and I don't have capacity for anything new / time consuming this week 16:14:11 so maybe someone can take it over 16:14:28 actually, we can probably check if it's still an important issue later in the meeting when we talk about scenarios 16:14:38 let's do that 16:15:28 #topic Grafana 16:15:28 http://grafana.openstack.org/dashboard/db/neutron-failure-rate 16:17:21 fullstack seems to be the only job that is above 50% 16:17:36 what is also good ;) 16:17:37 well except ovsfw but it looks like it quickly goes down with its average 16:18:02 ovsfw patches was merged today so I think we should wait with this one some time 16:18:52 * mlavalle keeping fingers crossed 16:18:58 right, but it's clear it's going down very quick. if that would be a Dow Jones chart we would see people jumping from windows. 16:19:10 LOL 16:19:12 LOL 16:19:14 like 1929 16:19:31 but here lower is better 16:19:40 ok on serious note... seems like we want to talk about fullstack first thing 16:19:57 and then move to scenarios if time allows for it 16:20:12 #topic Fullstack 16:20:37 before going through latest failures, I think it makes sense to check some patches 16:21:23 specifically, https://review.openstack.org/#/c/539953/ enables test_dscp_marking_packets (it's actually already in gate so maybe we can just make a note it may resurface now and skip it) 16:21:44 also those backports for the same test for older branches: https://review.openstack.org/#/q/Ia3522237dc787edb90d162ac4a5535ff5d2a03d5 16:21:55 but I believe this test should works fine now 16:21:59 I believe once those are in, we can also backport the re-enabling patch 16:22:14 there are also those backports: https://review.openstack.org/#/q/Ib2d588d081a48f4f2b6e98a943bca95b9955a149 16:23:17 any other patches I missed for fullstack? 16:23:17 AFAIR https://review.openstack.org/#/q/Ib2d588d081a48f4f2b6e98a943bca95b9955a149 are required by https://review.openstack.org/#/q/Ia3522237dc787edb90d162ac4a5535ff5d2a03d5 16:24:44 yeah I think I mentioned both 16:24:53 yes you did 16:24:54 yes, just saying :) 16:24:55 ok now looking at a latest run 16:24:55 http://logs.openstack.org/53/539953/2/check/neutron-fullstack/55e3511/logs/testr_results.html.gz 16:25:02 test_controller_timeout_does_not_break_connectivity_sigterm and test_l2_agent_restart 16:25:06 all old friends 16:25:24 at some point I should have even looked at one of them but never did 16:25:55 and as I said I am out of capacity for this week anyway 16:26:15 I can take a look on them 16:26:30 I can't see neither reported in https://bugs.launchpad.net/neutron/+bugs?field.tag=fullstack 16:27:18 slaweq, 16:27:20 great 16:27:24 https://bugs.launchpad.net/neutron/+bug/1673531 16:27:25 Launchpad bug 1673531 in neutron "fullstack test_controller_timeout_does_not_break_connectivity_sigkill(GRE and l2pop,openflow-native_ovsdb-cli) failure" [High,Confirmed] - Assigned to Ihar Hrachyshka (ihar-hrachyshka) 16:27:26 actually maybe the sigterm one is https://bugs.launchpad.net/neutron/+bug/1673531 16:27:29 riiight 16:27:31 isn't this related? 16:27:35 probably 16:27:39 just different signal 16:28:08 I will take a look on them this week 16:28:28 #action slaweq to look at http://logs.openstack.org/53/539953/2/check/neutron-fullstack/55e3511/logs/testr_results.html.gz failures 16:28:38 slaweq, very cool, thanks 16:28:54 no problem 16:29:17 checked another run, same failures: http://logs.openstack.org/01/542301/1/check/neutron-fullstack/d8f312e/logs/testr_results.html.gz 16:29:53 and another failed run - same: http://logs.openstack.org/96/542596/6/check/neutron-fullstack/8db76b7/logs/testr_results.html.gz (just one failed though) 16:30:11 it's clear that if those are tackled, it may get fullstack down to normal failure rate 16:30:32 so it's great we'll have slaweq on those, meaning next week we'll have it green ;) 16:30:32 I hope so :) 16:30:55 I can always mark them as unstable to make it green :P 16:31:29 true lol 16:32:02 #topic Functional 16:32:29 I am looking at functional job charts and it seems to work fine. actually showing weird 0% failure last several days. 16:32:54 mlavalle, what would be the standard to hit to get it marked voting in check queue? 16:33:10 for the record I look at http://grafana.openstack.org/dashboard/db/neutron-failure-rate?panelId=7&fullscreen 16:34:16 good question 16:34:34 10% sounds good to me but I am shooting from the hip 16:34:47 what do you think? 16:35:19 * ihrachys checking what's average for unit tests 16:35:47 these are unit tests: http://grafana.openstack.org/dashboard/db/neutron-failure-rate?panelId=10&fullscreen 16:36:37 if you look at 30 days stats, you see average is somewhere between 10 and 15 16:37:10 so 10% sounds good 16:37:14 mlavalle, let's say we can hit 10%. how long should we experience it to consider voting? 16:37:30 let's say a week 16:37:35 with that average 16:37:49 ok gotcha, and full *working* week I believe 16:37:56 because weekends are not normal 16:37:56 yes 16:38:35 ok we are in the second day then right now. I will monitor it proactively then. ;) 16:38:44 cool 16:39:08 we can revisit net meeting 16:39:14 next^^^ 16:39:20 ok 16:39:23 #topic Scenarios 16:39:38 the chart here: http://grafana.openstack.org/dashboard/db/neutron-failure-rate?panelId=8&fullscreen 16:40:35 ok. first, dvr one is 20-25% 16:40:48 slaweq since the job uses ovsfw shouldn't we expect some stabilization there too? 16:41:12 I don't know exactly 16:41:25 do You have some example failure? 16:42:00 no not yet. we'll look at them later. 16:42:12 as for linuxbridge job, it's 25-30% so slightly worse 16:42:34 hence first looking at linuxbridge 16:43:07 example failure: http://logs.openstack.org/07/492107/9/check/neutron-tempest-plugin-scenario-linuxbridge/086f728/logs/testr_results.html.gz 16:43:37 all ssh failures using fip 16:45:35 since this job is not dvr / ha, routers are legacy 16:45:49 so we should e.g. see arpings and iptables-restore and 'ip addr add' in logs 16:47:19 this is where the fip address is configured on l3agent side: http://logs.openstack.org/07/492107/9/check/neutron-tempest-plugin-scenario-linuxbridge/086f728/logs/screen-q-l3.txt.gz#_Feb_13_15_11_35_940199 16:48:27 so the address is configured in namespace; arping is executed; tempest should know about correct location of the address 16:48:51 and iptables-save is also done just after so NAT should be also done 16:48:56 there's no errors in the l2 agent log either 16:49:02 maybe there is some issue with SG then? 16:50:13 hm possible 16:50:23 and also instances booted fine and reached metadata 16:50:23 the only thing that merged recently was the ARP change i did maybe, moved things to PREROUTING chain 16:50:32 (it's seen in one of test cases that logs console) 16:50:54 slaweq: right, unless its SG 16:51:20 slaweq, wouldn't it be Connection refused if SG closed the port? 16:51:58 I don't think so 16:52:03 there it seems it just hangs there until timeout 16:55:18 haleyb, so one thing that pops up is this message: 16:55:21 http://logs.openstack.org/07/492107/9/check/neutron-tempest-plugin-scenario-linuxbridge/086f728/logs/screen-q-agt.txt.gz?#_Feb_13_15_19_22_794763 16:55:32 Tried to remove rule that was not there: 'POSTROUTING' u'-m physdev --physdev-in tap967365b4-ce --physdev-is-bridged -j neutron-linuxbri-qos-o967365' 16:55:48 I think there is already patch for that 16:55:54 or is it the red herring that slaweq has a patch for 16:56:03 https://review.openstack.org/#/c/539922/ ? 16:56:05 haleyb did this patch I think 16:56:05 yes, there is, red herring i believe 16:56:14 ah wait that's db side 16:56:29 https://review.openstack.org/#/c/541379/ ? 16:56:35 yes 16:56:41 You were faster :) 16:57:03 and that passed the scenario test 16:57:20 ok I nudged it in gate 16:57:31 but do we think it could break anything? 16:58:30 this warning? I don't think so 16:58:36 ok 16:58:41 so there is something else 16:58:49 and we are almost out of time 16:58:52 IMHO yes 16:59:04 *yes, there is something else :) 16:59:44 this issue doesn't have obvious leads in logs so one thing to consider is maybe we should expand logging 16:59:50 ihrachys: I don't have much time this week, but if nobody else has, I'll try to take a stab at it 17:00:11 I will probably be busy with debugging fullstack tests 17:00:13 one thing I noticed is most tests that failed don't log console on failure. that could be worth expanding on, not just for this issue. 17:00:24 mlavalle, please do! 17:00:31 ok 17:00:35 #action mlavalle to look into linuxbridge ssh timeout failures 17:00:40 thanks folks for joining 17:00:42 #endmeeting