#openstack-meeting log

16:00:48 <ihrachys> #startmeeting neutron_ci
16:00:49 <openstack> Meeting started Tue Feb 13 16:00:48 2018 UTC and is due to finish in 60 minutes.  The chair is ihrachys. Information about MeetBot at http://wiki.debian.org/MeetBot.
16:00:50 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
16:00:52 <openstack> The meeting name has been set to 'neutron_ci'
16:00:55 <mlavalle> o/
16:01:12 <slaweq> hi
16:01:16 <ihrachys> #topic Actions from prev meeting
16:01:23 <mlavalle> ihrachys: Kuba won't make it to the meeting
16:01:29 <ihrachys> ack
16:01:34 <ihrachys> thanks for the notice
16:01:35 <mlavalle> He went to hospital to see wife
16:01:51 <ihrachys> yeah he has better things to care about right now ;)
16:01:52 <mlavalle> and Sara
16:02:19 <ihrachys> first action item was "ihrachys report bug for -pg- failure to glance"
16:02:42 <ihrachys> that was rather optimistic, before we learned about OVO / facade issue
16:02:45 <ihrachys> so i didn't
16:02:50 <ihrachys> let me check if it's still relevant
16:03:57 <ihrachys> http://grafana.openstack.org/dashboard/db/neutron-failure-rate?panelId=14&fullscreen
16:04:20 <ihrachys> there was a 2d trench apparently on weekend
16:04:29 <ihrachys> don't we run periodics on weekends?
16:05:44 <ihrachys> ok the latest failure seems to be smth different than devstack failure we saw before:
16:05:45 <ihrachys> http://logs.openstack.org/periodic/git.openstack.org/openstack/neutron/master/legacy-periodic-tempest-dsvm-neutron-pg-full/894eb3b/job-output.txt.gz#_2018-02-13_07_13_23_633653
16:05:52 <ihrachys> test_get_service_by_service_and_host_name failed
16:06:02 <ihrachys> that's for compute admin api
16:07:09 <ihrachys> I will report this one instead
16:07:18 <ihrachys> #action ihrachys report test_get_service_by_service_and_host_name failure in periodic -pg- job
16:07:30 <ihrachys> then there was "slaweq to report bug for ovsfw job failure / sea of red in ovs agent logs"
16:07:38 <ihrachys> I believe this is largely contained now
16:07:50 <slaweq> so I reported bug: https://bugs.launchpad.net/neutron/+bug/1747709
16:07:51 <openstack> Launchpad bug 1747709 in neutron "neutron-tempest-ovsfw fails 100% times" [High,Fix released] - Assigned to Slawek Kaplonski (slaweq)
16:08:11 <slaweq> and later also https://bugs.launchpad.net/neutron/+bug/1748546
16:08:12 <openstack> Launchpad bug 1748546 in neutron "ovsfw tempest tests random fails" [High,Fix committed] - Assigned to Slawek Kaplonski (slaweq)
16:08:25 <slaweq> and now patches for both are merged
16:08:37 <slaweq> https://review.openstack.org/#/c/542596/
16:08:55 <slaweq> adn https://review.openstack.org/#/c/542257/
16:08:57 <slaweq> *and
16:09:09 <ihrachys> ok, we need to land backports: https://review.openstack.org/#/q/I750224f7495aa46634bec1211599953cbbd4d1df
16:09:21 <ihrachys> and https://review.openstack.org/#/q/I6d917cbac61293e9a956a2efcd9f2b720e4cac95
16:09:31 <slaweq> with those 2 patches neutron-tempest-ovsfw job was green at least 5-6 times when I checked that during weekend
16:09:51 <mlavalle> yeap
16:10:01 <ihrachys> mlavalle, we'll need rc2 for those right?
16:10:07 <mlavalle> yes
16:10:08 <slaweq> yes, cherry-picks to queens are done already also
16:10:30 <mlavalle> those two are the only ones we have for RC2, so far
16:10:43 <ihrachys> slaweq, do we need earlier backports for those bugs?
16:10:49 <ihrachys> pike and ocata are still open
16:11:03 <slaweq> I'm not sure but I will check it ASAP
16:11:18 <mlavalle> with Pike we have more time
16:11:24 <mlavalle> Ocata is the urgent one
16:12:13 <slaweq> ihrachys: please assign "action" for me - I will check it after the meeting
16:12:39 <ihrachys> #action slaweq to backport ovsfw fixes to older stable branches
16:12:42 <slaweq> thx
16:12:53 <ihrachys> thanks for fixing the job! incredible.
16:13:03 <mlavalle> ++
16:13:05 <ihrachys> next was "ihrachys to look at linuxbridge scenario random failures when sshing to FIP"
16:13:14 <ihrachys> ok that's another one that fell through cracks
16:13:33 <ihrachys> and I don't have capacity for anything new / time consuming this week
16:14:11 <ihrachys> so maybe someone can take it over
16:14:28 <ihrachys> actually, we can probably check if it's still an important issue later in the meeting when we talk about scenarios
16:14:38 <mlavalle> let's do that
16:15:28 <ihrachys> #topic Grafana
16:15:28 <ihrachys> http://grafana.openstack.org/dashboard/db/neutron-failure-rate
16:17:21 <ihrachys> fullstack seems to be the only job that is above 50%
16:17:36 <slaweq> what is also good ;)
16:17:37 <ihrachys> well except ovsfw but it looks like it quickly goes down with its average
16:18:02 <slaweq> ovsfw patches was merged today so I think we should wait with this one some time
16:18:52 * mlavalle keeping fingers crossed
16:18:58 <ihrachys> right, but it's clear it's going down very quick. if that would be a Dow Jones chart we would see people jumping from windows.
16:19:10 <mlavalle> LOL
16:19:12 <slaweq> LOL
16:19:14 <mlavalle> like 1929
16:19:31 <slaweq> but here lower is better
16:19:40 <ihrachys> ok on serious note... seems like we want to talk about fullstack first thing
16:19:57 <ihrachys> and then move to scenarios if time allows for it
16:20:12 <ihrachys> #topic Fullstack
16:20:37 <ihrachys> before going through latest failures, I think it makes sense to check some patches
16:21:23 <ihrachys> specifically, https://review.openstack.org/#/c/539953/ enables test_dscp_marking_packets (it's actually already in gate so maybe we can just make a note it may resurface now and skip it)
16:21:44 <ihrachys> also those backports for the same test for older branches: https://review.openstack.org/#/q/Ia3522237dc787edb90d162ac4a5535ff5d2a03d5
16:21:55 <slaweq> but I believe this test should works fine now
16:21:59 <ihrachys> I believe once those are in, we can also backport the re-enabling patch
16:22:14 <ihrachys> there are also those backports: https://review.openstack.org/#/q/Ib2d588d081a48f4f2b6e98a943bca95b9955a149
16:23:17 <ihrachys> any other patches I missed for fullstack?
16:23:17 <slaweq> AFAIR https://review.openstack.org/#/q/Ib2d588d081a48f4f2b6e98a943bca95b9955a149 are required by https://review.openstack.org/#/q/Ia3522237dc787edb90d162ac4a5535ff5d2a03d5
16:24:44 <ihrachys> yeah I think I mentioned both
16:24:53 <mlavalle> yes you did
16:24:54 <slaweq> yes, just saying :)
16:24:55 <ihrachys> ok now looking at a latest run
16:24:55 <ihrachys> http://logs.openstack.org/53/539953/2/check/neutron-fullstack/55e3511/logs/testr_results.html.gz
16:25:02 <ihrachys> test_controller_timeout_does_not_break_connectivity_sigterm and test_l2_agent_restart
16:25:06 <ihrachys> all old friends
16:25:24 <ihrachys> at some point I should have even looked at one of them but never did
16:25:55 <ihrachys> and as I said I am out of capacity for this week anyway
16:26:15 <slaweq> I can take a look on them
16:26:30 <ihrachys> I can't see neither reported in https://bugs.launchpad.net/neutron/+bugs?field.tag=fullstack
16:27:18 <ihrachys> slaweq,
16:27:20 <ihrachys> great
16:27:24 <slaweq> https://bugs.launchpad.net/neutron/+bug/1673531
16:27:25 <openstack> Launchpad bug 1673531 in neutron "fullstack test_controller_timeout_does_not_break_connectivity_sigkill(GRE and l2pop,openflow-native_ovsdb-cli) failure" [High,Confirmed] - Assigned to Ihar Hrachyshka (ihar-hrachyshka)
16:27:26 <ihrachys> actually maybe the sigterm one is https://bugs.launchpad.net/neutron/+bug/1673531
16:27:29 <ihrachys> riiight
16:27:31 <slaweq> isn't this related?
16:27:35 <ihrachys> probably
16:27:39 <ihrachys> just different signal
16:28:08 <slaweq> I will take a look on them this week
16:28:28 <ihrachys> #action slaweq to look at http://logs.openstack.org/53/539953/2/check/neutron-fullstack/55e3511/logs/testr_results.html.gz failures
16:28:38 <ihrachys> slaweq, very cool, thanks
16:28:54 <slaweq> no problem
16:29:17 <ihrachys> checked another run, same failures: http://logs.openstack.org/01/542301/1/check/neutron-fullstack/d8f312e/logs/testr_results.html.gz
16:29:53 <ihrachys> and another failed run - same: http://logs.openstack.org/96/542596/6/check/neutron-fullstack/8db76b7/logs/testr_results.html.gz (just one failed though)
16:30:11 <ihrachys> it's clear that if those are tackled, it may get fullstack down to normal failure rate
16:30:32 <ihrachys> so it's great we'll have slaweq on those, meaning next week we'll have it green ;)
16:30:32 <slaweq> I hope so :)
16:30:55 <slaweq> I can always mark them as unstable to make it green :P
16:31:29 <ihrachys> true lol
16:32:02 <ihrachys> #topic Functional
16:32:29 <ihrachys> I am looking at functional job charts and it seems to work fine. actually showing weird 0% failure last several days.
16:32:54 <ihrachys> mlavalle, what would be the standard to hit to get it marked voting in check queue?
16:33:10 <ihrachys> for the record I look at http://grafana.openstack.org/dashboard/db/neutron-failure-rate?panelId=7&fullscreen
16:34:16 <mlavalle> good question
16:34:34 <mlavalle> 10% sounds good to me but I am shooting from the hip
16:34:47 <mlavalle> what do you think?
16:35:19 * ihrachys checking what's average for unit tests
16:35:47 <ihrachys> these are unit tests: http://grafana.openstack.org/dashboard/db/neutron-failure-rate?panelId=10&fullscreen
16:36:37 <ihrachys> if you look at 30 days stats, you see average is somewhere between 10 and 15
16:37:10 <mlavalle> so 10% sounds good
16:37:14 <ihrachys> mlavalle, let's say we can hit 10%. how long should we experience it to consider voting?
16:37:30 <mlavalle> let's say a week
16:37:35 <mlavalle> with that average
16:37:49 <ihrachys> ok gotcha, and full *working* week I believe
16:37:56 <ihrachys> because weekends are not normal
16:37:56 <mlavalle> yes
16:38:35 <ihrachys> ok we are in the second day then right now. I will monitor it proactively then. ;)
16:38:44 <mlavalle> cool
16:39:08 <mlavalle> we can revisit net meeting
16:39:14 <mlavalle> next^^^
16:39:20 <ihrachys> ok
16:39:23 <ihrachys> #topic Scenarios
16:39:38 <ihrachys> the chart here: http://grafana.openstack.org/dashboard/db/neutron-failure-rate?panelId=8&fullscreen
16:40:35 <ihrachys> ok. first, dvr one is 20-25%
16:40:48 <ihrachys> slaweq since the job uses ovsfw shouldn't we expect some stabilization there too?
16:41:12 <slaweq> I don't know exactly
16:41:25 <slaweq> do You have some example failure?
16:42:00 <ihrachys> no not yet. we'll look at them later.
16:42:12 <ihrachys> as for linuxbridge job, it's 25-30% so slightly worse
16:42:34 <ihrachys> hence first looking at linuxbridge
16:43:07 <ihrachys> example failure: http://logs.openstack.org/07/492107/9/check/neutron-tempest-plugin-scenario-linuxbridge/086f728/logs/testr_results.html.gz
16:43:37 <ihrachys> all ssh failures using fip
16:45:35 <ihrachys> since this job is not dvr / ha, routers are legacy
16:45:49 <ihrachys> so we should e.g. see arpings and iptables-restore and 'ip addr add' in logs
16:47:19 <ihrachys> this is where the fip address is configured on l3agent side: http://logs.openstack.org/07/492107/9/check/neutron-tempest-plugin-scenario-linuxbridge/086f728/logs/screen-q-l3.txt.gz#_Feb_13_15_11_35_940199
16:48:27 <ihrachys> so the address is configured in namespace; arping is executed; tempest should know about correct location of the address
16:48:51 <slaweq> and iptables-save is also done just after so NAT should be also done
16:48:56 <haleyb> there's no errors in the l2 agent log either
16:49:02 <slaweq> maybe there is some issue with SG then?
16:50:13 <ihrachys> hm possible
16:50:23 <ihrachys> and also instances booted fine and reached metadata
16:50:23 <haleyb> the only thing that merged recently was the ARP change i did maybe, moved things to PREROUTING chain
16:50:32 <ihrachys> (it's seen in one of test cases that logs console)
16:50:54 <haleyb> slaweq: right, unless its SG
16:51:20 <ihrachys> slaweq, wouldn't it be Connection refused if SG closed the port?
16:51:58 <slaweq> I don't think so
16:52:03 <ihrachys> there it seems it just hangs there until timeout
16:55:18 <ihrachys> haleyb, so one thing that pops up is this message:
16:55:21 <ihrachys> http://logs.openstack.org/07/492107/9/check/neutron-tempest-plugin-scenario-linuxbridge/086f728/logs/screen-q-agt.txt.gz?#_Feb_13_15_19_22_794763
16:55:32 <ihrachys> Tried to remove rule that was not there: 'POSTROUTING' u'-m physdev --physdev-in tap967365b4-ce --physdev-is-bridged -j neutron-linuxbri-qos-o967365'
16:55:48 <slaweq> I think there is already patch for that
16:55:54 <ihrachys> or is it the red herring that slaweq has a patch for
16:56:03 <ihrachys> https://review.openstack.org/#/c/539922/ ?
16:56:05 <slaweq> haleyb did this patch I think
16:56:05 <haleyb> yes, there is, red herring i believe
16:56:14 <ihrachys> ah wait that's db side
16:56:29 <ihrachys> https://review.openstack.org/#/c/541379/ ?
16:56:35 <slaweq> yes
16:56:41 <slaweq> You were faster :)
16:57:03 <haleyb> and that passed the scenario test
16:57:20 <ihrachys> ok I nudged it in gate
16:57:31 <ihrachys> but do we think it could break anything?
16:58:30 <slaweq> this warning? I don't think so
16:58:36 <ihrachys> ok
16:58:41 <ihrachys> so there is something else
16:58:49 <ihrachys> and we are almost out of time
16:58:52 <slaweq> IMHO yes
16:59:04 <slaweq> *yes, there is something else :)
16:59:44 <ihrachys> this issue doesn't have obvious leads in logs so one thing to consider is maybe we should expand logging
16:59:50 <mlavalle> ihrachys: I don't have much time this week, but if nobody else has, I'll try to take a stab at it
17:00:11 <slaweq> I will probably be busy with debugging fullstack tests
17:00:13 <ihrachys> one thing I noticed is most tests that failed don't log console on failure. that could be worth expanding on, not just for this issue.
17:00:24 <ihrachys> mlavalle, please do!
17:00:31 <mlavalle> ok
17:00:35 <ihrachys> #action mlavalle to look into linuxbridge ssh timeout failures
17:00:40 <ihrachys> thanks folks for joining
17:00:42 <ihrachys> #endmeeting