#openstack-meeting log

16:00:58 <ihrachys> #startmeeting neutron_ci
16:00:59 <openstack> Meeting started Tue Sep 19 16:00:58 2017 UTC and is due to finish in 60 minutes.  The chair is ihrachys. Information about MeetBot at http://wiki.debian.org/MeetBot.
16:01:00 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
16:01:00 <jlibosva> o/
16:01:01 <ihrachys> hello my friends
16:01:03 <openstack> The meeting name has been set to 'neutron_ci'
16:01:09 <ihrachys> jlibosva, o/
16:01:37 <ihrachys> ok, that's a tight meeting company we have here ;)
16:01:54 <ihrachys> haleyb seems to be offline
16:02:00 <jlibosva> he just disconnected
16:02:06 <jlibosva> I hope he'll be back
16:02:36 <jlibosva> there he is
16:02:38 <ihrachys> here he is
16:02:40 <ihrachys> :)
16:02:44 <ihrachys> haleyb, o/
16:02:49 <ihrachys> #topic Actions from prev week
16:02:57 <ihrachys> "jlibosva to talk to armax about enabling test_convert_default_subnetpool_to_non_default in gate"
16:03:09 <jlibosva> I didn't because I forgot
16:03:16 <jlibosva> I'll do this week
16:03:22 <ihrachys> #action jlibosva to talk to armax about enabling test_convert_default_subnetpool_to_non_default in gate
16:03:28 <haleyb> hi there, irc-proxy was having problems
16:03:37 <ihrachys> "haleyb to figure out the way forward for grenade/dvr gate"
16:03:48 <ihrachys> I believe the grenade job is largely back to normal now?
16:04:05 <ihrachys> the bug being https://bugs.launchpad.net/neutron/+bug/1713927
16:04:06 <openstack> Launchpad bug 1713927 in neutron "gate-grenade-dsvm-neutron-dvr-multinode-ubuntu-xenial fails constantly" [High,In progress] - Assigned to Swaminathan Vasudevan (swaminathan-vasudevan)
16:04:09 <haleyb> i just rebooted, but remember it ~20%
16:04:23 <ihrachys> haleyb, you mentioned in the bug there is still a fix to land
16:04:25 <ihrachys> on server side
16:04:31 <ihrachys> can you post it in LP?
16:04:50 <haleyb> yes, it needs a quick re-spin to address a comment
16:04:53 <ihrachys> I am at loss with all those dvr fixes you have with Swami :)
16:05:11 <ihrachys> haleyb, is it required to fix the gate failure?
16:05:15 <ihrachys> or it's just nice to have?
16:05:42 <haleyb> it's required to fix the server side, we only worked around it in the agent
16:06:15 <ihrachys> ok. I remember we make the job non-voting. I assume we will get it back after the server side fix?
16:07:04 <haleyb> i hope so, if the failure rate is still good
16:07:18 <ihrachys> I recollect we had a revert for that somewhere.
16:07:25 <ihrachys> but now I fail to find it
16:07:34 <jlibosva> I don't think we do
16:07:57 <haleyb> actually the dvr-multinode job is 0%, it's the regular multinode job that's at 20% failure
16:08:13 <jlibosva> this is the original - https://review.openstack.org/#/c/500567/
16:08:39 <ihrachys> ok, I created a revert for tracking purposes
16:09:01 <jlibosva> haleyb: which could also mean we're not collecting data correctly, 0% failure is always suspicious to me as we have infra issues or catching regression failures etc.
16:09:06 <ihrachys> haleyb, hm... should we break it back so that we are on the same failure rate? :p
16:09:25 <haleyb> what, you don't believe DVR is better? :-p
16:09:47 <ihrachys> it is. it's just that the base line we compare with was always rather low. :p
16:10:11 <ihrachys> anyhow...
16:10:19 <ihrachys> these were all AIs we had
16:10:27 <ihrachys> #topic Grafana
16:10:30 <ihrachys> http://grafana.openstack.org/dashboard/db/neutron-failure-rate
16:11:03 <ihrachys> api job is at 20% failure rate on 7days sliding window chart
16:11:07 <ihrachys> not sure what causes that
16:11:15 <ihrachys> could it be the skip_checks issue?
16:11:24 <ihrachys> I mean this: https://review.openstack.org/#/q/I1c0902e3c06886812029fae0e4435bb6674f57df
16:11:43 <ihrachys> I believe charts collect data from all branches so we may still see those failures there if they happen in stable
16:12:38 <ihrachys> apart from that, it's usual suspects - fullstack and scenarios - that we still have at 100%. let's deal with them one by one.
16:12:42 <ihrachys> #topic Fullstack failures
16:13:06 <ihrachys> I believe we made some significant progress lately with fixing and triaging failures
16:13:40 <ihrachys> I believe the main issue that is causing 100% failure rate right now is the one where test_trunk_lifecycle fails because ovs agents clean up ports that don't belong to them
16:13:51 <ihrachys> https://bugs.launchpad.net/neutron/+bug/1687709
16:13:52 <openstack> Launchpad bug 1687709 in neutron "fullstack: ovs-agents remove trunk bridges that don't belong to them" [High,In progress] - Assigned to Armando Migliaccio (armando-migliaccio)
16:14:02 <ihrachys> and armax had a WIP patch for that
16:14:16 <ihrachys> https://review.openstack.org/#/c/504186/
16:15:10 <ihrachys> the idea there is to introduce a test only option that will set a prefix for agent ports
16:15:20 <ihrachys> then configure each fullstack test case with a unique prefix
16:15:32 <ihrachys> then ovs agent would filter those ports with the prefix
16:15:40 <ihrachys> not ideal, but should be an easy fix
16:15:55 <ihrachys> and then we can follow up with filtering them on ovsdb level, or something else
16:16:29 <ihrachys> jlibosva, with that patch, do we still want to follow up?
16:16:31 <jlibosva> I just don't like that we're having "monkey patch hack" at one agent and "config hack" in other agents. I'd rather have it unified
16:17:00 <jlibosva> and moving towards removing those hacks longterm but the goal of the idea is neat
16:17:11 <ihrachys> jlibosva, that's not ideal. but do you agree testing the actual production code is better than monkey patching the agent?
16:17:25 <jlibosva> imho it's moot
16:17:56 <ihrachys> moot point?
16:18:02 <ihrachys> as in - longterm both are bad?
16:18:49 <jlibosva> no, I mean if you have a code that's parametrized and never used in production - or you patch the code. It's still the same
16:18:53 <jlibosva> just written differently
16:19:30 <jlibosva> but if we get rid of monkey patching and replacing it with no-prod config values, I'm good. But in my opinion it's the same as monkey patching
16:20:57 <ihrachys> jlibosva, I remember I had complications with monkey patched agents when switching to rootwrap because I needed to configure rootwrap to allow those dirs for exec_dirs
16:21:05 <ihrachys> of course now it's solved somewhat
16:21:14 <ihrachys> but it left bitter taste in my mouth
16:21:16 <ihrachys> :)
16:21:18 <ihrachys> ok
16:21:29 <ihrachys> so to the question of follow up - is it smth we want to still track?
16:21:48 <ihrachys> or we will pretend we can live with the hacks?
16:21:53 <jlibosva> is it the last issue? I thought we still have some l3 east-west failing IIRC
16:22:06 <ihrachys> there are some issues that show once in a while
16:22:08 <jlibosva> or you mean to track the isolation stuff that I was working on?
16:22:21 <ihrachys> I was hoping that we tackle this one to finally have a reasonable chart that is not 100% always
16:22:30 <ihrachys> and then can meaningfully assess progress
16:22:33 <ihrachys> but yes, there are others
16:22:38 <jlibosva> sounds good
16:22:48 <jlibosva> btw the trunk failure shouldn't be 100%, it's a race condition
16:22:54 <ihrachys> re follow up, I meant the isolation and/or the filtering of ports on ovsdb query level
16:23:04 <ihrachys> some decent fix that would allow us to kill the test option
16:23:23 <ihrachys> jlibosva, sometimes the job rate falls to 95%, yes :p
16:23:26 <jlibosva> I'll keep it in my backlog
16:23:35 <jlibosva> ah, nice :) I like progress
16:24:06 <ihrachys> ok, good. it's a nice thing to have, but I believe not a critical thing right now, especially considering all the other priorities we tend to have
16:24:23 <ihrachys> another bug that affected the job was https://bugs.launchpad.net/neutron/+bug/1717582
16:24:24 <openstack> Launchpad bug 1717582 in neutron "fullstack job failing to create namespace because it's already exists" [Undecided,In progress] - Assigned to Slawek Kaplonski (slaweq)
16:24:24 <jlibosva> agreed
16:24:34 <ihrachys> and slaweq has a fix: https://review.openstack.org/#/c/503890/
16:25:09 <ihrachys> as we learned, netlink is async and hence doesn't provide guarantee that after netns add the namespace is present
16:25:12 <ihrachys> so we need to spin
16:26:18 <ihrachys> I need to get back to the patch, seems like slaweq has reasonable replies to my concerns
16:27:19 <jlibosva> thomas has ideas how to avoid the race
16:27:35 <jlibosva> it's in PS6
16:27:51 <ihrachys> ok, will check
16:28:31 <ihrachys> (looking through the list of fullstack bugs) one other thing I had for the suite is switching to using SIGTERM instead of SIGKILL for all services: https://bugs.launchpad.net/neutron/+bug/1487548
16:28:32 <openstack> Launchpad bug 1487548 in neutron "fullstack infrastructure tears down processes via kill -9" [Low,In progress] - Assigned to Ihar Hrachyshka (ihar-hrachyshka)
16:28:36 <ihrachys> I have https://review.openstack.org/#/c/499803/
16:28:53 <ihrachys> but I need to rework it so that if SIGTERM doesn't kill in a minute, we SIGKILL
16:29:38 <ihrachys> anything else on fullstack?
16:31:12 <jlibosva> nope
16:31:54 <ihrachys> #topic Scenarios
16:32:12 <ihrachys> jlibosva, I think we were making some progress there too?
16:32:35 <ihrachys> afaik there were two fronts - one dvr fixes from haleyb and Swami and another router migrations from anilvenkata
16:32:37 <jlibosva> I think only Anil has some patches for router migrations
16:32:50 <ihrachys> would be nice to have a list of things we believe are related
16:33:16 <ihrachys> jlibosva, afaiu some dvr fixes for grenade were effectively also helping scenarios
16:33:57 <jlibosva> cool, two birds with one stone
16:33:59 <haleyb> Swami just sent out https://review.openstack.org/#/c/505324 as well as he noticed there are some edge cases still broken
16:34:27 <haleyb> i think he had opened a new bug for that, will check
16:35:01 <ihrachys> the flow of dvr fixes both encourages and scares
16:35:08 <ihrachys> it's encouraging that we fix that
16:35:18 <ihrachys> but... it's like... broken since ever?
16:35:54 <haleyb> router migrations are complicated.  this is new since the dvr_snat_bound code merged in pike
16:36:10 <ihrachys> jlibosva, haleyb you think it would make sense to have a list of scenario related fixes somewhere so that we can prioritize them somehow?
16:36:20 <jlibosva> I can take an AI to do
16:36:37 <ihrachys> #action jlibosva to prepare a list of scenario related fixes
16:36:40 <ihrachys> jlibosva, thanks!
16:36:44 <haleyb> yes, either a single bug and/or topic i guess
16:37:00 <ihrachys> haleyb, ^ please help Jakub with that, I believe you have a good grasp of dvr side of things
16:37:09 <haleyb> will do
16:37:44 <jlibosva> maybe we could create a short-term LP tag to get the list quickly
16:37:55 <ihrachys> jlibosva, apart from it, are we aware of any other fixes? how close would the fixes we have in pipeline get us in terms of failure rate?
16:38:14 <ihrachys> jlibosva, good idea. we can create a new tag, no need to have a permanent one
16:38:58 <jlibosva> there is a fix for remote security groups that iwamoto is working on
16:39:12 <ihrachys> https://review.openstack.org/#/c/492404/ ?
16:39:14 <jlibosva> https://review.openstack.org/#/c/492404/
16:39:15 <jlibosva> yep
16:39:26 <jlibosva> this should also bring some peace
16:39:39 <ihrachys> I thought it's just perf optimization?
16:40:20 <ihrachys> ok I see in the commit message: "that filtering
16:40:21 <ihrachys> are correctly performed"
16:40:27 <ihrachys> so I guess it's functional too
16:40:29 <jlibosva> no, it's a new regression caused by conjunctions
16:40:52 <jlibosva> also slaweq has a patch for qos for better logging in the test: https://review.openstack.org/#/c/491244/
16:40:53 <ihrachys> jlibosva, it affects pike+?
16:41:28 <ihrachys> I mean the conjunctions
16:41:38 <jlibosva> yes
16:41:43 <jlibosva> it's been merged to pike
16:42:33 <ihrachys> another regression, good
16:42:36 <ihrachys> :)
16:42:47 <ihrachys> but it seems like we are not close to complete there?
16:42:55 <ihrachys> so I guess I will need to release .1 without it
16:43:22 <ihrachys> jlibosva, re the qos test patch, my concern was that it hides the issue. is it correct?
16:43:37 <ihrachys> it retries over and over. shouldn't we expect it to work correctly?
16:43:52 <ihrachys> instead it hangs in the middle
16:44:23 <jlibosva> yep, we could have some mechanism to retry few times
16:44:38 <jlibosva> but importantly, it adds some logging messages that could reveal more information about the issue
16:46:17 <ihrachys> jlibosva, but can't we raise after logging?
16:46:26 <jlibosva> we can :)
16:46:31 <ihrachys> otherwise it seems like the test will pass if it works after e.g. 5th attempt
16:46:39 <jlibosva> or make a loop
16:46:47 <jlibosva> aah
16:46:48 <jlibosva> I see
16:47:11 <ihrachys> it can also loop now indefinitely
16:47:25 <ihrachys> before, the timeout would bubble up to test runner
16:47:28 <ihrachys> now it's swallowed
16:47:32 <jlibosva> I was thinking about trying several times but you're right, that failing the test would make more sense
16:47:49 <jlibosva> and we should log the exception too
16:48:30 <ihrachys> ok, let's follow up with comments there then
16:48:41 <ihrachys> logging is a good idea, it's just execution that scared me
16:49:40 <ihrachys> #topic Open discussion
16:49:59 <ihrachys> we hit https://launchpad.net/bugs/1717046 the last week
16:50:01 <openstack> Launchpad bug 1717046 in neutron "L3HARouterVRIdAllocationDbObjectTestCase.test_delete_objects fails because of duplicate record" [Medium,In progress] - Assigned to Ihar Hrachyshka (ihar-hrachyshka)
16:50:08 <ihrachys> I worked it around with a fix for now
16:50:15 <ihrachys> but the root issue in the test framework is still there
16:50:21 <ihrachys> I have this attempt: https://review.openstack.org/#/c/503854/
16:50:25 <ihrachys> but it will require some more work
16:50:54 <ihrachys> also, Genadi Ch sent a new scenario for sec groups here: https://review.openstack.org/#/c/504021/
16:51:15 <ihrachys> and seems like he was able to trigger a sqlalchemy error and error 500 with it
16:51:25 <ihrachys> he reported a bug here: https://bugzilla.redhat.com/show_bug.cgi?id=1493175 (it's rhbz, not lp)
16:51:26 <openstack> bugzilla.redhat.com bug 1493175 in openstack-neutron "Update of VM port to have different number of security groups fails with Error 500" [High,New] - Assigned to amuller
16:51:44 <ihrachys> we will need to have a look, may be a reason of some scenario failures
16:51:47 <ihrachys> and it's in db layer.
16:51:56 <ihrachys> fails on refresh(port_db) call somewhere
16:52:10 <ihrachys> so probably affects everyone, not specific to backend
16:52:50 <ihrachys> also, to recap the ptg discussions, dvr folks were planning to adopt fullstack suite for testing different agent deployment modes
16:53:02 <ihrachys> I assume haleyb will follow up with Swami on that one
16:53:10 <ihrachys> that's all I have
16:53:17 <ihrachys> anything else to discuss?
16:53:41 <ihrachys> haleyb, jlibosva
16:53:42 <jlibosva> not from me
16:53:56 <haleyb> nothing here
16:54:14 <ihrachys> good. thanks for joining. we should have progress next time we meet. green gate future ahead.
16:54:16 <ihrachys> o/
16:54:18 <ihrachys> #endmeeting