16:00:58 <ihrachys> #startmeeting neutron_ci 16:00:59 <openstack> Meeting started Tue Sep 19 16:00:58 2017 UTC and is due to finish in 60 minutes. The chair is ihrachys. Information about MeetBot at http://wiki.debian.org/MeetBot. 16:01:00 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 16:01:00 <jlibosva> o/ 16:01:01 <ihrachys> hello my friends 16:01:03 <openstack> The meeting name has been set to 'neutron_ci' 16:01:09 <ihrachys> jlibosva, o/ 16:01:37 <ihrachys> ok, that's a tight meeting company we have here ;) 16:01:54 <ihrachys> haleyb seems to be offline 16:02:00 <jlibosva> he just disconnected 16:02:06 <jlibosva> I hope he'll be back 16:02:36 <jlibosva> there he is 16:02:38 <ihrachys> here he is 16:02:40 <ihrachys> :) 16:02:44 <ihrachys> haleyb, o/ 16:02:49 <ihrachys> #topic Actions from prev week 16:02:57 <ihrachys> "jlibosva to talk to armax about enabling test_convert_default_subnetpool_to_non_default in gate" 16:03:09 <jlibosva> I didn't because I forgot 16:03:16 <jlibosva> I'll do this week 16:03:22 <ihrachys> #action jlibosva to talk to armax about enabling test_convert_default_subnetpool_to_non_default in gate 16:03:28 <haleyb> hi there, irc-proxy was having problems 16:03:37 <ihrachys> "haleyb to figure out the way forward for grenade/dvr gate" 16:03:48 <ihrachys> I believe the grenade job is largely back to normal now? 16:04:05 <ihrachys> the bug being https://bugs.launchpad.net/neutron/+bug/1713927 16:04:06 <openstack> Launchpad bug 1713927 in neutron "gate-grenade-dsvm-neutron-dvr-multinode-ubuntu-xenial fails constantly" [High,In progress] - Assigned to Swaminathan Vasudevan (swaminathan-vasudevan) 16:04:09 <haleyb> i just rebooted, but remember it ~20% 16:04:23 <ihrachys> haleyb, you mentioned in the bug there is still a fix to land 16:04:25 <ihrachys> on server side 16:04:31 <ihrachys> can you post it in LP? 16:04:50 <haleyb> yes, it needs a quick re-spin to address a comment 16:04:53 <ihrachys> I am at loss with all those dvr fixes you have with Swami :) 16:05:11 <ihrachys> haleyb, is it required to fix the gate failure? 16:05:15 <ihrachys> or it's just nice to have? 16:05:42 <haleyb> it's required to fix the server side, we only worked around it in the agent 16:06:15 <ihrachys> ok. I remember we make the job non-voting. I assume we will get it back after the server side fix? 16:07:04 <haleyb> i hope so, if the failure rate is still good 16:07:18 <ihrachys> I recollect we had a revert for that somewhere. 16:07:25 <ihrachys> but now I fail to find it 16:07:34 <jlibosva> I don't think we do 16:07:57 <haleyb> actually the dvr-multinode job is 0%, it's the regular multinode job that's at 20% failure 16:08:13 <jlibosva> this is the original - https://review.openstack.org/#/c/500567/ 16:08:39 <ihrachys> ok, I created a revert for tracking purposes 16:09:01 <jlibosva> haleyb: which could also mean we're not collecting data correctly, 0% failure is always suspicious to me as we have infra issues or catching regression failures etc. 16:09:06 <ihrachys> haleyb, hm... should we break it back so that we are on the same failure rate? :p 16:09:25 <haleyb> what, you don't believe DVR is better? :-p 16:09:47 <ihrachys> it is. it's just that the base line we compare with was always rather low. :p 16:10:11 <ihrachys> anyhow... 16:10:19 <ihrachys> these were all AIs we had 16:10:27 <ihrachys> #topic Grafana 16:10:30 <ihrachys> http://grafana.openstack.org/dashboard/db/neutron-failure-rate 16:11:03 <ihrachys> api job is at 20% failure rate on 7days sliding window chart 16:11:07 <ihrachys> not sure what causes that 16:11:15 <ihrachys> could it be the skip_checks issue? 16:11:24 <ihrachys> I mean this: https://review.openstack.org/#/q/I1c0902e3c06886812029fae0e4435bb6674f57df 16:11:43 <ihrachys> I believe charts collect data from all branches so we may still see those failures there if they happen in stable 16:12:38 <ihrachys> apart from that, it's usual suspects - fullstack and scenarios - that we still have at 100%. let's deal with them one by one. 16:12:42 <ihrachys> #topic Fullstack failures 16:13:06 <ihrachys> I believe we made some significant progress lately with fixing and triaging failures 16:13:40 <ihrachys> I believe the main issue that is causing 100% failure rate right now is the one where test_trunk_lifecycle fails because ovs agents clean up ports that don't belong to them 16:13:51 <ihrachys> https://bugs.launchpad.net/neutron/+bug/1687709 16:13:52 <openstack> Launchpad bug 1687709 in neutron "fullstack: ovs-agents remove trunk bridges that don't belong to them" [High,In progress] - Assigned to Armando Migliaccio (armando-migliaccio) 16:14:02 <ihrachys> and armax had a WIP patch for that 16:14:16 <ihrachys> https://review.openstack.org/#/c/504186/ 16:15:10 <ihrachys> the idea there is to introduce a test only option that will set a prefix for agent ports 16:15:20 <ihrachys> then configure each fullstack test case with a unique prefix 16:15:32 <ihrachys> then ovs agent would filter those ports with the prefix 16:15:40 <ihrachys> not ideal, but should be an easy fix 16:15:55 <ihrachys> and then we can follow up with filtering them on ovsdb level, or something else 16:16:29 <ihrachys> jlibosva, with that patch, do we still want to follow up? 16:16:31 <jlibosva> I just don't like that we're having "monkey patch hack" at one agent and "config hack" in other agents. I'd rather have it unified 16:17:00 <jlibosva> and moving towards removing those hacks longterm but the goal of the idea is neat 16:17:11 <ihrachys> jlibosva, that's not ideal. but do you agree testing the actual production code is better than monkey patching the agent? 16:17:25 <jlibosva> imho it's moot 16:17:56 <ihrachys> moot point? 16:18:02 <ihrachys> as in - longterm both are bad? 16:18:49 <jlibosva> no, I mean if you have a code that's parametrized and never used in production - or you patch the code. It's still the same 16:18:53 <jlibosva> just written differently 16:19:30 <jlibosva> but if we get rid of monkey patching and replacing it with no-prod config values, I'm good. But in my opinion it's the same as monkey patching 16:20:57 <ihrachys> jlibosva, I remember I had complications with monkey patched agents when switching to rootwrap because I needed to configure rootwrap to allow those dirs for exec_dirs 16:21:05 <ihrachys> of course now it's solved somewhat 16:21:14 <ihrachys> but it left bitter taste in my mouth 16:21:16 <ihrachys> :) 16:21:18 <ihrachys> ok 16:21:29 <ihrachys> so to the question of follow up - is it smth we want to still track? 16:21:48 <ihrachys> or we will pretend we can live with the hacks? 16:21:53 <jlibosva> is it the last issue? I thought we still have some l3 east-west failing IIRC 16:22:06 <ihrachys> there are some issues that show once in a while 16:22:08 <jlibosva> or you mean to track the isolation stuff that I was working on? 16:22:21 <ihrachys> I was hoping that we tackle this one to finally have a reasonable chart that is not 100% always 16:22:30 <ihrachys> and then can meaningfully assess progress 16:22:33 <ihrachys> but yes, there are others 16:22:38 <jlibosva> sounds good 16:22:48 <jlibosva> btw the trunk failure shouldn't be 100%, it's a race condition 16:22:54 <ihrachys> re follow up, I meant the isolation and/or the filtering of ports on ovsdb query level 16:23:04 <ihrachys> some decent fix that would allow us to kill the test option 16:23:23 <ihrachys> jlibosva, sometimes the job rate falls to 95%, yes :p 16:23:26 <jlibosva> I'll keep it in my backlog 16:23:35 <jlibosva> ah, nice :) I like progress 16:24:06 <ihrachys> ok, good. it's a nice thing to have, but I believe not a critical thing right now, especially considering all the other priorities we tend to have 16:24:23 <ihrachys> another bug that affected the job was https://bugs.launchpad.net/neutron/+bug/1717582 16:24:24 <openstack> Launchpad bug 1717582 in neutron "fullstack job failing to create namespace because it's already exists" [Undecided,In progress] - Assigned to Slawek Kaplonski (slaweq) 16:24:24 <jlibosva> agreed 16:24:34 <ihrachys> and slaweq has a fix: https://review.openstack.org/#/c/503890/ 16:25:09 <ihrachys> as we learned, netlink is async and hence doesn't provide guarantee that after netns add the namespace is present 16:25:12 <ihrachys> so we need to spin 16:26:18 <ihrachys> I need to get back to the patch, seems like slaweq has reasonable replies to my concerns 16:27:19 <jlibosva> thomas has ideas how to avoid the race 16:27:35 <jlibosva> it's in PS6 16:27:51 <ihrachys> ok, will check 16:28:31 <ihrachys> (looking through the list of fullstack bugs) one other thing I had for the suite is switching to using SIGTERM instead of SIGKILL for all services: https://bugs.launchpad.net/neutron/+bug/1487548 16:28:32 <openstack> Launchpad bug 1487548 in neutron "fullstack infrastructure tears down processes via kill -9" [Low,In progress] - Assigned to Ihar Hrachyshka (ihar-hrachyshka) 16:28:36 <ihrachys> I have https://review.openstack.org/#/c/499803/ 16:28:53 <ihrachys> but I need to rework it so that if SIGTERM doesn't kill in a minute, we SIGKILL 16:29:38 <ihrachys> anything else on fullstack? 16:31:12 <jlibosva> nope 16:31:54 <ihrachys> #topic Scenarios 16:32:12 <ihrachys> jlibosva, I think we were making some progress there too? 16:32:35 <ihrachys> afaik there were two fronts - one dvr fixes from haleyb and Swami and another router migrations from anilvenkata 16:32:37 <jlibosva> I think only Anil has some patches for router migrations 16:32:50 <ihrachys> would be nice to have a list of things we believe are related 16:33:16 <ihrachys> jlibosva, afaiu some dvr fixes for grenade were effectively also helping scenarios 16:33:57 <jlibosva> cool, two birds with one stone 16:33:59 <haleyb> Swami just sent out https://review.openstack.org/#/c/505324 as well as he noticed there are some edge cases still broken 16:34:27 <haleyb> i think he had opened a new bug for that, will check 16:35:01 <ihrachys> the flow of dvr fixes both encourages and scares 16:35:08 <ihrachys> it's encouraging that we fix that 16:35:18 <ihrachys> but... it's like... broken since ever? 16:35:54 <haleyb> router migrations are complicated. this is new since the dvr_snat_bound code merged in pike 16:36:10 <ihrachys> jlibosva, haleyb you think it would make sense to have a list of scenario related fixes somewhere so that we can prioritize them somehow? 16:36:20 <jlibosva> I can take an AI to do 16:36:37 <ihrachys> #action jlibosva to prepare a list of scenario related fixes 16:36:40 <ihrachys> jlibosva, thanks! 16:36:44 <haleyb> yes, either a single bug and/or topic i guess 16:37:00 <ihrachys> haleyb, ^ please help Jakub with that, I believe you have a good grasp of dvr side of things 16:37:09 <haleyb> will do 16:37:44 <jlibosva> maybe we could create a short-term LP tag to get the list quickly 16:37:55 <ihrachys> jlibosva, apart from it, are we aware of any other fixes? how close would the fixes we have in pipeline get us in terms of failure rate? 16:38:14 <ihrachys> jlibosva, good idea. we can create a new tag, no need to have a permanent one 16:38:58 <jlibosva> there is a fix for remote security groups that iwamoto is working on 16:39:12 <ihrachys> https://review.openstack.org/#/c/492404/ ? 16:39:14 <jlibosva> https://review.openstack.org/#/c/492404/ 16:39:15 <jlibosva> yep 16:39:26 <jlibosva> this should also bring some peace 16:39:39 <ihrachys> I thought it's just perf optimization? 16:40:20 <ihrachys> ok I see in the commit message: "that filtering 16:40:21 <ihrachys> are correctly performed" 16:40:27 <ihrachys> so I guess it's functional too 16:40:29 <jlibosva> no, it's a new regression caused by conjunctions 16:40:52 <jlibosva> also slaweq has a patch for qos for better logging in the test: https://review.openstack.org/#/c/491244/ 16:40:53 <ihrachys> jlibosva, it affects pike+? 16:41:28 <ihrachys> I mean the conjunctions 16:41:38 <jlibosva> yes 16:41:43 <jlibosva> it's been merged to pike 16:42:33 <ihrachys> another regression, good 16:42:36 <ihrachys> :) 16:42:47 <ihrachys> but it seems like we are not close to complete there? 16:42:55 <ihrachys> so I guess I will need to release .1 without it 16:43:22 <ihrachys> jlibosva, re the qos test patch, my concern was that it hides the issue. is it correct? 16:43:37 <ihrachys> it retries over and over. shouldn't we expect it to work correctly? 16:43:52 <ihrachys> instead it hangs in the middle 16:44:23 <jlibosva> yep, we could have some mechanism to retry few times 16:44:38 <jlibosva> but importantly, it adds some logging messages that could reveal more information about the issue 16:46:17 <ihrachys> jlibosva, but can't we raise after logging? 16:46:26 <jlibosva> we can :) 16:46:31 <ihrachys> otherwise it seems like the test will pass if it works after e.g. 5th attempt 16:46:39 <jlibosva> or make a loop 16:46:47 <jlibosva> aah 16:46:48 <jlibosva> I see 16:47:11 <ihrachys> it can also loop now indefinitely 16:47:25 <ihrachys> before, the timeout would bubble up to test runner 16:47:28 <ihrachys> now it's swallowed 16:47:32 <jlibosva> I was thinking about trying several times but you're right, that failing the test would make more sense 16:47:49 <jlibosva> and we should log the exception too 16:48:30 <ihrachys> ok, let's follow up with comments there then 16:48:41 <ihrachys> logging is a good idea, it's just execution that scared me 16:49:40 <ihrachys> #topic Open discussion 16:49:59 <ihrachys> we hit https://launchpad.net/bugs/1717046 the last week 16:50:01 <openstack> Launchpad bug 1717046 in neutron "L3HARouterVRIdAllocationDbObjectTestCase.test_delete_objects fails because of duplicate record" [Medium,In progress] - Assigned to Ihar Hrachyshka (ihar-hrachyshka) 16:50:08 <ihrachys> I worked it around with a fix for now 16:50:15 <ihrachys> but the root issue in the test framework is still there 16:50:21 <ihrachys> I have this attempt: https://review.openstack.org/#/c/503854/ 16:50:25 <ihrachys> but it will require some more work 16:50:54 <ihrachys> also, Genadi Ch sent a new scenario for sec groups here: https://review.openstack.org/#/c/504021/ 16:51:15 <ihrachys> and seems like he was able to trigger a sqlalchemy error and error 500 with it 16:51:25 <ihrachys> he reported a bug here: https://bugzilla.redhat.com/show_bug.cgi?id=1493175 (it's rhbz, not lp) 16:51:26 <openstack> bugzilla.redhat.com bug 1493175 in openstack-neutron "Update of VM port to have different number of security groups fails with Error 500" [High,New] - Assigned to amuller 16:51:44 <ihrachys> we will need to have a look, may be a reason of some scenario failures 16:51:47 <ihrachys> and it's in db layer. 16:51:56 <ihrachys> fails on refresh(port_db) call somewhere 16:52:10 <ihrachys> so probably affects everyone, not specific to backend 16:52:50 <ihrachys> also, to recap the ptg discussions, dvr folks were planning to adopt fullstack suite for testing different agent deployment modes 16:53:02 <ihrachys> I assume haleyb will follow up with Swami on that one 16:53:10 <ihrachys> that's all I have 16:53:17 <ihrachys> anything else to discuss? 16:53:41 <ihrachys> haleyb, jlibosva 16:53:42 <jlibosva> not from me 16:53:56 <haleyb> nothing here 16:54:14 <ihrachys> good. thanks for joining. we should have progress next time we meet. green gate future ahead. 16:54:16 <ihrachys> o/ 16:54:18 <ihrachys> #endmeeting