16:00:58 #startmeeting neutron_ci 16:00:59 Meeting started Tue Sep 19 16:00:58 2017 UTC and is due to finish in 60 minutes. The chair is ihrachys. Information about MeetBot at http://wiki.debian.org/MeetBot. 16:01:00 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 16:01:00 o/ 16:01:01 hello my friends 16:01:03 The meeting name has been set to 'neutron_ci' 16:01:09 jlibosva, o/ 16:01:37 ok, that's a tight meeting company we have here ;) 16:01:54 haleyb seems to be offline 16:02:00 he just disconnected 16:02:06 I hope he'll be back 16:02:36 there he is 16:02:38 here he is 16:02:40 :) 16:02:44 haleyb, o/ 16:02:49 #topic Actions from prev week 16:02:57 "jlibosva to talk to armax about enabling test_convert_default_subnetpool_to_non_default in gate" 16:03:09 I didn't because I forgot 16:03:16 I'll do this week 16:03:22 #action jlibosva to talk to armax about enabling test_convert_default_subnetpool_to_non_default in gate 16:03:28 hi there, irc-proxy was having problems 16:03:37 "haleyb to figure out the way forward for grenade/dvr gate" 16:03:48 I believe the grenade job is largely back to normal now? 16:04:05 the bug being https://bugs.launchpad.net/neutron/+bug/1713927 16:04:06 Launchpad bug 1713927 in neutron "gate-grenade-dsvm-neutron-dvr-multinode-ubuntu-xenial fails constantly" [High,In progress] - Assigned to Swaminathan Vasudevan (swaminathan-vasudevan) 16:04:09 i just rebooted, but remember it ~20% 16:04:23 haleyb, you mentioned in the bug there is still a fix to land 16:04:25 on server side 16:04:31 can you post it in LP? 16:04:50 yes, it needs a quick re-spin to address a comment 16:04:53 I am at loss with all those dvr fixes you have with Swami :) 16:05:11 haleyb, is it required to fix the gate failure? 16:05:15 or it's just nice to have? 16:05:42 it's required to fix the server side, we only worked around it in the agent 16:06:15 ok. I remember we make the job non-voting. I assume we will get it back after the server side fix? 16:07:04 i hope so, if the failure rate is still good 16:07:18 I recollect we had a revert for that somewhere. 16:07:25 but now I fail to find it 16:07:34 I don't think we do 16:07:57 actually the dvr-multinode job is 0%, it's the regular multinode job that's at 20% failure 16:08:13 this is the original - https://review.openstack.org/#/c/500567/ 16:08:39 ok, I created a revert for tracking purposes 16:09:01 haleyb: which could also mean we're not collecting data correctly, 0% failure is always suspicious to me as we have infra issues or catching regression failures etc. 16:09:06 haleyb, hm... should we break it back so that we are on the same failure rate? :p 16:09:25 what, you don't believe DVR is better? :-p 16:09:47 it is. it's just that the base line we compare with was always rather low. :p 16:10:11 anyhow... 16:10:19 these were all AIs we had 16:10:27 #topic Grafana 16:10:30 http://grafana.openstack.org/dashboard/db/neutron-failure-rate 16:11:03 api job is at 20% failure rate on 7days sliding window chart 16:11:07 not sure what causes that 16:11:15 could it be the skip_checks issue? 16:11:24 I mean this: https://review.openstack.org/#/q/I1c0902e3c06886812029fae0e4435bb6674f57df 16:11:43 I believe charts collect data from all branches so we may still see those failures there if they happen in stable 16:12:38 apart from that, it's usual suspects - fullstack and scenarios - that we still have at 100%. let's deal with them one by one. 16:12:42 #topic Fullstack failures 16:13:06 I believe we made some significant progress lately with fixing and triaging failures 16:13:40 I believe the main issue that is causing 100% failure rate right now is the one where test_trunk_lifecycle fails because ovs agents clean up ports that don't belong to them 16:13:51 https://bugs.launchpad.net/neutron/+bug/1687709 16:13:52 Launchpad bug 1687709 in neutron "fullstack: ovs-agents remove trunk bridges that don't belong to them" [High,In progress] - Assigned to Armando Migliaccio (armando-migliaccio) 16:14:02 and armax had a WIP patch for that 16:14:16 https://review.openstack.org/#/c/504186/ 16:15:10 the idea there is to introduce a test only option that will set a prefix for agent ports 16:15:20 then configure each fullstack test case with a unique prefix 16:15:32 then ovs agent would filter those ports with the prefix 16:15:40 not ideal, but should be an easy fix 16:15:55 and then we can follow up with filtering them on ovsdb level, or something else 16:16:29 jlibosva, with that patch, do we still want to follow up? 16:16:31 I just don't like that we're having "monkey patch hack" at one agent and "config hack" in other agents. I'd rather have it unified 16:17:00 and moving towards removing those hacks longterm but the goal of the idea is neat 16:17:11 jlibosva, that's not ideal. but do you agree testing the actual production code is better than monkey patching the agent? 16:17:25 imho it's moot 16:17:56 moot point? 16:18:02 as in - longterm both are bad? 16:18:49 no, I mean if you have a code that's parametrized and never used in production - or you patch the code. It's still the same 16:18:53 just written differently 16:19:30 but if we get rid of monkey patching and replacing it with no-prod config values, I'm good. But in my opinion it's the same as monkey patching 16:20:57 jlibosva, I remember I had complications with monkey patched agents when switching to rootwrap because I needed to configure rootwrap to allow those dirs for exec_dirs 16:21:05 of course now it's solved somewhat 16:21:14 but it left bitter taste in my mouth 16:21:16 :) 16:21:18 ok 16:21:29 so to the question of follow up - is it smth we want to still track? 16:21:48 or we will pretend we can live with the hacks? 16:21:53 is it the last issue? I thought we still have some l3 east-west failing IIRC 16:22:06 there are some issues that show once in a while 16:22:08 or you mean to track the isolation stuff that I was working on? 16:22:21 I was hoping that we tackle this one to finally have a reasonable chart that is not 100% always 16:22:30 and then can meaningfully assess progress 16:22:33 but yes, there are others 16:22:38 sounds good 16:22:48 btw the trunk failure shouldn't be 100%, it's a race condition 16:22:54 re follow up, I meant the isolation and/or the filtering of ports on ovsdb query level 16:23:04 some decent fix that would allow us to kill the test option 16:23:23 jlibosva, sometimes the job rate falls to 95%, yes :p 16:23:26 I'll keep it in my backlog 16:23:35 ah, nice :) I like progress 16:24:06 ok, good. it's a nice thing to have, but I believe not a critical thing right now, especially considering all the other priorities we tend to have 16:24:23 another bug that affected the job was https://bugs.launchpad.net/neutron/+bug/1717582 16:24:24 Launchpad bug 1717582 in neutron "fullstack job failing to create namespace because it's already exists" [Undecided,In progress] - Assigned to Slawek Kaplonski (slaweq) 16:24:24 agreed 16:24:34 and slaweq has a fix: https://review.openstack.org/#/c/503890/ 16:25:09 as we learned, netlink is async and hence doesn't provide guarantee that after netns add the namespace is present 16:25:12 so we need to spin 16:26:18 I need to get back to the patch, seems like slaweq has reasonable replies to my concerns 16:27:19 thomas has ideas how to avoid the race 16:27:35 it's in PS6 16:27:51 ok, will check 16:28:31 (looking through the list of fullstack bugs) one other thing I had for the suite is switching to using SIGTERM instead of SIGKILL for all services: https://bugs.launchpad.net/neutron/+bug/1487548 16:28:32 Launchpad bug 1487548 in neutron "fullstack infrastructure tears down processes via kill -9" [Low,In progress] - Assigned to Ihar Hrachyshka (ihar-hrachyshka) 16:28:36 I have https://review.openstack.org/#/c/499803/ 16:28:53 but I need to rework it so that if SIGTERM doesn't kill in a minute, we SIGKILL 16:29:38 anything else on fullstack? 16:31:12 nope 16:31:54 #topic Scenarios 16:32:12 jlibosva, I think we were making some progress there too? 16:32:35 afaik there were two fronts - one dvr fixes from haleyb and Swami and another router migrations from anilvenkata 16:32:37 I think only Anil has some patches for router migrations 16:32:50 would be nice to have a list of things we believe are related 16:33:16 jlibosva, afaiu some dvr fixes for grenade were effectively also helping scenarios 16:33:57 cool, two birds with one stone 16:33:59 Swami just sent out https://review.openstack.org/#/c/505324 as well as he noticed there are some edge cases still broken 16:34:27 i think he had opened a new bug for that, will check 16:35:01 the flow of dvr fixes both encourages and scares 16:35:08 it's encouraging that we fix that 16:35:18 but... it's like... broken since ever? 16:35:54 router migrations are complicated. this is new since the dvr_snat_bound code merged in pike 16:36:10 jlibosva, haleyb you think it would make sense to have a list of scenario related fixes somewhere so that we can prioritize them somehow? 16:36:20 I can take an AI to do 16:36:37 #action jlibosva to prepare a list of scenario related fixes 16:36:40 jlibosva, thanks! 16:36:44 yes, either a single bug and/or topic i guess 16:37:00 haleyb, ^ please help Jakub with that, I believe you have a good grasp of dvr side of things 16:37:09 will do 16:37:44 maybe we could create a short-term LP tag to get the list quickly 16:37:55 jlibosva, apart from it, are we aware of any other fixes? how close would the fixes we have in pipeline get us in terms of failure rate? 16:38:14 jlibosva, good idea. we can create a new tag, no need to have a permanent one 16:38:58 there is a fix for remote security groups that iwamoto is working on 16:39:12 https://review.openstack.org/#/c/492404/ ? 16:39:14 https://review.openstack.org/#/c/492404/ 16:39:15 yep 16:39:26 this should also bring some peace 16:39:39 I thought it's just perf optimization? 16:40:20 ok I see in the commit message: "that filtering 16:40:21 are correctly performed" 16:40:27 so I guess it's functional too 16:40:29 no, it's a new regression caused by conjunctions 16:40:52 also slaweq has a patch for qos for better logging in the test: https://review.openstack.org/#/c/491244/ 16:40:53 jlibosva, it affects pike+? 16:41:28 I mean the conjunctions 16:41:38 yes 16:41:43 it's been merged to pike 16:42:33 another regression, good 16:42:36 :) 16:42:47 but it seems like we are not close to complete there? 16:42:55 so I guess I will need to release .1 without it 16:43:22 jlibosva, re the qos test patch, my concern was that it hides the issue. is it correct? 16:43:37 it retries over and over. shouldn't we expect it to work correctly? 16:43:52 instead it hangs in the middle 16:44:23 yep, we could have some mechanism to retry few times 16:44:38 but importantly, it adds some logging messages that could reveal more information about the issue 16:46:17 jlibosva, but can't we raise after logging? 16:46:26 we can :) 16:46:31 otherwise it seems like the test will pass if it works after e.g. 5th attempt 16:46:39 or make a loop 16:46:47 aah 16:46:48 I see 16:47:11 it can also loop now indefinitely 16:47:25 before, the timeout would bubble up to test runner 16:47:28 now it's swallowed 16:47:32 I was thinking about trying several times but you're right, that failing the test would make more sense 16:47:49 and we should log the exception too 16:48:30 ok, let's follow up with comments there then 16:48:41 logging is a good idea, it's just execution that scared me 16:49:40 #topic Open discussion 16:49:59 we hit https://launchpad.net/bugs/1717046 the last week 16:50:01 Launchpad bug 1717046 in neutron "L3HARouterVRIdAllocationDbObjectTestCase.test_delete_objects fails because of duplicate record" [Medium,In progress] - Assigned to Ihar Hrachyshka (ihar-hrachyshka) 16:50:08 I worked it around with a fix for now 16:50:15 but the root issue in the test framework is still there 16:50:21 I have this attempt: https://review.openstack.org/#/c/503854/ 16:50:25 but it will require some more work 16:50:54 also, Genadi Ch sent a new scenario for sec groups here: https://review.openstack.org/#/c/504021/ 16:51:15 and seems like he was able to trigger a sqlalchemy error and error 500 with it 16:51:25 he reported a bug here: https://bugzilla.redhat.com/show_bug.cgi?id=1493175 (it's rhbz, not lp) 16:51:26 bugzilla.redhat.com bug 1493175 in openstack-neutron "Update of VM port to have different number of security groups fails with Error 500" [High,New] - Assigned to amuller 16:51:44 we will need to have a look, may be a reason of some scenario failures 16:51:47 and it's in db layer. 16:51:56 fails on refresh(port_db) call somewhere 16:52:10 so probably affects everyone, not specific to backend 16:52:50 also, to recap the ptg discussions, dvr folks were planning to adopt fullstack suite for testing different agent deployment modes 16:53:02 I assume haleyb will follow up with Swami on that one 16:53:10 that's all I have 16:53:17 anything else to discuss? 16:53:41 haleyb, jlibosva 16:53:42 not from me 16:53:56 nothing here 16:54:14 good. thanks for joining. we should have progress next time we meet. green gate future ahead. 16:54:16 o/ 16:54:18 #endmeeting