16:01:09 <ihrachys> #startmeeting neutron_ci 16:01:10 <openstack> Meeting started Tue Sep 26 16:01:09 2017 UTC and is due to finish in 60 minutes. The chair is ihrachys. Information about MeetBot at http://wiki.debian.org/MeetBot. 16:01:11 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 16:01:14 <openstack> The meeting name has been set to 'neutron_ci' 16:01:15 <ihrachys> jlibosva, mlavalle, haleyb o/ 16:01:19 <mlavalle> o/ 16:01:19 <jlibosva> o/ 16:01:24 <ihrachys> and to whoever else joined 16:01:24 <haleyb> hi 16:01:39 <ihrachys> #topic Actions from prev week 16:01:40 <slaweq_> hello 16:01:49 <ihrachys> hi slaweq_ ! 16:02:05 <ihrachys> traditionally, we walk through action items from the previous week 16:02:10 <ihrachys> first item was "jlibosva to talk to armax about enabling test_convert_default_subnetpool_to_non_default in gate" 16:02:15 <jlibosva> I did! 16:02:31 <ihrachys> fill us in! 16:02:43 <jlibosva> tldr; it's not possible because default subnetpool is also used for nova tests 16:03:08 <ihrachys> ok then how is it supposed to be executed by a tempest consumer? 16:03:12 <jlibosva> and it would be very hacky to have both auto-allocated-topology tests + nova tests and other relying on default subnet pool and defautl subnetpool 16:03:21 <ihrachys> do users choose one or another? 16:03:38 <jlibosva> as per armax it was a mistake to merge the default subnetpool API tests 16:04:01 <jlibosva> so I made this patch https://review.openstack.org/#/c/507487/ 16:04:02 <patchbot> patch 507487 - neutron - Replace default subnetpool API tests with UT 16:04:24 <jlibosva> it moves the tests from API to unittests using the extension tests 16:04:40 <jlibosva> as the current default subnetpool API tests touch only API and DB layer, the coverage is kept 16:05:21 <ihrachys> great, adding to my review queue 16:05:33 <ihrachys> next action item was also on jlibosva 16:05:34 <slaweq_> jlibosva: I'm not expert but can't it be moved to fullstack? 16:05:35 <ihrachys> "jlibosva to prepare a list of scenario related fixes" 16:05:58 <ihrachys> let's address slaweq_ question first 16:06:08 <jlibosva> slaweq_: yes, that was also one of the options but given the current state of fullstack I thought UT should be sufficient 16:06:24 <ihrachys> it's not that bad ;) 16:06:29 <ihrachys> I mean, we are getting closer 16:06:32 <slaweq_> ok, UT are voting at least :) 16:06:44 <slaweq_> but I just wanted to ask 16:06:51 <jlibosva> slaweq_: it's a good point :) 16:07:14 <ihrachys> I think it's fine to move with what you have. fullstack could be a nice follow up 16:07:29 <mlavalle> ++ 16:07:33 <ihrachys> aka never happen 16:07:36 <ihrachys> :) 16:07:50 <ihrachys> ok, let's look at the next item. "jlibosva to prepare a list of scenario related fixes" 16:08:13 <ihrachys> I was hoping to have the list for the team meeting today but seems like we did not have one. 16:08:19 <ihrachys> haleyb was going to help jlibosva with the task 16:08:25 <haleyb> i have two 16:08:29 <ihrachys> two lists? 16:08:46 <jlibosva> I was looking at the fixes related to the dvr stuff, I somehow mixed the things ... 16:08:52 <haleyb> two reviews that help with scenarios jobs 16:09:16 <haleyb> https://review.openstack.org/#/c/500384/ - check router interface before ssh 16:09:16 <patchbot> patch 500384 - neutron - tempest: check router interface exists before ssh 16:09:32 <haleyb> https://review.openstack.org/#/c/505324 16:09:33 <patchbot> patch 505324 - neutron - DVR: Fix unbound fip port migration to bound port 16:09:46 <ihrachys> anilvenkata, do you plan to respin https://review.openstack.org/#/c/500384/ this week? 16:09:47 <patchbot> patch 500384 - neutron - tempest: check router interface exists before ssh 16:10:03 <anilvenkata> ihrachys, yes, doing it now 16:10:39 <ihrachys> great 16:10:51 <haleyb> ihrachys: i am reviewing swami's patch again now, but with it there are very few scenario failures, think 3 now 16:11:01 <ihrachys> haleyb, looking at the second patch. why is it every time we need to fix a bug in dvr, it ends up like a 200+ line patch? :) 16:11:07 <ihrachys> I am scared each time 16:11:40 <haleyb> it's not all butterflies and rainbows 16:12:16 <ihrachys> linuxbridge job seems to show more failures than ovs 16:12:19 <ihrachys> http://logs.openstack.org/24/505324/10/check/gate-tempest-dsvm-neutron-dvr-multinode-scenario-ubuntu-xenial-nv/e1a0b47/logs/testr_results.html.gz 16:12:22 <haleyb> if we didn't do migration it would be real simple :) 16:12:23 <ihrachys> vs. http://logs.openstack.org/24/505324/10/check/gate-tempest-dsvm-neutron-scenario-linuxbridge-ubuntu-xenial-nv/c2e8825/testr_results.html.gz 16:12:37 <ihrachys> so we also have trunks and qos 16:13:00 <ihrachys> but some of those could be the iptables issue that I spotted today 16:13:06 <ihrachys> (I will update later in the meeting) 16:13:41 <ihrachys> haleyb, thanks for the update on the patches. would be nice if we land them this week. 16:13:53 <haleyb> ihrachys: oh, and there is one more 16:14:20 <haleyb> https://review.openstack.org/#/c/500143/ 16:14:21 <patchbot> patch 500143 - neutron - DVR: Always initialize floating IP host 16:14:37 <haleyb> final fix for agent, but maybe that's more dvr-multinode related, getting ahead of myself 16:14:41 <mlavalle> ihrachys: land the 3 of them? 16:15:46 <ihrachys> mlavalle, yeah. 2 of them will require l3 experts to chime in. too complex for my 1oz brain 16:16:15 <mlavalle> lol, I'll keep an eye on them 16:16:23 <ihrachys> haleyb, the scenario job is dvr-multinode so it would affect it 16:16:45 <ihrachys> the patch from anilvenkata - https://review.openstack.org/#/c/500384/ - reveals some weirdness (to my taste) in api behavior. the change seems fine, but the number of things that api user is supposed to check to make sure data plane is healthy after an update is enormous. is it really the best we can do? see: https://review.openstack.org/#/c/500384/7/neutron/tests/tempest/scenario/test_migration.py 16:16:46 <patchbot> patch 500384 - neutron - tempest: check router interface exists before ssh 16:16:47 <patchbot> patch 500384 - neutron - tempest: check router interface exists before ssh 16:17:17 <ihrachys> we check ports, fips, ports deleted... 16:17:44 <ihrachys> I started to think that we need some attribute to expose whether router is ready 16:17:47 <ihrachys> maybe reuse status 16:17:47 <haleyb> ihrachys: i saw your comment regarding PENDING_UPDATE, just don't know how invasive that would be 16:19:00 <haleyb> i'm just thinking of a case where the server changes the state, informs agent, but reply lost. server would need to re-send 16:19:42 <ihrachys> it would stay pending until they recover somehow. data plane would still be functional (to the point it can be functional with the update not applied) 16:20:19 <haleyb> but i'm willing to try and make it better 16:20:50 <ihrachys> great, thanks a lot 16:20:59 <mlavalle> yeah, thanks 16:21:00 <anilvenkata> haleyb, we have similar thing for router state change 16:21:12 <anilvenkata> haleyb, if agent's reply is lost, it will periodically retry 16:21:18 <anilvenkata> haleyb, Ann has implemented it 16:21:47 <ihrachys> anilvenkata, is it merged or up for review? 16:22:16 <anilvenkata> ihrachys, that is only regarding HA router active/backup state change 16:23:29 <anilvenkata> ihrachys, not related to data plane status change 16:23:34 <ihrachys> ok. we didn't have any more items 16:23:37 <ihrachys> let's move on 16:23:48 <ihrachys> #topic Grafana 16:23:51 <ihrachys> http://grafana.openstack.org/dashboard/db/neutron-failure-rate 16:25:09 <ihrachys> (boy it takes a while to load it) 16:25:43 <mlavalle> yeap 16:25:45 <ihrachys> so first, our all time friends - scenario and fullstack - are still close to 100% 16:25:54 <ihrachys> the good news is that fullstack is now not 100% all the time 16:25:58 <ihrachys> more like 80% 16:26:06 <ihrachys> so the latest fixes helped a bit 16:26:30 <ihrachys> scenarios are not that lucky 16:26:49 <ihrachys> we had a blip with rally breakages earlier in the week but it's back to normal now 16:27:15 <ihrachys> those guys apparently don't have enough test coverage for the code so regressions happen, and since we feed from their master, it breaks us once in a while 16:28:19 <ihrachys> tl;dr nothing new, we can move to discuss specific day-to-day offenders 16:28:23 <ihrachys> #topic Scenarios 16:28:31 <ihrachys> we already discussed a bit the dvr/migration fixes 16:29:12 <ihrachys> one thing that I noticed today while looking at results for the new sec group scenario is that linuxbridge job seems to be busted because of a iptables issue 16:29:21 <ihrachys> example: http://logs.openstack.org/21/504021/2/check/gate-tempest-dsvm-neutron-scenario-linuxbridge-ubuntu-xenial-nv/e47a3f3/logs/screen-q-agt.txt.gz#_Sep_26_09_17_36_512389 16:29:39 <ihrachys> we have code in iptables managers that applies rules twice and checks that result is same 16:29:59 <ihrachys> that is activated by debug_iptables_rules = True 16:30:03 <ihrachys> which is on in gate 16:31:14 <ihrachys> doesn't happen in ovs 16:31:31 <ihrachys> but that's probably because there we use the 'openvswitch' flow firewall driver 16:31:55 <ihrachys> so maybe the issue still affects ovs/iptables setups 16:32:17 <ihrachys> of course we could disable debug_iptables_rules to avoid the errors but that would be a silly thing to do 16:33:07 <ihrachys> I hear that haleyb and jlibosva would love to take it from here ;) 16:33:23 <jlibosva> I can take a look tomorrow :) 16:33:37 <ihrachys> ok good. I will prepare a bug report after the meeting 16:33:52 <ihrachys> #action jlibosva to triage iptables apply failure in linuxbridge scenarios job 16:33:53 <haleyb> i will help as well of course 16:34:00 <ihrachys> #action ihrachys to report bug for iptables apply failure 16:34:34 <ihrachys> also while we are at it, worth reviewing the test scenario itself that triggers the failure: https://review.openstack.org/#/c/504021 16:34:34 <patchbot> patch 504021 - neutron - [Tempest] Scenarios for several sec groups on VM 16:34:38 <mlavalle> Thanks jlibosva, haleyb 16:35:01 <ihrachys> final thing to discuss here is that tempest plugin with all tests, api and scenarios, is moving to a separate repo 16:35:12 <ihrachys> chandankumar is spearheading the initiative 16:35:14 * mlavalle will review it 16:35:20 <haleyb> ihrachys: oh, so it only happens with that patch? 16:35:48 <ihrachys> haleyb, well I am not sure. could be. that's where I spotted it. 16:36:17 <ihrachys> for the split tempest, we already have https://github.com/openstack/neutron-tempest-plugin 16:36:23 <ihrachys> that carries copy of tests 16:37:02 <ihrachys> chandankumar is currently working on preparing jobs that would run tests from the plugin: https://review.openstack.org/#/c/507038/ 16:37:03 <patchbot> patch 507038 - openstack-infra/project-config - Added tempest dsvm job for neutron-tempest-plugin 16:37:32 <ihrachys> once they are in the separate repo gate, and we prove they work, we will switch neutron repo to new jobs that use the plugin repo 16:37:45 <ihrachys> and then clean up the neutron tree: https://review.openstack.org/#/c/507038/ 16:37:45 <patchbot> patch 507038 - openstack-infra/project-config - Added tempest dsvm job for neutron-tempest-plugin 16:37:59 <ihrachys> oops wrong link 16:38:04 <mlavalle> ihrachys: one question, what happens to code we are changing now in the Neutron repo? 16:38:13 <ihrachys> this is the patch that removes the in-tree tests: https://review.openstack.org/#/c/506672/ 16:38:14 <patchbot> patch 506672 - neutron - Remove the bundled intree neutron tempest plugin 16:38:30 <ihrachys> mlavalle, that's a good question, and the answer is, we will need to first sync everything before pulling the trigger 16:38:48 <mlavalle> yeah, I saw it over the weekend and was wondering 16:38:56 <mlavalle> good to know 16:39:02 <ihrachys> once we are closer to that, we should probably freeze all new patches, work through sync process, pull the trigger, cleanup; then ask everyone to move patches to the new repo 16:39:16 <mlavalle> sounds good 16:39:25 <ihrachys> there is a bit of legwork there 16:39:31 <mlavalle> yeap 16:40:27 <ihrachys> actually I now realized that project-config change is incomplete 16:40:44 <ihrachys> because since the new repo is branchless, we need to target all neutron stable branches there 16:40:49 <ihrachys> like tempest does 16:40:55 <mlavalle> right 16:41:00 <ihrachys> I commented in the patch 16:41:19 <ihrachys> the new setup will also temporarily complicate backports of test fixes into stable branches 16:42:37 <ihrachys> that's all I have for scenarios. there may be more issues with it, but I prefer we get back to them once migration/dvr/iptables stuff is fixed and we see new results 16:42:42 <ihrachys> #topic Fullstack 16:43:02 <ihrachys> there was a PTG discussion on things we should take care to make fullstack pass 16:43:35 <ihrachys> first issue was about trunk test case failing because ovs agents remove ports of each other because ovsdb is not sandboxed 16:43:50 <ihrachys> I believe the long term vision is we actually have it sandboxed 16:43:57 <mlavalle> correct 16:44:01 <ihrachys> but for the time being, armax sent this fix: https://review.openstack.org/#/c/504186/ 16:44:02 <patchbot> patch 504186 - neutron - Isolate trunk bridges on fullstack runs 16:44:39 <ihrachys> it needs some more work, but if you folks could assess the general approach that would be great. 16:45:13 <mlavalle> will do 16:45:51 <ihrachys> another issue we have with fullstack is that sometimes it seems like netlink socket operations don't immediately translate into kernel state changes 16:45:51 <ihrachys> https://launchpad.net/bugs/1717582 16:45:53 <openstack> Launchpad bug 1717582 in neutron "fullstack job failing to create namespace because it's already exists" [Undecided,Fix released] - Assigned to Slawek Kaplonski (slaweq) 16:46:01 <ihrachys> which then makes ensure_namespace fail 16:46:15 <ihrachys> slaweq_ had a fix for that https://review.openstack.org/#/c/503890/ but seems like it didn't help 16:46:16 <patchbot> patch 503890 - neutron - Fix for race condition during netns creation (MERGED) 16:46:21 <slaweq_> and my patch didn't solve it 16:46:27 <ihrachys> so now we probably revert: https://review.openstack.org/#/c/507382/ 16:46:28 <patchbot> patch 507382 - neutron - Revert "Fix for race condition during netns creation" 16:46:40 <slaweq_> I think that there is less such kind of errors now but it's only my "feeling" 16:46:48 <ihrachys> and look at alternative implementation of netns add using pyroute2: https://review.openstack.org/#/c/505701/ 16:46:48 <patchbot> patch 505701 - neutron - Change ip_lib network namespace code to use pyroute2 16:46:55 <mlavalle> have we taken the decision to revert yet? 16:47:14 <ihrachys> haleyb, remind everyone why we think it helps? 16:47:24 <ihrachys> haleyb, should probably be part of the commit message (also a link to the bug) 16:48:02 <ihrachys> mlavalle, afaiu slaweq_ explained: "My patch changed ensure_namespace() method to catch ProcessExecutionError exception. If Brian's patch will be merged it will never happen here that such exception will be raised." 16:48:15 <slaweq_> yep 16:48:18 <ihrachys> sounds like useless code if we go the pyroute2 path 16:48:25 <mlavalle> cool 16:48:26 <haleyb> ihrachys: i more started the pyroute2 change as a performance improvement, but it should catch the namespace exists issue better 16:48:31 <slaweq_> because pyroute2 will raise exception with EEXIST error code 16:48:48 <ihrachys> slaweq_, but we will still need to catch it? 16:48:52 <slaweq_> yep 16:49:03 <ihrachys> ok so the haleyb's patch doesn't fix the bug 16:49:04 <slaweq_> but we can check if it's this error or other one 16:49:07 <mlavalle> ahhh, I glimpsed pieces of the EEXIST conversation yesterday 16:49:32 <ihrachys> haleyb, then maybe related-bug 16:49:35 <haleyb> ihrachys: it does because it catches the pyroute EEXIST exception and ignores it 16:49:39 <ihrachys> oh ok 16:49:51 <slaweq_> also, I don't know exactly how privsep works 16:49:55 <haleyb> i'll add related-bug 16:49:58 <slaweq_> and if it uses rootwrap daemon 16:50:11 <ihrachys> haleyb, is it in neutron that you catch, or it's library? 16:50:37 <slaweq_> but I think that current issue might be related to some kind of race condition between two rootwrap daemons which executes commands 16:50:59 <haleyb> ihrachys: in neutron, the pyroute2 code doesn't seem to catch it properly 16:51:11 <ihrachys> slaweq_, could well be, meaning this wouldn't solve the underlying issue 16:51:18 <jlibosva> haleyb: even with the flag passed we saw at pyroute2 docs? 16:51:36 <ihrachys> oh I see the catch in https://review.openstack.org/#/c/505701/4/neutron/privileged/agent/linux/ip_lib.py 16:51:37 <patchbot> patch 505701 - neutron - Change ip_lib network namespace code to use pyroute2 16:52:17 <haleyb> ihrachys: i'd have to do it differently to use any flags 16:53:12 <ihrachys> slaweq_, so it seems like the lib handles the error itself? 16:53:19 <ihrachys> we don't need to handle it in neutron then? 16:53:27 <slaweq_> not exactly 16:53:42 <slaweq_> there is try except which should catch it 16:54:03 <slaweq_> but as we checked yesterday with haleyb it raises exception in other place 16:54:15 <ihrachys> oh ok, I see 16:54:29 <ihrachys> maybe a good idea to patch pyroute2 (in addition to neutron) 16:54:41 <haleyb> it might be a bug in the pyroute2 code, i need to file a bug and get maintainer response 16:54:42 <jlibosva> I haven't played with it but I looked at the docs of pyroute2 and there was an option to ignore namespace creation in case it already existed 16:54:47 <slaweq_> I was thinking that in https://github.com/svinota/pyroute2/blob/master/pyroute2/netns/__init__.py#L149 it will be catched 16:55:05 <slaweq_> but it looks that exception is raised in https://github.com/svinota/pyroute2/blob/master/pyroute2/netns/__init__.py#L162 16:55:28 <ihrachys> slaweq_, right. because netnsdir is /var/run/netns/ 16:55:35 <ihrachys> and netnspath is /var/run/netns/xxx 16:55:44 <ihrachys> so it handles exists for first only 16:56:12 <jlibosva> # create a new netns or use existing one 16:56:14 <jlibosva> netns = NetNS('test', flags=os.O_CREAT) 16:56:23 <jlibosva> from http://pyroute2.org/pyroute2-0.3.4/netns.html 16:56:59 <haleyb> i use the lazy pyroute2.netns.create("test") method instead of instantiating a class 16:57:20 <ihrachys> ok we have little time, let's move on with the issue. everyone make sure we review those patches quick. 16:57:34 <ihrachys> #topic Open discussion 16:57:40 <ihrachys> I would like to note one thing before we wrap up 16:58:01 <ihrachys> infra is switching gates to zuulv3 16:58:14 <ihrachys> there are significant changes in the future in how we define our jobs 16:58:29 <ihrachys> the gist of the changes can be found in https://docs.openstack.org/infra/manual/zuulv3.html 16:58:35 <ihrachys> make yourself comfortable 16:58:49 <jlibosva> I also wanted to note I'm working on some fullstack work: https://review.openstack.org/#/c/506722/ - it's very invasive :) but I hope it'll bring a lot of fruits 16:58:50 <patchbot> patch 506722 - neutron - WIP: fullstack: Put all services into namespaces 16:58:50 <ihrachys> the switch is currently ongoing, so project-config changes are frozen for now 16:59:08 <jlibosva> I'll probably split it into smaller pieces once it's done 16:59:09 <ihrachys> jlibosva, nice! 16:59:14 <mlavalle> thanks for pointing it out 16:59:50 <ihrachys> haleyb, btw you may want to push again on getting multinode-dvr the default once infra is done with zuulv3 16:59:53 <haleyb> i also plan on re-basing https://review.openstack.org/#/c/483600/ today 16:59:54 <patchbot> patch 483600 - openstack-infra/project-config - Make neutron-multinode job the default 16:59:58 <ihrachys> ok folks we are at the top of the hour 17:00:09 <ihrachys> haleyb, huh, great minds... 17:00:10 <ihrachys> :) 17:00:15 <ihrachys> thanks everyone for joining 17:00:18 <jlibosva> thanks all, cya :) 17:00:24 <slaweq_> thanks, bye 17:00:24 <mlavalle> o/ 17:00:27 <ihrachys> mlavalle, slaweq_ thanks for joining and all the help, I appreciate 17:00:29 <ihrachys> #endmeeting