16:00:37 <ihrachys> #startmeeting neutron_ci 16:00:38 <openstack> Meeting started Tue Jul 11 16:00:37 2017 UTC and is due to finish in 60 minutes. The chair is ihrachys. Information about MeetBot at http://wiki.debian.org/MeetBot. 16:00:39 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 16:00:41 <openstack> The meeting name has been set to 'neutron_ci' 16:00:49 <jlibosva> o/ 16:01:01 <ihrachys> hi jlibosva 16:01:06 * ihrachys waves at haleyb too 16:01:35 <ihrachys> we haven't had a meeting for a while 16:01:45 <ihrachys> #topic Actions from prev week 16:01:51 <jlibosva> week :) 16:02:12 <ihrachys> more than a week no? anyhoo. 16:02:16 <ihrachys> first AI was "jlibosva to craft an email to openstack-dev@ with func-py3 failures and request for action" 16:02:28 <ihrachys> I believe that we made significant progress for py3 for func tests 16:02:36 <ihrachys> jlibosva, can you briefly update about latest? 16:02:48 <jlibosva> yep, team did a great job and took down failures pretty quick 16:03:01 <jlibosva> last time I checked we had a single failure that's caused likely by a bug in eventlet 16:03:28 <jlibosva> some thread switch takes too long - increasing a timeout for particular test helps: https://review.openstack.org/#/c/475888/ 16:03:34 <jlibosva> but that's not a correct way to go 16:03:49 <jlibosva> I planned to reach out to some eventlet peeps but I haven't yet 16:03:57 <ihrachys> who would be the peeps? 16:04:34 <jlibosva> I'll try vstinner first 16:05:24 <ihrachys> ack 16:05:34 <ihrachys> cool, seems like we are on track with it 16:05:45 <jlibosva> I can take an AI till the next mtg 16:05:47 <ihrachys> #action jlibosva to reach out to Victor Stinner about eventlet/py3 issue with functional tests 16:06:04 <ihrachys> next AI was "jlibosva to talk to otherwiseguy about isolating ovsdb/ovs agent per fullstack 'machine'" 16:06:18 <jlibosva> so that's an interesting thing 16:06:25 <ihrachys> jlibosva, I was trying to find the patch that otherwiseguy had for ovsdbapp with the test fixture lately and couldn't find it. have a link? 16:06:29 <otherwiseguy> ihrachys, we've been talking about it quite a bit today. :p 16:06:50 <jlibosva> ihrachys: it's hidden :) 16:06:52 <jlibosva> here https://review.openstack.org/#/c/470441/30/ovsdbapp/venv.py 16:07:19 <ihrachys> oh ok that patch. I expected a separate one. 16:07:25 <otherwiseguy> yeah, that should have been separate. :( 16:07:35 <jlibosva> I was actually doing some coding and I'm trying to use the ovsdb-server in a sandbox 16:07:57 <jlibosva> my concern was whether we'll be able to connect "nodes" 16:08:14 <jlibosva> meaning that bridges in one sandbox must be reachable by bridges from the other sandbox 16:08:40 <jlibosva> it seems the ovs_system sees all entities in the sandboxes - so I hope we're on a good track 16:09:05 <ihrachys> ok cool. how do you test it? depends-on won't work until new lib is released right? 16:09:13 <jlibosva> currently I have some code that runs ovs agent, each using its own ovsdb-server 16:09:49 <jlibosva> I haven't pushed anything to gerrit yet so I have an egg-link pointing to ovsdbapp dir 16:09:55 <ihrachys> ah ok. 16:10:05 <jlibosva> but generally, yeah, for gate we'll need a new ovsdbapp release 16:10:11 <ihrachys> otherwiseguy, can we get it split separately this week? 16:10:34 <otherwiseguy> It's entirely possible that I will have enough in for 1.0 this week. 16:10:41 <otherwiseguy> maybe 0.99 just to be safe. :p 16:10:49 <ihrachys> otherwiseguy, but this patch is not in yet right? 16:10:55 <ihrachys> oh it is 16:10:57 <ihrachys> sorry 16:11:02 <otherwiseguy> the venv patch is. 16:11:11 <otherwiseguy> just not in a release. 16:11:31 <jlibosva> I have some WIP for ovsdbapp too, to not start ovn schemas, would be nice to get it in release too, if the patch makes sense 16:11:41 <jlibosva> https://github.com/cubeek/ovsdbapp/commit/0f51ab16ec72a7033057740d928c599ba3cd7fc6?diff=split 16:11:53 <ihrachys> jlibosva, why github fork? 16:12:07 <jlibosva> ihrachys: to show the WIP patch 16:12:27 <jlibosva> ihrachys: it's 2 hours old :) 16:12:41 <ihrachys> the idea of the patch makes a lot of sense. I think we discussed that before. 16:12:52 <ihrachys> please post to gerrit so that we can bash it 16:13:18 <jlibosva> oh did we? maybe I forgot, I just wanted to have the small minimum for my fullstack work 16:13:25 <ihrachys> #action jlibosva to post patch splitting OVN from OvsVenvFixture 16:13:26 <jlibosva> I'll push it once I polish it 16:13:51 <jlibosva> I'm also not sure e.g. if vtap belongs to ovn or ovs ... 16:14:01 <ihrachys> jlibosva, two weeks ago while drinking beer. I am not surprized some details could be forgotten :) 16:14:17 <jlibosva> damn you beer 16:14:35 <ihrachys> nah. yay beer. it spurred discussion in the first place. 16:15:00 <ihrachys> ok, we'll wait for your patch on gerrit and then whine about it there 16:15:25 <ihrachys> nice work otherwiseguy btw, it's a long standing issue for fullstack and you just solved it 16:15:40 * otherwiseguy crosses his fingers 16:15:40 <ihrachys> next AI was on me "ihrachys to update about functional/fullstack switch to devstack-gate and rootwrap" 16:16:08 <jlibosva> otherwiseguy++ 16:16:11 <ihrachys> so, to unblock the gate, we landed the patch switching fullstack and functional test runners to rootwrap, then landed the switch of those gates to devstack-gate 16:16:15 <otherwiseguy> jlibosva++ 16:16:36 <ihrachys> which resulted in breakage of fullstack job because some tests are still apparently not using rootwrap correctly 16:16:50 <ihrachys> which is the reason why fullstack is 100% failing in grafana ;) 16:17:13 <ihrachys> I didn't have time to look at it till now. I should have some till next meeting. 16:17:30 <ihrachys> #action ihrachys to look at why fullstack switch to rootwrap/d-g broke some test cases 16:17:41 <ihrachys> next was "haleyb to continue looking at prospects of dvr+ha job" 16:17:51 <haleyb> yes, i'm here 16:18:11 <ihrachys> the last time we talked about it you were going to watch the progress of the new job 16:18:37 * ihrachys looks at grafana 16:19:02 <haleyb> the job looks ok, it is still non-voting of course 16:19:56 <haleyb> it can still be higher than the dvr-multinode job 16:19:58 <ihrachys> I see it's 25%+ right now. is it ok? 16:20:28 <haleyb> when i looked yesterday it was lower, need to refresh 16:21:27 <haleyb> ihrachys: maybe it's time to split that check queue panel into two - grenade and tempest 16:21:49 <ihrachys> yeah I guess that could help. it's a mess right now. 16:21:57 <ihrachys> haleyb, will you post a patch? 16:22:44 <haleyb> sure i can do that. i don't know why it's failing more now, i'll have to look further 16:23:16 <ihrachys> #action haleyb to split grafana check dashboard into grenade and tempest charts. 16:23:37 <ihrachys> haleyb, re failure rate, even dvr one seem to be on too high level 16:23:43 <haleyb> as part of the job reduction was going to suggest making it voting to replace the dvr-multinode 16:24:36 <haleyb> yes, they usually track each other. last time i saw higher failures it was node setup issue, which is more likely to happen the more nodes we use 16:24:56 <ihrachys> it's 3 nodes for ha right? 16:25:05 <haleyb> yes, 3 nodes versus 2 16:25:46 <ihrachys> ack. yeah, replacing would be the end goal. if we know for sure it's just node setup thing, I think we can make the call to switch anyway. we should know though. 16:26:05 <ihrachys> #action haleyb to continue looking at dvr-ha job failure rate and reasons 16:26:13 <haleyb> i will have to go to logstash to see what's failing and if it's not just bad patches, since it is the check queue and not gate 16:26:22 <haleyb> :) 16:26:48 <ihrachys> + 16:26:52 <ihrachys> next was "ihrachys to talk to qa/keystone and maybe remove v3-only job" 16:27:13 <ihrachys> this is done as part of https://review.openstack.org/474733 16:27:30 <ihrachys> next is "haleyb to analyze all the l3 job flavours in gate/check queues and see where we could trim" 16:27:37 <ihrachys> we already touched on it somewhat 16:27:44 <haleyb> let me cut/paste a comment 16:27:58 <haleyb> regarding the grenade gate queue 16:28:01 <haleyb> grenade-dsvm-neutron and grenade-dsvm-neutron-multinode are both 16:28:01 <haleyb> voting. Propose we remove the single-node job, multinode will 16:28:01 <haleyb> just need a small Cells v2 tweak in its config. This is 16:28:01 <haleyb> actually two jobs less since there's a -trusty and -xenial. 16:28:10 <haleyb> doh, that pasted bad 16:28:17 <ihrachys> clarkb and other infra folks were eager to see progress on it because they had some issues with log storage disk space. 16:28:56 <ihrachys> haleyb, trusty is about to go if not already since it was newton only, and mitaka is EOL now 16:28:57 <haleyb> basically there are single-node and multinode jobs, i think we can just use the multinode ones 16:29:16 <clarkb> ihrachys: trusty should mostly be gone at this point 16:29:20 <haleyb> ihrachys: i was going to ask about -trusty, that would be a nice cleanup 16:29:28 <ihrachys> haleyb, since the single node job is part of integrated gate, do you imply that we do the replacement for all projectd? 16:29:36 <clarkb> if you notice any straggler trusty jobs let us know and we can help remove them 16:29:57 <ihrachys> clarkb, nice 16:30:03 <haleyb> neutron has a bunch 16:30:12 <clarkb> haleyb: still running as of today? 16:30:20 <clarkb> most of the cleanup happened late last week 16:30:46 <haleyb> ihrachys: i would think having multi-node is better than single-node, and more like a real setup 16:31:07 <haleyb> clarkb: i don't know, just see them on the grafana dashboard 16:31:17 <haleyb> http://grafana.openstack.org/dashboard/db/neutron-failure-rate 16:31:30 <clarkb> haleyb: ya I think we kept them in the dashboard so you don't lose the historical data? 16:31:32 <ihrachys> grafana may have obsolete entries, not everyone is even aware about its existence in project-config 16:31:37 <clarkb> but we may have missed things 16:31:55 <ihrachys> clarkb, we use those dashboards for current state so it's ok to clean them up 16:32:16 <clarkb> as for making multinode default in the integrated gate, I've been in favor of it because like you say it is more realistic, but resistance/concern has been that it will be even more difficult for developers to reproduce locally and debug 16:32:31 <clarkb> now you need 16GB of ram and ~160GB of disk just to run a base test 16:32:49 <ihrachys> haleyb, since you are going to do some cleanup there anyway, I will put that on you too ;) 16:32:54 <clarkb> but I think its worth revisiting that discussion because maybe that is necessary and we take that trade off (also how many people reproduce locally?) 16:33:00 <ihrachys> #action haleyb to clean up old trusty charts from grafana 16:33:45 <haleyb> ok, np 16:34:26 <haleyb> clarkb: yeah, i was looking more at the failure rates, which are about the same, and which is more important 16:34:59 <haleyb> the grenade jobs where the only ones with this overlap 16:35:48 <haleyb> the other thing is we would still have the multinode and dvr-multinode jobs, but i had a thought on that 16:35:49 <ihrachys> clarkb, I think the switch would fit nicely in your goal of reducing the number of jobs. where would we start from to drive it? ML? I guess folks would want to see stats on stability of the job before committing to anything? 16:36:11 <clarkb> yes I think the ML is good place to start. QA team in particular had concerns 16:36:21 <clarkb> including stats on stability would be good 16:36:41 <haleyb> since we don't want to reduce coverage on non-dvr code, i was wondering if it was possible to use the dvr setup, but also run tests with a "legacy" router 16:36:49 <clarkb> and maybe an argument for how it is more realistic, eg which code paths can we test on top of the base job (metadata proxy, live migration come to mind) 16:36:54 <ihrachys> ok. I guess it may make sense to focus on neutron state for a bit to prove to ourselves it's a good replacement, then go to broader audience. 16:38:05 <haleyb> so i can send something to the ML regarding the grenade jobs 16:38:37 <clarkb> and I can respond with info on log server retention and trying to get that udner control 16:38:47 <clarkb> and how reducing job counts will help 16:38:59 <ihrachys> haleyb, in theory, each test class could be transformed into a scenario class passing different args to create_router (scenarios could be generated from the list of api extensions except for dvr that may incorrectly indicate support at least before pike) 16:39:42 <ihrachys> ok, let's start a discussion on grenade reduction now, we can polish multinode job in parallel 16:39:44 <haleyb> ihrachys: right, the only problem could be that only the admin can create non-dvr routers in a dvr setup 16:40:02 <ihrachys> #action haleyb to spin up a ML discussion on replacing single node grenade job with multinode in integrated gate 16:40:28 <haleyb> unfortunately the tempest gate didn't have the overlap the grenade one did 16:42:16 * haleyb stops there since every time he talks he gets another job :) 16:42:28 <ihrachys> haleyb, I would imagine tempest core repo, being a certification tool, may not want to see dvr/ha specific scenarios. 16:43:01 <ihrachys> haleyb, haha. well we may find other candidates for some items that are on you. speak up. :) 16:43:51 <haleyb> ihrachys: nah, the grafana and jobs are easy, digging into my other dvr option would be harder 16:43:59 <ihrachys> ok ok 16:44:05 <ihrachys> and just to piss you off 16:44:05 <ihrachys> #action haleyb to continue looking at places to reduce the number of jobs 16:44:09 <ihrachys> :p 16:44:21 <ihrachys> ok those were all items we had 16:44:26 <ihrachys> #topic Grafana 16:44:35 <ihrachys> http://grafana.openstack.org/dashboard/db/neutron-failure-rate 16:44:41 <ihrachys> we somewhat discussed that before 16:44:58 <ihrachys> one thing to spot there though is that functional job is in a bad shape it seem 16:45:07 <ihrachys> it's currently ~25-30% in gate 16:45:44 <ihrachys> I checked both https://bugs.launchpad.net/neutron/+bugs?field.tag=functional-tests&orderby=-id&start=0 and https://bugs.launchpad.net/neutron/+bugs?field.tag=gate-failure&orderby=-id&start=0 for any new bug reports that could explain it, with no luck 16:45:49 <ihrachys> I also checked some recent failures 16:46:04 <ihrachys> of those I spotted some were for https://bugs.launchpad.net/neutron/+bug/1693931 16:46:05 <openstack> Launchpad bug 1693931 in neutron "functional test_next_port_closed test case failed with ProcessExecutionError when killing netcat" [High,Confirmed] 16:47:02 <ihrachys> but I haven't done complete triage 16:47:09 <ihrachys> I will take it on me to complete it asap 16:47:24 <ihrachys> #action ihrachys to complete triage of latest functional test failures that result in 30% failure rate 16:47:40 <ihrachys> anyone aware of late issues with the gate that could explain it? 16:48:48 <ihrachys> I guess not. ok I will look closer. 16:49:17 <ihrachys> one other tiny thing that bothers me every time I look at grafana is - why do we have postgres job in periodics? 16:49:55 <ihrachys> it doesn't seem like anyone really cares, and TC plans to express it explictly that psql is second class citizen in openstack 16:50:01 * ihrachys wonders if we need it there 16:50:18 <ihrachys> I would be fine to have it there if someone would work on the failures. 16:50:37 <ihrachys> btw I talk about http://grafana.openstack.org/dashboard/db/neutron-failure-rate?panelId=4&fullscreen 16:50:56 <ihrachys> it's always 100% and makes me click each other job name to see their results are 0% 16:51:23 <jlibosva> just send a patch to remove it and let's see who will complain :) 16:52:08 <ihrachys> ok ok 16:52:21 <ihrachys> #action ihrachys to remove pg job from periodics grafana board 16:52:46 <ihrachys> finally, fullstack is 100% but I said I will have a look so moving on 16:52:47 <ihrachys> #topic Gate bugs 16:52:52 <ihrachys> https://bugs.launchpad.net/neutron/+bugs?field.tag=gate-failure&orderby=-id&start=0 16:52:58 <ihrachys> https://bugs.launchpad.net/neutron/+bug/1696690 16:52:58 <openstack> Launchpad bug 1696690 in neutron "neutron fails to connect to q-agent-notifier-port-delete_fanout exchange" [Undecided,Confirmed] 16:53:03 <ihrachys> this was reported lately 16:53:23 <ihrachys> seems like affecting ironic 16:55:13 <ihrachys> it seems like a fanout queue was not created 16:55:23 <ihrachys> but shouldn't neutron-server itself initialize it on start? 16:58:15 <ihrachys> ok doesn't seem anyone has an idea :) 16:58:21 * jlibosva ¯\_(ツ)_/¯ 16:58:41 <ihrachys> seems something ironic/grenade specific, and they may in the end need to be more active poking us in our channel to get more traction. 16:58:56 <ihrachys> we have little time, so let's take those 2 mins we have back 16:59:00 <ihrachys> thanks everyone 16:59:03 <ihrachys> #endmeeting