16:02:25 <ihrachys> #startmeeting neutron_ci 16:02:26 <openstack> Meeting started Tue May 30 16:02:25 2017 UTC and is due to finish in 60 minutes. The chair is ihrachys. Information about MeetBot at http://wiki.debian.org/MeetBot. 16:02:27 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 16:02:29 <openstack> The meeting name has been set to 'neutron_ci' 16:03:03 <ihrachys> #topic Action items from prev meeting 16:03:23 <ihrachys> the last time was 3 weeks ago, that's a long time 16:03:26 <ihrachys> first was: jlibosva to post a patch to disable port sec for trunk scenarios 16:03:46 <jlibosva> uh, I need to look into gerrit 16:03:59 <ihrachys> I see this https://review.openstack.org/#/c/462227/ 16:04:04 <jlibosva> oh, I did :) 16:04:06 <jlibosva> yeah 16:04:33 <jlibosva> jenkins voted -1 but it seems it's not related to the change 16:04:35 <ihrachys> I see the test still fails 16:04:37 <ihrachys> in scenarip 16:04:42 <ihrachys> *scenario 16:04:49 <ihrachys> for dvr 16:04:57 <ihrachys> which suggests the patch hasn't helped? 16:05:07 <ihrachys> or wasn't it an expectation that it will heal the failure? 16:05:59 <jlibosva> yes, I hoped it will heal 16:06:54 <jlibosva> it seems that parent port didn't even get address from dhcp 16:07:44 <ihrachys> yeah 16:07:45 <ihrachys> Failed to start Raise network interfaces 16:07:50 <ihrachys> that's in console log for the instance 16:08:40 <jlibosva> I'll investigate that 16:08:45 <ihrachys> I guess we won't be able to get more data from inside the instance. the details are probably in journald. 16:08:53 <ihrachys> ok great, let's move on 16:09:20 <ihrachys> #action jlibosva to understand why instance failed to up networking in trunk conn test: https://review.openstack.org/#/c/462227/ 16:09:39 <ihrachys> ok next is "jlibosva to talk to higher summit beings on python3 gate strategy for Pike" 16:10:24 <jlibosva> so there was a discussion regarding python3 goal 16:10:33 <jlibosva> as part of the forum 16:10:52 <jlibosva> the requirement is to have functional and integration test running python3 16:11:58 <jlibosva> as there is a lack of overview where other projects stand in the transition, the goal I think slips into Queens 16:12:38 <ihrachys> we talked with kevinbenton afterwards, and agreed that functional could be a good start because we control the pipeline there. 16:12:57 <ihrachys> as for integration, it's harder because we depend on broader strategy. we can't really move independently there. 16:12:59 <jlibosva> but I also briefly talked with kevinbenton about what tempest configuration makes most sense for us. to not consume gate resources, I think it makes sense to have at one tempest job 16:13:10 <jlibosva> yeah, functional first, that's for sure 16:14:00 <jlibosva> I was more wondering about the tempest configuratoin as neutron testing matrix is huge. So for tempest we'll go with multinode dvr - maybe +ha. The most complex scenario 16:14:22 <haleyb> +1 to that 16:14:51 <ihrachys> re ha: we don't have +ha anywhere. I would say that should be achieved first for py2 before we look to transit to py3. 16:15:22 <jlibosva> that's why the "maybe" word. :) 16:15:36 <ihrachys> ok. focus on functional for now, and we'll revisit tempest once closer to passing func. 16:15:42 <ihrachys> I started py3 work for func 16:15:44 <jlibosva> yep, I see you already started :) 16:15:52 <ihrachys> there is now a experimental job for that 16:15:59 <ihrachys> and we landed some fixes already 16:16:12 <ihrachys> I think it's still ~150 failures right now. but a lot of them are identical 16:16:24 <ihrachys> I would say, ~7 distinct failures probably 16:16:26 <jlibosva> maybe we could fetch a list of failures and put them in sort of groups with similar failures 16:16:34 <jlibosva> and split the failures among members 16:16:41 <jlibosva> so we don't work on the same failure 16:17:02 <ihrachys> this is the gerrit topic to use for py3 functional work: https://review.openstack.org/#/q/topic:func-py3 16:17:17 <ihrachys> jlibosva, that would be nice, yes. who's up for the job? 16:17:25 <jlibosva> I can take some failures 16:17:52 <jlibosva> or you mean to fetch the list? 16:17:56 <jlibosva> I can do that too 16:18:03 <ihrachys> yeah, fetch and categorize 16:18:08 <ihrachys> then we can spread the load 16:18:15 <ihrachys> and ask others to help 16:18:25 <jlibosva> ok, I'll make a list, on etherpad probably 16:18:32 <haleyb> i should be able to take some as well 16:18:35 <ihrachys> #action jlibosva to fetch and categorize functional py3 failures 16:19:08 <ihrachys> haleyb, nice, let's wait for the list and then see if we want to pull external help 16:19:14 <ihrachys> jlibosva, thanks for handling it 16:19:27 <ihrachys> jlibosva, please use the etherpad we had before 16:19:41 <jlibosva> yeah, makes sense 16:19:43 <ihrachys> I mean https://etherpad.openstack.org/p/py3-neutron-pike 16:20:16 <ihrachys> ok, next item was organizational and handled so I will skip it 16:20:19 <ihrachys> next is "jlibosva to report fullstack trunk failure bug once he has a reproducer" 16:20:28 <ihrachys> jlibosva, you have stuff on your plate don't you?:) 16:20:42 * ihrachys recovers context on that one, sec 16:20:49 <jlibosva> so if I remember correctly - I suspected ovs to delete bridge 16:21:16 <jlibosva> but then I realized fullstack runs in parallel and we don't have isolation of ovsdb data between fullstack ovs-agents 16:21:26 <ihrachys> this is the context: http://eavesdrop.openstack.org/meetings/neutron_ci/2017/neutron_ci.2017-05-02-16.01.log.html#l-122 16:22:03 <ihrachys> that's another trunk failure, now in fullstack 16:22:04 <jlibosva> which means - all ovs-agents running in fullstack will start processing all trunk ports 16:23:01 <ihrachys> hm. will they irrespective of l2 extension being on? 16:23:03 <jlibosva> and once trunk port is deleted, whoever gets the notification first from ip monitor will remove also the trunk bridge. As only one server is the source of truth, other servers will say they don't know this trunk 16:23:30 <jlibosva> as a consequence it leaves subport patch ports between br-int and trunk-bridge 16:23:41 <jlibosva> ihrachys: can you explain? 16:24:03 <ihrachys> oh sorry. you mean two ovs agents running in scope of the same test case? 16:24:09 <ihrachys> not cross test race? 16:25:33 <jlibosva> cross test race 16:26:20 <jlibosva> in case we have two agents in the same test, both agents will think they handle the trunk .. so they get correct information from server 16:26:21 <ihrachys> hm. how does the agent detect removal? independently of server? 16:26:41 <jlibosva> yep, it gets events from ip monitor that tap device was removed and will start processing trunk removal 16:26:55 <jlibosva> since all agents use the same ovsdb, all of them will get this event and start processing 16:27:10 <ihrachys> aha. and all of them monitor the same thing. 16:27:12 <ihrachys> makes sense 16:27:18 <jlibosva> there were attempts in the past to have multiple ovsdb running but it didn't go well 16:27:34 <jlibosva> I don't know any details though 16:28:07 <ihrachys> it may make sense to talk to ovn/ovs folks, they should know better. 16:28:14 <jlibosva> ovs devs said ovsdb has never been designed to run with other ovsdb process on the same node 16:29:02 <ihrachys> I would start with reporting a bug and collecting ideas from ovs folks. 16:29:26 <ihrachys> if nothing else, maybe we can somehow tell agents to handle their own devices only (like passing a prefix for devices) 16:30:41 <ihrachys> jlibosva, what's the next step? 16:31:59 <jlibosva> ihrachys: I was thinking about hacked l2 agent for fullstack that would react only on its resources. But I don't like the idea that we have hacked agents (we already have l3 and dhcp). In the end we end up testing hacked neutron but not neutron 16:32:37 <jlibosva> I also vaguely remember otherwiseguy saying something about having ovsdb events per agent. I'll need to talk to him 16:33:04 <ihrachys> yeah, but if we are limited by ovs, we can make a workaround. ofc first thing would be asking for alternatives/experimenting with isolated ovsdb. 16:33:51 <ihrachys> #action jlibosva to talk to otherwiseguy about isolating ovsdb/ovs agent per fullstack 'machine' 16:34:27 <ihrachys> I guess we covered all items from the prev meeting 16:34:30 <ihrachys> #topic Grafana 16:34:35 <ihrachys> http://grafana.openstack.org/dashboard/db/neutron-failure-rate 16:34:43 <ihrachys> I don't see anything horrific there 16:34:58 <ihrachys> just scenarios and fullstack failing for reasons already discussed (all trunk) 16:35:12 <ihrachys> functional job stays stable despite gating. 16:35:26 <jlibosva> there was a functional pike yesterday 16:35:28 <ihrachys> ofc sometimes it flickers but not a lot 16:35:38 <ihrachys> gate or check? 16:35:50 <jlibosva> check 16:35:54 <jlibosva> let me check the gate 16:36:17 <ihrachys> hm yeah I see it 30% yesterday. do we have an idea? 16:37:20 <jlibosva> I didn't look at failures. Also the gate has some kind of pike. makes sense as gate won't get triggered if check is busted 16:37:29 <ihrachys> I actually see one failure that doesn't seem neutron specific. like: 16:37:30 <ihrachys> http://logs.openstack.org/33/468833/2/check/gate-neutron-dsvm-functional-ubuntu-xenial/dccf45b/console.html#_2017-05-30_13_28_43_886898 16:37:36 <ihrachys> and I saw that in the past 16:37:43 <ihrachys> for some reason mostly in functional job. 16:38:09 <ihrachys> I believe there was some infra issue with shade that infra tackled the prev week. could be a fallout. 16:38:35 <ihrachys> the reason why it hits functional job can be because of how we (ab)use devstack to deploy rootwrap/deps in the job. 16:38:57 <ihrachys> still would make sense to have a look closer. 16:39:01 <ihrachys> I will take it 16:39:05 <clarkb> ihrachys: I don't think that it is related to shade. Since that is devstack checking the local machine to determine its IP 16:39:24 <ihrachys> #action ihrachys to understand why functional job spiked on weekend 16:39:39 <clarkb> ihrachys: that looks like a bug in devstack unable to handle the ip ranges in citycloud. I can poke a bit more 16:39:47 <ihrachys> clarkb, in regular devstack-gate runs, do we preconfigure the IP in localrc? 16:40:01 <clarkb> ihrachys: I think we may. Though someone wanted to stop doing it 16:40:54 <ihrachys> ok. maybe this logic is not triggered in regular jobs and hence doesn't bother other projects. I will poke clarkb once I have a grasp of impact and read d-g. 16:41:20 <ihrachys> clarkb, citycloud, is it something new? 16:42:06 <clarkb> yes it is a new cloud we are running jobs in 16:42:09 <clarkb> with 4 regions 16:44:53 <ihrachys> ok 16:45:22 <ihrachys> #topic Gate setup 16:45:38 <ihrachys> I was looking at the jobs that we have in check queue lately 16:45:42 <ihrachys> f.e. see in https://review.openstack.org/#/c/468056/ 16:45:49 <ihrachys> and I see a lot of -nv jobs 16:45:58 <ihrachys> I wonder if we need them all. 16:46:12 <ihrachys> for one, what's gate-tempest-dsvm-neutron-identity-v3-only-full-ubuntu-xenial-nv ? 16:46:24 <ihrachys> hasn't we enabled v3 in gate for other jobs? 16:47:03 <clarkb> ihrachys: yes I think devstack is v3 by default now, so we can likely drop those identity-v3 jobs acorss the board. Double check with keystone and qa ? 16:48:03 <ihrachys> clarkb, thanks for the info. I will check. 16:48:32 <ihrachys> then there is gate-tempest-dsvm-neutron-dvr-ha-multinode-full-ubuntu-xenial-nv that seems passing (not sure how stable it is) 16:48:56 <haleyb> ihrachys: i think that's the one i just added 16:49:46 <ihrachys> hm nice. so you are monitoring it? 16:50:36 <haleyb> yes, i've been watching, it generally rises and falls with all the other jobs 16:50:40 <ihrachys> haleyb, what's your plan re this job? will it replace any existing ones? 16:51:35 <haleyb> ihrachys: it replaced another one, basically swapped it to the experimental queue 16:51:54 <ihrachys> yeah but I mean, will it be promoted to voting (replacing something?) 16:52:59 <haleyb> it should be able to replace the dvr-multinode tempest job if the failure rates are equal, i will look how it's behaved recently 16:53:57 <ihrachys> cool 16:54:05 <ihrachys> yeah I think dvr-multinode are good candidates. 16:54:22 <ihrachys> #action haleyb to monitor dvr+ha job and maybe replace existing dvr-multinode 16:54:33 <haleyb> i don't know what the multinode-full job difference is, so there might be multiple overlaps 16:54:37 <ihrachys> #action ihrachys to talk to qa/keystone and maybe remove v3-only job 16:55:49 <clarkb> haleyb: the regular multinode-full job is going to run the full tempest suite against multinode setup without dvr 16:56:05 <clarkb> haleyb: in that case controller node is also a network node and all things terminate there 16:56:40 <clarkb> assuming dvr works with live migration we can likely drop non dvr multinode everywhere as its definitely not as interesting as the dvr multinode case I think 16:57:14 <haleyb> so i guess we have to keep that non-dvr job then, but for the rest might as well have all the moving parts 16:57:39 <ihrachys> I feel there is some scoping/writing up to do. the set of l3 job flavours becomes hard to comprehend. 16:58:25 <ihrachys> once we would have the scope, we could bring it to wider team to see what we can afford to drop 16:58:55 <haleyb> yes, i will track down all the settings, perhaps something like the single-node one can go away in favor of the multinode (non-dvr), for example 16:58:56 <ihrachys> one can argue both ways - either that legacy mode should stay covered, or that dvr only should stay (in addition to the new dvr/ha) 16:59:38 <ihrachys> haleyb, afaik single node was a requirement for integrated gate in the past, but maybe now that we have multinode, it can go away. 16:59:48 <ihrachys> ok we are at the top of the hour 17:00:14 <ihrachys> #action haleyb to analyze all the l3 job flavours in gate/check queues and see where we could trim 17:00:21 <ihrachys> thanks folks 17:00:22 <ihrachys> #endmeeting