16:02:25 <ihrachys> #startmeeting neutron_ci
16:02:26 <openstack> Meeting started Tue May 30 16:02:25 2017 UTC and is due to finish in 60 minutes.  The chair is ihrachys. Information about MeetBot at http://wiki.debian.org/MeetBot.
16:02:27 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
16:02:29 <openstack> The meeting name has been set to 'neutron_ci'
16:03:03 <ihrachys> #topic Action items from prev meeting
16:03:23 <ihrachys> the last time was 3 weeks ago, that's a long time
16:03:26 <ihrachys> first was: jlibosva to post a patch to disable port sec for trunk scenarios
16:03:46 <jlibosva> uh, I need to look into gerrit
16:03:59 <ihrachys> I see this https://review.openstack.org/#/c/462227/
16:04:04 <jlibosva> oh, I did :)
16:04:06 <jlibosva> yeah
16:04:33 <jlibosva> jenkins voted -1 but it seems it's not related to the change
16:04:35 <ihrachys> I see the test still fails
16:04:37 <ihrachys> in scenarip
16:04:42 <ihrachys> *scenario
16:04:49 <ihrachys> for dvr
16:04:57 <ihrachys> which suggests the patch hasn't helped?
16:05:07 <ihrachys> or wasn't it an expectation that it will heal the failure?
16:05:59 <jlibosva> yes, I hoped it will heal
16:06:54 <jlibosva> it seems that parent port didn't even get address from dhcp
16:07:44 <ihrachys> yeah
16:07:45 <ihrachys> Failed to start Raise network interfaces
16:07:50 <ihrachys> that's in console log for the instance
16:08:40 <jlibosva> I'll investigate that
16:08:45 <ihrachys> I guess we won't be able to get more data from inside the instance. the details are probably in journald.
16:08:53 <ihrachys> ok great, let's move on
16:09:20 <ihrachys> #action jlibosva to understand why instance failed to up networking in trunk conn test: https://review.openstack.org/#/c/462227/
16:09:39 <ihrachys> ok next is "jlibosva to talk to higher summit beings on python3 gate strategy for Pike"
16:10:24 <jlibosva> so there was a discussion regarding python3 goal
16:10:33 <jlibosva> as part of the forum
16:10:52 <jlibosva> the requirement is to have functional and integration test running python3
16:11:58 <jlibosva> as there is a lack of overview where other projects stand in the transition, the goal I think slips into Queens
16:12:38 <ihrachys> we talked with kevinbenton afterwards, and agreed that functional could be a good start because we control the pipeline there.
16:12:57 <ihrachys> as for integration, it's harder because we depend on broader strategy. we can't really move independently there.
16:12:59 <jlibosva> but I also briefly talked with kevinbenton about what tempest configuration makes most sense for us. to not consume gate resources, I think it makes sense to have at one tempest job
16:13:10 <jlibosva> yeah, functional first, that's for sure
16:14:00 <jlibosva> I was more wondering about the tempest configuratoin as neutron testing matrix is huge. So for tempest we'll go with multinode dvr - maybe +ha. The most complex scenario
16:14:22 <haleyb> +1 to that
16:14:51 <ihrachys> re ha: we don't have +ha anywhere. I would say that should be achieved first for py2 before we look to transit to py3.
16:15:22 <jlibosva> that's why the "maybe" word. :)
16:15:36 <ihrachys> ok. focus on functional for now, and we'll revisit tempest once closer to passing func.
16:15:42 <ihrachys> I started py3 work for func
16:15:44 <jlibosva> yep, I see you already started :)
16:15:52 <ihrachys> there is now a experimental job for that
16:15:59 <ihrachys> and we landed some fixes already
16:16:12 <ihrachys> I think it's still ~150 failures right now. but a lot of them are identical
16:16:24 <ihrachys> I would say, ~7 distinct failures probably
16:16:26 <jlibosva> maybe we could fetch a list of failures and put them in sort of groups with similar failures
16:16:34 <jlibosva> and split the failures among members
16:16:41 <jlibosva> so we don't work on the same failure
16:17:02 <ihrachys> this is the gerrit topic to use for py3 functional work: https://review.openstack.org/#/q/topic:func-py3
16:17:17 <ihrachys> jlibosva, that would be nice, yes. who's up for the job?
16:17:25 <jlibosva> I can take some failures
16:17:52 <jlibosva> or you mean to fetch the list?
16:17:56 <jlibosva> I can do that too
16:18:03 <ihrachys> yeah, fetch and categorize
16:18:08 <ihrachys> then we can spread the load
16:18:15 <ihrachys> and ask others to help
16:18:25 <jlibosva> ok, I'll make a list, on etherpad probably
16:18:32 <haleyb> i should be able to take some as well
16:18:35 <ihrachys> #action jlibosva to fetch and categorize functional py3 failures
16:19:08 <ihrachys> haleyb, nice, let's wait for the list and then see if we want to pull external help
16:19:14 <ihrachys> jlibosva, thanks for handling it
16:19:27 <ihrachys> jlibosva, please use the etherpad we had before
16:19:41 <jlibosva> yeah, makes sense
16:19:43 <ihrachys> I mean https://etherpad.openstack.org/p/py3-neutron-pike
16:20:16 <ihrachys> ok, next item was organizational and handled so I will skip it
16:20:19 <ihrachys> next is "jlibosva to report fullstack trunk failure bug once he has a reproducer"
16:20:28 <ihrachys> jlibosva, you have stuff on your plate don't you?:)
16:20:42 * ihrachys recovers context on that one, sec
16:20:49 <jlibosva> so if I remember correctly - I suspected ovs to delete bridge
16:21:16 <jlibosva> but then I realized fullstack runs in parallel and we don't have isolation of ovsdb data between fullstack ovs-agents
16:21:26 <ihrachys> this is the context: http://eavesdrop.openstack.org/meetings/neutron_ci/2017/neutron_ci.2017-05-02-16.01.log.html#l-122
16:22:03 <ihrachys> that's another trunk failure, now in fullstack
16:22:04 <jlibosva> which means - all ovs-agents running in fullstack will start processing all trunk ports
16:23:01 <ihrachys> hm. will they irrespective of l2 extension being on?
16:23:03 <jlibosva> and once trunk port is deleted, whoever gets the notification first from ip monitor will remove also the trunk bridge. As only one server is the source of truth, other servers will say they don't know this trunk
16:23:30 <jlibosva> as a consequence it leaves subport patch ports between br-int and trunk-bridge
16:23:41 <jlibosva> ihrachys: can you explain?
16:24:03 <ihrachys> oh sorry. you mean two ovs agents running in scope of the same test case?
16:24:09 <ihrachys> not cross test race?
16:25:33 <jlibosva> cross test race
16:26:20 <jlibosva> in case we have two agents in the same test, both agents will think they handle the trunk .. so they get correct information from server
16:26:21 <ihrachys> hm. how does the agent detect removal? independently of server?
16:26:41 <jlibosva> yep, it gets events from ip monitor that tap device was removed and will start processing trunk removal
16:26:55 <jlibosva> since all agents use the same ovsdb, all of them will get this event and start processing
16:27:10 <ihrachys> aha. and all of them monitor the same thing.
16:27:12 <ihrachys> makes sense
16:27:18 <jlibosva> there were attempts in the past to have multiple ovsdb running but it didn't go well
16:27:34 <jlibosva> I don't know any details though
16:28:07 <ihrachys> it may make sense to talk to ovn/ovs folks, they should know better.
16:28:14 <jlibosva> ovs devs said ovsdb has never been designed to run with other ovsdb process on the same node
16:29:02 <ihrachys> I would start with reporting a bug and collecting ideas from ovs folks.
16:29:26 <ihrachys> if nothing else, maybe we can somehow tell agents to handle their own devices only (like passing a prefix for devices)
16:30:41 <ihrachys> jlibosva, what's the next step?
16:31:59 <jlibosva> ihrachys: I was thinking about hacked l2 agent for fullstack that would react only on its resources. But I don't like the idea that we have hacked agents (we already have l3 and dhcp). In the end we end up testing hacked neutron but not neutron
16:32:37 <jlibosva> I also vaguely remember otherwiseguy saying something about having ovsdb events per agent. I'll need to talk to him
16:33:04 <ihrachys> yeah, but if we are limited by ovs, we can make a workaround. ofc first thing would be asking for alternatives/experimenting with isolated ovsdb.
16:33:51 <ihrachys> #action jlibosva to talk to otherwiseguy about isolating ovsdb/ovs agent per fullstack 'machine'
16:34:27 <ihrachys> I guess we covered all items from the prev meeting
16:34:30 <ihrachys> #topic Grafana
16:34:35 <ihrachys> http://grafana.openstack.org/dashboard/db/neutron-failure-rate
16:34:43 <ihrachys> I don't see anything horrific there
16:34:58 <ihrachys> just scenarios and fullstack failing for reasons already discussed (all trunk)
16:35:12 <ihrachys> functional job stays stable despite gating.
16:35:26 <jlibosva> there was a functional pike yesterday
16:35:28 <ihrachys> ofc sometimes it flickers but not a lot
16:35:38 <ihrachys> gate or check?
16:35:50 <jlibosva> check
16:35:54 <jlibosva> let me check the gate
16:36:17 <ihrachys> hm yeah I see it 30% yesterday. do we have an idea?
16:37:20 <jlibosva> I didn't look at failures. Also the gate has some kind of pike. makes sense as gate won't get triggered if check is busted
16:37:29 <ihrachys> I actually see one failure that doesn't seem neutron specific. like:
16:37:30 <ihrachys> http://logs.openstack.org/33/468833/2/check/gate-neutron-dsvm-functional-ubuntu-xenial/dccf45b/console.html#_2017-05-30_13_28_43_886898
16:37:36 <ihrachys> and I saw that in the past
16:37:43 <ihrachys> for some reason mostly in functional job.
16:38:09 <ihrachys> I believe there was some infra issue with shade that infra tackled the prev week. could be a fallout.
16:38:35 <ihrachys> the reason why it hits functional job can be because of how we (ab)use devstack to deploy rootwrap/deps in the job.
16:38:57 <ihrachys> still would make sense to have a look closer.
16:39:01 <ihrachys> I will take it
16:39:05 <clarkb> ihrachys: I don't think that it is related to shade. Since that is devstack checking the local machine to determine its IP
16:39:24 <ihrachys> #action ihrachys to understand why functional job spiked on weekend
16:39:39 <clarkb> ihrachys: that looks like a bug in devstack unable to handle the ip ranges in citycloud. I can poke a bit more
16:39:47 <ihrachys> clarkb, in regular devstack-gate runs, do we preconfigure the IP in localrc?
16:40:01 <clarkb> ihrachys: I think we may. Though someone wanted to stop doing it
16:40:54 <ihrachys> ok. maybe this logic is not triggered in regular jobs and hence doesn't bother other projects. I will poke clarkb once I have a grasp of impact and read d-g.
16:41:20 <ihrachys> clarkb, citycloud, is it something new?
16:42:06 <clarkb> yes it is a new cloud we are running jobs in
16:42:09 <clarkb> with 4 regions
16:44:53 <ihrachys> ok
16:45:22 <ihrachys> #topic Gate setup
16:45:38 <ihrachys> I was looking at the jobs that we have in check queue lately
16:45:42 <ihrachys> f.e. see in https://review.openstack.org/#/c/468056/
16:45:49 <ihrachys> and I see a lot of -nv jobs
16:45:58 <ihrachys> I wonder if we need them all.
16:46:12 <ihrachys> for one, what's gate-tempest-dsvm-neutron-identity-v3-only-full-ubuntu-xenial-nv ?
16:46:24 <ihrachys> hasn't we enabled v3 in gate for other jobs?
16:47:03 <clarkb> ihrachys: yes I think devstack is v3 by default now, so we can likely drop those identity-v3 jobs acorss the board. Double check with keystone and qa ?
16:48:03 <ihrachys> clarkb, thanks for the info. I will check.
16:48:32 <ihrachys> then there is gate-tempest-dsvm-neutron-dvr-ha-multinode-full-ubuntu-xenial-nv that seems passing (not sure how stable it is)
16:48:56 <haleyb> ihrachys: i think that's the one i just added
16:49:46 <ihrachys> hm nice. so you are monitoring it?
16:50:36 <haleyb> yes, i've been watching, it generally rises and falls with all the other jobs
16:50:40 <ihrachys> haleyb, what's your plan re this job? will it replace any existing ones?
16:51:35 <haleyb> ihrachys: it replaced another one, basically swapped it to the experimental queue
16:51:54 <ihrachys> yeah but I mean, will it be promoted to voting (replacing something?)
16:52:59 <haleyb> it should be able to replace the dvr-multinode tempest job if the failure rates are equal, i will look how it's behaved recently
16:53:57 <ihrachys> cool
16:54:05 <ihrachys> yeah I think dvr-multinode are good candidates.
16:54:22 <ihrachys> #action haleyb to monitor dvr+ha job and maybe replace existing dvr-multinode
16:54:33 <haleyb> i don't know what the multinode-full job difference is, so there might be multiple overlaps
16:54:37 <ihrachys> #action ihrachys to talk to qa/keystone and maybe remove v3-only job
16:55:49 <clarkb> haleyb: the regular multinode-full job is going to run the full tempest suite against multinode setup without dvr
16:56:05 <clarkb> haleyb: in that case controller node is also a network node and all things terminate there
16:56:40 <clarkb> assuming dvr works with live migration we can likely drop non dvr multinode everywhere as its definitely not as interesting as the dvr multinode case I think
16:57:14 <haleyb> so i guess we have to keep that non-dvr job then, but for the rest might as well have all the moving parts
16:57:39 <ihrachys> I feel there is some scoping/writing up to do. the set of l3 job flavours becomes hard to comprehend.
16:58:25 <ihrachys> once we would have the scope, we could bring it to wider team to see what we can afford to drop
16:58:55 <haleyb> yes, i will track down all the settings, perhaps something like the single-node one can go away in favor of the multinode (non-dvr), for example
16:58:56 <ihrachys> one can argue both ways - either that legacy mode should stay covered, or that dvr only should stay (in addition to the new dvr/ha)
16:59:38 <ihrachys> haleyb, afaik single node was a requirement for integrated gate in the past, but maybe now that we have multinode, it can go away.
16:59:48 <ihrachys> ok we are at the top of the hour
17:00:14 <ihrachys> #action haleyb to analyze all the l3 job flavours in gate/check queues and see where we could trim
17:00:21 <ihrachys> thanks folks
17:00:22 <ihrachys> #endmeeting