#openstack-meeting log

16:01:47 <ihrachys> #startmeeting neutron_ci
16:01:49 <openstack> Meeting started Tue Mar 13 16:01:47 2018 UTC and is due to finish in 60 minutes.  The chair is ihrachys. Information about MeetBot at http://wiki.debian.org/MeetBot.
16:01:50 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
16:01:52 <openstack> The meeting name has been set to 'neutron_ci'
16:01:56 <mlavalle> o/
16:02:13 <jlibosva> o/
16:02:16 <ihrachys> I will need to drop off today in the middle of the meeting so I would like someone to take over the chair from there
16:02:28 <ihrachys> volunteers welcome
16:02:37 <ihrachys> #topic Actions from prev meeting
16:02:45 <ihrachys> "slaweq to check why it takes too long to raise interfaces in linuxbirdge scenarios job, and to compare with dvr"
16:03:24 <ihrachys> slaweq, around?
16:03:57 <mlavalle> he was here 10 minutes ago
16:04:03 <mlavalle> we finished the QoS meeting
16:04:05 <jlibosva> I think we agreed last time it was because of multinode?
16:04:23 <jlibosva> that dvr is multinode while linuxbridge is allinone
16:05:08 <ihrachys> yeah though I am not sure if we made any progress in terms of patches to bump timeout
16:05:31 <slaweq> hi
16:05:33 <jlibosva> iirc it was merged
16:05:34 <slaweq> sorry for late
16:05:51 <ihrachys> https://review.openstack.org/550832 ?
16:05:59 <jlibosva> yep
16:06:08 <slaweq> ihrachys: only thing I could do was increase ssh timeout
16:06:21 <ihrachys> ok great. I also noticed on grafana that the linuxbridge job is very stable now
16:06:30 <ihrachys> so that probably actually helped
16:06:31 <slaweq> from logs which I checked it wasn't looking that there is any issue with neutron agent's or something like that
16:06:41 <ihrachys> great work slaweq
16:06:50 <slaweq> thx
16:06:59 <mlavalle> he got an ovation for this in the Neutron meeting
16:07:07 <slaweq> LOL
16:07:11 <mlavalle> standing ovation
16:07:19 <ihrachys> next was "ihrachys to check grafana stats several days later when dust settles"
16:07:21 <jlibosva> indeed :)
16:07:27 <ihrachys> that's about fullstack / dvr instability
16:07:47 <ihrachys> dvr scenarios are still problematic, but fullstack seems to be more stable than functional now for a week or so
16:07:57 <slaweq> \o/
16:08:03 <ihrachys> looks like if we keep functional voting we should definitely have fullstack too
16:08:10 <ihrachys> we can check charts later
16:08:20 <ihrachys> "jlibosva to look into agent startup failure and missing logs in: http://logs.openstack.org/83/549283/1/check/neutron-fullstack/cbad08a/logs/"
16:08:45 <jlibosva> I tried to reproduce the l3 agent failing to start locally but no success
16:08:59 <jlibosva> I posted a patch to log failed processes though, I think it merged
16:09:13 <jlibosva> https://review.openstack.org/#/c/550566/
16:09:29 <slaweq> maybe it was some issue in some other package and was fixed in meantime
16:10:04 <ihrachys> nice, we will revisit it the next time it hits
16:10:26 <ihrachys> these are all AIs we had
16:10:30 <ihrachys> #topic Grafana
16:10:37 <ihrachys> http://grafana.openstack.org/dashboard/db/neutron-failure-rate
16:11:15 <ihrachys> as I said before, fullstack looks very nice. there were some failure bumps during the week but they match with other bumps for other jobs
16:11:25 <ihrachys> now fullstack at 6% or smth like that
16:11:36 <ihrachys> in contrast functional is almost 20%
16:12:19 <ihrachys> and as for tempest, linuxbridge scenarios are looking very good
16:12:36 <ihrachys> like 2-3% now
16:12:52 <ihrachys> in contrast to dvr scenarios that are still 35%
16:13:04 <slaweq> I hope it will be like that now
16:13:20 <slaweq> as I was checking during weekend with this longer ssh timeout it was passing every time
16:13:22 <ihrachys> and dvr-ha job is also at the same high failure rate
16:13:50 <slaweq> I think that I saw some trunk related test in dvr failing often
16:14:18 <ihrachys> another candidate with reasonable behavior is -ovsfw- though it will need more monitoring I believe
16:14:29 <ihrachys> we can dive in each type
16:14:32 <ihrachys> #topic Fullstack
16:14:50 <ihrachys> considering that the job is quite stable for a while, should we make it vote now?
16:14:59 <ihrachys> we are at the start of cycle so that's good
16:15:28 <slaweq> sounds good for me
16:15:31 <slaweq> we can try
16:15:42 <jlibosva> would that require to also include fullstack in gate queue?
16:16:04 <ihrachys> well we can experiment with partial enablement like we did with functional of course
16:16:27 <ihrachys> not that I am saying I would suggest doing it
16:17:10 <ihrachys> mlavalle, thoughts
16:17:43 <mlavalle> I'd say go for it
16:17:59 <ihrachys> slaweq, do you want the honor of posting the patch?
16:18:26 <slaweq> ihrachys: I would love to :)
16:18:28 <slaweq> thx
16:18:40 <ihrachys> #action slaweq to enable voting for fullstack
16:18:52 <ihrachys> mlavalle, jlibosva so do we go with both queues or check only?
16:19:05 <mlavalle> let's start with check
16:19:10 <slaweq> ++
16:19:16 <ihrachys> ok
16:19:17 <slaweq> let's do it with small steps
16:19:31 <ihrachys> we will revisit gate in like 2 weeks then
16:19:37 <jlibosva> that's what my question was above, I think there is a "rule" that voting jobs should be in gate queue too, isn't there?
16:19:48 <ihrachys> jlibosva, there is tradition for sure
16:19:58 <ihrachys> but we had functional not in gate but in check for some time
16:20:17 <ihrachys> we can post and see if infra objects
16:20:22 <ihrachys> I don't mind enabling in both queues
16:20:56 <slaweq> there is also neutron-rally-neutron which is not in gate currently
16:21:01 <slaweq> and is voting in check queue
16:21:42 <ihrachys> I will skip discussion of fullstack failures for today
16:21:52 <ihrachys> #topic Rally
16:22:07 <ihrachys> do we know what's the background behind rally not gating?
16:22:20 <ihrachys> is it because it breaks from time to time due to instabilities in rally itself?
16:22:53 <ihrachys> I remember there were multiple cases when a patch in rally broke our gates
16:22:58 <ihrachys> so maybe that
16:23:21 <ihrachys> I am not too excited to enable gating for it
16:23:27 <ihrachys> I mean, gate queue
16:23:32 <mlavalle> I don't know the details, but it must be that
16:23:57 <ihrachys> ok. I would focus on other jobs for now where we have better understanding and control.
16:24:03 <ihrachys> #topic Scenarios
16:24:08 <clarkb> glance recently removed rally beacuse it wasn't working at all for them
16:24:12 <clarkb> (just a datapoint)
16:24:26 <mlavalle> good to know, clarkb
16:24:57 <ihrachys> http://grafana.openstack.org/dashboard/db/neutron-failure-rate?panelId=8&fullscreen
16:25:32 <ihrachys> considering linuxbridge scenarios job is in good shape (failure spikes are reflected in other jobs, it's at 5% right now), do we want to consider it for voting too?
16:25:55 <mlavalle> yes
16:26:02 <mlavalle> let's go for it
16:26:07 <mlavalle> early in the cycle
16:26:18 <ihrachys> objections?
16:26:31 <slaweq> also only for check queue for now?
16:26:47 * ihrachys waves at rossella_s
16:27:14 <ihrachys> I would imagine yes, check queue for now
16:27:16 <rossella_s> ihrachys, hi!
16:27:29 <ihrachys> ok seems like no objections
16:27:34 <mlavalle> hola rossella_s
16:27:48 <ihrachys> #action slaweq to enable voting for linuxbridge scenarios
16:27:59 <slaweq> ok, sure
16:28:11 <rossella_s> mlavalle, hola :)
16:28:18 <slaweq> hi rossella_s
16:28:27 <ihrachys> slaweq, can you take over the meeting from here? I gotta bail out.
16:28:45 <slaweq> sure
16:28:47 <ihrachys> #chair slaweq
16:28:48 <openstack> Current chairs: ihrachys slaweq
16:28:55 <slaweq> do You have link to agenda somewhere?
16:29:02 <ihrachys> there is no agenda :)
16:29:19 <mlavalle> we can follow previous meetings
16:29:24 <ihrachys> but basically look at dvr scenarios that would be the main thing :)
16:29:27 <slaweq> ok, sure
16:29:36 <ihrachys> ok bye folks!
16:29:36 <slaweq> I will take care
16:29:41 <slaweq> bye ihrachys
16:29:54 <jlibosva> o/
16:30:14 <slaweq> so moving on with scenario jobs
16:30:29 <slaweq> dvr and dvr-ha are still on high failure ratio
16:31:10 <slaweq> about dvr when I was testing ssh timeout patch I had few times same errors like on http://logs.openstack.org/32/550832/2/check/neutron-tempest-plugin-dvr-multinode-scenario/6186573/logs/testr_results.html.gz
16:32:00 <slaweq> or http://logs.openstack.org/32/550832/2/check/neutron-tempest-plugin-dvr-multinode-scenario/136c7ef/logs/testr_results.html.gz
16:32:10 <slaweq> but every time it was failure with trunk port
16:32:16 <jlibosva> is it just test_subport_connectivity or also lifecycle?
16:32:30 <jlibosva> ah, second link answers my question :) sorry
16:32:33 <mlavalle> yeah, it's trunk ports
16:32:34 <slaweq> jlibosva: in second link it was lifecycle
16:32:47 <slaweq> any ideas about that?
16:33:07 <slaweq> on third one there was both of them even: http://logs.openstack.org/32/550832/2/check/neutron-tempest-plugin-dvr-multinode-scenario/6d09fd8/logs/testr_results.html.gz
16:33:52 <slaweq> I just found another example of same tests failed
16:34:06 <slaweq> do You have time/resources to check that this week?
16:34:49 <slaweq> I don't have any dvr environment and don't have experience with both dvr and trunk so it might be hard for me
16:36:43 <haleyb> I know dvr but not trunk, do you just need a multi-node environment?
16:36:55 <jlibosva> I know trunk and a bit of dvr :)
16:37:03 <slaweq> so we have a winner :)
16:37:30 <jlibosva> I can have a look and if I have issues with dvr, I'll poke haleyb
16:37:45 <slaweq> thx jlibosva
16:38:14 <slaweq> #action jlibosva to take a look on dvr trunk tests issue
16:39:04 <slaweq> I'm trying to find some failed neutron-tempest-dvr-ha-multinode-full test now as this one is also around 50% fails
16:40:17 <slaweq> I found one http://logs.openstack.org/14/529814/5/check/neutron-tempest-dvr-ha-multinode-full/d8cfbdf/logs/testr_results.html.gz
16:40:30 <slaweq> but this doesn't look like related to neutron
16:40:52 <slaweq> do You have maybe any other example?
16:41:24 <haleyb> that might be an old bug?  https://bugs.launchpad.net/neutron/+bug/1717302
16:41:25 <openstack> Launchpad bug 1717302 in neutron "Tempest floatingip scenario tests failing on DVR Multinode setup with HA" [High,Confirmed] - Assigned to Brian Haley (brian-haley)
16:44:06 <slaweq> haleyb: will You continue work on it?
16:44:19 <jlibosva> do we have a logstash query for the above?
16:44:38 <slaweq> I don't know
16:45:20 <mlavalle> No we don't have one
16:45:56 <haleyb> slaweq: i have not reproduced it locally yet
16:46:50 <jlibosva> maybe this could be used? http://bit.ly/2Fz6FPs
16:47:32 <jlibosva> it looks to me that the issue is gone
16:48:01 <slaweq> jlibosva: are You sure that logs from this job are properly indexed in logstash?
16:48:08 <mlavalle> no hist over the past 7 days
16:48:12 <slaweq> AFAIR there was some issue with fullstack for example
16:48:36 <jlibosva> it's a normal tempest job and afaik all service logs are indexed
16:48:44 <slaweq> ok, it should be
16:49:01 <jlibosva> or is the error coming from neutron-keepalived-state-change ?
16:49:06 <slaweq> ok, I will try to find more example of failures for this job
16:49:20 <jlibosva> maybe that one is not indexed, I don't know :)
16:49:37 <slaweq> and if I will find something what is happen often I will just report as bugs
16:49:41 <slaweq> what You think?
16:49:53 <mlavalle> sounds good
16:50:21 <slaweq> #action slaweq check reasons of failures of neutron-tempest-dvr-ha-multinode-full
16:51:09 <slaweq> so, other tempest jobs are below 10% of failure so it's fine IMO
16:52:02 <slaweq> periodic jobs looks that are fine also
16:52:31 <slaweq> do You have anything to add?
16:53:00 <mlavalle> I don't
16:53:22 <slaweq> #topic Open discussion
16:53:50 <slaweq> I just want to say that I found today that our scenario jobs are using neutron-legacy lib still
16:54:00 <slaweq> so I want to switch it to lib/neutron
16:54:25 <slaweq> are You fine with that? or there is any reason why it's like that and I shouldn't change it
16:55:32 <mlavalle> not that I'm aware of
16:56:08 <slaweq> ok, so I will send a patch and check how it works then
16:56:42 <slaweq> #action slaweq switch scenario jobs to lib/neutron
16:57:25 <slaweq> ok, so if You don't have anything I think we are done for today
16:58:14 * jlibosva nods
16:58:27 <slaweq> I wasn't for sure as good as ihrachys as a chair but I hope it was somehow :)
16:58:36 <jlibosva> you did really great :)
16:58:46 <slaweq> thank You
16:58:54 <slaweq> bye
16:58:57 <slaweq> #endmeeting