16:00:22 <slaweq> #startmeeting neutron_ci
16:00:23 <openstack> Meeting started Tue Jun 26 16:00:22 2018 UTC and is due to finish in 60 minutes.  The chair is slaweq. Information about MeetBot at http://wiki.debian.org/MeetBot.
16:00:25 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
16:00:26 <slaweq> hi
16:00:27 <mlavalle> o/
16:00:28 <openstack> The meeting name has been set to 'neutron_ci'
16:00:35 <njohnston> o/
16:01:21 <slaweq> haleyb: are You around?
16:02:09 * slaweq is on airport so sorry if there will be some issues with connection
16:02:38 <slaweq> ok, I maybe haleyb will join later
16:02:41 <slaweq> let's start
16:02:41 <mlavalle> going on vaction?
16:02:56 <slaweq> no, I'm going back home from internal training
16:03:02 <mlavalle> ahhh
16:03:07 <mlavalle> have a safe flight
16:03:11 <slaweq> thx :)
16:03:13 <slaweq> #topic Actions from previous meetings
16:03:27 <slaweq> we have only one action from last week
16:03:29 <slaweq> njohnston to look into adding Grafana dasboard for stable branches
16:04:19 <njohnston> I have started working on that but got pulled into some other things; I plan to push something for people to review later this week
16:04:30 <njohnston> Apologies for the delay
16:04:33 <slaweq> ok, thx for update
16:04:40 <slaweq> no problem, it's not very urgent :)
16:04:46 <slaweq> #action njohnston to look into adding Grafana dasboard for stable branches
16:04:55 <slaweq> You have it for next week :)
16:05:00 <mlavalle> no apologies needed. we all have sponsors who help us to pay the bills
16:05:01 <njohnston> thanks
16:05:13 <mlavalle> and they have their priorities
16:05:22 <slaweq> mlavalle: well said :)
16:05:43 <slaweq> #topic Grafana
16:05:52 <slaweq> http://grafana.openstack.org/dashboard/db/neutron-failure-rate
16:07:11 <slaweq> generally it doesn't looks very bad this week IMO
16:07:39 <slaweq> but I don't know why there is some "gap" with graphs yesterday and on Sunday
16:07:40 <mlavalle> The tempest linux bridge job has had some failures
16:08:14 <slaweq> mlavalle: are You talking about this "spike" in gate queue?
16:08:24 <mlavalle> yeah
16:08:57 <slaweq> I was checking that and I found few issues not related to neutron, like e.g. 503 errors from cinder: http://logs.openstack.org/63/323963/61/check/neutron-tempest-linuxbridge/1d56164/logs/testr_results.html.gz
16:09:25 <mlavalle> yes, I agree, What I've seens is someting about Nova tagging testing
16:09:32 <slaweq> and I think that this high failure rate is also because there was no many runs counted this time
16:10:02 <slaweq> as I went yesterday through some of failed jobs and there wasn't many failed examples of this job
16:10:17 <mlavalle> ok, let's keep an eye on it and see what happens over the next few days
16:10:25 <slaweq> about tagging test failure I have it for one of next topics :)
16:10:32 <slaweq> mlavalle: I agree
16:11:00 <mlavalle> maybe I am just pissed off with this job because it has failed in some of my patches ;-)
16:11:08 <slaweq> so let's now talk about some specific jobs
16:11:11 <slaweq> mlavalle: LOL
16:11:13 <slaweq> maybe
16:11:16 <slaweq> #topic Unit tests
16:11:38 <slaweq> I added UT as a topic today because I found few failures with some timeout
16:11:54 <slaweq> like e.g.: http://logs.openstack.org/58/565358/14/check/openstack-tox-py35/aa30b12/job-output.txt.gz or http://logs.openstack.org/03/563803/9/check/openstack-tox-py35/a50de4a/job-output.txt.gz
16:12:22 <slaweq> and it was always py35
16:12:42 <slaweq> or maybe You saw something similar with py27 but I didn't spotted it yet
16:12:53 <mlavalle> mhhh, let me see
16:12:56 <slaweq> I think that this should be investigated
16:13:42 <mlavalle> no, I saw I timeout a few minutes ago, but it wasn't unit tests
16:14:32 <njohnston> I have seen tempest get killed due to timeouts as well, for example: http://logs.openstack.org/61/566961/4/check/neutron-tempest-iptables_hybrid/c70896b/job-output.txt.gz#_2018-06-26_08_55_07_638631
16:15:52 <slaweq> I think that we should report this one as a bug, check how long "good" run takes and then check what issue it is maybe
16:16:14 <mlavalle> sounds like a plan
16:16:20 <slaweq> and I will try to investigate this UT tests
16:16:37 <slaweq> #actions slaweq will report and investigate py35 timeouts
16:16:55 <slaweq> njohnston: tempest is probably different issue
16:17:13 <slaweq> I know that it happens sometimes and we probably also should check
16:18:33 <slaweq> njohnston: would You like to investigate it?
16:19:47 <slaweq> ok, I will report it as a bug also and I will see if I will be able to check it this week
16:20:12 <slaweq> #action slaweq reports a bug with tempest timeouts, like on http://logs.openstack.org/61/566961/4/check/neutron-tempest-iptables_hybrid/c70896b/job-output.txt.gz
16:20:27 <slaweq> ok, let's go to next one
16:20:29 <slaweq> #topic Functional
16:20:50 <slaweq> I saw at least few times something like:     * http://logs.openstack.org/58/574058/10/check/neutron-functional/f38d685/logs/testr_results.html.gz
16:21:04 <slaweq> it might be related to  https://bugs.launchpad.net/neutron/+bug/1687027
16:21:04 <openstack> Launchpad bug 1687027 in neutron "test_walk_versions tests fail with "IndexError: tuple index out of range" after timeout" [High,Confirmed]
16:21:31 <slaweq> it happens at least few times this week so I think we should also dig into it and maybe find a way how to fix it :)
16:24:00 <slaweq> and those are different tests failing but with similar error, second example from last few days:     * http://logs.openstack.org/61/567461/4/check/neutron-functional/81d69a4/logs/testr_results.html.gz
16:24:35 <slaweq> any volunteer to debug this issue?
16:24:40 <mlavalle> o/
16:24:45 <slaweq> thx mlavalle :)
16:25:02 <mlavalle> I assigned it to me
16:25:09 <slaweq> #action mlavalle will check functional tests issue, https://bugs.launchpad.net/neutron/+bug/1687027
16:25:09 <openstack> Launchpad bug 1687027 in neutron "test_walk_versions tests fail with "IndexError: tuple index out of range" after timeout" [High,Confirmed] - Assigned to Miguel Lavalle (minsel)
16:25:13 <slaweq> ok, thx a lot
16:25:51 <slaweq> I will not talk about fullstack tests today as it is on quite good failure rate recently and I didn't have time to dig into this one often failing test yet
16:25:59 <slaweq> so next topic will be now
16:26:04 <slaweq> #topic Scenarios
16:26:30 <slaweq> As mlavalle mentioned before, there is some issue with tagging test from tempest
16:26:42 <slaweq> bug report: https://bugs.launchpad.net/tempest/+bug/1775947
16:26:42 <openstack> Launchpad bug 1775947 in tempest "tempest.api.compute.servers.test_device_tagging.TaggedAttachmentsTest failing" [Medium,Confirmed] - Assigned to Deepak Mourya (mourya007)
16:26:59 <mlavalle> yeah I've seen it more in api tests
16:27:00 <slaweq> example of failure is e.g. on http://logs.openstack.org/44/575444/4/gate/neutron-tempest-linuxbridge/118cc97/logs/testr_results.html.gz
16:27:05 <mlavalle> not as much in scenrio tests
16:27:10 <slaweq> ahh, yes
16:27:14 <slaweq> sorry, my bad
16:27:25 <slaweq> I mixed them as it's "tempest" in both cases :)
16:27:32 <slaweq> it's in tempest api tests
16:27:43 <slaweq> so let's change topic to
16:27:43 <mlavalle> and ye, I see it most;y in the linuxbridge job
16:27:52 <slaweq> #topic tempest tests
16:27:54 <slaweq> :)
16:28:22 <slaweq> I didn't check if it's only related to linuxbridge job
16:28:47 <slaweq> what I saw often was something like "process exited with errcode 137" on VM
16:28:59 <slaweq> during doing "curl" to metadata service
16:29:24 <mlavalle> right, I've also seen the message about the metadata service
16:31:20 <slaweq> for now it is reported against tempest as I suspected at first glance that it might be issue with "too fast" removing of port from instance and doing curl from VM to metadata API
16:31:32 <slaweq> but now I'm not sure if that is (only) issue
16:31:52 <slaweq> I will try to investigate it more as I was already checking it a bit
16:32:05 <slaweq> but I'm not sure if I will be able to do it this week
16:32:27 <slaweq> #action slaweq will investigate tagging tempest test issue
16:33:18 <slaweq> from other issues which I found and wanted to raise here is failure in neutron-tempest-dvr test:
16:33:20 <slaweq> http://logs.openstack.org/14/529814/16/check/neutron-tempest-dvr/d9696d5/logs/testr_results.html.gz
16:33:41 <slaweq> I think I spotted it only once but maybe You saw it also somewhere?
16:34:14 <mlavalle> NO I haven't seen it
16:34:25 <mlavalle> But I'll keep an eye open for it
16:35:13 <slaweq> thx mlavalle
16:36:12 <slaweq> I also found timeout issue with tempest-py3 test: http://logs.openstack.org/61/566961/4/gate/tempest-full-py3/4ca37f8/job-output.txt.gz
16:36:51 <slaweq> but I think that's the same as pointed by njohnston earlier
16:36:59 <slaweq> so it's already assigned to me :)
16:37:04 <mlavalle> lol
16:37:20 <slaweq> anything else to add here?
16:37:27 <mlavalle> not from me
16:37:35 <slaweq> ok, moving to next topic then
16:37:37 <slaweq> #topic Rally
16:37:47 <slaweq> There was an issue related to rally last week (spike on grafana) but was fixed quickly on rally side after talk with andreykurilin
16:37:52 <slaweq> it's just FYI :)
16:38:02 <mlavalle> ack
16:38:07 <slaweq> today boden spotted some new issue with rally for stable/queens branch
16:38:14 <slaweq> https://bugs.launchpad.net/neutron/+bug/1778714
16:38:14 <openstack> Launchpad bug 1778714 in neutron "neutron-rally-neutron fails with `NeutronTrunks.create_and_list_trunk_subports` in in any platform" [Critical,New]
16:38:25 <slaweq> but it looks that patch which he proposed fixes this problem
16:38:48 <mlavalle> is that a neutron patch?
16:38:49 <slaweq> I also talked with andreykurilin on rally channel and he confirmed me just before meeting that this should help :)
16:39:14 <slaweq> yes: https://review.openstack.org/#/c/578104/
16:39:15 <patchbot> patch 578104 - neutron (stable/queens) - Use rally 0.12.1 release for stable/pike branch.
16:39:57 <mlavalle> ok, added to my pile
16:40:13 <slaweq> thx mlavalle
16:40:35 <slaweq> so I guess we can go to next topic now
16:40:45 <slaweq> #topic Grenade
16:40:56 <slaweq> I also found issue in neutron-grenade-dvr-multinode job:
16:41:01 <slaweq> http://logs.openstack.org/03/563803/9/check/neutron-grenade-dvr-multinode/13338d9/logs/testr_results.html.gz
16:41:09 <slaweq> did You saw it already?
16:41:34 <mlavalle> I think I did, yeah
16:41:50 <slaweq> so it't also something which we should check
16:42:19 <mlavalle> This specific failure is the instance failing to become active
16:42:25 <slaweq> it looks for me like something what can be potentially related to neutron
16:42:44 <slaweq> L2 agent could not configure port or server didn't send notification to nova
16:43:21 <slaweq> or it's maybe just misconfiguration problem as it stays in error message that it didn't become active in 196 seconds
16:43:38 <slaweq> and IIRC nova-compute waits for port to be active for 300 seconds by default
16:47:07 <slaweq> strange thing for me is that this instance uuid is not in nova-compute logs at all
16:47:28 <slaweq> but I don't know exactly if it should be there
16:48:41 <slaweq> mlavalle: but in neutron-server logs there are some errors relted to dvr somehow: http://logs.openstack.org/03/563803/9/check/neutron-grenade-dvr-multinode/13338d9/logs/screen-q-svc.txt.gz?level=ERROR
16:48:53 <slaweq> check first 3 lines
16:49:02 <slaweq> do You think that it might be related?
16:51:10 <slaweq> mlavalle: are You here still? :)
16:51:14 <mlavalle> mhhh
16:51:18 <mlavalle> I was looking
16:51:23 <slaweq> ahh, ok :)
16:51:32 <mlavalle> at first glance I don't think so
16:51:53 <mlavalle> the messages refer to DVR ports
16:52:09 <mlavalle> and multiple port bindings only apply to compute ports
16:52:20 <mlavalle> but I will keep an eye open anyways
16:52:26 <mlavalle> thanks for bringingi  up
16:52:30 <slaweq> thx
16:53:01 <slaweq> ok, that's all from me according to failures in CI for this week
16:53:15 <slaweq> do You have anything else?
16:53:33 <mlavalle> nope, thanks flor the thorough update, as usual
16:53:43 <njohnston> Quick question, do the grafana graphs include just jobs that end in ERROR, or do they also include the increasing number of jobs that end in TIMED_OUT status as reported back in Zuul?
16:53:59 <slaweq> njohnston: good question
16:54:06 <slaweq> I don't know to be honest
16:54:36 <mlavalle> don't know either
16:54:43 <mlavalle> better ask in the infra channel
16:55:14 <slaweq> according to config file of dashboard: https://git.openstack.org/cgit/openstack-infra/project-config/tree/grafana/neutron.yaml
16:55:30 <slaweq> it's FAILURE but I have no idea how this value is calculated :)
16:55:39 <njohnston> right
16:55:47 <njohnston> OK, i will ask in infra channel and report back
16:55:54 <slaweq> njohnston: thx a lot
16:56:14 <slaweq> if You don't have anything else I have one more thing to tell You
16:56:17 <njohnston> #action njohnston to ask infra team if TIMED_OUT is included in FAILURE for grafana graphs
16:56:30 <njohnston> go ahead
16:56:40 <slaweq> next 2 weeks I will be on PTO (\o/) so I will not be able to do this meeting
16:57:00 <slaweq> You will need to find someone else who will replace me or just cancel them
16:57:28 <slaweq> I will be able to chair it again on 17.07
16:57:33 <mlavalle> Next week will be slow anyway with the July 4th holiday anywas
16:57:48 <mlavalle> so probably it will be best to cancel it
16:57:58 <mlavalle> but I can run it on the 10th
16:58:12 <slaweq> ok, I will send email about canceling next week's meeting
16:58:25 <slaweq> and thx for handling 10th of July :)
16:58:29 <mlavalle> if you trust I can handle it, that is
16:58:33 <njohnston> thanks mlavalle
16:58:46 <slaweq> #action slaweq will cancel next week's meeting
16:58:59 <slaweq> mlavalle: You will do it better than me for sure :)
16:59:12 <mlavalle> doubt it. I'll do it anyway
16:59:13 <slaweq> ok, thx for attending today
16:59:14 * haleyb waves in the final minute, just got back and will go through the logs
16:59:24 <slaweq> hi haleyb :)
16:59:33 <slaweq> see You all
16:59:35 <haleyb> bye slaweq :)
16:59:39 <mlavalle> what airport are you at slaweq ?
16:59:41 <slaweq> #endmeeting