16:00:22 <slaweq> #startmeeting neutron_ci 16:00:23 <openstack> Meeting started Tue Jun 26 16:00:22 2018 UTC and is due to finish in 60 minutes. The chair is slaweq. Information about MeetBot at http://wiki.debian.org/MeetBot. 16:00:25 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 16:00:26 <slaweq> hi 16:00:27 <mlavalle> o/ 16:00:28 <openstack> The meeting name has been set to 'neutron_ci' 16:00:35 <njohnston> o/ 16:01:21 <slaweq> haleyb: are You around? 16:02:09 * slaweq is on airport so sorry if there will be some issues with connection 16:02:38 <slaweq> ok, I maybe haleyb will join later 16:02:41 <slaweq> let's start 16:02:41 <mlavalle> going on vaction? 16:02:56 <slaweq> no, I'm going back home from internal training 16:03:02 <mlavalle> ahhh 16:03:07 <mlavalle> have a safe flight 16:03:11 <slaweq> thx :) 16:03:13 <slaweq> #topic Actions from previous meetings 16:03:27 <slaweq> we have only one action from last week 16:03:29 <slaweq> njohnston to look into adding Grafana dasboard for stable branches 16:04:19 <njohnston> I have started working on that but got pulled into some other things; I plan to push something for people to review later this week 16:04:30 <njohnston> Apologies for the delay 16:04:33 <slaweq> ok, thx for update 16:04:40 <slaweq> no problem, it's not very urgent :) 16:04:46 <slaweq> #action njohnston to look into adding Grafana dasboard for stable branches 16:04:55 <slaweq> You have it for next week :) 16:05:00 <mlavalle> no apologies needed. we all have sponsors who help us to pay the bills 16:05:01 <njohnston> thanks 16:05:13 <mlavalle> and they have their priorities 16:05:22 <slaweq> mlavalle: well said :) 16:05:43 <slaweq> #topic Grafana 16:05:52 <slaweq> http://grafana.openstack.org/dashboard/db/neutron-failure-rate 16:07:11 <slaweq> generally it doesn't looks very bad this week IMO 16:07:39 <slaweq> but I don't know why there is some "gap" with graphs yesterday and on Sunday 16:07:40 <mlavalle> The tempest linux bridge job has had some failures 16:08:14 <slaweq> mlavalle: are You talking about this "spike" in gate queue? 16:08:24 <mlavalle> yeah 16:08:57 <slaweq> I was checking that and I found few issues not related to neutron, like e.g. 503 errors from cinder: http://logs.openstack.org/63/323963/61/check/neutron-tempest-linuxbridge/1d56164/logs/testr_results.html.gz 16:09:25 <mlavalle> yes, I agree, What I've seens is someting about Nova tagging testing 16:09:32 <slaweq> and I think that this high failure rate is also because there was no many runs counted this time 16:10:02 <slaweq> as I went yesterday through some of failed jobs and there wasn't many failed examples of this job 16:10:17 <mlavalle> ok, let's keep an eye on it and see what happens over the next few days 16:10:25 <slaweq> about tagging test failure I have it for one of next topics :) 16:10:32 <slaweq> mlavalle: I agree 16:11:00 <mlavalle> maybe I am just pissed off with this job because it has failed in some of my patches ;-) 16:11:08 <slaweq> so let's now talk about some specific jobs 16:11:11 <slaweq> mlavalle: LOL 16:11:13 <slaweq> maybe 16:11:16 <slaweq> #topic Unit tests 16:11:38 <slaweq> I added UT as a topic today because I found few failures with some timeout 16:11:54 <slaweq> like e.g.: http://logs.openstack.org/58/565358/14/check/openstack-tox-py35/aa30b12/job-output.txt.gz or http://logs.openstack.org/03/563803/9/check/openstack-tox-py35/a50de4a/job-output.txt.gz 16:12:22 <slaweq> and it was always py35 16:12:42 <slaweq> or maybe You saw something similar with py27 but I didn't spotted it yet 16:12:53 <mlavalle> mhhh, let me see 16:12:56 <slaweq> I think that this should be investigated 16:13:42 <mlavalle> no, I saw I timeout a few minutes ago, but it wasn't unit tests 16:14:32 <njohnston> I have seen tempest get killed due to timeouts as well, for example: http://logs.openstack.org/61/566961/4/check/neutron-tempest-iptables_hybrid/c70896b/job-output.txt.gz#_2018-06-26_08_55_07_638631 16:15:52 <slaweq> I think that we should report this one as a bug, check how long "good" run takes and then check what issue it is maybe 16:16:14 <mlavalle> sounds like a plan 16:16:20 <slaweq> and I will try to investigate this UT tests 16:16:37 <slaweq> #actions slaweq will report and investigate py35 timeouts 16:16:55 <slaweq> njohnston: tempest is probably different issue 16:17:13 <slaweq> I know that it happens sometimes and we probably also should check 16:18:33 <slaweq> njohnston: would You like to investigate it? 16:19:47 <slaweq> ok, I will report it as a bug also and I will see if I will be able to check it this week 16:20:12 <slaweq> #action slaweq reports a bug with tempest timeouts, like on http://logs.openstack.org/61/566961/4/check/neutron-tempest-iptables_hybrid/c70896b/job-output.txt.gz 16:20:27 <slaweq> ok, let's go to next one 16:20:29 <slaweq> #topic Functional 16:20:50 <slaweq> I saw at least few times something like: * http://logs.openstack.org/58/574058/10/check/neutron-functional/f38d685/logs/testr_results.html.gz 16:21:04 <slaweq> it might be related to https://bugs.launchpad.net/neutron/+bug/1687027 16:21:04 <openstack> Launchpad bug 1687027 in neutron "test_walk_versions tests fail with "IndexError: tuple index out of range" after timeout" [High,Confirmed] 16:21:31 <slaweq> it happens at least few times this week so I think we should also dig into it and maybe find a way how to fix it :) 16:24:00 <slaweq> and those are different tests failing but with similar error, second example from last few days: * http://logs.openstack.org/61/567461/4/check/neutron-functional/81d69a4/logs/testr_results.html.gz 16:24:35 <slaweq> any volunteer to debug this issue? 16:24:40 <mlavalle> o/ 16:24:45 <slaweq> thx mlavalle :) 16:25:02 <mlavalle> I assigned it to me 16:25:09 <slaweq> #action mlavalle will check functional tests issue, https://bugs.launchpad.net/neutron/+bug/1687027 16:25:09 <openstack> Launchpad bug 1687027 in neutron "test_walk_versions tests fail with "IndexError: tuple index out of range" after timeout" [High,Confirmed] - Assigned to Miguel Lavalle (minsel) 16:25:13 <slaweq> ok, thx a lot 16:25:51 <slaweq> I will not talk about fullstack tests today as it is on quite good failure rate recently and I didn't have time to dig into this one often failing test yet 16:25:59 <slaweq> so next topic will be now 16:26:04 <slaweq> #topic Scenarios 16:26:30 <slaweq> As mlavalle mentioned before, there is some issue with tagging test from tempest 16:26:42 <slaweq> bug report: https://bugs.launchpad.net/tempest/+bug/1775947 16:26:42 <openstack> Launchpad bug 1775947 in tempest "tempest.api.compute.servers.test_device_tagging.TaggedAttachmentsTest failing" [Medium,Confirmed] - Assigned to Deepak Mourya (mourya007) 16:26:59 <mlavalle> yeah I've seen it more in api tests 16:27:00 <slaweq> example of failure is e.g. on http://logs.openstack.org/44/575444/4/gate/neutron-tempest-linuxbridge/118cc97/logs/testr_results.html.gz 16:27:05 <mlavalle> not as much in scenrio tests 16:27:10 <slaweq> ahh, yes 16:27:14 <slaweq> sorry, my bad 16:27:25 <slaweq> I mixed them as it's "tempest" in both cases :) 16:27:32 <slaweq> it's in tempest api tests 16:27:43 <slaweq> so let's change topic to 16:27:43 <mlavalle> and ye, I see it most;y in the linuxbridge job 16:27:52 <slaweq> #topic tempest tests 16:27:54 <slaweq> :) 16:28:22 <slaweq> I didn't check if it's only related to linuxbridge job 16:28:47 <slaweq> what I saw often was something like "process exited with errcode 137" on VM 16:28:59 <slaweq> during doing "curl" to metadata service 16:29:24 <mlavalle> right, I've also seen the message about the metadata service 16:31:20 <slaweq> for now it is reported against tempest as I suspected at first glance that it might be issue with "too fast" removing of port from instance and doing curl from VM to metadata API 16:31:32 <slaweq> but now I'm not sure if that is (only) issue 16:31:52 <slaweq> I will try to investigate it more as I was already checking it a bit 16:32:05 <slaweq> but I'm not sure if I will be able to do it this week 16:32:27 <slaweq> #action slaweq will investigate tagging tempest test issue 16:33:18 <slaweq> from other issues which I found and wanted to raise here is failure in neutron-tempest-dvr test: 16:33:20 <slaweq> http://logs.openstack.org/14/529814/16/check/neutron-tempest-dvr/d9696d5/logs/testr_results.html.gz 16:33:41 <slaweq> I think I spotted it only once but maybe You saw it also somewhere? 16:34:14 <mlavalle> NO I haven't seen it 16:34:25 <mlavalle> But I'll keep an eye open for it 16:35:13 <slaweq> thx mlavalle 16:36:12 <slaweq> I also found timeout issue with tempest-py3 test: http://logs.openstack.org/61/566961/4/gate/tempest-full-py3/4ca37f8/job-output.txt.gz 16:36:51 <slaweq> but I think that's the same as pointed by njohnston earlier 16:36:59 <slaweq> so it's already assigned to me :) 16:37:04 <mlavalle> lol 16:37:20 <slaweq> anything else to add here? 16:37:27 <mlavalle> not from me 16:37:35 <slaweq> ok, moving to next topic then 16:37:37 <slaweq> #topic Rally 16:37:47 <slaweq> There was an issue related to rally last week (spike on grafana) but was fixed quickly on rally side after talk with andreykurilin 16:37:52 <slaweq> it's just FYI :) 16:38:02 <mlavalle> ack 16:38:07 <slaweq> today boden spotted some new issue with rally for stable/queens branch 16:38:14 <slaweq> https://bugs.launchpad.net/neutron/+bug/1778714 16:38:14 <openstack> Launchpad bug 1778714 in neutron "neutron-rally-neutron fails with `NeutronTrunks.create_and_list_trunk_subports` in in any platform" [Critical,New] 16:38:25 <slaweq> but it looks that patch which he proposed fixes this problem 16:38:48 <mlavalle> is that a neutron patch? 16:38:49 <slaweq> I also talked with andreykurilin on rally channel and he confirmed me just before meeting that this should help :) 16:39:14 <slaweq> yes: https://review.openstack.org/#/c/578104/ 16:39:15 <patchbot> patch 578104 - neutron (stable/queens) - Use rally 0.12.1 release for stable/pike branch. 16:39:57 <mlavalle> ok, added to my pile 16:40:13 <slaweq> thx mlavalle 16:40:35 <slaweq> so I guess we can go to next topic now 16:40:45 <slaweq> #topic Grenade 16:40:56 <slaweq> I also found issue in neutron-grenade-dvr-multinode job: 16:41:01 <slaweq> http://logs.openstack.org/03/563803/9/check/neutron-grenade-dvr-multinode/13338d9/logs/testr_results.html.gz 16:41:09 <slaweq> did You saw it already? 16:41:34 <mlavalle> I think I did, yeah 16:41:50 <slaweq> so it't also something which we should check 16:42:19 <mlavalle> This specific failure is the instance failing to become active 16:42:25 <slaweq> it looks for me like something what can be potentially related to neutron 16:42:44 <slaweq> L2 agent could not configure port or server didn't send notification to nova 16:43:21 <slaweq> or it's maybe just misconfiguration problem as it stays in error message that it didn't become active in 196 seconds 16:43:38 <slaweq> and IIRC nova-compute waits for port to be active for 300 seconds by default 16:47:07 <slaweq> strange thing for me is that this instance uuid is not in nova-compute logs at all 16:47:28 <slaweq> but I don't know exactly if it should be there 16:48:41 <slaweq> mlavalle: but in neutron-server logs there are some errors relted to dvr somehow: http://logs.openstack.org/03/563803/9/check/neutron-grenade-dvr-multinode/13338d9/logs/screen-q-svc.txt.gz?level=ERROR 16:48:53 <slaweq> check first 3 lines 16:49:02 <slaweq> do You think that it might be related? 16:51:10 <slaweq> mlavalle: are You here still? :) 16:51:14 <mlavalle> mhhh 16:51:18 <mlavalle> I was looking 16:51:23 <slaweq> ahh, ok :) 16:51:32 <mlavalle> at first glance I don't think so 16:51:53 <mlavalle> the messages refer to DVR ports 16:52:09 <mlavalle> and multiple port bindings only apply to compute ports 16:52:20 <mlavalle> but I will keep an eye open anyways 16:52:26 <mlavalle> thanks for bringingi up 16:52:30 <slaweq> thx 16:53:01 <slaweq> ok, that's all from me according to failures in CI for this week 16:53:15 <slaweq> do You have anything else? 16:53:33 <mlavalle> nope, thanks flor the thorough update, as usual 16:53:43 <njohnston> Quick question, do the grafana graphs include just jobs that end in ERROR, or do they also include the increasing number of jobs that end in TIMED_OUT status as reported back in Zuul? 16:53:59 <slaweq> njohnston: good question 16:54:06 <slaweq> I don't know to be honest 16:54:36 <mlavalle> don't know either 16:54:43 <mlavalle> better ask in the infra channel 16:55:14 <slaweq> according to config file of dashboard: https://git.openstack.org/cgit/openstack-infra/project-config/tree/grafana/neutron.yaml 16:55:30 <slaweq> it's FAILURE but I have no idea how this value is calculated :) 16:55:39 <njohnston> right 16:55:47 <njohnston> OK, i will ask in infra channel and report back 16:55:54 <slaweq> njohnston: thx a lot 16:56:14 <slaweq> if You don't have anything else I have one more thing to tell You 16:56:17 <njohnston> #action njohnston to ask infra team if TIMED_OUT is included in FAILURE for grafana graphs 16:56:30 <njohnston> go ahead 16:56:40 <slaweq> next 2 weeks I will be on PTO (\o/) so I will not be able to do this meeting 16:57:00 <slaweq> You will need to find someone else who will replace me or just cancel them 16:57:28 <slaweq> I will be able to chair it again on 17.07 16:57:33 <mlavalle> Next week will be slow anyway with the July 4th holiday anywas 16:57:48 <mlavalle> so probably it will be best to cancel it 16:57:58 <mlavalle> but I can run it on the 10th 16:58:12 <slaweq> ok, I will send email about canceling next week's meeting 16:58:25 <slaweq> and thx for handling 10th of July :) 16:58:29 <mlavalle> if you trust I can handle it, that is 16:58:33 <njohnston> thanks mlavalle 16:58:46 <slaweq> #action slaweq will cancel next week's meeting 16:58:59 <slaweq> mlavalle: You will do it better than me for sure :) 16:59:12 <mlavalle> doubt it. I'll do it anyway 16:59:13 <slaweq> ok, thx for attending today 16:59:14 * haleyb waves in the final minute, just got back and will go through the logs 16:59:24 <slaweq> hi haleyb :) 16:59:33 <slaweq> see You all 16:59:35 <haleyb> bye slaweq :) 16:59:39 <mlavalle> what airport are you at slaweq ? 16:59:41 <slaweq> #endmeeting