16:00:43 <slaweq> #startmeeting neutron_ci 16:00:44 <openstack> Meeting started Tue Dec 4 16:00:43 2018 UTC and is due to finish in 60 minutes. The chair is slaweq. Information about MeetBot at http://wiki.debian.org/MeetBot. 16:00:45 <slaweq> hi 16:00:45 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 16:00:48 <openstack> The meeting name has been set to 'neutron_ci' 16:01:06 <mlavalle> o/ 16:01:48 <slaweq> lets wait few minutes for others 16:02:21 <mlavalle> ok 16:03:23 <slaweq> I pinged them in neutron channel 16:03:24 <haleyb> hi 16:03:35 <hongbin> o/ 16:03:36 <haleyb> slaweq found us :) 16:03:44 <bcafarel> o/ 16:03:51 <slaweq> :) 16:04:17 <njohnston> o/ 16:04:29 <slaweq> ok, lets start then 16:04:29 <njohnston> my coffeemaker was slow :-/ 16:04:34 <slaweq> #topic Actions from previous meetings 16:04:46 <slaweq> mlavalle to continue tracking not reachable FIP in trunk tests 16:05:03 <mlavalle> I did after we merged your patch 16:05:13 <mlavalle> the error is still taking place 16:05:25 <slaweq> yes, but I think that it's less often now, right? 16:05:51 <mlavalle> it may be slightly lower 16:06:22 <mlavalle> when did you fix exactly merge in master? 16:07:05 <slaweq> https://review.openstack.org/#/c/620805/ 16:07:09 <slaweq> 3.12 I think 16:08:00 <mlavalle> in master it seems it merged 11/27 16:08:12 <slaweq> ahh, right 16:08:15 <slaweq> this one was for pike 16:08:18 <slaweq> sorry 16:08:32 <mlavalle> and yes, there seem to be lower number of hist since then 16:08:48 <mlavalle> so the next step is to change the scenario test 16:09:03 <slaweq> yes, will You do it? 16:09:06 <mlavalle> to create the fip when the port is already bound 16:09:14 <mlavalle> and yes, I will do that today 16:09:40 <slaweq> #action mlavalle to change trunk scenario test and see if that will help with FIP issues 16:09:47 <mlavalle> ++ 16:10:07 <slaweq> ok, next one then 16:10:09 <slaweq> haleyb takes all this week :D 16:10:19 <slaweq> haleyb: did You fixed all our bugs? 16:10:21 <slaweq> :P 16:10:41 <haleyb> yeah, you didn't see the changes? :p 16:10:48 <slaweq> ok, thx haleyb++ 16:10:55 <slaweq> we can go to the next one then 16:10:57 <slaweq> njohnston to remove neutron-grenade job from neutron's CI queues 16:11:24 <njohnston> So I have a change up to do that, but there are pre-requisites 16:11:46 <njohnston> I can remove the job from our queues but leave the job definition there 16:11:53 <njohnston> because it is used elsewhere 16:12:05 <njohnston> I have a change up already for the grenade repo, which uses neutron-grenade 16:12:18 <njohnston> but it looks like nova, cinder, glance, keystone all use neutron-grenade 16:12:33 <njohnston> so I'll need to spin changes for all of those before we can delete the job definition 16:12:57 <slaweq> but in our gate we have already neutron-grenade-py3, right? 16:13:02 <njohnston> oh, and tempest, openstacksdk, and the requirements repo 16:13:37 <njohnston> we have grenade-py3 yes, provided by the grenade repo 16:14:09 <slaweq> so maybe we can remove this neutron-grenade job from our check and gate queues to not overload gates 16:14:29 <slaweq> and add some comments that definition of this job is used by other projects so can't be removed now 16:14:34 <slaweq> mlavalle: what do You think? 16:14:47 <njohnston> yes, I think I'll do that to start with, and I'll gradually work towards the other projects replacing neutron-grenade with grenade-py3 16:15:19 <mlavalle> sounds good 16:15:34 <slaweq> great, so go on with it njohnston :) 16:15:57 <slaweq> #action njohnston will remove neutron-grenade from neutron ci queues and add comment why definition of job is still needed 16:16:09 <njohnston> will do 16:16:13 <slaweq> thx 16:16:17 <slaweq> so, next one 16:16:19 <slaweq> slaweq to continue debugging bug 1798475 when journal log will be available in fullstack tests 16:16:20 <openstack> bug 1798475 in neutron "Fullstack test test_ha_router_restart_agents_no_packet_lost failing" [High,Confirmed] https://launchpad.net/bugs/1798475 16:16:42 <slaweq> I found this failure on patch where journal log is already storred 16:16:48 <slaweq> http://logs.openstack.org/09/608909/20/check/neutron-fullstack/c7b6401/logs/testr_results.html.gz 16:17:10 <slaweq> but I still have no idea why keepalived switched VIP address from one "host" to the other 16:17:29 <slaweq> I will keep investigating that 16:17:38 <slaweq> #action slaweq to continue debugging bug 1798475 16:17:39 <openstack> bug 1798475 in neutron "Fullstack test test_ha_router_restart_agents_no_packet_lost failing" [High,Confirmed] https://launchpad.net/bugs/1798475 16:18:12 <slaweq> if I will need help I will ping haleyb :) 16:18:26 <slaweq> next one 16:18:28 <slaweq> slaweq to continue fixing funtional-py3 tests 16:18:34 <slaweq> no progress on this one still 16:18:49 <njohnston> I hope you didn't mind that I raised the flag on this one in the neutron team meeting 16:18:51 <slaweq> we know that we need to limit output from functional tests, remove all deprecations and things like that and check then what will be the result 16:19:03 <slaweq> njohnston: no, great that You did it 16:19:17 <slaweq> we definitely needs more eyes looking at it 16:20:07 <slaweq> if there is anyone else who wants to try to fix it this week, feel free 16:20:21 <slaweq> I will assign it to myself as an action just to not forget about it 16:20:24 <slaweq> ok? 16:20:32 <njohnston> +1 16:20:45 <mlavalle> so based on how much progress I make on other assignments, I may take a look at this one 16:20:49 <slaweq> #action slaweq to continue fixing funtional-py3 tests 16:21:04 <slaweq> mlavalle: thx, ping me if You will need any info 16:21:10 <mlavalle> ack 16:21:25 <slaweq> ok, next one 16:21:27 <slaweq> njohnston to research py3 conversion for neutron grenade multinode jobs 16:21:46 <njohnston> no progress on that one yet 16:22:03 <slaweq> ok, lets move it to the next week then, right? 16:22:09 <njohnston> sounds good 16:22:10 <slaweq> #action njohnston to research py3 conversion for neutron grenade multinode jobs 16:22:12 <slaweq> thx 16:22:19 <slaweq> next: 16:22:22 <slaweq> slaweq to convert neutron-tempest-plugin jobs to py3 16:22:26 <slaweq> Patch https://review.openstack.org/#/c/621401/ 16:22:38 <slaweq> I rechecked it few times and it looks that it works fine 16:22:52 <slaweq> all neutron-tempest-plugin jobs for master branch are switched to py3 in this patch 16:23:04 <slaweq> and jobs for stable branches are still on py27 of course 16:23:14 <slaweq> please review it if You will have some time 16:24:00 <mlavalle> slaweq: I will look at it today 16:24:07 <slaweq> thx mlavalle 16:24:15 <slaweq> ok, so lets move on 16:24:26 <slaweq> njohnston add tempest-slow and networking-ovn-tempest-dsvm-ovs-release to grafana 16:25:14 <njohnston> hmm, I added that config but it looks like I never did a 'git review'. I'll do that right away, sorry I missed it. 16:25:34 <slaweq> no problem, thx njohnston for working on this :) 16:25:57 <slaweq> please add me as reviewer to it when You will push it 16:26:05 <njohnston> will do 16:26:13 <slaweq> thx 16:26:20 <slaweq> ok, lets move on to the last one 16:26:22 <slaweq> mlavalle to discuss about neutron-tempest-dvr job in L3 meeting 16:26:38 <mlavalle> mhhh, I couldn't attend the meeting 16:27:15 <mlavalle> I'll discuss with haleyb in the Neutron channel 16:27:15 <slaweq> ok, so will You talk about it this week? 16:27:20 <slaweq> ahh, thx 16:27:32 <slaweq> so I will not add action for it anymore 16:27:39 <mlavalle> please do 16:27:48 <slaweq> any questions/something to add? 16:27:54 <mlavalle> so I can report concusion next week 16:28:01 <mlavalle> conclusion 16:29:02 <slaweq> ok, lets move on then 16:29:03 <slaweq> #topic Python 3 16:29:09 <slaweq> njohnston: any updates? 16:30:09 <slaweq> I think we should update our etherpad, to mark what is already done 16:30:20 <slaweq> I will do that this week 16:30:40 <njohnston> I don't think there are any updates this week 16:30:43 <njohnston> from my end 16:31:00 <slaweq> #action slaweq to update etherpad with what is already converted to py3 16:31:28 <slaweq> ok, from my side there is also nothing more to talk today 16:31:37 <slaweq> so lets move on, next topic 16:31:44 <slaweq> #topic Grafana 16:31:49 <slaweq> http://grafana.openstack.org/dashboard/db/neutron-failure-rate 16:34:15 <slaweq> good news is that tempest jobs, and scenario jobs looks much better this week 16:34:25 <slaweq> even those multinode dvr jobs 16:35:08 <mlavalle> what happened last week? 16:35:19 <slaweq> when? 16:35:53 <mlavalle> seems we had a spike around Thursday in gate 16:37:14 <slaweq> hmm, I don't know 16:37:58 <hongbin> possibly there are some infra related issues 16:38:21 <mlavalle> well, let's keep it in mind 16:38:33 <slaweq> one day last week I saw some failures because of missing dependencies or something like that 16:38:50 <slaweq> but I don't remember what it was exactly and in which day 16:39:33 <njohnston> Hey, just a quick sanity check: we have the tempest-slow job in the check queue as voting but not in gate. Is that the intended configuration? 16:40:04 <njohnston> As in it is not present in the gate queue in any form 16:40:31 <hongbin> in theory, if a job is voting in check, it should be in the gate as well 16:41:04 <slaweq> yes, we should add it to gate queue as well 16:41:09 <njohnston> yes, that is my impression, I just wanted to see if there is some reason for this to be an exception that I don't know about 16:41:14 <slaweq> but maybe lets add it to grafana first 16:41:29 <njohnston> it turns out, it already is in grafana, at least for the check queue 16:41:34 <slaweq> then check how it works for one/two weeks and we will then decide 16:42:02 <slaweq> indeed, I missed it 16:42:50 <njohnston> So I will submit a change to add it to gate, and we can slow-walk that change until we are comfortable on it 16:43:21 <mlavalle> sounds like a plan 16:43:39 <slaweq> ok 16:44:10 <slaweq> ok, one more thing I want to mention 16:44:25 <slaweq> not related stricly to grafana but related to job failures 16:44:48 <slaweq> together with hongbin we decided to collect examples of failures which we spot in CI in etherpad: 16:44:49 <slaweq> https://etherpad.openstack.org/p/neutron-ci-failures 16:45:09 <slaweq> there is already quite many examples of issues from this week 16:45:40 <slaweq> some of them are not related to neutron but we wanted to have info how many times such issue happens for us 16:45:48 <slaweq> some are related to neutron for sure 16:46:19 <slaweq> most common issue related to neutron are problems with connectivity to FIP 16:46:31 <mlavalle> yeap 16:46:49 <slaweq> I don't know if that is the same issue every time but visible culprit is the same - no connection to FIP 16:46:55 <slaweq> it is in many different jobs 16:47:04 <mlavalle> I don't think it is always the same root cause 16:47:53 <slaweq> yes, probably not, but we should take a look at them, collect such examples for few weeks and report bugs if it hits more often in same job/test 16:48:30 <slaweq> from other things, I wanted to mention that there is (again) some issue with ovsdbapp timeouts 16:48:44 <slaweq> it hits us in many different jobs 16:49:00 <slaweq> and otherwiseguy was checking it already 16:49:35 <hongbin> there is a patch to turn on the python-ovs debugging to do further troubleshooting for that 16:50:16 <hongbin> #link https://review.openstack.org/#/c/621572/ 16:50:33 <slaweq> and bug reported already: https://bugs.launchpad.net/bugs/1802640 16:50:34 <openstack> Launchpad bug 1802640 in neutron "TimeoutException: Commands [<ovsdbapp.schema.open_vswitch.commands.SetFailModeCommand in q-agt gate failure" [Medium,Confirmed] 16:51:21 <slaweq> ok, lets talk about some specific jobs now 16:51:28 <slaweq> #topic fullstack/functional 16:51:43 <slaweq> according to fullstack tests we have new issue I think 16:52:04 <slaweq> I saw at least 3 times that test: neutron.tests.fullstack.test_l3_agent.TestHAL3Agent.test_gateway_ip_changed failed: 16:52:12 <slaweq> http://logs.openstack.org/36/617736/10/check/neutron-fullstack/61fa2eb/logs/testr_results.html.gz 16:52:15 <slaweq> http://logs.openstack.org/68/424468/34/check/neutron-fullstack/4422e70/logs/testr_results.html.gz 16:52:17 <slaweq> http://logs.openstack.org/08/620708/2/check/neutron-fullstack/51099db/logs/testr_results.html.gz 16:52:28 <slaweq> is there anyone who wants to take a look on it? 16:52:48 <hongbin> i can try to look at it 16:53:39 <mlavalle> is it always the IP address already alocated exception? 16:53:48 <slaweq> thx hongbin 16:54:00 <slaweq> patch which introduced this test is https://review.openstack.org/#/c/606876/ 16:54:21 <slaweq> mlavalle: I don't know, I didn't have time to investigate it 16:55:05 <slaweq> yes, it looks that it is the same culprit every time 16:55:17 <slaweq> it should be easy to fix I hope :) 16:55:41 <slaweq> #action hongbin to report and check failing neutron.tests.fullstack.test_l3_agent.TestHAL3Agent.test_gateway_ip_changed test 16:56:19 <slaweq> I want also to highligh again failing functional db migrations tests 16:56:34 <slaweq> I found it at least 5 or 6 times failing this week 16:56:47 <slaweq> there are already logs from those tests stored properly 16:56:58 <slaweq> but there is not too much info there 16:57:16 <slaweq> for example: http://logs.openstack.org/49/613549/6/gate/neutron-functional/315ddc4/logs/testr_results.html.gz 16:57:53 <slaweq> it looks that migrations are working but very slow 16:58:33 <slaweq> I think we should compare such failed tests with passed once and check if there are always the same migrations slowest ones or maybe it's random 16:58:48 <slaweq> but I don't know if I will have time for it this week 16:59:43 <slaweq> ok, and that's all what I wanted to highligh from issues in CI 16:59:54 <slaweq> thx for attending and see You next week 16:59:58 <slaweq> #endmeeting