#openstack-meeting log

16:00:43 <slaweq> #startmeeting neutron_ci
16:00:44 <openstack> Meeting started Tue Dec  4 16:00:43 2018 UTC and is due to finish in 60 minutes.  The chair is slaweq. Information about MeetBot at http://wiki.debian.org/MeetBot.
16:00:45 <slaweq> hi
16:00:45 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
16:00:48 <openstack> The meeting name has been set to 'neutron_ci'
16:01:06 <mlavalle> o/
16:01:48 <slaweq> lets wait few minutes for others
16:02:21 <mlavalle> ok
16:03:23 <slaweq> I pinged them in neutron channel
16:03:24 <haleyb> hi
16:03:35 <hongbin> o/
16:03:36 <haleyb> slaweq found us :)
16:03:44 <bcafarel> o/
16:03:51 <slaweq> :)
16:04:17 <njohnston> o/
16:04:29 <slaweq> ok, lets start then
16:04:29 <njohnston> my coffeemaker was slow :-/
16:04:34 <slaweq> #topic Actions from previous meetings
16:04:46 <slaweq> mlavalle to continue tracking not reachable FIP in trunk tests
16:05:03 <mlavalle> I did after we merged your patch
16:05:13 <mlavalle> the error is still taking place
16:05:25 <slaweq> yes, but I think that it's less often now, right?
16:05:51 <mlavalle> it may be slightly lower
16:06:22 <mlavalle> when did you fix exactly merge in master?
16:07:05 <slaweq> https://review.openstack.org/#/c/620805/
16:07:09 <slaweq> 3.12 I think
16:08:00 <mlavalle> in master it seems it merged 11/27
16:08:12 <slaweq> ahh, right
16:08:15 <slaweq> this one was for pike
16:08:18 <slaweq> sorry
16:08:32 <mlavalle> and yes, there seem to be lower number of hist since then
16:08:48 <mlavalle> so the next step is to change the scenario test
16:09:03 <slaweq> yes, will You do it?
16:09:06 <mlavalle> to create the fip when the port is already bound
16:09:14 <mlavalle> and yes, I will do that today
16:09:40 <slaweq> #action mlavalle to change trunk scenario test and see if that will help with FIP issues
16:09:47 <mlavalle> ++
16:10:07 <slaweq> ok, next one then
16:10:09 <slaweq> haleyb takes all this week :D
16:10:19 <slaweq> haleyb: did You fixed all our bugs?
16:10:21 <slaweq> :P
16:10:41 <haleyb> yeah, you didn't see the changes? :p
16:10:48 <slaweq> ok, thx haleyb++
16:10:55 <slaweq> we can go to the next one then
16:10:57 <slaweq> njohnston to remove neutron-grenade job from neutron's CI queues
16:11:24 <njohnston> So I have a change up to do that, but there are pre-requisites
16:11:46 <njohnston> I can remove the job from our queues but leave the job definition there
16:11:53 <njohnston> because it is used elsewhere
16:12:05 <njohnston> I have a change up already for the grenade repo, which uses neutron-grenade
16:12:18 <njohnston> but it looks like nova, cinder, glance, keystone all use neutron-grenade
16:12:33 <njohnston> so I'll need to spin changes for all of those before we can delete the job definition
16:12:57 <slaweq> but in our gate we have already neutron-grenade-py3, right?
16:13:02 <njohnston> oh, and tempest, openstacksdk, and the requirements repo
16:13:37 <njohnston> we have grenade-py3 yes, provided by the grenade repo
16:14:09 <slaweq> so maybe we can remove this neutron-grenade job from our check and gate queues to not overload gates
16:14:29 <slaweq> and add some comments that definition of this job is used by other projects so can't be removed now
16:14:34 <slaweq> mlavalle: what do You think?
16:14:47 <njohnston> yes, I think I'll do that to start with, and I'll gradually work towards the other projects replacing neutron-grenade with grenade-py3
16:15:19 <mlavalle> sounds good
16:15:34 <slaweq> great, so go on with it njohnston :)
16:15:57 <slaweq> #action njohnston will remove neutron-grenade from neutron ci queues and add comment why definition of job is still needed
16:16:09 <njohnston> will do
16:16:13 <slaweq> thx
16:16:17 <slaweq> so, next one
16:16:19 <slaweq> slaweq to continue debugging bug 1798475 when journal log will be available in fullstack tests
16:16:20 <openstack> bug 1798475 in neutron "Fullstack test test_ha_router_restart_agents_no_packet_lost failing" [High,Confirmed] https://launchpad.net/bugs/1798475
16:16:42 <slaweq> I found this failure on patch where journal log is already storred
16:16:48 <slaweq> http://logs.openstack.org/09/608909/20/check/neutron-fullstack/c7b6401/logs/testr_results.html.gz
16:17:10 <slaweq> but I still have no idea why keepalived switched VIP address from one "host" to the other
16:17:29 <slaweq> I will keep investigating that
16:17:38 <slaweq> #action slaweq to continue debugging bug 1798475
16:17:39 <openstack> bug 1798475 in neutron "Fullstack test test_ha_router_restart_agents_no_packet_lost failing" [High,Confirmed] https://launchpad.net/bugs/1798475
16:18:12 <slaweq> if I will need help I will ping haleyb :)
16:18:26 <slaweq> next one
16:18:28 <slaweq> slaweq to continue fixing funtional-py3 tests
16:18:34 <slaweq> no progress on this one still
16:18:49 <njohnston> I hope you didn't mind that I raised the flag on this one in the neutron team meeting
16:18:51 <slaweq> we know that we need to limit output from functional tests, remove all deprecations and things like that and check then what will be the result
16:19:03 <slaweq> njohnston: no, great that You did it
16:19:17 <slaweq> we definitely needs more eyes looking at it
16:20:07 <slaweq> if there is anyone else who wants to try to fix it this week, feel free
16:20:21 <slaweq> I will assign it to myself as an action just to not forget about it
16:20:24 <slaweq> ok?
16:20:32 <njohnston> +1
16:20:45 <mlavalle> so based on how much progress I make on other assignments, I may take a look at this one
16:20:49 <slaweq> #action slaweq to continue fixing funtional-py3 tests
16:21:04 <slaweq> mlavalle: thx, ping me if You will need any info
16:21:10 <mlavalle> ack
16:21:25 <slaweq> ok, next one
16:21:27 <slaweq> njohnston to research py3 conversion for neutron grenade multinode jobs
16:21:46 <njohnston> no progress on that one yet
16:22:03 <slaweq> ok, lets move it to the next week then, right?
16:22:09 <njohnston> sounds good
16:22:10 <slaweq> #action njohnston to research py3 conversion for neutron grenade multinode jobs
16:22:12 <slaweq> thx
16:22:19 <slaweq> next:
16:22:22 <slaweq> slaweq to convert neutron-tempest-plugin jobs to py3
16:22:26 <slaweq> Patch https://review.openstack.org/#/c/621401/
16:22:38 <slaweq> I rechecked it few times and it looks that it works fine
16:22:52 <slaweq> all neutron-tempest-plugin jobs for master branch are switched to py3 in this patch
16:23:04 <slaweq> and jobs for stable branches are still on py27 of course
16:23:14 <slaweq> please review it if You will have some time
16:24:00 <mlavalle> slaweq: I will look at it today
16:24:07 <slaweq> thx mlavalle
16:24:15 <slaweq> ok, so lets move on
16:24:26 <slaweq> njohnston add tempest-slow and networking-ovn-tempest-dsvm-ovs-release to grafana
16:25:14 <njohnston> hmm, I added that config but it looks like I never did a 'git review'.  I'll do that right away, sorry I missed it.
16:25:34 <slaweq> no problem, thx njohnston for working on this :)
16:25:57 <slaweq> please add me as reviewer to it when You will push it
16:26:05 <njohnston> will do
16:26:13 <slaweq> thx
16:26:20 <slaweq> ok, lets move on to the last one
16:26:22 <slaweq> mlavalle to discuss about neutron-tempest-dvr job in L3 meeting
16:26:38 <mlavalle> mhhh, I couldn't attend the meeting
16:27:15 <mlavalle> I'll discuss with haleyb in the Neutron channel
16:27:15 <slaweq> ok, so will You talk about it this week?
16:27:20 <slaweq> ahh, thx
16:27:32 <slaweq> so I will not add action for it anymore
16:27:39 <mlavalle> please do
16:27:48 <slaweq> any questions/something to add?
16:27:54 <mlavalle> so I can report concusion next week
16:28:01 <mlavalle> conclusion
16:29:02 <slaweq> ok, lets move on then
16:29:03 <slaweq> #topic Python 3
16:29:09 <slaweq> njohnston: any updates?
16:30:09 <slaweq> I think we should update our etherpad, to mark what is already done
16:30:20 <slaweq> I will do that this week
16:30:40 <njohnston> I don't think there are any updates this week
16:30:43 <njohnston> from my end
16:31:00 <slaweq> #action slaweq to update etherpad with what is already converted to py3
16:31:28 <slaweq> ok, from my side there is also nothing more to talk today
16:31:37 <slaweq> so lets move on, next topic
16:31:44 <slaweq> #topic Grafana
16:31:49 <slaweq> http://grafana.openstack.org/dashboard/db/neutron-failure-rate
16:34:15 <slaweq> good news is that tempest jobs, and scenario jobs looks much better this week
16:34:25 <slaweq> even those multinode dvr jobs
16:35:08 <mlavalle> what happened last week?
16:35:19 <slaweq> when?
16:35:53 <mlavalle> seems we had a spike around Thursday in gate
16:37:14 <slaweq> hmm, I don't know
16:37:58 <hongbin> possibly there are some infra related issues
16:38:21 <mlavalle> well, let's keep it in mind
16:38:33 <slaweq> one day last week I saw some failures because of missing dependencies or something like that
16:38:50 <slaweq> but I don't remember what it was exactly and in which day
16:39:33 <njohnston> Hey, just a quick sanity check: we have the tempest-slow job in the check queue as voting but not in gate.  Is that the intended configuration?
16:40:04 <njohnston> As in it is not present in the gate queue in any form
16:40:31 <hongbin> in theory, if a job is voting in check, it should be in the gate as well
16:41:04 <slaweq> yes, we should add it to gate queue as well
16:41:09 <njohnston> yes, that is my impression, I just wanted to see if there is some reason for this to be an exception that I don't know about
16:41:14 <slaweq> but maybe lets add it to grafana first
16:41:29 <njohnston> it turns out, it already is in grafana, at least for the check queue
16:41:34 <slaweq> then check how it works for one/two weeks and we will then decide
16:42:02 <slaweq> indeed, I missed it
16:42:50 <njohnston> So I will submit a change to add it to gate, and we can slow-walk that change until we are comfortable on it
16:43:21 <mlavalle> sounds like a plan
16:43:39 <slaweq> ok
16:44:10 <slaweq> ok, one more thing I want to mention
16:44:25 <slaweq> not related stricly to grafana but related to job failures
16:44:48 <slaweq> together with hongbin we decided to collect examples of failures which we spot in CI in etherpad:
16:44:49 <slaweq> https://etherpad.openstack.org/p/neutron-ci-failures
16:45:09 <slaweq> there is already quite many examples of issues from this week
16:45:40 <slaweq> some of them are not related to neutron but we wanted to have info how many times such issue happens for us
16:45:48 <slaweq> some are related to neutron for sure
16:46:19 <slaweq> most common issue related to neutron are problems with connectivity to FIP
16:46:31 <mlavalle> yeap
16:46:49 <slaweq> I don't know if that is the same issue every time but visible culprit is the same - no connection to FIP
16:46:55 <slaweq> it is in many different jobs
16:47:04 <mlavalle> I don't think it is always the same root cause
16:47:53 <slaweq> yes, probably not, but we should take a look at them, collect such examples for few weeks and report bugs if it hits more often in same job/test
16:48:30 <slaweq> from other things, I wanted to mention that there is (again) some issue with ovsdbapp timeouts
16:48:44 <slaweq> it hits us in many different jobs
16:49:00 <slaweq> and otherwiseguy was checking it already
16:49:35 <hongbin> there is a patch to turn on the python-ovs debugging to do further troubleshooting for that
16:50:16 <hongbin> #link https://review.openstack.org/#/c/621572/
16:50:33 <slaweq> and bug reported already: https://bugs.launchpad.net/bugs/1802640
16:50:34 <openstack> Launchpad bug 1802640 in neutron "TimeoutException: Commands [<ovsdbapp.schema.open_vswitch.commands.SetFailModeCommand in q-agt gate failure" [Medium,Confirmed]
16:51:21 <slaweq> ok, lets talk about some specific jobs now
16:51:28 <slaweq> #topic fullstack/functional
16:51:43 <slaweq> according to fullstack tests we have new issue I think
16:52:04 <slaweq> I saw at least 3 times that test: neutron.tests.fullstack.test_l3_agent.TestHAL3Agent.test_gateway_ip_changed failed:
16:52:12 <slaweq> http://logs.openstack.org/36/617736/10/check/neutron-fullstack/61fa2eb/logs/testr_results.html.gz
16:52:15 <slaweq> http://logs.openstack.org/68/424468/34/check/neutron-fullstack/4422e70/logs/testr_results.html.gz
16:52:17 <slaweq> http://logs.openstack.org/08/620708/2/check/neutron-fullstack/51099db/logs/testr_results.html.gz
16:52:28 <slaweq> is there anyone who wants to take a look on it?
16:52:48 <hongbin> i can try to look at it
16:53:39 <mlavalle> is it always the IP address already alocated exception?
16:53:48 <slaweq> thx hongbin
16:54:00 <slaweq> patch which introduced this test is https://review.openstack.org/#/c/606876/
16:54:21 <slaweq> mlavalle: I don't know, I didn't have time to investigate it
16:55:05 <slaweq> yes, it looks that it is the same culprit every time
16:55:17 <slaweq> it should be easy to fix I hope :)
16:55:41 <slaweq> #action hongbin to report and check failing neutron.tests.fullstack.test_l3_agent.TestHAL3Agent.test_gateway_ip_changed test
16:56:19 <slaweq> I want also to highligh again failing functional db migrations tests
16:56:34 <slaweq> I found it at least 5 or 6 times failing this week
16:56:47 <slaweq> there are already logs from those tests stored properly
16:56:58 <slaweq> but there is not too much info there
16:57:16 <slaweq> for example: http://logs.openstack.org/49/613549/6/gate/neutron-functional/315ddc4/logs/testr_results.html.gz
16:57:53 <slaweq> it looks that migrations are working but very slow
16:58:33 <slaweq> I think we should compare such failed tests with passed once and check if there are always the same migrations slowest ones or maybe it's random
16:58:48 <slaweq> but I don't know if I will have time for it this week
16:59:43 <slaweq> ok, and that's all what I wanted to highligh from issues in CI
16:59:54 <slaweq> thx for attending and see You next week
16:59:58 <slaweq> #endmeeting