16:00:37 <slaweq> #startmeeting neutron_ci 16:00:38 <openstack> Meeting started Tue Sep 18 16:00:37 2018 UTC and is due to finish in 60 minutes. The chair is slaweq. Information about MeetBot at http://wiki.debian.org/MeetBot. 16:00:39 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 16:00:40 <slaweq> hi 16:00:41 <openstack> The meeting name has been set to 'neutron_ci' 16:00:48 <mlavalle> o/ 16:01:47 <slaweq> lets wait few more minutes for others 16:01:53 <slaweq> maybe someone else will join 16:02:05 <mlavalle> np 16:02:28 <mlavalle> I didn't show up late, did I? 16:02:40 <slaweq> mlavalle: no, You were just on time :) 16:03:17 <mlavalle> I was distracted doing the homework you gave me last meeting and I was startled by the time 16:03:20 * haleyb wanders in 16:03:57 <slaweq> :) 16:04:50 <njohnston> o/ 16:04:54 <slaweq> hi njohnston :) 16:04:58 <slaweq> lets start then 16:05:06 <njohnston> hello slaweq, sorry I am late - working on a bug 16:05:11 <slaweq> #topic Actions from previous meetings 16:05:18 <slaweq> njohnston: no problem :) 16:05:26 <slaweq> * mlavalle to talk with mriedem about https://bugs.launchpad.net/neutron/+bug/1788006 16:05:26 <openstack> Launchpad bug 1788006 in neutron "Tests fail with "Server [UUID] failed to reach ACTIVE status and task state "None" within the required time ([INTEGER] s). Current status: BUILD. Current task state: spawning."" [Critical,Fix released] - Assigned to Slawek Kaplonski (slaweq) 16:05:31 <slaweq> I think it's done, right? :) 16:05:36 <mlavalle> we did and we fixed it 16:05:42 <mriedem> yar 16:05:42 <mlavalle> \o/ 16:05:46 <mriedem> virt_type=qemu 16:05:52 <mlavalle> thanks mriedem 16:05:53 <slaweq> thx mriedem for help on that 16:06:12 <njohnston> \o/ 16:06:19 <slaweq> ok, next one 16:06:22 <slaweq> * mlavalle continue debugging failing MigrationFromHA tests, bug https://bugs.launchpad.net/neutron/+bug/1789434 16:06:22 <openstack> Launchpad bug 1789434 in neutron "neutron_tempest_plugin.scenario.test_migration.NetworkMigrationFromHA failing 100% times" [High,Confirmed] - Assigned to Miguel Lavalle (minsel) 16:06:39 <mlavalle> manjeets took over that bug last week 16:06:56 <slaweq> yes, I saw but I don't think his approach to fix that is good 16:06:58 <manjeets> ++ 16:07:00 <mlavalle> he literally stole it from my hands, despite my strong resistance ;-) 16:07:06 <slaweq> LOL 16:07:20 <slaweq> I can imagine that mlavalle :P 16:07:21 <mlavalle> manjeets: I am going to assign the bug to you, ok? 16:07:32 <manjeets> mlavalle, ++ 16:07:52 <slaweq> I was looking on it also quickly during the weekend 16:08:13 <slaweq> and I think that it is again some race condition or something like that 16:08:30 <manjeets> It could be some subscribed callback as well ? 16:08:43 <slaweq> IMO this ports set to down should comes from L2 agent, not directly from l3 service plugin 16:09:26 <slaweq> becasuse it is something like: neutron-sever sends notification that router is disabled to L3 agent, L3 agent removes ports so L2 agent updates ports to down status (probably) 16:09:48 <slaweq> and probably there is no this notification send properly and because of that other things didn't happens 16:10:22 <slaweq> haleyb: mlavalle does it makes sense for You? or I missunderstood something in this workflow maybe? 16:11:06 <haleyb> slaweq: yes, that makes sense. originally i thought it was in the hadbmode code, but l2 is maybe more likely 16:11:26 <mlavalle> agree 16:11:28 * manjeets take a note will dig into l2 16:11:29 <haleyb> but it's someone getting the event and missing something 16:12:34 <slaweq> give me a sec, I will check one thing according to that 16:14:37 <slaweq> so what I found was that when I was checking migrtation from Legacy to HA, ports were down after this notification: https://github.com/openstack/neutron/blob/master/neutron/api/rpc/agentnotifiers/l3_rpc_agent_api.py#L55 16:14:53 <slaweq> in case of migration from HA it didn't happen 16:15:17 <slaweq> but I didn't have more time on the airport to dig more into it 16:15:34 <manjeets> slaweq, you that notification wasn't called in case of HA ? 16:15:38 <manjeets> you mean** 16:16:05 <slaweq> I think that it was called but then "hosts" list was empty and it wasn't send to any agent 16:16:18 <slaweq> but it has to be checked still, I'm not 100% sure 16:16:58 <manjeets> i'll test that today 16:17:05 <slaweq> ok, thx manjeets 16:17:08 <manjeets> the host thing if its empty in case 16:17:17 <slaweq> so I will assign this to You as an action, ok? 16:17:24 <manjeets> sure ! 16:17:27 <slaweq> thx 16:17:51 <slaweq> #action manjeets continue debugging why migration from HA routers fails 100% of times 16:17:57 <slaweq> ok, lets move on 16:18:03 <slaweq> next one 16:18:05 <slaweq> * mlavale to check issue with failing test_attach_volume_shelved_or_offload_server test 16:18:10 <manjeets> but the issue occurs after migration to HA , migration from HA worked 16:18:24 <manjeets> slaweq, its migration to HA, from HA i think is fine 16:19:23 <slaweq> manjeets: http://logs.openstack.org/29/572729/9/check/neutron-tempest-plugin-dvr-multinode-scenario/21bb5b7/logs/testr_results.html.gz 16:19:25 <slaweq> first example 16:19:32 <slaweq> from HA to any other fails always 16:19:56 <mlavalle> yes, that was my experience when I tried it 16:20:23 <manjeets> ah ok got it ! 16:20:37 <slaweq> :) 16:20:47 <mlavalle> are we moving on then? 16:20:54 <slaweq> I think we can 16:21:13 <mlavalle> That bug was the reason I was almost late for this meeting 16:21:27 <mlavalle> we have six instances in kibana over the past 7 days 16:21:42 <mlavalle> two of them is with the experimental queue 16:21:45 <slaweq> so not too many 16:22:09 <mlavalle> njohnston is playing with change 580450 and the experimental queue 16:22:20 <njohnston> yes 16:22:24 <slaweq> so it's njohnston's fault :P 16:22:42 <njohnston> => fault <= 16:22:44 <njohnston> :-) 16:22:59 <mlavalle> the other failures are with neutron-tempest-ovsfw and tempest-multinode-full, which are non voting if I remember correctly 16:23:55 <slaweq> IIRC this issue was caused by instance was not pinging after shelve/unshelve, right? 16:24:05 <mlavalle> yeah 16:24:10 <mlavalle> we get a timeout 16:24:29 <slaweq> maybe something was changed in nova then and this is not an issue anymore? 16:25:20 <mlavalle> I'll dig a little longer 16:25:30 <mlavalle> before concluding that 16:25:45 <slaweq> ok 16:25:51 <mlavalle> for the time being I am just making the point that it is not hitting us very hard 16:26:03 <slaweq> that is good information 16:26:19 <slaweq> ok, lets assign it to You for one more week then 16:26:25 <mlavalle> yes 16:26:31 <slaweq> #action mlavale to check issue with failing test_attach_volume_shelved_or_offload_server test 16:26:34 <slaweq> thx mlavalle 16:26:44 <slaweq> can we go to the next one? 16:27:04 * mlavalle has long let go of the hope of meeting El Comandante without getting homework 16:27:27 <slaweq> LOL 16:27:42 <slaweq> mlavalle: next week I will not give You homework :P 16:28:03 <mlavalle> np whatsoever.... just taking the opportunity to make a joke 16:28:14 <slaweq> I know :) 16:28:19 <slaweq> ok, lets move on 16:28:22 <slaweq> last one 16:28:24 <slaweq> njohnston to switch fullstack-python35 to python36 job 16:29:26 <slaweq> njohnston: are You around? 16:29:33 <njohnston> yes, I have a change for that up; I think it just needs a little love now that the gate is clear 16:29:47 <njohnston> I'll check it and make sure it's good to go 16:29:59 <slaweq> ok, thx njohnston 16:30:01 <slaweq> sounds good 16:30:18 <njohnston> https://review.openstack.org/599711 16:30:23 <slaweq> #action njohnston will continue work on switch fullstack-python35 to python36 job 16:30:31 <slaweq> ok, that's all from last meeting 16:31:00 <slaweq> #topic Grafana 16:31:05 <slaweq> http://grafana.openstack.org/dashboard/db/neutron-failure-rate 16:32:07 <slaweq> I was checking grafana earlier today and there wasn't many problems there 16:32:23 <slaweq> at least not problems which we are not aware of :) 16:32:30 <haleyb> not since we removed the dvr-multinode from the gate :( 16:32:39 <slaweq> haleyb: yes 16:33:28 <slaweq> so we have still neutron-tempest-plugin-dvr-multinode-scenario 100% failures but it's related to issue with migration from HA routers 16:33:47 <slaweq> and issue related to grenade job 16:34:03 <slaweq> other things I think I in quite good shape now 16:34:43 <slaweq> I was recently checking also reasons of some failures in tempest jobs and it was usually some issues with volumes (I don't have links to examples now) 16:34:45 <mlavalle> I have a question 16:34:50 <slaweq> sure mlavalle 16:35:17 <mlavalle> This doesn't have an owner: https://bugs.launchpad.net/neutron/+bug/1791989 16:35:18 <openstack> Launchpad bug 1791989 in neutron "grenade-dvr-multinode job fails" [High,Confirmed] 16:35:30 <slaweq> yes, sorry 16:35:35 <slaweq> I forgot to assign myself to it 16:35:40 <slaweq> I just did it now 16:35:42 <mlavalle> it is not voting for the time being, but we need to fix it right? 16:35:54 <slaweq> yes, I was checking that even today 16:35:59 <mlavalle> ah ok, question answered 16:36:02 <mlavalle> thanks 16:36:03 <slaweq> :) 16:36:28 <slaweq> and I wanted to talk about it now as it's last point on my list for today :) 16:36:36 <mlavalle> ok 16:36:43 <slaweq> #topic grenade 16:36:49 <slaweq> so speaking about this issue 16:37:38 <slaweq> yesterday I pushed patch https://review.openstack.org/#/c/602156/6/playbooks/legacy/neutron-grenade-dvr-multinode/run.yaml to neutron 16:38:56 <slaweq> together with depends-on from grenade https://review.openstack.org/#/c/602204/7/projects/60_nova/resources.sh it allowed me to log into at least controller node in this job 16:39:18 <slaweq> so I tried today and then I spawned manually same vm as is spawned by grenade script 16:39:37 <slaweq> and all worked perfectly fine, instance was pinging after around 5 seconds :/ 16:40:05 <slaweq> so now I added some additional logs to this grenade script: https://review.openstack.org/#/c/602204/9/projects/60_nova/resources.sh 16:40:38 <slaweq> and I'm running this job once again: http://zuul.openstack.org/stream.html?uuid=928662f6de054715835c6ef9599aefbd&logfile=console.log 16:40:45 <slaweq> I'm waiting for results of it 16:41:23 <slaweq> I also compared packages installed on nodes in such failed job from this week with packages installed before 7.09 on job which passed 16:41:43 <slaweq> I have list of packages which have different versions 16:41:59 <slaweq> there is different libvirt, linux-kernel, qemu, openvswitch 16:42:08 <slaweq> so many potential culprits 16:42:35 <slaweq> I think I will start with downgrading libvirt as it was updated in cloud-archive repo on 7.09 16:42:47 <haleyb> slaweq: we will eventually figure that one out! 16:43:04 <slaweq> any ideas what else I can check/test/do here? 16:43:36 <mlavalle> checking packages seems the right way to go 16:44:33 <slaweq> yes, so I will try to send some DNM patches with downgraded each of those packages (except kernel) and will try to recheck them few times 16:44:45 <haleyb> yes, other than that you can keep adding debug commands to the script - eg for looking at interfaces, routes, ovs, etc, but packages is a good first step 16:44:48 <slaweq> and see if issue will still happen on each of them 16:45:21 <slaweq> haleyb: yes, I just don't know if it's possible (and how to do it) to run such commands on subnode 16:45:56 <slaweq> so currently I only added some OSC commands to check status of instance/port/fip on control plane level 16:47:12 <slaweq> so I will continue debugging of this issue 16:47:28 <slaweq> ohh, one more thing, yesterday we spotted it with haleyb also in stable/pike job 16:47:58 <slaweq> and when I was looking for this issue in logstash, I found that it happend couple of times in stable/pike 16:48:09 <slaweq> less than in master but still it happend there also 16:48:11 <mlavalle> nice ctach 16:48:14 <mlavalle> catch 16:48:45 <slaweq> strange thing is that I didn't saw it on stable/queens or stable/rocky branches 16:49:27 <slaweq> so as this is failing on "old" openstack this means that it fails on neutron with stable/rocky and stable/ocata branches 16:50:47 <slaweq> and that's all as summary of this f..... issue :/ 16:50:58 <mlavalle> LOL 16:51:09 <slaweq> I will assign it to me as an action for this week 16:51:15 <mlavalle> I see you are learning some Frnech 16:51:19 <mlavalle> French 16:51:28 <slaweq> #action slaweq will continue debugging multinode-dvr-grenade issue 16:51:40 <slaweq> mlavalle: it can be "French" :P 16:52:10 * slaweq is becoming Hulk when has to deal with grenade multinode issue ;) 16:52:21 <njohnston> LOL! 16:52:43 <slaweq> ok, that's all from me about this issue 16:52:52 <slaweq> #topic Open discussion 16:53:03 <slaweq> do You have anything else to talk about? 16:53:32 <njohnston> I just sent email to openstack-dev to inquire about the python3 conversion status of tempest and grenade 16:53:52 <slaweq> thx njohnston, I will read it after the meeting then 16:54:01 <njohnston> if those conversions have not happened yet, and further if they need to be done globally, that could be interesting. 16:55:00 <njohnston> But I'll try not to borrow trouble. Thats it from me. 16:55:08 <slaweq> speaking about emails, I want to ask mlavalle one thing :) 16:55:19 <mlavalle> ok 16:55:37 <slaweq> do You remember to send email about adding some 3rd party projects jobs to neutron? 16:55:52 <mlavalle> yes 16:55:57 <slaweq> ok, great :) 16:56:50 <slaweq> ok, so if there is nothing else to talk, I think we can finish now 16:56:58 <mlavalle> Thanks! 16:57:02 <slaweq> thanks for attending 16:57:06 <slaweq> and see You next week 16:57:07 <slaweq> o/ 16:57:11 <slaweq> #endmeeting