16:00:37 #startmeeting neutron_ci 16:00:38 Meeting started Tue Sep 18 16:00:37 2018 UTC and is due to finish in 60 minutes. The chair is slaweq. Information about MeetBot at http://wiki.debian.org/MeetBot. 16:00:39 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 16:00:40 hi 16:00:41 The meeting name has been set to 'neutron_ci' 16:00:48 o/ 16:01:47 lets wait few more minutes for others 16:01:53 maybe someone else will join 16:02:05 np 16:02:28 I didn't show up late, did I? 16:02:40 mlavalle: no, You were just on time :) 16:03:17 I was distracted doing the homework you gave me last meeting and I was startled by the time 16:03:20 * haleyb wanders in 16:03:57 :) 16:04:50 o/ 16:04:54 hi njohnston :) 16:04:58 lets start then 16:05:06 hello slaweq, sorry I am late - working on a bug 16:05:11 #topic Actions from previous meetings 16:05:18 njohnston: no problem :) 16:05:26 * mlavalle to talk with mriedem about https://bugs.launchpad.net/neutron/+bug/1788006 16:05:26 Launchpad bug 1788006 in neutron "Tests fail with "Server [UUID] failed to reach ACTIVE status and task state "None" within the required time ([INTEGER] s). Current status: BUILD. Current task state: spawning."" [Critical,Fix released] - Assigned to Slawek Kaplonski (slaweq) 16:05:31 I think it's done, right? :) 16:05:36 we did and we fixed it 16:05:42 yar 16:05:42 \o/ 16:05:46 virt_type=qemu 16:05:52 thanks mriedem 16:05:53 thx mriedem for help on that 16:06:12 \o/ 16:06:19 ok, next one 16:06:22 * mlavalle continue debugging failing MigrationFromHA tests, bug https://bugs.launchpad.net/neutron/+bug/1789434 16:06:22 Launchpad bug 1789434 in neutron "neutron_tempest_plugin.scenario.test_migration.NetworkMigrationFromHA failing 100% times" [High,Confirmed] - Assigned to Miguel Lavalle (minsel) 16:06:39 manjeets took over that bug last week 16:06:56 yes, I saw but I don't think his approach to fix that is good 16:06:58 ++ 16:07:00 he literally stole it from my hands, despite my strong resistance ;-) 16:07:06 LOL 16:07:20 I can imagine that mlavalle :P 16:07:21 manjeets: I am going to assign the bug to you, ok? 16:07:32 mlavalle, ++ 16:07:52 I was looking on it also quickly during the weekend 16:08:13 and I think that it is again some race condition or something like that 16:08:30 It could be some subscribed callback as well ? 16:08:43 IMO this ports set to down should comes from L2 agent, not directly from l3 service plugin 16:09:26 becasuse it is something like: neutron-sever sends notification that router is disabled to L3 agent, L3 agent removes ports so L2 agent updates ports to down status (probably) 16:09:48 and probably there is no this notification send properly and because of that other things didn't happens 16:10:22 haleyb: mlavalle does it makes sense for You? or I missunderstood something in this workflow maybe? 16:11:06 slaweq: yes, that makes sense. originally i thought it was in the hadbmode code, but l2 is maybe more likely 16:11:26 agree 16:11:28 * manjeets take a note will dig into l2 16:11:29 but it's someone getting the event and missing something 16:12:34 give me a sec, I will check one thing according to that 16:14:37 so what I found was that when I was checking migrtation from Legacy to HA, ports were down after this notification: https://github.com/openstack/neutron/blob/master/neutron/api/rpc/agentnotifiers/l3_rpc_agent_api.py#L55 16:14:53 in case of migration from HA it didn't happen 16:15:17 but I didn't have more time on the airport to dig more into it 16:15:34 slaweq, you that notification wasn't called in case of HA ? 16:15:38 you mean** 16:16:05 I think that it was called but then "hosts" list was empty and it wasn't send to any agent 16:16:18 but it has to be checked still, I'm not 100% sure 16:16:58 i'll test that today 16:17:05 ok, thx manjeets 16:17:08 the host thing if its empty in case 16:17:17 so I will assign this to You as an action, ok? 16:17:24 sure ! 16:17:27 thx 16:17:51 #action manjeets continue debugging why migration from HA routers fails 100% of times 16:17:57 ok, lets move on 16:18:03 next one 16:18:05 * mlavale to check issue with failing test_attach_volume_shelved_or_offload_server test 16:18:10 but the issue occurs after migration to HA , migration from HA worked 16:18:24 slaweq, its migration to HA, from HA i think is fine 16:19:23 manjeets: http://logs.openstack.org/29/572729/9/check/neutron-tempest-plugin-dvr-multinode-scenario/21bb5b7/logs/testr_results.html.gz 16:19:25 first example 16:19:32 from HA to any other fails always 16:19:56 yes, that was my experience when I tried it 16:20:23 ah ok got it ! 16:20:37 :) 16:20:47 are we moving on then? 16:20:54 I think we can 16:21:13 That bug was the reason I was almost late for this meeting 16:21:27 we have six instances in kibana over the past 7 days 16:21:42 two of them is with the experimental queue 16:21:45 so not too many 16:22:09 njohnston is playing with change 580450 and the experimental queue 16:22:20 yes 16:22:24 so it's njohnston's fault :P 16:22:42 => fault <= 16:22:44 :-) 16:22:59 the other failures are with neutron-tempest-ovsfw and tempest-multinode-full, which are non voting if I remember correctly 16:23:55 IIRC this issue was caused by instance was not pinging after shelve/unshelve, right? 16:24:05 yeah 16:24:10 we get a timeout 16:24:29 maybe something was changed in nova then and this is not an issue anymore? 16:25:20 I'll dig a little longer 16:25:30 before concluding that 16:25:45 ok 16:25:51 for the time being I am just making the point that it is not hitting us very hard 16:26:03 that is good information 16:26:19 ok, lets assign it to You for one more week then 16:26:25 yes 16:26:31 #action mlavale to check issue with failing test_attach_volume_shelved_or_offload_server test 16:26:34 thx mlavalle 16:26:44 can we go to the next one? 16:27:04 * mlavalle has long let go of the hope of meeting El Comandante without getting homework 16:27:27 LOL 16:27:42 mlavalle: next week I will not give You homework :P 16:28:03 np whatsoever.... just taking the opportunity to make a joke 16:28:14 I know :) 16:28:19 ok, lets move on 16:28:22 last one 16:28:24 njohnston to switch fullstack-python35 to python36 job 16:29:26 njohnston: are You around? 16:29:33 yes, I have a change for that up; I think it just needs a little love now that the gate is clear 16:29:47 I'll check it and make sure it's good to go 16:29:59 ok, thx njohnston 16:30:01 sounds good 16:30:18 https://review.openstack.org/599711 16:30:23 #action njohnston will continue work on switch fullstack-python35 to python36 job 16:30:31 ok, that's all from last meeting 16:31:00 #topic Grafana 16:31:05 http://grafana.openstack.org/dashboard/db/neutron-failure-rate 16:32:07 I was checking grafana earlier today and there wasn't many problems there 16:32:23 at least not problems which we are not aware of :) 16:32:30 not since we removed the dvr-multinode from the gate :( 16:32:39 haleyb: yes 16:33:28 so we have still neutron-tempest-plugin-dvr-multinode-scenario 100% failures but it's related to issue with migration from HA routers 16:33:47 and issue related to grenade job 16:34:03 other things I think I in quite good shape now 16:34:43 I was recently checking also reasons of some failures in tempest jobs and it was usually some issues with volumes (I don't have links to examples now) 16:34:45 I have a question 16:34:50 sure mlavalle 16:35:17 This doesn't have an owner: https://bugs.launchpad.net/neutron/+bug/1791989 16:35:18 Launchpad bug 1791989 in neutron "grenade-dvr-multinode job fails" [High,Confirmed] 16:35:30 yes, sorry 16:35:35 I forgot to assign myself to it 16:35:40 I just did it now 16:35:42 it is not voting for the time being, but we need to fix it right? 16:35:54 yes, I was checking that even today 16:35:59 ah ok, question answered 16:36:02 thanks 16:36:03 :) 16:36:28 and I wanted to talk about it now as it's last point on my list for today :) 16:36:36 ok 16:36:43 #topic grenade 16:36:49 so speaking about this issue 16:37:38 yesterday I pushed patch https://review.openstack.org/#/c/602156/6/playbooks/legacy/neutron-grenade-dvr-multinode/run.yaml to neutron 16:38:56 together with depends-on from grenade https://review.openstack.org/#/c/602204/7/projects/60_nova/resources.sh it allowed me to log into at least controller node in this job 16:39:18 so I tried today and then I spawned manually same vm as is spawned by grenade script 16:39:37 and all worked perfectly fine, instance was pinging after around 5 seconds :/ 16:40:05 so now I added some additional logs to this grenade script: https://review.openstack.org/#/c/602204/9/projects/60_nova/resources.sh 16:40:38 and I'm running this job once again: http://zuul.openstack.org/stream.html?uuid=928662f6de054715835c6ef9599aefbd&logfile=console.log 16:40:45 I'm waiting for results of it 16:41:23 I also compared packages installed on nodes in such failed job from this week with packages installed before 7.09 on job which passed 16:41:43 I have list of packages which have different versions 16:41:59 there is different libvirt, linux-kernel, qemu, openvswitch 16:42:08 so many potential culprits 16:42:35 I think I will start with downgrading libvirt as it was updated in cloud-archive repo on 7.09 16:42:47 slaweq: we will eventually figure that one out! 16:43:04 any ideas what else I can check/test/do here? 16:43:36 checking packages seems the right way to go 16:44:33 yes, so I will try to send some DNM patches with downgraded each of those packages (except kernel) and will try to recheck them few times 16:44:45 yes, other than that you can keep adding debug commands to the script - eg for looking at interfaces, routes, ovs, etc, but packages is a good first step 16:44:48 and see if issue will still happen on each of them 16:45:21 haleyb: yes, I just don't know if it's possible (and how to do it) to run such commands on subnode 16:45:56 so currently I only added some OSC commands to check status of instance/port/fip on control plane level 16:47:12 so I will continue debugging of this issue 16:47:28 ohh, one more thing, yesterday we spotted it with haleyb also in stable/pike job 16:47:58 and when I was looking for this issue in logstash, I found that it happend couple of times in stable/pike 16:48:09 less than in master but still it happend there also 16:48:11 nice ctach 16:48:14 catch 16:48:45 strange thing is that I didn't saw it on stable/queens or stable/rocky branches 16:49:27 so as this is failing on "old" openstack this means that it fails on neutron with stable/rocky and stable/ocata branches 16:50:47 and that's all as summary of this f..... issue :/ 16:50:58 LOL 16:51:09 I will assign it to me as an action for this week 16:51:15 I see you are learning some Frnech 16:51:19 French 16:51:28 #action slaweq will continue debugging multinode-dvr-grenade issue 16:51:40 mlavalle: it can be "French" :P 16:52:10 * slaweq is becoming Hulk when has to deal with grenade multinode issue ;) 16:52:21 LOL! 16:52:43 ok, that's all from me about this issue 16:52:52 #topic Open discussion 16:53:03 do You have anything else to talk about? 16:53:32 I just sent email to openstack-dev to inquire about the python3 conversion status of tempest and grenade 16:53:52 thx njohnston, I will read it after the meeting then 16:54:01 if those conversions have not happened yet, and further if they need to be done globally, that could be interesting. 16:55:00 But I'll try not to borrow trouble. Thats it from me. 16:55:08 speaking about emails, I want to ask mlavalle one thing :) 16:55:19 ok 16:55:37 do You remember to send email about adding some 3rd party projects jobs to neutron? 16:55:52 yes 16:55:57 ok, great :) 16:56:50 ok, so if there is nothing else to talk, I think we can finish now 16:56:58 Thanks! 16:57:02 thanks for attending 16:57:06 and see You next week 16:57:07 o/ 16:57:11 #endmeeting