15:01:24 #startmeeting neutron_dvr 15:01:26 Meeting started Wed Aug 31 15:01:24 2016 UTC and is due to finish in 60 minutes. The chair is haleyb. Information about MeetBot at http://wiki.debian.org/MeetBot. 15:01:27 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 15:01:29 The meeting name has been set to 'neutron_dvr' 15:01:36 #chair Swami 15:01:37 Current chairs: Swami haleyb 15:02:00 #topic Announcements 15:02:22 N-3 is closing this week 15:02:43 any new features that haven't merged will need to be granted a FFE 15:02:45 almost there 15:03:11 We still will be able to fix bugs, especially if they help the check/gate job stability 15:03:14 haleyb: what do you think about the Fast path exit, do we need a FFE or can we move it to the next release and backport it. 15:04:04 Swami: i know we don't usually backport new features, but there doesn't seem to be a config change required 15:04:21 how close do you think it is? 15:04:23 it can still be considered as a bug 15:04:32 It is pretty good. 15:04:52 I have addressed couple of review comments yesterday as well on the route rules. 15:06:09 haleyb: if you get some time today you can review the rule patch for the fast path exit and we can keep it ready, it if works out we can merge. 15:06:37 I will have to look at the reviews again, then we can move forward once we're happy 15:06:56 haleyb: ok, make sense. 15:07:07 at this point, we don't need to hurry. 15:07:07 #topic Bugs 15:07:19 right, i'd rather get it right 15:07:42 haleyb: we don't have new bugs filed this week. So we pretty much have to address all the existing bugs. 15:08:20 * jschwarz is also here, but in a different meeting 15:08:25 #link https://bugs.launchpad.net/neutron/+bug/1612192 15:08:25 Launchpad bug 1612192 in neutron "L3 DVR: Unable to complete operation on subnet" [Critical,Confirmed] 15:08:25 No new bugs? must be a record 15:09:19 haleyb: Did you get a chance to triage this. 15:09:51 I think this is seen in the gate. Are we still seeing this in the gate after the gate stabilization. 15:09:58 Swami: no, been mostly looking at the dvr multinode job failures 15:10:17 i don't know if we're still seeing it, need to ask logstash 15:10:33 Both the criticals that we have were seen in the gate. 15:11:09 #link https://bugs.launchpad.net/neutron/+bug/1612804 15:11:09 Launchpad bug 1612804 in neutron "test_shelve_instance fails with sshtimeout" [Critical,Confirmed] 15:11:53 This was also reported as seen in the gate. 15:12:52 regarding the first, there have been 8 occurences in the past two weeks, all on one day 15:12:55 * jschwarz is back now 15:13:13 haleyb: so will it be due to the gate issues or a bad patch. 15:13:16 jschwarz: hi 15:13:30 Swami: actually 8 in past 30 days 15:13:57 all happened in 24 hours on 8/23-24 15:14:08 haleyb: so let us monitor it for another week and see what happens. 15:14:11 i will look and close if necessary 15:15:22 the second happens more often, but i don't know if it's neutron, have not dug 15:15:43 haleyb: ok thanks. 15:15:44 i'll guess and say it's an issue saving to storage 15:16:17 haleyb: you may be right. 15:16:40 haleyb: we have seen the shelve_instance fail earlier as well. 15:17:28 Swami: hmm, some show a failure to get DHCP 15:17:51 haleyb: so is there two different symptoms for the same failure. 15:17:53 and that's in the non-dvr test 15:18:47 Swami: any time we can't ssh to an instance it's neutron's fault :( 15:19:01 haleyb: yeah! you got it. 15:19:22 s/neutron/dvr 15:20:02 The next set of bugs are related to DVR+HA+L3 15:20:14 we have discussed about these bugs earlier. 15:20:23 haleyb, we're seeing a problem where in certain cases, with HA-only routers, instances lose connectivity 15:20:29 haleyb, that's still DVR's fault ;-) 15:20:46 jschwarz: let me know when patches are up :) 15:20:53 haleyb, :D 15:21:26 jschwarz: are there any HA patches that need attention? 15:21:27 jschwarz: mostly the L3+HA+DVR combination is generating lot more bugs. 15:21:59 jschwarz: any update on the L3+HA+DVR bugs that you are working on. 15:22:05 yes, sorry 15:22:12 I came unprepared :) 15:22:17 #link https://bugs.launchpad.net/neutron/+bug/1597461 15:22:17 Launchpad bug 1597461 in neutron "L3 HA: 2 masters after reboot of controller" [High,Fix released] - Assigned to John Schwarz (jschwarz) 15:22:21 jschwarz: no problem 15:22:39 Swami, the patch for that merged a few days ago 15:22:54 jschwarz: so can we close this bug, is it all fixed. 15:23:13 Swami, yes, with a side note that a couple of other repos are complaining it broke some of their tests 15:23:30 jschwarz: that's not good. 15:23:53 Swami, one was already taken care of, the other one I'm not sure if it has (it was from networking-odl) 15:24:03 #link https://bugs.launchpad.net/neutron/+bug/1602320 15:24:03 Launchpad bug 1602320 in neutron "ha + distributed router: keepalived process kill vrrp child process" [Undecided,In progress] - Assigned to Dongcan Ye (hellochosen) 15:24:40 Swami, https://review.openstack.org/#/c/357458/9/neutron/services/l3_router/l3_router_plugin.py@77 15:25:38 jschwarz: so are you planning to rollback or fix it with a different patch 15:25:57 Swami, I see a bug fix here: https://review.openstack.org/#/c/363175/ 15:26:10 Swami, I think it's best to keep it separated and not rollback 15:26:21 Swami, I will work with Isaku to provide a good fix for this ASAP 15:26:30 jschwarz: ok thanks. 15:27:00 jschwarz: do you have any input on the keepalived bug that I posted above. 15:27:19 Swami, nope :< 15:27:40 Swami, it looks like this patch has seen no action for 2 weeks now 15:27:51 I'll ping Dongcan Ye to see if this can be moved forward 15:28:04 jschwarz: thanks 15:28:36 #link https://bugs.launchpad.net/neutron/+bug/1602614 15:28:36 Launchpad bug 1602614 in neutron "DVR + L3 HA loss during failover is higher that it is expected" [High,In progress] - Assigned to venkata anil (anil-venkata) 15:29:09 Swami, the fix is https://review.openstack.org/#/c/255237/ - the long l2pop patch by anilvenkata 15:29:33 it's being reviewed by Carl, Assaf and Kevin and since it's a complicated fix it's taking a while 15:29:34 jschwarz: I think that is still under review. 15:29:43 yes 15:29:44 hopefully this will get in N's RC 15:29:44 jschwarz: ok makes sense. 15:29:56 getting different suggestions :) 15:30:03 jschwarz: ok 15:30:19 anilvenkata: thanks 15:30:32 thanks Swami jschwarz haleyb 15:31:27 Swami, we also have https://bugs.launchpad.net/neutron/+bug/1607381 15:31:27 Launchpad bug 1607381 in neutron "HA router in l3 dvr_snat/legacy agent has no ha_port" [Undecided,In progress] - Assigned to LIU Yulong (dragon889) 15:31:27 I did not get to that review yesterday. I hope to get to it today. 15:31:37 anilvenkata: jschwarz: ^ 15:31:43 carl_baldwin, ack, thanks :) 15:31:55 carl_baldwin, thanks Carl 15:31:57 the bug I linked to is being dealt with in https://review.openstack.org/#/c/265672/ 15:32:04 anilvenkata: can we close this bug or is it still valid #link https://bugs.launchpad.net/neutron/+bug/1595043 15:32:04 Launchpad bug 1595043 in neutron "Make DVR portbinding implementation useful for HA ports" [Medium,In progress] - Assigned to venkata anil (anil-venkata) 15:32:22 Swami, we will keep it 15:32:31 Swami, we have other issues with HA 15:32:37 anilvenkata: ok. 15:32:44 Swami, thanks Swami 15:33:19 going back, https://review.openstack.org/#/c/265672/ just needs reviews IMO - it looks to be ready 15:33:57 jschwarz: thanks 15:34:07 The next one is 15:34:12 #link https://bugs.launchpad.net/neutron/+bug/1593354 15:34:12 Launchpad bug 1593354 in neutron "SNAT HA failed because of missing nat rule in snat namespace iptable" [Undecided,New] 15:34:42 I have not seen this one before (and it wasn't even triaged) 15:34:54 jschwarz: yes 15:35:12 haleyb: did you get a chance to triage this bug. 15:35:43 anilvenkata, ^ is this something that your patch deals with? 15:35:54 Swami: no 15:36:10 jschwarz, no for bug 1593354 15:36:10 bug 1593354 in neutron "SNAT HA failed because of missing nat rule in snat namespace iptable" [Undecided,New] https://launchpad.net/bugs/1593354 15:36:38 haleyb: thanks, 15:36:45 Swami, I'll have a pass through it tomorrow and triage this, unless haleyb beats me to it 15:38:06 jschwarz: thanks 15:38:42 That's all I had for the bugs this week. 15:39:11 any other bugs to talk about from anyone? 15:39:53 * jschwarz claps his hands to show they are empty 15:40:28 you need to clap again and make all the bugs go away :) 15:40:38 * jschwarz claps again 15:40:48 jschwarz: thanks 15:40:50 * jschwarz magically fails to clap as his hands miss each other 15:41:08 #topic Gate failures 15:41:47 the dvr and dvr-multinode (and regular multinode) jobs still have higher failure rates 15:42:31 i am continuing to triage this when i have time, filing bugs as i go 15:42:31 Any insight in to the high multi-node rate? 15:43:36 carl_baldwin: DHCP is failing, but have not determined the timeline to see if it's the agent not starting 15:44:29 haleyb: Are you all alone in looking in to this? 15:44:33 kevinbenton and armax have been cleaning-up the dhcp issues i've been finding, but none actually fix the bug 15:44:51 haleyb: dvr is not happy still 15:45:32 armax: yes, i continue to look 15:46:55 and i do see the neutron-dvr job is about 12% failure in the gate today 15:47:10 haleyb: is this for the single node 15:47:27 yes, http://grafana.openstack.org/dashboard/db/neutron-failure-rate?panelId=5&fullscreen 15:47:46 gate-tempest-dsvm-neutron-dvr gate job 15:48:12 haleyb: thanks 15:48:41 but the failure rate is falling, don't know if it was related to the dhcp patch that merged last night 15:49:08 haleyb: will this also fix the multinode issue with respect to dhcp. 15:49:40 Swami: don't think so, dhcp could fail for many reasons as we know... 15:50:30 carl_baldwin: i will look into the gate dvr failure, and yes i'm all alone doing it at the moment, but will reach out to people 15:50:31 haleyb: yes. 15:51:02 haleyb: let me know, I might be able to help some. 15:51:37 i need to find a patch failing first 15:52:12 haleyb: is there a particular test that fails with respect to dhcp or is it random. 15:52:38 Is it the instance that is not able to get the dhcp and so the ssh fails. 15:52:54 Swami: TestNetworkBasicOps is usually the one 15:53:44 haleyb: thanks 15:53:51 right, console shows dhcp failing and no IP on eth0. i started looking on the server and dhcp agent and found some issues, but it's a work in progress 15:55:10 haleyb: thanks 15:55:18 #topic Open discussion 15:55:37 Swami: only note on wiki is live migration patch 15:56:01 haleyb: what is that. I don't get it. 15:56:36 https://review.openstack.org/#/c/275073/https://review.openstack.org/#/c/275073/ 15:56:49 nova patch for live migration w/neutron 15:57:15 haleyb: yes got it. We are still waiting for a +2 on it. 15:57:34 even the experimental job was happy 15:57:49 I think john ran the experimental job, so let us wait and see. 15:58:18 #link https://review.openstack.org/#/c/353788/ 15:58:31 haleyb: I need your blessing on this patch. 15:58:57 Swami: yes, saw that, will look 15:59:21 haleyb: thanks 15:59:36 we're at end of hour, thanks for all the hard work everyone! 15:59:38 There is one backport patch as well for /stable/liberyt 15:59:45 we will take it offline. 15:59:47 #endmeeting