15:01:09 #startmeeting Neutron_dvr 15:01:10 Meeting started Wed Nov 25 15:01:09 2015 UTC and is due to finish in 60 minutes. The chair is Swami. Information about MeetBot at http://wiki.debian.org/MeetBot. 15:01:11 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 15:01:14 The meeting name has been set to 'neutron_dvr' 15:01:34 #topic announcement 15:01:48 haleyb: will not be here today, he is on vacation 15:02:02 the attendence today is going to be slim 15:02:15 so we keep it short then :) 15:02:19 Happy Thanksgiving to all. 15:02:35 regXboi: yes let us try to keep it short and effective 15:02:49 Hi 15:03:26 fitoduarte: hi 15:03:32 #topic Agenda 15:03:36 #link https://wiki.openstack.org/wiki/Meetings/Neutron-DVR 15:04:04 #topic bugs 15:04:23 There was a new bug that was added couple of days back. 15:04:40 #link https://bugs.launchpad.net/neutron/+bug/1456073 15:04:41 Launchpad bug 1456073 in OpenStack Compute (nova) "Connection to an instance with floating IP breaks during block migration when using DVR" [Undecided,New] 15:05:21 https://bugs.launchpad.net/neutron/+bug/1414559 might be related according to comments 15:05:21 Launchpad bug 1414559 in neutron "OVS drops RARP packets by QEMU upon live-migration - VM temporarily disconnected" [Medium,In progress] - Assigned to Oleg Bondarev (obondarev) 15:05:34 I was looking at this bug and it seems right now the way DVR dynamically schedules routers is causing issues to the nova live migration. 15:05:36 I have patches in nova and neutron for it 15:05:50 obondarev: can you provide a link in here 15:05:52 obondarev: can we link the patches onto the wiki to save clicks? 15:06:01 (that is in addition to putting it here) 15:06:05 regXboi: good idea 15:06:19 will do, patches are https://review.openstack.org/#/c/246898/ and https://review.openstack.org/#/c/246910/ 15:06:45 but I think more work is required to address floating ip issue 15:07:03 obondarev: did you see my comment in the bug in launchpad 15:07:13 Swami: I did 15:07:26 obondarev: yes I agree more work is required in terms of nova to neutron communication with respect pre-migration event. 15:07:49 Swami: that's what my patches are about 15:08:23 but it's not that explicit on neutron side that we're at pre migration step 15:08:35 obondarev: but I looked at the existing nova patch that you have for state update, this would just address one part of it. 15:08:54 Swami: right 15:08:58 obondarev: are you planning to add more nova patch to address this handshake between nova and neutron 15:09:28 on neutron sidewe should be sure we're at pre migration step and schedule router to destination host I guess 15:10:06 Swami: I haven't such plans before I saw this bug about 5 mins ago :) 15:10:15 will think more on it 15:10:21 yes on the nova should inform neutron that it is in pre-migration state to a destination host and wait for response from neutron to complete the migration. 15:11:01 is it working fine with legacy routers? 15:11:14 Yes I had discussion about this with armax as well. armax mentioned that he will try to tie in the nova PTL to this bug and we can decide on the right flows between nova and neutron. 15:12:02 obondarev: based on my teams input, they did see that there was a 5sec delay with legacy routers and 8sec delay with dvr routers when doing live migration. 15:12:25 Swami: delay or disconnect? 15:12:29 with dvr 15:12:36 bug says disconnect 15:13:02 obondarev: it is delay that they measured. 15:13:31 obondarev: yes the bug states about the ssh connection getting disconnected. I have asked our internal team to do some investigation on this. 15:13:33 Swami: not sure I got what is delay here 15:14:03 tcp will make a disconnected t look like a delay. 15:14:30 fitoduarte: ah, ok, thanks 15:14:50 but bug clearly says disconnect 15:15:32 obondarev: agreed, it might be a disconnect. 15:15:44 d 15:16:01 obondarev: but to close I will try to see both the legacy and dvr and will update the bug report. 15:16:20 Swami: sounds good, thanks 15:16:22 disconnect would mean you can't reach the vm ever again? 15:16:33 The next one in the list is related to live migration. 15:16:37 #link https://bugs.launchpad.net/neutron/+bug/1508869 15:16:37 Launchpad bug 1508869 in neutron "DVR: handle dvr serviceable port's host change" [Low,In progress] - Assigned to Oleg Bondarev (obondarev) 15:17:00 obondarev: I think you already have a patch for this bug, that you posted above. 15:17:02 fitoduarte: you'll be abe to reconnect but current connection will be lost 15:17:28 for 1508869 fix is on review, there are some concerns on it 15:17:30 obondarev: I had one question related to this patch. 15:17:50 obondarev: right now you are dealing with all dvr serviceable ports, should we need it. 15:18:00 regXboi: you have concerns as well, can you please clarify in the review? 15:18:11 obondarev: I thought we can restrict to the compute ports for live migration. 15:18:34 Swami: we can't be sure if it's live migration or not 15:18:39 obondarev: since we are not going to migrate the dhcp ports. 15:18:56 Swami: I answered in the review, can you please check? 15:19:10 obondarev: ok will check it out. 15:19:10 Swami: dhcp ports can also change host 15:19:34 obondarev: what would be use case for dhcp port to change host 15:19:35 not suer why we should restrict 15:19:42 sure* 15:20:16 Swami: during the transition fron active to reserved and back to active DHCP port may change host 15:20:29 depending on dhcp agent that grabbed reserved port 15:21:17 obondarev: In this case you are saying "if" we have multiple "Nodes" with dhcp agents running on all nodes. 15:21:19 Swami: I might be missing smth, but what are benefits of restribction to only compute ports? 15:21:30 Swami: correct 15:21:53 Swami: that's usual case on production clouds 15:22:33 obondarev: There is nothing benefit from restricting in the other sense I don't want to burden the control plane by removing the router namespace just for the dhcp ports. 15:22:56 obondarev: that's fine, I will address it in your review. 15:23:00 Let us move on. 15:23:26 Swami: we're adding this namespace on controllers just because of dhcp ports - why not remove it when there is no dhcp ports? 15:24:14 obondarev: sure agreed 15:24:41 regXboi: please also check https://review.openstack.org/#/c/238478/ 15:24:45 Let us quickly go over the high priority bugs, since we have already discussed about it. 15:25:17 #link https://bugs.launchpad.net/neutron/+bug/1462154 15:25:17 Launchpad bug 1462154 in neutron "With DVR Pings to floating IPs replied with fixed-ips" [High,In progress] - Assigned to ZongKai LI (lzklibj) 15:25:51 #link https://review.openstack.org/#/c/246894/ Patch is in review 15:26:01 So if you have not reviewed it please review it. 15:26:08 The next one in the list is 15:26:19 There is also https://review.openstack.org/#/c/246855 15:26:31 #link https://bugs.launchpad.net/neutron/+bug/1505575 15:26:31 Launchpad bug 1505575 in neutron "Fatal memory consumption by neutron-server with DVR at scale" [High,In progress] - Assigned to Oleg Bondarev (obondarev) 15:26:41 as part of the bugfix for 1462154. 15:26:42 stephen-ma: thanks 15:26:56 review is in progress 15:27:22 there might not be memory issue anymore, but timeout issue is there 15:27:23 #link https://review.openstack.org/#/c/234067/ 15:27:36 obondarev: seems that it is in merge conflict 15:27:52 obondarev: so what removed the memory issue, do know. 15:28:10 yeah, just wanted to get carl_baldwin answers before uploading new patchset 15:28:29 obondarev: thanks, makes sense. 15:28:34 Swami: let me fing the link to the patch.. 15:29:03 Swami: here it is https://review.openstack.org/#/c/214974/ 15:29:38 obondarev: thanks 15:30:03 #link https://bugs.launchpad.net/neutron/+bug/1513678 15:30:03 Launchpad bug 1513678 in neutron "At scale router scheduling takes a long time with DVR routers with multiple compute nodes hosting thousands of VMs" [High,In progress] - Assigned to Swaminathan Vasudevan (swaminathan-vasudevan) 15:30:18 The patches are up for review for this one. 15:30:40 I got some review comments from carl_baldwin yesterday and will address it today based on his reply. 15:31:09 #link https://review.openstack.org/#/c/241843/ 15:31:25 #link https://review.openstack.org/#/c/242286/ 15:31:36 The next in the list is 15:31:54 #link https://bugs.launchpad.net/neutron/+bug/1515360 15:31:54 Launchpad bug 1515360 in neutron "Add more verbose to Tempest Test Errors that causes "SSHTimeout" seen in CVR and DVR" [High,Invalid] 15:32:23 I made some headway on this one by adding some self test within the agent. The patch is under review right now. 15:32:43 #link https://review.openstack.org/#/c/247748/ 15:33:03 Please review this patch at your earliest. 15:33:21 armax mentioned that he can push in this patch for the debug purpose and latter can be rmoved. 15:33:22 this has a -2 on it 15:33:33 so you need to get that cleared 15:33:49 regXboi: that is fine, we can get it cleared once we are ready to merge. 15:34:11 obondarev: I am still seeing some issue in this patch when I try to ping and validate the ping. 15:34:53 obondarev: can you help me figure out what the problem is here. It is throwing a _stderr 15:35:01 * carl_baldwin driving to see family. Will read log... 15:35:17 carl_baldwin: thanks, drive safe. 15:35:30 Swami: which patch, sorry? 15:35:38 Swami: got distracted, sorry 15:35:48 247748 15:36:11 Swami: ok, will review 247748 15:36:49 Swami: shows nothing for https://review.fuel-infra.org/#/c/247748 15:36:51 from the gate logs, I did see that it was able to ping twice successfully but ping did not go through in lot of the tests. I am not sure if it is a timing issue. 15:37:07 sorry 15:37:19 #link https://review.openstack.org/#/c/247748/ 15:37:34 yeah, thanks 15:37:34 obondarev: did you get it. 15:37:47 fitoduarte: adolfo is your server patch in. 15:37:52 signed up for it 15:37:59 obondarev: thanks 15:38:42 ok I don't see him here anymore. 15:39:04 BTW did someone saw SSh timeout issues in the gates lately? 15:39:07 Is there any other bugs that we wanted to discuss here. 15:39:31 obondarev: I do see SSHTimeout in the gate. 15:39:44 let's look 15:39:53 obondarev: This is from last week, I have captured it in the slide deck. 15:40:09 Swami: link please? 15:40:14 #link https://drive.google.com/file/d/0B4kh-7VVPWlPMkdrQWFFdjNsSnM/view?usp=sharing 15:40:18 seam: sorry which server patch? 15:40:38 obondarev: as part of the gate failure investigation I am trying to capture all the failures that I see. 15:40:51 fitoduarte: DVR HA server patch 15:41:14 since we are in this topic right now let us change the topic. 15:41:21 #topic Gate-failures 15:41:36 Swami: yeah, I just don't see SSH timeout happening often (like before) on dvr multinode job 15:42:48 swami: has two cores blessings but assaf is holding it. hopefully he'll be done this week. 15:43:16 obondarev: but i think in most of the failure cases, the issue is we can't get to the VM through SSH 15:43:47 obondarev: are you seeing any other failure symptom that is causing this graph to deviate from the cvr. 15:44:08 Swami: I'm not 15:44:11 obondarev: in my honest opinion sometimes the logstash results are also confusing. 15:45:12 obondarev: regXboi: Does anyone of you know if the dsvm-functional tests failures results are included in the dvr failure analysis or is it different. 15:45:37 Swami: which analysis? 15:46:03 obondarev: The graphite failure graph does it talk into consideration the dsvm-functional test failure. 15:46:42 #link https://goo.gl/L1WODG 15:46:55 ah, I see 15:47:02 no, that's a different graphite graph 15:47:29 regXboi: ok, the reason I asked is recently I did see some instability with the dsvm-functional tests. 15:48:06 obondarev: regXboi: If you see any gate failures please log it in the power point that deck, so that it will easy for us to get to the root. 15:48:13 http://goo.gl/WaeP7L 15:48:47 fullstack and functional both appear ill 15:49:23 regXboi: also the graph for neutron-full, neutron-dvr is away high when compared to this. I am not sure if the metrics are calculated correctly. 15:49:39 in my patches and in those that I review I didn't see SSH timeout failures on multinode dvr (and any other job) for a while so that's why I have an imression it might be gone for some reason 15:50:11 obondarev: yes you are right, I have either not seen in my patches. But the logstash still reports that it is seeing it in the gate. 15:50:31 obondarev: This might be from the one or two patches that cause a whole lot of failures in DVR. 15:50:40 obondarev: user error. 15:50:50 Swami: so failures related to patches? 15:51:00 obondarev: yes. 15:51:01 doesn't count I guess 15:51:20 obondarev: I am not sure about it. 15:51:47 ok, let us move on 15:51:49 ok lets proceed collecting statistics 15:52:00 obondarev: sure 15:52:48 #topic performance/scalability 15:53:00 obondarev: do you have anything to add here. 15:53:16 not really, https://blueprints.launchpad.net/neutron/+spec/improve-dvr-l3-agent-binding is in progress 15:53:51 obondarev: ok 15:54:00 anything else 15:54:29 #topic Open-Discussion 15:54:43 Do we have any other topic to discuss in the last 5 minutes 15:54:57 If we don't have any other topics we can end the meeting. 15:55:26 Please log the failures that you are seeing in the gate with respect to the DVR in the powerpoint. 15:55:39 Let us discuss it next week. 15:56:00 Thanks everyone for joining this meeting. 15:56:11 Wish you all again a Happy Thanksgiving. 15:56:16 #endmeeting