15:01:09 <Swami> #startmeeting Neutron_dvr
15:01:10 <openstack> Meeting started Wed Nov 25 15:01:09 2015 UTC and is due to finish in 60 minutes.  The chair is Swami. Information about MeetBot at http://wiki.debian.org/MeetBot.
15:01:11 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
15:01:14 <openstack> The meeting name has been set to 'neutron_dvr'
15:01:34 <Swami> #topic announcement
15:01:48 <Swami> haleyb: will not be here today, he is on vacation
15:02:02 <Swami> the attendence today is going to be slim
15:02:15 <regXboi> so we keep it short then :)
15:02:19 <Swami> Happy Thanksgiving to all.
15:02:35 <Swami> regXboi: yes let us try to keep it short and effective
15:02:49 <fitoduarte> Hi
15:03:26 <Swami> fitoduarte: hi
15:03:32 <Swami> #topic Agenda
15:03:36 <Swami> #link https://wiki.openstack.org/wiki/Meetings/Neutron-DVR
15:04:04 <Swami> #topic bugs
15:04:23 <Swami> There was a new bug that was added couple of days back.
15:04:40 <Swami> #link https://bugs.launchpad.net/neutron/+bug/1456073
15:04:41 <openstack> Launchpad bug 1456073 in OpenStack Compute (nova) "Connection to an instance with floating IP breaks during block migration when using DVR" [Undecided,New]
15:05:21 <obondarev> https://bugs.launchpad.net/neutron/+bug/1414559 might be related according to comments
15:05:21 <openstack> Launchpad bug 1414559 in neutron "OVS drops RARP packets by QEMU upon live-migration - VM temporarily disconnected" [Medium,In progress] - Assigned to Oleg Bondarev (obondarev)
15:05:34 <Swami> I was looking at this bug and it seems right now the way DVR dynamically schedules routers is causing issues to the nova live migration.
15:05:36 <obondarev> I have patches in nova and neutron for it
15:05:50 <Swami> obondarev: can you provide a link in here
15:05:52 <regXboi> obondarev: can we link the patches onto the wiki to save clicks?
15:06:01 <regXboi> (that is in addition to putting it here)
15:06:05 <Swami> regXboi: good idea
15:06:19 <obondarev> will do, patches are https://review.openstack.org/#/c/246898/ and https://review.openstack.org/#/c/246910/
15:06:45 <obondarev> but I think more work is required to address floating ip issue
15:07:03 <Swami> obondarev: did you see my comment in the bug in launchpad
15:07:13 <obondarev> Swami: I did
15:07:26 <Swami> obondarev: yes I agree more work is required in terms of nova to neutron communication with respect pre-migration event.
15:07:49 <obondarev> Swami: that's what my patches are about
15:08:23 <obondarev> but it's not that explicit on neutron side that we're at pre migration step
15:08:35 <Swami> obondarev: but I looked at the existing nova patch that you have for state update, this would just address one part of it.
15:08:54 <obondarev> Swami: right
15:08:58 <Swami> obondarev: are you planning to add more nova patch to address this handshake between nova and neutron
15:09:28 <obondarev> on neutron sidewe should be sure we're at pre migration step and schedule router to destination host I guess
15:10:06 <obondarev> Swami: I haven't such plans before I saw this bug about 5 mins ago :)
15:10:15 <obondarev> will think more on it
15:10:21 <Swami> yes on the nova should inform neutron that it is in pre-migration state to a destination host and wait for response from neutron to complete the migration.
15:11:01 <obondarev> is it working fine with legacy routers?
15:11:14 <Swami> Yes I had discussion about this with armax as well. armax mentioned that he will try to tie in the nova PTL to this bug and we can decide on the right flows between nova and neutron.
15:12:02 <Swami> obondarev: based on my teams input, they did see that there was a 5sec delay with legacy routers and 8sec delay with dvr routers when doing live migration.
15:12:25 <obondarev> Swami: delay or disconnect?
15:12:29 <obondarev> with dvr
15:12:36 <obondarev> bug says disconnect
15:13:02 <Swami> obondarev: it is delay that they measured.
15:13:31 <Swami> obondarev: yes the bug states about the ssh connection getting disconnected. I have asked our internal team to do some investigation on this.
15:13:33 <obondarev> Swami: not sure I got what is delay here
15:14:03 <fitoduarte> tcp will make a disconnected t look like a delay.
15:14:30 <obondarev> fitoduarte: ah, ok, thanks
15:14:50 <obondarev> but bug clearly says disconnect
15:15:32 <Swami> obondarev: agreed, it might be a disconnect.
15:15:44 <fitoduarte> d
15:16:01 <Swami> obondarev: but to close I will try to see both the legacy and dvr and will update the bug report.
15:16:20 <obondarev> Swami: sounds good, thanks
15:16:22 <fitoduarte> disconnect would mean you can't reach the vm ever again?
15:16:33 <Swami> The next one in the list is related to live migration.
15:16:37 <Swami> #link https://bugs.launchpad.net/neutron/+bug/1508869
15:16:37 <openstack> Launchpad bug 1508869 in neutron "DVR: handle dvr serviceable port's host change" [Low,In progress] - Assigned to Oleg Bondarev (obondarev)
15:17:00 <Swami> obondarev: I think you already have a patch for this bug, that you posted above.
15:17:02 <obondarev> fitoduarte: you'll be abe to reconnect but current connection will be lost
15:17:28 <obondarev> for 1508869 fix is on review, there are some concerns on it
15:17:30 <Swami> obondarev: I had one question related to this patch.
15:17:50 <Swami> obondarev: right now you are dealing with all dvr serviceable ports, should we need it.
15:18:00 <obondarev> regXboi: you have concerns as well, can you please clarify in the review?
15:18:11 <Swami> obondarev: I thought we can restrict to the compute ports for live migration.
15:18:34 <obondarev> Swami: we can't be sure if it's live migration or not
15:18:39 <Swami> obondarev: since we are not going to migrate the dhcp ports.
15:18:56 <obondarev> Swami: I answered in the review, can you please check?
15:19:10 <Swami> obondarev: ok will check it out.
15:19:10 <obondarev> Swami: dhcp ports can also change host
15:19:34 <Swami> obondarev: what would be use case for dhcp port to change host
15:19:35 <obondarev> not suer why we should restrict
15:19:42 <obondarev> sure*
15:20:16 <obondarev> Swami: during the transition fron active to reserved and back to active DHCP port may change host
15:20:29 <obondarev> depending on dhcp agent that grabbed reserved port
15:21:17 <Swami> obondarev: In this case you are saying "if" we have multiple "Nodes" with dhcp agents running on all nodes.
15:21:19 <obondarev> Swami: I might be missing smth, but what are benefits of restribction to only compute ports?
15:21:30 <obondarev> Swami: correct
15:21:53 <obondarev> Swami: that's usual case on production clouds
15:22:33 <Swami> obondarev: There is nothing benefit from restricting in the other sense I don't want to burden the control plane by removing the router namespace just for the dhcp ports.
15:22:56 <Swami> obondarev: that's fine, I will address it in your review.
15:23:00 <Swami> Let us move on.
15:23:26 <obondarev> Swami: we're adding this namespace on controllers just because of dhcp ports - why not remove it when there is no dhcp ports?
15:24:14 <Swami> obondarev: sure agreed
15:24:41 <obondarev> regXboi: please also check https://review.openstack.org/#/c/238478/
15:24:45 <Swami> Let us quickly go over the high priority bugs, since we have already discussed about it.
15:25:17 <Swami> #link https://bugs.launchpad.net/neutron/+bug/1462154
15:25:17 <openstack> Launchpad bug 1462154 in neutron "With DVR Pings to floating IPs replied with fixed-ips" [High,In progress] - Assigned to ZongKai LI (lzklibj)
15:25:51 <Swami> #link https://review.openstack.org/#/c/246894/ Patch is in review
15:26:01 <Swami> So if you have not reviewed it please review it.
15:26:08 <Swami> The next one in the list is
15:26:19 <stephen-ma> There is also https://review.openstack.org/#/c/246855
15:26:31 <Swami> #link https://bugs.launchpad.net/neutron/+bug/1505575
15:26:31 <openstack> Launchpad bug 1505575 in neutron "Fatal memory consumption by neutron-server with DVR at scale" [High,In progress] - Assigned to Oleg Bondarev (obondarev)
15:26:41 <stephen-ma> as part of the bugfix for 1462154.
15:26:42 <Swami> stephen-ma: thanks
15:26:56 <obondarev> review is in progress
15:27:22 <obondarev> there might not be memory issue anymore, but timeout issue is there
15:27:23 <Swami> #link https://review.openstack.org/#/c/234067/
15:27:36 <Swami> obondarev: seems that it is in merge conflict
15:27:52 <Swami> obondarev: so what removed the memory issue, do know.
15:28:10 <obondarev> yeah, just wanted to get carl_baldwin answers before uploading new patchset
15:28:29 <Swami> obondarev: thanks, makes sense.
15:28:34 <obondarev> Swami: let me fing the link to the patch..
15:29:03 <obondarev> Swami: here it is https://review.openstack.org/#/c/214974/
15:29:38 <Swami> obondarev: thanks
15:30:03 <Swami> #link https://bugs.launchpad.net/neutron/+bug/1513678
15:30:03 <openstack> Launchpad bug 1513678 in neutron "At scale router scheduling takes a long time with DVR routers with multiple compute nodes hosting thousands of VMs" [High,In progress] - Assigned to Swaminathan Vasudevan (swaminathan-vasudevan)
15:30:18 <Swami> The patches are up for review for this one.
15:30:40 <Swami> I got some review comments from carl_baldwin yesterday and will address it today based on his reply.
15:31:09 <Swami> #link https://review.openstack.org/#/c/241843/
15:31:25 <Swami> #link https://review.openstack.org/#/c/242286/
15:31:36 <Swami> The next in the list is
15:31:54 <Swami> #link https://bugs.launchpad.net/neutron/+bug/1515360
15:31:54 <openstack> Launchpad bug 1515360 in neutron "Add more verbose to Tempest Test Errors that causes "SSHTimeout" seen in CVR and DVR" [High,Invalid]
15:32:23 <Swami> I made some headway on this one by adding some self test within the agent. The patch is under review right now.
15:32:43 <Swami> #link https://review.openstack.org/#/c/247748/
15:33:03 <Swami> Please review this patch at your earliest.
15:33:21 <Swami> armax mentioned that he can push in this patch for the debug purpose and latter can be rmoved.
15:33:22 <regXboi> this has a -2 on it
15:33:33 <regXboi> so you need to get that cleared
15:33:49 <Swami> regXboi: that is fine, we can get it cleared once we are ready to merge.
15:34:11 <Swami> obondarev: I am still seeing some issue in this patch when I try to ping and validate the ping.
15:34:53 <Swami> obondarev: can you help me figure out what the problem is here. It is throwing a _stderr
15:35:01 * carl_baldwin driving to see family. Will read log...
15:35:17 <Swami> carl_baldwin: thanks, drive safe.
15:35:30 <obondarev> Swami: which patch, sorry?
15:35:38 <obondarev> Swami: got distracted, sorry
15:35:48 <Swami> 247748
15:36:11 <obondarev> Swami: ok, will review 247748
15:36:49 <obondarev> Swami: shows nothing for https://review.fuel-infra.org/#/c/247748
15:36:51 <Swami> from the gate logs, I did see that it was able to ping twice successfully but ping did not go through in lot of the tests. I am not sure if it is a timing issue.
15:37:07 <obondarev> sorry
15:37:19 <Swami> #link https://review.openstack.org/#/c/247748/
15:37:34 <obondarev> yeah, thanks
15:37:34 <Swami> obondarev: did you get it.
15:37:47 <Swami> fitoduarte: adolfo is your server patch in.
15:37:52 <obondarev> signed up for it
15:37:59 <Swami> obondarev: thanks
15:38:42 <Swami> ok I don't see him here anymore.
15:39:04 <obondarev> BTW did someone saw SSh timeout issues in the gates lately?
15:39:07 <Swami> Is there any other bugs that we wanted to discuss here.
15:39:31 <Swami> obondarev: I do see SSHTimeout in the gate.
15:39:44 <regXboi> let's look
15:39:53 <Swami> obondarev: This is from last week, I have captured it in the slide deck.
15:40:09 <obondarev> Swami: link please?
15:40:14 <Swami> #link https://drive.google.com/file/d/0B4kh-7VVPWlPMkdrQWFFdjNsSnM/view?usp=sharing
15:40:18 <fitoduarte> seam: sorry which server patch?
15:40:38 <Swami> obondarev: as part of the gate failure investigation I am trying to capture all the failures that I see.
15:40:51 <Swami> fitoduarte: DVR HA server patch
15:41:14 <Swami> since we are in this topic right now let us change the topic.
15:41:21 <Swami> #topic Gate-failures
15:41:36 <obondarev> Swami: yeah, I just don't see SSH timeout happening often (like before) on dvr multinode job
15:42:48 <fitoduarte> swami: has two cores blessings but assaf is holding it. hopefully he'll be done this week.
15:43:16 <Swami> obondarev: but i think in most of the failure cases, the issue is we can't get to the VM through SSH
15:43:47 <Swami> obondarev: are you seeing any other failure symptom that is causing this graph to deviate from the cvr.
15:44:08 <obondarev> Swami: I'm not
15:44:11 <Swami> obondarev: in my honest opinion sometimes the logstash results are also confusing.
15:45:12 <Swami> obondarev: regXboi: Does anyone of you know if the dsvm-functional tests failures results are included in the dvr failure analysis or is it different.
15:45:37 <obondarev> Swami: which analysis?
15:46:03 <Swami> obondarev: The graphite failure graph does it talk into consideration the dsvm-functional test failure.
15:46:42 <Swami> #link  https://goo.gl/L1WODG
15:46:55 <obondarev> ah, I see
15:47:02 <regXboi> no, that's a different graphite graph
15:47:29 <Swami> regXboi: ok, the reason I asked is recently I did see some instability with the dsvm-functional tests.
15:48:06 <Swami> obondarev: regXboi: If you see any gate failures please log it in the power point that deck, so that it will easy for us to get to the root.
15:48:13 <regXboi> http://goo.gl/WaeP7L
15:48:47 <regXboi> fullstack and functional both appear ill
15:49:23 <Swami> regXboi: also the graph for neutron-full, neutron-dvr is away high when compared to this. I am not sure if the metrics are calculated correctly.
15:49:39 <obondarev> in my patches and in those that I review I didn't see SSH timeout failures on multinode dvr (and any other job) for a while so that's why I have an imression it might be gone for some reason
15:50:11 <Swami> obondarev: yes you are right, I have either not seen in my patches. But the logstash still reports that it is seeing it in the gate.
15:50:31 <Swami> obondarev: This might be from the one or two patches that cause a whole lot of failures in DVR.
15:50:40 <Swami> obondarev: user error.
15:50:50 <obondarev> Swami: so failures related to patches?
15:51:00 <Swami> obondarev: yes.
15:51:01 <obondarev> doesn't count I guess
15:51:20 <Swami> obondarev: I am not sure about it.
15:51:47 <Swami> ok, let us move on
15:51:49 <obondarev> ok lets proceed collecting statistics
15:52:00 <Swami> obondarev: sure
15:52:48 <Swami> #topic performance/scalability
15:53:00 <Swami> obondarev: do you have anything to add here.
15:53:16 <obondarev> not really, https://blueprints.launchpad.net/neutron/+spec/improve-dvr-l3-agent-binding is in progress
15:53:51 <Swami> obondarev: ok
15:54:00 <Swami> anything else
15:54:29 <Swami> #topic Open-Discussion
15:54:43 <Swami> Do we have any other topic to discuss in the last 5 minutes
15:54:57 <Swami> If we don't have any other topics we can end the meeting.
15:55:26 <Swami> Please log the failures that you are seeing in the gate with respect to the DVR in the powerpoint.
15:55:39 <Swami> Let us discuss it next week.
15:56:00 <Swami> Thanks everyone for joining this meeting.
15:56:11 <Swami> Wish you all again a Happy Thanksgiving.
15:56:16 <Swami> #endmeeting