21:02:16 <markmcclain> #startmeeting Networking
21:02:16 <openstack> Meeting started Mon Oct  7 21:02:16 2013 UTC and is due to finish in 60 minutes.  The chair is markmcclain. Information about MeetBot at http://wiki.debian.org/MeetBot.
21:02:17 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
21:02:19 <salv-orlando> I was in the wrong room
21:02:20 <openstack> The meeting name has been set to 'networking'
21:02:33 <markmcclain> #link https://wiki.openstack.org/wiki/Network/Meetings
21:02:36 <beagles> o/
21:02:41 <markmcclain> #topic Announcements
21:03:12 <markmcclain> RC1 was released last week: http://tarballs.openstack.org/neutron/neutron-2013.2.rc1.tar.gz
21:03:31 <markmcclain> Please take this time to test the release candidate.
21:03:47 <roaet> o/
21:03:48 <markmcclain> If you find a bug please file on launchpad.
21:04:19 <markmcclain> If you've got a bug that needs to be fixed in the final release please tag a havana-rc-potential
21:04:57 <markmcclain> We've a found a few bugs that will necessitate an RC2
21:05:10 <markmcclain> that will be Thursday of this week
21:06:27 <markmcclain> To make things easier for the release team, please do not target bugs for RC2. If something needs to be targets please contact me offline.
21:06:52 <markmcclain> Speaking of bugs…
21:06:54 <markmcclain> #topic Bugs
21:07:11 <markmcclain> We've still got 3 bugs will are causing problems in the gate
21:07:16 <markmcclain> https://bugs.launchpad.net/neutron/+bug/1224001
21:07:18 <uvirtbot> Launchpad bug 1224001 in neutron "test_network_basic_ops fails waiting for network to become available" [Critical,In progress]
21:07:21 <salv-orlando> aloha again!
21:07:21 <markmcclain> https://bugs.launchpad.net/neutron/+bug/1211915
21:07:23 <uvirtbot> Launchpad bug 1211915 in neutron "Connection to neutron failed: Maximum attempts reached" [Critical,In progress]
21:07:48 <markmcclain> salv-orlando: I know you've spent a lot of time looking into the first one.  Want to fill us in?
21:08:16 <salv-orlando> I actually have spent a lot of time with idle thumbs waiting for devstack-gate jobs to complete
21:08:40 <salv-orlando> anyway, the gist of the failure is that either rpc_loop or sync_routers_task hang at some point
21:09:05 <salv-orlando> if they hang, the looping call does not execute anymore and API commands are not implemented on the l3 agent
21:09:15 <enikanorov> salv-orlando: is the issue reproducible in regular devstack setup?
21:09:16 <salv-orlando> hence, no floating IP - and therefore the failure
21:09:38 <salv-orlando> enikanorov: I've spawned 104 different devstack-gate vms with no luck
21:09:41 <salv-orlando> happens only on the gate
21:09:58 <enikanorov> i see
21:10:02 <salv-orlando> 104 is the real number of vms I've spawned
21:10:20 <markmcclain> definitely some kind of race condition that the gate seems to trigger more readily
21:10:25 <salv-orlando> anyway the cause of the hang is subprocess.Popen.communicate never returning
21:10:40 <dkehn> any news on https://bugs.launchpad.net/swift/+bug/1224001 - its killing the check-tempest-devstack-vm-neutron-isolated & check-tempest-devstack-vm-neutron-pg-isolated, or it is believe to be the issue
21:10:41 <uvirtbot> Launchpad bug 1224001 in neutron "test_network_basic_ops fails waiting for network to become available" [Critical,In progress]
21:10:56 <markmcclain> dkehn: that is what we're discussing
21:10:59 <salv-orlando> dkehn: that's what we're discussing
21:11:04 <dkehn> ouch
21:11:06 <dkehn> thx
21:11:39 <salv-orlando> I've found out that openstack.common.processutils.execute always does a sleep
21:11:54 <marun> salv-orlando: termies patch didn't fix the problem?
21:11:58 <salv-orlando> and there's a comment that say 'we do this otherwise subprocess.popen hangs'
21:12:09 <salv-orlando> marun: it fixed the problem, partly.
21:12:24 <salv-orlando> Because arping kept hanging!!! Even with -w option specified
21:12:32 <salv-orlando> And I've no idea how that could happen
21:13:05 <marun> salv-orlando: i'm confused - why only partly?
21:13:08 <salv-orlando> So I suggest to do a few changes to execute() in neutron.agent.linux.utils
21:13:22 <salv-orlando> marun: because before it was hanging on pretty much any command,
21:13:43 <salv-orlando> after doing the change in execute() I only observed hanging while executing arping
21:13:50 <salv-orlando> this happened for 18 failures out 18
21:13:53 <nati_ueno> salv-orlando: How about the patch introduces spawn for arpoing?
21:13:59 <marun> crazy
21:14:22 <salv-orlando> I have not following other patches; that would not block the process but will still leave green threads hanging
21:14:36 <salv-orlando> for the gate, I think we can say that we can disable gratuitous apr
21:14:39 <salv-orlando> apr -> arp
21:14:57 <salv-orlando> just for the gate… there's a configuration option after all
21:15:05 <markmcclain> here's the spawn for GARP patch: https://review.openstack.org/#/c/49063/
21:15:18 <haleyb> yes, i would agree with above - spawn would succeed, but arping hang would consume greenthread
21:16:25 <salv-orlando> haleyb: honestly I am happy either way
21:16:42 <salv-orlando> either by disabling it on the gate or by spawning in a noter thread
21:17:01 <nati_ueno> let's disable it for now.
21:17:06 <salv-orlando> Indeed I think I also comment on bug 1233391 that I believed it had the same root case as bug 1224001
21:17:08 <uvirtbot> Launchpad bug 1233391 in neutron "arping can be spawned in background to speed-up floating IP assignment" [Medium,In progress] https://launchpad.net/bugs/1233391
21:17:09 <uvirtbot> Launchpad bug 1224001 in neutron "test_network_basic_ops fails waiting for network to become available" [Critical,In progress] https://launchpad.net/bugs/1224001
21:17:09 <nati_ueno> and test the patch work or not
21:17:12 <marun> …there's another possibility btw
21:17:39 <salv-orlando> nati_ueno, markmcclain: disabling GARP on gate will require also a change in devstack-gate and a change in devstack
21:17:44 <salv-orlando> although they're trivial changes
21:17:55 <nati_ueno> it should be easy
21:18:14 <nati_ueno> so default off is only for gating?
21:18:16 <markmcclain> what's the change to devstack?  making GARP configuratble?
21:18:16 <marun> the async process class I've added as part of the polling minimization might be a alternate way of handling process threading
21:19:01 <salv-orlando> markmcclain: in devstack we need something to do the iniset on the conf variable
21:19:14 <salv-orlando> in devstack-gate we need to set the value to 0
21:19:46 <salv-orlando> marun: you've done only the l2 agent part so far, is that right?
21:19:55 <markmcclain> salv-orlando: ok
21:20:08 <nati_ueno> salv-orlando: let's me take the patch
21:20:45 <marun> salv-orlando: sure, but the async process class can be used for running any process, not just ovsdb-client
21:20:56 <salv-orlando> nati_ueno: are you suggesting to take the GARP patch from bhaley or are you offering to do the devstack patch for disabling GARP on the gate?
21:21:20 <nati_ueno> salv-orlando: devstack and devstack-gate patch for disabling GARP on the gate
21:21:23 <salv-orlando> marun: agreed. I was just wondering whether we could use this out of the box in the l3 agent too
21:21:42 <markmcclain> I'm still a bit concerned that we'll end up making the gate pass, but in production deployers will encounter the lockup
21:21:46 <marun> salv-orlando: it's on a process-by-process basis I'm afraid.
21:21:47 <salv-orlando> nati_ueno: thanks. Let's first agree on a direction. I won't mind just merging bhaley's patch too
21:22:04 <nati_ueno> salv-orlando: markmcclain: sure
21:22:06 <salv-orlando> markmcclain: me too.
21:22:27 <markmcclain> the only downside of the bhaley's patch is we potentially leak greenthreads right?
21:22:35 <salv-orlando> Not having been able to reproduce locally, I am still at a loss about why arping should hang
21:22:47 <salv-orlando> markmcclain: correct. We'll leak one for each time arping hangs
21:23:21 <salv-orlando> but we could easily put all those threads in a pool all kill them mercilessly if they've lived for more than say 10 seconds
21:24:30 <markmcclain> yeah that would be an approach
21:24:32 <marun> oy
21:24:55 <marun> you know we're desperate when…. :\
21:25:02 <salv-orlando> in the icehouse timeframe than we would be able to use marun code and therefore have an orderly an systematic approach for managing async tasks
21:25:18 <marun> agreed that we need a fix now, though.
21:25:39 <markmcclain> I'd be interested to see what the deadlock would be of spawning the garp and the other fixes you've been working on
21:25:47 <markmcclain> *deadlock rate
21:26:17 <salv-orlando> I can rebase on top of bhaley patch.
21:26:22 <markmcclain> both might be a good first step that we merge sooner
21:26:40 <markmcclain> and if we need to add a kill the stalled threads we can do that as follow up
21:26:55 <markmcclain> I think the sooner we can get this in, the sooner more folks can test
21:27:08 <salv-orlando> ok, so do we have consensus that as a stopgap measure we're going to fix execute in neutron.agent.utils, and spawn GARP into its own thread?
21:27:34 <markmcclain> that would be my choice
21:27:41 <salv-orlando> +1 from me
21:28:00 <nati_ueno> +1
21:28:05 <marun> +1, let's just make sure to file another bug that ensures we have a longer-term solution on the horizon
21:28:24 <armax> +1 from me too, it's best to ship havana with one less opportunity for a deadlock
21:28:32 <markmcclain> marun: agreed we'll need a long term bug
21:28:34 <salv-orlando> I think we should mark the commit messages as partial-bug and leave the bug open for havana with a lessere priority
21:28:49 <marun> salv-orlando: +1
21:28:52 <markmcclain> +1
21:28:59 <rkukura> +1
21:29:39 <markmcclain> We've consensus on a direction for this one
21:29:51 <salv-orlando> yeah, let's move to the next bug
21:30:12 <markmcclain> Are there any other bugs the teams needs to discuss?
21:30:51 <armax> there's bug 1197627 lying around
21:30:52 <uvirtbot> Launchpad bug 1197627 in neutron "DHCP agent race condition on subnet and network delete" [High,In progress] https://launchpad.net/bugs/1197627
21:31:30 <armax> but it hasn't been targeted
21:31:35 <salv-orlando> update on bug1211915?
21:31:53 <markmcclain> armax: I'll tag it as havana-rc-potential
21:31:58 <markmcclain> and would like folks to review it
21:32:27 <markmcclain> bug 1211915
21:32:27 <armax> I think the severity can be lowered
21:32:28 <uvirtbot> Launchpad bug 1211915 in neutron "Connection to neutron failed: Maximum attempts reached" [Critical,In progress] https://launchpad.net/bugs/1211915
21:32:33 <armax> to medium
21:33:27 <marun> armax: why?  It seems like a pretty major problem to me
21:33:34 <marun> armax: and we have a fix in process
21:33:37 <armax> it is
21:33:58 <salv-orlando> armax: which bug are you talking about?
21:34:09 <armax> salv-orlando: bug 1197627
21:34:10 <uvirtbot> Launchpad bug 1197627 in neutron "DHCP agent race condition on subnet and network delete" [High,In progress] https://launchpad.net/bugs/1197627
21:34:17 <markmcclain> yeah.. logstash is showing that bug 1211915 has gone away
21:34:17 <markmcclain> http://logstash.openstack.org/#eyJzZWFyY2giOiJcIkNvbm5lY3Rpb24gdG8gbmV1dHJvbiBmYWlsZWQ6IE1heGltdW0gYXR0ZW1wdHMgcmVhY2hlZFwiIiwiZmllbGRzIjpbXSwib2Zmc2V0IjowLCJ0aW1lZnJhbWUiOiI2MDQ4MDAiLCJncmFwaG1vZGUiOiJjb3VudCIsInRpbWUiOnsidXNlcl9pbnRlcnZhbCI6MH0sInN0YW1wIjoxMzgxMTgxNjAyNjcxfQ==
21:34:18 <uvirtbot> Launchpad bug 1211915 in neutron "Connection to neutron failed: Maximum attempts reached" [Critical,In progress] https://launchpad.net/bugs/1211915
21:34:53 <salv-orlando> markmcclain: I have never believed in the "bug has gone away" fairytale ;)
21:35:18 <marun> markmcclain: Yeah, I'm with salv-orlando.  I think race conditions can be triggered and then not triggered, but they're still there.
21:35:25 <armax> marun: I managed to reproduce the bug
21:35:41 <armax> but the conditions were staged
21:35:41 <marun> armax: bravo! :)
21:36:20 <armax> I wasn't able to reproduced it without injecting a time skew
21:36:35 <marun> armax: I'm not sure that changes anything, though.  Race conditions are like landmines, we're fine until someone steps on it.  But it's still there.
21:36:55 <markmcclain> Marun: so true
21:36:56 <armax> besides it really depends on how the plugin implements the delete operations
21:37:19 <armax> marun: I agree with you …but today no-one has lost a leg besides the reporter
21:37:22 <marun> armax: Trying to update the agent in a db transaction is simply a bad idea, and easily fixed.  I'm not sure what we're arguing about.
21:37:59 <beagles> 063911
21:38:08 <beagles> argh
21:38:14 <beagles> sorry
21:38:15 <marun> armax: wait, are we talking about the same bug?  Maybe I'm arguing for nothing :(
21:38:23 <armax> marun: maybe
21:38:29 <marun> 1211915?
21:38:32 <armax> nope
21:38:35 <armax> 1197627
21:38:42 <marun> apologies for wasting your time
21:38:51 <armax> no need
21:39:04 <markmcclain> Re 1211915: we made changes on the 3rd to make the api less likely to deadlock
21:39:20 <markmcclain> that's why I think it has not popped up recently
21:39:42 <markmcclain> my inclination is to wait a few more days on 1211915 and we don't see it, close it
21:39:51 <marun> armax: I'll help review the patches for 1197627 in any case.
21:40:06 <markmcclain> at the summit we discuss a the general problem of agents within transaction
21:40:12 <marun> markmcclain: But doesn't Jakub's patch make sense regardless?
21:40:30 <armax> marun: tnx
21:40:34 <marun> markmcclain: at least the changes to extraroute_db?
21:41:49 <markmcclain> maybe.. let's get the other patches in
21:42:16 <markmcclain> and then run it through the gate again… I'm a bit concerned we had to recheck a change that fixes a deadlock
21:42:57 <marun> markmcclain: ok
21:43:48 <markmcclain> from a code perspective I think it is an upgrade over the current
21:44:17 <markmcclain> getting back to bug 1197627
21:44:21 <uvirtbot> Launchpad bug 1197627 in neutron "DHCP agent race condition on subnet and network delete" [High,In progress] https://launchpad.net/bugs/1197627
21:44:23 <markmcclain> I know marun said he'd review
21:44:46 <markmcclain> looks like arosen2 is the other reviewer
21:44:53 <markmcclain> so we've got 2 cores on it
21:44:57 <markmcclain> Any other bugs?
21:45:50 <markmcclain> I want to thank the folks who've put in a lot of hours working on fixes for these tricky bugs
21:46:03 <markmcclain> #topic API docs
21:46:05 <markmcclain> salv-orlando: hi
21:46:21 <salv-orlando> hi
21:46:35 <salv-orlando> in a nutshell, not a lot of progress from last week.
21:46:50 <salv-orlando> I have been busy chasing these bugs and I've let the API fall back
21:46:56 <dkehn> https://review.openstack.org/#/c/49224/
21:47:02 <salv-orlando> will resume work from tomorrow
21:47:26 <salv-orlando> dkehn: thanks for raising that
21:47:37 <markmcclain> salv-orlando: no problem we were in good shape last week and fixing the bugs is a higher team priority
21:47:43 <salv-orlando> thankfully, not everybody is a slacker like me :)
21:48:08 <markmcclain> let's move on admin docs
21:48:11 <markmcclain> #topic docs
21:48:13 <markmcclain> emagana: hi
21:48:17 <emagana> markmcclain: Hi there, It will be a brief update
21:48:28 <emagana> mostly working on ML2 docs
21:49:00 <markmcclain> cool
21:49:05 <emagana> cloud admin guide, glossary* and deprecate message are merged
21:49:17 <markmcclain> great
21:49:27 <emagana> we have owners for installation and configuration guide, so it should be fine
21:50:05 <emagana> we are still missing owner for metering agent, if we dont have volunteers, I will take it
21:50:10 <markmcclain> ok
21:50:24 <markmcclain> thanks for keeping watch over the docs
21:50:34 <emagana> no more from my side, I have updated the wiki properly
21:50:45 <markmcclain> perfect.. thanks for updating
21:50:51 <emagana> markmcclain: n.p.
21:51:04 <markmcclain> #topic Horizon
21:51:29 <markmcclain> amotoki: has picked up this bug: https://review.openstack.org/49993
21:51:31 <markmcclain> I'll review
21:51:40 <markmcclain> but we'll need 1 more core to look at it
21:51:57 <armax> I'll look at it
21:52:03 <markmcclain> armax: thanks
21:52:14 <markmcclain> #topic Summit
21:52:29 <markmcclain> Reminder to get your summit sessions in: http://summit.openstack.org/
21:52:45 <emagana> markmcclain: when is the deadline?
21:53:36 <markmcclain> I'd like to have them in by end of day the 17th
21:53:51 <emagana> markmcclain: thanks
21:54:00 <markmcclain> remember the sooner they're filed it gives us more time to provide feedback
21:54:10 <geoffarnold> Including merged session proposals, right?
21:54:28 <markmcclain> yes
21:55:27 <markmcclain> Any summit related questions?
21:56:02 <markmcclain> #topic Open Discussion
21:56:04 <markmcclain> #link http://lists.openstack.org/pipermail/openstack-dev/2013-October/016025.html
21:56:20 <marun> I'd like to suggest a change in review policy
21:56:39 <marun> Any change that requires feedback by the submitter should require a -1 or -2 on the part of the reviewer
21:56:59 <salv-orlando> congratulations markmcclain for your success in the PTL elections!
21:57:03 <emagana> Congratulations markmcclain
21:57:09 <nati_ueno> congrat!
21:57:11 <ivar-lazzaro> grats!
21:57:12 <marun> congratulations!
21:57:14 <roaet> grats.
21:57:23 <enikanorov> congratulations!
21:57:54 <samuelbercovici> congratulations!
21:57:58 <markmcclain> Well our community works really hard and I'm excited for Icehouse cycle
21:58:00 <emagana> marun: Is it not the case right now?
21:58:27 <marun> emagana: I have experienced being reviewed in comments but with no score.
21:58:28 <markmcclain> marun: I'm assuming you're referring to the 0 score with comments?
21:58:38 <marun> markmcclain: correct
21:59:02 <markmcclain> I know I'm guilty of leaving comments and scoring
21:59:13 <salv-orlando> I often leave no score on purpose
21:59:26 <marun> markmcclain: if feedback requires action of any kind, I think a minus score is appropriate to ensure that other reviewers give time for the feedback loop to complete
21:59:32 <marun> salv-orlando: why?
21:59:48 <marun> salv-orlando: the minus scores are to ensure that action is taken.
22:00:01 <marun> salv-orlando: i'm not sure of the value of not being explicit about that
22:00:04 <salv-orlando> because I had non fundamental questions/comments and I did not want to put other reviewers off that patch
22:00:28 <rkukura> onr problem is that a single -1 often discourages anyone else from looking at the patch
22:00:29 <salv-orlando> you know, often when one sees -1 on a patch just skips it
22:00:32 <marun> salv-orlando: i think we should change the view that a minus score doesn't require other reviewer attention, though
22:00:40 <marun> salv-orlando: that's an education thing
22:00:52 <marun> salv-orlando, rkukura: let's solve the problem rather than the symptom
22:01:05 <salv-orlando> anyway I'm happy to change my behavioyr
22:01:13 <markmcclain> I'll start leaving scores too
22:01:15 <rkukura> marun: I'm not opposed to change either
22:01:48 <emagana> marun: Is this the behavior in the rest of the openstack projects?
22:02:01 <marun> emagana: I'm pretty sure, but I can check.
22:02:05 <markmcclain> emagana: each project has their own slight differences
22:02:21 <markmcclain> some always score and others might comment w/o scoring
22:02:44 <markmcclain> the size of the teams is often on factor too
22:02:47 <emagana> we should "try" to be aligned with the rest of the projects, it makes easier the educational part  :-)
22:02:49 <marun> In any case, I'll send an email to the list in any case reiterating the call for minus scores and for people to check reviews for sufficient reviewer coverage rather than ignoring -1s
22:02:53 <salv-orlando> yup pretty much in every other project you get a -1 for any kind of question
22:03:26 <emagana> salv-orlando: yeap, I noticed that in nova!
22:04:25 <shivh> document it on neutron wiki, if it works great other projects may adopt
22:04:41 <markmcclain> for now let's be aware and remember to leave a score if the question is a blocker
22:05:17 <markmcclain> we're out of time for today
22:06:12 <marun> sivh: ok
22:07:27 <shivh> congtrats Mark for PTL, bye all.
22:07:36 <emagana> adeu
22:07:56 <dkehn> markmcclain, congrats
22:08:23 <jlibosva> congratulations
22:08:35 <markmcclain> this got lost in the scoring discussion… the reason I mentioned the PTL results is to say thanks to everyone because the best part of being PTL is working with our great community
22:09:12 <markmcclain> the diligent work folks have put in the bugs this week is a prime example of how well our community works and shares information
22:09:29 <markmcclain> Thanks for stopping in and have a great week
22:09:31 <markmcclain> #endmeeting