21:02:16 #startmeeting Networking 21:02:16 Meeting started Mon Oct 7 21:02:16 2013 UTC and is due to finish in 60 minutes. The chair is markmcclain. Information about MeetBot at http://wiki.debian.org/MeetBot. 21:02:17 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 21:02:19 I was in the wrong room 21:02:20 The meeting name has been set to 'networking' 21:02:33 #link https://wiki.openstack.org/wiki/Network/Meetings 21:02:36 o/ 21:02:41 #topic Announcements 21:03:12 RC1 was released last week: http://tarballs.openstack.org/neutron/neutron-2013.2.rc1.tar.gz 21:03:31 Please take this time to test the release candidate. 21:03:47 o/ 21:03:48 If you find a bug please file on launchpad. 21:04:19 If you've got a bug that needs to be fixed in the final release please tag a havana-rc-potential 21:04:57 We've a found a few bugs that will necessitate an RC2 21:05:10 that will be Thursday of this week 21:06:27 To make things easier for the release team, please do not target bugs for RC2. If something needs to be targets please contact me offline. 21:06:52 Speaking of bugs… 21:06:54 #topic Bugs 21:07:11 We've still got 3 bugs will are causing problems in the gate 21:07:16 https://bugs.launchpad.net/neutron/+bug/1224001 21:07:18 Launchpad bug 1224001 in neutron "test_network_basic_ops fails waiting for network to become available" [Critical,In progress] 21:07:21 aloha again! 21:07:21 https://bugs.launchpad.net/neutron/+bug/1211915 21:07:23 Launchpad bug 1211915 in neutron "Connection to neutron failed: Maximum attempts reached" [Critical,In progress] 21:07:48 salv-orlando: I know you've spent a lot of time looking into the first one. Want to fill us in? 21:08:16 I actually have spent a lot of time with idle thumbs waiting for devstack-gate jobs to complete 21:08:40 anyway, the gist of the failure is that either rpc_loop or sync_routers_task hang at some point 21:09:05 if they hang, the looping call does not execute anymore and API commands are not implemented on the l3 agent 21:09:15 salv-orlando: is the issue reproducible in regular devstack setup? 21:09:16 hence, no floating IP - and therefore the failure 21:09:38 enikanorov: I've spawned 104 different devstack-gate vms with no luck 21:09:41 happens only on the gate 21:09:58 i see 21:10:02 104 is the real number of vms I've spawned 21:10:20 definitely some kind of race condition that the gate seems to trigger more readily 21:10:25 anyway the cause of the hang is subprocess.Popen.communicate never returning 21:10:40 any news on https://bugs.launchpad.net/swift/+bug/1224001 - its killing the check-tempest-devstack-vm-neutron-isolated & check-tempest-devstack-vm-neutron-pg-isolated, or it is believe to be the issue 21:10:41 Launchpad bug 1224001 in neutron "test_network_basic_ops fails waiting for network to become available" [Critical,In progress] 21:10:56 dkehn: that is what we're discussing 21:10:59 dkehn: that's what we're discussing 21:11:04 ouch 21:11:06 thx 21:11:39 I've found out that openstack.common.processutils.execute always does a sleep 21:11:54 salv-orlando: termies patch didn't fix the problem? 21:11:58 and there's a comment that say 'we do this otherwise subprocess.popen hangs' 21:12:09 marun: it fixed the problem, partly. 21:12:24 Because arping kept hanging!!! Even with -w option specified 21:12:32 And I've no idea how that could happen 21:13:05 salv-orlando: i'm confused - why only partly? 21:13:08 So I suggest to do a few changes to execute() in neutron.agent.linux.utils 21:13:22 marun: because before it was hanging on pretty much any command, 21:13:43 after doing the change in execute() I only observed hanging while executing arping 21:13:50 this happened for 18 failures out 18 21:13:53 salv-orlando: How about the patch introduces spawn for arpoing? 21:13:59 crazy 21:14:22 I have not following other patches; that would not block the process but will still leave green threads hanging 21:14:36 for the gate, I think we can say that we can disable gratuitous apr 21:14:39 apr -> arp 21:14:57 just for the gate… there's a configuration option after all 21:15:05 here's the spawn for GARP patch: https://review.openstack.org/#/c/49063/ 21:15:18 yes, i would agree with above - spawn would succeed, but arping hang would consume greenthread 21:16:25 haleyb: honestly I am happy either way 21:16:42 either by disabling it on the gate or by spawning in a noter thread 21:17:01 let's disable it for now. 21:17:06 Indeed I think I also comment on bug 1233391 that I believed it had the same root case as bug 1224001 21:17:08 Launchpad bug 1233391 in neutron "arping can be spawned in background to speed-up floating IP assignment" [Medium,In progress] https://launchpad.net/bugs/1233391 21:17:09 Launchpad bug 1224001 in neutron "test_network_basic_ops fails waiting for network to become available" [Critical,In progress] https://launchpad.net/bugs/1224001 21:17:09 and test the patch work or not 21:17:12 …there's another possibility btw 21:17:39 nati_ueno, markmcclain: disabling GARP on gate will require also a change in devstack-gate and a change in devstack 21:17:44 although they're trivial changes 21:17:55 it should be easy 21:18:14 so default off is only for gating? 21:18:16 what's the change to devstack? making GARP configuratble? 21:18:16 the async process class I've added as part of the polling minimization might be a alternate way of handling process threading 21:19:01 markmcclain: in devstack we need something to do the iniset on the conf variable 21:19:14 in devstack-gate we need to set the value to 0 21:19:46 marun: you've done only the l2 agent part so far, is that right? 21:19:55 salv-orlando: ok 21:20:08 salv-orlando: let's me take the patch 21:20:45 salv-orlando: sure, but the async process class can be used for running any process, not just ovsdb-client 21:20:56 nati_ueno: are you suggesting to take the GARP patch from bhaley or are you offering to do the devstack patch for disabling GARP on the gate? 21:21:20 salv-orlando: devstack and devstack-gate patch for disabling GARP on the gate 21:21:23 marun: agreed. I was just wondering whether we could use this out of the box in the l3 agent too 21:21:42 I'm still a bit concerned that we'll end up making the gate pass, but in production deployers will encounter the lockup 21:21:46 salv-orlando: it's on a process-by-process basis I'm afraid. 21:21:47 nati_ueno: thanks. Let's first agree on a direction. I won't mind just merging bhaley's patch too 21:22:04 salv-orlando: markmcclain: sure 21:22:06 markmcclain: me too. 21:22:27 the only downside of the bhaley's patch is we potentially leak greenthreads right? 21:22:35 Not having been able to reproduce locally, I am still at a loss about why arping should hang 21:22:47 markmcclain: correct. We'll leak one for each time arping hangs 21:23:21 but we could easily put all those threads in a pool all kill them mercilessly if they've lived for more than say 10 seconds 21:24:30 yeah that would be an approach 21:24:32 oy 21:24:55 you know we're desperate when…. :\ 21:25:02 in the icehouse timeframe than we would be able to use marun code and therefore have an orderly an systematic approach for managing async tasks 21:25:18 agreed that we need a fix now, though. 21:25:39 I'd be interested to see what the deadlock would be of spawning the garp and the other fixes you've been working on 21:25:47 *deadlock rate 21:26:17 I can rebase on top of bhaley patch. 21:26:22 both might be a good first step that we merge sooner 21:26:40 and if we need to add a kill the stalled threads we can do that as follow up 21:26:55 I think the sooner we can get this in, the sooner more folks can test 21:27:08 ok, so do we have consensus that as a stopgap measure we're going to fix execute in neutron.agent.utils, and spawn GARP into its own thread? 21:27:34 that would be my choice 21:27:41 +1 from me 21:28:00 +1 21:28:05 +1, let's just make sure to file another bug that ensures we have a longer-term solution on the horizon 21:28:24 +1 from me too, it's best to ship havana with one less opportunity for a deadlock 21:28:32 marun: agreed we'll need a long term bug 21:28:34 I think we should mark the commit messages as partial-bug and leave the bug open for havana with a lessere priority 21:28:49 salv-orlando: +1 21:28:52 +1 21:28:59 +1 21:29:39 We've consensus on a direction for this one 21:29:51 yeah, let's move to the next bug 21:30:12 Are there any other bugs the teams needs to discuss? 21:30:51 there's bug 1197627 lying around 21:30:52 Launchpad bug 1197627 in neutron "DHCP agent race condition on subnet and network delete" [High,In progress] https://launchpad.net/bugs/1197627 21:31:30 but it hasn't been targeted 21:31:35 update on bug1211915? 21:31:53 armax: I'll tag it as havana-rc-potential 21:31:58 and would like folks to review it 21:32:27 bug 1211915 21:32:27 I think the severity can be lowered 21:32:28 Launchpad bug 1211915 in neutron "Connection to neutron failed: Maximum attempts reached" [Critical,In progress] https://launchpad.net/bugs/1211915 21:32:33 to medium 21:33:27 armax: why? It seems like a pretty major problem to me 21:33:34 armax: and we have a fix in process 21:33:37 it is 21:33:58 armax: which bug are you talking about? 21:34:09 salv-orlando: bug 1197627 21:34:10 Launchpad bug 1197627 in neutron "DHCP agent race condition on subnet and network delete" [High,In progress] https://launchpad.net/bugs/1197627 21:34:17 yeah.. logstash is showing that bug 1211915 has gone away 21:34:17 http://logstash.openstack.org/#eyJzZWFyY2giOiJcIkNvbm5lY3Rpb24gdG8gbmV1dHJvbiBmYWlsZWQ6IE1heGltdW0gYXR0ZW1wdHMgcmVhY2hlZFwiIiwiZmllbGRzIjpbXSwib2Zmc2V0IjowLCJ0aW1lZnJhbWUiOiI2MDQ4MDAiLCJncmFwaG1vZGUiOiJjb3VudCIsInRpbWUiOnsidXNlcl9pbnRlcnZhbCI6MH0sInN0YW1wIjoxMzgxMTgxNjAyNjcxfQ== 21:34:18 Launchpad bug 1211915 in neutron "Connection to neutron failed: Maximum attempts reached" [Critical,In progress] https://launchpad.net/bugs/1211915 21:34:53 markmcclain: I have never believed in the "bug has gone away" fairytale ;) 21:35:18 markmcclain: Yeah, I'm with salv-orlando. I think race conditions can be triggered and then not triggered, but they're still there. 21:35:25 marun: I managed to reproduce the bug 21:35:41 but the conditions were staged 21:35:41 armax: bravo! :) 21:36:20 I wasn't able to reproduced it without injecting a time skew 21:36:35 armax: I'm not sure that changes anything, though. Race conditions are like landmines, we're fine until someone steps on it. But it's still there. 21:36:55 Marun: so true 21:36:56 besides it really depends on how the plugin implements the delete operations 21:37:19 marun: I agree with you …but today no-one has lost a leg besides the reporter 21:37:22 armax: Trying to update the agent in a db transaction is simply a bad idea, and easily fixed. I'm not sure what we're arguing about. 21:37:59 063911 21:38:08 argh 21:38:14 sorry 21:38:15 armax: wait, are we talking about the same bug? Maybe I'm arguing for nothing :( 21:38:23 marun: maybe 21:38:29 1211915? 21:38:32 nope 21:38:35 1197627 21:38:42 apologies for wasting your time 21:38:51 no need 21:39:04 Re 1211915: we made changes on the 3rd to make the api less likely to deadlock 21:39:20 that's why I think it has not popped up recently 21:39:42 my inclination is to wait a few more days on 1211915 and we don't see it, close it 21:39:51 armax: I'll help review the patches for 1197627 in any case. 21:40:06 at the summit we discuss a the general problem of agents within transaction 21:40:12 markmcclain: But doesn't Jakub's patch make sense regardless? 21:40:30 marun: tnx 21:40:34 markmcclain: at least the changes to extraroute_db? 21:41:49 maybe.. let's get the other patches in 21:42:16 and then run it through the gate again… I'm a bit concerned we had to recheck a change that fixes a deadlock 21:42:57 markmcclain: ok 21:43:48 from a code perspective I think it is an upgrade over the current 21:44:17 getting back to bug 1197627 21:44:21 Launchpad bug 1197627 in neutron "DHCP agent race condition on subnet and network delete" [High,In progress] https://launchpad.net/bugs/1197627 21:44:23 I know marun said he'd review 21:44:46 looks like arosen2 is the other reviewer 21:44:53 so we've got 2 cores on it 21:44:57 Any other bugs? 21:45:50 I want to thank the folks who've put in a lot of hours working on fixes for these tricky bugs 21:46:03 #topic API docs 21:46:05 salv-orlando: hi 21:46:21 hi 21:46:35 in a nutshell, not a lot of progress from last week. 21:46:50 I have been busy chasing these bugs and I've let the API fall back 21:46:56 https://review.openstack.org/#/c/49224/ 21:47:02 will resume work from tomorrow 21:47:26 dkehn: thanks for raising that 21:47:37 salv-orlando: no problem we were in good shape last week and fixing the bugs is a higher team priority 21:47:43 thankfully, not everybody is a slacker like me :) 21:48:08 let's move on admin docs 21:48:11 #topic docs 21:48:13 emagana: hi 21:48:17 markmcclain: Hi there, It will be a brief update 21:48:28 mostly working on ML2 docs 21:49:00 cool 21:49:05 cloud admin guide, glossary* and deprecate message are merged 21:49:17 great 21:49:27 we have owners for installation and configuration guide, so it should be fine 21:50:05 we are still missing owner for metering agent, if we dont have volunteers, I will take it 21:50:10 ok 21:50:24 thanks for keeping watch over the docs 21:50:34 no more from my side, I have updated the wiki properly 21:50:45 perfect.. thanks for updating 21:50:51 markmcclain: n.p. 21:51:04 #topic Horizon 21:51:29 amotoki: has picked up this bug: https://review.openstack.org/49993 21:51:31 I'll review 21:51:40 but we'll need 1 more core to look at it 21:51:57 I'll look at it 21:52:03 armax: thanks 21:52:14 #topic Summit 21:52:29 Reminder to get your summit sessions in: http://summit.openstack.org/ 21:52:45 markmcclain: when is the deadline? 21:53:36 I'd like to have them in by end of day the 17th 21:53:51 markmcclain: thanks 21:54:00 remember the sooner they're filed it gives us more time to provide feedback 21:54:10 Including merged session proposals, right? 21:54:28 yes 21:55:27 Any summit related questions? 21:56:02 #topic Open Discussion 21:56:04 #link http://lists.openstack.org/pipermail/openstack-dev/2013-October/016025.html 21:56:20 I'd like to suggest a change in review policy 21:56:39 Any change that requires feedback by the submitter should require a -1 or -2 on the part of the reviewer 21:56:59 congratulations markmcclain for your success in the PTL elections! 21:57:03 Congratulations markmcclain 21:57:09 congrat! 21:57:11 grats! 21:57:12 congratulations! 21:57:14 grats. 21:57:23 congratulations! 21:57:54 congratulations! 21:57:58 Well our community works really hard and I'm excited for Icehouse cycle 21:58:00 marun: Is it not the case right now? 21:58:27 emagana: I have experienced being reviewed in comments but with no score. 21:58:28 marun: I'm assuming you're referring to the 0 score with comments? 21:58:38 markmcclain: correct 21:59:02 I know I'm guilty of leaving comments and scoring 21:59:13 I often leave no score on purpose 21:59:26 markmcclain: if feedback requires action of any kind, I think a minus score is appropriate to ensure that other reviewers give time for the feedback loop to complete 21:59:32 salv-orlando: why? 21:59:48 salv-orlando: the minus scores are to ensure that action is taken. 22:00:01 salv-orlando: i'm not sure of the value of not being explicit about that 22:00:04 because I had non fundamental questions/comments and I did not want to put other reviewers off that patch 22:00:28 onr problem is that a single -1 often discourages anyone else from looking at the patch 22:00:29 you know, often when one sees -1 on a patch just skips it 22:00:32 salv-orlando: i think we should change the view that a minus score doesn't require other reviewer attention, though 22:00:40 salv-orlando: that's an education thing 22:00:52 salv-orlando, rkukura: let's solve the problem rather than the symptom 22:01:05 anyway I'm happy to change my behavioyr 22:01:13 I'll start leaving scores too 22:01:15 marun: I'm not opposed to change either 22:01:48 marun: Is this the behavior in the rest of the openstack projects? 22:02:01 emagana: I'm pretty sure, but I can check. 22:02:05 emagana: each project has their own slight differences 22:02:21 some always score and others might comment w/o scoring 22:02:44 the size of the teams is often on factor too 22:02:47 we should "try" to be aligned with the rest of the projects, it makes easier the educational part :-) 22:02:49 In any case, I'll send an email to the list in any case reiterating the call for minus scores and for people to check reviews for sufficient reviewer coverage rather than ignoring -1s 22:02:53 yup pretty much in every other project you get a -1 for any kind of question 22:03:26 salv-orlando: yeap, I noticed that in nova! 22:04:25 document it on neutron wiki, if it works great other projects may adopt 22:04:41 for now let's be aware and remember to leave a score if the question is a blocker 22:05:17 we're out of time for today 22:06:12 sivh: ok 22:07:27 congtrats Mark for PTL, bye all. 22:07:36 adeu 22:07:56 markmcclain, congrats 22:08:23 congratulations 22:08:35 this got lost in the scoring discussion… the reason I mentioned the PTL results is to say thanks to everyone because the best part of being PTL is working with our great community 22:09:12 the diligent work folks have put in the bugs this week is a prime example of how well our community works and shares information 22:09:29 Thanks for stopping in and have a great week 22:09:31 #endmeeting