16:00:47 #startmeeting neutron_ci 16:00:49 Meeting started Tue Sep 5 16:00:47 2017 UTC and is due to finish in 60 minutes. The chair is ihrachys. Information about MeetBot at http://wiki.debian.org/MeetBot. 16:00:50 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 16:00:52 jlibosva, haleyb o/ 16:00:52 The meeting name has been set to 'neutron_ci' 16:00:53 o/ 16:00:58 hi 16:01:06 we have some actual juice to discuss today, let's get going 16:01:14 first... 16:01:15 #topic Actions from prev week 16:01:21 "jlibosva to tweak gate not to create default subnetpool and enable test_convert_default_subnetpool_to_non_default" 16:01:51 I did investigate the option and it seems the default subnet is created because it's required by auto_allocated_topology tests 16:02:41 as I'm not really familiar with auto allocated topology feature, I'll need to talk to armax to see whether we can create the default subnet before running the auto allocated topology tests and then remove it 16:03:07 #action jlibosva to talk to armax about enabling test_convert_default_subnetpool_to_non_default in gate 16:03:07 it will require some external locking so auto allocated topology and default subnet tests don't run in parallel 16:03:31 I see. I think tests in same class don't run in parallel 16:03:34 so you can use that 16:04:33 aha, you mean to put auto allocated topology and default subnet under the same class 16:04:57 yeah. that will give you necessary serialization. I am not sure if it makes sense logically though. 16:05:05 I mean, in terms of code quality 16:06:19 ok let's move on 16:06:24 next item was "anilvenkata to add more control plane checks for migrated router ports tests before ssh'ing" 16:06:34 I believe anilvenkata sent a bunch of fixes for the issues 16:06:41 https://review.openstack.org/#/c/500384/ 16:06:49 I believe it will be eventually split 16:06:59 correct? 16:07:24 also https://review.openstack.org/#/c/500379 I think is relevant 16:07:42 I see. I will have a look at that one too 16:07:51 so seems like a good progress, let's make it to completion :) 16:08:09 * haleyb will look as well of course 16:08:22 next was "ihrachys to complete gate-failure cleanup" 16:08:25 I did close quite some bugs that haven't showed up 16:08:36 * jlibosva will look too and will be like "hmmm, l3 code ... " 16:08:44 I also looked closely through all fullstack bugs and tried to fix/triage some of them 16:08:49 we will discuss specific patches later 16:09:02 let's look at the gate 16:09:06 #topic Grafana 16:09:10 http://grafana.openstack.org/dashboard/db/neutron-failure-rate 16:09:17 * jlibosva refreshes grafana page 16:09:28 I see rally being at 100% a while ago, but that should be fixed now. 16:09:30 in master 16:09:41 I backported the fixes to stable: https://review.openstack.org/#/q/I80c9a155ee2b52558109c764075a58dfabee44d4,n,z 16:10:10 we also have grenade-dvr at 100% right now 16:10:24 is it the same failure we were struggling with the last week? 16:10:30 were stable branches also affected by the rally? 16:10:42 jlibosva, I would think so, since devstack-gate is branchless 16:10:47 we have two patches ready for that, got blocked when tox failed, then rally 16:10:49 ok 16:11:09 i'm assuming we're talking the keyerror revert and fix 16:11:12 haleyb, that will be https://review.openstack.org/#/c/499292/ and https://review.openstack.org/#/c/499585/ and https://review.openstack.org/#/c/499725/ ? 16:11:29 I looked at the grenade failure and updated LP bug about what I saw in the logs. I thought that my patch ignoring host in fip object caused back the regression we were fighting with 16:12:18 ihrachys: yes, thought there were only two but i'll look at those 3 16:12:20 also worth mentioning that I made the grenade dvr job non-voting as we weren't able to merge the rally patch with both jobs failing 16:12:51 jlibosva, ok. I think it's a good call in general before we prove it's back to normal 16:13:02 we already wasted ~week of gate 16:13:11 oh, I just +2'd that last one, so yes 3 16:13:22 ETOOMANYPATCHES 16:13:53 ihrachys: and we'll keep working on the server side, then take a deep breath 16:14:02 I'm confused, I guess we still need to fix the pike branch? 16:14:12 we also have scenario jobs at 100%. what's the latest status there? I believe anilvenkata's fixes should help that in part, but do we have full picture of remaining items? 16:14:27 I also believe dvr keyerror issue was affecting it 16:14:31 I haven't been tracking the failures since the etherpad 16:14:47 jlibosva, we probably need to backport all those 3 fixes 16:14:54 (revert is effectively already there, so 2) 16:15:13 i though we backported jlibosva's keyerror fix out-of-order 16:15:24 yeah, that's why I think it broke the job again 16:15:33 i.e. pike first, then master since we needed it in the "old" grenade 16:15:47 jlibosva, ok. I think it's fine to leave the router migration and some other work to proceed and revisit the state after those fixes in. 16:16:01 we're talking about two topics now :) 16:16:04 :) 16:16:10 ok, let's focus on grenade/dvr 16:16:13 ack 16:16:27 my understanding is, we have 3 fixes: revert + 2 new. 16:16:37 we landed revert in pike because it blocked master revert 16:16:47 now we try to push the revert in master 16:17:03 since the job is non-voting it should go in smoothly despite pike being broken by keyerror 16:17:05 we also pushed the second (fix keyerror) in pike 16:17:13 but the pike is still broken now as per grafan 16:17:36 jlibosva, because we haven't merged all 3 fixes in pike no? 16:17:41 so haleyb do you think it's possible that my fix that went to pike only broke the job again - as it might be causing the same behavior as swami's patch? 16:17:44 haleyb, you pushed? merged you mean? 16:18:16 it would be nice to test all patches in one go, both backports and master fixes, with some kind of DNM patch 16:18:23 https://review.openstack.org/#/c/500077/ one 16:18:30 I *think* gate actually allows depends-on for patches in multiple branches 16:18:55 haleyb, I see, I wasn't aware of thta 16:19:26 and I think the 500077 broke the gate again 16:19:31 as per grafana: http://grafana.openstack.org/dashboard/db/neutron-failure-rate?panelId=11&fullscreen 16:19:42 ihrachys: it was backwards because once we merged the revert to pike we couldn't land things in master... 16:19:51 if you look at 7 days history, the curve goes up after the patch has been merged 16:20:13 haleyb, ok. comments on what jlibosva suggested? ^ 16:20:34 so maybe the only way how to fix it is the server side patch 16:20:41 I would be glad if we avoid some reverts/backports before we are sure those don't break anything 16:21:04 the job was still failing on keyerror after the revert on the gate queue 16:21:07 in master 16:21:28 yes. the original patch was actually trying to fix the keyerror 16:21:32 which I don't understand why it wasn't hapenning in check queue ... 16:21:37 and then broke the gate with duplicate fips 16:21:59 jlibosva: what was your proposal? I'm sure it will be great :) 16:22:12 so we had keyerror; then we fixed it but regressed with duplicate fips; then we reverted the latter and got keyerrors back; now we try to fix keyerrors. 16:22:19 haleyb: my suggestion is to ask haleyb and swami for help :-p 16:22:25 haha 16:22:49 well this sh*t is definitely on l3 team to solve, so I kinda agree Swami and Brian should be involved ;) 16:23:01 it all started with the late feature backport for new dvr mode 16:23:28 yes, we can take the blame for that, armando warned us... 16:23:52 I don't care about blame. I want my precious gate back to normal. :) 16:23:54 don't tell him, he'd be like "I told you" 16:24:11 * haleyb goes to his corner 16:24:15 anyway 16:24:26 #action haleyb to figure out the way forward for grenade/dvr gate 16:24:30 one thing I don't understand is why the keyerror was happening on the gate queue only - after the revert of swami's patch 16:24:49 luck? 16:24:58 I think the original keyerror was not breaking 100%? 16:25:03 maybe the reproducer was lower than current issue 16:25:05 ywah 16:25:15 it depends on where the VM was launched from what i remember, so wasn't 100% 16:25:29 but now it is 100% 16:25:40 anyway, let's move on :) 16:26:28 maybe we should revert my patch from pike and not merge it in master 16:26:35 I will let haleyb to dig ;) I will stay and watch the house burning. 16:27:11 jlibosva, maybe. maybe not. that's why I would love to see some gate proof we have a final resolution. 16:27:14 https://i.imgur.com/c4jt321.png 16:27:42 jlibosva, that was exactly my reaction when I noticed tox ALL RED failure late Friday 16:27:50 I said f* it and turned off irc 16:27:56 :D 16:28:11 anyway... let's switch topics 16:28:42 so scenario jobs. they fail, but we have router migrations work and keyerror work in the pipeline, so it probably would make sense to revisit the failure rate once those are tackled. 16:29:04 so finally, fullstack job 16:29:13 that one was on me to triage 16:29:26 and after weeks of procrastination I actually did. yay Ihar. 16:29:40 I started https://etherpad.openstack.org/p/queens-neutron-fullstack-failures 16:29:42 I wouldn't call that procrastination but go on :) 16:29:51 and I also looked through reported bugs in gate-failure 16:30:00 and sent some patches, pulled some people 16:30:34 first, to skip test_mtu_update for linuxbridge: https://review.openstack.org/#/c/498932/ 16:31:00 because for lb, we start dhcp agent in a namespace, and so qdhcp- netns is inside this agent namespace 16:31:19 and we don't have ip_lib supporting digging devices inside a 2nd order namespace 16:31:27 later we may revisit that 16:31:50 second is test_ha_router sporadic failure. I sent this: https://review.openstack.org/#/c/500185/ 16:32:22 it basically seems that neutron-server may take some time to schedule both agents to the router, so there is a short window when only a single agent is scheduled 16:32:26 and it fails with 2 != 1 16:32:39 the patch makes the test wait, like other test cases do 16:32:47 lgtm 16:32:56 I actually believe there may be a server side bug here, because ideally it would immediately schedule both 16:33:05 and haleyb said he will have a look. 16:33:07 ;) 16:33:11 ihrachys, jlibosva haleyb sorry I was away at that time 16:33:16 but I don't think it's at top priority 16:33:24 for migration tests I proposed this patch https://review.openstack.org/#/c/500384/ 16:33:31 i started looking friday, will continue 16:33:43 anilvenkata, np, I think we are good. but an update on how far we are for migration fixes to land would be nice. 16:33:50 constantly monitoring failures and updating this patch 16:34:04 ihrachys, failures have come done 16:34:17 down? 16:34:56 earlier most of them are failing, now one or two 16:35:26 gooooood 16:35:29 anilvenkata: great job! :) 16:35:34 we'll loop in our best troops 16:35:36 aka haleyb 16:35:42 I am tracking them and constantly updating the patchset, there are multiple issues(at least reported 4 bugs :) ) 16:35:56 anilvenkata, keep them coming lol 16:36:11 once the tests are constantly passing, I will ask for review 16:36:23 thanks ihrachys jlibosva haleyb 16:36:23 anilvenkata, ++ you da man 16:36:32 thanks :) 16:36:32 ok. we were talking about fullstack 16:37:25 ok 16:37:25 the last thing I am aware is, some HA tests were failing because ml2 plugin was incorrectly setting dhcp provisioning block on port update while dhcp agent was not aware of any action to do on its side, so port never came to ACTIVE 16:37:43 I also noticed that 16:37:45 I digged that, talked to kevinbenton, and it seems like now we have a fix for that: https://review.openstack.org/#/c/500269/ 16:37:48 dhcp HA or l3 HA? 16:37:55 jlibosva, l3 ha 16:38:03 but maybe it's not l3 ha specific 16:38:06 l3 ha waiting for dhcp provisioning 16:38:13 it's just the fact that fullstack l3ha test case triggered it 16:38:31 so fullstack found yet another real issue? 16:38:36 the fix is now in gate, so we once those three are in, we may expect some better results. 16:38:43 jlibosva, how is it news? 16:38:51 that's what it does. 16:38:57 good guy fullstack 16:39:11 yeah, he is the cool kid in the hood 16:39:18 sadly not everyone realized that yet 16:39:45 and that's what I had for grafaan 16:39:48 speaking about fullstack, I found some time today to do some real work 16:40:01 any other interesting artifacts of grafana that I missed? 16:40:11 jlibosva, work on isolating ovs? 16:40:35 and did some of the isolation work, I plan to continue with the work this week 16:40:48 kevinbenton, Kevin is awesome in proposing quick fixes like this https://review.openstack.org/#/c/500269/3/neutron/plugins/ml2/plugin.py 16:41:19 yeah, it's not just ovs, but simulating better the nodes - it will give us a chance to run two neutron servers 16:41:33 so with haproxy, we could also test active/active server 16:41:40 oh, that's very nice. 16:41:57 also I split the data network into management and data network 16:42:09 thats great 16:42:13 so there are more changes, I need to talk to tmorin as I know bagpipe consumes fullstack 16:42:21 so I don't break their work :) 16:42:30 jlibosva++ 16:43:27 cool. I love the way we make progress. 16:43:50 I assume there is nothing else of interest in grafana 16:43:57 #topic Other patches 16:44:38 I noticed there was a wishlist bug for fullstack asking to kill processes more gracefully (not sigkill but sigterm) 16:44:50 I pushed https://review.openstack.org/#/c/499803/ to test whether a mere switch will do 16:45:17 so far there is a single run only and it seems successful (as much as fullstack run can be right now) 16:45:32 I will monitor and recheck it for a while. I should have some more stats the next time we meet. 16:46:19 also, the fix for functional job with new iptables from RHEL 7.4 can be found here: https://review.openstack.org/#/c/495974/ 16:46:30 jlibosva already +2d, maybe haleyb can have a look. 16:46:55 i will look 16:47:59 any more patches we may need to be aware of? 16:48:52 I guess no 16:49:03 #topic Open discussion 16:49:09 we will have PTG next week 16:49:14 so we will cancel the next meeting 16:49:22 we will meet two weeks from now 16:49:33 anything else to discuss? 16:49:38 not from me 16:49:45 nothing here 16:50:19 ok, good, we'll wrap up. keep up the good work. 16:50:27 and haleyb, all eyes are on l3 team ;) 16:50:30 cheers 16:50:32 lol 16:50:32 #endmeeting