#openstack-meeting log

16:00:47 <ihrachys> #startmeeting neutron_ci
16:00:49 <openstack> Meeting started Tue Sep  5 16:00:47 2017 UTC and is due to finish in 60 minutes.  The chair is ihrachys. Information about MeetBot at http://wiki.debian.org/MeetBot.
16:00:50 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
16:00:52 <ihrachys> jlibosva, haleyb o/
16:00:52 <openstack> The meeting name has been set to 'neutron_ci'
16:00:53 <jlibosva> o/
16:00:58 <haleyb> hi
16:01:06 <ihrachys> we have some actual juice to discuss today, let's get going
16:01:14 <ihrachys> first...
16:01:15 <ihrachys> #topic Actions from prev week
16:01:21 <ihrachys> "jlibosva to tweak gate not to create default subnetpool and enable test_convert_default_subnetpool_to_non_default"
16:01:51 <jlibosva> I did investigate the option and it seems the default subnet is created because it's required by auto_allocated_topology tests
16:02:41 <jlibosva> as I'm not really familiar with auto allocated topology feature, I'll need to talk to armax to see whether we can create the default subnet before running the auto allocated topology tests and then remove it
16:03:07 <ihrachys> #action jlibosva to talk to armax about enabling test_convert_default_subnetpool_to_non_default in gate
16:03:07 <jlibosva> it will require some external locking so auto allocated topology and default subnet tests don't run in parallel
16:03:31 <ihrachys> I see. I think tests in same class don't run in parallel
16:03:34 <ihrachys> so you can use that
16:04:33 <jlibosva> aha, you mean to put auto allocated topology and default subnet under the same class
16:04:57 <ihrachys> yeah. that will give you necessary serialization. I am not sure if it makes sense logically though.
16:05:05 <ihrachys> I mean, in terms of code quality
16:06:19 <ihrachys> ok let's move on
16:06:24 <ihrachys> next item was "anilvenkata to add more control plane checks for migrated router ports tests before ssh'ing"
16:06:34 <ihrachys> I believe anilvenkata sent a bunch of fixes for the issues
16:06:41 <ihrachys> https://review.openstack.org/#/c/500384/
16:06:49 <ihrachys> I believe it will be eventually split
16:06:59 <ihrachys> correct?
16:07:24 <jlibosva> also https://review.openstack.org/#/c/500379 I think is relevant
16:07:42 <ihrachys> I see. I will have a look at that one too
16:07:51 <ihrachys> so seems like a good progress, let's make it to completion :)
16:08:09 * haleyb will look as well of course
16:08:22 <ihrachys> next was "ihrachys to complete gate-failure cleanup"
16:08:25 <ihrachys> I did close quite some bugs that haven't showed up
16:08:36 * jlibosva will look too and will be like "hmmm, l3 code ... "
16:08:44 <ihrachys> I also looked closely through all fullstack bugs and tried to fix/triage some of them
16:08:49 <ihrachys> we will discuss specific patches later
16:09:02 <ihrachys> let's look at the gate
16:09:06 <ihrachys> #topic Grafana
16:09:10 <ihrachys> http://grafana.openstack.org/dashboard/db/neutron-failure-rate
16:09:17 * jlibosva refreshes grafana page
16:09:28 <ihrachys> I see rally being at 100% a while ago, but that should be fixed now.
16:09:30 <ihrachys> in master
16:09:41 <ihrachys> I backported the fixes to stable: https://review.openstack.org/#/q/I80c9a155ee2b52558109c764075a58dfabee44d4,n,z
16:10:10 <ihrachys> we also have grenade-dvr at 100% right now
16:10:24 <ihrachys> is it the same failure we were struggling with the last week?
16:10:30 <jlibosva> were stable branches also affected by the rally?
16:10:42 <ihrachys> jlibosva, I would think so, since devstack-gate is branchless
16:10:47 <haleyb> we have two patches ready for that, got blocked when tox failed, then rally
16:10:49 <jlibosva> ok
16:11:09 <haleyb> i'm assuming we're talking the keyerror revert and fix
16:11:12 <ihrachys> haleyb, that will be https://review.openstack.org/#/c/499292/ and https://review.openstack.org/#/c/499585/ and https://review.openstack.org/#/c/499725/ ?
16:11:29 <jlibosva> I looked at the grenade failure and updated LP bug about what I saw in the logs. I thought that my patch ignoring host in fip object caused back the regression we were fighting with
16:12:18 <haleyb> ihrachys: yes, thought there were only two but i'll look at those 3
16:12:20 <jlibosva> also worth mentioning that I made the grenade dvr job non-voting as we weren't able to merge the rally patch with both jobs failing
16:12:51 <ihrachys> jlibosva, ok. I think it's a good call in general before we prove it's back to normal
16:13:02 <ihrachys> we already wasted ~week of gate
16:13:11 <haleyb> oh, I just +2'd that last one, so yes 3
16:13:22 <ihrachys> ETOOMANYPATCHES
16:13:53 <haleyb> ihrachys: and we'll keep working on the server side, then take a deep breath
16:14:02 <jlibosva> I'm confused, I guess we still need to fix the pike branch?
16:14:12 <ihrachys> we also have scenario jobs at 100%. what's the latest status there? I believe anilvenkata's fixes should help that in part, but do we have full picture of remaining items?
16:14:27 <ihrachys> I also believe dvr keyerror issue was affecting it
16:14:31 <jlibosva> I haven't been tracking the failures since the etherpad
16:14:47 <ihrachys> jlibosva, we probably need to backport all those 3 fixes
16:14:54 <ihrachys> (revert is effectively already there, so 2)
16:15:13 <haleyb> i though we backported jlibosva's keyerror fix out-of-order
16:15:24 <jlibosva> yeah, that's why I think it broke the job again
16:15:33 <haleyb> i.e. pike first, then master since we needed it in the "old" grenade
16:15:47 <ihrachys> jlibosva, ok. I think it's fine to leave the router migration and some other work to proceed and revisit the state after those fixes in.
16:16:01 <jlibosva> we're talking about two topics now :)
16:16:04 <ihrachys> :)
16:16:10 <ihrachys> ok, let's focus on grenade/dvr
16:16:13 <jlibosva> ack
16:16:27 <ihrachys> my understanding is, we have 3 fixes: revert + 2 new.
16:16:37 <ihrachys> we landed revert in pike because it blocked master revert
16:16:47 <ihrachys> now we try to push the revert in master
16:17:03 <ihrachys> since the job is non-voting it should go in smoothly despite pike being broken by keyerror
16:17:05 <haleyb> we also pushed the second (fix keyerror) in pike
16:17:13 <jlibosva> but the pike is still broken now as per grafan
16:17:36 <ihrachys> jlibosva, because we haven't merged all 3 fixes in pike no?
16:17:41 <jlibosva> so haleyb do you think it's possible that my fix that went to pike only broke the job again - as it might be causing the same behavior as swami's patch?
16:17:44 <ihrachys> haleyb, you pushed? merged you mean?
16:18:16 <ihrachys> it would be nice to test all patches in one go, both backports and master fixes, with some kind of DNM patch
16:18:23 <haleyb> https://review.openstack.org/#/c/500077/ one
16:18:30 <ihrachys> I *think* gate actually allows depends-on for patches in multiple branches
16:18:55 <ihrachys> haleyb, I see, I wasn't aware of thta
16:19:26 <jlibosva> and I think the 500077 broke the gate again
16:19:31 <jlibosva> as per grafana: http://grafana.openstack.org/dashboard/db/neutron-failure-rate?panelId=11&fullscreen
16:19:42 <haleyb> ihrachys: it was backwards because once we merged the revert to pike we couldn't land things in master...
16:19:51 <jlibosva> if you look at 7 days history, the curve goes up after the patch has been merged
16:20:13 <ihrachys> haleyb, ok. comments on what jlibosva suggested? ^
16:20:34 <jlibosva> so maybe the only way how to fix it is the server side patch
16:20:41 <ihrachys> I would be glad if we avoid some reverts/backports before we are sure those don't break anything
16:21:04 <jlibosva> the job was still failing on keyerror after the revert on the gate queue
16:21:07 <jlibosva> in master
16:21:28 <ihrachys> yes. the original patch was actually trying to fix the keyerror
16:21:32 <jlibosva> which I don't understand why it wasn't hapenning in check queue ...
16:21:37 <ihrachys> and then broke the gate with duplicate fips
16:21:59 <haleyb> jlibosva: what was your proposal?  I'm sure it will be great :)
16:22:12 <ihrachys> so we had keyerror; then we fixed it but regressed with duplicate fips; then we reverted the latter and got keyerrors back; now we try to fix keyerrors.
16:22:19 <jlibosva> haleyb: my suggestion is to ask haleyb and swami for help :-p
16:22:25 <ihrachys> haha
16:22:49 <ihrachys> well this sh*t is definitely on l3 team to solve, so I kinda agree Swami and Brian should be involved ;)
16:23:01 <ihrachys> it all started with the late feature backport for new dvr mode
16:23:28 <haleyb> yes, we can take the blame for that, armando warned us...
16:23:52 <ihrachys> I don't care about blame. I want my precious gate back to normal. :)
16:23:54 <jlibosva> don't tell him, he'd be like "I told you"
16:24:11 * haleyb goes to his corner
16:24:15 <ihrachys> anyway
16:24:26 <ihrachys> #action haleyb to figure out the way forward for grenade/dvr gate
16:24:30 <jlibosva> one thing I don't understand is why the keyerror was happening on the gate queue only - after the revert of swami's patch
16:24:49 <ihrachys> luck?
16:24:58 <ihrachys> I think the original keyerror was not breaking 100%?
16:25:03 <jlibosva> maybe the reproducer was lower than current issue
16:25:05 <jlibosva> ywah
16:25:15 <haleyb> it depends on where the VM was launched from what i remember, so wasn't 100%
16:25:29 <jlibosva> but now it is 100%
16:25:40 <jlibosva> anyway, let's move on :)
16:26:28 <jlibosva> maybe we should revert my patch from pike and not merge it in master
16:26:35 <ihrachys> I will let haleyb to dig ;) I will stay and watch the house burning.
16:27:11 <ihrachys> jlibosva, maybe. maybe not. that's why I would love to see some gate proof we have a final resolution.
16:27:14 <jlibosva> https://i.imgur.com/c4jt321.png
16:27:42 <ihrachys> jlibosva, that was exactly my reaction when I noticed tox ALL RED failure late Friday
16:27:50 <ihrachys> I said f* it and turned off irc
16:27:56 <jlibosva> :D
16:28:11 <ihrachys> anyway... let's switch topics
16:28:42 <ihrachys> so scenario jobs. they fail, but we have router migrations work and keyerror work in the pipeline, so it probably would make sense to revisit the failure rate once those are tackled.
16:29:04 <ihrachys> so finally, fullstack job
16:29:13 <ihrachys> that one was on me to triage
16:29:26 <ihrachys> and after weeks of procrastination I actually did. yay Ihar.
16:29:40 <ihrachys> I started https://etherpad.openstack.org/p/queens-neutron-fullstack-failures
16:29:42 <jlibosva> I wouldn't call that procrastination but go on :)
16:29:51 <ihrachys> and I also looked through reported bugs in gate-failure
16:30:00 <ihrachys> and sent some patches, pulled some people
16:30:34 <ihrachys> first, to skip test_mtu_update for linuxbridge: https://review.openstack.org/#/c/498932/
16:31:00 <ihrachys> because for lb, we start dhcp agent in a namespace, and so qdhcp- netns is inside this agent namespace
16:31:19 <ihrachys> and we don't have ip_lib supporting digging devices inside a 2nd order namespace
16:31:27 <ihrachys> later we may revisit that
16:31:50 <ihrachys> second is test_ha_router sporadic failure. I sent this: https://review.openstack.org/#/c/500185/
16:32:22 <ihrachys> it basically seems that neutron-server may take some time to schedule both agents to the router, so there is a short window when only a single agent is scheduled
16:32:26 <ihrachys> and it fails with 2 != 1
16:32:39 <ihrachys> the patch makes the test wait, like other test cases do
16:32:47 <haleyb> lgtm
16:32:56 <ihrachys> I actually believe there may be a server side bug here, because ideally it would immediately schedule both
16:33:05 <ihrachys> and haleyb said he will have a look.
16:33:07 <ihrachys> ;)
16:33:11 <anilvenkata> ihrachys, jlibosva haleyb sorry I was away at that time
16:33:16 <ihrachys> but I don't think it's at top priority
16:33:24 <anilvenkata> for migration tests I proposed this patch https://review.openstack.org/#/c/500384/
16:33:31 <haleyb> i started looking friday, will continue
16:33:43 <ihrachys> anilvenkata, np, I think we are good. but an update on how far we are for migration fixes to land would be nice.
16:33:50 <anilvenkata> constantly monitoring failures and updating this patch
16:34:04 <anilvenkata> ihrachys, failures have come done
16:34:17 <ihrachys> down?
16:34:56 <anilvenkata> earlier most of them are failing, now one or two
16:35:26 <ihrachys> gooooood
16:35:29 <jlibosva> anilvenkata: great job! :)
16:35:34 <ihrachys> we'll loop in our best troops
16:35:36 <ihrachys> aka haleyb
16:35:42 <anilvenkata> I am tracking them and constantly updating the patchset, there are multiple issues(at least reported 4 bugs :) )
16:35:56 <ihrachys> anilvenkata, keep them coming lol
16:36:11 <anilvenkata> once the tests are constantly passing, I will ask for review
16:36:23 <anilvenkata> thanks ihrachys jlibosva haleyb
16:36:23 <ihrachys> anilvenkata, ++ you da man
16:36:32 <anilvenkata> thanks :)
16:36:32 <ihrachys> ok. we were talking about fullstack
16:37:25 <anilvenkata> ok
16:37:25 <ihrachys> the last thing I am aware is, some HA tests were failing because ml2 plugin was incorrectly setting dhcp provisioning block on port update while dhcp agent was not aware of any action to do on its side, so port never came to ACTIVE
16:37:43 <anilvenkata> I also noticed that
16:37:45 <ihrachys> I digged that, talked to kevinbenton, and it seems like now we have a fix for that: https://review.openstack.org/#/c/500269/
16:37:48 <jlibosva> dhcp HA or l3 HA?
16:37:55 <ihrachys> jlibosva, l3 ha
16:38:03 <ihrachys> but maybe it's not l3 ha specific
16:38:06 <anilvenkata> l3 ha waiting for dhcp provisioning
16:38:13 <ihrachys> it's just the fact that fullstack l3ha test case triggered it
16:38:31 <jlibosva> so fullstack found yet another real issue?
16:38:36 <ihrachys> the fix is now in gate, so we once those three are in, we may expect some better results.
16:38:43 <ihrachys> jlibosva, how is it news?
16:38:51 <ihrachys> that's what it does.
16:38:57 <jlibosva> good guy fullstack
16:39:11 <ihrachys> yeah, he is the cool kid in the hood
16:39:18 <ihrachys> sadly not everyone realized that yet
16:39:45 <ihrachys> and that's what I had for grafaan
16:39:48 <jlibosva> speaking about fullstack, I found some time today to do some real work
16:40:01 <ihrachys> any other interesting artifacts of grafana that I missed?
16:40:11 <ihrachys> jlibosva, work on isolating ovs?
16:40:35 <jlibosva> and did some of the isolation work, I plan to continue with the work this week
16:40:48 <anilvenkata> kevinbenton, Kevin is awesome in proposing quick fixes like this https://review.openstack.org/#/c/500269/3/neutron/plugins/ml2/plugin.py
16:41:19 <jlibosva> yeah, it's not just ovs, but simulating better the nodes - it will give us a chance to run two neutron servers
16:41:33 <jlibosva> so with haproxy, we could also test active/active server
16:41:40 <ihrachys> oh, that's very nice.
16:41:57 <jlibosva> also I split the data network into management and data network
16:42:09 <anilvenkata> thats great
16:42:13 <jlibosva> so there are more changes, I need to talk to tmorin as I know bagpipe consumes fullstack
16:42:21 <jlibosva> so I don't break their work :)
16:42:30 <anilvenkata> jlibosva++
16:43:27 <ihrachys> cool. I love the way we make progress.
16:43:50 <ihrachys> I assume there is nothing else of interest in grafana
16:43:57 <ihrachys> #topic Other patches
16:44:38 <ihrachys> I noticed there was a wishlist bug for fullstack asking to kill processes more gracefully (not sigkill but sigterm)
16:44:50 <ihrachys> I pushed https://review.openstack.org/#/c/499803/ to test whether a mere switch will do
16:45:17 <ihrachys> so far there is a single run only and it seems successful (as much as fullstack run can be right now)
16:45:32 <ihrachys> I will monitor and recheck it for a while. I should have some more stats the next time we meet.
16:46:19 <ihrachys> also, the fix for functional job with new iptables from RHEL 7.4 can be found here: https://review.openstack.org/#/c/495974/
16:46:30 <ihrachys> jlibosva already +2d, maybe haleyb can have a look.
16:46:55 <haleyb> i will look
16:47:59 <ihrachys> any more patches we may need to be aware of?
16:48:52 <ihrachys> I guess no
16:49:03 <ihrachys> #topic Open discussion
16:49:09 <ihrachys> we will have PTG next week
16:49:14 <ihrachys> so we will cancel the next meeting
16:49:22 <ihrachys> we will meet two weeks from now
16:49:33 <ihrachys> anything else to discuss?
16:49:38 <jlibosva> not from me
16:49:45 <haleyb> nothing here
16:50:19 <ihrachys> ok, good, we'll wrap up. keep up the good work.
16:50:27 <ihrachys> and haleyb, all eyes are on l3 team ;)
16:50:30 <ihrachys> cheers
16:50:32 <jlibosva> lol
16:50:32 <ihrachys> #endmeeting