#openstack-meeting log

16:00:23 <ihrachys> #startmeeting neutron_ci
16:00:24 <openstack> Meeting started Tue Mar 21 16:00:23 2017 UTC and is due to finish in 60 minutes.  The chair is ihrachys. Information about MeetBot at http://wiki.debian.org/MeetBot.
16:00:25 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
16:00:28 <openstack> The meeting name has been set to 'neutron_ci'
16:00:29 <manjeets> o/
16:00:31 <ihrachys> good day everyone
16:00:34 <jlibosva> o/
16:00:47 <dasm> o/
16:00:48 <dasanind> o/
16:00:58 * ihrachys waves at mlavalle
16:01:09 <mlavalle> o/
16:01:42 * mlavalle waves back at ihrachys
16:01:57 <ihrachys> ok let's start with reviewing action items from prev meeting
16:02:03 <ihrachys> "ihrachys fix e-r bot not reporting in irc channel"
16:02:25 <ihrachys> didn't happen, I am a slug, will need to wrap into this week
16:02:28 <ihrachys> #action ihrachys fix e-r bot not reporting in irc channel
16:02:38 <ihrachys> next is "haleyb and mlavalle to investigate what makes dvr gate job failing with 25% rate"
16:02:45 <mlavalle> hi
16:02:49 <ihrachys> #link http://grafana.openstack.org/dashboard/db/neutron-failure-rate?panelId=8&fullscreen
16:03:02 <mlavalle> I've been digging in Kiabana into this
16:03:40 <mlavalle> since it is in the check queue, it is laborious to separate failures caused by patchset and real failures
16:04:05 <mlavalle> for real failures, I don't have a conclusive diagnostic yet
16:04:06 <ihrachys> any idea why gate-tempest-dsvm-neutron-dvr-multinode-full-ubuntu-xenial is not in grafana board at all?
16:04:33 <mlavalle> no^^^^
16:04:44 <mlavalle> maybe haleyb might speak to that
16:04:55 <mlavalle> I don't see him around, though
16:05:06 <ihrachys> oh because it's gate-tempest-dsvm-neutron-dvr-multinode-full-ubuntu-xenial-nv now
16:05:08 <ihrachys> note -nv
16:05:20 <mlavalle> ok cool
16:05:26 <ihrachys> #action ihrachys to fix the grafana board to include gate-tempest-dsvm-neutron-dvr-multinode-full-ubuntu-xenial-nv
16:05:45 <ihrachys> mlavalle: it's hard to say if there is any trend beyond regular check queue issues
16:06:13 <mlavalle> of the real failures that I have analyzed, the most common case is {u'code': 500, u'message': u'No valid host was found. There are not enough hosts available.'
16:06:41 <ihrachys> which is usually because scheduler failed to talk to libvirtd isn't it?
16:06:53 <mlavalle> Yeah, I think so
16:07:05 <mlavalle> I haven't seen enough evidence yet, though
16:07:16 <mlavalle> I would like to watch this a few more days
16:07:38 <mlavalle> I just wanted to share with the team the trend that I am seeing
16:07:59 <ihrachys> ok let's sync the next week when the board is hopefully fixed
16:08:05 <mlavalle> cool
16:08:13 <ihrachys> ok next was "ihrachys explore why bug 1532809 bubbled into top in e-r"
16:08:13 <openstack> bug 1532809 in OpenStack Compute (nova) liberty "Gate failures when DHCP lease cannot be acquired" [High,In progress] https://launchpad.net/bugs/1532809 - Assigned to Sean Dague (sdague)
16:08:15 <mlavalle> I'll keep watching this over the next few days
16:08:41 <ihrachys> I checked logs, and it was because ODL gate was triggering it, it's just their gate setup issue I believe
16:09:01 <ihrachys> overall the query seems to me rather generic: https://review.openstack.org/#/c/445596/
16:09:12 <ihrachys> hence delete request ^
16:09:25 <ihrachys> next was "jlibosva fix delete_dscp for native driver: https://review.openstack.org/445560"
16:09:45 <ihrachys> it's merged, yay. there are still issues in the job that bring us to the next item
16:09:52 <ihrachys> "jlibosva to fix remaining fullstack failures in securitygroups for linuxbridge"
16:09:58 <ihrachys> jlibosva: progress there?
16:10:10 <jlibosva> I attempted to make it work then I was pointed out kevinbenton started similar patch
16:10:13 <ihrachys> is it conntrack issue reintroduced by kevinbenton lately?
16:10:21 <jlibosva> yes
16:10:39 <jlibosva> the thing is that iptables driver in linuxbridge agent doesn't remove conntracks due to missing zones
16:10:40 <jlibosva> https://review.openstack.org/#/c/441353/
16:11:03 <jlibosva> conntracks == conntrack entries
16:11:21 <ihrachys> makes sense, gotta have a closer look, though it's failing in unit tests right now
16:11:38 <ihrachys> jlibosva: so we proved it fixes the failure?
16:11:58 <jlibosva> I don't think that patch is final.
16:12:06 <ihrachys> I see test_securitygroup(ovs-hybrid) is failing there
16:12:09 <jlibosva> I'll try to support kevin or take over if needed
16:12:16 <jlibosva> ihrachys: which means it fixes the LB one :)
16:12:28 <ihrachys> or that it breaks ovs?
16:12:45 <jlibosva> that's a side-effect, yes. It still needs some work to be done
16:14:07 <ihrachys> ok
16:14:08 <jlibosva> that's all I can tell about that
16:14:34 <ihrachys> I see test_controller_timeout_does_not_break_connectivity_sigkill bubbles up since we landed it
16:14:55 <jlibosva> yep, I reported a bug about that some time ago I think
16:14:56 * jlibosva searches
16:15:26 <jlibosva> https://bugs.launchpad.net/neutron/+bug/1673531
16:15:26 <openstack> Launchpad bug 1673531 in neutron "fullstack test_controller_timeout_does_not_break_connectivity_sigkill(GRE and l2pop,openflow-native_ovsdb-cli) failure" [Undecided,New]
16:15:32 <jlibosva> I haven't looked at it yet though
16:16:18 <ihrachys> jlibosva: I understand that fullstack is not voting but would probably make sense to tag those bugs as gate-failure nevertheless
16:16:31 <ihrachys> since they may indicate something actually broken
16:16:31 <jlibosva> ihrachys: alright
16:16:39 <ihrachys> and since we want it voting this time :)
16:16:50 <ihrachys> I added the tag
16:17:57 <ihrachys> ok next item was "anilvenkata to follow up on HA+DVR job patches"
16:18:15 <ihrachys> the patch is still up for review: https://review.openstack.org/#/c/383827/
16:18:33 <ihrachys> there were some back and forth comments, seems the patch is finally ready for merge
16:19:11 <ihrachys> clarkb: would be cool to see you revisit ^
16:19:34 <ihrachys> ok, next item was "jlibosva to figure out the plan for py3 gate transition and report back"
16:19:48 <jlibosva> this is still 'to be done'
16:20:15 <ihrachys> ok let's repeat it
16:20:15 <jlibosva> can you please move it to the next week?
16:20:21 <ihrachys> #action jlibosva to figure out the plan for py3 gate transition and report back
16:20:33 <ihrachys> next was "manjeets respin https://review.openstack.org/#/c/439114/ to include gate-failure reviews into existing dashboard"
16:20:43 <manjeets> ihrachys, done
16:20:44 <ihrachys> I see the patch was respinned yesterday: https://review.openstack.org/#/c/439114/
16:21:11 <manjeets> dashboard will look like https://tinyurl.com/lqdu3qo
16:21:16 <ihrachys> cool, let's review that and then chase infra to get it in
16:21:17 <manjeets> scroll all the way to end
16:21:42 <ihrachys> ack. I would like to see it higher the stack
16:21:47 <ihrachys> somewhere near Critical
16:22:03 <manjeets> that'll be one line change i believe I can do that
16:22:06 <ihrachys> without gate, we can't land anything, so it would make sense to give it a priority
16:22:11 <ihrachys> ack
16:22:54 <ihrachys> and... that's it for action items from the previous week
16:23:48 <ihrachys> #topic State of the Gate
16:23:57 <mlavalle> quite a few!
16:24:13 <ihrachys> mlavalle: as always, makes it easier to track things
16:24:17 <ihrachys> several breakages happened the prev week that we dealt with
16:24:40 <ihrachys> one was a hilarious breakage by os-log-merger integration that was hopefully unraveled by https://review.openstack.org/#/q/topic:fix-os-log-merger-crash
16:24:57 <ihrachys> turned out os-log-merger was not really too liberal about accepted output
16:25:20 <ihrachys> then there was a eventlet induced breakage from yesterday fixed by https://review.openstack.org/#/c/447817/
16:25:34 <ihrachys> there is a follow up for the patch here up for review: https://review.openstack.org/#/c/447896/1
16:26:31 <ihrachys> there is still a gate breakage by an OVO test up for review here: https://review.openstack.org/#/c/447600/
16:26:41 <ihrachys> it's not consistent, but may hit the gate sometimes
16:26:55 <ihrachys> since it's unit tests, we gotta make it stable before people are used to recheck
16:26:57 <ihrachys> :)
16:27:58 <ihrachys> any other fixes that we are aware that will fix some gate failures?
16:28:59 <jlibosva> not that I'm aware of
16:29:35 <ihrachys> cool
16:30:09 * manjeets remember oslo-log-merger crashing multiple jobs
16:30:19 <ihrachys> manjeets: that is fixed and discussed above
16:30:22 <ihrachys> #topic Gate failure bugs
16:30:28 <ihrachys> #link https://bugs.launchpad.net/neutron/+bugs?field.tag=gate-failure
16:30:39 <ihrachys> I am wondering about https://bugs.launchpad.net/neutron/+bug/1627106
16:30:39 <openstack> Launchpad bug 1627106 in neutron "TimeoutException while executing tests adding bridge using OVSDB native" [High,In progress] - Assigned to Terry Wilson (otherwiseguy)
16:30:43 <ihrachys> has anyone seen it lately?
16:30:45 <manjeets> ihrachys, yea i noticed that from jenkins results
16:31:11 <ihrachys> the query is at http://status.openstack.org/elastic-recheck/#1627106
16:31:22 <ihrachys> and it shows 0 fails in 24 hrs / 0 fails in 10 days
16:31:30 <ihrachys> which apparently means that we squashed it somehow :)
16:31:50 <ihrachys> otherwiseguy: ^ are we in agreement?
16:32:02 <otherwiseguy> ihrachys, I *hope* so :)
16:32:15 <ihrachys> ok let's close the bug and see if it shows up again
16:32:44 * otherwiseguy dances
16:32:48 <ihrachys> nice job everyone and especially otherwiseguy
16:33:26 <manjeets> otherwiseguy, \o/
16:33:54 <ihrachys> I also believe we can close https://bugs.launchpad.net/neutron/+bug/1632290 now that metadata-proxy is rewritten to haproxy?
16:33:54 <openstack> Launchpad bug 1632290 in neutron "RuntimeError: Metadata proxy didn't spawn" [Medium,Triaged]
16:33:55 <jlibosva> otherwiseguy rocks
16:34:23 <otherwiseguy> jlibosva misspelled "openstack"
16:35:20 <ihrachys> the only place the metadata logstash query hits is this patch https://review.openstack.org/#/c/401086/
16:35:22 <ihrachys> which is all red
16:35:30 <ihrachys> so it's not unexpected that metadata is also borked there
16:35:34 <ihrachys> I will close the bug
16:37:44 <ihrachys> I am looking at https://bugs.launchpad.net/neutron/+bug/1673780 and wonder why we track vpnaas bugs in neutron component
16:37:44 <openstack> Launchpad bug 1673780 in neutron "vpnaas import error for agent config" [Undecided,In progress] - Assigned to YAMAMOTO Takashi (yamamoto)
16:37:49 <ihrachys> it's not even stadium participant
16:37:58 <manjeets> ihrachys, would it make sense to have gate failures at Top before RFE's ?
16:38:01 <manjeets> https://tinyurl.com/lqdu3qo
16:38:11 <ihrachys> manjeets: yeah, probably
16:38:27 <ihrachys> #action ihrachys to figure out why we track vpnaas bugs under neutron component
16:39:47 <ihrachys> another bug of interest is this: https://bugs.launchpad.net/neutron/+bug/1674517 and there is a patch from dasanind for that: https://review.openstack.org/#/c/447781/
16:39:47 <openstack> Launchpad bug 1674517 in neutron "pecan missing custom tenant_id policy project_id matching" [High,In progress] - Assigned to Anindita Das (anindita-das)
16:40:05 <ihrachys> as you prolly know we switched to pecan lately and now squash bugs that pop up
16:40:54 <mlavalle> yeap, they are going to show up now
16:41:22 <mlavalle> but it's the right time to do it, early in the cycle :-)
16:41:51 <manjeets> ihrachys, it will be now like  https://tinyurl.com/mj2uyw5
16:42:18 <ihrachys> manjeets: +
16:43:26 <ihrachys> there were several segfaults of ovs vswitchd component
16:43:43 <ihrachys> see https://bugs.launchpad.net/neutron/+bug/1669900
16:43:43 <openstack> Launchpad bug 1669900 in neutron "ovs-vswitchd crashed in functional test with segmentation fault" [Medium,Confirmed]
16:43:48 <ihrachys> and https://bugs.launchpad.net/neutron/+bug/1672607
16:43:48 <openstack> Launchpad bug 1672607 in neutron "test_arp_spoof_doesnt_block_normal_traffic fails with AttributeError: 'NoneType' object has no attribute 'splitlines'" [High,Confirmed]
16:44:07 <ihrachys> I don't think we can do much about segfaults per se (we don't maintain ovs)
16:44:30 <ihrachys> and afaik in functional job, we don't even compile ovs from source
16:44:32 <ihrachys> right?
16:44:39 <jlibosva> right
16:44:47 <ihrachys> but the second bug is a bit two sides
16:44:54 <ihrachys> one, yes, it's a segfault
16:45:09 <ihrachys> but that should not explain why our code fails with AttributeError
16:45:42 <ihrachys> we should be ready that ovs crashes, and then we may get None from dump-flows (or whatever is called there)
16:46:05 <ihrachys> I was thinking that latter bug should be about making the code more paranoid about ovs state
16:46:45 <ihrachys> it's ofctl though, which is not default driver, so maybe not of great criticality
16:47:19 <ihrachys> otherwiseguy: thoughts on ovs segfaults?
16:48:39 <otherwiseguy> ihrachys, I haven't looked at the crash bugs unfortunately (other than a cursory "oh, ovs segfaulted. that's bad".
16:48:41 <otherwiseguy> )
16:48:51 <mlavalle> lol
16:49:02 <ihrachys> :)
16:49:27 <ihrachys> I don't recollect segfaults in times when we were compiling ovs
16:49:29 <otherwiseguy> I'd love for a vm snapshot to be stashed away in that case that we could play around with the core dump etc., but ...
16:49:40 <ihrachys> otherwiseguy: yeah
16:49:49 <ihrachys> otherwiseguy: I don't think we collect core dumps
16:49:56 <ihrachys> otherwiseguy: just dumps would be helpful
16:50:02 <ihrachys> no need for the whole vm state
16:50:26 <otherwiseguy> at least a thread apply all bt full output or something
16:50:55 <ihrachys> maybe it's just oversight.
16:50:59 <ihrachys> anyone want to chase infra about why we don't collect core dumps? :)
16:51:19 <ihrachys> I suspect it may be a concern of dump size
16:51:52 <otherwiseguy> are core dumps really that useful without the environment they were created?
16:52:17 <ihrachys> otherwiseguy: you have config and logs; you can load debuginfo into gdb to get symbols
16:52:37 <otherwiseguy> if you have *exactly* the same libraries, etc.
16:52:53 <ihrachys> we have exact rpm version numbers
16:53:13 <ihrachys> I don't think it moves so quick you can't reproduce the environment as in gate
16:53:32 <ihrachys> in the end, it will be glibc and ovs-whatever-lib and some more, but not whole system
16:54:03 <ihrachys> btw .rpm -> .deb above, it's xenial
16:54:37 <ihrachys> ok we gotta figure out something, at least patching neutron so that not to crash so hard
16:55:03 <ihrachys> I don't see any more bugs in the list that may be worth discussing right now
16:55:07 <ihrachys> #topic Open discussion
16:55:39 <manjeets> ihrachys, even though it is related with upgrades but somehow ci
16:55:42 <ihrachys> https://review.openstack.org/#/q/topic:new-neutron-devstack-in-gate is slowly progressing, if you have spare review cycles, please chime in, there are some pieces in neutron
16:55:49 <ihrachys> manjeets: shoot
16:55:59 <clarkb> I think ideally people would be replicating locally for the vast majority of failures. Then for those that fail to reproduce we figure out grabbing a core dump specifically for that
16:56:09 <manjeets> grenade-multinode-lb is mostly failing on one kind of cloud
16:56:13 <clarkb> has anyone tried running ./reproduce.sh and then hitting ovs until it segfaults/
16:56:18 <ihrachys> clarkb: it happened like twice in last weeks
16:56:34 <manjeets> on which instance brings up time out according to me
16:56:49 <manjeets> it passes on ovh, osic clouds but fails on rax
16:56:49 <clarkb> ihrachys: ah ok so ya in that case I think you'd set it up to handle that specific case
16:57:31 <ihrachys> clarkb: so we would cp /var/run/*.core or wherever they are stored into logdir?
16:57:59 <ihrachys> ideally like /var/run/ovs-vswitchd*.core or smth
16:58:13 <manjeets> I am experimenting tempest_concurrency=2, which 4 by default on rax if that is causing failure on rax node for failure over smoke tests
16:58:26 <clarkb> ihrachys: ya, you may also need to set ulimit(s)
16:58:35 <ihrachys> manjeets: yeah, may be worth that
16:58:40 <manjeets> https://review.openstack.org/#/c/447628/
16:58:54 <ihrachys> manjeets: at least to understand if it's indeed the case of concurrency and load
16:59:06 <ihrachys> manjeets: I think it fails rather often to catch it with rechecks?
16:59:49 <manjeets> recheck also depends on which cloud it get placed on I want to make sure if concurrency is not one breaking things
17:00:11 <ihrachys> manjeets: ok let's follow up in gerrit
17:00:13 <ihrachys> we are at the top of the hour
17:00:15 <ihrachys> thanks everyone
17:00:17 <ihrachys> #endmeeting