16:00:23 <ihrachys> #startmeeting neutron_ci 16:00:24 <openstack> Meeting started Tue Mar 21 16:00:23 2017 UTC and is due to finish in 60 minutes. The chair is ihrachys. Information about MeetBot at http://wiki.debian.org/MeetBot. 16:00:25 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 16:00:28 <openstack> The meeting name has been set to 'neutron_ci' 16:00:29 <manjeets> o/ 16:00:31 <ihrachys> good day everyone 16:00:34 <jlibosva> o/ 16:00:47 <dasm> o/ 16:00:48 <dasanind> o/ 16:00:58 * ihrachys waves at mlavalle 16:01:09 <mlavalle> o/ 16:01:42 * mlavalle waves back at ihrachys 16:01:57 <ihrachys> ok let's start with reviewing action items from prev meeting 16:02:03 <ihrachys> "ihrachys fix e-r bot not reporting in irc channel" 16:02:25 <ihrachys> didn't happen, I am a slug, will need to wrap into this week 16:02:28 <ihrachys> #action ihrachys fix e-r bot not reporting in irc channel 16:02:38 <ihrachys> next is "haleyb and mlavalle to investigate what makes dvr gate job failing with 25% rate" 16:02:45 <mlavalle> hi 16:02:49 <ihrachys> #link http://grafana.openstack.org/dashboard/db/neutron-failure-rate?panelId=8&fullscreen 16:03:02 <mlavalle> I've been digging in Kiabana into this 16:03:40 <mlavalle> since it is in the check queue, it is laborious to separate failures caused by patchset and real failures 16:04:05 <mlavalle> for real failures, I don't have a conclusive diagnostic yet 16:04:06 <ihrachys> any idea why gate-tempest-dsvm-neutron-dvr-multinode-full-ubuntu-xenial is not in grafana board at all? 16:04:33 <mlavalle> no^^^^ 16:04:44 <mlavalle> maybe haleyb might speak to that 16:04:55 <mlavalle> I don't see him around, though 16:05:06 <ihrachys> oh because it's gate-tempest-dsvm-neutron-dvr-multinode-full-ubuntu-xenial-nv now 16:05:08 <ihrachys> note -nv 16:05:20 <mlavalle> ok cool 16:05:26 <ihrachys> #action ihrachys to fix the grafana board to include gate-tempest-dsvm-neutron-dvr-multinode-full-ubuntu-xenial-nv 16:05:45 <ihrachys> mlavalle: it's hard to say if there is any trend beyond regular check queue issues 16:06:13 <mlavalle> of the real failures that I have analyzed, the most common case is {u'code': 500, u'message': u'No valid host was found. There are not enough hosts available.' 16:06:41 <ihrachys> which is usually because scheduler failed to talk to libvirtd isn't it? 16:06:53 <mlavalle> Yeah, I think so 16:07:05 <mlavalle> I haven't seen enough evidence yet, though 16:07:16 <mlavalle> I would like to watch this a few more days 16:07:38 <mlavalle> I just wanted to share with the team the trend that I am seeing 16:07:59 <ihrachys> ok let's sync the next week when the board is hopefully fixed 16:08:05 <mlavalle> cool 16:08:13 <ihrachys> ok next was "ihrachys explore why bug 1532809 bubbled into top in e-r" 16:08:13 <openstack> bug 1532809 in OpenStack Compute (nova) liberty "Gate failures when DHCP lease cannot be acquired" [High,In progress] https://launchpad.net/bugs/1532809 - Assigned to Sean Dague (sdague) 16:08:15 <mlavalle> I'll keep watching this over the next few days 16:08:41 <ihrachys> I checked logs, and it was because ODL gate was triggering it, it's just their gate setup issue I believe 16:09:01 <ihrachys> overall the query seems to me rather generic: https://review.openstack.org/#/c/445596/ 16:09:12 <ihrachys> hence delete request ^ 16:09:25 <ihrachys> next was "jlibosva fix delete_dscp for native driver: https://review.openstack.org/445560" 16:09:45 <ihrachys> it's merged, yay. there are still issues in the job that bring us to the next item 16:09:52 <ihrachys> "jlibosva to fix remaining fullstack failures in securitygroups for linuxbridge" 16:09:58 <ihrachys> jlibosva: progress there? 16:10:10 <jlibosva> I attempted to make it work then I was pointed out kevinbenton started similar patch 16:10:13 <ihrachys> is it conntrack issue reintroduced by kevinbenton lately? 16:10:21 <jlibosva> yes 16:10:39 <jlibosva> the thing is that iptables driver in linuxbridge agent doesn't remove conntracks due to missing zones 16:10:40 <jlibosva> https://review.openstack.org/#/c/441353/ 16:11:03 <jlibosva> conntracks == conntrack entries 16:11:21 <ihrachys> makes sense, gotta have a closer look, though it's failing in unit tests right now 16:11:38 <ihrachys> jlibosva: so we proved it fixes the failure? 16:11:58 <jlibosva> I don't think that patch is final. 16:12:06 <ihrachys> I see test_securitygroup(ovs-hybrid) is failing there 16:12:09 <jlibosva> I'll try to support kevin or take over if needed 16:12:16 <jlibosva> ihrachys: which means it fixes the LB one :) 16:12:28 <ihrachys> or that it breaks ovs? 16:12:45 <jlibosva> that's a side-effect, yes. It still needs some work to be done 16:14:07 <ihrachys> ok 16:14:08 <jlibosva> that's all I can tell about that 16:14:34 <ihrachys> I see test_controller_timeout_does_not_break_connectivity_sigkill bubbles up since we landed it 16:14:55 <jlibosva> yep, I reported a bug about that some time ago I think 16:14:56 * jlibosva searches 16:15:26 <jlibosva> https://bugs.launchpad.net/neutron/+bug/1673531 16:15:26 <openstack> Launchpad bug 1673531 in neutron "fullstack test_controller_timeout_does_not_break_connectivity_sigkill(GRE and l2pop,openflow-native_ovsdb-cli) failure" [Undecided,New] 16:15:32 <jlibosva> I haven't looked at it yet though 16:16:18 <ihrachys> jlibosva: I understand that fullstack is not voting but would probably make sense to tag those bugs as gate-failure nevertheless 16:16:31 <ihrachys> since they may indicate something actually broken 16:16:31 <jlibosva> ihrachys: alright 16:16:39 <ihrachys> and since we want it voting this time :) 16:16:50 <ihrachys> I added the tag 16:17:57 <ihrachys> ok next item was "anilvenkata to follow up on HA+DVR job patches" 16:18:15 <ihrachys> the patch is still up for review: https://review.openstack.org/#/c/383827/ 16:18:33 <ihrachys> there were some back and forth comments, seems the patch is finally ready for merge 16:19:11 <ihrachys> clarkb: would be cool to see you revisit ^ 16:19:34 <ihrachys> ok, next item was "jlibosva to figure out the plan for py3 gate transition and report back" 16:19:48 <jlibosva> this is still 'to be done' 16:20:15 <ihrachys> ok let's repeat it 16:20:15 <jlibosva> can you please move it to the next week? 16:20:21 <ihrachys> #action jlibosva to figure out the plan for py3 gate transition and report back 16:20:33 <ihrachys> next was "manjeets respin https://review.openstack.org/#/c/439114/ to include gate-failure reviews into existing dashboard" 16:20:43 <manjeets> ihrachys, done 16:20:44 <ihrachys> I see the patch was respinned yesterday: https://review.openstack.org/#/c/439114/ 16:21:11 <manjeets> dashboard will look like https://tinyurl.com/lqdu3qo 16:21:16 <ihrachys> cool, let's review that and then chase infra to get it in 16:21:17 <manjeets> scroll all the way to end 16:21:42 <ihrachys> ack. I would like to see it higher the stack 16:21:47 <ihrachys> somewhere near Critical 16:22:03 <manjeets> that'll be one line change i believe I can do that 16:22:06 <ihrachys> without gate, we can't land anything, so it would make sense to give it a priority 16:22:11 <ihrachys> ack 16:22:54 <ihrachys> and... that's it for action items from the previous week 16:23:48 <ihrachys> #topic State of the Gate 16:23:57 <mlavalle> quite a few! 16:24:13 <ihrachys> mlavalle: as always, makes it easier to track things 16:24:17 <ihrachys> several breakages happened the prev week that we dealt with 16:24:40 <ihrachys> one was a hilarious breakage by os-log-merger integration that was hopefully unraveled by https://review.openstack.org/#/q/topic:fix-os-log-merger-crash 16:24:57 <ihrachys> turned out os-log-merger was not really too liberal about accepted output 16:25:20 <ihrachys> then there was a eventlet induced breakage from yesterday fixed by https://review.openstack.org/#/c/447817/ 16:25:34 <ihrachys> there is a follow up for the patch here up for review: https://review.openstack.org/#/c/447896/1 16:26:31 <ihrachys> there is still a gate breakage by an OVO test up for review here: https://review.openstack.org/#/c/447600/ 16:26:41 <ihrachys> it's not consistent, but may hit the gate sometimes 16:26:55 <ihrachys> since it's unit tests, we gotta make it stable before people are used to recheck 16:26:57 <ihrachys> :) 16:27:58 <ihrachys> any other fixes that we are aware that will fix some gate failures? 16:28:59 <jlibosva> not that I'm aware of 16:29:35 <ihrachys> cool 16:30:09 * manjeets remember oslo-log-merger crashing multiple jobs 16:30:19 <ihrachys> manjeets: that is fixed and discussed above 16:30:22 <ihrachys> #topic Gate failure bugs 16:30:28 <ihrachys> #link https://bugs.launchpad.net/neutron/+bugs?field.tag=gate-failure 16:30:39 <ihrachys> I am wondering about https://bugs.launchpad.net/neutron/+bug/1627106 16:30:39 <openstack> Launchpad bug 1627106 in neutron "TimeoutException while executing tests adding bridge using OVSDB native" [High,In progress] - Assigned to Terry Wilson (otherwiseguy) 16:30:43 <ihrachys> has anyone seen it lately? 16:30:45 <manjeets> ihrachys, yea i noticed that from jenkins results 16:31:11 <ihrachys> the query is at http://status.openstack.org/elastic-recheck/#1627106 16:31:22 <ihrachys> and it shows 0 fails in 24 hrs / 0 fails in 10 days 16:31:30 <ihrachys> which apparently means that we squashed it somehow :) 16:31:50 <ihrachys> otherwiseguy: ^ are we in agreement? 16:32:02 <otherwiseguy> ihrachys, I *hope* so :) 16:32:15 <ihrachys> ok let's close the bug and see if it shows up again 16:32:44 * otherwiseguy dances 16:32:48 <ihrachys> nice job everyone and especially otherwiseguy 16:33:26 <manjeets> otherwiseguy, \o/ 16:33:54 <ihrachys> I also believe we can close https://bugs.launchpad.net/neutron/+bug/1632290 now that metadata-proxy is rewritten to haproxy? 16:33:54 <openstack> Launchpad bug 1632290 in neutron "RuntimeError: Metadata proxy didn't spawn" [Medium,Triaged] 16:33:55 <jlibosva> otherwiseguy rocks 16:34:23 <otherwiseguy> jlibosva misspelled "openstack" 16:35:20 <ihrachys> the only place the metadata logstash query hits is this patch https://review.openstack.org/#/c/401086/ 16:35:22 <ihrachys> which is all red 16:35:30 <ihrachys> so it's not unexpected that metadata is also borked there 16:35:34 <ihrachys> I will close the bug 16:37:44 <ihrachys> I am looking at https://bugs.launchpad.net/neutron/+bug/1673780 and wonder why we track vpnaas bugs in neutron component 16:37:44 <openstack> Launchpad bug 1673780 in neutron "vpnaas import error for agent config" [Undecided,In progress] - Assigned to YAMAMOTO Takashi (yamamoto) 16:37:49 <ihrachys> it's not even stadium participant 16:37:58 <manjeets> ihrachys, would it make sense to have gate failures at Top before RFE's ? 16:38:01 <manjeets> https://tinyurl.com/lqdu3qo 16:38:11 <ihrachys> manjeets: yeah, probably 16:38:27 <ihrachys> #action ihrachys to figure out why we track vpnaas bugs under neutron component 16:39:47 <ihrachys> another bug of interest is this: https://bugs.launchpad.net/neutron/+bug/1674517 and there is a patch from dasanind for that: https://review.openstack.org/#/c/447781/ 16:39:47 <openstack> Launchpad bug 1674517 in neutron "pecan missing custom tenant_id policy project_id matching" [High,In progress] - Assigned to Anindita Das (anindita-das) 16:40:05 <ihrachys> as you prolly know we switched to pecan lately and now squash bugs that pop up 16:40:54 <mlavalle> yeap, they are going to show up now 16:41:22 <mlavalle> but it's the right time to do it, early in the cycle :-) 16:41:51 <manjeets> ihrachys, it will be now like https://tinyurl.com/mj2uyw5 16:42:18 <ihrachys> manjeets: + 16:43:26 <ihrachys> there were several segfaults of ovs vswitchd component 16:43:43 <ihrachys> see https://bugs.launchpad.net/neutron/+bug/1669900 16:43:43 <openstack> Launchpad bug 1669900 in neutron "ovs-vswitchd crashed in functional test with segmentation fault" [Medium,Confirmed] 16:43:48 <ihrachys> and https://bugs.launchpad.net/neutron/+bug/1672607 16:43:48 <openstack> Launchpad bug 1672607 in neutron "test_arp_spoof_doesnt_block_normal_traffic fails with AttributeError: 'NoneType' object has no attribute 'splitlines'" [High,Confirmed] 16:44:07 <ihrachys> I don't think we can do much about segfaults per se (we don't maintain ovs) 16:44:30 <ihrachys> and afaik in functional job, we don't even compile ovs from source 16:44:32 <ihrachys> right? 16:44:39 <jlibosva> right 16:44:47 <ihrachys> but the second bug is a bit two sides 16:44:54 <ihrachys> one, yes, it's a segfault 16:45:09 <ihrachys> but that should not explain why our code fails with AttributeError 16:45:42 <ihrachys> we should be ready that ovs crashes, and then we may get None from dump-flows (or whatever is called there) 16:46:05 <ihrachys> I was thinking that latter bug should be about making the code more paranoid about ovs state 16:46:45 <ihrachys> it's ofctl though, which is not default driver, so maybe not of great criticality 16:47:19 <ihrachys> otherwiseguy: thoughts on ovs segfaults? 16:48:39 <otherwiseguy> ihrachys, I haven't looked at the crash bugs unfortunately (other than a cursory "oh, ovs segfaulted. that's bad". 16:48:41 <otherwiseguy> ) 16:48:51 <mlavalle> lol 16:49:02 <ihrachys> :) 16:49:27 <ihrachys> I don't recollect segfaults in times when we were compiling ovs 16:49:29 <otherwiseguy> I'd love for a vm snapshot to be stashed away in that case that we could play around with the core dump etc., but ... 16:49:40 <ihrachys> otherwiseguy: yeah 16:49:49 <ihrachys> otherwiseguy: I don't think we collect core dumps 16:49:56 <ihrachys> otherwiseguy: just dumps would be helpful 16:50:02 <ihrachys> no need for the whole vm state 16:50:26 <otherwiseguy> at least a thread apply all bt full output or something 16:50:55 <ihrachys> maybe it's just oversight. 16:50:59 <ihrachys> anyone want to chase infra about why we don't collect core dumps? :) 16:51:19 <ihrachys> I suspect it may be a concern of dump size 16:51:52 <otherwiseguy> are core dumps really that useful without the environment they were created? 16:52:17 <ihrachys> otherwiseguy: you have config and logs; you can load debuginfo into gdb to get symbols 16:52:37 <otherwiseguy> if you have *exactly* the same libraries, etc. 16:52:53 <ihrachys> we have exact rpm version numbers 16:53:13 <ihrachys> I don't think it moves so quick you can't reproduce the environment as in gate 16:53:32 <ihrachys> in the end, it will be glibc and ovs-whatever-lib and some more, but not whole system 16:54:03 <ihrachys> btw .rpm -> .deb above, it's xenial 16:54:37 <ihrachys> ok we gotta figure out something, at least patching neutron so that not to crash so hard 16:55:03 <ihrachys> I don't see any more bugs in the list that may be worth discussing right now 16:55:07 <ihrachys> #topic Open discussion 16:55:39 <manjeets> ihrachys, even though it is related with upgrades but somehow ci 16:55:42 <ihrachys> https://review.openstack.org/#/q/topic:new-neutron-devstack-in-gate is slowly progressing, if you have spare review cycles, please chime in, there are some pieces in neutron 16:55:49 <ihrachys> manjeets: shoot 16:55:59 <clarkb> I think ideally people would be replicating locally for the vast majority of failures. Then for those that fail to reproduce we figure out grabbing a core dump specifically for that 16:56:09 <manjeets> grenade-multinode-lb is mostly failing on one kind of cloud 16:56:13 <clarkb> has anyone tried running ./reproduce.sh and then hitting ovs until it segfaults/ 16:56:18 <ihrachys> clarkb: it happened like twice in last weeks 16:56:34 <manjeets> on which instance brings up time out according to me 16:56:49 <manjeets> it passes on ovh, osic clouds but fails on rax 16:56:49 <clarkb> ihrachys: ah ok so ya in that case I think you'd set it up to handle that specific case 16:57:31 <ihrachys> clarkb: so we would cp /var/run/*.core or wherever they are stored into logdir? 16:57:59 <ihrachys> ideally like /var/run/ovs-vswitchd*.core or smth 16:58:13 <manjeets> I am experimenting tempest_concurrency=2, which 4 by default on rax if that is causing failure on rax node for failure over smoke tests 16:58:26 <clarkb> ihrachys: ya, you may also need to set ulimit(s) 16:58:35 <ihrachys> manjeets: yeah, may be worth that 16:58:40 <manjeets> https://review.openstack.org/#/c/447628/ 16:58:54 <ihrachys> manjeets: at least to understand if it's indeed the case of concurrency and load 16:59:06 <ihrachys> manjeets: I think it fails rather often to catch it with rechecks? 16:59:49 <manjeets> recheck also depends on which cloud it get placed on I want to make sure if concurrency is not one breaking things 17:00:11 <ihrachys> manjeets: ok let's follow up in gerrit 17:00:13 <ihrachys> we are at the top of the hour 17:00:15 <ihrachys> thanks everyone 17:00:17 <ihrachys> #endmeeting