16:00:23 #startmeeting neutron_ci 16:00:24 Meeting started Tue Mar 21 16:00:23 2017 UTC and is due to finish in 60 minutes. The chair is ihrachys. Information about MeetBot at http://wiki.debian.org/MeetBot. 16:00:25 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 16:00:28 The meeting name has been set to 'neutron_ci' 16:00:29 o/ 16:00:31 good day everyone 16:00:34 o/ 16:00:47 o/ 16:00:48 o/ 16:00:58 * ihrachys waves at mlavalle 16:01:09 o/ 16:01:42 * mlavalle waves back at ihrachys 16:01:57 ok let's start with reviewing action items from prev meeting 16:02:03 "ihrachys fix e-r bot not reporting in irc channel" 16:02:25 didn't happen, I am a slug, will need to wrap into this week 16:02:28 #action ihrachys fix e-r bot not reporting in irc channel 16:02:38 next is "haleyb and mlavalle to investigate what makes dvr gate job failing with 25% rate" 16:02:45 hi 16:02:49 #link http://grafana.openstack.org/dashboard/db/neutron-failure-rate?panelId=8&fullscreen 16:03:02 I've been digging in Kiabana into this 16:03:40 since it is in the check queue, it is laborious to separate failures caused by patchset and real failures 16:04:05 for real failures, I don't have a conclusive diagnostic yet 16:04:06 any idea why gate-tempest-dsvm-neutron-dvr-multinode-full-ubuntu-xenial is not in grafana board at all? 16:04:33 no^^^^ 16:04:44 maybe haleyb might speak to that 16:04:55 I don't see him around, though 16:05:06 oh because it's gate-tempest-dsvm-neutron-dvr-multinode-full-ubuntu-xenial-nv now 16:05:08 note -nv 16:05:20 ok cool 16:05:26 #action ihrachys to fix the grafana board to include gate-tempest-dsvm-neutron-dvr-multinode-full-ubuntu-xenial-nv 16:05:45 mlavalle: it's hard to say if there is any trend beyond regular check queue issues 16:06:13 of the real failures that I have analyzed, the most common case is {u'code': 500, u'message': u'No valid host was found. There are not enough hosts available.' 16:06:41 which is usually because scheduler failed to talk to libvirtd isn't it? 16:06:53 Yeah, I think so 16:07:05 I haven't seen enough evidence yet, though 16:07:16 I would like to watch this a few more days 16:07:38 I just wanted to share with the team the trend that I am seeing 16:07:59 ok let's sync the next week when the board is hopefully fixed 16:08:05 cool 16:08:13 ok next was "ihrachys explore why bug 1532809 bubbled into top in e-r" 16:08:13 bug 1532809 in OpenStack Compute (nova) liberty "Gate failures when DHCP lease cannot be acquired" [High,In progress] https://launchpad.net/bugs/1532809 - Assigned to Sean Dague (sdague) 16:08:15 I'll keep watching this over the next few days 16:08:41 I checked logs, and it was because ODL gate was triggering it, it's just their gate setup issue I believe 16:09:01 overall the query seems to me rather generic: https://review.openstack.org/#/c/445596/ 16:09:12 hence delete request ^ 16:09:25 next was "jlibosva fix delete_dscp for native driver: https://review.openstack.org/445560" 16:09:45 it's merged, yay. there are still issues in the job that bring us to the next item 16:09:52 "jlibosva to fix remaining fullstack failures in securitygroups for linuxbridge" 16:09:58 jlibosva: progress there? 16:10:10 I attempted to make it work then I was pointed out kevinbenton started similar patch 16:10:13 is it conntrack issue reintroduced by kevinbenton lately? 16:10:21 yes 16:10:39 the thing is that iptables driver in linuxbridge agent doesn't remove conntracks due to missing zones 16:10:40 https://review.openstack.org/#/c/441353/ 16:11:03 conntracks == conntrack entries 16:11:21 makes sense, gotta have a closer look, though it's failing in unit tests right now 16:11:38 jlibosva: so we proved it fixes the failure? 16:11:58 I don't think that patch is final. 16:12:06 I see test_securitygroup(ovs-hybrid) is failing there 16:12:09 I'll try to support kevin or take over if needed 16:12:16 ihrachys: which means it fixes the LB one :) 16:12:28 or that it breaks ovs? 16:12:45 that's a side-effect, yes. It still needs some work to be done 16:14:07 ok 16:14:08 that's all I can tell about that 16:14:34 I see test_controller_timeout_does_not_break_connectivity_sigkill bubbles up since we landed it 16:14:55 yep, I reported a bug about that some time ago I think 16:14:56 * jlibosva searches 16:15:26 https://bugs.launchpad.net/neutron/+bug/1673531 16:15:26 Launchpad bug 1673531 in neutron "fullstack test_controller_timeout_does_not_break_connectivity_sigkill(GRE and l2pop,openflow-native_ovsdb-cli) failure" [Undecided,New] 16:15:32 I haven't looked at it yet though 16:16:18 jlibosva: I understand that fullstack is not voting but would probably make sense to tag those bugs as gate-failure nevertheless 16:16:31 since they may indicate something actually broken 16:16:31 ihrachys: alright 16:16:39 and since we want it voting this time :) 16:16:50 I added the tag 16:17:57 ok next item was "anilvenkata to follow up on HA+DVR job patches" 16:18:15 the patch is still up for review: https://review.openstack.org/#/c/383827/ 16:18:33 there were some back and forth comments, seems the patch is finally ready for merge 16:19:11 clarkb: would be cool to see you revisit ^ 16:19:34 ok, next item was "jlibosva to figure out the plan for py3 gate transition and report back" 16:19:48 this is still 'to be done' 16:20:15 ok let's repeat it 16:20:15 can you please move it to the next week? 16:20:21 #action jlibosva to figure out the plan for py3 gate transition and report back 16:20:33 next was "manjeets respin https://review.openstack.org/#/c/439114/ to include gate-failure reviews into existing dashboard" 16:20:43 ihrachys, done 16:20:44 I see the patch was respinned yesterday: https://review.openstack.org/#/c/439114/ 16:21:11 dashboard will look like https://tinyurl.com/lqdu3qo 16:21:16 cool, let's review that and then chase infra to get it in 16:21:17 scroll all the way to end 16:21:42 ack. I would like to see it higher the stack 16:21:47 somewhere near Critical 16:22:03 that'll be one line change i believe I can do that 16:22:06 without gate, we can't land anything, so it would make sense to give it a priority 16:22:11 ack 16:22:54 and... that's it for action items from the previous week 16:23:48 #topic State of the Gate 16:23:57 quite a few! 16:24:13 mlavalle: as always, makes it easier to track things 16:24:17 several breakages happened the prev week that we dealt with 16:24:40 one was a hilarious breakage by os-log-merger integration that was hopefully unraveled by https://review.openstack.org/#/q/topic:fix-os-log-merger-crash 16:24:57 turned out os-log-merger was not really too liberal about accepted output 16:25:20 then there was a eventlet induced breakage from yesterday fixed by https://review.openstack.org/#/c/447817/ 16:25:34 there is a follow up for the patch here up for review: https://review.openstack.org/#/c/447896/1 16:26:31 there is still a gate breakage by an OVO test up for review here: https://review.openstack.org/#/c/447600/ 16:26:41 it's not consistent, but may hit the gate sometimes 16:26:55 since it's unit tests, we gotta make it stable before people are used to recheck 16:26:57 :) 16:27:58 any other fixes that we are aware that will fix some gate failures? 16:28:59 not that I'm aware of 16:29:35 cool 16:30:09 * manjeets remember oslo-log-merger crashing multiple jobs 16:30:19 manjeets: that is fixed and discussed above 16:30:22 #topic Gate failure bugs 16:30:28 #link https://bugs.launchpad.net/neutron/+bugs?field.tag=gate-failure 16:30:39 I am wondering about https://bugs.launchpad.net/neutron/+bug/1627106 16:30:39 Launchpad bug 1627106 in neutron "TimeoutException while executing tests adding bridge using OVSDB native" [High,In progress] - Assigned to Terry Wilson (otherwiseguy) 16:30:43 has anyone seen it lately? 16:30:45 ihrachys, yea i noticed that from jenkins results 16:31:11 the query is at http://status.openstack.org/elastic-recheck/#1627106 16:31:22 and it shows 0 fails in 24 hrs / 0 fails in 10 days 16:31:30 which apparently means that we squashed it somehow :) 16:31:50 otherwiseguy: ^ are we in agreement? 16:32:02 ihrachys, I *hope* so :) 16:32:15 ok let's close the bug and see if it shows up again 16:32:44 * otherwiseguy dances 16:32:48 nice job everyone and especially otherwiseguy 16:33:26 otherwiseguy, \o/ 16:33:54 I also believe we can close https://bugs.launchpad.net/neutron/+bug/1632290 now that metadata-proxy is rewritten to haproxy? 16:33:54 Launchpad bug 1632290 in neutron "RuntimeError: Metadata proxy didn't spawn" [Medium,Triaged] 16:33:55 otherwiseguy rocks 16:34:23 jlibosva misspelled "openstack" 16:35:20 the only place the metadata logstash query hits is this patch https://review.openstack.org/#/c/401086/ 16:35:22 which is all red 16:35:30 so it's not unexpected that metadata is also borked there 16:35:34 I will close the bug 16:37:44 I am looking at https://bugs.launchpad.net/neutron/+bug/1673780 and wonder why we track vpnaas bugs in neutron component 16:37:44 Launchpad bug 1673780 in neutron "vpnaas import error for agent config" [Undecided,In progress] - Assigned to YAMAMOTO Takashi (yamamoto) 16:37:49 it's not even stadium participant 16:37:58 ihrachys, would it make sense to have gate failures at Top before RFE's ? 16:38:01 https://tinyurl.com/lqdu3qo 16:38:11 manjeets: yeah, probably 16:38:27 #action ihrachys to figure out why we track vpnaas bugs under neutron component 16:39:47 another bug of interest is this: https://bugs.launchpad.net/neutron/+bug/1674517 and there is a patch from dasanind for that: https://review.openstack.org/#/c/447781/ 16:39:47 Launchpad bug 1674517 in neutron "pecan missing custom tenant_id policy project_id matching" [High,In progress] - Assigned to Anindita Das (anindita-das) 16:40:05 as you prolly know we switched to pecan lately and now squash bugs that pop up 16:40:54 yeap, they are going to show up now 16:41:22 but it's the right time to do it, early in the cycle :-) 16:41:51 ihrachys, it will be now like https://tinyurl.com/mj2uyw5 16:42:18 manjeets: + 16:43:26 there were several segfaults of ovs vswitchd component 16:43:43 see https://bugs.launchpad.net/neutron/+bug/1669900 16:43:43 Launchpad bug 1669900 in neutron "ovs-vswitchd crashed in functional test with segmentation fault" [Medium,Confirmed] 16:43:48 and https://bugs.launchpad.net/neutron/+bug/1672607 16:43:48 Launchpad bug 1672607 in neutron "test_arp_spoof_doesnt_block_normal_traffic fails with AttributeError: 'NoneType' object has no attribute 'splitlines'" [High,Confirmed] 16:44:07 I don't think we can do much about segfaults per se (we don't maintain ovs) 16:44:30 and afaik in functional job, we don't even compile ovs from source 16:44:32 right? 16:44:39 right 16:44:47 but the second bug is a bit two sides 16:44:54 one, yes, it's a segfault 16:45:09 but that should not explain why our code fails with AttributeError 16:45:42 we should be ready that ovs crashes, and then we may get None from dump-flows (or whatever is called there) 16:46:05 I was thinking that latter bug should be about making the code more paranoid about ovs state 16:46:45 it's ofctl though, which is not default driver, so maybe not of great criticality 16:47:19 otherwiseguy: thoughts on ovs segfaults? 16:48:39 ihrachys, I haven't looked at the crash bugs unfortunately (other than a cursory "oh, ovs segfaulted. that's bad". 16:48:41 ) 16:48:51 lol 16:49:02 :) 16:49:27 I don't recollect segfaults in times when we were compiling ovs 16:49:29 I'd love for a vm snapshot to be stashed away in that case that we could play around with the core dump etc., but ... 16:49:40 otherwiseguy: yeah 16:49:49 otherwiseguy: I don't think we collect core dumps 16:49:56 otherwiseguy: just dumps would be helpful 16:50:02 no need for the whole vm state 16:50:26 at least a thread apply all bt full output or something 16:50:55 maybe it's just oversight. 16:50:59 anyone want to chase infra about why we don't collect core dumps? :) 16:51:19 I suspect it may be a concern of dump size 16:51:52 are core dumps really that useful without the environment they were created? 16:52:17 otherwiseguy: you have config and logs; you can load debuginfo into gdb to get symbols 16:52:37 if you have *exactly* the same libraries, etc. 16:52:53 we have exact rpm version numbers 16:53:13 I don't think it moves so quick you can't reproduce the environment as in gate 16:53:32 in the end, it will be glibc and ovs-whatever-lib and some more, but not whole system 16:54:03 btw .rpm -> .deb above, it's xenial 16:54:37 ok we gotta figure out something, at least patching neutron so that not to crash so hard 16:55:03 I don't see any more bugs in the list that may be worth discussing right now 16:55:07 #topic Open discussion 16:55:39 ihrachys, even though it is related with upgrades but somehow ci 16:55:42 https://review.openstack.org/#/q/topic:new-neutron-devstack-in-gate is slowly progressing, if you have spare review cycles, please chime in, there are some pieces in neutron 16:55:49 manjeets: shoot 16:55:59 I think ideally people would be replicating locally for the vast majority of failures. Then for those that fail to reproduce we figure out grabbing a core dump specifically for that 16:56:09 grenade-multinode-lb is mostly failing on one kind of cloud 16:56:13 has anyone tried running ./reproduce.sh and then hitting ovs until it segfaults/ 16:56:18 clarkb: it happened like twice in last weeks 16:56:34 on which instance brings up time out according to me 16:56:49 it passes on ovh, osic clouds but fails on rax 16:56:49 ihrachys: ah ok so ya in that case I think you'd set it up to handle that specific case 16:57:31 clarkb: so we would cp /var/run/*.core or wherever they are stored into logdir? 16:57:59 ideally like /var/run/ovs-vswitchd*.core or smth 16:58:13 I am experimenting tempest_concurrency=2, which 4 by default on rax if that is causing failure on rax node for failure over smoke tests 16:58:26 ihrachys: ya, you may also need to set ulimit(s) 16:58:35 manjeets: yeah, may be worth that 16:58:40 https://review.openstack.org/#/c/447628/ 16:58:54 manjeets: at least to understand if it's indeed the case of concurrency and load 16:59:06 manjeets: I think it fails rather often to catch it with rechecks? 16:59:49 recheck also depends on which cloud it get placed on I want to make sure if concurrency is not one breaking things 17:00:11 manjeets: ok let's follow up in gerrit 17:00:13 we are at the top of the hour 17:00:15 thanks everyone 17:00:17 #endmeeting