16:00:50 #startmeeting neutron_ci 16:00:51 Meeting started Tue Jun 6 16:00:50 2017 UTC and is due to finish in 60 minutes. The chair is ihrachys|afk. Information about MeetBot at http://wiki.debian.org/MeetBot. 16:00:53 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 16:00:54 good day everyone 16:00:55 o/ 16:00:56 The meeting name has been set to 'neutron_ci' 16:01:03 * ihrachys|afk waves at haleyb 16:01:20 * haleyb waves back 16:01:28 as usual, starting with actions from prev week 16:01:33 #topic Actions from prev week 16:01:51 first is "jlibosva to understand why instance failed to up networking in trunk conn test: https://review.openstack.org/#/c/462227/" 16:02:14 I sent a new PS today 16:02:26 yeah, still failing, though in a different way it seems 16:02:29 I suspect it was because the port security was disabled *after* instance booted 16:02:33 for linuxbridge 16:02:48 http://logs.openstack.org/27/462227/5/check/gate-tempest-dsvm-neutron-scenario-linuxbridge-ubuntu-xenial-nv/f85a4b2/testr_results.html.gz 16:02:54 so now I changed the approach to disable by default and after instances are up, it will enable for LB 16:02:56 lookng 16:04:08 I didn't test it with LB as I have ovs-agt only 16:04:38 ok 16:04:39 KeyError: 'port_security_enabled' 16:04:44 this really looks like port-sec not enabled 16:05:00 anyhoo, not a bother for the meeting I think 16:05:04 let's move on 16:05:19 next is "jlibosva to fetch and categorize functional py3 failures" 16:05:35 I see smth in https://etherpad.openstack.org/p/py3-neutron-pike 16:05:42 so I categorized it to 12 failures: https://etherpad.openstack.org/p/py3-neutron-pike 16:06:37 nice 16:06:44 now we need to decide what to do with the list 16:07:06 considering that we are all full hand with stuff, maybe we can craft a request for action and send it to openstack-dev? 16:07:34 maybe also prioritizing them 16:08:03 some of those may look different but be the same issues, I would like to start where we are pretty sure those are unique 16:08:03 or in 'spare time' - we can write our name to the number and try to produce some patch 16:08:24 like one ovs firewall; one wsgi; one sqlfixture 16:08:37 then once those are tackled, we can revisit the results and see what's still there 16:08:44 what do you think? 16:09:19 I wanted to add that some failures might be related to 3rd party libraries not working with python3 - like ovsdbapp or ryu 16:09:32 as some failures occur only with these drivers 16:10:10 aha 16:10:23 well good news is I think we have links to their authors ;) 16:10:45 maybe worth pulling those people for failures we suspect are related to the libs 16:10:56 I am sure otherwiseguy will be able to help with ovsdbapp 16:11:05 and yamamoto should know whom to pull for ryu 16:11:16 jlibosva, are you up to craft the mail? 16:11:21 I haven't confirmed it's really there but maybe would be worth e.g. enable python3-functional for ovsdbapp 16:11:25 (assuming you think it's the right thing) 16:11:37 yeah, you can make me an AI 16:11:45 jlibosva, ovsdbapp has functional job? 16:11:50 ok 16:11:51 ihrachys: but not python3 flavor 16:11:56 or does it? 16:12:00 it didn't last time I checked 16:12:07 #action jlibosva to craft an email to openstack-dev@ with func-py3 failures and request for action 16:12:39 there is func job in ovsdbapp as can be seen in e.g. https://review.openstack.org/#/c/470441/ 16:13:01 ihrachys: but that runs with python2 16:13:05 yeah I know 16:13:13 just saying there is a job that we could dup for py3 16:13:18 ah, ok 16:13:30 there is not much inside though afair :) 16:13:34 I would start with talking to Terry about it (maybe through same venue) 16:13:46 there should be no expectation we pull it all ourselves 16:14:58 ok let's move on, thanks for the work, good progress 16:15:06 next is "jlibosva to talk to otherwiseguy about isolating ovsdb/ovs agent per fullstack 'machine'" 16:15:11 boy you have stuff on the plate 16:15:26 oh, that didn't happen 16:15:31 cause I forgot 16:15:31 that's related to trunk test instability in fullstack job 16:15:42 ok lemme repeat the AI for the next week 16:15:48 #action jlibosva to talk to otherwiseguy about isolating ovsdb/ovs agent per fullstack 'machine' 16:15:55 I should stop trusting my memory 16:16:13 jlibosva, I usually create a trello card for each thing I say I will look at 16:16:33 doesn't guarantee I do, but at least it makes me conscious about it being on the plate 16:16:37 I did create two after meeting without checking the logs 16:16:56 I do right away, I don't trust myself :) 16:16:59 ok, next was "ihrachys to understand why functional job spiked on weekend" 16:17:46 so the spike (and current instability) is because of the job failing on one of clouds where the cloud uses same IP range as the job 16:17:57 the fix is https://review.openstack.org/#/c/469189/ 16:18:12 which is switch to devstack-gate for the functional job (and fullstack while at it) 16:18:24 d-g knows the correct ip range to use for devstack 16:18:46 there is an issue with the switch right now, since fullstack doesn't use rootwrap, and sudo is disabled by d-g 16:19:36 aha, so that's why you want to use rootwrap in the test runner :) 16:19:37 well, we have one piece of rootwrap transition in already, for deployed resources: https://review.openstack.org/459110 16:19:46 but test runner needs that too 16:19:56 and the patch for that is at https://review.openstack.org/471097 16:19:59 jlibosva, yes :) 16:20:09 the patch is still failing, have to look at it 16:20:52 I will update the next week about progress, if it's not merged till then 16:21:20 #action ihrachys to update about functional/fullstack switch to devstack-gate and rootwrap 16:21:26 ok next was "haleyb to monitor dvr+ha job and maybe replace existing dvr-multinode" 16:21:38 haleyb, how's the job feeling these days? 16:22:16 that dashboard is a mess, the job isn't perfect 16:23:06 haleyb, totally agreed about the dash 16:23:23 haleyb, not perfect as in higher failure rate? 16:24:02 ihrachys: it's close to the dvr-multinode job 16:24:17 maybe 5% higher 16:24:36 do we have a grasp of pressing issues there? 16:25:18 i don't think there's any dvr-specific failure from what i've looked at 16:26:06 this is just looking at the check queue jobs, the gate is clearly better since we don't push things in with failures 16:27:14 i will continue to watch it, wouldn't be comfortable changing it right now 16:28:00 ok. one thing that may help is going through let's say last 30 patches and see how it failed there. can give a clue where to look at to make it less scary. 16:28:20 if we don't know specific issues that hit it, we can't really make a progress towards enabling it 16:28:20 so 16:28:30 ok let's monitor/look at it and check next week 16:28:44 #action haleyb to continue looking at prospects of dvr+ha job 16:28:55 next in line was "ihrachys to talk to qa/keystone and maybe remove v3-only job" 16:29:07 I haven't done that, will hopefully find some time this week 16:29:09 #action ihrachys to talk to qa/keystone and maybe remove v3-only job 16:29:18 it's not very pressing 16:29:22 next was "haleyb to analyze all the l3 job flavours in gate/check queues and see where we could trim" 16:30:06 i am not done with that one, still need to look at all the configs for the jobs 16:30:54 take your time 16:31:01 I will hang it for the next 16:31:02 #action haleyb to analyze all the l3 job flavours in gate/check queues and see where we could trim 16:31:11 and these are all we had from prev meeting 16:31:16 #topic Grafana 16:31:22 http://grafana.openstack.org/dashboard/db/neutron-failure-rate 16:31:53 one thing to note is ~12% failure rate *in gate* for unittests 16:32:09 not sure exactly, but can be https://review.openstack.org/#/c/469602/ 16:32:20 I am still to get back to it to fix tests 16:32:45 if someone would like to take it over while I look at functional job that would be great 16:33:46 another thing to note is, linuxbridge job seems to be at horrible rate 16:33:50 30%? 16:33:54 and it's in gate 16:34:00 what's going on there? 16:34:22 http://status.openstack.org/elastic-recheck/data/integrated_gate.html will give you a list of the recent fails if you need to dig in 16:34:24 there was a failure in detaching vifs 16:34:26 i could look at those unit test failures for your review 16:34:42 mlavalle, are you aware of any tempest failures that could affect linuxbridge? vif detach nova, is it the bug? 16:34:48 haleyb, please do, thanks 16:34:52 ihrachys: https://bugs.launchpad.net/nova/+bug/1696006 16:34:54 Launchpad bug 1696006 in neutron "Libvirt fails to detach network interface with Linux bridge" [Critical,New] 16:35:00 ihrachys: we talked about that in neutron meeting, right? 16:35:11 haleyb: ihrachys connected later 16:35:46 yeah I suck. I can read the logs instead 16:35:53 so we think it's it? 16:36:02 I compared times where we bumped os-vif correlate with failure occurence 16:36:12 but I didn't find any patch in particular, I started looking at nova code 16:36:20 ihrachys: possible libvirt issue from what mlavalle saw - failure during port_delete causing this 16:36:28 is nova team aware of this pressing issue? 16:36:38 aware as in actively work on? 16:36:45 32 hits for 24h 16:36:53 i think he just filed bug last night 16:37:07 not sure, but mlavalle did a good triage and is looking at it 16:37:33 ok 16:37:52 mriedem, https://bugs.launchpad.net/nova/+bug/1696125 affects neutron gate a lot. can we bump priority on it? 16:37:53 Launchpad bug 1696125 in OpenStack Compute (nova) "Detach interface failed - Unable to detach from guest transient domain (pike)" [Medium,Confirmed] 16:39:56 I guess Matt is not avail 16:39:59 i'm here 16:40:09 ok 16:40:10 i'm always here for you ihar 16:40:12 :) 16:40:21 * ihrachys hugs mriedem 16:40:27 i've got some tabs open, 16:40:35 dealing with some other stuff atm and then that this afternoon 16:40:38 so what's about this bug? is it on the radar for nova? 16:40:49 yeah https://review.openstack.org/#/c/441204/6 needs to be updated 16:40:54 it's on my radar 16:41:06 no one else in nova probably is aware or cares 16:41:15 ok cool. I will add myself to reviewers to monitor progress. 16:41:25 thanks for caring 16:42:04 looking at other grafana dashboards, they are mostly ok-ish, or it's functional/fullstack/scenarios that we know about and already covered 16:42:35 moving to bugs 16:42:47 #topic Gate bugs 16:43:17 one thing that popped today is it seems like neutron broke tripleo pipeline 16:43:20 https://bugs.launchpad.net/tripleo/+bug/1696094 16:43:21 Launchpad bug 1696094 in tripleo "CI: ovb-ha promotion job fails with 504 gateway timeout, neutron-server create-subnet timing out" [Critical,Triaged] 16:43:38 as per logs, it seems like neutron-server serves a subnet create request for 2minutes+ 16:43:50 and holds some locks for 60s+ 16:44:25 I suspect it's something like eventlet interacting badly with workers. like a green thread not yielding 16:44:42 the ~60s is suspicious, it's same in all failure runs I looked at 16:44:54 do we monkey patch server? :) 16:45:14 jlibosva, we do, via neutron/common/eventlet_utils.py 16:45:26 which is called from neutron/cmd/eventlet/__init__.py 16:45:36 and neutron-server entrypoint is under it 16:45:58 there are some suspects https://review.openstack.org/#/c/471345/ and https://review.openstack.org/#/c/471357/ 16:46:27 but really it's just a silly way to find late changes that seem related in some way :) 16:47:03 ideally someone would run with the bug from there, but I don't know of anyone actively working on it right now 16:48:36 ok I guess it may require some broader venue to advertise the issue 16:49:26 looking at the list of other bugs here: https://bugs.launchpad.net/neutron/+bugs?field.tag=gate-failure&orderby=-id&start=0 16:50:19 it doesn't seem there is anything on the list that was not covered and a new issue 16:50:27 so let's move on 16:50:33 #topic Open discussion 16:50:45 anyone has anything to share? any concerns? 16:51:06 I saw the pep8 job failing - is it known issue? 16:51:20 I just saw it at one of patching in this meeting 16:51:31 jlibosva, link? 16:51:56 https://review.openstack.org/#/c/469602/ 16:52:00 https://review.openstack.org/#/c/469189/ 16:52:06 oops, that wasn't it 16:52:24 * haleyb knew it was one of ihar's patches 16:52:46 oh this. I just suck and uploaded a patch with a pep8 violation 16:52:49 nothing to look here :) 16:53:12 ihrachys: but you didn't touch the file it was complaining about 16:53:15 yeah 16:53:24 haleyb, it's based on another patch 16:53:30 that touches it 16:53:37 aaah 16:53:39 ok unless someone else has more to share, I call it a day in 30s 16:53:47 i'll call it lunch 16:53:54 :) 16:54:05 heh 16:54:14 #endmeeting 16:54:36 has I done smth wrong? 16:54:41 where is the bot? 16:54:59 #endmeeting