16:00:45 <ihrachys> #startmeeting neutron_ci 16:00:45 <openstack> Meeting started Tue Jul 18 16:00:45 2017 UTC and is due to finish in 60 minutes. The chair is ihrachys. Information about MeetBot at http://wiki.debian.org/MeetBot. 16:00:46 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 16:00:46 <jlibosva> o/ 16:00:48 <openstack> The meeting name has been set to 'neutron_ci' 16:00:57 <ihrachys> we'll try quick 16:00:58 <ihrachys> #topic Actions from prev week 16:01:10 <ihrachys> first was "jlibosva to reach out to Victor Stinner about eventlet/py3 issue with functional tests" 16:01:23 <jlibosva> I did 16:01:35 <jlibosva> he was able to get some data from my environment 16:01:42 <jlibosva> and produced a simple reporducer 16:01:48 <jlibosva> http://paste.alacon.org/44101 16:01:58 <ihrachys> aha 16:02:04 <ihrachys> since it doesn't import from openstack.. 16:02:06 <ihrachys> eventlet bug? 16:02:24 <jlibosva> there seems to be a difference in python3 signal handling, that breaks eventlet loop 16:02:27 <jlibosva> yes, eventlet bug 16:02:46 <jlibosva> but the fix might be complicated as they will need to redesign signal handling in eventlet 16:02:48 <jlibosva> or 16:02:54 <jlibosva> victor came up with following fix 16:03:00 <jlibosva> http://paste.alacon.org/44102 16:03:45 <jlibosva> we also got help from sileth who suggest to make a workaround in oslo service 16:03:56 <jlibosva> next step is to report bug (if not done already) 16:04:02 <jlibosva> but seems like it got a traction 16:04:11 <jlibosva> maybe the self_pipe will be a way to fix this 16:04:15 <ihrachys> I guess it will be both oslo and eventlet bugs to report? 16:04:25 <jlibosva> technically oslo doesn't do anything wrong 16:04:49 <ihrachys> yeah but for a workaround it may make sense no? 16:05:07 <ihrachys> bumping a minimal for eventlet is always a pain 16:05:27 <jlibosva> hmm 16:05:45 <jlibosva> then maybe oslo workaround would be easier, we'll see how the fix will go 16:06:21 <ihrachys> ok, so next step from your side is report bugs 16:06:40 <ihrachys> #action jlibosva to report bugs for eventlet and maybe oslo.service for eventlet signal/timer issue with py3 16:06:49 <ihrachys> great dig it seems 16:07:11 <ihrachys> next item was "jlibosva to post patch splitting OVN from OvsVenvFixture" 16:07:43 <jlibosva> ah, didn't do. maybe I could just push what I currently have :) 16:07:58 <ihrachys> it may be good to see the direction 16:08:08 <ihrachys> I will repeat the item 16:08:19 <ihrachys> #action jlibosva to post patch splitting OVN from OvsVenvFixture 16:08:29 <ihrachys> next was "jlibosva to post patch splitting OVN from OvsVenvFixture" 16:08:33 <ihrachys> I didn't do :-x 16:08:43 <ihrachys> oops 16:08:46 <jlibosva> yeah, I didn't do that either 16:08:53 <ihrachys> I meant "ihrachys to look at why fullstack switch to rootwrap/d-g broke some test cases" 16:09:03 <ihrachys> #action ihrachys to look at why fullstack switch to rootwrap/d-g broke some test cases 16:09:20 <ihrachys> that kinda gets pushed down in my list because it's non-voting :p 16:09:29 <ihrachys> next was "haleyb to split grafana check dashboard into grenade and tempest charts" 16:09:54 <ihrachys> I believe it's https://review.openstack.org/#/c/483119/ 16:10:22 <ihrachys> next was "haleyb to continue looking at dvr-ha job failure rate and reasons" 16:10:38 <ihrachys> I see there is this patch: https://review.openstack.org/#/c/483600/ (WIP) 16:11:06 <ihrachys> also, Brian sent an email to openstack-dev@ 16:11:40 <ihrachys> #link http://lists.openstack.org/pipermail/openstack-dev/2017-July/119743.html 16:11:43 <ihrachys> no replies so far 16:12:18 <ihrachys> let's chime in there, and also maybe ping some key folks 16:12:53 <ihrachys> #action haleyb to collect feedback of key contributors on multinode-by-default patch 16:13:07 <ihrachys> #action everyone to chime in on haleyb's thread on multinode switch 16:13:24 <ihrachys> next was "haleyb to clean up old trusty charts from grafana" 16:14:22 <ihrachys> I believe it magically happened by virtue of general trusty cleanup 16:14:38 <ihrachys> so no patches from Brian, but it is done nevertheless 16:14:55 <ihrachys> next was "haleyb to spin up a ML discussion on replacing single node grenade job with multinode in integrated gate" 16:15:04 <ihrachys> ok, THAT one was for the email thread I mentioned above 16:15:10 <ihrachys> but those topics are interrelated 16:15:17 <ihrachys> and it seems Brian is not here 16:15:32 <ihrachys> so we will follow up with the thread and see where it leads us 16:15:40 <ihrachys> next was "haleyb to continue looking at places to reduce the number of jobs" 16:15:57 <ihrachys> kinda an open ended action, probably was not worth existance in the first place :) 16:16:10 <ihrachys> again, will see where more specific actions lead us 16:16:16 <ihrachys> next was "ihrachys to complete triage of latest functional test failures that result in 30% failure rate" 16:16:34 <ihrachys> I did some triaging for all failures since last meeting for functional gate (not check queue) 16:16:41 <ihrachys> this is the result: 16:16:51 <ihrachys> #link https://etherpad.openstack.org/p/neutron-functional-gate-failures-july Functional Gate failures 16:17:11 <ihrachys> it's basically timeouts, and tester threads running firewall test cases dying 16:17:18 <ihrachys> which may actually be the same 16:17:37 <ihrachys> when a tester thread is dying, we just see 'Killed' in the console log 16:17:44 <ihrachys> nothing in the per-test case log 16:17:46 <ihrachys> or in syslog 16:18:01 <ihrachys> it's suspicious it's almost always firewall test caser 16:18:13 * ihrachys wonders if it's smth wrong with the test class 16:18:25 <jlibosva> one thing that comes in my mind is that searching for pid to kill (like nc) is malfunctioned 16:18:38 <ihrachys> or is it just so huge that the chance of triggering it there is high? 16:18:41 <jlibosva> and since nc dies, it picks a wrong pid 16:18:58 <ihrachys> jlibosva, could it be some other thread kills the current thread somehow? 16:19:03 <ihrachys> or maybe it kills itself? :) 16:19:18 <jlibosva> I'll try to add some debug messages to patch and send it upstream to recheck, recheck, recheck 16:19:28 <ihrachys> ok 16:19:56 <ihrachys> #action jlibosva to send a debug patch for random test runner murders, and recheck, recheck, recheck 16:20:56 <ihrachys> maybe the code searching for children is misbehaving and ends up killing itself? 16:21:07 <ihrachys> anyway, we will follow up in gerrit 16:21:16 <jlibosva> yeah, that's what I meant 16:21:16 <ihrachys> thanks for taking the next step on this one! 16:21:28 <ihrachys> next was "ihrachys to remove pg job from periodics grafana board" 16:21:46 <ihrachys> I sent this: https://review.openstack.org/#/c/482676/ 16:21:56 <ihrachys> and Armando chimed in there with action ;0 16:21:59 <ihrachys> so I abandoned 16:22:06 <ihrachys> seems like there is some interest 16:22:29 <ihrachys> and next time the job fails, we may ask him 16:23:13 <jlibosva> pg-liaison 16:23:57 <ihrachys> it *seems* that the bugs reported are now closed 16:24:00 <ihrachys> checking the dash 16:24:28 <ihrachys> yeah it's green 16:24:33 <ihrachys> so we have closure here 16:24:40 <ihrachys> and those were all items we had 16:24:43 <ihrachys> now... 16:24:50 <ihrachys> let's review grafana 16:24:54 <ihrachys> #topic Grafana 16:25:01 <ihrachys> #link http://grafana.openstack.org/dashboard/db/neutron-failure-rate 16:25:04 <jlibosva> ah, no action for me re fullstack process isolation? 16:25:30 <ihrachys> jlibosva, there was the ovs fixture no? 16:25:42 <ihrachys> you were saying you will push what you have 16:25:52 <jlibosva> that was just about splitting the class into ovs and ovn 16:25:57 <ihrachys> ah ok 16:26:09 <ihrachys> let's discuss that after grafana 16:26:10 <jlibosva> I did some successful research on multiple ovsdb-server processes on a single node 16:26:12 <jlibosva> sure 16:26:47 <ihrachys> so looking at the board, we have a decent state actually 16:26:55 <ihrachys> except functional that was discussed 16:27:04 <ihrachys> and fullstack that is just knowingly broken 16:27:20 <ihrachys> and scenarios that don't get much traction for connectivity issues 16:27:25 <ihrachys> but those are long known 16:27:30 <ihrachys> no new breakages 16:28:07 <ihrachys> I see there is some -ovsfw- job in tempest check queue chart that is at 40% failure rate 16:28:11 <ihrachys> I haven't seen it before 16:28:13 <ihrachys> something new? 16:28:19 <jlibosva> no, it's been there for a while 16:28:28 <ihrachys> ok. 16:28:38 <ihrachys> the result is not great don't you think 16:28:40 <jlibosva> I checked the failure couple months back and they don't seem to be related to the firewall 16:28:50 <jlibosva> it's mostly other tempest issues 16:29:03 <jlibosva> if you look at the curve, it copies other tempest tests like dvr 16:29:06 <ihrachys> the fact that it stands out is suspicious 16:29:33 <ihrachys> curve yeah. but it's twice higher rate than e.g. dvr+ha 16:30:28 <ihrachys> anyway 16:30:32 <ihrachys> I don't think it's high prio 16:30:53 <ihrachys> #topic Fullstack isolation 16:30:59 <ihrachys> jlibosva, you had smth to update here 16:31:07 <jlibosva> hmm,maybe we'll need some job that uses the same env to compare 16:31:15 <jlibosva> to just have a diff between ovs vs. iptables 16:31:33 <jlibosva> I don't remember what it runs, I think it's an all-in-one but not sure if ti's dvr or not 16:31:50 <jlibosva> yep, so, I'm happy to announce that I was able to have multiple ovsdb-servers 16:32:12 <jlibosva> each running its own vswitchd to communicate with kernel datapath. ovsdb-servers must be in namespaces 16:32:15 <ihrachys> in fullstack env? or just in some poc env? 16:32:23 <jlibosva> so I had two namespaces and root namespace 16:32:30 <jlibosva> just a poc to see what's possible 16:32:34 <jlibosva> I didn't write any code 16:33:08 <jlibosva> I was able to have traffic from one namespace to other namespace, basically running traffic through interfaces from three different ovsdb-servers 16:33:26 <jlibosva> also, we can have a namespace in a namespace, nest it 16:33:35 <ihrachys> inception 16:33:45 <jlibosva> so what I plan to do is to create a namespace per fullstack host 16:34:04 <jlibosva> and run all agents there, using a single ovsdb-server running in namespace 16:34:27 <jlibosva> which means, fakefullstack machines will be also spawned in this namespace, that's why the inception 16:34:36 <ihrachys> hm. and so e.g. dhcp or l3 namespaces will be 2nd order depth? 16:34:53 <jlibosva> no, dhcp and l3 will run in a host namespace, not in its own like today 16:35:06 <jlibosva> oh, right 16:35:13 <ihrachys> yeah but I mean, then they create namespaces 16:35:14 <jlibosva> yeah, qrouter and qdhcp will be 2nd 16:35:16 <ihrachys> ok 16:35:18 <ihrachys> we are on the same page 16:35:29 <ihrachys> I wonder if it will reveal any kernel issues :p 16:35:32 <jlibosva> which means, we won't need the hacks for unique namespaces 16:35:47 <ihrachys> jlibosva, oh right, it would be great to get rid of that 16:36:00 <jlibosva> and I used ovs 2.6 16:36:12 <ihrachys> is it the gate version? 16:36:14 <jlibosva> there were similar attempts to do similar things in the past on older ovs without success 16:36:25 <jlibosva> I think we have 2.5.2 16:36:37 <jlibosva> iirc 16:37:07 <ihrachys> OVS_BRANCH=v2.6.1 16:37:10 <ihrachys> in fullstack 16:37:20 <jlibosva> but we don't compile do we? 16:37:24 <jlibosva> oh, we compile kernel 16:37:33 <ihrachys> http://logs.openstack.org/20/483020/3/check/gate-neutron-dsvm-fullstack-ubuntu-xenial/a95166c/console.html#_2017-07-18_14_02_16_215224 ? 16:37:40 <jlibosva> because of vxlan local tunneling 16:37:53 <jlibosva> but we use userspace from deb 16:38:10 <jlibosva> yeah, anyways, those were my findings and I'm excited about implementing it :) 16:38:24 <ihrachys> yeah that sounds like a great project 16:39:15 <ihrachys> there are no interesting new gate bugs, so I will skip 16:39:18 <ihrachys> #topic Open discussion 16:39:22 <ihrachys> I don't have anything 16:39:29 <jlibosva> me neither 16:39:45 <ihrachys> ok then we close the meeting. thanks jlibosva for being active, I would feel lonely otherwise lol. 16:39:48 <ihrachys> #endmeeting