16:00:45 #startmeeting neutron_ci 16:00:45 Meeting started Tue Jul 18 16:00:45 2017 UTC and is due to finish in 60 minutes. The chair is ihrachys. Information about MeetBot at http://wiki.debian.org/MeetBot. 16:00:46 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 16:00:46 o/ 16:00:48 The meeting name has been set to 'neutron_ci' 16:00:57 we'll try quick 16:00:58 #topic Actions from prev week 16:01:10 first was "jlibosva to reach out to Victor Stinner about eventlet/py3 issue with functional tests" 16:01:23 I did 16:01:35 he was able to get some data from my environment 16:01:42 and produced a simple reporducer 16:01:48 http://paste.alacon.org/44101 16:01:58 aha 16:02:04 since it doesn't import from openstack.. 16:02:06 eventlet bug? 16:02:24 there seems to be a difference in python3 signal handling, that breaks eventlet loop 16:02:27 yes, eventlet bug 16:02:46 but the fix might be complicated as they will need to redesign signal handling in eventlet 16:02:48 or 16:02:54 victor came up with following fix 16:03:00 http://paste.alacon.org/44102 16:03:45 we also got help from sileth who suggest to make a workaround in oslo service 16:03:56 next step is to report bug (if not done already) 16:04:02 but seems like it got a traction 16:04:11 maybe the self_pipe will be a way to fix this 16:04:15 I guess it will be both oslo and eventlet bugs to report? 16:04:25 technically oslo doesn't do anything wrong 16:04:49 yeah but for a workaround it may make sense no? 16:05:07 bumping a minimal for eventlet is always a pain 16:05:27 hmm 16:05:45 then maybe oslo workaround would be easier, we'll see how the fix will go 16:06:21 ok, so next step from your side is report bugs 16:06:40 #action jlibosva to report bugs for eventlet and maybe oslo.service for eventlet signal/timer issue with py3 16:06:49 great dig it seems 16:07:11 next item was "jlibosva to post patch splitting OVN from OvsVenvFixture" 16:07:43 ah, didn't do. maybe I could just push what I currently have :) 16:07:58 it may be good to see the direction 16:08:08 I will repeat the item 16:08:19 #action jlibosva to post patch splitting OVN from OvsVenvFixture 16:08:29 next was "jlibosva to post patch splitting OVN from OvsVenvFixture" 16:08:33 I didn't do :-x 16:08:43 oops 16:08:46 yeah, I didn't do that either 16:08:53 I meant "ihrachys to look at why fullstack switch to rootwrap/d-g broke some test cases" 16:09:03 #action ihrachys to look at why fullstack switch to rootwrap/d-g broke some test cases 16:09:20 that kinda gets pushed down in my list because it's non-voting :p 16:09:29 next was "haleyb to split grafana check dashboard into grenade and tempest charts" 16:09:54 I believe it's https://review.openstack.org/#/c/483119/ 16:10:22 next was "haleyb to continue looking at dvr-ha job failure rate and reasons" 16:10:38 I see there is this patch: https://review.openstack.org/#/c/483600/ (WIP) 16:11:06 also, Brian sent an email to openstack-dev@ 16:11:40 #link http://lists.openstack.org/pipermail/openstack-dev/2017-July/119743.html 16:11:43 no replies so far 16:12:18 let's chime in there, and also maybe ping some key folks 16:12:53 #action haleyb to collect feedback of key contributors on multinode-by-default patch 16:13:07 #action everyone to chime in on haleyb's thread on multinode switch 16:13:24 next was "haleyb to clean up old trusty charts from grafana" 16:14:22 I believe it magically happened by virtue of general trusty cleanup 16:14:38 so no patches from Brian, but it is done nevertheless 16:14:55 next was "haleyb to spin up a ML discussion on replacing single node grenade job with multinode in integrated gate" 16:15:04 ok, THAT one was for the email thread I mentioned above 16:15:10 but those topics are interrelated 16:15:17 and it seems Brian is not here 16:15:32 so we will follow up with the thread and see where it leads us 16:15:40 next was "haleyb to continue looking at places to reduce the number of jobs" 16:15:57 kinda an open ended action, probably was not worth existance in the first place :) 16:16:10 again, will see where more specific actions lead us 16:16:16 next was "ihrachys to complete triage of latest functional test failures that result in 30% failure rate" 16:16:34 I did some triaging for all failures since last meeting for functional gate (not check queue) 16:16:41 this is the result: 16:16:51 #link https://etherpad.openstack.org/p/neutron-functional-gate-failures-july Functional Gate failures 16:17:11 it's basically timeouts, and tester threads running firewall test cases dying 16:17:18 which may actually be the same 16:17:37 when a tester thread is dying, we just see 'Killed' in the console log 16:17:44 nothing in the per-test case log 16:17:46 or in syslog 16:18:01 it's suspicious it's almost always firewall test caser 16:18:13 * ihrachys wonders if it's smth wrong with the test class 16:18:25 one thing that comes in my mind is that searching for pid to kill (like nc) is malfunctioned 16:18:38 or is it just so huge that the chance of triggering it there is high? 16:18:41 and since nc dies, it picks a wrong pid 16:18:58 jlibosva, could it be some other thread kills the current thread somehow? 16:19:03 or maybe it kills itself? :) 16:19:18 I'll try to add some debug messages to patch and send it upstream to recheck, recheck, recheck 16:19:28 ok 16:19:56 #action jlibosva to send a debug patch for random test runner murders, and recheck, recheck, recheck 16:20:56 maybe the code searching for children is misbehaving and ends up killing itself? 16:21:07 anyway, we will follow up in gerrit 16:21:16 yeah, that's what I meant 16:21:16 thanks for taking the next step on this one! 16:21:28 next was "ihrachys to remove pg job from periodics grafana board" 16:21:46 I sent this: https://review.openstack.org/#/c/482676/ 16:21:56 and Armando chimed in there with action ;0 16:21:59 so I abandoned 16:22:06 seems like there is some interest 16:22:29 and next time the job fails, we may ask him 16:23:13 pg-liaison 16:23:57 it *seems* that the bugs reported are now closed 16:24:00 checking the dash 16:24:28 yeah it's green 16:24:33 so we have closure here 16:24:40 and those were all items we had 16:24:43 now... 16:24:50 let's review grafana 16:24:54 #topic Grafana 16:25:01 #link http://grafana.openstack.org/dashboard/db/neutron-failure-rate 16:25:04 ah, no action for me re fullstack process isolation? 16:25:30 jlibosva, there was the ovs fixture no? 16:25:42 you were saying you will push what you have 16:25:52 that was just about splitting the class into ovs and ovn 16:25:57 ah ok 16:26:09 let's discuss that after grafana 16:26:10 I did some successful research on multiple ovsdb-server processes on a single node 16:26:12 sure 16:26:47 so looking at the board, we have a decent state actually 16:26:55 except functional that was discussed 16:27:04 and fullstack that is just knowingly broken 16:27:20 and scenarios that don't get much traction for connectivity issues 16:27:25 but those are long known 16:27:30 no new breakages 16:28:07 I see there is some -ovsfw- job in tempest check queue chart that is at 40% failure rate 16:28:11 I haven't seen it before 16:28:13 something new? 16:28:19 no, it's been there for a while 16:28:28 ok. 16:28:38 the result is not great don't you think 16:28:40 I checked the failure couple months back and they don't seem to be related to the firewall 16:28:50 it's mostly other tempest issues 16:29:03 if you look at the curve, it copies other tempest tests like dvr 16:29:06 the fact that it stands out is suspicious 16:29:33 curve yeah. but it's twice higher rate than e.g. dvr+ha 16:30:28 anyway 16:30:32 I don't think it's high prio 16:30:53 #topic Fullstack isolation 16:30:59 jlibosva, you had smth to update here 16:31:07 hmm,maybe we'll need some job that uses the same env to compare 16:31:15 to just have a diff between ovs vs. iptables 16:31:33 I don't remember what it runs, I think it's an all-in-one but not sure if ti's dvr or not 16:31:50 yep, so, I'm happy to announce that I was able to have multiple ovsdb-servers 16:32:12 each running its own vswitchd to communicate with kernel datapath. ovsdb-servers must be in namespaces 16:32:15 in fullstack env? or just in some poc env? 16:32:23 so I had two namespaces and root namespace 16:32:30 just a poc to see what's possible 16:32:34 I didn't write any code 16:33:08 I was able to have traffic from one namespace to other namespace, basically running traffic through interfaces from three different ovsdb-servers 16:33:26 also, we can have a namespace in a namespace, nest it 16:33:35 inception 16:33:45 so what I plan to do is to create a namespace per fullstack host 16:34:04 and run all agents there, using a single ovsdb-server running in namespace 16:34:27 which means, fakefullstack machines will be also spawned in this namespace, that's why the inception 16:34:36 hm. and so e.g. dhcp or l3 namespaces will be 2nd order depth? 16:34:53 no, dhcp and l3 will run in a host namespace, not in its own like today 16:35:06 oh, right 16:35:13 yeah but I mean, then they create namespaces 16:35:14 yeah, qrouter and qdhcp will be 2nd 16:35:16 ok 16:35:18 we are on the same page 16:35:29 I wonder if it will reveal any kernel issues :p 16:35:32 which means, we won't need the hacks for unique namespaces 16:35:47 jlibosva, oh right, it would be great to get rid of that 16:36:00 and I used ovs 2.6 16:36:12 is it the gate version? 16:36:14 there were similar attempts to do similar things in the past on older ovs without success 16:36:25 I think we have 2.5.2 16:36:37 iirc 16:37:07 OVS_BRANCH=v2.6.1 16:37:10 in fullstack 16:37:20 but we don't compile do we? 16:37:24 oh, we compile kernel 16:37:33 http://logs.openstack.org/20/483020/3/check/gate-neutron-dsvm-fullstack-ubuntu-xenial/a95166c/console.html#_2017-07-18_14_02_16_215224 ? 16:37:40 because of vxlan local tunneling 16:37:53 but we use userspace from deb 16:38:10 yeah, anyways, those were my findings and I'm excited about implementing it :) 16:38:24 yeah that sounds like a great project 16:39:15 there are no interesting new gate bugs, so I will skip 16:39:18 #topic Open discussion 16:39:22 I don't have anything 16:39:29 me neither 16:39:45 ok then we close the meeting. thanks jlibosva for being active, I would feel lonely otherwise lol. 16:39:48 #endmeeting