#openstack-meeting log

16:00:32 <slaweq> #startmeeting neutron_ci
16:00:33 <openstack> Meeting started Tue Jan 22 16:00:32 2019 UTC and is due to finish in 60 minutes.  The chair is slaweq. Information about MeetBot at http://wiki.debian.org/MeetBot.
16:00:35 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
16:00:37 <openstack> The meeting name has been set to 'neutron_ci'
16:00:38 <mlavalle> o/
16:00:42 <slaweq> o/
16:01:28 <slaweq> haleyb: njohnston hongbin: are You around for CI meeting?
16:01:48 <haleyb> hi, i might have to leave early though
16:01:53 <hongbin> o/
16:02:18 <slaweq> ok, lets start then
16:02:27 <slaweq> #topic Actions from previous meetings
16:02:40 <slaweq> there wasn't too many actions from last week
16:02:44 <slaweq> slaweq to make e-r query for bug 1811515
16:02:46 <openstack> bug 1811515 in neutron "pyroute2.NetNS don't work properly with concurrency in oslo.privsep" [Critical,Fix released] https://launchpad.net/bugs/1811515 - Assigned to Slawek Kaplonski (slaweq)
16:03:09 <slaweq> I didn't write this query but bug is now fixed (workarounded) so it's not necessary anymore
16:03:20 <mlavalle> ++
16:03:20 <slaweq> next one:
16:03:23 <slaweq> slaweq to check if oslo.privsep < 1.31.0 will help to workaround issue with SSH to FIP
16:03:32 <mlavalle> it did
16:03:35 <slaweq> as above, this is now fixed
16:03:48 <slaweq> it did help but we workarounded it in different way
16:03:49 <slaweq> :)
16:03:51 <njohnston> o/
16:04:03 <slaweq> and the last one was:
16:04:05 <bcafarel> o/ too
16:04:05 <slaweq> slaweq to post more examples of failiures in bug 1811515
16:04:06 <openstack> bug 1811515 in neutron "pyroute2.NetNS don't work properly with concurrency in oslo.privsep" [Critical,Fix released] https://launchpad.net/bugs/1811515 - Assigned to Slawek Kaplonski (slaweq)
16:04:24 <slaweq> I think I added link to logstash query from where more example could be found
16:04:29 <slaweq> but it's already fixed
16:04:33 <mlavalle> yeap
16:04:37 <slaweq> so that was all from last week :)
16:04:38 <mlavalle> you did
16:04:42 <mlavalle> not quite
16:04:49 <mlavalle> I had an action item
16:05:02 <slaweq> ok, so I somehow missed it
16:05:05 <slaweq> sorry mlavalle
16:05:08 <slaweq> go on
16:05:17 <mlavalle> continue working on https://bugs.launchpad.net/neutron/+bug/1795870
16:05:18 <openstack> Launchpad bug 1795870 in neutron "Trunk scenario test test_trunk_subport_lifecycle fails from time to time" [High,In progress] - Assigned to Miguel Lavalle (minsel)
16:05:23 <mlavalle> which I did
16:05:49 <mlavalle> submitted DNM patch https://review.openstack.org/#/c/630778/
16:06:00 <mlavalle> which I rechecked a few times and I got lucky
16:06:18 <mlavalle> I got one succesful run and one failure right afterwards
16:06:39 <mlavalle> that is allowing me to compare sucess / failure:
16:07:09 <mlavalle> 1) When the test passes, the router is being hosted both in the copntroller and the compute
16:07:32 <mlavalle> so we see the router in the logs of both L3 agents
16:08:05 <mlavalle> 2) When the test fails, the router is never scheduled in the controller and it doesn't show up in its L3 agent
16:08:51 <slaweq> hmm, it should be scheduled to controller always as only there is dhcp port, right?
16:08:54 <mlavalle> This is an example of L3 agent log in a failed execution in the controller: http://logs.openstack.org/78/630778/1/check/neutron-tempest-plugin-dvr-multinode-scenario/02391e0/controller/logs/screen-q-svc.txt.gz?level=TRACE
16:09:35 <mlavalle> please note that the L3 agent is down according to the neutron server
16:09:52 <mlavalle> I am seeing the same pattern in at least two cases
16:10:35 <slaweq> why agent is down?
16:10:38 <slaweq> do You know?
16:10:51 <mlavalle> the agent is not actually down
16:10:56 <mlavalle> it is running
16:11:09 <mlavalle> but the server thinks is down
16:11:20 <mlavalle> and my next step is to investigate why
16:11:35 <mlavalle> so please give me an action item for next week
16:11:43 <slaweq> sure
16:12:11 <slaweq> #action mlavalle to continue investigate why L3 agent is considered as down and cause trunk tests fail
16:12:17 <slaweq> thx mlavalle for working on this
16:12:34 <slaweq> that is interesting why this agent is treated as down
16:12:45 <mlavalle> yeah, in the same node
16:13:54 <slaweq> I wonder if agent is not sending heartbeat or neutron-server is not processing it properly
16:14:10 <slaweq> IMHO it's more likely that agent is not sending it
16:14:24 <slaweq> as heartbeats from other agents are ok on server
16:14:33 <mlavalle> yes
16:14:39 <slaweq> but we will see when You will check it :)
16:15:13 <slaweq> ok, so now I think we can move on to the next topic
16:15:16 <slaweq> right?
16:15:19 <mlavalle> thanks
16:15:21 <mlavalle> yes
16:15:25 <slaweq> #topic Python 3
16:15:37 <slaweq> Etherpad: https://etherpad.openstack.org/p/neutron_ci_python3
16:16:17 <slaweq> in last week we merged 2 patches and now neutron-tempest-linuxbridge and neutron-tempest-dvr jobs are running on python 3 and using zuulv3 syntax already
16:16:44 <slaweq> I plan to do the same with neutron-tempest-dvr-ha-multinode-full this week
16:17:10 <mlavalle> Thanks
16:17:14 <slaweq> there is also this neutron-functional job which needs to be switched
16:17:24 <slaweq> and that is still problematic
16:17:38 <slaweq> This week I sent email to ML: http://lists.openstack.org/pipermail/openstack-discuss/2019-January/001904.html
16:18:22 <slaweq> maybe someone more familiar with ostestr/subunit and python 3 will be able to help us with it
16:18:39 <mlavalle> did you get any responses so far?
16:18:44 <slaweq> nope :/
16:19:34 <slaweq> I will ask tomorrow on openstack-qa channel - maybe e.g. gmann will know who can help us with it
16:20:10 <bcafarel> crossing fingers
16:20:10 <njohnston> good idea
16:20:54 <slaweq> unfortunatelly except that I have no any other idea how to deal with this issue currently :/
16:21:02 <slaweq> and that's all related to python 3 from me
16:21:09 <slaweq> njohnston: do You have anything else to add?
16:21:21 <slaweq> how it's going with grenade jobs switch?
16:21:49 <bcafarel> yeah pushing to add other limits to log output does not sound good as a viable fix :/ hopefully someone will fix the root cause and we get all our jobs py3-green
16:21:54 <njohnston> No, nothing for this week.  I have been focused on bulk ports, but I should be ablke to move forward on the grenade work this week
16:22:15 <slaweq> njohnston: great, thx
16:22:28 <njohnston> #action njohnston Work on grenade job transition
16:22:34 <slaweq> thx :)
16:22:50 <slaweq> ok, so lets go to the next topic
16:22:51 <slaweq> #topic Grafana
16:23:00 <slaweq> http://grafana.openstack.org/dashboard/db/neutron-failure-rate
16:24:54 <slaweq> there is peak on many jobs today, I don't know exactly why it is like that but:
16:25:16 <slaweq> 1. I didn't found in today's jobs anything very bad,
16:25:45 <slaweq> 2. as it is even in pep8 job, I think that it could be some "generic" issue, not related to neutron bug directly
16:26:29 <slaweq> from other things I think that we again have one major issue
16:26:36 <slaweq> and this week it is tempest-slow job :/
16:26:40 <mlavalle> yeah
16:26:47 <mlavalle> it is evident in Grafana
16:27:14 <slaweq> so, if You don't have anything else related to grafana, lets move to talk about this tempest jobs now
16:27:16 <slaweq> fine?
16:28:40 <slaweq> ok, I get it as yes :)
16:28:42 <slaweq> #topic Tempest/Scenario
16:29:05 <slaweq> We have one major issue with tempest-slow job, it is described in bug report https://bugs.launchpad.net/neutron/+bug/1812552
16:29:06 <openstack> Launchpad bug 1812552 in neutron "tempest-slow tests fails often" [Critical,In progress] - Assigned to Slawek Kaplonski (slaweq)
16:29:25 <slaweq> Today I was talking with seon-k-mooney about that
16:29:37 <mlavalle> did he help?
16:29:40 <slaweq> nope
16:29:42 <bcafarel> slaweq: for today's jobs there was some general post failure ~20h ago that probably did not help in the graph (sorry lagging as usual)
16:30:04 <slaweq> bcafarel: yes, that might be the reason :)
16:30:12 <mlavalle> thanks anyway bcafarel
16:30:17 <mlavalle> much appreciated
16:30:22 <slaweq> ok, so tempest-slow job issue in details:
16:30:46 <slaweq> 1. from grafana and logstash it looks that this issue was caused somehow by https://review.openstack.org/#/c/631584/
16:31:14 <slaweq> 2. I proposed revert of this patch https://review.openstack.org/#/c/631944/ and indeed this job passed 5 or 6 times already
16:31:32 <slaweq> in fact it didn't failed even once with this revert
16:31:47 <slaweq> which confirms somehow that this patch is culprit
16:32:23 <slaweq> 3. I talked with sean today and we though that maybe this my patch introduced some additional race in ovsdb monitor and how it handles ports events on ovs bridges
16:32:54 <mlavalle> so we are leaving interfaces un-plugged, right?
16:33:12 <slaweq> but, what is strange for me is fact that it cause issues only in this job, and only (at least where I was checking it) with two tests which shelve/unshelve instance
16:33:21 <slaweq> mlavalle: not exactly
16:33:44 <slaweq> today I invesigated logs from one of such jobs carefully
16:33:55 <slaweq> (please read my last comment in bug report)
16:34:08 <slaweq> and it looks for me that port was configured properly
16:34:25 <slaweq> and communication, at least from VM to dhcp server was fine
16:34:35 <slaweq> which puzzled me even more
16:35:07 <mlavalle> but the VM doesn't get the dhcp offer, right?
16:35:09 <slaweq> I have no idea what is wrong there and how (if) this mentioned patch could break it
16:35:18 <slaweq> mlavalle: it looks so
16:35:31 <slaweq> but dnsmasq get DHCP Request from VM
16:35:36 <slaweq> and sends this DHCP offer
16:35:42 <mlavalle> so the the DHCP request gets to dnsmasq
16:35:46 <slaweq> but this is somehow missed somewhere
16:36:07 <mlavalle> could it be a problem with the flows?
16:36:17 <slaweq> possibly yes
16:36:29 <slaweq> especially that we are using openvswitch fw driver there
16:36:31 <mlavalle> we are dropping the offer
16:37:00 <mlavalle> perhapds, that is
16:37:07 <slaweq> it can be
16:37:32 <slaweq> but basically I think that we should revert this patch to make tempest-slow into better shape
16:37:39 <slaweq> *to get
16:37:51 <mlavalle> yes, let's pull the trigger
16:38:00 <njohnston> +1
16:38:05 <mlavalle> worst case, we revert the revert ;-)
16:38:11 <slaweq> LOL
16:38:56 <bcafarel> it's been some times since I saw a "revert revert revert revert revert ..." review
16:38:57 <slaweq> I know that sean is going to release new os-vif version soon and it will have his revert of patches which caused that port was created twice during booting vm
16:39:15 <slaweq> so maybe this my patch will not be really needed
16:39:44 <slaweq> but as a sidenote I want to mention that I think that I saw some similar issues from time to time already
16:40:04 <slaweq> I mean issues that VM didn't have configured IP address and because of that was not reachable
16:40:24 <slaweq> maybe it is same issue but just happens less often without this my patch
16:40:32 <slaweq> please keep it in mind just :)
16:40:44 <mlavalle> ok, thanks for the clarification
16:41:18 <slaweq> ok, thats all from my side about tempest tests
16:41:24 <slaweq> anything else You want to add?
16:42:21 <mlavalle> not me
16:42:25 <slaweq> ok, lets move on then
16:42:29 <slaweq> next topic
16:42:31 <slaweq> #topic fullstack/functional
16:42:48 <slaweq> today I found new bug in functional tests \o/
16:42:57 <slaweq> https://bugs.launchpad.net/neutron/+bug/1812872
16:42:58 <openstack> Launchpad bug 1812872 in neutron "Trunk functional tests can interact with each other " [Medium,Confirmed] - Assigned to Slawek Kaplonski (slaweq)
16:42:58 <mlavalle> yaay
16:43:18 <slaweq> fortunately so far we are lucky and it don't hit us in gate
16:43:29 <slaweq> but I can reproduce it locally
16:43:42 <slaweq> and I hit it in my patch where I want to switch functional job to python3
16:44:04 <slaweq> basically it is race between two trunk related tests
16:44:04 <mlavalle> ahh
16:44:15 <slaweq> in bug description there are details
16:44:32 <slaweq> I have one idea how to "fix" it in ugly but fast way
16:44:53 <slaweq> we can add lockutils.synchronized() decorator for those 2 tests and it should works fine IMHO
16:45:04 <slaweq> what do You think about such solution?
16:45:24 <mlavalle> so I assume they share a resource
16:45:28 <mlavalle> right?
16:45:37 <mlavalle> that is why you need the lock
16:45:38 <slaweq> other than that is probably doing something similar to fullstack tests where we will need to monkey patch some functions
16:45:46 <slaweq> they don't share resources in fact
16:46:03 <slaweq> but each of them is listing ovsdb monitor events
16:46:36 <slaweq> and such events aren't distingueshed between tests
16:46:51 <mlavalle> well, that's the sahred resource, the stream of events
16:47:08 <slaweq> so one test have got mocked handle_trunk_remove method to not clean trunk bridge
16:47:30 <slaweq> but then second test is cleaning this bridge as it gets event from ovsdb monitor
16:47:44 <slaweq> mlavalle: yes, so in that way they share resources
16:48:15 <slaweq> so is such lock (with comment) acceptable for You?
16:48:23 <slaweq> what You think about it?
16:48:31 <njohnston> are trunk bridges vlan-specific?  Would there be a way to plumb another fake vlan, thus making the trunk bridges in each test different from each other?
16:48:32 <mlavalle> I don't see ehy not
16:48:52 <slaweq> njohnston: trunk bridges are different
16:49:06 <slaweq> they are created in setUp() method for each test
16:49:43 <slaweq> but one of them can remove trunk bridge for another one because it is triggered by ovsdb event
16:49:43 <mlavalle> so esentially another alternative is to come up with a way for each test to have iot's own stream events
16:50:11 <slaweq> that was discussed some time ago I think in context of fullstack tests
16:50:15 <njohnston> so the ovsdb event does not include info on which bridge it is related to?
16:50:15 <mlavalle> would that be a lot of work?
16:50:44 <slaweq> and we don't have (then we didn't have at least) easy way to tell - listen only events from this bridge
16:51:04 <mlavalle> we have a lot of work to do
16:51:11 <slaweq> but tbh now, when we switched ovsdb monitors to native implementation thx to ralonsoh work, maybe that would be possible
16:51:17 <slaweq> I can check it
16:51:21 <njohnston> cool
16:51:40 <slaweq> if this will be too much to do, I will go for now with lock and TODO note how it might be fixed in fututr
16:51:43 <slaweq> *future
16:51:45 <slaweq> ok for You?
16:51:49 <njohnston> +1
16:51:50 <mlavalle> if fixing this without locks is too labor intensive, let's go for the locks and leave a todo comment
16:52:01 <slaweq> mlavalle: ++ :)
16:52:31 <slaweq> #action slaweq to check if new ovsdb monitor implementation can allow to listen only for events from specific bridge
16:53:11 <slaweq> ok, anything else You want to mention related to functiona/fullstack tests?
16:53:17 <mlavalle> not me
16:53:27 <slaweq> ok, lets move to next topic then
16:53:29 <slaweq> #topic Periodic
16:53:56 <slaweq> since few days at least we have issue with jobs openstack-tox-py27-with-oslo-master and openstack-tox-py35-with-oslo-master
16:54:02 <slaweq> Example: http://logs.openstack.org/periodic/git.openstack.org/openstack/neutron/master/openstack-tox-py27-with-oslo-master/c787964/testr_results.html.gz
16:54:12 <slaweq> it's only link to one python27 job
16:54:19 <slaweq> but error is exactly the same in python 35
16:54:35 <slaweq> I didn't report this bug in launchpad yet
16:54:48 <slaweq> is there any volunteer to report it on launchpad and fix it? :)
16:54:50 <njohnston> never seen one like thta before
16:55:29 <njohnston> I'll take a look
16:55:52 <slaweq> njohnston: it's probably some change in newest oslo_service lib and we need to adjust our code to it
16:55:55 <slaweq> thx njohnston
16:56:01 <njohnston> that's what I am thinking too
16:56:17 <slaweq> #action njohnston to take care of periodic UT jobs failures
16:56:23 <njohnston> thx
16:56:31 <slaweq> Thank You :)
16:56:41 <slaweq> ok, and that's all from me for today :)
16:56:52 <slaweq> do You want to talk about anythin else quickly?
16:57:03 <mlavalle> not me
16:57:39 <slaweq> ok, so thanks for attending guys
16:57:43 <slaweq> have a great week
16:57:46 <slaweq> o/
16:57:49 <slaweq> #endmeeting