16:00:32 <slaweq> #startmeeting neutron_ci 16:00:33 <openstack> Meeting started Tue Jan 22 16:00:32 2019 UTC and is due to finish in 60 minutes. The chair is slaweq. Information about MeetBot at http://wiki.debian.org/MeetBot. 16:00:35 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 16:00:37 <openstack> The meeting name has been set to 'neutron_ci' 16:00:38 <mlavalle> o/ 16:00:42 <slaweq> o/ 16:01:28 <slaweq> haleyb: njohnston hongbin: are You around for CI meeting? 16:01:48 <haleyb> hi, i might have to leave early though 16:01:53 <hongbin> o/ 16:02:18 <slaweq> ok, lets start then 16:02:27 <slaweq> #topic Actions from previous meetings 16:02:40 <slaweq> there wasn't too many actions from last week 16:02:44 <slaweq> slaweq to make e-r query for bug 1811515 16:02:46 <openstack> bug 1811515 in neutron "pyroute2.NetNS don't work properly with concurrency in oslo.privsep" [Critical,Fix released] https://launchpad.net/bugs/1811515 - Assigned to Slawek Kaplonski (slaweq) 16:03:09 <slaweq> I didn't write this query but bug is now fixed (workarounded) so it's not necessary anymore 16:03:20 <mlavalle> ++ 16:03:20 <slaweq> next one: 16:03:23 <slaweq> slaweq to check if oslo.privsep < 1.31.0 will help to workaround issue with SSH to FIP 16:03:32 <mlavalle> it did 16:03:35 <slaweq> as above, this is now fixed 16:03:48 <slaweq> it did help but we workarounded it in different way 16:03:49 <slaweq> :) 16:03:51 <njohnston> o/ 16:04:03 <slaweq> and the last one was: 16:04:05 <bcafarel> o/ too 16:04:05 <slaweq> slaweq to post more examples of failiures in bug 1811515 16:04:06 <openstack> bug 1811515 in neutron "pyroute2.NetNS don't work properly with concurrency in oslo.privsep" [Critical,Fix released] https://launchpad.net/bugs/1811515 - Assigned to Slawek Kaplonski (slaweq) 16:04:24 <slaweq> I think I added link to logstash query from where more example could be found 16:04:29 <slaweq> but it's already fixed 16:04:33 <mlavalle> yeap 16:04:37 <slaweq> so that was all from last week :) 16:04:38 <mlavalle> you did 16:04:42 <mlavalle> not quite 16:04:49 <mlavalle> I had an action item 16:05:02 <slaweq> ok, so I somehow missed it 16:05:05 <slaweq> sorry mlavalle 16:05:08 <slaweq> go on 16:05:17 <mlavalle> continue working on https://bugs.launchpad.net/neutron/+bug/1795870 16:05:18 <openstack> Launchpad bug 1795870 in neutron "Trunk scenario test test_trunk_subport_lifecycle fails from time to time" [High,In progress] - Assigned to Miguel Lavalle (minsel) 16:05:23 <mlavalle> which I did 16:05:49 <mlavalle> submitted DNM patch https://review.openstack.org/#/c/630778/ 16:06:00 <mlavalle> which I rechecked a few times and I got lucky 16:06:18 <mlavalle> I got one succesful run and one failure right afterwards 16:06:39 <mlavalle> that is allowing me to compare sucess / failure: 16:07:09 <mlavalle> 1) When the test passes, the router is being hosted both in the copntroller and the compute 16:07:32 <mlavalle> so we see the router in the logs of both L3 agents 16:08:05 <mlavalle> 2) When the test fails, the router is never scheduled in the controller and it doesn't show up in its L3 agent 16:08:51 <slaweq> hmm, it should be scheduled to controller always as only there is dhcp port, right? 16:08:54 <mlavalle> This is an example of L3 agent log in a failed execution in the controller: http://logs.openstack.org/78/630778/1/check/neutron-tempest-plugin-dvr-multinode-scenario/02391e0/controller/logs/screen-q-svc.txt.gz?level=TRACE 16:09:35 <mlavalle> please note that the L3 agent is down according to the neutron server 16:09:52 <mlavalle> I am seeing the same pattern in at least two cases 16:10:35 <slaweq> why agent is down? 16:10:38 <slaweq> do You know? 16:10:51 <mlavalle> the agent is not actually down 16:10:56 <mlavalle> it is running 16:11:09 <mlavalle> but the server thinks is down 16:11:20 <mlavalle> and my next step is to investigate why 16:11:35 <mlavalle> so please give me an action item for next week 16:11:43 <slaweq> sure 16:12:11 <slaweq> #action mlavalle to continue investigate why L3 agent is considered as down and cause trunk tests fail 16:12:17 <slaweq> thx mlavalle for working on this 16:12:34 <slaweq> that is interesting why this agent is treated as down 16:12:45 <mlavalle> yeah, in the same node 16:13:54 <slaweq> I wonder if agent is not sending heartbeat or neutron-server is not processing it properly 16:14:10 <slaweq> IMHO it's more likely that agent is not sending it 16:14:24 <slaweq> as heartbeats from other agents are ok on server 16:14:33 <mlavalle> yes 16:14:39 <slaweq> but we will see when You will check it :) 16:15:13 <slaweq> ok, so now I think we can move on to the next topic 16:15:16 <slaweq> right? 16:15:19 <mlavalle> thanks 16:15:21 <mlavalle> yes 16:15:25 <slaweq> #topic Python 3 16:15:37 <slaweq> Etherpad: https://etherpad.openstack.org/p/neutron_ci_python3 16:16:17 <slaweq> in last week we merged 2 patches and now neutron-tempest-linuxbridge and neutron-tempest-dvr jobs are running on python 3 and using zuulv3 syntax already 16:16:44 <slaweq> I plan to do the same with neutron-tempest-dvr-ha-multinode-full this week 16:17:10 <mlavalle> Thanks 16:17:14 <slaweq> there is also this neutron-functional job which needs to be switched 16:17:24 <slaweq> and that is still problematic 16:17:38 <slaweq> This week I sent email to ML: http://lists.openstack.org/pipermail/openstack-discuss/2019-January/001904.html 16:18:22 <slaweq> maybe someone more familiar with ostestr/subunit and python 3 will be able to help us with it 16:18:39 <mlavalle> did you get any responses so far? 16:18:44 <slaweq> nope :/ 16:19:34 <slaweq> I will ask tomorrow on openstack-qa channel - maybe e.g. gmann will know who can help us with it 16:20:10 <bcafarel> crossing fingers 16:20:10 <njohnston> good idea 16:20:54 <slaweq> unfortunatelly except that I have no any other idea how to deal with this issue currently :/ 16:21:02 <slaweq> and that's all related to python 3 from me 16:21:09 <slaweq> njohnston: do You have anything else to add? 16:21:21 <slaweq> how it's going with grenade jobs switch? 16:21:49 <bcafarel> yeah pushing to add other limits to log output does not sound good as a viable fix :/ hopefully someone will fix the root cause and we get all our jobs py3-green 16:21:54 <njohnston> No, nothing for this week. I have been focused on bulk ports, but I should be ablke to move forward on the grenade work this week 16:22:15 <slaweq> njohnston: great, thx 16:22:28 <njohnston> #action njohnston Work on grenade job transition 16:22:34 <slaweq> thx :) 16:22:50 <slaweq> ok, so lets go to the next topic 16:22:51 <slaweq> #topic Grafana 16:23:00 <slaweq> http://grafana.openstack.org/dashboard/db/neutron-failure-rate 16:24:54 <slaweq> there is peak on many jobs today, I don't know exactly why it is like that but: 16:25:16 <slaweq> 1. I didn't found in today's jobs anything very bad, 16:25:45 <slaweq> 2. as it is even in pep8 job, I think that it could be some "generic" issue, not related to neutron bug directly 16:26:29 <slaweq> from other things I think that we again have one major issue 16:26:36 <slaweq> and this week it is tempest-slow job :/ 16:26:40 <mlavalle> yeah 16:26:47 <mlavalle> it is evident in Grafana 16:27:14 <slaweq> so, if You don't have anything else related to grafana, lets move to talk about this tempest jobs now 16:27:16 <slaweq> fine? 16:28:40 <slaweq> ok, I get it as yes :) 16:28:42 <slaweq> #topic Tempest/Scenario 16:29:05 <slaweq> We have one major issue with tempest-slow job, it is described in bug report https://bugs.launchpad.net/neutron/+bug/1812552 16:29:06 <openstack> Launchpad bug 1812552 in neutron "tempest-slow tests fails often" [Critical,In progress] - Assigned to Slawek Kaplonski (slaweq) 16:29:25 <slaweq> Today I was talking with seon-k-mooney about that 16:29:37 <mlavalle> did he help? 16:29:40 <slaweq> nope 16:29:42 <bcafarel> slaweq: for today's jobs there was some general post failure ~20h ago that probably did not help in the graph (sorry lagging as usual) 16:30:04 <slaweq> bcafarel: yes, that might be the reason :) 16:30:12 <mlavalle> thanks anyway bcafarel 16:30:17 <mlavalle> much appreciated 16:30:22 <slaweq> ok, so tempest-slow job issue in details: 16:30:46 <slaweq> 1. from grafana and logstash it looks that this issue was caused somehow by https://review.openstack.org/#/c/631584/ 16:31:14 <slaweq> 2. I proposed revert of this patch https://review.openstack.org/#/c/631944/ and indeed this job passed 5 or 6 times already 16:31:32 <slaweq> in fact it didn't failed even once with this revert 16:31:47 <slaweq> which confirms somehow that this patch is culprit 16:32:23 <slaweq> 3. I talked with sean today and we though that maybe this my patch introduced some additional race in ovsdb monitor and how it handles ports events on ovs bridges 16:32:54 <mlavalle> so we are leaving interfaces un-plugged, right? 16:33:12 <slaweq> but, what is strange for me is fact that it cause issues only in this job, and only (at least where I was checking it) with two tests which shelve/unshelve instance 16:33:21 <slaweq> mlavalle: not exactly 16:33:44 <slaweq> today I invesigated logs from one of such jobs carefully 16:33:55 <slaweq> (please read my last comment in bug report) 16:34:08 <slaweq> and it looks for me that port was configured properly 16:34:25 <slaweq> and communication, at least from VM to dhcp server was fine 16:34:35 <slaweq> which puzzled me even more 16:35:07 <mlavalle> but the VM doesn't get the dhcp offer, right? 16:35:09 <slaweq> I have no idea what is wrong there and how (if) this mentioned patch could break it 16:35:18 <slaweq> mlavalle: it looks so 16:35:31 <slaweq> but dnsmasq get DHCP Request from VM 16:35:36 <slaweq> and sends this DHCP offer 16:35:42 <mlavalle> so the the DHCP request gets to dnsmasq 16:35:46 <slaweq> but this is somehow missed somewhere 16:36:07 <mlavalle> could it be a problem with the flows? 16:36:17 <slaweq> possibly yes 16:36:29 <slaweq> especially that we are using openvswitch fw driver there 16:36:31 <mlavalle> we are dropping the offer 16:37:00 <mlavalle> perhapds, that is 16:37:07 <slaweq> it can be 16:37:32 <slaweq> but basically I think that we should revert this patch to make tempest-slow into better shape 16:37:39 <slaweq> *to get 16:37:51 <mlavalle> yes, let's pull the trigger 16:38:00 <njohnston> +1 16:38:05 <mlavalle> worst case, we revert the revert ;-) 16:38:11 <slaweq> LOL 16:38:56 <bcafarel> it's been some times since I saw a "revert revert revert revert revert ..." review 16:38:57 <slaweq> I know that sean is going to release new os-vif version soon and it will have his revert of patches which caused that port was created twice during booting vm 16:39:15 <slaweq> so maybe this my patch will not be really needed 16:39:44 <slaweq> but as a sidenote I want to mention that I think that I saw some similar issues from time to time already 16:40:04 <slaweq> I mean issues that VM didn't have configured IP address and because of that was not reachable 16:40:24 <slaweq> maybe it is same issue but just happens less often without this my patch 16:40:32 <slaweq> please keep it in mind just :) 16:40:44 <mlavalle> ok, thanks for the clarification 16:41:18 <slaweq> ok, thats all from my side about tempest tests 16:41:24 <slaweq> anything else You want to add? 16:42:21 <mlavalle> not me 16:42:25 <slaweq> ok, lets move on then 16:42:29 <slaweq> next topic 16:42:31 <slaweq> #topic fullstack/functional 16:42:48 <slaweq> today I found new bug in functional tests \o/ 16:42:57 <slaweq> https://bugs.launchpad.net/neutron/+bug/1812872 16:42:58 <openstack> Launchpad bug 1812872 in neutron "Trunk functional tests can interact with each other " [Medium,Confirmed] - Assigned to Slawek Kaplonski (slaweq) 16:42:58 <mlavalle> yaay 16:43:18 <slaweq> fortunately so far we are lucky and it don't hit us in gate 16:43:29 <slaweq> but I can reproduce it locally 16:43:42 <slaweq> and I hit it in my patch where I want to switch functional job to python3 16:44:04 <slaweq> basically it is race between two trunk related tests 16:44:04 <mlavalle> ahh 16:44:15 <slaweq> in bug description there are details 16:44:32 <slaweq> I have one idea how to "fix" it in ugly but fast way 16:44:53 <slaweq> we can add lockutils.synchronized() decorator for those 2 tests and it should works fine IMHO 16:45:04 <slaweq> what do You think about such solution? 16:45:24 <mlavalle> so I assume they share a resource 16:45:28 <mlavalle> right? 16:45:37 <mlavalle> that is why you need the lock 16:45:38 <slaweq> other than that is probably doing something similar to fullstack tests where we will need to monkey patch some functions 16:45:46 <slaweq> they don't share resources in fact 16:46:03 <slaweq> but each of them is listing ovsdb monitor events 16:46:36 <slaweq> and such events aren't distingueshed between tests 16:46:51 <mlavalle> well, that's the sahred resource, the stream of events 16:47:08 <slaweq> so one test have got mocked handle_trunk_remove method to not clean trunk bridge 16:47:30 <slaweq> but then second test is cleaning this bridge as it gets event from ovsdb monitor 16:47:44 <slaweq> mlavalle: yes, so in that way they share resources 16:48:15 <slaweq> so is such lock (with comment) acceptable for You? 16:48:23 <slaweq> what You think about it? 16:48:31 <njohnston> are trunk bridges vlan-specific? Would there be a way to plumb another fake vlan, thus making the trunk bridges in each test different from each other? 16:48:32 <mlavalle> I don't see ehy not 16:48:52 <slaweq> njohnston: trunk bridges are different 16:49:06 <slaweq> they are created in setUp() method for each test 16:49:43 <slaweq> but one of them can remove trunk bridge for another one because it is triggered by ovsdb event 16:49:43 <mlavalle> so esentially another alternative is to come up with a way for each test to have iot's own stream events 16:50:11 <slaweq> that was discussed some time ago I think in context of fullstack tests 16:50:15 <njohnston> so the ovsdb event does not include info on which bridge it is related to? 16:50:15 <mlavalle> would that be a lot of work? 16:50:44 <slaweq> and we don't have (then we didn't have at least) easy way to tell - listen only events from this bridge 16:51:04 <mlavalle> we have a lot of work to do 16:51:11 <slaweq> but tbh now, when we switched ovsdb monitors to native implementation thx to ralonsoh work, maybe that would be possible 16:51:17 <slaweq> I can check it 16:51:21 <njohnston> cool 16:51:40 <slaweq> if this will be too much to do, I will go for now with lock and TODO note how it might be fixed in fututr 16:51:43 <slaweq> *future 16:51:45 <slaweq> ok for You? 16:51:49 <njohnston> +1 16:51:50 <mlavalle> if fixing this without locks is too labor intensive, let's go for the locks and leave a todo comment 16:52:01 <slaweq> mlavalle: ++ :) 16:52:31 <slaweq> #action slaweq to check if new ovsdb monitor implementation can allow to listen only for events from specific bridge 16:53:11 <slaweq> ok, anything else You want to mention related to functiona/fullstack tests? 16:53:17 <mlavalle> not me 16:53:27 <slaweq> ok, lets move to next topic then 16:53:29 <slaweq> #topic Periodic 16:53:56 <slaweq> since few days at least we have issue with jobs openstack-tox-py27-with-oslo-master and openstack-tox-py35-with-oslo-master 16:54:02 <slaweq> Example: http://logs.openstack.org/periodic/git.openstack.org/openstack/neutron/master/openstack-tox-py27-with-oslo-master/c787964/testr_results.html.gz 16:54:12 <slaweq> it's only link to one python27 job 16:54:19 <slaweq> but error is exactly the same in python 35 16:54:35 <slaweq> I didn't report this bug in launchpad yet 16:54:48 <slaweq> is there any volunteer to report it on launchpad and fix it? :) 16:54:50 <njohnston> never seen one like thta before 16:55:29 <njohnston> I'll take a look 16:55:52 <slaweq> njohnston: it's probably some change in newest oslo_service lib and we need to adjust our code to it 16:55:55 <slaweq> thx njohnston 16:56:01 <njohnston> that's what I am thinking too 16:56:17 <slaweq> #action njohnston to take care of periodic UT jobs failures 16:56:23 <njohnston> thx 16:56:31 <slaweq> Thank You :) 16:56:41 <slaweq> ok, and that's all from me for today :) 16:56:52 <slaweq> do You want to talk about anythin else quickly? 16:57:03 <mlavalle> not me 16:57:39 <slaweq> ok, so thanks for attending guys 16:57:43 <slaweq> have a great week 16:57:46 <slaweq> o/ 16:57:49 <slaweq> #endmeeting