16:00:32 #startmeeting neutron_ci 16:00:33 Meeting started Tue Jan 22 16:00:32 2019 UTC and is due to finish in 60 minutes. The chair is slaweq. Information about MeetBot at http://wiki.debian.org/MeetBot. 16:00:35 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 16:00:37 The meeting name has been set to 'neutron_ci' 16:00:38 o/ 16:00:42 o/ 16:01:28 haleyb: njohnston hongbin: are You around for CI meeting? 16:01:48 hi, i might have to leave early though 16:01:53 o/ 16:02:18 ok, lets start then 16:02:27 #topic Actions from previous meetings 16:02:40 there wasn't too many actions from last week 16:02:44 slaweq to make e-r query for bug 1811515 16:02:46 bug 1811515 in neutron "pyroute2.NetNS don't work properly with concurrency in oslo.privsep" [Critical,Fix released] https://launchpad.net/bugs/1811515 - Assigned to Slawek Kaplonski (slaweq) 16:03:09 I didn't write this query but bug is now fixed (workarounded) so it's not necessary anymore 16:03:20 ++ 16:03:20 next one: 16:03:23 slaweq to check if oslo.privsep < 1.31.0 will help to workaround issue with SSH to FIP 16:03:32 it did 16:03:35 as above, this is now fixed 16:03:48 it did help but we workarounded it in different way 16:03:49 :) 16:03:51 o/ 16:04:03 and the last one was: 16:04:05 o/ too 16:04:05 slaweq to post more examples of failiures in bug 1811515 16:04:06 bug 1811515 in neutron "pyroute2.NetNS don't work properly with concurrency in oslo.privsep" [Critical,Fix released] https://launchpad.net/bugs/1811515 - Assigned to Slawek Kaplonski (slaweq) 16:04:24 I think I added link to logstash query from where more example could be found 16:04:29 but it's already fixed 16:04:33 yeap 16:04:37 so that was all from last week :) 16:04:38 you did 16:04:42 not quite 16:04:49 I had an action item 16:05:02 ok, so I somehow missed it 16:05:05 sorry mlavalle 16:05:08 go on 16:05:17 continue working on https://bugs.launchpad.net/neutron/+bug/1795870 16:05:18 Launchpad bug 1795870 in neutron "Trunk scenario test test_trunk_subport_lifecycle fails from time to time" [High,In progress] - Assigned to Miguel Lavalle (minsel) 16:05:23 which I did 16:05:49 submitted DNM patch https://review.openstack.org/#/c/630778/ 16:06:00 which I rechecked a few times and I got lucky 16:06:18 I got one succesful run and one failure right afterwards 16:06:39 that is allowing me to compare sucess / failure: 16:07:09 1) When the test passes, the router is being hosted both in the copntroller and the compute 16:07:32 so we see the router in the logs of both L3 agents 16:08:05 2) When the test fails, the router is never scheduled in the controller and it doesn't show up in its L3 agent 16:08:51 hmm, it should be scheduled to controller always as only there is dhcp port, right? 16:08:54 This is an example of L3 agent log in a failed execution in the controller: http://logs.openstack.org/78/630778/1/check/neutron-tempest-plugin-dvr-multinode-scenario/02391e0/controller/logs/screen-q-svc.txt.gz?level=TRACE 16:09:35 please note that the L3 agent is down according to the neutron server 16:09:52 I am seeing the same pattern in at least two cases 16:10:35 why agent is down? 16:10:38 do You know? 16:10:51 the agent is not actually down 16:10:56 it is running 16:11:09 but the server thinks is down 16:11:20 and my next step is to investigate why 16:11:35 so please give me an action item for next week 16:11:43 sure 16:12:11 #action mlavalle to continue investigate why L3 agent is considered as down and cause trunk tests fail 16:12:17 thx mlavalle for working on this 16:12:34 that is interesting why this agent is treated as down 16:12:45 yeah, in the same node 16:13:54 I wonder if agent is not sending heartbeat or neutron-server is not processing it properly 16:14:10 IMHO it's more likely that agent is not sending it 16:14:24 as heartbeats from other agents are ok on server 16:14:33 yes 16:14:39 but we will see when You will check it :) 16:15:13 ok, so now I think we can move on to the next topic 16:15:16 right? 16:15:19 thanks 16:15:21 yes 16:15:25 #topic Python 3 16:15:37 Etherpad: https://etherpad.openstack.org/p/neutron_ci_python3 16:16:17 in last week we merged 2 patches and now neutron-tempest-linuxbridge and neutron-tempest-dvr jobs are running on python 3 and using zuulv3 syntax already 16:16:44 I plan to do the same with neutron-tempest-dvr-ha-multinode-full this week 16:17:10 Thanks 16:17:14 there is also this neutron-functional job which needs to be switched 16:17:24 and that is still problematic 16:17:38 This week I sent email to ML: http://lists.openstack.org/pipermail/openstack-discuss/2019-January/001904.html 16:18:22 maybe someone more familiar with ostestr/subunit and python 3 will be able to help us with it 16:18:39 did you get any responses so far? 16:18:44 nope :/ 16:19:34 I will ask tomorrow on openstack-qa channel - maybe e.g. gmann will know who can help us with it 16:20:10 crossing fingers 16:20:10 good idea 16:20:54 unfortunatelly except that I have no any other idea how to deal with this issue currently :/ 16:21:02 and that's all related to python 3 from me 16:21:09 njohnston: do You have anything else to add? 16:21:21 how it's going with grenade jobs switch? 16:21:49 yeah pushing to add other limits to log output does not sound good as a viable fix :/ hopefully someone will fix the root cause and we get all our jobs py3-green 16:21:54 No, nothing for this week. I have been focused on bulk ports, but I should be ablke to move forward on the grenade work this week 16:22:15 njohnston: great, thx 16:22:28 #action njohnston Work on grenade job transition 16:22:34 thx :) 16:22:50 ok, so lets go to the next topic 16:22:51 #topic Grafana 16:23:00 http://grafana.openstack.org/dashboard/db/neutron-failure-rate 16:24:54 there is peak on many jobs today, I don't know exactly why it is like that but: 16:25:16 1. I didn't found in today's jobs anything very bad, 16:25:45 2. as it is even in pep8 job, I think that it could be some "generic" issue, not related to neutron bug directly 16:26:29 from other things I think that we again have one major issue 16:26:36 and this week it is tempest-slow job :/ 16:26:40 yeah 16:26:47 it is evident in Grafana 16:27:14 so, if You don't have anything else related to grafana, lets move to talk about this tempest jobs now 16:27:16 fine? 16:28:40 ok, I get it as yes :) 16:28:42 #topic Tempest/Scenario 16:29:05 We have one major issue with tempest-slow job, it is described in bug report https://bugs.launchpad.net/neutron/+bug/1812552 16:29:06 Launchpad bug 1812552 in neutron "tempest-slow tests fails often" [Critical,In progress] - Assigned to Slawek Kaplonski (slaweq) 16:29:25 Today I was talking with seon-k-mooney about that 16:29:37 did he help? 16:29:40 nope 16:29:42 slaweq: for today's jobs there was some general post failure ~20h ago that probably did not help in the graph (sorry lagging as usual) 16:30:04 bcafarel: yes, that might be the reason :) 16:30:12 thanks anyway bcafarel 16:30:17 much appreciated 16:30:22 ok, so tempest-slow job issue in details: 16:30:46 1. from grafana and logstash it looks that this issue was caused somehow by https://review.openstack.org/#/c/631584/ 16:31:14 2. I proposed revert of this patch https://review.openstack.org/#/c/631944/ and indeed this job passed 5 or 6 times already 16:31:32 in fact it didn't failed even once with this revert 16:31:47 which confirms somehow that this patch is culprit 16:32:23 3. I talked with sean today and we though that maybe this my patch introduced some additional race in ovsdb monitor and how it handles ports events on ovs bridges 16:32:54 so we are leaving interfaces un-plugged, right? 16:33:12 but, what is strange for me is fact that it cause issues only in this job, and only (at least where I was checking it) with two tests which shelve/unshelve instance 16:33:21 mlavalle: not exactly 16:33:44 today I invesigated logs from one of such jobs carefully 16:33:55 (please read my last comment in bug report) 16:34:08 and it looks for me that port was configured properly 16:34:25 and communication, at least from VM to dhcp server was fine 16:34:35 which puzzled me even more 16:35:07 but the VM doesn't get the dhcp offer, right? 16:35:09 I have no idea what is wrong there and how (if) this mentioned patch could break it 16:35:18 mlavalle: it looks so 16:35:31 but dnsmasq get DHCP Request from VM 16:35:36 and sends this DHCP offer 16:35:42 so the the DHCP request gets to dnsmasq 16:35:46 but this is somehow missed somewhere 16:36:07 could it be a problem with the flows? 16:36:17 possibly yes 16:36:29 especially that we are using openvswitch fw driver there 16:36:31 we are dropping the offer 16:37:00 perhapds, that is 16:37:07 it can be 16:37:32 but basically I think that we should revert this patch to make tempest-slow into better shape 16:37:39 *to get 16:37:51 yes, let's pull the trigger 16:38:00 +1 16:38:05 worst case, we revert the revert ;-) 16:38:11 LOL 16:38:56 it's been some times since I saw a "revert revert revert revert revert ..." review 16:38:57 I know that sean is going to release new os-vif version soon and it will have his revert of patches which caused that port was created twice during booting vm 16:39:15 so maybe this my patch will not be really needed 16:39:44 but as a sidenote I want to mention that I think that I saw some similar issues from time to time already 16:40:04 I mean issues that VM didn't have configured IP address and because of that was not reachable 16:40:24 maybe it is same issue but just happens less often without this my patch 16:40:32 please keep it in mind just :) 16:40:44 ok, thanks for the clarification 16:41:18 ok, thats all from my side about tempest tests 16:41:24 anything else You want to add? 16:42:21 not me 16:42:25 ok, lets move on then 16:42:29 next topic 16:42:31 #topic fullstack/functional 16:42:48 today I found new bug in functional tests \o/ 16:42:57 https://bugs.launchpad.net/neutron/+bug/1812872 16:42:58 Launchpad bug 1812872 in neutron "Trunk functional tests can interact with each other " [Medium,Confirmed] - Assigned to Slawek Kaplonski (slaweq) 16:42:58 yaay 16:43:18 fortunately so far we are lucky and it don't hit us in gate 16:43:29 but I can reproduce it locally 16:43:42 and I hit it in my patch where I want to switch functional job to python3 16:44:04 basically it is race between two trunk related tests 16:44:04 ahh 16:44:15 in bug description there are details 16:44:32 I have one idea how to "fix" it in ugly but fast way 16:44:53 we can add lockutils.synchronized() decorator for those 2 tests and it should works fine IMHO 16:45:04 what do You think about such solution? 16:45:24 so I assume they share a resource 16:45:28 right? 16:45:37 that is why you need the lock 16:45:38 other than that is probably doing something similar to fullstack tests where we will need to monkey patch some functions 16:45:46 they don't share resources in fact 16:46:03 but each of them is listing ovsdb monitor events 16:46:36 and such events aren't distingueshed between tests 16:46:51 well, that's the sahred resource, the stream of events 16:47:08 so one test have got mocked handle_trunk_remove method to not clean trunk bridge 16:47:30 but then second test is cleaning this bridge as it gets event from ovsdb monitor 16:47:44 mlavalle: yes, so in that way they share resources 16:48:15 so is such lock (with comment) acceptable for You? 16:48:23 what You think about it? 16:48:31 are trunk bridges vlan-specific? Would there be a way to plumb another fake vlan, thus making the trunk bridges in each test different from each other? 16:48:32 I don't see ehy not 16:48:52 njohnston: trunk bridges are different 16:49:06 they are created in setUp() method for each test 16:49:43 but one of them can remove trunk bridge for another one because it is triggered by ovsdb event 16:49:43 so esentially another alternative is to come up with a way for each test to have iot's own stream events 16:50:11 that was discussed some time ago I think in context of fullstack tests 16:50:15 so the ovsdb event does not include info on which bridge it is related to? 16:50:15 would that be a lot of work? 16:50:44 and we don't have (then we didn't have at least) easy way to tell - listen only events from this bridge 16:51:04 we have a lot of work to do 16:51:11 but tbh now, when we switched ovsdb monitors to native implementation thx to ralonsoh work, maybe that would be possible 16:51:17 I can check it 16:51:21 cool 16:51:40 if this will be too much to do, I will go for now with lock and TODO note how it might be fixed in fututr 16:51:43 *future 16:51:45 ok for You? 16:51:49 +1 16:51:50 if fixing this without locks is too labor intensive, let's go for the locks and leave a todo comment 16:52:01 mlavalle: ++ :) 16:52:31 #action slaweq to check if new ovsdb monitor implementation can allow to listen only for events from specific bridge 16:53:11 ok, anything else You want to mention related to functiona/fullstack tests? 16:53:17 not me 16:53:27 ok, lets move to next topic then 16:53:29 #topic Periodic 16:53:56 since few days at least we have issue with jobs openstack-tox-py27-with-oslo-master and openstack-tox-py35-with-oslo-master 16:54:02 Example: http://logs.openstack.org/periodic/git.openstack.org/openstack/neutron/master/openstack-tox-py27-with-oslo-master/c787964/testr_results.html.gz 16:54:12 it's only link to one python27 job 16:54:19 but error is exactly the same in python 35 16:54:35 I didn't report this bug in launchpad yet 16:54:48 is there any volunteer to report it on launchpad and fix it? :) 16:54:50 never seen one like thta before 16:55:29 I'll take a look 16:55:52 njohnston: it's probably some change in newest oslo_service lib and we need to adjust our code to it 16:55:55 thx njohnston 16:56:01 that's what I am thinking too 16:56:17 #action njohnston to take care of periodic UT jobs failures 16:56:23 thx 16:56:31 Thank You :) 16:56:41 ok, and that's all from me for today :) 16:56:52 do You want to talk about anythin else quickly? 16:57:03 not me 16:57:39 ok, so thanks for attending guys 16:57:43 have a great week 16:57:46 o/ 16:57:49 #endmeeting