16:00:20 #startmeeting neutron_ci 16:00:21 Meeting started Tue Aug 7 16:00:20 2018 UTC and is due to finish in 60 minutes. The chair is slaweq. Information about MeetBot at http://wiki.debian.org/MeetBot. 16:00:22 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 16:00:25 The meeting name has been set to 'neutron_ci' 16:00:30 hi 16:00:33 o/ 16:01:00 hi mlavalle 16:01:10 lets wait few minutes for others maybe 16:01:30 did you throw a party during your home mancation? 16:01:40 no, I didn't 16:01:56 LOL 16:02:27 I wouldn't either. I live in perpetual fear of my wife 16:02:32 but funny thing was, last week I told my wife about this term "mancation" and I explained here what it means 16:02:48 she then asked me when she will be on "womencation" 16:03:24 I told her that she is already (with kids) - she was a bit angry for me :D 16:03:49 LOL 16:04:24 ok, I think we will start 16:04:31 #topic Actions from previous meetings 16:04:42 mlavalle to check dvr (dvr-ha) env and shelve/unshelve server 16:04:50 I did 16:05:16 Just filed this bug: https://bugs.launchpad.net/neutron/+bug/1785848 16:05:16 Launchpad bug 1785848 in neutron "Neutron server producing tracebacks with 'L3RouterPlugin' object has no attribute 'is_distributed_router' when DVR is enabled" [Medium,New] - Assigned to Miguel Lavalle (minsel) 16:05:49 In a nutshell, this patch https://review.openstack.org/#/c/579058 introduced a bug 16:06:06 look at https://review.openstack.org/#/c/579058/3/neutron/db/l3_dvrscheduler_db.py@460 16:06:16 I noticed those errors in neutron logs this week and I wanted to ask about it later :) 16:06:43 it is trying to call is_distributed_router in the l3_plugin 16:06:48 mlavalle: do You think we should do it for RC1 or it's not so important? 16:07:06 it is probably good is we fix it 16:07:11 *do it => fix 16:07:25 I'll push a patch at the end of this meeting 16:07:35 let's see what haleyb thinks 16:07:36 great 16:07:39 thx 16:07:43 hi 16:07:47 hi haleyb 16:08:39 ok, let's go to the next one then 16:08:41 njohnston to tweak stable branches dashboards 16:08:53 I don't know if he did anything about that this week 16:09:21 let's check it next week, ok? 16:09:24 he is off today 16:09:30 and tomorrow as well 16:09:34 yes, I know 16:09:45 I know because he came collecting debts late last week 16:10:11 I got an action item from him last meeting 16:11:45 so do You think we can move it to next week to check then? 16:12:08 you mean the stable dashboard? 16:12:12 yes 16:12:15 yeah 16:12:27 #action njohnston to tweak stable branches dashboards 16:12:29 thx 16:12:37 next one: 16:12:39 slaweq to talk with infra about gaps in grafana graphs 16:12:43 I did 16:13:10 I talked with ianw, he explained me that it was like that all the time probably, maybe some setting changed during grafana upgrade and that’s why it’s visible now 16:13:43 We can change nullPointMode to connected mode to make it „not visible” in dashboard config, but I don't know if we want to do that 16:13:55 what You think? is it necessary to change? 16:16:19 Mhhh 16:16:51 I don't think it's worth it 16:17:10 if it's been there all the time, maybe we just should get used to it 16:18:17 exactly, that's it's what I think 16:18:30 but wanted to ask You, as our boss ;) 16:18:51 I am just another contributor with some additional duties 16:18:56 mlavalle: :) 16:19:06 ok, let's move on to the next one 16:19:08 slaweq to report a bug with failing neutron.tests.functional.db.migrations functional tests 16:19:24 I did it: https://bugs.launchpad.net/neutron/+bug/1784836 16:19:24 Launchpad bug 1784836 in neutron "Functional tests from neutron.tests.functional.db.migrations fails randomly" [Medium,Confirmed] 16:19:46 but it's not assigned to anybody yet 16:19:55 next one: 16:19:57 * slaweq to check tempest delete_resource method 16:20:13 so I checked it and there is nothing to fix in tempest 16:20:22 It was caused by very slow processing of DELETE network call by neutron: http://logs.openstack.org/01/584601/2/check/neutron-tempest-dvr/65f95f5/logs/tempest.txt.gz#_2018-07-27_22_42_11_577 16:20:29 Because of that client repeated the call after http_timeout but then network was already deleted by neutron (from first call) and that’s the issue. 16:20:55 so we have sometimes very slow API - I know that njohnston was woring recently on some improvements for that 16:21:32 should we discuss this in the PTG? 16:22:06 that's good idea 16:22:27 it hits us in many tests and quite often from what I can see in different job's results 16:22:38 so I propose you add a topic to the etherpad and gather some data to disucss in Denver 16:22:59 ok, I will do that 16:23:06 Thanks 16:23:15 #action slaweq add data about slow API to etherpad for PTG 16:23:27 so, next one 16:23:29 * slaweq to check with infra if virt_type=kvm is possible in gate jobs 16:23:40 and you did 16:23:50 I think I reviewed a patch last week 16:23:56 I again talked with ianw, clarkb and johnsom about it 16:24:13 yes, I added patch to scenario jobs which uses zuulv3 syntax last week 16:24:40 and also https://review.openstack.org/#/c/588465/ to use it in "legacy" jobs 16:24:59 o/ 16:25:02 ahh cool 16:25:16 as johnsom said they are using it for quite long time in octavia jobs and works fine 16:25:21 hi johnsom :) 16:25:21 hey johnsom :-) 16:26:01 Yeah, a kernel update to the nodepool images a few months back fixed the one issue we were seeing with guests crashing, otherwise it works great. 16:26:13 Saves up to 50 minutes on some of our gates 16:26:28 wow, that's great 16:26:44 yes, I saw for our linuxbridge scenario job something like 40 minutes speed up IIRC 16:26:54 so should be better I hope 16:26:54 ++ 16:27:24 ok, next one 16:27:26 * slaweq to report a bug with tempest.scenario.test_security_groups_basic_ops.TestSecurityGroupsBasicOps.test_in_tenant_traffic 16:27:31 Done: https://bugs.launchpad.net/neutron/+bug/1784837 16:27:31 Launchpad bug 1784837 in neutron "Test tempest.scenario.test_security_groups_basic_ops.TestSecurityGroupsBasicOps.test_in_tenant_traffic fails in neutron-tempest-dvr-ha-multinode-full job" [Medium,Confirmed] 16:27:43 next one: 16:27:45 slaweq to send patch to add logging of console output in tests like tempest.scenario.test_security_groups_basic_ops.TestSecurityGroupsBasicOps.test_in_tenant_traffic 16:27:52 Patch for tempest: https://review.openstack.org/#/c/588884/ merged today 16:28:22 if I will find such issue again, we will be able maybe to check if vm was booted properly and how fast it was booted 16:28:37 ok, and finally last one from last week: 16:28:38 cool 16:28:39 * mlavalle will check with tc what jobs should be running with py3 16:28:57 It's actually an ongoing conversation 16:29:15 I am talking to smcginnis about it 16:29:40 as soon as I finish that conversation, I'll report back 16:29:49 thx mlavalle 16:29:55 we will be waiting then :) 16:30:12 that's the debt that njohnsgton collected ahead of time last week 16:30:27 ahh, ok :) 16:30:52 let's move on 16:30:58 next topic 16:31:00 #topic Grafana 16:31:08 http://grafana.openstack.org/dashboard/db/neutron-failure-rate 16:32:10 and according to grafana I want to propose one small (maybe) improvement 16:32:25 what You think about adding to graphs line with total number of tests 16:32:58 then we could see e.g. for UT that there was 100% of failures but there were on e.g. 2 tests run and those 2 failed 16:33:39 or graph with number of failed tests instead (or together) with % of failures 16:34:06 yeah, that's actually an excellent idea 16:34:32 because that way we can understand quickly whether we have a trend or just a few outliers failling 16:34:43 exactly 16:35:38 like now, e.g on UT graph it has spike to 100% but we don't know if it was one WIP patch rechecked few times or many patches failing with reason not related to change :) 16:36:14 but which do You think would be better? number of failures together with percentage on same graph? 16:36:55 sounds good as a first step 16:37:07 we can experiment and tweak it as we learn more 16:37:12 ok, I will then try to add something like that this week 16:37:30 thx for blessing :) 16:38:03 thanks for improving our Grafana reporting :-) 16:38:06 #action slaweq will add number of failures to graphana 16:38:27 I'm trying to make my life easier just :P 16:38:57 speaking about graphana, there are some spikes there from last week in few jobs 16:39:18 but I was trying to check and note failures during whole last week 16:39:36 and I didn't found so many failures related to neutron, like it looks from graphs 16:39:50 (that's why I had this idea of improving graphs) 16:39:57 yeap 16:40:04 it makes sense 16:40:18 I saw that many patches were merged quite quickly during last week 16:40:31 that was my impression last week as well 16:40:37 so maybe we can talk about some failures which I found 16:40:58 let's first talk about voting jobs 16:41:00 #topic Integrated tempest 16:41:53 there were couple of errors which looks for me like slow neutron API or global timeouts (maybe also because of slow API) 16:41:57 e.g.: 16:42:03 * http://logs.openstack.org/80/585180/3/check/tempest-full/1f62b43/testr_results.html.gz 16:42:09 * http://logs.openstack.org/31/585731/8/gate/tempest-full/ecac3d5/ 16:42:18 * http://logs.openstack.org/59/585559/3/check/neutron-tempest-dvr/b4b7fda/job-output.txt.gz#_2018-08-01_08_45_22_477667 16:42:42 and one with timeout and "strange" exception: http://logs.openstack.org/80/585180/3/gate/tempest-full-py3/5d2fb38/testr_results.html.gz 16:44:10 this last one looks for me like some race but I don't know between what this race could be 16:44:19 I found it only once during last week 16:44:23 did You saw it before? 16:44:30 nope 16:44:46 do you mean the HTTPSConnectionPool exception? 16:45:38 no 16:45:49 this one unfortunatelly is quite "common" 16:46:10 I was asking about error in tearDownClass (tempest.api.network.test_ports.PortsIpV6TestJSON) 16:46:20 in http://logs.openstack.org/80/585180/3/gate/tempest-full-py3/5d2fb38/testr_results.html.gz 16:47:03 that may be a faulty test 16:47:14 that is not cleaning up after itself 16:47:33 but then it would be visible more often IMHO 16:47:55 as it happens only once let's just be aware of this :) 16:47:57 ok? 16:48:13 yeah 16:48:28 I would check anyways if something changed with the test anyways 16:48:43 but I see your point 16:50:02 and those HTTPSConnectionPool exceptions is different issue 16:50:09 looks e.g. on http://logs.openstack.org/80/585180/3/gate/tempest-full-py3/5d2fb38/controller/logs/screen-q-svc.txt.gz?level=INFO#_Aug_06_16_44_12_685397 16:50:21 POST took 80 seconds :/ 16:50:33 dang 16:50:42 I will prepare more data like that to PTG etherpad 16:51:07 Great 16:51:22 because I think it happens quite often 16:52:02 other issues which I saw in those tempest integrated jobs weren't related to neutron - most often some errors with volumes 16:52:24 so let's move on to next jobs 16:52:35 #topic Fullstack 16:52:50 I found again test * neutron.tests.fullstack.test_securitygroup.TestSecurityGroupsSameNetwork.test_securitygroup(linuxbridge-iptables) failing sometimes 16:53:08 it was always linuxbridge scenario, never ovs from what I seen 16:53:26 and I spotted it only once in neutron-fullstack job 16:53:38 but at least 5 or sic times in neutron-fullstack-python35 16:53:49 py27: * http://logs.openstack.org/27/534227/16/check/neutron-fullstack/2a475f6/logs/testr_results.html.gz 16:54:02 py35: * http://logs.openstack.org/27/534227/16/check/neutron-fullstack-python35/f9e25c1/logs/testr_results.html.gz 16:55:41 so I don't know if that is only coincidence or something works differently in py35 and that trigger issue more often 16:55:56 I will report new issue for that and try to investigate it 16:55:59 fine for You? 16:56:20 yes 16:56:55 #action slaweq to report new bug about fullstack test_securitygroup(linuxbridge-iptables) issue and investigate this 16:57:20 from other jobs, I have list of failures in scenario jobs 16:58:02 but it's mostly same issues like every week - many issues with shelve/unshelve, issues with volumes, some issues with FIP connectivity (randomly) 16:58:15 but we don't have time to talk about them now I think 16:58:27 just two things then 16:58:36 go on mlavalle 16:58:50 "part of what Doug is driving now is that everything should run using py3 except for those jobs explicitly testing py2" 16:59:10 quote from smcginnis guidance regarding py3 testing 16:59:26 and next week I will be in China, so I won't be able to attend this meeting 17:00:38 any actions I should take on it next week? 17:00:48 or we should just wait? 17:00:56 ok, I think we are out of time now 17:01:03 thx mlavale 17:01:10 let's talk on neutron channel more 17:01:17 #endmeeting