16:00:20 <slaweq> #startmeeting neutron_ci 16:00:21 <openstack> Meeting started Tue Aug 7 16:00:20 2018 UTC and is due to finish in 60 minutes. The chair is slaweq. Information about MeetBot at http://wiki.debian.org/MeetBot. 16:00:22 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 16:00:25 <openstack> The meeting name has been set to 'neutron_ci' 16:00:30 <slaweq> hi 16:00:33 <mlavalle> o/ 16:01:00 <slaweq> hi mlavalle 16:01:10 <slaweq> lets wait few minutes for others maybe 16:01:30 <mlavalle> did you throw a party during your home mancation? 16:01:40 <slaweq> no, I didn't 16:01:56 <mlavalle> LOL 16:02:27 <mlavalle> I wouldn't either. I live in perpetual fear of my wife 16:02:32 <slaweq> but funny thing was, last week I told my wife about this term "mancation" and I explained here what it means 16:02:48 <slaweq> she then asked me when she will be on "womencation" 16:03:24 <slaweq> I told her that she is already (with kids) - she was a bit angry for me :D 16:03:49 <mlavalle> LOL 16:04:24 <slaweq> ok, I think we will start 16:04:31 <slaweq> #topic Actions from previous meetings 16:04:42 <slaweq> mlavalle to check dvr (dvr-ha) env and shelve/unshelve server 16:04:50 <mlavalle> I did 16:05:16 <mlavalle> Just filed this bug: https://bugs.launchpad.net/neutron/+bug/1785848 16:05:16 <openstack> Launchpad bug 1785848 in neutron "Neutron server producing tracebacks with 'L3RouterPlugin' object has no attribute 'is_distributed_router' when DVR is enabled" [Medium,New] - Assigned to Miguel Lavalle (minsel) 16:05:49 <mlavalle> In a nutshell, this patch https://review.openstack.org/#/c/579058 introduced a bug 16:06:06 <mlavalle> look at https://review.openstack.org/#/c/579058/3/neutron/db/l3_dvrscheduler_db.py@460 16:06:16 <slaweq> I noticed those errors in neutron logs this week and I wanted to ask about it later :) 16:06:43 <mlavalle> it is trying to call is_distributed_router in the l3_plugin 16:06:48 <slaweq> mlavalle: do You think we should do it for RC1 or it's not so important? 16:07:06 <mlavalle> it is probably good is we fix it 16:07:11 <slaweq> *do it => fix 16:07:25 <mlavalle> I'll push a patch at the end of this meeting 16:07:35 <mlavalle> let's see what haleyb thinks 16:07:36 <slaweq> great 16:07:39 <slaweq> thx 16:07:43 <haleyb> hi 16:07:47 <slaweq> hi haleyb 16:08:39 <slaweq> ok, let's go to the next one then 16:08:41 <slaweq> njohnston to tweak stable branches dashboards 16:08:53 <slaweq> I don't know if he did anything about that this week 16:09:21 <slaweq> let's check it next week, ok? 16:09:24 <mlavalle> he is off today 16:09:30 <mlavalle> and tomorrow as well 16:09:34 <slaweq> yes, I know 16:09:45 <mlavalle> I know because he came collecting debts late last week 16:10:11 <mlavalle> I got an action item from him last meeting 16:11:45 <slaweq> so do You think we can move it to next week to check then? 16:12:08 <mlavalle> you mean the stable dashboard? 16:12:12 <slaweq> yes 16:12:15 <mlavalle> yeah 16:12:27 <slaweq> #action njohnston to tweak stable branches dashboards 16:12:29 <slaweq> thx 16:12:37 <slaweq> next one: 16:12:39 <slaweq> slaweq to talk with infra about gaps in grafana graphs 16:12:43 <slaweq> I did 16:13:10 <slaweq> I talked with ianw, he explained me that it was like that all the time probably, maybe some setting changed during grafana upgrade and that’s why it’s visible now 16:13:43 <slaweq> We can change nullPointMode to connected mode to make it „not visible” in dashboard config, but I don't know if we want to do that 16:13:55 <slaweq> what You think? is it necessary to change? 16:16:19 <mlavalle> Mhhh 16:16:51 <mlavalle> I don't think it's worth it 16:17:10 <mlavalle> if it's been there all the time, maybe we just should get used to it 16:18:17 <slaweq> exactly, that's it's what I think 16:18:30 <slaweq> but wanted to ask You, as our boss ;) 16:18:51 <mlavalle> I am just another contributor with some additional duties 16:18:56 <slaweq> mlavalle: :) 16:19:06 <slaweq> ok, let's move on to the next one 16:19:08 <slaweq> slaweq to report a bug with failing neutron.tests.functional.db.migrations functional tests 16:19:24 <slaweq> I did it: https://bugs.launchpad.net/neutron/+bug/1784836 16:19:24 <openstack> Launchpad bug 1784836 in neutron "Functional tests from neutron.tests.functional.db.migrations fails randomly" [Medium,Confirmed] 16:19:46 <slaweq> but it's not assigned to anybody yet 16:19:55 <slaweq> next one: 16:19:57 <slaweq> * slaweq to check tempest delete_resource method 16:20:13 <slaweq> so I checked it and there is nothing to fix in tempest 16:20:22 <slaweq> It was caused by very slow processing of DELETE network call by neutron: http://logs.openstack.org/01/584601/2/check/neutron-tempest-dvr/65f95f5/logs/tempest.txt.gz#_2018-07-27_22_42_11_577 16:20:29 <slaweq> Because of that client repeated the call after http_timeout but then network was already deleted by neutron (from first call) and that’s the issue. 16:20:55 <slaweq> so we have sometimes very slow API - I know that njohnston was woring recently on some improvements for that 16:21:32 <mlavalle> should we discuss this in the PTG? 16:22:06 <slaweq> that's good idea 16:22:27 <slaweq> it hits us in many tests and quite often from what I can see in different job's results 16:22:38 <mlavalle> so I propose you add a topic to the etherpad and gather some data to disucss in Denver 16:22:59 <slaweq> ok, I will do that 16:23:06 <mlavalle> Thanks 16:23:15 <slaweq> #action slaweq add data about slow API to etherpad for PTG 16:23:27 <slaweq> so, next one 16:23:29 <slaweq> * slaweq to check with infra if virt_type=kvm is possible in gate jobs 16:23:40 <mlavalle> and you did 16:23:50 <mlavalle> I think I reviewed a patch last week 16:23:56 <slaweq> I again talked with ianw, clarkb and johnsom about it 16:24:13 <slaweq> yes, I added patch to scenario jobs which uses zuulv3 syntax last week 16:24:40 <slaweq> and also https://review.openstack.org/#/c/588465/ to use it in "legacy" jobs 16:24:59 <johnsom> o/ 16:25:02 <mlavalle> ahh cool 16:25:16 <slaweq> as johnsom said they are using it for quite long time in octavia jobs and works fine 16:25:21 <slaweq> hi johnsom :) 16:25:21 <mlavalle> hey johnsom :-) 16:26:01 <johnsom> Yeah, a kernel update to the nodepool images a few months back fixed the one issue we were seeing with guests crashing, otherwise it works great. 16:26:13 <johnsom> Saves up to 50 minutes on some of our gates 16:26:28 <mlavalle> wow, that's great 16:26:44 <slaweq> yes, I saw for our linuxbridge scenario job something like 40 minutes speed up IIRC 16:26:54 <slaweq> so should be better I hope 16:26:54 <mlavalle> ++ 16:27:24 <slaweq> ok, next one 16:27:26 <slaweq> * slaweq to report a bug with tempest.scenario.test_security_groups_basic_ops.TestSecurityGroupsBasicOps.test_in_tenant_traffic 16:27:31 <slaweq> Done: https://bugs.launchpad.net/neutron/+bug/1784837 16:27:31 <openstack> Launchpad bug 1784837 in neutron "Test tempest.scenario.test_security_groups_basic_ops.TestSecurityGroupsBasicOps.test_in_tenant_traffic fails in neutron-tempest-dvr-ha-multinode-full job" [Medium,Confirmed] 16:27:43 <slaweq> next one: 16:27:45 <slaweq> slaweq to send patch to add logging of console output in tests like tempest.scenario.test_security_groups_basic_ops.TestSecurityGroupsBasicOps.test_in_tenant_traffic 16:27:52 <slaweq> Patch for tempest: https://review.openstack.org/#/c/588884/ merged today 16:28:22 <slaweq> if I will find such issue again, we will be able maybe to check if vm was booted properly and how fast it was booted 16:28:37 <slaweq> ok, and finally last one from last week: 16:28:38 <mlavalle> cool 16:28:39 <slaweq> * mlavalle will check with tc what jobs should be running with py3 16:28:57 <mlavalle> It's actually an ongoing conversation 16:29:15 <mlavalle> I am talking to smcginnis about it 16:29:40 <mlavalle> as soon as I finish that conversation, I'll report back 16:29:49 <slaweq> thx mlavalle 16:29:55 <slaweq> we will be waiting then :) 16:30:12 <mlavalle> that's the debt that njohnsgton collected ahead of time last week 16:30:27 <slaweq> ahh, ok :) 16:30:52 <slaweq> let's move on 16:30:58 <slaweq> next topic 16:31:00 <slaweq> #topic Grafana 16:31:08 <slaweq> http://grafana.openstack.org/dashboard/db/neutron-failure-rate 16:32:10 <slaweq> and according to grafana I want to propose one small (maybe) improvement 16:32:25 <slaweq> what You think about adding to graphs line with total number of tests 16:32:58 <slaweq> then we could see e.g. for UT that there was 100% of failures but there were on e.g. 2 tests run and those 2 failed 16:33:39 <slaweq> or graph with number of failed tests instead (or together) with % of failures 16:34:06 <mlavalle> yeah, that's actually an excellent idea 16:34:32 <mlavalle> because that way we can understand quickly whether we have a trend or just a few outliers failling 16:34:43 <slaweq> exactly 16:35:38 <slaweq> like now, e.g on UT graph it has spike to 100% but we don't know if it was one WIP patch rechecked few times or many patches failing with reason not related to change :) 16:36:14 <slaweq> but which do You think would be better? number of failures together with percentage on same graph? 16:36:55 <mlavalle> sounds good as a first step 16:37:07 <mlavalle> we can experiment and tweak it as we learn more 16:37:12 <slaweq> ok, I will then try to add something like that this week 16:37:30 <slaweq> thx for blessing :) 16:38:03 <mlavalle> thanks for improving our Grafana reporting :-) 16:38:06 <slaweq> #action slaweq will add number of failures to graphana 16:38:27 <slaweq> I'm trying to make my life easier just :P 16:38:57 <slaweq> speaking about graphana, there are some spikes there from last week in few jobs 16:39:18 <slaweq> but I was trying to check and note failures during whole last week 16:39:36 <slaweq> and I didn't found so many failures related to neutron, like it looks from graphs 16:39:50 <slaweq> (that's why I had this idea of improving graphs) 16:39:57 <mlavalle> yeap 16:40:04 <mlavalle> it makes sense 16:40:18 <slaweq> I saw that many patches were merged quite quickly during last week 16:40:31 <mlavalle> that was my impression last week as well 16:40:37 <slaweq> so maybe we can talk about some failures which I found 16:40:58 <slaweq> let's first talk about voting jobs 16:41:00 <slaweq> #topic Integrated tempest 16:41:53 <slaweq> there were couple of errors which looks for me like slow neutron API or global timeouts (maybe also because of slow API) 16:41:57 <slaweq> e.g.: 16:42:03 <slaweq> * http://logs.openstack.org/80/585180/3/check/tempest-full/1f62b43/testr_results.html.gz 16:42:09 <slaweq> * http://logs.openstack.org/31/585731/8/gate/tempest-full/ecac3d5/ 16:42:18 <slaweq> * http://logs.openstack.org/59/585559/3/check/neutron-tempest-dvr/b4b7fda/job-output.txt.gz#_2018-08-01_08_45_22_477667 16:42:42 <slaweq> and one with timeout and "strange" exception: http://logs.openstack.org/80/585180/3/gate/tempest-full-py3/5d2fb38/testr_results.html.gz 16:44:10 <slaweq> this last one looks for me like some race but I don't know between what this race could be 16:44:19 <slaweq> I found it only once during last week 16:44:23 <slaweq> did You saw it before? 16:44:30 <mlavalle> nope 16:44:46 <mlavalle> do you mean the HTTPSConnectionPool exception? 16:45:38 <slaweq> no 16:45:49 <slaweq> this one unfortunatelly is quite "common" 16:46:10 <slaweq> I was asking about error in tearDownClass (tempest.api.network.test_ports.PortsIpV6TestJSON) 16:46:20 <slaweq> in http://logs.openstack.org/80/585180/3/gate/tempest-full-py3/5d2fb38/testr_results.html.gz 16:47:03 <mlavalle> that may be a faulty test 16:47:14 <mlavalle> that is not cleaning up after itself 16:47:33 <slaweq> but then it would be visible more often IMHO 16:47:55 <slaweq> as it happens only once let's just be aware of this :) 16:47:57 <slaweq> ok? 16:48:13 <mlavalle> yeah 16:48:28 <mlavalle> I would check anyways if something changed with the test anyways 16:48:43 <mlavalle> but I see your point 16:50:02 <slaweq> and those HTTPSConnectionPool exceptions is different issue 16:50:09 <slaweq> looks e.g. on http://logs.openstack.org/80/585180/3/gate/tempest-full-py3/5d2fb38/controller/logs/screen-q-svc.txt.gz?level=INFO#_Aug_06_16_44_12_685397 16:50:21 <slaweq> POST took 80 seconds :/ 16:50:33 <mlavalle> dang 16:50:42 <slaweq> I will prepare more data like that to PTG etherpad 16:51:07 <mlavalle> Great 16:51:22 <slaweq> because I think it happens quite often 16:52:02 <slaweq> other issues which I saw in those tempest integrated jobs weren't related to neutron - most often some errors with volumes 16:52:24 <slaweq> so let's move on to next jobs 16:52:35 <slaweq> #topic Fullstack 16:52:50 <slaweq> I found again test * neutron.tests.fullstack.test_securitygroup.TestSecurityGroupsSameNetwork.test_securitygroup(linuxbridge-iptables) failing sometimes 16:53:08 <slaweq> it was always linuxbridge scenario, never ovs from what I seen 16:53:26 <slaweq> and I spotted it only once in neutron-fullstack job 16:53:38 <slaweq> but at least 5 or sic times in neutron-fullstack-python35 16:53:49 <slaweq> py27: * http://logs.openstack.org/27/534227/16/check/neutron-fullstack/2a475f6/logs/testr_results.html.gz 16:54:02 <slaweq> py35: * http://logs.openstack.org/27/534227/16/check/neutron-fullstack-python35/f9e25c1/logs/testr_results.html.gz 16:55:41 <slaweq> so I don't know if that is only coincidence or something works differently in py35 and that trigger issue more often 16:55:56 <slaweq> I will report new issue for that and try to investigate it 16:55:59 <slaweq> fine for You? 16:56:20 <mlavalle> yes 16:56:55 <slaweq> #action slaweq to report new bug about fullstack test_securitygroup(linuxbridge-iptables) issue and investigate this 16:57:20 <slaweq> from other jobs, I have list of failures in scenario jobs 16:58:02 <slaweq> but it's mostly same issues like every week - many issues with shelve/unshelve, issues with volumes, some issues with FIP connectivity (randomly) 16:58:15 <slaweq> but we don't have time to talk about them now I think 16:58:27 <mlavalle> just two things then 16:58:36 <slaweq> go on mlavalle 16:58:50 <mlavalle> "part of what Doug is driving now is that everything should run using py3 except for those jobs explicitly testing py2" 16:59:10 <mlavalle> quote from smcginnis guidance regarding py3 testing 16:59:26 <mlavalle> and next week I will be in China, so I won't be able to attend this meeting 17:00:38 <slaweq> any actions I should take on it next week? 17:00:48 <slaweq> or we should just wait? 17:00:56 <slaweq> ok, I think we are out of time now 17:01:03 <slaweq> thx mlavale 17:01:10 <slaweq> let's talk on neutron channel more 17:01:17 <slaweq> #endmeeting