#openstack-meeting log

16:00:20 <slaweq> #startmeeting neutron_ci
16:00:21 <openstack> Meeting started Tue Aug  7 16:00:20 2018 UTC and is due to finish in 60 minutes.  The chair is slaweq. Information about MeetBot at http://wiki.debian.org/MeetBot.
16:00:22 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
16:00:25 <openstack> The meeting name has been set to 'neutron_ci'
16:00:30 <slaweq> hi
16:00:33 <mlavalle> o/
16:01:00 <slaweq> hi mlavalle
16:01:10 <slaweq> lets wait few minutes for others maybe
16:01:30 <mlavalle> did you throw a party during your home mancation?
16:01:40 <slaweq> no, I didn't
16:01:56 <mlavalle> LOL
16:02:27 <mlavalle> I wouldn't either. I live in perpetual fear of my wife
16:02:32 <slaweq> but funny thing was, last week I told my wife about this term "mancation" and I explained here what it means
16:02:48 <slaweq> she then asked me when she will be on "womencation"
16:03:24 <slaweq> I told her that she is already (with kids) - she was a bit angry for me :D
16:03:49 <mlavalle> LOL
16:04:24 <slaweq> ok, I think we will start
16:04:31 <slaweq> #topic Actions from previous meetings
16:04:42 <slaweq> mlavalle to check dvr (dvr-ha) env and shelve/unshelve server
16:04:50 <mlavalle> I did
16:05:16 <mlavalle> Just filed this bug: https://bugs.launchpad.net/neutron/+bug/1785848
16:05:16 <openstack> Launchpad bug 1785848 in neutron "Neutron server producing tracebacks with 'L3RouterPlugin' object has no attribute 'is_distributed_router' when DVR is enabled" [Medium,New] - Assigned to Miguel Lavalle (minsel)
16:05:49 <mlavalle> In a nutshell, this patch https://review.openstack.org/#/c/579058 introduced a bug
16:06:06 <mlavalle> look at https://review.openstack.org/#/c/579058/3/neutron/db/l3_dvrscheduler_db.py@460
16:06:16 <slaweq> I noticed those errors in neutron logs this week and I wanted to ask about it later :)
16:06:43 <mlavalle> it is trying to call is_distributed_router in the l3_plugin
16:06:48 <slaweq> mlavalle: do You think we should do it for RC1 or it's not so important?
16:07:06 <mlavalle> it is probably good is we fix it
16:07:11 <slaweq> *do it => fix
16:07:25 <mlavalle> I'll push a patch at the end of this meeting
16:07:35 <mlavalle> let's see what haleyb thinks
16:07:36 <slaweq> great
16:07:39 <slaweq> thx
16:07:43 <haleyb> hi
16:07:47 <slaweq> hi haleyb
16:08:39 <slaweq> ok, let's go to the next one then
16:08:41 <slaweq> njohnston to tweak stable branches dashboards
16:08:53 <slaweq> I don't know if he did anything about that this week
16:09:21 <slaweq> let's check it next week, ok?
16:09:24 <mlavalle> he is off today
16:09:30 <mlavalle> and tomorrow as well
16:09:34 <slaweq> yes, I know
16:09:45 <mlavalle> I know because he came collecting debts late last week
16:10:11 <mlavalle> I got an action item from him last meeting
16:11:45 <slaweq> so do You think we can move it to next week to check then?
16:12:08 <mlavalle> you mean the stable dashboard?
16:12:12 <slaweq> yes
16:12:15 <mlavalle> yeah
16:12:27 <slaweq> #action njohnston to tweak stable branches dashboards
16:12:29 <slaweq> thx
16:12:37 <slaweq> next one:
16:12:39 <slaweq> slaweq to talk with infra about gaps in grafana graphs
16:12:43 <slaweq> I did
16:13:10 <slaweq> I talked with ianw, he explained me that it was like that all the time probably, maybe some setting changed during grafana upgrade and that’s why it’s visible now
16:13:43 <slaweq> We can change nullPointMode to connected mode to make it „not visible” in dashboard config, but I don't know if we want to do that
16:13:55 <slaweq> what You think? is it necessary to change?
16:16:19 <mlavalle> Mhhh
16:16:51 <mlavalle> I don't think it's worth it
16:17:10 <mlavalle> if it's been there all the time, maybe we just should get used to it
16:18:17 <slaweq> exactly, that's it's what I think
16:18:30 <slaweq> but wanted to ask You, as our boss ;)
16:18:51 <mlavalle> I am just another contributor with some additional duties
16:18:56 <slaweq> mlavalle: :)
16:19:06 <slaweq> ok, let's move on to the next one
16:19:08 <slaweq> slaweq to report a bug with failing neutron.tests.functional.db.migrations functional tests
16:19:24 <slaweq> I did it: https://bugs.launchpad.net/neutron/+bug/1784836
16:19:24 <openstack> Launchpad bug 1784836 in neutron "Functional tests from neutron.tests.functional.db.migrations fails randomly" [Medium,Confirmed]
16:19:46 <slaweq> but it's not assigned to anybody yet
16:19:55 <slaweq> next one:
16:19:57 <slaweq> * slaweq to check tempest delete_resource method
16:20:13 <slaweq> so I checked it and there is nothing to fix in tempest
16:20:22 <slaweq> It was caused by very slow processing of DELETE network call by neutron: http://logs.openstack.org/01/584601/2/check/neutron-tempest-dvr/65f95f5/logs/tempest.txt.gz#_2018-07-27_22_42_11_577
16:20:29 <slaweq> Because of that client repeated the call after http_timeout but then network was already deleted by neutron (from first call) and that’s the issue.
16:20:55 <slaweq> so we have sometimes very slow API - I know that njohnston was woring recently on some improvements for that
16:21:32 <mlavalle> should we discuss this in the PTG?
16:22:06 <slaweq> that's good idea
16:22:27 <slaweq> it hits us in many tests and quite often from what I can see in different job's results
16:22:38 <mlavalle> so I propose you add a topic to the etherpad and gather some data to disucss in Denver
16:22:59 <slaweq> ok, I will do that
16:23:06 <mlavalle> Thanks
16:23:15 <slaweq> #action slaweq add data about slow API to etherpad for PTG
16:23:27 <slaweq> so, next one
16:23:29 <slaweq> * slaweq to check with infra if virt_type=kvm is possible in gate jobs
16:23:40 <mlavalle> and you did
16:23:50 <mlavalle> I think I reviewed a patch last week
16:23:56 <slaweq> I again talked with ianw, clarkb and johnsom about it
16:24:13 <slaweq> yes, I added patch to scenario jobs which uses zuulv3 syntax last week
16:24:40 <slaweq> and also https://review.openstack.org/#/c/588465/ to use it in "legacy" jobs
16:24:59 <johnsom> o/
16:25:02 <mlavalle> ahh cool
16:25:16 <slaweq> as johnsom said they are using it for quite long time in octavia jobs and works fine
16:25:21 <slaweq> hi johnsom :)
16:25:21 <mlavalle> hey johnsom :-)
16:26:01 <johnsom> Yeah, a kernel update to the nodepool images a few months back fixed the one issue we were seeing with guests crashing, otherwise it works great.
16:26:13 <johnsom> Saves up to 50 minutes on some of our gates
16:26:28 <mlavalle> wow, that's great
16:26:44 <slaweq> yes, I saw for our linuxbridge scenario job something like 40 minutes speed up IIRC
16:26:54 <slaweq> so should be better I hope
16:26:54 <mlavalle> ++
16:27:24 <slaweq> ok, next one
16:27:26 <slaweq> * slaweq to report a bug with tempest.scenario.test_security_groups_basic_ops.TestSecurityGroupsBasicOps.test_in_tenant_traffic
16:27:31 <slaweq> Done: https://bugs.launchpad.net/neutron/+bug/1784837
16:27:31 <openstack> Launchpad bug 1784837 in neutron "Test tempest.scenario.test_security_groups_basic_ops.TestSecurityGroupsBasicOps.test_in_tenant_traffic fails in neutron-tempest-dvr-ha-multinode-full job" [Medium,Confirmed]
16:27:43 <slaweq> next one:
16:27:45 <slaweq> slaweq to send patch to add logging of console output in tests like tempest.scenario.test_security_groups_basic_ops.TestSecurityGroupsBasicOps.test_in_tenant_traffic
16:27:52 <slaweq> Patch for tempest: https://review.openstack.org/#/c/588884/  merged today
16:28:22 <slaweq> if I will find such issue again, we will be able maybe to check if vm was booted properly and how fast it was booted
16:28:37 <slaweq> ok, and finally last one from last week:
16:28:38 <mlavalle> cool
16:28:39 <slaweq> * mlavalle will check with tc what jobs should be running with py3
16:28:57 <mlavalle> It's actually an ongoing conversation
16:29:15 <mlavalle> I am talking to smcginnis about it
16:29:40 <mlavalle> as soon as I finish that conversation, I'll report back
16:29:49 <slaweq> thx mlavalle
16:29:55 <slaweq> we will be waiting then :)
16:30:12 <mlavalle> that's the debt that njohnsgton collected ahead of time last week
16:30:27 <slaweq> ahh, ok :)
16:30:52 <slaweq> let's move on
16:30:58 <slaweq> next topic
16:31:00 <slaweq> #topic Grafana
16:31:08 <slaweq> http://grafana.openstack.org/dashboard/db/neutron-failure-rate
16:32:10 <slaweq> and according to grafana I want to propose one small (maybe) improvement
16:32:25 <slaweq> what You think about adding to graphs line with total number of tests
16:32:58 <slaweq> then we could see e.g.  for UT that there was 100% of failures but there were on e.g. 2 tests run and those 2 failed
16:33:39 <slaweq> or graph with number of failed tests instead (or together) with % of failures
16:34:06 <mlavalle> yeah, that's actually an excellent idea
16:34:32 <mlavalle> because that way we can understand quickly whether we have a trend or just a few outliers failling
16:34:43 <slaweq> exactly
16:35:38 <slaweq> like now, e.g on UT graph it has spike to 100% but we don't know if it was one WIP patch rechecked few times or many patches failing with reason not related to change :)
16:36:14 <slaweq> but which do You think would be better? number of failures together with percentage on same graph?
16:36:55 <mlavalle> sounds good as a first step
16:37:07 <mlavalle> we can experiment and tweak it as we learn more
16:37:12 <slaweq> ok, I will then try to add something like that this week
16:37:30 <slaweq> thx for blessing :)
16:38:03 <mlavalle> thanks for improving our Grafana reporting :-)
16:38:06 <slaweq> #action slaweq will add number of failures to graphana
16:38:27 <slaweq> I'm trying to make my life easier just :P
16:38:57 <slaweq> speaking about graphana, there are some spikes there from last week in few jobs
16:39:18 <slaweq> but I was trying to check and note failures during whole last week
16:39:36 <slaweq> and I didn't found so many failures related to neutron, like it looks from graphs
16:39:50 <slaweq> (that's why I had this idea of improving graphs)
16:39:57 <mlavalle> yeap
16:40:04 <mlavalle> it makes sense
16:40:18 <slaweq> I saw that many patches were merged quite quickly during last week
16:40:31 <mlavalle> that was my impression last week as well
16:40:37 <slaweq> so maybe we can talk about some failures which I found
16:40:58 <slaweq> let's first talk about voting jobs
16:41:00 <slaweq> #topic Integrated tempest
16:41:53 <slaweq> there were couple of errors which looks for me like slow neutron API or global timeouts (maybe also because of slow API)
16:41:57 <slaweq> e.g.:
16:42:03 <slaweq> * http://logs.openstack.org/80/585180/3/check/tempest-full/1f62b43/testr_results.html.gz
16:42:09 <slaweq> * http://logs.openstack.org/31/585731/8/gate/tempest-full/ecac3d5/
16:42:18 <slaweq> * http://logs.openstack.org/59/585559/3/check/neutron-tempest-dvr/b4b7fda/job-output.txt.gz#_2018-08-01_08_45_22_477667
16:42:42 <slaweq> and one with timeout and "strange" exception: http://logs.openstack.org/80/585180/3/gate/tempest-full-py3/5d2fb38/testr_results.html.gz
16:44:10 <slaweq> this last one looks for me like some race but I don't know between what this race could be
16:44:19 <slaweq> I found it only once during last week
16:44:23 <slaweq> did You saw it before?
16:44:30 <mlavalle> nope
16:44:46 <mlavalle> do you mean the HTTPSConnectionPool exception?
16:45:38 <slaweq> no
16:45:49 <slaweq> this one unfortunatelly is quite "common"
16:46:10 <slaweq> I was asking about error in tearDownClass (tempest.api.network.test_ports.PortsIpV6TestJSON)
16:46:20 <slaweq> in http://logs.openstack.org/80/585180/3/gate/tempest-full-py3/5d2fb38/testr_results.html.gz
16:47:03 <mlavalle> that may be a faulty test
16:47:14 <mlavalle> that is not cleaning up after itself
16:47:33 <slaweq> but then it would be visible more often IMHO
16:47:55 <slaweq> as it happens only once let's just be aware of this :)
16:47:57 <slaweq> ok?
16:48:13 <mlavalle> yeah
16:48:28 <mlavalle> I would check anyways if something changed with the test anyways
16:48:43 <mlavalle> but I see your point
16:50:02 <slaweq> and those HTTPSConnectionPool exceptions is different issue
16:50:09 <slaweq> looks e.g. on http://logs.openstack.org/80/585180/3/gate/tempest-full-py3/5d2fb38/controller/logs/screen-q-svc.txt.gz?level=INFO#_Aug_06_16_44_12_685397
16:50:21 <slaweq> POST took 80 seconds :/
16:50:33 <mlavalle> dang
16:50:42 <slaweq> I will prepare more data like that to PTG etherpad
16:51:07 <mlavalle> Great
16:51:22 <slaweq> because I think it happens quite often
16:52:02 <slaweq> other issues which I saw in those tempest integrated jobs weren't related to neutron - most often some errors with volumes
16:52:24 <slaweq> so let's move on to next jobs
16:52:35 <slaweq> #topic Fullstack
16:52:50 <slaweq> I found again test * neutron.tests.fullstack.test_securitygroup.TestSecurityGroupsSameNetwork.test_securitygroup(linuxbridge-iptables) failing sometimes
16:53:08 <slaweq> it was always linuxbridge scenario, never ovs from what I seen
16:53:26 <slaweq> and I spotted it only once in neutron-fullstack job
16:53:38 <slaweq> but at least 5 or sic times in neutron-fullstack-python35
16:53:49 <slaweq> py27:     * http://logs.openstack.org/27/534227/16/check/neutron-fullstack/2a475f6/logs/testr_results.html.gz
16:54:02 <slaweq> py35:     * http://logs.openstack.org/27/534227/16/check/neutron-fullstack-python35/f9e25c1/logs/testr_results.html.gz
16:55:41 <slaweq> so I don't know if that is only coincidence or something works differently in py35 and that trigger issue more often
16:55:56 <slaweq> I will report new issue for that and try to investigate it
16:55:59 <slaweq> fine for You?
16:56:20 <mlavalle> yes
16:56:55 <slaweq> #action slaweq to report new bug about fullstack test_securitygroup(linuxbridge-iptables) issue and investigate this
16:57:20 <slaweq> from other jobs, I have list of failures in scenario jobs
16:58:02 <slaweq> but it's mostly same issues like every week - many issues with shelve/unshelve, issues with volumes, some issues with FIP connectivity (randomly)
16:58:15 <slaweq> but we don't have time to talk about them now I think
16:58:27 <mlavalle> just two things then
16:58:36 <slaweq> go on mlavalle
16:58:50 <mlavalle> "part of what Doug is driving now is that everything should run using py3 except for those jobs explicitly testing py2"
16:59:10 <mlavalle> quote from smcginnis guidance regarding py3 testing
16:59:26 <mlavalle> and next week I will be in China, so I won't be able to attend this meeting
17:00:38 <slaweq> any actions I should take on it next week?
17:00:48 <slaweq> or we should just wait?
17:00:56 <slaweq> ok, I think we are out of time now
17:01:03 <slaweq> thx mlavale
17:01:10 <slaweq> let's talk on neutron channel more
17:01:17 <slaweq> #endmeeting