16:00:26 <slaweq> #startmeeting neutron_ci
16:00:27 <openstack> Meeting started Tue Jul 23 16:00:26 2019 UTC and is due to finish in 60 minutes.  The chair is slaweq. Information about MeetBot at http://wiki.debian.org/MeetBot.
16:00:28 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
16:00:28 <slaweq> hi
16:00:30 <openstack> The meeting name has been set to 'neutron_ci'
16:00:31 <ralonsoh> hi
16:00:31 <mlavalle> o/
16:02:01 <slaweq> I know that njohnston is quite busy with internal stuff now, haleyb and bcafarel are on PTO
16:02:09 <slaweq> so lets start
16:02:14 <slaweq> first of all:
16:02:18 <slaweq> #link http://grafana.openstack.org/dashboard/db/neutron-failure-rate
16:02:34 <slaweq> and now lets go
16:02:36 <slaweq> #topic Actions from previous meetings
16:02:44 <slaweq> first one:
16:02:46 <slaweq> mlavalle to continue debugging neutron-tempest-plugin-dvr-multinode-scenario issues
16:03:04 <mlavalle> I did, although I changed my approach
16:03:10 <njohnston> o/
16:03:58 <mlavalle> since the merging of  https://review.opendev.org/#/c/667547/, the frequency of test_connectivity_through_2_routers has decreased significantly
16:04:33 <mlavalle> it still fails sometimes but many of those failures may be due to the slowness in metadata / nova we discussed yesterday
16:04:56 <mlavalle> so I did an analysis of the failures over the past 7 days of all the tests
16:05:05 <mlavalle> and came up with a ranking
16:05:25 <mlavalle> Please see note #5 in
16:05:28 <mlavalle> https://bugs.launchpad.net/neutron/+bug/1830763
16:05:29 <openstack> Launchpad bug 1830763 in neutron "Debug neutron-tempest-plugin-dvr-multinode-scenario failures" [High,Confirmed] - Assigned to Miguel Lavalle (minsel)
16:06:24 <mlavalle> As you can see, the biggest offenders are test_qos_basic_and_update and the routers migrations
16:07:06 <mlavalle> digging into test_qos_basic_and_update, please read note #6
16:07:28 <mlavalle> most of the failures happen after updating the QoS policy / rule
16:08:00 <mlavalle> in other words, at that point we already have connectivity and the routers / dvr seem to be working fine
16:08:24 <slaweq> ok, so we have couple of different issues there
16:08:53 <slaweq> because issue with failing to get instance-id from metadata also happens quite often in various tests
16:10:47 * mlavalle waiting for the rest of the comment
16:11:10 <slaweq> mlavalle: that's all from my side
16:11:16 <mlavalle> oh
16:11:19 <ralonsoh> mlavalle, I would like to see why the qos_check os failing
16:11:22 <slaweq> I just wanted to say that we have few different issues
16:11:24 <ralonsoh> is failing
16:11:28 <ralonsoh> I'll review the logs
16:11:45 <slaweq> can we than report it as 3 different bugs:
16:12:03 <slaweq> 1. already created bug - left it related to ssh and metadata issues,
16:12:05 <mlavalle> yeah, I wanted someone (really thinking ralonsoh) to check the failures in QoS
16:12:08 <slaweq> 2. qos test issue
16:12:15 <slaweq> 3. router migrations problems
16:12:22 <mlavalle> I don't think this QoS failure is purely due to dvr
16:12:29 <mlavalle> it must be the combination
16:12:43 <mlavalle> so if ralonsoh takes care of that one
16:12:44 <ralonsoh> I can take QoS (2)
16:12:59 <mlavalle> I will take care of bug 3 as described by slaweq
16:12:59 <openstack> bug 3 in mono (Ubuntu) "Custom information for each translation team" [Undecided,Fix committed] https://launchpad.net/bugs/3
16:13:09 <slaweq> LOL
16:13:38 <mlavalle> ralonsoh: would you file bug 2?
16:13:42 <ralonsoh> sure
16:13:47 <slaweq> thx guys
16:14:00 <slaweq> I can take a look once again on the issue with metadata
16:14:06 <mlavalle> if you do, I'll post there the Kibana search that you need to see all the occurrences
16:14:18 <slaweq> this one is IMO clearly related to dvr as I didn't saw it on any other jobs
16:14:22 <mlavalle> ralonsoh: ^^^^
16:15:01 <ralonsoh> ok
16:15:23 <mlavalle> ok, I achieved what I wanted with my report today
16:15:59 <slaweq> thx mlavalle for update and for working on this
16:16:14 <slaweq> mlavalle: will You also report bug 2 as new one?
16:16:21 <mlavalle> yes
16:16:24 <mlavalle> I will
16:16:24 <slaweq> thx a lot
16:16:35 <slaweq> #action mlavalle to report bug with router migrations
16:16:43 <mlavalle> and assign to me
16:16:54 <slaweq> #action ralonsoh to report bug with qos scenario test failures
16:17:14 <slaweq> #action slaweq to take a look at issue with dvr and metadata: https://bugs.launchpad.net/neutron/+bug/1830763
16:17:15 <openstack> Launchpad bug 1830763 in neutron "Debug neutron-tempest-plugin-dvr-multinode-scenario failures" [High,Confirmed] - Assigned to Miguel Lavalle (minsel)
16:17:42 <slaweq> ok, I think we are good here and can move on
16:17:52 <slaweq> ralonsoh to try a patch to resuce the number of workers in FT
16:18:15 <ralonsoh> slaweq, and I didn't have time, sorry
16:18:23 * mlavalle will have to drop off at 30 minutes after the hour
16:18:25 <ralonsoh> my bad, I'll do this tomorrow morning
16:18:44 <ralonsoh> but I'll need help, because I really don't know where to modify this
16:18:51 <slaweq> ralonsoh: sure, no problem
16:18:52 <ralonsoh> in the neutron/tox.ini file?
16:19:04 <ralonsoh> in the zuul FT definition?
16:19:19 <slaweq> ralonsoh: I think that it should be defined in tox.ini
16:19:26 <slaweq> but I'm not 100% sure now
16:19:31 <ralonsoh> slaweq, that was my initial though
16:19:39 <ralonsoh> I'll try it tomorrow
16:20:16 <slaweq> ok, thx
16:20:26 <slaweq> #action ralonsoh to try a patch to resuce the number of workers in FT
16:20:36 <slaweq> so, next one
16:20:38 <slaweq> ralonsoh to report a bug and investigate failed test neutron.tests.fullstack.test_qos.TestMinBwQoSOvs.test_bw_limit_qos_port_removed
16:20:52 <ralonsoh> in my list too, and no time
16:20:54 <ralonsoh> sorry again
16:21:15 <ralonsoh> really, I dind't have time last week
16:22:26 <slaweq> ralonsoh: no problem at all :)
16:22:31 <slaweq> it's not very urgent
16:22:35 <slaweq> #action ralonsoh to report a bug and investigate failed test neutron.tests.fullstack.test_qos.TestMinBwQoSOvs.test_bw_limit_qos_port_removed
16:22:52 <slaweq> but please at least report a bug, that we have it tracked
16:22:59 <ralonsoh> sure
16:23:00 <slaweq> maybe someone else will take a look
16:23:02 <slaweq> :)
16:23:13 <slaweq> thx
16:23:18 <slaweq> ok, and last one
16:23:20 <slaweq> slaweq to open bug about slow neutron-tempest-with-uwsgi job
16:23:26 <slaweq> I opened bug https://bugs.launchpad.net/neutron/+bug/1837552
16:23:27 <openstack> Launchpad bug 1837552 in neutron "neutron-tempest-with-uwsgi job finish with timeout very often" [Medium,Confirmed]
16:23:38 <slaweq> and I wrote there my initial findings
16:24:02 <slaweq> basically there is some issue with neutron API IMO but I'm not sure what's going on there exactly
16:25:38 <ralonsoh> slaweq, I saw that most of the time the test failing is the vrrp one
16:25:42 <ralonsoh> let me find the name
16:25:52 <slaweq> ralonsoh: but in which job?
16:26:07 <ralonsoh> hmmm, not in tempest.... osrry
16:26:09 <ralonsoh> sorry
16:26:42 <slaweq> sure
16:27:12 <slaweq> in this tempest job, it is like at some point all tests are failing due to timeouts connecting to neutron API
16:27:28 <slaweq> in apache logs I see HTTP 500 for each request related to neutron
16:27:39 <slaweq> but in neutron logs I didn't saw any error
16:27:48 <slaweq> so I'm a bit confused with that
16:27:59 <slaweq> if I will have some time, I will try to investigate it
16:28:27 <slaweq> but I didn't assign myself to this bug for now, maybe there will be someone else who will wants to look into this
16:29:01 * mlavalle drops off for another meeting o/
16:29:09 <slaweq> see You mlavalle
16:29:20 <slaweq> ok, I think we can move on
16:29:23 <slaweq> next topic
16:29:24 <slaweq> #topic Stadium projects
16:29:32 <slaweq> Python 3 migration
16:29:34 <slaweq> Stadium projects etherpad: https://etherpad.openstack.org/p/neutron_stadium_python3_status
16:29:41 <slaweq> I have only short update about this one
16:29:55 <slaweq> last job in fwaas repo is switched to zuulv3 and python3
16:29:59 <slaweq> so we are good with this one
16:30:19 <slaweq> we still have some work to do in
16:30:36 <slaweq> networking-bagpipe, networking-midonet, networking-odl and python-neutronclient
16:30:41 <slaweq> and that would be all
16:31:03 <slaweq> we have patches or at least volunteers for all of them except midonet
16:31:13 <slaweq> so I think we are quite good with this
16:31:25 <slaweq> next part is:
16:31:27 <slaweq> tempest-plugins migration
16:31:28 <ralonsoh> slaweq, cool, we still support neutron-client
16:31:33 <slaweq> and I don't have any updates here
16:31:45 <slaweq> ralonsoh: sure, we are still supporting neutronclient
16:31:51 <slaweq> it's deprecated but supported
16:32:04 <slaweq> and amotoki is taking care of switching it to py3
16:33:14 <slaweq> any questions/other updates about stadium projects?
16:33:26 <ralonsoh> no
16:33:38 <slaweq> ok, lets move on quickly
16:33:40 <slaweq> #topic Grafana
16:33:47 <slaweq> #link http://grafana.openstack.org/dashboard/db/neutron-failure-rate
16:34:45 <slaweq> looking at integrated tempest jobs, I think we are a bit better now, since we switched some jobs to only run neutron and nova related tests
16:35:26 <slaweq> even last week most of those jobs were in quite good shape
16:35:57 <slaweq> fullstack is also quite good now
16:36:04 <slaweq> functiona test are failing a bit
16:36:24 <slaweq> but I think it's mostly related to the issues which we talked about earlier today
16:37:36 <slaweq> anything else regarding grafana?
16:38:37 <slaweq> ok, lets move on
16:38:39 <slaweq> #topic fullstack/functional
16:38:55 <slaweq> I have only 2 things related to fullstack tests today
16:39:14 <slaweq> 1. I recently found one "new" issue and reported a bug: https://bugs.launchpad.net/neutron/+bug/1837380
16:39:15 <openstack> Launchpad bug 1837380 in neutron "Timeout while getting bridge datapath id crashes ova agent" [High,In progress] - Assigned to Slawek Kaplonski (slaweq)
16:40:00 <slaweq> basically if some physical bridge is recreated and there will be timeout while getting datapath id, neutron-ovs-agent will crash completly
16:40:11 <ralonsoh> ??
16:40:18 <ralonsoh> that means the bridge is created again?
16:40:28 <slaweq> correct ralonsoh
16:40:32 <ralonsoh> hmmm
16:40:46 <ralonsoh> maybe the default datapath_is is not correct
16:40:58 <ralonsoh> this must be different from the other bridges
16:41:11 <ralonsoh> and should be "something" not null
16:41:13 <slaweq> ralonsoh: please check logs: http://logs.openstack.org/81/671881/1/check/neutron-fullstack/c6b2e08/controller/logs/dsvm-fullstack-logs/TestLegacyL3Agent.test_north_south_traffic/neutron-openvswitch-agent--2019-07-22--07-48-17-892074_log.txt.gz#_2019-07-22_07_49_28_698
16:41:24 <slaweq> it was timeout while getting datapath_id
16:41:28 <ralonsoh> I see, yes
16:41:44 <ralonsoh> 10 secs to retrieve the datapath
16:41:49 <slaweq> in fullstack/functional jobs we have seen from time to time e.g. ovsdbapp timeouts and things like that
16:42:01 <slaweq> so my assumption is that similar issue happend here
16:42:12 <slaweq> but this shouldn't cause crash of agent IMO
16:42:33 <slaweq> but, as You can see at http://logs.openstack.org/81/671881/1/check/neutron-fullstack/c6b2e08/controller/logs/dsvm-fullstack-logs/TestLegacyL3Agent.test_north_south_traffic/neutron-openvswitch-agent--2019-07-22--07-48-17-892074_log.txt.gz#_2019-07-22_07_49_28_756 it crashes
16:42:36 <ralonsoh> no, of course
16:42:59 <slaweq> I proposed some patch for that but will need to add some UT for that probably: https://review.opendev.org/672018
16:43:10 <ralonsoh> actually I was talking more about the second error
16:43:56 <slaweq> which one?
16:44:10 <ralonsoh> the one related to https://review.opendev.org/#/c/672018
16:44:17 <ralonsoh> I'll review your patch
16:44:37 <slaweq> yes, this one is related to this crash while getting datapath_id of bridge
16:44:41 <slaweq> it's one issue
16:45:53 <slaweq> ok, and the other thing which I have for today is
16:45:55 <slaweq> http://logs.openstack.org/77/670177/4/check/neutron-fullstack/56c8bb0/testr_results.html.gz
16:46:16 <slaweq> but I found this failure only once so far and IMO it may be related to the patch on which it was running
16:46:25 <slaweq> so I would just keep an eye on those tests :)
16:46:56 <slaweq> and that's all from my for today
16:47:18 <slaweq> I think we already talked about all other "hot" issues with tempest and functional tests
16:47:32 <slaweq> anything else You want to talk about today?
16:47:43 <ralonsoh> no
16:48:08 <slaweq> ok, so lets finish a bit earlier today
16:48:12 <slaweq> thx for attending
16:48:16 <slaweq> o/
16:48:16 <ralonsoh> bye!
16:48:19 <slaweq> #endmeeting