16:00:26 <slaweq> #startmeeting neutron_ci 16:00:27 <openstack> Meeting started Tue Jul 23 16:00:26 2019 UTC and is due to finish in 60 minutes. The chair is slaweq. Information about MeetBot at http://wiki.debian.org/MeetBot. 16:00:28 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 16:00:28 <slaweq> hi 16:00:30 <openstack> The meeting name has been set to 'neutron_ci' 16:00:31 <ralonsoh> hi 16:00:31 <mlavalle> o/ 16:02:01 <slaweq> I know that njohnston is quite busy with internal stuff now, haleyb and bcafarel are on PTO 16:02:09 <slaweq> so lets start 16:02:14 <slaweq> first of all: 16:02:18 <slaweq> #link http://grafana.openstack.org/dashboard/db/neutron-failure-rate 16:02:34 <slaweq> and now lets go 16:02:36 <slaweq> #topic Actions from previous meetings 16:02:44 <slaweq> first one: 16:02:46 <slaweq> mlavalle to continue debugging neutron-tempest-plugin-dvr-multinode-scenario issues 16:03:04 <mlavalle> I did, although I changed my approach 16:03:10 <njohnston> o/ 16:03:58 <mlavalle> since the merging of https://review.opendev.org/#/c/667547/, the frequency of test_connectivity_through_2_routers has decreased significantly 16:04:33 <mlavalle> it still fails sometimes but many of those failures may be due to the slowness in metadata / nova we discussed yesterday 16:04:56 <mlavalle> so I did an analysis of the failures over the past 7 days of all the tests 16:05:05 <mlavalle> and came up with a ranking 16:05:25 <mlavalle> Please see note #5 in 16:05:28 <mlavalle> https://bugs.launchpad.net/neutron/+bug/1830763 16:05:29 <openstack> Launchpad bug 1830763 in neutron "Debug neutron-tempest-plugin-dvr-multinode-scenario failures" [High,Confirmed] - Assigned to Miguel Lavalle (minsel) 16:06:24 <mlavalle> As you can see, the biggest offenders are test_qos_basic_and_update and the routers migrations 16:07:06 <mlavalle> digging into test_qos_basic_and_update, please read note #6 16:07:28 <mlavalle> most of the failures happen after updating the QoS policy / rule 16:08:00 <mlavalle> in other words, at that point we already have connectivity and the routers / dvr seem to be working fine 16:08:24 <slaweq> ok, so we have couple of different issues there 16:08:53 <slaweq> because issue with failing to get instance-id from metadata also happens quite often in various tests 16:10:47 * mlavalle waiting for the rest of the comment 16:11:10 <slaweq> mlavalle: that's all from my side 16:11:16 <mlavalle> oh 16:11:19 <ralonsoh> mlavalle, I would like to see why the qos_check os failing 16:11:22 <slaweq> I just wanted to say that we have few different issues 16:11:24 <ralonsoh> is failing 16:11:28 <ralonsoh> I'll review the logs 16:11:45 <slaweq> can we than report it as 3 different bugs: 16:12:03 <slaweq> 1. already created bug - left it related to ssh and metadata issues, 16:12:05 <mlavalle> yeah, I wanted someone (really thinking ralonsoh) to check the failures in QoS 16:12:08 <slaweq> 2. qos test issue 16:12:15 <slaweq> 3. router migrations problems 16:12:22 <mlavalle> I don't think this QoS failure is purely due to dvr 16:12:29 <mlavalle> it must be the combination 16:12:43 <mlavalle> so if ralonsoh takes care of that one 16:12:44 <ralonsoh> I can take QoS (2) 16:12:59 <mlavalle> I will take care of bug 3 as described by slaweq 16:12:59 <openstack> bug 3 in mono (Ubuntu) "Custom information for each translation team" [Undecided,Fix committed] https://launchpad.net/bugs/3 16:13:09 <slaweq> LOL 16:13:38 <mlavalle> ralonsoh: would you file bug 2? 16:13:42 <ralonsoh> sure 16:13:47 <slaweq> thx guys 16:14:00 <slaweq> I can take a look once again on the issue with metadata 16:14:06 <mlavalle> if you do, I'll post there the Kibana search that you need to see all the occurrences 16:14:18 <slaweq> this one is IMO clearly related to dvr as I didn't saw it on any other jobs 16:14:22 <mlavalle> ralonsoh: ^^^^ 16:15:01 <ralonsoh> ok 16:15:23 <mlavalle> ok, I achieved what I wanted with my report today 16:15:59 <slaweq> thx mlavalle for update and for working on this 16:16:14 <slaweq> mlavalle: will You also report bug 2 as new one? 16:16:21 <mlavalle> yes 16:16:24 <mlavalle> I will 16:16:24 <slaweq> thx a lot 16:16:35 <slaweq> #action mlavalle to report bug with router migrations 16:16:43 <mlavalle> and assign to me 16:16:54 <slaweq> #action ralonsoh to report bug with qos scenario test failures 16:17:14 <slaweq> #action slaweq to take a look at issue with dvr and metadata: https://bugs.launchpad.net/neutron/+bug/1830763 16:17:15 <openstack> Launchpad bug 1830763 in neutron "Debug neutron-tempest-plugin-dvr-multinode-scenario failures" [High,Confirmed] - Assigned to Miguel Lavalle (minsel) 16:17:42 <slaweq> ok, I think we are good here and can move on 16:17:52 <slaweq> ralonsoh to try a patch to resuce the number of workers in FT 16:18:15 <ralonsoh> slaweq, and I didn't have time, sorry 16:18:23 * mlavalle will have to drop off at 30 minutes after the hour 16:18:25 <ralonsoh> my bad, I'll do this tomorrow morning 16:18:44 <ralonsoh> but I'll need help, because I really don't know where to modify this 16:18:51 <slaweq> ralonsoh: sure, no problem 16:18:52 <ralonsoh> in the neutron/tox.ini file? 16:19:04 <ralonsoh> in the zuul FT definition? 16:19:19 <slaweq> ralonsoh: I think that it should be defined in tox.ini 16:19:26 <slaweq> but I'm not 100% sure now 16:19:31 <ralonsoh> slaweq, that was my initial though 16:19:39 <ralonsoh> I'll try it tomorrow 16:20:16 <slaweq> ok, thx 16:20:26 <slaweq> #action ralonsoh to try a patch to resuce the number of workers in FT 16:20:36 <slaweq> so, next one 16:20:38 <slaweq> ralonsoh to report a bug and investigate failed test neutron.tests.fullstack.test_qos.TestMinBwQoSOvs.test_bw_limit_qos_port_removed 16:20:52 <ralonsoh> in my list too, and no time 16:20:54 <ralonsoh> sorry again 16:21:15 <ralonsoh> really, I dind't have time last week 16:22:26 <slaweq> ralonsoh: no problem at all :) 16:22:31 <slaweq> it's not very urgent 16:22:35 <slaweq> #action ralonsoh to report a bug and investigate failed test neutron.tests.fullstack.test_qos.TestMinBwQoSOvs.test_bw_limit_qos_port_removed 16:22:52 <slaweq> but please at least report a bug, that we have it tracked 16:22:59 <ralonsoh> sure 16:23:00 <slaweq> maybe someone else will take a look 16:23:02 <slaweq> :) 16:23:13 <slaweq> thx 16:23:18 <slaweq> ok, and last one 16:23:20 <slaweq> slaweq to open bug about slow neutron-tempest-with-uwsgi job 16:23:26 <slaweq> I opened bug https://bugs.launchpad.net/neutron/+bug/1837552 16:23:27 <openstack> Launchpad bug 1837552 in neutron "neutron-tempest-with-uwsgi job finish with timeout very often" [Medium,Confirmed] 16:23:38 <slaweq> and I wrote there my initial findings 16:24:02 <slaweq> basically there is some issue with neutron API IMO but I'm not sure what's going on there exactly 16:25:38 <ralonsoh> slaweq, I saw that most of the time the test failing is the vrrp one 16:25:42 <ralonsoh> let me find the name 16:25:52 <slaweq> ralonsoh: but in which job? 16:26:07 <ralonsoh> hmmm, not in tempest.... osrry 16:26:09 <ralonsoh> sorry 16:26:42 <slaweq> sure 16:27:12 <slaweq> in this tempest job, it is like at some point all tests are failing due to timeouts connecting to neutron API 16:27:28 <slaweq> in apache logs I see HTTP 500 for each request related to neutron 16:27:39 <slaweq> but in neutron logs I didn't saw any error 16:27:48 <slaweq> so I'm a bit confused with that 16:27:59 <slaweq> if I will have some time, I will try to investigate it 16:28:27 <slaweq> but I didn't assign myself to this bug for now, maybe there will be someone else who will wants to look into this 16:29:01 * mlavalle drops off for another meeting o/ 16:29:09 <slaweq> see You mlavalle 16:29:20 <slaweq> ok, I think we can move on 16:29:23 <slaweq> next topic 16:29:24 <slaweq> #topic Stadium projects 16:29:32 <slaweq> Python 3 migration 16:29:34 <slaweq> Stadium projects etherpad: https://etherpad.openstack.org/p/neutron_stadium_python3_status 16:29:41 <slaweq> I have only short update about this one 16:29:55 <slaweq> last job in fwaas repo is switched to zuulv3 and python3 16:29:59 <slaweq> so we are good with this one 16:30:19 <slaweq> we still have some work to do in 16:30:36 <slaweq> networking-bagpipe, networking-midonet, networking-odl and python-neutronclient 16:30:41 <slaweq> and that would be all 16:31:03 <slaweq> we have patches or at least volunteers for all of them except midonet 16:31:13 <slaweq> so I think we are quite good with this 16:31:25 <slaweq> next part is: 16:31:27 <slaweq> tempest-plugins migration 16:31:28 <ralonsoh> slaweq, cool, we still support neutron-client 16:31:33 <slaweq> and I don't have any updates here 16:31:45 <slaweq> ralonsoh: sure, we are still supporting neutronclient 16:31:51 <slaweq> it's deprecated but supported 16:32:04 <slaweq> and amotoki is taking care of switching it to py3 16:33:14 <slaweq> any questions/other updates about stadium projects? 16:33:26 <ralonsoh> no 16:33:38 <slaweq> ok, lets move on quickly 16:33:40 <slaweq> #topic Grafana 16:33:47 <slaweq> #link http://grafana.openstack.org/dashboard/db/neutron-failure-rate 16:34:45 <slaweq> looking at integrated tempest jobs, I think we are a bit better now, since we switched some jobs to only run neutron and nova related tests 16:35:26 <slaweq> even last week most of those jobs were in quite good shape 16:35:57 <slaweq> fullstack is also quite good now 16:36:04 <slaweq> functiona test are failing a bit 16:36:24 <slaweq> but I think it's mostly related to the issues which we talked about earlier today 16:37:36 <slaweq> anything else regarding grafana? 16:38:37 <slaweq> ok, lets move on 16:38:39 <slaweq> #topic fullstack/functional 16:38:55 <slaweq> I have only 2 things related to fullstack tests today 16:39:14 <slaweq> 1. I recently found one "new" issue and reported a bug: https://bugs.launchpad.net/neutron/+bug/1837380 16:39:15 <openstack> Launchpad bug 1837380 in neutron "Timeout while getting bridge datapath id crashes ova agent" [High,In progress] - Assigned to Slawek Kaplonski (slaweq) 16:40:00 <slaweq> basically if some physical bridge is recreated and there will be timeout while getting datapath id, neutron-ovs-agent will crash completly 16:40:11 <ralonsoh> ?? 16:40:18 <ralonsoh> that means the bridge is created again? 16:40:28 <slaweq> correct ralonsoh 16:40:32 <ralonsoh> hmmm 16:40:46 <ralonsoh> maybe the default datapath_is is not correct 16:40:58 <ralonsoh> this must be different from the other bridges 16:41:11 <ralonsoh> and should be "something" not null 16:41:13 <slaweq> ralonsoh: please check logs: http://logs.openstack.org/81/671881/1/check/neutron-fullstack/c6b2e08/controller/logs/dsvm-fullstack-logs/TestLegacyL3Agent.test_north_south_traffic/neutron-openvswitch-agent--2019-07-22--07-48-17-892074_log.txt.gz#_2019-07-22_07_49_28_698 16:41:24 <slaweq> it was timeout while getting datapath_id 16:41:28 <ralonsoh> I see, yes 16:41:44 <ralonsoh> 10 secs to retrieve the datapath 16:41:49 <slaweq> in fullstack/functional jobs we have seen from time to time e.g. ovsdbapp timeouts and things like that 16:42:01 <slaweq> so my assumption is that similar issue happend here 16:42:12 <slaweq> but this shouldn't cause crash of agent IMO 16:42:33 <slaweq> but, as You can see at http://logs.openstack.org/81/671881/1/check/neutron-fullstack/c6b2e08/controller/logs/dsvm-fullstack-logs/TestLegacyL3Agent.test_north_south_traffic/neutron-openvswitch-agent--2019-07-22--07-48-17-892074_log.txt.gz#_2019-07-22_07_49_28_756 it crashes 16:42:36 <ralonsoh> no, of course 16:42:59 <slaweq> I proposed some patch for that but will need to add some UT for that probably: https://review.opendev.org/672018 16:43:10 <ralonsoh> actually I was talking more about the second error 16:43:56 <slaweq> which one? 16:44:10 <ralonsoh> the one related to https://review.opendev.org/#/c/672018 16:44:17 <ralonsoh> I'll review your patch 16:44:37 <slaweq> yes, this one is related to this crash while getting datapath_id of bridge 16:44:41 <slaweq> it's one issue 16:45:53 <slaweq> ok, and the other thing which I have for today is 16:45:55 <slaweq> http://logs.openstack.org/77/670177/4/check/neutron-fullstack/56c8bb0/testr_results.html.gz 16:46:16 <slaweq> but I found this failure only once so far and IMO it may be related to the patch on which it was running 16:46:25 <slaweq> so I would just keep an eye on those tests :) 16:46:56 <slaweq> and that's all from my for today 16:47:18 <slaweq> I think we already talked about all other "hot" issues with tempest and functional tests 16:47:32 <slaweq> anything else You want to talk about today? 16:47:43 <ralonsoh> no 16:48:08 <slaweq> ok, so lets finish a bit earlier today 16:48:12 <slaweq> thx for attending 16:48:16 <slaweq> o/ 16:48:16 <ralonsoh> bye! 16:48:19 <slaweq> #endmeeting