16:00:05 #startmeeting neutron_ci 16:00:06 Meeting started Tue Sep 4 16:00:05 2018 UTC and is due to finish in 60 minutes. The chair is slaweq_. Information about MeetBot at http://wiki.debian.org/MeetBot. 16:00:08 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 16:00:09 hi 16:00:11 The meeting name has been set to 'neutron_ci' 16:00:11 o/ 16:00:14 hi 16:00:25 o/ 16:00:49 ok, let's start 16:00:55 #topic Actions from previous meetings 16:01:10 mlavalle to check another cases of failing neutron_tempest_plugin.scenario.test_dns_integration.DNSIntegrationTests.test_server_with_fip test 16:01:15 I did 16:01:26 I checked other ocurrences of this bug 16:01:40 I found that not only happend with the DNS integration job 16:01:47 but in other jobs as well 16:02:04 also found that the evidence points to a problem with Nova 16:02:15 which never rumes the instance 16:02:34 did You talk with mriedem about that? 16:02:38 it seems to me the compute manager is mixing up the events from Neutron and the virt layer 16:02:55 I pinged mriedem ealier today.... 16:03:12 but he hasn't responded. I assume he is busy and will pong back later 16:03:27 I will follow up with him 16:03:30 ok, good that You are on it :) 16:03:32 thx mlavalle 16:03:52 #action mlavalle to talk with mriedem about https://bugs.launchpad.net/neutron/+bug/1788006 16:03:52 Launchpad bug 1788006 in neutron "Tests fail with "Server [UUID] failed to reach ACTIVE status and task state "None" within the required time ([INTEGER] s). Current status: BUILD. Current task state: spawning."" [Critical,Confirmed] - Assigned to Miguel Lavalle (minsel) 16:04:12 I added action just to not forget it next time :) 16:04:20 mlavalle: sorry, 16:04:22 yeah, please keep me honest 16:04:23 forgot to reply 16:04:30 i'm creating stories for each project today 16:04:30 hi mriedem 16:04:48 oh sorry thought we were talking about the upgrade-checkers goal :) 16:05:12 no, we are talking about https://bugs.launchpad.net/neutron/+bug/1788006 16:05:12 Launchpad bug 1788006 in neutron "Tests fail with "Server [UUID] failed to reach ACTIVE status and task state "None" within the required time ([INTEGER] s). Current status: BUILD. Current task state: spawning."" [Critical,Confirmed] - Assigned to Miguel Lavalle (minsel) 16:05:13 mriedem: when you have time, please look at https://bugs.launchpad.net/neutron/+bug/1788006 16:05:33 mriedem: we think we need help from a Nova expert 16:05:42 * mriedem throws it on the pile 16:05:50 mriedem: Thanks :-) 16:05:56 thx mriedem 16:06:14 ok, so let's move on to next actions now 16:06:22 mlavalle to check failing router migration from DVR tests 16:06:37 I am working on that one right now 16:07:04 the migrations failing are always from HA to {dvr, dvr-ha, legacy} 16:07:34 it seems that when you set the admin_state_up to False in the router.... 16:07:44 yes, but that's related to other bug I think, see https://bugs.launchpad.net/neutron/+bug/1789434 16:07:45 Launchpad bug 1789434 in neutron "neutron_tempest_plugin.scenario.test_migration.NetworkMigrationFromHA failing 100% times" [High,Confirmed] - Assigned to Miguel Lavalle (minsel) 16:07:45 the "device_owner":"network:router_ha_interface" never goes down 16:08:10 yeah, that is what I am talking about 16:09:26 I have confirmed that is the case for migration from HA to DVR 16:09:36 and from HA to HA-DVR 16:10:24 I expect the same to be happening for HA to legacy (or Coke Classic as I think of it) 16:10:33 I intend to contnue debugging 16:10:35 but that check if ports are down was introduced by me recently, and it was passing then IIRC 16:10:47 it is not passing now 16:10:48 so was there any other change recently which could cause such issue? 16:10:59 yes, I know it is not passing now :) 16:11:08 I was about to ask the same.... 16:11:27 the ports status remains active all the time 16:11:53 so the port doesn't go down, or the interface in the namespace? as in two routers have active ha ports 16:12:13 the test is checking for the status of the port 16:12:21 mlavalle: haleyb: this is patch which adds this: https://review.openstack.org/#/c/589410/ 16:12:27 that's as far as I've checked so far 16:12:33 migration from HA tests were passing then 16:13:40 ok, so the port, strange 16:14:25 yes, that is really strange 16:14:42 well, that's what the test checks 16:14:46 the status of the port 16:15:44 yes, I added it to avoid race condition during migration from legacy to ha, described in https://bugs.launchpad.net/neutron/+bug/1785582 16:15:44 Launchpad bug 1785582 in neutron "Connectivity to instance after L3 router migration from Legacy to HA fails" [Medium,Fix released] - Assigned to Slawek Kaplonski (slaweq) 16:16:11 I remember and logically it makes sense 16:16:16 but if router is set to admin_state_down, its ports should be set to down also, right? 16:16:53 but mlavalle You can at least repoduce it manually, right? 16:17:11 I haven't tried manually yet. That is my next step 16:17:14 or ports status is fine during manual actions? 16:17:17 ahh, ok 16:17:32 so I will add it as action for You for next week, ok? 16:17:39 of course 16:17:50 as I said earlier, keep me honest 16:18:33 #action mlavalle continue debugging failing MigrationFromHA tests, bug https://bugs.launchpad.net/neutron/+bug/1789434 16:18:33 Launchpad bug 1789434 in neutron "neutron_tempest_plugin.scenario.test_migration.NetworkMigrationFromHA failing 100% times" [High,Confirmed] - Assigned to Miguel Lavalle (minsel) 16:18:36 thx mlavalle 16:18:47 slaweq to report a bug about timouts in neutron-tempest-plugin-scenario-linuxbridge 16:18:51 that was next one 16:19:00 I reported it https://bugs.launchpad.net/neutron/+bug/1789579 16:19:00 Launchpad bug 1788006 in neutron "duplicate for #1789579 Tests fail with "Server [UUID] failed to reach ACTIVE status and task state "None" within the required time ([INTEGER] s). Current status: BUILD. Current task state: spawning."" [Critical,Confirmed] - Assigned to Miguel Lavalle (minsel) 16:19:16 but it's in fact duplicate of issue mentioned before https://bugs.launchpad.net/neutron/+bug/1788006 16:19:16 Launchpad bug 1788006 in neutron "Tests fail with "Server [UUID] failed to reach ACTIVE status and task state "None" within the required time ([INTEGER] s). Current status: BUILD. Current task state: spawning."" [Critical,Confirmed] - Assigned to Miguel Lavalle (minsel) 16:19:22 yeap 16:19:36 so let's move forward 16:19:38 next was 16:19:40 mlavalle to ask Mike Bayer about functional db migration tests failures 16:19:58 I also suspect that some of the failures we are getting in the DVR multinode job are also due to this issue with Nova 16:20:13 and the others are due to router migration 16:20:28 I conclude that from the results of my queries with Kibana 16:20:42 now moving on to Mike Bayer 16:20:51 I was going to ping him 16:21:09 but the log that is pointed out in the bug doesn't exist anymore 16:21:30 and I don't want to ping him if I don't have data for him to look 16:21:50 ok, I will update bug report with fresh log today and will ping You 16:21:55 fine for You? 16:22:10 so let's keep our eyes open for one of those failures 16:22:12 as soon as I have one, I'll ping him 16:22:16 yes 16:22:23 ok, sounds good 16:22:25 I will also update the bug if I find one 16:22:32 if I will have anything new, I will update bug 16:22:33 :) 16:22:45 ok 16:22:50 last one from last week was 16:22:53 slaweq to mark fullstack security group test as unstable again 16:22:57 Done: https://review.openstack.org/597426 16:23:10 it's merged recently so fullstack tests should be in better shape now 16:23:49 and that was all for actions from previous week 16:23:56 let's move to next topic 16:23:58 #topic Grafana 16:24:04 http://grafana.openstack.org/dashboard/db/neutron-failure-rate 16:25:05 FYI: patch https://review.openstack.org/#/c/595763/ is merged so in graphs we have now also included jobs which were TIMED_OUT 16:26:12 ok 16:26:15 speaking about last week, there was spike around 31.08 on many jobs 16:26:31 it was related to some infra bug, fixed in https://review.openstack.org/591527 16:28:00 Just to note, I have a change to bump the "older" dashboards versions and to bring them in line with slaweq_'s format: https://review.openstack.org/597168 16:28:04 yeah I can see that 16:28:30 thx njohnston for working on this :) 16:28:41 ok, so in principle we shouldn't worry about the spike on 8/31 16:29:00 yep 16:29:47 and speaking about other things I think we have one main problem now 16:29:59 and it's tempest/scenario jobs issue which we already talked about 16:30:26 the nova issue? 16:30:32 yes 16:30:49 ok, I'll stay on top of it 16:30:55 this is basically most common issue I found when I was checking for job results 16:31:27 and it is quite paintfull as it happens randomly on many jobs so You need to recheck jobs many times and each time different job fails :/ 16:31:45 because of that it is also not very visible on graphs IMO 16:31:46 but the same underlying problem 16:32:15 yes, from what I was checking it looks like that, culprit is the same 16:32:54 ok, but let's talk about few other issues which I also found :) 16:33:00 #topic Tempest/Scenario 16:33:27 so, except this "main" issue I found few times that same shelve/unshelve test was again failing 16:33:32 examples: 16:33:39 * http://logs.openstack.org/71/516371/8/check/neutron-tempest-multinode-full/bb2ae6a/logs/testr_results.html.gz 16:33:41 * http://logs.openstack.org/59/591059/3/check/neutron-tempest-multinode-full/72dbf69/logs/testr_results.html.gz 16:33:43 * http://logs.openstack.org/72/507772/66/check/neutron-tempest-multinode-full/aff626c/logs/testr_results.html.gz 16:33:52 I think mlavalle You were checking that in the past, right? 16:34:04 yeah I think I was 16:34:25 how urgent is it compared to the other two I am working on? 16:34:35 not very urgent I think 16:34:52 I found those 3 examples in last 50 jobs or something like that 16:34:54 ok, give me an action item, so I remeber 16:35:09 I don't want it to fall through the cracks 16:35:29 on the understanding that I'll get it to it after the others 16:35:46 do You remember bug report for that maybe? 16:36:15 slaweq_: don't worry, I'll find it. for the time being just mention the shelve / unshelve issue in the AI 16:36:29 mlavalle: ok 16:36:55 #action mlavale to check issue with failing test_attach_volume_shelved_or_offload_server test 16:37:10 thanks :-) 16:37:15 thx mlavalle 16:37:19 ok, let's move on 16:37:32 I also found one more example of another error: http://logs.openstack.org/96/596896/2/check/tempest-full-py3/95eba41/testr_results.html.gz 16:37:42 it was some error during FIP deletion 16:39:41 but rest api call was done to nova: 500 DELETE https://10.210.194.229/compute/v2.1/os-floating-ips/a06bbb30-1914-4467-9e75-6fc70d99427f 16:40:22 and I can't find FIP with such uuid in nova-api logs nor neutron-server logs 16:40:33 do You maybe know where this error may be? 16:41:15 mhhh 16:41:48 so a06bbb30-1914-4467-9e75-6fc70d99427f is not in the neutron server log? 16:42:03 no 16:42:13 and also there is no any ERROR in neutron-server log: http://logs.openstack.org/96/596896/2/check/tempest-full-py3/95eba41/controller/logs/screen-q-svc.txt.gz?level=ERROR 16:42:29 which is strange IMO because this error 500 came from somewhere, right? 16:42:37 right 16:42:55 it's only one such failed test but it's interesting for me :) 16:43:05 indeed 16:43:41 maybe our fips our plunging into a black hole 16:44:43 in tempest log: http://logs.openstack.org/96/596896/2/check/tempest-full-py3/95eba41/controller/logs/tempest_log.txt 16:44:50 it's visible in 3 places 16:44:54 once when it's created 16:44:58 and then when it's removed 16:45:03 with erro 500 16:45:25 * mlavalle looking 16:46:47 yeahbut it responded 200 for the POST 16:47:43 and the fip doesn't work either 16:48:09 the ssh fails 16:48:20 because it was not visible anywhere in neutron IMO so it couldn't work, right? 16:48:55 yeah 16:49:13 maybe that is also culprit of other issues with not working FIP 16:49:37 let's keep an eye on it and check if on other examples similar things will pop up 16:49:58 ok 16:50:12 next topic 16:50:14 #topic Fullstack 16:50:22 slaweq_: http://logs.openstack.org/96/596896/2/check/tempest-full-py3/95eba41/controller/logs/screen-q-svc.txt.gz#_Sep_02_23_25_07_989880 16:51:13 the fip uuid is in the neutron server log 16:51:43 so I was searching too fast, and not whole log was loaded in browser :) 16:51:47 sorry for that 16:51:50 np 16:52:07 I'll let you dig deeper 16:52:35 but delete was also fine: http://logs.openstack.org/96/596896/2/check/tempest-full-py3/95eba41/controller/logs/screen-q-svc.txt.gz#_Sep_02_23_25_30_966952 16:53:09 I will check those logs once again later 16:53:22 right, so it seems the problem is more on the NOva side 16:53:33 maybe :) 16:53:47 let's back to fullstack for now 16:53:52 ok 16:54:19 test_securitygroups test is marked as unstable now so it should be at least in better shape then it was recently 16:54:34 I will continue debugging this issue when I will have some time 16:55:03 I think that this is most common issue in fullstack tests and without failing this one, it should be good with other things 16:55:14 great! 16:55:34 I have one more question about fullstack 16:55:34 excellent 16:55:45 what You think about switching fullstack-python35 to be voting? 16:55:58 it's following exactly failure rate of py27 job: http://grafana.openstack.org/d/Hj5IHcSmz/neutron-failure-rate?panelId=22&fullscreen&orgId=1&from=now-7d&to=now 16:56:22 at this point should I change it to be fullstack-py36 before we commit it? 16:56:28 or maybe we should switch neutron-fullstack to be py35 and add neutron-fullstack-python27 in parallel? 16:56:52 njohnston: ok, so You want to switch it to py36 and check few more weeks how it will be? 16:57:07 I like that paln 16:57:08 at least a week, yes 16:57:20 that is good idea IMO 16:57:35 can we swicth it this week? 16:58:04 that way we get the week of the PTG as testing and if everything goes well, we can pull the trigger in two weeks 16:58:10 yes, I will get that change up today 16:58:16 great, thx njohnston 16:58:34 #action njohnston to switch fullstack-python35 to python36 job 16:58:50 ok, lets move quickly to last topic 16:58:55 #topic Open discussion 16:59:07 FYI: QA team asked me to add list of slowest tests from neutron jobs to https://ethercalc.openstack.org/dorupfz6s9qt - they want to move such tests to slow job 16:59:22 ok 16:59:24 Such slow job for neutron is proposed in https://review.openstack.org/#/c/583847/ 16:59:25 sounds smart 16:59:39 I will try to check longest tests in our jobs this week 17:00:00 second thing I want to annonce is that I will cancel next week's meeting due to PTG 17:00:06 #endmeeting