16:00:05 <slaweq_> #startmeeting neutron_ci 16:00:06 <openstack> Meeting started Tue Sep 4 16:00:05 2018 UTC and is due to finish in 60 minutes. The chair is slaweq_. Information about MeetBot at http://wiki.debian.org/MeetBot. 16:00:08 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 16:00:09 <slaweq_> hi 16:00:11 <openstack> The meeting name has been set to 'neutron_ci' 16:00:11 <mlavalle> o/ 16:00:14 <haleyb> hi 16:00:25 <njohnston> o/ 16:00:49 <slaweq_> ok, let's start 16:00:55 <slaweq_> #topic Actions from previous meetings 16:01:10 <slaweq_> mlavalle to check another cases of failing neutron_tempest_plugin.scenario.test_dns_integration.DNSIntegrationTests.test_server_with_fip test 16:01:15 <mlavalle> I did 16:01:26 <mlavalle> I checked other ocurrences of this bug 16:01:40 <mlavalle> I found that not only happend with the DNS integration job 16:01:47 <mlavalle> but in other jobs as well 16:02:04 <mlavalle> also found that the evidence points to a problem with Nova 16:02:15 <mlavalle> which never rumes the instance 16:02:34 <slaweq_> did You talk with mriedem about that? 16:02:38 <mlavalle> it seems to me the compute manager is mixing up the events from Neutron and the virt layer 16:02:55 <mlavalle> I pinged mriedem ealier today.... 16:03:12 <mlavalle> but he hasn't responded. I assume he is busy and will pong back later 16:03:27 <mlavalle> I will follow up with him 16:03:30 <slaweq_> ok, good that You are on it :) 16:03:32 <slaweq_> thx mlavalle 16:03:52 <slaweq_> #action mlavalle to talk with mriedem about https://bugs.launchpad.net/neutron/+bug/1788006 16:03:52 <openstack> Launchpad bug 1788006 in neutron "Tests fail with "Server [UUID] failed to reach ACTIVE status and task state "None" within the required time ([INTEGER] s). Current status: BUILD. Current task state: spawning."" [Critical,Confirmed] - Assigned to Miguel Lavalle (minsel) 16:04:12 <slaweq_> I added action just to not forget it next time :) 16:04:20 <mriedem> mlavalle: sorry, 16:04:22 <mlavalle> yeah, please keep me honest 16:04:23 <mriedem> forgot to reply 16:04:30 <mriedem> i'm creating stories for each project today 16:04:30 <slaweq_> hi mriedem 16:04:48 <mriedem> oh sorry thought we were talking about the upgrade-checkers goal :) 16:05:12 <slaweq_> no, we are talking about https://bugs.launchpad.net/neutron/+bug/1788006 16:05:12 <openstack> Launchpad bug 1788006 in neutron "Tests fail with "Server [UUID] failed to reach ACTIVE status and task state "None" within the required time ([INTEGER] s). Current status: BUILD. Current task state: spawning."" [Critical,Confirmed] - Assigned to Miguel Lavalle (minsel) 16:05:13 <mlavalle> mriedem: when you have time, please look at https://bugs.launchpad.net/neutron/+bug/1788006 16:05:33 <mlavalle> mriedem: we think we need help from a Nova expert 16:05:42 * mriedem throws it on the pile 16:05:50 <mlavalle> mriedem: Thanks :-) 16:05:56 <slaweq_> thx mriedem 16:06:14 <slaweq_> ok, so let's move on to next actions now 16:06:22 <slaweq_> mlavalle to check failing router migration from DVR tests 16:06:37 <mlavalle> I am working on that one right now 16:07:04 <mlavalle> the migrations failing are always from HA to {dvr, dvr-ha, legacy} 16:07:34 <mlavalle> it seems that when you set the admin_state_up to False in the router.... 16:07:44 <slaweq_> yes, but that's related to other bug I think, see https://bugs.launchpad.net/neutron/+bug/1789434 16:07:45 <openstack> Launchpad bug 1789434 in neutron "neutron_tempest_plugin.scenario.test_migration.NetworkMigrationFromHA failing 100% times" [High,Confirmed] - Assigned to Miguel Lavalle (minsel) 16:07:45 <mlavalle> the "device_owner":"network:router_ha_interface" never goes down 16:08:10 <mlavalle> yeah, that is what I am talking about 16:09:26 <mlavalle> I have confirmed that is the case for migration from HA to DVR 16:09:36 <mlavalle> and from HA to HA-DVR 16:10:24 <mlavalle> I expect the same to be happening for HA to legacy (or Coke Classic as I think of it) 16:10:33 <mlavalle> I intend to contnue debugging 16:10:35 <slaweq_> but that check if ports are down was introduced by me recently, and it was passing then IIRC 16:10:47 <mlavalle> it is not passing now 16:10:48 <slaweq_> so was there any other change recently which could cause such issue? 16:10:59 <slaweq_> yes, I know it is not passing now :) 16:11:08 <mlavalle> I was about to ask the same.... 16:11:27 <mlavalle> the ports status remains active all the time 16:11:53 <haleyb> so the port doesn't go down, or the interface in the namespace? as in two routers have active ha ports 16:12:13 <mlavalle> the test is checking for the status of the port 16:12:21 <slaweq_> mlavalle: haleyb: this is patch which adds this: https://review.openstack.org/#/c/589410/ 16:12:27 <mlavalle> that's as far as I've checked so far 16:12:33 <slaweq_> migration from HA tests were passing then 16:13:40 <haleyb> ok, so the port, strange 16:14:25 <slaweq_> yes, that is really strange 16:14:42 <mlavalle> well, that's what the test checks 16:14:46 <mlavalle> the status of the port 16:15:44 <slaweq_> yes, I added it to avoid race condition during migration from legacy to ha, described in https://bugs.launchpad.net/neutron/+bug/1785582 16:15:44 <openstack> Launchpad bug 1785582 in neutron "Connectivity to instance after L3 router migration from Legacy to HA fails" [Medium,Fix released] - Assigned to Slawek Kaplonski (slaweq) 16:16:11 <mlavalle> I remember and logically it makes sense 16:16:16 <slaweq_> but if router is set to admin_state_down, its ports should be set to down also, right? 16:16:53 <slaweq_> but mlavalle You can at least repoduce it manually, right? 16:17:11 <mlavalle> I haven't tried manually yet. That is my next step 16:17:14 <slaweq_> or ports status is fine during manual actions? 16:17:17 <slaweq_> ahh, ok 16:17:32 <slaweq_> so I will add it as action for You for next week, ok? 16:17:39 <mlavalle> of course 16:17:50 <mlavalle> as I said earlier, keep me honest 16:18:33 <slaweq_> #action mlavalle continue debugging failing MigrationFromHA tests, bug https://bugs.launchpad.net/neutron/+bug/1789434 16:18:33 <openstack> Launchpad bug 1789434 in neutron "neutron_tempest_plugin.scenario.test_migration.NetworkMigrationFromHA failing 100% times" [High,Confirmed] - Assigned to Miguel Lavalle (minsel) 16:18:36 <slaweq_> thx mlavalle 16:18:47 <slaweq_> slaweq to report a bug about timouts in neutron-tempest-plugin-scenario-linuxbridge 16:18:51 <slaweq_> that was next one 16:19:00 <slaweq_> I reported it https://bugs.launchpad.net/neutron/+bug/1789579 16:19:00 <openstack> Launchpad bug 1788006 in neutron "duplicate for #1789579 Tests fail with "Server [UUID] failed to reach ACTIVE status and task state "None" within the required time ([INTEGER] s). Current status: BUILD. Current task state: spawning."" [Critical,Confirmed] - Assigned to Miguel Lavalle (minsel) 16:19:16 <slaweq_> but it's in fact duplicate of issue mentioned before https://bugs.launchpad.net/neutron/+bug/1788006 16:19:16 <openstack> Launchpad bug 1788006 in neutron "Tests fail with "Server [UUID] failed to reach ACTIVE status and task state "None" within the required time ([INTEGER] s). Current status: BUILD. Current task state: spawning."" [Critical,Confirmed] - Assigned to Miguel Lavalle (minsel) 16:19:22 <mlavalle> yeap 16:19:36 <slaweq_> so let's move forward 16:19:38 <slaweq_> next was 16:19:40 <slaweq_> mlavalle to ask Mike Bayer about functional db migration tests failures 16:19:58 <mlavalle> I also suspect that some of the failures we are getting in the DVR multinode job are also due to this issue with Nova 16:20:13 <mlavalle> and the others are due to router migration 16:20:28 <mlavalle> I conclude that from the results of my queries with Kibana 16:20:42 <mlavalle> now moving on to Mike Bayer 16:20:51 <mlavalle> I was going to ping him 16:21:09 <mlavalle> but the log that is pointed out in the bug doesn't exist anymore 16:21:30 <mlavalle> and I don't want to ping him if I don't have data for him to look 16:21:50 <slaweq_> ok, I will update bug report with fresh log today and will ping You 16:21:55 <slaweq_> fine for You? 16:22:10 <mlavalle> so let's keep our eyes open for one of those failures 16:22:12 <mlavalle> as soon as I have one, I'll ping him 16:22:16 <mlavalle> yes 16:22:23 <slaweq_> ok, sounds good 16:22:25 <mlavalle> I will also update the bug if I find one 16:22:32 <slaweq_> if I will have anything new, I will update bug 16:22:33 <slaweq_> :) 16:22:45 <slaweq_> ok 16:22:50 <slaweq_> last one from last week was 16:22:53 <slaweq_> slaweq to mark fullstack security group test as unstable again 16:22:57 <slaweq_> Done: https://review.openstack.org/597426 16:23:10 <slaweq_> it's merged recently so fullstack tests should be in better shape now 16:23:49 <slaweq_> and that was all for actions from previous week 16:23:56 <slaweq_> let's move to next topic 16:23:58 <slaweq_> #topic Grafana 16:24:04 <slaweq_> http://grafana.openstack.org/dashboard/db/neutron-failure-rate 16:25:05 <slaweq_> FYI: patch https://review.openstack.org/#/c/595763/ is merged so in graphs we have now also included jobs which were TIMED_OUT 16:26:12 <mlavalle> ok 16:26:15 <slaweq_> speaking about last week, there was spike around 31.08 on many jobs 16:26:31 <slaweq_> it was related to some infra bug, fixed in https://review.openstack.org/591527 16:28:00 <njohnston> Just to note, I have a change to bump the "older" dashboards versions and to bring them in line with slaweq_'s format: https://review.openstack.org/597168 16:28:04 <mlavalle> yeah I can see that 16:28:30 <slaweq_> thx njohnston for working on this :) 16:28:41 <mlavalle> ok, so in principle we shouldn't worry about the spike on 8/31 16:29:00 <slaweq_> yep 16:29:47 <slaweq_> and speaking about other things I think we have one main problem now 16:29:59 <slaweq_> and it's tempest/scenario jobs issue which we already talked about 16:30:26 <mlavalle> the nova issue? 16:30:32 <slaweq_> yes 16:30:49 <mlavalle> ok, I'll stay on top of it 16:30:55 <slaweq_> this is basically most common issue I found when I was checking for job results 16:31:27 <slaweq_> and it is quite paintfull as it happens randomly on many jobs so You need to recheck jobs many times and each time different job fails :/ 16:31:45 <slaweq_> because of that it is also not very visible on graphs IMO 16:31:46 <mlavalle> but the same underlying problem 16:32:15 <slaweq_> yes, from what I was checking it looks like that, culprit is the same 16:32:54 <slaweq_> ok, but let's talk about few other issues which I also found :) 16:33:00 <slaweq_> #topic Tempest/Scenario 16:33:27 <slaweq_> so, except this "main" issue I found few times that same shelve/unshelve test was again failing 16:33:32 <slaweq_> examples: 16:33:39 <slaweq_> * http://logs.openstack.org/71/516371/8/check/neutron-tempest-multinode-full/bb2ae6a/logs/testr_results.html.gz 16:33:41 <slaweq_> * http://logs.openstack.org/59/591059/3/check/neutron-tempest-multinode-full/72dbf69/logs/testr_results.html.gz 16:33:43 <slaweq_> * http://logs.openstack.org/72/507772/66/check/neutron-tempest-multinode-full/aff626c/logs/testr_results.html.gz 16:33:52 <slaweq_> I think mlavalle You were checking that in the past, right? 16:34:04 <mlavalle> yeah I think I was 16:34:25 <mlavalle> how urgent is it compared to the other two I am working on? 16:34:35 <slaweq_> not very urgent I think 16:34:52 <slaweq_> I found those 3 examples in last 50 jobs or something like that 16:34:54 <mlavalle> ok, give me an action item, so I remeber 16:35:09 <mlavalle> I don't want it to fall through the cracks 16:35:29 <mlavalle> on the understanding that I'll get it to it after the others 16:35:46 <slaweq_> do You remember bug report for that maybe? 16:36:15 <mlavalle> slaweq_: don't worry, I'll find it. for the time being just mention the shelve / unshelve issue in the AI 16:36:29 <slaweq_> mlavalle: ok 16:36:55 <slaweq_> #action mlavale to check issue with failing test_attach_volume_shelved_or_offload_server test 16:37:10 <mlavalle> thanks :-) 16:37:15 <slaweq_> thx mlavalle 16:37:19 <slaweq_> ok, let's move on 16:37:32 <slaweq_> I also found one more example of another error: http://logs.openstack.org/96/596896/2/check/tempest-full-py3/95eba41/testr_results.html.gz 16:37:42 <slaweq_> it was some error during FIP deletion 16:39:41 <slaweq_> but rest api call was done to nova: 500 DELETE https://10.210.194.229/compute/v2.1/os-floating-ips/a06bbb30-1914-4467-9e75-6fc70d99427f 16:40:22 <slaweq_> and I can't find FIP with such uuid in nova-api logs nor neutron-server logs 16:40:33 <slaweq_> do You maybe know where this error may be? 16:41:15 <mlavalle> mhhh 16:41:48 <mlavalle> so a06bbb30-1914-4467-9e75-6fc70d99427f is not in the neutron server log? 16:42:03 <slaweq_> no 16:42:13 <slaweq_> and also there is no any ERROR in neutron-server log: http://logs.openstack.org/96/596896/2/check/tempest-full-py3/95eba41/controller/logs/screen-q-svc.txt.gz?level=ERROR 16:42:29 <slaweq_> which is strange IMO because this error 500 came from somewhere, right? 16:42:37 <mlavalle> right 16:42:55 <slaweq_> it's only one such failed test but it's interesting for me :) 16:43:05 <mlavalle> indeed 16:43:41 <mlavalle> maybe our fips our plunging into a black hole 16:44:43 <slaweq_> in tempest log: http://logs.openstack.org/96/596896/2/check/tempest-full-py3/95eba41/controller/logs/tempest_log.txt 16:44:50 <slaweq_> it's visible in 3 places 16:44:54 <slaweq_> once when it's created 16:44:58 <slaweq_> and then when it's removed 16:45:03 <slaweq_> with erro 500 16:45:25 * mlavalle looking 16:46:47 <mlavalle> yeahbut it responded 200 for the POST 16:47:43 <mlavalle> and the fip doesn't work either 16:48:09 <mlavalle> the ssh fails 16:48:20 <slaweq_> because it was not visible anywhere in neutron IMO so it couldn't work, right? 16:48:55 <mlavalle> yeah 16:49:13 <slaweq_> maybe that is also culprit of other issues with not working FIP 16:49:37 <slaweq_> let's keep an eye on it and check if on other examples similar things will pop up 16:49:58 <mlavalle> ok 16:50:12 <slaweq_> next topic 16:50:14 <slaweq_> #topic Fullstack 16:50:22 <mlavalle> slaweq_: http://logs.openstack.org/96/596896/2/check/tempest-full-py3/95eba41/controller/logs/screen-q-svc.txt.gz#_Sep_02_23_25_07_989880 16:51:13 <mlavalle> the fip uuid is in the neutron server log 16:51:43 <slaweq_> so I was searching too fast, and not whole log was loaded in browser :) 16:51:47 <slaweq_> sorry for that 16:51:50 <mlavalle> np 16:52:07 <mlavalle> I'll let you dig deeper 16:52:35 <slaweq_> but delete was also fine: http://logs.openstack.org/96/596896/2/check/tempest-full-py3/95eba41/controller/logs/screen-q-svc.txt.gz#_Sep_02_23_25_30_966952 16:53:09 <slaweq_> I will check those logs once again later 16:53:22 <mlavalle> right, so it seems the problem is more on the NOva side 16:53:33 <slaweq_> maybe :) 16:53:47 <slaweq_> let's back to fullstack for now 16:53:52 <mlavalle> ok 16:54:19 <slaweq_> test_securitygroups test is marked as unstable now so it should be at least in better shape then it was recently 16:54:34 <slaweq_> I will continue debugging this issue when I will have some time 16:55:03 <slaweq_> I think that this is most common issue in fullstack tests and without failing this one, it should be good with other things 16:55:14 <mlavalle> great! 16:55:34 <slaweq_> I have one more question about fullstack 16:55:34 <njohnston> excellent 16:55:45 <slaweq_> what You think about switching fullstack-python35 to be voting? 16:55:58 <slaweq_> it's following exactly failure rate of py27 job: http://grafana.openstack.org/d/Hj5IHcSmz/neutron-failure-rate?panelId=22&fullscreen&orgId=1&from=now-7d&to=now 16:56:22 <njohnston> at this point should I change it to be fullstack-py36 before we commit it? 16:56:28 <slaweq_> or maybe we should switch neutron-fullstack to be py35 and add neutron-fullstack-python27 in parallel? 16:56:52 <slaweq_> njohnston: ok, so You want to switch it to py36 and check few more weeks how it will be? 16:57:07 <mlavalle> I like that paln 16:57:08 <njohnston> at least a week, yes 16:57:20 <slaweq_> that is good idea IMO 16:57:35 <mlavalle> can we swicth it this week? 16:58:04 <mlavalle> that way we get the week of the PTG as testing and if everything goes well, we can pull the trigger in two weeks 16:58:10 <njohnston> yes, I will get that change up today 16:58:16 <slaweq_> great, thx njohnston 16:58:34 <slaweq_> #action njohnston to switch fullstack-python35 to python36 job 16:58:50 <slaweq_> ok, lets move quickly to last topic 16:58:55 <slaweq_> #topic Open discussion 16:59:07 <slaweq_> FYI: QA team asked me to add list of slowest tests from neutron jobs to https://ethercalc.openstack.org/dorupfz6s9qt - they want to move such tests to slow job 16:59:22 <mlavalle> ok 16:59:24 <slaweq_> Such slow job for neutron is proposed in https://review.openstack.org/#/c/583847/ 16:59:25 <njohnston> sounds smart 16:59:39 <slaweq_> I will try to check longest tests in our jobs this week 17:00:00 <slaweq_> second thing I want to annonce is that I will cancel next week's meeting due to PTG 17:00:06 <slaweq_> #endmeeting