#openstack-meeting log

16:00:05 <slaweq_> #startmeeting neutron_ci
16:00:06 <openstack> Meeting started Tue Sep  4 16:00:05 2018 UTC and is due to finish in 60 minutes.  The chair is slaweq_. Information about MeetBot at http://wiki.debian.org/MeetBot.
16:00:08 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
16:00:09 <slaweq_> hi
16:00:11 <openstack> The meeting name has been set to 'neutron_ci'
16:00:11 <mlavalle> o/
16:00:14 <haleyb> hi
16:00:25 <njohnston> o/
16:00:49 <slaweq_> ok, let's start
16:00:55 <slaweq_> #topic Actions from previous meetings
16:01:10 <slaweq_> mlavalle to check another cases of failing neutron_tempest_plugin.scenario.test_dns_integration.DNSIntegrationTests.test_server_with_fip test
16:01:15 <mlavalle> I did
16:01:26 <mlavalle> I checked other ocurrences of this bug
16:01:40 <mlavalle> I found that not only happend with the DNS integration job
16:01:47 <mlavalle> but in other jobs as well
16:02:04 <mlavalle> also found that the evidence points to a problem with Nova
16:02:15 <mlavalle> which never rumes the instance
16:02:34 <slaweq_> did You talk with mriedem about that?
16:02:38 <mlavalle> it seems to me the compute manager is mixing up the events from Neutron and the virt layer
16:02:55 <mlavalle> I pinged mriedem ealier today....
16:03:12 <mlavalle> but he hasn't responded. I assume he is busy and will pong back later
16:03:27 <mlavalle> I will follow up with him
16:03:30 <slaweq_> ok, good that You are on it :)
16:03:32 <slaweq_> thx mlavalle
16:03:52 <slaweq_> #action mlavalle to talk with mriedem about https://bugs.launchpad.net/neutron/+bug/1788006
16:03:52 <openstack> Launchpad bug 1788006 in neutron "Tests fail with "Server [UUID] failed to reach ACTIVE status and task state "None" within the required time ([INTEGER] s). Current status: BUILD. Current task state: spawning."" [Critical,Confirmed] - Assigned to Miguel Lavalle (minsel)
16:04:12 <slaweq_> I added action just to not forget it next time :)
16:04:20 <mriedem> mlavalle: sorry,
16:04:22 <mlavalle> yeah, please keep me honest
16:04:23 <mriedem> forgot to reply
16:04:30 <mriedem> i'm creating stories for each project today
16:04:30 <slaweq_> hi mriedem
16:04:48 <mriedem> oh sorry thought we were talking about the upgrade-checkers goal :)
16:05:12 <slaweq_> no, we are talking about https://bugs.launchpad.net/neutron/+bug/1788006
16:05:12 <openstack> Launchpad bug 1788006 in neutron "Tests fail with "Server [UUID] failed to reach ACTIVE status and task state "None" within the required time ([INTEGER] s). Current status: BUILD. Current task state: spawning."" [Critical,Confirmed] - Assigned to Miguel Lavalle (minsel)
16:05:13 <mlavalle> mriedem: when you have time, please look at https://bugs.launchpad.net/neutron/+bug/1788006
16:05:33 <mlavalle> mriedem: we think we need help from a Nova expert
16:05:42 * mriedem throws it on the pile
16:05:50 <mlavalle> mriedem: Thanks :-)
16:05:56 <slaweq_> thx mriedem
16:06:14 <slaweq_> ok, so let's move on to next actions now
16:06:22 <slaweq_> mlavalle to check failing router migration from DVR tests
16:06:37 <mlavalle> I am working on that one right now
16:07:04 <mlavalle> the migrations failing are always from HA to {dvr, dvr-ha, legacy}
16:07:34 <mlavalle> it seems that when you set the admin_state_up to False in the router....
16:07:44 <slaweq_> yes, but that's related to other bug I think, see https://bugs.launchpad.net/neutron/+bug/1789434
16:07:45 <openstack> Launchpad bug 1789434 in neutron "neutron_tempest_plugin.scenario.test_migration.NetworkMigrationFromHA failing 100% times" [High,Confirmed] - Assigned to Miguel Lavalle (minsel)
16:07:45 <mlavalle> the "device_owner":"network:router_ha_interface" never goes down
16:08:10 <mlavalle> yeah, that is what I am talking about
16:09:26 <mlavalle> I have confirmed that is the case for migration from HA to DVR
16:09:36 <mlavalle> and from HA to HA-DVR
16:10:24 <mlavalle> I expect the same to be happening for HA to legacy (or Coke Classic as I think of it)
16:10:33 <mlavalle> I intend to contnue debugging
16:10:35 <slaweq_> but that check if ports are down was introduced by me recently, and it was passing then IIRC
16:10:47 <mlavalle> it is not passing now
16:10:48 <slaweq_> so was there any other change recently which could cause such issue?
16:10:59 <slaweq_> yes, I know it is not passing now :)
16:11:08 <mlavalle> I was about to ask the same....
16:11:27 <mlavalle> the ports status remains active all the time
16:11:53 <haleyb> so the port doesn't go down, or the interface in the namespace?  as in two routers have active ha ports
16:12:13 <mlavalle> the test is checking for the status of the port
16:12:21 <slaweq_> mlavalle: haleyb: this is patch which adds this: https://review.openstack.org/#/c/589410/
16:12:27 <mlavalle> that's as far as I've checked so far
16:12:33 <slaweq_> migration from HA tests were passing then
16:13:40 <haleyb> ok, so the port, strange
16:14:25 <slaweq_> yes, that is really strange
16:14:42 <mlavalle> well, that's what the test checks
16:14:46 <mlavalle> the status of the port
16:15:44 <slaweq_> yes, I added it to avoid race condition during migration from legacy to ha, described in https://bugs.launchpad.net/neutron/+bug/1785582
16:15:44 <openstack> Launchpad bug 1785582 in neutron "Connectivity to instance after L3 router migration from Legacy to HA fails" [Medium,Fix released] - Assigned to Slawek Kaplonski (slaweq)
16:16:11 <mlavalle> I remember and logically it makes sense
16:16:16 <slaweq_> but if router is set to admin_state_down, its ports should be set to down also, right?
16:16:53 <slaweq_> but mlavalle You can at least repoduce it manually, right?
16:17:11 <mlavalle> I haven't tried manually yet. That is my next step
16:17:14 <slaweq_> or ports status is fine during manual actions?
16:17:17 <slaweq_> ahh, ok
16:17:32 <slaweq_> so I will add it as action for You for next week, ok?
16:17:39 <mlavalle> of course
16:17:50 <mlavalle> as I said earlier, keep me honest
16:18:33 <slaweq_> #action mlavalle continue debugging failing MigrationFromHA tests, bug https://bugs.launchpad.net/neutron/+bug/1789434
16:18:33 <openstack> Launchpad bug 1789434 in neutron "neutron_tempest_plugin.scenario.test_migration.NetworkMigrationFromHA failing 100% times" [High,Confirmed] - Assigned to Miguel Lavalle (minsel)
16:18:36 <slaweq_> thx mlavalle
16:18:47 <slaweq_> slaweq to report a bug about timouts in neutron-tempest-plugin-scenario-linuxbridge
16:18:51 <slaweq_> that was next one
16:19:00 <slaweq_> I reported it https://bugs.launchpad.net/neutron/+bug/1789579
16:19:00 <openstack> Launchpad bug 1788006 in neutron "duplicate for #1789579 Tests fail with "Server [UUID] failed to reach ACTIVE status and task state "None" within the required time ([INTEGER] s). Current status: BUILD. Current task state: spawning."" [Critical,Confirmed] - Assigned to Miguel Lavalle (minsel)
16:19:16 <slaweq_> but it's in fact duplicate of issue mentioned before https://bugs.launchpad.net/neutron/+bug/1788006
16:19:16 <openstack> Launchpad bug 1788006 in neutron "Tests fail with "Server [UUID] failed to reach ACTIVE status and task state "None" within the required time ([INTEGER] s). Current status: BUILD. Current task state: spawning."" [Critical,Confirmed] - Assigned to Miguel Lavalle (minsel)
16:19:22 <mlavalle> yeap
16:19:36 <slaweq_> so let's move forward
16:19:38 <slaweq_> next was
16:19:40 <slaweq_> mlavalle to ask Mike Bayer about functional db migration tests failures
16:19:58 <mlavalle> I also suspect that some of the failures we are getting in the DVR multinode job are also due to this issue with Nova
16:20:13 <mlavalle> and the others are due to router migration
16:20:28 <mlavalle> I conclude that from the results of my queries with Kibana
16:20:42 <mlavalle> now moving on to Mike Bayer
16:20:51 <mlavalle> I was going to ping him
16:21:09 <mlavalle> but the log that is pointed out in the bug doesn't exist anymore
16:21:30 <mlavalle> and I don't want to ping him if I don't have data for him to look
16:21:50 <slaweq_> ok, I will update bug report with fresh log today and will ping You
16:21:55 <slaweq_> fine for You?
16:22:10 <mlavalle> so let's keep our eyes open for one of those failures
16:22:12 <mlavalle> as soon as I have one, I'll ping him
16:22:16 <mlavalle> yes
16:22:23 <slaweq_> ok, sounds good
16:22:25 <mlavalle> I will also update the bug if I find one
16:22:32 <slaweq_> if I will have anything new, I will update bug
16:22:33 <slaweq_> :)
16:22:45 <slaweq_> ok
16:22:50 <slaweq_> last one from last week was
16:22:53 <slaweq_> slaweq to mark fullstack security group test as unstable again
16:22:57 <slaweq_> Done: https://review.openstack.org/597426
16:23:10 <slaweq_> it's merged recently so fullstack tests should be in better shape now
16:23:49 <slaweq_> and that was all for actions from previous week
16:23:56 <slaweq_> let's move to next topic
16:23:58 <slaweq_> #topic Grafana
16:24:04 <slaweq_> http://grafana.openstack.org/dashboard/db/neutron-failure-rate
16:25:05 <slaweq_> FYI: patch https://review.openstack.org/#/c/595763/ is merged so in graphs we have now also included jobs which were TIMED_OUT
16:26:12 <mlavalle> ok
16:26:15 <slaweq_> speaking about last week, there was spike around 31.08 on many jobs
16:26:31 <slaweq_> it was related to some infra bug, fixed in https://review.openstack.org/591527
16:28:00 <njohnston> Just to note, I have a change to bump the "older" dashboards versions and to bring them in line with slaweq_'s format: https://review.openstack.org/597168
16:28:04 <mlavalle> yeah I can see that
16:28:30 <slaweq_> thx njohnston for working on this :)
16:28:41 <mlavalle> ok, so in principle we shouldn't worry about the spike on 8/31
16:29:00 <slaweq_> yep
16:29:47 <slaweq_> and speaking about other things I think we have one main problem now
16:29:59 <slaweq_> and it's tempest/scenario jobs issue which we already talked about
16:30:26 <mlavalle> the nova issue?
16:30:32 <slaweq_> yes
16:30:49 <mlavalle> ok, I'll stay on top of it
16:30:55 <slaweq_> this is basically most common issue I found when I was checking for job results
16:31:27 <slaweq_> and it is quite paintfull as it happens randomly on many jobs so You need to recheck jobs many times and each time different job fails :/
16:31:45 <slaweq_> because of that it is also not very visible on graphs IMO
16:31:46 <mlavalle> but the same underlying problem
16:32:15 <slaweq_> yes, from what I was checking it looks like that, culprit is the same
16:32:54 <slaweq_> ok, but let's talk about few other issues which I also found :)
16:33:00 <slaweq_> #topic Tempest/Scenario
16:33:27 <slaweq_> so, except this "main" issue I found few times that same shelve/unshelve test was again failing
16:33:32 <slaweq_> examples:
16:33:39 <slaweq_> * http://logs.openstack.org/71/516371/8/check/neutron-tempest-multinode-full/bb2ae6a/logs/testr_results.html.gz
16:33:41 <slaweq_> * http://logs.openstack.org/59/591059/3/check/neutron-tempest-multinode-full/72dbf69/logs/testr_results.html.gz
16:33:43 <slaweq_> * http://logs.openstack.org/72/507772/66/check/neutron-tempest-multinode-full/aff626c/logs/testr_results.html.gz
16:33:52 <slaweq_> I think mlavalle You were checking that in the past, right?
16:34:04 <mlavalle> yeah I think I was
16:34:25 <mlavalle> how urgent is it compared to the other two I am working on?
16:34:35 <slaweq_> not very urgent I think
16:34:52 <slaweq_> I found those 3 examples in last 50 jobs or something like that
16:34:54 <mlavalle> ok, give me an action item, so I remeber
16:35:09 <mlavalle> I don't want it to fall through the cracks
16:35:29 <mlavalle> on the understanding that I'll get it to it after the others
16:35:46 <slaweq_> do You remember bug report for that maybe?
16:36:15 <mlavalle> slaweq_: don't worry, I'll find it. for the time being just mention the shelve / unshelve issue in the AI
16:36:29 <slaweq_> mlavalle: ok
16:36:55 <slaweq_> #action mlavale to check issue with failing test_attach_volume_shelved_or_offload_server test
16:37:10 <mlavalle> thanks :-)
16:37:15 <slaweq_> thx mlavalle
16:37:19 <slaweq_> ok, let's move on
16:37:32 <slaweq_> I also found one more example of another error: http://logs.openstack.org/96/596896/2/check/tempest-full-py3/95eba41/testr_results.html.gz
16:37:42 <slaweq_> it was some error during FIP deletion
16:39:41 <slaweq_> but rest api call was done to nova: 500 DELETE https://10.210.194.229/compute/v2.1/os-floating-ips/a06bbb30-1914-4467-9e75-6fc70d99427f
16:40:22 <slaweq_> and I can't find FIP with such uuid in nova-api logs nor neutron-server logs
16:40:33 <slaweq_> do You maybe know where this error may be?
16:41:15 <mlavalle> mhhh
16:41:48 <mlavalle> so a06bbb30-1914-4467-9e75-6fc70d99427f is not in the neutron server log?
16:42:03 <slaweq_> no
16:42:13 <slaweq_> and also there is no any ERROR in neutron-server log: http://logs.openstack.org/96/596896/2/check/tempest-full-py3/95eba41/controller/logs/screen-q-svc.txt.gz?level=ERROR
16:42:29 <slaweq_> which is strange IMO because this error 500 came from somewhere, right?
16:42:37 <mlavalle> right
16:42:55 <slaweq_> it's only one such failed test but it's interesting for me :)
16:43:05 <mlavalle> indeed
16:43:41 <mlavalle> maybe our fips our plunging into a black hole
16:44:43 <slaweq_> in tempest log: http://logs.openstack.org/96/596896/2/check/tempest-full-py3/95eba41/controller/logs/tempest_log.txt
16:44:50 <slaweq_> it's visible in 3 places
16:44:54 <slaweq_> once when it's created
16:44:58 <slaweq_> and then when it's removed
16:45:03 <slaweq_> with erro 500
16:45:25 * mlavalle looking
16:46:47 <mlavalle> yeahbut it responded 200 for the POST
16:47:43 <mlavalle> and the fip doesn't work either
16:48:09 <mlavalle> the ssh fails
16:48:20 <slaweq_> because it was not visible anywhere in neutron IMO so it couldn't work, right?
16:48:55 <mlavalle> yeah
16:49:13 <slaweq_> maybe that is also culprit of other issues with not working FIP
16:49:37 <slaweq_> let's keep an eye on it and check if on other examples similar things will pop up
16:49:58 <mlavalle> ok
16:50:12 <slaweq_> next topic
16:50:14 <slaweq_> #topic Fullstack
16:50:22 <mlavalle> slaweq_: http://logs.openstack.org/96/596896/2/check/tempest-full-py3/95eba41/controller/logs/screen-q-svc.txt.gz#_Sep_02_23_25_07_989880
16:51:13 <mlavalle> the fip uuid is in the neutron server log
16:51:43 <slaweq_> so I was searching too fast, and not whole log was loaded in browser :)
16:51:47 <slaweq_> sorry for that
16:51:50 <mlavalle> np
16:52:07 <mlavalle> I'll let you dig deeper
16:52:35 <slaweq_> but delete was also fine: http://logs.openstack.org/96/596896/2/check/tempest-full-py3/95eba41/controller/logs/screen-q-svc.txt.gz#_Sep_02_23_25_30_966952
16:53:09 <slaweq_> I will check those logs once again later
16:53:22 <mlavalle> right, so it seems the problem is more on the NOva side
16:53:33 <slaweq_> maybe :)
16:53:47 <slaweq_> let's back to fullstack for now
16:53:52 <mlavalle> ok
16:54:19 <slaweq_> test_securitygroups test is marked as unstable now so it should be at least in better shape then it was recently
16:54:34 <slaweq_> I will continue debugging this issue when I will have some time
16:55:03 <slaweq_> I think that this is most common issue in fullstack tests and without failing this one, it should be good with other things
16:55:14 <mlavalle> great!
16:55:34 <slaweq_> I have one more question about fullstack
16:55:34 <njohnston> excellent
16:55:45 <slaweq_> what You think about switching fullstack-python35 to be voting?
16:55:58 <slaweq_> it's following exactly failure rate of py27 job: http://grafana.openstack.org/d/Hj5IHcSmz/neutron-failure-rate?panelId=22&fullscreen&orgId=1&from=now-7d&to=now
16:56:22 <njohnston> at this point should I change it to be fullstack-py36 before we commit it?
16:56:28 <slaweq_> or maybe we should switch neutron-fullstack to be py35 and add neutron-fullstack-python27 in parallel?
16:56:52 <slaweq_> njohnston: ok, so You want to switch it to py36 and check few more weeks how it will be?
16:57:07 <mlavalle> I like that paln
16:57:08 <njohnston> at least a week, yes
16:57:20 <slaweq_> that is good idea IMO
16:57:35 <mlavalle> can we swicth it this week?
16:58:04 <mlavalle> that way we get the week of the PTG as testing and if everything goes well, we can pull the trigger in two weeks
16:58:10 <njohnston> yes, I will get that change up today
16:58:16 <slaweq_> great, thx njohnston
16:58:34 <slaweq_> #action njohnston to switch fullstack-python35 to python36 job
16:58:50 <slaweq_> ok, lets move quickly to last topic
16:58:55 <slaweq_> #topic Open discussion
16:59:07 <slaweq_> FYI: QA team asked me to add list of slowest tests from neutron jobs to https://ethercalc.openstack.org/dorupfz6s9qt - they want to move such tests to slow job
16:59:22 <mlavalle> ok
16:59:24 <slaweq_> Such slow job for neutron is proposed in https://review.openstack.org/#/c/583847/
16:59:25 <njohnston> sounds smart
16:59:39 <slaweq_> I will try to check longest tests in our jobs this week
17:00:00 <slaweq_> second thing I want to annonce is that I will cancel next week's meeting due to PTG
17:00:06 <slaweq_> #endmeeting