#openstack-meeting-3 log

15:00:11 <slaweq> #startmeeting neutron_ci
15:00:12 <openstack> Meeting started Wed May 27 15:00:11 2020 UTC and is due to finish in 60 minutes.  The chair is slaweq. Information about MeetBot at http://wiki.debian.org/MeetBot.
15:00:13 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
15:00:15 <openstack> The meeting name has been set to 'neutron_ci'
15:00:20 <slaweq> hi
15:00:24 <maciejjozefczyk> hey
15:00:41 <njohnston> o/
15:01:07 <bcafarel> o/
15:01:18 <lajoskatona> o/
15:01:23 <slaweq> Grafana dashboard: http://grafana.openstack.org/dashboard/db/neutron-failure-rate
15:01:26 <slaweq> Please open now :)
15:01:38 <slaweq> and I think that we can start
15:01:44 <slaweq> #topic Actions from previous meetings
15:01:58 <slaweq> first one
15:02:00 <slaweq> ralonsoh to continue checking ovn jobs timeouts
15:02:33 <ralonsoh> sorry
15:02:35 <ralonsoh> I'm late
15:02:47 <slaweq> ralonsoh: no problem :)
15:02:48 <ralonsoh> I spent some time on this again
15:02:59 <ralonsoh> with ovsdbapp and python-ovn developers
15:03:09 <ralonsoh> and I still don't find a "breach" in the code
15:03:24 <ralonsoh> this is something permanent in my plate
15:03:32 <ralonsoh> can we put this task on hold?
15:03:37 <slaweq> but it still happens in the ci, right?
15:03:44 <ralonsoh> sometimes
15:03:49 <ralonsoh> but no so often
15:03:52 <slaweq> ok
15:04:01 <ralonsoh> and this is something that happens in OVS too
15:04:31 <slaweq> so it's not always related to ovn jobs? also ml2/ovs jobs has got the same issue?
15:04:59 <ralonsoh> yes
15:05:07 <slaweq> ok
15:05:12 <ralonsoh> it's a problem with eventlet, ovsdbapp and pyuthon-ovs
15:05:28 <slaweq> ouch, probably will be hard to find :/
15:05:36 <ralonsoh> pfffff
15:05:41 <slaweq> :)
15:07:16 <slaweq> ok, I know You will continue this investigation so I think we can move on
15:07:24 <slaweq> next one
15:07:26 <slaweq> slaweq to check failure in test_ha_router_failover: https://storage.gra.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_6d0/726168/2/check/neutron-functional/6d0b174/testr_results.html
15:07:40 <slaweq> I have it in my todo list but I didn't have time to get to this one yet
15:07:44 <slaweq> I will try this week
15:07:52 <slaweq> #action slaweq to check failure in test_ha_router_failover: https://storage.gra.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_6d0/726168/2/check/neutron-functional/6d0b174/testr_results.html
15:08:54 <slaweq> next one
15:08:56 <slaweq> slaweq to add additional logging for fullstack's firewall tests
15:09:19 <slaweq> and here it is the same: I have it in my todo list but I didn't have time yet to get to this
15:09:41 <slaweq> I reopened bug related to this test and marked it as unstable again
15:09:57 <slaweq> so it will not make our life harder :)
15:10:23 <slaweq> #action slaweq to add additional logging for fullstack's firewall tests
15:10:25 <slaweq> next one
15:10:30 <njohnston> there si so much that goes on in that one test, would it make sense to break it up?
15:11:09 <slaweq> njohnston: I was thinking about that, and I even started something https://review.opendev.org/#/c/716773/
15:11:27 <slaweq> but it will require some more work
15:11:48 <njohnston> cool!  As always, you are ahead of the curve. :-)
15:11:51 <slaweq> and I agree that this would be good to break it into few smaller tests
15:11:59 <slaweq> njohnston: thx :)
15:13:18 <slaweq> njohnston: so I will continue this effort in next weeek(s) but as low priority task
15:13:24 <njohnston> makes sense
15:14:12 <slaweq> ok
15:14:14 <slaweq> thx
15:14:17 <slaweq> so lets move on
15:14:31 <slaweq> slaweq to reopen bug related to failing fuillstack firewall tests
15:14:44 <slaweq> as I said I reopended bug https://bugs.launchpad.net/neutron/+bug/1742401
15:14:44 <openstack> Launchpad bug 1742401 in neutron "Fullstack tests neutron.tests.fullstack.test_securitygroup.TestSecurityGroupsSameNetwork fails often" [High,Confirmed] - Assigned to Slawek Kaplonski (slaweq)
15:14:50 <slaweq> and marked test as unstable again
15:16:05 <slaweq> ok, and the last one
15:16:07 <slaweq> ralonsoh to check  Address already allocated in subnet issue in tempest job
15:16:34 <ralonsoh> I'm on it, sorry
15:16:42 <ralonsoh> didn't spend too much time on this one
15:17:35 <slaweq> ralonsoh: I can imagine as I know what You were doing for most of the week :)
15:17:55 * slaweq also don't likes conjuntions :P
15:18:00 <maciejjozefczyk> :
15:18:01 <maciejjozefczyk> :P
15:18:32 <slaweq> with that I think we can move on to the next topic
15:18:34 <slaweq> #topic Stadium projects
15:18:59 <slaweq> I don't have anything related to the stadium for today
15:19:07 <slaweq> but maybe You have something to discuss here?
15:19:19 <njohnston> nope
15:19:20 <lajoskatona> nothing special, at least from me
15:20:00 <slaweq> ok, so lets move on to the next topic
15:20:02 <slaweq> #topic Stable branches
15:20:08 <slaweq> Train dashboard: http://grafana.openstack.org/d/pM54U-Kiz/neutron-failure-rate-previous-stable-release?orgId=1
15:20:10 <slaweq> Stein dashboard: http://grafana.openstack.org/d/dCFVU-Kik/neutron-failure-rate-older-stable-release?orgId=1
15:20:18 <bcafarel> Ussuri and Train now :)
15:20:31 <njohnston> nice!
15:20:55 <slaweq> bcafarel: are You sure?
15:21:03 <slaweq> I see in the description train and stein still
15:21:13 <slaweq> maybe You forgot change description?
15:21:19 <bcafarel> argh, checking
15:21:26 <bcafarel> the changeset merged for sure
15:22:51 <bcafarel> nah, names were updated too, I guess 14h ago is too recent?
15:23:05 <bcafarel> ( https://review.opendev.org/#/c/729291/ )
15:23:44 <slaweq> ok, lets wait some more time
15:23:54 <slaweq> hopefully it will change
15:24:03 <slaweq> and thx bcafarel for taking care of it
15:24:13 <bcafarel> something to check on next meeting :)
15:24:32 <bcafarel> and apart from that from memory I don't think I have seen many stable failures
15:24:52 <bcafarel> (also new stable releases for trein/stain)
15:25:00 <bcafarel> *train/stein
15:25:02 <slaweq> yes, me too - most of my patches were merged pretty fast recently
15:26:25 <slaweq> if there is nothing else regarding stable branches, I think we can move on to the next topic
15:29:16 <slaweq> #topic Grafana
15:29:22 <slaweq> http://grafana.openstack.org/dashboard/db/neutron-failure-rate
15:32:36 <slaweq> I don't know but our gate queue looks very good since few days - no failures at all :)
15:32:43 <slaweq> or almost at all
15:32:59 <slaweq> but it's worst in the check queue
15:33:21 <slaweq> e.g. neutron-ovn-tempest-slow is failing 100% times
15:36:01 <maciejjozefczyk> hmm, I may know whats going there
15:36:15 <slaweq> maciejjozefczyk: I was hoping that You will know :P
15:36:30 <maciejjozefczyk> I can take a look, this failing test seems similar to me :P (test_port_security_macspoofing_port)
15:36:48 <bcafarel> hmm yeah the name does ring a bell
15:36:52 <slaweq> yes, that's the test which is failing every time since around last friday
15:36:55 <maciejjozefczyk> I think we blacklisted that test in other jobs
15:37:08 <maciejjozefczyk> there is a fix in core-ovn for this but I think its not yet in the OVN version we use in the gates
15:37:09 <slaweq> maciejjozefczyk: please check that if You will have some time
15:37:18 <maciejjozefczyk> sure, thats gonan be quick :) slaweq
15:37:30 <slaweq> #action maciejjozefczyk to check failing test_port_security_macspoofing_port test
15:37:40 <maciejjozefczyk> I left the vivaldi tab open for tomorrow morning to not forget :)
15:37:56 <slaweq> other than that things looks quite good IMO
15:38:10 <slaweq> anything else You want to discuss regarding grafana?
15:39:53 <slaweq> ok, so next topic
15:39:55 <slaweq> #topic fullstack/functional
15:40:08 <slaweq> I have only one thing regarding functional job
15:40:14 <slaweq> What to do with https://review.opendev.org/#/c/729588/ ? IMO it's good to go
15:40:35 <slaweq> this uwsgi job is even more stable recently than "normal" functional tests job
15:40:35 <ralonsoh> I think so
15:40:50 <njohnston> +1
15:40:53 <ralonsoh> +1
15:40:53 <lajoskatona> +1
15:41:26 <slaweq> ok, so please review and approve it :)
15:41:41 <slaweq> thx lajoskatona bcafarel and maciejjozefczyk for review of it already :)
15:42:04 <maciejjozefczyk> :) +1
15:42:13 <slaweq> fullstack tests seems to be better when I marked this one test as unstable
15:42:31 <lajoskatona> slaweq: no problem
15:42:54 <slaweq> any questions/comments or can we move on?
15:43:40 <njohnston> go ahead
15:43:56 <bcafarel> nothing from me
15:44:01 <slaweq> ok
15:44:04 <slaweq> #topic Tempest/Scenario
15:44:16 <slaweq> we already talked about failing ovn-slow job
15:44:44 <slaweq> the only other issue which I have today is yet another error with IpAddressAlreadyAllocated error
15:44:55 <slaweq> this time in grenade job:     https://7ba1272f105c99db4826-d9c28f5658476db1e4ca5968d196888d.ssl.cf2.rackcdn.com/729591/1/check/neutron-grenade-dvr-multinode/54a824e/controller/logs/grenade.sh_log.txt
15:48:15 <slaweq> I'm not sure why it's like that but for me it smells like issue in the test
15:49:29 <slaweq> ok
15:49:31 <slaweq> I found
15:49:33 <slaweq> May 20 14:02:57.487624 ubuntu-bionic-rax-ord-0016690112 neutron-server[5514]: INFO neutron.wsgi [None req-e0e86a16-03d9-4d2c-86fd-db9c1c4133b9 tempest-RoutersTest-1419393395 tempest-RoutersTest-1419393395] 10.209.98.10 "PUT /v2.0/routers/68d7b71d-47e2-4b98-9f87-d972e2f3889d/add_router_interface HTTP/1.1" status: 409  len: 369 time: 38.4982579
15:50:08 <slaweq> sorry
15:50:12 <slaweq> first was:
15:50:14 <slaweq> May 20 14:03:00.501061 ubuntu-bionic-rax-ord-0016690112 neutron-server[5514]: INFO neutron.wsgi [None req-a76c0e0a-7bc9-4c57-94e9-fe92c402f768 tempest-RoutersTest-1419393395 tempest-RoutersTest-1419393395] 10.209.98.10 "PUT /v2.0/routers/68d7b71d-47e2-4b98-9f87-d972e2f3889d/add_router_interface HTTP/1.1" status: 200  len: 503 time: 102.8894622
15:50:31 <slaweq> this took long time, so client ended up with timeout and retried request
15:50:35 <ralonsoh> slaweq, I found the error for trhe IpAddressAlreadyAllocated
15:50:40 <ralonsoh> slaweq, https://bugs.launchpad.net/neutron/+bug/1880976
15:50:40 <openstack> Launchpad bug 1880976 in neutron "[tempest] Error in "test_reuse_ip_address_with_other_fip_on_other_router" with duplicated floating IP" [Undecided,New]
15:50:44 <ralonsoh> reported 5 mins ago
15:51:04 <slaweq> but in the meantime it was allocated already so second request was 409
15:51:15 <slaweq> so at least in this grenade job it's nothing really new
15:51:17 <ralonsoh> in a nutshell: between the FIP deletion and the creation again with the same IP address, another test requested an FIP
15:51:32 <ralonsoh> and the server gave the same IP
15:51:38 <ralonsoh> just a coincidence
15:51:42 <njohnston> ugh
15:51:52 <ralonsoh> I'll try to reduce the time between the deletion and the new creation
15:52:00 <ralonsoh> reusing the IP address
15:52:13 <slaweq> nice catch ralonsoh
15:52:22 <ralonsoh> good luck!
15:52:25 <njohnston> it's just unfortunate that with our ipam randomization work that it got the same IP
15:52:35 <ralonsoh> yeah...
15:52:44 <bcafarel> lower probability but still not 0%
15:52:53 <njohnston> bad roll of the dice
15:53:28 <slaweq> ralonsoh: so is it conflict of FIP or fixed IP?
15:53:33 <ralonsoh> FIP
15:53:34 <slaweq> how this test works?
15:53:39 <ralonsoh> simple
15:53:40 <slaweq> do we need to delete it?
15:53:45 <ralonsoh> 2 vms with FIP
15:53:50 <ralonsoh> 1 vm deleted and the FIP
15:54:09 <ralonsoh> then a VM is created and the FIP is created again, "reusing" the IP address
15:54:16 <slaweq> maybe we could just create FIP, attach to the first vm, detach from the vm, attach to the second vm?
15:54:33 <slaweq> that way FIP would be "reserved" in the tenant for whole test
15:54:37 <slaweq> am I right?
15:54:48 <ralonsoh> but we really need to delete it
15:54:55 <ralonsoh> one sec
15:55:00 <ralonsoh> it's in the test case description
15:55:24 <ralonsoh> https://github.com/openstack/neutron-tempest-plugin/blob/7b374486a54456d3c67fd2961c5894fb64ba48ab/neutron_tempest_plugin/scenario/test_floatingip.py#L518-L536
15:55:32 <ralonsoh> step 6
15:56:20 <ralonsoh> (btw, we SHOULD document always the test cases like this)
15:56:28 <njohnston> +1 for documenting
15:56:36 <slaweq> I agree
15:56:42 <ralonsoh> so clear, with those steps
15:57:12 <slaweq> and according to the test, I still don't see any reason why FIP has to be deleted? if we would detach it from VM it would be just DB record
15:57:33 <ralonsoh> you are right
15:57:35 <slaweq> so You don't need to "remember" its IP address but just reuse it later for VM3
15:57:42 <ralonsoh> we really don't need to delete the DB register
15:57:52 <slaweq> and this should be more stable IMHO
15:58:02 <ralonsoh> okidoki, I'll propose a patch
15:58:07 <slaweq> ralonsoh++ thx
15:58:15 <bcafarel> nice
15:58:25 <slaweq> and with that we are almost out of time
15:58:35 <slaweq> at the end one quick note about periodic jobs
15:58:42 <slaweq> all seems ok this week
15:58:45 <lajoskatona> just a question: isn't it possible to run serially these tests?
15:58:49 <lajoskatona> too much time?
15:58:53 <slaweq> our job with ovsdbapp from mast is fine now
15:59:12 <slaweq> lajoskatona: yes, it would take too much time
15:59:19 <lajoskatona> ok
15:59:29 <slaweq> in tempest-slow job tests are run in serial but there is only few of them
15:59:47 <slaweq> thx for attending the meeting, see You all tomorrow :)
15:59:51 <slaweq> o/
15:59:54 <bcafarel> o/
15:59:55 <slaweq> #endmeeting