15:00:11 <slaweq> #startmeeting neutron_ci 15:00:12 <openstack> Meeting started Wed May 27 15:00:11 2020 UTC and is due to finish in 60 minutes. The chair is slaweq. Information about MeetBot at http://wiki.debian.org/MeetBot. 15:00:13 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 15:00:15 <openstack> The meeting name has been set to 'neutron_ci' 15:00:20 <slaweq> hi 15:00:24 <maciejjozefczyk> hey 15:00:41 <njohnston> o/ 15:01:07 <bcafarel> o/ 15:01:18 <lajoskatona> o/ 15:01:23 <slaweq> Grafana dashboard: http://grafana.openstack.org/dashboard/db/neutron-failure-rate 15:01:26 <slaweq> Please open now :) 15:01:38 <slaweq> and I think that we can start 15:01:44 <slaweq> #topic Actions from previous meetings 15:01:58 <slaweq> first one 15:02:00 <slaweq> ralonsoh to continue checking ovn jobs timeouts 15:02:33 <ralonsoh> sorry 15:02:35 <ralonsoh> I'm late 15:02:47 <slaweq> ralonsoh: no problem :) 15:02:48 <ralonsoh> I spent some time on this again 15:02:59 <ralonsoh> with ovsdbapp and python-ovn developers 15:03:09 <ralonsoh> and I still don't find a "breach" in the code 15:03:24 <ralonsoh> this is something permanent in my plate 15:03:32 <ralonsoh> can we put this task on hold? 15:03:37 <slaweq> but it still happens in the ci, right? 15:03:44 <ralonsoh> sometimes 15:03:49 <ralonsoh> but no so often 15:03:52 <slaweq> ok 15:04:01 <ralonsoh> and this is something that happens in OVS too 15:04:31 <slaweq> so it's not always related to ovn jobs? also ml2/ovs jobs has got the same issue? 15:04:59 <ralonsoh> yes 15:05:07 <slaweq> ok 15:05:12 <ralonsoh> it's a problem with eventlet, ovsdbapp and pyuthon-ovs 15:05:28 <slaweq> ouch, probably will be hard to find :/ 15:05:36 <ralonsoh> pfffff 15:05:41 <slaweq> :) 15:07:16 <slaweq> ok, I know You will continue this investigation so I think we can move on 15:07:24 <slaweq> next one 15:07:26 <slaweq> slaweq to check failure in test_ha_router_failover: https://storage.gra.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_6d0/726168/2/check/neutron-functional/6d0b174/testr_results.html 15:07:40 <slaweq> I have it in my todo list but I didn't have time to get to this one yet 15:07:44 <slaweq> I will try this week 15:07:52 <slaweq> #action slaweq to check failure in test_ha_router_failover: https://storage.gra.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_6d0/726168/2/check/neutron-functional/6d0b174/testr_results.html 15:08:54 <slaweq> next one 15:08:56 <slaweq> slaweq to add additional logging for fullstack's firewall tests 15:09:19 <slaweq> and here it is the same: I have it in my todo list but I didn't have time yet to get to this 15:09:41 <slaweq> I reopened bug related to this test and marked it as unstable again 15:09:57 <slaweq> so it will not make our life harder :) 15:10:23 <slaweq> #action slaweq to add additional logging for fullstack's firewall tests 15:10:25 <slaweq> next one 15:10:30 <njohnston> there si so much that goes on in that one test, would it make sense to break it up? 15:11:09 <slaweq> njohnston: I was thinking about that, and I even started something https://review.opendev.org/#/c/716773/ 15:11:27 <slaweq> but it will require some more work 15:11:48 <njohnston> cool! As always, you are ahead of the curve. :-) 15:11:51 <slaweq> and I agree that this would be good to break it into few smaller tests 15:11:59 <slaweq> njohnston: thx :) 15:13:18 <slaweq> njohnston: so I will continue this effort in next weeek(s) but as low priority task 15:13:24 <njohnston> makes sense 15:14:12 <slaweq> ok 15:14:14 <slaweq> thx 15:14:17 <slaweq> so lets move on 15:14:31 <slaweq> slaweq to reopen bug related to failing fuillstack firewall tests 15:14:44 <slaweq> as I said I reopended bug https://bugs.launchpad.net/neutron/+bug/1742401 15:14:44 <openstack> Launchpad bug 1742401 in neutron "Fullstack tests neutron.tests.fullstack.test_securitygroup.TestSecurityGroupsSameNetwork fails often" [High,Confirmed] - Assigned to Slawek Kaplonski (slaweq) 15:14:50 <slaweq> and marked test as unstable again 15:16:05 <slaweq> ok, and the last one 15:16:07 <slaweq> ralonsoh to check Address already allocated in subnet issue in tempest job 15:16:34 <ralonsoh> I'm on it, sorry 15:16:42 <ralonsoh> didn't spend too much time on this one 15:17:35 <slaweq> ralonsoh: I can imagine as I know what You were doing for most of the week :) 15:17:55 * slaweq also don't likes conjuntions :P 15:18:00 <maciejjozefczyk> : 15:18:01 <maciejjozefczyk> :P 15:18:32 <slaweq> with that I think we can move on to the next topic 15:18:34 <slaweq> #topic Stadium projects 15:18:59 <slaweq> I don't have anything related to the stadium for today 15:19:07 <slaweq> but maybe You have something to discuss here? 15:19:19 <njohnston> nope 15:19:20 <lajoskatona> nothing special, at least from me 15:20:00 <slaweq> ok, so lets move on to the next topic 15:20:02 <slaweq> #topic Stable branches 15:20:08 <slaweq> Train dashboard: http://grafana.openstack.org/d/pM54U-Kiz/neutron-failure-rate-previous-stable-release?orgId=1 15:20:10 <slaweq> Stein dashboard: http://grafana.openstack.org/d/dCFVU-Kik/neutron-failure-rate-older-stable-release?orgId=1 15:20:18 <bcafarel> Ussuri and Train now :) 15:20:31 <njohnston> nice! 15:20:55 <slaweq> bcafarel: are You sure? 15:21:03 <slaweq> I see in the description train and stein still 15:21:13 <slaweq> maybe You forgot change description? 15:21:19 <bcafarel> argh, checking 15:21:26 <bcafarel> the changeset merged for sure 15:22:51 <bcafarel> nah, names were updated too, I guess 14h ago is too recent? 15:23:05 <bcafarel> ( https://review.opendev.org/#/c/729291/ ) 15:23:44 <slaweq> ok, lets wait some more time 15:23:54 <slaweq> hopefully it will change 15:24:03 <slaweq> and thx bcafarel for taking care of it 15:24:13 <bcafarel> something to check on next meeting :) 15:24:32 <bcafarel> and apart from that from memory I don't think I have seen many stable failures 15:24:52 <bcafarel> (also new stable releases for trein/stain) 15:25:00 <bcafarel> *train/stein 15:25:02 <slaweq> yes, me too - most of my patches were merged pretty fast recently 15:26:25 <slaweq> if there is nothing else regarding stable branches, I think we can move on to the next topic 15:29:16 <slaweq> #topic Grafana 15:29:22 <slaweq> http://grafana.openstack.org/dashboard/db/neutron-failure-rate 15:32:36 <slaweq> I don't know but our gate queue looks very good since few days - no failures at all :) 15:32:43 <slaweq> or almost at all 15:32:59 <slaweq> but it's worst in the check queue 15:33:21 <slaweq> e.g. neutron-ovn-tempest-slow is failing 100% times 15:36:01 <maciejjozefczyk> hmm, I may know whats going there 15:36:15 <slaweq> maciejjozefczyk: I was hoping that You will know :P 15:36:30 <maciejjozefczyk> I can take a look, this failing test seems similar to me :P (test_port_security_macspoofing_port) 15:36:48 <bcafarel> hmm yeah the name does ring a bell 15:36:52 <slaweq> yes, that's the test which is failing every time since around last friday 15:36:55 <maciejjozefczyk> I think we blacklisted that test in other jobs 15:37:08 <maciejjozefczyk> there is a fix in core-ovn for this but I think its not yet in the OVN version we use in the gates 15:37:09 <slaweq> maciejjozefczyk: please check that if You will have some time 15:37:18 <maciejjozefczyk> sure, thats gonan be quick :) slaweq 15:37:30 <slaweq> #action maciejjozefczyk to check failing test_port_security_macspoofing_port test 15:37:40 <maciejjozefczyk> I left the vivaldi tab open for tomorrow morning to not forget :) 15:37:56 <slaweq> other than that things looks quite good IMO 15:38:10 <slaweq> anything else You want to discuss regarding grafana? 15:39:53 <slaweq> ok, so next topic 15:39:55 <slaweq> #topic fullstack/functional 15:40:08 <slaweq> I have only one thing regarding functional job 15:40:14 <slaweq> What to do with https://review.opendev.org/#/c/729588/ ? IMO it's good to go 15:40:35 <slaweq> this uwsgi job is even more stable recently than "normal" functional tests job 15:40:35 <ralonsoh> I think so 15:40:50 <njohnston> +1 15:40:53 <ralonsoh> +1 15:40:53 <lajoskatona> +1 15:41:26 <slaweq> ok, so please review and approve it :) 15:41:41 <slaweq> thx lajoskatona bcafarel and maciejjozefczyk for review of it already :) 15:42:04 <maciejjozefczyk> :) +1 15:42:13 <slaweq> fullstack tests seems to be better when I marked this one test as unstable 15:42:31 <lajoskatona> slaweq: no problem 15:42:54 <slaweq> any questions/comments or can we move on? 15:43:40 <njohnston> go ahead 15:43:56 <bcafarel> nothing from me 15:44:01 <slaweq> ok 15:44:04 <slaweq> #topic Tempest/Scenario 15:44:16 <slaweq> we already talked about failing ovn-slow job 15:44:44 <slaweq> the only other issue which I have today is yet another error with IpAddressAlreadyAllocated error 15:44:55 <slaweq> this time in grenade job: https://7ba1272f105c99db4826-d9c28f5658476db1e4ca5968d196888d.ssl.cf2.rackcdn.com/729591/1/check/neutron-grenade-dvr-multinode/54a824e/controller/logs/grenade.sh_log.txt 15:48:15 <slaweq> I'm not sure why it's like that but for me it smells like issue in the test 15:49:29 <slaweq> ok 15:49:31 <slaweq> I found 15:49:33 <slaweq> May 20 14:02:57.487624 ubuntu-bionic-rax-ord-0016690112 neutron-server[5514]: INFO neutron.wsgi [None req-e0e86a16-03d9-4d2c-86fd-db9c1c4133b9 tempest-RoutersTest-1419393395 tempest-RoutersTest-1419393395] 10.209.98.10 "PUT /v2.0/routers/68d7b71d-47e2-4b98-9f87-d972e2f3889d/add_router_interface HTTP/1.1" status: 409 len: 369 time: 38.4982579 15:50:08 <slaweq> sorry 15:50:12 <slaweq> first was: 15:50:14 <slaweq> May 20 14:03:00.501061 ubuntu-bionic-rax-ord-0016690112 neutron-server[5514]: INFO neutron.wsgi [None req-a76c0e0a-7bc9-4c57-94e9-fe92c402f768 tempest-RoutersTest-1419393395 tempest-RoutersTest-1419393395] 10.209.98.10 "PUT /v2.0/routers/68d7b71d-47e2-4b98-9f87-d972e2f3889d/add_router_interface HTTP/1.1" status: 200 len: 503 time: 102.8894622 15:50:31 <slaweq> this took long time, so client ended up with timeout and retried request 15:50:35 <ralonsoh> slaweq, I found the error for trhe IpAddressAlreadyAllocated 15:50:40 <ralonsoh> slaweq, https://bugs.launchpad.net/neutron/+bug/1880976 15:50:40 <openstack> Launchpad bug 1880976 in neutron "[tempest] Error in "test_reuse_ip_address_with_other_fip_on_other_router" with duplicated floating IP" [Undecided,New] 15:50:44 <ralonsoh> reported 5 mins ago 15:51:04 <slaweq> but in the meantime it was allocated already so second request was 409 15:51:15 <slaweq> so at least in this grenade job it's nothing really new 15:51:17 <ralonsoh> in a nutshell: between the FIP deletion and the creation again with the same IP address, another test requested an FIP 15:51:32 <ralonsoh> and the server gave the same IP 15:51:38 <ralonsoh> just a coincidence 15:51:42 <njohnston> ugh 15:51:52 <ralonsoh> I'll try to reduce the time between the deletion and the new creation 15:52:00 <ralonsoh> reusing the IP address 15:52:13 <slaweq> nice catch ralonsoh 15:52:22 <ralonsoh> good luck! 15:52:25 <njohnston> it's just unfortunate that with our ipam randomization work that it got the same IP 15:52:35 <ralonsoh> yeah... 15:52:44 <bcafarel> lower probability but still not 0% 15:52:53 <njohnston> bad roll of the dice 15:53:28 <slaweq> ralonsoh: so is it conflict of FIP or fixed IP? 15:53:33 <ralonsoh> FIP 15:53:34 <slaweq> how this test works? 15:53:39 <ralonsoh> simple 15:53:40 <slaweq> do we need to delete it? 15:53:45 <ralonsoh> 2 vms with FIP 15:53:50 <ralonsoh> 1 vm deleted and the FIP 15:54:09 <ralonsoh> then a VM is created and the FIP is created again, "reusing" the IP address 15:54:16 <slaweq> maybe we could just create FIP, attach to the first vm, detach from the vm, attach to the second vm? 15:54:33 <slaweq> that way FIP would be "reserved" in the tenant for whole test 15:54:37 <slaweq> am I right? 15:54:48 <ralonsoh> but we really need to delete it 15:54:55 <ralonsoh> one sec 15:55:00 <ralonsoh> it's in the test case description 15:55:24 <ralonsoh> https://github.com/openstack/neutron-tempest-plugin/blob/7b374486a54456d3c67fd2961c5894fb64ba48ab/neutron_tempest_plugin/scenario/test_floatingip.py#L518-L536 15:55:32 <ralonsoh> step 6 15:56:20 <ralonsoh> (btw, we SHOULD document always the test cases like this) 15:56:28 <njohnston> +1 for documenting 15:56:36 <slaweq> I agree 15:56:42 <ralonsoh> so clear, with those steps 15:57:12 <slaweq> and according to the test, I still don't see any reason why FIP has to be deleted? if we would detach it from VM it would be just DB record 15:57:33 <ralonsoh> you are right 15:57:35 <slaweq> so You don't need to "remember" its IP address but just reuse it later for VM3 15:57:42 <ralonsoh> we really don't need to delete the DB register 15:57:52 <slaweq> and this should be more stable IMHO 15:58:02 <ralonsoh> okidoki, I'll propose a patch 15:58:07 <slaweq> ralonsoh++ thx 15:58:15 <bcafarel> nice 15:58:25 <slaweq> and with that we are almost out of time 15:58:35 <slaweq> at the end one quick note about periodic jobs 15:58:42 <slaweq> all seems ok this week 15:58:45 <lajoskatona> just a question: isn't it possible to run serially these tests? 15:58:49 <lajoskatona> too much time? 15:58:53 <slaweq> our job with ovsdbapp from mast is fine now 15:59:12 <slaweq> lajoskatona: yes, it would take too much time 15:59:19 <lajoskatona> ok 15:59:29 <slaweq> in tempest-slow job tests are run in serial but there is only few of them 15:59:47 <slaweq> thx for attending the meeting, see You all tomorrow :) 15:59:51 <slaweq> o/ 15:59:54 <bcafarel> o/ 15:59:55 <slaweq> #endmeeting