#openstack-meeting-3 log

15:00:16 <slaweq> #startmeeting neutron_ci
15:00:17 <openstack> Meeting started Wed Apr  8 15:00:16 2020 UTC and is due to finish in 60 minutes.  The chair is slaweq. Information about MeetBot at http://wiki.debian.org/MeetBot.
15:00:18 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
15:00:20 <openstack> The meeting name has been set to 'neutron_ci'
15:00:26 <ralonsoh> hi
15:00:27 <slaweq> hi
15:00:28 <njohnston> o/
15:00:51 <maciejjozefczyk> \o
15:01:08 <slaweq> Grafana dashboard: http://grafana.openstack.org/dashboard/db/neutron-failure-rate
15:01:16 <slaweq> #topic Actions from previous meetings
15:01:23 <bcafarel> o/
15:01:24 <slaweq> first one:
15:01:26 <slaweq> slaweq to investigate fullstack SG test broken pipe failures
15:01:36 <slaweq> I did some investigation on that one
15:02:06 <slaweq> I have some theory why it could fail like that but I need to try to reproduce it locally to confirm that
15:02:15 <slaweq> so I will continue this work next week too
15:02:28 <slaweq> #action slaweq to continue investigation of fullstack SG test broken pipe failures
15:02:41 <slaweq> next one
15:02:43 <slaweq> maciejjozefczyk to take a look and report LP for failures in neutron.tests.functional.plugins.ml2.drivers.ovn.mech_driver.ovsdb.test_ovn_db_sync.TestOvnNbSyncOverTcp.test_ovn_nb_sync_log
15:03:02 <maciejjozefczyk> I *think* that I found whats the problem
15:03:06 <maciejjozefczyk> #link https://review.opendev.org/#/c/717704/
15:03:37 <maciejjozefczyk> I proposed to change the transaction timeout for updating OVN rows
15:03:55 <maciejjozefczyk> with bumping up the timeout I cannot see any failure of that test in the link above.
15:04:17 <maciejjozefczyk> for a while I changed the zuul config to run those jobs 20 times and in all cases the failure wasn't there.
15:04:40 <maciejjozefczyk> according to the logs in the LP report - I found the timeout errors there, so I think that should help
15:04:42 <slaweq> based on old comment there, it's not first time when we are changing this timeout for tests
15:05:08 <maciejjozefczyk> slaweq, yes, but the funny think is by *default* in the config we have a much bigger timeout there
15:05:21 <maciejjozefczyk> I don't know why it was set as 5 seconds for functionals before
15:05:51 <slaweq> ahh, ok
15:06:07 <maciejjozefczyk> https://docs.openstack.org/networking-ovn/latest/configuration/ml2_conf.html
15:06:14 <maciejjozefczyk> default is 180 seconds
15:06:43 <slaweq> ok, if that solves the problem, lets try that :)
15:06:45 <slaweq> thx maciejjozefczyk
15:06:56 <maciejjozefczyk> So actually it wasn't that bad that sometimes we got Timeout after 5 seconds :)
15:07:49 <slaweq> ok, lets move on
15:07:52 <slaweq> next one
15:07:54 <slaweq> ralonsoh to open LP about issue with neutron.tests.functional.agent.linux.test_keepalived.KeepalivedManagerTestCase
15:08:05 <ralonsoh> patch merged
15:08:11 <slaweq> ++
15:08:13 <slaweq> thx ralonsoh
15:08:36 <ralonsoh> #link  https://review.opendev.org/#/c/716944/
15:09:17 <slaweq> I hope that with this one and maciejjozefczyk's patch functional jobs will be in better shape finally :)
15:09:29 <ralonsoh> sure!
15:09:32 <maciejjozefczyk> ++
15:09:51 <slaweq> ok, next one
15:09:53 <slaweq> slaweq to check server termination on multicast test
15:09:59 <slaweq> I again didn't had time for this
15:10:22 <slaweq> and sadly it just hit as again in maciejjozefczyk's patch mentioned few minutes ago
15:10:24 <slaweq> :/
15:10:37 <slaweq> I will really dedicate some time this week to finally check that one
15:10:40 <slaweq> #action slaweq to check server termination on multicast test
15:10:54 <slaweq> next one
15:10:56 <slaweq> ralonsoh to check failure in neutron.tests.functional.agent.l3.test_keepalived_state_change.TestMonitorDaemon
15:11:27 <ralonsoh> which one?
15:11:41 <ralonsoh> ups, I wasn't aware of this one
15:11:44 <ralonsoh> sorry
15:12:06 <slaweq> no problem
15:12:12 <maciejjozefczyk> slaweq, perhaps my patch needs rebase
15:12:13 <slaweq> do You want to check it this week?
15:12:24 <slaweq> maciejjozefczyk: just recheck :)
15:12:25 <ralonsoh> slaweq, do you have the links?
15:12:28 <ralonsoh> of the logs
15:12:57 <slaweq> ralonsoh: https://80137ce53930819135d8-42d904af0faa486c8226703976d821a0.ssl.cf2.rackcdn.com/704833/23/check/neutron-functional/17568d5/testr_results.html
15:13:13 <ralonsoh> slaweq, thanks!
15:13:26 <slaweq> ralonsoh: is this related https://bugs.launchpad.net/neutron/+bug/1870313 ?
15:13:27 <openstack> Launchpad bug 1870313 in neutron ""send_ip_addr_adv_notif" can't use eventlet when called from "keepalived_state_change"" [High,Fix released] - Assigned to Rodolfo Alonso (rodolfo-alonso-hernandez)
15:13:34 <ralonsoh> ahh but is the same problem
15:13:39 <slaweq> You linked this bug last week in meeting's etherpad
15:13:43 <ralonsoh> yes, I was writing this
15:13:59 <ralonsoh> we can remove it from the TODO list
15:14:08 <slaweq> ok, so done :)
15:14:10 <slaweq> great, thx
15:14:36 <slaweq> ok, and the last one from previous week
15:14:38 <slaweq> ralonsoh to check issue with neutron.tests.fullstack.test_l3_agent.TestHAL3Agent.test_router_fip_qos_after_admin_state_down_up
15:14:46 <ralonsoh> yes, the namespace drama
15:15:17 <ralonsoh> #link https://review.opendev.org/#/c/717017/
15:15:44 <ralonsoh> it's merged, please review the commit message for more info
15:15:47 <slaweq> LOL - "namespace drama" sounds like a good name for topic
15:15:48 <ralonsoh> or ping me in IRC
15:15:53 <ralonsoh> hehehehe
15:15:57 <slaweq> ahh, it's that one
15:15:59 <slaweq> ok
15:16:03 <maciejjozefczyk> heh
15:16:24 <slaweq> ok, that's all actions from last week
15:16:30 <slaweq> lets move on
15:16:34 <slaweq> #topic Stadium projects
15:16:44 <slaweq> standardize on zuul v3
15:16:45 <slaweq> Etherpad: https://etherpad.openstack.org/p/neutron-train-zuulv3-py27drop
15:16:52 <slaweq> I just marked networking-bagpipe as done
15:17:01 <slaweq> thx lajoskatona for work on this
15:17:17 <slaweq> so we have left only networking-{midonet,odl} in the list
15:17:20 <slaweq> not bad :)
15:17:26 <bcafarel> getting there
15:17:40 <njohnston> \o/
15:17:54 <slaweq> IPv6-only CI
15:17:55 <slaweq> Etherpad https://etherpad.openstack.org/p/neutron-stadium-ipv6-testing
15:18:02 <slaweq> no progress on my side with that one still
15:19:07 <slaweq> and there is one more thing related to stadium projects for today
15:19:12 <slaweq> midonet UT failures: https://bugs.launchpad.net/networking-midonet/+bug/1871568
15:19:13 <openstack> Launchpad bug 1871568 in networking-midonet "python3 unit tests jobs are failing on master branch" [Undecided,New]
15:19:24 <slaweq> basically midonet gate is broken now
15:19:50 <slaweq> I just cloned it locally to check that one, but if someone wants to take it - feel free :)
15:20:23 <njohnston> Do we have any active midonet contributors left?
15:20:38 <ralonsoh> yamamoto?
15:20:39 <njohnston> I haven't seen yamamoto active recently
15:20:41 <slaweq> njohnston: yamamoto is the only one I'm aware of
15:20:50 <slaweq> but his activity is very limited :/
15:21:17 <slaweq> I will send him email about that one - maybe he will be able to help with this
15:22:11 <slaweq> #action slaweq to ping yamamoto about midonet gate problems
15:22:24 <slaweq> ok, anything else related to stadium for today?
15:22:28 <njohnston> I think it would be good from a stadium perspective to do that just to check the health of the midonet driver, I know they have struggled to keep up the last few cycles with Xenial->Bionic and py27
15:22:59 <slaweq> njohnston: yes, that's true
15:23:19 <slaweq> maybe we will need to discuss that (again) during the next PTG
15:23:23 <njohnston> and Stackalytics shows no activity for yamamoto since January
15:24:19 <slaweq> ok
15:24:29 <slaweq> I will email him and we will see how it will be
15:25:09 <njohnston> +1
15:25:21 <ralonsoh> +1
15:25:35 <slaweq> lets move on
15:25:36 <slaweq> #topic Stable branches
15:26:06 <slaweq> as recently we have less issues to discuss e.g. with scenario jobs, I though that it would be good to add topic about stable branches to this meeting
15:26:24 <slaweq> so we can all catch up on current ci state for stable brances
15:26:25 <bcafarel> +100
15:26:43 <ralonsoh> perfect
15:27:14 <njohnston> I have not been updating the stable grafana dashboards, has anyone else?
15:27:47 <slaweq> njohnston: nope
15:27:47 <njohnston> ah looks like they are good, excellent
15:27:55 <njohnston> http://grafana.openstack.org/d/pM54U-Kiz/neutron-failure-rate-previous-stable-release?orgId=1 <- train
15:28:14 <njohnston> http://grafana.openstack.org/d/dCFVU-Kik/neutron-failure-rate-older-stable-release?orgId=1 <- stein
15:28:48 <slaweq> I'm not sure if all jobs are up to date there
15:28:56 <bcafarel> yes at least the branch name changes are there, did not check if jobs are up to date there
15:30:04 <slaweq> bcafarel: do You want to check that this week? :)
15:30:09 <bcafarel> sure!
15:30:23 <slaweq> thx
15:30:37 <bcafarel> from my "what recheck keyword did you type?" memory, most are designate, rally and the occasional non-neutron test failing
15:30:39 <slaweq> #action bcafarel to check and update stable branches grafana dashboards
15:31:23 <slaweq> one serious issue from today is this one with rally https://bugs.launchpad.net/neutron/+bug/1871596
15:31:24 <openstack> Launchpad bug 1871596 in neutron "Rally job on stable branches is failing" [Critical,Confirmed] - Assigned to Slawek Kaplonski (slaweq)
15:31:37 <slaweq> but it seems it's already fixed in rally thanks to lucasgomes :)
15:31:51 <ralonsoh> +1
15:32:21 <bcafarel> yep
15:32:48 <njohnston> bcafarel: looks like haleyb updated the stable dashboards when train was released
15:32:49 <slaweq> great, so other than that it's fine for stable branches now, right?
15:33:09 <bcafarel> fix is merged and from andreykurilin's comments there are not many rally calls from playbook (the part currently running from master) so we should be good
15:33:40 <bcafarel> https://bugs.launchpad.net/bugs/1871327 also has fixes merged in all branches now
15:33:41 <openstack> Launchpad bug 1871327 in tempest "stable/stein tempest-full job fails with "tempest requires Python '>=3.6' but the running Python is 2.7.17"" [Undecided,New]
15:34:27 <slaweq> ok
15:34:37 <slaweq> so I think we can move on to the next topic
15:34:38 <slaweq> #topic Grafana
15:34:44 <slaweq> #link http://grafana.openstack.org/dashboard/db/neutron-failure-rate
15:34:49 <slaweq> Average number of rechecks in last weeks:
15:34:51 <slaweq> week 14 of 2020: 3.13
15:34:53 <slaweq> week 15 of 2020: 3.6
15:35:06 <slaweq> that's my metrics - not so good this week :/
15:35:19 <ralonsoh> IMO, related to mentioned problems
15:35:25 <slaweq> ralonsoh: yes
15:35:41 <slaweq> that's for sure - I didn't noticed any new problems last week
15:36:03 <slaweq> one thing which I want to mention is that Grenade jobs should be better soon as https://bugs.launchpad.net/bugs/1844929 has proposed patch https://review.opendev.org/717662
15:36:04 <openstack> Launchpad bug 1844929 in OpenStack Compute (nova) "grenade jobs failing due to "Timed out waiting for response from cell" in scheduler" [High,In progress] - Assigned to melanie witt (melwitt)
15:36:26 <slaweq> melwitt did great debugging on it and found finally what was the issue there
15:36:43 <bcafarel> ooooh nice! (I forgot grenade in the frequent recheck causes on stable))
15:38:00 <slaweq> I also proposed patch https://review.opendev.org/718392 with updates for grafana dashboard
15:39:30 <slaweq> and that's all about grafana from me
15:39:33 <slaweq> anything else to add?
15:39:34 <ralonsoh> (you need to rebase it)
15:40:19 <slaweq> ralonsoh: I will
15:40:51 <slaweq> ok, if not, let's move on
15:41:00 <slaweq> #topic Tempest/Scenario
15:41:12 <slaweq> I have one new issue only here
15:41:16 <slaweq> Tripleo based jobs are failing, like e.g.: https://933286ee423f4ed9028e-1eceb8a6fb7f917522f65bda64a8589f.ssl.cf5.rackcdn.com/717754/2/check/neutron-centos-8-tripleo-standalone/a5f2585/job-output.txt
15:41:38 <slaweq> do You maybe know why it can be like that? It seems that neutron rpm build fails in those jobs
15:42:08 <bcafarel> sounds like https://review.rdoproject.org/r/26305
15:42:15 <maciejjozefczyk> slaweq, after we merged the ovn migration tools
15:42:32 <maciejjozefczyk> yes bcafarel
15:42:34 <slaweq> I though that but I wanted to confirm that :)
15:42:41 <slaweq> thx
15:42:45 <slaweq> so should be good soon
15:43:28 <slaweq> ok, so that's all what I have for today
15:43:40 <slaweq> anything else You want to talk about regarding ci?
15:44:00 <njohnston> glad things are looking pretty stable going into the U release
15:45:11 <slaweq> njohnston: yes, IMO it is in better shape recently
15:45:17 <slaweq> we don't have many new issues
15:45:25 <njohnston> nothing else from me
15:45:29 <slaweq> and patches are generally merged (usually) pretty fast
15:46:12 <slaweq> ok, if there is nothing else, I will give You almost 15 minutes back :)
15:46:26 <slaweq> thx for attending the meeting and for taking care of our CI
15:46:29 <slaweq> o/
15:46:31 <slaweq> #endmeeting