#openstack-neutron log

15:00:12 <ykarel> #startmeeting neutron_ci
15:00:12 <opendevmeet> Meeting started Tue Oct  3 15:00:12 2023 UTC and is due to finish in 60 minutes.  The chair is ykarel. Information about MeetBot at http://wiki.debian.org/MeetBot.
15:00:12 <opendevmeet> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
15:00:12 <opendevmeet> The meeting name has been set to 'neutron_ci'
15:00:17 <ralonsoh> hello
15:00:18 <lajoskatona> o/
15:00:22 <ykarel> ping bcafarel, lajoskatona, mlavalle, mtomaska, ralonsoh, ykarel, jlibosva, elvira
15:00:31 <ykarel> Grafana dashboard: https://grafana.opendev.org/d/f913631585/neutron-failure-rate?orgId=1
15:00:32 <ykarel> Please open now :)
15:00:39 <haleyb> o/
15:00:39 <slaweq> o/
15:01:01 <bcafarel> o/
15:01:30 <ykarel> Ok let's start with the topics
15:01:33 <ykarel> #topic Actions from previous meetings
15:01:42 <ykarel> ralonsoh to check failure with ha functional test
15:02:04 <ralonsoh> I opened a bug but I would need to find it
15:02:12 <ralonsoh> but I didn't push any patch yet
15:02:36 <ykarel> Thanks ralonsoh for checking, yes can share the bug later
15:02:48 <ykarel> lajoskatona to check consistency with bgpvpn related to https://zuul.openstack.org/build/8a624b4d29ea44589c9c83b0ec1da446
15:03:17 <lajoskatona> yes I pushed a dnm patch an it passed, and by the logs it looks like the ssh timeout issue we have in other jobs
15:03:58 <ykarel> yes looks a random one https://zuul.openstack.org/builds?job_name=neutron-tempest-plugin-bgpvpn-bagpipe&result=TIMED_OUT&skip=0
15:04:19 <ykarel> and it happend again in weekly run
15:04:21 <lajoskatona> yes, this week it was bgpvpn, and as the 2 run together with tempest, I suppose the same pattern
15:05:28 <ykarel> overall jobs looks healthy https://zuul.openstack.org/builds?job_name=neutron-tempest-plugin-bgpvpn-bagpipe&skip=0
15:06:01 <lajoskatona> yes, it is not frequent
15:06:18 <ykarel> Ok thanks lajoskatona, so we can keep monitoring it
15:06:50 <ykarel> ykarel to check https://bugs.launchpad.net/neutron/+bug/2036603
15:07:05 <ykarel> This was on me, but i couldn't check so will check it this week
15:07:19 <ykarel> #action ykarel to check https://bugs.launchpad.net/neutron/+bug/2036603
15:07:33 <ykarel> #topic Stable branches
15:07:48 <ykarel> bcafarel anything to share for stable branches?
15:08:29 <ykarel> atleast stable/2023.2 is impacted with l3 router issue
15:09:05 <bcafarel> indeed
15:10:04 <ykarel> rest all branches looks good
15:10:28 <ykarel> #topic Stadium projects
15:10:32 <bcafarel> I am a bit behind on the rest but older branches look good as far I could check
15:10:43 <ykarel> thx bcafarel
15:10:57 <ykarel> it was again bgpvpn job timeout https://zuul.openstack.org/buildset/5f604ee2caaf4d5883496d63087aa0dc
15:11:02 <ykarel> which we already discussed
15:11:03 <lajoskatona> except the issue in bagpipe/bgpvpn tempest job nothing else
15:11:19 <ykarel> thx
15:11:21 <lajoskatona> I pusehd a few patches to updte the weekly jobs to run with py311
15:11:23 <ykarel> #topic Grafana
15:11:36 <lajoskatona> and update small things in zull.yaml
15:11:46 <ykarel> #undo
15:11:46 <opendevmeet> Removing item from minutes: #topic Grafana
15:12:06 <lajoskatona> this is the topic for them: https://review.opendev.org/q/topic:py311_neutron
15:12:12 <ykarel> #info lajoskatona pushed patches to stadium to include py311 jobs in weekly pipeline
15:12:26 <ykarel> #link https://review.opendev.org/q/topic:py311_neutron
15:12:42 <lajoskatona> as I see I have to go back to the fwaas one (the last week I had no time to care them)
15:12:51 <lajoskatona> thats it from me
15:12:55 <ykarel> thx lajoskatona
15:13:01 <ykarel> #topic Grafana
15:13:10 <ykarel> https://grafana.opendev.org/d/f913631585/neutron-failure-rate?orgId=1
15:14:27 <ykarel> So we can see some significant failures for that l3 agent tempest failures in scenario jobs
15:14:33 <slaweq> IMHO grafana looks good, except those neutron-tempest-plugin jobs which are failing A LOT
15:15:19 <ykarel> yeap right
15:15:36 <ykarel> ok moving to next topic
15:15:39 <ykarel> #topic Rechecks
15:16:01 <ykarel> stat looks good, but there was not much activity this week
15:16:13 <ykarel> and just 1 bare recheck, so that's good too
15:16:38 <slaweq> yeah, I guess it's also because of those broken neutrn-tempest-plugin jobs as not many patches are really merged recently
15:16:51 <slaweq> and my script is checking only patches already merged
15:17:05 <ykarel> yes right
15:17:33 <ykarel> #topic fullstack/functional
15:17:44 <ykarel> AssertionError: Text not found in file /tmp/tmp_kg_b71e/tmpgbxs4_ju/log_file: "Initial status of router".
15:17:51 <ykarel> https://storage.bhs.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_fe2/periodic/opendev.org/openstack/neutron/master/neutron-functional-with-sqlalchemy-master/fe21035/testr_results.html
15:17:57 <ykarel> https://81a66b9ba1b7a73d7079-cf2e6d1d128a778849f5c48f17b8a34a.ssl.cf1.rackcdn.com/896926/2/check/neutron-functional-with-uwsgi/ea8b000/testr_results.html
15:18:24 <ykarel> we discussed this few weeks back and ralonsoh said it's something know issue and happens rarely
15:18:37 <ralonsoh> yes and this could be also the case of the tempest problems
15:18:38 <ykarel> but it's seems to be happening quite often now
15:18:49 <ykarel> seen 6 failures in a week
15:18:51 <slaweq> maybe it's somehow related to the issue with scenario jobs?
15:19:02 <ralonsoh> yes
15:19:06 <ralonsoh> it could be
15:19:06 <ykarel> slaweq, ralonsoh yes symptoms looks quite similar
15:19:31 <ralonsoh> I don't know if to refactor the IP monitor
15:19:42 <ralonsoh> or to continue the implementation of the bash script
15:19:46 <ralonsoh> one sec
15:19:55 <ralonsoh> --> https://review.opendev.org/c/openstack/neutron/+/836140?usp=dashboard
15:20:19 <ralonsoh> but somehow the IP monitor is not working fine now
15:21:07 <slaweq> maybe we can use something like https://raymii.org/s/tutorials/Keepalived_notify_script_execute_action_on_failover.html instead ?
15:21:19 <slaweq> I didn't read it fully, just google about somethig like that now
15:21:36 <slaweq> AFAIR ip_monitor is there only to monitor IP address to see if keepalived did failover
15:21:44 <ralonsoh> yes, exactly
15:21:48 <slaweq> maybe instead keepalived can notify neutron-l3-agent by itself
15:22:25 <slaweq> I would like to help with it but I don't know if I will have time in next few weeks
15:22:39 <slaweq> so I will not volunteer for it, at least not for now
15:23:39 <ykarel> ralonsoh, and what's left in move to bash script?
15:24:09 <ykarel> just if we can see if with it these current issue reproduces
15:24:11 <ralonsoh> this change needs to handle the migration to this new script, in order to stop and replace the running keepalived-state-change scripts
15:24:22 <ralonsoh> what slaweq commented in the review
15:26:05 <ykarel> ohkk and without those migration changes in place, it can be validated with current issues, right?
15:26:25 <ralonsoh> yes, the current script should work fine now
15:26:34 <ralonsoh> I've pushed a new PS 30 mins ago
15:27:00 <ralonsoh> I'll change the zuul definitions to check multiple times the ovs jobs
15:27:17 <ykarel> okk thanks, just noticed the update
15:27:56 <ykarel> Also similar failures seen in other tests
15:27:57 <ykarel> test_dvr_lifecycle_ha_with_snat_with_fips_with_cent_fips_no_gw and test_dvr_ha_router_interface_mtu_update
15:28:04 <ykarel> https://53d9a8858ad69ec7c4a3-c555fae2d8c498523cc4b2c363541725.ssl.cf5.rackcdn.com/periodic/opendev.org/openstack/neutron/master/neutron-functional/6b7fe58/testr_results.html
15:28:16 <ykarel> https://storage.gra.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_1ad/periodic/opendev.org/openstack/neutron/master/neutron-functional/1ad8b06/testr_results.html
15:28:31 <opendevreview> Anton Vazhnetsov proposed openstack/ovsdbapp master: nb: add 'nexthop' argument to 'lr_route_del'  https://review.opendev.org/c/openstack/ovsdbapp/+/896645
15:28:39 <ykarel> #topic Tempest/Scenario
15:28:46 <ykarel> This we already discussed
15:28:54 <ykarel> master/stable2023.2 Tests in linuxbridge/openvswitch scenario jobs fails randomly bug https://bugs.launchpad.net/neutron/+bug/2037239 since 22nd Septmeber with
15:29:08 <ykarel> fails like Details: Router 411b39c1-b9fd-4fa1-a28b-d7976858a4d4 is not active on any of the L3 agents
15:29:26 <ykarel> I tried to reproduce with linuxbridge setup locally, reproduces rarely and saw below observations with adding some debug statements, in CI reproducibility is quite high
15:29:34 <ykarel> Fails differently at different threads within keepalived-state-change process
15:29:34 <ykarel> handle_initial_state stuck, nothing written to state file, ha_state is set to "unknown"
15:29:34 <ykarel> Timeout reading the initial status of router, write state backup in state file, and state remains in backup
15:29:34 <ykarel> ip_monitor stuck, leading to not start of read_ip_updates stuck
15:30:00 <ykarel> but i couldn't find any recent change that could change the behavior in master/stable2023.2
15:31:11 <lajoskatona> even dependency thins seem to be the same, like no keepalived change or such
15:31:53 <ralonsoh> could be a change in the eventlet library and now something is being blocked in the ip_monitor
15:32:03 <ralonsoh> to be honest (and I implemented it) I don't like how it works
15:32:11 <ykarel> yes right, so something should be with openstack, as external packages should be same in some stable branches as running jammy
15:32:29 <ykarel> ralonsoh, i check eventlet was not bumped from quite few months
15:32:59 <ykarel> update 6+ months back
15:35:01 <ykarel> haleyb btw i had pushed https://review.opendev.org/c/openstack/neutron-tempest-plugin/+/897233 with le_ha=False on top of your 2023.2 jobs patch
15:35:12 <lajoskatona> +1
15:35:18 <lajoskatona> for short term
15:35:43 <haleyb> ykarel: ack, will review and see if it helps
15:35:54 <ykarel> i just pushed test patch, but we could change and take it for short term if it works fine
15:36:05 <ykarel> thx haleyb
15:36:21 <ykarel> #topic Periodic
15:36:49 <ykarel> periodic also had same tempest failures in linuxbridge job, and test failures in functional jobs
15:36:58 <ykarel> which we already discussed
15:37:05 <ykarel> #topic On Demand
15:37:30 <ykarel> anything else to raise here?
15:37:33 <ralonsoh> no
15:38:00 <lajoskatona> nothing from me
15:38:01 <slaweq> nope
15:38:37 <ykarel> thx everyone
15:38:52 <ykarel> #endmeeting