15:00:12 #startmeeting neutron_ci 15:00:12 Meeting started Tue Oct 3 15:00:12 2023 UTC and is due to finish in 60 minutes. The chair is ykarel. Information about MeetBot at http://wiki.debian.org/MeetBot. 15:00:12 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 15:00:12 The meeting name has been set to 'neutron_ci' 15:00:17 hello 15:00:18 o/ 15:00:22 ping bcafarel, lajoskatona, mlavalle, mtomaska, ralonsoh, ykarel, jlibosva, elvira 15:00:31 Grafana dashboard: https://grafana.opendev.org/d/f913631585/neutron-failure-rate?orgId=1 15:00:32 Please open now :) 15:00:39 o/ 15:00:39 o/ 15:01:01 o/ 15:01:30 Ok let's start with the topics 15:01:33 #topic Actions from previous meetings 15:01:42 ralonsoh to check failure with ha functional test 15:02:04 I opened a bug but I would need to find it 15:02:12 but I didn't push any patch yet 15:02:36 Thanks ralonsoh for checking, yes can share the bug later 15:02:48 lajoskatona to check consistency with bgpvpn related to https://zuul.openstack.org/build/8a624b4d29ea44589c9c83b0ec1da446 15:03:17 yes I pushed a dnm patch an it passed, and by the logs it looks like the ssh timeout issue we have in other jobs 15:03:58 yes looks a random one https://zuul.openstack.org/builds?job_name=neutron-tempest-plugin-bgpvpn-bagpipe&result=TIMED_OUT&skip=0 15:04:19 and it happend again in weekly run 15:04:21 yes, this week it was bgpvpn, and as the 2 run together with tempest, I suppose the same pattern 15:05:28 overall jobs looks healthy https://zuul.openstack.org/builds?job_name=neutron-tempest-plugin-bgpvpn-bagpipe&skip=0 15:06:01 yes, it is not frequent 15:06:18 Ok thanks lajoskatona, so we can keep monitoring it 15:06:50 ykarel to check https://bugs.launchpad.net/neutron/+bug/2036603 15:07:05 This was on me, but i couldn't check so will check it this week 15:07:19 #action ykarel to check https://bugs.launchpad.net/neutron/+bug/2036603 15:07:33 #topic Stable branches 15:07:48 bcafarel anything to share for stable branches? 15:08:29 atleast stable/2023.2 is impacted with l3 router issue 15:09:05 indeed 15:10:04 rest all branches looks good 15:10:28 #topic Stadium projects 15:10:32 I am a bit behind on the rest but older branches look good as far I could check 15:10:43 thx bcafarel 15:10:57 it was again bgpvpn job timeout https://zuul.openstack.org/buildset/5f604ee2caaf4d5883496d63087aa0dc 15:11:02 which we already discussed 15:11:03 except the issue in bagpipe/bgpvpn tempest job nothing else 15:11:19 thx 15:11:21 I pusehd a few patches to updte the weekly jobs to run with py311 15:11:23 #topic Grafana 15:11:36 and update small things in zull.yaml 15:11:46 #undo 15:11:46 Removing item from minutes: #topic Grafana 15:12:06 this is the topic for them: https://review.opendev.org/q/topic:py311_neutron 15:12:12 #info lajoskatona pushed patches to stadium to include py311 jobs in weekly pipeline 15:12:26 #link https://review.opendev.org/q/topic:py311_neutron 15:12:42 as I see I have to go back to the fwaas one (the last week I had no time to care them) 15:12:51 thats it from me 15:12:55 thx lajoskatona 15:13:01 #topic Grafana 15:13:10 https://grafana.opendev.org/d/f913631585/neutron-failure-rate?orgId=1 15:14:27 So we can see some significant failures for that l3 agent tempest failures in scenario jobs 15:14:33 IMHO grafana looks good, except those neutron-tempest-plugin jobs which are failing A LOT 15:15:19 yeap right 15:15:36 ok moving to next topic 15:15:39 #topic Rechecks 15:16:01 stat looks good, but there was not much activity this week 15:16:13 and just 1 bare recheck, so that's good too 15:16:38 yeah, I guess it's also because of those broken neutrn-tempest-plugin jobs as not many patches are really merged recently 15:16:51 and my script is checking only patches already merged 15:17:05 yes right 15:17:33 #topic fullstack/functional 15:17:44 AssertionError: Text not found in file /tmp/tmp_kg_b71e/tmpgbxs4_ju/log_file: "Initial status of router". 15:17:51 https://storage.bhs.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_fe2/periodic/opendev.org/openstack/neutron/master/neutron-functional-with-sqlalchemy-master/fe21035/testr_results.html 15:17:57 https://81a66b9ba1b7a73d7079-cf2e6d1d128a778849f5c48f17b8a34a.ssl.cf1.rackcdn.com/896926/2/check/neutron-functional-with-uwsgi/ea8b000/testr_results.html 15:18:24 we discussed this few weeks back and ralonsoh said it's something know issue and happens rarely 15:18:37 yes and this could be also the case of the tempest problems 15:18:38 but it's seems to be happening quite often now 15:18:49 seen 6 failures in a week 15:18:51 maybe it's somehow related to the issue with scenario jobs? 15:19:02 yes 15:19:06 it could be 15:19:06 slaweq, ralonsoh yes symptoms looks quite similar 15:19:31 I don't know if to refactor the IP monitor 15:19:42 or to continue the implementation of the bash script 15:19:46 one sec 15:19:55 --> https://review.opendev.org/c/openstack/neutron/+/836140?usp=dashboard 15:20:19 but somehow the IP monitor is not working fine now 15:21:07 maybe we can use something like https://raymii.org/s/tutorials/Keepalived_notify_script_execute_action_on_failover.html instead ? 15:21:19 I didn't read it fully, just google about somethig like that now 15:21:36 AFAIR ip_monitor is there only to monitor IP address to see if keepalived did failover 15:21:44 yes, exactly 15:21:48 maybe instead keepalived can notify neutron-l3-agent by itself 15:22:25 I would like to help with it but I don't know if I will have time in next few weeks 15:22:39 so I will not volunteer for it, at least not for now 15:23:39 ralonsoh, and what's left in move to bash script? 15:24:09 just if we can see if with it these current issue reproduces 15:24:11 this change needs to handle the migration to this new script, in order to stop and replace the running keepalived-state-change scripts 15:24:22 what slaweq commented in the review 15:26:05 ohkk and without those migration changes in place, it can be validated with current issues, right? 15:26:25 yes, the current script should work fine now 15:26:34 I've pushed a new PS 30 mins ago 15:27:00 I'll change the zuul definitions to check multiple times the ovs jobs 15:27:17 okk thanks, just noticed the update 15:27:56 Also similar failures seen in other tests 15:27:57 test_dvr_lifecycle_ha_with_snat_with_fips_with_cent_fips_no_gw and test_dvr_ha_router_interface_mtu_update 15:28:04 https://53d9a8858ad69ec7c4a3-c555fae2d8c498523cc4b2c363541725.ssl.cf5.rackcdn.com/periodic/opendev.org/openstack/neutron/master/neutron-functional/6b7fe58/testr_results.html 15:28:16 https://storage.gra.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_1ad/periodic/opendev.org/openstack/neutron/master/neutron-functional/1ad8b06/testr_results.html 15:28:31 Anton Vazhnetsov proposed openstack/ovsdbapp master: nb: add 'nexthop' argument to 'lr_route_del' https://review.opendev.org/c/openstack/ovsdbapp/+/896645 15:28:39 #topic Tempest/Scenario 15:28:46 This we already discussed 15:28:54 master/stable2023.2 Tests in linuxbridge/openvswitch scenario jobs fails randomly bug https://bugs.launchpad.net/neutron/+bug/2037239 since 22nd Septmeber with 15:29:08 fails like Details: Router 411b39c1-b9fd-4fa1-a28b-d7976858a4d4 is not active on any of the L3 agents 15:29:26 I tried to reproduce with linuxbridge setup locally, reproduces rarely and saw below observations with adding some debug statements, in CI reproducibility is quite high 15:29:34 Fails differently at different threads within keepalived-state-change process 15:29:34 handle_initial_state stuck, nothing written to state file, ha_state is set to "unknown" 15:29:34 Timeout reading the initial status of router, write state backup in state file, and state remains in backup 15:29:34 ip_monitor stuck, leading to not start of read_ip_updates stuck 15:30:00 but i couldn't find any recent change that could change the behavior in master/stable2023.2 15:31:11 even dependency thins seem to be the same, like no keepalived change or such 15:31:53 could be a change in the eventlet library and now something is being blocked in the ip_monitor 15:32:03 to be honest (and I implemented it) I don't like how it works 15:32:11 yes right, so something should be with openstack, as external packages should be same in some stable branches as running jammy 15:32:29 ralonsoh, i check eventlet was not bumped from quite few months 15:32:59 update 6+ months back 15:35:01 haleyb btw i had pushed https://review.opendev.org/c/openstack/neutron-tempest-plugin/+/897233 with le_ha=False on top of your 2023.2 jobs patch 15:35:12 +1 15:35:18 for short term 15:35:43 ykarel: ack, will review and see if it helps 15:35:54 i just pushed test patch, but we could change and take it for short term if it works fine 15:36:05 thx haleyb 15:36:21 #topic Periodic 15:36:49 periodic also had same tempest failures in linuxbridge job, and test failures in functional jobs 15:36:58 which we already discussed 15:37:05 #topic On Demand 15:37:30 anything else to raise here? 15:37:33 no 15:38:00 nothing from me 15:38:01 nope 15:38:37 thx everyone 15:38:52 #endmeeting