15:00:12 <ykarel> #startmeeting neutron_ci 15:00:12 <opendevmeet> Meeting started Tue Oct 3 15:00:12 2023 UTC and is due to finish in 60 minutes. The chair is ykarel. Information about MeetBot at http://wiki.debian.org/MeetBot. 15:00:12 <opendevmeet> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 15:00:12 <opendevmeet> The meeting name has been set to 'neutron_ci' 15:00:17 <ralonsoh> hello 15:00:18 <lajoskatona> o/ 15:00:22 <ykarel> ping bcafarel, lajoskatona, mlavalle, mtomaska, ralonsoh, ykarel, jlibosva, elvira 15:00:31 <ykarel> Grafana dashboard: https://grafana.opendev.org/d/f913631585/neutron-failure-rate?orgId=1 15:00:32 <ykarel> Please open now :) 15:00:39 <haleyb> o/ 15:00:39 <slaweq> o/ 15:01:01 <bcafarel> o/ 15:01:30 <ykarel> Ok let's start with the topics 15:01:33 <ykarel> #topic Actions from previous meetings 15:01:42 <ykarel> ralonsoh to check failure with ha functional test 15:02:04 <ralonsoh> I opened a bug but I would need to find it 15:02:12 <ralonsoh> but I didn't push any patch yet 15:02:36 <ykarel> Thanks ralonsoh for checking, yes can share the bug later 15:02:48 <ykarel> lajoskatona to check consistency with bgpvpn related to https://zuul.openstack.org/build/8a624b4d29ea44589c9c83b0ec1da446 15:03:17 <lajoskatona> yes I pushed a dnm patch an it passed, and by the logs it looks like the ssh timeout issue we have in other jobs 15:03:58 <ykarel> yes looks a random one https://zuul.openstack.org/builds?job_name=neutron-tempest-plugin-bgpvpn-bagpipe&result=TIMED_OUT&skip=0 15:04:19 <ykarel> and it happend again in weekly run 15:04:21 <lajoskatona> yes, this week it was bgpvpn, and as the 2 run together with tempest, I suppose the same pattern 15:05:28 <ykarel> overall jobs looks healthy https://zuul.openstack.org/builds?job_name=neutron-tempest-plugin-bgpvpn-bagpipe&skip=0 15:06:01 <lajoskatona> yes, it is not frequent 15:06:18 <ykarel> Ok thanks lajoskatona, so we can keep monitoring it 15:06:50 <ykarel> ykarel to check https://bugs.launchpad.net/neutron/+bug/2036603 15:07:05 <ykarel> This was on me, but i couldn't check so will check it this week 15:07:19 <ykarel> #action ykarel to check https://bugs.launchpad.net/neutron/+bug/2036603 15:07:33 <ykarel> #topic Stable branches 15:07:48 <ykarel> bcafarel anything to share for stable branches? 15:08:29 <ykarel> atleast stable/2023.2 is impacted with l3 router issue 15:09:05 <bcafarel> indeed 15:10:04 <ykarel> rest all branches looks good 15:10:28 <ykarel> #topic Stadium projects 15:10:32 <bcafarel> I am a bit behind on the rest but older branches look good as far I could check 15:10:43 <ykarel> thx bcafarel 15:10:57 <ykarel> it was again bgpvpn job timeout https://zuul.openstack.org/buildset/5f604ee2caaf4d5883496d63087aa0dc 15:11:02 <ykarel> which we already discussed 15:11:03 <lajoskatona> except the issue in bagpipe/bgpvpn tempest job nothing else 15:11:19 <ykarel> thx 15:11:21 <lajoskatona> I pusehd a few patches to updte the weekly jobs to run with py311 15:11:23 <ykarel> #topic Grafana 15:11:36 <lajoskatona> and update small things in zull.yaml 15:11:46 <ykarel> #undo 15:11:46 <opendevmeet> Removing item from minutes: #topic Grafana 15:12:06 <lajoskatona> this is the topic for them: https://review.opendev.org/q/topic:py311_neutron 15:12:12 <ykarel> #info lajoskatona pushed patches to stadium to include py311 jobs in weekly pipeline 15:12:26 <ykarel> #link https://review.opendev.org/q/topic:py311_neutron 15:12:42 <lajoskatona> as I see I have to go back to the fwaas one (the last week I had no time to care them) 15:12:51 <lajoskatona> thats it from me 15:12:55 <ykarel> thx lajoskatona 15:13:01 <ykarel> #topic Grafana 15:13:10 <ykarel> https://grafana.opendev.org/d/f913631585/neutron-failure-rate?orgId=1 15:14:27 <ykarel> So we can see some significant failures for that l3 agent tempest failures in scenario jobs 15:14:33 <slaweq> IMHO grafana looks good, except those neutron-tempest-plugin jobs which are failing A LOT 15:15:19 <ykarel> yeap right 15:15:36 <ykarel> ok moving to next topic 15:15:39 <ykarel> #topic Rechecks 15:16:01 <ykarel> stat looks good, but there was not much activity this week 15:16:13 <ykarel> and just 1 bare recheck, so that's good too 15:16:38 <slaweq> yeah, I guess it's also because of those broken neutrn-tempest-plugin jobs as not many patches are really merged recently 15:16:51 <slaweq> and my script is checking only patches already merged 15:17:05 <ykarel> yes right 15:17:33 <ykarel> #topic fullstack/functional 15:17:44 <ykarel> AssertionError: Text not found in file /tmp/tmp_kg_b71e/tmpgbxs4_ju/log_file: "Initial status of router". 15:17:51 <ykarel> https://storage.bhs.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_fe2/periodic/opendev.org/openstack/neutron/master/neutron-functional-with-sqlalchemy-master/fe21035/testr_results.html 15:17:57 <ykarel> https://81a66b9ba1b7a73d7079-cf2e6d1d128a778849f5c48f17b8a34a.ssl.cf1.rackcdn.com/896926/2/check/neutron-functional-with-uwsgi/ea8b000/testr_results.html 15:18:24 <ykarel> we discussed this few weeks back and ralonsoh said it's something know issue and happens rarely 15:18:37 <ralonsoh> yes and this could be also the case of the tempest problems 15:18:38 <ykarel> but it's seems to be happening quite often now 15:18:49 <ykarel> seen 6 failures in a week 15:18:51 <slaweq> maybe it's somehow related to the issue with scenario jobs? 15:19:02 <ralonsoh> yes 15:19:06 <ralonsoh> it could be 15:19:06 <ykarel> slaweq, ralonsoh yes symptoms looks quite similar 15:19:31 <ralonsoh> I don't know if to refactor the IP monitor 15:19:42 <ralonsoh> or to continue the implementation of the bash script 15:19:46 <ralonsoh> one sec 15:19:55 <ralonsoh> --> https://review.opendev.org/c/openstack/neutron/+/836140?usp=dashboard 15:20:19 <ralonsoh> but somehow the IP monitor is not working fine now 15:21:07 <slaweq> maybe we can use something like https://raymii.org/s/tutorials/Keepalived_notify_script_execute_action_on_failover.html instead ? 15:21:19 <slaweq> I didn't read it fully, just google about somethig like that now 15:21:36 <slaweq> AFAIR ip_monitor is there only to monitor IP address to see if keepalived did failover 15:21:44 <ralonsoh> yes, exactly 15:21:48 <slaweq> maybe instead keepalived can notify neutron-l3-agent by itself 15:22:25 <slaweq> I would like to help with it but I don't know if I will have time in next few weeks 15:22:39 <slaweq> so I will not volunteer for it, at least not for now 15:23:39 <ykarel> ralonsoh, and what's left in move to bash script? 15:24:09 <ykarel> just if we can see if with it these current issue reproduces 15:24:11 <ralonsoh> this change needs to handle the migration to this new script, in order to stop and replace the running keepalived-state-change scripts 15:24:22 <ralonsoh> what slaweq commented in the review 15:26:05 <ykarel> ohkk and without those migration changes in place, it can be validated with current issues, right? 15:26:25 <ralonsoh> yes, the current script should work fine now 15:26:34 <ralonsoh> I've pushed a new PS 30 mins ago 15:27:00 <ralonsoh> I'll change the zuul definitions to check multiple times the ovs jobs 15:27:17 <ykarel> okk thanks, just noticed the update 15:27:56 <ykarel> Also similar failures seen in other tests 15:27:57 <ykarel> test_dvr_lifecycle_ha_with_snat_with_fips_with_cent_fips_no_gw and test_dvr_ha_router_interface_mtu_update 15:28:04 <ykarel> https://53d9a8858ad69ec7c4a3-c555fae2d8c498523cc4b2c363541725.ssl.cf5.rackcdn.com/periodic/opendev.org/openstack/neutron/master/neutron-functional/6b7fe58/testr_results.html 15:28:16 <ykarel> https://storage.gra.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_1ad/periodic/opendev.org/openstack/neutron/master/neutron-functional/1ad8b06/testr_results.html 15:28:31 <opendevreview> Anton Vazhnetsov proposed openstack/ovsdbapp master: nb: add 'nexthop' argument to 'lr_route_del' https://review.opendev.org/c/openstack/ovsdbapp/+/896645 15:28:39 <ykarel> #topic Tempest/Scenario 15:28:46 <ykarel> This we already discussed 15:28:54 <ykarel> master/stable2023.2 Tests in linuxbridge/openvswitch scenario jobs fails randomly bug https://bugs.launchpad.net/neutron/+bug/2037239 since 22nd Septmeber with 15:29:08 <ykarel> fails like Details: Router 411b39c1-b9fd-4fa1-a28b-d7976858a4d4 is not active on any of the L3 agents 15:29:26 <ykarel> I tried to reproduce with linuxbridge setup locally, reproduces rarely and saw below observations with adding some debug statements, in CI reproducibility is quite high 15:29:34 <ykarel> Fails differently at different threads within keepalived-state-change process 15:29:34 <ykarel> handle_initial_state stuck, nothing written to state file, ha_state is set to "unknown" 15:29:34 <ykarel> Timeout reading the initial status of router, write state backup in state file, and state remains in backup 15:29:34 <ykarel> ip_monitor stuck, leading to not start of read_ip_updates stuck 15:30:00 <ykarel> but i couldn't find any recent change that could change the behavior in master/stable2023.2 15:31:11 <lajoskatona> even dependency thins seem to be the same, like no keepalived change or such 15:31:53 <ralonsoh> could be a change in the eventlet library and now something is being blocked in the ip_monitor 15:32:03 <ralonsoh> to be honest (and I implemented it) I don't like how it works 15:32:11 <ykarel> yes right, so something should be with openstack, as external packages should be same in some stable branches as running jammy 15:32:29 <ykarel> ralonsoh, i check eventlet was not bumped from quite few months 15:32:59 <ykarel> update 6+ months back 15:35:01 <ykarel> haleyb btw i had pushed https://review.opendev.org/c/openstack/neutron-tempest-plugin/+/897233 with le_ha=False on top of your 2023.2 jobs patch 15:35:12 <lajoskatona> +1 15:35:18 <lajoskatona> for short term 15:35:43 <haleyb> ykarel: ack, will review and see if it helps 15:35:54 <ykarel> i just pushed test patch, but we could change and take it for short term if it works fine 15:36:05 <ykarel> thx haleyb 15:36:21 <ykarel> #topic Periodic 15:36:49 <ykarel> periodic also had same tempest failures in linuxbridge job, and test failures in functional jobs 15:36:58 <ykarel> which we already discussed 15:37:05 <ykarel> #topic On Demand 15:37:30 <ykarel> anything else to raise here? 15:37:33 <ralonsoh> no 15:38:00 <lajoskatona> nothing from me 15:38:01 <slaweq> nope 15:38:37 <ykarel> thx everyone 15:38:52 <ykarel> #endmeeting