#openstack-neutron log

15:00:28 <slaweq> #startmeeting neutron_ci
15:00:28 <opendevmeet> Meeting started Tue Jul 18 15:00:28 2023 UTC and is due to finish in 60 minutes.  The chair is slaweq. Information about MeetBot at http://wiki.debian.org/MeetBot.
15:00:28 <opendevmeet> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
15:00:28 <opendevmeet> The meeting name has been set to 'neutron_ci'
15:00:31 <mlavalle> o/
15:00:37 <slaweq> ping bcafarel, lajoskatona, mlavalle, mtomaska, ralonsoh, ykarel, jlibosva, elvira
15:00:41 <ralonsoh> hi
15:00:42 <bcafarel> o/
15:00:46 <slaweq> Grafana dashboard: https://grafana.opendev.org/d/f913631585/neutron-failure-rate?orgId=1
15:00:54 <mtomaska> o/
15:01:06 <ykarel> o/
15:01:48 <slaweq> ok, lets start
15:01:54 <slaweq> #topic Actions from previous meetings
15:02:03 <slaweq> there was only one action item on the list
15:02:07 <slaweq> ralonsoh to try reduce concurency in functional job(s)
15:02:23 <ralonsoh> done adn merged (I don't have the patch right now)
15:02:46 <ralonsoh> hold on, not this one, sorry
15:02:47 <ralonsoh> no no
15:02:52 <slaweq> thx
15:03:02 <slaweq> no need to search for the link to the patch
15:03:10 <slaweq> We trust You :P
15:03:12 <ralonsoh> https://review.opendev.org/c/openstack/neutron/+/887633
15:03:24 <ralonsoh> heheheheh (but I don't trust myself)
15:03:52 <elvira> o/
15:04:00 <slaweq> LOL
15:04:10 <slaweq> ok, next topic then
15:04:11 <slaweq> #topic Stable branches
15:04:20 <slaweq> bcafarel any updates?
15:04:41 <bcafarel> not a lot, I was off yesterday so still need to check last backports but it looked good
15:04:55 <bcafarel> wallaby/victoria is back in shape too :)
15:05:03 <slaweq> that's good, thx
15:05:06 <ralonsoh> +1!
15:05:37 <slaweq> anything else regarding CI of the stable branches anyone?
15:06:33 <slaweq> if not, I think we can move on
15:06:48 <slaweq> I will skip today stadium projects topic as Lajos is off
15:06:52 <ralonsoh> one sec
15:07:03 <ralonsoh> https://review.opendev.org/c/openstack/networking-bagpipe/+/862505
15:07:07 <ralonsoh> https://review.opendev.org/c/openstack/networking-bgpvpn/+/888719
15:07:20 <ralonsoh> both CI are not working, in particular n-t-p
15:07:23 <slaweq> sure
15:07:35 <ralonsoh> these patches are in Victoria
15:07:53 <ralonsoh> if these jobs are not fixed, I'll propose the EOL of these branches
15:08:29 <slaweq> AttributeError: module 'neutron.common.config' has no attribute 'register_common_config_options'
15:08:32 <ralonsoh> yes
15:08:36 <ralonsoh> in both jobs
15:08:56 <slaweq> so I guess it's some mismatch in versions used there
15:09:06 <ralonsoh> could be, right
15:09:11 <slaweq> maybe we can wait 1 more week for Lajos to check it?
15:09:14 <slaweq> wdyt?
15:09:18 <ralonsoh> because that was implemented in newer versions of Neutron
15:09:22 <ralonsoh> of course, that can wait
15:09:30 <slaweq> ++
15:09:50 <slaweq> ok, so next topic
15:09:55 <slaweq> #topic Grafana
15:10:22 <slaweq> I see that yesterday functioinal and fullstack jobs were broken completely
15:10:31 <slaweq> but it's getting to be better today
15:10:38 <slaweq> do You know what happend there maybe?
15:10:51 <ralonsoh> no idea, I'll check the logs
15:10:59 <ralonsoh> i didn't see anything special
15:11:25 <slaweq> ok, maybe just some coincidence
15:11:48 <slaweq> #topic Rechecks
15:12:06 <slaweq> in rechecks it looks a bit better last week
15:12:19 <slaweq> and we don't have "bare rechecks" almost at all recently
15:12:33 <slaweq> that's all from me regarding this topic
15:12:41 <slaweq> any questions/comments on that?
15:12:49 <ralonsoh> one note
15:13:05 <ralonsoh> don't hesitate to reply on a patch requesting someone not to do bare rechecks
15:13:23 <ralonsoh> we should teach other people to check the CI first
15:13:39 <ralonsoh> that's all
15:13:58 <mlavalle> +1
15:14:02 <ykarel> slaweq, wrt bare recheck report that also checks stable/branch patches?
15:14:03 <slaweq> ++
15:14:26 <slaweq> ykarel no, I'm checking only master branch currently
15:14:30 <ykarel> because i had noticed some bare rechecks recently there but you said there were none
15:14:38 <ykarel> ok thanks makes sense than
15:15:28 <slaweq> ok, lets move on then
15:15:30 <slaweq> #topic fullstack/functional
15:16:18 <slaweq> I was checking some timeout in functional job today and it seems for me that it may not always be that it's just noisy neighbors or something like that
15:16:28 <slaweq> please look at https://3e402c0e76741e83fc60-d00ff4f1a74cdbc5ea9d8044145b77c0.ssl.cf2.rackcdn.com/888574/3/check/neutron-functional-with-uwsgi/3bea3d7/job-output.txt
15:16:46 <slaweq> if You look at the logs there, there is a lot of tests which failed with timouts
15:16:53 <ralonsoh> timeout in privsep
15:16:58 <slaweq> and those tests tooks 3 minutes or more
15:17:22 <slaweq> and most (or all) of them are failing due to the timeouts while interacting with netlink
15:17:35 <slaweq> like e.g. checking if device exists in namespace
15:17:47 <slaweq> or setting IP address on interface
15:17:48 <slaweq> etc.
15:18:08 <slaweq> it also looks similar in fullstack tests job in https://storage.bhs.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_c52/883246/16/check/neutron-fullstack-with-uwsgi/c524e84/job-output.txt
15:18:46 <slaweq> so maybe we should spend some more time and try to understand this issue on our side
15:18:49 <slaweq> did You saw issues like that before?
15:19:41 <ralonsoh> yes but randomly
15:21:37 <slaweq> ralonsoh and do You maybe know what is causing such problem there?
15:21:45 <ralonsoh> no, sorry
15:22:02 <slaweq> do You think we should open LP for that?
15:22:15 <ralonsoh> but when you have an issue in FT related to privsep, the other test will be affected
15:22:22 <ralonsoh> I think so, at least we should track it
15:22:32 <slaweq> I will open LP for it
15:22:44 <slaweq> and if anyone have any cycles, please maybe try to take a look at this
15:23:02 <mlavalle> I'll look at this
15:23:14 <slaweq> thx mlavalle
15:23:18 <mlavalle> please assign the LP to me
15:23:37 <slaweq> #action slaweq to create LP about timeouts in functional job and assign it to mlavalle
15:23:41 <slaweq> sure, thx a lot
15:23:46 <mlavalle> will you include the ponters to the test failures in the LP?
15:24:00 <racosta> are you using generic or kvm kernel in the test images? KVM kernel has no gre module: "controller | FAILED TO LOAD nf_conntrack_proto_gre"
15:24:06 <slaweq> mlavalle yes
15:24:14 <mlavalle> Thnaks!
15:24:49 <slaweq> racosta is this question related to functional tests issue which we are talking about now?
15:25:01 <racosta> yes
15:25:21 <slaweq> I don't know what kernel is used there
15:25:30 <slaweq> we just use what infra provides us
15:26:05 <slaweq> but mlavalle maybe it's good first thing to check then :)
15:26:21 <ykarel> Linux np0034680805 5.15.0-76-generic
15:26:23 <slaweq> maybe it's different kernel in different providers and maybe that is causing some issue
15:26:24 <slaweq> idk
15:26:51 <racosta> I imagine it is kvm, in which case the load of the gre module will not work.
15:26:51 <ykarel> i think images are common across providers
15:27:17 <slaweq> thx ykarel
15:27:23 <slaweq> so it should works fine then
15:27:57 <racosta> cloud images use kvm kernel - at least the ubuntu ones I've tested
15:29:14 <ykarel> looking at the kernel seems it's not kvm one i.e Linux np0034680805 5.15.0-76-generic, iirc kvm kernels have -kvm as prefix
15:29:23 <ykarel> suffix
15:31:50 <slaweq> ok, lets move forward for now and mlavalle can check it if needed :)
15:32:14 <slaweq> regarding fullstack tests I have found one more issue
15:32:16 <slaweq> neutron.tests.fullstack.test_l3_agent.TestHAL3Agent.test_keepalived_multiple_sighups_does_not_forfeit_primary
15:32:20 <slaweq> https://284d785cc67babb2d75b-a8caf4f4ad4b0a7f74dab21fc6a45bed.ssl.cf2.rackcdn.com/886992/5/check/neutron-fullstack-with-uwsgi/0bcd7be/testr_results.html
15:32:47 <slaweq> I just found it once so far but I think it's worth to check what happened there at least
15:32:54 <slaweq> any volunteer for that?
15:32:58 <ralonsoh> I'll check it
15:33:05 <slaweq> thx ralonsoh
15:33:26 <slaweq> #action ralonsoh to check failed neutron.tests.fullstack.test_l3_agent.TestHAL3Agent.test_keepalived_multiple_sighups_does_not_forfeit_primary test
15:33:31 <slaweq> #topic Tempest/Scenario
15:33:38 <slaweq> slow jobs broken in releases before xena since 11th July
15:33:48 <slaweq> https://bugs.launchpad.net/neutron/+bug/2027817
15:33:52 <slaweq> I think this was added by ykarel
15:33:56 <ykarel> that's fixed with a patch in tempest
15:34:08 <ykarel> merged yesterday
15:34:12 <ralonsoh> I think we are good now
15:34:25 <slaweq> thx
15:34:56 <slaweq> from other issues, I have found one with some timeouts in the nova-compute https://7ad29d1b700c1da60ae0-1bae5319fe4594ade335a46ad1c3bcc9.ssl.cf2.rackcdn.com/867513/24/check/neutron-tempest-plugin-openvswitch-iptables_hybrid/2bee760/controller/logs/screen-n-cpu.txt
15:35:15 <slaweq> so just FYI - if You will see something similar, I think we may want to report bug for nova for this
15:35:31 <ralonsoh> but related to port bindings?
15:35:32 <slaweq> and that's all regarding scenario jobs for today from me
15:35:47 <slaweq> ralonsoh no, I think there no even port created yet
15:36:05 <slaweq> nova-compute was failing earlier in the process there IIUC nova-compute log
15:36:06 <ralonsoh> ahh I see, RPC messages
15:36:20 <opendevreview> Merged openstack/neutron stable/wallaby: [OVN] Hash Ring: Set nodes as offline upon exit  https://review.opendev.org/c/openstack/neutron/+/887279
15:36:21 <slaweq> yeap
15:36:42 <slaweq> and last topic from me for today is
15:36:44 <slaweq> #topic Periodic
15:36:54 <slaweq> here I found one new issue
15:37:04 <slaweq> https://bugs.launchpad.net/neutron/+bug/2028037
15:37:24 <ralonsoh> pfffff
15:37:27 <slaweq> but ykarel found out that this is already reported in https://bugs.launchpad.net/neutron/+bug/2028003
15:37:31 <slaweq> thx ykarel
15:38:08 <ralonsoh> let me check that issue, it could be related to a specific postgree requirement
15:38:19 <ralonsoh> (as reported in ykarel bug)
15:38:24 <ralonsoh> must appear in the GROUP BY clause or be used in an aggregate function
15:38:27 <slaweq> yes, it is something specific to postgresql
15:38:29 <ykarel> ralonsoh, yes and seems to be triggered with https://review.opendev.org/q/Ic6001bd5a57493b8befdf81a41eb0bd1c8022df3
15:38:40 <slaweq> as it works fine in other jobs which are using MySQL/Mariadb
15:38:44 <ralonsoh> yeah, I was expecting that...
15:39:02 <ykarel> and the same job is impacted in stable branches
15:39:19 <ralonsoh> this is again the ironic job
15:39:31 <ralonsoh> we had another issue with postgree and ironic 2 weeks ago
15:39:55 <slaweq> one is ironic job and other is our periodic scenario job
15:40:02 <slaweq> both found that issue
15:40:54 <slaweq> ralonsoh will You take a look at this? Or do You want me to check it?
15:41:00 <ralonsoh> I'll do
15:41:04 <slaweq> thx
15:41:16 <slaweq> #action ralonsoh to check failing postgresql job
15:41:27 <slaweq> and that's all from me for today
15:41:38 <slaweq> do You have any other topics to discuss?
15:41:49 <slaweq> or if not, I will give You back about 18 minutes today
15:42:14 <ralonsoh> fine for me
15:42:32 <ykarel> o/
15:42:48 <slaweq> ok, so thx for attending the meeting
15:42:53 <slaweq> #endmeeting