#openstack-neutron log

15:00:12 <slaweq> #startmeeting neutron_ci
15:00:12 <opendevmeet> Meeting started Tue Sep  6 15:00:12 2022 UTC and is due to finish in 60 minutes.  The chair is slaweq. Information about MeetBot at http://wiki.debian.org/MeetBot.
15:00:12 <opendevmeet> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
15:00:12 <opendevmeet> The meeting name has been set to 'neutron_ci'
15:00:14 <slaweq> hi
15:00:15 <mlavalle> o/
15:00:38 <ykarel> o/
15:02:49 <slaweq> ralonsoh_ is on PTO, bcafarel too
15:03:00 <slaweq> I don't think if lajoskatona will be able to join today
15:03:07 <slaweq> so I guess we can start
15:03:09 <mlavalle> probably not
15:03:17 <slaweq> Grafana dashboard: https://grafana.opendev.org/d/f913631585/neutron-failure-rate?orgId=1
15:03:23 <lajoskatona> Hi, I can, but only on IRC
15:03:33 <slaweq> lets go with first topic
15:03:38 <slaweq> lajoskatona: hi, yeah, we have it on irc today
15:03:41 <slaweq> #topic Actions from previous meetings
15:03:49 <lajoskatona> and I am on mobilnet, so possible that I will disappear time-to-time....
15:03:50 <slaweq> slaweq to fix functiona/fullstack failures on centos 9 stream: https://bugs.launchpad.net/neutron/+bug/1976323
15:03:59 <slaweq> lajoskatona: sure, thx for the heads up
15:04:14 <slaweq> regarding that action item, I didn't made any progress really
15:05:15 <slaweq> so I will add it for myself for next week too
15:05:21 <slaweq> #action slaweq to fix functiona/fullstack failures on centos 9 stream: https://bugs.launchpad.net/neutron/+bug/1976323
15:05:28 <slaweq> next one
15:05:30 <slaweq> slaweq to check POST_FAILURE reasons
15:05:57 <slaweq> I checked it with infra team and it seems that it is timing out while uploading logs to swift
15:06:22 <slaweq> and we have a lot of small log files in the "dsvm-functional-logs" directory and that may be slow to upload all those files to Swift
15:06:25 <lajoskatona> ok so it is not that our tests take again longer time
15:06:31 <slaweq> so I prepared patch https://review.opendev.org/c/openstack/neutron/+/855868
15:06:39 <slaweq> lajoskatona: nope
15:06:59 <slaweq> with that patch we will upload to swift .tar.gz archive with those logs which should be faster (I hope)
15:07:30 <slaweq> I also did additional patch https://review.opendev.org/c/openstack/neutron/+/855867/ which removes store of the journal.log in the logs of the job
15:07:51 <slaweq> it's not needed as devstack is already doing that and storing in the devstack.journal.gz file
15:08:11 <slaweq> so it can save some disk space and few seconds during the job execution :)
15:08:27 <slaweq> please review both those patches when You will have a minute or two
15:08:40 <ykarel> ack
15:09:05 <slaweq> next one
15:09:08 <slaweq> ykarel to check interface not found issues in the periodic functional jobs
15:09:38 <ykarel> yes i checked all the three failures linked
15:09:44 <ykarel> All the failures share common symptoms where interface get's deleted/added quickly, and in that period neutron fails with device missing in namespace in two of those failures
15:09:59 <ykarel> like two of them, deleted at 02:45:35.681, readded at 02:45:35.778, fails at 02:45:35.705
15:10:05 <ykarel> deleted at 02:55:12.157, readded at 02:55:13.608, fails at 02:55:13.527
15:10:14 <ykarel> One failure share same observations as done by slawek in https://bugs.launchpad.net/neutron/+bug/1961740/comments/17
15:10:45 <ykarel> from opensearch i see some more occurances in non periodic jobs too in master and stable/yoga
15:11:33 <ykarel> https://opensearch.logs.openstack.org/_dashboards/app/discover/?security_tenant=global#/?_g=(filters:!(),refreshInterval:(pause:!t,value:0),time:(from:now-30d,to:now))&_a=(columns:!(_source),filters:!(),index:'94869730-aea8-11ec-9e6a-83741af3fdcd',interval:auto,query:(language:kuery,query:'message:%22not%20found%20in%20namespace%20snat%22'),sort:!())
15:11:42 <slaweq> ykarel: yes, I also saw that "readd" of the interfaces some time ago when I was investigating that
15:11:50 <slaweq> but I have no idea why it happens like that
15:12:08 <ykarel> slaweq, yes i too didn't got the root cause for that
15:12:34 <slaweq> some time ago I even did patch which I hoped will workaround it
15:12:37 <slaweq> let me find it
15:12:41 <ykarel> the retry one?
15:12:47 <slaweq> yes
15:13:12 <ykarel> yeap that's not helping atleast not avoiding this issue completely
15:13:30 <slaweq> I know :/
15:13:46 <ykarel> as in two of them i see that device added to namespace without retry
15:13:49 <slaweq> and that's strange as interface is added/removed/added in short period of time
15:13:52 <ykarel> but then removed
15:14:02 <ykarel> yes
15:15:12 <ykarel> also noticed there was cpu load > 10 around the failure, but i see similar with success jobs
15:15:25 <ykarel> also ram was not full utilized during failures
15:16:41 <ykarel> also observed many failures were seen in test patch https://review.opendev.org/c/openstack/neutron/+/854191/ as per opensearch
15:17:21 <ykarel> but that's just to trigger jobs, a lot of jobs
15:18:26 <ykarel> i recall some time back it was discussed to not use rootwrap in functional tests, you think that's related here?
15:18:37 <slaweq> maybe I have some theory
15:19:33 <slaweq> it is failing with error like "Interface not found in namespace snat..." or something like that
15:19:55 <ykarel> yes
15:20:05 <slaweq> so maybe as device is re-added, it's not in the snat-XXX namespace but in the global namespace
15:20:10 <slaweq> and that's why it cannot find it
15:20:33 <slaweq> look at https://github.com/openstack/neutron/blob/master/neutron/agent/linux/ip_lib.py#L463
15:20:38 <slaweq> it's where it is failing
15:20:57 <slaweq> and here "self._parent.namespace" is namespace in which interface is looked for?
15:21:12 <slaweq> and "net_ns_fd=namespace" is attribute to set for the interface
15:21:25 <slaweq> so it is expected to be in snat-XXX namespace but it's not there
15:21:33 <slaweq> as it was deleted/added again
15:21:45 <slaweq> does it makes sense?
15:21:57 <ykarel> didn't got why it's in global namespace
15:22:15 <slaweq> when You are adding new port it's always first in global namespace
15:22:18 <slaweq> right?
15:22:37 <ykarel> yes i think so
15:23:00 <ykarel> and to add it to namespace it needs some explicit calss
15:23:16 <ykarel> s/calss/calls
15:25:24 <slaweq> ok, I know why
15:25:30 <slaweq> it's bug in my retry
15:25:42 <ykarel> ahh
15:25:48 <slaweq> when it calls first time add_device_to_namespace
15:26:04 <slaweq> it set's parent namespace to namespace in https://github.com/openstack/neutron/blob/master/neutron/agent/linux/ip_lib.py#L464
15:26:17 <slaweq> and when it's deleted and added again, it's in global namespace
15:26:26 <slaweq> but _parent.namespace is already set
15:26:36 <slaweq> so that's why it's failing as it's looking for it in wrong namespace
15:26:38 <slaweq> :)
15:27:27 <slaweq> in this except block https://github.com/openstack/neutron/blob/master/neutron/agent/linux/interface.py#L360
15:27:31 <slaweq> we should do something like:
15:27:47 <slaweq> device._parent.namespace = None before retrying
15:27:57 <slaweq> and that should make it working fine IMO
15:28:28 <ykarel> so iiuc this will fix the case where it's failing even after multiple retry, right?
15:28:44 <slaweq> yes
15:28:47 <ykarel> not the other two cases
15:28:50 <ykarel> okk
15:28:58 <slaweq> I will propose patch for that
15:29:07 <ykarel> k Thanks
15:30:31 <slaweq> I think it may fix all cases where interface is "re-added"
15:30:39 <slaweq> as currently retry mechanism is broken
15:32:16 <slaweq> #action slaweq to fix add_device_to_namespace retry mechanism
15:32:35 <ykarel> if add_interface_to_namespace is called everytime port is added to ovs-bridge, then yes it should fix
15:32:42 <ykarel> i still have to check complete flow
15:32:51 <slaweq> k
15:32:59 <slaweq> I will propose patch to fix that issue which we found
15:33:14 <slaweq> but if You will find anything else, please propose fixes too :)
15:33:37 <slaweq> ok, lets move on
15:33:42 <slaweq> mlavalle to check failing quota test in openstack-tox-py39-with-oslo-master periodic job
15:33:56 <mlavalle> It is failing sometime
15:34:03 <mlavalle> I filed this bug: https://bugs.launchpad.net/neutron/+bug/1988604
15:34:21 <mlavalle> and proposed this fix: https://review.opendev.org/c/openstack/neutron/+/855703
15:34:45 <lajoskatona> quick question: can it be realted to the sqlalchemy2.0 vs oslo.db relase thread?
15:35:03 <mlavalle> it might
15:35:31 <lajoskatona> ok, thanks, it is interesting to have an opinion on the debate
15:36:12 <lajoskatona> this morning I said let's wait with it, but if we are on the safe side with our best understanding let's have a release
15:36:14 <slaweq> lajoskatona: oslo.db version which has this "issue" is 12.1.0, right?
15:36:32 <lajoskatona> yes I think
15:36:36 <slaweq> ok
15:36:45 <slaweq> in "normal" unit test jobs we are still using 12.0.0
15:36:52 <lajoskatona> It is not an issue more that some project not adopted, and we have this flapping job
15:36:52 <slaweq> so that's why those jobs are working fine
15:37:04 <lajoskatona> yes this is how I understand
15:37:08 <slaweq> mlavalle: I just run experimental jobs on Your patch
15:37:25 <slaweq> I think we can run it few times to check if that oslo-master job will be stable with it
15:37:34 <lajoskatona> but if mlavalle's patch fixes the job, I would say let's have this oslo.db out
15:37:41 <slaweq> ++
15:37:45 <lajoskatona> slaweq: good idea
15:37:53 <lajoskatona> i forgot tht we have experimental for this
15:38:02 <slaweq> mlavalle: and also, I would really like ralonsoh_ to look at Your patch too :)
15:38:39 <mlavalle> slaweq: we actually discussed it before he went on vacation
15:38:43 <lajoskatona> +1
15:38:53 <mlavalle> it is in this channel's log a week ago
15:38:58 <slaweq> mlavalle: ahh, ok
15:39:02 <slaweq> so if he was fine with it, I'm good too :)
15:39:07 <slaweq> I trust You ;)
15:39:11 <mlavalle> yes, he was
15:39:36 <lajoskatona> ok, than let's see the experimental jobs results and go back to the thread
15:39:40 <slaweq> ++
15:39:43 <slaweq> thx mlavalle
15:39:54 <slaweq> next topic then
15:39:56 <slaweq> #topic Stable branches
15:40:06 <slaweq> anything new regarding stable branches?
15:40:14 <lajoskatona> I just checked (https://review.opendev.org/c/openstack/requirements/+/855973 ) and cinderseems to be failing but I can't check the logs on mobile net :P
15:40:48 <lajoskatona> elodilles proposed a series for caping virtualenv: https://review.opendev.org/q/topic:cap-virtualenv-py37
15:41:17 <lajoskatona> if effects some networking projects also, I started to check (bagpipe perhaps) if you have tim please keep an eye on these
15:41:49 <lajoskatona> it is for ussuri only as I see
15:42:05 <slaweq> thx lajoskatona
15:42:09 <slaweq> I will take a look
15:42:31 <lajoskatona> thanks
15:42:50 <slaweq> ok, next topic
15:42:51 <slaweq> #topic Stadium projects
15:43:15 * slaweq will be back in 2 minutes
15:43:36 <lajoskatona> One topic, with the segments patches we let in a change in vlanmanager that brakes some stadium
15:43:46 <lajoskatona> I added the patches to the etherpad
15:43:58 <lajoskatona> https://review.opendev.org/c/openstack/networking-bagpipe/+/855886
15:43:58 <lajoskatona> https://review.opendev.org/c/openstack/networking-sfc/+/855887
15:43:58 <lajoskatona> https://review.opendev.org/c/openstack/neutron-fwaas/+/855891
15:44:45 <lajoskatona> it was too late when I switched to FF mode and stopped merging of this feature sorry for it :-(
15:45:03 * slaweq is back
15:45:33 <slaweq> no worries
15:45:41 <slaweq> good that we found it before final release of Zed
15:45:47 <slaweq> so we still have time to fix those
15:46:06 <lajoskatona> And i have to drop (low battery, and have to fetch my sons from English lesson)
15:46:28 <lajoskatona> yeah good that we have periodic jobs :-)
15:46:30 <slaweq> lajoskatona: thx, see You
15:46:36 <lajoskatona> o/
15:46:44 <slaweq> ok, lets move on to the next topic
15:46:49 <slaweq> #topic Grafana
15:47:14 <slaweq> dashboards looks pretty good IMO
15:47:22 <mlavalle> yeap
15:47:22 <slaweq> I don't see anything very bad there
15:47:52 <slaweq> do You see anything worth discussion there?
15:48:42 <slaweq> if not, I think we can quickly move on
15:48:49 <slaweq> #topic Rechecks
15:49:15 <slaweq> rechecks stats are in the meeting agenda etherpad https://etherpad.opendev.org/p/neutron-ci-meetings#L52
15:49:20 <slaweq> basically it looks good still
15:49:32 <slaweq> last week we had 0.17 recheck in average to get patch merged
15:49:53 <slaweq> this week it's 1.5 but it's just begin of the week
15:49:59 <slaweq> so hopefully it will be better
15:50:13 <slaweq> regarding bare rechecks it's also much better this week
15:50:17 <slaweq> +---------+---------------+--------------+-------------------+... (full message at https://matrix.org/_matrix/media/r0/download/matrix.org/WAYKKcqlXmMdgZjqrbwNZgul)
15:50:38 <slaweq> thx a lot to all of You who are checking failures before rechecking :)
15:50:39 <mlavalle> +1
15:51:23 <slaweq> anything else You want to add/ask regarding rechecks?
15:51:30 <mlavalle> nope
15:52:10 <slaweq> ok, so next topic
15:52:15 <slaweq> #topic fullstack/functional
15:52:24 <slaweq> here I found one "new" error
15:52:30 <slaweq> https://zuul.openstack.org/build/ad0801f20bc143cebf5692440b331df4
15:52:34 <slaweq> metadata proxy didn't start
15:53:59 <slaweq> but I didn't had time to look into it deeper
15:54:04 <slaweq> anyone wants to check it?
15:54:17 <mlavalle> I'll look
15:54:27 <slaweq> from log https://6338bbe59b3242bd04ef-84c9f5cd8c2b87d7cd3ff61e3f0a2559.ssl.cf2.rackcdn.com/periodic/opendev.org/openstack/neutron/master/neutron-functional-with-uwsgi-fips/ad0801f/controller/logs/dsvm-functional-logs/neutron.tests.functional.agent.test_dhcp_agent.DHCPAgentOVSTestCase.test_metadata_proxy_respawned.txt it seems that it was respawned
15:54:36 <mlavalle> we don't know it's and issue yet, right?
15:54:38 <slaweq> so maybe that's some issue in test
15:54:44 <slaweq> mlavalle: nope
15:54:50 <slaweq> thx for volunteering
15:55:06 <slaweq> #action mlavalle to check metadata proxy not respawned error
15:55:22 <slaweq> mlavalle: but please don't treat it as high priority (for now) as it happened only once
15:55:33 <mlavalle> yeap, that's why I asked
15:55:39 <slaweq> ++
15:56:01 <slaweq> any other issues/questions related to the functional or fullstack jobs?
15:56:03 <slaweq> or can we move on?
15:56:43 <slaweq> ok, lets move on
15:56:51 <slaweq> #topic Tempest/Scenario
15:57:00 <slaweq> here I just wanted to share with You one failure
15:57:04 <slaweq> https://3525f1c73d59ef5d5b98-485374e596f765d9f96c9ac94e680c34.ssl.cf2.rackcdn.com/840421/34/check/neutron-tempest-plugin-ovn/b503178/testr_results.html
15:57:16 <slaweq> it seems like some segfault in the guest ubuntu image
15:57:25 <slaweq> I saw it only once and it's not neutron related issue
15:57:39 <slaweq> but just wanted to make You aware for things like that
15:57:51 <slaweq> and that's all
15:58:00 <slaweq> regarding periodic jobs, it looks good this week
15:58:13 <slaweq> it was even all green 3 or 4 days so it's great
15:58:20 <slaweq> that's all from me for today
15:58:33 <slaweq> any last minute topics for the CI meeting for today?
15:58:50 <ykarel> none from me
15:59:04 <mlavalle> non from me either
15:59:09 <slaweq> ok, if not, then thx for attending the meeting and have a great week :)
15:59:13 <slaweq> #endmeeting