15:00:12 #startmeeting neutron_ci 15:00:12 Meeting started Tue Sep 6 15:00:12 2022 UTC and is due to finish in 60 minutes. The chair is slaweq. Information about MeetBot at http://wiki.debian.org/MeetBot. 15:00:12 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 15:00:12 The meeting name has been set to 'neutron_ci' 15:00:14 hi 15:00:15 o/ 15:00:38 o/ 15:02:49 ralonsoh_ is on PTO, bcafarel too 15:03:00 I don't think if lajoskatona will be able to join today 15:03:07 so I guess we can start 15:03:09 probably not 15:03:17 Grafana dashboard: https://grafana.opendev.org/d/f913631585/neutron-failure-rate?orgId=1 15:03:23 Hi, I can, but only on IRC 15:03:33 lets go with first topic 15:03:38 lajoskatona: hi, yeah, we have it on irc today 15:03:41 #topic Actions from previous meetings 15:03:49 and I am on mobilnet, so possible that I will disappear time-to-time.... 15:03:50 slaweq to fix functiona/fullstack failures on centos 9 stream: https://bugs.launchpad.net/neutron/+bug/1976323 15:03:59 lajoskatona: sure, thx for the heads up 15:04:14 regarding that action item, I didn't made any progress really 15:05:15 so I will add it for myself for next week too 15:05:21 #action slaweq to fix functiona/fullstack failures on centos 9 stream: https://bugs.launchpad.net/neutron/+bug/1976323 15:05:28 next one 15:05:30 slaweq to check POST_FAILURE reasons 15:05:57 I checked it with infra team and it seems that it is timing out while uploading logs to swift 15:06:22 and we have a lot of small log files in the "dsvm-functional-logs" directory and that may be slow to upload all those files to Swift 15:06:25 ok so it is not that our tests take again longer time 15:06:31 so I prepared patch https://review.opendev.org/c/openstack/neutron/+/855868 15:06:39 lajoskatona: nope 15:06:59 with that patch we will upload to swift .tar.gz archive with those logs which should be faster (I hope) 15:07:30 I also did additional patch https://review.opendev.org/c/openstack/neutron/+/855867/ which removes store of the journal.log in the logs of the job 15:07:51 it's not needed as devstack is already doing that and storing in the devstack.journal.gz file 15:08:11 so it can save some disk space and few seconds during the job execution :) 15:08:27 please review both those patches when You will have a minute or two 15:08:40 ack 15:09:05 next one 15:09:08 ykarel to check interface not found issues in the periodic functional jobs 15:09:38 yes i checked all the three failures linked 15:09:44 All the failures share common symptoms where interface get's deleted/added quickly, and in that period neutron fails with device missing in namespace in two of those failures 15:09:59 like two of them, deleted at 02:45:35.681, readded at 02:45:35.778, fails at 02:45:35.705 15:10:05 deleted at 02:55:12.157, readded at 02:55:13.608, fails at 02:55:13.527 15:10:14 One failure share same observations as done by slawek in https://bugs.launchpad.net/neutron/+bug/1961740/comments/17 15:10:45 from opensearch i see some more occurances in non periodic jobs too in master and stable/yoga 15:11:33 https://opensearch.logs.openstack.org/_dashboards/app/discover/?security_tenant=global#/?_g=(filters:!(),refreshInterval:(pause:!t,value:0),time:(from:now-30d,to:now))&_a=(columns:!(_source),filters:!(),index:'94869730-aea8-11ec-9e6a-83741af3fdcd',interval:auto,query:(language:kuery,query:'message:%22not%20found%20in%20namespace%20snat%22'),sort:!()) 15:11:42 ykarel: yes, I also saw that "readd" of the interfaces some time ago when I was investigating that 15:11:50 but I have no idea why it happens like that 15:12:08 slaweq, yes i too didn't got the root cause for that 15:12:34 some time ago I even did patch which I hoped will workaround it 15:12:37 let me find it 15:12:41 the retry one? 15:12:47 yes 15:13:12 yeap that's not helping atleast not avoiding this issue completely 15:13:30 I know :/ 15:13:46 as in two of them i see that device added to namespace without retry 15:13:49 and that's strange as interface is added/removed/added in short period of time 15:13:52 but then removed 15:14:02 yes 15:15:12 also noticed there was cpu load > 10 around the failure, but i see similar with success jobs 15:15:25 also ram was not full utilized during failures 15:16:41 also observed many failures were seen in test patch https://review.opendev.org/c/openstack/neutron/+/854191/ as per opensearch 15:17:21 but that's just to trigger jobs, a lot of jobs 15:18:26 i recall some time back it was discussed to not use rootwrap in functional tests, you think that's related here? 15:18:37 maybe I have some theory 15:19:33 it is failing with error like "Interface not found in namespace snat..." or something like that 15:19:55 yes 15:20:05 so maybe as device is re-added, it's not in the snat-XXX namespace but in the global namespace 15:20:10 and that's why it cannot find it 15:20:33 look at https://github.com/openstack/neutron/blob/master/neutron/agent/linux/ip_lib.py#L463 15:20:38 it's where it is failing 15:20:57 and here "self._parent.namespace" is namespace in which interface is looked for? 15:21:12 and "net_ns_fd=namespace" is attribute to set for the interface 15:21:25 so it is expected to be in snat-XXX namespace but it's not there 15:21:33 as it was deleted/added again 15:21:45 does it makes sense? 15:21:57 didn't got why it's in global namespace 15:22:15 when You are adding new port it's always first in global namespace 15:22:18 right? 15:22:37 yes i think so 15:23:00 and to add it to namespace it needs some explicit calss 15:23:16 s/calss/calls 15:25:24 ok, I know why 15:25:30 it's bug in my retry 15:25:42 ahh 15:25:48 when it calls first time add_device_to_namespace 15:26:04 it set's parent namespace to namespace in https://github.com/openstack/neutron/blob/master/neutron/agent/linux/ip_lib.py#L464 15:26:17 and when it's deleted and added again, it's in global namespace 15:26:26 but _parent.namespace is already set 15:26:36 so that's why it's failing as it's looking for it in wrong namespace 15:26:38 :) 15:27:27 in this except block https://github.com/openstack/neutron/blob/master/neutron/agent/linux/interface.py#L360 15:27:31 we should do something like: 15:27:47 device._parent.namespace = None before retrying 15:27:57 and that should make it working fine IMO 15:28:28 so iiuc this will fix the case where it's failing even after multiple retry, right? 15:28:44 yes 15:28:47 not the other two cases 15:28:50 okk 15:28:58 I will propose patch for that 15:29:07 k Thanks 15:30:31 I think it may fix all cases where interface is "re-added" 15:30:39 as currently retry mechanism is broken 15:32:16 #action slaweq to fix add_device_to_namespace retry mechanism 15:32:35 if add_interface_to_namespace is called everytime port is added to ovs-bridge, then yes it should fix 15:32:42 i still have to check complete flow 15:32:51 k 15:32:59 I will propose patch to fix that issue which we found 15:33:14 but if You will find anything else, please propose fixes too :) 15:33:37 ok, lets move on 15:33:42 mlavalle to check failing quota test in openstack-tox-py39-with-oslo-master periodic job 15:33:56 It is failing sometime 15:34:03 I filed this bug: https://bugs.launchpad.net/neutron/+bug/1988604 15:34:21 and proposed this fix: https://review.opendev.org/c/openstack/neutron/+/855703 15:34:45 quick question: can it be realted to the sqlalchemy2.0 vs oslo.db relase thread? 15:35:03 it might 15:35:31 ok, thanks, it is interesting to have an opinion on the debate 15:36:12 this morning I said let's wait with it, but if we are on the safe side with our best understanding let's have a release 15:36:14 lajoskatona: oslo.db version which has this "issue" is 12.1.0, right? 15:36:32 yes I think 15:36:36 ok 15:36:45 in "normal" unit test jobs we are still using 12.0.0 15:36:52 It is not an issue more that some project not adopted, and we have this flapping job 15:36:52 so that's why those jobs are working fine 15:37:04 yes this is how I understand 15:37:08 mlavalle: I just run experimental jobs on Your patch 15:37:25 I think we can run it few times to check if that oslo-master job will be stable with it 15:37:34 but if mlavalle's patch fixes the job, I would say let's have this oslo.db out 15:37:41 ++ 15:37:45 slaweq: good idea 15:37:53 i forgot tht we have experimental for this 15:38:02 mlavalle: and also, I would really like ralonsoh_ to look at Your patch too :) 15:38:39 slaweq: we actually discussed it before he went on vacation 15:38:43 +1 15:38:53 it is in this channel's log a week ago 15:38:58 mlavalle: ahh, ok 15:39:02 so if he was fine with it, I'm good too :) 15:39:07 I trust You ;) 15:39:11 yes, he was 15:39:36 ok, than let's see the experimental jobs results and go back to the thread 15:39:40 ++ 15:39:43 thx mlavalle 15:39:54 next topic then 15:39:56 #topic Stable branches 15:40:06 anything new regarding stable branches? 15:40:14 I just checked (https://review.opendev.org/c/openstack/requirements/+/855973 ) and cinderseems to be failing but I can't check the logs on mobile net :P 15:40:48 elodilles proposed a series for caping virtualenv: https://review.opendev.org/q/topic:cap-virtualenv-py37 15:41:17 if effects some networking projects also, I started to check (bagpipe perhaps) if you have tim please keep an eye on these 15:41:49 it is for ussuri only as I see 15:42:05 thx lajoskatona 15:42:09 I will take a look 15:42:31 thanks 15:42:50 ok, next topic 15:42:51 #topic Stadium projects 15:43:15 * slaweq will be back in 2 minutes 15:43:36 One topic, with the segments patches we let in a change in vlanmanager that brakes some stadium 15:43:46 I added the patches to the etherpad 15:43:58 https://review.opendev.org/c/openstack/networking-bagpipe/+/855886 15:43:58 https://review.opendev.org/c/openstack/networking-sfc/+/855887 15:43:58 https://review.opendev.org/c/openstack/neutron-fwaas/+/855891 15:44:45 it was too late when I switched to FF mode and stopped merging of this feature sorry for it :-( 15:45:03 * slaweq is back 15:45:33 no worries 15:45:41 good that we found it before final release of Zed 15:45:47 so we still have time to fix those 15:46:06 And i have to drop (low battery, and have to fetch my sons from English lesson) 15:46:28 yeah good that we have periodic jobs :-) 15:46:30 lajoskatona: thx, see You 15:46:36 o/ 15:46:44 ok, lets move on to the next topic 15:46:49 #topic Grafana 15:47:14 dashboards looks pretty good IMO 15:47:22 yeap 15:47:22 I don't see anything very bad there 15:47:52 do You see anything worth discussion there? 15:48:42 if not, I think we can quickly move on 15:48:49 #topic Rechecks 15:49:15 rechecks stats are in the meeting agenda etherpad https://etherpad.opendev.org/p/neutron-ci-meetings#L52 15:49:20 basically it looks good still 15:49:32 last week we had 0.17 recheck in average to get patch merged 15:49:53 this week it's 1.5 but it's just begin of the week 15:49:59 so hopefully it will be better 15:50:13 regarding bare rechecks it's also much better this week 15:50:17 +---------+---------------+--------------+-------------------+... (full message at https://matrix.org/_matrix/media/r0/download/matrix.org/WAYKKcqlXmMdgZjqrbwNZgul) 15:50:38 thx a lot to all of You who are checking failures before rechecking :) 15:50:39 +1 15:51:23 anything else You want to add/ask regarding rechecks? 15:51:30 nope 15:52:10 ok, so next topic 15:52:15 #topic fullstack/functional 15:52:24 here I found one "new" error 15:52:30 https://zuul.openstack.org/build/ad0801f20bc143cebf5692440b331df4 15:52:34 metadata proxy didn't start 15:53:59 but I didn't had time to look into it deeper 15:54:04 anyone wants to check it? 15:54:17 I'll look 15:54:27 from log https://6338bbe59b3242bd04ef-84c9f5cd8c2b87d7cd3ff61e3f0a2559.ssl.cf2.rackcdn.com/periodic/opendev.org/openstack/neutron/master/neutron-functional-with-uwsgi-fips/ad0801f/controller/logs/dsvm-functional-logs/neutron.tests.functional.agent.test_dhcp_agent.DHCPAgentOVSTestCase.test_metadata_proxy_respawned.txt it seems that it was respawned 15:54:36 we don't know it's and issue yet, right? 15:54:38 so maybe that's some issue in test 15:54:44 mlavalle: nope 15:54:50 thx for volunteering 15:55:06 #action mlavalle to check metadata proxy not respawned error 15:55:22 mlavalle: but please don't treat it as high priority (for now) as it happened only once 15:55:33 yeap, that's why I asked 15:55:39 ++ 15:56:01 any other issues/questions related to the functional or fullstack jobs? 15:56:03 or can we move on? 15:56:43 ok, lets move on 15:56:51 #topic Tempest/Scenario 15:57:00 here I just wanted to share with You one failure 15:57:04 https://3525f1c73d59ef5d5b98-485374e596f765d9f96c9ac94e680c34.ssl.cf2.rackcdn.com/840421/34/check/neutron-tempest-plugin-ovn/b503178/testr_results.html 15:57:16 it seems like some segfault in the guest ubuntu image 15:57:25 I saw it only once and it's not neutron related issue 15:57:39 but just wanted to make You aware for things like that 15:57:51 and that's all 15:58:00 regarding periodic jobs, it looks good this week 15:58:13 it was even all green 3 or 4 days so it's great 15:58:20 that's all from me for today 15:58:33 any last minute topics for the CI meeting for today? 15:58:50 none from me 15:59:04 non from me either 15:59:09 ok, if not, then thx for attending the meeting and have a great week :) 15:59:13 #endmeeting