#openstack-neutron log

15:00:11 <slaweq> #startmeeting neutron_ci
15:00:11 <opendevmeet> Meeting started Tue Nov  9 15:00:11 2021 UTC and is due to finish in 60 minutes.  The chair is slaweq. Information about MeetBot at http://wiki.debian.org/MeetBot.
15:00:11 <opendevmeet> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
15:00:11 <opendevmeet> The meeting name has been set to 'neutron_ci'
15:00:28 <slaweq> Grafana dashboard: http://grafana.openstack.org/dashboard/db/neutron-failure-rate
15:00:37 <opendevreview> Merged openstack/neutron master: Bump OVN version for functional job to 21.06  https://review.opendev.org/c/openstack/neutron/+/816614
15:00:49 <slaweq> bcafarel: lajoskatona obondarev CI meeting is starting  :)
15:00:50 <obondarev> hi
15:00:54 <slaweq> hi
15:00:55 <bcafarel> o/
15:02:08 <lajoskatona> Hi
15:02:44 <slaweq> ok, I think we can start
15:02:48 <slaweq> this week is just on IRC
15:02:56 <slaweq> #topic Actions from previous meetings
15:03:05 <slaweq> slaweq to work on https://bugs.launchpad.net/neutron/+bug/1948832
15:03:35 <slaweq> I was checking it today
15:03:57 <slaweq> and TBH I don't think it's the same bug as mentioned by ralonsoh in the LP
15:04:16 <slaweq> I wrote my findings in the comment there
15:04:25 <slaweq> I will try to investigate it little bit more this week
15:05:35 <slaweq> #action slaweq to check more deeply https://bugs.launchpad.net/neutron/+bug/1948832
15:05:39 <slaweq> next one
15:05:49 <slaweq> ralonsoh to check metadata issue https://1bdefef51603346d84af-53302f911195502b1bb2d87ad2b01ca2.ssl.cf5.rackcdn.com/814807/4/check/neutron-tempest-plugin-scenario-openvswitch/3de2195/testr_results.html
15:06:55 <slaweq> I guess we need to move that one for next week as ralonsoh is not here today
15:07:18 <slaweq> #action ralonsoh to check metadata issue https://1bdefef51603346d84af-53302f911195502b1bb2d87ad2b01ca2.ssl.cf5.rackcdn.com/814807/4/check/neutron-tempest-plugin-scenario-openvswitch/3de2195/testr_results.html
15:07:26 <slaweq> next one
15:07:31 <slaweq> slaweq to check https://5a5cde44dedb81c8bd48-91d0b9dca863bf6ffc8b1718d062319a.ssl.cf5.rackcdn.com/805391/13/check/neutron-tempest-plugin-scenario-ovn/84283cb/testr_results.html
15:07:48 <slaweq> It seems for me that the issue is similar to https://launchpad.net/bugs/1892861
15:07:48 <slaweq> The problem is that "login: " message in console log appears pretty quickly but cloud-init is still doing some things on the node, so maybe SSH isn't working yet and we hit the same bug.
15:09:02 <lajoskatona> this happens with cirros?
15:09:13 <slaweq> no, it's with Ubuntu
15:09:20 <lajoskatona> ok
15:09:51 <slaweq> maybe we should find better way to check if guest OS is really booted before doing SSH
15:09:52 <bcafarel> advanced OS wanting to show that login prompt too fast :)
15:10:08 <slaweq> it seems like that for me
15:10:26 <bcafarel> the problem is that sounds OS-dependant? though maybe systemd check for advanced + login: for cirros would be good enough
15:10:35 <slaweq> because I didn't saw anything else wrong there really
15:12:09 <slaweq> anyway, I will keep an eye on that. If it will happen more often I will report bug and try to figure out something :)
15:12:27 <slaweq> for now let's move on to the last one
15:12:29 <slaweq> lajoskatona to check https://581819ea67919485b97e-6002fae613cad806f99007086c39ea60.ssl.cf2.rackcdn.com/813977/5/gate/neutron-tempest-plugin-scenario-linuxbridge/8fdcf6f/testr_results.html
15:12:58 <lajoskatona> yeah I checked this one, and the startnge is that the FIP goes UP but too late (~few secs)
15:13:31 <slaweq> to UP or to DOWN?
15:13:47 <lajoskatona> slaweq: I mean down, thanks....
15:13:47 <slaweq> error message says "attached port status failed to transition to DOWN "
15:13:52 <slaweq> ahh, ok
15:14:10 <slaweq> how long we are waiting for that transition?
15:14:43 <lajoskatona> 120sec
15:15:12 <obondarev> should be more than enough..
15:15:18 <slaweq> yeah
15:15:24 <lajoskatona> tempest stoppes waiting at: 2021-10-21 08:09:14.583994
15:15:54 <lajoskatona> and q-svc reports it's down: Oct 21 08:09:17.015995 ubuntu-focal-iweb-mtl01-0027033075 neutron-server[81347]: DEBUG neutron.api.rpc.handlers.l3_rpc [None req-ca3c8a33-07ab-41b7-b946-2411e666af30 None None] New status for floating IP aa555045-e872-47f5-a5f4-b4b59017b474: DOWN {{(pid=81347) update_floatingip_statuses /opt/stack/neutron/neutron/api/rpc/handlers/l3_rpc.py:270}}
15:16:23 <lajoskatona> from https://581819ea67919485b97e-6002fae613cad806f99007086c39ea60.ssl.cf2.rackcdn.com/813977/5/gate/neutron-tempest-plugin-scenario-linuxbridge/8fdcf6f/job-output.txt and https://581819ea67919485b97e-6002fae613cad806f99007086c39ea60.ssl.cf2.rackcdn.com/813977/5/gate/neutron-tempest-plugin-scenario-linuxbridge/8fdcf6f/controller/logs/screen-q-svc.txt
15:16:25 <slaweq> was L3 agent very busy during that time?
15:16:55 <lajoskatona> yeah it was refreshing continuusly the iptables rules
15:17:58 <lajoskatona> my question: do we need ha router enabled for this for example?
15:17:59 <slaweq> maybe there is some issue in the L3 agent then?
15:18:13 <lajoskatona> this is a singlenode job as I see
15:18:21 <slaweq> lajoskatona: we enabled HA for those routers to have at least some HA test coverage
15:18:36 <slaweq> as we don't have any other jobs with HA routers
15:18:56 <slaweq> and even if that job is singlenode, when router is HA, all HA codepath is tested
15:19:02 <slaweq> like keepalived and other stuff
15:19:18 <slaweq> it's just that router is always transitioned to be primary on that single node :)
15:19:46 <lajoskatona> slaweq: ok
15:20:33 <lajoskatona> I can check l3-agent if I can see something why it took so long to make the fip down
15:21:09 <slaweq> ++
15:21:33 <slaweq> #action lajoskatona to check why make FIP down took more than 120 seconds in the L3 agent
15:21:37 <slaweq> thx lajoskatona
15:21:53 <slaweq> ok, that are all actions from last week
15:21:57 <slaweq> let's move on
15:21:58 <slaweq> #topic Stadium projects
15:22:17 <lajoskatona> nothing new
15:22:37 <lajoskatona> This week I realized that odl tempest jobs try to run with ovn
15:23:13 <lajoskatona> so I fight against that, but the jobs will still fail, but at least we have services up and tempest/rally started
15:24:05 <lajoskatona> that's it for stadiums
15:24:11 <slaweq> do You have patch for that?
15:24:13 <slaweq> I can review it if You want
15:24:17 <lajoskatona> thanks
15:24:54 <lajoskatona> it's not fully ready yet by zuul  as I see: https://review.opendev.org/c/openstack/networking-odl/+/817186
15:26:39 <slaweq> ok
15:26:46 <lajoskatona> ohh, no tempest is ok at least. It's failing bu with the usual way which is due to ODL is slow behind it....
15:27:22 <lajoskatona> ok, so I have to tune rally job
15:27:38 <lajoskatona> I will send it on IRC when that is ok
15:27:47 <slaweq> I added it to my review list for tomorrow morning :)
15:27:56 <lajoskatona> slaweq: thanks
15:28:13 <slaweq> ok, next topic then
15:28:18 <slaweq> #topic Stable branches
15:28:51 <bcafarel> overall good this week, as mentioned ussuri has only https://review.opendev.org/c/openstack/neutron/+/816661 left (before EM transition)
15:29:02 <bcafarel> though functional test failing in it looks related
15:29:26 <bcafarel> and I found just before meeting train has neutron-tempest-plugin-designate-scenario-train failing :( https://zuul.opendev.org/t/openstack/build/a6a1142368b742248be710f902f541f5
15:29:28 <slaweq> yes, indeed
15:29:33 <slaweq> I will check it tomorrow
15:29:48 <bcafarel> I will fill a bug for that designate train one after meetings
15:30:03 <bcafarel> but at least full support branches are good :)
15:30:34 <slaweq> ok, thx bcafarel
15:30:51 <slaweq> I think we can move on then
15:30:54 <slaweq> #topic Grafana
15:31:04 <slaweq> http://grafana.openstack.org/dashboard/db/neutron-failure-rate
15:31:48 <slaweq> rally was broken last week
15:32:02 <slaweq> but it's I guess due to that missing service endpoint in keystone, right?
15:32:08 <slaweq> and it's already fixed
15:32:20 <lajoskatona> yes
15:32:43 <lajoskatona> actually it's funny, in devstack the endpoint creation was deleted and now we create it from the job
15:33:03 <lajoskatona> and seems like neutron runs mostly as voting job rally
15:33:09 <slaweq> so only rally job needs it?
15:33:28 <lajoskatona> yes as I remember
15:34:01 <slaweq> hmm, ok
15:34:07 <slaweq> thx for fixing that issue quickly
15:34:07 <lajoskatona> and the proper fix should be in rally, but I checked and not clear for first view where to fix it.....
15:34:37 <lajoskatona> it was ralonsoh actually, I just run another "alternative" path, but that was not enough....
15:34:55 <slaweq> :)
15:35:33 <slaweq> from other things I see that neutron-tempest-plugin-scenario-linuxbridge job is failing pretty often (more than other similar jobs)
15:35:43 <slaweq> I'm not sure why it is like that really
15:35:58 <slaweq> probably some timeouts which I saw pretty often in the scenario jobs last week
15:36:28 <slaweq> like e.g. https://zuul.opendev.org/t/openstack/build/7a2bc299305249c5911e7107bb4d4a37
15:38:58 <slaweq> I think we need to investigate why so often our tests are so slow recently
15:39:28 <slaweq> I doubt I will have time for that this week but maybe there is anyone who could check that
15:40:08 <obondarev> also neutron-tempest-plugin-scenario-ovn started failing I think
15:40:22 <obondarev> example: https://storage.gra.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_330/807116/23/check/neutron-tempest-plugin-scenario-ovn/330ce5c/testr_results.html
15:40:27 <slaweq> obondarev: You mean with timeouts?
15:40:33 <obondarev> nope
15:40:48 <obondarev> a couple of tests failing with ssh I believe
15:41:32 <slaweq> hmm, in those 2 tests for example we are missing console log from instance
15:41:35 <slaweq> so it's hard to say
15:41:52 <slaweq> but it may be the same issue like in  https://5a5cde44dedb81c8bd48-91d0b9dca863bf6ffc8b1718d062319a.ssl.cf5.rackcdn.com/805391/13/check/neutron-tempest-plugin-scenario-ovn/84283cb/testr_results.html
15:42:07 <slaweq> which I spoke about at the begining of the meeting
15:44:08 <slaweq> let's observe it and if someone will have some time, maybe investigate it a bit more :)
15:44:10 <obondarev> I'll keep an eye on it and report a bug if it continue to fail with similar symptoms
15:44:23 <slaweq> obondarev: sounds good, thx
15:44:57 <slaweq> ok, I think we can move on quickly
15:45:16 <slaweq> we already spoke about most urgent issues in our jobs, at least those which I had prepared for today
15:45:22 <slaweq> #topic tempest/scenario
15:45:32 <slaweq> here I have one more thing to mention
15:45:39 <slaweq> this time, good news I think
15:45:47 <slaweq> our CI job found bug :)
15:45:55 <slaweq> https://bugs.launchpad.net/neutron/+bug/1950273
15:46:09 <slaweq> so sometimes those jobs seems to be useful ;)
15:46:18 <obondarev> nice! :)
15:46:59 <slaweq> it seems to be some race condition or something else but even in such case we shouldn't return error 500 to the user
15:47:02 <lajoskatona> +1
15:47:10 <slaweq> so at least we should properly handle that error
15:47:19 <slaweq> that's why I marked it as "Higt"
15:47:28 <slaweq> *High
15:47:49 <slaweq> ok, and last topic from me for today
15:47:51 <slaweq> #topic Periodic
15:48:03 <slaweq> here all except UT with neutron-lib master looks good
15:48:14 <slaweq> I opened bug https://bugs.launchpad.net/neutron/+bug/1950275 today
15:48:41 <lajoskatona> For that I pushed a patch: https://review.opendev.org/c/openstack/neutron/+/817178
15:49:05 <slaweq> thx lajoskatona
15:49:21 <slaweq> but my question is: why some change in neutron-lib caused that failure?
15:49:35 <slaweq> as OVO object is defined in the Neutron repo, isn't it?
15:49:44 <lajoskatona> but that breaks unit test with n-lib 2.16.0, and what I can't consume:
15:49:52 <obondarev> I believe order of constants on which OVO depends - changed
15:49:58 <lajoskatona> seems like this patch brings the failure: https://review.opendev.org/c/openstack/neutron-lib/+/816447
15:50:16 <lajoskatona> yeah, possible, but that can change the hash for ovo?
15:50:38 <obondarev> I think we faced similar issues in the past
15:51:16 <lajoskatona> so If I understand it we have to release n-lib with constants change and bump n-lib for neutron?
15:51:38 <obondarev> I think so too
15:51:47 <lajoskatona> ok
15:51:49 <slaweq> but the problem may be that in the release patches it runs some UT job for neutron, no?
15:51:57 <slaweq> and this job will fail then
15:52:01 <slaweq> or am I missing something?
15:52:13 <lajoskatona> that's possible
15:53:46 <lajoskatona> I push release patch and that will tell us if this way can work
15:53:57 <slaweq> ++
15:54:07 <slaweq> maybe I'm wrong here
15:54:26 <slaweq> in the releases it don't runs UT probably
15:54:31 <slaweq> I hope so at least :)
15:54:42 <slaweq> ok, so that's all what I had for today
15:54:50 <slaweq> anything else You want to discuss regarding CI?
15:54:56 <obondarev> lajoskatona: I would appreciate if you check https://review.opendev.org/c/openstack/neutron-lib/+/816468 before lib release :)
15:55:15 <lajoskatona> obondarev: I will
15:55:38 <obondarev> lajoskatona:thanks a lot!
15:56:38 <slaweq> ok, so I think we can finish our meeting for today
15:56:57 <slaweq> have a great week and see You all next week on the video call again :)
15:56:57 <slaweq> o/
15:56:58 <slaweq> #endmeeting