15:00:11 <slaweq> #startmeeting neutron_ci 15:00:11 <opendevmeet> Meeting started Tue Nov 9 15:00:11 2021 UTC and is due to finish in 60 minutes. The chair is slaweq. Information about MeetBot at http://wiki.debian.org/MeetBot. 15:00:11 <opendevmeet> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 15:00:11 <opendevmeet> The meeting name has been set to 'neutron_ci' 15:00:28 <slaweq> Grafana dashboard: http://grafana.openstack.org/dashboard/db/neutron-failure-rate 15:00:37 <opendevreview> Merged openstack/neutron master: Bump OVN version for functional job to 21.06 https://review.opendev.org/c/openstack/neutron/+/816614 15:00:49 <slaweq> bcafarel: lajoskatona obondarev CI meeting is starting :) 15:00:50 <obondarev> hi 15:00:54 <slaweq> hi 15:00:55 <bcafarel> o/ 15:02:08 <lajoskatona> Hi 15:02:44 <slaweq> ok, I think we can start 15:02:48 <slaweq> this week is just on IRC 15:02:56 <slaweq> #topic Actions from previous meetings 15:03:05 <slaweq> slaweq to work on https://bugs.launchpad.net/neutron/+bug/1948832 15:03:35 <slaweq> I was checking it today 15:03:57 <slaweq> and TBH I don't think it's the same bug as mentioned by ralonsoh in the LP 15:04:16 <slaweq> I wrote my findings in the comment there 15:04:25 <slaweq> I will try to investigate it little bit more this week 15:05:35 <slaweq> #action slaweq to check more deeply https://bugs.launchpad.net/neutron/+bug/1948832 15:05:39 <slaweq> next one 15:05:49 <slaweq> ralonsoh to check metadata issue https://1bdefef51603346d84af-53302f911195502b1bb2d87ad2b01ca2.ssl.cf5.rackcdn.com/814807/4/check/neutron-tempest-plugin-scenario-openvswitch/3de2195/testr_results.html 15:06:55 <slaweq> I guess we need to move that one for next week as ralonsoh is not here today 15:07:18 <slaweq> #action ralonsoh to check metadata issue https://1bdefef51603346d84af-53302f911195502b1bb2d87ad2b01ca2.ssl.cf5.rackcdn.com/814807/4/check/neutron-tempest-plugin-scenario-openvswitch/3de2195/testr_results.html 15:07:26 <slaweq> next one 15:07:31 <slaweq> slaweq to check https://5a5cde44dedb81c8bd48-91d0b9dca863bf6ffc8b1718d062319a.ssl.cf5.rackcdn.com/805391/13/check/neutron-tempest-plugin-scenario-ovn/84283cb/testr_results.html 15:07:48 <slaweq> It seems for me that the issue is similar to https://launchpad.net/bugs/1892861 15:07:48 <slaweq> The problem is that "login: " message in console log appears pretty quickly but cloud-init is still doing some things on the node, so maybe SSH isn't working yet and we hit the same bug. 15:09:02 <lajoskatona> this happens with cirros? 15:09:13 <slaweq> no, it's with Ubuntu 15:09:20 <lajoskatona> ok 15:09:51 <slaweq> maybe we should find better way to check if guest OS is really booted before doing SSH 15:09:52 <bcafarel> advanced OS wanting to show that login prompt too fast :) 15:10:08 <slaweq> it seems like that for me 15:10:26 <bcafarel> the problem is that sounds OS-dependant? though maybe systemd check for advanced + login: for cirros would be good enough 15:10:35 <slaweq> because I didn't saw anything else wrong there really 15:12:09 <slaweq> anyway, I will keep an eye on that. If it will happen more often I will report bug and try to figure out something :) 15:12:27 <slaweq> for now let's move on to the last one 15:12:29 <slaweq> lajoskatona to check https://581819ea67919485b97e-6002fae613cad806f99007086c39ea60.ssl.cf2.rackcdn.com/813977/5/gate/neutron-tempest-plugin-scenario-linuxbridge/8fdcf6f/testr_results.html 15:12:58 <lajoskatona> yeah I checked this one, and the startnge is that the FIP goes UP but too late (~few secs) 15:13:31 <slaweq> to UP or to DOWN? 15:13:47 <lajoskatona> slaweq: I mean down, thanks.... 15:13:47 <slaweq> error message says "attached port status failed to transition to DOWN " 15:13:52 <slaweq> ahh, ok 15:14:10 <slaweq> how long we are waiting for that transition? 15:14:43 <lajoskatona> 120sec 15:15:12 <obondarev> should be more than enough.. 15:15:18 <slaweq> yeah 15:15:24 <lajoskatona> tempest stoppes waiting at: 2021-10-21 08:09:14.583994 15:15:54 <lajoskatona> and q-svc reports it's down: Oct 21 08:09:17.015995 ubuntu-focal-iweb-mtl01-0027033075 neutron-server[81347]: DEBUG neutron.api.rpc.handlers.l3_rpc [None req-ca3c8a33-07ab-41b7-b946-2411e666af30 None None] New status for floating IP aa555045-e872-47f5-a5f4-b4b59017b474: DOWN {{(pid=81347) update_floatingip_statuses /opt/stack/neutron/neutron/api/rpc/handlers/l3_rpc.py:270}} 15:16:23 <lajoskatona> from https://581819ea67919485b97e-6002fae613cad806f99007086c39ea60.ssl.cf2.rackcdn.com/813977/5/gate/neutron-tempest-plugin-scenario-linuxbridge/8fdcf6f/job-output.txt and https://581819ea67919485b97e-6002fae613cad806f99007086c39ea60.ssl.cf2.rackcdn.com/813977/5/gate/neutron-tempest-plugin-scenario-linuxbridge/8fdcf6f/controller/logs/screen-q-svc.txt 15:16:25 <slaweq> was L3 agent very busy during that time? 15:16:55 <lajoskatona> yeah it was refreshing continuusly the iptables rules 15:17:58 <lajoskatona> my question: do we need ha router enabled for this for example? 15:17:59 <slaweq> maybe there is some issue in the L3 agent then? 15:18:13 <lajoskatona> this is a singlenode job as I see 15:18:21 <slaweq> lajoskatona: we enabled HA for those routers to have at least some HA test coverage 15:18:36 <slaweq> as we don't have any other jobs with HA routers 15:18:56 <slaweq> and even if that job is singlenode, when router is HA, all HA codepath is tested 15:19:02 <slaweq> like keepalived and other stuff 15:19:18 <slaweq> it's just that router is always transitioned to be primary on that single node :) 15:19:46 <lajoskatona> slaweq: ok 15:20:33 <lajoskatona> I can check l3-agent if I can see something why it took so long to make the fip down 15:21:09 <slaweq> ++ 15:21:33 <slaweq> #action lajoskatona to check why make FIP down took more than 120 seconds in the L3 agent 15:21:37 <slaweq> thx lajoskatona 15:21:53 <slaweq> ok, that are all actions from last week 15:21:57 <slaweq> let's move on 15:21:58 <slaweq> #topic Stadium projects 15:22:17 <lajoskatona> nothing new 15:22:37 <lajoskatona> This week I realized that odl tempest jobs try to run with ovn 15:23:13 <lajoskatona> so I fight against that, but the jobs will still fail, but at least we have services up and tempest/rally started 15:24:05 <lajoskatona> that's it for stadiums 15:24:11 <slaweq> do You have patch for that? 15:24:13 <slaweq> I can review it if You want 15:24:17 <lajoskatona> thanks 15:24:54 <lajoskatona> it's not fully ready yet by zuul as I see: https://review.opendev.org/c/openstack/networking-odl/+/817186 15:26:39 <slaweq> ok 15:26:46 <lajoskatona> ohh, no tempest is ok at least. It's failing bu with the usual way which is due to ODL is slow behind it.... 15:27:22 <lajoskatona> ok, so I have to tune rally job 15:27:38 <lajoskatona> I will send it on IRC when that is ok 15:27:47 <slaweq> I added it to my review list for tomorrow morning :) 15:27:56 <lajoskatona> slaweq: thanks 15:28:13 <slaweq> ok, next topic then 15:28:18 <slaweq> #topic Stable branches 15:28:51 <bcafarel> overall good this week, as mentioned ussuri has only https://review.opendev.org/c/openstack/neutron/+/816661 left (before EM transition) 15:29:02 <bcafarel> though functional test failing in it looks related 15:29:26 <bcafarel> and I found just before meeting train has neutron-tempest-plugin-designate-scenario-train failing :( https://zuul.opendev.org/t/openstack/build/a6a1142368b742248be710f902f541f5 15:29:28 <slaweq> yes, indeed 15:29:33 <slaweq> I will check it tomorrow 15:29:48 <bcafarel> I will fill a bug for that designate train one after meetings 15:30:03 <bcafarel> but at least full support branches are good :) 15:30:34 <slaweq> ok, thx bcafarel 15:30:51 <slaweq> I think we can move on then 15:30:54 <slaweq> #topic Grafana 15:31:04 <slaweq> http://grafana.openstack.org/dashboard/db/neutron-failure-rate 15:31:48 <slaweq> rally was broken last week 15:32:02 <slaweq> but it's I guess due to that missing service endpoint in keystone, right? 15:32:08 <slaweq> and it's already fixed 15:32:20 <lajoskatona> yes 15:32:43 <lajoskatona> actually it's funny, in devstack the endpoint creation was deleted and now we create it from the job 15:33:03 <lajoskatona> and seems like neutron runs mostly as voting job rally 15:33:09 <slaweq> so only rally job needs it? 15:33:28 <lajoskatona> yes as I remember 15:34:01 <slaweq> hmm, ok 15:34:07 <slaweq> thx for fixing that issue quickly 15:34:07 <lajoskatona> and the proper fix should be in rally, but I checked and not clear for first view where to fix it..... 15:34:37 <lajoskatona> it was ralonsoh actually, I just run another "alternative" path, but that was not enough.... 15:34:55 <slaweq> :) 15:35:33 <slaweq> from other things I see that neutron-tempest-plugin-scenario-linuxbridge job is failing pretty often (more than other similar jobs) 15:35:43 <slaweq> I'm not sure why it is like that really 15:35:58 <slaweq> probably some timeouts which I saw pretty often in the scenario jobs last week 15:36:28 <slaweq> like e.g. https://zuul.opendev.org/t/openstack/build/7a2bc299305249c5911e7107bb4d4a37 15:38:58 <slaweq> I think we need to investigate why so often our tests are so slow recently 15:39:28 <slaweq> I doubt I will have time for that this week but maybe there is anyone who could check that 15:40:08 <obondarev> also neutron-tempest-plugin-scenario-ovn started failing I think 15:40:22 <obondarev> example: https://storage.gra.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_330/807116/23/check/neutron-tempest-plugin-scenario-ovn/330ce5c/testr_results.html 15:40:27 <slaweq> obondarev: You mean with timeouts? 15:40:33 <obondarev> nope 15:40:48 <obondarev> a couple of tests failing with ssh I believe 15:41:32 <slaweq> hmm, in those 2 tests for example we are missing console log from instance 15:41:35 <slaweq> so it's hard to say 15:41:52 <slaweq> but it may be the same issue like in https://5a5cde44dedb81c8bd48-91d0b9dca863bf6ffc8b1718d062319a.ssl.cf5.rackcdn.com/805391/13/check/neutron-tempest-plugin-scenario-ovn/84283cb/testr_results.html 15:42:07 <slaweq> which I spoke about at the begining of the meeting 15:44:08 <slaweq> let's observe it and if someone will have some time, maybe investigate it a bit more :) 15:44:10 <obondarev> I'll keep an eye on it and report a bug if it continue to fail with similar symptoms 15:44:23 <slaweq> obondarev: sounds good, thx 15:44:57 <slaweq> ok, I think we can move on quickly 15:45:16 <slaweq> we already spoke about most urgent issues in our jobs, at least those which I had prepared for today 15:45:22 <slaweq> #topic tempest/scenario 15:45:32 <slaweq> here I have one more thing to mention 15:45:39 <slaweq> this time, good news I think 15:45:47 <slaweq> our CI job found bug :) 15:45:55 <slaweq> https://bugs.launchpad.net/neutron/+bug/1950273 15:46:09 <slaweq> so sometimes those jobs seems to be useful ;) 15:46:18 <obondarev> nice! :) 15:46:59 <slaweq> it seems to be some race condition or something else but even in such case we shouldn't return error 500 to the user 15:47:02 <lajoskatona> +1 15:47:10 <slaweq> so at least we should properly handle that error 15:47:19 <slaweq> that's why I marked it as "Higt" 15:47:28 <slaweq> *High 15:47:49 <slaweq> ok, and last topic from me for today 15:47:51 <slaweq> #topic Periodic 15:48:03 <slaweq> here all except UT with neutron-lib master looks good 15:48:14 <slaweq> I opened bug https://bugs.launchpad.net/neutron/+bug/1950275 today 15:48:41 <lajoskatona> For that I pushed a patch: https://review.opendev.org/c/openstack/neutron/+/817178 15:49:05 <slaweq> thx lajoskatona 15:49:21 <slaweq> but my question is: why some change in neutron-lib caused that failure? 15:49:35 <slaweq> as OVO object is defined in the Neutron repo, isn't it? 15:49:44 <lajoskatona> but that breaks unit test with n-lib 2.16.0, and what I can't consume: 15:49:52 <obondarev> I believe order of constants on which OVO depends - changed 15:49:58 <lajoskatona> seems like this patch brings the failure: https://review.opendev.org/c/openstack/neutron-lib/+/816447 15:50:16 <lajoskatona> yeah, possible, but that can change the hash for ovo? 15:50:38 <obondarev> I think we faced similar issues in the past 15:51:16 <lajoskatona> so If I understand it we have to release n-lib with constants change and bump n-lib for neutron? 15:51:38 <obondarev> I think so too 15:51:47 <lajoskatona> ok 15:51:49 <slaweq> but the problem may be that in the release patches it runs some UT job for neutron, no? 15:51:57 <slaweq> and this job will fail then 15:52:01 <slaweq> or am I missing something? 15:52:13 <lajoskatona> that's possible 15:53:46 <lajoskatona> I push release patch and that will tell us if this way can work 15:53:57 <slaweq> ++ 15:54:07 <slaweq> maybe I'm wrong here 15:54:26 <slaweq> in the releases it don't runs UT probably 15:54:31 <slaweq> I hope so at least :) 15:54:42 <slaweq> ok, so that's all what I had for today 15:54:50 <slaweq> anything else You want to discuss regarding CI? 15:54:56 <obondarev> lajoskatona: I would appreciate if you check https://review.opendev.org/c/openstack/neutron-lib/+/816468 before lib release :) 15:55:15 <lajoskatona> obondarev: I will 15:55:38 <obondarev> lajoskatona:thanks a lot! 15:56:38 <slaweq> ok, so I think we can finish our meeting for today 15:56:57 <slaweq> have a great week and see You all next week on the video call again :) 15:56:57 <slaweq> o/ 15:56:58 <slaweq> #endmeeting