15:00:11 #startmeeting neutron_ci 15:00:11 Meeting started Tue Nov 9 15:00:11 2021 UTC and is due to finish in 60 minutes. The chair is slaweq. Information about MeetBot at http://wiki.debian.org/MeetBot. 15:00:11 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 15:00:11 The meeting name has been set to 'neutron_ci' 15:00:28 Grafana dashboard: http://grafana.openstack.org/dashboard/db/neutron-failure-rate 15:00:37 Merged openstack/neutron master: Bump OVN version for functional job to 21.06 https://review.opendev.org/c/openstack/neutron/+/816614 15:00:49 bcafarel: lajoskatona obondarev CI meeting is starting :) 15:00:50 hi 15:00:54 hi 15:00:55 o/ 15:02:08 Hi 15:02:44 ok, I think we can start 15:02:48 this week is just on IRC 15:02:56 #topic Actions from previous meetings 15:03:05 slaweq to work on https://bugs.launchpad.net/neutron/+bug/1948832 15:03:35 I was checking it today 15:03:57 and TBH I don't think it's the same bug as mentioned by ralonsoh in the LP 15:04:16 I wrote my findings in the comment there 15:04:25 I will try to investigate it little bit more this week 15:05:35 #action slaweq to check more deeply https://bugs.launchpad.net/neutron/+bug/1948832 15:05:39 next one 15:05:49 ralonsoh to check metadata issue https://1bdefef51603346d84af-53302f911195502b1bb2d87ad2b01ca2.ssl.cf5.rackcdn.com/814807/4/check/neutron-tempest-plugin-scenario-openvswitch/3de2195/testr_results.html 15:06:55 I guess we need to move that one for next week as ralonsoh is not here today 15:07:18 #action ralonsoh to check metadata issue https://1bdefef51603346d84af-53302f911195502b1bb2d87ad2b01ca2.ssl.cf5.rackcdn.com/814807/4/check/neutron-tempest-plugin-scenario-openvswitch/3de2195/testr_results.html 15:07:26 next one 15:07:31 slaweq to check https://5a5cde44dedb81c8bd48-91d0b9dca863bf6ffc8b1718d062319a.ssl.cf5.rackcdn.com/805391/13/check/neutron-tempest-plugin-scenario-ovn/84283cb/testr_results.html 15:07:48 It seems for me that the issue is similar to https://launchpad.net/bugs/1892861 15:07:48 The problem is that "login: " message in console log appears pretty quickly but cloud-init is still doing some things on the node, so maybe SSH isn't working yet and we hit the same bug. 15:09:02 this happens with cirros? 15:09:13 no, it's with Ubuntu 15:09:20 ok 15:09:51 maybe we should find better way to check if guest OS is really booted before doing SSH 15:09:52 advanced OS wanting to show that login prompt too fast :) 15:10:08 it seems like that for me 15:10:26 the problem is that sounds OS-dependant? though maybe systemd check for advanced + login: for cirros would be good enough 15:10:35 because I didn't saw anything else wrong there really 15:12:09 anyway, I will keep an eye on that. If it will happen more often I will report bug and try to figure out something :) 15:12:27 for now let's move on to the last one 15:12:29 lajoskatona to check https://581819ea67919485b97e-6002fae613cad806f99007086c39ea60.ssl.cf2.rackcdn.com/813977/5/gate/neutron-tempest-plugin-scenario-linuxbridge/8fdcf6f/testr_results.html 15:12:58 yeah I checked this one, and the startnge is that the FIP goes UP but too late (~few secs) 15:13:31 to UP or to DOWN? 15:13:47 slaweq: I mean down, thanks.... 15:13:47 error message says "attached port status failed to transition to DOWN " 15:13:52 ahh, ok 15:14:10 how long we are waiting for that transition? 15:14:43 120sec 15:15:12 should be more than enough.. 15:15:18 yeah 15:15:24 tempest stoppes waiting at: 2021-10-21 08:09:14.583994 15:15:54 and q-svc reports it's down: Oct 21 08:09:17.015995 ubuntu-focal-iweb-mtl01-0027033075 neutron-server[81347]: DEBUG neutron.api.rpc.handlers.l3_rpc [None req-ca3c8a33-07ab-41b7-b946-2411e666af30 None None] New status for floating IP aa555045-e872-47f5-a5f4-b4b59017b474: DOWN {{(pid=81347) update_floatingip_statuses /opt/stack/neutron/neutron/api/rpc/handlers/l3_rpc.py:270}} 15:16:23 from https://581819ea67919485b97e-6002fae613cad806f99007086c39ea60.ssl.cf2.rackcdn.com/813977/5/gate/neutron-tempest-plugin-scenario-linuxbridge/8fdcf6f/job-output.txt and https://581819ea67919485b97e-6002fae613cad806f99007086c39ea60.ssl.cf2.rackcdn.com/813977/5/gate/neutron-tempest-plugin-scenario-linuxbridge/8fdcf6f/controller/logs/screen-q-svc.txt 15:16:25 was L3 agent very busy during that time? 15:16:55 yeah it was refreshing continuusly the iptables rules 15:17:58 my question: do we need ha router enabled for this for example? 15:17:59 maybe there is some issue in the L3 agent then? 15:18:13 this is a singlenode job as I see 15:18:21 lajoskatona: we enabled HA for those routers to have at least some HA test coverage 15:18:36 as we don't have any other jobs with HA routers 15:18:56 and even if that job is singlenode, when router is HA, all HA codepath is tested 15:19:02 like keepalived and other stuff 15:19:18 it's just that router is always transitioned to be primary on that single node :) 15:19:46 slaweq: ok 15:20:33 I can check l3-agent if I can see something why it took so long to make the fip down 15:21:09 ++ 15:21:33 #action lajoskatona to check why make FIP down took more than 120 seconds in the L3 agent 15:21:37 thx lajoskatona 15:21:53 ok, that are all actions from last week 15:21:57 let's move on 15:21:58 #topic Stadium projects 15:22:17 nothing new 15:22:37 This week I realized that odl tempest jobs try to run with ovn 15:23:13 so I fight against that, but the jobs will still fail, but at least we have services up and tempest/rally started 15:24:05 that's it for stadiums 15:24:11 do You have patch for that? 15:24:13 I can review it if You want 15:24:17 thanks 15:24:54 it's not fully ready yet by zuul as I see: https://review.opendev.org/c/openstack/networking-odl/+/817186 15:26:39 ok 15:26:46 ohh, no tempest is ok at least. It's failing bu with the usual way which is due to ODL is slow behind it.... 15:27:22 ok, so I have to tune rally job 15:27:38 I will send it on IRC when that is ok 15:27:47 I added it to my review list for tomorrow morning :) 15:27:56 slaweq: thanks 15:28:13 ok, next topic then 15:28:18 #topic Stable branches 15:28:51 overall good this week, as mentioned ussuri has only https://review.opendev.org/c/openstack/neutron/+/816661 left (before EM transition) 15:29:02 though functional test failing in it looks related 15:29:26 and I found just before meeting train has neutron-tempest-plugin-designate-scenario-train failing :( https://zuul.opendev.org/t/openstack/build/a6a1142368b742248be710f902f541f5 15:29:28 yes, indeed 15:29:33 I will check it tomorrow 15:29:48 I will fill a bug for that designate train one after meetings 15:30:03 but at least full support branches are good :) 15:30:34 ok, thx bcafarel 15:30:51 I think we can move on then 15:30:54 #topic Grafana 15:31:04 http://grafana.openstack.org/dashboard/db/neutron-failure-rate 15:31:48 rally was broken last week 15:32:02 but it's I guess due to that missing service endpoint in keystone, right? 15:32:08 and it's already fixed 15:32:20 yes 15:32:43 actually it's funny, in devstack the endpoint creation was deleted and now we create it from the job 15:33:03 and seems like neutron runs mostly as voting job rally 15:33:09 so only rally job needs it? 15:33:28 yes as I remember 15:34:01 hmm, ok 15:34:07 thx for fixing that issue quickly 15:34:07 and the proper fix should be in rally, but I checked and not clear for first view where to fix it..... 15:34:37 it was ralonsoh actually, I just run another "alternative" path, but that was not enough.... 15:34:55 :) 15:35:33 from other things I see that neutron-tempest-plugin-scenario-linuxbridge job is failing pretty often (more than other similar jobs) 15:35:43 I'm not sure why it is like that really 15:35:58 probably some timeouts which I saw pretty often in the scenario jobs last week 15:36:28 like e.g. https://zuul.opendev.org/t/openstack/build/7a2bc299305249c5911e7107bb4d4a37 15:38:58 I think we need to investigate why so often our tests are so slow recently 15:39:28 I doubt I will have time for that this week but maybe there is anyone who could check that 15:40:08 also neutron-tempest-plugin-scenario-ovn started failing I think 15:40:22 example: https://storage.gra.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_330/807116/23/check/neutron-tempest-plugin-scenario-ovn/330ce5c/testr_results.html 15:40:27 obondarev: You mean with timeouts? 15:40:33 nope 15:40:48 a couple of tests failing with ssh I believe 15:41:32 hmm, in those 2 tests for example we are missing console log from instance 15:41:35 so it's hard to say 15:41:52 but it may be the same issue like in https://5a5cde44dedb81c8bd48-91d0b9dca863bf6ffc8b1718d062319a.ssl.cf5.rackcdn.com/805391/13/check/neutron-tempest-plugin-scenario-ovn/84283cb/testr_results.html 15:42:07 which I spoke about at the begining of the meeting 15:44:08 let's observe it and if someone will have some time, maybe investigate it a bit more :) 15:44:10 I'll keep an eye on it and report a bug if it continue to fail with similar symptoms 15:44:23 obondarev: sounds good, thx 15:44:57 ok, I think we can move on quickly 15:45:16 we already spoke about most urgent issues in our jobs, at least those which I had prepared for today 15:45:22 #topic tempest/scenario 15:45:32 here I have one more thing to mention 15:45:39 this time, good news I think 15:45:47 our CI job found bug :) 15:45:55 https://bugs.launchpad.net/neutron/+bug/1950273 15:46:09 so sometimes those jobs seems to be useful ;) 15:46:18 nice! :) 15:46:59 it seems to be some race condition or something else but even in such case we shouldn't return error 500 to the user 15:47:02 +1 15:47:10 so at least we should properly handle that error 15:47:19 that's why I marked it as "Higt" 15:47:28 *High 15:47:49 ok, and last topic from me for today 15:47:51 #topic Periodic 15:48:03 here all except UT with neutron-lib master looks good 15:48:14 I opened bug https://bugs.launchpad.net/neutron/+bug/1950275 today 15:48:41 For that I pushed a patch: https://review.opendev.org/c/openstack/neutron/+/817178 15:49:05 thx lajoskatona 15:49:21 but my question is: why some change in neutron-lib caused that failure? 15:49:35 as OVO object is defined in the Neutron repo, isn't it? 15:49:44 but that breaks unit test with n-lib 2.16.0, and what I can't consume: 15:49:52 I believe order of constants on which OVO depends - changed 15:49:58 seems like this patch brings the failure: https://review.opendev.org/c/openstack/neutron-lib/+/816447 15:50:16 yeah, possible, but that can change the hash for ovo? 15:50:38 I think we faced similar issues in the past 15:51:16 so If I understand it we have to release n-lib with constants change and bump n-lib for neutron? 15:51:38 I think so too 15:51:47 ok 15:51:49 but the problem may be that in the release patches it runs some UT job for neutron, no? 15:51:57 and this job will fail then 15:52:01 or am I missing something? 15:52:13 that's possible 15:53:46 I push release patch and that will tell us if this way can work 15:53:57 ++ 15:54:07 maybe I'm wrong here 15:54:26 in the releases it don't runs UT probably 15:54:31 I hope so at least :) 15:54:42 ok, so that's all what I had for today 15:54:50 anything else You want to discuss regarding CI? 15:54:56 lajoskatona: I would appreciate if you check https://review.opendev.org/c/openstack/neutron-lib/+/816468 before lib release :) 15:55:15 obondarev: I will 15:55:38 lajoskatona:thanks a lot! 15:56:38 ok, so I think we can finish our meeting for today 15:56:57 have a great week and see You all next week on the video call again :) 15:56:57 o/ 15:56:58 #endmeeting