15:00:33 <slaweq> #startmeeting neutron_ci 15:00:33 <openstack> Meeting started Tue Mar 2 15:00:33 2021 UTC and is due to finish in 60 minutes. The chair is slaweq. Information about MeetBot at http://wiki.debian.org/MeetBot. 15:00:34 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 15:00:36 <openstack> The meeting name has been set to 'neutron_ci' 15:00:39 <slaweq> hi 15:00:57 <bcafarel> hey again 15:02:08 <slaweq> ping ralonsoh: lajoskatona 15:02:22 <slaweq> ci meeting, are You going to attend? 15:02:27 <ralonsoh> hi 15:02:58 <lajoskatona> Hi 15:03:02 <slaweq> I think we can start 15:03:08 <slaweq> Grafana dashboard: http://grafana.openstack.org/dashboard/db/neutron-failure-rate 15:03:15 <slaweq> #topic Actions from previous meetings 15:03:26 <slaweq> we have only one 15:03:31 <slaweq> slaweq to check failing periodic task in functional test 15:03:39 <slaweq> I reported bug https://bugs.launchpad.net/neutron/+bug/1916761 15:03:40 <openstack> Launchpad bug 1916761 in neutron "[dvr] bound port permanent arp entries never deleted" [High,In progress] - Assigned to Edward Hope-Morley (hopem) 15:03:47 <slaweq> and proposed patch https://review.opendev.org/c/openstack/neutron/+/778080 15:04:08 <slaweq> it seems that it works, at least there are no errors related to the maintenance task in the logs 15:04:27 <slaweq> job-output.txt file is about 10x smaller with that change 15:05:00 <slaweq> no, sorry 15:05:08 <slaweq> it's not so much smaller 15:05:15 <slaweq> but it is smaller significantly 15:05:45 <slaweq> please review that patch if You will have some time 15:06:06 <slaweq> and let's move on 15:06:08 <slaweq> #topic Stadium projects 15:06:17 <slaweq> anything regarding stadium projects' ci? 15:06:26 <lajoskatona> not much from me 15:06:39 <lajoskatona> still strugling with old branches fixes 15:06:52 <bcafarel> :) 15:06:56 <lajoskatona> I am on the point to ask around infra or QA channel 15:07:17 <lajoskatona> I run into stupid no pip2.7 available issues and similar 15:07:45 <lajoskatona> but it's on older (before train??) branches so 15:08:39 <lajoskatona> thats it as I remember 15:09:07 <slaweq> I think that all branches before Train are already EM 15:09:17 <bcafarel> yep 15:09:19 <slaweq> so we can mark them as EOL for stadium projects probably 15:09:22 <slaweq> no? 15:10:01 <lajoskatona> yes, we can check all and decide based on that 15:10:27 <lajoskatona> I mean based on the alive backports or similar 15:10:37 <slaweq> lajoskatona: so if there is no any interest in community to maintain them, and there are big issues, I would say - don't spent too much time on it :) 15:11:00 <lajoskatona> agree 15:11:50 <slaweq> thx, please keep me updated if You will want to EOL some branches in some projects 15:12:01 <bcafarel> especially if they have been broken for some time and no one complained 15:12:44 <lajoskatona> ok, I check where we are with those branches 15:12:47 <bcafarel> and I think we can easily announce them as Unmaintained (then if people show up to fix it, it can go back to EM) 15:13:01 <slaweq> ++ 15:13:10 <lajoskatona> +1 15:13:16 <slaweq> thx lajoskatona for taking care of it 15:13:27 <slaweq> let's move on 15:13:29 <slaweq> #topic Stable branches 15:13:34 <slaweq> Victoria dashboard: https://grafana.opendev.org/d/HUCHup2Gz/neutron-failure-rate-previous-stable-release?orgId=1 15:13:36 <slaweq> Ussuri dashboard: https://grafana.opendev.org/d/smqHXphMk/neutron-failure-rate-older-stable-release?orgId=1 15:13:46 <slaweq> bcafarel: any updates about stable branches? 15:15:05 <ralonsoh> train is failing, "neutron-tempest-dvr-ha-multinode-full" job, qos migration tests 15:15:26 <ralonsoh> eg: https://0c345762207dc13e339e-d1e090fdf1a39e65d2b0ba37cbdce0a4.ssl.cf2.rackcdn.com/777781/1/check/neutron-tempest-dvr-ha-multinode-full/463e963/testr_results.html 15:15:30 <ralonsoh> in all patches 15:15:48 <slaweq> ralonsoh: but this job is non-voting, right? 15:16:05 <ralonsoh> it is, yes 15:16:11 <ralonsoh> just a heads-up 15:16:19 <slaweq> but it seems like nova issue really 15:16:27 <slaweq> No valid host found for cold migrate 15:16:34 <slaweq> and 15:16:36 <slaweq> No valid host found for resize 15:16:58 <slaweq> is it only in train? 15:17:31 <bcafarel> sorry Murphy's law, mailman ringing just before the ping 15:17:43 <bcafarel> did it work before? I remember this job being mostly unstable 15:19:44 <slaweq> TBH I wasn't checking it for pretty long time 15:19:55 <slaweq> I can investigate and reach out to nova ppl if needed 15:20:17 <slaweq> #action slaweq to check failing qos migration tests in train neutron-tempest-dvr-ha-multinode-full job 15:21:46 <slaweq> anything else related to stable branches? 15:22:08 <bcafarel> hopefully https://review.opendev.org/c/openstack/neutron/+/777389 will be done soon for stein :) 15:22:39 <bcafarel> rest looked OK, I have a few ones to review in my backlog, but CI looked good overall 15:22:59 <slaweq> thx 15:23:03 <slaweq> so we can move on 15:23:06 <slaweq> next topic 15:23:08 <slaweq> #topic Grafana 15:23:12 <slaweq> http://grafana.openstack.org/dashboard/db/neutron-failure-rate 15:23:44 <slaweq> I see just one, but pretty serious problem there 15:23:53 <slaweq> fullstack/functional jobs are failing way too much still 15:24:05 <slaweq> except that it seems that we are good 15:25:11 <slaweq> do You see any other issues there? 15:25:17 <ralonsoh> we need to commit ourselves to, if we find an error in those jobs, report it in LP 15:25:27 <ralonsoh> just to track it 15:26:01 <slaweq> yes, I agree 15:26:06 <slaweq> I spent some time today on them 15:26:30 <slaweq> and I came with few small patches https://review.opendev.org/q/topic:%22improve-neutron-ci%22+(status:merged) 15:26:58 <slaweq> sorry 15:26:59 <slaweq> https://review.opendev.org/q/topic:%2522improve-neutron-ci%2522+status:open 15:27:03 <slaweq> that is correct link 15:27:09 <slaweq> please take a look at them 15:28:01 <slaweq> I think that most often failures are due to oom-killer kills mysql server 15:28:22 <ralonsoh> right 15:28:24 <slaweq> so I proposed to lower number of test workers in both jobs 15:28:31 <slaweq> in fullstack I already did that some time ago 15:28:45 <ralonsoh> let's reduce FT to 4 and fullstack to 3 15:28:51 <slaweq> but I forgot about dsvm-fullstack-gate tox env which is used in gate really 15:29:05 <slaweq> ralonsoh: that's exactly what my patches proposed :) 15:29:13 <slaweq> FT to 4 and fullstack to 3 15:29:17 <ralonsoh> yeah hehehe 15:29:19 <slaweq> :) 15:29:33 <slaweq> so that should be covered :) 15:29:52 <slaweq> another issue are timeouted jobs 15:30:05 <slaweq> and I think that those are mostly due to stestr and "too much output" 15:30:19 <slaweq> like we had already in the past in UT IIRC 15:30:25 <slaweq> so I proposed https://review.opendev.org/c/openstack/neutron/+/778196 15:30:37 <slaweq> and also https://review.opendev.org/c/openstack/neutron/+/778080 should helps for that 15:30:51 <slaweq> but in FT job there is still a lot of things logged 15:31:09 <slaweq> if You check in https://23965cc52ad55df824a3-476d86922c45abb704c82e069ca48dea.ssl.cf1.rackcdn.com/778080/2/check/neutron-functional-with-uwsgi/875857e/job-output.txt 15:31:15 <slaweq> there is a lot of errors like: 15:31:26 <slaweq> oslo_db.exception.CantStartEngineError: No sql_connection parameter is established 15:31:33 <slaweq> and a lot of lines like: 15:31:39 <slaweq> Running command: ... 15:31:57 <slaweq> I was trying to somehow get rid of them but I really don't know how 15:32:11 <slaweq> if You would have any ideas, help is more than welcome :) 15:33:11 <slaweq> anyone wants to check that? 15:33:42 <ralonsoh> sure 15:34:21 <slaweq> thx ralonsoh 15:35:15 <slaweq> #action ralonsoh to try to check how to limit number of logged lines in FT output 15:35:26 * slaweq will be back in 2 minutes 15:36:02 <ralonsoh> during this waiting time, I think this is because "DBInconsistenciesPeriodics", in FTs 15:36:06 <ralonsoh> but I need to check it 15:36:57 * slaweq is back 15:37:17 <slaweq> ralonsoh: but in patch https://review.opendev.org/c/openstack/neutron/+/778080 I mocked this maintenance worker thread 15:37:28 <slaweq> and still there are those lines logged there 15:37:37 <slaweq> maybe I missed something there, idk 15:38:00 <ralonsoh> right, you are stopping the thread there 15:39:36 <slaweq> there is one more issue which I found couple of times in FT recently 15:39:42 <slaweq> Timeouts while doing some ip operations, like in 15:39:47 <slaweq> https://storage.gra.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_018/772460/13/check/neutron-functional-with-uwsgi/0181e4f/testr_results.html 15:39:49 <slaweq> https://3d423a08ba57e3349bef-667e59a55d2239af414b0984e42f005a.ssl.cf5.rackcdn.com/771621/7/check/neutron-functional-with-uwsgi/c8d3396/testr_results.html 15:39:51 <slaweq> https://storage.gra.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_ac4/768129/37/check/neutron-functional-with-uwsgi/ac480c2/testr_results.html 15:39:56 <slaweq> did You saw them already? 15:40:03 <slaweq> maybe You know what can cause such problems? 15:40:17 <ralonsoh> I did but I still can't find the root cause 15:41:12 <slaweq> I saw it always in those test_linuxbridge_arp_protect tests module 15:41:19 <slaweq> did You saw it in different modules too? 15:41:39 <ralonsoh> I can't remember 15:41:40 <slaweq> maybe we could mark those tests are unstable for now and that would give us a breath 15:42:17 <ralonsoh> I'll record all appearances I find and I'll report a LP 15:42:54 <slaweq> ralonsoh: thx 15:43:17 <slaweq> #action ralonsoh to report bug with ip operations timeout in FT 15:43:39 <slaweq> so that's basically all what I had for today regarding those jobs 15:43:59 <slaweq> long story short, lets merge patches which we have now and hopefully it will be bit better 15:44:13 <slaweq> and then lets focus on debugging issues which we already mentioned here 15:44:25 <slaweq> last topic for today 15:44:27 <slaweq> #topic Periodic 15:44:34 <slaweq> Jobs results: http://zuul.openstack.org/buildsets?project=openstack%2Fneutron&pipeline=periodic&branch=master 15:44:43 <slaweq> in overall periodic jobs seems good 15:44:51 <slaweq> but fedora based job is all the time red 15:44:58 <slaweq> is there any volunteer to check it? 15:45:26 <ralonsoh> not this week, sorry 15:45:30 <bcafarel> again? sigh 15:45:35 <slaweq> no need to sorry ralonsoh :) 15:45:43 <bcafarel> I can try to take a look 15:45:44 <slaweq> bcafarel: I think it is failing still, not again :) 15:45:49 <slaweq> bcafarel: thx a lot 15:46:07 <slaweq> #action bcafarel to check failing fedora based periodic job 15:46:37 <bcafarel> seems not so long ago we had to push a few fixes for it :) 15:47:16 <slaweq> yes, but then it start failing again 15:47:22 <slaweq> and we never fixed it I think :/ 15:48:04 <slaweq> and that's basically all what I had for today 15:48:18 <slaweq> do You have anything else You want to discuss today, regarding our ci? 15:48:31 <ralonsoh> no 15:48:36 <bcafarel> nothing from me 15:48:44 <slaweq> if no, I will give You few minutes back today 15:48:50 <slaweq> thx for attending the meeting 15:48:55 <slaweq> and have a great week 15:48:56 <slaweq> o/ 15:48:57 <ralonsoh> bye! 15:48:59 <slaweq> #endmeeting