#openstack-meeting-3 log

15:00:33 <slaweq> #startmeeting neutron_ci
15:00:33 <openstack> Meeting started Tue Mar  2 15:00:33 2021 UTC and is due to finish in 60 minutes.  The chair is slaweq. Information about MeetBot at http://wiki.debian.org/MeetBot.
15:00:34 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
15:00:36 <openstack> The meeting name has been set to 'neutron_ci'
15:00:39 <slaweq> hi
15:00:57 <bcafarel> hey again
15:02:08 <slaweq> ping ralonsoh: lajoskatona
15:02:22 <slaweq> ci meeting, are You going to attend?
15:02:27 <ralonsoh> hi
15:02:58 <lajoskatona> Hi
15:03:02 <slaweq> I think we can start
15:03:08 <slaweq> Grafana dashboard: http://grafana.openstack.org/dashboard/db/neutron-failure-rate
15:03:15 <slaweq> #topic Actions from previous meetings
15:03:26 <slaweq> we have only one
15:03:31 <slaweq> slaweq to check failing periodic task in functional test
15:03:39 <slaweq> I reported bug  https://bugs.launchpad.net/neutron/+bug/1916761
15:03:40 <openstack> Launchpad bug 1916761 in neutron "[dvr] bound port permanent arp entries never deleted" [High,In progress] - Assigned to Edward Hope-Morley (hopem)
15:03:47 <slaweq> and proposed patch https://review.opendev.org/c/openstack/neutron/+/778080
15:04:08 <slaweq> it seems that it works, at least there are no errors related to the maintenance task in the logs
15:04:27 <slaweq> job-output.txt file is about 10x smaller with that change
15:05:00 <slaweq> no, sorry
15:05:08 <slaweq> it's not so much smaller
15:05:15 <slaweq> but it is smaller significantly
15:05:45 <slaweq> please review that patch if You will have some time
15:06:06 <slaweq> and let's move on
15:06:08 <slaweq> #topic Stadium projects
15:06:17 <slaweq> anything regarding stadium projects' ci?
15:06:26 <lajoskatona> not much from me
15:06:39 <lajoskatona> still strugling with old branches fixes
15:06:52 <bcafarel> :)
15:06:56 <lajoskatona> I am on the point to ask around infra or QA channel
15:07:17 <lajoskatona> I run into stupid no pip2.7 available issues and similar
15:07:45 <lajoskatona> but it's on older (before train??) branches so
15:08:39 <lajoskatona> thats it as I remember
15:09:07 <slaweq> I think that all branches before Train are already EM
15:09:17 <bcafarel> yep
15:09:19 <slaweq> so we can mark them as EOL for stadium projects probably
15:09:22 <slaweq> no?
15:10:01 <lajoskatona> yes, we can check all and decide based on that
15:10:27 <lajoskatona> I mean based on the alive backports or similar
15:10:37 <slaweq> lajoskatona: so if there is no any interest in community to maintain them, and there are big issues, I would say - don't spent too much time on it :)
15:11:00 <lajoskatona> agree
15:11:50 <slaweq> thx, please keep me updated if You will want to EOL some branches in some projects
15:12:01 <bcafarel> especially if they have been broken for some time and no one complained
15:12:44 <lajoskatona> ok, I check where we are with those branches
15:12:47 <bcafarel> and I think we can easily announce them as Unmaintained (then if people show up to fix it, it can go back to EM)
15:13:01 <slaweq> ++
15:13:10 <lajoskatona> +1
15:13:16 <slaweq> thx lajoskatona for taking care of it
15:13:27 <slaweq> let's move on
15:13:29 <slaweq> #topic Stable branches
15:13:34 <slaweq> Victoria dashboard: https://grafana.opendev.org/d/HUCHup2Gz/neutron-failure-rate-previous-stable-release?orgId=1
15:13:36 <slaweq> Ussuri dashboard: https://grafana.opendev.org/d/smqHXphMk/neutron-failure-rate-older-stable-release?orgId=1
15:13:46 <slaweq> bcafarel: any updates about stable branches?
15:15:05 <ralonsoh> train is failing, "neutron-tempest-dvr-ha-multinode-full" job, qos migration tests
15:15:26 <ralonsoh> eg: https://0c345762207dc13e339e-d1e090fdf1a39e65d2b0ba37cbdce0a4.ssl.cf2.rackcdn.com/777781/1/check/neutron-tempest-dvr-ha-multinode-full/463e963/testr_results.html
15:15:30 <ralonsoh> in all patches
15:15:48 <slaweq> ralonsoh: but this job is non-voting, right?
15:16:05 <ralonsoh> it is, yes
15:16:11 <ralonsoh> just a heads-up
15:16:19 <slaweq> but it seems like nova issue really
15:16:27 <slaweq> No valid host found for cold migrate
15:16:34 <slaweq> and
15:16:36 <slaweq> No valid host found for resize
15:16:58 <slaweq> is it only in train?
15:17:31 <bcafarel> sorry Murphy's law, mailman ringing just before the ping
15:17:43 <bcafarel> did it work before? I remember this job being mostly unstable
15:19:44 <slaweq> TBH I wasn't checking it for pretty long time
15:19:55 <slaweq> I can investigate and reach out to nova ppl if needed
15:20:17 <slaweq> #action slaweq to check failing qos migration tests in train neutron-tempest-dvr-ha-multinode-full job
15:21:46 <slaweq> anything else related to stable branches?
15:22:08 <bcafarel> hopefully https://review.opendev.org/c/openstack/neutron/+/777389 will be done soon for stein :)
15:22:39 <bcafarel> rest looked OK, I have a few ones to review in my backlog, but CI looked good overall
15:22:59 <slaweq> thx
15:23:03 <slaweq> so we can move on
15:23:06 <slaweq> next topic
15:23:08 <slaweq> #topic Grafana
15:23:12 <slaweq> http://grafana.openstack.org/dashboard/db/neutron-failure-rate
15:23:44 <slaweq> I see just one, but pretty serious problem there
15:23:53 <slaweq> fullstack/functional jobs are failing way too much still
15:24:05 <slaweq> except that it seems that we are good
15:25:11 <slaweq> do You see any other issues there?
15:25:17 <ralonsoh> we need to commit ourselves to, if we find an error in those jobs, report it in LP
15:25:27 <ralonsoh> just to track it
15:26:01 <slaweq> yes, I agree
15:26:06 <slaweq> I spent some time today on them
15:26:30 <slaweq> and I came with few small patches https://review.opendev.org/q/topic:%22improve-neutron-ci%22+(status:merged)
15:26:58 <slaweq> sorry
15:26:59 <slaweq> https://review.opendev.org/q/topic:%2522improve-neutron-ci%2522+status:open
15:27:03 <slaweq> that is correct link
15:27:09 <slaweq> please take a look at them
15:28:01 <slaweq> I think that most often failures are due to oom-killer kills mysql server
15:28:22 <ralonsoh> right
15:28:24 <slaweq> so I proposed to lower number of test workers in both jobs
15:28:31 <slaweq> in fullstack I already did that some time ago
15:28:45 <ralonsoh> let's reduce FT to 4 and fullstack to 3
15:28:51 <slaweq> but I forgot about dsvm-fullstack-gate tox env which is used in gate really
15:29:05 <slaweq> ralonsoh: that's exactly what my patches proposed :)
15:29:13 <slaweq> FT to 4 and fullstack to 3
15:29:17 <ralonsoh> yeah hehehe
15:29:19 <slaweq> :)
15:29:33 <slaweq> so that should be covered :)
15:29:52 <slaweq> another issue are timeouted jobs
15:30:05 <slaweq> and I think that those are mostly due to stestr and "too much output"
15:30:19 <slaweq> like we had already in the past in UT IIRC
15:30:25 <slaweq> so I proposed https://review.opendev.org/c/openstack/neutron/+/778196
15:30:37 <slaweq> and also https://review.opendev.org/c/openstack/neutron/+/778080 should helps for that
15:30:51 <slaweq> but in FT job there is still a lot of things logged
15:31:09 <slaweq> if You check in https://23965cc52ad55df824a3-476d86922c45abb704c82e069ca48dea.ssl.cf1.rackcdn.com/778080/2/check/neutron-functional-with-uwsgi/875857e/job-output.txt
15:31:15 <slaweq> there is a lot of errors like:
15:31:26 <slaweq> oslo_db.exception.CantStartEngineError: No sql_connection parameter is established
15:31:33 <slaweq> and a lot of lines like:
15:31:39 <slaweq> Running command: ...
15:31:57 <slaweq> I was trying to somehow get rid of them but I really don't know how
15:32:11 <slaweq> if You would have any ideas, help is more than welcome :)
15:33:11 <slaweq> anyone wants to check that?
15:33:42 <ralonsoh> sure
15:34:21 <slaweq> thx ralonsoh
15:35:15 <slaweq> #action ralonsoh to try to check how to limit number of logged lines in FT output
15:35:26 * slaweq will be back in 2 minutes
15:36:02 <ralonsoh> during this waiting time, I think this is because "DBInconsistenciesPeriodics", in FTs
15:36:06 <ralonsoh> but I need to check it
15:36:57 * slaweq is back
15:37:17 <slaweq> ralonsoh: but in patch https://review.opendev.org/c/openstack/neutron/+/778080 I mocked this maintenance worker thread
15:37:28 <slaweq> and still there are those lines logged there
15:37:37 <slaweq> maybe I missed something there, idk
15:38:00 <ralonsoh> right, you are stopping the thread there
15:39:36 <slaweq> there is one more issue which I found couple of times in FT recently
15:39:42 <slaweq> Timeouts while doing some ip operations, like in
15:39:47 <slaweq> https://storage.gra.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_018/772460/13/check/neutron-functional-with-uwsgi/0181e4f/testr_results.html
15:39:49 <slaweq> https://3d423a08ba57e3349bef-667e59a55d2239af414b0984e42f005a.ssl.cf5.rackcdn.com/771621/7/check/neutron-functional-with-uwsgi/c8d3396/testr_results.html
15:39:51 <slaweq> https://storage.gra.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_ac4/768129/37/check/neutron-functional-with-uwsgi/ac480c2/testr_results.html
15:39:56 <slaweq> did You saw them already?
15:40:03 <slaweq> maybe You know what can cause such problems?
15:40:17 <ralonsoh> I did but I still can't find the root cause
15:41:12 <slaweq> I saw it always in those test_linuxbridge_arp_protect tests module
15:41:19 <slaweq> did You saw it in different modules too?
15:41:39 <ralonsoh> I can't remember
15:41:40 <slaweq> maybe we could mark those tests are unstable for now and that would give us a breath
15:42:17 <ralonsoh> I'll record all appearances I find and I'll report a LP
15:42:54 <slaweq> ralonsoh: thx
15:43:17 <slaweq> #action ralonsoh to report bug with ip operations timeout in FT
15:43:39 <slaweq> so that's basically all what I had for today regarding those jobs
15:43:59 <slaweq> long story short, lets merge patches which we have now and hopefully it will be bit better
15:44:13 <slaweq> and then lets focus on debugging issues which we already mentioned here
15:44:25 <slaweq> last topic for today
15:44:27 <slaweq> #topic Periodic
15:44:34 <slaweq> Jobs results: http://zuul.openstack.org/buildsets?project=openstack%2Fneutron&pipeline=periodic&branch=master
15:44:43 <slaweq> in overall periodic jobs seems good
15:44:51 <slaweq> but fedora based job is all the time red
15:44:58 <slaweq> is there any volunteer to check it?
15:45:26 <ralonsoh> not this week, sorry
15:45:30 <bcafarel> again? sigh
15:45:35 <slaweq> no need to sorry ralonsoh :)
15:45:43 <bcafarel> I can try to take a look
15:45:44 <slaweq> bcafarel: I think it is failing still, not again :)
15:45:49 <slaweq> bcafarel: thx a lot
15:46:07 <slaweq> #action bcafarel to check failing fedora based periodic job
15:46:37 <bcafarel> seems not so long ago we had to push a few fixes for it :)
15:47:16 <slaweq> yes, but then it start failing again
15:47:22 <slaweq> and we never fixed it I think :/
15:48:04 <slaweq> and that's basically all what I had for today
15:48:18 <slaweq> do You have anything else You want to discuss today, regarding our ci?
15:48:31 <ralonsoh> no
15:48:36 <bcafarel> nothing from me
15:48:44 <slaweq> if no, I will give You few minutes back today
15:48:50 <slaweq> thx for attending the meeting
15:48:55 <slaweq> and have a great week
15:48:56 <slaweq> o/
15:48:57 <ralonsoh> bye!
15:48:59 <slaweq> #endmeeting