#openstack-meeting log

16:00:05 <slaweq> #startmeeting neutron_ci
16:00:06 <openstack> Meeting started Tue Dec 10 16:00:05 2019 UTC and is due to finish in 60 minutes.  The chair is slaweq. Information about MeetBot at http://wiki.debian.org/MeetBot.
16:00:07 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
16:00:08 <slaweq> hi
16:00:09 <openstack> The meeting name has been set to 'neutron_ci'
16:00:52 <bcafarel> o/
16:01:46 <slaweq> lets wait few more minutes for ralonsoh and others
16:01:51 <ralonsoh> hi
16:02:54 <slaweq> ok, lets start
16:02:56 <slaweq> Grafana dashboard: http://grafana.openstack.org/dashboard/db/neutron-failure-rate
16:03:11 <slaweq> please open it now so that it will be ready when needed :)
16:03:39 <slaweq> #topic Actions from previous meetings
16:03:51 <slaweq> first one:
16:03:53 <slaweq> njohnston to check failing NetworkMigrationFromHA in multinode dvr job
16:04:01 <slaweq> I'm not sure if njohnston is around now
16:04:51 <slaweq> #action njohnston to check failing NetworkMigrationFromHA in multinode dvr job
16:04:58 <slaweq> lets keep it for next week than
16:05:02 <njohnston> o/
16:05:06 <slaweq> hi njohnston :)
16:05:12 <njohnston> yeah, keep it for next week, I am debuigging it right now
16:05:18 <slaweq> ok
16:05:20 <slaweq> thx
16:05:26 <slaweq> and good luck with debugging
16:05:28 <slaweq> :)
16:05:36 <slaweq> ok, next one:
16:05:37 <njohnston> :)
16:05:43 <slaweq> ralonsoh to check functional tests timeouts https://bugs.launchpad.net/neutron/+bug/1854462
16:05:43 <openstack> Launchpad bug 1854462 in neutron "[Functional tests] Timeout exception in list_namespace_pids" [High,In progress] - Assigned to Rodolfo Alonso (rodolfo-alonso-hernandez)
16:05:56 <ralonsoh> I wrote a small script for this
16:06:12 <ralonsoh> http://paste.openstack.org/show/787322/
16:06:27 <ralonsoh> and I added log messages in pyroute2
16:07:01 <ralonsoh> I detected that most of the time, the blocking method was https://github.com/svinota/pyroute2/blob/master/pyroute2/netns/__init__.py#L209
16:07:20 <ralonsoh> so instead of calling it every time we call create/delete namespace
16:07:44 <ralonsoh> I create the object once (in the root context, see patch https://review.opendev.org/#/c/698039/)
16:07:46 <ralonsoh> that's all
16:09:01 <slaweq> smart :)
16:09:15 <ralonsoh> BTW, _CDLL = ctypes.CDLL(ctypes_util.find_library('c'), use_errno=True)
16:09:25 <ralonsoh> this MUST not change during the execution
16:09:59 <njohnston> maybe put a comment in to that effect?
16:10:25 <ralonsoh> njohnston, I mean: the library can't be modified
16:10:32 <ralonsoh> this is not going to happen
16:10:36 <njohnston> ok
16:11:27 <slaweq> ok
16:11:40 <slaweq> thx ralonsoh for working on this
16:11:51 <slaweq> I hope we will get rid of those timeouts with this patch
16:11:57 <slaweq> next one
16:11:59 <slaweq> slaweq to check reason of grenade jobs failures
16:12:02 <bcafarel> looks nice indeed
16:12:02 <slaweq> I checked it
16:12:25 <slaweq> and it seems that all those failures are related to https://bugs.launchpad.net/nova/+bug/1844929
16:12:25 <openstack> Launchpad bug 1844929 in OpenStack Compute (nova) "grenade jobs failing due to "Timed out waiting for response from cell" in scheduler" [High,Confirmed]
16:13:06 <slaweq> and as I talked with efried and mriedem yesterday, it is probably caused by oversubscribed CI nodes
16:13:27 <slaweq> so we don't have any good solution for that problem now
16:13:54 <slaweq> only 2 possible options imo are:
16:14:13 <slaweq> 1. live with it like it's now
16:14:34 <slaweq> 2. make grenade jobs non-voting and non-gating temporary until this issue will be solved
16:14:49 <slaweq> problem with 2 is that we don't know when it may be possible fixed
16:15:21 <ralonsoh> pfff all grenade jobs?
16:15:38 <ralonsoh> mark them as non-voting?
16:15:41 <slaweq> ralonsoh: we have now only 2 multinode grenade jobs
16:15:42 <njohnston> I wonder if it would be worthwhile to email openstack-discuss and ask if anything can be done about the oversubscription
16:15:50 <slaweq> we removed single node jobs
16:16:01 <ralonsoh> yes, but several tests
16:16:04 <ralonsoh> I mean tests
16:16:28 <slaweq> njohnston: there are such threads started by mriedem here http://lists.openstack.org/pipermail/openstack-discuss/2019-October/thread.html#10484 and continued here http://lists.openstack.org/pipermail/openstack-discuss/2019-November/thread.html#10502
16:17:00 <clarkb> we areour own noisy neighbors in many cases. One way to address oversubsceiption is to make our software run more efficiently
16:17:19 <clarkb> devstack jobs swap and Ive asked sevral times that openstack address this
16:17:43 <clarkb> I think fixing swapping will likely have a major impact on performance relatedproblems
16:19:40 <slaweq> clarkb: so we should focus on optimizing Neutron's memory usage to make this working better, correct?
16:21:04 <ralonsoh> slaweq, agree with this but we usually optimize the speed, not the memory consumption
16:21:28 <slaweq> yep
16:21:30 <ralonsoh> most of out efforts are in optimizing the DB access, the parallelism, etc
16:22:19 <clarkb> slaweq: and tge rest of openstack/devstack
16:22:27 <clarkb> etcd seems like it isnt used but always run
16:22:31 <clarkb> cinder backup too
16:22:57 <clarkb> byt it all adds up then you start swapping which impacts the current job and all other jobs trying to access those disk resources
16:23:33 <slaweq> clarkb: ok, I will check those jobs and maybe will be able to disable some of those not used services there
16:23:54 <slaweq> thx for this tips
16:24:17 <slaweq> #action slaweq to check and try to optimize neutron-grenade-multinode jobs memory consumption
16:24:39 <slaweq> ok, I think we can move on
16:24:44 <slaweq> next topic
16:24:46 <slaweq> #topic Stadium projects
16:24:52 <slaweq> tempest-plugins migration
16:24:57 <slaweq> Etherpad: https://etherpad.openstack.org/p/neutron_stadium_move_to_tempest_plugin_repo
16:25:10 <slaweq> last 2 patches for neutron-vpnaas are ready for review now
16:25:21 <slaweq> Step 1: https://review.openstack.org/#/c/649373
16:25:23 <slaweq> Step 2: https://review.opendev.org/#/c/695834
16:25:29 <slaweq> I just +2'ed Step 1 patch
16:25:49 <slaweq> and in step 2 we will probably need to switch centos based job to be non-voting
16:26:03 <slaweq> as code isn't compatible with py27 anymore
16:26:58 <bcafarel> yes we need centos7+py3, or centos8 (when possible)
16:26:59 <njohnston> agreed
16:27:29 <ralonsoh> (I can't run devstack with centos8)
16:27:39 <ralonsoh> some libraries are missing
16:28:18 <slaweq> maybe we should switch this job to be fedora based?
16:28:29 <slaweq> but that can be IMO done as follow up patch
16:28:32 <ralonsoh> F29 is working, not F30
16:28:45 <ralonsoh> sure, in another patch
16:29:28 <bcafarel> yes, let's wrap up tempest plugin migration first
16:30:09 <slaweq> yeah :)
16:30:27 <slaweq> so I hope mlavalle will push one more PS soon and we will be done with this finally
16:30:41 <slaweq> I hope next week we will move it out from meeting agenda
16:31:11 <bcafarel> :)
16:31:38 <slaweq> next stadium related topic is
16:31:40 <slaweq> Neutron Train - Drop py27 and standardize on zuul v3
16:31:45 <slaweq> Etherpad: https://etherpad.openstack.org/p/neutron-train-zuulv3-py27drop
16:31:54 <slaweq> njohnston: any updates on this since yesterday? :)
16:32:10 <njohnston> Nope, I thought I saw bcafarel doing something related though
16:32:32 <bcafarel> yes I sent a few phase 1 patches for review (links in etherpad)
16:32:42 <bcafarel> also I got feedback on openstack-python-jobs-neutron jobs
16:33:10 <bcafarel> these are in fact now legacy set and should not be touched, we should move to openstack-python3-ussuri-jobs-neutron
16:34:21 <bcafarel> I updated status for some projects too (reviews merged so done for them)
16:34:32 <njohnston> thanks very much bcafarel!
16:35:15 <slaweq> bcafarel++ thx a lot
16:35:35 * njohnston sees slaweq switching between channels, always in demand!
16:36:13 <bcafarel> everybody always looking for the PTL
16:36:33 <slaweq> njohnston: yes, I'm trying
16:36:37 <slaweq> but it's hard :)
16:37:03 <slaweq> ok, I think that it is all related to stadium projects for today
16:37:23 <slaweq> or maybe You have anything else what You want to discuss today?
16:37:33 <slaweq> if not, lets move on to the next topic
16:38:34 <slaweq> ok, lets move on than
16:38:46 <slaweq> #topic Grafana
16:38:54 <slaweq> #link http://grafana.openstack.org/dashboard/db/neutron-failure-rate
16:39:10 <slaweq> first thing which I want to mention is
16:39:22 <slaweq> that I sent today cleaning patch for grafana dashboard: https://review.opendev.org/698264
16:40:30 <slaweq> other than that, I don't see any issues on grafana
16:40:39 <slaweq> jobs looks pretty well this week
16:40:59 <njohnston> I see the ironic cogating problem as well as tripleo-standalone, and midonet cogating looks like it is getting better
16:42:08 <slaweq> njohnston: yes, those I noticed too
16:42:14 <slaweq> and I forgot about them now :)
16:42:17 <slaweq> sorry
16:42:50 <njohnston> do we know what is up with tripleo-standalone?
16:42:59 <slaweq> nope
16:43:16 <slaweq> is there any volunteer to check both of those jobs?
16:44:14 <ralonsoh> can we ping yamamoto for midonet?
16:44:20 <ralonsoh> sorry for being lazy
16:44:34 <njohnston> well midonet is fixed now
16:44:41 <ralonsoh> oh yes, sorry
16:44:47 <njohnston> it's the other two that are having issues
16:44:53 <slaweq> ralonsoh: midonet is fixed by skipping one failing test
16:45:09 <ralonsoh> slaweq, give me one to me
16:45:15 <ralonsoh> I'll take a look this week
16:46:13 <slaweq> ralonsoh: ok, pick whichever You want
16:46:17 <ralonsoh> ironic
16:47:04 <slaweq> ralonsoh: ok :)
16:47:08 <slaweq> so I will check tripleo
16:47:47 <slaweq> ralonsoh: please check on neutron-channel, I just spoke with TheJulia about one issue in dhcp agent on ironic job
16:47:54 <ralonsoh> sure
16:47:55 <slaweq> maybe it's the same issue (idk)
16:48:15 <slaweq> thx
16:48:30 <slaweq> #action ralonsoh to check ironic-tempest-ipa-wholedisk-bios-agent_ipmitool-tinyipa job
16:48:44 <slaweq> #action slaweq to check tripleo job
16:49:09 <slaweq> ok, lets move on than
16:49:25 <slaweq> I don't have any new issues with scenario/functiona/fullstack jobs for today
16:49:28 <slaweq> which is very good
16:49:31 <slaweq> \o/
16:49:39 <slaweq> but I have one issue with periodic jobs
16:49:40 <slaweq> #topic Periodic
16:50:00 <slaweq> recently we added periodic job which runs on mariadb instead of mysql
16:50:04 <slaweq> and it is failing now:
16:50:10 <slaweq> https://b12f79f00ace923cb903-227be9d6f8442281010ef49b8394f34d.ssl.cf5.rackcdn.com/periodic/opendev.org/openstack/neutron/master/neutron-tempest-mariadb-full/18fecee/job-output.txt
16:50:18 <ralonsoh> +1
16:50:26 <slaweq> it seems that our new ovn related code is broken on mariadb
16:50:42 <ralonsoh> uhmmmm ok
16:50:46 <ralonsoh> I'll take a look
16:50:52 <slaweq> ralonsoh: thx
16:50:55 <ralonsoh> (I did the DB migration)
16:51:41 <slaweq> ralonsoh: maybe it's again issue with mariadb 10.1
16:51:46 <slaweq> and on 10.4 will work fine
16:51:57 <ralonsoh> do you have a link?
16:51:59 <slaweq> as we had already with one other db migration script some time ago
16:52:05 <slaweq> ralonsoh: link to what?
16:52:19 <ralonsoh> the problem in maribadb 10.1
16:52:33 <slaweq> give me a sec
16:52:39 <ralonsoh> np, we can talk tomorrow
16:52:54 <slaweq> https://bugs.launchpad.net/kolla-ansible/+bug/1841907
16:52:54 <openstack> Launchpad bug 1841907 in neutron "Neutron bootstrap failing on Ubuntu bionic with Cannot change column 'network_id" [Critical,Confirmed]
16:53:04 <ralonsoh> upsss also mine!!
16:53:10 <slaweq> lol
16:53:14 <ralonsoh> I did this change too
16:53:26 <slaweq> You are doing many patches so some of them may break things ;)
16:53:41 <ralonsoh> the point is in mysql and postgree that was working
16:53:55 <slaweq> yes
16:54:10 <slaweq> that's why I proposed mariadb periodic job
16:54:20 <slaweq> as there are differences between mysql and mariadb now
16:55:44 <slaweq> ok, so ralonsoh You will check that, right?
16:55:48 <ralonsoh> yes
16:55:50 <slaweq> thx
16:55:59 <slaweq> #action ralonsoh to check periodic mariadb job failures
16:56:04 <slaweq> ok
16:56:12 <slaweq> so that's all what I had for today
16:56:46 <slaweq> in overall I think that we are now in really good shape with our CI, many patches were merged recently without dozens of rechecks
16:57:03 <slaweq> so thx for working on CI improvements guys :)
16:57:11 <njohnston> \o/
16:57:13 <bcafarel> nice!
16:57:21 <ralonsoh> fantastic
16:57:31 <slaweq> have a great evening and see You online
16:57:33 <slaweq> o/
16:57:35 <slaweq> #endmeeting