15:00:16 <mgoddard> #startmeeting kolla
15:00:20 <openstack> Meeting started Wed Jan 27 15:00:16 2021 UTC and is due to finish in 60 minutes.  The chair is mgoddard. Information about MeetBot at http://wiki.debian.org/MeetBot.
15:00:21 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
15:00:23 <openstack> The meeting name has been set to 'kolla'
15:00:26 <mgoddard> #topic rollcall
15:00:28 <mgoddard> \o
15:00:34 <wuchunyang> \o
15:00:46 <osmanlicilegi> o/
15:00:55 <hrw> [o]
15:03:09 <yoctozepto> /o\
15:03:48 <mnasiadka> o/
15:04:49 <mgoddard> #topic agenda
15:04:57 <mgoddard> * Roll-call
15:04:59 <mgoddard> * Announcements
15:05:01 <mgoddard> * Review action items from the last meeting
15:05:03 <mgoddard> * CI status
15:05:04 <mgoddard> * Cleaning up disabled components and services (e.g. monasca/storm) https://review.opendev.org/c/openstack/kolla-ansible/+/769900
15:05:07 <mgoddard> * Allowed to fail images - next steps https://review.opendev.org/c/openstack/kolla/+/765807
15:05:09 <mgoddard> * Wallaby release planning
15:05:11 <mgoddard> #topic announcements
15:05:18 <mgoddard> I have none. Anyone else?
15:06:01 <mgoddard> #topic Review action items from the last meeting
15:06:05 <mgoddard> There were none
15:06:10 <mgoddard> #topic CI status
15:06:53 <mgoddard> How are we looking?
15:07:05 <mgoddard> We've had some issues with train, let's start there
15:07:08 <yoctozepto> good
15:07:34 <yoctozepto> everything works
15:08:27 <mgoddard> great. Thanks to those involved in rescuing it
15:08:35 <mgoddard> Stein has some issues
15:08:38 <mgoddard> https://review.opendev.org/c/openstack/kolla-ansible/+/772501
15:08:44 <mgoddard> https://review.opendev.org/c/openstack/kolla/+/772490
15:08:51 <mgoddard> Those two should address them
15:09:54 <mgoddard> optional: propose CI Slim Down in K-A (T-W)
15:09:57 <mgoddard> remove l-c jobs in K-A U, V & W (as agreed; already dropped in older)
15:10:11 <mgoddard> let's either do those or remove them from the board
15:10:16 <yoctozepto> do these
15:11:10 <mgoddard> I guess I'm undecided on the slim down
15:12:14 <mgoddard> has anyone actually run 'check experimental' in kolla since we removed the jobs?
15:12:26 <yoctozepto> I
15:12:27 <yoctozepto> once
15:12:40 <yoctozepto> I'd prefer we make CI work better
15:12:46 <yoctozepto> but we can't easily
15:12:58 <mgoddard> work better how?
15:14:56 <mnasiadka> I honestly don't think anybody will use check experimental
15:15:20 <mnasiadka> So, if we could have a list of problematic jobs, and vote on what to do with them - would probably be better
15:15:36 <mgoddard> most people don't know about it, and those who do are likely to forget
15:15:57 <mgoddard> it seems like a reasonable trade-off for kolla, but not so sure about k-a
15:16:24 <mgoddard> maybe we should have only one of each scenario
15:16:34 <mgoddard> and put the other in experimental
15:16:35 <yoctozepto> mgoddard: as in dockerhub not failing
15:16:43 <yoctozepto> and no disk_full
15:16:46 <yoctozepto> and no timeout
15:16:56 <yoctozepto> and no multiple retries
15:17:11 <yoctozepto> we got ourselves a ton of jobs
15:17:14 <yoctozepto> I liked adding them
15:17:21 <yoctozepto> but I don't like waiting for them nowadays :D
15:17:33 * yoctozepto practicing self-blame
15:17:53 <mnasiadka> well, I prefer to wait for them, instead of write check experimental and wait for them to fail :)
15:18:08 <mgoddard> +1
15:18:31 <mgoddard> we do have various things that make CI unreliable, but it does seem workable at the moment
15:19:21 <mnasiadka> I would rather keep a list of jobs that need to be looked into, if there's no interest - we can decide to remove them, or just disable them.
15:19:48 <mgoddard> or combine them
15:19:52 <yoctozepto> +1
15:20:05 <yoctozepto> let's not overwork ourselves for the moment
15:20:12 <yoctozepto> remove the point on slim down
15:20:20 <mgoddard> ok
15:20:22 <yoctozepto> and we will reconsider our options when it gets worse
15:20:31 <yoctozepto> kolla has it already
15:20:42 <yoctozepto> it sounds like a reasonable tradeoff
15:20:50 <mnasiadka> I know the ceph jobs are failing often, but I plan to move to use cephadm, and then investigate failures if they happen.
15:21:26 <yoctozepto> yeah, cephadm should help a bit; ceph-ansible is quite quirky
15:21:27 <mgoddard> are there legit ceph failures we should look into?
15:21:48 <yoctozepto> or perhaps we should just spin ceph manually for our simple scenario
15:21:55 <yoctozepto> it's a no-brainer xD
15:22:10 <yoctozepto> mgoddard: I don't think so
15:22:20 <yoctozepto> it's just feels worse than before
15:22:21 <mgoddard> ok
15:22:26 <mgoddard> let's move on, tired of CI :)
15:22:27 <yoctozepto> perhaps just due to multinode being worse
15:22:29 <yoctozepto> indeed!
15:22:38 <mgoddard> #topic Cleaning up disabled components and services (e.g. monasca/storm) https://review.opendev.org/c/openstack/kolla-ansible/+/769900
15:22:50 * yoctozepto to deprecate everything
15:23:09 <mgoddard> This came up while reviewing one of dougsz's patches
15:23:14 <mgoddard> https://review.opendev.org/c/openstack/kolla-ansible/+/769900
15:23:17 <hrw> monasca goes out?
15:23:20 <mgoddard> no
15:23:33 <mgoddard> but it's going on a diet :)
15:23:49 <mgoddard> some things will be removed
15:24:09 <mgoddard> and we'll add a flag to allow disabling the alerting pipeline
15:24:12 <mgoddard> https://review.opendev.org/c/openstack/kolla-ansible/+/769902/
15:24:17 <mgoddard> so...
15:24:34 <mgoddard> there are two cases here
15:24:52 <mgoddard> 1. a container is no longer required and should be cleaned up
15:25:16 <mgoddard> 2. a container is disabled based on some feature flag, and should be cleaned up
15:25:18 <yoctozepto> ah, this general thing
15:25:38 <mgoddard> 1. I think should be done on upgrade
15:25:46 <yoctozepto> we should address both at once
15:26:22 <mgoddard> 2. is less clear. I don't really want a bunch of code that cleans up on every deploy
15:26:23 <yoctozepto> perhaps treat being disabled as removal?
15:26:30 <mgoddard> it will be messy and slow
15:26:33 <yoctozepto> and have a stub for those removed completely due to new release
15:26:50 <yoctozepto> hmm
15:27:06 <mgoddard> there is also a 3. which I forget
15:27:21 <mgoddard> 3. a service (enable_X) is disabled and should be cleaned up
15:27:53 <mgoddard> 3 is tricky because the site.yml playbook doesn't execute plays for disabled services
15:28:09 <mnasiadka> we need undeploy.yml in each role and a separate command to clean up?
15:28:19 <mgoddard> that is my preference
15:28:42 <mnasiadka> that sounds simplistic and doable in short time
15:28:54 <mgoddard> right
15:29:07 <mgoddard> however, I think dougsz didn't want to overload his patch
15:29:16 <mgoddard> so we came to a compromise
15:29:38 <mgoddard> he will implement a specific command to clean up the monasca alerting pipeline, which can be adapted into a general cleanup command
15:29:53 <mgoddard> e.g. kolla-ansible monasca-disable-alerting
15:30:01 <mgoddard> does that seem reasonable?
15:30:15 <wuchunyang> what about kolla-ansible destroy --tags service ?
15:30:25 <mgoddard> (dougsz is busy right now so can't join in)
15:30:55 <mgoddard> wuchunyang: I think destroy is subtley different
15:31:16 <mgoddard> destroy will clean up everything, regardless of enable flags, etc.
15:31:30 <mgoddard> cleanup should only destroy things that have been disabled
15:32:22 <mnasiadka> and still questionable if it should delete database entries and so on
15:32:33 <mgoddard> is this making sense?
15:32:42 <mgoddard> hmm, hadn't considered DB
15:32:51 <mgoddard> that would only apply if the whole service is disabled
15:33:02 <mgoddard> probably it will evolve some flags
15:33:21 <yoctozepto> it's tricky and sounds like a good topic for the next ptg
15:33:25 <mgoddard> --clean-db --clean-keystone --clean-mq
15:33:36 <yoctozepto> as the need grows :D
15:33:57 <mnasiadka> yeah, but the initial approach with Monasca sounds fine, we need to start somewhere
15:33:58 <yoctozepto> --yes-i-really-really-want-it-clean
15:33:59 <wuchunyang> sounds good
15:34:01 <mgoddard> but a simple deletion of containers and volumes would be a good start
15:34:08 <yoctozepto> yes
15:34:21 <mnasiadka> just need to make sure we adapt reno if we change the command name :)
15:34:26 <mnasiadka> (over time)
15:34:41 <mgoddard> I think it would be a new command
15:34:54 <wuchunyang> + 1
15:35:37 <mgoddard> any other thoughts or shall we progress?
15:36:13 <mgoddard> #topic Allowed to fail images - next steps https://review.opendev.org/c/openstack/kolla/+/765807
15:36:16 <mgoddard> ping hrw
15:36:30 <mgoddard> #link https://review.opendev.org/c/openstack/kolla/+/765807
15:36:46 <hrw> pong
15:36:48 <mgoddard> we had some discussion about this in IRC earlier
15:37:12 <mgoddard> that patch adds an option to kolla to mark some images as being allowed to fail
15:37:29 <mgoddard> the next step is to make use of this in our CI
15:37:45 <mgoddard> some key requirements for me:
15:38:15 <mgoddard> * a change that breaks an image must fail CI
15:38:42 <mgoddard> * ideally we have some visibility if an external change breaks an image
15:39:14 * hrw waits for end of list
15:39:23 <mgoddard> </end>
15:39:26 <hrw> ok.
15:39:47 <hrw> 1. 'change that breaks an image must fail CI' == what we have by default
15:40:02 <hrw> allowed-to-fail does not change anything here
15:40:54 <hrw> this feature is more for 'omg, xyz image failed again and the only core who understand it is on holidays. let use ignore it for a week with this patch'
15:41:14 <mgoddard> ok
15:41:26 <hrw> otherwise we would switch to profile builds which can list 'those images we build and they can not fail'
15:41:46 <mgoddard> I think if we keep the list empty by default it will work
15:41:47 <hrw> + non-voting job building everything to check are there images which fail
15:41:55 <hrw> that's the plan
15:42:03 <hrw> (empty list)
15:42:06 <mgoddard> ok
15:42:23 <hrw> 'ideally we have some visibility if an external change breaks an image' == normal CI too
15:43:06 <hrw> allowed-to-fail is kind of 'one step' before moving to UNBUILDABLE_IMAGES
15:43:07 <mgoddard> the original goal was for tiers of images, so I imagined having a static list of tier 3 images
15:43:34 <mgoddard> in which case they could have started failig silently
15:43:51 <mgoddard> default empty list works for me
15:44:10 <mgoddard> so perhaps you are right in saying that the feature is complete :)
15:44:13 <mgoddard> anyone disagree?
15:44:51 <mgoddard> great
15:44:53 <hrw> we have 3 levels for images: BUILD, allowed to fail, UNBUILDABLE
15:45:16 <hrw> with middle one being for emergencies
15:45:25 <mgoddard> hrw: could you add a few sentences on that to the contributor docs?
15:45:28 <yoctozepto> no disagreement
15:45:44 <hrw> mgoddard: #action please. will handle
15:46:22 <mgoddard> #action hrw to document allowed to fail & unbuildable images in contributor docs
15:46:42 <mgoddard> good job - nice and simple, should be useful
15:46:44 <hrw> and 'allowed to fail' is basically CI stuff which can be used by anyone
15:47:14 <mgoddard> PTL just too stoopid to understand it :D
15:47:32 <mgoddard> #topic Wallaby release planning
15:47:44 <hrw> mgoddard: after adding UNBUILDABLE I have a feeling that I understand all those lists. and that's scary
15:48:00 <hrw> ~3 months to release?
15:48:08 <mgoddard> I don't think we've talked dates yet
15:48:21 <hrw> April is usual date ;D
15:48:24 <mgoddard> I'll add a section to the whiteboard
15:48:52 <yoctozepto> ++
15:49:02 <mgoddard> #link https://releases.openstack.org/wallaby/schedule.html
15:49:18 <hrw> I think that infra stuff would be nice. I may take a look and rewrite patches adding funcionality and skip CI changes
15:49:51 <mgoddard> Apr 12 - Apr 16
15:50:15 <mgoddard> ^ is OpenStack Wallaby GA
15:50:59 <hrw> so 1st April is a day when we should be ready with features, freeze, fix issues and release on 1st May
15:51:03 <mgoddard> Kolla feature freeze: Mar 29 - Apr 02
15:51:06 <mgoddard> Kolla RC1: Apr 12 - Apr 16
15:51:39 <mgoddard> CentOS 8 stream is a risk
15:51:58 <mgoddard> RDO is targeting stream for wallaby
15:52:55 <yoctozepto> it will freeze our blood
15:52:56 <yoctozepto> :-)
15:52:56 <mgoddard> just to reiterate - there are around 2 months until feature freeze
15:53:10 <mgoddard> now is the time to get working on features
15:53:11 <yoctozepto> only sad reactions!
15:53:12 <hrw> if RDO goes stream then so are we.
15:53:14 <mgoddard> (and reviewing)
15:53:20 <yoctozepto> :D
15:53:58 <hrw> which may end in centos:8 -> stream or ubi:8 as a base image
15:54:02 <mgoddard> 7 minutes to go. Shall we look at the release priorities?
15:55:01 <hrw> go, quickly
15:55:06 <mgoddard> this one is ever present, and never done:
15:55:09 <mgoddard> High level documentation, eg. examples of networking config, diagrams, justification of use of containers, not k8s etc
15:55:27 <mgoddard> but if anyone has time for docs, they will be appreciated
15:55:37 <mgoddard> Ability to run CI jobs locally (without Zuul, but possibly with Ansible)
15:55:41 <mgoddard> yoctozepto: any progress?
15:56:31 <yoctozepto> mgoddard: none
15:56:49 <mgoddard> ok
15:57:12 <mgoddard> Switch to CentOS 8 stream
15:57:17 <mgoddard> Anyone want to pick this up?
15:58:00 <hrw> I may if no one wants
15:58:14 <mgoddard> ok
15:58:20 <mgoddard> Infra images
15:58:33 <mgoddard> hrw: you mentioned adding features but skipping CI
15:58:43 <mgoddard> AFAIR, CI was the blocking part
15:58:54 <mgoddard> the CentOS combined job fails
15:59:14 <mgoddard> so we either need to fix that, or implement the multi-job pipeline
16:00:02 <hrw> I want to rewrite patches to have functionality for review while waiting for someone with ideas of CI changes
16:00:35 <hrw> it is quite old patchset so I first have to remind myself how it works
16:01:14 <mgoddard> I do wonder if the combined job could be made to work with a bit of debugging
16:01:39 <mgoddard> it could be a disk issue, try removing images after each build
16:02:20 <mgoddard> anyway, we are over time
16:02:23 <mgoddard> thanks all
16:02:24 <hrw> the good part is that CI changes can be done separately
16:02:34 <mgoddard> #endmeeting