15:00:06 #startmeeting kolla 15:00:06 Meeting started Wed Aug 4 15:00:06 2021 UTC and is due to finish in 60 minutes. The chair is mgoddard. Information about MeetBot at http://wiki.debian.org/MeetBot. 15:00:06 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 15:00:06 The meeting name has been set to 'kolla' 15:00:14 #topic rollcall 15:00:17 \o 15:00:30 o/ 15:03:20 \o/ 15:04:23 #topic agenda 15:04:31 * Roll-call 15:04:33 * Agenda 15:04:35 * Announcements 15:04:37 * Review action items from the last meeting 15:04:39 * CI status 15:04:41 * Release tasks 15:04:43 ** Policy on dead projects (vide Freezer) https://bugs.launchpad.net/kolla-ansible/+bug/1901698 15:04:45 * Xena cycle planning 15:04:47 ** Xena feature prioritisation https://docs.google.com/spreadsheets/d/1BuVMwP8eLnOVJDX8f3Nb6hCrNcNpRQl57T2ENU9Xao8/edit?usp=sharing 15:04:49 * Kolla operator pain points https://etherpad.opendev.org/p/pain-point-elimination 15:04:51 * Open discussion 15:04:53 #topic announcements 15:05:00 none here 15:05:03 anyone else? 15:05:36 #topic Review action items from the last meeting 15:05:58 mgoddard email kolla-klubbers & openstack-discuss about pain points 15:06:00 done 15:06:09 no responses :) 15:06:13 ;-( 15:06:32 clearly our users are pain free 15:06:54 painKillers 15:07:19 #topic CI status 15:07:51 I noticed that the debian upgrade job is borked on master 15:08:33 #link https://storage.gra.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_b1e/793664/7/check/kolla-ansible-debian-source-upgrade/b1e0b6b/primary/logs/ansible/chrony-install 15:09:50 ah, yes, looks like ansible broke with python autodetection; come on, ansible :P 15:10:05 last passed 2021-07-01... 15:10:26 the wonder of non-voting jobs 15:11:10 oh, is that what it is? makes sense now 15:12:13 https://opendev.org/openstack/kolla-ansible/src/branch/master/tests/upgrade.sh#L27 15:14:29 any other CI breakage to speak of? 15:15:10 not seen any ohter 15:15:16 nor other 15:15:18 nor otter 15:15:26 nor oh-errr 15:15:41 good 15:15:52 #topic Release tasks 15:16:16 R-9 15:18:22 #link https://docs.openstack.org/kolla/latest/contributor/release-management.html 15:18:40 so next week we Switch binary images to current release 15:18:52 otherwise I think we're ok 15:19:06 #topic Policy on dead projects (vide Freezer) https://bugs.launchpad.net/kolla-ansible/+bug/1901698 15:19:11 I smell a yoctozepto 15:19:35 Michal Arbet proposed openstack/kolla-ansible master: Add mariadb arbitrator to mariadb role https://review.opendev.org/c/openstack/kolla-ansible/+/780811 15:19:59 a wild yoctozepto appeared 15:20:21 Discuss / Change core / Items / Run 15:20:39 and back to the topic - yes 15:20:57 and I see yesterday we also got asked on the channel about freezer specifically 15:21:02 anyhow 15:21:15 it's certainly in my list of 'would not recommend' projects 15:21:19 the topic is general - how should we approach projects that seem dead and unusable 15:21:23 because 15:21:26 but how do we decide? 15:21:30 ( mgoddard: mine too ) 15:21:33 mgoddard: precisely 15:21:40 but let me continue 15:21:43 because 15:21:53 the issue is we (as kolla) endorse the projects we support 15:22:09 it is just how the worlds seems to view that 15:22:40 and that is problematic because it just grows the frustration in users as there are parts that are simply known-not-to-work 15:22:51 as opposed to merely not-known-whether-it-works 15:23:05 Michal Arbet proposed openstack/kolla-ansible master: [CI] Test Mariadb-Arbitrator with shards in the nova cells scenario https://review.opendev.org/c/openstack/kolla-ansible/+/780970 15:23:21 I think we kind of need a policy/doc thing on that 15:23:37 with extra bells and whistles around tries to use such projects 15:23:50 though I, personally, would just drop them not to create illusions of workability 15:24:27 what are your thoughts? 15:24:33 I know it's mostly only mgoddard today 15:24:35 New variable? enable_experimental_services: True 15:24:40 but perhaps priteau can say a thing or two 15:24:42 oh, he did 15:24:43 it's a tricky one 15:24:56 And without it you can't deploy them? 15:25:10 priteau: that's a part of those bells and whistles, yeah, good idea, thanks! 15:25:23 on the one hand I don't like that we have unstable/unusable projects that add to our maintenance burden 15:25:40 Doc is important too of course, but we know it isn't read by everyone 15:26:13 on the other hand, I quite like that a project can go quiet but be rekindled by an interested party 15:27:07 well, perhaps the freezer example is just radical and we should just drop freezer and be happy with all the others 15:27:19 ansible-collection-kolla-moribund-projects 15:27:30 xD 15:27:49 what other projects come to your mind, btw? 15:27:51 be honest 15:27:55 it could quickly go stale though 15:28:10 unless we maintained it 15:28:51 freezer seems pretty dead, if it doesn't support py3.8 15:29:05 a targeted approach might be best 15:29:15 for others, maybe a long period of deprecation? 15:29:37 ok, so for freezer a mail to the mailing list that it looks bad 15:29:45 perhaps some TC look at it too 15:29:52 https://www.stackalytics.io/?module=freezer-group 15:29:53 and what "others" do you have in mind? 15:29:57 there have been some commits 15:30:45 we could just add metadata to a role if it's tested and supported, or deprecated (known not to work, no data, no known users, etc) - and build some support matrix based on that? I'm not saying we should remove those deprecated, but just deprecate them? 15:30:52 yeah, there are commits but nobody looks at bug reports 15:32:05 Isn't this the fix for your bug? https://review.opendev.org/c/openstack/freezer/+/795715 15:33:06 priteau: looks like it 15:33:14 Although, source & master images should have it? 15:33:18 the bug is not mine though :-) 15:33:25 Oh, bug comments are from 2020 15:33:29 yeah, but people are running stable 15:33:29 yeah 15:34:37 Sooner or later we will reverse support. Provide a list of known-to-work images and rest will generate 'UNSUPPORTED, YOU HAVE BEEN WARNED' message on build and deploy. plus 1 minute sleep per image 15:34:39 priteau: you could comment with that in the bug report 15:34:45 Already did 15:34:57 hrw: I suggest one hour, make them learn :D 15:35:02 priteau: great! 15:35:20 yoctozepto: 1 minute is annoying. 1 hour is 'where it is, I'll kill it' 15:35:33 would love to get rid of these: 15:35:35 aodh 15:35:37 murano 15:35:39 solum 15:35:41 trove 15:35:43 gnocchi 15:35:45 qdrouterd 15:35:47 storm 15:35:49 vitrage 15:35:51 mistral 15:35:53 ovs-dpdk 15:35:55 vmtp 15:35:57 tacker 15:35:59 watcher 15:36:01 collectd 15:36:03 sahara 15:36:05 telegraf 15:36:07 ceilometer 15:36:09 freezer 15:36:11 senlin 15:36:13 skydive 15:36:20 base 15:36:22 for the love of CI, that's one long list :-) 15:36:43 I doubt it would help our CI - few are tested! 15:36:45 I suppose there are still many deployments using ceilometer + gnocchi + aodh 15:36:50 vitrage's HA is broken, so nothing new 15:37:01 collectd, wasn't flint adding new things recently? 15:37:10 priteau: true, that's just my preference :) 15:37:11 I hope he was not 15:37:16 :D 15:37:49 so where are we now? 15:37:56 freezer fixed? 15:38:11 perhaps? at least that error is hidden now 15:38:47 it stumbles on 15:39:06 taking us back to the general question of how proactive we should be 15:39:46 I'd quite like to see a page in docs for each service 15:39:52 that priteau-proposed flag sounds cool 15:40:02 mgoddard: yeah, and warnings on bad services 15:40:08 as I think a lot of users at least get that far - search for 'kolla freezer' 15:40:20 the flag sounds interesting 15:40:52 but what are the criteria? 15:41:14 having a banner in docs that says 'we do not test this in CI' makes sense to me 15:41:18 We have a support matrix right? 15:41:31 true 15:41:40 projects with long history of weird issues and no help coming is yet another stage 15:42:29 Maybe we need another level in the matrix in addition to "C - Community maintained" 15:42:44 D - despised 15:42:53 U - unloved? 15:43:01 P - promoted 15:43:04 U - underrated 15:43:43 U - unsupported 15:43:46 FWIW, I like what cinder did around their drivers with requiring testing, and eventually removing things with no tests. Maybe you have to set: allow_crazy_untested_services: True or the config doesn't validate? 15:45:09 in Nova we found no one bothered to read the support matrix, some people noticed having to enable extra config flags to use dodgy stuff 15:46:32 is there anyone who would like to work on this? 15:46:44 I can 15:46:49 unless priteau wants to 15:46:56 E - experimental (another suggestion for the matrix) 15:47:01 anyhow, I think we agree on two things 15:47:04 or three 15:47:14 Unfortunately no time at the moment 15:47:21 now we have Terrible, Common issues, Not even try levels in support matrix 15:47:28 1) add the extra flag (starting Xena, with reno and general docs) 15:47:51 2) extend the support matrix to mark "worse" services (this needs discussing how to choose and call these really) 15:48:29 3) add docs per service with warning banner for services not tested in CI and especially for those that we know are ugly 15:48:53 4) drop the really bad ones that won't be fixed? 15:49:09 ad 4) that is long term I guess 15:49:30 anyhow, I can drive this 15:49:32 just to add another take on it - H - needs help. That was an original goal of the support matrix, to highlight where we lack skills/interest, and attract help 15:49:43 oh, good one 15:50:02 Nice one 15:50:03 although that assumes the issues are on our side 15:50:07 sor tof 15:50:12 so rtof 15:50:22 read the online form 15:50:26 anyway, I think your list makes sense yoctozepto, especially if you're picking it up :) 15:50:33 mgoddard: thanks 15:50:47 I just want to focus on these and restore https://review.opendev.org/c/openstack/kolla-ansible/+/773246 and its stack 15:50:57 shameless plug but please consider reviewing and pushing that 15:51:07 as it's rotting for no good reason 15:51:08 yeah, let's do that 15:51:24 ok, 10m left 15:51:36 #topic Xena feature prioritisation https://docs.google.com/spreadsheets/d/1BuVMwP8eLnOVJDX8f3Nb6hCrNcNpRQl57T2ENU9Xao8/edit?usp=sharing 15:52:01 We covered this last time 15:52:12 I gave it some thought and using the launchpad blueprints for Masakari now 15:52:13 How did that project-config patch go? 15:52:19 perhaps we could use it for Kolla projects too 15:52:42 got stuck 15:53:05 could do 15:53:07 mgoddard: the description for "H" could explain that it needs help in integration / testing in Kolla Ansible and/or fixing issues upstream 15:53:20 priteau: true 15:53:41 re blueprints, it could work 15:53:51 we don't really use them much today 15:53:55 yup 15:53:58 we could start 15:54:01 there are many stale ones 15:54:13 yeah, needs some form of cleanup 15:54:24 read my post on masakari for how I've done that 15:54:30 closed the obviously implemented ones 15:54:38 and reset priorities for other oldies 15:54:57 then we set priorities for our real goals 15:54:58 I think I have done that mostly 15:55:04 and handle it all from there 15:55:09 plus the gerrit tags 15:55:12 I will ping infra 15:56:47 done 15:56:51 both channels 15:56:56 so no hashtags for masakari? 15:57:32 nah, we don't have that much of change traffic 15:57:43 and only really one core that does most of the reviews 15:58:11 doh 15:58:15 yup 15:58:50 #topic Kolla operator pain points https://etherpad.opendev.org/p/pain-point-elimination 15:59:00 I think I misspoke earlier about this 15:59:17 while I received no direct responses, we do have 4 pain points 15:59:28 or 3, if you can count 15:59:37 (clarkb) Stopping docker and its containers to perform rolling reboots for operating system patches can result in bad rabbitmq state preventing services like cinder-scheduler from reconnecting successfully afterwards. This was addressed by completely rebuilding the rabbitmq cluster with a blank state. Not sure if there is a better rolling reboot process or maybe improvements can be made 15:59:39 to rabbitmq in kolla to make it more resilient? This was with a victoria kolla deployment. 15:59:41 (Fl1nt): Kolla-ansible lack a bit of an offline mode for companies that are using it in a really source restricted environment that doesn't get access to the internet at all but get access to internal mirrors and cloned repositories. I'm already working on this issue right now and few patches have already been proposed and merged back, few remaining patches are coming. 15:59:43 Kolla-ansible lack some self-recovery mechanisms for offline environments. For example, mariadb cannot recover automatically after the cluster is restarted, and human intervention is highly likely to be required. It is hoped that some self-recovery solutions can be introduced to improve the recoverability of the cluster. 16:00:47 :-( 16:00:49 sad stories 16:00:53 some food for though 16:01:01 time's up y'all 16:01:05 thanks 16:01:07 #endmeeting