15:00:17 <gmann> #startmeeting tc
15:00:17 <opendevmeet> Meeting started Thu Jul  8 15:00:17 2021 UTC and is due to finish in 60 minutes.  The chair is gmann. Information about MeetBot at http://wiki.debian.org/MeetBot.
15:00:17 <opendevmeet> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
15:00:17 <opendevmeet> The meeting name has been set to 'tc'
15:00:21 <gmann> #topic Roll call
15:00:24 <diablo_rojo_phone> o/
15:00:24 <gmann> o/
15:00:24 <jungleboyj> o/
15:00:24 <ricolin> o/
15:00:32 <belmoreira> o/
15:00:36 <dansmith> o/
15:00:39 <clarkb> hello
15:00:48 <gmann> clarkb: hi
15:00:57 <gmann> yoctozepto is on PTO so would not be able to join today meeting
15:01:13 <dansmith> did we approve that time off?
15:01:27 <gmann> :)
15:01:33 <fungi> i told him it was okay
15:01:34 <jungleboyj> :-)
15:01:49 <gmann> let's start
15:01:52 <gmann> #topic Follow up on past action items
15:02:02 <gmann> gmann to remove Governance non-active repos cleanup topic from agenda
15:02:03 <gmann> done
15:02:14 <gmann> gmann to remove election assignments topic form agenda
15:02:22 <gmann> this too
15:02:32 <gmann> ricolin to ask for collecting the ops pain points on openstack-discuss ML
15:02:40 <gmann> ricolin: any update on this
15:03:13 <ricolin> already added it on community-goals backlogs and y-cycle pre-selected,but not yet send ML out
15:03:36 <gmann> +1. i think that is good
15:03:41 <spotz> o/
15:03:47 <ricolin> will send it out this week
15:03:49 <ricolin> on ML
15:03:59 <gmann> ok, thanks
15:04:04 <gmann> gmann to propose the RBAC goal
15:04:15 <gmann> I proposed that #link https://review.opendev.org/c/openstack/governance/+/799705
15:04:19 <gmann> please review
15:04:40 <gmann> #topic Gate health check (dansmith/yoctozepto)
15:05:02 <gmann> dansmith: any news
15:05:07 <dansmith> I really have nothing to report, but mostly because I've been too busy with other stuff to be submitting many patches in the last week or so
15:05:26 <gmann> ok
15:05:41 <gmann> one thing to share is about the log warning especialyl from oslo policy
15:05:56 <fungi> we've had a bit of job configuration upheaval from the zuul 4.6.0 security release
15:06:24 <gmann> melwitt clarkb: pointed out that in infra channel and many projects have such ot of warning due to policy rule
15:06:37 <gmann> I am fixing those in #link https://review.opendev.org/q/topic:%22fix-oslo-policy-warnings%22+(status:open%20OR%20status:merged)
15:06:53 <fungi> had to make non-backward-compatible changes to how some kinds of variables are accessed, particularly with regards to secrets, so that's been disrupting some post/promote jobs (should be under control now), as well as made some projects' over all zuul configuration insta-buggy causing some of their jobs to not run
15:07:27 <fungi> i think kolla was hardest hit by that
15:07:48 <gmann> ok
15:08:08 <gmann> fungi: any effected project without ack or need helpon this ?
15:08:35 <gmann> I saw on ML about few project ack that and working on
15:08:36 <clarkb> gmann: it might be good to update those warnings to only fire once per process
15:08:40 <fungi> i haven't checked in the past few days, but click the bell icon at the top-right of the zuul status page for a list of some which may need help
15:08:51 <clarkb> I can't imagine those warnings helps operators any more than they help CI
15:09:37 <gmann> fungi: ok, thanks for update. let us know if any project did not notice or need help
15:10:19 <gmann> back to policy rule warning
15:10:27 <gmann> clarkb: yes, that seems very noisy now
15:11:01 <gmann> when we added it initially we thought it would help operator to move to new rbac but as in new rbac work every policy rule changed rhe default so warning
15:11:09 <gmann> whihc seems does not help much
15:11:47 <gmann> One approch I sent on ML about disableing those by default with make it configurable so that operator can enable those to see what all they need to update
15:11:54 <gmann> #link http://lists.openstack.org/pipermail/openstack-discuss/2021-July/023484.html
15:12:11 <gmann> and this is patch #link https://review.opendev.org/c/openstack/oslo.policy/+/799539
15:12:39 <gmann> feel free to respond to ML or gerrit about your opinon
15:13:42 <gmann> anything else to discuss related to gate health?
15:14:12 <gmann> #topic Migration from 'Freenode' to 'OFTC' (gmann)
15:14:15 <gmann> #link https://etherpad.opendev.org/p/openstack-irc-migration-to-oftc
15:14:42 <gmann> I started pushing the patches for remaining projects #link https://review.opendev.org/q/topic:%22oftc%22+(status:open%20OR%20status:merged)
15:14:48 <gmann> few are still left
15:15:12 <gmann> nothing else to share on this
15:15:17 <fungi> today we landed an update to the opendev infra manual as well, so if you refer anyone there it should now properly reference oftc and not freenode
15:15:29 <gmann> +1
15:16:19 <gmann> #topic Xena Tracker
15:16:21 <spotz> +1
15:16:27 <gmann> #link https://etherpad.opendev.org/p/tc-xena-tracker
15:17:00 <gmann> I think we can close 'election promotion' now as we have three new election official
15:17:15 <gmann> spotz: belmoreira diablo_rojo_phone ? what you say?
15:17:23 <gmann> L63 in etherpad
15:17:29 <fungi> i'm very excited by that, and happy to answer questions anyone has
15:17:29 <jungleboyj> \o/
15:17:33 <spotz> Yeah and we now have a name for that patch
15:18:13 <gmann> and email  opt in process or solution can be discussed by yoou guys at election channel
15:18:20 <belmoreira> lgtm
15:18:21 <gmann> thanks again for volunteering
15:19:41 <gmann> Charter revision also done so marked as completed
15:19:49 <diablo_rojo_phone> Yes we can close it.
15:20:08 <gmann> any other update on Xena tracker?
15:20:28 <gmann> jungleboyj: mnaser any update you want to share for 'stable policy process change' ?
15:21:41 <jungleboyj> No, didn't get to that with the holiday week.
15:21:53 <gmann> ok
15:22:25 <gmann> we have 8 items in etherpad to finish in Xena, let's start working on those which should not take much time
15:22:48 <gmann> moving next..
15:22:52 <gmann> #topic ELK services plan and help status
15:23:04 <gmann> first is Board meeting updates
15:23:32 <gmann> I presented this slide in 30th June Board meeting #link https://docs.google.com/presentation/u/1/d/1ugdwMI2ZM2L8z1sobzHJwDpbvlyWKH02PH7Fi4tkyVc/edit#slide=id.ge1bdf71dac_0_0
15:23:50 <gmann> I was expected some actional item from Board but that did not happen.
15:24:44 <gmann> Board ack this help-needed and stated to broadcast it in the organization/local community etc
15:25:11 <gmann> that I think we everyone are doing since 2018 when we re-defined the upstream investment opportunity
15:26:06 <gmann> honestly saying I am not so happy with the no-actionable item from that meeting
15:26:21 <gmann> and do not know how we can get help here ?
15:26:45 <spotz> I took it as folks were going back to their own companies
15:26:53 <spotz> It was a bit late for me though
15:26:55 <gmann> yeah theor own company also
15:27:28 <gmann> butn that is no different step from what we all including Board are trying since 2018
15:27:30 <fungi> not to apologize for them, but i don't expect the board members come to those meetings expecting to make commitments on the behalf of their employers, and probably don't control the budget that assistance would be provided out of in most cases (they're often in entirely separate business units), so they have to lobby internally for that sort of thing
15:28:02 <jungleboyj> fungi:  True.
15:28:18 <fungi> i'm more disappointed by the years of inaction than in their inability to make any immediate promises
15:28:27 <gmann> few of the suggestion are listed in the slide#5 #link https://docs.google.com/presentation/d/1ugdwMI2ZM2L8z1sobzHJwDpbvlyWKH02PH7Fi4tkyVc/edit#slide=id.ge1bdf71dac_0_24
15:28:57 <gmann> that was my expecttion and hope. I know those are not easy but in current situation we need such support
15:30:47 <gmann> anyways that is update from Board meeting. moving next..
15:30:56 <gmann> Creating a timeline for shutting the service down if help isn't found
15:31:09 <gmann> clarkb please go ahead
15:31:31 <clarkb> This is mostly a request that we start thinkign about what the timeline looks like if we don't end up with help to update the system or host it somewhere else
15:32:01 <clarkb> I'm not currently in a rush to shut it down, but there is a risk that external circumstances could force that to be done (security concerns or similar)
15:32:19 <clarkb> However, I think it would be good to have some agreement on what not a rush means :)
15:32:21 <jungleboyj> :-(
15:32:45 <clarkb> part of the reason this came up was after a week or two it was noticed that he cluster had completely crashed and I had to go resurrect it
15:33:02 <clarkb> I don't want to do that indefinitely if there isn't proper care and feeding happening
15:33:51 <clarkb> There are also a few problems with indexing currently including the massive log files generated by unittests due to warnings and for some reason logsatsh is emitting events for centuries in the future which floods the elasticsaerch cluster with indexes for the future
15:34:18 <clarkb> I think the massive log files led to the cluster crashing. The future events problem is more annoying than anything else
15:34:43 <gmann> yeah, we should start fixing those warning.may be we can ask all projects on ML. I can fix for oslo policy but do nto have bandwidth to fix other
15:35:00 <gmann> back to shutdown thing
15:35:09 <gmann> so if we shutdown, bug question is how we are going to debug the failure or how much it will add extra load on gate in term of doing recheck ..
15:35:28 <clarkb> gmann: to be fair I think most people just recheck anyway and don't do debugging
15:35:36 <gmann> yeah but not all
15:35:51 <gmann> after shutdown there will be many recheck we had to do
15:35:58 <clarkb> where elastic-recheck has been particularly useful is when you have an sdague, jogo, mtreinish, melwitt, or dansmith digging into braoder failures and trying to address them
15:36:14 <dansmith> yeah, I try to shame people that just blindly recheck,
15:36:17 <dansmith> but it's a bit of a losing battle
15:36:27 <dansmith> still, removing the *ability* to do real checking sucks :/
15:36:32 <clarkb> I suspect the biggest impact will not be recheck problems but the once a cycle or so fix a very unstable gate
15:36:54 <dansmith> ...or a more continuously unstable gate
15:36:58 <gmann> yeah
15:36:58 <clarkb> ya
15:37:26 <gmann> which will directly impact our release
15:37:32 <gmann> or feature implementation
15:37:33 <fungi> though it sounds like the entire cluster was broken for a couple weeks there before anyone noticed it wasn't returning results to their queries
15:37:43 <clarkb> I think that also is part of why it has been so hard to find help for this. When it is a tool you use every 6 months it is less in your mind continuously for care and feeding
15:37:52 <clarkb> fungi: yes, but no one notices if the gate is stable
15:37:56 <dansmith> yeah
15:38:02 <clarkb> which is a big underlying issue here imo
15:38:30 <clarkb> people do notice when there are systemic problems in the gate that need addressing
15:39:06 <clarkb> another reason to have a rough timeline is it may help light a fire under people willing to help
15:39:26 <clarkb> when I brought this up last week gmann suggested the end of the Yoga cycle as a potential deadline
15:39:35 <dansmith> yeah, "no rush" is not as motivating
15:40:01 <gmann> yeah, I am thinking end of Yoga will be like more than 6 month we called it as last critical call for help
15:40:13 <clarkb> That ensures that Xena (hopefully) doesn't have any major changes to the stabilzation process. Then in Yoga we can start planning for replacement/shutdown/etc (though that can start earlier too)
15:40:23 <gmann> so if there is anyone want to help, should be raising hand by than
15:40:34 <clarkb> That timeline seems reasonable to me
15:41:23 <gmann> any objection on above deadline ?
15:41:32 <fungi> "no rush" has also been tempered by "but might be tomorrow, depending on outside factors"
15:41:45 <clarkb> fungi: yes and I think that is still the message from me
15:41:53 <dansmith> I'm not happy about the timeline, but accept the need
15:42:09 <dansmith> "happy with, not happy about" you might say :)
15:42:10 <gmann> dansmith: means? it too late or early ?
15:42:11 <clarkb> if we notice abuse of elasticsearch or lgostash that requires upgrades to address we'll be in a situation where we don't have much choice
15:42:13 <jungleboyj> I think that sounds like a reasonable timeline ... even though we don't want one.
15:42:54 <dansmith> gmann: it's me being intentionally vague. I'm good with it, just not happy about it.. necessary, but I worry about the inevitable end where nobody has actually stepped up
15:42:55 <gmann> clarkb: yeah for outside factors we would not be able to do anything and shutdown early ?
15:43:06 <clarkb> gmann: correct
15:43:10 <gmann> k
15:43:34 <gmann> dansmith: correct. my last hope was Board on paid resource but anyways that did not happen
15:43:52 <clarkb> Another concern is the sheer size of the system. I've temporarily shut down 50% of the indexing pipeline and have been monitoring our indexing queue https://grafana.opendev.org/d/5Imot6EMk/zuul-status?viewPanel=17&orgId=1&from=now-24h&to=now
15:44:37 <clarkb> compared to elasticsearch the logstash workers aren't huge but it is still something. I think I may turn on 10% again and leave it at 40% shutdown for another week then turn off the extra servers if that looks stable.
15:44:44 <clarkb> currently we seem to be just barely keeping up with demand
15:44:48 <fungi> yeah, that's just half the indexing workers, not half the system
15:45:00 <clarkb> (and then having some headroom for feature freeze is a good idea hence only reducing by 40% total)
15:45:18 <gmann> how about keeping only check pipeline logs ?
15:45:36 <clarkb> gmann: I would probably to the opposite and only keep gate
15:45:40 <clarkb> check is too noisy
15:45:48 <clarkb> people push a lot of broken into check :)
15:45:55 <jungleboyj> clarkb:  That makes sense to me.
15:46:03 <gmann> clarkb: yeah but in check we do most of debugging and make it more stable till gate
15:46:06 <fungi> the check pipeline results are full of noise failures from bad changes, while the gate pipeline should in theory be things which at least got through check and code review to approval
15:46:12 <clarkb> but that is another option and reducing the total amount of logs indexed would potentially allows us to remove an elasticsearch server or two (since the major factor there is total storage size)
15:46:57 <clarkb> gmann: yes, but it is very hard to see anything useful in check because you can't really tell if things are just broken because someone didn't run tox locally or if they are really broken
15:47:11 <gmann> yeah
15:47:17 <clarkb> it is still useful to have check, often you want to go and see where something may have been introduced and you can trace that back to check
15:47:27 <clarkb> but if we start trimming logs check is what I would drop first
15:47:58 <clarkb> as far as elasticsaerch disk consumption goes we should have a pretty good idnicate of current db size for 7 days of indexes at the beginning of next week
15:48:06 <clarkb> the data is currently a bit off since we had the cluster crash recently
15:48:45 <clarkb> that info is available in our cacti instance if you want to see what usage looks like. We have 6TB storage available but 5TB useable beacuse we need to be tolerate to losing one server and its 1TB of disk
15:49:16 <clarkb> If we want t ostart pruning logs out then maybe we start that conversation next week when we have a good baseline of data to look at first
15:49:34 <gmann> or truncate the log storage time? to 2-3 days
15:49:42 <clarkb> yes that is another option
15:50:11 <fungi> though that doesn't give you much history to be able to identify when a particular failure started
15:50:22 <fungi> a week is already fairly short in that regard
15:50:32 <clarkb> yup, but may be enough to identify the source of problems and then work backward in code
15:50:34 <gmann> yeah, we anyways going to loose that
15:50:47 <clarkb> as well as track what issues are still occuring
15:50:56 <gmann> yes
15:51:16 <clarkb> anyway I think discussion for pruning elasticsearch size is better next week when we have better data to look at. I'm happy to help collect some of that info together and discuss it further next week if we like
15:51:26 <fungi> i wonder if we could change the indexing threshold to >info instead of >debug
15:51:44 <clarkb> (this is about all I had on this agenda item. I'll go ahead and make note of the Yoga daedline on the mailing list in a response to the thread I started a while back if I can find it)
15:51:45 <gmann> clarkb: +1 that will be even better
15:52:05 <clarkb> fungi: the issue with that is a good chunk of logs are the job-output.txt files now with no log level
15:52:11 <gmann> clarkb: +1  and thanks for publishing deadline on ML
15:52:11 <clarkb> fungi: this is why the warnings hurt so much
15:52:14 <fungi> ahh, yeah good point
15:52:47 <gmann> on warnings I will start a thread to fix and start onverting them to error at from openstack lib side so that proejct had tp fix them
15:53:17 <gmann> #action clarkb to convey the ELK service shutdown deadline on ML
15:53:42 <gmann> #action gmann to send ML to fix warning and oslo side changes to convert them to error
15:54:01 <gmann> and we will continue disscussing it in next week
15:54:23 <gmann> thanks clarkb fungi for the updates and maintaining these services
15:54:49 <jungleboyj> ++
15:54:51 <gmann> #topic Open Reviews
15:54:54 <gmann> #link https://review.opendev.org/q/projects:openstack/governance+is:open
15:55:21 <gmann> I added the link for Yoga release name announcement
15:55:24 <gmann> please review that
15:55:35 <gmann> also Yoga testing runtime #link https://review.opendev.org/c/openstack/governance/+/799927
15:55:46 <gmann> with no change from what we have in Xena
15:56:21 <gmann> and this one about rbac goal proposal #link https://review.opendev.org/c/openstack/governance/+/799705
15:56:53 <gmann> and need one more vote in this project-update #link https://review.opendev.org/c/openstack/governance/+/799817
15:57:05 <clarkb> as a note on the python version available in focal I think 3.9 is avaible now
15:57:24 <spotz> 3.9 is also what will be Stream 9
15:57:26 <clarkb> oh I guess it is in universe though
15:57:39 <clarkb> probably good to test it but not make it the default
15:57:39 <gmann> clarkb: I think 3.8
15:57:55 <gmann> clarkb:  we have unit test job as non voting for 3.9
15:58:02 <clarkb> gmann: it has both. But 3.8 is the default and not in universe :)
15:58:11 <gmann> yeah, default
15:59:15 <gmann> that's all for me today, anything else to diuscuss ?
15:59:33 <gmann> though 1 min left
16:00:07 <jungleboyj> Nothing here.
16:00:08 <gmann> if nothing, let's close meeting.
16:00:11 <gmann> k
16:00:12 <gmann> thanks all for joining.
16:00:16 <gmann> #endmeeting