15:00:17 #startmeeting tc 15:00:17 Meeting started Thu Jul 8 15:00:17 2021 UTC and is due to finish in 60 minutes. The chair is gmann. Information about MeetBot at http://wiki.debian.org/MeetBot. 15:00:17 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 15:00:17 The meeting name has been set to 'tc' 15:00:21 #topic Roll call 15:00:24 o/ 15:00:24 o/ 15:00:24 o/ 15:00:24 o/ 15:00:32 o/ 15:00:36 o/ 15:00:39 hello 15:00:48 clarkb: hi 15:00:57 yoctozepto is on PTO so would not be able to join today meeting 15:01:13 did we approve that time off? 15:01:27 :) 15:01:33 i told him it was okay 15:01:34 :-) 15:01:49 let's start 15:01:52 #topic Follow up on past action items 15:02:02 gmann to remove Governance non-active repos cleanup topic from agenda 15:02:03 done 15:02:14 gmann to remove election assignments topic form agenda 15:02:22 this too 15:02:32 ricolin to ask for collecting the ops pain points on openstack-discuss ML 15:02:40 ricolin: any update on this 15:03:13 already added it on community-goals backlogs and y-cycle pre-selected,but not yet send ML out 15:03:36 +1. i think that is good 15:03:41 o/ 15:03:47 will send it out this week 15:03:49 on ML 15:03:59 ok, thanks 15:04:04 gmann to propose the RBAC goal 15:04:15 I proposed that #link https://review.opendev.org/c/openstack/governance/+/799705 15:04:19 please review 15:04:40 #topic Gate health check (dansmith/yoctozepto) 15:05:02 dansmith: any news 15:05:07 I really have nothing to report, but mostly because I've been too busy with other stuff to be submitting many patches in the last week or so 15:05:26 ok 15:05:41 one thing to share is about the log warning especialyl from oslo policy 15:05:56 we've had a bit of job configuration upheaval from the zuul 4.6.0 security release 15:06:24 melwitt clarkb: pointed out that in infra channel and many projects have such ot of warning due to policy rule 15:06:37 I am fixing those in #link https://review.opendev.org/q/topic:%22fix-oslo-policy-warnings%22+(status:open%20OR%20status:merged) 15:06:53 had to make non-backward-compatible changes to how some kinds of variables are accessed, particularly with regards to secrets, so that's been disrupting some post/promote jobs (should be under control now), as well as made some projects' over all zuul configuration insta-buggy causing some of their jobs to not run 15:07:27 i think kolla was hardest hit by that 15:07:48 ok 15:08:08 fungi: any effected project without ack or need helpon this ? 15:08:35 I saw on ML about few project ack that and working on 15:08:36 gmann: it might be good to update those warnings to only fire once per process 15:08:40 i haven't checked in the past few days, but click the bell icon at the top-right of the zuul status page for a list of some which may need help 15:08:51 I can't imagine those warnings helps operators any more than they help CI 15:09:37 fungi: ok, thanks for update. let us know if any project did not notice or need help 15:10:19 back to policy rule warning 15:10:27 clarkb: yes, that seems very noisy now 15:11:01 when we added it initially we thought it would help operator to move to new rbac but as in new rbac work every policy rule changed rhe default so warning 15:11:09 whihc seems does not help much 15:11:47 One approch I sent on ML about disableing those by default with make it configurable so that operator can enable those to see what all they need to update 15:11:54 #link http://lists.openstack.org/pipermail/openstack-discuss/2021-July/023484.html 15:12:11 and this is patch #link https://review.opendev.org/c/openstack/oslo.policy/+/799539 15:12:39 feel free to respond to ML or gerrit about your opinon 15:13:42 anything else to discuss related to gate health? 15:14:12 #topic Migration from 'Freenode' to 'OFTC' (gmann) 15:14:15 #link https://etherpad.opendev.org/p/openstack-irc-migration-to-oftc 15:14:42 I started pushing the patches for remaining projects #link https://review.opendev.org/q/topic:%22oftc%22+(status:open%20OR%20status:merged) 15:14:48 few are still left 15:15:12 nothing else to share on this 15:15:17 today we landed an update to the opendev infra manual as well, so if you refer anyone there it should now properly reference oftc and not freenode 15:15:29 +1 15:16:19 #topic Xena Tracker 15:16:21 +1 15:16:27 #link https://etherpad.opendev.org/p/tc-xena-tracker 15:17:00 I think we can close 'election promotion' now as we have three new election official 15:17:15 spotz: belmoreira diablo_rojo_phone ? what you say? 15:17:23 L63 in etherpad 15:17:29 i'm very excited by that, and happy to answer questions anyone has 15:17:29 \o/ 15:17:33 Yeah and we now have a name for that patch 15:18:13 and email opt in process or solution can be discussed by yoou guys at election channel 15:18:20 lgtm 15:18:21 thanks again for volunteering 15:19:41 Charter revision also done so marked as completed 15:19:49 Yes we can close it. 15:20:08 any other update on Xena tracker? 15:20:28 jungleboyj: mnaser any update you want to share for 'stable policy process change' ? 15:21:41 No, didn't get to that with the holiday week. 15:21:53 ok 15:22:25 we have 8 items in etherpad to finish in Xena, let's start working on those which should not take much time 15:22:48 moving next.. 15:22:52 #topic ELK services plan and help status 15:23:04 first is Board meeting updates 15:23:32 I presented this slide in 30th June Board meeting #link https://docs.google.com/presentation/u/1/d/1ugdwMI2ZM2L8z1sobzHJwDpbvlyWKH02PH7Fi4tkyVc/edit#slide=id.ge1bdf71dac_0_0 15:23:50 I was expected some actional item from Board but that did not happen. 15:24:44 Board ack this help-needed and stated to broadcast it in the organization/local community etc 15:25:11 that I think we everyone are doing since 2018 when we re-defined the upstream investment opportunity 15:26:06 honestly saying I am not so happy with the no-actionable item from that meeting 15:26:21 and do not know how we can get help here ? 15:26:45 I took it as folks were going back to their own companies 15:26:53 It was a bit late for me though 15:26:55 yeah theor own company also 15:27:28 butn that is no different step from what we all including Board are trying since 2018 15:27:30 not to apologize for them, but i don't expect the board members come to those meetings expecting to make commitments on the behalf of their employers, and probably don't control the budget that assistance would be provided out of in most cases (they're often in entirely separate business units), so they have to lobby internally for that sort of thing 15:28:02 fungi: True. 15:28:18 i'm more disappointed by the years of inaction than in their inability to make any immediate promises 15:28:27 few of the suggestion are listed in the slide#5 #link https://docs.google.com/presentation/d/1ugdwMI2ZM2L8z1sobzHJwDpbvlyWKH02PH7Fi4tkyVc/edit#slide=id.ge1bdf71dac_0_24 15:28:57 that was my expecttion and hope. I know those are not easy but in current situation we need such support 15:30:47 anyways that is update from Board meeting. moving next.. 15:30:56 Creating a timeline for shutting the service down if help isn't found 15:31:09 clarkb please go ahead 15:31:31 This is mostly a request that we start thinkign about what the timeline looks like if we don't end up with help to update the system or host it somewhere else 15:32:01 I'm not currently in a rush to shut it down, but there is a risk that external circumstances could force that to be done (security concerns or similar) 15:32:19 However, I think it would be good to have some agreement on what not a rush means :) 15:32:21 :-( 15:32:45 part of the reason this came up was after a week or two it was noticed that he cluster had completely crashed and I had to go resurrect it 15:33:02 I don't want to do that indefinitely if there isn't proper care and feeding happening 15:33:51 There are also a few problems with indexing currently including the massive log files generated by unittests due to warnings and for some reason logsatsh is emitting events for centuries in the future which floods the elasticsaerch cluster with indexes for the future 15:34:18 I think the massive log files led to the cluster crashing. The future events problem is more annoying than anything else 15:34:43 yeah, we should start fixing those warning.may be we can ask all projects on ML. I can fix for oslo policy but do nto have bandwidth to fix other 15:35:00 back to shutdown thing 15:35:09 so if we shutdown, bug question is how we are going to debug the failure or how much it will add extra load on gate in term of doing recheck .. 15:35:28 gmann: to be fair I think most people just recheck anyway and don't do debugging 15:35:36 yeah but not all 15:35:51 after shutdown there will be many recheck we had to do 15:35:58 where elastic-recheck has been particularly useful is when you have an sdague, jogo, mtreinish, melwitt, or dansmith digging into braoder failures and trying to address them 15:36:14 yeah, I try to shame people that just blindly recheck, 15:36:17 but it's a bit of a losing battle 15:36:27 still, removing the *ability* to do real checking sucks :/ 15:36:32 I suspect the biggest impact will not be recheck problems but the once a cycle or so fix a very unstable gate 15:36:54 ...or a more continuously unstable gate 15:36:58 yeah 15:36:58 ya 15:37:26 which will directly impact our release 15:37:32 or feature implementation 15:37:33 though it sounds like the entire cluster was broken for a couple weeks there before anyone noticed it wasn't returning results to their queries 15:37:43 I think that also is part of why it has been so hard to find help for this. When it is a tool you use every 6 months it is less in your mind continuously for care and feeding 15:37:52 fungi: yes, but no one notices if the gate is stable 15:37:56 yeah 15:38:02 which is a big underlying issue here imo 15:38:30 people do notice when there are systemic problems in the gate that need addressing 15:39:06 another reason to have a rough timeline is it may help light a fire under people willing to help 15:39:26 when I brought this up last week gmann suggested the end of the Yoga cycle as a potential deadline 15:39:35 yeah, "no rush" is not as motivating 15:40:01 yeah, I am thinking end of Yoga will be like more than 6 month we called it as last critical call for help 15:40:13 That ensures that Xena (hopefully) doesn't have any major changes to the stabilzation process. Then in Yoga we can start planning for replacement/shutdown/etc (though that can start earlier too) 15:40:23 so if there is anyone want to help, should be raising hand by than 15:40:34 That timeline seems reasonable to me 15:41:23 any objection on above deadline ? 15:41:32 "no rush" has also been tempered by "but might be tomorrow, depending on outside factors" 15:41:45 fungi: yes and I think that is still the message from me 15:41:53 I'm not happy about the timeline, but accept the need 15:42:09 "happy with, not happy about" you might say :) 15:42:10 dansmith: means? it too late or early ? 15:42:11 if we notice abuse of elasticsearch or lgostash that requires upgrades to address we'll be in a situation where we don't have much choice 15:42:13 I think that sounds like a reasonable timeline ... even though we don't want one. 15:42:54 gmann: it's me being intentionally vague. I'm good with it, just not happy about it.. necessary, but I worry about the inevitable end where nobody has actually stepped up 15:42:55 clarkb: yeah for outside factors we would not be able to do anything and shutdown early ? 15:43:06 gmann: correct 15:43:10 k 15:43:34 dansmith: correct. my last hope was Board on paid resource but anyways that did not happen 15:43:52 Another concern is the sheer size of the system. I've temporarily shut down 50% of the indexing pipeline and have been monitoring our indexing queue https://grafana.opendev.org/d/5Imot6EMk/zuul-status?viewPanel=17&orgId=1&from=now-24h&to=now 15:44:37 compared to elasticsearch the logstash workers aren't huge but it is still something. I think I may turn on 10% again and leave it at 40% shutdown for another week then turn off the extra servers if that looks stable. 15:44:44 currently we seem to be just barely keeping up with demand 15:44:48 yeah, that's just half the indexing workers, not half the system 15:45:00 (and then having some headroom for feature freeze is a good idea hence only reducing by 40% total) 15:45:18 how about keeping only check pipeline logs ? 15:45:36 gmann: I would probably to the opposite and only keep gate 15:45:40 check is too noisy 15:45:48 people push a lot of broken into check :) 15:45:55 clarkb: That makes sense to me. 15:46:03 clarkb: yeah but in check we do most of debugging and make it more stable till gate 15:46:06 the check pipeline results are full of noise failures from bad changes, while the gate pipeline should in theory be things which at least got through check and code review to approval 15:46:12 but that is another option and reducing the total amount of logs indexed would potentially allows us to remove an elasticsearch server or two (since the major factor there is total storage size) 15:46:57 gmann: yes, but it is very hard to see anything useful in check because you can't really tell if things are just broken because someone didn't run tox locally or if they are really broken 15:47:11 yeah 15:47:17 it is still useful to have check, often you want to go and see where something may have been introduced and you can trace that back to check 15:47:27 but if we start trimming logs check is what I would drop first 15:47:58 as far as elasticsaerch disk consumption goes we should have a pretty good idnicate of current db size for 7 days of indexes at the beginning of next week 15:48:06 the data is currently a bit off since we had the cluster crash recently 15:48:45 that info is available in our cacti instance if you want to see what usage looks like. We have 6TB storage available but 5TB useable beacuse we need to be tolerate to losing one server and its 1TB of disk 15:49:16 If we want t ostart pruning logs out then maybe we start that conversation next week when we have a good baseline of data to look at first 15:49:34 or truncate the log storage time? to 2-3 days 15:49:42 yes that is another option 15:50:11 though that doesn't give you much history to be able to identify when a particular failure started 15:50:22 a week is already fairly short in that regard 15:50:32 yup, but may be enough to identify the source of problems and then work backward in code 15:50:34 yeah, we anyways going to loose that 15:50:47 as well as track what issues are still occuring 15:50:56 yes 15:51:16 anyway I think discussion for pruning elasticsearch size is better next week when we have better data to look at. I'm happy to help collect some of that info together and discuss it further next week if we like 15:51:26 i wonder if we could change the indexing threshold to >info instead of >debug 15:51:44 (this is about all I had on this agenda item. I'll go ahead and make note of the Yoga daedline on the mailing list in a response to the thread I started a while back if I can find it) 15:51:45 clarkb: +1 that will be even better 15:52:05 fungi: the issue with that is a good chunk of logs are the job-output.txt files now with no log level 15:52:11 clarkb: +1 and thanks for publishing deadline on ML 15:52:11 fungi: this is why the warnings hurt so much 15:52:14 ahh, yeah good point 15:52:47 on warnings I will start a thread to fix and start onverting them to error at from openstack lib side so that proejct had tp fix them 15:53:17 #action clarkb to convey the ELK service shutdown deadline on ML 15:53:42 #action gmann to send ML to fix warning and oslo side changes to convert them to error 15:54:01 and we will continue disscussing it in next week 15:54:23 thanks clarkb fungi for the updates and maintaining these services 15:54:49 ++ 15:54:51 #topic Open Reviews 15:54:54 #link https://review.opendev.org/q/projects:openstack/governance+is:open 15:55:21 I added the link for Yoga release name announcement 15:55:24 please review that 15:55:35 also Yoga testing runtime #link https://review.opendev.org/c/openstack/governance/+/799927 15:55:46 with no change from what we have in Xena 15:56:21 and this one about rbac goal proposal #link https://review.opendev.org/c/openstack/governance/+/799705 15:56:53 and need one more vote in this project-update #link https://review.opendev.org/c/openstack/governance/+/799817 15:57:05 as a note on the python version available in focal I think 3.9 is avaible now 15:57:24 3.9 is also what will be Stream 9 15:57:26 oh I guess it is in universe though 15:57:39 probably good to test it but not make it the default 15:57:39 clarkb: I think 3.8 15:57:55 clarkb: we have unit test job as non voting for 3.9 15:58:02 gmann: it has both. But 3.8 is the default and not in universe :) 15:58:11 yeah, default 15:59:15 that's all for me today, anything else to diuscuss ? 15:59:33 though 1 min left 16:00:07 Nothing here. 16:00:08 if nothing, let's close meeting. 16:00:11 k 16:00:12 thanks all for joining. 16:00:16 #endmeeting