16:01:56 #startmeeting openstack-helm 16:01:57 Meeting started Tue Feb 18 16:01:56 2020 UTC and is due to finish in 60 minutes. The chair is portdirect. Information about MeetBot at http://wiki.debian.org/MeetBot. 16:01:58 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 16:02:00 The meeting name has been set to 'openstack_helm' 16:02:04 o/ 16:02:17 o/ 16:02:19 \o 16:02:32 lets give it a fews misn for people to roll up 16:02:54 lamt and gagehugo prepared the agenda for today, as they both rock: https://etherpad.openstack.org/p/openstack-helm-meeting-2020-02-18 16:02:57 O/ 16:02:59 also its great to be back 16:03:03 o/ 16:03:05 wb 16:03:10 wb 16:06:02 ok lets go 16:06:09 #topic virtual ptg 16:06:24 o/ 16:06:37 before i got roped into one million and one things we were talking about holding a virtual midcycle ptg 16:06:47 primarily focused upon documentation and gating 16:07:19 we still like this idea? if so I'll send out a few dates in march for us to choose from 16:07:35 I think it's a good idea 16:07:38 +1 16:07:39 ++ 16:07:53 Did we find a good video conferencing tool? 16:08:07 i think webex should work, unless anyone objects? 16:09:32 ok, i'll take silence as approval ;) 16:09:55 #action, portdirect to pull finger out and get email sent for virtualptg 16:10:26 webex should be fine 16:10:42 so - that actually is it for the agenda today 16:11:08 however i can offer a tale of osh in use that may be of interest? 16:11:33 certainly sounds interesting 16:11:37 I am eared. 16:11:43 consider my interests piqued 16:12:14 so - a couple of weeks ago an osh deployment was happily humming along... 16:12:28 when several things hit it at once 16:12:39 1) network fabric instability 16:13:03 2) signifigant resource contention on control plane and compute nodes nodes 16:13:41 these two factors combined resulted in us not living up to the mantra of our humble mascot, le grande honeybadger 16:14:47 ultimatly resulting in the oom-killer taking out (i believe) qemu and kvm processes 16:14:58 which is not a good thing 16:15:36 oh 16:15:57 so i think theres some really valible learnings we can take away from this: 16:16:02 is that the issue with the 1000 probes? 16:16:24 >2k probes, but yes ;) also you ruin stories lamt 16:16:55 1) we need to ensure that the blast radius of any component in osh is limited 16:17:46 2) we need to do work that ensures that even when a node is unhealty - we take out 'non-essential' services and infra prior to tenants being impacted 16:18:16 as lamt has pointed out, it was observed that we had a massive buildup of exec probes on some compute nodes 16:18:38 which were caused by an unhealty rabbit cluster* 16:19:13 * the rabbit cluster was infact working, but due to load on the control plane, exec probes there were timing out, resulting in k8s flapping endpoints 16:20:00 some of the community memebers involved in this have started to make some great patchsets to improve the 'honey badger' ideals we strive for 16:20:10 eg: https://review.opendev.org/#/c/706590/ 16:20:57 but i'd also like to ask that we all look at the areas of osh that we know best 16:21:13 and see if there is any areas there where we can make similar hardening improvements 16:21:44 eg phil recently made this chaneg to mariadb: https://review.opendev.org/#/c/708071/ 16:21:59 which improves how the deployment copes with instability of the k8s api 16:22:21 but im sure there are many other areas where we can make improvement 16:22:40 anyone have any thoughts or insight into this? 16:23:40 Thanks for sharing the background and the work to date - those changes both look great 16:24:11 all hail the honey badger, may he live forever! 16:24:12 would probably look closer at mariadb and rabbitmq - those two components are always the trouble maker 16:24:25 yeah 16:24:33 those two and their clustering 16:24:54 amen to that - but please also look at their consumers 16:25:11 the probes issue above is testement to that 16:25:37 i think there is a ps being worked on to ensure the oom-score is adjusted appropriately for libvirt etc 16:25:58 in summary - we will never be able to protect against all control plane issues 16:26:11 s/protect/prevent 16:26:15 part of that may also be tuning those probes (timeout, delay, etc.) 16:26:22 so we need to do all we can to mitigate them 16:26:27 lamt: +100 16:26:46 i think oveverall we are probably much more agressive than we need to be with livelyness probes 16:27:02 it would also be great to see resource limits turned on by default 16:27:21 eg - if say a neutron agen is consuming more than 8gb of ram 16:27:31 its pretty safe to assume somthing is wrong 16:27:49 the other side of this - is things like lma 16:28:14 should we be reading from the head, or tail of files? what are the tradeoffs etc 16:29:13 anyway - food for thought 16:29:37 I don't think it's been done upstream yet, but dsmith was able to implement the pos file for fluentd, which will allow it to 'remember' where it was reading from in case of a restart 16:30:00 what does that do? 16:31:19 It records the position in the log file which fluentd is reading 16:31:51 So after a restart, fluentd won't start at the top of the file and send out a bunch of duplicated messaged 16:32:06 oh nice - that sounds like a great improvement 16:33:07 Yeah, I'll ask him to make the changes to the chart upstream as well, as an example to other OSH users 16:33:08 ok should we move onto the weekly plea for reviews? 16:33:13 stevthedev: +++ 16:33:14 Sure 16:33:20 #topic reviews 16:33:37 we have a few this week that could do with some eyes: 16:33:40 https://review.opendev.org/#/c/706590/ (sangeet) 16:33:40 https://review.opendev.org/#/c/708046/ 16:33:40 https://review.opendev.org/#/c/702983/ 16:33:40 https://review.opendev.org/#/c/697554/ 16:33:40 https://review.opendev.org/#/c/701839/ 16:34:38 and with that - if we have nothing left to discuss this week, we can take 20 mins back? 16:37:07 wfm 16:37:53 ok - thanks for coming everyone! 16:37:57 #endmeeting