#openstack-meeting-4 log

16:01:56 <portdirect> #startmeeting openstack-helm
16:01:57 <openstack> Meeting started Tue Feb 18 16:01:56 2020 UTC and is due to finish in 60 minutes.  The chair is portdirect. Information about MeetBot at http://wiki.debian.org/MeetBot.
16:01:58 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
16:02:00 <openstack> The meeting name has been set to 'openstack_helm'
16:02:04 <gagehugo> o/
16:02:17 <dwalt> o/
16:02:19 <megheisler> \o
16:02:32 <portdirect> lets give it a fews misn for people to roll up
16:02:54 <portdirect> lamt and gagehugo prepared the agenda for today, as they both rock: https://etherpad.openstack.org/p/openstack-helm-meeting-2020-02-18
16:02:57 <stevthedev> O/
16:02:59 <portdirect> also its great to be back
16:03:03 <lamt> o/
16:03:05 <lamt> wb
16:03:10 <gagehugo> wb
16:06:02 <portdirect> ok lets go
16:06:09 <portdirect> #topic virtual ptg
16:06:24 <mattmceuen> o/
16:06:37 <portdirect> before i got roped into one million and one things we were talking about holding a virtual midcycle ptg
16:06:47 <portdirect> primarily focused upon documentation and gating
16:07:19 <portdirect> we still like this idea? if so I'll send out a few dates in march for us to choose from
16:07:35 <gagehugo> I think it's a good idea
16:07:38 <mattmceuen> +1
16:07:39 <lamt> ++
16:07:53 <gagehugo> Did we find a good video conferencing tool?
16:08:07 <portdirect> i think webex should work, unless anyone objects?
16:09:32 <portdirect> ok, i'll take silence as approval ;)
16:09:55 <portdirect> #action, portdirect to pull finger out and get email sent for virtualptg
16:10:26 <gagehugo> webex should be fine
16:10:42 <portdirect> so - that actually is it for the agenda today
16:11:08 <portdirect> however i can offer a tale of osh in use that may be of interest?
16:11:33 <megheisler> certainly sounds interesting
16:11:37 <lamt> I am eared.
16:11:43 <stevthedev> consider my interests piqued
16:12:14 <portdirect> so - a couple of weeks ago an osh deployment was happily humming along...
16:12:28 <portdirect> when several things hit it at once
16:12:39 <portdirect> 1) network fabric instability
16:13:03 <portdirect> 2) signifigant resource contention on control plane and compute nodes nodes
16:13:41 <portdirect> these two factors combined resulted in us not living up to the mantra of our humble mascot, le grande honeybadger
16:14:47 <portdirect> ultimatly resulting in the oom-killer taking out (i believe) qemu and kvm processes
16:14:58 <portdirect> which is not a good thing
16:15:36 <gagehugo> oh
16:15:57 <portdirect> so i think theres some really valible learnings we can take away from this:
16:16:02 <lamt> is that the issue with the 1000 probes?
16:16:24 <portdirect> >2k probes, but yes ;) also you ruin stories lamt
16:16:55 <portdirect> 1) we need to ensure that the blast radius of any component in osh is limited
16:17:46 <portdirect> 2) we need to do work that ensures that even when a node is unhealty - we take out 'non-essential' services and infra prior to tenants being impacted
16:18:16 <portdirect> as lamt has pointed out, it was observed that we had a massive buildup of exec probes on some compute nodes
16:18:38 <portdirect> which were caused by an unhealty rabbit cluster*
16:19:13 <portdirect> * the rabbit cluster was infact working, but due to load on the control plane, exec probes there were timing out, resulting in k8s flapping endpoints
16:20:00 <portdirect> some of the community memebers involved in this have started to make some great patchsets to improve the 'honey badger' ideals we strive for
16:20:10 <portdirect> eg: https://review.opendev.org/#/c/706590/
16:20:57 <portdirect> but i'd also like to ask that we all look at the areas of osh that we know best
16:21:13 <portdirect> and see if there is any areas there where we can make similar hardening improvements
16:21:44 <portdirect> eg phil recently made this chaneg to mariadb: https://review.opendev.org/#/c/708071/
16:21:59 <portdirect> which improves how the deployment copes with instability of the k8s api
16:22:21 <portdirect> but im sure there are many other areas where we can make improvement
16:22:40 <portdirect> anyone have any thoughts or insight into this?
16:23:40 <mattmceuen> Thanks for sharing the background and the work to date - those changes both look great
16:24:11 <mattmceuen> all hail the honey badger, may he live forever!
16:24:12 <lamt> would probably look closer at mariadb and rabbitmq - those two components are always the trouble maker
16:24:25 <gagehugo> yeah
16:24:33 <gagehugo> those two and their clustering
16:24:54 <portdirect> amen to that - but please also look at their consumers
16:25:11 <portdirect> the probes issue above is testement to that
16:25:37 <portdirect> i think there is a ps being worked on to ensure the oom-score is adjusted appropriately for libvirt etc
16:25:58 <portdirect> in summary - we will never be able to protect against all control plane issues
16:26:11 <portdirect> s/protect/prevent
16:26:15 <lamt> part of that may also be tuning those probes (timeout, delay, etc.)
16:26:22 <portdirect> so we need to do all we can to mitigate them
16:26:27 <portdirect> lamt: +100
16:26:46 <portdirect> i think oveverall we are probably much more agressive than we need to be with livelyness probes
16:27:02 <portdirect> it would also be great to see resource limits turned on by default
16:27:21 <portdirect> eg - if say a neutron agen is consuming more than 8gb of ram
16:27:31 <portdirect> its pretty safe to assume somthing is wrong
16:27:49 <portdirect> the other side of this - is things like lma
16:28:14 <portdirect> should we be reading from the head, or tail of files? what are the tradeoffs etc
16:29:13 <portdirect> anyway - food for thought
16:29:37 <stevthedev> I don't think it's been done upstream yet, but dsmith was able to implement the pos file for fluentd, which will allow it to 'remember' where it was reading from in case of a restart
16:30:00 <portdirect> what does that do?
16:31:19 <stevthedev> It records the position in the log file which fluentd is reading
16:31:51 <stevthedev> So after a restart, fluentd won't start at the top of the file and send out a bunch of duplicated messaged
16:32:06 <portdirect> oh nice - that sounds like a great improvement
16:33:07 <stevthedev> Yeah, I'll ask him to make the changes to the chart upstream as well, as an example to other OSH users
16:33:08 <portdirect> ok should we move onto the weekly plea for reviews?
16:33:13 <portdirect> stevthedev: +++
16:33:14 <stevthedev> Sure
16:33:20 <portdirect> #topic reviews
16:33:37 <portdirect> we have a few this week that could do with some eyes:
16:33:40 <portdirect> https://review.opendev.org/#/c/706590/ (sangeet)
16:33:40 <portdirect> https://review.opendev.org/#/c/708046/
16:33:40 <portdirect> https://review.opendev.org/#/c/702983/
16:33:40 <portdirect> https://review.opendev.org/#/c/697554/
16:33:40 <portdirect> https://review.opendev.org/#/c/701839/
16:34:38 <portdirect> and with that - if we have nothing left to discuss this week, we can take 20 mins back?
16:37:07 <gagehugo> wfm
16:37:53 <portdirect> ok - thanks for coming everyone!
16:37:57 <portdirect> #endmeeting