#openstack-meeting-5 log

14:59:57 <mattmceuen> #startmeeting openstack-helm
14:59:58 <openstack> Meeting started Tue Aug 21 14:59:57 2018 UTC and is due to finish in 60 minutes.  The chair is mattmceuen. Information about MeetBot at http://wiki.debian.org/MeetBot.
15:00:00 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
15:00:03 <openstack> The meeting name has been set to 'openstack_helm'
15:00:04 <mattmceuen> #topic Rollcall
15:00:08 <srwilkers> o/
15:00:13 <mattmceuen> GM / GE / GD everyone!
15:00:17 <mattmceuen> o/ srwilkers
15:00:20 <tdoc> hi
15:00:24 <mattmceuen> hey tdoc
15:01:14 <evrardjp> o/
15:01:16 <mattmceuen> Here's our agenda: https://etherpad.openstack.org/p/openstack-helm-meeting-2018-08-21
15:01:27 <mattmceuen> Please go ahead and add anything you'd like to discuss today
15:01:35 <mattmceuen> Otherwise I'll give one more min for folks to filter in
15:02:42 <mattmceuen> o/ jayahn!
15:02:48 <portdirect> o/
15:03:02 <jayahn> o/
15:03:15 <mattmceuen> #topic LMA News
15:03:17 <jayahn> mattmceuen!!
15:03:23 <mattmceuen> Good to see you man :)
15:03:51 <mattmceuen> srwilkers has been hard at work testing our LMA stack in various labs of various sizes and workloads
15:03:56 <jayahn> yeah.. it was independency holiday + summer vacation last week.
15:04:06 <jayahn> that is good.
15:04:26 <mattmceuen> That sounds awesome jayahn - hope it was an awesome vacation
15:04:28 <mattmceuen> well earned
15:04:43 <jayahn> we are analyzing what each exporter gathers, which is siginificant one to watch, and alarm..
15:05:12 <portdirect> have you ever deployed the current lma stack at scale in a working osh cluster?
15:05:13 <mattmceuen> srwilkers has been doing some of the same thing as part of his analysis
15:05:23 <srwilkers> oh hello
15:05:28 <jayahn> at scale.. how big?
15:06:15 <portdirect> lets start small >10 nodes, with active workloads?
15:06:48 <jayahn> i think we did fairly good test on >10, for logging part
15:07:24 <portdirect> did you run into any issues? OOM's or similar?
15:07:24 <jayahn> for prometheus, we are behind the schedule.
15:07:46 <jayahn> on elastic search side, i heard sungil experience lots of oom
15:08:08 <jayahn> sungil had experienced..
15:08:15 <srwilkers> are you running default values mostly, or have you started providing more fine-grained overrides for things like fluentbit and fluentd?  my biggest takeaway from the logging stack was that it's better to leverage fluentd to provide smarter filters than just jamming everything into elasticsearch
15:08:43 <srwilkers> once i started adding more granular filters and dumping specific entries, elasticsearch was much healthier in the long term
15:08:50 <jayahn> nope, i think we overrides values on es, fluent-bit.
15:09:09 <jayahn> i will ask sunil tomorrow on this
15:09:42 <portdirect> that would be great jayahn
15:09:45 <jayahn> he has been struggling with logging for the last two month..
15:09:48 <srwilkers> i've done quite a bit of work on this and exposed it as part of the work to introduce an ocata based armada gate
15:09:49 <srwilkers> https://review.openstack.org/#/c/591808/12/tools/deployment/armada/manifests/ocata/armada-lma.yaml
15:10:00 <jayahn> did some test on federation as well.
15:10:49 <jayahn> okay.
15:11:43 <srwilkers> prometheus is a whole different beast though
15:11:50 <jayahn> agreed
15:12:03 <jayahn> it graduated at least. :)
15:12:14 <srwilkers> lol
15:12:17 <mattmceuen> srwilkers I believe you're moving more sane defaults into the charts so that operators can choose to let more logs through to elasticsearch if they need them, right?
15:12:43 <jayahn> we are urgently hiring a person to take on prometheus, really short handed right now. :)
15:13:00 <jayahn> srwilkers: you will be always welcome here :)
15:13:12 <mattmceuen> hey hey save the poaching for after the team meeting
15:13:27 <srwilkers> jayahn: lol
15:13:33 <portdirect> jayahn: I'm his agent
15:13:49 <mattmceuen> and I'm portdirect's agent
15:13:56 <portdirect> and alanmeadows is yours?
15:14:02 <mattmceuen> we all get a cut
15:14:13 <jayahn> hey.... pyramid organization...
15:14:19 <mattmceuen> :D
15:14:22 <srwilkers> bernie madoff would be proud
15:14:24 <srwilkers> anyway
15:14:26 <evrardjp> lol
15:14:29 <mattmceuen> Anything else to share on the LMA front folks?
15:14:35 <srwilkers> yeah
15:14:37 <evrardjp> FYI it hurts corporate diversity :p
15:14:48 <evrardjp> I know,I have been there :p
15:15:00 <srwilkers> im starting to work on pruning out the metrics we actually ingest into prometheus by default
15:15:23 <srwilkers> as we were consuming a massive amount of metrics that we arent actually using with grafana/nagios/prometheus by default
15:15:39 <jayahn> frankly speaking, we are currently guarantying "short time usage" for elastic search, and asking all the operation team to help us to fine-tune these logging beast if they want to use this as more long term logging stroage
15:15:40 <jayahn> :)
15:16:04 <srwilkers> cadvisor was the biggest culprit -- i've proposed dropping 41 metrics from cadvisor alone, and that reduced the total number of time series in a single node deployment from 18500ish to a little more than 3000
15:16:14 <jayahn> srwilers: I think we can help on that
15:16:17 <mattmceuen> wow
15:17:15 <srwilkers> node exporter is probably my next target
15:17:23 <srwilkers> as there's some there we dont really need
15:17:30 <evrardjp> srwilkers: good to know
15:18:08 <jamesgu> srwilkers: is there a doc for the metrics that we are currently collecting?
15:18:12 <portdirect> srwilkers: were you able to get all welost from cadvisor out of k8s itself, or too early to say?
15:18:13 <srwilkers> it's something that needs some attention though, because ive been seeing prometheus fall over dead in a ~10 node deployment with 16GB memory limits
15:18:37 <srwilkers> and it was hitting that limit after about 2 days without significant workloads running on top
15:18:45 <srwilkers> portdirect: too early to say
15:19:03 <portdirect> jayahn: maybe your team could help there?
15:19:19 <srwilkers> jamesgu: we currently gather everything available from every exporter we leverage.  i dont have a list handy yet, but can provide a quick list of exporters we have
15:19:42 <jayahn> for every exporter, we are doing like this. https://usercontent.irccloud-cdn.com/file/mtrps177/Calico%20Exporter.pdf
15:19:48 <evrardjp> srwilkers: if you could document it in, that would be nice :)
15:20:20 <srwilkers> evrardjp: yeah, it's about that time :)
15:20:21 <jayahn> if we can setup wiki page we can all use, I will certainly upload information we summarized so far, and work together.
15:20:34 <jamesgu> srwilkers: that would be very nice.
15:20:44 <jayahn> wiki or anything to put this massive document, or information to share
15:20:46 <evrardjp> srwilkers: ping me for reviews when ready
15:20:53 <srwilkers> evrardjp: nice, cheers
15:21:12 <evrardjp> don't need for the whole document, just saying how it works
15:21:19 <jamesgu> have we run into disk issue too besides memory?
15:22:01 <srwilkers> jamesgu: yep.  noticing 500gb PVCs filling up in ~7 days time on a similar sized deployment (~8-10 nodes)
15:22:15 <srwilkers> which once again is due to the massive amount of time series we gather and persist
15:22:31 <evrardjp> yeah pruning would be an important part :)
15:22:38 <portdirect> have any idea on the i/o reqs?
15:22:47 <portdirect> as well as raw capacity
15:23:15 <srwilkers> and things like cadvisor are especially bad, because there's ~50 metrics that get gathered per container, so if you think about how many containers would be deployed in a production-ish environment, that gets out of hand quickly
15:23:50 <srwilkers> portdirect: not at the moment -- certainly something that would be nice to get multiple peoples' input on.  would be awesome if you could help evaluate that too jayahn
15:24:42 <jayahn> I am missing the converation flow..
15:25:17 <jayahn> sorry, .. what awesome thing I can do?
15:25:29 <jayahn> could you kindly summarize?
15:25:30 <srwilkers> jayahn: oh, sorry.  just getting a better idea of the io requirements and storage capacity requirements for prometheus in a medium/large-ish deployment
15:25:48 <jayahn> ah.. okay
15:25:50 <portdirect> jayahn: in your env do you know how much pressure lma has been putting on the storage - both capacity wise, and IOPs/thoughput?
15:26:36 <jayahn> for prometheus, we have a plan to test that on 20 nodes deployment from the next week
15:26:55 <jayahn> we are right now enabling every exporter.
15:27:22 <jayahn> so, i guess we can share something next month.
15:28:00 <evrardjp> jayahn: my point (sorry to have disrupted the flow) was that your research can be documented https://docs.openstack.org/openstack-helm/latest/devref/fluent-logging.html
15:28:00 <jayahn> to get idea on capacity planning / requirement on prometheus.
15:28:07 <mattmceuen> would you plan to incorporate srwilkers' pruning work, or do you want all that data?
15:28:22 <evrardjp> or elsewhere, as this is maybe not enough
15:29:28 <jayahn> evrardjp: document would be a good place once we finalize all the contents, but WIP information sharing might be better with more flexible tools, like wiki
15:29:43 <srwilkers> evrardjp: that's largely my fault.  it's no secret that the biggest documentation gap we have is the LMA stack
15:29:49 <mattmceuen> Sorry guys, great discussion but we need to move on unfortunately
15:29:54 <srwilkers> mattmceuen: agreed
15:29:57 <mattmceuen> Let's touch point next week
15:30:01 <jayahn> mattmceuen: I will review srwilkers' pruningn work, and try to leverage that
15:30:25 <mattmceuen> Thanks jayahn, hopefully its a quick & easy win for you to learn from our pain :)
15:30:59 <mattmceuen> Ok speaking of this and going slightly out of order as it's probably related to this topic
15:31:07 <mattmceuen> #topic Korean Documentation
15:31:17 <portdirect> oh
15:31:40 <portdirect> so - we have some awesome work being done by korean speaking community memebers
15:31:53 <portdirect> and they have some fantastic docs
15:32:03 <evrardjp> that's nice :)
15:32:09 <evrardjp> is that linked to i18n team?
15:32:15 <portdirect> not yet!
15:32:23 <jayahn> not yet. :)
15:32:25 <evrardjp> sorry, go ahead :)
15:32:38 <portdirect> jayahn: can we work together to get korean docs up for osh
15:32:49 <portdirect> so your team can start moving work upstream?
15:32:54 <jayahn> that would be no problem.
15:33:11 <jayahn> so it would be korean docs? not need to translate to english?
15:33:29 <portdirect> in what would be an awesome bit of reversal, i think the other english speakers would be happy to help translate them into english
15:33:33 <evrardjp> jayahn: I guess you still need to have upstream english, but that can go through i18n process to publish a korean docs
15:34:00 <portdirect> evrardjp: i think we need to work out how to handle this case prob a bit differently
15:34:01 <evrardjp> if it's following the standard process :)
15:34:16 <evrardjp> yeah I guess the first step would be to do the other way around?
15:34:17 <portdirect> as there is more korean docs than english....
15:34:22 <portdirect> i think so?
15:34:27 <jayahn> okay. I will talk to ian.choi, previous i18n PTL
15:34:31 <evrardjp> yeah, but I am not sure the tool is ready for that.
15:34:41 <portdirect> jayahn: can you loop me in on that please
15:34:41 <evrardjp> jayahn: that's great,I was planning to suggest that :)
15:35:13 <mattmceuen> I think this is a great idea
15:35:22 <jayahn> we both (ian.choi and myself) will be at PTG, we can do f2f discussion on this topic as well
15:35:24 <evrardjp> if you need help on the english side, shoot. I think good docs is a good factor for community size ramp up.
15:35:38 <portdirect> could not agree more evrardjp
15:35:45 <portdirect> and seeing things like this: https://usercontent.irccloud-cdn.com/file/mtrps177/Calico%20Exporter.pdf
15:36:02 <portdirect> make me sad, as this is such a great resource to have
15:36:02 <evrardjp> jayahn: let's plan that PTG part in a separate channel :)
15:36:06 <srwilkers> jayahn: youre coming to denver?  time for more beer
15:36:41 <jayahn> I told the foundation that I only have a budge to do single trip between PTG and Summit
15:36:50 <portdirect> so - mattmceuen can we get an action item to get this worked out at ptg
15:36:51 <evrardjp> so I guess the question was: do we all agree to bring more docs from jayahn to upstream, and how we do things, right?
15:36:52 <jayahn> they kindly offered me a free hotel
15:36:59 <mattmceuen> I will add it to the agenda
15:37:07 <portdirect> evrardjp: 100%
15:37:13 <mattmceuen> oh that's awesome jayahn.  #thanksOSF!!!
15:37:27 <jayahn> okay. doing upstream in korean is really fantastic!
15:37:29 <evrardjp> that's cool indeed :)
15:37:58 <mattmceuen> Alrighty - anything else before we move on?
15:37:59 <evrardjp> should we discuss more about the technicalities at the PTG now that ppl are in agreement we should bring your things in?
15:38:13 <evrardjp> mattmceuen: I guess we agree there :)
15:38:15 <mattmceuen> I think that'll be easier
15:38:25 <mattmceuen> we can move in that direction ahead of time
15:38:39 <mattmceuen> but lets plan on having things in good shape by the time PTG is over
15:38:53 <evrardjp> I think Frank or Ian's input would be valuable in here.
15:39:00 <alanmeadows> jayahn / evrardjp: quick question, to be able to report things like calico being unable to peer to prometheus, are you running prometheus and all scrapers in host networking mode
15:39:49 <jayahn> unfortunatly, i am not an expert on that, but I will ask hyunsun and get back to you. just put your question on etherpad. :)
15:40:42 <portdirect> jayahn: is there a reason your team cant attend these? time/language etc?
15:40:57 <jayahn> time and language
15:41:05 <portdirect> lol - the double wammy
15:41:09 <evrardjp> :)
15:41:28 <jayahn> dan, robert often attend these. they have english capa.
15:41:37 <portdirect> the other thing i'd like to dicuss at the ptg is how to bridge that gap a bit better
15:41:47 <jayahn> but most of others are not
15:42:22 <jayahn> i totally agree.. it has been very difficult point for me as well.
15:42:54 <portdirect> lets start on the docs - and use that as a way to close the language barrier better
15:43:05 <mattmceuen> Next week let's revisit meeting timing -- we still haven't found a time that works for everyone well
15:43:22 <mattmceuen> But if we can try harder and find a good time that would be really valuable
15:43:43 <mattmceuen> Alright gotta keep movin'
15:43:45 <mattmceuen> #topic Moving config to secrets
15:43:53 <portdirect> oh hai
15:44:21 <portdirect> so - I'm working to move much of the config we have for openstack services to k8s secrets from configmaps
15:44:36 <evrardjp> \o/
15:44:40 <portdirect> this should bring us a few wins
15:44:57 <portdirect> 1) stop writing passwords/creds to disc on nodes
15:45:30 <portdirect> 2) give us more ganular control on rbac for ops teams*
15:45:50 <portdirect> 3) let us leverage k8s secrets backends etc
15:46:26 <portdirect> * this will need to fully come in in follow up work, when we start to split out 'config' from 'sensitive config'
15:47:02 <portdirect> Just wanted to highlight this - as it will be a bit disruptive for some work in flight
15:47:13 <portdirect> but i think moves us in the right direction.
15:47:26 <evrardjp> it's positive disrupting -- maybe using release notes would help :p
15:47:44 <mattmceuen> Making sure I understand the last part:  is this the path
15:47:44 <mattmceuen> 1) None of the configs are secrets today
15:47:44 <mattmceuen> 2) All configs that contain passwords etc will be secrets soon
15:47:44 <mattmceuen> 3) More fine-grained split between the two in the future
15:47:44 <mattmceuen> ?
15:47:57 <portdirect> 1) yup
15:48:08 <portdirect> 2) yup
15:48:44 <portdirect> 3) yeah
15:49:03 <portdirect> * three make take some time to implement, and frankly not be possible
15:49:10 <portdirect> but thats the intent
15:49:13 <mattmceuen> #2 is my favorite
15:49:20 <srwilkers> ++
15:49:21 <mattmceuen> But yeah - #3 would be nice
15:49:22 <evrardjp> ++
15:49:30 <mattmceuen> That's awesome portdirect
15:50:20 <mattmceuen> Any questions on secrecy before we move on
15:50:36 <evrardjp> none, positive improvement, thanks portdirect
15:50:36 <mattmceuen> #topic Tempest
15:50:54 <mattmceuen> We have several colors of lavender in the etherpad, I think this may be you jayahn :)
15:51:23 <jayahn> just curious on tempest usage
15:51:37 <mattmceuen> Sharing the full question:
15:51:37 <mattmceuen> AT&T uses tempest? We found out that "regex, blacklist, whitelist" part is not working well. tempest 19.0.0 is required for pike, regex generation logic is changed from "currently avaialble tempest 13.0.0 on osh upstream". Just curious how gating or AT&T uses tempest. We think tempest need to be fixed, similar to rally.
15:52:14 <jayahn> yeah.. that
15:52:16 <mattmceuen> We are still integrating tempest into our downstream gating
15:52:55 <evrardjp> tempest 19.0.0 is required in rocky for keystone api testing, if you do it. 18.0.0 will not work.
15:53:07 <evrardjp> and queens
15:53:28 <portdirect> the tempest chart we have today, is very unloved :(
15:53:43 <portdirect> and could do with a blanket, and some coco.
15:53:46 <jayahn> so like discussion we had with rally, we need to find a good way to keep tempest version for each openstack release, and have corresponding values
15:53:48 <srwilkers> i love it only enough to kick it every now and then
15:54:01 <mattmceuen> rough crowd!
15:54:21 <evrardjp> jayahn: so, for OSA, we are using tempest 18.0.0 for everything until rocky.
15:54:37 <evrardjp> that should work, as tempest is supposed to be backwards compatible
15:54:43 <portdirect> evrardjp: we should make that same shift then
15:55:20 <evrardjp> if you point me to your whitelist/blacklist, I can help on which version should be required per upstream branch
15:55:41 <evrardjp> but ourselves we are thinking to move everything to smoke.
15:55:55 <portdirect> ++ this makes sense for community gates
15:56:11 <jayahn> we did manage to make it work.
15:56:23 <mattmceuen> what did you do to get it working jayahn?
15:56:35 <portdirect> jayahn: can you get a ps, with the changes you made?
15:57:36 <evrardjp> portdirect: indeed, for community, I'd think that smoke tests are fine. You can do more thorough tests in periodics or internally.
15:58:19 <mattmceuen> ++
15:58:21 <jayahn> portdirect: okay
15:58:48 <mattmceuen> alright guys - we're at a couple minute to time
15:58:53 <mattmceuen> #topic Roundtable
15:59:06 <mattmceuen> I will move the things we didn't get to today to next week, sorry for not hitting everything today
15:59:07 <jayahn> pls review PS. :)
15:59:13 <mattmceuen> Yes!
15:59:16 <portdirect> one big thing - helm 2.10 is here!
15:59:21 <mattmceuen> helm yeah!
15:59:32 <portdirect> so expect to see some tls related patches from ruslan and I ;)
15:59:36 <jayahn> yeah!
16:00:02 <evrardjp> thanks everyone
16:00:43 <goutham1> By anychance did anyonce went through this
16:00:55 <goutham1> https://storyboard.openstack.org/#!/story/2003507
16:01:27 <goutham1> portdirect: u said u will check yesterday did u find anything ??
16:01:40 <mattmceuen> Gotta shut down the meeting goutham1 - can we move this into #openstack-helm ?
16:01:47 <mattmceuen> Thanks all!
16:01:53 <mattmceuen> #endmeeting