#openstack-meeting-4 log

15:00:24 <portdirect> #startmeeting openstack-helm
15:00:25 <openstack> Meeting started Tue Sep 17 15:00:24 2019 UTC and is due to finish in 60 minutes.  The chair is portdirect. Information about MeetBot at http://wiki.debian.org/MeetBot.
15:00:26 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
15:00:28 <openstack> The meeting name has been set to 'openstack_helm'
15:00:39 <portdirect> sorry im running a bit late this morning, i'll get an etherpad up now
15:00:58 <portdirect> https://etherpad.openstack.org/p/openstack-helm-meeting-2019-09-17
15:02:19 <lamt> o/
15:02:32 <stevthedev> \o
15:03:30 <itxaka> \o/
15:03:42 <rihabb> o/
15:04:28 <mattmceuen> o/
15:06:03 <portdirect> #topic topics
15:06:18 <portdirect> anything that we should be discussing this week?
15:08:28 <srwilkers> o/
15:08:51 <itxaka> spiderman out of the MCU? Good/bad/neutral, discuss.
15:08:57 <stevthedev> At some point I would like to discuss more what we have in mind re: using nagios upstream gates. I know srwiklers and I briefly touched on it the other day
15:08:59 <itxaka> it migth be a bit offtopic though
15:09:13 <itxaka> oh good, real discussion items :D
15:11:00 <srwilkers> spiderman out of the MCU makes me sad
15:12:01 <itxaka> nagios makes me sad
15:12:04 <itxaka> :p
15:12:13 <srwilkers> join the clun
15:12:15 <srwilkers> club
15:12:16 <portdirect> itxaka: i think nagios makes everyone sad
15:12:33 <portdirect> so - what could we use instead of it?
15:12:46 <portdirect> its been the tool we've used for the following use cases:
15:12:51 <portdirect> * storage is down
15:13:10 <portdirect> * integration with corp ticketing/slerting systems
15:13:44 <portdirect> * simple to understand (from an ops perspective)
15:13:44 <itxaka> isnt prometheus already in the osh-infra repo?
15:14:01 <portdirect> it is, and most of what nagios reports on comes from prom
15:14:15 <portdirect> (as i understand it)
15:14:23 <stevthedev> I think that's correct
15:14:27 <itxaka> Im out of touch with monitoring systems (THANK GOD), last thing I used was mmonit years ago
15:14:55 <itxaka> now I live free of the monitoring overlords corrupting my life
15:15:02 <srwilkers> >:(
15:15:40 <srwilkers> the storage part is tricky - and our current approach with nagios comes with its own cost
15:16:15 <srwilkers> that being without a way to persist nagios's log file, we can't realistically enable alert/notification history as nagios generates that history by tracking state changes in its log file
15:16:56 <itxaka> no pvc?
15:17:24 <srwilkers> well, that's the tricky part - if we use a PVC (ie, ceph), then nagios is subject to the same limitations that drove us to use it in the way we do in the first place
15:17:32 <itxaka> to be fair this seems like an AT&T problem rather than a upstream problem, doesnt it?
15:17:34 * itxaka hides
15:17:35 <srwilkers> that being: when ceph is unhappy/misbehaving/dead, we can't really tell
15:18:17 <srwilkers> i'd say it's a nagios problem in general if you're wanting to use it to provide insight into in-cluster storage it relies on, not just an AT&T problem
15:19:21 <itxaka> how about using ses with rook? That should fix the misbehaving stuff by moving it to be a k8s problem lol
15:19:32 <itxaka> at least for the misbehaving/dead point of view
15:19:46 <portdirect> the fundimantal point remains
15:20:17 <portdirect> we need something to alert that cluster storage is having issues
15:23:43 <itxaka> cant you save the history to db?
15:23:50 <itxaka> so you get history + replication?
15:25:03 <itxaka> or you do offsite log replication every X minutes so its safe and secure?
15:26:32 <itxaka> that would be local storage + remote sync of course, so you dont depend on the backend storage
15:26:44 <itxaka> but of course that means sending data offsite which is not always possible
15:30:19 <itxaka> sorry if this sounds stupid, missing context on the architecture makes it difficult to come with decent ideas :D
15:30:47 <srwilkers> this is probably heading too far into the weeds, really.  if i had my way, i'd just have nagios reporting alerts for things like downed hosts or ceph being unhealthy/bad, and use alertmanager for everything else.  alertmanager supports smtp out of the box, just like nagios, and whatever system is listening should realistically be able to handle both with a little bit of effort
15:31:25 <srwilkers> but monitoring and alarming are two things that are likely unique enterprise-to-enterprise, so discussing a one-size-fits-all here might be a bit pie in the sky
15:33:57 * srwilkers pokes portdirect
15:36:38 * gagehugo hands srwilkers a rather large trout
15:37:04 * portdirect falls of end of warf
15:37:16 <portdirect> sorry - got called into something
15:37:50 <portdirect> ive got very little to add at this point to the convo
15:38:02 <portdirect> im still trying to wrap my head round all the aspects atm
15:38:21 <portdirect> anything else we should think about before we move onto the plea for reviews?
15:42:34 <portdirect> #topic reviews
15:42:49 <portdirect> please can we look at the following:
15:42:51 <portdirect> https://review.opendev.org/#/c/672678/
15:42:52 <portdirect> https://review.opendev.org/#/c/670550/
15:43:14 <portdirect> itxaka: if you could help with https://review.opendev.org/#/c/672678/ it would be great
15:43:39 <itxaka> oh, interesting
15:43:42 <itxaka> will have a look
15:48:49 <portdirect> ok folks
15:49:01 <portdirect> catch you all in #openstack-helm
15:49:04 <portdirect> #endmeeting