15:00:24 <portdirect> #startmeeting openstack-helm 15:00:25 <openstack> Meeting started Tue Sep 17 15:00:24 2019 UTC and is due to finish in 60 minutes. The chair is portdirect. Information about MeetBot at http://wiki.debian.org/MeetBot. 15:00:26 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 15:00:28 <openstack> The meeting name has been set to 'openstack_helm' 15:00:39 <portdirect> sorry im running a bit late this morning, i'll get an etherpad up now 15:00:58 <portdirect> https://etherpad.openstack.org/p/openstack-helm-meeting-2019-09-17 15:02:19 <lamt> o/ 15:02:32 <stevthedev> \o 15:03:30 <itxaka> \o/ 15:03:42 <rihabb> o/ 15:04:28 <mattmceuen> o/ 15:06:03 <portdirect> #topic topics 15:06:18 <portdirect> anything that we should be discussing this week? 15:08:28 <srwilkers> o/ 15:08:51 <itxaka> spiderman out of the MCU? Good/bad/neutral, discuss. 15:08:57 <stevthedev> At some point I would like to discuss more what we have in mind re: using nagios upstream gates. I know srwiklers and I briefly touched on it the other day 15:08:59 <itxaka> it migth be a bit offtopic though 15:09:13 <itxaka> oh good, real discussion items :D 15:11:00 <srwilkers> spiderman out of the MCU makes me sad 15:12:01 <itxaka> nagios makes me sad 15:12:04 <itxaka> :p 15:12:13 <srwilkers> join the clun 15:12:15 <srwilkers> club 15:12:16 <portdirect> itxaka: i think nagios makes everyone sad 15:12:33 <portdirect> so - what could we use instead of it? 15:12:46 <portdirect> its been the tool we've used for the following use cases: 15:12:51 <portdirect> * storage is down 15:13:10 <portdirect> * integration with corp ticketing/slerting systems 15:13:44 <portdirect> * simple to understand (from an ops perspective) 15:13:44 <itxaka> isnt prometheus already in the osh-infra repo? 15:14:01 <portdirect> it is, and most of what nagios reports on comes from prom 15:14:15 <portdirect> (as i understand it) 15:14:23 <stevthedev> I think that's correct 15:14:27 <itxaka> Im out of touch with monitoring systems (THANK GOD), last thing I used was mmonit years ago 15:14:55 <itxaka> now I live free of the monitoring overlords corrupting my life 15:15:02 <srwilkers> >:( 15:15:40 <srwilkers> the storage part is tricky - and our current approach with nagios comes with its own cost 15:16:15 <srwilkers> that being without a way to persist nagios's log file, we can't realistically enable alert/notification history as nagios generates that history by tracking state changes in its log file 15:16:56 <itxaka> no pvc? 15:17:24 <srwilkers> well, that's the tricky part - if we use a PVC (ie, ceph), then nagios is subject to the same limitations that drove us to use it in the way we do in the first place 15:17:32 <itxaka> to be fair this seems like an AT&T problem rather than a upstream problem, doesnt it? 15:17:34 * itxaka hides 15:17:35 <srwilkers> that being: when ceph is unhappy/misbehaving/dead, we can't really tell 15:18:17 <srwilkers> i'd say it's a nagios problem in general if you're wanting to use it to provide insight into in-cluster storage it relies on, not just an AT&T problem 15:19:21 <itxaka> how about using ses with rook? That should fix the misbehaving stuff by moving it to be a k8s problem lol 15:19:32 <itxaka> at least for the misbehaving/dead point of view 15:19:46 <portdirect> the fundimantal point remains 15:20:17 <portdirect> we need something to alert that cluster storage is having issues 15:23:43 <itxaka> cant you save the history to db? 15:23:50 <itxaka> so you get history + replication? 15:25:03 <itxaka> or you do offsite log replication every X minutes so its safe and secure? 15:26:32 <itxaka> that would be local storage + remote sync of course, so you dont depend on the backend storage 15:26:44 <itxaka> but of course that means sending data offsite which is not always possible 15:30:19 <itxaka> sorry if this sounds stupid, missing context on the architecture makes it difficult to come with decent ideas :D 15:30:47 <srwilkers> this is probably heading too far into the weeds, really. if i had my way, i'd just have nagios reporting alerts for things like downed hosts or ceph being unhealthy/bad, and use alertmanager for everything else. alertmanager supports smtp out of the box, just like nagios, and whatever system is listening should realistically be able to handle both with a little bit of effort 15:31:25 <srwilkers> but monitoring and alarming are two things that are likely unique enterprise-to-enterprise, so discussing a one-size-fits-all here might be a bit pie in the sky 15:33:57 * srwilkers pokes portdirect 15:36:38 * gagehugo hands srwilkers a rather large trout 15:37:04 * portdirect falls of end of warf 15:37:16 <portdirect> sorry - got called into something 15:37:50 <portdirect> ive got very little to add at this point to the convo 15:38:02 <portdirect> im still trying to wrap my head round all the aspects atm 15:38:21 <portdirect> anything else we should think about before we move onto the plea for reviews? 15:42:34 <portdirect> #topic reviews 15:42:49 <portdirect> please can we look at the following: 15:42:51 <portdirect> https://review.opendev.org/#/c/672678/ 15:42:52 <portdirect> https://review.opendev.org/#/c/670550/ 15:43:14 <portdirect> itxaka: if you could help with https://review.opendev.org/#/c/672678/ it would be great 15:43:39 <itxaka> oh, interesting 15:43:42 <itxaka> will have a look 15:48:49 <portdirect> ok folks 15:49:01 <portdirect> catch you all in #openstack-helm 15:49:04 <portdirect> #endmeeting