15:00:24 #startmeeting openstack-helm 15:00:25 Meeting started Tue Sep 17 15:00:24 2019 UTC and is due to finish in 60 minutes. The chair is portdirect. Information about MeetBot at http://wiki.debian.org/MeetBot. 15:00:26 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 15:00:28 The meeting name has been set to 'openstack_helm' 15:00:39 sorry im running a bit late this morning, i'll get an etherpad up now 15:00:58 https://etherpad.openstack.org/p/openstack-helm-meeting-2019-09-17 15:02:19 o/ 15:02:32 \o 15:03:30 \o/ 15:03:42 o/ 15:04:28 o/ 15:06:03 #topic topics 15:06:18 anything that we should be discussing this week? 15:08:28 o/ 15:08:51 spiderman out of the MCU? Good/bad/neutral, discuss. 15:08:57 At some point I would like to discuss more what we have in mind re: using nagios upstream gates. I know srwiklers and I briefly touched on it the other day 15:08:59 it migth be a bit offtopic though 15:09:13 oh good, real discussion items :D 15:11:00 spiderman out of the MCU makes me sad 15:12:01 nagios makes me sad 15:12:04 :p 15:12:13 join the clun 15:12:15 club 15:12:16 itxaka: i think nagios makes everyone sad 15:12:33 so - what could we use instead of it? 15:12:46 its been the tool we've used for the following use cases: 15:12:51 * storage is down 15:13:10 * integration with corp ticketing/slerting systems 15:13:44 * simple to understand (from an ops perspective) 15:13:44 isnt prometheus already in the osh-infra repo? 15:14:01 it is, and most of what nagios reports on comes from prom 15:14:15 (as i understand it) 15:14:23 I think that's correct 15:14:27 Im out of touch with monitoring systems (THANK GOD), last thing I used was mmonit years ago 15:14:55 now I live free of the monitoring overlords corrupting my life 15:15:02 >:( 15:15:40 the storage part is tricky - and our current approach with nagios comes with its own cost 15:16:15 that being without a way to persist nagios's log file, we can't realistically enable alert/notification history as nagios generates that history by tracking state changes in its log file 15:16:56 no pvc? 15:17:24 well, that's the tricky part - if we use a PVC (ie, ceph), then nagios is subject to the same limitations that drove us to use it in the way we do in the first place 15:17:32 to be fair this seems like an AT&T problem rather than a upstream problem, doesnt it? 15:17:34 * itxaka hides 15:17:35 that being: when ceph is unhappy/misbehaving/dead, we can't really tell 15:18:17 i'd say it's a nagios problem in general if you're wanting to use it to provide insight into in-cluster storage it relies on, not just an AT&T problem 15:19:21 how about using ses with rook? That should fix the misbehaving stuff by moving it to be a k8s problem lol 15:19:32 at least for the misbehaving/dead point of view 15:19:46 the fundimantal point remains 15:20:17 we need something to alert that cluster storage is having issues 15:23:43 cant you save the history to db? 15:23:50 so you get history + replication? 15:25:03 or you do offsite log replication every X minutes so its safe and secure? 15:26:32 that would be local storage + remote sync of course, so you dont depend on the backend storage 15:26:44 but of course that means sending data offsite which is not always possible 15:30:19 sorry if this sounds stupid, missing context on the architecture makes it difficult to come with decent ideas :D 15:30:47 this is probably heading too far into the weeds, really. if i had my way, i'd just have nagios reporting alerts for things like downed hosts or ceph being unhealthy/bad, and use alertmanager for everything else. alertmanager supports smtp out of the box, just like nagios, and whatever system is listening should realistically be able to handle both with a little bit of effort 15:31:25 but monitoring and alarming are two things that are likely unique enterprise-to-enterprise, so discussing a one-size-fits-all here might be a bit pie in the sky 15:33:57 * srwilkers pokes portdirect 15:36:38 * gagehugo hands srwilkers a rather large trout 15:37:04 * portdirect falls of end of warf 15:37:16 sorry - got called into something 15:37:50 ive got very little to add at this point to the convo 15:38:02 im still trying to wrap my head round all the aspects atm 15:38:21 anything else we should think about before we move onto the plea for reviews? 15:42:34 #topic reviews 15:42:49 please can we look at the following: 15:42:51 https://review.opendev.org/#/c/672678/ 15:42:52 https://review.opendev.org/#/c/670550/ 15:43:14 itxaka: if you could help with https://review.opendev.org/#/c/672678/ it would be great 15:43:39 oh, interesting 15:43:42 will have a look 15:48:49 ok folks 15:49:01 catch you all in #openstack-helm 15:49:04 #endmeeting