09:02:12 #startmeeting self-healing 09:02:13 Meeting started Wed Dec 19 09:02:12 2018 UTC and is due to finish in 60 minutes. The chair is aspiers. Information about MeetBot at http://wiki.debian.org/MeetBot. 09:02:14 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 09:02:16 The meeting name has been set to 'self_healing' 09:02:30 #topic vitrage / monasca integration 09:02:54 so, what's new with this? :) 09:03:05 https://review.openstack.org/#/c/622899/ 09:03:31 The integration is almost done, but we need to solve the issue of identifying in Vitrage the resource that the alarm is raised on 09:03:36 we have to clarify the mapping between Monasca alarms and Vitrage entities/resources 09:04:08 awesome 09:04:14 BTW, once we agree on the conceptual design, I think that we might be able to close the current change with minimal fixes, and do the complete solution later in a different change 09:04:26 +1 09:04:29 makes sense 09:04:52 nice to see my colleague Joe on the review :) 09:04:59 I should get him on this channel 09:05:03 witek: did you see my last mail? I suggested a solution, but I’m not familiar enough with Monasca so I need your approval that it will work 09:05:35 yes, started writing an answer yesterday, will send today 09:05:47 in general I think it should work 09:05:47 Cool, thanks 09:06:28 was wondering if you want to implement it as `global` mapping, or add to resource entity definition in Vitrage template? 09:06:58 I thought about a global mapping (single configuration file), but let me think it over 09:07:22 The benefit of the global mapping is that the same alarm can be easily reused in several templates 09:07:29 true 09:07:46 Any disadvantages in your opinion? 09:08:32 I think global file would cover most of use cases, but definition in template might be more flexible 09:09:15 but my knowledge about Vitrage is very limited, so I might be wrong 09:09:36 I need to think about it. Do you have an example? 09:10:16 http_status on node or VIP 09:10:35 How is it defined? it was probably written in the mail 09:10:54 `name`: `http_status`, `dimensions`: {`hostname`: `node1`, 09:10:55 `service`: `keystone`, `url`: `http://node1/identity` 09:10:55 } 09:11:00 This one, right? 09:11:18 yes 09:11:37 And why do you think we should handle it in the template? 09:12:20 when configured with node URL, gives information about the service on the node 09:12:45 sorry, I don’t understand 09:12:56 when configured with VIP URL, information is one layer higher, for load-balanced service 09:13:03 newbie question: how is it currently done with Zabbix? 09:14:01 we are facing similar questions with Zabbix. So far we are using it for monitoring hosts, and these are statically defined in a zabbix_conf file. For monitoring vms, interfaces etc we didn’t implement a good solution yet 09:14:07 ah ok 09:14:15 we are also facing this question with Prometheus... 09:14:33 is it worth writing a spec for this maybe? 09:14:39 or is that too heavy-weight? 09:14:49 of course, I’m trying to understand what this spec should include 09:14:54 got it :-) 09:15:10 I don’t think it’s too heavy-weight, and it is definitely something we should handle 09:15:45 witek: let me get back to your question. You are saying that depending on the URL we should figure out the resource type? 09:15:50 well, the spec could list multiple options but propose a preferred solution and list the other(s) as alternative(s) 09:16:02 aspiers: of course 09:16:35 this is why I wanted to make a temporary fix for the current change in gerrit. But it should be a smart fix and not the existing POC code 09:17:07 ifat_afek: I think that in general operators might want to use the same metric to alarm about different things 09:17:11 the full implementation should not take a long time, the design is the complicated part 09:17:23 witek: I agree 09:18:06 and how do the operators understand what is being monitored? suppose they see the alarms in Monasca itself, do they figure it out by the resource name? by the URL?… 09:18:20 so it might be an advantage if they also have a mechanism to describe it in alarm entity definition, how a given alarm should be interpreted 09:18:43 but the interpretation should happen in Monasca first, right? 09:19:01 when you create an alarm definition, you should somehow describe what you are monitoring 09:19:02 operators are free to define their own alarms 09:19:16 they know, what the metric measures 09:19:52 If I am an operator, and I see a ‘high cpu load’ alarm, how can I tell if it was raised on a vm or on a host? by the resource name? by a certain dimension? 09:20:42 BTW, aspiers, if you prefer we can take this discussion offline :-) 09:21:00 no, this is a great discussion and obviously important for self-healing :) 09:21:21 and we probably don't have any other topics today anyway :) 09:21:26 depends how the metric is collected: each monasca-agent plugin provides unique metric names 09:21:50 so, system metrics names are different then ones from libvirt plugin 09:22:25 in this case, can we have a configuration file in Vitrage that determines the resource type per metric name? will it always be 1-1 relation? 09:22:43 or maybe better, can we get this information directly from Monasca? 09:23:36 I'd say not always, but in most cases 09:24:26 if it’s most cases, then maybe we can’t do what I suggested 09:25:07 so back to my previous question - if the same metric can be used for two different resource types, how does the operator understand what the alarm was raised on? 09:25:35 I’m trying to understand the logic behind this, so I can use this logic in Vitrage 09:26:54 I think the problem appears for generic metrics, like e.g. http_check 09:27:11 which can be configured with really any http endpoint 09:28:18 and to your question, the metric is uniquely described by its name and dimension key/values 09:29:48 so I could say something like this: if metric_name==‘http_check’ and url==‘…..’ then resource_type is host and resource_id is ‘resource_id’? 09:30:09 the disadvantage is, of course, having a detailed description for every alarm 09:30:28 and the need to manually update the description once in a while 09:31:01 alternatively - does it make sense to ask for a dedicated ‘resource_type’ dimension in Monasca? 09:34:24 configuration of dedicated `resource_type` in agent could be left to operator, per convention 09:34:43 ok, so we can’t force it nor assume it is there 09:35:24 so it seems like we should have a slightly-complex configuration file. do you have a better idea? 09:36:20 #link http://lists.openstack.org/pipermail/openstack-discuss/2018-December/000806.html mailing list discussion on the integration 09:36:37 I think we should start with defining use cases we would like to cover and check if we can do it with simple conf file 09:36:59 I think it should work for most cases 09:37:19 sounds like a good idea 09:37:46 so I can start a spec with use cases and you’ll help me 09:37:58 I'll definitely help 09:38:04 later I can add a proposed solution to the spec 09:38:17 but I agree we should start with the use cases 09:38:30 sounds good to me too 09:38:49 great. I’ll push an initial version today or tomorrow 09:38:57 cool, thanks 09:39:00 awesome! 09:39:12 where to? I was thinking about vitrage-specs, because the implementation will most likely be inside Vitrage 09:39:26 unless you think it should be in the self-healing repo 09:39:28 #action ifat_afek will submit a spec with use cases 09:39:37 yes, I think Vitrage repo is best suited 09:39:53 either is fine by me 09:41:19 witek: do you have a simple solution for the existing change? e.g. use a dimension that will work in many cases but not all, instead of the one used for the POC? just so we can say this change is finished 09:43:10 I would say, we could stay with current approach for the first version 09:43:27 will leave a comment in review 09:43:39 great, thanks 09:44:15 cool. anything else on this topic? 09:44:27 nothing on my side 09:44:36 no, thanks 09:44:37 I learned some useful things, especially that I need to improve my email filters ;-) 09:45:02 :) oh, we haven't added [self-healing] 09:45:10 my bad… 09:45:18 haha no problem X-D 09:45:59 ok, well thanks a lot both - I'm SUPER happy and excited this discussion is happening :) 09:46:14 me too! 09:46:20 yes, me too, thanks ifat_afek for launching it! 09:46:22 it's exactly the kind of cross-project work I was dreaming of for the SIG 09:47:12 I will ping Joseph and see if he can join future IRC discussions, but I see he's on the mailing list thread already anyway 09:47:19 and he's in the wrong timezone for this meeting 09:47:37 do either of you intend to join the one later today? 09:47:41 no problem at all if not 09:47:44 I plan to 09:47:53 OK, it will be his morning then 09:47:54 I'm not sure yet 09:48:19 I'll see if he can, but I guess it's not a big deal if not 09:48:51 if you included [self-healing] when announcing the spec, maybe it can get a few more reviewers 09:49:07 I will certainly review, anyway 09:49:10 usually I don’t announce specs, but I can do it this time 09:49:29 becasue indeed it is interesting, and I’ll be happy to hear more opinions 09:49:33 great :) 09:49:57 alright, just very briefly for the record... 09:50:05 #topic service health check API 09:50:23 there seems to be some movement on this, since the TC have proposed it as a goal for Train 09:50:44 but there needs to be a champion goal 09:51:20 I have initiated some discussion in SUSE about this - maybe there is a chance that I or another colleague could volunteer for that 09:51:47 but we need to discuss prioritisation first, so can't guarantee anything 09:52:26 in any case, if we had a health check API across multiple services, this would presumably tie in very nicely with the Vitrage/Monasca integration efforts 09:52:57 of course, I think it’s a great initiative, and I’ll be happy if you or your colleagues can drive it forward 09:53:09 awesome, thanks 09:53:37 I think it's being discussed under [all][tc] and I added [self-healing] later IIRC 09:54:13 #link http://lists.openstack.org/pipermail/openstack-discuss/2018-December/000599.html 09:54:19 you probably already saw that 09:54:46 right 09:54:49 not much more to say about that right now, but I thought it was worth mentioning 09:55:25 ifat_afek: should I add a task to https://storyboard.openstack.org/#!/story/2002684 for creating the vitrage-spec? 09:55:37 so you can reference it in the commit message? 09:56:19 ah sorry 09:56:21 wrong story 09:56:22 aspiers: are you sure this is the right story? this one is about Vitrage and Heat 09:56:25 :-) 09:56:25 :) 09:56:29 I'm not awake yet 09:56:42 hrm, do we have a story for the integration yet? 09:56:50 maybe need to create one 09:56:51 which I plan to progress with, BTW, but I’m not ready to update about it yet 09:57:02 we have a story in Vitrage 09:57:11 OK, I'll look for that 09:57:14 https://storyboard.openstack.org/#!/story/2004550 09:57:18 thanks! 09:57:37 alright, I guess we're done 09:57:40 And another one for the (near?) future, to accept immeidate notifications 09:57:56 got it 09:57:59 https://storyboard.openstack.org/#!/story/2004064 09:58:32 ah yeah, it seems I was already subscribed to that one :) 09:59:22 OK, thanks a lot both, and maybe see you later! 09:59:23 I can add tasks for writing a spec and also for implementing this spec, on top of the initial implementation 09:59:29 see you later! 09:59:37 awesome 09:59:44 #endmeeting