#openstack-self-healing log

09:02:12 <aspiers> #startmeeting self-healing
09:02:13 <openstack> Meeting started Wed Dec 19 09:02:12 2018 UTC and is due to finish in 60 minutes.  The chair is aspiers. Information about MeetBot at http://wiki.debian.org/MeetBot.
09:02:14 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
09:02:16 <openstack> The meeting name has been set to 'self_healing'
09:02:30 <aspiers> #topic vitrage / monasca integration
09:02:54 <aspiers> so, what's new with this? :)
09:03:05 <ifat_afek> https://review.openstack.org/#/c/622899/
09:03:31 <ifat_afek> The integration is almost done, but we need to solve the issue of identifying in Vitrage the resource that the alarm is raised on
09:03:36 <witek> we have to clarify the mapping between Monasca alarms and Vitrage entities/resources
09:04:08 <aspiers> awesome
09:04:14 <ifat_afek> BTW, once we agree on the conceptual design, I think that we might be able to close the current change with minimal fixes, and do the complete solution later in a different change
09:04:26 <witek> +1
09:04:29 <aspiers> makes sense
09:04:52 <aspiers> nice to see my colleague Joe on the review :)
09:04:59 <aspiers> I should get him on this channel
09:05:03 <ifat_afek> witek: did you see my last mail? I suggested a solution, but I’m not familiar enough with Monasca so I need your approval that it will work
09:05:35 <witek> yes, started writing an answer yesterday, will send today
09:05:47 <witek> in general I think it should work
09:05:47 <ifat_afek> Cool, thanks
09:06:28 <witek> was wondering if you want to implement it as `global` mapping, or add to resource entity definition in Vitrage template?
09:06:58 <ifat_afek> I thought about a global mapping (single configuration file), but let me think it over
09:07:22 <ifat_afek> The benefit of the global mapping is that the same alarm can be easily reused in several templates
09:07:29 <witek> true
09:07:46 <ifat_afek> Any disadvantages in your opinion?
09:08:32 <witek> I think global file would cover most of use cases, but definition in template might be more flexible
09:09:15 <witek> but my knowledge about Vitrage is very limited, so I might be wrong
09:09:36 <ifat_afek> I need to think about it. Do you  have an example?
09:10:16 <witek> http_status on node or VIP
09:10:35 <ifat_afek> How is it defined? it was probably written in the mail
09:10:54 <ifat_afek> `name`: `http_status`, `dimensions`: {`hostname`: `node1`,
09:10:55 <ifat_afek> `service`: `keystone`, `url`: `http://node1/identity`
09:10:55 <ifat_afek> <http://node1/identity>}
09:11:00 <ifat_afek> This one, right?
09:11:18 <witek> yes
09:11:37 <ifat_afek> And why do you think we should handle it in the template?
09:12:20 <witek> when configured with node URL, gives information about the service on the node
09:12:45 <ifat_afek> sorry, I don’t understand
09:12:56 <witek> when configured with VIP URL, information is one layer higher, for load-balanced service
09:13:03 <aspiers> newbie question: how is it currently done with Zabbix?
09:14:01 <ifat_afek> we are facing similar questions with Zabbix. So far we are using it for monitoring hosts, and these are statically defined in a zabbix_conf file. For monitoring vms, interfaces etc we didn’t implement a good solution yet
09:14:07 <aspiers> ah ok
09:14:15 <ifat_afek> we are also facing this question with Prometheus...
09:14:33 <aspiers> is it worth writing a spec for this maybe?
09:14:39 <aspiers> or is that too heavy-weight?
09:14:49 <ifat_afek> of course, I’m trying to understand what this spec should include
09:14:54 <aspiers> got it :-)
09:15:10 <ifat_afek> I don’t think it’s too heavy-weight, and it is definitely something we should handle
09:15:45 <ifat_afek> witek: let me get back to your question. You are saying that depending on the URL we should figure out the resource type?
09:15:50 <aspiers> well, the spec could list multiple options but propose a preferred solution and list the other(s) as alternative(s)
09:16:02 <ifat_afek> aspiers: of course
09:16:35 <ifat_afek> this is why I wanted to make a temporary fix for the current change in gerrit. But it should be a smart fix and not the existing POC code
09:17:07 <witek> ifat_afek: I think that in general operators might want to use the same metric to alarm about different things
09:17:11 <ifat_afek> the full implementation should not take a long time, the design is the complicated part
09:17:23 <ifat_afek> witek: I agree
09:18:06 <ifat_afek> and how do the operators understand what is being monitored? suppose they see the alarms in Monasca itself, do they figure it out by the resource name? by the URL?…
09:18:20 <witek> so it might be an advantage if they also have a mechanism to describe it in alarm entity definition, how a given alarm should be interpreted
09:18:43 <ifat_afek> but the interpretation should happen in Monasca first, right?
09:19:01 <ifat_afek> when you create an alarm definition, you should somehow describe what you are monitoring
09:19:02 <witek> operators are free to define their own alarms
09:19:16 <witek> they know, what the metric measures
09:19:52 <ifat_afek> If I am an operator, and I see a ‘high cpu load’ alarm, how can I tell if it was raised on a vm or on a host? by the resource name? by a certain dimension?
09:20:42 <ifat_afek> BTW, aspiers, if you prefer we can take this discussion offline :-)
09:21:00 <aspiers> no, this is a great discussion and obviously important for self-healing :)
09:21:21 <aspiers> and we probably don't have any other topics today anyway :)
09:21:26 <witek> depends how the metric is collected: each monasca-agent plugin provides unique metric names
09:21:50 <witek> so, system metrics names are different then ones from libvirt plugin
09:22:25 <ifat_afek> in this case, can we have a configuration file in Vitrage that determines the resource type per metric name? will it always be 1-1 relation?
09:22:43 <ifat_afek> or maybe better, can we get this information directly from Monasca?
09:23:36 <witek> I'd say not always, but in most cases
09:24:26 <ifat_afek> if it’s most cases, then maybe we can’t do what I suggested
09:25:07 <ifat_afek> so back to my previous question - if the same metric can be used for two different resource types, how does the operator understand what the alarm was raised on?
09:25:35 <ifat_afek> I’m trying to understand the logic behind this, so I can use this logic in Vitrage
09:26:54 <witek> I think the problem appears for generic metrics, like e.g. http_check
09:27:11 <witek> which can be configured with really any http endpoint
09:28:18 <witek> and to your question, the metric is uniquely described by its name and dimension key/values
09:29:48 <ifat_afek> so I could say something like this: if metric_name==‘http_check’ and url==‘…..’ then resource_type is host and resource_id is ‘resource_id’?
09:30:09 <ifat_afek> the disadvantage is, of course, having a detailed description for every alarm
09:30:28 <ifat_afek> and the need to manually update the description once in a while
09:31:01 <ifat_afek> alternatively - does it make sense to ask for a dedicated ‘resource_type’ dimension in Monasca?
09:34:24 <witek> configuration of dedicated `resource_type` in agent could be left to operator, per convention
09:34:43 <ifat_afek> ok, so we can’t force it nor assume it is there
09:35:24 <ifat_afek> so it seems like we should have a slightly-complex configuration file. do you have a better idea?
09:36:20 <aspiers> #link http://lists.openstack.org/pipermail/openstack-discuss/2018-December/000806.html mailing list discussion on the integration
09:36:37 <witek> I think we should start with defining use cases we would like to cover and check if we can do it with simple conf file
09:36:59 <witek> I think it should work for most cases
09:37:19 <ifat_afek> sounds like a good idea
09:37:46 <ifat_afek> so I can start a spec with use cases and you’ll help me
09:37:58 <witek> I'll definitely help
09:38:04 <ifat_afek> later I can add a proposed solution to the spec
09:38:17 <ifat_afek> but I agree we should start with the use cases
09:38:30 <aspiers> sounds good to me too
09:38:49 <ifat_afek> great. I’ll push an initial version today or tomorrow
09:38:57 <witek> cool, thanks
09:39:00 <aspiers> awesome!
09:39:12 <ifat_afek> where to? I was thinking about vitrage-specs, because the implementation will most likely be inside Vitrage
09:39:26 <ifat_afek> unless you think it should be in the self-healing repo
09:39:28 <aspiers> #action ifat_afek will submit a spec with use cases
09:39:37 <witek> yes, I think Vitrage repo is best suited
09:39:53 <aspiers> either is fine by me
09:41:19 <ifat_afek> witek: do you have a simple solution for the existing change? e.g. use a dimension that will work in many cases but not all, instead of the one used for the POC? just so we can say this change is finished
09:43:10 <witek> I would say, we could stay with current approach for the first version
09:43:27 <witek> will leave a comment in review
09:43:39 <ifat_afek> great, thanks
09:44:15 <aspiers> cool. anything else on this topic?
09:44:27 <ifat_afek> nothing on my side
09:44:36 <witek> no, thanks
09:44:37 <aspiers> I learned some useful things, especially that I need to improve my email filters ;-)
09:45:02 <witek> :) oh, we haven't added [self-healing]
09:45:10 <ifat_afek> my bad…
09:45:18 <aspiers> haha no problem X-D
09:45:59 <aspiers> ok, well thanks a lot both - I'm SUPER happy and excited this discussion is happening :)
09:46:14 <ifat_afek> me too!
09:46:20 <witek> yes, me too, thanks ifat_afek for launching it!
09:46:22 <aspiers> it's exactly the kind of cross-project work I was dreaming of for the SIG
09:47:12 <aspiers> I will ping Joseph and see if he can join future IRC discussions, but I see he's on the mailing list thread already anyway
09:47:19 <aspiers> and he's in the wrong timezone for this meeting
09:47:37 <aspiers> do either of you intend to join the one later today?
09:47:41 <aspiers> no problem at all if not
09:47:44 <ifat_afek> I plan to
09:47:53 <aspiers> OK, it will be his morning then
09:47:54 <witek> I'm not sure yet
09:48:19 <aspiers> I'll see if he can, but I guess it's not a big deal if not
09:48:51 <aspiers> if you included [self-healing] when announcing the spec, maybe it can get a few more reviewers
09:49:07 <aspiers> I will certainly review, anyway
09:49:10 <ifat_afek> usually I don’t announce specs, but I can do it this time
09:49:29 <ifat_afek> becasue indeed it is interesting, and I’ll be happy to hear more opinions
09:49:33 <aspiers> great :)
09:49:57 <aspiers> alright, just very briefly for the record...
09:50:05 <aspiers> #topic service health check API
09:50:23 <aspiers> there seems to be some movement on this, since the TC have proposed it as a goal for Train
09:50:44 <aspiers> but there needs to be a champion goal
09:51:20 <aspiers> I have initiated some discussion in SUSE about this - maybe there is a chance that I or another colleague could volunteer for that
09:51:47 <aspiers> but we need to discuss prioritisation first, so can't guarantee anything
09:52:26 <aspiers> in any case, if we had a health check API across multiple services, this would presumably tie in very nicely with the Vitrage/Monasca integration efforts
09:52:57 <ifat_afek> of course, I think it’s a great initiative, and I’ll be happy if you or your colleagues can drive it forward
09:53:09 <aspiers> awesome, thanks
09:53:37 <aspiers> I think it's being discussed under [all][tc] and I added [self-healing] later IIRC
09:54:13 <aspiers> #link http://lists.openstack.org/pipermail/openstack-discuss/2018-December/000599.html
09:54:19 <aspiers> you probably already saw that
09:54:46 <ifat_afek> right
09:54:49 <aspiers> not much more to say about that right now, but I thought it was worth mentioning
09:55:25 <aspiers> ifat_afek: should I add a task to https://storyboard.openstack.org/#!/story/2002684 for creating the vitrage-spec?
09:55:37 <aspiers> so you can reference it in the commit message?
09:56:19 <aspiers> ah sorry
09:56:21 <aspiers> wrong story
09:56:22 <ifat_afek> aspiers: are you sure this is the right story? this one is about Vitrage and Heat
09:56:25 <ifat_afek> :-)
09:56:25 <aspiers> :)
09:56:29 <aspiers> I'm not awake yet
09:56:42 <aspiers> hrm, do we have a story for the integration yet?
09:56:50 <aspiers> maybe need to create one
09:56:51 <ifat_afek> which I plan to progress with, BTW, but I’m not ready to update about it yet
09:57:02 <ifat_afek> we have a story in Vitrage
09:57:11 <aspiers> OK, I'll look for that
09:57:14 <ifat_afek> https://storyboard.openstack.org/#!/story/2004550
09:57:18 <aspiers> thanks!
09:57:37 <aspiers> alright, I guess we're done
09:57:40 <ifat_afek> And another one for the (near?) future, to accept immeidate notifications
09:57:56 <aspiers> got it
09:57:59 <ifat_afek> https://storyboard.openstack.org/#!/story/2004064
09:58:32 <aspiers> ah yeah, it seems I was already subscribed to that one :)
09:59:22 <aspiers> OK, thanks a lot both, and maybe see you later!
09:59:23 <ifat_afek> I can add tasks for writing a spec and also for implementing this spec, on top of the initial implementation
09:59:29 <ifat_afek> see you later!
09:59:37 <aspiers> awesome
09:59:44 <aspiers> #endmeeting