17:04:55 <aspiers> #startmeeting self-healing
17:04:56 <openstack> Meeting started Wed Apr 10 17:04:55 2019 UTC and is due to finish in 60 minutes.  The chair is aspiers. Information about MeetBot at http://wiki.debian.org/MeetBot.
17:04:57 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
17:04:59 <openstack> The meeting name has been set to 'self_healing'
17:05:16 <aspiers> OK so I think we have 3 topics for the agenda today
17:05:36 <aspiers> 1. whatever jsuchome wants to talk about ;-)
17:05:47 <aspiers> 2. ricolin's topic of http://lists.openstack.org/pipermail/openstack-discuss/2019-March/004246.html
17:05:49 <aspiers> 3. Denver
17:06:01 <jsuchome> I've added mine to the agenda btw
17:06:06 <aspiers> cool
17:06:11 <jsuchome> if https://etherpad.openstack.org/p/self-healing-SIG-IRC-meeting is indeed current agenda
17:06:29 <aspiers> I actually forgot about that etherpad %-D
17:06:48 <aspiers> jsuchome, ricolin: if either of you are in a hurry we can change the order so you can leave early
17:07:11 <ricolin> it will be nice if I can go first
17:07:15 <aspiers> wow, I don't think I've ever used that etherpad before
17:07:29 <aspiers> ricolin: OK I guess your topic will be quick anyway
17:07:41 <aspiers> #topic help most needed for SIG
17:07:47 <ricolin> Not like I will leave the meeting, just wish to go with it when my brain still awake
17:07:51 <aspiers> haha :)
17:07:53 <aspiers> sure
17:08:08 <aspiers> ricolin: I can answer most of these questions
17:08:15 <ricolin> yeah
17:08:21 <aspiers> but I could send a reply to the ML ?
17:08:39 <aspiers> I guess you have things you want to ask now too?
17:08:45 <ricolin> It will be nice if you can reply that ML
17:08:56 <aspiers> sure
17:09:34 <ricolin> I'm fine to get those answer from SIG chairs
17:09:39 <aspiers> OK
17:09:52 <ricolin> so I will leave it to you than:)
17:10:10 <aspiers> that's fine :)
17:10:15 <ricolin> Any specific concerns about those questions?
17:10:20 <aspiers> I will reply tonight or tomorrow
17:10:45 <aspiers> feel free to chase me if I forget ...
17:10:56 <ricolin> I will also try to invite project teams to join us in that forum
17:11:09 <ricolin> and hope that will help with SIGs
17:11:16 <aspiers> cool
17:11:22 <aspiers> is there a meta-SIG Forum session?
17:11:57 <aspiers> I don't see one on https://wiki.openstack.org/wiki/Forum/Denver2019
17:12:01 <ricolin> It's a action from previous summit forum `expose SIGs and WGs`
17:12:17 <aspiers> Yeah I remember that discussion
17:12:24 <aspiers> Just wondering if there is a slot for it in Denver
17:12:42 <ricolin> https://etherpad.openstack.org/p/DEN-Train-TC-brainstorming
17:12:47 <ricolin> under TC actually
17:13:00 <aspiers> OK cool
17:13:11 <ricolin> I wasn't running meta-SIG at the time
17:13:36 <aspiers> np
17:13:47 <aspiers> maybe we can suggest that to ttx and diablo_rojo
17:13:51 <aspiers> anyway
17:14:03 <ricolin> what you mean by suggest?
17:14:03 <aspiers> anything else on this topic or can we move on to jsuchome's topic?
17:14:23 <aspiers> I mean suggest that maybe we should have a meta-SIG session
17:14:28 <aspiers> e.g. 30 mins
17:14:39 <ricolin> aspiers, we can move on:)
17:14:44 <aspiers> OK thanks :)
17:14:53 <aspiers> #topic supporting self-healing in openstack-helm
17:14:53 <ricolin> aspiers, That's actually nice idea
17:15:06 <aspiers> ricolin: we can talk about it in #openstack-tc maybe
17:15:22 <ricolin> aspiers, I already proposed a SIG governance PTG topic under TC PTG as well
17:15:31 <aspiers> oh nice
17:15:32 <ricolin> aspiers, yep
17:15:43 <aspiers> jsuchome: the floor is yours :)
17:15:48 <jsuchome> OK, so - projects deployed by openstack-helm have these probes, basically python scripts that make sure that a certain service is alive
17:16:03 <aspiers> evrardjp: ^^^ in case you are listening or reading scrollback
17:16:09 <jsuchome> (service running in a container/ kubernetes pod, but that's not necessary interesting here)
17:16:32 <jsuchome> you can see one e.g. here, that is for neutron https://review.openstack.org/#/c/632200/
17:16:44 <aspiers> yeah, just looking at that
17:16:48 <jsuchome> it is using some fake RPC calls just to find out if RPC service is responding
17:17:03 <aspiers> so is this just for RPC, or also APIs?
17:17:21 <jsuchome> problem with that is that while getting fake function, it logs some errors into the logs
17:17:50 <jsuchome> which does not break anything, because we catch the exception and ignore it, but it is ugly and you can miss the real errors in such log files
17:17:54 <jsuchome> just RPC AFAIK
17:18:16 <aspiers> OK so it's basically the RPC equivalent of https://storyboard.openstack.org/#!/story/2001439 ?
17:18:18 <jsuchome> so, I think, we could use some real methods instead of fake ones, the problem is which ones?
17:19:01 <aspiers> I was hoping to drive API health checks as a community goal for Train, but I ran out of time :-/
17:19:03 <jsuchome> that might be it, I do not know this story, but that's probably the reason I was redirected to self-healing when proposing we should change it in helm
17:19:14 <aspiers> but yeah this sounds like a good idea to me
17:19:23 <jsuchome> I mean, helm : my idea would be to have such simple "ping" methods in different openstack components
17:19:31 <aspiers> the original idea was to start with APIs first and handle RPC etc. later
17:19:48 <aspiers> but if there is developer bandwidth to do RPC first then that is fine too
17:21:01 <aspiers> IIUC https://review.openstack.org/#/c/632200/37/neutron/templates/bin/_health-probe.py.tpl is somewhat overlapping with API health checks too
17:21:01 <jsuchome> so ... I do not know if it's good idea just to start posting something into e.g. nova codebase ... seems like better start would be to read through your proposal for API
17:21:32 <aspiers> definitely worth reading through the API proposal, although I guess the implementation will need to be quite different for RPC
17:21:44 <ricolin> jsuchome, I think it's a nice tool to check the RPC health, which self-healing SIG should definitely help to run/host the process. We can also discuss about where those code can be
17:21:59 <aspiers> yeah
17:22:13 <aspiers> I agree that fake RPC calls is not good, we should have real ones to avoid errors
17:22:53 <aspiers> but not just to avoid errors, I guess the call could trigger some internal checks similarly to what we intended with the API health checks
17:23:14 <aspiers> jsuchome: https://storyboard.openstack.org/#!/story/2001439 is the best place to start reading about API health checks
17:23:53 <aspiers> we agreed to reuse the existing oslo.middleware API and extend it to v2
17:23:59 <jsuchome> maybe ... but there's probably the idea that current probes should be "light" so do not spend some resources needed elsewhere, I do not know right now how often they are called
17:24:17 <aspiers> yes
17:24:26 <ricolin> Yeah agree with aspiers , we can reuse/create the same storyboard story in self-healing SIG and implement the basic checking tool in oslo, and services can implement RPC/API check within their own codebase
17:25:00 <aspiers> with API health checks we debated for years whether to make them synchronous (triggered by the request) or async (run periodically and cache the results)
17:25:25 <aspiers> in the end we decided to start simple with synchronous and worry about maybe moving to background later
17:25:39 <aspiers> since there were endless discussions and no progress on code
17:26:08 <aspiers> so my advice would be to avoid over-engineering something complex early on, and start with the simplest thing which can possibly work
17:26:45 <ricolin> aspiers, jsuchome agree, light as it can, and expend if really needed
17:27:05 <jsuchome> well, I'm not sure if my idea wasn't too simple - I really just thought of going through those services I want to watch and adding simple calls with no params and basic response
17:27:17 <aspiers> that sounds good to me
17:27:35 <aspiers> later the basic response could optionally be made more informative
17:27:56 <aspiers> e.g. "oh no! I'm still running but my backend is broken"
17:28:06 <aspiers> however IIRC we also agreed to avoid recursive / transitive checks
17:28:44 <ricolin> jsuchome, so the basic code is ready to split from openstack-helm now or you plan to restart the effort?
17:29:02 <aspiers> so e.g. it's OK for a service to report stuff like "my db connection is healthy/broken", but not OK for it to report "I depend on another service X, and X can't talk to its db"
17:29:20 <aspiers> since in that case service X should report db connection issues via its own healthcheck API or RPC calls
17:29:34 <jsuchome> this would really not be part of openstack-helm code at all, it needs to go directly into openstack components, openstack-helm would just use what is prepared
17:29:35 <aspiers> and we don't want the same health checks duplicated / triggered in multiple places
17:29:45 <aspiers> jsuchome: yes exactly
17:30:19 <aspiers> jsuchome: we already intended openstack-helm to be one of the first customers of the API health checks (I talked to the AT&T guys about this in Berlin and they were interested)
17:30:24 <aspiers> so the same applies for RPC
17:30:44 <aspiers> I think this SIG is the best place to track that work
17:31:02 <jsuchome> it makes sense, as kubernetes being one of the engines that should be interesting in health
17:31:09 <aspiers> definitely
17:31:53 <jsuchome> ok then - I'm gonna look at your API proposal ... then I might hack some simple POC and offer it for reviews, possibly coming here again for some consulting
17:31:54 <aspiers> my suggestion would be a) submit a new story for this, just like https://storyboard.openstack.org/#!/story/2001439
17:32:13 <ricolin> +1
17:32:15 <aspiers> (I think maybe best to keep it separate to the API health checks, unless you can see any significant overlap)
17:32:23 <aspiers> b) submit a spec to self-healing-sig repo
17:32:53 <aspiers> c) after spec is merged (or maybe even before) start submitting code to implement
17:33:06 <aspiers> does that make sense?
17:33:27 <jsuchome> it does, yes
17:33:31 <aspiers> cool
17:33:41 <ricolin> We should also have a PTG session for health chekcs for API and RPC
17:33:51 <ricolin> jsuchome, are you going to Denver too?
17:33:58 <aspiers> we should have time in the self-healing PTG session
17:34:10 <ricolin> aspiers, that's perfect
17:34:27 <jsuchome> unfortunatelly not, but I hope I can prepare something before so you can have some base if you have time to talk about this topic
17:34:43 <aspiers> jsuchome: I am very happy to lead that discussion
17:34:49 <aspiers> maybe evrardjp is coming too? I can't remember
17:34:57 <aspiers> jsuchome: feel free to add to https://etherpad.openstack.org/p/DEN-self-healing-SIG
17:34:58 <jsuchome> I think he will, yes
17:35:02 <ricolin> aspiers, I think he will
17:35:02 <aspiers> awesome
17:35:29 <jsuchome> ok, cool, I think I'm done here, thanks for the ideas
17:35:45 <aspiers> cool
17:35:53 <aspiers> thanks a lot for proposing, this is a really cool initiative :)
17:36:16 <aspiers> #topic Denver
17:36:24 <aspiers> well I guess we already mostly covered this
17:36:29 <aspiers> but just for the record ...
17:36:44 <aspiers> #link https://etherpad.openstack.org/p/DEN-self-healing-SIG self-healing etherpad for Denver Forum and PTG sessions
17:37:08 <aspiers> I thought it was probably overkill to have two separate etherpads
17:37:23 <aspiers> I'll try to touch base with ekcs about Denver too
17:39:10 <aspiers> alright, anything else? if not I think we're done
17:39:46 <ricolin> aspiers, thx, I will try to see if I have any topic to put in
17:39:56 <aspiers> +1
17:40:11 <ricolin> I think I will put one for tempest later
17:40:11 <aspiers> alright cool, thanks a lot folks!
17:40:15 <aspiers> good idea
17:40:15 <ricolin> aspiers, thx!
17:40:28 <aspiers> #action aspiers to reply to ricolin's SIG questionnaire on ML
17:40:40 * ricolin like that action:)
17:40:45 <aspiers> XD
17:40:57 <jsuchome> good night or whatever time is it in your part of the world
17:41:02 <aspiers> #action jsuchome to propose the Denver discussion about RPC healthchecks
17:41:11 <ricolin> 01:41 for me:/
17:41:16 <aspiers> #action ricolin to propose discussion topic for tempest
17:41:18 <aspiers> ouch!
17:41:25 <aspiers> OK, please sleep now ;-)
17:41:30 <aspiers> thanks a lot for attending
17:41:53 <aspiers> ttyl o/
17:42:02 <ricolin> any chance to make our meeting two hour earlier?:)
17:42:15 <aspiers> we can make it at least one hour earlier
17:42:19 <aspiers> maybe 2
17:42:33 <aspiers> but there was another meeting this morning
17:42:49 <aspiers> that one is targetted at EU / APAC
17:43:02 <aspiers> we always have two on the same day, so that all time zones are covered
17:43:05 <ricolin> aspiers, oh, okay than I should try to join that one
17:43:10 <aspiers> yes please :)
17:43:57 <aspiers> #action aspiers to ask if irc-meetings can accept events in non-fixed time zones, so that they automatically adapt to daylight savings
17:44:12 <aspiers> I'll ask on infra now
17:44:18 <aspiers> OK, l8r folks!
17:44:20 <aspiers> #endmeeting