17:04:55 <aspiers> #startmeeting self-healing 17:04:56 <openstack> Meeting started Wed Apr 10 17:04:55 2019 UTC and is due to finish in 60 minutes. The chair is aspiers. Information about MeetBot at http://wiki.debian.org/MeetBot. 17:04:57 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 17:04:59 <openstack> The meeting name has been set to 'self_healing' 17:05:16 <aspiers> OK so I think we have 3 topics for the agenda today 17:05:36 <aspiers> 1. whatever jsuchome wants to talk about ;-) 17:05:47 <aspiers> 2. ricolin's topic of http://lists.openstack.org/pipermail/openstack-discuss/2019-March/004246.html 17:05:49 <aspiers> 3. Denver 17:06:01 <jsuchome> I've added mine to the agenda btw 17:06:06 <aspiers> cool 17:06:11 <jsuchome> if https://etherpad.openstack.org/p/self-healing-SIG-IRC-meeting is indeed current agenda 17:06:29 <aspiers> I actually forgot about that etherpad %-D 17:06:48 <aspiers> jsuchome, ricolin: if either of you are in a hurry we can change the order so you can leave early 17:07:11 <ricolin> it will be nice if I can go first 17:07:15 <aspiers> wow, I don't think I've ever used that etherpad before 17:07:29 <aspiers> ricolin: OK I guess your topic will be quick anyway 17:07:41 <aspiers> #topic help most needed for SIG 17:07:47 <ricolin> Not like I will leave the meeting, just wish to go with it when my brain still awake 17:07:51 <aspiers> haha :) 17:07:53 <aspiers> sure 17:08:08 <aspiers> ricolin: I can answer most of these questions 17:08:15 <ricolin> yeah 17:08:21 <aspiers> but I could send a reply to the ML ? 17:08:39 <aspiers> I guess you have things you want to ask now too? 17:08:45 <ricolin> It will be nice if you can reply that ML 17:08:56 <aspiers> sure 17:09:34 <ricolin> I'm fine to get those answer from SIG chairs 17:09:39 <aspiers> OK 17:09:52 <ricolin> so I will leave it to you than:) 17:10:10 <aspiers> that's fine :) 17:10:15 <ricolin> Any specific concerns about those questions? 17:10:20 <aspiers> I will reply tonight or tomorrow 17:10:45 <aspiers> feel free to chase me if I forget ... 17:10:56 <ricolin> I will also try to invite project teams to join us in that forum 17:11:09 <ricolin> and hope that will help with SIGs 17:11:16 <aspiers> cool 17:11:22 <aspiers> is there a meta-SIG Forum session? 17:11:57 <aspiers> I don't see one on https://wiki.openstack.org/wiki/Forum/Denver2019 17:12:01 <ricolin> It's a action from previous summit forum `expose SIGs and WGs` 17:12:17 <aspiers> Yeah I remember that discussion 17:12:24 <aspiers> Just wondering if there is a slot for it in Denver 17:12:42 <ricolin> https://etherpad.openstack.org/p/DEN-Train-TC-brainstorming 17:12:47 <ricolin> under TC actually 17:13:00 <aspiers> OK cool 17:13:11 <ricolin> I wasn't running meta-SIG at the time 17:13:36 <aspiers> np 17:13:47 <aspiers> maybe we can suggest that to ttx and diablo_rojo 17:13:51 <aspiers> anyway 17:14:03 <ricolin> what you mean by suggest? 17:14:03 <aspiers> anything else on this topic or can we move on to jsuchome's topic? 17:14:23 <aspiers> I mean suggest that maybe we should have a meta-SIG session 17:14:28 <aspiers> e.g. 30 mins 17:14:39 <ricolin> aspiers, we can move on:) 17:14:44 <aspiers> OK thanks :) 17:14:53 <aspiers> #topic supporting self-healing in openstack-helm 17:14:53 <ricolin> aspiers, That's actually nice idea 17:15:06 <aspiers> ricolin: we can talk about it in #openstack-tc maybe 17:15:22 <ricolin> aspiers, I already proposed a SIG governance PTG topic under TC PTG as well 17:15:31 <aspiers> oh nice 17:15:32 <ricolin> aspiers, yep 17:15:43 <aspiers> jsuchome: the floor is yours :) 17:15:48 <jsuchome> OK, so - projects deployed by openstack-helm have these probes, basically python scripts that make sure that a certain service is alive 17:16:03 <aspiers> evrardjp: ^^^ in case you are listening or reading scrollback 17:16:09 <jsuchome> (service running in a container/ kubernetes pod, but that's not necessary interesting here) 17:16:32 <jsuchome> you can see one e.g. here, that is for neutron https://review.openstack.org/#/c/632200/ 17:16:44 <aspiers> yeah, just looking at that 17:16:48 <jsuchome> it is using some fake RPC calls just to find out if RPC service is responding 17:17:03 <aspiers> so is this just for RPC, or also APIs? 17:17:21 <jsuchome> problem with that is that while getting fake function, it logs some errors into the logs 17:17:50 <jsuchome> which does not break anything, because we catch the exception and ignore it, but it is ugly and you can miss the real errors in such log files 17:17:54 <jsuchome> just RPC AFAIK 17:18:16 <aspiers> OK so it's basically the RPC equivalent of https://storyboard.openstack.org/#!/story/2001439 ? 17:18:18 <jsuchome> so, I think, we could use some real methods instead of fake ones, the problem is which ones? 17:19:01 <aspiers> I was hoping to drive API health checks as a community goal for Train, but I ran out of time :-/ 17:19:03 <jsuchome> that might be it, I do not know this story, but that's probably the reason I was redirected to self-healing when proposing we should change it in helm 17:19:14 <aspiers> but yeah this sounds like a good idea to me 17:19:23 <jsuchome> I mean, helm : my idea would be to have such simple "ping" methods in different openstack components 17:19:31 <aspiers> the original idea was to start with APIs first and handle RPC etc. later 17:19:48 <aspiers> but if there is developer bandwidth to do RPC first then that is fine too 17:21:01 <aspiers> IIUC https://review.openstack.org/#/c/632200/37/neutron/templates/bin/_health-probe.py.tpl is somewhat overlapping with API health checks too 17:21:01 <jsuchome> so ... I do not know if it's good idea just to start posting something into e.g. nova codebase ... seems like better start would be to read through your proposal for API 17:21:32 <aspiers> definitely worth reading through the API proposal, although I guess the implementation will need to be quite different for RPC 17:21:44 <ricolin> jsuchome, I think it's a nice tool to check the RPC health, which self-healing SIG should definitely help to run/host the process. We can also discuss about where those code can be 17:21:59 <aspiers> yeah 17:22:13 <aspiers> I agree that fake RPC calls is not good, we should have real ones to avoid errors 17:22:53 <aspiers> but not just to avoid errors, I guess the call could trigger some internal checks similarly to what we intended with the API health checks 17:23:14 <aspiers> jsuchome: https://storyboard.openstack.org/#!/story/2001439 is the best place to start reading about API health checks 17:23:53 <aspiers> we agreed to reuse the existing oslo.middleware API and extend it to v2 17:23:59 <jsuchome> maybe ... but there's probably the idea that current probes should be "light" so do not spend some resources needed elsewhere, I do not know right now how often they are called 17:24:17 <aspiers> yes 17:24:26 <ricolin> Yeah agree with aspiers , we can reuse/create the same storyboard story in self-healing SIG and implement the basic checking tool in oslo, and services can implement RPC/API check within their own codebase 17:25:00 <aspiers> with API health checks we debated for years whether to make them synchronous (triggered by the request) or async (run periodically and cache the results) 17:25:25 <aspiers> in the end we decided to start simple with synchronous and worry about maybe moving to background later 17:25:39 <aspiers> since there were endless discussions and no progress on code 17:26:08 <aspiers> so my advice would be to avoid over-engineering something complex early on, and start with the simplest thing which can possibly work 17:26:45 <ricolin> aspiers, jsuchome agree, light as it can, and expend if really needed 17:27:05 <jsuchome> well, I'm not sure if my idea wasn't too simple - I really just thought of going through those services I want to watch and adding simple calls with no params and basic response 17:27:17 <aspiers> that sounds good to me 17:27:35 <aspiers> later the basic response could optionally be made more informative 17:27:56 <aspiers> e.g. "oh no! I'm still running but my backend is broken" 17:28:06 <aspiers> however IIRC we also agreed to avoid recursive / transitive checks 17:28:44 <ricolin> jsuchome, so the basic code is ready to split from openstack-helm now or you plan to restart the effort? 17:29:02 <aspiers> so e.g. it's OK for a service to report stuff like "my db connection is healthy/broken", but not OK for it to report "I depend on another service X, and X can't talk to its db" 17:29:20 <aspiers> since in that case service X should report db connection issues via its own healthcheck API or RPC calls 17:29:34 <jsuchome> this would really not be part of openstack-helm code at all, it needs to go directly into openstack components, openstack-helm would just use what is prepared 17:29:35 <aspiers> and we don't want the same health checks duplicated / triggered in multiple places 17:29:45 <aspiers> jsuchome: yes exactly 17:30:19 <aspiers> jsuchome: we already intended openstack-helm to be one of the first customers of the API health checks (I talked to the AT&T guys about this in Berlin and they were interested) 17:30:24 <aspiers> so the same applies for RPC 17:30:44 <aspiers> I think this SIG is the best place to track that work 17:31:02 <jsuchome> it makes sense, as kubernetes being one of the engines that should be interesting in health 17:31:09 <aspiers> definitely 17:31:53 <jsuchome> ok then - I'm gonna look at your API proposal ... then I might hack some simple POC and offer it for reviews, possibly coming here again for some consulting 17:31:54 <aspiers> my suggestion would be a) submit a new story for this, just like https://storyboard.openstack.org/#!/story/2001439 17:32:13 <ricolin> +1 17:32:15 <aspiers> (I think maybe best to keep it separate to the API health checks, unless you can see any significant overlap) 17:32:23 <aspiers> b) submit a spec to self-healing-sig repo 17:32:53 <aspiers> c) after spec is merged (or maybe even before) start submitting code to implement 17:33:06 <aspiers> does that make sense? 17:33:27 <jsuchome> it does, yes 17:33:31 <aspiers> cool 17:33:41 <ricolin> We should also have a PTG session for health chekcs for API and RPC 17:33:51 <ricolin> jsuchome, are you going to Denver too? 17:33:58 <aspiers> we should have time in the self-healing PTG session 17:34:10 <ricolin> aspiers, that's perfect 17:34:27 <jsuchome> unfortunatelly not, but I hope I can prepare something before so you can have some base if you have time to talk about this topic 17:34:43 <aspiers> jsuchome: I am very happy to lead that discussion 17:34:49 <aspiers> maybe evrardjp is coming too? I can't remember 17:34:57 <aspiers> jsuchome: feel free to add to https://etherpad.openstack.org/p/DEN-self-healing-SIG 17:34:58 <jsuchome> I think he will, yes 17:35:02 <ricolin> aspiers, I think he will 17:35:02 <aspiers> awesome 17:35:29 <jsuchome> ok, cool, I think I'm done here, thanks for the ideas 17:35:45 <aspiers> cool 17:35:53 <aspiers> thanks a lot for proposing, this is a really cool initiative :) 17:36:16 <aspiers> #topic Denver 17:36:24 <aspiers> well I guess we already mostly covered this 17:36:29 <aspiers> but just for the record ... 17:36:44 <aspiers> #link https://etherpad.openstack.org/p/DEN-self-healing-SIG self-healing etherpad for Denver Forum and PTG sessions 17:37:08 <aspiers> I thought it was probably overkill to have two separate etherpads 17:37:23 <aspiers> I'll try to touch base with ekcs about Denver too 17:39:10 <aspiers> alright, anything else? if not I think we're done 17:39:46 <ricolin> aspiers, thx, I will try to see if I have any topic to put in 17:39:56 <aspiers> +1 17:40:11 <ricolin> I think I will put one for tempest later 17:40:11 <aspiers> alright cool, thanks a lot folks! 17:40:15 <aspiers> good idea 17:40:15 <ricolin> aspiers, thx! 17:40:28 <aspiers> #action aspiers to reply to ricolin's SIG questionnaire on ML 17:40:40 * ricolin like that action:) 17:40:45 <aspiers> XD 17:40:57 <jsuchome> good night or whatever time is it in your part of the world 17:41:02 <aspiers> #action jsuchome to propose the Denver discussion about RPC healthchecks 17:41:11 <ricolin> 01:41 for me:/ 17:41:16 <aspiers> #action ricolin to propose discussion topic for tempest 17:41:18 <aspiers> ouch! 17:41:25 <aspiers> OK, please sleep now ;-) 17:41:30 <aspiers> thanks a lot for attending 17:41:53 <aspiers> ttyl o/ 17:42:02 <ricolin> any chance to make our meeting two hour earlier?:) 17:42:15 <aspiers> we can make it at least one hour earlier 17:42:19 <aspiers> maybe 2 17:42:33 <aspiers> but there was another meeting this morning 17:42:49 <aspiers> that one is targetted at EU / APAC 17:43:02 <aspiers> we always have two on the same day, so that all time zones are covered 17:43:05 <ricolin> aspiers, oh, okay than I should try to join that one 17:43:10 <aspiers> yes please :) 17:43:57 <aspiers> #action aspiers to ask if irc-meetings can accept events in non-fixed time zones, so that they automatically adapt to daylight savings 17:44:12 <aspiers> I'll ask on infra now 17:44:18 <aspiers> OK, l8r folks! 17:44:20 <aspiers> #endmeeting