17:04:55 #startmeeting self-healing 17:04:56 Meeting started Wed Apr 10 17:04:55 2019 UTC and is due to finish in 60 minutes. The chair is aspiers. Information about MeetBot at http://wiki.debian.org/MeetBot. 17:04:57 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 17:04:59 The meeting name has been set to 'self_healing' 17:05:16 OK so I think we have 3 topics for the agenda today 17:05:36 1. whatever jsuchome wants to talk about ;-) 17:05:47 2. ricolin's topic of http://lists.openstack.org/pipermail/openstack-discuss/2019-March/004246.html 17:05:49 3. Denver 17:06:01 I've added mine to the agenda btw 17:06:06 cool 17:06:11 if https://etherpad.openstack.org/p/self-healing-SIG-IRC-meeting is indeed current agenda 17:06:29 I actually forgot about that etherpad %-D 17:06:48 jsuchome, ricolin: if either of you are in a hurry we can change the order so you can leave early 17:07:11 it will be nice if I can go first 17:07:15 wow, I don't think I've ever used that etherpad before 17:07:29 ricolin: OK I guess your topic will be quick anyway 17:07:41 #topic help most needed for SIG 17:07:47 Not like I will leave the meeting, just wish to go with it when my brain still awake 17:07:51 haha :) 17:07:53 sure 17:08:08 ricolin: I can answer most of these questions 17:08:15 yeah 17:08:21 but I could send a reply to the ML ? 17:08:39 I guess you have things you want to ask now too? 17:08:45 It will be nice if you can reply that ML 17:08:56 sure 17:09:34 I'm fine to get those answer from SIG chairs 17:09:39 OK 17:09:52 so I will leave it to you than:) 17:10:10 that's fine :) 17:10:15 Any specific concerns about those questions? 17:10:20 I will reply tonight or tomorrow 17:10:45 feel free to chase me if I forget ... 17:10:56 I will also try to invite project teams to join us in that forum 17:11:09 and hope that will help with SIGs 17:11:16 cool 17:11:22 is there a meta-SIG Forum session? 17:11:57 I don't see one on https://wiki.openstack.org/wiki/Forum/Denver2019 17:12:01 It's a action from previous summit forum `expose SIGs and WGs` 17:12:17 Yeah I remember that discussion 17:12:24 Just wondering if there is a slot for it in Denver 17:12:42 https://etherpad.openstack.org/p/DEN-Train-TC-brainstorming 17:12:47 under TC actually 17:13:00 OK cool 17:13:11 I wasn't running meta-SIG at the time 17:13:36 np 17:13:47 maybe we can suggest that to ttx and diablo_rojo 17:13:51 anyway 17:14:03 what you mean by suggest? 17:14:03 anything else on this topic or can we move on to jsuchome's topic? 17:14:23 I mean suggest that maybe we should have a meta-SIG session 17:14:28 e.g. 30 mins 17:14:39 aspiers, we can move on:) 17:14:44 OK thanks :) 17:14:53 #topic supporting self-healing in openstack-helm 17:14:53 aspiers, That's actually nice idea 17:15:06 ricolin: we can talk about it in #openstack-tc maybe 17:15:22 aspiers, I already proposed a SIG governance PTG topic under TC PTG as well 17:15:31 oh nice 17:15:32 aspiers, yep 17:15:43 jsuchome: the floor is yours :) 17:15:48 OK, so - projects deployed by openstack-helm have these probes, basically python scripts that make sure that a certain service is alive 17:16:03 evrardjp: ^^^ in case you are listening or reading scrollback 17:16:09 (service running in a container/ kubernetes pod, but that's not necessary interesting here) 17:16:32 you can see one e.g. here, that is for neutron https://review.openstack.org/#/c/632200/ 17:16:44 yeah, just looking at that 17:16:48 it is using some fake RPC calls just to find out if RPC service is responding 17:17:03 so is this just for RPC, or also APIs? 17:17:21 problem with that is that while getting fake function, it logs some errors into the logs 17:17:50 which does not break anything, because we catch the exception and ignore it, but it is ugly and you can miss the real errors in such log files 17:17:54 just RPC AFAIK 17:18:16 OK so it's basically the RPC equivalent of https://storyboard.openstack.org/#!/story/2001439 ? 17:18:18 so, I think, we could use some real methods instead of fake ones, the problem is which ones? 17:19:01 I was hoping to drive API health checks as a community goal for Train, but I ran out of time :-/ 17:19:03 that might be it, I do not know this story, but that's probably the reason I was redirected to self-healing when proposing we should change it in helm 17:19:14 but yeah this sounds like a good idea to me 17:19:23 I mean, helm : my idea would be to have such simple "ping" methods in different openstack components 17:19:31 the original idea was to start with APIs first and handle RPC etc. later 17:19:48 but if there is developer bandwidth to do RPC first then that is fine too 17:21:01 IIUC https://review.openstack.org/#/c/632200/37/neutron/templates/bin/_health-probe.py.tpl is somewhat overlapping with API health checks too 17:21:01 so ... I do not know if it's good idea just to start posting something into e.g. nova codebase ... seems like better start would be to read through your proposal for API 17:21:32 definitely worth reading through the API proposal, although I guess the implementation will need to be quite different for RPC 17:21:44 jsuchome, I think it's a nice tool to check the RPC health, which self-healing SIG should definitely help to run/host the process. We can also discuss about where those code can be 17:21:59 yeah 17:22:13 I agree that fake RPC calls is not good, we should have real ones to avoid errors 17:22:53 but not just to avoid errors, I guess the call could trigger some internal checks similarly to what we intended with the API health checks 17:23:14 jsuchome: https://storyboard.openstack.org/#!/story/2001439 is the best place to start reading about API health checks 17:23:53 we agreed to reuse the existing oslo.middleware API and extend it to v2 17:23:59 maybe ... but there's probably the idea that current probes should be "light" so do not spend some resources needed elsewhere, I do not know right now how often they are called 17:24:17 yes 17:24:26 Yeah agree with aspiers , we can reuse/create the same storyboard story in self-healing SIG and implement the basic checking tool in oslo, and services can implement RPC/API check within their own codebase 17:25:00 with API health checks we debated for years whether to make them synchronous (triggered by the request) or async (run periodically and cache the results) 17:25:25 in the end we decided to start simple with synchronous and worry about maybe moving to background later 17:25:39 since there were endless discussions and no progress on code 17:26:08 so my advice would be to avoid over-engineering something complex early on, and start with the simplest thing which can possibly work 17:26:45 aspiers, jsuchome agree, light as it can, and expend if really needed 17:27:05 well, I'm not sure if my idea wasn't too simple - I really just thought of going through those services I want to watch and adding simple calls with no params and basic response 17:27:17 that sounds good to me 17:27:35 later the basic response could optionally be made more informative 17:27:56 e.g. "oh no! I'm still running but my backend is broken" 17:28:06 however IIRC we also agreed to avoid recursive / transitive checks 17:28:44 jsuchome, so the basic code is ready to split from openstack-helm now or you plan to restart the effort? 17:29:02 so e.g. it's OK for a service to report stuff like "my db connection is healthy/broken", but not OK for it to report "I depend on another service X, and X can't talk to its db" 17:29:20 since in that case service X should report db connection issues via its own healthcheck API or RPC calls 17:29:34 this would really not be part of openstack-helm code at all, it needs to go directly into openstack components, openstack-helm would just use what is prepared 17:29:35 and we don't want the same health checks duplicated / triggered in multiple places 17:29:45 jsuchome: yes exactly 17:30:19 jsuchome: we already intended openstack-helm to be one of the first customers of the API health checks (I talked to the AT&T guys about this in Berlin and they were interested) 17:30:24 so the same applies for RPC 17:30:44 I think this SIG is the best place to track that work 17:31:02 it makes sense, as kubernetes being one of the engines that should be interesting in health 17:31:09 definitely 17:31:53 ok then - I'm gonna look at your API proposal ... then I might hack some simple POC and offer it for reviews, possibly coming here again for some consulting 17:31:54 my suggestion would be a) submit a new story for this, just like https://storyboard.openstack.org/#!/story/2001439 17:32:13 +1 17:32:15 (I think maybe best to keep it separate to the API health checks, unless you can see any significant overlap) 17:32:23 b) submit a spec to self-healing-sig repo 17:32:53 c) after spec is merged (or maybe even before) start submitting code to implement 17:33:06 does that make sense? 17:33:27 it does, yes 17:33:31 cool 17:33:41 We should also have a PTG session for health chekcs for API and RPC 17:33:51 jsuchome, are you going to Denver too? 17:33:58 we should have time in the self-healing PTG session 17:34:10 aspiers, that's perfect 17:34:27 unfortunatelly not, but I hope I can prepare something before so you can have some base if you have time to talk about this topic 17:34:43 jsuchome: I am very happy to lead that discussion 17:34:49 maybe evrardjp is coming too? I can't remember 17:34:57 jsuchome: feel free to add to https://etherpad.openstack.org/p/DEN-self-healing-SIG 17:34:58 I think he will, yes 17:35:02 aspiers, I think he will 17:35:02 awesome 17:35:29 ok, cool, I think I'm done here, thanks for the ideas 17:35:45 cool 17:35:53 thanks a lot for proposing, this is a really cool initiative :) 17:36:16 #topic Denver 17:36:24 well I guess we already mostly covered this 17:36:29 but just for the record ... 17:36:44 #link https://etherpad.openstack.org/p/DEN-self-healing-SIG self-healing etherpad for Denver Forum and PTG sessions 17:37:08 I thought it was probably overkill to have two separate etherpads 17:37:23 I'll try to touch base with ekcs about Denver too 17:39:10 alright, anything else? if not I think we're done 17:39:46 aspiers, thx, I will try to see if I have any topic to put in 17:39:56 +1 17:40:11 I think I will put one for tempest later 17:40:11 alright cool, thanks a lot folks! 17:40:15 good idea 17:40:15 aspiers, thx! 17:40:28 #action aspiers to reply to ricolin's SIG questionnaire on ML 17:40:40 * ricolin like that action:) 17:40:45 XD 17:40:57 good night or whatever time is it in your part of the world 17:41:02 #action jsuchome to propose the Denver discussion about RPC healthchecks 17:41:11 01:41 for me:/ 17:41:16 #action ricolin to propose discussion topic for tempest 17:41:18 ouch! 17:41:25 OK, please sleep now ;-) 17:41:30 thanks a lot for attending 17:41:53 ttyl o/ 17:42:02 any chance to make our meeting two hour earlier?:) 17:42:15 we can make it at least one hour earlier 17:42:19 maybe 2 17:42:33 but there was another meeting this morning 17:42:49 that one is targetted at EU / APAC 17:43:02 we always have two on the same day, so that all time zones are covered 17:43:05 aspiers, oh, okay than I should try to join that one 17:43:10 yes please :) 17:43:57 #action aspiers to ask if irc-meetings can accept events in non-fixed time zones, so that they automatically adapt to daylight savings 17:44:12 I'll ask on infra now 17:44:18 OK, l8r folks! 17:44:20 #endmeeting