-@gerrit:opendev.org- Benjamin Schanzel proposed: [zuul/zuul] 892280: Log exceptions on db migration errors https://review.opendev.org/c/zuul/zuul/+/892280 | 07:55 | |
-@gerrit:opendev.org- Benjamin Schanzel proposed: [zuul/nodepool] 893375: Kubernetes/OpenShirt drivers: allow setting dynamic k8s labels https://review.opendev.org/c/zuul/nodepool/+/893375 | 09:00 | |
-@gerrit:opendev.org- Benjamin Schanzel proposed: [zuul/nodepool] 893481: Fix sphinx doc build https://review.opendev.org/c/zuul/nodepool/+/893481 | 09:03 | |
-@gerrit:opendev.org- Benjamin Schanzel proposed: [zuul/nodepool] 893375: Kubernetes/OpenShirt drivers: allow setting dynamic k8s labels https://review.opendev.org/c/zuul/nodepool/+/893375 | 09:39 | |
@mhuin:matrix.org | hello zuul-maint, I've started this chain https://review.opendev.org/q/topic:connections_health and would appreciate feedback before I go further down the rabbit hole. For context/incentive behind the chain, we've had issues with github repositories where the zuul app wasn't installed or configured properly, resulting in degraded performances and hitting API rate limiting quotas much earlier than expected. While the zuul admins ended up finding evidence of the issue in the scheduler's logs, it wasn't especially obvious if you don't know what you're looking for. And furthermore, the zuul admins weren't admins on the problematic github repos so we couldn't fix the app, while the rate limiting was applied to our zuul instance as a whole and affected every GH repo IIRC | 14:20 |
---|---|---|
@mhuin:matrix.org | So the idea is that for connections that might be out of perimeter for zuul admins, the health of the connection should be made available, either in the form of monitoring metrics or GUI info | 14:21 |
@mhuin:matrix.org | another possibility would be a zuul-admin subcommand that could poll connections on demand to check their state | 14:23 |
@tristanc_:matrix.org | mhu: if I understand correctly, the goal is to report connection issues (e.g. rate limit, project not found, api error, ...). Then I wonder if we need to carry such `_health` structure for every projects, wouldn't it be easier to simply have a list of such error next to the `config_errors` structure, and report them on the existing web page? | 15:04 |
@jim:acmegating.com | i think having connection configuration errors reported in the logs is pretty reasonable and expected :). but i understand that you might want to set alert conditions for administrators, and that can be difficult to do based on log messages. so i think having statsd/prometheus metrics that report health on connections is reasonable. but in general, connection configuration problems are not actionable by users, so i don't think they should be in the main part of the web ui. if it's not something that can be fixed by changing zuul.yaml then it should probably not be on the projects page (which shows the zuul.yaml configuration for a project) or the configuration errors page (which shows errors in zuul.yaml configuration) | 15:07 |
@jim:acmegating.com | tbh, my preference would be to keep this stuff in the logs. especially if it's very open ended (you listed a lot of different types of errors there). but something to alert admins that they should check the logs for errors, or certain kinds of errors, sounds useful. | 15:10 |
@tristanc_:matrix.org | corvus: but sometime, connection problems are not actionable by admin either, for example when there is a GitHub api outage, or when the app is not installed properly. | 15:11 |
@jim:acmegating.com | tristanC: sure, but none of those can be fixed by changes to zuul's configuration. that's the distinction i'm trying to convey. | 15:12 |
@jim:acmegating.com | maybe it would be worth writing up exactly what information is needed and where you think it makes sense to expose that. because there's a lot of confusing crossover with connections and projects in a multi-tenant setup. associating connection information to projects can be tricky. | 15:17 |
@mhuin:matrix.org | The difficulty is that for connections like GH and possibly also GL and Pagure is the requirement that a repo has an application installed, which makes {connection,project} a unique setup | 15:19 |
@jim:acmegating.com | exactly, while {tenant, project} is also unique and potentially overlaps in strange ways with the other. | 15:20 |
@mhuin:matrix.org | also, metrics as in statsd and prometheus are just numerical values AFAICT so you can't really convey info about what's exactly going on, which is why I added a description field to the health structure I proposed | 15:21 |
@jim:acmegating.com | metrics for alerting and logs for diagnosis makes sense to me | 15:21 |
@mhuin:matrix.org | so maybe a simpler approach could be to have statsd 0/1 gauges for states like "OK", "DEGRADED", "FAILING" | 15:22 |
@mhuin:matrix.org | fire them when there's a problem in a connection, and leave it up to admins to have alerting set up and to look in the logs | 15:23 |
@mhuin:matrix.org | and drop the project headache | 15:23 |
@jim:acmegating.com | i think that would be a simple solution. but also, it's starting to sound to me like this problem may be much more specific than i was originally understanding. if really the issue is "we want users to know if the github app used by the connection hasn't been installed for this project"... that maybe is a bit more of a tractable problem than generalized connection health reporting. | 15:26 |
@mhuin:matrix.org | yeah that was the original incentive, and then I thought "maybe I can go for something more generic and useful?" | 15:29 |
@jim:acmegating.com | mhu: okay, i think i get the use case, and i get why putting it on the project page in the ui makes sense for this case... and i think maybe setting it up in a generic enough way that we could use it for other things in the future makes sense. but i'd want to be really careful with it that we don't put too much admin-level information in there, or that we don't leak what should be private information about the connection. so i think i'm okay with adding connection information (specifically to start out: errors relating to app installation) to the project page. | 15:33 |
@jim:acmegating.com | i think things like persistent 500 errors, etc, may be better for a statsd/prometheus metric for admin alerting. like, if you get a 500, or hit a rate limit, or something else, that's better to signal an admin to check the logs | 15:37 |
@mhuin:matrix.org | Agreed, then 4XX errors would probably fall under config errors | 15:38 |
@mhuin:matrix.org | well generally speaking | 15:38 |
@jim:acmegating.com | i think maybe the "github app not installed" thing is a new class of malfunction for us to think about: basically, it's an external configuration error. so propagating external configuration errors to users is reasonable. and server 5xx or authentication errors, or rate limits, etc, are operational issues, or admin configuration errors, and they should not show up in the web ui, but should be in logs, or alerts... | 15:39 |
@clarkb:matrix.org | it isn't necessarily an error to not have an app installed though? | 15:46 |
@clarkb:matrix.org | zuul operates with a regular user token iirc | 15:46 |
@mhuin:matrix.org | @Clark: no, if the app isn't install and if there's no token in the config, it falls back to anonymous API access | 15:46 |
@mhuin:matrix.org | but then can get hit with API rate limiting | 15:47 |
@clarkb:matrix.org | but zuul is functional in that state right? | 15:47 |
@clarkb:matrix.org | other than potential rate limits which for small intsalls may be a non concern | 15:48 |
@mhuin:matrix.org | well not trying to brag, but our install is yuuuge! 🙂 | 15:48 |
@clarkb:matrix.org | sure, my point is that having a big error sign may be counter productive for some zuul installs | 15:49 |
@jim:acmegating.com | sure, but i think the point stands that not having a huge install is not an error | 15:49 |
@mhuin:matrix.org | but yeah it's probably not a concern for smaller deployments | 15:49 |
@mhuin:matrix.org | but if you have an app_id set in zuul's config and the app isn't installed on the repos zuul provides CI for, surely this should be an error? | 15:50 |
@jim:acmegating.com | nope | 15:50 |
@jim:acmegating.com | not an error | 15:50 |
@jim:acmegating.com | it's entirely reasonable and expected to add projects for sibling testing, dependencies, etc without having the app installed | 15:51 |
@mhuin:matrix.org | ah ... this gets pretty complex then if this can be expected behavior | 15:52 |
@mhuin:matrix.org | maybe that should be part of the project's description in the gui, if the connection is of type github, show whether the zuul app is installed or not | 15:54 |
@mhuin:matrix.org | well, food for thought for the weekend | 15:55 |
@jim:acmegating.com | it could potentially be okay as information, not error | 15:55 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!