opendevreview | Ian Wienand proposed opendev/system-config master: grafana: set custom home dashboard https://review.opendev.org/c/opendev/system-config/+/832179 | 00:03 |
---|---|---|
ianw | https://zuul.opendev.org/t/openstack/build/d0310e707ace4cc683e64959c2295bae/console got a 503 from zuul api, but i'm going to assume it was during the about deployment of the lb | 00:08 |
ianw | above | 00:08 |
opendevreview | Ian Wienand proposed opendev/system-config master: grafana: set custom home dashboard https://review.opendev.org/c/opendev/system-config/+/832179 | 00:35 |
opendevreview | Ian Wienand proposed opendev/system-config master: grafana: set custom home dashboard https://review.opendev.org/c/opendev/system-config/+/832179 | 01:49 |
ianw | <corvus> btw, while i'm looking at that, does anyone know how to make grafana not look like a clickbait news site? ^ | 02:21 |
ianw | last time i looked, it wasn't configurable. but I just looked again and they added a feature in 7.1 to allow setting the default dashboard. so that makes a simple page | 02:22 |
ianw | https://3b798c7d49936bd690f8-9fa499072a9f8bf63e024cc09284603e.ssl.cf1.rackcdn.com/832179/5/check/system-config-run-grafana/ce5320a/bridge.openstack.org/screenshots/grafana-main-page.png | 02:22 |
ianw | i will now delete the node i held to debug the initial failure, that turned out to be a single misplaced trailing comma in the page .json :/ | 02:24 |
*** rlandy__ is now known as rlandy|out | 05:08 | |
opendevreview | Merged opendev/system-config master: Don't run infra-prod-run-refstack on all group var updates https://review.opendev.org/c/opendev/system-config/+/832150 | 05:13 |
*** ysandeep|out is now known as ysandeep | 05:22 | |
opendevreview | Merged opendev/system-config master: Allow zuul-lb to send stats to graphite https://review.opendev.org/c/opendev/system-config/+/832148 | 05:23 |
*** jpena|off is now known as jpena | 08:02 | |
*** ysandeep is now known as ysandeep|lunch | 08:29 | |
*** gthiemon1e is now known as gthiemonge | 08:39 | |
*** ysandeep|lunch is now known as ysandeep | 08:52 | |
*** poojajadhav is now known as pojadhav | 09:01 | |
*** arxcruz is now known as arxcruz|off | 11:04 | |
*** rlandy is now known as rlandy|ruck | 11:13 | |
*** dviroel|out is now known as dviroel | 11:23 | |
opendevreview | Merged openstack/project-config master: Add Some periodic jobs to Neutron Dashboard https://review.opendev.org/c/openstack/project-config/+/831378 | 13:40 |
opendevreview | Merged opendev/system-config master: grafana: set custom home dashboard https://review.opendev.org/c/opendev/system-config/+/832179 | 13:55 |
fungi | woohoo! https://grafana.opendev.org/ looks so much better now. thanks ianw! | 14:42 |
yoctozepto | wow, nice | 14:47 |
fungi | in particular, no more ads from the grafana corporate product blog feed | 14:48 |
yoctozepto | lovely that kolla has a dedicated dashboard | 14:48 |
fungi | yoctozepto: that was originally added by mnaser in 2017 | 14:50 |
fungi | https://review.openstack.org/446697 | 14:50 |
fungi | 5 years ago next week | 14:51 |
yoctozepto | woohoo | 14:51 |
*** dviroel is now known as dviroel|lunch | 14:56 | |
mnaser | dang, time goes by | 15:01 |
*** ysandeep is now known as ysandeep|dinner | 15:36 | |
opendevreview | Francisco Seruca Salgado proposed opendev/gerritbot master: Gerrit Bot IRC post https://review.opendev.org/c/opendev/gerritbot/+/832270 | 15:40 |
*** ysandeep|dinner is now known as ysandeep | 15:55 | |
fungi | infra-root: i've temporarily disabled deployment to zuul-lb01 while i fiddle with the haproxy config there to turn on some more verbose health check logging | 16:02 |
*** dviroel|lunch is now known as dviroel | 16:13 | |
*** ysandeep is now known as ysandeep|out | 17:18 | |
opendevreview | Jeremy Stanley proposed opendev/system-config master: Add check keyword to zuul01 HTTPS server line https://review.opendev.org/c/opendev/system-config/+/832294 | 17:45 |
opendevreview | Jeremy Stanley proposed opendev/system-config master: Add check to remainder of balance_zuul_https https://review.opendev.org/c/opendev/system-config/+/832295 | 17:45 |
opendevreview | Jeremy Stanley proposed opendev/system-config master: Add check keyword to gitea01 HTTPS server line https://review.opendev.org/c/opendev/system-config/+/832296 | 17:45 |
opendevreview | Jeremy Stanley proposed opendev/system-config master: Add check to remainder of balance_git_https https://review.opendev.org/c/opendev/system-config/+/832297 | 17:45 |
fungi | infra-root: ^ i'm around to monitor if someone else wants to approve those. i've already tested that configuration on zuul-lb01 and confirmed it got the health checks running there | 17:46 |
fungi | i'll set all but the first one wip so we can monitor the effects one at a time | 17:46 |
fungi | i've also taken zuul-lb01 back out of the disable list now | 17:46 |
fungi | i guess we could safely approve 832295 and 832296 at the same time so i didn't wip the latter | 17:49 |
*** jpena is now known as jpena|off | 17:52 | |
corvus | fungi: if you confirmed it already, why not do 94 and 95 together? | 17:52 |
corvus | (even though things are busier now, it's still not the worst thing in the world if we lose the zuul web app for a few mins) | 17:52 |
fungi | i'd be fine with that, just trying to be cautious | 18:03 |
fungi | happy to smash 94-96 together even, if that makes sense, and just be more careful with 97 | 18:04 |
fungi | corvus: what do you think? | 18:04 |
fungi | also, if you grep 'check.*balance' from /var/log/syslog on zuul-lb01 you can see the results of a brief experiment where i repointed the proxy lines in the zuul01 apache vhost to a random incorrect port to simulate the zuul-web service being down with apache still running | 18:07 |
fungi | Mar 7 17:20:25 zuul-lb01 haproxy[34]: Health check for server balance_zuul_https/zuul01.opendev.org failed, reason: Layer7 wrong status, code: 503, info: "Service Unavailable", check duration: 4ms, status: 0/2 DOWN. | 18:08 |
corvus | i'm in favor of smashing :) | 18:10 |
fungi | hulk smash | 18:10 |
fungi | better than clobbering time, in this particular case | 18:11 |
opendevreview | Jeremy Stanley proposed opendev/system-config master: Add check keyword to balance_zuul_https servers https://review.opendev.org/c/opendev/system-config/+/832294 | 18:15 |
opendevreview | Jeremy Stanley proposed opendev/system-config master: Add check to remainder of balance_git_https https://review.opendev.org/c/opendev/system-config/+/832297 | 18:15 |
fungi | corvus: ^ | 18:15 |
fungi | also the well-hidden haproxy docs are a gold mine of options for fine-tuning the check details if we like, including doing sni, passing specific host headers, changing the url we want to hit, et cetera | 18:18 |
fungi | can also adjust the http protocol version, set cookies, require specific response codes, look for actual content in the response... | 18:19 |
fungi | change the method to head if get is causing additional load... | 18:20 |
corvus | we should probably switch zuul to check /api/info but i don't think we need to do that right now | 18:21 |
fungi | yeah, whatever the preferred health check url is. could also look for specific data to indicate that it's done reloading its config if that page returns content too early in a startup cycle | 18:24 |
corvus | fungi: it won't start answering on the main port until it's done (it will answer on the prometheus port though). so checking any uri on the main port, or checking the ready uri on the prometheus port are equivalent for us. but we'd need to do extra apache config to add in a health check on the prom port, so i think sticking with the main port is good for now. the uri change would be simply to reduce the load (but we should double check that | 18:35 |
corvus | -- it's possible that OPTIONS on / may be less load than even the info endpoint) | 18:35 |
fungi | noted | 19:33 |
opendevreview | Merged opendev/system-config master: Add check keyword to balance_zuul_https servers https://review.opendev.org/c/opendev/system-config/+/832294 | 20:17 |
ianw | is there anything else to do for the zuul LB ATM? | 21:15 |
Clark[m] | check check-ssl makes sense but also really? Wow | 21:49 |
fungi | yep | 21:53 |
fungi | ianw: no, i'm stuffing my face at the moment, but assuming the zuul webui is still functional and gitea01 is still in the https pool, we can approve 832297 too | 21:56 |
fungi | i'll un-wip it once i'm done eating | 21:56 |
fungi | zuul webui lgtm | 22:13 |
fungi | balance_git_https on gitea-lb01 on the other hand doesn't look so good for gitea01 | 22:15 |
fungi | "Method Not Allowed" and "Layer7 wrong status" in the show stat output | 22:15 |
fungi | i reckon we need to adjust the parameters for the check we're performing (maybe do sni or pass a host header, maybe adjust the url?) | 22:15 |
fungi | ianw: ^ ideas? | 22:16 |
fungi | i can start experimenting | 22:16 |
ianw | umm, i'll have to context switch in | 22:16 |
fungi | ianw: the executive summary is that 832294 turned on checking for gitea01's https interface (but not the other servers for now) | 22:17 |
fungi | i have a feeling the naive `GET /` performed by haproxy on the ssl socket isn't sufficient to get a good https response back from apache | 22:18 |
fungi | 832297 would turn it on for the other servers in the pool too, but that would be a disaster | 22:18 |
ianw | oh, so this was/is working on zuul-lb01, but not yet against gitea? | 22:19 |
fungi | it's working for zuul-lb01 yep, but zuul is a different beast than gitea obviously | 22:19 |
ianw | ahh, ok :) | 22:19 |
fungi | gitea may need more... wooing | 22:19 |
ianw | no timestamps in "sudo docker logs haproxy-docker_haproxy_1" ... wonder if that can be turned on. separate issue | 22:21 |
fungi | not urgent since we're running at 7/8 capacity on gitea servers and load is pretty low | 22:21 |
ianw | so it would be helpful if we could figure out how gitea01.opendev.org can return of 405 | 22:21 |
fungi | it's also possible to turn on haproxy's check logging in the defaults, if that helps | 22:21 |
fungi | it's how i tracked down what was (or more accurately, wasn't) happening on zuul-lb01 | 22:22 |
ianw | i think that the SNI thread is probably one to pull ... | 22:23 |
*** dviroel is now known as dviroel|out | 22:25 | |
ianw | "check-sni <sni>" maybe? | 22:25 |
fungi | yeah, we can pass a specific host header in the check too, if that's the problem (maybe the default vhost is returning back responses?) | 22:25 |
ianw | "The most | 22:27 |
ianw | common use is to send HTTPS checks by combining "httpchk" with SSL checks." | 22:27 |
fungi | "option log-health-checks" in defaults is what i used to turn on the additional logging on zuul-lb01 temporarily, btw | 22:27 |
fungi | i'd consider adding it permanently, since it only logs state changes anyway and so doesn't add that much volume to the logs | 22:28 |
Clark[m] | ++ | 22:32 |
ianw | curl --insecure https://38.108.68.172:3081 returns the page | 22:35 |
ianw | i would have thought that was not sending any SNI and might trigger a 405 from gitea01 | 22:35 |
fungi | yeah, it may not be sni that's the problem | 22:36 |
ianw | 38.108.68.124:40776 - - [07/Mar/2022:22:36:17 +0000] "OPTIONS / HTTP/1.0" 405 - "-" "-" | 22:36 |
ianw | is what apache is saying | 22:36 |
fungi | it's doing an options method instead of get? that's not what i'd have assumed from the haproxy docs | 22:37 |
ianw | yeah, i think we might have to specify GET to the actual request | 22:37 |
ianw | option httpchk GET / | 22:37 |
fungi | strangely, this is working for the zuul apache vhosts | 22:38 |
ianw | "curl -v --insecure -X OPTIONS https://38.108.68.172:3081" is also not giving me an error | 22:40 |
ianw | i wonder if the http/1.0 makes a difference ... | 22:40 |
ianw | no, actually, it's same thing 159.X.X.X:49372 - - [07/Mar/2022:22:40:48 +0000] "OPTIONS / HTTP/1.1" 405 - "-" "curl/7.79.1" | 22:41 |
ianw | that's me | 22:41 |
ianw | curl -v --insecure -X OPTIONS https://zuul01.opendev.org returns the html; i feel like that is *also* wrong | 22:42 |
Clark[m] | It's probably because gitea doesn't do options but cherrypy does? | 22:42 |
fungi | also if we're going to have to specify a request method, i'd probably go with head instead of get in order to reduce load from the checks, assuming that works | 22:43 |
Clark[m] | Since that ends up getting proxied | 22:43 |
Clark[m] | ++ to HEAD | 22:43 |
fungi | i buy Clark[m]'s theory | 22:43 |
Clark[m] | And / is probably fine for gitea | 22:43 |
fungi | though i'm still surprised haproxy defaults to options, i could swear i saw it said get was the default request method | 22:43 |
fungi | and yeah, i guess options would explain why the default check on zuul was returning a 200 response even though a web browser sees / redirected to /tenants | 22:44 |
fungi | (i assumed it was a "meta refresh" instead of a protocol level redirect, but didn't dig too deep there) | 22:45 |
opendevreview | Ian Wienand proposed opendev/system-config master: gitea-haproxy: issue liveness check to HEAD / https://review.opendev.org/c/opendev/system-config/+/832379 | 22:48 |
ianw | so should i manually edit in something like ^ to test? | 22:48 |
fungi | i think there are method and url parameters you can set on the server entries | 22:52 |
fungi | <method> is the optional HTTP method used with the requests. When not set, the "OPTIONS" method is used | 22:56 |
fungi | aha! | 22:56 |
fungi | so you're not mad | 22:56 |
fungi | i wish haproxy (non-enterprise) docs weren't so hard to track down | 22:57 |
ianw | yep, so HEAD / (or really just option thtpchk HEAD, as "/" is default") might help | 22:59 |
ianw | (832379) | 22:59 |
corvus | fungi ianw https://review.opendev.org/832149 adds a zuul-lb dashboard | 23:00 |
fungi | ianw: yeah, putting the lb in the emergency disable list and then tweaking the config and hupping the parent haproxy process is how i tested it on zuul-lb01 | 23:03 |
ianw | it seems worth doing that as i have everything open | 23:04 |
fungi | temporarily adding the line to turn on health check logging is also useful if you haven't already | 23:04 |
ianw | you can pretty much see it from the gitea01 side anyway, as it constantly gets pings and gets the 405's | 23:05 |
ianw | ok, i've put it in emergency | 23:06 |
ianw | i've added the HEAD / | 23:07 |
ianw | i hupped it | 23:09 |
ianw | 38.108.68.124:46416 - - [07/Mar/2022:23:09:04 +0000] "HEAD / HTTP/1.0" 200 - "-" "-" | 23:09 |
ianw | we're logging those requests every 2 seconds now | 23:09 |
fungi | lgtm! | 23:09 |
ianw | there's also requests coming in | 23:10 |
ianw | so must be back in rotation | 23:10 |
fungi | Clark[m]: to your earlier point there are a number of check-something directives for the server statements, all of which change the ways checks are done, but none of them do anything if you don't also set "check" | 23:10 |
fungi | so in a weird sort of way, i guess it makes sense | 23:10 |
ianw | i think if we merged 832379 -> 832297 we would have this working | 23:11 |
fungi | sgtm | 23:11 |
fungi | i've approved 832379 and will un-wip 832297 now | 23:11 |
ianw | removed gitea-lb01 from emergency | 23:12 |
ianw | there seem to be 2 remaining issues -- 1) zuul is giving a 200 response with the page content to an OPTIONS request, which seems wrong. 2) possibly, apache for gitea should be allowing OPTIONS requests? might be important if we ever do something with data being submitted to gitea? | 23:13 |
opendevreview | Merged openstack/project-config master: Add zuul load balancer dashboard https://review.opendev.org/c/openstack/project-config/+/832149 | 23:15 |
ianw | another small thought it is the health checks are a good idea, but our usual way of finding out about degraded service has been someone who gets hashed to a failing server popping up and saying things don't work | 23:18 |
fungi | ianw: reassuringly, when i adjusted the apache vhost config on zuul01 to proxy to an unused port, haproxy started seeing a 503 response code even with the default options / request | 23:19 |
fungi | as for the incentive for health checks, why i raised the concern is that we assumed orchestrated service restarts for ha clusters were "hitless" but since we were restarting services behind an apache proxy and what haproxy was testing via tcp check was apache not the proxied services, we were creating momentary outages | 23:21 |
ianw | yep, end-to-end makes perfect sense | 23:22 |
fungi | the tcp checks were relevant initially, but adding apache between haproxy and gitea made it so haproxy was no longer taking gitea servers out of the pool when they were down if their corresponding apache proxy was still running | 23:22 |
fungi | if the entire server became unreachable, the tcp check was still useful. but not for the micro-outages we were creating by restarting services | 23:24 |
opendevreview | Ian Wienand proposed opendev/system-config master: zuul-lb : issue HEAD / checks https://review.opendev.org/c/opendev/system-config/+/832439 | 23:26 |
ianw | that changes it for zuul too, more as a robustness thing | 23:26 |
fungi | note that the result may be a redirect response, in which case something more specific than / might also be preferable (or maybe not) | 23:29 |
ianw | zuul seems to handle options; https://opendev.org/zuul/zuul/src/branch/master/zuul/web/__init__.py#L128 | 23:30 |
fungi | neat | 23:31 |
ianw | but i do not see those Access-Control-Allow-Origin | 23:31 |
ianw | headers | 23:31 |
ianw | curl -i -X OPTIONS https://zuul01.opendev.org/api/tenant/openstack/project/openstack/project-config/enqueue | 23:33 |
ianw | does, however | 23:33 |
corvus | / is not handled by an api method, it's cherrypy static hosting | 23:34 |
corvus | which is why i'm unsure whether OPTIONS or HEAD on / is better or worse than OPTIONS/HEAD/GET on /api/info | 23:34 |
corvus | the /api/info is the least resource-intensive api method; that may or may not be faster than OPTIONS or HEAD on a static file resource in cherrypy. i honestly don't know. | 23:35 |
ianw | curl -i -X OPTIONS https://zuul01.opendev.org/api/info also returns a body | 23:38 |
fungi | yeah, options returns a 200, head / will probably be a 302. i think haproxy is okay with 302 so it's probably still fine either way | 23:38 |
corvus | any request to / should be a 200, not a 302. it will return the static html file that bootstraps the js app | 23:39 |
ianw | the info calls aren't wrapped with @cherrypy.tools.handle_options(allowed_methods=['GET', ]) | 23:40 |
fungi | ahh, okay, so the browser switching to /tenants must be happening via meta refresh or something | 23:41 |
ianw | @https://opendev.org/zuul/zuul/src/branch/master/zuul/web/__init__.py#L821 | 23:41 |
corvus | fungi: the js app does the internal redirect | 23:41 |
ianw | i'm not sure if they should be | 23:42 |
ianw | OPTIONS to google.com gives you a 405 | 23:42 |
fungi | corvus: thanks. i didn't realize javascript could tell the browser to switch to a different url entirely, but i shouldn't be surprised | 23:43 |
corvus | HEAD / lgtm -- it seems about as fast as HEAD /api/info | 23:46 |
corvus | fungiianw regarding redirects -- i think the thing to keep in mind is that the HTTP (not ssl) port should return a redirect to https in all cases from apache; but again, that should be sufficient for our purposes | 23:47 |
corvus | (i guess it's okay to hit zuul01's apache for the redirect even if zuul01's zuul-web is down; we can let those operate independently) | 23:48 |
fungi | yeah, it does and haproxy is cool with that response in the port 80 check | 23:48 |
ianw | https://review.opendev.org/c/zuul/zuul/+/734134 added the headers for the protected end-points | 23:51 |
ianw | i think everything could respond to OPTIONS. it does seem like it would be better explicitly return 504 if it's *not* supported though | 23:52 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!