Monday, 2022-03-07

opendevreviewIan Wienand proposed opendev/system-config master: grafana: set custom home dashboard
ianw got a 503 from zuul api, but i'm going to assume it was during the about deployment of the lb00:08
opendevreviewIan Wienand proposed opendev/system-config master: grafana: set custom home dashboard
opendevreviewIan Wienand proposed opendev/system-config master: grafana: set custom home dashboard
ianw<corvus> btw, while i'm looking at that, does anyone know how to make grafana not look like a clickbait news site? ^02:21
ianwlast time i looked, it wasn't configurable.  but I just looked again and they added a feature in 7.1 to allow setting the default dashboard.  so that makes a simple page02:22
ianwi will now delete the node i held to debug the initial failure, that turned out to be a single misplaced trailing comma in the page .json :/02:24
*** rlandy__ is now known as rlandy|out05:08
opendevreviewMerged opendev/system-config master: Don't run infra-prod-run-refstack on all group var updates
*** ysandeep|out is now known as ysandeep05:22
opendevreviewMerged opendev/system-config master: Allow zuul-lb to send stats to graphite
*** jpena|off is now known as jpena08:02
*** ysandeep is now known as ysandeep|lunch08:29
*** gthiemon1e is now known as gthiemonge08:39
*** ysandeep|lunch is now known as ysandeep08:52
*** poojajadhav is now known as pojadhav09:01
*** arxcruz is now known as arxcruz|off11:04
*** rlandy is now known as rlandy|ruck11:13
*** dviroel|out is now known as dviroel11:23
opendevreviewMerged openstack/project-config master: Add Some periodic jobs to Neutron Dashboard
opendevreviewMerged opendev/system-config master: grafana: set custom home dashboard
fungiwoohoo! looks so much better now. thanks ianw!14:42
yoctozeptowow, nice14:47
fungiin particular, no more ads from the grafana corporate product blog feed14:48
yoctozeptolovely that kolla has a dedicated dashboard14:48
fungiyoctozepto: that was originally added by mnaser in 201714:50
fungi5 years ago next week14:51
*** dviroel is now known as dviroel|lunch14:56
mnaserdang, time goes by15:01
*** ysandeep is now known as ysandeep|dinner15:36
opendevreviewFrancisco Seruca Salgado proposed opendev/gerritbot master: Gerrit Bot IRC post
*** ysandeep|dinner is now known as ysandeep15:55
fungiinfra-root: i've temporarily disabled deployment to zuul-lb01 while i fiddle with the haproxy config there to turn on some more verbose health check logging16:02
*** dviroel|lunch is now known as dviroel16:13
*** ysandeep is now known as ysandeep|out17:18
opendevreviewJeremy Stanley proposed opendev/system-config master: Add check keyword to zuul01 HTTPS server line
opendevreviewJeremy Stanley proposed opendev/system-config master: Add check to remainder of balance_zuul_https
opendevreviewJeremy Stanley proposed opendev/system-config master: Add check keyword to gitea01 HTTPS server line
opendevreviewJeremy Stanley proposed opendev/system-config master: Add check to remainder of balance_git_https
fungiinfra-root: ^ i'm around to monitor if someone else wants to approve those. i've already tested that configuration on zuul-lb01 and confirmed it got the health checks running there17:46
fungii'll set all but the first one wip so we can monitor the effects one at a time17:46
fungii've also taken zuul-lb01 back out of the disable list now17:46
fungii guess we could safely approve 832295 and 832296 at the same time so i didn't wip the latter17:49
*** jpena is now known as jpena|off17:52
corvusfungi: if you confirmed it already, why not do 94 and 95 together?17:52
corvus(even though things are busier now, it's still not the worst thing in the world if we lose the zuul web app for a few mins)17:52
fungii'd be fine with that, just trying to be cautious18:03
fungihappy to smash 94-96 together even, if that makes sense, and just be more careful with 9718:04
fungicorvus: what do you think?18:04
fungialso, if you grep 'check.*balance' from /var/log/syslog on zuul-lb01 you can see the results of a brief experiment where i repointed the proxy lines in the zuul01 apache vhost to a random incorrect port to simulate the zuul-web service being down with apache still running18:07
fungiMar  7 17:20:25 zuul-lb01 haproxy[34]: Health check for server balance_zuul_https/ failed, reason: Layer7 wrong status, code: 503, info: "Service Unavailable", check duration: 4ms, status: 0/2 DOWN.18:08
corvusi'm in favor of smashing :)18:10
fungihulk smash18:10
fungibetter than clobbering time, in this particular case18:11
opendevreviewJeremy Stanley proposed opendev/system-config master: Add check keyword to balance_zuul_https servers
opendevreviewJeremy Stanley proposed opendev/system-config master: Add check to remainder of balance_git_https
fungicorvus: ^18:15
fungialso the well-hidden haproxy docs are a gold mine of options for fine-tuning the check details if we like, including doing sni, passing specific host headers, changing the url we want to hit, et cetera18:18
fungican also adjust the http protocol version, set cookies, require specific response codes, look for actual content in the response...18:19
fungichange the method to head if get is causing additional load...18:20
corvuswe should probably switch zuul to check /api/info but i don't think we need to do that right now18:21
fungiyeah, whatever the preferred health check url is. could also look for specific data to indicate that it's done reloading its config if that page returns content too early in a startup cycle18:24
corvusfungi: it won't start answering on the main port until it's done (it will answer on the prometheus port though).  so checking any uri on the main port, or checking the ready uri on the prometheus port are equivalent for us.  but we'd need to do extra apache config to add in a health check on the prom port, so i think sticking with the main port is good for now.  the uri change would be simply to reduce the load  (but we should double check that18:35
corvus-- it's possible that OPTIONS on / may be less load than even the info endpoint)18:35
opendevreviewMerged opendev/system-config master: Add check keyword to balance_zuul_https servers
ianwis there anything else to do for the zuul LB ATM?21:15
Clark[m]check check-ssl makes sense but also really? Wow21:49
fungiianw: no, i'm stuffing my face at the moment, but assuming the zuul webui is still functional and gitea01 is still in the https pool, we can approve 832297 too21:56
fungii'll un-wip it once i'm done eating21:56
fungizuul webui lgtm22:13
fungibalance_git_https on gitea-lb01 on the other hand doesn't look so good for gitea0122:15
fungi"Method Not Allowed" and "Layer7 wrong status" in the show stat output22:15
fungii reckon we need to adjust the parameters for the check we're performing (maybe do sni or pass a host header, maybe adjust the url?)22:15
fungiianw: ^ ideas?22:16
fungii can start experimenting22:16
ianwumm, i'll have to context switch in22:16
fungiianw: the executive summary is that 832294 turned on checking for gitea01's https interface (but not the other servers for now)22:17
fungii have a feeling the naive `GET /` performed by haproxy on the ssl socket isn't sufficient to get a good https response back from apache22:18
fungi832297 would turn it on for the other servers in the pool too, but that would be a disaster22:18
ianwoh, so this was/is working on zuul-lb01, but not yet against gitea?22:19
fungiit's working for zuul-lb01 yep, but zuul is a different beast than gitea obviously22:19
ianwahh, ok :)22:19
fungigitea may need more... wooing22:19
ianwno timestamps in "sudo docker logs haproxy-docker_haproxy_1" ... wonder if that can be turned on. separate issue22:21
funginot urgent since we're running at 7/8 capacity on gitea servers and load is pretty low22:21
ianwso it would be helpful if we could figure out how can return of 40522:21
fungiit's also possible to turn on haproxy's check logging in the defaults, if that helps22:21
fungiit's how i tracked down what was (or more accurately, wasn't) happening on zuul-lb0122:22
ianwi think that the SNI thread is probably one to pull ... 22:23
*** dviroel is now known as dviroel|out22:25
ianw"check-sni <sni>" maybe?22:25
fungiyeah, we can pass a specific host header in the check too, if that's the problem (maybe the default vhost is returning back responses?)22:25
ianw"The most22:27
ianwcommon use is to send HTTPS checks by combining "httpchk" with SSL checks."22:27
fungi"option  log-health-checks" in defaults is what i used to turn on the additional logging on zuul-lb01 temporarily, btw22:27
fungii'd consider adding it permanently, since it only logs state changes anyway and so doesn't add that much volume to the logs22:28
ianwcurl --insecure returns the page22:35
ianwi would have thought that was not sending any SNI and might trigger a 405 from gitea0122:35
fungiyeah, it may not be sni that's the problem22:36
ianw38.108.68.124:40776 - - [07/Mar/2022:22:36:17 +0000] "OPTIONS / HTTP/1.0" 405 - "-" "-"22:36
ianwis what apache is saying22:36
fungiit's doing an options method instead of get? that's not what i'd have assumed from the haproxy docs22:37
ianwyeah, i think we might have to specify GET to the actual request22:37
ianwoption httpchk GET /22:37
fungistrangely, this is working for the zuul apache vhosts22:38
ianw"curl -v --insecure  -X OPTIONS" is also not giving me an error22:40
ianwi wonder if the http/1.0 makes a difference ...22:40
ianwno, actually, it's same thing 159.X.X.X:49372 - - [07/Mar/2022:22:40:48 +0000] "OPTIONS / HTTP/1.1" 405 - "-" "curl/7.79.1"22:41
ianwthat's me22:41
ianwcurl -v --insecure  -X OPTIONS returns the html; i feel like that is *also* wrong22:42
Clark[m]It's probably because gitea doesn't do options but cherrypy does?22:42
fungialso if we're going to have to specify a request method, i'd probably go with head instead of get in order to reduce load from the checks, assuming that works22:43
Clark[m]Since that ends up getting proxied22:43
Clark[m]++ to HEAD22:43
fungii buy Clark[m]'s theory22:43
Clark[m]And / is probably fine for gitea22:43
fungithough i'm still surprised haproxy defaults to options, i could swear i saw it said get was the default request method22:43
fungiand yeah, i guess options would explain why the default check on zuul was returning a 200 response even though a web browser sees / redirected to /tenants22:44
fungi(i assumed it was a "meta refresh" instead of a protocol level redirect, but didn't dig too deep there)22:45
opendevreviewIan Wienand proposed opendev/system-config master: gitea-haproxy: issue liveness check to HEAD /
ianwso should i manually edit in something like ^ to test?22:48
fungii think there are method and url parameters you can set on the server entries22:52
fungi<method>  is the optional HTTP method used with the requests. When not set, the "OPTIONS" method is used22:56
fungiso you're not mad22:56
fungii wish haproxy (non-enterprise) docs weren't so hard to track down22:57
ianwyep, so HEAD / (or really just option thtpchk HEAD, as "/" is default") might help22:59
corvusfungi ianw adds a zuul-lb dashboard23:00
fungiianw: yeah, putting the lb in the emergency disable list and then tweaking the config and hupping the parent haproxy process is how i tested it on zuul-lb0123:03
ianwit seems worth doing that as i have everything open23:04
fungitemporarily adding the line to turn on health check logging is also useful if you haven't already23:04
ianwyou can pretty much see it from the gitea01 side anyway, as it constantly gets pings and gets the 405's23:05
ianwok, i've put it in emergency23:06
ianwi've added the HEAD /23:07
ianwi hupped it23:09
ianw38.108.68.124:46416 - - [07/Mar/2022:23:09:04 +0000] "HEAD / HTTP/1.0" 200 - "-" "-"23:09
ianwwe're logging those requests every 2 seconds now23:09
ianwthere's also requests coming in23:10
ianwso must be back in rotation23:10
fungiClark[m]: to your earlier point there are a number of check-something directives for the server statements, all of which change the ways checks are done, but none of them do anything if you don't also set "check"23:10
fungiso in a weird sort of way, i guess it makes sense23:10
ianwi think if we merged 832379 -> 832297 we would have this working23:11
fungii've approved 832379 and will un-wip 832297 now23:11
ianwremoved gitea-lb01 from emergency23:12
ianwthere seem to be 2 remaining issues -- 1) zuul is giving a 200 response with the page content to an OPTIONS request, which seems wrong.  2) possibly, apache for gitea should be allowing OPTIONS requests?  might be important if we ever do something with data being submitted to gitea?23:13
opendevreviewMerged openstack/project-config master: Add zuul load balancer dashboard
ianwanother small thought it is the health checks are a good idea, but our usual way of finding out about degraded service has been someone who gets hashed to a failing server popping up and saying things don't work23:18
fungiianw: reassuringly, when i adjusted the apache vhost config on zuul01 to proxy to an unused port, haproxy started seeing a 503 response code even with the default options / request23:19
fungias for the incentive for health checks, why i raised the concern is that we assumed orchestrated service restarts for ha clusters were "hitless" but since we were restarting services behind an apache proxy and what haproxy was testing via tcp check was apache not the proxied services, we were creating momentary outages23:21
ianwyep, end-to-end makes perfect sense23:22
fungithe tcp checks were relevant initially, but adding apache between haproxy and gitea made it so haproxy was no longer taking gitea servers out of the pool when they were down if their corresponding apache proxy was still running23:22
fungiif the entire server became unreachable, the tcp check was still useful. but not for the micro-outages we were creating by restarting services23:24
opendevreviewIan Wienand proposed opendev/system-config master: zuul-lb : issue HEAD / checks
ianwthat changes it for zuul too, more as a robustness thing23:26
funginote that the result may be a redirect response, in which case something more specific than / might also be preferable (or maybe not)23:29
ianwzuul seems to handle options;
ianwbut i do not see those Access-Control-Allow-Origin23:31
ianwcurl -i -X OPTIONS
ianwdoes, however23:33
corvus/ is not handled by an api method, it's cherrypy static hosting23:34
corvuswhich is why i'm unsure whether OPTIONS or HEAD on /  is better or worse than OPTIONS/HEAD/GET on /api/info23:34
corvusthe /api/info is the least resource-intensive api method; that may or may not be faster than OPTIONS or HEAD on a static file resource in cherrypy.  i honestly don't know.23:35
ianwcurl -i -X OPTIONS also returns a body23:38
fungiyeah, options returns a 200, head / will probably be a 302. i think haproxy is okay with 302 so it's probably still fine either way23:38
corvusany request to / should be a 200, not a 302.  it will return the static html file that bootstraps the js app23:39
ianwthe info calls aren't wrapped with['GET', ])23:40
fungiahh, okay, so the browser switching to /tenants must be happening via meta refresh or something23:41
corvusfungi: the js app does the internal redirect23:41
ianwi'm not sure if they should be23:42
ianwOPTIONS to gives you a 40523:42
fungicorvus: thanks. i didn't realize javascript could tell the browser to switch to a different url entirely, but i shouldn't be surprised23:43
corvusHEAD / lgtm -- it seems about as fast as HEAD /api/info23:46
corvusfungiianw regarding redirects -- i think the thing to keep in mind is that the HTTP (not ssl) port should return a redirect to https in all cases from apache; but again, that should be sufficient for our purposes23:47
corvus(i guess it's okay to hit zuul01's apache for the redirect even if zuul01's zuul-web is down; we can let those operate independently)23:48
fungiyeah, it does and haproxy is cool with that response in the port 80 check23:48
ianw added the headers for the protected end-points23:51
ianwi think everything could respond to OPTIONS.  it does seem like it would be better explicitly return 504 if it's *not* supported though23:52

Generated by 2.17.3 by Marius Gedminas - find it at!