Clark[m] | Deployment concluded successfully and a requirements change merged and replicated successfully | 02:10 |
---|---|---|
*** dhill is now known as Guest10533 | 03:02 | |
qwebirc35203 | Is there known problem or maintenance for opendev.org being down? | 05:02 |
tonyb | qwebirc35203: can you be more specific. I'm not seeing an issue with review.opendev.org | 05:08 |
qwebirc35203 | Some subdomains seem to work, but base domain as in https://opendev.org/ is not opening for me (tested a few different providers and few website testing sites and all give timeouts) | 05:09 |
tonyb | qwebirc35203: hmm okay. I see it. I'll look into it | 05:11 |
qwebirc35203 | tonyb: Thank you. | 05:11 |
tonyb | #status notice Web, and possibly others, services on opendev.org appear to be down. Admins are investigating. | 05:18 |
opendevstatus | tonyb: sending notice | 05:18 |
-opendevstatus- NOTICE: Web, and possibly others, services on opendev.org appear to be down. Admins are investigating. | 05:18 | |
tonyb | Ooops that wasn't supposed to be notice :/ | 05:20 |
opendevstatus | tonyb: finished sending notice | 05:21 |
tonyb | qwebirc35203: Try again now. It's working for me | 05:46 |
tonyb | qwebirc35203: and stalled again | 05:56 |
qwebirc35203 | tonyb: Ok, I probably wasn't fast enough on testing, because I didn't get successful load. | 05:57 |
qwebirc35203 | Routing seems ok as traceroute completes to the ip that resolves from opendev.org | 05:58 |
tonyb | Yup, and the systems behind the LoadBalancer seem fine, but not the LB itself. | 05:59 |
tonyb | it was fine after a process restart | 05:59 |
qwebirc35203 | Someone session flooding the lb? | 06:00 |
tonyb | qwebirc35203: Something like that. I'm struggling to verify that :/ | 06:10 |
qwebirc35203 | It you are running haproxy, I'm assuming you've checked hatop? | 06:12 |
qwebirc35203 | *If you are... | 06:12 |
tonyb | qwebirc35203: we don't seem to have that. | 06:14 |
qwebirc35203 | You might have stats web-endpoint, that's usually in different port so it should be accessible | 06:16 |
qwebirc35203 | Or you can get the raw stats with " echo "show stat" | nc -U /var/run/haproxy.stat" (your stat socket location may differ) | 06:19 |
tonyb | I can certainly do the last one. I'll fess up that I'm "new" here so I'm struggling to filter the normal from the "problem pointers" | 06:20 |
qwebirc35203 | echo "show info" | nc -U /var/run/haproxy.stat | 06:21 |
qwebirc35203 | That should give you the generic stats | 06:21 |
qwebirc35203 | If CurrConns is hitting Hard_maxconn, then something is hogging the connections | 06:22 |
tonyb | Okay. That is the current state. | 06:23 |
tonyb | Maxconn: 4000 | 06:23 |
tonyb | Hard_maxconn: 4000 | 06:23 |
tonyb | CurrConns: 4000 | 06:23 |
qwebirc35203 | Ok, then that's bit of an advanced topic to find out what is doing that. I would recommend forwarding the issue to someone who is used to solving that | 06:24 |
tonyb | qwebirc35203: Thanks for your help. | 06:25 |
qwebirc35203 | Could be someone intentionally doing abuse, or just automations hammering the site after some downtime | 06:25 |
qwebirc35203 | Increasing logging or using tcpdump may be required to figure out what is happening and how to correct the situation. If it is single source, that can be firewalled or rate limited from haproxy | 06:26 |
qwebirc35203 | Correct solution heavily depends on what is actually happening | 06:27 |
qwebirc35203 | Increasing max connections shouldn't be a problem, but I wouldn't do that without consulting someone who has experience with that specific system | 06:29 |
tonyb | Sure, but given how quickly the connections fiulled up last time I suspect any additional connections will also just fill up | 06:31 |
qwebirc35203 | Yes, that is likely. And more connections on load balancer may cause issues to the servers behind it at normal operations. That's why changing the number must be informed decision with knowledge of the system. | 06:32 |
qwebirc35203 | I would probably do tcpdump on port 443 with limited packet count to check if some ip has high dominance in the packets received. If you don't limit the packet count on remote system you might have trouble stopping the capture. But every admin has their preferred way, some would add verbosity to haproxy logging. | 06:36 |
tonyb | I'm confident I can up the maxconns and the load balancer and beckend will handle it. Without knowing if this is innocent or not | 06:37 |
qwebirc35203 | Tcpdump has the advantage of having no effect on logging (if it processed or delivered forward in some way, changing verbosity may cause some trouble, same goes if disk space is at premium) | 06:37 |
tonyb | qwebirc35203: Yeah I've looked at tcpdump and there isn't a visible hog. | 06:37 |
qwebirc35203 | You should be able to do x2 or x4 connections without issues at least on haproxy side. If you do that make sure Ulimit-n value in the output increases accordingly | 06:40 |
qwebirc35203 | Because haproxy has the front connection and back connection, you need to have double the sockets to maxconn | 06:41 |
tonyb | Yup | 06:42 |
qwebirc35203 | Just came in for quick check from webirc, switching to better client at TNX. | 06:57 |
tonyb | Kk | 06:59 |
TNX | Goog luck with resolving the issue, I'll see how it went later on, I hit that site issue while finishing some update tasks for the night, for me it's late for the work shift and early for the day. | 07:04 |
TNX | I got everything important done anyway, but thought give notice about the issue | 07:05 |
tonyb | Thanks | 07:05 |
tonyb | I appreciate it | 07:05 |
tonyb | #status log The gitea load balancer on opendev.org is saturated and therefore new connections are timing out. Admins are investigating. | 07:21 |
opendevstatus | tonyb: finished logging | 07:21 |
tonyb | infra-root: gitea-lb02 is "flooded" CurrConns == Maxconn == 4000, I'm basically out of ideas. | 08:41 |
tonyb | https://etherpad.opendev.org/p/gitea-debugging is a summary of the last couple of hours. I'll be back online after dinner etc | 08:43 |
eandersson | I think something is broken again? e.g. this does not load for me at all https://opendev.org/openstack/requirements/raw/branch/master/upper-constraints.txt | 12:15 |
eandersson | oh or maybe the issue is just ongoing | 12:16 |
fungi | eandersson: yes, we suspect something has gone awry with internal networking in vexxhost's sjc1 region | 13:11 |
fungi | though not for everything... i'm able to reach https://mirror.sjc1.vexxhost.opendev.org/ with no problem | 13:13 |
fungi | https://gitea09.opendev.org:3000/ loads too | 13:14 |
fungi | i can ssh to the opendev.org load balancer with no problem, seems like it's just haproxy itself that's getting overloaded | 13:15 |
fungi | http://cacti.openstack.org/ graphs for gitea-lb02 are pretty striking. user cpu and load average both spike to fill the two available vcpus right at 04:00 utc | 13:18 |
fungi | at the same time, network traffic drops to almost to almost imperceptible levels | 13:19 |
fungi | /var/log/haproxy.log has been silent since 07:35 utc, nearly 6 hours ago | 13:21 |
fungi | looks like haproxy was restarted at roughly 07:30, so maybe it was briefly logging traffic until it fell over again | 13:24 |
fungi | doing some analysis of packet capture samples to see if there's any common sources at the moment | 13:36 |
fungi | at the moment there's one host that seems to account for around 30% of all inbound packets for the ports haproxy is listening on | 13:38 |
fungi | i've blocked the two highest volume sources in iptables and am restarting the haproxy container to see if it regains its senses | 13:45 |
fungi | the site is loading for me at the moment, but i'm not sure that will last | 13:46 |
fungi | source addresses have changed now, majority of high-volume senders are known parties (red hat, suse, microsoft...) | 13:48 |
fungi | now the site's back to not responding for me | 13:50 |
fungi | yeah, 13:49:08 was the last request in the haproxy log | 13:51 |
Clark[m] | Haproxy should log to syslog iirc. You get better data about the requests on the gitea side since that shows you l7 URL paths and such. My first thought is this is the web crawler ddos we've seen before. | 13:52 |
Clark[m] | Haproxy logs the conclusion of connections so if we get no logs whatever it is may simply be holding connections open | 13:52 |
Clark[m] | Gitea will log the start of a request and the completion of a request | 13:53 |
Clark[m] | I think we want to look for started but not concluded requests in gitea and work back from there | 13:53 |
fungi | yeah, i do you know how it differentiates them in the log? | 13:54 |
Clark[m] | It puts a string like started and finished in the log lines | 13:54 |
Clark[m] | Request started /too/bar or something along those lines | 13:55 |
fungi | thing is, /var/log/haproxy.log only has 13 entries between when i started it and when it stopped logging | 13:56 |
Clark[m] | Ya haproxy only logs when connections end iirc | 13:56 |
Clark[m] | On the lb side netstat/ss to list tcp conn state may be more helpful | 13:57 |
Clark[m] | Basically something appears to be making 4k connections and never closing them if I interpret this correctly. But it is early for me so take that with a grain of salt | 13:59 |
Clark[m] | Restarting helps things because you manually kill those connections and let other stuff reconnect until whatever is misbehaving comes back and grabs all the connections again | 13:59 |
fungi | yep | 14:00 |
Clark[m] | But also I think haproxy may log to syslog so double check there? | 14:04 |
Clark[m] | Sorry I can't really look myself without waking the entire house on a Saturday morning | 14:04 |
fungi | no worries, i've blocked the top 5 ip sources with open connections according to netstat and am restarting the container again | 14:06 |
Clark[m] | And ya cross checking against gitea web logs may give more insight into what the clients are attempting to accomplish | 14:08 |
fungi | if things stabilize, then i can slowly remove the firewall rejects one by one until the problem comes back too | 14:10 |
fungi | now it's no longer loading for me again | 14:13 |
Clark[m] | Whatever it is probably requires characterization beyond simply the worst IPs. | 14:13 |
Clark[m] | Because there are more behind them | 14:14 |
fungi | yeah, i added a couple more, the new top two with open sockets shortly after things hung, so it's possible one of them is causing it | 14:14 |
fungi | they were several times higher than the runners-up | 14:15 |
Clark[m] | And we're reasonably happy that this isn't a failure of the backends because those work? | 14:15 |
fungi | i'll retest them all directly to be sure, i only spot-checked some | 14:16 |
Clark[m] | Ya backend responds immediately for me | 14:16 |
fungi | all 6 return the expected content for me on 3000/tcp | 14:17 |
fungi | one more try. this time blocked a few /24 networks that had a large number of open connections spread across many addresses | 14:23 |
Clark[m] | If that doesn't help I would check gitea logs as the next step and try to understand if clients are making it that far | 14:23 |
fungi | yeah, that will be my next stop, but trying to correlate those is a lot more time consuming so wanted to rule out the easy things first | 14:24 |
Clark[m] | Sometimes it's pretty apparent just because the patterns are "odd" but if not then ya | 14:24 |
fungi | at the moment, all /24 networks with clients that have sockets open through the load balancer are under 25 each | 14:26 |
fungi | cpu on the load balancer is still maxxed out though | 14:27 |
fungi | ~350 total sockets open for 80/tcp or 443/tcp | 14:28 |
fungi | 432 sockets open now and the service is back to not responding again | 14:30 |
Clark[m] | There must be more SYN'ing then to hit 4k? | 14:31 |
fungi | i forgot we have https://grafana.opendev.org/d/1f6dfd6769/opendev-load-balancer | 14:32 |
fungi | maybe i'll find more clues there | 14:32 |
fungi | definitely reports ~4k concurrent sessions | 14:33 |
fungi | looks like tons of concurrent sessions to gitea12 | 14:36 |
Clark[m] | There is a clue in those graphs. Gitea12 has the most conns so it is hashing more things (or one thing) to there | 14:36 |
Clark[m] | I would look at logs on gitea12 | 14:36 |
fungi | right, so classifying the traffic it's seeing will help narrow this down | 14:36 |
Clark[m] | And work backward | 14:36 |
Clark[m] | ++ | 14:36 |
fungi | earlier in the day it was mostly hitting gitea11, looks like | 14:37 |
Clark[m] | The average session time graph is interesting basically says the average session is 0 seconds on 12? Could this be an ol fashioned tcp ddos? | 14:40 |
Clark[m] | Oh wait we probably need session to end to report that data | 14:40 |
Clark[m] | Same issue with logging not happening until sessions close | 14:40 |
fungi | apache access log doesn't seem to dereference the forwarded-for client, where were we recording that? | 14:44 |
fungi | oh, i guess we only do the source port | 14:45 |
Clark[m] | You have to rely on ports. Ya it's a bug in gitea | 14:45 |
Clark[m] | They have the info but their logging mangles it so we can't get it last I tried | 14:45 |
fungi | so figuring out the client address(es) is... hrm | 14:46 |
Clark[m] | It should be possible but also annoying | 14:47 |
Clark[m] | I think the haproxy command socket will give you mappings on the haproxy side since we don't have logs yet | 14:48 |
Clark[m] | And then align the port numbers with the backend? | 14:49 |
Clark[m] | But also the urls and user agents may be sufficient clues. For proceeding it it is the typical crawler dos we've seen | 14:49 |
Clark[m] | Oh I think I remember the forwarded for issue. We go haproxy to Apache to gitea and for some reason gitea can parse out the top level forwarding info. The Apache logs may be easier to work with? | 14:54 |
fungi | yeah, i was looking in the apache logs but they show the haproxy host's address as the client | 14:55 |
fungi | i thought we had protocol-level proxy client signalling turned on (whatever that's called) | 14:55 |
fungi | so that haproxy can communicate the original client address to apache | 14:55 |
fungi | but if so, that's not being reflected in apache's access log | 14:56 |
fungi | i'm fumbling around with the haproxy command socket right now, but no luck figuring out how to get it to dump a list of clients/sessions yet. the context help is quite terse, i'll probably need to crack open the manual and digest haproxy's terminology | 14:57 |
fungi | aha, show sess | 14:58 |
fungi | it has 2391 sessions matching be=balance_git_https srv=gitea12.opendev.org | 14:59 |
fungi | 2214 of those are from codesearch :/ | 15:01 |
Clark[m] | Maybe block code search and see if it helps? | 15:02 |
Clark[m] | If code search is struggling we can always fix that later | 15:03 |
corvus | fyi, hound has not recently restarted so seems unlikely to be a sudden behavior change there | 15:03 |
Clark[m] | Hrm | 15:03 |
fungi | agreed, that was the first think i checked | 15:04 |
fungi | okay, restarted the haproxy container again, this time with codesearch blocked | 15:04 |
corvus | hound started reporting errors on dec 14 | 15:05 |
fungi | it was connecting over ipv6, but i've blocked its ipv4 address too just to make sure it doesn't fall back on that | 15:05 |
corvus | (hound had intermittent problems dec 14 and 15; looks much worse today) | 15:06 |
corvus | that may suggest a longer time window for whatever issue | 15:06 |
fungi | oddly the cpu usage and system load spiked up at precisely 04:00, which seems like suspicious timing | 15:07 |
fungi | haproxy is still basically maxing out cpu utilization | 15:08 |
fungi | but maybe it will settle after a bit | 15:08 |
corvus | (apropos of nothing, gitea11's ipv4 has a ptr record of test.example2.com) | 15:09 |
fungi | lovely | 15:09 |
fungi | nb02 has a bunch of open sessions through haproxy at the moment | 15:12 |
fungi | also something in osuosl, maybe a test node | 15:12 |
fungi | site's still responding though | 15:13 |
fungi | https://grafana.opendev.org/d/1f6dfd6769/opendev-load-balancer shows current sessions have exceeded 3k, climbing steadily but less rapidly than before | 15:14 |
fungi | yeah, the site's unresponsive again | 15:16 |
Clark[m] | Codesearch was probably failing due to whatever the real issue is and making it worse by quickly trying to fetch things | 15:16 |
fungi | ipv6 address of nb02 currently has the largest number of active sessions through the lb | 15:16 |
Clark[m] | Did the gitea or Apache logs show many commit requests by weird UAs? | 15:16 |
fungi | not that i spotted, lots of odd agent strings (Blackbox Exporter, SemrushBot, Faraday, ...) but i'll see if i can classify based on that | 15:18 |
fungi | largest count of agents on gitea12 since the last restart is python-requests/2.28.1 | 15:20 |
fungi | i'll check the other backends, but if it's a ddos it would in theory be hitting them all pretty equally | 15:21 |
Clark[m] | The UAs that typically gives us trouble are for old cellphones and Chinese android browsers | 15:21 |
fungi | gitea10 had 1599 connections for git/2.39.2 | 15:22 |
fungi | that stands out | 15:22 |
fungi | others are mostly search bots (yandex, bing, openai) | 15:23 |
fungi | yeah, after the last restart, grafana also says the bulk of current sessions were for gitea10 | 15:24 |
fungi | i wonder if that's where nb02 is getting sent | 15:25 |
Clark[m] | The pattern we've seen cause issues is requesting many commit urls in short periods of time. But I suspect that isn't an issue here because I can directly get urls like that without trouble talking to a backend | 15:25 |
fungi | src=2001:4800:7818:104:be76:4eff:fe02:5608:56772 fe=balance_git_https be=balance_git_https srv=gitea10.opendev.org | 15:26 |
fungi | so nb02 (presumably git cache dib element for an image build) made ~1.6k git requests in the few minutes haproxy was working | 15:27 |
fungi | should we block it and try again? | 15:28 |
Clark[m] | I feel like that is a symptom not the cause. But maybe it is git level operations not web operations that are the problem | 15:29 |
Clark[m] | Can you try cloning directly from a backend? | 15:29 |
fungi | i agree it seems unlikely | 15:29 |
eandersson | Would it be worth clearning the zuul queue? There are hundreds of job in the queue | 15:30 |
fungi | cloning https://gitea10.opendev.org:3000/openstack/nova is working, slowly but not unusually so | 15:31 |
corvus | eandersson: zuul is unlikely to be contributing to the problem | 15:31 |
corvus | eandersson: (it does not communicate with gitea) | 15:31 |
fungi | git clone is averaging around 1-1.5 MiB/s from gitea10 for me | 15:32 |
Clark[m] | Really seems like the issue is in the lb somehow. Maybe we block everything then run local requests against it and see if we can reproduce hanging connections. That seems like a shot in the dark though | 15:33 |
fungi | my nova clone from gitea10 just completed, so took about 4 minutes | 15:33 |
fungi | similar test against gitea09 seems to be running about as fast | 15:34 |
Clark[m] | The failed jobs in zuul thrashing the queues seem to be trying to fetch global requirements constraints from OpenDev. So that explains the fallout there but doesn't necessarily mean it is the problem | 15:35 |
Clark[m] | But perhaps it snowballed and now we're behind due to that thundering herd | 15:36 |
Clark[m] | Do periodic jobs start around 0400? | 15:37 |
eandersson | Yea that is what I was thinking. Seems like a safe option to at least restart zuul? | 15:37 |
TNX | That "python-requests/2.28.1" sounds lot like Openstack ansible jobs, I first noted the issue running an upgrade | 15:38 |
fungi | periodic triggers at 02:00 utc, periodic-weekly doesn't trigger until tomorrow | 15:38 |
Clark[m] | Restarting zuul doesn't dump the jobs but we could try that using the API maybe | 15:38 |
fungi | TNX: yeah, in the past we've also seen thundering herds from openstack-ansible users who try to upgrade a large network and don't maintain a central cache, so wind up flooding our git server farm from multiple servers all trying to clone the same things simultaneously | 15:39 |
fungi | but recent openstack-ansible versions are supposed to set a custom user agent string now in order to make that more obvious | 15:40 |
corvus | time wget -O - https://gitea10.opendev.org:3081/openstack/requirements/raw/branch/master/upper-constraints.txt | 15:41 |
corvus | real0m0.035s | 15:41 |
fungi | yeah, we should be able to satisfy thousands of those sorts of requests | 15:41 |
corvus | it doesn't look like fetching a constraints file is very taxing | 15:42 |
fungi | they don't (or at least shouldn't) stay open for any time at all | 15:42 |
fungi | so anyway, we don't have any periodic zuul pipelines that trigger at 04:00 on any day of the week, much less today (except our hourly deploy pipeline which fires every hour of the day of course) | 15:43 |
fungi | and judging from grafana it doesn't seem likely to be something that started in the periodic pipeline at 02:00 and gradually built up, the sessions chart shows it going from a steady ~30 sessions just before 04:00 to 4k within a couple of minutes | 15:45 |
fungi | i suppose we could try bisecting the entire internet. wouldn't be all that many steps | 15:47 |
corvus | haproxy is supposed to log every request on completion, right? and so far, it's logged a handful from the last restart | 15:48 |
fungi | yes, around 20 | 15:48 |
eandersson | https://github.com/openstack/project-config/blob/master/zuul.d/pipelines.yaml#L212 Isn't this starting at 2AM everyday? | 15:48 |
corvus | which means that as far as its concerned, 39xx sessions haven't completed? | 15:48 |
fungi | eandersson: correct, and the problem started at exactly 04:00, two hours later | 15:49 |
TNX | Have you checked what that high cpu usage consists of? User, system, interrupts? Might give a pointer if there is something wrong with the system and not with the amount of connections per se. | 15:49 |
fungi | TNX: user | 15:49 |
fungi | it's entirely user, and attributed to the haproxy process | 15:50 |
corvus | so what's it take for a session to complete? in general... i'm seeing FINs on the small amount of traffic that goes between lb02 and gitea10, so it's not like those are being completely dropped | 15:50 |
corvus | according to strace, the haproxy process looks to be busy-waiting around 2 different epoll objects | 15:51 |
corvus | lb02 has 302 connections to gitea12 in CLOSE_WAIT; gitea12 has 0 connections to lb02 | 15:53 |
fungi | not seeing any packet loss between them at least | 15:54 |
Clark[m] | Did haproxy update maybe and change it's TLS connection settings by default? | 15:55 |
Clark[m] | Or I supposed that could be packet loss of some sort | 15:56 |
Clark[m] | (hence fungi's comment) | 15:56 |
corvus | yeah, i got a pcap from both ends and am looking at a single session; i'm not seeing any red flags. i seet a fin,ack from each side followed by an ack from each side and that's the end | 15:57 |
fungi | oh, good point i looked at dpkg.log but haproxy is run from a container image | 15:57 |
fungi | haproxy latest b7b500699a22 22 hours ago 121MB | 15:57 |
corvus | that seems suspicious | 15:58 |
fungi | we don't seem to have old images preserved on the server either | 15:58 |
corvus | there is an image from 8 hours ago | 15:59 |
corvus | we could try rolling forward | 15:59 |
corvus | https://www.mail-archive.com/haproxy@formilux.org/msg44428.html | 16:00 |
corvus | at a guess: maybe our current image is 2.9.0 and the newer one is 2.9.1? | 16:01 |
Clark[m] | That seems like a good guess. I suspect the image can tell us somehow. If that doesn't work we can probably rollback too. Haproxy isn't very stateful across hard restarts | 16:02 |
fungi | haproxy_1 | 2023-12-16T15:04:08.174923451Z [NOTICE] (1) : haproxy version is 2.9.1-f72603c | 16:02 |
fungi | from docker-compose logs | 16:02 |
corvus | oh :( | 16:03 |
corvus | then we may have to go back to < 2.9.0 | 16:03 |
corvus | maybe 2.8.5? | 16:04 |
fungi | sounds worth a try | 16:04 |
corvus | it's an lts stable release :) | 16:04 |
fungi | also the "22 hours ago" is probably when the image was built, not when it was published nor when we downloaded and restarted onto it | 16:04 |
Clark[m] | I think 2.8.5 is fine as long as we down and up and don't do a graceful restart attempt | 16:05 |
corvus | yep | 16:05 |
Clark[m] | Our configs haven't been using fancy new etuff | 16:05 |
corvus | we could also use the tag "lts" | 16:05 |
Clark[m] | ++ | 16:05 |
corvus | fungi: you want to manually make that change? want me to? | 16:05 |
fungi | i can in a moment, trying to see if the 04:00 deploy was when we pulled the new haproxy image | 16:06 |
corvus | ++ i will make breakfast | 16:06 |
fungi | bingo | 16:09 |
fungi | Pulling haproxy ... status: downloaded newer image fo... | 16:09 |
fungi | docker-compose up -d ran at 03:59:27.857109 | 16:09 |
fungi | working on the change now | 16:09 |
Clark[m] | I'm still not reaching th service. Did you down the container first? | 16:11 |
fungi | no, that was from the ansible log on bridge | 16:11 |
fungi | i'm working on downgrading and will then push a similar change for review | 16:12 |
Clark[m] | Oh I see | 16:12 |
fungi | i was just confirming that we upgraded to haproxy 2.9.1 at 04:00 on the nose, pretty much | 16:12 |
Clark[m] | ++ | 16:13 |
fungi | haproxy_1 | [NOTICE] (1) : haproxy version is 2.8.5-aaba8d0 | 16:15 |
fungi | i've added the lb to the emergency disable list too so we won't roll that back accidentally while reviewing | 16:16 |
corvus | i see connections growing and shrinking | 16:16 |
corvus | none staying in close_wait | 16:17 |
Clark[m] | Yay. I guess we check the haproxy issue tracker to see if this is known. But that can probably wait until Monday | 16:18 |
Clark[m] | https://github.com/haproxy/haproxy/issues/2387 maybe | 16:19 |
opendevreview | Jeremy Stanley proposed opendev/system-config master: Downgrade haproxy image from latest to lts https://review.opendev.org/c/opendev/system-config/+/903805 | 16:20 |
fungi | i'm going to work on unwinding all the iptables drop rules now | 16:21 |
fungi | that's done | 16:24 |
fungi | https://grafana.opendev.org/d/1f6dfd6769/opendev-load-balancer is still looking healthy | 16:24 |
Clark[m] | Haproxy has been so solid I didn't even consider it was a regression there | 16:26 |
fungi | status notice Service for Git repository hosting on https://opendev.org/ has been restored by rolling back an haproxy upgrade; Zuul jobs which failed with connection timeouts occurring between 04:00 and 16:15 UTC today can be safely rechecked now | 16:26 |
fungi | not sending yet until we're sure, but does that work? | 16:26 |
Clark[m] | Yes lgtm | 16:27 |
fungi | this is definitely the longest it's been in working order without session count climbing since the incident began, so looks like we're in the clear | 16:28 |
fungi | still solid. load average has settled around 0.1 | 16:33 |
fungi | haproxy is using about 7% cpu now | 16:33 |
fungi | infra-root: any objection to me sending the status notice in a few minutes? | 16:34 |
fungi | still looking happy 30 minutes later, so sending it now | 16:44 |
fungi | #status notice Service for Git repository hosting on https://opendev.org/ has been restored by rolling back an haproxy upgrade; Zuul jobs which failed with connection timeouts occurring between 04:00 and 16:15 UTC today can be safely rechecked now | 16:45 |
opendevstatus | fungi: sending notice | 16:45 |
-opendevstatus- NOTICE: Service for Git repository hosting on https://opendev.org/ has been restored by rolling back an haproxy upgrade; Zuul jobs which failed with connection timeouts occurring between 04:00 and 16:15 UTC today can be safely rechecked now | 16:45 | |
opendevstatus | fungi: finished sending notice | 16:47 |
fungi | okay, switching to the stuff i was supposed to get done this morning, but i'll keep an eye on the graphs for a while just in case | 16:51 |
clarkb | fungi: corvus: I'm going to +2 but not approve the haproxy change. The reason is I remembered that zuul has uses haproxy and not sure if we need manual itnervention there as well | 18:36 |
jrosser | TNX: did you ask the openstack-ansible team for help with your upgrade? | 19:07 |
jrosser | many requests to opendev.org would happen many releases ago if the upgrade guide was not followed | 19:08 |
jrosser | but there is now a specific circuit-breaker in the code to prevent that and halt the deployment | 19:09 |
TNX | jrosser: I'm fine, I have latest version running now, my repo just failed one request against https://opendev.org/ for upper constraint while the issue was ongoing. | 19:32 |
fungi | also note that you can switch your base url to pull from the mirror at github.com/openstack for those repos if there's a problem (though we try to be fairly responsive to outages) | 19:37 |
fungi | this one was unusual because who expects a regression like that in a minor release bump for haproxy? | 19:38 |
fungi | i guess we do, now anyway | 19:38 |
TNX | Thanks for the tip, everything worked out fine in the end, I did previous upgrade last night end second today. I got a nice rest while you were looking into the issue | 19:38 |
TNX | Haproxy is certainly one of the software pieces you wouldn't expect to be the problem. But in long enough run even that will happen. | 19:40 |
jrosser | TNX: just FYI there is now a single variable you can set in OSA to swing all the git urls over to anywhere you like, including the GitHub mirrors | 19:40 |
TNX | jrosser: I'm probably getting too comfy on everything "just working" most of the time nowadays with Openstack ansible. I've been running it for quite a long time and I remember thing being quite a bit rougher at every upgrade. | 19:43 |
* fungi is happy to hear people say openstack upgrades aren't as painful as in the bad old days | 19:44 | |
tonyb | Thanks for getting to the bottom of the haproxy issue. I learnt a lot from reading the scrollback | 21:30 |
eandersson | Is it just me? I am still having some issues, e.g. zuul.opendev.org isn't loading for me :'( | 21:42 |
fungi | eandersson: i'll put money on it being the same haproxy problem... will check shortly but i agree it's not loading for me now (and as clarkb pointed out, it's the one other place we use haproxy) | 21:43 |
tonyb | I'm looking at it now | 21:44 |
tonyb | tonyb@zuul-lb01:~$ echo "show info" | sudo nc -U /var/haproxy/run/stats | grep -E '(Ulimit-n|Maxsock|Maxconn|Hard_maxconn|CurrConns|MaxconnReached)' | cat | 21:46 |
tonyb | Ulimit-n: 8037 | 21:46 |
tonyb | Maxsock: 8037 | 21:46 |
tonyb | Maxconn: 4000 | 21:46 |
tonyb | Hard_maxconn: 4000 | 21:46 |
tonyb | CurrConns: 4000 | 21:46 |
tonyb | MaxconnReached: 2042 | 21:46 |
tonyb | echo "show info" | sudo nc -U /var/haproxy/run/stats | 21:46 |
eandersson | <3 | 21:49 |
tonyb | I'm manually applied the same change as gitea | 21:50 |
fungi | tonyb: thanks! it's working for me now | 21:51 |
fungi | can you confirm that 903805 is going to apply to both gitea-lb and zuul-lb? pretty sure they use the same ansible role | 21:52 |
tonyb | Oh okay I can double check that | 21:52 |
fungi | tonyb: also, make sure to add zuul-lb01 to the emergency disable list on bridge or it will just get undone | 21:52 |
tonyb | I just added zuul-lb0 to emergency | 21:53 |
tonyb | I figured we'd need time to get a change landed to fix it | 21:53 |
tonyb | CurrConns on zuul-lb01 are staying low | 21:54 |
fungi | yeah, my expectation is that 903805 will fix both gitea and zuul lbs because they should use the same role | 21:56 |
tonyb | That makes sense, I'm just confirming that | 21:57 |
tonyb | Yup looks good to me. | 21:59 |
fungi | awesome. assuming everything's working as expected, let's worry about it next week (or this week on your end of the rock) | 22:00 |
tonyb | Cool Beans. | 22:01 |
fungi | my guess is that the zuul lb doesn't get nearly the volume of connections that the gitea lb does (probably multiple orders of magnitude fewer), so took longer to fill up | 22:05 |
tonyb | that sounds plausible | 22:06 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!