Saturday, 2023-12-16

Clark[m]	Deployment concluded successfully and a requirements change merged and replicated successfully	02:10
*** dhill is now known as Guest10533		03:02
qwebirc35203	Is there known problem or maintenance for opendev.org being down?	05:02
tonyb	qwebirc35203: can you be more specific. I'm not seeing an issue with review.opendev.org	05:08
qwebirc35203	Some subdomains seem to work, but base domain as in https://opendev.org/ is not opening for me (tested a few different providers and few website testing sites and all give timeouts)	05:09
tonyb	qwebirc35203: hmm okay. I see it. I'll look into it	05:11
qwebirc35203	tonyb: Thank you.	05:11
tonyb	#status notice Web, and possibly others, services on opendev.org appear to be down. Admins are investigating.	05:18
opendevstatus	tonyb: sending notice	05:18
-opendevstatus- NOTICE: Web, and possibly others, services on opendev.org appear to be down. Admins are investigating.		05:18
tonyb	Ooops that wasn't supposed to be notice :/	05:20
opendevstatus	tonyb: finished sending notice	05:21
tonyb	qwebirc35203: Try again now. It's working for me	05:46
tonyb	qwebirc35203: and stalled again	05:56
qwebirc35203	tonyb: Ok, I probably wasn't fast enough on testing, because I didn't get successful load.	05:57
qwebirc35203	Routing seems ok as traceroute completes to the ip that resolves from opendev.org	05:58
tonyb	Yup, and the systems behind the LoadBalancer seem fine, but not the LB itself.	05:59
tonyb	it was fine after a process restart	05:59
qwebirc35203	Someone session flooding the lb?	06:00
tonyb	qwebirc35203: Something like that. I'm struggling to verify that :/	06:10
qwebirc35203	It you are running haproxy, I'm assuming you've checked hatop?	06:12
qwebirc35203	*If you are...	06:12
tonyb	qwebirc35203: we don't seem to have that.	06:14
qwebirc35203	You might have stats web-endpoint, that's usually in different port so it should be accessible	06:16
qwebirc35203	Or you can get the raw stats with " echo "show stat" \| nc -U /var/run/haproxy.stat" (your stat socket location may differ)	06:19
tonyb	I can certainly do the last one. I'll fess up that I'm "new" here so I'm struggling to filter the normal from the "problem pointers"	06:20
qwebirc35203	echo "show info" \| nc -U /var/run/haproxy.stat	06:21
qwebirc35203	That should give you the generic stats	06:21
qwebirc35203	If CurrConns is hitting Hard_maxconn, then something is hogging the connections	06:22
tonyb	Okay. That is the current state.	06:23
tonyb	Maxconn: 4000	06:23
tonyb	Hard_maxconn: 4000	06:23
tonyb	CurrConns: 4000	06:23
qwebirc35203	Ok, then that's bit of an advanced topic to find out what is doing that. I would recommend forwarding the issue to someone who is used to solving that	06:24
tonyb	qwebirc35203: Thanks for your help.	06:25
qwebirc35203	Could be someone intentionally doing abuse, or just automations hammering the site after some downtime	06:25
qwebirc35203	Increasing logging or using tcpdump may be required to figure out what is happening and how to correct the situation. If it is single source, that can be firewalled or rate limited from haproxy	06:26
qwebirc35203	Correct solution heavily depends on what is actually happening	06:27
qwebirc35203	Increasing max connections shouldn't be a problem, but I wouldn't do that without consulting someone who has experience with that specific system	06:29
tonyb	Sure, but given how quickly the connections fiulled up last time I suspect any additional connections will also just fill up	06:31
qwebirc35203	Yes, that is likely. And more connections on load balancer may cause issues to the servers behind it at normal operations. That's why changing the number must be informed decision with knowledge of the system.	06:32
qwebirc35203	I would probably do tcpdump on port 443 with limited packet count to check if some ip has high dominance in the packets received. If you don't limit the packet count on remote system you might have trouble stopping the capture. But every admin has their preferred way, some would add verbosity to haproxy logging.	06:36
tonyb	I'm confident I can up the maxconns and the load balancer and beckend will handle it. Without knowing if this is innocent or not	06:37
qwebirc35203	Tcpdump has the advantage of having no effect on logging (if it processed or delivered forward in some way, changing verbosity may cause some trouble, same goes if disk space is at premium)	06:37
tonyb	qwebirc35203: Yeah I've looked at tcpdump and there isn't a visible hog.	06:37
qwebirc35203	You should be able to do x2 or x4 connections without issues at least on haproxy side. If you do that make sure Ulimit-n value in the output increases accordingly	06:40
qwebirc35203	Because haproxy has the front connection and back connection, you need to have double the sockets to maxconn	06:41
tonyb	Yup	06:42
qwebirc35203	Just came in for quick check from webirc, switching to better client at TNX.	06:57
tonyb	Kk	06:59
TNX	Goog luck with resolving the issue, I'll see how it went later on, I hit that site issue while finishing some update tasks for the night, for me it's late for the work shift and early for the day.	07:04
TNX	I got everything important done anyway, but thought give notice about the issue	07:05
tonyb	Thanks	07:05
tonyb	I appreciate it	07:05
tonyb	#status log The gitea load balancer on opendev.org is saturated and therefore new connections are timing out. Admins are investigating.	07:21
opendevstatus	tonyb: finished logging	07:21
tonyb	infra-root: gitea-lb02 is "flooded" CurrConns == Maxconn == 4000, I'm basically out of ideas.	08:41
tonyb	https://etherpad.opendev.org/p/gitea-debugging is a summary of the last couple of hours. I'll be back online after dinner etc	08:43
eandersson	I think something is broken again? e.g. this does not load for me at all https://opendev.org/openstack/requirements/raw/branch/master/upper-constraints.txt	12:15
eandersson	oh or maybe the issue is just ongoing	12:16
fungi	eandersson: yes, we suspect something has gone awry with internal networking in vexxhost's sjc1 region	13:11
fungi	though not for everything... i'm able to reach https://mirror.sjc1.vexxhost.opendev.org/ with no problem	13:13
fungi	https://gitea09.opendev.org:3000/ loads too	13:14
fungi	i can ssh to the opendev.org load balancer with no problem, seems like it's just haproxy itself that's getting overloaded	13:15
fungi	http://cacti.openstack.org/ graphs for gitea-lb02 are pretty striking. user cpu and load average both spike to fill the two available vcpus right at 04:00 utc	13:18
fungi	at the same time, network traffic drops to almost to almost imperceptible levels	13:19
fungi	/var/log/haproxy.log has been silent since 07:35 utc, nearly 6 hours ago	13:21
fungi	looks like haproxy was restarted at roughly 07:30, so maybe it was briefly logging traffic until it fell over again	13:24
fungi	doing some analysis of packet capture samples to see if there's any common sources at the moment	13:36
fungi	at the moment there's one host that seems to account for around 30% of all inbound packets for the ports haproxy is listening on	13:38
fungi	i've blocked the two highest volume sources in iptables and am restarting the haproxy container to see if it regains its senses	13:45
fungi	the site is loading for me at the moment, but i'm not sure that will last	13:46
fungi	source addresses have changed now, majority of high-volume senders are known parties (red hat, suse, microsoft...)	13:48
fungi	now the site's back to not responding for me	13:50
fungi	yeah, 13:49:08 was the last request in the haproxy log	13:51
Clark[m]	Haproxy should log to syslog iirc. You get better data about the requests on the gitea side since that shows you l7 URL paths and such. My first thought is this is the web crawler ddos we've seen before.	13:52
Clark[m]	Haproxy logs the conclusion of connections so if we get no logs whatever it is may simply be holding connections open	13:52
Clark[m]	Gitea will log the start of a request and the completion of a request	13:53
Clark[m]	I think we want to look for started but not concluded requests in gitea and work back from there	13:53
fungi	yeah, i do you know how it differentiates them in the log?	13:54
Clark[m]	It puts a string like started and finished in the log lines	13:54
Clark[m]	Request started /too/bar or something along those lines	13:55
fungi	thing is, /var/log/haproxy.log only has 13 entries between when i started it and when it stopped logging	13:56
Clark[m]	Ya haproxy only logs when connections end iirc	13:56
Clark[m]	On the lb side netstat/ss to list tcp conn state may be more helpful	13:57
Clark[m]	Basically something appears to be making 4k connections and never closing them if I interpret this correctly. But it is early for me so take that with a grain of salt	13:59
Clark[m]	Restarting helps things because you manually kill those connections and let other stuff reconnect until whatever is misbehaving comes back and grabs all the connections again	13:59
fungi	yep	14:00
Clark[m]	But also I think haproxy may log to syslog so double check there?	14:04
Clark[m]	Sorry I can't really look myself without waking the entire house on a Saturday morning	14:04
fungi	no worries, i've blocked the top 5 ip sources with open connections according to netstat and am restarting the container again	14:06
Clark[m]	And ya cross checking against gitea web logs may give more insight into what the clients are attempting to accomplish	14:08
fungi	if things stabilize, then i can slowly remove the firewall rejects one by one until the problem comes back too	14:10
fungi	now it's no longer loading for me again	14:13
Clark[m]	Whatever it is probably requires characterization beyond simply the worst IPs.	14:13
Clark[m]	Because there are more behind them	14:14
fungi	yeah, i added a couple more, the new top two with open sockets shortly after things hung, so it's possible one of them is causing it	14:14
fungi	they were several times higher than the runners-up	14:15
Clark[m]	And we're reasonably happy that this isn't a failure of the backends because those work?	14:15
fungi	i'll retest them all directly to be sure, i only spot-checked some	14:16
Clark[m]	Ya backend responds immediately for me	14:16
fungi	all 6 return the expected content for me on 3000/tcp	14:17
fungi	one more try. this time blocked a few /24 networks that had a large number of open connections spread across many addresses	14:23
Clark[m]	If that doesn't help I would check gitea logs as the next step and try to understand if clients are making it that far	14:23
fungi	yeah, that will be my next stop, but trying to correlate those is a lot more time consuming so wanted to rule out the easy things first	14:24
Clark[m]	Sometimes it's pretty apparent just because the patterns are "odd" but if not then ya	14:24
fungi	at the moment, all /24 networks with clients that have sockets open through the load balancer are under 25 each	14:26
fungi	cpu on the load balancer is still maxxed out though	14:27
fungi	~350 total sockets open for 80/tcp or 443/tcp	14:28
fungi	432 sockets open now and the service is back to not responding again	14:30
Clark[m]	There must be more SYN'ing then to hit 4k?	14:31
fungi	i forgot we have https://grafana.opendev.org/d/1f6dfd6769/opendev-load-balancer	14:32
fungi	maybe i'll find more clues there	14:32
fungi	definitely reports ~4k concurrent sessions	14:33
fungi	looks like tons of concurrent sessions to gitea12	14:36
Clark[m]	There is a clue in those graphs. Gitea12 has the most conns so it is hashing more things (or one thing) to there	14:36
Clark[m]	I would look at logs on gitea12	14:36
fungi	right, so classifying the traffic it's seeing will help narrow this down	14:36
Clark[m]	And work backward	14:36
Clark[m]	++	14:36
fungi	earlier in the day it was mostly hitting gitea11, looks like	14:37
Clark[m]	The average session time graph is interesting basically says the average session is 0 seconds on 12? Could this be an ol fashioned tcp ddos?	14:40
Clark[m]	Oh wait we probably need session to end to report that data	14:40
Clark[m]	Same issue with logging not happening until sessions close	14:40
fungi	apache access log doesn't seem to dereference the forwarded-for client, where were we recording that?	14:44
fungi	oh, i guess we only do the source port	14:45
Clark[m]	You have to rely on ports. Ya it's a bug in gitea	14:45
Clark[m]	They have the info but their logging mangles it so we can't get it last I tried	14:45
fungi	so figuring out the client address(es) is... hrm	14:46
Clark[m]	It should be possible but also annoying	14:47
Clark[m]	I think the haproxy command socket will give you mappings on the haproxy side since we don't have logs yet	14:48
Clark[m]	And then align the port numbers with the backend?	14:49
Clark[m]	But also the urls and user agents may be sufficient clues. For proceeding it it is the typical crawler dos we've seen	14:49
Clark[m]	Oh I think I remember the forwarded for issue. We go haproxy to Apache to gitea and for some reason gitea can parse out the top level forwarding info. The Apache logs may be easier to work with?	14:54
fungi	yeah, i was looking in the apache logs but they show the haproxy host's address as the client	14:55
fungi	i thought we had protocol-level proxy client signalling turned on (whatever that's called)	14:55
fungi	so that haproxy can communicate the original client address to apache	14:55
fungi	but if so, that's not being reflected in apache's access log	14:56
fungi	i'm fumbling around with the haproxy command socket right now, but no luck figuring out how to get it to dump a list of clients/sessions yet. the context help is quite terse, i'll probably need to crack open the manual and digest haproxy's terminology	14:57
fungi	aha, show sess	14:58
fungi	it has 2391 sessions matching be=balance_git_https srv=gitea12.opendev.org	14:59
fungi	2214 of those are from codesearch :/	15:01
Clark[m]	Maybe block code search and see if it helps?	15:02
Clark[m]	If code search is struggling we can always fix that later	15:03
corvus	fyi, hound has not recently restarted so seems unlikely to be a sudden behavior change there	15:03
Clark[m]	Hrm	15:03
fungi	agreed, that was the first think i checked	15:04
fungi	okay, restarted the haproxy container again, this time with codesearch blocked	15:04
corvus	hound started reporting errors on dec 14	15:05
fungi	it was connecting over ipv6, but i've blocked its ipv4 address too just to make sure it doesn't fall back on that	15:05
corvus	(hound had intermittent problems dec 14 and 15; looks much worse today)	15:06
corvus	that may suggest a longer time window for whatever issue	15:06
fungi	oddly the cpu usage and system load spiked up at precisely 04:00, which seems like suspicious timing	15:07
fungi	haproxy is still basically maxing out cpu utilization	15:08
fungi	but maybe it will settle after a bit	15:08
corvus	(apropos of nothing, gitea11's ipv4 has a ptr record of test.example2.com)	15:09
fungi	lovely	15:09
fungi	nb02 has a bunch of open sessions through haproxy at the moment	15:12
fungi	also something in osuosl, maybe a test node	15:12
fungi	site's still responding though	15:13
fungi	https://grafana.opendev.org/d/1f6dfd6769/opendev-load-balancer shows current sessions have exceeded 3k, climbing steadily but less rapidly than before	15:14
fungi	yeah, the site's unresponsive again	15:16
Clark[m]	Codesearch was probably failing due to whatever the real issue is and making it worse by quickly trying to fetch things	15:16
fungi	ipv6 address of nb02 currently has the largest number of active sessions through the lb	15:16
Clark[m]	Did the gitea or Apache logs show many commit requests by weird UAs?	15:16
fungi	not that i spotted, lots of odd agent strings (Blackbox Exporter, SemrushBot, Faraday, ...) but i'll see if i can classify based on that	15:18
fungi	largest count of agents on gitea12 since the last restart is python-requests/2.28.1	15:20
fungi	i'll check the other backends, but if it's a ddos it would in theory be hitting them all pretty equally	15:21
Clark[m]	The UAs that typically gives us trouble are for old cellphones and Chinese android browsers	15:21
fungi	gitea10 had 1599 connections for git/2.39.2	15:22
fungi	that stands out	15:22
fungi	others are mostly search bots (yandex, bing, openai)	15:23
fungi	yeah, after the last restart, grafana also says the bulk of current sessions were for gitea10	15:24
fungi	i wonder if that's where nb02 is getting sent	15:25
Clark[m]	The pattern we've seen cause issues is requesting many commit urls in short periods of time. But I suspect that isn't an issue here because I can directly get urls like that without trouble talking to a backend	15:25
fungi	src=2001:4800:7818:104:be76:4eff:fe02:5608:56772 fe=balance_git_https be=balance_git_https srv=gitea10.opendev.org	15:26
fungi	so nb02 (presumably git cache dib element for an image build) made ~1.6k git requests in the few minutes haproxy was working	15:27
fungi	should we block it and try again?	15:28
Clark[m]	I feel like that is a symptom not the cause. But maybe it is git level operations not web operations that are the problem	15:29
Clark[m]	Can you try cloning directly from a backend?	15:29
fungi	i agree it seems unlikely	15:29
eandersson	Would it be worth clearning the zuul queue? There are hundreds of job in the queue	15:30
fungi	cloning https://gitea10.opendev.org:3000/openstack/nova is working, slowly but not unusually so	15:31
corvus	eandersson: zuul is unlikely to be contributing to the problem	15:31
corvus	eandersson: (it does not communicate with gitea)	15:31
fungi	git clone is averaging around 1-1.5 MiB/s from gitea10 for me	15:32
Clark[m]	Really seems like the issue is in the lb somehow. Maybe we block everything then run local requests against it and see if we can reproduce hanging connections. That seems like a shot in the dark though	15:33
fungi	my nova clone from gitea10 just completed, so took about 4 minutes	15:33
fungi	similar test against gitea09 seems to be running about as fast	15:34
Clark[m]	The failed jobs in zuul thrashing the queues seem to be trying to fetch global requirements constraints from OpenDev. So that explains the fallout there but doesn't necessarily mean it is the problem	15:35
Clark[m]	But perhaps it snowballed and now we're behind due to that thundering herd	15:36
Clark[m]	Do periodic jobs start around 0400?	15:37
eandersson	Yea that is what I was thinking. Seems like a safe option to at least restart zuul?	15:37
TNX	That "python-requests/2.28.1" sounds lot like Openstack ansible jobs, I first noted the issue running an upgrade	15:38
fungi	periodic triggers at 02:00 utc, periodic-weekly doesn't trigger until tomorrow	15:38
Clark[m]	Restarting zuul doesn't dump the jobs but we could try that using the API maybe	15:38
fungi	TNX: yeah, in the past we've also seen thundering herds from openstack-ansible users who try to upgrade a large network and don't maintain a central cache, so wind up flooding our git server farm from multiple servers all trying to clone the same things simultaneously	15:39
fungi	but recent openstack-ansible versions are supposed to set a custom user agent string now in order to make that more obvious	15:40
corvus	time wget -O - https://gitea10.opendev.org:3081/openstack/requirements/raw/branch/master/upper-constraints.txt	15:41
corvus	real0m0.035s	15:41
fungi	yeah, we should be able to satisfy thousands of those sorts of requests	15:41
corvus	it doesn't look like fetching a constraints file is very taxing	15:42
fungi	they don't (or at least shouldn't) stay open for any time at all	15:42
fungi	so anyway, we don't have any periodic zuul pipelines that trigger at 04:00 on any day of the week, much less today (except our hourly deploy pipeline which fires every hour of the day of course)	15:43
fungi	and judging from grafana it doesn't seem likely to be something that started in the periodic pipeline at 02:00 and gradually built up, the sessions chart shows it going from a steady ~30 sessions just before 04:00 to 4k within a couple of minutes	15:45
fungi	i suppose we could try bisecting the entire internet. wouldn't be all that many steps	15:47
corvus	haproxy is supposed to log every request on completion, right? and so far, it's logged a handful from the last restart	15:48
fungi	yes, around 20	15:48
eandersson	https://github.com/openstack/project-config/blob/master/zuul.d/pipelines.yaml#L212 Isn't this starting at 2AM everyday?	15:48
corvus	which means that as far as its concerned, 39xx sessions haven't completed?	15:48
fungi	eandersson: correct, and the problem started at exactly 04:00, two hours later	15:49
TNX	Have you checked what that high cpu usage consists of? User, system, interrupts? Might give a pointer if there is something wrong with the system and not with the amount of connections per se.	15:49
fungi	TNX: user	15:49
fungi	it's entirely user, and attributed to the haproxy process	15:50
corvus	so what's it take for a session to complete? in general... i'm seeing FINs on the small amount of traffic that goes between lb02 and gitea10, so it's not like those are being completely dropped	15:50
corvus	according to strace, the haproxy process looks to be busy-waiting around 2 different epoll objects	15:51
corvus	lb02 has 302 connections to gitea12 in CLOSE_WAIT; gitea12 has 0 connections to lb02	15:53
fungi	not seeing any packet loss between them at least	15:54
Clark[m]	Did haproxy update maybe and change it's TLS connection settings by default?	15:55
Clark[m]	Or I supposed that could be packet loss of some sort	15:56
Clark[m]	(hence fungi's comment)	15:56
corvus	yeah, i got a pcap from both ends and am looking at a single session; i'm not seeing any red flags. i seet a fin,ack from each side followed by an ack from each side and that's the end	15:57
fungi	oh, good point i looked at dpkg.log but haproxy is run from a container image	15:57
fungi	haproxy latest b7b500699a22 22 hours ago 121MB	15:57
corvus	that seems suspicious	15:58
fungi	we don't seem to have old images preserved on the server either	15:58
corvus	there is an image from 8 hours ago	15:59
corvus	we could try rolling forward	15:59
corvus	https://www.mail-archive.com/haproxy@formilux.org/msg44428.html	16:00
corvus	at a guess: maybe our current image is 2.9.0 and the newer one is 2.9.1?	16:01
Clark[m]	That seems like a good guess. I suspect the image can tell us somehow. If that doesn't work we can probably rollback too. Haproxy isn't very stateful across hard restarts	16:02
fungi	haproxy_1 \| 2023-12-16T15:04:08.174923451Z [NOTICE] (1) : haproxy version is 2.9.1-f72603c	16:02
fungi	from docker-compose logs	16:02
corvus	oh :(	16:03
corvus	then we may have to go back to < 2.9.0	16:03
corvus	maybe 2.8.5?	16:04
fungi	sounds worth a try	16:04
corvus	it's an lts stable release :)	16:04
fungi	also the "22 hours ago" is probably when the image was built, not when it was published nor when we downloaded and restarted onto it	16:04
Clark[m]	I think 2.8.5 is fine as long as we down and up and don't do a graceful restart attempt	16:05
corvus	yep	16:05
Clark[m]	Our configs haven't been using fancy new etuff	16:05
corvus	we could also use the tag "lts"	16:05
Clark[m]	++	16:05
corvus	fungi: you want to manually make that change? want me to?	16:05
fungi	i can in a moment, trying to see if the 04:00 deploy was when we pulled the new haproxy image	16:06
corvus	++ i will make breakfast	16:06
fungi	bingo	16:09
fungi	Pulling haproxy ... status: downloaded newer image fo...	16:09
fungi	docker-compose up -d ran at 03:59:27.857109	16:09
fungi	working on the change now	16:09
Clark[m]	I'm still not reaching th service. Did you down the container first?	16:11
fungi	no, that was from the ansible log on bridge	16:11
fungi	i'm working on downgrading and will then push a similar change for review	16:12
Clark[m]	Oh I see	16:12
fungi	i was just confirming that we upgraded to haproxy 2.9.1 at 04:00 on the nose, pretty much	16:12
Clark[m]	++	16:13
fungi	haproxy_1 \| [NOTICE] (1) : haproxy version is 2.8.5-aaba8d0	16:15
fungi	i've added the lb to the emergency disable list too so we won't roll that back accidentally while reviewing	16:16
corvus	i see connections growing and shrinking	16:16
corvus	none staying in close_wait	16:17
Clark[m]	Yay. I guess we check the haproxy issue tracker to see if this is known. But that can probably wait until Monday	16:18
Clark[m]	https://github.com/haproxy/haproxy/issues/2387 maybe	16:19
opendevreview	Jeremy Stanley proposed opendev/system-config master: Downgrade haproxy image from latest to lts https://review.opendev.org/c/opendev/system-config/+/903805	16:20
fungi	i'm going to work on unwinding all the iptables drop rules now	16:21
fungi	that's done	16:24
fungi	https://grafana.opendev.org/d/1f6dfd6769/opendev-load-balancer is still looking healthy	16:24
Clark[m]	Haproxy has been so solid I didn't even consider it was a regression there	16:26
fungi	status notice Service for Git repository hosting on https://opendev.org/ has been restored by rolling back an haproxy upgrade; Zuul jobs which failed with connection timeouts occurring between 04:00 and 16:15 UTC today can be safely rechecked now	16:26
fungi	not sending yet until we're sure, but does that work?	16:26
Clark[m]	Yes lgtm	16:27
fungi	this is definitely the longest it's been in working order without session count climbing since the incident began, so looks like we're in the clear	16:28
fungi	still solid. load average has settled around 0.1	16:33
fungi	haproxy is using about 7% cpu now	16:33
fungi	infra-root: any objection to me sending the status notice in a few minutes?	16:34
fungi	still looking happy 30 minutes later, so sending it now	16:44
fungi	#status notice Service for Git repository hosting on https://opendev.org/ has been restored by rolling back an haproxy upgrade; Zuul jobs which failed with connection timeouts occurring between 04:00 and 16:15 UTC today can be safely rechecked now	16:45
opendevstatus	fungi: sending notice	16:45
-opendevstatus- NOTICE: Service for Git repository hosting on https://opendev.org/ has been restored by rolling back an haproxy upgrade; Zuul jobs which failed with connection timeouts occurring between 04:00 and 16:15 UTC today can be safely rechecked now		16:45
opendevstatus	fungi: finished sending notice	16:47
fungi	okay, switching to the stuff i was supposed to get done this morning, but i'll keep an eye on the graphs for a while just in case	16:51
clarkb	fungi: corvus: I'm going to +2 but not approve the haproxy change. The reason is I remembered that zuul has uses haproxy and not sure if we need manual itnervention there as well	18:36
jrosser	TNX: did you ask the openstack-ansible team for help with your upgrade?	19:07
jrosser	many requests to opendev.org would happen many releases ago if the upgrade guide was not followed	19:08
jrosser	but there is now a specific circuit-breaker in the code to prevent that and halt the deployment	19:09
TNX	jrosser: I'm fine, I have latest version running now, my repo just failed one request against https://opendev.org/ for upper constraint while the issue was ongoing.	19:32
fungi	also note that you can switch your base url to pull from the mirror at github.com/openstack for those repos if there's a problem (though we try to be fairly responsive to outages)	19:37
fungi	this one was unusual because who expects a regression like that in a minor release bump for haproxy?	19:38
fungi	i guess we do, now anyway	19:38
TNX	Thanks for the tip, everything worked out fine in the end, I did previous upgrade last night end second today. I got a nice rest while you were looking into the issue	19:38
TNX	Haproxy is certainly one of the software pieces you wouldn't expect to be the problem. But in long enough run even that will happen.	19:40
jrosser	TNX: just FYI there is now a single variable you can set in OSA to swing all the git urls over to anywhere you like, including the GitHub mirrors	19:40
TNX	jrosser: I'm probably getting too comfy on everything "just working" most of the time nowadays with Openstack ansible. I've been running it for quite a long time and I remember thing being quite a bit rougher at every upgrade.	19:43
* fungi is happy to hear people say openstack upgrades aren't as painful as in the bad old days		19:44
tonyb	Thanks for getting to the bottom of the haproxy issue. I learnt a lot from reading the scrollback	21:30
eandersson	Is it just me? I am still having some issues, e.g. zuul.opendev.org isn't loading for me :'(	21:42
fungi	eandersson: i'll put money on it being the same haproxy problem... will check shortly but i agree it's not loading for me now (and as clarkb pointed out, it's the one other place we use haproxy)	21:43
tonyb	I'm looking at it now	21:44
tonyb	tonyb@zuul-lb01:~$ echo "show info" \| sudo nc -U /var/haproxy/run/stats \| grep -E '(Ulimit-n\|Maxsock\|Maxconn\|Hard_maxconn\|CurrConns\|MaxconnReached)' \| cat	21:46
tonyb	Ulimit-n: 8037	21:46
tonyb	Maxsock: 8037	21:46
tonyb	Maxconn: 4000	21:46
tonyb	Hard_maxconn: 4000	21:46
tonyb	CurrConns: 4000	21:46
tonyb	MaxconnReached: 2042	21:46
tonyb	echo "show info" \| sudo nc -U /var/haproxy/run/stats	21:46
eandersson	<3	21:49
tonyb	I'm manually applied the same change as gitea	21:50
fungi	tonyb: thanks! it's working for me now	21:51
fungi	can you confirm that 903805 is going to apply to both gitea-lb and zuul-lb? pretty sure they use the same ansible role	21:52
tonyb	Oh okay I can double check that	21:52
fungi	tonyb: also, make sure to add zuul-lb01 to the emergency disable list on bridge or it will just get undone	21:52
tonyb	I just added zuul-lb0 to emergency	21:53
tonyb	I figured we'd need time to get a change landed to fix it	21:53
tonyb	CurrConns on zuul-lb01 are staying low	21:54
fungi	yeah, my expectation is that 903805 will fix both gitea and zuul lbs because they should use the same role	21:56
tonyb	That makes sense, I'm just confirming that	21:57
tonyb	Yup looks good to me.	21:59
fungi	awesome. assuming everything's working as expected, let's worry about it next week (or this week on your end of the rock)	22:00
tonyb	Cool Beans.	22:01
fungi	my guess is that the zuul lb doesn't get nearly the volume of connections that the gitea lb does (probably multiple orders of magnitude fewer), so took longer to fill up	22:05
tonyb	that sounds plausible	22:06

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!