Tuesday, 2024-11-05

clarkb	and sent	00:06
corvus	someone should be able to test that with personal (or non openstackci) creds too	00:19
mnaser	i think opendev.org might be having some sort of dns issues or something	01:53
mnaser	pages take a long time to load and they then load really fast	01:53
mnaser	Powered by Gitea Version: v1.22.3 Page: 32819ms Template: 29ms	01:53
mnaser	almost like a perfect 30s+2.8s to load page	01:53
mnaser	refreshing same page after => Powered by Gitea Version: v1.22.3 Page: 33ms Template: 5ms	01:54
fungi	mnaser: could it be your browser trying ipv6, timing out and then trying ipv4?	01:58
mnaser	fungi: i dont think so since the actual page load in the footer of gitea shows 32s	01:59
fungi	ah, yeah that seems like something slow behind the gitea service itself (database or filesystem maybe)	02:01
fungi	i did just find a page that took 17943ms	02:01
fungi	system load seems low on all the gitea backends and the haproxy load balancer	02:03
slaweq	hi, I just tested this rtd issue with playbook from frickler on the ubuntu noble container and it failed for me with the "Status code was 403 and not [200]: HTTP Error 403: Forbidden" message too	09:37
slaweq	same playbook run on F41 works fine and triggers job on rtd.org	09:37
frickler	slaweq: yes, that's what ianw reported yesterday, too. debian trixie is also working fine, bookworm is broken. the big question is still: why?!?	09:38
frickler	I've been thinking to set up a webserver locally to see whether it can be reproduced there, but I'll need some spare time to do that, wouldn't mind if someone else did	09:40
slaweq	I also don't have much time for this TBH	10:02
frickler	ok, we have it on the agenda for the meeting tonight, maybe someone else wants to pick it up	10:03
slaweq	ok, thx	10:05
mnaser	I'm still getting 15-20s render times on pages at Opendev	14:52
fungi	mnaser: just for pages with git data, or also for the main page?	14:52
fungi	for https://opendev.org/ itself i get "Page: 1ms Template: 1ms"	14:53
fungi	consistently	14:54
fungi	for https://opendev.org/explore/repos i mostly get between 5-25ms for the page and 3-9ms for the template	14:55
fungi	for the top-level urls of random git repos page is around 100ms and template around 15ms, though sometimes i see page times around 1000ms	14:57
fungi	probably has to do with whether the backend i'm hitting has the content cached already or has to regenerate it from database/git inspection	14:58
fungi	odds are we're getting crawled by new ai training bot armies whose user agents we haven't identified and blocked yet	15:00
fungi	so probably both putting i/o strain on the backends and blowing out their caches to be mostly useless	15:01
fungi	yeah, if i repeatedly reload a page i get low render times which is probably coming from the cache, but if i wait a minute and refresh the same page i tend to get an order of magnitude slower behaviors	15:02
cardoe	Any chance the opendevreview bot has Slack integration as well?	15:03
cardoe	Or maybe I can add it?	15:03
fungi	cardoe: we have a matrix version, but no we only support protocols for open source platforms, not proprietary services	15:03
fungi	we've collectively preferred to invest our time in supporting open source software	15:04
cardoe	I only ask because the OpenStack Helm folks are on Slack. If I don't drop a link to my patch it takes longer to get noticed.	15:04
fungi	yes, i think that was brought up with the tc as a reason to probably drop openstack-helm from openstack officially	15:05
mnaser	yikes	15:06
fungi	they could set up an irc-to-slack bridge if they were interested at all in collaborating as a part of openstack, but that doesn't seem to be a priority for their project participants	15:06
kevko	Hi, I'm really desperate at this point , I'm using Neutron with OVN, and OVS starts spinning at 100% CPU, with the log showing entries https://paste.openstack.org/show/bnleYUdDaH5W7Ohk1LiS/ and the time just keeps increasing. The ovs-appctl command hangs, so I can’t use coverage. While ovs-vsctl show works, any attempt to change anything	15:06
kevko	causes it to freeze. If I delete OVS (including the volume) and restart the server, when I reconfigure OVS (re-create it), everything is initially fine because there are no rules in OVS. At this stage, if I remove the interface from br-ex and call reconfigure on OVN (which sets the metadata), everything is still OK. However, as soon as I put the	15:06
kevko	interface back into br-ex, it crashes again	15:06
kevko	Kolla-ansible, kolla, images ubuntu 22.04, ovs 3.3.0 , 24.03.2	15:06
cardoe	Well their community is on Slack. They've got over 1100 people in their Slack channel. Which is hosted on the official Kubernetes Slack instance.	15:06
mnaser	kevko: wrong place, #openstack-kolla is where you need to go	15:06
kevko	well, this is not about kolla of course ...	15:07
mnaser	ok cool, then #openstack-neutron	15:07
mnaser	whatever it is, it's probably not here :)	15:07
kevko	thanks	15:07
mnaser	i didnt know the tc is going to drop openstack-helm for this, that seems like a massive knee jerk	15:07
cardoe	I'm happy to help do what's needed to get the irc-to-slack bridge going, fungi	15:07
fungi	kevko: #opendev is the people who manage the git servers, etherpad, mailing list platform, etc. we don't develop or directly support openstack software itself (the opendev collaboratory is not the openstack project)	15:08
fungi	cardoe: well, if you did, then the slack channels you've bridged would get messages from the irc channels to which they're bridged, including irc messages from gerritbot. i don't really know anything about running or maintaining slack bridging solutions though, that would be up to slack users to figure out and take care of	15:09
fungi	on the gitea performance front, it looks like the load averages on gitea09 and gitea13 are easily 2-4x that of the other backends, so it's possible our ip-based persistence in haproxy is feeding some more abusive addresses to those backends. i see to be getting sent to gitea12 though and it's comparatively unloaded	15:16
fungi	cardoe: a slack-to-irc bridge would also address the tc's main concern for openstack-helm, i think, which is that the founding principles of the project necessitate design discussions be open to everyone and have a durable, accessible record for purposes of transparency. if the slack channels where their design discussions take place were also bridged to irc channels, we could easily record	15:21
fungi	those irc channels and publish the logs like we do for other openstack teams. also the project documentation which says they use the #openstack-helm irc channel wouldn't be entirely wrong any more	15:21
clarkb	fungi: mnaser: yes gitea caches content but it has to render it at least once to cache it. So slow then fast is expected but not that slow	16:02
clarkb	the assumption that we're getting crawled is probably a good one has anyone tailed the apache or gitea webserver logs to see? in particular look for requests to specific sha based urls from odd or consistent user agents in short periods of time	16:03
clarkb	based on what fungi says above I would start by looking at 09 and 13	16:05
clarkb	survey says amazon and meta need to take a hike	16:10
clarkb	"openstack trained amazonbot does that make aws openstack?"	16:11
clarkb	and bytedance	16:12
clarkb	its amazing to me that everyone feels the need to crawl the same stuff all at once. Almost need a data broker	16:12
clarkb	there is also something that appears to be grabbing all the puppet repos over and over again. maybe an inefficient deployment (except puppet uses a central control node so shouldn't need all the code on all the deployment nodes	16:13
clarkb	that said I'm seeing reasonable performance from gitea09 right now so maybe these bots aren't the problem and whatever was the problem is not longer hitting us?	16:14
clarkb	I think I'm up to like 6 different AI crawler bots just naively looking at portions of the logs. No wonder cloudflare is all we'll charge people for this	16:17
clarkb	aha here we go. Yes it appears finding the right time window is important	16:18
clarkb	looking around when mnaser originally reported this there are more requests from the "odd" UAs	16:19
clarkb	does anyone else want to compile a list and update our ruleset?	16:19
fungi	i can take a look through logs in a bit	16:24
clarkb	I did some quick grep/sort/uniq and the output is in gitea09:/root/ua_counts.txt	16:27
fungi	i'm starting by looking at the haproxy logs, to see if backends are failing to respond to probes and being taken out of the pools	16:27
clarkb	I filtered by things making requests to commit specific urls first then grabbed the user agents from those strings, sorted and uniq -c'd them	16:27
fungi	(spoiler, they are)	16:27
clarkb	and there are some clear trends that match historical issues we'ev had also facebook/meta	16:27
fungi	looks like gitea10 has been bouncing in and out a bunch. 13 some but far less often	16:28
clarkb	5 of the top 6 UAs are AI bots the other is one that fits our previous issues	16:29
fungi	gitea09 was earlier (a few hours ago)	16:29
clarkb	and I'm not sure they are respecting our crawl delay	16:29
clarkb	oh wait not the 6th is also related to google I think	16:30
clarkb	https://developers.facebook.com/docs/sharing/webmasters/web-crawlers says rollout is expected to complete october 31 and they are the worst offender	16:32
clarkb	I'	16:32
clarkb	er	16:32
fungi	yeah, gitea09 looks like it was at its worst between 11:52:37 and 12:00:28 utc based on haproxy logging constant failures to respond	16:32
clarkb	I'm beginning to think thatwe should maybe block facebook and they can petition for things to be unblocked if they make their bot less problematic	16:32
clarkb	or maybe the problem is that is a brand new bot so it is scraping absolutely everything from scratch whereas everyone else is just doing a slow trickle since they've been around a while	16:33
fungi	"slow trickle" is not how i'd personally describe any of the ai training crawlers i've seen in our logs this year	16:34
clarkb	ya I guess some of these may be normal search indexing. In fact I think you can see that difference/trend in my quick data compilation	16:34
clarkb	what I think are search index bots are an order of magnitude fewer requests tahn the ai bots	16:34
clarkb	annoyingly they share the same parent company and could be making a single request then processing the data on the backend multiple times	16:35
clarkb	but no	16:35
clarkb	anyway facebook/meta is by far the wrost offender. I'm comfortable with blocking them but also happy for others to dig into the data and reach the same conclusions before we ban anything	16:36
fungi	you'd think, being google, they'd already have most of this data and could just train their llm directly on their existing dataset	16:36
fungi	you mentioned meta not respecting the crawl-delay in our robots.txt... did you happen to see them also fetching things in disallowed paths by any chance?	16:37
clarkb	fungi: to be clar I wasn't sure if they were respecting it just based on how many there were in short sections of logs. However, they may be doing so and I need to look at timestamps more closely	16:38
clarkb	let me look at a section of requests just from them	16:38
clarkb	they made 64 requests in the 16:38:00-16:38:59.999999.... minute	16:40
clarkb	crawl delay is set to 2 seconds	16:40
fungi	looking at requests to gitea09 between 11:50 and 12:00 utc when it was getting kicked out of the load balancer a bunch, i see the vast majority had user agent "git/2.43.5"	16:40
clarkb	so I think that confirms they are not respecting the crawl delay	16:40
clarkb	the requests are to ///src/commit/** which we do allow	16:41
clarkb	fungi: those are likely to be git upload pack requests which is normal fetch/clone operations	16:42
clarkb	if we were simply overwhelmed with requests then that would also cause these problems	16:42
fungi	"normal" but possibly at abusive levels	16:42
clarkb	maybe look at the request pattern for that	16:42
clarkb	ya	16:42
clarkb	I mean compared to the "I'm an AI bot requests commit by commit as I crawl you" being abnormal	16:42
clarkb	it would honestly be better for them to git clone and then index the git tree locally...	16:43
fungi	i have a feeling the really long response delays observed may be cases where the backend is entirely nonresponsive and haproxy pulls it out of the pool and moves the request to a different backend	16:43
clarkb	fungi: I don't think so because gitea generates the timestamps on the backend	16:43
clarkb	and you've got really long timestamps on the pages themselves	16:44
fungi	oh, right, though if you've got a long enough delay in returning content, haproxy may use that as a reason to take it out of the pool, i guess it's just a higher threshold	16:44
clarkb	haproxy should only use errors or disconnects to do that iirc	16:45
clarkb	I don't think we have any response time metric balancing going on	16:45
fungi	layer6 timeout is what it's mainly complaining about	16:45
clarkb	I think those are timeouts making new connections and setting up ssl?	16:46
clarkb	or maybe just even doing the heartbeat check over ssl	16:46
fungi	looks like it expects a protocol-level response within 2 seconds, not necessarily content, so yeah different (but potentially related) behavior	16:46
clarkb	so backend is getting overwhelmed, haproxy detects this as you cannot do tls and then stops sending new requests to that bakcend	16:46
clarkb	but I don't think it will move existing requests they will either complete or fail on the backend they were originally assigned	16:47
clarkb	then as pressure eases on the backend its heartbeat will resume normal operations and we'll put load there again	16:47
fungi	right, i meant re-persist the client address hashes based on the pool change, so when the browser retries the request it ends up at another gitea server	16:47
clarkb	ya. So failure mode seems to be request demand causes backend to stop negotiating tls which causes haproxy to remove that backend from the pool. All of that demand gets redirected to another server rinse and repeat	16:49
clarkb	pretty similar to what we've seen in the past when clients ddos us	16:49
clarkb	now I guess the question is whether or not the ai bots are the source of the load creating the problems or that is just background noise to a more acute problem	16:49
clarkb	I guess the evidence you've dug up for when haproxy removes backends does point to somethign more acute since gitea09 is happy at the moment for example despite having the continuous stream of ai bot rquests	16:50
fungi	looks like tons of requests for puppet repos in that timeframe	16:52
clarkb	ya about 30%	16:53
clarkb	(thats napkin math so I may be off)	16:53
clarkb	but the same git UA does do upload packs for an umber of other repos. However, it also repeats the same repos over and over	16:54
clarkb	so maybe puppet is a coincidence and its just some git far cloning things very inefficiently?	16:54
clarkb	fungi: on the haproxy side maybe we can determine if all those requests come from similar IPs (we have to correlate via port numbers and timestamps iirc)	16:54
fungi	yeah	16:55
fungi	working on it	16:55
fungi	i expect the request count would have been much higher, but haproxy kicked the backend out for a good chunk of that window	16:56
clarkb	50% of the requests to gitea09 during the 11:50 10 minute block are from a single IP	16:58
clarkb	whois says it is a cogent IP and traceroute has it landing in montreal. Is it possibly a vexxhost IP?	17:00
fungi	whois for me says it's in a vexxhost allocation	17:01
fungi	https://trunk-builder-centos9.rdoproject.org/	17:01
clarkb	looking at the log that IP shows up and just makes a ton of requests all at once then the server goes down and flaps back and forth a bit but that IP largely goes away	17:01
fungi	maybe spotz can help, i can also ask around in #rdo	17:02
clarkb	huh how'd you find that dig -x didn't work	17:02
clarkb	that IP definitely does seem to correlate to the falling over	17:02
clarkb	also if I'm reading the logs properly it makes the same requests to the same repos with the same response bytes over and over in short periods of time	17:03
clarkb	so maybe there is some easy "just remember the first request you made" optimization that can be done	17:03
clarkb	oh I see vexxhost at the end of the whois data now. Not sure how you figured out the host maybe you hit it via https?	17:04
fungi	clarkb: old school, on a whim i plugged the ip address into my web browser to see if there was a webserver running on it	17:04
clarkb	got it thanks	17:04
fungi	anyway, i asked about it in #rdo and pinged spotz in there for good measure	17:06
clarkb	thanks!	17:06
clarkb	then I guess we have to decide if we awnt to do anything about the bots but if they aren't the acute problem then its probably best to let them run for now	17:06
fungi	yeah, we could also block rdo's builder if worse comes to worse, but i'd rather give them some time to investigate if it's not causing massive problems	17:07
clarkb	catching up on email and it looks like rax notified us of nodepool test nodes having problems. I don't think there is anything for us to do baout that as nodepool should automatically clean them up after the jobs complete (and presumably fail?)	17:09
fungi	i usually just ignore those tickets if the instance name is np*, or at most close them if they sit around open for too long	17:10
clarkb	team meeting is less than 2 hours away now. I'll need to get the agenda sorted out locally before then (mostly a reminder to myself since the chagne to standard time has changed itrelative to my local clock)	17:10
clarkb	fungi: ++	17:10
fungi	and yes, my expectation is that zuul transparently retries the impacted build so users don't notice anyway	17:10
gouthamr	hello o/ how do i get opendevreview, opendevstatus and opendevmeet to join the #openstack-eventlet-removal channel? i invited these as O/P.. but i feel there's something missing?	17:41
clarkb	gouthamr: in the openstack/project-config repo you need to configure each of the bots to join the channel.	17:42
fungi	gouthamr: see the sections in https://docs.opendev.org/opendev/system-config/latest/irc.html#statusbot for each bot	17:42
clarkb	oh maybe only gerritbot and accessbot are in there?	17:42
gouthamr	ah! thank you, reading	17:42
fungi	er, i guess start with the https://docs.opendev.org/opendev/system-config/latest/irc.html#logging section	17:42
fungi	system-config: inventory/service/group_vars/eavesdrop.yaml	17:43
clarkb	ya gerritbot and accessbot are in project-config meetbot and statusbot are in system-config	17:43
fungi	project-config: gerritbot/channels.yaml	17:43
fungi	project-config: accessbot/channels.yaml	17:43
fungi	oh, and system-config: hiera/common.yaml for statusbot	17:44
fungi	anyway, each location is linked from the respective section in that document	17:44
mnasiadka	I'm observing some weird slowness even with the main Gerrit webpage (https://review.opendev.org) - it mainly works under half a second, but every 4-5 requests - it responds in somewhere around 5 to 15 seconds - is it just my location or something?	17:49
clarkb	gerrit like gitea does rely on caches. I think there is a 5 second timeout for diff generation for example	17:53
clarkb	system load is "low" so I don't suspect general load problems there	17:54
clarkb	I've clicked around in nova, neutron and kolla change lists opening random chagnes and fiel diffs and haven't been able to reproduce that behavior	17:55
clarkb	everythign seems to be pretty consistent. Larger changes take a little longer (but like still around a second?)	17:56
clarkb	and that is expected	17:56
opendevreview	Goutham Pacha Ravi proposed opendev/system-config master: Add meetbot to openstack-eventlet-removal https://review.opendev.org/c/opendev/system-config/+/934164	17:57
opendevreview	Goutham Pacha Ravi proposed opendev/system-config master: Add meetbot and statusbot to openstack-eventlet-removal https://review.opendev.org/c/opendev/system-config/+/934164	17:58
fungi	mnasiadka: gerrit's webui is a javascript monolith that makes lots of rest api calls to the server, so if those rest api responses go missing or are delayed, that will result in slow rendering. also because it chains up api calls that depend on one another, round-trip latency can play a pretty big role in the responsiveness of pages displaying	18:02
clarkb	ya may be worth checking for packet loss between you and the server? mtr is a quick way to get any idea if that is part of the issue	18:02
mnasiadka	Yeah, seems like i’m sometimes going some weird route which ends up in packet loss :(	18:08
opendevreview	Amy Marrich proposed opendev/irc-meetings master: Adjust meeting time to avoid confict https://review.opendev.org/c/opendev/irc-meetings/+/934167	18:10
opendevreview	Merged opendev/irc-meetings master: Adjust meeting time to avoid confict https://review.opendev.org/c/opendev/irc-meetings/+/934167	18:36
fungi	didn't come up in the meeting, but the discussion from #rdo on git impact from their delorean builders can be found here: https://meetings.opendev.org/irclogs/%23rdo/%23rdo.2024-11-05.log.html#t2024-11-05T17:03:49	20:00
clarkb	fungi: two comments reading that 1) there is no ssh option 2) they have multiple logical builders all hitting at once they should be doing something more zuul like which is to have more centralized caches for N builders and push from those caches as necessary to avoid excessive updates	20:03
clarkb	part of the problem with that setup is the use of a single IP	20:04
clarkb	having a ton of logical builders behind one IP means we don't get load balancing goodness and we can see the round robin of dos failures instead	20:04
clarkb	so moving the load balancing onto the other side of the IP would probably help	20:04
clarkb	I have enabled bounce processing on service-discuss	20:50
clarkb	I kept all of the defaults	20:50
clarkb	ianw: thank you for putting 933700 together. I rechecked it because the -1 appers to be related to ssh errors not directly related to the change. Also left a few comments/nits/questsion sif you have time to look	21:12
clarkb	now I'm wondering if I have an excuse to email service-discuss to see if any bounce scores change	21:16
clarkb	corvus: for intermediate registry storage backend replacement don't gate images publish to the upstream registry then promote just updates the tags?	21:23
clarkb	so in theory wedo just need a recheck to rebuild and put things back into the registry?	21:23
clarkb	corvus: https://etherpad.opendev.org/p/UfU_JiUB7UgpAMOimBbI this is what I've drafted so far	21:24
fungi	lgtm	21:26
ianw	clarkb: will do	21:28
corvus	clarkb: oh yeah that is how we have it set up in opendev. so it should just affect current queue item dependencies.	21:37
corvus	clarkb: msg lgtm	21:38
clarkb	corvus: cool I'll send that out in a bit	21:39
ianw	was thinking about the rtd thing ... that we put mitm in the path and it works. mitm sees the auth header; it seems impossible it's not there -- you can instrument where ansible adds it although actually getting in on the wire is deeper in httplib	21:58
ianw	it leaves me to think that cloudflare is doing some sort of fingerprinting and that it's collateral damage	21:58
clarkb	and its subtle enough thta fedora 41 and debian trixie python http stuff don't trip it?	21:59
clarkb	that does seem possible but also very difficult to debug	21:59
ianw	yes, i mean basically impossible from our side :)	21:59
ianw	i mean it could be something like "this fingerprint + POST + empty body + ???" that it just falls into	22:00
ianw	i'd agree just making this a shell: curl call probably avoids it. we can report something to RTD, it's pretty easy to replicate just via a container ... if they want to ping CF about it they can	22:01
ianw	i just quickly checked with ansible in noble + f41 -- same venv, exact same versions of everything including cryptography	22:02
ianw	that would essentially mean same openssl library right, if it's using the wheel?	22:04
clarkb	I'm not sure if it bundles all of openssl	22:04
clarkb	or if they manage to link against whatever is on your system. I think the old cffi setup did build against your local openssl on demand but then they moved some stuff to rust?	22:05
clarkb	https://lists.opendev.org/archives/list/service-announce@lists.opendev.org/thread/7IK7KPGGDTHQ4VYA3GNSZ2QGGGFP66MC/ announcement sent for intermediate registry pruning	22:16
opendevreview	Goutham Pacha Ravi proposed opendev/system-config master: Fix doc for including statusbot in channels https://review.opendev.org/c/opendev/system-config/+/934189	22:25
clarkb	ianw: a recheck of your change got it to +1	22:57

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!