Tuesday, 2024-11-05

clarkband sent00:06
corvussomeone should be able to test that with personal (or non openstackci) creds too00:19
mnaseri think opendev.org might be having some sort of dns issues or something01:53
mnaserpages take a long time to load and they then load really fast01:53
mnaserPowered by Gitea Version: v1.22.3 Page: 32819ms Template: 29ms01:53
mnaseralmost like a perfect 30s+2.8s to load page01:53
mnaserrefreshing same page after => Powered by Gitea Version: v1.22.3 Page: 33ms Template: 5ms01:54
fungimnaser: could it be your browser trying ipv6, timing out and then trying ipv4?01:58
mnaserfungi: i dont think so since the actual page load in the footer of gitea shows 32s01:59
fungiah, yeah that seems like something slow behind the gitea service itself (database or filesystem maybe)02:01
fungii did just find a page that took 17943ms02:01
fungisystem load seems low on all the gitea backends and the haproxy load balancer02:03
slaweqhi, I just tested this rtd issue with playbook from frickler on the ubuntu noble container and it failed for me with the "Status code was 403 and not [200]: HTTP Error 403: Forbidden" message too09:37
slaweqsame playbook run on F41 works fine and triggers job on rtd.org09:37
fricklerslaweq: yes, that's what ianw reported yesterday, too. debian trixie is also working fine, bookworm is broken. the big question is still: why?!?09:38
fricklerI've been thinking to set up a webserver locally to see whether it can be reproduced there, but I'll need some spare time to do that, wouldn't mind if someone else did09:40
slaweqI also don't have much time for this TBH10:02
fricklerok, we have it on the agenda for the meeting tonight, maybe someone else wants to pick it up10:03
slaweqok, thx10:05
mnaserI'm still getting 15-20s render times on pages at Opendev14:52
fungimnaser: just for pages with git data, or also for the main page?14:52
fungifor https://opendev.org/ itself i get "Page: 1ms  Template: 1ms"14:53
fungiconsistently14:54
fungifor https://opendev.org/explore/repos i mostly get between 5-25ms for the page and 3-9ms for the template14:55
fungifor the top-level urls of random git repos page is around 100ms and template around 15ms, though sometimes i see page times around 1000ms14:57
fungiprobably has to do with whether the backend i'm hitting has the content cached already or has to regenerate it from database/git inspection14:58
fungiodds are we're getting crawled by new ai training bot armies whose user agents we haven't identified and blocked yet15:00
fungiso probably both putting i/o strain on the backends and blowing out their caches to be mostly useless15:01
fungiyeah, if i repeatedly reload a page i get low render times which is probably coming from the cache, but if i wait a minute and refresh the same page i tend to get an order of magnitude slower behaviors15:02
cardoeAny chance the opendevreview bot has Slack integration as well?15:03
cardoeOr maybe I can add it?15:03
fungicardoe: we have a matrix version, but no we only support protocols for open source platforms, not proprietary services15:03
fungiwe've collectively preferred to invest our time in supporting open source software15:04
cardoeI only ask because the OpenStack Helm folks are on Slack. If I don't drop a link to my patch it takes longer to get noticed.15:04
fungiyes, i think that was brought up with the tc as a reason to probably drop openstack-helm from openstack officially15:05
mnaseryikes15:06
fungithey could set up an irc-to-slack bridge if they were interested at all in collaborating as a part of openstack, but that doesn't seem to be a priority for their project participants15:06
kevkoHi, I'm really desperate at this point ,  I'm using Neutron with OVN, and OVS starts spinning at 100% CPU, with the log showing entries  https://paste.openstack.org/show/bnleYUdDaH5W7Ohk1LiS/  and the time just keeps increasing. The ovs-appctl command hangs, so I can’t use coverage. While ovs-vsctl show works, any attempt to change anything15:06
kevkocauses it to freeze. If I delete OVS (including the volume) and restart the server, when I reconfigure OVS (re-create it), everything is initially fine because there are no rules in OVS. At this stage, if I remove the interface from br-ex and call reconfigure on OVN (which sets the metadata), everything is still OK. However, as soon as I put the15:06
kevkointerface back into br-ex, it crashes again15:06
kevkoKolla-ansible, kolla, images ubuntu 22.04, ovs 3.3.0 , 24.03.215:06
cardoeWell their community is on Slack. They've got over 1100 people in their Slack channel. Which is hosted on the official Kubernetes Slack instance.15:06
mnaserkevko: wrong place, #openstack-kolla is where you need to go15:06
kevkowell, this is not about kolla of course ...15:07
mnaserok cool, then #openstack-neutron15:07
mnaserwhatever it is, it's probably not here :)15:07
kevkothanks 15:07
mnaseri didnt know the tc is going to drop openstack-helm for this, that seems like a massive knee jerk15:07
cardoeI'm happy to help do what's needed to get the irc-to-slack bridge going, fungi 15:07
fungikevko: #opendev is the people who manage the git servers, etherpad, mailing list platform, etc. we don't develop or directly support openstack software itself (the opendev collaboratory is not the openstack project)15:08
fungicardoe: well, if you did, then the slack channels you've bridged would get messages from the irc channels to which they're bridged, including irc messages from gerritbot. i don't really know anything about running or maintaining slack bridging solutions though, that would be up to slack users to figure out and take care of15:09
fungion the gitea performance front, it looks like the load averages on gitea09 and gitea13 are easily 2-4x that of the other backends, so it's possible our ip-based persistence in haproxy is feeding some more abusive addresses to those backends. i see to be getting sent to gitea12 though and it's comparatively unloaded15:16
fungicardoe: a slack-to-irc bridge would also address the tc's main concern for openstack-helm, i think, which is that the founding principles of the project necessitate design discussions be open to everyone and have a durable, accessible record for purposes of transparency. if the slack channels where their design discussions take place were also bridged to irc channels, we could easily record15:21
fungithose irc channels and publish the logs like we do for other openstack teams. also the project documentation which says they use the #openstack-helm irc channel wouldn't be entirely wrong any more15:21
clarkbfungi: mnaser: yes gitea caches content but it has to render it at least once to cache it. So slow then fast is expected but not that slow16:02
clarkbthe assumption that we're getting crawled is probably a good one has anyone tailed the apache or gitea webserver logs to see? in particular look for requests to specific sha based urls from odd or consistent user agents in short periods of time16:03
clarkbbased on what fungi says above I would start by looking at 09 and 1316:05
clarkbsurvey says amazon and meta need to take a hike16:10
clarkb"openstack trained amazonbot does that make aws openstack?"16:11
clarkband bytedance16:12
clarkbits amazing to me that everyone feels the need to crawl the same stuff all at once. Almost need a data broker16:12
clarkbthere is also something that appears to be grabbing all the puppet repos over and over again. maybe an inefficient deployment (except puppet uses a central control node so shouldn't need all the code on all the deployment nodes16:13
clarkbthat said I'm seeing reasonable performance from gitea09 right now so maybe these bots aren't the problem and whatever was the problem is not longer hitting us?16:14
clarkbI think I'm up to like 6 different AI crawler bots just naively looking at portions of the logs. No wonder cloudflare is all we'll charge people for this16:17
clarkbaha here we go. Yes it appears finding the right time window is important16:18
clarkblooking around when mnaser originally reported this there are more requests from the "odd" UAs16:19
clarkbdoes anyone else want to compile a list and update our ruleset?16:19
fungii can take a look through logs in a bit16:24
clarkbI did some quick grep/sort/uniq and the output is in gitea09:/root/ua_counts.txt16:27
fungii'm starting by looking at the haproxy logs, to see if backends are failing to respond to probes and being taken out of the pools16:27
clarkbI filtered by things making requests to commit specific urls first then grabbed the user agents from those strings, sorted and uniq -c'd them16:27
fungi(spoiler, they are)16:27
clarkband there are some clear trends that match historical issues we'ev had also facebook/meta16:27
fungilooks like gitea10 has been bouncing in and out a bunch. 13 some but far less often16:28
clarkb5 of the top 6 UAs are AI bots the other is one that fits our previous issues16:29
fungigitea09 was earlier (a few hours ago)16:29
clarkband I'm not sure they are respecting our crawl delay16:29
clarkboh wait not the 6th is also related to google I think16:30
clarkbhttps://developers.facebook.com/docs/sharing/webmasters/web-crawlers says rollout is expected to complete october 31 and they are the worst offender16:32
clarkbI'16:32
clarkber16:32
fungiyeah, gitea09 looks like it was at its worst between 11:52:37 and 12:00:28 utc based on haproxy logging constant failures to respond16:32
clarkbI'm beginning to think thatwe should maybe block facebook and they can petition for things to be unblocked if they make their bot less problematic16:32
clarkbor maybe the problem is that is a brand new bot so it is scraping absolutely everything from scratch whereas everyone else is just doing a slow trickle since they've been around a while16:33
fungi"slow trickle" is not how i'd personally describe any of the ai training crawlers i've seen in our logs this year16:34
clarkbya I guess some of these may be normal search indexing. In fact I think you can see that difference/trend in my quick data compilation16:34
clarkbwhat I think are search index bots are an order of magnitude fewer requests tahn the ai bots16:34
clarkbannoyingly they share the same parent company and could be making a single request then processing the data on the backend multiple times16:35
clarkbbut no16:35
clarkbanyway facebook/meta is by far the wrost offender. I'm comfortable with blocking them but also happy for others to dig into the data and reach the same conclusions before we ban anything16:36
fungiyou'd think, being google, they'd already have most of this data and could just train their llm directly on their existing dataset16:36
fungiyou mentioned meta not respecting the crawl-delay in our robots.txt... did you happen to see them also fetching things in disallowed paths by any chance?16:37
clarkbfungi: to be clar I wasn't sure if they were respecting it just based on how many there were in short sections of logs. However, they may be doing so and I need to look at timestamps more closely16:38
clarkblet me look at a section of requests just from them16:38
clarkbthey made 64 requests in the 16:38:00-16:38:59.999999.... minute16:40
clarkbcrawl delay is set to 2 seconds16:40
fungilooking at requests to gitea09 between 11:50 and 12:00 utc when it was getting kicked out of the load balancer a bunch, i see the vast majority had user agent "git/2.43.5"16:40
clarkbso I think that confirms they are not respecting the crawl delay16:40
clarkbthe requests are to /*/*/src/commit/** which we do allow16:41
clarkbfungi: those are likely to be git upload pack requests which is normal fetch/clone operations16:42
clarkbif we were simply overwhelmed with requests then that would also cause these problems16:42
fungi"normal" but possibly at abusive levels16:42
clarkbmaybe look at the request pattern for that16:42
clarkbya16:42
clarkbI mean compared to the "I'm an AI bot requests commit by commit as I crawl you" being abnormal16:42
clarkbit would honestly be better for them to git clone and then index the git tree locally...16:43
fungii have a feeling the really long response delays observed may be cases where the backend is entirely nonresponsive and haproxy pulls it out of the pool and moves the request to a different backend16:43
clarkbfungi: I don't think so because gitea generates the timestamps on the backend16:43
clarkband you've got really long timestamps on the pages themselves16:44
fungioh, right, though if you've got a long enough delay in returning content, haproxy may use that as a reason to take it out of the pool, i guess it's just a higher threshold16:44
clarkbhaproxy should only use errors or disconnects to do that iirc16:45
clarkbI don't think we have any response time metric balancing going on16:45
fungilayer6 timeout is what it's mainly complaining about16:45
clarkbI think those are timeouts making new connections and setting up ssl?16:46
clarkbor maybe just even doing the heartbeat check over ssl16:46
fungilooks like it expects a protocol-level response within 2 seconds, not necessarily content, so yeah different (but potentially related) behavior16:46
clarkbso backend is getting overwhelmed, haproxy detects this as you cannot do tls and then stops sending new requests to that bakcend16:46
clarkbbut I don't think it will move existing requests they will either complete or fail on the backend they were originally assigned16:47
clarkbthen as pressure eases on the backend its heartbeat will resume normal operations and we'll put load there again16:47
fungiright, i meant re-persist the client address hashes based on the pool change, so when the browser retries the request it ends up at another gitea server16:47
clarkbya. So failure mode seems to be request demand causes backend to stop negotiating tls which causes haproxy to remove that backend from the pool. All of that demand gets redirected to another server rinse and repeat16:49
clarkbpretty similar to what we've seen in the past when clients ddos us16:49
clarkbnow I guess the question is whether or not the ai bots are the source of the load creating the problems or that is just background noise to a more acute problem16:49
clarkbI guess the evidence you've dug up for when haproxy removes backends does point to somethign more acute since gitea09 is happy at the moment for example despite having the continuous stream of ai bot rquests16:50
fungilooks like tons of requests for puppet repos in that timeframe16:52
clarkbya about 30%16:53
clarkb(thats napkin math so I may be off)16:53
clarkbbut the same git UA does do upload packs for an umber of other repos. However, it also repeats the same repos over and over16:54
clarkbso maybe puppet is a coincidence and its just some git far cloning things very inefficiently?16:54
clarkbfungi: on the haproxy side maybe we can determine if all those requests come from similar IPs (we have to correlate via port numbers and timestamps iirc)16:54
fungiyeah16:55
fungiworking on it16:55
fungii expect the request count would have been much higher, but haproxy kicked the backend out for a good chunk of that window16:56
clarkb50% of the requests to gitea09 during the 11:50 10 minute block are from a single IP16:58
clarkbwhois says it is a cogent IP and traceroute has it landing in montreal. Is it possibly a vexxhost IP?17:00
fungiwhois for me says it's in a vexxhost allocation17:01
fungihttps://trunk-builder-centos9.rdoproject.org/17:01
clarkblooking at the log that IP shows up and just makes a ton of requests all at once then the server goes down and flaps back and forth a bit but that IP largely goes away17:01
fungimaybe spotz can help, i can also ask around in #rdo17:02
clarkbhuh how'd you find that dig -x didn't work17:02
clarkbthat IP definitely does seem to correlate to the falling over17:02
clarkbalso if I'm reading the logs properly it makes the same requests to the same repos with the same response bytes over and over in short periods of time17:03
clarkbso maybe there is some easy "just remember the first request you made" optimization that can be done17:03
clarkboh I see vexxhost at the end of the whois data now. Not sure how you figured out the host maybe you hit it via https?17:04
fungiclarkb: old school, on a whim i plugged the ip address into my web browser to see if there was a webserver running on it17:04
clarkbgot it thanks17:04
fungianyway, i asked about it in #rdo and pinged spotz in there for good measure17:06
clarkbthanks!17:06
clarkbthen I guess we have to decide if we awnt to do anything about the bots but if they aren't the acute problem  then its probably best to let them run for now17:06
fungiyeah, we could also block rdo's builder if worse comes to worse, but i'd rather give them some time to investigate if it's not causing massive problems17:07
clarkbcatching up on email and it looks like rax notified us of nodepool test nodes having problems. I don't think there is anything for us to do baout that as nodepool should automatically clean them up after the jobs complete (and presumably fail?)17:09
fungii usually just ignore those tickets if the instance name is np*, or at most close them if they sit around open for too long17:10
clarkbteam meeting is less than 2 hours away now. I'll need to get the agenda sorted out locally before then (mostly a reminder to myself since the chagne to standard time has changed itrelative to my local clock)17:10
clarkbfungi: ++17:10
fungiand yes, my expectation is that zuul transparently retries the impacted build so users don't notice anyway17:10
gouthamrhello o/ how do i get opendevreview, opendevstatus and opendevmeet to join the #openstack-eventlet-removal channel? i invited these as O/P.. but i feel there's something missing?17:41
clarkbgouthamr: in the openstack/project-config repo you need to configure each of the bots to join the channel.17:42
fungigouthamr: see the sections in https://docs.opendev.org/opendev/system-config/latest/irc.html#statusbot for each bot17:42
clarkboh maybe only gerritbot and accessbot are in there?17:42
gouthamrah! thank you, reading17:42
fungier, i guess start with the https://docs.opendev.org/opendev/system-config/latest/irc.html#logging section17:42
fungisystem-config: inventory/service/group_vars/eavesdrop.yaml17:43
clarkbya gerritbot and accessbot are in project-config meetbot and statusbot are in system-config17:43
fungiproject-config: gerritbot/channels.yaml17:43
fungiproject-config: accessbot/channels.yaml17:43
fungioh, and system-config: hiera/common.yaml for statusbot17:44
fungianyway, each location is linked from the respective section in that document17:44
mnasiadkaI'm observing some weird slowness even with the main Gerrit webpage (https://review.opendev.org) - it mainly works under half a second, but every 4-5 requests - it responds in somewhere around 5 to 15 seconds - is it just my location or something?17:49
clarkbgerrit like gitea does rely on caches. I think there is a 5 second timeout for diff generation for example17:53
clarkbsystem load is "low" so I don't suspect general load problems there17:54
clarkbI've clicked around in nova, neutron and kolla change lists opening random chagnes and fiel diffs and haven't been able to reproduce that behavior17:55
clarkbeverythign seems to be pretty consistent. Larger changes take a little longer (but like still around a second?)17:56
clarkband that is expected17:56
opendevreviewGoutham Pacha Ravi proposed opendev/system-config master: Add meetbot to openstack-eventlet-removal  https://review.opendev.org/c/opendev/system-config/+/93416417:57
opendevreviewGoutham Pacha Ravi proposed opendev/system-config master: Add meetbot and statusbot to openstack-eventlet-removal  https://review.opendev.org/c/opendev/system-config/+/93416417:58
fungimnasiadka: gerrit's webui is a javascript monolith that makes lots of rest api calls to the server, so if those rest api responses go missing or are delayed, that will result in slow rendering. also because it chains up api calls that depend on one another, round-trip latency can play a pretty big role in the responsiveness of pages displaying18:02
clarkbya may be worth checking for packet loss between you and the server? mtr is a quick way to get any idea if that is part of the issue18:02
mnasiadkaYeah, seems like i’m sometimes going some weird route which ends up in packet loss :(18:08
opendevreviewAmy Marrich proposed opendev/irc-meetings master: Adjust meeting time to avoid confict  https://review.opendev.org/c/opendev/irc-meetings/+/93416718:10
opendevreviewMerged opendev/irc-meetings master: Adjust meeting time to avoid confict  https://review.opendev.org/c/opendev/irc-meetings/+/93416718:36
fungididn't come up in the meeting, but the discussion from #rdo on git impact from their delorean builders can be found here: https://meetings.opendev.org/irclogs/%23rdo/%23rdo.2024-11-05.log.html#t2024-11-05T17:03:4920:00
clarkbfungi: two comments reading that 1) there is no ssh option 2) they have multiple logical builders all hitting at once they should be doing something more zuul like which is to have more centralized caches for N builders and push from those caches as necessary to avoid excessive updates20:03
clarkbpart of the problem with that setup is the use of a single IP20:04
clarkbhaving a ton of logical builders behind one IP means we don't get load balancing goodness and we can see the round robin of dos failures instead20:04
clarkbso moving the load balancing onto the other side of the IP would probably help20:04
clarkbI have enabled bounce processing on service-discuss20:50
clarkbI kept all of the defaults20:50
clarkbianw: thank you for putting 933700 together. I rechecked it because the -1 appers to be related to ssh errors not directly related to the change. Also left a few comments/nits/questsion sif you have time to look21:12
clarkbnow I'm wondering if I have an excuse to email service-discuss to see if any bounce scores change21:16
clarkbcorvus: for intermediate registry storage backend replacement don't gate images publish to the upstream registry then promote just updates the tags?21:23
clarkbso in theory wedo just need a recheck to rebuild and put things back into the registry?21:23
clarkbcorvus: https://etherpad.opendev.org/p/UfU_JiUB7UgpAMOimBbI this is what I've drafted so far21:24
fungilgtm21:26
ianwclarkb: will do21:28
corvusclarkb: oh yeah that is how we have it set up in opendev.  so it should just affect current queue item dependencies.21:37
corvusclarkb: msg lgtm21:38
clarkbcorvus: cool I'll send that out in a bit21:39
ianwwas thinking about the rtd thing ... that we put mitm in the path and it works.  mitm sees the auth header; it seems impossible it's not there -- you can instrument where ansible adds it although actually getting in on the wire is deeper in httplib21:58
ianwit leaves me to think that cloudflare is doing some sort of fingerprinting and that it's collateral damage21:58
clarkband its subtle enough thta fedora 41 and debian trixie python http stuff don't trip it?21:59
clarkbthat does seem possible but also very difficult to debug21:59
ianwyes, i mean basically impossible from our side :)21:59
ianwi mean it could be something like "this fingerprint + POST + empty body + ???" that it just falls into 22:00
ianwi'd agree just making this a shell: curl call probably avoids it.  we can report something to RTD, it's pretty easy to replicate just via a container ... if they want to ping CF about it they can22:01
ianwi just quickly checked with ansible in noble + f41 -- same venv, exact same versions of everything including cryptography22:02
ianwthat would essentially mean same openssl library right, if it's using the wheel?22:04
clarkbI'm not sure if it bundles all of openssl22:04
clarkbor if they manage to link against whatever is on your system. I think the old cffi setup did build against your local openssl on demand but then they moved some stuff to rust?22:05
clarkbhttps://lists.opendev.org/archives/list/service-announce@lists.opendev.org/thread/7IK7KPGGDTHQ4VYA3GNSZ2QGGGFP66MC/ announcement sent for intermediate registry pruning22:16
opendevreviewGoutham Pacha Ravi proposed opendev/system-config master: Fix doc for including statusbot in channels  https://review.opendev.org/c/opendev/system-config/+/93418922:25
clarkbianw: a recheck of your change got it to +122:57

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!