clarkb | and sent | 00:06 |
---|---|---|
corvus | someone should be able to test that with personal (or non openstackci) creds too | 00:19 |
mnaser | i think opendev.org might be having some sort of dns issues or something | 01:53 |
mnaser | pages take a long time to load and they then load really fast | 01:53 |
mnaser | Powered by Gitea Version: v1.22.3 Page: 32819ms Template: 29ms | 01:53 |
mnaser | almost like a perfect 30s+2.8s to load page | 01:53 |
mnaser | refreshing same page after => Powered by Gitea Version: v1.22.3 Page: 33ms Template: 5ms | 01:54 |
fungi | mnaser: could it be your browser trying ipv6, timing out and then trying ipv4? | 01:58 |
mnaser | fungi: i dont think so since the actual page load in the footer of gitea shows 32s | 01:59 |
fungi | ah, yeah that seems like something slow behind the gitea service itself (database or filesystem maybe) | 02:01 |
fungi | i did just find a page that took 17943ms | 02:01 |
fungi | system load seems low on all the gitea backends and the haproxy load balancer | 02:03 |
slaweq | hi, I just tested this rtd issue with playbook from frickler on the ubuntu noble container and it failed for me with the "Status code was 403 and not [200]: HTTP Error 403: Forbidden" message too | 09:37 |
slaweq | same playbook run on F41 works fine and triggers job on rtd.org | 09:37 |
frickler | slaweq: yes, that's what ianw reported yesterday, too. debian trixie is also working fine, bookworm is broken. the big question is still: why?!? | 09:38 |
frickler | I've been thinking to set up a webserver locally to see whether it can be reproduced there, but I'll need some spare time to do that, wouldn't mind if someone else did | 09:40 |
slaweq | I also don't have much time for this TBH | 10:02 |
frickler | ok, we have it on the agenda for the meeting tonight, maybe someone else wants to pick it up | 10:03 |
slaweq | ok, thx | 10:05 |
mnaser | I'm still getting 15-20s render times on pages at Opendev | 14:52 |
fungi | mnaser: just for pages with git data, or also for the main page? | 14:52 |
fungi | for https://opendev.org/ itself i get "Page: 1ms Template: 1ms" | 14:53 |
fungi | consistently | 14:54 |
fungi | for https://opendev.org/explore/repos i mostly get between 5-25ms for the page and 3-9ms for the template | 14:55 |
fungi | for the top-level urls of random git repos page is around 100ms and template around 15ms, though sometimes i see page times around 1000ms | 14:57 |
fungi | probably has to do with whether the backend i'm hitting has the content cached already or has to regenerate it from database/git inspection | 14:58 |
fungi | odds are we're getting crawled by new ai training bot armies whose user agents we haven't identified and blocked yet | 15:00 |
fungi | so probably both putting i/o strain on the backends and blowing out their caches to be mostly useless | 15:01 |
fungi | yeah, if i repeatedly reload a page i get low render times which is probably coming from the cache, but if i wait a minute and refresh the same page i tend to get an order of magnitude slower behaviors | 15:02 |
cardoe | Any chance the opendevreview bot has Slack integration as well? | 15:03 |
cardoe | Or maybe I can add it? | 15:03 |
fungi | cardoe: we have a matrix version, but no we only support protocols for open source platforms, not proprietary services | 15:03 |
fungi | we've collectively preferred to invest our time in supporting open source software | 15:04 |
cardoe | I only ask because the OpenStack Helm folks are on Slack. If I don't drop a link to my patch it takes longer to get noticed. | 15:04 |
fungi | yes, i think that was brought up with the tc as a reason to probably drop openstack-helm from openstack officially | 15:05 |
mnaser | yikes | 15:06 |
fungi | they could set up an irc-to-slack bridge if they were interested at all in collaborating as a part of openstack, but that doesn't seem to be a priority for their project participants | 15:06 |
kevko | Hi, I'm really desperate at this point , I'm using Neutron with OVN, and OVS starts spinning at 100% CPU, with the log showing entries https://paste.openstack.org/show/bnleYUdDaH5W7Ohk1LiS/ and the time just keeps increasing. The ovs-appctl command hangs, so I can’t use coverage. While ovs-vsctl show works, any attempt to change anything | 15:06 |
kevko | causes it to freeze. If I delete OVS (including the volume) and restart the server, when I reconfigure OVS (re-create it), everything is initially fine because there are no rules in OVS. At this stage, if I remove the interface from br-ex and call reconfigure on OVN (which sets the metadata), everything is still OK. However, as soon as I put the | 15:06 |
kevko | interface back into br-ex, it crashes again | 15:06 |
kevko | Kolla-ansible, kolla, images ubuntu 22.04, ovs 3.3.0 , 24.03.2 | 15:06 |
cardoe | Well their community is on Slack. They've got over 1100 people in their Slack channel. Which is hosted on the official Kubernetes Slack instance. | 15:06 |
mnaser | kevko: wrong place, #openstack-kolla is where you need to go | 15:06 |
kevko | well, this is not about kolla of course ... | 15:07 |
mnaser | ok cool, then #openstack-neutron | 15:07 |
mnaser | whatever it is, it's probably not here :) | 15:07 |
kevko | thanks | 15:07 |
mnaser | i didnt know the tc is going to drop openstack-helm for this, that seems like a massive knee jerk | 15:07 |
cardoe | I'm happy to help do what's needed to get the irc-to-slack bridge going, fungi | 15:07 |
fungi | kevko: #opendev is the people who manage the git servers, etherpad, mailing list platform, etc. we don't develop or directly support openstack software itself (the opendev collaboratory is not the openstack project) | 15:08 |
fungi | cardoe: well, if you did, then the slack channels you've bridged would get messages from the irc channels to which they're bridged, including irc messages from gerritbot. i don't really know anything about running or maintaining slack bridging solutions though, that would be up to slack users to figure out and take care of | 15:09 |
fungi | on the gitea performance front, it looks like the load averages on gitea09 and gitea13 are easily 2-4x that of the other backends, so it's possible our ip-based persistence in haproxy is feeding some more abusive addresses to those backends. i see to be getting sent to gitea12 though and it's comparatively unloaded | 15:16 |
fungi | cardoe: a slack-to-irc bridge would also address the tc's main concern for openstack-helm, i think, which is that the founding principles of the project necessitate design discussions be open to everyone and have a durable, accessible record for purposes of transparency. if the slack channels where their design discussions take place were also bridged to irc channels, we could easily record | 15:21 |
fungi | those irc channels and publish the logs like we do for other openstack teams. also the project documentation which says they use the #openstack-helm irc channel wouldn't be entirely wrong any more | 15:21 |
clarkb | fungi: mnaser: yes gitea caches content but it has to render it at least once to cache it. So slow then fast is expected but not that slow | 16:02 |
clarkb | the assumption that we're getting crawled is probably a good one has anyone tailed the apache or gitea webserver logs to see? in particular look for requests to specific sha based urls from odd or consistent user agents in short periods of time | 16:03 |
clarkb | based on what fungi says above I would start by looking at 09 and 13 | 16:05 |
clarkb | survey says amazon and meta need to take a hike | 16:10 |
clarkb | "openstack trained amazonbot does that make aws openstack?" | 16:11 |
clarkb | and bytedance | 16:12 |
clarkb | its amazing to me that everyone feels the need to crawl the same stuff all at once. Almost need a data broker | 16:12 |
clarkb | there is also something that appears to be grabbing all the puppet repos over and over again. maybe an inefficient deployment (except puppet uses a central control node so shouldn't need all the code on all the deployment nodes | 16:13 |
clarkb | that said I'm seeing reasonable performance from gitea09 right now so maybe these bots aren't the problem and whatever was the problem is not longer hitting us? | 16:14 |
clarkb | I think I'm up to like 6 different AI crawler bots just naively looking at portions of the logs. No wonder cloudflare is all we'll charge people for this | 16:17 |
clarkb | aha here we go. Yes it appears finding the right time window is important | 16:18 |
clarkb | looking around when mnaser originally reported this there are more requests from the "odd" UAs | 16:19 |
clarkb | does anyone else want to compile a list and update our ruleset? | 16:19 |
fungi | i can take a look through logs in a bit | 16:24 |
clarkb | I did some quick grep/sort/uniq and the output is in gitea09:/root/ua_counts.txt | 16:27 |
fungi | i'm starting by looking at the haproxy logs, to see if backends are failing to respond to probes and being taken out of the pools | 16:27 |
clarkb | I filtered by things making requests to commit specific urls first then grabbed the user agents from those strings, sorted and uniq -c'd them | 16:27 |
fungi | (spoiler, they are) | 16:27 |
clarkb | and there are some clear trends that match historical issues we'ev had also facebook/meta | 16:27 |
fungi | looks like gitea10 has been bouncing in and out a bunch. 13 some but far less often | 16:28 |
clarkb | 5 of the top 6 UAs are AI bots the other is one that fits our previous issues | 16:29 |
fungi | gitea09 was earlier (a few hours ago) | 16:29 |
clarkb | and I'm not sure they are respecting our crawl delay | 16:29 |
clarkb | oh wait not the 6th is also related to google I think | 16:30 |
clarkb | https://developers.facebook.com/docs/sharing/webmasters/web-crawlers says rollout is expected to complete october 31 and they are the worst offender | 16:32 |
clarkb | I' | 16:32 |
clarkb | er | 16:32 |
fungi | yeah, gitea09 looks like it was at its worst between 11:52:37 and 12:00:28 utc based on haproxy logging constant failures to respond | 16:32 |
clarkb | I'm beginning to think thatwe should maybe block facebook and they can petition for things to be unblocked if they make their bot less problematic | 16:32 |
clarkb | or maybe the problem is that is a brand new bot so it is scraping absolutely everything from scratch whereas everyone else is just doing a slow trickle since they've been around a while | 16:33 |
fungi | "slow trickle" is not how i'd personally describe any of the ai training crawlers i've seen in our logs this year | 16:34 |
clarkb | ya I guess some of these may be normal search indexing. In fact I think you can see that difference/trend in my quick data compilation | 16:34 |
clarkb | what I think are search index bots are an order of magnitude fewer requests tahn the ai bots | 16:34 |
clarkb | annoyingly they share the same parent company and could be making a single request then processing the data on the backend multiple times | 16:35 |
clarkb | but no | 16:35 |
clarkb | anyway facebook/meta is by far the wrost offender. I'm comfortable with blocking them but also happy for others to dig into the data and reach the same conclusions before we ban anything | 16:36 |
fungi | you'd think, being google, they'd already have most of this data and could just train their llm directly on their existing dataset | 16:36 |
fungi | you mentioned meta not respecting the crawl-delay in our robots.txt... did you happen to see them also fetching things in disallowed paths by any chance? | 16:37 |
clarkb | fungi: to be clar I wasn't sure if they were respecting it just based on how many there were in short sections of logs. However, they may be doing so and I need to look at timestamps more closely | 16:38 |
clarkb | let me look at a section of requests just from them | 16:38 |
clarkb | they made 64 requests in the 16:38:00-16:38:59.999999.... minute | 16:40 |
clarkb | crawl delay is set to 2 seconds | 16:40 |
fungi | looking at requests to gitea09 between 11:50 and 12:00 utc when it was getting kicked out of the load balancer a bunch, i see the vast majority had user agent "git/2.43.5" | 16:40 |
clarkb | so I think that confirms they are not respecting the crawl delay | 16:40 |
clarkb | the requests are to /*/*/src/commit/** which we do allow | 16:41 |
clarkb | fungi: those are likely to be git upload pack requests which is normal fetch/clone operations | 16:42 |
clarkb | if we were simply overwhelmed with requests then that would also cause these problems | 16:42 |
fungi | "normal" but possibly at abusive levels | 16:42 |
clarkb | maybe look at the request pattern for that | 16:42 |
clarkb | ya | 16:42 |
clarkb | I mean compared to the "I'm an AI bot requests commit by commit as I crawl you" being abnormal | 16:42 |
clarkb | it would honestly be better for them to git clone and then index the git tree locally... | 16:43 |
fungi | i have a feeling the really long response delays observed may be cases where the backend is entirely nonresponsive and haproxy pulls it out of the pool and moves the request to a different backend | 16:43 |
clarkb | fungi: I don't think so because gitea generates the timestamps on the backend | 16:43 |
clarkb | and you've got really long timestamps on the pages themselves | 16:44 |
fungi | oh, right, though if you've got a long enough delay in returning content, haproxy may use that as a reason to take it out of the pool, i guess it's just a higher threshold | 16:44 |
clarkb | haproxy should only use errors or disconnects to do that iirc | 16:45 |
clarkb | I don't think we have any response time metric balancing going on | 16:45 |
fungi | layer6 timeout is what it's mainly complaining about | 16:45 |
clarkb | I think those are timeouts making new connections and setting up ssl? | 16:46 |
clarkb | or maybe just even doing the heartbeat check over ssl | 16:46 |
fungi | looks like it expects a protocol-level response within 2 seconds, not necessarily content, so yeah different (but potentially related) behavior | 16:46 |
clarkb | so backend is getting overwhelmed, haproxy detects this as you cannot do tls and then stops sending new requests to that bakcend | 16:46 |
clarkb | but I don't think it will move existing requests they will either complete or fail on the backend they were originally assigned | 16:47 |
clarkb | then as pressure eases on the backend its heartbeat will resume normal operations and we'll put load there again | 16:47 |
fungi | right, i meant re-persist the client address hashes based on the pool change, so when the browser retries the request it ends up at another gitea server | 16:47 |
clarkb | ya. So failure mode seems to be request demand causes backend to stop negotiating tls which causes haproxy to remove that backend from the pool. All of that demand gets redirected to another server rinse and repeat | 16:49 |
clarkb | pretty similar to what we've seen in the past when clients ddos us | 16:49 |
clarkb | now I guess the question is whether or not the ai bots are the source of the load creating the problems or that is just background noise to a more acute problem | 16:49 |
clarkb | I guess the evidence you've dug up for when haproxy removes backends does point to somethign more acute since gitea09 is happy at the moment for example despite having the continuous stream of ai bot rquests | 16:50 |
fungi | looks like tons of requests for puppet repos in that timeframe | 16:52 |
clarkb | ya about 30% | 16:53 |
clarkb | (thats napkin math so I may be off) | 16:53 |
clarkb | but the same git UA does do upload packs for an umber of other repos. However, it also repeats the same repos over and over | 16:54 |
clarkb | so maybe puppet is a coincidence and its just some git far cloning things very inefficiently? | 16:54 |
clarkb | fungi: on the haproxy side maybe we can determine if all those requests come from similar IPs (we have to correlate via port numbers and timestamps iirc) | 16:54 |
fungi | yeah | 16:55 |
fungi | working on it | 16:55 |
fungi | i expect the request count would have been much higher, but haproxy kicked the backend out for a good chunk of that window | 16:56 |
clarkb | 50% of the requests to gitea09 during the 11:50 10 minute block are from a single IP | 16:58 |
clarkb | whois says it is a cogent IP and traceroute has it landing in montreal. Is it possibly a vexxhost IP? | 17:00 |
fungi | whois for me says it's in a vexxhost allocation | 17:01 |
fungi | https://trunk-builder-centos9.rdoproject.org/ | 17:01 |
clarkb | looking at the log that IP shows up and just makes a ton of requests all at once then the server goes down and flaps back and forth a bit but that IP largely goes away | 17:01 |
fungi | maybe spotz can help, i can also ask around in #rdo | 17:02 |
clarkb | huh how'd you find that dig -x didn't work | 17:02 |
clarkb | that IP definitely does seem to correlate to the falling over | 17:02 |
clarkb | also if I'm reading the logs properly it makes the same requests to the same repos with the same response bytes over and over in short periods of time | 17:03 |
clarkb | so maybe there is some easy "just remember the first request you made" optimization that can be done | 17:03 |
clarkb | oh I see vexxhost at the end of the whois data now. Not sure how you figured out the host maybe you hit it via https? | 17:04 |
fungi | clarkb: old school, on a whim i plugged the ip address into my web browser to see if there was a webserver running on it | 17:04 |
clarkb | got it thanks | 17:04 |
fungi | anyway, i asked about it in #rdo and pinged spotz in there for good measure | 17:06 |
clarkb | thanks! | 17:06 |
clarkb | then I guess we have to decide if we awnt to do anything about the bots but if they aren't the acute problem then its probably best to let them run for now | 17:06 |
fungi | yeah, we could also block rdo's builder if worse comes to worse, but i'd rather give them some time to investigate if it's not causing massive problems | 17:07 |
clarkb | catching up on email and it looks like rax notified us of nodepool test nodes having problems. I don't think there is anything for us to do baout that as nodepool should automatically clean them up after the jobs complete (and presumably fail?) | 17:09 |
fungi | i usually just ignore those tickets if the instance name is np*, or at most close them if they sit around open for too long | 17:10 |
clarkb | team meeting is less than 2 hours away now. I'll need to get the agenda sorted out locally before then (mostly a reminder to myself since the chagne to standard time has changed itrelative to my local clock) | 17:10 |
clarkb | fungi: ++ | 17:10 |
fungi | and yes, my expectation is that zuul transparently retries the impacted build so users don't notice anyway | 17:10 |
gouthamr | hello o/ how do i get opendevreview, opendevstatus and opendevmeet to join the #openstack-eventlet-removal channel? i invited these as O/P.. but i feel there's something missing? | 17:41 |
clarkb | gouthamr: in the openstack/project-config repo you need to configure each of the bots to join the channel. | 17:42 |
fungi | gouthamr: see the sections in https://docs.opendev.org/opendev/system-config/latest/irc.html#statusbot for each bot | 17:42 |
clarkb | oh maybe only gerritbot and accessbot are in there? | 17:42 |
gouthamr | ah! thank you, reading | 17:42 |
fungi | er, i guess start with the https://docs.opendev.org/opendev/system-config/latest/irc.html#logging section | 17:42 |
fungi | system-config: inventory/service/group_vars/eavesdrop.yaml | 17:43 |
clarkb | ya gerritbot and accessbot are in project-config meetbot and statusbot are in system-config | 17:43 |
fungi | project-config: gerritbot/channels.yaml | 17:43 |
fungi | project-config: accessbot/channels.yaml | 17:43 |
fungi | oh, and system-config: hiera/common.yaml for statusbot | 17:44 |
fungi | anyway, each location is linked from the respective section in that document | 17:44 |
mnasiadka | I'm observing some weird slowness even with the main Gerrit webpage (https://review.opendev.org) - it mainly works under half a second, but every 4-5 requests - it responds in somewhere around 5 to 15 seconds - is it just my location or something? | 17:49 |
clarkb | gerrit like gitea does rely on caches. I think there is a 5 second timeout for diff generation for example | 17:53 |
clarkb | system load is "low" so I don't suspect general load problems there | 17:54 |
clarkb | I've clicked around in nova, neutron and kolla change lists opening random chagnes and fiel diffs and haven't been able to reproduce that behavior | 17:55 |
clarkb | everythign seems to be pretty consistent. Larger changes take a little longer (but like still around a second?) | 17:56 |
clarkb | and that is expected | 17:56 |
opendevreview | Goutham Pacha Ravi proposed opendev/system-config master: Add meetbot to openstack-eventlet-removal https://review.opendev.org/c/opendev/system-config/+/934164 | 17:57 |
opendevreview | Goutham Pacha Ravi proposed opendev/system-config master: Add meetbot and statusbot to openstack-eventlet-removal https://review.opendev.org/c/opendev/system-config/+/934164 | 17:58 |
fungi | mnasiadka: gerrit's webui is a javascript monolith that makes lots of rest api calls to the server, so if those rest api responses go missing or are delayed, that will result in slow rendering. also because it chains up api calls that depend on one another, round-trip latency can play a pretty big role in the responsiveness of pages displaying | 18:02 |
clarkb | ya may be worth checking for packet loss between you and the server? mtr is a quick way to get any idea if that is part of the issue | 18:02 |
mnasiadka | Yeah, seems like i’m sometimes going some weird route which ends up in packet loss :( | 18:08 |
opendevreview | Amy Marrich proposed opendev/irc-meetings master: Adjust meeting time to avoid confict https://review.opendev.org/c/opendev/irc-meetings/+/934167 | 18:10 |
opendevreview | Merged opendev/irc-meetings master: Adjust meeting time to avoid confict https://review.opendev.org/c/opendev/irc-meetings/+/934167 | 18:36 |
fungi | didn't come up in the meeting, but the discussion from #rdo on git impact from their delorean builders can be found here: https://meetings.opendev.org/irclogs/%23rdo/%23rdo.2024-11-05.log.html#t2024-11-05T17:03:49 | 20:00 |
clarkb | fungi: two comments reading that 1) there is no ssh option 2) they have multiple logical builders all hitting at once they should be doing something more zuul like which is to have more centralized caches for N builders and push from those caches as necessary to avoid excessive updates | 20:03 |
clarkb | part of the problem with that setup is the use of a single IP | 20:04 |
clarkb | having a ton of logical builders behind one IP means we don't get load balancing goodness and we can see the round robin of dos failures instead | 20:04 |
clarkb | so moving the load balancing onto the other side of the IP would probably help | 20:04 |
clarkb | I have enabled bounce processing on service-discuss | 20:50 |
clarkb | I kept all of the defaults | 20:50 |
clarkb | ianw: thank you for putting 933700 together. I rechecked it because the -1 appers to be related to ssh errors not directly related to the change. Also left a few comments/nits/questsion sif you have time to look | 21:12 |
clarkb | now I'm wondering if I have an excuse to email service-discuss to see if any bounce scores change | 21:16 |
clarkb | corvus: for intermediate registry storage backend replacement don't gate images publish to the upstream registry then promote just updates the tags? | 21:23 |
clarkb | so in theory wedo just need a recheck to rebuild and put things back into the registry? | 21:23 |
clarkb | corvus: https://etherpad.opendev.org/p/UfU_JiUB7UgpAMOimBbI this is what I've drafted so far | 21:24 |
fungi | lgtm | 21:26 |
ianw | clarkb: will do | 21:28 |
corvus | clarkb: oh yeah that is how we have it set up in opendev. so it should just affect current queue item dependencies. | 21:37 |
corvus | clarkb: msg lgtm | 21:38 |
clarkb | corvus: cool I'll send that out in a bit | 21:39 |
ianw | was thinking about the rtd thing ... that we put mitm in the path and it works. mitm sees the auth header; it seems impossible it's not there -- you can instrument where ansible adds it although actually getting in on the wire is deeper in httplib | 21:58 |
ianw | it leaves me to think that cloudflare is doing some sort of fingerprinting and that it's collateral damage | 21:58 |
clarkb | and its subtle enough thta fedora 41 and debian trixie python http stuff don't trip it? | 21:59 |
clarkb | that does seem possible but also very difficult to debug | 21:59 |
ianw | yes, i mean basically impossible from our side :) | 21:59 |
ianw | i mean it could be something like "this fingerprint + POST + empty body + ???" that it just falls into | 22:00 |
ianw | i'd agree just making this a shell: curl call probably avoids it. we can report something to RTD, it's pretty easy to replicate just via a container ... if they want to ping CF about it they can | 22:01 |
ianw | i just quickly checked with ansible in noble + f41 -- same venv, exact same versions of everything including cryptography | 22:02 |
ianw | that would essentially mean same openssl library right, if it's using the wheel? | 22:04 |
clarkb | I'm not sure if it bundles all of openssl | 22:04 |
clarkb | or if they manage to link against whatever is on your system. I think the old cffi setup did build against your local openssl on demand but then they moved some stuff to rust? | 22:05 |
clarkb | https://lists.opendev.org/archives/list/service-announce@lists.opendev.org/thread/7IK7KPGGDTHQ4VYA3GNSZ2QGGGFP66MC/ announcement sent for intermediate registry pruning | 22:16 |
opendevreview | Goutham Pacha Ravi proposed opendev/system-config master: Fix doc for including statusbot in channels https://review.opendev.org/c/opendev/system-config/+/934189 | 22:25 |
clarkb | ianw: a recheck of your change got it to +1 | 22:57 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!