| @mnasiadka:matrix.org | I think releases.openstack.org got overwhelmed by crawlers or something similar | 13:25 |
|---|---|---|
| @gtema:matrix.org | docs.openstack.org are also not reachable | 13:28 |
| @garyx:matrix.org | tarballs.opendev.org is down as well | 13:47 |
| @mnasiadka:matrix.org | Thatโs probably the same server :) | 13:53 |
| @garyx:matrix.org | yeah most likely, just reporting it as well :) | 13:53 |
| @mnaser:matrix.org | ah I was just about to join the train :) | 14:08 |
| @fungicide:matrix.org | i'm looking into it, seems like i can't even ssh into the server | 14:10 |
| @fungicide:matrix.org | sadly rackspace classic doesn't implement the console log, but what little i can see from the novnc console is showing blocked processes and hung kernel tasks, no idea how old those are though since it's timestamped by seconds since boot | 14:13 |
| @fungicide:matrix.org | server instance is in an active state and pressing return in the console gives me a login prompt, but we don't set passwords on any accounts and my ssh attempts to it time out both over ipv4 and ipv6 so best i can do is attempt a ctrl-alt-del from the console or reboot over nova api | 14:15 |
| @fungicide:matrix.org | even ping (icmpv4 and v6) are mostly dead. i did manage to get two v4 echo replies just now but that was like 95% lost | 14:17 |
| @garyx:matrix.org | yeah ping for me is 99% dead | 14:17 |
| @mnaser:matrix.org | are other rackspace servers properly working or not? maybe its a widespread network issue there | 14:17 |
| @fungicide:matrix.org | i can ssh into afs01.dfw.openstack.org in the same region just fine, but it also gets almost 100% packet loss trying to reach static02.opendev.org locally | 14:19 |
| @mnaser:matrix.org | ah so maybe vm or hypervisor local issue then | 14:19 |
| @fungicide:matrix.org | the secondary nic on static02 which is routed over a separate rfc1918 network is also almost 100% packet loss | 14:20 |
| @fungicide:matrix.org | we also don't seem to have any tickets from rackspace support notifying us of an impacting outage or anything, so i'll proceed with attempting a graceful reboot | 14:21 |
| @fungicide:matrix.org | i think the server itself was likely overloaded, because as soon as it started terminating processes during shutdown i suddenly stopped getting packet loss | 14:22 |
| @fungicide:matrix.org | server instance | 14:22 |
| @mnaser:matrix.org | yeah same here, and looks like its back | 14:23 |
| @mnaser:matrix.org | at least icmp wise :) | 14:23 |
| @fungicide:matrix.org | it's possible our recent increase in apache worker slots (in an attempt to handle more crawlers) was too generous | 14:23 |
| @fungicide:matrix.org | i can ssh into it again | 14:24 |
| @fungicide:matrix.org | and sites it serves return content for me now | 14:24 |
| @garyx:matrix.org | Same, content is working as far as I can tell. | 14:25 |
| @fungicide:matrix.org | #status log Rebooted static02.opendev.org in order to return it to working order, as it was unreachable and appeared to be overloaded; investigation is underway | 14:25 |
| @garyx:matrix.org | Do you have any monitoring running on this one to see the load and/or logs? | 14:26 |
| @fungicide:matrix.org | garyx: we do, it's a pain to get to at the moment because it's in severe need of being rebuilt/replaced and we've locked off public access in order to not expose any vulnerabilities in the seriously outdated software it runs | 14:27 |
| @fungicide:matrix.org | but i'll get an ssh tunnel set up to the cacti server from my workstation and pull up the graphs in a bit | 14:28 |
| @garyx:matrix.org | That's totally understandable, tech debt is a thing | 14:28 |
| @fungicide:matrix.org | i think i may see what was impacting the networking | 14:29 |
| @fungicide:matrix.org | dmesg is already flooded with "nf_conntrack: table full, dropping packet" | 14:29 |
| @mnaser:matrix.org | ahh | 14:29 |
| @garyx:matrix.org | if you have the memory you can resize the table. | 14:30 |
| @fungicide:matrix.org | so whatever is hammering it seems to be exhausting the default 64k entry limit on conntrack table size | 14:30 |
| @fungicide:matrix.org | well, allowing it to handle more connections than that may also just cause it to fall over for other reasons | 14:30 |
| @garyx:matrix.org | True, but I have made that table bigger in quite a few instances and usually it's fine. | 14:31 |
| @fungicide:matrix.org | separately, statusbot never acknowledged my log command, so it probably needs to be looked at as well | 14:31 |
| @garyx:matrix.org | You can always revert later. | 14:31 |
| @fungicide:matrix.org | yeah, it's an 8gb ram server instance and currently using about 6gb of that immediately after reboot, though 1.5g is occupied by buffers/cache | 14:32 |
| @fungicide:matrix.org | granted, the ram usage is all apache workers, so probably unusually high due to whatever bot army has decided to index the whole thing at once | 14:33 |
| @garyx:matrix.org | yeah, that table if you resize it does not use that much memory in my experience. Apacher workers use much more. | 14:34 |
| @fungicide:matrix.org | but yeah, the web sites are already back to being unreachable for me, though my ssh session is still fine presumably due to having an established session | 14:34 |
| @garyx:matrix.org | Hopefully the army stays away but if the table fills up once, I usually see it happen again later. | 14:36 |
| @fungicide:matrix.org | okay, i doubled it with `sysctl -w net.netfilter.nf_conntrack_max=131072` | 14:36 |
| @garyx:matrix.org | You can save that also in /etc/sysctl.conf so it survives reboot. | 14:37 |
| @fungicide:matrix.org | well, yeah this is just testing for now | 14:37 |
| @garyx:matrix.org | Getcha | 14:37 |
| @fungicide:matrix.org | if we want it to survive we'll put it in configuration management so it survives more than just reboots | 14:37 |
| @fungicide:matrix.org | still getting table full messages eveb after doubling the limit | 14:38 |
| @fungicide:matrix.org | doubled again to 262144 now | 14:38 |
| @mnaser:matrix.org | i wonder if there's some form of bad clients that are opening and not closing connections | 14:39 |
| @fungicide:matrix.org | still getting those errors | 14:39 |
| @fungicide:matrix.org | increased to 524288 now | 14:40 |
| @fungicide:matrix.org | fwiw, this is similar to the situation we've been getting on wiki.openstack.org for the past few days now | 14:41 |
| @jim:acmegating.com | fungi: statusbot failed due to an error from wiki | 14:41 |
| @fungicide:matrix.org | go figure | 14:41 |
| @fungicide:matrix.org | i think the internet may finally have collapsed due to badly-configured llm training crawlers, and we can take up goat farming | 14:42 |
| @jim:acmegating.com | i concur | 14:42 |
| @garyx:matrix.org | Yay, goat farming might just be the pivot I was looking for from devops. | 14:43 |
| @garyx:matrix.org | Have you run `conntrack` to see if you have ip's hogging the connections table? | 14:44 |
| @fungicide:matrix.org | fwiw, i'm not getting new table full errors in dmesg yet at 524288, but i still can't load page content | 14:44 |
| @mnaser:matrix.org | I'm going to guess that it's probably because all slots are occupied in apache | 14:45 |
| @fungicide:matrix.org | at this point probably | 14:45 |
| @fungicide:matrix.org | also dmesg is periodically logging eth0 going in and out of promiscuous mode for a few seconds at a time | 14:45 |
| @fungicide:matrix.org | not sure if that's just normal operation | 14:45 |
| @mnaser:matrix.org | that weird, anyone running tcpdump on that system? | 14:45 |
| @jim:acmegating.com | yep that's me | 14:45 |
| @jim:acmegating.com | most apache slots are in "R" reading request | 14:46 |
| @fungicide:matrix.org | i concur that apache worker slots are probably full, but i can't get it to respond to my request for server-status over the loopback to confirm, so will just assume | 14:46 |
| @fungicide:matrix.org | oh, you got a response | 14:46 |
| @jim:acmegating.com | most requests are for /developer/watcher/datasources/datasources/actions/strategi | 14:47 |
| @jim:acmegating.com | (it's truncated in the view) | 14:47 |
| @jim:acmegating.com | some other docs in that hierarchy too | 14:48 |
| @fungicide:matrix.org | sean-k-mooney: ^ any idea if that's something software could try to request programmatically or whether it's just documentation? | 14:48 |
| @jim:acmegating.com | a quick look at the ip addrs suggests that they are all unique. also, ipv4 and ipv6. | 14:51 |
| @fungicide:matrix.org | also nf_conntrack_count did eventually reach 524288 and it's hovering there again | 14:52 |
| @fungicide:matrix.org | i'm trying to figure out which table(s) it's in | 14:52 |
| @tafkamax:matrix.org | Self DDOS would be a classic move though ๐ | 14:54 |
| @tafkamax:matrix.org | * Self DOS would be a classic move though ๐ | 14:54 |
| @tafkamax:matrix.org | We have openvas that does it our infra... | 14:54 |
| @tafkamax:matrix.org | need to tune alertmanager because of it. | 14:55 |
| @garyx:matrix.org | I mean who hasn't done that at least once in their career? | 14:55 |
| @jim:acmegating.com | i'm a little confused; in afs, i don't ee a "datasources" directory under /afs/openstack.org/docs/developer/watcher | 14:57 |
| @jim:acmegating.com | indeed: "GET /developer/watcher/datasources/datasources/actions/datasources/actions/strategies/datasources/actions/contributor/actions/strategies/man/strategies/man/integrations/strategies/admin/architecture.html HTTP/1.1" 301 | 15:00 |
| @jim:acmegating.com | perhaps a redirect loop is involved | 15:00 |
| @fungicide:matrix.org | analyzing conntrack table entries, the majority of connections are from the server to itself, so presumably some activity (maybe hitting afs-hosted content since that's about the only thing it does) is resulting in local connections that aren't terminating | 15:00 |
| @fungicide:matrix.org | codesearch turns up a bunch of hits in openstack watcher and vitrage docs about datasources configuration | 15:03 |
| @jim:acmegating.com | fungi: it looks like something is generating requests for a bunch of bogus urls like above, perhaps a redirect loop or some other misconfiguration (either on our side or theirs). perhaps each of those is resulting in afs traffic in order to get a negative fstat result; so we're not benefitting from the afs cache in that case. | 15:09 |
| @fungicide:matrix.org | that would certainly make sense | 15:10 |
| @fungicide:matrix.org | looking at https://opendev.org/openstack/watcher/raw/branch/master/doc/source/admin/index.rst it seems like that could be the source of the initial links into nonexistent top-level datasources | 15:12 |
| @jim:acmegating.com | if i wget http://docs.openstack.org/developer/watcher/datasources/datasources/actions/strategies/actions/actions/strategies/man/strategies/actions/datasources/datasources/admin/install/admin/integrations/strategies/index.html i get a location header pointing to the same url | 15:12 |
| @fungicide:matrix.org | yeah, that certainly would lead to a circular trap for crawlers | 15:13 |
| @jim:acmegating.com | oh wait sorry, that was an http->https redirect; i'm retrying that correctly | 15:14 |
| @fungicide:matrix.org | fwiw those paths exist in the git repository but don't seem to have gotten installed into afs | 15:14 |
| @jim:acmegating.com | hrm, it looks like we don't split our http/https logs, so perhaps all those 301s are just http->https redirects also? | 15:15 |
| @fungicide:matrix.org | i wonder if their most recent docs build got interrupted during post-run and didn't write the whole thing | 15:15 |
| @jim:acmegating.com | root marker is: Project: openstack/watcher Ref: master Build: 1324c0b3665643018366489b7dbbb248 Revision: 5f179609d0ee145fc7957972c83593cce242884d | 15:16 |
| @jim:acmegating.com | i can't remember if that's written at the start or end | 15:16 |
| @tkajinam:matrix.org | https://zuul.opendev.org/t/openstack/builds?job_name=promote-openstack-tox-docs&project=openstack/watcher | 15:17 |
| @tkajinam:matrix.org | at least the promoiton job suceeded without any error, according to zuul | 15:17 |
| @fungicide:matrix.org | the site's .htaccess file includes `redirectmatch 301 ^/developer/watcher($|/.*$) /watcher/latest/` and `redirectmatch 301 ^/watcher/?$ /watcher/latest/` (same as for other openstack projects) but watcher isn't doing versioned publishing either | 15:18 |
| @jim:acmegating.com | https://zuul.opendev.org/t/openstack/build/1324c0b3665643018366489b7dbbb248 this doesn't exist... that build should be in the openstack tenant, right? | 15:18 |
| @jim:acmegating.com | Jun 30 2017 .root-marker | 15:20 |
| @jim:acmegating.com | is /afs/openstack.org/docs/developer/watcher the right place? | 15:20 |
| @fungicide:matrix.org | oh! okay, `/afs/openstack.org/docs/developer/watcher` is stale content | 15:20 |
| @tkajinam:matrix.org | I guess that /developer path is an old path ? | 15:20 |
| @fungicide:matrix.org | the redirects above should be going into `/afs/openstack.org/docs/watcher/latest` which does exist | 15:21 |
| @fungicide:matrix.org | so whatever's making those requests isn't following the 301 redirect responses then | 15:21 |
| @jim:acmegating.com | or perhaps they are, but they're not getting an answer due to the afs contention | 15:22 |
| @fungicide:matrix.org | and e.g. `/afs/openstack.org/docs/watcher/latest/datasources` is there | 15:22 |
| @jim:acmegating.com | i haven't been able to complete a request to one of those urls yet | 15:23 |
| @fungicide:matrix.org | okay, so possible this is a symptom | 15:23 |
| @jim:acmegating.com | if i restart apache, is it possible i might get a request through? | 15:24 |
| @jim:acmegating.com | i'd really like to see the response | 15:24 |
| @fungicide:matrix.org | yes | 15:26 |
| @jim:acmegating.com | fungi: i'm going to try that | 15:26 |
| @fungicide:matrix.org | go for it | 15:26 |
| @fungicide:matrix.org | not like it will get any more broken than it already is | 15:26 |
| @jim:acmegating.com | Location: https://docs.openstack.org/watcher/latest/ [following] | 15:27 |
| @jim:acmegating.com | that lgtm | 15:27 |
| @jim:acmegating.com | it's starting to look like bad url handling on the side of the botnet? | 15:28 |
| @fungicide:matrix.org | okay, so presumably any redirect loops (if they're happening) are a cascade effect once things start breaking down | 15:28 |
| @fungicide:matrix.org | that does seem more likely | 15:28 |
| @fungicide:matrix.org | we could start adding those malformed urls to our waf rules temporarily, so that all the ip addresses requesting them get forbidden | 15:29 |
| @jim:acmegating.com | fungi: how do you feel about making /developer/watcher/datasources/.* a tripwire for the new system? | 15:29 |
| @jim:acmegating.com | yeah that :) | 15:29 |
| @fungicide:matrix.org | presicely, yes | 15:29 |
| @fungicide:matrix.org | er, precisely | 15:30 |
| @jim:acmegating.com | you've got my +2 if you want to do that | 15:30 |
| @fungicide:matrix.org | i'll temporarily add it to the vhost config and see what happens | 15:30 |
| @fungicide:matrix.org | oh! actually that depends on https://review.opendev.org/978118 but i can hack something in for now | 15:31 |
| @fungicide:matrix.org | hopefully clients requesting anything under /developer/watcher/datasources/ on docs.o.o are getting 403 denied responses now | 15:38 |
| @fungicide:matrix.org | `[Tue Mar 03 15:38:56.480726 2026] [:error] [pid 17077:tid 140594856805952] [client <redacted>:57798] [client <redacted>] ModSecurity: Access denied with code 403 (phase 1). String match "/developer/watcher/datasources/" at REQUEST_URI. [file "/etc/apache2/sites-enabled/50-docs.openstack.org.conf"] [line "24"] [id "9002"] [hostname "docs.openstack.org"] [uri "/developer/watcher/datasources/datasources/actions/strategies/actions/actions/strategies/man/strategies/actions/strategies/configuration/datasources/admin/datasources/man/integrations/index.html"] [unique_id "aacAkIBlhckR6gJpFHdf3AAADU8"]` | 15:39 |
| @fungicide:matrix.org | that does seem to work | 15:40 |
| @fungicide:matrix.org | i'll put static02.opendev.org into the disable list for ansible updates temporarily, but i am starting to be able to get content again | 15:41 |
| @noonedeadpunk:matrix.org | jsut to mention that, quite right before the issue this patch was merged: https://review.opendev.org/c/openstack/openstack-ansible/+/949497 | 15:42 |
| @noonedeadpunk:matrix.org | which I would not think of creating issues, but it's kind of unconventional either | 15:42 |
| @fungicide:matrix.org | thanks for the pointer, i agree it's probably unrelated | 15:43 |
| @jim:acmegating.com | fungi: um how do i clear my ip? :) | 15:45 |
| @fungicide:matrix.org | checking... | 15:46 |
| @clarkb:matrix.org | corvus: fungi I think if you restart apache it will rebuild the table? | 15:48 |
| @clarkb:matrix.org | then you can let the bot reblock itself | 15:48 |
| @clarkb:matrix.org | maybe | 15:48 |
| @fungicide:matrix.org | testing that theory now | 15:48 |
| @clarkb:matrix.org | I'm not sure if there is a more precise apache mod security table edit method | 15:48 |
| @fungicide:matrix.org | i guess it depends on whether this is in-memory tables or persisted to disk | 15:48 |
| @clarkb:matrix.org | I'm 99% certain it is in memory not disk | 15:49 |
| @clarkb:matrix.org | Is there anything I can be doing to help at this point? | 15:51 |
| @fungicide:matrix.org | not sure yet. ideating? | 15:52 |
| @clarkb:matrix.org | I guess the service still isn't reachable with the waf block in place so that may not be the only issue? | 15:53 |
| @jim:acmegating.com | i think it's stored in /var/cache/modsecurity/ | 15:53 |
| @jim:acmegating.com | -rw-r----- 1 www-data www-data 3469814784 Mar 3 15:53 /var/cache/modsecurity/www-data-ip.pag | 15:53 |
| @fungicide:matrix.org | well, after restarting apache we went back to the conntrack table being full | 15:54 |
| @fungicide:matrix.org | which i think is why it's not responding again | 15:54 |
| @jim:acmegating.com | i wonder if apache is still doing some stats even with the mod_security rule? | 15:55 |
| @jim:acmegating.com | like, maybe it's looking for htaccess files in case they modify the request path, even though it ends up going through mod_security in the end | 15:55 |
| @jim:acmegating.com | also, there is some lock contention on the mod security database file | 15:56 |
| @fungicide:matrix.org | at least we don't seem to have a custom 403 page configured in that vhost (only 404) | 15:56 |
| @jim:acmegating.com | okay i picked a random apache process to strace; i'm not seeing any stats to afs for bad watcher paths, so i don't like my theory that it's still doing stats. | 15:58 |
| @clarkb:matrix.org | `sudo conntrack -L | cut -d' ' -f 10 | sort | uniq -c | sort` makes it look like a proper ddos | 15:58 |
| @jim:acmegating.com | Clark: i agree, i didn't see any duplicate ips | 15:59 |
| @jim:acmegating.com | i'm starting to see a higher proportion of legitimate requests... well, okay, requests from better-behaved llm crawlers, than bad ones | 16:01 |
| @fungicide:matrix.org | no more conntrack table full errors for the past several minutes | 16:02 |
| @fungicide:matrix.org | and i'm starting to get page content returned again | 16:02 |
| @fungicide:matrix.org | still taking a while to fetch server-status | 16:03 |
| @fungicide:matrix.org | finally came back, and full of about 50% reading request / 50% logging | 16:04 |
| @clarkb:matrix.org | cross referencing conntrack listed IPs to apache logs some show that request you identified as potentially problematic and others don't show up at all. I wonder if they simply never got far enough to process a request in apache | 16:04 |
| @fungicide:matrix.org | most likely, with the workers slap full | 16:05 |
| @clarkb:matrix.org | also many of the IPs seem to originate in a specific (but broad) location in the world | 16:05 |
| @fungicide:matrix.org | i'm hoping that the waf contention is apache busy writing new clients to the block list and will settle down soon | 16:06 |
| @jim:acmegating.com | yeah... except i haven't found a way to remove an ip from that list | 16:06 |
| @clarkb:matrix.org | where do we see the waf contention? | 16:06 |
| @jim:acmegating.com | Clark: flock for the dbm file in strace of apache | 16:07 |
| @fungicide:matrix.org | server-status indicates taht basically all requests are for /developer/watcher/datasources/... | 16:07 |
| @clarkb:matrix.org | ack thanks | 16:07 |
| @fungicide:matrix.org | following `/var/log/apache2/docs.openstack.org_error.log` it's constantly updating | 16:08 |
| @jim:acmegating.com | i see more slots in "Logging" state than "Reading" | 16:09 |
| @jim:acmegating.com | i wonder if writing to the list happens in that phase | 16:09 |
| @fungicide:matrix.org | i wonder if we need to make the waf hits not log | 16:09 |
| @fungicide:matrix.org | spot checks don't show us logging any client address more than once though | 16:10 |
| @jim:acmegating.com | zuul logs a lot more data than that; i doubt it's the actual access log that's slow | 16:10 |
| @clarkb:matrix.org | fungi: I think you can add `,nolog` to the `"id:9002,phase:1,t:lowercase,deny,setvar:ip.honeypot=+1,expirevar:ip.honeypot=86400"` section to stop logging it | 16:10 |
| @fungicide:matrix.org | so far it looks like it only logs once as it adds the offending client anyway | 16:11 |
| @jim:acmegating.com | i don't think we should stop logging it | 16:11 |
| @fungicide:matrix.org | that'll require another apache restart, which i'd rather avoid after how long it was offline during the previous restart | 16:11 |
| @jim:acmegating.com | well, i've been trying to say we're going to have another one if we can't find a way to remove an ip | 16:12 |
| @fungicide:matrix.org | page content is returning quickly for me now too, so hopefully this has reached a happy state even though we're logging a constant flood of errors from waf | 16:12 |
| @clarkb:matrix.org | https://github.com/owasp-modsecurity/ModSecurity/wiki/Reference-Manual-(v3.x)#persistent-storage indicates you can store things in memory or on disk | 16:12 |
| @clarkb:matrix.org | but I don't see how to convert this to a memory store yet | 16:13 |
| @fungicide:matrix.org | maybe we can switch to memory during the restart and then let it rebuild the table of offenders | 16:13 |
| @clarkb:matrix.org | looks like it might be a global mod secutiry configuration item | 16:15 |
| @clarkb:matrix.org | rather than table specific | 16:15 |
| @clarkb:matrix.org | we are actually using v2 so this document is more correct: https://github.com/owasp-modsecurity/ModSecurity/wiki/Reference-Manual-(v2.x) | 16:17 |
| @fungicide:matrix.org | looking at some of the trapped clients' requests in the access logs, they're using varied user agent strings from one request to the next too | 16:17 |
| @fungicide:matrix.org | even from the same ip address | 16:17 |
| @fungicide:matrix.org | the addresses are from allocations managed by rirs all over the world too, looks like maybe mobile clients? | 16:19 |
| @fungicide:matrix.org | my guess is one of the larger compromised device bot armies has been asked to crawl the entire web | 16:19 |
| @fungicide:matrix.org | and badly | 16:19 |
| @clarkb:matrix.org | I suspect (but am not positive) that if we set SecDataDir to in /etc/apache2/mods-enabled/security2.conf to some dir that apache can't write to it would stop persisting the data. But that seems super hacky. I feel liek there must be some way to force memory only but haven't found it | 16:20 |
| @jim:acmegating.com | https://github.com/SpiderLabs/modsec-sdbm-util might be useful for editing | 16:24 |
| @clarkb:matrix.org | I'm half wondering if we should be looking at a non mod security solution so that we can switch to that then drop the db entirely (since that seems simpler than building a tool to edit the live db) | 16:30 |
| @jim:acmegating.com | it's an old repo but it compiles on jammy | 16:30 |
| @clarkb:matrix.org | but we've established a huge variety of IP source and different user agents. Maybe we just mod rewrite that location entirely for now? | 16:31 |
| @jim:acmegating.com | Clark: what do you mean? have mod_rewrite do what? | 16:31 |
| @clarkb:matrix.org | corvus: rewrite requests to /developer/watcher/datasources to a 403 response | 16:32 |
| @jim:acmegating.com | without the waf? | 16:32 |
| @fungicide:matrix.org | it would be nice if any addresses requesting that path didn't consume resources requesting other paths too | 16:33 |
| @clarkb:matrix.org | yes. Though I guess these IPs are requesting other paths as well? This is just the indentifying bad location so having the waf block them entirely may be what is making things better? | 16:33 |
| @clarkb:matrix.org | ya that | 16:33 |
| @fungicide:matrix.org | that's not the only thing they're requesting, it's just that those are the requests taking longer and eating more apache slots | 16:33 |
| @fungicide:matrix.org | because they're nonexistent files and so not getting a response back from the afs cache immediately | 16:34 |
| @clarkb:matrix.org | looking at conntrack the counts are slowly falling so I think the situation is improving? Just slowly | 16:34 |
| @jim:acmegating.com | i have built a copy of that utility and copied the database to the container where i built it and am testing it. it seems to read the database correctly. it has 86923 records. | 16:35 |
| @fungicide:matrix.org | yes, i think as we get more of the offending addresses blocked, they're not keeping connections open as long because apache responds immediately with a http/403 | 16:35 |
| @jim:acmegating.com | ``` | 16:37 |
| __expire_KEY: 1772556167 | ||
| KEY: <my ip> | ||
| TIMEOUT: 3600 | ||
| __key: <my ip> | ||
| __name: ip | ||
| CREATE_TIME: 1772552567 | ||
| UPDATE_COUNTER: 1 | ||
| honeypot: 1 | ||
| __expire_honeypot: 1772638967 | ||
| ``` | ||
| @fungicide:matrix.org | server-status is showing a bunch more worker slots waiting for new connections now | 16:37 |
| @jim:acmegating.com | that's what a record looks like... it looks like expire_honeypot is 24h, but expire is 1 hour | 16:37 |
| @jim:acmegating.com | i wounder if the 1 hour expiration would cause the entry to be removed even though the honeypot timeout is longer? | 16:38 |
| @fungicide:matrix.org | 1772638967 is 15:42:47 utc tomorrow | 16:38 |
| @jim:acmegating.com | the key expiration time is coming up in 4 minutes though. | 16:38 |
| @fungicide:matrix.org | yeah, if you fall out of the database and are no longer blocked i guess we'll know | 16:39 |
| @clarkb:matrix.org | the 24 hour expiry is the value that our config attempts to expire at. If you ip clears out automatically in 4 minutes then our config isn't doing quite what we expected but maybe good enough for now | 16:39 |
| @jim:acmegating.com | i will do some dishes then check to see if i'm still blocked. | 16:39 |
| @fungicide:matrix.org | #status log Implemented temporary Apache mod_security WAF rules to block clients that are collectively causing a distributed denial of service against our static site hosting | 16:41 |
| @status:opendev.org | @fungicide:matrix.org: finished logging | 16:41 |
| @fungicide:matrix.org | seems like whatever was impacting the wiki server has temporarily abated, since statusbot is working again | 16:41 |
| @clarkb:matrix.org | they moved from wiki to docs :) | 16:42 |
| @jim:acmegating.com | looks like i'm still blocked; i'm getting an updated copy of the db | 16:47 |
| @fungicide:matrix.org | the entry is refreshed there i guess? | 16:47 |
| @fungicide:matrix.org | i wonder if that tool's `-r` option needs to be used with apache offline | 16:48 |
| @fungicide:matrix.org | seems apache is getting even happier, seeing open slots with no process now even | 16:49 |
| @fungicide:matrix.org | and some gracefully finishing, indicating it's able to recycle worker processes again | 16:50 |
| @clarkb:matrix.org | we can test the SecDataDir to invalid location idea with our CI jobs if we want to try and use that hack to force things into memory only | 16:50 |
| @jim:acmegating.com | oh and now i'm unblocked | 16:51 |
| @fungicide:matrix.org | huh | 16:51 |
| @jim:acmegating.com | i suspect there may be a periodic cleanup that has to expire the keys | 16:51 |
| @jim:acmegating.com | the db copy i got just after my unblock time still had an entry for me with the same data | 16:51 |
| @jim:acmegating.com | i'll get a third copy and see if it's gone now | 16:51 |
| @fungicide:matrix.org | good to know, so our current 24h expiration isn't relevant because those entries get tossed after 1h anyway | 16:52 |
| @jim:acmegating.com | yeah' that's the hypothesis, gimme a min to confirm | 16:52 |
| @clarkb:matrix.org | so other than the ddos itself two things to look into are manipulating the db somehow (maybe by forcing it into memory only or via a tool to edit in place?) and better understanding the expiration situation | 16:53 |
| @jim:acmegating.com | yep, my entry is no longer in the db | 16:53 |
| @fungicide:matrix.org | neat | 16:54 |
| @clarkb:matrix.org | we do `expirevar=ip.honeypot` whcih does seem to set a 24 hour expiration for the honeypot value. But the key for the entry is what has an hour long expiry and that also results in clearing the item from the db early | 16:55 |
| @jim:acmegating.com | yep | 16:55 |
| @jim:acmegating.com | for completeness, i have used the tool to remove a key from the database. it appears to have worked. | 16:55 |
| @jim:acmegating.com | so if we want to keep using the dbm files, then i think it's worth building this tool. it's a pretty straightforward classic autoconf co tool, and builds easily on jammy. | 16:56 |
| @jim:acmegating.com | * so if we want to keep using the dbm files, then i think it's worth building this tool. it's a pretty straightforward classic autoconf c tool, and builds easily on jammy. | 16:56 |
| @jim:acmegating.com | (oh, to clarify, i just removed an entry from my local copy, i have not manipulated the server) | 16:57 |
| @jim:acmegating.com | anyone have any other questions about modsec-sdbm-util before i delete my ephemeral container? | 16:57 |
| @fungicide:matrix.org | i wonder if it handles write locking properly or needs apache to be stopped | 16:58 |
| @fungicide:matrix.org | but useful either way | 16:58 |
| @fungicide:matrix.org | the documentation was a bit light and i haven't dug into the source to see | 16:59 |
| @jim:acmegating.com | it uses the apache runtime library; perhaps the `apr_sdbm_open` function handles locking? | 17:00 |
| @jim:acmegating.com | i see a lot of flock calls in strace | 17:02 |
| @clarkb:matrix.org | I wonder if we can do `expirevar:ip.KEY=86400` to increase the time there. | 17:02 |
| @fungicide:matrix.org | yeah, so maybe we can use it live in that case | 17:02 |
| @fungicide:matrix.org | even more convenient if so | 17:03 |
| @jim:acmegating.com | fungi: yeah, i'd say let's assume so and if we're wrong, then we just lose a database we're happy to throw away anyway. low stakes. :) | 17:03 |
| @fungicide:matrix.org | right, it's not a huge deal to test when things are a little more calm | 17:04 |
| @jim:acmegating.com | Clark: or maybe https://github.com/owasp-modsecurity/ModSecurity/wiki/Reference-Manual-(v2.x)#user-content-SecCollectionTimeout | 17:05 |
| @jim:acmegating.com | not sure how you specify which collection (ie, "IP") that applies to? | 17:05 |
| @jim:acmegating.com | all of them? | 17:06 |
| @clarkb:matrix.org | corvus: I think that goes in the global conf so it probably applies to all of them. In this case that is fine since we only have the one? | 17:06 |
| @jim:acmegating.com | sgtm | 17:07 |
| @clarkb:matrix.org | `/etc/apache2/mods-enabled/security2.conf` this file appears to be the one | 17:07 |
| @clarkb:matrix.org | it does do some includes too so we could stash that in a .d dir to avoid modifying that package supplied file | 17:08 |
| @clarkb:matrix.org | corvus: I'm checking in with gerrit upstream and gerrit.googlesource.com appears to be slow again. Are you able to check your gerrit-review.googlesource.com test/logs to see if that is looking better? | 17:15 |
| @clarkb:matrix.org | I want to give them as much info as I can | 17:15 |
| @clarkb:matrix.org | corvus: the curl example you gave yesterday appears to be subsecond quick right now | 17:17 |
| @clarkb:matrix.org | sounds like the gerrit.googlesource.com slowness is `a different ongoing incident` | 17:19 |
| @fungicide:matrix.org | i wonder if they're getting flooded with requests for nonexistent pages from a global phone botnet | 17:23 |
| @clarkb:matrix.org | the thought did cross my mind :) | 17:24 |
| @clarkb:matrix.org | fungi: thinking out loud here: any idea if caching 404s for say 10 minutes or an hour would help with performance with afs? | 17:25 |
| @clarkb:matrix.org | we can cache them on the apache side maybe so that we're not hitting openafs for what are likely missing files | 17:25 |
| @fungicide:matrix.org | maybe... | 17:27 |
| @fungicide:matrix.org | i can't see an obvious downside to that | 17:27 |
| @fungicide:matrix.org | other than someone expectantly requesting a new page they've created after it's promoted but before the vos release happens seeing a somewhat longer delay before content finally appears, but that should be minor as inconveniences go | 17:28 |
| @clarkb:matrix.org | right I think if we keep it short to say 10 minutes we mitigate that problem but potentially take a lot of load off of afs? | 17:29 |
| @clarkb:matrix.org | then we're checking these files once every 10 minutes rather than 10k times a second | 17:29 |
| @fungicide:matrix.org | picking a random blocked client from the log, i see it made a second attempt at requesting a different nonexistent url under the same base path roughly 8 minutes after initially being blocked, so the same clients are continuing to try | 17:30 |
| -@gerrit:opendev.org- Michal Nasiadka proposed: [openstack/project-config] 978566: propose-updates: Add ansible-lint https://review.opendev.org/c/openstack/project-config/+/978566 | 17:44 | |
| -@gerrit:opendev.org- Michal Nasiadka proposed: [openstack/project-config] 978566: propose-updates: Add ansible-lint target https://review.opendev.org/c/openstack/project-config/+/978566 | 17:44 | |
| @fungicide:matrix.org | current nf_conntrack_count/nf_conntrack_max on static02 is still 396818/524288 or 76% | 17:46 |
| @fungicide:matrix.org | though it's continuing to fall (albeit slowly) | 17:47 |
| @fungicide:matrix.org | #status log Rebooted wiki.openstack.org in order to get user session logins working again | 17:58 |
| @status:opendev.org | @fungicide:matrix.org: finished logging | 17:58 |
| @clarkb:matrix.org | re wiki I wonder if we can identify a similar marker we could waf (though setting that up is a bit more work there). And I feel bad I'm coming up with all of these crazy ideas and also stuck in meetings all day so can't really dig in too easily | 18:01 |
| @fungicide:matrix.org | same | 18:13 |
| @garyx:matrix.org | Thanks for your work today guys, it's really appreciated. | 18:47 |
| -@gerrit:opendev.org- Michal Nasiadka proposed: [openstack/project-config] 978566: propose-updates: Add pcu target https://review.opendev.org/c/openstack/project-config/+/978566 | 19:01 | |
| @clarkb:matrix.org | fungi: and I are in the opendevent meetpad: https://meetpad.opendev.org/opendevent-march-2026 if anyone else wants to join us. This is replacing the weekly team meeting today | 19:08 |
| -@gerrit:opendev.org- Michal Nasiadka proposed: [openstack/project-config] 978566: propose-updates: Add pcu target https://review.opendev.org/c/openstack/project-config/+/978566 | 19:30 | |
| -@gerrit:opendev.org- Clark Boylan proposed: [opendev/system-config] 978824: Upgrade Gerrit images to 3.11.9 and 3.12.5 https://review.opendev.org/c/opendev/system-config/+/978824 | 20:52 | |
| @clarkb:matrix.org | infra-root ^ as promised on meetpad. As a heads up I need to eat lunch now then I have to go do an appointment. So I'm not sure I'll be back in time today to babysit the upgrade. I think we can wait until tomorrow though and get it done first thing | 20:53 |
| @clarkb:matrix.org | if anyone else finds users confused about unexplained merge failures it looks like github was having an outage today and that impacted some merge requests for some jobs. | 21:46 |
| @clarkb:matrix.org | corvus: It doesn't look like we log the build that triggered the merge request just the buildset so its a bit harder to track this to a specific build that has deps that maybe it doesn't need. Not sure if that is something that should be changed though the logs are already quite verbose | 21:46 |
| @clarkb:matrix.org | corvus: specifically on zuul01 I see `2026-03-03 19:18:15,544 DEBUG zuul.Scheduler: [e: 9bca931fca214313b49a919aa0093dd0] Processing result event <MergeCompletedEvent job: 32e9baf77b9f4217a6f400abd4eb10eb buildset: c6513d9027874a8f8bbc26c6cad499a3 merged: False updated: False commit: None errors: []>` and was able to track that to zm03 | 21:48 |
| @clarkb:matrix.org | but not sure what specific build tripped over novnv | 21:48 |
| @clarkb:matrix.org | * but not sure what specific build tripped over novnc | 21:48 |
| @clarkb:matrix.org | fungi: the gerrit image builds failed on a bazel error: https://zuul.opendev.org/t/openstack/build/18e3358146ee431f94894c1a740fae2c/log/job-output.txt#1214-1217 | 21:50 |
| @clarkb:matrix.org | Cross checking against https://gerrit.googlesource.com/plugins/delete-project/+refs maybe the issue is that not all plugins got the new 3.11.9 and 3.12.5 tags? I just sort of assumed they would when I wrote teh change | 21:51 |
| @clarkb:matrix.org | but we may need to check all of those and update the change accordingly. But I have to pop out in the next handful of minutes. If I get back at a reasonable hour I can try to run that down later | 21:52 |
| @clarkb:matrix.org | I wonder why that didn't fail in zuul | 21:52 |
| @fungicide:matrix.org | Clark: just saw that myself and found the error in the logs, yeah | 21:53 |
| @clarkb:matrix.org | looks like the hooks plugin hasn't gotten the tags either so this may just be the main repo that got tagged? I can cehck them all when I get back and update the change. Or feel free to do ti and update the change too | 21:53 |
| @clarkb:matrix.org | fungi: my suspicion is that we're falling back to master since that is the default checkout ref | 21:53 |
| @clarkb:matrix.org | fungi: and master is not compatible with 3.11 and 3.12 | 21:53 |
| @fungicide:matrix.org | oh yeah | 21:53 |
| @clarkb:matrix.org | and if we update all the tags to valid values then it should be happy again | 21:54 |
| @clarkb:matrix.org | but gotta run now | 21:54 |
Generated by irclog2html.py 4.1.0 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!