| @fungicide:matrix.org | it may be a little slow initially here as its afs client cache warms up | 00:02 |
|---|---|---|
| @fungicide:matrix.org | also it doesn't know about all the bad client addresses yet, so has to block them all again as they start to hit the tripwires | 00:03 |
| @fungicide:matrix.org | speaking of which, it would be good to get 980812 merged now that lp ppas aren't timing out, i'll approve it | 00:04 |
| @fungicide:matrix.org | and for avoidance of doubt, the promlem botnet did follow the dns update near-instantly | 00:08 |
| @fungicide:matrix.org | which has currently taken up all the available worker slots, and probably will for a while until enough of them manage to block themselves | 00:09 |
| @fungicide:matrix.org | unrelated, infra-prod-service-gitea failed in that deploy buildset, looks like a `/var/lib/apt/lists/lock` conflict on gitea13, probably just a temporary collision with unattended-upgrades | 00:12 |
| @fungicide:matrix.org | since all the docs.openstack.org load has shifted off of static02, i'm going to restart apache there again and see if it gets any better once it sheds some of the extra memory baggage the workers may still be carrying | 00:17 |
| @fungicide:matrix.org | nf_conntrack_count numbers seem to have shifted accordingly, 9920 on static02 (many of which may be stale) and 6369 on static03 | 00:18 |
| @fungicide:matrix.org | looks like things may have settled out, i'm getting near-instant responses from apache on both servers, static02 claims to be processing around 160 concurrent requests now and static03 around 850 | 00:20 |
| @fungicide:matrix.org | so as expected, the bulk of the requests followed docs.openstack.org when it moved | 00:21 |
| @fungicide:matrix.org | still climbing on both, with static03 far outpacing 02 | 00:21 |
| @fungicide:matrix.org | 03 has climbed to around 1100 concurrent requests and still hasn't dipped into swap | 00:22 |
| @fungicide:matrix.org | 02 is now around 200 requests and not really using swap yet either | 00:23 |
| @mnaser:matrix.org | sounds like a good time to take a small break (hopefully not) in case stuff gets poopy again | 00:23 |
| @fungicide:matrix.org | i want to keep an eye on it for a little longer until it seems to have (hopefully) reached a steady state | 00:23 |
| @fungicide:matrix.org | but yeah it's getting late on my end of the rock | 00:24 |
| @clarkb:matrix.org | Thank you for taking care of that. It is good info to know that the majority of the requests followed docs.openstack.org. we may be able to take advantage of that (perhaps making the dedicated server for that vhost permanent) | 00:26 |
| @fungicide:matrix.org | well, now concurrent requests on static02 have risen to 712 and on static03 they've fallen to 753 which may indicate that 03 is performing better and serving up responses faster | 00:29 |
| @fungicide:matrix.org | what we could do tomorrow is build a static04 to replace static02 | 00:30 |
| @clarkb:matrix.org | ++ and maybe start considering a load balancer front end | 00:31 |
| @fungicide:matrix.org | yeah, would be easy enough to replicate what we do for gitea-lb and zuul-lb | 00:31 |
| @clarkb:matrix.org | exactly. The main consideration wtih doing that is it will impact our waf rules which are ip based | 00:31 |
| @clarkb:matrix.org | they would need to look at x forwarded for instead or something along those lines. So we can't just slap it in front of them without some extra work | 00:32 |
| @fungicide:matrix.org | ohhhh, right, we'd need to look into the forwarded-for protocol thing haproxy adds | 00:32 |
| @clarkb:matrix.org | But even just a second larger server is likely to help while we figure that out | 00:32 |
| @mnaser:matrix.org | if you dont wanna complicate life too much with that, proxyv2 + tcp load balancing can also do the trick | 00:32 |
| @fungicide:matrix.org | x-forwarded-for won't be available unless we do https termination on the lb, but there is that http socket protocol thing that's outside the encrypted payload | 00:33 |
| @mnaser:matrix.org | https://www.haproxy.com/documentation/haproxy-configuration-tutorials/proxying-essentials/client-ip-preservation/enable-proxy-protocol/ | 00:33 |
| @fungicide:matrix.org | yes, that, thanks! | 00:33 |
| @fungicide:matrix.org | so the question is whether apache mod_proxy can make use of that for tracking and blocking client addresses | 00:34 |
| @clarkb:matrix.org | looks like mod remoteip is the apache side support for that | 00:34 |
| @fungicide:matrix.org | sounds like it would work in that case | 00:34 |
| @fungicide:matrix.org | er, not apache mod_proxy of course, i meant mod_security | 00:34 |
| @fungicide:matrix.org | my brain is turning to mush about now | 00:34 |
| @clarkb:matrix.org | things do look much happier now and we can pick this up in the morning (amongst all the meetings). Thanks again! | 00:35 |
| -@gerrit:opendev.org- Zuul merged on behalf of Jeremy Stanley https://matrix.to/#/@fungicide:matrix.org: [opendev/system-config] 980812: Tripwire the 10 most commonly requested docs 302s https://review.opendev.org/c/opendev/system-config/+/980812 | 00:35 | |
| @clarkb:matrix.org | I'll get the meeting agenda out in a bit | 00:35 |
| @clarkb:matrix.org | which includes this topic! | 00:35 |
| @fungicide:matrix.org | i'm mainly worried about the climbing concurrent connections on static02, though it seems to have maybe leveled off around 1800 and change now | 00:36 |
| @fungicide:matrix.org | nope, climbing again | 00:36 |
| @fungicide:matrix.org | it's still not to 50% of the configured worker slots though | 00:37 |
| @fungicide:matrix.org | okay, now it is | 00:37 |
| @fungicide:matrix.org | and swap use is starting to grow on static02 as well | 00:37 |
| @mnaser:matrix.org | static02 is not hosting docs anymore, so they're hitting something else? | 00:38 |
| @fungicide:matrix.org | yes, one or more of the other 32 sites we host on that server | 00:38 |
| @fungicide:matrix.org | i did see governance.openstack.org was getting crawled a lot, for example | 00:38 |
| @clarkb:matrix.org | They've all been getting crawled. But yes some more than others | 00:39 |
| @mnaser:matrix.org | maybe this is not the most popular opinion but since its getting much more late and it seems like a never ending battle, maybe since openstack.org is hosted at CF it might be worth toggling cf on overnight just for things to hum along and revisit this | 00:39 |
| @clarkb:matrix.org | I don't know that we have access to do that or budget approval. But also I don't think handing cf all of our traffic is something we should do before going to bed | 00:41 |
| @clarkb:matrix.org | It's likely to cause more problems imo | 00:41 |
| @tony.breeds:matrix.org | I know I've been basically absent for months but I have time to monitor this. Not that I want to give CF all our traffic | 00:42 |
| @mnaser:matrix.org | usually cf is free (or probably enabling it to a subdomain if the zone alreadyhas cf costs nothing) -- but if tony is here then thats breathing room :) | 00:42 |
| @clarkb:matrix.org | Neither tony or I even have access to the DNS records | 00:42 |
| @clarkb:matrix.org | Only fungi does | 00:42 |
| @clarkb:matrix.org | So we that's the before bed time frame I'm operating on | 00:43 |
| @clarkb:matrix.org | * So that's the before bed time frame I'm operating on | 00:43 |
| @clarkb:matrix.org | I literally cannot do what you suggest and things seem somewhat improved right now particularly for docs.openstack.org | 00:44 |
| @clarkb:matrix.org | My preference would be to monitor then take what we learn tomorrow and continue to improve | 00:44 |
| @fungicide:matrix.org | yeah, at the moment static02 still seems snappy, so may not degrade | 00:45 |
| @fungicide:matrix.org | but static03 is doing amazingly well, which i expect is related to the lack of memory pressure, so hopefully tomorrow moving everything else besides docs.openstack.org to another larger vm will address that | 00:47 |
| @fungicide:matrix.org | static02 is currently handling 29.9 requests/sec while static03 is managing 62.2 requests/sec | 00:49 |
| @fungicide:matrix.org | and yet the concurrent requests on static02 is up at 3050 vs 88 on static03 | 00:50 |
| @fungicide:matrix.org | suggesting that static02 is serving half as many requests and yet not keeping up | 00:51 |
| @mnaser:matrix.org | i was playing with nginx in system-config but hit wall of projects being able to use htaccess :( | 00:56 |
| @fungicide:matrix.org | tony.breeds: so the thing to be on the lookout for is that static02 may run out of memory and then the oom killer will probably target the parent apache supervisory process, in which case ssh into the server and `sudo systemctl start apache2` | 00:56 |
| @clarkb:matrix.org | Re CF concerns I have with a quick not planned transition (separate from my concerns with CF as a service): A) our waf rules will likely block CF servers if they let those requests through and B) only a subset of domains can be hosted by CF. If the ddos is sufficiently bad on the other vhosts we'll break anyway | 01:01 |
| @clarkb:matrix.org | Any move like that should be thought through and planned and monitored | 01:01 |
| @clarkb:matrix.org | This is what I mean by likely to create more peoblems | 01:01 |
| @fungicide:matrix.org | right, since every one of the problem requests is different, basic cdn services are of no real help in dealing with it. we'd essentially be pinning our hopes on cf's bot classifier identifying the problem requests in advance and never passing them to our backend | 01:03 |
| @fungicide:matrix.org | since otherwise every single one will be a cold cache hit | 01:04 |
| @mnaser:matrix.org | we can discuss this further too in the right time, but we can also rely on cf to fully cache certain paths in their edge and prevent the hit on our side entirely, but yes there is a bunch of stuff to consider | 01:04 |
| @fungicide:matrix.org | that said, their bot classifier is probably pretty good at spotting this attack if it's as widespread as i think it is | 01:04 |
| @clarkb:matrix.org | for example I imagine we would switch something like docs-cf.openstack.org first and see how it does. Or maybe do an A/B setup if their DNS supports that | 01:05 |
| @tony.breeds:matrix.org | fungi: thanks. I'll keep an eyeout | 01:05 |
| @fungicide:matrix.org | not being personally familiar with cf's full caching option, i guess you have to tell them all the pages you want them to cache in advance? which would be a nontrivial thing to map out given the current state of redirects on these sites | 01:06 |
| @clarkb:matrix.org | anyway not something I just awnt to apply monday night and hope for the best. I erally want to avoid it in the first place but if we're making taht transition and giving up on the Internet then I want to do it right | 01:06 |
| @mnaser:matrix.org | yeah i am not a fan of it but its more to buy time than anything, but yea i get it | 01:06 |
| @clarkb:matrix.org | and if we're giving up on the Internet I think that poses a bunch of other followup qusetions I don't want to invite on a monday night either | 01:07 |
| @fungicide:matrix.org | because at least for the present situation, the goal would be to have cf not allow the requests through, so it would need a complete picture of what requests are valid ahead of time | 01:07 |
| @mnaser:matrix.org | it would be nice if we had things like a node_exporter/prometheus/some logs thing that folks can help look at | 01:07 |
| @mnaser:matrix.org | right now i can only just throw blind suggestions | 01:08 |
| @fungicide:matrix.org | where "valid" means doesn't end up redirecting to the same content | 01:08 |
| @fungicide:matrix.org | because of all these programmatically generated redirect match patterns for docs.openstack.org | 01:08 |
| @clarkb:matrix.org | we did, but then had to block it off from public access. There is a spec up to deploy a prometheus to replace it but as noted in IRC we've been busy and not gotten to it. mnasiadka expressed interest in it though | 01:08 |
| @fungicide:matrix.org | yes, basically nobody has shown up with the time and interest to set up a new monitoring system that would replace the old one we had to block because we didn't have time to keep limping it along | 01:09 |
| @mnaser:matrix.org | ill play around pushing something today to add prometheus as stage 1, and then stage 2 is rolling out node exporter and having prometheus scrape the nodes | 01:11 |
| @mnaser:matrix.org | maybe apache exporter would be nice as well to grab metrics from apache | 01:11 |
| @clarkb:matrix.org | yes I think the spec is focused on node level info first, but leaves the door open for additional sources. Zuul, gitea, gerrit, all have support too | 01:14 |
| @fungicide:matrix.org | cool! at least now you can see the spec page again ;) | 01:14 |
| @fungicide:matrix.org | anyway, i'm going to call it a night and then pick this up again in the morning in between meetings | 01:14 |
| @fungicide:matrix.org | thanks everyone! | 01:14 |
| @clarkb:matrix.org | see you tomorrow | 01:15 |
| -@gerrit:opendev.org- Mohammed Naser proposed: [opendev/system-config] 980840: Add Prometheus monitoring service https://review.opendev.org/c/opendev/system-config/+/980840 | 01:35 | |
| @mnaser:matrix.org | im not sure where the dns records come from but i guess that cant come until a vm is setup for this | 01:37 |
| @clarkb:matrix.org | yes have to boot something first to have that info. But then it goes in https://opendev.org/opendev/zone-opendev.org/src/branch/master/zones/opendev.org/zone.db | 01:39 |
| -@gerrit:opendev.org- Mohammed Naser proposed: [opendev/system-config] 980840: Add Prometheus monitoring service https://review.opendev.org/c/opendev/system-config/+/980840 | 01:43 | |
| -@gerrit:opendev.org- Mohammed Naser proposed: [opendev/system-config] 980840: Add Prometheus monitoring service https://review.opendev.org/c/opendev/system-config/+/980840 | 02:06 | |
| @mnaser:matrix.org | something's off... openstack-tox-pep8 has been queued for 115 hours apparently for a kolla-ansible job | 02:12 |
| @mnaser:matrix.org | and even tho dashbaord reports 0/0 events, my system-config change has been sitting with no jobs loaded for 7 minutes | 02:13 |
| @clarkb:matrix.org | It's the daily periodic job thundering herd | 02:15 |
| @clarkb:matrix.org | That doesn't explain the kolla pep8 job but does explain slowness in getting nodes for current things | 02:15 |
| @mnaser:matrix.org | ah, my change just had got the jobs show up in waiting status now | 02:16 |
| @clarkb:matrix.org | We time the herd for this time of day as general load is lower after North America ends it's work day. Not zero but less than EU and NA working hours | 02:17 |
| -@gerrit:opendev.org- Mohammed Naser proposed: [opendev/system-config] 980840: Add Prometheus monitoring service https://review.opendev.org/c/opendev/system-config/+/980840 | 03:34 | |
| -@gerrit:opendev.org- Michal Nasiadka proposed: [opendev/system-config] 980851: repos: Use sources.list.d on Ubuntu Noble https://review.opendev.org/c/opendev/system-config/+/980851 | 07:06 | |
| @mnasiadka:matrix.org | Clark: I posted ^^ because I noticed that on OpenMetal mirror02 we're getting duplicate repo entries, so maybe it's time to switch to deb822 :) | 07:24 |
| -@gerrit:opendev.org- Michal Nasiadka proposed: [opendev/system-config] 980851: repos: Use sources.list.d on Ubuntu Noble https://review.opendev.org/c/opendev/system-config/+/980851 | 07:25 | |
| @harbott.osism.tech:regio.chat | docs.o.o is unusably slow for me again, but at least it did not oom. not sure whether we should do restarts until the situation improves? | 07:33 |
| -@gerrit:opendev.org- Michal Nasiadka proposed on behalf of Mohammed Naser: [opendev/system-config] 980840: Add Prometheus monitoring service https://review.opendev.org/c/opendev/system-config/+/980840 | 07:42 | |
| @mnasiadka:matrix.org | mnaser: ^^ took the liberty to push this a little forward | 07:42 |
| @mnasiadka:matrix.org | Jens Harbott: It looks like it's just at the workers limit (based on the log message "[Tue Mar 17 06:52:37.295160 2026] [mpm_event:error] [pid 131868:tid 124365246039936] AH03490: scoreboard is full, not at MaxRequestWorkers.Increase ServerLimit." | 07:44 |
| @tafkamax:matrix.org | Regarding monitoring. We are currently using prometheus, some exporters, promtail on the hosts and have two central monitoring hosts where all data can be queryd. Loki handles the logs and thanos is the data retention backend that pushes to s3. | 07:47 |
| @tafkamax:matrix.org | We are upgrading it soon, with the poc finished, which uses promethuse and node exporter on the hosts, but for the "glue" we will try to use vector | 07:47 |
| @tafkamax:matrix.org | https://vector.dev/ | 07:47 |
| @tafkamax:matrix.org | I don't know if you have heard of it. | 07:48 |
| -@gerrit:opendev.org- Michal Nasiadka proposed: [opendev/system-config] 980851: repos: Use sources.list.d on Ubuntu Noble https://review.opendev.org/c/opendev/system-config/+/980851 | 08:16 | |
| -@gerrit:opendev.org- Michal Nasiadka proposed: [opendev/zone-opendev.org] 980856: Add mirror02.iad3.openmetal to dns https://review.opendev.org/c/opendev/zone-opendev.org/+/980856 | 08:27 | |
| -@gerrit:opendev.org- Michal Nasiadka proposed: [opendev/zone-opendev.org] 980856: Add mirror02.iad3.openmetal to dns https://review.opendev.org/c/opendev/zone-opendev.org/+/980856 | 08:33 | |
| -@gerrit:opendev.org- Michal Nasiadka proposed: [opendev/zone-opendev.org] 980856: Add mirror02.iad3.openmetal to dns https://review.opendev.org/c/opendev/zone-opendev.org/+/980856 | 08:36 | |
| -@gerrit:opendev.org- Michal Nasiadka proposed: [opendev/system-config] 980851: repos: Use sources.list.d on Ubuntu Noble https://review.opendev.org/c/opendev/system-config/+/980851 | 10:00 | |
| @fungicide:matrix.org | starting to get going for the day, i see static02 is doing pretty well but static03 has basically all its apache workers busy now | 13:30 |
| @fungicide:matrix.org | so docs.openstack.org is back to being slow, but all the other static sites are faring fine | 13:31 |
| @mnaser:matrix.org | excellent. we _could_ add node exporter wiring to this but before i run too far forward it would be nice to get this deployed and validated first. | 13:34 |
| @fungicide:matrix.org | seems apache wants more ram for buffers/cache, it's actively using 4gb ram and another 10gb for buffers/cache, but has also paged out an additional 4gb to swap | 13:35 |
| @fungicide:matrix.org | instead of making another 15gb replacement for the sites on static02, we might think about moving docs.openstack.org to a server with 30gb ram | 13:36 |
| -@gerrit:opendev.org- Joaci Otaviano de Morais proposed wip: [openstack/project-config] 978113: Add Netapp Storage App to StarlingX https://review.opendev.org/c/openstack/project-config/+/978113 | 14:04 | |
| -@gerrit:opendev.org- Joaci Otaviano de Morais marked as active: [openstack/project-config] 978113: Add Netapp Storage App to StarlingX https://review.opendev.org/c/openstack/project-config/+/978113 | 14:18 | |
| @fungicide:matrix.org | seeing signs of memory pressure off and on for static02 as well, so i'm leaning toward moving docs.openstack.org to a new 30gb static04, then shifting the remaining sites off the 8gb static02 to the 15gb static03 | 14:30 |
| -@gerrit:opendev.org- Jeremy Stanley https://matrix.to/#/@fungicide:matrix.org proposed: [opendev/zone-opendev.org] 980965: Add a static04.opendev.org server https://review.opendev.org/c/opendev/zone-opendev.org/+/980965 | 14:49 | |
| -@gerrit:opendev.org- Jeremy Stanley https://matrix.to/#/@fungicide:matrix.org proposed: [opendev/system-config] 980966: Add a static04 server https://review.opendev.org/c/opendev/system-config/+/980966 | 14:49 | |
| @fungicide:matrix.org | in case we want it, that ^ exists now | 14:50 |
| @fungicide:matrix.org | my day is about to get more full of meetings, so wanted to knock it out beforehand to give us options | 14:51 |
| @clarkb:matrix.org | the system-config-run jobs allow you to validate it before anything merges and deploys properly | 15:00 |
| @mnaser:matrix.org | ok so im just trying to gauge a bar for getting this deployed, is it also including the scraping and addition of node exporter to the infrastructure? promethetus is already tested with system-config-run right now | 15:01 |
| @clarkb:matrix.org | I think my personal preference would be taht we model the data ingestion too. Otherwise we may do a bunch of work to deploy something then haev to start over after we do that | 15:02 |
| @clarkb:matrix.org | this is one of the major benefits of the way we set up zuul and our test jobs. We can see an end to end model of the final thing to have a good idea if it does what we expect | 15:03 |
| @clarkb:matrix.org | (it doesn't have to be in one change either to be clear) | 15:03 |
| @clarkb:matrix.org | so we might still deploy to production incrementally but we can have a picture of the final state ahead of time | 15:03 |
| @mnaser:matrix.org | shall i amend it all into one change or do i put the "add node exporter" stuff as a thing on top? | 15:04 |
| @clarkb:matrix.org | I would probably do it as a followup change as that will make the deployment part easier | 15:04 |
| @mnaser:matrix.org | yeah just to make the review more digestable | 15:05 |
| @clarkb:matrix.org | fungi: I'm +2 on both changes but didn't approve as I'm in the morning long meeting gauntlet (that you are too) | 15:09 |
| @clarkb:matrix.org | also I didn't send our meeting agenda yesterday evening like I said I would. I will do that in a bit | 15:10 |
| @clarkb:matrix.org | fungi: re static03 looking at system reasources there is a fair bit of memory available (though it has dipped into swap for some reason) maybe we need to be increasing the apache mpm limits there? | 15:11 |
| @clarkb:matrix.org | as an alternative or addition to the static04 boot | 15:11 |
| @tafkamax:matrix.org | What is your swappiness level on the vms | 15:19 |
| @tafkamax:matrix.org | Imho i would leave it at 1 | 15:19 |
| @tafkamax:matrix.org | Out of 100 | 15:19 |
| @clarkb:matrix.org | I don't think we're currently setting it so it is whatever ubuntu's cloud image sets by default. We did at one time (not sure if we still do) tune it for the test nodes when setting up swap on them. Could be helpful here too as well | 15:23 |
| @tafkamax:matrix.org | When i enable swap i set it to 10 or lower always. Swap should be last possible resort imho. | 15:24 |
| @tafkamax:matrix.org | I am no swap expert, but in my use cases, I try to avoid it as much as possible. | 15:24 |
| @fungicide:matrix.org | it hasn't dipped "a bit" into swap. at the moment it's using 2/3 of its 15gb ram for buffers/cache and paged 5gb worth of allocations out to swap in order to get more buffers/cache space | 15:37 |
| @fungicide:matrix.org | chances are this is mod_security chewing up a ton of memory for buffering the concurrent requests, but the memory pressure is slowing down response speed causing those connections to pile up, exacerbating the issue | 15:38 |
| @fungicide:matrix.org | i'm hoping if we throw more memory at it, we'll hit a point where it doesn't tip over this way | 15:40 |
| @clarkb:matrix.org | yes I have no objection to proceeding with static04 as proposed. Should we approve those changes? (Zone update has to go first) | 15:41 |
| @fungicide:matrix.org | yeah, those changes should be non-impacting regardless, the magic happens when we update the cnames for the sites anyway | 15:42 |
| @fungicide:matrix.org | i'm also going to take static02 out of the disable list for ansible now that we've no longer got a divergence between configuration on the server and in git | 15:44 |
| @tkajinam:matrix.org | It seems that docs.openstack.org is down again ? | 15:48 |
| @tkajinam:matrix.org | hm it just loads right after I posted this .... | 15:48 |
| @tkajinam:matrix.org | * hm it just loads right after I posted this .... (though it's still slow) | 15:48 |
| @clarkb:matrix.org | its slow. The bigger server is helping, but hasn't been enough. We're going even bigger now | 15:48 |
| @fungicide:matrix.org | also separately, i'm getting a lot of timeouts and connection closed when browsing gitea too | 16:02 |
| @fungicide:matrix.org | tkajinam: yeah, the current bot flood has pushed us to expand our site hosting resources by 6x so far | 16:03 |
| -@gerrit:opendev.org- Zuul merged on behalf of Jeremy Stanley https://matrix.to/#/@fungicide:matrix.org: [opendev/zone-opendev.org] 980965: Add a static04.opendev.org server https://review.opendev.org/c/opendev/zone-opendev.org/+/980965 | 16:03 | |
| @fungicide:matrix.org | deploy for that is waiting behind the hourly jobs | 16:09 |
| @fungicide:matrix.org | static04.opendev.org is resolving for me now, Clark do you want to approve 980966? | 16:15 |
| @fungicide:matrix.org | the deploy jobs have officially reported success now too | 16:17 |
| @fungicide:matrix.org | i need to step away for a few to grab a belated shower and snack before my next meeting, but can get static04 verified and docs.openstack.org moved over to it after 980966 | 16:19 |
| @fungicide:matrix.org | in case someone wants to approve it in the meantime | 16:19 |
| @clarkb:matrix.org | done | 16:19 |
| @fungicide:matrix.org | thanks! | 16:19 |
| @fungicide:matrix.org | bbiab | 16:19 |
| -@gerrit:opendev.org- Zuul merged on behalf of Jeremy Stanley https://matrix.to/#/@fungicide:matrix.org: [opendev/system-config] 980966: Add a static04 server https://review.opendev.org/c/opendev/system-config/+/980966 | 16:31 | |
| -@gerrit:opendev.org- Clark Boylan proposed: [opendev/system-config] 980993: Start sketching out what an anubis deployment on static would look like https://review.opendev.org/c/opendev/system-config/+/980993 | 16:39 | |
| -@gerrit:opendev.org- Mohammed Naser proposed: [opendev/system-config] 980994: Deploy node_exporter across all managed hosts https://review.opendev.org/c/opendev/system-config/+/980994 | 16:40 | |
| @clarkb:matrix.org | fungi: this is interesting. Apache on static03 reports fewer than 1k requsets currently in process. It also reports that there are 0 idle workers. However we seem to have a full complement of mpm child processes implying that there should be a few thousand idle workers? | 16:53 |
| @clarkb:matrix.org | many of these processes are actually quite old. I wonder if something is keeping them around when we should be freeing them up | 16:54 |
| @clarkb:matrix.org | looking at the old pid I have straced it and it is sitting in an epoll not doing much | 16:56 |
| @clarkb:matrix.org | looking at its fds in /proc there are a number of things you'd expect (log files etc). But there are a few network connections and ss -np | grep pid shows they are all in CLOSE-WAIT | 16:56 |
| @clarkb:matrix.org | I wonder if the slowness here is that we are not freeing up these processes so that they can be refreshed providing new workers to the system due to tcp connections that won't die | 16:58 |
| @clarkb:matrix.org | this would explain why setting the max clients value to a lower number hurt rather than helped things | 16:58 |
| @clarkb:matrix.org | all that to say maybe the more problematic aspect of this ddos is clients not properly shutting down tcp keeping processes around (eating memory) but not processing subsequent requests | 16:59 |
| @fungicide:matrix.org | Clark: apache will spin down workers if they sit idle for too long | 17:00 |
| @clarkb:matrix.org | fungi: these are more than 12 hours old | 17:02 |
| @clarkb:matrix.org | I guess close wait means the remote requested a shutdown and it is apache that is failing to close the connection? | 17:02 |
| @clarkb:matrix.org | I guess this could be fallout of general performance issues related to being overloaded preventing apache from getting around to closing those connections and subsequently stopping the worker process? | 17:04 |
| @clarkb:matrix.org | I think if we restart apache right now docs.openstack.org will be happy again as we'd free up 75% of our capacity | 17:04 |
| @fungicide:matrix.org | possible | 17:04 |
| @clarkb:matrix.org | but maybe a bigger host doesn't get into this situation in the first place? | 17:05 |
| @fungicide:matrix.org | that's my main hope yes | 17:05 |
| @fungicide:matrix.org | repeatedly restarting apache to free up process memory isn't a great use of our time | 17:05 |
| @fungicide:matrix.org | infra-prod-service-static is in progress for the static04 deploy | 17:07 |
| @clarkb:matrix.org | I wonder if gracefulshutdowntimeout applies to mpm worker rotation or if that is just for apachectl initiated graceful shutdown requests | 17:08 |
| @fungicide:matrix.org | i would need to read up on it when not in a meeting | 17:08 |
| @clarkb:matrix.org | https://httpd.apache.org/docs/2.4/stopping.html#gracefulstop doesn't have a lot of indepth detail like that | 17:09 |
| @fungicide:matrix.org | infra-prod-service-gitea deploy failed again, if it's apt lockfile collision on the same backend again we may need a manual resolution | 17:13 |
| @clarkb:matrix.org | that won't prevent static from deploying at least | 17:14 |
| @fungicide:matrix.org | right | 17:14 |
| @fungicide:matrix.org | just noting it for later followup | 17:14 |
| @clarkb:matrix.org | apache's bugzilla is now only accessible to authenticated users due to an increaase in abuse and AI scraping | 17:15 |
| @mnasiadka:matrix.org | Locking down the internet from abusers has started | 17:16 |
| @fungicide:matrix.org | formerly public information is becoming less and less public due to abuse | 17:17 |
| @fungicide:matrix.org | with `104.130.246.43 docs.openstack.org` overridden in my /etc/hosts i'm able to browse around docs.openstack.org through static04 now, so i'll update the dns record for the site to move its traffic over | 17:20 |
| -@gerrit:opendev.org- Clark Boylan proposed: [opendev/system-config] 980993: Start sketching out what an anubis deployment on static would look like https://review.opendev.org/c/opendev/system-config/+/980993 | 17:20 | |
| @fungicide:matrix.org | and done | 17:20 |
| @clarkb:matrix.org | lookingat server-status on static03 many many are in a gracefully finishing state | 17:23 |
| @clarkb:matrix.org | and the vast majority of the event workers are in a stopping state (which is what I expected after my external digging) | 17:24 |
| @fungicide:matrix.org | #status log Further server resources have been deployed for the docs.openstack.org site, which should help relieve even more of the recent load increase on all of our static site hosting | 17:27 |
| @status:opendev.org | @fungicide:matrix.org: finished logging | 17:27 |
| @clarkb:matrix.org | https://github.com/owasp-modsecurity/ModSecurity/issues/2822 I don't know yet if this is an issue, but could explain it if so | 17:29 |
| @clarkb:matrix.org | no I can't find evidence we set SecAuditLog to anything | 17:30 |
| -@gerrit:opendev.org- Jeremy Stanley https://matrix.to/#/@fungicide:matrix.org proposed: [opendev/zone-opendev.org] 981006: Move static hosting to static03 https://review.opendev.org/c/opendev/zone-opendev.org/+/981006 | 17:30 | |
| @mnaser:matrix.org | https://review.opendev.org/c/opendev/system-config/+/980994 | 17:31 |
| node_exporter as a follow up deployed everywhere by default, protected via iptables like snmp is for cacti, and node scraping config generated automatically :) | ||
| @mnaser:matrix.org | and the logic is mostly simplified because the prometheus community has a collection that has most of the magic :) | 17:32 |
| @clarkb:matrix.org | mnaser: cool and I think the inventory manipulation we do in test jobs should have the nodes in the test environment participating in the ndoe exporter setup too so it should be tested. Re the collection it isn't claer to me how they are running the node exporter. Which amy be important if some of our older nodes cannot do it or need older versions etc | 17:32 |
| @mnaser:matrix.org | i purposely picked the collection because it flat out just templated out systemd units and downloads the binaries directly from github | 17:33 |
| @mnaser:matrix.org | so no matter the os and how old it is, we should be in the clear | 17:33 |
| @mnaser:matrix.org | and in the ara report it showed that it was deployed on prometheus itself and the bridge :) | 17:33 |
| @clarkb:matrix.org | fungi: ok I think GracefulShutdownTimeout should help: https://httpd.apache.org/docs/2.4/mod/mpm_common.html#gracefulshutdowntimeout | 17:37 |
| @clarkb:matrix.org | (I mean it is possible that this creates new issues, but maybe we set the timeout to some reasonable value?) | 17:37 |
| @clarkb:matrix.org | its still possible that it won't process these as graceful shutdowns since the initiation is from the wrong source, but it is part of the mpm common configuration and server-status shows things "gracefully finishing" | 17:39 |
| @clarkb:matrix.org | in any case I think it is worth attemping an update for that. I'm working on a change now | 17:41 |
| -@gerrit:opendev.org- Michal Nasiadka proposed wip: [opendev/zone-opendev.org] 980856: Add mirror02.iad3.openmetal to dns https://review.opendev.org/c/opendev/zone-opendev.org/+/980856 | 17:41 | |
| -@gerrit:opendev.org- Michal Nasiadka proposed wip: [opendev/zone-opendev.org] 980856: Add mirror02.iad3.openmetal to dns https://review.opendev.org/c/opendev/zone-opendev.org/+/980856 | 17:42 | |
| -@gerrit:opendev.org- Michal Nasiadka marked as active: [opendev/zone-opendev.org] 980856: Add mirror02.iad3.openmetal to dns https://review.opendev.org/c/opendev/zone-opendev.org/+/980856 | 17:42 | |
| -@gerrit:opendev.org- Clark Boylan proposed: [opendev/system-config] 981009: Apply GracefulShutdownTimeout to static apache tuning https://review.opendev.org/c/opendev/system-config/+/981009 | 17:46 | |
| @clarkb:matrix.org | fungi: it was gitea13 that failed to grab the dpkg lock | 17:47 |
| @clarkb:matrix.org | looks like the `/usr/lib/apt/apt.systemd.daily update` run from yesterday may be holding the lock | 17:48 |
| @clarkb:matrix.org | that is just doing an index update I think. Is it safe to kill the process? | 17:49 |
| @clarkb:matrix.org | mnaser: fwiw I Think the general shape of those changes is about what I expected. But I haven't been able to focus and do a proper review. I think the things I want to understand better when I dig in are how storage is managed and then the underlying details of the ansible collection used to manage node exporter. Hopefully I can do that sort of focused review tomorrow? Today is my long day of meetings and the ongoing firefighting so doubt I will get to it today | 17:55 |
| @mnaser:matrix.org | no worries, just keep me posted and happy to ask for any questions | 17:55 |
| @mnaser:matrix.org | answer for any questions, not ask =) | 17:55 |
| @fungicide:matrix.org | Clark: yes, if you kill the process and can do a `sudo apt update` cleanly it should be fine after that | 17:56 |
| @clarkb:matrix.org | fungi: ok I did that and it immedaitely started a new process that did an update, and now an install | 17:57 |
| @clarkb:matrix.org | ok that completed so I'm running `sudo apt update` now just to be double sure | 17:58 |
| @fungicide:matrix.org | that sounds basically like what i saw on the osuosl mirror yesterday | 17:59 |
| @fungicide:matrix.org | though it had been hung since friday | 17:59 |
| @fungicide:matrix.org | i think the parent ansible script ends up retrying when you kill the apt process | 18:00 |
| @fungicide:matrix.org | do we want to wait and test the GracefulShutdownTimeout change on static02 or go ahead and move the sites to the now-empty static03 with 981006 now? | 18:11 |
| @jim:acmegating.com | gracefulshutdown sounds promising and should be a quick test to do on static02. i think the info gathering is worth it. | 18:12 |
| @fungicide:matrix.org | static03 is down to 0 concurrent requests (or technically 1 when i check /server-status because it's handling my request) | 18:18 |
| @clarkb:matrix.org | ++ lets do the quick test | 18:20 |
| @clarkb:matrix.org | fungi: are you able to do that? I need to prep our infra meeting stuff and am in another meeting | 18:20 |
| @fungicide:matrix.org | sure, gimme a sec | 18:28 |
| @fungicide:matrix.org | though it may not be a useful test, thinking about it, because we'd need to restart apache when changing the config? | 18:29 |
| @fungicide:matrix.org | so we're not going to see these stuck connections hanging around past the restart regardless | 18:29 |
| @clarkb:matrix.org | correct. This is a set and wait for a day situation I think | 18:32 |
| @clarkb:matrix.org | though we might be able to see general trends in server-status | 18:33 |
| @fungicide:matrix.org | so i guess the question is really whether we want to wait a day before moving the other sites to static03 in order to give 981009 a chance to percolate | 18:34 |
| @clarkb:matrix.org | I guess moving to 03 gives us the best change of happyness later so maybe we start there? | 18:35 |
| @clarkb:matrix.org | and incorporate the mpm event config update as we go | 18:35 |
| @jim:acmegating.com | you don't think we'll be able to see the graceful terminations quickly? | 18:36 |
| @clarkb:matrix.org | I think we will but not at the scale where its easy to tell if it helped or not | 18:38 |
| @clarkb:matrix.org | static02 currently has processes from just after 00:00 UTC today and that requires time to see. But if we dont' see large numbers of G connections in an hour that is good signal then we check again tomorrow | 18:38 |
| @fungicide:matrix.org | got it, and it would be better to test that on static02 because its struggles with memory pressure might lead to more hung requests, i guess? | 18:40 |
| @jim:acmegating.com | yeah, this strikes me as a situation with two non-independent variables :) | 18:41 |
| @fungicide:matrix.org | so in that case, should we go ahead and merge 981009 to set the GracefulShutdownTimeout on all the static servers, or stick static02 back in the emergency file so ansible doesn't unwind manual application there? | 18:43 |
| @fungicide:matrix.org | i'll wip 981006 in the meantime | 18:44 |
| -@gerrit:opendev.org- Zuul merged on behalf of Michal Nasiadka: [opendev/zone-opendev.org] 980856: Add mirror02.iad3.openmetal to dns https://review.opendev.org/c/opendev/zone-opendev.org/+/980856 | 19:34 | |
| @fungicide:matrix.org | it looks like some of the bots didn't get the memo about the docs.openstack.org dns change yet, there's a new surge hitting it on static03 | 19:39 |
| @jim:acmegating.com | are the bots written in java? | 19:40 |
| @fungicide:matrix.org | it had died down completely, but now has come back fairly hard | 19:40 |
| @fungicide:matrix.org | while also hitting static04 as intended, so i guess some of them are holding onto old resolution | 19:40 |
| @clarkb:matrix.org | DNS cache ttl: forever | 19:42 |
| @fungicide:matrix.org | kinda | 19:42 |
| -@gerrit:opendev.org- Zuul merged on behalf of Clark Boylan: [opendev/system-config] 981009: Apply GracefulShutdownTimeout to static apache tuning https://review.opendev.org/c/opendev/system-config/+/981009 | 19:59 | |
| @clarkb:matrix.org | it has only been about half an hour but the server status on static02 doesn't have any G's in it | 20:31 |
| @clarkb:matrix.org | 04 has some G states | 20:33 |
| @clarkb:matrix.org | ok and all of those G states on 04 have gone away for the moment | 20:35 |
| @clarkb:matrix.org | as mentioned previously probably too early to say one way or another if this is helping, but it does look like the problem hasn't manifested yet | 20:36 |
| @fungicide:matrix.org | yeah, will probably have a better idea when we look at the servers tomorrow | 20:37 |
| @clarkb:matrix.org | now on static02 668163 is stopping and has 4 close wait tcp connections according to ss | 20:39 |
| @clarkb:matrix.org | now down to three | 20:39 |
| @fungicide:matrix.org | seems like it's working at least | 20:40 |
| @clarkb:matrix.org | though looks like we've been stuck at 3 for more than 5 minutes now. But maybe it isn't processing them every second. | 20:44 |
| @clarkb:matrix.org | we'll have to see how much longer this particular process hangs around and if the tcp connection count changes further. I do also wonder if we might get slightly different behavior between jammy and noble | 20:45 |
| @clarkb:matrix.org | I just saw static04 clear out another set of G connections and then the process went away. So in general things seem to be working. TBD if this is sufficient to keep ahead of it | 20:47 |
| @clarkb:matrix.org | that one process is hanging around on 02, but 04 looks like it is clearing them out consistently | 21:06 |
| @clarkb:matrix.org | ok that process finally disappeared on 02 | 21:16 |
| @fungicide:matrix.org | good sign it's probably working, at least | 21:20 |
| -@gerrit:opendev.org- Clark Boylan proposed: [opendev/system-config] 981042: Purge eavesdrop01 and refstack01 backups on the smaller backup server https://review.opendev.org/c/opendev/system-config/+/981042 | 21:21 | |
| -@gerrit:opendev.org- Clark Boylan proposed: [opendev/system-config] 893571: DNM Forced fail on Gerrit to test 3.12 and 3.13 upgrades and downgrades https://review.opendev.org/c/opendev/system-config/+/893571 | 21:31 | |
| @clarkb:matrix.org | ok I've got autoholds in place for ^ so that when I'm ready to dig into gerrit upgrade testing I don't need to wait | 21:32 |
| -@gerrit:opendev.org- Clark Boylan proposed: [opendev/system-config] 980993: Start sketching out what an anubis deployment on static would look like https://review.opendev.org/c/opendev/system-config/+/980993 | 21:38 | |
| @clarkb:matrix.org | The problem with ^ was the lack fo a listen directive for what is now the new backend backend. I thought it was the waf testing failing but we actually only do waf testing on the docs.opendev.org vhost which this change doesn't touch so in theory we can have both things sort of side by side of vhosts and verify it works as expected in testing | 21:48 |
| -@gerrit:opendev.org- Zuul merged on behalf of Joaci Otaviano de Morais: [openstack/project-config] 978113: Add Netapp Storage App to StarlingX https://review.opendev.org/c/openstack/project-config/+/978113 | 22:04 | |
| @clarkb:matrix.org | It definitely seems like the newer 04 host is much happier. Whether that is simply due to it being larger or maybe due to improvements to apache in noble I can't say. Assuming that things hold up overnight. I think what we should plan to do is move the other vhosts to 03 since it too is larger is also noble. Then monitor that. Then maybe we even consider loading up 04 with more vhosts? I was thinking that having all the openstack vhosts on the one larger node may make sense since they seem to get hit the hardest | 22:04 |
| -@gerrit:opendev.org- Zuul merged on behalf of Hediberto Cavalcante da Silva: [openstack/project-config] 970439: Add LVM CSI App to StarlingX https://review.opendev.org/c/openstack/project-config/+/970439 | 22:05 | |
| @jim:acmegating.com | sounds like a plan | 22:06 |
| @fungicide:matrix.org | count me in | 22:07 |
| @fungicide:matrix.org | since everything seems to have actually settled down, i'm going to disappear for the evening and check it all when i wake up | 22:37 |
| @clarkb:matrix.org | Enjoy! | 22:50 |
| @clarkb:matrix.org | Apparently some of the free threading python devs hope that by 3.16 or 3.17 they will stop having a GIL'd build entirely | 23:05 |
| @clarkb:matrix.org | This is not something I want to worry about now but I guess if that is the direction we're headed then being thread safe is important | 23:05 |
Generated by irclog2html.py 4.1.0 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!