Tuesday, 2026-03-17

@fungicide:matrix.orgit may be a little slow initially here as its afs client cache warms up00:02
@fungicide:matrix.orgalso it doesn't know about all the bad client addresses yet, so has to block them all again as they start to hit the tripwires00:03
@fungicide:matrix.orgspeaking of which, it would be good to get 980812 merged now that lp ppas aren't timing out, i'll approve it00:04
@fungicide:matrix.organd for avoidance of doubt, the promlem botnet did follow the dns update near-instantly00:08
@fungicide:matrix.orgwhich has currently taken up all the available worker slots, and probably will for a while until enough of them manage to block themselves00:09
@fungicide:matrix.orgunrelated, infra-prod-service-gitea failed in that deploy buildset, looks like a `/var/lib/apt/lists/lock` conflict on gitea13, probably just a temporary collision with unattended-upgrades00:12
@fungicide:matrix.orgsince all the docs.openstack.org load has shifted off of static02, i'm going to restart apache there again and see if it gets any better once it sheds some of the extra memory baggage the workers may still be carrying00:17
@fungicide:matrix.orgnf_conntrack_count numbers seem to have shifted accordingly, 9920 on static02 (many of which may be stale) and 6369 on static0300:18
@fungicide:matrix.orglooks like things may have settled out, i'm getting near-instant responses from apache on both servers, static02 claims to be processing around 160 concurrent requests now and static03 around 85000:20
@fungicide:matrix.orgso as expected, the bulk of the requests followed docs.openstack.org when it moved00:21
@fungicide:matrix.orgstill climbing on both, with static03 far outpacing 0200:21
@fungicide:matrix.org03 has climbed to around 1100 concurrent requests and still hasn't dipped into swap00:22
@fungicide:matrix.org02 is now around 200 requests and not really using swap yet either00:23
@mnaser:matrix.orgsounds like a good time to take a small break (hopefully not) in case stuff gets poopy again00:23
@fungicide:matrix.orgi want to keep an eye on it for a little longer until it seems to have (hopefully) reached a steady state00:23
@fungicide:matrix.orgbut yeah it's getting late on my end of the rock00:24
@clarkb:matrix.orgThank you for taking care of that. It is good info to know that the majority of the requests followed docs.openstack.org. we may be able to take advantage of that (perhaps making the dedicated server for that vhost permanent)00:26
@fungicide:matrix.orgwell, now concurrent requests on static02 have risen to 712 and on static03 they've fallen to 753 which may indicate that 03 is performing better and serving up responses faster00:29
@fungicide:matrix.orgwhat we could do tomorrow is build a static04 to replace static0200:30
@clarkb:matrix.org++ and maybe start considering a load balancer front end00:31
@fungicide:matrix.orgyeah, would be easy enough to replicate what we do for gitea-lb and zuul-lb00:31
@clarkb:matrix.orgexactly. The main consideration wtih doing that is it will impact our waf rules which are ip based00:31
@clarkb:matrix.orgthey would need to look at x forwarded for instead or something along those lines. So we can't just slap it in front of them without some extra work00:32
@fungicide:matrix.orgohhhh, right, we'd need to look into the forwarded-for protocol thing haproxy adds00:32
@clarkb:matrix.orgBut even just a second larger server is likely to help while we figure that out00:32
@mnaser:matrix.orgif you dont wanna complicate life too much with that, proxyv2 + tcp load balancing can also do the trick00:32
@fungicide:matrix.orgx-forwarded-for won't be available unless we do https termination on the lb, but there is that http socket protocol thing that's outside the encrypted payload00:33
@mnaser:matrix.orghttps://www.haproxy.com/documentation/haproxy-configuration-tutorials/proxying-essentials/client-ip-preservation/enable-proxy-protocol/00:33
@fungicide:matrix.orgyes, that, thanks!00:33
@fungicide:matrix.orgso the question is whether apache mod_proxy can make use of that for tracking and blocking client addresses00:34
@clarkb:matrix.orglooks like mod remoteip is the apache side support for that00:34
@fungicide:matrix.orgsounds like it would work in that case00:34
@fungicide:matrix.orger, not apache mod_proxy of course, i meant mod_security00:34
@fungicide:matrix.orgmy brain is turning to mush about now00:34
@clarkb:matrix.orgthings do look much happier now and we can pick this up in the morning (amongst all the meetings). Thanks again!00:35
-@gerrit:opendev.org- Zuul merged on behalf of Jeremy Stanley https://matrix.to/#/@fungicide:matrix.org: [opendev/system-config] 980812: Tripwire the 10 most commonly requested docs 302s https://review.opendev.org/c/opendev/system-config/+/98081200:35
@clarkb:matrix.orgI'll get the meeting agenda out in a bit00:35
@clarkb:matrix.orgwhich includes this topic!00:35
@fungicide:matrix.orgi'm mainly worried about the climbing concurrent connections on static02, though it seems to have maybe leveled off around 1800 and change now00:36
@fungicide:matrix.orgnope, climbing again00:36
@fungicide:matrix.orgit's still not to 50% of the configured worker slots though00:37
@fungicide:matrix.orgokay, now it is00:37
@fungicide:matrix.organd swap use is starting to grow on static02 as well00:37
@mnaser:matrix.orgstatic02 is not hosting docs anymore, so they're hitting something else?00:38
@fungicide:matrix.orgyes, one or more of the other 32 sites we host on that server00:38
@fungicide:matrix.orgi did see governance.openstack.org was getting crawled a lot, for example00:38
@clarkb:matrix.orgThey've all been getting crawled. But yes some more than others00:39
@mnaser:matrix.orgmaybe this is not the most popular opinion but since its getting much more late and it seems like a never ending battle, maybe since openstack.org is hosted at CF it might be worth toggling cf on overnight just for things to hum along and revisit this00:39
@clarkb:matrix.orgI don't know that we have access to do that or budget approval. But also I don't think handing cf all of our traffic is something we should do before going to bed 00:41
@clarkb:matrix.orgIt's likely to cause more problems imo 00:41
@tony.breeds:matrix.orgI know I've been basically absent for months but I have time to monitor this.   Not that I want to give CF all our traffic 00:42
@mnaser:matrix.orgusually cf is free (or probably enabling it to a subdomain if the zone alreadyhas cf costs nothing) -- but if tony is here then thats breathing room :)00:42
@clarkb:matrix.orgNeither tony or I even have access to the DNS records00:42
@clarkb:matrix.orgOnly fungi does00:42
@clarkb:matrix.orgSo we that's the before bed time frame I'm operating on00:43
@clarkb:matrix.org* So that's the before bed time frame I'm operating on00:43
@clarkb:matrix.orgI literally cannot do what you suggest and things seem somewhat improved right now particularly for docs.openstack.org00:44
@clarkb:matrix.orgMy preference would be to monitor then take what we learn tomorrow and continue to improve00:44
@fungicide:matrix.orgyeah, at the moment static02 still seems snappy, so may not degrade00:45
@fungicide:matrix.orgbut static03 is doing amazingly well, which i expect is related to the lack of memory pressure, so hopefully tomorrow moving everything else besides docs.openstack.org to another larger vm will address that00:47
@fungicide:matrix.orgstatic02 is currently handling 29.9 requests/sec while static03 is managing 62.2 requests/sec00:49
@fungicide:matrix.organd yet the concurrent requests on static02 is up at 3050 vs 88 on static0300:50
@fungicide:matrix.orgsuggesting that static02 is serving half as many requests and yet not keeping up00:51
@mnaser:matrix.orgi was playing with nginx in system-config but hit wall of projects being able to use htaccess :(00:56
@fungicide:matrix.orgtony.breeds: so the thing to be on the lookout for is that static02 may run out of memory and then the oom killer will probably target the parent apache supervisory process, in which case ssh into the server and `sudo systemctl start apache2`00:56
@clarkb:matrix.orgRe CF concerns I have with a quick not planned transition (separate from my concerns with CF as a service): A) our waf rules will likely block CF servers if they let those requests through and B) only a subset of domains can be hosted by CF. If the ddos is sufficiently bad on the other vhosts we'll break anyway01:01
@clarkb:matrix.orgAny move like that should be thought through and planned and monitored 01:01
@clarkb:matrix.orgThis is what I mean by likely to create more peoblems01:01
@fungicide:matrix.orgright, since every one of the problem requests is different, basic cdn services are of no real help in dealing with it. we'd essentially be pinning our hopes on cf's bot classifier identifying the problem requests in advance and never passing them to our backend01:03
@fungicide:matrix.orgsince otherwise every single one will be a cold cache hit01:04
@mnaser:matrix.orgwe can discuss this further too in the right time, but we can also rely on cf to fully cache certain paths in their edge and prevent the hit on our side entirely, but yes there is a bunch of stuff to consider01:04
@fungicide:matrix.orgthat said, their bot classifier is probably pretty good at spotting this attack if it's as widespread as i think it is01:04
@clarkb:matrix.orgfor example I imagine we would switch something like docs-cf.openstack.org first and see how it does. Or maybe do an A/B setup if their DNS supports that01:05
@tony.breeds:matrix.orgfungi: thanks.   I'll keep an eyeout01:05
@fungicide:matrix.orgnot being personally familiar with cf's full caching option, i guess you have to tell them all the pages you want them to cache in advance? which would be a nontrivial thing to map out given the current state of redirects on these sites01:06
@clarkb:matrix.organyway not something I just awnt to apply monday night and hope for the best. I erally want to avoid it in the first place but if we're making taht transition and giving up on the Internet then I want to do it right01:06
@mnaser:matrix.orgyeah i am not a fan of it but its more to buy time than anything, but yea i get it01:06
@clarkb:matrix.organd if we're giving up on the Internet I think that poses a bunch of other followup qusetions I don't want to invite on a monday night either01:07
@fungicide:matrix.orgbecause at least for the present situation, the goal would be to have cf not allow the requests through, so it would need a complete picture of what requests are valid ahead of time01:07
@mnaser:matrix.orgit would be nice if we had things like a node_exporter/prometheus/some logs thing that folks can help look at01:07
@mnaser:matrix.orgright now i can only just throw blind suggestions01:08
@fungicide:matrix.orgwhere "valid" means doesn't end up redirecting to the same content01:08
@fungicide:matrix.orgbecause of all these programmatically generated redirect match patterns for docs.openstack.org01:08
@clarkb:matrix.orgwe did, but then had to block it off from public access. There is a spec up to deploy a prometheus to replace it but as noted in IRC we've been busy and not gotten to it. mnasiadka expressed interest in it though01:08
@fungicide:matrix.orgyes, basically nobody has shown up with the time and interest to set up a new monitoring system that would replace the old one we had to block because we didn't have time to keep limping it along01:09
@mnaser:matrix.orgill play around pushing something today to add prometheus as stage 1, and then stage 2 is rolling out node exporter and having prometheus scrape the nodes01:11
@mnaser:matrix.orgmaybe apache exporter would be nice as well to grab metrics from apache01:11
@clarkb:matrix.orgyes I think the spec is focused on node level info first, but leaves the door open for additional sources. Zuul, gitea, gerrit, all have support too01:14
@fungicide:matrix.orgcool! at least now you can see the spec page again ;)01:14
@fungicide:matrix.organyway, i'm going to call it a night and then pick this up again in the morning in between meetings01:14
@fungicide:matrix.orgthanks everyone!01:14
@clarkb:matrix.orgsee you tomorrow01:15
-@gerrit:opendev.org- Mohammed Naser proposed: [opendev/system-config] 980840: Add Prometheus monitoring service https://review.opendev.org/c/opendev/system-config/+/98084001:35
@mnaser:matrix.orgim not sure where the dns records come from but i guess that cant come until a vm is setup for this01:37
@clarkb:matrix.orgyes have to boot something first to have that info. But then it goes in https://opendev.org/opendev/zone-opendev.org/src/branch/master/zones/opendev.org/zone.db01:39
-@gerrit:opendev.org- Mohammed Naser proposed: [opendev/system-config] 980840: Add Prometheus monitoring service https://review.opendev.org/c/opendev/system-config/+/98084001:43
-@gerrit:opendev.org- Mohammed Naser proposed: [opendev/system-config] 980840: Add Prometheus monitoring service https://review.opendev.org/c/opendev/system-config/+/98084002:06
@mnaser:matrix.orgsomething's off... openstack-tox-pep8 has been queued for 115 hours apparently for a kolla-ansible job02:12
@mnaser:matrix.organd even tho dashbaord reports 0/0 events, my system-config change has been sitting with no jobs loaded for 7 minutes02:13
@clarkb:matrix.orgIt's the daily periodic job thundering herd02:15
@clarkb:matrix.orgThat doesn't explain the kolla pep8 job but does explain slowness in getting nodes for current things02:15
@mnaser:matrix.orgah, my change just had got the jobs show up in waiting status now02:16
@clarkb:matrix.orgWe time the herd for this time of day as general load is lower after North America ends it's work day. Not zero but less than EU and NA working hours02:17
-@gerrit:opendev.org- Mohammed Naser proposed: [opendev/system-config] 980840: Add Prometheus monitoring service https://review.opendev.org/c/opendev/system-config/+/98084003:34
-@gerrit:opendev.org- Michal Nasiadka proposed: [opendev/system-config] 980851: repos: Use sources.list.d on Ubuntu Noble https://review.opendev.org/c/opendev/system-config/+/98085107:06
@mnasiadka:matrix.orgClark: I posted ^^ because I noticed that on OpenMetal mirror02 we're getting duplicate repo entries, so maybe it's time to switch to deb822 :)07:24
-@gerrit:opendev.org- Michal Nasiadka proposed: [opendev/system-config] 980851: repos: Use sources.list.d on Ubuntu Noble https://review.opendev.org/c/opendev/system-config/+/98085107:25
@harbott.osism.tech:regio.chatdocs.o.o is unusably slow for me again, but at least it did not oom. not sure whether we should do restarts until the situation improves?07:33
-@gerrit:opendev.org- Michal Nasiadka proposed on behalf of Mohammed Naser: [opendev/system-config] 980840: Add Prometheus monitoring service https://review.opendev.org/c/opendev/system-config/+/98084007:42
@mnasiadka:matrix.orgmnaser: ^^ took the liberty to push this a little forward07:42
@mnasiadka:matrix.orgJens Harbott: It looks like it's just at the workers limit (based on the log message "[Tue Mar 17 06:52:37.295160 2026] [mpm_event:error] [pid 131868:tid 124365246039936] AH03490: scoreboard is full, not at MaxRequestWorkers.Increase ServerLimit."07:44
@tafkamax:matrix.orgRegarding monitoring. We are currently using prometheus, some exporters, promtail on the hosts and have two central monitoring hosts where all data can be queryd. Loki handles the logs and thanos is the data retention backend that pushes to s3.07:47
@tafkamax:matrix.orgWe are upgrading it soon, with the poc finished, which uses promethuse and node exporter on the hosts, but for the "glue" we will try to use vector07:47
@tafkamax:matrix.orghttps://vector.dev/07:47
@tafkamax:matrix.orgI don't know if you have heard of it.07:48
-@gerrit:opendev.org- Michal Nasiadka proposed: [opendev/system-config] 980851: repos: Use sources.list.d on Ubuntu Noble https://review.opendev.org/c/opendev/system-config/+/98085108:16
-@gerrit:opendev.org- Michal Nasiadka proposed: [opendev/zone-opendev.org] 980856: Add mirror02.iad3.openmetal to dns https://review.opendev.org/c/opendev/zone-opendev.org/+/98085608:27
-@gerrit:opendev.org- Michal Nasiadka proposed: [opendev/zone-opendev.org] 980856: Add mirror02.iad3.openmetal to dns https://review.opendev.org/c/opendev/zone-opendev.org/+/98085608:33
-@gerrit:opendev.org- Michal Nasiadka proposed: [opendev/zone-opendev.org] 980856: Add mirror02.iad3.openmetal to dns https://review.opendev.org/c/opendev/zone-opendev.org/+/98085608:36
-@gerrit:opendev.org- Michal Nasiadka proposed: [opendev/system-config] 980851: repos: Use sources.list.d on Ubuntu Noble https://review.opendev.org/c/opendev/system-config/+/98085110:00
@fungicide:matrix.orgstarting to get going for the day, i see static02 is doing pretty well but static03 has basically all its apache workers busy now13:30
@fungicide:matrix.orgso docs.openstack.org is back to being slow, but all the other static sites are faring fine13:31
@mnaser:matrix.orgexcellent.   we _could_ add node exporter wiring to this but before i run too far forward it would be nice to get this deployed and validated first.13:34
@fungicide:matrix.orgseems apache wants more ram for buffers/cache, it's actively using 4gb ram and another 10gb for buffers/cache, but has also paged out an additional 4gb to swap13:35
@fungicide:matrix.orginstead of making another 15gb replacement for the sites on static02, we might think about moving docs.openstack.org to a server with 30gb ram13:36
-@gerrit:opendev.org- Joaci Otaviano de Morais proposed wip: [openstack/project-config] 978113: Add Netapp Storage App to StarlingX https://review.opendev.org/c/openstack/project-config/+/97811314:04
-@gerrit:opendev.org- Joaci Otaviano de Morais marked as active: [openstack/project-config] 978113: Add Netapp Storage App to StarlingX https://review.opendev.org/c/openstack/project-config/+/97811314:18
@fungicide:matrix.orgseeing signs of memory pressure off and on for static02 as well, so i'm leaning toward moving docs.openstack.org to a new 30gb static04, then shifting the remaining sites off the 8gb static02 to the 15gb static0314:30
-@gerrit:opendev.org- Jeremy Stanley https://matrix.to/#/@fungicide:matrix.org proposed: [opendev/zone-opendev.org] 980965: Add a static04.opendev.org server https://review.opendev.org/c/opendev/zone-opendev.org/+/98096514:49
-@gerrit:opendev.org- Jeremy Stanley https://matrix.to/#/@fungicide:matrix.org proposed: [opendev/system-config] 980966: Add a static04 server https://review.opendev.org/c/opendev/system-config/+/98096614:49
@fungicide:matrix.orgin case we want it, that ^ exists now14:50
@fungicide:matrix.orgmy day is about to get more full of meetings, so wanted to knock it out beforehand to give us options14:51
@clarkb:matrix.orgthe system-config-run jobs allow you to validate it before anything merges and deploys properly15:00
@mnaser:matrix.orgok so im just trying to gauge a bar for getting this deployed, is it also including the scraping and addition of node exporter to the infrastructure?  promethetus is already tested with system-config-run right now15:01
@clarkb:matrix.orgI think my personal preference would be taht we model the data ingestion too. Otherwise we may do a bunch of work to deploy something then haev to start over after we do that15:02
@clarkb:matrix.orgthis is one of the major benefits of the way we set up zuul and our test jobs. We can see an end to end model of the final thing to have a good idea if it does what we expect15:03
@clarkb:matrix.org(it doesn't have to be in one change either to be clear)15:03
@clarkb:matrix.orgso we might still deploy to production incrementally but we can have a picture of the final state ahead of time15:03
@mnaser:matrix.orgshall i amend it all into one change or do i put the "add node exporter" stuff as a thing on top?15:04
@clarkb:matrix.orgI would probably do it as a followup change as that will make the deployment part easier15:04
@mnaser:matrix.orgyeah just to make the review more digestable15:05
@clarkb:matrix.orgfungi: I'm +2 on both changes but didn't approve as I'm in the morning long meeting gauntlet (that you are too)15:09
@clarkb:matrix.orgalso I didn't send our meeting agenda yesterday evening like I said I would. I will do that in a bit15:10
@clarkb:matrix.orgfungi: re static03 looking at system reasources there is a fair bit of memory available (though it has dipped into swap for some reason) maybe we need to be increasing the apache mpm limits there?15:11
@clarkb:matrix.orgas an alternative or addition to the static04 boot15:11
@tafkamax:matrix.orgWhat is your swappiness level on the vms15:19
@tafkamax:matrix.orgImho i would leave it at 115:19
@tafkamax:matrix.orgOut of 10015:19
@clarkb:matrix.orgI don't think we're currently setting it so it is whatever ubuntu's cloud image sets by default. We did at one time (not sure if we still do) tune it for the test nodes when setting up swap on them. Could be helpful here too as well15:23
@tafkamax:matrix.orgWhen i enable swap i set it to 10 or lower always. Swap should be last possible resort imho. 15:24
@tafkamax:matrix.orgI am no swap expert, but in my use cases, I try to avoid it as much as possible. 15:24
@fungicide:matrix.orgit hasn't dipped "a bit" into swap. at the moment it's using 2/3 of its 15gb ram for buffers/cache and paged 5gb worth of allocations out to swap in order to get more buffers/cache space15:37
@fungicide:matrix.orgchances are this is mod_security chewing up a ton of memory for buffering the concurrent requests, but the memory pressure is slowing down response speed causing those connections to pile up, exacerbating the issue15:38
@fungicide:matrix.orgi'm hoping if we throw more memory at it, we'll hit a point where it doesn't tip over this way15:40
@clarkb:matrix.orgyes I have no objection to proceeding with static04 as proposed. Should we approve those changes? (Zone update has to go first)15:41
@fungicide:matrix.orgyeah, those changes should be non-impacting regardless, the magic happens when we update the cnames for the sites anyway15:42
@fungicide:matrix.orgi'm also going to take static02 out of the disable list for ansible now that we've no longer got a divergence between configuration on the server and in git15:44
@tkajinam:matrix.orgIt seems that docs.openstack.org is down again ?15:48
@tkajinam:matrix.orghm it just loads right after I posted this ....15:48
@tkajinam:matrix.org* hm it just loads right after I posted this .... (though it's still slow)15:48
@clarkb:matrix.orgits slow. The bigger server is helping, but hasn't been enough. We're going even bigger now15:48
@fungicide:matrix.orgalso separately, i'm getting a lot of timeouts and connection closed when browsing gitea too16:02
@fungicide:matrix.orgtkajinam: yeah, the current bot flood has pushed us to expand our site hosting resources by 6x so far16:03
-@gerrit:opendev.org- Zuul merged on behalf of Jeremy Stanley https://matrix.to/#/@fungicide:matrix.org: [opendev/zone-opendev.org] 980965: Add a static04.opendev.org server https://review.opendev.org/c/opendev/zone-opendev.org/+/98096516:03
@fungicide:matrix.orgdeploy for that is waiting behind the hourly jobs16:09
@fungicide:matrix.orgstatic04.opendev.org is resolving for me now, Clark do you want to approve 980966?16:15
@fungicide:matrix.orgthe deploy jobs have officially reported success now too16:17
@fungicide:matrix.orgi need to step away for a few to grab a belated shower and snack before my next meeting, but can get static04 verified and docs.openstack.org moved over to it after 98096616:19
@fungicide:matrix.orgin case someone wants to approve it in the meantime16:19
@clarkb:matrix.orgdone16:19
@fungicide:matrix.orgthanks!16:19
@fungicide:matrix.orgbbiab16:19
-@gerrit:opendev.org- Zuul merged on behalf of Jeremy Stanley https://matrix.to/#/@fungicide:matrix.org: [opendev/system-config] 980966: Add a static04 server https://review.opendev.org/c/opendev/system-config/+/98096616:31
-@gerrit:opendev.org- Clark Boylan proposed: [opendev/system-config] 980993: Start sketching out what an anubis deployment on static would look like https://review.opendev.org/c/opendev/system-config/+/98099316:39
-@gerrit:opendev.org- Mohammed Naser proposed: [opendev/system-config] 980994: Deploy node_exporter across all managed hosts https://review.opendev.org/c/opendev/system-config/+/98099416:40
@clarkb:matrix.orgfungi: this is interesting. Apache on static03 reports fewer than 1k requsets currently in process. It also reports that there are 0 idle workers. However we seem to have a full complement of mpm child processes implying that there should be a few thousand idle workers?16:53
@clarkb:matrix.orgmany of these processes are actually quite old. I wonder if something is keeping them around when we should be freeing them up16:54
@clarkb:matrix.orglooking at the old pid I have straced it and it is sitting in an epoll not doing much16:56
@clarkb:matrix.orglooking at its fds in /proc there are a number of things you'd expect (log files etc). But there are a few network connections and ss -np | grep pid shows they are all in CLOSE-WAIT16:56
@clarkb:matrix.orgI wonder if the slowness here is that we are not freeing up these processes so that they can be refreshed providing new workers to the system due to tcp connections that won't die16:58
@clarkb:matrix.orgthis would explain why setting the max clients value to a lower number hurt rather than helped things16:58
@clarkb:matrix.orgall that to say maybe the more problematic aspect of this ddos is clients not properly shutting down tcp keeping processes around (eating memory) but not processing subsequent requests16:59
@fungicide:matrix.orgClark: apache will spin down workers if they sit idle for too long17:00
@clarkb:matrix.orgfungi: these are more than 12 hours old17:02
@clarkb:matrix.orgI guess close wait means the remote requested a shutdown and it is apache that is failing to close the connection?17:02
@clarkb:matrix.orgI guess this could be fallout of general performance issues related to being overloaded preventing apache from getting around to closing those connections and subsequently stopping the worker process?17:04
@clarkb:matrix.orgI think if we restart apache right now docs.openstack.org will be happy again as we'd free up 75% of our capacity17:04
@fungicide:matrix.orgpossible17:04
@clarkb:matrix.orgbut maybe a bigger host doesn't get into this situation in the first place?17:05
@fungicide:matrix.orgthat's my main hope yes17:05
@fungicide:matrix.orgrepeatedly restarting apache to free up process memory isn't a great use of our time17:05
@fungicide:matrix.orginfra-prod-service-static is in progress for the static04 deploy17:07
@clarkb:matrix.orgI wonder if gracefulshutdowntimeout applies to mpm worker rotation or if that is just for apachectl initiated graceful shutdown requests17:08
@fungicide:matrix.orgi would need to read up on it when not in a meeting17:08
@clarkb:matrix.orghttps://httpd.apache.org/docs/2.4/stopping.html#gracefulstop doesn't have a lot of indepth detail like that17:09
@fungicide:matrix.orginfra-prod-service-gitea deploy failed again, if it's apt lockfile collision on the same backend again we may need a manual resolution17:13
@clarkb:matrix.orgthat won't prevent static from deploying at least17:14
@fungicide:matrix.orgright17:14
@fungicide:matrix.orgjust noting it for later followup17:14
@clarkb:matrix.orgapache's bugzilla is now only accessible to authenticated users due to an increaase in abuse and AI scraping17:15
@mnasiadka:matrix.orgLocking down the internet from abusers has started17:16
@fungicide:matrix.orgformerly public information is becoming less and less public due to abuse17:17
@fungicide:matrix.orgwith `104.130.246.43 docs.openstack.org` overridden in my /etc/hosts i'm able to browse around docs.openstack.org through static04 now, so i'll update the dns record for the site to move its traffic over17:20
-@gerrit:opendev.org- Clark Boylan proposed: [opendev/system-config] 980993: Start sketching out what an anubis deployment on static would look like https://review.opendev.org/c/opendev/system-config/+/98099317:20
@fungicide:matrix.organd done17:20
@clarkb:matrix.orglookingat server-status on static03 many many are in a gracefully finishing state17:23
@clarkb:matrix.organd the vast majority of the event workers are in a stopping state (which is what I expected after my external digging)17:24
@fungicide:matrix.org#status log Further server resources have been deployed for the docs.openstack.org site, which should help relieve even more of the recent load increase on all of our static site hosting17:27
@status:opendev.org@fungicide:matrix.org: finished logging17:27
@clarkb:matrix.orghttps://github.com/owasp-modsecurity/ModSecurity/issues/2822 I don't know yet if this is an issue, but could explain it if so17:29
@clarkb:matrix.orgno I can't find evidence we set SecAuditLog to anything17:30
-@gerrit:opendev.org- Jeremy Stanley https://matrix.to/#/@fungicide:matrix.org proposed: [opendev/zone-opendev.org] 981006: Move static hosting to static03 https://review.opendev.org/c/opendev/zone-opendev.org/+/98100617:30
@mnaser:matrix.orghttps://review.opendev.org/c/opendev/system-config/+/98099417:31
node_exporter as a follow up deployed everywhere by default, protected via iptables like snmp is for cacti, and node scraping config generated automatically :)
@mnaser:matrix.organd the logic is mostly simplified because the prometheus community has a collection that has most of the magic :)17:32
@clarkb:matrix.orgmnaser: cool and I think the inventory manipulation we do in test jobs should have the nodes in the test environment participating in the ndoe exporter setup too so it should be tested. Re the collection it isn't claer to me how they are running the node exporter. Which amy be important if some of our older nodes cannot do it or need older versions etc17:32
@mnaser:matrix.orgi purposely picked the collection because it flat out just templated out systemd units and downloads the binaries directly from github17:33
@mnaser:matrix.orgso no matter the os and how old it is, we should be in the clear17:33
@mnaser:matrix.organd in the ara report it showed that it was deployed on prometheus itself and the bridge :)17:33
@clarkb:matrix.orgfungi: ok I think GracefulShutdownTimeout should help: https://httpd.apache.org/docs/2.4/mod/mpm_common.html#gracefulshutdowntimeout17:37
@clarkb:matrix.org(I mean it is possible that this creates new issues, but maybe we set the timeout to some reasonable value?)17:37
@clarkb:matrix.orgits still possible that it won't process these as graceful shutdowns since the initiation is from the wrong source, but it is part of the mpm common configuration and server-status shows things "gracefully finishing"17:39
@clarkb:matrix.orgin any case I think it is worth attemping an update for that. I'm working on a change now17:41
-@gerrit:opendev.org- Michal Nasiadka proposed wip: [opendev/zone-opendev.org] 980856: Add mirror02.iad3.openmetal to dns https://review.opendev.org/c/opendev/zone-opendev.org/+/98085617:41
-@gerrit:opendev.org- Michal Nasiadka proposed wip: [opendev/zone-opendev.org] 980856: Add mirror02.iad3.openmetal to dns https://review.opendev.org/c/opendev/zone-opendev.org/+/98085617:42
-@gerrit:opendev.org- Michal Nasiadka marked as active: [opendev/zone-opendev.org] 980856: Add mirror02.iad3.openmetal to dns https://review.opendev.org/c/opendev/zone-opendev.org/+/98085617:42
-@gerrit:opendev.org- Clark Boylan proposed: [opendev/system-config] 981009: Apply GracefulShutdownTimeout to static apache tuning https://review.opendev.org/c/opendev/system-config/+/98100917:46
@clarkb:matrix.orgfungi: it was gitea13 that failed to grab the dpkg lock17:47
@clarkb:matrix.orglooks like the `/usr/lib/apt/apt.systemd.daily update` run from yesterday may be holding the lock17:48
@clarkb:matrix.orgthat is just doing an index update I think. Is it safe to kill the process?17:49
@clarkb:matrix.orgmnaser: fwiw I Think the general shape of those changes is about what I expected. But I haven't been able to focus and do a proper review. I think the things I want to understand better when I dig in are how storage is managed and then the underlying details of the ansible collection used to manage node exporter. Hopefully I can do that sort of focused review tomorrow? Today is my long day of meetings and the ongoing firefighting so doubt I will get to it today17:55
@mnaser:matrix.orgno worries, just keep me posted and happy to ask for any questions17:55
@mnaser:matrix.organswer for any questions, not ask =)17:55
@fungicide:matrix.orgClark: yes, if you kill the process and can do a `sudo apt update` cleanly it should be fine after that17:56
@clarkb:matrix.orgfungi: ok I did that and it immedaitely started a new process that did an update, and now an install17:57
@clarkb:matrix.orgok that completed so I'm running `sudo apt update` now just to be double sure17:58
@fungicide:matrix.orgthat sounds basically like what i saw on the osuosl mirror yesterday17:59
@fungicide:matrix.orgthough it had been hung since friday17:59
@fungicide:matrix.orgi think the parent ansible script ends up retrying when you kill the apt process18:00
@fungicide:matrix.orgdo we want to wait and test the GracefulShutdownTimeout change on static02 or go ahead and move the sites to the now-empty static03 with 981006 now?18:11
@jim:acmegating.comgracefulshutdown sounds promising and should be a quick test to do on static02.  i think the info gathering is worth it.18:12
@fungicide:matrix.orgstatic03 is down to 0 concurrent requests (or technically 1 when i check /server-status because it's handling my request)18:18
@clarkb:matrix.org++ lets do the quick test18:20
@clarkb:matrix.orgfungi: are you able to do that? I need to prep our infra meeting stuff and am in another meeting18:20
@fungicide:matrix.orgsure, gimme a sec18:28
@fungicide:matrix.orgthough it may not be a useful test, thinking about it, because we'd need to restart apache when changing the config?18:29
@fungicide:matrix.orgso we're not going to see these stuck connections hanging around past the restart regardless18:29
@clarkb:matrix.orgcorrect. This is a set and wait for a day situation I think18:32
@clarkb:matrix.orgthough we might be able to see general trends in server-status18:33
@fungicide:matrix.orgso i guess the question is really whether we want to wait a day before moving the other sites to static03 in order to give 981009 a chance to percolate18:34
@clarkb:matrix.orgI guess moving to 03 gives us the best change of happyness later so maybe we start there?18:35
@clarkb:matrix.organd incorporate the mpm event config update as we go18:35
@jim:acmegating.comyou don't think we'll be able to see the graceful terminations quickly?18:36
@clarkb:matrix.orgI think we will but not at the scale where its easy to tell if it helped or not18:38
@clarkb:matrix.orgstatic02 currently has processes from just after 00:00 UTC today and that requires time to see. But if we dont' see large numbers of G connections in an hour that is good signal then we check again tomorrow18:38
@fungicide:matrix.orggot it, and it would be better to test that on static02 because its struggles with memory pressure might lead to more hung requests, i guess?18:40
@jim:acmegating.comyeah, this strikes me as a situation with two non-independent variables :)18:41
@fungicide:matrix.orgso in that case, should we go ahead and merge 981009 to set the GracefulShutdownTimeout on all the static servers, or stick static02 back in the emergency file so ansible doesn't unwind manual application there?18:43
@fungicide:matrix.orgi'll wip 981006 in the meantime18:44
-@gerrit:opendev.org- Zuul merged on behalf of Michal Nasiadka: [opendev/zone-opendev.org] 980856: Add mirror02.iad3.openmetal to dns https://review.opendev.org/c/opendev/zone-opendev.org/+/98085619:34
@fungicide:matrix.orgit looks like some of the bots didn't get the memo about the docs.openstack.org dns change yet, there's a new surge hitting it on static0319:39
@jim:acmegating.comare the bots written in java?19:40
@fungicide:matrix.orgit had died down completely, but now has come back fairly hard19:40
@fungicide:matrix.orgwhile also hitting static04 as intended, so i guess some of them are holding onto old resolution19:40
@clarkb:matrix.orgDNS cache ttl: forever19:42
@fungicide:matrix.orgkinda19:42
-@gerrit:opendev.org- Zuul merged on behalf of Clark Boylan: [opendev/system-config] 981009: Apply GracefulShutdownTimeout to static apache tuning https://review.opendev.org/c/opendev/system-config/+/98100919:59
@clarkb:matrix.orgit has only been about half an hour but the server status on static02 doesn't have any G's in it20:31
@clarkb:matrix.org04 has some G states20:33
@clarkb:matrix.orgok and all of those G states on 04 have gone away for the moment20:35
@clarkb:matrix.orgas mentioned previously probably too early to say one way or another if this is helping, but it does look like the problem hasn't manifested yet20:36
@fungicide:matrix.orgyeah, will probably have a better idea when we look at the servers tomorrow20:37
@clarkb:matrix.orgnow on static02 668163 is stopping and has 4 close wait tcp connections according to ss20:39
@clarkb:matrix.orgnow down to three20:39
@fungicide:matrix.orgseems like it's working at least20:40
@clarkb:matrix.orgthough looks like we've been stuck at 3 for more than 5 minutes now. But maybe it isn't processing them every second.20:44
@clarkb:matrix.orgwe'll have to see how much longer this particular process hangs around and if the tcp connection count changes further. I do also wonder if we might get slightly different behavior between jammy and noble20:45
@clarkb:matrix.orgI just saw static04 clear out another set of G connections and then the process went away. So in general things seem to be working. TBD if this is sufficient to keep ahead of it20:47
@clarkb:matrix.orgthat one process is hanging around on 02, but 04 looks like it is clearing them out consistently21:06
@clarkb:matrix.orgok that process finally disappeared on 0221:16
@fungicide:matrix.orggood sign it's probably working, at least21:20
-@gerrit:opendev.org- Clark Boylan proposed: [opendev/system-config] 981042: Purge eavesdrop01 and refstack01 backups on the smaller backup server https://review.opendev.org/c/opendev/system-config/+/98104221:21
-@gerrit:opendev.org- Clark Boylan proposed: [opendev/system-config] 893571: DNM Forced fail on Gerrit to test 3.12 and 3.13 upgrades and downgrades https://review.opendev.org/c/opendev/system-config/+/89357121:31
@clarkb:matrix.orgok I've got autoholds in place for ^ so that when I'm ready to dig into gerrit upgrade testing I don't need to wait21:32
-@gerrit:opendev.org- Clark Boylan proposed: [opendev/system-config] 980993: Start sketching out what an anubis deployment on static would look like https://review.opendev.org/c/opendev/system-config/+/98099321:38
@clarkb:matrix.orgThe problem with ^ was the lack fo a listen directive for what is now the new backend backend. I thought it was the waf testing failing but we actually only do waf testing on the docs.opendev.org vhost which this change doesn't touch so in theory we can have both things sort of side by side of vhosts and verify it works as expected in testing21:48
-@gerrit:opendev.org- Zuul merged on behalf of Joaci Otaviano de Morais: [openstack/project-config] 978113: Add Netapp Storage App to StarlingX https://review.opendev.org/c/openstack/project-config/+/97811322:04
@clarkb:matrix.orgIt definitely seems like the newer 04 host is much happier. Whether that is simply due to it being larger or maybe due to improvements to apache in noble I can't say. Assuming that things hold up overnight. I think what we should plan to do is move the other vhosts to 03 since it too is larger is also noble. Then monitor that. Then maybe we even consider loading up 04 with more vhosts? I was thinking that having all the openstack vhosts on the one larger node may make sense since they seem to get hit the hardest22:04
-@gerrit:opendev.org- Zuul merged on behalf of Hediberto Cavalcante da Silva: [openstack/project-config] 970439: Add LVM CSI App to StarlingX https://review.opendev.org/c/openstack/project-config/+/97043922:05
@jim:acmegating.comsounds like a plan22:06
@fungicide:matrix.orgcount me in22:07
@fungicide:matrix.orgsince everything seems to have actually settled down, i'm going to disappear for the evening and check it all when i wake up22:37
@clarkb:matrix.orgEnjoy!22:50
@clarkb:matrix.orgApparently some of the free threading python devs hope that by 3.16 or 3.17 they will stop having a GIL'd build entirely23:05
@clarkb:matrix.orgThis is not something I want to worry about now but I guess if that is the direction we're headed then being thread safe is important 23:05

Generated by irclog2html.py 4.1.0 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!