Thursday, 2025-06-05

*** darmach6 is now known as darmach00:12
opendevreviewTony Breeds proposed openstack/diskimage-builder master: Add new openstack/devstack based functional testing  https://review.opendev.org/c/openstack/diskimage-builder/+/94994201:06
opendevreviewTony Breeds proposed openstack/diskimage-builder master: Add new openstack/devstack based functional testing  https://review.opendev.org/c/openstack/diskimage-builder/+/94994202:30
opendevreviewTony Breeds proposed openstack/diskimage-builder master: Add new openstack/devstack based functional testing  https://review.opendev.org/c/openstack/diskimage-builder/+/94994202:32
opendevreviewTony Breeds proposed openstack/diskimage-builder master: Add new openstack/devstack based functional testing  https://review.opendev.org/c/openstack/diskimage-builder/+/94994203:46
mnasiadkaRamereth[m]: it seems you were able to fix it - it's now way better - thanks :)05:40
mnasiadkamorning05:59
opendevreviewElod Illes proposed openstack/project-config master: Allow late EOL tagging of trailing projects  https://review.opendev.org/c/openstack/project-config/+/95184210:51
opendevreviewElod Illes proposed openstack/project-config master: Allow late EOL tagging of trailing projects  https://review.opendev.org/c/openstack/project-config/+/95184211:08
clarkbfungi: disks are full again14:55
clarkbat least on gitea10 which I pulled up to look into doing a RO change14:55
clarkbfungi: we could click buttons on gitea10 to clean it up then set the dir to ro then reboot and see how we do?14:55
clarkb10, 11, 13, and 14 seem to be in the bad state14:57
clarkbinfra-root ^ any other opinions on how to tackle that?14:57
clarkbmy concern is that gitea may just update the perms on the dir if it finds them unwriteable. Which I guess is fine as a test in prod. I suspect that we might generate more 500 errors otherwise whcih I think is also acceptable14:58
clarkbso I think its fine to try that on gitea10 and see what happens?14:58
corvusseems like a great time to try that experiment since the service is impacted anyway.14:58
clarkbok I'll start on that in a few14:59
fungiyeah, no objection on that test15:00
fungigotta do something anyway15:00
fungimaybe we can also block the actual url pattern or something15:01
clarkbya that might cut down on 500 errors15:02
fungi.*/archive/[^/]+\.(tar\.gz|zip)$15:02
clarkbfungi: hrm I got a 500 error just trying to login did that happen to you?15:03
clarkbmaybe you had to free up some other disk space first?15:03
fungiclarkb: yes, you need to free up a little space on the rootfs first15:03
clarkback15:03
fungii did it with `sudo journalctl --vacuum-time=1d`15:04
fungifree up a little space so it can write to the db, then log in, then clear the archive cache, then reboot15:04
fungiarchive/[^/]+ is too conservative, there's branches with / in their names, eg. https://opendev.org/openstack/nova/archive/stable/2025.1.tar.gz15:05
fungiso something more like /archive/.+\.(tar\.gz|zip)$ i guess15:06
clarkbdr-xr-xr-x    2 1000 1000  4096 Jun  5 15:07 repo-archive15:08
clarkbI'm going to reboot and start services again (I shut them down before setting up the new archive perms)15:08
clarkbthings are up and perms haven't changed15:09
clarkbseems like normal operations continue to work on that host. I tried to download the tar.gz for x/wsme master and I see the cleint making the requests and gettinga  200 back indicating the request has been accepted but I don't see anything changing in the repo-archive dir15:12
clarkbso I think this means we're successfully breaking this15:12
fungidid it actually download?15:13
clarkbfungi: no15:13
clarkbI think the way this works is you make a request to start the download process which is what creates the artifact and caches it then whe nthe file is available the browser client initiates the actual download15:14
clarkblet me know if you want to do any addtional testing before I proceed with 11, 13, and 14 (then we can do 09 and 12 after)15:14
clarkband then we can retrigger more replication15:15
clarkbI picked x/wsme because it is small (so should tarball quickly) and is an easy thing to grep for15:16
fungino, lgtm. i've got a ProxyPassMatch reject line to add to the apache config, just need to write the test for it15:16
clarkbthanks15:16
clarkbI'm actually going to manually trigger the admin task to delete the archives to see what happens there15:16
clarkbsince we run that in a daily ish cron too I want to make sure we don't make the server sad with the updated perms doing that15:17
clarkbok I think it naively succeeded because I created a new empty dir for repo-archive. Looking at admin/monitor/cron it says the trigger I just did succeeded. It must've listed the dir saw nothing to cleanup and exited immediately15:18
clarkbwhich is perfect15:18
clarkbproceeding with the other servers now15:19
opendevreviewJeremy Stanley proposed opendev/system-config master: Block access to Gitea's archive feature  https://review.opendev.org/c/opendev/system-config/+/95187315:28
slittleHi,  is opendev experiencing any issues propagating merged revies from review.opendev.org to opendev.org ?15:30
slittlee.g. https://review.opendev.org/c/starlingx/root/+/951867 merged 45 min ago, but isn't visible on opendev.org15:31
clarkbslittle: yes, I'm busy correcting it now. If you pull up channel logs you'll see more deatils15:31
fungislittle: at the moment yes, some (presumably llm training) crawler is filling up the disk by trying to download tarballs and zipfiles of every repository state15:31
fungias soon as surgery is complete we'll force full re-replication from gerrit and they should catch up15:32
slittlelove those llm's15:33
fungiwe've got a couple of mitigations in flight to prevent it long-term (by disabling the gitea feature that lets users download tarballs and zips of repositories)15:33
slittleI'll alert my community15:33
fungi#status log Gitea archive caches filled up filesystems again today on most of the backends, cleanup is proceeding, Git repository states are lagging behind Gerrit by a few hours but will catch up soon.15:35
opendevstatusfungi: finished logging15:35
clarkbok the four properly broken servers should be "fixed" now15:36
* fungi shakes fist at the rogue ai takeover of the internet15:36
clarkbI'm going to do 09 and 12 so they match and then we can trigger replication and then we can write a system-config change to match the chmod 55515:37
fungisounds good, want that included in https://review.opendev.org/c/opendev/system-config/+/951873 or is it better as a separat change?15:37
fungilooks like we have a lot of other chmods in docker/gitea-init/entrypoint.sh, is that a good place for this too or should ansible take care of it?15:39
clarkbfwiw I moved aside the old dir as some of them still had like 400MB of content in them15:40
clarkbfungi: I was going to have ansible take care of it. I think that entrypoint stuff may come from upstream15:40
fungimakes sense. i can put together the ansible change15:40
clarkbfungi: we create the data dir in ansible and just need to create the repo-archive dir which lives under there at /var/gitea/data/gitea/repo-archive and set the mode15:40
clarkbthanks15:40
fungiyep15:40
clarkbyou might need to create the data/gitea/ dir also so that we can set the mode on that to not 55515:41
clarkbwhcih I think if ansible does it recursively it will apply the same mode everywhere?15:41
fungidoes the file task recursively create directories ?15:42
clarkbI think it can in certain situations15:46
clarkball 6 giteas should be done now. I'm going to double check my work and make sure they all rebooted recently, have all containers running, still have mode 555 on repo-archive, and aren't consuming all the disk then I can trigger gerrit replication15:46
clarkbyup that all checks out. Triggering replication now15:49
opendevreviewJeremy Stanley proposed opendev/system-config master: Block access to Gitea's archive feature  https://review.opendev.org/c/opendev/system-config/+/95187315:50
fungi951873 failed system-config-run-gitea on a mariadb issue15:50
funginever got far enough to run the new testinfra test15:51
fungii should say, on a failure to download the mariadb container image15:52
clarkb502 response from quay I think?15:52
clarkbya15:52
fungilooks like it15:52
fungiERROR: for mariadb  received unexpected HTTP status: 502 Bad Gateway15:52
clarkbdown to 11k replication tasks. That seems to be steadily falling15:54
clarkbfungi: another thought would be to disable the archive cleanup cron entirely too15:55
clarkbfungi: that would be in the gitea app.ini config file if you want to add that to your change15:55
clarkbor set it to run less often in case somethign sneaks through somehow?15:55
fungii'm adding a disable for it for now16:00
opendevreviewJeremy Stanley proposed opendev/system-config master: Block access to Gitea's archive feature  https://review.opendev.org/c/opendev/system-config/+/95187316:01
fungiif reviewers would prefer to just see the frequency reduced, i can adjust the change16:02
clarkbI think this should be fine since we're no longer writing to those dirs16:02
clarkbfungi: just posted a thought on the regex you're using too16:05
clarkbslittle: I think we should be caught up now. Can you check your repo again?16:06
fungiclarkb: i'm unclear on how your proposed regex would avoid matching attempts to browse a repository that has a directory named "archive" with files under it ending in those extensions16:08
fungialso where did you see the bundle file downloads linked?16:08
clarkbfungi: because its rooted with two sections under / one for org and one for repo eg opendev/system-config/16:08
fungithe archive feature in the webui only showed me .tar.gz and .zip16:08
clarkbfungi: if you go to a repo page and click on the code dropdown it shows me .tar.gz .zip and .bundle16:09
fungiaha, it doesn't do .bundle for the branch downloads, which is where i was looking16:09
clarkboh ya this is on the root page of a repo so its doing it for the master branch there16:09
fungican we assume that repositories we're hosting will always have one and only one "/" in their names?16:10
fungiif so, then yes your suggestion works16:10
clarkbthat is a good question16:10
clarkbI think all do today. And I'm not sure if gerrit and gitea handle deeper nesting16:10
clarkbthe paths that we want to not break are of the form /opendev/system-config/src/branch/master/archive/foo.tar.gz16:11
fungii don't think gerrit nests anything, does it? so this would only be a gitea question, as it has an org concept16:11
clarkbanother approach would be to ensure that src not occur before archive?16:11
fungithat would need a negative lookahead, i guess apache's regex parser can handle those?16:12
clarkbits possible that the git protocols also use "archive" in paths?16:12
fungiprobably git protocol urls would never include those file extensions at the very end16:12
clarkbquickly approaching "it would be nice if gitea just let us disable this feature"16:12
fungiyes, indeed16:12
fungii wonder if we can strip it out of the web template instead?16:12
clarkbI suspect we can though that may not prevent smarter bots16:13
fungiyeah that wouldn't prevent anyone who used the direct url, but the filesystem perms adjustment breaks those and we don't care if they're confused16:13
clarkbtrue16:13
clarkblet me see what I can find in the templates16:13
funginot providing the links in the ui would be the most user-friendly solution16:14
fungii need to step away from the keyboard for a few minutes, brb16:14
clarkb{{if and (not $.DisableDownloadSourceArchives) $.RefName}}16:15
clarkbmaybe we can disable this!16:15
fungiwhoa!16:15
fungithat must have sneaked in and not gotten a mention in the changelog?16:15
clarkbits a per repo setting it looks like so maybe something that ya they added and since it isn't a system wide setting didn't get much attention?16:16
clarkbbtu I suspect we may be able to update our repo management tooling to set the flag16:16
clarkbor maybe it isn't exposed via the api at all? I've having trouble finding it16:17
clarkbfungi: left a comment with the way to disable them on your change16:23
clarkbfeature was added in august 2022 so somewhat new and ya doesn't seem to have gotten much attention that I see16:24
clarkbctx.Error(http.StatusNotFound) <- that appears to be what gitea will return if you amke the request so I think your existing test cases may still be valid16:29
fungiclarkb: should we also undo the chmod in that case, once this lands, instead of adding it to the ansible?16:46
fungior keep it in for belt-and-braces?16:46
clarkbI could go either way on that. I guess undoing in prod is somethign we don't need to take a downtime like we just did to apply it since we're making things more permissive16:47
clarkbso ya maybe drop the special config in ansible in your change to create the dir and chmod things, then after your change lands and applis we can re chmod that dir back to its old perms16:47
clarkband confirm that we're still not getting flooded then16:48
clarkbthat wfm16:48
fungilooks like gitea doesn't do typical html error pages like apache does, so the response only contains the string "Not found."16:49
opendevreviewJeremy Stanley proposed opendev/system-config master: Block access to Gitea's archive feature  https://review.opendev.org/c/opendev/system-config/+/95187316:51
clarkbalso note I'm not sure if updating app.ini will do automated restarts of gitea16:53
clarkbI feel like it didn't at one time then I have vague memory of fixing that but not sure if that is a hallucination in the AI system16:54
fungiglitch in the matrix16:55
corvus2025-06-05 17:37:09,450 DEBUG zuul.Launcher: [e: 1856baa99bb84856855a99b450571e2c] [req: f315607a0dce4b7a95e3e065907789c6] Selected request main provider <OpenstackProvider canonical_name=opendev.org%2Fopendev%2Fzuul-providers/openmetal-iad3-main>17:40
corvus2025-06-05 17:37:19,075 ERROR zuul.Launcher: [e: 1856baa99bb84856855a99b450571e2c] [req: f315607a0dce4b7a95e3e065907789c6] Error in creating the server. Compute service reports fault: No valid host was found. There are not enough hosts available.17:40
corvusseeing that error a bit in openmetal iad317:40
corvus(from niz)17:41
corvusperhaps we just have more quota than actual hosts?17:41
corvus2025-06-05 17:37:19,773 DEBUG zuul.Launcher: [e: 1856baa99bb84856855a99b450571e2c] [req: f315607a0dce4b7a95e3e065907789c6] Provider quota including Zuul: {'instances': inf, 'cores': inf, 'ram': inf, 'volumes': inf, 'volume-gb': inf}17:42
corvusyes!  yes we do :)17:42
corvusi guess we depend on max-servers with nodepool.  and of course we could do the same with niz later.17:43
clarkbcorvus: should we set a quota in the cloud or set a limit on the zuul provider?17:43
clarkbya17:43
corvusand in the interim, the two aren't going to be able to cooperate since they don't know the quota17:43
corvusbut also, i'm thinking that if we could set the quota, that would probably be good, because if we're going to allow node size diversity with niz, that will be important17:43
corvus(ie, having 4, 8, 16gb nodes)17:44
clarkbI think if we log into horizon in that cloud it gives us a quick overview of total resources which we can use to quickly math out some quotas17:44
clarkband setting quotas in horizon is probably easy too so one stop shopping there17:44
clarkbI can look at doing that closer to lunch if no one beats me to it17:44
corvusyeah, and if it somehow isn't, we can at least use those numbers to set niz max-resources (so basically, like max-servers, but fine-grained).  but quota would be best.17:45
corvusthat sounds great, thanks!17:45
corvuswe stopped running puppet on cacti because of an rce, but that also means we're not getting new hosts added17:52
corvusi wonder if we should update the config for the host to make the blocking permanent and turn it back on17:53
corvusor should we just run the create-graphs script manually17:53
corvusi am going to run create_graphs.sh for zl01 manually, but let's think about whether we should update the ansible for that host17:54
clarkbI think I'd be ok with trying to make it permanent17:55
clarkbthen reenabling it for now17:55
clarkbcorvus: I think that can be done without any puppet actually since we use ansible to do the firewall rules?17:55
clarkbso it might be fairly straightforward17:55
corvusyeah17:55
corvus#status log ran create_graphs.sh to create cacti graphs for zl01, review03, nl05-8, nb05-7, codesearch02, grafana0218:01
opendevstatuscorvus: finished logging18:01
corvusi just worked back in the history of the file until i found a host already created18:01
fungilog stream says the testinfra test i added is failing, probably need to null-terminate the string or something18:11
fungiyep, that, but also the string isn't quite the same as what i got from a random nonexistent url on gitea18:18
fungiAssertionError: assert 'Not Found\n' == 'Not found.'18:18
opendevreviewJeremy Stanley proposed opendev/system-config master: Block access to Gitea's archive feature  https://review.opendev.org/c/opendev/system-config/+/95187318:19
clarkbI've updated the quota in openmetal to 50 instances (matches nodepool max servers though we are slightly over so may have leaked a small number of nodes), 520 vcpus (we seem to have capacity for 540 (I think this is with oversubscription)) that should give us some headroom for cloud services and ~2TB of memory. Note horizon seems only let you modify the default quotas but since we18:23
clarkbare the only users of the cloud I figured this was ok for now and we can be more specific later if necessary18:23
clarkbcorvus: ^ fyi18:23
clarkbif anyone knows how to do this on a per project basis via horizon I'm happy to revert the defaults and set project specific values18:23
clarkboh I think I just found it acutally. its a hidden menu option for projects18:24
corvusclarkb: do you know how many ips we have?18:25
clarkbcorvus: its ~50. Thats where the max server size comes from18:26
clarkbs/size/count/18:27
clarkbbut I can see if I can find the actual pool allocation18:27
corvuscool -- that's where i was going with that question.  thanks :)18:28
clarkbfor the record I reverted the defaults back to unlimited instances, memroy, and vcpus. Then modified the zuul project to a limit of 51 instances (in the project settings it wouldn't let me set a quota lower than current consumption), 500 vcpus (thats 10 per instance) , and 2TB of memory18:28
clarkbcorvus: ya in this cloud our primary limitation is IP addrs18:28
corvus2025-06-05 18:28:51,411 ERROR zuul.Launcher: [e: fd10e30d60474b68a40f8616bd7f1dec] [req: 18281e1a650b44b9b9c00f557759c4b6] Error in creating the server. Compute service reports fault: Exceeded maximum number of retries. Exceeded max scheduling attempts 3 for instance 56fcb68f-206e-4aff-92bc-710417b30935. Last exception: sequence item 1: expected string, NoneType found18:29
corvusthat looks like a python error from openstack18:29
corvuslemme check which cloud that is18:30
clarkbcorvus: we have a /26 which is 64 total IPs then you have overhead for your neutron routers and dhcp agents iirc. We might get away with something closer to 55? but not much larger18:30
corvuscanonical_name=opendev.org%2Fopendev%2Fzuul-providers/rax-iad-main was the compute service fault18:30
clarkbdid the limit not land there?18:31
clarkbthere == zuul-provider/zuul-launcher since rax-iad is the one we wanted to disable18:31
corvusi think that's a completely different error18:31
corvusjust bringing it up since it's new to me18:31
clarkbwell it should stop trying to boot instances with a limit of 0 right?18:31
corvusoh that, sorry i misunderstood18:32
clarkbso I'm mostly wondering why it tried to boot an instance. But maybe the error itself is manifesting in zuul launcher differently than it did with nodepool?18:32
corvusi thought you were asking about the new limits from today, not the old limits from yesterday :)18:32
clarkbsame issue different symptom18:32
corvusi'm checking on your qestion re resource-limits18:33
clarkbcorvus: one thing I noticed with your change is it applied to section not provider (or maybe vice versa) maybe that matters?18:33
corvusyep, shouldn't, but could be a bug18:33
clarkb#status log Set project quotas for instances, memory, and vcpus in the zuul openmetal cloud project.18:39
opendevstatusclarkb: finished logging18:39
corvusclarkb: i believe the setting is correct, but there is a bug in the launcher in provider selection that lets it slip through.  working on it.18:52
corvusclarkb: re openmetal, this is what zl sees now: {'instances': 51, 'cores': 500, 'ram': 2097152, 'volumes': inf, 'volume-gb': inf}18:54
Clark[m]That looks correct. Sorry switched clients as I'm sorting out lunch now18:57
clarkbapparently you can get mulch blown in like insulation. It is not quiet19:47
clarkbfungi: AssertionError: assert 'Not Found\n' == 'Not found.'19:47
clarkbfungi: I think you just need a newline on your test assertions then your change should pass19:48
fungithat was what i fixed in the most recent patchset19:49
fungioh! i only fixed one occurrence of it19:49
clarkbfungi: its still failing on the latest patchset19:50
clarkboh ok cool I wanted to make sure i wasn't missing something19:50
fungiyeah, i needed to fix it three times, not just once :/19:50
opendevreviewJeremy Stanley proposed opendev/system-config master: Block access to Gitea's archive feature  https://review.opendev.org/c/opendev/system-config/+/95187319:51
* fungi sighs19:51
fungiis it friday yet?19:51
clarkbnot quite but I feel that19:51
opendevreviewMerged opendev/zuul-providers master: Use zuul-supplied image formats  https://review.opendev.org/c/opendev/zuul-providers/+/94994420:29
corvusre zuul-launcher: 2 things: 1) there's an error where we don't check whether a provider could ever possibly have enough quota for a node before we assign it.  that's an easy fix.20:41
corvus2) without the fix for #1 in place, we should see some nodes that are assigned to rax-iad and stay stuck there forever, but that didn't happen because at some point, the quota went crazy and it thought it actually could proceed.20:41
corvusi can not find a code path for #2 to happen; i have no explanation.  there is a bit more debugging info in the latest image.  i'm going to restart zuul-launcher on that.  perhaps something in my monkeypatching to fix the zero division error caused it.20:41
corvusbut if not, hopefully we'll see it again with more debug info.20:41
clarkback20:41
mnasiadkahello21:02
mnasiadkaclarkb: do you think we'll get a second core review on https://review.opendev.org/c/opendev/glean/+/941672? I'd like to move on a bit - I know we're waiting for tonyb on the DIB switch to use devstack for testing - but still :)21:03
clarkbmnasiadka: my hunch is that fungi is the most likely extra reviewer. fungi do you think you'd like to review that or should we proceed with my review?21:04
mnasiadkathanks :)21:13
fungilooking21:25
fungibig diff, but all the new codepaths look sufficiently gated by conditionals that at worst it will only break rhel10/centos10, which wasn't working anyway21:30
clarkband TheJulia acked that this is unlikely to create problems for ironic21:32
* TheJulia is summoned21:32
TheJuliaoh hai21:32
* TheJulia reads21:32
clarkbTheJulia just noting that the glean update for centos 10 stream seems safe enough as written21:32
TheJuliaOh yeah, I concur21:33
clarkband that you chimed in to agree when it comes to ironic21:33
TheJulia... there is an aspect I was made aware of yesterday in general, but more so a quark regarding cloud-init with the metadata, but that is entirely unrelated.21:34
TheJulia(and I think glean does it right.)21:34
* TheJulia may also be biased21:36
* TheJulia steps back into the shadows21:37
opendevreviewMerged opendev/glean master: Add support for CentOS 10 keyfiles  https://review.opendev.org/c/opendev/glean/+/94167222:08
mnasiadkayay :)22:14
clarkbI think we need a new release for that too, but one step at a time22:14
clarkbfungi: https://review.opendev.org/c/opendev/system-config/+/951873 passes now \o/ I've double checked our ansible and I don't believe that we will restart gitea automatically after that change lands22:24
clarkbthats ok I'm ahppy to work through it manually tomorrow (running out of energy for that today I think) if we land the hcange at some point between now and then22:24
clarkbthe process requires us to start the containers in the right order so that gerrit replication doesn't lose events22:24
clarkbI'm also happy to walk anyone else through it if they are interested. Its straightforward just repetetive22:25
corvushttps://review.opendev.org/951919 should address the main issue with launcher attempting to create nodes in iad.  the parent should also fix an issue with request priority (which we haven't seen yet, but i think we would as we scale up).22:41
corvusnow that cacti is in place: the load average on zl01 is negligible.  it peaks at 0.1.22:42
corvusit's not doing much yet, but we've had some bursts of activity.22:43
clarkbcorvus: for 951919 I'm not understanding where the limits come into play. It seems to only consider the provider's quota?22:47
clarkbcorvus: specifically getProviderQuota() doesn'ts eem to refer to limits22:48
clarkboh we overlay resource-limits over the quota looks like22:51
clarkbya ok I see it now. Basically we get the resource-limits and the cloud provided quota then take the minimum value and use that as our quota22:53
corvusyep23:20

Generated by irclog2html.py 4.0.0 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!