ianw | fungi: ^^^ see comments in https://review.opendev.org/c/zuul/zuul-jobs/+/933395. the user/pw ping does work afaics. i think there's something else going on ... possibly the lack of config file pointed out there | 06:11 |
---|---|---|
ianw | i've just been through a similar thing with copr, but that api gives you back the build number when you ping the webhook, so you have something to poll to see if it worked or not. afaics there's nothing similar for RTD | 06:12 |
ianw | like the RTD jobs should check for the .readthedocs.yaml file before the ping and exit with a sane failure; although being in the post pipeline it's easily missed | 06:13 |
ianw | s/like/likely/ | 06:13 |
fungi | ianw: if you get time, could you test with one of the repos which broke more recently, like x/tobiko? | 12:50 |
fungi | it looks the job began failing for projects *with* a v2 rtd config within a few days after 2024-09-19 | 12:52 |
fungi | or i can try to recreate your test using the curl command from your comment | 12:54 |
fungi | i'm trying to repeat your test, but am clearly getting something wrong since curl keeps throwing an error back at me... | 13:02 |
fungi | stdout comes back with... {"build_triggered":true,"project":"tobiko","versions":["master"]} | 13:04 |
fungi | but stderr has this: | 13:04 |
fungi | curl: (3) URL rejected: Port number was not a decimal number between 0 and 65535 | 13:04 |
fungi | curl: (3) bad range in URL position 11: | 13:04 |
fungi | versions:[latest] | 13:04 |
frickler | hmm, the curl works for me without an error. I wonder where the build failure could be seen | 13:08 |
fungi | i suspect it's my shell eating some of the quoting, but if i try to wrap the json in single-quotes i get an error about nested brackets instead | 13:08 |
frickler | which quoting? the second line of ian's comment is the response from the server, not part of the command | 13:10 |
fungi | oh! i thought it was the post body | 13:10 |
fungi | okay, yeah, if i leave that part out it works for me | 13:10 |
frickler | slaweq: you said you manually triggered the rtd build successfully, but the docs on the page still say they're version 0.8, not 0.8.1 | 13:11 |
frickler | slaweq: is it possible that the trigger worked, but the build still failed? not sure where the logs for that could be found | 13:11 |
fungi | regardless, it's also unclear why the trigger job is failing in that case | 13:11 |
fungi | maybe the ansible url module's behavior changed instead? | 13:13 |
frickler | yes, I was just thinking trying with ansible would be the next debugging step | 13:14 |
fungi | could that timing coincide with an ansible default version change we made for the tenant, maybe? | 13:17 |
frickler | I'm now checking whether with the login on the rtd site I can see any build logs | 13:17 |
frickler | "Lass Built: 7 min ago, successful". so that seems fine. and the page still only shows 0.8 as version. so that's either intentional or an unrelated bug | 13:20 |
frickler | also looks like tobiko does have a proper .readthedocs.yaml file | 13:21 |
frickler | I'm not sure when we changed ansible versions, some change in zuul might also be possible? anyway I'm going to do a local test with ansible now | 13:23 |
fungi | actually we haven't switched the openstack tenant to ansible 9 yet, it's still on 8 | 13:44 |
fungi | and the default nodeset change was back in august | 13:45 |
frickler | o.k., testing with ansible-core 2.17.5 on python 3.12.6 on trixie was successfull, build triggered without failure | 13:59 |
frickler | not sure whether testing other versions would be worthwhile, or whether next up should be a test within zuul | 13:59 |
fungi | amd yjay | 14:00 |
fungi | bleagh | 14:00 |
fungi | and that's with force_basic_auth: yes? | 14:01 |
frickler | yes, copied the task 1:1 from the role | 14:01 |
fungi | the window for the start of failures does straddle a weekend, so it could have started with the 2024-09-21 zuul upgrade | 14:05 |
fungi | maybe something changed with handling of variables? could rtd_webhook_id be ending up empty for example? | 14:06 |
* fungi tries to see what zuul changes merged between 2024-09-14 and 2024-09-21 | 14:07 | |
frickler | rtd_webhook_id specifically is checked in https://review.opendev.org/c/zuul/zuul-jobs/+/933395/2/roles/trigger-readthedocs/tasks/main.yaml#4 | 14:08 |
fungi | good point, so we'd have a clear error in that case | 14:08 |
fungi | rtd_project_name isn't checked, but it's default filled with zuul.project.short_name which seems unlikely to have broken | 14:09 |
fungi | also there was a lull in changes merging to zuul/zuul during that week, so fairly easy to check and i'm not seeing anything obvious that could have impacted this | 14:11 |
fungi | there are a handful of projects using this job who override rtd_project_name to an explicit string, but they also seem to have broken at the same time as those relying on the default | 14:13 |
Clark[m] | fungi: we did change Openstack to Ansible 9 by default | 14:37 |
Clark[m] | I don't think the timing of that works for when things broke though | 14:39 |
fungi | Reviewed-on: https://review.opendev.org/c/openstack/project-config/+/931320 | 14:45 |
fungi | Submitted-at: Tue, 08 Oct 2024 19:15:31 +0000 | 14:45 |
fungi | so yeah, it broke before then | 14:45 |
fungi | somehow i'd created a local branch named "origin/master" and didn't notice that when i did `git checkout origin/master` i was landing on a stale branch from september | 14:46 |
fungi | judging from the pace of the backup volume filling up, i expect it'll be around 98% full on monday. should i go ahead and prune it today, or early next week (keeping in mind that'll be cutting it close) | 15:01 |
Clark[m] | I think we should prune today. | 15:03 |
fungi | i'll get it running now in that case | 15:03 |
Clark[m] | We can measure effectiveness of other cleanups through direct disk utilization checks so pruning now or later doesn't hurt that effort | 15:03 |
fungi | in progress in a root screen session on backup02.ca-ymq-1.vexxhost | 15:04 |
clarkb | thanks! | 15:22 |
clarkb | I'm having a bit of a slow start today. Its like I'm alraedy prepared for dropping DST | 15:22 |
opendevreview | Clark Boylan proposed openstack/diskimage-builder master: Update Nodepool image location in docs https://review.opendev.org/c/openstack/diskimage-builder/+/933923 | 15:35 |
fungi | #status log Pruned backups on backup02.ca-ymq-1.vexxhost.opendev.org reducing volume utilization from 96% to 77% | 15:41 |
opendevstatus | fungi: finished logging | 15:41 |
fungi | when i pruned it last, on 2024-10-07, it dropped to 75%, so we didn't really gain any ground with the ethercalc removal | 15:42 |
clarkb | ya ethercalc was only 1.1gb | 15:43 |
clarkb | review01 and review-dev01 will be much larger impacts | 15:43 |
fungi | agreed | 15:43 |
clarkb | fungi: https://paste.opendev.org/show/bx5rwZyRrefaDi8au2Km/ thats the breakdown | 15:43 |
clarkb | its actually etherpad01 and review01 that will have the biggest impact | 15:44 |
clarkb | I think about 20% of our disk use is tied up in these old unused services/servers | 15:45 |
clarkb | so in theory we'd get down to 57% | 15:46 |
clarkb | effectively doubling our freespace on pruning/cleanup? Not bad | 15:49 |
fungi | yep, sounds great | 15:55 |
clarkb | just in time for us to add a new review server that needs backing up :) | 15:56 |
clarkb | fungi: I went ahead and single core approved your docs update for mm3 admin access | 16:12 |
fungi | thanks! | 16:13 |
opendevreview | Merged opendev/system-config master: Add documentation about Django/Mailman super user https://review.opendev.org/c/opendev/system-config/+/933668 | 16:22 |
clarkb | fungi: I've noticed the wiki is slow today (trying to put some notes on the agenda so I don't forget over the weekend). Its still usable but ya I suspect the AI crawler bots have new names | 17:39 |
fungi | likely | 17:49 |
clarkb | thoughts on adding screen to our test node images? I have to manually install it to do a test run through of our gerrit upgrade on a held gerrit node | 17:57 |
clarkb | actually wonder why the ansible to deploy gerrit doesn't pull that in as part of our standard stuff | 17:58 |
clarkb | maybe figuring that out is better | 17:58 |
clarkb | ok gerrit upgrade etherpad is now updated with my notes from actually performing the upgrade then downgrade on the held test node | 18:24 |
clarkb | I'm glad I did this because I found an issue with my naive approach to managing index backups in the downgrade process (basically I was copying the files as root to back them up so when copyting them back need to chown them properly) | 18:25 |
clarkb | https://paste.opendev.org/show/bnt0hagi7Q4S3yZzXyVV/ captures that downgrade process if anyone is curious | 18:25 |
clarkb | there are no config changes in the diff so any config chagnes we want would be those we opt into (potentially for server log file rotation or other new config options which need further investigation) | 18:26 |
fungi | yay etherpad timeslider for seeing what you changed since i last read through it | 18:26 |
clarkb | every index does get upgraded but all of that is done online by default with this upgrade path | 18:26 |
clarkb | on the whole this seems pretty straightforward | 18:26 |
clarkb | as a side note I'm glad we don't maintain forks of the soy email templates | 18:28 |
clarkb | they change every single release | 18:28 |
clarkb | would be annoying to update our forks of all those different files if we had them | 18:28 |
clarkb | the reason for moving the index backups around like that is newer gerrit (I think starting with 3.8 or 3.9?) will use existing index content of the correct version to speed up a full reindex like we do in a downgrade | 18:34 |
clarkb | that isn't technically required but without it we should expect the reindexing to take about 35 minutes iirc. We don't have experience with reindexing from a backup starting point but I would expect it to be quite a bit quicker from what those who implemented the change say | 18:35 |
fungi | just spotted this lost in the recent notices about the backup volume filling up: | 19:43 |
fungi | Inconsistency found in backup /opt/backups/borg-ethercalc02/backup on backup02 at Sun Oct 27 00:19:29 UTC 2024 | 19:43 |
fungi | i guess it only warned that once, i don't see any further notices about it | 19:43 |
fungi | presumably we should expect a similar notice each time we delete another server from backups? | 19:44 |
clarkb | I wonder if there is a list of things we need to remove that server from | 19:56 |
clarkb | that directory got removed so ya it has an inconsistency :) happy to clean it up and prevent the warning if we know where to do that | 19:56 |
clarkb | I'm looknig at mm3 bounce processing options and I have no objections to enabling this on our lists | 20:28 |
clarkb | the one thing that seems to be a bit iffy to me is there doesn't appear to be any documetnation that I can find on how the boucne score is calculated. Google search's AI summary says a hard bounce is worth one point and a soft bounce is half a point | 20:29 |
clarkb | by default list owners are notified of disabled and removed users whcih seems reasonable for tracking this after we enable it | 20:30 |
clarkb | oh and the listing under users for list members shows you a running bounce score | 20:31 |
clarkb | they are all zero on service-discuss I think because we have the functionality completely disabled so it doesn't even bother to track scores | 20:32 |
clarkb | I guess the main risk is that we'd remove people who do generally get emails but have a sad server for a short period of time. We can mitigate that by setting the threshold higher or increasing the number of warnings before removal? | 20:33 |
clarkb | probably best to just see how it does with the defaults and take it from there | 20:33 |
clarkb | anyway I wanted to make sure I understood this well enough to discuss it next week and ended up thinking through it out loud here. | 20:34 |
fungi | yeah, dmarc enforcement was the primary driver for disabling bounce processing on lists under mm2, and we kept them that way for mm3 initially but it handles bounces differently than its predecessor (it sends a verp probe if a delivery bounces, so as to hopefully avoid counting bounces that were solely dependent on the message contents/headers) | 20:37 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!