Friday, 2024-11-01

ianwfungi: ^^^ see comments in https://review.opendev.org/c/zuul/zuul-jobs/+/933395.  the user/pw ping does work afaics.  i think there's something else going on ... possibly the lack of config file pointed out there06:11
ianwi've just been through a similar thing with copr, but that api gives you back the build number when you ping the webhook, so you have something to poll to see if it worked or not.  afaics there's nothing similar for RTD06:12
ianwlike the RTD jobs should check for the .readthedocs.yaml file before the ping and exit with a sane failure; although being in the post pipeline it's easily missed06:13
ianws/like/likely/06:13
fungiianw: if you get time, could you test with one of the repos which broke more recently, like x/tobiko?12:50
fungiit looks the job began failing for projects *with* a v2 rtd config within a few days after 2024-09-1912:52
fungior i can try to recreate your test using the curl command from your comment12:54
fungii'm trying to repeat your test, but am clearly getting something wrong since curl keeps throwing an error back at me...13:02
fungistdout comes back with... {"build_triggered":true,"project":"tobiko","versions":["master"]}13:04
fungibut stderr has this:13:04
fungicurl: (3) URL rejected: Port number was not a decimal number between 0 and 6553513:04
fungicurl: (3) bad range in URL position 11:13:04
fungiversions:[latest]13:04
fricklerhmm, the curl works for me without an error. I wonder where the build failure could be seen13:08
fungii suspect it's my shell eating some of the quoting, but if i try to wrap the json in single-quotes i get an error about nested brackets instead13:08
fricklerwhich quoting? the second line of ian's comment is the response from the server, not part of the command13:10
fungioh! i thought it was the post body13:10
fungiokay, yeah, if i leave that part out it works for me13:10
fricklerslaweq: you said you manually triggered the rtd build successfully, but the docs on the page still say they're version 0.8, not 0.8.113:11
fricklerslaweq: is it possible that the trigger worked, but the build still failed? not sure where the logs for that could be found13:11
fungiregardless, it's also unclear why the trigger job is failing in that case13:11
fungimaybe the ansible url module's behavior changed instead?13:13
frickleryes, I was just thinking trying with ansible would be the next debugging step13:14
fungicould that timing coincide with an ansible default version change we made for the tenant, maybe?13:17
fricklerI'm now checking whether with the login on the rtd site I can see any build logs13:17
frickler"Lass Built: 7 min ago, successful". so that seems fine. and the page still only shows 0.8 as version. so that's either intentional or an unrelated bug13:20
frickleralso looks like tobiko does have a proper .readthedocs.yaml file13:21
fricklerI'm not sure when we changed ansible versions, some change in zuul might also be possible? anyway I'm going to do a local test with ansible now13:23
fungiactually we haven't switched the openstack tenant to ansible 9 yet, it's still on 813:44
fungiand the default nodeset change was back in august13:45
fricklero.k., testing with ansible-core 2.17.5 on python 3.12.6 on trixie was successfull, build triggered without failure13:59
fricklernot sure whether testing other versions would be worthwhile, or whether next up should be a test within zuul13:59
fungiamd yjay14:00
fungibleagh14:00
fungiand that's with force_basic_auth: yes?14:01
frickleryes, copied the task 1:1 from the role14:01
fungithe window for the start of failures does straddle a weekend, so it could have started with the 2024-09-21 zuul upgrade14:05
fungimaybe something changed with handling of variables? could rtd_webhook_id be ending up empty for example?14:06
* fungi tries to see what zuul changes merged between 2024-09-14 and 2024-09-2114:07
fricklerrtd_webhook_id specifically is checked in https://review.opendev.org/c/zuul/zuul-jobs/+/933395/2/roles/trigger-readthedocs/tasks/main.yaml#414:08
fungigood point, so we'd have a clear error in that case14:08
fungirtd_project_name isn't checked, but it's default filled with zuul.project.short_name which seems unlikely to have broken14:09
fungialso there was a lull in changes merging to zuul/zuul during that week, so fairly easy to check and i'm not seeing anything obvious that could have impacted this14:11
fungithere are a handful of projects using this job who override rtd_project_name to an explicit string, but they also seem to have broken at the same time as those relying on the default14:13
Clark[m]fungi: we did change Openstack to Ansible 9 by default14:37
Clark[m]I don't think the timing of that works for when things broke though14:39
fungiReviewed-on: https://review.opendev.org/c/openstack/project-config/+/93132014:45
fungiSubmitted-at: Tue, 08 Oct 2024 19:15:31 +000014:45
fungiso yeah, it broke before then14:45
fungisomehow i'd created a local branch named "origin/master" and didn't notice that when i did `git checkout origin/master` i was landing on a stale branch from september14:46
fungijudging from the pace of the backup volume filling up, i expect it'll be around 98% full on monday. should i go ahead and prune it today, or early next week (keeping in mind that'll be cutting it close)15:01
Clark[m]I think we should prune today.15:03
fungii'll get it running now in that case15:03
Clark[m]We can measure effectiveness of other cleanups through direct disk utilization checks so pruning now or later doesn't hurt that effort15:03
fungiin progress in a root screen session on backup02.ca-ymq-1.vexxhost15:04
clarkbthanks!15:22
clarkbI'm having a bit of a slow start today. Its like I'm alraedy prepared for dropping DST15:22
opendevreviewClark Boylan proposed openstack/diskimage-builder master: Update Nodepool image location in docs  https://review.opendev.org/c/openstack/diskimage-builder/+/93392315:35
fungi#status log Pruned backups on backup02.ca-ymq-1.vexxhost.opendev.org reducing volume utilization from 96% to 77%15:41
opendevstatusfungi: finished logging15:41
fungiwhen i pruned it last, on 2024-10-07, it dropped to 75%, so we didn't really gain any ground with the ethercalc removal15:42
clarkbya ethercalc was only 1.1gb15:43
clarkbreview01 and review-dev01 will be much larger impacts15:43
fungiagreed15:43
clarkbfungi: https://paste.opendev.org/show/bx5rwZyRrefaDi8au2Km/ thats the breakdown15:43
clarkbits actually etherpad01 and review01 that will have the biggest impact15:44
clarkbI think about 20% of our disk use is tied up in these old unused services/servers15:45
clarkbso in theory we'd get down to 57%15:46
clarkbeffectively doubling our freespace on pruning/cleanup? Not bad15:49
fungiyep, sounds great15:55
clarkbjust in time for us to add a new review server that needs backing up :)15:56
clarkbfungi: I went ahead and single core approved your docs update for mm3 admin access16:12
fungithanks!16:13
opendevreviewMerged opendev/system-config master: Add documentation about Django/Mailman super user  https://review.opendev.org/c/opendev/system-config/+/93366816:22
clarkbfungi: I've noticed the wiki is slow today (trying to put some notes on the agenda so I don't forget over the weekend). Its still usable but ya I suspect the AI crawler bots have new names17:39
fungilikely17:49
clarkbthoughts on adding screen to our test node images? I have to manually install it to do a test run through of our gerrit upgrade on a held gerrit node17:57
clarkbactually wonder why the ansible to deploy gerrit doesn't pull that in as part of our standard stuff17:58
clarkbmaybe figuring that out is better17:58
clarkbok gerrit upgrade etherpad is now updated with my notes from actually performing the upgrade then downgrade on the held test node18:24
clarkbI'm glad I did this because I found an issue with my naive approach to managing index backups in the downgrade process (basically I was copying the files as root to back them up so when copyting them back need to chown them properly)18:25
clarkbhttps://paste.opendev.org/show/bnt0hagi7Q4S3yZzXyVV/ captures that downgrade process if anyone is curious18:25
clarkbthere are no config changes in the diff so any config chagnes we want would be those we opt into (potentially for server log file rotation or other new config options which need further investigation)18:26
fungiyay etherpad timeslider for seeing what you changed since i last read through it18:26
clarkbevery index does get upgraded but all of that is done online by default with this upgrade path18:26
clarkbon the whole this seems pretty straightforward18:26
clarkbas a side note I'm glad we don't maintain forks of the soy email templates18:28
clarkbthey change every single release18:28
clarkbwould be annoying to update our forks of all those different files if we had them18:28
clarkbthe reason for moving the index backups around like that is newer gerrit (I think starting with 3.8 or 3.9?) will use existing index content of the correct version to speed up a full reindex like we do in a downgrade18:34
clarkbthat isn't technically required but without it we should expect the reindexing to take about 35 minutes iirc. We don't have experience with reindexing from a backup starting point but I would expect it to be quite a bit quicker from what those who implemented the change say18:35
fungijust spotted this lost in the recent notices about the backup volume filling up:19:43
fungiInconsistency found in backup /opt/backups/borg-ethercalc02/backup on backup02 at Sun Oct 27 00:19:29 UTC 202419:43
fungii guess it only warned that once, i don't see any further notices about it19:43
fungipresumably we should expect a similar notice each time we delete another server from backups?19:44
clarkbI wonder if there is a list of things we need to remove that server from19:56
clarkbthat directory got removed so ya it has an inconsistency :) happy to clean it up and prevent the warning if we know where to do that19:56
clarkbI'm looknig at mm3 bounce processing options and I have no objections to enabling this on our lists20:28
clarkbthe one thing that seems to be a bit iffy to me is there doesn't appear to be any documetnation that I can find on how the boucne score is calculated. Google search's AI summary says a hard bounce is worth one point and a soft bounce is half a point20:29
clarkbby default list owners are notified of disabled and removed users whcih seems reasonable for tracking this after we enable it20:30
clarkboh and the listing under users for list members shows you a running bounce score20:31
clarkbthey are all zero on service-discuss I think because we have the functionality completely disabled so it doesn't even bother to track scores20:32
clarkbI guess the main risk is that we'd remove people who do generally get emails but have a sad server for a short period of time. We can mitigate that by setting the threshold higher or increasing the number of warnings before removal?20:33
clarkbprobably best to just see how it does with the defaults and take it from there20:33
clarkbanyway I wanted to make sure I understood this well enough to discuss it next week and ended up thinking through it out loud here.20:34
fungiyeah, dmarc enforcement was the primary driver for disabling bounce processing on lists under mm2, and we kept them that way for mm3 initially but it handles bounces differently than its predecessor (it sends a verp probe if a delivery bounces, so as to hopefully avoid counting bounces that were solely dependent on the message contents/headers)20:37

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!