Friday, 2024-11-01

ianw	fungi: ^^^ see comments in https://review.opendev.org/c/zuul/zuul-jobs/+/933395. the user/pw ping does work afaics. i think there's something else going on ... possibly the lack of config file pointed out there	06:11
ianw	i've just been through a similar thing with copr, but that api gives you back the build number when you ping the webhook, so you have something to poll to see if it worked or not. afaics there's nothing similar for RTD	06:12
ianw	like the RTD jobs should check for the .readthedocs.yaml file before the ping and exit with a sane failure; although being in the post pipeline it's easily missed	06:13
ianw	s/like/likely/	06:13
fungi	ianw: if you get time, could you test with one of the repos which broke more recently, like x/tobiko?	12:50
fungi	it looks the job began failing for projects with a v2 rtd config within a few days after 2024-09-19	12:52
fungi	or i can try to recreate your test using the curl command from your comment	12:54
fungi	i'm trying to repeat your test, but am clearly getting something wrong since curl keeps throwing an error back at me...	13:02
fungi	stdout comes back with... {"build_triggered":true,"project":"tobiko","versions":["master"]}	13:04
fungi	but stderr has this:	13:04
fungi	curl: (3) URL rejected: Port number was not a decimal number between 0 and 65535	13:04
fungi	curl: (3) bad range in URL position 11:	13:04
fungi	versions:[latest]	13:04
frickler	hmm, the curl works for me without an error. I wonder where the build failure could be seen	13:08
fungi	i suspect it's my shell eating some of the quoting, but if i try to wrap the json in single-quotes i get an error about nested brackets instead	13:08
frickler	which quoting? the second line of ian's comment is the response from the server, not part of the command	13:10
fungi	oh! i thought it was the post body	13:10
fungi	okay, yeah, if i leave that part out it works for me	13:10
frickler	slaweq: you said you manually triggered the rtd build successfully, but the docs on the page still say they're version 0.8, not 0.8.1	13:11
frickler	slaweq: is it possible that the trigger worked, but the build still failed? not sure where the logs for that could be found	13:11
fungi	regardless, it's also unclear why the trigger job is failing in that case	13:11
fungi	maybe the ansible url module's behavior changed instead?	13:13
frickler	yes, I was just thinking trying with ansible would be the next debugging step	13:14
fungi	could that timing coincide with an ansible default version change we made for the tenant, maybe?	13:17
frickler	I'm now checking whether with the login on the rtd site I can see any build logs	13:17
frickler	"Lass Built: 7 min ago, successful". so that seems fine. and the page still only shows 0.8 as version. so that's either intentional or an unrelated bug	13:20
frickler	also looks like tobiko does have a proper .readthedocs.yaml file	13:21
frickler	I'm not sure when we changed ansible versions, some change in zuul might also be possible? anyway I'm going to do a local test with ansible now	13:23
fungi	actually we haven't switched the openstack tenant to ansible 9 yet, it's still on 8	13:44
fungi	and the default nodeset change was back in august	13:45
frickler	o.k., testing with ansible-core 2.17.5 on python 3.12.6 on trixie was successfull, build triggered without failure	13:59
frickler	not sure whether testing other versions would be worthwhile, or whether next up should be a test within zuul	13:59
fungi	amd yjay	14:00
fungi	bleagh	14:00
fungi	and that's with force_basic_auth: yes?	14:01
frickler	yes, copied the task 1:1 from the role	14:01
fungi	the window for the start of failures does straddle a weekend, so it could have started with the 2024-09-21 zuul upgrade	14:05
fungi	maybe something changed with handling of variables? could rtd_webhook_id be ending up empty for example?	14:06
* fungi tries to see what zuul changes merged between 2024-09-14 and 2024-09-21		14:07
frickler	rtd_webhook_id specifically is checked in https://review.opendev.org/c/zuul/zuul-jobs/+/933395/2/roles/trigger-readthedocs/tasks/main.yaml#4	14:08
fungi	good point, so we'd have a clear error in that case	14:08
fungi	rtd_project_name isn't checked, but it's default filled with zuul.project.short_name which seems unlikely to have broken	14:09
fungi	also there was a lull in changes merging to zuul/zuul during that week, so fairly easy to check and i'm not seeing anything obvious that could have impacted this	14:11
fungi	there are a handful of projects using this job who override rtd_project_name to an explicit string, but they also seem to have broken at the same time as those relying on the default	14:13
Clark[m]	fungi: we did change Openstack to Ansible 9 by default	14:37
Clark[m]	I don't think the timing of that works for when things broke though	14:39
fungi	Reviewed-on: https://review.opendev.org/c/openstack/project-config/+/931320	14:45
fungi	Submitted-at: Tue, 08 Oct 2024 19:15:31 +0000	14:45
fungi	so yeah, it broke before then	14:45
fungi	somehow i'd created a local branch named "origin/master" and didn't notice that when i did `git checkout origin/master` i was landing on a stale branch from september	14:46
fungi	judging from the pace of the backup volume filling up, i expect it'll be around 98% full on monday. should i go ahead and prune it today, or early next week (keeping in mind that'll be cutting it close)	15:01
Clark[m]	I think we should prune today.	15:03
fungi	i'll get it running now in that case	15:03
Clark[m]	We can measure effectiveness of other cleanups through direct disk utilization checks so pruning now or later doesn't hurt that effort	15:03
fungi	in progress in a root screen session on backup02.ca-ymq-1.vexxhost	15:04
clarkb	thanks!	15:22
clarkb	I'm having a bit of a slow start today. Its like I'm alraedy prepared for dropping DST	15:22
opendevreview	Clark Boylan proposed openstack/diskimage-builder master: Update Nodepool image location in docs https://review.opendev.org/c/openstack/diskimage-builder/+/933923	15:35
fungi	#status log Pruned backups on backup02.ca-ymq-1.vexxhost.opendev.org reducing volume utilization from 96% to 77%	15:41
opendevstatus	fungi: finished logging	15:41
fungi	when i pruned it last, on 2024-10-07, it dropped to 75%, so we didn't really gain any ground with the ethercalc removal	15:42
clarkb	ya ethercalc was only 1.1gb	15:43
clarkb	review01 and review-dev01 will be much larger impacts	15:43
fungi	agreed	15:43
clarkb	fungi: https://paste.opendev.org/show/bx5rwZyRrefaDi8au2Km/ thats the breakdown	15:43
clarkb	its actually etherpad01 and review01 that will have the biggest impact	15:44
clarkb	I think about 20% of our disk use is tied up in these old unused services/servers	15:45
clarkb	so in theory we'd get down to 57%	15:46
clarkb	effectively doubling our freespace on pruning/cleanup? Not bad	15:49
fungi	yep, sounds great	15:55
clarkb	just in time for us to add a new review server that needs backing up :)	15:56
clarkb	fungi: I went ahead and single core approved your docs update for mm3 admin access	16:12
fungi	thanks!	16:13
opendevreview	Merged opendev/system-config master: Add documentation about Django/Mailman super user https://review.opendev.org/c/opendev/system-config/+/933668	16:22
clarkb	fungi: I've noticed the wiki is slow today (trying to put some notes on the agenda so I don't forget over the weekend). Its still usable but ya I suspect the AI crawler bots have new names	17:39
fungi	likely	17:49
clarkb	thoughts on adding screen to our test node images? I have to manually install it to do a test run through of our gerrit upgrade on a held gerrit node	17:57
clarkb	actually wonder why the ansible to deploy gerrit doesn't pull that in as part of our standard stuff	17:58
clarkb	maybe figuring that out is better	17:58
clarkb	ok gerrit upgrade etherpad is now updated with my notes from actually performing the upgrade then downgrade on the held test node	18:24
clarkb	I'm glad I did this because I found an issue with my naive approach to managing index backups in the downgrade process (basically I was copying the files as root to back them up so when copyting them back need to chown them properly)	18:25
clarkb	https://paste.opendev.org/show/bnt0hagi7Q4S3yZzXyVV/ captures that downgrade process if anyone is curious	18:25
clarkb	there are no config changes in the diff so any config chagnes we want would be those we opt into (potentially for server log file rotation or other new config options which need further investigation)	18:26
fungi	yay etherpad timeslider for seeing what you changed since i last read through it	18:26
clarkb	every index does get upgraded but all of that is done online by default with this upgrade path	18:26
clarkb	on the whole this seems pretty straightforward	18:26
clarkb	as a side note I'm glad we don't maintain forks of the soy email templates	18:28
clarkb	they change every single release	18:28
clarkb	would be annoying to update our forks of all those different files if we had them	18:28
clarkb	the reason for moving the index backups around like that is newer gerrit (I think starting with 3.8 or 3.9?) will use existing index content of the correct version to speed up a full reindex like we do in a downgrade	18:34
clarkb	that isn't technically required but without it we should expect the reindexing to take about 35 minutes iirc. We don't have experience with reindexing from a backup starting point but I would expect it to be quite a bit quicker from what those who implemented the change say	18:35
fungi	just spotted this lost in the recent notices about the backup volume filling up:	19:43
fungi	Inconsistency found in backup /opt/backups/borg-ethercalc02/backup on backup02 at Sun Oct 27 00:19:29 UTC 2024	19:43
fungi	i guess it only warned that once, i don't see any further notices about it	19:43
fungi	presumably we should expect a similar notice each time we delete another server from backups?	19:44
clarkb	I wonder if there is a list of things we need to remove that server from	19:56
clarkb	that directory got removed so ya it has an inconsistency :) happy to clean it up and prevent the warning if we know where to do that	19:56
clarkb	I'm looknig at mm3 bounce processing options and I have no objections to enabling this on our lists	20:28
clarkb	the one thing that seems to be a bit iffy to me is there doesn't appear to be any documetnation that I can find on how the boucne score is calculated. Google search's AI summary says a hard bounce is worth one point and a soft bounce is half a point	20:29
clarkb	by default list owners are notified of disabled and removed users whcih seems reasonable for tracking this after we enable it	20:30
clarkb	oh and the listing under users for list members shows you a running bounce score	20:31
clarkb	they are all zero on service-discuss I think because we have the functionality completely disabled so it doesn't even bother to track scores	20:32
clarkb	I guess the main risk is that we'd remove people who do generally get emails but have a sad server for a short period of time. We can mitigate that by setting the threshold higher or increasing the number of warnings before removal?	20:33
clarkb	probably best to just see how it does with the defaults and take it from there	20:33
clarkb	anyway I wanted to make sure I understood this well enough to discuss it next week and ended up thinking through it out loud here.	20:34
fungi	yeah, dmarc enforcement was the primary driver for disabling bounce processing on lists under mm2, and we kept them that way for mm3 initially but it handles bounces differently than its predecessor (it sends a verp probe if a delivery bounces, so as to hopefully avoid counting bounces that were solely dependent on the message contents/headers)	20:37

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!