Friday, 2025-08-08

*** mtreinish_ is now known as mtreinish		11:00
*** mtreinish_ is now known as mtreinish		12:02
fungi	are we even still deploying elastic-recheck? that should probably have been retired years ago when we took down the elasticsearch cluster	12:28
clarkb	fungi: I'm not sure if the opensearch setup is still running elastic-recheck or not	14:51
clarkb	I think we can retire the project if not	14:51
fungi	i thought it was all rewritten from scratch	14:52
fungi	i haven't seen dpawlik around for a while to ask though	14:52
*** darmach43 is now known as darmach4		15:39
clarkb	infra-root any objections to me proceeding with https://review.opendev.org/c/opendev/system-config/+/956828 then https://review.opendev.org/c/opendev/system-config/+/956829 ?	15:58
fungi	none on my part, i just didn't want to approve them until you were settled in for the morning	16:18
clarkb	cool I've eaten breakfast and taken acre of paperwork I needed to get done. I'll approve the first one now	16:19
fungi	i feel like i have an acre of paperwork every week	16:20
clarkb	I've been reading up on mjolnir and haven't found any indication from EMS docs that they can run one for you (though I haevn't logged into the control dashboard to confirm that this is the case). But they publish a container image that you supply a config to with account credetnials fro your homeserver and that account gets configured as adminitrator on your channels. Then you add	16:47
clarkb	the bot to the channels as well as a private moderation channel which is where the humans can receive reports. I think they can also send commands to the bot there or dm the bot	16:47
clarkb	in the past you also needed to use a proxy to join encrypted channels but now mjolnir supports that natively.	16:48
fungi	so in theory we could add it to the fleet of bot containers on eavesdrop	16:51
clarkb	ya that is what I'm thinking. I think it does store some data (ban lists for example) but that should be minimal. Doesn't seem to use a database either	16:51
clarkb	or if it does its something like a sqlite file directly on disk not a separate service	16:51
fungi	well, if it stores something then it's using some kind of database, yeah	16:52
fungi	it might just be a flat text file, but that's still a sort of database	16:52
clarkb	https://github.com/matrix-org/mjolnir/blob/main/docs/setup_docker.md ya I meant more that we don't need additional services based on this documentation	16:52
clarkb	just a data volume	16:52
fungi	perfect	16:52
fungi	we have those	16:53
clarkb	looks like the entire cinder volume on eavesdrop is bind mounted for limnoria now. So we'd need to either expand that fs and split the fs tree for multiple mount points or use a second volume	16:54
clarkb	in any case I think this is doable	16:54
fungi	i can imagine a quick container restart where we put eavesdrop02 into emergency disable, mkdir /var/lib/bots and /var/lib/limnoria/limnoria, stop the bot, mv all the other files in /var/lib/limnoria into /var/lib/limnoria/limnoria, umount /var/lib/limnoria, mount it at /var/lib/bots, adjust the volume source path in the compose file, start the bot, merge a change to make the config	17:05
fungi	permanent	17:05
fungi	then we can put other mappings in /var/lib/bots and they'll be on the cinder volume	17:06
fungi	or we could just not care because the mjolnir state data is likely tiny and the only reason limnoria has its own cinder volume is lack of sufficient space on the rootfs	17:07
fungi	also losing state for mjolnir probably isn't catastrophic since spammers usually burn their accounts within minutes of using them and getting banned everywhere, but we can lean on backups for emergencies too	17:09
clarkb	thats a good point	17:10
clarkb	can probably just bind mount off the rootfs	17:10
fungi	i wouldn't worry about moving it to the cinder volume unless it's a lot of data, which i have a hard time imagining it would be	17:16
corvus	it looks like there were 3 post-failures for the image upload role switch change; the rest succeeded	17:16
corvus	s/3/4/ math is hard	17:16
corvus	"cloud": "defaults"	17:17
corvus	that doesn't look right	17:17
corvus	https://zuul.opendev.org/t/opendev/build/e9084d621659404ab67a9428efd418ff/log/job-output.txt#10462	17:17
fungi	in actuality we have enough room on the eavesdrop02 rootfs for limnoria's data too, but it's enough data (20gb in cinder for now while there's only 33gb free on the rootfs) that this is reasonable future-proofing	17:18
corvus	https://zuul.opendev.org/t/opendev/build/e9084d621659404ab67a9428efd418ff/console oh that's a much better error actually	17:19
corvus	openstack.exceptions.HttpException: HttpException: 499: Client Error for url: https://swift.api.sjc3.rackspacecloud.com/v1/AUTH_ac0fed44dbe4539d83485bcefc4e2d4b/images-7b7d44d25aa9/e9084d621659404ab67a9428efd418ff-centos-9-stream-arm64.raw.zst/000001, Client DisconnectThe client was disconnected during request.	17:19
clarkb	was it disconncted because the cloud details were wrong so auth failed?	17:20
clarkb	I would've expected a different http code in that case but maybe that explains it?	17:20
fungi	misbehaving middlebox, poorly-configured idle state timeout, or just a random network glitch	17:20
corvus	i think the "cloud" error is a red herring and my fault; i think the real error is the Disconnect, and that's just a normal glitch. i think clarkb mentioned that our change to "retry" didn't cover all the cases, and i think maybe this is evidence of that, and it's still a problem	17:23
fungi	the common interpretation for a 499 is that a client closed the connection before the server responded. this can be a webserver reporting a timeout proxying/calling to a wsgi process	17:23
corvus	so, in short, i think this is not evidence that there is something wrong with the new copy of the role; i think we're just seeing sporadic errors we've previously seen	17:23
fungi	maybe we're finding the less common cases now that they're not drowned out in other noise	17:24
corvus	the "cloud" thing is because i tried to helpfully put in some debug data in error responses, probably based on code copied from the logs roles, and that code doesn't work because it's dereferencing a variable that doesn't exist. we're lucky it did that instead of just crashing, actually.	17:24
corvus	fungi: yeah, i think this may be the only one we've seen that we don't have a solution for	17:25
corvus	i think we speculated that to actually have requests/urllib retry due to this error code, we would need to do some intrusive work... it's not supported in the api.	17:25
corvus	oh i think i get it. since we're not using a clouds.yaml file, our cloud doesn't have a name. so i guess cloud.name on the cloud object we get back from openstacksdk is just defaults, meaning "whatever you gave me as parameters"	17:27
clarkb	aha	17:28
corvus	i think i'm convinced that a "recheck" is okay because everything is status quo	17:28
clarkb	sounds good	17:28
opendevreview	Dmitriy Rabotyagov proposed openstack/project-config master: Stop syncing run_tests/Vagrantfiles for OSA https://review.opendev.org/c/openstack/project-config/+/956944	17:28
opendevreview	James E. Blair proposed opendev/zuul-providers master: Use label-defaults https://review.opendev.org/c/opendev/zuul-providers/+/956946	17:35
corvus	no rush, but ^ is in response to a proposed niz syntax change... i wanted to go ahead and write the change for opendev so we can take a look at the result and see if it looks sane (to help evaluate the upstream zuul change).	17:36
clarkb	I guess the idea is to reorganize to better reflect those values apply to label boots and not the cloud itself?	17:37
clarkb	similarly with the image defaults being image specific	17:38
clarkb	syntax wise that seems fine	17:38
corvus	yeah, and especially that they apply to labels and there might be different defaults for the same attribute that would apply to images. the depends-on commit message goes into it a bit.	17:44
opendevreview	Merged opendev/system-config master: Reapply "Migrate statsd sidecar container images to quay.io" https://review.opendev.org/c/opendev/system-config/+/956828	18:09
fungi	i guess hourlies got in first	18:10
clarkb	the promote jobs don't wait	18:11
clarkb	once they are done and we confirm the new content on quay I'll approve the followup to pull the image from there	18:11
clarkb	looks like they both updated	18:12
clarkb	I've approved the other change (956829)	18:12
fungi	ah, right, it's the second change that deploys anyway	18:25
clarkb	Once that change merges and deplyos I'm going to pop out for lunch	19:08
fungi	yeah, i need to run some errands once it's in	19:11
opendevreview	Merged opendev/system-config master: Pull the haproxy and zookeeper statsd sidecars from quay https://review.opendev.org/c/opendev/system-config/+/956829	19:34
fungi	deploy jobs are already starting	19:35
clarkb	both haproxy-statsd containers have restarted on the new containers	19:37
clarkb	https://grafana.opendev.org/d/1f6dfd6769/opendev-load-balancer?orgId=1&from=now-5m&to=now&timezone=utc shows a little gap then resumed data	19:38
clarkb	the zookeeper statsd containers also restarted	19:38
clarkb	https://grafana.opendev.org/d/21a6e53ea4/zuul-status?orgId=1&from=now-5m&to=now&timezone=utc the bottom of this dashboard has the zk metrics	19:39
clarkb	I think the 19:39:10 set are post restart so this also looks good	19:39
fungi	yeah, looks right so far	19:40
clarkb	the deploy buildset reported success and I'm happy with the statsd grafana results	19:41
clarkb	I'm going to grab lunch now	19:41
fungi	yep, i think we're good	19:42
fungi	i'm going to pop out to run some errands while the tourists are hopefully all out on the water	19:42
fungi	looks like everything's still working	21:59

Generated by irclog2html.py 4.0.0 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!