*** mtreinish_ is now known as mtreinish | 11:00 | |
*** mtreinish_ is now known as mtreinish | 12:02 | |
fungi | are we even still deploying elastic-recheck? that should probably have been retired years ago when we took down the elasticsearch cluster | 12:28 |
---|---|---|
clarkb | fungi: I'm not sure if the opensearch setup is still running elastic-recheck or not | 14:51 |
clarkb | I think we can retire the project if not | 14:51 |
fungi | i thought it was all rewritten from scratch | 14:52 |
fungi | i haven't seen dpawlik around for a while to ask though | 14:52 |
*** darmach43 is now known as darmach4 | 15:39 | |
clarkb | infra-root any objections to me proceeding with https://review.opendev.org/c/opendev/system-config/+/956828 then https://review.opendev.org/c/opendev/system-config/+/956829 ? | 15:58 |
fungi | none on my part, i just didn't want to approve them until you were settled in for the morning | 16:18 |
clarkb | cool I've eaten breakfast and taken acre of paperwork I needed to get done. I'll approve the first one now | 16:19 |
fungi | i feel like i have an acre of paperwork every week | 16:20 |
clarkb | I've been reading up on mjolnir and haven't found any indication from EMS docs that they can run one for you (though I haevn't logged into the control dashboard to confirm that this is the case). But they publish a container image that you supply a config to with account credetnials fro your homeserver and that account gets configured as adminitrator on your channels. Then you add | 16:47 |
clarkb | the bot to the channels as well as a private moderation channel which is where the humans can receive reports. I think they can also send commands to the bot there or dm the bot | 16:47 |
clarkb | in the past you also needed to use a proxy to join encrypted channels but now mjolnir supports that natively. | 16:48 |
fungi | so in theory we could add it to the fleet of bot containers on eavesdrop | 16:51 |
clarkb | ya that is what I'm thinking. I think it does store some data (ban lists for example) but that should be minimal. Doesn't seem to use a database either | 16:51 |
clarkb | or if it does its something like a sqlite file directly on disk not a separate service | 16:51 |
fungi | well, if it stores something then it's using some kind of database, yeah | 16:52 |
fungi | it might just be a flat text file, but that's still a sort of database | 16:52 |
clarkb | https://github.com/matrix-org/mjolnir/blob/main/docs/setup_docker.md ya I meant more that we don't need additional services based on this documentation | 16:52 |
clarkb | just a data volume | 16:52 |
fungi | perfect | 16:52 |
fungi | we have those | 16:53 |
clarkb | looks like the entire cinder volume on eavesdrop is bind mounted for limnoria now. So we'd need to either expand that fs and split the fs tree for multiple mount points or use a second volume | 16:54 |
clarkb | in any case I think this is doable | 16:54 |
fungi | i can imagine a quick container restart where we put eavesdrop02 into emergency disable, mkdir /var/lib/bots and /var/lib/limnoria/limnoria, stop the bot, mv all the other files in /var/lib/limnoria into /var/lib/limnoria/limnoria, umount /var/lib/limnoria, mount it at /var/lib/bots, adjust the volume source path in the compose file, start the bot, merge a change to make the config | 17:05 |
fungi | permanent | 17:05 |
fungi | then we can put other mappings in /var/lib/bots and they'll be on the cinder volume | 17:06 |
fungi | *or* we could just not care because the mjolnir state data is likely tiny and the only reason limnoria has its own cinder volume is lack of sufficient space on the rootfs | 17:07 |
fungi | also losing state for mjolnir probably isn't catastrophic since spammers usually burn their accounts within minutes of using them and getting banned everywhere, but we can lean on backups for emergencies too | 17:09 |
clarkb | thats a good point | 17:10 |
clarkb | can probably just bind mount off the rootfs | 17:10 |
fungi | i wouldn't worry about moving it to the cinder volume unless it's a lot of data, which i have a hard time imagining it would be | 17:16 |
corvus | it looks like there were 3 post-failures for the image upload role switch change; the rest succeeded | 17:16 |
corvus | s/3/4/ math is hard | 17:16 |
corvus | "cloud": "defaults" | 17:17 |
corvus | that doesn't look right | 17:17 |
corvus | https://zuul.opendev.org/t/opendev/build/e9084d621659404ab67a9428efd418ff/log/job-output.txt#10462 | 17:17 |
fungi | in actuality we have enough room on the eavesdrop02 rootfs for limnoria's data too, but it's enough data (20gb in cinder for now while there's only 33gb free on the rootfs) that this is reasonable future-proofing | 17:18 |
corvus | https://zuul.opendev.org/t/opendev/build/e9084d621659404ab67a9428efd418ff/console oh that's a much better error actually | 17:19 |
corvus | openstack.exceptions.HttpException: HttpException: 499: Client Error for url: https://swift.api.sjc3.rackspacecloud.com/v1/AUTH_ac0fed44dbe4539d83485bcefc4e2d4b/images-7b7d44d25aa9/e9084d621659404ab67a9428efd418ff-centos-9-stream-arm64.raw.zst/000001, Client DisconnectThe client was disconnected during request. | 17:19 |
clarkb | was it disconncted because the cloud details were wrong so auth failed? | 17:20 |
clarkb | I would've expected a different http code in that case but maybe that explains it? | 17:20 |
fungi | misbehaving middlebox, poorly-configured idle state timeout, or just a random network glitch | 17:20 |
corvus | i think the "cloud" error is a red herring and my fault; i think the real error is the Disconnect, and that's just a normal glitch. i think clarkb mentioned that our change to "retry" didn't cover all the cases, and i think maybe this is evidence of that, and it's still a problem | 17:23 |
fungi | the common interpretation for a 499 is that a client closed the connection before the server responded. this can be a webserver reporting a timeout proxying/calling to a wsgi process | 17:23 |
corvus | so, in short, i think this is not evidence that there is something wrong with the new copy of the role; i think we're just seeing sporadic errors we've previously seen | 17:23 |
fungi | maybe we're finding the less common cases now that they're not drowned out in other noise | 17:24 |
corvus | the "cloud" thing is because i tried to helpfully put in some debug data in error responses, probably based on code copied from the logs roles, and that code doesn't work because it's dereferencing a variable that doesn't exist. we're lucky it did that instead of just crashing, actually. | 17:24 |
corvus | fungi: yeah, i think this may be the only one we've seen that we don't have a solution for | 17:25 |
corvus | i think we speculated that to actually have requests/urllib retry due to this error code, we would need to do some intrusive work... it's not supported in the api. | 17:25 |
corvus | oh i think i get it. since we're not using a clouds.yaml file, our cloud doesn't have a name. so i guess cloud.name on the cloud object we get back from openstacksdk is just defaults, meaning "whatever you gave me as parameters" | 17:27 |
clarkb | aha | 17:28 |
corvus | i think i'm convinced that a "recheck" is okay because everything is status quo | 17:28 |
clarkb | sounds good | 17:28 |
opendevreview | Dmitriy Rabotyagov proposed openstack/project-config master: Stop syncing run_tests/Vagrantfiles for OSA https://review.opendev.org/c/openstack/project-config/+/956944 | 17:28 |
opendevreview | James E. Blair proposed opendev/zuul-providers master: Use label-defaults https://review.opendev.org/c/opendev/zuul-providers/+/956946 | 17:35 |
corvus | no rush, but ^ is in response to a proposed niz syntax change... i wanted to go ahead and write the change for opendev so we can take a look at the result and see if it looks sane (to help evaluate the upstream zuul change). | 17:36 |
clarkb | I guess the idea is to reorganize to better reflect those values apply to label boots and not the cloud itself? | 17:37 |
clarkb | similarly with the image defaults being image specific | 17:38 |
clarkb | syntax wise that seems fine | 17:38 |
corvus | yeah, and especially that they apply to labels and there might be different defaults for the same attribute that would apply to images. the depends-on commit message goes into it a bit. | 17:44 |
opendevreview | Merged opendev/system-config master: Reapply "Migrate statsd sidecar container images to quay.io" https://review.opendev.org/c/opendev/system-config/+/956828 | 18:09 |
fungi | i guess hourlies got in first | 18:10 |
clarkb | the promote jobs don't wait | 18:11 |
clarkb | once they are done and we confirm the new content on quay I'll approve the followup to pull the image from there | 18:11 |
clarkb | looks like they both updated | 18:12 |
clarkb | I've approved the other change (956829) | 18:12 |
fungi | ah, right, it's the second change that deploys anyway | 18:25 |
clarkb | Once that change merges and deplyos I'm going to pop out for lunch | 19:08 |
fungi | yeah, i need to run some errands once it's in | 19:11 |
opendevreview | Merged opendev/system-config master: Pull the haproxy and zookeeper statsd sidecars from quay https://review.opendev.org/c/opendev/system-config/+/956829 | 19:34 |
fungi | deploy jobs are already starting | 19:35 |
clarkb | both haproxy-statsd containers have restarted on the new containers | 19:37 |
clarkb | https://grafana.opendev.org/d/1f6dfd6769/opendev-load-balancer?orgId=1&from=now-5m&to=now&timezone=utc shows a little gap then resumed data | 19:38 |
clarkb | the zookeeper statsd containers also restarted | 19:38 |
clarkb | https://grafana.opendev.org/d/21a6e53ea4/zuul-status?orgId=1&from=now-5m&to=now&timezone=utc the bottom of this dashboard has the zk metrics | 19:39 |
clarkb | I think the 19:39:10 set are post restart so this also looks good | 19:39 |
fungi | yeah, looks right so far | 19:40 |
clarkb | the deploy buildset reported success and I'm happy with the statsd grafana results | 19:41 |
clarkb | I'm going to grab lunch now | 19:41 |
fungi | yep, i think we're good | 19:42 |
fungi | i'm going to pop out to run some errands while the tourists are hopefully all out on the water | 19:42 |
fungi | looks like everything's still working | 21:59 |
Generated by irclog2html.py 4.0.0 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!