#opendev-meeting log

19:01:14 <clarkb> #startmeeting infra
19:01:14 <opendevmeet> Meeting started Tue Aug 16 19:01:14 2022 UTC and is due to finish in 60 minutes.  The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot.
19:01:14 <opendevmeet> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
19:01:14 <opendevmeet> The meeting name has been set to 'infra'
19:01:29 <clarkb> But if you watch #opendev you might not be able to tell the difference
19:01:38 <clarkb> #link https://lists.opendev.org/pipermail/service-discuss/2022-August/000352.html Our Agenda
19:01:44 <clarkb> #topic Announcements
19:01:55 <clarkb> OpenDev service coordinator nominations end today.
19:02:10 <clarkb> I haven't seen any nominations yet. I'm taking that as a sign that everyone thinks i should do it again
19:03:18 <clarkb> I guess if no one else says they are interested I'll go ahead and make my own nomination official later this afternnon
19:04:52 <clarkb> #topic Topics
19:05:03 <clarkb> #topic Improving Grafana Management Tooling
19:05:33 <clarkb> I was going to call out https://review.opendev.org/c/openstack/openstack-zuul-jobs/+/851955 as the last remaining change to close this out, but looks like it merged sometime between when I sent the agenda and our meeting starting
19:05:44 <clarkb> ianw: is there anything else to call out on this subject or can we consider this compelted?
19:06:21 <ianw> I think we can call it done, thanks
19:07:25 <clarkb> great, thank you for putting that together
19:07:33 <clarkb> #topic Bastion Host Updates
19:08:09 <clarkb> This got side tracked by venv installation management iirc. Any other updates on this one?
19:09:02 <ianw> no, sorry, i've been sidetracked away from this, but it's 2nd on my todo list
19:09:18 <clarkb> No problem. I think those venv updates will be generally useful (as already indicated by the borg work)
19:09:30 <clarkb> #topic Upgrading Bionic servers to Focal/Jammy
19:09:31 <ianw> but yeah, i want to get things more isolated before we tackle the upgrade
19:09:53 <clarkb> thats a good lead into this subject. Things all end up related to each other.
19:09:59 <clarkb> #link https://etherpad.opendev.org/p/opendev-bionic-server-upgrades Notes on the work that needs to be done.
19:10:20 <clarkb> In general though we're laying groundwork to make it possible to update to Jammy as well as make certain servers like bridge cleaner to update
19:12:17 <clarkb> Did anyone else have questions/concerns/plans for upgrades that wanted to discuss?
19:12:57 <ianw> not really for me, i guess bridge is the first one I'd like to get into
19:13:37 <clarkb> onward then
19:13:44 <clarkb> #topic Mailman 3
19:13:50 <clarkb> #link https://review.opendev.org/c/opendev/system-config/+/851248 WIP change to deploy a mailman 3 instance
19:14:04 <clarkb> Much progress has been made with mailman 3 since our last meeting
19:14:32 <clarkb> In particular I believe that exim is properly configured to forward mail to mailman and mailman and exim know how to send mail outbound
19:15:03 <clarkb> A permissions issue with xapian write locations was addressed allowing hyperkitty to spin up successfully and list (empty) archives
19:15:34 <clarkb> I've also updated the testing to force mailman's hourly cron jobs to run during CI which makes things like hyperkitty bootstrap properly
19:16:21 <clarkb> There are now two major things to work on. The first is testing the migrations of our existing lists from mailman 2 to mailman 3. My plan is to work with fungi on that using a held CI node
19:16:54 <clarkb> We want to make sure that both the data (subscribers and archives) migrate cleanly as well as the configuration (handling dmarc, publicly accessibly vs not, and so on)
19:17:54 <clarkb> The other big item is dealing with uid:gid mappings between the host and the containers. Currently I'm punting on this because there doesn't seem to be a good answer for this. But the tl;dr is that mailman is uid 100 in btoh mailman-web and mailman-core and gid 100 in one of those and 65535 in the other
19:18:14 <clarkb> that makes mapping it onto users on the host a bit painful, particularly since they both differ with gids.
19:18:52 <clarkb> My current thinking on that is we can take the upstream images, inherit from them and change the uid:gid pair on both images to be consistent and also map them to something like 10500:10500 well out of the way of anything else on the system
19:19:11 <clarkb> That does mean we'll be building our own images though which adds some overhead, but nothing we haven't done for other pieces of software.
19:19:19 <clarkb> If others have better ideas or suggestions I am all ears :)
19:19:38 <clarkb> Oh and finally there is a currently held node that you should feel free to poke at.
19:20:42 <ianw> hrm, is the uid/gid thing a bug?
19:21:15 <clarkb> it might be. There are actually a couple of other things I've modified locally that are probably worth filing against upstream. Worst case they tell us that they won't fix it
19:21:32 <clarkb> Why don't I work on filing those issues today before I take any drastic steps like building our own images based on theirs
19:22:12 <ianw> sounds sane, ... maybe just nobody has tried running them for real on the same host or something?
19:22:30 <frickler> at least finding out if there is some hidden reasoning behind those choices would also have been my idea
19:22:33 <clarkb> ya or they use docker volumes exclusively and don't think about the mapping
19:23:41 <clarkb> I'll work on filing those today. The other issues are ALLOWED_HOSTS is basically useless as it doesn't get interpolated into the django config properly and the django config hardcodes assumed docker hostnames
19:23:49 <ianw> wouldn't you still have the same issue if the volume was shared?
19:24:07 <clarkb> ianw: I think the uid overlap means it isn't an issue
19:24:20 <clarkb> definitely not very clean and still not good to have uid 100 in a volume mapping to _apt on the host
19:24:35 <clarkb> one issue is if they change things on their side they may break anyone using the images :/
19:24:55 <clarkb> not likely to be easy to fix, but making people aware of it and double checking we aren't missing something makes it worthwhile
19:26:15 <clarkb> Anything else related to mailman 3? I'm hapyp to answer questions if people have them
19:27:47 <ianw> not for me, but thanks for working on it!
19:27:58 <clarkb> #topic Gitea 1.17
19:28:05 <clarkb> #link https://review.opendev.org/c/opendev/system-config/+/847204
19:28:25 <clarkb> Reviews on that change would be appreciated, but I do think waiting for 1.17.1 to happen before we upgrade is worthwhile
19:28:50 <clarkb> 1.17.1's milestone has quite a few bugs that seem like good things to have fixed before we upgrade
19:28:56 <clarkb> #link https://github.com/go-gitea/gitea/milestone/122
19:29:13 <clarkb> I do think a lot of them don't affect us because we are disabling the new package repo system they added
19:30:12 <clarkb> But we don't have an immediate need to update and this release seems like it could use some improving. I'm happy to wait :)
19:30:34 <ianw> ++
19:31:23 <clarkb> There are a lot of breaking chagnes in the release notes that I've noted in the commit message and tried to explain how they affect us if at all
19:31:38 <clarkb> The aerly review is worthwhile simply to get through all of that
19:33:40 <clarkb> #topic Open Discussion
19:33:46 <clarkb> That was everything on the agenda
19:33:49 <clarkb> Anything else
19:35:20 <ianw> not for me, i've been a bit side-tracked into some zuul regression testing for the console streaming bits after i broke that with the zuul restart on the weekened
19:35:47 <clarkb> Tripleo has been having ansible ssh failures in their jobs. From what I can tell the ssh connections are from 127.0.0.1 to 127.0.0.2 as the zuul user. Journald seems to record the ssh connection succeeds but then claims the remote closes the connection and ansible says ""Failed to connect to the host via ssh: kex_exchange_identification: Connection closed by remote
19:35:49 <clarkb> host\r\nConnection closed by 127.0.0.2 port 22", "unreachable": true"
19:36:56 <clarkb> I know that ansible will report ssh errors when it cannot do its remote node bootstrapping due to full disks and similar problems
19:37:14 <clarkb> I suspect this is some sort of ansible specific problem and that ssh itself is functioning, but that is still mostly a hunch at this point
19:38:16 <ianw> well i wasted about ... too many hours ... with a similar unreachable host
19:39:06 <clarkb> was that the fstrings thing?
19:39:19 <ianw> "ssh-keyscan localhost -p 2022" does *not* scan the ssh running on port 2022.  "ssh-keyscan -p 2022 localhost" does.  if you have a ssh on port 22, the first will give you the keys for that
19:39:43 <clarkb> oh fun
19:40:05 <ianw> and ansible will give you very little clue about host key mismatches
19:40:34 <clarkb> right ansible's error reporting here is I think what is hampering further debugging. I've suggested they maybe increase verbosity so help sort it out
19:40:56 <ianw> yep then you can see the ssh calls
19:43:40 <clarkb> Sounds like that may be everything?
19:44:35 <ianw> nothing more from me
19:45:00 <clarkb> Thank you everyone! We'll be back here next week at the same time and location
19:45:04 <clarkb> #endmeeting