19:01:14 <clarkb> #startmeeting infra 19:01:14 <opendevmeet> Meeting started Tue Aug 16 19:01:14 2022 UTC and is due to finish in 60 minutes. The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot. 19:01:14 <opendevmeet> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 19:01:14 <opendevmeet> The meeting name has been set to 'infra' 19:01:29 <clarkb> But if you watch #opendev you might not be able to tell the difference 19:01:38 <clarkb> #link https://lists.opendev.org/pipermail/service-discuss/2022-August/000352.html Our Agenda 19:01:44 <clarkb> #topic Announcements 19:01:55 <clarkb> OpenDev service coordinator nominations end today. 19:02:10 <clarkb> I haven't seen any nominations yet. I'm taking that as a sign that everyone thinks i should do it again 19:03:18 <clarkb> I guess if no one else says they are interested I'll go ahead and make my own nomination official later this afternnon 19:04:52 <clarkb> #topic Topics 19:05:03 <clarkb> #topic Improving Grafana Management Tooling 19:05:33 <clarkb> I was going to call out https://review.opendev.org/c/openstack/openstack-zuul-jobs/+/851955 as the last remaining change to close this out, but looks like it merged sometime between when I sent the agenda and our meeting starting 19:05:44 <clarkb> ianw: is there anything else to call out on this subject or can we consider this compelted? 19:06:21 <ianw> I think we can call it done, thanks 19:07:25 <clarkb> great, thank you for putting that together 19:07:33 <clarkb> #topic Bastion Host Updates 19:08:09 <clarkb> This got side tracked by venv installation management iirc. Any other updates on this one? 19:09:02 <ianw> no, sorry, i've been sidetracked away from this, but it's 2nd on my todo list 19:09:18 <clarkb> No problem. I think those venv updates will be generally useful (as already indicated by the borg work) 19:09:30 <clarkb> #topic Upgrading Bionic servers to Focal/Jammy 19:09:31 <ianw> but yeah, i want to get things more isolated before we tackle the upgrade 19:09:53 <clarkb> thats a good lead into this subject. Things all end up related to each other. 19:09:59 <clarkb> #link https://etherpad.opendev.org/p/opendev-bionic-server-upgrades Notes on the work that needs to be done. 19:10:20 <clarkb> In general though we're laying groundwork to make it possible to update to Jammy as well as make certain servers like bridge cleaner to update 19:12:17 <clarkb> Did anyone else have questions/concerns/plans for upgrades that wanted to discuss? 19:12:57 <ianw> not really for me, i guess bridge is the first one I'd like to get into 19:13:37 <clarkb> onward then 19:13:44 <clarkb> #topic Mailman 3 19:13:50 <clarkb> #link https://review.opendev.org/c/opendev/system-config/+/851248 WIP change to deploy a mailman 3 instance 19:14:04 <clarkb> Much progress has been made with mailman 3 since our last meeting 19:14:32 <clarkb> In particular I believe that exim is properly configured to forward mail to mailman and mailman and exim know how to send mail outbound 19:15:03 <clarkb> A permissions issue with xapian write locations was addressed allowing hyperkitty to spin up successfully and list (empty) archives 19:15:34 <clarkb> I've also updated the testing to force mailman's hourly cron jobs to run during CI which makes things like hyperkitty bootstrap properly 19:16:21 <clarkb> There are now two major things to work on. The first is testing the migrations of our existing lists from mailman 2 to mailman 3. My plan is to work with fungi on that using a held CI node 19:16:54 <clarkb> We want to make sure that both the data (subscribers and archives) migrate cleanly as well as the configuration (handling dmarc, publicly accessibly vs not, and so on) 19:17:54 <clarkb> The other big item is dealing with uid:gid mappings between the host and the containers. Currently I'm punting on this because there doesn't seem to be a good answer for this. But the tl;dr is that mailman is uid 100 in btoh mailman-web and mailman-core and gid 100 in one of those and 65535 in the other 19:18:14 <clarkb> that makes mapping it onto users on the host a bit painful, particularly since they both differ with gids. 19:18:52 <clarkb> My current thinking on that is we can take the upstream images, inherit from them and change the uid:gid pair on both images to be consistent and also map them to something like 10500:10500 well out of the way of anything else on the system 19:19:11 <clarkb> That does mean we'll be building our own images though which adds some overhead, but nothing we haven't done for other pieces of software. 19:19:19 <clarkb> If others have better ideas or suggestions I am all ears :) 19:19:38 <clarkb> Oh and finally there is a currently held node that you should feel free to poke at. 19:20:42 <ianw> hrm, is the uid/gid thing a bug? 19:21:15 <clarkb> it might be. There are actually a couple of other things I've modified locally that are probably worth filing against upstream. Worst case they tell us that they won't fix it 19:21:32 <clarkb> Why don't I work on filing those issues today before I take any drastic steps like building our own images based on theirs 19:22:12 <ianw> sounds sane, ... maybe just nobody has tried running them for real on the same host or something? 19:22:30 <frickler> at least finding out if there is some hidden reasoning behind those choices would also have been my idea 19:22:33 <clarkb> ya or they use docker volumes exclusively and don't think about the mapping 19:23:41 <clarkb> I'll work on filing those today. The other issues are ALLOWED_HOSTS is basically useless as it doesn't get interpolated into the django config properly and the django config hardcodes assumed docker hostnames 19:23:49 <ianw> wouldn't you still have the same issue if the volume was shared? 19:24:07 <clarkb> ianw: I think the uid overlap means it isn't an issue 19:24:20 <clarkb> definitely not very clean and still not good to have uid 100 in a volume mapping to _apt on the host 19:24:35 <clarkb> one issue is if they change things on their side they may break anyone using the images :/ 19:24:55 <clarkb> not likely to be easy to fix, but making people aware of it and double checking we aren't missing something makes it worthwhile 19:26:15 <clarkb> Anything else related to mailman 3? I'm hapyp to answer questions if people have them 19:27:47 <ianw> not for me, but thanks for working on it! 19:27:58 <clarkb> #topic Gitea 1.17 19:28:05 <clarkb> #link https://review.opendev.org/c/opendev/system-config/+/847204 19:28:25 <clarkb> Reviews on that change would be appreciated, but I do think waiting for 1.17.1 to happen before we upgrade is worthwhile 19:28:50 <clarkb> 1.17.1's milestone has quite a few bugs that seem like good things to have fixed before we upgrade 19:28:56 <clarkb> #link https://github.com/go-gitea/gitea/milestone/122 19:29:13 <clarkb> I do think a lot of them don't affect us because we are disabling the new package repo system they added 19:30:12 <clarkb> But we don't have an immediate need to update and this release seems like it could use some improving. I'm happy to wait :) 19:30:34 <ianw> ++ 19:31:23 <clarkb> There are a lot of breaking chagnes in the release notes that I've noted in the commit message and tried to explain how they affect us if at all 19:31:38 <clarkb> The aerly review is worthwhile simply to get through all of that 19:33:40 <clarkb> #topic Open Discussion 19:33:46 <clarkb> That was everything on the agenda 19:33:49 <clarkb> Anything else 19:35:20 <ianw> not for me, i've been a bit side-tracked into some zuul regression testing for the console streaming bits after i broke that with the zuul restart on the weekened 19:35:47 <clarkb> Tripleo has been having ansible ssh failures in their jobs. From what I can tell the ssh connections are from 127.0.0.1 to 127.0.0.2 as the zuul user. Journald seems to record the ssh connection succeeds but then claims the remote closes the connection and ansible says ""Failed to connect to the host via ssh: kex_exchange_identification: Connection closed by remote 19:35:49 <clarkb> host\r\nConnection closed by 127.0.0.2 port 22", "unreachable": true" 19:36:56 <clarkb> I know that ansible will report ssh errors when it cannot do its remote node bootstrapping due to full disks and similar problems 19:37:14 <clarkb> I suspect this is some sort of ansible specific problem and that ssh itself is functioning, but that is still mostly a hunch at this point 19:38:16 <ianw> well i wasted about ... too many hours ... with a similar unreachable host 19:39:06 <clarkb> was that the fstrings thing? 19:39:19 <ianw> "ssh-keyscan localhost -p 2022" does *not* scan the ssh running on port 2022. "ssh-keyscan -p 2022 localhost" does. if you have a ssh on port 22, the first will give you the keys for that 19:39:43 <clarkb> oh fun 19:40:05 <ianw> and ansible will give you very little clue about host key mismatches 19:40:34 <clarkb> right ansible's error reporting here is I think what is hampering further debugging. I've suggested they maybe increase verbosity so help sort it out 19:40:56 <ianw> yep then you can see the ssh calls 19:43:40 <clarkb> Sounds like that may be everything? 19:44:35 <ianw> nothing more from me 19:45:00 <clarkb> Thank you everyone! We'll be back here next week at the same time and location 19:45:04 <clarkb> #endmeeting