#opendev-meeting log

19:01:16 <clarkb> #startmeeting infra
19:01:16 <opendevmeet> Meeting started Tue Oct 25 19:01:16 2022 UTC and is due to finish in 60 minutes.  The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot.
19:01:16 <opendevmeet> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
19:01:16 <opendevmeet> The meeting name has been set to 'infra'
19:01:25 <clarkb> #link https://lists.opendev.org/pipermail/service-discuss/2022-October/000369.html Our Agenda
19:01:46 <clarkb> #topic Announcements
19:01:52 <clarkb> No announcements so we can dive right in
19:02:10 <clarkb> #topic Bastion Host Changes
19:02:38 <clarkb> ianw: you've made a bunch of progress on this both with the zuul console log files and the virtualenv and upgrade work
19:03:12 <ianw> yep in short there is one change that is basically s/bridge.openstack.org/bridge01.opendev.org/ -> https://review.opendev.org/c/opendev/system-config/+/861112
19:03:21 <ianw> the new host is ready
19:03:23 <clarkb> ianw: at this point do we expect that we won't have any console logs written to the host? we updated the base jobs repo and system-config? Have we deleted the old files?
19:03:53 <ianw> oh, in terms of the console logs in /tmp -- yep they should be gone and i removed all the old files
19:04:13 <clarkb> I guess that is less important for bridge as we're replacing the host. But for static that is important
19:04:15 <clarkb> also great
19:04:16 <ianw> on bridge and static
19:04:45 <clarkb> For the bridge replacement I saw there were a couple of struggles with the overlap between testing and prod. Are any of those worth digging into?
19:05:22 <ianw> not at this point -- it was all about trying to minimise the number of places we hardcode literal "bridge.openstack.org"
19:05:47 <ianw> i think I have it down to about the bare minimum; so 861112 is basically it
19:06:02 <clarkb> For the new server the host vars and group vars and secrets files are moved over?
19:06:13 <clarkb> (since that requires a manual step)
19:06:21 <ianw> no, so i plan on doing that today if no objections
19:06:37 <ianw> there's a few manual steps -- copying the old secrets, and setting up zuul login
19:06:53 <ianw> and i am 100% sure there is something forgotten that will be revealed when we actually try it
19:07:06 <clarkb> ya I think the rough order of operations should be copying that content over, ask other roots to double check things and then land https://review.opendev.org/c/opendev/system-config/+/861112 ?
19:07:10 <ianw> but i plan to keep notes and add a small checklist for migrating bridge to system-config docs
19:07:14 <clarkb> ++
19:07:53 <clarkb> if we want we can do a pruning pass of that data first too (since we may have old hosts var files or similar)
19:07:56 <ianw> yep, that is about it
19:08:01 <clarkb> but that seems less critical and can be done on the new host afterwads too
19:08:15 <clarkb> ok sounds good to me
19:08:32 <ianw> yeah i think at this point i'd like to get the migration done and prod jobs working on it -- then we can move over any old ~ data and prune, etc.
19:09:01 <clarkb> anything else on this topic?
19:09:13 <ianw> nope, hopefully next week it won't be a topic! :)
19:09:13 <corvus> (and try to reconstruct our venvs!  ;)
19:09:27 <fungi> that'll be the hardest part!
19:09:29 <clarkb> it may be worth keeping the old bridge around for a bit too just in case
19:09:40 <clarkb> thank you for pushing this along. Great progress
19:09:40 <ianw> corvus: i've got a change out for us to have launch node in a venv setup by system-config, so we don't need separate ones
19:09:51 <corvus> ++
19:09:57 <ianw> #link https://review.opendev.org/c/opendev/system-config/+/861284
19:09:58 <clarkb> #topic Upgrading Bionic Servers
19:10:18 <clarkb> We have our first jammy server in production. gitea-lb02 which fronts opendev.org
19:10:52 <clarkb> This server was booted in vexxhost which does/did not have a jammy image already. I took ubuntu's published image and converted it to raw and uploaded that to vexxhost
19:11:02 <clarkb> I did the raw conversion for maximum compatibility with vexxhost ceph
19:11:27 <clarkb> That seems to be working fine. But did require a modern paramiko in a venv to do ssh as jammy ssh seems to not want to do rsa + sha1
19:11:47 <clarkb> I thought about updating launch node to use an ed25519 key instead but paramiko doesn't have key generation routines for that key type like it does rsa
19:12:08 <clarkb> Anyway it mostly works except for the paramiko thing. I don't think there is much to add to this other than that ianw's bridge work should hopefully mitigate some of this
19:12:18 <clarkb> Otherwise I think we can go ahead and launch jammy nodes
19:12:31 <clarkb> #topic Removing snapd
19:12:34 <ianw> ++ the new bridge is jammy too
19:13:16 <clarkb> When doing the new jammy node I noticed that we don't remove snapd which is something I thought we were doing. Fungi did some excellent git history invenstigating and discovered we did remove snapd at one time but stopped so that we could install the kubectl snap
19:13:45 <clarkb> We aren't currently using kubectl for anything in production and even if we were I think we could find a different isntall method. This makes me wonder if we should go back toremoving snapd?
19:14:03 <clarkb> I don't think we need to make a hard decision here in the meeting but wanted to call it out as something to think about and if you have thoughts I'm happy for them to be shared
19:14:24 <fungi> also we only needed to stop removing it from the server(s) where we installed kubectl
19:14:41 <fungi> and also there now seem to be more sane ways of installing an updated kubectl anyway
19:15:22 <ianw> i hit something tangentially related with the screenshots -- i wanted to use firefox on the jammy hosts but the geckodriver  bits don't work because firefox is embedded in a snap
19:16:16 <ianw> which -- i guess i get why you want your browser sandboxed.  but it's also quite a departure from traditional idea of a packaged system
19:16:27 <clarkb> probably something that deserves a bit more investigation to understand its broader impact then
19:16:50 <clarkb> I'll try to make time for that. One thing that might be good is listing snaps for which there aren't packages that we might end up using like kubectl or firefox
19:17:22 <clarkb> And then take it from there
19:17:29 <clarkb> #topic Mailman 3
19:17:33 <clarkb> Moving along so we don't run out of time
19:18:04 <clarkb> fungi: I think our testing is largely complete at this point. Are we ready to boot a new jammy server and if so have we decided where it should live?
19:18:53 <fungi> if folks are generally satisfied with our forked image strategy, yeah i guess next steps are deciding where to boot it and then booting it and getting it excluded from blocklists if needed
19:19:23 <clarkb> at this point I still haven't heard from the upstream image maintainer. I do think we should probably accept that we'll need to maintain our own images at least for now
19:19:31 <fungi> once we have ip addresses for the server, we can include those in communications around migration planning for lists.opendev.org and lists.zuul-ci.org as our first sites to move
19:20:01 <clarkb> re hosting location it occured to me that we can't easily get reverse dns records outside of rax which makes me think rax is the best location for a mail server
19:21:11 <clarkb> But I think we could also host it in vexxhost if mnaser doesn't have concerns with email flowing through his IPs and he is willing to edit dns records for us
19:21:13 <fungi> perhaps, but rackspace also preemptively places their netblocks on the sbl
19:21:26 <fungi> which makes them less great for it
19:21:38 <fungi> er, on the pbl i mean
19:21:43 <clarkb> ya so maybe step 0 is send a feeler to mnaser about it
19:21:45 <fungi> (spamhaus policy blocklist)
19:21:57 <clarkb> to figure out how problematic the dns records and email traffic would be
19:22:00 <corvus> i think that's normal/expected behavior
19:22:04 <corvus> and removal from pbl is easy?
19:22:16 <fungi> exclusion from pbl used to be easier
19:22:20 <corvus> i think vexxhost can do reverse dns by request
19:22:38 <fungi> now they require you to periodically renew your pbl exclusion and there's no way to find out when it will run out that i can find
19:22:41 <corvus> is it not easy?  i thought it was click-and-done
19:23:09 <corvus> ah :(
19:23:34 <ianw> for review02 we did have to ask mnaser, but it was also easy :)
19:23:41 <clarkb> from our end being able to set reverse dns records was what came to mind. SOunds like pbl is also worth considering
19:23:57 <ianw> so there's already a lot of mail coming out of that
19:24:26 <fungi> corvus: at least i recall spotting that change recently, looking now for a clear quote i can link
19:25:46 <corvus> (either place seems good to me; seems like nothing's perfect)
19:26:01 <clarkb> I guess the two todos are for people to weigh in on whether or not we're comfortable with forked images and specify if they have a strong preference for hosting location
19:26:11 <clarkb> I agree sounds like we'll just deal with different things in either location
19:26:40 <clarkb> #link https://review.opendev.org/c/opendev/system-config/+/860157 Change to fork upstream mailman3 docker images
19:26:45 <fungi> i concur
19:26:46 <clarkb> maybe drop your thoughts there?
19:27:07 <fungi> also i suppose merging those changes will be a prerequisite to booting the new server
19:27:16 <fungi> there's a series of several
19:27:29 <clarkb> fungi: we can boot the new server first it just won't do much until changesl and
19:27:39 <clarkb> but I don't think the boot order is super important here
19:27:40 <fungi> good point
19:28:48 <clarkb> ok lets move on. Please leave thoughts on the change otherwise I expect we'll proceed
19:28:58 <clarkb> #topic Switching our base job nodeset to Jammy
19:29:04 <clarkb> #link https://review.opendev.org/c/opendev/base-jobs/+/862624
19:29:10 <clarkb> today is the day we said we would make this swap
19:29:26 <fungi> yeah, we can merge it after the meeting wraps up
19:29:40 <clarkb> ++ Mostly a heads up that this is changing and to be on the lookout for fallout
19:30:25 <clarkb> I did find a place in zuul-jobs that would likely break which was python3.8 jobs running without a nodeset specifier
19:30:36 <fungi> if anyone else wants to review that three-line change before we approve it, you have roughly half an hour
19:30:38 <clarkb> I expect that sort of thing to be the bulk of what we run into
19:31:30 <clarkb> #topic Updating our base python images to use pip wheel
19:31:59 <clarkb> About a week ago Nodepool could no longer build its container images. The issue was that we weren't using wheels built by the builder in the prod image
19:32:18 <clarkb> after a bunch of debugging it basically came down to pip 22.3 changed the location it caches wheels compared to 22.2.2 and prior
19:32:40 <clarkb> I think this is actually a pip bug (because it reduces the file integrity assertions that existed previously)
19:32:49 <fungi> or rather the layout of the cache directory
19:32:52 <fungi> changed
19:32:58 <clarkb> ya
19:32:59 <clarkb> #link https://github.com/pypa/pip/issues/11527
19:33:07 <clarkb> #link https://github.com/pypa/pip/pull/11538
19:33:34 <clarkb> I filed an issue upstream and wrote a patch. The patch is currently not passing CI due to a different git change (that zuul also ran into) that impacts their test suite. They've asked if I want to write a patch for that too but I haven't found time yet
19:33:58 <clarkb> Anyway part of the fallout from this is that pip says we shouldn't use the cache that way as its more of an implementation detail for pip which is a reasonable position
19:34:29 <clarkb> Their suggestion is to use `pip wheel` instead and explicitly fetch/build wheels and use them that way
19:34:34 <clarkb> #link https://review.opendev.org/c/opendev/system-config/+/862152
19:35:02 <clarkb> that change updates our base images to do this. I've tested it with a change to nodepool and diskimage builder which helps to exercise that the modification actually work without breaking sibling installs and extras installs
19:35:38 <clarkb> This shouldn't actually change our images much, but should make our build process more reliable in the future
19:35:45 <clarkb> reviews and concerns appreciated.
19:35:52 <fungi> tonyb is looking into doing something similar with rewrites of the wheel cache builder jobs, i think
19:36:07 <fungi> and constraints generation jobs more generally
19:36:29 <clarkb> The other peice of feedback that came out of this is that other people do similar but instead of creating a wheel cache and copying that and performing another install on the prod image they do a pip install --user on the builder side then just copy over $USER/.local to the prod image
19:36:45 <clarkb> this has the upside of not needing wheel files in the final image which reduces the final image size
19:37:04 <fungi> as long as the path remains the same, right?
19:37:07 <clarkb> I think we should consider doing that as well, but all of our consuming images would need to be updated to find the executables in the local dir or a virtualenv
19:37:29 <clarkb> fungi: yes it only works if the two sides stay in sync for python versions (somethign we already attempt to do) and paths
19:37:32 <ianw> that would be ... interesting
19:37:37 <ianw> i think most would not
19:37:44 <fungi> well, but also venvs aren't supposed to be relocateable
19:37:46 <clarkb> its the and paths bit that makes it difficult for us to transition as we'd need to update the consuming images
19:37:49 <ianw> (find things in a venv)
19:38:04 <clarkb> fungi: yes, except in this case they aren't relocating as far as they are concerned everything stays in the same spot
19:38:31 <fungi> the path of the venv inside the container image would need to be the same as where they're copied from on the host where they're built?
19:38:47 <fungi> or maybe i'm misunderstanding how docker image builds work
19:39:03 <clarkb> fungi: the way it works today is we have a builder image and a base image that becomes the prod image
19:39:23 <corvus> i think the global install has a lot going for it and prefer that to a user/venv install
19:39:24 <clarkb> the builder image makes wheels using compile time deps. We copy the wheels to the base prod image and install there which means we don't need build time deps in the prod image
19:39:35 <fungi> okay, so you're saying create the venv in the builder image but then copy it to the base image
19:39:36 <clarkb> in the venv case you'd make the venv on the builder and copy it to base
19:39:51 <fungi> in that case the paths would be identical, right
19:40:00 <clarkb> corvus: ya it would definitely be a lot of effort to switch considering existing assumptions so we better really like the smaller images
19:40:22 <corvus> why would they be smaller?
19:40:26 <clarkb> anyway I bring it up as it was mentioned and I do think it is a good idea if the tiniest image is the goal. I don't think we should shelve the pip wheel work in favor of that as its a lot more effort
19:40:52 <clarkb> corvus: beacuse we copy the wheels from the builder to the base image which increases the base image by the aggregate size of all the wheels. You don't have thisstep in the venv case
19:40:54 <fungi> corvus: because pip will cache the wheels while installing
19:41:08 <fungi> or otherwise needs a local copy of them
19:41:15 <corvus> we can just remove the wheel cache after installing them?
19:41:24 <clarkb> corvus: that doesn't reduce the size of the image unfortunately
19:42:00 <corvus> i think there are tools/techniques for that
19:42:08 <clarkb> because they are copied in using a docker COPY directive we get a layer with that copy. Then any removals are just another layer delta saying the files don't exist anymore. But the layer is still there with the contents
19:42:46 <clarkb> anyway we don't need to debugthe sizes here. I just wanted to call it out as another alternative to what my changes propose. But one I think would require significnatly more effort which is why I didn't change direction
19:43:13 <fungi> the recent changes to image building aren't making larger images than what we did before anyway
19:44:04 <clarkb> #topic Dropping python3.8 base docker images
19:44:19 <clarkb> related but not really is removing python3.8 base docker images to make room for yesterday's python3.11 release
19:44:25 <clarkb> #link https://review.opendev.org/q/status:open+(topic:use-new-python+OR+topic:docker-cleanups)
19:44:45 <clarkb> at this point we're ready to land the removal. I didn't +A it earlier since docker hub was having trouble but sounds like that may be over
19:45:57 <clarkb> then we should also look at updating python3.9 things to 3.10/3.11 but there is a lot more stuff on 3.9 than 3.8
19:46:10 <clarkb> THank you for all the reviews and moving this along
19:46:24 <clarkb> #topic iweb cloud going away by the end of the year
19:46:46 <clarkb> leaseweb acquired iweb which was spun out of inap
19:46:59 <clarkb> leaseweb is a cloud provider but not primarily an openstack cloud provider.
19:47:28 <clarkb> They have told us that the openstack environment currently backing our iweb provider in nodepool will need to go away by the end of the year. But they said we could keep using it until then and to let them know when we stop using it
19:48:14 <clarkb> that pool gives us 200 nodes which is a fair bit.
19:48:35 <fungi> around 20-25% of our theoretical total quotas, i guess
19:48:50 <clarkb> The good news is that they were previously open to the idea of providing us test resources via cloudstack
19:49:23 <clarkb> this would require a new nodepool driver. I've got a meeting on friday to talk to them about whether or not this si still something they are interested in
19:49:49 <clarkb> I don't thinkwe need to do anything today. And I should make a calendar reminder for mid december to shut down that provider in our nodepool config
19:50:05 <clarkb> And now you all know what I know :)
19:50:19 <clarkb> #topic Etherpad container log growth
19:50:51 <clarkb> During the PTG last week we discovered the etherpad servers root fs was filling up over time. It turned out to be the container log itself as there hasn't been anethepad release to upgrade too in a while so the container has run for a while
19:51:09 <clarkb> To address that we docker-compose down'd then up'd the service which made a new container and cleared out the old large log file
19:51:37 <clarkb> My question here is if we would expect ianw's container syslogging stuff to mitigate this. If so we should convert etherpad to it
19:52:14 <clarkb> my understanding is that etherpad writes to stdout/stderr and docker accumulates that into a log file that never rotates
19:53:18 <ianw> it seems like putting that in /var/log/containers and having normal logrotate would help in that situation
19:53:55 <clarkb> ya I thought it would, but didn't feel like paging in how all that works in order to write a change before someone else agreed it would :)
19:54:00 <clarkb> sounds like something we should try and get done
19:54:09 <ianw> it should all be gate testable in that the logfile will be created, and you can confirm the output of docker-logs doesn't have it too
19:54:23 <clarkb> good point
19:54:25 <ianw> i can take a todo to update the etherpad installation
19:55:00 <clarkb> ianw: that would be great (I don't mind doing it either just to page in how it all works, but with the pip things and mailman things and so on time is always an issue)
19:55:11 <clarkb> #topic Open Discussion
19:55:21 <clarkb> Somehow I have more things to bring up that didn't make it to the agenda
19:55:29 <clarkb> corvus discovered we're underutilizing our quota in the inmotion cloud
19:55:41 <clarkb> I believe this to be due to leaked placement allocations in the placement service for that cloud
19:55:43 <clarkb> https://docs.openstack.org/nova/latest/admin/troubleshooting/orphaned-allocations.html
19:55:55 <clarkb> That is nova docs on how to deal with it and this is something melwitt has helped with in the past
19:56:16 <clarkb> I've got that on my todo list to try and take a look but if anyone wants to look at nova debugging I'm happy to let some one else look
19:56:49 <clarkb> And finally, the foundation has sent email to various project mailing lists asking for feedback on the potential for a PTG colocated with the Vancouver summit. There is a survey you can fill out to give them your thoughts
19:57:13 <clarkb> Anything else?
19:57:35 <ianw> one minor thing is
19:57:38 <ianw> #link https://review.opendev.org/c/zuul/zuul-sphinx/+/862215
19:58:17 <ianw> see the links inline, but works around what i think is a docutils bug (no response on that bug from upstream, no sure how active they are)
19:58:44 <ianw> #link https://review.opendev.org/q/topic:ansible-lint-6.8.2
19:59:10 <ianw> is also out there -- but i just noticed that a bunch of the jobs stopped working because it seems part of the testing is to install zuul-client, which must have just dropped 3.6 support maybe?
19:59:40 <clarkb> ianw: yes it did. That came out of feedback for my docker image updtes to zuul-client
19:59:42 <ianw> anyway, i'll have to loop back on some of the -1's there on some platforms to figure that out, but in general the changes can be looked at
20:00:19 <clarkb> and we are at time
20:00:25 <fungi> thanks clarkb!
20:00:34 <clarkb> Thank you everyone. Sorry for the long agenda. I guess that is what happens when you skip due to a ptg
20:00:37 <clarkb> #endmeeting