19:01:16 #startmeeting infra 19:01:16 Meeting started Tue Oct 25 19:01:16 2022 UTC and is due to finish in 60 minutes. The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot. 19:01:16 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 19:01:16 The meeting name has been set to 'infra' 19:01:25 #link https://lists.opendev.org/pipermail/service-discuss/2022-October/000369.html Our Agenda 19:01:46 #topic Announcements 19:01:52 No announcements so we can dive right in 19:02:10 #topic Bastion Host Changes 19:02:38 ianw: you've made a bunch of progress on this both with the zuul console log files and the virtualenv and upgrade work 19:03:12 yep in short there is one change that is basically s/bridge.openstack.org/bridge01.opendev.org/ -> https://review.opendev.org/c/opendev/system-config/+/861112 19:03:21 the new host is ready 19:03:23 ianw: at this point do we expect that we won't have any console logs written to the host? we updated the base jobs repo and system-config? Have we deleted the old files? 19:03:53 oh, in terms of the console logs in /tmp -- yep they should be gone and i removed all the old files 19:04:13 I guess that is less important for bridge as we're replacing the host. But for static that is important 19:04:15 also great 19:04:16 on bridge and static 19:04:45 For the bridge replacement I saw there were a couple of struggles with the overlap between testing and prod. Are any of those worth digging into? 19:05:22 not at this point -- it was all about trying to minimise the number of places we hardcode literal "bridge.openstack.org" 19:05:47 i think I have it down to about the bare minimum; so 861112 is basically it 19:06:02 For the new server the host vars and group vars and secrets files are moved over? 19:06:13 (since that requires a manual step) 19:06:21 no, so i plan on doing that today if no objections 19:06:37 there's a few manual steps -- copying the old secrets, and setting up zuul login 19:06:53 and i am 100% sure there is something forgotten that will be revealed when we actually try it 19:07:06 ya I think the rough order of operations should be copying that content over, ask other roots to double check things and then land https://review.opendev.org/c/opendev/system-config/+/861112 ? 19:07:10 but i plan to keep notes and add a small checklist for migrating bridge to system-config docs 19:07:14 ++ 19:07:53 if we want we can do a pruning pass of that data first too (since we may have old hosts var files or similar) 19:07:56 yep, that is about it 19:08:01 but that seems less critical and can be done on the new host afterwads too 19:08:15 ok sounds good to me 19:08:32 yeah i think at this point i'd like to get the migration done and prod jobs working on it -- then we can move over any old ~ data and prune, etc. 19:09:01 anything else on this topic? 19:09:13 nope, hopefully next week it won't be a topic! :) 19:09:13 (and try to reconstruct our venvs! ;) 19:09:27 that'll be the hardest part! 19:09:29 it may be worth keeping the old bridge around for a bit too just in case 19:09:40 thank you for pushing this along. Great progress 19:09:40 corvus: i've got a change out for us to have launch node in a venv setup by system-config, so we don't need separate ones 19:09:51 ++ 19:09:57 #link https://review.opendev.org/c/opendev/system-config/+/861284 19:09:58 #topic Upgrading Bionic Servers 19:10:18 We have our first jammy server in production. gitea-lb02 which fronts opendev.org 19:10:52 This server was booted in vexxhost which does/did not have a jammy image already. I took ubuntu's published image and converted it to raw and uploaded that to vexxhost 19:11:02 I did the raw conversion for maximum compatibility with vexxhost ceph 19:11:27 That seems to be working fine. But did require a modern paramiko in a venv to do ssh as jammy ssh seems to not want to do rsa + sha1 19:11:47 I thought about updating launch node to use an ed25519 key instead but paramiko doesn't have key generation routines for that key type like it does rsa 19:12:08 Anyway it mostly works except for the paramiko thing. I don't think there is much to add to this other than that ianw's bridge work should hopefully mitigate some of this 19:12:18 Otherwise I think we can go ahead and launch jammy nodes 19:12:31 #topic Removing snapd 19:12:34 ++ the new bridge is jammy too 19:13:16 When doing the new jammy node I noticed that we don't remove snapd which is something I thought we were doing. Fungi did some excellent git history invenstigating and discovered we did remove snapd at one time but stopped so that we could install the kubectl snap 19:13:45 We aren't currently using kubectl for anything in production and even if we were I think we could find a different isntall method. This makes me wonder if we should go back toremoving snapd? 19:14:03 I don't think we need to make a hard decision here in the meeting but wanted to call it out as something to think about and if you have thoughts I'm happy for them to be shared 19:14:24 also we only needed to stop removing it from the server(s) where we installed kubectl 19:14:41 and also there now seem to be more sane ways of installing an updated kubectl anyway 19:15:22 i hit something tangentially related with the screenshots -- i wanted to use firefox on the jammy hosts but the geckodriver bits don't work because firefox is embedded in a snap 19:16:16 which -- i guess i get why you want your browser sandboxed. but it's also quite a departure from traditional idea of a packaged system 19:16:27 probably something that deserves a bit more investigation to understand its broader impact then 19:16:50 I'll try to make time for that. One thing that might be good is listing snaps for which there aren't packages that we might end up using like kubectl or firefox 19:17:22 And then take it from there 19:17:29 #topic Mailman 3 19:17:33 Moving along so we don't run out of time 19:18:04 fungi: I think our testing is largely complete at this point. Are we ready to boot a new jammy server and if so have we decided where it should live? 19:18:53 if folks are generally satisfied with our forked image strategy, yeah i guess next steps are deciding where to boot it and then booting it and getting it excluded from blocklists if needed 19:19:23 at this point I still haven't heard from the upstream image maintainer. I do think we should probably accept that we'll need to maintain our own images at least for now 19:19:31 once we have ip addresses for the server, we can include those in communications around migration planning for lists.opendev.org and lists.zuul-ci.org as our first sites to move 19:20:01 re hosting location it occured to me that we can't easily get reverse dns records outside of rax which makes me think rax is the best location for a mail server 19:21:11 But I think we could also host it in vexxhost if mnaser doesn't have concerns with email flowing through his IPs and he is willing to edit dns records for us 19:21:13 perhaps, but rackspace also preemptively places their netblocks on the sbl 19:21:26 which makes them less great for it 19:21:38 er, on the pbl i mean 19:21:43 ya so maybe step 0 is send a feeler to mnaser about it 19:21:45 (spamhaus policy blocklist) 19:21:57 to figure out how problematic the dns records and email traffic would be 19:22:00 i think that's normal/expected behavior 19:22:04 and removal from pbl is easy? 19:22:16 exclusion from pbl used to be easier 19:22:20 i think vexxhost can do reverse dns by request 19:22:38 now they require you to periodically renew your pbl exclusion and there's no way to find out when it will run out that i can find 19:22:41 is it not easy? i thought it was click-and-done 19:23:09 ah :( 19:23:34 for review02 we did have to ask mnaser, but it was also easy :) 19:23:41 from our end being able to set reverse dns records was what came to mind. SOunds like pbl is also worth considering 19:23:57 so there's already a lot of mail coming out of that 19:24:26 corvus: at least i recall spotting that change recently, looking now for a clear quote i can link 19:25:46 (either place seems good to me; seems like nothing's perfect) 19:26:01 I guess the two todos are for people to weigh in on whether or not we're comfortable with forked images and specify if they have a strong preference for hosting location 19:26:11 I agree sounds like we'll just deal with different things in either location 19:26:40 #link https://review.opendev.org/c/opendev/system-config/+/860157 Change to fork upstream mailman3 docker images 19:26:45 i concur 19:26:46 maybe drop your thoughts there? 19:27:07 also i suppose merging those changes will be a prerequisite to booting the new server 19:27:16 there's a series of several 19:27:29 fungi: we can boot the new server first it just won't do much until changesl and 19:27:39 but I don't think the boot order is super important here 19:27:40 good point 19:28:48 ok lets move on. Please leave thoughts on the change otherwise I expect we'll proceed 19:28:58 #topic Switching our base job nodeset to Jammy 19:29:04 #link https://review.opendev.org/c/opendev/base-jobs/+/862624 19:29:10 today is the day we said we would make this swap 19:29:26 yeah, we can merge it after the meeting wraps up 19:29:40 ++ Mostly a heads up that this is changing and to be on the lookout for fallout 19:30:25 I did find a place in zuul-jobs that would likely break which was python3.8 jobs running without a nodeset specifier 19:30:36 if anyone else wants to review that three-line change before we approve it, you have roughly half an hour 19:30:38 I expect that sort of thing to be the bulk of what we run into 19:31:30 #topic Updating our base python images to use pip wheel 19:31:59 About a week ago Nodepool could no longer build its container images. The issue was that we weren't using wheels built by the builder in the prod image 19:32:18 after a bunch of debugging it basically came down to pip 22.3 changed the location it caches wheels compared to 22.2.2 and prior 19:32:40 I think this is actually a pip bug (because it reduces the file integrity assertions that existed previously) 19:32:49 or rather the layout of the cache directory 19:32:52 changed 19:32:58 ya 19:32:59 #link https://github.com/pypa/pip/issues/11527 19:33:07 #link https://github.com/pypa/pip/pull/11538 19:33:34 I filed an issue upstream and wrote a patch. The patch is currently not passing CI due to a different git change (that zuul also ran into) that impacts their test suite. They've asked if I want to write a patch for that too but I haven't found time yet 19:33:58 Anyway part of the fallout from this is that pip says we shouldn't use the cache that way as its more of an implementation detail for pip which is a reasonable position 19:34:29 Their suggestion is to use `pip wheel` instead and explicitly fetch/build wheels and use them that way 19:34:34 #link https://review.opendev.org/c/opendev/system-config/+/862152 19:35:02 that change updates our base images to do this. I've tested it with a change to nodepool and diskimage builder which helps to exercise that the modification actually work without breaking sibling installs and extras installs 19:35:38 This shouldn't actually change our images much, but should make our build process more reliable in the future 19:35:45 reviews and concerns appreciated. 19:35:52 tonyb is looking into doing something similar with rewrites of the wheel cache builder jobs, i think 19:36:07 and constraints generation jobs more generally 19:36:29 The other peice of feedback that came out of this is that other people do similar but instead of creating a wheel cache and copying that and performing another install on the prod image they do a pip install --user on the builder side then just copy over $USER/.local to the prod image 19:36:45 this has the upside of not needing wheel files in the final image which reduces the final image size 19:37:04 as long as the path remains the same, right? 19:37:07 I think we should consider doing that as well, but all of our consuming images would need to be updated to find the executables in the local dir or a virtualenv 19:37:29 fungi: yes it only works if the two sides stay in sync for python versions (somethign we already attempt to do) and paths 19:37:32 that would be ... interesting 19:37:37 i think most would not 19:37:44 well, but also venvs aren't supposed to be relocateable 19:37:46 its the and paths bit that makes it difficult for us to transition as we'd need to update the consuming images 19:37:49 (find things in a venv) 19:38:04 fungi: yes, except in this case they aren't relocating as far as they are concerned everything stays in the same spot 19:38:31 the path of the venv inside the container image would need to be the same as where they're copied from on the host where they're built? 19:38:47 or maybe i'm misunderstanding how docker image builds work 19:39:03 fungi: the way it works today is we have a builder image and a base image that becomes the prod image 19:39:23 i think the global install has a lot going for it and prefer that to a user/venv install 19:39:24 the builder image makes wheels using compile time deps. We copy the wheels to the base prod image and install there which means we don't need build time deps in the prod image 19:39:35 okay, so you're saying create the venv in the builder image but then copy it to the base image 19:39:36 in the venv case you'd make the venv on the builder and copy it to base 19:39:51 in that case the paths would be identical, right 19:40:00 corvus: ya it would definitely be a lot of effort to switch considering existing assumptions so we better really like the smaller images 19:40:22 why would they be smaller? 19:40:26 anyway I bring it up as it was mentioned and I do think it is a good idea if the tiniest image is the goal. I don't think we should shelve the pip wheel work in favor of that as its a lot more effort 19:40:52 corvus: beacuse we copy the wheels from the builder to the base image which increases the base image by the aggregate size of all the wheels. You don't have thisstep in the venv case 19:40:54 corvus: because pip will cache the wheels while installing 19:41:08 or otherwise needs a local copy of them 19:41:15 we can just remove the wheel cache after installing them? 19:41:24 corvus: that doesn't reduce the size of the image unfortunately 19:42:00 i think there are tools/techniques for that 19:42:08 because they are copied in using a docker COPY directive we get a layer with that copy. Then any removals are just another layer delta saying the files don't exist anymore. But the layer is still there with the contents 19:42:46 anyway we don't need to debugthe sizes here. I just wanted to call it out as another alternative to what my changes propose. But one I think would require significnatly more effort which is why I didn't change direction 19:43:13 the recent changes to image building aren't making larger images than what we did before anyway 19:44:04 #topic Dropping python3.8 base docker images 19:44:19 related but not really is removing python3.8 base docker images to make room for yesterday's python3.11 release 19:44:25 #link https://review.opendev.org/q/status:open+(topic:use-new-python+OR+topic:docker-cleanups) 19:44:45 at this point we're ready to land the removal. I didn't +A it earlier since docker hub was having trouble but sounds like that may be over 19:45:57 then we should also look at updating python3.9 things to 3.10/3.11 but there is a lot more stuff on 3.9 than 3.8 19:46:10 THank you for all the reviews and moving this along 19:46:24 #topic iweb cloud going away by the end of the year 19:46:46 leaseweb acquired iweb which was spun out of inap 19:46:59 leaseweb is a cloud provider but not primarily an openstack cloud provider. 19:47:28 They have told us that the openstack environment currently backing our iweb provider in nodepool will need to go away by the end of the year. But they said we could keep using it until then and to let them know when we stop using it 19:48:14 that pool gives us 200 nodes which is a fair bit. 19:48:35 around 20-25% of our theoretical total quotas, i guess 19:48:50 The good news is that they were previously open to the idea of providing us test resources via cloudstack 19:49:23 this would require a new nodepool driver. I've got a meeting on friday to talk to them about whether or not this si still something they are interested in 19:49:49 I don't thinkwe need to do anything today. And I should make a calendar reminder for mid december to shut down that provider in our nodepool config 19:50:05 And now you all know what I know :) 19:50:19 #topic Etherpad container log growth 19:50:51 During the PTG last week we discovered the etherpad servers root fs was filling up over time. It turned out to be the container log itself as there hasn't been anethepad release to upgrade too in a while so the container has run for a while 19:51:09 To address that we docker-compose down'd then up'd the service which made a new container and cleared out the old large log file 19:51:37 My question here is if we would expect ianw's container syslogging stuff to mitigate this. If so we should convert etherpad to it 19:52:14 my understanding is that etherpad writes to stdout/stderr and docker accumulates that into a log file that never rotates 19:53:18 it seems like putting that in /var/log/containers and having normal logrotate would help in that situation 19:53:55 ya I thought it would, but didn't feel like paging in how all that works in order to write a change before someone else agreed it would :) 19:54:00 sounds like something we should try and get done 19:54:09 it should all be gate testable in that the logfile will be created, and you can confirm the output of docker-logs doesn't have it too 19:54:23 good point 19:54:25 i can take a todo to update the etherpad installation 19:55:00 ianw: that would be great (I don't mind doing it either just to page in how it all works, but with the pip things and mailman things and so on time is always an issue) 19:55:11 #topic Open Discussion 19:55:21 Somehow I have more things to bring up that didn't make it to the agenda 19:55:29 corvus discovered we're underutilizing our quota in the inmotion cloud 19:55:41 I believe this to be due to leaked placement allocations in the placement service for that cloud 19:55:43 https://docs.openstack.org/nova/latest/admin/troubleshooting/orphaned-allocations.html 19:55:55 That is nova docs on how to deal with it and this is something melwitt has helped with in the past 19:56:16 I've got that on my todo list to try and take a look but if anyone wants to look at nova debugging I'm happy to let some one else look 19:56:49 And finally, the foundation has sent email to various project mailing lists asking for feedback on the potential for a PTG colocated with the Vancouver summit. There is a survey you can fill out to give them your thoughts 19:57:13 Anything else? 19:57:35 one minor thing is 19:57:38 #link https://review.opendev.org/c/zuul/zuul-sphinx/+/862215 19:58:17 see the links inline, but works around what i think is a docutils bug (no response on that bug from upstream, no sure how active they are) 19:58:44 #link https://review.opendev.org/q/topic:ansible-lint-6.8.2 19:59:10 is also out there -- but i just noticed that a bunch of the jobs stopped working because it seems part of the testing is to install zuul-client, which must have just dropped 3.6 support maybe? 19:59:40 ianw: yes it did. That came out of feedback for my docker image updtes to zuul-client 19:59:42 anyway, i'll have to loop back on some of the -1's there on some platforms to figure that out, but in general the changes can be looked at 20:00:19 and we are at time 20:00:25 thanks clarkb! 20:00:34 Thank you everyone. Sorry for the long agenda. I guess that is what happens when you skip due to a ptg 20:00:37 #endmeeting