clarkb | Just about meeting time | 18:59 |
---|---|---|
ianw | o/ | 19:00 |
clarkb | We do have a fairly large agend so I'll try tokeep things moving | 19:00 |
clarkb | #startmeeting infra | 19:01 |
opendevmeet | Meeting started Tue Oct 25 19:01:16 2022 UTC and is due to finish in 60 minutes. The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot. | 19:01 |
opendevmeet | Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. | 19:01 |
opendevmeet | The meeting name has been set to 'infra' | 19:01 |
clarkb | #link https://lists.opendev.org/pipermail/service-discuss/2022-October/000369.html Our Agenda | 19:01 |
clarkb | #topic Announcements | 19:01 |
clarkb | No announcements so we can dive right in | 19:01 |
clarkb | #topic Bastion Host Changes | 19:02 |
clarkb | ianw: you've made a bunch of progress on this both with the zuul console log files and the virtualenv and upgrade work | 19:02 |
ianw | yep in short there is one change that is basically s/bridge.openstack.org/bridge01.opendev.org/ -> https://review.opendev.org/c/opendev/system-config/+/861112 | 19:03 |
ianw | the new host is ready | 19:03 |
clarkb | ianw: at this point do we expect that we won't have any console logs written to the host? we updated the base jobs repo and system-config? Have we deleted the old files? | 19:03 |
ianw | oh, in terms of the console logs in /tmp -- yep they should be gone and i removed all the old files | 19:03 |
clarkb | I guess that is less important for bridge as we're replacing the host. But for static that is important | 19:04 |
clarkb | also great | 19:04 |
ianw | on bridge and static | 19:04 |
clarkb | For the bridge replacement I saw there were a couple of struggles with the overlap between testing and prod. Are any of those worth digging into? | 19:04 |
ianw | not at this point -- it was all about trying to minimise the number of places we hardcode literal "bridge.openstack.org" | 19:05 |
ianw | i think I have it down to about the bare minimum; so 861112 is basically it | 19:05 |
clarkb | For the new server the host vars and group vars and secrets files are moved over? | 19:06 |
clarkb | (since that requires a manual step) | 19:06 |
ianw | no, so i plan on doing that today if no objections | 19:06 |
ianw | there's a few manual steps -- copying the old secrets, and setting up zuul login | 19:06 |
ianw | and i am 100% sure there is something forgotten that will be revealed when we actually try it | 19:06 |
clarkb | ya I think the rough order of operations should be copying that content over, ask other roots to double check things and then land https://review.opendev.org/c/opendev/system-config/+/861112 ? | 19:07 |
ianw | but i plan to keep notes and add a small checklist for migrating bridge to system-config docs | 19:07 |
clarkb | ++ | 19:07 |
clarkb | if we want we can do a pruning pass of that data first too (since we may have old hosts var files or similar) | 19:07 |
ianw | yep, that is about it | 19:07 |
clarkb | but that seems less critical and can be done on the new host afterwads too | 19:08 |
clarkb | ok sounds good to me | 19:08 |
ianw | yeah i think at this point i'd like to get the migration done and prod jobs working on it -- then we can move over any old ~ data and prune, etc. | 19:08 |
clarkb | anything else on this topic? | 19:09 |
ianw | nope, hopefully next week it won't be a topic! :) | 19:09 |
corvus | (and try to reconstruct our venvs! ;) | 19:09 |
fungi | that'll be the hardest part! | 19:09 |
clarkb | it may be worth keeping the old bridge around for a bit too just in case | 19:09 |
clarkb | thank you for pushing this along. Great progress | 19:09 |
ianw | corvus: i've got a change out for us to have launch node in a venv setup by system-config, so we don't need separate ones | 19:09 |
corvus | ++ | 19:09 |
ianw | #link https://review.opendev.org/c/opendev/system-config/+/861284 | 19:09 |
clarkb | #topic Upgrading Bionic Servers | 19:09 |
clarkb | We have our first jammy server in production. gitea-lb02 which fronts opendev.org | 19:10 |
clarkb | This server was booted in vexxhost which does/did not have a jammy image already. I took ubuntu's published image and converted it to raw and uploaded that to vexxhost | 19:10 |
clarkb | I did the raw conversion for maximum compatibility with vexxhost ceph | 19:11 |
clarkb | That seems to be working fine. But did require a modern paramiko in a venv to do ssh as jammy ssh seems to not want to do rsa + sha1 | 19:11 |
clarkb | I thought about updating launch node to use an ed25519 key instead but paramiko doesn't have key generation routines for that key type like it does rsa | 19:11 |
clarkb | Anyway it mostly works except for the paramiko thing. I don't think there is much to add to this other than that ianw's bridge work should hopefully mitigate some of this | 19:12 |
clarkb | Otherwise I think we can go ahead and launch jammy nodes | 19:12 |
clarkb | #topic Removing snapd | 19:12 |
ianw | ++ the new bridge is jammy too | 19:12 |
clarkb | When doing the new jammy node I noticed that we don't remove snapd which is something I thought we were doing. Fungi did some excellent git history invenstigating and discovered we did remove snapd at one time but stopped so that we could install the kubectl snap | 19:13 |
clarkb | We aren't currently using kubectl for anything in production and even if we were I think we could find a different isntall method. This makes me wonder if we should go back toremoving snapd? | 19:13 |
clarkb | I don't think we need to make a hard decision here in the meeting but wanted to call it out as something to think about and if you have thoughts I'm happy for them to be shared | 19:14 |
fungi | also we only needed to stop removing it from the server(s) where we installed kubectl | 19:14 |
fungi | and also there now seem to be more sane ways of installing an updated kubectl anyway | 19:14 |
ianw | i hit something tangentially related with the screenshots -- i wanted to use firefox on the jammy hosts but the geckodriver bits don't work because firefox is embedded in a snap | 19:15 |
ianw | which -- i guess i get why you want your browser sandboxed. but it's also quite a departure from traditional idea of a packaged system | 19:16 |
clarkb | probably something that deserves a bit more investigation to understand its broader impact then | 19:16 |
clarkb | I'll try to make time for that. One thing that might be good is listing snaps for which there aren't packages that we might end up using like kubectl or firefox | 19:16 |
clarkb | And then take it from there | 19:17 |
clarkb | #topic Mailman 3 | 19:17 |
clarkb | Moving along so we don't run out of time | 19:17 |
clarkb | fungi: I think our testing is largely complete at this point. Are we ready to boot a new jammy server and if so have we decided where it should live? | 19:18 |
fungi | if folks are generally satisfied with our forked image strategy, yeah i guess next steps are deciding where to boot it and then booting it and getting it excluded from blocklists if needed | 19:18 |
clarkb | at this point I still haven't heard from the upstream image maintainer. I do think we should probably accept that we'll need to maintain our own images at least for now | 19:19 |
fungi | once we have ip addresses for the server, we can include those in communications around migration planning for lists.opendev.org and lists.zuul-ci.org as our first sites to move | 19:19 |
clarkb | re hosting location it occured to me that we can't easily get reverse dns records outside of rax which makes me think rax is the best location for a mail server | 19:20 |
clarkb | But I think we could also host it in vexxhost if mnaser doesn't have concerns with email flowing through his IPs and he is willing to edit dns records for us | 19:21 |
fungi | perhaps, but rackspace also preemptively places their netblocks on the sbl | 19:21 |
fungi | which makes them less great for it | 19:21 |
fungi | er, on the pbl i mean | 19:21 |
clarkb | ya so maybe step 0 is send a feeler to mnaser about it | 19:21 |
fungi | (spamhaus policy blocklist) | 19:21 |
clarkb | to figure out how problematic the dns records and email traffic would be | 19:21 |
corvus | i think that's normal/expected behavior | 19:22 |
corvus | and removal from pbl is easy? | 19:22 |
fungi | exclusion from pbl used to be easier | 19:22 |
corvus | i think vexxhost can do reverse dns by request | 19:22 |
fungi | now they require you to periodically renew your pbl exclusion and there's no way to find out when it will run out that i can find | 19:22 |
corvus | is it not easy? i thought it was click-and-done | 19:22 |
corvus | ah :( | 19:23 |
ianw | for review02 we did have to ask mnaser, but it was also easy :) | 19:23 |
clarkb | from our end being able to set reverse dns records was what came to mind. SOunds like pbl is also worth considering | 19:23 |
ianw | so there's already a lot of mail coming out of that | 19:23 |
fungi | corvus: at least i recall spotting that change recently, looking now for a clear quote i can link | 19:24 |
corvus | (either place seems good to me; seems like nothing's perfect) | 19:25 |
clarkb | I guess the two todos are for people to weigh in on whether or not we're comfortable with forked images and specify if they have a strong preference for hosting location | 19:26 |
clarkb | I agree sounds like we'll just deal with different things in either location | 19:26 |
clarkb | #link https://review.opendev.org/c/opendev/system-config/+/860157 Change to fork upstream mailman3 docker images | 19:26 |
fungi | i concur | 19:26 |
clarkb | maybe drop your thoughts there? | 19:26 |
fungi | also i suppose merging those changes will be a prerequisite to booting the new server | 19:27 |
fungi | there's a series of several | 19:27 |
clarkb | fungi: we can boot the new server first it just won't do much until changesl and | 19:27 |
clarkb | but I don't think the boot order is super important here | 19:27 |
fungi | good point | 19:27 |
clarkb | ok lets move on. Please leave thoughts on the change otherwise I expect we'll proceed | 19:28 |
clarkb | #topic Switching our base job nodeset to Jammy | 19:28 |
clarkb | #link https://review.opendev.org/c/opendev/base-jobs/+/862624 | 19:29 |
clarkb | today is the day we said we would make this swap | 19:29 |
fungi | yeah, we can merge it after the meeting wraps up | 19:29 |
clarkb | ++ Mostly a heads up that this is changing and to be on the lookout for fallout | 19:29 |
clarkb | I did find a place in zuul-jobs that would likely break which was python3.8 jobs running without a nodeset specifier | 19:30 |
fungi | if anyone else wants to review that three-line change before we approve it, you have roughly half an hour | 19:30 |
clarkb | I expect that sort of thing to be the bulk of what we run into | 19:30 |
clarkb | #topic Updating our base python images to use pip wheel | 19:31 |
clarkb | About a week ago Nodepool could no longer build its container images. The issue was that we weren't using wheels built by the builder in the prod image | 19:31 |
clarkb | after a bunch of debugging it basically came down to pip 22.3 changed the location it caches wheels compared to 22.2.2 and prior | 19:32 |
clarkb | I think this is actually a pip bug (because it reduces the file integrity assertions that existed previously) | 19:32 |
fungi | or rather the layout of the cache directory | 19:32 |
fungi | changed | 19:32 |
clarkb | ya | 19:32 |
clarkb | #link https://github.com/pypa/pip/issues/11527 | 19:32 |
clarkb | #link https://github.com/pypa/pip/pull/11538 | 19:33 |
clarkb | I filed an issue upstream and wrote a patch. The patch is currently not passing CI due to a different git change (that zuul also ran into) that impacts their test suite. They've asked if I want to write a patch for that too but I haven't found time yet | 19:33 |
clarkb | Anyway part of the fallout from this is that pip says we shouldn't use the cache that way as its more of an implementation detail for pip which is a reasonable position | 19:33 |
clarkb | Their suggestion is to use `pip wheel` instead and explicitly fetch/build wheels and use them that way | 19:34 |
clarkb | #link https://review.opendev.org/c/opendev/system-config/+/862152 | 19:34 |
clarkb | that change updates our base images to do this. I've tested it with a change to nodepool and diskimage builder which helps to exercise that the modification actually work without breaking sibling installs and extras installs | 19:35 |
clarkb | This shouldn't actually change our images much, but should make our build process more reliable in the future | 19:35 |
clarkb | reviews and concerns appreciated. | 19:35 |
fungi | tonyb is looking into doing something similar with rewrites of the wheel cache builder jobs, i think | 19:35 |
fungi | and constraints generation jobs more generally | 19:36 |
clarkb | The other peice of feedback that came out of this is that other people do similar but instead of creating a wheel cache and copying that and performing another install on the prod image they do a pip install --user on the builder side then just copy over $USER/.local to the prod image | 19:36 |
clarkb | this has the upside of not needing wheel files in the final image which reduces the final image size | 19:36 |
fungi | as long as the path remains the same, right? | 19:37 |
clarkb | I think we should consider doing that as well, but all of our consuming images would need to be updated to find the executables in the local dir or a virtualenv | 19:37 |
clarkb | fungi: yes it only works if the two sides stay in sync for python versions (somethign we already attempt to do) and paths | 19:37 |
ianw | that would be ... interesting | 19:37 |
ianw | i think most would not | 19:37 |
fungi | well, but also venvs aren't supposed to be relocateable | 19:37 |
clarkb | its the and paths bit that makes it difficult for us to transition as we'd need to update the consuming images | 19:37 |
ianw | (find things in a venv) | 19:37 |
clarkb | fungi: yes, except in this case they aren't relocating as far as they are concerned everything stays in the same spot | 19:38 |
fungi | the path of the venv inside the container image would need to be the same as where they're copied from on the host where they're built? | 19:38 |
fungi | or maybe i'm misunderstanding how docker image builds work | 19:38 |
clarkb | fungi: the way it works today is we have a builder image and a base image that becomes the prod image | 19:39 |
corvus | i think the global install has a lot going for it and prefer that to a user/venv install | 19:39 |
clarkb | the builder image makes wheels using compile time deps. We copy the wheels to the base prod image and install there which means we don't need build time deps in the prod image | 19:39 |
fungi | okay, so you're saying create the venv in the builder image but then copy it to the base image | 19:39 |
clarkb | in the venv case you'd make the venv on the builder and copy it to base | 19:39 |
fungi | in that case the paths would be identical, right | 19:39 |
clarkb | corvus: ya it would definitely be a lot of effort to switch considering existing assumptions so we better really like the smaller images | 19:40 |
corvus | why would they be smaller? | 19:40 |
clarkb | anyway I bring it up as it was mentioned and I do think it is a good idea if the tiniest image is the goal. I don't think we should shelve the pip wheel work in favor of that as its a lot more effort | 19:40 |
clarkb | corvus: beacuse we copy the wheels from the builder to the base image which increases the base image by the aggregate size of all the wheels. You don't have thisstep in the venv case | 19:40 |
fungi | corvus: because pip will cache the wheels while installing | 19:40 |
fungi | or otherwise needs a local copy of them | 19:41 |
corvus | we can just remove the wheel cache after installing them? | 19:41 |
clarkb | corvus: that doesn't reduce the size of the image unfortunately | 19:41 |
corvus | i think there are tools/techniques for that | 19:42 |
clarkb | because they are copied in using a docker COPY directive we get a layer with that copy. Then any removals are just another layer delta saying the files don't exist anymore. But the layer is still there with the contents | 19:42 |
clarkb | anyway we don't need to debugthe sizes here. I just wanted to call it out as another alternative to what my changes propose. But one I think would require significnatly more effort which is why I didn't change direction | 19:42 |
fungi | the recent changes to image building aren't making larger images than what we did before anyway | 19:43 |
clarkb | #topic Dropping python3.8 base docker images | 19:44 |
clarkb | related but not really is removing python3.8 base docker images to make room for yesterday's python3.11 release | 19:44 |
clarkb | #link https://review.opendev.org/q/status:open+(topic:use-new-python+OR+topic:docker-cleanups) | 19:44 |
clarkb | at this point we're ready to land the removal. I didn't +A it earlier since docker hub was having trouble but sounds like that may be over | 19:44 |
clarkb | then we should also look at updating python3.9 things to 3.10/3.11 but there is a lot more stuff on 3.9 than 3.8 | 19:45 |
clarkb | THank you for all the reviews and moving this along | 19:46 |
clarkb | #topic iweb cloud going away by the end of the year | 19:46 |
clarkb | leaseweb acquired iweb which was spun out of inap | 19:46 |
clarkb | leaseweb is a cloud provider but not primarily an openstack cloud provider. | 19:46 |
clarkb | They have told us that the openstack environment currently backing our iweb provider in nodepool will need to go away by the end of the year. But they said we could keep using it until then and to let them know when we stop using it | 19:47 |
clarkb | that pool gives us 200 nodes which is a fair bit. | 19:48 |
fungi | around 20-25% of our theoretical total quotas, i guess | 19:48 |
clarkb | The good news is that they were previously open to the idea of providing us test resources via cloudstack | 19:48 |
clarkb | this would require a new nodepool driver. I've got a meeting on friday to talk to them about whether or not this si still something they are interested in | 19:49 |
clarkb | I don't thinkwe need to do anything today. And I should make a calendar reminder for mid december to shut down that provider in our nodepool config | 19:49 |
clarkb | And now you all know what I know :) | 19:50 |
clarkb | #topic Etherpad container log growth | 19:50 |
clarkb | During the PTG last week we discovered the etherpad servers root fs was filling up over time. It turned out to be the container log itself as there hasn't been anethepad release to upgrade too in a while so the container has run for a while | 19:50 |
clarkb | To address that we docker-compose down'd then up'd the service which made a new container and cleared out the old large log file | 19:51 |
clarkb | My question here is if we would expect ianw's container syslogging stuff to mitigate this. If so we should convert etherpad to it | 19:51 |
clarkb | my understanding is that etherpad writes to stdout/stderr and docker accumulates that into a log file that never rotates | 19:52 |
ianw | it seems like putting that in /var/log/containers and having normal logrotate would help in that situation | 19:53 |
clarkb | ya I thought it would, but didn't feel like paging in how all that works in order to write a change before someone else agreed it would :) | 19:53 |
clarkb | sounds like something we should try and get done | 19:54 |
ianw | it should all be gate testable in that the logfile will be created, and you can confirm the output of docker-logs doesn't have it too | 19:54 |
clarkb | good point | 19:54 |
ianw | i can take a todo to update the etherpad installation | 19:54 |
clarkb | ianw: that would be great (I don't mind doing it either just to page in how it all works, but with the pip things and mailman things and so on time is always an issue) | 19:55 |
clarkb | #topic Open Discussion | 19:55 |
clarkb | Somehow I have more things to bring up that didn't make it to the agenda | 19:55 |
clarkb | corvus discovered we're underutilizing our quota in the inmotion cloud | 19:55 |
clarkb | I believe this to be due to leaked placement allocations in the placement service for that cloud | 19:55 |
clarkb | https://docs.openstack.org/nova/latest/admin/troubleshooting/orphaned-allocations.html | 19:55 |
clarkb | That is nova docs on how to deal with it and this is something melwitt has helped with in the past | 19:55 |
clarkb | I've got that on my todo list to try and take a look but if anyone wants to look at nova debugging I'm happy to let some one else look | 19:56 |
clarkb | And finally, the foundation has sent email to various project mailing lists asking for feedback on the potential for a PTG colocated with the Vancouver summit. There is a survey you can fill out to give them your thoughts | 19:56 |
clarkb | Anything else? | 19:57 |
ianw | one minor thing is | 19:57 |
ianw | #link https://review.opendev.org/c/zuul/zuul-sphinx/+/862215 | 19:57 |
ianw | see the links inline, but works around what i think is a docutils bug (no response on that bug from upstream, no sure how active they are) | 19:58 |
ianw | #link https://review.opendev.org/q/topic:ansible-lint-6.8.2 | 19:58 |
ianw | is also out there -- but i just noticed that a bunch of the jobs stopped working because it seems part of the testing is to install zuul-client, which must have just dropped 3.6 support maybe? | 19:59 |
clarkb | ianw: yes it did. That came out of feedback for my docker image updtes to zuul-client | 19:59 |
ianw | anyway, i'll have to loop back on some of the -1's there on some platforms to figure that out, but in general the changes can be looked at | 19:59 |
clarkb | and we are at time | 20:00 |
fungi | thanks clarkb! | 20:00 |
clarkb | Thank you everyone. Sorry for the long agenda. I guess that is what happens when you skip due to a ptg | 20:00 |
clarkb | #endmeeting | 20:00 |
opendevmeet | Meeting ended Tue Oct 25 20:00:37 2022 UTC. Information about MeetBot at http://wiki.debian.org/MeetBot . (v 0.1.4) | 20:00 |
opendevmeet | Minutes: https://meetings.opendev.org/meetings/infra/2022/infra.2022-10-25-19.01.html | 20:00 |
opendevmeet | Minutes (text): https://meetings.opendev.org/meetings/infra/2022/infra.2022-10-25-19.01.txt | 20:00 |
opendevmeet | Log: https://meetings.opendev.org/meetings/infra/2022/infra.2022-10-25-19.01.log.html | 20:00 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!