#opendev-meeting log

19:01:06 <clarkb> #startmeeting infra
19:01:06 <opendevmeet> Meeting started Tue Mar  1 19:01:06 2022 UTC and is due to finish in 60 minutes.  The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot.
19:01:06 <opendevmeet> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
19:01:06 <opendevmeet> The meeting name has been set to 'infra'
19:01:08 <clarkb> #link https://lists.opendev.org/pipermail/service-discuss/2022-March/000323.html Our Agenda
19:01:18 <clarkb> #topic Announcements
19:01:30 <clarkb> First up I won't be able to run next weeks meeting as I have other meetings
19:01:36 <ianw> o/
19:02:02 <fungi> i expect to be in a similar situation
19:02:05 <clarkb> I've proposed we skip it in the agenda, but if others want a meeting feel free to update the agenda and send it out. I just won't be able to participate
19:02:28 <frickler> I won't travel but have a holiday, so fine with skipping
19:02:42 <ianw> i can host it, but if fungi is out too probably ok to skip
19:03:08 <clarkb> cool consider it skipped then
19:03:14 <clarkb> #topic Actions from last meeting
19:03:19 <clarkb> #link http://eavesdrop.openstack.org/meetings/infra/2022/infra.2022-02-22-19.01.txt minutes from last meeting
19:03:23 <clarkb> There were no actions recorded
19:03:33 <clarkb> #topic Topics
19:03:36 <clarkb> Time to dive in
19:03:45 <clarkb> #topic Improving CD throughput
19:03:55 <clarkb> ianw: Did all the logs changes end up landing?
19:04:09 <clarkb> that info is now available to us via zuul if we add our gpg keys ya?
19:04:34 <ianw> not yet
19:04:37 <ianw> #link https://review.opendev.org/c/opendev/system-config/+/830784
19:04:56 <ianw> is the one that turns it on globally -- i was hoping for some extra reviews on that but i'm happy to babysit it
19:05:11 <ianw> #link https://review.opendev.org/c/opendev/system-config/+/830785
19:05:18 <clarkb> ah
19:05:26 <clarkb> I guess I saw it only on the codesearch jobs
19:05:33 <ianw> is a doc update, that turned into quite a bit more than just how to add your gpg keys but it's in there
19:05:45 <clarkb> ya seemed like good improvements to the docs
19:06:18 <ianw> i felt like we were all ok with it, but didn't want to single approve 830784 just in case
19:06:35 <clarkb> ++
19:06:48 <clarkb> Good to get lots of eyeballs on changes like this
19:07:10 <clarkb> Anything else on this topic?
19:07:42 <fungi> oh, i totally missed 830785, thanks
19:08:33 <fungi> i approved 830784 now
19:08:48 <clarkb> #topic Container Maintenance
19:08:51 <ianw> nope, that's it for now, thanks
19:09:25 <clarkb> jentoio: and I met up last week and discussed what needed to be done for giving containers dedicated users. We decided to look at updating insecure ci registry to start since it doesn't write to the fs
19:09:41 <clarkb> That will help get the shape of things in place before we tackle some of the more complicated containers
19:09:51 <jentoio> I'll be working on zuul-registry this afternoon
19:09:58 <clarkb> jentoio: thanks again!
19:10:08 <jentoio> finaly allocated some time to focus on it
19:10:29 <clarkb> so ya some progress here. Reviews appreciated once we have changes
19:11:09 <clarkb> #topic Cleaning Up Old Reviews
19:11:12 <clarkb> #link https://review.opendev.org/q/topic:retirement+status:open Changes to retire all unused the repos.
19:11:50 <clarkb> We're basically at actually retiring content from the repos. I would appreciate if we could start picking some of those off. I don't think they are difficult reviews, you should be able to fetch them and ls the repo contents to make sure nothing got left behind andcheck the README content but there are a lot of them
19:12:43 <clarkb> Once those changes land I'll do bulk abandons for open changes on those repos and then push up the zuul removal cleanups
19:12:49 <clarkb> as well as gerrit acl updates
19:12:53 <ianw> ++ sorry i meant to look at them, will do
19:13:07 <clarkb> thanks!
19:13:46 <fungi> yeah, same. i'm going to blame the fact that i had unsubbed from those repos in gertty a while back
19:14:28 <clarkb> #topic Gitea 1.16
19:15:14 <clarkb> I think it is worth waiting until 1.16.3 releases and then make a push to upgrade
19:15:32 <clarkb> https://github.com/go-gitea/gitea/milestone/113 indicates the 1.16.3 release should happen soon. Only one remaining issue left to address
19:15:56 <clarkb> The reason for this is that of the issues we've discovered (diff rendering of images and build inconsistencies with a dep) they should both be fixed by 1.16.3
19:16:23 <clarkb> I think we can hold off on reviewing things for now untikl 1.16.3 happens. I'll get a change pushed for that and update our hold to inspect it
19:16:42 <clarkb> Mostly a heads up that I haven't forgotten about this, but waiting until it stablizes a bit more
19:17:05 <clarkb> #topic Rocky Linux
19:17:10 <clarkb> #link https://review.opendev.org/c/zuul/nodepool/+/831108 Need new dib for next steps
19:17:20 <clarkb> This is a nodepool change that will upgrade dib in our builder images
19:17:35 <clarkb> This new dib should address new issues found with Rocky Linux builds
19:18:11 <clarkb> Please review that if you get a chance
19:18:26 <clarkb> also keep an eye out for any unexpected new glean behavior since it updated a couple of days ago
19:18:56 <clarkb> #topic Removing airship-citycloud nodepool provider
19:19:09 <clarkb> This morning I woke up to an email that this provider is going away for us today
19:19:47 <clarkb> The first two changes to remove it have landed. After this meeting I need to check if nodepool is complaining about a node in that provider that wouldn't delete. I'll run nodepool erase airship-citycloud if so
19:19:48 <fungi> ericsson had been paying for that citycloud account in support of airship testing, but has decided to discontinue it
19:20:01 <clarkb> right
19:20:32 <clarkb> Once nodepool is done we can land the system-config and dns update changes. Then we can delete the mirror node if it hasn't gone away already
19:20:33 <frickler> do we need to inform the airship team or do they know already?
19:20:47 <clarkb> frickler: this was apparently discussed with them months ago. They just neglected to tell us :/
19:21:06 <frickler> ah, o.k.
19:21:39 <clarkb> I'll keep driving this along today to help ensure it doesn't cause problems for us.
19:21:47 <clarkb> Thank you for the help with reviews.
19:22:09 <clarkb> Oh the mirror node is already in the emergency file to avoid ansible errors if they remove it quicker than we can take it out of our inventory
19:23:51 <clarkb> #topic zuul-registry bugs
19:24:15 <clarkb> We've discovered some zuul-registry bugs. Specifically the way it handles concurrent uploads of the same image layer blobs is broken
19:24:54 <clarkb> What happens is the second upload notices that the first one is running and it exits early. When it exits early that second uplodaing client HEADs the object to get its size back (even though it already knows the size) and we get a short read because the first upload isn't completed yet
19:25:15 <clarkb> then when the second client uses that second short size to push a manifest using that blob it errors because the server validates the input sizes
19:25:27 <clarkb> #link https://review.opendev.org/c/zuul/zuul-registry/+/831235 Should address these bugs. Please monitor container jobs after this lands.
19:25:50 <ianw> ++ will review.  i got a little sidetracked with podman issues that got uncovered too
19:25:55 <clarkb> This change addresses that by forcing all uploads to run to their own completion. Then we handle the data such that it should always be valid when read (using atomic moves on the filesystem and swift eventual consistency stuff)
19:26:17 <clarkb> There is also a child change that makes the testing of zuul-registry much more concurrent to try and catch these issues
19:26:29 <clarkb> it caught problems with one of my first attempts at fixing this so seems to be helpful already
19:26:53 <clarkb> In general this entire problem is avoided if yo udon't push the same image blobs at the same time which is why we likely haven't noticed until recently
19:27:52 <clarkb> #topic PTG Planning
19:28:10 <clarkb> I was reminded that the deadline for teams signing up to the PTG is approaching in about 10 days
19:28:20 <clarkb> My initial thought was that we could skip this one since last one was very quiet for us
19:28:34 <clarkb> But if there is interest in having a block of time or two let me know and I'm happy to sign up and manage that for us
19:28:59 <fungi> we've not really had any takers on our office hours sessions in the past
19:29:31 <clarkb> yup. I think if we did time this time around we should dedicate it to things we want to cover and not do office hours
19:31:09 <clarkb> Anyway no rush on that decision. Let me know in the next week or so and I can get us signed up if we like. But happy to avoid it. I know some of us end up in a lot of other sessions so having less opendev stuff might be helpful there too
19:31:57 <clarkb> #topic Open Discussion
19:32:02 <clarkb> That was what I had on the agenda. Anything else?
19:32:41 <frickler> any idea regarding the buster ensure-pip issue?
19:33:05 <clarkb> I've managed to miss this issue.
19:33:15 <frickler> if you missed it earlier, we break installing tox on py2 with our self-built wheels
19:33:21 <ianw> frickler: sorry i only briefly saw your notes, but haven't dug into it yet
19:34:03 <frickler> because the wheel gets build as *py2.py3* wheel and we don't serve the metadata to let pip know that this really only is for >=py3.6
19:34:49 <frickler> we've seen similar issues multiple times and usually worked around by capping the associated pkgs
19:35:02 <clarkb> is that because tox doesn't set those flags upstream properly?
19:35:12 <clarkb> which results in downstream artifacts being wrong?
19:35:41 <frickler> no, we only serve the file, so pip chooses the package only based on the filename
19:36:23 <frickler> pluggy-1.0.0-py2.py3-none-any.whl
19:37:01 <clarkb> right but when we build it it shouldn't build a py2 wheel if upstream sets their metadata properly iirc
19:37:11 <ianw> (i've had a spec out to add metadata but never got around to it ... but it has details https://review.opendev.org/c/opendev/infra-specs/+/703916/1/specs/wheel-modernisation.rst)
19:37:17 <frickler> of course it is also a bug in that package to not specify the proper settings
19:37:20 <fungi> yeah, a big part of the problem is that our wheel builder solution assumes that since a wheel can be built on a platform it can be used on it, so we build wheels with python3 and then python picks them up
19:37:28 <frickler> but we can't easily fix that
19:37:35 <ianw> "Making PEP503 compliant indexes"
19:37:42 <fungi> and unlike the simple api, we have no way to tell pip those wheels aren't suitable for python
19:38:36 <clarkb> ya so my concern is that if upstream is broken its hard for us to prevent this. We could address it after the fact one way or another, but if we rely on upstream to specify versions and they get them wrong we'll build bad wheels
19:39:02 <fungi> yeah, we'd basically need to emulate the pypi simple api, and mark the entries as appropriate only for the python versions they built with, if we're supporting multiple interpreter versions on a single platform
19:39:50 <fungi> note this could also happen between different python3 versions if we made multiple versions of python3 available on the same platform, some of which weren't sufficient for wheels we cache for those platforms
19:40:38 <fungi> i suppose an alternative would be to somehow override universalness, and force them to build specific cp27 or cp36 abis
19:41:00 <frickler> apart from fixing the general issue, we should also discuss how to shortterm fix zuul-jobs builds
19:41:04 <clarkb> not all wheels support that iirc as your C linking is more limited
19:41:06 <fungi> could be as simple as renaming the files
19:41:25 <clarkb> frickler: I think my preference for the short term would be to pin the dependency if we know it cannot work
19:41:30 <clarkb> s/pin/cap/
19:41:37 <frickler> https://review.opendev.org/c/zuul/zuul-jobs/+/831136 is what is currently blocked
19:42:12 <frickler> well the failure happens in our test which does a simple "pip install tox"
19:42:44 <frickler> we could add "pluggy\<1" to it globally or add a constraints file that does this only for py2.7
19:42:59 <frickler> I haven't been able to get the latter to work directly without a file
19:43:58 <frickler> the other option that I've proposed would be to not use the whell mirror on buster
19:44:03 <clarkb> ya I think pip install tox pluggy\<1 is a good way to address it
19:44:12 <clarkb> then we can address mirroring without it being an emergency
19:44:58 <fungi> ahh, so pluggy is a dependency of tox?
19:45:04 <frickler> at least until latest tox requires pluggy>=1
19:45:09 <frickler> fungi: yes
19:47:13 <frickler> o.k., I'll propose a patch for that
19:48:06 <frickler> another thing, I've been considering whether some additions to nodepool might be useful
19:48:32 <clarkb> like preinstalling tox?
19:48:46 <frickler> not sure whether that would better be discussed in #zuul but I wanted to mention them here
19:48:48 <clarkb> I think we said we'd be ok with that sort of thing as long as we stashed things in virtualenvs to avoid conflicts with the system
19:49:06 <clarkb> ya feel free to bring it up here. We can always take it to #zuul later
19:49:21 <frickler> no, unrelated to that issue. although pre-provisioning nodes might also be interesting
19:49:53 <frickler> these came up while I'm setting up a local zuul environment to test our (osism) kolla deployments
19:50:19 <frickler> currently we use terraform and deploy nodes with fixed IPs and with additional volumes
19:50:53 <ianw> (note we do install a tox venv in nodepool -> https://opendev.org/openstack/project-config/src/branch/master/nodepool/elements/infra-package-needs/install.d/40-install-tox)
19:51:01 <clarkb> ianw: thanks
19:51:10 <frickler> so in particular the additional volumes should be easily done in nodepool, just mark them to be deleted together with the server
19:51:27 <frickler> so then after creation and attaching, nodepool can forget about them
19:51:58 <clarkb> frickler: ya there has been talk about managing additional resources in nodepool. The problem is you have to track those resources just like you track instances because they can be leaked or have "alien" counterparts too
19:52:14 <clarkb> Unfortuantely nodepool can't completely forget due to the leaking
19:52:20 <clarkb> (cinder is particularly bad about leaking resources0
19:52:32 <clarkb> But if nodepool were modified to track extra resources that would be doable.
19:53:11 <clarkb> And I agree this is probably a better discussion for #zuul as it may require some design
19:53:26 <clarkb> in particular we'll want to think about what this looks like for EBS volumes and other clouds too
19:54:27 <frickler> o.k., so I need to find some stable homeserver and then move over to #zuul, fine
19:54:31 <fungi> i don't think cinder was particularly bad about leaking resources, just that the nova api was particularly indifferent to whether cinder successfully deleted things
19:55:34 <frickler> do you know if someone already tried this or what it just some idea?
19:55:52 <clarkb> frickler: spamaps had talked about it at one time but I think it was mostly in the idea stage
19:58:04 <frickler> o.k., great to hear that I'm not the only one with weird ideas ;)
19:58:21 <frickler> that would be it from me for now
19:58:26 <clarkb> We are just about at time. Thank you everyone for joining. Remember no meeting next week. We'll see you in two weeks.
20:00:33 <clarkb> #endmeeting