19:01:06 <clarkb> #startmeeting infra 19:01:06 <opendevmeet> Meeting started Tue Mar 1 19:01:06 2022 UTC and is due to finish in 60 minutes. The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot. 19:01:06 <opendevmeet> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 19:01:06 <opendevmeet> The meeting name has been set to 'infra' 19:01:08 <clarkb> #link https://lists.opendev.org/pipermail/service-discuss/2022-March/000323.html Our Agenda 19:01:18 <clarkb> #topic Announcements 19:01:30 <clarkb> First up I won't be able to run next weeks meeting as I have other meetings 19:01:36 <ianw> o/ 19:02:02 <fungi> i expect to be in a similar situation 19:02:05 <clarkb> I've proposed we skip it in the agenda, but if others want a meeting feel free to update the agenda and send it out. I just won't be able to participate 19:02:28 <frickler> I won't travel but have a holiday, so fine with skipping 19:02:42 <ianw> i can host it, but if fungi is out too probably ok to skip 19:03:08 <clarkb> cool consider it skipped then 19:03:14 <clarkb> #topic Actions from last meeting 19:03:19 <clarkb> #link http://eavesdrop.openstack.org/meetings/infra/2022/infra.2022-02-22-19.01.txt minutes from last meeting 19:03:23 <clarkb> There were no actions recorded 19:03:33 <clarkb> #topic Topics 19:03:36 <clarkb> Time to dive in 19:03:45 <clarkb> #topic Improving CD throughput 19:03:55 <clarkb> ianw: Did all the logs changes end up landing? 19:04:09 <clarkb> that info is now available to us via zuul if we add our gpg keys ya? 19:04:34 <ianw> not yet 19:04:37 <ianw> #link https://review.opendev.org/c/opendev/system-config/+/830784 19:04:56 <ianw> is the one that turns it on globally -- i was hoping for some extra reviews on that but i'm happy to babysit it 19:05:11 <ianw> #link https://review.opendev.org/c/opendev/system-config/+/830785 19:05:18 <clarkb> ah 19:05:26 <clarkb> I guess I saw it only on the codesearch jobs 19:05:33 <ianw> is a doc update, that turned into quite a bit more than just how to add your gpg keys but it's in there 19:05:45 <clarkb> ya seemed like good improvements to the docs 19:06:18 <ianw> i felt like we were all ok with it, but didn't want to single approve 830784 just in case 19:06:35 <clarkb> ++ 19:06:48 <clarkb> Good to get lots of eyeballs on changes like this 19:07:10 <clarkb> Anything else on this topic? 19:07:42 <fungi> oh, i totally missed 830785, thanks 19:08:33 <fungi> i approved 830784 now 19:08:48 <clarkb> #topic Container Maintenance 19:08:51 <ianw> nope, that's it for now, thanks 19:09:25 <clarkb> jentoio: and I met up last week and discussed what needed to be done for giving containers dedicated users. We decided to look at updating insecure ci registry to start since it doesn't write to the fs 19:09:41 <clarkb> That will help get the shape of things in place before we tackle some of the more complicated containers 19:09:51 <jentoio> I'll be working on zuul-registry this afternoon 19:09:58 <clarkb> jentoio: thanks again! 19:10:08 <jentoio> finaly allocated some time to focus on it 19:10:29 <clarkb> so ya some progress here. Reviews appreciated once we have changes 19:11:09 <clarkb> #topic Cleaning Up Old Reviews 19:11:12 <clarkb> #link https://review.opendev.org/q/topic:retirement+status:open Changes to retire all unused the repos. 19:11:50 <clarkb> We're basically at actually retiring content from the repos. I would appreciate if we could start picking some of those off. I don't think they are difficult reviews, you should be able to fetch them and ls the repo contents to make sure nothing got left behind andcheck the README content but there are a lot of them 19:12:43 <clarkb> Once those changes land I'll do bulk abandons for open changes on those repos and then push up the zuul removal cleanups 19:12:49 <clarkb> as well as gerrit acl updates 19:12:53 <ianw> ++ sorry i meant to look at them, will do 19:13:07 <clarkb> thanks! 19:13:46 <fungi> yeah, same. i'm going to blame the fact that i had unsubbed from those repos in gertty a while back 19:14:28 <clarkb> #topic Gitea 1.16 19:15:14 <clarkb> I think it is worth waiting until 1.16.3 releases and then make a push to upgrade 19:15:32 <clarkb> https://github.com/go-gitea/gitea/milestone/113 indicates the 1.16.3 release should happen soon. Only one remaining issue left to address 19:15:56 <clarkb> The reason for this is that of the issues we've discovered (diff rendering of images and build inconsistencies with a dep) they should both be fixed by 1.16.3 19:16:23 <clarkb> I think we can hold off on reviewing things for now untikl 1.16.3 happens. I'll get a change pushed for that and update our hold to inspect it 19:16:42 <clarkb> Mostly a heads up that I haven't forgotten about this, but waiting until it stablizes a bit more 19:17:05 <clarkb> #topic Rocky Linux 19:17:10 <clarkb> #link https://review.opendev.org/c/zuul/nodepool/+/831108 Need new dib for next steps 19:17:20 <clarkb> This is a nodepool change that will upgrade dib in our builder images 19:17:35 <clarkb> This new dib should address new issues found with Rocky Linux builds 19:18:11 <clarkb> Please review that if you get a chance 19:18:26 <clarkb> also keep an eye out for any unexpected new glean behavior since it updated a couple of days ago 19:18:56 <clarkb> #topic Removing airship-citycloud nodepool provider 19:19:09 <clarkb> This morning I woke up to an email that this provider is going away for us today 19:19:47 <clarkb> The first two changes to remove it have landed. After this meeting I need to check if nodepool is complaining about a node in that provider that wouldn't delete. I'll run nodepool erase airship-citycloud if so 19:19:48 <fungi> ericsson had been paying for that citycloud account in support of airship testing, but has decided to discontinue it 19:20:01 <clarkb> right 19:20:32 <clarkb> Once nodepool is done we can land the system-config and dns update changes. Then we can delete the mirror node if it hasn't gone away already 19:20:33 <frickler> do we need to inform the airship team or do they know already? 19:20:47 <clarkb> frickler: this was apparently discussed with them months ago. They just neglected to tell us :/ 19:21:06 <frickler> ah, o.k. 19:21:39 <clarkb> I'll keep driving this along today to help ensure it doesn't cause problems for us. 19:21:47 <clarkb> Thank you for the help with reviews. 19:22:09 <clarkb> Oh the mirror node is already in the emergency file to avoid ansible errors if they remove it quicker than we can take it out of our inventory 19:23:51 <clarkb> #topic zuul-registry bugs 19:24:15 <clarkb> We've discovered some zuul-registry bugs. Specifically the way it handles concurrent uploads of the same image layer blobs is broken 19:24:54 <clarkb> What happens is the second upload notices that the first one is running and it exits early. When it exits early that second uplodaing client HEADs the object to get its size back (even though it already knows the size) and we get a short read because the first upload isn't completed yet 19:25:15 <clarkb> then when the second client uses that second short size to push a manifest using that blob it errors because the server validates the input sizes 19:25:27 <clarkb> #link https://review.opendev.org/c/zuul/zuul-registry/+/831235 Should address these bugs. Please monitor container jobs after this lands. 19:25:50 <ianw> ++ will review. i got a little sidetracked with podman issues that got uncovered too 19:25:55 <clarkb> This change addresses that by forcing all uploads to run to their own completion. Then we handle the data such that it should always be valid when read (using atomic moves on the filesystem and swift eventual consistency stuff) 19:26:17 <clarkb> There is also a child change that makes the testing of zuul-registry much more concurrent to try and catch these issues 19:26:29 <clarkb> it caught problems with one of my first attempts at fixing this so seems to be helpful already 19:26:53 <clarkb> In general this entire problem is avoided if yo udon't push the same image blobs at the same time which is why we likely haven't noticed until recently 19:27:52 <clarkb> #topic PTG Planning 19:28:10 <clarkb> I was reminded that the deadline for teams signing up to the PTG is approaching in about 10 days 19:28:20 <clarkb> My initial thought was that we could skip this one since last one was very quiet for us 19:28:34 <clarkb> But if there is interest in having a block of time or two let me know and I'm happy to sign up and manage that for us 19:28:59 <fungi> we've not really had any takers on our office hours sessions in the past 19:29:31 <clarkb> yup. I think if we did time this time around we should dedicate it to things we want to cover and not do office hours 19:31:09 <clarkb> Anyway no rush on that decision. Let me know in the next week or so and I can get us signed up if we like. But happy to avoid it. I know some of us end up in a lot of other sessions so having less opendev stuff might be helpful there too 19:31:57 <clarkb> #topic Open Discussion 19:32:02 <clarkb> That was what I had on the agenda. Anything else? 19:32:41 <frickler> any idea regarding the buster ensure-pip issue? 19:33:05 <clarkb> I've managed to miss this issue. 19:33:15 <frickler> if you missed it earlier, we break installing tox on py2 with our self-built wheels 19:33:21 <ianw> frickler: sorry i only briefly saw your notes, but haven't dug into it yet 19:34:03 <frickler> because the wheel gets build as *py2.py3* wheel and we don't serve the metadata to let pip know that this really only is for >=py3.6 19:34:49 <frickler> we've seen similar issues multiple times and usually worked around by capping the associated pkgs 19:35:02 <clarkb> is that because tox doesn't set those flags upstream properly? 19:35:12 <clarkb> which results in downstream artifacts being wrong? 19:35:41 <frickler> no, we only serve the file, so pip chooses the package only based on the filename 19:36:23 <frickler> pluggy-1.0.0-py2.py3-none-any.whl 19:37:01 <clarkb> right but when we build it it shouldn't build a py2 wheel if upstream sets their metadata properly iirc 19:37:11 <ianw> (i've had a spec out to add metadata but never got around to it ... but it has details https://review.opendev.org/c/opendev/infra-specs/+/703916/1/specs/wheel-modernisation.rst) 19:37:17 <frickler> of course it is also a bug in that package to not specify the proper settings 19:37:20 <fungi> yeah, a big part of the problem is that our wheel builder solution assumes that since a wheel can be built on a platform it can be used on it, so we build wheels with python3 and then python picks them up 19:37:28 <frickler> but we can't easily fix that 19:37:35 <ianw> "Making PEP503 compliant indexes" 19:37:42 <fungi> and unlike the simple api, we have no way to tell pip those wheels aren't suitable for python 19:38:36 <clarkb> ya so my concern is that if upstream is broken its hard for us to prevent this. We could address it after the fact one way or another, but if we rely on upstream to specify versions and they get them wrong we'll build bad wheels 19:39:02 <fungi> yeah, we'd basically need to emulate the pypi simple api, and mark the entries as appropriate only for the python versions they built with, if we're supporting multiple interpreter versions on a single platform 19:39:50 <fungi> note this could also happen between different python3 versions if we made multiple versions of python3 available on the same platform, some of which weren't sufficient for wheels we cache for those platforms 19:40:38 <fungi> i suppose an alternative would be to somehow override universalness, and force them to build specific cp27 or cp36 abis 19:41:00 <frickler> apart from fixing the general issue, we should also discuss how to shortterm fix zuul-jobs builds 19:41:04 <clarkb> not all wheels support that iirc as your C linking is more limited 19:41:06 <fungi> could be as simple as renaming the files 19:41:25 <clarkb> frickler: I think my preference for the short term would be to pin the dependency if we know it cannot work 19:41:30 <clarkb> s/pin/cap/ 19:41:37 <frickler> https://review.opendev.org/c/zuul/zuul-jobs/+/831136 is what is currently blocked 19:42:12 <frickler> well the failure happens in our test which does a simple "pip install tox" 19:42:44 <frickler> we could add "pluggy\<1" to it globally or add a constraints file that does this only for py2.7 19:42:59 <frickler> I haven't been able to get the latter to work directly without a file 19:43:58 <frickler> the other option that I've proposed would be to not use the whell mirror on buster 19:44:03 <clarkb> ya I think pip install tox pluggy\<1 is a good way to address it 19:44:12 <clarkb> then we can address mirroring without it being an emergency 19:44:58 <fungi> ahh, so pluggy is a dependency of tox? 19:45:04 <frickler> at least until latest tox requires pluggy>=1 19:45:09 <frickler> fungi: yes 19:47:13 <frickler> o.k., I'll propose a patch for that 19:48:06 <frickler> another thing, I've been considering whether some additions to nodepool might be useful 19:48:32 <clarkb> like preinstalling tox? 19:48:46 <frickler> not sure whether that would better be discussed in #zuul but I wanted to mention them here 19:48:48 <clarkb> I think we said we'd be ok with that sort of thing as long as we stashed things in virtualenvs to avoid conflicts with the system 19:49:06 <clarkb> ya feel free to bring it up here. We can always take it to #zuul later 19:49:21 <frickler> no, unrelated to that issue. although pre-provisioning nodes might also be interesting 19:49:53 <frickler> these came up while I'm setting up a local zuul environment to test our (osism) kolla deployments 19:50:19 <frickler> currently we use terraform and deploy nodes with fixed IPs and with additional volumes 19:50:53 <ianw> (note we do install a tox venv in nodepool -> https://opendev.org/openstack/project-config/src/branch/master/nodepool/elements/infra-package-needs/install.d/40-install-tox) 19:51:01 <clarkb> ianw: thanks 19:51:10 <frickler> so in particular the additional volumes should be easily done in nodepool, just mark them to be deleted together with the server 19:51:27 <frickler> so then after creation and attaching, nodepool can forget about them 19:51:58 <clarkb> frickler: ya there has been talk about managing additional resources in nodepool. The problem is you have to track those resources just like you track instances because they can be leaked or have "alien" counterparts too 19:52:14 <clarkb> Unfortuantely nodepool can't completely forget due to the leaking 19:52:20 <clarkb> (cinder is particularly bad about leaking resources0 19:52:32 <clarkb> But if nodepool were modified to track extra resources that would be doable. 19:53:11 <clarkb> And I agree this is probably a better discussion for #zuul as it may require some design 19:53:26 <clarkb> in particular we'll want to think about what this looks like for EBS volumes and other clouds too 19:54:27 <frickler> o.k., so I need to find some stable homeserver and then move over to #zuul, fine 19:54:31 <fungi> i don't think cinder was particularly bad about leaking resources, just that the nova api was particularly indifferent to whether cinder successfully deleted things 19:55:34 <frickler> do you know if someone already tried this or what it just some idea? 19:55:52 <clarkb> frickler: spamaps had talked about it at one time but I think it was mostly in the idea stage 19:58:04 <frickler> o.k., great to hear that I'm not the only one with weird ideas ;) 19:58:21 <frickler> that would be it from me for now 19:58:26 <clarkb> We are just about at time. Thank you everyone for joining. Remember no meeting next week. We'll see you in two weeks. 20:00:33 <clarkb> #endmeeting