#opendev-meeting log

19:00:36 <clarkb> #startmeeting infra
19:00:36 <opendevmeet> Meeting started Tue Oct 24 19:00:36 2023 UTC and is due to finish in 60 minutes.  The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot.
19:00:36 <opendevmeet> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
19:00:36 <opendevmeet> The meeting name has been set to 'infra'
19:00:39 <clarkb> #link https://lists.opendev.org/archives/list/service-discuss@lists.opendev.org/thread/5OF3FQDLDDDDH7XQAJXMRBA6EEAAUBQY/ Our Agenda
19:00:47 <frickler> I'm around-ish
19:01:06 <clarkb> #topic Announcements
19:01:13 <clarkb> As just mentioned the PTG is currently happenign this week
19:01:33 <clarkb> I think many of us have been participating so no surprises, but please do keep that in mind if you are making changes to etherpad/meetpad
19:01:38 <clarkb> #topic Mailman 3
19:01:56 <clarkb> I don't think fungi is here to day so I'll try to recap best I can
19:02:13 <clarkb> Since the last meeting the RH email problems have been resolved. We didn't make any changes on our end
19:02:26 <clarkb> The change to remove lists.openstack.org from our inventory has merged
19:03:10 <clarkb> The next step will be to snapshot the old server and shut it down. Before doing that I think we should extract the latest kernel and put it in place to ensure that snapshot is bootable
19:03:53 <clarkb> There is still the question of adding MX records. I think we can proceed with that if we choose to but doesn't seem directly related to the mm3 migration (we didn't have them before, don't have them now, and stuff seems to generally work)
19:04:55 <clarkb> On the usage side mm3 seems to be working. I haven't seen any major complaints
19:05:09 <clarkb> One thing I noticed today is that mm3 doesn't indicate who the various list admins are for lists like mm2 did
19:05:13 <clarkb> But that is a minor thing
19:05:30 <clarkb> #topic LE Certcheck List Building Failures
19:05:38 <clarkb> Getting `'ansible.vars.hostvars.HostVarsVars object' has no attribute 'letsencrypt_certcheck_domains'`
19:05:51 <clarkb> THis happens less than 100% of the time. I suspect this is an ansible problem
19:05:56 <clarkb> #link https://review.opendev.org/c/opendev/system-config/+/898475 Changes to LE roles to improve debugging
19:06:05 <clarkb> this change is intended to improve our logging of the situation so that we can debug it better
19:06:20 <clarkb> reviews welcome. Also Ansible 8 may be the solution
19:06:28 <clarkb> #topic Ansible 8 Upgrade on Bridge
19:06:36 <clarkb> #link https://review.opendev.org/c/opendev/system-config/+/898505 will update us to Ansible 8 on Bridge. Should be merged when we can monitor
19:06:43 <clarkb> I'd like to merge this when fungi is back from PTO
19:06:54 <clarkb> mostly so that we can ensure the most set of eyeballs are present when that happens
19:07:19 <frickler> +1
19:07:38 <clarkb> consider this a heads up that we'll be doing that upgrade soon. The upgrade on the zuul side went well as did system-config-run-* testing with ansible 8 so its probably fine but good to be careful with global changes like this
19:08:03 <clarkb> #topic Server Upgrades
19:08:08 <clarkb> No movement on this recently
19:08:15 <clarkb> #topic Python container updates
19:08:21 <clarkb> #link https://review.opendev.org/q/(+topic:bookworm-python3.11+OR+hashtag:bookworm+)status:open
19:08:29 <clarkb> Next up here is removing builds of our python3.9 iamges
19:08:48 <clarkb> I think https://review.opendev.org/c/opendev/system-config/+/898480 should be an easy review to do that removal if anyone has time
19:09:26 <clarkb> once that is done I'll see about syncing with osc and zuul-operator folks to figure out migrations of those images
19:10:22 <clarkb> #topic Gitea 1.21 Upgrade
19:10:39 <clarkb> There is an rc2 for this release now, but still not seeing a changelog
19:11:01 <clarkb> I guess I should update https://review.opendev.org/c/opendev/system-config/+/897679/3/docker/gitea/Dockerfile to rc2 and see if our testing complains
19:11:45 <clarkb> #topic Linaro Cloud SSL Cert Updates
19:12:29 <clarkb> As noted recently our zuul service was not marking jobs NODE_FAILURE for invalid labels because the linaro cloud had an invalid cert and processing for that cloud couldn't progress far enough to register node denials
19:13:04 <frickler> is that something that should be improved in nodepool?
19:13:06 <clarkb> I sent email to kevinz and his response was to quote an email ianw had sent about this in the past. Basically we haev access to the server and can run acme.sh to reprovision LE certs and thenrun kolla-ansible to apply the cert to the system
19:13:22 <clarkb> frickler: possibly. If the cloud is broken due to certs then I could see that being an implied denial
19:13:41 <clarkb> others may argue that issues like that may be temporary and it is better to have the cloud continue to retry?
19:14:01 <clarkb> I think my preference would be to go ahead and fail faster, but we should see if other nodepool users agree
19:14:24 <frickler> IMO at least after some timeout (couple of hours?) the request should fail
19:14:33 <clarkb> I went through the process kevinz shared and reprovisioned things and now all is well.
19:14:42 <frickler> getting jobs stuck for days isn't nice
19:15:02 <clarkb> I want ot note that our access is via bridge's root key as we don't have our users created on that server
19:15:21 <frickler> ah, I was just about to ask how to access it
19:15:30 <clarkb> and I stuck a text document with the process on the server itself. Easier to find there than in email I think. But maybe that should go into our actual docs somewhere?
19:16:11 <clarkb> oh and our monitoring of the name for ssl cert expiry used the wrong name. That has been fixed so we'll get remindersin two months to redo this
19:16:40 <frickler> depends on whether we can make progress with automating it. if we need to do this ourselves now every 2-3 months, that may increase the motivation to do so
19:16:56 <clarkb> ya ianw was working on that before other tasks took over
19:16:58 <clarkb> I think it is doable
19:17:02 <clarkb> but needs some work
19:17:38 <clarkb> I think that might look like a daily job that runs acme.sh, copies files if they change, then runs kolla-ansible if there are updates
19:17:53 <clarkb> I don't think we want to integrate it into our existing acme.sh runs because the infrastructure for the domain is all different
19:18:03 <clarkb> woudl just get ugly douing that and intead we could do the naive simple thing
19:19:27 <clarkb> #topic Gerrit 3.8 Upgrade Planning
19:19:35 <clarkb> This is the major task I'm trying to drive to completion right now
19:19:40 <clarkb> #link https://etherpad.opendev.org/p/gerrit-upgrade-3.8
19:20:04 <clarkb> Good news is this upgrade looks straightforward. There are no schema chagnes and we can trivially downgrade with an offline reindex if necessary
19:20:23 <clarkb> That said there are a number of breaking changes in the release notes that I'm slowly working through in that etherpad to sort out if they affect us before we take the plunge
19:20:42 <clarkb> The current one is https://review.opendev.org/c/opendev/system-config/+/898989
19:21:04 <clarkb> I had set up autoholds for that yseterday before calling it a day but they all failed on the ruamel thing so I had to recycle those and should have new holds soon
19:21:18 <clarkb> I want ot make sure that commentlink changes don't come iwth any behavior differences then we can udpate our config on 3.7 first
19:21:26 <clarkb> and then it will be on to the next thing in the list
19:22:10 <clarkb> As for the upgrade itself I'm thinking November 17 or December 1. THey are both Fridays. November 10 doesn't work as I'm out that day. November 24 is part of the US Thanksgiving holiday which is a conflict too
19:22:44 <clarkb> by next week I should have an idea if November 17 is doable and we can announce it at that point
19:23:24 <clarkb> #topic Open Discussion
19:23:30 <clarkb> That was all I had on the proper agenda.
19:23:44 <clarkb> Worth noting we capped ruamel.yaml in our bridge ansible installs because ARA doesn't work with the latest versions
19:23:59 <clarkb> this broke system-config-run-* jobs and led to me failed node holds for gerrit the first time around
19:24:39 <clarkb> Anything else?
19:24:58 <frickler> not from me
19:25:53 <clarkb> thank you for your time today and for all the help keeping things running
19:26:07 <clarkb> As usual feel free to bring things up outside the meeting in #opendev or on the mailing list
19:26:25 <clarkb> enjoy the PTG! hope it is a productive week for everyone
19:26:30 <clarkb> #endmeeting