19:00:36 <clarkb> #startmeeting infra 19:00:36 <opendevmeet> Meeting started Tue Oct 24 19:00:36 2023 UTC and is due to finish in 60 minutes. The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot. 19:00:36 <opendevmeet> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 19:00:36 <opendevmeet> The meeting name has been set to 'infra' 19:00:39 <clarkb> #link https://lists.opendev.org/archives/list/service-discuss@lists.opendev.org/thread/5OF3FQDLDDDDH7XQAJXMRBA6EEAAUBQY/ Our Agenda 19:00:47 <frickler> I'm around-ish 19:01:06 <clarkb> #topic Announcements 19:01:13 <clarkb> As just mentioned the PTG is currently happenign this week 19:01:33 <clarkb> I think many of us have been participating so no surprises, but please do keep that in mind if you are making changes to etherpad/meetpad 19:01:38 <clarkb> #topic Mailman 3 19:01:56 <clarkb> I don't think fungi is here to day so I'll try to recap best I can 19:02:13 <clarkb> Since the last meeting the RH email problems have been resolved. We didn't make any changes on our end 19:02:26 <clarkb> The change to remove lists.openstack.org from our inventory has merged 19:03:10 <clarkb> The next step will be to snapshot the old server and shut it down. Before doing that I think we should extract the latest kernel and put it in place to ensure that snapshot is bootable 19:03:53 <clarkb> There is still the question of adding MX records. I think we can proceed with that if we choose to but doesn't seem directly related to the mm3 migration (we didn't have them before, don't have them now, and stuff seems to generally work) 19:04:55 <clarkb> On the usage side mm3 seems to be working. I haven't seen any major complaints 19:05:09 <clarkb> One thing I noticed today is that mm3 doesn't indicate who the various list admins are for lists like mm2 did 19:05:13 <clarkb> But that is a minor thing 19:05:30 <clarkb> #topic LE Certcheck List Building Failures 19:05:38 <clarkb> Getting `'ansible.vars.hostvars.HostVarsVars object' has no attribute 'letsencrypt_certcheck_domains'` 19:05:51 <clarkb> THis happens less than 100% of the time. I suspect this is an ansible problem 19:05:56 <clarkb> #link https://review.opendev.org/c/opendev/system-config/+/898475 Changes to LE roles to improve debugging 19:06:05 <clarkb> this change is intended to improve our logging of the situation so that we can debug it better 19:06:20 <clarkb> reviews welcome. Also Ansible 8 may be the solution 19:06:28 <clarkb> #topic Ansible 8 Upgrade on Bridge 19:06:36 <clarkb> #link https://review.opendev.org/c/opendev/system-config/+/898505 will update us to Ansible 8 on Bridge. Should be merged when we can monitor 19:06:43 <clarkb> I'd like to merge this when fungi is back from PTO 19:06:54 <clarkb> mostly so that we can ensure the most set of eyeballs are present when that happens 19:07:19 <frickler> +1 19:07:38 <clarkb> consider this a heads up that we'll be doing that upgrade soon. The upgrade on the zuul side went well as did system-config-run-* testing with ansible 8 so its probably fine but good to be careful with global changes like this 19:08:03 <clarkb> #topic Server Upgrades 19:08:08 <clarkb> No movement on this recently 19:08:15 <clarkb> #topic Python container updates 19:08:21 <clarkb> #link https://review.opendev.org/q/(+topic:bookworm-python3.11+OR+hashtag:bookworm+)status:open 19:08:29 <clarkb> Next up here is removing builds of our python3.9 iamges 19:08:48 <clarkb> I think https://review.opendev.org/c/opendev/system-config/+/898480 should be an easy review to do that removal if anyone has time 19:09:26 <clarkb> once that is done I'll see about syncing with osc and zuul-operator folks to figure out migrations of those images 19:10:22 <clarkb> #topic Gitea 1.21 Upgrade 19:10:39 <clarkb> There is an rc2 for this release now, but still not seeing a changelog 19:11:01 <clarkb> I guess I should update https://review.opendev.org/c/opendev/system-config/+/897679/3/docker/gitea/Dockerfile to rc2 and see if our testing complains 19:11:45 <clarkb> #topic Linaro Cloud SSL Cert Updates 19:12:29 <clarkb> As noted recently our zuul service was not marking jobs NODE_FAILURE for invalid labels because the linaro cloud had an invalid cert and processing for that cloud couldn't progress far enough to register node denials 19:13:04 <frickler> is that something that should be improved in nodepool? 19:13:06 <clarkb> I sent email to kevinz and his response was to quote an email ianw had sent about this in the past. Basically we haev access to the server and can run acme.sh to reprovision LE certs and thenrun kolla-ansible to apply the cert to the system 19:13:22 <clarkb> frickler: possibly. If the cloud is broken due to certs then I could see that being an implied denial 19:13:41 <clarkb> others may argue that issues like that may be temporary and it is better to have the cloud continue to retry? 19:14:01 <clarkb> I think my preference would be to go ahead and fail faster, but we should see if other nodepool users agree 19:14:24 <frickler> IMO at least after some timeout (couple of hours?) the request should fail 19:14:33 <clarkb> I went through the process kevinz shared and reprovisioned things and now all is well. 19:14:42 <frickler> getting jobs stuck for days isn't nice 19:15:02 <clarkb> I want ot note that our access is via bridge's root key as we don't have our users created on that server 19:15:21 <frickler> ah, I was just about to ask how to access it 19:15:30 <clarkb> and I stuck a text document with the process on the server itself. Easier to find there than in email I think. But maybe that should go into our actual docs somewhere? 19:16:11 <clarkb> oh and our monitoring of the name for ssl cert expiry used the wrong name. That has been fixed so we'll get remindersin two months to redo this 19:16:40 <frickler> depends on whether we can make progress with automating it. if we need to do this ourselves now every 2-3 months, that may increase the motivation to do so 19:16:56 <clarkb> ya ianw was working on that before other tasks took over 19:16:58 <clarkb> I think it is doable 19:17:02 <clarkb> but needs some work 19:17:38 <clarkb> I think that might look like a daily job that runs acme.sh, copies files if they change, then runs kolla-ansible if there are updates 19:17:53 <clarkb> I don't think we want to integrate it into our existing acme.sh runs because the infrastructure for the domain is all different 19:18:03 <clarkb> woudl just get ugly douing that and intead we could do the naive simple thing 19:19:27 <clarkb> #topic Gerrit 3.8 Upgrade Planning 19:19:35 <clarkb> This is the major task I'm trying to drive to completion right now 19:19:40 <clarkb> #link https://etherpad.opendev.org/p/gerrit-upgrade-3.8 19:20:04 <clarkb> Good news is this upgrade looks straightforward. There are no schema chagnes and we can trivially downgrade with an offline reindex if necessary 19:20:23 <clarkb> That said there are a number of breaking changes in the release notes that I'm slowly working through in that etherpad to sort out if they affect us before we take the plunge 19:20:42 <clarkb> The current one is https://review.opendev.org/c/opendev/system-config/+/898989 19:21:04 <clarkb> I had set up autoholds for that yseterday before calling it a day but they all failed on the ruamel thing so I had to recycle those and should have new holds soon 19:21:18 <clarkb> I want ot make sure that commentlink changes don't come iwth any behavior differences then we can udpate our config on 3.7 first 19:21:26 <clarkb> and then it will be on to the next thing in the list 19:22:10 <clarkb> As for the upgrade itself I'm thinking November 17 or December 1. THey are both Fridays. November 10 doesn't work as I'm out that day. November 24 is part of the US Thanksgiving holiday which is a conflict too 19:22:44 <clarkb> by next week I should have an idea if November 17 is doable and we can announce it at that point 19:23:24 <clarkb> #topic Open Discussion 19:23:30 <clarkb> That was all I had on the proper agenda. 19:23:44 <clarkb> Worth noting we capped ruamel.yaml in our bridge ansible installs because ARA doesn't work with the latest versions 19:23:59 <clarkb> this broke system-config-run-* jobs and led to me failed node holds for gerrit the first time around 19:24:39 <clarkb> Anything else? 19:24:58 <frickler> not from me 19:25:53 <clarkb> thank you for your time today and for all the help keeping things running 19:26:07 <clarkb> As usual feel free to bring things up outside the meeting in #opendev or on the mailing list 19:26:25 <clarkb> enjoy the PTG! hope it is a productive week for everyone 19:26:30 <clarkb> #endmeeting