19:00:12 <clarkb> #startmeeting infra
19:00:12 <opendevmeet> Meeting started Tue Nov  5 19:00:12 2024 UTC and is due to finish in 60 minutes.  The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot.
19:00:12 <opendevmeet> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
19:00:12 <opendevmeet> The meeting name has been set to 'infra'
19:00:15 <clarkb> #link https://lists.opendev.org/archives/list/service-discuss@lists.opendev.org/thread/TW6HC5TVBZNEAUWX6HBSATZKK7USHXAB/ Our Agenda
19:00:19 <clarkb> #topic Announcements
19:00:30 <clarkb> No big announcements today
19:01:45 <clarkb> I will be out Thursday morning local time for an appointment then I'm also out Friday and Monday (long holiday weekend with family)
19:02:14 <clarkb> did anyone else have anything to announce?
19:03:08 <clarkb> sounds like no
19:03:10 <clarkb> #topic Zuul-launcher image builds
19:03:36 <clarkb> yesterday corvus reported that we ran a job on a nodepool in zuul built and uploaded image
19:03:40 <clarkb> trying to find that link now
19:04:10 <corvus> #link niz build https://zuul.opendev.org/t/opendev/build/944e17ac26f841b5a91f07e37ac79de6
19:04:11 <clarkb> #link https://zuul.opendev.org/t/opendev/build/944e17ac26f841b5a91f07e37ac79de6
19:04:15 <clarkb> #undo
19:04:15 <opendevmeet> Removing item from minutes: #link https://zuul.opendev.org/t/opendev/build/944e17ac26f841b5a91f07e37ac79de6
19:04:48 <corvus> still some work to do on the implementation (we're missing important functionality), but we've got the basics in place
19:05:14 <clarkb> I assume the bulk of that is on the nodepool in zuul side of things? On the opendev side of things we still need jobs to build the various images we have?
19:05:32 <corvus> yep, still a bit of niz implementation to do
19:05:36 <corvus> on the opendev side:
19:05:56 <corvus> * need to fix the image upload to use the correct expire-after headers
19:06:08 <corvus> (since all 3 ways we tried to do that failed; so something needs fixing)
19:06:29 <corvus> * should probably test with a raw image upload for benchmarking
19:06:34 <clarkb> do we know if the problem is the header value itself or something in the client? I'm guessing we don't actually know yet?
19:06:39 <corvus> (but that should wait until after some more niz performance improvements)
19:06:53 <corvus> * need to add more image build jobs (no rush, but not blocked by anything)
19:07:32 <corvus> clarkb: tim said that it needs to be set for both the individual parts and the manifest; i take that to mean that swift cli client is only setting it on one of those
19:07:37 <corvus> so one of the fixes is fix swiftclient to set both
19:07:44 <corvus> another fix could be to fix openstacksdk
19:07:48 <clarkb> got it
19:08:08 <corvus> a third fix could be to do something custom in an ansible module
19:08:19 <clarkb> side note: swift has been a thing for like 13 years? kind of amazing we're the first to hit this issue
19:08:21 <corvus> honestly don't know the relative merits of those, so step 1 is to evaluate those choices :)
19:08:51 <corvus> yeah... i guess people want to keep big files?  :)
19:09:18 <clarkb> anything else?
19:09:28 <corvus> anyway, that's about it i think
19:09:39 <clarkb> thank you for the update. Its cool to see it work end to end like that
19:09:51 <clarkb> #topic Backup Server Pruning
19:09:53 <corvus> yw; ++
19:10:12 <clarkb> As previously mentioned I did some manual cleanup of ethercalc02 backups
19:10:18 <clarkb> I then documented this process
19:10:20 <clarkb> #link https://review.opendev.org/c/opendev/system-config/+/933354 Documentation of cleanup process applied to ethercalc02 backups.
19:10:54 <clarkb> Couple of things to note. The first is that there is still some discussion over whether or not this is the best approach
19:11:08 <clarkb> in particular ianw has proposed an alternative which automates most of this process
19:11:21 <clarkb> #link https://review.opendev.org/c/opendev/system-config/+/933700 Automate backup deletions for old servers/services
19:11:41 <clarkb> Reviewing this change is high on my todo list (hping to look after lunch today) as I do think it would be an improvement on what I've done
19:12:24 <clarkb> The other thing to note is that the backup server checks noticed the backup dir for ethercalc02 was abnormal after the cleanup. This is true since the dir was removed but we should probably avoid having it emit the warning in the first place if we're intentionally cleaning things up
19:12:50 <clarkb> oh and fungi ran our normal pruning process on the vexxhost backup server as we were running low on space and we're still working through he process for clearing unneeded content
19:13:09 <clarkb> the good news is once we have a process for this and apply it I think we'll free up about 20% of the total disk space on the server which is a good chunk
19:13:11 <fungi> yep, i did that
19:13:33 <fungi> worth noting the ethercalc removal didn't really free up any observable space, but we didn't expect it to
19:13:38 <clarkb> so anyway please review the two chagnes above and weigh in on whether or not you think either approach is worth pursuing further
19:14:18 <clarkb> #topic Upgrading old servers
19:14:51 <clarkb> I don't see any updates on tonyb's mediawiki stack
19:15:13 <clarkb> tonyb: anything to note? I know there was a pile of comments on that stack so want to make sure that I was clear enough and also not off base on what I wrote
19:16:30 <clarkb> separately I decided with Gerrit that I would like to upgrade Gerrit to 3.10 first then figure out the server upgrade. The reason for that is the service update is mroe straightforward and I think a better candidate for sorting out over the next while of holidays and such
19:16:38 <clarkb> once that is done I'll have to look at server replacement
19:16:50 <clarkb> more on Gerrit 3.10 later in the meeting
19:17:53 <clarkb> any other server upgrade notes to make?
19:19:14 <clarkb> #topic Docker compose plugin with podman service for servers
19:19:29 <clarkb> lets move on. I didn't expect any updates on this topic since last week but wanted to give a chance for anyone to chime in if there was movement
19:20:29 <clarkb> ok sounds like there wasn't anything new. We can move on
19:20:32 <clarkb> #topic Enabling mailman3 bounce processing
19:20:36 <clarkb> and now for topics new to this week
19:21:11 <clarkb> frickler added this one and the question aiui is basically can we enable mailman3's automatic bounce processing which removes mailing list members after a certain number of bounces
19:21:36 <frickler> yes, so basically I just stumbled about this when looking at the mailman list admin UI
19:21:58 <clarkb> Last friday I logged into lists.opendev.org and looked at the options and tried to undersatnd how it works. Basically there are two values we can change the score threshold that when exceeded members are removed and the time period your score is valid for before being reset
19:22:06 <frickler> and since we do see a large number of bounces in our exim logs, I thought maybe give it a try
19:22:40 <clarkb> by default threshold is 5 (I think you get 1 score point for a hard bounce and half a point for a soft bounce not sure what the difference between hard and soft bounces are) and that value is reset weekly
19:23:09 <frickler> yes, the default looked pretty sane to me
19:23:21 <clarkb> one thing that occurred to me is we can enable this on service-discuss and since we only get about one email a week we should avoid removing anyone too quickly while we see if it works as expected
19:23:33 <clarkb> but otherwise it does seem like a good idea to remove all the old addresses that are no longer valid
19:23:54 <clarkb> fungi: corvus: ^ any thoughts or concerns on enabling this on some/all of our lists? I think dmarc/dkim validation was one concern?
19:24:13 <frickler> iirc one could also set it to report to the list admin instead of auto-remove addresses?
19:24:28 <clarkb> frickler: yes, you can also do both things.
19:24:39 <clarkb> I figured we'd have it alert list owners as well as disabling/removing people
19:24:41 <corvus> we definitely should if we can; i'm not familiar with the dkim issues that apparently caused us to turn it off
19:24:58 <fungi> as mentioned earlier when we discussed it, spurious dmarc-related rejections are less of a concern with mm3 because it doesn't just score on bounced posts, it follows up with a verp probe and uses the results of that to increase the bounce score
19:25:36 <clarkb> got it so previous concerns are theoretically a non issue in mm3. In that case should I go ahead and enable it on service-discuss and see what happens from there?
19:25:56 <fungi> basically mm3 tries to avoid disabling subscriptions in cases where the bounce could have been related to the message content rather than an actual delivery problem at the destination
19:25:57 <clarkb> oh also if you login to a list user member list page it shows the current score for all the members
19:26:12 <clarkb> which is another way to keep track of how it is processing people
19:26:32 <clarkb> then maybe enabling it on busier lists next week if nothing goes haywire?
19:26:41 <fungi> sounds good to me
19:27:04 <clarkb> cool I'll do that this afternoon for service-discuss too
19:27:31 <clarkb> anything else on this topic?
19:28:26 <clarkb> #topic Failures of insecure registry
19:28:57 <clarkb> Recently we had a bunch of issues with the insecure-ci-registry not communicating with container image push/pull clients and that resulted in job timeouts for jobs pushing images in particular
19:29:05 <clarkb> #link https://zuul.opendev.org/t/openstack/builds?job_name=osc-build-image&result=POST_FAILURE&skip=0 Example build failures some examples
19:29:52 <frickler> seems to have been resolved by the latest image update? or did I miss something?
19:29:53 <clarkb> After restarting the container a few times I noticed that the tracebacks recorded in the logs were happening in cheroot which had a more recent release than our current container image build. I pushed up a change to zuul-registry which rebuilt with latest cheroot as well as updating other system libs like openssl
19:30:21 <clarkb> and yes that image update seems to have made things happier.
19:30:59 <clarkb> Looking at logs it seems that some clients try to negotiate with invalid/unsupported tls versions and the registry is properly rejecting them now today. But my theory is this wasn't working previously and we'd eat up threads or some other limited resource on the server
19:31:17 <clarkb> one thing that seemed to back this up is if you listed connections on the server prior to the update there were many tcp connections hanging around
19:31:22 <clarkb> but now that isn't the case.
19:31:32 <clarkb> No concrete evidence that this was the problem but it seems to be much happier now
19:31:54 <clarkb> if you notice things going unhappy again please say something and check the container logs for any tracebacks
19:32:01 <frickler> +1
19:33:11 <clarkb> #topic Gerrit 3.10 Upgrade Planning
19:33:19 <clarkb> #link https://etherpad.opendev.org/p/gerrit-upgrade-3.10 Gerrit upgrade planning document
19:33:26 <clarkb> I've been looking at upgrading Gerrit recently
19:33:39 <corvus> wait
19:33:43 <corvus> *** We should plan cleanup of some sort for the backing store. Either test the prune command or swap to a new container and cleanup the old one.
19:33:49 <clarkb> oh right
19:33:51 <clarkb> #undo
19:33:51 <opendevmeet> Removing item from minutes: #link https://etherpad.opendev.org/p/gerrit-upgrade-3.10
19:33:51 <corvus> ^ should decide that
19:34:11 <clarkb> the other thing that was brought up last week was that our swift container backing the insecure ci registry is quite large
19:34:27 <clarkb> there is a zuul registry pruning command but corvus reports the last time we tried it things didn't go well
19:34:39 <corvus> last time we ran prune, it manifested weird errors.  i think zuul-registry is better now, but i don't know if whatever problem happened was specifically identified and fixed.
19:34:59 <clarkb> I think our options are to try the prune command again or instead we can point the registry at a new container then work on deleting the old container in its entirety after the fact
19:35:09 <corvus> so i think if we want to clean it up, we should prepare to replace the container (ie, be ready to make a new one and swap it in). then run the prune command.  if we see issues, swap to the new container.
19:35:20 <fungi> note that cleaning up the old container is likely to require asking the cloud operators to use admin privileges, since the swift bulk delete has to recursively delete each object in the container and i think is limited to something like 1k (maybe it was 10k?) objects per api call
19:35:39 <corvus> fungi: well, we could run it in screen for weeks
19:35:50 <fungi> yeah, i mean, that's also an option
19:36:07 <frickler> doesn't rclone also work for that?
19:36:12 <corvus> prune would be doing similar, so... run that it screen if you do :)
19:36:29 <clarkb> sounds like there is probably some value in trying the prune command since that produces more feedback to the registry codebase and our fallback is the same either way (new container)
19:36:32 <fungi> doing that in rackspace classic for some of our old build log containers would take years of continuous running to complete
19:36:55 <corvus> registry has fewer, larger, objects
19:37:04 <fungi> frickler: i think rclone was one of the suggested tools to use for bulk deletes, but it's still limited by what the swift api supports
19:37:15 <corvus> so hopefully lower value for "years" :)
19:37:38 <clarkb> need a volunteer to 1) prep a fallback container in swift 2) start zuul-registry prune command in screen then if there are problems 3) swap registry to fallback container and delete old container one way or another
19:38:06 <corvus> if we swap, promotes won't work, obviously.  so it's a little disruptive.
19:38:21 <clarkb> right people would need to recheck thigns to regenerate images and upload them again
19:38:41 <clarkb> so maybe there is a 0) which is announce this to service-announce as a potential outcome (rechecks required)
19:39:07 <corvus> well, promote comes from gate, so there's typically no recourse other than "merge another change"
19:39:24 <clarkb> oh right since promote is running post merge
19:39:58 <clarkb> I should be able to work through this sometime next week. Maybe we announce it today for a wednesday or thursday implementation next week? that gives everyone a week of notice
19:40:06 <clarkb> but also happy for someone else to drive ti and set a schedule
19:40:18 <corvus> might be good to do on a friday/weekend.  but if we're going to prune, no idea when we'd actually see problems come up, if they do.  could be at any time, and if the duration is "months", then that's hard to announce/schedule.
19:41:01 <clarkb> I see. So maybe a warning that the work is happening and we'd like to know if unexpected results occur
19:41:16 <clarkb> I should be able to start it a week from this friday but not this friday
19:41:18 <corvus> yeah, that sounds like it fits our expected outcomes :)
19:41:26 <fungi> wfm
19:42:01 <clarkb> ok I'll work on a draft email and make sure corvus reviews it for accuracy before sending sometime this week with a plan to perform the work November 15
19:42:11 <fungi> thanks!
19:42:21 <clarkb> and happy for others to help or drive any point along the way :)
19:42:22 <corvus> clarkb: ++
19:43:19 <clarkb> #topic Gerrit 3.10 Upgrade Planning
19:43:39 <clarkb> oh shoot I'm just realizing I didn't undo enough items
19:43:41 <clarkb> oh well
19:43:55 <clarkb> the records will look weird but I think we can get by
19:44:02 <clarkb> #link https://etherpad.opendev.org/p/gerrit-upgrade-3.10 Gerrit upgrade planning document
19:44:12 <clarkb> so I've been looking at Gerrit 3.10 upgrade planning
19:44:36 <clarkb> per our usual process I've put a document together and tried to identify areas of concern, breaking changes, and then go through and decide what if any impact any of them have on us
19:44:45 <clarkb> I've also manually tested the upgrade (and we automatically test it too)
19:44:59 <clarkb> overall this seems like another straightforward change for us but there are a couple of things to note
19:45:18 <clarkb> one is that gerrit 3.10 can delete old log files itself so I pushed a change to configure that and we can remove our cronjob which does so post upgrade
19:45:43 <clarkb> another is that robot comments are deprecated (I believe zuul uses this for inline comments)
19:46:06 <clarkb> and that due to project imports being a thing that can lead to change number collisions some searches now may require project+changenumber
19:46:20 <clarkb> we haven't imported any projects from another gerrit server so I don't think this last item is a real concern for us
19:46:42 <fungi> what does gerrit recommend switching to, away from robot comments?
19:46:54 <clarkb> fungi: using a checks api plugin I think
19:46:54 <fungi> is there a new comment type for that purpose?
19:47:07 <fungi> huh, i thought the checks api plugin was also deprecated
19:47:17 <clarkb> see this is where things get very very confusing
19:47:25 <corvus> nope that's the "checks plugin" :)
19:47:30 <fungi> d'oh
19:47:45 <clarkb> the system where you register jobs with gerrit and it triggers them is deprecated
19:47:53 <corvus> (which is now a deprecated backend for the checks api)
19:47:53 <fungi> i have a feeling i'm going to regret asking questions now ;)
19:48:01 <clarkb> there is a different system that you just teach gerrit to query your ci system for info through that isn't
19:48:33 <clarkb> but ya my understanding is that integration with CI systems directly in Gerrit are expected to go through this system so things like robot comments are deprecated
19:48:40 <corvus> does the checks api plugin system support line comments?
19:48:42 <clarkb> worst case we send normal comments from zuul
19:48:59 <clarkb> corvus: I don't know but I guess that is a good question since that is what robot comments were doing
19:49:18 <clarkb> but also the deprecation is listed as a breaking change but as far as I can tell they have only deprecated them not actually broken them
19:49:29 <clarkb> so this is a problem that doesn't need solving for 3.10 we just need to be aware of it eventually needing a solution
19:49:41 <corvus> since there isn't currently a zuul checks api implementation, if/when they break we will probably just fall back to regular human-comments
19:49:54 <corvus> should be a 2-line change to zuul
19:50:07 <clarkb> ++
19:50:21 <clarkb> another thing I wanted to call out is that gerrit made reindexing faster if you start with an old index
19:50:28 <fungi> there is no robot only zuul
19:50:50 <clarkb> for this reason our upgrade process is slightly modified in that document to backup the indexes then copy them back in place if we do a downgrade. In theory this will speed up our downgrade process
19:50:57 <corvus> this reads to me like a check result can point to a single line of code and that's it.  https://gerrit.googlesource.com/gerrit/+/master/polygerrit-ui/app/api/checks.ts#435
19:51:30 <corvus> if that's correct, then i don't think the checks api is an adequate replacement, so falling back to human-comments is the best option for what zuul does.
19:51:56 <clarkb> I also tested the downgrade on a held node with 3 chagnes so that process works but I can't really comment on whether or not it is faster
19:52:11 <clarkb> we are running out of time so I'll end this topic here
19:52:26 <clarkb> but please look over the document and the 3.10 release notes and call out any additioanl concerns that you feel need investigation
19:52:39 <clarkb> otherwise I'm looking at December 6, 2024 as the upgrade date as that is after the big thanksgiving holiday
19:52:43 <clarkb> should be plenty of time to get prepared by then
19:52:55 <clarkb> #topic RTD Build Trigger Requests
19:53:10 <clarkb> Really quickly before we run out of time I wanted to call out the read the docs build trigger api request failures in ansible
19:53:28 <clarkb> tl;dr seems to be that making the same request with curl or python requests works but having ansible's uri module do so fails with an unauthorized error?
19:53:46 <clarkb> additional debugging with a mitm proxy hasn't shown any clear reason for why this is happening
19:53:50 <frickler> but only on some distros like bookworm or noble, not on f41 or trixie
19:53:58 <frickler> so very weird indeed
19:54:25 <clarkb> I think we could probably rewrite the jobs to use python requests or curl or something and just move on
19:54:26 <fungi> something is causing http basic auth to return a 403 error from the zuul-executor containers, reproducible from some specific distro platforms but works from others
19:54:38 <clarkb> but it is also interesting from a "what is ansible even doing here" perspective that may warrant further debugging
19:55:12 <fungi> it does seem likely to be related to a shared library which will be fixed when we eventually move zuul images from bookworm to trixie
19:55:23 <fungi> but hard to say what exactly
19:55:31 <clarkb> do we think that just rebuilding the image may correct it too?
19:55:42 <clarkb> (I don't think so since iirc we do update the platform when we build zuul images)
19:56:05 <clarkb> another option to try could be updating to python3.12 in the zuul images
19:56:06 <fungi> also strange that we didn't change zuul's container base recently, but the problem only started in late september
19:56:08 <frickler> well trixie is only in testing as of now
19:56:11 <fungi> yes
19:56:29 <clarkb> corvus: is python3.12 something that we should consider generally for zuul or is that not in the cards yet?
19:56:39 <clarkb> for the container images I mean. We're already testing with 3.12
19:56:43 <frickler> and py3.12 on noble seems broken/affected, too, so that wouldn't help
19:56:49 <clarkb> ack
19:57:07 <corvus> not in a rush to change in general :)
19:57:19 <corvus> (but happy to if we think it's a good idea)
19:57:48 <corvus> but all things equal, id maybe just leave it for the next os upgrade?  unless there's something pressing that 3.12 would make better
19:57:56 <clarkb> more mitmproxy testing or setting up a server to log headers and then diffing between platforms is probably the next debugging step if anyone wants to do that
19:58:12 <clarkb> corvus: I don't think there is a pressing need. unittests on 3.11 are faster too...
19:58:14 <frickler> either that or try the curl solution
19:58:32 <clarkb> ya we could just give up for now and switch to a working different implementation then revert when trixie happens
19:58:33 <fungi> my guess is that there was some regression which landed in bookworm 6 weeks ago, or in a change backported into an ansible point release maybe
19:58:44 <frickler> just before we close I'd also want to mention the promote-openstack-manuals-developer failures https://zuul.opendev.org/t/openstack/build/618e3a431a2145afb4344809a9aa84fa/console
19:58:48 <fungi> or a python point release
19:59:10 <frickler> no idea yet what's different there compared to promote-openstack-manuals runs
19:59:29 <clarkb> its a different target so I suspect that fungi's original idea is the right path
19:59:39 <clarkb> just not complete yet? basically need to get the paths and destinations all in alignment?
19:59:42 <fungi> yeah, but my change didn't seem to fix the error
20:00:04 <clarkb> (side note another reason why I think developer doesn't need a different target it makes the publications more complicated)
20:00:13 <clarkb> and we are at time. Thank you everyone
20:00:30 <clarkb> we can continue discussion in #opendev or on the mailing list as necessary but I don't want to keep anyone here longer than the prescribed hour
20:00:33 <clarkb> #endmeeting