19:00:12 #startmeeting infra 19:00:12 Meeting started Tue Nov 5 19:00:12 2024 UTC and is due to finish in 60 minutes. The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot. 19:00:12 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 19:00:12 The meeting name has been set to 'infra' 19:00:15 #link https://lists.opendev.org/archives/list/service-discuss@lists.opendev.org/thread/TW6HC5TVBZNEAUWX6HBSATZKK7USHXAB/ Our Agenda 19:00:19 #topic Announcements 19:00:30 No big announcements today 19:01:45 I will be out Thursday morning local time for an appointment then I'm also out Friday and Monday (long holiday weekend with family) 19:02:14 did anyone else have anything to announce? 19:03:08 sounds like no 19:03:10 #topic Zuul-launcher image builds 19:03:36 yesterday corvus reported that we ran a job on a nodepool in zuul built and uploaded image 19:03:40 trying to find that link now 19:04:10 #link niz build https://zuul.opendev.org/t/opendev/build/944e17ac26f841b5a91f07e37ac79de6 19:04:11 #link https://zuul.opendev.org/t/opendev/build/944e17ac26f841b5a91f07e37ac79de6 19:04:15 #undo 19:04:15 Removing item from minutes: #link https://zuul.opendev.org/t/opendev/build/944e17ac26f841b5a91f07e37ac79de6 19:04:48 still some work to do on the implementation (we're missing important functionality), but we've got the basics in place 19:05:14 I assume the bulk of that is on the nodepool in zuul side of things? On the opendev side of things we still need jobs to build the various images we have? 19:05:32 yep, still a bit of niz implementation to do 19:05:36 on the opendev side: 19:05:56 * need to fix the image upload to use the correct expire-after headers 19:06:08 (since all 3 ways we tried to do that failed; so something needs fixing) 19:06:29 * should probably test with a raw image upload for benchmarking 19:06:34 do we know if the problem is the header value itself or something in the client? I'm guessing we don't actually know yet? 19:06:39 (but that should wait until after some more niz performance improvements) 19:06:53 * need to add more image build jobs (no rush, but not blocked by anything) 19:07:32 clarkb: tim said that it needs to be set for both the individual parts and the manifest; i take that to mean that swift cli client is only setting it on one of those 19:07:37 so one of the fixes is fix swiftclient to set both 19:07:44 another fix could be to fix openstacksdk 19:07:48 got it 19:08:08 a third fix could be to do something custom in an ansible module 19:08:19 side note: swift has been a thing for like 13 years? kind of amazing we're the first to hit this issue 19:08:21 honestly don't know the relative merits of those, so step 1 is to evaluate those choices :) 19:08:51 yeah... i guess people want to keep big files? :) 19:09:18 anything else? 19:09:28 anyway, that's about it i think 19:09:39 thank you for the update. Its cool to see it work end to end like that 19:09:51 #topic Backup Server Pruning 19:09:53 yw; ++ 19:10:12 As previously mentioned I did some manual cleanup of ethercalc02 backups 19:10:18 I then documented this process 19:10:20 #link https://review.opendev.org/c/opendev/system-config/+/933354 Documentation of cleanup process applied to ethercalc02 backups. 19:10:54 Couple of things to note. The first is that there is still some discussion over whether or not this is the best approach 19:11:08 in particular ianw has proposed an alternative which automates most of this process 19:11:21 #link https://review.opendev.org/c/opendev/system-config/+/933700 Automate backup deletions for old servers/services 19:11:41 Reviewing this change is high on my todo list (hping to look after lunch today) as I do think it would be an improvement on what I've done 19:12:24 The other thing to note is that the backup server checks noticed the backup dir for ethercalc02 was abnormal after the cleanup. This is true since the dir was removed but we should probably avoid having it emit the warning in the first place if we're intentionally cleaning things up 19:12:50 oh and fungi ran our normal pruning process on the vexxhost backup server as we were running low on space and we're still working through he process for clearing unneeded content 19:13:09 the good news is once we have a process for this and apply it I think we'll free up about 20% of the total disk space on the server which is a good chunk 19:13:11 yep, i did that 19:13:33 worth noting the ethercalc removal didn't really free up any observable space, but we didn't expect it to 19:13:38 so anyway please review the two chagnes above and weigh in on whether or not you think either approach is worth pursuing further 19:14:18 #topic Upgrading old servers 19:14:51 I don't see any updates on tonyb's mediawiki stack 19:15:13 tonyb: anything to note? I know there was a pile of comments on that stack so want to make sure that I was clear enough and also not off base on what I wrote 19:16:30 separately I decided with Gerrit that I would like to upgrade Gerrit to 3.10 first then figure out the server upgrade. The reason for that is the service update is mroe straightforward and I think a better candidate for sorting out over the next while of holidays and such 19:16:38 once that is done I'll have to look at server replacement 19:16:50 more on Gerrit 3.10 later in the meeting 19:17:53 any other server upgrade notes to make? 19:19:14 #topic Docker compose plugin with podman service for servers 19:19:29 lets move on. I didn't expect any updates on this topic since last week but wanted to give a chance for anyone to chime in if there was movement 19:20:29 ok sounds like there wasn't anything new. We can move on 19:20:32 #topic Enabling mailman3 bounce processing 19:20:36 and now for topics new to this week 19:21:11 frickler added this one and the question aiui is basically can we enable mailman3's automatic bounce processing which removes mailing list members after a certain number of bounces 19:21:36 yes, so basically I just stumbled about this when looking at the mailman list admin UI 19:21:58 Last friday I logged into lists.opendev.org and looked at the options and tried to undersatnd how it works. Basically there are two values we can change the score threshold that when exceeded members are removed and the time period your score is valid for before being reset 19:22:06 and since we do see a large number of bounces in our exim logs, I thought maybe give it a try 19:22:40 by default threshold is 5 (I think you get 1 score point for a hard bounce and half a point for a soft bounce not sure what the difference between hard and soft bounces are) and that value is reset weekly 19:23:09 yes, the default looked pretty sane to me 19:23:21 one thing that occurred to me is we can enable this on service-discuss and since we only get about one email a week we should avoid removing anyone too quickly while we see if it works as expected 19:23:33 but otherwise it does seem like a good idea to remove all the old addresses that are no longer valid 19:23:54 fungi: corvus: ^ any thoughts or concerns on enabling this on some/all of our lists? I think dmarc/dkim validation was one concern? 19:24:13 iirc one could also set it to report to the list admin instead of auto-remove addresses? 19:24:28 frickler: yes, you can also do both things. 19:24:39 I figured we'd have it alert list owners as well as disabling/removing people 19:24:41 we definitely should if we can; i'm not familiar with the dkim issues that apparently caused us to turn it off 19:24:58 as mentioned earlier when we discussed it, spurious dmarc-related rejections are less of a concern with mm3 because it doesn't just score on bounced posts, it follows up with a verp probe and uses the results of that to increase the bounce score 19:25:36 got it so previous concerns are theoretically a non issue in mm3. In that case should I go ahead and enable it on service-discuss and see what happens from there? 19:25:56 basically mm3 tries to avoid disabling subscriptions in cases where the bounce could have been related to the message content rather than an actual delivery problem at the destination 19:25:57 oh also if you login to a list user member list page it shows the current score for all the members 19:26:12 which is another way to keep track of how it is processing people 19:26:32 then maybe enabling it on busier lists next week if nothing goes haywire? 19:26:41 sounds good to me 19:27:04 cool I'll do that this afternoon for service-discuss too 19:27:31 anything else on this topic? 19:28:26 #topic Failures of insecure registry 19:28:57 Recently we had a bunch of issues with the insecure-ci-registry not communicating with container image push/pull clients and that resulted in job timeouts for jobs pushing images in particular 19:29:05 #link https://zuul.opendev.org/t/openstack/builds?job_name=osc-build-image&result=POST_FAILURE&skip=0 Example build failures some examples 19:29:52 seems to have been resolved by the latest image update? or did I miss something? 19:29:53 After restarting the container a few times I noticed that the tracebacks recorded in the logs were happening in cheroot which had a more recent release than our current container image build. I pushed up a change to zuul-registry which rebuilt with latest cheroot as well as updating other system libs like openssl 19:30:21 and yes that image update seems to have made things happier. 19:30:59 Looking at logs it seems that some clients try to negotiate with invalid/unsupported tls versions and the registry is properly rejecting them now today. But my theory is this wasn't working previously and we'd eat up threads or some other limited resource on the server 19:31:17 one thing that seemed to back this up is if you listed connections on the server prior to the update there were many tcp connections hanging around 19:31:22 but now that isn't the case. 19:31:32 No concrete evidence that this was the problem but it seems to be much happier now 19:31:54 if you notice things going unhappy again please say something and check the container logs for any tracebacks 19:32:01 +1 19:33:11 #topic Gerrit 3.10 Upgrade Planning 19:33:19 #link https://etherpad.opendev.org/p/gerrit-upgrade-3.10 Gerrit upgrade planning document 19:33:26 I've been looking at upgrading Gerrit recently 19:33:39 wait 19:33:43 *** We should plan cleanup of some sort for the backing store. Either test the prune command or swap to a new container and cleanup the old one. 19:33:49 oh right 19:33:51 #undo 19:33:51 Removing item from minutes: #link https://etherpad.opendev.org/p/gerrit-upgrade-3.10 19:33:51 ^ should decide that 19:34:11 the other thing that was brought up last week was that our swift container backing the insecure ci registry is quite large 19:34:27 there is a zuul registry pruning command but corvus reports the last time we tried it things didn't go well 19:34:39 last time we ran prune, it manifested weird errors. i think zuul-registry is better now, but i don't know if whatever problem happened was specifically identified and fixed. 19:34:59 I think our options are to try the prune command again or instead we can point the registry at a new container then work on deleting the old container in its entirety after the fact 19:35:09 so i think if we want to clean it up, we should prepare to replace the container (ie, be ready to make a new one and swap it in). then run the prune command. if we see issues, swap to the new container. 19:35:20 note that cleaning up the old container is likely to require asking the cloud operators to use admin privileges, since the swift bulk delete has to recursively delete each object in the container and i think is limited to something like 1k (maybe it was 10k?) objects per api call 19:35:39 fungi: well, we could run it in screen for weeks 19:35:50 yeah, i mean, that's also an option 19:36:07 doesn't rclone also work for that? 19:36:12 prune would be doing similar, so... run that it screen if you do :) 19:36:29 sounds like there is probably some value in trying the prune command since that produces more feedback to the registry codebase and our fallback is the same either way (new container) 19:36:32 doing that in rackspace classic for some of our old build log containers would take years of continuous running to complete 19:36:55 registry has fewer, larger, objects 19:37:04 frickler: i think rclone was one of the suggested tools to use for bulk deletes, but it's still limited by what the swift api supports 19:37:15 so hopefully lower value for "years" :) 19:37:38 need a volunteer to 1) prep a fallback container in swift 2) start zuul-registry prune command in screen then if there are problems 3) swap registry to fallback container and delete old container one way or another 19:38:06 if we swap, promotes won't work, obviously. so it's a little disruptive. 19:38:21 right people would need to recheck thigns to regenerate images and upload them again 19:38:41 so maybe there is a 0) which is announce this to service-announce as a potential outcome (rechecks required) 19:39:07 well, promote comes from gate, so there's typically no recourse other than "merge another change" 19:39:24 oh right since promote is running post merge 19:39:58 I should be able to work through this sometime next week. Maybe we announce it today for a wednesday or thursday implementation next week? that gives everyone a week of notice 19:40:06 but also happy for someone else to drive ti and set a schedule 19:40:18 might be good to do on a friday/weekend. but if we're going to prune, no idea when we'd actually see problems come up, if they do. could be at any time, and if the duration is "months", then that's hard to announce/schedule. 19:41:01 I see. So maybe a warning that the work is happening and we'd like to know if unexpected results occur 19:41:16 I should be able to start it a week from this friday but not this friday 19:41:18 yeah, that sounds like it fits our expected outcomes :) 19:41:26 wfm 19:42:01 ok I'll work on a draft email and make sure corvus reviews it for accuracy before sending sometime this week with a plan to perform the work November 15 19:42:11 thanks! 19:42:21 and happy for others to help or drive any point along the way :) 19:42:22 clarkb: ++ 19:43:19 #topic Gerrit 3.10 Upgrade Planning 19:43:39 oh shoot I'm just realizing I didn't undo enough items 19:43:41 oh well 19:43:55 the records will look weird but I think we can get by 19:44:02 #link https://etherpad.opendev.org/p/gerrit-upgrade-3.10 Gerrit upgrade planning document 19:44:12 so I've been looking at Gerrit 3.10 upgrade planning 19:44:36 per our usual process I've put a document together and tried to identify areas of concern, breaking changes, and then go through and decide what if any impact any of them have on us 19:44:45 I've also manually tested the upgrade (and we automatically test it too) 19:44:59 overall this seems like another straightforward change for us but there are a couple of things to note 19:45:18 one is that gerrit 3.10 can delete old log files itself so I pushed a change to configure that and we can remove our cronjob which does so post upgrade 19:45:43 another is that robot comments are deprecated (I believe zuul uses this for inline comments) 19:46:06 and that due to project imports being a thing that can lead to change number collisions some searches now may require project+changenumber 19:46:20 we haven't imported any projects from another gerrit server so I don't think this last item is a real concern for us 19:46:42 what does gerrit recommend switching to, away from robot comments? 19:46:54 fungi: using a checks api plugin I think 19:46:54 is there a new comment type for that purpose? 19:47:07 huh, i thought the checks api plugin was also deprecated 19:47:17 see this is where things get very very confusing 19:47:25 nope that's the "checks plugin" :) 19:47:30 d'oh 19:47:45 the system where you register jobs with gerrit and it triggers them is deprecated 19:47:53 (which is now a deprecated backend for the checks api) 19:47:53 i have a feeling i'm going to regret asking questions now ;) 19:48:01 there is a different system that you just teach gerrit to query your ci system for info through that isn't 19:48:33 but ya my understanding is that integration with CI systems directly in Gerrit are expected to go through this system so things like robot comments are deprecated 19:48:40 does the checks api plugin system support line comments? 19:48:42 worst case we send normal comments from zuul 19:48:59 corvus: I don't know but I guess that is a good question since that is what robot comments were doing 19:49:18 but also the deprecation is listed as a breaking change but as far as I can tell they have only deprecated them not actually broken them 19:49:29 so this is a problem that doesn't need solving for 3.10 we just need to be aware of it eventually needing a solution 19:49:41 since there isn't currently a zuul checks api implementation, if/when they break we will probably just fall back to regular human-comments 19:49:54 should be a 2-line change to zuul 19:50:07 ++ 19:50:21 another thing I wanted to call out is that gerrit made reindexing faster if you start with an old index 19:50:28 there is no robot only zuul 19:50:50 for this reason our upgrade process is slightly modified in that document to backup the indexes then copy them back in place if we do a downgrade. In theory this will speed up our downgrade process 19:50:57 this reads to me like a check result can point to a single line of code and that's it. https://gerrit.googlesource.com/gerrit/+/master/polygerrit-ui/app/api/checks.ts#435 19:51:30 if that's correct, then i don't think the checks api is an adequate replacement, so falling back to human-comments is the best option for what zuul does. 19:51:56 I also tested the downgrade on a held node with 3 chagnes so that process works but I can't really comment on whether or not it is faster 19:52:11 we are running out of time so I'll end this topic here 19:52:26 but please look over the document and the 3.10 release notes and call out any additioanl concerns that you feel need investigation 19:52:39 otherwise I'm looking at December 6, 2024 as the upgrade date as that is after the big thanksgiving holiday 19:52:43 should be plenty of time to get prepared by then 19:52:55 #topic RTD Build Trigger Requests 19:53:10 Really quickly before we run out of time I wanted to call out the read the docs build trigger api request failures in ansible 19:53:28 tl;dr seems to be that making the same request with curl or python requests works but having ansible's uri module do so fails with an unauthorized error? 19:53:46 additional debugging with a mitm proxy hasn't shown any clear reason for why this is happening 19:53:50 but only on some distros like bookworm or noble, not on f41 or trixie 19:53:58 so very weird indeed 19:54:25 I think we could probably rewrite the jobs to use python requests or curl or something and just move on 19:54:26 something is causing http basic auth to return a 403 error from the zuul-executor containers, reproducible from some specific distro platforms but works from others 19:54:38 but it is also interesting from a "what is ansible even doing here" perspective that may warrant further debugging 19:55:12 it does seem likely to be related to a shared library which will be fixed when we eventually move zuul images from bookworm to trixie 19:55:23 but hard to say what exactly 19:55:31 do we think that just rebuilding the image may correct it too? 19:55:42 (I don't think so since iirc we do update the platform when we build zuul images) 19:56:05 another option to try could be updating to python3.12 in the zuul images 19:56:06 also strange that we didn't change zuul's container base recently, but the problem only started in late september 19:56:08 well trixie is only in testing as of now 19:56:11 yes 19:56:29 corvus: is python3.12 something that we should consider generally for zuul or is that not in the cards yet? 19:56:39 for the container images I mean. We're already testing with 3.12 19:56:43 and py3.12 on noble seems broken/affected, too, so that wouldn't help 19:56:49 ack 19:57:07 not in a rush to change in general :) 19:57:19 (but happy to if we think it's a good idea) 19:57:48 but all things equal, id maybe just leave it for the next os upgrade? unless there's something pressing that 3.12 would make better 19:57:56 more mitmproxy testing or setting up a server to log headers and then diffing between platforms is probably the next debugging step if anyone wants to do that 19:58:12 corvus: I don't think there is a pressing need. unittests on 3.11 are faster too... 19:58:14 either that or try the curl solution 19:58:32 ya we could just give up for now and switch to a working different implementation then revert when trixie happens 19:58:33 my guess is that there was some regression which landed in bookworm 6 weeks ago, or in a change backported into an ansible point release maybe 19:58:44 just before we close I'd also want to mention the promote-openstack-manuals-developer failures https://zuul.opendev.org/t/openstack/build/618e3a431a2145afb4344809a9aa84fa/console 19:58:48 or a python point release 19:59:10 no idea yet what's different there compared to promote-openstack-manuals runs 19:59:29 its a different target so I suspect that fungi's original idea is the right path 19:59:39 just not complete yet? basically need to get the paths and destinations all in alignment? 19:59:42 yeah, but my change didn't seem to fix the error 20:00:04 (side note another reason why I think developer doesn't need a different target it makes the publications more complicated) 20:00:13 and we are at time. Thank you everyone 20:00:30 we can continue discussion in #opendev or on the mailing list as necessary but I don't want to keep anyone here longer than the prescribed hour 20:00:33 #endmeeting