19:00:14 #startmeeting infra 19:00:14 Meeting started Tue Jun 11 19:00:14 2024 UTC and is due to finish in 60 minutes. The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot. 19:00:14 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 19:00:14 The meeting name has been set to 'infra' 19:00:20 I made it back in time to host the meeting 19:00:27 #link https://lists.opendev.org/archives/list/service-discuss@lists.opendev.org/thread/4BD6VNMXEEQWEEANCGX2JWB2LK6XY5RT/ Our Agenda 19:00:32 \o/ 19:00:54 #topic Announcements 19:01:06 I'm going to be out from June 14-19 19:01:31 This means I'll miss next week's meeting. More than happy for ya'll to have one without me or to skip it if that is easier 19:01:38 I'm relocating back to AU on the 14th 19:01:54 sounds like skipping would be fine 19:02:12 yeah, i'll be around but don't particularly need a meeting either 19:02:21 works for me 19:02:28 #agreed The June 18th meeting is Cancelled 19:02:35 anything else to announce? 19:02:41 and I have started a project via the OIF univerity partnetships to help us get keycloak talking to Ubuntu SSO 19:02:58 Also my IRC is super laggy :( 19:03:14 tonyb: cool! is that related to the folks saying hello in #opendev earlier today? 19:03:14 thanks tonyb!!! 19:03:27 clarkb: Yes yes it is :) 19:03:29 that's the main blocker to being able to make progress on the spec 19:04:02 knikolla had some ideas on how to approach it and had written something similar, if you manage to catch him at some point 19:04:09 Yup so fingers crossed we'll be able to make solid progress in the next month 19:04:16 fungi: Oh awesome! 19:04:20 he might be able to provide additional insight 19:04:43 something about having written other simplesamlphp bridges in the past 19:05:24 I'll reachout 19:06:28 #topic Upgrading Old Servers 19:06:49 I've been trying to keep up with tonyb but failing to do so. I think there has been progress in getting the wiki changes into shape? 19:07:00 tonyb: do you still need reviews on the change for new docker compose stuff? 19:07:35 Yes please 19:08:13 Adding compose v2 should be ready pending https://review.opendev.org/c/opendev/system-config/+/921764?usp=search passing CI 19:08:36 #link https://review.opendev.org/c/opendev/system-config/+/920760 Please review this change which adds the newer docker compose tool to our images (which should be safe to do alongside the old tool as they are different commands) 19:08:45 (side note) once that passes CI I could use it to migrate list3 to the golang tools 19:09:07 neat 19:09:11 and thank you for pushing that along 19:09:21 mediawiki is closer I have it building 2 versions of our custom container 19:09:30 the role is getting closer to failing in testinfra 19:09:44 what are the two different versions for? 19:09:54 from there I'll have questions about more complex testing 19:10:09 two different mediawiki releases? 19:10:16 oh right we want to do upgrades 19:10:23 1.28.2 (the current version) and 1.35,x the version we'll ned to go to next 19:10:34 got it thanks 19:10:37 yeah, the plan is to deploy a container of the same version we're running, then upgrade once we're containered up 19:10:47 I wanted to make sure I had the Dockerfile and job more or less right 19:11:03 similar to hwo we do gerrit today with the current and next version queued up 19:11:20 Yup heavily inspired by gerrit :) 19:11:39 On noble I've been trying to move that ball along 19:11:49 with a mirror node based on the openafs qusetions 19:12:02 currebtly stuck on openafs on noble 19:12:11 having thought about it I think the best thing is to simply skip the ppa on noble until we need to build new pacakges for noble 19:12:35 I think we can (should?) do that for jammy too but didn't because we didn't realize we were using the distro package until well after we had been using jammy 19:12:45 I tried that and hit a failure later where we install openafs-dkms so I need to cover that also 19:12:55 at this point its probably fine to leave jammy in that weird state where we install the ppa then ignore it. But for noble I think we can just skip the ppa 19:13:21 Okay we can decide on Jammy during the review 19:13:22 tonyb: ok the packages may have different names in noble now? 19:13:35 it's possible you're hitting debian bug 1060896 19:13:36 they were the same in jammy but our ppa had an older version so apt pciked the distro version instead 19:13:53 fungi: is that the arm bug? I think the issue would only be seen on arm 19:13:56 #link https://bugs.debian.org/1060896 openafs-modules-dkms: module build fails for Linux 6.7: osi_file.c:178:42: error: 'struct inode' has no member named 'i_mtime' 19:14:06 I'll look it only just happened. 19:14:08 oh no thats a different thing 19:14:10 oh, is openafs-modules-dkms only failing on arm? 19:14:37 1060896 is what's keeping me from using it with linux >=6.7 19:14:40 fungi: it has an arm failure, the one I helped debug and posted a bug to debian for. I just couldn't remember the number. That specific issue was arm specific 19:14:53 Oh and I hit another issue where CentOS9s has moved forward from our image so the headers don't match 19:15:10 ianw has also been pushing for us to use kafs in the mirror nodes. We could test that as an alternative 19:15:19 #link https://bugs.debian.org/1069781 openafs kernel module not loadable on ARM64 Bookworm 19:15:37 tonyb: you can manually trigger an image rebuild in nodepool if you want to speed up rebuilding images after the mirrors update 19:16:04 Yeah I'm keen on that too (kAFS) but I was thinking later 19:16:23 clarkb: That'd be handy we can do that after the meeting 19:16:23 ++ especially if we can decouple using noble from that. I'm a fan of doing one thing at a time if possible 19:16:36 if the challenges with openafs become too steep, kafs might be an easier path forward, yeah 19:16:43 It's a $newthing for me 19:16:49 tonyb: yup can help with that 19:16:56 cool cool 19:17:06 and thank you for pushing forward on all these things. Even if we don't have all the answers it is good to be aware of the potentail traps ahead of us 19:17:10 i should try using kafs locally to get around bug 1060896 keeping me from accessing afs on my local systems 19:17:32 anything else on the topic of upgrading servers? 19:17:46 Nope I think that's good 19:18:09 #topic Cleaning up AFS Mirrors 19:18:37 i sense a theme 19:18:50 I don't have any new progress on this. I'm pulled in enough directions that this is an easy deprioritization item. That said I wanted to call out that our centos-8-stream images are basically unusable and maybe unbuildable 19:19:07 Upstream cleared out the repos and our mirrors faithfully reflected that empty state 19:19:21 that means all thats left for us to do really is to remove the nodes and image builds from nodepool 19:19:28 8-stream should be okay? I thought it was just CentOS-8 that went away? 19:19:30 oh and the nodeset stuff from zuul 19:19:39 tonyb: no, 8-stream is EOL as of a week ago 19:19:48 Oh #rats 19:19:55 Okay 19:20:02 where are we at with Xenial? 19:20:16 and they deleted the repos (I think technically the repos go somewhere else, but I don't want to set an expectation that we're going to chase eol centos repo mirroring locations if the release is eol) 19:20:36 with xenial devstack-gate has been wound down which had a fair bit of fallout. 19:20:58 yeah, we could try to start mirroring from the centos archive or whatever they call it, but since the platform's eol and we've said we don't make extra effort to maintain eol platforms 19:20:58 I think the next steps for xenial removal are going to be removing projects from the zuul tenant config and/or merging changes to projects to drop xenial jobs 19:21:22 Okay so we need to correct some of that fallout? Or can we look at removing all the xenial jobs and images? 19:21:23 we'll have to make educated guesses on which is appropriate (probably based on the last time anything happened in gerrit for the projects) 19:21:55 tonyb: I don't think we need to correct the devstack-gate removal fallout. Its long been deprecated and most of the fallout is in dead projects whcih will probably get dealt with when we remove them from the tenant config 19:22:08 Okay. 19:22:19 basically we should focus on that sort of cleanup first^ then we can figure out what is still on fire and if it is worth intervention at that point 19:22:26 Is there any process or order for removing xenial jobs 19:22:33 or in some very old openstack branches that have escaped eol/deletion 19:23:03 for example system-config-zuul-role-integration-xenial? to pick on one I noticed today 19:23:03 tonyb: the most graceful way to do it is in the child jobs first then remove the parents. I think zuul will enforce this for the most part too 19:23:30 tonyb: for system-config in particular I think we can just proceed with cleanup. Some of my system-config xenial cleanup changes have already merged 19:23:55 clarkb: cool beans 19:24:00 if you do push changes for xenial removal please use https://review.opendev.org/q/topic:drop-ubuntu-xenial that topic 19:24:07 noted 19:24:59 one last thought before we move to the next topic: maybe we pause centos-8-stream image buidls in nodepool now as a step 0 for that distro release. As I'm like 98% positive those image builds will fail now 19:25:09 I should be able to push up a change for that 19:25:32 #topic Gitea 1.22 Upgrade 19:25:51 Similar to the last topic I haven't made much progress on this after I did the simple doctor command testing a few weeks ago 19:26:09 I was hoping that they would have a 1.22.1 release before really diving into this as they typically do a bugfix release shortly after the main release. 19:26:29 there is a 1.22.1 milestone that was due like yesterday in github so I do expect that soon and will pick this back up again once avaialble 19:26:46 there are also some fun bugs in that milestone list though I don't think any would really affect our use case 19:27:02 (things like PRs are not mergable due to git bugs in gitea) 19:27:30 #topic Fixing Cloud Launcher 19:27:50 This was on the agenda to ensure it didn't get lost if the last fix didn't work. But tonyb reports the job was successful last night 19:27:52 seems this has passed today 19:28:00 we just had to provide a proper path to the safe.directory config entries 19:28:16 Which is great bceause it means we're ready to take advantage of the tool for bootstrapping some of the openmetal stuff 19:28:28 Huzzah 19:28:38 which takes us to the next topic 19:28:45 #topic OpenMetal Cloud Rebuild 19:28:57 The inmotion cloud is gone and we have a new openmetal cloud on the same hardware! 19:29:20 The upside to all this is we have a much more modern platform and openstack deployment to take advantage of. THe downside is we have to reenroll the cloud into our systems 19:29:43 I got the ball rolling and over the last 24 hours or so tonyb has really picked up todo items as I've been distracted by other stuff 19:29:55 #link https://etherpad.opendev.org/p/openmetal-cloud-bootstrapping captures our todo list for bootstrapping the new cloud 19:30:17 I think https://review.opendev.org/c/opendev/system-config/+/921765?usp=search is ready now 19:30:25 https://review.opendev.org/c/opendev/system-config/+/921765 is the next item that neesd to happen, then we can update cloud launcher to configure security groups and ssh keys, then we can keep working down the todo list int he etherpad 19:31:17 I'm happy to work with whomever to keep that rolling 19:31:20 so anyway reviews welcome as we get changes up to move things along. At this point I expect the vast majority of things to go through gerrit 19:31:36 tonyb: thanks! I should be able to help through thursday at least if we don't get it done by then 19:31:43 It's a good learning opportunity 19:32:19 I was hoping the new mirror node would be on noble but that's far from essential 19:32:55 ya I wouldn't prioritize noble here if there aren't easy fixes for the openafs stuff 19:33:02 Yeah 19:33:14 also easy to replace later 19:33:17 ++ 19:33:33 the last thing I wanted to call out on this is that I think fungi and corvus may still have account creation for the system outstanding 19:33:40 I haven't been consistent about removing inmotion as I add openmetal figuring we can do a complete cleanup "at the end" 19:33:51 feel free to reach out to me if you ahve questions (I worked through the process and while not an expert did work through a couple odd things) 19:33:52 yep sorry :) 19:34:08 tonyb: ya that should be fine too. I have a note in the todo list to find those things that we missed already 19:34:20 though that's also not too relevant for operating the cloud itself 19:34:26 Oh and datadog ... My invitation expired and I think clarkb was going to sak about dropping it 19:34:29 i'm not, like, blocking things right? 19:34:35 all infra-root ssh keys should be on all nodes already 19:34:46 corvus: no you are not. I'm more worried taht the invites might expire and we'll ahve to get openmetal to send new ones 19:34:56 but frickler is right your ssh keys should be sufficient for the openstack side of things 19:34:59 oh i see. i will get that done soon then. 19:35:10 It looks like ew can send them ourselvs 19:35:18 tonyb: oh cool in that case its far less urgent 19:35:42 it's mostly for issue tracking and "above OpenStack" issues 19:35:49 oh, i'll try to find the invites 19:35:58 i think i may have missed mine 19:36:09 fungi: mine got sorted into the spam folder, but once dug out it was usable 19:36:17 noted, thanks 19:36:20 anything else on the subject of the new openmetal cloud? 19:36:36 well ... as far as html mails with three-line-links are usable 19:36:53 do we have to discuss details like nested virt yet? 19:37:09 frickler: I figured we could wait on that until after we have "normal" nodes booting 19:37:27 adding nested virt labels to nodepool at that point should be trivial. That said I think we can do that as long as the cloud supports it (which I believe it does) 19:38:03 ack, will need some testing first. but let's plan to do that before going "full steam ahead" 19:38:13 That works 19:38:24 sounds good 19:38:55 Do we have/need a logo to add to the "sponsors" page? 19:39:10 tonyb: they are already on the opendev.org page 19:39:28 because the inmotion stuff was really openmetal for the last little while. We were just awiting for the cloud rebuild to go through the trouble of renaming everything 19:39:36 Cool 19:40:10 though it's worth double-checking that the donors listed there are still current, and cover everyone we think needs to be listed 19:40:39 the main questions there would be do we add osuosl and linaro/equinix/worksonarm 19:40:48 Rax, Vexx, OVH and OpenMetal 19:41:07 that might be a good discussion outside of the meeting though as we have a couple more topics to get into and I can see that takign some time to get through 19:41:13 yep 19:41:16 not urgent 19:41:22 kk 19:41:30 #topic Improving Mailman Throughput 19:42:09 We increased the number of out runner to 4 from 1. This unfortunately didn't fix the openstack-discuss delivery slowness. But now that we understand things better should prevent openstack-discuss slowness from impacting other lists so a (small) win anyway 19:42:46 yeah, there seems to be an odd tuning mismatch between mailman and exim 19:42:47 fungi did more digging and found that both mailman and exim have a batch size for messages delivery of 10. However it almost seems like exim's might be 9? because it complains when mailman gives it 10 messages and goes into queing mode? 19:43:23 yeah, exim's error says it's deferring delivery because there were more than 10 recipients, but there are exactly 10 the way mailman is batching them up (confirmed from exim logs even) 19:43:34 my suggestion was that we could bump the exim number quite a bit to see if it stops complaining mailman is giving it too many messages at once. THen if that helps we can further bump the mailman number up to a closer value but give exim some headroom 19:43:53 so i think mailman's default is chosen trying to avoid exceeding exim's default limit, except it's just hitting it instead 19:43:58 say bump exim up to 50 or a 100. Then bump mailman up to 45 or 95 19:44:39 All sounds good to me 19:44:44 could also decrease exim's queue interval 19:44:59 yeah, right now the delay is basically waiting for the next scheduled queue run 19:45:30 oh thats a piece of info I wasn't aware of. So exim will process each batch in the queue with a delay between them? 19:45:38 not exactly 19:45:40 (i think synchronous is better if the system can handle it, but if not, then queing and running more often could be a solution) 19:45:54 exim will process the message ~immediately if the recipient count is below the threshold 19:46:11 otherwise it just chucks it into the big queue and waits for the batch processing to get it later 19:46:16 got it 19:46:36 fungi: was pushing up changes to do some version of ^ somethign that you were planning to do? 19:46:50 except that with our mailing lists, every message exim receives outbound is going to exceed that threshold and end up batched instead 19:47:16 clarkb: i can, just figured it wasn't urgent and we could brainstorm in the meeting, which we've done now 19:47:27 thanks! 19:47:48 #topic Testing Rackspaces new cloud offering 19:47:50 so with synchronous being preferable, increasing the recipient limit in exim sounds like the way to go 19:47:54 ++ 19:48:10 and it doesn't sound like there are any objections with 50 instead of 10 19:48:16 wfm 19:48:23 (where we'll really just be submitting 45 anyway) 19:49:41 #makeitso 19:49:49 re rax's new thing they want us to help test. I haven't had time to ask for more info on that. However, enrolling it will be very similar to the openmetal cloud (though without setting flavors and stuff I expect) so we're ensuring the process generally works and are familiar with it by going through the process for openmetal 19:50:06 part of the reason I have avoided digging into this is I didn't want to start something like that then disappear for a week 19:50:19 but for completeness thought I'd keep it on the agenda in case anyone else had info 19:51:01 I was supposed to reachout to cloudnull and/or the RAX helpdesk for more information 19:51:01 is this something they sent to all customers or is it specifically targeted to opendev? 19:51:22 frickler: I am not sure. The way it was sent to us was via a ticket to our nodepool ci account 19:51:28 i am a paying rackspace customer and did not receive a personal invite, so i think it's selective 19:51:53 I don't think our other rax accounts got invites either. Cloudnull was a big proponent of what we did back in the osic days and may have asked his team to target us 19:52:13 ok 19:52:48 coincidentally I've seen a number of tempest job timeouts that were all happening on rax recently 19:53:11 ya I think it can potentially be useful for both us and them 19:53:15 so assuming the new cloud also has better performance, that might be really interesting 19:53:18 we just need to find the time to dig into it. 19:54:00 I can certainly look into helping with testing, just not sure about communications due to time difference 19:54:02 yes, the aging rackspace public cloud hardware and platform has been hinted in a number of different performance discrepancies/variances in jobs for a while 19:54:38 also with it being kvm, migrating to it eventually could mean we can stop worrying about "some of the test nodes you get might be xen" 19:55:11 +1 19:55:12 at least i think it's kvm, but don't have much info either 19:55:28 one thing we should calrify is what happens to the resources after the test period. If we aren't welcome to continue using it in some capacity then the amoutn of effort may not be justified 19:55:57 sounds like we need more info in general. tonyb if you don't manage to find out before I get back from vacation I'll put it on my todo list at that point as I will be less likely to just disappear :) 19:56:04 #topic Open Discussion 19:56:14 We have about 4 minutes left. Anything else that was missed in the agenda? 19:56:23 Perfect 19:56:38 Nothing from me 19:56:49 i got nothin' 19:57:44 just ftr I finished the pypi cleanup for openstack today 19:58:12 but noticed some repos weren't covered yet, like dib. so a bit more work coming up possibly 19:58:20 frickler: Wow! 19:58:27 That's very impressive 19:58:30 frickler: ack thanks for the heads up 19:58:35 oh, also you wanted to ask about similar cleanup for opendev-maintained packages 19:58:40 like glean 19:59:06 yeah, but that's not urgent, so can wait for a less full agenda 19:59:15 for our packges I think it would be good to have a central account that is the opendev pypi owner account 19:59:20 though for glean specifically mordred has already indicated he has no need to be a maintainer on it any longer 19:59:44 we can add that account as owner on our packages then remove anyone that isn't the zuul upload account (also make the zuul upload account a normal maintainer rather than owner) 19:59:45 yes, and not sure how many other things we actually have on pypi? 19:59:53 yeah, the biggest challenge is that there's no api, so we can't script adding maintainers/collaborators 20:00:04 frickler: git-review, glean, jeepyb maybe? 20:00:09 oh and gerritlib possible 20:00:17 still a totally manual clicky process unless someone wants to fix the feature request in warehouse 20:00:22 fungi: ya thankfully our problem space is much smaller than openstacks 20:00:25 agreed 20:00:30 bindep too 20:00:36 git-restack, etc 20:01:37 I've also configured that centos 8 stream image builds are failing 20:01:47 and we are at time 20:02:00 thank you everyone. Feel free to continue discussion in #opendev or on the mailing list, but I'll end the meeting here 20:02:02 #endmeeting