clarkb | Just about meeting time | 18:59 |
---|---|---|
clarkb | #startmeeting infra | 19:00 |
opendevmeet | Meeting started Tue Sep 10 19:00:07 2024 UTC and is due to finish in 60 minutes. The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot. | 19:00 |
opendevmeet | Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. | 19:00 |
opendevmeet | The meeting name has been set to 'infra' | 19:00 |
clarkb | #link https://lists.opendev.org/archives/list/service-discuss@lists.opendev.org/thread/LIHCPHNOQTZLO26CYOYY6Y5LNGOOB2CV/ Our Agenda | 19:00 |
clarkb | #topic Announcements | 19:00 |
NeilHanlon | o/ | 19:01 |
fungi | ohai | 19:01 |
clarkb | This didn't make it to the agenda because it was made official earlier today but https://www.socallinuxexpo.org/scale/22x/events/open-infra-days is happening in March 2025. You have until November 1 to get your cfp submissions in if interested | 19:01 |
clarkb | also not erally an announcement, but I suspect my scheduling will be far more normal at this point. School started again, family returned to their homes, and I'm back from the asia summit | 19:02 |
clarkb | #topic Upgrading Old Servers | 19:03 |
clarkb | Its been a little while since I reviewed the wiki config management changes, any updates there? | 19:03 |
clarkb | looks like no new patchsets since my last review | 19:04 |
clarkb | fungi deployed a new mirror in the new raxflex region and had to make a small edit to the dns keys package installation order | 19:04 |
clarkb | are there any other server replacement/upgrade/addition items to be aware of? | 19:04 |
clarkb | sounds like no. Let's move on and we can get back to this later if necessary | 19:06 |
clarkb | #topic AFS Mirror Cleanups | 19:06 |
clarkb | When we updated the prepare-workspace-git role to do more of its work in python rather than ansible tasks we did end up breaking xenial due to the use of fstrings | 19:07 |
clarkb | I pointed out that I have a change to remove xenial testing from system-config | 19:07 |
clarkb | https://review.opendev.org/c/opendev/system-config/+/922680 but rather than merge that we dropped the use of fstraings in prepare-workspace-git | 19:07 |
clarkb | anyway calling that out as this sort of cleanup is a necessary precursor to removing the content in the mirrors | 19:07 |
clarkb | other than that I don't have any updates here | 19:08 |
clarkb | #topic Rackspace Flex Cloud | 19:08 |
fungi | seems to be working well | 19:09 |
fungi | though our ram quota is only enough for 32 of our standard nodes for now | 19:09 |
clarkb | yup since we last met fungi et al have set this cloud up in nodepool and we are running with max-servers set to 32 utilizing all of our quota there | 19:09 |
frickler | but cardoe also offered to bump our quota if we want to go further | 19:10 |
clarkb | ya I think the next step is to put together an email to rax folks with what we've learned (the ssh keyscan timeout thing is weird but also seems to be mitigated?) | 19:11 |
corvus | (or just not happening today) | 19:11 |
clarkb | and then they can decide if they are interested in bumping our quota and if so on what schedule | 19:11 |
fungi | hard to be sure, yep | 19:11 |
corvus | the delay between keyscan and ready we noted earlier today could be a nodepool problem, or it could be a symptom of some other issue; it's probably worth analyzing a bit more | 19:12 |
corvus | that doesn't currently rise to the level of "problem" only "weird" | 19:12 |
corvus | (but weird can be a precursor to problem) | 19:12 |
clarkb | right, its information not accusation | 19:12 |
clarkb | anyway thank you everyone for putting this together I ended up mostly busy with travel and other things so wasn't as helpful as I'd hoped | 19:13 |
clarkb | and it seems to be working. Be aware that some jobs may notice the use if floating IPs like swift did. They were trying to bind to the public fip address which isn't configured on the servers and that failed | 19:13 |
clarkb | siwtching over to 0.0.0.0/127.0.0.1 or ipv6 equivalents would work as would using the private host ip (I think the private ip is what swift switched to) | 19:14 |
fungi | oh, something else worth pointing out, we resized/renumbered the internal address cidr | 19:14 |
fungi | from /24 to /20 | 19:14 |
clarkb | I can work on a draft email in an etehrpad after lunch if we want to capture the things we noticed (like the ephemeral drive mounting) and the networking | 19:15 |
clarkb | fungi: did we do that via the cloud launcher stuff or in the cloud directly? | 19:15 |
fungi | so the addresses now may be anywhere in the range of 10.0.16.0-10.0.31.255 | 19:15 |
fungi | cloud launcher config | 19:15 |
corvus | why? (curious) | 19:15 |
fungi | in the inventory | 19:15 |
fungi | in case we get a quota >250 nodes | 19:15 |
fungi | frickler's suggestion | 19:16 |
fungi | easier to resize it before it was in use | 19:16 |
corvus | oh this is on a subnet we make | 19:16 |
fungi | since it involves deleting the network/router/interface from neutron | 19:16 |
fungi | yep | 19:17 |
corvus | makes sense thx | 19:17 |
fungi | there's no shared provider net (hence the fips) | 19:17 |
clarkb | anything else re rax flex? | 19:17 |
fungi | i didn't have anything | 19:17 |
clarkb | #topic Etherpad 2.2.4 Upgrade | 19:17 |
clarkb | #link https://review.opendev.org/c/opendev/system-config/+/926078 Change implementing the upgrade | 19:17 |
clarkb | tl;dr here is we are on 2.1.1. 2.2.2 broke ep_headings plugin so we switched to ep_headings2 which appears to work with our old markup (thank you fungi for organizing that testing) | 19:18 |
fungi | yeah, proposed upgrade lgtm | 19:18 |
clarkb | the current release is 2.2.4 and I think we should make a push to upgrade to that version. The first step is in the parent to 926078 where we'll tag a 2.1.1 etherpad image so that we can rollback if the headings plugin explodes spectacularly | 19:18 |
fungi | also changelog makes a vague mention of security fixes, so probably should take it as somewhat urgent | 19:19 |
clarkb | maybe after the meeting we can land that first change and make sure we get the fallback image tagged and all that. Then do the upgrade afterwards or first thing tomorrow? | 19:19 |
fungi | and we should remember to test meetpad/jitsi integration after we upgrade | 19:19 |
clarkb | ++ | 19:19 |
fungi | especially with the ptg coming up next month | 19:19 |
clarkb | ya they changed some paths which required editing the mod rewrite rules in apache | 19:19 |
clarkb | its theoretically possible that meetpad would also be angry about the new paths | 19:20 |
clarkb | reviews and assistance welcome. Let me know if you have any concerns that I need to address before we proceed | 19:21 |
clarkb | #topic OSUOSL ARM Cloud Issues | 19:21 |
clarkb | The nodepool builder appears to have successfully built images for arm things | 19:22 |
clarkb | that resolves one issue there | 19:22 |
clarkb | As for job slowness things appear to have improved a bit but aren't consistently good. The kolla image build job does succeed occasionally now but does also still timeout a fair bit | 19:22 |
frickler | also ramereth did some ram upgrades iiuc | 19:22 |
fungi | seems like Ramereth's upgrade work helped the slow jobs | 19:22 |
clarkb | I suspect that these improvements are due to upgraded hardware that they have been rolling in | 19:22 |
clarkb | ya | 19:22 |
clarkb | there was also a ceph disk that was sad and got replaced? | 19:23 |
fungi | yep | 19:23 |
clarkb | So anyway not completely resolved but actions being taken appear to be improving the situation. I'll probably drop this off of the next meeting's agenda but I wanted to followup today and thank those who helped debug and improve things. Thank you! | 19:23 |
clarkb | and if things start trending in the wrong direction again let us know and we'll see what we can do | 19:24 |
fungi | though getting a second arm provider again would still be good | 19:24 |
fungi | even if just for resiliency | 19:24 |
clarkb | ++ | 19:24 |
fungi | though Ramereth did seem to imply that we could ask for more quota in osuosl too | 19:24 |
fungi | if we decide we need it | 19:24 |
clarkb | also worth noting that most jobs are fine on that cloud | 19:25 |
clarkb | the issues specifically appear to be io related so doing a lot of io like doing kolla image builds was hit with slowness but unittests and similar run in reasonable amounts of time | 19:25 |
clarkb | #topic Updating ansible+ansible-lint versions in our repos | 19:26 |
clarkb | Now that we have updated centos 9 arm images I was able to get the ozj change related to this passing | 19:27 |
clarkb | #link https://review.opendev.org/c/openstack/openstack-zuul-jobs/+/926970 | 19:27 |
clarkb | There is also a project-config change but it has passed all along | 19:27 |
clarkb | #link https://review.opendev.org/c/openstack/project-config/+/926848 | 19:27 |
clarkb | neither of these changes are particularly fun reviews but hopefully we can get them done then not bother with this stuff too much for another few years | 19:27 |
clarkb | and they are largely mechanical updates | 19:28 |
clarkb | do be on the lookout for any potential behavior changes in those updates because there shouldn't be any | 19:28 |
clarkb | getting this done will get these jobs running on noble too which is nice | 19:28 |
clarkb | #topic Zuul-launcher Image Builds | 19:29 |
clarkb | zuul-launcher (zuul's internal daemon for managing historically nodepool related things) has landed | 19:29 |
fungi | yay! | 19:30 |
clarkb | the next step in proving out that all works is to get opendev building test node images using jobs and zuul-launcher to upload them rather than nodepool itself | 19:30 |
clarkb | There are dib roles in zuul-jobs to simplify the job side | 19:30 |
clarkb | corvus has a question about where the resulting jobs and configs should live for this | 19:30 |
corvus | i wanted to discuss where we might put those image jobs | 19:30 |
clarkb | my hunch is that opendev/project-config is the appropriate location | 19:31 |
clarkb | since we're building these images for use across opendev and want to have control of that and this would be one less thing to migrate out of openstack/project-config if we manage it there | 19:31 |
corvus | yes -- but if we put them in an untrusted-project, then speculative changes would be possible | 19:31 |
clarkb | and you're suggesting that would be useful in this situation? | 19:32 |
corvus | i don't think we have any untrusted central repos like that (like openstack-zuul-jobs is for openstack) | 19:32 |
clarkb | correct opendev doesn't currently have a repo set up that way | 19:32 |
corvus | perhaps! no one knows yet... but it certainly was useful when i was working on the initial test job | 19:32 |
clarkb | but if that is useful for image builds then maybe it should. Something like opendev/zuul-configs ? | 19:33 |
corvus | it would let people propose changes to dib settings and see if they work | 19:33 |
clarkb | (that name is maybe not descriptive enough) | 19:33 |
corvus | yeah, we can do something like that, or go for something with "image" in the name if we want to narrow the scope a bit | 19:34 |
clarkb | thinking out loud here: if we had an untrusted base jobs like repo in opendev we could potentially use that to refactor the base jobs a bit to do less work | 19:34 |
fungi | that's basically the model recommended in zuul's pti document | 19:35 |
clarkb | I know this is something we've talked about for ages but never done it because the current system mostly works, but if there are other reasons to split the configs into trusted and unstrusted them maybe we'll finally do that refactoring too | 19:35 |
fungi | back when we were operating on a single zuul tenant that's what we used the openstack/openstack-zuul-jobs repo for | 19:35 |
fungi | and we still have some stuff in there that's not openstack-specific | 19:36 |
corvus | yeah -- though a lot of that thinking was originally related to base role testing, and we've pretty much got that nailed down now, so it's not quite as important as it was. but it's still valid and could be useful, and if we had it then i probably would have said lets start there and not started this conversation :) | 19:36 |
clarkb | in any case a new repo with a better name than the one I suggested makes sense to me. It can house stuff a level above base jobs as well as image build things | 19:36 |
corvus | so if we feel like "generic opendev untrusted low-level jobs" is a useful repo to have then sure let's start there :) | 19:36 |
clarkb | ++ | 19:37 |
corvus | incidentally, there is a related thing: | 19:37 |
corvus | not only will we define the image build jobs as we just discussed, but we will have zuul configuration objects for the images themselves, and we want to include those in every tenant. we will eventually want to decide if we want to put those in this repo, or some other repo dedicated to images. | 19:38 |
corvus | we can kick that down the road a bit, and move things around later. and the ability to include and exclude config objects in each tenant gives us a lot of flexibility | 19:38 |
corvus | but i just wanted to point that out as a distinct but related concern as we start to develop our mental model around this stuff | 19:39 |
clarkb | if we use this repo for "base" job stuff then we may end up wanting to include it everywhere for jobs anyway | 19:39 |
corvus | yep | 19:39 |
clarkb | yup its a good callout since they are distinct config items | 19:39 |
corvus | we will definitely only want to run the image build jobs in one tenant, so if we put them in the "untrusted base job" repo, we will only include the project stanza in the opendev tenant | 19:39 |
corvus | i think that would be normal behavior for that repo anyway | 19:40 |
corvus | okay, i think that's enough to proceed; i'll propose a new repo and then sync up with tonyb on making jobs | 19:40 |
fungi | thanks! | 19:41 |
clarkb | sounds good thanks | 19:41 |
clarkb | #topic Open Discussion | 19:41 |
clarkb | Anything else before we end the meeting? | 19:41 |
NeilHanlon | i have something | 19:42 |
clarkb | go for it | 19:42 |
NeilHanlon | wanted to discuss adding rocky mirrors, following up from early august | 19:42 |
clarkb | this was brought up in the context of I think kolla and the two things I suggested to start there are 1) determine if the jobs are failing due to mirror/internet flakyness (so that we're investing effort where it has most impact) and 2) determine how large the mirror content is to determine if it will fit in the existing afs filesystem | 19:43 |
clarkb | https://grafana.opendev.org/d/9871b26303/afs?orgId=1 shows current utilization | 19:44 |
clarkb | hrm the graphs don't show us free space just utilization and used space (whic we can use to math out free space; must be 5tb total or about 1.2tb currently free?) | 19:45 |
fungi | related to utilization, openeuler is already huge/full and is asking to add an additional major release | 19:45 |
fungi | #link https://review.opendev.org/927462 Update openEuler mirror repo | 19:45 |
clarkb | then assuming the mirrors will help and we've got space we just need to add an rsync mirroring script for rocky (as well as the afs volume bits I guess) | 19:46 |
clarkb | fungi: in openEuler's case I think we replace the old with new instaed of carrying both | 19:46 |
clarkb | fungi: the old version was never really used much aiui and there probably isn't much reason to double the capacity needs there | 19:46 |
clarkb | NeilHanlon: do you know how big the mirror size would be for rockylinux 9? | 19:47 |
NeilHanlon | yeah, it came up again in OSA a few weeks back, not directly related to jobs, but not unrelated, either | 19:47 |
clarkb | NeilHanlon: I guess we could also do x86 or arm or both so that may be more than one number | 19:47 |
fungi | yeah, wrt openeuler i was hinting at that as well, though the author left a comment about some existing jobs continuing to rely on the old version | 19:47 |
clarkb | fungi: right I'm saying delete those jobs like we delete centos 7 and opensuse and hopefully soon xenial jobs :) | 19:48 |
clarkb | its not ideal but pruning like that enables us to do more interesting things so rather tahn try and keep all things alive with minimal use we just need to moev on | 19:48 |
clarkb | NeilHanlon: https://opendev.org/opendev/system-config/src/branch/master/playbooks/roles/mirror-update/files/centos-stream-mirror-update this is what the centos stream sync script looks like | 19:48 |
NeilHanlon | 8 and 9, with images, it's about 800GiB; w/o, less.. and with only x86/arm and no debuginfo/source... probably under 300 | 19:49 |
clarkb | NeilHanlon: it should be fairly consistent rpm mirror to rpm mirror. We just need to tailor the source location and rsync excludes to the specific mirror | 19:49 |
NeilHanlon | it should be nominally the same as whatever the c9s one is, really | 19:49 |
clarkb | NeilHanlon: ya we won't mirror images or debuginfo/source stuff. The thought there is you can fetch those when necessary from the source but the vast majority of ci jobs don't need that info | 19:50 |
NeilHanlon | right, makes sense | 19:50 |
clarkb | NeilHanlon: oh thats good input c9s is currently 303GB | 19:50 |
clarkb | (from that grafana dashboard I linked above) | 19:50 |
clarkb | so napkin math says we can fit a rockylinux mirror | 19:51 |
clarkb | a good next step would be to just confirm that the failures we're seeing are expected to be addressed by a mirror then push up a change to add the syncing script | 19:52 |
NeilHanlon | excellent :) I do agree we should have some firm "evidence" it's needed, even making the assumption that it will fix some percentage of test failures | 19:52 |
clarkb | an opendev admin will need to create the afs volume which they can do as part of the coordination for merging the change that adds the sync script | 19:52 |
NeilHanlon | that 'some percentage' should be quantifiable, somehow. I'll take an action to try and scrape some logs for commonalities | 19:52 |
clarkb | NeilHanlon: I think even if it is "in the last week we had X failures related to mirror download failures" that would be sufficient. Just enough to know wer'e actually solving a problem and not simply assuming it will help | 19:53 |
NeilHanlon | roger | 19:53 |
clarkb | (because one of the reasons I didn't want to mirror rocky initially was to challenge the assumption we've had that it is necessary, if evidence proves it is still necessary then lets fix that | 19:53 |
NeilHanlon | in fairness, I think it's gone really well all things considered, though I don't know that.. just a feeling :) | 19:54 |
NeilHanlon | alright, well, thank you. I've got my marching orders 🫡 | 19:55 |
clarkb | fungi: openeuler doesn't appaer to have a nodeset in opendev/base-jobs as just a point of reference | 19:55 |
clarkb | fungi: and codesaerch doesn't show any obvious CI usage of that platform | 19:55 |
clarkb | so ya I'd like to see evidence that we're going to break something important before we do a replacement rather than a side by side rollover | 19:56 |
clarkb | just a few more minutes, last call for anything else | 19:56 |
fungi | looks like kolla dropped openeuler testing this cycle | 19:58 |
fungi | per a comment in https://review.opendev.org/927466 | 19:58 |
fungi | though kolla-ansible stable branches may be running jobs on it | 19:59 |
clarkb | and we are at time | 20:00 |
clarkb | Thank you everyone | 20:00 |
clarkb | we'll be back here same time and location next week | 20:00 |
clarkb | #endmeeting | 20:00 |
opendevmeet | Meeting ended Tue Sep 10 20:00:20 2024 UTC. Information about MeetBot at http://wiki.debian.org/MeetBot . (v 0.1.4) | 20:00 |
opendevmeet | Minutes: https://meetings.opendev.org/meetings/infra/2024/infra.2024-09-10-19.00.html | 20:00 |
opendevmeet | Minutes (text): https://meetings.opendev.org/meetings/infra/2024/infra.2024-09-10-19.00.txt | 20:00 |
opendevmeet | Log: https://meetings.opendev.org/meetings/infra/2024/infra.2024-09-10-19.00.log.html | 20:00 |
NeilHanlon | thanks clarkb, fungi, all :) | 20:01 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!