19:00:21 <clarkb> #startmeeting infra 19:00:21 <opendevmeet> Meeting started Tue Aug 27 19:00:21 2024 UTC and is due to finish in 60 minutes. The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot. 19:00:21 <opendevmeet> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 19:00:21 <opendevmeet> The meeting name has been set to 'infra' 19:00:32 <clarkb> #link https://lists.opendev.org/archives/list/service-discuss@lists.opendev.org/thread/CGWURWK2YK4LLA7VPHS5KXF63I47EOYJ/ Our Agenda 19:00:41 <clarkb> #topic Announcements 19:00:56 <clarkb> Due to timezones and travel and conference obligations I won't make it to next weeks meeting. 19:01:52 <clarkb> #topic Upgrading old servers 19:02:24 <clarkb> #link https://review.opendev.org/c/opendev/system-config/+/921321 Wiki replacement ansible stack 19:03:01 <clarkb> Looks like a couple of us have reviewed that stack since the meeting last week. Overall things look good to me. My main concern was how the ansibel is set up to stop and start things on every run. I think we can probably live with that before we do the cut over if we prefer to nto fix that upfront 19:03:10 <clarkb> or we can fix it upfront and avoid unnecessary restarts 19:03:39 <clarkb> looks like frickler found some functional issues that need correcting in the job setup as well 19:05:04 <clarkb> Not sure if frickler or tonyb are around for the meeting, but are there any questions about the reviews? 19:06:41 <clarkb> sounds like no at least for now 19:07:14 <clarkb> separately tonyb also got some new Noble mirrors running. https://mirror02.sjc1.vexxhost.opendev.org/ I believe that is one of them and it appears to eb working 19:07:29 <clarkb> we should probably go ahead and cut dns over and start thinking about cleaning up the old servers 19:07:31 <frickler> I'm around, but not sure about the question? 19:07:59 <clarkb> frickler: I was mostly opening the door for tonyb to provide feedback on our reviews if there was any. I know I ended up writing a number of comments 19:08:30 <frickler> ok 19:09:15 <tonyb> yup they're very helpful. 19:09:54 <clarkb> tonyb: any questions or concerns or updates? 19:10:01 <tonyb> I'm working on addressing them. just slowly due to running up and down a mountain 19:10:12 <tonyb> nope nothing specific yet 19:10:24 <clarkb> cool. Thank you for continuing to push this along 19:10:32 <clarkb> #topic AFS Mirror Cleanups 19:10:53 <fungi> if it was a sacred mountain, i hope you wore curse-resistant footwear 19:10:54 <clarkb> I don't have anything new here. I've been distracted by new clouds and summit/travel prep and this is an easy thing to deprioritize... 19:11:07 <clarkb> #topic Rackspace Flex Cloud 19:11:21 <clarkb> But we have infos about rackspace's new cloud setup and it sounds very promising 19:11:26 <fungi> it's ready to be flexed 19:11:58 <clarkb> basically they are rolling out a new cloud deployment generation. Its currently still in some sort of pre release state but they are happy for us to start kicking the tires on it. 19:12:37 <clarkb> Our existing accounts work with it if we use a different keystone and region. fungi set up clouds.yaml for us and it seems to be working. I think we should treat this as a separate cloud though because it is so different even though the credentials align 19:12:49 <clarkb> so we have new clouds.yaml entries for it and we'll have separate nodepool providers and so on 19:12:58 <clarkb> #link https://review.opendev.org/c/opendev/system-config/+/927214 Enroll New cloud region into cloud launcher 19:13:16 <fungi> yeah, the open change splits the credential vars in our private store, even though they're just copies of the same values at the moment 19:13:19 <clarkb> I believe this is the next step in rolling out our usage of the flex cloud. Basically configure networking, ssh keys, and security groups 19:13:44 <clarkb> Then when that is done we can figure out flavors and quotas, deploy a mirror node, then point nodepool at it 19:14:04 <clarkb> it does appear they have a noble image so we don't have to upload our own like tonyb did with other clouds but can do that too if we want things to be in sync 19:14:49 <fungi> i already identified --flavor=gp.0.4.8 as being equivalent to our standard for job nodes 19:15:15 <fungi> that's 8gb ram, 4 vcpus, 80gb rootfs 19:15:29 <fungi> also has a 128gb ephemeral disk 19:15:35 <frickler> iiuc our standard is 8 vcpus? 19:15:44 <fungi> depends on how fast they are 19:15:50 <clarkb> ya on osic we did 4vcpus 19:15:53 <fungi> these are supposedly "very fast" 19:16:08 <corvus> we've traditionally considered ram more important 19:16:11 <clarkb> and it sounded like if we have feedback on that they are open to it 19:16:22 <clarkb> for example if 4vcpus aren't enough we could probably ask for an 8vcpu flavor 19:16:29 <corvus> as in, more important to keep consistent across providers 19:16:55 <clarkb> but ya they seemed confident these should be much quicker so hopefully we can get away with 4vcpu 19:17:04 <fungi> the only other 8gb flavor i saw had a smaller rootfs and no ephemeral disk 19:17:18 <frickler> I also saw that we have a quota of 50 instances, but only 256GB ram, so that would only by 32 x 8 GB unless I miscalculated 19:17:46 <fungi> they said it was a starter quota, so we can test it out and then let them know when we want to scale up 19:18:06 <clarkb> they also said they may need to build out capacity, but once its there it should be easy for us to update the max-servers number 19:18:14 <fungi> but yes, we should check the limits and adjust our initial max-servers accordingly 19:18:29 <frickler> ah, o.k., so we should limit on the nodepool side for now, fine then 19:18:47 <fungi> yep 19:19:59 <clarkb> so ya I think we keep pushing this forward and we should hopefully have a nice shiny cloud to use soon 19:20:07 <clarkb> #topic Etherpad 2.2.2 Upgrade 19:21:00 <clarkb> As a reminder the concern with this upgrade is that 2.2.2 breaks how code is imported into etherpad which appears to break ep_headings plugin that we've used for years. We swapped that out with ep_headings2 in the 2.2.2 image build. fungi then tested a production etherpad dump into the held 2.2.2 node and it looks like ep_headings2 works with existing pads 19:21:42 <clarkb> I think this means we can go ahead and upgrade (maybe after the summit?) as long as we take a couple of extra precautions. Specifically do a manual db dump prior to the upgrade to make rollbacks easier and maybe also give the current etherpad image a tag other than latest to make rolling back to it easier too 19:21:54 <clarkb> #link https://review.opendev.org/c/opendev/system-config/+/926078 WIP Change implementing the upgrade 19:22:16 <clarkb> That change is WIP only because I was concerned about the compatibility between plugins but maybe I'll keep it WIP until we're comfortable with that upgrade path 19:22:24 <fungi> the only thing i didn't find time to do was identify a pad which has level 5 or 6 headings to check, since the new plugin only has up to 4 heading levels 19:22:53 <clarkb> and if those break we can probably do a pad export then reimport without the formatting 19:22:59 <clarkb> annoying but workable 19:23:29 <fungi> also level 5 and 6 headings were uselessly small, so i doubt they saw much use 19:23:45 <fungi> (smaller than the normal text size) 19:24:45 <clarkb> I'll try to plan for that after I return from the summit 19:24:50 <clarkb> should go quickly once we actually do it 19:24:55 <clarkb> #topic Service Coordinator Election 19:25:21 <clarkb> The only nomination I saw during the nomination period was the one I sent. Based on our previous meeting I'm not surprised :) 19:25:44 <clarkb> That means I'm service coordinator again by default unless I missed any nominations. If there was one that was missed please call that out otherwise I'll consider this election activity done 19:27:20 <clarkb> #topic OSUOSL ARM Cloud Issues 19:27:40 <clarkb> There were two distinct issues that have been noticed in the OSUOSL arm cloud since the linaro cloud shutdown 19:28:18 <clarkb> the first is that our nodepool builder for arm images (nb04) had run out of disk. I cleared out /opt/dib_tmp but image builds continued to fail which was due to losetup loopback devices all being consumed. I did a reboot to clear out that state and that seems to hae corrected things 19:29:07 <clarkb> We have had at least one successful image build since I made those changes. Unfortunately those image builds are very slow (~7 hours). It sounds like some of that slowness may be due to how cinder volumes are implemented there. ramereth says that an ssd backed volume can be used if some cloud changes are made which may help 19:29:20 <clarkb> This would be a good improvement but I Think we can limp along as is it will just be slow 19:29:42 <clarkb> Separately the kolla team noticed that their container image build jobs are super slow and timing out on osuosl since the linaro shtudown too 19:30:21 <clarkb> after some digging it appears that fio shows poor io against the root disk and ephemeral disks in that cloud. Which is good because now we have some concrete measurable problems that we can feedback to osuosl and hopefully improve things with 19:32:22 <clarkb> at this point I think we've got what we need to provide feedback so not much more to do 19:32:33 <clarkb> #topic Updating ansible+ansible-lint versions in our repos 19:32:57 <clarkb> After we updated the default nodeset to ubuntu noble we ran into issues with the versions of ansible + ansible-lint in our linter jobs 19:33:30 <clarkb> basically older ansible and old ansible-lint can't run under python3.12. Things work if we update ansible to ansible 8 to match what zuul runs and ansible-lint to latest. But doing so introduces new errors 19:33:35 <clarkb> #link https://review.opendev.org/c/openstack/project-config/+/926848 19:33:39 <clarkb> #link https://review.opendev.org/c/openstack/openstack-zuul-jobs/+/926970 19:33:49 <clarkb> I've got these two chagnes which correct the problem for project-config and openstack-zuul-jobs. 19:34:21 <clarkb> The ozj change doesn't pass CI because we update the playbook to build openafs RPMs and that fails on arm64 due to a stale kernel that doesn't match package headers. The fixes to nb04 above should correct that in the next day or two I hope 19:36:17 <clarkb> reviews welcome, most of it is mechanical updates to make the linter happy. I didn't just turn off all the rules because the majority seem to make some sense (like naming plays and using fully qualified paths for action modules) 19:36:35 <clarkb> I was less happy about capitalizing words and reording yaml dicts to someone's preference for oder 19:36:37 <clarkb> *order 19:37:55 <corvus> what's the yaml order thing? 19:39:12 <clarkb> trying to find an example so many changes 19:39:15 <fungi> ansible-link now cares what order certain associative array elements appear in 19:39:19 <frickler> like this? https://review.opendev.org/c/openstack/openstack-zuul-jobs/+/926970/4/playbooks/ansible-role-functional/pre.yaml 19:39:23 <clarkb> corvus: but basically they want when to go at the beginning of the block not the end 19:39:25 <mordred> I'm guessing like this: https://review.opendev.org/c/openstack/openstack-zuul-jobs/+/926970/4/playbooks/ansible-role-functional/pre.yaml ? 19:39:25 <fungi> er, ansible-lint i mean 19:39:38 <clarkb> frickler: that is a variant of it 19:39:39 <mordred> I'm guessing that's "name should come first" ? 19:39:48 <clarkb> ya one is name comes first but also when shouldn't be at the end 19:39:52 <fungi> it's more than just name 19:39:57 <mordred> jeez 19:40:00 <mordred> that's a dumb rule 19:40:04 <fungi> then "when" conditions also should be second, i think? 19:40:11 <fungi> yeah, that 19:40:28 <fungi> like much of style linting, it's "someone has an opinion about this" 19:40:33 <mordred> I don't know about you - but frequently I think it play reads better when when is at the end 19:40:57 <fungi> i'd have been fine with marking that rule skipped (most of the rules, or even the entire job, honestly) 19:41:21 <corvus> fungi speaks for me 19:42:53 <fungi> i still think at least 95% of the issues ansible-lint catches for us would also be caught by a basic yaml parser, so when you weigh the remaining 5% against the time spent updating style for working code over and over... 19:43:02 <clarkb> heh happy for followups to refine the ruleset either in followup changes or new patchsets 19:43:11 <corvus> (and fwiw, i usually put when at the beginning) 19:43:20 <clarkb> but this works and it does catch some useful things like the mode thing and being better about using modern names for things 19:43:48 <fungi> i agree the mode check is relevant, because 0644 and '0644' are different data types 19:44:34 <fungi> and the latter is getting interpreted/cast by ansible as octal 644 rather than decimal 644 19:44:47 <fungi> hence entirely different numbers 19:45:04 <corvus> that should be in ansible itself 19:45:17 <clarkb> that would be nice, unfrotauntely.... 19:45:19 <fungi> ideally yes 19:45:21 <clarkb> anyway reviews welcome 19:45:33 <clarkb> This was a followup to noble stuff so I wanted to ensure it didn't get forgotten 19:45:56 <clarkb> on the whole though I think the noble default nodeset swtich went relatively well. We had some things break but all were fixable in a straightfowrard manner 19:46:09 <corvus> ++ 19:46:21 <fungi> it was less churn than i anticipated 19:47:41 <clarkb> #topic Open Discussion 19:48:09 <clarkb> I wanted to note that zuul-jobs was updated to make prepare-workspace-git faster. This was done by moving the implementation of that role from ansible tasks to ansible library python code 19:48:33 <clarkb> This should speed up jobs quite a bit. The impact will be greater the more repos are involved in a job 19:49:38 <clarkb> be on the lookout for any issues realted to it, though I did some spot checking and it seems to be working as is a speedup 19:53:30 <clarkb> Sounds like that may be everything? 19:53:45 <clarkb> Thank you for your time. I'll let you work out if you want a meeting next week before next tuesday 19:53:48 <clarkb> #endmeeting