19:00:21 <clarkb> #startmeeting infra
19:00:21 <opendevmeet> Meeting started Tue Aug 27 19:00:21 2024 UTC and is due to finish in 60 minutes.  The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot.
19:00:21 <opendevmeet> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
19:00:21 <opendevmeet> The meeting name has been set to 'infra'
19:00:32 <clarkb> #link https://lists.opendev.org/archives/list/service-discuss@lists.opendev.org/thread/CGWURWK2YK4LLA7VPHS5KXF63I47EOYJ/ Our Agenda
19:00:41 <clarkb> #topic Announcements
19:00:56 <clarkb> Due to timezones and travel and conference obligations I won't make it to next weeks meeting.
19:01:52 <clarkb> #topic Upgrading old servers
19:02:24 <clarkb> #link https://review.opendev.org/c/opendev/system-config/+/921321 Wiki replacement ansible stack
19:03:01 <clarkb> Looks like a couple of us have reviewed that stack since the meeting last week. Overall things look good to me. My main concern was how the ansibel is set up to stop and start things on every run. I think we can probably live with that before we do the cut over if we prefer to nto fix that upfront
19:03:10 <clarkb> or we can fix it upfront and avoid unnecessary restarts
19:03:39 <clarkb> looks like frickler found some functional issues that need correcting in the job setup as well
19:05:04 <clarkb> Not sure if frickler or tonyb are around for the meeting, but are there any questions about the reviews?
19:06:41 <clarkb> sounds like no at least for now
19:07:14 <clarkb> separately tonyb also got some new Noble mirrors running. https://mirror02.sjc1.vexxhost.opendev.org/ I believe that is one of them and it appears to eb working
19:07:29 <clarkb> we should probably go ahead and cut dns over and start thinking about cleaning up the old servers
19:07:31 <frickler> I'm around, but not sure about the question?
19:07:59 <clarkb> frickler: I was mostly opening the door for tonyb to provide feedback on our reviews if there was any. I know I ended up writing a number of comments
19:08:30 <frickler> ok
19:09:15 <tonyb> yup they're very helpful.
19:09:54 <clarkb> tonyb: any questions or concerns or updates?
19:10:01 <tonyb> I'm working on addressing them.   just slowly due to running up and down a mountain
19:10:12 <tonyb> nope nothing specific yet
19:10:24 <clarkb> cool. Thank you for continuing to push this along
19:10:32 <clarkb> #topic AFS Mirror Cleanups
19:10:53 <fungi> if it was a sacred mountain, i hope you wore curse-resistant footwear
19:10:54 <clarkb> I don't have anything new here. I've been distracted by new clouds and summit/travel prep and this is an easy thing to deprioritize...
19:11:07 <clarkb> #topic Rackspace Flex Cloud
19:11:21 <clarkb> But we have infos about rackspace's new cloud setup and it sounds very promising
19:11:26 <fungi> it's ready to be flexed
19:11:58 <clarkb> basically they are rolling out a new cloud deployment generation. Its currently still in some sort of pre release state but they are happy for us to start kicking the tires on it.
19:12:37 <clarkb> Our existing accounts work with it if we use a different keystone and region. fungi set up clouds.yaml for us and it seems to be working. I think we should treat this as a separate cloud though because it is so different even though the credentials align
19:12:49 <clarkb> so we have new clouds.yaml entries for it and we'll have separate nodepool providers and so on
19:12:58 <clarkb> #link https://review.opendev.org/c/opendev/system-config/+/927214 Enroll New cloud region into cloud launcher
19:13:16 <fungi> yeah, the open change splits the credential vars in our private store, even though they're just copies of the same values at the moment
19:13:19 <clarkb> I believe this is the next step in rolling out our usage of the flex cloud. Basically configure networking, ssh keys, and security groups
19:13:44 <clarkb> Then when that is done we can figure out flavors and quotas, deploy a mirror node, then point nodepool at it
19:14:04 <clarkb> it does appear they have a noble image so we don't have to upload our own like tonyb did with other clouds but can do that too if we want things to be in sync
19:14:49 <fungi> i already identified --flavor=gp.0.4.8 as being equivalent to our standard for job nodes
19:15:15 <fungi> that's 8gb ram, 4 vcpus, 80gb rootfs
19:15:29 <fungi> also has a 128gb ephemeral disk
19:15:35 <frickler> iiuc our standard is 8 vcpus?
19:15:44 <fungi> depends on how fast they are
19:15:50 <clarkb> ya on osic we did 4vcpus
19:15:53 <fungi> these are supposedly "very fast"
19:16:08 <corvus> we've traditionally considered ram more important
19:16:11 <clarkb> and it sounded like if we have feedback on that they are open to it
19:16:22 <clarkb> for example if 4vcpus aren't enough we could probably ask for an 8vcpu flavor
19:16:29 <corvus> as in, more important to keep consistent across providers
19:16:55 <clarkb> but ya they seemed confident these should be much quicker so hopefully we can get away with 4vcpu
19:17:04 <fungi> the only other 8gb flavor i saw had a smaller rootfs and no ephemeral disk
19:17:18 <frickler> I also saw that we have a quota of 50 instances, but only 256GB ram, so that would only by 32 x 8 GB unless I miscalculated
19:17:46 <fungi> they said it was a starter quota, so we can test it out and then let them know when we want to scale up
19:18:06 <clarkb> they also said they may need to build out capacity, but once its there it should be easy for us to update the max-servers number
19:18:14 <fungi> but yes, we should check the limits and adjust our initial max-servers accordingly
19:18:29 <frickler> ah, o.k., so we should limit on the nodepool side for now, fine then
19:18:47 <fungi> yep
19:19:59 <clarkb> so ya I think we keep pushing this forward and we should hopefully have a nice shiny cloud to use soon
19:20:07 <clarkb> #topic Etherpad 2.2.2 Upgrade
19:21:00 <clarkb> As a reminder the concern with this upgrade is that 2.2.2 breaks how code is imported into etherpad which appears to break ep_headings plugin that we've used for years. We swapped that out with ep_headings2 in the 2.2.2 image build. fungi then tested a production etherpad dump into the held 2.2.2 node and it looks like ep_headings2 works with existing pads
19:21:42 <clarkb> I think this means we can go ahead and upgrade (maybe after the summit?) as long as we take a couple of extra precautions. Specifically do a manual db dump prior to the upgrade to make rollbacks easier and maybe also give the current etherpad image a tag other than latest to make rolling back to it easier too
19:21:54 <clarkb> #link https://review.opendev.org/c/opendev/system-config/+/926078 WIP Change implementing the upgrade
19:22:16 <clarkb> That change is WIP only because I was concerned about the compatibility between plugins but maybe I'll keep it WIP until we're comfortable with that upgrade path
19:22:24 <fungi> the only thing i didn't find time to do was identify a pad which has level 5 or 6 headings to check, since the new plugin only has up to 4 heading levels
19:22:53 <clarkb> and if those break we can probably do a pad export then reimport without the formatting
19:22:59 <clarkb> annoying but workable
19:23:29 <fungi> also level 5 and 6 headings were uselessly small, so i doubt they saw much use
19:23:45 <fungi> (smaller than the normal text size)
19:24:45 <clarkb> I'll try to plan for that after I return from the summit
19:24:50 <clarkb> should go quickly once we actually do it
19:24:55 <clarkb> #topic Service Coordinator Election
19:25:21 <clarkb> The only nomination I saw during the nomination period was the one I sent. Based on our previous meeting I'm not surprised :)
19:25:44 <clarkb> That means I'm service coordinator again by default unless I missed any nominations. If there was one that was missed please call that out otherwise I'll consider this election activity done
19:27:20 <clarkb> #topic OSUOSL ARM Cloud Issues
19:27:40 <clarkb> There were two distinct issues that have been noticed in the OSUOSL arm cloud since the linaro cloud shutdown
19:28:18 <clarkb> the first is that our nodepool builder for arm images (nb04) had run out of disk. I cleared out /opt/dib_tmp but image builds continued to fail which was due to losetup loopback devices all being consumed. I did a reboot to clear out that state and that seems to hae corrected things
19:29:07 <clarkb> We have had at least one successful image build since I made those changes. Unfortunately those image builds are very slow (~7 hours). It sounds like some of that slowness may be due to how cinder volumes are implemented there. ramereth says that an ssd backed volume can be used if some cloud changes are made which may help
19:29:20 <clarkb> This would be a good improvement but I Think we can limp along as is it will just be slow
19:29:42 <clarkb> Separately the kolla team noticed that their container image build jobs are super slow and timing out on osuosl since the linaro shtudown too
19:30:21 <clarkb> after some digging it appears that fio shows poor io against the root disk and ephemeral disks in that cloud. Which is good because now we have some concrete measurable problems that we can feedback to osuosl and hopefully improve things with
19:32:22 <clarkb> at this point I think we've got what we need to provide feedback so not much more to do
19:32:33 <clarkb> #topic Updating ansible+ansible-lint versions in our repos
19:32:57 <clarkb> After we updated the default nodeset to ubuntu noble we ran into issues with the versions of ansible + ansible-lint in our linter jobs
19:33:30 <clarkb> basically older ansible and old ansible-lint can't run under python3.12. Things work if we update ansible to ansible 8 to match what zuul runs and ansible-lint to latest. But doing so introduces new errors
19:33:35 <clarkb> #link https://review.opendev.org/c/openstack/project-config/+/926848
19:33:39 <clarkb> #link https://review.opendev.org/c/openstack/openstack-zuul-jobs/+/926970
19:33:49 <clarkb> I've got these two chagnes which correct the problem for project-config and openstack-zuul-jobs.
19:34:21 <clarkb> The ozj change doesn't pass CI because we update the playbook to build openafs RPMs and that fails on arm64 due to a stale kernel that doesn't match package headers. The fixes to nb04 above should correct that in the next day or two I hope
19:36:17 <clarkb> reviews welcome, most of it is mechanical updates to make the linter happy. I didn't just turn off all the rules because the majority seem to make some sense (like naming plays and using fully qualified paths for action modules)
19:36:35 <clarkb> I was less happy about capitalizing words and reording yaml dicts to someone's preference for oder
19:36:37 <clarkb> *order
19:37:55 <corvus> what's the yaml order thing?
19:39:12 <clarkb> trying to find an example so many changes
19:39:15 <fungi> ansible-link now cares what order certain associative array elements appear in
19:39:19 <frickler> like this? https://review.opendev.org/c/openstack/openstack-zuul-jobs/+/926970/4/playbooks/ansible-role-functional/pre.yaml
19:39:23 <clarkb> corvus: but basically they want when to go at the beginning of the block not the end
19:39:25 <mordred> I'm guessing like this: https://review.opendev.org/c/openstack/openstack-zuul-jobs/+/926970/4/playbooks/ansible-role-functional/pre.yaml ?
19:39:25 <fungi> er, ansible-lint i mean
19:39:38 <clarkb> frickler: that is a variant of it
19:39:39 <mordred> I'm guessing that's "name should come first" ?
19:39:48 <clarkb> ya one is name comes first but also when shouldn't be at the end
19:39:52 <fungi> it's more than just name
19:39:57 <mordred> jeez
19:40:00 <mordred> that's a dumb rule
19:40:04 <fungi> then "when" conditions also should be second, i think?
19:40:11 <fungi> yeah, that
19:40:28 <fungi> like much of style linting, it's "someone has an opinion about this"
19:40:33 <mordred> I don't know about you - but frequently I think it play reads better when when is at the end
19:40:57 <fungi> i'd have been fine with marking that rule skipped (most of the rules, or even the entire job, honestly)
19:41:21 <corvus> fungi speaks for me
19:42:53 <fungi> i still think at least 95% of the issues ansible-lint catches for us would also be caught by a basic yaml parser, so when you weigh the remaining 5% against the time spent updating style for working code over and over...
19:43:02 <clarkb> heh happy for followups to refine the ruleset either in followup changes or new patchsets
19:43:11 <corvus> (and fwiw, i usually put when at the beginning)
19:43:20 <clarkb> but this works and it does catch some useful things like the mode thing and being better about using modern names for things
19:43:48 <fungi> i agree the mode check is relevant, because 0644 and '0644' are different data types
19:44:34 <fungi> and the latter is getting interpreted/cast by ansible as octal 644 rather than decimal 644
19:44:47 <fungi> hence entirely different numbers
19:45:04 <corvus> that should be in ansible itself
19:45:17 <clarkb> that would be nice, unfrotauntely....
19:45:19 <fungi> ideally yes
19:45:21 <clarkb> anyway reviews welcome
19:45:33 <clarkb> This was a followup to noble stuff so I wanted to ensure it didn't get forgotten
19:45:56 <clarkb> on the whole though I think the noble default nodeset swtich went relatively well. We had some things break but all were fixable in a straightfowrard manner
19:46:09 <corvus> ++
19:46:21 <fungi> it was less churn than i anticipated
19:47:41 <clarkb> #topic Open Discussion
19:48:09 <clarkb> I wanted to note that zuul-jobs was updated to make prepare-workspace-git faster. This was done by moving the implementation of that role from ansible tasks to ansible library python code
19:48:33 <clarkb> This should speed up jobs quite a bit. The impact will be greater the more repos are involved in a job
19:49:38 <clarkb> be on the lookout for any issues realted to it, though I did some spot checking and it seems to be working as is a speedup
19:53:30 <clarkb> Sounds like that may be everything?
19:53:45 <clarkb> Thank you for your time. I'll let you work out if you want a meeting next week before next tuesday
19:53:48 <clarkb> #endmeeting