19:01:06 <clarkb> #startmeeting infra
19:01:06 <opendevmeet> Meeting started Tue Jul 19 19:01:06 2022 UTC and is due to finish in 60 minutes.  The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot.
19:01:06 <opendevmeet> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
19:01:06 <opendevmeet> The meeting name has been set to 'infra'
19:01:12 <clarkb> #link https://lists.opendev.org/pipermail/service-discuss/2022-July/000345.html Our Agenda
19:01:29 <clarkb> There were no announcements. So we'll dive right into our topic list
19:01:32 <clarkb> #topic Topics
19:01:43 <clarkb> #topic Improving OpenDev's CD throughput
19:02:01 <clarkb> Our new Zuul management seems to be workign well (no issues related to the auto upgrades yet that I am aware of)
19:02:25 <clarkb> Probably worth picking the core of this topic back up again if anyone has time for it.
19:02:51 <ianw> o/
19:03:24 <ianw> ++; perpetually on my todo stack but somehow new things keep getting pushed on it!
19:03:42 <clarkb> I know the feeling :)
19:03:58 <clarkb> I don't think this is a rush. We've been slowly chipping at it and making improvements along the way.
19:04:23 <clarkb> Just wanted to point out that the most recent of thsoe improvements haven't appeared to create any problems. Please do let us know if you notice any issues with the auto upgrades
19:05:12 <clarkb> #topic Improving Grafana management
19:05:18 <clarkb> #link https://lists.opendev.org/pipermail/service-discuss/2022-July/000342.html
19:05:31 <clarkb> I've been meaning to followup on this thread but too much travel and vacation recently.
19:05:45 <clarkb> ianw: are we mostly looking for feedback from those that had concerns previoulsy?
19:06:44 <clarkb> cc corvus ^ if you get a chance to look over that and provide your feedback to the thread that would be great. This way we can consolidate all the feedback asynchronously and don't have to rely on the meeting for it
19:06:47 <ianw> concerns, comments, whatever :)  it's much more concrete as there's actual examples that work now :)
19:08:44 <clarkb> #topic Bastion Host Updates
19:09:10 <clarkb> I'm not sure there has been much movement on this yet. But I've also been distracted lately (hopeflly back to normalish starting today)
19:09:29 <clarkb> ianw: are there any changes for the venv work yet or any other related updates?
19:09:56 <ianw> not yet, another todo item sorry
19:10:17 <clarkb> no problem. I bring it up beacuse it is slightly related to the next topic
19:10:23 <clarkb> #topic Bionic Server Upgrades
19:10:43 <clarkb> We've got a number of Bionic servers out there that we need to start upgrading to either focal or jammy
19:10:50 <clarkb> #link https://etherpad.opendev.org/p/opendev-bionic-server-upgrades Notes on the work that needs to be done.
19:11:06 <clarkb> I've begun taking notes there and trying to classify and call out the difficulty and process for the various servers
19:11:13 <clarkb> if you want to add more notes about bridge there please feel free
19:11:43 <clarkb> Also if you notice I've missed any feel free to add them (this data set started with the ansibel fact cache so should be fairly complete)
19:12:18 <clarkb> I did have a couple of things to call out. The first is that I'm not sure we have openafs packages for jammy and/or have tested jammy with its openafs packages
19:12:29 <clarkb> I think we have asserted we prefer to just run our own packages for now though
19:13:09 <clarkb> We can either update servers to focal and punt on the jammy openafs work or start figuring out jammy openafs first. I think either option is fine and probably depends on who starts where :)
19:13:12 <ianw> we do have jammy packages for that @ https://launchpad.net/~openstack-ci-core/+archive/ubuntu/openafs
19:13:26 <clarkb> oh right for the wheel mirror stuff
19:13:34 <clarkb> any idea if we have exercised them outside of the wheel mirror context?
19:13:57 <ianw> in CI; but only R/O
19:14:08 <clarkb> I'll make note of that on the etherpad really quickly
19:14:56 <ianw> one thing that is another perpetual todo list item is to use kafs; it might be worth considering for jammy
19:15:16 <ianw> i has had a ton of work, more than i can keep up with
19:15:18 <clarkb> ianw: fwiw I tried using it on tumbleweed a while back and didn't have great luck with it. But that was many kernel releases ago and probably worht testing again
19:15:20 <fungi> worth noting, the current openafs packages don't build with very new kernels
19:15:37 <fungi> we'll have to keep an eye out if we add hwe kernels on jammy or something
19:15:44 <clarkb> noted
19:16:07 <ianw> fungi: yeah, although usually there's patches in their gerrit pretty quickly (one of the reasons we keep our ppa :)
19:16:37 <corvus> clarkb: ianw i have no further feedback on grafyaml, thanks
19:17:41 <corvus> (i think ianw incorporated the concerns and responses/mitigations i raised earlier in that message)
19:18:06 <clarkb> The other thing I noticed is that our nameservers are needing upgrades. I've got a disucssion in the etherpad going on whether or not we should do replacements or do them in place. I think doing replacements would be best, but we have to update info with our registrars to do that. The opendev.org domain is managed by the foundation's registrar service and fungi and I should
19:18:08 <clarkb> be able to update their records. zuul-ci.org is managed by corvus. We can probably coordinate stuff if necessary without too much trouble, but it is an important step to not miss
19:18:21 <clarkb> corvus: thanks
19:19:00 <clarkb> One reason to do replacement ns servers is that we can have them running on focal or jammy and test that nsd and bind continue to work as expected before handing production load to them
19:19:10 <fungi> we'll want to double-check that any packet filtering or axfr/also-notify we do in bind is updated for new ip addresses
19:19:31 <ianw> i feel like history may show we did those in place last time?
19:19:43 <clarkb> ianw: I think they started life as bionic and haven't been updated since?
19:19:46 <ianw> (i'm certain we did for openafs)
19:19:46 <clarkb> But I could be wrong
19:19:54 <fungi> i'd have to double-check, but typically those are configured by ip address due to chicken and egg problems
19:19:56 <ianw> that could also be true :)
19:20:21 <clarkb> fungi: yes I think that is all managed by ansible and we should ba able to have adns2 and ns3 and ns4 completely up and running before we change anything if we like
19:20:39 <corvus> the only registrar coord we should need to do is opendev.org itself
19:21:03 <corvus> zuul-ci.org references nsX.opendev.org by name.  but nsX.opendev.org has glue records in .org
19:21:03 <fungi> adns1 and ns1 have a last modified date of 2018 on their /etc/hosts files, so i suspect an in-place upgrade previously
19:21:23 <clarkb> fungi: I don't think they existed prior to that
19:21:54 <corvus> (if we elect to make ns3/ns4, then zuul-ci.org would need to update its ns records, but that's just a regular zone update)
19:22:04 <fungi> oh, yes they're running ubuntu 18.04 so an october 2018 creation time would imply that
19:22:21 <clarkb> corvus: ah right
19:23:07 <fungi> anyway, it won't be hard, but i do think that replacing the nameservers is likely to need some more thorough planning that we'll be able to do in today's meeting
19:23:25 <clarkb> fungi: yup I mostly wanted to call it out and have people leave feedback on the etherpad
19:23:36 <fungi> wfm
19:23:41 <clarkb> I definitely think we'll be doing jammy upgrades on other services before the nameservers as well
19:24:03 <clarkb> Also if anyone would really like to upgrade any specific service on that ehterpad doc feel free to indicate so there
19:24:15 <clarkb> I'm hoping I can dig into jammy node replacements for something in the next week or so
19:24:32 <clarkb> maybe zp01 since it should be straightforward to swap out
19:25:51 <clarkb> #topic Zuul and Ansible v5 and Glibc deadlock
19:26:16 <clarkb> Yesterday all of our executors were updated to run the new zuul executor image with bookworm's glibc
19:26:40 <clarkb> Since then https://zuul.opendev.org/t/openstack/builds?pipeline=opendev-prod-hourly&skip=0 seems to show no more post failures
19:26:58 <ianw> sorry about the first attempt, i didn't realise the restart playbook didn't pull images.  i guess that is by design?
19:27:04 <clarkb> Considering we had post failures with half the executors on old glibc I strongly suspect this is corrected.
19:27:27 <clarkb> ianw: ya there is a separate pull playbook because sometimes you just need to restart without updating the image
19:28:06 <ianw> i did think of using ansibles prompt thing to write an interactive playbook, which would say "do you want to restart <scheduler|merger|executor> <y/n>" and then "do you want to pull new images <y|n>"
19:28:36 <ianw> now i've made the mistake, i guess i won't forget next time :)
19:29:03 <clarkb> It also looks like upstream ansible has closed their issue and says the fix is to upgrade glibc
19:29:18 <ianw> clarkb: yes, almost certainly; i managed to eventually catch a hung process and it traced back into the exact code discussed in those issues
19:29:22 <clarkb> good news is that is what we have done. Bad news is I guess we're running backported glibc on our executors until bookworm releases
19:29:50 <corvus> so... 10...20 years?
19:29:53 <ianw> (for posterity, discussions @ https://meetings.opendev.org/irclogs/%23opendev/%23opendev.2022-07-17.log.html#t2022-07-17T23:30:09)
19:30:10 <clarkb> corvus: ha, zigo made it sound like this winter. So maybe next spring :)
19:30:51 <corvus> i wonder if a glibc fix could make it into stable?
19:30:52 <clarkb> (NA seasons)
19:31:41 <clarkb> corvus: not a bad idea since it affects real software like ansible. Though like rhel I wonder if debian won't prioritize it due to their packaged version of ansible being fine (however other software may struggle with this bug we just don't know about it yet)
19:31:43 <ianw> it seemed a quite small fix and also seems serious enough, it's very annoying to have things deadlock like that
19:32:31 <clarkb> is the process for that similar to ubuntu? Open a bug with a link to the issue and ask for a stable backport with justification?
19:32:45 <fungi> bookworm release is expected in 2023 yeah
19:33:03 <fungi> the freeze starts late this year
19:33:48 <fungi> hard freeze is next march
19:34:14 <ianw> clarkb: that's where i'd start; i can take a closer look at the patch and send something up
19:34:30 <fungi> actually i misspoke, the transitions deadline is late this year, soft freeze is early next year according to the schedule
19:34:46 <clarkb> ianw: thanks
19:35:22 <fungi> mmm, in fact even the transition deadline is actually early next year (january)
19:35:37 <fungi> https://release.debian.org/
19:36:59 <clarkb> Seems like we've got a good next step and we can likely run that glibc until 2023 if necessary. It is limited to the executors which should limit impact
19:37:15 <clarkb> #topic Other Zuul job POST_FAILURES
19:37:53 <clarkb> yesterday it occured to me that the POST_FAILURES that OSA was seeing could be due to the same issue. I asked jrosser if they had updated their job ansible version to v5 early (because they had these errors prior to us updating our default), but unfortauntely they had not
19:38:07 <clarkb> that means their issue is very likely independent of our issue taht we believe is now fixed.
19:38:13 <clarkb> #link https://review.opendev.org/c/opendev/base-jobs/+/848027 Add remote log store location debugging info to base jobs
19:38:37 <clarkb> for that reason I think we should try and get this chagne landed to add the extra swift location debugging info to our base jobs. Please cross check against the base-test job and leave a review
19:39:16 <clarkb> ianw did manage to strace a live one last week and found that it appeared to be our end waiting on read()s returning from the remote swift server
19:39:57 <corvus> which provider?
19:40:01 <clarkb> the remote was ovh. I suspect we may need to gather some evidence that it is consistently ovh using the above change then take that to ovh and see if they know why it is breaking
19:41:31 <ianw> one thing we noticed was this one included an ARA report, which is a lot of little files
19:42:03 <ianw> at first i thought that might be in-common with our hang issues (they generate ara), but as discussed turned out to be completely separate
19:42:28 <ianw> https://meetings.opendev.org/irclogs/%23opendev/%23opendev.2022-07-13.log.html#t2022-07-13T23:19:34
19:42:45 <ianw> is the discussion of what i found on the stuck process
19:43:52 <clarkb> I suspect that the remote service is having some sort of problem and if we can collect stronger evidence of that (through consistent failures against a single backed) and take that evidence with timestamps that the provider will be able to debug
19:44:07 <clarkb> which makes that base job change important. But ya we can take it from there if/when we get that evidence
19:45:19 <clarkb> #topic Gerrit 3.5 Larger Caches
19:45:58 <clarkb> Last week our gerrit installation ran out of disk space due to the growth of its caches. Turns out that Gerrit 3.5 added some new caches that were expected to grow quite large but upstream didn't properly document this change in their release notes
19:46:34 <clarkb> Upstream has said they should update those release notes now. They also noted part of the problem is a deficiency in h2 where it spends only 200ms compacting its backing file for the database before giving up which is insufficient to make a very large file smaller
19:47:18 <clarkb> That said they did think that the backing files would essentially stop growing after a certain size (a size big enough to fit all the data we use in a day)
19:47:45 <clarkb> To accomodate that we've increased the size of our volume and filesystem on the gerrit installation (thank you ianw and fungi for getting that done)
19:48:20 <clarkb> We are now in a period of monitoring the consumption to check the assumptions from upstream about growth eventually platueaing.
19:48:29 <clarkb> #link http://cacti.openstack.org/cacti/graph.php?action=view&local_graph_id=70356&rra_id=all Cacti graphs of review02 disk consumption
19:49:09 <clarkb> I did test using held nodes (which I should delete now I guess) that you can safely delete these two large cache files while gerrit is offline and it will recreate them when needed after starting again
19:49:30 <clarkb> If we find ourselves in a gerrit is crashing situation I think ^ is a good short term fix.
19:49:35 <clarkb> #link https://review.opendev.org/c/opendev/system-config/+/849886 is our fallback if the cache disk usage doesn't stabilize.
19:49:53 <ianw> ++
19:49:56 <corvus> does that graph mean that we're approximately at nominal size for those files now since it's leveled out?
19:50:14 <clarkb> And this change which is marked WIP is the long term fix if we end up having problems again. It basically says we won't cache this info on disk (it will still maintain memory caches for the data)
19:50:41 <clarkb> corvus: yes I think we're seeing that we are close to leveling out. The bigger of the two files hasn't grown much at all since the original problem was discovered. The smaller of the two has grown by abour 25%
19:51:26 <clarkb> time will tell though. It is also possible that we're just quieter than usual since we had the problem and if development picks up again we'll see growth
19:51:43 <clarkb> I think it is still a bit early to draw any major conclusions, but it is looking good at least
19:52:34 <clarkb> For future upgrades we might want to start capturing any new cache files too. I bet our upgrade job could actually be modified to check which caches are preset pre upgrade and diff that against the list post upgrade
19:52:39 <fungi> the cache size does strike me as being unnecessarily large though. searching through that much disk is unlikely to result in a performance benefit
19:52:46 <clarkb> I can look at updating our upgrade job to try and do that
19:53:05 <clarkb> fungi: it may actually help beacuse computing diffs is not cheap
19:53:25 <fungi> oh, is it git diffs/interdiffs?
19:53:38 <clarkb> fungi: while doing a large disk lookup may not be quick either, diffing is an expensive operation on top of the file io to perform the diff
19:53:40 <fungi> then yeah, i could see how that could need a lot of space
19:53:58 <clarkb> fungi: yes it is caching results of various diff operations
19:54:25 <ianw> but can anything ever leave the cache, if h2 can't clean it?
19:54:45 <clarkb> ianw: yes items leave the cache aiui. It is just that the backing file isn't trimmed to match
19:54:58 <clarkb> so that very large file may only be 1% used
19:55:37 <corvus> then why do we think it'll level out at a daily max?  won't people (bots actually) be continually diffing new things and adding to the cache and therefore backing file?
19:55:54 <clarkb> because it isn't append only aiui
19:56:08 <clarkb> it can reuse the allocated blocks that are free in h2 (but not on disk) to cache new things
19:56:14 <corvus> backing file space gets reclaimed but not deleted?
19:56:44 <clarkb> from h2's perspective the database may have plenty of room and not need to grow the disk. But the disk once expanded doesn't really shrink
19:56:58 <ianw> i guess it needs to, basically, defrag itself?
19:57:08 <clarkb> ianw: ya and spend more than 200ms doing so apparently
19:57:12 <corvus> sound like it -- and defrag is too computationally expensive
19:57:28 <fungi> moreso the larger the set
19:57:56 <ianw> well i've spent plenty of time starting at defrag moving blocks around in the dos days :)  never did me any harm ... kids (databases?) these days ...
19:58:00 <clarkb> its like a python list. It will expand and expand to add more records. But once you del stuff out of it you don't free up that room
19:58:13 <clarkb> but if you need the list to be bigger again you don't need to allocate more memory you just reuse what was already there
19:58:16 <corvus> at least that's a consistent set of behaviors i can understand, even if it's .... suboptimal.
19:58:45 <ianw> it seems though that part of upgrade steps is probably to delete the caches though?
19:59:01 <ianw> a super-defrag
19:59:03 <clarkb> ianw: upstream was talking about doing that automatically for us
19:59:09 <clarkb> so ya might make sense to just do it
19:59:38 <ianw> ++ one to keep in mind for the checklist
19:59:44 <clarkb> We are almost at time and I had a couple of small changes I wanted to call out before we run out of time
19:59:48 <clarkb> #topic Open Discussion
20:00:17 <clarkb> #link https://review.opendev.org/c/opendev/system-config/+/850268 Add a new mailing list for computing force network to lists.opendev.org
20:00:32 <clarkb> fungi: ^ I think you wanted to make sure others were comfortable with adding this list under the opendev.org domain?
20:00:51 <clarkb> (we have a small number of similar working group like lists already so no objection from me)
20:00:53 <fungi> yeah, even just a heads up that it's proposed
20:01:07 <fungi> a second review would be appreciated
20:01:17 <clarkb> #link https://review.opendev.org/c/openstack/project-config/+/850252 Fix our new project CI checks to handle gitea/git trying to auth when repo doesn't exist
20:01:51 <clarkb> When you try to git clone a repo that doesn't exist in gitea you don't get a 404 and instead get prompted to provide credentials as the assumption is you need to authenticate to see a repo which isn't public
20:02:09 <clarkb> This change works around taht by doing an http fetch of the gitea ui page instead of a git clone to determine if the repo exists already
20:02:22 <clarkb> (note this isn't really gitea specific, github will do the same thing)
20:02:51 <clarkb> And we are at time. Thank you everyone
20:03:10 <clarkb> Feel free to continue discussion on the mailing list or in #opendev. I'll end this meeting so we can return to $meal and other tasks :)
20:03:12 <clarkb> #endmeeting