19:01:06 <clarkb> #startmeeting infra 19:01:06 <opendevmeet> Meeting started Tue Jul 19 19:01:06 2022 UTC and is due to finish in 60 minutes. The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot. 19:01:06 <opendevmeet> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 19:01:06 <opendevmeet> The meeting name has been set to 'infra' 19:01:12 <clarkb> #link https://lists.opendev.org/pipermail/service-discuss/2022-July/000345.html Our Agenda 19:01:29 <clarkb> There were no announcements. So we'll dive right into our topic list 19:01:32 <clarkb> #topic Topics 19:01:43 <clarkb> #topic Improving OpenDev's CD throughput 19:02:01 <clarkb> Our new Zuul management seems to be workign well (no issues related to the auto upgrades yet that I am aware of) 19:02:25 <clarkb> Probably worth picking the core of this topic back up again if anyone has time for it. 19:02:51 <ianw> o/ 19:03:24 <ianw> ++; perpetually on my todo stack but somehow new things keep getting pushed on it! 19:03:42 <clarkb> I know the feeling :) 19:03:58 <clarkb> I don't think this is a rush. We've been slowly chipping at it and making improvements along the way. 19:04:23 <clarkb> Just wanted to point out that the most recent of thsoe improvements haven't appeared to create any problems. Please do let us know if you notice any issues with the auto upgrades 19:05:12 <clarkb> #topic Improving Grafana management 19:05:18 <clarkb> #link https://lists.opendev.org/pipermail/service-discuss/2022-July/000342.html 19:05:31 <clarkb> I've been meaning to followup on this thread but too much travel and vacation recently. 19:05:45 <clarkb> ianw: are we mostly looking for feedback from those that had concerns previoulsy? 19:06:44 <clarkb> cc corvus ^ if you get a chance to look over that and provide your feedback to the thread that would be great. This way we can consolidate all the feedback asynchronously and don't have to rely on the meeting for it 19:06:47 <ianw> concerns, comments, whatever :) it's much more concrete as there's actual examples that work now :) 19:08:44 <clarkb> #topic Bastion Host Updates 19:09:10 <clarkb> I'm not sure there has been much movement on this yet. But I've also been distracted lately (hopeflly back to normalish starting today) 19:09:29 <clarkb> ianw: are there any changes for the venv work yet or any other related updates? 19:09:56 <ianw> not yet, another todo item sorry 19:10:17 <clarkb> no problem. I bring it up beacuse it is slightly related to the next topic 19:10:23 <clarkb> #topic Bionic Server Upgrades 19:10:43 <clarkb> We've got a number of Bionic servers out there that we need to start upgrading to either focal or jammy 19:10:50 <clarkb> #link https://etherpad.opendev.org/p/opendev-bionic-server-upgrades Notes on the work that needs to be done. 19:11:06 <clarkb> I've begun taking notes there and trying to classify and call out the difficulty and process for the various servers 19:11:13 <clarkb> if you want to add more notes about bridge there please feel free 19:11:43 <clarkb> Also if you notice I've missed any feel free to add them (this data set started with the ansibel fact cache so should be fairly complete) 19:12:18 <clarkb> I did have a couple of things to call out. The first is that I'm not sure we have openafs packages for jammy and/or have tested jammy with its openafs packages 19:12:29 <clarkb> I think we have asserted we prefer to just run our own packages for now though 19:13:09 <clarkb> We can either update servers to focal and punt on the jammy openafs work or start figuring out jammy openafs first. I think either option is fine and probably depends on who starts where :) 19:13:12 <ianw> we do have jammy packages for that @ https://launchpad.net/~openstack-ci-core/+archive/ubuntu/openafs 19:13:26 <clarkb> oh right for the wheel mirror stuff 19:13:34 <clarkb> any idea if we have exercised them outside of the wheel mirror context? 19:13:57 <ianw> in CI; but only R/O 19:14:08 <clarkb> I'll make note of that on the etherpad really quickly 19:14:56 <ianw> one thing that is another perpetual todo list item is to use kafs; it might be worth considering for jammy 19:15:16 <ianw> i has had a ton of work, more than i can keep up with 19:15:18 <clarkb> ianw: fwiw I tried using it on tumbleweed a while back and didn't have great luck with it. But that was many kernel releases ago and probably worht testing again 19:15:20 <fungi> worth noting, the current openafs packages don't build with very new kernels 19:15:37 <fungi> we'll have to keep an eye out if we add hwe kernels on jammy or something 19:15:44 <clarkb> noted 19:16:07 <ianw> fungi: yeah, although usually there's patches in their gerrit pretty quickly (one of the reasons we keep our ppa :) 19:16:37 <corvus> clarkb: ianw i have no further feedback on grafyaml, thanks 19:17:41 <corvus> (i think ianw incorporated the concerns and responses/mitigations i raised earlier in that message) 19:18:06 <clarkb> The other thing I noticed is that our nameservers are needing upgrades. I've got a disucssion in the etherpad going on whether or not we should do replacements or do them in place. I think doing replacements would be best, but we have to update info with our registrars to do that. The opendev.org domain is managed by the foundation's registrar service and fungi and I should 19:18:08 <clarkb> be able to update their records. zuul-ci.org is managed by corvus. We can probably coordinate stuff if necessary without too much trouble, but it is an important step to not miss 19:18:21 <clarkb> corvus: thanks 19:19:00 <clarkb> One reason to do replacement ns servers is that we can have them running on focal or jammy and test that nsd and bind continue to work as expected before handing production load to them 19:19:10 <fungi> we'll want to double-check that any packet filtering or axfr/also-notify we do in bind is updated for new ip addresses 19:19:31 <ianw> i feel like history may show we did those in place last time? 19:19:43 <clarkb> ianw: I think they started life as bionic and haven't been updated since? 19:19:46 <ianw> (i'm certain we did for openafs) 19:19:46 <clarkb> But I could be wrong 19:19:54 <fungi> i'd have to double-check, but typically those are configured by ip address due to chicken and egg problems 19:19:56 <ianw> that could also be true :) 19:20:21 <clarkb> fungi: yes I think that is all managed by ansible and we should ba able to have adns2 and ns3 and ns4 completely up and running before we change anything if we like 19:20:39 <corvus> the only registrar coord we should need to do is opendev.org itself 19:21:03 <corvus> zuul-ci.org references nsX.opendev.org by name. but nsX.opendev.org has glue records in .org 19:21:03 <fungi> adns1 and ns1 have a last modified date of 2018 on their /etc/hosts files, so i suspect an in-place upgrade previously 19:21:23 <clarkb> fungi: I don't think they existed prior to that 19:21:54 <corvus> (if we elect to make ns3/ns4, then zuul-ci.org would need to update its ns records, but that's just a regular zone update) 19:22:04 <fungi> oh, yes they're running ubuntu 18.04 so an october 2018 creation time would imply that 19:22:21 <clarkb> corvus: ah right 19:23:07 <fungi> anyway, it won't be hard, but i do think that replacing the nameservers is likely to need some more thorough planning that we'll be able to do in today's meeting 19:23:25 <clarkb> fungi: yup I mostly wanted to call it out and have people leave feedback on the etherpad 19:23:36 <fungi> wfm 19:23:41 <clarkb> I definitely think we'll be doing jammy upgrades on other services before the nameservers as well 19:24:03 <clarkb> Also if anyone would really like to upgrade any specific service on that ehterpad doc feel free to indicate so there 19:24:15 <clarkb> I'm hoping I can dig into jammy node replacements for something in the next week or so 19:24:32 <clarkb> maybe zp01 since it should be straightforward to swap out 19:25:51 <clarkb> #topic Zuul and Ansible v5 and Glibc deadlock 19:26:16 <clarkb> Yesterday all of our executors were updated to run the new zuul executor image with bookworm's glibc 19:26:40 <clarkb> Since then https://zuul.opendev.org/t/openstack/builds?pipeline=opendev-prod-hourly&skip=0 seems to show no more post failures 19:26:58 <ianw> sorry about the first attempt, i didn't realise the restart playbook didn't pull images. i guess that is by design? 19:27:04 <clarkb> Considering we had post failures with half the executors on old glibc I strongly suspect this is corrected. 19:27:27 <clarkb> ianw: ya there is a separate pull playbook because sometimes you just need to restart without updating the image 19:28:06 <ianw> i did think of using ansibles prompt thing to write an interactive playbook, which would say "do you want to restart <scheduler|merger|executor> <y/n>" and then "do you want to pull new images <y|n>" 19:28:36 <ianw> now i've made the mistake, i guess i won't forget next time :) 19:29:03 <clarkb> It also looks like upstream ansible has closed their issue and says the fix is to upgrade glibc 19:29:18 <ianw> clarkb: yes, almost certainly; i managed to eventually catch a hung process and it traced back into the exact code discussed in those issues 19:29:22 <clarkb> good news is that is what we have done. Bad news is I guess we're running backported glibc on our executors until bookworm releases 19:29:50 <corvus> so... 10...20 years? 19:29:53 <ianw> (for posterity, discussions @ https://meetings.opendev.org/irclogs/%23opendev/%23opendev.2022-07-17.log.html#t2022-07-17T23:30:09) 19:30:10 <clarkb> corvus: ha, zigo made it sound like this winter. So maybe next spring :) 19:30:51 <corvus> i wonder if a glibc fix could make it into stable? 19:30:52 <clarkb> (NA seasons) 19:31:41 <clarkb> corvus: not a bad idea since it affects real software like ansible. Though like rhel I wonder if debian won't prioritize it due to their packaged version of ansible being fine (however other software may struggle with this bug we just don't know about it yet) 19:31:43 <ianw> it seemed a quite small fix and also seems serious enough, it's very annoying to have things deadlock like that 19:32:31 <clarkb> is the process for that similar to ubuntu? Open a bug with a link to the issue and ask for a stable backport with justification? 19:32:45 <fungi> bookworm release is expected in 2023 yeah 19:33:03 <fungi> the freeze starts late this year 19:33:48 <fungi> hard freeze is next march 19:34:14 <ianw> clarkb: that's where i'd start; i can take a closer look at the patch and send something up 19:34:30 <fungi> actually i misspoke, the transitions deadline is late this year, soft freeze is early next year according to the schedule 19:34:46 <clarkb> ianw: thanks 19:35:22 <fungi> mmm, in fact even the transition deadline is actually early next year (january) 19:35:37 <fungi> https://release.debian.org/ 19:36:59 <clarkb> Seems like we've got a good next step and we can likely run that glibc until 2023 if necessary. It is limited to the executors which should limit impact 19:37:15 <clarkb> #topic Other Zuul job POST_FAILURES 19:37:53 <clarkb> yesterday it occured to me that the POST_FAILURES that OSA was seeing could be due to the same issue. I asked jrosser if they had updated their job ansible version to v5 early (because they had these errors prior to us updating our default), but unfortauntely they had not 19:38:07 <clarkb> that means their issue is very likely independent of our issue taht we believe is now fixed. 19:38:13 <clarkb> #link https://review.opendev.org/c/opendev/base-jobs/+/848027 Add remote log store location debugging info to base jobs 19:38:37 <clarkb> for that reason I think we should try and get this chagne landed to add the extra swift location debugging info to our base jobs. Please cross check against the base-test job and leave a review 19:39:16 <clarkb> ianw did manage to strace a live one last week and found that it appeared to be our end waiting on read()s returning from the remote swift server 19:39:57 <corvus> which provider? 19:40:01 <clarkb> the remote was ovh. I suspect we may need to gather some evidence that it is consistently ovh using the above change then take that to ovh and see if they know why it is breaking 19:41:31 <ianw> one thing we noticed was this one included an ARA report, which is a lot of little files 19:42:03 <ianw> at first i thought that might be in-common with our hang issues (they generate ara), but as discussed turned out to be completely separate 19:42:28 <ianw> https://meetings.opendev.org/irclogs/%23opendev/%23opendev.2022-07-13.log.html#t2022-07-13T23:19:34 19:42:45 <ianw> is the discussion of what i found on the stuck process 19:43:52 <clarkb> I suspect that the remote service is having some sort of problem and if we can collect stronger evidence of that (through consistent failures against a single backed) and take that evidence with timestamps that the provider will be able to debug 19:44:07 <clarkb> which makes that base job change important. But ya we can take it from there if/when we get that evidence 19:45:19 <clarkb> #topic Gerrit 3.5 Larger Caches 19:45:58 <clarkb> Last week our gerrit installation ran out of disk space due to the growth of its caches. Turns out that Gerrit 3.5 added some new caches that were expected to grow quite large but upstream didn't properly document this change in their release notes 19:46:34 <clarkb> Upstream has said they should update those release notes now. They also noted part of the problem is a deficiency in h2 where it spends only 200ms compacting its backing file for the database before giving up which is insufficient to make a very large file smaller 19:47:18 <clarkb> That said they did think that the backing files would essentially stop growing after a certain size (a size big enough to fit all the data we use in a day) 19:47:45 <clarkb> To accomodate that we've increased the size of our volume and filesystem on the gerrit installation (thank you ianw and fungi for getting that done) 19:48:20 <clarkb> We are now in a period of monitoring the consumption to check the assumptions from upstream about growth eventually platueaing. 19:48:29 <clarkb> #link http://cacti.openstack.org/cacti/graph.php?action=view&local_graph_id=70356&rra_id=all Cacti graphs of review02 disk consumption 19:49:09 <clarkb> I did test using held nodes (which I should delete now I guess) that you can safely delete these two large cache files while gerrit is offline and it will recreate them when needed after starting again 19:49:30 <clarkb> If we find ourselves in a gerrit is crashing situation I think ^ is a good short term fix. 19:49:35 <clarkb> #link https://review.opendev.org/c/opendev/system-config/+/849886 is our fallback if the cache disk usage doesn't stabilize. 19:49:53 <ianw> ++ 19:49:56 <corvus> does that graph mean that we're approximately at nominal size for those files now since it's leveled out? 19:50:14 <clarkb> And this change which is marked WIP is the long term fix if we end up having problems again. It basically says we won't cache this info on disk (it will still maintain memory caches for the data) 19:50:41 <clarkb> corvus: yes I think we're seeing that we are close to leveling out. The bigger of the two files hasn't grown much at all since the original problem was discovered. The smaller of the two has grown by abour 25% 19:51:26 <clarkb> time will tell though. It is also possible that we're just quieter than usual since we had the problem and if development picks up again we'll see growth 19:51:43 <clarkb> I think it is still a bit early to draw any major conclusions, but it is looking good at least 19:52:34 <clarkb> For future upgrades we might want to start capturing any new cache files too. I bet our upgrade job could actually be modified to check which caches are preset pre upgrade and diff that against the list post upgrade 19:52:39 <fungi> the cache size does strike me as being unnecessarily large though. searching through that much disk is unlikely to result in a performance benefit 19:52:46 <clarkb> I can look at updating our upgrade job to try and do that 19:53:05 <clarkb> fungi: it may actually help beacuse computing diffs is not cheap 19:53:25 <fungi> oh, is it git diffs/interdiffs? 19:53:38 <clarkb> fungi: while doing a large disk lookup may not be quick either, diffing is an expensive operation on top of the file io to perform the diff 19:53:40 <fungi> then yeah, i could see how that could need a lot of space 19:53:58 <clarkb> fungi: yes it is caching results of various diff operations 19:54:25 <ianw> but can anything ever leave the cache, if h2 can't clean it? 19:54:45 <clarkb> ianw: yes items leave the cache aiui. It is just that the backing file isn't trimmed to match 19:54:58 <clarkb> so that very large file may only be 1% used 19:55:37 <corvus> then why do we think it'll level out at a daily max? won't people (bots actually) be continually diffing new things and adding to the cache and therefore backing file? 19:55:54 <clarkb> because it isn't append only aiui 19:56:08 <clarkb> it can reuse the allocated blocks that are free in h2 (but not on disk) to cache new things 19:56:14 <corvus> backing file space gets reclaimed but not deleted? 19:56:44 <clarkb> from h2's perspective the database may have plenty of room and not need to grow the disk. But the disk once expanded doesn't really shrink 19:56:58 <ianw> i guess it needs to, basically, defrag itself? 19:57:08 <clarkb> ianw: ya and spend more than 200ms doing so apparently 19:57:12 <corvus> sound like it -- and defrag is too computationally expensive 19:57:28 <fungi> moreso the larger the set 19:57:56 <ianw> well i've spent plenty of time starting at defrag moving blocks around in the dos days :) never did me any harm ... kids (databases?) these days ... 19:58:00 <clarkb> its like a python list. It will expand and expand to add more records. But once you del stuff out of it you don't free up that room 19:58:13 <clarkb> but if you need the list to be bigger again you don't need to allocate more memory you just reuse what was already there 19:58:16 <corvus> at least that's a consistent set of behaviors i can understand, even if it's .... suboptimal. 19:58:45 <ianw> it seems though that part of upgrade steps is probably to delete the caches though? 19:59:01 <ianw> a super-defrag 19:59:03 <clarkb> ianw: upstream was talking about doing that automatically for us 19:59:09 <clarkb> so ya might make sense to just do it 19:59:38 <ianw> ++ one to keep in mind for the checklist 19:59:44 <clarkb> We are almost at time and I had a couple of small changes I wanted to call out before we run out of time 19:59:48 <clarkb> #topic Open Discussion 20:00:17 <clarkb> #link https://review.opendev.org/c/opendev/system-config/+/850268 Add a new mailing list for computing force network to lists.opendev.org 20:00:32 <clarkb> fungi: ^ I think you wanted to make sure others were comfortable with adding this list under the opendev.org domain? 20:00:51 <clarkb> (we have a small number of similar working group like lists already so no objection from me) 20:00:53 <fungi> yeah, even just a heads up that it's proposed 20:01:07 <fungi> a second review would be appreciated 20:01:17 <clarkb> #link https://review.opendev.org/c/openstack/project-config/+/850252 Fix our new project CI checks to handle gitea/git trying to auth when repo doesn't exist 20:01:51 <clarkb> When you try to git clone a repo that doesn't exist in gitea you don't get a 404 and instead get prompted to provide credentials as the assumption is you need to authenticate to see a repo which isn't public 20:02:09 <clarkb> This change works around taht by doing an http fetch of the gitea ui page instead of a git clone to determine if the repo exists already 20:02:22 <clarkb> (note this isn't really gitea specific, github will do the same thing) 20:02:51 <clarkb> And we are at time. Thank you everyone 20:03:10 <clarkb> Feel free to continue discussion on the mailing list or in #opendev. I'll end this meeting so we can return to $meal and other tasks :) 20:03:12 <clarkb> #endmeeting