Tuesday, 2022-07-19

clarkb	meeting time	19:00
clarkb	we'll get started in a minute	19:00
fungi	ahoy!	19:00
clarkb	#startmeeting infra	19:01
opendevmeet	Meeting started Tue Jul 19 19:01:06 2022 UTC and is due to finish in 60 minutes. The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot.	19:01
opendevmeet	Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.	19:01
opendevmeet	The meeting name has been set to 'infra'	19:01
clarkb	#link https://lists.opendev.org/pipermail/service-discuss/2022-July/000345.html Our Agenda	19:01
clarkb	There were no announcements. So we'll dive right into our topic list	19:01
clarkb	#topic Topics	19:01
clarkb	#topic Improving OpenDev's CD throughput	19:01
clarkb	Our new Zuul management seems to be workign well (no issues related to the auto upgrades yet that I am aware of)	19:02
clarkb	Probably worth picking the core of this topic back up again if anyone has time for it.	19:02
ianw	o/	19:02
ianw	++; perpetually on my todo stack but somehow new things keep getting pushed on it!	19:03
clarkb	I know the feeling :)	19:03
clarkb	I don't think this is a rush. We've been slowly chipping at it and making improvements along the way.	19:03
clarkb	Just wanted to point out that the most recent of thsoe improvements haven't appeared to create any problems. Please do let us know if you notice any issues with the auto upgrades	19:04
clarkb	#topic Improving Grafana management	19:05
clarkb	#link https://lists.opendev.org/pipermail/service-discuss/2022-July/000342.html	19:05
clarkb	I've been meaning to followup on this thread but too much travel and vacation recently.	19:05
clarkb	ianw: are we mostly looking for feedback from those that had concerns previoulsy?	19:05
clarkb	cc corvus ^ if you get a chance to look over that and provide your feedback to the thread that would be great. This way we can consolidate all the feedback asynchronously and don't have to rely on the meeting for it	19:06
ianw	concerns, comments, whatever :) it's much more concrete as there's actual examples that work now :)	19:06
clarkb	#topic Bastion Host Updates	19:08
clarkb	I'm not sure there has been much movement on this yet. But I've also been distracted lately (hopeflly back to normalish starting today)	19:09
clarkb	ianw: are there any changes for the venv work yet or any other related updates?	19:09
ianw	not yet, another todo item sorry	19:09
clarkb	no problem. I bring it up beacuse it is slightly related to the next topic	19:10
clarkb	#topic Bionic Server Upgrades	19:10
clarkb	We've got a number of Bionic servers out there that we need to start upgrading to either focal or jammy	19:10
clarkb	#link https://etherpad.opendev.org/p/opendev-bionic-server-upgrades Notes on the work that needs to be done.	19:10
clarkb	I've begun taking notes there and trying to classify and call out the difficulty and process for the various servers	19:11
clarkb	if you want to add more notes about bridge there please feel free	19:11
clarkb	Also if you notice I've missed any feel free to add them (this data set started with the ansibel fact cache so should be fairly complete)	19:11
clarkb	I did have a couple of things to call out. The first is that I'm not sure we have openafs packages for jammy and/or have tested jammy with its openafs packages	19:12
clarkb	I think we have asserted we prefer to just run our own packages for now though	19:12
clarkb	We can either update servers to focal and punt on the jammy openafs work or start figuring out jammy openafs first. I think either option is fine and probably depends on who starts where :)	19:13
ianw	we do have jammy packages for that @ https://launchpad.net/~openstack-ci-core/+archive/ubuntu/openafs	19:13
clarkb	oh right for the wheel mirror stuff	19:13
clarkb	any idea if we have exercised them outside of the wheel mirror context?	19:13
ianw	in CI; but only R/O	19:13
clarkb	I'll make note of that on the etherpad really quickly	19:14
ianw	one thing that is another perpetual todo list item is to use kafs; it might be worth considering for jammy	19:14
ianw	i has had a ton of work, more than i can keep up with	19:15
clarkb	ianw: fwiw I tried using it on tumbleweed a while back and didn't have great luck with it. But that was many kernel releases ago and probably worht testing again	19:15
fungi	worth noting, the current openafs packages don't build with very new kernels	19:15
fungi	we'll have to keep an eye out if we add hwe kernels on jammy or something	19:15
clarkb	noted	19:15
ianw	fungi: yeah, although usually there's patches in their gerrit pretty quickly (one of the reasons we keep our ppa :)	19:16
corvus	clarkb: ianw i have no further feedback on grafyaml, thanks	19:16
corvus	(i think ianw incorporated the concerns and responses/mitigations i raised earlier in that message)	19:17
clarkb	The other thing I noticed is that our nameservers are needing upgrades. I've got a disucssion in the etherpad going on whether or not we should do replacements or do them in place. I think doing replacements would be best, but we have to update info with our registrars to do that. The opendev.org domain is managed by the foundation's registrar service and fungi and I should	19:18
clarkb	be able to update their records. zuul-ci.org is managed by corvus. We can probably coordinate stuff if necessary without too much trouble, but it is an important step to not miss	19:18
clarkb	corvus: thanks	19:18
clarkb	One reason to do replacement ns servers is that we can have them running on focal or jammy and test that nsd and bind continue to work as expected before handing production load to them	19:19
fungi	we'll want to double-check that any packet filtering or axfr/also-notify we do in bind is updated for new ip addresses	19:19
ianw	i feel like history may show we did those in place last time?	19:19
clarkb	ianw: I think they started life as bionic and haven't been updated since?	19:19
ianw	(i'm certain we did for openafs)	19:19
clarkb	But I could be wrong	19:19
fungi	i'd have to double-check, but typically those are configured by ip address due to chicken and egg problems	19:19
ianw	that could also be true :)	19:19
clarkb	fungi: yes I think that is all managed by ansible and we should ba able to have adns2 and ns3 and ns4 completely up and running before we change anything if we like	19:20
corvus	the only registrar coord we should need to do is opendev.org itself	19:20
corvus	zuul-ci.org references nsX.opendev.org by name. but nsX.opendev.org has glue records in .org	19:21
fungi	adns1 and ns1 have a last modified date of 2018 on their /etc/hosts files, so i suspect an in-place upgrade previously	19:21
clarkb	fungi: I don't think they existed prior to that	19:21
corvus	(if we elect to make ns3/ns4, then zuul-ci.org would need to update its ns records, but that's just a regular zone update)	19:21
fungi	oh, yes they're running ubuntu 18.04 so an october 2018 creation time would imply that	19:22
clarkb	corvus: ah right	19:22
fungi	anyway, it won't be hard, but i do think that replacing the nameservers is likely to need some more thorough planning that we'll be able to do in today's meeting	19:23
clarkb	fungi: yup I mostly wanted to call it out and have people leave feedback on the etherpad	19:23
fungi	wfm	19:23
clarkb	I definitely think we'll be doing jammy upgrades on other services before the nameservers as well	19:23
clarkb	Also if anyone would really like to upgrade any specific service on that ehterpad doc feel free to indicate so there	19:24
clarkb	I'm hoping I can dig into jammy node replacements for something in the next week or so	19:24
clarkb	maybe zp01 since it should be straightforward to swap out	19:24
clarkb	#topic Zuul and Ansible v5 and Glibc deadlock	19:25
clarkb	Yesterday all of our executors were updated to run the new zuul executor image with bookworm's glibc	19:26
clarkb	Since then https://zuul.opendev.org/t/openstack/builds?pipeline=opendev-prod-hourly&skip=0 seems to show no more post failures	19:26
ianw	sorry about the first attempt, i didn't realise the restart playbook didn't pull images. i guess that is by design?	19:26
clarkb	Considering we had post failures with half the executors on old glibc I strongly suspect this is corrected.	19:27
clarkb	ianw: ya there is a separate pull playbook because sometimes you just need to restart without updating the image	19:27
ianw	i did think of using ansibles prompt thing to write an interactive playbook, which would say "do you want to restart <scheduler\|merger\|executor> <y/n>" and then "do you want to pull new images <y\|n>"	19:28
ianw	now i've made the mistake, i guess i won't forget next time :)	19:28
clarkb	It also looks like upstream ansible has closed their issue and says the fix is to upgrade glibc	19:29
ianw	clarkb: yes, almost certainly; i managed to eventually catch a hung process and it traced back into the exact code discussed in those issues	19:29
clarkb	good news is that is what we have done. Bad news is I guess we're running backported glibc on our executors until bookworm releases	19:29
corvus	so... 10...20 years?	19:29
ianw	(for posterity, discussions @ https://meetings.opendev.org/irclogs/%23opendev/%23opendev.2022-07-17.log.html#t2022-07-17T23:30:09)	19:29
clarkb	corvus: ha, zigo made it sound like this winter. So maybe next spring :)	19:30
corvus	i wonder if a glibc fix could make it into stable?	19:30
clarkb	(NA seasons)	19:30
clarkb	corvus: not a bad idea since it affects real software like ansible. Though like rhel I wonder if debian won't prioritize it due to their packaged version of ansible being fine (however other software may struggle with this bug we just don't know about it yet)	19:31
ianw	it seemed a quite small fix and also seems serious enough, it's very annoying to have things deadlock like that	19:31
clarkb	is the process for that similar to ubuntu? Open a bug with a link to the issue and ask for a stable backport with justification?	19:32
fungi	bookworm release is expected in 2023 yeah	19:32
fungi	the freeze starts late this year	19:33
fungi	hard freeze is next march	19:33
ianw	clarkb: that's where i'd start; i can take a closer look at the patch and send something up	19:34
fungi	actually i misspoke, the transitions deadline is late this year, soft freeze is early next year according to the schedule	19:34
clarkb	ianw: thanks	19:34
fungi	mmm, in fact even the transition deadline is actually early next year (january)	19:35
fungi	https://release.debian.org/	19:35
clarkb	Seems like we've got a good next step and we can likely run that glibc until 2023 if necessary. It is limited to the executors which should limit impact	19:36
clarkb	#topic Other Zuul job POST_FAILURES	19:37
clarkb	yesterday it occured to me that the POST_FAILURES that OSA was seeing could be due to the same issue. I asked jrosser if they had updated their job ansible version to v5 early (because they had these errors prior to us updating our default), but unfortauntely they had not	19:37
clarkb	that means their issue is very likely independent of our issue taht we believe is now fixed.	19:38
clarkb	#link https://review.opendev.org/c/opendev/base-jobs/+/848027 Add remote log store location debugging info to base jobs	19:38
clarkb	for that reason I think we should try and get this chagne landed to add the extra swift location debugging info to our base jobs. Please cross check against the base-test job and leave a review	19:38
clarkb	ianw did manage to strace a live one last week and found that it appeared to be our end waiting on read()s returning from the remote swift server	19:39
corvus	which provider?	19:39
clarkb	the remote was ovh. I suspect we may need to gather some evidence that it is consistently ovh using the above change then take that to ovh and see if they know why it is breaking	19:40
ianw	one thing we noticed was this one included an ARA report, which is a lot of little files	19:41
ianw	at first i thought that might be in-common with our hang issues (they generate ara), but as discussed turned out to be completely separate	19:42
ianw	https://meetings.opendev.org/irclogs/%23opendev/%23opendev.2022-07-13.log.html#t2022-07-13T23:19:34	19:42
ianw	is the discussion of what i found on the stuck process	19:42
clarkb	I suspect that the remote service is having some sort of problem and if we can collect stronger evidence of that (through consistent failures against a single backed) and take that evidence with timestamps that the provider will be able to debug	19:43
clarkb	which makes that base job change important. But ya we can take it from there if/when we get that evidence	19:44
clarkb	#topic Gerrit 3.5 Larger Caches	19:45
clarkb	Last week our gerrit installation ran out of disk space due to the growth of its caches. Turns out that Gerrit 3.5 added some new caches that were expected to grow quite large but upstream didn't properly document this change in their release notes	19:45
clarkb	Upstream has said they should update those release notes now. They also noted part of the problem is a deficiency in h2 where it spends only 200ms compacting its backing file for the database before giving up which is insufficient to make a very large file smaller	19:46
clarkb	That said they did think that the backing files would essentially stop growing after a certain size (a size big enough to fit all the data we use in a day)	19:47
clarkb	To accomodate that we've increased the size of our volume and filesystem on the gerrit installation (thank you ianw and fungi for getting that done)	19:47
clarkb	We are now in a period of monitoring the consumption to check the assumptions from upstream about growth eventually platueaing.	19:48
clarkb	#link http://cacti.openstack.org/cacti/graph.php?action=view&local_graph_id=70356&rra_id=all Cacti graphs of review02 disk consumption	19:48
clarkb	I did test using held nodes (which I should delete now I guess) that you can safely delete these two large cache files while gerrit is offline and it will recreate them when needed after starting again	19:49
clarkb	If we find ourselves in a gerrit is crashing situation I think ^ is a good short term fix.	19:49
clarkb	#link https://review.opendev.org/c/opendev/system-config/+/849886 is our fallback if the cache disk usage doesn't stabilize.	19:49
ianw	++	19:49
corvus	does that graph mean that we're approximately at nominal size for those files now since it's leveled out?	19:49
clarkb	And this change which is marked WIP is the long term fix if we end up having problems again. It basically says we won't cache this info on disk (it will still maintain memory caches for the data)	19:50
clarkb	corvus: yes I think we're seeing that we are close to leveling out. The bigger of the two files hasn't grown much at all since the original problem was discovered. The smaller of the two has grown by abour 25%	19:50
clarkb	time will tell though. It is also possible that we're just quieter than usual since we had the problem and if development picks up again we'll see growth	19:51
clarkb	I think it is still a bit early to draw any major conclusions, but it is looking good at least	19:51
clarkb	For future upgrades we might want to start capturing any new cache files too. I bet our upgrade job could actually be modified to check which caches are preset pre upgrade and diff that against the list post upgrade	19:52
fungi	the cache size does strike me as being unnecessarily large though. searching through that much disk is unlikely to result in a performance benefit	19:52
clarkb	I can look at updating our upgrade job to try and do that	19:52
clarkb	fungi: it may actually help beacuse computing diffs is not cheap	19:53
fungi	oh, is it git diffs/interdiffs?	19:53
clarkb	fungi: while doing a large disk lookup may not be quick either, diffing is an expensive operation on top of the file io to perform the diff	19:53
fungi	then yeah, i could see how that could need a lot of space	19:53
clarkb	fungi: yes it is caching results of various diff operations	19:53
ianw	but can anything ever leave the cache, if h2 can't clean it?	19:54
clarkb	ianw: yes items leave the cache aiui. It is just that the backing file isn't trimmed to match	19:54
clarkb	so that very large file may only be 1% used	19:54
corvus	then why do we think it'll level out at a daily max? won't people (bots actually) be continually diffing new things and adding to the cache and therefore backing file?	19:55
clarkb	because it isn't append only aiui	19:55
clarkb	it can reuse the allocated blocks that are free in h2 (but not on disk) to cache new things	19:56
corvus	backing file space gets reclaimed but not deleted?	19:56
clarkb	from h2's perspective the database may have plenty of room and not need to grow the disk. But the disk once expanded doesn't really shrink	19:56
ianw	i guess it needs to, basically, defrag itself?	19:56
clarkb	ianw: ya and spend more than 200ms doing so apparently	19:57
corvus	sound like it -- and defrag is too computationally expensive	19:57
fungi	moreso the larger the set	19:57
ianw	well i've spent plenty of time starting at defrag moving blocks around in the dos days :) never did me any harm ... kids (databases?) these days ...	19:57
clarkb	its like a python list. It will expand and expand to add more records. But once you del stuff out of it you don't free up that room	19:58
clarkb	but if you need the list to be bigger again you don't need to allocate more memory you just reuse what was already there	19:58
corvus	at least that's a consistent set of behaviors i can understand, even if it's .... suboptimal.	19:58
ianw	it seems though that part of upgrade steps is probably to delete the caches though?	19:58
ianw	a super-defrag	19:59
clarkb	ianw: upstream was talking about doing that automatically for us	19:59
clarkb	so ya might make sense to just do it	19:59
ianw	++ one to keep in mind for the checklist	19:59
clarkb	We are almost at time and I had a couple of small changes I wanted to call out before we run out of time	19:59
clarkb	#topic Open Discussion	19:59
clarkb	#link https://review.opendev.org/c/opendev/system-config/+/850268 Add a new mailing list for computing force network to lists.opendev.org	20:00
clarkb	fungi: ^ I think you wanted to make sure others were comfortable with adding this list under the opendev.org domain?	20:00
clarkb	(we have a small number of similar working group like lists already so no objection from me)	20:00
fungi	yeah, even just a heads up that it's proposed	20:00
fungi	a second review would be appreciated	20:01
clarkb	#link https://review.opendev.org/c/openstack/project-config/+/850252 Fix our new project CI checks to handle gitea/git trying to auth when repo doesn't exist	20:01
clarkb	When you try to git clone a repo that doesn't exist in gitea you don't get a 404 and instead get prompted to provide credentials as the assumption is you need to authenticate to see a repo which isn't public	20:01
clarkb	This change works around taht by doing an http fetch of the gitea ui page instead of a git clone to determine if the repo exists already	20:02
clarkb	(note this isn't really gitea specific, github will do the same thing)	20:02
clarkb	And we are at time. Thank you everyone	20:02
clarkb	Feel free to continue discussion on the mailing list or in #opendev. I'll end this meeting so we can return to $meal and other tasks :)	20:03
clarkb	#endmeeting	20:03
opendevmeet	Meeting ended Tue Jul 19 20:03:12 2022 UTC. Information about MeetBot at http://wiki.debian.org/MeetBot . (v 0.1.4)	20:03
opendevmeet	Minutes: https://meetings.opendev.org/meetings/infra/2022/infra.2022-07-19-19.01.html	20:03
opendevmeet	Minutes (text): https://meetings.opendev.org/meetings/infra/2022/infra.2022-07-19-19.01.txt	20:03
opendevmeet	Log: https://meetings.opendev.org/meetings/infra/2022/infra.2022-07-19-19.01.log.html	20:03
*** Guest5201 is now known as diablo_rojo_phone		20:57

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!