Tuesday, 2023-12-19

fungi	meeting in about 10 minutes	18:50
fungi	meeting time!	19:00
fungi	i've volunteered to chair this week since clarkb is feeling under the weather	19:00
fungi	#startmeeting infra	19:00
opendevmeet	Meeting started Tue Dec 19 19:00:52 2023 UTC and is due to finish in 60 minutes. The chair is fungi. Information about MeetBot at http://wiki.debian.org/MeetBot.	19:00
opendevmeet	Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.	19:00
opendevmeet	The meeting name has been set to 'infra'	19:00
fungi	#link https://wiki.openstack.org/wiki/Meetings/InfraTeamMeeting#Agenda_for_next_meeting Our Agenda	19:01
fungi	#topic Announcements	19:01
fungi	#info The OpenDev weekly meeting is cancelled for the next two weeks owing to lack of availability for many participants; we're skipping December 26 and January 2, resuming as usual on January 9.	19:02
fungi	i'm also skipping the empty boilerplate topics	19:03
fungi	#topic Upgrading Bionic servers to Focal/Jammy (clarkb)	19:03
fungi	#link https://etherpad.opendev.org/p/opendev-bionic-server-upgrades	19:03
tonyb	mirrors are done and need to be cleaned up	19:04
fungi	there's a note here in the agenda about updating cnames and cleaning up old servers for mirror replacements	19:04
fungi	yeah, that	19:04
fungi	are there open changes for dns still?	19:04
tonyb	I started doing this yesterday but wanted additional eyes as it's my first time	19:04
fungi	or do we just need to delete servers/volumes?	19:04
tonyb	no open changes ATM	19:04
tonyb	that one	19:04
fungi	what specifically do you want an extra pair of eyes on? happy to help	19:04
tonyb	fungi: the server and volume deletes I understand the process	19:05
fungi	i'm around to help after the meeting if you want, or you can pick another better time	19:06
tonyb	fungi: after the meeting is good for me	19:06
fungi	sounds good, thanks!	19:07
tonyb	I've started looking at jvb and meetpad for upgrades	19:07
fungi	that's a huge help	19:07
tonyb	I'm thinking we'll bring up 3 new servers and then do a cname swiatch.	19:07
fungi	that should be fine. there's not a lot of utilization on them at this time of year anyway	19:08
fungi	#topic DIB bionic support (ianw)	19:08
tonyb	I was considering a more complex process to growing the jvb pool but I think that way is uneeded	19:08
fungi	i think this got covered last week. was there any followup we needed to do?	19:08
fungi	seems like there was some work to fix the dib unit tests?	19:09
fungi	i'm guessing this no longer needed to be on the agenda, just making sure	19:09
fungi	#topic Python container updates (tonyb)	19:10
fungi	zuul-operator seems to still need addressing	19:11
tonyb	no updates this week	19:11
fungi	no worries, just checking. thanks!	19:11
fungi	Gitea 1.21.1 Upgrade (clarkb)	19:11
fungi	er...	19:11
fungi	#topic Gitea 1.21.1 Upgrade (clarkb	19:11
tonyb	Yup I intend to update the roles to enhance container logging and then we'll have a good platform to understand the problem	19:11
fungi	we were planning to do the gitea upgrade at the beginning of the week, but with lingering concerns after the haproxy incident over the weekend we decided to postpone	19:12
tonyb	I think we're safe to remove the 2 LBs from emergency right?	19:12
fungi	#link https://review.opendev.org/903805 Downgrade haproxy image from latest to lts	19:13
fungi	that hasn't been approved yet	19:13
fungi	so not until it merges at least	19:13
tonyb	Ah	19:13
fungi	but upgrading gitea isn't necessarily blocked on the lb being updated	19:14
fungi	different system, separate software	19:14
tonyb	Fair point	19:14
fungi	anyway, with people also vacationing and/or ill this is probably still not a good time for a gitea upgrade. if the situation changes later in the week we can decide to do it then, i think	19:15
tonyb	Okay	19:15
fungi	#topic Updating Zuul's database server (clarkb)	19:15
tonyb	I suspect there hasn't been much progress this week.	19:16
fungi	i'm not sure where we ended up on this, there was research being done, but also an interest in temporarily dumping/importing on a replacement trove instance in the meantime	19:16
fungi	we can revisit next year	19:16
fungi	#topic Annual Report Season (clarkb)	19:17
fungi	#link OpenDev's 2023 Annual Report Draft will live here: https://etherpad.opendev.org/p/2023-opendev-annual-report	19:17
fungi	we need to get that to the foundation staff coordinator for the overall annual report by the end of the week, so we're about out of time for further edits if you wanted to check it over	19:17
fungi	#topic EMS discontinuing legacy/consumer hosting plans (fungi)	19:18
fungi	we received a notice last week that element matrix services (ems) who hosts our opendev.org matrix homeserver for us is changing their pricing and eliminating the low-end plan we had the foundation paying for	19:19
fungi	the lowest "discounted" option they're offering us comes in at 10x what we've been paying, and has to be paid a year ahead in one lump sum	19:20
fungi	(we were paying monthly before)	19:20
tonyb	when?	19:20
tonyb	does the plan need to be purchased	19:20
fungi	we have until 2024-02-07 to upgrade to a business hosting plan or move elsewhere	19:20
tonyb	phew	19:20
fungi	so ~1.5 months to decide on and execute a course of action	19:21
tonyb	not a lot of lead time but also some lead time	19:21
corvus	is the foundation interested in upgrading?	19:22
fungi	i've so far not heard anyone say they're keen to work on deploying a matrix homeserver in our infrastructure, and i looked at a few (4 i think?) other hosting options but they were either as expensive or problematic in various ways, and also we'd have to find time to export/import our configuration and switch dns resulting in some downtime	19:22
fungi	i've talked to the people who hold the pursestrings on the foundation staff and it sounds like we could go ahead and buy a year of business service from ems since we do have several projects utilizing it at this point	19:23
fungi	which would buy us more time to decide if we want to keep doing that or work on our own solution	19:24
tonyb	A very quick looks implies that hosting our own server wouldn't be too bad. the hardest part will be the export/import and downtime	19:24
frickler	another option might be dropping the homeserver and moving the rooms to matrix.org?	19:24
tonyb	I suspect that StartlingX will be the "most impacted"	19:24
frickler	I've tried running a homeserver privately some time ago but it was very opaque and not debugable	19:25
fungi	maybe, but with as many channels as they have they're still not super active on them (i lurk in all their channels and they average a few messages a day tops)	19:25
corvus	fungi: does the business plan support more than one hostname? the foundation may be able to eek out some more value if they can use the same plan to host internal comms.	19:26
fungi	looking at https://element.io/pricing it's not clear to me how that's covered exactly	19:28
fungi	maybe?	19:28
corvus	ok. just a thought :)	19:28
frickler	also, is that "discounted" option a special price or does that match the public pricing?	19:28
fungi	the "discounted" rate they offered us to switch is basically the normal business cloud option on that page, but with a reduced minimum user count of 20 instead of 50	19:29
fungi	anyway, mostly wanted to put this on the agenda so folks know it's coming and have some time to think about options	19:30
fungi	we can discuss again in the next meeting which will be roughly a month before the deadline	19:31
corvus	if the foundation is comfortable paying for it, i'd lean that direction	19:31
fungi	yeah, i'm feeling similarly. i don't think any of us has a ton of free time for another project just now	19:31
corvus	(i think there are good reasons to do so, including the value of the service provided compared to our time and materials cost of running it ourselves, and also supporting open source projects)	19:32
fungi	agreed, and while it's 10x what we've been paying, there wasn't a lot of surprise at a us$1.2k/yr+tax price tag	19:32
fungi	helps from a budget standpoint that it's due in the beginning of the year	19:33
corvus	tbh i thought the original price was too way low for an org (i'm personally sad that it isn't an option for individuals any more though)	19:33
fungi	yeah, we went with it mainly because they didn't have any open source community discounts, which we'd have otherwise opted for	19:34
fungi	any other comments before we move to other topics?	19:34
fungi	#topic Followup on 20231216 incident (frickler)	19:35
fungi	you have the floor	19:35
frickler	well I just collected some things that came to my mind on sunday	19:35
frickler	first question: Do we want to pin external images like haproxy and only bump them after testing? (Not sure that would've helped for the current issue though)	19:36
fungi	there's a similar question from corvus in 903805 about whether we want to make the switch from "latest" to "lts" permanent	19:36
fungi	testing wouldn't have caught it though i don't thinl	19:37
corvus	yeah, unlike gerrit/gitea where there's stuff to test, i don't think we're going to catch haproxy bugs in advance	19:37
fungi	but maybe someone with a higher tolerance for the bleeding edge would have spotted it before latest became lts	19:37
fungi	also it's not like we use recent/advanced features of haproxy	19:38
corvus	for me, i think maybe permanently switching to tracking the lts tag is the right balance of auto-upgrade with hopefully low probability of breakage	19:38
fungi	so i think the answer is "it depends, but we can be conservative on haproxy and similar components"	19:38
frickler	are there other images we consume that could cause similar issues?	19:38
frickler	and I'm fine with haproxy:lts as a middle ground for now	19:39
fungi	i don't know off the top of my head, but if someone wants to `git grep :latest$` and do some digging, i'm happy to review a change	19:39
frickler	ok, second thing: Use docker prune less aggressively for easier rollback?	19:39
frickler	We do so for some services, like https://opendev.org/opendev/system-config/src/branch/master/playbooks/roles/gitea/tasks/main.yaml#L71-L76, might want to duplicate for all containers? Bump the hold time to 7d?	19:39
corvus	(also, honestly i think the fact that haproxy is usually rock solid is why it took us so long to diagnose it. normally checking restart times would be near the top of the list of things to check)	19:40
fungi	fwiw, when i switched gitea-lb's compose file from latest to lts and did a pull, nothing was downloaded, the image was still in the cache	19:40
frickler	so "docker prune" doesn't clear the cache?	19:41
tonyb	IIRC it did download on zuul-lb	19:41
corvus	i sort of wonder what we're trying to achieve there? resiliency against upstream retroactively changing a tag? or shortened download times? or ability to guess what versions we were running by inspecting the cache?	19:41
frickler	being able to have a fast revert of an image upgrade by just checking "docker images" locally	19:42
fungi	i guess the concern is that we're tracking lts, upstream moves lts to a broken image, and we've pruned the image that lts used to point to so we have to redownload it when we change the tag?	19:42
frickler	also I don't think we are having disk space issues that make fast pruning essential	19:43
corvus	if it's resiliency against changes, i agree that 7d is probably a good idea. otherwise, 1-3 days is probably okay... if we haven't cared enough to look after 3 days, we can probably check logs or dockerhub, etc...	19:43
fungi	but also the download times are generally on the order of seconds, not minutes	19:43
fungi	it might buy us a little time but it's far from the most significant proportion of any related outage	19:44
frickler	the 3d is only in effect for gitea, most other images are pruned immediately after upgrading	19:44
corvus	(we're probably going to revert to a tag, which, since we can download in a few seconds, means the local cache isn't super important)	19:44
fungi	i'm basically +/-0 on adjusting image cache times. i agree that we can afford the additional storage, but want to make sure it doesn't grow without bound	19:45
frickler	ok, so leave it at that for now	19:45
frickler	next up: Add timestamps to zuul_reboot.log?	19:45
frickler	https://opendev.org/opendev/system-config/src/branch/master/playbooks/service-bridge.yaml#L41-L55 Also this is running on Saturdays (weekday: 6), do we want to fix the comment or the dow?	19:45
fungi	also having too many images cached makes it a pain to dig through when you're looking for a recent-ish obne	19:45
fungi	is zuul_reboot.log a file? on bridge?	19:46
frickler	yes	19:46
frickler	the code above shows how it is generated	19:46
fungi	aha, /var/log/ansible/zuul_reboot.log	19:47
corvus	adding timestamps sounds good to me; i like the current time so i'd say change the comment	19:47
fungi	i have no objection to adding timestamps	19:47
fungi	to, well, anything really	19:47
frickler	ok, so I'll look into that	19:47
fungi	more time-based context is preferable to less	19:47
fungi	thanks!	19:47
frickler	final one: Do we want to document or implement a procedure for rolling back zuul upgrades? Or do we assume that issues can always be fixed in a forward going way?	19:47
fungi	i think the challenge there is that "downgrading" may mean manually undoing database migrations	19:48
frickler	like what would we have done if we hadn't found a fast fix for the timer issue?	19:48
fungi	the details of which will differ from version to version	19:48
fungi	frickler: if the solution hadn't been obvious i was going to propose a revert of the offending change and try to get that fast-tracked	19:49
frickler	ok, what if no clear bad patch had been identied?	19:49
frickler	identified	19:49
frickler	anyway, we don't need to discuss this at length right now, more something to think about medium term	19:50
fungi	for zuul we're in a special situation where several of us are maintainers, so we've generally been able to solve things like that quickly one way or another	19:50
corvus	i agree with fungi, any downgrade procedure is dependent on the revisions in scope, so i don't think there's a generic process we can do	19:50
fungi	it'll be an on-the-spot determination as to whether its less work to roll forward or try to unwind things	19:51
frickler	ok, time's tight, so lets move to AFS?	19:51
fungi	yep!	19:51
fungi	#topic AFS quota issues (frickler)	19:51
frickler	mirror.openeuler has reached its quota limit and the mirror job seems to be failing since two weeks. I'm also a bit worried that they seem do have doubled their volume over the last 12 months	19:52
frickler	ubuntu mirrors are also getting close, but we might have another couple of months time there	19:52
frickler	mirror.centos-stream seems to have a steep increase in the last two months and might also run into quota limits soon	19:52
frickler	project.zuul with the latest releases is getting close to its tight limit of 1GB (sic), I suggest to simply double that	19:52
frickler	the last one is easy I think. for openeuler instead of bumping the quota someone may want to look into cleanup options first?	19:52
frickler	the others are more of something to keep an eye on	19:53
fungi	broken openeuler mirrors that nobody brought to our attention would indicate they're not being used, but yes it's possible we can filter out some things like we do for centos	19:53
fungi	i'll try to figure out based on git blame who added the openeuler mirror and see if they can propose improvements before lunar new year	19:53
frickler	well they are being used in devstack, but being out of date for some weeks does not yet break jobs	19:53
corvus	feel free to action me on the zuul thing if no one else wants to do it	19:54
fungi	i agree just bumping the zuul quota is fine	19:54
fungi	#action fungi Reach out to someone about cleaning up OpenEuler mirroring	19:54
fungi	#action corvus Increase project.zuul AFS volume quota	19:55
fungi	let's move to the last topic	19:55
fungi	#topic Broken wheel build issues (frickler)	19:55
fungi	centos8 wheel builds are the only ones that are thoroughly broken currently?	19:55
fungi	i'm pleasantly surprised if so	19:56
frickler	fungi: https://review.opendev.org/c/openstack/devstack/+/900143 is the last patch on devstack that a quick search showed me for openeuler	19:56
frickler	I think centos9 too?	19:56
fungi	oh, >+8	19:56
fungi	>=8	19:56
fungi	got it	19:56
frickler	though depends on what you mean by thoroughly (8 months vs. just 1)	19:57
fungi	how much centos testing is going on these days now that tripleo has basically closed up shop?	19:57
fungi	wondering how much time we're saving by not rebuilding some stuff from sdist in centos jobs	19:57
frickler	not sure, I think some usage is still there for special reqs like FIPS	19:58
tonyb	yup FIPS still needs it.	19:58
frickler	at least people are still enough concerned about devstack global_venv being broken on centos	19:58
fungi	for 9 or 8 too?	19:58
frickler	both I think	19:59
tonyb	I can work with ade_lee to verify what exactly is needed and fix or prune as appropriate	19:59
fungi	we can quite easily stop running the wheel build jobs, if the resources for running those every day is a concern	19:59
fungi	i guess we can discuss options in #opendev since we're past the end of the hour	20:00
frickler	the question is then do we want to keep the outdated builds or purge them too?	20:00
fungi	keeping them doesn't hurt anything, i don't think	20:00
fungi	it's just an extra index url for pypi	20:00
fungi	and either the desired wheel is there or it's not	20:00
fungi	and if it's not, the job grabs the sdist from pypi and builds it	20:00
tonyb	and storage?	20:00
frickler	it does mask build errors that can happen for people that do not have access to those wheels	20:00
frickler	if the build was working 6 months ago but has broken since then	20:01
frickler	but anyway, not urgent, we can also continue next year	20:01
fungi	it's a good point, we considered that as a balance between using job resources continually building the same wheels over and over	20:01
fungi	and projects forgetting to list the necessary requirements for building the wheels for things they depend on that lack them	20:01
fungi	okay, let's continue in #opendev. thanks everyone!	20:02
fungi	#endmeeting	20:02
opendevmeet	Meeting ended Tue Dec 19 20:02:18 2023 UTC. Information about MeetBot at http://wiki.debian.org/MeetBot . (v 0.1.4)	20:02
opendevmeet	Minutes: https://meetings.opendev.org/meetings/infra/2023/infra.2023-12-19-19.00.html	20:02
opendevmeet	Minutes (text): https://meetings.opendev.org/meetings/infra/2023/infra.2023-12-19-19.00.txt	20:02
opendevmeet	Log: https://meetings.opendev.org/meetings/infra/2023/infra.2023-12-19-19.00.log.html	20:02
frickler	thx fungi	20:02
tonyb	Thanks all	20:02

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!