Thursday, 2023-11-16

clarkb	https://213.32.76.236:3081/opendev/system-config this is the held node and it confirms the concern that we have to reorganize our image assets	00:31
clarkb	I won't be able to dig into that today	00:31
clarkb	but yay for testing	00:31
tonyb	yay	01:55
opendevreview	Roman Kuznecov proposed zuul/zuul-jobs master: tox: Separate stdout and stderr in getting siblings https://review.opendev.org/c/zuul/zuul-jobs/+/901072	14:37
opendevreview	Roman Kuznecov proposed zuul/zuul-jobs master: tox: Separate stdout and stderr in getting siblings https://review.opendev.org/c/zuul/zuul-jobs/+/901072	14:39
opendevreview	Roman Kuznecov proposed zuul/zuul-jobs master: tox: Do not concat stdout and stderr in getting siblings https://review.opendev.org/c/zuul/zuul-jobs/+/901072	14:50
opendevreview	Nate Johnston proposed ttygroup/gertty master: Support SQL Alchemy 2.0 https://review.opendev.org/c/ttygroup/gertty/+/901166	14:52
opendevreview	Nate Johnston proposed ttygroup/gertty master: Support SQL Alchemy 2.0 https://review.opendev.org/c/ttygroup/gertty/+/901166	14:53
corvus	i'm going to gracefully stop ze01 and upgrade it so i can observe the new git repo behavior	15:18
fungi	corvus: thanks! what new behavior is that again? the shallowness?	15:18
corvus	i wouldn't use that word since it suggests the git "shallow clone" feature which is definitely not what's going on; but rather that we (a) don't checkout a workspace when cloning and (b) are more efficient setting refs	15:20
fungi	oh, right i forgot about skipping the checkout	15:22
fungi	and yeah, the ref replication improvements aren't really shallow in the clone sense, it's shallow copying?	15:22
fungi	i'll check what terminology ended up in the release note	15:23
fungi	"thin" (not shallow) per the commit message	15:24
corvus	yeah i tried to find a new word. and i used it as a verb too. :)	15:25
corvus	because we're not changing the result, just the process.	15:25
fungi	wfm, it's great for clarity	15:26
opendevreview	Clark Boylan proposed opendev/system-config master: Update gitea to 1.21.0 https://review.opendev.org/c/opendev/system-config/+/897679	16:22
opendevreview	Clark Boylan proposed opendev/system-config master: DNM intentional gitea failure to hold a node https://review.opendev.org/c/opendev/system-config/+/848181	16:22
clarkb	I'm going to rotate the authold for ^	16:22
fungi	cool, i'm disappearing for lunch but should be back by 17:30 utc	16:27
tonyb	I'm going to be out for the rest of the day.	16:28
clarkb	looks like I have extra reason to be up early tomorrow. Starship laucnh window opens at 5am local time	16:46
corvus	let's hope that's the only source of guaranteed excitement :)	17:01
corvus	there is a zuul sql db schema migration ready to merge that, at least in our case, should be treated as planned full downtime. i have run the migration locally on my workstation and it takes 22 minutes. i don't have a factor to translate that to an estimate on our particular database server; it seems like using a scotty factor of 2x might be a good idea for safety. so we're looking at a 22-44 minute outage, which fits within the already	17:13
corvus	scheduled maintenance window for gerrit. i propose that i take zuul down during the gerrit outage and perform the sql migration then.	17:13
corvus	clarkb: fungi ^	17:13
corvus	(oh and ze01 finally exited, so i will be restarting it now with the new repo stuff)	17:13
clarkb	corvus: we also haven't done a full restart since we updated the github api usage right? so unsure if that will take a number of iterations (but we don't expect it to at this point)	17:14
clarkb	from a gerrit upgrade perspective the gerrit upgrade process doesn't rely on zuul until we land the change to reflect what we've already done	17:14
clarkb	and any downgrade that might happen would be manual outside of zuul as well	17:14
clarkb	I think for this reason I'm ok with it happening during the gerrit changes as they are well decoupled	17:14
corvus	clarkb: yeah, i believe we don't expect github to be a problem at this point (but also, we shouldn't actually need to do a full-reconfigure so we don't need to trigger the "list all branches on github" code path)	17:15
corvus	(but if something goes wrong, we might need to, so good to consider that for planning)	17:15
clarkb	corvus: oh that is because we aren't clearing the zk data right?	17:15
clarkb	in my head shutting everything down still requires a from scratch rebuild of the configs but we cache in zk now so that isn't the case	17:16
corvus	yep	17:17
fungi	clarkb: corvus: i'm cool with a full zuul restart during or coordinated around the gerrit restart too	17:42
clarkb	fungi's message made it through the matrix bridge before it made it through the oftc network to my normal irc connection	17:45
fungi	christine has an eye appointment (only a few minutes from home) at 13:30 utc that i need to drive her to because she might not be able to see well enough after to drive herself back, they've been really slow in the past so there's a possibility i might be stuck working on my phone from a parking lot at 15:30 (i really hope not), but all that's to say don't count on me being 100% useful until	17:46
fungi	later in the maintenance	17:46
corvus	ze01 seems to have produced the usual complement of results from jobs. so far so good.	17:51
clarkb	logos are back on : https://104.130.127.229:3081/	17:57
clarkb	so ya I think that gitea change is now ready for review with the asterisk next to the ssh key length verification removal	17:57
clarkb	happy to rotate keys first then do the upgrade	17:57
fungi	yeah, i think (without having closely reviewed changes yet) that week after next we should do the key rotation, check that replication is still working, upgrade gitea, and then force a full re-replication just to be sure	17:59
clarkb	we will need another change to do the key updates on the gerrit side. I think we can land the change to add a new key to gitea first, then add that key to gerrit with a .ssh/config there selecting the key then restart gerrit to put it into use	18:03
clarkb	shouldn't be too difficult	18:03
clarkb	the most difficult thing is deciding what key type and size we should use	18:03
fungi	you might be surprised that, as a security professional, i consider that decision mostly irrelevant. newer algorithm, larger key, whatever. malfeasors won't be attacking our keys, they'll look for easier ways in regardless	18:05
fungi	we should pick whatever makes sense and is simplest for long-term maintenance	18:06
fungi	people who argue over key size or which algorithm is stronger than which other algorithm are missing the bigger security picture	18:07
clarkb	in that case I think I'd lean towards ed25519 since it has a single size? We'd only replace it if the entire algorithm/protocol is decided to be insecure vs doing an rsa key length extension every 5 years	18:07
fungi	that sounds fine to me	18:07
corvus	https://tracing.opendev.org/trace/6b5bc808fec911b1abf607f6fc7ea37b has some of the new tracing info	18:08
fungi	this is quite literally one service we control talking to other services we control, and also restricted by firewall rules i think? we could just about do no encryption at all and not care	18:08
fungi	the only real concern is someone launching a middle-node attack between gerrit and gitea in order to inject or subvert git replication	18:10
corvus	is there an etherpad for tomorrow?	21:29
tonyb	https://etherpad.opendev.org/p/gerrit-upgrade-3.8	21:30
fungi	corvus: ^ that	21:30
fungi	thanks tonyb!	21:30
tonyb	#nailedit	21:31
corvus	cool; i'm thinking of adding the zuul hosts to the emergency file now-ish, just to avoid any surprises when the change merges. that means that changes to the tenant config file (ie, adding new projects) won't take effect. does that sound reasonable? should i wait till later?	21:32
fungi	clarkb: we want to do steps 1-2 (or 1-3) an hour before, so ~14:30 utc? or should it be earlier?	21:32
corvus	(the change = the schema migration change; it shouldn't auto-deploy, but it would if something went wrong overnight)	21:32
fungi	corvus: sounds fine to me	21:33
fungi	i don't think we're merging anything new for those in the interim	21:33
fungi	i expect we could even do all of step #2 nowish	21:34
clarkb	corvus: seems fine to me	22:32
clarkb	fungi: I think an hour should be plenty	22:32
corvus	okay i'm editing emergency now	22:32
corvus	er, remind me -- can i put a group in here? or is it just hostnames?	22:34
clarkb	corvus: I've always just done hostnames. I'm not sure if it will recursively expand groups into the disabled group	22:35
clarkb	s/hostnames/that names that appear in the inventory/	22:35
corvus	if i'm reading this right, only one of the hosts in emergency is actually an ansible host	22:37
corvus	storyboard-dev01.opendev.org	22:37
corvus	and yeah, i don't think recursive groups works	22:37
clarkb	corvus: yes I think that file can use some clearing out	22:37
corvus	`ansible localhost -m debug -a 'var=groups["disabled"]'` is useful	22:38
corvus	i left notes in that file; maybe someone can double check that and we can clean it up later	22:40
corvus	#status log added zuul hosts to ansible emergency file to prepare for 2023-11-17 maintenance	22:41
opendevstatus	corvus: finished logging	22:41
clarkb	++ to cleanup but lets do that after we're done with tomorrow's fun :)	22:44
corvus	yep	22:49

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!