Friday, 2024-07-19

opendevreview	Benjamin Schanzel proposed zuul/zuul-jobs master: wip: Add build-diskimage role https://review.opendev.org/c/zuul/zuul-jobs/+/922911	10:47
opendevreview	Benjamin Schanzel proposed zuul/zuul-jobs master: wip: Add build-diskimage role https://review.opendev.org/c/zuul/zuul-jobs/+/922911	11:02
fungi	stepping out briefly to get a walk in while we have a day of slightly cooler temperatures between the storms, should hopefully be back around 13:30 utc	11:27
opendevreview	Benjamin Schanzel proposed zuul/zuul-jobs master: wip: Add build-diskimage role https://review.opendev.org/c/zuul/zuul-jobs/+/922911	11:34
opendevreview	Benjamin Schanzel proposed zuul/zuul-jobs master: wip: Add build-diskimage role https://review.opendev.org/c/zuul/zuul-jobs/+/922911	11:38
opendevreview	Benjamin Schanzel proposed zuul/zuul-jobs master: wip: Add build-diskimage role https://review.opendev.org/c/zuul/zuul-jobs/+/922911	12:13
tkajinam	I wonder if there is a better channel to ask for reviews of https://review.opendev.org/c/zuul/zuul-jobs/+/924057 ?	12:34
tkajinam	the repo is in the zuul org but the change is more about CI job definition	12:34
fungi	tkajinam: the zuul community has a presence on opendev's matrix homeserver, #zuul:opendev.org	13:51
fungi	but also some of us in here do review changes for the zuul/zuul-jobs repo too... i'll take a look shortly	13:52
clarkb	tkajinam: I left a comment on it. But ya generally that discussion should happen in the zuul matrix room fungi listed above	15:37
clarkb	fungi: fwiw I'm around at this point. Should we take the crowdstrike situation as an omen and go for an easy friday or send it and do the upgrade anyway?	15:37
fungi	i think it's the universe suggesting we buckle up and enjoy the ride? ;)	15:37
fungi	i don't personally see a reason to avoid a gitea upgrade today	15:38
clarkb	ya logically they should be completely unrelated events :)	15:38
clarkb	fungi: I'm working on some quick breakfast. Do you want to approve the change or should I after eating?	15:42
fungi	i'll approve now so zuul can do its thing	15:43
clarkb	frickler: not sure if you've seen the emails yet but cirros' ssl cert is going to expire in 26 days. Do you still have connections for getting that updated?	15:45
frickler	clarkb: I saw that and pinged smoser about it. he claims dreamhost will update automatically, not sure about the interval for that	15:46
clarkb	ack I guess we can ignore the warnings until it gets down to just a few days	15:46
frickler	I'll check once more in a week or so, ack	15:46
opendevreview	Merged opendev/system-config master: Update Gitea to version 1.22 https://review.opendev.org/c/opendev/system-config/+/920580	16:54
clarkb	I'm trying to watch ^	16:55
fungi	yeah, promote succeeded, deploy is running now	16:55
clarkb	ok there is an sisue	16:57
fungi	looks like it's restarting on 90?	16:57
fungi	09	16:57
clarkb	yes its broken around some token write issue	16:57
clarkb	I think this should end up failing the deploy entirely	16:58
clarkb	(which is preferable so that 10-14 stay up)	16:58
fungi	yeah, the rest still seem to be up	16:59
fungi	handy safety catch there	16:59
fungi	the lb should also stop sending new connections to 09 automatically	16:59
clarkb	the playbook is currently waiting for the web services to come up which they won't because of this app.ini permissions issue. I really dislike when software write back to their config files	16:59
fungi	gitea-web_1 \| 2024/07/19 16:58:03 ...es/setting/oauth2.go:152:loadOAuth2From() [F] save oauth2.JWT_SECRET failed: failed to save "/custom/conf/app.ini": open /custom/conf/app.ini: permission denied	17:00
fungi	i guess that's what you saw	17:00
clarkb	save oauth2.JWT_SECRET failed: failed to save "/custom/conf/app.ini": open /custom/conf/app.ini: permission denied	17:00
clarkb	ya	17:00
clarkb	that file is 644 root:root because we don't want gitea writing to it	17:00
clarkb	it isn't clear to me yet how/why this didn't break in CI	17:00
fungi	and i guess it didn't until now	17:00
clarkb	once I have confirmed the playbook ends up aborting and not touching 10-14 I'm going to dig into the problem more. But I want to make sure that we don't slowly break the entire cluster first	17:01
clarkb	the wait for service to start has just under 3 minutes left	17:02
fungi	/var/gitea/conf/app.ini is definitely root:root 0644 on your held job node too	17:02
clarkb	and we set the value it wants to update according to the log line specifically to avoid it trying to write that value back	17:03
fungi	no mention of app.ini in any of the gitea logs on the held node either	17:04
clarkb	the value for that key in the held node matches what we set in the fake secrets vars too	17:04
clarkb	I feel like this is some issue that is only appearing in prod for some reason	17:04
clarkb	we may end up modifying perms and letting it update the file then compare?	17:05
clarkb	this is aggravating because its for a feature we explicitly disable too and I've filed a bug about this but upstream doesn't really care	17:06
clarkb	ok confirmed that the deployment stopped and didn't proceed to gitea10 so we're in a less than ideal but workable steady state for now	17:07
clarkb	I'm going to add gitea09-14 to the emergency file to ensure we don't unexpectedly leave that state	17:07
fungi	i wonder if this codepath is trying to do something only on upgrade of an existing deployment	17:07
clarkb	fungi: ya could be	17:07
fungi	sounds like a safe idea, thanks	17:07
clarkb	after I've added the host to the emergency file I'm going to change perms to allow gitea to write back to the file and then see what ahppens	17:07
fungi	maybe backup the app.ini in that directory for easier diffing	17:08
clarkb	I'll remove it from the load balancer explicitly before I do that too	17:08
clarkb	++	17:08
fungi	watch it try to just write the exact same content back into the file	17:08
clarkb	servers are all lin the emergency file and gitea09 has been explicitly disabled in the load balancer	17:09
fungi	yeah, the more i think about this, the more i wonder if it's trying to "upgrade" the config file to change the entire [oauth2] section to apply the ENABLE->ENABLED change	17:10
clarkb	proceeding to shutdown gitea on 09, update perms on the file, then start the service again just as soon as I figure out the perms to set	17:10
clarkb	fungi: its already ENABLED though?	17:10
fungi	gitea's running as user 1000 so should be sufficient to temporarily chown it?	17:10
fungi	right, i mean notices this is an upgrade and is unilaterally rewriting that section whether it needs it or not	17:11
clarkb	ya need to chown it was just trying to figure out what the uid was	17:12
clarkb	oh maybe	17:12
clarkb	its up now	17:14
clarkb	fungi: it changed the secret value. I don't think it actually matters all that much honestly	17:15
clarkb	maybe the current value isn't appropriate for some reason?	17:16
clarkb	fungi: here's my idea for "fixing it" I'll copy the newly generated value into secret vars. THen I'll push a change for the other formatting edit it made whcih will cause things to redeploy. We can remove gitea10-14 from emergency and land that change and in theory it should work. If it doesn't we can write a second change to simply chown to 1000:1000 for now and then figure it out	17:18
clarkb	from there?	17:18
clarkb	but git cloen and the web ui on gitea 09 seem to be working so this is just an upgrade path hitch I hope	17:19
fungi	huh, i wonder why it decided to just change the secret	17:22
fungi	it does look a bit more like the LFS_JWT_SECRET now as far as included characters go	17:23
fungi	i agree with your proposed fix	17:23
opendevreview	Clark Boylan proposed opendev/system-config master: Fix the formatting of the gitea app.ini file https://review.opendev.org/c/opendev/system-config/+/924528	17:25
clarkb	fungi: ^ I've already updated secret vars but not committed it yet (will do once this is all resolved). That chnage covers the other diff in the delta	17:25
clarkb	the old copy of the file is in roots homedir if you want t odiff yourself to confirm	17:26
clarkb	we literally don't use this feature and have it explicitly disabled so the fact that it can break startup during upgrade is the worst	17:26
clarkb	once 924528 is approved I'll remove the hosts from the emergency file	17:27
clarkb	oh I'm going to trigger replication against gitea09 now as well as it was done for a fair bit and not sure if anything got missed there	17:28
fungi	aha, thanks, i was trying to figure out where you'd stashed it	17:34
fungi	lgtm	17:34
clarkb	thanks	17:35
clarkb	looks like we have about an hour before it goes into the gate. I'll hold off on removing nodes from the emergency file for a bit due to that delay. Any concern with me putting gitea09 back into service in the load balancer?	17:35
clarkb	I did a git clone test from gitea09 and that worked and did some lightweight web ui checks as well	17:36
clarkb	the replication to gitea09 is down to ~1400 tasks from ~2300. I guess I can wait for that to complete before putting it back into service in the lb	17:36
clarkb	as far as what went wrong my best guess is they have changed the jwt secret somehow such that the existing value has to be rewritten in a new ofmrat maybe?	17:40
clarkb	but in ci since we deploy fresh its using the old value as if it is in the new format and it just works?	17:40
clarkb	definitely a weird situation	17:40
fungi	seems like it should be fine to add back into the pool	17:40
clarkb	cool I'll do so as soon as replication is complete	17:41
clarkb	actually before I do I'll shutdown gitea09 again and bring it back up using the same process the normal upgrade path uses (to avoid needing to rerun replication) but I'll do so with the file set back to root:root just to be sure it doesn't do the same thing again	17:43
clarkb	if it does then I will update my change to update perms on the file	17:44
clarkb	side note: infra-root please avoid approving any changes to projects in gerrit/gitea until we're happy with this upgrade	17:45
clarkb	ok did that and the restart was fine so this isn't a "always rewrite the file" problem or I really awnt it to be 0600 my uid problem. I think that means the proposed fix is worth pursuing then pivoting from there if anything changes	17:52
clarkb	I'll put gitea09 back into the lb rotation now	17:52
fungi	yep, seems like it's fine this way	17:52
fungi	though does make the idea of a gitea upgrade testing job compelling	17:53
fungi	because it didn't seem to care on a fresh install with that same config/secret format	17:53
clarkb	agreed a good followup would be an upgrade job	17:54
clarkb	we would need to stop using :latest I think which may be a good idea anyway for the docker images	17:54
clarkb	I am slightly concerned that setting the secret to the new value won't be good enough to skip whatever check it is doing internally...	17:56
clarkb	and if that happens we'll need to manage this secret value for each gitea node separately which would be a pita. Or maybe we can just accept what it generates then flip back to the common value	17:57
clarkb	anyway we know that if gitea10 fails on the second pass 11-14 should be left alone and we can try again.	17:57
fungi	yep	17:57
clarkb	https://github.com/go-gitea/gitea/issues/31169	18:00
clarkb	specifically https://github.com/go-gitea/gitea/issues/31169#issuecomment-2138907332	18:00
clarkb	that would seem to confirm that using the new value should work. Separately I should probably plan to update the test value (though that one seems to work magically by chance?)	18:01
fungi	i think it won't if we add an upgrade test	18:02
fungi	at least that's my bet	18:02
clarkb	ya could be. THough that comment says it only tries to rewrite it if it can't use the secret that is there	18:03
clarkb	also I'm not happy that was marked not a bug. It is an upgrade bug at the very least	18:03
clarkb	but separtely ENABLED = false then requiring a secret and exploding if not set properly is a bug!	18:04
fungi	"broken as intended"	18:04
clarkb	there is also an lfs jwt secret but it never tried to rewrite that one. Should I ask gitea to generate a secret for that or just accept that if it hasn't rewritten it it is happy with it?	18:20
clarkb	I'm leaning towards just accept that it is happy with it for now	18:21
fungi	yeah, i wouldn't poke the bear	18:21
clarkb	oh I think that is what the last comment on that issue means.	18:21
clarkb	they are calling out that there is yet another thing they need to fix in their code so we may hit this with that var in a future upgrade but for now I suspect we leave it as is	18:21
fungi	maybe it'll prove a useful test of our eventual upgrade job	18:22
opendevreview	Clark Boylan proposed opendev/system-config master: Update test gitea's JWT_SECRET https://review.opendev.org/c/opendev/system-config/+/924536	18:24
clarkb	thats mostly so I don't forget. I don't think it is urgent and for now lets just focus on getting prod happy again	18:24
fungi	yep	18:25
opendevreview	Clark Boylan proposed opendev/system-config master: Update test gitea's JWT_SECRET https://review.opendev.org/c/opendev/system-config/+/924536	18:27
clarkb	I wanted to be a bit more safe around string formatting so added quotes	18:27
clarkb	change is in the gate now	18:38
clarkb	I'll go ahead and remove the nodes from the emergency file under the assumption it will land and then trigger a deploy	18:38
Clark[m]	And now working on some early lunch while I wait	18:42
fungi	#status log Pruned backups on backup02.ca-ymq-1.vexxhost bringing volume usage down from 91% to 75%	18:46
opendevstatus	fungi: finished logging	18:46
clarkb	the fixup change is merging momentarily	19:57
opendevreview	Merged opendev/system-config master: Fix the formatting of the gitea app.ini file https://review.opendev.org/c/opendev/system-config/+/924528	19:57
clarkb	it should run on gitea09 and noop then we'll see when it gets to gitea10 if this resolved the upgrade	19:57
frickler	do we still have a contact for software factory CI? it is leaving a lot of nonsense reviews see e.g. https://review.opendev.org/c/openstack/diskimage-builder/+/923985	19:59
clarkb	frickler: tristanC (who isn't in here right now)	20:00
clarkb	looks like gitea10 updated properly	20:00
clarkb	still working to confirm that but this looks much happier	20:00
fungi	agreed, looks like the gitea web process did start this time	20:01
fungi	and the token seems to be updated in the app.ini file as well	20:01
clarkb	yup and git clone works and web ui reports the new version. I think its already done 11 as well	20:01
clarkb	so I think we just confirm functionality at this point for each of the nodes as we usually do	20:02
clarkb	and I'll commit the new secret when the playbook is done running	20:02
clarkb	ok all 6 giteas have upgraded at this point	20:06
clarkb	the job was successful overall too	20:06
clarkb	I have committed the new value now that this appears happy	20:08
clarkb	I think we are good to start the db doctoring process whenever we want now. But maybe that is a Monday task. I think the process I'd like to use is pull node out of the lb, shutdown gitea services, backup the db, run the command, restart gitea, spot check things then add back to the lb. Do that in a loop until done	20:10
clarkb	frickler: tristanC is on matrix I'll message there asking about that	20:10
fungi	yeah, given the upgrade change hit a bit of an unexpected roadblock, it's already getting late-ish here and getting an early start on it monday might be better just to make sure we don't end up eating into weekend time unnecessarily	20:18
clarkb	++	20:19
tristanC[m]	I'm looking into the sf ci failure, the project config nor the job changed recently, maybe that's a new Zuul behavior since the v11 upgrade?	21:28
clarkb	maybe though its talking about the run playbook and v11 should've only dealt with post and cleanup run?	21:30
tristanC[m]	I mean, this job doesn't have a run playbook since https://review.rdoproject.org/r/plugins/gitiles/rdo-jobs/+/64ea2f252bb6160c74d137e3cfacaee5cc3ee797%5E%21/#F6 , so perhaps the error was somehow not reported before the v11.	21:38
clarkb	if there is no parent set (so it parents to base) and no run playbook why even have the job?	21:42
tristanC[m]	Yes, I hope this will fix the issue: https://review.rdoproject.org/r/c/config/+/53843	21:42
opendevreview	Merged openstack/diskimage-builder master: Run functest jobs on bookworm https://review.opendev.org/c/openstack/diskimage-builder/+/924019	22:04

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!