Friday, 2024-07-19

opendevreviewBenjamin Schanzel proposed zuul/zuul-jobs master: wip: Add build-diskimage role  https://review.opendev.org/c/zuul/zuul-jobs/+/92291110:47
opendevreviewBenjamin Schanzel proposed zuul/zuul-jobs master: wip: Add build-diskimage role  https://review.opendev.org/c/zuul/zuul-jobs/+/92291111:02
fungistepping out briefly to get a walk in while we have a day of slightly cooler temperatures between the storms, should hopefully be back around 13:30 utc11:27
opendevreviewBenjamin Schanzel proposed zuul/zuul-jobs master: wip: Add build-diskimage role  https://review.opendev.org/c/zuul/zuul-jobs/+/92291111:34
opendevreviewBenjamin Schanzel proposed zuul/zuul-jobs master: wip: Add build-diskimage role  https://review.opendev.org/c/zuul/zuul-jobs/+/92291111:38
opendevreviewBenjamin Schanzel proposed zuul/zuul-jobs master: wip: Add build-diskimage role  https://review.opendev.org/c/zuul/zuul-jobs/+/92291112:13
tkajinamI wonder if there is a better channel to ask for reviews of https://review.opendev.org/c/zuul/zuul-jobs/+/924057 ?12:34
tkajinamthe repo is in the zuul org but the change is more about CI job definition12:34
fungitkajinam: the zuul community has a presence on opendev's matrix homeserver, #zuul:opendev.org13:51
fungibut also some of us in here do review changes for the zuul/zuul-jobs repo too... i'll take a look shortly13:52
clarkbtkajinam: I left a comment on it. But ya generally that discussion should happen in the zuul matrix room fungi listed above15:37
clarkbfungi: fwiw I'm around at this point. Should we take the crowdstrike situation as an omen and go for an easy friday or send it and do the upgrade anyway?15:37
fungii think it's the universe suggesting we buckle up and enjoy the ride? ;)15:37
fungii don't personally see a reason to avoid a gitea upgrade today15:38
clarkbya logically they should be completely unrelated events :)15:38
clarkbfungi: I'm working on some quick breakfast. Do you want to approve the change or should I after eating?15:42
fungii'll approve now so zuul can do its thing15:43
clarkbfrickler: not sure if you've seen the emails yet but cirros' ssl cert is going to expire in 26 days. Do you still have connections for getting that updated?15:45
fricklerclarkb: I saw that and pinged smoser about it. he claims dreamhost will update automatically, not sure about the interval for that15:46
clarkback I guess we can ignore the warnings until it gets down to just a few days15:46
fricklerI'll check once more in a week or so, ack15:46
opendevreviewMerged opendev/system-config master: Update Gitea to version 1.22  https://review.opendev.org/c/opendev/system-config/+/92058016:54
clarkbI'm trying to watch ^16:55
fungiyeah, promote succeeded, deploy is running now16:55
clarkbok there is an sisue16:57
fungilooks like it's restarting on 90?16:57
fungi0916:57
clarkbyes its broken around some token write issue16:57
clarkbI think this should end up failing the deploy entirely16:58
clarkb(which is preferable so that 10-14 stay up)16:58
fungiyeah, the rest still seem to be up16:59
fungihandy safety catch there16:59
fungithe lb should also stop sending new connections to 09 automatically16:59
clarkbthe playbook is currently waiting for the web services to come up which they won't because of this app.ini permissions issue. I really dislike when software write back to their config files16:59
fungigitea-web_1  | 2024/07/19 16:58:03 ...es/setting/oauth2.go:152:loadOAuth2From() [F] save oauth2.JWT_SECRET failed: failed to save "/custom/conf/app.ini": open /custom/conf/app.ini: permission denied17:00
fungii guess that's what you saw17:00
clarkbsave oauth2.JWT_SECRET failed: failed to save "/custom/conf/app.ini": open /custom/conf/app.ini: permission denied17:00
clarkbya17:00
clarkbthat file is 644 root:root because we don't want gitea writing to it17:00
clarkbit isn't clear to me yet how/why this didn't break in CI17:00
fungiand i guess it didn't until now17:00
clarkbonce I have confirmed the playbook ends up aborting and not touching 10-14 I'm going to dig into the problem more. But I want to make sure that we don't slowly break the entire cluster first17:01
clarkbthe wait for service to start has just under 3 minutes left17:02
fungi/var/gitea/conf/app.ini is definitely root:root 0644 on your held job node too17:02
clarkband we set the value it wants to update according to the log line specifically to avoid it trying to write that value back17:03
fungino mention of app.ini in any of the gitea logs on the held node either17:04
clarkbthe value for that key in the held node matches what we set in the fake secrets vars too17:04
clarkbI feel like this is some issue that is only appearing in prod for some reason17:04
clarkbwe may end up modifying perms and letting it update the file then compare?17:05
clarkbthis is aggravating because its for a feature we explicitly disable too and I've filed a bug about this but upstream doesn't really care17:06
clarkbok confirmed that the deployment stopped and didn't proceed to gitea10 so we're in a less than ideal but workable steady state for now17:07
clarkbI'm going to add gitea09-14 to the emergency file to ensure we don't unexpectedly leave that state17:07
fungii wonder if this codepath is trying to do something only on upgrade of an existing deployment17:07
clarkbfungi: ya could be17:07
fungisounds like a safe idea, thanks17:07
clarkbafter I've added the host to the emergency file I'm going to change perms to allow gitea to write back to the file and then see what ahppens17:07
fungimaybe backup the app.ini in that directory for easier diffing17:08
clarkbI'll remove it from the load balancer explicitly before I do that too17:08
clarkb++17:08
fungiwatch it try to just write the exact same content back into the file17:08
clarkbservers are all lin the emergency file and gitea09 has been explicitly disabled in the load balancer17:09
fungiyeah, the more i think about this, the more i wonder if it's trying to "upgrade" the config file to change the entire [oauth2] section to apply the ENABLE->ENABLED change17:10
clarkbproceeding to shutdown gitea on 09, update perms on the file, then start the service again just as soon as I figure out the perms to set17:10
clarkbfungi: its already ENABLED though?17:10
fungigitea's running as user 1000 so should be sufficient to temporarily chown it?17:10
fungiright, i mean notices this is an upgrade and is unilaterally rewriting that section whether it needs it or not17:11
clarkbya need to chown it was just trying to figure out what the uid was17:12
clarkboh maybe17:12
clarkbits up now17:14
clarkbfungi: it changed the secret value. I don't think it actually matters all that much honestly17:15
clarkbmaybe the current value isn't appropriate for some reason?17:16
clarkbfungi: here's my idea for "fixing it" I'll copy the newly generated value into secret vars. THen I'll push a change for the other formatting edit it made whcih will cause things to redeploy. We can remove gitea10-14 from emergency and land that change and in theory it should work. If it doesn't we can write a second change to simply chown to 1000:1000 for now and then figure it out17:18
clarkbfrom there?17:18
clarkbbut git cloen and the web ui on gitea 09 seem to be working so this is just an upgrade path hitch I hope17:19
fungihuh, i wonder why it decided to just change the secret17:22
fungiit does look a bit more like the LFS_JWT_SECRET now as far as included characters go17:23
fungii agree with your proposed fix17:23
opendevreviewClark Boylan proposed opendev/system-config master: Fix the formatting of the gitea app.ini file  https://review.opendev.org/c/opendev/system-config/+/92452817:25
clarkbfungi: ^ I've already updated secret vars but not committed it yet (will do once this is all resolved). That chnage covers the other diff in the delta17:25
clarkbthe old copy of the file is in roots homedir if you want t odiff yourself to confirm17:26
clarkbwe literally don't use this feature and have it explicitly disabled so the fact that it can break startup during upgrade is the worst17:26
clarkbonce 924528 is approved I'll remove the hosts from the emergency file17:27
clarkboh I'm going to trigger replication against gitea09 now as well as it was done for a fair bit and not sure if anything got missed there17:28
fungiaha, thanks, i was trying to figure out where you'd stashed it17:34
fungilgtm17:34
clarkbthanks17:35
clarkblooks like we have about an hour before it goes into the gate. I'll hold off on removing nodes from the emergency file for a bit due to that delay. Any concern with me putting gitea09 back into service in the load balancer?17:35
clarkbI did a git clone test from gitea09 and that worked and did some lightweight web ui checks as well17:36
clarkbthe replication to gitea09 is down to ~1400 tasks from ~2300. I guess I can wait for that to complete before putting it back into service in the lb17:36
clarkbas far as what went wrong my best guess is they have changed the jwt secret somehow such that the existing value has to be rewritten in a new ofmrat maybe?17:40
clarkbbut in ci since we deploy fresh its using the old value as if it is in the new format and it just works?17:40
clarkbdefinitely a weird situation17:40
fungiseems like it should be fine to add back into the pool17:40
clarkbcool I'll do so as soon as replication is complete17:41
clarkbactually before I do I'll shutdown gitea09 again and bring it back up using the same process the normal upgrade path uses (to avoid needing to rerun replication) but I'll do so with the file set back to root:root just to be sure it doesn't do the same thing again17:43
clarkbif it does then I will update my change to update perms on the file17:44
clarkbside note: infra-root please avoid approving any changes to projects in gerrit/gitea until we're happy with this upgrade17:45
clarkbok did that and the restart was fine so this isn't a "always rewrite the file" problem or I really awnt it to be 0600 my uid problem. I think that means the proposed fix is worth pursuing then pivoting from there if anything changes17:52
clarkbI'll put gitea09 back into the lb rotation now17:52
fungiyep, seems like it's fine this way17:52
fungithough does make the idea of a gitea upgrade testing job compelling17:53
fungibecause it didn't seem to care on a fresh install with that same config/secret format17:53
clarkbagreed a good followup would be an upgrade job17:54
clarkbwe would need to stop using :latest I think which may be a good idea anyway for the docker images17:54
clarkbI am slightly concerned that setting the secret to the new value won't be good enough to skip whatever check it is doing internally...17:56
clarkband if that happens we'll need to manage this secret value for each gitea node separately which would be a pita. Or maybe we can just accept what it generates then flip back to the common value17:57
clarkbanyway we know that if gitea10 fails on the second pass 11-14 should be left alone and we can try again.17:57
fungiyep17:57
clarkbhttps://github.com/go-gitea/gitea/issues/3116918:00
clarkbspecifically https://github.com/go-gitea/gitea/issues/31169#issuecomment-213890733218:00
clarkbthat would seem to confirm that using the new value should work. Separately I should probably plan to update the test value (though that one seems to work magically by chance?)18:01
fungii think it won't if we add an upgrade test18:02
fungiat least that's my bet18:02
clarkbya could be. THough that comment says it only tries to rewrite it if it can't use the secret that is there18:03
clarkbalso I'm not happy that was marked not a bug. It is an upgrade bug at the very least18:03
clarkbbut separtely ENABLED = false then requiring a secret and exploding if not set properly is a bug!18:04
fungi"broken as intended"18:04
clarkbthere is also an lfs jwt secret but it never tried to rewrite that one. Should I ask gitea to generate a secret for that or just accept that if it hasn't rewritten it it is happy with it?18:20
clarkbI'm leaning towards just accept that it is happy with it for now18:21
fungiyeah, i wouldn't poke the bear18:21
clarkboh I think that is what the last comment on that issue means.18:21
clarkbthey are calling out that there is yet another thing they need to fix in their code so we may hit this with that var in a future upgrade but for now I suspect we leave it as is18:21
fungimaybe it'll prove a useful test of our eventual upgrade job18:22
opendevreviewClark Boylan proposed opendev/system-config master: Update test gitea's JWT_SECRET  https://review.opendev.org/c/opendev/system-config/+/92453618:24
clarkbthats mostly so I don't forget. I don't think it is urgent and for now lets just focus on getting prod happy again18:24
fungiyep18:25
opendevreviewClark Boylan proposed opendev/system-config master: Update test gitea's JWT_SECRET  https://review.opendev.org/c/opendev/system-config/+/92453618:27
clarkbI wanted to be a bit more safe around string formatting so added quotes18:27
clarkbchange is in the gate now18:38
clarkbI'll go ahead and remove the nodes from the emergency file under the assumption it will land and then trigger a deploy18:38
Clark[m]And now working on some early lunch while I wait18:42
fungi#status log Pruned backups on backup02.ca-ymq-1.vexxhost bringing volume usage down from 91% to 75%18:46
opendevstatusfungi: finished logging18:46
clarkbthe fixup change is merging momentarily19:57
opendevreviewMerged opendev/system-config master: Fix the formatting of the gitea app.ini file  https://review.opendev.org/c/opendev/system-config/+/92452819:57
clarkbit should run on gitea09 and noop then we'll see when it gets to gitea10 if this resolved the upgrade19:57
fricklerdo we still have a contact for software factory CI? it is leaving a lot of nonsense reviews see e.g. https://review.opendev.org/c/openstack/diskimage-builder/+/92398519:59
clarkbfrickler: tristanC (who isn't in here right now)20:00
clarkblooks like gitea10 updated properly20:00
clarkbstill working to confirm that but this looks much happier20:00
fungiagreed, looks like the gitea web process did start this time20:01
fungiand the token seems to be updated in the app.ini file as well20:01
clarkbyup and git clone works and web ui reports the new version. I think its already done 11 as well20:01
clarkbso I think we just confirm functionality at this point for each of the nodes as we usually do20:02
clarkband I'll commit the new secret when the playbook is done running20:02
clarkbok all 6 giteas have upgraded at this point20:06
clarkbthe job was successful overall too20:06
clarkbI have committed the new value now that this appears happy20:08
clarkbI think we are good to start the db doctoring process whenever we want now. But maybe that is a Monday task. I think the process I'd like to use is pull node out of the lb, shutdown gitea services, backup the db, run the command, restart gitea, spot check things then add back to the lb. Do that in a loop until done20:10
clarkbfrickler: tristanC is on matrix I'll message there asking about that20:10
fungiyeah, given the upgrade change hit a bit of an unexpected roadblock, it's already getting late-ish here and getting an early start on it monday might be better just to make sure we don't end up eating into weekend time unnecessarily20:18
clarkb++20:19
tristanC[m]I'm looking into the sf ci failure, the project config nor the job changed recently, maybe that's a new Zuul behavior since the v11 upgrade?21:28
clarkbmaybe though its talking about the run playbook and v11 should've only dealt with post and cleanup run?21:30
tristanC[m]I mean, this job doesn't have a run playbook since https://review.rdoproject.org/r/plugins/gitiles/rdo-jobs/+/64ea2f252bb6160c74d137e3cfacaee5cc3ee797%5E%21/#F6 , so perhaps the error was somehow not reported before the v11.21:38
clarkbif there is no parent set (so it parents to base) and no run playbook why even have the job?21:42
tristanC[m]Yes, I hope this will fix the issue: https://review.rdoproject.org/r/c/config/+/5384321:42
opendevreviewMerged openstack/diskimage-builder master: Run functest jobs on bookworm  https://review.opendev.org/c/openstack/diskimage-builder/+/92401922:04

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!