opendevreview | Benjamin Schanzel proposed zuul/zuul-jobs master: wip: Add build-diskimage role https://review.opendev.org/c/zuul/zuul-jobs/+/922911 | 10:47 |
---|---|---|
opendevreview | Benjamin Schanzel proposed zuul/zuul-jobs master: wip: Add build-diskimage role https://review.opendev.org/c/zuul/zuul-jobs/+/922911 | 11:02 |
fungi | stepping out briefly to get a walk in while we have a day of slightly cooler temperatures between the storms, should hopefully be back around 13:30 utc | 11:27 |
opendevreview | Benjamin Schanzel proposed zuul/zuul-jobs master: wip: Add build-diskimage role https://review.opendev.org/c/zuul/zuul-jobs/+/922911 | 11:34 |
opendevreview | Benjamin Schanzel proposed zuul/zuul-jobs master: wip: Add build-diskimage role https://review.opendev.org/c/zuul/zuul-jobs/+/922911 | 11:38 |
opendevreview | Benjamin Schanzel proposed zuul/zuul-jobs master: wip: Add build-diskimage role https://review.opendev.org/c/zuul/zuul-jobs/+/922911 | 12:13 |
tkajinam | I wonder if there is a better channel to ask for reviews of https://review.opendev.org/c/zuul/zuul-jobs/+/924057 ? | 12:34 |
tkajinam | the repo is in the zuul org but the change is more about CI job definition | 12:34 |
fungi | tkajinam: the zuul community has a presence on opendev's matrix homeserver, #zuul:opendev.org | 13:51 |
fungi | but also some of us in here do review changes for the zuul/zuul-jobs repo too... i'll take a look shortly | 13:52 |
clarkb | tkajinam: I left a comment on it. But ya generally that discussion should happen in the zuul matrix room fungi listed above | 15:37 |
clarkb | fungi: fwiw I'm around at this point. Should we take the crowdstrike situation as an omen and go for an easy friday or send it and do the upgrade anyway? | 15:37 |
fungi | i think it's the universe suggesting we buckle up and enjoy the ride? ;) | 15:37 |
fungi | i don't personally see a reason to avoid a gitea upgrade today | 15:38 |
clarkb | ya logically they should be completely unrelated events :) | 15:38 |
clarkb | fungi: I'm working on some quick breakfast. Do you want to approve the change or should I after eating? | 15:42 |
fungi | i'll approve now so zuul can do its thing | 15:43 |
clarkb | frickler: not sure if you've seen the emails yet but cirros' ssl cert is going to expire in 26 days. Do you still have connections for getting that updated? | 15:45 |
frickler | clarkb: I saw that and pinged smoser about it. he claims dreamhost will update automatically, not sure about the interval for that | 15:46 |
clarkb | ack I guess we can ignore the warnings until it gets down to just a few days | 15:46 |
frickler | I'll check once more in a week or so, ack | 15:46 |
opendevreview | Merged opendev/system-config master: Update Gitea to version 1.22 https://review.opendev.org/c/opendev/system-config/+/920580 | 16:54 |
clarkb | I'm trying to watch ^ | 16:55 |
fungi | yeah, promote succeeded, deploy is running now | 16:55 |
clarkb | ok there is an sisue | 16:57 |
fungi | looks like it's restarting on 90? | 16:57 |
fungi | 09 | 16:57 |
clarkb | yes its broken around some token write issue | 16:57 |
clarkb | I think this should end up failing the deploy entirely | 16:58 |
clarkb | (which is preferable so that 10-14 stay up) | 16:58 |
fungi | yeah, the rest still seem to be up | 16:59 |
fungi | handy safety catch there | 16:59 |
fungi | the lb should also stop sending new connections to 09 automatically | 16:59 |
clarkb | the playbook is currently waiting for the web services to come up which they won't because of this app.ini permissions issue. I really dislike when software write back to their config files | 16:59 |
fungi | gitea-web_1 | 2024/07/19 16:58:03 ...es/setting/oauth2.go:152:loadOAuth2From() [F] save oauth2.JWT_SECRET failed: failed to save "/custom/conf/app.ini": open /custom/conf/app.ini: permission denied | 17:00 |
fungi | i guess that's what you saw | 17:00 |
clarkb | save oauth2.JWT_SECRET failed: failed to save "/custom/conf/app.ini": open /custom/conf/app.ini: permission denied | 17:00 |
clarkb | ya | 17:00 |
clarkb | that file is 644 root:root because we don't want gitea writing to it | 17:00 |
clarkb | it isn't clear to me yet how/why this didn't break in CI | 17:00 |
fungi | and i guess it didn't until now | 17:00 |
clarkb | once I have confirmed the playbook ends up aborting and not touching 10-14 I'm going to dig into the problem more. But I want to make sure that we don't slowly break the entire cluster first | 17:01 |
clarkb | the wait for service to start has just under 3 minutes left | 17:02 |
fungi | /var/gitea/conf/app.ini is definitely root:root 0644 on your held job node too | 17:02 |
clarkb | and we set the value it wants to update according to the log line specifically to avoid it trying to write that value back | 17:03 |
fungi | no mention of app.ini in any of the gitea logs on the held node either | 17:04 |
clarkb | the value for that key in the held node matches what we set in the fake secrets vars too | 17:04 |
clarkb | I feel like this is some issue that is only appearing in prod for some reason | 17:04 |
clarkb | we may end up modifying perms and letting it update the file then compare? | 17:05 |
clarkb | this is aggravating because its for a feature we explicitly disable too and I've filed a bug about this but upstream doesn't really care | 17:06 |
clarkb | ok confirmed that the deployment stopped and didn't proceed to gitea10 so we're in a less than ideal but workable steady state for now | 17:07 |
clarkb | I'm going to add gitea09-14 to the emergency file to ensure we don't unexpectedly leave that state | 17:07 |
fungi | i wonder if this codepath is trying to do something only on upgrade of an existing deployment | 17:07 |
clarkb | fungi: ya could be | 17:07 |
fungi | sounds like a safe idea, thanks | 17:07 |
clarkb | after I've added the host to the emergency file I'm going to change perms to allow gitea to write back to the file and then see what ahppens | 17:07 |
fungi | maybe backup the app.ini in that directory for easier diffing | 17:08 |
clarkb | I'll remove it from the load balancer explicitly before I do that too | 17:08 |
clarkb | ++ | 17:08 |
fungi | watch it try to just write the exact same content back into the file | 17:08 |
clarkb | servers are all lin the emergency file and gitea09 has been explicitly disabled in the load balancer | 17:09 |
fungi | yeah, the more i think about this, the more i wonder if it's trying to "upgrade" the config file to change the entire [oauth2] section to apply the ENABLE->ENABLED change | 17:10 |
clarkb | proceeding to shutdown gitea on 09, update perms on the file, then start the service again just as soon as I figure out the perms to set | 17:10 |
clarkb | fungi: its already ENABLED though? | 17:10 |
fungi | gitea's running as user 1000 so should be sufficient to temporarily chown it? | 17:10 |
fungi | right, i mean notices this is an upgrade and is unilaterally rewriting that section whether it needs it or not | 17:11 |
clarkb | ya need to chown it was just trying to figure out what the uid was | 17:12 |
clarkb | oh maybe | 17:12 |
clarkb | its up now | 17:14 |
clarkb | fungi: it changed the secret value. I don't think it actually matters all that much honestly | 17:15 |
clarkb | maybe the current value isn't appropriate for some reason? | 17:16 |
clarkb | fungi: here's my idea for "fixing it" I'll copy the newly generated value into secret vars. THen I'll push a change for the other formatting edit it made whcih will cause things to redeploy. We can remove gitea10-14 from emergency and land that change and in theory it should work. If it doesn't we can write a second change to simply chown to 1000:1000 for now and then figure it out | 17:18 |
clarkb | from there? | 17:18 |
clarkb | but git cloen and the web ui on gitea 09 seem to be working so this is just an upgrade path hitch I hope | 17:19 |
fungi | huh, i wonder why it decided to just change the secret | 17:22 |
fungi | it does look a bit more like the LFS_JWT_SECRET now as far as included characters go | 17:23 |
fungi | i agree with your proposed fix | 17:23 |
opendevreview | Clark Boylan proposed opendev/system-config master: Fix the formatting of the gitea app.ini file https://review.opendev.org/c/opendev/system-config/+/924528 | 17:25 |
clarkb | fungi: ^ I've already updated secret vars but not committed it yet (will do once this is all resolved). That chnage covers the other diff in the delta | 17:25 |
clarkb | the old copy of the file is in roots homedir if you want t odiff yourself to confirm | 17:26 |
clarkb | we literally don't use this feature and have it explicitly disabled so the fact that it can break startup during upgrade is the worst | 17:26 |
clarkb | once 924528 is approved I'll remove the hosts from the emergency file | 17:27 |
clarkb | oh I'm going to trigger replication against gitea09 now as well as it was done for a fair bit and not sure if anything got missed there | 17:28 |
fungi | aha, thanks, i was trying to figure out where you'd stashed it | 17:34 |
fungi | lgtm | 17:34 |
clarkb | thanks | 17:35 |
clarkb | looks like we have about an hour before it goes into the gate. I'll hold off on removing nodes from the emergency file for a bit due to that delay. Any concern with me putting gitea09 back into service in the load balancer? | 17:35 |
clarkb | I did a git clone test from gitea09 and that worked and did some lightweight web ui checks as well | 17:36 |
clarkb | the replication to gitea09 is down to ~1400 tasks from ~2300. I guess I can wait for that to complete before putting it back into service in the lb | 17:36 |
clarkb | as far as what went wrong my best guess is they have changed the jwt secret somehow such that the existing value has to be rewritten in a new ofmrat maybe? | 17:40 |
clarkb | but in ci since we deploy fresh its using the old value as if it is in the new format and it just works? | 17:40 |
clarkb | definitely a weird situation | 17:40 |
fungi | seems like it should be fine to add back into the pool | 17:40 |
clarkb | cool I'll do so as soon as replication is complete | 17:41 |
clarkb | actually before I do I'll shutdown gitea09 again and bring it back up using the same process the normal upgrade path uses (to avoid needing to rerun replication) but I'll do so with the file set back to root:root just to be sure it doesn't do the same thing again | 17:43 |
clarkb | if it does then I will update my change to update perms on the file | 17:44 |
clarkb | side note: infra-root please avoid approving any changes to projects in gerrit/gitea until we're happy with this upgrade | 17:45 |
clarkb | ok did that and the restart was fine so this isn't a "always rewrite the file" problem or I really awnt it to be 0600 my uid problem. I think that means the proposed fix is worth pursuing then pivoting from there if anything changes | 17:52 |
clarkb | I'll put gitea09 back into the lb rotation now | 17:52 |
fungi | yep, seems like it's fine this way | 17:52 |
fungi | though does make the idea of a gitea upgrade testing job compelling | 17:53 |
fungi | because it didn't seem to care on a fresh install with that same config/secret format | 17:53 |
clarkb | agreed a good followup would be an upgrade job | 17:54 |
clarkb | we would need to stop using :latest I think which may be a good idea anyway for the docker images | 17:54 |
clarkb | I am slightly concerned that setting the secret to the new value won't be good enough to skip whatever check it is doing internally... | 17:56 |
clarkb | and if that happens we'll need to manage this secret value for each gitea node separately which would be a pita. Or maybe we can just accept what it generates then flip back to the common value | 17:57 |
clarkb | anyway we know that if gitea10 fails on the second pass 11-14 should be left alone and we can try again. | 17:57 |
fungi | yep | 17:57 |
clarkb | https://github.com/go-gitea/gitea/issues/31169 | 18:00 |
clarkb | specifically https://github.com/go-gitea/gitea/issues/31169#issuecomment-2138907332 | 18:00 |
clarkb | that would seem to confirm that using the new value should work. Separately I should probably plan to update the test value (though that one seems to work magically by chance?) | 18:01 |
fungi | i think it won't if we add an upgrade test | 18:02 |
fungi | at least that's my bet | 18:02 |
clarkb | ya could be. THough that comment says it only tries to rewrite it if it can't use the secret that is there | 18:03 |
clarkb | also I'm not happy that was marked not a bug. It is an upgrade bug at the very least | 18:03 |
clarkb | but separtely ENABLED = false then requiring a secret and exploding if not set properly is a bug! | 18:04 |
fungi | "broken as intended" | 18:04 |
clarkb | there is also an lfs jwt secret but it never tried to rewrite that one. Should I ask gitea to generate a secret for that or just accept that if it hasn't rewritten it it is happy with it? | 18:20 |
clarkb | I'm leaning towards just accept that it is happy with it for now | 18:21 |
fungi | yeah, i wouldn't poke the bear | 18:21 |
clarkb | oh I think that is what the last comment on that issue means. | 18:21 |
clarkb | they are calling out that there is yet another thing they need to fix in their code so we may hit this with that var in a future upgrade but for now I suspect we leave it as is | 18:21 |
fungi | maybe it'll prove a useful test of our eventual upgrade job | 18:22 |
opendevreview | Clark Boylan proposed opendev/system-config master: Update test gitea's JWT_SECRET https://review.opendev.org/c/opendev/system-config/+/924536 | 18:24 |
clarkb | thats mostly so I don't forget. I don't think it is urgent and for now lets just focus on getting prod happy again | 18:24 |
fungi | yep | 18:25 |
opendevreview | Clark Boylan proposed opendev/system-config master: Update test gitea's JWT_SECRET https://review.opendev.org/c/opendev/system-config/+/924536 | 18:27 |
clarkb | I wanted to be a bit more safe around string formatting so added quotes | 18:27 |
clarkb | change is in the gate now | 18:38 |
clarkb | I'll go ahead and remove the nodes from the emergency file under the assumption it will land and then trigger a deploy | 18:38 |
Clark[m] | And now working on some early lunch while I wait | 18:42 |
fungi | #status log Pruned backups on backup02.ca-ymq-1.vexxhost bringing volume usage down from 91% to 75% | 18:46 |
opendevstatus | fungi: finished logging | 18:46 |
clarkb | the fixup change is merging momentarily | 19:57 |
opendevreview | Merged opendev/system-config master: Fix the formatting of the gitea app.ini file https://review.opendev.org/c/opendev/system-config/+/924528 | 19:57 |
clarkb | it should run on gitea09 and noop then we'll see when it gets to gitea10 if this resolved the upgrade | 19:57 |
frickler | do we still have a contact for software factory CI? it is leaving a lot of nonsense reviews see e.g. https://review.opendev.org/c/openstack/diskimage-builder/+/923985 | 19:59 |
clarkb | frickler: tristanC (who isn't in here right now) | 20:00 |
clarkb | looks like gitea10 updated properly | 20:00 |
clarkb | still working to confirm that but this looks much happier | 20:00 |
fungi | agreed, looks like the gitea web process did start this time | 20:01 |
fungi | and the token seems to be updated in the app.ini file as well | 20:01 |
clarkb | yup and git clone works and web ui reports the new version. I think its already done 11 as well | 20:01 |
clarkb | so I think we just confirm functionality at this point for each of the nodes as we usually do | 20:02 |
clarkb | and I'll commit the new secret when the playbook is done running | 20:02 |
clarkb | ok all 6 giteas have upgraded at this point | 20:06 |
clarkb | the job was successful overall too | 20:06 |
clarkb | I have committed the new value now that this appears happy | 20:08 |
clarkb | I think we are good to start the db doctoring process whenever we want now. But maybe that is a Monday task. I think the process I'd like to use is pull node out of the lb, shutdown gitea services, backup the db, run the command, restart gitea, spot check things then add back to the lb. Do that in a loop until done | 20:10 |
clarkb | frickler: tristanC is on matrix I'll message there asking about that | 20:10 |
fungi | yeah, given the upgrade change hit a bit of an unexpected roadblock, it's already getting late-ish here and getting an early start on it monday might be better just to make sure we don't end up eating into weekend time unnecessarily | 20:18 |
clarkb | ++ | 20:19 |
tristanC[m] | I'm looking into the sf ci failure, the project config nor the job changed recently, maybe that's a new Zuul behavior since the v11 upgrade? | 21:28 |
clarkb | maybe though its talking about the run playbook and v11 should've only dealt with post and cleanup run? | 21:30 |
tristanC[m] | I mean, this job doesn't have a run playbook since https://review.rdoproject.org/r/plugins/gitiles/rdo-jobs/+/64ea2f252bb6160c74d137e3cfacaee5cc3ee797%5E%21/#F6 , so perhaps the error was somehow not reported before the v11. | 21:38 |
clarkb | if there is no parent set (so it parents to base) and no run playbook why even have the job? | 21:42 |
tristanC[m] | Yes, I hope this will fix the issue: https://review.rdoproject.org/r/c/config/+/53843 | 21:42 |
opendevreview | Merged openstack/diskimage-builder master: Run functest jobs on bookworm https://review.opendev.org/c/openstack/diskimage-builder/+/924019 | 22:04 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!