clarkb | https://213.32.76.236:3081/opendev/system-config this is the held node and it confirms the concern that we have to reorganize our image assets | 00:31 |
---|---|---|
clarkb | I won't be able to dig into that today | 00:31 |
clarkb | but yay for testing | 00:31 |
tonyb | yay | 01:55 |
opendevreview | Roman Kuznecov proposed zuul/zuul-jobs master: tox: Separate stdout and stderr in getting siblings https://review.opendev.org/c/zuul/zuul-jobs/+/901072 | 14:37 |
opendevreview | Roman Kuznecov proposed zuul/zuul-jobs master: tox: Separate stdout and stderr in getting siblings https://review.opendev.org/c/zuul/zuul-jobs/+/901072 | 14:39 |
opendevreview | Roman Kuznecov proposed zuul/zuul-jobs master: tox: Do not concat stdout and stderr in getting siblings https://review.opendev.org/c/zuul/zuul-jobs/+/901072 | 14:50 |
opendevreview | Nate Johnston proposed ttygroup/gertty master: Support SQL Alchemy 2.0 https://review.opendev.org/c/ttygroup/gertty/+/901166 | 14:52 |
opendevreview | Nate Johnston proposed ttygroup/gertty master: Support SQL Alchemy 2.0 https://review.opendev.org/c/ttygroup/gertty/+/901166 | 14:53 |
corvus | i'm going to gracefully stop ze01 and upgrade it so i can observe the new git repo behavior | 15:18 |
fungi | corvus: thanks! what new behavior is that again? the shallowness? | 15:18 |
corvus | i wouldn't use that word since it suggests the git "shallow clone" feature which is definitely not what's going on; but rather that we (a) don't checkout a workspace when cloning and (b) are more efficient setting refs | 15:20 |
fungi | oh, right i forgot about skipping the checkout | 15:22 |
fungi | and yeah, the ref replication improvements aren't really shallow in the clone sense, it's shallow copying? | 15:22 |
fungi | i'll check what terminology ended up in the release note | 15:23 |
fungi | "thin" (not shallow) per the commit message | 15:24 |
corvus | yeah i tried to find a new word. and i used it as a verb too. :) | 15:25 |
corvus | because we're not changing the result, just the process. | 15:25 |
fungi | wfm, it's great for clarity | 15:26 |
opendevreview | Clark Boylan proposed opendev/system-config master: Update gitea to 1.21.0 https://review.opendev.org/c/opendev/system-config/+/897679 | 16:22 |
opendevreview | Clark Boylan proposed opendev/system-config master: DNM intentional gitea failure to hold a node https://review.opendev.org/c/opendev/system-config/+/848181 | 16:22 |
clarkb | I'm going to rotate the authold for ^ | 16:22 |
fungi | cool, i'm disappearing for lunch but should be back by 17:30 utc | 16:27 |
tonyb | I'm going to be out for the rest of the day. | 16:28 |
clarkb | looks like I have extra reason to be up early tomorrow. Starship laucnh window opens at 5am local time | 16:46 |
corvus | let's hope that's the only source of guaranteed excitement :) | 17:01 |
corvus | there is a zuul sql db schema migration ready to merge that, at least in our case, should be treated as planned full downtime. i have run the migration locally on my workstation and it takes 22 minutes. i don't have a factor to translate that to an estimate on our particular database server; it seems like using a scotty factor of 2x might be a good idea for safety. so we're looking at a 22-44 minute outage, which fits within the already | 17:13 |
corvus | scheduled maintenance window for gerrit. i propose that i take zuul down during the gerrit outage and perform the sql migration then. | 17:13 |
corvus | clarkb: fungi ^ | 17:13 |
corvus | (oh and ze01 finally exited, so i will be restarting it now with the new repo stuff) | 17:13 |
clarkb | corvus: we also haven't done a full restart since we updated the github api usage right? so unsure if that will take a number of iterations (but we don't expect it to at this point) | 17:14 |
clarkb | from a gerrit upgrade perspective the gerrit upgrade process doesn't rely on zuul until we land the change to reflect what we've already done | 17:14 |
clarkb | and any downgrade that might happen would be manual outside of zuul as well | 17:14 |
clarkb | I think for this reason I'm ok with it happening during the gerrit changes as they are well decoupled | 17:14 |
corvus | clarkb: yeah, i believe we don't expect github to be a problem at this point (but also, we shouldn't actually need to do a full-reconfigure so we don't need to trigger the "list all branches on github" code path) | 17:15 |
corvus | (but if something goes wrong, we might need to, so good to consider that for planning) | 17:15 |
clarkb | corvus: oh that is because we aren't clearing the zk data right? | 17:15 |
clarkb | in my head shutting everything down still requires a from scratch rebuild of the configs but we cache in zk now so that isn't the case | 17:16 |
corvus | yep | 17:17 |
fungi | clarkb: corvus: i'm cool with a full zuul restart during or coordinated around the gerrit restart too | 17:42 |
clarkb | fungi's message made it through the matrix bridge before it made it through the oftc network to my normal irc connection | 17:45 |
fungi | christine has an eye appointment (only a few minutes from home) at 13:30 utc that i need to drive her to because she might not be able to see well enough after to drive herself back, they've been really slow in the past so there's a possibility i might be stuck working on my phone from a parking lot at 15:30 (i really hope not), but all that's to say don't count on me being 100% useful until | 17:46 |
fungi | later in the maintenance | 17:46 |
corvus | ze01 seems to have produced the usual complement of results from jobs. so far so good. | 17:51 |
clarkb | logos are back on : https://104.130.127.229:3081/ | 17:57 |
clarkb | so ya I think that gitea change is now ready for review with the asterisk next to the ssh key length verification removal | 17:57 |
clarkb | happy to rotate keys first then do the upgrade | 17:57 |
fungi | yeah, i think (without having closely reviewed changes yet) that week after next we should do the key rotation, check that replication is still working, upgrade gitea, and then force a full re-replication just to be sure | 17:59 |
clarkb | we will need another change to do the key updates on the gerrit side. I think we can land the change to add a new key to gitea first, then add that key to gerrit with a .ssh/config there selecting the key then restart gerrit to put it into use | 18:03 |
clarkb | shouldn't be too difficult | 18:03 |
clarkb | the most difficult thing is deciding what key type and size we should use | 18:03 |
fungi | you might be surprised that, as a security professional, i consider that decision mostly irrelevant. newer algorithm, larger key, whatever. malfeasors won't be attacking our keys, they'll look for easier ways in regardless | 18:05 |
fungi | we should pick whatever makes sense and is simplest for long-term maintenance | 18:06 |
fungi | people who argue over key size or which algorithm is stronger than which other algorithm are missing the bigger security picture | 18:07 |
clarkb | in that case I think I'd lean towards ed25519 since it has a single size? We'd only replace it if the entire algorithm/protocol is decided to be insecure vs doing an rsa key length extension every 5 years | 18:07 |
fungi | that sounds fine to me | 18:07 |
corvus | https://tracing.opendev.org/trace/6b5bc808fec911b1abf607f6fc7ea37b has some of the new tracing info | 18:08 |
fungi | this is quite literally one service we control talking to other services we control, and also restricted by firewall rules i think? we could just about do no encryption at all and not care | 18:08 |
fungi | the only real concern is someone launching a middle-node attack between gerrit and gitea in order to inject or subvert git replication | 18:10 |
corvus | is there an etherpad for tomorrow? | 21:29 |
tonyb | https://etherpad.opendev.org/p/gerrit-upgrade-3.8 | 21:30 |
fungi | corvus: ^ that | 21:30 |
fungi | thanks tonyb! | 21:30 |
tonyb | #nailedit | 21:31 |
corvus | cool; i'm thinking of adding the zuul hosts to the emergency file now-ish, just to avoid any surprises when the change merges. that means that changes to the tenant config file (ie, adding new projects) won't take effect. does that sound reasonable? should i wait till later? | 21:32 |
fungi | clarkb: we want to do steps 1-2 (or 1-3) an hour before, so ~14:30 utc? or should it be earlier? | 21:32 |
corvus | (the change = the schema migration change; it shouldn't auto-deploy, but it would if something went wrong overnight) | 21:32 |
fungi | corvus: sounds fine to me | 21:33 |
fungi | i don't think we're merging anything new for those in the interim | 21:33 |
fungi | i expect we could even do all of step #2 nowish | 21:34 |
clarkb | corvus: seems fine to me | 22:32 |
clarkb | fungi: I think an hour should be plenty | 22:32 |
corvus | okay i'm editing emergency now | 22:32 |
corvus | er, remind me -- can i put a group in here? or is it just hostnames? | 22:34 |
clarkb | corvus: I've always just done hostnames. I'm not sure if it will recursively expand groups into the disabled group | 22:35 |
clarkb | s/hostnames/that names that appear in the inventory/ | 22:35 |
corvus | if i'm reading this right, only one of the hosts in emergency is actually an ansible host | 22:37 |
corvus | storyboard-dev01.opendev.org | 22:37 |
corvus | and yeah, i don't think recursive groups works | 22:37 |
clarkb | corvus: yes I think that file can use some clearing out | 22:37 |
corvus | `ansible localhost -m debug -a 'var=groups["disabled"]'` is useful | 22:38 |
corvus | i left notes in that file; maybe someone can double check that and we can clean it up later | 22:40 |
corvus | #status log added zuul hosts to ansible emergency file to prepare for 2023-11-17 maintenance | 22:41 |
opendevstatus | corvus: finished logging | 22:41 |
clarkb | ++ to cleanup but lets do that after we're done with tomorrow's fun :) | 22:44 |
corvus | yep | 22:49 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!