fungi | if tonyb doesn't beat me to it, i'll send the one-hour status notice at 14:30 utc, add the indicated hosts to the disable list and pre-create the root screen session on review02 per https://etherpad.opendev.org/p/gerrit-upgrade-3.8 | 13:08 |
---|---|---|
fungi | i'm headed out now but have all that prepped so i can do it from my phone if i'm not home by then | 13:08 |
Clark[m] | Thanks! I'm working on being awake now. If possible a larger than 80x24 screen window would be good :) | 13:44 |
tonyb | lies! all terminals are 80x24 ;P | 13:45 |
tonyb | disabled list updated | 13:45 |
tonyb | And the correct nodes are in groups["disabled"] | 13:48 |
fungi | thanks! | 13:52 |
fungi | turns out the cell signal in this parking lot is nearly nonexistent | 13:53 |
tonyb | eeek | 13:54 |
fungi | my phone claims to be doing cellular data with a single bar of 2g signal | 13:55 |
fungi | occasionally switches to 4g for a minute and then drops back to 2g again | 13:56 |
tonyb | screen session created in a 100x50 terminal | 13:57 |
tonyb | That's super frustrating. | 13:57 |
fungi | attached | 13:57 |
tonyb | Hopefully the geometry works out okay. | 13:59 |
tonyb | Is there some way to verify I can use statusbot ahead of time? | 14:00 |
fungi | yeah, american cell providers have gotten really terrible about dropping roaming agreements, which makes it extra bad if you live in a remote location with spotty coverage | 14:00 |
fungi | tonyb: i suppose you could #status log something, but might as well just try to do the notice and see what happens | 14:01 |
fungi | we're 90 minutes from start anyway | 14:02 |
tonyb | #status notice Gerrit will be unavailable for a short time starting at 15:30 UTC as it is upgraded to the 3.8 release. https://lists.opendev.org/archives/list/service-announce@lists.opendev.org/thread/XT26HFG2FOZL3UHZVLXCCANDZ3TJZM7Q/ | 14:05 |
opendevstatus | tonyb: sending notice | 14:05 |
-opendevstatus- NOTICE: Gerrit will be unavailable for a short time starting at 15:30 UTC as it is upgraded to the 3.8 release. https://lists.opendev.org/archives/list/service-announce@lists.opendev.org/thread/XT26HFG2FOZL3UHZVLXCCANDZ3TJZM7Q/ | 14:05 | |
opendevstatus | tonyb: finished sending notice | 14:08 |
tonyb | Okay Looks like we're good for clarkb to carry-on from https://etherpad.opendev.org/p/gerrit-upgrade-3.8#L109 | 14:09 |
fungi | awesome | 14:10 |
tonyb | I verified that the screen logging is working ... mostly to check I did it right ;P | 14:12 |
* corvus yawns | 14:15 | |
clarkb | I've realized that we run gerrit init with the mysql db stopped. In this upgrade that isn't a problem because there are no schema changes, but for avoiding problems in the future I'm going to add a command to step 11 to start the db before running init | 14:27 |
clarkb | and once I've got tea I'll load ssh keys and hop into screen and get ready for the rest of the fun | 14:28 |
fungi | good call | 14:28 |
*** blarnath is now known as d34dh0r53 | 14:38 | |
clarkb | I confirm that screen logging appears to be working | 14:41 |
clarkb | and am attached to the screen | 14:41 |
clarkb | and emergency file looks good. Thank you for taking care of that | 14:42 |
fungi | home again with time to spare | 14:50 |
fungi | yeah, i checked the emergency list from the car and it looked right | 14:50 |
tonyb | Nice. | 14:51 |
fungi | also notified the openstack release team during their meeting | 14:51 |
tonyb | I really like the ansible that corvus shared. | 14:51 |
corvus | me too! hope it's accurate! :) | 14:52 |
tonyb | fungi: awesome | 14:52 |
corvus | https://etherpad.opendev.org/p/Av9otg2ML-52q2Nxiyi9 is my plan for zuul | 15:01 |
clarkb | corvus: do you need an inline gzip on the mysql dump to keep file sizes down? (not sure how big that will be and if zuul01's disk is large enough) | 15:03 |
corvus | clarkb: good point; as of 1 month ago it was 18g uncompressed. i'll probably dump it into /opt uncompressed which has plenty of space then compress it later | 15:04 |
clarkb | sounds good | 15:05 |
corvus | a month ago it was 2.6G compressed | 15:06 |
corvus | i'm running the docker-compose pulls now | 15:07 |
corvus | pulls complete | 15:17 |
corvus | hrm, we don't have a root mysql user for zuul do we? | 15:23 |
clarkb | heh now both backups servers are at or above 90% | 15:23 |
corvus | (it's not important, but the zuul user lacks some privileges to inspect what the innodb engine is doing, which can be useful for monitoring progress) | 15:23 |
clarkb | that shouldn't affect our backups for the gerrit upgrade but we'll want to address that soon | 15:23 |
corvus | (yet another reason to just run our own db server) | 15:24 |
tonyb | Oh yeah. I was going to prune some older backups. | 15:24 |
clarkb | 5 minutes until we start | 15:25 |
tonyb | ++ | 15:25 |
clarkb | and so its clear I plan to "drive" | 15:27 |
clarkb | I'm awake and here so may as well :) | 15:27 |
fungi | great. i'm standing by to help test and troubleshoot | 15:28 |
* tonyb is watching from the cheap seats | 15:28 | |
tonyb | and is happy to help as directed | 15:28 |
clarkb | yup I'll be sure to mention things in here if I need help | 15:29 |
clarkb | My clock has ticked over to 1530. I'm proceeding now | 15:30 |
corvus | stopping zuul | 15:30 |
clarkb | giving mariadb a few seconds to start up before I proceed with backups which talk to it | 15:31 |
clarkb | fs backups are complete and exited 0 | 15:32 |
clarkb | db backups are in progress | 15:32 |
clarkb | db backups also report rc 0 so I'm proceeding | 15:33 |
clarkb | I'm up to the point where I pull imges. The edits to the docker-compose.yaml file lgtm so I am proceeding | 15:36 |
clarkb | The hash under RepoDigests near the top of the screen window seems to match the one I've got in the etherpad (and I checked that version was up to date yesterday) | 15:38 |
clarkb | fungi: tonyb: next step is the actual upgrade. Any reason to not proceed? | 15:38 |
tonyb | clarkb: Not that I can see. | 15:38 |
fungi | none i'm aware of | 15:38 |
clarkb | ok proceeding | 15:38 |
clarkb | [2023-11-17T15:41:00.681Z] [main] INFO com.google.gerrit.pgm.Daemon : Gerrit Code Review 3.8.2-53-gfbcfb3e1e5-dirty ready | 15:41 |
clarkb | I'm going to check reindexing maybe yall can look at web stuff? | 15:41 |
fungi | yep, looking | 15:42 |
clarkb | I see reindexing is in progress according to `gerrit show-queue` | 15:42 |
fungi | Powered by Gerrit Code Review (3.8.2-53-gfbcfb3e1e5-dirty) | 15:42 |
clarkb | web is up for me and I see the expected version | 15:42 |
fungi | poking around the webui and i don't see anything out of the ordinary yet | 15:43 |
tonyb | UI looks good, logout login seem to work | 15:43 |
clarkb | thanks. The config diff has the one email template that updated as expected so thats good | 15:44 |
clarkb | pushing a change/patchset and then c hecking it replicates is probably the last remaining functionality thing we can do until zuul is with us again? | 15:44 |
clarkb | maybe one of you can do that? | 15:44 |
fungi | yeah, i have a dnm change i think | 15:45 |
clarkb | task queue has dropped by ~600 items since I first checked | 15:45 |
corvus | zuul db backup is complete and migration has started | 15:45 |
SvenKieske | +1 reviews work as well :) | 15:45 |
opendevreview | Jeremy Stanley proposed opendev/bindep master: DNM: Test bindep with PYTHONWARNINGS=error https://review.opendev.org/c/opendev/bindep/+/818672 | 15:46 |
fungi | that's commit 2a9a8b0751f27e97820c2444aa2df6df6fb8f54b | 15:46 |
clarkb | https://opendev.org/opendev/bindep/commit/2a9a8b0751f27e97820c2444aa2df6df6fb8f54b I see it | 15:47 |
clarkb | push and replication seem good. I think the next major item then is waiting for reindexing to complete and then verifying zuul interactions once zuul is back | 15:47 |
tonyb | You guys are so quick :) | 15:47 |
fungi | yeah, https://opendev.org/opendev/bindep/commit/2a9a8b07 comes up for me too | 15:47 |
fungi | tonyb: it's not the first time we've done this ;) | 15:47 |
clarkb | tonyb: we are practiced :) | 15:47 |
tonyb | :) | 15:48 |
clarkb | gerrit's error log doesn't show anything unexpected. I see the expected exception from the plugin manager and reindexing updates in the log are at 33% complete | 15:49 |
clarkb | I do note that at least one user has invalid project watches. I believe this is a known thing and not new | 15:49 |
corvus | is it me? | 15:50 |
clarkb | corvus: no | 15:50 |
clarkb | its a relatively new account which surprises me. | 15:50 |
corvus | we're at the "copy the build table" portion of the migration; i want to say that's like 8 minutes... | 15:51 |
corvus | no, 11 minutes locally according to my notes | 15:52 |
clarkb | just over halfway done on the reindex according to hte log file | 15:53 |
fungi | we'll probably also want a status notice at the end to remind people some changes may need to be rechecked? maybe something like... | 15:55 |
fungi | status notice The Gerrit upgrade is now complete; be aware we had Zuul offline in parallel for a lengthy schema migration, so any events occurring prior to XX:XX UTC may need a recheck to trigger jobs. | 15:55 |
clarkb | ++ | 15:56 |
clarkb | side note: I think we upgraded to 3.7 about a week before the 3.8 release. This upgrade is about a week before the 3.9 release. Its cool to see we're keeping up. Also they are crazy for releasing over thanksgiving | 15:58 |
tonyb | The release should be fine, it's the consumers of the release that may disagree ;P | 16:01 |
corvus | oh cool, i didn't quite catch the end, but i'm pretty sure the table copy took no longer than my local run, meaning my time/performance estimate should be pretty close to prod | 16:03 |
clarkb | corvus: nice | 16:04 |
corvus | we are 2 steps away from the point of no return on the db migration | 16:04 |
clarkb | reindexing completed and gerrit reports it is using the new gerrit index version. There were three errors against two changes being reindexed. Both changes have id's <20k | 16:04 |
* fungi grabs holds onto his seat | 16:04 | |
clarkb | I think we've seen that before and these are problems with old changes that we've basically accepted because what are you going to do | 16:05 |
fungi | yes, we have a handful of "bad" changes that can't be indexed | 16:05 |
clarkb | (if we want those to go away we could possibly try deleting the changes) | 16:05 |
fungi | all very old | 16:05 |
clarkb | show queue output looks good too. I'm marking that step done now | 16:05 |
fungi | i can't remember now, but vaguely recall they're unreachable in the ui too | 16:05 |
clarkb | ya | 16:06 |
corvus | we are past the point of no return on the zuul migration (if there is an error, we will need to fix it or restore from backup) | 16:07 |
clarkb | ack | 16:07 |
fungi | noted | 16:08 |
clarkb | fwiw on the gerrit side if infra-prod-service-review runs before we want it to at this point thats mostly safe. It will only update the docker-compose.yaml file to use the 3.7 image but on gerrit we don't let it manage service restarts | 16:10 |
clarkb | so as long as we get the change to update to 3.8 landed quickly we should be fine. All that to say I think we'll defer to corvus on when he is ready to clean up the emergency file and take it form there | 16:10 |
fungi | wfm | 16:11 |
tonyb | Sounds good. | 16:11 |
corvus | there was an error, i'm trying to sort out the logs | 16:13 |
corvus | 2023-11-17 15:43:47,859 DEBUG zuul.SQLConnection: Current migration revision: 151893067f91 | 16:19 |
corvus | 2023-11-17 16:10:03,931 DEBUG zuul.Scheduler: Configured logging: 9.2.1.dev47 | 16:19 |
corvus | it appears the scheduler restarted at that time; i don't see any indication why | 16:20 |
clarkb | agreed `docker ps -a` shows the container running for 10 minutes | 16:20 |
clarkb | dmesg doesn't report OOMKiller kicking in | 16:21 |
corvus | i wonder if there's a way to get the docker output from the previous container | 16:21 |
fungi | did we have it copying to syslog? | 16:21 |
clarkb | https://paste.opendev.org/show/bPTXJSq4qhmM81MofoBU/ I see this from docker in syslog | 16:23 |
clarkb | I don't see any ansible tasks | 16:24 |
clarkb | (so no unexpected ansible triggered this as far as I can tell) | 16:24 |
corvus | so it could be a zuul crash where the logs only go to stderr | 16:24 |
clarkb | I don't see anything in /var/log/containers which is where we've done our other docker log redirect to syslog output | 16:25 |
fungi | yeah, directory is entirely empty | 16:26 |
corvus | /var/lib/docker/containers only has the current container with no log from a previous run | 16:26 |
clarkb | we probably just haven't set it up for the zuul services | 16:26 |
fungi | right, nothing for log settings in /etc/zuul-scheduler/docker-compose.yaml | 16:26 |
corvus | okay i think the best thing we can do now is shut down the scheduler and then i'll try to figure out where in the migration it was and see if i can reconstruct the error, assuming there was one | 16:27 |
fungi | that sounds reasonable to me | 16:28 |
corvus | any objections to shutting down the running scheduler (which is in a loop trying to redo the migration but it can't)? | 16:28 |
clarkb | no objection from me | 16:28 |
tonyb | none | 16:28 |
fungi | please do | 16:28 |
corvus | in case it's useful in the future, the current (reconstituted) container is b6d98a4420b035c1eab11088d2764849afc6f36d8096ef91525f1a83b1346380 | 16:28 |
corvus | | 13869276 | zuul | 10.223.160.47:42130 | zuul | Query | 166 | altering table | CREATE INDEX zuul_build_uuid_buildset_id_idx ON zuul_build (uuid, buildset_id) | | 16:29 |
corvus | that's the last thing i saw; trying to see if it proceeded past that | 16:29 |
clarkb | should we do something like #status notice The Gerrit upgrade to 3.8 is complete but Zuul remains offline due to a problem with database migrations in a Zuul upgrade that was being performed in the same outage window. We will notify when both Gerrit and Zuul are happy again. | 16:29 |
fungi | status notice The Gerrit upgrade is complete, however we have Zuul offline in parallel for a schema migration, so any events occurring during this time will be lost (requiring a recheck or similar to trigger jobs once it returns to service); we'll update again once this is complete. | 16:30 |
corvus | yes, but you might want to indicate whether or not you think ppl should use gerrit | 16:30 |
fungi | hah, i was just typing something similar | 16:30 |
corvus | i have updated the zuul etherpad with the dump of all the sql statements i'm working from | 16:31 |
clarkb | corvus: at this point I think it is fine to use gerrit to post reviews. The only major item we haven't confirmed is working is the zuul integration | 16:31 |
clarkb | but good point. Maybe something along the lines of "it should be safe to post changes and reviews to Gerrit but you will not get CI results" | 16:31 |
clarkb | I think fungi's message covers that actually | 16:32 |
fungi | feel free to reword mine if you prefer | 16:32 |
clarkb | fungi: no I think yours is good if you want to send it | 16:32 |
fungi | #status notice The Gerrit upgrade is complete, however we have Zuul offline in parallel for a schema migration, so any events occurring during this time will be lost (requiring a recheck or similar to trigger jobs once it returns to service); we'll update again once this is complete. | 16:32 |
opendevstatus | fungi: sending notice | 16:32 |
-opendevstatus- NOTICE: The Gerrit upgrade is complete, however we have Zuul offline in parallel for a schema migration, so any events occurring during this time will be lost (requiring a recheck or similar to trigger jobs once it returns to service); we'll update again once this is complete. | 16:33 | |
corvus | i think we're at line 192 in the etherpad | 16:34 |
clarkb | corvus: are you thinking roll forward from there and see if you can reproduce an error? | 16:35 |
opendevstatus | fungi: finished sending notice | 16:35 |
corvus | yes; and also, if able to proceed without an error, then just finish the migration manually | 16:36 |
clarkb | ok | 16:36 |
fungi | makes sense to me | 16:36 |
corvus | the statement that presumably failed involves fk constraints; we should have disabled them already, but i wonder if this old database/server behaves differently | 16:36 |
corvus | i'm going to set fk checks off and then run line 192 | 16:37 |
corvus | btw i do have a screen session on zuul02 (second window) if anyone wants to join | 16:38 |
corvus | okay, we're running mysql 5.7 and it does not support "alter table drop constraint" | 16:39 |
fungi | zuul01? | 16:40 |
corvus | yep | 16:40 |
corvus | sorry | 16:40 |
fungi | np, attached now | 16:40 |
fungi | we're still using a trove instance in rax for this, right? | 16:41 |
corvus | yep | 16:41 |
fungi | i guess we should be thinking about upgrading that instance and/or setting up our own db cluster instead | 16:44 |
corvus | yep. i'd like to try the statement on line 203 | 16:45 |
clarkb | that looks fine to me (but I'm not monty sql wizard) | 16:46 |
fungi | looks the same as the one at line 192. does it differ in ways i'm not spotting? | 16:46 |
clarkb | fungi: it moves from dropping constraint to dropping foreign key | 16:46 |
corvus | s/constraint/foreign key/ | 16:46 |
clarkb | seems to be a syntax thing | 16:46 |
fungi | oh! right | 16:47 |
corvus | Query OK, 0 rows affected (0.06 sec) | 16:47 |
fungi | but yeah, i'm a bit out of my depth when it comes to foreign key constraints | 16:47 |
corvus | show create table on that lgtm now | 16:48 |
corvus | shall i continue running the statements in the etherpad manually? | 16:48 |
fungi | yes please | 16:48 |
clarkb | corvus: I think if you are comfortable doing that (you wrote the migration so should be pretty knowledgeable about hwat is needed form here) I would say go for it | 16:48 |
clarkb | I think I would be more concerned if this was software we weren't super familiar with just because the chance of missing something is high | 16:49 |
corvus | the scheduler in its attempt to re-run the migration failed at step1, so i don't think it did any damage | 16:49 |
clarkb | ack | 16:50 |
fungi | stroke of luck there, i suppose | 16:50 |
corvus | (it's sort of symmetrical; same table is at the beginning and end of the migration) | 16:52 |
corvus | i'm double checking the etherpad statements with the python code | 16:52 |
corvus | (just to make sure nothing changed) | 16:52 |
corvus | while we're waiting on this alter table; i think to fix zuul we can try just making this change and let testing tell us if that works in current mysql/postgres | 16:54 |
clarkb | ++ | 16:54 |
corvus | okay migration is complete | 16:56 |
corvus | shall we startup the executor again now? | 16:56 |
fungi | i think so | 16:56 |
fungi | scheduler you mean? | 16:56 |
corvus | ha yes lets do that one :) | 16:56 |
clarkb | wfm | 16:56 |
fungi | cool, yes then ;) | 16:56 |
corvus | seems to be happy and not doing any sql thigs | 16:57 |
fungi | yay! | 16:57 |
fungi | thanks!!! | 16:57 |
corvus | i will proceed with restarting the rest | 16:57 |
fungi | once it's up i can recheck that dnm change from earlier and see if it gets enqueued | 16:57 |
clarkb | I guess let us know when you think we should recheck fungi's bindep DNM change to check the zuul + gerrit intercommunication | 16:58 |
clarkb | ++ | 16:58 |
corvus | rebooting all other hosts | 16:58 |
corvus | starting zuul-web on zuul01 | 16:59 |
corvus | clarkb: looks like we *are* going through the github branch listing | 17:00 |
corvus | but it was relatively fast this time | 17:00 |
clarkb | huh | 17:00 |
fungi | hopefully won't trigger any api rate limits | 17:00 |
clarkb | the updates we made should avoid that now | 17:01 |
clarkb | fingers crossed anyway | 17:01 |
corvus | yeah it's done | 17:01 |
corvus | starting up zuul02 | 17:01 |
corvus | starting mergers and executors | 17:02 |
fungi | dashboard is returning content now | 17:03 |
fungi | i guess all of the queue items will end up retrying their prior builds? | 17:04 |
corvus | the builds and buildsets tabs produce data in a reasonable amount of time | 17:04 |
corvus | yep, it's firing them off now | 17:04 |
fungi | oh, nice, it's just the builds which were in progress that are being retried, all the ones which had completed remain so | 17:04 |
corvus | yep | 17:05 |
corvus | a neat side effect of that is that we immediately have new build database records (for the "RETRY" results) | 17:05 |
clarkb | should we recheck the bindep chagne now? | 17:05 |
fungi | should i go ahead and recheck our test change? | 17:05 |
corvus | yep i think it's gtg | 17:06 |
fungi | done | 17:06 |
clarkb | I see it enqueued | 17:06 |
clarkb | one reason we explicitly test rechecks if they have chagned the stream event format before around comment data | 17:06 |
clarkb | but that seems good | 17:06 |
clarkb | and jobs are starting | 17:06 |
clarkb | I've marked the two zuul related items on step 14 as done based on the bindep change | 17:07 |
clarkb | that takse us to step 17 which is to quit the screen and save the log. Are we ready for that? | 17:07 |
fungi | yeah, https://zuul.opendev.org/t/opendev/status lgtm | 17:07 |
fungi | i suppose we'll want to see it successfully comment in gerrit too, those jobs should hopefully be relatively fats | 17:08 |
fungi | er, fast | 17:08 |
clarkb | I've also removed my WIP on https://review.opendev.org/c/opendev/system-config/+/899609 and think we can approve that whenever corvus is ready | 17:08 |
clarkb | fungi: ++ | 17:08 |
fungi | some builds for 818672 have already succeeded | 17:08 |
corvus | ready for 899609 | 17:09 |
corvus | all zuul components are running now | 17:09 |
clarkb | tonyb: fungi: you cool with me closing the screen now and saving the log? | 17:09 |
corvus | and i think it's fine to remove zuul from emergency now | 17:09 |
tonyb | clarkb: Yup. | 17:09 |
clarkb | corvus: do you want to do the emergency file cleanup? I think you can remove review related stuff as well | 17:09 |
corvus | con do | 17:10 |
corvus | can do | 17:10 |
fungi | clarkb: go for it | 17:10 |
fungi | status notice Zuul is fully back in service now, but any events occurring prior to 17:05 UTC may need a recheck to trigger jobs. | 17:10 |
clarkb | step 17 to stop screen and move the log file is done | 17:11 |
fungi | does that cover what we want folks to know? | 17:11 |
clarkb | fungi: lgtm | 17:11 |
corvus | i have removed todays maintenance entries from emergency. | 17:11 |
fungi | thanks! | 17:11 |
corvus | do we want to also remove the unrelated things we think we can clean up from that file, or leave that for another day? | 17:11 |
clarkb | I've approved https://review.opendev.org/c/opendev/system-config/+/899609 | 17:11 |
fungi | #status notice Zuul is fully back in service now, but any events occurring prior to 17:05 UTC may need a recheck to trigger jobs. | 17:11 |
opendevstatus | fungi: sending notice | 17:11 |
clarkb | corvus: I'm happy to do that another day :) | 17:11 |
-opendevstatus- NOTICE: Zuul is fully back in service now, but any events occurring prior to 17:05 UTC may need a recheck to trigger jobs. | 17:11 | |
clarkb | corvus: I can make a note to myself to do that monday | 17:11 |
corvus | ack | 17:11 |
fungi | yeah, i'm beginning to get a smidge peckish | 17:12 |
clarkb | ok to recap where we are on the gerrit side of things: The upgrade is done, the checks we have performed have all checked out. Nothing crazy or unexpected in the gerrit error_log and we got things we expected which is double good. We have since removed services from the emergency file and approved the change to set the 3.8 image in docker-compose.yaml on review02. We need to confirm | 17:13 |
clarkb | that the file looks good after infra-prod-service-review runs | 17:13 |
clarkb | infra-prod-service-review does not start and stop gerrit though so it should be very safe even if we got somethign wrong there | 17:13 |
opendevstatus | fungi: finished sending notice | 17:14 |
tonyb | clarkb: I didn't know that infra-prod-service-review does not stop/start gerrit but everything else matches my understanding | 17:16 |
clarkb | tonyb: ya, many of our services we let ansible automatically do that stuff. Gerrit is special enough and has lots of rules that change between versions about whether or not you need to init or reindex or both and its also disruptive to restart even when we do updates within a single version. All that means we let ansible write the configs then we manually restart things | 17:17 |
tonyb | clarkb: makes sense. | 17:19 |
clarkb | corvus: fwiw the builds search feature in zuul works for me. As does buildsets. | 17:20 |
clarkb | https://review.opendev.org/c/starlingx/update/+/898850 is a post upgrade zuul comment against gerrit | 17:20 |
clarkb | it lgtm | 17:21 |
* clarkb takes a break while waiting for job results | 17:22 | |
fungi | the test change is still waiting on two nodes | 17:28 |
fungi | and is in a failing state (build log shows the reason) | 17:29 |
fungi | so looks like it's working the way it should so far | 17:29 |
* tonyb goes afk for a bit | 17:30 | |
fungi | yeah, christine's being very patient waiting for me to take her to lunch, but i may have to assume this will work and check back on it after i return | 17:31 |
fungi | node request backlog is down to around 65 now | 17:32 |
clarkb | zuul is busy today. I didn't expect a friday before a major holiday for at least some contributors to be so busy | 17:33 |
fungi | are those "Will not fetch project branches as read-only is set" errors for the opendev tenant expected? | 17:33 |
clarkb | fungi: yes/no They are not new. But we do need to debug them | 17:34 |
fungi | ah, okay. thanks | 17:34 |
fungi | i have a feeling the node request for that remaining opendev-nox-docs build got accepted by a provider that's repeatedly timing out booting an instance for it | 17:35 |
clarkb | https://review.opendev.org/c/starlingx/stx-puppet/+/900806 is a merged change after the ugprade fwiw | 17:36 |
fungi | we're at about 30 minutes since the change was enqueued, and all its builds have completed except that one | 17:36 |
clarkb | fungi: really we have enough other changes that have done stuff that I think its fine | 17:36 |
corvus | ftr, no that alter tabel syntax does not work with postgres, but it does work with mysql 8.x so i've updated the patch with a conditional. i do expect it to pass tests now. | 17:36 |
fungi | cool, i'm going to step out for an hour, back soon | 17:36 |
clarkb | corvus: the fix gets a +2 from me | 17:37 |
clarkb | oh the tooz thing is causing the openstack gate to thrash which could explain why it is busy | 17:40 |
clarkb | more coincidence than anything else I think | 17:40 |
corvus | i wonder if we should start promoting the regex-based early failure detection | 17:41 |
corvus | seems to be working pretty well in the zuul-project jobs | 17:42 |
corvus | maybe i should send an email next week | 17:42 |
clarkb | ++ | 17:43 |
clarkb | https://opendev.org/starlingx/stx-puppet/commits/branch/master shows the above merged change replicated and the master branch updated properly | 17:47 |
clarkb | just more sanity checks of replication | 17:47 |
corvus | i added another zuul change to address the missing error log problem | 17:51 |
clarkb | good idea +2 there as well | 17:51 |
clarkb | as a heads up the opendev hourly jobs have enqueued. They will run against zuul but not review | 18:02 |
clarkb | the hourly jobs should wrap up in just a couple of minutes. Then a few minutes after that the change to set the image version in the docker-compose file should merge and apply (which should noop) | 18:20 |
opendevreview | Merged opendev/system-config master: Upgrade Gerrit to Gerrit 3.8 https://review.opendev.org/c/opendev/system-config/+/899609 | 18:32 |
tonyb | gerrit-compose on review02 looks good to me (still contains 3.8) | 18:34 |
clarkb | yup the job is still running though according to zuul | 18:34 |
clarkb | not sure when ansible will try to modify it so we should double check after the job completes | 18:34 |
clarkb | job is complete now and the file wasn't modified according to the timestamp | 18:35 |
clarkb | yup looks good too. I think we are basically done at this point | 18:35 |
tonyb | Oh so it is, I waited for it to merge but of course that isn't the "important" run I needed to wait for the deploy pipeline | 18:35 |
clarkb | I made a note to myself to do the autohold cleanup Monday | 18:35 |
clarkb | and then in a few days / a week we can work to clean up the 3.7 image stuff if we don't revert between now and then | 18:36 |
clarkb | tonyb: the other thigns I've got on my list for today are to swap in the new mirror and to do db pruning. I Think fungi mentioned he would help with the db pruning. Do you have a change up for the dns update to swap in the new mirror yet? | 18:37 |
tonyb | I don't. Gimme 5 and I will ;P | 18:37 |
clarkb | ok I can review it :) | 18:37 |
opendevreview | Tony Breeds proposed opendev/zone-opendev.org master: Switch CNAME for mirror.ord.rax to new mirror02 node https://review.opendev.org/c/opendev/zone-opendev.org/+/901332 | 18:46 |
tonyb | clarkb: I'd be keen to shadow you as you do the db purging. | 18:46 |
clarkb | er sorry not db pruning. Backup pruning | 18:48 |
clarkb | tonyb: I was going to defer that to you and fungi since fungi mentioned the other day being willing to do that with you | 18:48 |
tonyb | Sorry that's what I meant | 18:48 |
tonyb | clarkb: Okay cool | 18:48 |
clarkb | ya I typoed backups as db earlier :) just wanted to be clear | 18:48 |
clarkb | tonyb: https://docs.opendev.org/opendev/system-config/latest/sysadmin.html#managing-backup-storage is the relevant documentation | 18:49 |
tonyb | Thanks | 18:51 |
clarkb | when fungi returns from lucn he can double check the dns update and then let us know if backup pruning isn't in the cards for today | 18:53 |
clarkb | if not I can refresh on it | 18:53 |
tonyb | Okay. Sounds good. I have a few things to do in a couple of hours but if that conflicts I'll shadow next time. | 18:55 |
clarkb | fyi it has been reported in #openstack-infra that editing files in the web UI isn't working. Specifically you can enter edit mode, but when you open a file to edit a specific file it never loads the file contents in the editor. You can then exit edit mode successfully | 18:59 |
fungi | okay, backl | 19:07 |
fungi | back too | 19:07 |
fungi | tonyb: i can work with you on that now if you like | 19:07 |
fungi | or later, either works | 19:07 |
tonyb | fungi: If now works for you it works for me | 19:08 |
fungi | starlingx folks are asking about errors from pip install... i'm looking into the logs now | 19:08 |
tonyb | fungi: where? I suspect that's another venue I should hang out/monitor | 19:09 |
fungi | pinged me directly in the starlingx general matrix room | 19:09 |
tonyb | Ah okay | 19:10 |
clarkb | fungi: can you weigh in on the editor being broken first? | 19:12 |
clarkb | I want to make sure we're comfortabl with that not working for now | 19:12 |
fungi | oh, i missed the broken editor | 19:13 |
clarkb | I'm putting notes in the etherpad. I don't think we need to rollback for this. Its annoying but not vital | 19:13 |
tonyb | Shoot it's workign for me now | 19:14 |
clarkb | wut | 19:15 |
clarkb | tonyb: it == editor in web ui? | 19:15 |
tonyb | clarkb: Yup. | 19:15 |
* clarkb retests | 19:15 | |
tonyb | clarkb: I reproduced exactly what was seen on 900435. I was poking around in the console/developer tools | 19:16 |
clarkb | tonyb: it still doesnt' work for me. | 19:16 |
tonyb | clarkb: I closed the window by mistake and now I have a functional editor | 19:16 |
clarkb | maybe we need to hard refresh because something is cached? | 19:16 |
clarkb | ya maybe that is it /me tries | 19:16 |
clarkb | yup that did it | 19:16 |
clarkb | I think the plugin html must not be versioned like the main gerrit js/html/css is so we don't get the auto eviction stuff | 19:18 |
fungi | strangely, starlingx is seeing the reverse of https://review.opendev.org/c/openstack/project-config/+/897545 in this job, i think: https://zuul.opendev.org/t/openstack/build/7b5008b924e247c7a1f3eb76fe96151f | 19:18 |
* clarkb updates the etherpad. That gives us somethign to send upstream | 19:18 | |
fungi | they're trying to access wheels under debian-11.8 instead of debian-11 | 19:18 |
clarkb | fungi: we updated zuul maybe ansible reverted that behavior/ swapped it around | 19:19 |
fungi | well, i think it's that we ended up with newer libs in the ansible venvs | 19:19 |
clarkb | fungi: yes we changed teh mirror stuff because ansible updated and changed the behavior of those vars. I'm wondering if they swapped back to the old behavior | 19:20 |
clarkb | and did it since last Friday because we last upgarded zuul around then and just upgraded it a few hours ago | 19:20 |
fungi | what we fixed in 897545 was the jobs that build the wheel mirrors | 19:21 |
fungi | the problem they're seeing is that jobs are now looking for wheels in the location that the broken wheel mirror jobs wanted to publish to | 19:21 |
fungi | so i don't think this is a revert | 19:21 |
fungi | it looks more like a delayed reaction, where the playbook setting up the mirror urls in jobs is now exhibiting similar behavior to how we saw wheel publication break before we fixed it | 19:22 |
clarkb | they use the same ansible versiosn though | 19:22 |
clarkb | it should all be the ansible version in zuul's venv for ansible | 19:22 |
clarkb | unless maybe they are doing nested ansible? | 19:23 |
fungi | it's happening in that job when opendev.org/zuul/zuul-jobs/playbooks/tox/run.yaml is invoked, doesn't look nested | 19:24 |
fungi | it's parented to tox-py39 | 19:25 |
clarkb | maybe we needed to update the cleint side and didn't realize it was broken after we fixed the generation side/ | 19:26 |
clarkb | and its just now getting bubbled up? | 19:26 |
fungi | that's what it seems like, i'm just trying to find where that happens | 19:26 |
fungi | it's possible all jobs using debian nodes are exhibiting this now | 19:26 |
opendevreview | Clark Boylan proposed opendev/system-config master: A file with tab delimited useful utf8 chars https://review.opendev.org/c/opendev/system-config/+/900379 | 19:27 |
fungi | https://opendev.org/zuul/zuul-jobs/src/branch/master/roles/configure-mirrors/defaults/main.yaml#L12 is where it's coming from, i think, we want ansible_distribution_major_version instead of ansible_distribution_version on debian now | 19:30 |
fungi | odd that centos is working though | 19:32 |
corvus | any idea why it's only showing up now? is it maybe the case that this mirror doesn't get used often? | 19:32 |
corvus | (it's unfortunate that job doesn't save tox logs so we can compare to successful runs) | 19:32 |
clarkb | corvus: I'm beginning to suspect its been broken for a while and noone noticed | 19:32 |
clarkb | we fixed the mirror generation side and the consumers didn't test it or if they did didn't check back in with us to say it doesn't work | 19:33 |
clarkb | but maybe zuul's build history can confirm | 19:33 |
corvus | the most recent runs succeeded | 19:33 |
corvus | https://zuul.opendev.org/t/openstack/builds?job_name=sysinv-tox-py39&project=starlingx/config | 19:33 |
fungi | yeah, i just noticed that too | 19:33 |
corvus | failure we're looking at is 3rd newest | 19:33 |
fungi | all for the same change | 19:34 |
corvus | that's what made me think that the "use mirror" path may not be used often? but we can't tell on the successful jobs | 19:34 |
tonyb | The successful one seemed to use a mirror: https://zuul.opendev.org/t/openstack/build/fe0825aa756945208f228b6c52c273d5/log/job-output.txt#747 | 19:36 |
tonyb | a non RAX mirror | 19:36 |
fungi | https://zuul.opendev.org/t/openstack/build/fe0825aa756945208f228b6c52c273d5/log/job-output.txt#747 is in a build of that job that succeeded | 19:36 |
tonyb | snap | 19:36 |
corvus | that mirror url is constructed with major.minor | 19:37 |
fungi | yeah, that's what i'm saying, not sure why it didn't cause a problem in that build | 19:37 |
corvus | so it seems like we are getting consistent behavior from the jobs in using that variable. | 19:37 |
corvus | maybe it's not using that particular wheel mirror; either getting it from somewhere else or building it? | 19:38 |
clarkb | https://etherpad.opendev.org/p/tvAyWLRV07MNayX3Bbc3 this is my draft to the gerrit repo-discuss list about cache stuff | 19:39 |
fungi | right, the wheel mirror may be unavailable/incorrect in both builds, the error for it in the failing build might be secondary and the real problem could be elsewhere | 19:39 |
clarkb | pip is supposed to fallback when it can't find additional indexes to the indexes it does fine | 19:39 |
clarkb | *it does find | 19:39 |
corvus | anyway, it seems like that probably means we can exclude all of todays/yesterday's changes from the list of suspects; the fix is to update the variable; and the remaining mystery is why it only sometimes manifests (and i'm pretty sure saving the tox logs would help with that too, so i'd recommend that) | 19:40 |
clarkb | ++ | 19:40 |
clarkb | I'm going to need to stop and find food soon. I skipped breakfast and it is now almost lunch time and my body is protesting. I think remaining todos are to fire off that email to repo-discuss and then other tasks unrelated to gerrit like backup pruning and ord mirror swap out via dns | 19:46 |
corvus | clarkb: etherpad lgtm | 19:48 |
fungi | okay, it looks like this was the actual cause of the starlingx job failure: https://zuul.opendev.org/t/openstack/build/7b5008b924e247c7a1f3eb76fe96151f/log/job-output.txt#61245-61286 | 20:07 |
fungi | or immediately above there anyway... Could not fetch URL https://mirror-int.ord.rax.opendev.org/pypi/simple/pygments/: connection error: HTTPSConnectionPool(host='mirror-int.ord.rax.opendev.org', port=443): Read timed out. - skipping | 20:09 |
clarkb | I'm going to send that email to repo-discuss now. Thank you all for reading it | 20:23 |
fungi | sorry, just read it now but lgtm | 20:25 |
clarkb | fungi: I think we can land https://review.opendev.org/c/opendev/zone-opendev.org/+/901332 when you are ready | 20:27 |
clarkb | this will swap in the new ord mirror | 20:28 |
clarkb | fungi: and were you still planning to do backup pruning with tonyb today? | 20:28 |
clarkb | I'm about to eat lunch and expect to be afk for a bit | 20:28 |
clarkb | https://groups.google.com/g/repo-discuss/c/DTrYQtY0j1k/m/7riBbIa5BwAJ | 20:32 |
tonyb | heading to the gym now. back in about 90mins | 20:34 |
fungi | oh, yep. i'll be around when tonyb is back from the gym | 21:05 |
fungi | also approved 901332 | 21:06 |
opendevreview | Merged opendev/zone-opendev.org master: Switch CNAME for mirror.ord.rax to new mirror02 node https://review.opendev.org/c/opendev/zone-opendev.org/+/901332 | 21:12 |
fungi | deploy succeeded | 21:54 |
fungi | $ host mirror.ord.rax.opendev.org | 21:55 |
fungi | mirror.ord.rax.opendev.org is an alias for mirror02.ord.rax.opendev.org. | 21:55 |
fungi | https://mirror.ord.rax.opendev.org/ has a working ssl cert and i can browse it as expected | 21:57 |
tonyb | Awesome. | 22:05 |
fungi | tonyb: so when you have a moment, take a look at https://docs.opendev.org/opendev/system-config/latest/sysadmin.html#backups if you haven't already | 22:31 |
fungi | currently the two backup servers are backup01.ord.rax.opendev.org and backup02.ca-ymq-1.vexxhost.opendev.org | 22:32 |
fungi | the active backup volume on both of those is /opt/backups-202010 | 22:32 |
fungi | when the active backup volume exceeds 90% used, we try to prune it | 22:33 |
fungi | which basically boils down to running /usr/local/bin/prune-borg-backups as root on the relevant backup server | 22:34 |
tonyb | fungi: Just got back | 22:34 |
tonyb | Ahh got it. I misread the docs and thought it was on the backup client. | 22:35 |
fungi | it will take a while (upwards of an hour or two) so safest to run it in a screen session in case your ssh connection gets interrupted | 22:35 |
tonyb | Okay | 22:35 |
fungi | you'll probably also want to crack open the prune-borg-backups script to see the goodness within | 22:35 |
tonyb | Okay so I'll start 2 80x24 terminals (one on each server) with sudo -s ; su - ; screen <CTRL>-H | 22:36 |
fungi | it does log the output, so hardcopy of the screen session, while it doesn't hurt, isn't all that necessary | 22:36 |
tonyb | Ah okay | 22:36 |
fungi | if you look in the script, you'll see it logs to a file in /opt/backups/ called prune-<timestamp>.log | 22:37 |
fungi | ultimately, the real command is a few lines from the end of the script, it's calling `/opt/borg/bin/borg prune ...` | 22:38 |
fungi | the rest is so much window dressing to save us from having to fiddle options | 22:39 |
tonyb | Okay screen sessions created | 22:39 |
tonyb | I'm looking at the prune-borg-backup script on backup02.ca-ymq-1.vexxhost.opendev.org | 22:39 |
fungi | oh. and it's you might also want to take a peek at one of the prune logs just so you know what they include | 22:40 |
tonyb | Will do | 22:41 |
fungi | (they're extremely verbose, you won't really get much on stdout in your terminal other than whether it succeeded or, rarely and hopefully not, failed | 22:41 |
fungi | you'll see it records the exit code of each prune command it runs too | 22:42 |
fungi | ideally rc 0 obviously | 22:42 |
fungi | there is a dry run option you might want to try first | 22:42 |
tonyb | Sounds good. | 22:43 |
fungi | `/usr/local/bin/prune-borg-backups noop` is the dry run syntax | 22:44 |
fungi | `/usr/local/bin/prune-borg-backups prune` is the actually do it syntax | 22:44 |
tonyb | Okay, that's going to take a small amount to digest, as the script does a 'read' so I was expecting it to prompt and wait etc but you're passing the mode as an argument | 22:47 |
fungi | oh, actually yes | 22:47 |
fungi | you're right | 22:47 |
fungi | `/usr/local/bin/prune-borg-backups` and then enter "noop" | 22:48 |
tonyb | Okay | 22:48 |
fungi | if you enter anything other than "noop" or "prune" it will lol at you | 22:48 |
tonyb | I thought it was some mode of read I didn't know about | 22:48 |
fungi | yes, the special kind that i forget about when it's something i only run every few months ;) | 22:48 |
tonyb | LOL | 22:49 |
tonyb | So I understand the process as is I need to doa little more reading to really get the way (and reasons) borg is setup but so far it looks super neat | 22:53 |
tonyb | fungi: Should serialize the servers or do them in parallel ? It seems like parallel is "safe" | 22:54 |
fungi | yes, completely safe | 22:56 |
fungi | clients push similar backups to two different servers every day, we have them in different service providers just as an insurance policy | 22:57 |
fungi | they're purely for redundancy, not performance reasons | 22:58 |
fungi | usually we don't simply because they tend not to fill up in the same week | 22:58 |
fungi | so there's almost only ever one that needs pruning at any given point in time anyway | 22:58 |
tonyb | https://borgbackup.readthedocs.io/en/stable/usage/prune.html says "Important: Repository disk space is not freed until you run borg compact." I don't see anywhere there is a compact run | 22:59 |
fungi | possible it's implied somehow | 23:00 |
tonyb | Okay. | 23:02 |
fungi | i mean, it visibly reduces disk space on the volume when we run prune, so it's happening somehow i guess | 23:03 |
tonyb | Yeah, Just trying to understand as much as I can. | 23:04 |
Clark[m] | We are still borg 1.6 iirc and those docs may be for 2.0 and maybe it changes? | 23:18 |
tonyb | Oh okay. I'll check that | 23:18 |
tonyb | These are the 1.2.6 docs | 23:23 |
tonyb | Okay so even the noop run takes $some_time | 23:29 |
tonyb | It look slike both servers are at > 90% so I'll prune them both | 23:30 |
fungi | yep, thanks! | 23:31 |
tonyb | The noop on backup01 returned success so starting the actul prune in 5mins unless someone says "NO!" | 23:32 |
fungi | when i do it, i just leave it running and then check back on it after a few hours or the next morning | 23:32 |
fungi | i say go for it | 23:32 |
tonyb | That's my plan. | 23:32 |
tonyb | It's in a screen session as described | 23:33 |
tonyb | backup01 is pruning, screen:0 is wheer it's running screen:1 is a tail -f of the log .... incase anyone wants to check in | 23:34 |
tonyb | ditto for backup02 | 23:35 |
tonyb | So WRT the mirror updates, mirror02.ord.rax is now the mirror for that region | 23:38 |
Clark[m] | I trust the script | 23:39 |
tonyb | IIUC Assuming there are no issues I need to remove mirror01.ord.rax from DNS, and then from system-config inventory and LE handlers and then delete ther server | 23:39 |
tonyb | Once I've done that I can do the OVH ones (doign them in serial is just about caution nothing technical) | 23:40 |
tonyb | as discussed mirror.dfw.rax will be last because extra careful | 23:41 |
tonyb | Sound about right? | 23:41 |
Clark[m] | Yup | 23:41 |
Clark[m] | We will also need to delete the volume attached to it | 23:42 |
Clark[m] | Otherwise I think that is complete and correct | 23:42 |
tonyb | <emote character="Mr Burns"> Excellent </emote> | 23:42 |
tonyb | That's a good point. | 23:43 |
* fungi nods in agreement | 23:43 | |
tonyb | Cool. I can do that next week. | 23:44 |
tonyb | I'll also reach out to rosmaita and the i18n SIG about the future of translate | 23:46 |
tonyb | I was looking at the wiki, I think with a little work we can use the mediawiki container images in the same way we for for many other services that would at least decouple the OS upgrades and give us some testing. There is an image for the version we're running which *hopefully* will make it easy to switch to. From there we can look at rolling updates forward to get to a newer version. | 23:48 |
Clark[m] | I think the main thing is getting all the plugins and stuff going but maybe we don't care about the theming so much anymore | 23:50 |
Clark[m] | Re prune vs compact I'm confused. Maybe compact does extra cleanup beyond what prune does? In particular prune cleans up incomplete archives automatically maybe that's all we clean up? | 23:52 |
tonyb | It is pruning other things | 23:52 |
tonyb | eg Pruning archive: review02-filesystem-2022-10-31T05:46:02 Mon, 2022-10-31 05:46:02 [SNIP] (39/39) | 23:53 |
tonyb | so I think it's more than incomplete archives | 23:53 |
tonyb | https://borgbackup.readthedocs.io/en/stable/usage/compact.html indicates it's usful after a prune. | 23:55 |
tonyb | So I'm going to suggest that early next week (perhaps right after the team meeting), we try running a compact on a repo and see what happens to the disk utilisation. | 23:56 |
tonyb | I don't think that Friday afternoon is a good time to play with that kind of thing ;P | 23:57 |
fungi | right there with you | 23:59 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!