corvus1 | clarkb: fungi morning | 14:08 |
---|---|---|
corvus1 | looks like the restart has finished | 14:09 |
corvus1 | i've started a db dump from the old server | 14:10 |
corvus1 | that means this is the start time for the data loss window (jobs completed between now and when we finish won't be in the db) | 14:10 |
corvus1 | i've approved the zuul change to add the index hints (and also one other simple db related fix) | 14:14 |
Clark[m] | Ack | 14:22 |
corvus1 | dump is finished, copying now | 14:31 |
corvus1 | restore in progress on screen on zuul-db01. | 14:34 |
corvus1 | i think we still need the dburi change, yeah? | 14:35 |
corvus1 | oh heh it's actually just a secret var | 14:37 |
Clark[m] | Yes I think so | 14:37 |
corvus1 | so i'm going to edit secrets and update it there. | 14:37 |
Clark[m] | And we run zuul updates hourly so it should go into affect quickly if you don't do it by hand | 14:38 |
Clark[m] | Just keep that in mind I guess as it may go in before you're ready too? | 14:39 |
corvus1 | okay, the dburi is updated in secret hostvars. for now, let's see if the ansible deploy gets it before the import is done? and otherwise we can edit by hand. | 14:40 |
corvus1 | i don't think we're going to automatically restart zuul based on that file changing, right? so since all the zuul restarts are done now, i think we're good to let it go in any time? | 14:40 |
Clark[m] | Oh yes that is correct | 14:41 |
corvus1 | i suppose it's possible we might do a full-reconfigure? and that might reload the connections? i don't know | 14:41 |
Clark[m] | We trigger a zuul config reload but that applies to the tenant config not the ini config | 14:41 |
corvus1 | yeah, we only trigger a smart-reconfigure; i don't think that touches connections | 14:41 |
fungi | morning! | 14:45 |
fungi | looks like you started just as i was getting ready to grab a shower | 14:46 |
corvus1 | perfect timing! | 14:47 |
corvus1 | i think we're idle for another 90m or so right now. btw, import is running in screen. | 14:47 |
Clark[m] | The zuul hourly job will likely run in about a half an hour and again an hour after that | 14:48 |
Clark[m] | Just for timing on when the config might update. | 14:48 |
Clark[m] | The docs explicitly mention tenant config and don't say anything about connections configs in relation to smart reconfiguration | 14:50 |
corvus1 | ++ if i'm wrong about what zuul will do with the update, then i would expect it to simply fail inserts due to missing tables (or, if we happen to have all the tables a query needs, then it might insert them (but that's okay because the schema has the insert id in it, so it would go at the end) or it might hit a table lock and just freeze until that table is done. | 14:50 |
corvus1 | i went through the code to double check, and it does reload the zuul.conf file, and it does tell the connections that a reconfiguration has happened, but the sql driver ignores that event since it doesn't typically need to do anything. | 14:51 |
corvus1 | my takaway is that it shouldn't affect us, but also, if we wanted to make it easy to update the dburi without a restart, that would be a very simple change. :) | 14:52 |
Clark[m] | It seems like making the database a bit more an explicit update is useful for understanding data integrity. But I can see how that would be useful in some situations | 14:53 |
Clark[m] | (like with redundant active standby DB setups) | 14:54 |
corvus1 | yup | 14:54 |
corvus1 | artifacts done; halfway through builds | 15:31 |
fungi | thinking either my terminal is too small or i'm on the wrong window of the screen session | 15:32 |
fungi | this is on zuul-db01 yeah? | 15:32 |
fungi | there are like 4 windows but the first one seems blank, i'm assuming that's where the progress output is showing up for the rest of you | 15:33 |
corvus1 | yeah, it's a big window sorry | 15:34 |
corvus1 | i've resized mine to 80x24 now | 15:34 |
corvus1 | when it's done, something should show up in the first blank one; that's the import | 15:34 |
corvus1 | i'm using the second one to show processlist to check progress | 15:35 |
corvus1 | zuul changes merged | 15:52 |
fungi | no worries, i don't think screen handles dynamic resizing anyway. maybe someday we can talk about using tmux for similar purposes, but it's not a big deal | 15:54 |
corvus1 | looks like it's doen | 16:24 |
corvus1 | i'm going to docker-compose pull on the schedulers, then check the image tags and also dburi and make sure everything is staged | 16:26 |
Clark[m] | ++ | 16:28 |
corvus1 | the image log looks correct, has the latest change id | 16:29 |
corvus1 | dburi looks correct | 16:29 |
corvus1 | i'm going to restart zuul-web on zuul01 now | 16:30 |
fungi | cool | 16:32 |
corvus1 | we're going from dev82 to dev85 | 16:34 |
corvus1 | zuul01 web is up, restarting zuul02 web now | 16:35 |
corvus1 | https://zuul.opendev.org/t/openstack/builds works now and is instantaneous for me | 16:36 |
corvus1 | restarting zuul01 scheduler now | 16:37 |
fungi | ~instantneous for me as well | 16:37 |
Clark[m] | And me | 16:37 |
fungi | instantaneous | 16:37 |
corvus1 | faster than you can spell instantaneous for sure | 16:37 |
fungi | faster than i can type it at the very least | 16:39 |
corvus1 | both web servers are up now | 16:41 |
corvus1 | scheduler 1 is up, restarting scheduler 2 now | 16:42 |
corvus1 | that means we should be at the end of the data loss period | 16:42 |
fungi | right | 16:42 |
fungi | writing all further results to the new (temporary) database now | 16:43 |
corvus1 | so job results will be missing from the database and webui from about 14:10 to 16:43 on 2024-04-06 | 16:43 |
fungi | thanks! i was about to calculate that as well, saved me the bother | 16:44 |
corvus1 | how about: status notice Zuul job results from 14:10 to 16:43 on 2024-04-06 may be unavailable in Zuul's web UI. Recheck changes affected by this if necessary. | 16:45 |
fungi | lgtm | 16:45 |
corvus1 | #status notice Zuul job results from 14:10 to 16:43 on 2024-04-06 may be unavailable in Zuul's web UI. Recheck changes affected by this if necessary. | 16:46 |
corvus1 | maybe i'm not authed? | 16:47 |
fungi | checking | 16:47 |
fungi | it's in channel at least | 16:47 |
fungi | #status notice Zuul job results from 14:10 to 16:43 on 2024-04-06 may be unavailable in Zuul's web UI. Recheck changes affected by this if necessary. | 16:48 |
opendevstatus | fungi: sending notice | 16:48 |
-opendevstatus- NOTICE: Zuul job results from 14:10 to 16:43 on 2024-04-06 may be unavailable in Zuul's web UI. Recheck changes affected by this if necessary. | 16:48 | |
*** corvus is now known as Guest269 | 16:48 | |
*** corvus1 is now known as corvus | 16:48 | |
fungi | yeah, corvus1 isn't logged in according to /whois | 16:48 |
corvus | i am now :) | 16:48 |
corvus | https://zuul.opendev.org/t/openstack/build/7cdf35ccdddb46fd804638d4c8305516 is post-migration | 16:49 |
fungi | agreed | 16:49 |
Clark[m] | And now we can see if general responsiveness improves as well | 16:49 |
corvus | all components are up now | 16:49 |
fungi | awesome! heating up the wok to make lunch, but will be around to keep an eye out for any problem reports | 16:49 |
opendevstatus | fungi: finished sending notice | 16:50 |
corvus | i dequeued the sanbox post items | 16:51 |
corvus | there's a periodic item in there now that's stuck on a node request? | 16:51 |
corvus | doesn't look like i can dequeue it | 16:52 |
Clark[m] | That's the change vlotorev got in to reproduce that problem... | 16:52 |
Clark[m] | Or at least I think it is /me double checks | 16:53 |
corvus | the ones i dequeued, yes; not the periodic one though | 16:53 |
Clark[m] | Ya the sandbox change | 16:53 |
corvus | there were 3 queue items. 2 in "post" one in "periodic". i dequeued the 2 in "post". the one in "periodic" remains because i am unable to dequeue it through the web ui. | 16:54 |
Clark[m] | And it will return if we don't revert the change in sandbox first. We should probably do that then figure out the dequeue | 16:55 |
corvus | they will return the next time someone merges 2 changes in sandbox in rapid succession. but yes, it should all be reverted. | 16:55 |
Clark[m] | Ya if we revert then the problem goes away for new trigger events | 16:56 |
Clark[m] | And I plan to update the change to address this monday | 16:57 |
Clark[m] | Which may also allow them to run normally | 16:57 |
corvus | it's a real travesty what happened to that equisite corpse | 17:06 |
corvus | remote: https://review.opendev.org/c/opendev/sandbox/+/915197 Revert to an earlier state of the repo [NEW] | 17:13 |
corvus | remote: https://review.opendev.org/c/opendev/sandbox/+/915198 Update test.py for python3 [NEW] | 17:13 |
corvus | Clark: fungi ^ some cleanup | 17:13 |
Clark[m] | corvus: I guess there wasn't one specific change we could revert? I don't see a periodic job in the zuul.yaml diff | 17:18 |
corvus | Clark: oh i'm sure that could be smaller, but the repo was a mess. just doing 2 things at once. | 17:19 |
Clark[m] | I see | 17:19 |
fungi | also worth consideration: a dedicated sandbox tenant in zuul for further isolation | 17:22 |
corvus | maybe with only check and gate pipelines | 17:23 |
fungi | yeah, that would make sense. simplicity helps the demonstration | 17:26 |
mnaser | https://releases.openstack.org/ is not responding to me | 21:30 |
mnaser | cc infra-root ^ | 21:32 |
mnaser | it seems like its responding very slowly now :X | 21:33 |
fungi | looking | 22:55 |
fungi | it's responding instantly to me now, but i'll check the system resource graphs for static.o.o | 22:56 |
fungi | http://cacti.openstack.org/cacti/graph.php?action=view&local_graph_id=71254&rra_id=all | 22:57 |
fungi | load average spiked right when mnaser was noticing issues | 22:58 |
fungi | there was also a connection count spike, but that could be a symptom of connections waiting much longer than usual for a response | 22:58 |
fungi | server started complaining about losing contact with afs01.dfw (in the same cloud provider and region) at 21:12:12 utc, continuing with some frequency until 21:23:46 utc | 23:01 |
fungi | [Sat Apr 6 21:31:06 2024] INFO: task jbd2/dm-0-8:549 blocked for more than 120 seconds. | 23:03 |
fungi | that's the start of errors in dmesg on afs01.dfw | 23:03 |
fungi | "This message is to inform you that our monitoring systems have detected a problem with the server which hosts your Cloud Block Storage device, afs01.dfw.opendev.org/main04, [...] at 2024-04-06T21:34:09.152730." | 23:06 |
fungi | we also got a "returned to service" update on that same ticket at 21:59:21 utc | 23:07 |
fungi | i guess i'll reboot afs01.dfw and then, if lvm comes back up clean, see what might be stuck | 23:09 |
mnaser | yeah looked very io-y :\ | 23:09 |
fungi | i/o errors in dmesg ceased at 21:45:10 utc, looks like | 23:09 |
fungi | server's back up now | 23:20 |
fungi | reboot took about 10 minutes | 23:20 |
fungi | looks like /vicepa came up mounted and writeable | 23:22 |
fungi | [ OK ] Finished File System Check on /dev/main/vicepa. | 23:24 |
fungi | boot.log confirms it did get checked | 23:24 |
fungi | `vos status` says "No active transactions on 104.130.138.161" | 23:26 |
fungi | same for the other two fileservers, so nothing to abort | 23:27 |
fungi | i'll start a root screen session and work through manually releasing the list of volumes | 23:29 |
fungi | about 1/3 of the way through now | 23:40 |
fungi | about 2/3 of the way through now | 23:48 |
fungi | and done | 23:56 |
fungi | except for the volumes without any read-only replicas configured, of course, which i skipped | 23:56 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!