*** ryohayakawa has joined #opendev | 00:04 | |
openstackgerrit | Ian Wienand proposed opendev/system-config master: Add host keys to inventory; give host key in launch-node script https://review.opendev.org/739412 | 01:04 |
---|---|---|
openstackgerrit | Ian Wienand proposed opendev/system-config master: Add host keys on bridge https://review.opendev.org/739414 | 01:04 |
openstackgerrit | Ian Wienand proposed opendev/system-config master: Setup gate inventory in /etc/ansible on bridge https://review.opendev.org/740605 | 01:04 |
openstackgerrit | Ian Wienand proposed opendev/system-config master: Setup gate inventory in /etc/ansible on bridge https://review.opendev.org/740605 | 01:30 |
openstackgerrit | Ian Wienand proposed opendev/system-config master: Add host keys to inventory; give host key in launch-node script https://review.opendev.org/739412 | 01:30 |
openstackgerrit | Ian Wienand proposed opendev/system-config master: Add host keys on bridge https://review.opendev.org/739414 | 01:30 |
*** sgw1 has quit IRC | 02:00 | |
openstackgerrit | Merged openstack/diskimage-builder master: Switch from unittest2 compat methods to Python 3.x methods https://review.opendev.org/739645 | 02:21 |
openstackgerrit | Andrii Ostapenko proposed zuul/zuul-jobs master: Add ability to use upload-docker-image in periodic jobs https://review.opendev.org/740560 | 02:25 |
*** sgw1 has joined #opendev | 02:29 | |
*** sgw1 has quit IRC | 02:48 | |
*** weshay_ruck is now known as weshay_pto | 03:04 | |
*** sgw1 has joined #opendev | 03:19 | |
openstackgerrit | wu.chunyang proposed openstack/diskimage-builder master: remove py35 in "V" cycle https://review.opendev.org/740607 | 03:31 |
openstackgerrit | Ian Wienand proposed opendev/system-config master: Copy generated inventory to bridge logs https://review.opendev.org/740605 | 03:41 |
openstackgerrit | Ian Wienand proposed opendev/system-config master: Add host keys to inventory; give host key in launch-node script https://review.opendev.org/739412 | 03:41 |
openstackgerrit | Ian Wienand proposed opendev/system-config master: Add host keys on bridge https://review.opendev.org/739414 | 03:41 |
*** DSpider has joined #opendev | 03:48 | |
*** sgw1 has quit IRC | 03:59 | |
openstackgerrit | Ian Wienand proposed opendev/system-config master: Add host keys on bridge https://review.opendev.org/739414 | 04:07 |
openstackgerrit | Ian Wienand proposed opendev/system-config master: testinfra: silence yaml.load() warnings https://review.opendev.org/740608 | 04:10 |
*** raukadah is now known as chandankumar | 04:20 | |
*** sgw1 has joined #opendev | 04:23 | |
openstackgerrit | Ian Wienand proposed opendev/system-config master: Fix junit error, add HTML report https://review.opendev.org/740609 | 04:29 |
*** sgw1 has quit IRC | 04:32 | |
*** sgw1 has joined #opendev | 04:33 | |
*** bhagyashris|away is now known as bhagyashris | 04:41 | |
openstackgerrit | Ian Wienand proposed opendev/system-config master: Fix junit error, add HTML report https://review.opendev.org/740609 | 05:05 |
*** cloudnull has quit IRC | 05:23 | |
openstackgerrit | Ian Wienand proposed opendev/system-config master: Fix junit error, add HTML report https://review.opendev.org/740609 | 05:25 |
*** marios has joined #opendev | 05:35 | |
*** fressi has joined #opendev | 05:36 | |
*** cloudnull has joined #opendev | 05:48 | |
*** ysandeep|away is now known as ysandeep | 05:50 | |
*** ysandeep is now known as ysandeep|afk | 05:51 | |
ianw | infra-root: https://review.opendev.org/#/q/topic:host-keys+(status:open+OR+status:merged) is a little stack to add host keys to our inventory, and automatically deploy them on bridge, and do a little cleanup | 06:14 |
*** halali_ has quit IRC | 06:15 | |
ianw | fungi: i haven't done a full debug, but 740609 failed in system-config-run-lists which looks unrelated -- https://zuul.opendev.org/t/openstack/build/a80165194c7f4f42a44477642a304c31 | 06:29 |
ianw | fungi: Error: Execution of '/usr/sbin/newlist mailman nobody@openstack.org notarealpassword' returned 1: Create a new, unpopulated mailing list. ... i wonder if the job is not happy? | 06:31 |
*** halali_ has joined #opendev | 06:33 | |
*** tosky has joined #opendev | 06:39 | |
*** ysandeep|afk is now known as ysandeep|rover | 07:53 | |
*** moppy has quit IRC | 08:01 | |
*** moppy has joined #opendev | 08:01 | |
*** ysandeep|rover is now known as ysandeep|lunch | 08:16 | |
*** dtantsur|afk is now known as dtantsur | 08:34 | |
*** ysandeep|lunch is now known as ysandeep|rover | 08:55 | |
*** sshnaidm|afk is now known as sshnaidm | 09:09 | |
openstackgerrit | Iury Gregory Melo Ferreira proposed openstack/diskimage-builder master: Update ipa jobs https://review.opendev.org/740642 | 09:27 |
frickler | infra-root: mirror01.london.linaro-london.openstack.org seems to still be running and trying to send mails, failing for lack of DNS records, any way to get that shut down? that region seems no longer to be in our clouds.yaml either | 09:29 |
*** halali_ has quit IRC | 09:43 | |
*** zbr has joined #opendev | 10:34 | |
*** finucannot is now known as stephenfin | 10:57 | |
*** halali_ has joined #opendev | 11:00 | |
*** bhagyashris is now known as bhagyashris|afk | 11:33 | |
*** tkajinam has quit IRC | 11:37 | |
*** rh-jelabarre has joined #opendev | 12:12 | |
*** rh-jelabarre has quit IRC | 12:12 | |
*** rh-jelabarre has joined #opendev | 12:12 | |
*** ryohayakawa has quit IRC | 12:29 | |
*** osmanlicilegi has quit IRC | 12:33 | |
*** bhagyashris|afk is now known as bhagyashris | 12:35 | |
*** osmanlicilegi has joined #opendev | 12:44 | |
*** zbr|ruck has quit IRC | 13:27 | |
*** zbr|ruck has joined #opendev | 13:28 | |
*** noonedeadpunk has quit IRC | 13:31 | |
*** noonedeadpunk has joined #opendev | 13:33 | |
*** bhagyashris is now known as bhagyashris|afk | 13:56 | |
*** dviroel has joined #opendev | 14:05 | |
*** ysandeep|rover is now known as ysandeep|away | 14:32 | |
fungi | ianw: looks like it complained about "illegal list name: <foo>@ lists" for every <foo> it tried | 14:33 |
fungi | i wonder if this is a behavior change with newer mailman | 14:33 |
fungi | ianw: though it's ubuntu xenial so that seems unlikely | 14:35 |
fungi | maybe we changed something related to name resolution on the nodes? | 14:36 |
fungi | frickler: git history for system-config says that server's last known ip address is 213.146.141.37 | 14:39 |
fungi | and i can still ssh in, i could just locally initiate a poweroff for it | 14:40 |
fungi | but yeah, deleting will most likely require excavating the old api credentials from the private hostvars git history, assuming the api is still reachable | 14:41 |
clarkb | hrw may be able to rm that mirror node too if the api is not accessible | 14:41 |
*** mlavalle has joined #opendev | 14:55 | |
openstackgerrit | Thierry Carrez proposed openstack/project-config master: maintain-github-mirror: add requests dependency https://review.opendev.org/740711 | 15:10 |
fungi | infra-root: rackspace says it had to reboot the hypervsior host for ze05 a few hours ago... i'll check it over | 15:13 |
fungi | oh, nope zm05 | 15:13 |
fungi | #status log zm05 rebooted by provider at 12:02 utc due to hypervisor host problem, provider trouble ticket 200713-ord-0000367 | 15:14 |
openstackstatus | fungi: finished logging | 15:14 |
openstackgerrit | Clark Boylan proposed opendev/system-config master: Update to gitea v1.12.2 https://review.opendev.org/740716 | 15:34 |
clarkb | infra-root catching up on some things that happened last week and I was most curious about the bup git indexes. Did we end up rm'ing them and everything was fine afterwards? if so should we do the same to review.o.o? | 15:40 |
fungi | the zuul-merger process on zm05 seems to be running and getting used, and no obvious errors in its logs, so i'll close out the rax trouble ticket | 15:40 |
fungi | grafana says we're down one merger though, so seeing if i can work out which one is out to lunch | 15:41 |
fungi | aha, nevermind. it's ze01, so known | 15:42 |
clarkb | though we landed the change to vendor geard so we can probably turn it back on again? | 15:42 |
clarkb | it being ze01's executor | 15:42 |
fungi | yeah, i think we just didn't want to do it while nobody was around | 15:42 |
clarkb | I'm around if we want to do that nowish | 15:43 |
clarkb | mostly just digging through scrollback and emails to ensure I'm not missing anything important | 15:43 |
corvus | clarkb: yes i rm'd it; let me see if it looks like everything is fine | 15:43 |
corvus | hrm, last entry on the remote side is wed jul 8, so i think everything is not fine | 15:45 |
clarkb | I wonder if we need to create a new remote backup target if we reset the local indexes | 15:46 |
corvus | there are 2 bup processes currently running on zuul01 | 15:47 |
corvus | jul 9 and 10 | 15:47 |
corvus | i wonder if one of them is stuck due to the disk being full before | 15:47 |
corvus | how about i kill them (in reverse order) and see if the next one runs okay? | 15:48 |
clarkb | sounds good | 15:48 |
fungi | though status log says you removed /root/.bup from zuul01 2020-07-08 16:14:11 utc | 15:50 |
fungi | so if the oldest running backup started on 2020-07-09 that would be after the cleanup | 15:50 |
corvus | well, hrm. | 15:51 |
corvus | i dunno then. | 15:52 |
corvus | the remote side is getting pretty full. | 15:52 |
corvus | maybe we should just go ahead and do a rotation there anyway. | 15:52 |
*** marios is now known as marios|out | 15:54 | |
*** mlavalle has quit IRC | 15:57 | |
*** mlavalle has joined #opendev | 16:05 | |
fungi | yeah, looks like we didn't zero the root reserved allocation for the current volume | 16:06 |
fungi | oh, nevermind i'm looking at the older volume | 16:06 |
fungi | so yes, we're below half a percent free there | 16:07 |
fungi | and we last rotated it a little over a year ago, judging from the volume name | 16:08 |
fungi | would we also keep the volume currently mounted at /opt/backups-201711 or just blow it away and swap them? | 16:09 |
clarkb | rotating like that would simplify things, rather than making a new volume | 16:12 |
*** marios|out has quit IRC | 16:13 | |
corvus | i'd advocate blowing away 2017 and using its pvs to make 2020 | 16:18 |
*** fressi has left #opendev | 16:19 | |
*** diablo_rojo__ has joined #opendev | 16:21 | |
fungi | i'm in favor of that plan. i have to assume nothing writes to /opt/backups-201711 currently anyway | 16:25 |
clarkb | process for that is something like remount current backups to some new path, clear oldest backups, remount oldest backups fs to current backups? | 16:25 |
clarkb | fungi: ya we've basically done a rotation to keep oldest set around | 16:25 |
fungi | i'm happy to work on that unless someone else already is | 16:26 |
clarkb | we may want to double check with ianw as ianw has done some backup stuff in the past but I think in this case repurposing space for oldest backups is safe | 16:26 |
clarkb | and no I'm not working on it | 16:26 |
*** diablo_rojo__ is now known as diablo_rojo | 16:26 | |
fungi | but yeah, the only thing i need to figure out is what's currently telling it to write into /opt/backups-201903 (a symlink?) and whether we need to bup init all the trees and chown stuff on the new fs | 16:27 |
fungi | aha, yep. symlink | 16:28 |
fungi | /opt/backups -> backups-201903 | 16:28 |
corvus | https://docs.opendev.org/opendev/system-config/latest/sysadmin.html#rotating-backup-storage | 16:30 |
fungi | i guess we should make fresh homedirs for each of the users (copying from /etc/skel?), carry over their .ssh/authorized_keys files and then bup init as each of them to create an empty ~/.bup | 16:30 |
fungi | aha, we have docs, right! ;) | 16:30 |
fungi | (why do i always assume stuff like this isn't documented?) | 16:31 |
corvus | that looks fairly complete :) | 16:31 |
fungi | indeed it does, thanks | 16:31 |
corvus | it has a step of run a test backup manuall | 16:31 |
corvus | y | 16:31 |
corvus | we may want to do 2 of those | 16:31 |
fungi | sure | 16:31 |
corvus | a "normal" server and zuul01, in case zuul01 is still somehow broken | 16:31 |
fungi | so the remaining question, i know ianw wanted to build a new bup server, anyone happen to know the state of that? | 16:32 |
corvus | i don't, but i feel like we can/should consider that a second server | 16:32 |
fungi | i recall he was talking about doing the next rotation to a replacement server, but yeah, having redundant backups again would be good | 16:32 |
corvus | it's time to rotate the volumes on the primary server anyway, so i vote we just do that for now | 16:32 |
clarkb | system-config-run-backup exists and you can probably go from there to figure out general state but I agree with corvus that can be a second server rather than a replacement | 16:33 |
fungi | okay, cool, i'll get started on that here momentarily | 16:40 |
fungi | infra-root: any objections to ditching the main/backups volume (currently mounted at /opt/backups-201711) on backup01.ord.rax.ci.openstack.org and repurposing it for future backups so the currently near-full backup volume can be rotated out? | 16:42 |
corvus | fungi: no objections (you knew that already, but ftr) | 16:44 |
fungi | thanks | 16:46 |
clarkb | ya I think that is fine | 16:47 |
clarkb | corvus: can you think of any reason to not start ze01's executor again? the vendored gear code in the ansible role should address the last known problem with it right? | 16:57 |
clarkb | I'll go ahead and do that if that is the undersatnding | 16:57 |
corvus | clarkb: ++; i need to take care of some errands; should be back in a bit. feel free to start if you have time, or i can later | 17:03 |
clarkb | ya I can do it | 17:03 |
clarkb | I've got the errands this afternoon (school district is doing q&a on restarting schools in the fall at 2pm) so trying to be useful now | 17:04 |
clarkb | and done. I'll keep an eye on it | 17:04 |
clarkb | #status log Restarted zuul-executor container on ze01 now that we vendor gear in the logstash job submission role. | 17:05 |
openstackstatus | clarkb: finished logging | 17:05 |
clarkb | infra-root I'll plan to land https://review.opendev.org/737885 once I'm satisfied ze01 can be left alone. And https://review.opendev.org/740716 would be good to land too. Both are gitea improvements/upgrades | 17:06 |
*** dtantsur is now known as dtantsur|afk | 17:15 | |
openstackgerrit | Andrii Ostapenko proposed zuul/zuul-jobs master: Add ability to use upload-docker-image in periodic jobs https://review.opendev.org/740560 | 17:22 |
clarkb | ze01 seems happy. I've now identified a tox job I'm follwogin specifically | 17:34 |
clarkb | will use that a canary | 17:34 |
clarkb | 2020-07-13 17:37:40,225 DEBUG zuul.AnsibleJob.output: [e: abb23df574fd4ababf35797c0dcbcae3] [build: ff644ee4f74b4e8596416af21bd31757] Ansible output: b"An exception occurred during task execution. To see the full traceback, use -vvv. The error was: ModuleNotFoundError: No module named 'gear'" | 17:38 |
clarkb | seems the vendoring hasn't fully worked (and we wouldn't haven oticed until ze01 was turned on?) | 17:38 |
corvus | i just saw that :/ | 17:38 |
corvus | i'm guessing it has to do with the cwd of the python process when it loads the module | 17:39 |
clarkb | do we need a __init__.py in the library/ dir to make it a valid python module? | 17:39 |
clarkb | oh that could be too | 17:39 |
clarkb | corvus: we vendor toml things somewhere or am I imagining that? I wonder if we can replicate how that is done | 17:40 |
clarkb | I seem torecall something related to serialization like that anyway | 17:40 |
corvus | ya, i'll look into it | 17:40 |
fungi | so stopping ze01 now? | 17:40 |
corvus | clarkb: meanwhile, we could graceful ze01, or leave it running if we don't mind missing a few logstashes | 17:41 |
clarkb | I don't mind missing that data personally | 17:41 |
clarkb | e-r says we're way behind right now anyway | 17:41 |
corvus | okay, let's give me a few mins to see if there's a quick fix | 17:41 |
clarkb | and leaving it up will make it easier to test the fix | 17:41 |
fungi | yeah, i expect the worst side effect is users getting confused by the failed tasks in their successful jobs | 17:41 |
corvus | fungi: they'll never see it | 17:41 |
corvus | this is strictly post-log-upload | 17:42 |
fungi | oh, right, it's after log collection | 17:42 |
corvus | (you'd have to watch the streamer in real-time) | 17:42 |
fungi | (NECESSARILY after log collection, since we're processing collected logs) | 17:42 |
corvus | clarkb: the special ansible.module_utils thing is what you're thinking of that we use with toml | 17:42 |
corvus | i'm going to take a few mins to set up a repro env locally so we don't burn all day on this :) | 17:44 |
clarkb | I'm guessing if I look at logstash worker logs we'll find some new giant log files that are causing problems with indexing (and that is why we are behind) | 17:44 |
clarkb | corvus: ++ | 17:44 |
fungi | this is the tool which grew out of pip's vendoring approach: https://pypi.org/project/vendoring/ | 17:44 |
*** qchris has quit IRC | 18:08 | |
fungi | #status log old volume and volume group for main/backups unmounted, deactivated and deleted on backup01.ord.rax.ci.openstack.org | 18:16 |
openstackstatus | fungi: finished logging | 18:16 |
fungi | i suppose we want to continue to keep the pvs tied to separate vgs instead of extending a vg across all of them? | 18:17 |
fungi | we can tie specific pvs to specific lvs either way, but i guess we can revisit how these are organized when we build the new server | 18:18 |
clarkb | separating them seems good if we want to have more than one failure domain? | 18:18 |
fungi | it's irrelevant as far as that's concerned. vgs don't have any notion of consistency anyway unless you use mirroring | 18:19 |
clarkb | but if we mixed vgs across pvs losing a pv would lose both vgs? | 18:20 |
fungi | i was suggesting we think about putting a single volume group across the physical volumes, you can still tell it which physical volumes should contain the blocks for which logical volumes | 18:21 |
fungi | it's really mostly namespacing | 18:21 |
clarkb | gotcha | 18:21 |
fungi | anyway, irrelevant for the moment, i've already created the new vg across the old repurposed pvs | 18:22 |
fungi | main-202007 | 18:22 |
*** qchris has joined #opendev | 18:22 | |
fungi | okay, we've got 3.0T free on /opt/backups-202007 | 18:27 |
fungi | i need to take a break to do some dinner prep, and will then tackle the rest of the cutover | 18:28 |
clarkb | thanks! | 18:28 |
openstackgerrit | James E. Blair proposed opendev/base-jobs master: Really vendor gear for log processing https://review.opendev.org/740744 | 18:40 |
corvus | clarkb: ^ i think that should do it (it at least gets past import errors in my local testing) | 18:41 |
clarkb | gotcha so there is an ansible method for doing that lgtm | 18:42 |
fungi | all, module_utils is a special namespace i guess? | 19:43 |
fungi | ansible magic | 19:43 |
corvus | yep | 19:51 |
fungi | mordred: not sure if you're around today, but i'm catching up on old stuff in my "i'll look at this later" pile, and a few weeks ago rax alerted us that "MySQL Replica on testmt-01-replica-2017-06-19-07-19-15-replica-2020-06-27-15-27-00 is CRITICAL" | 19:51 |
fungi | i assume this is some old trove replication test we no longer care about but have forgotten was set up, and we should delete it? | 19:52 |
clarkb | I've approved https://review.opendev.org/#/c/737885/7 to paginate more gitea requests for project management | 19:59 |
clarkb | it should be pretty well tested at this point but definitely say something if you notice it operating oddly | 19:59 |
fungi | corvus: you killed the bup processes on zuul01, right? just making sure i'm not overlooking them | 20:00 |
openstackgerrit | Merged opendev/base-jobs master: Really vendor gear for log processing https://review.opendev.org/740744 | 20:01 |
clarkb | fungi: corvus we should consider rm'ing review.o.o's bup indexes and restarting them on the new volume if zuul01 shows it is happy that way | 20:02 |
clarkb | it is quite large there as well | 20:02 |
fungi | i expect so, yes | 20:03 |
fungi | when we were looking into it, sounded like it would just maybe reduce performance of the next bup run since there would be no cache | 20:04 |
openstackgerrit | Merged zuul/zuul-jobs master: Strip path from default ensure_pip_from_upstream_interpreters https://review.opendev.org/740505 | 20:04 |
fungi | but if the next run is also a full backup, then it's probably irrelevant anyway | 20:04 |
clarkb | we didn't rename any projects last week right? | 20:10 |
* clarkb is putting together our agenda for tomorrow | 20:11 | |
fungi | we did not, no | 20:16 |
fungi | at least we didn't take any downtime for it | 20:16 |
clarkb | thanks for confirming | 20:16 |
corvus | fungi: yes i killed the bups | 20:21 |
fungi | thanks for confirming | 20:22 |
fungi | i'll proceed with stopping sshd, switching mounts around, setting the old volume read-only and priming the new homedir copies | 20:23 |
*** shtepanie has joined #opendev | 20:43 | |
openstackgerrit | Merged opendev/system-config master: Paginate all the gitea get requests https://review.opendev.org/737885 | 20:53 |
*** DSpider has quit IRC | 21:16 | |
fungi | while prepping homedirs for the new backups volume, i took the opportunity to omit a few which had no content on the prior volume (likely were already not being backed up by the time of the last rotation) as well as a couple where the servers had been replaced since the last rotation and are so no longer getting new data | 21:17 |
corvus | fungi: ++ | 21:17 |
*** boyvinall has joined #opendev | 21:17 | |
*** diablo_rojo has quit IRC | 21:18 | |
fungi | oh, also one service which has been decommissioned since the last rotation (groups.o.o) | 21:18 |
fungi | that leaves us with the following nine bup-* accounts: ask01 ethercalc02 etherpad lists review storyboard translate wiki zuulv3 | 21:19 |
fungi | hopefully there's nothing anyone thinks we're backing up which doesn't appear in that list | 21:20 |
fungi | and all the ~/.bup dirs in them have been initialized | 21:21 |
*** JayF has quit IRC | 21:23 | |
fungi | and the symlink pointed at the new file tree and sshd reenabled | 21:23 |
*** JayF has joined #opendev | 21:24 | |
fungi | based on previous backup sizes, i should probably start by testing ethercalc if i have any desire for it to wrap up in a reasonable amount of time | 21:25 |
fungi | i have a root screen session going on the ethercalc server where i'm testing the bup command from its crontab | 21:26 |
fungi | it spewed a warning about missing indices | 21:27 |
fungi | that's presumably to be expected | 21:27 |
fungi | warning: index pack-ac94d2c7004625e772e9c1cc623163ab30d9b37a.idx missing used by midx-36b8c644cf750bbfe70298e7b8453dd3da9f3b28.midx | 21:28 |
fungi | et cetera | 21:28 |
fungi | the size of ~bup-ethercalc02 (now on the new volume) is growing | 21:30 |
fungi | and it seems to have finished, exited 0 | 21:30 |
*** boyvinall has quit IRC | 21:30 | |
fungi | total size on the backup server 764M | 21:30 |
fungi | now to test zuul | 21:31 |
fungi | i have a root screen session going on zuul01 where i'm testing the bup command from its crontab now | 21:32 |
fungi | same missing index warnings | 21:32 |
fungi | ~/bup-zuulv3 growing on the backup server | 21:33 |
*** boyvinall has joined #opendev | 21:33 | |
corvus | i deem that to be promising :) | 21:40 |
fungi | 7.5gb accumulated for it on the backup server already | 21:40 |
*** boyvinall has quit IRC | 21:42 | |
*** boyvinall has joined #opendev | 21:43 | |
fungi | completed, exited 0, 22gb as stored on the backup server now | 22:06 |
ianw | fungi: system-config-run-lists re-ran and went ok, so i don't know, i guess it was just a transient error | 22:06 |
fungi | ianw: yeah, that was really, really strange. it looked like it could have been a name resolution problem | 22:06 |
fungi | corvus: should i do a second backup on zuul.o.o now to confirm it goes more quickly once primed? | 22:07 |
ianw | fungi: i think that these days we wouldn't need to pre-seed the storage volume on backups; the ansible roles should create things as required | 22:07 |
clarkb | fungi: how big is the bup stuff on zuul01 now? | 22:07 |
fungi | corvus: also the ~root/.bup dir on zuul.o.o is now 3.5gb | 22:07 |
clarkb | is it in the same magnitidue of the previous contents? | 22:07 |
fungi | heh, you read my mind | 22:07 |
clarkb | ah cool so we saved like 20GB or something | 22:07 |
clarkb | in that case I think we should do similar with review | 22:08 |
fungi | yeah, seems safe | 22:08 |
fungi | i'll fire a second backup on zuul now to see how much faster to completes than the initial transfer | 22:09 |
fungi | no missing index warnings this time | 22:09 |
corvus | fungi: huzzah, thanks! | 22:09 |
corvus | so if we wanted to clear out the .bup dir on gerrit, now would probably be the time | 22:10 |
fungi | yes, i think so | 22:10 |
corvus | i'm in favor | 22:10 |
fungi | looks like it's 15gb | 22:10 |
clarkb | I'm still deep into school district q&a so can't help right now but I am also in favor | 22:10 |
fungi | i'll remove it and then start a backup there under screen | 22:10 |
clarkb | and maybe we update docs to say that we can clear that server dir if we rotate the remote backups too | 22:11 |
clarkb | I can write that change since its less time sensitive | 22:11 |
fungi | have at it | 22:11 |
fungi | ianw: i think the benefit of the rsync step is that you can do it while the backup server is offline. disabling sshd means ansible can't prepopulate homedirs, so you risk having a backup attempted when the homedir doesn't exist... though maybe that's fine after all, the end result is probably the same as if a backup is attempted with sshd stopped? | 22:13 |
fungi | oh, also the second zuul backup completed in a few minutes, and both local and remote .bups are still basically the same size | 22:20 |
fungi | as one expects | 22:21 |
fungi | okay, i've removed ~root/.bup on review01 and run `bup init` as root | 22:22 |
fungi | now running the backup command from its crontab in a root screen session there | 22:22 |
fungi | #status log rotated backup volume to main-202007/backups-202007 logical volume on backup01.ord.rax.ci.openstack.org | 22:24 |
openstackstatus | fungi: finished logging | 22:24 |
fungi | before i forget | 22:24 |
ianw | fungi: also, yeah this is the old puppet hosts; the ansible hosts are backing up to vexxhost | 22:24 |
fungi | oh, do we have a second backup server already? | 22:24 |
fungi | indeed, i did not notice that review.o.o is backing up to backup01.ca-ymq-1.vexxhost.opendev.org | 22:25 |
fungi | ianw: zuul01 isn't really an "old puppet host" though | 22:26 |
fungi | is it just awaiting switching to the new server? | 22:26 |
ianw | yeah, i think where things got stalled was converting everything to ansible-based backups, and then starting a new rax backup, and doing dual backups | 22:26 |
ianw | the ansible roles are all written so that we just drop another server in the backup-server group and it should "just work" ... install a separate cron job | 22:27 |
ianw | fungi: zuul may be a hybrid; i don't think we've started completely fresh servers, let me see... | 22:29 |
fungi | also, i guess shortly we'll have the answer to what happens if you blow away the local .bup but not the remote one | 22:30 |
ianw | it's not in the "ansible" backup list ... https://opendev.org/opendev/system-config/src/branch/master/inventory/service/groups.yaml#L23 | 22:31 |
fungi | got it, so its backups are still being configured by puppet, even though everything else on the server is ansible | 22:32 |
ianw | in fact, there's probably a good chance backups are not being configured | 22:32 |
ianw | they're just left over | 22:32 |
ianw | since the switch to containers | 22:32 |
fungi | which i guess is fine if we don't anticipate rebuilding those servers as-is | 22:32 |
clarkb | for zuul01 we want to backup the keys iirc | 22:33 |
clarkb | which should be backed up properly since they are bind mounted | 22:33 |
fungi | so on the new backup server so far the only things being backed up are review, review-dev and etherpad | 22:35 |
openstackgerrit | Ian Wienand proposed opendev/system-config master: Add Zuul to backups group https://review.opendev.org/740824 | 22:38 |
ianw | fungi: ^ so we should probably go with that | 22:38 |
ianw | the idea was that as we dropped bup::site that would replace it | 22:39 |
ianw | until there was no more puppet hosts; then as i say, we drop in another backup server to have dual offsites | 22:39 |
clarkb | I think it would be good to go back to two remotes if possible | 22:39 |
ianw | that's certainly possible; all that needs to happen is to bring up another backup host and put it in the backup-server group | 22:41 |
ianw | s/host/server/ just to keep the terms consistent | 22:41 |
*** tosky has quit IRC | 22:42 | |
*** tkajinam has joined #opendev | 22:54 | |
fungi | okay, so this is strange | 22:57 |
fungi | on review01, trying to perform a backup is (eventually) failing with "IOError: [Errno 28] No space left on device" | 22:59 |
clarkb | is it filling / | 22:59 |
fungi | doesn't seem like it | 22:59 |
clarkb | or maybe /var/backups or similar type of spool? | 22:59 |
fungi | i don't see a full fs on either end | 22:59 |
fungi | rerunning again because i wasn't doing it under screen the first time | 23:00 |
fungi | but i wonder if this has been failing for a while | 23:00 |
clarkb | http://cacti.openstack.org/cacti/graph.php?action=view&local_graph_id=32&rra_id=all shows a spike but not being full (could be it hit the limit then imediately went back under though?) | 23:01 |
fungi | oh, yep, i bet so | 23:02 |
*** boyvinall has quit IRC | 23:02 | |
fungi | the initial drop is from where i cleared .bup | 23:02 |
clarkb | maybe we need to clean up /home/gerrit2 before bup will be happy | 23:03 |
clarkb | I keep avoiding that because escared | 23:03 |
fungi | i guess we don't have enough free space for the spool | 23:03 |
clarkb | we do it as a stream on the command line but bup itself must spool in order to chunk and checksum? | 23:04 |
fungi | yeah, especially if we're not actually successfully backing it up | 23:04 |
fungi | seems that way | 23:07 |
*** mlavalle has quit IRC | 23:08 | |
fungi | we could likely clear out a ton of ~gerrit2/index.backup.* files which may reduce the volume of data we're backing up (won't free up space at rest on the rootfs though as that's on a separate fs) | 23:08 |
clarkb | ya but the spooling is likely related to the input? | 23:08 |
fungi | i doubt those are of much use except to roll back if a reindex fails | 23:08 |
clarkb | fungi: any idea where the growth is? | 23:09 |
fungi | also some bundles like gerrit_backup_2016-04-11_maint.sql.gz and gerrit-to-restore-2017-09-21.sql.gz | 23:10 |
fungi | can you clarify what growth you mean? | 23:10 |
clarkb | "IOError: [Errno 28] No space left on device" <- basically what causes that | 23:10 |
fungi | oh, as in what file is it spooling to on the rootfs. i'll see if i can find out | 23:11 |
fungi | lsof will likely say what's open | 23:11 |
clarkb | cacti seems happy now at least | 23:11 |
clarkb | have you hit the issue more than once? | 23:11 |
fungi | we have a /tmp/repos dir we could clean up to free 1.2gb of the rootfs | 23:12 |
clarkb | I had a set of notes around this | 23:14 |
clarkb | but then every time I sit down to deal with it I get paranoid about deleting things I shouldn't | 23:14 |
* clarkb trying to find it now | 23:14 | |
fungi | and yeah, i'm not seeing any unconstrained growth on the rootfs during this backup attempt | 23:16 |
clarkb | http://paste.openstack.org/show/BoP6WhVAe5XbXtf8gDUC/ | 23:17 |
clarkb | and then I made an etherpad from that | 23:17 |
clarkb | the rest of my day today has been completely shot by school stuff | 23:17 |
clarkb | I'll try to dig up the rest of my notes on ^ tomorrow as we should do that clenaup anyway | 23:17 |
fungi | sounds good, thanks | 23:19 |
openstackgerrit | Ian Wienand proposed opendev/system-config master: Backup all hosts with Ansible https://review.opendev.org/740827 | 23:21 |
ianw | fungi/clarkb: ^ so i think that lays out a plan ... the fatal flaw was probably that the puppet side was supposed to disappear more quickly than it has | 23:22 |
ianw | Data could not be sent to remote host "23.253.56.128". Make sure this host can be reached over ssh: Load key "/root/.ssh/id_rsa": invalid format | 23:27 |
ianw | that's a new one | 23:27 |
ianw | https://zuul.opendev.org/t/openstack/build/c3676edcffdd4c2583aaa823516ce01c | 23:27 |
fungi | new key format with old openssh? | 23:46 |
fungi | also the rootfs disk utilization on review01 is starting to grow again, but not as fast as during the previous backup attempt | 23:48 |
fungi | i can't find where the additional files are. entirely possible they're unlinked but open fds somewhere | 23:56 |
fungi | which would explain how they were immediately cleaned up when bup crashed rather than left behind | 23:57 |
Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!