Monday, 2020-07-13

*** ryohayakawa has joined #opendev		00:04
openstackgerrit	Ian Wienand proposed opendev/system-config master: Add host keys to inventory; give host key in launch-node script https://review.opendev.org/739412	01:04
openstackgerrit	Ian Wienand proposed opendev/system-config master: Add host keys on bridge https://review.opendev.org/739414	01:04
openstackgerrit	Ian Wienand proposed opendev/system-config master: Setup gate inventory in /etc/ansible on bridge https://review.opendev.org/740605	01:04
openstackgerrit	Ian Wienand proposed opendev/system-config master: Setup gate inventory in /etc/ansible on bridge https://review.opendev.org/740605	01:30
openstackgerrit	Ian Wienand proposed opendev/system-config master: Add host keys to inventory; give host key in launch-node script https://review.opendev.org/739412	01:30
openstackgerrit	Ian Wienand proposed opendev/system-config master: Add host keys on bridge https://review.opendev.org/739414	01:30
*** sgw1 has quit IRC		02:00
openstackgerrit	Merged openstack/diskimage-builder master: Switch from unittest2 compat methods to Python 3.x methods https://review.opendev.org/739645	02:21
openstackgerrit	Andrii Ostapenko proposed zuul/zuul-jobs master: Add ability to use upload-docker-image in periodic jobs https://review.opendev.org/740560	02:25
*** sgw1 has joined #opendev		02:29
*** sgw1 has quit IRC		02:48
*** weshay_ruck is now known as weshay_pto		03:04
*** sgw1 has joined #opendev		03:19
openstackgerrit	wu.chunyang proposed openstack/diskimage-builder master: remove py35 in "V" cycle https://review.opendev.org/740607	03:31
openstackgerrit	Ian Wienand proposed opendev/system-config master: Copy generated inventory to bridge logs https://review.opendev.org/740605	03:41
openstackgerrit	Ian Wienand proposed opendev/system-config master: Add host keys to inventory; give host key in launch-node script https://review.opendev.org/739412	03:41
openstackgerrit	Ian Wienand proposed opendev/system-config master: Add host keys on bridge https://review.opendev.org/739414	03:41
*** DSpider has joined #opendev		03:48
*** sgw1 has quit IRC		03:59
openstackgerrit	Ian Wienand proposed opendev/system-config master: Add host keys on bridge https://review.opendev.org/739414	04:07
openstackgerrit	Ian Wienand proposed opendev/system-config master: testinfra: silence yaml.load() warnings https://review.opendev.org/740608	04:10
*** raukadah is now known as chandankumar		04:20
*** sgw1 has joined #opendev		04:23
openstackgerrit	Ian Wienand proposed opendev/system-config master: Fix junit error, add HTML report https://review.opendev.org/740609	04:29
*** sgw1 has quit IRC		04:32
*** sgw1 has joined #opendev		04:33
*** bhagyashris\|away is now known as bhagyashris		04:41
openstackgerrit	Ian Wienand proposed opendev/system-config master: Fix junit error, add HTML report https://review.opendev.org/740609	05:05
*** cloudnull has quit IRC		05:23
openstackgerrit	Ian Wienand proposed opendev/system-config master: Fix junit error, add HTML report https://review.opendev.org/740609	05:25
*** marios has joined #opendev		05:35
*** fressi has joined #opendev		05:36
*** cloudnull has joined #opendev		05:48
*** ysandeep\|away is now known as ysandeep		05:50
*** ysandeep is now known as ysandeep\|afk		05:51
ianw	infra-root: https://review.opendev.org/#/q/topic:host-keys+(status:open+OR+status:merged) is a little stack to add host keys to our inventory, and automatically deploy them on bridge, and do a little cleanup	06:14
*** halali_ has quit IRC		06:15
ianw	fungi: i haven't done a full debug, but 740609 failed in system-config-run-lists which looks unrelated -- https://zuul.opendev.org/t/openstack/build/a80165194c7f4f42a44477642a304c31	06:29
ianw	fungi: Error: Execution of '/usr/sbin/newlist mailman nobody@openstack.org notarealpassword' returned 1: Create a new, unpopulated mailing list. ... i wonder if the job is not happy?	06:31
*** halali_ has joined #opendev		06:33
*** tosky has joined #opendev		06:39
*** ysandeep\|afk is now known as ysandeep\|rover		07:53
*** moppy has quit IRC		08:01
*** moppy has joined #opendev		08:01
*** ysandeep\|rover is now known as ysandeep\|lunch		08:16
*** dtantsur\|afk is now known as dtantsur		08:34
*** ysandeep\|lunch is now known as ysandeep\|rover		08:55
*** sshnaidm\|afk is now known as sshnaidm		09:09
openstackgerrit	Iury Gregory Melo Ferreira proposed openstack/diskimage-builder master: Update ipa jobs https://review.opendev.org/740642	09:27
frickler	infra-root: mirror01.london.linaro-london.openstack.org seems to still be running and trying to send mails, failing for lack of DNS records, any way to get that shut down? that region seems no longer to be in our clouds.yaml either	09:29
*** halali_ has quit IRC		09:43
*** zbr has joined #opendev		10:34
*** finucannot is now known as stephenfin		10:57
*** halali_ has joined #opendev		11:00
*** bhagyashris is now known as bhagyashris\|afk		11:33
*** tkajinam has quit IRC		11:37
*** rh-jelabarre has joined #opendev		12:12
*** rh-jelabarre has quit IRC		12:12
*** rh-jelabarre has joined #opendev		12:12
*** ryohayakawa has quit IRC		12:29
*** osmanlicilegi has quit IRC		12:33
*** bhagyashris\|afk is now known as bhagyashris		12:35
*** osmanlicilegi has joined #opendev		12:44
*** zbr\|ruck has quit IRC		13:27
*** zbr\|ruck has joined #opendev		13:28
*** noonedeadpunk has quit IRC		13:31
*** noonedeadpunk has joined #opendev		13:33
*** bhagyashris is now known as bhagyashris\|afk		13:56
*** dviroel has joined #opendev		14:05
*** ysandeep\|rover is now known as ysandeep\|away		14:32
fungi	ianw: looks like it complained about "illegal list name: <foo>@ lists" for every <foo> it tried	14:33
fungi	i wonder if this is a behavior change with newer mailman	14:33
fungi	ianw: though it's ubuntu xenial so that seems unlikely	14:35
fungi	maybe we changed something related to name resolution on the nodes?	14:36
fungi	frickler: git history for system-config says that server's last known ip address is 213.146.141.37	14:39
fungi	and i can still ssh in, i could just locally initiate a poweroff for it	14:40
fungi	but yeah, deleting will most likely require excavating the old api credentials from the private hostvars git history, assuming the api is still reachable	14:41
clarkb	hrw may be able to rm that mirror node too if the api is not accessible	14:41
*** mlavalle has joined #opendev		14:55
openstackgerrit	Thierry Carrez proposed openstack/project-config master: maintain-github-mirror: add requests dependency https://review.opendev.org/740711	15:10
fungi	infra-root: rackspace says it had to reboot the hypervsior host for ze05 a few hours ago... i'll check it over	15:13
fungi	oh, nope zm05	15:13
fungi	#status log zm05 rebooted by provider at 12:02 utc due to hypervisor host problem, provider trouble ticket 200713-ord-0000367	15:14
openstackstatus	fungi: finished logging	15:14
openstackgerrit	Clark Boylan proposed opendev/system-config master: Update to gitea v1.12.2 https://review.opendev.org/740716	15:34
clarkb	infra-root catching up on some things that happened last week and I was most curious about the bup git indexes. Did we end up rm'ing them and everything was fine afterwards? if so should we do the same to review.o.o?	15:40
fungi	the zuul-merger process on zm05 seems to be running and getting used, and no obvious errors in its logs, so i'll close out the rax trouble ticket	15:40
fungi	grafana says we're down one merger though, so seeing if i can work out which one is out to lunch	15:41
fungi	aha, nevermind. it's ze01, so known	15:42
clarkb	though we landed the change to vendor geard so we can probably turn it back on again?	15:42
clarkb	it being ze01's executor	15:42
fungi	yeah, i think we just didn't want to do it while nobody was around	15:42
clarkb	I'm around if we want to do that nowish	15:43
clarkb	mostly just digging through scrollback and emails to ensure I'm not missing anything important	15:43
corvus	clarkb: yes i rm'd it; let me see if it looks like everything is fine	15:43
corvus	hrm, last entry on the remote side is wed jul 8, so i think everything is not fine	15:45
clarkb	I wonder if we need to create a new remote backup target if we reset the local indexes	15:46
corvus	there are 2 bup processes currently running on zuul01	15:47
corvus	jul 9 and 10	15:47
corvus	i wonder if one of them is stuck due to the disk being full before	15:47
corvus	how about i kill them (in reverse order) and see if the next one runs okay?	15:48
clarkb	sounds good	15:48
fungi	though status log says you removed /root/.bup from zuul01 2020-07-08 16:14:11 utc	15:50
fungi	so if the oldest running backup started on 2020-07-09 that would be after the cleanup	15:50
corvus	well, hrm.	15:51
corvus	i dunno then.	15:52
corvus	the remote side is getting pretty full.	15:52
corvus	maybe we should just go ahead and do a rotation there anyway.	15:52
*** marios is now known as marios\|out		15:54
*** mlavalle has quit IRC		15:57
*** mlavalle has joined #opendev		16:05
fungi	yeah, looks like we didn't zero the root reserved allocation for the current volume	16:06
fungi	oh, nevermind i'm looking at the older volume	16:06
fungi	so yes, we're below half a percent free there	16:07
fungi	and we last rotated it a little over a year ago, judging from the volume name	16:08
fungi	would we also keep the volume currently mounted at /opt/backups-201711 or just blow it away and swap them?	16:09
clarkb	rotating like that would simplify things, rather than making a new volume	16:12
*** marios\|out has quit IRC		16:13
corvus	i'd advocate blowing away 2017 and using its pvs to make 2020	16:18
*** fressi has left #opendev		16:19
*** diablo_rojo__ has joined #opendev		16:21
fungi	i'm in favor of that plan. i have to assume nothing writes to /opt/backups-201711 currently anyway	16:25
clarkb	process for that is something like remount current backups to some new path, clear oldest backups, remount oldest backups fs to current backups?	16:25
clarkb	fungi: ya we've basically done a rotation to keep oldest set around	16:25
fungi	i'm happy to work on that unless someone else already is	16:26
clarkb	we may want to double check with ianw as ianw has done some backup stuff in the past but I think in this case repurposing space for oldest backups is safe	16:26
clarkb	and no I'm not working on it	16:26
*** diablo_rojo__ is now known as diablo_rojo		16:26
fungi	but yeah, the only thing i need to figure out is what's currently telling it to write into /opt/backups-201903 (a symlink?) and whether we need to bup init all the trees and chown stuff on the new fs	16:27
fungi	aha, yep. symlink	16:28
fungi	/opt/backups -> backups-201903	16:28
corvus	https://docs.opendev.org/opendev/system-config/latest/sysadmin.html#rotating-backup-storage	16:30
fungi	i guess we should make fresh homedirs for each of the users (copying from /etc/skel?), carry over their .ssh/authorized_keys files and then bup init as each of them to create an empty ~/.bup	16:30
fungi	aha, we have docs, right! ;)	16:30
fungi	(why do i always assume stuff like this isn't documented?)	16:31
corvus	that looks fairly complete :)	16:31
fungi	indeed it does, thanks	16:31
corvus	it has a step of run a test backup manuall	16:31
corvus	y	16:31
corvus	we may want to do 2 of those	16:31
fungi	sure	16:31
corvus	a "normal" server and zuul01, in case zuul01 is still somehow broken	16:31
fungi	so the remaining question, i know ianw wanted to build a new bup server, anyone happen to know the state of that?	16:32
corvus	i don't, but i feel like we can/should consider that a second server	16:32
fungi	i recall he was talking about doing the next rotation to a replacement server, but yeah, having redundant backups again would be good	16:32
corvus	it's time to rotate the volumes on the primary server anyway, so i vote we just do that for now	16:32
clarkb	system-config-run-backup exists and you can probably go from there to figure out general state but I agree with corvus that can be a second server rather than a replacement	16:33
fungi	okay, cool, i'll get started on that here momentarily	16:40
fungi	infra-root: any objections to ditching the main/backups volume (currently mounted at /opt/backups-201711) on backup01.ord.rax.ci.openstack.org and repurposing it for future backups so the currently near-full backup volume can be rotated out?	16:42
corvus	fungi: no objections (you knew that already, but ftr)	16:44
fungi	thanks	16:46
clarkb	ya I think that is fine	16:47
clarkb	corvus: can you think of any reason to not start ze01's executor again? the vendored gear code in the ansible role should address the last known problem with it right?	16:57
clarkb	I'll go ahead and do that if that is the undersatnding	16:57
corvus	clarkb: ++; i need to take care of some errands; should be back in a bit. feel free to start if you have time, or i can later	17:03
clarkb	ya I can do it	17:03
clarkb	I've got the errands this afternoon (school district is doing q&a on restarting schools in the fall at 2pm) so trying to be useful now	17:04
clarkb	and done. I'll keep an eye on it	17:04
clarkb	#status log Restarted zuul-executor container on ze01 now that we vendor gear in the logstash job submission role.	17:05
openstackstatus	clarkb: finished logging	17:05
clarkb	infra-root I'll plan to land https://review.opendev.org/737885 once I'm satisfied ze01 can be left alone. And https://review.opendev.org/740716 would be good to land too. Both are gitea improvements/upgrades	17:06
*** dtantsur is now known as dtantsur\|afk		17:15
openstackgerrit	Andrii Ostapenko proposed zuul/zuul-jobs master: Add ability to use upload-docker-image in periodic jobs https://review.opendev.org/740560	17:22
clarkb	ze01 seems happy. I've now identified a tox job I'm follwogin specifically	17:34
clarkb	will use that a canary	17:34
clarkb	2020-07-13 17:37:40,225 DEBUG zuul.AnsibleJob.output: [e: abb23df574fd4ababf35797c0dcbcae3] [build: ff644ee4f74b4e8596416af21bd31757] Ansible output: b"An exception occurred during task execution. To see the full traceback, use -vvv. The error was: ModuleNotFoundError: No module named 'gear'"	17:38
clarkb	seems the vendoring hasn't fully worked (and we wouldn't haven oticed until ze01 was turned on?)	17:38
corvus	i just saw that :/	17:38
corvus	i'm guessing it has to do with the cwd of the python process when it loads the module	17:39
clarkb	do we need a __init__.py in the library/ dir to make it a valid python module?	17:39
clarkb	oh that could be too	17:39
clarkb	corvus: we vendor toml things somewhere or am I imagining that? I wonder if we can replicate how that is done	17:40
clarkb	I seem torecall something related to serialization like that anyway	17:40
corvus	ya, i'll look into it	17:40
fungi	so stopping ze01 now?	17:40
corvus	clarkb: meanwhile, we could graceful ze01, or leave it running if we don't mind missing a few logstashes	17:41
clarkb	I don't mind missing that data personally	17:41
clarkb	e-r says we're way behind right now anyway	17:41
corvus	okay, let's give me a few mins to see if there's a quick fix	17:41
clarkb	and leaving it up will make it easier to test the fix	17:41
fungi	yeah, i expect the worst side effect is users getting confused by the failed tasks in their successful jobs	17:41
corvus	fungi: they'll never see it	17:41
corvus	this is strictly post-log-upload	17:42
fungi	oh, right, it's after log collection	17:42
corvus	(you'd have to watch the streamer in real-time)	17:42
fungi	(NECESSARILY after log collection, since we're processing collected logs)	17:42
corvus	clarkb: the special ansible.module_utils thing is what you're thinking of that we use with toml	17:42
corvus	i'm going to take a few mins to set up a repro env locally so we don't burn all day on this :)	17:44
clarkb	I'm guessing if I look at logstash worker logs we'll find some new giant log files that are causing problems with indexing (and that is why we are behind)	17:44
clarkb	corvus: ++	17:44
fungi	this is the tool which grew out of pip's vendoring approach: https://pypi.org/project/vendoring/	17:44
*** qchris has quit IRC		18:08
fungi	#status log old volume and volume group for main/backups unmounted, deactivated and deleted on backup01.ord.rax.ci.openstack.org	18:16
openstackstatus	fungi: finished logging	18:16
fungi	i suppose we want to continue to keep the pvs tied to separate vgs instead of extending a vg across all of them?	18:17
fungi	we can tie specific pvs to specific lvs either way, but i guess we can revisit how these are organized when we build the new server	18:18
clarkb	separating them seems good if we want to have more than one failure domain?	18:18
fungi	it's irrelevant as far as that's concerned. vgs don't have any notion of consistency anyway unless you use mirroring	18:19
clarkb	but if we mixed vgs across pvs losing a pv would lose both vgs?	18:20
fungi	i was suggesting we think about putting a single volume group across the physical volumes, you can still tell it which physical volumes should contain the blocks for which logical volumes	18:21
fungi	it's really mostly namespacing	18:21
clarkb	gotcha	18:21
fungi	anyway, irrelevant for the moment, i've already created the new vg across the old repurposed pvs	18:22
fungi	main-202007	18:22
*** qchris has joined #opendev		18:22
fungi	okay, we've got 3.0T free on /opt/backups-202007	18:27
fungi	i need to take a break to do some dinner prep, and will then tackle the rest of the cutover	18:28
clarkb	thanks!	18:28
openstackgerrit	James E. Blair proposed opendev/base-jobs master: Really vendor gear for log processing https://review.opendev.org/740744	18:40
corvus	clarkb: ^ i think that should do it (it at least gets past import errors in my local testing)	18:41
clarkb	gotcha so there is an ansible method for doing that lgtm	18:42
fungi	all, module_utils is a special namespace i guess?	19:43
fungi	ansible magic	19:43
corvus	yep	19:51
fungi	mordred: not sure if you're around today, but i'm catching up on old stuff in my "i'll look at this later" pile, and a few weeks ago rax alerted us that "MySQL Replica on testmt-01-replica-2017-06-19-07-19-15-replica-2020-06-27-15-27-00 is CRITICAL"	19:51
fungi	i assume this is some old trove replication test we no longer care about but have forgotten was set up, and we should delete it?	19:52
clarkb	I've approved https://review.opendev.org/#/c/737885/7 to paginate more gitea requests for project management	19:59
clarkb	it should be pretty well tested at this point but definitely say something if you notice it operating oddly	19:59
fungi	corvus: you killed the bup processes on zuul01, right? just making sure i'm not overlooking them	20:00
openstackgerrit	Merged opendev/base-jobs master: Really vendor gear for log processing https://review.opendev.org/740744	20:01
clarkb	fungi: corvus we should consider rm'ing review.o.o's bup indexes and restarting them on the new volume if zuul01 shows it is happy that way	20:02
clarkb	it is quite large there as well	20:02
fungi	i expect so, yes	20:03
fungi	when we were looking into it, sounded like it would just maybe reduce performance of the next bup run since there would be no cache	20:04
openstackgerrit	Merged zuul/zuul-jobs master: Strip path from default ensure_pip_from_upstream_interpreters https://review.opendev.org/740505	20:04
fungi	but if the next run is also a full backup, then it's probably irrelevant anyway	20:04
clarkb	we didn't rename any projects last week right?	20:10
* clarkb is putting together our agenda for tomorrow		20:11
fungi	we did not, no	20:16
fungi	at least we didn't take any downtime for it	20:16
clarkb	thanks for confirming	20:16
corvus	fungi: yes i killed the bups	20:21
fungi	thanks for confirming	20:22
fungi	i'll proceed with stopping sshd, switching mounts around, setting the old volume read-only and priming the new homedir copies	20:23
*** shtepanie has joined #opendev		20:43
openstackgerrit	Merged opendev/system-config master: Paginate all the gitea get requests https://review.opendev.org/737885	20:53
*** DSpider has quit IRC		21:16
fungi	while prepping homedirs for the new backups volume, i took the opportunity to omit a few which had no content on the prior volume (likely were already not being backed up by the time of the last rotation) as well as a couple where the servers had been replaced since the last rotation and are so no longer getting new data	21:17
corvus	fungi: ++	21:17
*** boyvinall has joined #opendev		21:17
*** diablo_rojo has quit IRC		21:18
fungi	oh, also one service which has been decommissioned since the last rotation (groups.o.o)	21:18
fungi	that leaves us with the following nine bup-* accounts: ask01 ethercalc02 etherpad lists review storyboard translate wiki zuulv3	21:19
fungi	hopefully there's nothing anyone thinks we're backing up which doesn't appear in that list	21:20
fungi	and all the ~/.bup dirs in them have been initialized	21:21
*** JayF has quit IRC		21:23
fungi	and the symlink pointed at the new file tree and sshd reenabled	21:23
*** JayF has joined #opendev		21:24
fungi	based on previous backup sizes, i should probably start by testing ethercalc if i have any desire for it to wrap up in a reasonable amount of time	21:25
fungi	i have a root screen session going on the ethercalc server where i'm testing the bup command from its crontab	21:26
fungi	it spewed a warning about missing indices	21:27
fungi	that's presumably to be expected	21:27
fungi	warning: index pack-ac94d2c7004625e772e9c1cc623163ab30d9b37a.idx missing used by midx-36b8c644cf750bbfe70298e7b8453dd3da9f3b28.midx	21:28
fungi	et cetera	21:28
fungi	the size of ~bup-ethercalc02 (now on the new volume) is growing	21:30
fungi	and it seems to have finished, exited 0	21:30
*** boyvinall has quit IRC		21:30
fungi	total size on the backup server 764M	21:30
fungi	now to test zuul	21:31
fungi	i have a root screen session going on zuul01 where i'm testing the bup command from its crontab now	21:32
fungi	same missing index warnings	21:32
fungi	~/bup-zuulv3 growing on the backup server	21:33
*** boyvinall has joined #opendev		21:33
corvus	i deem that to be promising :)	21:40
fungi	7.5gb accumulated for it on the backup server already	21:40
*** boyvinall has quit IRC		21:42
*** boyvinall has joined #opendev		21:43
fungi	completed, exited 0, 22gb as stored on the backup server now	22:06
ianw	fungi: system-config-run-lists re-ran and went ok, so i don't know, i guess it was just a transient error	22:06
fungi	ianw: yeah, that was really, really strange. it looked like it could have been a name resolution problem	22:06
fungi	corvus: should i do a second backup on zuul.o.o now to confirm it goes more quickly once primed?	22:07
ianw	fungi: i think that these days we wouldn't need to pre-seed the storage volume on backups; the ansible roles should create things as required	22:07
clarkb	fungi: how big is the bup stuff on zuul01 now?	22:07
fungi	corvus: also the ~root/.bup dir on zuul.o.o is now 3.5gb	22:07
clarkb	is it in the same magnitidue of the previous contents?	22:07
fungi	heh, you read my mind	22:07
clarkb	ah cool so we saved like 20GB or something	22:07
clarkb	in that case I think we should do similar with review	22:08
fungi	yeah, seems safe	22:08
fungi	i'll fire a second backup on zuul now to see how much faster to completes than the initial transfer	22:09
fungi	no missing index warnings this time	22:09
corvus	fungi: huzzah, thanks!	22:09
corvus	so if we wanted to clear out the .bup dir on gerrit, now would probably be the time	22:10
fungi	yes, i think so	22:10
corvus	i'm in favor	22:10
fungi	looks like it's 15gb	22:10
clarkb	I'm still deep into school district q&a so can't help right now but I am also in favor	22:10
fungi	i'll remove it and then start a backup there under screen	22:10
clarkb	and maybe we update docs to say that we can clear that server dir if we rotate the remote backups too	22:11
clarkb	I can write that change since its less time sensitive	22:11
fungi	have at it	22:11
fungi	ianw: i think the benefit of the rsync step is that you can do it while the backup server is offline. disabling sshd means ansible can't prepopulate homedirs, so you risk having a backup attempted when the homedir doesn't exist... though maybe that's fine after all, the end result is probably the same as if a backup is attempted with sshd stopped?	22:13
fungi	oh, also the second zuul backup completed in a few minutes, and both local and remote .bups are still basically the same size	22:20
fungi	as one expects	22:21
fungi	okay, i've removed ~root/.bup on review01 and run `bup init` as root	22:22
fungi	now running the backup command from its crontab in a root screen session there	22:22
fungi	#status log rotated backup volume to main-202007/backups-202007 logical volume on backup01.ord.rax.ci.openstack.org	22:24
openstackstatus	fungi: finished logging	22:24
fungi	before i forget	22:24
ianw	fungi: also, yeah this is the old puppet hosts; the ansible hosts are backing up to vexxhost	22:24
fungi	oh, do we have a second backup server already?	22:24
fungi	indeed, i did not notice that review.o.o is backing up to backup01.ca-ymq-1.vexxhost.opendev.org	22:25
fungi	ianw: zuul01 isn't really an "old puppet host" though	22:26
fungi	is it just awaiting switching to the new server?	22:26
ianw	yeah, i think where things got stalled was converting everything to ansible-based backups, and then starting a new rax backup, and doing dual backups	22:26
ianw	the ansible roles are all written so that we just drop another server in the backup-server group and it should "just work" ... install a separate cron job	22:27
ianw	fungi: zuul may be a hybrid; i don't think we've started completely fresh servers, let me see...	22:29
fungi	also, i guess shortly we'll have the answer to what happens if you blow away the local .bup but not the remote one	22:30
ianw	it's not in the "ansible" backup list ... https://opendev.org/opendev/system-config/src/branch/master/inventory/service/groups.yaml#L23	22:31
fungi	got it, so its backups are still being configured by puppet, even though everything else on the server is ansible	22:32
ianw	in fact, there's probably a good chance backups are not being configured	22:32
ianw	they're just left over	22:32
ianw	since the switch to containers	22:32
fungi	which i guess is fine if we don't anticipate rebuilding those servers as-is	22:32
clarkb	for zuul01 we want to backup the keys iirc	22:33
clarkb	which should be backed up properly since they are bind mounted	22:33
fungi	so on the new backup server so far the only things being backed up are review, review-dev and etherpad	22:35
openstackgerrit	Ian Wienand proposed opendev/system-config master: Add Zuul to backups group https://review.opendev.org/740824	22:38
ianw	fungi: ^ so we should probably go with that	22:38
ianw	the idea was that as we dropped bup::site that would replace it	22:39
ianw	until there was no more puppet hosts; then as i say, we drop in another backup server to have dual offsites	22:39
clarkb	I think it would be good to go back to two remotes if possible	22:39
ianw	that's certainly possible; all that needs to happen is to bring up another backup host and put it in the backup-server group	22:41
ianw	s/host/server/ just to keep the terms consistent	22:41
*** tosky has quit IRC		22:42
*** tkajinam has joined #opendev		22:54
fungi	okay, so this is strange	22:57
fungi	on review01, trying to perform a backup is (eventually) failing with "IOError: [Errno 28] No space left on device"	22:59
clarkb	is it filling /	22:59
fungi	doesn't seem like it	22:59
clarkb	or maybe /var/backups or similar type of spool?	22:59
fungi	i don't see a full fs on either end	22:59
fungi	rerunning again because i wasn't doing it under screen the first time	23:00
fungi	but i wonder if this has been failing for a while	23:00
clarkb	http://cacti.openstack.org/cacti/graph.php?action=view&local_graph_id=32&rra_id=all shows a spike but not being full (could be it hit the limit then imediately went back under though?)	23:01
fungi	oh, yep, i bet so	23:02
*** boyvinall has quit IRC		23:02
fungi	the initial drop is from where i cleared .bup	23:02
clarkb	maybe we need to clean up /home/gerrit2 before bup will be happy	23:03
clarkb	I keep avoiding that because escared	23:03
fungi	i guess we don't have enough free space for the spool	23:03
clarkb	we do it as a stream on the command line but bup itself must spool in order to chunk and checksum?	23:04
fungi	yeah, especially if we're not actually successfully backing it up	23:04
fungi	seems that way	23:07
*** mlavalle has quit IRC		23:08
fungi	we could likely clear out a ton of ~gerrit2/index.backup.* files which may reduce the volume of data we're backing up (won't free up space at rest on the rootfs though as that's on a separate fs)	23:08
clarkb	ya but the spooling is likely related to the input?	23:08
fungi	i doubt those are of much use except to roll back if a reindex fails	23:08
clarkb	fungi: any idea where the growth is?	23:09
fungi	also some bundles like gerrit_backup_2016-04-11_maint.sql.gz and gerrit-to-restore-2017-09-21.sql.gz	23:10
fungi	can you clarify what growth you mean?	23:10
clarkb	"IOError: [Errno 28] No space left on device" <- basically what causes that	23:10
fungi	oh, as in what file is it spooling to on the rootfs. i'll see if i can find out	23:11
fungi	lsof will likely say what's open	23:11
clarkb	cacti seems happy now at least	23:11
clarkb	have you hit the issue more than once?	23:11
fungi	we have a /tmp/repos dir we could clean up to free 1.2gb of the rootfs	23:12
clarkb	I had a set of notes around this	23:14
clarkb	but then every time I sit down to deal with it I get paranoid about deleting things I shouldn't	23:14
* clarkb trying to find it now		23:14
fungi	and yeah, i'm not seeing any unconstrained growth on the rootfs during this backup attempt	23:16
clarkb	http://paste.openstack.org/show/BoP6WhVAe5XbXtf8gDUC/	23:17
clarkb	and then I made an etherpad from that	23:17
clarkb	the rest of my day today has been completely shot by school stuff	23:17
clarkb	I'll try to dig up the rest of my notes on ^ tomorrow as we should do that clenaup anyway	23:17
fungi	sounds good, thanks	23:19
openstackgerrit	Ian Wienand proposed opendev/system-config master: Backup all hosts with Ansible https://review.opendev.org/740827	23:21
ianw	fungi/clarkb: ^ so i think that lays out a plan ... the fatal flaw was probably that the puppet side was supposed to disappear more quickly than it has	23:22
ianw	Data could not be sent to remote host "23.253.56.128". Make sure this host can be reached over ssh: Load key "/root/.ssh/id_rsa": invalid format	23:27
ianw	that's a new one	23:27
ianw	https://zuul.opendev.org/t/openstack/build/c3676edcffdd4c2583aaa823516ce01c	23:27
fungi	new key format with old openssh?	23:46
fungi	also the rootfs disk utilization on review01 is starting to grow again, but not as fast as during the previous backup attempt	23:48
fungi	i can't find where the additional files are. entirely possible they're unlinked but open fds somewhere	23:56
fungi	which would explain how they were immediately cleaned up when bup crashed rather than left behind	23:57

Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!