*** tosky has quit IRC | 00:02 | |
ianw | ok, now we have a problem that the backup volume is full in vexxhost so i can't create the new user/home for wiki backup | 00:02 |
---|---|---|
ianw | i'm going to try that ethercalc prune (noop first) | 00:03 |
*** iurygregory has quit IRC | 00:06 | |
ianw | OSError: [Errno 28] No space left on device | 00:07 |
ianw | hrmmm | 00:07 |
clarkb | ianw: I wonder if we should've been setting the additional_free_space setting | 00:09 |
ianw | i'm moving borg-translate01 (22g) to /opt directly to free up some space temporarily | 00:09 |
clarkb | ok | 00:10 |
ianw | clarkb: i guess http://paste.openstack.org/show/801749/ looks about right? | 00:14 |
ianw | i feel like give it a go and see how much gets freed | 00:14 |
clarkb | ianw: ya that looks about right. my only other thought is --keep-monthly 12 would probably be nice | 00:15 |
clarkb | but unlikely to have much effect here since borg is recent | 00:15 |
clarkb | (its backwards to the output from borg list so took me a second to reverse sort) | 00:15 |
*** artom has quit IRC | 00:16 | |
ianw | 106Gborg-ask01 | 00:16 |
ianw | 1.8Gborg-ethercalc02 | 00:16 |
ianw | 188Gborg-etherpad01 | 00:16 |
ianw | 203Gborg-gitea01 | 00:16 |
ianw | 5.2Gborg-lists | 00:16 |
ianw | 4.9Gborg-review-dev01 | 00:16 |
ianw | 446Gborg-review01 | 00:16 |
ianw | 29Gborg-storyboard01 | 00:16 |
*** artom has joined #opendev | 00:16 | |
ianw | 0borg-translate01 | 00:16 |
ianw | 4.3Gborg-zuul01 | 00:16 |
ianw | for reference | 00:16 |
ianw | etherpad seems too large, we probably shoudl look at exclusions more closely there | 00:17 |
clarkb | ianw: its probably due to the large databse backups there | 00:17 |
clarkb | and ya maybe we can make that better by not keeping as many local db backups | 00:17 |
mordred | is it backing up any old historical backups? | 00:17 |
clarkb | or instruct borg to only backup the most recent db backup | 00:17 |
mordred | yeah | 00:17 |
mordred | I think that | 00:17 |
mordred | backing up the rotated backups is wasteful | 00:18 |
clarkb | ++ I like having the local db backups if we can keep them and telling borg to only look at the most recent one is a good workaround to that I guess | 00:18 |
mordred | ++ yeah - local rotated backups is super helpful for ease of use | 00:18 |
ianw | i'm going to try that prune on ethercalc, even though it's small, now | 00:18 |
mordred | maybe an exclusion with the .gz$ or [0-9].gz or whatever | 00:18 |
clarkb | ianw: ++ | 00:18 |
mordred | ++ | 00:18 |
clarkb | mordred: ianw ya keep in mind though that I think logrotate makes it weird where we end up with a 0 byte file and its the .1.gz that is most recent | 00:19 |
clarkb | but ya I assume we can do it with a matcher of some sort | 00:19 |
mordred | ++ | 00:19 |
ianw | heh, so it pruned to ... 1.8Gborg-ethercalc02 | 00:20 |
ianw | i guess it's deltas are very efficient | 00:20 |
clarkb | that service might be a bad test because ya that | 00:20 |
clarkb | ianw: it sounds like there is alos a --compress option to the backup step | 00:20 |
clarkb | are we using that? If not I bet that would help with space usage too | 00:20 |
clarkb | ianw: ask and gitea01 also do local db backups with rotation so may be good indicators | 00:22 |
clarkb | (review too) | 00:22 |
ianw | 5.6G. | 00:22 |
ianw | /var/backups/etherpad-mariadb# du -h | 00:22 |
ianw | but i wonder if having it a .gzip files destroys more effective delta updates? | 00:23 |
clarkb | ianw: likely yes | 00:23 |
clarkb | I wonder if --compress is smart about that somehow | 00:24 |
ianw | it would probably be better to have the latest uncompressed, then have logrotate compress and rotate that locally | 00:24 |
clarkb | we may not have enough disk space for that though on etherpad | 00:24 |
fungi | okay, doing mirror.epel now | 00:24 |
clarkb | but ya that is a potential thing we could try | 00:24 |
clarkb | also are gitea backups doing another set of git backups I wonder | 00:25 |
clarkb | between review and gitea I mean | 00:25 |
clarkb | not the worst thing but maybe another place we can prune | 00:25 |
ianw | yeah i wasn't sure if we needed gitea at all | 00:25 |
clarkb | ianw: on gitea we want the db backups as that preserves our redirects in the database | 00:26 |
clarkb | ianw: but I don't think we need anything else from it | 00:26 |
ianw | yeah i think we're definitely getting the git trees ... | 00:26 |
*** DSpider has quit IRC | 00:28 | |
ianw | clarkb: so do you think we can exclude /var/gitea? | 00:33 |
clarkb | ianw: I think so, if not the entirety of that dir at least /var/gitea/data/git (I think that is the path but going from memory there) | 00:34 |
ianw | 97M. | 00:34 |
clarkb | since ew're backing those up on the gerrit side | 00:34 |
ianw | /var/backups/gitea-mariadb# du -h | 00:34 |
clarkb | ya the gitea db is very small | 00:34 |
clarkb | its largely just we have these projects and redirects since we don't do issues and wiki and users | 00:34 |
ianw | yeah i can't see anything under /var/gitea that isn't covered by config mgmt | 00:35 |
clarkb | ssl certs may be the only thing? | 00:35 |
ianw | access logs | 00:35 |
clarkb | oh ya ++ to those | 00:35 |
*** iurygregory has joined #opendev | 00:43 | |
*** artom has quit IRC | 00:45 | |
fungi | ssl certs are presumably not valuable because le will just issue more automatically, right? | 01:00 |
*** stevebaker has quit IRC | 01:03 | |
*** mlavalle has quit IRC | 01:05 | |
ianw | it seems like you can't "--exclude /var/lib/gitea" --include "/varlib/gitea/logs" | 01:07 |
ianw | # The file '/home/user/cache/important' is *not* backed up: | 01:07 |
ianw | $ borg create -e /home/user/cache/ backup / /home/user/cache/important | 01:07 |
ianw | the etherpad dump is 15905104958 bytes | 01:20 |
ianw | 16gb | 01:20 |
*** hamalq has quit IRC | 01:26 | |
fungi | full release of mirror.epel finished and we're already well into the catch-up pass across the volumes. once they're done i'll remove my locks and we can look at getting ianw's release serialization change deployed i think | 02:02 |
ianw | fungi: where is the content to backup; in /srv/mediawiki? | 02:07 |
fungi | it's scattered all throughout there. in the puppeted version i've extracted the stateful data away from the deployed software and configuration, but on that production server it's quite comingled | 02:11 |
fungi | and honestly, since the deployment and configuration aren't well understood yet, we probably need to be backing them up there anyway | 02:12 |
fungi | oh, maybe i misunderstood your question, yes we should back up (all of) /srv/mediawiki | 02:13 |
ianw | hrm, srv is 11G, but I guess fairly stable? | 02:13 |
fungi | yeah, images are the main thing which change on it (that's where uploaded files wind up) | 02:13 |
fungi | and the lucene index lives in there so it changes when it's regenerated | 02:13 |
ianw | i have everything deployed but we're going to need to free up some space or get some more | 02:13 |
ianw | (backup space) | 02:14 |
fungi | i was half following, sorry, were you able to work out the pruning? | 02:19 |
ianw | fungi: umm, sort of. i think we've uncovered a number of things | 02:31 |
ianw | pruning down to weekly, monthly we can do on command line | 02:31 |
ianw | the space efficiency gzipping the database removing borg's ability to de-dup is something to think about | 02:32 |
ianw | gzipping the daily database dumps | 02:32 |
ianw | and we can prune a bunch of directories from gitea at least | 02:32 |
*** stevebaker has joined #opendev | 02:37 | |
auristor | ianw: I see that the afs01.dfw volserver is idle. just to note in case it was missed that the "docs" and "mirror.fedora" RO volumes are still new on afs01 and old on afs02. "docs" is also locked which might mean a release was in flight when afs01 died. | 02:44 |
ianw | auristor: thanks for looking in! :) it looks like fungi has dropped locks and the fedora mirror process is running now, so that's expected | 02:46 |
fungi | well, i'm still holding a (non-afs) lock which prevents our normal mirror content updates, and have been steadily going through them in a serialized fashion until we get them caught up to present | 02:46 |
fungi | which i'm hoping will be in the next hour or two | 02:47 |
ianw | fungi: sorry, i just unlocked the docs volume, but somewhat accidentally pasted in the release command too | 02:49 |
ianw | i can kill it or just let it run; i think i'm tending to the latter | 02:51 |
ianw | i don't know why it failed to release but given all the recent commotion nothing would surprise me | 02:51 |
*** openstackgerrit has joined #opendev | 02:58 | |
openstackgerrit | Ian Wienand proposed opendev/system-config master: borg-backup: prune after successful backup https://review.opendev.org/c/opendev/system-config/+/771531 | 02:58 |
*** hemanth_n has joined #opendev | 03:12 | |
fungi | ianw: i wasn't holding any lock for the docs volume, just the mirror volumes | 03:52 |
fungi | and now i've released them all as the updates indicate having all completed | 03:53 |
ianw | fungi: thanks! great to be back. i'm just letting the docs one run now | 03:56 |
openstackgerrit | Ian Wienand proposed opendev/system-config master: gitea backup: prune some large directories https://review.opendev.org/c/opendev/system-config/+/771534 | 05:02 |
*** ykarel|away has joined #opendev | 05:03 | |
*** hemanth_n has quit IRC | 05:07 | |
*** hemanth_n has joined #opendev | 05:07 | |
openstackgerrit | Ian Wienand proposed opendev/system-config master: borg-backup: fix logrotate name https://review.opendev.org/c/opendev/system-config/+/771557 | 05:12 |
*** iurygregory has quit IRC | 05:33 | |
ianw | ok, i've run rdiff on the two mysql zip files and the delta is the file size | 05:40 |
ianw | from etherpad | 05:41 |
ianw | it looks like we can actually create a borg archive from stdin. i.e. dump the db directly into borg as a separate archive. | 05:57 |
ianw | i think that's going to be better; hosts can still dump their db's to disk but we can just ignore that in the backups | 05:58 |
ianw | that'll be tomorrow, if clarkb doesn't beat me to it :) | 05:59 |
*** zbr5 has joined #opendev | 06:04 | |
*** zbr has quit IRC | 06:06 | |
*** zbr5 is now known as zbr | 06:06 | |
*** ykarel_ has joined #opendev | 06:17 | |
*** ykarel|away has quit IRC | 06:19 | |
*** marios has joined #opendev | 06:22 | |
*** ykarel_ is now known as ykarel | 06:26 | |
*** slaweq has joined #opendev | 07:04 | |
*** slaweq has quit IRC | 07:30 | |
openstackgerrit | Rico Lin proposed openstack/project-config master: Add ubuntu-bionic-arm64-xlarge https://review.opendev.org/c/openstack/project-config/+/771565 | 07:30 |
*** eolivare has joined #opendev | 07:31 | |
openstackgerrit | Daniel Blixt proposed zuul/zuul-jobs master: Use urlencoded filenames in test fixtures https://review.opendev.org/c/zuul/zuul-jobs/+/771566 | 07:38 |
*** slaweq has joined #opendev | 08:00 | |
*** hashar has joined #opendev | 08:03 | |
*** fressi has joined #opendev | 08:07 | |
*** andrewbonney has joined #opendev | 08:09 | |
*** sboyron_ has joined #opendev | 08:12 | |
*** rpittau|afk is now known as rpittau | 08:17 | |
*** sboyron__ has joined #opendev | 08:38 | |
*** sboyron_ has quit IRC | 08:41 | |
*** hemanth_n has quit IRC | 08:41 | |
*** stevebaker has quit IRC | 08:41 | |
*** hemanth_n has joined #opendev | 08:41 | |
*** akahat|rover is now known as akahat|lunch | 08:46 | |
*** tosky has joined #opendev | 08:47 | |
*** DSpider has joined #opendev | 08:48 | |
*** jpena|off is now known as jpena | 08:54 | |
*** raukadah has quit IRC | 09:18 | |
*** tristanC has quit IRC | 09:18 | |
*** raukadah has joined #opendev | 09:20 | |
*** tristanC has joined #opendev | 09:20 | |
*** brinzhang_ has quit IRC | 09:34 | |
*** ysandeep is now known as ysandeep|afk | 09:45 | |
*** brinzhang has joined #opendev | 09:52 | |
*** klonn has joined #opendev | 10:07 | |
*** akahat|lunch is now known as akahat|rover | 10:09 | |
*** ysandeep|afk is now known as ysandeep | 10:17 | |
*** rpittau is now known as rpittau|bbl | 10:20 | |
openstackgerrit | Albin Vass proposed zuul/zuul-jobs master: bindep: remove set_fact usage when converting string to list https://review.opendev.org/c/zuul/zuul-jobs/+/771585 | 10:24 |
*** priteau has joined #opendev | 10:24 | |
*** sshnaidm|afk is now known as sshnaidm|ruck | 10:43 | |
*** dtantsur|afk is now known as dtantsur | 10:44 | |
*** hashar has quit IRC | 10:50 | |
*** rpittau|bbl is now known as rpittau | 11:24 | |
*** iurygregory has joined #opendev | 11:27 | |
*** sboyron__ has quit IRC | 11:30 | |
*** klonn has quit IRC | 11:31 | |
openstackgerrit | Guillaume Chauvel proposed opendev/system-config master: Increase comment log text width to avoid line wrap https://review.opendev.org/c/opendev/system-config/+/771445 | 11:47 |
*** jpena is now known as jpena|lunch | 12:29 | |
*** sboyron has joined #opendev | 12:45 | |
*** klonn has joined #opendev | 12:47 | |
openstackgerrit | Radosław Piliszek proposed opendev/irc-meetings master: Move the Masakari meeting to the weekly schedule https://review.opendev.org/c/opendev/irc-meetings/+/771642 | 12:49 |
openstackgerrit | Merged opendev/git-review master: Drop support for py27 https://review.opendev.org/c/opendev/git-review/+/770556 | 13:04 |
openstackgerrit | Merged opendev/git-review master: Assure git-review works with py37 and py38 https://review.opendev.org/c/opendev/git-review/+/770641 | 13:05 |
*** artom has joined #opendev | 13:22 | |
*** ysandeep is now known as ysandeep|afk | 13:24 | |
auristor | ianw fungi: the "docs" volume has still not released properly. Looking more carefully, its second RO site is afs01.ord not afs02.dfw and afs01.ord is not responding. | 13:25 |
*** jpena|lunch is now known as jpena | 13:28 | |
*** whoami-rajat___ has joined #opendev | 13:31 | |
*** brinzhang has quit IRC | 13:37 | |
openstackgerrit | Merged opendev/irc-meetings master: Move the Masakari meeting to the weekly schedule https://review.opendev.org/c/opendev/irc-meetings/+/771642 | 13:38 |
*** michael-mcaleer has joined #opendev | 13:43 | |
*** sboyron has quit IRC | 13:48 | |
*** brinzhang has joined #opendev | 13:49 | |
*** brinzhang has quit IRC | 13:51 | |
*** sboyron has joined #opendev | 13:51 | |
*** brinzhang has joined #opendev | 13:51 | |
openstackgerrit | Guillaume Chauvel proposed opendev/system-config master: Increase comment log text width to avoid line wrap https://review.opendev.org/c/opendev/system-config/+/771445 | 14:13 |
*** hemanth_n has quit IRC | 14:20 | |
*** zoharm has joined #opendev | 14:37 | |
fungi | auristor: interesting, i agree vos status says it's not reachable. when i ssh into it afsd is running and the openafs lkm is loaded, i'll have to dig deeper into it after some morning errands and meetings | 14:38 |
fungi | thanks for the heads up! | 14:38 |
openstackgerrit | Albin Vass proposed zuul/zuul-jobs master: Add policy about overriding role input variables https://review.opendev.org/c/zuul/zuul-jobs/+/771655 | 15:00 |
*** whoami-rajat___ is now known as whoami-rajat__ | 15:02 | |
*** hashar has joined #opendev | 15:04 | |
*** klonn has quit IRC | 15:06 | |
*** d34dh0r53 has quit IRC | 15:12 | |
*** d34dh0r53 has joined #opendev | 15:19 | |
*** slaweq has quit IRC | 15:21 | |
*** slaweq has joined #opendev | 15:23 | |
*** ysandeep|afk is now known as ysandeep | 15:31 | |
clarkb | that is the one we upgraded to 1.8 right? | 15:32 |
clarkb | maybe the key conversion thing didn't go properly? | 15:32 |
*** fressi has quit IRC | 15:39 | |
*** sboyron has quit IRC | 15:45 | |
*** klonn has joined #opendev | 15:50 | |
*** sboyron has joined #opendev | 16:02 | |
*** ykarel has quit IRC | 16:21 | |
*** mlavalle has joined #opendev | 16:26 | |
auristor | fungi: afsd is the client not the servers | 16:37 |
fungi | oh, right | 16:37 |
auristor | the servers are bosserver, dafileserver, davolserver, dasalvageserver | 16:37 |
fungi | clarkb: they're all upgraded to 1.8 | 16:37 |
clarkb | oh thats already done? /me so far behind | 16:37 |
fungi | auristor: yep, i think those are what's not running. maybe they didn't get started automatically at boot, i'll be able to fiddle with it in a couple hours | 16:38 |
fungi | er, nevermind, bad grep. bosserver, dafileserver, davolserver are all in the process table (no dasalvageserver though) | 16:39 |
fungi | in a couple more hours i should be in a position to be able to start digging in logs | 16:40 |
auristor | firewall rules? | 16:40 |
fungi | unlikely any of that has changed, and they should be consistent across afs01.dfw.openstack.org, afs02.dfw.openstack.org and afs01.ord.openstack.org, but i'll compare them all once i have a moment | 16:49 |
*** fbo has quit IRC | 16:51 | |
*** fbo has joined #opendev | 16:52 | |
*** artom has quit IRC | 17:15 | |
*** michael-mcaleer has quit IRC | 17:23 | |
*** dtantsur is now known as dtantsur|afk | 17:23 | |
*** rpittau is now known as rpittau|afk | 17:26 | |
*** ysandeep is now known as ysandeep|away | 17:26 | |
*** marios is now known as marios|out | 17:27 | |
openstackgerrit | Sorin Sbârnea proposed opendev/git-review master: Allow the default of notopic to be configurable https://review.opendev.org/c/opendev/git-review/+/697448 | 17:44 |
openstackgerrit | Sorin Sbârnea proposed opendev/git-review master: Fix bug in git_credentials() https://review.opendev.org/c/opendev/git-review/+/753946 | 17:44 |
openstackgerrit | Sorin Sbârnea proposed opendev/git-review master: Fix "git-review -d" erases work directory if on the same branch as the change downloaded https://review.opendev.org/c/opendev/git-review/+/399779 | 17:44 |
*** artom has joined #opendev | 17:47 | |
*** artom has quit IRC | 17:47 | |
openstackgerrit | Sorin Sbârnea proposed opendev/git-review master: Support spaces and other characters in topic https://review.opendev.org/c/opendev/git-review/+/681906 | 17:47 |
*** artom has joined #opendev | 17:47 | |
*** ralonsoh has quit IRC | 17:58 | |
*** klonn has quit IRC | 18:07 | |
*** cloudnull has quit IRC | 18:19 | |
*** cloudnull has joined #opendev | 18:20 | |
*** eolivare has quit IRC | 18:21 | |
*** cloudnull5 has joined #opendev | 18:26 | |
*** cloudnull has quit IRC | 18:27 | |
*** cloudnull5 is now known as cloudnull | 18:27 | |
*** jpena is now known as jpena|off | 18:32 | |
openstackgerrit | Merged opendev/git-review master: Allow the default of notopic to be configurable https://review.opendev.org/c/opendev/git-review/+/697448 | 18:41 |
openstackgerrit | Merged opendev/git-review master: Fix "git-review -d" erases work directory if on the same branch as the change downloaded https://review.opendev.org/c/opendev/git-review/+/399779 | 18:41 |
*** marios|out has quit IRC | 18:43 | |
*** sboyron has quit IRC | 18:44 | |
*** hashar is now known as hasharAway | 19:00 | |
*** andrewbonney has quit IRC | 19:09 | |
*** akrpan-pure has joined #opendev | 19:18 | |
akrpan-pure | If I'm having an issue with the devstack-gate-wrap (openstack) script in third party CI, is there a good channel to go to? #openstack-third-party-ci is pretty dead it seems like | 19:19 |
clarkb | akrpan-pure: devstack-gate is effectively daed at this point | 19:20 |
clarkb | your best bet is likely to migrate away from it | 19:21 |
fungi | akrpan-pure: devstack-gate is effectively unmaintained these days, upstream jobs parent to a zuul v3 native "devstack" job in the openstack/devstack repository | 19:21 |
akrpan-pure | Urkkkkk, I guess I should've expected that at this point | 19:27 |
akrpan-pure | Alright, I'll continue down the longer path of updating to those jobs too. Thanks! | 19:28 |
*** zoharm has quit IRC | 19:40 | |
ianw | did we get to the bottom of the ORD issue ... looking now | 19:43 |
ianw | Wed Jan 20 08:43:59 2021 fssync: breaking all call backs for volume 536870992 | 19:46 |
ianw | Starting transaction on cloned volume 536870992... done | 19:47 |
ianw | Deleting extant RO_DONTUSE site on afs01.ord.openstack.org... done | 19:47 |
ianw | Creating new volume 536870992 on replication site afs01.ord.openstack.org: done | 19:47 |
ianw | This will be a full dump: previous release failed | 19:47 |
ianw | Starting ForwardMulti from 536870992 to 536870992 on afs01.ord.openstack.org (entire volume). | 19:47 |
ianw | Failed to set correct names and ids: Possible communication failure | 19:47 |
ianw | Could not end transaction on a ro volume: Possible communication failure | 19:47 |
clarkb | ianw: no sorry, gerrit account issue is current distraction | 19:47 |
ianw | fun | 19:49 |
fungi | ianw: no, i haven't looked deeper other than to confirm the server uptime and which services are running | 19:51 |
ianw | there's stuff in here about the volume being salvaged Tue Jan 19 02:45:57 2021 fileserver requested salvage of clone 536870992; scheduling salvage of volume group 536870991... | 19:51 |
auristor | ianw: rxdebug to afs01.ord on ports 7000, 7005, and 7007 all fail to receive a response. | 19:51 |
fungi | i expect you're on the money with it being a firewall issue. looks like we may have reverted iptables to our basic ruleset (ssh and snmp) | 19:52 |
fungi | so the next question is why | 19:52 |
ianw | ohhhhhhh | 19:53 |
auristor | icmp reply destination unreachable - host administratively prohibited. so definitely firewall rules | 19:53 |
ianw | i bet it's ansible | 19:53 |
fungi | looks like /etc/iptables/rules.* were last updated today at 06:23z | 19:53 |
ianw | i'd say we relied on puppet. looking into it. | 19:54 |
fungi | so yes, i think we should focus there first | 19:54 |
ianw | inventory/service/group_vars/afs.yaml:iptables_extra_public_udp_ports: | 19:54 |
fungi | probably just a matter of adding the ports to our group vars for those servers | 19:54 |
ianw | yeah, i changed the group name to afs-1.8 | 19:55 |
fungi | yeah, that | 19:55 |
fungi | mmm | 19:55 |
ianw | ok, that should be changed back, let me see where that got to (the group name) | 19:55 |
clarkb | should be able to copy the group vars for afs to afs-1.8 to address that | 19:55 |
clarkb | and or siwtch everything back to afs if we are ready now | 19:55 |
fungi | the change for that is up, maybe not merged yet | 19:55 |
* fungi checks | 19:55 | |
auristor | not reachable yet | 19:56 |
ianw | https://review.opendev.org/c/opendev/system-config/+/771293 | 19:56 |
ianw | it has a linter error on the group matching bits, let me fix | 19:57 |
fungi | yeah, looks like we can just merge that then | 19:57 |
fungi | (once linters are passing) | 19:57 |
ianw | WARNING Couldn't open /home/iwienand/programs/openstack-infra/system-config/playbooks/roles/letsencrypt-create-certs/roles/letsencrypt-create-certs/handlers/restart_gitea.yaml - No such file or directory [try:2] | 20:07 |
ianw | i'm not sure why ansible-lint looks for stuff there, and now not sure why it tries to open the non-existant file 3 times :/ | 20:07 |
clarkb | ianw: that looks buggy there is extra pathing in the middle there | 20:08 |
clarkb | like maybe its assuming it knows where the location of handlers are and doing so poorly | 20:08 |
clarkb | I wonder if we should just disable it | 20:08 |
openstackgerrit | Ian Wienand proposed opendev/system-config master: Remove afs-1.8 group https://review.opendev.org/c/opendev/system-config/+/771293 | 20:08 |
openstackgerrit | Ian Wienand proposed opendev/system-config master: Manage afsdb servers with Ansible https://review.opendev.org/c/opendev/system-config/+/771340 | 20:08 |
openstackgerrit | Ian Wienand proposed opendev/system-config master: Remove AFS puppet https://review.opendev.org/c/opendev/system-config/+/771342 | 20:08 |
ianw | it's only a warning, but it seems to try to find it and then sleep for (maybe?) a second and try it again x 3, which kind of adds up when it's doing 3 times for about 7 handlers | 20:11 |
openstackgerrit | Kendall Nelson proposed openstack/project-config master: Remove Karbor projects from infra https://review.opendev.org/c/openstack/project-config/+/767057 | 20:12 |
ianw | clarkb: if and when you get this gerrit issue sorted, a few pruning things @ https://review.opendev.org/q/topic:%22backup-prune%22 from yesterday | 20:12 |
ianw | we're still space constrained and need space if we're going to get wiki backed up, so still working on it | 20:13 |
clarkb | ya I just sent email to gerrit upstream about the account thing. I can review those next | 20:15 |
clarkb | but then i need to find lunch | 20:15 |
zbr | ianw: the 3 retries no longer happens on newer versions. | 20:16 |
clarkb | ianw: that topic lgtm | 20:18 |
ianw | zbr: do you know why it's constructing the wrong path for the handler? | 20:19 |
ianw | clarkb: not sure if you saw, but what i'm thinking of doing is piping the output of mysqldump directly into borg as a separate archive, via it's stdin reader | 20:20 |
ianw | in theory, we then only keep incremental db updates that should deduplicate | 20:21 |
clarkb | ianw: ya I saw some thoughts on that but was't sure if I full grocked them. YOu mean do something like tee it into borg and onto disk and then stop borg from looking at the on disk stuff? | 20:21 |
ianw | more like "mysqldump | borg create --stdin-name dump" | 20:22 |
clarkb | and just do the local copies separately? | 20:22 |
ianw | we can keep a local dump too; but not put that in the backups | 20:23 |
clarkb | ya got it | 20:23 |
clarkb | and then beacuse its plain text we'd get better incrementalness | 20:23 |
ianw | yeah, and the local dumps can be compressed for size | 20:23 |
clarkb | ianw: also did you see that bup supports a compressed backups option. Not sure if we are doing that or if it does it by default | 20:23 |
ianw | that's the theory anyway | 20:23 |
clarkb | but that may be another option available to us, I think bup was compressing by default so maybe that explains the difference in growth | 20:23 |
clarkb | or some of it anyway | 20:24 |
clarkb | ok I'm told lunch is waiting for me, back in a bit | 20:24 |
ianw | hrm, i don't think we are; might be an option. i generally worry a little with things like that if it can turn a small corruption into a big corruption | 20:24 |
clarkb | ya, just calling it out as I'm 95% sure bup was doing it due to its git like packfiles (git compresses packfiles) | 20:25 |
ianw | yeah, very true, and that format was very "interlinked" as well (yes you can pull things out of corrupt git trees, sort of, but not somthing anyone wants to do) | 20:27 |
fungi | there os such a thing as "diffable compression" but just not compressing is likely easier | 20:28 |
*** klonn has joined #opendev | 20:29 | |
fungi | also if borg used a copy-on-write scheme it could theoretically have deduplicated differential/incremental backups where the most recent data is de facto complete, but i expect there are reasons it doesn't | 20:32 |
ianw | alright, getting some breakfast, will push that ord fix and monitor as soon as it passes. i'm just leaving it rather than messing up the iptables state by doing something by hand | 20:34 |
fungi | it's not urgent so long as some untoward incident doesn't knock afs01.dfw offline | 20:35 |
* fungi gives rax a long sideways look | 20:35 | |
*** zimmerry has joined #opendev | 20:35 | |
zbr | ianw: likely the unsupported repo layout with nested roles directory may be involved. Afaik, include paths works fine for official layout: only one roles/ folder at root. But I may be wrong. | 20:37 |
fungi | roles directory parallel to the location of the playbook is no longer supported? | 20:40 |
*** tosky has quit IRC | 20:41 | |
*** fbo has quit IRC | 20:42 | |
*** tosky has joined #opendev | 20:42 | |
*** raukadah has quit IRC | 20:42 | |
*** fbo has joined #opendev | 20:42 | |
*** stevebaker has joined #opendev | 20:42 | |
*** raukadah has joined #opendev | 20:43 | |
zbr | i need to check tomorrow, remind me if it forget | 20:56 |
zbr | the guessing inside the linter is a bit of a mess, i wanted to work on it but never got enough time | 20:57 |
clarkb | fungi: ianw what do you think about approving https://review.opendev.org/c/opendev/system-config/+/769226 now? are all the fires sufficiently contained? | 21:03 |
clarkb | that is the gitea 1.13.1 upgrade change. | 21:03 |
fungi | i think it should be safe to move forward there, yeah | 21:06 |
ianw | ++ agree | 21:06 |
clarkb | alright I'm approving it now then | 21:06 |
ianw | i'm going to try the mysql dump to borg archive on etherpad manually, maybe run it again manually tomorrow and see if it gets us the de-duplication we hope for | 21:35 |
ianw | btw, we're using lz4 compression with borg, so it does have higher compression options but we have something | 21:36 |
clarkb | oh cool | 21:37 |
openstackgerrit | Merged opendev/system-config master: Remove afs-1.8 group https://review.opendev.org/c/opendev/system-config/+/771293 | 21:39 |
*** whoami-rajat__ has quit IRC | 21:51 | |
ianw | infra-prod-base is running which should hopefully restore the iptables rules for ord | 21:53 |
clarkb | heh its also gonna do all the things because it affected groups | 21:55 |
clarkb | so will be a little while for the gitea upgrade once it lands (I should still be around for a number of hours today so not a big deal) | 21:56 |
clarkb | hrm I wonder if it is possible that we'll get ordering slightly wrong though | 21:58 |
clarkb | if the gitea image updates when the change lands, then the old system-config version does a pull and compose down then up it will restart on the new version but without the template updates? | 21:58 |
clarkb | oh wait no the template updates area ll in the image | 21:59 |
clarkb | so the only issue would be https://review.opendev.org/c/opendev/system-config/+/769226/3/playbooks/roles/gitea/templates/app.ini.j2 ? | 21:59 |
clarkb | thats probably minor enough that we'll be fine | 21:59 |
clarkb | may just need to roll through and restart things again once app.ini updates | 21:59 |
clarkb | I thought containers were supposed to fix all these problems :P | 22:00 |
fungi | containers == magic pixie dust | 22:01 |
mordred | you're a container | 22:02 |
ianw | ok, afs01.ord is back with the right iptables rules | 22:05 |
ianw | i guess i'll try the docs update again | 22:05 |
ianw | actually the cron job seems to be running it | 22:11 |
*** hasharAway has quit IRC | 22:14 | |
clarkb | I think it runs every 5 minutes or so | 22:14 |
fungi | yup, along with the rest of the static site updates | 22:14 |
clarkb | once that finishes can we switch back to using the RO path for static/ | 22:14 |
fungi | i switched us back to that already over the weekend | 22:15 |
fungi | i think i status logged it | 22:16 |
clarkb | oh cool | 22:16 |
clarkb | it does look like base and le failed so all the things behind them skipped too fwiw | 22:17 |
fungi | ahh, didn't status log, but https://review.opendev.org/770857 deployed 2021-01-16 23:25:14 | 22:17 |
fungi | so saturday | 22:17 |
clarkb | nb03 is unreachable | 22:18 |
clarkb | and nb01 and nb02 both failed in ansible | 22:18 |
* fungi checks if the mirror there is also | 22:18 | |
clarkb | that appears to be why the LE playbook failed | 22:18 |
clarkb | fungi: can you reboot nb03 if necessary? | 22:18 |
fungi | gladly | 22:18 |
clarkb | nb01 and nb02 have full /opts | 22:19 |
fungi | mirror02.regionone.linaro-us.opendev.org is up for 5 days | 22:19 |
fungi | the gentoo images may be filling disk when they fail? | 22:19 |
fungi | i have a change up to pause them again until we can get a new dib release | 22:19 |
clarkb | its possible. I think I'll start by stopping nodepool-builder on both, disabling the service, then rebooting and see what has leaked? | 22:19 |
fungi | or has that already happened? | 22:19 |
clarkb | gentoo pause is false | 22:20 |
fungi | yeah, https://review.opendev.org/771104 if we want them to stop again | 22:20 |
fungi | i proposed that when it was clear they were still broken, but we were hip-deep in other fire | 22:21 |
fungi | i was like "i'll just put this over here with the rest of the fire" | 22:21 |
clarkb | Failed to stop nodepool-builder.service: Unit nodepool-builder.service not loaded | 22:21 |
clarkb | systemctl list-units -a shows it knows nothing about nodepool | 22:22 |
clarkb | oh right I'm a derp | 22:22 |
clarkb | its docker compose now | 22:22 |
fungi | anyway, i think i approved all prometheanfire's gentoo element fixes for dib, but we still need a dib release before we'll use them on the builders | 22:22 |
clarkb | bother are rebooting now, then we can see what leaked in /opt and trim | 22:23 |
clarkb | fungi: if they aren't expected to build then pausing them makes sense t ome | 22:23 |
corvus | ianw, fungi, clarkb, mordred: if ansible-lint is continuing to have more problems with the contents of system-config, maybe we should get more consensus on disabling it for that repo: https://review.opendev.org/733406 | 22:23 |
corvus | 3 people in favor of that, but i'd love for ianw and clarkb to weigh in | 22:24 |
openstackgerrit | Merged opendev/system-config master: Update gitea to 1.13.1 https://review.opendev.org/c/opendev/system-config/+/769226 | 22:24 |
*** hamalq has joined #opendev | 22:25 | |
fungi | console log show nb03.opendev.org says "Guest does not have a console available." and server list shows the instance in SHUTOFF state. booting it now | 22:27 |
ianw | kevinz: ^ i think you made some scheduler changes? | 22:27 |
clarkb | /opt/dib_tmp did leak dib_build* dib_image* and profiledirs on both servers. I'm cleaning those up first to see what that frees up | 22:28 |
clarkb | gitea should be upgrading nowish | 22:28 |
ianw | fungi: oh, i got totally distracted on a dib release. i got into a state, i can do a release now. but still quite a lag as we need to push into nodepool and update images | 22:29 |
fungi | ianw: yeah, we may still want to re-pause the gentoo image builds | 22:30 |
fungi | i was hesitant to tag dib without some more eyeballs on the changes which went in or may be pending | 22:31 |
ianw | yeah, i went through the queue, thanks for looking in on it too :) pushed 3.6.0 | 22:32 |
*** slaweq has quit IRC | 22:32 | |
fungi | thanks! | 22:32 |
clarkb | https://gitea01.opendev.org:3000/ has updated | 22:33 |
fungi | prometheanfire: ^ we still need to get that into nodepool container images and deploy them, but closer at least | 22:33 |
clarkb | looks good to me at first glance. I'll follow it as it goes through the list | 22:33 |
prometheanfire | fungi: do I need to do anything? | 22:34 |
fungi | prometheanfire: i don't think so yet. once we get it deployed you'll want to take another look at gentoo image build logs | 22:34 |
clarkb | I may need to put nb01 and nb02 in the emergency file as their hourly deploy is queued up to happen soon | 22:35 |
clarkb | I'll go ahead and do that now | 22:35 |
clarkb | and done | 22:35 |
prometheanfire | cool | 22:36 |
clarkb | cleaning up /opt/dib_tmp on nb01 freed 67GB which is unlikely to be sufficient for very long | 22:37 |
clarkb | I'll look at any leaked images in /opt/nodepool_dib once nb02's dib_tmp is cleaned up | 22:37 |
clarkb | fungi: I notice we're still building stretch images. Any idea if those are used by anything? | 22:39 |
fungi | not without digging in codesearch, no | 22:39 |
prometheanfire | the gentoo image does try and cache binpkgs, for quicker (re)builds | 22:40 |
fungi | we probably eventually need a better way to answer questions like that | 22:40 |
clarkb | found two leaked focal images on nb01. Will clean those up. Likely need to look through all the other images and see if they have leaked too | 22:42 |
*** cloudnull8 has joined #opendev | 22:44 | |
*** cloudnull has quit IRC | 22:46 | |
*** cloudnull8 is now known as cloudnull | 22:46 | |
ianw | This archive: 15.92 GB 4.17 GB 208.56 MB | 22:49 |
ianw | clarkb: ^ that's a more-or-less back-to-back run of dumping the etherpad db directly, so it looks like an incremental is ~208MB | 22:49 |
clarkb | which seems to support your theory that we'd be better of doing it that way | 22:50 |
clarkb | rather than ~4GB compressed each time or whatever it is (I think it is in that range) | 22:50 |
ianw | yeah, about 5gb | 22:50 |
clarkb | all 8 giteas have upgraded now | 22:50 |
clarkb | the zuul/zuul frontpage loads for me | 22:51 |
clarkb | and things look generally correct | 22:51 |
clarkb | nb01 now has 157GB of disk free after cleaning two leaked nb01 images and two intermediate.bak files from old builds in nodepool_dib | 22:51 |
clarkb | all other images in nodepool_dib look legit | 22:51 |
clarkb | cleaning up the dib_build.* on nb02 freed about 100GB and cleaning dib_image.* freed another 260GB or so | 22:53 |
clarkb | I'm checking nb02 for stale content in nodepool_dib now | 22:53 |
clarkb | hrm I think nb02 hasn't built an image in a long while | 22:54 |
clarkb | dib-image-list | grep nb02 shows that everything has failed there except for gentoo forever ago | 22:54 |
clarkb | I'll clean up the stale images there except for gentoo then maybe we start it alone for a bit and let it take some of the load off of nb01? | 22:54 |
clarkb | also as a side note the gentoo images that we haev attributed to nb02 in zk don't appear to be on disk | 22:57 |
clarkb | ok nb02 is cleaned up. I will start its builder now | 23:01 |
clarkb | I'll remove nb02 from the emergency file but keep nb01 in it so that nb02 can pick up some of the lsack for a bit | 23:02 |
clarkb | #status Log Upgraded gitea to 1.13.1 | 23:03 |
openstackstatus | clarkb: finished logging | 23:03 |
openstackgerrit | Merged opendev/system-config master: borg-backup: prune after successful backup https://review.opendev.org/c/opendev/system-config/+/771531 | 23:03 |
clarkb | #status log Cleaned up /opt on nb01 and nb02 to remove stale image build data from dib_tmp and nodepool_dib. nb02's builder has been started as it has much more free space and we want it to "steal" builds from nb01. | 23:04 |
openstackstatus | clarkb: finished logging | 23:04 |
ianw | hrm i went through all the builders just before christmas | 23:04 |
clarkb | ianw: most of these appaered stale since mid november | 23:04 |
clarkb | but maybe they were active in december and only recently rolled out? | 23:04 |
ianw | i think the failure case, where one fills up and guarantees the other will then fill up is something to think about | 23:04 |
clarkb | agreed. One thing I've thought about is having them weight their job grabs based on how full their disk is | 23:05 |
clarkb | which should trned to sharing the load over time | 23:05 |
ianw | i need to get back to https://review.opendev.org/c/zuul/nodepool/+/764280 | 23:05 |
ianw | that will refuse to start a build if it knows it's going to run out of disk | 23:05 |
clarkb | one simple way to do the weight thing I was thinking of is to do a sleep before grabbing a new build based on how much free space there is | 23:06 |
clarkb | but I think that may fail if the sleep is less than a typical image build runtime | 23:06 |
clarkb | ianw: maybe at the end of your day you can check how many images nb02 has built and if it is in the range of say 4 start up nb01? otherwise I can start nb01 tomorrow morning? | 23:07 |
clarkb | actually nb02 cannot take over more than half of the images since we keep the current and previous image | 23:08 |
clarkb | in that acse it should be safe to let it run for 24 hours before starting nb01 | 23:08 |
clarkb | I'll start nb01 tomorrow given ^ | 23:08 |
*** lbragstad has quit IRC | 23:13 | |
*** lbragstad_ has joined #opendev | 23:13 | |
*** bodgix has joined #opendev | 23:14 | |
*** bodgix_ has quit IRC | 23:14 | |
clarkb | heh i took nb02 out of emergency which caused it to restart a couple of minutes ago when ansible ran against it | 23:16 |
clarkb | took me a minute to figure out why the centos 7 image build it was doing just disappeared | 23:16 |
clarkb | ianw: ^ fwiw that does appear to have leaked a build in dib_tmp | 23:16 |
clarkb | ianw: I think nb02:/opt/dib_tmp/dib_build.dVZ8L3kD dib_image.igFXlwm2 and profiledir.MhYYhz belonged to the build that was aborted due to a restart | 23:17 |
clarkb | for about 5.6GB of disk use | 23:17 |
clarkb | I'm going to watych it closer and see if the current build stuff goes away when that image build finishes and if so I think I can be confident I've found the correct leaked files and manually clean them up | 23:18 |
*** brinzhang has quit IRC | 23:20 | |
*** brinzhang has joined #opendev | 23:29 | |
*** tosky has quit IRC | 23:44 |
Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!