Wednesday, 2021-01-20

*** tosky has quit IRC00:02
ianwok, now we have a problem that the backup volume is full in vexxhost so i can't create the new user/home for wiki backup00:02
ianwi'm going to try that ethercalc prune (noop first)00:03
*** iurygregory has quit IRC00:06
ianwOSError: [Errno 28] No space left on device00:07
clarkbianw: I wonder if we should've been setting the additional_free_space setting00:09
ianwi'm moving borg-translate01 (22g) to /opt directly to free up some space temporarily00:09
ianwclarkb: i guess looks about right?00:14
ianwi feel like give it a go and see how much gets freed00:14
clarkbianw: ya that looks about right. my only other thought is --keep-monthly 12 would probably be nice00:15
clarkbbut unlikely to have much effect here since borg is recent00:15
clarkb(its backwards to the output from borg list so took me a second to reverse sort)00:15
*** artom has quit IRC00:16
*** artom has joined #opendev00:16
ianwfor reference00:16
ianwetherpad seems too large, we probably shoudl look at exclusions more closely there00:17
clarkbianw: its probably due to the large databse backups there00:17
clarkband ya maybe we can make that better by not keeping as many local db backups00:17
mordredis it backing up any old historical backups?00:17
clarkbor instruct borg to only backup the most recent db backup00:17
mordredI think that00:17
mordredbacking up the rotated backups is wasteful00:18
clarkb++ I like having the local db backups if we can keep them and telling borg to only look at the most recent one is a good workaround to that I guess00:18
mordred++ yeah - local rotated backups is super helpful for ease of use00:18
ianwi'm going to try that prune on ethercalc, even though it's small, now00:18
mordredmaybe an exclusion with the .gz$ or [0-9].gz or whatever00:18
clarkbianw: ++00:18
clarkbmordred: ianw ya keep in mind though that I think logrotate makes it weird where we end up with a 0 byte file and its the .1.gz that is most recent00:19
clarkbbut ya I assume we can do it with a matcher of some sort00:19
ianwheh, so it pruned to ... 1.8Gborg-ethercalc0200:20
ianwi guess it's deltas are very efficient00:20
clarkbthat service might be a bad test because ya that00:20
clarkbianw: it sounds like there is alos a --compress option to the backup step00:20
clarkbare we using that? If not I bet that would help with space usage too00:20
clarkbianw: ask and gitea01 also do local db backups with rotation so may be good indicators00:22
clarkb(review too)00:22
ianw /var/backups/etherpad-mariadb# du -h00:22
ianwbut i wonder if having it a .gzip files destroys more effective delta updates?00:23
clarkbianw: likely yes00:23
clarkbI wonder if --compress is smart about that somehow00:24
ianwit would probably be better to have the latest uncompressed, then have logrotate compress and rotate that locally00:24
clarkbwe may not have enough disk space for that though on etherpad00:24
fungiokay, doing mirror.epel now00:24
clarkbbut ya that is a potential thing we could try00:24
clarkbalso are gitea backups doing another set of git backups I wonder00:25
clarkbbetween review and gitea I mean00:25
clarkbnot the worst thing but maybe another place we can prune00:25
ianwyeah i wasn't sure if we needed gitea at all00:25
clarkbianw: on gitea we want the db backups as that preserves our redirects in the database00:26
clarkbianw: but I don't think we need anything else from it00:26
ianwyeah i think we're definitely getting the git trees ...00:26
*** DSpider has quit IRC00:28
ianwclarkb: so do you think we can exclude /var/gitea?00:33
clarkbianw: I think so, if not the entirety of that dir at least /var/gitea/data/git (I think that is the path but going from memory there)00:34
clarkbsince ew're backing those up on the gerrit side00:34
ianw /var/backups/gitea-mariadb# du -h00:34
clarkbya the gitea db is very small00:34
clarkbits largely just we have these projects and redirects since we don't do issues and wiki and users00:34
ianwyeah i can't see anything under /var/gitea that isn't covered by config mgmt00:35
clarkbssl certs may be the only thing?00:35
ianwaccess logs00:35
clarkboh ya ++ to those00:35
*** iurygregory has joined #opendev00:43
*** artom has quit IRC00:45
fungissl certs are presumably not valuable because le will just issue more automatically, right?01:00
*** stevebaker has quit IRC01:03
*** mlavalle has quit IRC01:05
ianwit seems like you can't "--exclude /var/lib/gitea" --include "/varlib/gitea/logs"01:07
ianw# The file '/home/user/cache/important' is *not* backed up:01:07
ianw$ borg create -e /home/user/cache/ backup / /home/user/cache/important01:07
ianwthe etherpad dump is 15905104958 bytes01:20
*** hamalq has quit IRC01:26
fungifull release of mirror.epel finished and we're already well into the catch-up pass across the volumes. once they're done i'll remove my locks and we can look at getting ianw's release serialization change deployed i think02:02
ianwfungi: where is the content to backup; in /srv/mediawiki?02:07
fungiit's scattered all throughout there. in the puppeted version i've extracted the stateful data away from the deployed software and configuration, but on that production server it's quite comingled02:11
fungiand honestly, since the deployment and configuration aren't well understood yet, we probably need to be backing them up there anyway02:12
fungioh, maybe i misunderstood your question, yes we should back up (all of) /srv/mediawiki02:13
ianwhrm, srv is 11G, but I guess fairly stable?02:13
fungiyeah, images are the main thing which change on it (that's where uploaded files wind up)02:13
fungiand the lucene index lives in there so it changes when it's regenerated02:13
ianwi have everything deployed but we're going to need to free up some space or get some more02:13
ianw(backup space)02:14
fungii was half following, sorry, were you able to work out the pruning?02:19
ianwfungi: umm, sort of.  i think we've uncovered a number of things02:31
ianwpruning down to weekly, monthly we can do on command line02:31
ianwthe space efficiency gzipping the database removing borg's ability to de-dup is something to think about02:32
ianwgzipping the daily database dumps02:32
ianwand we can prune a bunch of directories from gitea at least02:32
*** stevebaker has joined #opendev02:37
auristorianw: I see that the afs01.dfw volserver is idle.    just to note in case it was missed that the "docs" and "mirror.fedora" RO volumes are still new on afs01 and old on afs02.   "docs" is also locked which might mean a release was in flight when afs01 died.02:44
ianwauristor: thanks for looking in! :)  it looks like fungi has dropped locks and the fedora mirror process is running now, so that's expected02:46
fungiwell, i'm still holding a (non-afs) lock which prevents our normal mirror content updates, and have been steadily going through them in a serialized fashion until we get them caught up to present02:46
fungiwhich i'm hoping will be in the next hour or two02:47
ianwfungi: sorry, i just unlocked the docs volume, but somewhat accidentally pasted in the release command too02:49
ianwi can kill it or just let it run; i think i'm tending to the latter02:51
ianwi don't know why it failed to release but given all the recent commotion nothing would surprise me02:51
*** openstackgerrit has joined #opendev02:58
openstackgerritIan Wienand proposed opendev/system-config master: borg-backup: prune after successful backup
*** hemanth_n has joined #opendev03:12
fungiianw: i wasn't holding any lock for the docs volume, just the mirror volumes03:52
fungiand now i've released them all as the updates indicate having all completed03:53
ianwfungi: thanks!  great to be back.  i'm just letting the docs one run now03:56
openstackgerritIan Wienand proposed opendev/system-config master: gitea backup: prune some large directories
*** ykarel|away has joined #opendev05:03
*** hemanth_n has quit IRC05:07
*** hemanth_n has joined #opendev05:07
openstackgerritIan Wienand proposed opendev/system-config master: borg-backup: fix logrotate name
*** iurygregory has quit IRC05:33
ianwok, i've run rdiff on the two mysql zip files and the delta is the file size05:40
ianwfrom etherpad05:41
ianwit looks like we can actually create a borg archive from stdin.  i.e. dump the db directly into borg as a separate archive.05:57
ianwi think that's going to be better; hosts can still dump their db's to disk but we can just ignore that in the backups05:58
ianwthat'll be tomorrow, if clarkb doesn't beat me to it :)05:59
*** zbr5 has joined #opendev06:04
*** zbr has quit IRC06:06
*** zbr5 is now known as zbr06:06
*** ykarel_ has joined #opendev06:17
*** ykarel|away has quit IRC06:19
*** marios has joined #opendev06:22
*** ykarel_ is now known as ykarel06:26
*** slaweq has joined #opendev07:04
*** slaweq has quit IRC07:30
openstackgerritRico Lin proposed openstack/project-config master: Add ubuntu-bionic-arm64-xlarge
*** eolivare has joined #opendev07:31
openstackgerritDaniel Blixt proposed zuul/zuul-jobs master: Use urlencoded filenames in test fixtures
*** slaweq has joined #opendev08:00
*** hashar has joined #opendev08:03
*** fressi has joined #opendev08:07
*** andrewbonney has joined #opendev08:09
*** sboyron_ has joined #opendev08:12
*** rpittau|afk is now known as rpittau08:17
*** sboyron__ has joined #opendev08:38
*** sboyron_ has quit IRC08:41
*** hemanth_n has quit IRC08:41
*** stevebaker has quit IRC08:41
*** hemanth_n has joined #opendev08:41
*** akahat|rover is now known as akahat|lunch08:46
*** tosky has joined #opendev08:47
*** DSpider has joined #opendev08:48
*** jpena|off is now known as jpena08:54
*** raukadah has quit IRC09:18
*** tristanC has quit IRC09:18
*** raukadah has joined #opendev09:20
*** tristanC has joined #opendev09:20
*** brinzhang_ has quit IRC09:34
*** ysandeep is now known as ysandeep|afk09:45
*** brinzhang has joined #opendev09:52
*** klonn has joined #opendev10:07
*** akahat|lunch is now known as akahat|rover10:09
*** ysandeep|afk is now known as ysandeep10:17
*** rpittau is now known as rpittau|bbl10:20
openstackgerritAlbin Vass proposed zuul/zuul-jobs master: bindep: remove set_fact usage when converting string to list
*** priteau has joined #opendev10:24
*** sshnaidm|afk is now known as sshnaidm|ruck10:43
*** dtantsur|afk is now known as dtantsur10:44
*** hashar has quit IRC10:50
*** rpittau|bbl is now known as rpittau11:24
*** iurygregory has joined #opendev11:27
*** sboyron__ has quit IRC11:30
*** klonn has quit IRC11:31
openstackgerritGuillaume Chauvel proposed opendev/system-config master: Increase comment log text width to avoid line wrap
*** jpena is now known as jpena|lunch12:29
*** sboyron has joined #opendev12:45
*** klonn has joined #opendev12:47
openstackgerritRadosław Piliszek proposed opendev/irc-meetings master: Move the Masakari meeting to the weekly schedule
openstackgerritMerged opendev/git-review master: Drop support for py27
openstackgerritMerged opendev/git-review master: Assure git-review works with py37 and py38
*** artom has joined #opendev13:22
*** ysandeep is now known as ysandeep|afk13:24
auristorianw fungi: the "docs" volume has still not released properly.   Looking more carefully, its second RO site is afs01.ord not afs02.dfw and afs01.ord is not responding.13:25
*** jpena|lunch is now known as jpena13:28
*** whoami-rajat___ has joined #opendev13:31
*** brinzhang has quit IRC13:37
openstackgerritMerged opendev/irc-meetings master: Move the Masakari meeting to the weekly schedule
*** michael-mcaleer has joined #opendev13:43
*** sboyron has quit IRC13:48
*** brinzhang has joined #opendev13:49
*** brinzhang has quit IRC13:51
*** sboyron has joined #opendev13:51
*** brinzhang has joined #opendev13:51
openstackgerritGuillaume Chauvel proposed opendev/system-config master: Increase comment log text width to avoid line wrap
*** hemanth_n has quit IRC14:20
*** zoharm has joined #opendev14:37
fungiauristor: interesting, i agree vos status says it's not reachable. when i ssh into it afsd is running and the openafs lkm is loaded, i'll have to dig deeper into it after some morning errands and meetings14:38
fungithanks for the heads up!14:38
openstackgerritAlbin Vass proposed zuul/zuul-jobs master: Add policy about overriding role input variables
*** whoami-rajat___ is now known as whoami-rajat__15:02
*** hashar has joined #opendev15:04
*** klonn has quit IRC15:06
*** d34dh0r53 has quit IRC15:12
*** d34dh0r53 has joined #opendev15:19
*** slaweq has quit IRC15:21
*** slaweq has joined #opendev15:23
*** ysandeep|afk is now known as ysandeep15:31
clarkbthat is the one we upgraded to 1.8 right?15:32
clarkbmaybe the key conversion thing didn't go properly?15:32
*** fressi has quit IRC15:39
*** sboyron has quit IRC15:45
*** klonn has joined #opendev15:50
*** sboyron has joined #opendev16:02
*** ykarel has quit IRC16:21
*** mlavalle has joined #opendev16:26
auristorfungi: afsd is the client not the servers16:37
fungioh, right16:37
auristorthe servers are bosserver, dafileserver, davolserver, dasalvageserver16:37
fungiclarkb: they're all upgraded to 1.816:37
clarkboh thats already done? /me so far behind16:37
fungiauristor: yep, i think those are what's not running. maybe they didn't get started automatically at boot, i'll be able to fiddle with it in a couple hours16:38
fungier, nevermind, bad grep. bosserver, dafileserver, davolserver are all in the process table (no dasalvageserver though)16:39
fungiin a couple more hours i should be in a position to be able to start digging in logs16:40
auristorfirewall rules?16:40
fungiunlikely any of that has changed, and they should be consistent across, and, but i'll compare them all once i have a moment16:49
*** fbo has quit IRC16:51
*** fbo has joined #opendev16:52
*** artom has quit IRC17:15
*** michael-mcaleer has quit IRC17:23
*** dtantsur is now known as dtantsur|afk17:23
*** rpittau is now known as rpittau|afk17:26
*** ysandeep is now known as ysandeep|away17:26
*** marios is now known as marios|out17:27
openstackgerritSorin Sbârnea proposed opendev/git-review master: Allow the default of notopic to be configurable
openstackgerritSorin Sbârnea proposed opendev/git-review master: Fix bug in git_credentials()
openstackgerritSorin Sbârnea proposed opendev/git-review master: Fix "git-review -d" erases work directory if on the same branch as the change downloaded
*** artom has joined #opendev17:47
*** artom has quit IRC17:47
openstackgerritSorin Sbârnea proposed opendev/git-review master: Support spaces and other characters in topic
*** artom has joined #opendev17:47
*** ralonsoh has quit IRC17:58
*** klonn has quit IRC18:07
*** cloudnull has quit IRC18:19
*** cloudnull has joined #opendev18:20
*** eolivare has quit IRC18:21
*** cloudnull5 has joined #opendev18:26
*** cloudnull has quit IRC18:27
*** cloudnull5 is now known as cloudnull18:27
*** jpena is now known as jpena|off18:32
openstackgerritMerged opendev/git-review master: Allow the default of notopic to be configurable
openstackgerritMerged opendev/git-review master: Fix "git-review -d" erases work directory if on the same branch as the change downloaded
*** marios|out has quit IRC18:43
*** sboyron has quit IRC18:44
*** hashar is now known as hasharAway19:00
*** andrewbonney has quit IRC19:09
*** akrpan-pure has joined #opendev19:18
akrpan-pureIf I'm having an issue with the devstack-gate-wrap (openstack) script in third party CI, is there a good channel to go to? #openstack-third-party-ci is pretty dead it seems like19:19
clarkbakrpan-pure: devstack-gate is effectively daed at this point19:20
clarkbyour best bet is likely to migrate away from it19:21
fungiakrpan-pure: devstack-gate is effectively unmaintained these days, upstream jobs parent to a zuul v3 native "devstack" job in the openstack/devstack repository19:21
akrpan-pureUrkkkkk, I guess I should've expected that at this point19:27
akrpan-pureAlright, I'll continue down the longer path of updating to those jobs too. Thanks!19:28
*** zoharm has quit IRC19:40
ianwdid we get to the bottom of the ORD issue ... looking now19:43
ianwWed Jan 20 08:43:59 2021 fssync: breaking all call backs for volume 53687099219:46
ianwStarting transaction on cloned volume 536870992... done19:47
ianwDeleting extant RO_DONTUSE site on done19:47
ianwCreating new volume 536870992 on replication site  done19:47
ianwThis will be a full dump: previous release failed19:47
ianwStarting ForwardMulti from 536870992 to 536870992 on (entire volume).19:47
ianwFailed to set correct names and ids: Possible communication failure19:47
ianwCould not end transaction on a ro volume: Possible communication failure19:47
clarkbianw: no sorry, gerrit account issue is current distraction19:47
fungiianw: no, i haven't looked deeper other than to confirm the server uptime and which services are running19:51
ianwthere's stuff in here about the volume being salvaged Tue Jan 19 02:45:57 2021 fileserver requested salvage of clone 536870992; scheduling salvage of volume group 536870991...19:51
auristorianw: rxdebug to afs01.ord on ports 7000, 7005, and 7007 all fail to receive a response.19:51
fungii expect you're on the money with it being a firewall issue. looks like we may have reverted iptables to our basic ruleset (ssh and snmp)19:52
fungiso the next question is why19:52
auristoricmp reply destination unreachable - host administratively prohibited.    so definitely firewall rules19:53
ianwi bet it's ansible19:53
fungilooks like /etc/iptables/rules.* were last updated today at 06:23z19:53
ianwi'd say we relied on puppet. looking into it.19:54
fungiso yes, i think we should focus there first19:54
fungiprobably just a matter of adding the ports to our group vars for those servers19:54
ianwyeah, i changed the group name to afs-1.819:55
fungiyeah, that19:55
ianwok, that should be changed back, let me see where that got to (the group name)19:55
clarkbshould be able to copy the group vars for afs to afs-1.8 to address that19:55
clarkband or siwtch everything back to afs if we are ready now19:55
fungithe change for that is up, maybe not merged yet19:55
* fungi checks19:55
auristornot reachable yet19:56
ianwit has a linter error on the group matching bits, let me fix19:57
fungiyeah, looks like we can just merge that then19:57
fungi(once linters are passing)19:57
ianwWARNING  Couldn't open /home/iwienand/programs/openstack-infra/system-config/playbooks/roles/letsencrypt-create-certs/roles/letsencrypt-create-certs/handlers/restart_gitea.yaml - No such file or directory [try:2]20:07
ianwi'm not sure why ansible-lint looks for stuff there, and now not sure why it tries to open the non-existant file 3 times :/20:07
clarkbianw: that looks buggy there is extra pathing in the middle there20:08
clarkblike maybe its assuming it knows where the location of handlers are and doing so poorly20:08
clarkbI wonder if we should just disable it20:08
openstackgerritIan Wienand proposed opendev/system-config master: Remove afs-1.8 group
openstackgerritIan Wienand proposed opendev/system-config master: Manage afsdb servers with Ansible
openstackgerritIan Wienand proposed opendev/system-config master: Remove AFS puppet
ianwit's only a warning, but it seems to try to find it and then sleep for (maybe?) a second and try it again x 3, which kind of adds up when it's doing 3 times for about 7 handlers20:11
openstackgerritKendall Nelson proposed openstack/project-config master: Remove Karbor projects from infra
ianwclarkb: if and when you get this gerrit issue sorted, a few pruning things @ from yesterday20:12
ianwwe're still space constrained and need space if we're going to get wiki backed up, so still working on it20:13
clarkbya I just sent email to gerrit upstream about the account thing. I can review those next20:15
clarkbbut then i need to find lunch20:15
zbrianw: the 3 retries no longer happens on newer versions.20:16
clarkbianw: that topic lgtm20:18
ianwzbr: do you know why it's constructing the wrong path for the handler?20:19
ianwclarkb: not sure if you saw, but what i'm thinking of doing is piping the output of mysqldump directly into borg as a separate archive, via it's stdin reader20:20
ianwin theory, we then only keep incremental db updates that should deduplicate20:21
clarkbianw: ya I saw some thoughts on that but was't sure if I full grocked them. YOu mean do something like tee it into borg and onto disk and then stop borg from looking at the on disk stuff?20:21
ianwmore like "mysqldump | borg create --stdin-name dump"20:22
clarkband just do the local copies separately?20:22
ianwwe can keep a local dump too; but not put that in the backups20:23
clarkbya got it20:23
clarkband then beacuse its plain text we'd get better incrementalness20:23
ianwyeah, and the local dumps can be compressed for size20:23
clarkbianw: also did you see that bup supports a compressed backups option. Not sure if we are doing that or if it does it by default20:23
ianwthat's the theory anyway20:23
clarkbbut that may be another option available to us, I think bup was compressing by default so maybe that explains the difference in growth20:23
clarkbor some of it anyway20:24
clarkbok I'm told lunch is waiting for me, back in a bit20:24
ianwhrm, i don't think we are; might be an option.  i generally worry a little with things like that if it can turn a small corruption into a big corruption20:24
clarkbya, just calling it out as I'm 95% sure bup was doing it due to its git like packfiles (git compresses packfiles)20:25
ianwyeah, very true, and that format was very "interlinked" as well (yes you can pull things out of corrupt git trees, sort of, but not somthing anyone wants to do)20:27
fungithere os such a thing as "diffable compression" but just not compressing is likely easier20:28
*** klonn has joined #opendev20:29
fungialso if borg used a copy-on-write scheme it could theoretically have deduplicated differential/incremental backups where the most recent data is de facto complete, but i expect there are reasons it doesn't20:32
ianwalright, getting some breakfast, will push that ord fix and monitor as soon as it passes.  i'm just leaving it rather than messing up the iptables state by doing something by hand20:34
fungiit's not urgent so long as some untoward incident doesn't knock afs01.dfw offline20:35
* fungi gives rax a long sideways look20:35
*** zimmerry has joined #opendev20:35
zbrianw: likely the unsupported repo layout with nested roles directory may be involved. Afaik, include paths works fine for official layout: only one roles/ folder at root. But I may be wrong.20:37
fungiroles directory parallel to the location of the playbook is no longer supported?20:40
*** tosky has quit IRC20:41
*** fbo has quit IRC20:42
*** tosky has joined #opendev20:42
*** raukadah has quit IRC20:42
*** fbo has joined #opendev20:42
*** stevebaker has joined #opendev20:42
*** raukadah has joined #opendev20:43
zbri need to check tomorrow, remind me if it forget20:56
zbrthe guessing inside the linter is a bit of a mess, i wanted to work on it but never got enough time20:57
clarkbfungi: ianw what do you think about approving now? are all the fires sufficiently contained?21:03
clarkbthat is the gitea 1.13.1 upgrade change.21:03
fungii think it should be safe to move forward there, yeah21:06
ianw++ agree21:06
clarkbalright I'm approving it now then21:06
ianwi'm going to try the mysql dump to borg archive on etherpad manually, maybe run it again manually tomorrow and see if it gets us the de-duplication we hope for21:35
ianwbtw, we're using lz4 compression with borg, so it does have higher compression options but we have something21:36
clarkboh cool21:37
openstackgerritMerged opendev/system-config master: Remove afs-1.8 group
*** whoami-rajat__ has quit IRC21:51
ianwinfra-prod-base is running which should hopefully restore the iptables rules for ord21:53
clarkbheh its also gonna do all the things because it affected groups21:55
clarkbso will be a little while for the gitea upgrade once it lands (I should still be around for a number of hours today so not a big deal)21:56
clarkbhrm I wonder if it is possible that we'll get ordering slightly wrong though21:58
clarkbif the gitea image updates when the change lands, then the old system-config version does a pull and compose down then up it will restart on the new version but without the template updates?21:58
clarkboh wait no the template updates area ll in the image21:59
clarkbso the only issue would be ?21:59
clarkbthats probably minor enough that we'll be fine21:59
clarkbmay just need to roll through and restart things again once app.ini updates21:59
clarkbI thought containers were supposed to fix all these problems :P22:00
fungicontainers == magic pixie dust22:01
mordredyou're a container22:02
ianwok, afs01.ord is back with the right iptables rules22:05
ianwi guess i'll try the docs update again22:05
ianwactually the cron job seems to be running it22:11
*** hasharAway has quit IRC22:14
clarkbI think it runs every 5 minutes or so22:14
fungiyup, along with the rest of the static site updates22:14
clarkbonce that finishes can we switch back to using the RO path for static/22:14
fungii switched us back to that already over the weekend22:15
fungii think i status logged it22:16
clarkboh cool22:16
clarkbit does look like base and le failed so all the things behind them skipped too fwiw22:17
fungiahh, didn't status log, but deployed 2021-01-16 23:25:1422:17
fungiso saturday22:17
clarkbnb03 is unreachable22:18
clarkband nb01 and nb02 both failed in ansible22:18
* fungi checks if the mirror there is also22:18
clarkbthat appears to be why the LE playbook failed22:18
clarkbfungi: can you reboot nb03 if necessary?22:18
clarkbnb01 and nb02 have full /opts22:19 is up for 5 days22:19
fungithe gentoo images may be filling disk when they fail?22:19
fungii have a change up to pause them again until we can get a new dib release22:19
clarkbits possible. I think I'll start by stopping nodepool-builder on both, disabling the service, then rebooting and see what has leaked?22:19
fungior has that already happened?22:19
clarkbgentoo pause is false22:20
fungiyeah, if we want them to stop again22:20
fungii proposed that when it was clear they were still broken, but we were hip-deep in other fire22:21
fungii was like "i'll just put this over here with the rest of the fire"22:21
clarkbFailed to stop nodepool-builder.service: Unit nodepool-builder.service not loaded22:21
clarkbsystemctl list-units -a shows it knows nothing about nodepool22:22
clarkboh right I'm a derp22:22
clarkbits docker compose now22:22
fungianyway, i think i approved all prometheanfire's gentoo element fixes for dib, but we still need a dib release before we'll use them on the builders22:22
clarkbbother are rebooting now, then we can see what leaked in /opt and trim22:23
clarkbfungi: if they aren't expected to build then pausing them makes sense t ome22:23
corvusianw, fungi, clarkb, mordred: if ansible-lint is continuing to have more problems with the contents of system-config, maybe we should get more consensus on disabling it for that repo:
corvus3 people in favor of that, but i'd love for ianw and clarkb to weigh in22:24
openstackgerritMerged opendev/system-config master: Update gitea to 1.13.1
*** hamalq has joined #opendev22:25
fungiconsole log show says "Guest does not have a console available." and server list shows the instance in SHUTOFF state. booting it now22:27
ianwkevinz: ^ i think you made some scheduler changes?22:27
clarkb/opt/dib_tmp did leak dib_build* dib_image* and profiledirs on both servers. I'm cleaning those up first to see what that frees up22:28
clarkbgitea should be upgrading nowish22:28
ianwfungi: oh, i got totally distracted on a dib release.  i got into a state, i can do a release now.  but still quite a lag as we need to push into nodepool and update images22:29
fungiianw: yeah, we may still want to re-pause the gentoo image builds22:30
fungii was hesitant to tag dib without some more eyeballs on the changes which went in or may be pending22:31
ianwyeah, i went through the queue, thanks for looking in on it too :)  pushed 3.6.022:32
*** slaweq has quit IRC22:32
clarkb has updated22:33
fungiprometheanfire: ^ we still need to get that into nodepool container images and deploy them, but closer at least22:33
clarkblooks good to me at first glance. I'll follow it as it goes through the list22:33
prometheanfirefungi: do I need to do anything?22:34
fungiprometheanfire: i don't think so yet. once we get it deployed you'll want to take another look at gentoo image build logs22:34
clarkbI may need to put nb01 and nb02 in the emergency file as their hourly deploy is queued up to happen soon22:35
clarkbI'll go ahead and do that now22:35
clarkband done22:35
clarkbcleaning up /opt/dib_tmp on nb01 freed 67GB which is unlikely to be sufficient for very long22:37
clarkbI'll look at any leaked images in /opt/nodepool_dib once nb02's dib_tmp is cleaned up22:37
clarkbfungi: I notice we're still building stretch images. Any idea if those are used by anything?22:39
funginot without digging in codesearch, no22:39
prometheanfirethe gentoo image does try and cache binpkgs, for quicker (re)builds22:40
fungiwe probably eventually need a better way to answer questions like that22:40
clarkbfound two leaked focal images on nb01. Will clean those up. Likely need to look through all the other images and see if they have leaked too22:42
*** cloudnull8 has joined #opendev22:44
*** cloudnull has quit IRC22:46
*** cloudnull8 is now known as cloudnull22:46
ianwThis archive:               15.92 GB              4.17 GB            208.56 MB22:49
ianwclarkb: ^ that's a more-or-less back-to-back run of dumping the etherpad db directly, so it looks like an incremental is ~208MB22:49
clarkbwhich seems to support your theory that we'd be better of doing it that way22:50
clarkbrather than ~4GB compressed each time or whatever it is (I think it is in that range)22:50
ianwyeah, about 5gb22:50
clarkball 8 giteas have upgraded now22:50
clarkbthe zuul/zuul frontpage loads for me22:51
clarkband things look generally correct22:51
clarkbnb01 now has 157GB of disk free after cleaning two leaked nb01 images and two intermediate.bak files from old builds in nodepool_dib22:51
clarkball other images in nodepool_dib look legit22:51
clarkbcleaning up the dib_build.* on nb02 freed about 100GB and cleaning dib_image.* freed another 260GB or so22:53
clarkbI'm checking nb02 for stale content in nodepool_dib now22:53
clarkbhrm I think nb02 hasn't built an image in a long while22:54
clarkbdib-image-list | grep nb02 shows that everything has failed there except for gentoo forever ago22:54
clarkbI'll clean up the stale images there except for gentoo then maybe we start it alone for a bit and let it take some of the load off of nb01?22:54
clarkbalso as a side note the gentoo images that we haev attributed to nb02 in zk don't appear to be on disk22:57
clarkbok nb02 is cleaned up. I will start its builder now23:01
clarkbI'll remove nb02 from the emergency file but keep nb01 in it so that nb02 can pick up some of the lsack for a bit23:02
clarkb#status Log Upgraded gitea to 1.13.123:03
openstackstatusclarkb: finished logging23:03
openstackgerritMerged opendev/system-config master: borg-backup: prune after successful backup
clarkb#status log Cleaned up /opt on nb01 and nb02 to remove stale image build data from dib_tmp and nodepool_dib. nb02's builder has been started as it has much more free space and we want it to "steal" builds from nb01.23:04
openstackstatusclarkb: finished logging23:04
ianwhrm i went through all the builders just before christmas23:04
clarkbianw: most of these appaered stale since mid november23:04
clarkbbut maybe they were active in december and only recently rolled out?23:04
ianwi think the failure case, where one fills up and guarantees the other will then fill up is something to think about23:04
clarkbagreed. One thing I've thought about is having them weight their job grabs based on how full their disk is23:05
clarkbwhich should trned to sharing the load over time23:05
ianwi need to get back to
ianwthat will refuse to start a build if it knows it's going to run out of disk23:05
clarkbone simple way to do the weight thing I was thinking of is to do a sleep before grabbing a new build based on how much free space there is23:06
clarkbbut I think that may fail if the sleep is less than a typical image build runtime23:06
clarkbianw: maybe at the end of your day you can check how many images nb02 has built and if it is in the range of say 4 start up nb01? otherwise I can start nb01 tomorrow morning?23:07
clarkbactually nb02 cannot take over more than half of the images since we keep the current and previous image23:08
clarkbin that acse it should be safe to let it run for 24 hours before starting nb0123:08
clarkbI'll start nb01 tomorrow given ^23:08
*** lbragstad has quit IRC23:13
*** lbragstad_ has joined #opendev23:13
*** bodgix has joined #opendev23:14
*** bodgix_ has quit IRC23:14
clarkbheh i took nb02 out of emergency which caused it to restart a couple of minutes ago when ansible ran against it23:16
clarkbtook me a minute to figure out why the centos 7 image build it was doing just disappeared23:16
clarkbianw: ^ fwiw that does appear to have leaked a build in dib_tmp23:16
clarkbianw: I think nb02:/opt/dib_tmp/dib_build.dVZ8L3kD dib_image.igFXlwm2 and profiledir.MhYYhz belonged to the build that was aborted due to a restart23:17
clarkbfor about 5.6GB of disk use23:17
clarkbI'm going to watych it closer and see if the current build stuff goes away when that image build finishes and if so I think I can be confident I've found the correct leaked files and manually clean them up23:18
*** brinzhang has quit IRC23:20
*** brinzhang has joined #opendev23:29
*** tosky has quit IRC23:44

Generated by 2.17.2 by Marius Gedminas - find it at!