Thursday, 2022-02-24

Dereckson	Hello. Do we hav a specific channel for git-review?	00:18
fungi	Dereckson: yep, this one ;)	00:19
Dereckson	ah thanks :)	00:19
Dereckson	I maintain the FreeBSD port for git-review, and I've a request on our bug tracker to include the commit fixing the rebase for Git 2.34	00:19
Dereckson	I've also noticed Fedora released a "2.2.0" rpm in advance of an actual release.	00:20
fungi	Dereckson: this one? https://pypi.org/project/git-review/2.2.0/	00:21
fungi	https://opendev.org/opendev/git-review/src/tag/2.2.0	00:22
fungi	Dereckson: https://lists.opendev.org/pipermail/service-announce/2021-November/000028.html	00:23
fungi	that was the release announcement last november	00:23
fungi	but if you specifically want to backport the fix you mentioned, it's commit 7182166ec00ad3645821435d72c5424b4629165f	00:24
fungi	reviewed in https://review.opendev.org/c/opendev/git-review/+/818219	00:24
Dereckson	I'd like to avoid the backport and release a new version, but I was still tracking https://docs.opendev.org/opendev/git-review/latest/	00:24
Dereckson	as we've already 2.2.0, problem solved, thanks	00:25
fungi	ahh, yeah, we need to figure out a way to force a docs refresh on tag push, it would say 2.2.0 there the next time we merge a commit, but nothing's merged since i pushed that tag	00:25
fungi	sorry for the confusion, i'll spend some more time thinking about how we can make that smoother	00:27
fungi	in a high-activity project it tends to go unnoticed, but git-review doesn't update all that often	00:28
fungi	i can't remember why we had decided previously that re-running our docs jobs on tag events was problematic	00:29
fungi	clarkb: ianw: do either of you recall?	00:29
opendevreview	Merged opendev/glean master: distro: sync to 3.6 https://review.opendev.org/c/opendev/glean/+/830536	00:30
ianw	umm, i do not. i feel like it would be ok, as it would just push new contents to AFS?	00:30
fungi	right, like simply adding the docs job we already run in post to the tag or release pipelines	00:31
Dereckson	Still supported under Python 3.6 by the way?	00:31
Dereckson	(I only noted 2.0 is Python 3+)	00:32
fungi	Dereckson: python-requires = >= 3.5	00:32
fungi	so yep	00:32
Dereckson	ok	00:32
ianw	fungi: yeah, perhaps we were worried about overwriting or something? if you pushed a tag not at HEAD?	00:33
fungi	Dereckson: the patch mentioned above was automatically tested with 3.5, 3.6, 3.7, 3.8 and 3.9	00:33
fungi	(or some point releases thereof for each of those, anyway)	00:34
fungi	ianw: i suppose if you tagged an earlier commit than the branch tip that could roll back the content, right	00:34
fungi	now that zuul has the ability to guess a most likely branch for any tag, we could maybe reset to the branch tip if the ref is a tag, but that will require some thought	00:35
fungi	in a single-branch project like git-review that's obviously safe, but for projects maintaining multiple branches it may need some additional care to make sure the correct branch is chosen	00:36
clarkb	I don't recall why that changed. It used to be we published the main docs and the version specific docs when we pushed tags. But now our tag documentation jobs generaly only push the tag specific version iirc	00:45
clarkb	zuul suffers from this too	00:45
clarkb	oh I think I remember. It is because the tag version docs can overwrite the latest master builds as they can race each other	00:46
clarkb	I agree thati s something git-review doesn't really need to worry about	00:46
fungi	oh, right, i forgot about the version-specific doc builds	00:47
fungi	so, yes, we'd need to run two docs jobs on any tag event: one to publish the version-specific docs, and one which resets to the guessed branch for that tag and refreshes the branch-specific docs	00:47
Dereckson	Thanks for the support fungi, patch submitted to the FreeBSD ports systems so we can have 2.2.0 there. Tested locally, works like charm.	01:08
fungi	Dereckson: great! thanks for confirming it's working there	01:57
*** rlandy\|ruck\|bbl is now known as rlandy\|out		02:06
*** lajoskatona_ is now known as lajoskatona		02:25
opendevreview	wangxiyuan proposed zuul/zuul-jobs master: Add openEuler to iptalbe firewall persist https://review.opendev.org/c/zuul/zuul-jobs/+/830706	02:32
*** pojadhav\|out is now known as pojadhav\|ruck		02:52
*** frenzy_friday is now known as frenzyfriday\|rover		04:26
*** ysandeep\|out is now known as ysandeep		04:53
*** ysandeep is now known as ysandeep\|away		05:55
*** ysandeep\|away is now known as ysandeep		07:04
*** amoralej\|off is now known as amoralej		07:09
*** jpena\|off is now known as jpena		08:34
opendevreview	Ian Wienand proposed opendev/system-config master: encrypt-logs: turn on for all prod playbooks https://review.opendev.org/c/opendev/system-config/+/830784	08:40
opendevreview	Ian Wienand proposed opendev/system-config master: docs: reorganise around a open infrastructure overview https://review.opendev.org/c/opendev/system-config/+/830785	08:40
frickler	this may affect our mailman setup going forward https://bugs.launchpad.net/ubuntu/+source/mailman3/+bug/1960547	09:06
*** ysandeep is now known as ysandeep\|lunch		09:07
opendevreview	Merged openstack/diskimage-builder master: rhel: work around RHEL-9 BLS issues https://review.opendev.org/c/openstack/diskimage-builder/+/829620	10:04
opendevreview	Merged openstack/diskimage-builder master: Detect boot and EFI partitions in extract-image https://review.opendev.org/c/openstack/diskimage-builder/+/828617	10:04
*** ysandeep\|lunch is now known as ysandeep		10:19
opendevreview	Will Szumski proposed openstack/diskimage-builder master: Always Use linuxefi with DIB_BLOCK_DEVICE=efi https://review.opendev.org/c/openstack/diskimage-builder/+/830801	10:20
opendevreview	Merged opendev/glean master: Add Rocky Linux support https://review.opendev.org/c/opendev/glean/+/830539	10:55
*** rlandy\|out is now known as rlandy\|ruck		11:18
*** dviroel_ is now known as dviroel		11:28
*** bhagyashris_ is now known as bhagyashris		11:57
*** amoralej is now known as amoralej\|lunch		12:19
*** sshnaidm is now known as sshnaidm\|off		12:23
*** ysandeep is now known as ysandeep\|afk		12:33
*** ysandeep\|afk is now known as ysandeep		12:54
*** amoralej\|lunch is now known as amoralej		13:19
opendevreview	Merged opendev/glean master: Remove rebuild-test-output.sh https://review.opendev.org/c/opendev/glean/+/830540	13:39
opendevreview	Riccardo Pittau proposed openstack/diskimage-builder master: Revert "Detect boot and EFI partitions in extract-image" https://review.opendev.org/c/openstack/diskimage-builder/+/830717	13:42
*** dviroel is now known as dviroel\|brb		14:01
fungi	frickler: thanks for the heads up! while i did the poc with the debian/ubuntu packages, the current plan is to use https://docs.mailman3.org/en/latest/install/docker.html	14:08
fungi	and yeah, i actually bookmarked the rfh bug back when it was opened in november, in case i find time to help maintain the debian packages (which i might if i decide to run them on some of my servers for my own projects)	14:18
fungi	looks like a couple of wmf folks volunteered to pitch in though	14:20
fungi	and they're acutally dds, which i'm not, so probably have more chance of being useful as their uploads won't need sponsoring	14:21
*** dviroel\|brb is now known as dviroel		14:23
*** pojadhav\|ruck is now known as pojadhav\|brb		14:28
*** pojadhav\|brb is now known as pojadhav\|afk		14:32
mgagne_	fungi: my (new) employer would like to reach to you. Would your first name at openinfra.dev be the email to use?	15:46
fungi	mgagne_: sure, that works. thanks!	15:51
mgagne_	good	15:51
*** dviroel is now known as dviroel\|lunvh		16:03
*** dviroel\|lunvh is now known as dviroel\|lunch		16:03
*** ysandeep is now known as ysandeep\|out		16:17
*** pojadhav\|afk is now known as pojadhav\|out		16:25
clarkb	https://github.com/go-gitea/gitea/pull/18799 is the fix for our gitea 1.16 diff issue. Looks like they noticed before I filed the bug, but I couldn't find an issue for it	16:29
*** ykarel is now known as ykarel\|away		16:31
fungi	hah, convenient!	16:34
clarkb	Looks like there may be a good number of other fixes in 1.16.2. I'm thinking we can wait for that version whether or not it includes the diff fix and maybe update to that instaed	16:35
fungi	sure	16:37
fungi	wfm	16:37
opendevreview	Merged openstack/diskimage-builder master: Revert "Detect boot and EFI partitions in extract-image" https://review.opendev.org/c/openstack/diskimage-builder/+/830717	16:45
*** marios is now known as marios\|out		16:51
*** dviroel\|lunch is now known as dviroel		16:58
opendevreview	Clark Boylan proposed opendev/system-config master: Update Etherpad to 1.8.17 https://review.opendev.org/c/opendev/system-config/+/830874	17:11
opendevreview	Clark Boylan proposed opendev/system-config master: DNM forcing failure to hold a node and check new etherpad https://review.opendev.org/c/opendev/system-config/+/830875	17:11
clarkb	I'll put a hold on the etherpad job for that second change and we can check the etherpad there and land the parent if it looks good	17:12
fungi	ooh, thanks!	17:13
clarkb	fungi: 0000000024 is an autohold for the gitea links checking. I had mentioned to frickler we can use that one to double check the mergability checking update on a running gerrit. I'll see if I can get to that today, then we are good to delete the autohold?	17:14
fungi	clarkb: yeah, i have no more need of it	17:29
*** amoralej is now known as amoralej\|off		17:42
*** jpena is now known as jpena\|off		17:53
clarkb	infra-root https://104.130.132.239/ but you need to hit it with that ip set as etherpad.opendev.org in /etc/hosts due to redirects	17:57
clarkb	I'm in the isitbroken pad	17:57
*** rlandy\|ruck is now known as rlandy\|ruck\|mtg		17:59
clarkb	it seems to be working happily. If you agree we should be able to proceed with https://review.opendev.org/c/opendev/system-config/+/830874	17:59
corvus	checking	18:00
corvus	it does not seem broken	18:01
corvus	i have closed etherpad	18:07
clarkb	me too. Let me know if I should rejoin to perform additional testing but I think corvus and I are happy with it	18:08
clarkb	we do note that unset names appear to be able to be set by anyone. Then once set only the controlling session can change the name	18:09
corvus	i think that's an exist behavior, just not one we exercise often	18:11
corvus	+ing	18:11
clarkb	oh one other thing. I checked to see if 1.8.17 has upstream docker images published yet and it does not. Seems like it was a good move to stop relying on them for these	18:20
clarkb	I was hopeful that was a one off issue with 1.8.16 but seems nopt	18:21
fungi	interesting	18:21
opendevreview	Clark Boylan proposed opendev/system-config master: DNM test if we can build gitea release/v1.16 https://review.opendev.org/c/opendev/system-config/+/830885	18:38
*** rlandy\|ruck\|mtg is now known as rlandy\|ruck		19:19
mnasiadka	hello	19:20
mnasiadka	regarding Rocky Linux - does it work now after merging the glean patches?	19:20
clarkb	mnasiadka: no we need a glean release and that needs to be coordinated with a dib release. One thing complicating this is it changes how glean is executed by udev a bit and ideally we'll be able to monitor that as it goes in	19:26
clarkb	that said rocky is bootable in at least one of the clouds now via a fallback to dhcp so it kind of works	19:26
mnasiadka	ok, so I can run a job now using rocky nodeset?	19:26
mnasiadka	or do I have to wait for the coordinated release?	19:27
clarkb	I think you can try running a job now and see what happens	19:29
clarkb	I'm not going to commit to it working reliably at this point, but it should hopefully be sufficient to start seeing if rocky works for your stuff and preparing for when it is reliable	19:29
clarkb	if ianw would like to do those releases early in his day to day I can help monitor today and tomorrow.	19:30
clarkb	I think worst case we end up tagging an old commit of glean as a new version to revert	19:30
clarkb	but there is a lot to observe as we need to ensure all the images continue to boot in all the clouds :/	19:30
opendevreview	Jeremy Stanley proposed opendev/system-config master: Clean up two retired mailing lists https://review.opendev.org/c/opendev/system-config/+/830893	19:34
clarkb	fungi: isn't it great when ansible is helpful :)	19:35
fungi	so great	19:35
clarkb	Looking at glean I htink we'd do a 1.20.0 for the executable restructuring, gentoo fixes, rocky addition and distro sync to 1.6	19:54
fungi	yeah, it needs a minor rev, not patch	19:54
fungi	agreed	19:54
fungi	the rocky addition and vendored distro update both warrant it	19:55
clarkb	corvus: is https://zuul.opendev.org/t/openstack/build/cde0c6eedcc54d67acac0e2ee31245ca a zuul-registry bug?	20:14
clarkb	fungi: also I think if you want you coudl update the gerrit testing jobs to not rebuild the images on your config update	20:14
clarkb	fungi: that might be more reliable now wheil we sort out that problem which seems fairly consistent	20:15
clarkb	and now I need to eat lunch	20:15
corvus	clarkb: either that, or a client bug, or a client/registry interaction bug	20:21
fungi	oh, huh	20:21
corvus	or maybe corrupted data? i dunno. it usually takes a while to triage those	20:21
corvus	`Manifest has invalid size for layer sha256:41de18d1833d2d5e6cb6111780577f3eab0afd9b1bf6c0c1756c5227abfa645c (size:44359680 actual:203253446)` is the meat of it	20:22
fungi	yeah, that definitely seems wrong	20:23
fungi	looks like it could be an interrupted upload	20:24
fungi	though size<actual which would imply the reverse?	20:24
fungi	i'm misunderstanding why we'd see that on upload of a newly rebuilt image each time though	20:27
fungi	and yeah, i misread the error originally and thought it was dockerhub rejecting the upload, didn't spot that it was the buildset registry	20:28
fungi	829975 is really not at all urgent though, i'm perfectly happy to sort this problem out first	20:29
fungi	though before i do anything else, i should probably sort out dinner	20:38
clarkb	hrm we've had these size mismatches before I think ianw debugged them	20:39
clarkb	the gist was we calculated things off by one in the registry iirc and some clients ebcause more strict?	20:40
fungi	something something skopeo something? ;)	20:40
clarkb	maybe we've got another client being more strict and we're catching another existing bug somewhere	20:40
clarkb	https://zuul.opendev.org/t/openstack/build/ad14ff4e385640beba9d43b9de108973 so it did succeed in check	20:45
clarkb	but then seems to consistently fail in gate?	20:45
clarkb	oh gate and check run two slightly different version of the image build jobs. Once uploads only to the insecure ci registry and the other to both docker and the insecure ci registry	20:46
clarkb	so in this case cherrypy is reporting back that the manifest reported a size of 44359680 but it got 203253446 bytes. The tool doing the push is docker itself not skopeo	20:47
clarkb	this upload is to the buildset registry not the insecure ci registry.	20:50
fungi	the server itself doesn't seem to be in any particular distress, at least	20:51
fungi	if anything it looks extremely underutilized	20:52
clarkb	well this is the buildset registry which is per buildset	20:52
clarkb	https://zuul.opendev.org/t/openstack/build/d80aa9c8e84241cdbbfffc7d7c4f43fc/log/docker/buildset_registry.txt is the log for the one that failed	20:52
clarkb	https://zuul.opendev.org/t/openstack/build/d80aa9c8e84241cdbbfffc7d7c4f43fc/log/docker/buildset_registry.txt#767 here is the server side reporting the 400 error	20:53
*** dviroel is now known as dviroel\|afk		20:54
opendevreview	Steve Baker proposed openstack/diskimage-builder master: Revert "Revert "Detect boot and EFI partitions in extract-image"" https://review.opendev.org/c/openstack/diskimage-builder/+/830900	20:54
clarkb	https://zuul.opendev.org/t/openstack/build/d80aa9c8e84241cdbbfffc7d7c4f43fc/log/docker/buildset_registry.txt#711 that shows the docker client HEADing the blob and getting back the actual size rather than the size in the manifest	20:55
clarkb	the docker version is from december	20:56
clarkb	oh now this is creally curious	20:56
clarkb	https://zuul.opendev.org/t/openstack/build/d80aa9c8e84241cdbbfffc7d7c4f43fc/log/docker/buildset_registry.txt#667 vs https://zuul.opendev.org/t/openstack/build/d80aa9c8e84241cdbbfffc7d7c4f43fc/log/docker/buildset_registry.txt#710	20:57
clarkb	https://zuul.opendev.org/t/openstack/build/d80aa9c8e84241cdbbfffc7d7c4f43fc/log/docker/buildset_registry.txt#706 that is when it reports the upload is complete	20:57
clarkb	I think it must be checking the size before the upload is complete and using that smaller value. Then when it tries to use that smaller value the up to date and correct value is known and there is a conflict	20:58
clarkb	I don't know enough about the docker image upload protocols to know if we are supposed to kick back a 404 or what if the upload isn't completed	20:58
clarkb	corvus: ^ do you know off the top of your head?	20:59
fungi	oh, yeah, i misread you, i was looking at insecure-ci-registry, sorry	20:59
ianw	o/	21:00
ianw	istr that there was a podman 4 release very recently. have we pulled in something different?	21:01
clarkb	ianw: this hsould all be docker not podman	21:03
clarkb	https://zuul.opendev.org/t/openstack/build/d80aa9c8e84241cdbbfffc7d7c4f43fc/log/docker/buildset_registry.txt#552-555 at this point which is before 667 with the shorter size it seems to recognize 203253446 is the size of something	21:03
ianw	clarkb: i can do the glean release ... but given that it will have global effect i'm thinking that doing it on my monday might be a better idea. rolling it back will be an exercise in rebuilding things and having a longer runway when things are quiet would be helpful	21:04
clarkb	ianw: wfm	21:04
clarkb	ianw: I think if things get really bad iwth a glean release the easiest thing may be to pull the release from pypi	21:04
clarkb	then the next round of image buidls will use the prior good release which we know to work	21:05
clarkb	but ++ to doing it when everything can be monitored and action taken to address problems if necessary	21:05
clarkb	for the registry thing I think line 667 of the registry log is the thing we need to undersatnd better. It is returning the short value that things complain about and it is doing so after the registry seems to know about the larger value logged on 552-555	21:06
ianw	hrm maybe not podman, but IIRC the use of buildkit was invovled	21:07
clarkb	zuul registry's filesystem implementation does return os.stat(path).st_size to get the size that head returns	21:08
clarkb	I think the issue here maybe that we've told the registry it is this size, but before all that data is written out to disk we'ev queried the disk for how much it has and fed that back to docker	21:08
clarkb	thinking out loud here. Maybe we need to write to a tmp name then do an atomic mv?	21:09
clarkb	then the os.path.exists will fail for the aerly HEAD and we report the proper result when we know it? Though that might also break docker because it may need that early HEAD to succeed	21:10
ianw	https://opendev.org/zuul/zuul-registry/commit/7100c360b31dfd1b9f4413eb2df3db3f229a5e26 seems similar	21:12
clarkb	I've just confirmed the actual_size value is also from disk. So by the time we go to validate the manifest the size is correct on disk but the PUT for the manifest must've supplied the short value from the earlier head	21:16
fungi	so are we racing a fd flush?	21:18
clarkb	"When this response is received, the client can assume that the layer is already available in the registry under the given name and should take no further action to upload the layer." this is what it says about the HEAD on a layer (which is expected to return the size)	21:19
clarkb	I think there is a race here. We shouldn't be responding positively to the HEAD request when the value is short.	21:19
clarkb	fungi: no I think the upload hasn't completed yet based on the logs	21:19
clarkb	fungi: even if we flushed everything we'd still be short because the client is sending us more data	21:20
clarkb	https://zuul.opendev.org/t/openstack/build/d80aa9c8e84241cdbbfffc7d7c4f43fc/log/docker/buildset_registry.txt#706 is where the upload for the layer completes but https://zuul.opendev.org/t/openstack/build/d80aa9c8e84241cdbbfffc7d7c4f43fc/log/docker/buildset_registry.txt#667 is responding that the layer exists prior to that with a smaller size	21:20
fungi	oh, the short read is happening server-side?	21:21
clarkb	yes, it seems like midway through the upload the client is asking the server in a separate thread to tell it how big the not yet completely uploaded layer is	21:23
clarkb	the server is checking what it has on disk which is short because it hasn't finsihed recieving the data and responds with that value. Then the client constructs a manifest that it sends to the server with that short value	21:23
clarkb	(its a bit insance to me that we need to track sizes like this when the client is the originator of the info and everything is hashed, just check if hashes are valid)	21:24
clarkb	and zuul-registry doesn't seem to be stateful enough to know whether or not a file on disk represents a complete upload?	21:25
fungi	#zuul	21:30
fungi	hah, you're not my buffer command	21:30
clarkb	I suspect that this may be a bug in the client HEADing a layer before it has finished uploading the layer which exposes a bug in zuul-registry which returns the layer info before it is complete	21:32
clarkb	hrm there may be some state here though as we seem to record chunks and those go to a subdir?	21:34
clarkb	and then later they end up in the path that the HEAD looks at	21:34
clarkb	oh actually https://zuul.opendev.org/t/openstack/build/d80aa9c8e84241cdbbfffc7d7c4f43fc/log/docker/buildset_registry.txt#554 tgat may be where we finish the upload which is before teh short read so maybe the HEAD isn't too soon and we just aren't synced?	21:36
clarkb	I think I see it	21:39
clarkb	or at least something that can cause it. Fix on the way	21:39
opendevreview	Merged opendev/system-config master: Clean up two retired mailing lists https://review.opendev.org/c/opendev/system-config/+/830893	21:44
clarkb	remote: https://review.opendev.org/c/zuul/zuul-registry/+/830905 Atomically concatenate blob objects	21:52
clarkb	It is possible that the fsync flushing is also to blame I suppose	21:53
clarkb	and we may not actually solve this with the move, but I think the move is more correct either way	21:53
clarkb	also I think this may be somewhat cloud specific	22:01
clarkb	due to iops availability?	22:01
clarkb	rechecking until it goes through is probably viable. but also we should reconsider building gerrit images for apahce config changes	22:01
clarkb	they two shouldn't be related	22:01
opendevreview	Steve Baker proposed openstack/diskimage-builder master: Revert "Revert "Detect boot and EFI partitions in extract-image"" https://review.opendev.org/c/openstack/diskimage-builder/+/830900	22:17
ianw	i wonder how hard it would be to setup a job to build and push into the registry via mitmproxy and keep a dump of the traffic. that was how i debugged the original issues	22:17
clarkb	https://review.opendev.org/c/opendev/system-config/+/830874 could use another reviewer. I'm happy to approve that tomorrow morning and watch it after wards	22:28
clarkb	the child change has a held etherpad you can test with too. corvus and I did testing earlier with the isitbroken pad on the test node and it seeemd fine (note you have to update your /etc/hosts to make the redirects work)	22:28
opendevreview	Clark Boylan proposed opendev/system-config master: Remove Gerrit's JVM GC logs https://review.opendev.org/c/opendev/system-config/+/830912	22:44
clarkb	Ok I tested the mergeability update and it seemed fine. I think we can proceed with that when we have a good time to restart gerrit	22:44
clarkb	since I was on the test node gerrit I wanted to see if I could fix the jvm_gc log rotation errors we always get on gerrit startup and 830912 is the best I could come up with. Basically stop telling it to write those files entirely. Note that we have to stop gerrit and move the old files aside before starting it again or we get the same errors on startup	22:45
clarkb	The extent of my mergability testing was to check the error_log for any indication a full reindex occured and I didn't see any	22:45
clarkb	I'll leave that server up until tomorrow in case frickler wants to take a look but then I think we can delete the old	22:46
clarkb	*delete the zuul autohold	22:47
fungi	okay, back from dinner/errands	22:50
fungi	i guess a depends-on to 830905 from a failing system-config change isn't going to exercise the updated registry code, right?	22:53
fungi	looks like base deploys may be failing again, digging into it now	22:56
fungi	the failed build i'm looking at claimed codesearch was unreachable at 2022-02-24T21:49:42Z	22:58
fungi	i can ssh into it and it's been "up 29 days" so maybe just a network blip	22:59
fungi	oh! i should have read further	22:59
fungi	it was "unreachable" because "mkdir: cannot create directory ...: No space left on device"	22:59
fungi	rootfs is indeed full	22:59
fungi	i'll see what's gone nuts	23:00
ianw	fungi: no, it will really have to be rolled out to the actual registry i think	23:01
clarkb	ianw: fungi: the buildset registry uses the :latest tag s o ithink once my change lands and :latest is updated it will be used afterwards	23:01
fungi	got it	23:01
clarkb	fungi: ianw: re codesearch seems it is largely /var/lib/hound/data	23:02
clarkb	which is where the indexed data lives	23:02
ianw	i wonder if we're not removing things we should be; seems there's like daily files, but stuff from back in november	23:03
clarkb	I do half wonder if some of the data in there is stale though. There are 2831 entries and I don't think we're indexing that many repos	23:03
ianw	not sure if it's additive	23:03
clarkb	ianw: ya exactly	23:03
fungi	yeah, that's 87% of the fs	23:04
fungi	there's a ton of old files in there	23:05
fungi	even over a year old	23:05
fungi	i thought it regenerated all its data at restart	23:05
ianw	we reload it if project-config changes	23:06
clarkb	ya I wonder if every time we update it or restart it it leaks some data	23:06
fungi	yeah, there's content dated back to 2020-11-20	23:06
clarkb	we might be able to stop it. Delete things, then start it again? This might also be slow?	23:06
fungi	it might be slow, yes	23:07
clarkb	and maybe we should consider not bindmounting that data so that it auto flushes every time we restart	23:07
clarkb	assuming that is the issue	23:07
ianw	/var/log/resync-hound.log, which is supposed to log resyncs, is empty, i'm guessing a pipe error	23:07
ianw	/usr/local/bin/resync-hound >> /var/log/resync-hound.log 2>&1	23:07
fungi	the data tree contains 817421 regular files and 257558 directories	23:07
ianw	seems right ... wonder why it's blank	23:08
fungi	ianw: maybe the disk filled up and then the logfile was rotated?	23:08
fungi	no space, no way to write to the new log	23:08
clarkb	also is the redirect happenign in the container context? I don't think so based on the quoting	23:08
clarkb	but /var/log in the container is separate from /var/log on the host so maybe?	23:09
ianw	the theory there at least is to take whatever comes out on stdout of the run and save it locally, i don't think that passes through	23:09
fungi	should i hold off stopping the container? (or trying to stop it at least? that may also fail if it needs to write to the rootfs	23:10
fungi	we'll also want to reboot, seeing as how it's the rootfs which filled up completely, there's no telling what else may have broken in the process	23:11
clarkb	I expect stopping it will fail due to the full disk. What I've done in the past is had journald prune some data then you get enough haed room to stop things and make changes	23:12
fungi	but yes, my vote is to stop the container if we can, blow away the data tree, reboot the server, then wait for everything to reindex and see what the utilization looks like after that	23:12
clarkb	`journalctl --vacuum-size=500M` or similar then docker-compose down, then debug further?	23:12
clarkb	fungi: ya that sounds good	23:12
fungi	thanks for the journalctl tip, i'll do that first just for safety	23:13
ianw	i think yes, start again and we can monitor what's going on when we issue resyncs	23:13
fungi	downing the container now	23:13
fungi	it claims to have downed the container successfully	23:13
fungi	okay, so are we agreed on just recursively deleting everything under /var/lib/hound/data?	23:14
clarkb	my understanding has been that all of that data is ephemeral and can be rebuilt from the source repos	23:16
clarkb	so ya I'm good with that	23:16
ianw	++	23:16
fungi	thanks, removing now	23:17
ianw	so when we update, we run update-hound-config, which diffs project-config and, if it is different, runs supervisorctl restart houndd	23:21
ianw	i note that in the docs, they don't seem to say anything with the container about mapping in a volume for index data	23:22
ianw	i'm not sure if supervisorctl stop && rm -rf /index/data/* && supervisorctl start would work ... or if docker would think the container was dead?	23:23
clarkb	ianw: I think we could just stop start the container instead?	23:25
clarkb	er and use ephemeral /var/lib/hound/data	23:25
ianw	the problem is that the update checking happens inside the container	23:26
clarkb	ianw: can we have an external prcess trigger the check within the container and then take action externally if necessary?	23:26
fungi	okay, deletion complete, rootfs utilization is at a meagre 13%	23:29
opendevreview	Ian Wienand proposed opendev/system-config master: codesearch: remove index when resyncing https://review.opendev.org/c/opendev/system-config/+/830916	23:30
fungi	am i clear to reboot the server now so we can make sure fundamental processes are sane, in case anything else broke when it could no longer write?	23:30
ianw	clarkb: ^ maybe with more fiddling; that may be another approach	23:31
clarkb	ianw: ya that seems like it could work	23:31
clarkb	fungi: ya I think so	23:31
ianw	although, i guess we should not map in the index directory either	23:32
fungi	okay, rebooting the server now	23:32
clarkb	ianw: oh ya beacuse we also stop and start the service externally	23:33
fungi	server has rebooted, should we start the container again or are we wanting to get some changes deployed to it first?	23:37
ianw	i think restart, i'm just thinking about it	23:37
clarkb	ya I think we can start it and see what we learn	23:37
*** dviroel\|afk is now known as dviroel		23:37
fungi	okay, startint it up now	23:37
*** rlandy\|ruck is now known as rlandy\|out		23:38
fungi	looks like it's running again	23:38
fungi	reindexing will presumably take ~hours	23:38
clarkb	if you tail the log it gives you some indication of what it is doing iirc	23:38
clarkb	/var/log/containers/docker-hound.log	23:39
fungi	yep	23:39
ianw	that's why i wonder if a complete reindex on project-config updates is the best way	23:44
ianw	i wonder if you can just delete idx-*	23:47
clarkb	says its done	23:50
clarkb	so that was about 13 minutes?	23:50
clarkb	I get results searching in it too	23:50
clarkb	and its only using ~6GB now?	23:50
clarkb	so ya we definitely leak	23:50
ianw	it's probably worth putting a todo for next week and coming back to look at it	23:51
ianw	and checking the old directories on disk, and see if they've been re-cloned	23:52
clarkb	++	23:52
ianw	i've soured on the idea of just removing the whole index on restart; but maybe that's the only solution	23:53
clarkb	or maybe we can identify the old data somehow and delete it after codesearch is restarted	23:54
ianw	yeah, maybe an mtime type find & delete	23:54
ianw	if we see that the project has been re-cloned and the old one hasn't been accessed	23:55
ianw	i don't see any references to this in the hound github, open or closed issues	23:59

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!