Dereckson | Hello. Do we hav a specific channel for git-review? | 00:18 |
---|---|---|
fungi | Dereckson: yep, this one ;) | 00:19 |
Dereckson | ah thanks :) | 00:19 |
Dereckson | I maintain the FreeBSD port for git-review, and I've a request on our bug tracker to include the commit fixing the rebase for Git 2.34 | 00:19 |
Dereckson | I've also noticed Fedora released a "2.2.0" rpm in advance of an actual release. | 00:20 |
fungi | Dereckson: this one? https://pypi.org/project/git-review/2.2.0/ | 00:21 |
fungi | https://opendev.org/opendev/git-review/src/tag/2.2.0 | 00:22 |
fungi | Dereckson: https://lists.opendev.org/pipermail/service-announce/2021-November/000028.html | 00:23 |
fungi | that was the release announcement last november | 00:23 |
fungi | but if you specifically want to backport the fix you mentioned, it's commit 7182166ec00ad3645821435d72c5424b4629165f | 00:24 |
fungi | reviewed in https://review.opendev.org/c/opendev/git-review/+/818219 | 00:24 |
Dereckson | I'd like to avoid the backport and release a new version, but I was still tracking https://docs.opendev.org/opendev/git-review/latest/ | 00:24 |
Dereckson | as we've already 2.2.0, problem solved, thanks | 00:25 |
fungi | ahh, yeah, we need to figure out a way to force a docs refresh on tag push, it would say 2.2.0 there the next time we merge a commit, but nothing's merged since i pushed that tag | 00:25 |
fungi | sorry for the confusion, i'll spend some more time thinking about how we can make that smoother | 00:27 |
fungi | in a high-activity project it tends to go unnoticed, but git-review doesn't update all that often | 00:28 |
fungi | i can't remember why we had decided previously that re-running our docs jobs on tag events was problematic | 00:29 |
fungi | clarkb: ianw: do either of you recall? | 00:29 |
opendevreview | Merged opendev/glean master: distro: sync to 3.6 https://review.opendev.org/c/opendev/glean/+/830536 | 00:30 |
ianw | umm, i do not. i feel like it would be ok, as it would just push new contents to AFS? | 00:30 |
fungi | right, like simply adding the docs job we already run in post to the tag or release pipelines | 00:31 |
Dereckson | Still supported under Python 3.6 by the way? | 00:31 |
Dereckson | (I only noted 2.0 is Python 3+) | 00:32 |
fungi | Dereckson: python-requires = >= 3.5 | 00:32 |
fungi | so yep | 00:32 |
Dereckson | ok | 00:32 |
ianw | fungi: yeah, perhaps we were worried about overwriting or something? if you pushed a tag not at HEAD? | 00:33 |
fungi | Dereckson: the patch mentioned above was automatically tested with 3.5, 3.6, 3.7, 3.8 and 3.9 | 00:33 |
fungi | (or some point releases thereof for each of those, anyway) | 00:34 |
fungi | ianw: i suppose if you tagged an earlier commit than the branch tip that could roll back the content, right | 00:34 |
fungi | now that zuul has the ability to guess a most likely branch for any tag, we could maybe reset to the branch tip if the ref is a tag, but that will require some thought | 00:35 |
fungi | in a single-branch project like git-review that's obviously safe, but for projects maintaining multiple branches it may need some additional care to make sure the correct branch is chosen | 00:36 |
clarkb | I don't recall why that changed. It used to be we published the main docs and the version specific docs when we pushed tags. But now our tag documentation jobs generaly only push the tag specific version iirc | 00:45 |
clarkb | zuul suffers from this too | 00:45 |
clarkb | oh I think I remember. It is because the tag version docs can overwrite the latest master builds as they can race each other | 00:46 |
clarkb | I agree thati s something git-review doesn't really need to worry about | 00:46 |
fungi | oh, right, i forgot about the version-specific doc builds | 00:47 |
fungi | so, yes, we'd need to run two docs jobs on any tag event: one to publish the version-specific docs, and one which resets to the guessed branch for that tag and refreshes the branch-specific docs | 00:47 |
Dereckson | Thanks for the support fungi, patch submitted to the FreeBSD ports systems so we can have 2.2.0 there. Tested locally, works like charm. | 01:08 |
fungi | Dereckson: great! thanks for confirming it's working there | 01:57 |
*** rlandy|ruck|bbl is now known as rlandy|out | 02:06 | |
*** lajoskatona_ is now known as lajoskatona | 02:25 | |
opendevreview | wangxiyuan proposed zuul/zuul-jobs master: Add openEuler to iptalbe firewall persist https://review.opendev.org/c/zuul/zuul-jobs/+/830706 | 02:32 |
*** pojadhav|out is now known as pojadhav|ruck | 02:52 | |
*** frenzy_friday is now known as frenzyfriday|rover | 04:26 | |
*** ysandeep|out is now known as ysandeep | 04:53 | |
*** ysandeep is now known as ysandeep|away | 05:55 | |
*** ysandeep|away is now known as ysandeep | 07:04 | |
*** amoralej|off is now known as amoralej | 07:09 | |
*** jpena|off is now known as jpena | 08:34 | |
opendevreview | Ian Wienand proposed opendev/system-config master: encrypt-logs: turn on for all prod playbooks https://review.opendev.org/c/opendev/system-config/+/830784 | 08:40 |
opendevreview | Ian Wienand proposed opendev/system-config master: docs: reorganise around a open infrastructure overview https://review.opendev.org/c/opendev/system-config/+/830785 | 08:40 |
frickler | this may affect our mailman setup going forward https://bugs.launchpad.net/ubuntu/+source/mailman3/+bug/1960547 | 09:06 |
*** ysandeep is now known as ysandeep|lunch | 09:07 | |
opendevreview | Merged openstack/diskimage-builder master: rhel: work around RHEL-9 BLS issues https://review.opendev.org/c/openstack/diskimage-builder/+/829620 | 10:04 |
opendevreview | Merged openstack/diskimage-builder master: Detect boot and EFI partitions in extract-image https://review.opendev.org/c/openstack/diskimage-builder/+/828617 | 10:04 |
*** ysandeep|lunch is now known as ysandeep | 10:19 | |
opendevreview | Will Szumski proposed openstack/diskimage-builder master: Always Use linuxefi with DIB_BLOCK_DEVICE=efi https://review.opendev.org/c/openstack/diskimage-builder/+/830801 | 10:20 |
opendevreview | Merged opendev/glean master: Add Rocky Linux support https://review.opendev.org/c/opendev/glean/+/830539 | 10:55 |
*** rlandy|out is now known as rlandy|ruck | 11:18 | |
*** dviroel_ is now known as dviroel | 11:28 | |
*** bhagyashris_ is now known as bhagyashris | 11:57 | |
*** amoralej is now known as amoralej|lunch | 12:19 | |
*** sshnaidm is now known as sshnaidm|off | 12:23 | |
*** ysandeep is now known as ysandeep|afk | 12:33 | |
*** ysandeep|afk is now known as ysandeep | 12:54 | |
*** amoralej|lunch is now known as amoralej | 13:19 | |
opendevreview | Merged opendev/glean master: Remove rebuild-test-output.sh https://review.opendev.org/c/opendev/glean/+/830540 | 13:39 |
opendevreview | Riccardo Pittau proposed openstack/diskimage-builder master: Revert "Detect boot and EFI partitions in extract-image" https://review.opendev.org/c/openstack/diskimage-builder/+/830717 | 13:42 |
*** dviroel is now known as dviroel|brb | 14:01 | |
fungi | frickler: thanks for the heads up! while i did the poc with the debian/ubuntu packages, the current plan is to use https://docs.mailman3.org/en/latest/install/docker.html | 14:08 |
fungi | and yeah, i actually bookmarked the rfh bug back when it was opened in november, in case i find time to help maintain the debian packages (which i might if i decide to run them on some of my servers for my own projects) | 14:18 |
fungi | looks like a couple of wmf folks volunteered to pitch in though | 14:20 |
fungi | and they're acutally dds, which i'm not, so probably have more chance of being useful as their uploads won't need sponsoring | 14:21 |
*** dviroel|brb is now known as dviroel | 14:23 | |
*** pojadhav|ruck is now known as pojadhav|brb | 14:28 | |
*** pojadhav|brb is now known as pojadhav|afk | 14:32 | |
mgagne_ | fungi: my (new) employer would like to reach to you. Would your first name at openinfra.dev be the email to use? | 15:46 |
fungi | mgagne_: sure, that works. thanks! | 15:51 |
mgagne_ | good | 15:51 |
*** dviroel is now known as dviroel|lunvh | 16:03 | |
*** dviroel|lunvh is now known as dviroel|lunch | 16:03 | |
*** ysandeep is now known as ysandeep|out | 16:17 | |
*** pojadhav|afk is now known as pojadhav|out | 16:25 | |
clarkb | https://github.com/go-gitea/gitea/pull/18799 is the fix for our gitea 1.16 diff issue. Looks like they noticed before I filed the bug, but I couldn't find an issue for it | 16:29 |
*** ykarel is now known as ykarel|away | 16:31 | |
fungi | hah, convenient! | 16:34 |
clarkb | Looks like there may be a good number of other fixes in 1.16.2. I'm thinking we can wait for that version whether or not it includes the diff fix and maybe update to that instaed | 16:35 |
fungi | sure | 16:37 |
fungi | wfm | 16:37 |
opendevreview | Merged openstack/diskimage-builder master: Revert "Detect boot and EFI partitions in extract-image" https://review.opendev.org/c/openstack/diskimage-builder/+/830717 | 16:45 |
*** marios is now known as marios|out | 16:51 | |
*** dviroel|lunch is now known as dviroel | 16:58 | |
opendevreview | Clark Boylan proposed opendev/system-config master: Update Etherpad to 1.8.17 https://review.opendev.org/c/opendev/system-config/+/830874 | 17:11 |
opendevreview | Clark Boylan proposed opendev/system-config master: DNM forcing failure to hold a node and check new etherpad https://review.opendev.org/c/opendev/system-config/+/830875 | 17:11 |
clarkb | I'll put a hold on the etherpad job for that second change and we can check the etherpad there and land the parent if it looks good | 17:12 |
fungi | ooh, thanks! | 17:13 |
clarkb | fungi: 0000000024 is an autohold for the gitea links checking. I had mentioned to frickler we can use that one to double check the mergability checking update on a running gerrit. I'll see if I can get to that today, then we are good to delete the autohold? | 17:14 |
fungi | clarkb: yeah, i have no more need of it | 17:29 |
*** amoralej is now known as amoralej|off | 17:42 | |
*** jpena is now known as jpena|off | 17:53 | |
clarkb | infra-root https://104.130.132.239/ but you need to hit it with that ip set as etherpad.opendev.org in /etc/hosts due to redirects | 17:57 |
clarkb | I'm in the isitbroken pad | 17:57 |
*** rlandy|ruck is now known as rlandy|ruck|mtg | 17:59 | |
clarkb | it seems to be working happily. If you agree we should be able to proceed with https://review.opendev.org/c/opendev/system-config/+/830874 | 17:59 |
corvus | checking | 18:00 |
corvus | it does not seem broken | 18:01 |
corvus | i have closed etherpad | 18:07 |
clarkb | me too. Let me know if I should rejoin to perform additional testing but I think corvus and I are happy with it | 18:08 |
clarkb | we do note that unset names appear to be able to be set by anyone. Then once set only the controlling session can change the name | 18:09 |
corvus | i think that's an exist behavior, just not one we exercise often | 18:11 |
corvus | +ing | 18:11 |
clarkb | oh one other thing. I checked to see if 1.8.17 has upstream docker images published yet and it does not. Seems like it was a good move to stop relying on them for these | 18:20 |
clarkb | I was hopeful that was a one off issue with 1.8.16 but seems nopt | 18:21 |
fungi | interesting | 18:21 |
opendevreview | Clark Boylan proposed opendev/system-config master: DNM test if we can build gitea release/v1.16 https://review.opendev.org/c/opendev/system-config/+/830885 | 18:38 |
*** rlandy|ruck|mtg is now known as rlandy|ruck | 19:19 | |
mnasiadka | hello | 19:20 |
mnasiadka | regarding Rocky Linux - does it work now after merging the glean patches? | 19:20 |
clarkb | mnasiadka: no we need a glean release and that needs to be coordinated with a dib release. One thing complicating this is it changes how glean is executed by udev a bit and ideally we'll be able to monitor that as it goes in | 19:26 |
clarkb | that said rocky is bootable in at least one of the clouds now via a fallback to dhcp so it kind of works | 19:26 |
mnasiadka | ok, so I can run a job now using rocky nodeset? | 19:26 |
mnasiadka | or do I have to wait for the coordinated release? | 19:27 |
clarkb | I think you can try running a job now and see what happens | 19:29 |
clarkb | I'm not going to commit to it working reliably at this point, but it should hopefully be sufficient to start seeing if rocky works for your stuff and preparing for when it is reliable | 19:29 |
clarkb | if ianw would like to do those releases early in his day to day I can help monitor today and tomorrow. | 19:30 |
clarkb | I think worst case we end up tagging an old commit of glean as a new version to revert | 19:30 |
clarkb | but there is a lot to observe as we need to ensure all the images continue to boot in all the clouds :/ | 19:30 |
opendevreview | Jeremy Stanley proposed opendev/system-config master: Clean up two retired mailing lists https://review.opendev.org/c/opendev/system-config/+/830893 | 19:34 |
clarkb | fungi: isn't it great when ansible is helpful :) | 19:35 |
fungi | so great | 19:35 |
clarkb | Looking at glean I htink we'd do a 1.20.0 for the executable restructuring, gentoo fixes, rocky addition and distro sync to 1.6 | 19:54 |
fungi | yeah, it needs a minor rev, not patch | 19:54 |
fungi | agreed | 19:54 |
fungi | the rocky addition and vendored distro update both warrant it | 19:55 |
clarkb | corvus: is https://zuul.opendev.org/t/openstack/build/cde0c6eedcc54d67acac0e2ee31245ca a zuul-registry bug? | 20:14 |
clarkb | fungi: also I think if you want you coudl update the gerrit testing jobs to not rebuild the images on your config update | 20:14 |
clarkb | fungi: that might be more reliable now wheil we sort out that problem which seems fairly consistent | 20:15 |
clarkb | and now I need to eat lunch | 20:15 |
corvus | clarkb: either that, or a client bug, or a client/registry interaction bug | 20:21 |
fungi | oh, huh | 20:21 |
corvus | or maybe corrupted data? i dunno. it usually takes a while to triage those | 20:21 |
corvus | `Manifest has invalid size for layer sha256:41de18d1833d2d5e6cb6111780577f3eab0afd9b1bf6c0c1756c5227abfa645c (size:44359680 actual:203253446)` is the meat of it | 20:22 |
fungi | yeah, that definitely seems wrong | 20:23 |
fungi | looks like it could be an interrupted upload | 20:24 |
fungi | though size<actual which would imply the reverse? | 20:24 |
fungi | i'm misunderstanding why we'd see that on upload of a newly rebuilt image each time though | 20:27 |
fungi | and yeah, i misread the error originally and thought it was dockerhub rejecting the upload, didn't spot that it was the buildset registry | 20:28 |
fungi | 829975 is really not at all urgent though, i'm perfectly happy to sort this problem out first | 20:29 |
fungi | though before i do anything else, i should probably sort out dinner | 20:38 |
clarkb | hrm we've had these size mismatches before I think ianw debugged them | 20:39 |
clarkb | the gist was we calculated things off by one in the registry iirc and some clients ebcause more strict? | 20:40 |
fungi | something something skopeo something? ;) | 20:40 |
clarkb | maybe we've got another client being more strict and we're catching another existing bug somewhere | 20:40 |
clarkb | https://zuul.opendev.org/t/openstack/build/ad14ff4e385640beba9d43b9de108973 so it did succeed in check | 20:45 |
clarkb | but then seems to consistently fail in gate? | 20:45 |
clarkb | oh gate and check run two slightly different version of the image build jobs. Once uploads only to the insecure ci registry and the other to both docker and the insecure ci registry | 20:46 |
clarkb | so in this case cherrypy is reporting back that the manifest reported a size of 44359680 but it got 203253446 bytes. The tool doing the push is docker itself not skopeo | 20:47 |
clarkb | this upload is to the buildset registry not the insecure ci registry. | 20:50 |
fungi | the server itself doesn't seem to be in any particular distress, at least | 20:51 |
fungi | if anything it looks extremely underutilized | 20:52 |
clarkb | well this is the buildset registry which is per buildset | 20:52 |
clarkb | https://zuul.opendev.org/t/openstack/build/d80aa9c8e84241cdbbfffc7d7c4f43fc/log/docker/buildset_registry.txt is the log for the one that failed | 20:52 |
clarkb | https://zuul.opendev.org/t/openstack/build/d80aa9c8e84241cdbbfffc7d7c4f43fc/log/docker/buildset_registry.txt#767 here is the server side reporting the 400 error | 20:53 |
*** dviroel is now known as dviroel|afk | 20:54 | |
opendevreview | Steve Baker proposed openstack/diskimage-builder master: Revert "Revert "Detect boot and EFI partitions in extract-image"" https://review.opendev.org/c/openstack/diskimage-builder/+/830900 | 20:54 |
clarkb | https://zuul.opendev.org/t/openstack/build/d80aa9c8e84241cdbbfffc7d7c4f43fc/log/docker/buildset_registry.txt#711 that shows the docker client HEADing the blob and getting back the actual size rather than the size in the manifest | 20:55 |
clarkb | the docker version is from december | 20:56 |
clarkb | oh now this is creally curious | 20:56 |
clarkb | https://zuul.opendev.org/t/openstack/build/d80aa9c8e84241cdbbfffc7d7c4f43fc/log/docker/buildset_registry.txt#667 vs https://zuul.opendev.org/t/openstack/build/d80aa9c8e84241cdbbfffc7d7c4f43fc/log/docker/buildset_registry.txt#710 | 20:57 |
clarkb | https://zuul.opendev.org/t/openstack/build/d80aa9c8e84241cdbbfffc7d7c4f43fc/log/docker/buildset_registry.txt#706 that is when it reports the upload is complete | 20:57 |
clarkb | I think it must be checking the size before the upload is complete and using that smaller value. Then when it tries to use that smaller value the up to date and correct value is known and there is a conflict | 20:58 |
clarkb | I don't know enough about the docker image upload protocols to know if we are supposed to kick back a 404 or what if the upload isn't completed | 20:58 |
clarkb | corvus: ^ do you know off the top of your head? | 20:59 |
fungi | oh, yeah, i misread you, i was looking at insecure-ci-registry, sorry | 20:59 |
ianw | o/ | 21:00 |
ianw | istr that there was a podman 4 release very recently. have we pulled in something different? | 21:01 |
clarkb | ianw: this hsould all be docker not podman | 21:03 |
clarkb | https://zuul.opendev.org/t/openstack/build/d80aa9c8e84241cdbbfffc7d7c4f43fc/log/docker/buildset_registry.txt#552-555 at this point which is before 667 with the shorter size it seems to recognize 203253446 is the size of something | 21:03 |
ianw | clarkb: i can do the glean release ... but given that it will have global effect i'm thinking that doing it on my monday might be a better idea. rolling it back will be an exercise in rebuilding things and having a longer runway when things are quiet would be helpful | 21:04 |
clarkb | ianw: wfm | 21:04 |
clarkb | ianw: I think if things get really bad iwth a glean release the easiest thing may be to pull the release from pypi | 21:04 |
clarkb | then the next round of image buidls will use the prior good release which we know to work | 21:05 |
clarkb | but ++ to doing it when everything can be monitored and action taken to address problems if necessary | 21:05 |
clarkb | for the registry thing I think line 667 of the registry log is the thing we need to undersatnd better. It is returning the short value that things complain about and it is doing so after the registry seems to know about the larger value logged on 552-555 | 21:06 |
ianw | hrm maybe not podman, but IIRC the use of buildkit was invovled | 21:07 |
clarkb | zuul registry's filesystem implementation does return os.stat(path).st_size to get the size that head returns | 21:08 |
clarkb | I think the issue here maybe that we've told the registry it is this size, but before all that data is written out to disk we'ev queried the disk for how much it has and fed that back to docker | 21:08 |
clarkb | thinking out loud here. Maybe we need to write to a tmp name then do an atomic mv? | 21:09 |
clarkb | then the os.path.exists will fail for the aerly HEAD and we report the proper result when we know it? Though that might also break docker because it may need that early HEAD to succeed | 21:10 |
ianw | https://opendev.org/zuul/zuul-registry/commit/7100c360b31dfd1b9f4413eb2df3db3f229a5e26 seems similar | 21:12 |
clarkb | I've just confirmed the actual_size value is also from disk. So by the time we go to validate the manifest the size is correct on disk but the PUT for the manifest must've supplied the short value from the earlier head | 21:16 |
fungi | so are we racing a fd flush? | 21:18 |
clarkb | "When this response is received, the client can assume that the layer is already available in the registry under the given name and should take no further action to upload the layer." this is what it says about the HEAD on a layer (which is expected to return the size) | 21:19 |
clarkb | I think there is a race here. We shouldn't be responding positively to the HEAD request when the value is short. | 21:19 |
clarkb | fungi: no I think the upload hasn't completed yet based on the logs | 21:19 |
clarkb | fungi: even if we flushed everything we'd still be short because the client is sending us more data | 21:20 |
clarkb | https://zuul.opendev.org/t/openstack/build/d80aa9c8e84241cdbbfffc7d7c4f43fc/log/docker/buildset_registry.txt#706 is where the upload for the layer completes but https://zuul.opendev.org/t/openstack/build/d80aa9c8e84241cdbbfffc7d7c4f43fc/log/docker/buildset_registry.txt#667 is responding that the layer exists prior to that with a smaller size | 21:20 |
fungi | oh, the short read is happening server-side? | 21:21 |
clarkb | yes, it seems like midway through the upload the client is asking the server in a separate thread to tell it how big the not yet completely uploaded layer is | 21:23 |
clarkb | the server is checking what it has on disk which is short because it hasn't finsihed recieving the data and responds with that value. Then the client constructs a manifest that it sends to the server with that short value | 21:23 |
clarkb | (its a bit insance to me that we need to track sizes like this when the client is the originator of the info and everything is hashed, just check if hashes are valid) | 21:24 |
clarkb | and zuul-registry doesn't seem to be stateful enough to know whether or not a file on disk represents a complete upload? | 21:25 |
fungi | #zuul | 21:30 |
fungi | hah, you're not my buffer command | 21:30 |
clarkb | I suspect that this may be a bug in the client HEADing a layer before it has finished uploading the layer which exposes a bug in zuul-registry which returns the layer info before it is complete | 21:32 |
clarkb | hrm there may be some state here though as we seem to record chunks and those go to a subdir? | 21:34 |
clarkb | and then later they end up in the path that the HEAD looks at | 21:34 |
clarkb | oh actually https://zuul.opendev.org/t/openstack/build/d80aa9c8e84241cdbbfffc7d7c4f43fc/log/docker/buildset_registry.txt#554 tgat may be where we finish the upload which is before teh short read so maybe the HEAD isn't too soon and we just aren't synced? | 21:36 |
clarkb | I think I see it | 21:39 |
clarkb | or at least something that can cause it. Fix on the way | 21:39 |
opendevreview | Merged opendev/system-config master: Clean up two retired mailing lists https://review.opendev.org/c/opendev/system-config/+/830893 | 21:44 |
clarkb | remote: https://review.opendev.org/c/zuul/zuul-registry/+/830905 Atomically concatenate blob objects | 21:52 |
clarkb | It is possible that the fsync flushing is also to blame I suppose | 21:53 |
clarkb | and we may not actually solve this with the move, but I think the move is more correct either way | 21:53 |
clarkb | also I think this may be somewhat cloud specific | 22:01 |
clarkb | due to iops availability? | 22:01 |
clarkb | rechecking until it goes through is probably viable. but also we should reconsider building gerrit images for apahce config changes | 22:01 |
clarkb | they two shouldn't be related | 22:01 |
opendevreview | Steve Baker proposed openstack/diskimage-builder master: Revert "Revert "Detect boot and EFI partitions in extract-image"" https://review.opendev.org/c/openstack/diskimage-builder/+/830900 | 22:17 |
ianw | i wonder how hard it would be to setup a job to build and push into the registry via mitmproxy and keep a dump of the traffic. that was how i debugged the original issues | 22:17 |
clarkb | https://review.opendev.org/c/opendev/system-config/+/830874 could use another reviewer. I'm happy to approve that tomorrow morning and watch it after wards | 22:28 |
clarkb | the child change has a held etherpad you can test with too. corvus and I did testing earlier with the isitbroken pad on the test node and it seeemd fine (note you have to update your /etc/hosts to make the redirects work) | 22:28 |
opendevreview | Clark Boylan proposed opendev/system-config master: Remove Gerrit's JVM GC logs https://review.opendev.org/c/opendev/system-config/+/830912 | 22:44 |
clarkb | Ok I tested the mergeability update and it seemed fine. I think we can proceed with that when we have a good time to restart gerrit | 22:44 |
clarkb | since I was on the test node gerrit I wanted to see if I could fix the jvm_gc log rotation errors we always get on gerrit startup and 830912 is the best I could come up with. Basically stop telling it to write those files entirely. Note that we have to stop gerrit and move the old files aside before starting it again or we get the same errors on startup | 22:45 |
clarkb | The extent of my mergability testing was to check the error_log for any indication a full reindex occured and I didn't see any | 22:45 |
clarkb | I'll leave that server up until tomorrow in case frickler wants to take a look but then I think we can delete the old | 22:46 |
clarkb | *delete the zuul autohold | 22:47 |
fungi | okay, back from dinner/errands | 22:50 |
fungi | i guess a depends-on to 830905 from a failing system-config change isn't going to exercise the updated registry code, right? | 22:53 |
fungi | looks like base deploys may be failing again, digging into it now | 22:56 |
fungi | the failed build i'm looking at claimed codesearch was unreachable at 2022-02-24T21:49:42Z | 22:58 |
fungi | i can ssh into it and it's been "up 29 days" so maybe just a network blip | 22:59 |
fungi | oh! i should have read further | 22:59 |
fungi | it was "unreachable" because "mkdir: cannot create directory ...: No space left on device" | 22:59 |
fungi | rootfs is indeed full | 22:59 |
fungi | i'll see what's gone nuts | 23:00 |
ianw | fungi: no, it will really have to be rolled out to the actual registry i think | 23:01 |
clarkb | ianw: fungi: the buildset registry uses the :latest tag s o ithink once my change lands and :latest is updated it will be used afterwards | 23:01 |
fungi | got it | 23:01 |
clarkb | fungi: ianw: re codesearch seems it is largely /var/lib/hound/data | 23:02 |
clarkb | which is where the indexed data lives | 23:02 |
ianw | i wonder if we're not removing things we should be; seems there's like daily files, but stuff from back in november | 23:03 |
clarkb | I do half wonder if some of the data in there is stale though. There are 2831 entries and I don't think we're indexing that many repos | 23:03 |
ianw | not sure if it's additive | 23:03 |
clarkb | ianw: ya exactly | 23:03 |
fungi | yeah, that's 87% of the fs | 23:04 |
fungi | there's a ton of old files in there | 23:05 |
fungi | even over a year old | 23:05 |
fungi | i thought it regenerated all its data at restart | 23:05 |
ianw | we reload it if project-config changes | 23:06 |
clarkb | ya I wonder if every time we update it or restart it it leaks some data | 23:06 |
fungi | yeah, there's content dated back to 2020-11-20 | 23:06 |
clarkb | we might be able to stop it. Delete things, then start it again? This might also be slow? | 23:06 |
fungi | it might be slow, yes | 23:07 |
clarkb | and maybe we should consider not bindmounting that data so that it auto flushes every time we restart | 23:07 |
clarkb | assuming that is the issue | 23:07 |
ianw | /var/log/resync-hound.log, which is supposed to log resyncs, is empty, i'm guessing a pipe error | 23:07 |
ianw | /usr/local/bin/resync-hound >> /var/log/resync-hound.log 2>&1 | 23:07 |
fungi | the data tree contains 817421 regular files and 257558 directories | 23:07 |
ianw | seems right ... wonder why it's blank | 23:08 |
fungi | ianw: maybe the disk filled up and then the logfile was rotated? | 23:08 |
fungi | no space, no way to write to the new log | 23:08 |
clarkb | also is the redirect happenign in the container context? I don't think so based on the quoting | 23:08 |
clarkb | but /var/log in the container is separate from /var/log on the host so maybe? | 23:09 |
ianw | the theory there at least is to take whatever comes out on stdout of the run and save it locally, i don't think that passes through | 23:09 |
fungi | should i hold off stopping the container? (or trying to stop it at least? that may also fail if it needs to write to the rootfs | 23:10 |
fungi | we'll also want to reboot, seeing as how it's the rootfs which filled up completely, there's no telling what else may have broken in the process | 23:11 |
clarkb | I expect stopping it will fail due to the full disk. What I've done in the past is had journald prune some data then you get enough haed room to stop things and make changes | 23:12 |
fungi | but yes, my vote is to stop the container if we can, blow away the data tree, reboot the server, then wait for everything to reindex and see what the utilization looks like after that | 23:12 |
clarkb | `journalctl --vacuum-size=500M` or similar then docker-compose down, then debug further? | 23:12 |
clarkb | fungi: ya that sounds good | 23:12 |
fungi | thanks for the journalctl tip, i'll do that first just for safety | 23:13 |
ianw | i think yes, start again and we can monitor what's going on when we issue resyncs | 23:13 |
fungi | downing the container now | 23:13 |
fungi | it claims to have downed the container successfully | 23:13 |
fungi | okay, so are we agreed on just recursively deleting everything under /var/lib/hound/data? | 23:14 |
clarkb | my understanding has been that all of that data is ephemeral and can be rebuilt from the source repos | 23:16 |
clarkb | so ya I'm good with that | 23:16 |
ianw | ++ | 23:16 |
fungi | thanks, removing now | 23:17 |
ianw | so when we update, we run update-hound-config, which diffs project-config and, if it is different, runs supervisorctl restart houndd | 23:21 |
ianw | i note that in the docs, they don't seem to say anything with the container about mapping in a volume for index data | 23:22 |
ianw | i'm not sure if supervisorctl stop && rm -rf /index/data/* && supervisorctl start would work ... or if docker would think the container was dead? | 23:23 |
clarkb | ianw: I think we could just stop start the container instead? | 23:25 |
clarkb | er and use ephemeral /var/lib/hound/data | 23:25 |
ianw | the problem is that the update checking happens inside the container | 23:26 |
clarkb | ianw: can we have an external prcess trigger the check within the container and then take action externally if necessary? | 23:26 |
fungi | okay, deletion complete, rootfs utilization is at a meagre 13% | 23:29 |
opendevreview | Ian Wienand proposed opendev/system-config master: codesearch: remove index when resyncing https://review.opendev.org/c/opendev/system-config/+/830916 | 23:30 |
fungi | am i clear to reboot the server now so we can make sure fundamental processes are sane, in case anything else broke when it could no longer write? | 23:30 |
ianw | clarkb: ^ maybe with more fiddling; that may be another approach | 23:31 |
clarkb | ianw: ya that seems like it could work | 23:31 |
clarkb | fungi: ya I think so | 23:31 |
ianw | although, i guess we should not map in the index directory either | 23:32 |
fungi | okay, rebooting the server now | 23:32 |
clarkb | ianw: oh ya beacuse we also stop and start the service externally | 23:33 |
fungi | server has rebooted, should we start the container again or are we wanting to get some changes deployed to it first? | 23:37 |
ianw | i think restart, i'm just thinking about it | 23:37 |
clarkb | ya I think we can start it and see what we learn | 23:37 |
*** dviroel|afk is now known as dviroel | 23:37 | |
fungi | okay, startint it up now | 23:37 |
*** rlandy|ruck is now known as rlandy|out | 23:38 | |
fungi | looks like it's running again | 23:38 |
fungi | reindexing will presumably take ~hours | 23:38 |
clarkb | if you tail the log it gives you some indication of what it is doing iirc | 23:38 |
clarkb | /var/log/containers/docker-hound.log | 23:39 |
fungi | yep | 23:39 |
ianw | that's why i wonder if a complete reindex on project-config updates is the best way | 23:44 |
ianw | i wonder if you can just delete idx-* | 23:47 |
clarkb | says its done | 23:50 |
clarkb | so that was about 13 minutes? | 23:50 |
clarkb | I get results searching in it too | 23:50 |
clarkb | and its only using ~6GB now? | 23:50 |
clarkb | so ya we definitely leak | 23:50 |
ianw | it's probably worth putting a todo for next week and coming back to look at it | 23:51 |
ianw | and checking the old directories on disk, and see if they've been re-cloned | 23:52 |
clarkb | ++ | 23:52 |
ianw | i've soured on the idea of just removing the whole index on restart; but maybe that's the only solution | 23:53 |
clarkb | or maybe we can identify the old data somehow and delete it after codesearch is restarted | 23:54 |
ianw | yeah, maybe an mtime type find & delete | 23:54 |
ianw | if we see that the project has been re-cloned and the old one hasn't been accessed | 23:55 |
ianw | i don't see any references to this in the hound github, open or closed issues | 23:59 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!