opendevreview | James E. Blair proposed openstack/project-config master: Fix zuul status node requests graph https://review.opendev.org/c/openstack/project-config/+/953914 | 00:38 |
---|---|---|
opendevreview | Tony Breeds proposed openstack/diskimage-builder master: Remove mirror from experimental debian jobs https://review.opendev.org/c/openstack/diskimage-builder/+/953256 | 01:57 |
opendevreview | Tony Breeds proposed openstack/diskimage-builder master: Remove nodepool based testing https://review.opendev.org/c/openstack/diskimage-builder/+/952953 | 01:57 |
opendevreview | Tony Breeds proposed openstack/diskimage-builder master: Remove testing for f37 https://review.opendev.org/c/openstack/diskimage-builder/+/952954 | 01:57 |
opendevreview | Merged openstack/project-config master: Fix zuul status node requests graph https://review.opendev.org/c/openstack/project-config/+/953914 | 09:12 |
stephenfin | clarkb: fungi: gate fixes for pbr starting here, when you have a chance. I'd like to get that going to unblock a few other open changes, before moving onto further testing | 11:13 |
stephenfin | note that I'm simply disabling tests with newer setuptools since the fixes are likely to be involved, and this way we can selectively re-enable them as we fix or delete them if it's no longer sustainable | 11:13 |
frickler | is this a new feature in the gerrit UI that in the "reply" popup I can remove reviewers/CCs by clicking on the "x" next to them, but them when I try so submit my comment/reviews, I get "Error 403 (Forbidden): remove reviewer not permitted" and I also have no way of restoring to the original state without losing the comment I typed? | 12:21 |
mnasiadka | Good feature | 12:21 |
fungi | frickler: if you changed permission levels for your account (adding to/removing from the administrators group for example) you'll need to force-refresh gerrit in your browser since the client caches a lot of permission lookups client-side | 12:26 |
frickler | hmm, last permission changes were a very long time ago. still I'd argue they should be treated consistently within the UI | 12:28 |
fungi | headed out to run some errands, should be back soon | 13:00 |
mnasiadka | corvus/frickler: willing to take a look at https://review.opendev.org/c/opendev/zuul-providers/+/953908? ;-) | 13:30 |
clarkb | stephenfin: ya I should be able to take a look today | 14:54 |
stephenfin | ty | 14:54 |
clarkb | since things seem otherwise quiet I'm going to take the opportunity for some local system updates first though. I've been neglecting those due to a busy early week | 14:54 |
opendevreview | James E. Blair proposed openstack/project-config master: Revise zuul-launcher dashboards https://review.opendev.org/c/openstack/project-config/+/953973 | 16:18 |
opendevreview | Michal Nasiadka proposed openstack/diskimage-builder master: Retry git clone/fetch on timeout https://review.opendev.org/c/openstack/diskimage-builder/+/721581 | 16:23 |
corvus | i'm restarting the launchers with a fix that should address a situation where the launchers were only using half the quota | 16:25 |
corvus | https://i.imgur.com/jO8j7dL.png | 16:32 |
corvus | that looks a lot better | 16:32 |
corvus | you can see the usage increase when i restarted, and it's now fully saturated | 16:33 |
corvus | that's from the new style dashboard i just uploaded in 953973 | 16:33 |
clarkb | corvus: which launcher change was that? | 16:33 |
corvus | since osuosl is the only provider for arm64 labels, all of the arm64 node requests get assigned there. which means, for the first time, we can see the backlog for that particular provider (the read line that goes way above the max line) | 16:34 |
corvus | clarkb: https://review.opendev.org/953925 | 16:34 |
corvus | wrote it late last night and early this morning | 16:34 |
clarkb | ah a data lookup bug in quota calculations | 16:35 |
clarkb | and ya that graph is neat | 16:35 |
clarkb | and for reviewing the graph update change I'm going to look at screenshots and if they look good give it a +2 rather than try and parse all the json. | 16:36 |
corvus | yeah, i was sad about the json, but i'm pretty sure the beta-feature of the "config from query results" transform is not supportable in our old yaml... | 17:01 |
clarkb | corvus: looking at the screenshots I see that other providers also have the red backlog line above the max server capacity. Is that the backlog that those providers have grabbed from the queue? | 17:03 |
clarkb | it doesnt' seem to reflect the global backlog as it differs between them | 17:03 |
corvus | yes, and figuring out why those are getting assigned to providers instead of waiting in the nodeset request queue is next on my list | 17:04 |
corvus | (so the graphs are showing what i think is a real thing that should be improved) | 17:04 |
clarkb | nice. Data helping make things better | 17:05 |
clarkb | I +2'd the change the graphs seem to work in the screenshots | 17:05 |
corvus | like, 6 hours ago 300 nodes got assigned to rax-dfw, and 50 got assigned to rax-ord. why? i dunno. :) | 17:05 |
clarkb | not sure if anyone else wants to review that or if we should go ahead and approve it | 17:05 |
corvus | it is interesting to see that in ovh, the controlling limit is cores. we use 20% of our ram limit when we're at 100% of cores. | 17:06 |
fungi | i'll take a quick peek | 17:07 |
corvus | osuosl is, unsurprisingly, perfectly balanced -- we're at 100% of instances, cores, ram all at the same time. :) | 17:08 |
clarkb | stephenfin: fungi: ok I have reviewed the pbr chagnes. I have a concern in the first change that we're over restricting setuptools to <80 even in cases we don't need to and I'm concnered that the precommit update in the last change switches to pulling hacking from an uncached location rather than pypi which is cached in our CI system | 17:30 |
stephenfin | thanks, looking now before I wrap up for the day | 17:32 |
clarkb | fwiw I wish I had kept up with the precommit change | 17:32 |
clarkb | I -1'd it then it was merged despite my requset that we not do the thing that the last precommit change does too | 17:32 |
clarkb | precommit is a bad choice in CI environments that attempt to cache things like pacakges because no one sets it up to install from pacakges | 17:33 |
clarkb | instead installing from random git repo refs that are not cached | 17:33 |
clarkb | I feel like this proves my point that its a bad tool. You can use it in less bad ways but no one does even when I explicitly asked that we use it correctly... | 17:34 |
clarkb | fungi: fyi I suspect that precommit is also why the fixup change includes unrelated formatting updates | 17:36 |
clarkb | its but I'm not positive of that (therea are a bunch of rules in the file for pre-commit-hooks | 17:36 |
corvus | why would precommit need the ci caches... it's not run in the gate, right? | 17:37 |
clarkb | corvus: it is run in the gate | 17:37 |
opendevreview | Merged openstack/project-config master: Revise zuul-launcher dashboards https://review.opendev.org/c/openstack/project-config/+/953973 | 17:37 |
clarkb | corvus: the way openstack has configured pre-commit is that the pep8/linters targets now all run pre-commit if pre-commit is added to the repo | 17:37 |
clarkb | this way you don't end up with flake8 called directly failing a commit that was pre-commit checked locally due to a mismatch in versions or whatever | 17:38 |
corvus | https://governance.openstack.org/tc/reference/project-testing-interface.html | 17:38 |
corvus | pre-commit does not appear there | 17:38 |
clarkb | corvus: I think the rules in that document are loose enough "every project must enforce code style" | 17:39 |
corvus | i feel like that would have been a really good place to talk about how to integrate pre-commit in a way that doesn't break testing | 17:39 |
clarkb | whether you use flake8 directly or pre-commit to call flake8 is fine under that rule | 17:39 |
clarkb | but yes the problem isn't pre-commit existing. The problem is no one configures pre-commit to use CI caches | 17:39 |
clarkb | evne when I explicitly ask them to apparently | 17:40 |
corvus | yeah, that doc used to mostly be about making sure tox was set up correctly | 17:40 |
corvus | so i feel like bypassing the dep installation of tox warrants a note :) | 17:40 |
frickler | jrosser reported some unusual cluster of job timeouts, like in https://zuul.opendev.org/t/openstack/buildset/c9c0373292f445a28033086f4ab4ff4c seems to be mostly on rax-ord/dfw, but I didn't look closer yet and won't get to that today. just as a note in case more such failures pop up | 17:41 |
clarkb | tl;dr you should have pre-commit install python packages by name and version that way pip will look at pypi which in our Ci environment has caching | 17:42 |
corvus | clarkb: ++ | 17:42 |
stephenfin | clarkb: I recall this coming up, but it's a matter of competing priorities. For me, I selfishly care far more about the dev ex win that pre-commit represents that I do about something abstract (to me) like caching | 17:43 |
clarkb | stephenfin: right but you don't run pre-commit 10k times a day | 17:43 |
corvus | i mean, you can have both | 17:43 |
clarkb | the CI system does and we should be better citizens of the Internet (though in the age of AI crawlers this seems like a drop in the bucket) | 17:44 |
clarkb | and ya no one is saying don't use pre-commit. Just asking that you not point at a repo and sha | 17:44 |
clarkb | whcih I know is the pre-commit default but aiui doesn't have to be run that way | 17:44 |
stephenfin | by doing local configuration, you lose the ability to auto-bump dependencies | 17:45 |
stephenfin | client-side caching too, iirc | 17:45 |
clarkb | stephenfin: you have to bump them in your tox config (which may be requirements.txt or whatever) | 17:45 |
stephenfin | and you need to copy the configuration from the upstream pre-commit hooks | 17:46 |
clarkb | this is a bug in pre-commit fwiw | 17:46 |
clarkb | it should be easier to use like this. | 17:46 |
stephenfin | I think the issue is that there's no way to get all the pre-commit context from the installed package | 17:46 |
stephenfin | that's a known file (.pre-commit-hooks.yaml) in the root of the repo, not something included in the package | 17:47 |
clarkb | I'm not sure I follow. flake8 installed from a git hash or from a pypi package should contain the exact same code as long as the commit hash matches that pypi package release | 17:47 |
clarkb | stephenfin: if there are files missing from the paackges then that is a broken package | 17:48 |
clarkb | and that should be fixable (though you may need a new release) | 17:49 |
stephenfin | Alas, no. A .pre-commit-hooks.yaml file won't be included since it lives in the root of the repo, not in the package, so it's a data file (which is a deprecated thing) rather than a package data file | 17:50 |
clarkb | we're still able to incluide things like READMEs and AUTHORS files | 17:50 |
clarkb | why is this any different? | 17:50 |
stephenfin | the README is included because it's a blessed thing for package management tools. The AUTHORS is included because pbr (and only pbr, right?) has a soon-to-be-broken hook for that. Adding another file to dist-info will require a new hooking mechanism, which I can only assume would necessitate a PEP | 17:53 |
clarkb | as a sanity check I grabbed https://files.pythonhosted.org/packages/e7/7f/2143758ec2ed791b9fe506a4721fed680452291f7d8bfb39b397d9a86687/zuul-12.1.0.tar.gz and it contains Changelog and even the input MANIFEST.in | 17:53 |
clarkb | I guess what you're saying is setuptools is going to break our ability to properly package software further and we won't be able to do that in the future? | 17:54 |
clarkb | but as far as I can tell this does work today with existing packaging | 17:54 |
fungi | stephenfin: no, AUTHORS gets included in packages automatically by setuptools as a license-related file, auto-detected similar to Copying and LICENSE et cetera | 17:55 |
fungi | the fact that we generate it with pbr is orthogonal to that | 17:55 |
stephenfin | fungi: thanks, I wasn't sure about that. ChangeLog too, or is that one pbr only? | 17:56 |
fungi | setuptools is not, afaik, planning to break inclusion of license files | 17:56 |
clarkb | the .zuul.yaml is also in the sdist so I think this would work fine for pre-commit. At least with current setuptools and pbr | 17:56 |
fungi | again, ChangeLog is generated by pbr but its inclusion is a matter of being entered in the manifest | 17:56 |
stephenfin | clarkb: no. Nothing includes .pre-commit-hooks.yaml. You can verify that by pulling a package that includes one like hacking | 17:56 |
clarkb | I don't know enough of their future plans to know if that will break | 17:56 |
clarkb | stephenfin: right but you can is my point | 17:56 |
clarkb | just like zuul includes .zuul.yaml in it | 17:56 |
clarkb | if that information is required for pre-commit to work then not having it in your package is a package bug imo | 17:57 |
fungi | perhaps one of the things i should have added to the pbr features etherpad is that it generates the dist manifest based on the results of git ls-files | 17:57 |
clarkb | not a fundamental reason to not use packages | 17:57 |
stephenfin | okay, I get your point now | 17:57 |
fungi | though setuptools also has its own manifest auto-generator these days that scans for python modules, but can be given a list of additional files to include | 17:57 |
clarkb | stephenfin: hacking includes the precommit file in its package | 17:57 |
clarkb | stephenfin: https://files.pythonhosted.org/packages/f7/19/cf7a61cb63288c226bf2fa012ddcda51e4baad3039dbb4fc4b4e1a2b8e16/hacking-7.0.0.tar.gz extract that and you'll see the file | 17:58 |
stephenfin | I was going to say no, it's not there, but forgot nautilus (or whatever the graphical file browser in Fedora is) doesn't show hidden files by default | 17:58 |
fungi | stephenfin: anyway, in future it would be good to separate random format updating and comment typo corrections into their own patch, if you're going to auto-style files every time you touch them | 17:59 |
fungi | i didn't -1 for that, but it's distracting to reviewers | 17:59 |
clarkb | and yes its likely that pre-commit does't have the necessary plumbing today to make all of that work. But it could I don't think there is a fundamental reason that it wouldn't work and most of the pieces are there. Its just pre-commit itself missing the necessary bits | 17:59 |
stephenfin | err, which review are we talking about | 17:59 |
fungi | stephenfin: the pbr reviews i left nit comments/questions on | 18:00 |
clarkb | https://review.opendev.org/c/openstack/pbr/+/953892/3/pbr/tests/test_integration.py and https://review.opendev.org/c/openstack/pbr/+/953839/5/pbr/tests/test_core.py for example | 18:00 |
fungi | about sudden appearance of extra blank lines, reflowing function parameters that weren't being changed, there was also a mistyped word in a random code commment that i didn't point out but was seemingly unrelated to the patch | 18:01 |
fungi | maybe you're running some tool that's randomly altering files behind your back, and not checking the diff yourself so didn't notice? | 18:01 |
clarkb | fungi: we just got an email indicating the lists backups failed. These failures have occurred occasionally over the last week | 18:03 |
fungi | not a huge deal, but like i said distracting when it's unrelated to and in some cases distantly removed from your actual intended edits in the file | 18:03 |
clarkb | fungi: I assume (but haven't checked) that this is going to be load related to the cralwing stuff | 18:03 |
clarkb | I guess I should look at the logs and see if we have more rules to add to the UA filter | 18:03 |
fungi | clarkb: possible, that could easily lead to timeouts | 18:03 |
fungi | the ua filter change i started earlier in the week just includes the one ua you pointed out, i haven't looked at the logs to see if there are others | 18:04 |
fungi | also it's still open and can be amended | 18:04 |
clarkb | fungi: ya or affecting timing in such a way that it overlaps with tasks on the backup server side like backup validation that I think cause new backups to error if they happen concurrently | 18:04 |
stephenfin | fungi: no, that's me alright. I'm relying on Gerrit highlighting significant changes differently to newlines/rewraps | 18:05 |
stephenfin | others like https://review.opendev.org/c/openstack/pbr/+/953839/5/pbr/tests/test_integration.py were to make code more comprehensible. I can't drag that out into a precursor patch like I normally would since the gate is currently broken | 18:06 |
fungi | stephenfin: makes sense, thanks | 18:06 |
fungi | clarkb: right, in the past what we saw was colliding backup runs from the two separate servers | 18:08 |
clarkb | fungi: `cut -d' ' -f 12- lists.opendev.org-ssl-access.log | sort | uniq -c | sort` says chatgpt and claude are the worst offenders | 18:08 |
fungi | usually related to long backup times causing them to overlap when they normally wouldn't | 18:08 |
clarkb | fungi: /pipermail is the old url compatibility shim right? | 18:09 |
clarkb | I should maybe grep -v pipermail and see what things look like then | 18:09 |
stephenfin | clarkb: as for pre-commit, I need time to think about that more. I can surely wrangle a solution, but the biggest issue is that the pre-commit author is one of the least pleasant people I've had the misfortune of working with and I actively avoid contributing to either that or flake8 nowadays | 18:09 |
stephenfin | if only astral would come up with their own variant of that too... | 18:09 |
clarkb | oof | 18:10 |
corvus | i think all the zuul-launcher anomalies i have investigated can be explained by the recently fixed bug; so i'm going to disregard past issues and just look for new weirdness. | 18:10 |
clarkb | and ya for something like PBR its relatively minor due to the lack of activity. I'm more just frustrated that its been years of trying to get people to be more cautious with pre-commit to make it CI friendly but that is never anyone elses priority | 18:10 |
stephenfin | least pleasant might be a big harsh: least agreeable is perhaps more apt :) | 18:10 |
stephenfin | *bit | 18:11 |
fungi | /pipermail is the copy of the old mm2 archives, which are redundant (but not able to be automatically mapped/redirected from their mm3 counterparts) in some cases, though uniquely archival for lists that were retired before we migrated. i guess they ignore the crawl-delay we set in robots.txt? | 18:11 |
clarkb | stephenfin: fwiw I can think of other tools that suffer similiar problems. Docker comes immediately ot mind where they made the image protocol annoying to cache and now they are enforcing strict request limits on everyone | 18:11 |
clarkb | fungi: I haven't checked timestamps to see if they honor the crawl delay | 18:11 |
clarkb | fungi: but a lot of requests are to pipermail for chatgpt at least | 18:12 |
fungi | well, we set it to 2 so i wouldn't expect a ton of requests at that speed | 18:12 |
clarkb | `grep -v '/pipermail/' lists.opendev.org-ssl-access.log | cut -d' ' -f 12- | sort | uniq -c | sort` give a very different result | 18:12 |
fungi | at crawl-delay: 2 the bot should top out around 43200/day | 18:12 |
fungi | but also, just because a ua claims to be claude or chat-gpt, doesn't mean it actually is | 18:13 |
clarkb | yes but once I ignore pipermail I think the picture is more clear | 18:13 |
clarkb | serving pipermail is basically free compared to the django stuff | 18:13 |
fungi | right, it's direct file handoff for the most part, while django is database queries | 18:14 |
opendevreview | Merged opendev/zuul-providers master: Drop niz- label prefix from nodesets https://review.opendev.org/c/opendev/zuul-providers/+/953835 | 18:14 |
opendevreview | Merged opendev/zuul-providers master: Remove "normal" labels, etc https://review.opendev.org/c/opendev/zuul-providers/+/953836 | 18:15 |
opendevreview | Merged opendev/zuul-providers master: Remove gentoo-17 nodeset https://review.opendev.org/c/opendev/zuul-providers/+/952723 | 18:16 |
opendevreview | Merged opendev/zuul-providers master: Remove ubuntu-xenial nodeset https://review.opendev.org/c/opendev/zuul-providers/+/952726 | 18:17 |
corvus | clarkb: ^ you highlighted that there is a possibility of fallout from that (but not expected) | 18:18 |
clarkb | corvus: ya and it would be in places like system-config | 18:19 |
clarkb | so infra-root keep your eyes open and let us know if you see something | 18:19 |
stephenfin | clarkb: https://review.opendev.org/c/openstack/pbr/+/953982 I've kept it separate for now lest if fail in CI | 18:21 |
clarkb | stephenfin: ack that works for me. Probably easier to whittle down what needs further fixing that way too | 18:22 |
stephenfin | that's my thinking, yes | 18:22 |
stephenfin | (start from a stable base rather than house on fire) | 18:22 |
clarkb | I +2d but didn't approve as I figured fungi may want to weigh in on the nits | 18:22 |
clarkb | fungi: but feel free to approve the stack now if you're ahppy | 18:23 |
fungi | looking | 18:23 |
fungi | i'll give it a bit to see how tests are doing with that change before i approve the whole lot | 18:26 |
opendevreview | James E. Blair proposed zuul/zuul-jobs master: Remove no_log for image upload tasks https://review.opendev.org/c/zuul/zuul-jobs/+/953983 | 18:28 |
opendevreview | James E. Blair proposed opendev/zuul-providers master: Switch to zuul-jobs upload-image-swift https://review.opendev.org/c/opendev/zuul-providers/+/951018 | 18:29 |
opendevreview | Clark Boylan proposed opendev/system-config master: Add more UA filters https://review.opendev.org/c/opendev/system-config/+/953904 | 18:48 |
clarkb | fungi: ^ that is an updated set of rules. In some cases I found we already had regex type rules that I expanded. In others I just went more verbatim because it is easy | 18:49 |
clarkb | for testing of that change our gitea testinfra requests go to port 3081 which should be the apache there which should check that we don't completelybreak apache with those rules which is nice | 18:51 |
clarkb | something we should be aware of as those rules are applied pretty broadly these days. If we break apache with ab ad rule then we'll have widespread sadness | 18:51 |
clarkb | in addition to that change https://review.opendev.org/c/opendev/system-config/+/953848 and https://review.opendev.org/c/opendev/system-config/+/953846 are straightforward simple improvements to zuul and zookeeper docker compose | 19:01 |
clarkb | and now I'm finding lunch | 19:01 |
opendevreview | Merged opendev/system-config master: Cleanup zookeeper config management https://review.opendev.org/c/opendev/system-config/+/953846 | 20:05 |
opendevreview | Merged opendev/system-config master: Remove docker compose version from zuul services https://review.opendev.org/c/opendev/system-config/+/953848 | 20:17 |
clarkb | that first change made the trailing space correction to the inventory so its running all the infra prod jobs | 20:24 |
clarkb | I'm keeping an eye on it | 20:24 |
clarkb | corvus: the deployment job for grafana against 953846 (so an unrelated change) failed with Exception: Duplicate dashboard found in '/grafana/zuul-launcher-ovh.yaml: 'Zuul Launcher: OVH' already defined | 20:30 |
clarkb | corvus: I'm wondering if we need to do manual cleanup for the existing launcher dahsboards as it can't resolve the deltas for some raeson? | 20:30 |
clarkb | it does look like the dashboard I see currently on grafana is the old version (no memory, cores, etc graphs) | 20:31 |
fungi | the discussion in #openstack-nova about why glean is still needed may be relevant to the interests of some in here | 20:31 |
clarkb | is it ongoing or should I just look at logs? | 20:32 |
fungi | (cropped up earlier based on discussion about nova long-term plans to get rid of configdrive) | 20:32 |
fungi | it's in progress | 20:32 |
fungi | semi-synchronous due to tz differences between conversants | 20:33 |
fungi | so it stretches back to early this morning western time | 20:33 |
clarkb | the zookeeper docker-compose.yaml update did end up doing a rolling restart of the cluster fwiw | 20:44 |
clarkb | but all seems well from what I can see here | 20:44 |
clarkb | looking at ansible logs that may be because we chagned the docker-compose.yaml config (even in a noop way) which caused it to repull images? | 20:47 |
clarkb | but ya we went server by server and the cluster seems happy so I think we're ok | 20:47 |
clarkb | and remote puppet else failed with Execution of '/usr/bin/git fetch origin' returned 128: fatal: unable to access 'https://opendev.org/openstack/project-config/': GnuTLS recv error (-54): Error in the pull function. which is suspiciously like our image build errors | 20:50 |
clarkb | I wonder if the gitea upgrade may be contributing to this... | 20:50 |
opendevreview | Clark Boylan proposed opendev/system-config master: Add apache-ua-filter file path matches where used https://review.opendev.org/c/opendev/system-config/+/953993 | 20:58 |
clarkb | fungi: corvus ^ you reviewed the UA filter change. I noticed ^ when checking on the progress of the gating. Basically I don't think we'll auto deploy those updates to the zuul and mailman services currently | 20:59 |
clarkb | hrm that change didn't actually run the jobs I expected it to | 21:16 |
clarkb | I'll make a noop change to the ua filters in it so that it actually drives updates to all those services when it lands | 21:16 |
opendevreview | Clark Boylan proposed opendev/system-config master: Add apache-ua-filter file path matches where used https://review.opendev.org/c/opendev/system-config/+/953993 | 21:17 |
clarkb | fungi: ^ sorry I noticed that after you reviewed it | 21:17 |
fungi | np | 21:17 |
clarkb | also I really like that git-review (gerrit really) tells you what the label removals are when you push | 21:19 |
clarkb | makes it easy to catch things like this | 21:19 |
clarkb | corvus: I think I've laerned something about the zk average latency values. Its the latency to clients not within the cluster. zk03 is ucrrently at 0 beacuse it has no connections (it was restarted last in the rolling restart just a bit ago so all connections went to zk02 or zk01) | 21:21 |
opendevreview | Merged opendev/system-config master: Add more UA filters https://review.opendev.org/c/opendev/system-config/+/953904 | 21:29 |
corvus | clarkb: yeah, i was afraid we'd need to delete the dashboards.... hopefully we have an admin user that can do that. | 21:57 |
corvus | i'll try to get to it in a bit | 21:57 |
clarkb | corvus: I seem torecall there is a token we can use like etherpad used to do | 22:09 |
clarkb | corvus: ya looks like there are the bits needed for admin api access in /etc/grafana/secrets | 22:10 |
clarkb | corvus: if you get a moment for https://review.opendev.org/c/opendev/system-config/+/953993 that should actually apply the new UA filters to the mailman server (or we'll wait for periodic this evening) | 22:13 |
clarkb | I can still access gitea so it seems to work in general | 22:13 |
corvus | +3 | 22:14 |
corvus | i deleted the zuul-lancher dashboards | 22:30 |
corvus | i logged in as the admin user through the web ui normally, which worked, but deleting the dashboards failed with a 403 origin not allowed | 22:31 |
corvus | so i set up an ssh port forward for 3000:localhost:3000 and repeated it that way, and it worked | 22:31 |
corvus | i have re-enqueued the deploy buildset | 22:32 |
clarkb | oh right we set up a limitation for admin access. I can't remember why but I seem to recall doing that | 22:33 |
clarkb | corvus: it failed again. i think the hint as to why is earlier in the log: DEBUG:grafana_dashboards.cache:Using cache: /root/.cache/grafyaml/cache.dbm | 22:40 |
clarkb | corvus: I think it may be complaining that the cache.dbm database has the duplicate and note grafan itself? | 22:40 |
clarkb | hrm i don't see a /root/.cache/grafyaml/ | 22:43 |
clarkb | oh that is from the container perspective | 22:45 |
clarkb | maybe it lives elsewhere | 22:45 |
clarkb | hrm we don't seem to bind mount anything to that path. So maybe it is building that db and finding stale content within grafana somewhere? | 22:47 |
clarkb | I'm nto sure I understand why this is happening | 22:54 |
corvus | looks like /opt/project-config/grafana has yaml and json files | 22:54 |
corvus | that dir gets bind-mounted into a temporary container to run grafyaml | 22:55 |
corvus | i don't understand why it has both | 22:55 |
clarkb | hrm I think that comes from project-config syncing? | 22:55 |
corvus | i think we rsync it? | 22:55 |
corvus | do we check it out on bridge then rsync it to grafana02? | 22:56 |
clarkb | that sounds right. Looking on bridge there are only json files for zuul-launcher | 22:56 |
corvus | do we need to add an extra argument to sync to delete? | 22:56 |
corvus | https://docs.ansible.com/ansible/latest/collections/ansible/posix/synchronize_module.html#parameter-delete | 22:57 |
clarkb | I'm trying to find where role sync-project-config lives | 22:57 |
corvus | that's used in a number of places... | 22:57 |
corvus | https://paste.opendev.org/show/boOAIOkIsHjDJfYvPPUz/ | 22:57 |
corvus | it's in system-config | 22:57 |
corvus | do we think it's okay to set that for all of them? | 22:58 |
clarkb | the synchronize does indeed only set a source and a destination | 22:58 |
clarkb | corvus: I think the main one I would be worried about is gerrit | 22:59 |
corvus | why's that? | 22:59 |
clarkb | Because jeepyb is reading from files in there (potentially, maybe we copy out acls and project config?) but acls in particular I could see there being a problem if we only have an acl file beacuse we haven't deleted it | 23:00 |
corvus | that sounds like this would still correct a bug, no? | 23:00 |
clarkb | whereas things like eavesdrop are reading gerritbot config and accessbot configs and are less likely to have the "file missing oops" problem | 23:00 |
corvus | also, we would accidentally correct this situation every time we deploy a new gerrit server, so if we are relying on zombie acl files, the problem would only extend to there, right? | 23:01 |
clarkb | corvus: well yes, but it may break creation of new projects? Though I think jeepyb will try to create all projects and continue along then report success/failure rather than short circuiting | 23:01 |
opendevreview | James E. Blair proposed opendev/system-config master: Update sync-project-config to delete https://review.opendev.org/c/opendev/system-config/+/953999 | 23:01 |
clarkb | all of the other systems grab data out of project-config in ways that I think are less prone to errors like this | 23:02 |
clarkb | because they look at singular files | 23:02 |
corvus | yeah... to me it sounds like a bandaid we should rip off... i think it's unlikely to be a problem | 23:02 |
clarkb | but yes I think booting the new review03 server is an indication that this is also unlikely to be a problem for gerrit since that was semi recent | 23:02 |
corvus | should be able to check by listing the files on the gerrit server | 23:02 |
clarkb | and yes fixing it seems ideal. I'm just thinking through where/what the fallout could be | 23:03 |
corvus | i'm going to manually delete the files on grafana so we don't have to rush that change. | 23:03 |
clarkb | ack | 23:03 |
corvus | deleted and re-enqueued again | 23:03 |
clarkb | I +2'd and left notes summarizing the above discussion | 23:05 |
clarkb | I think this is a coorect change. Just one to think about carefully and monitor when landing | 23:05 |
clarkb | corvus: thinking out loud 953999 isn't running the grafana job or gerrit jobs. I think due to the way we setup a fake bridge and all of that the change would be self testing if we did so (at least in some capacity) | 23:07 |
clarkb | corvus: maybe we should update job trigger file matchers as part of that change and get a bit of test coverage that way? | 23:08 |
*** prometheanfire is now known as Guest21259 | 23:09 | |
corvus | want to basically add these? https://paste.opendev.org/show/boOAIOkIsHjDJfYvPPUz/ | 23:09 |
clarkb | corvus: ya though I think run-accessbot and service-eavesdrop are both covered by the eavesdrop job | 23:12 |
clarkb | so we'd add playbooks/roles/sync-project-config to the gerrit jobs (there are three), zuul, nodepool, grafana, and eavesdrop? | 23:12 |
corvus | just for the test jobs or the deploy ones as well? | 23:12 |
clarkb | I feel like adding it for both is probably more complete? that will ensure we sync the deletions when we land the chagne rather than waiting for hourlies or dailies | 23:13 |
opendevreview | Merged opendev/system-config master: Add apache-ua-filter file path matches where used https://review.opendev.org/c/opendev/system-config/+/953993 | 23:13 |
clarkb | might make monitoring the updates easier | 23:13 |
clarkb | the new graphs are present on the server now | 23:13 |
*** Guest21259 is now known as prometheanfire | 23:14 | |
opendevreview | James E. Blair proposed opendev/system-config master: Update sync-project-config to delete https://review.opendev.org/c/opendev/system-config/+/953999 | 23:15 |
corvus | would be cool if we did static analysis on the playbooks to generate the file list | 23:15 |
corvus | the gauges are not quite right; our grafana may be too old | 23:18 |
clarkb | corvus: two comments about jobs getting the matchers. I think there is a mismatch in each of the updated files | 23:18 |
clarkb | re gauges and too old grafana I think we can upgrade grafana these days since we acught up semi recently? | 23:19 |
clarkb | I can probably look into that more closely if you think it would be helpful | 23:19 |
clarkb | I think lists got its apache config reloaded at 23:17 UTC | 23:20 |
fungi | load average on the server is reasonable, but looks like it was not under significant load before that either | 23:21 |
clarkb | previously the biggest cpu hog was mariadb iirc | 23:22 |
clarkb | mariadb is still up there but its using less now (it was a full cpu before) | 23:23 |
clarkb | probably easiest to check in tomorrow looking at requests for the next day and see if they're more resonable. Also two of the 5 apache processes are still old | 23:23 |
clarkb | oh those have just disappeared now | 23:24 |
opendevreview | James E. Blair proposed opendev/system-config master: Upgrade grafana to 12.0.2 https://review.opendev.org/c/opendev/system-config/+/954000 | 23:26 |
opendevreview | James E. Blair proposed opendev/system-config master: Update sync-project-config to delete https://review.opendev.org/c/opendev/system-config/+/953999 | 23:28 |
corvus | okay i think that's got all the things | 23:28 |
clarkb | ya that looks better. Then for grafana we just want to confirm we get graphs out of testing as I think the service itself is fairly stateless | 23:32 |
Generated by irclog2html.py 4.0.0 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!