| @jim:acmegating.com | no worries, thanks for the fix! | 00:01 |
|---|---|---|
| -@gerrit:opendev.org- Zuul merged on behalf of Jeremy Stanley https://matrix.to/#/@fungicide:matrix.org: [opendev/zone-gating.dev] 989307: Update address records for domain root https://review.opendev.org/c/opendev/zone-gating.dev/+/989307 | 00:01 | |
| @fungicide:matrix.org | anyway, now that's done and dusted, i'm really going to offline before i break something else | 00:04 |
| -@gerrit:opendev.org- OpenStack Proposal Bot proposed: [openstack/project-config] 989320: Normalize projects.yaml https://review.opendev.org/c/openstack/project-config/+/989320 | 02:58 | |
| -@gerrit:opendev.org- Zuul merged: [openstack/project-config] 989320: Normalize projects.yaml https://review.opendev.org/c/openstack/project-config/+/989320 | 05:54 | |
| @harbott.osism.tech:regio.chat | I haven't seen any zuul issue mentioned yesterday, I'm wondering why https://review.opendev.org/c/openstack/requirements/+/988934 didn't transition into gate after the recheck | 07:46 |
| -@gerrit:opendev.org- Jeremy Stanley https://matrix.to/#/@fungicide:matrix.org proposed: [opendev/system-config] 989377: Shadow mirror.logs RO replica with RW original https://review.opendev.org/c/opendev/system-config/+/989377 | 13:05 | |
| @fungicide:matrix.org | Jens Harbott: my only guess is that there's an unreported merge conflich because the parent commit is about 70 behind the current state of the master branch | 13:11 |
| -@gerrit:opendev.org- Mauricio Harley proposed: [openstack/project-config] 988770: Add #openstack-pqc IRC channel https://review.opendev.org/c/openstack/project-config/+/988770 | 13:51 | |
| -@gerrit:opendev.org- Zuul merged on behalf of Mauricio Harley: [openstack/project-config] 988770: Add #openstack-pqc IRC channel https://review.opendev.org/c/openstack/project-config/+/988770 | 14:09 | |
| @fungicide:matrix.org | https://review.opendev.org/c/opendev/zuul-providers/+/989136 hit a mix of http/499 client disconnects from rackspace flex swift and enospc during image conversion, how did we get around those previously? | 14:39 |
| @clarkb:matrix.org | fungi: the 499 is new. I don't think we've seen that before. The enospace problems were somewhat mitigated by the changes corvus wrote to change the order of image conversions and otherwise be slightly more efficient during some steps. But they didn't fully go away | 14:45 |
| @clarkb:matrix.org | I would look at the 499s more closely to start since I think those are unexpected | 14:46 |
| @fungicide:matrix.org | what's odd is that 9 uploaded fine, 15 failed (though only some of those got as far as trying to upload) | 14:48 |
| -@gerrit:opendev.org- Clark Boylan proposed on behalf of Jeremy Stanley https://matrix.to/#/@fungicide:matrix.org: [opendev/ansible-role-puppet] 989028: Use include_tasks instead of include https://review.opendev.org/c/opendev/ansible-role-puppet/+/989028 | 14:56 | |
| @clarkb:matrix.org | fungi: ^ I think that may get the test job to pass | 14:56 |
| @mordred:waterwanders.com | Clark: did you ever figure out the public_ip thing from yesterday? I scanned scrollback at dinner but wasn't quite sure. happy to help look if it's still a mystery | 14:58 |
| @mordred:waterwanders.com | (I'd buy the missing stuff being related to fetching fast given backend eventual consistency) | 15:00 |
| @mordred:waterwanders.com | `openstack.cloud.meta.add_server_interfaces` *definitley* does a *lot* of shennanigans to try to set those two fields properly so it's certainly not unreasonable for a bug there to leave you with the wrong value - depending on what the raw openstack json looks like | 15:02 |
| @fungicide:matrix.org | mordred: yes, turns out it was an intentional choice 6 years ago, because we use that address to populate the /etc/hosts entry and to prefer backend/private network communication between nodes in providers that have them | 15:04 |
| @mordred:waterwanders.com | AH nod | 15:04 |
| @mordred:waterwanders.com | yeah. actually - I think I remember that discussion from 6 years ago :) | 15:04 |
| @mordred:waterwanders.com | feature not bug! | 15:05 |
| @fungicide:matrix.org | the plan, aiui, is to put together a more deliberate solution that separates those concerns and labels them more clearly as what they are | 15:05 |
| @clarkb:matrix.org | We didn't figure out why public_v6 is empty though | 15:05 |
| @mordred:waterwanders.com | gremlins | 15:05 |
| @mordred:waterwanders.com | they love stealing v6 addresses. just don't feed them after midnight | 15:05 |
| @fungicide:matrix.org | right, that one's a head-scratcher, though the theory about async assignment in the cloud racing our api response makes sense | 15:05 |
| @clarkb:matrix.org | Right when we fetch the data against instances that have been running the data comes back as expected. But zuul is fetching it immediately after creation so theory is some race with the data being available from the cloud | 15:07 |
| @mordred:waterwanders.com | it's a completely reasonable theory. this is rackspace right? like classis? So no floating? | 15:13 |
| @clarkb:matrix.org | correct. Also ovh which is also no floating ip. osuosl doesn't have this problem (not sure if that is floating or not off the top of my head) | 15:13 |
| @fungicide:matrix.org | osuosl is not floating ip | 15:16 |
| @fungicide:matrix.org | the only provider currently doing floating ip for our nodes is rackspace flex | 15:17 |
| @clarkb:matrix.org | wow ansible-lint doesn't list all failure upfront. I "fixed" the first error and now have 134 errors. Time to drop the linting job instead | 15:17 |
| @mordred:waterwanders.com | yeah - that should just be coming directly from the server dict, but a race on the backend would not be surprising at all. sdk doesn't make any _additional_ calls on the backend for that case (which otherwise would, of course, add latency, and increase chance that nova would catch up) so I'd put my vote in with "zuul is fetching quickly enough to catch the race" and "in the ansible case with sdk there's enough lag time it's not quick enough to catch the race" | 15:17 |
| -@gerrit:opendev.org- Clark Boylan proposed on behalf of Jeremy Stanley https://matrix.to/#/@fungicide:matrix.org: [opendev/ansible-role-puppet] 989028: Use include_tasks instead of include https://review.opendev.org/c/opendev/ansible-role-puppet/+/989028 | 15:20 | |
| @clarkb:matrix.org | fungi: ^ now with the linter job disabled | 15:21 |
| -@gerrit:opendev.org- Monty Taylor https://matrix.to/#/@mordred:inaugust.com proposed: [opendev/bindep] 989416: Rewrite bindep in rust https://review.opendev.org/c/opendev/bindep/+/989416 | 15:28 | |
| @mordred:waterwanders.com | So ... I did a thing ^^ that I absolutely do not expect to get merged or really even reviewed, but I wanted to push it up anyway | 15:28 |
| @mordred:waterwanders.com | I think the better approach is to just make a new personal repo called bindep-rs to hold that, but I also think "ask before fork" is always the correct first step. so anyway - there's that | 15:30 |
| @clarkb:matrix.org | fungi: ok I think https://review.opendev.org/c/opendev/ansible-role-puppet/+/989028 https://review.opendev.org/c/opendev/system-config/+/989022 and https://review.opendev.org/c/opendev/system-config/+/988698 are all now in a mergeable state which should hopefully close out the ansible 9 upgrade fallout | 15:31 |
| @fungicide:matrix.org | mordred: i don't object. i don't feel like i'd personally be well-equipped to maintain a rust version of bindep, but it has stabilized to the point where at least someone maintaining a separate one in parallel probably doesn't have to worry much about divergence due to bitrot | 15:32 |
| @mordred:waterwanders.com | ++ - there's actually a test suite for "install python bindep, run it and ensure compat" | 15:33 |
| @fungicide:matrix.org | we occasionally merge a change to add support for another distro, but aside from that the majority of recent changes are just to keep up with python packaging conventions | 15:34 |
| @mordred:waterwanders.com | (this came up because I went crazy and decided to revive drizzle - which needs to start out on 12.04 - which does not have python3 but where I can happily copy in a static binary) | 15:35 |
| @fungicide:matrix.org | oh cool! | 15:35 |
| @fungicide:matrix.org | sadly, i think spi let the domain registration go recently | 15:35 |
| @fungicide:matrix.org | whois is redacted and useless these days and i don't feel like figuring out how to log into spi's account at gandi but maybe we do still control it actually, there's just no site being served there | 15:38 |
| @fungicide:matrix.org | we did disassociate the project 7 years ago though: https://www.spi-inc.org/corporate/resolutions/2019/2019-08-11.tp.1/ | 15:39 |
| @fungicide:matrix.org | oh, i think it's the trademark we finally let lapse | 15:39 |
| @mordred:waterwanders.com | wow - 2019 was 7 years ago | 15:40 |
| @fungicide:matrix.org | because paying for trademark registration renewals didn't make sense | 15:40 |
| @mordred:waterwanders.com | well - I'm glad you think that's cool ... | 15:40 |
| -@gerrit:opendev.org- Monty Taylor https://matrix.to/#/@mordred:inaugust.com proposed: [openstack/project-config] 989417: Import drizzle revival and add a bindep-rs support project https://review.opendev.org/c/openstack/project-config/+/989417 | 15:40 | |
| @fungicide:matrix.org | heh | 15:40 |
| @mordred:waterwanders.com | if you launch a 12.04 contianer (or use the containerfile in the updated repo) - it TOTALLY builds and passes all its tests | 15:41 |
| @fungicide:matrix.org | and yeah, now that i think about it, i believe spi did continue to hold onto the drizzle.org domain because gandi doesn't charge us for domain registrations | 15:41 |
| @mordred:waterwanders.com | neat. so - if I get ambitious, I could _probably_ reach out to spi and ask for the domain to be pointed somewhere then | 15:42 |
| @mordred:waterwanders.com | should probably see if I can get it upcycled to run on resolute before bothering | 15:43 |
| @fungicide:matrix.org | mordred: since it's considered an asset of value held by an irs 501(c)(3) charity, i think your options are to either pursue reassociation with spi or with another c3 charity. transferring assets from a charity to any non-charity org is legally hard/risky | 15:46 |
| @mordred:waterwanders.com | oh - yeah - I think reassociation would be the most sensible thing, assuming it gets that far. I definitely wouldn't think it's ok to transfer control just to a me | 15:48 |
| @fungicide:matrix.org | (this has come up before for spi, a notable example was when we gave up opensource.org to osi) | 15:48 |
| @fungicide:matrix.org | but if you did end up reassociating with spi, for example, you could still host the dns and web presence for drizzle.org here in opendev if you wanted | 15:49 |
| @mordred:waterwanders.com | ++ cool | 15:50 |
| @fungicide:matrix.org | would just need spi's admins to point gandi at the opendev nameservers | 15:50 |
| @mordred:waterwanders.com | speaking of - my opendev usage has ramped up quite a bit since leaving the warm embrace of pappa larry. might not be a bad idea to start helping out again. my time comes in unpredictable bursts, so I don't think I could be considered "reliable" (but then again, was I ever?) | 15:55 |
| -@gerrit:opendev.org- Monty Taylor https://matrix.to/#/@mordred:inaugust.com proposed: [openstack/project-config] 989417: Import drizzle revival and add a bindep-rs support project https://review.opendev.org/c/openstack/project-config/+/989417 | 15:57 | |
| @clarkb:matrix.org | we appreciate any help we can get | 16:00 |
| @clarkb:matrix.org | unrelated I accidentally ran ed just a few minutes ago. Took me a minute to figure out what was happening | 16:01 |
| @clarkb:matrix.org | fungi: I'm going to finish up some breakfast stuff, but then I'm going to look into redoing the gerrit 3.13 upgrade testing with the new container images. The set of ansible 9 followup changes 989028, 989022 and 988698 should all be good to go whenever you're ready I think | 16:02 |
| @fungicide:matrix.org | yeah, i just finished lunch and planning to catch up on those next | 16:17 |
| @tkajinam:matrix.org | I'm wondering what's the future of storyboard. It hasn't been updated for a few years and its CI has been badly broken (see https://review.opendev.org/c/opendev/storyboard/+/922699 which is still incomplete ...) | 16:23 |
| @tkajinam:matrix.org | I have a growing concern about it, seeing recent mass vulnability detection using AI tools. | 16:24 |
| @clarkb:matrix.org | it hasn't been maintained for many years. I don't think that is any different today | 16:24 |
| @tkajinam:matrix.org | yes. I agree | 16:24 |
| @clarkb:matrix.org | Unfortunately, its relatively low on the priority list for everyone and probably even lower with the current state of the world | 16:25 |
| @clarkb:matrix.org | if there is interest in resurrecting it I think we could support that. But right now I am not personally able to dedicate much if any time to it | 16:26 |
| @mordred:waterwanders.com | funny story ... | 16:26 |
| @mordred:waterwanders.com | I was literally JUST about to make a suggestion/propose a question about that | 16:26 |
| @tkajinam:matrix.org | considering migration out from storyboard (as was done by a few projects already) might be a better approach | 16:28 |
| @clarkb:matrix.org | and I guess to be blunt I think OpenDev could use more investment generally. I really appreciate mnasiadka volunteering recently to help. Recently with some outages in opendev and ubuntu there have been a ton of complaints. But reality is there is a small group of us working hard to keep the lights on and its only getting harder to do so | 16:28 |
| @clarkb:matrix.org | without investment and additional overhead (largelydriven by LLMs) more outages and problems can probably be expected | 16:29 |
| @fungicide:matrix.org | part of the challenge there is that it's still being deployed with puppet and running under python 2.7, but because storyboard dropped python 2 support at one point we're not actually deploying the latest version which has fixes implemented for a variety of bugs and performance issues we're not taking advantage of | 16:29 |
| @mordred:waterwanders.com | there's another pattern emerging in the agentic dev space which is "point agents at the api of an appropriate task tracker and use that for coordination, so that as it's working on one thing and notices a future todo, it can just plop it down and another agent can pick it up without it getting lost" Temporal is what people seem to like - but I don't really like closed source things, so I was thinking "what if I updated storyboard - it's api-first based anyway - and added it to the gerrit/zuul set I'm already using" | 16:30 |
| @fungicide:matrix.org | mordred even got it set up to build container images at one point, but nobody's had time to convert our ancient deployment to ansible orchestrating those containers | 16:30 |
| @clarkb:matrix.org | I think we stopped building those containers? It may require resurrecting that effort too | 16:30 |
| @clarkb:matrix.org | in any case I brought this up at the PTG. It feels like were on an LLM runners high but instead of realize that this requires new investment in infrastructure to support we're just letting everyone NIH new features like crazy and its more of a survival of the fittest situation | 16:31 |
| @clarkb:matrix.org | essentially from my point of view LLMs should create a reevaluation of where the human priorities are in the software development process. This isn't happening. Instead we're just doubling down on existing processes and hoping that if we ramp them up to 11 therest of the system won't break | 16:32 |
| @fungicide:matrix.org | and then there's an architectural issue with storyboard, which is that it fell into the same trap a lot of projects do chasing the javascript framework du jour (angular in this case) which was inevitable abandoned with no clear migration path to any successor | 16:32 |
| @clarkb:matrix.org | fungi: I think if you write any js you just have to accept rewriting everything at least once every 2 years :) | 16:33 |
| @fungicide:matrix.org | yes, that's the impression i get from the handful of javascript-based applications i'm brave enough to touch with a 10 meter pole | 16:33 |
| @mordred:waterwanders.com | yeah- so - migrating from old angular to using the same js stuff we use for zuul wouldn't be hard with agentic help. I don't know that we have a team of humans who would feel super comfortable reviewing that, but with zuul preview pointing at prod perhaps that's review enough? | 16:34 |
| @mordred:waterwanders.com | but - honestly - I'm not sure we have reviewers for storyboard even with hand-written patches - so the question would be, sort of along clark's line earlier- what's the right application of limited human resources for this? | 16:35 |
| @fungicide:matrix.org | tkajinam: to be clear, there has been (to my knowledge) no "migration out of storyboard" by any project, they just end up abandoning what they have on sb and starting from scratch somewhere like lp, i've never seen anyone move any history from sb to lp | 16:36 |
| @tkajinam:matrix.org | * yes I'm aware of that. | 16:36 |
| @fungicide:matrix.org | there *is* a rest api in sb that you could use to export all public history for a project, but i don't know of tooling anyone's written to import that data somewhere else | 16:37 |
| @tkajinam:matrix.org | fungi: I agree. The major projects I'm aware of are ironic and telemetry but both of these rely on existing ones left (and visible) in storyboard | 16:39 |
| @fungicide:matrix.org | in contrast, we did develop tooling to extract bug history from launchpad and import it into storyboard | 16:40 |
| @fungicide:matrix.org | which was what we used originally for projects migrating from lp to sb | 16:41 |
| -@gerrit:opendev.org- Takashi Kajinami proposed: [opendev/storyboard] 922699: Fix test executions https://review.opendev.org/c/opendev/storyboard/+/922699 | 16:42 | |
| @tkajinam:matrix.org | we could technically to the reverse order but that would be tricky. wondering if there is any short-cut solution to provide only view of existing items in sb... | 16:43 |
| @fungicide:matrix.org | you could do it by checking for equivalent bug numbers in lp, up to the point where we started to collide | 16:44 |
| @fungicide:matrix.org | we originally reserved all story numbers below 2000000 for imports from lp, so other than already open stories that then got updated, the only stories which weren't lp imports and were created originally on sb are numbered 2000000 and above | 16:46 |
| @fungicide:matrix.org | we should have done numbers below 10m or something instead, because lp use accelerated and it hit 2m a few years after we began using sb, but by then there weren't really any projects that migrated in anyway | 16:47 |
| @tkajinam:matrix.org | the imported bugs, which were closed in sb after migration, might be tricky though I expect (I hope) these are not many. | 16:51 |
| @tkajinam:matrix.org | sorry I have to leave soon. Notification from gerrit reminds me of that long-remaining fix-ci patch and I just wanted to raise the situation somewhere so that I don't forget and leave it for another year... | 16:54 |
| @tkajinam:matrix.org | based on the current tricky situation, we may not get clear path soon but for the time being I'll have a chat with some heat folks regarding migration back (because heat is still staying in storyboard). | 16:55 |
| @clarkb:matrix.org | ok I have tested the Gerrit upgrade from 3.12.7 -> 3.13.6 and then downgraded back to 3.12.7 again. My notes in https://etherpad.opendev.org/p/gerrit-upgrade-3.13 have been updated with new infos. One thing I noticed this time (and actually last time but didn't understand it super well) is that after the 3.13 upgrade there are web client errors that pop up about fetching robot comments. The reason appears to be that the 3.12 client in the browser is still trying to fetch robot comments which are gone in 3.13. A hard refresh fixes this. Similarly when you downgrade the 3.13 client tries to fetch is-flows-enabled from gerrit and 3.12 knows nothing about it so you get a similar error. A hard refresh also corrects that | 17:02 |
| @clarkb:matrix.org | I think in the email comms for the upgrade I'll note the robot comments errors and suggest a browser refresh if anyone runs into that | 17:03 |
| -@gerrit:opendev.org- Clark Boylan proposed: | 17:07 | |
| - [opendev/system-config] 989428: Fix system-config-run-review-3.13 images requires https://review.opendev.org/c/opendev/system-config/+/989428 | ||
| - [opendev/system-config] 989429: Upgrade Gerrit to 3.13 https://review.opendev.org/c/opendev/system-config/+/989429 | ||
| @clarkb:matrix.org | infra-root 989428 is a bugfix for the current system-config jobs and can be landed now. 989429 is prep work for the upgrade process and I'll WIP it as a result | 17:08 |
| -@gerrit:opendev.org- Zuul merged on behalf of Jeremy Stanley https://matrix.to/#/@fungicide:matrix.org: [opendev/ansible-role-puppet] 989028: Use include_tasks instead of include https://review.opendev.org/c/opendev/ansible-role-puppet/+/989028 | 17:18 | |
| @clarkb:matrix.org | I've started a draft gerrit upgrade announcement here: https://etherpad.opendev.org/p/SFnVUJwITKx-8Ed3k7Ev | 17:22 |
| @fungicide:matrix.org | announcement lgtm, made some small edits | 17:44 |
| @fungicide:matrix.org | do you happen to know how the new auth tokens would be implemented by rest api clients? e.g. will zuul, gertty and some of our custom scripts that currently use passwords need to change how they authenticate? | 17:45 |
| @fungicide:matrix.org | or can the tokens just be passed as if they're passwords, like how pypi's upload tokens work? | 17:46 |
| @fungicide:matrix.org | i guess it's not urgent to figure out, and will be easier to experiment with once we have 3.13 in production | 17:47 |
| @mordred:waterwanders.com | I think we get to use both passwords and tokens in parallel for this release, right? | 17:50 |
| @fungicide:matrix.org | correct, which is why i say it's not urgent to figure out | 17:51 |
| @fungicide:matrix.org | possibly for the next many releases, they don't say when they're removing legacy http password support | 17:51 |
| @clarkb:matrix.org | fungi: you use them as basic auth password like you currently do. Or you can use them as jwt like auth tokens in headers | 17:51 |
| @fungicide:matrix.org | okay perfect | 17:51 |
| @mordred:waterwanders.com | cool | 17:52 |
| @clarkb:matrix.org | But my understanding is basic auth with the value will continue to work for compatibility | 17:52 |
| @fungicide:matrix.org | so in theory no changes are needed to clients | 17:52 |
| @clarkb:matrix.org | Ya that is my read of it | 17:52 |
| @fungicide:matrix.org | even once legacy passwords are gone entirely | 17:52 |
| @fungicide:matrix.org | that's how pypi's upload tokens work too, i just wanted to be sure it wasn't locking us into a multi-step auth handshake instead | 17:53 |
| @jim:acmegating.com | i believe it is difficult/impossible to generate new legacy auth | 17:53 |
| @jim:acmegating.com | so that's their upgrade-forcing path | 17:54 |
| @fungicide:matrix.org | makes sense | 17:54 |
| @fungicide:matrix.org | so it's ultimately just a change in the schema for what's encoded in the string that's passed as a basic auth password, in that case | 17:55 |
| @fungicide:matrix.org | rather than being a "simple" kdf over some random data, it's got signed metadata for things like expiration | 17:56 |
| @fungicide:matrix.org | a certificate, essentially | 17:56 |
| @clarkb:matrix.org | maybe? I think the main change is how they are storing the data on the backend | 17:57 |
| @clarkb:matrix.org | because now you can have multiple tokens instead of just one aiui | 17:58 |
| @clarkb:matrix.org | I don't know if they rely on signatures to expire them or just a db record | 17:58 |
| @clarkb:matrix.org | since they are the issuer and validator they don't need to rely on signatures to validate that stuff | 17:58 |
| @clarkb:matrix.org | I was asked in the upstream discord why we don't just convert as part of the upgrade and the reason in my head is that may make downgrading more painful | 17:59 |
| @clarkb:matrix.org | I think its good to upgrade and be happy with 3.13 generally then commit to the new 3.13 passwords | 17:59 |
| @fungicide:matrix.org | at least with pypi's tokens you can redelegate by creating your own rescoped versions of them, e.g. using https://pypi.org/project/pypitoken-cli/ | 17:59 |
| -@gerrit:opendev.org- Monty Taylor https://matrix.to/#/@mordred:inaugust.com proposed: [openstack/project-config] 989417: Import drizzle revival and add a bindep-rs support project https://review.opendev.org/c/openstack/project-config/+/989417 | 18:01 | |
| @mordred:waterwanders.com | you'd think I could get a projects.yaml right in less than three revisions, but you'd be wrong | 18:02 |
| @fungicide:matrix.org | this is why we have fast linters configured to run in check | 18:04 |
| @mordred:waterwanders.com | ++ | 18:13 |
| @mordred:waterwanders.com | yay! I got it right after 3 tries. I'm so proud of myself. maybe I'll get to switch to bigboy pants one day | 18:18 |
| @clarkb:matrix.org | etherpad is hitting `FATAL ERROR: Reached heap limit Allocation failed - JavaScript heap out of memory` and restarting occasionally. Likely fallout from the recent upgrade | 18:23 |
| @clarkb:matrix.org | I will look at that after this meeting and lunch. I believe nodejs has a very conservative heap size and can be enlarged | 18:24 |
| @fungicide:matrix.org | ooh, that would explain why i saw my client flashing up very brief errors about reconnecting earlier | 18:28 |
| -@gerrit:opendev.org- Zuul merged on behalf of Clark Boylan: [opendev/system-config] 989428: Fix system-config-run-review-3.13 images requires https://review.opendev.org/c/opendev/system-config/+/989428 | 18:28 | |
| @clarkb:matrix.org | looking at top I think there is a memory leak actually | 18:29 |
| @clarkb:matrix.org | so this may not be an easy fix | 18:30 |
| @fungicide:matrix.org | but also the server is only a 4gb flavor | 18:30 |
| @clarkb:matrix.org | ya but etherpad has 19GB virtual and only 840MB resident but is steadily climbing | 18:30 |
| @clarkb:matrix.org | we just claered 1gb resident | 18:31 |
| @fungicide:matrix.org | right now the server has a dedicated 4gb swap partition on the ephemeral disk, shared with /opt which is using the remaining blocks but does have some content | 18:31 |
| @clarkb:matrix.org | virtual memory is not climbing as quickly | 18:32 |
| @fungicide:matrix.org | i think if we put the server in the disable list for a bit to avoid battling ansible, we can umount and shrink /opt to make more room for swap | 18:32 |
| @fungicide:matrix.org | or just add a swapfile on /opt and swapon it | 18:32 |
| @fungicide:matrix.org | maybe that's the faster band-aid | 18:32 |
| @clarkb:matrix.org | its restarting every ~15 minutes | 18:33 |
| @fungicide:matrix.org | i wonder if this is fixed in 3.1 | 18:33 |
| @fungicide:matrix.org | i don't see it called out in the release notes if so | 18:35 |
| @clarkb:matrix.org | ya not finding any issues for a memory leaks recently either open or closed | 18:35 |
| @fungicide:matrix.org | could it be crawler-induced? | 18:36 |
| @fungicide:matrix.org | can't use etherpad without js so we could frontend with anubis i suppose | 18:36 |
| @fungicide:matrix.org | if we decide that's the source of the memory consumption | 18:36 |
| @clarkb:matrix.org | that is possible I guess | 18:36 |
| @clarkb:matrix.org | but anything reasonably causing the memory to increase is probably running js? | 18:37 |
| -@gerrit:opendev.org- Sabbir Ahmed proposed: [openstack/project-config] 989446: Add starlingx/app-machine-operator project https://review.opendev.org/c/openstack/project-config/+/989446 | 18:40 | |
| @clarkb:matrix.org | we just hit 2gb resident | 18:40 |
| @clarkb:matrix.org | looks like nodejs also changed its gc implementation like python 3.14 | 18:41 |
| @clarkb:matrix.org | so maybe if we downgrade to nodejs22 it would be happier or maybe we can configure it more? | 18:42 |
| @clarkb:matrix.org | ok it just restarted around 2gb resident | 18:42 |
| @fungicide:matrix.org | is nodejs set to avoid consuming more memory than that? | 18:49 |
| @fungicide:matrix.org | because yeah, i haven't seen it really dip into swap | 18:49 |
| @clarkb:matrix.org | yes I think the default is 4gb but we only have 4gb present so it must tune that down to 2gb | 18:50 |
| @clarkb:matrix.org | I haven't confirmed this though. but based on the evidence of having OOM errors at around 2gb it appears likely | 18:50 |
| @clarkb:matrix.org | I think it may only be garbage collecting every 3 seconds? | 18:51 |
| @jim:acmegating.com | that's still < 15m so that shouldn't be a prob? | 19:00 |
| @clarkb:matrix.org | ya I just expected it to be much more often (like every few hundred ms) | 19:03 |
| @clarkb:matrix.org | I suspect the issue is a memory leak and not nodejs gc behaviors | 19:03 |
| @clarkb:matrix.org | I think what I would like to do is hold a node and see if we can profile this outside of production | 19:04 |
| @clarkb:matrix.org | (that is unlikely to be fast, but seems necessary to get to the bottom of the issue) | 19:04 |
| @clarkb:matrix.org | then maybe we can file a bug or even fix it etc | 19:04 |
| @clarkb:matrix.org | I have rechecked 840972 and requested an autohold on its etherpad job to do this | 19:08 |
| @clarkb:matrix.org | https://github.com/ether/etherpad-memory-leak-test exists too and I can point that at the held node if I can figure out running it | 19:11 |
| -@gerrit:opendev.org- Sabbir Ahmed proposed: [openstack/project-config] 989446: Add starlingx/app-machine-operator project https://review.opendev.org/c/openstack/project-config/+/989446 | 19:17 | |
| @clarkb:matrix.org | looks like we need the ep_stats plugin to get the /stats endpoint? | 19:18 |
| @clarkb:matrix.org | I tried fetchign stats on the prod server but it doesn't work. Yet another reason that a held node may be helpful I guess | 19:18 |
| @fungicide:matrix.org | possible we block access to it in the vhost config? | 19:19 |
| @clarkb:matrix.org | anyway lunch now. I'll try to look closer in a bit. There is also a new gitea 1.26.2 release as of a few minutes ago if anyone wants to push a change for that | 19:19 |
| @clarkb:matrix.org | fungi: no this was fetching against http://localhost:9001 to bypass apache | 19:19 |
| @fungicide:matrix.org | ah | 19:22 |
| -@gerrit:opendev.org- Clark Boylan proposed: [opendev/system-config] 989448: Upgrade Gitea to 1.26.2 https://review.opendev.org/c/opendev/system-config/+/989448 | 19:38 | |
| @clarkb:matrix.org | there is a quick change to get gitea 1.26.2 rolling. We'll see how testing goes | 19:38 |
| @clarkb:matrix.org | aha /stats is controlled by the enableMetrics config setting which we set to false. I can enable that on the held node and then test some things | 19:40 |
| @fungicide:matrix.org | whoa, in gerrit 3.13 we'll finally be able to delete unused groups! | 19:45 |
| -@gerrit:opendev.org- Clark Boylan proposed: [opendev/system-config] 989453: Enable etherpad's stats endpoint for local requests https://review.opendev.org/c/opendev/system-config/+/989453 | 20:19 | |
| @clarkb:matrix.org | testing on the held node shows that the /stats endpoint is likely helpful here so I'm proposing we enable it in a limited capacity to aid further heap issues in production | 20:20 |
| @fungicide:matrix.org | the testinfra tests are a good addition to make sure we don't accidentally expose it, thanks! | 20:22 |
| -@gerrit:opendev.org- Zuul merged on behalf of Clark Boylan: [opendev/system-config] 989453: Enable etherpad's stats endpoint for local requests https://review.opendev.org/c/opendev/system-config/+/989453 | 21:09 | |
| @mordred:waterwanders.com | MADNESS! | 21:20 |
| -@gerrit:opendev.org- Jeremy Stanley https://matrix.to/#/@fungicide:matrix.org proposed wip: [opendev/system-config] 989464: Add ai-policy-wg@lists.openinfra.org mailing list https://review.opendev.org/c/opendev/system-config/+/989464 | 22:25 | |
| -@gerrit:opendev.org- Zuul merged on behalf of Jeremy Stanley https://matrix.to/#/@fungicide:matrix.org: [zuul/zuul-jobs] 980499: Add and test an ensure-validate-pyproject role https://review.opendev.org/c/zuul/zuul-jobs/+/980499 | 22:29 | |
| -@gerrit:opendev.org- Jeremy Stanley https://matrix.to/#/@fungicide:matrix.org marked as active: [opendev/system-config] 989464: Add ai-policy-wg@lists.openinfra.org mailing list https://review.opendev.org/c/opendev/system-config/+/989464 | 22:38 | |
| @fungicide:matrix.org | Clark: you were at that ^ meeting today, if you have time to review/approve the new ml for the wg | 22:39 |
| @clarkb:matrix.org | fungi: is it ok if I +2 it and let you approve? I'm deep into nodejs debugging weeds right now | 22:42 |
| @clarkb:matrix.org | I should have a change up for ^ soon too. But I'm ttesting it end to end on the held node first (and its just more debugging tooling not any fixing) | 22:42 |
| @fungicide:matrix.org | sure, no problem | 22:44 |
| @fungicide:matrix.org | unrelated, the persistent system-config-run-letsencrypt failures on https://review.opendev.org/c/opendev/system-config/+/989022 are baffling, something seems to be breaking dns resolution for the bridge99 test node | 22:45 |
| @fungicide:matrix.org | this must be new, i can't help but wonder if it's related to the recent unbound or bind package updates from earlier today | 22:47 |
| @fungicide:matrix.org | er, i guess it's actually the adns99 node that can't resolve | 22:48 |
| @fungicide:matrix.org | so maybe some sort of conflict between bind and unbound, since they both run on that node | 22:49 |
| -@gerrit:opendev.org- Clark Boylan proposed: [opendev/system-config] 989465: Enable nodejs heap snapshots with SIGUSR2 against etherpad https://review.opendev.org/c/opendev/system-config/+/989465 | 22:50 | |
| @clarkb:matrix.org | fungi: its also not happening consistently since it passes in check but not the gate | 22:51 |
| @fungicide:matrix.org | oh, right | 22:51 |
| @clarkb:matrix.org | fungi: so maybe comparing where those things ran to see if there are obvious network differences that would explain it might help. Its possible that the chagne to modify the inventory is creating problems somehow. Like maybe the value in /etc/hosts which is the private address needs to match the address in inventory? | 22:51 |
| @clarkb:matrix.org | and for some clouds private == public and it is fine but others it isn't? | 22:52 |
| @fungicide:matrix.org | oh, yeah that could explain it | 22:52 |
| @clarkb:matrix.org | re 989465 I have successfully taken two heap snapshots on the held etherpad node using that and loaded them into chrome. I think this is a viable way to inspect memory growth | 22:52 |
| @fungicide:matrix.org | so for 989465 we don't need any additional modules to support the heapsnapshot feature? | 22:52 |
| @clarkb:matrix.org | I don't think I'm in a rush to land that though as I'm running out of steam to keep debugging this late into the day | 22:52 |
| @clarkb:matrix.org | fungi: nope, nodejs supports it you just have to eneable it which that NODE_OPTIONS entry does | 22:53 |
| @fungicide:matrix.org | very neat | 22:53 |
| @fungicide:matrix.org | i'm +2 on it for whenever you're ready to resume investigation | 22:53 |
| @fungicide:matrix.org | though about to step away for the evening myself at this point | 22:53 |
| @clarkb:matrix.org | thanks. I think first thing tomorrow should be fine | 22:54 |
| @clarkb:matrix.org | the held node isn't doing enough interesting things to see big deltas. But just toggling between the two snapshots in the viewer I can see some things do change. So I'm hopeful that with the big chagnes we see in prod it will be apparent where the memory is going | 22:55 |
| @fungicide:matrix.org | zuul's remaining time estimate for 989464 in the gate is strange. it seems to imply that it expects opendev-buildset-registry to run for an additional 30 minutes once it unpauses | 22:59 |
| @clarkb:matrix.org | Yes because it estimates based on job runtime and doesn't look at other jobs | 23:02 |
| @clarkb:matrix.org | So jobs that pause with variable length sets of children have broken eatimates | 23:02 |
| @fungicide:matrix.org | ah, i guess it takes average total runtime into account, including any paused duration | 23:02 |
| @clarkb:matrix.org | Basically we're poisoning the data swt | 23:02 |
| @fungicide:matrix.org | yep, makes sense | 23:03 |
| @fungicide:matrix.org | i'm used to looking at system-config changes with longer-than-average running jobs, so the opendev-buildset-registry runtime comes in shorter than the buildset estimate | 23:04 |
| @fungicide:matrix.org | essentially underestimating that job on those changes, but overestimating it on this one | 23:05 |
| @fungicide:matrix.org | i suppose it could have a multi-segment estimate where it knows the average runtime before pause and after pause and can add the latter to the longest depending job that needs to complete before unpausing, but that would get obnoxiously complicated | 23:06 |
| @clarkb:matrix.org | the most accurate would be to lookup the runtimes for the jobs the pause is waiting on and use that number | 23:07 |
| @clarkb:matrix.org | it has the data it just doesn't do the transitive lookup. I'm not sure how difficult it would be to find that data from the db given teh current context though | 23:07 |
| @fungicide:matrix.org | right, basically what i was getting at, it would also need to know how much longer to expect that job to take after unpausing too | 23:08 |
| @clarkb:matrix.org | that info may not be available | 23:09 |
| @clarkb:matrix.org | I think it only has complete runtime for each buidl | 23:09 |
| @fungicide:matrix.org | which i suppose it could calculate by subtracting the paused duration from the runtime when updating the average, then subtract the time spent before pause on the current run from the average to get the remaining time that it estimates will be needed after unpause | 23:10 |
| @clarkb:matrix.org | multi level estimating | 23:11 |
| @fungicide:matrix.org | but yeah, not worth the complexity for something that's ultimately cosmetic | 23:13 |
| -@gerrit:opendev.org- Zuul merged on behalf of Jeremy Stanley https://matrix.to/#/@fungicide:matrix.org: [opendev/system-config] 989464: Add ai-policy-wg@lists.openinfra.org mailing list https://review.opendev.org/c/opendev/system-config/+/989464 | 23:21 | |
| @jim:acmegating.com | we'll just make it say 99% complete like a windows copy dialog | 23:23 |
| @fungicide:matrix.org | or make it sometimes go backwards | 23:24 |
| @fungicide:matrix.org | infra-prod-service-zuul in hourly is taking qute a while. looking at service-zuul.yaml.log on bridge it's been idle for 20 minutes after pulling container images | 23:28 |
| @fungicide:matrix.org | oh, maybe it's waiting on one of the servers to finish pulling and it hasn't reported back yet | 23:29 |
| @fungicide:matrix.org | yeah, ze07 has been running `docker-compose pull` since 23:07 | 23:32 |
| @clarkb:matrix.org | those pulls are from quay not docker hub (just for information) | 23:33 |
| @fungicide:matrix.org | all other recent runs for that job have been between 4 and 8 minutes, so this is well into anomalous territory | 23:33 |
| @clarkb:matrix.org | the executor image is largeish but it should be pullable within a few minutes | 23:34 |
| @clarkb:matrix.org | maybe some sort of slowness between that particular node and quay due to entwork issues? | 23:34 |
| @fungicide:matrix.org | yeah, seems likely something is stuck for ze07, the other servers returned quickly | 23:34 |
| @fungicide:matrix.org | yeah, the job eventually timed out, i wonder what that's going to mean for subsequent runs if there's a stuck pull on ze07 (the process is still there on the server) | 23:38 |
| @fungicide:matrix.org | maybe it'll finally give up and terminate on its own | 23:38 |
| @fungicide:matrix.org | okay, i'm really checking out for the evening now, but will try to remember to check on ze07 again in the morning and see if it's fixed itself or needs some manual intervention | 23:49 |
Generated by irclog2html.py 4.1.0 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!