Wednesday, 2026-05-20

@jim:acmegating.comno worries, thanks for the fix!00:01
-@gerrit:opendev.org- Zuul merged on behalf of Jeremy Stanley https://matrix.to/#/@fungicide:matrix.org: [opendev/zone-gating.dev] 989307: Update address records for domain root https://review.opendev.org/c/opendev/zone-gating.dev/+/98930700:01
@fungicide:matrix.organyway, now that's done and dusted, i'm really going to offline before i break something else00:04
-@gerrit:opendev.org- OpenStack Proposal Bot proposed: [openstack/project-config] 989320: Normalize projects.yaml https://review.opendev.org/c/openstack/project-config/+/98932002:58
-@gerrit:opendev.org- Zuul merged: [openstack/project-config] 989320: Normalize projects.yaml https://review.opendev.org/c/openstack/project-config/+/98932005:54
@harbott.osism.tech:regio.chatI haven't seen any zuul issue mentioned yesterday, I'm wondering why https://review.opendev.org/c/openstack/requirements/+/988934 didn't transition into gate after the recheck07:46
-@gerrit:opendev.org- Jeremy Stanley https://matrix.to/#/@fungicide:matrix.org proposed: [opendev/system-config] 989377: Shadow mirror.logs RO replica with RW original https://review.opendev.org/c/opendev/system-config/+/98937713:05
@fungicide:matrix.orgJens Harbott: my only guess is that there's an unreported merge conflich because the parent commit is about 70 behind the current state of the master branch13:11
-@gerrit:opendev.org- Mauricio Harley proposed: [openstack/project-config] 988770: Add #openstack-pqc IRC channel https://review.opendev.org/c/openstack/project-config/+/98877013:51
-@gerrit:opendev.org- Zuul merged on behalf of Mauricio Harley: [openstack/project-config] 988770: Add #openstack-pqc IRC channel https://review.opendev.org/c/openstack/project-config/+/98877014:09
@fungicide:matrix.orghttps://review.opendev.org/c/opendev/zuul-providers/+/989136 hit a mix of http/499 client disconnects from rackspace flex swift and enospc during image conversion, how did we get around those previously?14:39
@clarkb:matrix.orgfungi: the 499 is new. I don't think we've seen that before. The enospace problems were somewhat mitigated by the changes corvus wrote to change the order of image conversions and otherwise be slightly more efficient during some steps. But they didn't fully go away14:45
@clarkb:matrix.orgI would look at the 499s more closely to start since I think those are unexpected14:46
@fungicide:matrix.orgwhat's odd is that 9 uploaded fine, 15 failed (though only some of those got as far as trying to upload)14:48
-@gerrit:opendev.org- Clark Boylan proposed on behalf of Jeremy Stanley https://matrix.to/#/@fungicide:matrix.org: [opendev/ansible-role-puppet] 989028: Use include_tasks instead of include https://review.opendev.org/c/opendev/ansible-role-puppet/+/98902814:56
@clarkb:matrix.orgfungi: ^ I think that may get the test job to pass14:56
@mordred:waterwanders.comClark: did you ever figure out the public_ip thing from yesterday? I scanned scrollback at dinner but wasn't quite sure. happy to help look if it's still a mystery14:58
@mordred:waterwanders.com(I'd buy the missing stuff being related to fetching fast given backend eventual consistency)15:00
@mordred:waterwanders.com`openstack.cloud.meta.add_server_interfaces` *definitley* does a *lot* of shennanigans to try to set those two fields properly so it's certainly not unreasonable for a bug there to leave you with the wrong value - depending on what the raw openstack json looks like15:02
@fungicide:matrix.orgmordred: yes, turns out it was an intentional choice 6 years ago, because we use that address to populate the /etc/hosts entry and to prefer backend/private network communication between nodes in providers that have them15:04
@mordred:waterwanders.comAH nod15:04
@mordred:waterwanders.comyeah. actually - I think I remember that discussion from 6 years ago :)15:04
@mordred:waterwanders.comfeature not bug!15:05
@fungicide:matrix.orgthe plan, aiui, is to put together a more deliberate solution that separates those concerns and labels them more clearly as what they are15:05
@clarkb:matrix.orgWe didn't figure out why public_v6 is empty though15:05
@mordred:waterwanders.comgremlins15:05
@mordred:waterwanders.comthey love stealing v6 addresses. just don't feed them after midnight15:05
@fungicide:matrix.orgright, that one's a head-scratcher, though the theory about async assignment in the cloud racing our api response makes sense15:05
@clarkb:matrix.orgRight when we fetch the data against instances that have been running the data comes back as expected. But zuul is fetching it immediately after creation so theory is some race with the data being available from the cloud15:07
@mordred:waterwanders.comit's a completely reasonable theory. this is rackspace right? like classis? So no floating?15:13
@clarkb:matrix.orgcorrect. Also ovh which is also no floating ip. osuosl doesn't have this problem (not sure if that is floating or not off the top of my head)15:13
@fungicide:matrix.orgosuosl is not floating ip15:16
@fungicide:matrix.orgthe only provider currently doing floating ip for our nodes is rackspace flex15:17
@clarkb:matrix.orgwow ansible-lint doesn't list all failure upfront. I "fixed" the first error and now have 134 errors. Time to drop the linting job instead15:17
@mordred:waterwanders.comyeah - that should just be coming directly from the server dict, but a race on the backend would not be surprising at all. sdk doesn't make any _additional_ calls on the backend for that case (which otherwise would, of course, add latency, and increase chance that nova would catch up) so I'd put my vote in with "zuul is fetching quickly enough to catch the race" and "in the ansible case with sdk there's enough lag time it's not quick enough to catch the race"15:17
-@gerrit:opendev.org- Clark Boylan proposed on behalf of Jeremy Stanley https://matrix.to/#/@fungicide:matrix.org: [opendev/ansible-role-puppet] 989028: Use include_tasks instead of include https://review.opendev.org/c/opendev/ansible-role-puppet/+/98902815:20
@clarkb:matrix.orgfungi: ^ now with the linter job disabled15:21
-@gerrit:opendev.org- Monty Taylor https://matrix.to/#/@mordred:inaugust.com proposed: [opendev/bindep] 989416: Rewrite bindep in rust https://review.opendev.org/c/opendev/bindep/+/98941615:28
@mordred:waterwanders.comSo ... I did a thing ^^ that I absolutely do not expect to get merged or really even reviewed, but I wanted to push it up anyway15:28
@mordred:waterwanders.comI think the better approach is to just make a new personal repo called bindep-rs to hold that, but I also think "ask before fork" is always the correct first step. so anyway - there's that15:30
@clarkb:matrix.orgfungi: ok I think https://review.opendev.org/c/opendev/ansible-role-puppet/+/989028 https://review.opendev.org/c/opendev/system-config/+/989022 and https://review.opendev.org/c/opendev/system-config/+/988698 are all now in a mergeable state which should hopefully close out the ansible 9 upgrade fallout15:31
@fungicide:matrix.orgmordred: i don't object. i don't feel like i'd personally be well-equipped to maintain a rust version of bindep, but it has stabilized to the point where at least someone maintaining a separate one in parallel probably doesn't have to worry much about divergence due to bitrot15:32
@mordred:waterwanders.com++ - there's actually a test suite for "install python bindep, run it and ensure compat"15:33
@fungicide:matrix.orgwe occasionally merge a change to add support for another distro, but aside from that the majority of recent changes are just to keep up with python packaging conventions15:34
@mordred:waterwanders.com(this came up because I went crazy and decided to revive drizzle - which needs to start out on 12.04 - which does not have python3 but where I can happily copy in a static binary)15:35
@fungicide:matrix.orgoh cool!15:35
@fungicide:matrix.orgsadly, i think spi let the domain registration go recently15:35
@fungicide:matrix.orgwhois is redacted and useless these days and i don't feel like figuring out how to log into spi's account at gandi but maybe we do still control it actually, there's just no site being served there15:38
@fungicide:matrix.orgwe did disassociate the project 7 years ago though: https://www.spi-inc.org/corporate/resolutions/2019/2019-08-11.tp.1/15:39
@fungicide:matrix.orgoh, i think it's the trademark we finally let lapse15:39
@mordred:waterwanders.comwow - 2019 was 7 years ago15:40
@fungicide:matrix.orgbecause paying for trademark registration renewals didn't make sense15:40
@mordred:waterwanders.comwell - I'm glad you think that's cool ...15:40
-@gerrit:opendev.org- Monty Taylor https://matrix.to/#/@mordred:inaugust.com proposed: [openstack/project-config] 989417: Import drizzle revival and add a bindep-rs support project https://review.opendev.org/c/openstack/project-config/+/98941715:40
@fungicide:matrix.orgheh15:40
@mordred:waterwanders.comif you launch a 12.04 contianer (or use the containerfile in the updated repo) - it TOTALLY builds and passes all its tests15:41
@fungicide:matrix.organd yeah, now that i think about it, i believe spi did continue to hold onto the drizzle.org domain because gandi doesn't charge us for domain registrations15:41
@mordred:waterwanders.comneat. so - if I get ambitious, I could _probably_ reach out to spi and ask for the domain to be pointed somewhere then15:42
@mordred:waterwanders.comshould probably see if I can get it upcycled to run on resolute before bothering15:43
@fungicide:matrix.orgmordred: since it's considered an asset of value held by an irs 501(c)(3) charity, i think your options are to either pursue reassociation with spi or with another c3 charity. transferring assets from a charity to any non-charity org is legally hard/risky15:46
@mordred:waterwanders.comoh - yeah - I think reassociation would be the most sensible thing, assuming it gets that far. I definitely wouldn't think it's ok to transfer control just to a me15:48
@fungicide:matrix.org(this has come up before for spi, a notable example was when we gave up opensource.org to osi)15:48
@fungicide:matrix.orgbut if you did end up reassociating with spi, for example, you could still host the dns and web presence for drizzle.org here in opendev if you wanted15:49
@mordred:waterwanders.com++ cool15:50
@fungicide:matrix.orgwould just need spi's admins to point gandi at the opendev nameservers15:50
@mordred:waterwanders.comspeaking of - my opendev usage has ramped up quite a bit since leaving the warm embrace of pappa larry. might not be a bad idea to start helping out again. my time comes in unpredictable bursts, so I don't think I could be considered "reliable" (but then again, was I ever?) 15:55
-@gerrit:opendev.org- Monty Taylor https://matrix.to/#/@mordred:inaugust.com proposed: [openstack/project-config] 989417: Import drizzle revival and add a bindep-rs support project https://review.opendev.org/c/openstack/project-config/+/98941715:57
@clarkb:matrix.orgwe appreciate any help we can get16:00
@clarkb:matrix.orgunrelated I accidentally ran ed just a few minutes ago. Took me a minute to figure out what was happening16:01
@clarkb:matrix.orgfungi: I'm going to finish up some breakfast stuff, but then I'm going to look into redoing the gerrit 3.13 upgrade testing with the new container images. The set of ansible 9 followup changes 989028, 989022 and 988698 should all be good to go whenever you're ready I think16:02
@fungicide:matrix.orgyeah, i just finished lunch and planning to catch up on those next16:17
@tkajinam:matrix.orgI'm wondering what's the future of storyboard. It hasn't been updated for a few years and its CI has been badly broken (see https://review.opendev.org/c/opendev/storyboard/+/922699 which is still incomplete ...)16:23
@tkajinam:matrix.orgI have a growing concern about it, seeing recent mass vulnability detection using AI tools.16:24
@clarkb:matrix.orgit hasn't been maintained for many years. I don't think that is any different today16:24
@tkajinam:matrix.orgyes. I agree16:24
@clarkb:matrix.orgUnfortunately, its relatively low on the priority list for everyone and probably even lower with the current state of the world16:25
@clarkb:matrix.orgif there is interest in resurrecting it I think we could support that. But right now I am not personally able to dedicate much if any time to it16:26
@mordred:waterwanders.comfunny story ...16:26
@mordred:waterwanders.comI was literally JUST about to make a suggestion/propose a question about that16:26
@tkajinam:matrix.orgconsidering migration out from storyboard (as was done by a few projects already) might be a better approach16:28
@clarkb:matrix.organd I guess to be blunt I think OpenDev could use more investment generally. I really appreciate mnasiadka volunteering recently to help. Recently with some outages in opendev and ubuntu there have been a ton of complaints. But reality is there is a small group of us working hard to keep the lights on and its only getting harder to do so16:28
@clarkb:matrix.orgwithout investment and additional overhead (largelydriven by LLMs) more outages and problems can probably be expected16:29
@fungicide:matrix.orgpart of the challenge there is that it's still being deployed with puppet and running under python 2.7, but because storyboard dropped python 2 support at one point we're not actually deploying the latest version which has fixes implemented for a variety of bugs and performance issues we're not taking advantage of16:29
@mordred:waterwanders.comthere's another pattern emerging in the agentic dev space which is "point agents at the api of an appropriate task tracker and use that for coordination, so that as it's working on one thing and notices a future todo, it can just plop it down and another agent can pick it up without it getting lost" Temporal is what people seem to like - but I don't really like closed source things, so I was thinking "what if I updated storyboard - it's api-first based anyway - and added it to the gerrit/zuul set I'm already using"16:30
@fungicide:matrix.orgmordred even got it set up to build container images at one point, but nobody's had time to convert our ancient deployment to ansible orchestrating those containers16:30
@clarkb:matrix.orgI think we stopped building those containers? It may require resurrecting that effort too16:30
@clarkb:matrix.orgin any case I brought this up at the PTG. It feels like were on an LLM runners high but instead of realize that this requires new investment in infrastructure to support we're just letting everyone NIH new features like crazy and its more of a survival of the fittest situation16:31
@clarkb:matrix.orgessentially from my point of view LLMs should create a reevaluation of where the human priorities are in the software development process. This isn't happening. Instead we're just doubling down on existing processes and hoping that if we ramp them up to 11 therest of the system won't break16:32
@fungicide:matrix.organd then there's an architectural issue with storyboard, which is that it fell into the same trap a lot of projects do chasing the javascript framework du jour (angular in this case) which was inevitable abandoned with no clear migration path to any successor16:32
@clarkb:matrix.orgfungi: I think if you write any js you just have to accept rewriting everything at least once every 2 years :)16:33
@fungicide:matrix.orgyes, that's the impression i get from the handful of javascript-based applications i'm brave enough to touch with a 10 meter pole16:33
@mordred:waterwanders.comyeah- so - migrating from old angular to using the same js stuff we use for zuul wouldn't be hard with agentic help. I don't know that we have a team of humans who would feel super comfortable reviewing that, but with zuul preview pointing at prod perhaps that's review enough?16:34
@mordred:waterwanders.combut - honestly - I'm not sure we have reviewers for storyboard even with hand-written patches - so the question would be, sort of along clark's line earlier- what's the right application of limited human resources for this?16:35
@fungicide:matrix.orgtkajinam: to be clear, there has been (to my knowledge) no "migration out of storyboard" by any project, they just end up abandoning what they have on sb and starting from scratch somewhere like lp, i've never seen anyone move any history from sb to lp16:36
@tkajinam:matrix.org* yes I'm aware of that.16:36
@fungicide:matrix.orgthere *is* a rest api in sb that you could use to export all public history for a project, but i don't know of tooling anyone's written to import that data somewhere else16:37
@tkajinam:matrix.orgfungi: I agree. The major projects I'm aware of are ironic and telemetry but both of these rely on existing ones left (and visible) in storyboard16:39
@fungicide:matrix.orgin contrast, we did develop tooling to extract bug history from launchpad and import it into storyboard16:40
@fungicide:matrix.orgwhich was what we used originally for projects migrating from lp to sb16:41
-@gerrit:opendev.org- Takashi Kajinami proposed: [opendev/storyboard] 922699: Fix test executions https://review.opendev.org/c/opendev/storyboard/+/92269916:42
@tkajinam:matrix.orgwe could technically to the reverse order but that would be tricky. wondering if there is any short-cut solution to provide only view of existing items in sb...16:43
@fungicide:matrix.orgyou could do it by checking for equivalent bug numbers in lp, up to the point where we started to collide16:44
@fungicide:matrix.orgwe originally reserved all story numbers below 2000000 for imports from lp, so other than already open stories that then got updated, the only stories which weren't lp imports and were created originally on sb are numbered 2000000 and above16:46
@fungicide:matrix.orgwe should have done numbers below 10m or something instead, because lp use accelerated and it hit 2m a few years after we began using sb, but by then there weren't really any projects that migrated in anyway16:47
@tkajinam:matrix.orgthe imported bugs, which were closed in sb after migration, might be tricky though I expect (I hope) these are not many.16:51
@tkajinam:matrix.orgsorry I have to leave soon. Notification from gerrit reminds me of that long-remaining fix-ci patch and I just wanted to raise the situation somewhere so that I don't forget and leave it for another year...16:54
@tkajinam:matrix.orgbased on the current tricky situation, we may not get clear path soon but for the time being I'll have a chat with some heat folks regarding migration back (because heat is still staying in storyboard).16:55
@clarkb:matrix.orgok I have tested the Gerrit upgrade from 3.12.7 -> 3.13.6 and then downgraded back to 3.12.7 again. My notes in https://etherpad.opendev.org/p/gerrit-upgrade-3.13 have been updated with new infos. One thing I noticed this time (and actually last time but didn't understand it super well) is that after the 3.13 upgrade there are web client errors that pop up about fetching robot comments. The reason appears to be that the 3.12 client in the browser is still trying to fetch robot comments which are gone in 3.13. A hard refresh fixes this. Similarly when you downgrade the 3.13 client tries to fetch is-flows-enabled from gerrit and 3.12 knows nothing about it so you get a similar error. A hard refresh also corrects that17:02
@clarkb:matrix.orgI think in the email comms for the upgrade I'll note the robot comments errors and suggest a browser refresh if anyone runs into that17:03
-@gerrit:opendev.org- Clark Boylan proposed:17:07
- [opendev/system-config] 989428: Fix system-config-run-review-3.13 images requires https://review.opendev.org/c/opendev/system-config/+/989428
- [opendev/system-config] 989429: Upgrade Gerrit to 3.13 https://review.opendev.org/c/opendev/system-config/+/989429
@clarkb:matrix.orginfra-root 989428 is a bugfix for the current system-config jobs and can be landed now. 989429 is prep work for the upgrade process and I'll WIP it as a result17:08
-@gerrit:opendev.org- Zuul merged on behalf of Jeremy Stanley https://matrix.to/#/@fungicide:matrix.org: [opendev/ansible-role-puppet] 989028: Use include_tasks instead of include https://review.opendev.org/c/opendev/ansible-role-puppet/+/98902817:18
@clarkb:matrix.orgI've started a draft gerrit upgrade announcement here: https://etherpad.opendev.org/p/SFnVUJwITKx-8Ed3k7Ev17:22
@fungicide:matrix.organnouncement lgtm, made some small edits17:44
@fungicide:matrix.orgdo you happen to know how the new auth tokens would be implemented by rest api clients? e.g. will zuul, gertty and some of our custom scripts that currently use passwords need to change how they authenticate?17:45
@fungicide:matrix.orgor can the tokens just be passed as if they're passwords, like how pypi's upload tokens work?17:46
@fungicide:matrix.orgi guess it's not urgent to figure out, and will be easier to experiment with once we have 3.13 in production17:47
@mordred:waterwanders.comI think we get to use both passwords and tokens in parallel for this release, right?17:50
@fungicide:matrix.orgcorrect, which is why i say it's not urgent to figure out17:51
@fungicide:matrix.orgpossibly for the next many releases, they don't say when they're removing legacy http password support17:51
@clarkb:matrix.orgfungi: you use them as basic auth password like you currently do. Or you can use them as jwt like auth tokens in headers17:51
@fungicide:matrix.orgokay perfect17:51
@mordred:waterwanders.comcool17:52
@clarkb:matrix.orgBut my understanding is basic auth with the value will continue to work for compatibility 17:52
@fungicide:matrix.orgso in theory no changes are needed to clients17:52
@clarkb:matrix.orgYa that is my read of it17:52
@fungicide:matrix.orgeven once legacy passwords are gone entirely17:52
@fungicide:matrix.orgthat's how pypi's upload tokens work too, i just wanted to be sure it wasn't locking us into a multi-step auth handshake instead17:53
@jim:acmegating.comi believe it is difficult/impossible to generate new legacy auth17:53
@jim:acmegating.comso that's their upgrade-forcing path17:54
@fungicide:matrix.orgmakes sense17:54
@fungicide:matrix.orgso it's ultimately just a change in the schema for what's encoded in the string that's passed as a basic auth password, in that case17:55
@fungicide:matrix.orgrather than being a "simple" kdf over some random data, it's got signed metadata for things like expiration17:56
@fungicide:matrix.orga certificate, essentially17:56
@clarkb:matrix.orgmaybe? I think the main change is how they are storing the data on the backend17:57
@clarkb:matrix.orgbecause now you can have multiple tokens instead of just one aiui17:58
@clarkb:matrix.orgI don't know if they rely on signatures to expire them or just a db record17:58
@clarkb:matrix.orgsince they are the issuer and validator they don't need to rely on signatures to validate that stuff17:58
@clarkb:matrix.orgI was asked in the upstream discord why we don't just convert as part of the upgrade and the reason in my head is that may make downgrading more painful17:59
@clarkb:matrix.orgI think its good to upgrade and be happy with 3.13 generally then commit to the new 3.13 passwords17:59
@fungicide:matrix.orgat least with pypi's tokens you can redelegate by creating your own rescoped versions of them, e.g. using https://pypi.org/project/pypitoken-cli/17:59
-@gerrit:opendev.org- Monty Taylor https://matrix.to/#/@mordred:inaugust.com proposed: [openstack/project-config] 989417: Import drizzle revival and add a bindep-rs support project https://review.opendev.org/c/openstack/project-config/+/98941718:01
@mordred:waterwanders.comyou'd think I could get a projects.yaml right in less than three revisions, but you'd be wrong18:02
@fungicide:matrix.orgthis is why we have fast linters configured to run in check18:04
@mordred:waterwanders.com++18:13
@mordred:waterwanders.comyay! I got it right after 3 tries. I'm so proud of myself. maybe I'll get to switch to bigboy pants one day18:18
@clarkb:matrix.orgetherpad is hitting `FATAL ERROR: Reached heap limit Allocation failed - JavaScript heap out of memory` and restarting occasionally. Likely fallout from the recent upgrade18:23
@clarkb:matrix.orgI will look at that after this meeting and lunch. I believe nodejs has a very conservative heap size and can be enlarged18:24
@fungicide:matrix.orgooh, that would explain why i saw my client flashing up very brief errors about reconnecting earlier18:28
-@gerrit:opendev.org- Zuul merged on behalf of Clark Boylan: [opendev/system-config] 989428: Fix system-config-run-review-3.13 images requires https://review.opendev.org/c/opendev/system-config/+/98942818:28
@clarkb:matrix.orglooking at top I think there is a memory leak actually18:29
@clarkb:matrix.orgso this may not be an easy fix18:30
@fungicide:matrix.orgbut also the server is only a 4gb flavor18:30
@clarkb:matrix.orgya but etherpad has 19GB virtual and only 840MB resident but is steadily climbing18:30
@clarkb:matrix.orgwe just claered 1gb resident18:31
@fungicide:matrix.orgright now the server has a dedicated 4gb swap partition on the ephemeral disk, shared with /opt which is using the remaining blocks but does have some content18:31
@clarkb:matrix.orgvirtual memory is not climbing as quickly18:32
@fungicide:matrix.orgi think if we put the server in the disable list for a bit to avoid battling ansible, we can umount and shrink /opt to make more room for swap18:32
@fungicide:matrix.orgor just add a swapfile on /opt and swapon it18:32
@fungicide:matrix.orgmaybe that's the faster band-aid18:32
@clarkb:matrix.orgits restarting every ~15 minutes18:33
@fungicide:matrix.orgi wonder if this is fixed in 3.118:33
@fungicide:matrix.orgi don't see it called out in the release notes if so18:35
@clarkb:matrix.orgya not finding any issues for a memory leaks recently either open or closed18:35
@fungicide:matrix.orgcould it be crawler-induced?18:36
@fungicide:matrix.orgcan't use etherpad without js so we could frontend with anubis i suppose18:36
@fungicide:matrix.orgif we decide that's the source of the memory consumption18:36
@clarkb:matrix.orgthat is possible I guess18:36
@clarkb:matrix.orgbut anything reasonably causing the memory to increase is probably running js?18:37
-@gerrit:opendev.org- Sabbir Ahmed proposed: [openstack/project-config] 989446: Add starlingx/app-machine-operator project https://review.opendev.org/c/openstack/project-config/+/98944618:40
@clarkb:matrix.orgwe just hit 2gb resident18:40
@clarkb:matrix.orglooks like nodejs also changed its gc implementation like python 3.1418:41
@clarkb:matrix.orgso maybe if we downgrade to nodejs22 it would be happier or maybe we can configure it more?18:42
@clarkb:matrix.orgok it just restarted around 2gb resident18:42
@fungicide:matrix.orgis nodejs set to avoid consuming more memory than that?18:49
@fungicide:matrix.orgbecause yeah, i haven't seen it really dip into swap18:49
@clarkb:matrix.orgyes I think the default is 4gb but we only have 4gb present so it must tune that down to 2gb18:50
@clarkb:matrix.orgI haven't confirmed this though. but based on the evidence of having OOM errors at around 2gb it appears likely18:50
@clarkb:matrix.orgI think it may only be garbage collecting every 3 seconds?18:51
@jim:acmegating.comthat's still < 15m so that shouldn't be a prob?19:00
@clarkb:matrix.orgya I just expected it to be much more often (like every few hundred ms)19:03
@clarkb:matrix.orgI suspect the issue is a memory leak and not nodejs gc behaviors19:03
@clarkb:matrix.orgI think what I would like to do is hold a node and see if we can profile this outside of production19:04
@clarkb:matrix.org(that is unlikely to be fast, but seems necessary to get to the bottom of the issue)19:04
@clarkb:matrix.orgthen maybe we can file a bug or even fix it etc19:04
@clarkb:matrix.orgI have rechecked 840972 and requested an autohold on its etherpad job to do this19:08
@clarkb:matrix.orghttps://github.com/ether/etherpad-memory-leak-test exists too and I can point that at the held node if I can figure out running it19:11
-@gerrit:opendev.org- Sabbir Ahmed proposed: [openstack/project-config] 989446: Add starlingx/app-machine-operator project https://review.opendev.org/c/openstack/project-config/+/98944619:17
@clarkb:matrix.orglooks like we need the ep_stats plugin to get the /stats endpoint?19:18
@clarkb:matrix.orgI tried fetchign stats on the prod server but it doesn't work. Yet another reason that a held node may be helpful I guess19:18
@fungicide:matrix.orgpossible we block access to it in the vhost config?19:19
@clarkb:matrix.organyway lunch now. I'll try to look closer in a bit.  There is also a new gitea 1.26.2 release as of a few minutes ago if anyone wants to push a change for that19:19
@clarkb:matrix.orgfungi: no this was fetching against http://localhost:9001 to bypass apache19:19
@fungicide:matrix.orgah19:22
-@gerrit:opendev.org- Clark Boylan proposed: [opendev/system-config] 989448: Upgrade Gitea to 1.26.2 https://review.opendev.org/c/opendev/system-config/+/98944819:38
@clarkb:matrix.orgthere is a quick change to get gitea 1.26.2 rolling. We'll see how testing goes19:38
@clarkb:matrix.orgaha /stats is controlled by the enableMetrics config setting which we set to false. I can enable that on the held node and then test some things19:40
@fungicide:matrix.orgwhoa, in gerrit 3.13 we'll finally be able to delete unused groups!19:45
-@gerrit:opendev.org- Clark Boylan proposed: [opendev/system-config] 989453: Enable etherpad's stats endpoint for local requests https://review.opendev.org/c/opendev/system-config/+/98945320:19
@clarkb:matrix.orgtesting on the held node shows that the /stats endpoint is likely helpful here so I'm proposing we enable it in a limited capacity to aid further heap issues in production20:20
@fungicide:matrix.orgthe testinfra tests are a good addition to make sure we don't accidentally expose it, thanks!20:22
-@gerrit:opendev.org- Zuul merged on behalf of Clark Boylan: [opendev/system-config] 989453: Enable etherpad's stats endpoint for local requests https://review.opendev.org/c/opendev/system-config/+/98945321:09
@mordred:waterwanders.comMADNESS!21:20
-@gerrit:opendev.org- Jeremy Stanley https://matrix.to/#/@fungicide:matrix.org proposed wip: [opendev/system-config] 989464: Add ai-policy-wg@lists.openinfra.org mailing list https://review.opendev.org/c/opendev/system-config/+/98946422:25
-@gerrit:opendev.org- Zuul merged on behalf of Jeremy Stanley https://matrix.to/#/@fungicide:matrix.org: [zuul/zuul-jobs] 980499: Add and test an ensure-validate-pyproject role https://review.opendev.org/c/zuul/zuul-jobs/+/98049922:29
-@gerrit:opendev.org- Jeremy Stanley https://matrix.to/#/@fungicide:matrix.org marked as active: [opendev/system-config] 989464: Add ai-policy-wg@lists.openinfra.org mailing list https://review.opendev.org/c/opendev/system-config/+/98946422:38
@fungicide:matrix.orgClark: you were at that ^ meeting today, if you have time to review/approve the new ml for the wg22:39
@clarkb:matrix.orgfungi: is it ok if I +2 it and let you approve? I'm deep into nodejs debugging weeds right now22:42
@clarkb:matrix.orgI should have a change up for ^ soon too. But I'm ttesting it end to end on the held node first (and its just more debugging tooling not any fixing)22:42
@fungicide:matrix.orgsure, no problem22:44
@fungicide:matrix.orgunrelated, the persistent system-config-run-letsencrypt failures on https://review.opendev.org/c/opendev/system-config/+/989022 are baffling, something seems to be breaking dns resolution for the bridge99 test node22:45
@fungicide:matrix.orgthis must be new, i can't help but wonder if it's related to the recent unbound or bind package updates from earlier today22:47
@fungicide:matrix.orger, i guess it's actually the adns99 node that can't resolve22:48
@fungicide:matrix.orgso maybe some sort of conflict between bind and unbound, since they both run on that node22:49
-@gerrit:opendev.org- Clark Boylan proposed: [opendev/system-config] 989465: Enable nodejs heap snapshots with SIGUSR2 against etherpad https://review.opendev.org/c/opendev/system-config/+/98946522:50
@clarkb:matrix.orgfungi: its also not happening consistently since it passes in check but not the gate22:51
@fungicide:matrix.orgoh, right22:51
@clarkb:matrix.orgfungi: so maybe comparing where those things ran to see if there are obvious network differences that would explain it might help. Its possible that the chagne to modify the inventory is creating problems somehow. Like maybe the value in /etc/hosts which is the private address needs to match the address in inventory?22:51
@clarkb:matrix.organd for some clouds private == public and it is fine but others it isn't?22:52
@fungicide:matrix.orgoh, yeah that could explain it22:52
@clarkb:matrix.orgre 989465 I have successfully taken two heap snapshots on the held etherpad node using that and loaded them into chrome. I think this is a viable way to inspect memory growth22:52
@fungicide:matrix.orgso for 989465 we don't need any additional modules to support the heapsnapshot feature?22:52
@clarkb:matrix.orgI don't think I'm in a rush to land that though as I'm running out of steam to keep debugging this late into the day22:52
@clarkb:matrix.orgfungi: nope, nodejs supports it you just have to eneable it which that NODE_OPTIONS entry does22:53
@fungicide:matrix.orgvery neat22:53
@fungicide:matrix.orgi'm +2 on it for whenever you're ready to resume investigation22:53
@fungicide:matrix.orgthough about to step away for the evening myself at this point22:53
@clarkb:matrix.orgthanks. I think first thing tomorrow should be fine22:54
@clarkb:matrix.orgthe held node isn't doing enough interesting things to see big deltas. But just toggling between the two snapshots in the viewer I can see some things do change. So I'm hopeful that with the big chagnes we see in prod it will be apparent where the memory is going22:55
@fungicide:matrix.orgzuul's remaining time estimate for 989464 in the gate is strange. it seems to imply that it expects opendev-buildset-registry to run for an additional 30 minutes once it unpauses22:59
@clarkb:matrix.orgYes because it estimates based on job runtime and doesn't look at other jobs23:02
@clarkb:matrix.orgSo jobs that pause with variable length sets of children have broken eatimates23:02
@fungicide:matrix.orgah, i guess it takes average total runtime into account, including any paused duration23:02
@clarkb:matrix.orgBasically we're poisoning the data swt23:02
@fungicide:matrix.orgyep, makes sense23:03
@fungicide:matrix.orgi'm used to looking at system-config changes with longer-than-average running jobs, so the opendev-buildset-registry runtime comes in shorter than the buildset estimate23:04
@fungicide:matrix.orgessentially underestimating that job on those changes, but overestimating it on this one23:05
@fungicide:matrix.orgi suppose it could have a multi-segment estimate where it knows the average runtime before pause and after pause and can add the latter to the longest depending job that needs to complete before unpausing, but that would get obnoxiously complicated23:06
@clarkb:matrix.orgthe most accurate would be to lookup the runtimes for the jobs the pause is waiting on and use that number23:07
@clarkb:matrix.orgit has the data it just doesn't do the transitive lookup. I'm not sure how difficult it would be to find that data from the db given teh current context though23:07
@fungicide:matrix.orgright, basically what i was getting at, it would also need to know how much longer to expect that job to take after unpausing too23:08
@clarkb:matrix.orgthat info may not be available23:09
@clarkb:matrix.orgI think it only has complete runtime for each buidl23:09
@fungicide:matrix.orgwhich i suppose it could calculate by subtracting the paused duration from the runtime when updating the average, then subtract the time spent before pause on the current run from the average to get the remaining time that it estimates will be needed after unpause23:10
@clarkb:matrix.orgmulti level estimating23:11
@fungicide:matrix.orgbut yeah, not worth the complexity for something that's ultimately cosmetic23:13
-@gerrit:opendev.org- Zuul merged on behalf of Jeremy Stanley https://matrix.to/#/@fungicide:matrix.org: [opendev/system-config] 989464: Add ai-policy-wg@lists.openinfra.org mailing list https://review.opendev.org/c/opendev/system-config/+/98946423:21
@jim:acmegating.comwe'll just make it say 99% complete like a windows copy dialog23:23
@fungicide:matrix.orgor make it sometimes go backwards23:24
@fungicide:matrix.orginfra-prod-service-zuul in hourly is taking qute a while. looking at service-zuul.yaml.log on bridge it's been idle for 20 minutes after pulling container images23:28
@fungicide:matrix.orgoh, maybe it's waiting on one of the servers to finish pulling and it hasn't reported back yet23:29
@fungicide:matrix.orgyeah, ze07 has been running `docker-compose pull` since 23:0723:32
@clarkb:matrix.orgthose pulls are from quay not docker hub (just for information)23:33
@fungicide:matrix.orgall other recent runs for that job have been between 4 and 8 minutes, so this is well into anomalous territory23:33
@clarkb:matrix.orgthe executor image is largeish but it should be pullable within a few minutes23:34
@clarkb:matrix.orgmaybe some sort of slowness between that particular node and quay due to entwork issues?23:34
@fungicide:matrix.orgyeah, seems likely something is stuck for ze07, the other servers returned quickly23:34
@fungicide:matrix.orgyeah, the job eventually timed out, i wonder what that's going to mean for subsequent runs if there's a stuck pull on ze07 (the process is still there on the server)23:38
@fungicide:matrix.orgmaybe it'll finally give up and terminate on its own23:38
@fungicide:matrix.orgokay, i'm really checking out for the evening now, but will try to remember to check on ze07 again in the morning and see if it's fixed itself or needs some manual intervention23:49

Generated by irclog2html.py 4.1.0 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!