| @mnaser:matrix.org | In order to be more efficient of donor infra.. does it make sense to setup reporting that we can look at from time to time? | 03:40 |
|---|---|---|
| @mnaser:matrix.org | From the start of this month for example, Kolla has spent 1590h from the start of this month, aka 66.25 days of compute now .. all on non voting jobs that are constantly failing .. Cinder has 630 hours, Manila has 509 hrs, etc | 03:43 |
| -@gerrit:opendev.org- chandan kumar proposed: [openstack/project-config] 977194: Add snu-csl/nvmevirt to available repos https://review.opendev.org/c/openstack/project-config/+/977194 | 06:13 | |
| @fungicide:matrix.org | mnaser: Clark has put together ad hoc reports from time to time totalling the node-hours consumed by openstack sub-projects, though i think they were put together by scraping logs or doing low-level database queries, there wasn't a running tally kept up all the time | 11:56 |
| @fungicide:matrix.org | though i think it's been a few years, at the time the largest consumer (by far) was tripleo | 11:57 |
| @fungicide:matrix.org | mnaser: depending on how you're querying that now, you'll want to make sure you multuply the build elapsed time by the number of nodes (and now the node types matter too since we no longer use just one node size, so maybe we need to work out scaling measurements by node ram as well) | 11:58 |
| @fungicide:matrix.org | i'm going to use the zuul webui's "build image" button on https://zuul.opendev.org/t/opendev/image/ubuntu-noble to get an updated os-testr venv containing testtools 2.8.4 from about an hour ago | 12:59 |
| @fungicide:matrix.org | specifically for this commit in the new release: https://github.com/testing-cabal/testtools/commit/eca83db | 13:06 |
| @fungicide:matrix.org | doing ubuntu-jammy as well (apparently debian-trixie is fine, the problem that update addresses is on earlier python versions) | 13:34 |
| @fungicide:matrix.org | the last round of image builds for ubuntu-noble seems to have succeeded, but the uploads are all in "pending" state for the past 9+ hours... is there any way to tell from the webui why they didn't upload yet? | 13:40 |
| @fungicide:matrix.org | actually, it looks like previous images also waited around 19-20 hours between build completion and uploading, i guess they just operate on independent timers? | 13:42 |
| @harbott.osism.tech:regio.chat | or is this our "standard" backlog because uploads are so slow? can we maybe revert to older images still instead? likely not after the fresh rebuild | 14:05 |
| @fungicide:matrix.org | oh that could indeed be delay due to volume | 14:15 |
| @fungicide:matrix.org | but yeah, the ubuntu-noble images at least finished uploading shortly before their replacements were built (perhaps even while they were building) | 14:16 |
| @jim:acmegating.com | the launchers handle one image at a time, but upload to multiple endpoints simultaneously. | 14:20 |
| zl01 is uploading debian-trixie since 2026-02-19T13:46:59 | ||
| zl02 is uploading debian-trixie-arm64 since 2026-02-19T14:02:34 | ||
| @jim:acmegating.com | if the previous image would be sufficient, then perhaps we should delete the most current image and roll back? that's our usual procedure, i think. | 14:23 |
| @jim:acmegating.com | looks like we have images from 02-17 and 02-18; i assume the 2-18 ones are the problem and we can roll back to 2-17? | 14:24 |
| @jim:acmegating.com | i think the procedure would be to delete the image build artifact, since we want all the uploads to go away (and we don't want zuul to try uploading them again). deleting the artifact should cause the launchers (after a short delay) to mark all the uploads for deletion. | 14:39 |
| unfortunately, i think because of the backlog, it won't actually delete them for a while, and due to an oversight, will continue to use them. i can make a live patch to the launchers to fix that though. equivalent to: https://review.opendev.org/c/zuul/zuul/+/977326 Don't use un-ready image uploads [NEW] | ||
| @jim:acmegating.com | i'll start working on the live patch; but i'm going to leave it to fungi or Jens Harbott to decide about deleting the image. | 14:39 |
| @jim:acmegating.com | fungi: that patch is in place, if you want to delete the image build artifact(s) then i think the rollback should work | 14:46 |
| @mnaser:matrix.org | Ah yes good point.. I scraped the API for jobs which were in check pipeline .. non voting .. failing and never passed in the same range | 14:51 |
| @fungicide:matrix.org | corvus: in this case i don't know if the previous image would be helpful, unless it's old enough to pre-date testtools 2.8.3 | 14:55 |
| @jim:acmegating.com | i don't remember this happening yesterday | 14:56 |
| @jim:acmegating.com | 3.8.3 is Tue, 17 Feb 2026 15:17:00 GMT from https://pypi.org/rss/project/testtools/releases.xml | 14:57 |
| @fungicide:matrix.org | 15:17 utc on 2026-02-17 is when testtools 2.8.3 packages were uploaded to pypi, so images built and uploaded after that carry the breakage | 14:57 |
| @fungicide:matrix.org | but if the upload is happening on nearly a day delay, then it wouldn't have started happening until yesterday | 14:58 |
| @jim:acmegating.com | https://zuul.opendev.org/t/opendev/build/0428a64a82ae403b9b3e4fd0966c51f3 Completed at 2026-02-17 03:56:12 | 14:58 |
| @fungicide:matrix.org | okay, so should work | 14:59 |
| @jim:acmegating.com | right -- today, the 19th we're using images built on the 18th with the problem. yesterday, the 18th, we were using images built on the 17th without the problem. thus i'm thinking that deleting the bad images from the 18th would leave us with good images from the 17th. | 14:59 |
| @jim:acmegating.com | fungi: for clarity, i'm under the impression you'll issue the image built artifact delete :) | 15:04 |
| @fungicide:matrix.org | corvus: i guess there are 3 builds for the different image formats? judging from https://zuul.opendev.org/t/opendev/image/ubuntu-noble | 15:06 |
| @fungicide:matrix.org | just confirming those are the 3 i'm deleting for ubuntu-noble | 15:07 |
| @fungicide:matrix.org | d375a2cc24f44a9783842874c6d4bf2c, e34a5225598a4c8cbca7914a2482179d and 8341ce5f1f5045d9b3204db8226c43fe from 2026-02-19T04:08:55 | 15:08 |
| @fungicide:matrix.org | i've sufficiently convinced myself those are the correct images to delete, so doing that now | 15:09 |
| @jim:acmegating.com | aren't those the artifacts for the pending uploads? | 15:10 |
| @fungicide:matrix.org | those are the old pending uploads from 10 hours ago, yeah | 15:10 |
| @fungicide:matrix.org | oh, i guess we also need to delete the three that are in use, more importantly | 15:10 |
| @jim:acmegating.com | to roll back, we need to delete the artifacts for the currently in-use images: f7a3c43cb4b34e689a3698e00f8460f3 e0f7d874b84c4f3aa028fe90717f75a0 04aec63688d3485cb6ff05c98cc605e3 | 15:10 |
| @fungicide:matrix.org | yep, done now, so i told it to delete all 6 (the three most recent in use and the three older pending upload) | 15:12 |
| @fungicide:matrix.org | the three most recent pending upload were built on-demand after the fixed testtools made it onto pypi, so i left those untouched | 15:12 |
| @fungicide:matrix.org | not sure how long it takes for those to transition to deleting or disappear | 15:13 |
| @jim:acmegating.com | me neither -- i'm hoping that the launcher switches the uploads to 'deleting' soon; but if it doesn't we'll have to do that ourselves. let's give it a little bit. | 15:14 |
| @fungicide:matrix.org | okay they updated to deleting state | 15:16 |
| @fungicide:matrix.org | well, the builds updated to deleting anyway | 15:16 |
| @fungicide:matrix.org | the uploads are still ready and pending | 15:17 |
| @fungicide:matrix.org | or does deleting the image build artifacts not automatically cascade to deleting the uploads? | 15:18 |
| @jim:acmegating.com | i think it should, i'm checking on what could cause a delay there | 15:18 |
| @fungicide:matrix.org | i guess i can manually select "delete upload" on each of them | 15:18 |
| @fungicide:matrix.org | okay, i'll hold off | 15:19 |
| @jim:acmegating.com | fungi: okay, i think it's the same queue as the uploads. i think i want to adjust my patch to take the artifact state into account | 15:23 |
| @fungicide:matrix.org | so i should go ahead and ask it to delete the uploads individually as well? | 15:25 |
| @jim:acmegating.com | that shouldn't be necessary | 15:26 |
| @fungicide:matrix.org | oh, i guess the launcher won't boot nodes from those uploads now that the corredponding image builds are deleting? | 15:26 |
| @jim:acmegating.com | that will be the case once i patch the launchers (but it isn't right now) | 15:27 |
| @fungicide:matrix.org | okay, thanks | 15:28 |
| @jim:acmegating.com | all right, both launchers are patched, so nodes created after this point should not use those images | 15:30 |
| @fungicide:matrix.org | awesome! | 15:31 |
| @jim:acmegating.com | now we need to make a change to zuul. there are two ways we could do this: | 15:32 |
| 1) make zuul behave the way that i just patched it: if you delete the artifact, then it won't use the uploads for that artifact. but the uploads still show in "ready" state. | ||
| 2) change the image delete api call so that it also marks all the uploads for deletion when it marks the artifact for deletion | ||
| @jim:acmegating.com | right now, i'm thinking i don't love the idea that the web ui says the artifact is deleting but the upload is ready, and as a human, we have to look at both of those to determine the state. so i'm kind of leaning toward #2 as the long-term fix. | 15:32 |
| @fungicide:matrix.org | once upon a time we had talked about only preserving images on disk for one image format, as a space-saving measure. does that affect build deletion? | 15:33 |
| @fungicide:matrix.org | like, would that require being able to boot from ready uploads for builds that were deleted, or is the backing file on disk independent from whether the "build" is deleted? | 15:34 |
| @jim:acmegating.com | now that the actual storage location is in swift, we don't do that any more, so they're all available in the cloud as long as the corresponding artifact is there. but we could delete only the qcow2 artifact, for example, and it won't affect the others. they are independent. | 15:34 |
| @fungicide:matrix.org | oh right, there is no longer a need to keep any copies locally | 15:35 |
| @fungicide:matrix.org | so anyway, i agree #2 makes the most sense to me | 15:36 |
| @jim:acmegating.com | ok, i'll work on a change to do that. | 15:37 |
| @jim:acmegating.com | fungi: https://review.opendev.org/977326 and https://review.opendev.org/977333 should get zuul to our desired behavior. | 15:47 |
| @fungicide:matrix.org | yep, thanks again! | 16:08 |
| @fungicide:matrix.org | https://zuul.opendev.org/t/openstack/builds?job_name=openstack-tox-py310&skip=0 indicates that the image rollback seems to have worked | 16:13 |
| @priteau:matrix.org | Thanks fungi, I am rechecking our blazar-nova change, it seems that we were the last POST_FAILURE | 16:18 |
| @priteau:matrix.org | success 😄 | 16:21 |
| @fungicide:matrix.org | excellent | 16:28 |
| -@gerrit:opendev.org- Zuul merged on behalf of Dr. Jens Harbott: [openstack/diskimage-builder] 976345: Add tox-py313 job https://review.opendev.org/c/openstack/diskimage-builder/+/976345 | 18:20 | |
| -@gerrit:opendev.org- Jeremy Stanley https://matrix.to/#/@fungicide:matrix.org proposed: [openstack/project-config] 977380: Replace 2026.1/Gazpacho key with 2026.2/Hibiscus https://review.opendev.org/c/openstack/project-config/+/977380 | 22:33 | |
Generated by irclog2html.py 4.0.0 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!