*** ykarel_ is now known as ykarel | 08:53 | |
frickler | oh, adding labels for trixie likely would also be helpful ... :-D | 08:58 |
---|---|---|
opendevreview | Dr. Jens Harbott proposed opendev/zuul-providers master: Add labels for debian-trixie https://review.opendev.org/c/opendev/zuul-providers/+/954701 | 09:05 |
opendevreview | Dr. Jens Harbott proposed opendev/zuul-providers master: Add labels for debian-trixie https://review.opendev.org/c/opendev/zuul-providers/+/954701 | 09:18 |
frickler | corvus: ^^ the config error for PS1 was ... not too helpful, not sure if that could be improved? | 09:19 |
frickler | I'm going to self-approve the update now in order to get some testing done, feel free to revert/amend later. like I only added new style labels now, assuming that we've passed the testing phase. one thing to discuss might be whether we want to keep some default labels without RAM spec to keep nodeset definitions simpler | 09:21 |
opendevreview | Merged opendev/zuul-providers master: Add labels for debian-trixie https://review.opendev.org/c/opendev/zuul-providers/+/954701 | 09:22 |
opendevreview | Michal Nasiadka proposed opendev/system-config master: docker-mirror: Add Ubuntu 24.04 and Debian Bookworm/Trixie mirrors https://review.opendev.org/c/opendev/system-config/+/954703 | 09:23 |
frickler | it lives \o/ https://zuul.opendev.org/t/openstack/build/4feda999fae44f22bc54175a0da0a8f6 ... and it fails quite fast, holding a node now for checking (and testing autohold with niz ;) | 09:36 |
*** clarkb is now known as Guest21734 | 10:51 | |
opendevreview | Dr. Jens Harbott proposed opendev/zuul-providers master: Fix unbound installation for trixie https://review.opendev.org/c/opendev/zuul-providers/+/954716 | 11:22 |
frickler | ^^ that's the fix for unbound, do we still want to keep the project-config/nodepool version in sync? | 11:23 |
*** dmellado6 is now known as dmellado | 13:02 | |
fungi | frickler: i'm inclined to say we don't care about fixing that for nodepool-built images, since we're not building trixie with nodepool and planning to turn off the nodepool services at any moment | 13:18 |
frickler | and still waiting for CI results :( | 13:31 |
fungi | frickler: no longer! | 13:37 |
fungi | all green | 13:38 |
fungi | looks like the bionic and focal builds worked, so i guess p.u.c just prunes data about earlier lts versions | 13:39 |
frickler | yes, I checked that when creating the patch, commented on the review. there is still a small chance that this change might break unbound for those, do we have a way to emergency-delete images with zuul-launcher? or would we have to wait for a revert to merge and get promoted in the worst case? I'm a bit reluctant to simply self-approve because of this | 13:44 |
corvus | frickler: i think we should only have the ram-suffixed labels, and not the ones without it (going forward). one reason: zuul thinks they are different enough that a ready node for one can't be used for another. most users will just use pre-defined nodesets, and their names can continue to be simple (ie, just "debian-trixie") | 13:57 |
corvus | frickler: the api and the web ui can both be used to delete uploads or builds. you must use the "opendev" tenant (the tenant where the images are built) | 13:58 |
corvus | i agree, the error messages need some work :) | 14:02 |
corvus | for the first time, the node graphs for last night's periodic jobs in are the shape i've been looking for: https://imgur.com/a/F1KjyeR | 14:10 |
corvus | (when we're at quota, we want more requests in the "requested" state and fewer nodes in the "requested" state) | 14:12 |
corvus | clarkb: would you mind a re-review on https://review.opendev.org/931824 ? i switched the test fixture and made the validation optional. | 14:28 |
Guest21734 | corvus: done. Though I seem to have been guestified. I'll work on fixing that next | 14:45 |
Guest21734 | frickler: fungi I went ahead and approved the unbound fixup too. I left a comment with why I believe this is safe | 14:48 |
fungi | thanks Guest21734! | 14:51 |
fungi | ;) | 14:51 |
*** Guest21734 is now known as clarkb | 14:53 | |
opendevreview | Jeremy Stanley proposed openstack/project-config master: Drop requirements branch override for translations https://review.opendev.org/c/openstack/project-config/+/954747 | 14:58 |
clarkb | fungi: do you have a quick moment to rereivew https://review.opendev.org/c/opendev/system-config/+/954624 now with scrolling on the grafana pages so all graphs render? | 14:59 |
clarkb | and then should we go ahead and land the specs cleanup and fixup changes? | 14:59 |
clarkb | my main concern is less with the fixup and more with my application of cleanups which may be biased. But I think its easy to undo that sort of documentation change if we wish | 15:00 |
fungi | yeah | 15:03 |
fungi | all lgtm now | 15:03 |
opendevreview | Jeremy Stanley proposed openstack/project-config master: Drop requirements branch override for translations https://review.opendev.org/c/openstack/project-config/+/954747 | 15:08 |
clarkb | fungi: thinking about https://review.opendev.org/c/zuul/zuul-jobs/+/954280 more. Maybe the easiest least impactful choice is to drop backports from our debian image builds? | 15:25 |
clarkb | that isn't my personal preference but I Think going that route avoids and potential conflict with other people running debian images with the configure mirrors role | 15:26 |
clarkb | that might require updates to dib though | 15:30 |
clarkb | which makes me wonder if anyone would be building images this way anyway | 15:30 |
opendevreview | James E. Blair proposed opendev/zuul-providers master: Add 32GB labels https://review.opendev.org/c/opendev/zuul-providers/+/954749 | 15:37 |
opendevreview | James E. Blair proposed opendev/zuul-providers master: Add 32GB labels and flavors https://review.opendev.org/c/opendev/zuul-providers/+/954749 | 15:40 |
opendevreview | James E. Blair proposed opendev/zuul-providers master: Add 32GB labels to vexxhost https://review.opendev.org/c/opendev/zuul-providers/+/954752 | 15:40 |
opendevreview | Merged opendev/system-config master: Scroll grafana pages to force all graphs to load https://review.opendev.org/c/opendev/system-config/+/954624 | 15:40 |
opendevreview | Merged opendev/zuul-providers master: Add 32GB labels and flavors https://review.opendev.org/c/opendev/zuul-providers/+/954749 | 15:46 |
fungi | clarkb: yeah, i'm not sure. as long as we don't need any backported packages during image building then they'll still get enabled by default by that role at job runtime so shouldn't result in regressions for anyone | 15:52 |
fungi | so i agree that's probably the least impactful, as it doesn't require changing anything in zuul-jobs | 15:52 |
clarkb | ya but I'm 99% sure it requires changes to dib. | 15:52 |
clarkb | I don't think dib is actually using backports to install any packages, but it is configuring backports as a repo | 15:53 |
fungi | it's controlled by a variable we set | 15:53 |
fungi | if you look at the dib elements for ubuntu-minimal and debian-minimal there's a list of source suites that are passed in | 15:54 |
fungi | we could *probably* do it in our dib configuration without even altering the dib elements themselves | 15:54 |
clarkb | right we could override the entire DIB_APT_SOURCES_CONF_DEFAULT | 15:54 |
fungi | but also changing the defaults in dib is a possibility | 15:55 |
clarkb | actually maybe not that doesn't seem to accept a different avlue. Where is it used | 15:55 |
clarkb | DIB_APT_SOURCES_CONF this is the var to override | 15:55 |
fungi | there were two vars, i think, and it's the other one you want | 15:55 |
fungi | ah, yeah that | 15:55 |
clarkb | ya the _DEFAULT is the dfeault value for the one without _DEFAULT | 15:56 |
clarkb | my concern with this appraoch is it seems less correct from a build a debian image perspective | 15:56 |
clarkb | debian upstream cloud images incldue backports, dib image builds include backports. backports are only used if explicitly requested for a packge | 15:56 |
clarkb | the accepted practice seems to be that you should configure backports | 15:56 |
fungi | well, about that. after going back over the discussions from 2015/2016 i think that got undone | 15:57 |
fungi | so my recollection is outdated | 15:57 |
fungi | there was a time when it was enabled because they needed newer versions of cloud-init in the images | 15:58 |
clarkb | in upstream cloud images you mean? | 15:58 |
fungi | yeah, but now i think the cloud team policy is to just keep updating cloud-init rather than trying to keep it stable in stable debian versions | 15:58 |
fungi | i don't use the stable cloud images myself, always testing/unstable, so i hadn't noticed they weren't adding backports in the stable cloud images | 15:59 |
clarkb | I see | 15:59 |
clarkb | given that I kinda think that changing dib itself to drop backports to match is maybe better than us overriding that default var list | 15:59 |
fungi | so not including backports in our images these days would probably be more consistent with official debian cloud images | 15:59 |
opendevreview | James E. Blair proposed opendev/zuul-providers master: Add 32GB flavor and labels to vexxhost https://review.opendev.org/c/opendev/zuul-providers/+/954752 | 16:00 |
clarkb | I'll work on a change | 16:01 |
opendevreview | Merged opendev/zuul-providers master: Add 32GB flavor and labels to vexxhost https://review.opendev.org/c/opendev/zuul-providers/+/954752 | 16:01 |
corvus | frickler: hrm, it looks like we may have been intending to have folks switch to -ram suffixed nodesets too, so... maybe strike my comment from earlier about that. but, i think we could consider keeping the non-ram-suffixed nodesets if we like that idea. i don't think it would cause a problem. | 16:02 |
opendevreview | James E. Blair proposed opendev/zuul-providers master: Move ubuntu-bionic/focal nodeset definition https://review.opendev.org/c/opendev/zuul-providers/+/954756 | 16:04 |
corvus | i think the only remaining nodepool labels are now: 'ubuntu-bionic-arm64' and 'ubuntu-focal-arm64' | 16:04 |
corvus | we apparently did not see fit to add those images to zuul-launcher. but there have been a couple of requests for them. | 16:05 |
corvus | publish-wheel-cache-ubuntu-focal-arm64 publish-wheel-cache-ubuntu-bionic-arm64 requested them | 16:06 |
corvus | i think the node requests are coming from inside the house | 16:06 |
fungi | i think we could probably just ditch those jobs | 16:07 |
fungi | at least that would be my first preference | 16:07 |
fungi | i don't believe they're providing anything useful if there are no otehr focal or bionic jobs running to take advantage of what they're producing | 16:07 |
clarkb | ++ lets drop those jobs | 16:08 |
opendevreview | James E. Blair proposed openstack/project-config master: Remove bionic/focal arm64 wheel jobs https://review.opendev.org/c/openstack/project-config/+/954758 | 16:09 |
corvus | tonyb: can we delete this autohold? https://zuul.opendev.org/t/openstack/autohold/0000000208 | 16:12 |
opendevreview | James E. Blair proposed opendev/zuul-providers master: Remove nodepool-labels file https://review.opendev.org/c/opendev/zuul-providers/+/954759 | 16:14 |
corvus | ^ when that merges, we can shut down nodepool. | 16:15 |
fungi | are the centos-9-stream config errors there expected? | 16:17 |
corvus | oops deleted wrong file hah :) | 16:18 |
opendevreview | James E. Blair proposed opendev/zuul-providers master: Remove nodepool-nodesets file https://review.opendev.org/c/opendev/zuul-providers/+/954759 | 16:19 |
opendevreview | Merged openstack/project-config master: Remove bionic/focal arm64 wheel jobs https://review.opendev.org/c/openstack/project-config/+/954758 | 16:25 |
opendevreview | Clark Boylan proposed openstack/diskimage-builder master: Drop backports from debian-minimal by default https://review.opendev.org/c/openstack/diskimage-builder/+/954760 | 16:25 |
clarkb | fungi: ^ something like that maybe. | 16:25 |
corvus | remote: https://review.opendev.org/c/openstack/openstack-zuul-jobs/+/954761 Remove bionic- and focal- arm64 jobs [NEW] | 16:28 |
corvus | oh that repo is not in good shape | 16:28 |
corvus | apparently there are jobs in that repo that refer to an "ubuntu-xenial" nodeset which is not defined | 16:30 |
clarkb | I wonder if it is sufficient to remove those two ubuntu-xenial jobs along with the old arm64 nodeset using jobs | 16:31 |
corvus | yeah... i think what we did was we stopped loading nodesets from opendev/base-jobs, instead loading them from zuul-providers | 16:31 |
corvus | and there is a xenial defined in base-jobs, but not zuul-providers | 16:31 |
corvus | i think we decided not to copy it over because it was unused | 16:32 |
corvus | i think if we still believe that, then, yeah, let's try yanking those jobs and see if it really is unused | 16:32 |
clarkb | Hopefully all the stable branches with references to py35 jobs have been deleted and nothing will complain | 16:32 |
clarkb | I've gone ahead and self approved https://review.opendev.org/c/opendev/infra-specs/+/954670 to make infra specs buildable again | 16:34 |
clarkb | that should only result in cosmetic changes | 16:34 |
clarkb | The followup makes some judgement calls in https://review.opendev.org/c/opendev/infra-specs/+/954662 and I'll probably do the same in an hour or two if there is no additional feedback on that | 16:34 |
clarkb | I'd like to get that sorted out before I start on the matrix for comms spec as I don't want to mix spec writing and spec management in the same stack | 16:35 |
clarkb | arg looks like things refer to py35 still | 16:35 |
clarkb | but maybe we can do a quick update to project-config to remove all those and be good? | 16:36 |
corvus | what about py38-arm64? | 16:38 |
clarkb | https://codesearch.opendev.org/?q=py38-arm64&i=nope&literal=nope&files=&excludeFiles=&repos= this seems to indicate that swift may have a usage but not through the jobs defined in openstack-zuul-jobs? | 16:39 |
clarkb | same nodeset, different job tree | 16:40 |
corvus | i'm starting to think this exceeds my current openstack expertise.... i think i may need to turn this work over to someone else. | 16:40 |
clarkb | fundamentally the problem here is that zuul doesn't really allow you to ignore tech debt in the ci job + pipeline configuration. Which is great if you're willing to tend to the garden but openstack has struggled with that | 16:41 |
clarkb | thinking out loud here: I wonder if an escape hatch is to define the nodeset but make it empty? | 16:41 |
clarkb | then we don't need to build and boot and manage the images. THe jobs are still configured with what they think is a valid nodeset and then the jobs themselves will effectively become less efficient noops? | 16:42 |
opendevreview | Merged opendev/infra-specs master: Make infra specs buildable again https://review.opendev.org/c/opendev/infra-specs/+/954670 | 16:42 |
clarkb | https://docs.opendev.org/opendev/infra-specs/latest/ has updated due to ^ | 16:45 |
corvus | clarkb: that's fine with me. how about we define that nodeset only in the openstack tenant? | 16:46 |
clarkb | corvus: ya I think that makes sense | 16:46 |
clarkb | corvus: taking that idea a step further I wonder if something like a noop job builtin but for nodeset labels along the lines of INVALID_LABEL might be a way to force the jobs to fail at runtime but validate configuration? | 16:48 |
clarkb | that is probably a lot of extra logic to encode into zuul for something that is already solvable if addressed properly | 16:48 |
clarkb | (whereas noop solves a fundamental issue of trying to make a noop job use as few resources as possible) | 16:48 |
corvus | well, we can still request a label that doesn't exist, which will NODE_ERROR; that's what will happen today (and what i'm about to propose for openstack-zuul-jobs) | 16:49 |
corvus | but if we add labels to config validation like we discussed, then that won't work, and maybe we should think about what you just suggested | 16:49 |
clarkb | oh I thought we would do cross validation of the labels. But you're right we couldn't do that with nodepool before and must not with zuul-launcher at least not yet | 16:49 |
clarkb | ya | 16:49 |
corvus | yeah, we won't do that at least until the nodepool deprecation period ends... and it hasn't started yet :) | 16:50 |
clarkb | corvus: and then we can still remove the specific jobs you identified to build wheels as we don't need them anymore | 16:50 |
clarkb | corvus: in that latest update you used the ubuntu-xenial name, wouldn't we fallback to nodepool for that? | 16:52 |
clarkb | based on my codesearch search earlier I'm hopeful that cleaning out the project-template definitions that use the removed jobs in openstack-zuul-jobs will make that mergeable | 16:53 |
corvus | oh, ha... you know what, we are missing a switch in zuul to turn off nodepool fallback :) | 16:56 |
fungi | sorry, stepped away for a few and just catching up, but i agree the best solution is one which forces node_error results on jobs and doesn't block us from removing deprecated configuraiton | 16:56 |
fungi | configuration | 16:56 |
fungi | allowing projects to perform config cleanup (or not) on their schedule without impacting ours | 16:57 |
corvus | i still don't understand how to unwind https://review.opendev.org/954761 | 16:57 |
clarkb | corvus: can we update the project-templates in the same change? | 16:58 |
corvus | sure! | 16:58 |
corvus | i'm just saying, openstack's job cornfiguration is outside of my area of expertise | 16:59 |
clarkb | ya I think the problem is its outside of anyones at this point | 16:59 |
clarkb | so doing the minimal we can get away with makes sense to me | 16:59 |
fungi | i think move stable/2024.1 to the branches list in openstack-tox-py39-arm64 instead | 16:59 |
fungi | gmaan: ^ ? | 17:00 |
corvus | i mean, strictly speaking, this is not blocking niz. we can merge https://review.opendev.org/954759 and all it will do is introduce more errors into the openstack tenant, but for things that are presumably already broken or disused | 17:01 |
corvus | i'm going to bow out of this and leave it to others with more openstack expertise | 17:02 |
corvus | okay, one more thought: maybe the easiest thing is to add a dummy nodeset for xenial, and both arm labels? | 17:03 |
clarkb | none of these arem jobs were ever voting. Let me push up a change that drops them from the project-templates and if that goes green then we can proceed with that. If not we can use the dummy nodest instead | 17:04 |
gmaan | fungi: but we do not test it on stable/2024.1 right? https://github.com/openstack/openstack-zuul-jobs/blob/master/zuul.d/project-templates.yaml#L1206 | 17:04 |
gmaan | yeah, I am not sure anyone working to make it voting or not, manytime I thought of removing those | 17:05 |
fungi | gmaan: thanks, yeah mainly for now we just need a solution that lets us drop ubuntu-focal-arm64 nodes | 17:06 |
fungi | if removing that non-voting testing entirely makes sense, fine by me | 17:06 |
gmaan | yeah, I am not sure why we are keeping these non voting on stable. keeping it on master make sense if anyone comes up and fix but if it went non voiting to stable we should remove | 17:08 |
clarkb | gmaan: because no one is doing the gardening | 17:08 |
clarkb | I don't think theer is any intention behind it but keeping things pruned and tended to requires effort that no on is doing | 17:09 |
fungi | weeding the garden especially | 17:09 |
gmaan | let me propose the change and raise on ML if no objection we can go that way next week or so | 17:10 |
clarkb | gmaan: I don't think need to wait that long | 17:11 |
fungi | gmaan: in the near term being able to at least drop the openstack-tox-py39-arm64 somewhat immmediately would help to not block our work on removing nodepool | 17:11 |
clarkb | like I think if this works we can land it right now | 17:11 |
fungi | er, openstack-tox-py38-arm64 i mean | 17:12 |
gmaan | sure, that's works fine for me. seeing no one interesting in those for so many cycle, is ok for me | 17:12 |
fungi | the one that needs focal nodes, specifically | 17:12 |
gmaan | whenever I prepare the new cycle template I ask these question to myself that why we have these non voting jobs for so long | 17:13 |
fungi | the only branch even running those is due to become unmaintained in <3 months | 17:13 |
clarkb | corvus: I restored your change and updated it with my proposal above. But zuul is still complaining about the project-template definition even though it is updated in the same change. Is this something zuul will force us to do in multiple steps? | 17:13 |
gmaan | fungi: yeah that one but I will just say not to continue non voting things on stable gate | 17:13 |
clarkb | gmaan: I have asked the release team to make cleaning up this stuff part of the new branch creation process rather than the branch deletion process so that we get ahead of it but I think the main issue is no one is really around to clean this stuff up | 17:13 |
gmaan | yeah | 17:14 |
clarkb | ok I think the stack that begins at https://review.opendev.org/c/openstack/openstack-zuul-jobs/+/954763 may resolve this for us cc corvus | 17:22 |
clarkb | seems I needed to start with the fundamental config issue first then work my way up | 17:22 |
clarkb | though it looks like the last change in the stack is still failing because of neutron | 17:23 |
corvus | fungi: do you mind reviewing https://review.opendev.org/954756 (you reviewed its child). it's pretty pro-forma | 17:28 |
opendevreview | Merged opendev/zuul-providers master: Move ubuntu-bionic/focal nodeset definition https://review.opendev.org/c/opendev/zuul-providers/+/954756 | 17:30 |
fungi | yep, approved, that was straightforward | 17:30 |
corvus | clarkb: actually, i think we can start doing label validation early -- we can tie it to the "i have no nodepool" tenant switch, which we need in order to remove nodepool anyway. | 17:31 |
clarkb | corvus: that would break the hack to make a valid xenial nodeset then? | 17:32 |
clarkb | corvus: digging into ubuntu-focal-arm64 more it does seem to be used in a few more places: https://codesearch.opendev.org/?q=ubuntu-focal-arm64&i=nope&literal=nope&files=&excludeFiles=&repos= | 17:32 |
clarkb | however its still fairly minimal | 17:33 |
corvus | re xenial: yep! | 17:33 |
clarkb | shoudl we not do that then? | 17:33 |
clarkb | I feel like this is a nice hack to separate the image management side of things from the job configuration side of things without forcing us to force merge changes or become involved in many indepednent job configuration in various projects | 17:34 |
corvus | i mean, it's a valid configuration error.. openstack has 285 at this point. | 17:34 |
corvus | i don't think it's actually a problem | 17:34 |
clarkb | I guess we'd have to force merge one change that adds the invalid nodeset label | 17:35 |
corvus | i believe the only thing we're doing here is volunteering to weed openstack's garden | 17:35 |
corvus | i don't tihnk we need to force merge anything today, or even if we had label validation in place | 17:35 |
corvus | the only time zuul is going to stop us is if it breaks the opendev tenant | 17:36 |
clarkb | right I guess the force merge is in the openstack config | 17:36 |
corvus | i mean, it's possible we get annoyed by excessive non-blocking errors from other tenants... | 17:36 |
clarkb | corvus: to udnerstand your next step you want to remove these two nodesets right: https://opendev.org/opendev/base-jobs/src/branch/master/zuul.d/nodesets.yaml#L44-L54 ? | 17:37 |
corvus | oh yes, if someone wants to merge anything to openstack-zuul-jobs then they would need to fix the network of errors involving that repo | 17:37 |
clarkb | I started looking at dummy nodeset options for those two arm nodesets and found we're still defining them unlike the xenial nodeset so want to make sure I understand what is going on there | 17:38 |
corvus | hrm, i guess so | 17:38 |
corvus | i didn't realize those were there too | 17:38 |
corvus | but yes, if we have decided that opendev doesn't provide those labels, then we should remove that too | 17:39 |
clarkb | we don't currently build those images right? | 17:39 |
clarkb | (just making sureI udnerstand what prompted all of this) | 17:40 |
corvus | wait, those nodeset defs don't matter | 17:40 |
corvus | those are unused now, we can remove the whole file | 17:40 |
clarkb | corvus: thay rae used | 17:40 |
corvus | no, we exclude that file from config loading | 17:40 |
clarkb | hrm when I updated openstack-zuul-jobs it complained about that but maybe I'm getting my wires crossed and becoming confused | 17:41 |
opendevreview | Clark Boylan proposed opendev/system-config master: Drop bionic and focal arm64 testing https://review.opendev.org/c/opendev/system-config/+/954765 | 17:41 |
corvus | these are the real definition: https://review.opendev.org/954759 and yes i want to remove them | 17:41 |
clarkb | aha that is the piece of info I was missing thanks | 17:42 |
corvus | so on that change (759) zuul is saying "we can merge this, but, btw, this will add some config errors to the openstack tenant" and we are being polite and trying to avoid adding those errors | 17:42 |
clarkb | corvus: so I think 954765 is something we need to do on our (opendev) end. Then https://review.opendev.org/c/openstack/openstack-zuul-jobs/+/954763/ and https://review.opendev.org/c/openstack/openstack-zuul-jobs/+/954761/ are incomplete cleanups on the openstack side that some of us may or may not help shuffle along | 17:43 |
corvus | to progress 761 you will need to make a change to neutron's stable branch, right? | 17:45 |
clarkb | corvus: ya to all of the stable branches too | 17:45 |
clarkb | which is why I was thinking I might go ahead with the dummy nodeset again but now I'm less sure of that | 17:45 |
corvus | are you thinking there might be legit testing using it? should we add focal/bionic arm64? | 17:46 |
clarkb | corvus: based on what I've seen so far I think swift and liberasure code have valid arm64 focal based test jobs | 17:48 |
clarkb | corvus: everything else seems to be non voting as part of the early days of dipping toes into the arm64 waters | 17:48 |
corvus | okay, so it seems like our assumption that "no one is using arm64 bionic/focal" may have been wrong and we should either consider adding those images, or deciding that now is the cut-off time for that | 17:49 |
clarkb | which IMO was valid while those branches were tip and now with stable branch policies are no longer valid | 17:49 |
clarkb | idea: openstack release process should drop all these jobs that were maybe informative but not gating and providing stability assurances when branches become stable | 17:50 |
clarkb | corvus: I think for xenial we should be drawing the line in the sand at this point for sure | 17:51 |
clarkb | which we have already done | 17:51 |
fungi | yeah, i can see pushing for a policy that stable branches have no long-term nonvoting jobs | 17:51 |
clarkb | I think I'm leaning a bit towards being ok with building a focal arm64 job | 17:51 |
clarkb | I haven't found evidence of bionic arm64 jobs that make sense to me | 17:52 |
corvus | i'll start on a change to add a focal image build | 17:52 |
clarkb | and I'll update the other changes we've been pushing up to drop bionic arm64 but not focal arm64 | 17:53 |
opendevreview | Clark Boylan proposed opendev/system-config master: Drop bionic arm64 testing https://review.opendev.org/c/opendev/system-config/+/954765 | 17:55 |
opendevreview | James E. Blair proposed opendev/zuul-providers master: Add ubuntu-focal-arm64 images and labels https://review.opendev.org/c/opendev/zuul-providers/+/954768 | 18:02 |
opendevreview | James E. Blair proposed opendev/zuul-providers master: Add ubuntu-focal-arm64 image builds https://review.opendev.org/c/opendev/zuul-providers/+/954769 | 18:02 |
opendevreview | Merged opendev/zuul-providers master: Add ubuntu-focal-arm64 images and labels https://review.opendev.org/c/opendev/zuul-providers/+/954768 | 18:05 |
fungi | corvus: did 954769 get mixed up with a change for debian-trixie? | 18:06 |
corvus | no, that was a reorg because trixie was out of order... | 18:07 |
fungi | i see you commented in the change too | 18:07 |
clarkb | I'm slowly backporting the cleanup in openstack/requirements for the wheel cache build job then once I've got all of those pushed I'll update the openstack-zuul-job cleanup change | 18:07 |
corvus | okay, that change is syntactically correct.... | 18:07 |
fungi | cool | 18:08 |
corvus | clarkb: fungi due to the extreme load on the arm nodes, i think the best thing we could do would be to review and approve https://review.opendev.org/954769 now and send it straight to gate. even with that, it's going to take a long time to merge. | 18:08 |
fungi | and yeah i misread the diff, confused by the trixie definition getting relocated which made it look like that was the new addition | 18:09 |
corvus | yeah that diff did not end up great, and i should have mentioned it in the commit msg | 18:09 |
fungi | no worries, looks right to me | 18:09 |
corvus | but if you look at the resultant file, it's really easy to see the changes from each image to the next as we progress through bionic, focal, focal-arm64, jammy, ... | 18:10 |
fungi | i went agead and approved to save testing tie | 18:10 |
fungi | time | 18:10 |
corvus | thanks, we have hours if anyone else wants to review | 18:10 |
clarkb | c +2 from me | 18:10 |
clarkb | there is an unmaintained/2023.1 but no unmaintained/2023.2 | 18:11 |
corvus | this might actually be a good real-world test of relative priority... it should jump the line and beat out the next openstack job | 18:11 |
clarkb | I nkow this makes sense to some but not to me | 18:11 |
fungi | clarkb: non-slurp branches don't transition to unmaintained, since upgrading between slurp branches is tested to work | 18:12 |
fungi | they just go eol immediately once stable maintenance ends | 18:12 |
fungi | that was one of the compromises made to keep the branch count down | 18:13 |
clarkb | fungi: I'll be honest this feels like the opposite of keeping the branch count down. There are 10 branches that need this commit made to it | 18:13 |
clarkb | and about 60% of them are not trivial backports (they merge conflict in a mostly trivial way at least) | 18:14 |
fungi | pre-slurp unmaintained branches still need to get explicitly eol'd and that hasn't happened yet | 18:14 |
fungi | but breaking them is fine, let the unmaintainers sort out any resulting mess | 18:14 |
fungi | maybe elodilles wants to help with ^ | 18:15 |
clarkb | I'm almost done at this point | 18:15 |
corvus | (relative priority did work!) | 18:15 |
opendevreview | James E. Blair proposed opendev/zuul-providers master: Remove nodepool-nodesets file https://review.opendev.org/c/opendev/zuul-providers/+/954759 | 18:18 |
opendevreview | James E. Blair proposed opendev/zuul-providers master: Move ubuntu-focal-arm64 nodeset https://review.opendev.org/c/opendev/zuul-providers/+/954778 | 18:18 |
corvus | do we want to add back any of the focal-arm64 stuff we removed from openstack/project-config? like the wheel build jobs? | 18:19 |
clarkb | corvus: yes I'm working on that now | 18:19 |
corvus | oh ok. | 18:19 |
clarkb | I have to get my 10 depends on in a row first :) | 18:22 |
fungi | i don't think we need the wheel build jobs, to be honest | 18:23 |
clarkb | fungi: ya I figure we can cleanup focal stuff a bit less urgently though | 18:24 |
clarkb | I think what this should be a signal for to openstack is that xenial, bionic, and focal things should start to be pruned | 18:24 |
clarkb | xenial and some cases of bionic are getting more upfront forced cleanups on the opendev side | 18:24 |
clarkb | but anything that isn't forcefully removed in that list should still be claened up | 18:25 |
fungi | openstack's stable constraints lists are frozen, and the last stable branch using focal (it shouldn't have been) reaches end of maintenance in a few months | 18:25 |
fungi | https://governance.openstack.org/tc/reference/runtimes/2024.1.html says it only required testing on jammy | 18:26 |
clarkb | https://review.opendev.org/c/openstack/openstack-zuul-jobs/+/954761 has been updated with requirements depends on and I leave focal for now | 18:26 |
fungi | maybe the python 3.8 jobs were being kept for other reasons and needed focal to supply an old enough interpreter, i really dunno | 18:27 |
clarkb | I think that concludes my hacking openstack-zuul jobs. gmaan I think you can apply your cleanups on top now | 18:28 |
gmaan | k | 18:28 |
clarkb | fungi: https://review.opendev.org/c/opendev/system-config/+/954765 should be a quick review and any reason to not approve https://review.opendev.org/c/openstack/openstack-zuul-jobs/+/954763 now? | 18:30 |
fungi | approved | 18:31 |
gmaan | clarkb: you want to rebase this on top of latest parent https://review.opendev.org/c/openstack/openstack-zuul-jobs/+/954762 | 18:33 |
clarkb | gmaan: I can, I stopped updating that one since it seemed to (logically at least) conflict with the change you pushed | 18:33 |
clarkb | gmaan: do you want me to rebase that change or do you want to take over and merge it into your change? | 18:34 |
gmaan | clarkb: I can merge | 18:34 |
clarkb | ack thanks | 18:34 |
frickler | do we know something about the swift upload errors on https://review.opendev.org/c/opendev/zuul-providers/+/954716 ? are we maybe hitting quota limits or might these need longer timeouts? | 18:38 |
opendevreview | Merged opendev/system-config master: Drop bionic arm64 testing https://review.opendev.org/c/opendev/system-config/+/954765 | 18:39 |
clarkb | frickler: that sort of error isn't something I would expect client side timeouts to help with | 18:39 |
clarkb | I read that as the client got an EOF in violation of the protocol. Possibly because the server side closed the connection? | 18:40 |
corvus | it's worth noting that requests doesn't have a retry policy by default. i don't think that openstacksdk has a setting for that. but we might be able to create a retry object and configure the keystoneauth session to use that | 18:44 |
corvus | that feels a little hacky though? | 18:44 |
corvus | but we might need something like that; it may be asking too much to expect to push that much data over http without retrying on error | 18:46 |
corvus | (i'm guessing that's a result of something like a load balancer shift or similar) | 18:46 |
clarkb | ya that makes sense to me | 18:47 |
clarkb | ok last call on https://review.opendev.org/c/opendev/infra-specs/+/954662 for anyone to object with my continued cleanup in that repo | 19:02 |
clarkb | I'd like to start on the matrix spec after lunch so intend on approving that soon | 19:02 |
elodilles | fungi: ACK, i'll keep an eye on those patches | 19:37 |
gmaan | clarkb: this is ready, I pinged neutron core to review neutron backports https://review.opendev.org/c/openstack/openstack-zuul-jobs/+/954781 | 19:49 |
gmaan | elodilles: ^^ as you are here, can you please review unmaintained branches backports https://review.opendev.org/q/Ieb1b116c4d0866bd8208f9b8b440c1e274c82b1c | 19:50 |
clarkb | gmaan: thanks I approved it which means anyone should be able to recheck once the depends on are merged | 19:58 |
gmaan | ++ | 19:58 |
opendevreview | Merged opendev/infra-specs master: Update existing specs to match the current reality https://review.opendev.org/c/opendev/infra-specs/+/954662 | 20:02 |
opendevreview | James E. Blair proposed opendev/zuul-providers master: Add backoff handling to swift upload https://review.opendev.org/c/opendev/zuul-providers/+/954804 | 20:10 |
corvus | that is not self-testing, but i think we could merge it and just revert if it fails. | 20:11 |
corvus | (i don't think it's worth bothering to make that self-testing since we're trying to switch to the zuul-jobs version of that role anyway) | 20:12 |
Clark[m] | corvus: I stepped away from the computer but that change and approach seems reasonable to me | 20:23 |
opendevreview | Merged opendev/zuul-providers master: Add backoff handling to swift upload https://review.opendev.org/c/opendev/zuul-providers/+/954804 | 21:07 |
clarkb | corvus: I've approved that change | 21:07 |
clarkb | oh wow it approved quickly | 21:07 |
corvus | 954769 the change to add the new arm buildfailed with a post-failure; i re-enqueued it so it'll be the test of whether that works | 21:19 |
corvus | (954716 also failed, but started before that merged; it's a severed head now based on the old upload code) | 21:19 |
corvus | clarkb: i made the change to add an option to turn off nodepool and validate labels; i put some thought into what that would look like based on our earlier conversation and included thoughts in the commit message. https://review.opendev.org/954825 | 21:21 |
corvus | (we don't need to rush that; maybe we can land that next week) | 21:21 |
corvus | i'm going to dequeue 716 and re-enqueue since it's failing and using a bunch of arm nodes | 21:23 |
corvus | https://zuul.opendev.org/t/opendev/build/a986072e8c344bba9ce5164820a9012d is our canary build for the upload retries | 21:25 |
clarkb | ack I'll try to review the label validation change once I get this spec written and pushed | 21:27 |
opendevreview | Clark Boylan proposed opendev/infra-specs master: Add spec to use Matrix for OpenDev comms https://review.opendev.org/c/opendev/infra-specs/+/954826 | 22:07 |
clarkb | corvus: left a couple of thoughts/questions but overall looks about how I woudl expect it to | 22:20 |
clarkb | https://zuul.opendev.org/t/opendev/build/a30c8fcb64774249ab312d79374bd929 looks like we get a new upload error but I don't think the code itself is fundamentally flawed | 22:22 |
clarkb | seems like this is similar to the old errors we were trying to work around but they are bubbling up differently now? | 22:23 |
clarkb | there are also some successful builds so the change itself isn't 100% fatal | 22:27 |
corvus | clarkb: thanks, replied and updated. | 22:29 |
corvus | clarkb: yeah, that's a fascinating new error. i agree, i don't think we need to revert, but it clearly didn't solve all the probs | 22:32 |
corvus | i'll try to figure out what "499: Client Error for url" even means | 22:32 |
clarkb | +2 thanks | 22:32 |
corvus | https://urllib3.readthedocs.io/en/stable/reference/urllib3.util.html i wonder if we need to do something with status codes there; it's all a bit muddled | 22:38 |
corvus | and i wonder what issued the 499? is that a load balancer or something? | 22:40 |
clarkb | https://urllib3.readthedocs.io/en/stable/reference/urllib3.util.html#urllib3.util.Retry.RETRY_AFTER_STATUS_CODES that list I guess? | 22:40 |
corvus | yeah... i'm not entirely sure what the default behavior is! | 22:40 |
clarkb | "client disconnected from the server before the server could send a response" | 22:40 |
clarkb | almost seemsl ike maybe we have a client side timeout afterall? | 22:41 |
clarkb | the original failures looked like server side teimouts to me but frickler thought changing timeouts might help and maybe that is the case | 22:41 |
corvus | right, but for us to get a 499 code -- obviously we're not the client, unless that's some urllib internal thing where it makes up a 499 code for an internal timeout. | 22:41 |
clarkb | ya thats what I'm wondering. Or if the server issues 499 as part of the write out to handle client disconnect? | 22:41 |
clarkb | definitely muddled | 22:41 |
corvus | (like, if a server literally sent us a 499 response code, it's obvious that we didn't timeout -- but it might have if it is acting as a client for a backend) | 22:42 |
clarkb | oh! right | 22:42 |
clarkb | the proxy may be the one that disconnected then sends a 499 to us the real client | 22:42 |
corvus | yeah. i've found some stackoverflows that suggest haproxy may do that | 22:42 |
corvus | but maybe since it's a weird code urllib3 retries don't retry it by default | 22:44 |
Clark[m] | Maybe add 499 to the default status retry list and then set that on the retry object? Sorry had to pop out for an errand so it's a bit difficult to get all the terms right | 22:57 |
Clark[m] | Also james_denton and dan_with may be interested in looking at the proxy | 22:58 |
corvus | i'm still learning, but i think the status_forcelist is only used in a case that doesn't apply to us | 22:59 |
corvus | i think the retry logic works like this: 1) if it is a connection error, then increment the connection retry counter and retry | 22:59 |
corvus | 2) if it is a read error, increment the read error counter and retry | 22:59 |
corvus | 3) if the server sent a "Retry-After: ..." header in the response, then consult the retry_after options including status_forcelist to decide whether to retry | 23:00 |
corvus | iow, i think status_forcelist is only used in that last case, which makes it basically: "if the server sent retry-after, and we are configured to honor retry-after, and the status code is one of the codes in RETRY_AFTER_STATUS_CODES or in status_forcelist, then retry" | 23:01 |
Clark[m] | Thinking out loud here but maybe we'd get more consistent regular server responses if we use smaller objects in multipart uploads. Not sure if that is also an option | 23:01 |
corvus | so case 3 would only apply if the 499 code arrives with a retry-after header, and ... i doubt it? but i don't know. | 23:01 |
Clark[m] | Like if that is 5gb now maybe 1gb sharding would be more reliable | 23:02 |
corvus | case 1 is what we thought we were doing when we started this: dealing with connection errors. we may have actually addressed that EOF that we got due to case 1, but then maybe on the retry we got a 499? and the 499 is falling through because it doesn't match case 1 or case 3. (i'm still checking case 2) | 23:03 |
corvus | case 1: https://github.com/urllib3/urllib3/blob/main/src/urllib3/util/retry.py#L365 | 23:04 |
corvus | case 2: https://github.com/urllib3/urllib3/blob/main/src/urllib3/util/retry.py#L373 | 23:04 |
corvus | case 3: https://github.com/urllib3/urllib3/blob/main/src/urllib3/util/retry.py#L387 | 23:04 |
corvus | so yeah, i'm like 95% sure that if we get a 499, we would only retry it if it comes with a Retry-After header | 23:05 |
Clark[m] | Got it | 23:05 |
corvus | Clark: i agree that we should consider whether our behavior is causing or contributing to this and if we should change it :) | 23:07 |
corvus | we are... trying to push data quickly. | 23:07 |
corvus | it's unclear whether there's a correlation with multiple uploads from different hosts. are we triggering some kind of account limit? or is it coincidence and the proxies are just having a bad day right as we're adding more images? | 23:08 |
corvus | Clark: we currently upload 500MB chunks | 23:10 |
corvus | if we want 499 to cause a retry, I think we can implement our own subclass of Retry and override "_is_connection_error" (though we'd be overriding a "private" method). the nearest "public" method to override would be "increment", but that's pretty complex. it'd probably be okay to override _is_connection_error. | 23:16 |
corvus | jamesdenton: hi! we're encountering some unusual errors when uploading large objects to swift in flex. the total object size is many gigabytes, and we're using SLO with 500MB chunks. | 23:20 |
corvus | one example error is at 2025-07-11 22:03:28.180894 -- HttpException: 499: Client Error for url: https://swift.api.sjc3.rackspacecloud.com/v1/AUTH_ac0fed44dbe4539d83485bcefc4e2d4b/images-7b7d44d25aa9/cfe16fd7553c4921bfe241b237d4a2f8-rockylinux-8.vhd.zst/000006, Client DisconnectThe client was disconnected during request. | 23:20 |
corvus | is that due to an error on the cloud side? should we retry if that happens? or are we inadvertently causing a problem due to the way we're uploading files and should we do something different? | 23:22 |
corvus | here's a whole set of errors from the most recent buildset: https://paste.opendev.org/show/bnlkcRemnDVqN5xqn7vZ/ | 23:27 |
corvus | i'm not seeing much in the way of commonalities in times; not much clustering of the error times, nor of the segment numbers (so it happens at different points in the upload processes) | 23:28 |
clarkb | ya I wonder if we should consider the semaphore again but thats a very big stick. Also we really only need to rate limit the uploads not the builds. | 23:33 |
clarkb | Not sure I have any good ideas for doing that | 23:33 |
clarkb | I guess staggering them out like nodepool did if we can express that through zuul | 23:33 |
corvus | we can also reduce our parallelism for individual uploads | 23:33 |
corvus | i just don't want to guess. if there's some limit, let's find out what it is. otherwise, we might be 2xing our upload time just because someone is doing maintenance on a load balancer today | 23:34 |
opendevreview | Merged openstack/project-config master: Drop requirements branch override for translations https://review.opendev.org/c/openstack/project-config/+/954747 | 23:36 |
clarkb | corvus: ++ | 23:49 |
Generated by irclog2html.py 4.0.0 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!