Monday, 2025-10-06

opendevreviewOpenStack Proposal Bot proposed openstack/project-config master: Normalize projects.yaml  https://review.opendev.org/c/openstack/project-config/+/96255702:15
*** mrunge_ is now known as mrunge05:50
*** dtantsur_ is now known as dtantsur05:54
*** darmach9 is now known as darmach10:39
*** g-a-c0 is now known as g-a-c10:51
*** efoley_ is now known as efoley12:21
fungiinfra-root: more interesting behavior that may be related to the weekend's zuul upgrade... an autohold from last week (0000000241 in the openstack tenant) had its held node spontaneously deleted. even more odd is that it can't be deleted now12:48
fungithe autohold can't be deleted now, i mean12:49
fungii wonder if this is at all related to the "Exception loading ZKObject" errors tonyb noted on saturday12:49
fungialso https://zuul.opendev.org/t/openstack/autohold/0000000243 which i just created shows me data for 0000000241 instead12:52
fungipossibly because i recreated it with the same parameters as before?12:53
fungi(in between those i created 0000000242 but had the change format wrong so deleted it immediately, and that one was successfully removed)12:54
fungihttps://review.opendev.org/c/zuul/zuul/+/960417 "Add subnode support to launcher" merged last week, and does make additions to the model12:58
fungii'm trying to track down the node deletion in the scheduler debug logs but not having much luck yet, maybe it got reaped by cleanup on one of the launchers?13:05
fungi2025-10-06 12:45:15,959 ERROR zuul.web:   AttributeError: property 'min_request_version' of 'OpenstackProviderNode' object has no setter13:08
fungithat's from web-debug.log for my autohold-delete attempt on 0000000241 a few minutes ago13:09
opendevreviewPierre Riteau proposed opendev/zuul-providers master: Add cirros 0.6.3 images to cache  https://review.opendev.org/c/opendev/zuul-providers/+/96317313:10
fungizl01 and zl02 both logged a "zuul.zk.ZooKeeper: Error playing back event WatchedEvent(type='NONE', state='CONNECTED', path='/zuul/nodes/nodes/c97202e2fa644f8cb8038b62466cc101')" on 2025-10-04 around 08:40 utc (a couple of minutes apart)13:12
fungithat's probably when they were upgraded13:13
opendevreviewPierre Riteau proposed opendev/zuul-providers master: Add cirros 0.6.3 images to cache  https://review.opendev.org/c/opendev/zuul-providers/+/96317313:13
fungithe launchers reported a slew of errors like that, i'm guessing one for each existing node at that point in time13:18
fungiand yeah, the underlying exception was the same AttributeError on min_request_version as my autohold-delete met with13:20
fungii can't say for sure the exceptions tonyb saw mentioned in the log are from the same issue, because ZKObject._loadData() doesn't preserve tracebacks so all we know is that there was *some* exception involving a request revision property, but it seems likely13:46
*** ralonsoh_ is now known as ralonsoh14:00
mnasiadkaHmm, https://review.opendev.org/c/opendev/zuul-providers/+/963173 failed two jobs with some gitea related clone/checkout problems14:27
fungimnasiadka: both on cfn/computing-offload looks like, so maybe there's something broken with that repo on one of our gitea backends... i'll try to isolate it14:43
mnasiadkayeah, that felt weird that it’s on the same repo14:44
corvusfungi: the revision errors are a red herring14:44
fungicorvus: oh, good to know. thanks!14:45
clarkbfungi: mnasiadka iirc that repo is the one with a bunch of opaque binary data in it and it is larger than nova. I asked horace to see if he could reach out to them but not surei f that happened14:45
corvusfungi: so aside from anything weird that happened during the upgrade, the current problem is that 243 has the wrong info?  what it it supposed to say?14:47
fungicorvus: it's supposed to say 243 and not 24114:48
fungialso 241 is not deletable, and the node that was held for 241 (and all other held nodes i think) vanished, maybe all the nodes that existed got "cleaned up" during the upgrade?14:49
corvusfungi: maybe try reloading?  because 243 says 243 for me14:49
fungioh, yep now it says 243. i wonder if the old page content got cached for some reason?14:50
corvusah yeah, it looks like in the web ui, if you click on an autohold, it just shows you the first one it loaded14:50
fungior old api response14:50
fungigot it, so that was just confusing me and unrelated14:50
fungimain issue(s) seem to be unexpected node deletion during the upgrade and pre-upgrade autoholds no longer being deleteable14:50
corvusokay, i think the min_request_version error is a plausible cause for the node being deleted during the upgrade.  i'll check and make sure that isn't an ongoing problem.  i think i have enough info to try deleting the autoholds now (which i'll try after refreshing the web ui)14:55
fungion mnasiadka's job failures, i directly cloned cfn/computing-offload successfully from all 6 gitea backends, no errors14:55
mnasiadkaSo then recheck it is14:56
fungiso it doesn't seem like it's corrupt on a random backend at least14:56
corvusah, it's the same error, so that answers that question.  i'll get a fix soon.14:56
fungicorvus: yeah, sounds right, i haven't seen any node-related problems for any resources created after the upgrade completed14:58
fungiand like i said, i was able to delete an autohold i created today, it's just the one i created friday is undeletable15:00
fungiwhich would seem to point to them having different data15:00
fungicfn/computing-offload is currently 487M after a fresh clone, and takes me 37 seconds to download. the last change to merge to the master branch was just over a month ago, so no recent changes that i can see15:06
fungioh, though i think dib downloads all branches and tags too15:06
clarkbfungi: yes dib should fetch all branches and refs, but it should also do some from the state of the last build's cache15:07
fungiwhich in a check job should be empty15:07
clarkbits possible the problems were simply internet connectvity related and we're more likely to experience an issue with the larger repos?15:07
clarkbfungi: no the cache on the image15:07
fungioh! right15:07
clarkbin /opt/cache or whatever the path is not the zuul repos15:07
fungiyeah, https://opendev.org/openstack/diskimage-builder/src/branch/master/diskimage_builder/elements/source-repositories/extra-data.d/98-source-repositories#L214-L215 is the relevant update command i think15:12
clarkbdo we have a log from the failure?15:15
clarkbthe fetches are supposed to retry now iirc. In the past we had problems where we would try to fetch over flip flopping ip addresses15:16
clarkb`error: corrupt loose object '8c215a8f38400bfce658e8f66efff79010d764aa'`15:18
fungihttps://zuul.opendev.org/t/opendev/build/aa1eca2c3be144e6937360b080247ba5/log/job-output.txt#2095-210315:18
clarkb`fatal: loose object 8c215a8f38400bfce658e8f66efff79010d764aa (stored in /opt/dib_tmp/dib_build.IwkOLp7G/mnt/opt/git/opendev.org/cfn/computing-offload/.git/objects/8c/215a8f38400bfce658e8f66efff79010d764aa) is corrupt`15:18
clarkbso I don't think that is the flip flopping ip address issue15:19
clarkbthat seems like the underlying git cache is corrupt and we may need to clear that out. But first we have to determine if the corruption is in gerrit, gitea, or the disk cache on test node images15:19
fungiyeah, so i wonder if we somehow ended up with corrupt git caches of that repo just on the both debian-trixie and alma-10 images15:19
fungionly those two platforms failed for that buildset anyway15:20
clarkbfungi: the image builds for trixie and alma build on noble iirc15:20
clarkb(just keep in mind they don't build themselves)15:20
clarkbya the inconsistent behavior implies gerrit is not at fault. Could be a specific gitea backend or a specific cloud image15:20
fungioh, are those the only two platforms we're building on noble?15:20
clarkbfungi: no I think everything builds on noble15:20
fungiyeah, just wondering if those were the two most recently uploaded images and that's the reason only they failed so far15:21
fungii need to pop out to lunch, but can keep digging into this in about an hour15:21
clarkbor some specific cloud bit flipped or a specific test node image corrupted and everywhere was still running an older version due to upload delays or running a newer version and these were scheduled to older ones that were sad15:22
clarkbI think step 1 here is to check each gitea backend15:22
clarkband if those look good then we dig into the test images themselves15:22
clarkbboth failed builds ran in rax flex sjc315:22
clarkbsuccessful jobs ran in rax flex dfw3, vexxhost ca-ymq-1, and rax ord (maybe others too)15:23
clarkbhaven't found a successful job in scj3 yet so it could be specific to the image upload in sjc3 (this is where the inability to confirm hashes of cloud images via glance becomes problematic)15:24
clarkbI've got clones going against 09, 10, and 11 right now. These are not very fast. When complete I'll do 12, 13, and 1415:28
clarkbgitea09 looks good15:30
clarkboh though I'm realizing I'm just doing a git clone not the broader command15:31
clarkb9, 10, and 11 all clone cleanly at least15:35
clarkbas does 12. I'm beginning to suspect this is an issue with the repo cache15:36
clarkbin theory the repo cache will automatically get cleared out as the next round of builds will eitehr fail on the broken images (so not propogate) or succeed on happy caches and then the result of those successes will be uploaded to all the clouds including the ones with currently broken images15:37
clarkbcorvus: ^ assuming that the noble image rax flex sjc3 does have a corrupted git repo in its git cache do you think ^ is accurate and it will eventually "self heal"15:38
clarkbI think we can manually dleete that image from sjc3 if we want to speed the process up15:38
clarkbI also wish we recorded if the git clone from the cache or the update after the cache clone was the command that failed15:40
clarkbI suspect the later because the clone would just copy the object and packfiles as is iirc then when we try to operate on them in the subsequent command it complains15:41
clarkball 6 clones from each respective gitea backend succeeded for me15:41
clarkbI'm going to git fsck each one now just to double check15:42
clarkb`git fsck --full --strict --progress` reports no issues in any of the local fresh clones15:44
clarkbhttps://zuul.opendev.org/t/opendev/build/e6d17f2733f3406f83daec9a962674bc/log/job-output.txt is another failure from the recheck and also in rax flex sjc3 so ya I think that particular cloud image is bad15:48
clarkbI'll manually boot a test node on that image in that cloud to see if we can learn more15:48
clarkbbut first I need to reboot for local system updates15:48
clarkbubuntu-noble-cd24b1135a324bd98beeebb955bee932 is the newer of the two noble images in raxflex sjc3 so I believe this is the one in use15:55
clarkbinfra-root: root@159.135.206.48 will get you onto my test node16:00
clarkbgit fsck on the cache repo on that image shows the same error as the image builds do16:02
clarkbthe object file does exist (so not the missing portion of the error but the corrupt portion) and is ~2.9MB large16:03
clarkbI'm fscking every repo in the cache on that image now16:06
clarkbwe had a successful build in dfw3 so I'll boot a test node there and run the same fsck to compare16:08
corvuscatching up16:11
clarkb174.143.59.196 is the ip address for the dfw3 node16:12
clarkba number of other repos had issues in the sjc3 node but so far only some minor warnings in the dfw3 node16:15
corvusclarkb: without https://review.opendev.org/955797 my guess is we will struggle to self-heal because we'll keep failing builds.  with that, i think we would slowly self-heal.  in either case, deleting the known bad image should speed things up.16:15
clarkbcorvus: ah for some reason I thought 955797 was already in place16:16
corvus(and to be clear, without https://review.opendev.org/955797 and deleting the known bad image should mean immediate healing)16:16
corvusyeah i keep thinking that too :)16:16
clarkbI think this is another data point indicating that 955797 is a good idea16:16
clarkbcorvus: what is the process for deleeting ubuntu-noble-cd24b1135a324bd98beeebb955bee932 ?16:16
clarkbif my test in dfw3 comes back clean I think we should proceed with that as it appears to be a problem isolated to this specific cloud image16:17
corvusclarkb: i usually use the web ui, i think there's a zuul-client command.16:17
clarkback I'll do that once my tests conclude if they show the issue isn't more widespread16:17
clarkbI'm going to assume that this is some bitflip or data conversion corruption error on the cloud side after we've uploaded the data since the same data has apparently uploaded elsewhere without issue16:18
corvusclarkb: note you must use the build tenant: opendev16:18
clarkbcorvus: and that will then affect all other tenants?16:18
corvusyep16:19
clarkbdangling commit b0cd0dfe6eb8ba133fcfc09e6a6b91edf5de6f96, dangling commit 5ad6519069f2ce78fd90366abb131be265c40d9c, and a bunch of gitignoreSymlink: .gitignore is a symlink warnings are what is found on dfw316:20
clarkbnone of which are fatal16:20
clarkbon sjc3 starlingx/test, openstack/openstack-ansible, openstack/openstack-manuals, openstack/puppet-openstack-cookiecutter, x/ci-cd-pipeline-app-murano, cfn/computing-offload, and possibly more have corruption issues16:22
clarkbso ya seems like a fairly widespread data corruption issue with this particular image upload that isn't replicated by other uploads. I'll work on deleting the image and then clean up my test nodes16:22
clarkbunless anyone wants to do more investigation let me know and I can keep m test nodes up16:23
fungiclarkb: yeah, that's why i linked to the command in dib, i had already done a clone and even a clone --mirror from all the backends successfully16:24
clarkbubuntu-noble-cd24b1135a324bd98beeebb955bee932 has been deleted from sjc316:25
clarkbinfra-root is there any interest in keeping my test nodes around at this point or should I delete them?16:26
fungii think you've gotten all the info from it that i would have16:27
clarkbcorvus: is it safe to dequeue a buildset that is doing image builds? This is for check whcih I think should be fine since that doesn't produce any stored results16:28
clarkbI want to dequeue 963173 so that we can reenqueue it quicker16:29
clarkbnote I checked the cloud image listing and it was cleared out there16:31
clarkbalso looks like we may have leaked old nodepool images16:32
clarkbimages like ubuntu-noble-1752791456. I'm guessing those should all be deleted manually at this point. I'll make a note for that but not sure when I'll be able to get to it16:32
corvusclarkb: absolutely safe to dequeue16:38
corvusremote:   https://review.opendev.org/c/zuul/zuul/+/963203 Fix providernode assignment upgrade [NEW]16:38
opendevreviewMerged openstack/project-config master: Replace 2025.2/Flamingo key with 2026.1/Gazpacho  https://review.opendev.org/c/openstack/project-config/+/95746716:38
corvusclarkb: fungi the providernode change above is something we should get into opendev production; we're going to have lingering bad data until we restart with it.16:39
fungithanks16:39
clarkbmy test nodes have been deleted and I dequeued and rechecked the cirros 0.6.3 change16:40
clarkbcorvus: I'll review that change now16:41
clarkbas I think I've concluded the bad image builds situation for the moment16:42
fungisame16:42
clarkbside note there are so many new admin buttons in the zuul ui now16:42
fungiare you logged in?16:42
fungii hadn't noticed them, but i don't tend to authenticate to the webui16:42
clarkbyes I used the web ui to delete the image from sjc316:44
clarkb(and that required logging in first)16:44
fungineat16:45
clarkbcorvus: one question on https://review.opendev.org/c/zuul/zuul/+/96320316:53
fungiit seems like that test case is only temporarily useful anyway, so i don't see much point in overengineering it16:54
corvusreplied (and ++ to fungi's point)16:58
clarkback +2 from me16:59
fungiyeah, lgtm. thanks for the quick fix!17:08
clarkbfungi: you may want to weigh in on https://review.opendev.org/95579717:13
clarkbalso we could add a git fsck step to the image builds to force them to fail early and not upload if the corruption happens at that point (though I don't think that was the case here since only one region seemed affected)17:14
opendevreviewMerged opendev/project-config master: Use built images even if some failed  https://review.opendev.org/c/opendev/project-config/+/95579717:15
fungiah yeah i was meaning to look at that one, merged now!17:15
fungidoes make image validation testing harder in the future if we want to add it, i think? but we can figure that out when we get there17:17
corvusit might be worth an audit of checksums -- i'm 97% sure we're passing md5 and sha256 all the way to openstacksdk for the image uploads; but i don't know what happens after that.17:17
corvus(also, if these don't actually do anything -- boy oh boy could we make zuul-launcher and the image build jobs a lot simpler :)17:18
clarkbcorvus: I think they do something but I'm not sure how deep that validation goes with glance17:21
clarkblike I know that if glance converts the image on the backend after download there is no way as the end user to check that the image you uploaded with glance is the one it received after the fact because the recorded checksum is for the translated/converted data17:22
clarkbI remember looking into this year and years ago after we had a corrupted image with nodepool that was a short write and checksums have some value but not the entire value you expect them to have. But I don't recall all thedetails and things may have changed since17:22
corvusmmm... and that conversion is another step where error could be introduced too17:23
clarkbyup17:23
corvuswe could use zuul image validation jobs to check for this case.  we've always resisted having a validation that checked whether our jobs should succeed -- but one that did basic checks like "does it boot? are the git repos ok?" seems like it could be in scope.17:24
clarkbya checking fundamental attributes of the image itself seems safe compared to only allowing images up that allow specific test jobs to pass17:25
corvusthat is a feature that is implemented now in zuul-launcher.  i don't have time to make jobs for that, but if someone wants to, i'm happy to answer questions.17:25
corvusjobs + pipeline config i should say17:26
clarkbis that a check that can be configured tor un against the image upload in each cloud region or would it just be once per upload?17:26
clarkbI think this will make a good preptg topic (which starst tomorrow) so I'll add it there in a bit17:27
corvusi may be having trouble parsing the question because i think: once per upload == image upload in each cloud region17:28
corvusmaybe one of those was supposed to be "once per image build artifact"?17:28
clarkbcorvus: yes sorry is it once per build artifact or can we do it once per upload17:29
clarkbI think for this particular concern we need to check each upload in each cloud region independently17:29
corvusgot it.  it's once per upload.  so yes, it addresses this use case.17:30
clarkbchecking the central build artifact still has some potential value (if the corruption happens early for example)17:30
corvusyeah, the way to check the artifact would be in the current build job (like was proposed earlier)17:30
fungiright, we had plenty of trouble back when devstack smoketest was our image validator, but i agree something simpler and less prone to random breakage could make sense17:32
clarkbok this item is added to the preptg agenda18:21
clarkbI guess its a good time to remind people that now is a great time to get your ideas on that agenda as we'll be diving into things starting tomorrow18:21
clarkbhttps://review.opendev.org/c/opendev/zuul-providers/+/963173 passes now after the noble image cleanup18:29
clarkbinfra-root I think if we approve ^ that should cause us to update images after a few days of not doing so due to the noble image problem18:29
clarkbthen as followup we should consider if we can/should delete any of those older cirros images18:30
*** g-a-c3 is now known as g-a-c18:47
clarkbTheJulia: re the dib and ai code review topics for the opendev pre ptg since you can't make wednesday and there is the possibility we don't need the extra time on Thursday tomorrow is probably the best time to join. Does something like 1900 UTC tomorrow work for you? I can ensure we get to those two topics around that timeframe if so21:15
clarkbalternatively if tomorrow doesn't work then we can just use the Thursday time for those two items if we finish everything else tomorrow and wednesday21:16
clarkbthat would be 1500 Thursday as the alternative21:17
TheJuliaYeah, that should be able to work21:19
TheJulia1900 that is21:20
clarkbgreat I'll pencil that in on the etherpad so we don't forget. See you there21:20
TheJuliaThanks!21:20
opendevreviewMerged opendev/zuul-providers master: Add cirros 0.6.3 images to cache  https://review.opendev.org/c/opendev/zuul-providers/+/96317321:47
clarkbcorvus: do we not have a status for the image uploads to clouds? ^ that change landed and promoted successfully. But if I go here: https://zuul.opendev.org/t/opendev/image/ubuntu-noble I only see the old uploads (no in progress uplaods for example)22:17
fungiclarkb: at the bottom i see a bunch that are in a "pending" state22:20
clarkbfungi: I think those have just shown up so I may have been too impatient and checked before the launchers started processing thye promoted builds22:21
clarkbso i guess the gap here is between the promotion job running and the launcher processing it (maybe ebcause it is doing one iamge at a time and if I checked other images I would've seen them sooner?22:22
corvusit's actually the scheduler that makes those on report and it's all synchronous.  but zuul-web relies on a cache which could have a small lag.  regardless, i would expect that clicking "refresh" after 5 seconds or so should show the right data.22:24
corvus(btw, the pending uploads are the ones you're looking for; you can match the build id of the artifacts on that page tohttps://zuul.opendev.org/t/opendev/build/f0fd8c6fdc134142971c083839b73229 ) 22:27
corvus(eventually the build id on that page should be a link to that)22:27
clarkbthat appears to be a link arleady22:29
clarkbsounds like maybe I just wasn't refreshing hard enough. I'll try to keep that in mind for next time and check if it works more like how I would expect it to22:30
corvusno i mean on https://zuul.opendev.org/t/openstack/image/ubuntu-noble the text "f0fd8c6fdc134142971c083839b73229" will someday be a link to https://zuul.opendev.org/t/opendev/build/f0fd8c6fdc134142971c083839b7322922:35
corvustoday you just have to copy/paste22:35
corvuseither way, that's how you find out that those image uploads correspond to the artifacts from that job22:37
corvusah, it is a link in the opendev tenant, just not openstack22:37
fungii do love that you can click through from an image to the build results with the image build log and the image itself as a downloadable artifact22:38
clarkbah I am in the opendev tenant since you said I had to be there earlier to delete the bad image upload22:38
corvusi guess i did get around to implementing that ;)22:38
fungiyeah, i keep initially forgetting and looking at images in the openstack tenant and wondering why they don't link anywhere, then remembering you have to be in the opendev tenant since that's where they built22:39
corvusfungi: yeah, there should be no mystery/guessing; everything is traceable :)22:39
fungiexactly, a very elegant design!22:39
fungiawesome work22:40
clarkbcorvus: did you see my note about what I think are the old nodepool image uploads? Those should be safe to delete now like the old instances right?23:23
corvusyep!23:27
clarkbok I'll try to get around to that between meetings tomorrow23:33
clarkboh I also have a dentist appointment tomorrow afternoon so once meetings are done I may not be around much23:33

Generated by irclog2html.py 4.0.0 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!