Tuesday, 2025-09-23

ykarelThanks clarkb will keep an eye on the jobs03:55
ykarelatleast as per opensearch looks good, no failure since last night04:07
*** ralonsoh_ is now known as ralonsoh07:33
*** jroll01 is now known as jroll008:29
seunghunleeHello, Is there a way to check if Zuul of certain project had problem last week?10:24
tonybseunghunlee: I'm not sure I understand, zuul is functional or not.  the project/tenant there aren't typically outages per project 10:26
tonybseunghunlee: can you give me a little more detail?10:26
seunghunleeHello tonyb I experienced one of CI of openstack/kolla-ansible failing around at 6pm UTC 18th. But this was passing before then and there was no significant change merged in between. Today I tried again and that CI is passing. So, I wanted to know if it was really caused by something from Zuul.10:30
tonyboh.   that is helpful.   I don't have the full details but yes there were some generic problems.   it's probable that you hit those.10:32
tonybI was intrigued/confused by the "certain project" part of your query.10:33
tonybseunghunlee: if you sage the change link I can take a quick look to confirm 10:34
seunghunleeThank you tonyb, here's the change https://review.opendev.org/c/openstack/kolla-ansible/+/96167510:37
seunghunleeAh sorry, I misread my log. Correction: CI was okay on 18th but failed at 12 pm UTC on 19th10:40
seunghunleeand now it's fine10:40
tonybbased on my read it was a network outage/issue.   I think you can ignore it10:47
seunghunleeThanks for confirming tonyb10:53
fungitypically, i recommend looking at a failing build result and at least identifying the most obvious error message from its logs13:42
fungithis was the result from the only voting job that failed for the linked change on the 19th: https://zuul.opendev.org/t/openstack/build/f1f7256727b243369245a621c840b32413:44
fungithe "Run test-mariadb.sh script" task exited nonzero (2) in the run phase: https://zuul.opendev.org/t/openstack/build/f1f7256727b243369245a621c840b324/log/job-output.txt#1574-158113:48
fungithe task detail is at https://zuul.opendev.org/t/openstack/build/f1f7256727b243369245a621c840b324/console#2/1/55/primary and i think indicates an ssh connection problem13:50
fungibut the task ran for almost 30 minutes according to the console13:51
fungithis is the task definition: https://opendev.org/openstack/kolla-ansible/src/branch/master/tests/run.yml#L523-L53113:54
fungidoesn't seem to pass any parameters to the script, but it does seem to log the output from that: https://opendev.org/openstack/kolla-ansible/src/branch/master/tests/test-mariadb.sh#L8913:55
fungiwhich is collected and available here: https://zuul.opendev.org/t/openstack/build/f1f7256727b243369245a621c840b324/log/primary/logs/ansible/test-mariadb13:56
fungihttps://zuul.opendev.org/t/openstack/build/f1f7256727b243369245a621c840b324/log/primary/logs/ansible/test-mariadb#13753 says the mariadb_recovery.yml playbook is what actually exited 213:58
fungithat's this playbook: https://opendev.org/openstack/kolla-ansible/src/branch/master/ansible/mariadb_recovery.yml14:00
fungii'm lost on where/whether that results in any additional logging, but https://zuul.opendev.org/t/openstack/build/f1f7256727b243369245a621c840b324/log/primary/logs/kolla/mariadb/mariadb.txt#2492 does indicate some sort of local connectivity problem reaching 4567/tcp on a virtual interface with gcomm/gcs (whatever that is)14:11
fungithat's about 18 minutes into the test-mariadb.sh task, 12 minutes before it ends, so not sure it's a smoking gun14:13
fungithough it does seem to coincide with this mariadb restart task which reports an ssh error, that's from the nested ansible running on the primary test node: https://zuul.opendev.org/t/openstack/build/f1f7256727b243369245a621c840b324/log/primary/logs/ansible/test-mariadb#12975-1312714:17
fungii'm not sure what the network topology looks like between the test nodes, whether 192.0.2.1 is local to the primary node or across a tunnel to one of the secondary nodes14:22
fungianyway, this ran in ovh gra1 which is not somewhere we had known network issues on friday, and the log detail looks more like mariadb failed to start during recovery testing14:32
fungiseunghunlee: tonyb: that's my takeaway ^14:33
fungisomeone with more kolla familiarity would need to confirm14:33
*** sean-k-mooney is now known as sean-k-mooney-pto14:51
*** NeilHanlon_ is now known as NeilHanlon15:04
opendevreviewChristian Berendt proposed openstack/project-config master: Remove refstack projects  https://review.opendev.org/c/openstack/project-config/+/96211615:08
mnaserhttps://tarballs.opendev.org/openstack/openstack-helm/ any reason why this isn't loading?16:31
mnaserit's been spinning for a few minutes for me..16:31
mnaseri hit enter and a few seconds later it loads.. so.. nvm16:32
clarkbthere was an afs blip that affected docs.openstack.org on the 15th. It didn't last very long16:38
clarkbI wonder if this is the same issue16:38
mnasiadkamnaser: I’m getting the same since some time, also for zuul.opendev.org - wonder if that’s some problem on Rackspace side16:38
clarkboh ok if you're seeing similar to other resources hosted in rackspace that are not backed by openafs then its probably different16:39
mnasiadkaclarkb: on zuul I experience that on “task summary” tab in job view, so it might be different16:41
mnasiadkaThe page with tenants list always loads correctly16:41
clarkbthe task summary has to load the data from the swift log storage backend (that goes straight to your browser) so the issue maybe between you and the swift install16:43
clarkbbut also sometimes that happens if the logs have timed out and been deleted after 30 days or if the losg are too large for your browser to process16:43
mnasiadkaNo, these are todays jobs16:44
mnasiadkaI can try to trace the swift backend in question16:44
clarkbif you have a link to the build I can also see if it reproduces for me16:44
clarkbfor the tarballs I wonder if all of the new releases caused openafs to notice its cache was stale then it was either not able to update the cache as quickly as mnaser expected or it failed to update it for some time due to an issue16:45
clarkbas a heads up my network woes resurfaced this morning so I've got someone from my isp coming out to check on things upstream of my router sometime after 2000 UTC today16:53
clarkbI won't be surprised if the debugging process has me disconencted from the internet for some time16:53
fungiit does seem inconsistent and maybe afs-related, some directories are loading promptly for me16:54
fungithough it might also be because they publish a tarball for every single change that merges, 11159 files in /afs/openstack.org/project/tarballs.opendev.org/openstack/openstack-helm/ right now16:55
clarkbya I would expect that to make the cache unhappy at times16:56
fungithat is, tarball for every change and project they cover16:56
fungiso e.g. watcher-2025.1.2+5bfc19d67.tgz16:56
fungiyeah that directory listing is massive16:57
fungithousands of tiny (~60-70kb) tarballs16:57
clarkbdmesg says 'afs: Warning: it took us a long time (around 4 seconds) to try to trim our stat cache down to a reasonable size. This may indicate someone is accessing an excessive number of files, or something is wrong with the AFS cache.' and 'afs: Consider raising the afsd -stat parameter (current setting: 15000, current vcount: 23578), or figure out what is accessing so many files.'16:58
clarkbbut that is from yesterday not today16:58
clarkbbut maybe it is related due to the large file counts16:58
fungi"an excessive number of files" describes this case to a t16:58
clarkbthinking out loud here: it might make sense to split tarballs onto its own frontend host with a separate afs installation to shard out the total number of files afs is managing on any one instance?16:59
clarkbcurrently static is hosting all the things16:59
clarkbbut also maybe we can just tune afs on that one node and get it happy17:00
mnasiadkaOh boy, openstack-helm uploads a tarball for each commit sha in a given branch? huh17:00
fungii don't think splitting the tarballs site to a different client would help this particular case17:00
fungiopenstack-helm treeing their tarballs would make a difference17:01
fungior pruning the ones they don't need any more, or just not uploading so many17:01
fungiapache mod_autoindex is performing admirably when asked to render 11k file links at once17:02
fungii shudder in horror at the thought of how many llm training crawlers are hitting that directory and then following every link17:03
clarkbnarrator: "it was all of them"17:03
clarkbincluding the ones that identify as 3 year old chrome installations17:05
clarkbre rax flex double nic situation: I had a thought that maybe it is related to setting up the floating ip. Put another way maybe we boot with one network then when we attach the fip we're attaching another or something along those lines. In any case I'm going to try and put together a minimal reproducer script and see if I can get it to happen outside of zuul launcher17:36
fungioh, like asking for a fip automatically asks for a port too?17:39
clarkbya maybe. This is just a debugging hunch to track down and rule out. I don't have evidence this is the cause yet17:41
fungispillover from the openstack tc meeting, it looks like debian's package of gertty is one of the reverse-depends for python3-ply17:49
mnasiadkaclarkb: https://zuul.opendev.org/t/openstack/build/20843e8dd2894c80940e01e964c20e9e - that one is awfully slow for me18:04
mnasiadkaclarkb: but clicking through the logs tab is fast, so I have no clue what’s happening :)18:04
clarkbthat should grab the manifest file then use that to find the job-output.json file which it then parsesthrough to find the failed task https://17a8cbac2c4522dc47fa-749ac6329d801cba1d5f2575f7d4079d.ssl.cf5.rackcdn.com/openstack/20843e8dd2894c80940e01e964c20e9e/ shows the swift log location as well as file sizes for all of that18:05
clarkbI don't see anything that looks terribly bad18:05
clarkbmnasiadka: I think once it finds the failing task it tries to find the last N lines of output. Maybe that is slow if there is a lot of output for a specific failing task?18:07
fungithe landing page for the build also parses the console log18:07
fungiyeah, that18:07
clarkbfungi: I think it parses it out of the json not the text though18:07
clarkbbut yes maybe that parsing is what is slow and not the actual file download18:08
fungiit's definitely slow to load for me too18:08
fungithe console log is only 1.2k lines, so not that long18:09
clarkbya the files are all of reasonable file size too18:11
fungiwhere "slow" is 15-20 seconds between when it renders the heading for the page and shows the "Fetching info..." spinner18:12
fungiand when it actually loads the task summary18:13
corvusthe debug panel says it takes 29 seconds for the object storage system to produce a 404 on job-output.json.gz before it falls back to fetching job-output.json which returns a 200 in 11ms.18:13
clarkbcorvus: huh is there a .json.gz in the manifest?18:14
corvusperhaps the object storage system cdn is slow to deal with cache misses18:14
clarkblooks like there is only a job-output.json not a job-output.json.gz18:15
clarkbin the manifest I mean18:15
fungipresumably zuul tries both but expects a 404 to be basically instantaneous to return18:15
fungiat least i wouldn't expect "i don't have the file you requested" to be a particularly slow thing to determine18:16
clarkbunless as corvus points out you're a cdn and not having data is not itself an indication it doesn't exist18:17
corvusyep.  that's because some of the upload roles gzip after the manifest is created.18:17
clarkb(a lot of cdns work by having 404 handlers in the edge endpoints do a lookup for data at the source before actually replying with a 404 so ya maybe that is slow)18:17
fungithough i'd hope they also incorporate some sort of negative cache for those responses as well, even if with a shorter expiry18:18
opendevreviewChristian Berendt proposed openstack/project-config master: Remove refstack projects  https://review.opendev.org/c/openstack/project-config/+/96211618:44
opendevreviewChristian Berendt proposed openstack/project-config master: Remove refstack projects  https://review.opendev.org/c/openstack/project-config/+/96211618:52
clarkbfungi: https://review.opendev.org/c/opendev/system-config/+/959892 is approved20:02
clarkbI +2'd https://review.opendev.org/c/opendev/system-config/+/961530 but did not approve it as I am not sure if we jsut want to send it at this point cc corvus 20:03
clarkband now lunch20:03
corvussent20:04
opendevreviewMerged opendev/system-config master: Clean up OpenEuler mirroring infrastructure  https://review.opendev.org/c/opendev/system-config/+/95989220:33
opendevreviewMerged opendev/system-config master: Delete statsd gauges after 24h of inactivity  https://review.opendev.org/c/opendev/system-config/+/96153020:33
opendevreviewTony Breeds proposed openstack/diskimage-builder master: Remove nodepool based testing  https://review.opendev.org/c/openstack/diskimage-builder/+/95295320:44
opendevreviewTony Breeds proposed openstack/diskimage-builder master: Remove testing for f37  https://review.opendev.org/c/openstack/diskimage-builder/+/95295420:44
tonybAny reason I shouldn't +A https://review.opendev.org/c/opendev/system-config/+/946316 ?20:47
clarkbI can't think of one20:47
tonybokay I've approved it20:48
* tonyb is trying to clear out low hanging fruit 20:49
clarkbinfra-root ok I think I have a hacked up first draft of a script based on how zuul launches nodes taht I'd like to run against the clouds.yaml on zl01. Do you think I should try to copy it into the launcher container and run it out of the container python or just create a new venv on the host and install openstacksdk there?20:56
clarkbthis first draft doesn't bother with floating IPs yet. I figure if we get one interface with no floating ip then we can add fips. If we get two without fips then we can focus on that andignore gips20:56
clarkb*fips20:56
opendevreviewMerged opendev/system-config master: Add tony's AFS admin user to UserList  https://review.opendev.org/c/opendev/system-config/+/94631620:59
clarkbI think I convinced myself to copy the script into the container so that everything matches (runtime and library versions)21:00
tonybmy gut feel is that it doesn't matter but probably until we verify the repro tool works as expected use the conatiner21:00
clarkbok no floating IP management seems to produce a node with one interface21:02
clarkbwhich is unfortunate in that it maens I need to make this reproduction test case more complicated. But may also point towards fips being the source of the issue21:02
tonybYup, and Yup :/21:03
clarkbcorvus: in the state machine for openstack server creation we check against self.label.auto_floating_ip. The string auto_floating_ip seems to only appear on that one line in the entire zuul codebase. Do you know how we're ever matching that as true?21:09
corvusclarkb: https://opendev.org/zuul/zuul/src/branch/master/zuul/driver/openstack/openstackprovider.py#L12821:30
corvusclarkb: default is true21:30
corvusprotip: grep for auto.floating.ip21:30
clarkbthanks so we definitely are falling into that block. Fwiw I've been testing as if that was the case as its the only way it makes sense we get a floating IP21:42
clarkbI have not yet been able to reproduce the issue. I'll share my script in a few21:43
clarkbcorvus: https://paste.opendev.org/show/bedm2UM2rjLgnJHnZASF/ this produces a single interface on the node with one floating ip attached21:47
clarkbso now I'm thinking it may have to do with the inputs to create_server. I've edited that script with that I believe to be the relevant inputs but maybe the network list doesn't look the way I expect it to?21:47
clarkband I ran that script within the zuul launcher container so all of the libraries and the version of python should be the same21:48
clarkbI do wonder why we create the floating ip (which should attach it) then call attach_ip_to_server. I thought maybe something about that redundant calls may not do what we expect but ti seems to work fine21:48
clarkband when I manually delete the server after checking its details the fip automatically deletes too21:50
clarkbeverything works when I'm doing it :P21:50
clarkbthe other thing that may be different between that script and the launcher is I've ripped out the state machine machinery21:51
clarkbmaybe somehow we're getting thigns mixed up in the state machine and we're executing actions against the same node twice?21:52
corvusi don't think we call _attach_ip_to_server21:52
clarkbah I guess it depends on whether or not the port id is None. fwiw I tested it both ways with and without that last line and both produce the same result of a single interface with the floating ip attached21:53
clarkbif the port id is none then we call _attach_ip_to_server so I guess thats a belts and suspenders if the create doesn't attach the fip21:54
corvusthere are 646 create_server calls for DFW3 in the log and 638 create_floating_ip21:56
clarkbthose extra 8 are probably due to failures before we get to deciding we want a floating ip so we bail out early21:56
corvusi think that's consistent with 1:1, assuming that some create_server calls produced errors21:57
clarkbyup21:57
clarkbI could test listing the same network twice in the networks list to see if that reproduces. I think we expect it would but I'm not sure that this is why we're seeing that behavior when configuring networks via the launcher21:59
clarkbunless there is some earlier config dta processing error duplicating data there? maybe via the inheritance tree or something?21:59
corvuswe can check that, 1 sec.22:01
corvuswell, more like 1 min22:01
corvuswhat label?22:03
clarkbcorvus: ubuntu-noble-8GB for ubuntu-noble I think22:04
clarkbs/for/or/22:04
corvusoh of course this wouldn't be the live configuration22:05
clarkbya my old server listings in my terminal scrollback show noble nodes with flavor gp.0.4.8 hit the issue22:05
clarkboh right yes22:05
clarkbsorry we pulled it out since it was broken22:05
corvusi can confirm that currently the networks list is empty for that label22:06
clarkbok that is good22:06
corvusi can't exclude the hypothesis that it may have had multiple entries in the other config22:06
clarkbdo you think there is value in testing what happens if we list that network twice?22:06
corvusyes, that's way easier than any other next steps22:07
clarkbok I'll work on that in a bit22:07
clarkbby updating like 39 of my pasted script to list the same network twice and then check what happens22:07
corvus++22:07
corvusyou're using the current prod clouds.yaml right?22:08
corvusthat means that your script is telling us it's okay to have the networks listed in clouds.yaml, and then also specify the same network once in the create call (and the create call should override clouds.yaml as we expect)22:08
Clark[m]Yes exactly. Sorry had to step away from the computer. Kids just got home22:10
corvusif that's all true, then we can create a new provider (we could put it in only one tenant, or we could put it in all of them and just set the limits to 0).  then we can investigate the inheritance hypothesis or others.22:10
corvusClark: there was a semi-recent change that altered the inheritance mechanism.  i'm not sure about the timing, but it's plausible, especially considering a possible delay before anyone noticed.  https://review.opendev.org/961534 changes it again, slightly.22:12
corvusi think we can and should land that change, and just be aware of it as a variable.22:13
corvusprobably the best sequence would be to: 1) confirm double network behavior with that script; 2) introduce a new provider and inspect it to see if we get double inheritance; 3) merge that change and see if the problem persists; 4) try to fix it.22:15
clarkbcorvus: I edited networks to ['opendevzuul-network1', 'opendevzuul-network1'] and commented out the last line as we don't believe we're calling _attach_ip_to_server and ran it. This does seem to reproduce22:30
clarkbso ya my best guess at this point is that we're inputting multiple networks so that we get multiple nics22:30
clarkbcorvus: one question about your proposed plan: we were setting the networks value in the parent section called raxflex-base. If we add a new provider that inherits from that we may not get the same behavior. DO you think we need both a new base and child provider section?22:32
clarkbI guess I'm not entirely sure what the safest and most reproduction wrothy approach is. I'm fairly certain we don't want to reupload images either for example22:33
clarkbalso any reason to keep my test node around or should I go ahead and delete it22:33
* clarkb will make an attempt at a new pair of provider sections to test this out22:34
corvusclarkb: there is a shocking amount of code dedicated to not uploading images multiple times to the same endpoints22:34
corvusas long as the image definitions are the same, it should detect it22:34
clarkback22:35
clarkbcorvus: does that mean the connection: foo value should be the same in both?22:35
corvusi think we should create a new base provider and rely on that, so we have fidelity to what we had before22:35
corvusyep, same connection22:36
corvusthat way we don't have to update zuul.conf22:36
opendevreviewClark Boylan proposed opendev/zuul-providers master: Add a test provider for rax flex sjc3  https://review.opendev.org/c/opendev/zuul-providers/+/96214322:41
clarkbcorvus: ^ something like that?22:41
clarkbI'm going to clean up my test node now as I don't think it is useful and I don't want to forget it22:41
corvusclarkb: that lgtm.  if we don't see a problem then we should add the other regions.22:42
corvusi'm going to +3 that22:43
opendevreviewMerged opendev/zuul-providers master: Add a test provider for rax flex sjc3  https://review.opendev.org/c/opendev/zuul-providers/+/96214322:43
clarkbok22:43
corvus['opendevzuul-network1', 'opendevzuul-network1']22:45
corvusclarkb: looks like you win :)22:45
clarkbwow ok that was a fun one to run down22:45
corvusi'm going to approve https://review.opendev.org/961534 now22:46
clarkbok I guess we'll need to manually update the launchers then recheck the list entries?22:46
corvusyep22:47
corvusclarkb: i think i see the problem; and i don't think 534 will affect it either way.  i'll work on a test-and-fix when i finish what i'm working on22:49
clarkbsounds good and thanks for helping track this down22:49
corvusnp, thank you :)22:49

Generated by irclog2html.py 4.0.0 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!