| ykarel | Thanks clarkb will keep an eye on the jobs | 03:55 |
|---|---|---|
| ykarel | atleast as per opensearch looks good, no failure since last night | 04:07 |
| *** ralonsoh_ is now known as ralonsoh | 07:33 | |
| *** jroll01 is now known as jroll0 | 08:29 | |
| seunghunlee | Hello, Is there a way to check if Zuul of certain project had problem last week? | 10:24 |
| tonyb | seunghunlee: I'm not sure I understand, zuul is functional or not. the project/tenant there aren't typically outages per project | 10:26 |
| tonyb | seunghunlee: can you give me a little more detail? | 10:26 |
| seunghunlee | Hello tonyb I experienced one of CI of openstack/kolla-ansible failing around at 6pm UTC 18th. But this was passing before then and there was no significant change merged in between. Today I tried again and that CI is passing. So, I wanted to know if it was really caused by something from Zuul. | 10:30 |
| tonyb | oh. that is helpful. I don't have the full details but yes there were some generic problems. it's probable that you hit those. | 10:32 |
| tonyb | I was intrigued/confused by the "certain project" part of your query. | 10:33 |
| tonyb | seunghunlee: if you sage the change link I can take a quick look to confirm | 10:34 |
| seunghunlee | Thank you tonyb, here's the change https://review.opendev.org/c/openstack/kolla-ansible/+/961675 | 10:37 |
| seunghunlee | Ah sorry, I misread my log. Correction: CI was okay on 18th but failed at 12 pm UTC on 19th | 10:40 |
| seunghunlee | and now it's fine | 10:40 |
| tonyb | based on my read it was a network outage/issue. I think you can ignore it | 10:47 |
| seunghunlee | Thanks for confirming tonyb | 10:53 |
| fungi | typically, i recommend looking at a failing build result and at least identifying the most obvious error message from its logs | 13:42 |
| fungi | this was the result from the only voting job that failed for the linked change on the 19th: https://zuul.opendev.org/t/openstack/build/f1f7256727b243369245a621c840b324 | 13:44 |
| fungi | the "Run test-mariadb.sh script" task exited nonzero (2) in the run phase: https://zuul.opendev.org/t/openstack/build/f1f7256727b243369245a621c840b324/log/job-output.txt#1574-1581 | 13:48 |
| fungi | the task detail is at https://zuul.opendev.org/t/openstack/build/f1f7256727b243369245a621c840b324/console#2/1/55/primary and i think indicates an ssh connection problem | 13:50 |
| fungi | but the task ran for almost 30 minutes according to the console | 13:51 |
| fungi | this is the task definition: https://opendev.org/openstack/kolla-ansible/src/branch/master/tests/run.yml#L523-L531 | 13:54 |
| fungi | doesn't seem to pass any parameters to the script, but it does seem to log the output from that: https://opendev.org/openstack/kolla-ansible/src/branch/master/tests/test-mariadb.sh#L89 | 13:55 |
| fungi | which is collected and available here: https://zuul.opendev.org/t/openstack/build/f1f7256727b243369245a621c840b324/log/primary/logs/ansible/test-mariadb | 13:56 |
| fungi | https://zuul.opendev.org/t/openstack/build/f1f7256727b243369245a621c840b324/log/primary/logs/ansible/test-mariadb#13753 says the mariadb_recovery.yml playbook is what actually exited 2 | 13:58 |
| fungi | that's this playbook: https://opendev.org/openstack/kolla-ansible/src/branch/master/ansible/mariadb_recovery.yml | 14:00 |
| fungi | i'm lost on where/whether that results in any additional logging, but https://zuul.opendev.org/t/openstack/build/f1f7256727b243369245a621c840b324/log/primary/logs/kolla/mariadb/mariadb.txt#2492 does indicate some sort of local connectivity problem reaching 4567/tcp on a virtual interface with gcomm/gcs (whatever that is) | 14:11 |
| fungi | that's about 18 minutes into the test-mariadb.sh task, 12 minutes before it ends, so not sure it's a smoking gun | 14:13 |
| fungi | though it does seem to coincide with this mariadb restart task which reports an ssh error, that's from the nested ansible running on the primary test node: https://zuul.opendev.org/t/openstack/build/f1f7256727b243369245a621c840b324/log/primary/logs/ansible/test-mariadb#12975-13127 | 14:17 |
| fungi | i'm not sure what the network topology looks like between the test nodes, whether 192.0.2.1 is local to the primary node or across a tunnel to one of the secondary nodes | 14:22 |
| fungi | anyway, this ran in ovh gra1 which is not somewhere we had known network issues on friday, and the log detail looks more like mariadb failed to start during recovery testing | 14:32 |
| fungi | seunghunlee: tonyb: that's my takeaway ^ | 14:33 |
| fungi | someone with more kolla familiarity would need to confirm | 14:33 |
| *** sean-k-mooney is now known as sean-k-mooney-pto | 14:51 | |
| *** NeilHanlon_ is now known as NeilHanlon | 15:04 | |
| opendevreview | Christian Berendt proposed openstack/project-config master: Remove refstack projects https://review.opendev.org/c/openstack/project-config/+/962116 | 15:08 |
| mnaser | https://tarballs.opendev.org/openstack/openstack-helm/ any reason why this isn't loading? | 16:31 |
| mnaser | it's been spinning for a few minutes for me.. | 16:31 |
| mnaser | i hit enter and a few seconds later it loads.. so.. nvm | 16:32 |
| clarkb | there was an afs blip that affected docs.openstack.org on the 15th. It didn't last very long | 16:38 |
| clarkb | I wonder if this is the same issue | 16:38 |
| mnasiadka | mnaser: I’m getting the same since some time, also for zuul.opendev.org - wonder if that’s some problem on Rackspace side | 16:38 |
| clarkb | oh ok if you're seeing similar to other resources hosted in rackspace that are not backed by openafs then its probably different | 16:39 |
| mnasiadka | clarkb: on zuul I experience that on “task summary” tab in job view, so it might be different | 16:41 |
| mnasiadka | The page with tenants list always loads correctly | 16:41 |
| clarkb | the task summary has to load the data from the swift log storage backend (that goes straight to your browser) so the issue maybe between you and the swift install | 16:43 |
| clarkb | but also sometimes that happens if the logs have timed out and been deleted after 30 days or if the losg are too large for your browser to process | 16:43 |
| mnasiadka | No, these are todays jobs | 16:44 |
| mnasiadka | I can try to trace the swift backend in question | 16:44 |
| clarkb | if you have a link to the build I can also see if it reproduces for me | 16:44 |
| clarkb | for the tarballs I wonder if all of the new releases caused openafs to notice its cache was stale then it was either not able to update the cache as quickly as mnaser expected or it failed to update it for some time due to an issue | 16:45 |
| clarkb | as a heads up my network woes resurfaced this morning so I've got someone from my isp coming out to check on things upstream of my router sometime after 2000 UTC today | 16:53 |
| clarkb | I won't be surprised if the debugging process has me disconencted from the internet for some time | 16:53 |
| fungi | it does seem inconsistent and maybe afs-related, some directories are loading promptly for me | 16:54 |
| fungi | though it might also be because they publish a tarball for every single change that merges, 11159 files in /afs/openstack.org/project/tarballs.opendev.org/openstack/openstack-helm/ right now | 16:55 |
| clarkb | ya I would expect that to make the cache unhappy at times | 16:56 |
| fungi | that is, tarball for every change and project they cover | 16:56 |
| fungi | so e.g. watcher-2025.1.2+5bfc19d67.tgz | 16:56 |
| fungi | yeah that directory listing is massive | 16:57 |
| fungi | thousands of tiny (~60-70kb) tarballs | 16:57 |
| clarkb | dmesg says 'afs: Warning: it took us a long time (around 4 seconds) to try to trim our stat cache down to a reasonable size. This may indicate someone is accessing an excessive number of files, or something is wrong with the AFS cache.' and 'afs: Consider raising the afsd -stat parameter (current setting: 15000, current vcount: 23578), or figure out what is accessing so many files.' | 16:58 |
| clarkb | but that is from yesterday not today | 16:58 |
| clarkb | but maybe it is related due to the large file counts | 16:58 |
| fungi | "an excessive number of files" describes this case to a t | 16:58 |
| clarkb | thinking out loud here: it might make sense to split tarballs onto its own frontend host with a separate afs installation to shard out the total number of files afs is managing on any one instance? | 16:59 |
| clarkb | currently static is hosting all the things | 16:59 |
| clarkb | but also maybe we can just tune afs on that one node and get it happy | 17:00 |
| mnasiadka | Oh boy, openstack-helm uploads a tarball for each commit sha in a given branch? huh | 17:00 |
| fungi | i don't think splitting the tarballs site to a different client would help this particular case | 17:00 |
| fungi | openstack-helm treeing their tarballs would make a difference | 17:01 |
| fungi | or pruning the ones they don't need any more, or just not uploading so many | 17:01 |
| fungi | apache mod_autoindex is performing admirably when asked to render 11k file links at once | 17:02 |
| fungi | i shudder in horror at the thought of how many llm training crawlers are hitting that directory and then following every link | 17:03 |
| clarkb | narrator: "it was all of them" | 17:03 |
| clarkb | including the ones that identify as 3 year old chrome installations | 17:05 |
| clarkb | re rax flex double nic situation: I had a thought that maybe it is related to setting up the floating ip. Put another way maybe we boot with one network then when we attach the fip we're attaching another or something along those lines. In any case I'm going to try and put together a minimal reproducer script and see if I can get it to happen outside of zuul launcher | 17:36 |
| fungi | oh, like asking for a fip automatically asks for a port too? | 17:39 |
| clarkb | ya maybe. This is just a debugging hunch to track down and rule out. I don't have evidence this is the cause yet | 17:41 |
| fungi | spillover from the openstack tc meeting, it looks like debian's package of gertty is one of the reverse-depends for python3-ply | 17:49 |
| mnasiadka | clarkb: https://zuul.opendev.org/t/openstack/build/20843e8dd2894c80940e01e964c20e9e - that one is awfully slow for me | 18:04 |
| mnasiadka | clarkb: but clicking through the logs tab is fast, so I have no clue what’s happening :) | 18:04 |
| clarkb | that should grab the manifest file then use that to find the job-output.json file which it then parsesthrough to find the failed task https://17a8cbac2c4522dc47fa-749ac6329d801cba1d5f2575f7d4079d.ssl.cf5.rackcdn.com/openstack/20843e8dd2894c80940e01e964c20e9e/ shows the swift log location as well as file sizes for all of that | 18:05 |
| clarkb | I don't see anything that looks terribly bad | 18:05 |
| clarkb | mnasiadka: I think once it finds the failing task it tries to find the last N lines of output. Maybe that is slow if there is a lot of output for a specific failing task? | 18:07 |
| fungi | the landing page for the build also parses the console log | 18:07 |
| fungi | yeah, that | 18:07 |
| clarkb | fungi: I think it parses it out of the json not the text though | 18:07 |
| clarkb | but yes maybe that parsing is what is slow and not the actual file download | 18:08 |
| fungi | it's definitely slow to load for me too | 18:08 |
| fungi | the console log is only 1.2k lines, so not that long | 18:09 |
| clarkb | ya the files are all of reasonable file size too | 18:11 |
| fungi | where "slow" is 15-20 seconds between when it renders the heading for the page and shows the "Fetching info..." spinner | 18:12 |
| fungi | and when it actually loads the task summary | 18:13 |
| corvus | the debug panel says it takes 29 seconds for the object storage system to produce a 404 on job-output.json.gz before it falls back to fetching job-output.json which returns a 200 in 11ms. | 18:13 |
| clarkb | corvus: huh is there a .json.gz in the manifest? | 18:14 |
| corvus | perhaps the object storage system cdn is slow to deal with cache misses | 18:14 |
| clarkb | looks like there is only a job-output.json not a job-output.json.gz | 18:15 |
| clarkb | in the manifest I mean | 18:15 |
| fungi | presumably zuul tries both but expects a 404 to be basically instantaneous to return | 18:15 |
| fungi | at least i wouldn't expect "i don't have the file you requested" to be a particularly slow thing to determine | 18:16 |
| clarkb | unless as corvus points out you're a cdn and not having data is not itself an indication it doesn't exist | 18:17 |
| corvus | yep. that's because some of the upload roles gzip after the manifest is created. | 18:17 |
| clarkb | (a lot of cdns work by having 404 handlers in the edge endpoints do a lookup for data at the source before actually replying with a 404 so ya maybe that is slow) | 18:17 |
| fungi | though i'd hope they also incorporate some sort of negative cache for those responses as well, even if with a shorter expiry | 18:18 |
| opendevreview | Christian Berendt proposed openstack/project-config master: Remove refstack projects https://review.opendev.org/c/openstack/project-config/+/962116 | 18:44 |
| opendevreview | Christian Berendt proposed openstack/project-config master: Remove refstack projects https://review.opendev.org/c/openstack/project-config/+/962116 | 18:52 |
| clarkb | fungi: https://review.opendev.org/c/opendev/system-config/+/959892 is approved | 20:02 |
| clarkb | I +2'd https://review.opendev.org/c/opendev/system-config/+/961530 but did not approve it as I am not sure if we jsut want to send it at this point cc corvus | 20:03 |
| clarkb | and now lunch | 20:03 |
| corvus | sent | 20:04 |
| opendevreview | Merged opendev/system-config master: Clean up OpenEuler mirroring infrastructure https://review.opendev.org/c/opendev/system-config/+/959892 | 20:33 |
| opendevreview | Merged opendev/system-config master: Delete statsd gauges after 24h of inactivity https://review.opendev.org/c/opendev/system-config/+/961530 | 20:33 |
| opendevreview | Tony Breeds proposed openstack/diskimage-builder master: Remove nodepool based testing https://review.opendev.org/c/openstack/diskimage-builder/+/952953 | 20:44 |
| opendevreview | Tony Breeds proposed openstack/diskimage-builder master: Remove testing for f37 https://review.opendev.org/c/openstack/diskimage-builder/+/952954 | 20:44 |
| tonyb | Any reason I shouldn't +A https://review.opendev.org/c/opendev/system-config/+/946316 ? | 20:47 |
| clarkb | I can't think of one | 20:47 |
| tonyb | okay I've approved it | 20:48 |
| * tonyb is trying to clear out low hanging fruit | 20:49 | |
| clarkb | infra-root ok I think I have a hacked up first draft of a script based on how zuul launches nodes taht I'd like to run against the clouds.yaml on zl01. Do you think I should try to copy it into the launcher container and run it out of the container python or just create a new venv on the host and install openstacksdk there? | 20:56 |
| clarkb | this first draft doesn't bother with floating IPs yet. I figure if we get one interface with no floating ip then we can add fips. If we get two without fips then we can focus on that andignore gips | 20:56 |
| clarkb | *fips | 20:56 |
| opendevreview | Merged opendev/system-config master: Add tony's AFS admin user to UserList https://review.opendev.org/c/opendev/system-config/+/946316 | 20:59 |
| clarkb | I think I convinced myself to copy the script into the container so that everything matches (runtime and library versions) | 21:00 |
| tonyb | my gut feel is that it doesn't matter but probably until we verify the repro tool works as expected use the conatiner | 21:00 |
| clarkb | ok no floating IP management seems to produce a node with one interface | 21:02 |
| clarkb | which is unfortunate in that it maens I need to make this reproduction test case more complicated. But may also point towards fips being the source of the issue | 21:02 |
| tonyb | Yup, and Yup :/ | 21:03 |
| clarkb | corvus: in the state machine for openstack server creation we check against self.label.auto_floating_ip. The string auto_floating_ip seems to only appear on that one line in the entire zuul codebase. Do you know how we're ever matching that as true? | 21:09 |
| corvus | clarkb: https://opendev.org/zuul/zuul/src/branch/master/zuul/driver/openstack/openstackprovider.py#L128 | 21:30 |
| corvus | clarkb: default is true | 21:30 |
| corvus | protip: grep for auto.floating.ip | 21:30 |
| clarkb | thanks so we definitely are falling into that block. Fwiw I've been testing as if that was the case as its the only way it makes sense we get a floating IP | 21:42 |
| clarkb | I have not yet been able to reproduce the issue. I'll share my script in a few | 21:43 |
| clarkb | corvus: https://paste.opendev.org/show/bedm2UM2rjLgnJHnZASF/ this produces a single interface on the node with one floating ip attached | 21:47 |
| clarkb | so now I'm thinking it may have to do with the inputs to create_server. I've edited that script with that I believe to be the relevant inputs but maybe the network list doesn't look the way I expect it to? | 21:47 |
| clarkb | and I ran that script within the zuul launcher container so all of the libraries and the version of python should be the same | 21:48 |
| clarkb | I do wonder why we create the floating ip (which should attach it) then call attach_ip_to_server. I thought maybe something about that redundant calls may not do what we expect but ti seems to work fine | 21:48 |
| clarkb | and when I manually delete the server after checking its details the fip automatically deletes too | 21:50 |
| clarkb | everything works when I'm doing it :P | 21:50 |
| clarkb | the other thing that may be different between that script and the launcher is I've ripped out the state machine machinery | 21:51 |
| clarkb | maybe somehow we're getting thigns mixed up in the state machine and we're executing actions against the same node twice? | 21:52 |
| corvus | i don't think we call _attach_ip_to_server | 21:52 |
| clarkb | ah I guess it depends on whether or not the port id is None. fwiw I tested it both ways with and without that last line and both produce the same result of a single interface with the floating ip attached | 21:53 |
| clarkb | if the port id is none then we call _attach_ip_to_server so I guess thats a belts and suspenders if the create doesn't attach the fip | 21:54 |
| corvus | there are 646 create_server calls for DFW3 in the log and 638 create_floating_ip | 21:56 |
| clarkb | those extra 8 are probably due to failures before we get to deciding we want a floating ip so we bail out early | 21:56 |
| corvus | i think that's consistent with 1:1, assuming that some create_server calls produced errors | 21:57 |
| clarkb | yup | 21:57 |
| clarkb | I could test listing the same network twice in the networks list to see if that reproduces. I think we expect it would but I'm not sure that this is why we're seeing that behavior when configuring networks via the launcher | 21:59 |
| clarkb | unless there is some earlier config dta processing error duplicating data there? maybe via the inheritance tree or something? | 21:59 |
| corvus | we can check that, 1 sec. | 22:01 |
| corvus | well, more like 1 min | 22:01 |
| corvus | what label? | 22:03 |
| clarkb | corvus: ubuntu-noble-8GB for ubuntu-noble I think | 22:04 |
| clarkb | s/for/or/ | 22:04 |
| corvus | oh of course this wouldn't be the live configuration | 22:05 |
| clarkb | ya my old server listings in my terminal scrollback show noble nodes with flavor gp.0.4.8 hit the issue | 22:05 |
| clarkb | oh right yes | 22:05 |
| clarkb | sorry we pulled it out since it was broken | 22:05 |
| corvus | i can confirm that currently the networks list is empty for that label | 22:06 |
| clarkb | ok that is good | 22:06 |
| corvus | i can't exclude the hypothesis that it may have had multiple entries in the other config | 22:06 |
| clarkb | do you think there is value in testing what happens if we list that network twice? | 22:06 |
| corvus | yes, that's way easier than any other next steps | 22:07 |
| clarkb | ok I'll work on that in a bit | 22:07 |
| clarkb | by updating like 39 of my pasted script to list the same network twice and then check what happens | 22:07 |
| corvus | ++ | 22:07 |
| corvus | you're using the current prod clouds.yaml right? | 22:08 |
| corvus | that means that your script is telling us it's okay to have the networks listed in clouds.yaml, and then also specify the same network once in the create call (and the create call should override clouds.yaml as we expect) | 22:08 |
| Clark[m] | Yes exactly. Sorry had to step away from the computer. Kids just got home | 22:10 |
| corvus | if that's all true, then we can create a new provider (we could put it in only one tenant, or we could put it in all of them and just set the limits to 0). then we can investigate the inheritance hypothesis or others. | 22:10 |
| corvus | Clark: there was a semi-recent change that altered the inheritance mechanism. i'm not sure about the timing, but it's plausible, especially considering a possible delay before anyone noticed. https://review.opendev.org/961534 changes it again, slightly. | 22:12 |
| corvus | i think we can and should land that change, and just be aware of it as a variable. | 22:13 |
| corvus | probably the best sequence would be to: 1) confirm double network behavior with that script; 2) introduce a new provider and inspect it to see if we get double inheritance; 3) merge that change and see if the problem persists; 4) try to fix it. | 22:15 |
| clarkb | corvus: I edited networks to ['opendevzuul-network1', 'opendevzuul-network1'] and commented out the last line as we don't believe we're calling _attach_ip_to_server and ran it. This does seem to reproduce | 22:30 |
| clarkb | so ya my best guess at this point is that we're inputting multiple networks so that we get multiple nics | 22:30 |
| clarkb | corvus: one question about your proposed plan: we were setting the networks value in the parent section called raxflex-base. If we add a new provider that inherits from that we may not get the same behavior. DO you think we need both a new base and child provider section? | 22:32 |
| clarkb | I guess I'm not entirely sure what the safest and most reproduction wrothy approach is. I'm fairly certain we don't want to reupload images either for example | 22:33 |
| clarkb | also any reason to keep my test node around or should I go ahead and delete it | 22:33 |
| * clarkb will make an attempt at a new pair of provider sections to test this out | 22:34 | |
| corvus | clarkb: there is a shocking amount of code dedicated to not uploading images multiple times to the same endpoints | 22:34 |
| corvus | as long as the image definitions are the same, it should detect it | 22:34 |
| clarkb | ack | 22:35 |
| clarkb | corvus: does that mean the connection: foo value should be the same in both? | 22:35 |
| corvus | i think we should create a new base provider and rely on that, so we have fidelity to what we had before | 22:35 |
| corvus | yep, same connection | 22:36 |
| corvus | that way we don't have to update zuul.conf | 22:36 |
| opendevreview | Clark Boylan proposed opendev/zuul-providers master: Add a test provider for rax flex sjc3 https://review.opendev.org/c/opendev/zuul-providers/+/962143 | 22:41 |
| clarkb | corvus: ^ something like that? | 22:41 |
| clarkb | I'm going to clean up my test node now as I don't think it is useful and I don't want to forget it | 22:41 |
| corvus | clarkb: that lgtm. if we don't see a problem then we should add the other regions. | 22:42 |
| corvus | i'm going to +3 that | 22:43 |
| opendevreview | Merged opendev/zuul-providers master: Add a test provider for rax flex sjc3 https://review.opendev.org/c/opendev/zuul-providers/+/962143 | 22:43 |
| clarkb | ok | 22:43 |
| corvus | ['opendevzuul-network1', 'opendevzuul-network1'] | 22:45 |
| corvus | clarkb: looks like you win :) | 22:45 |
| clarkb | wow ok that was a fun one to run down | 22:45 |
| corvus | i'm going to approve https://review.opendev.org/961534 now | 22:46 |
| clarkb | ok I guess we'll need to manually update the launchers then recheck the list entries? | 22:46 |
| corvus | yep | 22:47 |
| corvus | clarkb: i think i see the problem; and i don't think 534 will affect it either way. i'll work on a test-and-fix when i finish what i'm working on | 22:49 |
| clarkb | sounds good and thanks for helping track this down | 22:49 |
| corvus | np, thank you :) | 22:49 |
Generated by irclog2html.py 4.0.0 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!