Tuesday, 2025-09-23

ykarel	Thanks clarkb will keep an eye on the jobs	03:55
ykarel	atleast as per opensearch looks good, no failure since last night	04:07
*** ralonsoh_ is now known as ralonsoh		07:33
*** jroll01 is now known as jroll0		08:29
seunghunlee	Hello, Is there a way to check if Zuul of certain project had problem last week?	10:24
tonyb	seunghunlee: I'm not sure I understand, zuul is functional or not. the project/tenant there aren't typically outages per project	10:26
tonyb	seunghunlee: can you give me a little more detail?	10:26
seunghunlee	Hello tonyb I experienced one of CI of openstack/kolla-ansible failing around at 6pm UTC 18th. But this was passing before then and there was no significant change merged in between. Today I tried again and that CI is passing. So, I wanted to know if it was really caused by something from Zuul.	10:30
tonyb	oh. that is helpful. I don't have the full details but yes there were some generic problems. it's probable that you hit those.	10:32
tonyb	I was intrigued/confused by the "certain project" part of your query.	10:33
tonyb	seunghunlee: if you sage the change link I can take a quick look to confirm	10:34
seunghunlee	Thank you tonyb, here's the change https://review.opendev.org/c/openstack/kolla-ansible/+/961675	10:37
seunghunlee	Ah sorry, I misread my log. Correction: CI was okay on 18th but failed at 12 pm UTC on 19th	10:40
seunghunlee	and now it's fine	10:40
tonyb	based on my read it was a network outage/issue. I think you can ignore it	10:47
seunghunlee	Thanks for confirming tonyb	10:53
fungi	typically, i recommend looking at a failing build result and at least identifying the most obvious error message from its logs	13:42
fungi	this was the result from the only voting job that failed for the linked change on the 19th: https://zuul.opendev.org/t/openstack/build/f1f7256727b243369245a621c840b324	13:44
fungi	the "Run test-mariadb.sh script" task exited nonzero (2) in the run phase: https://zuul.opendev.org/t/openstack/build/f1f7256727b243369245a621c840b324/log/job-output.txt#1574-1581	13:48
fungi	the task detail is at https://zuul.opendev.org/t/openstack/build/f1f7256727b243369245a621c840b324/console#2/1/55/primary and i think indicates an ssh connection problem	13:50
fungi	but the task ran for almost 30 minutes according to the console	13:51
fungi	this is the task definition: https://opendev.org/openstack/kolla-ansible/src/branch/master/tests/run.yml#L523-L531	13:54
fungi	doesn't seem to pass any parameters to the script, but it does seem to log the output from that: https://opendev.org/openstack/kolla-ansible/src/branch/master/tests/test-mariadb.sh#L89	13:55
fungi	which is collected and available here: https://zuul.opendev.org/t/openstack/build/f1f7256727b243369245a621c840b324/log/primary/logs/ansible/test-mariadb	13:56
fungi	https://zuul.opendev.org/t/openstack/build/f1f7256727b243369245a621c840b324/log/primary/logs/ansible/test-mariadb#13753 says the mariadb_recovery.yml playbook is what actually exited 2	13:58
fungi	that's this playbook: https://opendev.org/openstack/kolla-ansible/src/branch/master/ansible/mariadb_recovery.yml	14:00
fungi	i'm lost on where/whether that results in any additional logging, but https://zuul.opendev.org/t/openstack/build/f1f7256727b243369245a621c840b324/log/primary/logs/kolla/mariadb/mariadb.txt#2492 does indicate some sort of local connectivity problem reaching 4567/tcp on a virtual interface with gcomm/gcs (whatever that is)	14:11
fungi	that's about 18 minutes into the test-mariadb.sh task, 12 minutes before it ends, so not sure it's a smoking gun	14:13
fungi	though it does seem to coincide with this mariadb restart task which reports an ssh error, that's from the nested ansible running on the primary test node: https://zuul.opendev.org/t/openstack/build/f1f7256727b243369245a621c840b324/log/primary/logs/ansible/test-mariadb#12975-13127	14:17
fungi	i'm not sure what the network topology looks like between the test nodes, whether 192.0.2.1 is local to the primary node or across a tunnel to one of the secondary nodes	14:22
fungi	anyway, this ran in ovh gra1 which is not somewhere we had known network issues on friday, and the log detail looks more like mariadb failed to start during recovery testing	14:32
fungi	seunghunlee: tonyb: that's my takeaway ^	14:33
fungi	someone with more kolla familiarity would need to confirm	14:33
*** sean-k-mooney is now known as sean-k-mooney-pto		14:51
*** NeilHanlon_ is now known as NeilHanlon		15:04
opendevreview	Christian Berendt proposed openstack/project-config master: Remove refstack projects https://review.opendev.org/c/openstack/project-config/+/962116	15:08
mnaser	https://tarballs.opendev.org/openstack/openstack-helm/ any reason why this isn't loading?	16:31
mnaser	it's been spinning for a few minutes for me..	16:31
mnaser	i hit enter and a few seconds later it loads.. so.. nvm	16:32
clarkb	there was an afs blip that affected docs.openstack.org on the 15th. It didn't last very long	16:38
clarkb	I wonder if this is the same issue	16:38
mnasiadka	mnaser: I’m getting the same since some time, also for zuul.opendev.org - wonder if that’s some problem on Rackspace side	16:38
clarkb	oh ok if you're seeing similar to other resources hosted in rackspace that are not backed by openafs then its probably different	16:39
mnasiadka	clarkb: on zuul I experience that on “task summary” tab in job view, so it might be different	16:41
mnasiadka	The page with tenants list always loads correctly	16:41
clarkb	the task summary has to load the data from the swift log storage backend (that goes straight to your browser) so the issue maybe between you and the swift install	16:43
clarkb	but also sometimes that happens if the logs have timed out and been deleted after 30 days or if the losg are too large for your browser to process	16:43
mnasiadka	No, these are todays jobs	16:44
mnasiadka	I can try to trace the swift backend in question	16:44
clarkb	if you have a link to the build I can also see if it reproduces for me	16:44
clarkb	for the tarballs I wonder if all of the new releases caused openafs to notice its cache was stale then it was either not able to update the cache as quickly as mnaser expected or it failed to update it for some time due to an issue	16:45
clarkb	as a heads up my network woes resurfaced this morning so I've got someone from my isp coming out to check on things upstream of my router sometime after 2000 UTC today	16:53
clarkb	I won't be surprised if the debugging process has me disconencted from the internet for some time	16:53
fungi	it does seem inconsistent and maybe afs-related, some directories are loading promptly for me	16:54
fungi	though it might also be because they publish a tarball for every single change that merges, 11159 files in /afs/openstack.org/project/tarballs.opendev.org/openstack/openstack-helm/ right now	16:55
clarkb	ya I would expect that to make the cache unhappy at times	16:56
fungi	that is, tarball for every change and project they cover	16:56
fungi	so e.g. watcher-2025.1.2+5bfc19d67.tgz	16:56
fungi	yeah that directory listing is massive	16:57
fungi	thousands of tiny (~60-70kb) tarballs	16:57
clarkb	dmesg says 'afs: Warning: it took us a long time (around 4 seconds) to try to trim our stat cache down to a reasonable size. This may indicate someone is accessing an excessive number of files, or something is wrong with the AFS cache.' and 'afs: Consider raising the afsd -stat parameter (current setting: 15000, current vcount: 23578), or figure out what is accessing so many files.'	16:58
clarkb	but that is from yesterday not today	16:58
clarkb	but maybe it is related due to the large file counts	16:58
fungi	"an excessive number of files" describes this case to a t	16:58
clarkb	thinking out loud here: it might make sense to split tarballs onto its own frontend host with a separate afs installation to shard out the total number of files afs is managing on any one instance?	16:59
clarkb	currently static is hosting all the things	16:59
clarkb	but also maybe we can just tune afs on that one node and get it happy	17:00
mnasiadka	Oh boy, openstack-helm uploads a tarball for each commit sha in a given branch? huh	17:00
fungi	i don't think splitting the tarballs site to a different client would help this particular case	17:00
fungi	openstack-helm treeing their tarballs would make a difference	17:01
fungi	or pruning the ones they don't need any more, or just not uploading so many	17:01
fungi	apache mod_autoindex is performing admirably when asked to render 11k file links at once	17:02
fungi	i shudder in horror at the thought of how many llm training crawlers are hitting that directory and then following every link	17:03
clarkb	narrator: "it was all of them"	17:03
clarkb	including the ones that identify as 3 year old chrome installations	17:05
clarkb	re rax flex double nic situation: I had a thought that maybe it is related to setting up the floating ip. Put another way maybe we boot with one network then when we attach the fip we're attaching another or something along those lines. In any case I'm going to try and put together a minimal reproducer script and see if I can get it to happen outside of zuul launcher	17:36
fungi	oh, like asking for a fip automatically asks for a port too?	17:39
clarkb	ya maybe. This is just a debugging hunch to track down and rule out. I don't have evidence this is the cause yet	17:41
fungi	spillover from the openstack tc meeting, it looks like debian's package of gertty is one of the reverse-depends for python3-ply	17:49
mnasiadka	clarkb: https://zuul.opendev.org/t/openstack/build/20843e8dd2894c80940e01e964c20e9e - that one is awfully slow for me	18:04
mnasiadka	clarkb: but clicking through the logs tab is fast, so I have no clue what’s happening :)	18:04
clarkb	that should grab the manifest file then use that to find the job-output.json file which it then parsesthrough to find the failed task https://17a8cbac2c4522dc47fa-749ac6329d801cba1d5f2575f7d4079d.ssl.cf5.rackcdn.com/openstack/20843e8dd2894c80940e01e964c20e9e/ shows the swift log location as well as file sizes for all of that	18:05
clarkb	I don't see anything that looks terribly bad	18:05
clarkb	mnasiadka: I think once it finds the failing task it tries to find the last N lines of output. Maybe that is slow if there is a lot of output for a specific failing task?	18:07
fungi	the landing page for the build also parses the console log	18:07
fungi	yeah, that	18:07
clarkb	fungi: I think it parses it out of the json not the text though	18:07
clarkb	but yes maybe that parsing is what is slow and not the actual file download	18:08
fungi	it's definitely slow to load for me too	18:08
fungi	the console log is only 1.2k lines, so not that long	18:09
clarkb	ya the files are all of reasonable file size too	18:11
fungi	where "slow" is 15-20 seconds between when it renders the heading for the page and shows the "Fetching info..." spinner	18:12
fungi	and when it actually loads the task summary	18:13
corvus	the debug panel says it takes 29 seconds for the object storage system to produce a 404 on job-output.json.gz before it falls back to fetching job-output.json which returns a 200 in 11ms.	18:13
clarkb	corvus: huh is there a .json.gz in the manifest?	18:14
corvus	perhaps the object storage system cdn is slow to deal with cache misses	18:14
clarkb	looks like there is only a job-output.json not a job-output.json.gz	18:15
clarkb	in the manifest I mean	18:15
fungi	presumably zuul tries both but expects a 404 to be basically instantaneous to return	18:15
fungi	at least i wouldn't expect "i don't have the file you requested" to be a particularly slow thing to determine	18:16
clarkb	unless as corvus points out you're a cdn and not having data is not itself an indication it doesn't exist	18:17
corvus	yep. that's because some of the upload roles gzip after the manifest is created.	18:17
clarkb	(a lot of cdns work by having 404 handlers in the edge endpoints do a lookup for data at the source before actually replying with a 404 so ya maybe that is slow)	18:17
fungi	though i'd hope they also incorporate some sort of negative cache for those responses as well, even if with a shorter expiry	18:18
opendevreview	Christian Berendt proposed openstack/project-config master: Remove refstack projects https://review.opendev.org/c/openstack/project-config/+/962116	18:44
opendevreview	Christian Berendt proposed openstack/project-config master: Remove refstack projects https://review.opendev.org/c/openstack/project-config/+/962116	18:52
clarkb	fungi: https://review.opendev.org/c/opendev/system-config/+/959892 is approved	20:02
clarkb	I +2'd https://review.opendev.org/c/opendev/system-config/+/961530 but did not approve it as I am not sure if we jsut want to send it at this point cc corvus	20:03
clarkb	and now lunch	20:03
corvus	sent	20:04
opendevreview	Merged opendev/system-config master: Clean up OpenEuler mirroring infrastructure https://review.opendev.org/c/opendev/system-config/+/959892	20:33
opendevreview	Merged opendev/system-config master: Delete statsd gauges after 24h of inactivity https://review.opendev.org/c/opendev/system-config/+/961530	20:33
opendevreview	Tony Breeds proposed openstack/diskimage-builder master: Remove nodepool based testing https://review.opendev.org/c/openstack/diskimage-builder/+/952953	20:44
opendevreview	Tony Breeds proposed openstack/diskimage-builder master: Remove testing for f37 https://review.opendev.org/c/openstack/diskimage-builder/+/952954	20:44
tonyb	Any reason I shouldn't +A https://review.opendev.org/c/opendev/system-config/+/946316 ?	20:47
clarkb	I can't think of one	20:47
tonyb	okay I've approved it	20:48
* tonyb is trying to clear out low hanging fruit		20:49
clarkb	infra-root ok I think I have a hacked up first draft of a script based on how zuul launches nodes taht I'd like to run against the clouds.yaml on zl01. Do you think I should try to copy it into the launcher container and run it out of the container python or just create a new venv on the host and install openstacksdk there?	20:56
clarkb	this first draft doesn't bother with floating IPs yet. I figure if we get one interface with no floating ip then we can add fips. If we get two without fips then we can focus on that andignore gips	20:56
clarkb	*fips	20:56
opendevreview	Merged opendev/system-config master: Add tony's AFS admin user to UserList https://review.opendev.org/c/opendev/system-config/+/946316	20:59
clarkb	I think I convinced myself to copy the script into the container so that everything matches (runtime and library versions)	21:00
tonyb	my gut feel is that it doesn't matter but probably until we verify the repro tool works as expected use the conatiner	21:00
clarkb	ok no floating IP management seems to produce a node with one interface	21:02
clarkb	which is unfortunate in that it maens I need to make this reproduction test case more complicated. But may also point towards fips being the source of the issue	21:02
tonyb	Yup, and Yup :/	21:03
clarkb	corvus: in the state machine for openstack server creation we check against self.label.auto_floating_ip. The string auto_floating_ip seems to only appear on that one line in the entire zuul codebase. Do you know how we're ever matching that as true?	21:09
corvus	clarkb: https://opendev.org/zuul/zuul/src/branch/master/zuul/driver/openstack/openstackprovider.py#L128	21:30
corvus	clarkb: default is true	21:30
corvus	protip: grep for auto.floating.ip	21:30
clarkb	thanks so we definitely are falling into that block. Fwiw I've been testing as if that was the case as its the only way it makes sense we get a floating IP	21:42
clarkb	I have not yet been able to reproduce the issue. I'll share my script in a few	21:43
clarkb	corvus: https://paste.opendev.org/show/bedm2UM2rjLgnJHnZASF/ this produces a single interface on the node with one floating ip attached	21:47
clarkb	so now I'm thinking it may have to do with the inputs to create_server. I've edited that script with that I believe to be the relevant inputs but maybe the network list doesn't look the way I expect it to?	21:47
clarkb	and I ran that script within the zuul launcher container so all of the libraries and the version of python should be the same	21:48
clarkb	I do wonder why we create the floating ip (which should attach it) then call attach_ip_to_server. I thought maybe something about that redundant calls may not do what we expect but ti seems to work fine	21:48
clarkb	and when I manually delete the server after checking its details the fip automatically deletes too	21:50
clarkb	everything works when I'm doing it :P	21:50
clarkb	the other thing that may be different between that script and the launcher is I've ripped out the state machine machinery	21:51
clarkb	maybe somehow we're getting thigns mixed up in the state machine and we're executing actions against the same node twice?	21:52
corvus	i don't think we call _attach_ip_to_server	21:52
clarkb	ah I guess it depends on whether or not the port id is None. fwiw I tested it both ways with and without that last line and both produce the same result of a single interface with the floating ip attached	21:53
clarkb	if the port id is none then we call _attach_ip_to_server so I guess thats a belts and suspenders if the create doesn't attach the fip	21:54
corvus	there are 646 create_server calls for DFW3 in the log and 638 create_floating_ip	21:56
clarkb	those extra 8 are probably due to failures before we get to deciding we want a floating ip so we bail out early	21:56
corvus	i think that's consistent with 1:1, assuming that some create_server calls produced errors	21:57
clarkb	yup	21:57
clarkb	I could test listing the same network twice in the networks list to see if that reproduces. I think we expect it would but I'm not sure that this is why we're seeing that behavior when configuring networks via the launcher	21:59
clarkb	unless there is some earlier config dta processing error duplicating data there? maybe via the inheritance tree or something?	21:59
corvus	we can check that, 1 sec.	22:01
corvus	well, more like 1 min	22:01
corvus	what label?	22:03
clarkb	corvus: ubuntu-noble-8GB for ubuntu-noble I think	22:04
clarkb	s/for/or/	22:04
corvus	oh of course this wouldn't be the live configuration	22:05
clarkb	ya my old server listings in my terminal scrollback show noble nodes with flavor gp.0.4.8 hit the issue	22:05
clarkb	oh right yes	22:05
clarkb	sorry we pulled it out since it was broken	22:05
corvus	i can confirm that currently the networks list is empty for that label	22:06
clarkb	ok that is good	22:06
corvus	i can't exclude the hypothesis that it may have had multiple entries in the other config	22:06
clarkb	do you think there is value in testing what happens if we list that network twice?	22:06
corvus	yes, that's way easier than any other next steps	22:07
clarkb	ok I'll work on that in a bit	22:07
clarkb	by updating like 39 of my pasted script to list the same network twice and then check what happens	22:07
corvus	++	22:07
corvus	you're using the current prod clouds.yaml right?	22:08
corvus	that means that your script is telling us it's okay to have the networks listed in clouds.yaml, and then also specify the same network once in the create call (and the create call should override clouds.yaml as we expect)	22:08
Clark[m]	Yes exactly. Sorry had to step away from the computer. Kids just got home	22:10
corvus	if that's all true, then we can create a new provider (we could put it in only one tenant, or we could put it in all of them and just set the limits to 0). then we can investigate the inheritance hypothesis or others.	22:10
corvus	Clark: there was a semi-recent change that altered the inheritance mechanism. i'm not sure about the timing, but it's plausible, especially considering a possible delay before anyone noticed. https://review.opendev.org/961534 changes it again, slightly.	22:12
corvus	i think we can and should land that change, and just be aware of it as a variable.	22:13
corvus	probably the best sequence would be to: 1) confirm double network behavior with that script; 2) introduce a new provider and inspect it to see if we get double inheritance; 3) merge that change and see if the problem persists; 4) try to fix it.	22:15
clarkb	corvus: I edited networks to ['opendevzuul-network1', 'opendevzuul-network1'] and commented out the last line as we don't believe we're calling _attach_ip_to_server and ran it. This does seem to reproduce	22:30
clarkb	so ya my best guess at this point is that we're inputting multiple networks so that we get multiple nics	22:30
clarkb	corvus: one question about your proposed plan: we were setting the networks value in the parent section called raxflex-base. If we add a new provider that inherits from that we may not get the same behavior. DO you think we need both a new base and child provider section?	22:32
clarkb	I guess I'm not entirely sure what the safest and most reproduction wrothy approach is. I'm fairly certain we don't want to reupload images either for example	22:33
clarkb	also any reason to keep my test node around or should I go ahead and delete it	22:33
* clarkb will make an attempt at a new pair of provider sections to test this out		22:34
corvus	clarkb: there is a shocking amount of code dedicated to not uploading images multiple times to the same endpoints	22:34
corvus	as long as the image definitions are the same, it should detect it	22:34
clarkb	ack	22:35
clarkb	corvus: does that mean the connection: foo value should be the same in both?	22:35
corvus	i think we should create a new base provider and rely on that, so we have fidelity to what we had before	22:35
corvus	yep, same connection	22:36
corvus	that way we don't have to update zuul.conf	22:36
opendevreview	Clark Boylan proposed opendev/zuul-providers master: Add a test provider for rax flex sjc3 https://review.opendev.org/c/opendev/zuul-providers/+/962143	22:41
clarkb	corvus: ^ something like that?	22:41
clarkb	I'm going to clean up my test node now as I don't think it is useful and I don't want to forget it	22:41
corvus	clarkb: that lgtm. if we don't see a problem then we should add the other regions.	22:42
corvus	i'm going to +3 that	22:43
opendevreview	Merged opendev/zuul-providers master: Add a test provider for rax flex sjc3 https://review.opendev.org/c/opendev/zuul-providers/+/962143	22:43
clarkb	ok	22:43
corvus	['opendevzuul-network1', 'opendevzuul-network1']	22:45
corvus	clarkb: looks like you win :)	22:45
clarkb	wow ok that was a fun one to run down	22:45
corvus	i'm going to approve https://review.opendev.org/961534 now	22:46
clarkb	ok I guess we'll need to manually update the launchers then recheck the list entries?	22:46
corvus	yep	22:47
corvus	clarkb: i think i see the problem; and i don't think 534 will affect it either way. i'll work on a test-and-fix when i finish what i'm working on	22:49
clarkb	sounds good and thanks for helping track this down	22:49
corvus	np, thank you :)	22:49

Generated by irclog2html.py 4.0.0 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!