Tuesday, 2025-11-25

dmsimard[m]	clarkb: I think we wanted to talk about ara things, but it can be in open discussion	01:18
fungi	the boot-timeout increase for osuosl seems to have eliminated our earlier node_failure results, but my last recheck of 966200 ran into timeouts for 3 arm64 image builds (all right at 3 hours)	13:44
mnasiadka	fungi: that sort of reminds me of the storage problems there last time, are we sure it’s not happening again?	14:27
fungi	entirely possible	14:28
fungi	i'm never sure of anything	14:28
fungi	"Rackspace Engineers will be performing a service disrupting software release in our DFW3 data center. Planned start date and time: 2025-12-03 08:00:00 CST" (window is listed as 14 hours)	15:51
fungi	oh, though "During this maintenance, brief packet loss on the data-plane up to three packets consecutively on all networks. As always, your Support Team will be available by phone to assist you." so maybe we don't care	15:52
fungi	i generally assume the internet overall to be worse than that	15:52
clarkb	we asked osuosl if there were isuses with storage and they didn't think so. But we said we would address the image format selection (whcih we half did deciding to try and get by with just the timeout then let the automatic selection take over after the initial build) and then we also increased boot timeouts	16:02
clarkb	looks like that did fix the trixie builds	16:02
clarkb	the latest recheck didn't fail on them	16:02
fungi	yeah, it's other random arm64 image builds that are timing out	16:03
fungi	and some of the ones that did succeed completed just under the wire	16:04
fungi	2h 58m 03s, 2h 52m 35s	16:04
clarkb	and none of those would be building and converting the extra image formats since only trixie is the new one	16:04
clarkb	it would probably be good if someone can look at those logs and determine where the time lost is then we can decide if we want to escalate to osuosl again with more info (keep in mind it is a holiday week in this part of the world)	16:05
fungi	https://zuul.opendev.org/t/opendev/build/d6c2d278bbdd477fb21a0ed613c8b5c6/log/job-output.txt#8892-8893 is a ~46-minute gap	16:10
fungi	that's the biggest window of silence i spot in that build by far	16:11
clarkb	fungi: I think that is the same step that we saw was slow last week	16:12
fungi	similarly https://zuul.opendev.org/t/opendev/build/ed4e661d215a43619f8cab1aa986eb6b/log/job-output.txt#9313-9314	16:12
fungi	those are the two that completed just shy of the 3-hour timeout	16:13
clarkb	we could update dib to log some of the attributes of that transfer (total file count, file sizes, etc)	16:13
fungi	so slightly easier to analyze	16:13
clarkb	looking at the code it seems like its literally just a single fstab file	16:15
fungi	same step silent for ~48 minutes on one of the builds that timed out: https://zuul.opendev.org/t/opendev/build/74472dac822b467aa8f282c3d5f15807/log/job-output.txt#9494-9495	16:15
clarkb	https://opendev.org/openstack/diskimage-builder/src/branch/master/diskimage_builder/block_device/blockdevice.py#L406 this is the origination point of that log entry I think	16:15
fungi	though the timeout killed the task in that case	16:16
clarkb	the mkdir is measured in subsecond time as expected but for some reason the cp is much more costly	16:16
fungi	could it be triggering an implicit fsync?	16:17
fungi	maybe there's a write cache flushing	16:17
clarkb	oh ya maybe. Looking upward in the doc it seems we already have a bunch of data written to the 'built' paths and then that fstab entry follows up	16:18
clarkb	maybe cp is triggering some sort of sync that we've avoided up to that point	16:18
clarkb	I think the fact that we see this in existing images that are only building raw images and we've increased timeouts and we seem to hit it doing the cp of a single file (which maybe causes a sync to occur) is all good feedback to osuosl particularly since it doesn't seem to be a problem 100% of the time	16:19
clarkb	dmsimard[m]: I'm happy to discuss ara things but I don't think anyone added it to the agenda. But yes if we have tyime we can followup after the other items	16:20
clarkb	fungi: https://opendev.org/openstack/diskimage-builder/src/branch/master/diskimage_builder/lib/disk-image-create#L372-L531 this is the portion of code leading up to and including the fstab write	16:46
clarkb	fungi: It looks like the shell script is invoking multiple python commands to do each of the block device steps finishing with the fstab write. I would expect that a python processing exiting is sufficient to force flushes/syncs but maybe not?	16:47
clarkb	fungi: if we want to test this further we could add an explicit sync at https://opendev.org/openstack/diskimage-builder/src/branch/master/diskimage_builder/lib/disk-image-create#L528 this line and see if the slowness moves to that location	16:47
fungi	that does seem likely	16:50
clarkb	fungi: that == python would sync or that if we add a sync we'd see it slow at that location?	16:53
fungi	i expect exiting python would account for a flush of buffers closed when the process does	17:02
clarkb	ack so probably something about writing the fstab? or cp behavior differences?	17:03
clarkb	it does appear to be io related at least and we expect that fstab is quite small	17:04
clarkb	it is copying from one filesystem to another though	17:04
fungi	yeah, my guess is that exiting python started a fsync, and then the cp blocks on waiting for that to complete	17:06
clarkb	actually it may not be a different fs. The different fs is at /mnt not /built	17:07
clarkb	fungi: aha	17:07
clarkb	so yes this seems to point at slow io	17:07
fungi	i suppose we could add performance monitoring and see what the i/o throughput looks like for that silent 45-ish minutes, but that may be overkill	17:08
clarkb	Ramereth[m]: to catch you up, we went ahead and did our best to remove the extra work caused by creating raw, qcow2, and vhd images when we only need raw images. We have also increased our build job timeouts from 2hours to 3 hours. Despite these two changes we're able to reproduce the timeout problems in image builds and	17:09
clarkb	https://zuul.opendev.org/t/opendev/build/d6c2d278bbdd477fb21a0ed613c8b5c6/log/job-output.txt#8892-8893 seems to illustrate the problem. Basically when we're writing data into the new image it is slow. fungi's theory is that we're waiting for background syncs to catch up when running the cp of the fstab from the source to the dest	17:09
corvus	there should be some dstat roles already, we use them in zuul jobs at least	17:09
corvus	fungi: ^^ dstat msg	17:09
clarkb	Ramereth[m]: I think we've largely isolated what is different on our end and can now say that it seems the performance seems to have regressed independently of what we are doing.	17:10
corvus	fungi: https://opendev.org/zuul/zuul/src/branch/master/playbooks/zuul-nox/pre.yaml and https://opendev.org/zuul/zuul/src/branch/master/playbooks/zuul-nox/post-system-logs.yaml	17:11
clarkb	Ramereth[m]: if you can think of any additional debugging or logging that might be useful I think we'd be happy to help.	17:11
fungi	yeah, adding dstat wouldn't be much work in that case, still not sure whether it's overkill for proving this one case or if we'd have longer-term use for including it in the job	17:11
clarkb	Maybe we should try and manually boot a few instances and do some fio benchmarks?	17:11
clarkb	similar to what I did with lists.o.o	17:11
clarkb	dstat is another good idea that may capture peak iops for us in the background	17:11
corvus	i think it'd be a good long-term addition so it's there when we need it	17:15
clarkb	++	17:15
fungi	wfm. i'm in meetings for most of my afternoon but can work on adding it this evening or tomorrow	17:15
opendevreview	Clark Boylan proposed opendev/system-config master: Bump Gerrit container image to 3.11 https://review.opendev.org/c/opendev/system-config/+/968349	17:16
clarkb	ya I'm in meetings until noon today then I'm going to try and do a family lunch afterwards. When I get back my goal is to start testing and taking notes on the gerrit 3.10 -> 3.11 upgrade process as we'll do it in production using the held node	17:20
clarkb	that is the last major peice of the gerrit upgrade planning process	17:20
Ramereth[m]	clarkb: okay, I'll see what I can do	18:20
dtantsur	hey folks, do we by chance have any sort of a dockerhub mirror to avoid hitting its infamous rate limits?	18:58
fungi	dtantsur: in fact we do!	19:00
clarkb	dtantsur: we host some commonly used imagse at quay.io/opendevmirror but it isn't a generic mirror of everything	19:00
fungi	yeah, i guess technically we don't have a mirror so much as we mirror to someone else's repository that's not dockerhub	19:01
dtantsur	clarkb: you do have registry, which is exactly the one we need :)	19:01
fungi	we have zuul jobs which can mirror images to arbitrary namespaces, though we do have a namespace on quay we use for our more general opendev dependencies too	19:02
fungi	clarkb is probably a little distracted running the opendev meeting that just started a few seconds ago	19:02
dtantsur	basically, we're trying to start a container registry	19:04
dtantsur	so if you folks already maintain https://quay.io/repository/opendevmirror/registry, I guess we can use it	19:04
dtantsur	btw are you planning to mirror registry:3 in addition to registry:2?	19:05
* dtantsur calls it a day and will check the answer tomorrow if his client does not eat it		19:06
fungi	dtantsur: yeah, trying to get you some examples of uses of the mirror-container-images job from the zuul standard library, opendev uses it and so do a number of openstack projects i think	19:07
fungi	syncing up tomorrow when we're not in a meeting might be faster	19:07
clarkb	I think someone proposed mirroring :3 and then decided they didn't need :3 because the issue was elsewhere	19:07
clarkb	but yes I thinkwe could mirror :3 too	19:07
clarkb	dtantsur: depending on what you are doing with the registry the zuul buildset registry may make sense too	19:12
clarkb	dtantsur: if you're building a central image then reconsuming it in subsequent jobs then the zuul buildset registry (and job) is designed to address that need	19:12
fungi	good for a temporary registry to put images into for one job to produce and another to consume within a buildset or dependent change series	19:12
tonyb	the in progress ansible-devel patches install ara from git/master again FWIW	20:04
clarkb	dmsimard[m]: ya I took a quick look and ara isn't in the git repo list of required projects for that job anymore. I think we ended up in a situation where latest ara and latest ansible were not comaptible so we switched back to installing ara from pip in that job	20:04
clarkb	the git log should have details though	20:04
clarkb	tonyb: ah cool	20:04
clarkb	and now I'm popping out for lunch	20:04
tonyb	I didn't add it as a required repo as that needed a connection to GitHub previously and that was one of the reasons we switched to pip. now that ARA is on codeberg we could maybe have a connection?	20:06
opendevreview	Jeremy Stanley proposed opendev/zuul-providers master: Instrument dstat in image build jobs https://review.opendev.org/c/opendev/zuul-providers/+/968386	20:07
fungi	corvus: clarkb: ^ i cribbed that from the zuul-nox playbooks	20:07
fungi	happy to rebase 966200 onto it if that helps	20:08
fungi	getting node_failure results for a lot of the arm64 builds on 968386 now (buildset still in progress), so maybe the boot-timeout increase for osuosl didn't fix it after all	20:38
corvus	clarkb: looks like 174.143.59.58 from https://zuul.opendev.org/t/openstack/build/44dfa5023f66491681250bad557b4bc7 might be the host you were mentioning where we lost the ip?	21:10
corvus	hrm, that one looks like it was lost before november 19, since i see that ip being used a lot in the logs. i'll wait for clarkb to return to see if there's some other node	21:14
clarkb	corvus: it is the ip listed on https://etherpad.opendev.org/p/gerrit-upgrade-3.11 line 6	21:16
clarkb	corvus: np6d8b42238b674 is the instance in rax flex dfw3. I just confirmed via openstack server list that the fip is gone	21:17
clarkb	corvus: I think the losg on zl01 do confirm this is the process deleting the fip. Interestingly it seems ot have attempted to do so many times	21:18
clarkb	zl02 doesn't seem to log it at all.	21:18
corvus	clarkb: do you know the build for that?	21:23
clarkb	corvus: I believe it is the build you identified. But it had its original fip deleted then I maanually attached this new one	21:24
corvus	oooh	21:24
mordred	I may be out of the loop, but I feel like FIPs getting deleted unexpectedly isn't ideal	21:25
corvus	mordred: yeah -- here's the story: https://review.opendev.org/967891	21:26
mordred	oh fun!	21:26
corvus	clarkb: it only shows up on zl01 because of the cooperative thing the launchers do: they decided that zl01 was going to handle cleanups for this region	21:27
clarkb	corvus: got it	21:28
mordred	does rax flex support the "pass in the server to create_floating_ip"?	21:28
clarkb	mordred: I think that is the only way that zuul launcher supports using floating ips	21:28
corvus	log output from that new log line: https://paste.opendev.org/show/bQtRqdlVru2fvwDOdubD/	21:28
corvus	mordred: clarkb yes/correct re server arg	21:29
mordred	oh - that's even more fascinating. it's using the proper way and also is only doing cleanup on one server	21:29
clarkb	mordred: that was one of the first things I checked when debugging this the first time around since the docstring for the cleanup method in sdk had the big warning about it	21:30
clarkb	I'm wondering if clouds are no longer holding whatever criteria the sdk needs to do this safely	21:31
mordred	I feel like anytime I bothered to write a long docstring, it must be bad :)	21:31
fungi	or frightening, at the very least	21:32
mordred	I mean - neutron would have had to change the model for how ports work.	21:33
corvus	looking at the logs around that time, it deleted a bunch of fips around the same time. so whatever happened it happened to a lot.	21:33
mordred	one off occurance or ongoing issue?	21:33
corvus	this has happened at least twice? clark would know	21:34
corvus	but we're only noticing because it's happening to held nodes	21:34
mordred	awesome	21:34
corvus	it could very well happen more often and we wouldn't notice due to turnover	21:34
clarkb	it has happened twice now against two different held servers	21:34
clarkb	though the second server aws only hit the first time because I didn't manaully attach a new fip	21:35
mordred	yeah. it would just show up as a bad node and zuul would happily delete and retry in the normal case yeah?	21:35
corvus	yeah due to the "unreachable" retry case	21:35
clarkb	that cleanup routine runs frequently though (every time we list servers)	21:35
clarkb	so I think it isn't happening frequently compared to the number of total attempts. But when the conditions that cause it to happen seem to happen often enough we notice after a few days	21:35
mordred	fwiw - there is an http-level tracing you can enable with python logging incantations ... but that'll get super chatty	21:36
mordred	does seem like a derpy server-side case though :(	21:37
corvus	it deleted 18 fips at once. according to the graphs, we had 17 nodes in use in raxflex-dfw3 at the time (i think we may be missing a graph for held nodes, so that would be the 18th).	21:38
corvus	so it looks like all the fips for that region switched to having a null value for their port at the same time.	21:38
mordred	wow. db level derp perhaps?	21:39
fungi	so maybe some bad/empty api response caused it to think they were all orphaned?	21:39
mordred	we printed out other pieces of that API response though - so it wasn't an empty	21:40
corvus	i think it's a single api call from the zuul/sdk perspective (we should double check sdk isn't doing a second one behind our back), so yeah, assuming that holds, i think we got a good response with bad data	21:41
clarkb	exciting	21:41
corvus	client.list_floating_ips() is the singel api call we're using for this	21:41
clarkb	corvus: and the port field in that listing for each fip is null/none?	21:43
clarkb	(at least we infer that based on the behavior?)	21:43
corvus	inferring that it's falsey based on behavior: if not bool(ip.port_id):	21:43
corvus	might be worth an update to the debug string to include that	21:44
corvus	(just in case "None" vs "0" is useful debugging)	21:44
mordred	checked. it's a single GET under the covers	21:46
mordred	so this shouldn't be a case of anything going behind the scenes and doing additional calls to fill in missing attributes	21:46
dmsimard[m]	I have nothing useful to contribute but hi mordred long time no see :P	21:46
corvus	clarkb: here's the mass-deletion: https://paste.opendev.org/show/bsRarJmfCZdWpz2CMJWm/ maybe we should send that to the flex folks and let them know that based on behavior, it looks like all the fips had missing or incorrect port ids (so that it looked like none of those ips had ports)	21:47
* mordred waves to dmsimard :)		21:47
mordred	if you need it for followup debugging - set the keystoneauth logger to DEBUG and it'll log full request/response interactions. massive sledgehammer of course	21:48
clarkb	corvus: can do. I'll include that in the email about the checksum behavior in sjc3. I'll do that as soon as my gerrit 3.11 upgrade and downgrade is done	21:48
mordred	that should be security safe to do - it won't log auth creds	21:48
corvus	mordred: yeah, that would be quite the firehose... i think if we decide we're at that point, we probably need to make a dedicated launcher process and restrict it to just this provider (and remove the providers from the others).	21:50
mordred	++ agree	21:50
clarkb	my test gerrit 3.10 -> 3.11 upgrade went very smoothly. It doesn't even reindex becuase the indexes are not updated. No soy email template changes either	21:53
corvus	i spot checked one of the other ips. zuul started the cleanup for the node it was attached to after the first message about deleting the leaked fip. (which supports the idea that the cloud just returned bad data, and not that this was some cascade resulting from zuul's deletion process)	21:55
clarkb	corvus: any idea if other regions have been affected?	21:56
clarkb	we know dfw3 is but I wonder about iad3 (sjc3 is disabled)	21:56
corvus	the only other instances are sjc3 for 2025-11-25 03:05:49,151 DEBUG zuul.openstack.raxflex-SJC3: Deleting unattached floating ip 3000e926-9dea-40dd-9ac5-17440060e86e 66.70.103.231	22:01
corvus	but we only have a day's worth of these logs	22:01
corvus	so far	22:01
mordred	I love that we're still leaking FIPs at all ;)	22:02
corvus	mordred: yeah -- i think we observed that leaks still happen, right clarkb ? or do we need to do another test?	22:03
clarkb	corvus: they were happening when we first brought up raxflex	22:05
clarkb	corvus: that is why we toggled the cleanup feature on. I don't tknow if we know that things are leaking today though	22:06
clarkb	infra-root ok I have upgraded the 3.10 held node (noted in the etherpad https://etherpad.opendev.org/p/gerrit-upgrade-3.11) to 3.11 and captured notes on the process. Then I downgraded it back to 3.10 and captured notes on that process	22:06
clarkb	I don't see anything concerning and in fact its maybe the easiest one we've done in terms fo steps since indexes don't change	22:07
clarkb	I didn't test the cleanup of the large h2 cache files because it occurred to me after upgraded that I should do that. I will try and do another pass at another time and test the cleanup of the large h2 cache files just to double check it doesn't hurt anything	22:07
clarkb	but that will be naother day likely	22:07
clarkb	corvus: I'll draft an email to rax that roughly says we notice all checksums in glance in sjc3 don't match our checksum. This implies something is happenign on the backend to modify the images. Could it be possible that process is sometimes corrupting the images? This doesn't seem to be the case in dfw3 or iad3 where the checksums do match. Additionally we've noticed that	22:10
clarkb	openstacksdk's delete_unattached_floating_ips's method will sometimes delete all of our floating ips in dfw3 (possible elsewhere too but this is where we have logs catching it happen see this paste) we suspect due to that method's list_floating_ip method returning null/falsey port_id values. This is unexpected and doesn't happen frequently but does happen often enough that ever few	22:10
clarkb	days we lose floating ips	22:10
clarkb	corvus: does that seem to roughly capture what we have learned and is accurate etc?	22:11
clarkb	infra-root also if anyone wants to run through the test upgrade and downgrade with me (tonyb maybe?) its a good way of building familiariy with the whole process	22:13
corvus	clarkb: info lgtm	22:23
clarkb	email has been sent	22:29
clarkb	should we go ahead and reenable raxflex-sjc3 now that we suspect things are happier overall?	22:30
clarkb	or do we want to leave it largely disabled until we have a better understanding of the failuresin the first place?	22:30

Generated by irclog2html.py 4.0.0 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!