Tuesday, 2025-11-25

dmsimard[m]clarkb: I think we wanted to talk about ara things, but it can be in open discussion01:18
fungithe boot-timeout increase for osuosl seems to have eliminated our earlier node_failure results, but my last recheck of 966200 ran into timeouts for 3 arm64 image builds (all right at 3 hours)13:44
mnasiadkafungi: that sort of reminds me of the storage problems there last time, are we sure it’s not happening again?14:27
fungientirely possible14:28
fungii'm never sure of anything14:28
fungi"Rackspace Engineers will be performing a service disrupting software release in our DFW3 data center. Planned start date and time:  2025-12-03 08:00:00 CST" (window is listed as 14 hours)15:51
fungioh, though "During this maintenance, brief packet loss on the data-plane up to three packets consecutively on all networks. As always, your Support Team will be available by phone to assist you." so maybe we don't care15:52
fungii generally assume the internet overall to be worse than that15:52
clarkbwe asked osuosl if there were isuses with storage and they didn't think so. But we said we would address the image format selection (whcih we half did deciding to try and get by with just the timeout then let the automatic selection take over after the initial build) and then we also increased boot timeouts16:02
clarkblooks like that did fix the trixie builds16:02
clarkbthe latest recheck didn't fail on them16:02
fungiyeah, it's other random arm64 image builds that are timing out16:03
fungiand some of the ones that did succeed completed just under the wire16:04
fungi2h 58m 03s, 2h 52m 35s16:04
clarkband none of those would be building and converting the extra image formats since only trixie is the new one16:04
clarkbit would probably be good if someone can look at those logs and determine where the time lost is then we can decide if we want to escalate to osuosl again with more info (keep in mind it is a holiday week in this part of the world)16:05
fungihttps://zuul.opendev.org/t/opendev/build/d6c2d278bbdd477fb21a0ed613c8b5c6/log/job-output.txt#8892-8893 is a ~46-minute gap16:10
fungithat's the biggest window of silence i spot in that build by far16:11
clarkbfungi: I think that is the same step that we saw was slow last week16:12
fungisimilarly https://zuul.opendev.org/t/opendev/build/ed4e661d215a43619f8cab1aa986eb6b/log/job-output.txt#9313-931416:12
fungithose are the two that completed just shy of the 3-hour timeout16:13
clarkbwe could update dib to log some of the attributes of that transfer (total file count, file sizes, etc)16:13
fungiso slightly easier to analyze16:13
clarkblooking at the code it seems like its literally just a single fstab file16:15
fungisame step silent for ~48 minutes on one of the builds that timed out: https://zuul.opendev.org/t/opendev/build/74472dac822b467aa8f282c3d5f15807/log/job-output.txt#9494-949516:15
clarkbhttps://opendev.org/openstack/diskimage-builder/src/branch/master/diskimage_builder/block_device/blockdevice.py#L406 this is the origination point of that log entry I think16:15
fungithough the timeout killed the task in that case16:16
clarkbthe mkdir is measured in subsecond time as expected but for some reason the cp is much more costly16:16
fungicould it be triggering an implicit fsync?16:17
fungimaybe there's a write cache flushing16:17
clarkboh ya maybe. Looking upward in the doc it seems we already have a bunch of data written to the 'built' paths and then that fstab entry follows up16:18
clarkbmaybe cp is triggering some sort of sync that we've avoided up to that point16:18
clarkbI think the fact that we see this in existing images that are only building raw images and we've increased timeouts and we seem to hit it doing the cp of a single file (which maybe causes a sync to occur) is all good feedback to osuosl particularly since it doesn't seem to be a problem 100% of the time16:19
clarkbdmsimard[m]: I'm happy to discuss ara things but I don't think anyone added it to the agenda. But yes if we have tyime we can followup after the other items16:20
clarkbfungi: https://opendev.org/openstack/diskimage-builder/src/branch/master/diskimage_builder/lib/disk-image-create#L372-L531 this is the portion of code leading up to and including the fstab write16:46
clarkbfungi: It looks like the shell script is invoking multiple python commands to do each of the block device steps finishing with the fstab write. I would expect that a python processing exiting is sufficient to force flushes/syncs but maybe not?16:47
clarkbfungi: if we want to test this further we could add an explicit sync at https://opendev.org/openstack/diskimage-builder/src/branch/master/diskimage_builder/lib/disk-image-create#L528 this line and see if the slowness moves to that location16:47
fungithat does seem likely16:50
clarkbfungi: that == python would sync or that if we add a sync we'd see it slow at that location?16:53
fungii expect exiting python would account for a flush of buffers closed when the process does17:02
clarkback so probably something about writing the fstab? or cp behavior differences?17:03
clarkbit does appear to be io related at least and we expect that fstab is quite small17:04
clarkbit is copying from one filesystem to another though17:04
fungiyeah, my guess is that exiting python started a fsync, and then the cp blocks on waiting for that to complete17:06
clarkbactually it may not be a different fs. The different fs is at /mnt not /built17:07
clarkbfungi: aha17:07
clarkbso yes this seems to point at slow io17:07
fungii suppose we could add performance monitoring and see what the i/o throughput looks like for that silent 45-ish minutes, but that may be overkill17:08
clarkbRamereth[m]: to catch you up, we went ahead and did our best to remove the extra work caused by creating raw, qcow2, and vhd images when we only need raw images. We have also increased our build job timeouts from 2hours to 3 hours. Despite these two changes we're able to reproduce the timeout problems in image builds and17:09
clarkbhttps://zuul.opendev.org/t/opendev/build/d6c2d278bbdd477fb21a0ed613c8b5c6/log/job-output.txt#8892-8893 seems to illustrate the problem. Basically when we're writing data into the new image it is slow. fungi's theory is that we're waiting for background syncs to catch up when running the cp of the fstab from the source to the dest17:09
corvusthere should be some dstat roles already, we use them in zuul jobs at least17:09
corvusfungi: ^^ dstat msg17:09
clarkbRamereth[m]: I think we've largely isolated what is different on our end and can now say that it seems the performance seems to have regressed independently of what we are doing.17:10
corvusfungi: https://opendev.org/zuul/zuul/src/branch/master/playbooks/zuul-nox/pre.yaml and https://opendev.org/zuul/zuul/src/branch/master/playbooks/zuul-nox/post-system-logs.yaml17:11
clarkbRamereth[m]: if you can think of any additional debugging or logging that might be useful I think we'd be happy to help.17:11
fungiyeah, adding dstat wouldn't be much work in that case, still not sure whether it's overkill for proving this one case or if we'd have longer-term use for including it in the job17:11
clarkbMaybe we should try and manually boot a few instances and do some fio benchmarks?17:11
clarkbsimilar to what I did with lists.o.o17:11
clarkbdstat is another good idea that may capture peak iops for us in the background17:11
corvusi think it'd be a good long-term addition so it's there when we need it17:15
clarkb++17:15
fungiwfm. i'm in meetings for most of my afternoon but can work on adding it this evening or tomorrow17:15
opendevreviewClark Boylan proposed opendev/system-config master: Bump Gerrit container image to 3.11  https://review.opendev.org/c/opendev/system-config/+/96834917:16
clarkbya I'm in meetings until noon today then I'm going to try and do a family lunch afterwards. When I get back my goal is to start testing and taking notes on the gerrit 3.10 -> 3.11 upgrade process as we'll do it in production using the held node17:20
clarkbthat is the last major peice of the gerrit upgrade planning process17:20
Ramereth[m]clarkb: okay, I'll see what I can do18:20
dtantsurhey folks, do we by chance have any sort of a dockerhub mirror to avoid hitting its infamous rate limits?18:58
fungidtantsur: in fact we do!19:00
clarkbdtantsur: we host some commonly used imagse at quay.io/opendevmirror but it isn't a generic mirror of everything19:00
fungiyeah, i guess technically we don't *have* a mirror so much as we mirror to someone else's repository that's not dockerhub19:01
dtantsurclarkb: you do have registry, which is exactly the one we need :)19:01
fungiwe have zuul jobs which can mirror images to arbitrary namespaces, though we do have a namespace on quay we use for our more general opendev dependencies too19:02
fungiclarkb is probably a little distracted running the opendev meeting that just started a few seconds ago19:02
dtantsurbasically, we're trying to start a container registry19:04
dtantsurso if you folks already maintain https://quay.io/repository/opendevmirror/registry, I guess we can use it19:04
dtantsurbtw are you planning to mirror registry:3 in addition to registry:2?19:05
* dtantsur calls it a day and will check the answer tomorrow if his client does not eat it19:06
fungidtantsur: yeah, trying to get you some examples of uses of the mirror-container-images job from the zuul standard library, opendev uses it and so do a number of openstack projects i think19:07
fungisyncing up tomorrow when we're not in a meeting might be faster19:07
clarkbI think someone proposed mirroring :3 and then decided they didn't need :3 because the issue was elsewhere19:07
clarkbbut yes I thinkwe could mirror :3 too19:07
clarkbdtantsur: depending on what you are doing with the registry the zuul buildset registry may make sense too19:12
clarkbdtantsur: if you're building a central image then reconsuming it in subsequent jobs then the zuul buildset registry (and job) is designed to address that need19:12
fungigood for a temporary registry to put images into for one job to produce and another to consume within a buildset or dependent change series19:12
tonybthe in progress ansible-devel patches install ara from git/master again FWIW20:04
clarkbdmsimard[m]: ya I took a quick look and ara isn't in the git repo list of required projects for that job anymore. I think we ended up in a situation where latest ara and latest ansible were not comaptible so we switched back to installing ara from pip in that job20:04
clarkbthe git log should have details though20:04
clarkbtonyb: ah cool20:04
clarkband now I'm popping out for lunch20:04
tonybI didn't add it as a required repo as that needed a connection to GitHub previously and that was one of the reasons we switched to pip.   now that ARA is on codeberg we could maybe have a connection?20:06
opendevreviewJeremy Stanley proposed opendev/zuul-providers master: Instrument dstat in image build jobs  https://review.opendev.org/c/opendev/zuul-providers/+/96838620:07
fungicorvus: clarkb: ^ i cribbed that from the zuul-nox playbooks20:07
fungihappy to rebase 966200 onto it if that helps20:08
fungigetting node_failure results for a lot of the arm64 builds on 968386 now (buildset still in progress), so maybe the boot-timeout increase for osuosl didn't fix it after all20:38
corvusclarkb: looks like 174.143.59.58 from https://zuul.opendev.org/t/openstack/build/44dfa5023f66491681250bad557b4bc7 might be the host you were mentioning where we lost the ip?21:10
corvushrm, that one looks like it was lost before november 19, since i see that ip being used a lot in the logs.  i'll wait for clarkb to return to see if there's some other node21:14
clarkbcorvus: it is the ip listed on https://etherpad.opendev.org/p/gerrit-upgrade-3.11 line 621:16
clarkbcorvus: np6d8b42238b674 is the instance in rax flex dfw3. I just confirmed via openstack server list that the fip is gone21:17
clarkbcorvus: I think the losg on zl01 do confirm this is the process deleting the fip. Interestingly it seems ot have attempted to do so many times21:18
clarkbzl02 doesn't seem to log it at all.21:18
corvusclarkb: do you know the build for that?21:23
clarkbcorvus: I believe it is the build you identified. But it had its original fip deleted then I maanually attached this new one21:24
corvusoooh21:24
mordredI may be out of the loop, but I feel like FIPs getting deleted unexpectedly isn't ideal21:25
corvusmordred: yeah -- here's the story: https://review.opendev.org/96789121:26
mordredoh fun!21:26
corvusclarkb: it only shows up on zl01 because of the cooperative thing the launchers do: they decided that zl01 was going to handle cleanups for this region21:27
clarkbcorvus: got it21:28
mordreddoes rax flex support the "pass in the server to create_floating_ip"? 21:28
clarkbmordred: I think that is the only way that zuul launcher supports using floating ips21:28
corvuslog output from that new log line: https://paste.opendev.org/show/bQtRqdlVru2fvwDOdubD/21:28
corvusmordred: clarkb yes/correct re server arg21:29
mordredoh - that's even more fascinating. it's using the proper way and also is only doing cleanup on one server21:29
clarkbmordred: that was one of the first things I checked when debugging this the first time around since the docstring for the cleanup method in sdk had the big warning about it21:30
clarkbI'm wondering if clouds are no longer holding whatever criteria the sdk needs to do this safely21:31
mordredI feel like anytime I bothered to write a long docstring, it must be bad :)21:31
fungior frightening, at the very least21:32
mordredI mean - neutron would have had to change the model for how ports work. 21:33
corvuslooking at the logs around that time, it deleted a bunch of fips around the same time.  so whatever happened it happened to a lot.21:33
mordredone off occurance or ongoing issue?21:33
corvusthis has happened at least twice?  clark would know21:34
corvusbut we're only noticing because it's happening to held nodes21:34
mordredawesome21:34
corvusit could very well happen more often and we wouldn't notice due to turnover21:34
clarkbit has happened twice now against two different held servers21:34
clarkbthough the second server aws only hit the first time because I didn't manaully attach a new fip21:35
mordredyeah. it would just show up as a bad node and zuul would happily delete and retry in the normal case yeah?21:35
corvusyeah due to the "unreachable" retry case21:35
clarkbthat cleanup routine runs frequently though (every time we list servers)21:35
clarkbso I think it isn't happening frequently compared to the number of total attempts. But when the conditions that cause it to happen seem to happen often enough we notice after a few days21:35
mordredfwiw - there is an http-level tracing you can enable with python logging incantations ... but that'll get super chatty21:36
mordreddoes seem like a derpy server-side case though :( 21:37
corvusit deleted 18 fips at once.  according to the graphs, we had 17 nodes in use in raxflex-dfw3 at the time (i think we may be missing a graph for held nodes, so that would be the 18th).21:38
corvusso it looks like all the fips for that region switched to having a null value for their port at the same time.21:38
mordredwow. db level derp perhaps?21:39
fungiso maybe some bad/empty api response caused it to think they were all orphaned?21:39
mordredwe printed out other pieces of that API response though - so it wasn't an empty21:40
corvusi think it's a single api call from the zuul/sdk perspective (we should double check sdk isn't doing a second one behind our back), so yeah, assuming that holds, i think we got a good response with bad data21:41
clarkbexciting21:41
corvusclient.list_floating_ips() is the singel api call we're using for this21:41
clarkbcorvus: and the port field in that listing for each fip is null/none?21:43
clarkb(at least we infer that based on the behavior?)21:43
corvusinferring that it's falsey based on behavior: if not bool(ip.port_id):21:43
corvusmight be worth an update to the debug string to include that21:44
corvus(just in case "None" vs "0" is useful debugging)21:44
mordredchecked. it's a single GET under the covers21:46
mordredso this shouldn't be a case of anything going behind the scenes and doing additional calls to fill in missing attributes21:46
dmsimard[m]I have nothing useful to contribute but hi mordred long time no see :P21:46
corvusclarkb: here's the mass-deletion: https://paste.opendev.org/show/bsRarJmfCZdWpz2CMJWm/  maybe we should send that to the flex folks and let them know that based on behavior, it looks like all the fips had missing or incorrect port ids (so that it looked like none of those ips had ports)21:47
* mordred waves to dmsimard :) 21:47
mordredif you need it for followup debugging - set the keystoneauth logger to DEBUG and it'll log full request/response interactions. massive sledgehammer of course21:48
clarkbcorvus: can do. I'll include that in the email about the checksum behavior in sjc3. I'll do that as soon as my gerrit 3.11 upgrade and downgrade is done21:48
mordredthat should be security safe to do - it won't log auth creds21:48
corvusmordred: yeah, that would be quite the firehose... i think if we decide we're at that point, we probably need to make a dedicated launcher process and restrict it to just this provider (and remove the providers from the others).21:50
mordred++ agree21:50
clarkbmy test gerrit 3.10 -> 3.11 upgrade went very smoothly. It doesn't even reindex becuase the indexes are not updated. No soy email template changes either21:53
corvusi spot checked one of the other ips.  zuul started the cleanup for the node it was attached to after the first message about deleting the leaked fip.  (which supports the idea that the cloud just returned bad data, and not that this was some cascade resulting from zuul's deletion process)21:55
clarkbcorvus: any idea if other regions have been affected?21:56
clarkbwe know dfw3 is but I wonder about iad3 (sjc3 is disabled)21:56
corvusthe only other instances are sjc3 for 2025-11-25 03:05:49,151 DEBUG zuul.openstack.raxflex-SJC3: Deleting unattached floating ip 3000e926-9dea-40dd-9ac5-17440060e86e 66.70.103.23122:01
corvusbut we only have a day's worth of these logs22:01
corvusso far22:01
mordredI love that we're still leaking FIPs at all ;) 22:02
corvusmordred: yeah -- i think we observed that leaks still happen, right clarkb ?  or do we need to do another test?22:03
clarkbcorvus: they were happening when we first brought up raxflex22:05
clarkbcorvus: that is why we toggled the cleanup feature on. I don't tknow if we know that things are leaking today though22:06
clarkbinfra-root ok I have upgraded the 3.10 held node (noted in the etherpad https://etherpad.opendev.org/p/gerrit-upgrade-3.11) to 3.11 and captured notes on the process. Then I downgraded it back to 3.10 and captured notes on that process22:06
clarkbI don't see anything concerning and in fact its maybe the easiest one we've done in terms fo steps since indexes don't change22:07
clarkbI didn't test the cleanup of the large h2 cache files because it occurred to me after upgraded that I should do that. I will try and do another pass at another time and test the cleanup of the large h2 cache files just to double check it doesn't hurt anything22:07
clarkbbut that will be naother day likely22:07
clarkbcorvus: I'll draft an email to rax that roughly says we notice all checksums in glance in sjc3 don't match our checksum. This implies something is happenign on the backend to modify the images. Could it be possible that process is sometimes corrupting the images? This doesn't seem to be the case in dfw3 or iad3 where the checksums do match. Additionally we've noticed that22:10
clarkbopenstacksdk's delete_unattached_floating_ips's method will sometimes delete all of our floating ips in dfw3 (possible elsewhere too but this is where we have logs catching it happen see this paste) we suspect due to that method's list_floating_ip method returning null/falsey port_id values. This is unexpected and doesn't happen frequently but does happen often enough that ever few22:10
clarkbdays we lose floating ips22:10
clarkbcorvus: does that seem to roughly capture what we have learned and is accurate etc?22:11
clarkbinfra-root also if anyone wants to run through the test upgrade and downgrade with me (tonyb maybe?) its a good way of building familiariy with the whole process22:13
corvusclarkb: info lgtm22:23
clarkbemail has been sent22:29
clarkbshould we go ahead and reenable raxflex-sjc3 now that we suspect things are happier overall?22:30
clarkbor do we want to leave it largely disabled until we have a better understanding of the failuresin the first place?22:30

Generated by irclog2html.py 4.0.0 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!