| dmsimard[m] | clarkb: I think we wanted to talk about ara things, but it can be in open discussion | 01:18 |
|---|---|---|
| fungi | the boot-timeout increase for osuosl seems to have eliminated our earlier node_failure results, but my last recheck of 966200 ran into timeouts for 3 arm64 image builds (all right at 3 hours) | 13:44 |
| mnasiadka | fungi: that sort of reminds me of the storage problems there last time, are we sure it’s not happening again? | 14:27 |
| fungi | entirely possible | 14:28 |
| fungi | i'm never sure of anything | 14:28 |
| fungi | "Rackspace Engineers will be performing a service disrupting software release in our DFW3 data center. Planned start date and time: 2025-12-03 08:00:00 CST" (window is listed as 14 hours) | 15:51 |
| fungi | oh, though "During this maintenance, brief packet loss on the data-plane up to three packets consecutively on all networks. As always, your Support Team will be available by phone to assist you." so maybe we don't care | 15:52 |
| fungi | i generally assume the internet overall to be worse than that | 15:52 |
| clarkb | we asked osuosl if there were isuses with storage and they didn't think so. But we said we would address the image format selection (whcih we half did deciding to try and get by with just the timeout then let the automatic selection take over after the initial build) and then we also increased boot timeouts | 16:02 |
| clarkb | looks like that did fix the trixie builds | 16:02 |
| clarkb | the latest recheck didn't fail on them | 16:02 |
| fungi | yeah, it's other random arm64 image builds that are timing out | 16:03 |
| fungi | and some of the ones that did succeed completed just under the wire | 16:04 |
| fungi | 2h 58m 03s, 2h 52m 35s | 16:04 |
| clarkb | and none of those would be building and converting the extra image formats since only trixie is the new one | 16:04 |
| clarkb | it would probably be good if someone can look at those logs and determine where the time lost is then we can decide if we want to escalate to osuosl again with more info (keep in mind it is a holiday week in this part of the world) | 16:05 |
| fungi | https://zuul.opendev.org/t/opendev/build/d6c2d278bbdd477fb21a0ed613c8b5c6/log/job-output.txt#8892-8893 is a ~46-minute gap | 16:10 |
| fungi | that's the biggest window of silence i spot in that build by far | 16:11 |
| clarkb | fungi: I think that is the same step that we saw was slow last week | 16:12 |
| fungi | similarly https://zuul.opendev.org/t/opendev/build/ed4e661d215a43619f8cab1aa986eb6b/log/job-output.txt#9313-9314 | 16:12 |
| fungi | those are the two that completed just shy of the 3-hour timeout | 16:13 |
| clarkb | we could update dib to log some of the attributes of that transfer (total file count, file sizes, etc) | 16:13 |
| fungi | so slightly easier to analyze | 16:13 |
| clarkb | looking at the code it seems like its literally just a single fstab file | 16:15 |
| fungi | same step silent for ~48 minutes on one of the builds that timed out: https://zuul.opendev.org/t/opendev/build/74472dac822b467aa8f282c3d5f15807/log/job-output.txt#9494-9495 | 16:15 |
| clarkb | https://opendev.org/openstack/diskimage-builder/src/branch/master/diskimage_builder/block_device/blockdevice.py#L406 this is the origination point of that log entry I think | 16:15 |
| fungi | though the timeout killed the task in that case | 16:16 |
| clarkb | the mkdir is measured in subsecond time as expected but for some reason the cp is much more costly | 16:16 |
| fungi | could it be triggering an implicit fsync? | 16:17 |
| fungi | maybe there's a write cache flushing | 16:17 |
| clarkb | oh ya maybe. Looking upward in the doc it seems we already have a bunch of data written to the 'built' paths and then that fstab entry follows up | 16:18 |
| clarkb | maybe cp is triggering some sort of sync that we've avoided up to that point | 16:18 |
| clarkb | I think the fact that we see this in existing images that are only building raw images and we've increased timeouts and we seem to hit it doing the cp of a single file (which maybe causes a sync to occur) is all good feedback to osuosl particularly since it doesn't seem to be a problem 100% of the time | 16:19 |
| clarkb | dmsimard[m]: I'm happy to discuss ara things but I don't think anyone added it to the agenda. But yes if we have tyime we can followup after the other items | 16:20 |
| clarkb | fungi: https://opendev.org/openstack/diskimage-builder/src/branch/master/diskimage_builder/lib/disk-image-create#L372-L531 this is the portion of code leading up to and including the fstab write | 16:46 |
| clarkb | fungi: It looks like the shell script is invoking multiple python commands to do each of the block device steps finishing with the fstab write. I would expect that a python processing exiting is sufficient to force flushes/syncs but maybe not? | 16:47 |
| clarkb | fungi: if we want to test this further we could add an explicit sync at https://opendev.org/openstack/diskimage-builder/src/branch/master/diskimage_builder/lib/disk-image-create#L528 this line and see if the slowness moves to that location | 16:47 |
| fungi | that does seem likely | 16:50 |
| clarkb | fungi: that == python would sync or that if we add a sync we'd see it slow at that location? | 16:53 |
| fungi | i expect exiting python would account for a flush of buffers closed when the process does | 17:02 |
| clarkb | ack so probably something about writing the fstab? or cp behavior differences? | 17:03 |
| clarkb | it does appear to be io related at least and we expect that fstab is quite small | 17:04 |
| clarkb | it is copying from one filesystem to another though | 17:04 |
| fungi | yeah, my guess is that exiting python started a fsync, and then the cp blocks on waiting for that to complete | 17:06 |
| clarkb | actually it may not be a different fs. The different fs is at /mnt not /built | 17:07 |
| clarkb | fungi: aha | 17:07 |
| clarkb | so yes this seems to point at slow io | 17:07 |
| fungi | i suppose we could add performance monitoring and see what the i/o throughput looks like for that silent 45-ish minutes, but that may be overkill | 17:08 |
| clarkb | Ramereth[m]: to catch you up, we went ahead and did our best to remove the extra work caused by creating raw, qcow2, and vhd images when we only need raw images. We have also increased our build job timeouts from 2hours to 3 hours. Despite these two changes we're able to reproduce the timeout problems in image builds and | 17:09 |
| clarkb | https://zuul.opendev.org/t/opendev/build/d6c2d278bbdd477fb21a0ed613c8b5c6/log/job-output.txt#8892-8893 seems to illustrate the problem. Basically when we're writing data into the new image it is slow. fungi's theory is that we're waiting for background syncs to catch up when running the cp of the fstab from the source to the dest | 17:09 |
| corvus | there should be some dstat roles already, we use them in zuul jobs at least | 17:09 |
| corvus | fungi: ^^ dstat msg | 17:09 |
| clarkb | Ramereth[m]: I think we've largely isolated what is different on our end and can now say that it seems the performance seems to have regressed independently of what we are doing. | 17:10 |
| corvus | fungi: https://opendev.org/zuul/zuul/src/branch/master/playbooks/zuul-nox/pre.yaml and https://opendev.org/zuul/zuul/src/branch/master/playbooks/zuul-nox/post-system-logs.yaml | 17:11 |
| clarkb | Ramereth[m]: if you can think of any additional debugging or logging that might be useful I think we'd be happy to help. | 17:11 |
| fungi | yeah, adding dstat wouldn't be much work in that case, still not sure whether it's overkill for proving this one case or if we'd have longer-term use for including it in the job | 17:11 |
| clarkb | Maybe we should try and manually boot a few instances and do some fio benchmarks? | 17:11 |
| clarkb | similar to what I did with lists.o.o | 17:11 |
| clarkb | dstat is another good idea that may capture peak iops for us in the background | 17:11 |
| corvus | i think it'd be a good long-term addition so it's there when we need it | 17:15 |
| clarkb | ++ | 17:15 |
| fungi | wfm. i'm in meetings for most of my afternoon but can work on adding it this evening or tomorrow | 17:15 |
| opendevreview | Clark Boylan proposed opendev/system-config master: Bump Gerrit container image to 3.11 https://review.opendev.org/c/opendev/system-config/+/968349 | 17:16 |
| clarkb | ya I'm in meetings until noon today then I'm going to try and do a family lunch afterwards. When I get back my goal is to start testing and taking notes on the gerrit 3.10 -> 3.11 upgrade process as we'll do it in production using the held node | 17:20 |
| clarkb | that is the last major peice of the gerrit upgrade planning process | 17:20 |
| Ramereth[m] | clarkb: okay, I'll see what I can do | 18:20 |
| dtantsur | hey folks, do we by chance have any sort of a dockerhub mirror to avoid hitting its infamous rate limits? | 18:58 |
| fungi | dtantsur: in fact we do! | 19:00 |
| clarkb | dtantsur: we host some commonly used imagse at quay.io/opendevmirror but it isn't a generic mirror of everything | 19:00 |
| fungi | yeah, i guess technically we don't *have* a mirror so much as we mirror to someone else's repository that's not dockerhub | 19:01 |
| dtantsur | clarkb: you do have registry, which is exactly the one we need :) | 19:01 |
| fungi | we have zuul jobs which can mirror images to arbitrary namespaces, though we do have a namespace on quay we use for our more general opendev dependencies too | 19:02 |
| fungi | clarkb is probably a little distracted running the opendev meeting that just started a few seconds ago | 19:02 |
| dtantsur | basically, we're trying to start a container registry | 19:04 |
| dtantsur | so if you folks already maintain https://quay.io/repository/opendevmirror/registry, I guess we can use it | 19:04 |
| dtantsur | btw are you planning to mirror registry:3 in addition to registry:2? | 19:05 |
| * dtantsur calls it a day and will check the answer tomorrow if his client does not eat it | 19:06 | |
| fungi | dtantsur: yeah, trying to get you some examples of uses of the mirror-container-images job from the zuul standard library, opendev uses it and so do a number of openstack projects i think | 19:07 |
| fungi | syncing up tomorrow when we're not in a meeting might be faster | 19:07 |
| clarkb | I think someone proposed mirroring :3 and then decided they didn't need :3 because the issue was elsewhere | 19:07 |
| clarkb | but yes I thinkwe could mirror :3 too | 19:07 |
| clarkb | dtantsur: depending on what you are doing with the registry the zuul buildset registry may make sense too | 19:12 |
| clarkb | dtantsur: if you're building a central image then reconsuming it in subsequent jobs then the zuul buildset registry (and job) is designed to address that need | 19:12 |
| fungi | good for a temporary registry to put images into for one job to produce and another to consume within a buildset or dependent change series | 19:12 |
| tonyb | the in progress ansible-devel patches install ara from git/master again FWIW | 20:04 |
| clarkb | dmsimard[m]: ya I took a quick look and ara isn't in the git repo list of required projects for that job anymore. I think we ended up in a situation where latest ara and latest ansible were not comaptible so we switched back to installing ara from pip in that job | 20:04 |
| clarkb | the git log should have details though | 20:04 |
| clarkb | tonyb: ah cool | 20:04 |
| clarkb | and now I'm popping out for lunch | 20:04 |
| tonyb | I didn't add it as a required repo as that needed a connection to GitHub previously and that was one of the reasons we switched to pip. now that ARA is on codeberg we could maybe have a connection? | 20:06 |
| opendevreview | Jeremy Stanley proposed opendev/zuul-providers master: Instrument dstat in image build jobs https://review.opendev.org/c/opendev/zuul-providers/+/968386 | 20:07 |
| fungi | corvus: clarkb: ^ i cribbed that from the zuul-nox playbooks | 20:07 |
| fungi | happy to rebase 966200 onto it if that helps | 20:08 |
| fungi | getting node_failure results for a lot of the arm64 builds on 968386 now (buildset still in progress), so maybe the boot-timeout increase for osuosl didn't fix it after all | 20:38 |
| corvus | clarkb: looks like 174.143.59.58 from https://zuul.opendev.org/t/openstack/build/44dfa5023f66491681250bad557b4bc7 might be the host you were mentioning where we lost the ip? | 21:10 |
| corvus | hrm, that one looks like it was lost before november 19, since i see that ip being used a lot in the logs. i'll wait for clarkb to return to see if there's some other node | 21:14 |
| clarkb | corvus: it is the ip listed on https://etherpad.opendev.org/p/gerrit-upgrade-3.11 line 6 | 21:16 |
| clarkb | corvus: np6d8b42238b674 is the instance in rax flex dfw3. I just confirmed via openstack server list that the fip is gone | 21:17 |
| clarkb | corvus: I think the losg on zl01 do confirm this is the process deleting the fip. Interestingly it seems ot have attempted to do so many times | 21:18 |
| clarkb | zl02 doesn't seem to log it at all. | 21:18 |
| corvus | clarkb: do you know the build for that? | 21:23 |
| clarkb | corvus: I believe it is the build you identified. But it had its original fip deleted then I maanually attached this new one | 21:24 |
| corvus | oooh | 21:24 |
| mordred | I may be out of the loop, but I feel like FIPs getting deleted unexpectedly isn't ideal | 21:25 |
| corvus | mordred: yeah -- here's the story: https://review.opendev.org/967891 | 21:26 |
| mordred | oh fun! | 21:26 |
| corvus | clarkb: it only shows up on zl01 because of the cooperative thing the launchers do: they decided that zl01 was going to handle cleanups for this region | 21:27 |
| clarkb | corvus: got it | 21:28 |
| mordred | does rax flex support the "pass in the server to create_floating_ip"? | 21:28 |
| clarkb | mordred: I think that is the only way that zuul launcher supports using floating ips | 21:28 |
| corvus | log output from that new log line: https://paste.opendev.org/show/bQtRqdlVru2fvwDOdubD/ | 21:28 |
| corvus | mordred: clarkb yes/correct re server arg | 21:29 |
| mordred | oh - that's even more fascinating. it's using the proper way and also is only doing cleanup on one server | 21:29 |
| clarkb | mordred: that was one of the first things I checked when debugging this the first time around since the docstring for the cleanup method in sdk had the big warning about it | 21:30 |
| clarkb | I'm wondering if clouds are no longer holding whatever criteria the sdk needs to do this safely | 21:31 |
| mordred | I feel like anytime I bothered to write a long docstring, it must be bad :) | 21:31 |
| fungi | or frightening, at the very least | 21:32 |
| mordred | I mean - neutron would have had to change the model for how ports work. | 21:33 |
| corvus | looking at the logs around that time, it deleted a bunch of fips around the same time. so whatever happened it happened to a lot. | 21:33 |
| mordred | one off occurance or ongoing issue? | 21:33 |
| corvus | this has happened at least twice? clark would know | 21:34 |
| corvus | but we're only noticing because it's happening to held nodes | 21:34 |
| mordred | awesome | 21:34 |
| corvus | it could very well happen more often and we wouldn't notice due to turnover | 21:34 |
| clarkb | it has happened twice now against two different held servers | 21:34 |
| clarkb | though the second server aws only hit the first time because I didn't manaully attach a new fip | 21:35 |
| mordred | yeah. it would just show up as a bad node and zuul would happily delete and retry in the normal case yeah? | 21:35 |
| corvus | yeah due to the "unreachable" retry case | 21:35 |
| clarkb | that cleanup routine runs frequently though (every time we list servers) | 21:35 |
| clarkb | so I think it isn't happening frequently compared to the number of total attempts. But when the conditions that cause it to happen seem to happen often enough we notice after a few days | 21:35 |
| mordred | fwiw - there is an http-level tracing you can enable with python logging incantations ... but that'll get super chatty | 21:36 |
| mordred | does seem like a derpy server-side case though :( | 21:37 |
| corvus | it deleted 18 fips at once. according to the graphs, we had 17 nodes in use in raxflex-dfw3 at the time (i think we may be missing a graph for held nodes, so that would be the 18th). | 21:38 |
| corvus | so it looks like all the fips for that region switched to having a null value for their port at the same time. | 21:38 |
| mordred | wow. db level derp perhaps? | 21:39 |
| fungi | so maybe some bad/empty api response caused it to think they were all orphaned? | 21:39 |
| mordred | we printed out other pieces of that API response though - so it wasn't an empty | 21:40 |
| corvus | i think it's a single api call from the zuul/sdk perspective (we should double check sdk isn't doing a second one behind our back), so yeah, assuming that holds, i think we got a good response with bad data | 21:41 |
| clarkb | exciting | 21:41 |
| corvus | client.list_floating_ips() is the singel api call we're using for this | 21:41 |
| clarkb | corvus: and the port field in that listing for each fip is null/none? | 21:43 |
| clarkb | (at least we infer that based on the behavior?) | 21:43 |
| corvus | inferring that it's falsey based on behavior: if not bool(ip.port_id): | 21:43 |
| corvus | might be worth an update to the debug string to include that | 21:44 |
| corvus | (just in case "None" vs "0" is useful debugging) | 21:44 |
| mordred | checked. it's a single GET under the covers | 21:46 |
| mordred | so this shouldn't be a case of anything going behind the scenes and doing additional calls to fill in missing attributes | 21:46 |
| dmsimard[m] | I have nothing useful to contribute but hi mordred long time no see :P | 21:46 |
| corvus | clarkb: here's the mass-deletion: https://paste.opendev.org/show/bsRarJmfCZdWpz2CMJWm/ maybe we should send that to the flex folks and let them know that based on behavior, it looks like all the fips had missing or incorrect port ids (so that it looked like none of those ips had ports) | 21:47 |
| * mordred waves to dmsimard :) | 21:47 | |
| mordred | if you need it for followup debugging - set the keystoneauth logger to DEBUG and it'll log full request/response interactions. massive sledgehammer of course | 21:48 |
| clarkb | corvus: can do. I'll include that in the email about the checksum behavior in sjc3. I'll do that as soon as my gerrit 3.11 upgrade and downgrade is done | 21:48 |
| mordred | that should be security safe to do - it won't log auth creds | 21:48 |
| corvus | mordred: yeah, that would be quite the firehose... i think if we decide we're at that point, we probably need to make a dedicated launcher process and restrict it to just this provider (and remove the providers from the others). | 21:50 |
| mordred | ++ agree | 21:50 |
| clarkb | my test gerrit 3.10 -> 3.11 upgrade went very smoothly. It doesn't even reindex becuase the indexes are not updated. No soy email template changes either | 21:53 |
| corvus | i spot checked one of the other ips. zuul started the cleanup for the node it was attached to after the first message about deleting the leaked fip. (which supports the idea that the cloud just returned bad data, and not that this was some cascade resulting from zuul's deletion process) | 21:55 |
| clarkb | corvus: any idea if other regions have been affected? | 21:56 |
| clarkb | we know dfw3 is but I wonder about iad3 (sjc3 is disabled) | 21:56 |
| corvus | the only other instances are sjc3 for 2025-11-25 03:05:49,151 DEBUG zuul.openstack.raxflex-SJC3: Deleting unattached floating ip 3000e926-9dea-40dd-9ac5-17440060e86e 66.70.103.231 | 22:01 |
| corvus | but we only have a day's worth of these logs | 22:01 |
| corvus | so far | 22:01 |
| mordred | I love that we're still leaking FIPs at all ;) | 22:02 |
| corvus | mordred: yeah -- i think we observed that leaks still happen, right clarkb ? or do we need to do another test? | 22:03 |
| clarkb | corvus: they were happening when we first brought up raxflex | 22:05 |
| clarkb | corvus: that is why we toggled the cleanup feature on. I don't tknow if we know that things are leaking today though | 22:06 |
| clarkb | infra-root ok I have upgraded the 3.10 held node (noted in the etherpad https://etherpad.opendev.org/p/gerrit-upgrade-3.11) to 3.11 and captured notes on the process. Then I downgraded it back to 3.10 and captured notes on that process | 22:06 |
| clarkb | I don't see anything concerning and in fact its maybe the easiest one we've done in terms fo steps since indexes don't change | 22:07 |
| clarkb | I didn't test the cleanup of the large h2 cache files because it occurred to me after upgraded that I should do that. I will try and do another pass at another time and test the cleanup of the large h2 cache files just to double check it doesn't hurt anything | 22:07 |
| clarkb | but that will be naother day likely | 22:07 |
| clarkb | corvus: I'll draft an email to rax that roughly says we notice all checksums in glance in sjc3 don't match our checksum. This implies something is happenign on the backend to modify the images. Could it be possible that process is sometimes corrupting the images? This doesn't seem to be the case in dfw3 or iad3 where the checksums do match. Additionally we've noticed that | 22:10 |
| clarkb | openstacksdk's delete_unattached_floating_ips's method will sometimes delete all of our floating ips in dfw3 (possible elsewhere too but this is where we have logs catching it happen see this paste) we suspect due to that method's list_floating_ip method returning null/falsey port_id values. This is unexpected and doesn't happen frequently but does happen often enough that ever few | 22:10 |
| clarkb | days we lose floating ips | 22:10 |
| clarkb | corvus: does that seem to roughly capture what we have learned and is accurate etc? | 22:11 |
| clarkb | infra-root also if anyone wants to run through the test upgrade and downgrade with me (tonyb maybe?) its a good way of building familiariy with the whole process | 22:13 |
| corvus | clarkb: info lgtm | 22:23 |
| clarkb | email has been sent | 22:29 |
| clarkb | should we go ahead and reenable raxflex-sjc3 now that we suspect things are happier overall? | 22:30 |
| clarkb | or do we want to leave it largely disabled until we have a better understanding of the failuresin the first place? | 22:30 |
Generated by irclog2html.py 4.0.0 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!