Monday, 2024-08-26

*** elodilles_pto is now known as elodilles07:24
*** liuxie is now known as liushy07:55
*** ykarel_ is now known as ykarel08:36
mnasiadkaHello there - I noticed that now we have only one aarch64 capable nodepool provider with 15 max instances (not counting it's slowness) - are there any planned activities to get aarch64 testing back in shape?14:15
fungimnasiadka: i think we need another donor to volunteer some resources now that works-on-arm pulled out of supplying the hardware for the linaro cloud14:25
fungiwe might be able to get some additional test resources out of osuosl, but that doesn't give us any redundancy/resiliency14:26
mnasiadkaespecially osuosl has some i/o problems or something similar looking at kolla build jobs14:27
mnasiadkahttps://zuul.openstack.org/builds?job_name=kolla-build-debian-aarch64&job_name=kolla-build-ubuntu-aarch64&job_name=kolla-build-rocky9-aarch64&skip=014:27
opendevreviewJames E. Blair proposed zuul/zuul-jobs master: DNM: test base-test  https://review.opendev.org/c/zuul/zuul-jobs/+/92714814:46
clarkbright this is all dependent on volunteers and donated resources. Currently we've got one set of resources to work with. If you can characterize/debug the timeouts and slowness that may be useful input that can be actionable on osuosl's side15:06
clarkbnb04 arm image builds are still failing I think because maybe we've used up all the losetup devices. I'm going to stop services and reboot to clear out any of those stale mounts and devices15:21
clarkband maybe do another round of /opt/dib_tmp cleanup when it comes back too15:22
clarkbfungi: I guess the next step for raxflex is to boot a mirror node and then add a small number of nodes to a new provider?15:23
fungii think first we need to check out the networking situation there, take some stock of available flavors, check the relevant quotas, and upload a copy of the image tonyb made15:26
fungiwe'll also need a change to add the region to our cloud setup routine, i think?15:26
clarkboh right15:26
clarkbyes all of that seem like good precursor steps that I am too excited to remember15:27
corvusupload copy of what image?15:30
fricklerI checked the flavors earlier and didn't find any 8/8 variant. do you still have contact with rax and could ask for a custom flavor?15:30
fricklercorvus: iirc a custom noble build that was used for the latest mirror instances15:30
clarkbcorvus: of the ubuntu noble image since most (all?) clouds don't have noble images up15:30
clarkbfrickler: it isn't a custom image just the upstream one15:31
fungiwhich may be the same as just downloading an official noble cloud image from ubuntu.com, i don't recall15:31
clarkbfungi: yes15:31
fungiokay, so not made, merely downloaded to bridge for upload into glance15:32
fungithough raxflex does include "Ubuntu-24.04" in their image list15:32
fricklerwell raxflex offers Ubuntu-24.04, so that might not be necessary15:32
frickler^5 ;)15:32
clarkbnice, I think that may be the first cloud region to have one officially15:33
corvusack thx15:34
clarkbnb04 is rebooted. losetup -l -a look better to me. I've started a new round of cleanups in dib_tmp and will let ansible restart the builder service for me in ~45 minutes15:35
clarkb(just to minimize the amount of io while things delete)15:35
clarkbnodepool-builder is running on nb04 against (via the hourly nodepool job)16:11
mnasiadkaclarkb: simple timing comparisons for kolla x86 vs aarch64 builds in opendev CI - https://ethercalc.net/kolla-aarch64-timing (aarch64 is osuosl, x86 is ovh)16:49
mnasiadkaclarkb: I don't think that's cpu exhaustion, rather i/o (storage) issues16:50
clarkbmnasiadka: I'm not sure comparing x86 to aarch64 is super productive. They are completely different architectures including how io is going to work aiui. Do we maybe have data from before the queue grew this long to see if this is a new problem or was preexisting?16:52
clarkbIdeally we can point out "these sorts of operations are a problem" rather than just "x86 is faster" thats a given16:53
clarkbkolla-toolbox looks particularly bad16:54
mnasiadkashould be available, x86 builds finish around 30mins, and aarch64 until recently was similar, so timing out after 3hrs limit is a bit problematic for us16:54
clarkbmaybe focusing on whatever that is doing would be good16:54
clarkbalso the keystone timing illustrates why I'm concerned about comparing the two. It is slower on x86 than other things it is faster on aarch6416:54
clarkb*keystone-base16:54
clarkbgnocchi-base seems to be that way too16:57
clarkbnova-libvirt and ovsdpdk also seem slow, I suspect that those may be doing compilation steps for tools? So maybe CPU is at play, might be worth checking those run in parallel if they can be made to do so (but I'm sort of making assumptions there)17:03
opendevreviewMerged zuul/zuul-jobs master: Fix k8s-crio buildset registry test  https://review.opendev.org/c/zuul/zuul-jobs/+/92601317:07
opendevreviewMerged zuul/zuul-jobs master: Update ensure-kubernetes with podman support  https://review.opendev.org/c/zuul/zuul-jobs/+/92497017:15
clarkbonce nb04 uploads a new centos 9 stream arm image the ozj change to bump up ansible-lint and run things on noble should be mergeable17:34
clarkbthe project-config chagne that does similar is already passing. Note this does make minor edits to a large number of playbooks so not sure what our comfort level is doing that17:34
clarkbthe vast majority of those edits should be inconsequential (add names to plays and reorder some yaml dict things for ansible tasks). The ones I'm primarily worried about have to do with renaming action moduels to use their fully qualified forms17:35
clarkbjust in case the lookups find different code or something17:35
mnasiadkaclarkb: sadly last successful builds that fit in good timings (~30 minutes run - actually this one has 21 minutes) don't have the logs anymore - awkwardly looking at history (https://zuul.openstack.org/builds?job_name=kolla-build-debian-aarch64&skip=50) it seems the problems started 29th of July - which sort of fits the time of merging first linaro removal change (https://review.opendev.org/c/openstack/project-config/+/925029)17:37
clarkbmnasiadka: yes that seems right. Its possible these jobs were never succesful on the osuosl cloud and only succeeded on linaro17:38
mnasiadkaI have a hunch that in the past the builds that timed out or were over 2hrs successful were not run on Linaro17:38
fungiit might also just mean that we started more heavily loading osuosl for longer spans of time, and so we're adversely impacting the performance of resources there as a result17:39
mnasiadkamight be, we just merged a patch that should remove 2/3 of our aarch64 build jobs (stop building rocky and ubuntu - leave only debian that is the only one we do test deploys in kolla-ansible with)17:39
mnasiadkaI'll pay some attention to the aarch64 jobs in Kolla/Kolla-Ansible to see how they are behaving17:42
mnasiadka(but I assume there's not going to be a lot of improvements)17:43
clarkbI think if we can identify specific areas of concern then we can feed that back to osuosl and maybe there are tweaks taht can be made17:44
clarkbI suspect that we're limited by available hardware, but it wouldn't be the first time we've had poor performance in a cloud that was identified and mitigated via our job experiences17:44
corvushttps://imgur.com/screenshot-TKG3y6h  for the visually inclined18:08
fungiclearly two sets of groupings there. analysis ftw!18:19
fungibut some fast results recently too18:19
fungioh, but those are failures so may have just ended early18:19
clarkblooking at neutron unittest jobs they do take longer on osuosl than the x86 clouds but less than double the time18:20
fungilooks like there were probably some successes in osuosl until we were only running the jobs there?18:20
clarkbwhcih is a clear difference to the kolla job behavior. I suspect taht does point to io as previuosly called out as the neutron and nova unittests are very cpu bound iirc18:20
fungiif we assume the 1.5-3.0hr cluster is osuosl18:21
fungijob run times in that cluster seem to have been relatively consistent for a few months18:21
mnasiadkaDo you want to say Linaro just had better hardware/storage?18:25
clarkbmnasiadka: it almost certainly did since it was running on a fairly beefy ampere server iirc18:25
clarkbbut arm shut it down and didn't renew us with new hardware18:26
mnasiadkaWell then, maybe the only sensible option is to bump up the debian jobs timeout and see how that goes - until things improve18:31
clarkbagain I think there is value in determining where the specific slowness occurs so that we can provide that feedback then measure if any changes are possible18:40
clarkbfungi: I found https://opendev.org/opendev/system-config/commit/f1df36145d0d2d27b210b027e9fd1008238297f3 as an example for using cloud launcher to boot strap the new cloud and its networking20:54
clarkbpulling together meeting agenda stuff and wanted to have that on there20:54
clarkbbut we can probably go ahead and spin up changes like that?20:54
fungiclarkb: yeah, i should be free to put that together here shortly21:04
clarkbI've made a bunch of updates to th emeeting agenda if anyone else would like to add content or clean things up let me know21:04
fungitesting booting a server instance in raxflex, it does seem to have a suitable flavor for our test nodes, almost like it was tailor-made. gp.0.4.8 has 8gb ram, 4 vcpus, 80gb rootfs, and 64gb ephemeral disk on top of that22:02
clarkbnice22:03
fungialso i think i confirmed that we definitely need to create a network. there is a "PUBLICNET" in the network list but setting --network=PUBLICNET results in a scheduling fault with "No valid service subnet for the given device owner..."22:04
clarkbya the commit above shows how we did that with clouds in the past (no current cloud does that though)22:05
clarkbbecause you have to create a network a subnet and a router and tie them all together. IIRC they said there isn't ipv6 yet so we don't have to worry about taht either at this point (though at some point we probably do want to add ipv622:05
fungilooks like that goes in inventory/service/group_vars/bastion.yaml these days22:08
clarkbya it moved into the group var file22:08
fungiduring the bridge replacement, yes22:09
fungiwe still have the opendevci-networking and opendevzuul-networking profiles, even better22:09
fungidoes "network: External" mean the router connects to an existing provider network named "External"? in which case for this we should do "network: PUBLICNET" instead?22:15
clarkbfungi: yes and that is why we dno't have a profile for the router as it is cloud and potentially region specific. Essentially you're determining the other interface on the router22:15
fungigot it, thanks!22:16
opendevreviewJeremy Stanley proposed opendev/system-config master: Set up networking for Rackspace Flex tenants  https://review.opendev.org/c/opendev/system-config/+/92721422:22
clarkbfungi: https://docs.openstack.org/api-ref/network/v2/index.html#create-router I believe the network in the ansible var gets translated to external_gateway_info.network_id?22:23
clarkbsomething like that22:23
clarkbupdated the meeting agenda to point at that change now22:25
fungithanks!22:26
clarkbI'll get the agenda out before I call it a day just in case there is anything else to add before then. Also looks like nb04 did successfully create a loopback device with losetup23:01
clarkbassuming that was the last remaining issue I expect the current image build to succeed23:02

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!