*** elodilles_pto is now known as elodilles | 07:24 | |
*** liuxie is now known as liushy | 07:55 | |
*** ykarel_ is now known as ykarel | 08:36 | |
mnasiadka | Hello there - I noticed that now we have only one aarch64 capable nodepool provider with 15 max instances (not counting it's slowness) - are there any planned activities to get aarch64 testing back in shape? | 14:15 |
---|---|---|
fungi | mnasiadka: i think we need another donor to volunteer some resources now that works-on-arm pulled out of supplying the hardware for the linaro cloud | 14:25 |
fungi | we might be able to get some additional test resources out of osuosl, but that doesn't give us any redundancy/resiliency | 14:26 |
mnasiadka | especially osuosl has some i/o problems or something similar looking at kolla build jobs | 14:27 |
mnasiadka | https://zuul.openstack.org/builds?job_name=kolla-build-debian-aarch64&job_name=kolla-build-ubuntu-aarch64&job_name=kolla-build-rocky9-aarch64&skip=0 | 14:27 |
opendevreview | James E. Blair proposed zuul/zuul-jobs master: DNM: test base-test https://review.opendev.org/c/zuul/zuul-jobs/+/927148 | 14:46 |
clarkb | right this is all dependent on volunteers and donated resources. Currently we've got one set of resources to work with. If you can characterize/debug the timeouts and slowness that may be useful input that can be actionable on osuosl's side | 15:06 |
clarkb | nb04 arm image builds are still failing I think because maybe we've used up all the losetup devices. I'm going to stop services and reboot to clear out any of those stale mounts and devices | 15:21 |
clarkb | and maybe do another round of /opt/dib_tmp cleanup when it comes back too | 15:22 |
clarkb | fungi: I guess the next step for raxflex is to boot a mirror node and then add a small number of nodes to a new provider? | 15:23 |
fungi | i think first we need to check out the networking situation there, take some stock of available flavors, check the relevant quotas, and upload a copy of the image tonyb made | 15:26 |
fungi | we'll also need a change to add the region to our cloud setup routine, i think? | 15:26 |
clarkb | oh right | 15:26 |
clarkb | yes all of that seem like good precursor steps that I am too excited to remember | 15:27 |
corvus | upload copy of what image? | 15:30 |
frickler | I checked the flavors earlier and didn't find any 8/8 variant. do you still have contact with rax and could ask for a custom flavor? | 15:30 |
frickler | corvus: iirc a custom noble build that was used for the latest mirror instances | 15:30 |
clarkb | corvus: of the ubuntu noble image since most (all?) clouds don't have noble images up | 15:30 |
clarkb | frickler: it isn't a custom image just the upstream one | 15:31 |
fungi | which may be the same as just downloading an official noble cloud image from ubuntu.com, i don't recall | 15:31 |
clarkb | fungi: yes | 15:31 |
fungi | okay, so not made, merely downloaded to bridge for upload into glance | 15:32 |
fungi | though raxflex does include "Ubuntu-24.04" in their image list | 15:32 |
frickler | well raxflex offers Ubuntu-24.04, so that might not be necessary | 15:32 |
frickler | ^5 ;) | 15:32 |
clarkb | nice, I think that may be the first cloud region to have one officially | 15:33 |
corvus | ack thx | 15:34 |
clarkb | nb04 is rebooted. losetup -l -a look better to me. I've started a new round of cleanups in dib_tmp and will let ansible restart the builder service for me in ~45 minutes | 15:35 |
clarkb | (just to minimize the amount of io while things delete) | 15:35 |
clarkb | nodepool-builder is running on nb04 against (via the hourly nodepool job) | 16:11 |
mnasiadka | clarkb: simple timing comparisons for kolla x86 vs aarch64 builds in opendev CI - https://ethercalc.net/kolla-aarch64-timing (aarch64 is osuosl, x86 is ovh) | 16:49 |
mnasiadka | clarkb: I don't think that's cpu exhaustion, rather i/o (storage) issues | 16:50 |
clarkb | mnasiadka: I'm not sure comparing x86 to aarch64 is super productive. They are completely different architectures including how io is going to work aiui. Do we maybe have data from before the queue grew this long to see if this is a new problem or was preexisting? | 16:52 |
clarkb | Ideally we can point out "these sorts of operations are a problem" rather than just "x86 is faster" thats a given | 16:53 |
clarkb | kolla-toolbox looks particularly bad | 16:54 |
mnasiadka | should be available, x86 builds finish around 30mins, and aarch64 until recently was similar, so timing out after 3hrs limit is a bit problematic for us | 16:54 |
clarkb | maybe focusing on whatever that is doing would be good | 16:54 |
clarkb | also the keystone timing illustrates why I'm concerned about comparing the two. It is slower on x86 than other things it is faster on aarch64 | 16:54 |
clarkb | *keystone-base | 16:54 |
clarkb | gnocchi-base seems to be that way too | 16:57 |
clarkb | nova-libvirt and ovsdpdk also seem slow, I suspect that those may be doing compilation steps for tools? So maybe CPU is at play, might be worth checking those run in parallel if they can be made to do so (but I'm sort of making assumptions there) | 17:03 |
opendevreview | Merged zuul/zuul-jobs master: Fix k8s-crio buildset registry test https://review.opendev.org/c/zuul/zuul-jobs/+/926013 | 17:07 |
opendevreview | Merged zuul/zuul-jobs master: Update ensure-kubernetes with podman support https://review.opendev.org/c/zuul/zuul-jobs/+/924970 | 17:15 |
clarkb | once nb04 uploads a new centos 9 stream arm image the ozj change to bump up ansible-lint and run things on noble should be mergeable | 17:34 |
clarkb | the project-config chagne that does similar is already passing. Note this does make minor edits to a large number of playbooks so not sure what our comfort level is doing that | 17:34 |
clarkb | the vast majority of those edits should be inconsequential (add names to plays and reorder some yaml dict things for ansible tasks). The ones I'm primarily worried about have to do with renaming action moduels to use their fully qualified forms | 17:35 |
clarkb | just in case the lookups find different code or something | 17:35 |
mnasiadka | clarkb: sadly last successful builds that fit in good timings (~30 minutes run - actually this one has 21 minutes) don't have the logs anymore - awkwardly looking at history (https://zuul.openstack.org/builds?job_name=kolla-build-debian-aarch64&skip=50) it seems the problems started 29th of July - which sort of fits the time of merging first linaro removal change (https://review.opendev.org/c/openstack/project-config/+/925029) | 17:37 |
clarkb | mnasiadka: yes that seems right. Its possible these jobs were never succesful on the osuosl cloud and only succeeded on linaro | 17:38 |
mnasiadka | I have a hunch that in the past the builds that timed out or were over 2hrs successful were not run on Linaro | 17:38 |
fungi | it might also just mean that we started more heavily loading osuosl for longer spans of time, and so we're adversely impacting the performance of resources there as a result | 17:39 |
mnasiadka | might be, we just merged a patch that should remove 2/3 of our aarch64 build jobs (stop building rocky and ubuntu - leave only debian that is the only one we do test deploys in kolla-ansible with) | 17:39 |
mnasiadka | I'll pay some attention to the aarch64 jobs in Kolla/Kolla-Ansible to see how they are behaving | 17:42 |
mnasiadka | (but I assume there's not going to be a lot of improvements) | 17:43 |
clarkb | I think if we can identify specific areas of concern then we can feed that back to osuosl and maybe there are tweaks taht can be made | 17:44 |
clarkb | I suspect that we're limited by available hardware, but it wouldn't be the first time we've had poor performance in a cloud that was identified and mitigated via our job experiences | 17:44 |
corvus | https://imgur.com/screenshot-TKG3y6h for the visually inclined | 18:08 |
fungi | clearly two sets of groupings there. analysis ftw! | 18:19 |
fungi | but some fast results recently too | 18:19 |
fungi | oh, but those are failures so may have just ended early | 18:19 |
clarkb | looking at neutron unittest jobs they do take longer on osuosl than the x86 clouds but less than double the time | 18:20 |
fungi | looks like there were probably some successes in osuosl until we were only running the jobs there? | 18:20 |
clarkb | whcih is a clear difference to the kolla job behavior. I suspect taht does point to io as previuosly called out as the neutron and nova unittests are very cpu bound iirc | 18:20 |
fungi | if we assume the 1.5-3.0hr cluster is osuosl | 18:21 |
fungi | job run times in that cluster seem to have been relatively consistent for a few months | 18:21 |
mnasiadka | Do you want to say Linaro just had better hardware/storage? | 18:25 |
clarkb | mnasiadka: it almost certainly did since it was running on a fairly beefy ampere server iirc | 18:25 |
clarkb | but arm shut it down and didn't renew us with new hardware | 18:26 |
mnasiadka | Well then, maybe the only sensible option is to bump up the debian jobs timeout and see how that goes - until things improve | 18:31 |
clarkb | again I think there is value in determining where the specific slowness occurs so that we can provide that feedback then measure if any changes are possible | 18:40 |
clarkb | fungi: I found https://opendev.org/opendev/system-config/commit/f1df36145d0d2d27b210b027e9fd1008238297f3 as an example for using cloud launcher to boot strap the new cloud and its networking | 20:54 |
clarkb | pulling together meeting agenda stuff and wanted to have that on there | 20:54 |
clarkb | but we can probably go ahead and spin up changes like that? | 20:54 |
fungi | clarkb: yeah, i should be free to put that together here shortly | 21:04 |
clarkb | I've made a bunch of updates to th emeeting agenda if anyone else would like to add content or clean things up let me know | 21:04 |
fungi | testing booting a server instance in raxflex, it does seem to have a suitable flavor for our test nodes, almost like it was tailor-made. gp.0.4.8 has 8gb ram, 4 vcpus, 80gb rootfs, and 64gb ephemeral disk on top of that | 22:02 |
clarkb | nice | 22:03 |
fungi | also i think i confirmed that we definitely need to create a network. there is a "PUBLICNET" in the network list but setting --network=PUBLICNET results in a scheduling fault with "No valid service subnet for the given device owner..." | 22:04 |
clarkb | ya the commit above shows how we did that with clouds in the past (no current cloud does that though) | 22:05 |
clarkb | because you have to create a network a subnet and a router and tie them all together. IIRC they said there isn't ipv6 yet so we don't have to worry about taht either at this point (though at some point we probably do want to add ipv6 | 22:05 |
fungi | looks like that goes in inventory/service/group_vars/bastion.yaml these days | 22:08 |
clarkb | ya it moved into the group var file | 22:08 |
fungi | during the bridge replacement, yes | 22:09 |
fungi | we still have the opendevci-networking and opendevzuul-networking profiles, even better | 22:09 |
fungi | does "network: External" mean the router connects to an existing provider network named "External"? in which case for this we should do "network: PUBLICNET" instead? | 22:15 |
clarkb | fungi: yes and that is why we dno't have a profile for the router as it is cloud and potentially region specific. Essentially you're determining the other interface on the router | 22:15 |
fungi | got it, thanks! | 22:16 |
opendevreview | Jeremy Stanley proposed opendev/system-config master: Set up networking for Rackspace Flex tenants https://review.opendev.org/c/opendev/system-config/+/927214 | 22:22 |
clarkb | fungi: https://docs.openstack.org/api-ref/network/v2/index.html#create-router I believe the network in the ansible var gets translated to external_gateway_info.network_id? | 22:23 |
clarkb | something like that | 22:23 |
clarkb | updated the meeting agenda to point at that change now | 22:25 |
fungi | thanks! | 22:26 |
clarkb | I'll get the agenda out before I call it a day just in case there is anything else to add before then. Also looks like nb04 did successfully create a loopback device with losetup | 23:01 |
clarkb | assuming that was the last remaining issue I expect the current image build to succeed | 23:02 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!