Monday, 2024-08-26

*** elodilles_pto is now known as elodilles		07:24
*** liuxie is now known as liushy		07:55
*** ykarel_ is now known as ykarel		08:36
mnasiadka	Hello there - I noticed that now we have only one aarch64 capable nodepool provider with 15 max instances (not counting it's slowness) - are there any planned activities to get aarch64 testing back in shape?	14:15
fungi	mnasiadka: i think we need another donor to volunteer some resources now that works-on-arm pulled out of supplying the hardware for the linaro cloud	14:25
fungi	we might be able to get some additional test resources out of osuosl, but that doesn't give us any redundancy/resiliency	14:26
mnasiadka	especially osuosl has some i/o problems or something similar looking at kolla build jobs	14:27
mnasiadka	https://zuul.openstack.org/builds?job_name=kolla-build-debian-aarch64&job_name=kolla-build-ubuntu-aarch64&job_name=kolla-build-rocky9-aarch64&skip=0	14:27
opendevreview	James E. Blair proposed zuul/zuul-jobs master: DNM: test base-test https://review.opendev.org/c/zuul/zuul-jobs/+/927148	14:46
clarkb	right this is all dependent on volunteers and donated resources. Currently we've got one set of resources to work with. If you can characterize/debug the timeouts and slowness that may be useful input that can be actionable on osuosl's side	15:06
clarkb	nb04 arm image builds are still failing I think because maybe we've used up all the losetup devices. I'm going to stop services and reboot to clear out any of those stale mounts and devices	15:21
clarkb	and maybe do another round of /opt/dib_tmp cleanup when it comes back too	15:22
clarkb	fungi: I guess the next step for raxflex is to boot a mirror node and then add a small number of nodes to a new provider?	15:23
fungi	i think first we need to check out the networking situation there, take some stock of available flavors, check the relevant quotas, and upload a copy of the image tonyb made	15:26
fungi	we'll also need a change to add the region to our cloud setup routine, i think?	15:26
clarkb	oh right	15:26
clarkb	yes all of that seem like good precursor steps that I am too excited to remember	15:27
corvus	upload copy of what image?	15:30
frickler	I checked the flavors earlier and didn't find any 8/8 variant. do you still have contact with rax and could ask for a custom flavor?	15:30
frickler	corvus: iirc a custom noble build that was used for the latest mirror instances	15:30
clarkb	corvus: of the ubuntu noble image since most (all?) clouds don't have noble images up	15:30
clarkb	frickler: it isn't a custom image just the upstream one	15:31
fungi	which may be the same as just downloading an official noble cloud image from ubuntu.com, i don't recall	15:31
clarkb	fungi: yes	15:31
fungi	okay, so not made, merely downloaded to bridge for upload into glance	15:32
fungi	though raxflex does include "Ubuntu-24.04" in their image list	15:32
frickler	well raxflex offers Ubuntu-24.04, so that might not be necessary	15:32
frickler	^5 ;)	15:32
clarkb	nice, I think that may be the first cloud region to have one officially	15:33
corvus	ack thx	15:34
clarkb	nb04 is rebooted. losetup -l -a look better to me. I've started a new round of cleanups in dib_tmp and will let ansible restart the builder service for me in ~45 minutes	15:35
clarkb	(just to minimize the amount of io while things delete)	15:35
clarkb	nodepool-builder is running on nb04 against (via the hourly nodepool job)	16:11
mnasiadka	clarkb: simple timing comparisons for kolla x86 vs aarch64 builds in opendev CI - https://ethercalc.net/kolla-aarch64-timing (aarch64 is osuosl, x86 is ovh)	16:49
mnasiadka	clarkb: I don't think that's cpu exhaustion, rather i/o (storage) issues	16:50
clarkb	mnasiadka: I'm not sure comparing x86 to aarch64 is super productive. They are completely different architectures including how io is going to work aiui. Do we maybe have data from before the queue grew this long to see if this is a new problem or was preexisting?	16:52
clarkb	Ideally we can point out "these sorts of operations are a problem" rather than just "x86 is faster" thats a given	16:53
clarkb	kolla-toolbox looks particularly bad	16:54
mnasiadka	should be available, x86 builds finish around 30mins, and aarch64 until recently was similar, so timing out after 3hrs limit is a bit problematic for us	16:54
clarkb	maybe focusing on whatever that is doing would be good	16:54
clarkb	also the keystone timing illustrates why I'm concerned about comparing the two. It is slower on x86 than other things it is faster on aarch64	16:54
clarkb	*keystone-base	16:54
clarkb	gnocchi-base seems to be that way too	16:57
clarkb	nova-libvirt and ovsdpdk also seem slow, I suspect that those may be doing compilation steps for tools? So maybe CPU is at play, might be worth checking those run in parallel if they can be made to do so (but I'm sort of making assumptions there)	17:03
opendevreview	Merged zuul/zuul-jobs master: Fix k8s-crio buildset registry test https://review.opendev.org/c/zuul/zuul-jobs/+/926013	17:07
opendevreview	Merged zuul/zuul-jobs master: Update ensure-kubernetes with podman support https://review.opendev.org/c/zuul/zuul-jobs/+/924970	17:15
clarkb	once nb04 uploads a new centos 9 stream arm image the ozj change to bump up ansible-lint and run things on noble should be mergeable	17:34
clarkb	the project-config chagne that does similar is already passing. Note this does make minor edits to a large number of playbooks so not sure what our comfort level is doing that	17:34
clarkb	the vast majority of those edits should be inconsequential (add names to plays and reorder some yaml dict things for ansible tasks). The ones I'm primarily worried about have to do with renaming action moduels to use their fully qualified forms	17:35
clarkb	just in case the lookups find different code or something	17:35
mnasiadka	clarkb: sadly last successful builds that fit in good timings (~30 minutes run - actually this one has 21 minutes) don't have the logs anymore - awkwardly looking at history (https://zuul.openstack.org/builds?job_name=kolla-build-debian-aarch64&skip=50) it seems the problems started 29th of July - which sort of fits the time of merging first linaro removal change (https://review.opendev.org/c/openstack/project-config/+/925029)	17:37
clarkb	mnasiadka: yes that seems right. Its possible these jobs were never succesful on the osuosl cloud and only succeeded on linaro	17:38
mnasiadka	I have a hunch that in the past the builds that timed out or were over 2hrs successful were not run on Linaro	17:38
fungi	it might also just mean that we started more heavily loading osuosl for longer spans of time, and so we're adversely impacting the performance of resources there as a result	17:39
mnasiadka	might be, we just merged a patch that should remove 2/3 of our aarch64 build jobs (stop building rocky and ubuntu - leave only debian that is the only one we do test deploys in kolla-ansible with)	17:39
mnasiadka	I'll pay some attention to the aarch64 jobs in Kolla/Kolla-Ansible to see how they are behaving	17:42
mnasiadka	(but I assume there's not going to be a lot of improvements)	17:43
clarkb	I think if we can identify specific areas of concern then we can feed that back to osuosl and maybe there are tweaks taht can be made	17:44
clarkb	I suspect that we're limited by available hardware, but it wouldn't be the first time we've had poor performance in a cloud that was identified and mitigated via our job experiences	17:44
corvus	https://imgur.com/screenshot-TKG3y6h for the visually inclined	18:08
fungi	clearly two sets of groupings there. analysis ftw!	18:19
fungi	but some fast results recently too	18:19
fungi	oh, but those are failures so may have just ended early	18:19
clarkb	looking at neutron unittest jobs they do take longer on osuosl than the x86 clouds but less than double the time	18:20
fungi	looks like there were probably some successes in osuosl until we were only running the jobs there?	18:20
clarkb	whcih is a clear difference to the kolla job behavior. I suspect taht does point to io as previuosly called out as the neutron and nova unittests are very cpu bound iirc	18:20
fungi	if we assume the 1.5-3.0hr cluster is osuosl	18:21
fungi	job run times in that cluster seem to have been relatively consistent for a few months	18:21
mnasiadka	Do you want to say Linaro just had better hardware/storage?	18:25
clarkb	mnasiadka: it almost certainly did since it was running on a fairly beefy ampere server iirc	18:25
clarkb	but arm shut it down and didn't renew us with new hardware	18:26
mnasiadka	Well then, maybe the only sensible option is to bump up the debian jobs timeout and see how that goes - until things improve	18:31
clarkb	again I think there is value in determining where the specific slowness occurs so that we can provide that feedback then measure if any changes are possible	18:40
clarkb	fungi: I found https://opendev.org/opendev/system-config/commit/f1df36145d0d2d27b210b027e9fd1008238297f3 as an example for using cloud launcher to boot strap the new cloud and its networking	20:54
clarkb	pulling together meeting agenda stuff and wanted to have that on there	20:54
clarkb	but we can probably go ahead and spin up changes like that?	20:54
fungi	clarkb: yeah, i should be free to put that together here shortly	21:04
clarkb	I've made a bunch of updates to th emeeting agenda if anyone else would like to add content or clean things up let me know	21:04
fungi	testing booting a server instance in raxflex, it does seem to have a suitable flavor for our test nodes, almost like it was tailor-made. gp.0.4.8 has 8gb ram, 4 vcpus, 80gb rootfs, and 64gb ephemeral disk on top of that	22:02
clarkb	nice	22:03
fungi	also i think i confirmed that we definitely need to create a network. there is a "PUBLICNET" in the network list but setting --network=PUBLICNET results in a scheduling fault with "No valid service subnet for the given device owner..."	22:04
clarkb	ya the commit above shows how we did that with clouds in the past (no current cloud does that though)	22:05
clarkb	because you have to create a network a subnet and a router and tie them all together. IIRC they said there isn't ipv6 yet so we don't have to worry about taht either at this point (though at some point we probably do want to add ipv6	22:05
fungi	looks like that goes in inventory/service/group_vars/bastion.yaml these days	22:08
clarkb	ya it moved into the group var file	22:08
fungi	during the bridge replacement, yes	22:09
fungi	we still have the opendevci-networking and opendevzuul-networking profiles, even better	22:09
fungi	does "network: External" mean the router connects to an existing provider network named "External"? in which case for this we should do "network: PUBLICNET" instead?	22:15
clarkb	fungi: yes and that is why we dno't have a profile for the router as it is cloud and potentially region specific. Essentially you're determining the other interface on the router	22:15
fungi	got it, thanks!	22:16
opendevreview	Jeremy Stanley proposed opendev/system-config master: Set up networking for Rackspace Flex tenants https://review.opendev.org/c/opendev/system-config/+/927214	22:22
clarkb	fungi: https://docs.openstack.org/api-ref/network/v2/index.html#create-router I believe the network in the ansible var gets translated to external_gateway_info.network_id?	22:23
clarkb	something like that	22:23
clarkb	updated the meeting agenda to point at that change now	22:25
fungi	thanks!	22:26
clarkb	I'll get the agenda out before I call it a day just in case there is anything else to add before then. Also looks like nb04 did successfully create a loopback device with losetup	23:01
clarkb	assuming that was the last remaining issue I expect the current image build to succeed	23:02

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!