Tuesday, 2024-08-27

mnasiadka	hello, seems the aarch64 situation is not getting better - I'll probably disable the debian build jobs in Kolla as well	12:03
mnasiadka	clarkb: out of curiosity - what bandwidth would be required to stand up a nodepool provider? (I assume mirror traffic takes some portion of that)	12:07
frickler	mnasiadka: you are talking about donating cloud resources for CI? I don't think we have any data on the data consumption we produce	12:11
mnasiadka	frickler: yeah, that would be useful to see if I can convince somebody to improve the aarch64 situation	12:11
frickler	we could check with Ramereth whether he has any accounting data, other than that I wouldn't even have an idea how to collect that. we could maybe get some lower bound by looking at the traffic on the mirrors, though	12:21
NeilHanlon	eyo clarkb - g'morning :) I recently picked up maintaining the python-gerritlib (https://opendev.org/opendev/gerritlib) package in Fedora and the version being shipped fails to build from source with python 3.13 (version 0.6.0...) -- anyways, I'm bumping it now to a newer version and was going to just use the latest git commit for the package. My	14:17
NeilHanlon	question is: is there a plan to cut more tags for gerritlib and/or push releases to pypi? No worries in any direction, just will inform how I go about packaging it moving forward (i.e., as a 0.11.0 pre-release, or just as a 0.10.0-$downstreamreleasebump)	14:17
clarkb	NeilHanlon: we primarily use gerritlib with jeepyb which consumes gerritlib releases (so we're using 0.10.0 there with python3.11 on debian bookworm I think). I didn't have any plans for a release but I would expect the next one to be 0.11.0 because we dropped older python support	15:43
clarkb	I probably won't get to that myself this week as I'm going to need to really focus on travel prep stuff now, but if someone else did that would be fine	15:43
clarkb	NeilHanlon: does latest commit work with python3.13? or do we need to make additional changes for that? If we do need additional changes it mightb e good to do that before a 0.11.0 release	15:44
NeilHanlon	clarkb: yeah it appears to work on python 3.13, at least, from a building perspective. I will actually give it a try in a little bit	15:46
clarkb	unrelated to ^ there is an email on the gerrit list today about how an update to SSHD on gerrit master may have broken ssh connectivity	15:54
clarkb	I don't think we have a job set up to test that, but I'm checking now to see if they backported to stable 3.9 or 3.10 as we can test on those branches	15:54
clarkb	nope neither stable release have the SSHD_VERS bump. Will just have to see what upstream debugging says	15:57
Ramereth[m]	<frickler> "we could check with Ramereth..." <- What exactly do you need? Has the I/O issues not improved since we last talked?	15:58
clarkb	Ramereth[m]: whatever the issue is seems to be persisting at least as measured by our CI job runtimes (and timeouts). I was hoping that someone more familiar with what those jobs are doing would look into the logs more closely to try and find more of a concrete cause but I don't think that has happened	15:59
opendevreview	James E. Blair proposed zuul/zuul-jobs master: Add a role to convert diskimages between formats https://review.opendev.org/c/zuul/zuul-jobs/+/922912	16:00
clarkb	we did successfully build an ubuntu focal arm64 image over night (looks like the bionic image build was interrupted by a nodepool container update...) so thats good we've got things working there again	16:00
frickler	Ramereth[m]: that was only semi-related to the current issue, mnasiadka wanted to know how much traffic our CI is producing in order to give that information to some potential new cloud donor	16:06
clarkb	MINA ahs already been debugging the gerrit issue I called out above. TUrns out the bug has existed since 2015 and gerrit worked around it elsewhere but not in the new kex handling	16:15
opendevreview	Merged zuul/zuul-jobs master: Synchronize test-prepare-workspace-git to prepare-workspace-git https://review.opendev.org/c/zuul/zuul-jobs/+/925540	16:29
corvus	that change has been extensively tested, but it does affect every job; please be aware of it and ping me if you see any errors related to git workspace setup	16:29
clarkb	ack	16:31
opendevreview	Merged zuul/zuul-jobs master: Add ensure-dib role https://review.opendev.org/c/zuul/zuul-jobs/+/922910	16:40
clarkb	corvus: fwiw I do see jobs succeeding after that merged	16:47
clarkb	still a small sample size and there could be corner cases but it isn't just a hard fail	16:48
clarkb	corvus: one thing I notice is the streaming console log is far less verbose now which is maybe not ideal. We'll have to go to the post job ansible console after wards to see that info	16:49
clarkb	(its just a nice way of confirming the job is running with the expected hashes when it starts up	16:49
clarkb	devstack jobs are taking ~5 seconds to set up git now though which seems like a reasonable compromise	16:50
clarkb	oh maybe it was closer to 7 :) still great	16:51
clarkb	https://zuul.opendev.org/t/openstack/build/b95e3358fdde4e20ab97b339bfa3fde6/console#0/3/11/ubuntu-jammy the info is available here after the job completes so we didn't lose the info entirely	16:52
corvus	yeah, i included all the details in the result object, and more actually, so it should be easy (maybe even easier) to debug	16:53
corvus	and i'm kinda hoping that the workspace setup is fast enough no one has time to notice there isn't a bunch of chatty git output. :)	16:54
clarkb	ya I usually only care when I'm trying to confirm the right thing was checked out as the job is running. A rare occurence and usually only to see if a depends on did what I expected. I can just check after the fact	16:55
corvus	clarkb: note this: https://zuul.opendev.org/t/openstack/build/b95e3358fdde4e20ab97b339bfa3fde6/console#0/3/9/ubuntu-jammy	16:55
corvus	"initial_state":"cloned-from-cache" is a new thing	16:56
clarkb	neat. I think I remember that from the reviews now	16:57
opendevreview	Merged zuul/zuul-jobs master: Add build-diskimage role https://review.opendev.org/c/zuul/zuul-jobs/+/922911	16:57
opendevreview	Merged zuul/zuul-jobs master: Add build_diskimage_environment role variable https://review.opendev.org/c/zuul/zuul-jobs/+/926224	16:57
opendevreview	Merged zuul/zuul-jobs master: Add a diskimage-builder job https://review.opendev.org/c/zuul/zuul-jobs/+/926225	16:57
clarkb	but also speeding up jobs by 30 seconds a piece * several thousand a day is a really nice improvement	16:58
opendevreview	Merged zuul/zuul-jobs master: Add a role to convert diskimages between formats https://review.opendev.org/c/zuul/zuul-jobs/+/922912	17:00
Ramereth[m]	<clarkb> "Ramereth: whatever the issue..." <- I ask because I noticed there was an issue with one of the Ceph nodes and the RAID controller, but I resolved that a few days ago. If it's still happening I'll have to take a closer look again and narrow down what is going on.	17:05
Ramereth[m]	<frickler> "Ramereth: that was only semi-..." <- When you say traffic, do you mean network traffic from the VMs? or something else?	17:05
clarkb	Ramereth[m]: ya I think it is still going on one sec and I'll get a log from one of our image builds that took 7 hours	17:07
clarkb	https://nb04.opendev.org/ubuntu-focal-arm64-09615a17683a4f149aae1600d1fa49ed.log I think this generally takes about 90 minutes on our x86 builders. I haven't spent a ton of time digging into that log to see what is slow yet	17:07
Ramereth[m]	FWIW I also adjusted the RAM ratio last week so it might be hitting swap a little more on VMs. I ordered more RAM that should arrive on Friday. Depending on when I get it I'm hoping to get those installed on the newer Mt. Collins nodes	17:08
Ramereth[m]	Something else I realized is that I didn't properly add the SSD cinder pool into the cluster, so that might helpful things too, but you would have to use cinder boot volumes to utilize that	17:09
clarkb	oh that is an interesting thought. I think swap could explain what we've seen	17:09
clarkb	ya if you look at timestamp 2024-08-27 12:38:24.257 the step where we copy from the filesystem to the image mounted as a loopback device with losetup it takes over an hour	17:10
clarkb	that occurs on a cinder volume though so maybe related to the ssd cinder pool miss more than anything else?	17:11
Ramereth[m]	Did the slowness start on or around Aug 6th? That's when I bumped the RAM ratio	17:13
clarkb	its hard to tell because it appears the other cloud (whcih as gone away) was papering over things and the kolla team didn't notice until that cloud went away	17:15
Ramereth[m]	Well, let me work on getting the SSD cinder pool added and see if that helps at least. I'll ping you when it's ready. Hopefully later this afternoon	17:16
clarkb	July 26 is when we removed the linaro cloud. I think mnasiadka had data for it occuring as far back as then	17:17
clarkb	Ramereth[m]: the cinder volume would most likely help our image builds but not the CI jobs themselves as we don't use cinder volumes for the CI jobs	17:17
clarkb	I guess I'm trying to say I don't think that particular item is urgent so don't feel rushed on it	17:17
clarkb	(it would be good to improve but image builds can run in the background slowly and we'll live it is the ci jobs themselves where people are noticing the pain)	17:18
clarkb	mnasiadka: if you are still around you may have more concrete details? Maybe you can point at the particular bad parts of those kolla jobs (via the logs?)	17:19
frickler	Ramereth[m]: yes, network traffic to the uplink, as in "how much upstream capacity in terms of Mbit/s would one need to pay for when running a cloud for opendev"	17:27
frickler	clarkb: fungi: ^^ am I right in assuming that we don't have this kind of data for any of our clouds so far?	17:29
clarkb	frickler: we may have it for openmetal if someone logs into datadog, but otherwise that is correct	17:29
frickler	I mentioned earlier that one could use traffic on the mirror node as lower bound, but even that's difficult since it mixes uplink traffic with traffic to CI nodes	17:29
frickler	ah, looking at openmetal might be an option, maybe just looking at interface counters or similar for a bit	17:30
mnasiadka	clarkb: it's the build process that is bad - as in running kolla-build (which is building around 197 container images), I've run that on an aarch64 VM that was kindly provided by jrosser and I was able to get the kolla-build command finished in 10 to 15 minutes - and in OSUOSL it takes over 3 hours. What is interesting kolla-ansible jobs (which only deploy containers downloaded from quay.io) are not timing out in OSUOSL.	17:42
clarkb	mnasiadka: right can you link directly to that stuff?	17:43
clarkb	its the concrete info that I've been asking for because that may lead to actionable corrections. WIthout that information its really difficult to provide feedback to Ramereth[m] other than that it is slow	17:43
clarkb	for example I was able to link to file copies through loopback devices on a cinder volume being slow above	17:43
clarkb	the ci nodes don't use cinder volumes though so that can be an independent issue	17:43
clarkb	but that led to "oh the cinder voluems aren't backed by ssds" so could be something we can improve	17:44
mnasiadka	clarkb: https://384c64727f31e7a7ec3c-70aee045aa856a76767e0cd0433cf359.ssl.cf2.rackcdn.com/925712/9/check-arm64/kolla-build-debian-aarch64/71d53f4/job-output.txt - we're even using ephemeral0 device that seems to be available on OSUOSL if that helps in anything	17:47
clarkb	mnasiadka: if you use the zuul log rendering you should be able to link to specific areas of timeloss. Or provide timestamps like I did above	17:48
clarkb	I'm not asking for a top level job log I'm asking for "here a cp or a gcc compile etc" that illustrates where things are particularly slow	17:48
clarkb	because something concrete like that can usually identify bottlenecks as well as a measurable activity we can use to check if things improve if any changes are made	17:48
clarkb	the job-output.txt captures really high level stuff and indicates the job timed out. But it isn't showing us specific information on what is slow	17:50
mnasiadka	well, we're not really compiling anything in the dockerfiles - it's just plain rpm/deb/pip install and downloading files from the mirror or over the internet	17:51
clarkb	right ideally someone who understands the kolla jobs and their logs would dig in a bit and find which operations indicate particularly bad timeloss	17:52
mnasiadka	if it helps I can post timings from an aarch64 VM with fast storage to the ethercalc I shared yesterday but generally what takes over 3 hours in OSUOSL (just running kolla-build), took 11 minutes	17:52
mnasiadka	you can see here -> https://zuul.openstack.org/build/71d53f41e317442f94339124073f6b0e/log/job-output.txt#581 that templating out 197 Dockerfiles is quite fast - but that's not a lot of data	17:53
clarkb	I think what would help is if you can link to a specific part of the jobs logs that show slow oeprations compared to the donor node that you used	17:53
mnasiadka	ok, let me run it on the donor node with timestamps to be able to easily compare it to job log	17:56
mnasiadka	clarkb: https://zuul.openstack.org/build/71d53f41e317442f94339124073f6b0e/log/job-output.txt#587 vs https://paste.openstack.org/show/b8YDoInK5RMrJex4KIE7/ (donor node)	18:11
clarkb	mnasiadka: those are still high level kolla build tasks. Can we pick a slow one and then dig into why that particular one is slow?	18:12
clarkb	we know that building kolla is slow, what I'm hoping we can do is point to say docker COPY operations or installation of some specific package (these are just examples I don't actually know what it will be) that we can say "oh ya this clearly indicates that io on $device is slow" or "we're going out to the internet here and it is slow" or "compiling the dkms package for this kernel	18:14
clarkb	module is slow"	18:14
mnasiadka	yeah, for that I would need to get some timestamps from docker-py build routines, let me try - but I'm not going to come back in 5 minutes :)	18:19
frickler	at least the download speeds reported by apt etc. didn't look significantly slow to me	18:21
fungi	clarkb: NeilHanlon: no promises but i can look at trying to put together a gerritlib release this week, time permitting	18:22
mnasiadka	there's also this - I did a test with running randrw fio on a 4G file x86: https://zuul.opendev.org/t/openstack/build/47e953e015a048e3ad6d15873ff3f49e/log/job-output.txt#545 - aarch64 is still running https://zuul.openstack.org/stream/000bddb724f343648aab14f937cc821d?logfile=console.log (which doesn't look good) - I know that test was not ideal and put some strain on the env, but still that should finish already (and it's still running	18:23
mnasiadka	for nearly an hour?)	18:23
clarkb	cool thats the sort of thing that helps because now we can point at specific (this is disk io on the VM root disk) and it is measurable (we can time how long specific randrw data sizes take and we can determine if improvements have been made	18:26
clarkb	Ramereth[m]: ^ fyi I think thats the sort of concrete info we're looking for	18:26
mnasiadka	clarkb: aarch64 finished, took 1 hr compared to 33 seconds - x85 reports 31.2k IOPS and aarch64 is 287 IOPS	18:32
mnasiadka	it's a bit more than a single HDD IOPS	18:33
clarkb	https://review.opendev.org/c/openstack/kolla/+/927210/ is the change if we need to dig up logs later	18:33
mnasiadka	I used /var/lib/docker since that's the place where we mount ephemeral0 if it exists on a given cloud	18:34
clarkb	mnasiadka: wait you mount a different device?	18:36
clarkb	and this isn't the root fs device?	18:36
clarkb	I wonder if the performance is different	18:36
NeilHanlon	fungi: appreciate that! :)	18:36
mnasiadka	clarkb: we mount that, since on some clouds we were running out of space and were recommended that some clouds that have smaller root fs device expose an additional disk usually labeled ephemeral0	18:37
fungi	sounds like it was mainly for rackspace nodes then	18:37
clarkb	mnasiadka: yes rax in particular. I'm just wondering if that is happening here or not. It sounds like you are saying it does happen which makes me wonder if that device performs worse than the root device if so	18:38
clarkb	https://zuul.opendev.org/t/openstack/build/000bddb724f343648aab14f937cc821d/log/job-output.txt#360-396 looks like ti does happen	18:39
mnasiadka	clarkb: seems / has 60G on osuosl and ephemeral0 has another 80G	18:39
clarkb	so ya I would alsotest if the root device is any different	18:40
mnasiadka	well, if I post a patch to get tested now, the queue looks a bit overloaded and we'll see some results maybe tomorrow ;-)	18:40
fungi	maybe this is where Ramereth[m] can speculate on the relative performances of whatever the backing storage for the rootfs and ephemeral disk are	18:41
fungi	if they're different at all, that is	18:41
fungi	some focused benchmarking might also be possible	18:42
mnasiadka	Ok, I updated 927210 to also create an fio file in / - if there's any option to get all jobs in check-arm64 queue aborted just to get that running - we might see some results sooner than tomorrow	18:51
mnasiadka	all kolla jobs in check-arm64 of course	18:52
corvus	on it	18:56
corvus	210 is running	18:58
clarkb	thanks!	18:59
mnasiadka	it seems that the root disk is as bad as the additional ephemeral0 disk	19:11
fungi	quite possible they're both using exactly the same storage underneath	19:12
fungi	or they share a common write cache	19:12
mnasiadka	well, at least we know what's the real problem	19:14
clarkb	ya, but it is good to confirm the behavior is consistent as that is additional debugging information	19:14

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!