Tuesday, 2024-08-27

mnasiadkahello, seems the aarch64 situation is not getting better - I'll probably disable the debian build jobs in Kolla as well12:03
mnasiadkaclarkb: out of curiosity - what bandwidth would be required to stand up a nodepool provider? (I assume mirror traffic takes some portion of that)12:07
fricklermnasiadka: you are talking about donating cloud resources for CI? I don't think we have any data on the data consumption we produce12:11
mnasiadkafrickler: yeah, that would be useful to see if I can convince somebody to improve the aarch64 situation12:11
fricklerwe could check with Ramereth whether he has any accounting data, other than that I wouldn't even have an idea how to collect that. we could maybe get some lower bound by looking at the traffic on the mirrors, though12:21
NeilHanloneyo clarkb - g'morning :) I recently picked up maintaining the python-gerritlib (https://opendev.org/opendev/gerritlib) package in Fedora and the version being shipped fails to build from source with python 3.13 (version 0.6.0...) -- anyways, I'm bumping it now to a newer version and was going to just use the latest git commit for the package. My14:17
NeilHanlonquestion is: is there a plan to cut more tags for gerritlib and/or push releases to pypi? No worries in any direction, just will inform how I go about packaging it moving forward (i.e., as a 0.11.0 pre-release, or just as a 0.10.0-$downstreamreleasebump)14:17
clarkbNeilHanlon: we primarily use gerritlib with jeepyb which consumes gerritlib releases (so we're using 0.10.0 there with python3.11 on debian bookworm I think). I didn't have any plans for a release but I would expect the next one to be 0.11.0 because we dropped older python support15:43
clarkbI probably won't get to that myself this week as I'm going to need to really focus on travel prep stuff now, but if someone else did that would be fine15:43
clarkbNeilHanlon: does latest commit work with python3.13? or do we need to make additional changes for that? If we do need additional changes it mightb e good to do that before a 0.11.0 release15:44
NeilHanlonclarkb: yeah it appears to work on python 3.13, at least, from a building perspective. I will actually give it a try in a little bit15:46
clarkbunrelated to ^ there is an email on the gerrit list today about how an update to SSHD on gerrit master may have broken ssh connectivity15:54
clarkbI don't think we have a job set up to test that, but I'm checking now to see if they backported to stable 3.9 or 3.10 as we can test on those branches15:54
clarkbnope neither stable release have the SSHD_VERS bump. Will just have to see what upstream debugging says15:57
Ramereth[m]<frickler> "we could check with Ramereth..." <- What exactly do you need? Has the I/O issues not improved since we last talked?15:58
clarkbRamereth[m]: whatever the issue is seems to be persisting at least as measured by our CI job runtimes (and timeouts). I was hoping that someone more familiar with what those jobs are doing would look into the logs more closely to try and find more of a concrete cause but I don't think that has happened15:59
opendevreviewJames E. Blair proposed zuul/zuul-jobs master: Add a role to convert diskimages between formats  https://review.opendev.org/c/zuul/zuul-jobs/+/92291216:00
clarkbwe did successfully build an ubuntu focal arm64 image over night (looks like the bionic image build was interrupted by a nodepool container update...) so thats good we've got things working there again16:00
fricklerRamereth[m]: that was only semi-related to the current issue, mnasiadka wanted to know how much traffic our CI is producing in order to give that information to some potential new cloud donor16:06
clarkbMINA ahs already been debugging the gerrit issue I called out above. TUrns out the bug has existed since 2015 and gerrit worked around it elsewhere but not in the new kex handling16:15
opendevreviewMerged zuul/zuul-jobs master: Synchronize test-prepare-workspace-git to prepare-workspace-git  https://review.opendev.org/c/zuul/zuul-jobs/+/92554016:29
corvusthat change has been extensively tested, but it does affect every job; please be aware of it and ping me if you see any errors related to git workspace setup16:29
clarkback16:31
opendevreviewMerged zuul/zuul-jobs master: Add ensure-dib role  https://review.opendev.org/c/zuul/zuul-jobs/+/92291016:40
clarkbcorvus: fwiw I do see jobs succeeding after that merged16:47
clarkbstill a small sample size and there could be corner cases but it isn't just a hard fail16:48
clarkbcorvus: one thing I notice is the streaming console log is far less verbose now which is maybe not ideal. We'll have to go to the post job ansible console after wards to see that info16:49
clarkb(its just a nice way of confirming the job is running with the expected hashes when it starts up16:49
clarkbdevstack jobs are taking ~5 seconds to set up git now though which seems like a reasonable compromise16:50
clarkboh maybe it was closer to 7 :) still great16:51
clarkbhttps://zuul.opendev.org/t/openstack/build/b95e3358fdde4e20ab97b339bfa3fde6/console#0/3/11/ubuntu-jammy the info is available here after the job completes so we didn't lose the info entirely16:52
corvusyeah, i included all the details in the result object, and more actually, so it should be easy (maybe even easier) to debug16:53
corvusand i'm kinda hoping that the workspace setup is fast enough no one has time to notice there isn't a bunch of chatty git output.  :)16:54
clarkbya I usually only care when I'm trying to confirm the right thing was checked out as the job is running. A rare occurence and usually only to see if a depends on did what I expected. I can just check after the fact16:55
corvusclarkb: note this: https://zuul.opendev.org/t/openstack/build/b95e3358fdde4e20ab97b339bfa3fde6/console#0/3/9/ubuntu-jammy16:55
corvus"initial_state":"cloned-from-cache"  is a new thing16:56
clarkbneat. I think I remember that from the reviews now16:57
opendevreviewMerged zuul/zuul-jobs master: Add build-diskimage role  https://review.opendev.org/c/zuul/zuul-jobs/+/92291116:57
opendevreviewMerged zuul/zuul-jobs master: Add build_diskimage_environment role variable  https://review.opendev.org/c/zuul/zuul-jobs/+/92622416:57
opendevreviewMerged zuul/zuul-jobs master: Add a diskimage-builder job  https://review.opendev.org/c/zuul/zuul-jobs/+/92622516:57
clarkbbut also speeding up jobs by 30 seconds a piece * several thousand a day is a really nice improvement16:58
opendevreviewMerged zuul/zuul-jobs master: Add a role to convert diskimages between formats  https://review.opendev.org/c/zuul/zuul-jobs/+/92291217:00
Ramereth[m]<clarkb> "Ramereth: whatever the issue..." <- I ask because I noticed there was an issue with one of the Ceph nodes and the RAID controller, but I resolved that a few days ago. If it's still happening I'll have to take a closer look again and narrow down what is going on.17:05
Ramereth[m]<frickler> "Ramereth: that was only semi-..." <- When you say traffic, do you mean network traffic from the VMs? or something else?17:05
clarkbRamereth[m]: ya I think it is still going on one sec and I'll get a log from one of our image builds that took 7 hours17:07
clarkbhttps://nb04.opendev.org/ubuntu-focal-arm64-09615a17683a4f149aae1600d1fa49ed.log I think this generally takes about 90 minutes on our x86 builders. I haven't spent a ton of time digging into that log to see what is slow yet17:07
Ramereth[m]FWIW I also adjusted the RAM ratio last week so it might be hitting swap a little more on VMs. I ordered more RAM that should arrive on Friday. Depending on when I get it I'm hoping to get those installed on the newer Mt. Collins nodes17:08
Ramereth[m]Something else I realized is that I didn't properly add the SSD cinder pool into the cluster, so that might helpful things too, but you would have to use cinder boot volumes to utilize that17:09
clarkboh that is an interesting thought. I think swap could explain what we've seen17:09
clarkbya if you look at timestamp 2024-08-27 12:38:24.257 the step where we copy from the filesystem to the image mounted as a loopback device with losetup it takes over an hour17:10
clarkbthat occurs on a cinder volume though so maybe related to the ssd cinder pool miss more than anything else?17:11
Ramereth[m]Did the slowness start on or around Aug 6th? That's when I bumped the RAM ratio17:13
clarkbits hard to tell because it appears the other cloud (whcih as gone away) was papering over things and the kolla team didn't notice until that cloud went away17:15
Ramereth[m]Well, let me work on getting the SSD cinder pool added and see if that helps at least. I'll ping you when it's ready. Hopefully later this afternoon17:16
clarkbJuly 26 is when we removed the linaro cloud. I think mnasiadka had data for it occuring as far back as then17:17
clarkbRamereth[m]: the cinder volume would most likely help our image builds but not the CI jobs themselves as we don't use cinder volumes for the CI jobs17:17
clarkbI guess I'm trying to say I don't think that particular item is urgent so don't feel rushed on it17:17
clarkb(it would be good to improve but image builds can run in the background slowly and we'll live it is the ci jobs themselves where people are noticing the pain)17:18
clarkbmnasiadka: if you are still around you may have more concrete details? Maybe you can point at the particular bad parts of those kolla jobs (via the logs?)17:19
fricklerRamereth[m]: yes, network traffic to the uplink, as in "how much upstream capacity in terms of Mbit/s would one need to pay for when running a cloud for opendev"17:27
fricklerclarkb: fungi: ^^ am I right in assuming that we don't have this kind of data for any of our clouds so far?17:29
clarkbfrickler: we may have it for openmetal if someone logs into datadog, but otherwise that is correct17:29
fricklerI mentioned earlier that one could use traffic on the mirror node as lower bound, but even that's difficult since it mixes uplink traffic with traffic to CI nodes17:29
fricklerah, looking at openmetal might be an option, maybe just looking at interface counters or similar for a bit17:30
mnasiadkaclarkb: it's the build process that is bad - as in running kolla-build (which is building around 197 container images), I've run that on an aarch64 VM that was kindly provided by jrosser and I was able to get the kolla-build command finished in 10 to 15 minutes - and in OSUOSL it takes over 3 hours. What is interesting kolla-ansible jobs (which only deploy containers downloaded from quay.io) are not timing out in OSUOSL.17:42
clarkbmnasiadka: right can you link directly to that stuff?17:43
clarkbits the concrete info that I've been asking for because that may lead to actionable corrections. WIthout that information its really difficult to provide feedback to Ramereth[m] other than that it is slow17:43
clarkbfor example I was able to link to file copies through loopback devices on a cinder volume being slow above17:43
clarkbthe ci nodes don't use cinder volumes though so that can be an independent issue17:43
clarkbbut that led to "oh the cinder voluems aren't backed by ssds" so could be something we can improve17:44
mnasiadkaclarkb: https://384c64727f31e7a7ec3c-70aee045aa856a76767e0cd0433cf359.ssl.cf2.rackcdn.com/925712/9/check-arm64/kolla-build-debian-aarch64/71d53f4/job-output.txt - we're even using ephemeral0 device that seems to be available on OSUOSL if that helps in anything17:47
clarkbmnasiadka: if you use the zuul log rendering you should be able to link to specific areas of timeloss. Or provide timestamps like I did above17:48
clarkbI'm not asking for a top level job log I'm asking for "here a cp or a gcc compile etc" that illustrates where things are particularly slow17:48
clarkbbecause something concrete like that can usually identify bottlenecks as well as a measurable activity we can use to check if things improve if any changes are made17:48
clarkbthe job-output.txt captures really high level stuff and indicates the job timed out. But it isn't showing us specific information on what is slow17:50
mnasiadkawell, we're not really compiling anything in the dockerfiles - it's just plain rpm/deb/pip install and downloading files from the mirror or over the internet17:51
clarkbright ideally someone who understands the kolla jobs and their logs would dig in a bit and find which operations indicate particularly bad timeloss17:52
mnasiadkaif it helps I can post timings from an aarch64 VM with fast storage to the ethercalc I shared yesterday but generally what takes over 3 hours in OSUOSL (just running kolla-build), took 11 minutes17:52
mnasiadkayou can see here -> https://zuul.openstack.org/build/71d53f41e317442f94339124073f6b0e/log/job-output.txt#581 that templating out 197 Dockerfiles is quite fast - but that's not a lot of data17:53
clarkbI think what would help is if you can link to a specific part of the jobs logs that show slow oeprations compared to the donor node that you used17:53
mnasiadkaok, let me run it on the donor node with timestamps to be able to easily compare it to job log17:56
mnasiadkaclarkb: https://zuul.openstack.org/build/71d53f41e317442f94339124073f6b0e/log/job-output.txt#587 vs https://paste.openstack.org/show/b8YDoInK5RMrJex4KIE7/ (donor node)18:11
clarkbmnasiadka: those are still high level kolla build tasks. Can we pick a slow one and then dig into why that particular one is slow?18:12
clarkbwe know that building kolla is slow, what I'm hoping we can do is point to say docker COPY operations or installation of some specific package (these are just examples I don't actually know what it will be) that we can say "oh ya this clearly indicates that io on $device is slow" or "we're going out to the internet here and it is slow" or "compiling the dkms package for this kernel18:14
clarkbmodule is slow"18:14
mnasiadkayeah, for that I would need to get some timestamps from docker-py build routines, let me try - but I'm not going to come back in 5 minutes :)18:19
fricklerat least the download speeds reported by apt etc. didn't look significantly slow to me18:21
fungiclarkb: NeilHanlon: no promises but i can look at trying to put together a gerritlib release this week, time permitting18:22
mnasiadkathere's also this - I did a test with running randrw fio on a 4G file x86: https://zuul.opendev.org/t/openstack/build/47e953e015a048e3ad6d15873ff3f49e/log/job-output.txt#545 - aarch64 is still running https://zuul.openstack.org/stream/000bddb724f343648aab14f937cc821d?logfile=console.log (which doesn't look good) - I know that test was not ideal and put some strain on the env, but still that should finish already (and it's still running 18:23
mnasiadkafor nearly an hour?)18:23
clarkbcool thats the sort of thing that helps because now we can point at specific (this is disk io on the VM root disk) and it is measurable (we can time how long specific randrw data sizes take and we can determine if improvements have been made18:26
clarkbRamereth[m]: ^ fyi I think thats the sort of concrete info we're looking for18:26
mnasiadkaclarkb: aarch64 finished, took 1 hr compared to 33 seconds - x85 reports 31.2k IOPS and aarch64 is 287 IOPS18:32
mnasiadkait's a bit more than a single HDD IOPS18:33
clarkbhttps://review.opendev.org/c/openstack/kolla/+/927210/ is the change if we need to dig up logs later18:33
mnasiadkaI used /var/lib/docker since that's the place where we mount ephemeral0 if it exists on a given cloud18:34
clarkbmnasiadka: wait you mount a different device?18:36
clarkband this isn't the root fs device?18:36
clarkbI wonder if the performance is different18:36
NeilHanlonfungi: appreciate that! :) 18:36
mnasiadkaclarkb: we mount that, since on some clouds we were running out of space and were recommended that some clouds that have smaller root fs device expose an additional disk usually labeled ephemeral018:37
fungisounds like it was mainly for rackspace nodes then18:37
clarkbmnasiadka: yes rax in particular. I'm just wondering if that is happening here or not. It sounds like you are saying it does happen which makes me wonder if that device performs worse than the root device if so18:38
clarkbhttps://zuul.opendev.org/t/openstack/build/000bddb724f343648aab14f937cc821d/log/job-output.txt#360-396 looks like ti does happen18:39
mnasiadkaclarkb: seems / has 60G on osuosl and ephemeral0 has another 80G18:39
clarkbso ya I would alsotest if the root device is any different18:40
mnasiadkawell, if I post a patch to get tested now, the queue looks a bit overloaded and we'll see some results maybe tomorrow ;-)18:40
fungimaybe this is where Ramereth[m] can speculate on the relative performances of whatever the backing storage for the rootfs and ephemeral disk are18:41
fungiif they're different at all, that is18:41
fungisome focused benchmarking might also be possible18:42
mnasiadkaOk, I updated 927210 to also create an fio file in / - if there's any option to get all jobs in check-arm64 queue aborted just to get that running - we might see some results sooner than tomorrow18:51
mnasiadkaall kolla jobs in check-arm64 of course18:52
corvuson it18:56
corvus210 is running18:58
clarkbthanks!18:59
mnasiadkait seems that the root disk is as bad as the additional ephemeral0 disk19:11
fungiquite possible they're both using exactly the same storage underneath19:12
fungior they share a common write cache19:12
mnasiadkawell, at least we know what's the real problem19:14
clarkbya, but it is good to confirm the behavior is consistent as that is additional debugging information19:14

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!