mnasiadka | hello, seems the aarch64 situation is not getting better - I'll probably disable the debian build jobs in Kolla as well | 12:03 |
---|---|---|
mnasiadka | clarkb: out of curiosity - what bandwidth would be required to stand up a nodepool provider? (I assume mirror traffic takes some portion of that) | 12:07 |
frickler | mnasiadka: you are talking about donating cloud resources for CI? I don't think we have any data on the data consumption we produce | 12:11 |
mnasiadka | frickler: yeah, that would be useful to see if I can convince somebody to improve the aarch64 situation | 12:11 |
frickler | we could check with Ramereth whether he has any accounting data, other than that I wouldn't even have an idea how to collect that. we could maybe get some lower bound by looking at the traffic on the mirrors, though | 12:21 |
NeilHanlon | eyo clarkb - g'morning :) I recently picked up maintaining the python-gerritlib (https://opendev.org/opendev/gerritlib) package in Fedora and the version being shipped fails to build from source with python 3.13 (version 0.6.0...) -- anyways, I'm bumping it now to a newer version and was going to just use the latest git commit for the package. My | 14:17 |
NeilHanlon | question is: is there a plan to cut more tags for gerritlib and/or push releases to pypi? No worries in any direction, just will inform how I go about packaging it moving forward (i.e., as a 0.11.0 pre-release, or just as a 0.10.0-$downstreamreleasebump) | 14:17 |
clarkb | NeilHanlon: we primarily use gerritlib with jeepyb which consumes gerritlib releases (so we're using 0.10.0 there with python3.11 on debian bookworm I think). I didn't have any plans for a release but I would expect the next one to be 0.11.0 because we dropped older python support | 15:43 |
clarkb | I probably won't get to that myself this week as I'm going to need to really focus on travel prep stuff now, but if someone else did that would be fine | 15:43 |
clarkb | NeilHanlon: does latest commit work with python3.13? or do we need to make additional changes for that? If we do need additional changes it mightb e good to do that before a 0.11.0 release | 15:44 |
NeilHanlon | clarkb: yeah it appears to work on python 3.13, at least, from a building perspective. I will actually give it a try in a little bit | 15:46 |
clarkb | unrelated to ^ there is an email on the gerrit list today about how an update to SSHD on gerrit master may have broken ssh connectivity | 15:54 |
clarkb | I don't think we have a job set up to test that, but I'm checking now to see if they backported to stable 3.9 or 3.10 as we can test on those branches | 15:54 |
clarkb | nope neither stable release have the SSHD_VERS bump. Will just have to see what upstream debugging says | 15:57 |
Ramereth[m] | <frickler> "we could check with Ramereth..." <- What exactly do you need? Has the I/O issues not improved since we last talked? | 15:58 |
clarkb | Ramereth[m]: whatever the issue is seems to be persisting at least as measured by our CI job runtimes (and timeouts). I was hoping that someone more familiar with what those jobs are doing would look into the logs more closely to try and find more of a concrete cause but I don't think that has happened | 15:59 |
opendevreview | James E. Blair proposed zuul/zuul-jobs master: Add a role to convert diskimages between formats https://review.opendev.org/c/zuul/zuul-jobs/+/922912 | 16:00 |
clarkb | we did successfully build an ubuntu focal arm64 image over night (looks like the bionic image build was interrupted by a nodepool container update...) so thats good we've got things working there again | 16:00 |
frickler | Ramereth[m]: that was only semi-related to the current issue, mnasiadka wanted to know how much traffic our CI is producing in order to give that information to some potential new cloud donor | 16:06 |
clarkb | MINA ahs already been debugging the gerrit issue I called out above. TUrns out the bug has existed since 2015 and gerrit worked around it elsewhere but not in the new kex handling | 16:15 |
opendevreview | Merged zuul/zuul-jobs master: Synchronize test-prepare-workspace-git to prepare-workspace-git https://review.opendev.org/c/zuul/zuul-jobs/+/925540 | 16:29 |
corvus | that change has been extensively tested, but it does affect every job; please be aware of it and ping me if you see any errors related to git workspace setup | 16:29 |
clarkb | ack | 16:31 |
opendevreview | Merged zuul/zuul-jobs master: Add ensure-dib role https://review.opendev.org/c/zuul/zuul-jobs/+/922910 | 16:40 |
clarkb | corvus: fwiw I do see jobs succeeding after that merged | 16:47 |
clarkb | still a small sample size and there could be corner cases but it isn't just a hard fail | 16:48 |
clarkb | corvus: one thing I notice is the streaming console log is far less verbose now which is maybe not ideal. We'll have to go to the post job ansible console after wards to see that info | 16:49 |
clarkb | (its just a nice way of confirming the job is running with the expected hashes when it starts up | 16:49 |
clarkb | devstack jobs are taking ~5 seconds to set up git now though which seems like a reasonable compromise | 16:50 |
clarkb | oh maybe it was closer to 7 :) still great | 16:51 |
clarkb | https://zuul.opendev.org/t/openstack/build/b95e3358fdde4e20ab97b339bfa3fde6/console#0/3/11/ubuntu-jammy the info is available here after the job completes so we didn't lose the info entirely | 16:52 |
corvus | yeah, i included all the details in the result object, and more actually, so it should be easy (maybe even easier) to debug | 16:53 |
corvus | and i'm kinda hoping that the workspace setup is fast enough no one has time to notice there isn't a bunch of chatty git output. :) | 16:54 |
clarkb | ya I usually only care when I'm trying to confirm the right thing was checked out as the job is running. A rare occurence and usually only to see if a depends on did what I expected. I can just check after the fact | 16:55 |
corvus | clarkb: note this: https://zuul.opendev.org/t/openstack/build/b95e3358fdde4e20ab97b339bfa3fde6/console#0/3/9/ubuntu-jammy | 16:55 |
corvus | "initial_state":"cloned-from-cache" is a new thing | 16:56 |
clarkb | neat. I think I remember that from the reviews now | 16:57 |
opendevreview | Merged zuul/zuul-jobs master: Add build-diskimage role https://review.opendev.org/c/zuul/zuul-jobs/+/922911 | 16:57 |
opendevreview | Merged zuul/zuul-jobs master: Add build_diskimage_environment role variable https://review.opendev.org/c/zuul/zuul-jobs/+/926224 | 16:57 |
opendevreview | Merged zuul/zuul-jobs master: Add a diskimage-builder job https://review.opendev.org/c/zuul/zuul-jobs/+/926225 | 16:57 |
clarkb | but also speeding up jobs by 30 seconds a piece * several thousand a day is a really nice improvement | 16:58 |
opendevreview | Merged zuul/zuul-jobs master: Add a role to convert diskimages between formats https://review.opendev.org/c/zuul/zuul-jobs/+/922912 | 17:00 |
Ramereth[m] | <clarkb> "Ramereth: whatever the issue..." <- I ask because I noticed there was an issue with one of the Ceph nodes and the RAID controller, but I resolved that a few days ago. If it's still happening I'll have to take a closer look again and narrow down what is going on. | 17:05 |
Ramereth[m] | <frickler> "Ramereth: that was only semi-..." <- When you say traffic, do you mean network traffic from the VMs? or something else? | 17:05 |
clarkb | Ramereth[m]: ya I think it is still going on one sec and I'll get a log from one of our image builds that took 7 hours | 17:07 |
clarkb | https://nb04.opendev.org/ubuntu-focal-arm64-09615a17683a4f149aae1600d1fa49ed.log I think this generally takes about 90 minutes on our x86 builders. I haven't spent a ton of time digging into that log to see what is slow yet | 17:07 |
Ramereth[m] | FWIW I also adjusted the RAM ratio last week so it might be hitting swap a little more on VMs. I ordered more RAM that should arrive on Friday. Depending on when I get it I'm hoping to get those installed on the newer Mt. Collins nodes | 17:08 |
Ramereth[m] | Something else I realized is that I didn't properly add the SSD cinder pool into the cluster, so that might helpful things too, but you would have to use cinder boot volumes to utilize that | 17:09 |
clarkb | oh that is an interesting thought. I think swap could explain what we've seen | 17:09 |
clarkb | ya if you look at timestamp 2024-08-27 12:38:24.257 the step where we copy from the filesystem to the image mounted as a loopback device with losetup it takes over an hour | 17:10 |
clarkb | that occurs on a cinder volume though so maybe related to the ssd cinder pool miss more than anything else? | 17:11 |
Ramereth[m] | Did the slowness start on or around Aug 6th? That's when I bumped the RAM ratio | 17:13 |
clarkb | its hard to tell because it appears the other cloud (whcih as gone away) was papering over things and the kolla team didn't notice until that cloud went away | 17:15 |
Ramereth[m] | Well, let me work on getting the SSD cinder pool added and see if that helps at least. I'll ping you when it's ready. Hopefully later this afternoon | 17:16 |
clarkb | July 26 is when we removed the linaro cloud. I think mnasiadka had data for it occuring as far back as then | 17:17 |
clarkb | Ramereth[m]: the cinder volume would most likely help our image builds but not the CI jobs themselves as we don't use cinder volumes for the CI jobs | 17:17 |
clarkb | I guess I'm trying to say I don't think that particular item is urgent so don't feel rushed on it | 17:17 |
clarkb | (it would be good to improve but image builds can run in the background slowly and we'll live it is the ci jobs themselves where people are noticing the pain) | 17:18 |
clarkb | mnasiadka: if you are still around you may have more concrete details? Maybe you can point at the particular bad parts of those kolla jobs (via the logs?) | 17:19 |
frickler | Ramereth[m]: yes, network traffic to the uplink, as in "how much upstream capacity in terms of Mbit/s would one need to pay for when running a cloud for opendev" | 17:27 |
frickler | clarkb: fungi: ^^ am I right in assuming that we don't have this kind of data for any of our clouds so far? | 17:29 |
clarkb | frickler: we may have it for openmetal if someone logs into datadog, but otherwise that is correct | 17:29 |
frickler | I mentioned earlier that one could use traffic on the mirror node as lower bound, but even that's difficult since it mixes uplink traffic with traffic to CI nodes | 17:29 |
frickler | ah, looking at openmetal might be an option, maybe just looking at interface counters or similar for a bit | 17:30 |
mnasiadka | clarkb: it's the build process that is bad - as in running kolla-build (which is building around 197 container images), I've run that on an aarch64 VM that was kindly provided by jrosser and I was able to get the kolla-build command finished in 10 to 15 minutes - and in OSUOSL it takes over 3 hours. What is interesting kolla-ansible jobs (which only deploy containers downloaded from quay.io) are not timing out in OSUOSL. | 17:42 |
clarkb | mnasiadka: right can you link directly to that stuff? | 17:43 |
clarkb | its the concrete info that I've been asking for because that may lead to actionable corrections. WIthout that information its really difficult to provide feedback to Ramereth[m] other than that it is slow | 17:43 |
clarkb | for example I was able to link to file copies through loopback devices on a cinder volume being slow above | 17:43 |
clarkb | the ci nodes don't use cinder volumes though so that can be an independent issue | 17:43 |
clarkb | but that led to "oh the cinder voluems aren't backed by ssds" so could be something we can improve | 17:44 |
mnasiadka | clarkb: https://384c64727f31e7a7ec3c-70aee045aa856a76767e0cd0433cf359.ssl.cf2.rackcdn.com/925712/9/check-arm64/kolla-build-debian-aarch64/71d53f4/job-output.txt - we're even using ephemeral0 device that seems to be available on OSUOSL if that helps in anything | 17:47 |
clarkb | mnasiadka: if you use the zuul log rendering you should be able to link to specific areas of timeloss. Or provide timestamps like I did above | 17:48 |
clarkb | I'm not asking for a top level job log I'm asking for "here a cp or a gcc compile etc" that illustrates where things are particularly slow | 17:48 |
clarkb | because something concrete like that can usually identify bottlenecks as well as a measurable activity we can use to check if things improve if any changes are made | 17:48 |
clarkb | the job-output.txt captures really high level stuff and indicates the job timed out. But it isn't showing us specific information on what is slow | 17:50 |
mnasiadka | well, we're not really compiling anything in the dockerfiles - it's just plain rpm/deb/pip install and downloading files from the mirror or over the internet | 17:51 |
clarkb | right ideally someone who understands the kolla jobs and their logs would dig in a bit and find which operations indicate particularly bad timeloss | 17:52 |
mnasiadka | if it helps I can post timings from an aarch64 VM with fast storage to the ethercalc I shared yesterday but generally what takes over 3 hours in OSUOSL (just running kolla-build), took 11 minutes | 17:52 |
mnasiadka | you can see here -> https://zuul.openstack.org/build/71d53f41e317442f94339124073f6b0e/log/job-output.txt#581 that templating out 197 Dockerfiles is quite fast - but that's not a lot of data | 17:53 |
clarkb | I think what would help is if you can link to a specific part of the jobs logs that show slow oeprations compared to the donor node that you used | 17:53 |
mnasiadka | ok, let me run it on the donor node with timestamps to be able to easily compare it to job log | 17:56 |
mnasiadka | clarkb: https://zuul.openstack.org/build/71d53f41e317442f94339124073f6b0e/log/job-output.txt#587 vs https://paste.openstack.org/show/b8YDoInK5RMrJex4KIE7/ (donor node) | 18:11 |
clarkb | mnasiadka: those are still high level kolla build tasks. Can we pick a slow one and then dig into why that particular one is slow? | 18:12 |
clarkb | we know that building kolla is slow, what I'm hoping we can do is point to say docker COPY operations or installation of some specific package (these are just examples I don't actually know what it will be) that we can say "oh ya this clearly indicates that io on $device is slow" or "we're going out to the internet here and it is slow" or "compiling the dkms package for this kernel | 18:14 |
clarkb | module is slow" | 18:14 |
mnasiadka | yeah, for that I would need to get some timestamps from docker-py build routines, let me try - but I'm not going to come back in 5 minutes :) | 18:19 |
frickler | at least the download speeds reported by apt etc. didn't look significantly slow to me | 18:21 |
fungi | clarkb: NeilHanlon: no promises but i can look at trying to put together a gerritlib release this week, time permitting | 18:22 |
mnasiadka | there's also this - I did a test with running randrw fio on a 4G file x86: https://zuul.opendev.org/t/openstack/build/47e953e015a048e3ad6d15873ff3f49e/log/job-output.txt#545 - aarch64 is still running https://zuul.openstack.org/stream/000bddb724f343648aab14f937cc821d?logfile=console.log (which doesn't look good) - I know that test was not ideal and put some strain on the env, but still that should finish already (and it's still running | 18:23 |
mnasiadka | for nearly an hour?) | 18:23 |
clarkb | cool thats the sort of thing that helps because now we can point at specific (this is disk io on the VM root disk) and it is measurable (we can time how long specific randrw data sizes take and we can determine if improvements have been made | 18:26 |
clarkb | Ramereth[m]: ^ fyi I think thats the sort of concrete info we're looking for | 18:26 |
mnasiadka | clarkb: aarch64 finished, took 1 hr compared to 33 seconds - x85 reports 31.2k IOPS and aarch64 is 287 IOPS | 18:32 |
mnasiadka | it's a bit more than a single HDD IOPS | 18:33 |
clarkb | https://review.opendev.org/c/openstack/kolla/+/927210/ is the change if we need to dig up logs later | 18:33 |
mnasiadka | I used /var/lib/docker since that's the place where we mount ephemeral0 if it exists on a given cloud | 18:34 |
clarkb | mnasiadka: wait you mount a different device? | 18:36 |
clarkb | and this isn't the root fs device? | 18:36 |
clarkb | I wonder if the performance is different | 18:36 |
NeilHanlon | fungi: appreciate that! :) | 18:36 |
mnasiadka | clarkb: we mount that, since on some clouds we were running out of space and were recommended that some clouds that have smaller root fs device expose an additional disk usually labeled ephemeral0 | 18:37 |
fungi | sounds like it was mainly for rackspace nodes then | 18:37 |
clarkb | mnasiadka: yes rax in particular. I'm just wondering if that is happening here or not. It sounds like you are saying it does happen which makes me wonder if that device performs worse than the root device if so | 18:38 |
clarkb | https://zuul.opendev.org/t/openstack/build/000bddb724f343648aab14f937cc821d/log/job-output.txt#360-396 looks like ti does happen | 18:39 |
mnasiadka | clarkb: seems / has 60G on osuosl and ephemeral0 has another 80G | 18:39 |
clarkb | so ya I would alsotest if the root device is any different | 18:40 |
mnasiadka | well, if I post a patch to get tested now, the queue looks a bit overloaded and we'll see some results maybe tomorrow ;-) | 18:40 |
fungi | maybe this is where Ramereth[m] can speculate on the relative performances of whatever the backing storage for the rootfs and ephemeral disk are | 18:41 |
fungi | if they're different at all, that is | 18:41 |
fungi | some focused benchmarking might also be possible | 18:42 |
mnasiadka | Ok, I updated 927210 to also create an fio file in / - if there's any option to get all jobs in check-arm64 queue aborted just to get that running - we might see some results sooner than tomorrow | 18:51 |
mnasiadka | all kolla jobs in check-arm64 of course | 18:52 |
corvus | on it | 18:56 |
corvus | 210 is running | 18:58 |
clarkb | thanks! | 18:59 |
mnasiadka | it seems that the root disk is as bad as the additional ephemeral0 disk | 19:11 |
fungi | quite possible they're both using exactly the same storage underneath | 19:12 |
fungi | or they share a common write cache | 19:12 |
mnasiadka | well, at least we know what's the real problem | 19:14 |
clarkb | ya, but it is good to confirm the behavior is consistent as that is additional debugging information | 19:14 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!