*** tosky has quit IRC | 00:02 | |
*** dingyichen has joined #openstack-infra | 00:03 | |
*** dychen has quit IRC | 00:03 | |
dansmith | clarkb: am I wrong that running unit and functional tests on multiple python versions on every patch is maybe contributing to why queue times are so long lately? | 00:09 |
---|---|---|
dansmith | don't those all consume a worker for almost as much time during setup as actually running? | 00:10 |
dansmith | today nova is running py36,38,39 for unit and 38,39 for functional which seems a little much | 00:10 |
dansmith | and also, couldn't we combine like py38 with cover and maybe even pep8 to make those all run on one worker and avoid setup time? | 00:11 |
dansmith | yeah, 15m to run tests, but ~30m for the job | 00:17 |
dansmith | (for unit. | 00:17 |
*** dciabrin_ has quit IRC | 00:17 | |
dansmith | 9m to run 2m of pep8 | 00:17 |
dansmith | 15m to run 8m of cover | 00:18 |
fungi | the devstack jobs take many times longer, and some of them occupy multiple nodes | 00:20 |
dansmith | 15+2+8=25 minutes of actual runtime, vs 55 minutes of "cpu time" to run them separate I think | 00:21 |
*** lxkong0 has joined #openstack-infra | 00:21 | |
fungi | so yes reducing job count could help some, but reducing devstack jobs would have more impact than reducing unit test or linting jobs | 00:21 |
dansmith | yeah I know, I'm just wondering if we're wasting worker time re-setting up a basically identical environment just so we can see each test broken out in the report? | 00:21 |
dansmith | I really wonder if we need to be running grenade *and* grenade-multinode for example | 00:23 |
fungi | well, re-setting up an environment which is "nearly identical" to some small percentage of the overall jobs run. we have hundreds of such "not quite the same" environments and trying to maintain pre-set-up copies of all of them would also require a ton of resources (and increase complexity many times over) | 00:23 |
dansmith | aside from bugs in grenade, I wouldn't think that we're actually missing any coverage on the multinode case | 00:23 |
fungi | we've already tried to use pre-set-up environments in the past and keeping them maintained ends up being far more work than letting them get created at job runtime, but also makes changes to the environment setup itself testable directly | 00:24 |
fungi | er, building them at job runtime makes them more directly testable i mean | 00:24 |
dansmith | fungi: right, but if I make a job run tox -epy38,pep8,cover, those should run all in the same pre-set-up worker and just burn some pip time right? | 00:24 |
dansmith | fungi: yeah, I'm talking about using a more common base config and running more tox envs in a row on it, not trying to make the image more specifci | 00:25 |
fungi | they would, though you'll need to go digging in a much larger job log to figure out whether what broke was linting, unit tests or your coverage report | 00:25 |
dansmith | yeah, I'm okay with that | 00:26 |
fungi | improving efficiency of the environment setup might be more effective | 00:26 |
dansmith | I guess that seems like more infra work to me, and thought you were arguing against it just a few lines ago :) | 00:26 |
fungi | like not installing things the job doesn't need just because it's easier to maintain a single list of dependencies than task-specific dependency sets | 00:26 |
fungi | nope, talking about job configuration | 00:27 |
dansmith | I'm just saying, if we're spending 50% of our time booting a "basic ubuntu image" worker to run 2m of something, that seems like a waste just so we can have the jobs called out separately | 00:27 |
fungi | not infrastructure | 00:27 |
dansmith | okay, I see.. obviously if we can make the setup time faster then that's good, | 00:27 |
fungi | we spend on average closer to 30 seconds to boot an ubuntu image i think. job setup spends a lot of time installing things people aren't actually using in their jobs | 00:28 |
dansmith | but if jamming a few things into the same worker means we don't have to micro-optimize, I dunno.. seems easier | 00:28 |
fungi | might be, i'm not arguing against trying it | 00:28 |
fungi | folks already do that with linters in some projects | 00:29 |
fungi | i've not seen anyone cram linting and unit tests into the same job, but it should be doable | 00:29 |
dansmith | so, the pep8 job I'm looking at, | 00:29 |
dansmith | ran pep8 for 2 minutes, and was done with that 5m into the job, but the job took another 4 minutes, presumably to clean up and post logs | 00:29 |
dansmith | well, I run tox -epy38,pep8 locally a lot.. because the pep8 output is short enough that I can scroll up and see the unit test output above it, even if I have a few lines of pep8 fail | 00:30 |
dansmith | well, those numbers aren't quite right because it looks like the job start time must not be at time zero in the log, so maybe it's more front-loaded.. about 40s of cleanup after we're done with pep8 | 00:31 |
dansmith | so I assume that's create and boot time or something, which is part of what would be saved | 00:32 |
dansmith | anyway, I'm just really worried that we're at an 8h turnaround time on a monday | 00:32 |
dansmith | and looking at everything we're running in a nova job these days, it seems like we should pare that down | 00:33 |
fungi | as we've said in the past, the biggest impact you can make on node utilization is to ferret out nondeterministic failures in projects/tests which burn a ton of nodes by having to retest changes and discarding lots of other builds | 00:33 |
dansmith | sure, and I'm still trying to land such a fix from last week :) | 00:34 |
dansmith | I definitely continue to push on people to do that, | 00:34 |
dansmith | but as we noted last week, it sounds like maybe some job configs have grown a little heavy | 00:34 |
dansmith | clarkb said he wasn't seeing a lot of resets when we had a >24h turnaround time last week | 00:34 |
fungi | the longer gate queues and the gate failures i see at the moment are for tripleo, yeah | 00:36 |
fungi | but also zuul's very nearly caught up from earlier today at this point | 00:37 |
dansmith | fungi: the head of the nova queue is 8h old | 00:37 |
fungi | we peaked at a backlog of 1.7k nodes and are down to just being 400 behind now | 00:37 |
dansmith | if tripleo has a lot of fails (and we know they have heavy jobs) we probably also need to focus attention there | 00:38 |
fungi | in the next hour or so i expect all changes in all pipelines will have node assignments filled | 00:38 |
fungi | looks like puppet-openstack also just tagged 26 releases | 00:44 |
fungi | well, "just" nearly three hours ago | 00:45 |
*** jamesmcarthur has joined #openstack-infra | 00:46 | |
*** jamesmcarthur has quit IRC | 00:52 | |
*** JanZerebecki[m] has joined #openstack-infra | 01:17 | |
*** jamesmcarthur has joined #openstack-infra | 01:19 | |
*** jamesmcarthur has quit IRC | 01:19 | |
*** jamesmcarthur has joined #openstack-infra | 01:20 | |
*** jamesmcarthur has quit IRC | 01:20 | |
*** jamesmcarthur has joined #openstack-infra | 01:26 | |
*** jamesmcarthur has quit IRC | 01:57 | |
*** ysandeep|away is now known as ysandeep | 02:07 | |
*** jamesmcarthur has joined #openstack-infra | 02:18 | |
*** jamesmcarthur has quit IRC | 02:23 | |
*** jamesmcarthur has joined #openstack-infra | 02:23 | |
*** rcernin has quit IRC | 02:26 | |
*** jamesmcarthur has quit IRC | 02:29 | |
*** jamesmcarthur has joined #openstack-infra | 02:33 | |
*** jamesmcarthur has quit IRC | 02:34 | |
*** rcernin has joined #openstack-infra | 02:42 | |
*** rcernin has quit IRC | 02:44 | |
*** rcernin has joined #openstack-infra | 02:44 | |
*** jamesmcarthur has joined #openstack-infra | 02:56 | |
*** verdurin has quit IRC | 03:02 | |
*** verdurin has joined #openstack-infra | 03:07 | |
*** jhesketh_ has joined #openstack-infra | 03:40 | |
*** jhesketh has quit IRC | 03:41 | |
*** jhesketh_ is now known as jhesketh | 03:43 | |
*** lxkong0 is now known as lxkong | 03:47 | |
*** zzzeek has quit IRC | 03:48 | |
*** zzzeek has joined #openstack-infra | 03:51 | |
*** ricolin has joined #openstack-infra | 03:54 | |
*** ysandeep is now known as ysandeep|pto | 03:55 | |
*** lbragstad has quit IRC | 04:16 | |
*** ykarel has joined #openstack-infra | 04:18 | |
*** zzzeek has quit IRC | 04:33 | |
*** zzzeek has joined #openstack-infra | 04:35 | |
*** guillaumec has quit IRC | 05:40 | |
*** guillaumec has joined #openstack-infra | 05:44 | |
ykarel | frickler, hberaud fyi tarballs are available and our jobs passing now | 06:00 |
*** ykarel_ has joined #openstack-infra | 06:16 | |
*** ykarel has quit IRC | 06:19 | |
*** vishalmanchanda has joined #openstack-infra | 06:20 | |
*** ykarel_ is now known as ykarel | 06:29 | |
*** sboyron has joined #openstack-infra | 06:33 | |
*** jamesmcarthur has quit IRC | 07:06 | |
*** amoralej|off is now known as amoralej | 07:15 | |
*** rcernin has quit IRC | 07:19 | |
*** xek has joined #openstack-infra | 07:22 | |
*** ralonsoh has joined #openstack-infra | 07:27 | |
*** lpetrut has joined #openstack-infra | 07:39 | |
hberaud | ykarel: ack, thanks for the heads up | 07:41 |
*** nightmare_unreal has joined #openstack-infra | 07:44 | |
*** eolivare has joined #openstack-infra | 07:47 | |
*** slaweq has joined #openstack-infra | 07:48 | |
*** yamamoto has quit IRC | 07:53 | |
*** jcapitao has joined #openstack-infra | 07:57 | |
*** dciabrin_ has joined #openstack-infra | 08:00 | |
*** dchen has quit IRC | 08:01 | |
*** rpittau|afk is now known as rpittau | 08:07 | |
*** yamamoto has joined #openstack-infra | 08:11 | |
*** andrewbonney has joined #openstack-infra | 08:13 | |
*** zzzeek has quit IRC | 08:14 | |
*** zzzeek has joined #openstack-infra | 08:16 | |
*** hashar has joined #openstack-infra | 08:22 | |
*** tosky has joined #openstack-infra | 08:39 | |
*** gfidente has joined #openstack-infra | 08:40 | |
*** jpena|off is now known as jpena | 08:58 | |
*** lucasagomes has joined #openstack-infra | 09:04 | |
*** jamesmcarthur has joined #openstack-infra | 09:06 | |
*** sboyron has quit IRC | 09:06 | |
*** sboyron_ has joined #openstack-infra | 09:06 | |
*** jamesmcarthur has quit IRC | 09:11 | |
*** ricolin has quit IRC | 09:12 | |
*** ricolin has joined #openstack-infra | 09:13 | |
*** ociuhandu has joined #openstack-infra | 09:15 | |
amoralej | hi, may i get some attention on https://review.opendev.org/c/zuul/zuul-jobs/+/771105/ and https://review.opendev.org/c/zuul/zuul-jobs/+/770815 ? | 09:32 |
amoralej | we need it to get proper repos configured in centos8 stream jobs | 09:33 |
*** rcernin has joined #openstack-infra | 09:43 | |
*** sboyron has joined #openstack-infra | 09:43 | |
*** sboyron_ has quit IRC | 09:43 | |
*** derekh has joined #openstack-infra | 09:43 | |
openstackgerrit | Hervé Beraud proposed openstack/project-config master: Adding irc notification for missing oslo projects https://review.opendev.org/c/openstack/project-config/+/771392 | 09:49 |
*** hashar is now known as hasharOut | 09:59 | |
*** sboyron_ has joined #openstack-infra | 10:05 | |
*** sboyron has quit IRC | 10:08 | |
*** ociuhandu has quit IRC | 10:19 | |
*** ociuhandu has joined #openstack-infra | 10:24 | |
*** ociuhandu has quit IRC | 10:24 | |
*** ociuhandu has joined #openstack-infra | 10:25 | |
*** rcernin has quit IRC | 10:28 | |
*** ociuhandu has quit IRC | 10:29 | |
*** ociuhandu has joined #openstack-infra | 10:58 | |
*** ociuhandu has quit IRC | 11:12 | |
*** ociuhandu has joined #openstack-infra | 11:12 | |
*** rcernin has joined #openstack-infra | 11:13 | |
*** ysandeep|pto is now known as ysandeep | 11:14 | |
*** ociuhandu has quit IRC | 11:17 | |
*** jcapitao is now known as jcapitao_lunch | 11:26 | |
geguileo | amoralej: is the second patch going to fix the centos-8 jobs that are trying https://mirror.bhs1.ovh.opendev.org/wheel/centos-8.3-x86_64 instead of the right one? | 11:27 |
geguileo | amoralej: ignore me, it's not, that patch is for stream | 11:27 |
geguileo | and I'm talking about centos-8 | 11:27 |
amoralej | geguileo, yes, it's probably a different issue | 11:28 |
geguileo | amoralej: maybe you can point me in the right direction then... | 11:30 |
geguileo | centos-8 nodeset is using an incorrect wheel mirror which is making some jobs fails... | 11:31 |
geguileo | it's trying https://mirror.bhs1.ovh.opendev.org/wheel/centos-8.3-x86_64 | 11:31 |
geguileo | and it should be https://mirror.bhs1.ovh.opendev.org/wheel/centos-8-x86_64 | 11:31 |
amoralej | geguileo, can you point me to a failing job? | 11:33 |
geguileo | amoralej: https://zuul.opendev.org/t/openstack/build/6d6fb0dde981476ab9981fe80a093bf1 | 11:34 |
geguileo | amoralej: I think the problem is the definition of "wheel_mirror" that uses {{ ansible_distribution_version }} instead of just the major version... | 11:35 |
geguileo | because we don't have a default in roles/configure-mirrors/vars/CentOS.yaml | 11:37 |
*** rcernin has quit IRC | 11:37 | |
amoralej | yes, looks like so | 11:37 |
geguileo | do you know if all pip URLs use centos-8-x86_64 or if some use centos-8.3-x86_64 format? | 11:41 |
amoralej | geguileo, cuiously it doesn't fail in centos7.9 jobs | 11:41 |
amoralej | even if the wheels dir does not exist | 11:41 |
amoralej | geguileo, i have no idea tbh | 11:41 |
geguileo | amoralej: maybe because we are "lucky" and this bug is affecting that ansible version... https://github.com/ansible/ansible/issues/50141 | 11:41 |
geguileo | which reports 7 when it should be saying 7.9 | 11:42 |
amoralej | no | 11:42 |
amoralej | i see it's using 7.9 | 11:42 |
amoralej | in fact in a centos8 run from some days ago it worked | 11:42 |
amoralej | https://zuul.opendev.org/t/openstack/build/1d67a1289d3d417188d13a5f4451c60e/console | 11:42 |
geguileo | mmmmm, and what's the wheel mirror used to build alembic? | 11:42 |
geguileo | amoralej: on that job alembic was already present in the system | 11:43 |
amoralej | yes | 11:44 |
amoralej | that's what i'm seeing | 11:44 |
geguileo | so it didn't have to build it... | 11:44 |
amoralej | anyway it's clear that the mirror url is wrong | 11:45 |
geguileo | amoralej: did you get to see the wheel mirror URL anywhere on that job? | 11:45 |
*** yamamoto has quit IRC | 11:45 | |
amoralej | it's what you pointed | 11:45 |
amoralej | in configure-mirrors/defaults/main.yaml | 11:46 |
*** yamamoto has joined #openstack-infra | 11:46 | |
*** yamamoto has quit IRC | 11:46 | |
geguileo | amoralej: yeah, but I meant the one actually being used by that job | 11:46 |
geguileo | as in seen it in the logs | 11:46 |
*** yamamoto has joined #openstack-infra | 11:47 | |
geguileo | found it (I think) | 11:47 |
amoralej | it needs to be overriden in CentOS.yaml | 11:47 |
*** yamamoto has quit IRC | 11:47 | |
geguileo | amoralej: that's what I'd like to confirm... | 11:48 |
amoralej | geguileo, i think it's in https://opendev.org/zuul/zuul-jobs/src/branch/master/roles/configure-mirrors/vars/CentOS.yaml | 11:48 |
geguileo | amoralej: yeah, that's where I need to add it | 11:48 |
*** yamamoto has joined #openstack-infra | 11:48 | |
*** yamamoto has quit IRC | 11:48 | |
geguileo | amoralej: but I don't want to break 7.9 jobs just to fix 8 jobs | 11:48 |
*** yamamoto has joined #openstack-infra | 11:48 | |
geguileo | XD | 11:48 |
amoralej | it loads in https://opendev.org/zuul/zuul-jobs/src/branch/master/roles/configure-mirrors/tasks/mirror.yaml#L6 | 11:49 |
amoralej | well, you may even create a CentOS-8.yaml | 11:49 |
amoralej | if you prefer | 11:49 |
*** yamamoto has quit IRC | 11:49 | |
amoralej | but i'd say it should not break centos7 | 11:49 |
geguileo | amoralej: well, we either have the centos7 jobs with the wrong wheel mirrors now, or it could break them | 11:50 |
amoralej | it has wrong mirror url | 11:51 |
geguileo | amoralej: ok, will send the fix now | 11:52 |
amoralej | check in https://zuul.opendev.org/t/openstack/build/b221eba358a5443990f6fd5809bde2b7 | 11:52 |
geguileo | amoralej: yup, thanks | 11:53 |
*** ramishra has quit IRC | 11:55 | |
*** ociuhandu has joined #openstack-infra | 11:57 | |
*** ramishra has joined #openstack-infra | 11:57 | |
*** ricolin has quit IRC | 11:57 | |
*** dpawlik has quit IRC | 11:57 | |
*** logan- has quit IRC | 11:57 | |
*** paladox has quit IRC | 11:57 | |
*** fresta has quit IRC | 11:57 | |
*** lifeless has quit IRC | 11:57 | |
*** abhishekk has quit IRC | 11:57 | |
*** gryf has quit IRC | 11:57 | |
*** DinaBelova has quit IRC | 11:57 | |
*** markmcclain has quit IRC | 11:57 | |
*** paladox has joined #openstack-infra | 11:57 | |
*** fresta has joined #openstack-infra | 11:58 | |
*** ricolin has joined #openstack-infra | 11:58 | |
*** lpetrut_ has joined #openstack-infra | 11:58 | |
*** lifeless has joined #openstack-infra | 11:58 | |
*** DinaBelova has joined #openstack-infra | 11:58 | |
*** markmcclain has joined #openstack-infra | 11:58 | |
*** abhishekk has joined #openstack-infra | 11:59 | |
*** gryf has joined #openstack-infra | 11:59 | |
*** logan- has joined #openstack-infra | 12:00 | |
*** lpetrut has quit IRC | 12:01 | |
*** ociuhandu has quit IRC | 12:10 | |
*** ociuhandu has joined #openstack-infra | 12:13 | |
*** eolivare_ has joined #openstack-infra | 12:14 | |
*** dpawlik has joined #openstack-infra | 12:15 | |
*** eolivare has quit IRC | 12:16 | |
*** ociuhandu has quit IRC | 12:18 | |
*** ajitha_ has joined #openstack-infra | 12:23 | |
*** ociuhandu has joined #openstack-infra | 12:29 | |
*** ociuhandu has quit IRC | 12:30 | |
*** ociuhandu has joined #openstack-infra | 12:32 | |
*** jpena is now known as jpena|lunch | 12:34 | |
*** ajitha_ is now known as ajitha | 12:34 | |
*** jcapitao_lunch is now known as jcapitao | 12:44 | |
*** hasharOut is now known as hashar | 12:45 | |
*** ttx has quit IRC | 12:51 | |
*** rlandy has joined #openstack-infra | 12:51 | |
*** eolivare_ has quit IRC | 12:53 | |
*** yamamoto has joined #openstack-infra | 12:58 | |
*** amoralej is now known as amoralej|lunch | 12:59 | |
*** ykarel has quit IRC | 13:01 | |
*** ttx has joined #openstack-infra | 13:04 | |
openstackgerrit | Luigi Toscano proposed openstack/project-config master: cursive: prepare to move the jobs in-tree https://review.opendev.org/c/openstack/project-config/+/771443 | 13:12 |
*** ykarel has joined #openstack-infra | 13:24 | |
*** jamesmcarthur has joined #openstack-infra | 13:26 | |
*** eolivare_ has joined #openstack-infra | 13:26 | |
*** jpena|lunch is now known as jpena | 13:34 | |
*** lbragstad has joined #openstack-infra | 13:37 | |
*** zul has joined #openstack-infra | 13:40 | |
*** yamamoto has quit IRC | 13:42 | |
*** _erlon_ has joined #openstack-infra | 13:49 | |
*** amoralej|lunch is now known as amoralej | 13:58 | |
*** yamamoto has joined #openstack-infra | 14:15 | |
*** yamamoto has quit IRC | 14:26 | |
*** slittle1 has quit IRC | 14:32 | |
*** akantek has joined #openstack-infra | 14:33 | |
*** akantek has quit IRC | 14:34 | |
*** dave-mccowan has quit IRC | 14:38 | |
*** rcernin has joined #openstack-infra | 14:42 | |
*** rcernin has quit IRC | 15:01 | |
*** derekh has quit IRC | 15:20 | |
*** derekh has joined #openstack-infra | 15:20 | |
openstackgerrit | Merged openstack/project-config master: Add PTP Notification app to StarlingX https://review.opendev.org/c/openstack/project-config/+/771235 | 15:22 |
*** gryf has quit IRC | 15:26 | |
*** ociuhandu has quit IRC | 15:27 | |
*** hashar is now known as hasharKids | 15:28 | |
*** gryf has joined #openstack-infra | 15:31 | |
*** sshnaidm|ruck is now known as sshnaidm|afk | 15:37 | |
*** ysandeep is now known as ysandeep|dinner | 15:37 | |
clarkb | dansmith: fungi ya when we haev looked at numbers in the past the long running multinode jobs so completely dwarf the other jobs that trying to optmize linting or even unittests for projects won't have a large effect | 15:42 |
*** dklyle has joined #openstack-infra | 15:43 | |
dansmith | clarkb: okay but, if every nova review is running five jobs that it doesn't need, even if they're small, I would think that would add up | 15:43 |
fungi | we do have data we can use to attempt to calculate how many node-hours we spend on various jobs and per project | 15:44 |
dansmith | clarkb: I know it would be zuul changes and maybe some more noise, but have we ever considered batching the jobs into long and short? so that we get a quicker report of unit, functional, linting and then a later report of the heavy stuff? | 15:45 |
fungi | which would probably lead to a more useful analysis, and less abstract conjecture | 15:45 |
clarkb | dansmith: it is >0 but not significant. Last time we ran numbers tripleo alone was like 35% or something like that of resource usage and they don't do linting and unittests really | 15:45 |
clarkb | its all their 3 hour multiple jobs that quickly dominate | 15:45 |
clarkb | I can see if I can run that script again today | 15:46 |
clarkb | I think it also outputs job consumption which helps see it from the linting/unittests vs integration angle | 15:46 |
*** ysandeep|dinner is now known as ysandeep | 15:47 | |
dansmith | clarkb: can we figure out some relative stat? like hours per review or something likethat? | 15:47 |
clarkb | dansmith: yes, we've actually done that. What we found when we tried it is you get a lot more round trips and it doesn't help on the whole | 15:47 |
fungi | dansmith: some projects do hold longer jobs until their shorter jobs pass, but the down sides to that are 1. you may need additional patchsets when you find out that you have more than one error exposed in different jobs some of which weren't run the first time, and 2. it'll take longer to get a result because the jobs are no longer run in parallel | 15:47 |
*** zul has quit IRC | 15:47 | |
dansmith | clarkb: like, I want to compare them against other projects to say "tripleo has half as many patches as nova, but consumes 4x the bandwidth" | 15:48 |
clarkb | dansmith: you should be able to do that as a derivative from the numbers the existing script prints out | 15:48 |
dansmith | okay | 15:48 |
dansmith | clarkb: a while ago we discussed per-project throttling such that 100-patch series in nova didn't swap single-patch glance reviews from getting timely results | 15:50 |
fungi | yeah, zuul has been doing that for a while now | 15:51 |
dansmith | is that (a) still happening and (b) do the long serialized jobs defeat that because they use a lot of nodes and run a long time? | 15:51 |
fungi | yes, a change which runs 20 3-hour multinode jobs gets weighted the same as a change which runs a docs build, from a "fair queuing" perspective | 15:51 |
dansmith | okay | 15:52 |
dansmith | and is the fairness across the project level or the git dep chain? | 15:52 |
dansmith | I ask because I can't tell (by the seat of my pants) that my single-patch glance reviews go any quicker than my nova ones, when there's literally nothing in the queue for glance | 15:53 |
fungi | it's per project queue, so in check that's basically at the project level, in gate it's at the dependent queue level (but you rarely observe that because the gate pipeline gets top priority anyway) | 15:53 |
dansmith | ack, okay | 15:53 |
dansmith | so these long wide heavy jobs must be putting both glance and nova patches so deep into the "not even considered yet" queue that I can't tell | 15:54 |
fungi | and it doesn't necessarily affect how fast the jobs run, it's just about prioritizing node requests, so if there's a backlog of node requests the projects with fewer changes get their node requests filled sooner | 15:54 |
dansmith | aye | 15:55 |
*** ociuhandu has joined #openstack-infra | 15:55 | |
fungi | from a "fairness" perspective it's far from perfect, but it's the best mechanism we were able to fit to the available data model and control points in the system | 15:57 |
clarkb | dansmith: http://paste.openstack.org/show/jD6kAP9tHk7PZr2nhv8h/ the aggregation there uses openstack/governance/reference/projects.yaml to decide what is tripleo and neutron and so on | 15:57 |
dansmith | dear $deity | 15:57 |
dansmith | tripleo and neutron together use over 50%? | 15:57 |
clarkb | that shows things in openstack goverance consumed 95.5% of used cpu time. 30% of the total was tripleo jobs. 22% neutron and so on | 15:57 |
clarkb | yes | 15:57 |
clarkb | note neutron runs tripleo jobs too | 15:58 |
dansmith | yeah | 15:58 |
slaweq | clarkb: we just discussed in our ci meeting to move some of those jobs to periodic queue | 15:58 |
slaweq | I will propose patch in few minutes | 15:58 |
dansmith | slaweq: ++ | 15:58 |
dansmith | slaweq: note that nova is 5% on that chart :) | 15:59 |
clarkb | all openstack-tox-py36 jobs used about 1% of consumed resources | 15:59 |
dansmith | clarkb: is there number-of-reviews data in that paste that I'm missing? | 15:59 |
clarkb | so if we say the "lightweight" jobs are maybe 5% total you can optimize that but the dent is tiny | 15:59 |
clarkb | dansmith: no you need to grab that from gerrit's api | 15:59 |
clarkb | the date range is in my report and it breaks it down by repo too | 16:00 |
clarkb | which should be enough to ask gerrit for data (I think fungi may even have a script that does taht bit?) | 16:00 |
*** ykarel is now known as ykarel|away | 16:00 | |
*** ociuhandu has quit IRC | 16:01 | |
*** diablo_rojo has joined #openstack-infra | 16:01 | |
fungi | https://review.opendev.org/729293 aggregates by git namespace so all of openstack/ gets lumped together, but you could tweak the aggregation (it shards the queries by repo already for better pagination stability) | 16:01 |
*** sshnaidm|afk is now known as sshnaidm|ruck | 16:03 | |
*** amoralej is now known as amoralej|off | 16:11 | |
*** ociuhandu has joined #openstack-infra | 16:11 | |
*** ykarel|away has quit IRC | 16:17 | |
*** yamamoto has joined #openstack-infra | 16:26 | |
*** yamamoto has quit IRC | 16:34 | |
*** armax has joined #openstack-infra | 16:35 | |
*** derekh has quit IRC | 16:43 | |
*** lbragstad_ has joined #openstack-infra | 16:46 | |
*** lpetrut_ has quit IRC | 16:48 | |
*** slaweq has quit IRC | 16:48 | |
*** rlandy_ has joined #openstack-infra | 16:48 | |
*** slaweq has joined #openstack-infra | 16:49 | |
*** jamesdenton has quit IRC | 16:49 | |
*** gryf has quit IRC | 16:49 | |
*** lbragstad has quit IRC | 16:49 | |
*** rlandy has quit IRC | 16:49 | |
*** jamesdenton has joined #openstack-infra | 16:49 | |
*** rlandy_ is now known as rlandy | 16:50 | |
*** gryf has joined #openstack-infra | 16:50 | |
*** lbragstad_ is now known as lbragstad | 16:58 | |
zbr | i do believe that we could improve the developer experience if we can find a way to priorietize low-resource jobs. | 16:59 |
clarkb | zuul does already support it, developers can opt into it by modifying their job pipeline graphs | 17:00 |
zbr | so far we used queues, but these are more of by project. | 17:00 |
clarkb | I dno't think it will be helpful, but the tool allows it and some are trying it aiui | 17:00 |
*** lucasagomes has quit IRC | 17:01 | |
*** jamesmcarthur has quit IRC | 17:03 | |
*** jamesmcarthur has joined #openstack-infra | 17:03 | |
*** jcapitao has quit IRC | 17:05 | |
fungi | prioritizing low-resource jobs wouldn't necessarily get you results any sooner, unless those were all you were running | 17:05 |
dansmith | fungi: well, that's why I was asking about two batches.. if you're relying on zuul to run pep8 for you then it's not going to help, but if you rely on it to run and find problems with python versions you don't have, then maybe | 17:08 |
dansmith | fungi: there's also some locality of review, where I'd +2 something I saw pass functional tests and let the gate sort out merging based on whether devstack jobs worked, | 17:09 |
dansmith | but otherwise, I'll pretty much wait until I see the results, which right now is often "not today" | 17:09 |
dansmith | my queuing isn't as good as zuul, which means it might be "not until $owner pings me again" | 17:10 |
zbr | while it is possible for each project to optimize how jobs are triggered (dependencies and fail-fast), there is very little incentive for them to do it mainly because that means "slow yourself down and spend extra effort doing it, for the greater good". | 17:10 |
dansmith | not the worst thing, but the whole point of this is to make machines improve life for humans :) | 17:10 |
clarkb | dansmith: yes but at the same time the machiens have a limited set of resources (which seems to only shrink) | 17:11 |
dansmith | zbr: that's certainly true, but it sounds like some projects obsess over their job optimization more than others, which makes some of us angry :) | 17:11 |
*** sshnaidm|ruck is now known as sshnaidm|afk | 17:11 | |
fungi | and an ever shrinking number of people managing them and developing the systems which run on top of them | 17:11 |
dansmith | clarkb: I assure you I have a limited set of resources | 17:11 |
clarkb | yes me too | 17:11 |
clarkb | but I keep getting asked to work miracles :) | 17:11 |
clarkb | reality is this problem has existed for years | 17:12 |
dansmith | are you referring to me? | 17:12 |
clarkb | I've called it out for years | 17:12 |
clarkb | and no one has really cared until it all melts down and then its too late | 17:12 |
zbr | if we look at the problem from the CI system point of view, where you want to optimize resource usage and maximize how fast jobs are tests in general, you may want to promote good-players (low resource users). | 17:12 |
clarkb | dansmith: not just you, but it seems the demands on this team are higher than ever and we're smaller than ever | 17:12 |
clarkb | zbr: low resources users aren't necessarily "good-players" | 17:12 |
*** _erlon_ has quit IRC | 17:13 | |
clarkb | it could be that low resource users allow more bugs in which caues more gate resets in the long run | 17:13 |
dansmith | clarkb: I'm sorry, I think I've asked you only for help understanding so far in 2021.. if that's asking too much then I'll go away | 17:13 |
clarkb | dansmith: its not asking too much, its just really difficult to hear a lot of suggestions when we have been asking for help for years and we get the opposite. And that isn't to say you are the problem. Its systemic in the community | 17:14 |
clarkb | that script was originally written because the TC and otehrs kept accusing new small projects for the queue backups when in reality it was openstack itself (and a small numebr of resource hogs) | 17:15 |
clarkb | and the timestamp on that file is ~2018 | 17:15 |
clarkb | I'm just trying to keep the lights on most days anymore | 17:16 |
*** ociuhandu_ has joined #openstack-infra | 17:16 | |
clarkb | being able to add features to zuul (or even fix bugs in zuul) seems like a luxury | 17:16 |
clarkb | another related issue is the swapping in devstack jobs | 17:17 |
dansmith | clarkb: sorry, I'm missing something.. I can't help with things like cloud quota or giving you warm bodies, all I can help with is either trying to understand, brainstorming other technical improvements, or trying to convince people to tweak/shrink their jobs | 17:17 |
*** zbr3 has joined #openstack-infra | 17:18 | |
clarkb | dansmith: yes, I think the ball has been in openstack's court for fixing these queue issues for several years now | 17:18 |
dansmith | clarkb: if you've interpreted the brainstorming as zuul feature demands, then I'm really sorry and clearly communicated poorly | 17:18 |
clarkb | openstack runs a number of large innefficiencies in its CI jobs. Devstack being central to a number of them. For example you can cut devstack spin up time by around at least a half simply by not using osc and writing a python script to do the keystone setup (because osc startup time is bad and it doesn't cache tokens) | 17:19 |
fungi | to clarify though, we've literally deployed all this with open source software and configured with code reviewed continuously deployed configuration management, much of which is self-testing now, so the things which require privileged access to systems isn't that much | 17:19 |
*** ociuhandu has quit IRC | 17:19 | |
clarkb | Devstack also swaps in many of its jobs which create performance issues as well | 17:19 |
clarkb | Er as well as stability issues | 17:19 |
*** zbr has quit IRC | 17:20 | |
*** zbr3 is now known as zbr | 17:20 | |
clarkb | tuning the devstack jobs to not swap or even better improving openstack's memory consumption in its services would go a long way for making the jobs run quicker and also be more reliable | 17:20 |
*** ociuhandu_ has quit IRC | 17:20 | |
clarkb | the tripleo side of things is harder for me to characterize because it changes often and uses tools I'm less familar with, but I expect there are similar improvements that can be made there | 17:20 |
fungi | putting openstack on a diet and revisiting devstack's and tripleo's frameworks with an eye toward efficiency would certainly have a huge impact compared to messing around with reordering jobs or trying to cram two lightweight jobs into one | 17:21 |
clarkb | I did a poc for the osc replacement in devstack but was told it was too complicated for uesrs | 17:21 |
clarkb | so instead we spend about 10-15 minutes per devsatck job running osc instead of like 20 seconds for a python script | 17:21 |
fungi | well, also the qa team didn't like that it wasn't using all separate openstackclient commands | 17:22 |
fungi | daemon mode osc would have also probably had similar performance impact, but that never got completed | 17:22 |
clarkb | (and again I don't think its any one person's fault, but it seems there are systemic issues that specifically oppose solving these problems on the job end and instead we tend to prefer pushing that to the hosting providers) | 17:24 |
clarkb | but we've largely run out of our ability to scale up the hosting provider | 17:24 |
zbr | there is one aspect that affects our performance: number of jobs X random-failure rate. Projects with lots of jobs are far more likely to fail at agate, is just statistics. | 17:24 |
zbr | Assuming a 2% random failure rate, if you have 15 jobs this translates to ~26% change of failing. | 17:27 |
zbr | sadly nobody was able to count the real number of random failures, but i guess that we could compute it based on "successful rechecks". | 17:29 |
clarkb | zbr: yes, that coupled with gate states being dependent on their parents is what makes gate resets so painful | 17:29 |
clarkb | but also "random failures" tend to be pretty low in historical tracking we've done | 17:29 |
clarkb | a significant portion of failures represent actual bugs somewhere | 17:30 |
clarkb | gratned those may be in places we don't have any hoep of fixing (like nested virt crashing due to aprticular combos of kernels in a provider or provider reusing an ip address improperly) | 17:30 |
*** ociuhandu has joined #openstack-infra | 17:31 | |
*** jamesmcarthur has quit IRC | 17:31 | |
*** jamesmcarthur has joined #openstack-infra | 17:32 | |
clarkb | dansmith: interpreting those things as zuul feature deamnds is likely my personal bias because it seems any time I push on improving the job side the response is no we need to change $zuul thing. I'll try to view these issues with less of that bias | 17:35 |
*** ociuhandu has quit IRC | 17:35 | |
dansmith | clarkb: sorry man, really (really) just trying to come up with ideas | 17:36 |
dansmith | I just fixed an OOM in tempest (yes actually tempest) the other day, trying to chase down stability things to make things better | 17:37 |
*** jamesmcarthur has quit IRC | 17:37 | |
dansmith | been messing around with something in devstack today to address osc latency | 17:37 |
*** rlandy is now known as rlandy|brb | 17:37 | |
dansmith | I doubt I could really make complex changes to zuul in a reasonable amount of time, | 17:38 |
*** jamesmcarthur has joined #openstack-infra | 17:38 | |
dansmith | but in a lot of cases, I don't know what I don't know (like if we're still fair queuing across projects) so I was just asking | 17:38 |
dansmith | fwiw, I too feel like the cadre of people that care about the infra are all gone | 17:39 |
dansmith | so it's hard to continue to care instead of just making sure my shit is tight with local testing | 17:39 |
* fungi is still here ;) | 17:40 | |
*** hamalq has joined #openstack-infra | 17:40 | |
dansmith | fungi: yeah I mean people on projects who care to spend time working on non-project infra, common infra, or understanding infra issues to make changes in their projects | 17:40 |
fungi | but yeah, we've lost sdague, matt, second matt... :( | 17:40 |
dansmith | fungi: really glad you're still here tho :) | 17:41 |
dansmith | right, they were always better than me anyway | 17:41 |
fungi | melwitt has been doing great stuff lately in that vein | 17:41 |
*** jamesmcarthur has quit IRC | 17:43 | |
*** gfidente is now known as gfidente|afk | 17:43 | |
clarkb | dansmith: re the osc thing, is that via improving osc startup time and or token reuse? Those seemed to be the big reasons why osc was slow when I looked in the past, but both were somewhat complicated to address. Startup time because python entrypoint libs and tokens due to security concerns | 17:44 |
dansmith | clarkb: well, neither and more crazy.. trying to just make devstack less single-threaded, | 17:45 |
dansmith | but maybe that'll make too much memory pressure | 17:45 |
clarkb | dansmith: oh interesting | 17:45 |
fungi | i get the impression the memory pressure in those jobs is more in the tempest phase, so devstack setup may benefit from greater parallelism | 17:46 |
*** d34dh0r53 has quit IRC | 17:47 | |
clarkb | fungi: yes I think that is the case. Basically it is the use of the cloud that balloons the memory use | 17:47 |
dansmith | ack | 17:47 |
dansmith | so I was toying with being able to start named jobs async, and then say "okay if you get to here make sure $future is done" | 17:48 |
dansmith | parallelizing the init_project parts for example | 17:48 |
dansmith | and also the creation of service accounts as another quick example which seems to take EFFING MINUTES | 17:48 |
fungi | service accounts at the system level? like with adduser command or whatever? | 17:49 |
dansmith | no keystone service accounts | 17:49 |
fungi | oh, okay. i wonder how many osc calls that's implemented with | 17:49 |
dansmith | yeah, it's a lot of osc overhead, but it also seems like some keystone slowness I dunno why | 17:50 |
*** d34dh0r53 has joined #openstack-infra | 17:50 | |
dansmith | I also wonder if we couldn't wrap osc shell mode and delegate commands that we run to it | 17:50 |
dansmith | like I wonder if that would offend anyone, if I could make it work | 17:50 |
dansmith | ten minutes of osc overhead sounds pretty juicy to me | 17:50 |
clarkb | dansmith: what my poc did was replace osc for service accounts and catalog bits with a script that used the sdk. That script was then able to cache the token for many requests and have a single startup time | 17:51 |
clarkb | its been a while but my maths were something like 7 minutes just for keystone setup then a few minutes of other things like create this network and that flavor and so on | 17:51 |
dansmith | yeah, the keystone stuff is stupid slow | 17:52 |
dansmith | I'm also parallelizing things like neutron setup (db creation, etc) with things like swift and glance and placement which should be mostly isolated I think | 17:53 |
dansmith | but the iops may not work out in a cloud worker such that there's benefit | 17:53 |
zbr | do we have the meeting in one hour? | 17:56 |
clarkb | yes | 17:56 |
clarkb | (I sent out an agenda to the list yesterday too if you're curious to see what is on it) | 17:57 |
* zbr goes out for a while, aiming to return in one hour. | 17:58 | |
*** eolivare_ has quit IRC | 18:02 | |
*** jamesmcarthur has joined #openstack-infra | 18:02 | |
*** gyee has joined #openstack-infra | 18:16 | |
*** jpena is now known as jpena|off | 18:18 | |
*** rlandy|brb is now known as rlandy | 18:20 | |
*** bdodd has quit IRC | 18:23 | |
*** dtantsur is now known as dtantsur|afk | 18:23 | |
*** ricolin has quit IRC | 18:34 | |
*** hasharKids has quit IRC | 18:34 | |
*** rpittau is now known as rpittau|afk | 18:41 | |
*** jamesmcarthur has quit IRC | 18:59 | |
*** jamesmcarthur has joined #openstack-infra | 18:59 | |
gmann | mnaser: fungi clarkb these project-config changes lgtm and quick to review- https://review.opendev.org/c/openstack/project-config/+/771443 https://review.opendev.org/c/openstack/project-config/+/771392 https://review.opendev.org/c/openstack/project-config/+/771066 https://review.opendev.org/c/openstack/project-config/+/770538 | 19:14 |
*** nightmare_unreal has quit IRC | 19:14 | |
fungi | thanks gmann! i guess you're watching the conversation in the opendev meeting | 19:14 |
fungi | we were just talking about that right now | 19:15 |
gmann | ah did not see that. | 19:15 |
gmann | nice | 19:15 |
fungi | yeah, that's the current topic in the meeting, looking for volunteers for config reviewing | 19:16 |
gmann | I was checking in #opendev | 19:18 |
fungi | heh, yeah sorry we have a separate meeting channel but you've found it | 19:22 |
fungi | we use that for weekly meetings but also scheduled maintenance activities and incident management | 19:22 |
*** lifeless has quit IRC | 19:27 | |
*** lifeless has joined #openstack-infra | 19:27 | |
*** andrewbonney has quit IRC | 19:42 | |
*** slaweq has quit IRC | 19:43 | |
*** ajitha has quit IRC | 20:01 | |
*** Jeffrey4l has quit IRC | 20:04 | |
*** openstackgerrit has quit IRC | 20:12 | |
*** Jeffrey4l has joined #openstack-infra | 20:13 | |
*** zbr5 has joined #openstack-infra | 20:14 | |
*** zbr has quit IRC | 20:16 | |
*** zbr5 is now known as zbr | 20:16 | |
*** bdodd has joined #openstack-infra | 20:29 | |
*** yamamoto has joined #openstack-infra | 20:32 | |
*** stevebaker has quit IRC | 20:35 | |
*** yamamoto has quit IRC | 20:36 | |
*** vishalmanchanda has quit IRC | 20:39 | |
*** Jeffrey4l has quit IRC | 20:50 | |
*** Jeffrey4l has joined #openstack-infra | 20:51 | |
*** stevebaker has joined #openstack-infra | 21:03 | |
*** ociuhandu has joined #openstack-infra | 21:07 | |
*** harlowja has joined #openstack-infra | 21:14 | |
*** jamesmcarthur has quit IRC | 21:17 | |
*** jamesmcarthur has joined #openstack-infra | 21:19 | |
*** sboyron_ has quit IRC | 21:26 | |
*** priteau has quit IRC | 21:35 | |
*** jamesmcarthur has quit IRC | 21:42 | |
*** xek has quit IRC | 21:44 | |
*** jamesmcarthur has joined #openstack-infra | 21:46 | |
*** arne_wiebalck has quit IRC | 21:49 | |
*** arne_wiebalck has joined #openstack-infra | 21:51 | |
*** matt_kosut has quit IRC | 22:01 | |
*** matt_kosut has joined #openstack-infra | 22:02 | |
*** matt_kosut has quit IRC | 22:07 | |
*** rcernin has joined #openstack-infra | 22:09 | |
*** yamamoto has joined #openstack-infra | 22:10 | |
*** jamesmcarthur has quit IRC | 22:16 | |
*** jamesmcarthur has joined #openstack-infra | 22:23 | |
*** iurygregory has quit IRC | 22:28 | |
*** jamesmcarthur has quit IRC | 22:30 | |
*** jamesmcarthur has joined #openstack-infra | 22:33 | |
*** iurygregory has joined #openstack-infra | 22:37 | |
*** ociuhandu has quit IRC | 22:47 | |
*** ociuhandu has joined #openstack-infra | 22:47 | |
*** openstackgerrit has joined #openstack-infra | 22:50 | |
openstackgerrit | Merged openstack/project-config master: cursive: prepare to move the jobs in-tree https://review.opendev.org/c/openstack/project-config/+/771443 | 22:50 |
openstackgerrit | Merged openstack/project-config master: Adding irc notification for missing oslo projects https://review.opendev.org/c/openstack/project-config/+/771392 | 22:50 |
openstackgerrit | Merged openstack/project-config master: Combine acl file for all interop source code repo https://review.opendev.org/c/openstack/project-config/+/771066 | 22:50 |
openstackgerrit | Merged openstack/project-config master: Move snaps ACL to x https://review.opendev.org/c/openstack/project-config/+/770538 | 22:50 |
gagehugo | Is review.opendev.org sign-in now switched to openid? | 22:54 |
clarkb | gagehugo: its always been openid as far as I know | 22:54 |
gagehugo | the login page changed, was just wondering | 22:55 |
*** thogarre has joined #openstack-infra | 22:55 | |
clarkb | hrm that shouldn't have changed | 22:55 |
clarkb | itshould take you to the ubuntu one openid login page | 22:55 |
clarkb | oh except I think I discovered a bug where you can't hit the login button from the diff viewer as the redirects don't work from there? | 22:56 |
fungi | when you click sign init should take you to https://login.ubuntu.com/ yeah | 22:56 |
gagehugo | https://usercontent.irccloud-cdn.com/file/dBmWTQfP/image.png | 22:56 |
fungi | huh, that's the page we've usually seen if login.ubuntu.com is down for some reason | 22:56 |
gagehugo | ah ok | 22:56 |
clarkb | fwiw it just worked for me | 22:56 |
fungi | same here | 22:56 |
clarkb | so maybe a blip on the remote side | 22:56 |
gagehugo | someone from our team was having that issue as well so I figured I'd check, thanks! | 22:58 |
*** snapiri has quit IRC | 22:58 | |
*** openstackgerrit has quit IRC | 22:59 | |
clarkb | gagehugo: if it persists I would double check dns resolution and firewall access for login.ubuntu.com | 23:02 |
gagehugo | ok | 23:02 |
*** snapiri has joined #openstack-infra | 23:03 | |
*** snapiri has quit IRC | 23:08 | |
*** CrayZee has joined #openstack-infra | 23:08 | |
*** matt_kosut has joined #openstack-infra | 23:17 | |
*** jamesmcarthur has quit IRC | 23:25 | |
*** matt_kosut has quit IRC | 23:27 | |
fungi | yeah, maybe access is being blocked or something | 23:29 |
fungi | or i suppose it could be a new browser security feature, blocking refresh-redirect to another domain? | 23:30 |
fungi | something or other breaking openid workflow | 23:30 |
*** dchen has joined #openstack-infra | 23:31 | |
*** jamesmcarthur has joined #openstack-infra | 23:43 | |
*** thogarre has quit IRC | 23:52 | |
*** ociuhandu has quit IRC | 23:58 | |
*** ociuhandu has joined #openstack-infra | 23:58 |
Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!