corvus | zuul likes it | 00:00 |
---|---|---|
corvus | ah, i'm guessing it's not quite tuned for ubuntu -- from the nodepool-functional-k8s job: ubuntu-bionic | "msg": "No package matching 'java-latest-openjdk' is available" | 00:01 |
corvus | 'default-jdk-headless' maybe? | 00:02 |
clarkb | ya or default-jdk | 00:02 |
clarkb | not sure what the difference is but headless seems appropriat ehere | 00:02 |
openstackgerrit | James E. Blair proposed zuul/zuul-jobs master: ensure-zookeeper: add use_tls role var https://review.opendev.org/c/zuul/zuul-jobs/+/776290 | 00:06 |
openstackgerrit | James E. Blair proposed zuul/nodepool master: WIP: Require TLS. https://review.opendev.org/c/zuul/nodepool/+/776286 | 00:07 |
clarkb | corvus: looking at the WIP change it seems to have a lot of th same stuff as the zj role. Is the intent that nodepool would bootstrap itself? | 00:09 |
corvus | clarkb: i don't want to require 'ensure-zookeeper' for the tox jobs | 00:10 |
corvus | i like the idea of test-setup.sh doing that in a way that works for the zuul tox jobs and devs | 00:11 |
clarkb | ok just making sure I understand the various bits there | 00:11 |
corvus | we could switch the functional jobs to use test-setup, but those are running k8s and openshift, so there's a pretty good argument for not running something in docker in those jobs :) | 00:11 |
corvus | and they are not especially standard; they already have their own playbooks | 00:12 |
clarkb | corvus: in test-setup.sh why use docker-compose rm -sf over docker-compose down? (If you use down you can drop the condition and I think down does roughly the same thing?) | 00:13 |
corvus | clarkb: no idea, copied that from zuul's test-setup-docker.sh | 00:13 |
corvus | i'll incorporate that in the next rev | 00:13 |
clarkb | corvus: found a minor thing on the WIP change (left it as a comment) | 00:15 |
corvus | ah thx | 00:16 |
corvus | that will probably fail some unit tests but not all of them; i'm going to let it run a bit more before i push up a fix | 00:17 |
*** tosky has quit IRC | 00:30 | |
corvus | i'm confused about why the py3x tests timed out; they seem to be mostly running; i'm not sure what they were doing when time was up | 01:02 |
clarkb | corvus: https://9e7a0909972c63991fa4-da3822d63841e990242061d65cb4e6c4.ssl.cf5.rackcdn.com/776286/8/check/tox-py36/be3b5ed/tmpqv4jwhl2 is an attempt at deciphering that | 01:03 |
clarkb | that is the partial subunit stream | 01:03 |
* clarkb pulls it up | 01:04 | |
corvus | yeah, it ends with a successful test | 01:04 |
corvus | i think i have it running the bad test locally, i just still don't know which one it is | 01:06 |
corvus | i don't see a test name in my stestr output | 01:06 |
clarkb | corvus: grep ^test: $thatsubunitfile | wc -l is 91 | 01:06 |
corvus | clarkb: i don't know how to use that info | 01:09 |
clarkb | mostly just pointing out it only ran 91 tests before it timed out (I think nodepool has several hundred tests overall) | 01:10 |
corvus | 277 | 01:11 |
corvus | the 3 tests around line 91 of 'python -m testtools.run discover --list' are file | 01:11 |
corvus | fine | 01:11 |
corvus | i'm really confused because we have a TESTING.rst file that says if i run "stestr run" it will print out the name of each test as it runs | 01:12 |
corvus | but all i see is log output | 01:12 |
clarkb | corvus: the invocation in the job log uses --no-subunit-trace which suppressed the behvaior you want I think | 01:13 |
corvus | ok. well i'm running plain "stestr run" now, and i do see some job names among the logs | 01:14 |
corvus | so maybe if it hangs i'll see it this time | 01:14 |
clarkb | https://opendev.org/zuul/nodepool/src/branch/master/tox.ini#L21 | 01:14 |
corvus | okay, running that way and then hitting ctrl-c when it hung has given me the names of the 7 running tests that were hanging | 01:16 |
corvus | so i can iterate now | 01:16 |
clarkb | and ya it seems like the file only writes complete subunit results (to avoid interleaving?) but the proper tracing should print them as they go | 01:16 |
corvus | okay, i think i have fixes for the latest errors in the k8s/openshift jobs and these tests; i'm just running locally again to find any more | 01:24 |
openstackgerrit | James E. Blair proposed zuul/nodepool master: WIP: Require TLS https://review.opendev.org/c/zuul/nodepool/+/776286 | 01:32 |
corvus | at this point, i expect some of these jobs to start passing | 01:32 |
*** harrymichal has quit IRC | 01:49 | |
corvus | clarkb, tobiash: the disadvantage of using docker for the unit tests is: toomanyrequests: You have reached your pull rate limit. You may increase the limit by authenticating and upgrading: https://www.docker.com/increase-rate-limit | 01:52 |
corvus | maybe we should leave the docker setup for devs only and switch the tox jobs to ensure-zookeeper | 01:53 |
corvus | tobiash, felixedel, swest: if you start to look at moving zk connection inside zuul services, note the work going on in nodepool; see the lines above ^ and also https://review.opendev.org/776290 and https://review.opendev.org/776286 | 02:02 |
*** dmsimard8 has joined #zuul | 02:56 | |
*** dmsimard has quit IRC | 02:59 | |
*** aluria has quit IRC | 02:59 | |
*** SpamapS has quit IRC | 02:59 | |
*** gouthamr has quit IRC | 02:59 | |
*** tflink has quit IRC | 02:59 | |
*** dmsimard8 is now known as dmsimard | 02:59 | |
*** ianw has quit IRC | 02:59 | |
*** tflink has joined #zuul | 03:00 | |
*** SpamapS has joined #zuul | 03:00 | |
*** ianw has joined #zuul | 03:00 | |
*** gouthamr has joined #zuul | 03:03 | |
*** aluria has joined #zuul | 03:04 | |
*** ykarel has joined #zuul | 05:00 | |
*** evrardjp has quit IRC | 05:33 | |
*** evrardjp has joined #zuul | 05:33 | |
*** jfoufas1 has joined #zuul | 06:00 | |
*** gouthamr has joined #zuul | 06:01 | |
*** rlandy|bbl has quit IRC | 06:17 | |
*** vishalmanchanda has joined #zuul | 06:34 | |
*** reiterative has quit IRC | 06:48 | |
*** reiterative has joined #zuul | 06:48 | |
openstackgerrit | Dinesh Garg proposed zuul/zuul-jobs master: Allow customization of helm charts repos https://review.opendev.org/c/zuul/zuul-jobs/+/767354 | 07:01 |
openstackgerrit | Dinesh Garg proposed zuul/zuul-jobs master: Allow customization of helm charts repos https://review.opendev.org/c/zuul/zuul-jobs/+/767354 | 07:03 |
openstackgerrit | Dinesh Garg proposed zuul/zuul-jobs master: Allow customization of helm charts repos https://review.opendev.org/c/zuul/zuul-jobs/+/767354 | 07:04 |
*** jpena|off is now known as jpena | 07:36 | |
*** harrymichal has joined #zuul | 07:58 | |
*** jcapitao has joined #zuul | 08:13 | |
*** rpittau|afk is now known as rpittau | 08:14 | |
*** hashar has joined #zuul | 08:21 | |
*** tosky has joined #zuul | 08:45 | |
openstackgerrit | Dinesh Garg proposed zuul/zuul-jobs master: Allow customization of helm charts repos https://review.opendev.org/c/zuul/zuul-jobs/+/767354 | 08:51 |
*** jfoufas1 has quit IRC | 08:56 | |
openstackgerrit | Andreas Jaeger proposed zuul/zuul-jobs master: Allow customization of helm charts repos https://review.opendev.org/c/zuul/zuul-jobs/+/767354 | 09:05 |
*** ykarel_ has joined #zuul | 09:05 | |
openstackgerrit | Andreas Jaeger proposed zuul/zuul-jobs master: Allow customization of helm charts repos https://review.opendev.org/c/zuul/zuul-jobs/+/767354 | 09:07 |
*** ykarel has quit IRC | 09:08 | |
*** nils has joined #zuul | 09:22 | |
*** ykarel_ is now known as ykarel | 09:49 | |
*** wuchunyang has joined #zuul | 10:39 | |
*** wuchunyang has quit IRC | 10:44 | |
openstackgerrit | Merged zuul/zuul-jobs master: Allow customization of helm charts repos https://review.opendev.org/c/zuul/zuul-jobs/+/767354 | 11:31 |
*** harrymichal has quit IRC | 11:38 | |
*** harrymichal has joined #zuul | 11:39 | |
*** jcapitao is now known as jcapitao_lunch | 12:08 | |
*** rlandy has joined #zuul | 12:28 | |
*** jpena is now known as jpena|lunch | 12:30 | |
*** ikhan has joined #zuul | 12:50 | |
*** jcapitao_lunch is now known as jcapitao | 13:19 | |
*** jpena|lunch is now known as jpena | 13:31 | |
*** iurygregory has quit IRC | 13:47 | |
*** iurygregory has joined #zuul | 13:50 | |
*** ykarel_ has joined #zuul | 14:14 | |
*** ykarel has quit IRC | 14:17 | |
pabelanger | I know some folks are using prometheus with zuul metric, is that right? if so, do people have it documented any place | 14:27 |
*** ykarel_ has quit IRC | 14:27 | |
pabelanger | we'll likely have to use that given no one offer graphite much any more | 14:27 |
corvus | pabelanger: you run your own cotrol plane right? or are you moving to a hosted control plane? | 14:28 |
corvus | (not sure who needs to "offer" graphite in this scenario) | 14:29 |
pabelanger | yah, we still run our own. But we don't want to manage more services, so looking to other teams to provide something | 14:30 |
pabelanger | and prometheus is the thing, not graphite | 14:30 |
pabelanger | and basically don't have time / effort to stand up own now | 14:30 |
corvus | pabelanger: well, fwiw, graphite isn't required... you can use grafana with influxdb as your statsd receiver as well. | 14:31 |
pabelanger | we're kinda of at a cross roads, of looking for managed CI again. Which may or maynot be zuul | 14:31 |
pabelanger | :( | 14:31 |
corvus | pabelanger: if you do need prometheus, i think tobiash has a patch in review under zuul/zuul for a prometheus statsd exporter | 14:33 |
avass | where is logging configured in zuul and what decides what variables are passed when zuul (python?) logs something? | 14:47 |
avass | not that well versed in how logging in python actually works | 14:47 |
corvus | avass: uses standard python logging library: https://docs.python.org/3/library/logging.html | 14:47 |
avass | anyhow, I don't think buildset gets logged nicely and if it is I can't find it except for in the actual message | 14:47 |
corvus | avass: that configures the formatting, etc. but if you want extra messages or info/metadata along with the messages, those are source code changes | 14:48 |
avass | corvus: oh nvm, think I found it. the get_annotated_logger function is probably what I'm looking for | 14:49 |
avass | it would be nice if buildset was part of the logs somehow as well | 14:50 |
tristanC | corvus: note that influxdb statsd support is a bit tricky, we couldn't use the graphite query in grafana and we add to configure continuous queries for nodepool | 14:50 |
*** tosky has quit IRC | 14:51 | |
*** tosky_ has joined #zuul | 14:51 | |
corvus | avass: ack... i wonder if it would be better to have it in every line related to a build, or just have a clear "starting build X for buildset Y" for cross-referencing | 14:51 |
*** tosky_ is now known as tosky | 14:51 | |
avass | corvus: for console logs that should be enough but for our splunk logs it would be nice to have that as part of the data for each log entry | 14:52 |
avass | because doing buildset=<hash> levelname=ERROR could be convenient :) | 14:53 |
avass | but event_id should do pretty much the same thing | 14:55 |
tristanC | pabelanger: we have been using the node_exporter text file feature to get zuul metrics in prometheus | 15:01 |
pabelanger | sorry, I got pulled into a meeting. catching up | 15:02 |
tristanC | pabelanger: here are the rules https://softwarefactory-project.io/cgit/software-factory/sf-infra/tree/monitoring/rules-zuul.yaml and the script to generate the metrics is: https://softwarefactory-project.io/cgit/software-factory/sf-infra/tree/roles/custom-exporter/files/exporter.py | 15:02 |
pabelanger | so, the feedback I am getting is 'we need more zuul metric', which is solved already. But I have a very limited amount of time or ability to launch new services to support it. Given teams already have prometheus (I am unsure about influxdb), the plan would be to use those services, over asking them to run graphite. | 15:05 |
pabelanger | statsd is already in place, so we don't have to provision that (aside from exporters) | 15:05 |
pabelanger | tristanC: thanks, I'll look | 15:06 |
tristanC | pabelanger: the links i gave you is more about alerting when something break, you'll need to look at the tobiash's statsd exporter configuration for metrics | 15:13 |
pabelanger | https://review.opendev.org/c/zuul/zuul/+/660472 | 15:18 |
pabelanger | thanks | 15:18 |
*** hashar has quit IRC | 15:18 | |
pabelanger | that looks like it might be easy to setup and test | 15:18 |
*** rlandy is now known as rlandy|training | 15:39 | |
corvus | pabelanger: yep that's the one | 15:43 |
*** ykarel_ has joined #zuul | 15:48 | |
tristanC | corvus: about ensure-zookeeper tls support, if i understand correctly it fixed the k8s/openshift integration jobs, but we now need to use it for the regular tox job? | 15:48 |
tristanC | thank you for taking over the change by the way | 15:49 |
corvus | tristanC: yes. i wanted to use docker in test-setup.sh so that it was the same for devs and CI, and we wouldn't have to add a playbook to the tox jobs. however, with the dockerhub rate limits, i think your idea of using it for the tox jobs is better. | 15:50 |
corvus | tristanC: so i think we just need to update 776286 to run ensure-zookeeper and then set env variables for the unit test tox runs | 15:50 |
corvus | i can do that in a little bit | 15:50 |
corvus | tristanC: and thank you for the ensure-zk change -- i love timezone teamwork :) | 15:51 |
clarkb | corvus: no objection from me using ensure-zk for unittests too | 16:02 |
*** vishalmanchanda has quit IRC | 16:02 | |
*** rlandy|training is now known as rlandy | 16:16 | |
*** ykarel_ is now known as ykarel | 16:21 | |
*** mvadkert has joined #zuul | 16:22 | |
mvadkert | tristanC: hi, I looked at your Fedora lightning talk, do you have examples where you are now using Dhall in Zuul and for what? | 16:23 |
mvadkert | tristanC: we are looking to use cuelang.org for some of our services, but we want to look at Dhall also ... | 16:24 |
avass | mvadkert: it's widely used for the zuul-operator: https://opendev.org/zuul/zuul-operator | 16:24 |
mvadkert | avass: thanks, will check that out | 16:26 |
*** ykarel has quit IRC | 16:29 | |
clarkb | corvus: looks like unit tests passed in your latest recheck (but not some of the functional jobs), though it has not finished and reported yet | 16:36 |
tristanC | mvadkert: fbo wrote a blog post about it here: https://www.softwarefactory-project.io/using-dhall-to-generate-fedora-ci-zuul-config.html | 16:44 |
tristanC | mvadkert: you can also find several examples in the binding documentation here: https://github.com/softwarefactory-project/dhall-zuul | 16:45 |
tristanC | mvadkert: and finally, you can also find an ambitious work in progress expression to manage a full tenant configuration there: https://github.com/softwarefactory-project/bootstrap-your-zuul | 16:47 |
*** jpena is now known as jpena|off | 16:51 | |
*** ikhan has quit IRC | 16:52 | |
*** hashar has joined #zuul | 17:16 | |
mvadkert | tristanC: cool ty! | 17:20 |
*** jcapitao has quit IRC | 17:24 | |
tristanC | mvadkert: you're welcome, i'd be interested to see a zuul configuration in cue if you already made one | 17:36 |
mvadkert | tristanC: not for zuul, we plan to use it for other services, but the work has just started | 17:38 |
mvadkert | tristanC: so we are mostly looking around for similar use cases | 17:38 |
*** rpittau is now known as rpittau|afk | 17:47 | |
*** rlandy is now known as rlandy|training | 18:01 | |
openstackgerrit | James E. Blair proposed zuul/nodepool master: WIP: Require TLS https://review.opendev.org/c/zuul/nodepool/+/776286 | 18:06 |
*** akrpan-pure has joined #zuul | 18:13 | |
akrpan-pure | Hello! I have a zuul run working fine on opendev/ci-sandbox, but I just switched it over to also run on openstack/cinder and the runs are working fine, but uploading the result mysteriously fails with a "Gerrit error executing gerrit review <rest of command>" and no comment on the change request | 18:14 |
akrpan-pure | Anyone either seen this before, or know how I could turn on more debugging info for response reporting through gerrit? | 18:14 |
corvus | akrpan-pure: you may be able to run zuul with the "-d" argument to get more debugging info | 18:15 |
clarkb | akrpan-pure: are your jobs trying to vote +/-1 verified? | 18:15 |
*** wuchunyang has joined #zuul | 18:15 | |
clarkb | I think that most openstack projects don't allow third party CI systems to vote and the reviews can fail if you try to vote | 18:16 |
akrpan-pure | .... oh I bet I know what it is, I made a new account for our third party CI, I bet it hasn't been approved | 18:17 |
akrpan-pure | *as a CI account yet, because our old one could just fine | 18:17 |
clarkb | akrpan-pure: cinder-ci has no members, they don't want anyone voting | 18:17 |
akrpan-pure | Oh indeed, I'm not sure what I was remembering but I could've sworn I saw votes from other CI systems including ours in the past | 18:19 |
akrpan-pure | Looking back now I see none | 18:19 |
akrpan-pure | Thanks for pointing that out! Easy enough to fix | 18:20 |
*** wuchunyang has quit IRC | 18:20 | |
*** hashar has quit IRC | 18:32 | |
*** akrpan-pure has quit IRC | 19:00 | |
*** mgoddard has quit IRC | 19:17 | |
*** mgoddard has joined #zuul | 19:18 | |
avass | tristanC: oh yep overrides are starting to click a bit now | 20:01 |
tristanC | i've published an haskell library to interact with zuul, it comes with a cli to list the nodepool labels used in projects pipeline configuration: https://hackage.haskell.org/package/zuul | 20:01 |
tristanC | avass: good, that seems the best solution to modify a set, if you look in zuul-nix, override could be used to tweak the existing requirements instead of creating new ones when there is a mismatch | 20:04 |
*** ikhan has joined #zuul | 20:05 | |
*** nils has quit IRC | 20:14 | |
*** jamesmcarthur has joined #zuul | 20:26 | |
*** rlandy|training is now known as rlandy | 20:33 | |
pabelanger | anybody seen the following decode issue before? | 20:58 |
pabelanger | http://paste.openstack.org/show/802799/ | 20:58 |
pabelanger | trying to stand up new executor and hitting it | 20:58 |
tristanC | pabelanger: perhaps a jwt library update? | 20:59 |
tristanC | pabelanger: without https://review.opendev.org/c/zuul/zuul/+/768312 , you need to pin it to <2 | 21:00 |
pabelanger | tristanC: thanks! that look to be it | 21:01 |
pabelanger | manually downgrading to 1.7.1 fixed it | 21:01 |
tristanC | i guess pip installing zuul 3.19.1 is not working since december | 21:02 |
pabelanger | yah, possible | 21:02 |
pabelanger | at least for github | 21:02 |
clarkb | you can always use a contraints file if necessary | 21:04 |
pabelanger | true | 21:08 |
*** mgagne has quit IRC | 21:08 | |
*** jamesmcarthur has quit IRC | 21:12 | |
*** jamesmcarthur has joined #zuul | 21:14 | |
*** jamesmcarthur has quit IRC | 21:26 | |
*** jamesmcarthur has joined #zuul | 21:37 | |
clarkb | corvus: left a couple of notes on the WIP nodepool change. | 21:39 |
*** jamesmcarthur has quit IRC | 21:39 | |
*** jamesmcarthur has joined #zuul | 21:40 | |
clarkb | corvus: and for the functional jobs I see where ensure-zookeeper started zookeeper but then later logs complain about being unable to connect. Not quite sure what is going on there | 21:42 |
clarkb | looks like it is connecting to port 2181 | 21:44 |
clarkb | ok left another comment to address ^ I think. | 21:45 |
avass | tristanC: I think overrides might not make much sense after all... | 21:58 |
avass | well, at least I tried to build a newer rustc version so I tried to override pkgs.rustc.version, for some reason that produces logs as if it downloaded a 1.50.0 tarball but it turns out it built 1.41.0 anyway.. :( | 22:00 |
openstackgerrit | James E. Blair proposed zuul/nodepool master: WIP: Require TLS https://review.opendev.org/c/zuul/nodepool/+/776286 | 22:01 |
corvus | clarkb: thx updated | 22:01 |
tristanC | avass: it is confusing indeed :-) I think it's necessary when a function expect a package set, for example if you want to rebuild zuul-nix with a modified set, then you need to create it with an override | 22:01 |
avass | tristanC: yeah I read through the pills explaining overrides and config.nix and followed along with the examples and that made sense. but I feel like I have no idea why that didn't work for a real package | 22:02 |
tristanC | avass: the override implementation may varies between packageset, it's just a nix function afterall | 22:02 |
avass | it would help if the packages were better documented with what can be overriden | 22:04 |
tristanC | avass: for sure, i would recommend to check the source directly :) for rust, have you tried https://github.com/mozilla/nixpkgs-mozilla#using-in-nix-expressions ? | 22:06 |
avass | tristanC: that's what I'm doing currently. but having to read through the source to use a package manager will be a bit hard to motivate when people are used to the simplicity of containers | 22:07 |
avass | will take a look at that | 22:07 |
tristanC | avass: that's true, though once the expression is working, it's quite easy to use. And unless you are doing complicated things, you shouldn't have to use an override just to use a different toolchain, for rust mozilla provides every set already. | 22:15 |
openstackgerrit | Tristan Cacqueray proposed zuul/zuul master: scheduler: add statsd metric for enqueue time https://review.opendev.org/c/zuul/zuul/+/776287 | 22:21 |
clarkb | corvus: just noticed another thing that is important but maybe do that update after we see how teh functional jobs do? | 22:22 |
clarkb | that actually had me really confused for a minute too. Was wondering why the new pre run wasn't used | 22:22 |
corvus | clarkb: derp yes thx. updated locally, will wait for results to push. | 22:23 |
clarkb | nodepool is just starting now in the functional openstack job | 22:31 |
clarkb | it isn't spamming about connection problems so I think it is looking good. | 22:32 |
clarkb | will have to see if image builds and launches all agree | 22:32 |
avass | tristanC: thanks that worked a lot better :) | 22:36 |
*** jamesmcarthur has quit IRC | 22:38 | |
tristanC | avass: that's good to know! note that you can find similar package set for other toolchain outside of nixpkgs, for example you would pull zuul-nix using an equivalent import expression | 22:45 |
clarkb | corvus: the functional openstack jobs succeeded | 22:48 |
clarkb | I think this is very close once you swap out the unittest jobs | 22:48 |
openstackgerrit | James E. Blair proposed zuul/zuul master: Add python-logstash-async to container images https://review.opendev.org/c/zuul/zuul/+/776551 | 22:49 |
tristanC | corvus: thank you for the suggestion, i've added a statsd service to my benchmark and the metrics are more finegrained, for enqueue time from 776287 i get: mean 0.00139 std dev 0.00018 (compared to the mqtt measure: mean 0.01406 std dev 0.00120) | 22:50 |
tristanC | for reference, here is the updated script: https://github.com/TristanCacqueray/zuul-nix/commit/2083b33005e568115a5e71b2847a953b1dd5d62c | 22:51 |
corvus | oh interesting, i wasn't expecting them to necessarily be different as much as being something that could be compared with prod. that's good to know and i'm glad that worked out | 22:52 |
corvus | clarkb: looks like maybe it's far enough along i should just push up the fix and see if it passes for real? | 22:55 |
tristanC | well the timestamp are probably not collected at the same place, the statsd one measure from sched.addEvent to the end of manager.addChange | 22:55 |
clarkb | corvus: ya I think so | 22:55 |
openstackgerrit | James E. Blair proposed zuul/nodepool master: Require TLS https://review.opendev.org/c/zuul/nodepool/+/776286 | 22:55 |
*** harrymichal has quit IRC | 22:57 | |
*** harrymichal has joined #zuul | 22:57 | |
pabelanger | I maybe have raised this before, but cannot remember. We have a lot of multi-node jobs, and sometime when we are close to quota capacity we need a lot of unlocked ready nodes. Here is an example: http://paste.openstack.org/show/802801/ | 23:05 |
pabelanger | nodes 0001147143 and 0001147144 are ready, but because node 0001147128 failed, they seem to be idle now (using quota). | 23:07 |
pabelanger | and given we've hit quota, they sometime sit idle for upwards of 8 hours | 23:07 |
*** jamesmcarthur has joined #zuul | 23:07 | |
pabelanger | I am curious, if we could add something that is the whole nodeset isn't allocated, we also delete the unlocked ready nodes | 23:08 |
pabelanger | and give them back to the pool | 23:08 |
clarkb | pabelanger: from memory I believe they are technically part of the pool already. If another job came by looking for those labels they could be used for them instead | 23:10 |
clarkb | but they aren't gone so do consume those resources until they timeout or are used | 23:10 |
clarkb | you can reduce the timeout for cleaning ready nodes iirc | 23:11 |
pabelanger | Hmm, so I need to confirm but we are not seeing them get used by other jobs. That would actually be good if they did | 23:12 |
pabelanger | we do set a low 'max-ready-age' but that doesn't seem to work in this case | 23:12 |
pabelanger | eg: max-ready-age: 600 | 23:13 |
pabelanger | but, they will say online well over 8 hours | 23:13 |
pabelanger | https://dashboard.zuul.ansible.com/t/ansible/nodes | 23:13 |
pabelanger | is an example of the amount right now | 23:14 |
*** harrymichal has quit IRC | 23:15 | |
*** jamesmcarthur has quit IRC | 23:15 | |
*** jamesmcarthur_ has joined #zuul | 23:15 | |
clarkb | that doesn't show locked state though | 23:16 |
clarkb | is it possible the ready nodes are locked? | 23:16 |
pabelanger | | 0001147027 | limestone-us-dfw-1 | centos-8-1vcpu | ad65b3b4-aa9a-4778-9d12-14f58577a11b | 162.253.43.28 | 2607:ff68:100:a::13e | ready | 00:01:05:32 | unlocked | | 23:16 |
pabelanger | no | 23:16 |
pabelanger | I _think_ they are still part of original node request | 23:17 |
pabelanger | and not free, until the missing node comes online | 23:17 |
pabelanger | which is a chicken and egg issue, as the ready unlocked nodes, is taking up quota | 23:17 |
clarkb | oh the original node request is still alive in that provider? | 23:18 |
clarkb | then ya I think that would hold those resources for that request (though they should stay locked until then?) | 23:18 |
pabelanger | yah, it is | 23:18 |
pabelanger | just confirmed | 23:18 |
pabelanger | | 200-0000346078 | 2 | requested | zs01.sjc1.vexxhost.zuul.ansible.com | centos-8-1vcpu,vmware-vcsa-7.0.0,esxi-6.7.0-without-nested,esxi-6.7.0-without-nested | | nl01-26418-PoolWorker.ec2-us-east-2-main,nl01-26418-PoolWorker.limestone-us-dfw-1-s1.small,nl01-26418-PoolWorker.limestone-us-slc-s1.small,nl02-10946-PoolWorker.vexxhost-sjc1-v2-highcpu-1 | 224c7760-7232-11eb-86ac-5a443aaf3adf | | 23:18 |
pabelanger | I need to confirm what the Priority value (2) means | 23:19 |
pabelanger | if higher or lower means, it gets requests sooner | 23:19 |
clarkb | pabelanger: I think it would be helpful to specifically figureout what state the node request is in while those nodes sit ready | 23:23 |
clarkb | reading the max-ready-age cleanup code it seems the onyl checks are is this ready and is the age > than the max age if so lock it then delete it | 23:23 |
clarkb | (that is why I asked about being locked beacuse an unlocked node for a label with a ready age that has passed should be cleanable) | 23:24 |
clarkb | I wonder if the cleanup doesn't run often enough | 23:24 |
pabelanger | sure, let me see if I can figure it out | 23:24 |
pabelanger | I believe I can see the cleanup handler running | 23:25 |
clarkb | if you grep 'exceeds max ready age' in debug logs you should see if it is ever trying to delete a node | 23:26 |
clarkb | looks like those cleanups just run in a loop one after another so it should be serviced fairly frequently | 23:27 |
pabelanger | yah, I can see it for other nodes | 23:27 |
pabelanger | 2021-02-18 23:13:50,840 DEBUG nodepool.CleanupWorker: Node 0001147424 exceeds max ready age: 1300.8019075393677 >= 600 | 23:27 |
pabelanger | so that is odd | 23:28 |
pabelanger | that node was in another node-request, which didn't fully come online: 200-0000346163 | 23:29 |
clarkb | pabelanger: looking at your original paste that request was released and I would expect 0001147127 0001147128 0001147143 and 0001147144 to be cleaned up by the max-ready-age cleanup | 23:30 |
clarkb | 2021-02-18 22:24:02,138 DEBUG nodepool.PoolWorker.limestone-us-slc-s1.small: [e: 224c7760-7232-11eb-86ac-5a443aaf3adf] [node_request: 200-0000346078] Removing request handler <- is the line that I think says I'm giving up | 23:31 |
clarkb | and so those 4 nodes should either be used by subsequent requests or get cleaned up | 23:31 |
pabelanger | okay, I am not going to touch anything and will grep for them in a bit | 23:31 |
clarkb | might be good to double check those 4 nodes and see what happened with them | 23:31 |
clarkb | and work from there | 23:31 |
*** tosky has quit IRC | 23:31 | |
pabelanger | I see a lot of exceeds max ready age well above 600 | 23:32 |
pabelanger | 2021-02-18 20:17:19,868 DEBUG nodepool.CleanupWorker: Node 0001145171 exceeds max ready age: 4105.30643081665 >= 600 | 23:32 |
clarkb | ya I suspect that may be the logging of what you are seeing and now need to work backward from there to see why the age is so high before it gets removed | 23:33 |
clarkb | maybe it was part of a request that lived for over an hour and it wasn't until the request went away that you get the cleanup | 23:34 |
pabelanger | maybe | 23:39 |
pabelanger | I can still see 200-0000346078 in the request-list | 23:39 |
pabelanger | but nothing in logs since | 23:39 |
pabelanger | 2021-02-18 22:24:02,970 DEBUG nodepool.DeletedNodeWorker: Marking for deletion unlocked node 0001147128 (state: failed, allocated_to: 200-0000346078) | 23:39 |
pabelanger | over an hour ago | 23:40 |
clarkb | 200-0000346078 should stay in the request list until all providers fail it or one succeeds | 23:40 |
clarkb | that log line may be the breadcrumb you need | 23:40 |
clarkb | I would expect that to be deletable because that provider isn't handling that request anymore but maybe there is a bug where if the request is still outstanding on any provider it is considered alive | 23:41 |
pabelanger | yah | 23:44 |
pabelanger | I can't see anything specific now | 23:44 |
pabelanger | I think I just have to wait, and see what happens | 23:44 |
pabelanger | then try to debug after the fact | 23:44 |
pabelanger | okay, have to run | 23:45 |
pabelanger | will let you know | 23:45 |
*** jamesmcarthur_ has quit IRC | 23:53 | |
*** jamesmcarthur has joined #zuul | 23:54 | |
*** jamesmcarthur has quit IRC | 23:59 |
Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!