Thursday, 2021-02-18

corvus	zuul likes it	00:00
corvus	ah, i'm guessing it's not quite tuned for ubuntu -- from the nodepool-functional-k8s job: ubuntu-bionic \| "msg": "No package matching 'java-latest-openjdk' is available"	00:01
corvus	'default-jdk-headless' maybe?	00:02
clarkb	ya or default-jdk	00:02
clarkb	not sure what the difference is but headless seems appropriat ehere	00:02
openstackgerrit	James E. Blair proposed zuul/zuul-jobs master: ensure-zookeeper: add use_tls role var https://review.opendev.org/c/zuul/zuul-jobs/+/776290	00:06
openstackgerrit	James E. Blair proposed zuul/nodepool master: WIP: Require TLS. https://review.opendev.org/c/zuul/nodepool/+/776286	00:07
clarkb	corvus: looking at the WIP change it seems to have a lot of th same stuff as the zj role. Is the intent that nodepool would bootstrap itself?	00:09
corvus	clarkb: i don't want to require 'ensure-zookeeper' for the tox jobs	00:10
corvus	i like the idea of test-setup.sh doing that in a way that works for the zuul tox jobs and devs	00:11
clarkb	ok just making sure I understand the various bits there	00:11
corvus	we could switch the functional jobs to use test-setup, but those are running k8s and openshift, so there's a pretty good argument for not running something in docker in those jobs :)	00:11
corvus	and they are not especially standard; they already have their own playbooks	00:12
clarkb	corvus: in test-setup.sh why use docker-compose rm -sf over docker-compose down? (If you use down you can drop the condition and I think down does roughly the same thing?)	00:13
corvus	clarkb: no idea, copied that from zuul's test-setup-docker.sh	00:13
corvus	i'll incorporate that in the next rev	00:13
clarkb	corvus: found a minor thing on the WIP change (left it as a comment)	00:15
corvus	ah thx	00:16
corvus	that will probably fail some unit tests but not all of them; i'm going to let it run a bit more before i push up a fix	00:17
*** tosky has quit IRC		00:30
corvus	i'm confused about why the py3x tests timed out; they seem to be mostly running; i'm not sure what they were doing when time was up	01:02
clarkb	corvus: https://9e7a0909972c63991fa4-da3822d63841e990242061d65cb4e6c4.ssl.cf5.rackcdn.com/776286/8/check/tox-py36/be3b5ed/tmpqv4jwhl2 is an attempt at deciphering that	01:03
clarkb	that is the partial subunit stream	01:03
* clarkb pulls it up		01:04
corvus	yeah, it ends with a successful test	01:04
corvus	i think i have it running the bad test locally, i just still don't know which one it is	01:06
corvus	i don't see a test name in my stestr output	01:06
clarkb	corvus: grep ^test: $thatsubunitfile \| wc -l is 91	01:06
corvus	clarkb: i don't know how to use that info	01:09
clarkb	mostly just pointing out it only ran 91 tests before it timed out (I think nodepool has several hundred tests overall)	01:10
corvus	277	01:11
corvus	the 3 tests around line 91 of 'python -m testtools.run discover --list' are file	01:11
corvus	fine	01:11
corvus	i'm really confused because we have a TESTING.rst file that says if i run "stestr run" it will print out the name of each test as it runs	01:12
corvus	but all i see is log output	01:12
clarkb	corvus: the invocation in the job log uses --no-subunit-trace which suppressed the behvaior you want I think	01:13
corvus	ok. well i'm running plain "stestr run" now, and i do see some job names among the logs	01:14
corvus	so maybe if it hangs i'll see it this time	01:14
clarkb	https://opendev.org/zuul/nodepool/src/branch/master/tox.ini#L21	01:14
corvus	okay, running that way and then hitting ctrl-c when it hung has given me the names of the 7 running tests that were hanging	01:16
corvus	so i can iterate now	01:16
clarkb	and ya it seems like the file only writes complete subunit results (to avoid interleaving?) but the proper tracing should print them as they go	01:16
corvus	okay, i think i have fixes for the latest errors in the k8s/openshift jobs and these tests; i'm just running locally again to find any more	01:24
openstackgerrit	James E. Blair proposed zuul/nodepool master: WIP: Require TLS https://review.opendev.org/c/zuul/nodepool/+/776286	01:32
corvus	at this point, i expect some of these jobs to start passing	01:32
*** harrymichal has quit IRC		01:49
corvus	clarkb, tobiash: the disadvantage of using docker for the unit tests is: toomanyrequests: You have reached your pull rate limit. You may increase the limit by authenticating and upgrading: https://www.docker.com/increase-rate-limit	01:52
corvus	maybe we should leave the docker setup for devs only and switch the tox jobs to ensure-zookeeper	01:53
corvus	tobiash, felixedel, swest: if you start to look at moving zk connection inside zuul services, note the work going on in nodepool; see the lines above ^ and also https://review.opendev.org/776290 and https://review.opendev.org/776286	02:02
*** dmsimard8 has joined #zuul		02:56
*** dmsimard has quit IRC		02:59
*** aluria has quit IRC		02:59
*** SpamapS has quit IRC		02:59
*** gouthamr has quit IRC		02:59
*** tflink has quit IRC		02:59
*** dmsimard8 is now known as dmsimard		02:59
*** ianw has quit IRC		02:59
*** tflink has joined #zuul		03:00
*** SpamapS has joined #zuul		03:00
*** ianw has joined #zuul		03:00
*** gouthamr has joined #zuul		03:03
*** aluria has joined #zuul		03:04
*** ykarel has joined #zuul		05:00
*** evrardjp has quit IRC		05:33
*** evrardjp has joined #zuul		05:33
*** jfoufas1 has joined #zuul		06:00
*** gouthamr has joined #zuul		06:01
*** rlandy\|bbl has quit IRC		06:17
*** vishalmanchanda has joined #zuul		06:34
*** reiterative has quit IRC		06:48
*** reiterative has joined #zuul		06:48
openstackgerrit	Dinesh Garg proposed zuul/zuul-jobs master: Allow customization of helm charts repos https://review.opendev.org/c/zuul/zuul-jobs/+/767354	07:01
openstackgerrit	Dinesh Garg proposed zuul/zuul-jobs master: Allow customization of helm charts repos https://review.opendev.org/c/zuul/zuul-jobs/+/767354	07:03
openstackgerrit	Dinesh Garg proposed zuul/zuul-jobs master: Allow customization of helm charts repos https://review.opendev.org/c/zuul/zuul-jobs/+/767354	07:04
*** jpena\|off is now known as jpena		07:36
*** harrymichal has joined #zuul		07:58
*** jcapitao has joined #zuul		08:13
*** rpittau\|afk is now known as rpittau		08:14
*** hashar has joined #zuul		08:21
*** tosky has joined #zuul		08:45
openstackgerrit	Dinesh Garg proposed zuul/zuul-jobs master: Allow customization of helm charts repos https://review.opendev.org/c/zuul/zuul-jobs/+/767354	08:51
*** jfoufas1 has quit IRC		08:56
openstackgerrit	Andreas Jaeger proposed zuul/zuul-jobs master: Allow customization of helm charts repos https://review.opendev.org/c/zuul/zuul-jobs/+/767354	09:05
*** ykarel_ has joined #zuul		09:05
openstackgerrit	Andreas Jaeger proposed zuul/zuul-jobs master: Allow customization of helm charts repos https://review.opendev.org/c/zuul/zuul-jobs/+/767354	09:07
*** ykarel has quit IRC		09:08
*** nils has joined #zuul		09:22
*** ykarel_ is now known as ykarel		09:49
*** wuchunyang has joined #zuul		10:39
*** wuchunyang has quit IRC		10:44
openstackgerrit	Merged zuul/zuul-jobs master: Allow customization of helm charts repos https://review.opendev.org/c/zuul/zuul-jobs/+/767354	11:31
*** harrymichal has quit IRC		11:38
*** harrymichal has joined #zuul		11:39
*** jcapitao is now known as jcapitao_lunch		12:08
*** rlandy has joined #zuul		12:28
*** jpena is now known as jpena\|lunch		12:30
*** ikhan has joined #zuul		12:50
*** jcapitao_lunch is now known as jcapitao		13:19
*** jpena\|lunch is now known as jpena		13:31
*** iurygregory has quit IRC		13:47
*** iurygregory has joined #zuul		13:50
*** ykarel_ has joined #zuul		14:14
*** ykarel has quit IRC		14:17
pabelanger	I know some folks are using prometheus with zuul metric, is that right? if so, do people have it documented any place	14:27
*** ykarel_ has quit IRC		14:27
pabelanger	we'll likely have to use that given no one offer graphite much any more	14:27
corvus	pabelanger: you run your own cotrol plane right? or are you moving to a hosted control plane?	14:28
corvus	(not sure who needs to "offer" graphite in this scenario)	14:29
pabelanger	yah, we still run our own. But we don't want to manage more services, so looking to other teams to provide something	14:30
pabelanger	and prometheus is the thing, not graphite	14:30
pabelanger	and basically don't have time / effort to stand up own now	14:30
corvus	pabelanger: well, fwiw, graphite isn't required... you can use grafana with influxdb as your statsd receiver as well.	14:31
pabelanger	we're kinda of at a cross roads, of looking for managed CI again. Which may or maynot be zuul	14:31
pabelanger	:(	14:31
corvus	pabelanger: if you do need prometheus, i think tobiash has a patch in review under zuul/zuul for a prometheus statsd exporter	14:33
avass	where is logging configured in zuul and what decides what variables are passed when zuul (python?) logs something?	14:47
avass	not that well versed in how logging in python actually works	14:47
corvus	avass: uses standard python logging library: https://docs.python.org/3/library/logging.html	14:47
avass	anyhow, I don't think buildset gets logged nicely and if it is I can't find it except for in the actual message	14:47
corvus	avass: that configures the formatting, etc. but if you want extra messages or info/metadata along with the messages, those are source code changes	14:48
avass	corvus: oh nvm, think I found it. the get_annotated_logger function is probably what I'm looking for	14:49
avass	it would be nice if buildset was part of the logs somehow as well	14:50
tristanC	corvus: note that influxdb statsd support is a bit tricky, we couldn't use the graphite query in grafana and we add to configure continuous queries for nodepool	14:50
*** tosky has quit IRC		14:51
*** tosky_ has joined #zuul		14:51
corvus	avass: ack... i wonder if it would be better to have it in every line related to a build, or just have a clear "starting build X for buildset Y" for cross-referencing	14:51
*** tosky_ is now known as tosky		14:51
avass	corvus: for console logs that should be enough but for our splunk logs it would be nice to have that as part of the data for each log entry	14:52
avass	because doing buildset=<hash> levelname=ERROR could be convenient :)	14:53
avass	but event_id should do pretty much the same thing	14:55
tristanC	pabelanger: we have been using the node_exporter text file feature to get zuul metrics in prometheus	15:01
pabelanger	sorry, I got pulled into a meeting. catching up	15:02
tristanC	pabelanger: here are the rules https://softwarefactory-project.io/cgit/software-factory/sf-infra/tree/monitoring/rules-zuul.yaml and the script to generate the metrics is: https://softwarefactory-project.io/cgit/software-factory/sf-infra/tree/roles/custom-exporter/files/exporter.py	15:02
pabelanger	so, the feedback I am getting is 'we need more zuul metric', which is solved already. But I have a very limited amount of time or ability to launch new services to support it. Given teams already have prometheus (I am unsure about influxdb), the plan would be to use those services, over asking them to run graphite.	15:05
pabelanger	statsd is already in place, so we don't have to provision that (aside from exporters)	15:05
pabelanger	tristanC: thanks, I'll look	15:06
tristanC	pabelanger: the links i gave you is more about alerting when something break, you'll need to look at the tobiash's statsd exporter configuration for metrics	15:13
pabelanger	https://review.opendev.org/c/zuul/zuul/+/660472	15:18
pabelanger	thanks	15:18
*** hashar has quit IRC		15:18
pabelanger	that looks like it might be easy to setup and test	15:18
*** rlandy is now known as rlandy\|training		15:39
corvus	pabelanger: yep that's the one	15:43
*** ykarel_ has joined #zuul		15:48
tristanC	corvus: about ensure-zookeeper tls support, if i understand correctly it fixed the k8s/openshift integration jobs, but we now need to use it for the regular tox job?	15:48
tristanC	thank you for taking over the change by the way	15:49
corvus	tristanC: yes. i wanted to use docker in test-setup.sh so that it was the same for devs and CI, and we wouldn't have to add a playbook to the tox jobs. however, with the dockerhub rate limits, i think your idea of using it for the tox jobs is better.	15:50
corvus	tristanC: so i think we just need to update 776286 to run ensure-zookeeper and then set env variables for the unit test tox runs	15:50
corvus	i can do that in a little bit	15:50
corvus	tristanC: and thank you for the ensure-zk change -- i love timezone teamwork :)	15:51
clarkb	corvus: no objection from me using ensure-zk for unittests too	16:02
*** vishalmanchanda has quit IRC		16:02
*** rlandy\|training is now known as rlandy		16:16
*** ykarel_ is now known as ykarel		16:21
*** mvadkert has joined #zuul		16:22
mvadkert	tristanC: hi, I looked at your Fedora lightning talk, do you have examples where you are now using Dhall in Zuul and for what?	16:23
mvadkert	tristanC: we are looking to use cuelang.org for some of our services, but we want to look at Dhall also ...	16:24
avass	mvadkert: it's widely used for the zuul-operator: https://opendev.org/zuul/zuul-operator	16:24
mvadkert	avass: thanks, will check that out	16:26
*** ykarel has quit IRC		16:29
clarkb	corvus: looks like unit tests passed in your latest recheck (but not some of the functional jobs), though it has not finished and reported yet	16:36
tristanC	mvadkert: fbo wrote a blog post about it here: https://www.softwarefactory-project.io/using-dhall-to-generate-fedora-ci-zuul-config.html	16:44
tristanC	mvadkert: you can also find several examples in the binding documentation here: https://github.com/softwarefactory-project/dhall-zuul	16:45
tristanC	mvadkert: and finally, you can also find an ambitious work in progress expression to manage a full tenant configuration there: https://github.com/softwarefactory-project/bootstrap-your-zuul	16:47
*** jpena is now known as jpena\|off		16:51
*** ikhan has quit IRC		16:52
*** hashar has joined #zuul		17:16
mvadkert	tristanC: cool ty!	17:20
*** jcapitao has quit IRC		17:24
tristanC	mvadkert: you're welcome, i'd be interested to see a zuul configuration in cue if you already made one	17:36
mvadkert	tristanC: not for zuul, we plan to use it for other services, but the work has just started	17:38
mvadkert	tristanC: so we are mostly looking around for similar use cases	17:38
*** rpittau is now known as rpittau\|afk		17:47
*** rlandy is now known as rlandy\|training		18:01
openstackgerrit	James E. Blair proposed zuul/nodepool master: WIP: Require TLS https://review.opendev.org/c/zuul/nodepool/+/776286	18:06
*** akrpan-pure has joined #zuul		18:13
akrpan-pure	Hello! I have a zuul run working fine on opendev/ci-sandbox, but I just switched it over to also run on openstack/cinder and the runs are working fine, but uploading the result mysteriously fails with a "Gerrit error executing gerrit review <rest of command>" and no comment on the change request	18:14
akrpan-pure	Anyone either seen this before, or know how I could turn on more debugging info for response reporting through gerrit?	18:14
corvus	akrpan-pure: you may be able to run zuul with the "-d" argument to get more debugging info	18:15
clarkb	akrpan-pure: are your jobs trying to vote +/-1 verified?	18:15
*** wuchunyang has joined #zuul		18:15
clarkb	I think that most openstack projects don't allow third party CI systems to vote and the reviews can fail if you try to vote	18:16
akrpan-pure	.... oh I bet I know what it is, I made a new account for our third party CI, I bet it hasn't been approved	18:17
akrpan-pure	*as a CI account yet, because our old one could just fine	18:17
clarkb	akrpan-pure: cinder-ci has no members, they don't want anyone voting	18:17
akrpan-pure	Oh indeed, I'm not sure what I was remembering but I could've sworn I saw votes from other CI systems including ours in the past	18:19
akrpan-pure	Looking back now I see none	18:19
akrpan-pure	Thanks for pointing that out! Easy enough to fix	18:20
*** wuchunyang has quit IRC		18:20
*** hashar has quit IRC		18:32
*** akrpan-pure has quit IRC		19:00
*** mgoddard has quit IRC		19:17
*** mgoddard has joined #zuul		19:18
avass	tristanC: oh yep overrides are starting to click a bit now	20:01
tristanC	i've published an haskell library to interact with zuul, it comes with a cli to list the nodepool labels used in projects pipeline configuration: https://hackage.haskell.org/package/zuul	20:01
tristanC	avass: good, that seems the best solution to modify a set, if you look in zuul-nix, override could be used to tweak the existing requirements instead of creating new ones when there is a mismatch	20:04
*** ikhan has joined #zuul		20:05
*** nils has quit IRC		20:14
*** jamesmcarthur has joined #zuul		20:26
*** rlandy\|training is now known as rlandy		20:33
pabelanger	anybody seen the following decode issue before?	20:58
pabelanger	http://paste.openstack.org/show/802799/	20:58
pabelanger	trying to stand up new executor and hitting it	20:58
tristanC	pabelanger: perhaps a jwt library update?	20:59
tristanC	pabelanger: without https://review.opendev.org/c/zuul/zuul/+/768312 , you need to pin it to <2	21:00
pabelanger	tristanC: thanks! that look to be it	21:01
pabelanger	manually downgrading to 1.7.1 fixed it	21:01
tristanC	i guess pip installing zuul 3.19.1 is not working since december	21:02
pabelanger	yah, possible	21:02
pabelanger	at least for github	21:02
clarkb	you can always use a contraints file if necessary	21:04
pabelanger	true	21:08
*** mgagne has quit IRC		21:08
*** jamesmcarthur has quit IRC		21:12
*** jamesmcarthur has joined #zuul		21:14
*** jamesmcarthur has quit IRC		21:26
*** jamesmcarthur has joined #zuul		21:37
clarkb	corvus: left a couple of notes on the WIP nodepool change.	21:39
*** jamesmcarthur has quit IRC		21:39
*** jamesmcarthur has joined #zuul		21:40
clarkb	corvus: and for the functional jobs I see where ensure-zookeeper started zookeeper but then later logs complain about being unable to connect. Not quite sure what is going on there	21:42
clarkb	looks like it is connecting to port 2181	21:44
clarkb	ok left another comment to address ^ I think.	21:45
avass	tristanC: I think overrides might not make much sense after all...	21:58
avass	well, at least I tried to build a newer rustc version so I tried to override pkgs.rustc.version, for some reason that produces logs as if it downloaded a 1.50.0 tarball but it turns out it built 1.41.0 anyway.. :(	22:00
openstackgerrit	James E. Blair proposed zuul/nodepool master: WIP: Require TLS https://review.opendev.org/c/zuul/nodepool/+/776286	22:01
corvus	clarkb: thx updated	22:01
tristanC	avass: it is confusing indeed :-) I think it's necessary when a function expect a package set, for example if you want to rebuild zuul-nix with a modified set, then you need to create it with an override	22:01
avass	tristanC: yeah I read through the pills explaining overrides and config.nix and followed along with the examples and that made sense. but I feel like I have no idea why that didn't work for a real package	22:02
tristanC	avass: the override implementation may varies between packageset, it's just a nix function afterall	22:02
avass	it would help if the packages were better documented with what can be overriden	22:04
tristanC	avass: for sure, i would recommend to check the source directly :) for rust, have you tried https://github.com/mozilla/nixpkgs-mozilla#using-in-nix-expressions ?	22:06
avass	tristanC: that's what I'm doing currently. but having to read through the source to use a package manager will be a bit hard to motivate when people are used to the simplicity of containers	22:07
avass	will take a look at that	22:07
tristanC	avass: that's true, though once the expression is working, it's quite easy to use. And unless you are doing complicated things, you shouldn't have to use an override just to use a different toolchain, for rust mozilla provides every set already.	22:15
openstackgerrit	Tristan Cacqueray proposed zuul/zuul master: scheduler: add statsd metric for enqueue time https://review.opendev.org/c/zuul/zuul/+/776287	22:21
clarkb	corvus: just noticed another thing that is important but maybe do that update after we see how teh functional jobs do?	22:22
clarkb	that actually had me really confused for a minute too. Was wondering why the new pre run wasn't used	22:22
corvus	clarkb: derp yes thx. updated locally, will wait for results to push.	22:23
clarkb	nodepool is just starting now in the functional openstack job	22:31
clarkb	it isn't spamming about connection problems so I think it is looking good.	22:32
clarkb	will have to see if image builds and launches all agree	22:32
avass	tristanC: thanks that worked a lot better :)	22:36
*** jamesmcarthur has quit IRC		22:38
tristanC	avass: that's good to know! note that you can find similar package set for other toolchain outside of nixpkgs, for example you would pull zuul-nix using an equivalent import expression	22:45
clarkb	corvus: the functional openstack jobs succeeded	22:48
clarkb	I think this is very close once you swap out the unittest jobs	22:48
openstackgerrit	James E. Blair proposed zuul/zuul master: Add python-logstash-async to container images https://review.opendev.org/c/zuul/zuul/+/776551	22:49
tristanC	corvus: thank you for the suggestion, i've added a statsd service to my benchmark and the metrics are more finegrained, for enqueue time from 776287 i get: mean 0.00139 std dev 0.00018 (compared to the mqtt measure: mean 0.01406 std dev 0.00120)	22:50
tristanC	for reference, here is the updated script: https://github.com/TristanCacqueray/zuul-nix/commit/2083b33005e568115a5e71b2847a953b1dd5d62c	22:51
corvus	oh interesting, i wasn't expecting them to necessarily be different as much as being something that could be compared with prod. that's good to know and i'm glad that worked out	22:52
corvus	clarkb: looks like maybe it's far enough along i should just push up the fix and see if it passes for real?	22:55
tristanC	well the timestamp are probably not collected at the same place, the statsd one measure from sched.addEvent to the end of manager.addChange	22:55
clarkb	corvus: ya I think so	22:55
openstackgerrit	James E. Blair proposed zuul/nodepool master: Require TLS https://review.opendev.org/c/zuul/nodepool/+/776286	22:55
*** harrymichal has quit IRC		22:57
*** harrymichal has joined #zuul		22:57
pabelanger	I maybe have raised this before, but cannot remember. We have a lot of multi-node jobs, and sometime when we are close to quota capacity we need a lot of unlocked ready nodes. Here is an example: http://paste.openstack.org/show/802801/	23:05
pabelanger	nodes 0001147143 and 0001147144 are ready, but because node 0001147128 failed, they seem to be idle now (using quota).	23:07
pabelanger	and given we've hit quota, they sometime sit idle for upwards of 8 hours	23:07
*** jamesmcarthur has joined #zuul		23:07
pabelanger	I am curious, if we could add something that is the whole nodeset isn't allocated, we also delete the unlocked ready nodes	23:08
pabelanger	and give them back to the pool	23:08
clarkb	pabelanger: from memory I believe they are technically part of the pool already. If another job came by looking for those labels they could be used for them instead	23:10
clarkb	but they aren't gone so do consume those resources until they timeout or are used	23:10
clarkb	you can reduce the timeout for cleaning ready nodes iirc	23:11
pabelanger	Hmm, so I need to confirm but we are not seeing them get used by other jobs. That would actually be good if they did	23:12
pabelanger	we do set a low 'max-ready-age' but that doesn't seem to work in this case	23:12
pabelanger	eg: max-ready-age: 600	23:13
pabelanger	but, they will say online well over 8 hours	23:13
pabelanger	https://dashboard.zuul.ansible.com/t/ansible/nodes	23:13
pabelanger	is an example of the amount right now	23:14
*** harrymichal has quit IRC		23:15
*** jamesmcarthur has quit IRC		23:15
*** jamesmcarthur_ has joined #zuul		23:15
clarkb	that doesn't show locked state though	23:16
clarkb	is it possible the ready nodes are locked?	23:16
pabelanger	\| 0001147027 \| limestone-us-dfw-1 \| centos-8-1vcpu \| ad65b3b4-aa9a-4778-9d12-14f58577a11b \| 162.253.43.28 \| 2607:ff68:100:a::13e \| ready \| 00:01:05:32 \| unlocked \|	23:16
pabelanger	no	23:16
pabelanger	I _think_ they are still part of original node request	23:17
pabelanger	and not free, until the missing node comes online	23:17
pabelanger	which is a chicken and egg issue, as the ready unlocked nodes, is taking up quota	23:17
clarkb	oh the original node request is still alive in that provider?	23:18
clarkb	then ya I think that would hold those resources for that request (though they should stay locked until then?)	23:18
pabelanger	yah, it is	23:18
pabelanger	just confirmed	23:18
pabelanger	\| 200-0000346078 \| 2 \| requested \| zs01.sjc1.vexxhost.zuul.ansible.com \| centos-8-1vcpu,vmware-vcsa-7.0.0,esxi-6.7.0-without-nested,esxi-6.7.0-without-nested \| \| nl01-26418-PoolWorker.ec2-us-east-2-main,nl01-26418-PoolWorker.limestone-us-dfw-1-s1.small,nl01-26418-PoolWorker.limestone-us-slc-s1.small,nl02-10946-PoolWorker.vexxhost-sjc1-v2-highcpu-1 \| 224c7760-7232-11eb-86ac-5a443aaf3adf \|	23:18
pabelanger	I need to confirm what the Priority value (2) means	23:19
pabelanger	if higher or lower means, it gets requests sooner	23:19
clarkb	pabelanger: I think it would be helpful to specifically figureout what state the node request is in while those nodes sit ready	23:23
clarkb	reading the max-ready-age cleanup code it seems the onyl checks are is this ready and is the age > than the max age if so lock it then delete it	23:23
clarkb	(that is why I asked about being locked beacuse an unlocked node for a label with a ready age that has passed should be cleanable)	23:24
clarkb	I wonder if the cleanup doesn't run often enough	23:24
pabelanger	sure, let me see if I can figure it out	23:24
pabelanger	I believe I can see the cleanup handler running	23:25
clarkb	if you grep 'exceeds max ready age' in debug logs you should see if it is ever trying to delete a node	23:26
clarkb	looks like those cleanups just run in a loop one after another so it should be serviced fairly frequently	23:27
pabelanger	yah, I can see it for other nodes	23:27
pabelanger	2021-02-18 23:13:50,840 DEBUG nodepool.CleanupWorker: Node 0001147424 exceeds max ready age: 1300.8019075393677 >= 600	23:27
pabelanger	so that is odd	23:28
pabelanger	that node was in another node-request, which didn't fully come online: 200-0000346163	23:29
clarkb	pabelanger: looking at your original paste that request was released and I would expect 0001147127 0001147128 0001147143 and 0001147144 to be cleaned up by the max-ready-age cleanup	23:30
clarkb	2021-02-18 22:24:02,138 DEBUG nodepool.PoolWorker.limestone-us-slc-s1.small: [e: 224c7760-7232-11eb-86ac-5a443aaf3adf] [node_request: 200-0000346078] Removing request handler <- is the line that I think says I'm giving up	23:31
clarkb	and so those 4 nodes should either be used by subsequent requests or get cleaned up	23:31
pabelanger	okay, I am not going to touch anything and will grep for them in a bit	23:31
clarkb	might be good to double check those 4 nodes and see what happened with them	23:31
clarkb	and work from there	23:31
*** tosky has quit IRC		23:31
pabelanger	I see a lot of exceeds max ready age well above 600	23:32
pabelanger	2021-02-18 20:17:19,868 DEBUG nodepool.CleanupWorker: Node 0001145171 exceeds max ready age: 4105.30643081665 >= 600	23:32
clarkb	ya I suspect that may be the logging of what you are seeing and now need to work backward from there to see why the age is so high before it gets removed	23:33
clarkb	maybe it was part of a request that lived for over an hour and it wasn't until the request went away that you get the cleanup	23:34
pabelanger	maybe	23:39
pabelanger	I can still see 200-0000346078 in the request-list	23:39
pabelanger	but nothing in logs since	23:39
pabelanger	2021-02-18 22:24:02,970 DEBUG nodepool.DeletedNodeWorker: Marking for deletion unlocked node 0001147128 (state: failed, allocated_to: 200-0000346078)	23:39
pabelanger	over an hour ago	23:40
clarkb	200-0000346078 should stay in the request list until all providers fail it or one succeeds	23:40
clarkb	that log line may be the breadcrumb you need	23:40
clarkb	I would expect that to be deletable because that provider isn't handling that request anymore but maybe there is a bug where if the request is still outstanding on any provider it is considered alive	23:41
pabelanger	yah	23:44
pabelanger	I can't see anything specific now	23:44
pabelanger	I think I just have to wait, and see what happens	23:44
pabelanger	then try to debug after the fact	23:44
pabelanger	okay, have to run	23:45
pabelanger	will let you know	23:45
*** jamesmcarthur_ has quit IRC		23:53
*** jamesmcarthur has joined #zuul		23:54
*** jamesmcarthur has quit IRC		23:59

Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!