Saturday, 2021-07-03

opendevreview	James E. Blair proposed zuul/zuul-jobs master: Ignore errors when deleting tags from dockerhub https://review.opendev.org/c/zuul/zuul-jobs/+/799338	00:00
corvus	fungi, clarkb: apparently that xenial node is failing to boot in bhs1	00:25
clarkb	oh hrm	00:25
corvus	it's on attempt #2	00:25
fungi	timeouts?	00:26
clarkb	corvus: do you hvae the node id so we can see the traceback?	00:26
corvus	yeha	00:26
corvus	2021-07-03 00:17:16,437 INFO nodepool.NodeLauncher: [e: 56c26ac0e5cf4a35a42ab59289ef7fcf] [node_request: 200-0014656534] [node: 0025380180] Node 6aae1b30-8d51-4c9f-95bb-ee96be8bd1e3 scheduled for cleanup	00:26
corvus	from nl04	00:26
corvus	that was attempt #1	00:26
fungi	we unfortunately see frequent timeouts in a number of our providers, i wonder if we need to tune the launchers to wait longer	00:26
clarkb	openstack.exceptions.ResourceTimeout: Timeout waiting for the server to come up.	00:27
corvus	fungi: i'm not sure i want to wait >10 minutes for a node to boot	00:27
corvus	though tbh, i don't know why we wait 10 minutes 3 times	00:27
corvus	i would like it to just fail immediately so we can actually fall back on another provider	00:28
clarkb	we retry 3 times in the code regardless of failure type and then timeouts can cause failure	00:28
clarkb	but ya in the case of timeouts falling back to another provider may be a good idea if you have >1 able to fulfill the request	00:28
corvus	yeah, i just think we've had a lot of push-pull over the years about failure conditions	00:28
fungi	i agree treating timeout failures differently makes sense	00:28
clarkb	you do want to retry if it is your only provider remaining I think	00:29
corvus	it doesn't make sense to set a timeout then wait 3 times, because it would (as fungi suggested initially) be better to wait 30m once than 10m 3 times :)	00:29
fungi	though it does probably increase the chances that we return node_failure if we don't have a given label in many providers	00:29
clarkb	ok I need to check on dinner plans and related activities. I'll watch the zuul status	00:29
corvus	2021-07-03 00:27:18,229 ERROR nodepool.NodeLauncher: [e: 56c26ac0e5cf4a35a42ab59289ef7fcf] [node_request: 200-0014656534] [node: 0025380180] Launch attempt 2/3 failed:	00:30
corvus	so only another 8 minutes of waiting until we can move on	00:30
fungi	another possibility is that something has happened with our xenial images and they're going to timeout everywhere	00:30
corvus	yeah, or xenial on bhs	00:30
fungi	i guess we'll find out when it switches to another provider/region	00:30
corvus	oh hey it's running	00:31
corvus	attempt #3 worked	00:31
fungi	third try's a charm? ;)	00:31
fungi	yeesh	00:31
corvus	maybe 3x 10m timeout isn't as terrible as i thought? i dunno	00:31
corvus	i'm happy to be proven wrong if it means the job starts running ;)	00:32
opendevreview	Merged zuul/zuul-jobs master: Ignore errors when deleting tags from dockerhub https://review.opendev.org/c/zuul/zuul-jobs/+/799338	00:35
corvus	i'm restarting all of zuul again to pick up today's bugfixes	01:06
corvus	#status log restarted all of zuul on commit 10966948d723ea75ca845f77d22b8623cb44eba4 to pick up stats and zk watch bugfixes	01:09
opendevstatus	corvus: finished logging	01:09
corvus	restoring queues	01:16
clarkb	corvus: a few jobs error'd in the openstack periodic queue	01:19
clarkb	but check seems happy	01:19
corvus	clarkb: i'm guessing that's an artifact of the re-enqueue?	01:19
fungi	are we going to need to clean up any leaked znodes from before?	01:19
clarkb	corvus: ya I'm wondering if periodic jobs don't reenqueue cleanly	01:20
corvus	fungi: no znodes leaked afaik, only watches, which have already disappeared due to closing the connections tehy were on	01:20
corvus	i think i saw that from the last re-enqueue	01:20
fungi	oh, got it, so watches aren't represented by their own znodes, they're a separate structure?	01:21
clarkb	ya a watch is a mechanism on top of the znodes (that may or may not be there)	01:21
corvus	yep, it's entirely an in-memory construct on a single zk server and associated with a single client connection; as soon as that connection is broken, it's gone	01:22
fungi	right, for some reason i had it in my head that watches were also znodes, pointing at znode paths which may or may not exist	01:22
fungi	makes more sense now	01:22
fungi	so restarting zuul would have also temporarily cleaned up the leaked watches	01:23
clarkb	yes	01:23
fungi	even before the fix	01:23
clarkb	zk04 went from 720 to 715 watch according to graphite	01:24
clarkb	that is promosing	01:24
corvus	also, i don't know how much a watch really "costs"; we may have been able to handle considerable growth before it became a problem	01:27
corvus	i'm just not in the mood to find out that way :)	01:27
clarkb	ya the admin docs warn against large numbers of watches (but don't specify a scale against hardware) when running the wchp command but the command returned instantly for us in the 10-20k range without issue	01:28
clarkb	I suspect that large numbers in this case is much larger ++ to not finding out the hard way though	01:28
corvus	none of zk04's cacti graphs show any kind of linear growth during the day, nor does the response time or anything like that, so they're probably relatively cheap	01:29
corvus	clarkb: yeah, seems like the 10k order of magnitude is experimentally okay :)	01:29
opendevreview	Gonéri Le Bouder proposed openstack/diskimage-builder master: fedora: defined DIB_FEDORA_SUBRELEASE for f3{3,4} https://review.opendev.org/c/openstack/diskimage-builder/+/799339	01:54
*** odyssey4me is now known as Guest1321		02:24
opendevreview	Gonéri Le Bouder proposed openstack/diskimage-builder master: fedora: reuse DIB_FEDORA_SUBRELEASE if set https://review.opendev.org/c/openstack/diskimage-builder/+/799340	02:27
opendevreview	Gonéri Le Bouder proposed openstack/diskimage-builder master: Fedora: bump DIB_RELEASE to 34 https://review.opendev.org/c/openstack/diskimage-builder/+/799341	02:27
*** cloudnull7 is now known as cloudnull		23:44

Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!