Thursday, 2021-09-02

clarkb	the zuul key backup cron failed as expected and sent email saying that was the case	00:11
clarkb	corvus: ^ this has me thinking. The command doesn't write any secret data to stdout or stderr right? as we don't want that getting emailed if the cronjob fails	00:12
clarkb	pretty sure it doesn't and we're good	00:12
clarkb	it will output if a key couldn't be backed up for some reason iirc but not share any of the key data	00:12
corvus	clarkb: correct	00:16
corvus	by design	00:16
opendevreview	Merged opendev/base-jobs master: buildset-registry: add flag to make job fail https://review.opendev.org/c/opendev/base-jobs/+/806818	01:22
opendevreview	Ian Wienand proposed opendev/system-config master: gitea: use assets bundle https://review.opendev.org/c/opendev/system-config/+/805933	01:30
ianw	:/ that didn't appear to work	02:06
opendevreview	Ian Wienand proposed opendev/system-config master: gitea: use assets bundle https://review.opendev.org/c/opendev/system-config/+/805933	02:16
ianw	works better when you spell things right	02:39
*** ysandeep\|out is now known as ysandeep		06:16
opendevreview	Jiri Podivin proposed zuul/zuul-jobs master: DNM https://review.opendev.org/c/zuul/zuul-jobs/+/807031	06:46
*** jpena\|off is now known as jpena		07:39
*** ykarel is now known as ykarel\|lunch		08:13
*** ysandeep is now known as ysandeep\|lunch		08:16
*** ysandeep\|lunch is now known as ysandeep		09:12
*** ykarel\|lunch is now known as ykarel		09:56
*** ysandeep is now known as ysandeep\|afk		11:15
opendevreview	Artem Goncharov proposed zuul/zuul-jobs master: Drop deprecated url param of upload_logs_XXX modules https://review.opendev.org/c/zuul/zuul-jobs/+/807118	11:18
*** dviroel\|out is now known as dviroel\|ruck		11:19
opendevreview	Artem Goncharov proposed zuul/zuul-jobs master: Drop deprecated url param of upload_logs_XXX modules https://review.opendev.org/c/zuul/zuul-jobs/+/807118	11:22
*** jpena is now known as jpena\|lunch		11:38
opendevreview	Artem Goncharov proposed zuul/zuul-jobs master: Drop deprecated url param of upload_logs_XXX modules https://review.opendev.org/c/zuul/zuul-jobs/+/807118	11:39
*** ysandeep\|afk is now known as ysandeep		12:13
*** jpena\|lunch is now known as jpena		12:40
*** frenzy_friday is now known as anbanerj\|ruck		12:46
opendevreview	Artem Goncharov proposed zuul/zuul-jobs master: [DNM] Test dropping delegation in the upload_logs_s3 role https://review.opendev.org/c/zuul/zuul-jobs/+/807132	12:47
*** dviroel\|ruck is now known as dviroel		12:48
opendevreview	Artem Goncharov proposed zuul/zuul-jobs master: [DNM] Test dropping delegation in the upload_logs_s3 role https://review.opendev.org/c/zuul/zuul-jobs/+/807132	13:05
opendevreview	Artem Goncharov proposed zuul/zuul-jobs master: [DNM] Test upload_logs_s3 role https://review.opendev.org/c/zuul/zuul-jobs/+/807132	13:22
*** diablo_rojo_phone is now known as Guest6075		13:28
opendevreview	Jiri Podivin proposed zuul/zuul-jobs master: DNM https://review.opendev.org/c/zuul/zuul-jobs/+/807031	14:23
opendevreview	Jiri Podivin proposed zuul/zuul-jobs master: DNM https://review.opendev.org/c/zuul/zuul-jobs/+/807031	14:42
clarkb	infra-root it looks like the inap provider use has largely fallen off. There is one in use server, perhaps a held node? I'll take a look shortly, but I suspect if that is the case we can delete it then push a change to clean up the images in that cloud, then remove it entirely?	14:52
clarkb	mgagne: ^ fyi	14:52
fungi	clarkb: two nodes, one is held with "mnaser debug multi-arch containers" and the other is an almost-day-old centos-8-stream node which will likely get recycled in about 45 minutes	14:54
clarkb	ah thanks you have your ssh keys loaded :)	14:54
clarkb	mnaser: ^ do you still need that hold? if not it will help us clean up a nodepool provider that got renamed	14:55
mnaser	clarkb, fungi: mnaser debug multi-arch containers sounds like a really old hold that i don't need	14:55
clarkb	thanks	14:56
fungi	er, actually the centos-8-stream node is "in-use" for 00:23:13:02 so maybe leaked? or there's a stuck job...	14:56
fungi	mnaser: thanks, i'll take out that autohold	14:56
mnaser	clarkb, fungi: inap is now iweb?	14:57
fungi	mnaser: yep	14:57
mnaser	iweb used to private, then became publicly traded, then private, then acquired by inap, then now i guess acquired by leaseweb (according to their site) to become an independent iweb again i guess :P	14:58
fungi	yeah, i lost track about halfway through the history there ;)	14:58
mnaser	>In 2021 iWeb Technologies was acquired by Leaseweb.	14:58
mnaser	ah yep	14:58
opendevreview	Gage Hugo proposed opendev/irc-meetings master: Update openstack-helm irc meeting channel https://review.opendev.org/c/opendev/irc-meetings/+/805094	14:59
opendevreview	Artem Goncharov proposed zuul/zuul-jobs master: [DNM] Test upload_logs_s3 role https://review.opendev.org/c/zuul/zuul-jobs/+/807132	15:12
mgagne	mnaser: this is correct regarding ownership.	15:15
clarkb	I have run the gerrit user audit across the 33 remaining conflicts and the yaml file is in the typical location	15:29
clarkb	I don't expect I'll get to reviewing those today, but hopefully soon and will start sending out emails and writing changes in an All-Users checkout	15:30
clarkb	fungi: corvus that last inap instance is the paused job for 806901,2 in openstack's check queue	15:36
clarkb	it appears we have ~4 changes that have leaked in the check queue there based on age	15:37
clarkb	I'm not in a good spot to look at that right this moment as I've got a meeting in a few minutes, but I wonder if we've got a stale launcher that needs restarting or a similar issue to before where we forget requests?	15:37
clarkb	4e03748821a74fe6b755bc2155838125 is the zuul event id for the 806901,2 enqueue	15:38
fungi	i'll try to dig into stuck jobs in a few, but also in meetings	15:38
clarkb	what if it is stuck that way ebcause we set max servers to 0?	15:56
clarkb	the job that is paused started on inap so all the jobs behind it want to run in inap too but cannot	15:56
clarkb	That doesn't explain the other leaked changes in the queue though, but could explain this one	15:56
opendevreview	Ananya proposed opendev/elastic-recheck master: Fix ER bot to report back to gerrit with bug/error report https://review.opendev.org/c/opendev/elastic-recheck/+/807176	16:04
fungi	it's also possible we haven't completely squashed the hung/stale node requests problem, i haven't dug into it yet	16:16
*** jpena is now known as jpena\|off		16:47
*** ysandeep is now known as ysandeep\|out		16:47
*** sshnaidm is now known as sshnaidm\|off		17:01
clarkb	299-0015272873 is one of the node requests that was submitted after the job paused	17:04
clarkb	/etc/nodepool/nodepool.yaml on nl03 updated at 15:50UTC yseterday. The above node request was submitted at 16:49 yesterday	17:05
clarkb	I think it is very likely that we cannot provision those nodes because max-servers is 0 on the only provider that is capable of running those jobs	17:05
opendevreview	Ananya proposed opendev/elastic-recheck master: Fix ER bot to report back to gerrit with bug/error report https://review.opendev.org/c/opendev/elastic-recheck/+/807176	17:06
clarkb	that doesn't explain the other three though	17:06
clarkb	corvus: ^ is there a better way to remove a provider in nodepool? maybe we just accept this risk and stop then reenqueue the change in these cases?	17:06
clarkb	infra-root ^ I'll do that for 806901,2 shortly if there is no objection. The other three probably deserve looking at as I doubt they have this particular issue	17:07
opendevreview	Ananya proposed opendev/elastic-recheck rdo: Fix ER bot to report back to gerrit with bug/error report https://review.opendev.org/c/opendev/elastic-recheck/+/805638	17:08
fungi	oh, was it trying to provide nodes for the child job from the same provider? yeah i guess that's intentional?	17:08
clarkb	yes I believe zuul enforces that behavior	17:09
corvus	yeah, i think that is an unexpected corner case (those 2 features were designed at different times)	17:09
opendevreview	Ananya proposed opendev/elastic-recheck rdo: Fix ER bot to report back to gerrit with bug/error report https://review.opendev.org/c/opendev/elastic-recheck/+/805638	17:10
clarkb	for the airship change that is stuck it has a node request (300-0015270257) for https://opendev.org/airship/airshipctl/src/branch/master/zuul.d/nodesets.yaml#L39 that doesn't seem to be getting fulfilled	17:10
corvus	i don't have a good solution to that. i think we just have to muddle through. dequeue/enqueue sounds good.	17:11
clarkb	that label for the airship job is currently only able to be provided by the airship cloud provider	17:11
clarkb	corvus: ok I'll dequeue enqueue for that one.	17:11
clarkb	then I'll keep looking at this airship node request I guess	17:12
opendevreview	Ananya proposed opendev/elastic-recheck rdo: Fix ER bot to report back to gerrit with bug/error report https://review.opendev.org/c/opendev/elastic-recheck/+/805638	17:14
opendevreview	Ananya proposed opendev/elastic-recheck rdo: Fix ER bot to report back to gerrit with bug/error report https://review.opendev.org/c/opendev/elastic-recheck/+/805638	17:14
clarkb	for the airship one it failed to boot three times in the airship cloud. Every other cloud should decline the request which should result in a node failure	17:15
clarkb	notsure why that hasn't happened yet	17:15
clarkb	the new iweb provider did decline the request when it came online according to the logs on nl03	17:18
clarkb	if launchers.issubset(set(self.request.declined_by)): <- is the condition we fail on	17:21
clarkb	I'll have to check the zk db it looks like	17:21
corvus	clarkb: if you want to see the declined list, you can run nodepool list with --detail	17:22
corvus	request-list	17:22
clarkb	corvus: thanks, but I need to see what the set of registered launchers is and I think that is only available from the db?	17:22
clarkb	but then I can compare that against the decline by list	17:22
corvus	yeah i think so	17:23
clarkb	zk-shell appers to pull in twitter packages now?	17:24
corvus	always has	17:24
clarkb	huh	17:24
clarkb	corvus: I think we have two extra launchers registered: nl01.opendev.org-PoolWorker.rax-ord-main-06e2062bc61f4957bb40e1de543722ca and nl03.opendev.org-PoolWorker.iweb-mtl01-main-1a2f7610ca4941e4a6af286f5758da8b	17:31
clarkb	however, those registrations are epephemeral so now I'm extra confused about how they exist	17:32
clarkb	Maybe the inap one is related to the held node or the paused node and will get cleaned up automatically post dequeue enqueue?	17:33
clarkb	maybe I can do a thread dump and match up the launcher id to a thread?	17:34
clarkb	there is a single pool worker for inap main and yet we had two registered launchers for it	17:38
clarkb	I didn't think we shared the zk client between launchers but that might explain it if the current launcher has kept track of it	17:39
clarkb	oh sorry the iweb one is valid I think	17:43
clarkb	nl03.opendev.org-PoolWorker.inap-mtl01-main-2d4c222d1678460c90f88c158e41b2ce is the invalid one	17:43
clarkb	nl03.opendev.org-PoolWorker.inap-mtl01-main-b88877414cc0474caa11e7c383b17a07 shows up in the logs as declining requests but nl03.opendev.org-PoolWorker.inap-mtl01-main-2d4c222d1678460c90f88c158e41b2ce hasn't loggedanything in the current logfile. It definitely seems like we haven't deregistered properly	17:45
clarkb	looking back a few days I don't see 2d4c222d1678460c90f88c158e41b2ce in the logs. That implies it wasn't the shift from 120 to 0 max-servers that leaked it?	17:46
corvus	we probably do share the zk client, so it's up to the launcher code to de-register the pool provider properly	17:48
clarkb	based on the mtime on the znode it was last modified on July 30 ish UTC	17:49
clarkb	bit more confident that this isn't related to the provider update given that and is just coincidence that one of the two leaked launchers is also one we are renaming	17:50
corvus	do we expect frequent updates of that znode?	17:50
corvus	i guess the question is really: did something in nodepool's config change to cause that pool-provider to reconfigure in the past month	17:51
clarkb	looks like no, many of the other launchers have ctime == mtime	17:52
corvus	if no, then this event may still be the first to trigger that code path	17:52
clarkb	yes we updated its max-servers value yesterday	17:52
clarkb	note there is a leaked rax-ord launcher as well and we haven't updated that recently as far as I know	17:52
clarkb	looking at the code we seem to only call deregisterLauncher() if stop() on the thread is called. Looking at the thread dump it seems there is only one PoolWorker thread for inap currently which would have to be for the non leaked one that is logging declined requests. Is it possible the thread crashed hard for some reason and stop() was never called?	17:53
clarkb	the run loop does catch all exceptions though	17:54
clarkb	oh wait no it doesn't if the exception can happen early in he run loop	17:54
clarkb	grepping for exceptions in the last two days I don't see any that seem related to zookeeper breaking when trying to deregister or a thread being killed due to an uncaught exception. However I'm filtering stuff that I guess could be related?	18:01
clarkb	corvus: fungi: any objection to me restarting nl03 and nl01's launchers to clear the ephemeral nodes? I'm not sure we'll get much more debugging out of the launcher that isn't in the logs already?	18:02
clarkb	no ZooKeeper suspended messages either	18:03
corvus	clarkb: wfm	18:06
fungi	sounds fine to me	18:07
clarkb	alright nl03 is done. will do nl01 when nl03 shows it is happy restarting	18:08
fungi	thanks	18:08
clarkb	corvus: the other thing I notice is that in nodepool.launcher.NodePool.run() we seem to check for stale pools based on provider name + pool name which are the same for these leaked pool as the good pool	18:09
clarkb	corvus: however, the thread dump says the thread isn't there at all so that shouldn't be an issue	18:09
*** Guest6075 is now known as diablo_rojo		18:10
*** diablo_rojo is now known as Guest6089		18:11
*** Guest6089 is now known as diablo_rojo_phone		18:12
clarkb	I think we only check if we need to fail a node request when we decline a node request	18:17
clarkb	oh maybe not seems the change finally got evicted	18:17
fungi	the scheduler eventually times out the node request and issues a new one, right?	18:18
clarkb	https://review.opendev.org/c/airship/airshipctl/+/772890/ <- has a node failure reported to it	18:18
clarkb	fungi: no I think this is entirely driven by nodepool	18:18
fungi	aah	18:19
clarkb	It appears that restarting nl01's launcher caused it to check all open requests, Which caused it to decline the request again which caused it to be failed	18:22
clarkb	That is good means it does what it can to reach a consistent state. Just need to figure out how we are leaking these launcher registrations in the first place	18:22
clarkb	those restarts appear to have unstuck one of the neutron chagnes stuck in check as it is running the job it wanted now	18:23
clarkb	803582,1 being the last remaining stuck entry	18:23
clarkb	I'm going to try and write some chagnes really quickly before I lookt at that one to capture some of what I've learned though	18:24
clarkb	it looks like August 18 was when we added the previous current inap launcher to zk	18:35
clarkb	and our logs don't go back that far	18:35
clarkb	But I've found a bit of code that I think may be the cause. I'll get a change up shortly	18:36
clarkb	corvus: it looks like we don't create a new pool worker thread when the config updates?	18:39
clarkb	Pretty sure we did in the past but now it appears we only create new provider worker threads when a new provider shows up or if the thread has died for some reason	18:40
fungi	infra-root: proposed announcement for the planned listserv upgrade manitenance: https://etherpad.opendev.org/p/lists-maint-2021-09-12	18:50
fungi	i can send that shortly if nobody has any edits	18:51
fungi	just dawned on me i meant to write that up tuesday evening and got sidetracked	18:51
clarkb	fungi: ya I'll take a look as I think I convinced myself that my nodepool changes are likely to be correct. Trying to sort out the relationship between provider managers, pool workers, and request handler threads is something I seem to have to do every time I look at nodepool :/	18:55
clarkb	fungi: you caught the freenode thing. That was going to be my only comment. lgtm	18:56
fungi	hah	18:56
fungi	yeah, some unfortunate finger memory	18:56
fungi	i had to step away and then come back and re-read it to spot that	18:56
opendevreview	Clark Boylan proposed openstack/project-config master: Set empty nodepool resource lists on inap https://review.opendev.org/c/openstack/project-config/+/807204	19:01
opendevreview	Clark Boylan proposed openstack/project-config master: Remove the inap provider from nodepool https://review.opendev.org/c/openstack/project-config/+/807205	19:01
clarkb	infra-root ^ that first change should be mergeable now I think. The second should only merge after the first has run in production long enough for nodepool to clean up after itslef	19:01
opendevreview	Luciano Lo Giudice proposed openstack/project-config master: Add the cinder-lvm charm to Openstack charms https://review.opendev.org/c/openstack/project-config/+/807206	19:01
clarkb	mgagne: ^ fyi moving things along there	19:01
mgagne	clarkb: thanks for the follow up. I think you all know how to deal with it better than me =)	19:05
clarkb	corvus: the last chagne stuck in check is probably most interesting to you. 2021-08-26 08:15:40,530 INFO zuul.ExecutorClient: [e: b850d6c6a02b4d779cec37b56d5b610e] Execute job neutron-functional (uuid: 06426d3af17643a5a05e6b42928af8a0) on nodes <NodeSet devstack-single-node [<Node 0026112614 ('primary',):ubuntu-xenial>]> that node is currently unlocked and ready in nodepool	19:06
clarkb	corvus: /var/log/zuul/debug.log.7.gz contains the zuul logs for that line	19:06
clarkb	corvus: 2021-08-26 08:15:40,533 DEBUG zuul.ExecutorQueue: [e: b850d6c6a02b4d779cec37b56d5b610e] Submitting job request to ZooKeeper <BuildRequest 06426d3af17643a5a05e6b42928af8a0, state=requested, path=/zuul/executor/unzoned/requests/06426d3af17643a5a05e6b42928af8a0 zone=None> that seems to be the last zuul did with it. Meaning we've lost the build request in zk?	19:07
clarkb	corvus: I need to grab lunch now but I won't touch that change in the queue because I think that is potentially interesting to the build requests in zk work	19:07
corvus	clarkb: i'm back from lunch, will take a look	20:34
corvus	clarkb: thanks for spotting that. there was a zk connection loss on the executor. we have some good tracebacks (on ze09, log .7). i'll look into handling that.	20:42
clarkb	corvus: cool, should we dequeue and enqueue or wait for a bit?	20:43
corvus	i believe restarting ze09 (and i think we wanted to restart the whole system anyway) will correct that. dequeue/enqueue might work too.	20:43
clarkb	ok I've got to pop out shortly to dealwith some start of school year stuff but will be back later this afternoon and can do that if others don't beat me to it	20:46
*** dviroel is now known as dviroel\|out		21:01
clarkb	I'm going to dequeue and enqueue and see if that is sufficient since that is the last impactful change	22:14
clarkb	that seemed to make it happy fwiw	22:45
opendevreview	melanie witt proposed openstack/project-config master: Set launchpad bug Fix Released after adding comment https://review.opendev.org/c/openstack/project-config/+/801376	22:45
fungi	notifications about the lists upgrade maintenance for the 12th has been sent to the most active mailing lists for each of our 5 current mailman domains	23:28

Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!