clarkb | the zuul key backup cron failed as expected and sent email saying that was the case | 00:11 |
---|---|---|
clarkb | corvus: ^ this has me thinking. The command doesn't write any secret data to stdout or stderr right? as we don't want that getting emailed if the cronjob fails | 00:12 |
clarkb | pretty sure it doesn't and we're good | 00:12 |
clarkb | it will output if a key couldn't be backed up for some reason iirc but not share any of the key data | 00:12 |
corvus | clarkb: correct | 00:16 |
corvus | by design | 00:16 |
opendevreview | Merged opendev/base-jobs master: buildset-registry: add flag to make job fail https://review.opendev.org/c/opendev/base-jobs/+/806818 | 01:22 |
opendevreview | Ian Wienand proposed opendev/system-config master: gitea: use assets bundle https://review.opendev.org/c/opendev/system-config/+/805933 | 01:30 |
ianw | :/ that didn't appear to work | 02:06 |
opendevreview | Ian Wienand proposed opendev/system-config master: gitea: use assets bundle https://review.opendev.org/c/opendev/system-config/+/805933 | 02:16 |
ianw | works better when you spell things right | 02:39 |
*** ysandeep|out is now known as ysandeep | 06:16 | |
opendevreview | Jiri Podivin proposed zuul/zuul-jobs master: DNM https://review.opendev.org/c/zuul/zuul-jobs/+/807031 | 06:46 |
*** jpena|off is now known as jpena | 07:39 | |
*** ykarel is now known as ykarel|lunch | 08:13 | |
*** ysandeep is now known as ysandeep|lunch | 08:16 | |
*** ysandeep|lunch is now known as ysandeep | 09:12 | |
*** ykarel|lunch is now known as ykarel | 09:56 | |
*** ysandeep is now known as ysandeep|afk | 11:15 | |
opendevreview | Artem Goncharov proposed zuul/zuul-jobs master: Drop deprecated url param of upload_logs_XXX modules https://review.opendev.org/c/zuul/zuul-jobs/+/807118 | 11:18 |
*** dviroel|out is now known as dviroel|ruck | 11:19 | |
opendevreview | Artem Goncharov proposed zuul/zuul-jobs master: Drop deprecated url param of upload_logs_XXX modules https://review.opendev.org/c/zuul/zuul-jobs/+/807118 | 11:22 |
*** jpena is now known as jpena|lunch | 11:38 | |
opendevreview | Artem Goncharov proposed zuul/zuul-jobs master: Drop deprecated url param of upload_logs_XXX modules https://review.opendev.org/c/zuul/zuul-jobs/+/807118 | 11:39 |
*** ysandeep|afk is now known as ysandeep | 12:13 | |
*** jpena|lunch is now known as jpena | 12:40 | |
*** frenzy_friday is now known as anbanerj|ruck | 12:46 | |
opendevreview | Artem Goncharov proposed zuul/zuul-jobs master: [DNM] Test dropping delegation in the upload_logs_s3 role https://review.opendev.org/c/zuul/zuul-jobs/+/807132 | 12:47 |
*** dviroel|ruck is now known as dviroel | 12:48 | |
opendevreview | Artem Goncharov proposed zuul/zuul-jobs master: [DNM] Test dropping delegation in the upload_logs_s3 role https://review.opendev.org/c/zuul/zuul-jobs/+/807132 | 13:05 |
opendevreview | Artem Goncharov proposed zuul/zuul-jobs master: [DNM] Test upload_logs_s3 role https://review.opendev.org/c/zuul/zuul-jobs/+/807132 | 13:22 |
*** diablo_rojo_phone is now known as Guest6075 | 13:28 | |
opendevreview | Jiri Podivin proposed zuul/zuul-jobs master: DNM https://review.opendev.org/c/zuul/zuul-jobs/+/807031 | 14:23 |
opendevreview | Jiri Podivin proposed zuul/zuul-jobs master: DNM https://review.opendev.org/c/zuul/zuul-jobs/+/807031 | 14:42 |
clarkb | infra-root it looks like the inap provider use has largely fallen off. There is one in use server, perhaps a held node? I'll take a look shortly, but I suspect if that is the case we can delete it then push a change to clean up the images in that cloud, then remove it entirely? | 14:52 |
clarkb | mgagne: ^ fyi | 14:52 |
fungi | clarkb: two nodes, one is held with "mnaser debug multi-arch containers" and the other is an almost-day-old centos-8-stream node which will likely get recycled in about 45 minutes | 14:54 |
clarkb | ah thanks you have your ssh keys loaded :) | 14:54 |
clarkb | mnaser: ^ do you still need that hold? if not it will help us clean up a nodepool provider that got renamed | 14:55 |
mnaser | clarkb, fungi: mnaser debug multi-arch containers sounds like a really old hold that i don't need | 14:55 |
clarkb | thanks | 14:56 |
fungi | er, actually the centos-8-stream node is "in-use" for 00:23:13:02 so maybe leaked? or there's a stuck job... | 14:56 |
fungi | mnaser: thanks, i'll take out that autohold | 14:56 |
mnaser | clarkb, fungi: inap is now iweb? | 14:57 |
fungi | mnaser: yep | 14:57 |
mnaser | iweb used to private, then became publicly traded, then private, then acquired by inap, then now i guess acquired by leaseweb (according to their site) to become an independent iweb again i guess :P | 14:58 |
fungi | yeah, i lost track about halfway through the history there ;) | 14:58 |
mnaser | >In 2021 iWeb Technologies was acquired by Leaseweb. | 14:58 |
mnaser | ah yep | 14:58 |
opendevreview | Gage Hugo proposed opendev/irc-meetings master: Update openstack-helm irc meeting channel https://review.opendev.org/c/opendev/irc-meetings/+/805094 | 14:59 |
opendevreview | Artem Goncharov proposed zuul/zuul-jobs master: [DNM] Test upload_logs_s3 role https://review.opendev.org/c/zuul/zuul-jobs/+/807132 | 15:12 |
mgagne | mnaser: this is correct regarding ownership. | 15:15 |
clarkb | I have run the gerrit user audit across the 33 remaining conflicts and the yaml file is in the typical location | 15:29 |
clarkb | I don't expect I'll get to reviewing those today, but hopefully soon and will start sending out emails and writing changes in an All-Users checkout | 15:30 |
clarkb | fungi: corvus that last inap instance is the paused job for 806901,2 in openstack's check queue | 15:36 |
clarkb | it appears we have ~4 changes that have leaked in the check queue there based on age | 15:37 |
clarkb | I'm not in a good spot to look at that right this moment as I've got a meeting in a few minutes, but I wonder if we've got a stale launcher that needs restarting or a similar issue to before where we forget requests? | 15:37 |
clarkb | 4e03748821a74fe6b755bc2155838125 is the zuul event id for the 806901,2 enqueue | 15:38 |
fungi | i'll try to dig into stuck jobs in a few, but also in meetings | 15:38 |
clarkb | what if it is stuck that way ebcause we set max servers to 0? | 15:56 |
clarkb | the job that is paused started on inap so all the jobs behind it want to run in inap too but cannot | 15:56 |
clarkb | That doesn't explain the other leaked changes in the queue though, but could explain this one | 15:56 |
opendevreview | Ananya proposed opendev/elastic-recheck master: Fix ER bot to report back to gerrit with bug/error report https://review.opendev.org/c/opendev/elastic-recheck/+/807176 | 16:04 |
fungi | it's also possible we haven't completely squashed the hung/stale node requests problem, i haven't dug into it yet | 16:16 |
*** jpena is now known as jpena|off | 16:47 | |
*** ysandeep is now known as ysandeep|out | 16:47 | |
*** sshnaidm is now known as sshnaidm|off | 17:01 | |
clarkb | 299-0015272873 is one of the node requests that was submitted after the job paused | 17:04 |
clarkb | /etc/nodepool/nodepool.yaml on nl03 updated at 15:50UTC yseterday. The above node request was submitted at 16:49 yesterday | 17:05 |
clarkb | I think it is very likely that we cannot provision those nodes because max-servers is 0 on the only provider that is capable of running those jobs | 17:05 |
opendevreview | Ananya proposed opendev/elastic-recheck master: Fix ER bot to report back to gerrit with bug/error report https://review.opendev.org/c/opendev/elastic-recheck/+/807176 | 17:06 |
clarkb | that doesn't explain the other three though | 17:06 |
clarkb | corvus: ^ is there a better way to remove a provider in nodepool? maybe we just accept this risk and stop then reenqueue the change in these cases? | 17:06 |
clarkb | infra-root ^ I'll do that for 806901,2 shortly if there is no objection. The other three probably deserve looking at as I doubt they have this particular issue | 17:07 |
opendevreview | Ananya proposed opendev/elastic-recheck rdo: Fix ER bot to report back to gerrit with bug/error report https://review.opendev.org/c/opendev/elastic-recheck/+/805638 | 17:08 |
fungi | oh, was it trying to provide nodes for the child job from the same provider? yeah i guess that's intentional? | 17:08 |
clarkb | yes I believe zuul enforces that behavior | 17:09 |
corvus | yeah, i think that is an unexpected corner case (those 2 features were designed at different times) | 17:09 |
opendevreview | Ananya proposed opendev/elastic-recheck rdo: Fix ER bot to report back to gerrit with bug/error report https://review.opendev.org/c/opendev/elastic-recheck/+/805638 | 17:10 |
clarkb | for the airship change that is stuck it has a node request (300-0015270257) for https://opendev.org/airship/airshipctl/src/branch/master/zuul.d/nodesets.yaml#L39 that doesn't seem to be getting fulfilled | 17:10 |
corvus | i don't have a good solution to that. i think we just have to muddle through. dequeue/enqueue sounds good. | 17:11 |
clarkb | that label for the airship job is currently only able to be provided by the airship cloud provider | 17:11 |
clarkb | corvus: ok I'll dequeue enqueue for that one. | 17:11 |
clarkb | then I'll keep looking at this airship node request I guess | 17:12 |
opendevreview | Ananya proposed opendev/elastic-recheck rdo: Fix ER bot to report back to gerrit with bug/error report https://review.opendev.org/c/opendev/elastic-recheck/+/805638 | 17:14 |
opendevreview | Ananya proposed opendev/elastic-recheck rdo: Fix ER bot to report back to gerrit with bug/error report https://review.opendev.org/c/opendev/elastic-recheck/+/805638 | 17:14 |
clarkb | for the airship one it failed to boot three times in the airship cloud. Every other cloud should decline the request which should result in a node failure | 17:15 |
clarkb | notsure why that hasn't happened yet | 17:15 |
clarkb | the new iweb provider did decline the request when it came online according to the logs on nl03 | 17:18 |
clarkb | if launchers.issubset(set(self.request.declined_by)): <- is the condition we fail on | 17:21 |
clarkb | I'll have to check the zk db it looks like | 17:21 |
corvus | clarkb: if you want to see the declined list, you can run nodepool list with --detail | 17:22 |
corvus | request-list | 17:22 |
clarkb | corvus: thanks, but I need to see what the set of registered launchers is and I think that is only available from the db? | 17:22 |
clarkb | but then I can compare that against the decline by list | 17:22 |
corvus | yeah i think so | 17:23 |
clarkb | zk-shell appers to pull in twitter packages now? | 17:24 |
corvus | always has | 17:24 |
clarkb | huh | 17:24 |
clarkb | corvus: I think we have two extra launchers registered: nl01.opendev.org-PoolWorker.rax-ord-main-06e2062bc61f4957bb40e1de543722ca and nl03.opendev.org-PoolWorker.iweb-mtl01-main-1a2f7610ca4941e4a6af286f5758da8b | 17:31 |
clarkb | however, those registrations are epephemeral so now I'm extra confused about how they exist | 17:32 |
clarkb | Maybe the inap one is related to the held node or the paused node and will get cleaned up automatically post dequeue enqueue? | 17:33 |
clarkb | maybe I can do a thread dump and match up the launcher id to a thread? | 17:34 |
clarkb | there is a single pool worker for inap main and yet we had two registered launchers for it | 17:38 |
clarkb | I didn't think we shared the zk client between launchers but that might explain it if the current launcher has kept track of it | 17:39 |
clarkb | oh sorry the iweb one is valid I think | 17:43 |
clarkb | nl03.opendev.org-PoolWorker.inap-mtl01-main-2d4c222d1678460c90f88c158e41b2ce is the invalid one | 17:43 |
clarkb | nl03.opendev.org-PoolWorker.inap-mtl01-main-b88877414cc0474caa11e7c383b17a07 shows up in the logs as declining requests but nl03.opendev.org-PoolWorker.inap-mtl01-main-2d4c222d1678460c90f88c158e41b2ce hasn't loggedanything in the current logfile. It definitely seems like we haven't deregistered properly | 17:45 |
clarkb | looking back a few days I don't see 2d4c222d1678460c90f88c158e41b2ce in the logs. That implies it wasn't the shift from 120 to 0 max-servers that leaked it? | 17:46 |
corvus | we probably do share the zk client, so it's up to the launcher code to de-register the pool provider properly | 17:48 |
clarkb | based on the mtime on the znode it was last modified on July 30 ish UTC | 17:49 |
clarkb | bit more confident that this isn't related to the provider update given that and is just coincidence that one of the two leaked launchers is also one we are renaming | 17:50 |
corvus | do we expect frequent updates of that znode? | 17:50 |
corvus | i guess the question is really: did something in nodepool's config change to cause that pool-provider to reconfigure in the past month | 17:51 |
clarkb | looks like no, many of the other launchers have ctime == mtime | 17:52 |
corvus | if no, then this event may still be the first to trigger that code path | 17:52 |
clarkb | yes we updated its max-servers value yesterday | 17:52 |
clarkb | note there is a leaked rax-ord launcher as well and we haven't updated that recently as far as I know | 17:52 |
clarkb | looking at the code we seem to only call deregisterLauncher() if stop() on the thread is called. Looking at the thread dump it seems there is only one PoolWorker thread for inap currently which would have to be for the non leaked one that is logging declined requests. Is it possible the thread crashed hard for some reason and stop() was never called? | 17:53 |
clarkb | the run loop does catch all exceptions though | 17:54 |
clarkb | oh wait no it doesn't if the exception can happen early in he run loop | 17:54 |
clarkb | grepping for exceptions in the last two days I don't see any that seem related to zookeeper breaking when trying to deregister or a thread being killed due to an uncaught exception. However I'm filtering stuff that I guess could be related? | 18:01 |
clarkb | corvus: fungi: any objection to me restarting nl03 and nl01's launchers to clear the ephemeral nodes? I'm not sure we'll get much more debugging out of the launcher that isn't in the logs already? | 18:02 |
clarkb | no ZooKeeper suspended messages either | 18:03 |
corvus | clarkb: wfm | 18:06 |
fungi | sounds fine to me | 18:07 |
clarkb | alright nl03 is done. will do nl01 when nl03 shows it is happy restarting | 18:08 |
fungi | thanks | 18:08 |
clarkb | corvus: the other thing I notice is that in nodepool.launcher.NodePool.run() we seem to check for stale pools based on provider name + pool name which are the same for these leaked pool as the good pool | 18:09 |
clarkb | corvus: however, the thread dump says the thread isn't there at all so that shouldn't be an issue | 18:09 |
*** Guest6075 is now known as diablo_rojo | 18:10 | |
*** diablo_rojo is now known as Guest6089 | 18:11 | |
*** Guest6089 is now known as diablo_rojo_phone | 18:12 | |
clarkb | I think we only check if we need to fail a node request when we decline a node request | 18:17 |
clarkb | oh maybe not seems the change finally got evicted | 18:17 |
fungi | the scheduler eventually times out the node request and issues a new one, right? | 18:18 |
clarkb | https://review.opendev.org/c/airship/airshipctl/+/772890/ <- has a node failure reported to it | 18:18 |
clarkb | fungi: no I think this is entirely driven by nodepool | 18:18 |
fungi | aah | 18:19 |
clarkb | It appears that restarting nl01's launcher caused it to check all open requests, Which caused it to decline the request again which caused it to be failed | 18:22 |
clarkb | That is good means it does what it can to reach a consistent state. Just need to figure out how we are leaking these launcher registrations in the first place | 18:22 |
clarkb | those restarts appear to have unstuck one of the neutron chagnes stuck in check as it is running the job it wanted now | 18:23 |
clarkb | 803582,1 being the last remaining stuck entry | 18:23 |
clarkb | I'm going to try and write some chagnes really quickly before I lookt at that one to capture some of what I've learned though | 18:24 |
clarkb | it looks like August 18 was when we added the previous current inap launcher to zk | 18:35 |
clarkb | and our logs don't go back that far | 18:35 |
clarkb | But I've found a bit of code that I think may be the cause. I'll get a change up shortly | 18:36 |
clarkb | corvus: it looks like we don't create a new pool worker thread when the config updates? | 18:39 |
clarkb | Pretty sure we did in the past but now it appears we only create new provider worker threads when a new provider shows up or if the thread has died for some reason | 18:40 |
fungi | infra-root: proposed announcement for the planned listserv upgrade manitenance: https://etherpad.opendev.org/p/lists-maint-2021-09-12 | 18:50 |
fungi | i can send that shortly if nobody has any edits | 18:51 |
fungi | just dawned on me i meant to write that up tuesday evening and got sidetracked | 18:51 |
clarkb | fungi: ya I'll take a look as I think I convinced myself that my nodepool changes are likely to be correct. Trying to sort out the relationship between provider managers, pool workers, and request handler threads is something I seem to have to do every time I look at nodepool :/ | 18:55 |
clarkb | fungi: you caught the freenode thing. That was going to be my only comment. lgtm | 18:56 |
fungi | hah | 18:56 |
fungi | yeah, some unfortunate finger memory | 18:56 |
fungi | i had to step away and then come back and re-read it to spot that | 18:56 |
opendevreview | Clark Boylan proposed openstack/project-config master: Set empty nodepool resource lists on inap https://review.opendev.org/c/openstack/project-config/+/807204 | 19:01 |
opendevreview | Clark Boylan proposed openstack/project-config master: Remove the inap provider from nodepool https://review.opendev.org/c/openstack/project-config/+/807205 | 19:01 |
clarkb | infra-root ^ that first change should be mergeable now I think. The second should only merge after the first has run in production long enough for nodepool to clean up after itslef | 19:01 |
opendevreview | Luciano Lo Giudice proposed openstack/project-config master: Add the cinder-lvm charm to Openstack charms https://review.opendev.org/c/openstack/project-config/+/807206 | 19:01 |
clarkb | mgagne: ^ fyi moving things along there | 19:01 |
mgagne | clarkb: thanks for the follow up. I think you all know how to deal with it better than me =) | 19:05 |
clarkb | corvus: the last chagne stuck in check is probably most interesting to you. 2021-08-26 08:15:40,530 INFO zuul.ExecutorClient: [e: b850d6c6a02b4d779cec37b56d5b610e] Execute job neutron-functional (uuid: 06426d3af17643a5a05e6b42928af8a0) on nodes <NodeSet devstack-single-node [<Node 0026112614 ('primary',):ubuntu-xenial>]> that node is currently unlocked and ready in nodepool | 19:06 |
clarkb | corvus: /var/log/zuul/debug.log.7.gz contains the zuul logs for that line | 19:06 |
clarkb | corvus: 2021-08-26 08:15:40,533 DEBUG zuul.ExecutorQueue: [e: b850d6c6a02b4d779cec37b56d5b610e] Submitting job request to ZooKeeper <BuildRequest 06426d3af17643a5a05e6b42928af8a0, state=requested, path=/zuul/executor/unzoned/requests/06426d3af17643a5a05e6b42928af8a0 zone=None> that seems to be the last zuul did with it. Meaning we've lost the build request in zk? | 19:07 |
clarkb | corvus: I need to grab lunch now but I won't touch that change in the queue because I think that is potentially interesting to the build requests in zk work | 19:07 |
corvus | clarkb: i'm back from lunch, will take a look | 20:34 |
corvus | clarkb: thanks for spotting that. there was a zk connection loss on the executor. we have some good tracebacks (on ze09, log .7). i'll look into handling that. | 20:42 |
clarkb | corvus: cool, should we dequeue and enqueue or wait for a bit? | 20:43 |
corvus | i believe restarting ze09 (and i think we wanted to restart the whole system anyway) will correct that. dequeue/enqueue might work too. | 20:43 |
clarkb | ok I've got to pop out shortly to dealwith some start of school year stuff but will be back later this afternoon and can do that if others don't beat me to it | 20:46 |
*** dviroel is now known as dviroel|out | 21:01 | |
clarkb | I'm going to dequeue and enqueue and see if that is sufficient since that is the last impactful change | 22:14 |
clarkb | that seemed to make it happy fwiw | 22:45 |
opendevreview | melanie witt proposed openstack/project-config master: Set launchpad bug Fix Released after adding comment https://review.opendev.org/c/openstack/project-config/+/801376 | 22:45 |
fungi | notifications about the lists upgrade maintenance for the 12th has been sent to the most active mailing lists for each of our 5 current mailman domains | 23:28 |
Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!