*** mattw4 has quit IRC | 00:00 | |
*** tosky has quit IRC | 00:21 | |
zenkuro | corvus: zuul indeed shows “Loading...” if there where no builds | 00:27 |
---|---|---|
*** jamesmcarthur has joined #zuul | 00:35 | |
*** mattw4 has joined #zuul | 00:46 | |
*** jamesmcarthur has quit IRC | 00:51 | |
*** jamesmcarthur has joined #zuul | 00:52 | |
*** jamesmcarthur has quit IRC | 00:57 | |
*** felixedel has joined #zuul | 01:09 | |
*** felixedel has quit IRC | 01:10 | |
*** jamesmcarthur has joined #zuul | 01:22 | |
*** jamesmcarthur has quit IRC | 01:23 | |
*** jamesmcarthur has joined #zuul | 01:23 | |
*** sgw has quit IRC | 01:32 | |
*** jamesmcarthur has quit IRC | 01:55 | |
*** jamesmcarthur has joined #zuul | 02:00 | |
*** jamesmcarthur has quit IRC | 02:31 | |
*** jamesmcarthur has joined #zuul | 02:43 | |
*** jamesmcarthur has quit IRC | 02:49 | |
*** num8lock has joined #zuul | 02:50 | |
*** num8lock has quit IRC | 02:51 | |
*** num8lock has joined #zuul | 02:51 | |
*** num8lock has left #zuul | 02:52 | |
*** jamesmcarthur has joined #zuul | 02:53 | |
*** rlandy|bbl is now known as rlandy | 02:56 | |
zenkuro | corvus: the right place for errors was mysql. in particular the name created in mysql should match exactly the one listed in zuul config. which is kinda self evident >_< | 03:06 |
*** jamesmcarthur has quit IRC | 03:07 | |
*** jamesmcarthur has joined #zuul | 03:31 | |
*** rlandy has quit IRC | 03:44 | |
*** jamesmcarthur has quit IRC | 03:58 | |
*** bolg has joined #zuul | 05:09 | |
*** mattw4 has quit IRC | 05:24 | |
*** evrardjp has quit IRC | 05:34 | |
*** evrardjp has joined #zuul | 05:35 | |
*** sgw has joined #zuul | 05:35 | |
*** swest has joined #zuul | 05:38 | |
*** reiterative has quit IRC | 05:40 | |
*** reiterative has joined #zuul | 05:41 | |
*** sgw has quit IRC | 05:50 | |
*** saneax has joined #zuul | 05:56 | |
*** felixedel has joined #zuul | 06:32 | |
*** Defolos has joined #zuul | 06:46 | |
*** dangtrinhnt has joined #zuul | 07:06 | |
*** felixedel has quit IRC | 07:19 | |
*** felixedel has joined #zuul | 07:28 | |
*** dangtrinhnt has quit IRC | 07:56 | |
*** dangtrinhnt has joined #zuul | 07:57 | |
*** jcapitao_off has joined #zuul | 07:58 | |
*** jcapitao_off is now known as jcapitao | 08:00 | |
*** dangtrinhnt has quit IRC | 08:01 | |
*** raukadah is now known as chandankumar | 08:13 | |
*** tosky has joined #zuul | 08:17 | |
*** dangtrinhnt has joined #zuul | 08:20 | |
*** bolg has quit IRC | 08:33 | |
*** bolg has joined #zuul | 08:34 | |
*** jpena|off is now known as jpena | 08:51 | |
*** jfoufas178 has joined #zuul | 08:54 | |
*** decimuscorvinus has quit IRC | 08:58 | |
*** decimuscorvinus_ has joined #zuul | 08:58 | |
*** avass has joined #zuul | 09:21 | |
*** jfoufas178 has quit IRC | 09:24 | |
*** dangtrinhnt has quit IRC | 09:43 | |
*** dangtrinhnt has joined #zuul | 09:43 | |
*** dangtrinhnt has quit IRC | 09:51 | |
*** dmellado has quit IRC | 09:59 | |
*** sshnaidm|afk has joined #zuul | 10:17 | |
*** carli has joined #zuul | 10:18 | |
*** sshnaidm|afk is now known as sshnaidm | 10:29 | |
*** carli is now known as carli|afk | 10:51 | |
*** felixedel has quit IRC | 10:52 | |
*** felixedel has joined #zuul | 10:56 | |
*** jcapitao is now known as jcapitao_lunch | 11:14 | |
*** carli|afk is now known as carli | 12:02 | |
*** jpena is now known as jpena|lunch | 12:35 | |
*** jamesmcarthur has joined #zuul | 12:36 | |
*** Goneri has joined #zuul | 12:50 | |
*** rlandy has joined #zuul | 12:56 | |
*** jamesmcarthur has quit IRC | 13:00 | |
*** jamesmcarthur has joined #zuul | 13:00 | |
*** dmellado has joined #zuul | 13:05 | |
*** sshnaidm_ has joined #zuul | 13:05 | |
*** jamesmcarthur has quit IRC | 13:06 | |
fbo | hi, just question regarding Zuul artifacts require/provide feature. Zuul does not expose artifacts of depends-on changes if dependencies have been merged. I looks ok to me but in some cases it could be if great zuul still expose the artifacts in zuul.artifacts. | 13:07 |
fbo | for instance in the packaging context for Fedora a merged change on a distgit does not mean the package is built and available. | 13:08 |
*** sshnaidm has quit IRC | 13:08 | |
*** jamesmcarthur has joined #zuul | 13:10 | |
*** jcapitao_lunch is now known as jcapitao | 13:17 | |
fbo | so when dependent changes are still not merged dependent artifacts (here rpms) are exposed to current project rpm-(build|test). When dependent changes are merged artifacts (rpm) are not longer exposed to the current project rpm-(built|test) job but and in the meantime they are not accessible because not yet build and published on the final rpm repository. | 13:18 |
fbo | Question is do you think that could make sense to have an option to tell zuul to still add artifacts to zuul.artifacts even if the dependencies are merged ? | 13:19 |
*** plaurin has joined #zuul | 13:25 | |
plaurin | Hello irc prople | 13:25 |
plaurin | So glad to see this patch! https://review.opendev.org/#/c/709261/ Really anxious to use this :D | 13:26 |
plaurin | As soon as merged I wil be using this in production | 13:26 |
*** rfolco has joined #zuul | 13:28 | |
*** rfolco has quit IRC | 13:29 | |
*** jamesmcarthur has quit IRC | 13:32 | |
*** jamesmcarthur has joined #zuul | 13:32 | |
*** jpena|lunch is now known as jpena | 13:34 | |
*** jamesmcarthur has quit IRC | 13:47 | |
*** jamesmcarthur_ has joined #zuul | 13:47 | |
openstackgerrit | Tobias Henkel proposed zuul/zuul master: Optimize canMerge using graphql https://review.opendev.org/709836 | 14:28 |
*** sgw has joined #zuul | 14:32 | |
*** jamesmcarthur_ has quit IRC | 14:35 | |
*** sgw has quit IRC | 14:39 | |
*** jamesmcarthur has joined #zuul | 14:40 | |
corvus | plaurin, mordred: i think based on conversations with tristanC i want to revise that a bit -- i think we should add socat to our executor images first to ensure that will work for folks running zuul in k8s. | 14:41 |
corvus | i've set it to WIP | 14:41 |
mordred | corvus: ah - socat | 14:44 |
*** jamesmcarthur has quit IRC | 14:48 | |
tristanC | corvus: btw i've added port-forward to k1s and i was about to test it with the zuul change | 14:49 |
corvus | i'm writing a job now where i wish i could attach the secret just to the pre-playbook of the job, not the main playbook. i think i have to add an extra layer of inheritance just to do that. but maybe we should consider a config syntax change to allow that. | 14:50 |
corvus | tristanC: great! | 14:50 |
tristanC | turns out port-forward uses the same protocol as exec, so that was a simple addition | 14:50 |
*** felixedel has quit IRC | 14:56 | |
fungi | what is k1s? i'm getting lost in all the abbrevs | 14:57 |
mnaser | i've heard about k3s but never heard of k1s :p | 15:00 |
openstackgerrit | James E. Blair proposed zuul/zuul master: Stream output from kubectl pods https://review.opendev.org/709261 | 15:02 |
*** bolg has quit IRC | 15:03 | |
corvus | tristanC, mordred, plaurin, Shrews: ^ i added socat to the executor image and also a release note. does that seem good? or should i update the patch so that it doesn't fail if kubectl isn't present? | 15:03 |
Shrews | corvus: i was thinking this morning that maybe we need a release note that mentions a need for a specific version of nodepool to support that | 15:05 |
Shrews | which i guess would be 3.12.0 ? | 15:05 |
corvus | Shrews: yeah, i reckon so | 15:06 |
corvus | Shrews: i'll amend for that | 15:06 |
corvus | i think at this point, i'll be more comfortable if we don't fail the job; i think i'll add that | 15:07 |
*** mattw4 has joined #zuul | 15:10 | |
mordred | corvus: ++ | 15:10 |
openstackgerrit | Tristan Cacqueray proposed zuul/zuul master: executor: blacklist dangerous ansible host vars https://review.opendev.org/710287 | 15:15 |
*** mattw4 has quit IRC | 15:17 | |
*** goneri_ has joined #zuul | 15:19 | |
*** goneri_ has quit IRC | 15:21 | |
openstackgerrit | James E. Blair proposed zuul/zuul master: Stream output from kubectl pods https://review.opendev.org/709261 | 15:23 |
openstackgerrit | James E. Blair proposed zuul/zuul master: Add destructor for SshAgent https://review.opendev.org/709609 | 15:23 |
corvus | okay that should take care of it | 15:23 |
*** avass has quit IRC | 15:24 | |
*** sgw has joined #zuul | 15:28 | |
*** jamesmcarthur has joined #zuul | 15:38 | |
mordred | corvus: the self._log in the stream callback is displayed to the user? or the operator? | 15:41 |
*** chandankumar is now known as raukadah | 15:41 | |
corvus | mordred: user | 15:41 |
*** sgw has quit IRC | 15:42 | |
mordred | corvus: k. then the only nit I might suggest is to say something like "on the executor" or "on the executor, contact your admin" - so it's clear it's not just something they need to add to their job config | 15:47 |
mordred | otherwise looks great | 15:47 |
corvus | mordred: good point | 15:47 |
corvus | mordred: "[Zuul] Kubectl and socat must be installed on the Zuul executor for streaming output from pods" how's that | 15:49 |
openstackgerrit | James E. Blair proposed zuul/zuul master: Stream output from kubectl pods https://review.opendev.org/709261 | 15:51 |
openstackgerrit | James E. Blair proposed zuul/zuul master: Add destructor for SshAgent https://review.opendev.org/709609 | 15:51 |
*** rishabhhpe has joined #zuul | 15:54 | |
mordred | corvus: yeah | 15:54 |
*** sshnaidm_ is now known as sshnaidm | 15:58 | |
Shrews | corvus: lgtm. +3'd | 15:58 |
*** sgw has joined #zuul | 16:00 | |
*** mattw4 has joined #zuul | 16:17 | |
*** sgw has quit IRC | 16:24 | |
*** sgw has joined #zuul | 16:25 | |
*** carli has quit IRC | 16:32 | |
*** NBorg has joined #zuul | 16:37 | |
*** plaurin has quit IRC | 16:42 | |
*** armstrongs has joined #zuul | 16:45 | |
*** jpena is now known as jpena|brb | 16:45 | |
*** jamesmcarthur has quit IRC | 16:45 | |
*** jamesmcarthur has joined #zuul | 16:46 | |
*** electrofelix has joined #zuul | 16:46 | |
*** armstrongs has quit IRC | 16:51 | |
*** erbarr has joined #zuul | 17:14 | |
openstackgerrit | Tristan Cacqueray proposed zuul/zuul master: executor: blacklist dangerous ansible host vars https://review.opendev.org/710287 | 17:16 |
*** jpena|brb is now known as jpena | 17:23 | |
*** evrardjp has quit IRC | 17:35 | |
*** evrardjp has joined #zuul | 17:35 | |
NBorg | How does zuul handle submodules? We have a dependency towards a repo on a gitlab server. We don't need (or want to) trigger anything on updates in that repo. Workarounds I can think of: keep a clone in gerrit and update semi-regularly, use a submodule, or add a role for cloning that specific repo. Any suggestions? | 17:36 |
*** felixedel has joined #zuul | 17:36 | |
*** felixedel has quit IRC | 17:36 | |
corvus | NBorg: we've been discussing submodules a bit lately, and what zuul could do with them to be useful. right now, it ignores them completely, so the job needs to take care of it itself. | 17:37 |
corvus | NBorg: gerrit supports submodule subscriptions which automatically update submodule pointers if 2 repos are on the same system, but that won't help if you're using gitlab | 17:38 |
corvus | NBorg: (but maybe if you kept a copy in gerrit, you could use the submodule subscriptions to automatically update the super-repo) | 17:38 |
*** igordc has joined #zuul | 17:39 | |
corvus | NBorg: aside from that, i think the options are to do the cloning/submodule init in a job. let me get a pointer to some stuff i did for the gerrit project recently | 17:39 |
corvus | NBorg: take a look at this role, it may have some things that could be useful: https://gerrit.googlesource.com/zuul/jobs/+/refs/heads/master/roles/prepare-gerrit-repos/ | 17:40 |
corvus | NBorg: personally, i would say if you have a choice to avoid git submodules, i would | 17:41 |
corvus | NBorg: they're often confusing for everybody. so if i were in your place, i would try to avoid that and either just clone the repo in the job, or get the repo into zuul somehow (gitlab driver, git driver, or copy into gerrit) and add it as a required-project if using depends-on was important. | 17:43 |
SpamapS | Let me just +1 corvus's point. submodules are a misfeature in git, and should be avoided. | 17:43 |
SpamapS | There are 100 other things that work better. | 17:43 |
fungi | in almost all cases i treat a git submodule as "just another repo" which i happen to checkout inside some repo's worktree, and steer clear of the fancy submodule commands entirely | 17:45 |
openstackgerrit | James E. Blair proposed zuul/nodepool master: Fix GCE volume parameters https://review.opendev.org/710324 | 17:49 |
NBorg | I didn't consider the git driver. That seems like what I want. Thank you! | 17:49 |
corvus | NBorg: i think it'll update every 2 hours, which sounds like it may work for your case | 17:50 |
NBorg | You've even made it configurable with poll_delay. :) | 17:51 |
corvus | oh neat | 17:52 |
corvus | NBorg: if you get bored and wanted to try out the work-in-progress gitlab driver, that'd be cool too :) | 17:52 |
corvus | (i think it's in-tree but undocumented since it's incomplete; but it's probably good enough for cloning a repo) | 17:54 |
corvus | Shrews, mordred: https://review.opendev.org/710324 i just noticed that we weren't actually getting the 40GB volumes i requested on gerrit's zuul; i tested this against gce and it seems to improve the situation | 17:58 |
*** igordc has quit IRC | 18:01 | |
*** jcapitao is now known as jcapitao_off | 18:02 | |
NBorg | Hehe, being bored is the least of my problems. I'll contribute some code when I have the time to clean it up. Any day now (maybe next year). | 18:02 |
*** igordc has joined #zuul | 18:04 | |
*** mattw4 has quit IRC | 18:05 | |
*** mattw4 has joined #zuul | 18:05 | |
rishabhhpe | Hi All, after triggering a zuul job on nodepool spawned instance on which i installed devstack .. i am not able to run any openstack command . it is saying Failed to discover auth URL. | 18:14 |
corvus | rishabhhpe: you might want to ask in #openstack-infra, #openstack-qa | 18:14 |
rishabhhpe | corvus: ok | 18:18 |
SpamapS | feedback on UI: in the build screen, Console tab, Zuul shows tasks that had "ignore_errors: true" as "FAILED" .. would be good to show "IGNORED" instead. | 18:31 |
*** mattw4 has quit IRC | 18:31 | |
SpamapS | (from a member of my team) | 18:32 |
*** mattw4 has joined #zuul | 18:32 | |
*** jamesmcarthur has quit IRC | 18:33 | |
*** jamesmcarthur has joined #zuul | 18:33 | |
corvus | SpamapS: yeah, i've been noticing that too. i bet we have the info to do that. | 18:33 |
*** jpena is now known as jpena|off | 18:33 | |
*** mattw4 has quit IRC | 18:38 | |
*** mattw4 has joined #zuul | 18:39 | |
*** tjgresha has quit IRC | 18:46 | |
*** tjgresha has joined #zuul | 18:46 | |
mordred | corvus: gce patch looks sensible | 19:01 |
*** rlandy is now known as rlandy|brb | 19:07 | |
*** jamesmcarthur has quit IRC | 19:12 | |
-openstackstatus- NOTICE: Memory pressure on zuul.opendev.org is causing connection timeouts resulting in POST_FAILURE and RETRY_LIMIT results for some jobs since around 06:00 UTC today; we will be restarting the scheduler shortly to relieve the problem, and will follow up with another notice once running changes are reenqueued. | 19:13 | |
*** jcapitao_off has quit IRC | 19:15 | |
*** rishabhhpe has quit IRC | 19:16 | |
*** rlandy|brb is now known as rlandy | 19:19 | |
*** electrofelix has quit IRC | 19:20 | |
*** jamesmcarthur has joined #zuul | 19:40 | |
-openstackstatus- NOTICE: The scheduler for zuul.opendev.org has been restarted; any changes which were in queues at the time of the restart have been reenqueued automatically, but any changes whose jobs failed with a RETRY_LIMIT, POST_FAILURE or NODE_FAILURE build result in the past 14 hours should be manually rechecked for fresh results | 19:46 | |
openstackgerrit | Merged zuul/nodepool master: Fix GCE volume parameters https://review.opendev.org/710324 | 20:02 |
*** felixedel has joined #zuul | 20:13 | |
SpamapS | is OpenDev running Zuul in GCE now? | 20:51 |
SpamapS | or rather, some zuul | 20:52 |
corvus | SpamapS: no, gerrit is! :) | 20:52 |
fungi | nope, the gerrit zuul is | 20:52 |
corvus | SpamapS: https://ci.gerritcodereview.com/tenants | 20:52 |
*** jamesmcarthur has quit IRC | 20:54 | |
corvus | SpamapS: actually, strictly speaking, gerrit is running Zuul in GKE but using GCE for test nodes | 20:54 |
SpamapS | Cool! | 20:57 |
*** erbarr has quit IRC | 21:11 | |
*** jcapitao_off has joined #zuul | 21:17 | |
*** Goneri has quit IRC | 21:19 | |
*** sgw has quit IRC | 21:23 | |
*** mattw4 has quit IRC | 21:30 | |
*** mattw4 has joined #zuul | 21:31 | |
*** sgw has joined #zuul | 21:38 | |
*** dpawlik has quit IRC | 21:43 | |
NBorg | That automatic reenqueue is a nice feature. Is that new or not part of zuul itself? I don't think I've seen that happen in any of our tentants. | 21:44 |
mordred | NBorg: it's not part of zuul itself at the moment - we saved the job queues before the restart and then restored the job queues after the restart | 21:53 |
*** Goneri has joined #zuul | 21:53 | |
*** jamesmcarthur has joined #zuul | 21:54 | |
corvus | NBorg: here's the script though: https://opendev.org/zuul/zuul/src/branch/master/tools/zuul-changes.py | 21:54 |
corvus | hopefully soon it will obsolete when we have HA schedulers :) | 21:54 |
fungi | yeah, it's just a workaround for the fact that it's state held in memory by the spof scheduler | 21:59 |
fungi | zuul v2 *had* some solution where the queue was exported as a python pickle i want to say? but that was only viable if the data structure remained unchanged | 22:00 |
openstackgerrit | Merged zuul/zuul master: executor: blacklist dangerous ansible host vars https://review.opendev.org/710287 | 22:02 |
corvus | fungi: well, it had (and still has) a facility for saving the event queue, but that's more or less useless without being able to save the running queue | 22:06 |
NBorg | Thanks! That'll save some time. | 22:07 |
*** jamesmcarthur has quit IRC | 22:08 | |
*** jamesmcarthur has joined #zuul | 22:11 | |
*** NBorg has quit IRC | 22:12 | |
corvus | Shrews: re zombie node requests from #openstack-infra -- nodepool/driver/__init.py__ line 679 is i think the key here -- that's failing to store a node (which disappeared) that was assigned to a request (which disappeared) | 22:20 |
Shrews | yeah. we're missing exception handling around it | 22:20 |
fungi | corvus: oh, right, that was events/results | 22:20 |
Shrews | corvus: i think we can safely capture NoNodeError there, log it, and move to the next node | 22:20 |
corvus | Shrews: agreed; you want to type that up? | 22:21 |
Shrews | though i wonder how the node znode disappeared? | 22:21 |
Shrews | corvus: yeah | 22:21 |
clarkb | corvus: Shrews do we need to restart our launcher to clear this out in the running system? | 22:21 |
clarkb | Shrews: because the scheduler was restarted | 22:21 |
corvus | clarkb: yes, i don't think it will recover (but i think the only problem is that the logs are spammy?) | 22:22 |
corvus | yeah, but something still deleted those nodes... | 22:22 |
Shrews | clarkb: how would that cause node znodes to disappear? | 22:22 |
clarkb | Shrews: aren't the znodes ephemeral with the connection? | 22:22 |
Shrews | no | 22:22 |
clarkb | thats why memory pressure problems causes node failures | 22:22 |
clarkb | the connection dies and znodes go away | 22:22 |
Shrews | node requests are | 22:22 |
corvus | and node locks are, but not nodes | 22:23 |
clarkb | ah I see | 22:23 |
corvus | Shrews: i'll come up with a timeline and see if we can spot how it happened | 22:25 |
tristanC | we are seing NODE_FAILURE after restaring the services, could this be related? | 22:25 |
fungi | tristanC: discussion of opendev failures is taking place in #openstack-infra but, yes, sounds like maybe a launcher restart there is in order | 22:26 |
corvus | Shrews, clarkb: the sequence: https://etherpad.openstack.org/p/tujgF5HLpJ | 22:29 |
corvus | Shrews: i'm guessing that when the request disappeared, the launcher unlocked the nodes even though they were still building (though i don't see a log message for that) | 22:31 |
*** Goneri has quit IRC | 22:32 | |
corvus | hrm, i can't back that up | 22:33 |
openstackgerrit | David Shrewsbury proposed zuul/nodepool master: Fix for clearing assigned nodes that have vanished https://review.opendev.org/710343 | 22:33 |
Shrews | corvus: if they're building, they wouldn't be allocated to the request | 22:34 |
corvus | i thought we allocated them as soon as we started building? | 22:34 |
Shrews | they'd have to be in the READY and unlocked state before being allocated | 22:34 |
corvus | if that's the case, "Locked building node 0014861474 for request 200-0007660291" is a very misleading message | 22:34 |
Shrews | oh, but wait a sec... | 22:34 |
Shrews | corvus: you're right. if we build a new node specifically for a request, the allocated_to is set. existing nodes was what i was thinking of | 22:35 |
corvus | so if we unlocked the node because the request disappeared, i would expect to see the "node request disappeared" log message around then, but i don't (that only comes much later) | 22:36 |
corvus | (basically i'm focusing on the lines at 18:06 -- i would not expect the node to be unlocked then...) | 22:39 |
Shrews | ok, so while it is building, the request disappears, the launcher cleanup thread runs, deletes the instance and znodes, request handler wakes up and gets confused | 22:41 |
Shrews | that makes sense | 22:41 |
corvus | Shrews: except i don't understand why the node was unlocked; it should be locked while building | 22:41 |
clarkb | should we be checking if the node can fulfill another request before deleting it? | 22:41 |
corvus | clarkb: that's actually what's supposed to happen | 22:42 |
Shrews | corvus: THAT is a good question. possible if the scheduler was seeing disconnects, that the launcher was too and lost locks? | 22:43 |
Shrews | any lost session errors? | 22:43 |
corvus | good q | 22:43 |
Shrews | i don't recall what the message would be. likely a zk exception | 22:44 |
Shrews | SessionExpiredError maybe | 22:44 |
corvus | 2020-02-27 18:04:59,092 INFO kazoo.client: Zookeeper session lost, state: EXPIRED_SESSION | 22:45 |
corvus | bingo | 22:45 |
Shrews | so hard to program defensively enough for that | 22:46 |
corvus | ok, i think that explains it, and i think under the circumstances, that's fine, and the only thing we really need to do is plug the hole with https://review.opendev.org/710343 | 22:46 |
clarkb | ok so it was related to the connection loss afterall just in an unexpected manner | 22:47 |
corvus | yep, we didn't notice the launcher also lost its connection at the same time | 22:48 |
Shrews | glad we figured it out | 22:48 |
clarkb | looking at 710343, that causes us to ignore not being able to write the node beacuse it has been deleted? | 22:49 |
clarkb | (seems like it was the request that was deleted but we could still reuse the host?) | 22:49 |
*** jamesmcarthur has quit IRC | 22:49 | |
clarkb | I guess my qusetion comes down to will 710343 cause us to leak some other state (which probably won't acuse jobs to fail but maybe we have to wait 8 hours for cleanup to run) | 22:51 |
corvus | clarkb: 343 covers the case where both the request and the node have disappeared from zk. it lets us clear the request out of the list of active requests the launcher is tracking | 22:53 |
Shrews | what cleanup runs every 8 hours? | 22:53 |
Shrews | iirc, launcher cleanup thread runs 1x per minute | 22:53 |
corvus | i don't think it will cause anything else to leak (if a server leaks during that process, the normal leaked instance thing will catch it) | 22:53 |
Shrews | our cleanup thread is pretty thorough as long as znodes are unlocked or just not present | 22:54 |
Shrews | sort of the housekeeper of nodepool, wiping bugs under the carpet that we just cant program against :) | 22:55 |
clarkb | Shrews: node cleanup is on 8 hour timeout iirc | 22:56 |
clarkb | corvus: right the request and node have disappeared from zk but what about the actual cloud node | 22:57 |
clarkb | if the intent is to reuse that in another request have we leaked it? | 22:57 |
Shrews | clarkb: not in nodepool. hard coded to 60s: https://opendev.org/zuul/nodepool/src/branch/master/nodepool/launcher.py#L874 | 22:58 |
clarkb | Shrews: thats the interval but then we check age after? I guess if the node znode is gone then we short circuit. The 8 hour timeout is only for other nodes? | 22:58 |
clarkb | Shrews: basically cleanup runs that often but we don't take action on objects until they are older | 22:58 |
Shrews | are you refering to max-age? that's different than cleanup | 22:59 |
clarkb | I guess? basically whatever would remove a now orphaned node | 22:59 |
clarkb | in the cloud | 22:59 |
corvus | clarkb: i think in this case, the ship has sailed. we lost our claim on the node which means it and the underlying instance are subject to deletion (and, in fact, in this case both were deleted). it doesn't make sense to put it back because we can't recover the state accurately. | 22:59 |
Shrews | or max-ready-age (whatever we named it). that just removes unused node after they've been alive for some set time | 22:59 |
Shrews | clarkb: orphaned nodes would get cleanup up within the 60s timeframe | 23:00 |
corvus | clarkb: (to answer your question another way, in this case we did not leak the underlying instance, we deleted it, but if we didn't for some reason (launcher crash?) the thing you're talking with Shrews about would catch it) | 23:00 |
clarkb | got it | 23:00 |
corvus | clarkb: also, to clarify an earlier point, we *do* reuse nodes if the request disappears; only if things are so bad that the launcher can't hold on to its own node locks do we get into the case where we might lose both. | 23:01 |
clarkb | basically if the node znode had survived then this would be a non issue (and would allow reuse) but because the node znode does not survive any launcher could then delete the instance so we can't recycle at that point | 23:01 |
Shrews | clarkb: correct | 23:02 |
corvus | (node request disappearing is normal behavior and we optimize for it -- no failures are required to hit that) | 23:02 |
clarkb | and the way to avoid that is to keep network connectivity and memory use happy | 23:02 |
*** mattw4 has quit IRC | 23:02 | |
*** mattw4 has joined #zuul | 23:06 | |
*** jamesmcarthur has joined #zuul | 23:25 | |
openstackgerrit | Merged zuul/nodepool master: Fix for clearing assigned nodes that have vanished https://review.opendev.org/710343 | 23:38 |
Generated by irclog2html.py 2.15.3 by Marius Gedminas - find it at mg.pov.lt!