opendevreview | James E. Blair proposed zuul/zuul master: Use the nodeset build parameter instead of hosts/groups https://review.opendev.org/c/zuul/zuul/+/799127 | 00:26 |
---|---|---|
corvus | clarkb: ^ in response to your comments; tristanC ^ can you look at that and check if that will work with the runner (i suspect it may need a minor update) | 00:28 |
opendevreview | James E. Blair proposed zuul/zuul master: WIP: Remove nodes/groups from build parameters https://review.opendev.org/c/zuul/zuul/+/799128 | 00:28 |
opendevreview | Merged zuul/zuul master: Lock/unlock nodes on executor server https://review.opendev.org/c/zuul/zuul/+/774610 | 00:48 |
opendevreview | Ian Wienand proposed zuul/nodepool master: [wip] make /sys ro in container https://review.opendev.org/c/zuul/nodepool/+/799125 | 00:53 |
opendevreview | Merged zuul/zuul master: Remove ExecutorApi.update() call from tests https://review.opendev.org/c/zuul/zuul/+/798778 | 03:23 |
opendevreview | Merged zuul/zuul master: Add some jitter to apscheduler interval cleanup jobs https://review.opendev.org/c/zuul/zuul/+/798953 | 03:47 |
opendevreview | Merged zuul/zuul master: Add comment for test method https://review.opendev.org/c/zuul/zuul/+/798954 | 03:47 |
*** jpena|off is now known as jpena | 06:52 | |
opendevreview | Simon Westphahl proposed zuul/zuul master: Refactor pipeline processing in run handler https://review.opendev.org/c/zuul/zuul/+/795985 | 09:40 |
opendevreview | Tobias Henkel proposed zuul/nodepool master: Speed up node list https://review.opendev.org/c/zuul/nodepool/+/760542 | 11:18 |
opendevreview | Tobias Henkel proposed zuul/nodepool master: Delete init nodes when resetting lost requests https://review.opendev.org/c/zuul/nodepool/+/744107 | 11:21 |
opendevreview | Tobias Henkel proposed zuul/zuul master: Call manageLoad during pause and unpause https://review.opendev.org/c/zuul/zuul/+/755765 | 11:23 |
*** jpena is now known as jpena|lunch | 11:36 | |
*** bhagyashris_ is now known as bhagyashris|ruck | 12:15 | |
*** jpena|lunch is now known as jpena | 12:36 | |
opendevreview | Tobias Henkel proposed zuul/zuul master: Optimize stats reporting per node request https://review.opendev.org/c/zuul/zuul/+/754967 | 12:53 |
opendevreview | Denis proposed zuul/zuul master: [api][cors] Access-Control-Allow-Origin * for all routes https://review.opendev.org/c/zuul/zuul/+/767691 | 14:09 |
opendevreview | Denis proposed zuul/zuul master: [api][cors] Access-Control-Allow-Origin * for all routes https://review.opendev.org/c/zuul/zuul/+/767691 | 14:09 |
corvus | i've just restarted opendev's zuul on master | 14:31 |
tobiash[m] | yay, fingers crossed | 14:33 |
corvus | some jobs have run to completion, so yay! i'm not quite sure what to make of the changes in zk behavior yet | 14:35 |
opendevreview | James E. Blair proposed zuul/zuul master: Fix zuul.executors.accepting stats bug https://review.opendev.org/c/zuul/zuul/+/799219 | 15:07 |
corvus | tobiash: ^ | 15:07 |
corvus | tobiash: so we still are lacking a method of calculating the executor queue, right? | 15:07 |
corvus | tobiash: i think that was in an old version of the executor api change, but somehow got dropped | 15:11 |
tobiash[m] | I think so | 15:11 |
corvus | i found it in commit 0004c2865c (an old unmerged commit of I5de26afdf6774944b35472e2054b93d12fe21793) | 15:12 |
corvus | i'll see if i can extract it | 15:12 |
corvus | the total is easy; the zone stuff is going to require a little extra work | 15:19 |
corvus | so i think this will be original code instead of just extracting the old stuff | 15:20 |
opendevreview | James E. Blair proposed zuul/zuul master: Correct executor queue stats https://review.opendev.org/c/zuul/zuul/+/799223 | 15:38 |
corvus | tobiash, fungi: ^ i think that should do it | 15:38 |
corvus | we may have a steadily increasing number of zk watches -- we should keep an eye on that and see if it levels out or keeps increasing | 15:40 |
clarkb | those should grow with the number of nodes right? since we add watches under each znode with contents worth watching? If we see znode count fall dramatically but watches continue to climb that could indicate a leak | 15:41 |
corvus | yeah, nodes are holding steady at 40k while watches are increasing linearly | 15:42 |
corvus | it's possible we're not clearing them from the executor api | 15:43 |
corvus | but i want to give it a bit more time before i call it a pattern | 15:43 |
clarkb | corvus: super small thing on the stats fix | 15:47 |
*** jpena is now known as jpena|off | 15:53 | |
opendevreview | James E. Blair proposed zuul/zuul master: Correct executor queue stats https://review.opendev.org/c/zuul/zuul/+/799223 | 16:09 |
corvus | clarkb: ++ | 16:09 |
clarkb | corvus: are we able to query zk for a list of watches via the admin side? | 16:31 |
clarkb | corvus: if so maybe we take a listing now then wait $timeperiod and take another listing and check the delta? | 16:31 |
clarkb | if we see watches going away we know we aren't leaking those watches but if some category does not go away then there is a good chance those do leak? | 16:32 |
opendevreview | Merged zuul/zuul master: Fix zuul.executors.accepting stats bug https://review.opendev.org/c/zuul/zuul/+/799219 | 16:34 |
clarkb | wchs, wchc, and wchp are commands that will do that (I don't think we have any enabled via the socket but we can hit them via the rest api?). Note that wchc and wchp come with warnings that those operations can be expensive and impact the server | 16:49 |
clarkb | corvus: I think wchp is what we want: "Lists detailed information on watches for the server, by path. This outputs a list of paths (znodes) with associated sessions. Note, depending on the number of watches this operation may be expensive (ie impact server performance), use it carefully." | 16:59 |
clarkb | of course now we have to decide if we are willing to try it (maybe run it on zk04 which has the least number of watches?) | 16:59 |
clarkb | corvus: running wchs (the one not annotated with be careful messages) shows the number of paths and the number of watches. The number of paths seems to be growing far less than the number of watches | 17:11 |
clarkb | but both do seem to grow | 17:12 |
corvus | clarkb: sorry was away; good ideas | 17:24 |
clarkb | thats ok I've just been monitoring wchs informally and both num paths and watches go up | 17:25 |
corvus | clarkb: i'd like to do either a wchc or wchp; which do you think? i'm thinking wchp | 17:25 |
* clarkb rereads the docs | 17:25 | |
clarkb | corvus: ya wchp seems better since it is waches by path. I suspect that mapping will make it easier to find any potential leaks | 17:26 |
clarkb | corvus: do you want me to go ahead and run that on zk04 or do you want to do it? It appears that it pretty prints json too though I should test that redirected | 17:27 |
clarkb | interestingly zk05 had a large drop in watches but zk04 and zk06 have not | 17:29 |
corvus | clarkb: why don't you do it and save the output since you have a session going | 17:32 |
clarkb | ok | 17:33 |
clarkb | corvus: zk04:~clarkb/wchp.20210702-first | 17:34 |
corvus | i'm logged into zk04 (also, this is borderline opendev ops related, but i figure since we're suspecting a zuul issue, we can keep going in this channel) | 17:34 |
clarkb | ya I figured it would be of interest to this group | 17:34 |
corvus | as expected, many build request watches | 17:34 |
clarkb | I can rerun it in say 5 minutes then we can look at diffs I guess | 17:34 |
corvus | ++ | 17:34 |
corvus | i'm going to look up some uuids while we wait | 17:34 |
corvus | clarkb: oh, actually, i'd like to know who's on the other end of sessions 288312462838726666, 288312462838726667, 288312462838726659 | 17:35 |
clarkb | also it ran pretty quickly implying that this wasn't too expensive for us | 17:35 |
corvus | do we have a way to figure that out? | 17:36 |
clarkb | I'm sure there is a way but I don't know off the top of my head | 17:36 |
clarkb | https://zookeeper.apache.org/doc/r3.5.9/zookeeperAdmin.html may tell us | 17:36 |
clarkb | cons maybe | 17:36 |
corvus | clarkb: are you just doing "nc localhost 2181"? | 17:36 |
corvus | oh you're using the rest api? | 17:37 |
clarkb | corvus: no I'm using the rest api because the wch* commands are not whitelisted | 17:37 |
corvus | clarkb: what's that look like? i've never used it | 17:37 |
corvus | thx for the pm :) | 17:38 |
clarkb | 6659 is zuul02, 6667 is ze12, and 6666 is ze05 | 17:40 |
corvus | it is interesting that all of the requests have exactly those hosts... | 17:42 |
clarkb | there are only 6 total sessions from when I ran cons, this is I think expected because other clients will be talking to toher servers in the cluster | 17:42 |
corvus | ah of course | 17:42 |
clarkb | let me see what the other sessions are, they could all be mergeers maybe? | 17:42 |
clarkb | 288312462838726660 is also zuul02. Everything else is zuul mergers or nodepool builders | 17:44 |
corvus | clarkb: adding up all the watchers for /zuul/build-requests/unzoned/ca6cd1ae56cd406eb330ce0d5b6abbc4 gives us 13 sessions | 17:44 |
corvus | so we know every build request is getting a watch from every executor plus the scheduler | 17:44 |
clarkb | I'm running my followup wchp now | 17:45 |
clarkb | diffing the two shows that no build request watches went away, we only added new ones (which could be coincidence given the time some of those take to run?) | 17:47 |
corvus | i checked the first item in the first list: ff56c3afd7cc4ac0b143c54cfdc53a72 and zuul says it finished at 15:33:00,761 which is almost exactly the time of your first query. but it's still there in the second query. | 17:47 |
corvus | i think that's a leak | 17:47 |
clarkb | ya seems like it | 17:47 |
corvus | okay, i'm going to go off and stare at some code | 17:47 |
clarkb | corvus: maybe double check that we aren't returning an exception in the logs in the watch methods which prevents the watch from being cleaned up? | 17:48 |
corvus | (and maybe think about calling wchp from unit tests...) | 17:48 |
clarkb | since we need to return False alone iirc | 17:48 |
clarkb | corvus: does the request still exist too? | 17:48 |
corvus | good q | 17:49 |
clarkb | that might be another thing to check is if we are leaking the znode entirely? (Not sure how znode deletions interact with watches) | 17:49 |
corvus | we do have some exceptions with the executor client trying to return nodesets, we should look at that but it's likely a separate problem | 17:49 |
corvus | clarkb: Path /zuul/build-requests/unzoned/ff56c3afd7cc4ac0b143c54cfdc53a72 doesn't exist | 17:51 |
corvus | so just a watch leak | 17:51 |
corvus | (in general, you can watch paths that don't exist in zk) | 17:52 |
clarkb | got it | 17:53 |
clarkb | thinking out loud here if this was simply a matter of watch callbacks not being called quickly enough due to contention we would expect some of the watches to go away but more slowly than they are created. The diffs indicate this is unlikely the case as we only see new watches and none removed | 17:54 |
corvus | ++ | 17:55 |
corvus | clarkb: if there were an exception in our callback, it would prevent the watch from being removed. but i'm pretty sure it would be logged under kazoo.recipe.watchers | 17:59 |
corvus | i believe https://github.com/python-zk/kazoo/blob/master/kazoo/recipe/watchers.py#L155 is the relevant kazoo code | 17:59 |
corvus | we probably should make our callback more robust, but i don't immediately see the error since i think that isn't happening. | 18:00 |
clarkb | corvus: could it be the if not content: return code at the top of the callback? | 18:02 |
clarkb | that would only trigger on the changed case but if CHANGED and DELETED are both triggered by deletion maybe that is the problem? | 18:03 |
clarkb | hrm except that we should get it fired for both events in that case | 18:03 |
clarkb | unless event.type is a mask | 18:03 |
clarkb | https://kazoo.readthedocs.io/en/latest/api/protocol/states.html#kazoo.protocol.states.EventType doesnt say anything about masks | 18:06 |
corvus | testing says we just get deleted on deleted | 18:08 |
clarkb | perhaps something to do with the closure | 18:13 |
clarkb | like maybe that has messed up kazoo's ability to count arguments and pass in event properly? (It shouldn't since we hide that from kazoo with the watch() signature | 18:13 |
corvus | i can observe the behavior in tests | 18:16 |
corvus | by adding a whcp call to the end of a test | 18:16 |
clarkb | side note: a third wchp continues to show the same behavior, no build request watches are being cleaned up | 18:16 |
corvus | i think this may be a characteristic of the datawatch kazoo recipe | 18:20 |
corvus | i think there would have to be one more update after the delete in order for the watch to be triggered and not re-set | 18:22 |
corvus | i think https://github.com/python-zk/kazoo/blob/master/kazoo/recipe/watchers.py#L191-L192 causes the watcher to be re-set after the delete happens | 18:23 |
clarkb | corvus: isn't that running before the callback on line 206? | 18:26 |
corvus | the java api has a removeWatches call, but i don't see that in kazoo | 18:26 |
clarkb | distinct to https://github.com/python-zk/kazoo/blob/master/kazoo/recipe/watchers.py#L168 ? | 18:27 |
corvus | clarkb: exactly: delete in zk happens; zk sends client a "delete" event for the watcher; Datawatch calls getdata(); getdata() calls 'get' which fails, then calls 'exists' which succeeds and sets a watch. then our callback happens, and we tell it to stop. | 18:28 |
corvus | the system is left with a watch in place because of the exists call. | 18:28 |
corvus | i think the only way to get a datawatch to stop watching a path is to return False before the node is deleted, and then have one more event (which could be deletion) happen. i think the 'exists' handling was added so you could watch a node that doesn't exist yet; i don't think they thought through watching a node that gets deleted. | 18:30 |
corvus | basically, we want a datawatch which doesn't call exists | 18:31 |
clarkb | and the exists call sets a watch that is _get_data so it is recursive | 18:33 |
clarkb | and _get_data never seems to return False? | 18:34 |
clarkb | corvus: can we fix this with an event type check in _get_data? basically don't call the get or exists if event is deleted? we already know it isn't existing | 18:35 |
corvus | clarkb: yes -- but it may be an uphill battle to get that accepted; i could see the argument that they may have users that want watches to persist after deletion. so at least i think it would need to be a new optional argument to the constructor or something | 18:38 |
opendevreview | James E. Blair proposed zuul/zuul master: WIP datawatch https://review.opendev.org/c/zuul/zuul/+/799317 | 18:39 |
clarkb | corvus: ya that seems reasonable, or even a sublcass: DeletableDataWatch | 18:39 |
corvus | clarkb: ^ that has the behavior we want (i removed the exists check, since we're not calling it until after we know it exists) | 18:39 |
clarkb | right we only add those watches when we have a child watch that indicates the entity exists. I suppose there is a race there if the buildrequest is immediately deleted for some reason? | 18:40 |
clarkb | but in that case we still don't want to watch it | 18:41 |
clarkb | we just want to move on. I think that means your fix is appropriate if you ignore private api overrides | 18:41 |
corvus | clarkb: yeah, either approach would work for us: option (clarkb): datawatch can be created regardless of whether the znode exists, but once it's deleted, watch is removed. option (corvus): watch can only be created if node exists, watch is removed if node deleted. | 18:41 |
corvus | yeah, maybe mine is slightly preferable in that it helps keep things tidy in the case of the create-then-delete race? | 18:42 |
clarkb | ya. Also I think you need to set data and stat since they may not be set if an exception is raised? | 18:42 |
clarkb | they don't get None values they will be undefined iirc | 18:42 |
clarkb | are watches an implied ephemeral type? if the connection goes away we'll auto purge the leaked watches? (thinking about what cleanup might entail) | 18:46 |
clarkb | corvus: we also use DataWatch() elsewhere in existing code. Any idea why this doesnt' seem to be a problem for them? | 18:49 |
corvus | yeah, when clients disconnect, watches disappear | 18:49 |
clarkb | NodeRequests in particualr seem to be usign DataWatches | 18:49 |
corvus | clarkb: we stop watching node requests once they are fulfilled or canceled | 18:52 |
corvus | but deletion is a separate step | 18:52 |
clarkb | got it that stops the watching entirely so when the deletion event happens there is no watch to recursively keep watching with at that point | 18:52 |
corvus | yep | 18:53 |
corvus | there may be some edge cases there, but that's the typcial workflow. i expect it would be okay to use whatever we come up with here for that as well | 18:53 |
clarkb | though that could potentailly race, it is unlikely to happen in bulk like this (and possibly never at all depending on timing) | 18:53 |
clarkb | ++ | 18:53 |
corvus | another approach we could do is to avoid using DataWatch and set the watches ourselves... but honestly, datawatch is a relatively simple recipe, and we'd probably just end up reimplementing most of it. | 18:54 |
clarkb | This seems like a reasonable enough use case that we should maybe try upstreaming your child class? | 18:56 |
clarkb | Basically a case where you only awnt to watch something as long as it exists | 18:56 |
corvus | clarkb: re data+stat -- i don't think we need to set those? because of the return? | 18:56 |
clarkb | oh I missed there is a return in the excpetion handler. I thought it was falling through. Don't we need to fall through and call our watcher one last time for the DELETED event? | 18:57 |
corvus | yes | 18:57 |
clarkb | the deletion will happen and the watcher will fire for the DELETED event. The get will fail on NoNodeError and we then need to set None types for those values and fall through and let _log_func_exception run one more time I think | 18:58 |
clarkb | and then maybe try upstreaming that as ExistingDataWatch with some updates to the code to check self._ever_called and short circuit if that is false | 18:59 |
clarkb | ya I think in the exception handler block you want to do what you currently have if not self._ever_called else fallthrough and let the handler run one last time | 19:01 |
clarkb | THen you should haev a watcher that does nothing if the data does not exist when created and will stop doing things when the data is eventually deleted | 19:03 |
clarkb | (if the watcher chooses to handle the deleted event I guess) | 19:03 |
corvus | or we could still allow it to be called the first time, just with null data/stat/event; then the callback would know that it was deleted | 19:04 |
clarkb | ya that would work too | 19:04 |
clarkb | if not data and not event then you don't exist at initialization. | 19:05 |
clarkb | and i guess letting the callback decide what to do in that case is more flexible | 19:05 |
opendevreview | James E. Blair proposed zuul/zuul master: WIP datawatch https://review.opendev.org/c/zuul/zuul/+/799317 | 19:05 |
corvus | something like that ^ | 19:05 |
corvus | i'm going to get lunch, then i'll polish it up | 19:05 |
clarkb | left a comment. I think it is really close | 19:11 |
clarkb | And thinking out loud more, maybe it is better to vendor the parent class in the child (and separate them) so that we don't run into private api change problems? at least until we can get something upstreamed? | 19:14 |
clarkb | kazoo is apache2 licensed so that shouldn't pose a problem. We jsut do some attribution and should be good? | 19:15 |
corvus | clarkb: replied... can you take a look? | 20:47 |
clarkb | corvus: yup lookiong now | 20:54 |
clarkb | corvus: what about the case where you might want to keep listening after handling the DELETE event. I guess we're basically saying that isn't a thing here | 20:56 |
clarkb | theis watcher is only good on valid nodes until they are deleted? | 20:57 |
clarkb | corvus: your comment makes sense though. And I agree with it as long as we don't intend on letting the DELETE event handler decide if the watch should stop or not | 20:59 |
corvus | clarkb: yes that's the intent; structurally, it's very difficult to let the callback decide that, because the datawatcher calls the zk method that would set the watch before calling the callback | 21:04 |
corvus | by very difficult, i think i mean 'impossible within the remaining design constraints of the DataWatch recipe' | 21:05 |
clarkb | I think you would have to use the self._ever_called and event.type attributes to decide that | 21:05 |
clarkb | if self._ever_called and event.type == DELETED and not data then you know it has gone away | 21:06 |
corvus | clarkb: no it's too late | 21:06 |
corvus | maybe i'm misunderstanding the suggestion | 21:06 |
clarkb | keeping the behavior simple like you've done makes sense to me regardless of trying to figure out the state machine | 21:06 |
corvus | i just want to check something | 21:07 |
corvus | you know the zookeeper watch is actually set by the "client.get" call, right? | 21:07 |
clarkb | corvus: I think you can call retry(exists, watcher) if the callback didn't return false and self._ever_called is true and data is None | 21:08 |
clarkb | corvus: yes both the get and the exists set it | 21:08 |
corvus | ok, then it's me that's missing something... :) | 21:08 |
corvus | but the callback is called after the exists call | 21:08 |
corvus | so how do we let the callback decide whether or not to set the watch? | 21:08 |
josefwells | Hey zuul-friends, getting close to my zuul on k8s. My executor logs show: 2021-07-02 18:54:47,152 DEBUG zuul.AnsibleJob.output: [e: e16b8800-db66-11eb-8e5e-3951beb15980] [build: c7f4c441db6b4223b861f95dd37b2deb] Ansible output: b'bwrap: Failed to make / slave: Permission denied' | 21:09 |
clarkb | corvus: you can move that call after I think | 21:09 |
clarkb | corvus: and then wrap it in the conditional to check if the callback wants you to put it back again | 21:09 |
clarkb | corvus: its more trouble than it is worth | 21:09 |
corvus | clarkb: it's before so that we pass in the stat | 21:09 |
corvus | clarkb: yeah that would mostly work, but i think it's potentially racy | 21:10 |
corvus | the "exists+watch" combo provides an important assurance that we don't miss any changes | 21:10 |
josefwells | but I'm not sure where to start looking for the issue.. any debug pointers? Everything is suspect to me :) | 21:10 |
corvus | (or get+watch) | 21:10 |
corvus | josefwells: make sure the executor container is running privileged | 21:11 |
josefwells | corvus: ok, thanks | 21:12 |
opendevreview | James E. Blair proposed zuul/zuul-jobs master: Enable ZooKeeper 4 letter words https://review.opendev.org/c/zuul/zuul-jobs/+/799334 | 21:24 |
opendevreview | James E. Blair proposed zuul/zuul master: Add ExistingDataWatch class https://review.opendev.org/c/zuul/zuul/+/799317 | 21:26 |
corvus | clarkb: ^ final version | 21:26 |
clarkb | corvus: cool looking | 21:26 |
corvus | well, i mean, finally ready for real review :) | 21:26 |
corvus | i'll work on a kazoo pr meanwhile | 21:27 |
josefwells | corvus: ok, lets just say for the sake of hypotheticals that I can't run a privileged container in my cluster.. is there some other option? | 21:42 |
clarkb | josefwells: I think it is possible you could enable the specific items that bwrap needs to create its containers | 21:43 |
clarkb | a privileged container is a shortcut for that. That said there is a good chance that won't work because those may be considered to be effectively the same as privileged | 21:43 |
josefwells | bummer town, guess I just need to bite the bullet and own another server | 21:44 |
josefwells | trying to get out of that business | 21:44 |
josefwells | clarkb: where can I find a list of privileges I would need? | 21:45 |
opendevreview | Merged zuul/zuul-jobs master: Enable ZooKeeper 4 letter words https://review.opendev.org/c/zuul/zuul-jobs/+/799334 | 21:45 |
josefwells | clarkb: Oh, if it is ability to create containers, yeah, seems like that would be.. | 21:46 |
clarkb | josefwells: https://github.com/containers/bubblewrap#sandboxing I think that covers the possible options. I know the the user namespaces are required for zuul's use case. Not sure about the others | 21:46 |
clarkb | josefwells: I know that tobiash[m] has/had to run zuul on a separate openshift from their normal openshift in order to allow for privileged containers. corvus also helps manage the gerrit zuul which runs on gks. I think on gks it is a non issue though | 21:52 |
clarkb | it is possible tobiash[m] or tristanC may have input on how to make it run more happily | 21:52 |
opendevreview | Merged zuul/zuul master: Correct executor queue stats https://review.opendev.org/c/zuul/zuul/+/799223 | 21:53 |
josefwells | clarkb: thanks.. I've checked with my people.. I'll figure something out | 21:55 |
clarkb | corvus: I did leave a comment on the latest patchset to make sure I understand the test watch thing properly. Otherwise I think we can go with that | 22:11 |
corvus | clarkb: cool; want to take a quick look at https://github.com/python-zk/kazoo/pull/648 ? | 22:18 |
corvus | (while i review your comment) | 22:18 |
opendevreview | James E. Blair proposed zuul/zuul master: Add ExistingDataWatch class https://review.opendev.org/c/zuul/zuul/+/799317 | 22:22 |
corvus | clarkb: i did change something in that test method ^ | 22:22 |
clarkb | corvus: For the PR: small typo in the docstring for the new class: "functiol" instead of "function". The tests you added look good. The only thing I would caution with them is the event.wait(0.2) lines may be too short depending on how long the watchers take to process. But waiting for longer will make teh tests slower | 22:25 |
corvus | oof, thx ill fix the typo | 22:25 |
corvus | 0.2 was used in other tests with similar 'expected failure' so i went with that; i agree it's concerning tho | 22:26 |
corvus | zuul-maint: if anyone else is around to review https://review.opendev.org/799317 that would be great; otherwise, i'll probably merge it in a little while just with +2 from clarkb and me, so we can restart opendev with that before the watches get too out of hand (i'm not too worried about them, but would like to nip it in the bud before it becomes a problem :) | 22:30 |
clarkb | I need to take a short break but I can be around for that later today | 22:36 |
clarkb | the merging and restart and all that I mean | 22:36 |
corvus | cool, i went aheand and +w'd it; if someone sees an issue, feel free to block it; otherwise, hopefully it'll merge within an hour or so and we can restart; i'm going to afk and get out of this chair for a bit :) | 22:39 |
fungi | heh, was about to say i'll take a look | 22:46 |
fungi | but there's still time if i catch a problem with it | 22:46 |
fungi | not that i expect to | 22:46 |
fungi | ahh, yeah this was tested on the z-j change i reviewed | 22:47 |
clarkb | The upstream PR is maybe a bit easier to read | 23:00 |
clarkb | you can pick out the new stuff a bit easier in the diff there, though corvus annotated the vendored code with comments | 23:00 |
fungi | yep | 23:01 |
clarkb | less than 11 minutes zuul says | 23:32 |
opendevreview | Merged zuul/zuul master: Add ExistingDataWatch class https://review.opendev.org/c/zuul/zuul/+/799317 | 23:38 |
corvus | yay! | 23:38 |
corvus | hrm, promote failed; i'll re-enqueue that | 23:41 |
corvus | oh actually, it failed in the cleanup part; i tihnk it's okay, i'll double check dockerhub | 23:41 |
clarkb | ok | 23:42 |
corvus | no, i think it needs to re-run; i'll re-enqueue | 23:43 |
clarkb | corvus: Ithink it may be old still | 23:43 |
clarkb | ya | 23:44 |
clarkb | I'll notify the openstack release team though it should be a noop since it is the weekend for them | 23:44 |
corvus | something's fishy; it failed again, and there's no change_799317_latest tag | 23:46 |
clarkb | huh did the cleanup possibly fail because the image wasn't uplaoded it he first place? | 23:48 |
clarkb | no it seems we uploaded to docker hub | 23:51 |
clarkb | https://zuul.opendev.org/t/zuul/build/08da96babad640a29136acdbb2244b49/log/job-output.txt#11370 is where that should have happened | 23:51 |
corvus | it's almost like dockerhub isn't being consistent | 23:51 |
corvus | but the "Get manifest" task shouldn't have worked without that tag | 23:53 |
corvus | docker pull zuul/zuul:change_799317_latest works for me | 23:53 |
corvus | so basically, the tag actually exists in the registry, but it doesn't show up in the web ui and the api can't delete it | 23:54 |
clarkb | weird | 23:55 |
clarkb | that would explain the failure then I guess | 23:55 |
corvus | yeah; we could manually pull that tag on all the hosts and tag it as latest; that's not great though since it could restart into something else | 23:56 |
clarkb | https://status.docker.com/pages/533c6539221ae15e3f000031 indicates an active issue with email | 23:56 |
clarkb | I doubt that is related to the issue we are seeing | 23:56 |
clarkb | corvus: any idea if we can manually update latest on dockerhub? | 23:56 |
corvus | maybe smtp is their distributed consensus algorithm | 23:56 |
clarkb | or maybe we already tried that | 23:56 |
corvus | clarkb: yeah i think it should work; basically do what the job does without the 'delete' call | 23:57 |
clarkb | ya that | 23:57 |
corvus | we could also add a 'failed_when: false' to that delete method | 23:57 |
Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!