opendevreview | James E. Blair proposed zuul/zuul-registry master: Build images on bionic https://review.opendev.org/c/zuul/zuul-registry/+/799728 | 00:28 |
---|---|---|
opendevreview | James E. Blair proposed zuul/zuul-registry master: Add a restricted mode (read authentication required) https://review.opendev.org/c/zuul/zuul-registry/+/788389 | 00:46 |
opendevreview | James E. Blair proposed zuul/zuul-registry master: Make TLS optional https://review.opendev.org/c/zuul/zuul-registry/+/788390 | 00:46 |
opendevreview | James E. Blair proposed zuul/zuul-registry master: Add content-length headers and debug messages https://review.opendev.org/c/zuul/zuul-registry/+/791068 | 00:46 |
opendevreview | James E. Blair proposed zuul/zuul-registry master: Add missing size to image configs https://review.opendev.org/c/zuul/zuul-registry/+/799726 | 00:46 |
opendevreview | James E. Blair proposed zuul/zuul-registry master: Also include missing size attributes in layers https://review.opendev.org/c/zuul/zuul-registry/+/799729 | 00:53 |
*** iury|holiday is now known as iurygregory | 05:53 | |
opendevreview | Simon Westphahl proposed zuul/zuul master: Store tenant layout state in Zookeeper https://review.opendev.org/c/zuul/zuul/+/771461 | 06:52 |
boha[m] | <avass[m] "boha: apparently we are using th"> Thanks, that did the trick ;-). Is the task "debug" with hello message from the quick start guide supposed to work on windows? I might have messed something else up in the process, but I only see pre and post jobs running. | 07:27 |
*** jpena|off is now known as jpena | 07:46 | |
*** bhagyashris_ is now known as bhagyashris|ruck | 08:37 | |
*** rpittau|afk is now known as rpittau | 09:42 | |
avass[m] | boha: yes that should work, usually when tasks don't show up it's because they target a node that does not exist in inventory and the task is skipped | 11:27 |
*** jpena is now known as jpena|lunch | 11:39 | |
opendevreview | Simon Westphahl proposed zuul/zuul master: Fix race waiting for watches to be registered https://review.opendev.org/c/zuul/zuul/+/799797 | 12:09 |
opendevreview | Simon Westphahl proposed zuul/zuul master: dnm: try to reproduce race condition https://review.opendev.org/c/zuul/zuul/+/799798 | 12:21 |
*** jpena|lunch is now known as jpena | 12:43 | |
*** eliadcohen__ is now known as eliadcohen | 13:47 | |
opendevreview | Felix Edel proposed zuul/zuul master: Switch to ZooKeeper backed NodesProvisionedEvents https://review.opendev.org/c/zuul/zuul/+/799833 | 13:50 |
*** bhagyashris is now known as bhagyashris|ruck | 13:59 | |
*** bhagyashris_ is now known as bhagyashris|ruck | 14:24 | |
clarkb | corvus: Is the next set of v5 work finishing up the result event migration to zk and the storage of tenant configs in zk? | 15:29 |
corvus | clarkb: yes, there are some changes ready for review under "hashtag:sos" -- i plan on reviewing them today | 15:30 |
clarkb | why sos? | 15:30 |
corvus | scale out scheduler | 15:31 |
clarkb | that is a fun collision :) | 15:31 |
corvus | ;) | 15:31 |
corvus | there's one nodepool change in there too -- we're going to need a little more information in the node requests to account for the fact that the schedulers won't be tracking node requests in memory any more (and a scheduler may have to process a node request submitted by another scheduler) | 15:33 |
clarkb | https://review.opendev.org/q/hashtag:%2522sos%2522+(status:open) for anyone else wanting to see the changes. | 15:33 |
clarkb | I'll try to dig into those today as well | 15:34 |
corvus | so there's a change to basically add some extra data to the noderequest that is opaque to nodepool, but zuul can see it and know what node requests belong to what builds | 15:34 |
corvus | that's https://review.opendev.org/798746 -- it might be good to go ahead and get that merged and in a nodepool release so than the next zuul release can depend on it | 15:35 |
clarkb | I'll start there then | 15:35 |
clarkb | corvus: in that nodepool change we need nodepool to be aware of the data (but then ignore it) so that when it serializes it back into zk after updates we don't lose that information? | 15:42 |
corvus | clarkb: exactly | 15:43 |
clarkb | https://review.opendev.org/c/zuul/nodepool/+/798746 it is actually a very straightforward change if anyone else has time to review it | 15:44 |
clarkb | tobiash[m]: tristanC ^ maybe? | 15:44 |
tobiash[m] | I already have it open ;) | 15:44 |
*** marios is now known as marios|out | 16:06 | |
*** rpittau is now known as rpittau|afk | 16:09 | |
opendevreview | Merged zuul/nodepool master: Add requestor_data field to node request https://review.opendev.org/c/zuul/nodepool/+/798746 | 16:45 |
*** jpena is now known as jpena|off | 16:52 | |
*** mgoddard- is now known as mgoddard | 16:54 | |
opendevreview | James E. Blair proposed zuul/zuul-operator master: Add instructions and tools for running tests with kind https://review.opendev.org/c/zuul/zuul-operator/+/785763 | 17:31 |
opendevreview | James E. Blair proposed zuul/zuul-operator master: Fix error with multiple nodepool providers https://review.opendev.org/c/zuul/zuul-operator/+/799873 | 17:31 |
opendevreview | James E. Blair proposed zuul/zuul-operator master: WIP: add static node to functional test https://review.opendev.org/c/zuul/zuul-operator/+/799874 | 17:31 |
clarkb | https://review.opendev.org/c/zuul/zuul/+/799797 failed both unittest runs on test_cancel_starting_build that change doesn't affect that particular test so I will reapprove it when done with my review, but that may indicate a race or bug in that particular test | 17:55 |
opendevreview | James E. Blair proposed zuul/zuul-operator master: WIP: add static node to functional test https://review.opendev.org/c/zuul/zuul-operator/+/799874 | 18:04 |
opendevreview | James E. Blair proposed zuul/zuul-operator master: Add instructions and tools for running tests with kind https://review.opendev.org/c/zuul/zuul-operator/+/785763 | 18:04 |
clarkb | I've got `tox -e py38 -- --until-failure test_cancel_starting_build` running locally now | 18:05 |
corvus | clarkb: yeah swest is trying to get more info on that in https://review.opendev.org/799798 | 18:05 |
clarkb | we'll see if I can trip over it locally | 18:05 |
clarkb | oh cool | 18:05 |
opendevreview | James E. Blair proposed zuul/zuul master: dnm: try to reproduce race condition https://review.opendev.org/c/zuul/zuul/+/799798 | 18:07 |
corvus | i'll set up a local loop too | 18:07 |
clarkb | corvus: ok I caught one but I was running against master. Let me rerun on top of swest's change | 18:12 |
clarkb | corvus: do we maybe need to do similar to https://review.opendev.org/c/zuul/zuul/+/799797/1/tests/unit/test_zk.py and wait for our watch to be in place before cancelling the build? | 18:15 |
clarkb | we have the ordering correct right now (watch before cancel) but can the watch creation happen on the backend after creating the object? | 18:16 |
clarkb | corvus: I got one with the extra logging. Let me push it up somewhere | 18:18 |
clarkb | I need to take a break though, but can read through it after | 18:18 |
clarkb | corvus: test_cancel_starting_build.log in my homedir on the zuul scheduler | 18:21 |
clarkb | ok back in a bit | 18:21 |
clarkb | looking at that log the watcher seems to be in place first and even hits the znode doesn't exist yet path | 18:49 |
clarkb | I think I see it. The watch event fires, then we delete the znode, then we get the data via the watch event | 18:50 |
clarkb | I think I see the fix too. | 18:51 |
clarkb | it almost seems like if the delete races the watcher that the data watch function isn't called | 19:24 |
clarkb | which may be related to the issue we had to fix last week? | 19:24 |
corvus | clarkb: i'm back; i'll take a look at your log | 19:27 |
corvus | oh also my local run caught something | 19:28 |
clarkb | corvus: if you look at the log you can see the watch start, then later the cancel znode gets created, but verysoon after it also gets deleted. By the time the watch anlder tries to read the data from zk it gets a NoNodeError | 19:29 |
clarkb | corvus: I've got a change that rewrites the data watch function to try and handle that better but it is still failing occasionally so I'm adding more debugging in my log | 19:29 |
clarkb | and it seems like maybe we get called with a None event more often than I expected | 19:30 |
clarkb | oh wait | 19:31 |
clarkb | I need send_event? | 19:31 |
clarkb | reading the docs it seems send_event is only necessary on a childwatch | 19:32 |
corvus | clarkb: if we set two threading.Events, we can confirm that we have performed the initial get which sets the watch before we cancel the build, then the second call should be for the created event (which we should get even if the zoned is deleted by the time we call 'get') | 19:33 |
clarkb | corvus: yup that is basically what my current fix does, but it tries to check the event type is CREATED too on the second event | 19:33 |
clarkb | and I think I had at least one case where that wasn't the case | 19:33 |
clarkb | I'll push up what I have now anyway I guess and we can improve it if necessary | 19:33 |
clarkb | I'm up to 81 successful runs after adding more debug logging to try and determine why it wasn't a CREATED event | 19:34 |
clarkb | also I suppose it is possible I derped when running tests and I didn't :w properly | 19:36 |
corvus | heh :) | 19:39 |
opendevreview | Clark Boylan proposed zuul/zuul master: Fix data watch race in test_cancel_starting_build https://review.opendev.org/c/zuul/zuul/+/799798 | 19:40 |
clarkb | I reused the same change id as the extra logging change but cleaned it up a bit to make it mroe mergable. I also removed the parent change dependencies as those are unrelated | 19:41 |
clarkb | feel free to move it back or restore the logging in that change etc | 19:41 |
clarkb | I'm running that test locally in a loop again to build confidence in what I pushed. It failed on me but I also did rebase and stuff and it is possible that tox caught something while the git repo was changing states | 19:42 |
clarkb | I'll leave git alone now | 19:42 |
corvus | clarkb: oh, that's not quite what i was imagining; let me flesh it out and i'll push up another | 19:43 |
clarkb | corvus: yours should be about the same, its just using the watcher state to track the first event in your example | 19:44 |
clarkb | *mine is just using the watcher state to track the first event in your example | 19:44 |
clarkb | ok mine just failed locally without any changes to disk going on in the repo at the time | 19:48 |
clarkb | I'm rerunning again with the logging put back in | 19:49 |
corvus | just working on the commit message now | 19:50 |
corvus | clarkb: i'm doubting my fix right now... | 19:55 |
corvus | clarkb: looking at your log again... i don't see a response to this transaction: | 19:56 |
corvus | 2021-07-07 11:13:40,043 kazoo.client DEBUG Sending request(xid=22): Exists(path='/test/uTMrboPF19071tests.unit.testscheduler.TestScheduler.testcancelstartingbuild/zuul/build-requests/unzoned/e4268fef7b0541ca814ba540803f8d18/cancel', watcher=<bound method DataWatch._watcher of <kazoo.recipe.watchers.DataWatch object at 0x7f754c148fa0>>) | 19:56 |
corvus | then again, i also don't see that in a normal test run | 19:57 |
clarkb | corvus: ya so that seems to be in the first pass of the watch path | 19:58 |
clarkb | it first calls the GetData and that returns NoNodeError so it tries Exists next | 19:58 |
corvus | yep; and it's exists that actually sets the zk watch | 19:58 |
corvus | so seeing a response to that would give me a warm fuzzy | 19:59 |
clarkb | ya and if that never gets a response maybe we aren't hitting the first pass | 19:59 |
corvus | but i think it's a red herring -- i don't think kazoo logs reponses to exists | 19:59 |
corvus | i think for the moment we must assume exists is called and correctly sets the watch | 20:00 |
clarkb | I'm going to try and add some richer logging ot hte data watch function | 20:03 |
clarkb | and see what it shows rather than just the kazoo logging | 20:03 |
clarkb | corvus: with the code I have pushed and only adding a self.log.debug(event) to the top of data_watch() it seems that data_watch() is only called once | 20:05 |
corvus | that should not be the case | 20:06 |
clarkb | ya it is definitely unexpected | 20:07 |
clarkb | I'm putting some unique strings in that logger so that I don't just get None back out (this way I'm confident that is the log line I see) | 20:08 |
corvus | i see it called 2x | 20:08 |
corvus | but i'm not running your code | 20:08 |
clarkb | I think the old code could fail in the 2x case too | 20:08 |
clarkb | my change should only fail if it is called 1x? but I'm trying to confirm that now | 20:08 |
clarkb | yup confirmed now | 20:09 |
clarkb | with my chagne when it fails it seems to fail when data_watch is only called once | 20:09 |
corvus | clarkb: which call do you get? | 20:10 |
clarkb | corvus: event is None so I think it is the initial setup call | 20:10 |
corvus | that makes sense. that one should be more-or-less synchronous, in that we shouldn't return from creating the DataWatch until that has happened. | 20:11 |
corvus | but that does sort of suggest something along the lines of "exist() did not set the watch" | 20:12 |
clarkb | Received EVENT: Watch(type=1, state=3, path='/test/kfSdnawG_25854_tests.unit.test_scheduler.TestScheduler.test_cancel_starting_build/zuul/build-requests/unzoned/b3160cca502446c291b4f6e8cd51456b/cancel') it does seem toget the created event notification at least | 20:14 |
clarkb | would it get that if it wasn't watching the node? | 20:14 |
corvus | nope -- so that makes me think more along the lines of "kazoo didn't map the event to the watcher" | 20:14 |
clarkb | Then the Delete for that path happens then the GetData from the watch handler then the exists | 20:14 |
clarkb | but then it doesn't seem to call our supplied function | 20:15 |
clarkb | corvus: I think the issue is in kazoo | 20:16 |
corvus | so there was a get_data callback? | 20:16 |
clarkb | https://github.com/python-zk/kazoo/blob/master/kazoo/recipe/watchers.py#L205 initial_version == self._version because we have never seen any real data | 20:16 |
clarkb | corvus: yes I think it is getting to ^ that line then not calling our function again | 20:17 |
corvus | clarkb: i agree | 20:19 |
clarkb | kazoo is basically going "this state is the same as the previous state so I noop" | 20:19 |
corvus | clarkb: maybe for this we should just use a low-level exists call with a watch | 20:19 |
corvus | instead of a DataWatch | 20:19 |
corvus | we don't even need to handle reconnection events, it's a unit test | 20:20 |
clarkb | corvus: could we use a childwatch maybe? | 20:20 |
clarkb | on the parent side | 20:20 |
corvus | i think that's subject to race | 20:21 |
clarkb | Then just assume if that gets triggered the cancel node was created | 20:21 |
corvus | i think an exists with a watcher is exactly what we want | 20:21 |
corvus | it will get called when the node exists :) | 20:21 |
clarkb | ok I'll try that | 20:22 |
clarkb | any idea what the type of the watch method is there? | 20:23 |
corvus | clarkb: def _watcher(self, event): | 20:24 |
corvus | like that | 20:24 |
clarkb | that seems to be working. Let me put this together | 20:26 |
corvus | yeah, i've got it running in a loop locally too. i think it's going to be less code than before actually :) | 20:27 |
clarkb | yup its quite small | 20:27 |
opendevreview | Clark Boylan proposed zuul/zuul master: Fix data watch race in test_cancel_starting_build https://review.opendev.org/c/zuul/zuul/+/799798 | 20:31 |
clarkb | something like that I think | 20:31 |
clarkb | oh i have an unused var in there one sec | 20:31 |
opendevreview | Clark Boylan proposed zuul/zuul master: Fix data watch race in test_cancel_starting_build https://review.opendev.org/c/zuul/zuul/+/799798 | 20:31 |
corvus | clarkb: i left 3 notes :) | 20:32 |
opendevreview | Clark Boylan proposed zuul/zuul master: Fix data watch race in test_cancel_starting_build https://review.opendev.org/c/zuul/zuul/+/799798 | 20:33 |
clarkb | corvus: ^ all three should be fixed in that ps | 20:33 |
corvus | clarkb: cool, i think we can probably +w that, and ping tobiash, swest, and felixedel so they're aware of that DataWatch behavior nuance | 20:34 |
clarkb | ++ | 20:34 |
clarkb | I'm up to 107 successful runs so far with that code | 20:35 |
avass[m] | today was a productive day. I just spent five hours trying to figure out why my new router couldn't get a dhcp lease. turns out the cable plugged into the wall was lose and it happened to work by a fluke when i plugged it in directly to my desktop... | 20:35 |
corvus | avass: wow | 20:35 |
clarkb | avass[m]: the best kind of bug | 20:35 |
fungi | that sure sounds like a wednesday | 20:36 |
avass[m] | best part is that we we're two people debugging it and we even started reading the sourcecode for udhcpc to figure out what was happening :) | 20:44 |
corvus | avass: if you had just called tech support, they would have asked "did you try unplugging it and plugging it back in?" :) | 20:45 |
avass[m] | oh right, they were also able to see my mac address (?) and saw that I got an ip address when I plugged it into my desktop. | 20:46 |
avass[m] | in my experience tech support is usually pretty good here as long as you describe the steps you've already taken | 20:47 |
avass[m] | at least we've got and answer to "How many software engineers do you need to fix a hardware problem" | 20:48 |
avass[m] | if we just ignore that I studied electrical engineering for a moment... :) | 20:49 |
clarkb | corvus: swest you may want to check my simplification suggestions on https://review.opendev.org/c/zuul/zuul/+/795985 I didn't -1 because the code isn't wrong, but I didn't +2 because I think the code will be a lot easier to follow with that simplification | 21:02 |
clarkb | but if you prefer I can go ahead and approve it as is and we can sort that out in followups | 21:02 |
opendevreview | James E. Blair proposed zuul/zuul-operator master: Add instructions and tools for running tests with kind https://review.opendev.org/c/zuul/zuul-operator/+/785763 | 21:04 |
opendevreview | James E. Blair proposed zuul/zuul-operator master: WIP: add static node to functional test https://review.opendev.org/c/zuul/zuul-operator/+/799874 | 21:04 |
opendevreview | James E. Blair proposed zuul/zuul-operator master: Fix error with multiple nodepool providers https://review.opendev.org/c/zuul/zuul-operator/+/799917 | 21:04 |
corvus | clarkb: i think we can wait for swest to update that | 21:06 |
corvus | i agree that will make it easier to read | 21:06 |
avass[m] | So how close is v5 anyway? Do we have an approximate timeline? | 21:08 |
corvus | avass: on the order of months | 21:09 |
clarkb | progress is picking up though seems like. I'm trying to dedicate more time to helping it along at least via reviews | 21:11 |
corvus | what do folks think about bumping nodepool to 4.7.0, so that the next zuul version can be 4.7.0 and can depend on the same nodepool version? or should we just do 4.2.0 for nodepool? | 21:14 |
corvus | (this is because zuul will need the requestor_data nodepool change) | 21:16 |
clarkb | I don't think syncing up versions is super important. But if users expect things to line up maybe that is the best option | 21:16 |
fungi | we've not made attempts to sync up minor revisions in the past, but if there's going to be a zuul dependency on a new nodepool feature then maybe that's a good enough excuse to skip a few versions | 21:16 |
corvus | okay, maybe just 4.2.0 then? | 21:17 |
clarkb | wfm | 21:17 |
corvus | zuul-maint: how does this look for a nodepool release? commit 9757e3f1ceade7ca1a866edc95243f6e29fc9522 (HEAD -> master, tag: 4.2.0, origin/master) | 21:17 |
clarkb | One thing that comes to mind is that dib is likely to get a new release if it hasn't already to fix centos builds iirc. | 21:19 |
fungi | yeah, that matches origin/master for me, and i agree there are quite a few feature changes in there since the last tag (4.1.0) so looks correct | 21:19 |
clarkb | mgiht want to bump that dep too if a new dib version exists /me looks | 21:19 |
clarkb | a new version does not exist. Lets do 4.2.0 /me checks the sha | 21:20 |
clarkb | 9757e3f1ceade7ca1a866edc95243f6e29fc9522 at 4.2.0 looks correct to me | 21:20 |
corvus | k, i'll give that some more time in case others want to weigh in, then i'll push in a bit | 21:21 |
avass[m] | Syncing versions between zuul and nodepool would make things slightly easier, but i assume there's a release note to tell people which versions are compatible and I don't see any reason why new installations would want to run an older version of zuul or nodepool. So that doesn't seem like a big deal to me | 21:33 |
avass[m] | Unless you've got multiple zuuls that uses the same nodepools | 21:34 |
corvus | it's a one way dependency, so that's not a problem | 21:35 |
corvus | (upgrade nodepool first before upgrading any zuuls) | 21:35 |
avass[m] | Ah yeah | 21:35 |
fungi | can multiple individual schedulers share a launcher? | 21:35 |
fungi | i didn't even realize that was possible | 21:36 |
avass[m] | I just assumed that's possible, but maybe there would be key collisions if someone tried that | 21:37 |
corvus | fungi: yes, but i think we may have inadvertently made that more difficult/impossible due to the v5 work. i don't think there's a way to separate the /zuul hierarchy from the nodepool root. | 21:37 |
corvus | but that's not too hard to fix | 21:37 |
corvus | (it's not a structural problem, more of an oversight) | 21:37 |
fungi | ahh, yeah i guess as long as they all shared teh same zk cluster | 21:37 |
corvus | in v3, only nodepool was in zk, so you just pointed X zuuls at the same chroot in zk; zuul's data was all local to that zuul | 21:38 |
corvus | but now zuul data are stored in /zuul under the same chroot where /nodepool is | 21:38 |
corvus | so we probably need to either make the nodepool zk connection a second connection, or add an extra path element for zuul | 21:39 |
corvus | since no one has complained about that, i'm guessing there are zero such users, so it's probably not our most pressing concern | 21:40 |
corvus | (zuul's multi-tenancy sort of mitigates that) | 21:41 |
opendevreview | Merged zuul/zuul master: Fix data watch race in test_cancel_starting_build https://review.opendev.org/c/zuul/zuul/+/799798 | 21:56 |
opendevreview | Merged zuul/zuul master: Fix race waiting for watches to be registered https://review.opendev.org/c/zuul/zuul/+/799797 | 21:56 |
corvus | pushed nodepool 4.2.0 | 22:26 |
fungi | thanks! | 22:28 |
opendevreview | James E. Blair proposed zuul/zuul-operator master: WIP: add static node to functional test https://review.opendev.org/c/zuul/zuul-operator/+/799874 | 22:46 |
opendevreview | James E. Blair proposed zuul/zuul-operator master: WIP: add static node to functional test https://review.opendev.org/c/zuul/zuul-operator/+/799874 | 23:26 |
corvus | nodepool release announcement sent | 23:31 |
corvus | swest: https://review.opendev.org/799463 approved but has a +2 comment from clarkb | 23:45 |
clarkb | its a super minor thing | 23:47 |
clarkb | corvus: can you check my comments on https://review.opendev.org/c/zuul/zuul/+/799464 particularly the third one? | 23:58 |
clarkb | tl;dr is I'm not sure that will properly handle tenant deletion | 23:58 |
clarkb | the old code made a new abide which created a new abide.tenants OrderedDict but now we reuse the abide and I think we never discard old tenants that go away | 23:59 |
Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!