*** iurygregory is now known as iurygregory|holiday | 02:26 | |
opendevreview | Merged opendev/zone-opendev.org master: Cleanup etherpad DNS records https://review.opendev.org/c/opendev/zone-opendev.org/+/880169 | 02:47 |
---|---|---|
opendevreview | Ching Kuo proposed opendev/system-config master: [dnm] Test Hound Build https://review.opendev.org/c/opendev/system-config/+/880912 | 02:54 |
opendevreview | Ian Wienand proposed opendev/zone-opendev.org master: Add Jammy refresh NS records https://review.opendev.org/c/opendev/zone-opendev.org/+/880577 | 03:06 |
opendevreview | Ian Wienand proposed opendev/zone-opendev.org master: Remove old nameservers https://review.opendev.org/c/opendev/zone-opendev.org/+/880709 | 03:06 |
ianw | genekuo: ^ i noticed you sent that back in and i quickly pulled it up in my browser | 03:32 |
ianw | it looks broken to me in both firefox and chrome | 03:34 |
ianw | https://104.130.253.48/ | 03:34 |
ianw | so it's not an artifact of selenium, or it would seem python | 03:35 |
ianw | https://github.com/hound-search/hound/issues/453 same issue | 03:48 |
genekuo | ah, ok, I was only searching with my browser last time and didn't open the advance panel. | 03:49 |
genekuo | In this case shall we postpone the update? | 03:49 |
ianw | well we definitely don't want to pull in this version of hound | 03:50 |
opendevreview | Ching Kuo proposed opendev/system-config master: Update accessbot to Use Python 3.11 Base Images https://review.opendev.org/c/opendev/system-config/+/881161 | 03:54 |
ianw | genekuo: ok, here's something weird | 03:57 |
ianw | i still have the window up from the test run | 03:57 |
ianw | <div data-reactid=".0.0.1.0" style="height: 0px; padding: 0px;" id="adv"> | 03:59 |
ianw | note that's id="adv" -- not "advanced" as changed in https://github.com/hound-search/hound/commit/d25b221872426b03d9be8cf6924327e5eab6c314 | 03:59 |
ianw | if i change that id to "advanced" in the inspector, it looks right | 03:59 |
genekuo | seems like the run has ended | 04:05 |
genekuo | Let me try to run hound locally and see if I can fix it | 04:05 |
ianw | i'm 99.99% sure it's that adv change upstream | 04:26 |
opendevreview | Ian Wienand proposed opendev/system-config master: [wip] build houndd directly https://review.opendev.org/c/opendev/system-config/+/881163 | 04:52 |
ianw | genekuo: ^ it's really weird ... when i build it locally it seems to look alright? i wonder if "go get" is messing something up? | 04:53 |
ianw | https://review.opendev.org/c/opendev/system-config/+/880580?tab=change-view-tab-header-zuul-results-summary has now hit -2 on verified twice with what seems like nodes going away | 05:04 |
ianw | huh, it looks like that ?tab= is also new along with the ?usp= | 05:04 |
*** amoralej|off is now known as amoralej | 06:08 | |
genekuo | ianw: local build doesn't works for me, the advance tab is opened | 07:00 |
ianw | genekuo: i think it's something to do with "go get" -- https://review.opendev.org/c/opendev/system-config/+/881163 looks good. perhaps it is grabbing some sort of cached component? | 07:11 |
ianw | we should build in a separate layer, feel free to take over 881163 if you're interested | 07:11 |
genekuo | got it, not sure why local build doesn't work for me though | 07:18 |
ianw | did you "go get" ... I just typed "make". may also be a go version thing? | 07:23 |
opendevreview | Ching Kuo proposed opendev/system-config master: [wip] build houndd directly https://review.opendev.org/c/opendev/system-config/+/881163 | 07:26 |
genekuo | ianw: I ran go build directly instead of using make | 07:27 |
opendevreview | Ian Wienand proposed openstack/project-config master: Indent Gerrit ACL options https://review.opendev.org/c/openstack/project-config/+/879906 | 07:29 |
opendevreview | Ian Wienand proposed openstack/project-config master: tools/normalize_acl.py: Add some human readable output https://review.opendev.org/c/openstack/project-config/+/880898 | 07:29 |
ianw | genekuo: weird ... make does do some stuff to update ui/bindata.go which is where all the bits are stuffed as gzipped strings ... maybe that has something to do with it? | 07:42 |
opendevreview | Merged opendev/system-config master: dns: abstract names https://review.opendev.org/c/opendev/system-config/+/880580 | 07:43 |
genekuo | ianw: seems like that's the issue, I did a build without running and one after running make ui/bindata.go, the issue is resolved for the build after | 07:45 |
opendevreview | Ching Kuo proposed opendev/system-config master: [wip] build houndd directly https://review.opendev.org/c/opendev/system-config/+/881163 | 07:48 |
frickler | still more interesting zuul behavior on https://review.opendev.org/c/openstack/releases/+/878864 even after the gerrit side of things seems to have gotten resolved earlier | 07:52 |
ianw | genekuo: i filed https://github.com/hound-search/hound/pull/456 ... the project isn't very active so i don't hold out a lot of hope, but we'll see if anyone says anything. i definitely do get a diff in the hound.js binary bit | 07:53 |
genekuo | If it get merged in a few days, I'll prefer to download the build binaries instead building them ourselves. | 07:57 |
ianw | i don't think we were ever using upstream binaries as such with "go get", which i think is also deprecated? honestly building it like these seems ok to me, it's not a lot of maintenance overhead | 07:59 |
genekuo | I see | 08:01 |
genekuo | go get is deprecated in golang 1.18 | 08:01 |
frickler | infra-root: looks like we may be having two stuck jobs again? one of them is at the head of the gate queue, effectively blocking integrated merges https://review.opendev.org/881142 | 09:20 |
*** dhill is now known as Guest11773 | 11:21 | |
fungi | frickler: on 878864 i wonder if zuul cached the result, and since the votes hadn't changed (only the rules for what those votes mean), maybe it didn't re-evaluate submittability until the next activity after the cache aged out | 11:23 |
fungi | looks like the swift-tox-func-encryption-py38 build for 881142 is waiting for a node assignment, i'll see if i can find where it's ended up | 11:27 |
fungi | node 0033802341 was assigned for it at 23:39:12z | 11:35 |
fungi | the no, i misread the log, that was for a related change | 11:41 |
fungi | nr 200-0021026179 (a single ubuntu-focal node) was issued for it at 22:51:43z | 11:42 |
fungi | 2023-04-20 23:01:59,804 DEBUG nodepool.driver.NodeRequestHandler[nl01.opendev.org-PoolWorker.rax-ord-main-48b69e922e3f4d33a1c2ea0aa9544520]: [e: f09fe23c7aee43f0bae5a32c33f9bdac] [node_request: 200-0021026179] Accepting node request | 11:44 |
fungi | that was the second launcher to attempt to build it. first one was ovh-gra1 on nl04 (accepted at 22:51:46) | 11:46 |
fungi | i see now what corvus meant by a nodepool change which increases log chattiness | 11:48 |
fungi | the last node i see it trying to build to satisfy that nr was 0033802172 and i can see in the nl01 debug logs where it decides to delete it but i can't find any corresponding launch failure logged | 12:01 |
fungi | 2023-04-20 23:09:13,590 DEBUG nodepool.DeletedNodeWorker: Marking for deletion unlocked node 0033802172 (state: building, allocated_to: 200-0021026179) | 12:03 |
fungi | i also don't see any further mention of 200-0021026179 in the log after it deleted that node | 12:04 |
fungi | i need to have some coffee and attend to deferred morning activities, maybe someone with better eyes can spot what i'm missing in the log | 12:15 |
fungi | otherwise i suppose i can trigger a thread dump on nl01 and try to see if there's a hung thread for this, but beyond that we're probably at restarting the launcher to free up the lock on that nr? | 12:16 |
*** amoralej is now known as amoralej|lunch | 12:43 | |
opendevreview | Ching Kuo proposed opendev/system-config master: [wip] build houndd directly https://review.opendev.org/c/opendev/system-config/+/881163 | 12:50 |
opendevreview | Ching Kuo proposed opendev/system-config master: [wip] build houndd directly https://review.opendev.org/c/opendev/system-config/+/881163 | 13:07 |
*** amoralej|lunch is now known as amoralej | 13:49 | |
opendevreview | Ching Kuo proposed opendev/system-config master: Build houndd Directly https://review.opendev.org/c/opendev/system-config/+/881163 | 14:05 |
opendevreview | Merged openstack/project-config master: Stop using Storyboard for ovn-bgp-agent https://review.opendev.org/c/openstack/project-config/+/880938 | 14:13 |
clarkb | fungi: corvus looking at 200-0021026179 and 0033802172 it looks like there was a building node then nodepool was restarted to deploy the update to wait limits. This switched the node to deleting. As far as I can tell it does manage to delete the node from the cloud and zookeeper then ya it never seems to perform another attempt to launch for that node request. | 14:50 |
clarkb | fungi: did you do a thread dump yet? | 14:50 |
clarkb | I think we want to see where the request handler is for that request | 14:50 |
clarkb | I think this is different than the previous restart induced lockup because that occurred in the node deletion process which appears to have completed in this instance | 14:51 |
clarkb | the bugfix for suddenly non public gitea apis is in their merge queue. I don't think there is a fix for basic auth needing to be forced yet. However, I think that bug was preexisting as we have a bunch of force stuff in ansible tasks already. | 14:56 |
fungi | clarkb: i didn't do a thread dump yet, but the restart is a key insight i missed | 14:57 |
fungi | i'll trigger one now | 14:57 |
fungi | i got distracted by an unrelated openstack release job oddity | 14:58 |
fungi | running this on nl01 now: `sudo kill -USR2 2807717;sleep 60;sudo kill -USR2 2807717` | 14:59 |
corvus | i think the next question is whether that request is locked | 14:59 |
fungi | so we should have two thread dumps 60s apart in the debug log momentarily | 14:59 |
corvus | i'll try to figure that out | 15:00 |
fungi | thanks! | 15:00 |
clarkb | I suspect the reason we've seen restarts induce these problems is that we'll have a fairly large number of building nodes that suddenly all go to delete and this probably creates a thundering herd effect of zookeeper contention? | 15:02 |
clarkb | just makes it far more likely we'll catch existing races when we do that compared to normal operations I bet | 15:03 |
corvus | it is locked by nl01 | 15:03 |
fungi | first stack dump starts at 2023-04-21 14:59:36,245 in the log, second at 15:00:36,295 | 15:04 |
corvus | there is no log message for the second lock, and looking at the code, the way for that to happen is via the cleanupLostRequests method | 15:08 |
corvus | which is what we want to happen here | 15:08 |
corvus | wow that thread dump for cleanupLostRequests looks exactly like the thread deadlock we just fixed | 15:10 |
fungi | cleaned up stack dump output is now in nl01:~fungi/stack_dump.2023-04-21 in case it's helpful | 15:11 |
corvus | i see the problem | 15:30 |
corvus | https://opendev.org/zuul/nodepool/src/branch/master/nodepool/driver/statemachine.py#L341-L342 | 15:31 |
corvus | we never unlock a node when we delete it, because deleting it unlocks it from zk's pov. but that means we never release the local thread lock. | 15:31 |
fungi | aha | 15:36 |
corvus | fix proposed in https://review.opendev.org/881237 and i also wrote https://review.opendev.org/881238 in response to the log deluge | 15:39 |
clarkb | both lgtm thanks! | 15:40 |
corvus | resolution of the immediate issue should be the same as last time: we can restart the launcher at any point now | 15:40 |
clarkb | though that may cause us to trip over the same thing ? | 15:40 |
clarkb | but it would at least get that chgne moving presumably | 15:40 |
corvus | yes, with a new set of deleted nodes | 15:40 |
corvus | if we can wait an hour or two, restarting with those 2 changes might be nice | 15:41 |
corvus | if we're in more of a hurry, we could set max_servers to 0 on those regions before shutting down, then we will have no nodes to delete on startup. | 15:41 |
clarkb | and the reason this wasn't an issue until we restarted is that for whatever reason we don't have multiple threads trying to grab that nodelock under normal circumstances (I guess that makes sense since the launch thread normally has the lock and does all of he processing but then with bulk deletion like that we trigger cleanups?) | 15:42 |
clarkb | I think we can probably wait if fungi able to review the changes nowsih | 15:42 |
fungi | i already approved the first one and am looking over the second now | 15:42 |
clarkb | excellent | 15:43 |
corvus | clarkb: yeah it's only if we try to lock a recently deleted node; so it could happen in other cases, but the restart increases the probability | 15:43 |
fungi | yep, both lgtm | 15:43 |
*** amoralej is now known as amoralej|off | 15:56 | |
opendevreview | James E. Blair proposed zuul/zuul-jobs master: Move containerfile setting in container build https://review.opendev.org/c/zuul/zuul-jobs/+/881252 | 17:20 |
clarkb | once corvus' changes to move zuul, nodepool, zuul-registry, zuul-preview etc to quay.io land we'll want to update where we pull them from in opendev | 17:31 |
corvus | the nodepool changes from earlier have merged and promoted, so nl can be pulled and restarted at will | 17:54 |
corvus | i need to afk for a bit now so don't plan on doing that myself | 17:55 |
fungi | our hourly jobs do that anyway, right? | 17:59 |
fungi | so are in theory kicking off in mere seconds | 17:59 |
clarkb | the first change actually deployed a while ago and got things moving again. | 18:36 |
clarkb | The second one should auto deploy and may have already yup | 18:36 |
opendevreview | Clark Boylan proposed zuul/zuul-jobs master: Use full image url in container buildx path https://review.opendev.org/c/zuul/zuul-jobs/+/881277 | 20:16 |
clarkb | infra-root I've completed an initial pass of container image syncs from docker hub to quay.io | 20:24 |
clarkb | I think we have two big next steps. Switching over an image or a few at a time to the new image build jobs and updating our prod consumpton location and updating the new image build jobs to create the container if it doesn't exist (not necessary for those I just synced) | 20:25 |
clarkb | Zuul is also working through some of this right now so I'll focus on helping there since what we fix there should be applicable to us | 20:25 |
opendevreview | Clark Boylan proposed zuul/zuul-jobs master: Pin virtualenv in tox environments https://review.opendev.org/c/zuul/zuul-jobs/+/881279 | 20:40 |
opendevreview | Clark Boylan proposed zuul/zuul-jobs master: Pin virtualenv in tox environments https://review.opendev.org/c/zuul/zuul-jobs/+/881279 | 20:43 |
opendevreview | Merged zuul/zuul-jobs master: Pin virtualenv in tox environments https://review.opendev.org/c/zuul/zuul-jobs/+/881279 | 20:57 |
opendevreview | Clark Boylan proposed zuul/zuul-jobs master: Add ensure-quay-repo role https://review.opendev.org/c/zuul/zuul-jobs/+/877834 | 21:31 |
opendevreview | Merged zuul/zuul-jobs master: Move containerfile setting in container build https://review.opendev.org/c/zuul/zuul-jobs/+/881252 | 21:34 |
opendevreview | Clark Boylan proposed zuul/zuul-jobs master: Add ensure-quay-repo role https://review.opendev.org/c/zuul/zuul-jobs/+/877834 | 22:04 |
opendevreview | Clark Boylan proposed zuul/zuul-jobs master: Add ensure-quay-repo role https://review.opendev.org/c/zuul/zuul-jobs/+/877834 | 22:08 |
opendevreview | Clark Boylan proposed zuul/zuul-jobs master: Add ensure-quay-repo role https://review.opendev.org/c/zuul/zuul-jobs/+/877834 | 22:10 |
opendevreview | Clark Boylan proposed opendev/system-config master: WIP Base jobs for quay.io image publishing https://review.opendev.org/c/opendev/system-config/+/881285 | 22:20 |
clarkb | infra-root ^ feedback on those two changes would be much appreciated. I think it sketches out what our publishing process will look like for opendev images ot quay. But if I've made silly mistakes I'd love that feedback | 22:20 |
opendevreview | Clark Boylan proposed zuul/zuul-jobs master: Use full image url in container buildx path https://review.opendev.org/c/zuul/zuul-jobs/+/881277 | 22:26 |
clarkb | The other thing I've realized is that some images like gerritbot are outside of system-config so we may end up needing to do this a few times | 22:27 |
clarkb | but one step at a time | 22:27 |
opendevreview | Clark Boylan proposed opendev/system-config master: WIP Base jobs for quay.io image publishing https://review.opendev.org/c/opendev/system-config/+/881285 | 22:55 |
opendevreview | Clark Boylan proposed zuul/zuul-jobs master: Use full image url in container buildx path https://review.opendev.org/c/zuul/zuul-jobs/+/881277 | 23:33 |
Clark[m] | wow that last patchset might actually work. I'm not sure I'm a fan of that approach but if others don't mind it ... | 23:39 |
clarkb | er that was meant for the zuul matrix room | 23:39 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!