@clarkb:matrix.org | https://sdv.eclipse.org/ was announced today. Considering zuul's existing use in the space I thought people here might be interested | 00:06 |
---|---|---|
-@gerrit:opendev.org- Zuul merged on behalf of James E. Blair https://matrix.to/#/@jim:acmegating.com: | 00:29 | |
- [zuul/zuul] 814281: Remove toDict from FrozenJob https://review.opendev.org/c/zuul/zuul/+/814281 | ||
- [zuul/zuul] 814243: Make FrozenJob a ZKObject https://review.opendev.org/c/zuul/zuul/+/814243 | ||
- [zuul/zuul] 814329: Implement frozen job serialization/deserialization https://review.opendev.org/c/zuul/zuul/+/814329 | ||
-@gerrit:opendev.org- Zuul merged on behalf of James E. Blair https://matrix.to/#/@jim:acmegating.com: [zuul/zuul] 815911: Remove false-negative in zk test https://review.opendev.org/c/zuul/zuul/+/815911 | 01:44 | |
-@gerrit:opendev.org- Zuul merged on behalf of Clark Boylan: [zuul/zuul] 815343: CI image requires consistency cleanup https://review.opendev.org/c/zuul/zuul/+/815343 | 01:44 | |
-@gerrit:opendev.org- Zuul merged on behalf of James E. Blair https://matrix.to/#/@jim:acmegating.com: [zuul/zuul] 814679: Store FrozenJob data in separate znodes https://review.opendev.org/c/zuul/zuul/+/814679 | 02:09 | |
-@gerrit:opendev.org- Zuul merged on behalf of Simon Westphahl: [zuul/zuul] 814544: Cleanup stale items after refreshing a pipeline https://review.opendev.org/c/zuul/zuul/+/814544 | 06:03 | |
-@gerrit:opendev.org- Simon Westphahl proposed: | 06:27 | |
- [zuul/zuul] 815787: Refresh pipelines in tests when settled https://review.opendev.org/c/zuul/zuul/+/815787 | ||
- [zuul/zuul] 815788: wip: Allow refreshing project branches https://review.opendev.org/c/zuul/zuul/+/815788 | ||
- [zuul/zuul] 815278: DNM: execute tests with two schedulers https://review.opendev.org/c/zuul/zuul/+/815278 | ||
-@gerrit:opendev.org- Felix Edel proposed: [zuul/zuul] 814996: Make the ConfigLoader work independently of the Scheduler https://review.opendev.org/c/zuul/zuul/+/814996 | 08:04 | |
-@gerrit:opendev.org- Zuul merged on behalf of Felix Edel: [zuul/zuul] 815844: Provide zstat version when updating Node Request in ZooKeeper https://review.opendev.org/c/zuul/zuul/+/815844 | 08:25 | |
@ecsantos:matrix.org | Hello folks | 08:52 |
@ecsantos:matrix.org | I have a question regarding Zuul playbooks and roles | 08:52 |
@ecsantos:matrix.org | For example, this line on [1]: ` path: "{{ cached_repos_root }}/{{ zj_project.canonical_name }}"` | 08:52 |
[1] https://opendev.org/zuul/zuul-jobs/src/branch/master/roles/prepare-workspace-git/tasks/main.yaml#L3 | ||
@ecsantos:matrix.org | Where are these kind of variables declared (such as `zj_project`)? | 08:53 |
@avass:vassast.org | ecsantos: `zj_project` is set by `loop_control { loop_var: zj_project }` and is there because of the loop var policy: https://www.zuul-ci.org/docs/zuul-jobs/policy.html#loops-in-roles | 09:49 |
@avass:vassast.org | ecsantos: `cached_repos_root` is set as a default var in the role: https://opendev.org/zuul/zuul-jobs/src/branch/master/roles/prepare-workspace-git/defaults/main.yaml | 09:49 |
-@gerrit:opendev.org- Simon Westphahl proposed: | 10:11 | |
- [zuul/zuul] 814772: Allow passing extra attributes to ZKObject.fromZK https://review.opendev.org/c/zuul/zuul/+/814772 | ||
- [zuul/zuul] 814571: Update pipeline state when modifying attributes https://review.opendev.org/c/zuul/zuul/+/814571 | ||
- [zuul/zuul] 814570: Reference active change queues in pipeline state https://review.opendev.org/c/zuul/zuul/+/814570 | ||
- [zuul/zuul] 814862: Bail out when a project moves between connections https://review.opendev.org/c/zuul/zuul/+/814862 | ||
- [zuul/zuul] 814773: Move re-enqueue to pipeline processing https://review.opendev.org/c/zuul/zuul/+/814773 | ||
- [zuul/zuul] 814899: Delete old build sets immediately https://review.opendev.org/c/zuul/zuul/+/814899 | ||
- [zuul/zuul] 815309: Cancel jobs before resetting builds https://review.opendev.org/c/zuul/zuul/+/815309 | ||
- [zuul/zuul] 815111: Store builds in Zookeeper https://review.opendev.org/c/zuul/zuul/+/815111 | ||
- [zuul/zuul] 815276: Add change queues to change queue managers https://review.opendev.org/c/zuul/zuul/+/815276 | ||
- [zuul/zuul] 815277: Refresh pipelines before checking for empty queues https://review.opendev.org/c/zuul/zuul/+/815277 | ||
- [zuul/zuul] 815428: Fix GitHub PR (de-)serialization https://review.opendev.org/c/zuul/zuul/+/815428 | ||
- [zuul/zuul] 815429: Add missing logger to Build and BuildSet classes https://review.opendev.org/c/zuul/zuul/+/815429 | ||
- [zuul/zuul] 815450: Create bundle items during queue deserialization https://review.opendev.org/c/zuul/zuul/+/815450 | ||
- [zuul/zuul] 815495: Fix Gerrit change (de-)serialization https://review.opendev.org/c/zuul/zuul/+/815495 | ||
- [zuul/zuul] 815616: Only reset the pipeline state if needed https://review.opendev.org/c/zuul/zuul/+/815616 | ||
- [zuul/zuul] 815617: Ensure same layout UUID across schedulers https://review.opendev.org/c/zuul/zuul/+/815617 | ||
- [zuul/zuul] 815787: Refresh pipelines in tests when settled https://review.opendev.org/c/zuul/zuul/+/815787 | ||
- [zuul/zuul] 815788: wip: Allow refreshing project branches https://review.opendev.org/c/zuul/zuul/+/815788 | ||
- [zuul/zuul] 815278: DNM: execute tests with two schedulers https://review.opendev.org/c/zuul/zuul/+/815278 | ||
-@gerrit:opendev.org- Simon Westphahl proposed on behalf of James E. Blair https://matrix.to/#/@jim:acmegating.com: | 10:11 | |
- [zuul/zuul] 815154: Update test_inventory to be ZK-friendly https://review.opendev.org/c/zuul/zuul/+/815154 | ||
- [zuul/zuul] 815565: Remove unecessary assignment in re-enqueue https://review.opendev.org/c/zuul/zuul/+/815565 | ||
- [zuul/zuul] 815744: Use a metaclass to deserialize event objects https://review.opendev.org/c/zuul/zuul/+/815744 | ||
- [zuul/zuul] 815764: Add a pipeline change list object to ZK https://review.opendev.org/c/zuul/zuul/+/815764 | ||
- [zuul/zuul] 815916: Reduce use of OrderedDict in PipelineState https://review.opendev.org/c/zuul/zuul/+/815916 | ||
- [zuul/zuul] 815917: Update Pipeline for symmetry https://review.opendev.org/c/zuul/zuul/+/815917 | ||
@westphahl:matrix.org | zuul-maint: reordered the sos stack a bit since 814570 had legit test failures. 814772 is now the first change in the stack | 10:21 |
@mordred:inaugust.com | swest: it's so exciting seeing all of that | 10:38 |
@westphahl:matrix.org | mordred: yeah I'm looking forward to having multiple schedulers running. we don't have much runway left running with a single scheduler | 10:42 |
@ecsantos:matrix.org | > <@avass:vassast.org> ecsantos: `zj_project` is set by `loop_control { loop_var: zj_project }` and is there because of the loop var policy: https://www.zuul-ci.org/docs/zuul-jobs/policy.html#loops-in-roles | 10:46 |
Albin Vass: Thanks! That made it clearer for me | ||
@avass:vassast.org | ecsantos: no problem! | 10:50 |
@ecsantos:matrix.org | Sorry to bother but I have one more question :p | 10:56 |
@ecsantos:matrix.org | I'm hitting the following error on multiple plays (e.g., "Check growroot logs", "configure-mirrors : Update apt cache", "persistent-firewall : List current ipv4 rules"): `Timeout exception waiting for the logger. Please check connectivity to [<IP address>:19885]` | 10:56 |
@ecsantos:matrix.org | What is this "logger" Zuul is referring to? | 10:56 |
-@gerrit:opendev.org- Zuul merged on behalf of Simon Westphahl: [zuul/zuul] 814772: Allow passing extra attributes to ZKObject.fromZK https://review.opendev.org/c/zuul/zuul/+/814772 | 11:44 | |
-@gerrit:opendev.org- Zuul merged on behalf of Simon Westphahl: [zuul/zuul] 814571: Update pipeline state when modifying attributes https://review.opendev.org/c/zuul/zuul/+/814571 | 11:48 | |
@fungicide:matrix.org | ecsantos: it's the daemon which streams the ansible output (console log), started by this role: https://www.zuul-ci.org/docs/zuul-jobs/general-roles.html#role-start-zuul-console | 12:47 |
@fungicide:matrix.org | does your job maybe adjust firewall rules to block access to 19885/tcp? | 12:48 |
@fungicide:matrix.org | specifically, the executor will need to be able to reach that port | 12:50 |
@fungicide:matrix.org | also i notice we don't do a great job of documenting that, the only actual documentation reference to that port number seems to be in https://www.zuul-ci.org/docs/zuul/howtos/nodepool_static.html#node-requirements | 12:54 |
@fungicide:matrix.org | if i get a moment later today i'll try to leave some breadcrumbs in the start-zuul-console role readme and the components chapter of the zuul documentation | 12:55 |
@ecsantos:matrix.org | > <@fungicide:matrix.org> does your job maybe adjust firewall rules to block access to 19885/tcp? | 13:08 |
fungi: I'm running the same base job as OpenDev, same playbooks and roles, only difference is that I upload my logs to a logserver and not Swift | ||
@ecsantos:matrix.org | For example, the play `Check growroot logs` for me results in `localhost | Timeout exception waiting for the logger` for the localhost, and `controller | ok` for my Ubuntu node | 13:10 |
@ecsantos:matrix.org | Inspecting my logs I see | 13:15 |
``` | ||
2021-10-29 08:25:30.915301 | TASK [start-zuul-console : Start zuul_console daemon.] | 00:01 | |
2021-10-29 08:25:31.889466 | controller | ok | 00:01 | |
``` | ||
It's only running for the Ubuntu node and not the localhost. Maybe I need to run this role on my localhost as well (Zuul is running on localhost)? | ||
@ecsantos:matrix.org | * Inspecting my logs I see | 13:15 |
``` | ||
2021-10-29 08:25:30.915301 | TASK [start-zuul-console : Start zuul_console daemon.] | ||
2021-10-29 08:25:31.889466 | controller | ok | ||
``` | ||
It's only running for the Ubuntu node and not the localhost. Maybe I need to run this role on my localhost as well (Zuul is running on localhost)? | ||
@fungicide:matrix.org | ecsantos: in opendev, we set up default firewall rules on our nodes during the image build process, and so are doing it there: https://opendev.org/openstack/project-config/src/commit/30fd4b45491c4d0aa054be66dd8763a7ca89c1ec/nodepool/elements/nodepool-base/install.d/20-iptables#L60 | 13:39 |
@fungicide:matrix.org | start-zuul-console should only apply to the job nodes, not "localhost" (which is the executor) | 13:39 |
@ecsantos:matrix.org | fungi: Yeah, I'm using the same DIB elements on my diskimage, including `nodepool-base` | 13:41 |
@ecsantos:matrix.org | Oh okay, I though "all" included the localhost, my mistake, I'm new to Ansible as well :) | 13:41 |
@spamaps:spamaps.ems.host | I'm returning to zuul-land from a long hiatus. Is there a thing like nodepool-builder, but that builds using Dockerfiles and pushes images instead? I really want that (I may just build it using periodic jobs) | 13:42 |
@spamaps:spamaps.ems.host | * I'm returning to zuul-land from a long hiatus. Is there a thing like nodepool-builder, but that builds using Dockerfiles and pushes container images instead? I really want that (I may just build it using periodic jobs) | 13:42 |
@jim:acmegating.com | spamaps: not that pushes images; i think i'd just use periodic jobs for that. opendev does its base container image building all in zuul jobs. | 13:43 |
@jim:acmegating.com | spamaps: (nodepool can use dockerfiles for building vm images; not what you're asking for, but an interesting related new feature) | 13:44 |
@spamaps:spamaps.ems.host | that's.. weird! | 13:45 |
@spamaps:spamaps.ems.host | I like it? ;) | 13:45 |
@jim:acmegating.com | spamaps: but cool! it isolates the build environment, so you don't need a builder host with tooling for every platform (which is occasionally not possible) | 13:46 |
@jim:acmegating.com | (also, it's a little lighter weight to write a dockerfile than an element) | 13:47 |
@jim:acmegating.com | it's all implemented with an element of course, called 'containerfile' | 13:47 |
@fungicide:matrix.org | well, it's not all as simple as it seems. for example we need podman from debian/unstable to get a new enough version to be able to handle the glibc in latest fedora | 13:48 |
@spamaps:spamaps.ems.host | 🍿 | 13:49 |
@spamaps:spamaps.ems.host | Glad to be home. I hope it's not a short visit. | 13:49 |
@fungicide:matrix.org | spamaps: details of that particular slice of cake can be found here: https://review.opendev.org/815766 | 13:49 |
@fungicide:matrix.org | containers! magic that lets you run pretty much anything anywhere, right? well, nope, as it turns out ;) | 13:50 |
@fungicide:matrix.org | can anybody spot what's going sideways with the tenant quota in this failed test? seems to crop up at random (we just hit it on the podman version change above): https://storage.gra.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_677/815766/3/gate/nodepool-tox-py38/677848b/testr_results.html | 14:02 |
@jim:acmegating.com | fungi: it looks like the either the last node was never deleted from zk, or the internal node cache is out of sync with zk. i can't tell more about the problem from that bit of logging. | 14:12 |
@jim:acmegating.com | will probably need to add a bit more logging and run locally | 14:16 |
@clarkb:matrix.org | corvus: the next few unapproved changes in the zuul sos stack were quick easy reviews that I've snuck in before my meeting today. I can pick reviews up again after the meeting (I'm at https://review.opendev.org/c/zuul/zuul/+/815111). I think that there is enough in the gate queue that I should get ahead of it this morning | 14:20 |
@jim:acmegating.com | Clark: awesome, thanks! | 14:20 |
-@gerrit:opendev.org- Zuul merged on behalf of Simon Westphahl: [zuul/zuul] 814570: Reference active change queues in pipeline state https://review.opendev.org/c/zuul/zuul/+/814570 | 14:42 | |
@goneri:matrix.org | Hi, Can I get the final +A on https://review.opendev.org/c/zuul/zuul-jobs/+/815901? | 14:43 |
@goneri:matrix.org | thanks! :-) | 14:53 |
@clarkb:matrix.org | swest: corvus few consistency questions in https://review.opendev.org/c/zuul/zuul/+/815111 not worthy of a -1 so I +2'd feel free to approve if you don't have any additional concerns from my comments | 15:37 |
@clarkb:matrix.org | corvus: I guess I didn't raelize that pipeline queues are branch specific? https://review.opendev.org/c/zuul/zuul/+/815276/8/zuul/model.py In the case of say openstack's gate queue we set that to None beacuse we run all the branches in that queue? | 15:45 |
-@gerrit:opendev.org- James E. Blair https://matrix.to/#/@jim:acmegating.com proposed: [zuul/zuul] 815979: Use activeContext instead of explicit _save calls https://review.opendev.org/c/zuul/zuul/+/815979 | 15:48 | |
@jim:acmegating.com | Clark: ^ replied on 111 and made that followup ^ | 15:48 |
@jim:acmegating.com | Clark: they are optionally branch-specific | 15:48 |
@fungicide:matrix.org | Clark: i think that changed somewhat recently in order to support users who want separate queues per branch | 15:48 |
@jim:acmegating.com | i believe the default is they aren't | 15:48 |
@clarkb:matrix.org | I see I think I missed that | 15:49 |
@clarkb:matrix.org | corvus: swest question on https://review.opendev.org/c/zuul/zuul/+/815276 | 15:49 |
@clarkb:matrix.org | I've approved https://review.opendev.org/c/zuul/zuul/+/815111/ and +2'd the followup thanks | 15:51 |
@jim:acmegating.com | Clark: does https://review.opendev.org/814773 address your question? | 15:58 |
@jim:acmegating.com | in general, pipeline processing should make the zk data eventually consistent wrt configuration changes and re-enqueues. i don't know if that addresses your specific concern though. | 15:59 |
@clarkb:matrix.org | I think that half answers my question. We're unlikely to suffer major problems from my concern on 815276 | 16:00 |
@clarkb:matrix.org | The other half is whether or not we should be doing inferred associations at all? and instead directly address them in the database | 16:00 |
@clarkb:matrix.org | When everything was in memory our python references were our direct associations and we lose that in this instance | 16:00 |
@clarkb:matrix.org | but that would require making the queues object more complex to refer to a change queue manager? or the other way around. The change queue manager/pipeline state will have to refer to specific queues? | 16:01 |
@jim:acmegating.com | i think what should happen is that for a project to move between quues, you would need a re-enqueue pass, so everything would need to move from old_queues to queues. and the new queues would be constructed with the new set of shared projects. | 16:02 |
@clarkb:matrix.org | gotcha. So sched 1 might fail but you haven't moved yet. Then sched2 can recover and do the move | 16:05 |
@jim:acmegating.com | Clark: yep, that's my understanding of the theory. | 16:06 |
@jim:acmegating.com | i think the pipeline.state.layout_uuid lets us know that the objects in zk for a pipeline match the current config; if that doesn't match, re-enqueue needs to happen | 16:07 |
@clarkb:matrix.org | I have approved 815276 and left a note with some hints to this information | 16:07 |
@jim:acmegating.com | and re-enqueue will always happen with the current config | 16:07 |
@clarkb:matrix.org | My goal today is to get through the end of the stack that is ready to land. Then reward myself with a bike ride while zuul gates everything :) | 16:30 |
@jim:acmegating.com | solid plan | 16:30 |
@clarkb:matrix.org | tobiash: you don't have a recent review on https://review.opendev.org/c/zuul/zuul/+/815450/ (this is my next change to review). Did you want to review that before it gets approved? | 16:38 |
@tobias.henkel:matrix.org | Clark: I had +2 on that in the past | 16:38 |
@tobias.henkel:matrix.org | looks like that vanished during rebase | 16:38 |
@clarkb:matrix.org | ah ok so probably fine to approve, thansk for checking | 16:39 |
@jim:acmegating.com | i think there was a small fix to that since then; not substantial | 16:39 |
@clarkb:matrix.org | I've approved it | 16:43 |
@spamaps:spamaps.ems.host | I seem to have caused nodepool to go to plaid... | 16:47 |
``` | ||
launcher_1 | 2021-10-29 16:45:18,535 ERROR nodepool.NodePool: if isinstance(other, GCEProviderConfig): launcher_1 | 2021-10-29 16:45:18,535 ERROR nodepool.NodePool: RecursionError: maximum recursion depth exceeded while calling a Python object | ||
launcher_1 | 2021-10-29 16:45:38,613 ERROR nodepool.NodePool: Exception in main loop: | ||
launcher_1 | 2021-10-29 16:45:38,613 ERROR nodepool.NodePool: Traceback (most recent call last): launcher_1 | 2021-10-29 16:45:38,613 ERROR nodepool.NodePool: File "/usr/local/lib/python3.9/site-packages/nodepool/launcher.py", line 1095, in run | ||
launcher_1 | 2021-10-29 16:45:38,613 ERROR nodepool.NodePool: self.updateConfig() | ||
``` | ||
@spamaps:spamaps.ems.host | Seems there's a way that GCEProviderConfig might self-reference. | 16:55 |
@spamaps:spamaps.ems.host | It's happening when Nodepool tries to determine if new/old are different | 16:58 |
@spamaps:spamaps.ems.host | I think there may be somewhere that parts of the config are shared between new and old. | 16:59 |
@spamaps:spamaps.ems.host | happens on startup too. :( | 17:00 |
@tobias.henkel:matrix.org | corvus: q on 815764 | 17:01 |
-@gerrit:opendev.org- Zuul merged on behalf of Simon Westphahl: [zuul/zuul] 814862: Bail out when a project moves between connections https://review.opendev.org/c/zuul/zuul/+/814862 | 17:03 | |
@jim:acmegating.com | tobiash: replied | 17:04 |
@tobias.henkel:matrix.org | thanks, lgtm | 17:05 |
@clarkb:matrix.org | swest: corvus can you check my comment on https://review.opendev.org/c/zuul/zuul/+/815617 it isn't clear to me how we are setting an initial layout uuid to a valid uuid | 17:09 |
@clarkb:matrix.org | hrm maybe I just figured it out. We're setting it in the Layout object then that gets passed to the LayoutState | 17:10 |
@clarkb:matrix.org | I was reading it as LayoutState always sets the uuid on Layout but that is only true if a LayoutState exists already. If it doesn't exist we get it from Layout /me updates the review | 17:10 |
@clarkb:matrix.org | I've approved https://review.opendev.org/c/zuul/zuul/+/815617 but left a comment about a small test update | 17:13 |
@clarkb:matrix.org | spamaps: I think the gerrit zuul installation is using nodepool with gce. I wonder why it doesn't hit that | 17:13 |
@clarkb:matrix.org | tobiash: if you are able to review https://review.opendev.org/c/zuul/zuul/+/815744/ that would be helpful as I'm the only current reviewer on it | 17:15 |
@jim:acmegating.com | Clark: tobiash previously reviewed that one and there have been only insubstantial changes since then, so i think we can carry it over | 17:17 |
@jim:acmegating.com | (variable names/comments were the only changes) | 17:17 |
@tobias.henkel:matrix.org | I've just re-reviewed it | 17:17 |
@jim:acmegating.com | that works too :) | 17:18 |
@tobias.henkel:matrix.org | just when I though the gate got more stable the next reset happened... | 17:18 |
@tobias.henkel:matrix.org | * just when I thought the gate got more stable the next reset happened... | 17:18 |
@jim:acmegating.com | yeah looking at the failure now | 17:19 |
@westphahl:matrix.org | Clark: responded in 815617 | 17:19 |
@tobias.henkel:matrix.org | corvus: the first thing I'm seeing there is a zk session loss: https://ddb474164b5093260c27-91acd78cd46015ce54b5f888f723113e.ssl.cf1.rackcdn.com/814773/14/gate/zuul-tox-py36/7e35fa2/testr_results.html | 17:20 |
@tobias.henkel:matrix.org | if that's the real reason we could look into reducing the parallel tests by one or maybe increase the zk session timeout more in the tests | 17:21 |
@jim:acmegating.com | well, that's an intentional zk disconnect test | 17:21 |
@tobias.henkel:matrix.org | oh, then scratch that | 17:21 |
@jim:acmegating.com | i already "fixed" that once though... | 17:21 |
@spamaps:spamaps.ems.host | > <@clarkb:matrix.org> spamaps: I think the gerrit zuul installation is using nodepool with gce. I wonder why it doesn't hit that | 17:22 |
I'm wondering if maybe there's cruft in zk from before I wrote the config ... I can't figure out how to log in to zk though.. zkCli gives SASL errors. | ||
@clarkb:matrix.org | swest: yup after I left my comment I realized we are passing that all the way down to Layout during tenant layout parsing to have it generate a new uuid for us | 17:22 |
@jim:acmegating.com | spamaps: zk is not relevant for that problem | 17:22 |
@spamaps:spamaps.ems.host | ah ok | 17:22 |
@clarkb:matrix.org | our zk has local no ssl connectivity to work around that (if it were a zk issue) | 17:22 |
@jim:acmegating.com | those should only be local python objects | 17:22 |
@spamaps:spamaps.ems.host | ```zookeeper-servers: | 17:23 |
- host: zk | ||
port: 2281 | ||
zookeeper-tls: | ||
cert: /var/certs/certs/client.pem | ||
key: /var/certs/keys/clientkey.pem | ||
ca: /var/certs/certs/cacert.pem | ||
labels: | ||
- name: ubuntu-focal | ||
min-ready: 2 | ||
providers: | ||
- name: static-vms | ||
driver: static | ||
pools: | ||
- name: main | ||
nodes: | ||
- name: node | ||
labels: ubuntu-focal | ||
host-key: "ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIOgHJYejINIKzUiuSJ2MN8uPc+dfFrZ9JH1hLWS8gI+g" | ||
python-path: /usr/bin/python3 | ||
username: root | ||
- name: gce-uscentral1 | ||
driver: gce | ||
project: spotify-zuul-ci | ||
region: us-central1 | ||
zone: us-central1-a | ||
cloud-images: | ||
- name: ubuntu-focal | ||
image-project: ubuntu-os-cloud | ||
image-family: ubuntu-2004-lts | ||
username: ubuntu | ||
pools: | ||
- name: ubuntu-focal-vms-gce-uscentral1a | ||
use-internal-ip: true | ||
labels: | ||
- name: ubuntu-focal | ||
cloud-image: ubuntu-focal | ||
instance-type: e2-standard-8 | ||
volume-type: pd-standard | ||
volume-size: 500``` | ||
@spamaps:spamaps.ems.host | There's the config (that host key is the static one in the docs ;) | 17:23 |
@jim:acmegating.com | spamaps: for reference though, in opendev we allow non-ssl connections to zk on localhost only (protected by firewall) to aid in debugging with zk-shell. | 17:23 |
@jim:acmegating.com | tobiash: it got stuck on self.stats_thread.join() weird | 17:27 |
@tobias.henkel:matrix.org | Corvus: maybe because of the crashed stats reporter election | 17:32 |
@jim:acmegating.com | tobiash: yeah, i think it's a race with the election | 17:32 |
@tobias.henkel:matrix.org | There is an exception about that in the log | 17:32 |
@jim:acmegating.com | like it may have started running the election after the stop signal | 17:33 |
@jim:acmegating.com | i think it's a pretty rare race. we should probably have scheduler.stop cancel the election after setting the stop event. i think that would take care of it. i'll do that next time i have a clean working tree :) | 17:33 |
@spamaps:spamaps.ems.host | > <@jim:acmegating.com> spamaps: for reference though, in opendev we allow non-ssl connections to zk on localhost only (protected by firewall) to aid in debugging with zk-shell. | 17:40 |
I may submit a patch for the quickstart to do the same thing | ||
@jim:acmegating.com | spamaps: i don't think we should do that | 17:43 |
@jim:acmegating.com | spamaps: no end-user of zuul should ever have to touch zk. i just said that to you as a zuul developer expert | 17:44 |
@jim:acmegating.com | and exposing un-encrypted zk could compromise the system and therefore the integrity of the code that's tested. it's super dangerous | 17:45 |
@jim:acmegating.com | (we actually added a zuul cli tool so that if something goes wrong with zk, you can just delete all of zuul's state and start over; that's the level of end-user zk interaction i expect) | 17:47 |
@fungicide:matrix.org | yeah, the risk in the quickstart is that there are lots of additional components being started on the same system as the zk service. if you have zk confined to entirely separate servers it's less of a concern to do that | 17:52 |
@fungicide:matrix.org | (and that's the case in opendev's deployment, for example) | 17:53 |
@fungicide:matrix.org | in particular, a job could run on the co-located executor to connect to the unprotected loopback port | 17:55 |
@spamaps:spamaps.ems.host | Ok.. I figured quick start was more of "try this out, if you like it use a real install method" | 18:04 |
@fungicide:matrix.org | it should still be made reasonably "safe" to for small scale use cases | 18:05 |
@jim:acmegating.com | since we know people would do it anyway, we designed it to be able to 'evolve' to production use, mostly by adjusting the nodepool config. either way, i'd want to set a good example there. :) | 18:09 |
@jim:acmegating.com | (which is why it goes to all the trouble to use an encrypted zk connection on an all-in-one self-contained isolated-network localhost-only system) | 18:10 |
-@gerrit:opendev.org- Zuul merged on behalf of Simon Westphahl: | 18:19 | |
- [zuul/zuul] 814773: Move re-enqueue to pipeline processing https://review.opendev.org/c/zuul/zuul/+/814773 | ||
- [zuul/zuul] 814899: Delete old build sets immediately https://review.opendev.org/c/zuul/zuul/+/814899 | ||
-@gerrit:opendev.org- Clark Boylan proposed on behalf of James E. Blair https://matrix.to/#/@jim:acmegating.com: [zuul/zuul] 815979: Use activeContext instead of explicit _save calls https://review.opendev.org/c/zuul/zuul/+/815979 | 18:21 | |
@clarkb:matrix.org | That ^ fixes a linter error on the last change in the stack | 18:21 |
@clarkb:matrix.org | The stack is largely approved now. There are still a variety of failures but I think corvus has been tracking those down and the majority are races or load issues and not directly related to the stack (though the stack does more zk stuff so can make races and load worse) | 18:22 |
-@gerrit:opendev.org- Zuul merged on behalf of James E. Blair https://matrix.to/#/@jim:acmegating.com: [zuul/zuul] 815154: Update test_inventory to be ZK-friendly https://review.opendev.org/c/zuul/zuul/+/815154 | 18:33 | |
-@gerrit:opendev.org- Zuul merged on behalf of Simon Westphahl: [zuul/zuul] 815309: Cancel jobs before resetting builds https://review.opendev.org/c/zuul/zuul/+/815309 | 18:33 | |
-@gerrit:opendev.org- Zuul merged on behalf of Simon Westphahl: | 19:52 | |
- [zuul/zuul] 815111: Store builds in Zookeeper https://review.opendev.org/c/zuul/zuul/+/815111 | ||
- [zuul/zuul] 815276: Add change queues to change queue managers https://review.opendev.org/c/zuul/zuul/+/815276 | ||
-@gerrit:opendev.org- Zuul merged on behalf of Simon Westphahl: [zuul/zuul] 815277: Refresh pipelines before checking for empty queues https://review.opendev.org/c/zuul/zuul/+/815277 | 19:52 | |
@tobias.henkel:matrix.org | corvus: this is the next failure: https://11a58606918618718cd5-a21c5a2f7ec31a719791313ddc031133.ssl.cf5.rackcdn.com/815764/4/gate/zuul-tox-py38/a1e5f86/testr_results.html | 21:22 |
in this case iterate timeout timed out waiting for a build to be in starting phase because the load governor paused the executor | ||
-@gerrit:opendev.org- Zuul merged on behalf of Simon Westphahl: [zuul/zuul] 815428: Fix GitHub PR (de-)serialization https://review.opendev.org/c/zuul/zuul/+/815428 | 21:33 | |
@tobias.henkel:matrix.org | yet another gate reset, this time it seems to have hit a very slow node | 21:37 |
@tobias.henkel:matrix.org | however I've also seen a real zk session low further down in the queue | 21:38 |
@tobias.henkel:matrix.org | * however I've also seen a real zk session loss further down in the queue | 21:38 |
@jim:acmegating.com | tobiash: regarding the governor, it spent 30s over load average of 20, so 815389 combined either with a load multiplier increase or reducing concurrency would probably fix it. | 21:40 |
@jim:acmegating.com | i'm leaning increasingly toward reducing concurrency | 21:40 |
@tobias.henkel:matrix.org | either that or slightly increasing the load multiplier | 21:41 |
@tobias.henkel:matrix.org | I guess both would probably work | 21:41 |
@jim:acmegating.com | most of the time that job was around LA=12, which is high but not crazy for an 8 core machine | 21:41 |
@jim:acmegating.com | so i think it's still reasonable to keep trying concurrency=-1 (what we have now) and increase the multiplier for tests. | 21:42 |
@tobias.henkel:matrix.org | reducing concurrency would probably also reduce the risk of zk session lossses | 21:42 |
@jim:acmegating.com | i don't actually care that the governor work in tests, i just want the code running. | 21:42 |
@jim:acmegating.com | tobiash: can you link to the build page of the job where the session was lost? | 21:43 |
@jim:acmegating.com | (by work, i mean i don't want the governor to stop jobs, i just want it running) | 21:43 |
@tobias.henkel:matrix.org | need to check my browser history, just a sec | 21:43 |
@jim:acmegating.com | (so as far as i'm concerned, we can change the multiplier to 100x in tests) | 21:44 |
@tobias.henkel:matrix.org | corvus: https://9646a7fb82b47fbe6288-a22e2178400a1d74c0dfc0d0570ba9cf.ssl.cf2.rackcdn.com/815744/3/gate/zuul-tox-py36/e30809a/testr_results.html | 21:44 |
@tobias.henkel:matrix.org | second failed test | 21:44 |
@jim:acmegating.com | thanks for the link -- fwiw i find the build page more helpful than the direct link | 21:45 |
@tobias.henkel:matrix.org | I didn't find that anymore in my history | 21:46 |
@jim:acmegating.com | no prob, here it is https://zuul.opendev.org/t/zuul/build/e30809a40c5a44ab8bdbee8181c2e3be | 21:46 |
@tobias.henkel:matrix.org | the timed out job (https://zuul.opendev.org/t/zuul/build/b0dc7a8e8ec94bb89b0e46afe25aedca) shows a lot of steal time in dstat, so likely an overloaded compute node | 21:48 |
@jim:acmegating.com | tobiash: that one is due to load too | 21:48 |
@jim:acmegating.com | in the e30809a build: 2021-10-29 20:32:11,139 zuul.ExecutorServer INFO Unregistering due to high system load 22.06 > 20.0 | 21:49 |
@tobias.henkel:matrix.org | so I guess we should go with 815389 and reducing concurrency by one? | 21:49 |
@jim:acmegating.com | that's the first failure, and i don't always trust connection losses, etc, after the first failure (they may be real, or they may be a side effect of sloppy test teardown) | 21:50 |
@jim:acmegating.com | tobiash: we could try 815389 with a multiplier adjustment before reducing concurrency if you want | 21:50 |
@jim:acmegating.com | i do think we know that 815389 alone isn't enough at this point | 21:50 |
@jim:acmegating.com | actually, if we're going to increase the multiplier we probably don't really need 815389 | 21:51 |
@tobias.henkel:matrix.org | yeah | 21:51 |
@jim:acmegating.com | but it's harmless; i could go either way. might speed up the actual governor tests :) | 21:52 |
-@gerrit:opendev.org- Tobias Henkel proposed: [zuul/zuul] 816072: Increase load_multiplier in tests https://review.opendev.org/c/zuul/zuul/+/816072 | 21:57 | |
@jim:acmegating.com | Clark: ^ | 22:00 |
-@gerrit:opendev.org- James E. Blair https://matrix.to/#/@jim:acmegating.com proposed: [zuul/zuul] 816073: Cancel stats election on shutdown https://review.opendev.org/c/zuul/zuul/+/816073 | 22:14 | |
@jim:acmegating.com | tobiash, Clark: that's the change i promised earlier ^ | 22:14 |
@clarkb:matrix.org | tobiash: corvus do we think we need to poll the sensors more often too? | 22:31 |
@clarkb:matrix.org | I've got a change up that does that if so | 22:31 |
@jim:acmegating.com | Clark: i don't think it's necessary, but i'm happy to merge the change regardless | 22:32 |
@jim:acmegating.com | (basically, if the multiplier is so high it never trips, it shouldn't matter how often we poll) | 22:32 |
@clarkb:matrix.org | well now that I see the multiplyer is 100 I agree it shouldn't matter :) | 22:32 |
@clarkb:matrix.org | yup exactly | 22:32 |
-@gerrit:opendev.org- Zuul merged on behalf of Simon Westphahl: [zuul/zuul] 815429: Add missing logger to Build and BuildSet classes https://review.opendev.org/c/zuul/zuul/+/815429 | 22:48 |
Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!