openstackgerrit | Merged zuul/nodepool master: Switch to collect-container-logs https://review.opendev.org/701869 | 00:03 |
---|---|---|
clarkb | corvus: left a few notes on your docs changes | 00:21 |
clarkb | I like the state this is headed towards | 00:21 |
pabelanger | clarkb: yup, see that before. Fixed merged, we need release | 00:22 |
clarkb | tobiash: I left comments on https://review.opendev.org/#/c/702237/3 and https://review.opendev.org/#/c/702828/2 you aren't the author but I don't think they are here and you can probably make sure they see that before the weekend. Thanks! | 00:23 |
clarkb | pabelanger: ya eventually got there | 00:23 |
pabelanger | For a while, i was manually applying the patch, but haven't in a few upgrades | 00:23 |
pabelanger | so, would be nice to get github3 release for it | 00:24 |
*** mattw4 has quit IRC | 00:25 | |
pabelanger | zuul job design question. If a parent job sets job.files: foo/bar.py, can a child set job.files: [] to remove the filematch? | 00:25 |
pabelanger | doesn't look like it | 00:37 |
pabelanger | basically, tryin to see if I can set job.files directly on a parent job stanza, over doing it in pipeline | 00:37 |
pabelanger | but child needs to remove all matchers | 00:37 |
pabelanger | k, so job.files: [] on child doesn't work, but job.files: .* will | 00:39 |
openstackgerrit | Merged zuul/zuul master: Report buildset result in MQTT reporter https://review.opendev.org/702838 | 00:50 |
openstackgerrit | Merged zuul/zuul master: Document the buildsets endpoint https://review.opendev.org/702127 | 00:56 |
*** openstackgerrit has quit IRC | 00:57 | |
*** sgw has joined #zuul | 02:15 | |
*** zxiiro has quit IRC | 02:33 | |
*** logan- has quit IRC | 02:48 | |
*** logan_ has joined #zuul | 02:50 | |
*** logan_ is now known as logan- | 02:50 | |
*** johanssone has quit IRC | 02:56 | |
*** johanssone has joined #zuul | 02:57 | |
*** rfolco has joined #zuul | 02:57 | |
*** rfolco has quit IRC | 03:02 | |
*** bhavikdbavishi has joined #zuul | 03:06 | |
*** bhavikdbavishi1 has joined #zuul | 03:09 | |
*** bhavikdbavishi has quit IRC | 03:10 | |
*** bhavikdbavishi1 is now known as bhavikdbavishi | 03:10 | |
*** openstackgerrit has joined #zuul | 03:11 | |
openstackgerrit | Merged zuul/zuul-jobs master: ensure-tox: improve pip detection https://review.opendev.org/702978 | 03:11 |
*** bhavikdbavishi has quit IRC | 03:56 | |
*** bhavikdbavishi has joined #zuul | 04:04 | |
*** rlandy has quit IRC | 04:06 | |
openstackgerrit | Tristan Cacqueray proposed zuul/zuul-operator master: Add main configuration file https://review.opendev.org/703013 | 04:21 |
*** raukadah is now known as chandankumar | 04:46 | |
*** dustinc is now known as dustinc|PTO | 04:52 | |
*** bhavikdbavishi1 has joined #zuul | 04:58 | |
*** bhavikdbavishi has quit IRC | 05:00 | |
*** bhavikdbavishi1 is now known as bhavikdbavishi | 05:00 | |
*** decimuscorvinus_ has quit IRC | 05:21 | |
*** decimuscorvinus has joined #zuul | 05:22 | |
*** kmalloc has quit IRC | 05:33 | |
*** kmalloc has joined #zuul | 05:33 | |
*** evrardjp has quit IRC | 05:34 | |
*** evrardjp has joined #zuul | 05:34 | |
*** saneax has joined #zuul | 06:06 | |
*** swest has joined #zuul | 06:11 | |
*** sgw has quit IRC | 06:12 | |
*** saneax has quit IRC | 06:28 | |
*** sgw has joined #zuul | 06:30 | |
*** wxy-xiyuan has joined #zuul | 06:31 | |
*** sanjay__u has quit IRC | 06:39 | |
*** sanjay__u has joined #zuul | 06:41 | |
*** michael-beaver has quit IRC | 06:53 | |
*** notnone has joined #zuul | 07:15 | |
*** bhavikdbavishi has quit IRC | 07:18 | |
*** SotK has quit IRC | 07:19 | |
*** klindgren_ has quit IRC | 07:19 | |
*** ianw has quit IRC | 07:19 | |
*** fbo has quit IRC | 07:19 | |
*** at_work has quit IRC | 07:19 | |
*** tobberydberg has quit IRC | 07:19 | |
*** EmilienM has quit IRC | 07:19 | |
*** openstackstatus has quit IRC | 07:20 | |
*** saneax has joined #zuul | 07:29 | |
*** bhavikdbavishi has joined #zuul | 08:05 | |
*** SotK has joined #zuul | 08:05 | |
*** klindgren_ has joined #zuul | 08:05 | |
*** ianw has joined #zuul | 08:05 | |
*** fbo has joined #zuul | 08:05 | |
*** tobberydberg has joined #zuul | 08:05 | |
*** EmilienM has joined #zuul | 08:05 | |
*** armstrongs has joined #zuul | 08:14 | |
*** bhavikdbavishi has quit IRC | 08:16 | |
*** armstrongs has quit IRC | 08:24 | |
*** avass has joined #zuul | 08:26 | |
*** reiterative has joined #zuul | 08:26 | |
*** tosky has joined #zuul | 08:29 | |
*** dmellado has quit IRC | 08:34 | |
*** themroc has joined #zuul | 08:34 | |
*** dmellado has joined #zuul | 08:35 | |
*** bhavikdbavishi has joined #zuul | 08:36 | |
*** jpena|off is now known as jpena | 08:48 | |
openstackgerrit | Matthieu Huin proposed zuul/zuul master: JWT drivers: Deprecate RS256withJWKS, introduce OpenIDConnect https://review.opendev.org/701972 | 08:49 |
openstackgerrit | Matthieu Huin proposed zuul/zuul master: JWT drivers: Deprecate RS256withJWKS, introduce OpenIDConnect https://review.opendev.org/701972 | 09:06 |
openstackgerrit | Benjamin Schanzel proposed zuul/zuul master: Allow Passing of Jitter Values in TimerDriver https://review.opendev.org/702854 | 09:08 |
*** themroc has quit IRC | 09:10 | |
*** themroc has joined #zuul | 09:15 | |
*** sanjay__u has quit IRC | 09:20 | |
openstackgerrit | Benjamin Schanzel proposed zuul/zuul master: Handle Erroneous Cron Strings in TimerDriver https://review.opendev.org/702237 | 09:23 |
openstackgerrit | Benjamin Schanzel proposed zuul/zuul master: Allow Passing of Jitter Values in TimerDriver https://review.opendev.org/702854 | 09:23 |
*** bhavikdbavishi has quit IRC | 09:35 | |
*** bhavikdbavishi has joined #zuul | 09:35 | |
*** bhavikdbavishi has quit IRC | 09:42 | |
openstackgerrit | Tobias Henkel proposed zuul/zuul master: Handle jobs with dependencies on job page https://review.opendev.org/703045 | 09:43 |
openstackgerrit | Andreas Jaeger proposed zuul/zuul-jobs master: Remove trusty testing https://review.opendev.org/703046 | 09:45 |
AJaeger | corvus,clarkb: fine to remove the trusty testing on zuul-jobs? ^ | 09:47 |
reiterative | I'm trying to (manually) set up a node for use via the Nodepool Static driver, but Zuul is failing with RETRY_LIMIT when trying to use it. Can anyone point me at some info about the config / setup requirements for a node? I have confirmed that I can ssh to the node from my Nodepool instance using the configured user, and can see it logging on in the syslog. I've been trying to follow the setup from the quick start tutorial demo, but I suspect that I'm | 09:54 |
reiterative | missing something... | 09:54 |
openstackgerrit | Simon Westphahl proposed zuul/zuul master: Match tag items against containing branches https://review.opendev.org/578557 | 09:59 |
openstackgerrit | Simon Westphahl proposed zuul/zuul master: Optionally support mitogen for job execution https://review.opendev.org/657024 | 10:01 |
openstackgerrit | Matthieu Huin proposed zuul/zuul master: OIDCAuthenticator: add capabilities, scope option https://review.opendev.org/702275 | 10:03 |
openstackgerrit | Simon Westphahl proposed zuul/zuul master: Report retried builds in a build set via mqtt. https://review.opendev.org/632727 | 10:03 |
*** jangutter has joined #zuul | 10:06 | |
*** hashar has joined #zuul | 10:12 | |
*** jangutter has quit IRC | 10:13 | |
*** jangutter has joined #zuul | 10:13 | |
openstackgerrit | Sorin Sbarnea proposed zuul/zuul-jobs master: docker-install: workaround for centos-8 conflicts https://review.opendev.org/703053 | 10:49 |
openstackgerrit | Benjamin Schanzel proposed zuul/zuul master: Allow Passing of Jitter Values in TimerDriver https://review.opendev.org/702854 | 11:15 |
*** pcaruana has joined #zuul | 11:59 | |
*** sshnaidm|afk is now known as sshnaidm|off | 12:00 | |
openstackgerrit | Tobias Henkel proposed zuul/zuul master: Don't expand change panel on middle click https://review.opendev.org/703064 | 12:08 |
*** zbr|rover is now known as zbr|drover | 12:09 | |
openstackgerrit | Sorin Sbarnea proposed zuul/zuul-jobs master: remoke-sudo: improve sudo removal https://review.opendev.org/703065 | 12:09 |
openstackgerrit | Sorin Sbarnea proposed zuul/zuul-jobs master: remoke-sudo: improve sudo removal https://review.opendev.org/703065 | 12:11 |
*** rfolco has joined #zuul | 12:27 | |
avass | reiterative: I can take a look at it if you link a paste: http://paste.openstack.org/ | 12:31 |
*** bhavikdbavishi has joined #zuul | 12:31 | |
reiterative | Thanks very much avass - what info would you like me to share? | 12:33 |
openstackgerrit | Sorin Sbarnea proposed zuul/zuul-jobs master: remoke-sudo: improve sudo removal https://review.opendev.org/703065 | 12:34 |
*** bhavikdbavishi1 has joined #zuul | 12:34 | |
openstackgerrit | Sorin Sbarnea proposed zuul/zuul-jobs master: remoke-sudo: improve sudo removal https://review.opendev.org/703065 | 12:34 |
avass | reiterative: Do you get any build logs during the pre-run? If so that's a good place to start. Otherwise the executor log during the build should contain some information as well :) | 12:35 |
*** bhavikdbavishi has quit IRC | 12:35 | |
*** bhavikdbavishi1 is now known as bhavikdbavishi | 12:35 | |
openstackgerrit | Sorin Sbarnea proposed zuul/zuul-jobs master: remoke-sudo: improve sudo removal https://review.opendev.org/703065 | 12:37 |
avass | reiterative: Here are the requirements for the nodes anyway: https://docs.ansible.com/ansible/latest/installation_guide/intro_installation.html#managed-node-requirements | 12:38 |
reiterative | avass Executor log is here http://paste.openstack.org/show/788531/ | 12:39 |
reiterative | Where would I look for build logs? | 12:39 |
*** dmellado has quit IRC | 12:41 | |
tristanC | corvus: could the unknown configuration be caused by the fact that there is two jobs that depends on the build-image job? Is there an existing (stable) project pipeline config with more than one job that depends on the buildset registry? | 12:42 |
openstackgerrit | Sorin Sbarnea proposed zuul/zuul-jobs master: DNM: docker-install: test existing jobs https://review.opendev.org/703068 | 12:43 |
*** dmellado has joined #zuul | 12:44 | |
*** jpena is now known as jpena|lunch | 12:46 | |
avass | reiterative: There will be a bit more information if you run the executor in debug mode... which I don't remember how to configure *shrug* | 12:50 |
avass | reiterative: Otherwise you can use the fetch-output role from zuul-jobs repo: https://zuul-ci.org/docs/zuul-jobs/log-roles.html#role-fetch-output | 12:51 |
avass | But I guess that would fail as well. Can you see anything in the web dashboard during the run? | 12:51 |
openstackgerrit | Benjamin Schanzel proposed zuul/zuul master: Allow Passing of Jitter Values in TimerDriver https://review.opendev.org/702854 | 12:54 |
reiterative | Cheers I'll try that | 12:58 |
openstackgerrit | Sorin Sbarnea proposed zuul/zuul-jobs master: remoke-sudo: improve sudo removal https://review.opendev.org/703065 | 13:09 |
*** rlandy has joined #zuul | 13:11 | |
*** bhavikdbavishi has quit IRC | 13:23 | |
openstackgerrit | Alan Pevec proposed zuul/zuul-jobs master: Add phoronix-test-suite job https://review.opendev.org/679082 | 13:25 |
tristanC | zuul-maint : could you please have a look at https://review.opendev.org/679082 . If zuul-jobs is not the right place for such job, should we consider a zuul-jobs-extra project, or should this live outside of the opendev/zuul org? | 13:30 |
mnaser | small ping about https://review.opendev.org/#/c/701868/ :) | 13:32 |
openstackgerrit | Sorin Sbarnea proposed zuul/zuul-jobs master: remoke-sudo: improve sudo removal https://review.opendev.org/703065 | 13:38 |
*** jpena|lunch is now known as jpena | 13:39 | |
*** pcaruana has quit IRC | 13:45 | |
*** saneax has quit IRC | 14:06 | |
openstackgerrit | Tristan Cacqueray proposed zuul/zuul-operator master: Manage operator scaffolding using a function and configuration file https://review.opendev.org/703013 | 14:10 |
*** pcaruana has joined #zuul | 14:22 | |
*** bhavikdbavishi has joined #zuul | 14:31 | |
*** bhavikdbavishi1 has joined #zuul | 14:38 | |
*** bhavikdbavishi has quit IRC | 14:40 | |
*** bhavikdbavishi1 is now known as bhavikdbavishi | 14:40 | |
*** jangutter has quit IRC | 14:51 | |
openstackgerrit | Sorin Sbarnea proposed zuul/zuul-jobs master: revoke-sudo: improve sudo removal https://review.opendev.org/703065 | 15:02 |
*** jangutter has joined #zuul | 15:02 | |
openstackgerrit | Merged zuul/zuul-registry master: Switch to collect-container-logs https://review.opendev.org/701868 | 15:04 |
*** jangutter has quit IRC | 15:06 | |
*** zbr|drover has quit IRC | 15:08 | |
sugaar | Hi, I am deploying zuul in kubernetes. For that reason, I took the zuul example docker-compose and I am implementing it in k8s "style". I am having a problem with zuul-scheduler because when the scrip wait-for-gearman.sh is executed I get "./wait-to-start.sh: line 9: mysql: Name or service not known | 15:12 |
sugaar | ./wait-to-start.sh: line 9: /dev/tcp/mysql/3306: Invalid argument | 15:12 |
sugaar | " | 15:12 |
openstackgerrit | Tobias Henkel proposed zuul/zuul master: Shard py35 and py37 test cases https://review.opendev.org/702473 | 15:13 |
sugaar | So I went into /dev/ and there is not tcp/ directory, so I was wondering how does it work in the docker-compose? why doesn't break there?? | 15:13 |
tristanC | sugaar: did you setup a service named 'mysql' ? | 15:14 |
sugaar | I am using mariadb which is what it is used in the docker-compose. By service you mean a k8s service object? | 15:14 |
tristanC | sugaar: yes, the k8s service, iiuc that's make the `mysql` name dns | 15:15 |
tobiash | sugaar: I guess that dir is not forwarded to pods in k8s, however in k8s you won't really need that script as this is used to avoid race conditions in ci | 15:15 |
Shrews | clarkb: I read you comment you pointed me to in https://review.opendev.org/702828. I think it's difficult for us to judge the impact of that change without seeing results, so maybe we should let someone who does use the static driver cast a vote first. (cc: tobiash) | 15:15 |
Shrews | wow, such bad grammar i have today | 15:16 |
*** zbr has joined #zuul | 15:16 | |
*** avass has quit IRC | 15:16 | |
*** zbr is now known as zbr|drover | 15:16 | |
tobiash | Shrews: we're using the static driver a lot (with user managed static nodes) and when they're doing maintenance (by firewalling ssh port off nodepool) our logs really get spammed by the exception traces | 15:16 |
sugaar | tristanC I have the service but I gave a different name to it, I will change the name to see what happens. | 15:17 |
sugaar | tobiash do you reckon I could get rid of the script? | 15:17 |
tobiash | so for us this is 'normal operations' where one would not want exceptions in the log | 15:17 |
Shrews | tobiash: sure. i trust your judgement on it, so if you're ok with the change, i'm happy to +3 it | 15:17 |
tobiash | Shrews: I'm fine with the change | 15:18 |
tristanC | sugaar: fwiw, you might be interested by the work we are doing with the zuul-helm and the zuul-operator project | 15:18 |
Shrews | tobiash: cool. just needs your +2 :) | 15:18 |
tobiash | Shrews: I'll review in a bit :) | 15:18 |
clarkb | tobiash: did you see the case I wascalling out that probably is still an error though? | 15:21 |
tobiash | clarkb: user triggered maintenance would be ideally implemented by the user if the port is firewalled off directly after a job (which would then also hit the re-registration code path) | 15:23 |
tobiash | further nodepool anyway does a periodic check of node connectivity and still periodically logs warnings about that | 15:23 |
corvus | clarkb, tobiash, Shrews: how about log.error and include the errno message? | 15:24 |
corvus | (i think the main thing the traceback gets you there is *why* the connection failed (no route/refused/dns/etc)) | 15:25 |
tobiash | I guess that would be a reasonable compromise | 15:25 |
tobiash | swest: what do you think? ^ | 15:26 |
*** electrofelix has joined #zuul | 15:26 | |
clarkb | ok so it is valid to hit that post job and deregister. In that case I guess warning is fine though error might be better? | 15:27 |
corvus | tristanC: i think opendev/system-config has multiple jobs that depend on the registry, but they have file matchers, so i don't know how often they run at once. | 15:27 |
clarkb | mostly I dont want the cause of yhe deregister to g et lost | 15:27 |
openstackgerrit | Antoine Musso proposed zuul/zuul master: doc: add links to components documentation https://review.opendev.org/703105 | 15:28 |
corvus | clarkb, tobiash, Shrews, swest: i guess the message should also so "failed to connect after liveness check" or something so we know where (that's the other thing the TB gives us) | 15:28 |
tobiash | sounds good | 15:29 |
sugaar | tristanC this one right? https://opendev.org/zuul/zuul-helm | 15:31 |
openstackgerrit | James E. Blair proposed zuul/zuul master: Docs: add admin reference section https://review.opendev.org/702997 | 15:35 |
corvus | sugaar: yes -- there's no documentation yet, but looking at it may help | 15:41 |
openstackgerrit | Antoine Musso proposed zuul/zuul master: Fix release note for a 3.0.2 feature https://review.opendev.org/703109 | 15:41 |
tristanC | sugaar: yes, and i'm also working on a zuul operator with this topic: https://review.opendev.org/#/q/topic:zuul-crd | 15:44 |
pabelanger | I really don't like that depends-on in commit messages does not work for github | 16:02 |
pabelanger | and I would like to see how to fix that | 16:02 |
tobiash | pabelanger: use the hub tool ;) | 16:02 |
pabelanger | I do use CLI tool | 16:03 |
pabelanger | but github issue templates overwrite it | 16:03 |
pabelanger | also, it forces users which user gerrit into 2nd workflow | 16:03 |
pabelanger | pull, if you edit first PR comment, when job is running, zuul will abort job and not reenqueue | 16:03 |
pabelanger | you have to manually recheck | 16:04 |
pabelanger | s/pull/plus | 16:04 |
clarkb | pabelanger: to be fair github and gerrit workflows are already miles apart | 16:04 |
sugaar | tristanC thanks I will have a look to that too | 16:04 |
tobiash | pabelanger: having the depends-on in the commit message doesn't really work with github as this information is only available after doing the merge and even then it'll be difficult as you can have all sorts of git trees in a pr | 16:04 |
pabelanger | clarkb: yah, I agree. Just very frustrating to update it in multiple places | 16:04 |
pabelanger | tobiash: can you explain available after merge? I am not following | 16:06 |
tobiash | pabelanger: in order to get the depends-on from commit messages you need to analyse all commits that are part of the pr (which can be quite a lot) | 16:07 |
pabelanger | yes, I agree that would be needed. But, is that a lot of overhead? | 16:08 |
tobiash | so you would need either hammering the github url or use the mergers to get all depends-on headers of all invoved commits | 16:08 |
pabelanger | could we not support both | 16:08 |
tobiash | and this needs to be done before enqueuing into any pipeline | 16:08 |
tobiash | s/github url/github api | 16:09 |
pabelanger | yup, agree | 16:09 |
pabelanger | what is merges did squash commits, that should get all commits in PR, with depends-on into into single commit | 16:10 |
*** hashar has quit IRC | 16:11 | |
pabelanger | if* | 16:11 |
pabelanger | I _really_ need to slow down on typing | 16:12 |
tobiash | the thing is not how to get those commit messages, but the added complexity and system load for doing that | 16:15 |
tobiash | it could be doable with https://developer.github.com/enterprise/2.19/v3/pulls/#list-commits-on-a-pull-request | 16:15 |
tobiash | (with the limitation that only the depends-on headers of max 250 commits get found) | 16:15 |
tobiash | but if we'd support this it should be optional because of increased system load when using this feature | 16:16 |
tobiash | an important thing is also that especially with multi-commit prs users can be quickly confused why dependencies are pulled in because on the pr page you won't see any dependency header of commit messages if you don't also add them to the pr description | 16:17 |
tobiash | so for me this means: more support tickets of unexperienced users wondering why or why not dependencies are pulled in | 16:18 |
tobiash | that's why I'd turn that feature off in my deployment | 16:18 |
pabelanger | yah, I have the opposite issue. gerrit users of zuul, are creating support issues because depends-on isn't working, from commit messages. | 16:19 |
pabelanger | but yah, agree if would be nice to make it optional or some way to support | 16:19 |
fungi | well. and another problem is that the more popular github workflow is to not replace commits but merely add new child commits in the pr, so what's the expected behavior if there's a depends-on in a commit message for a non-leaf commit in the pr? | 16:19 |
pabelanger | this is about the main issue I have with github driver today, so #firstworldproblem I figure :) | 16:19 |
tobiash | pabelanger: but I'm open adding this as an optional feature (probably best default off) | 16:20 |
pabelanger | k, that is good to know | 16:20 |
fungi | and how to you "remove" or "modify" a depends-on without rewriting that commit, in a workflow where rewriting commits is discouraged? | 16:20 |
tobiash | fungi: in that case all depends-on headers would be squashed into a combined list | 16:20 |
*** mattw4 has joined #zuul | 16:20 | |
tobiash | remove would be only possible by force push | 16:20 |
fungi | right, and there are projects which are strongly opposed to force-push in pull requests | 16:21 |
tobiash | (which is not mostly not that much discouraged on pre-merge pr branches) | 16:21 |
fungi | because they want to retain the history of the pr development | 16:21 |
tobiash | in that case it's not possible to remove dependencies | 16:21 |
fungi | exactly | 16:21 |
fungi | (that was basically my point) | 16:21 |
fungi | there are popular github workflows which, as clarkb already indicated, are simply incompatible with gerrit workflows, so trying to have zuul pretend people use them the same way is likely to just lead to more problems | 16:22 |
tobiash | pabelanger: so as long as you're willing to accept those shortcomings and the feature is optional and off by default I'm fine with it | 16:23 |
pabelanger | tobiash: Yup, I think it would be a good idea to get them all down some place, then see if worth it. | 16:24 |
pabelanger | I don't have the time to work on it today, but is helpful to understand how others deal with it | 16:24 |
tobiash | I personally deal with it by adding the depends-on when creating the pr via commandline (which actually feels pretty much like creating a commit) ;) | 16:25 |
pabelanger | yah, we have issue templates. for some reason, they don't read commit messages by default | 16:26 |
pabelanger | so, I often forget to add it there | 16:27 |
pabelanger | Hmm, I noticed I have a locked ready node for 16hours | 16:30 |
pabelanger | need to debug that, but if scheduler is holding lock, there is no way to unlock it without restart right? | 16:30 |
tobiash | pabelanger: yes, we also face this issue, zuul occasionally leaks nodes in some corner cases, but I didn't have time yet to deeply debug this | 16:31 |
pabelanger | okay, thanks. | 16:31 |
pabelanger | would be nice to have way to unlocking, without full restart | 16:32 |
tobiash | pabelanger: there is a hack to unlock them by deleting the lock znode via zookeeper cli | 16:32 |
tobiash | but that is a hack | 16:32 |
pabelanger | tobiash: k, I'd be okay with that right now | 16:32 |
pabelanger | nodepool delete --force --delete-lock | 16:33 |
pabelanger | would be nice | 16:33 |
tobiash | well, we should fix the leak instead ;) | 16:33 |
pabelanger | yes, that is better | 16:33 |
tobiash | btw, this fixes one of the leaks: https://review.opendev.org/666643 | 16:33 |
tobiash | but there is still one left somewhere | 16:34 |
Shrews | i do not endorse a way to manually break locks | 16:34 |
tobiash | Shrews: I know, that's why I emphasized it's a hack ;) | 16:34 |
Shrews | yeah, i meant the nodepool cli command | 16:35 |
pabelanger | yah, in our case we are getting billed by the locked node, but I don't have bandwidth to debug or restart zuul | 16:35 |
pabelanger | so, looking for some solution | 16:35 |
pabelanger | I could delete node behind nodepool / zuul back, but that doesn't feel good either | 16:35 |
tobiash | pabelanger: in your case I would delete the lock and let nodepool handle the delete, but I agree with Shrews that we shoudn't add such a functionality into nodepool | 16:36 |
tobiash | that's what we do currently until I have enough time to analize and fix the leak | 16:37 |
corvus | 643 lgtm. that was originally an optimization for speed in zuul v2, but i think we can do without it now. | 16:38 |
pabelanger | tobiash: do you use zk-shell? | 16:39 |
pabelanger | tobiash: like to see if you us rm or rmr | 16:40 |
tobiash | corvus: thanks, at least I didn't notice any negative performance impact in our system (we're running with this fix already for quite some time) | 16:40 |
tobiash | pabelanger: zkCli.sh rmr /nodepool/nodes/<nodeid>/lock | 16:41 |
pabelanger | thanks | 16:41 |
clarkb | tobiash: do you configure timeouts with gearman so that if an executor is lost eventually that work is handed to another executor? | 16:43 |
clarkb | I seem to recall you were involved in some changes to gear to make that possible? | 16:43 |
pabelanger | leaked node gone, I'll see if I can figure out it locked in a bit | 16:44 |
tobiash | clarkb: that's done by using tcp keepalive and default in zuul | 16:44 |
tobiash | clarkb: support for that in gear: https://review.opendev.org/599567 | 16:44 |
tobiash | clarkb: and usage in zuul: https://review.opendev.org/599573 | 16:45 |
clarkb | tobiash: ok so it should just work already? | 16:46 |
clarkb | (we're seeing that in some cases we have changes stay queued waiting for a job to start form any hours, ~48, after an executor dies) | 16:46 |
tobiash | actually yes, just read in -infra, did you verify that this job ran on the lost executor? | 16:46 |
*** themroc has quit IRC | 16:47 | |
tobiash | zuul did handle executors going away quite ok as far as I remember | 16:48 |
clarkb | yup we regularly stop them and start them for upgrades and jobs typically restart as expected. This is likely a corner case of some sort | 16:48 |
clarkb | I'm not sure if fungi managed to confirm that the lost executor was handling this queued job | 16:48 |
pabelanger | clarkb: did executor stop, or die? | 16:50 |
clarkb | pabelanger: it was not asked to stop if that is what you mean | 16:53 |
fungi | pabelanger: tobiash: ooh! i didn't know about rmr, i've been using tab completion to manually recurse for removals with multiple rm commands instead | 16:53 |
pabelanger | clarkb: looking for old bug, might have idea | 16:54 |
tobiash | clarkb: gear server should also use keepalive | 16:54 |
fungi | clarkb: nope, i haven't had time to dig into the executors yet. still working my way around to it, but last few times this behavior was reported that was precisely the cause | 16:54 |
fungi | (executor is already on its way out, cpu is spiking or whatever, but it manages to accept the build request moments before becoming entirely unresponsive) | 16:55 |
tobiash | clarkb: so a freezing vm should be detected (but not a freezing executor process that continues to keep the connection running) | 16:55 |
pabelanger | I remember an issue, if zuul executor was killed by rackspace, we leaked jobs | 16:55 |
pabelanger | I think I opened a bug about it | 16:55 |
fungi | tobiash: what's the definition of "keep the connection running?" does that include the server disappearing without closing the connection? or is there some sort of dead peer detection which is supposed to notice when it's no longer responsive? | 16:57 |
tobiash | pabelanger: if you find that bug, this was probably the fix: https://review.opendev.org/425248 (vm going away without terminating the connection) | 16:57 |
pabelanger | clarkb: http://eavesdrop.openstack.org/irclogs/%23zuul/%23zuul.2018-07-18.log.html#t2018-07-18T17:32:27 | 16:57 |
pabelanger | still looking | 16:57 |
tobiash | fungi: gear server as well as client use tcp keepalive, so a vm going away will be detected | 16:57 |
tobiash | what won't be detected is a freezing executor python process (no idea how that could happen) because tcp keepalive is handled by the kernel | 16:58 |
pabelanger | tobiash: yah, that's what I remember, VM dies, server reboots, stuck job | 16:58 |
tobiash | or what's not handled is vm gets unresponsive because of process explosion, also in this case kernel still handles keepalive probably | 16:59 |
pabelanger | is there no way to get more then 10 items in storyboard? | 17:00 |
pabelanger | https://storyboard.openstack.org/#!/project/679 | 17:00 |
clarkb | pabelanger: in your user settings you can check the paging size | 17:00 |
tobiash | or just beneath the page selector | 17:00 |
pabelanger | ah, thanks see it | 17:01 |
fungi | the little gear icon | 17:04 |
Shrews | hrm, yet another nodepool-zuul-functional test failure during managed_ansible processing: https://zuul.opendev.org/t/zuul/build/a77e4d2b427745f3b8855249d6f05725/log/job-output.txt#851 | 17:05 |
tobiash | Shrews: I have a hunch that multithreaded installation of ansible into different venvs sometimes confuses pip | 17:06 |
pabelanger | Shrews: I've seen that before, but not sure why it happens | 17:06 |
pabelanger | tobiash: I think that might be the case | 17:07 |
tobiash | maybe some tempfile/cache race | 17:07 |
tobiash | at least it got worse with more supported versions | 17:07 |
openstackgerrit | Tobias Henkel proposed zuul/zuul master: Limit parallelity when installing ansible https://review.opendev.org/703126 | 17:10 |
fungi | pip does use a couple of on-disk caches by default, so i could see concurrent writes potentially clobbering each other in those from time to time | 17:10 |
tobiash | Shrews: I'd suggest to try two threads as a compromise between test failure risk and time to install 4 different versions. | 17:11 |
fungi | it has an http cache and a wheel cache | 17:11 |
tobiash | if that's not enough we can disable parallelity entirely | 17:11 |
fungi | but there are pip command-line options to set the locations of its caches, so we could in theory separate them out and see if that helps | 17:11 |
fungi | at the expense of retrieving/building some packages multiple times, of course | 17:12 |
clarkb | tobiash: is that what caused my docs change job to fail? | 17:12 |
clarkb | tobiash: it was definitely unahppy about something very deep into python itself | 17:12 |
clarkb | (importlib to be specific) | 17:12 |
tobiash | which one? | 17:12 |
tobiash | I think I remember one and I think it was during ansible installation | 17:13 |
clarkb | tobiash: let me find it | 17:14 |
clarkb | tobiash: https://zuul.opendev.org/t/zuul/build/c26b73191cf2443a8e083f7b1668052e that one | 17:14 |
tobiash | clarkb: yes, exactly this issue | 17:15 |
corvus | tobiash: see my comment on 703126 | 17:16 |
corvus | i think we're really going to want a comment for that in the future :) | 17:16 |
tobiash | corvus: thanks, fixing | 17:16 |
tobiash | you're absolutely right :) | 17:16 |
openstackgerrit | Tobias Henkel proposed zuul/zuul master: Limit parallelity when installing ansible https://review.opendev.org/703126 | 17:18 |
*** chandankumar is now known as raukadah | 17:20 | |
*** tosky has quit IRC | 17:21 | |
Shrews | corvus: As an alternative to https://review.opendev.org/702992, all of the remaining common Reference links are admin oriented *except* glossary and dev guide. We could also add an admin reference section, move glossary back under the "Indices and tables", then expand the Dev Guide by itself (thus replacing the common References section). That would expose the dev stuff more clearly on the root page. | 17:24 |
Shrews | oh, some of that is done in deeper reviews, i see now | 17:26 |
Shrews | at least the admin reference stuff | 17:26 |
openstackgerrit | Merged zuul/zuul master: Handle Erroneous Cron Strings in TimerDriver https://review.opendev.org/702237 | 17:27 |
Shrews | i guess the Overview content move complicates my suggestion | 17:27 |
*** evrardjp has quit IRC | 17:34 | |
*** evrardjp has joined #zuul | 17:34 | |
corvus | Shrews, clarkb: would you mind reviewing my doc changes despite the -1s, then i'll rebase them and push them through? conflicts are getting annoying | 17:35 |
corvus | maybe if we like all 4 changes, i can squash them too | 17:35 |
clarkb | corvus: ya I can rereview them | 17:36 |
Shrews | corvus: yeah, i'm doing that now (thus the ramblings above) :) | 17:36 |
Shrews | already approved the lead change | 17:38 |
openstackgerrit | James E. Blair proposed zuul/zuul master: Docs: flatten directory structure https://review.opendev.org/703135 | 17:38 |
corvus | cool, i'll start fixing the conflicts then | 17:39 |
corvus | oh, i think i'm just going to have to wait for those to merge, then rebase | 17:41 |
Shrews | i still think eliminating the common ref section and expanding the dev guide is worth some consideration down the road, but these all lgtm | 17:41 |
corvus | Shrews: yeah, what's left in common ref is ambiguous as to audience. | 17:42 |
corvus | if i had to choose, i'd say monitorying is more admin, rest is more user. | 17:42 |
corvus | (but opendev users make use of the monitoring reference) | 17:42 |
corvus | this is where separating by audience doesn't matter any more | 17:43 |
fungi | i guess we don't document the signal handler in zuul at all? looks like there's documentation for the (virtually identical) one in nodepool. should we copy that into zuul's docs? (also i'm getting an extreme sense of déjà vu here) | 17:43 |
fungi | maybe there's already a proposed change to do that | 17:44 |
corvus | fungi: https://zuul-ci.org/docs/zuul/discussion/components.html#reconfiguration | 17:45 |
*** jpena is now known as jpena|off | 17:45 | |
openstackgerrit | James E. Blair proposed zuul/zuul master: Docs: re-order reference index https://review.opendev.org/702962 | 17:46 |
openstackgerrit | James E. Blair proposed zuul/zuul master: Docs: move project config docs to user reference https://review.opendev.org/702992 | 17:46 |
openstackgerrit | James E. Blair proposed zuul/zuul master: Docs: move overview section to reference https://review.opendev.org/702995 | 17:46 |
openstackgerrit | James E. Blair proposed zuul/zuul master: Docs: add admin reference section https://review.opendev.org/702997 | 17:46 |
openstackgerrit | James E. Blair proposed zuul/zuul master: Docs: flatten directory structure https://review.opendev.org/703135 | 17:46 |
openstackgerrit | James E. Blair proposed zuul/zuul master: Docs: fix styling in reconfigure commands https://review.opendev.org/703138 | 17:46 |
corvus | oh phooey | 17:46 |
corvus | those are just rebases; i'll self +3 them. | 17:46 |
fungi | corvus: hrm, that only talks about SIGHUP though. are we doing something similar to pass SIGUSR1 and SIGUSR2 from the cli? | 17:47 |
corvus | fungi: oh, the debug thing. yep that's missing and could be copied. | 17:48 |
fungi | right, the paraghraph at the end of https://zuul-ci.org/docs/nodepool/operation.html#daemon-usage | 17:48 |
fungi | should that go in https://zuul-ci.org/docs/zuul/howtos/admins/troubleshooting.html do you think? | 17:48 |
fungi | or is there a better place? | 17:49 |
corvus | fungi: that sounds good to me | 17:49 |
fungi | thanks, will do shortly | 17:49 |
*** michael-beaver has joined #zuul | 17:50 | |
clarkb | note the yappi profiling is only available when yappi is installed manually. However the thread dumps should always work | 17:55 |
pabelanger | clarkb: corvus: when time, do you happen to have thoughts on http://eavesdrop.openstack.org/irclogs/%23zuul/latest.log.html#t2020-01-17T00:25:12 | 17:58 |
fungi | clarkb: yep, but we install it, so we do get that as well | 17:59 |
*** zxiiro has joined #zuul | 17:59 | |
fungi | however, the resulting double-dump with yappi deets is 64kb/1180 lines so it's probably too large for paste.o.o | 18:00 |
corvus | pabelanger: if you're dealing with things like that, i would suggest restructuring the job hierarchy so you're not setting files stanzas too early | 18:01 |
pabelanger | corvus: yah, that is possible. What we have today is a little complex, with multiple child jobs. | 18:02 |
pabelanger | however, so far I haven't figured out a better way | 18:02 |
corvus | pabelanger: to put it succintly, if you're undoing a files matcher in a pipeline, it shouldn't be on the job in the first place. think about making 2 versions of the job, or something. | 18:04 |
pabelanger | Yup, that is fair. This job is a little odd, as we run it in ansible-zuul-jobs. But it really parent for jobs that run in ansible/ansible. So, mostly only want to run it if we update playbooks in ansible-zuul-jobs, and always for ansible/ansible. For now, filematcher in pipeline has been way forward | 18:06 |
corvus | pabelanger: then 3 jobs: abstract parent in azj with no file matcher; child in azj with file matcher - this one runs on azj changes; child in aa with different file matcher - this runs on aa changes | 18:08 |
corvus | jobs are free. make as many as you need :) | 18:08 |
fungi | tobiash: here's a thread dump from our hung scheduler... http://paste.openstack.org/show/788548/ | 18:09 |
fungi | anything look familiar? | 18:09 |
corvus | pabelanger: (or, just use the file matchers in the pipelines. you can use yaml anchors to avoid duplication) | 18:09 |
pabelanger | yah, that is fair. Mostly want to keep a->b->c order over a->b a->c, but will work | 18:09 |
pabelanger | ++ | 18:09 |
pabelanger | ack, thanks for help | 18:09 |
corvus | fungi: is there a stuck thread? | 18:09 |
fungi | corvus: what's the best way to tell? | 18:10 |
corvus | fungi: sorry, i mean, i'm asking what you're debugging | 18:10 |
corvus | i recall a conversation about a stuck job | 18:10 |
fungi | oh, there's a build hung waiting for over a day, and so went looking and it was handled by ze08 shortly before its service/debug log go dead silent | 18:11 |
fungi | but it responded to sigusr2 and wrote a thread dump | 18:11 |
corvus | so this is the output from ze08 and the question is: "why is it not doing anything?" | 18:11 |
fungi | right | 18:12 |
fungi | the hung build is 7f3f0f8e86324b22968daec67d7ef28c | 18:12 |
fungi | which does seem to have a thread at line 264 | 18:12 |
fungi | though there are a number of other build-* threads too | 18:13 |
fungi | but more curious is that the executor seems to have ceased doing anything at all roughly 48 hours ago | 18:13 |
fungi | not even periodic wakeups in the debug log, making me suspect a stuck thread | 18:14 |
corvus | fungi: thread 140609457223424 looks suspicous | 18:16 |
corvus | like it's involved in a long running git transaction. it's holding a lock that all other jobs need. | 18:16 |
fungi | indeed, and the forked zuul daemon process is hung at "read(5, " according to strace | 18:17 |
fungi | and there are two other `git cat-file ...` child processes as well | 18:17 |
fungi | though those predate the cessation of logging by a couple hours | 18:17 |
fungi | so i wasn't certain they were necessarily related | 18:17 |
fungi | but this does tie all that together a bit | 18:18 |
corvus | fungi: can you tell what git files are open? | 18:18 |
fungi | checking | 18:18 |
fungi | by those specific processes, or across the whole filesystem? | 18:19 |
fungi | but looks like mostly nova | 18:19 |
corvus | fungi: at this point, probably any open git files | 18:19 |
*** hashar has joined #zuul | 18:19 | |
fungi | /var/lib/zuul/executor-git/opendev.org/openstack/nova/.git/objects/pack/<various>.idx | 18:19 |
corvus | this all looks to be local filesystem access, so there could be a bad git repo involved. i'm not sure how difficult it would be to add a timeout here -- i think it would have to be something pretty high level (like something that forcefully killed the thread). | 18:19 |
corvus | fungi: i would suggest stopping the executor, removing /var/lib/zuul/executor-git/opendev.org/openstack/nova/ and restarting | 18:20 |
fungi | also /var/lib/zuul/executor-git/git.openstack.org/openstack/charm-vault/.git | 18:20 |
corvus | maybe that one too then | 18:20 |
fungi | and yeah, it doesn't seem to be underlying device issues, or at least nothing that the kernel has logged to dmesg | 18:20 |
fungi | i'll move the nova repo out of the way so we can investigate whether its contents are hinky | 18:22 |
fungi | once the executor is fully stopped | 18:22 |
fungi | should i also move charm-vau | 18:22 |
corvus | yeah | 18:22 |
fungi | er, charm-vault | 18:22 |
fungi | okay, i'll move both repos | 18:22 |
corvus | (and some "kill"ing may be required to get it to stop) | 18:22 |
fungi | sure, i won't be surprised there | 18:23 |
*** dtroyer has joined #zuul | 18:29 | |
*** bhavikdbavishi has quit IRC | 18:29 | |
fungi | after restart it's cloning those again | 18:30 |
fungi | so far, so good | 18:30 |
*** fbo is now known as fbo|off | 18:41 | |
tobiash | fungi: you wrote that there were hanging git cat-file processes? | 18:46 |
tobiash | That would fit to that dump | 18:46 |
*** electrofelix has quit IRC | 18:46 | |
tobiash | There are a lot of blocking threads in the inner update loop | 18:47 |
tobiash | Aka every blocked thread there blocks a job | 18:47 |
tobiash | So if git blocks the update loop it blocks the whole executor | 18:47 |
tobiash | I'm not at computer atm, but we should check from where the git calls come from and maybe check if some sort of timeout is helpful there | 18:48 |
tobiash | oh I guess I should have read the whole backscroll | 18:49 |
fungi | tobiash: yep, several bits of evidence point to hung `git cat-file ...` commands, possibly as a result of some sort of corrupt on-disk repositories (they seemed to be reading from various pack files before i killed them) | 18:50 |
fungi | i moved them aside and will try a git fsck on a copy shortly to see if that theory pans out | 18:51 |
*** pcaruana has quit IRC | 18:51 | |
fungi | of the two repos which were open for read on the filesystem, the smaller one completed fsck with no errors (but most of the open file handles were to the larger one which will take a few minutes to fsck) | 18:57 |
tobiash | corvus: I guess git cat-file is only used by cat commands? | 18:57 |
tobiash | Maybe that interferes with parallel fetches | 18:58 |
tobiash | I'm not sure if those use the repo locks | 18:58 |
fungi | git fsck on the larger repo also didn't report any errors, just some ~800 dangling commits but that's probably normal | 19:00 |
fungi | the git cat-file processes were clearly stuck on a read call though, according to strace. maybe something changed out from under them? | 19:01 |
tobiash | Maybe a garbage collect triggered by a concurrent fetch | 19:01 |
tobiash | If that truncates the pack file while cat is reading... | 19:02 |
fungi | yeah, seems like there are opportunities for races there at least | 19:06 |
*** hashar has quit IRC | 19:08 | |
tobiash | fungi: hrm, the only thing in the stack trace that reads a stream is actually this: http://paste.openstack.org/show/788549/ | 19:26 |
openstackgerrit | Jeremy Stanley proposed zuul/zuul master: Add notes on thread dumping and yappi https://review.opendev.org/703185 | 19:26 |
tobiash | that's probably reading one of that pack indexes | 19:26 |
fungi | tobiash: yep, that's the same one corvus pointed out too | 19:27 |
fungi | sounds like some consensus at least | 19:27 |
tobiash | and I cannot find anything else that accesses the repo and isn't protected by the lock | 19:27 |
tobiash | so could also be infra structure related | 19:28 |
fungi | possible, though the kernel didn't think there was any problem with the server's storage layer | 19:33 |
fungi | also, zuul-maint: yeads up that in #openstack-infra we're discussing what looks like a rapid scheduler memory leak a day or two after restarting on what was tagged as 3.15.0 | 19:34 |
fungi | er, heads up i mean | 19:34 |
*** tosky has joined #zuul | 19:41 | |
openstackgerrit | Merged zuul/zuul master: Defer setting build result to event queue https://review.opendev.org/666643 | 19:45 |
*** michael-beaver has quit IRC | 20:00 | |
*** openstackstatus has joined #zuul | 20:02 | |
*** ChanServ sets mode: +v openstackstatus | 20:02 | |
*** dtroyer has quit IRC | 20:21 | |
*** armstrongs has joined #zuul | 20:28 | |
*** dtroyer has joined #zuul | 20:28 | |
*** armstrongs has quit IRC | 20:38 | |
*** zxiiro has quit IRC | 20:42 | |
*** jamesmcarthur has joined #zuul | 20:51 | |
*** jamesmcarthur has quit IRC | 20:53 | |
*** jamesmcarthur has joined #zuul | 20:55 | |
*** jamesmcarthur has quit IRC | 21:20 | |
openstackgerrit | Merged zuul/zuul master: Fix release note for a 3.0.2 feature https://review.opendev.org/703109 | 21:22 |
openstackgerrit | Merged zuul/zuul master: Docs: re-order reference index https://review.opendev.org/702962 | 21:35 |
*** rlandy has quit IRC | 21:36 | |
clarkb | One thing I've noticed looking at this memory leak is that we seem to compile new config schemas for every config object | 21:49 |
clarkb | I think we could probably compile them once and use them over and over since the config object rules don't change without restarting zuul (and getting new code) | 21:49 |
openstackgerrit | Merged zuul/zuul master: Docs: move project config docs to user reference https://review.opendev.org/702992 | 21:55 |
*** hashar has joined #zuul | 22:04 | |
openstackgerrit | Merged zuul/zuul master: Docs: move overview section to reference https://review.opendev.org/702995 | 22:12 |
hashar | bah I looked at Zuul doc earlier today and the version from this morning is already obsolete! | 22:15 |
corvus | hashar: yeah sorry, we're moving stuff around. once the outstanding changes land, i'll re-do the redirects | 22:18 |
corvus | clarkb: speaking of which, are you interested in a +3 on https://review.opendev.org/703135 ? | 22:18 |
hashar | i usually just tox -e docs ;] | 22:18 |
hashar | turns out I have ready doc from earlier this week | 22:19 |
corvus | hashar: oh, then you'll want to use grep! i did that earlier.... :) | 22:19 |
clarkb | corvus: looking | 22:19 |
corvus | clarkb: re schemas -- that's probably true for most of them except maybe project-pipelines | 22:20 |
hashar | zuul/zuul: tox-py37 SUCCESS in 41m 57s , doh that has grown tremendously | 22:20 |
openstackgerrit | Merged zuul/zuul master: Docs: add admin reference section https://review.opendev.org/702997 | 22:37 |
*** hashar has quit IRC | 23:02 | |
openstackgerrit | Merged zuul/zuul master: Docs: flatten directory structure https://review.opendev.org/703135 | 23:03 |
*** mattw4 has quit IRC | 23:57 |
Generated by irclog2html.py 2.15.3 by Marius Gedminas - find it at mg.pov.lt!