clarkb | ianw: left a couple of notes on the first change in the wheel python stack | 00:09 |
---|---|---|
ianw | clarkb: thanks; new version posted | 00:14 |
clarkb | ianw: ok left comments on the gpg encrypted log files stack too | 00:43 |
ianw | clarkb: it was bit of a concious choice to not tee the logs there, as i wondered if it was all quite long and not quite necessary in the console output, because it's not formatted that well for that context | 00:47 |
clarkb | ianw: hrm I tend to use them. But I know the ara is there as an alternative (as would the new log files) | 00:50 |
clarkb | I have an old habit of looking at hte job output file before anything else. I can be convinced this is a bad idea :) | 00:50 |
ianw | as it changes the status quo, i can put in a "tee" there for now, and propose stopping doing this in a separate change | 00:51 |
clarkb | I'm happy to see if others have a preference | 00:51 |
clarkb | and just remove it if that is what people prefer | 00:51 |
*** rlandy|ruck is now known as rlandy|out | 00:52 | |
opendevreview | Merged openstack/diskimage-builder master: Update platform support to describe stable testing https://review.opendev.org/c/openstack/diskimage-builder/+/418204 | 01:22 |
ianw | thanks for the reviews, i will tweak things after som elunch | 01:23 |
opendevreview | Ian Wienand proposed zuul/zuul-jobs master: encrypt-file : role to encrypt a file https://review.opendev.org/c/zuul/zuul-jobs/+/828818 | 02:06 |
Clark[m] | fungi: email indicates your GitHub fix is likely working | 02:35 |
opendevreview | Ian Wienand proposed zuul/zuul-jobs master: encrypt-file : role to encrypt a file https://review.opendev.org/c/zuul/zuul-jobs/+/828818 | 02:53 |
opendevreview | Ian Wienand proposed opendev/system-config master: Base work for exporting encrypted logs https://review.opendev.org/c/opendev/system-config/+/828810 | 03:03 |
opendevreview | Ian Wienand proposed opendev/system-config master: run-production-playbook: return encrypted logs https://review.opendev.org/c/opendev/system-config/+/829147 | 03:03 |
opendevreview | Ian Wienand proposed opendev/system-config master: zuul/run-base.yaml : don't echo test playbooks to console log https://review.opendev.org/c/opendev/system-config/+/829470 | 03:03 |
opendevreview | Ian Wienand proposed zuul/zuul-jobs master: encrypt-file : role to encrypt a file https://review.opendev.org/c/zuul/zuul-jobs/+/828818 | 03:09 |
opendevreview | Ian Wienand proposed zuul/zuul-jobs master: encrypt-file : role to encrypt a file https://review.opendev.org/c/zuul/zuul-jobs/+/828818 | 03:19 |
opendevreview | Ian Wienand proposed zuul/zuul-jobs master: encrypt-file : role to encrypt a file https://review.opendev.org/c/zuul/zuul-jobs/+/828818 | 03:24 |
*** ysandeep|out is now known as ysandeep | 05:36 | |
opendevreview | Ian Wienand proposed opendev/system-config master: Base work for exporting encrypted logs https://review.opendev.org/c/opendev/system-config/+/828810 | 05:40 |
opendevreview | Ian Wienand proposed opendev/system-config master: run-production-playbook: return encrypted logs https://review.opendev.org/c/opendev/system-config/+/829147 | 05:40 |
opendevreview | Ian Wienand proposed opendev/system-config master: zuul/run-base.yaml : don't echo test playbooks to console log https://review.opendev.org/c/opendev/system-config/+/829470 | 05:40 |
*** ykarel_ is now known as ykarel | 06:49 | |
*** dwhite449 is now known as dwhite44 | 07:38 | |
*** amoralej|off is now known as amoralej | 08:03 | |
*** jpena|off is now known as jpena | 08:33 | |
*** ysandeep is now known as ysandeep|lunch | 08:59 | |
lourot | fungi, re: github mirroring fix, this worked, thanks a lot! | 08:59 |
fungi | perfect. at least the missing repos were created. content may not show up for them until a new change merges in each | 09:00 |
*** ysandeep|lunch is now known as ysandeep | 10:03 | |
*** pojadhav- is now known as pojadhav | 10:25 | |
*** ykarel_ is now known as ykarel | 10:31 | |
*** rlandy|out is now known as rlandy|ruck | 11:08 | |
*** dviroel|out is now known as dviroel | 11:21 | |
opendevreview | Merged openstack/project-config master: Add Rocky Linux to nodepool elements tooling https://review.opendev.org/c/openstack/project-config/+/829405 | 11:26 |
opendevreview | Szymon Datko proposed zuul/zuul-jobs master: [ensure-python] Fix for CentOS/RHEL 9 packages https://review.opendev.org/c/zuul/zuul-jobs/+/829533 | 12:48 |
opendevreview | Szymon Datko proposed zuul/zuul-jobs master: [ensure-python] Fix for CentOS/RHEL 9 packages https://review.opendev.org/c/zuul/zuul-jobs/+/829533 | 12:49 |
*** amoralej is now known as amoralej|lunch | 13:02 | |
*** pojadhav is now known as pojadhav|brb | 13:18 | |
opendevreview | Szymon Datko proposed zuul/zuul-jobs master: [ensure-python] Fix for CentOS/RHEL 9 packages https://review.opendev.org/c/zuul/zuul-jobs/+/829533 | 13:27 |
frickler | kevinz_: hi, any update on the certificate? I'm still seeing the expiry warnings | 13:39 |
opendevreview | Szymon Datko proposed zuul/zuul-jobs master: [ensure-python] Allow overriding package name https://review.opendev.org/c/zuul/zuul-jobs/+/829544 | 13:48 |
*** amoralej|lunch is now known as amoralej | 14:01 | |
admin1 | hi all.. i am setting up openstack-ansible and got this message | 14:16 |
admin1 | You are not building wheels while running role against multiple hosts. This might result in DOS-ing OpenDev infrustructure servers. In order to proceed, please ensure that you have repo servers for selected OS version and architecture. If you want to avoid building wheel on purpose, ensure that you run playbook in serial manner. In case of causing | 14:16 |
admin1 | unreasonable load on the opendev.org git servers, your access may be blocked to protect other users and the OpenDev CI infrastructure which are reliant on this service. | 14:16 |
fungi | admin1: please ask in #openstack-ansible | 14:19 |
admin1 | they sent me here :D | 14:20 |
fungi | all the opendev sysadmins know is that sometimes openstack-ansible users flood our git servers with repository clone requests from hundreds of systems at once and knock us offline, so the openstack-ansible maintainers added that warning after diagnosing the cause | 14:21 |
opendevreview | Neil Hanlon proposed openstack/project-config master: Add rockylinux-8 to nodepool configuration https://review.opendev.org/c/openstack/project-config/+/828435 | 14:33 |
*** pojadhav|brb is now known as pojadhav | 14:51 | |
*** weechat1 is now known as amorin | 15:02 | |
*** pojadhav is now known as pojadhavdinner | 15:21 | |
*** pojadhavdinner is now known as pojadhav|dinner | 15:21 | |
*** ysandeep is now known as ysandeep|out | 15:26 | |
fungi | another open source videoconferencing platform i hadn't seen before: https://github.com/jangouts/jangouts | 15:28 |
fungi | looks like the underlying webrtc gateway implementation came from meetecho.com which i'd also never heard of | 15:29 |
fungi | part of the pandemic wfh vc explosion bubble, i guess | 15:29 |
*** dviroel is now known as dviroel|lunch | 15:31 | |
*** ysandeep|out is now known as ysandeep | 16:00 | |
fungi | nope, i guessed wrong: "Meetecho was born in 2009 as an official academic spin-off of the University of Napoli Federico II." | 16:01 |
*** ysandeep is now known as ysandeep|out | 16:04 | |
*** rlandy|ruck is now known as rlandy|ruck|mtg | 16:22 | |
opendevreview | Szymon Datko proposed zuul/zuul-jobs master: [ensure-python] Allow overriding package name https://review.opendev.org/c/zuul/zuul-jobs/+/829544 | 16:33 |
*** rlandy|ruck|mtg is now known as rlandy|ruck | 16:37 | |
*** dviroel|lunch is now known as dviroel | 16:39 | |
*** marios is now known as marios|out | 17:04 | |
*** pojadhav|dinner is now known as pojadhav | 17:08 | |
corvus | i'm going to look into the log streaming issue that ianw reported; anything new i should be aware of? | 17:31 |
*** jpena is now known as jpena|off | 17:32 | |
clarkb | not that I know of | 17:33 |
clarkb | I think we've all been distracted by other stuff and havne't had a chance to look at it closer | 17:33 |
clarkb | infra-root I'm approving https://review.opendev.org/c/opendev/system-config/+/829134 https://review.opendev.org/c/opendev/system-config/+/829119 and then https://review.opendev.org/c/openstack/project-config/+/829121 will be ready to land. This last one could use one more review | 17:37 |
fungi | thanks for the heads up, looking now | 17:46 |
clarkb | fungi: re gerrit gitea I think I can technically land that change now. But historically davido has submitted changes for me when he has +2'd them so I'm thinking maybe he wants extra review on these? | 17:49 |
fungi | yeah, that's what i was wondering | 17:50 |
fungi | i didn't notice him adding other requested reviewers, but may have missed it | 17:50 |
clarkb | I can ping davido on their slack and get his opinion so that it is clear | 17:50 |
opendevreview | Merged opendev/system-config master: Remove configuration management for wiki servers https://review.opendev.org/c/opendev/system-config/+/829134 | 17:58 |
opendevreview | Merged opendev/system-config master: Stop using puppet repos that will be retired https://review.opendev.org/c/opendev/system-config/+/829119 | 17:58 |
clarkb | corvus: https://zuul.opendev.org is non responsive right now and zuul-web is spinning a cpu | 18:27 |
*** pojadhav is now known as pojadhav|out | 18:27 | |
clarkb | is this possibly related to your debugging? Should we restart zuul web? | 18:27 |
fungi | or did he already restart zuul-web and it's still doing its smart-reconfig? | 18:28 |
clarkb | oh maybe? | 18:28 |
clarkb | the process is old but it does appear it is reloading its configs | 18:29 |
fungi | if memory serves, it takes zuul-web 15-20 minutes to restart now that it's been reimplemented as basically another scheduler | 18:29 |
clarkb | tailing the debug log shows it talking about config files | 18:29 |
corvus | i did sigusr2, so yappi is running... maybe that's slowing this all down | 18:35 |
clarkb | ah | 18:35 |
clarkb | yes that could do it | 18:35 |
fungi | from the logs, i suspect we ended up doing a reconfigure with yappi going | 18:35 |
corvus | i hit it again, hopefully it speeds up now | 18:35 |
clarkb | thanks | 18:35 |
corvus | check out the grafana | 18:36 |
corvus | peaked at 200 queued requests | 18:36 |
corvus | back to normal now | 18:37 |
*** rlandy|ruck is now known as rlandy|ruck|mtg | 19:00 | |
*** rlandy|ruck|mtg is now known as rlandy|ruck | 20:14 | |
opendevreview | Ian Wienand proposed zuul/zuul-jobs master: encrypt-file : role to encrypt a file https://review.opendev.org/c/zuul/zuul-jobs/+/828818 | 20:37 |
opendevreview | Jonathan Rosser proposed zuul/zuul-jobs master: Allow some configure-mirrors repositories to be disabled https://review.opendev.org/c/zuul/zuul-jobs/+/829028 | 20:38 |
opendevreview | Jonathan Rosser proposed zuul/zuul-jobs master: Allow some configure-mirrors repositories to be disabled https://review.opendev.org/c/zuul/zuul-jobs/+/829028 | 20:39 |
*** rlandy|ruck is now known as rlandy|ruck|mtg | 20:43 | |
clarkb | ianw: if you get a chance can you review https://review.opendev.org/c/opendev/system-config/+/829141 for improved haproxying with gitea? | 20:48 |
clarkb | fungi: looks like davido merged the gerrit changes | 20:51 |
clarkb | fungi: I think we can test it without the depends on now | 20:52 |
fungi | oh, yep! on it | 20:53 |
opendevreview | Jeremy Stanley proposed opendev/system-config master: Use Gitea for Gerrit's code browser URLs https://review.opendev.org/c/opendev/system-config/+/825339 | 20:59 |
opendevreview | Jeremy Stanley proposed opendev/system-config master: DNM: Fail our Gerrit testing for an autohold https://review.opendev.org/c/opendev/system-config/+/825396 | 20:59 |
fungi | i've set a fresh autohold for the dnm change there and released the previous hold | 21:00 |
clarkb | exciting | 21:02 |
clarkb | and then if we update for that we'll pull in the ls-members fix too | 21:02 |
fungi | yup | 21:03 |
*** rlandy|ruck|mtg is now known as rlandy|ruck | 21:04 | |
*** amoralej is now known as amoralej|off | 21:10 | |
ianw | clarkb: why do we need to "verify none" for the production case -- there wouldn't we have valid SSL certificates? | 21:20 |
clarkb | ianw: we do have valid ssl certs in prod. The problem is testing it since you have to provide the ca files as well. Mostly just worried that if we land the chagne we'll suddenly have no valid backends because testing is difficult | 21:21 |
ianw | clarkb: perhaps as a follow-up we should switch one to do ssl checks in production, and if it's ok, switch the rest? | 21:22 |
fungi | well, even in prod we'd have to point to a ca file | 21:22 |
clarkb | fungi: ya but if they are all verify none but one the ca-file should be pretty non impactful for the other 7 | 21:23 |
clarkb | ianw: I think that approach works | 21:23 |
fungi | ianw: keep in mind that this isn't a regression over what we already had with the tcp checks, even before apache, and we're separately checking and alerting on cert validity anyway outside of haproxy | 21:23 |
clarkb | we need to bind mount the /etc/ssl/certs/ca-certificates.crt file into the container then set the config to use it | 21:23 |
clarkb | and if one backend has a sad that isn't the end of the world | 21:23 |
ianw | ahh, i had sort of assumed the LB container would have the certs setups for LE | 21:24 |
ianw | i'm fine with it btw, just thinking through what we could do | 21:24 |
clarkb | ianw: we consume haproxy from upstream and they don't seem to include any certs | 21:25 |
clarkb | which kinda makes sense for the target audience of the image I guess | 21:25 |
fungi | the odds that one of our gitea servers would spontaneously have an invalid cert while the others are fine is fairly low, and having the load balancer make decisions based on cert validity also increases complexity, thus the chances that it might decide to take all the backends out of the pool because of an error somewhere | 21:25 |
clarkb | but we can bind mount it in from the host | 21:25 |
clarkb | fungi: ya also that | 21:25 |
clarkb | it does add more complexity and that is always opportunity for unexpected failure | 21:25 |
*** dviroel is now known as dviroel|out | 21:26 | |
fungi | my primary fear with load balancers is that they'll hit a condition in their check logic which causes them to invalidate all backends | 21:26 |
ianw | fair points, i guess it's more about the LB knowing who it's talking to in the back-end | 21:27 |
fungi | and this is fear borne from experience managing very expensive commercial load balancers for decades in a past life. it definitely happens | 21:27 |
fungi | telling the customer that a minor change to their website caused the load balancer cluster to suddenly decide none of their servers were viable destinations was never fun, and inevitably led them to question why they were even using load balancers if they could cause the site to go offline rather than preventing it | 21:29 |
fungi | so, yeah, simpler checks are better. the complexity of the layer-7 check is a mitigation for the reverse-proxy causing the service to seem up for simpler tcp socket checks when it isn't, but i think we need to carefully weigh any increase in check complexity against the benefits provided | 21:32 |
fungi | in particular, we know that we take the gitea containers offline when the container images are replaced | 21:33 |
fungi | so i think not sending traffic to them under those (fairly frequent) outages is worth the added risk | 21:33 |
clarkb | ya I think we definitely want a check that covers both apache and gitea | 21:35 |
clarkb | currently we only have apache. The proposed change should also cover gitea | 21:35 |
clarkb | looking at the timing of the change landing it appears I'll be starting my walk to get kids from school around when the job should merge. infra-root if you'd like I can put the lb in the emergency file now and remove it and run the playbook when I return | 21:45 |
clarkb | Considering it is tested I'm not too worried about it. But I won't be able to fix it for a little bit after it lands if something goes wrong | 21:46 |
opendevreview | Merged zuul/zuul-jobs master: Allow some configure-mirrors repositories to be disabled https://review.opendev.org/c/zuul/zuul-jobs/+/829028 | 21:47 |
fungi | i should be done with dinner by then | 21:59 |
clarkb | cool I'll leave it be then | 22:04 |
opendevreview | Merged opendev/system-config master: Haproxy http checks for Gitea https://review.opendev.org/c/opendev/system-config/+/829141 | 22:08 |
clarkb | it will end up behind the hourly jobs. Those take about half an hour iirc. I'll check when the school run is done | 22:09 |
ianw | i'm around, so can watch too | 22:14 |
rcastillo|rover | Hello. We're running into some issues with our tripleo centos 9 content provider jobs. They're failing on retry and the running jobs don't seem to print any logs | 22:19 |
rcastillo|rover | https://zuul.opendev.org/t/openstack/builds?job_name=tripleo-ci-centos-9-content-provider&skip=0 | 22:19 |
rcastillo|rover | some issue with stream 9 nodes? | 22:20 |
ianw | rcastillo|rover: hrm, let me have a look | 22:20 |
ianw | https://zuul.opendev.org/t/openstack/build/ad2af72e30044ef7bd081ff8b035d711 let's check that one, seems the latest | 22:20 |
ianw | Adding node request <NodeRequest 300-0017269841 ['centos-9-stream']> for job <FrozenJob tripleo-ci-centos-9-content-provider> to item <QueueItem 4cd6063e4397459dabb0793b8f36afbf for <Change 0x7f1cc4175250 openstack/tripleo-ansible 828920,3> in check> | 22:24 |
ianw | 2022-02-16 22:05:04,927 DEBUG zuul.nodepool: [e: 1a7616b0e3c74112b17d45540797e1ab] Node request <NodeRequest 300-0017269841 ['centos-9-stream']> fulfilled | 22:24 |
ianw | 2022-02-16 22:05:04,884 DEBUG nodepool.driver.NodeRequestHandler[nl04.opendev.org-PoolWorker.ovh-bhs1-main-ae1885d0488944b68fb9a380bcdca2b9]: [e: 1a7616b0e3c74112b17d45540797e1ab] [node_request: 300-0017269841] Fulfilled node request | 22:32 |
ianw | nl04 fufilled the request, and it went to ovh-bhs1 | 22:32 |
ianw | 2022-02-16 22:04:24,900 ERROR zuul.AnsibleJob: [e: 1a7616b0e3c74112b17d45540797e1ab] [build: e89d10ecb10a4812b1dd9e4ea50fdb1a] Exception while executing job | 22:44 |
ianw | on ze03 ... this might be a clue | 22:44 |
ianw | https://paste.opendev.org/show/bZxLtgjgsH1w3q7Yj4xP/ is the full error | 22:48 |
ianw | 2022-02-16 22:04:24,900 ERROR zuul.AnsibleJob: ValueError: SHA b'01e50ed7dac6cd25ce268d01cc457910633ccbf0' could not be resolved, git returned: b'01e50ed7dac6cd25ce268d01cc457910633ccbf0 missing' | 22:49 |
ianw | is the interesting bit | 22:49 |
fungi | corrupt repo cache? | 22:49 |
ianw | corvus: ^ might be able to short-cut my further investigations :) | 22:49 |
ianw | yeah, possibly | 22:49 |
ianw | 2022-02-16 22:04:24,899 DEBUG zuul.Repo.Ref: Create reference refs/heads/patchback/backports/stable-4/7f793c83f1aae99e936238dc1873251ac2c358a3/pr-4162 at 01e50ed7dac6cd25ce268d01cc457910633ccbf0 in /var/lib/zuul/builds/e89d10ecb10a4812b1dd9e4ea50fdb1a/work/src/github.com/ansible-collections/community.general/.git | 22:50 |
ianw | this is before it | 22:50 |
ianw | https://review.opendev.org/c/openstack/tripleo-ansible/+/828920/ does not have a depends-on | 22:52 |
ianw | so that's something | 22:52 |
clarkb | looks like haproxy update hasn't happened yet and I'm back so can watch it | 22:53 |
ianw | root@ze03:/var/lib/zuul/executor-git/github.com/ansible-collections/ansible-collections%2Fcommunity.general# git show 01e50ed7dac6cd25ce268d01cc457910633ccbf0 | 22:54 |
ianw | fatal: bad object 01e50ed7dac6cd25ce268d01cc457910633ccbf0 | 22:54 |
clarkb | ianw: I think we need to check if that object is broken upstream too | 22:55 |
clarkb | if it is then there isn't much we can do. If it isn't then we likely need to take the executor out of rotation, remove the repo and let it repopulate after starting the executor back up again | 22:55 |
clarkb | considering this seems to be somewhat widepsread I suspect this isn't a problem on our end but somethign we are managing to fetch from upstream (but we should confirm that) | 22:55 |
ianw | this must have run several times across executors you'd think | 22:56 |
ianw | yeah, what you said :) | 22:56 |
ianw | [iwienand@fedora19 community.general (main)]$ git show 01e50ed7dac6cd25ce268d01cc457910633ccbf0 | 22:56 |
ianw | fatal: bad object 01e50ed7dac6cd25ce268d01cc457910633ccbf0 | 22:56 |
clarkb | that is unfortunate for ansible, but I'm not sure we can do anything about it? | 22:57 |
clarkb | eg that should be fixed upstream | 22:57 |
ianw | i guess i will file an issue with ansible, i feel like github may have to get involved | 22:57 |
clarkb | ya it is possible | 22:58 |
clarkb | I guess on our end we might be able to better capture the error somehow | 22:58 |
*** prometheanfire is now known as Guest2 | 22:59 | |
ianw | i'm not exactly sure how zuul came up with 01e50ed... | 23:02 |
clarkb | looking at the traceback it is doing a setRepostate which sets up all the refs/branches iirc | 23:03 |
ianw | grep -r '01e50ed' * in .git/refs doesn't match anything | 23:03 |
ianw | https://github.com/ansible-collections/community.general/pull/4208 seems related | 23:04 |
*** Guest2 is now known as prometheanfire | 23:05 | |
clarkb | the commit for that PR seems to be 01250ed | 23:05 |
clarkb | if I can type it right :) | 23:05 |
ianw | https://github.com/ansible-collections/community.general/pull/4208/commits/01e50ed7dac6cd25ce268d01cc457910633ccbf0 | 23:06 |
*** dhill is now known as Guest5 | 23:06 | |
ianw | the branch was deleted | 23:06 |
*** Guest5 is now known as dhill_ | 23:06 | |
dhill_ | hi | 23:07 |
dhill_ | rlandy|ruck, hi | 23:07 |
ianw | i just want to make sure before blaming github that this is not some generic error you see if *you* have an old ref | 23:07 |
clarkb | ianw: the thing Zuul is attempting to do there is set all the branches to the right state. SO maybe we aren't handling a branch being deleted properly if that is what has happened? However what is odd is that the ref is in the repo so you would've expect tofind it anyway | 23:08 |
rlandy|ruck | dhill_: hey | 23:08 |
clarkb | ianw: how do we know the branch was deleted? Does github tell us that? | 23:08 |
rlandy|ruck | join the party | 23:08 |
dhill_ | rlandy|ruck, we have too many irc network | 23:09 |
clarkb | or just that clicking on the branch in the PR is a 404? | 23:09 |
rlandy|ruck | discussion in progress | 23:09 |
ianw | $ git show aaaaaed7dac6cd25ce268d01cc457910633ccbf0 | 23:09 |
ianw | fatal: bad object aaaaaed7dac6cd25ce268d01cc457910633ccbf0 | 23:09 |
ianw | so we get the same error with a bogus ref anyway | 23:09 |
ianw | which makes me think it's zuul's fault here, somehow | 23:09 |
ianw | clarkb: i'm looking at the last comment in https://github.com/ansible-collections/community.general/pull/4208 | 23:09 |
clarkb | ianw: aha thanks | 23:09 |
clarkb | ianw: as far as zuul doing something wrong. I think the expectation is that Zuul sets every branch to the ref that is pointed to upstream. If zuul somehow missed the delete branch event and then didn't noticed the branch was gone when listing branches to set state on that could happen? | 23:10 |
clarkb | I think another thing that makes this confusing is this repo seems to use a rebase not merge method of commiting PRs (the sha for the PR commit changed in the target branch and there is no merge commit) | 23:11 |
clarkb | basically the branch and the commit get discarded. And ya maybe it is possible for zuul to get confused in that state | 23:11 |
ianw | ./remotes/origin | 23:12 |
ianw | ./remotes/origin/patchback | 23:12 |
ianw | ./remotes/origin/patchback/backports | 23:12 |
ianw | ./remotes/origin/patchback/backports/stable-4 | 23:12 |
ianw | ./remotes/origin/patchback/backports/stable-4/05c3e0d69f3653a352b5fbf1671ff4e2c0e9c812 | 23:12 |
ianw | ./remotes/origin/patchback/backports/stable-4/05c3e0d69f3653a352b5fbf1671ff4e2c0e9c812/pr-4136 | 23:12 |
ianw | that is what ze03's git cache repo has | 23:12 |
ianw | oh, pr-4136 is ! 4208 | 23:13 |
clarkb | and that is a valid branch | 23:13 |
clarkb | ianw: did it log which branch repo state settnig it awas angry about? | 23:14 |
ianw | let me go back and find the exception to see what's above it more | 23:14 |
clarkb | I'm curious if we asked zuul to try again and see if it works now and this is a race with deleting branches and trying to update them locally | 23:15 |
clarkb | Zuul checks repo and finds PR branch with sha foo, repo delets branch with sha foo, zuul tries to set branch with sha foo and cannot find sha foo sort of ordering | 23:15 |
ianw | seems to happen between 21:53:16 and 22:04:41 so that's a big window | 23:16 |
ianw | like 8 retries | 23:16 |
dhill_ | ianw, so if I "recheck" my failing patch we'll know ? | 23:16 |
dhill_ | ianw, https://review.opendev.org/c/openstack/tripleo-heat-templates/+/829610 | 23:16 |
ianw | dhill_: yes, it's worth trying at least please | 23:16 |
dhill_ | ianw, I just rechecked this one | 23:17 |
ianw | clarkb: that's all we have above exception for this repo -> https://paste.opendev.org/show/bKwUKfj4VfWd5OavNSi8/ | 23:18 |
clarkb | 829610 appears to have been in the check queue for an hour and 16 minutes already | 23:18 |
dhill_ | it fails here | 23:19 |
dhill_ | tripleo-ci-centos-9-content-provider https://zuul.opendev.org/t/openstack/build/00999e9311d4423b92df82451a91cbce : RETRY_LIMIT in 23s | 23:19 |
ianw | felixfontein deleted the patchback/backports/stable-4/7f793c83f1aae99e936238dc1873251ac2c358a3/pr-4162 branch 1 hour ago | 23:19 |
clarkb | dhill_: ya but that is from 21:54 and it is 23:19 now | 23:19 |
clarkb | dhill_: and it is in check for an hour and 16 minutes so ~22:02 ish? | 23:19 |
clarkb | if you hover the one hour ago it gives you a proper timestamp. I suspect that this is a race | 23:21 |
ianw | hovering that tells me it was 8:52am local time, which was ... 1hour 28 minutes ago, so UTC 21:52 | 23:21 |
clarkb | ya and the failure was a couple minutes after that. Depending on how long it took zuul to construct a state for the job (which isn't always quick depending on merges etc) we could hit this race | 23:21 |
clarkb | In a repo that merged commits this would be a non issue | 23:21 |
clarkb | you'd have the commit in the repo still and could checkout a branch to that (even if the branch shouldn't exit anymore) | 23:22 |
ianw | ok, that sort of makes sense, but why is it still failing on recheck | 23:22 |
clarkb | I think it is specifically the combo of deleting while zuul is processing state for jobs containing that repo and the repo must use lossy merge process | 23:22 |
clarkb | ianw: I don't think it is? | 23:22 |
clarkb | the change dhill_ linked to is in check and has been since ~22:02 ish | 23:23 |
dhill_ | clarkb, yeah it failed with a retry | 23:23 |
clarkb | 829610 is currently in check and has been since 22:02 | 23:23 |
clarkb | I think the recheck worked and its been running? | 23:23 |
clarkb | I feel like I'm missing something | 23:23 |
ianw | yeah, i agree; tripleo-ci-centos-9-content-provider is https://zuul.opendev.org/t/openstack/stream/3bb4b65e64a34334b7fdbd67fae4cd7f?logfile=console.log and currently paused | 23:26 |
clarkb | infra-root the haproxy config update happened but we don't reload the haproxy config when that happens. We do reload when the docker compose file is updated. I'm going to manually run the graceful reload command. Then push up a change to fix our config management | 23:27 |
fungi | sounds good, thanks | 23:27 |
clarkb | the docker-compose kill -s HUP haproxy command has been run and there is a new process. The old ones go away when they are done serving content iirc. I can still reach opendev.org so thats good | 23:29 |
fungi | same | 23:29 |
opendevreview | Clark Boylan proposed opendev/system-config master: Reload haproxy when its config updates https://review.opendev.org/c/opendev/system-config/+/829615 | 23:30 |
*** diablo_rojo_phone is now known as Guest7 | 23:36 | |
clarkb | dhill_: rlandy|ruck rcastillo|rover ianw reporting some info back from the zuul channel. A typical solution to this would be setting https://zuul-ci.org/docs/zuul/latest/tenants.html#attr-tenant.untrusted-projects.%3Cproject%3E.exclude-unprotected-branches however that only helps if the repo in question protects the branches you care about (and I'm not sure how to determine that | 23:39 |
clarkb | here) | 23:39 |
clarkb | https://review.opendev.org/c/zuul/zuul/+/804177/ is an alternative that would help us but it hasn't merge yet. You'd specificy which branches you cared about in those repos regardless of their protection strategy | 23:40 |
* rlandy|ruck reads back | 23:43 | |
rlandy|ruck | hmmm ... good question | 23:45 |
clarkb | corvus notes that the second option presented by 804177 presents its own challenges because now you're getting a window on the repository and if you need something outside that window you may fail too | 23:47 |
clarkb | Really the best option would be to not delete commits :) | 23:47 |
clarkb | but I suspect we won't get very far making that usggestion to people | 23:47 |
rlandy|ruck | I am not sure I have a good answer on the branches | 23:47 |
dhill_ | I don't understand anything of this lol | 23:48 |
dhill_ | :/ | 23:48 |
rlandy|ruck | we'd need to set up something like: | 23:49 |
rlandy|ruck | exclude-unprotected-branches: | 23:49 |
rlandy|ruck | option | 23:49 |
rlandy|ruck | but be clear what exact branches we need | 23:49 |
rlandy|ruck | which is not always constant | 23:50 |
clarkb | dhill_: let me try to explain a different way. For every buildset (collection of jobs) that zuul runs it attempts to configure a consistent repository state for every repository the jobs know about. This can take some time as it has to inspect the git content of all the places the jobs are configured. If one of those repos (like the ansible collection repo) does lossy deletions while | 23:50 |
clarkb | zuul is processing that state this can happen. | 23:50 |
clarkb | Basically people in github are deleting content from their repo while zuul is trying to process that content and that results in brokeness. Rerunning does seem to work though | 23:50 |
clarkb | rlandy|ruck: ya that is what 804177 would add to zuul and ya you'd be potentially trying to keep up with upstream changes that way | 23:51 |
rlandy|ruck | which doesn't seem sustainable | 23:51 |
dhill_ | no failures yet | 23:51 |
rlandy|ruck | considering the number of repos we include | 23:51 |
clarkb | In general I think my sugegstions would be that people don't perform lossy operations on their git repos as a good first step. But people do it because they like the cleanliness of it | 23:52 |
rlandy|ruck | clarkb: ianw: fungi: thanks for following this through | 23:52 |
rlandy|ruck | and yeah - it's a possible hit for us on any repo | 23:52 |
rlandy|ruck | at any time | 23:52 |
clarkb | rlandy|ruck: well only those that allow lossy operations. Those in gerrit shouldn't | 23:53 |
clarkb | I suppose another option here is to stop using those repos via zuul and clone them from github instead | 23:53 |
rlandy|ruck | that has other challenges | 23:53 |
clarkb | you'd trade lossy operation races for network bw and failures | 23:53 |
rlandy|ruck | correct | 23:53 |
rlandy|ruck | we have done that in the past | 23:53 |
fungi | and could no longer depends-on pull requests | 23:53 |
rlandy|ruck | and decided zuul was far more reliable | 23:53 |
rlandy|ruck | I think we go forward as is - seems like the most maintainable solution | 23:54 |
clarkb | if you talk to your upstreams they may be willing to protect the branches that you care about then you can set that flag | 23:54 |
clarkb | and/or sugegst they stop deleting content :) | 23:54 |
rlandy|ruck | I'll pick it up tomorrow morning with our internal infra people | 23:55 |
rlandy|ruck | we need a standard line here wrt repos we intend to include | 23:56 |
rlandy|ruck | thanks again ... I'm out | 23:56 |
*** rlandy|ruck is now known as rlandy|out | 23:56 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!