opendevreview | Ian Wienand proposed opendev/system-config master: make-tarball: role to archive directories https://review.opendev.org/c/opendev/system-config/+/865784 | 04:20 |
---|---|---|
*** yadnesh|away is now known as yadnesh | 04:48 | |
opendevreview | Ian Wienand proposed opendev/system-config master: make-tarball: role to archive directories https://review.opendev.org/c/opendev/system-config/+/865784 | 05:30 |
*** dasm|off is now known as Guest188 | 05:30 | |
opendevreview | Cedric Jeanneret proposed opendev/base-jobs master: Ensure NetworkManager will not override /etc/resolv.conf https://review.opendev.org/c/opendev/base-jobs/+/865428 | 08:13 |
opendevreview | Cedric Jeanneret proposed opendev/base-jobs master: Ensure NetworkManager will not override /etc/resolv.conf https://review.opendev.org/c/opendev/base-jobs/+/865428 | 08:21 |
opendevreview | Cedric Jeanneret proposed openstack/project-config master: Ensure NetworkManager doesn't override /etc/resolv.conf https://review.opendev.org/c/openstack/project-config/+/865433 | 08:25 |
*** ysandeep__ is now known as ysandeep_afk | 08:37 | |
*** jpena|off is now known as jpena | 08:42 | |
akahat | please review: https://review.rdoproject.org/r/c/rdo-jobs/+/46183, https://review.rdoproject.org/r/c/rdo-infra/ci-config/+/45898, https://review.rdoproject.org/r/c/rdo-infra/ci-config/+/46112 | 08:49 |
akahat | ^^ ysandeep pojadhav|ruck bhagyashris chandankumar arxcruz frenzy_friday|rover | 08:50 |
*** prometheanfire is now known as Guest217 | 09:11 | |
*** yadnesh is now known as yadnesh|afk | 09:43 | |
*** ysandeep_afk is now known as ysandeep_ | 09:56 | |
*** yadnesh|afk is now known as yadnesh | 10:12 | |
*** ysandeep_ is now known as ysandeep_brb | 10:20 | |
*** ysandeep_brb is now known as ysandeep_afk | 10:29 | |
*** anbanerj is now known as frenzy_friday|rover | 10:47 | |
*** dviroel|out is now known as dviroel | 10:58 | |
apevec | akahat: that would be for #_oftc_#rdo:matrix.org | 11:41 |
opendevreview | Ian Wienand proposed opendev/system-config master: make-tarball: role to archive directories https://review.opendev.org/c/opendev/system-config/+/865784 | 11:47 |
*** ysandeep_afk is now known as ysandeep_ | 12:04 | |
*** frenzy_friday|rover is now known as frenzy_friday|rover|food | 12:22 | |
*** frenzy_friday|rover|food is now known as frenzy_friday|rover | 13:38 | |
*** Guest188 is now known as dasm | 13:58 | |
*** rcastillo|rover is now known as rcastillo | 13:58 | |
*** yadnesh is now known as yadnesh|away | 13:59 | |
*** dviroel is now known as dviroel|afk | 14:55 | |
opendevreview | Dr. Jens Harbott proposed openstack/project-config master: Use kolla.config for kolla-ansible in gerrit https://review.opendev.org/c/openstack/project-config/+/865686 | 14:59 |
*** ysandeep_ is now known as ysandeep_dinner | 15:41 | |
fungi | infra-root: our lets encrypt cert renewals seem to have started breaking coincident with the release of acme.sh 3.0.5 last week (hence the increasing list of certs which are expiring in less than a month) | 15:52 |
fungi | for some reason even though we cname e.g. _acme-challenge.eavesdrop01.opendev.org to acme.opendev.org the script now wants to find _acme-challenge.acme.opendev.org txt records instead | 15:52 |
fungi | (which don't exist) | 15:53 |
fungi | 3.0.5 includes 6 months of updates from their development project, so narrowing down the cause could take time. we may want to roll back to 3.0.4 in the meantime | 15:53 |
fungi | though we do still have a few weeks to figure it out if we won't want to pin | 15:55 |
clarkb | fungi: can you expand a bit on what specific acme record it wants for say eavesdrop01? | 16:17 |
clarkb | fungi: also I think we may install acme.sh from tip of master and not use releases which might simplify tracking down the issue? | 16:17 |
opendevreview | Cedric Jeanneret proposed opendev/system-config master: Allow to cache ansible-galaxy content https://review.opendev.org/c/opendev/system-config/+/865869 | 16:19 |
Tengu | fungi: -^^ not really sure, but it seems to be the right thing... | 16:20 |
*** marios is now known as marios|out | 16:35 | |
*** ysandeep_dinner is now known as ysandeep_out | 16:37 | |
opendevreview | Cedric Jeanneret proposed opendev/system-config master: Allow to cache ansible-galaxy content https://review.opendev.org/c/opendev/system-config/+/865869 | 16:39 |
Tengu | let's see if testing is working. | 16:40 |
*** pojadhav|ruck is now known as pojadhav|out | 16:55 | |
fungi | clarkb: well, the commits which coincide with the start of the failures are not much better. basically "merge pr to sync with development before release" | 17:04 |
fungi | their workflow seems to be that they use a completely separate git repository for developing acme.sh and then they duplicate everything from it into the acme.sh master branch when they get ready to tag it | 17:05 |
opendevreview | Cedric Jeanneret proposed opendev/system-config master: Allow to cache ansible-galaxy content https://review.opendev.org/c/opendev/system-config/+/865869 | 17:06 |
clarkb | fungi: ya I'm looking and we use the dev branch | 17:08 |
fungi | clarkb: so as for the error, this is what we have in the log when we run `/opt/acme.sh/driver.sh renew -d eavesdrop01.opendev.org`: https://paste.opendev.org/show/bngNOQMTAPjK5PA26wsl/ | 17:08 |
clarkb | fungi: so it was something recent on that branch and not directly related to the release aiui | 17:08 |
fungi | last successful run of infra-prod-letsencrypt was 2022-11-23 03:41:22 and the first failure was 2022-11-24 03:41:10 so it likely happened sometime on 2022-11-23 | 17:09 |
fungi | https://zuul.opendev.org/t/openstack/builds?job_name=infra-prod-letsencrypt | 17:09 |
fungi | oh, i see, dev is actually a branch in the acme.sh repo after all. i got confused by their workflow involving pull requests without a fork | 17:11 |
clarkb | fungi: the other thing it could be is our updated ansible lists mangling for ansible 6 | 17:12 |
clarkb | perhaps we're not constructing valid data for acme anymore and it confuses acme.sh | 17:12 |
fungi | er, or maybe not? i'm confused because the dev branch has commits merged to it like "Merge pull request #4406 from acmesh-official/dev" | 17:12 |
fungi | almost like they have a hidden https://github.com/acmesh-official/dev repository | 17:12 |
clarkb | even if they did we don't consume the release | 17:13 |
clarkb | so we aren't pulling from it | 17:13 |
clarkb | oh the merge is on the dev branch | 17:13 |
fungi | yeah | 17:13 |
fungi | anyway, it does look, based on scant commit messages, as though they also started merging things that day which they expect to go into 3.0.6 so a regression in that batch makes sense | 17:14 |
fungi | it's almost as if the tool has started dereferencing cname records before deciding what cn to use for the cert | 17:16 |
clarkb | ianw's changes to system-config for ansible 6 have not landed yet so it isn't those (just to rule out the other idea I had earlier) | 17:17 |
fungi | yeah, i went looking for possible changes on our side which could have broken it before turning to digging through upstream commits and issues | 17:21 |
*** jpena is now known as jpena|off | 17:54 | |
fungi | i need a bit of a break, but will start looking through our options for fixing the le job, since i expect that's going to break my ability to deploy mailman3 on the new lists01 server i booted last week | 18:39 |
*** dviroel|afk is now known as dviroel | 18:39 | |
clarkb | agreed, that seems a high priority even if the existing certs are valid for a while specifically for that reason | 18:39 |
fungi | but once we get that sorted i'm hopefully we can merge the remaining topic:mailman3 changes | 18:44 |
fungi | er, hopeful | 18:45 |
*** rlandy is now known as rlandy|afk | 19:06 | |
clarkb | fungi: one thing I notice when digging into acme.sh is that tehy recently updated the key type to ecdsa by default. THis shouldn't override any key types for existing installs but I think once fixed mailman3 will get ecdsa if not overridden. | 19:14 |
clarkb | This may limit clients that can talk to the site. Though maybe its worth see if it isa problem before worryingbaout it | 19:14 |
clarkb | fungi: also the issue may have been introduced prior to the window of time you identified. The reason for this is we may not have had any certs that needed refreshing for a bit | 19:34 |
clarkb | our window is ~2 months since we renew after 2 months | 19:34 |
clarkb | however it is likely smaller than that as we have other certs that almost certainly have renewed more recently | 19:34 |
clarkb | our driver.sh script passes --challenge-alias acme.opendev.org | 19:37 |
clarkb | https://github.com/acmesh-official/acme.sh/blob/dev/acme.sh#L4737 I think that is the line that appends _acme-challenge to acme.opendev.org | 19:38 |
clarkb | as far as I can tell the code around this hasn't changed much | 19:42 |
clarkb | the TXT records are in place at acme.opendev.org not _acme-challenge.acme.opendev.org. So something has changed to cause it to do that prepending when we didn't do it previously? | 19:52 |
clarkb | maybe we were falling into https://github.com/acmesh-official/acme.sh/blob/dev/acme.sh#L4734-L4735 previously somehow? | 19:53 |
ianw | interesting, thanks for pointing it out. i can look in a little | 20:03 |
clarkb | ianw: I'm beginning to think it is related to your exit code 3 thing | 20:04 |
clarkb | ianw: when we issue we check for rc 3 but when we renew we don't. It seems to fail because the renew path goes through the issue path and exits 3 | 20:04 |
clarkb | I suspect that thing might actually be working properly but acme.sh is bailing out early with the unrecognized/new error code? But I don't know how long ago those changes landed (havne't checked yet) just noticing that in our logs the rc is 3 and renew calls issue | 20:06 |
fungi | oh, yeah that's a great point, the task basically ends by saying the command exited nonzero, so maybe that's what changed? | 20:08 |
ianw | i don't think any of that changed recently, but i could be wrong | 20:08 |
fungi | but the error message does very specifically say to create a dns record at a place we don't have any record, so not sure if that's the script getting smarter and checking ahead of us? | 20:08 |
clarkb | ianw: ya I'm thinking maybe something earlier in issue() is not exiting earlier due to being called as part of renew? | 20:09 |
ianw | hrm, we should be logging the acme calls in more detail, let me pull up logs | 20:09 |
fungi | could be that message about the dns record is a red herring | 20:09 |
fungi | oh, good point, we have a separate acme log i didn't think to look at, i'm just going by the output recorded by ansible | 20:10 |
clarkb | ianw: we don't seem to set the debug flag fwiw | 20:10 |
clarkb | but maybe we should push a change that runs it with debug set in testing? | 20:10 |
clarkb | fungi: the acme log is the same as the ansible log I think | 20:11 |
ianw | [Mon Nov 28 03:52:05 UTC 2022] The dns manual mode can not renew automatically, you must issue it again manually. You'd better use the other modes instead. | 20:11 |
ianw | Unknown failure: 3 | 20:11 |
clarkb | ya we tee it | 20:11 |
clarkb | ianw: yup and rc 3 is what ou added. And we handle it on the initial issue() call | 20:11 |
ianw | on each host in var/log/acme.sh ... so yeah | 20:11 |
fungi | i don't see anything in it which is different from the stdout recorded by ansible though | 20:12 |
clarkb | but then after ansible has run to update the acme.opendev.org dns domain (the records are there) it runs the renew command and that exits 3 because its hitting that rc portion of the issue function | 20:12 |
clarkb | fungi: they are the same. we tee it | 20:12 |
fungi | okay, so no new info to be gleaned from the log | 20:12 |
fungi | so maybe we just didn't notice this right away because we didn't need to renew any certs for a while after the exit code change. when/where was that added? | 20:14 |
clarkb | it was added in like april I think | 20:14 |
clarkb | and we handle it on the issue side. | 20:14 |
clarkb | I suspect something else is side effecting the renew call which goes through issue to fall all the way through to exiting 3 | 20:14 |
clarkb | not that the exit 3 is directly at fault | 20:15 |
clarkb | basically when you do the renew call the dns records should already exist so it shouldn't fall through | 20:15 |
ianw | feels similar to https://github.com/acmesh-official/acme.sh/issues/2763 | 20:18 |
ianw | unfortunately, on the server, we seem to rotate out /var/log/acme.sh sufficiently that we don't have the logs from the last renewal to compare. | 20:21 |
clarkb | I'm having a hard time understanding how this ever worked, because it iterates through the list of entries and if it doesnt' create a dns record for them it exits with an error (now 3 but previously 1) | 20:21 |
clarkb | and git log -p isn't showing me the deletion of any code that would've skipped ahead in hte case of manual dns | 20:22 |
ianw | iirc it's going based on a file in the cert store | 20:25 |
clarkb | ya there is the .conf file in the cert store dir | 20:26 |
ianw | /etc/letsencrypt-certs/eavesdrop01.opendev.org/*.conf | 20:26 |
ianw | yep, that's it | 20:26 |
ianw | i think that the issue writes something in there, then the renewal path should pick that up | 20:27 |
ianw | Le_Vlist looks like it | 20:27 |
clarkb | ok it does check if verification is done already | 20:27 |
clarkb | I was initially reading that as checking if web verification had succeeded then skip dns verification but maybe it is checking dns earlier too | 20:27 |
clarkb | _savedomainconf "Le_OrderFinalize" "$Le_OrderFinalize" | 20:31 |
clarkb | that appears related to manual dns | 20:31 |
clarkb | but that seems to be processed on renew after we've already exited? | 20:33 |
ianw | https://github.com/acmesh-official/acme.sh/commit/38778f8adca0d016b27ad0f2a2fc367055c90091 touched the renew path ... but not that recently | 20:33 |
ianw | the logs are saying "Renew to" | 20:34 |
clarkb | yes we're failing on the second pass | 20:34 |
clarkb | and we are failing because it seems to be ignoring that we've already done the dns steps | 20:35 |
clarkb | basically it is bailing out on the `issue` path where it wants you to go manually edit dns. But we've already done that and want it to renew | 20:35 |
ianw | yeah, but i guess it *is* going into the renew path, at least at the start, from the "renew to" message | 20:37 |
clarkb | yes that renew() function calls issue() | 20:39 |
clarkb | just below where it emits "Renew to" | 20:39 |
clarkb | In issue() it does a [ -z $vlist ] | 20:39 |
clarkb | let me get a link. But I think we may expect that condition to be false to skip this stuff when renewing | 20:40 |
clarkb | https://github.com/acmesh-official/acme.sh/blob/dev/acme.sh#L4482 | 20:40 |
clarkb | with manual dns we write that list out to the config when we issue. Then when we renew we check if it is empty and if not we can skip all the issue steps. But where do we read it back? | 20:42 |
clarkb | it is set in the config too | 20:42 |
ianw | i'm also supicioius on -> https://github.com/acmesh-official/acme.sh/commit/38778f8adca0d016b27ad0f2a2fc367055c90091?diff=split -- it seems like in the renew path we're unconditionally going into _initpath | 20:42 |
clarkb | I don't see where Le_Vlist is ever read | 20:43 |
clarkb | so maybe it is always empty and something else was avoiding this problem previously but is no longer there? | 20:44 |
clarkb | eval "export $_rac_key=$_rac_value" ok I think that might try to read in all the config | 20:47 |
clarkb | but that function is not called by anything anymore | 20:48 |
clarkb | oh it is used outside of acme.sh | 20:48 |
clarkb | `git log -p` says `_readdomainconf Le_Vlist` has never been in the code base... | 20:52 |
clarkb | but also that -z $vlist condition hasn't changed | 20:54 |
ianw | i think it might just read in the .conf file and set each variable in there | 20:55 |
ianw | so it may not be explicitly reading hte value of Le_Vlist if that makes sense | 20:55 |
ianw | i think code inspection via blame might be a luxury afforded by projects that take more care with their changelogs to give people context about WTF is happening, which unfortunately isn't the case here | 20:56 |
ianw | i might have to setup something to be able to git bisect test | 20:57 |
clarkb | the other place where we'd skip returning 3 is if the verification happens. However I'm still not sure if that is checking dns validation | 20:57 |
clarkb | but maybe the issue here is not in acme.sh but that le isn't validating us | 20:57 |
ianw | at least we know it must have worked a few months ago | 20:57 |
clarkb | and that causes acme.sh to try and do a new request | 20:57 |
*** dviroel is now known as dviroel|afk | 21:00 | |
clarkb | I think _initAPI may do some of this checking | 21:01 |
ianw | sorry got to afk for a little but will come back to it ... | 21:07 |
clarkb | ya I should take a break too. THis 1000 line shell function is making my brain melt | 21:09 |
clarkb | I guess out CI for this doesn't run `issue` or `renew` commands so we can't test it that way | 21:11 |
clarkb | we're testing that ansible coordination not the amce.sh script | 21:11 |
clarkb | for some reason I thought we had it talk to the dev LE servers | 21:11 |
clarkb | at a quick check that may have been how we did things historically but we don't do that anymore? | 21:13 |
clarkb | I suspect our next step is to run acme.sh on a server with trbouel manually and set the debug flag | 21:13 |
*** rlandy|afk is now known as rlandy | 21:39 | |
*** dasm is now known as dasm|off | 22:01 | |
ianw | yeah i think a manual run and trying to go backwards with acme.sh versions to try and narrow down what's changed | 22:05 |
ianw | i think i can probably run the LE playbook with --limit flags for a single host | 22:07 |
ianw | we do request the TXT records in CI, so we exercise that part of the process. but yeah, it's the actual issue that is problematic, because we don't make those records live obviously | 22:08 |
ianw | (of course that's the bit that's broken) | 22:09 |
ianw | /usr/ansible-venv/bin/ansible-playbook --limit 'adns*.opendev.org,eavesdrop01.opendev.org' ./playbooks/letsencrypt.yaml | 22:19 |
ianw | seems to be about right | 22:19 |
clarkb | ya that looks right to me. But you need to modify it to do debugging? | 22:19 |
clarkb | I don't think it does debugging as is | 22:20 |
ianw | no acme doesn't. i think probably if this fails (it's in a dns propagation pause) I can walk back the versions until it doesn't on this host, that will give us a clue where to start | 22:20 |
ianw | ... ok it did fail | 22:21 |
clarkb | I think that playbook will update the version too | 22:21 |
clarkb | I was thinking we would have torun things manually on eavesdrop to get around that | 22:21 |
ianw | so trying with 3.0.4 seems like the next step. i'll just manually edit that into my copy of the playbook | 22:21 |
ianw | yeah, currently we run against "dev" | 22:22 |
ianw | which is probably not totally great in itself, but anyway, one thing at a time :) | 22:22 |
ianw | technically i guess 3.0.5 might be behind dev, let's start there | 22:23 |
clarkb | ya its a small number of commits behind | 22:24 |
ianw | hrm, that doesn't seem to want to renew the cert ... | 22:25 |
clarkb | ianw: there wer changes around code to handle that. I think amybe even fixes? | 22:26 |
clarkb | its possible that 3.0.5 is also broken due to that | 22:27 |
clarkb | though meetings.o.o's cert was renewed on November 20 | 22:27 |
clarkb | I don't see eavesdrop in my complaint emails | 22:28 |
clarkb | maybe whatever is broken in latest dev is causing eavesdrop to try and renew when it doesn't need to | 22:28 |
ianw | Valid and current certificate found | 22:29 |
ianw | it renewed eavesdrop01 | 22:29 |
ianw | https://paste.opendev.org/show/bF4NFlbN6rkOUQtgZmtg/ is the output from the first run (with -dev) | 22:31 |
clarkb | thats the renew command as before and it has an rc 3 too | 22:32 |
clarkb | but eavesdrop isn't in the emails warning about expirations | 22:32 |
clarkb | does that mean it renewed early? | 22:33 |
ianw | we also have -> drwxr-s--- 2 root letsencrypt 4.0K Nov 24 03:45 ptg.opendev.org_ecc | 22:33 |
ianw | i guess we now make ecc certs | 22:35 |
clarkb | ianw: yes that was one of the changes since 3.0.5. However I didn't expect it to do that for existing configs | 22:36 |
clarkb | since it should read the config out of the existing files and continue to do rsa? | 22:36 |
ianw | so, when I ran with -dev on eavesdrop01, it successfully renewed eavesdrop01, but failed on the SAN ptgbot cert | 22:43 |
ianw | then i ran it again with 3.0.5, and it decided not to renew either | 22:44 |
ianw | this extra _ecc popped up on 24th | 22:45 |
ianw | *maybe* -dev branch is forcing renewals to get the _ecc cert -- and failing with SAN certs? | 22:45 |
ianw | i am going to try running it again with -dev | 22:45 |
ianw | it should skip eavesdrop01, and we'll see if tries to renew ptgbot | 22:46 |
ianw | yep, that did happen -- except -- the renewal worked | 22:49 |
clarkb | ianw: the renewal for all certs worked? | 22:50 |
ianw | yep -- here's a comparision https://paste.opendev.org/show/bzUcLkEhKQeSNNkfa5Ci/ | 22:52 |
clarkb | ianw: ya notice on the working pass it does the verifying which when verified skips the requests which causes it to try and dns | 22:53 |
clarkb | ianw: I still have no idea why it wsn't verifying before | 22:53 |
clarkb | but maybe it was a bug that got fixed very recently? | 22:54 |
clarkb | (we should be on the lookout for issues with ecc I guess) | 22:54 |
ianw | well first time i ran it manually, it failed with the "3" error. then i reran it and it didn't | 22:55 |
clarkb | and its been failing for several days before that. So what changed? | 22:56 |
ianw | right now i remain confused | 22:57 |
clarkb | (ya sorry I don't expect anyone to have the answer to that just talking about loud) | 22:57 |
clarkb | maybe we need to check LE for turkey outage? | 22:58 |
clarkb | (its possible verification didn't work due to their end?) | 22:58 |
ianw | i think the dev version, ecc certs, and multiple renewals at the same time all have something to do with it | 22:59 |
clarkb | ianw: that Verifying: line occurs after the rc 3 from before | 23:03 |
clarkb | which implies that it is skipping that now for whatever reason | 23:03 |
clarkb | (we don't get the log lines for the DNS records) | 23:03 |
clarkb | ianw: maybe we should set debugging always? | 23:06 |
clarkb | I don't know what gets included and how chatty that is to say if that is a problem | 23:06 |
clarkb | but might help future debugging if we run into problems | 23:06 |
ianw | yeah we can do that | 23:07 |
clarkb | I think there are multiple levels of debug logging too. Maybe stick to the least verbose one for now? | 23:08 |
*** Guest217 is now known as prometheanfire | 23:12 | |
ianw | so the granfa cert expires on Wed, 18 Jan 2023 03:13:17 GMT | 23:23 |
ianw | 60 days is November 19, 2022. | 23:25 |
ianw | i'm running it manually, with --debug 2 | 23:30 |
ianw | it wants to renew | 23:30 |
clarkb | maybe it is checking the key type and noticing it is rsa vs ecdsa and deciding it must renew then something gets wedged in that process? | 23:32 |
ianw | ... and it did renew :/ | 23:33 |
ianw | Not After | 23:34 |
ianw | Sun, 26 Feb 2023 22:31:17 GMT | 23:34 |
clarkb | ianw: does it try to renew if you rerun now. I wonder if LE rate limited us and caused the errors before | 23:35 |
clarkb | if it is trying to renew on every run for example | 23:35 |
clarkb | and even if a rerun doesn't try to renew I suppose we may have rate limited previously as we try to bash against those limits to cylce out all our certs... | 23:36 |
ianw | if i rerun it doesn't try to renew anything | 23:36 |
ianw | i.e. it knows it's an up to date cert | 23:37 |
clarkb | ok thats good. I'm beginning to suspect it could be a rate limit issue in converting all our certs all at once | 23:37 |
ianw | (this is all with -dev) | 23:37 |
clarkb | and some certs that will actually expire soon are the fallout | 23:37 |
ianw | https://paste.opendev.org/show/bH4IlmJfele3Djv14wZE/ is the failure list | 23:41 |
clarkb | ya and not all of them are expiring soon | 23:42 |
clarkb | so that very well could be what is happening? | 23:42 |
clarkb | maybe we should manually refresh those that are expiring soon (or did that already happen when you reran?) and avoid those expiring soon | 23:42 |
ianw | i haven't re-run globally, only targeted at etherpad and grafana | 23:43 |
clarkb | aha | 23:44 |
ianw | static might be an interesting one, because that has many certs | 23:44 |
clarkb | I wonder if there is an easy way to check if we are getting rate limited | 23:44 |
clarkb | alternatively we can stick to rsa and see if it settles down? | 23:44 |
clarkb | (that would try and reissue any ecdsa certs which might pendulum swing the opposite direction) | 23:45 |
ianw | i dunno -- the rsa cert on grafana was up for renewal | 23:47 |
clarkb | right, I think because they changed the default to ecdsa and it sees the difference and is trying to renew everything | 23:48 |
clarkb | I'm beginning to suspect the underlying issue is we have been trying to renew everything all at once and tripped some sort of rate limit | 23:48 |
clarkb | if we explicitly do rsa we'd stop trying to renew everything except for those close to expiration or what has already converted to ecdsa? | 23:48 |
ianw | static.opendev.org for example is Not After | 23:49 |
ianw | Wed, 11 Jan 2023 02:30:16 GMT | 23:49 |
ianw | so that's up for renewal too | 23:49 |
clarkb | ianw: well it wouldn't renew normally now | 23:50 |
clarkb | we renew with one month left which is ~december 11 | 23:50 |
ianw | isn't it 60 days? | 23:50 |
clarkb | LE gives us 90 day certs and we renew after 60 days (this is LE's suggested timeline) | 23:50 |
clarkb | I think the update to defaulting to ecdsa in acme.sh very likely has caused us to try and renew everything early and all at once | 23:51 |
clarkb | which could be tripping rate limits | 23:51 |
clarkb | I think we are well under the total names allowed but may be tripping the distinct cert limit? | 23:52 |
ianw | ahh, hrm, 60 days old | 23:54 |
clarkb | possibly also they are complaining we are renewing within the 60 day window | 23:54 |
clarkb | I wonder if they have a different rate limit for that too | 23:54 |
ianw | let me try static with debug on | 23:55 |
ianw | i need to make the driver script dump the domain its working on too | 23:56 |
ianw | ok, it failed, i have a debug=2 dump of it | 23:58 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!