Monday, 2022-11-28

opendevreviewIan Wienand proposed opendev/system-config master: make-tarball: role to archive directories
*** yadnesh|away is now known as yadnesh04:48
opendevreviewIan Wienand proposed opendev/system-config master: make-tarball: role to archive directories
*** dasm|off is now known as Guest18805:30
opendevreviewCedric Jeanneret proposed opendev/base-jobs master: Ensure NetworkManager will not override /etc/resolv.conf
opendevreviewCedric Jeanneret proposed opendev/base-jobs master: Ensure NetworkManager will not override /etc/resolv.conf
opendevreviewCedric Jeanneret proposed openstack/project-config master: Ensure NetworkManager doesn't override /etc/resolv.conf
*** ysandeep__ is now known as ysandeep_afk08:37
*** jpena|off is now known as jpena08:42
akahatplease review:,,
akahat^^ ysandeep pojadhav|ruck bhagyashris chandankumar arxcruz frenzy_friday|rover 08:50
*** prometheanfire is now known as Guest21709:11
*** yadnesh is now known as yadnesh|afk09:43
*** ysandeep_afk is now known as ysandeep_09:56
*** yadnesh|afk is now known as yadnesh10:12
*** ysandeep_ is now known as ysandeep_brb10:20
*** ysandeep_brb is now known as ysandeep_afk10:29
*** anbanerj is now known as frenzy_friday|rover10:47
*** dviroel|out is now known as dviroel10:58
apevecakahat:  that would be for 11:41
opendevreviewIan Wienand proposed opendev/system-config master: make-tarball: role to archive directories
*** ysandeep_afk is now known as ysandeep_12:04
*** frenzy_friday|rover is now known as frenzy_friday|rover|food12:22
*** frenzy_friday|rover|food is now known as frenzy_friday|rover13:38
*** Guest188 is now known as dasm13:58
*** rcastillo|rover is now known as rcastillo13:58
*** yadnesh is now known as yadnesh|away13:59
*** dviroel is now known as dviroel|afk14:55
opendevreviewDr. Jens Harbott proposed openstack/project-config master: Use kolla.config for kolla-ansible in gerrit
*** ysandeep_ is now known as ysandeep_dinner15:41
fungiinfra-root: our lets encrypt cert renewals seem to have started breaking coincident with the release of 3.0.5 last week (hence the increasing list of certs which are expiring in less than a month)15:52
fungifor some reason even though we cname e.g. to the script now wants to find txt records instead15:52
fungi(which don't exist)15:53
fungi3.0.5 includes 6 months of updates from their development project, so narrowing down the cause could take time. we may want to roll back to 3.0.4 in the meantime15:53
fungithough we do still have a few weeks to figure it out if we won't want to pin15:55
clarkbfungi: can you expand a bit on what specific acme record it wants for say eavesdrop01?16:17
clarkbfungi: also I think we may install from tip of master and not use releases which might simplify tracking down the issue?16:17
opendevreviewCedric Jeanneret proposed opendev/system-config master: Allow to cache ansible-galaxy content
Tengufungi: -^^   not really sure, but it seems to be the right thing...16:20
*** marios is now known as marios|out16:35
*** ysandeep_dinner is now known as ysandeep_out16:37
opendevreviewCedric Jeanneret proposed opendev/system-config master: Allow to cache ansible-galaxy content
Tengulet's see if testing is working.16:40
*** pojadhav|ruck is now known as pojadhav|out16:55
fungiclarkb: well, the commits which coincide with the start of the failures are not much better. basically "merge pr to sync with development before release"17:04
fungitheir workflow seems to be that they use a completely separate git repository for developing and then they duplicate everything from it into the master branch when they get ready to tag it17:05
opendevreviewCedric Jeanneret proposed opendev/system-config master: Allow to cache ansible-galaxy content
clarkbfungi: ya I'm looking and we use the dev branch17:08
fungiclarkb: so as for the error, this is what we have in the log when we run `/opt/ renew -d`:
clarkbfungi: so it was something recent on that branch and not directly related to the release aiui17:08
fungilast successful run of infra-prod-letsencrypt was 2022-11-23 03:41:22 and the first failure was 2022-11-24 03:41:10 so it likely happened sometime on 2022-11-2317:09
fungioh, i see, dev is actually a branch in the repo after all. i got confused by their workflow involving pull requests without a fork17:11
clarkbfungi: the other thing it could be is our updated ansible lists mangling for ansible 617:12
clarkbperhaps we're not constructing valid data for acme anymore and it confuses acme.sh17:12
fungier, or maybe not? i'm confused because the dev branch has commits merged to it like "Merge pull request #4406 from acmesh-official/dev"17:12
fungialmost like they have a hidden repository17:12
clarkbeven if they did we don't consume the release17:13
clarkbso we aren't pulling from it17:13
clarkboh the merge is on the dev branch17:13
fungianyway, it does look, based on scant commit messages, as though they also started merging things that day which they expect to go into 3.0.6 so a regression in that batch makes sense17:14
fungiit's almost as if the tool has started dereferencing cname records before deciding what cn to use for the cert17:16
clarkbianw's changes to system-config for ansible 6 have not landed yet so it isn't those (just to rule out the other idea I had earlier)17:17
fungiyeah, i went looking for possible changes on our side which could have broken it before turning to digging through upstream commits and issues17:21
*** jpena is now known as jpena|off17:54
fungii need a bit of a break, but will start looking through our options for fixing the le job, since i expect that's going to break my ability to deploy mailman3 on the new lists01 server i booted last week18:39
*** dviroel|afk is now known as dviroel18:39
clarkbagreed, that seems a high priority even if the existing certs are valid for a while specifically for that reason18:39
fungibut once we get that sorted i'm hopefully we can merge the remaining topic:mailman3 changes18:44
fungier, hopeful18:45
*** rlandy is now known as rlandy|afk19:06
clarkbfungi: one thing I notice when digging into is that tehy recently updated the key type to ecdsa by default. THis shouldn't override any key types for existing installs but I think once fixed mailman3 will get ecdsa if not overridden.19:14
clarkbThis may limit clients that can talk to the site. Though maybe its worth see if it isa problem before worryingbaout it19:14
clarkbfungi: also the issue may have been introduced prior to the window of time you identified. The reason for this is we may not have had any certs that needed refreshing for a bit19:34
clarkbour window is ~2 months since we renew after 2 months19:34
clarkbhowever it is likely smaller than that as we have other certs that almost certainly have renewed more recently19:34
clarkbour script passes --challenge-alias acme.opendev.org19:37
clarkb I think that is the line that appends _acme-challenge to acme.opendev.org19:38
clarkbas far as I can tell the code around this hasn't changed much19:42
clarkbthe TXT records are in place at not So something has changed to cause it to do that prepending when we didn't do it previously?19:52
clarkbmaybe we were falling into previously somehow?19:53
ianwinteresting, thanks for pointing it out.  i can look in a little20:03
clarkbianw: I'm beginning to think it is related to your exit code 3 thing20:04
clarkbianw: when we issue we check for rc 3 but when we renew we don't. It seems to fail because the renew path goes through the issue path and exits 320:04
clarkbI suspect that thing might actually be working properly but is bailing out early with the unrecognized/new error code? But I don't know how long ago those changes landed (havne't checked yet) just noticing that in our logs the rc is 3 and renew calls issue20:06
fungioh, yeah that's a great point, the task basically ends by saying the command exited nonzero, so maybe that's what changed?20:08
ianwi don't think any of that changed recently, but i could be wrong20:08
fungibut the error message does very specifically say to create a dns record at a place we don't have any record, so not sure if that's the script getting smarter and checking ahead of us?20:08
clarkbianw: ya I'm thinking maybe something earlier in issue() is not exiting earlier due to being called as part of renew?20:09
ianwhrm, we should be logging the acme calls in more detail, let me pull up logs20:09
fungicould be that message about the dns record is a red herring20:09
fungioh, good point, we have a separate acme log i didn't think to look at, i'm just going by the output recorded by ansible20:10
clarkbianw: we don't seem to set the debug flag fwiw20:10
clarkbbut maybe we should push a change that runs it with debug set in testing?20:10
clarkbfungi: the acme log is the same as the ansible log I think20:11
ianw[Mon Nov 28 03:52:05 UTC 2022] The dns manual mode can not renew automatically, you must issue it again manually. You'd better use the other modes instead.20:11
ianwUnknown failure: 320:11
clarkbya we tee it20:11
clarkbianw: yup and rc 3 is what ou added. And we handle it on the initial issue() call20:11
ianwon each host in var/log/ ... so yeah20:11
fungii don't see anything in it which is different from the stdout recorded by ansible though20:12
clarkbbut then after ansible has run to update the dns domain (the records are there) it runs the renew command and that exits 3 because its hitting that rc portion of the issue function20:12
clarkbfungi: they are the same. we tee it20:12
fungiokay, so no new info to be gleaned from the log20:12
fungiso maybe we just didn't notice this right away because we didn't need to renew any certs for a while after the exit code change. when/where was that added?20:14
clarkbit was added in like april I think20:14
clarkband we handle it on the issue side.20:14
clarkbI suspect something else is side effecting the renew call which goes through issue to fall all the way through to exiting 320:14
clarkbnot that the exit 3 is directly at fault20:15
clarkbbasically when you do the renew call the dns records should already exist so it shouldn't fall through20:15
ianwfeels similar to
ianwunfortunately, on the server, we seem to rotate out /var/log/ sufficiently that we don't have the logs from the last renewal to compare.20:21
clarkbI'm having a hard time understanding how this ever worked, because it iterates through the list of entries and if it doesnt' create a dns record for them it exits with an error (now 3 but previously 1)20:21
clarkband git log -p isn't showing me the deletion of any code that would've skipped ahead in hte case of manual dns20:22
ianwiirc it's going based on a file in the cert store20:25
clarkbya there is the .conf file in the cert store dir20:26
ianw /etc/letsencrypt-certs/*.conf20:26
ianwyep, that's it20:26
ianwi think that the issue writes something in there, then the renewal path should pick that up20:27
ianwLe_Vlist looks like it 20:27
clarkbok it does check if verification is done already20:27
clarkbI was initially reading that as checking if web verification had succeeded then skip dns verification but maybe it is checking dns earlier too20:27
clarkb_savedomainconf "Le_OrderFinalize" "$Le_OrderFinalize"20:31
clarkbthat appears related to manual dns20:31
clarkbbut that seems to be processed on renew after we've already exited?20:33
ianw touched the renew path ... but not that recently20:33
ianwthe logs are saying "Renew to"20:34
clarkbyes we're failing on the second pass20:34
clarkband we are failing because it seems to be ignoring that we've already done the dns steps20:35
clarkbbasically it is bailing out on the `issue` path where it wants you to go manually edit dns. But we've already done that and want it to renew20:35
ianwyeah, but i guess it *is* going into the renew path, at least at the start, from the "renew to" message20:37
clarkbyes that renew() function calls issue()20:39
clarkbjust below where it emits "Renew to"20:39
clarkbIn issue() it does a [ -z $vlist ]20:39
clarkblet me get a link. But I think we may expect that condition to be false to skip this stuff when renewing20:40
clarkbwith manual dns we write that list out to the config when we issue. Then when we renew we check if it is empty and if not we can skip all the issue steps. But where do we read it back?20:42
clarkbit is set in the config too20:42
ianwi'm also supicioius on -> -- it seems like in the renew path we're unconditionally going into _initpath20:42
clarkbI don't see where Le_Vlist is ever read20:43
clarkbso maybe it is always empty and something else was avoiding this problem previously but is no longer there?20:44
clarkbeval "export $_rac_key=$_rac_value" ok I think that might try to read in all the config20:47
clarkbbut that function is not called by anything anymore20:48
clarkboh it is used outside of acme.sh20:48
clarkb`git log -p` says `_readdomainconf Le_Vlist` has never been in the code base...20:52
clarkbbut also that -z $vlist condition hasn't changed20:54
ianwi think it might just read in the .conf file and set each variable in there20:55
ianwso it may not be explicitly reading hte value of Le_Vlist if that makes sense20:55
ianwi think code inspection via blame might be a luxury afforded by projects that take more care with their changelogs to give people context about WTF is happening, which unfortunately isn't the case here20:56
ianwi might have to setup something to be able to git bisect test20:57
clarkbthe other place where we'd skip returning 3 is if the verification happens. However I'm still not sure if that is checking dns validation20:57
clarkbbut maybe the issue here is not in but that le isn't validating us20:57
ianwat least we know it must have worked a few months ago20:57
clarkband that causes to try and do a new request20:57
*** dviroel is now known as dviroel|afk21:00
clarkbI think _initAPI may do some of this checking21:01
ianwsorry got to afk for a little but will come back to it ...21:07
clarkbya I should take a break too. THis 1000 line shell function is making my brain melt21:09
clarkbI guess out CI for this doesn't run `issue` or `renew` commands so we can't test it that way21:11
clarkbwe're testing that ansible coordination not the script21:11
clarkbfor some reason I thought we had it talk to the dev LE servers21:11
clarkbat a quick check that may have been how we did things historically but we don't do that anymore?21:13
clarkbI suspect our next step is to run on a server with trbouel manually and set the debug flag21:13
*** rlandy|afk is now known as rlandy21:39
*** dasm is now known as dasm|off22:01
ianwyeah i think a manual run and trying to go backwards with versions to try and narrow down what's changed22:05
ianwi think i can probably run the LE playbook with --limit flags for a single host 22:07
ianwwe do request the TXT records in CI, so we exercise that part of the process.  but yeah, it's the actual issue that is problematic, because we don't make those records live obviously22:08
ianw(of course that's the bit that's broken)22:09
ianw /usr/ansible-venv/bin/ansible-playbook --limit 'adns*,' ./playbooks/letsencrypt.yaml22:19
ianwseems to be about right22:19
clarkbya that looks right to me. But you need to modify it to do debugging?22:19
clarkbI don't think it does debugging as is22:20
ianwno acme doesn't.  i think probably if this fails (it's in a dns propagation pause) I can walk back the versions until it doesn't on this host, that will give us a clue where to start22:20
ianw... ok it did fail22:21
clarkbI think that playbook will update the version too22:21
clarkbI was thinking we would have torun things manually on eavesdrop to get around that22:21
ianwso trying with 3.0.4 seems like the next step.  i'll just manually edit that into my copy of the playbook22:21
ianwyeah, currently we run against "dev"22:22
ianwwhich is probably not totally great in itself, but anyway, one thing at a time :)22:22
ianwtechnically i guess 3.0.5 might be behind dev, let's start there22:23
clarkbya its a small number of commits behind22:24
ianwhrm, that doesn't seem to want to renew the cert ...22:25
clarkbianw: there wer changes around code to handle that. I think amybe even fixes?22:26
clarkbits possible that 3.0.5 is also broken due to that22:27
clarkbthough meetings.o.o's cert was renewed on November 2022:27
clarkbI don't see eavesdrop in my complaint emails22:28
clarkbmaybe whatever is broken in latest dev is causing eavesdrop to try and renew when it doesn't need to22:28
ianwValid and current certificate found22:29
ianwit renewed eavesdrop0122:29
ianw is the output from the first run (with -dev)22:31
clarkbthats the renew command as before and it has an rc 3 too22:32
clarkbbut eavesdrop isn't in the emails warning about expirations22:32
clarkbdoes that mean it renewed early?22:33
ianwwe also have -> drwxr-s--- 2 root letsencrypt 4.0K Nov 24 03:45 ptg.opendev.org_ecc22:33
ianwi guess we now make ecc certs22:35
clarkbianw: yes that was one of the changes since 3.0.5. However I didn't expect it to do that for existing configs22:36
clarkbsince it should read the config out of the existing files and continue to do rsa?22:36
ianwso, when I ran with -dev on eavesdrop01, it successfully renewed eavesdrop01, but failed on the SAN ptgbot cert22:43
ianwthen i ran it again with 3.0.5, and it decided not to renew either22:44
ianwthis extra _ecc popped up on 24th22:45
ianw*maybe* -dev branch is forcing renewals to get the _ecc cert -- and failing with SAN certs?22:45
ianwi am going to try running it again with -dev22:45
ianwit should skip eavesdrop01, and we'll see if tries to renew ptgbot22:46
ianwyep, that did happen -- except -- the renewal worked22:49
clarkbianw: the renewal for all certs worked?22:50
ianwyep -- here's a comparision
clarkbianw: ya notice on the working pass it does the verifying which when verified skips the requests which causes it to try and dns22:53
clarkbianw: I still have no idea why it wsn't verifying before22:53
clarkbbut maybe it was a bug that got fixed very recently?22:54
clarkb(we should be on the lookout for issues with ecc I guess)22:54
ianwwell first time i ran it manually, it failed with the "3" error.  then i reran it and it didn't22:55
clarkband its been failing for several days before that. So what changed?22:56
ianwright now i remain confused22:57
clarkb(ya sorry I don't expect anyone to have the answer to that just talking about loud)22:57
clarkbmaybe we need to check LE for turkey outage?22:58
clarkb(its possible verification didn't work due to their end?)22:58
ianwi think the dev version, ecc certs, and multiple renewals at the same time all have something to do with it22:59
clarkbianw: that Verifying: line occurs after the rc 3 from before23:03
clarkbwhich implies that it is skipping that now for whatever reason23:03
clarkb(we don't get the log lines for the DNS records)23:03
clarkbianw: maybe we should set debugging always?23:06
clarkbI don't know what gets included and how chatty that is to say if that is a problem23:06
clarkbbut might help future debugging if we run into problems23:06
ianwyeah we can do that23:07
clarkbI think there are multiple levels of debug logging too. Maybe stick to the least verbose one for now?23:08
*** Guest217 is now known as prometheanfire23:12
ianwso the granfa cert expires on Wed, 18 Jan 2023 03:13:17 GMT23:23
ianw60 days is November 19, 2022.23:25
ianwi'm running it manually, with --debug 223:30
ianwit wants to renew23:30
clarkbmaybe it is checking the key type and noticing it is rsa vs ecdsa and deciding it must renew then something gets wedged in that process?23:32
ianw... and it did renew :/23:33
ianwNot After23:34
ianwSun, 26 Feb 2023 22:31:17 GMT23:34
clarkbianw: does it try to renew if you rerun now. I wonder if LE rate limited us and caused the errors before23:35
clarkbif it is trying to renew on every run for example23:35
clarkband even if a rerun doesn't try to renew I suppose we may have rate limited previously as we try to bash against those limits to cylce out all our certs...23:36
ianwif i rerun it doesn't try to renew anything23:36
ianwi.e. it knows it's an up to date cert23:37
clarkbok thats good. I'm beginning to suspect it could be a rate limit issue in converting all our certs all at once23:37
ianw(this is all with -dev)23:37
clarkband some certs that will actually expire soon are the fallout23:37
ianw is the failure list23:41
clarkbya and not all of them are expiring soon23:42
clarkbso that very well could be what is happening?23:42
clarkbmaybe we should manually refresh those that are expiring soon (or did that already happen when you reran?) and avoid those expiring soon23:42
ianwi haven't re-run globally, only targeted at etherpad and grafana23:43
ianwstatic might be an interesting one, because that has many certs23:44
clarkbI wonder if there is an easy way to check if we are getting rate limited23:44
clarkbalternatively we can stick to rsa and see if it settles down?23:44
clarkb(that would try and reissue any ecdsa certs which might pendulum swing the opposite direction)23:45
ianwi dunno -- the rsa cert on grafana was up for renewal23:47
clarkbright, I think because they changed the default to ecdsa and it sees the difference and is trying to renew everything23:48
clarkbI'm beginning to suspect the underlying issue is we have been trying to renew everything all at once and tripped some sort of rate limit23:48
clarkbif we explicitly do rsa we'd stop trying to renew everything except for those close to expiration or what has already converted to ecdsa?23:48 for example is Not After23:49
ianwWed, 11 Jan 2023 02:30:16 GMT23:49
ianwso that's up for renewal too23:49
clarkbianw: well it wouldn't renew normally now23:50
clarkbwe renew with one month left which is ~december 1123:50
ianwisn't it 60 days?23:50
clarkbLE gives us 90 day certs and we renew after 60 days (this is LE's suggested timeline)23:50
clarkbI think the update to defaulting to ecdsa in very likely has caused us to try and renew everything early and all at once23:51
clarkbwhich could be tripping rate limits23:51
clarkbI think we are well under the total names allowed but may be tripping the distinct cert limit?23:52
ianwahh, hrm, 60 days old23:54
clarkbpossibly also they are complaining we are renewing within the 60 day window23:54
clarkbI wonder if they have a different rate limit for that too23:54
ianwlet me try static with debug on23:55
ianwi need to make the driver script dump the domain its working on too23:56
ianwok, it failed, i have a debug=2 dump of it23:58

Generated by 2.17.3 by Marius Gedminas - find it at!