Wednesday, 2021-09-29

clarkbcorvus: the same set of tripleo changes that we evicted last erestart is apparently near merging and the estimates have them merging just before the zuul change. I'm thinking if those esimates are off it might be nice to giv ethem a few more minutes ust to see if we can flush them out of the queue pre restart00:02
corvusclarkb: wfm00:10
clarkbzuul just flushed them out. They beat the zuul change00:18
clarkbhrm zuul -2'd the zuul change00:18
corvuslooking at tests00:19
clarkbBoth unittest jobs failed and not timeouts00:19
corvusyeah, the new test... it may be racy :/00:20
clarkblooks like assertfinal state isn't empty because the periodic jobs are racing in00:20
clarkbcorvus: you should be able to update the config to remove the jobs at the end of the test?00:20
corvuswait i thought i had that in there00:20
corvusdoh... i deleted that with some extra test prints00:21
corvusthat was dumb; sorry :/00:21
clarkbno worries, happy to rereview00:22
corvusmerged; awaiting promote01:49
corvuspromote succeeded01:53
clarkbok keys loaded and ready if needed01:54
corvuspull done; i'll restart now01:56
clarkbcorvus: I think it is up now?02:08
corvusyep, will re-enqueue02:08
fungii came back just in time for the fun!02:08
corvusi got a couple of transient errors on re-enqueue that are worth looking into later, but i don't think they're critical now02:16
corvusi re-ran that and it worked 2nd time02:16
corvusre-enqueue finished02:17
corvus#status log restarted all of zuul on commit 659ba07f63dcc79bbfe62788d951a004ea4582f8 to pick up change cache fix for periodic jobs02:17
opendevstatuscorvus: finished logging02:17
corvusgenerally looking good to me.  hopefully tomorrow we won't see any misplaced changes in periodic pipelines.02:18
corvusi think we saw them in zuul and vexxhost, so those are the tenants to check02:18
fungimalformed entries in the cache?02:18
corvus(it's the 0600 entries that would show it; the hourly pipelines don't trigger the bug)02:19
corvusfungi: this bug:
fungibut yeah, doesn't look urgent02:19
corvusoh sorry the paste02:19
corvusyeah, not sure about that one yet02:19
corvusmight be a cache collision, like the cache is being written to or something02:20
fungiright, the paste, seems to have raised in zuul.zk.change_cache02:20
corvusit might be a place where i missed changing from tuple change keys to structured cache keys02:20
fungiahh, that seems plausible02:21
corvusyeah, i bet that's it.  should be straightforward to track down (and is probably missing test coverage)02:21
corvusi think i'll afk now02:22
fungihave a good evening!02:23
*** ykarel|away is now known as ykarel05:18
*** ysandeep|out is now known as ysandeep05:43
*** jpena|off is now known as jpena07:28
*** ykarel is now known as ykarel|lunch08:49
*** ykarel|lunch is now known as ykarel09:52
opendevreviewAnanya proposed opendev/elastic-recheck rdo: Fix ER bot to report back to gerrit with bug/error report
opendevreviewAnanya proposed opendev/elastic-recheck rdo: Fix ER bot to report back to gerrit with bug/error report
*** jpena is now known as jpena|lunch11:24
*** jpena|lunch is now known as jpena12:22
opendevreviewAnanya proposed opendev/elastic-recheck rdo: Fix ER bot to report back to gerrit with bug/error report
opendevreviewAnanya proposed opendev/elastic-recheck rdo: Fix ER bot to report back to gerrit with bug/error report
opendevreviewMerged opendev/bindep master: Add Rocky Linux support
*** cloudnull8 is now known as cloudnull14:20
*** ykarel is now known as ykarel|away14:59
fungiinfra-root: just a reminder to be on our toes tomorrow, the old letsencrypt root cert (IdenTrust DST Root CA X3) is expiring, so long as our certs are already signed by the new root (ISRG Root X1) we should be fine, pretty sure we checked previously but might not hurt to look one more time in case we need to force renewals somewhere15:18
fungipretty sure they switched to the new root cert long enough ago that any certs we had prior to it would themselves be long expired now anyway15:19
clarkbThey switched at the beginning of this year or end of last iirc15:20
clarkbit was a while ago15:20
fungiyep. though also need to be on the lookout for breakage with openssl <= 1.0.2 and the like15:20
*** dviroel is now known as dviroel|ruck15:23
*** marios is now known as marios|out15:49
clarkbfungi: I actually think a number of our servers are not serving the new intermediate. That surprises me. Maybe isn't updating the intermediate cert?16:00
clarkbno I don't think that is it since paste has the old chain too. What is going on here16:04
clarkb has a --preferred-chain flag and if you don't specify it the default offered chain is used. That implies LE is offering people chains that will expire in less time than the cert validity period?16:19
clarkbthat makes no sense but wouldn't surprise me given all the moving parts here16:19
opendevreviewClark Boylan proposed opendev/system-config master: Use the new LE chain to avoid expiring chain
clarkbinfra-root ^ I think that we want to land that then force reissing all of our LE certs16:22
clarkbI'm going to fetch that onto bridge and run manually against paste16:24
*** jpena is now known as jpena|off16:38
clarkbI've managed to get a new cert but it appears to have used preexisting verification and the same chain on paste0116:49
clarkbhrm it actually gave us a new ca.cer file16:50
clarkbbut it is for the same chain as far as I can tell16:50
clarkbfungi: do you know how to force it to try again from scratch? it seems --force is insufficient16:51
clarkbI can try moving the /root/ stuff aside16:51
fungistill catching up, i'll double-check the certs first to make sure i understand the concern16:51
clarkbI'm really annoyed that LE is issuing certs this way at all16:54
fungiclarkb: when looking at the https cert for in firefox, it's serving two additional chain certs, the let's encrypt r3 cert, and the isrg root x1 cert16:54
clarkbThey should've stopped >3 months before the cert expires16:54
clarkbfungi: in firefox I see the r3 cert and the DST root cert not the isrg root16:55
fungifor etherpad?16:55
fungithat cert was renewed earlier today too, from the look of its "not before" field16:55
clarkbthough looking at chrome I see the ISRG cert, how can that happen?16:55
clarkbhuh chrome is showing me ISRG cert but firefox shows the DST16:56
fungi"Browsers (Chrome, Safari, Edge, Opera) generally trust the same root certificates as the operating system they are running on. Firefox is the exception: it has its own root store."
fungimaybe related?16:56
clarkbmaybe? This isn't the first time I've had problems with firefox's cert viewer either16:58
clarkbthe etherpad cert appears to be etherpad -> R3 -> ISRG -> DST (the primary chain for LE) in chrome16:58
clarkber sorry with s_client16:58
clarkbreview is review -> R3 -> ISRG (the alternate chain for LE) with s_client16:58
clarkbI'm tempted to set my local desktop clock to tomorrow and see what happens16:59
clarkb(though half worried that will make certain things angry)16:59
fungii want to say debian's firefox package actually hacks it to use the system trust instead of its own, so maybe that's why i'm getting different results than you are with it16:59
clarkbI'm using mozilla's beta build distribution17:00
clarkbre review's chain above I'm sorry I was wrong. Review is the same as etherpad according to s_client it is paste which I just updated using my change and --force that has the ISRG only chain17:01
clarkbI'm beginning to suspect that firefox's cert viewer isn't showing me the full chain and is instead showing the very end17:01
fungithere's a -no_alt_chains option to s_client which might change that behavior too17:01
clarkbno change with -no_alt_chains as it is a single linear chain aiui. The verification process is just supposed to stop before it gets to the DST root if something before it is trusted and verifies17:02
fungitesting via `openssl s_client -connect -showcerts` i get -> Let's Encrypt R3 -> ISRG Root X1 -> DST Root CA X317:03
clarkbyup and now if you do you'll lack the DST Root due to my preferred-cert and --force attempt17:03
fungiagreed the result with -no_alt_chains is identical17:03
clarkbI can revert that back by removing my --preferred-chain if we like?17:03
clarkband I guess the problem here is firefox17:04 -> Let's Encrypt R3 -> ISRG Root X117:04
fungiso the concern is that serving an expired ca cert in the bundle will cause problems for some clients even if they already trust a later ca in the chain provided?17:05
clarkbfungi: that is what old openssl et al will fail on17:05
fungigot it17:05
clarkbhere is an interesting one. If I open the cert viewer for paste in FF I still see the DST root17:06
fungiso we do want to try to force removal of the old DST Root CA X3 from the bundle on all the servers (which we could do by hand in theory, just editing the chain bundle to delete it)17:06
clarkbI wonder if firefox is doing magic around their R317:06
clarkbfungi: I think I've shown that will do that the next time we issue certs17:06
fungiright, i concur, but that will happen gradually over the course of ~two months17:07
clarkbif we do that then certain older systems that don't trust the ISRG root will start to fail. Some of them will already fail due to the presence of the DST root. Others will succeed because tehy have a DST root hack (android)17:07
clarkbfungi: no it won't because LE's default chain is the one with DST in it to make older android happy17:07
fungii mean after we merge 81174917:07
fungiit won't take effect across all our servers instantly17:07
clarkboh right we would have to wait for normal renewals. I guess it is up to us whether or not we'd prefer to have the primary default chain in which case we do no updates or force the ISRG cert?17:08
clarkbMy initial thought on that is the latnerate chain might be better because you can always explicitly add the ISRG cert to your trust chain then make old clients work17:08
clarkbBut if you have an old client and it sees the DST root in many cases there is no real workaround other than upgrading17:09
fungithe resulting bundle is the same excelt that --preferred-chain "X1" will omit DST Root CA X3 right?17:10
clarkbfungi: yup I think so. If you look in etc/letsencrypt-certs/ and etc/letsencrypt-certs/ you can compare them17:11
fungiso if we're worried we'll bump up against renewal limits for le's acme api, we could make that same edit locally is what i'm saying17:12
clarkbgot it17:12
clarkbto summarize: my firefox install is being really derpy (hopefully it doesn't break tomorrow but if it idoes it will be a local problem). Currently all our sites seem to have the site -> R3 -> ISRG -> DST chain except for paste01 where I manually reran a forced update with which removed DST from the chain17:13
clarkbAll that to say modern clients and our servers shouldn't have problems tomorrow when DST expires. Old clients may start to have problems. We can attempt to mitigate those either by upgrading or adding ISRG to the trust chain on systems and dropping DST from our served chains17:14
clarkbfungi: ^ does that seem about right?17:14
fungiyeah, so if we do nothing, things will presumably keep working with the default chain except old (openssl?) which will choke on there being an expired root cert in the bundle17:14
fungii agree. probably hard to decide on a particular course of action until we see exactly what the fallout winds up being17:15
clarkbfungi: I'm thinking maybe lets leave paste as is for now and we can use it to cross check if old clients show up17:15
fungiinfra-root: to summarize, we'll take no action now, but if clients run into problems validating https certs for our services on/after tomorrow we can get them to double-check against paste.o.o and consider merging 811749 if that ends up being the better behavior17:16
clarkbnote merging 811749 alone wasn't sufficent to have it reissue files I had to add a --force in my checkout on system-config to the --issue call17:17
clarkbhowever we can land 811479 then separately copy the ca.cer from paste.o.o to avoid reissing things17:17
clarkbI think17:17
fungii concur17:18
clarkbfungi: One more question. If you look at the ca.cer and fullchain.cer files that is writing out it seems to not include the R3 cert, Just the root(s depending on where you look there is one or two)17:37
clarkbfungi: any idea why that is?17:37
clarkband yet somehow browsers and s_client all see the R3 in the middle17:38
fungii'll have to parse them with openssl x509 to confirm17:39
opendevreviewMerged zuul/zuul-jobs master: Make default tox run more strict about interpreter version
clarkbfungi: ok I think I've figured it out. The cert in ca.cer and fullchain.cer on paste01 is the R3 cert17:49
clarkbfungi: on other systems it is the R3 cert + the cross signed ISRG Root17:50
clarkbIn both cases the root most cert is omitted (ISRG and DST respectively) ebacuse you're expected to have that locally17:50
fungii just finished hacking the files apart because i couldn't work out how to get openssl x509 -in to parse multiple certs out of a single file17:50
clarkbIn my firefox case it must be getting the R3 cert then using its local files rather than provided chain to verify from there which is how I end up with DST17:51
clarkbI don't know what to expect locally tomorrow, but hopefully the blast radisu is small and I can use chrome temporarily if necessary17:51
clarkbstill not great of firefox to do this imo17:52
fungiyeah, probably ff works its way up the chain from the server cert until it reaches a cert signed by something it considers valid from its trust store17:52
fungiso it hits the cross-signed cert, identifies one of the signers which is in its trust store, and reports that (even though the other signer is probably in its trust store too)17:53
clarkbyup and the other signer has a much better validity period17:53
fungifor all i know it could just match oldest first17:54
clarkboh this is even more interesting. I think the R3 cert firefox shows is different than the one we serve. Its almost like firefox has its own R3 cert and is trusting them from that point on rather than the next level down18:00
clarkbHopefully tomorrow it will use what is supplied if necessary18:01
*** ysandeep is now known as ysandeep|out18:53
*** timburke__ is now known as timburke19:39
melwittgrafana is ... in a state rn
clarkbmelwitt: it seems to load for me20:52
clarkbmaybe try a hard refresh?20:52
melwittoh weird20:52
melwitthm, hard refresh same result. all panels say "N/A" and the error is "Failed to fetch". maybe something with my vpn connection?20:55
clarkbpossibly? Grafana loads the graphite data from on the client side iirc20:56
clarkbso it could be an issue getting to the backend data store from your client.20:56
melwittjust checked with my phone that's not on vpn and same result. hm.20:56
clarkblet me try another browser20:56
clarkb loads for me20:57
clarkbin both firefox and chrome20:57
clarkbI wonder if it could be an ipv6 issue?20:58
melwittyou're right, shows "No Data" in graphite for me too20:58
clarkboh interesting20:58
clarkbmelwitt: does something like that show data?20:59
melwittclarkb: yes, that does show data21:00
clarkbok so you can directly reach graphite which is where grafana pulls data from21:00
clarkbdo any of the other graph dashboard work on grafana? wonder if it could be specific to this one for some reason21:00
clarkbalso I just checked on my phone and it works too21:01
melwittoh it's working now21:01
melwittyeah, whatever was wrong resolved itself. how strange21:01
clarkbweird. I guess let us know if it happens again and we can try to dig into it more21:01
melwittbut yeah earlier I tried the ceph fail rate dashboard and the logstash queue dashboard and both were erroring in the same way21:02
melwittok, thanks21:02
melwittit's still in an error state on my phone but it's working on my laptop @_@21:07
fungicould also be a cross-site blocker in your browser21:13
clarkbdo mobile browsers have the dev debugging tools? Might be good to see what exactly is failing like if it is a specific request?21:28
*** dviroel|ruck is now known as dviroel|out21:34
melwittI'll see what I can do21:45
fungilooks like we may have mirrored a broken fedora 34 package state...
clarkblooks like it can't find packages implying the index updated before the packages were present21:51
clarkbunfortunately for rpm mirrors we don't have a tool like reprepro to check things are valid before publishing21:51
fungiyeah, if the mirror has refreshed since that failure, i'll just recheck and see if it was a temporary state21:51
fungiand it has (just a few minutes ago)21:52
melwittclarkb: this is what I got with web inspector on my phone "TypeError: The certificate for this server is invalid. You might be connecting to a server that is pretending to be “” which could put your confidential information at risk."21:59
clarkbmelwitt: ok that is super helpful actually. We have a known issue with aapche where asking it to reload to pick up cert rotations doesn't always work because it wait nicely for existing processes to finish up22:00
melwittI'm not sure why it didn't just say that in the browser somehow... instead it just shows broken graphs with "No Data"22:00
clarkbmelwitt: its possible that we did a cert rotation and then have a backend apache process that is still using the stale cert. I will check that now22:00
clarkbhrm graphite doesn't do apache22:01
clarkbgraphite's cert rotated on august 27. Its docker container which I expect is running nginx last restarted 4 weeks ago whcih is in line with a restart for new cert22:02
fungimelwitt: did it provide any additional explanation for why the certificate was invalid? like expiration or wrong hostname?22:04
clarkbs_client says the cert chain is just graphite -> R322:05
fungii wonder if this is the let's encrypt root key expiration taking effect22:05
melwittfungi: I didn't see more info but I'll look again. I'm not super familiar with web inspector22:06
clarkbfungi: its about 14 hours early if so22:06
fungiwe could try removing the old root cert from the intermediate bundle to see if that solves it, i suppose22:06
fungitime is an illusion, time on cell phones doubly so22:06
clarkbfungi: right but even my local system with s_client complains22:06
clarkbit does not complain with etehrpad22:07
fungioh, openssl s_client is erroring?22:07
clarkbfungi: yes on my desktop22:07
fungiyeah, maybe the nginx inside that container does something weird with cert bundles22:07
fungimelwitt: have you tried going directly to to see if that gives you errors as well?22:09
melwittoh yeah there it goes again22:09
melwittlet's see...22:09
clarkbok we don't serve the chain at all looking at the nginx config22:11
fungiclarkb: i think it's not actually serving any chain certs? if i use -showcerts with s_client i only see the server cert22:11
clarkbfungi: should I try changing ssl_certificate /etc/letsencrypt-certs/; to ssl_certificate /etc/letsencrypt-certs/; ?22:11
fungiyeah, looks like we both arrived at the same conclusion22:11
fungiclarkb: for nginx you may need it, yeah22:12
clarkbok let me try that. Can always revert easily enough22:12
fungiunless it has a separate option for the chain bundle22:12
fungimelwitt: this seems likely to be the cause. your phone probably doesn't have the let's encrypt r3 root cert in its trust store, but your desktop might22:13
clarkb says fullchain is what we want I think22:13
melwittit's doing it on my desktop again22:13
melwittThis page is not secure (broken HTTPS).22:13
melwittCertificate - missing22:13
melwittThis site is missing a valid, trusted certificate (net::ERR_CERT_DATE_INVALID).22:13
clarkbI'll do the fullchain switch and restart and see if that makes it happy22:13
fungimelwitt: yep, okay this is likely the fix for the server side then22:13
fungithanks for the detail!22:13
fungii wonder what caused it to suddenly break22:14
fungimaybe the graphite container has changed?22:14
clarkbmelwitt: can you try it again22:15
clarkbfungi: this is config we supply22:15
clarkbfungi: I think what changed was LE started using teh R3 intermediate and no one noticedu ntil now22:15
fungiyeah, i'm wondering if something changed in configuration we don't supply, but i can't imagine what22:15
clarkbmelwitt: if it works now then I think this is the fix and I'll work on a change to make it permanent22:15
clarkbfungi: the R3 intermediate is new in LE22:15
melwittclarkb: looks like that worked22:15
funginew like today new?22:15
clarkbI don't know how new but relatively22:15
clarkbfungi: no older than that. Like february ish? but maybe they weren't giving it out to all requests yet22:15
fungibut we've got it at least as far back as the last cert refresh, so i wonder why it's gone unnoticed until now22:16
fungimaybe most of our users have the r3 cert in their browsers' trust stores already22:16
fungibecause we've been serving a cert relying on it there for over a month22:17
clarkbyes that is my expectation22:17
clarkball of my browsers were totally happy with it22:17
fungisame. bonkers22:18
fungigit grep says the graphite container is the only place we're doing this22:19
opendevreviewClark Boylan proposed opendev/system-config master: Use fullchain.cer on graphite for nginx
clarkbfungi: I think gitea too?22:19
clarkboh gitea would've before we switched it to apache22:19
clarkbnow I don't think it cares becaus apache22:19
clarkbmelwitt: thanks for working through this with us. Config error on our end22:20
fungiyeah, no matches on ssl_certificate outside the graphite config template22:20
clarkbfungi: oh I was grepping for fullchain. gitea did its own termination in its golang https server22:20
fungiahh, yeah i was looking for other nginx server configs we might have missed adding the fullchain.cer file to22:21
clarkbgood idea :)22:21
melwittclarkb: np. it's still doing it on my laptop for some reason ... but my phone is working now22:22
clarkbmelwitt: you might need to create new ssl connections?22:22
clarkbits also possible that ansible has helpfully undone what I did22:22
clarkbno it hasn't undone it yet22:22
clarkbbut s_client verifies properly for me now so I'm reasonably confident this was the fix22:23
melwittclarkb: sorry what does that mean?22:23
fungimelwitt: possible that your laptop is complaining for a different reason than your phone did22:24
clarkbmelwitt: openssl s_client is openssl's command line too to make ssl connections and it prints a bunch of debugging info about the ssl connection setup. Previously I could confirm that s_client did not like the ssl cert setup on that server as it had not way to verify the cert through the intermediate. But I'ev since fixed that22:24
clarkbmelwitt: `openssl s_client -connect` if you want to try it locally. Look for Verification: OK22:24
melwittoh, sorry, I'm dumb. says "No Data" when you visit it without navigating anywhere and I didn't realize that22:24
clarkbmelwitt: ah yup you have to select a graph22:25
clarkber not a graph but things to graph22:25
melwittgrafana dashboards all looking good on phone and laptop22:25
melwittthanks both!22:25
clarkbexcellent, and thank you as I probably wouldn't have thought to check ssl22:25
clarkb(since it was working in my browser)22:26
fungiyeah, the fix should hopefully merge and deploy shortly, i've already approved it22:26
fungifedora 34 mirror is still broken by the way, we probably need to wait for the site we're mirroring from to refresh22:36
fungior switch to a different source mirror if it persists for too long22:37
fungiunfortunately i've got a whole stack of tox role fixes for zuul-jobs i've been trying to get merged for a month and by each time they get approved there's some new regression which becomes a blocker22:39
funginow it's f34 testing22:40
clarkbfungi: don't worry tomorrow will make you forget all about tox :P22:40
clarkb"its the final countdown"22:40
mordredclarkb: what exciting thing is happening tomorrow?22:58
fungiit's a week until openstack xena releases23:00
clarkbmordred: lets encrypt's old root signing cert expires23:01
fungiand that, yep23:01
mordredOh joy23:01
clarkbmordred: a bunch of stuff is expected to stop working. And one of the compromises they are making is that by default they are doing a thing that will work for android but break openssl and gnutls if they are too old23:02
clarkbI'm sure that compromise was made because there are billions of android devices that are way out of date but I can't help but feel that is the wrong prioritization23:02
fungibecause who cares about crufty old servers, as long as old cellphones can still browse stuff23:02
clarkbwe have a number of options available to us including removing the android hack from our cert chains and adding the new root explicitly to our servers trust chains23:03
clarkbI think ESM also addresses this23:03
clarkband so on23:03
clarkbIt is hard to know right now what the total impact might be but fungi and I poked around this morning and made suer that our existing LE certs are at least doing the right thing (except for the one that melwitt found but that was on our end not or LEs)23:04
clarkband I prepared a change to drop the older android compat if that helps us (since older android isn't really a primary user of our services)23:04
fungiyeah, that was more just that we had pretty terribly misconfigured nginx in the graphite container23:04
fungibecause basically nowhere else do we use nginx23:04
clarkbbut we expect some non zero fallout and will try to debug and workaround it best we can23:04
clarkbthe nuclear option is we replace LE certs with sectigo certs23:05
clarkbthen use the yaer that gives us to upgrade things appropriately23:05
fungii have some spare change under my sofa cushions23:05
opendevreviewMerged opendev/system-config master: Use fullchain.cer on graphite for nginx
mordredWow. That does sound like it'll make everyone forget about tox 23:07
fungicarefully timed for maximum impact on openstack release week23:07
clarkbalso firefox shows me the ISRG root now23:09
clarkbI have no idea what happened there23:09
clarkbthe one good thing about requests using its own chain by default :)23:09

Generated by 2.17.2 by Marius Gedminas - find it at!