Wednesday, 2024-11-06

opendevreviewMerged opendev/system-config master: Fix doc for including statusbot in channels  https://review.opendev.org/c/opendev/system-config/+/93418903:10
opendevreviewMerged opendev/system-config master: Add meetbot and statusbot to openstack-eventlet-removal  https://review.opendev.org/c/opendev/system-config/+/93416408:30
slaweqfungi frickler ianw hi, is my understanding of the meeting logs correct that you want to simply try to use "curl" in ansible to trigger rtd webhook?08:32
slaweqif yes, I can propose such change08:32
*** marios_ is now known as marios09:20
opendevreviewKarolina Kula proposed openstack/diskimage-builder master: WIP: Add support for CentOS Stream 10  https://review.opendev.org/c/openstack/diskimage-builder/+/93404512:29
fungislaweq: yes, basically switch it to a shell task calling curl, looks like our executors already have that installed so it should "just work"12:50
slaweqfungi: ok, I will propose a patch for this today12:55
fungisomething changed on nb04 monday-ish that has been causing infra-prod-letsencrypt to fail in the "Install acme.sh client" task with error "Local modifications exist in the destination: /opt/acme.sh"13:18
funginobody has logged into the server for the past month, so it doesn't look like the file was manually altered13:19
fungicomparing to that git repo on nb01, both have head at commits with identical diffs but different commit ids13:28
opendevreviewKarolina Kula proposed openstack/diskimage-builder master: WIP: Add support for CentOS Stream 10  https://review.opendev.org/c/openstack/diskimage-builder/+/93404513:30
fungiaha, that's because the "Patch for issue 4659" task does git cherry-picks at each run, so the resulting commit ids will always differ13:35
ykarelfrickler, fungi wrt translation update job broken which i raised days before13:35
ykarelany way to get it tested i had pushed https://review.opendev.org/c/openstack/openstack-zuul-jobs/+/93403413:36
ykarelwith depends-on fails as Unable to freeze job graph: Job <Job propose-translation-update branches: None source: opendev/base-jobs/zuul.d/jobs.yaml@master#25> which is defined in opendev.org/openstack/project-config is protected and cannot be inherited from other projects.13:36
fungiykarel: jobs defined in trusted config projects don't use speculative execution for safety reasons, since someone could propose unmerged patches to disclose things like the zanata account credentials trusted to that job13:38
ykarelyeap, so have to get it merged13:38
fungiyes13:39
ykarelok will push removing WIP13:41
opendevreviewSlawek Kaplonski proposed zuul/zuul-jobs master: Use "curl" to trigger rtd.org webhook with http basic auth  https://review.opendev.org/c/zuul/zuul-jobs/+/93424313:43
slaweqfungi frickler clarkb  ^^ please check when you will have a minute13:43
slaweqalso ianw 13:43
slaweqthx for all the help with this13:44
fungiso for the git repo on nb04, it looks like it somehow ended up with a soft reset of the cherry-picked commits, basically the reflog has head at the cherry-picked state, but the worktree matches the pre-cherry-picked state instead (upstream tag 3.0.5)13:46
fungiand that's preventing ansible's git module from proceeding. but no sign in the prior build logs for what could have left it in this state13:47
fungianyway, i'm thinking this is nothing nefarious and the cherry-pick simply got interrupted before the worktree was updated. if other infra-root folks aren't particularly concerned i'm happy to reset the repository state and check the next build13:58
fricklerfungi: +113:59
opendevreviewJeremy Stanley proposed opendev/system-config master: Force acme.sh Git repo refresh  https://review.opendev.org/c/opendev/system-config/+/93424714:47
fungiinfra-root: there's ^ an alternative option14:47
opendevreviewKarolina Kula proposed openstack/diskimage-builder master: WIP: Add support for CentOS Stream 10  https://review.opendev.org/c/openstack/diskimage-builder/+/93404515:09
fricklerjust in case someone else notices broken reqs-check jobs https://zuul.opendev.org/t/openstack/builds?job_name=requirements-check&skip=0 revert is in progress, might even consider force-merging https://review.opendev.org/c/openstack/requirements/+/93424915:22
opendevreviewKarolina Kula proposed openstack/diskimage-builder master: WIP: Add support for CentOS Stream 10  https://review.opendev.org/c/openstack/diskimage-builder/+/93404515:39
fricklerfungi: did you look at the nb04 failure because of the deploy issues? https://zuul.opendev.org/t/openstack/buildset/d71c9b202bb6409ea7830341e0eb704d15:39
corvusfrickler: is there no gate job on the requirements repo that runs the requirements check job?15:40
fricklercorvus: I was caught by surprise by that, too, else I'd have done some extra checking15:49
clarkbfungi: the acme.sh thing happens when we run out of disk15:54
clarkbfungi: we discussed this previously and one of the things I wanted to see us do was stop updating the repo after it was in the staet we wanted15:55
clarkbthe problem is every time we run (so at least once a day) we completely reset the git repo state. Which explodes if we haev no more disk but its also unnecessary15:55
corvusfrickler: i'm wondering if we can just run it targeting the requirements repo: testing it out in https://review.opendev.org/934251 and https://review.opendev.org/93425215:56
clarkbunfortunately I think this is another case where the ansible git module just isn't great and we might have to replace it with manual management of git15:56
corvusclarkb: fungi or we could just make a temporary fork on github or something15:56
clarkbcorvus: ya thats another option. I think if we do that then the version we point at has to be the HEAD ref (so master/main/whatver) otherwise the git module always resets things15:58
clarkbwe can also check if we still need to patched setup. I suspect we do unfortunately15:59
clarkbhttps://github.com/acmesh-official/acme.sh/issues/4416 is still open :/16:00
clarkbreading that I guess we could just switch to the ecc path maybe?16:03
clarkbbut ya simplest thing now may be to have a fork and just use that, apply the security related patches and check out dev/master/main whatever we set HEAD to16:03
clarkbactually I think the script is entirely self contained16:05
clarkbwhat if we just vendor it in its entirety and use an ansible file copy instead?16:05
fungifrickler: yes, deploy failure on https://review.opendev.org/934164 is what got me looking into the broken letsencrypt git state16:07
clarkbtried to summarize all of that on the change so that we don't lose the context later and we can figure out what we want to do from there16:07
clarkbfungi: /dev/mapper/main-main  787G  787G     0 100% /opt the disk is full so I don't think your change will fix it fwiw16:08
fungiclarkb: does acme.sh not use the driver.sh file or any of the scripts under the subdirectories?16:09
fungioh, hah16:09
fungiwhat's the usual cleanup routine? down the nodepool-builder container and then delete all of dib's tempfiles?16:10
corvusfrickler: answer to that question appears to be "no"; so i made this instead:   https://review.opendev.org/c/openstack/requirements/+/934253 WIP: Update requirements-check job to run on arbitrary targets [NEW]        16:10
fungiclarkb: specifically delete all the /opt/dib_tmp/* directories?16:11
clarkbfungi: yes down the service then start a delete of the /opt/dib_tmp contents. The hourly jobs will start the service back up again while thats running (whcih will probably take a day or so) but that hasn't been a huge issue16:11
fungion it16:11
clarkbfungi: also if losetup has no more available devices then doing a reboot is helpful16:11
fungik16:12
clarkblosetup -a give you a list of all loopback devices. I think the max is like 14 or 16 or something so if we're well under that then we're good16:12
clarkbre the subfolders I thought we just copy the acme.sh script out and run it from another location but I'll check on that16:13
clarkbI'm wrong about that the driver.sh executes acme.sh out of the git repo dir16:14
clarkbso vendoring is potentially problematic16:14
clarkbidea: put the acme.sh install stuff into a block that only runs when a hash of acme.sh does not match what we expect16:19
fungitempfile cleanup is in progress in a root screen session on nb0416:19
clarkbI'll write that up as an alternative we can consider16:19
fungicurrently there are 10 loop devs on the server16:20
clarkbbut that should work for new installs and existing installs and be more resilient to disks filling16:20
clarkbfungi: ok so not at the limit but getting close. Reboot probably a good idea after deletions are done ~tomorrow16:20
fungiagreed16:20
opendevreviewClark Boylan proposed opendev/system-config master: Only update acme.sh if necessary  https://review.opendev.org/c/opendev/system-config/+/93425616:33
clarkbsomething like that maybe16:33
corvusfrickler: https://zuul.opendev.org/t/openstack/build/79171c3eabc540bfa37eb58af8e00e4b shows the expected failure.  i will update https://review.opendev.org/934253 in a bit to make it ready to merge16:34
clarkbhttps://storage.bhs.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_780/934256/1/check/system-config-run-static/78081f2/bridge99.opendev.org/ara-report/playbooks/3.html shows 934256 working in the no previously installed acme.sh client case17:00
clarkbI wonder if we should update system-config-run-letsencrypt to do two passes so we can confirm the other case?17:01
clarkbany opinions on ^ I think that should be doable if we don't mind the job taking longer17:01
opendevreviewClark Boylan proposed opendev/system-config master: Run letsencrypt twice in system-config-run-letsencrypt  https://review.opendev.org/c/opendev/system-config/+/93425817:08
clarkbdid it as a followup change as we may decide we don't want it after confirming the behavior in 93425617:08
fungiclarkb: yeah, an idempotency test there is a great idea17:11
corvusfrickler: https://review.opendev.org/c/openstack/requirements/+/934253 is ready for review17:23
corvusthat will close the gate hole for the requirements-check job and should make changes like updating the nodeset or fixing whatever problems there are with upgrading to noble easier17:24
opendevreviewClark Boylan proposed opendev/system-config master: Only update acme.sh if necessary  https://review.opendev.org/c/opendev/system-config/+/93425617:32
opendevreviewClark Boylan proposed opendev/system-config master: Run letsencrypt twice in system-config-run-letsencrypt  https://review.opendev.org/c/opendev/system-config/+/93425817:32
clarkbit caught a bug \o/17:32
fungii went ahead and hard reset and cleaned the acme.sh clone on nb0417:51
fungisince the cause is known17:51
fungidib cleanup has freed about 20gb so far17:52
clarkbhttps://storage.bhs.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_599/934258/2/check/system-config-run-letsencrypt/5995444/bridge99.opendev.org/ara-report/playbooks/5.html shows the skip working now18:05
clarkbthat doesn't solve the disk filling but should make us more resilient to it for cert renewals18:06
fungiagreed, looks great18:08
clarkbI think it also shows the driver.sh copy is change: false so nooping properly too18:08
clarkbcorvus: I've done some code review of zuul-registry's prune method. The thing that stands out to me is that if https://opendev.org/zuul/zuul-registry/src/branch/master/zuul_registry/storage.py#L287-L288 this list is incomplete (or empty) then https://opendev.org/zuul/zuul-registry/src/branch/master/zuul_registry/storage.py#L292 will delete every blob that is an hour or more old. I21:59
clarkbwonder if we should set upload_target to manifest_target out of an abundance of caution on line 292?21:59
clarkbcorvus: we might still be broke if that occurs but less so. Another thought I had is to break up prune into several passes so that we only prune manifests then check on things then prune blobs ensuring the layers list looks right? I don't know how helpful that is if the aboev theory is the issue21:59
ianwthere wasn't any grand plan pulling acme.sh from git; rather just an anti-bitrot incentive to keep up with upstream.  as time has gone on, we're not doing that and pinned for various reasons anyway.  i don't suppose checking a file sha is really functionally any different than say, checking a pinned .tar.gz release file22:00
clarkbianw: ya the upsdie to checking the sha is it checks that we've applied the desired security fixes too without needing to fork or maintain our own packagae etc. Bit hacky but simple22:01
ianwon the positive side acme isn't changing, neither is sh, and we've got ecc certificates that seem fairly future proof ... so in theory i guess it can just sit there doing it's thing22:01
clarkbcorvus: and as for why that list may be incomplete I'm thinking that could be a side effect of swift client behavior? similar to how we can only delete 10k objects per request maybe we only ever get back 10k objects (or some other limit) at a time?22:02
clarkbianw: ya I think the main risk is if we have to patch for other security concerns. If we hit that point we're probably rewriting things to use the other directory?22:02
clarkbianw: the issue is just that the certs go into the "wrong" directory rightso everhything has to update to look for them there?22:02
ianwit's been a long time -- i think the problem is because of our two-step process, where we create certs and get the TXT records, install them, then come back and use renew to pick up where we left off -- acme.sh isn't picking up where it left off, it doesn't correctly recognise the _ecc dir and tries to start the process again22:06
clarkbaha22:07
ianwi think that we're fairly unique in having this two step process -- most people seem to be using an acme.sh plugin to their DNS-as-a-service API where acme.sh essentially pauses and then carries on22:07
clarkbso not just a matter of us picking the right new directory but acme.sh internally needs to see the certs in the correct dir22:07
ianwright, the "renew" path needs fixing ... i'm sure it could be done, it's just a matter of time and effort, like everything :)22:09
clarkbya I assume they would take a patch. /me puts it in the pile22:09
fungiin theory we could set up dynamic dns updates, or just delegate the subdomain we use to something that has its own update api22:10
clarkbunrelated to ^ does anyone have an excuse to send mail to service-discuss? I haven't seen any bounce processing yet and I think that is because we need new mail to generate bounces22:10
funginothing springs to mind22:13
ianwfungi: heh, well we (corvus mostly i guess) spent a lot of effort to get us _off_ rax cloud dns :)22:13
ianwperhaps we should just investigate using the standard python client.  the idea was that acme.sh was small and zero-dependency, but the reality is that having a python venv it's going to really be a concern for anything that needs a cert22:15
fungiyeah, i meant something we run, but i agree it's sort of silly to add yet another service just for that one case22:15
ianw /s/really/not really/22:16
clarkbI do really like that the dns based verification allows us to provision things all in one go without a chicken and egg dance with the webservers22:20
clarkbbut it does seem to be far less common and then on top of that as you emntion we're using it in the less common way22:20
ianwi think the dns-based is common, but what's not common is doing it with two separate invocations of acme.sh22:23
corvusyeah the dns based process is great22:30
corvusclarkb: the number of layers per manifest should be like 5-20, not 10k, so i doubt the api limit is at play here, but i agree that if something goes wrong there, deleting all blobs older than 1hr is what we would do, and that sounds like the sort of thing that might have happened before.  maybe we should add a --dry-run argument to see what it would do?22:36
clarkboh ya dry-run may be a good option and have it print out paths and ctimes?22:37
corvusyeah.  i'll do that real quick22:37
clarkbcorvus: re the count of things the issue is not in getting the layer blobs but in listing all the keep manifests potentially. If that list is short then we won't keep everything and then we'll delete new stuff maybe?22:38
clarkbin any case it seems like deleting things more than an hour old may be the underlying problem even if I'm not certain of how that would happen so ++ to dry-run option22:39
corvusyeah, manifests could be a problem now, but maybe not a long time ago....22:40
clarkber and I didn't mean do upload_target = manifest_target. I meant pass manifest_target where we pass upload_target today in that one call22:42
clarkbI think an hour grace period for inflight uploads is likely to be fine22:42
corvusclarkb: i think upload_target is used there with the idea that any blob in an existing finalized manifest should be handled by the mark/sweep list, but unknown blobs may show up because of in-progress uploads, so anything within the upload window should be kept.22:45
corvusclarkb: put that another way: that value could be "older than now" except that we want to keep in-progress upload blobs.22:46
corvusall to say, i think upload_target makes sense there22:46
clarkbcorvus: actually theory two: since blobs are sha256sum addressed is it possible that we're deleting shared blobs?22:46
clarkbI guess that shouldn't be the case because we woudl return those shared blobs in the kept list22:47
clarkbassuming they are used by newer manifests they would end up in that list and then we shouldn't delete them22:47
corvusclarkb: right, that's the theory.  the potential bug there would be "there's something we're not accounting for" like a different kind of manifest format, or recursive manifests with multi arch or something.22:48
corvusin other words, to do this correctly, we have to understand all the manifest formats we handle22:48
clarkbah yup that could explain it too22:48
clarkbthe unknown manifest log message should expose that. Another reason to do a dry run if opssible22:49
opendevreviewJay Faulkner proposed openstack/diskimage-builder master: Stop using deprecated pkg_resources API  https://review.opendev.org/c/openstack/diskimage-builder/+/90769123:08
clarkbdo we want to approve the letsencrypt chagnes today before the periodic jobs run them for real? I have to pop out to take kids to an activity in less than an hour so I won't be able to watch it23:46
clarkbMaybe tomorrow is better?23:46
fungii can approve and watch them deploy23:49
fungidoing it now23:49
fungiand i'll abandon my more naive bandaid23:50
clarkbsounds good thanks23:53

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!