opendevreview | Merged opendev/system-config master: Fix doc for including statusbot in channels https://review.opendev.org/c/opendev/system-config/+/934189 | 03:10 |
---|---|---|
opendevreview | Merged opendev/system-config master: Add meetbot and statusbot to openstack-eventlet-removal https://review.opendev.org/c/opendev/system-config/+/934164 | 08:30 |
slaweq | fungi frickler ianw hi, is my understanding of the meeting logs correct that you want to simply try to use "curl" in ansible to trigger rtd webhook? | 08:32 |
slaweq | if yes, I can propose such change | 08:32 |
*** marios_ is now known as marios | 09:20 | |
opendevreview | Karolina Kula proposed openstack/diskimage-builder master: WIP: Add support for CentOS Stream 10 https://review.opendev.org/c/openstack/diskimage-builder/+/934045 | 12:29 |
fungi | slaweq: yes, basically switch it to a shell task calling curl, looks like our executors already have that installed so it should "just work" | 12:50 |
slaweq | fungi: ok, I will propose a patch for this today | 12:55 |
fungi | something changed on nb04 monday-ish that has been causing infra-prod-letsencrypt to fail in the "Install acme.sh client" task with error "Local modifications exist in the destination: /opt/acme.sh" | 13:18 |
fungi | nobody has logged into the server for the past month, so it doesn't look like the file was manually altered | 13:19 |
fungi | comparing to that git repo on nb01, both have head at commits with identical diffs but different commit ids | 13:28 |
opendevreview | Karolina Kula proposed openstack/diskimage-builder master: WIP: Add support for CentOS Stream 10 https://review.opendev.org/c/openstack/diskimage-builder/+/934045 | 13:30 |
fungi | aha, that's because the "Patch for issue 4659" task does git cherry-picks at each run, so the resulting commit ids will always differ | 13:35 |
ykarel | frickler, fungi wrt translation update job broken which i raised days before | 13:35 |
ykarel | any way to get it tested i had pushed https://review.opendev.org/c/openstack/openstack-zuul-jobs/+/934034 | 13:36 |
ykarel | with depends-on fails as Unable to freeze job graph: Job <Job propose-translation-update branches: None source: opendev/base-jobs/zuul.d/jobs.yaml@master#25> which is defined in opendev.org/openstack/project-config is protected and cannot be inherited from other projects. | 13:36 |
fungi | ykarel: jobs defined in trusted config projects don't use speculative execution for safety reasons, since someone could propose unmerged patches to disclose things like the zanata account credentials trusted to that job | 13:38 |
ykarel | yeap, so have to get it merged | 13:38 |
fungi | yes | 13:39 |
ykarel | ok will push removing WIP | 13:41 |
opendevreview | Slawek Kaplonski proposed zuul/zuul-jobs master: Use "curl" to trigger rtd.org webhook with http basic auth https://review.opendev.org/c/zuul/zuul-jobs/+/934243 | 13:43 |
slaweq | fungi frickler clarkb ^^ please check when you will have a minute | 13:43 |
slaweq | also ianw | 13:43 |
slaweq | thx for all the help with this | 13:44 |
fungi | so for the git repo on nb04, it looks like it somehow ended up with a soft reset of the cherry-picked commits, basically the reflog has head at the cherry-picked state, but the worktree matches the pre-cherry-picked state instead (upstream tag 3.0.5) | 13:46 |
fungi | and that's preventing ansible's git module from proceeding. but no sign in the prior build logs for what could have left it in this state | 13:47 |
fungi | anyway, i'm thinking this is nothing nefarious and the cherry-pick simply got interrupted before the worktree was updated. if other infra-root folks aren't particularly concerned i'm happy to reset the repository state and check the next build | 13:58 |
frickler | fungi: +1 | 13:59 |
opendevreview | Jeremy Stanley proposed opendev/system-config master: Force acme.sh Git repo refresh https://review.opendev.org/c/opendev/system-config/+/934247 | 14:47 |
fungi | infra-root: there's ^ an alternative option | 14:47 |
opendevreview | Karolina Kula proposed openstack/diskimage-builder master: WIP: Add support for CentOS Stream 10 https://review.opendev.org/c/openstack/diskimage-builder/+/934045 | 15:09 |
frickler | just in case someone else notices broken reqs-check jobs https://zuul.opendev.org/t/openstack/builds?job_name=requirements-check&skip=0 revert is in progress, might even consider force-merging https://review.opendev.org/c/openstack/requirements/+/934249 | 15:22 |
opendevreview | Karolina Kula proposed openstack/diskimage-builder master: WIP: Add support for CentOS Stream 10 https://review.opendev.org/c/openstack/diskimage-builder/+/934045 | 15:39 |
frickler | fungi: did you look at the nb04 failure because of the deploy issues? https://zuul.opendev.org/t/openstack/buildset/d71c9b202bb6409ea7830341e0eb704d | 15:39 |
corvus | frickler: is there no gate job on the requirements repo that runs the requirements check job? | 15:40 |
frickler | corvus: I was caught by surprise by that, too, else I'd have done some extra checking | 15:49 |
clarkb | fungi: the acme.sh thing happens when we run out of disk | 15:54 |
clarkb | fungi: we discussed this previously and one of the things I wanted to see us do was stop updating the repo after it was in the staet we wanted | 15:55 |
clarkb | the problem is every time we run (so at least once a day) we completely reset the git repo state. Which explodes if we haev no more disk but its also unnecessary | 15:55 |
corvus | frickler: i'm wondering if we can just run it targeting the requirements repo: testing it out in https://review.opendev.org/934251 and https://review.opendev.org/934252 | 15:56 |
clarkb | unfortunately I think this is another case where the ansible git module just isn't great and we might have to replace it with manual management of git | 15:56 |
corvus | clarkb: fungi or we could just make a temporary fork on github or something | 15:56 |
clarkb | corvus: ya thats another option. I think if we do that then the version we point at has to be the HEAD ref (so master/main/whatver) otherwise the git module always resets things | 15:58 |
clarkb | we can also check if we still need to patched setup. I suspect we do unfortunately | 15:59 |
clarkb | https://github.com/acmesh-official/acme.sh/issues/4416 is still open :/ | 16:00 |
clarkb | reading that I guess we could just switch to the ecc path maybe? | 16:03 |
clarkb | but ya simplest thing now may be to have a fork and just use that, apply the security related patches and check out dev/master/main whatever we set HEAD to | 16:03 |
clarkb | actually I think the script is entirely self contained | 16:05 |
clarkb | what if we just vendor it in its entirety and use an ansible file copy instead? | 16:05 |
fungi | frickler: yes, deploy failure on https://review.opendev.org/934164 is what got me looking into the broken letsencrypt git state | 16:07 |
clarkb | tried to summarize all of that on the change so that we don't lose the context later and we can figure out what we want to do from there | 16:07 |
clarkb | fungi: /dev/mapper/main-main 787G 787G 0 100% /opt the disk is full so I don't think your change will fix it fwiw | 16:08 |
fungi | clarkb: does acme.sh not use the driver.sh file or any of the scripts under the subdirectories? | 16:09 |
fungi | oh, hah | 16:09 |
fungi | what's the usual cleanup routine? down the nodepool-builder container and then delete all of dib's tempfiles? | 16:10 |
corvus | frickler: answer to that question appears to be "no"; so i made this instead: https://review.opendev.org/c/openstack/requirements/+/934253 WIP: Update requirements-check job to run on arbitrary targets [NEW] | 16:10 |
fungi | clarkb: specifically delete all the /opt/dib_tmp/* directories? | 16:11 |
clarkb | fungi: yes down the service then start a delete of the /opt/dib_tmp contents. The hourly jobs will start the service back up again while thats running (whcih will probably take a day or so) but that hasn't been a huge issue | 16:11 |
fungi | on it | 16:11 |
clarkb | fungi: also if losetup has no more available devices then doing a reboot is helpful | 16:11 |
fungi | k | 16:12 |
clarkb | losetup -a give you a list of all loopback devices. I think the max is like 14 or 16 or something so if we're well under that then we're good | 16:12 |
clarkb | re the subfolders I thought we just copy the acme.sh script out and run it from another location but I'll check on that | 16:13 |
clarkb | I'm wrong about that the driver.sh executes acme.sh out of the git repo dir | 16:14 |
clarkb | so vendoring is potentially problematic | 16:14 |
clarkb | idea: put the acme.sh install stuff into a block that only runs when a hash of acme.sh does not match what we expect | 16:19 |
fungi | tempfile cleanup is in progress in a root screen session on nb04 | 16:19 |
clarkb | I'll write that up as an alternative we can consider | 16:19 |
fungi | currently there are 10 loop devs on the server | 16:20 |
clarkb | but that should work for new installs and existing installs and be more resilient to disks filling | 16:20 |
clarkb | fungi: ok so not at the limit but getting close. Reboot probably a good idea after deletions are done ~tomorrow | 16:20 |
fungi | agreed | 16:20 |
opendevreview | Clark Boylan proposed opendev/system-config master: Only update acme.sh if necessary https://review.opendev.org/c/opendev/system-config/+/934256 | 16:33 |
clarkb | something like that maybe | 16:33 |
corvus | frickler: https://zuul.opendev.org/t/openstack/build/79171c3eabc540bfa37eb58af8e00e4b shows the expected failure. i will update https://review.opendev.org/934253 in a bit to make it ready to merge | 16:34 |
clarkb | https://storage.bhs.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_780/934256/1/check/system-config-run-static/78081f2/bridge99.opendev.org/ara-report/playbooks/3.html shows 934256 working in the no previously installed acme.sh client case | 17:00 |
clarkb | I wonder if we should update system-config-run-letsencrypt to do two passes so we can confirm the other case? | 17:01 |
clarkb | any opinions on ^ I think that should be doable if we don't mind the job taking longer | 17:01 |
opendevreview | Clark Boylan proposed opendev/system-config master: Run letsencrypt twice in system-config-run-letsencrypt https://review.opendev.org/c/opendev/system-config/+/934258 | 17:08 |
clarkb | did it as a followup change as we may decide we don't want it after confirming the behavior in 934256 | 17:08 |
fungi | clarkb: yeah, an idempotency test there is a great idea | 17:11 |
corvus | frickler: https://review.opendev.org/c/openstack/requirements/+/934253 is ready for review | 17:23 |
corvus | that will close the gate hole for the requirements-check job and should make changes like updating the nodeset or fixing whatever problems there are with upgrading to noble easier | 17:24 |
opendevreview | Clark Boylan proposed opendev/system-config master: Only update acme.sh if necessary https://review.opendev.org/c/opendev/system-config/+/934256 | 17:32 |
opendevreview | Clark Boylan proposed opendev/system-config master: Run letsencrypt twice in system-config-run-letsencrypt https://review.opendev.org/c/opendev/system-config/+/934258 | 17:32 |
clarkb | it caught a bug \o/ | 17:32 |
fungi | i went ahead and hard reset and cleaned the acme.sh clone on nb04 | 17:51 |
fungi | since the cause is known | 17:51 |
fungi | dib cleanup has freed about 20gb so far | 17:52 |
clarkb | https://storage.bhs.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_599/934258/2/check/system-config-run-letsencrypt/5995444/bridge99.opendev.org/ara-report/playbooks/5.html shows the skip working now | 18:05 |
clarkb | that doesn't solve the disk filling but should make us more resilient to it for cert renewals | 18:06 |
fungi | agreed, looks great | 18:08 |
clarkb | I think it also shows the driver.sh copy is change: false so nooping properly too | 18:08 |
clarkb | corvus: I've done some code review of zuul-registry's prune method. The thing that stands out to me is that if https://opendev.org/zuul/zuul-registry/src/branch/master/zuul_registry/storage.py#L287-L288 this list is incomplete (or empty) then https://opendev.org/zuul/zuul-registry/src/branch/master/zuul_registry/storage.py#L292 will delete every blob that is an hour or more old. I | 21:59 |
clarkb | wonder if we should set upload_target to manifest_target out of an abundance of caution on line 292? | 21:59 |
clarkb | corvus: we might still be broke if that occurs but less so. Another thought I had is to break up prune into several passes so that we only prune manifests then check on things then prune blobs ensuring the layers list looks right? I don't know how helpful that is if the aboev theory is the issue | 21:59 |
ianw | there wasn't any grand plan pulling acme.sh from git; rather just an anti-bitrot incentive to keep up with upstream. as time has gone on, we're not doing that and pinned for various reasons anyway. i don't suppose checking a file sha is really functionally any different than say, checking a pinned .tar.gz release file | 22:00 |
clarkb | ianw: ya the upsdie to checking the sha is it checks that we've applied the desired security fixes too without needing to fork or maintain our own packagae etc. Bit hacky but simple | 22:01 |
ianw | on the positive side acme isn't changing, neither is sh, and we've got ecc certificates that seem fairly future proof ... so in theory i guess it can just sit there doing it's thing | 22:01 |
clarkb | corvus: and as for why that list may be incomplete I'm thinking that could be a side effect of swift client behavior? similar to how we can only delete 10k objects per request maybe we only ever get back 10k objects (or some other limit) at a time? | 22:02 |
clarkb | ianw: ya I think the main risk is if we have to patch for other security concerns. If we hit that point we're probably rewriting things to use the other directory? | 22:02 |
clarkb | ianw: the issue is just that the certs go into the "wrong" directory rightso everhything has to update to look for them there? | 22:02 |
ianw | it's been a long time -- i think the problem is because of our two-step process, where we create certs and get the TXT records, install them, then come back and use renew to pick up where we left off -- acme.sh isn't picking up where it left off, it doesn't correctly recognise the _ecc dir and tries to start the process again | 22:06 |
clarkb | aha | 22:07 |
ianw | i think that we're fairly unique in having this two step process -- most people seem to be using an acme.sh plugin to their DNS-as-a-service API where acme.sh essentially pauses and then carries on | 22:07 |
clarkb | so not just a matter of us picking the right new directory but acme.sh internally needs to see the certs in the correct dir | 22:07 |
ianw | right, the "renew" path needs fixing ... i'm sure it could be done, it's just a matter of time and effort, like everything :) | 22:09 |
clarkb | ya I assume they would take a patch. /me puts it in the pile | 22:09 |
fungi | in theory we could set up dynamic dns updates, or just delegate the subdomain we use to something that has its own update api | 22:10 |
clarkb | unrelated to ^ does anyone have an excuse to send mail to service-discuss? I haven't seen any bounce processing yet and I think that is because we need new mail to generate bounces | 22:10 |
fungi | nothing springs to mind | 22:13 |
ianw | fungi: heh, well we (corvus mostly i guess) spent a lot of effort to get us _off_ rax cloud dns :) | 22:13 |
ianw | perhaps we should just investigate using the standard python client. the idea was that acme.sh was small and zero-dependency, but the reality is that having a python venv it's going to really be a concern for anything that needs a cert | 22:15 |
fungi | yeah, i meant something we run, but i agree it's sort of silly to add yet another service just for that one case | 22:15 |
ianw | /s/really/not really/ | 22:16 |
clarkb | I do really like that the dns based verification allows us to provision things all in one go without a chicken and egg dance with the webservers | 22:20 |
clarkb | but it does seem to be far less common and then on top of that as you emntion we're using it in the less common way | 22:20 |
ianw | i think the dns-based is common, but what's not common is doing it with two separate invocations of acme.sh | 22:23 |
corvus | yeah the dns based process is great | 22:30 |
corvus | clarkb: the number of layers per manifest should be like 5-20, not 10k, so i doubt the api limit is at play here, but i agree that if something goes wrong there, deleting all blobs older than 1hr is what we would do, and that sounds like the sort of thing that might have happened before. maybe we should add a --dry-run argument to see what it would do? | 22:36 |
clarkb | oh ya dry-run may be a good option and have it print out paths and ctimes? | 22:37 |
corvus | yeah. i'll do that real quick | 22:37 |
clarkb | corvus: re the count of things the issue is not in getting the layer blobs but in listing all the keep manifests potentially. If that list is short then we won't keep everything and then we'll delete new stuff maybe? | 22:38 |
clarkb | in any case it seems like deleting things more than an hour old may be the underlying problem even if I'm not certain of how that would happen so ++ to dry-run option | 22:39 |
corvus | yeah, manifests could be a problem now, but maybe not a long time ago.... | 22:40 |
clarkb | er and I didn't mean do upload_target = manifest_target. I meant pass manifest_target where we pass upload_target today in that one call | 22:42 |
clarkb | I think an hour grace period for inflight uploads is likely to be fine | 22:42 |
corvus | clarkb: i think upload_target is used there with the idea that any blob in an existing finalized manifest should be handled by the mark/sweep list, but unknown blobs may show up because of in-progress uploads, so anything within the upload window should be kept. | 22:45 |
corvus | clarkb: put that another way: that value could be "older than now" except that we want to keep in-progress upload blobs. | 22:46 |
corvus | all to say, i think upload_target makes sense there | 22:46 |
clarkb | corvus: actually theory two: since blobs are sha256sum addressed is it possible that we're deleting shared blobs? | 22:46 |
clarkb | I guess that shouldn't be the case because we woudl return those shared blobs in the kept list | 22:47 |
clarkb | assuming they are used by newer manifests they would end up in that list and then we shouldn't delete them | 22:47 |
corvus | clarkb: right, that's the theory. the potential bug there would be "there's something we're not accounting for" like a different kind of manifest format, or recursive manifests with multi arch or something. | 22:48 |
corvus | in other words, to do this correctly, we have to understand all the manifest formats we handle | 22:48 |
clarkb | ah yup that could explain it too | 22:48 |
clarkb | the unknown manifest log message should expose that. Another reason to do a dry run if opssible | 22:49 |
opendevreview | Jay Faulkner proposed openstack/diskimage-builder master: Stop using deprecated pkg_resources API https://review.opendev.org/c/openstack/diskimage-builder/+/907691 | 23:08 |
clarkb | do we want to approve the letsencrypt chagnes today before the periodic jobs run them for real? I have to pop out to take kids to an activity in less than an hour so I won't be able to watch it | 23:46 |
clarkb | Maybe tomorrow is better? | 23:46 |
fungi | i can approve and watch them deploy | 23:49 |
fungi | doing it now | 23:49 |
fungi | and i'll abandon my more naive bandaid | 23:50 |
clarkb | sounds good thanks | 23:53 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!