Wednesday, 2024-11-06

opendevreview	Merged opendev/system-config master: Fix doc for including statusbot in channels https://review.opendev.org/c/opendev/system-config/+/934189	03:10
opendevreview	Merged opendev/system-config master: Add meetbot and statusbot to openstack-eventlet-removal https://review.opendev.org/c/opendev/system-config/+/934164	08:30
slaweq	fungi frickler ianw hi, is my understanding of the meeting logs correct that you want to simply try to use "curl" in ansible to trigger rtd webhook?	08:32
slaweq	if yes, I can propose such change	08:32
*** marios_ is now known as marios		09:20
opendevreview	Karolina Kula proposed openstack/diskimage-builder master: WIP: Add support for CentOS Stream 10 https://review.opendev.org/c/openstack/diskimage-builder/+/934045	12:29
fungi	slaweq: yes, basically switch it to a shell task calling curl, looks like our executors already have that installed so it should "just work"	12:50
slaweq	fungi: ok, I will propose a patch for this today	12:55
fungi	something changed on nb04 monday-ish that has been causing infra-prod-letsencrypt to fail in the "Install acme.sh client" task with error "Local modifications exist in the destination: /opt/acme.sh"	13:18
fungi	nobody has logged into the server for the past month, so it doesn't look like the file was manually altered	13:19
fungi	comparing to that git repo on nb01, both have head at commits with identical diffs but different commit ids	13:28
opendevreview	Karolina Kula proposed openstack/diskimage-builder master: WIP: Add support for CentOS Stream 10 https://review.opendev.org/c/openstack/diskimage-builder/+/934045	13:30
fungi	aha, that's because the "Patch for issue 4659" task does git cherry-picks at each run, so the resulting commit ids will always differ	13:35
ykarel	frickler, fungi wrt translation update job broken which i raised days before	13:35
ykarel	any way to get it tested i had pushed https://review.opendev.org/c/openstack/openstack-zuul-jobs/+/934034	13:36
ykarel	with depends-on fails as Unable to freeze job graph: Job <Job propose-translation-update branches: None source: opendev/base-jobs/zuul.d/jobs.yaml@master#25> which is defined in opendev.org/openstack/project-config is protected and cannot be inherited from other projects.	13:36
fungi	ykarel: jobs defined in trusted config projects don't use speculative execution for safety reasons, since someone could propose unmerged patches to disclose things like the zanata account credentials trusted to that job	13:38
ykarel	yeap, so have to get it merged	13:38
fungi	yes	13:39
ykarel	ok will push removing WIP	13:41
opendevreview	Slawek Kaplonski proposed zuul/zuul-jobs master: Use "curl" to trigger rtd.org webhook with http basic auth https://review.opendev.org/c/zuul/zuul-jobs/+/934243	13:43
slaweq	fungi frickler clarkb ^^ please check when you will have a minute	13:43
slaweq	also ianw	13:43
slaweq	thx for all the help with this	13:44
fungi	so for the git repo on nb04, it looks like it somehow ended up with a soft reset of the cherry-picked commits, basically the reflog has head at the cherry-picked state, but the worktree matches the pre-cherry-picked state instead (upstream tag 3.0.5)	13:46
fungi	and that's preventing ansible's git module from proceeding. but no sign in the prior build logs for what could have left it in this state	13:47
fungi	anyway, i'm thinking this is nothing nefarious and the cherry-pick simply got interrupted before the worktree was updated. if other infra-root folks aren't particularly concerned i'm happy to reset the repository state and check the next build	13:58
frickler	fungi: +1	13:59
opendevreview	Jeremy Stanley proposed opendev/system-config master: Force acme.sh Git repo refresh https://review.opendev.org/c/opendev/system-config/+/934247	14:47
fungi	infra-root: there's ^ an alternative option	14:47
opendevreview	Karolina Kula proposed openstack/diskimage-builder master: WIP: Add support for CentOS Stream 10 https://review.opendev.org/c/openstack/diskimage-builder/+/934045	15:09
frickler	just in case someone else notices broken reqs-check jobs https://zuul.opendev.org/t/openstack/builds?job_name=requirements-check&skip=0 revert is in progress, might even consider force-merging https://review.opendev.org/c/openstack/requirements/+/934249	15:22
opendevreview	Karolina Kula proposed openstack/diskimage-builder master: WIP: Add support for CentOS Stream 10 https://review.opendev.org/c/openstack/diskimage-builder/+/934045	15:39
frickler	fungi: did you look at the nb04 failure because of the deploy issues? https://zuul.opendev.org/t/openstack/buildset/d71c9b202bb6409ea7830341e0eb704d	15:39
corvus	frickler: is there no gate job on the requirements repo that runs the requirements check job?	15:40
frickler	corvus: I was caught by surprise by that, too, else I'd have done some extra checking	15:49
clarkb	fungi: the acme.sh thing happens when we run out of disk	15:54
clarkb	fungi: we discussed this previously and one of the things I wanted to see us do was stop updating the repo after it was in the staet we wanted	15:55
clarkb	the problem is every time we run (so at least once a day) we completely reset the git repo state. Which explodes if we haev no more disk but its also unnecessary	15:55
corvus	frickler: i'm wondering if we can just run it targeting the requirements repo: testing it out in https://review.opendev.org/934251 and https://review.opendev.org/934252	15:56
clarkb	unfortunately I think this is another case where the ansible git module just isn't great and we might have to replace it with manual management of git	15:56
corvus	clarkb: fungi or we could just make a temporary fork on github or something	15:56
clarkb	corvus: ya thats another option. I think if we do that then the version we point at has to be the HEAD ref (so master/main/whatver) otherwise the git module always resets things	15:58
clarkb	we can also check if we still need to patched setup. I suspect we do unfortunately	15:59
clarkb	https://github.com/acmesh-official/acme.sh/issues/4416 is still open :/	16:00
clarkb	reading that I guess we could just switch to the ecc path maybe?	16:03
clarkb	but ya simplest thing now may be to have a fork and just use that, apply the security related patches and check out dev/master/main whatever we set HEAD to	16:03
clarkb	actually I think the script is entirely self contained	16:05
clarkb	what if we just vendor it in its entirety and use an ansible file copy instead?	16:05
fungi	frickler: yes, deploy failure on https://review.opendev.org/934164 is what got me looking into the broken letsencrypt git state	16:07
clarkb	tried to summarize all of that on the change so that we don't lose the context later and we can figure out what we want to do from there	16:07
clarkb	fungi: /dev/mapper/main-main 787G 787G 0 100% /opt the disk is full so I don't think your change will fix it fwiw	16:08
fungi	clarkb: does acme.sh not use the driver.sh file or any of the scripts under the subdirectories?	16:09
fungi	oh, hah	16:09
fungi	what's the usual cleanup routine? down the nodepool-builder container and then delete all of dib's tempfiles?	16:10
corvus	frickler: answer to that question appears to be "no"; so i made this instead: https://review.opendev.org/c/openstack/requirements/+/934253 WIP: Update requirements-check job to run on arbitrary targets [NEW]	16:10
fungi	clarkb: specifically delete all the /opt/dib_tmp/* directories?	16:11
clarkb	fungi: yes down the service then start a delete of the /opt/dib_tmp contents. The hourly jobs will start the service back up again while thats running (whcih will probably take a day or so) but that hasn't been a huge issue	16:11
fungi	on it	16:11
clarkb	fungi: also if losetup has no more available devices then doing a reboot is helpful	16:11
fungi	k	16:12
clarkb	losetup -a give you a list of all loopback devices. I think the max is like 14 or 16 or something so if we're well under that then we're good	16:12
clarkb	re the subfolders I thought we just copy the acme.sh script out and run it from another location but I'll check on that	16:13
clarkb	I'm wrong about that the driver.sh executes acme.sh out of the git repo dir	16:14
clarkb	so vendoring is potentially problematic	16:14
clarkb	idea: put the acme.sh install stuff into a block that only runs when a hash of acme.sh does not match what we expect	16:19
fungi	tempfile cleanup is in progress in a root screen session on nb04	16:19
clarkb	I'll write that up as an alternative we can consider	16:19
fungi	currently there are 10 loop devs on the server	16:20
clarkb	but that should work for new installs and existing installs and be more resilient to disks filling	16:20
clarkb	fungi: ok so not at the limit but getting close. Reboot probably a good idea after deletions are done ~tomorrow	16:20
fungi	agreed	16:20
opendevreview	Clark Boylan proposed opendev/system-config master: Only update acme.sh if necessary https://review.opendev.org/c/opendev/system-config/+/934256	16:33
clarkb	something like that maybe	16:33
corvus	frickler: https://zuul.opendev.org/t/openstack/build/79171c3eabc540bfa37eb58af8e00e4b shows the expected failure. i will update https://review.opendev.org/934253 in a bit to make it ready to merge	16:34
clarkb	https://storage.bhs.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_780/934256/1/check/system-config-run-static/78081f2/bridge99.opendev.org/ara-report/playbooks/3.html shows 934256 working in the no previously installed acme.sh client case	17:00
clarkb	I wonder if we should update system-config-run-letsencrypt to do two passes so we can confirm the other case?	17:01
clarkb	any opinions on ^ I think that should be doable if we don't mind the job taking longer	17:01
opendevreview	Clark Boylan proposed opendev/system-config master: Run letsencrypt twice in system-config-run-letsencrypt https://review.opendev.org/c/opendev/system-config/+/934258	17:08
clarkb	did it as a followup change as we may decide we don't want it after confirming the behavior in 934256	17:08
fungi	clarkb: yeah, an idempotency test there is a great idea	17:11
corvus	frickler: https://review.opendev.org/c/openstack/requirements/+/934253 is ready for review	17:23
corvus	that will close the gate hole for the requirements-check job and should make changes like updating the nodeset or fixing whatever problems there are with upgrading to noble easier	17:24
opendevreview	Clark Boylan proposed opendev/system-config master: Only update acme.sh if necessary https://review.opendev.org/c/opendev/system-config/+/934256	17:32
opendevreview	Clark Boylan proposed opendev/system-config master: Run letsencrypt twice in system-config-run-letsencrypt https://review.opendev.org/c/opendev/system-config/+/934258	17:32
clarkb	it caught a bug \o/	17:32
fungi	i went ahead and hard reset and cleaned the acme.sh clone on nb04	17:51
fungi	since the cause is known	17:51
fungi	dib cleanup has freed about 20gb so far	17:52
clarkb	https://storage.bhs.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_599/934258/2/check/system-config-run-letsencrypt/5995444/bridge99.opendev.org/ara-report/playbooks/5.html shows the skip working now	18:05
clarkb	that doesn't solve the disk filling but should make us more resilient to it for cert renewals	18:06
fungi	agreed, looks great	18:08
clarkb	I think it also shows the driver.sh copy is change: false so nooping properly too	18:08
clarkb	corvus: I've done some code review of zuul-registry's prune method. The thing that stands out to me is that if https://opendev.org/zuul/zuul-registry/src/branch/master/zuul_registry/storage.py#L287-L288 this list is incomplete (or empty) then https://opendev.org/zuul/zuul-registry/src/branch/master/zuul_registry/storage.py#L292 will delete every blob that is an hour or more old. I	21:59
clarkb	wonder if we should set upload_target to manifest_target out of an abundance of caution on line 292?	21:59
clarkb	corvus: we might still be broke if that occurs but less so. Another thought I had is to break up prune into several passes so that we only prune manifests then check on things then prune blobs ensuring the layers list looks right? I don't know how helpful that is if the aboev theory is the issue	21:59
ianw	there wasn't any grand plan pulling acme.sh from git; rather just an anti-bitrot incentive to keep up with upstream. as time has gone on, we're not doing that and pinned for various reasons anyway. i don't suppose checking a file sha is really functionally any different than say, checking a pinned .tar.gz release file	22:00
clarkb	ianw: ya the upsdie to checking the sha is it checks that we've applied the desired security fixes too without needing to fork or maintain our own packagae etc. Bit hacky but simple	22:01
ianw	on the positive side acme isn't changing, neither is sh, and we've got ecc certificates that seem fairly future proof ... so in theory i guess it can just sit there doing it's thing	22:01
clarkb	corvus: and as for why that list may be incomplete I'm thinking that could be a side effect of swift client behavior? similar to how we can only delete 10k objects per request maybe we only ever get back 10k objects (or some other limit) at a time?	22:02
clarkb	ianw: ya I think the main risk is if we have to patch for other security concerns. If we hit that point we're probably rewriting things to use the other directory?	22:02
clarkb	ianw: the issue is just that the certs go into the "wrong" directory rightso everhything has to update to look for them there?	22:02
ianw	it's been a long time -- i think the problem is because of our two-step process, where we create certs and get the TXT records, install them, then come back and use renew to pick up where we left off -- acme.sh isn't picking up where it left off, it doesn't correctly recognise the _ecc dir and tries to start the process again	22:06
clarkb	aha	22:07
ianw	i think that we're fairly unique in having this two step process -- most people seem to be using an acme.sh plugin to their DNS-as-a-service API where acme.sh essentially pauses and then carries on	22:07
clarkb	so not just a matter of us picking the right new directory but acme.sh internally needs to see the certs in the correct dir	22:07
ianw	right, the "renew" path needs fixing ... i'm sure it could be done, it's just a matter of time and effort, like everything :)	22:09
clarkb	ya I assume they would take a patch. /me puts it in the pile	22:09
fungi	in theory we could set up dynamic dns updates, or just delegate the subdomain we use to something that has its own update api	22:10
clarkb	unrelated to ^ does anyone have an excuse to send mail to service-discuss? I haven't seen any bounce processing yet and I think that is because we need new mail to generate bounces	22:10
fungi	nothing springs to mind	22:13
ianw	fungi: heh, well we (corvus mostly i guess) spent a lot of effort to get us _off_ rax cloud dns :)	22:13
ianw	perhaps we should just investigate using the standard python client. the idea was that acme.sh was small and zero-dependency, but the reality is that having a python venv it's going to really be a concern for anything that needs a cert	22:15
fungi	yeah, i meant something we run, but i agree it's sort of silly to add yet another service just for that one case	22:15
ianw	/s/really/not really/	22:16
clarkb	I do really like that the dns based verification allows us to provision things all in one go without a chicken and egg dance with the webservers	22:20
clarkb	but it does seem to be far less common and then on top of that as you emntion we're using it in the less common way	22:20
ianw	i think the dns-based is common, but what's not common is doing it with two separate invocations of acme.sh	22:23
corvus	yeah the dns based process is great	22:30
corvus	clarkb: the number of layers per manifest should be like 5-20, not 10k, so i doubt the api limit is at play here, but i agree that if something goes wrong there, deleting all blobs older than 1hr is what we would do, and that sounds like the sort of thing that might have happened before. maybe we should add a --dry-run argument to see what it would do?	22:36
clarkb	oh ya dry-run may be a good option and have it print out paths and ctimes?	22:37
corvus	yeah. i'll do that real quick	22:37
clarkb	corvus: re the count of things the issue is not in getting the layer blobs but in listing all the keep manifests potentially. If that list is short then we won't keep everything and then we'll delete new stuff maybe?	22:38
clarkb	in any case it seems like deleting things more than an hour old may be the underlying problem even if I'm not certain of how that would happen so ++ to dry-run option	22:39
corvus	yeah, manifests could be a problem now, but maybe not a long time ago....	22:40
clarkb	er and I didn't mean do upload_target = manifest_target. I meant pass manifest_target where we pass upload_target today in that one call	22:42
clarkb	I think an hour grace period for inflight uploads is likely to be fine	22:42
corvus	clarkb: i think upload_target is used there with the idea that any blob in an existing finalized manifest should be handled by the mark/sweep list, but unknown blobs may show up because of in-progress uploads, so anything within the upload window should be kept.	22:45
corvus	clarkb: put that another way: that value could be "older than now" except that we want to keep in-progress upload blobs.	22:46
corvus	all to say, i think upload_target makes sense there	22:46
clarkb	corvus: actually theory two: since blobs are sha256sum addressed is it possible that we're deleting shared blobs?	22:46
clarkb	I guess that shouldn't be the case because we woudl return those shared blobs in the kept list	22:47
clarkb	assuming they are used by newer manifests they would end up in that list and then we shouldn't delete them	22:47
corvus	clarkb: right, that's the theory. the potential bug there would be "there's something we're not accounting for" like a different kind of manifest format, or recursive manifests with multi arch or something.	22:48
corvus	in other words, to do this correctly, we have to understand all the manifest formats we handle	22:48
clarkb	ah yup that could explain it too	22:48
clarkb	the unknown manifest log message should expose that. Another reason to do a dry run if opssible	22:49
opendevreview	Jay Faulkner proposed openstack/diskimage-builder master: Stop using deprecated pkg_resources API https://review.opendev.org/c/openstack/diskimage-builder/+/907691	23:08
clarkb	do we want to approve the letsencrypt chagnes today before the periodic jobs run them for real? I have to pop out to take kids to an activity in less than an hour so I won't be able to watch it	23:46
clarkb	Maybe tomorrow is better?	23:46
fungi	i can approve and watch them deploy	23:49
fungi	doing it now	23:49
fungi	and i'll abandon my more naive bandaid	23:50
clarkb	sounds good thanks	23:53

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!