Tuesday, 2025-02-25

clarkb	this is neat. I'm guessing people that live in different parts of the country have more experience with this but the NWS just put out a severe thunderstorm warning for the next half hour or so based on current radar indicated conditions. The area they warned is super specific and went straihgt to me phone	00:09
ianw	heh i get a text message from my car insurance company warning me there's a storm coming and to move the car inside	00:23
ianw	jokes on them because it's a 20 year old corolla that won't die. it laughs in the face of hail :)	00:25
clarkb	it seems to be over for now and no destructive hail thankfully	00:26
ianw	\o/ another 2025 weather even survived. on to the next one ... :)	00:28
ianw	s/even/event/	00:28
ianw	clarkb: on that parallel stuff ... i just can't think of anything to test it other than trying it. i'm not sure i'm really confident i have the time to monitor and revert potential issue-causing production changes though	00:30
clarkb	ianw: ya I think even for me initial change of copying system-config testing is tough. I think you are right that we pick a time when making changes to those foundational bits and monitor closely and jump in if necessary. And totally understood that you don't have time to drive that	00:31
clarkb	I'm trying to get server replacements in a spot where they are rolling forward at a reasonable rate then I should be able to take a look at it if no one else beats me to it	00:31
ianw	++ i'm mostly around :) let's sync one day when you feel you've got some bandwidth too and maybe we can try it	00:34
mnaser	clarkb: cool thank you, fwiw there is valkey as a redis alternative which is a fork that lives under the Lf	00:51
frickler	infra-root: I'm wondering why zuul didn't pick up https://review.opendev.org/c/openstack/requirements/+/925057 when I approved it yesterday. the dependency merged like 10 minutes before that, is there some larger delay to be expected? reworkflowing now seems to have resolved the issue	07:37
opendevreview	Karolina Kula proposed opendev/glean master: WIP: Add support for CentOS 10 keyfiles https://review.opendev.org/c/opendev/glean/+/941672	10:56
fungi	frickler: hard to say without checking debug logs on the schedulers for how they handled the comment-added event at 18:03:56 utc	13:11
fungi	or whether the event ever got to zuul	13:11
slittle_	please add me as first core for group starlingx-app-distributed-cloud-core	14:24
slittle_	thanks	14:31
fungi	slittle_: https://review.opendev.org/admin/groups/starlingx-app-distributed-cloud-core,members already has tons of members, including you. are you thinking of a different group maybe?	14:33
opendevreview	Clark Boylan proposed opendev/system-config master: Run gitea with memcached cache adapter https://review.opendev.org/c/opendev/system-config/+/942650	15:09
clarkb	I'm trying to confirm ^ is actually using memcached and I think it isn't yet for some reason (in particular locally testing with netcat I get client connection log messages with -vv set and I don't get those messages in logs for 942650	16:49
clarkb	ok I think I see it. Sometimes you just need to talk about it	16:51
opendevreview	Clark Boylan proposed opendev/system-config master: Run gitea with memcached cache adapter https://review.opendev.org/c/opendev/system-config/+/942650	16:53
clarkb	debugging that wuold've been easier if gitea logged the invalid cache adapter name somewhere but it doesn't appear to do so	16:57
opendevreview	Jeremy Stanley proposed opendev/system-config master: Install ssl-cert-check from distro package not Git https://review.opendev.org/c/opendev/system-config/+/939187	18:17
opendevreview	Jeremy Stanley proposed opendev/system-config master: Add certcheck functionality to bridge servers https://review.opendev.org/c/opendev/system-config/+/942719	18:17
opendevreview	Clark Boylan proposed opendev/system-config master: Run gitea with memcached cache adapter https://review.opendev.org/c/opendev/system-config/+/942650	18:35
clarkb	I think that may be ready for review now (depends on whether or not that new test case functions I guess)	18:35
fungi	https://github.com/aio-libs/aiosmtpd/issues/278 talks about the logging module detecting the new file and switching correctly, but mailman copies log streams from other modules to its own logs i guess?	20:02
clarkb	and then refers back to that issue you posted initially which is still open as the spot that should track fixing it	20:02
clarkb	so presumably this is still a problem	20:02
clarkb	but ya it seems like python logging loses track of where logs are actually going to due to mailman decisions	20:03
fungi	https://docs.mailman3.org/en/latest/faq.html#how-to-enable-debug-logging-in-mailman-core indicates you can do logging configuration in mailman.cfg	20:03
fungi	https://github.com/maxking/docker-mailman/issues/535 has a suggested logrotate config	20:09
fungi	relying on `docker-compose restart`	20:10
fungi	it's some kind of awesome that they have a github action that just closes open feature requests as "not planned" if nobody comments on them	20:12
clarkb	I forgot to mention that I have a dentist appointment tomorrow at 10am local	20:26
fungi	i'll be around	20:27
opendevreview	Clark Boylan proposed opendev/system-config master: Mirror memcached:latest from docker hub to quay.io/opendevmirror https://review.opendev.org/c/opendev/system-config/+/942730	20:30
clarkb	I believe it is safe to land the change updating job and project config for infra-prod-* jobs while the hourlies are running because zuul will use the config compiled when enqueing the buildset	20:58
clarkb	anyway thats me double checking we don't have to wait to approve the infra-prod refactor change until after hourlies and I don't think we do	20:58
clarkb	do we still want to approve it nowish?	20:59
fungi	yeah, i'm good going ahead with it	20:59
clarkb	ack	21:00
clarkb	the 2200 hourly runs are the ones we will want to apy attention to	21:01
clarkb	actualyl I may have just found a bug looking at the current hourlies	21:02
clarkb	service-bridge and bootstrap-bridge are running concurrently	21:02
clarkb	service-bridge should depend on bootstrap-bridge to prevent that. I think the chagne as written is fine because of the semaphore but we should probably fix everything all at once?	21:02
clarkb	oh the reason this happens is that dependences are not transitive	21:03
clarkb	service-bridge depends on infra-prod-base. We don't run infra-prod-base in hourlies so maybe this is ok as is	21:03
clarkb	rely on the semaphore to prevent concurrent runs like with everything else running for now	21:03
clarkb	ya I think I'm ok with this as is after double checking that	21:04
clarkb	fungi: if you agree do you want to approve it or should I?	21:04
fungi	i approved it just now	21:05
clarkb	I put some notes at the end of the todo list on the meetup etherpad pointing at the followup change and a note about non transitive dependencies implying we may want to explicitly depend on both bootstrap-bridge and infra-prod-base	21:07
clarkb	they are sort of two halfs of the same bootstrapping. One bootstraps ansible the other bootstraps the node (users, firewall etc)	21:08
clarkb	but I don't think any of that is a concern in the parent change due to the use of the semaphore	21:08
clarkb	its just when we refactor to use less semaphore control that we need to be careful	21:08
opendevreview	Merged opendev/system-config master: Reparent the bootstrap-bridge job onto a job that sets up git repos https://review.opendev.org/c/opendev/system-config/+/942307	21:09
clarkb	at 2200 we will want to check that bootstrap-bridge successfully copies over the system-config state (which should be a noop unless we merge another change after the hourlies because the hourlines will sync over the current state as they are periodic) and that bootstrap-bridge and service-bridge run one after another and not concurrently	21:10
clarkb	corvus: ianw: looking at https://review.opendev.org/c/opendev/system-config/+/942439 more closely I think we may want/need to do two additional things. First we need to make infra-prod-playbook depend on infra-prod-bootstrap-bridge with a hard depedency (and drop the files: matcher in bootstrap-bridge so that it always runs). This ensures that we always update system-config before	21:18
clarkb	running and the hard depdnecny is what allows the child jobs to trigger after the job pauses right? Then we mgith want a second semaphore that infra-prod-playbook holds to limit the total number of those jobs that can run on bridge at the same time (say 5 or 10). Does that make sense to you?	21:18
clarkb	I guess depending on how inheritancy works for job dependencies we may need to make every job have a hard dependency on bootstrap-bridge	21:20
clarkb	I'm not sure and would have to go check that behavior	21:20
clarkb	anyway let me know what you think and I can start on those edits in that change to keep things moving forward	21:21
corvus	clarkb: i think your original thought is right, no need to duplicate the dep; just inherit it from playbook	21:26
clarkb	corvus: for the multiple semaphore thing do you know if it is possible for jobs to grab semaphores before their dependencies have completed?	21:31
clarkb	corvus: I'm wondering if I can deadlock things with two semaphores and I think it comes down to when zuul jobs acquire the locks	21:31
corvus	clarkb: by default semaphores are acquired before the node request is submitted; https://zuul-ci.org/docs/zuul/latest/config/job.html#attr-job.semaphores.resources-first switches it to after -- but i'm not sure that matters here. who's the other actor for the potential deadlock?	21:33
corvus	i think it would be a periodic pipeline running the same job structure right, so also trying to acquire the "outer" semaphore?	21:34
clarkb	corvus: yes	21:34
corvus	if i'm following correctly, i think that means the "inner" semaphores are all protected by the outer one, so we shouldn't deadlock.	21:34
clarkb	ack thanks for confirming	21:34
corvus	but if we have playbook jobs running without the bootstrap job, that could become a problem	21:34
corvus	so that would just need to be a thing we write notes about and remember not to approve changes that would do that. :)	21:35
clarkb	got it	21:35
corvus	(but if the playbook job has a hard dependency on the bootstrap job, it shouldn't be possible for us to make that mistake anyway)	21:36
corvus	(if we tried to do that, then we'd get a job graph freeze error)	21:36
opendevreview	Clark Boylan proposed opendev/system-config master: Bootstrap-bridge as top-level job https://review.opendev.org/c/opendev/system-config/+/942439	21:43
clarkb	I think ^ that latest patchset covers these improvements	21:44
clarkb	I'm half tempted for 942439's deployment to remove a bunch of the infra-prod-* jobs from pipelines to reduce the potential blast radius. That limits what we can deploy until we're confident in it	21:59
clarkb	but maybe that is a good compromise between lack of testing and sending it	22:00
clarkb	also hourlies should be enqueing now	22:00
clarkb	looks like they haven't yet. There is a small amount of variance in the pipeline config iirc	22:00
clarkb	2 minute jitter	22:00
clarkb	we are enqueued	22:01
clarkb	only bootstrap-bridge appears to be running as expected so we've excluded service-bridge from overlapping (we want that)	22:02
clarkb	the job succeeded and now service-bridge is running	22:03
corvus	clarkb: super nit on a comment (but i thought it might confuse us later so prob worth fixing)	22:04
clarkb	corvus: ack	22:05
corvus	(i don't think the comment is technically incorrect, just could be interpreted multiple ways)	22:05
clarkb	fwiw I don't think the first change has created any problems but I'm having a hard time confirming we actually set up the system-config sources in the way we anticipated	22:05
corvus	what do you mean?	22:06
clarkb	https://zuul.opendev.org/t/openstack/build/8b3a3efe585b4ab1b8f9fc79891e0392/log/job-output.txt#63-84 this shows logs for setting up keys but not for the source	22:06
clarkb	the actual source codeon bridge looked fine fwiw. But the earlier hourlies would've updated to the latest state. So I'm just having a hard time confirming we actually synchronized things based on the log	22:06
clarkb	the console tab similarly lacks info	22:06
clarkb	I think we actually did noop and it is because we hav things in separate plays	22:07
clarkb	so the add host magic isn't applying	22:07
corvus	yeah, i don't expect that source setup task ran	22:08
clarkb	yup I'm like 99% certain that is what happend looking at opendev/base-jobs	22:08
clarkb	so we're in the same boat as before where we don't actually udpate system-config yet. But I can push a change up to fix that	22:08
opendevreview	Clark Boylan proposed opendev/base-jobs master: Fix dynamic inventory in -src setup job https://review.opendev.org/c/opendev/base-jobs/+/942740	22:14
clarkb	I think ^ is the fix for that. Note that we should give that chagne the same level of care reviewing and thought out merge window	22:14
clarkb	since this is what now effectively attempts ot make the previously planned change	22:15
opendevreview	Clark Boylan proposed opendev/system-config master: Bootstrap-bridge as top-level job https://review.opendev.org/c/opendev/system-config/+/942439	22:17
clarkb	it is worth noting that I think any inventory update will fail on its initial pass now as we have stricter locking across jobs and we aren't setting up system-config as expected yet	22:18
clarkb	but that is a fail safe behavior. Otherwise its just annoying :)	22:18
corvus	what was the purpose of using two individual playbooks instead of pre.yaml?	22:19
corvus	(that is, what was the motivation for that in the recent change)	22:19
corvus	regardless of the answer, i think we should +w https://review.opendev.org/942740 now	22:20
clarkb	corvus: this was all part of ianw's original attempt at refactoring this. I think he liked the clarity of organization that provided. The job shows it explicitly does this playbook then that playbook rather than needing to open the pre.yaml playbook to see that	22:20
corvus	i +2d it but leave it to you clarkb to decide if it needs more review	22:20
clarkb	fungi: any chance you want to review 942740 before the 2300 UTC hourlies so that we can check again then?	22:21
corvus	why were there 2 playbooks to begin with?	22:21
corvus	because that used to be in 2 jobs?	22:21
clarkb	corvus: because after the refactor only this very base job needs to update system-config on bridge. Everything else only needs to update keys and inventory	22:21
corvus	oh	22:22
corvus	so we're going to call keys from playbook and both from bootstrap	22:22
clarkb	other way around. Pre refactor we have a single base job that was using pre.yaml. After the refactor we have a base jobs for setting up keys and source on bridge that will only run in one place. Then we also haev a base job for everything else that only needs the keys to be set up	22:22
clarkb	ya	22:22
clarkb	and since that refactor never completed we just never got far enough along to relize having two playbooks like that doesn't work unless you explicitly set up the inventory twice	22:23
clarkb	I think setting it up twice is likely to be more error prone so I switched back to using pre.yaml for the singular thing that needs to update sources	22:23
corvus	okay. we typically use the pattern of playbooks listing roles for that. doing that here might be a good idea. that would get the clarity we're seeking while keeping the playbooks simple to undertsand.	22:23
clarkb	ya I guess we could refactor this more into a role to update inventory, a role to update keys, and a role to updat system-config. Then have one playbook that does all three and one playbook that only does update inventory and keys	22:24
corvus	++ but not urgent	22:25
clarkb	in this situation I'm still trying to keep the diffs as small as possible to get working known_hosts values out the other side. Happy to incorporate more refactoring as part of the larger make things nicer and parallel effort	22:25
clarkb	https://opendev.org/opendev/base-jobs/src/branch/master/playbooks/infra-prod/pre.yaml basically you can see this just includes the two playbooks we were running before	22:26
clarkb	but instead of running them as separate processes tehy are in one process so the dynamic inventory only needs to be updated once	22:26
clarkb	https://opendev.org/opendev/base-jobs/src/branch/master/playbooks/infra-prod/setup-keys.yaml and https://opendev.org/opendev/base-jobs/src/branch/master/playbooks/infra-prod/setup-src.yaml are what we then run	22:27
clarkb	the other infra-prod-playbook jobs were inheriting from the job that runs pre.yaml so this should already be well exercised and just work	22:27
clarkb	now that I've said that I'm less sure :) but that was part of my reasoning for using this approach	22:27
corvus	seems plausible to me :)	22:33
clarkb	jobs only take a handlful of minutes but do you think I should go ahead and approve it without fungi's review to ensure we exercise it during the 2300 UTC hourly runs?	22:34
corvus	yes, i think so; i did a more-than-surface review, and i think it's a good time to inspect and verify the solution before the problem starts to impact other work.	22:38
clarkb	ok I'll approve it now\	22:39
fungi	i can take a look in a moment too	22:41
opendevreview	Merged opendev/base-jobs master: Fix dynamic inventory in -src setup job https://review.opendev.org/c/opendev/base-jobs/+/942740	22:47
clarkb	thanks! let's see if we get better results on the next pass	22:48
clarkb	should enqueue in the next few minutes	22:59
ianw	yeah my intent was to have a setup-src which sets up the source, and a setup-keys that sets up the keys, and pre.yaml should have been removed because jobs would only need to setup keys. which we sort of got to but then some reverts happened and TBH i never got back to it	23:00
clarkb	we're back to using pre.yaml more but I think we can clean it up as part of the larger refactor later	23:00
clarkb	if we put things in roles then we just have a playbook that does source setup including both roles and one that does keys only using only one role	23:01
clarkb	then its one playbook in both cases with dynamic inventoy all happy	23:01
clarkb	based on the streaming log we updated the system-config repo this time around	23:02
ianw	++ agree on that	23:02
clarkb	the git log lgtm on bridge for that repo	23:03
clarkb	and looking at https://zuul.opendev.org/t/openstack/build/0301b592d20c47679ae911327dcd8b3c/ I don't see anything amiss	23:04
clarkb	I think I'm happy with that	23:05
corvus	lgtm	23:06
clarkb	sweet we should have known_hosts that updates from an accurate inventory now	23:07
clarkb	then we can followup on further refactors as more people review the existing change	23:07
clarkb	curious what others think of temporarily reducing the total number of infra-prod jobs to reduce blast radius	23:07
clarkb	we probably still need infra-prod-base to run but then can maybe limit to paste/etherpad/tracing/ things that are disconnected from gerrit/zuul if we like	23:08
clarkb	or maybe that seems like too much effort since we're already limiting things to running one at a time to start	23:08
clarkb	I'm going to work on deleting tracing01 now	23:09
corvus	i lean towards "not necessary" because we're not actually changing what these jobs do... i guess the worry is that somehow we end up running the "zuul" playbook on the "gerrit" host or something? that seems pretty unlikely, but i guess it's relatively cheap insurance. i guess i feel it's not more unlikely than messing that up with any other change. :)	23:11
ianw	clarkb: probably with the extra semaphore (excellent idea) hopefully that would stop a massive mess	23:11
clarkb	corvus: I think either run in the wrong place which seems highly unlikely or run playbook with old/stale/improper system-config state	23:11
clarkb	and ya I think the risk of anything catastrophic is low. The system-config state being wrong would mostly likely occur if we allow two jobs to update system-config and end up in some intermediate state?	23:12
clarkb	but even that seems unlikely given git is careful about manaing state	23:12
corvus	i think stale system-config state is an easily recoverable error; so even though it's a more likely error, i don't worry too much about it and don't think it warrants that extra caution.	23:12
clarkb	#status log Deleted tracing01.opendev.org (ae3bb36a-af9e-40e0-8cc4-2bf4890ec5d4) as it has been replaced with tracing02.opendev.org	23:12
corvus	i agree with ianw and your comment (clarkb) that starting with a low parallel mutex reduces risk too	23:12
opendevstatus	clarkb: finished logging	23:12
clarkb	ack in that case I think the change a sproposed is probably ready for scrutiny	23:13
ianw	cool. it in general looks like what i think it should look like. the idea for me has been that in a DR scenario for bridge, we can bring up the node, ensure secrets are in place (from our distributed backup :) and zuul can log into it, and that should be it. that was why i was keen on the "bootstrap" job that gets bridge into an "able to run ansible" state	23:18
clarkb	I think the distributed backup is still a wip too :/	23:20
clarkb	but one step at a time	23:20
clarkb	can't boil the ocean all at once	23:20
ianw	heh, yeah that's OK. i just remember replacing bridge last time being a very scary notion. but that's how a lot of the "abstract and boostrap bridge" work kind of got tangled up with "run jobs in parallel". the boostrap makes for a good synchronization point for system-config, which is how they get mixed in with each other	23:23
opendevreview	Ian Wienand proposed opendev/system-config master: make-tarball: role to archive directories https://review.opendev.org/c/opendev/system-config/+/865784	23:30
opendevreview	Ian Wienand proposed opendev/system-config master: tools/make-backup-key.sh https://review.opendev.org/c/opendev/system-config/+/866430	23:30
opendevreview	Ian Wienand proposed opendev/system-config master: make-tarball: add some extraction instructions https://review.opendev.org/c/opendev/system-config/+/875587	23:30

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!