Tuesday, 2025-02-25

clarkbthis is neat. I'm guessing people that live in different parts of the country have more experience with this but the NWS just put out a severe thunderstorm warning for the next half hour or so based on current radar indicated conditions. The area they warned is super specific and went straihgt to me phone00:09
ianwheh i get a text message from my car insurance company warning me there's a storm coming and to move the car inside00:23
ianwjokes on them because it's a 20 year old corolla that won't die.  it laughs in the face of hail :)00:25
clarkbit seems to be over for now and no destructive hail thankfully00:26
ianw\o/ another 2025 weather even survived.  on to the next one ... :)00:28
ianws/even/event/00:28
ianwclarkb: on that parallel stuff ... i just can't think of anything to test it other than trying it.  i'm not sure i'm really confident i have the time to monitor and revert potential issue-causing production changes though00:30
clarkbianw: ya I think even for me initial change of copying system-config testing is tough. I think you are right that we pick a time when making changes to those foundational bits and monitor closely and jump in if necessary. And totally understood that you don't have time to drive that00:31
clarkbI'm trying to get server replacements in a spot where they are rolling forward at a reasonable rate then I should be able to take a look at it if no one else beats me to it00:31
ianw++ i'm mostly around :)  let's sync one day when you feel you've got some bandwidth too and maybe we can try it00:34
mnaserclarkb: cool thank you, fwiw there is valkey as a redis alternative which is a fork that lives under the Lf00:51
fricklerinfra-root: I'm wondering why zuul didn't pick up https://review.opendev.org/c/openstack/requirements/+/925057 when I approved it yesterday. the dependency merged like 10 minutes before that, is there some larger delay to be expected? reworkflowing now seems to have resolved the issue07:37
opendevreviewKarolina Kula proposed opendev/glean master: WIP: Add support for CentOS 10 keyfiles  https://review.opendev.org/c/opendev/glean/+/94167210:56
fungifrickler: hard to say without checking debug logs on the schedulers for how they handled the comment-added event at 18:03:56 utc13:11
fungior whether the event ever got to zuul13:11
slittle_please add me as first core for group starlingx-app-distributed-cloud-core14:24
slittle_thanks14:31
fungislittle_: https://review.opendev.org/admin/groups/starlingx-app-distributed-cloud-core,members already has tons of members, including you. are you thinking of a different group maybe?14:33
opendevreviewClark Boylan proposed opendev/system-config master: Run gitea with memcached cache adapter  https://review.opendev.org/c/opendev/system-config/+/94265015:09
clarkbI'm trying to confirm ^ is actually using memcached and I think it isn't yet for some reason (in particular locally testing with netcat I get client connection log messages with -vv set and I don't get those messages in logs for 94265016:49
clarkbok I think I see it. Sometimes you just need to talk about it16:51
opendevreviewClark Boylan proposed opendev/system-config master: Run gitea with memcached cache adapter  https://review.opendev.org/c/opendev/system-config/+/94265016:53
clarkbdebugging that wuold've been easier if gitea logged the invalid cache adapter name somewhere but it doesn't appear to do so16:57
opendevreviewJeremy Stanley proposed opendev/system-config master: Install ssl-cert-check from distro package not Git  https://review.opendev.org/c/opendev/system-config/+/93918718:17
opendevreviewJeremy Stanley proposed opendev/system-config master: Add certcheck functionality to bridge servers  https://review.opendev.org/c/opendev/system-config/+/94271918:17
opendevreviewClark Boylan proposed opendev/system-config master: Run gitea with memcached cache adapter  https://review.opendev.org/c/opendev/system-config/+/94265018:35
clarkbI think that may be ready for review now (depends on whether or not that new test case functions I guess)18:35
fungihttps://github.com/aio-libs/aiosmtpd/issues/278 talks about the logging module detecting the new file and switching correctly, but mailman copies log streams from other modules to its own logs i guess?20:02
clarkband then refers back to that issue you posted initially which is still open as the spot that should track fixing it20:02
clarkbso presumably this is still a problem20:02
clarkbbut ya it seems like python logging loses track of where logs are actually going to due to mailman decisions20:03
fungihttps://docs.mailman3.org/en/latest/faq.html#how-to-enable-debug-logging-in-mailman-core indicates you can do logging configuration in mailman.cfg20:03
fungihttps://github.com/maxking/docker-mailman/issues/535 has a suggested logrotate config20:09
fungirelying on `docker-compose restart`20:10
fungiit's some kind of awesome that they have a github action that just closes open feature requests as "not planned" if nobody comments on them20:12
clarkbI forgot to mention that I have a dentist appointment tomorrow at 10am local20:26
fungii'll be around20:27
opendevreviewClark Boylan proposed opendev/system-config master: Mirror memcached:latest from docker hub to quay.io/opendevmirror  https://review.opendev.org/c/opendev/system-config/+/94273020:30
clarkbI believe it is safe to land the change updating job and project config for infra-prod-* jobs while the hourlies are running because zuul will use the config compiled when enqueing the buildset20:58
clarkbanyway thats me double checking we don't have to wait to approve the infra-prod refactor change until after hourlies and I don't think we do20:58
clarkbdo we still want to approve it nowish?20:59
fungiyeah, i'm good going ahead with it20:59
clarkback21:00
clarkbthe 2200 hourly runs are the ones we will want to apy attention to21:01
clarkbactualyl I may have just found a bug looking at the current hourlies21:02
clarkbservice-bridge and bootstrap-bridge are running concurrently21:02
clarkbservice-bridge should depend on bootstrap-bridge to prevent that. I think the chagne as written is fine because of the semaphore but we should probably fix everything all at once?21:02
clarkboh the reason this happens is that dependences are not transitive21:03
clarkbservice-bridge depends on infra-prod-base. We don't run infra-prod-base in hourlies so maybe this is ok as is21:03
clarkbrely on the semaphore to prevent concurrent runs like with everything else running for now21:03
clarkbya I think I'm ok with this as is after double checking that21:04
clarkbfungi: if you agree do you want to approve it or should I?21:04
fungii approved it just now21:05
clarkbI put some notes at the end of the todo list on the meetup etherpad pointing at the followup change and a note about non transitive dependencies implying we may want to explicitly depend on both bootstrap-bridge and infra-prod-base21:07
clarkbthey are sort of two halfs of the same bootstrapping. One bootstraps ansible the other bootstraps the node (users, firewall etc)21:08
clarkbbut I don't think any of that is a concern in the parent change due to the use of the semaphore21:08
clarkbits just when we refactor to use less semaphore control that we need to be careful21:08
opendevreviewMerged opendev/system-config master: Reparent the bootstrap-bridge job onto a job that sets up git repos  https://review.opendev.org/c/opendev/system-config/+/94230721:09
clarkbat 2200 we will want to check that bootstrap-bridge successfully copies over the system-config state (which should be a noop unless we merge another change after the hourlies because the hourlines will sync over the current state as they are periodic) and that bootstrap-bridge and service-bridge run one after another and not concurrently21:10
clarkbcorvus: ianw: looking at https://review.opendev.org/c/opendev/system-config/+/942439 more closely I think we may want/need to do two additional things. First we need to make infra-prod-playbook depend on infra-prod-bootstrap-bridge with a hard depedency (and drop the files: matcher in bootstrap-bridge so that it always runs). This ensures that we always update system-config before21:18
clarkbrunning and the hard depdnecny is what allows the child jobs to trigger after the job pauses right? Then we mgith want a second semaphore that infra-prod-playbook holds to limit the total number of those jobs that can run on bridge at the same time (say 5 or 10). Does that make sense to you?21:18
clarkbI guess depending on how inheritancy works for job dependencies we may need to make every job have a hard dependency on bootstrap-bridge21:20
clarkbI'm not sure and would have to go check that behavior21:20
clarkbanyway let me know what you think and I can start on those edits in that change to keep things moving forward21:21
corvusclarkb: i think your original thought is right, no need to duplicate the dep; just inherit it from playbook21:26
clarkbcorvus: for the multiple semaphore thing do you know if it is possible for jobs to grab semaphores before their dependencies have completed?21:31
clarkbcorvus: I'm wondering if I can deadlock things with two semaphores and I think it comes down to when zuul jobs acquire the locks21:31
corvusclarkb: by default semaphores are acquired before the node request is submitted; https://zuul-ci.org/docs/zuul/latest/config/job.html#attr-job.semaphores.resources-first switches it to after -- but i'm not sure that matters here.  who's the other actor for the potential deadlock?21:33
corvusi think it would be a periodic pipeline running the same job structure right, so also trying to acquire the "outer" semaphore?21:34
clarkbcorvus: yes21:34
corvusif i'm following correctly, i think that means the "inner" semaphores are all protected by the outer one, so we shouldn't deadlock.21:34
clarkback thanks for confirming21:34
corvusbut if we have playbook jobs running without the bootstrap job, that could become a problem21:34
corvusso that would just need to be a thing we write notes about and remember not to approve changes that would do that.  :)21:35
clarkbgot it21:35
corvus(but if the playbook job has a hard dependency on the bootstrap job, it shouldn't be possible for us to make that mistake anyway)21:36
corvus(if we tried to do that, then we'd get a job graph freeze error)21:36
opendevreviewClark Boylan proposed opendev/system-config master: Bootstrap-bridge as top-level job  https://review.opendev.org/c/opendev/system-config/+/94243921:43
clarkbI think ^ that latest patchset covers these improvements21:44
clarkbI'm half tempted for 942439's deployment to remove a bunch of the infra-prod-* jobs from pipelines to reduce the potential blast radius. That limits what we can deploy until we're confident in it21:59
clarkbbut maybe that is a good compromise between lack of testing and sending it22:00
clarkbalso hourlies should be enqueing now22:00
clarkblooks like they haven't yet. There is a small amount of variance in the pipeline config iirc22:00
clarkb2 minute jitter22:00
clarkbwe are enqueued22:01
clarkbonly bootstrap-bridge appears to be running as expected so we've excluded service-bridge from overlapping (we want that)22:02
clarkbthe job succeeded and now service-bridge is running22:03
corvusclarkb: super nit on a comment (but i thought it might confuse us later so prob worth fixing)22:04
clarkbcorvus: ack22:05
corvus(i don't think the comment is technically incorrect, just could be interpreted multiple ways)22:05
clarkbfwiw I don't think the first change has created any problems but I'm having a hard time confirming we actually set up the system-config sources in the way we anticipated22:05
corvuswhat do you mean?22:06
clarkbhttps://zuul.opendev.org/t/openstack/build/8b3a3efe585b4ab1b8f9fc79891e0392/log/job-output.txt#63-84 this shows logs for setting up keys but not for the source22:06
clarkbthe actual source codeon bridge looked fine fwiw. But the earlier hourlies would've updated to the latest state. So I'm just having a hard time confirming we actually synchronized things based on the log22:06
clarkbthe console tab similarly lacks info22:06
clarkbI think we actually did noop and it is because we hav things in separate plays22:07
clarkbso the add host magic isn't applying22:07
corvusyeah, i don't expect that source setup task ran22:08
clarkbyup I'm like 99% certain that is what happend looking at opendev/base-jobs22:08
clarkbso we're in the same boat as before where we don't actually udpate system-config yet. But I can push a change up to fix that22:08
opendevreviewClark Boylan proposed opendev/base-jobs master: Fix dynamic inventory in -src setup job  https://review.opendev.org/c/opendev/base-jobs/+/94274022:14
clarkbI think ^ is the fix for that. Note that we should give that chagne the same level of care reviewing and thought out merge window22:14
clarkbsince this is what now effectively attempts ot make the previously planned change22:15
opendevreviewClark Boylan proposed opendev/system-config master: Bootstrap-bridge as top-level job  https://review.opendev.org/c/opendev/system-config/+/94243922:17
clarkbit is worth noting that I think any inventory update will fail on its initial pass now as we have stricter locking across jobs and we aren't setting up system-config as expected yet22:18
clarkbbut that is a fail safe behavior. Otherwise its just annoying :)22:18
corvuswhat was the purpose of using two individual playbooks instead of pre.yaml?22:19
corvus(that is, what was the motivation for that in the recent change)22:19
corvusregardless of the answer, i think we should +w https://review.opendev.org/942740 now22:20
clarkbcorvus: this was all part of ianw's original attempt at refactoring this. I think he liked the clarity of organization that provided. The job shows it explicitly does this playbook then that playbook rather than needing to open the pre.yaml playbook to see that22:20
corvusi +2d it but leave it to you clarkb to decide if it needs more review22:20
clarkbfungi: any chance you want to review 942740 before the 2300 UTC hourlies so that we can check again then?22:21
corvuswhy were there 2 playbooks to begin with?22:21
corvusbecause that used to be in 2 jobs?22:21
clarkbcorvus: because after the refactor only this very base job needs to update system-config on bridge. Everything else only needs to update keys and inventory22:21
corvusoh22:22
corvusso we're going to call keys from playbook and both from bootstrap22:22
clarkbother way around. Pre refactor we have a single base job that was using pre.yaml. After the refactor we have a base jobs for setting up keys and source on bridge that will only run in one place. Then we also haev a base job for everything else that only needs the keys to be set up22:22
clarkbya22:22
clarkband since that refactor never completed we just never got far enough along to relize having two playbooks like that doesn't work unless you explicitly set up the inventory twice22:23
clarkbI think setting it up twice is likely to be more error prone so I switched back to using pre.yaml for the singular thing that needs to update sources22:23
corvusokay.  we typically use the pattern of playbooks listing roles for that.  doing that here might be a good idea.  that would get the clarity we're seeking while keeping the playbooks simple to undertsand.22:23
clarkbya I guess we could refactor this more into a role to update inventory, a role to update keys, and a role to updat system-config. Then have one playbook that does all three and one playbook that only does update inventory and keys22:24
corvus++ but not urgent22:25
clarkbin this situation I'm still trying to keep the diffs as small as possible to get working known_hosts values out the other side. Happy to incorporate more refactoring as part of the larger make things nicer and parallel effort22:25
clarkbhttps://opendev.org/opendev/base-jobs/src/branch/master/playbooks/infra-prod/pre.yaml basically you can see this just includes the two playbooks we were running before22:26
clarkbbut instead of running them as separate processes tehy are in one process so the dynamic inventory only needs to be updated once22:26
clarkbhttps://opendev.org/opendev/base-jobs/src/branch/master/playbooks/infra-prod/setup-keys.yaml and https://opendev.org/opendev/base-jobs/src/branch/master/playbooks/infra-prod/setup-src.yaml are what we then run22:27
clarkbthe other infra-prod-playbook jobs were inheriting from the job that runs pre.yaml so this should already be well exercised and just work22:27
clarkbnow that I've said that I'm less sure :) but that was part of my reasoning for using this approach22:27
corvusseems plausible to me :)22:33
clarkbjobs only take a handlful of minutes but do you think I should go ahead and approve it without fungi's review to ensure we exercise it during the 2300 UTC hourly runs?22:34
corvusyes, i think so; i did a more-than-surface review, and i think it's a good time to inspect and verify the solution before the problem starts to impact other work.22:38
clarkbok I'll approve it now\22:39
fungii can take a look in a moment too22:41
opendevreviewMerged opendev/base-jobs master: Fix dynamic inventory in -src setup job  https://review.opendev.org/c/opendev/base-jobs/+/94274022:47
clarkbthanks! let's see if we get better results on the next pass22:48
clarkbshould enqueue in the next few minutes22:59
ianwyeah my intent was to have a setup-src which sets up the source, and a setup-keys that sets up the keys, and pre.yaml should have been removed because jobs would only need to setup keys.  which we sort of got to but then some reverts happened and TBH i never got back to it23:00
clarkbwe're back to using pre.yaml more but I think we can clean it up as part of the larger refactor later23:00
clarkbif we put things in roles then we just have a playbook that does source setup including both roles and one that does keys only using only one role23:01
clarkbthen its one playbook in both cases with dynamic inventoy all happy23:01
clarkbbased on the streaming log we updated the system-config repo this time around23:02
ianw++ agree on that23:02
clarkbthe git log lgtm on bridge for that repo23:03
clarkband looking at https://zuul.opendev.org/t/openstack/build/0301b592d20c47679ae911327dcd8b3c/ I don't see anything amiss23:04
clarkbI think I'm happy with that23:05
corvuslgtm23:06
clarkbsweet we should have known_hosts that updates from an accurate inventory now23:07
clarkbthen we can followup on further refactors as more people review the existing change23:07
clarkbcurious what others think of temporarily reducing the total number of infra-prod jobs to reduce blast radius23:07
clarkbwe probably still need infra-prod-base to run but then can maybe limit to paste/etherpad/tracing/ things that are disconnected from gerrit/zuul if we like23:08
clarkbor maybe that seems like too much effort since we're already limiting things to running one at a time to start23:08
clarkbI'm going to work on deleting tracing01 now23:09
corvusi lean towards "not necessary" because we're not actually changing what these jobs do... i guess the worry is that somehow we end up running the "zuul" playbook on the "gerrit" host or something?  that seems pretty unlikely, but i guess it's relatively cheap insurance.  i guess i feel it's not more unlikely than messing that up with any other change.  :)23:11
ianwclarkb: probably with the extra semaphore (excellent idea) hopefully that would stop a massive mess 23:11
clarkbcorvus: I think either run in the wrong place which seems highly unlikely or run playbook with old/stale/improper system-config state23:11
clarkband ya I think the risk of anything catastrophic is low. The system-config state being wrong would mostly likely occur if we allow two jobs to update system-config and end up in some intermediate state?23:12
clarkbbut even that seems unlikely given git is careful about manaing state23:12
corvusi think stale system-config state is an easily recoverable error; so even though it's a more likely error, i don't worry too much about it and don't think it warrants that extra caution.23:12
clarkb#status log Deleted tracing01.opendev.org (ae3bb36a-af9e-40e0-8cc4-2bf4890ec5d4) as it has been replaced with tracing02.opendev.org23:12
corvusi agree with ianw and your comment (clarkb) that starting with a low parallel mutex reduces risk too23:12
opendevstatusclarkb: finished logging23:12
clarkback in that case I think the change a sproposed is probably ready for scrutiny23:13
ianwcool.  it in general looks like what i think it should look like.  the idea for me has been that in a DR scenario for bridge, we can bring up the node, ensure secrets are in place (from our distributed backup :) and zuul can log into it, and that should be it.  that was why i was keen on the "bootstrap" job that gets bridge into an "able to run ansible" state23:18
clarkbI think the distributed backup is still a wip too :/23:20
clarkbbut one step at a time23:20
clarkbcan't boil the ocean all at once23:20
ianwheh, yeah that's OK.  i just remember replacing bridge last time being a very scary notion.  but that's how a lot of the "abstract and boostrap bridge" work kind of got tangled up with "run jobs in parallel".  the boostrap makes for a good synchronization point for system-config, which is how they get mixed in with each other23:23
opendevreviewIan Wienand proposed opendev/system-config master: make-tarball: role to archive directories  https://review.opendev.org/c/opendev/system-config/+/86578423:30
opendevreviewIan Wienand proposed opendev/system-config master: tools/make-backup-key.sh  https://review.opendev.org/c/opendev/system-config/+/86643023:30
opendevreviewIan Wienand proposed opendev/system-config master: make-tarball: add some extraction instructions  https://review.opendev.org/c/opendev/system-config/+/87558723:30

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!