clarkb | this is neat. I'm guessing people that live in different parts of the country have more experience with this but the NWS just put out a severe thunderstorm warning for the next half hour or so based on current radar indicated conditions. The area they warned is super specific and went straihgt to me phone | 00:09 |
---|---|---|
ianw | heh i get a text message from my car insurance company warning me there's a storm coming and to move the car inside | 00:23 |
ianw | jokes on them because it's a 20 year old corolla that won't die. it laughs in the face of hail :) | 00:25 |
clarkb | it seems to be over for now and no destructive hail thankfully | 00:26 |
ianw | \o/ another 2025 weather even survived. on to the next one ... :) | 00:28 |
ianw | s/even/event/ | 00:28 |
ianw | clarkb: on that parallel stuff ... i just can't think of anything to test it other than trying it. i'm not sure i'm really confident i have the time to monitor and revert potential issue-causing production changes though | 00:30 |
clarkb | ianw: ya I think even for me initial change of copying system-config testing is tough. I think you are right that we pick a time when making changes to those foundational bits and monitor closely and jump in if necessary. And totally understood that you don't have time to drive that | 00:31 |
clarkb | I'm trying to get server replacements in a spot where they are rolling forward at a reasonable rate then I should be able to take a look at it if no one else beats me to it | 00:31 |
ianw | ++ i'm mostly around :) let's sync one day when you feel you've got some bandwidth too and maybe we can try it | 00:34 |
mnaser | clarkb: cool thank you, fwiw there is valkey as a redis alternative which is a fork that lives under the Lf | 00:51 |
frickler | infra-root: I'm wondering why zuul didn't pick up https://review.opendev.org/c/openstack/requirements/+/925057 when I approved it yesterday. the dependency merged like 10 minutes before that, is there some larger delay to be expected? reworkflowing now seems to have resolved the issue | 07:37 |
opendevreview | Karolina Kula proposed opendev/glean master: WIP: Add support for CentOS 10 keyfiles https://review.opendev.org/c/opendev/glean/+/941672 | 10:56 |
fungi | frickler: hard to say without checking debug logs on the schedulers for how they handled the comment-added event at 18:03:56 utc | 13:11 |
fungi | or whether the event ever got to zuul | 13:11 |
slittle_ | please add me as first core for group starlingx-app-distributed-cloud-core | 14:24 |
slittle_ | thanks | 14:31 |
fungi | slittle_: https://review.opendev.org/admin/groups/starlingx-app-distributed-cloud-core,members already has tons of members, including you. are you thinking of a different group maybe? | 14:33 |
opendevreview | Clark Boylan proposed opendev/system-config master: Run gitea with memcached cache adapter https://review.opendev.org/c/opendev/system-config/+/942650 | 15:09 |
clarkb | I'm trying to confirm ^ is actually using memcached and I think it isn't yet for some reason (in particular locally testing with netcat I get client connection log messages with -vv set and I don't get those messages in logs for 942650 | 16:49 |
clarkb | ok I think I see it. Sometimes you just need to talk about it | 16:51 |
opendevreview | Clark Boylan proposed opendev/system-config master: Run gitea with memcached cache adapter https://review.opendev.org/c/opendev/system-config/+/942650 | 16:53 |
clarkb | debugging that wuold've been easier if gitea logged the invalid cache adapter name somewhere but it doesn't appear to do so | 16:57 |
opendevreview | Jeremy Stanley proposed opendev/system-config master: Install ssl-cert-check from distro package not Git https://review.opendev.org/c/opendev/system-config/+/939187 | 18:17 |
opendevreview | Jeremy Stanley proposed opendev/system-config master: Add certcheck functionality to bridge servers https://review.opendev.org/c/opendev/system-config/+/942719 | 18:17 |
opendevreview | Clark Boylan proposed opendev/system-config master: Run gitea with memcached cache adapter https://review.opendev.org/c/opendev/system-config/+/942650 | 18:35 |
clarkb | I think that may be ready for review now (depends on whether or not that new test case functions I guess) | 18:35 |
fungi | https://github.com/aio-libs/aiosmtpd/issues/278 talks about the logging module detecting the new file and switching correctly, but mailman copies log streams from other modules to its own logs i guess? | 20:02 |
clarkb | and then refers back to that issue you posted initially which is still open as the spot that should track fixing it | 20:02 |
clarkb | so presumably this is still a problem | 20:02 |
clarkb | but ya it seems like python logging loses track of where logs are actually going to due to mailman decisions | 20:03 |
fungi | https://docs.mailman3.org/en/latest/faq.html#how-to-enable-debug-logging-in-mailman-core indicates you can do logging configuration in mailman.cfg | 20:03 |
fungi | https://github.com/maxking/docker-mailman/issues/535 has a suggested logrotate config | 20:09 |
fungi | relying on `docker-compose restart` | 20:10 |
fungi | it's some kind of awesome that they have a github action that just closes open feature requests as "not planned" if nobody comments on them | 20:12 |
clarkb | I forgot to mention that I have a dentist appointment tomorrow at 10am local | 20:26 |
fungi | i'll be around | 20:27 |
opendevreview | Clark Boylan proposed opendev/system-config master: Mirror memcached:latest from docker hub to quay.io/opendevmirror https://review.opendev.org/c/opendev/system-config/+/942730 | 20:30 |
clarkb | I believe it is safe to land the change updating job and project config for infra-prod-* jobs while the hourlies are running because zuul will use the config compiled when enqueing the buildset | 20:58 |
clarkb | anyway thats me double checking we don't have to wait to approve the infra-prod refactor change until after hourlies and I don't think we do | 20:58 |
clarkb | do we still want to approve it nowish? | 20:59 |
fungi | yeah, i'm good going ahead with it | 20:59 |
clarkb | ack | 21:00 |
clarkb | the 2200 hourly runs are the ones we will want to apy attention to | 21:01 |
clarkb | actualyl I may have just found a bug looking at the current hourlies | 21:02 |
clarkb | service-bridge and bootstrap-bridge are running concurrently | 21:02 |
clarkb | service-bridge should depend on bootstrap-bridge to prevent that. I think the chagne as written is fine because of the semaphore but we should probably fix everything all at once? | 21:02 |
clarkb | oh the reason this happens is that dependences are not transitive | 21:03 |
clarkb | service-bridge depends on infra-prod-base. We don't run infra-prod-base in hourlies so maybe this is ok as is | 21:03 |
clarkb | rely on the semaphore to prevent concurrent runs like with everything else running for now | 21:03 |
clarkb | ya I think I'm ok with this as is after double checking that | 21:04 |
clarkb | fungi: if you agree do you want to approve it or should I? | 21:04 |
fungi | i approved it just now | 21:05 |
clarkb | I put some notes at the end of the todo list on the meetup etherpad pointing at the followup change and a note about non transitive dependencies implying we may want to explicitly depend on both bootstrap-bridge and infra-prod-base | 21:07 |
clarkb | they are sort of two halfs of the same bootstrapping. One bootstraps ansible the other bootstraps the node (users, firewall etc) | 21:08 |
clarkb | but I don't think any of that is a concern in the parent change due to the use of the semaphore | 21:08 |
clarkb | its just when we refactor to use less semaphore control that we need to be careful | 21:08 |
opendevreview | Merged opendev/system-config master: Reparent the bootstrap-bridge job onto a job that sets up git repos https://review.opendev.org/c/opendev/system-config/+/942307 | 21:09 |
clarkb | at 2200 we will want to check that bootstrap-bridge successfully copies over the system-config state (which should be a noop unless we merge another change after the hourlies because the hourlines will sync over the current state as they are periodic) and that bootstrap-bridge and service-bridge run one after another and not concurrently | 21:10 |
clarkb | corvus: ianw: looking at https://review.opendev.org/c/opendev/system-config/+/942439 more closely I think we may want/need to do two additional things. First we need to make infra-prod-playbook depend on infra-prod-bootstrap-bridge with a hard depedency (and drop the files: matcher in bootstrap-bridge so that it always runs). This ensures that we always update system-config before | 21:18 |
clarkb | running and the hard depdnecny is what allows the child jobs to trigger after the job pauses right? Then we mgith want a second semaphore that infra-prod-playbook holds to limit the total number of those jobs that can run on bridge at the same time (say 5 or 10). Does that make sense to you? | 21:18 |
clarkb | I guess depending on how inheritancy works for job dependencies we may need to make every job have a hard dependency on bootstrap-bridge | 21:20 |
clarkb | I'm not sure and would have to go check that behavior | 21:20 |
clarkb | anyway let me know what you think and I can start on those edits in that change to keep things moving forward | 21:21 |
corvus | clarkb: i think your original thought is right, no need to duplicate the dep; just inherit it from playbook | 21:26 |
clarkb | corvus: for the multiple semaphore thing do you know if it is possible for jobs to grab semaphores before their dependencies have completed? | 21:31 |
clarkb | corvus: I'm wondering if I can deadlock things with two semaphores and I think it comes down to when zuul jobs acquire the locks | 21:31 |
corvus | clarkb: by default semaphores are acquired before the node request is submitted; https://zuul-ci.org/docs/zuul/latest/config/job.html#attr-job.semaphores.resources-first switches it to after -- but i'm not sure that matters here. who's the other actor for the potential deadlock? | 21:33 |
corvus | i think it would be a periodic pipeline running the same job structure right, so also trying to acquire the "outer" semaphore? | 21:34 |
clarkb | corvus: yes | 21:34 |
corvus | if i'm following correctly, i think that means the "inner" semaphores are all protected by the outer one, so we shouldn't deadlock. | 21:34 |
clarkb | ack thanks for confirming | 21:34 |
corvus | but if we have playbook jobs running without the bootstrap job, that could become a problem | 21:34 |
corvus | so that would just need to be a thing we write notes about and remember not to approve changes that would do that. :) | 21:35 |
clarkb | got it | 21:35 |
corvus | (but if the playbook job has a hard dependency on the bootstrap job, it shouldn't be possible for us to make that mistake anyway) | 21:36 |
corvus | (if we tried to do that, then we'd get a job graph freeze error) | 21:36 |
opendevreview | Clark Boylan proposed opendev/system-config master: Bootstrap-bridge as top-level job https://review.opendev.org/c/opendev/system-config/+/942439 | 21:43 |
clarkb | I think ^ that latest patchset covers these improvements | 21:44 |
clarkb | I'm half tempted for 942439's deployment to remove a bunch of the infra-prod-* jobs from pipelines to reduce the potential blast radius. That limits what we can deploy until we're confident in it | 21:59 |
clarkb | but maybe that is a good compromise between lack of testing and sending it | 22:00 |
clarkb | also hourlies should be enqueing now | 22:00 |
clarkb | looks like they haven't yet. There is a small amount of variance in the pipeline config iirc | 22:00 |
clarkb | 2 minute jitter | 22:00 |
clarkb | we are enqueued | 22:01 |
clarkb | only bootstrap-bridge appears to be running as expected so we've excluded service-bridge from overlapping (we want that) | 22:02 |
clarkb | the job succeeded and now service-bridge is running | 22:03 |
corvus | clarkb: super nit on a comment (but i thought it might confuse us later so prob worth fixing) | 22:04 |
clarkb | corvus: ack | 22:05 |
corvus | (i don't think the comment is technically incorrect, just could be interpreted multiple ways) | 22:05 |
clarkb | fwiw I don't think the first change has created any problems but I'm having a hard time confirming we actually set up the system-config sources in the way we anticipated | 22:05 |
corvus | what do you mean? | 22:06 |
clarkb | https://zuul.opendev.org/t/openstack/build/8b3a3efe585b4ab1b8f9fc79891e0392/log/job-output.txt#63-84 this shows logs for setting up keys but not for the source | 22:06 |
clarkb | the actual source codeon bridge looked fine fwiw. But the earlier hourlies would've updated to the latest state. So I'm just having a hard time confirming we actually synchronized things based on the log | 22:06 |
clarkb | the console tab similarly lacks info | 22:06 |
clarkb | I think we actually did noop and it is because we hav things in separate plays | 22:07 |
clarkb | so the add host magic isn't applying | 22:07 |
corvus | yeah, i don't expect that source setup task ran | 22:08 |
clarkb | yup I'm like 99% certain that is what happend looking at opendev/base-jobs | 22:08 |
clarkb | so we're in the same boat as before where we don't actually udpate system-config yet. But I can push a change up to fix that | 22:08 |
opendevreview | Clark Boylan proposed opendev/base-jobs master: Fix dynamic inventory in -src setup job https://review.opendev.org/c/opendev/base-jobs/+/942740 | 22:14 |
clarkb | I think ^ is the fix for that. Note that we should give that chagne the same level of care reviewing and thought out merge window | 22:14 |
clarkb | since this is what now effectively attempts ot make the previously planned change | 22:15 |
opendevreview | Clark Boylan proposed opendev/system-config master: Bootstrap-bridge as top-level job https://review.opendev.org/c/opendev/system-config/+/942439 | 22:17 |
clarkb | it is worth noting that I think any inventory update will fail on its initial pass now as we have stricter locking across jobs and we aren't setting up system-config as expected yet | 22:18 |
clarkb | but that is a fail safe behavior. Otherwise its just annoying :) | 22:18 |
corvus | what was the purpose of using two individual playbooks instead of pre.yaml? | 22:19 |
corvus | (that is, what was the motivation for that in the recent change) | 22:19 |
corvus | regardless of the answer, i think we should +w https://review.opendev.org/942740 now | 22:20 |
clarkb | corvus: this was all part of ianw's original attempt at refactoring this. I think he liked the clarity of organization that provided. The job shows it explicitly does this playbook then that playbook rather than needing to open the pre.yaml playbook to see that | 22:20 |
corvus | i +2d it but leave it to you clarkb to decide if it needs more review | 22:20 |
clarkb | fungi: any chance you want to review 942740 before the 2300 UTC hourlies so that we can check again then? | 22:21 |
corvus | why were there 2 playbooks to begin with? | 22:21 |
corvus | because that used to be in 2 jobs? | 22:21 |
clarkb | corvus: because after the refactor only this very base job needs to update system-config on bridge. Everything else only needs to update keys and inventory | 22:21 |
corvus | oh | 22:22 |
corvus | so we're going to call keys from playbook and both from bootstrap | 22:22 |
clarkb | other way around. Pre refactor we have a single base job that was using pre.yaml. After the refactor we have a base jobs for setting up keys and source on bridge that will only run in one place. Then we also haev a base job for everything else that only needs the keys to be set up | 22:22 |
clarkb | ya | 22:22 |
clarkb | and since that refactor never completed we just never got far enough along to relize having two playbooks like that doesn't work unless you explicitly set up the inventory twice | 22:23 |
clarkb | I think setting it up twice is likely to be more error prone so I switched back to using pre.yaml for the singular thing that needs to update sources | 22:23 |
corvus | okay. we typically use the pattern of playbooks listing roles for that. doing that here might be a good idea. that would get the clarity we're seeking while keeping the playbooks simple to undertsand. | 22:23 |
clarkb | ya I guess we could refactor this more into a role to update inventory, a role to update keys, and a role to updat system-config. Then have one playbook that does all three and one playbook that only does update inventory and keys | 22:24 |
corvus | ++ but not urgent | 22:25 |
clarkb | in this situation I'm still trying to keep the diffs as small as possible to get working known_hosts values out the other side. Happy to incorporate more refactoring as part of the larger make things nicer and parallel effort | 22:25 |
clarkb | https://opendev.org/opendev/base-jobs/src/branch/master/playbooks/infra-prod/pre.yaml basically you can see this just includes the two playbooks we were running before | 22:26 |
clarkb | but instead of running them as separate processes tehy are in one process so the dynamic inventory only needs to be updated once | 22:26 |
clarkb | https://opendev.org/opendev/base-jobs/src/branch/master/playbooks/infra-prod/setup-keys.yaml and https://opendev.org/opendev/base-jobs/src/branch/master/playbooks/infra-prod/setup-src.yaml are what we then run | 22:27 |
clarkb | the other infra-prod-playbook jobs were inheriting from the job that runs pre.yaml so this should already be well exercised and just work | 22:27 |
clarkb | now that I've said that I'm less sure :) but that was part of my reasoning for using this approach | 22:27 |
corvus | seems plausible to me :) | 22:33 |
clarkb | jobs only take a handlful of minutes but do you think I should go ahead and approve it without fungi's review to ensure we exercise it during the 2300 UTC hourly runs? | 22:34 |
corvus | yes, i think so; i did a more-than-surface review, and i think it's a good time to inspect and verify the solution before the problem starts to impact other work. | 22:38 |
clarkb | ok I'll approve it now\ | 22:39 |
fungi | i can take a look in a moment too | 22:41 |
opendevreview | Merged opendev/base-jobs master: Fix dynamic inventory in -src setup job https://review.opendev.org/c/opendev/base-jobs/+/942740 | 22:47 |
clarkb | thanks! let's see if we get better results on the next pass | 22:48 |
clarkb | should enqueue in the next few minutes | 22:59 |
ianw | yeah my intent was to have a setup-src which sets up the source, and a setup-keys that sets up the keys, and pre.yaml should have been removed because jobs would only need to setup keys. which we sort of got to but then some reverts happened and TBH i never got back to it | 23:00 |
clarkb | we're back to using pre.yaml more but I think we can clean it up as part of the larger refactor later | 23:00 |
clarkb | if we put things in roles then we just have a playbook that does source setup including both roles and one that does keys only using only one role | 23:01 |
clarkb | then its one playbook in both cases with dynamic inventoy all happy | 23:01 |
clarkb | based on the streaming log we updated the system-config repo this time around | 23:02 |
ianw | ++ agree on that | 23:02 |
clarkb | the git log lgtm on bridge for that repo | 23:03 |
clarkb | and looking at https://zuul.opendev.org/t/openstack/build/0301b592d20c47679ae911327dcd8b3c/ I don't see anything amiss | 23:04 |
clarkb | I think I'm happy with that | 23:05 |
corvus | lgtm | 23:06 |
clarkb | sweet we should have known_hosts that updates from an accurate inventory now | 23:07 |
clarkb | then we can followup on further refactors as more people review the existing change | 23:07 |
clarkb | curious what others think of temporarily reducing the total number of infra-prod jobs to reduce blast radius | 23:07 |
clarkb | we probably still need infra-prod-base to run but then can maybe limit to paste/etherpad/tracing/ things that are disconnected from gerrit/zuul if we like | 23:08 |
clarkb | or maybe that seems like too much effort since we're already limiting things to running one at a time to start | 23:08 |
clarkb | I'm going to work on deleting tracing01 now | 23:09 |
corvus | i lean towards "not necessary" because we're not actually changing what these jobs do... i guess the worry is that somehow we end up running the "zuul" playbook on the "gerrit" host or something? that seems pretty unlikely, but i guess it's relatively cheap insurance. i guess i feel it's not more unlikely than messing that up with any other change. :) | 23:11 |
ianw | clarkb: probably with the extra semaphore (excellent idea) hopefully that would stop a massive mess | 23:11 |
clarkb | corvus: I think either run in the wrong place which seems highly unlikely or run playbook with old/stale/improper system-config state | 23:11 |
clarkb | and ya I think the risk of anything catastrophic is low. The system-config state being wrong would mostly likely occur if we allow two jobs to update system-config and end up in some intermediate state? | 23:12 |
clarkb | but even that seems unlikely given git is careful about manaing state | 23:12 |
corvus | i think stale system-config state is an easily recoverable error; so even though it's a more likely error, i don't worry too much about it and don't think it warrants that extra caution. | 23:12 |
clarkb | #status log Deleted tracing01.opendev.org (ae3bb36a-af9e-40e0-8cc4-2bf4890ec5d4) as it has been replaced with tracing02.opendev.org | 23:12 |
corvus | i agree with ianw and your comment (clarkb) that starting with a low parallel mutex reduces risk too | 23:12 |
opendevstatus | clarkb: finished logging | 23:12 |
clarkb | ack in that case I think the change a sproposed is probably ready for scrutiny | 23:13 |
ianw | cool. it in general looks like what i think it should look like. the idea for me has been that in a DR scenario for bridge, we can bring up the node, ensure secrets are in place (from our distributed backup :) and zuul can log into it, and that should be it. that was why i was keen on the "bootstrap" job that gets bridge into an "able to run ansible" state | 23:18 |
clarkb | I think the distributed backup is still a wip too :/ | 23:20 |
clarkb | but one step at a time | 23:20 |
clarkb | can't boil the ocean all at once | 23:20 |
ianw | heh, yeah that's OK. i just remember replacing bridge last time being a very scary notion. but that's how a lot of the "abstract and boostrap bridge" work kind of got tangled up with "run jobs in parallel". the boostrap makes for a good synchronization point for system-config, which is how they get mixed in with each other | 23:23 |
opendevreview | Ian Wienand proposed opendev/system-config master: make-tarball: role to archive directories https://review.opendev.org/c/opendev/system-config/+/865784 | 23:30 |
opendevreview | Ian Wienand proposed opendev/system-config master: tools/make-backup-key.sh https://review.opendev.org/c/opendev/system-config/+/866430 | 23:30 |
opendevreview | Ian Wienand proposed opendev/system-config master: make-tarball: add some extraction instructions https://review.opendev.org/c/opendev/system-config/+/875587 | 23:30 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!