*** ralonsoh_ is now known as ralonsoh | 07:04 | |
opendevreview | Fabien Boucher proposed zuul/zuul-jobs master: zuul_log_path: allow override for emit-job-header and upload-logs https://review.opendev.org/c/zuul/zuul-jobs/+/943586 | 07:52 |
---|---|---|
opendevreview | Fabien Boucher proposed zuul/zuul-jobs master: zuul_log_path: allow override for emit-job-header and upload-logs https://review.opendev.org/c/zuul/zuul-jobs/+/943586 | 08:04 |
opendevreview | Fabien Boucher proposed zuul/zuul-jobs master: zuul_log_path: allow override for emit-job-header and upload-logs https://review.opendev.org/c/zuul/zuul-jobs/+/943586 | 08:14 |
opendevreview | Fabien Boucher proposed zuul/zuul-jobs master: zuul_log_path: allow override for emit-job-header and upload-logs https://review.opendev.org/c/zuul/zuul-jobs/+/943586 | 08:22 |
opendevreview | Fabien Boucher proposed zuul/zuul-jobs master: zuul_log_path: allow override for emit-job-header and upload-logs https://review.opendev.org/c/zuul/zuul-jobs/+/943586 | 08:33 |
opendevreview | Fabien Boucher proposed zuul/zuul-jobs master: zuul_log_path: allow override for emit-job-header and upload-logs https://review.opendev.org/c/zuul/zuul-jobs/+/943586 | 08:37 |
opendevreview | Fabien Boucher proposed zuul/zuul-jobs master: zuul_log_path: allow override for emit-job-header and upload-logs https://review.opendev.org/c/zuul/zuul-jobs/+/943586 | 08:44 |
opendevreview | Fabien Boucher proposed zuul/zuul-jobs master: zuul_log_path: allow override for emit-job-header and upload-logs https://review.opendev.org/c/zuul/zuul-jobs/+/943586 | 09:01 |
opendevreview | Fabien Boucher proposed zuul/zuul-jobs master: zuul_log_path: allow override for emit-job-header and upload-logs https://review.opendev.org/c/zuul/zuul-jobs/+/943586 | 09:06 |
opendevreview | Fabien Boucher proposed zuul/zuul-jobs master: zuul_log_path: allow override for emit-job-header and upload-logs https://review.opendev.org/c/zuul/zuul-jobs/+/943586 | 09:15 |
opendevreview | Fabien Boucher proposed zuul/zuul-jobs master: zuul_log_path: allow override for emit-job-header and upload-logs https://review.opendev.org/c/zuul/zuul-jobs/+/943586 | 09:22 |
opendevreview | Fabien Boucher proposed zuul/zuul-jobs master: zuul_log_path: allow override for emit-job-header and upload-logs https://review.opendev.org/c/zuul/zuul-jobs/+/943586 | 09:35 |
opendevreview | Fabien Boucher proposed zuul/zuul-jobs master: zuul_log_path: allow override for emit-job-header and upload-logs https://review.opendev.org/c/zuul/zuul-jobs/+/943586 | 10:21 |
opendevreview | Fabien Boucher proposed zuul/zuul-jobs master: zuul_log_path: allow override for emit-job-header and upload-logs https://review.opendev.org/c/zuul/zuul-jobs/+/943586 | 10:24 |
opendevreview | Fabien Boucher proposed zuul/zuul-jobs master: zuul_log_path: allow override for emit-job-header and upload-logs https://review.opendev.org/c/zuul/zuul-jobs/+/943586 | 10:31 |
opendevreview | Karolina Kula proposed opendev/glean master: WIP: Add support for CentOS 10 keyfiles https://review.opendev.org/c/opendev/glean/+/941672 | 12:28 |
*** dhill is now known as Guest11027 | 13:02 | |
clarkb | infra-root https://review.opendev.org/c/opendev/system-config/+/943488 is ready for review at this point if we're generally happy with the new infra-prod deployment process. I did find out additional potential conflict on bridge that this change addresses via some new job dependencies. corvus you may be interested in that as it has to do with the opendev-ca used by zookeeper, | 14:50 |
clarkb | nodepool, zuul, and jaeger | 14:50 |
clarkb | fungi: did you see that python was tripping over a clang 19 bug? I don't know which compiler you use locally for your python builds but if you are using clang19 then you probably want to rebuild with mitigations in place: https://blog.nelhage.com/post/cpython-tail-call/ | 14:55 |
clarkb | its just performance issues tahnkfully. Not a security of crash type bug | 14:55 |
fungi | good question... i have clang 16, 17, 18 and 19 installed... | 14:56 |
fungi | thanks for the heads up! | 14:56 |
opendevreview | Karolina Kula proposed opendev/glean master: WIP: Add support for CentOS 10 keyfiles https://review.opendev.org/c/opendev/glean/+/941672 | 14:57 |
clarkb | I've started thinking about the process for a gerrit server replacement and I think the main consideration is that replication cannot be enabled until after we're running on the new server (to avoid overwriting content on the giteas that is stale or even just wiping out content) | 15:06 |
clarkb | otherwise I think the process is very similar to our regular process. Boot new server, attach volume, enroll in inventory and deploy an "empty" gerrit. Check things look ok. Syncrhonize content over. CHeck again. Schedule outage for final synchronization and dns update then enable replication on the new server and disable it on the old server? | 15:07 |
clarkb | I want people to start noodling on that if possible so that we can begin the process soonish with a plan for cutting over soon after the openstack release | 15:08 |
fungi | the synchronize content step will need a (probably somewhat lengthy) outage, though maybe not too bad if we use a pre-warmed rsync and just finalize it offline | 15:12 |
fungi | could also do that step in parallel with dns change propagation, to shorten the outage further | 15:12 |
fungi | i assume we'll want the zuul schedulers offline during that time as well | 15:13 |
clarkb | ya I think we can prestage warm data to speed things up. I guess the database will want a backup restore rather than a rsync? | 15:14 |
clarkb | the other option is to use the current prod volume and move it over | 15:15 |
fungi | true, the db is opefully fairly minimal these days? | 15:15 |
clarkb | probably spin up the new server in that case with a temporary volume (to avoid polluting the root fs with gerrit stuff) | 15:15 |
clarkb | fungi: its all the reviewed flags for each user on each file in every change they have looked at. I think it can get somewhat large but also that data isn't critical if we decide to drop it | 15:15 |
clarkb | if we did the volume swap instead of rsyncing we might do something like a volume clone and validate the data works with the deployment then during our actual cutover swap out the volumes for the actual prod one? | 15:16 |
fungi | we ought to temporarily disable the optmize-git-repos cronjob on the new server too | 15:16 |
clarkb | can cinder do a clone of an active volume? That might be a good option if we think it will be reliable (I know we've had slow attaches in the past etc) | 15:16 |
fungi | not sure | 15:17 |
fungi | if you think it would make more sense than rsync/mysqldump, i suppose we could try and see whether it's possible | 15:20 |
clarkb | Just thinking if it works it might be the shortest downtime option as it would essentially be umount, volume detach, volume attach, mount | 15:22 |
clarkb | but I seem to recall us trying to do similar things in the past and it not being veryrelaible (because detaches may not work and attaches may be slow and and and...) | 15:22 |
clarkb | rsync and mysqldump are probably the best choices as they are stable known quantities and while we may wait a bit longer we're less likely to have exceptional behavior | 15:23 |
clarkb | looks like the cinder api supports taking a snapshot of an attached volume: "Prior to API version 3.66, a ‘force’ flag was required to create a snapshot of an in-use volume, but this is no longer the case. From API version 3.66, the ‘force’ flag is invalid when passed in a volume snapshot request" | 15:25 |
clarkb | so in theory this is an option available to us. Make a snapshot of the prod volume for testing purposes and boot the new server with a new volume based on that snapshot attached | 15:25 |
clarkb | then when we do the cut over replace the temporary copy of the volume with the actual volume | 15:25 |
clarkb | but as mentioned this has a lto more unknown unknowns and requires a lot of different tools to interact with each other properly | 15:26 |
fungi | also rsync/mysqldump gives us the option to switch regions or providers, while a volume move would only work for adjacent server instances | 15:29 |
clarkb | yup | 15:30 |
clarkb | also I think we use lvm. That probably complicates the volume swap a bit (as we'd need to also refresh the lvm configuration as part of the mount process?) | 15:30 |
fungi | lvm should be fine as long as we deactivate the volume groups before the switch | 15:34 |
clarkb | on which side? Maybe both to clear out the config on the destination and to allow the source to become source on the destination? | 15:35 |
clarkb | anyway I think I'm leaning towards rsync and mysqldump simply because it gives us more control over the process and less leap of faith type actions | 15:36 |
fungi | vgchange -an on both servers before detaching the volumes from them, then vgchange -ay on the new server after attaching again | 15:39 |
fungi | and yeah, i'm not advocating for a volume swap, but that's that the process would look like. it's tractable | 15:40 |
clarkb | the other two changes I've currently got up are https://review.opendev.org/c/opendev/system-config/+/940928 and child. Though those might make good followups to the parallel infra-prod execution as sanity checks if the hourlies look good first | 16:06 |
clarkb | but reviews now might be helpful | 16:06 |
fungi | both lgtm | 16:09 |
clarkb | should we make a bindep release? | 16:16 |
fungi | ah, yes probably... infra-root: last call for input on whether 938570 and 940711 make sense as a pattern to align our projects with broader python packaging community expectations | 16:20 |
fungi | if there's a general preference for keeping requirements.txt, test-requirements.txt and doc/requirements.txt files in our packages, or a desire to implement the latter two with "extras" in the pyproject.toml and reference those from our noxfile.py or something, then we can release bindep as-is | 16:22 |
fungi | the python pacckaging community argument against requirements.txt in particular (938570) is that we're marking our package dependencies as dynamically resolved at wheel build time, which means the wheel has to be downloaded and built in order to identify them instead of just checking published metadata | 16:24 |
clarkb | I think if I had a preference keeping the test requirements separate to make it easier to run without using nox seems like a good idea. But rolling true deps into the pyproject.toml to match upstream expectations makes sense to me | 16:24 |
fungi | note that for 940711 we could stick those into extras in the pyproject.toml and then list bindep[docs] or bindep[linters] or bindep[coverage] or bindep[unit] or whatever in noxfile.py | 16:25 |
fungi | that way people who want to install them some other way still can | 16:26 |
fungi | i can work up an alternative if people want to see it in action | 16:26 |
clarkb | those might become an attractive nuisance for people thinking if they install them they get extra features/functionality | 16:28 |
clarkb | (which is typically how extras are used aiui) | 16:28 |
corvus | clarkb: i think opendev-ca is idempotent -- here's what it would be doing for each server https://opendev.org/opendev/system-config/src/branch/master/playbooks/roles/opendev-ca/opendev-ca.sh#L98 | 16:28 |
corvus | clarkb: at least, once it's established | 16:28 |
corvus | the first time it ever runs, it would need to create the ca cert, that could race | 16:28 |
clarkb | maybe I should update my todo to move opendev-ca calls into bootstrap-bridge then we can remove the serialization of those jobs? | 16:29 |
clarkb | maybe not move opendev-ca there but call it there first at least? | 16:29 |
corvus | yeah, if we call it once with, say, a dummy server name, that would ensure that the ca part is done once on bootstrap | 16:30 |
corvus | (we could also change the script to make it so it doesn't need a server name, but as of today, it expects one) | 16:30 |
clarkb | ok let me update the parallelization chagne to keep serializing those 4 jobs but I'll add a TODO that indicates we can stop serializing once we have bootstrap bridge calling opendev-ca | 16:30 |
corvus | hrm | 16:31 |
corvus | i'm trying to avoid serializing them | 16:31 |
clarkb | I guess today its safe to not serialize them because we have an existing bridge and ca | 16:31 |
corvus | can we just make that change now? | 16:31 |
corvus | right, the only issue would be in DR | 16:31 |
corvus | (or we decide to make a new ca) | 16:32 |
clarkb | maybe. The "problem" is bootstrap-bridge is actually doing minimal stuff on bridge and only running ansible from zuul not from bridge aiui | 16:32 |
clarkb | so I'm not sure how to port that into bootstrap-bridge yet | 16:32 |
corvus | oh, then... we could flock | 16:32 |
clarkb | oh ya that would work. I can write that edit | 16:33 |
clarkb | I'll update the parallelization change to drop serializing those jobs and add a flock to the opendev-ca script | 16:34 |
corvus | cool, that sounds great | 16:34 |
corvus | the other thing i looked at is https://opendev.org/opendev/system-config/src/branch/master/playbooks/roles/opendev-ca/tasks/main.yaml -- but i think those copy commands will be idempotent, so should be fine. | 16:35 |
clarkb | oh there is already a flock in there | 16:38 |
clarkb | heh | 16:38 |
clarkb | there isn't a timeout on that flock call. Which is probably ok but I could add one | 16:39 |
corvus | oh it's in the role! | 16:39 |
clarkb | I'll leave it as is for now since it has apparrently been working | 16:39 |
opendevreview | Clark Boylan proposed opendev/system-config master: Run infra-prod jobs in parallel https://review.opendev.org/c/opendev/system-config/+/943488 | 16:40 |
clarkb | the actual period of time where things contend for the ca should be very small after the initial setup so this is probably fine as is | 16:41 |
clarkb | if nothing else I feel a lot more confident that the ca won't give us unexpected problems now :) | 16:42 |
opendevreview | Merged zuul/zuul-jobs master: Update zookeeper latest version to 3.9.3 https://review.opendev.org/c/zuul/zuul-jobs/+/943861 | 17:05 |
opendevreview | Jeremy Stanley proposed opendev/bindep master: Drop requirements.txt https://review.opendev.org/c/opendev/bindep/+/938570 | 17:18 |
opendevreview | Jeremy Stanley proposed opendev/bindep master: Drop auxiliary requirements files https://review.opendev.org/c/opendev/bindep/+/940711 | 17:18 |
fungi | clarkb: ^ redid 940711 based on extras, and also elaborated on the rationale for both in their commit messages | 17:19 |
clarkb | having the test- prefix on the extras should help mitigate confusion over whther that adds new functionality | 17:19 |
fungi | yep, exactly why i went with that after you raised the concern | 17:20 |
opendevreview | Jeremy Stanley proposed opendev/bindep master: Drop setup.cfg https://review.opendev.org/c/opendev/bindep/+/943976 | 17:21 |
clarkb | fungi: question on https://review.opendev.org/c/opendev/bindep/+/940711 about behavior differences between binde[] and .[] usages | 17:21 |
opendevreview | Jeremy Stanley proposed opendev/bindep master: Drop setup.py https://review.opendev.org/c/opendev/bindep/+/943977 | 17:22 |
fungi | 943976 and 943977 are wip, just want somewhere to test more pbr improvements later | 17:22 |
clarkb | ++ | 17:22 |
clarkb | fungi: ok I think we should probably expand the coerage deps in that case (I left a note on the change about it) | 17:31 |
fungi | clarkb: actually, no it works exactly how we want now that i think about it | 17:32 |
fungi | because right now bindep[test-unit] doesn't exist anywhere except locally in that change | 17:32 |
fungi | and it still gets the right deps installed | 17:33 |
clarkb | oh right so if the coverage test is happy then it should work | 17:33 |
clarkb | and that job was successful let see if we ran tests and collected coverage | 17:33 |
clarkb | seems to have done so. Ok no change necessary | 17:34 |
fungi | yeah, i ran all those locally before i pushed them | 17:34 |
clarkb | I updated my review to +2 I think both of those changes lgt | 17:35 |
clarkb | *lgtm | 17:35 |
fungi | as for 943976 and 943977 the failure modes are subtly different between the two of them | 17:37 |
clarkb | ya for 943977 we don't have a way of enabling pbr so pbr never runs aiui | 17:37 |
fungi | though ultimately the exception for both is the same | 17:38 |
fungi | ERROR Backend subprocess exited when trying to invoke get_requires_for_build_sdist | 17:38 |
clarkb | it seems that we're using an old distutils flag=value support option via setup() being called explicitly there that we can't set up another way | 17:38 |
clarkb | though we may be able to hook into a different system to enable the pbr subsystem | 17:38 |
clarkb | Any idea how things like setuptools-scm do it? | 17:40 |
clarkb | thats probably the thing to look at if we want to take the next step in PBR updates to make it fit into the modern stuff better | 17:40 |
fungi | i haven't looked, though probably step 1 is working out how to fix 943976 | 17:40 |
clarkb | ya I guess those may be related in subtle ways as it all comes down to feeding options/config into setuptools | 17:41 |
clarkb | infra-root thoughts on if/when we want to try https://review.opendev.org/c/opendev/system-config/+/943488 and run infra-prod jobs in parallel? | 17:41 |
fungi | right now that comes down to https://opendev.org/openstack/pbr/src/branch/master/pbr/core.py#L98 insisting on looking for a setup.cfg file | 17:42 |
fungi | i wonder if it would be possible to just drop that check | 17:42 |
fungi | oh, though i think there's going to be an earlier failure to fix too | 17:43 |
opendevreview | Jeremy Stanley proposed opendev/bindep master: Drop metadata.name from setup.cfg https://review.opendev.org/c/opendev/bindep/+/943981 | 17:44 |
fungi | that ^ should demonstrate | 17:44 |
clarkb | random followup notes on things from last week: rax-flex is looking happy in both sjc3 and dfw3, gitea response times continue to appear to be stable so I'm feeling more and more confident in the memcached fixes broken internal memory caching for gitea, and manually checking docker hub rate limits I still personally appear to get 100 per 6 hours | 17:49 |
clarkb | if jamesdenton returns to the channel we should ask about bumping quotas up if they are in a position to support that in rax flex | 17:50 |
fungi | okay, so https://opendev.org/openstack/pbr/src/branch/master/pbr/hooks/metadata.py#L32 is where pbr insists on pulling metadata.name from setup.cfg but i suppose trying to rip that out is going to unravel a much longer thread | 17:51 |
clarkb | we'd need to discover the name from somewhere else. In theory we could get it form wherever setuptools stores it if it has read it by the time pbr is called and needs it | 17:52 |
fungi | it wants to pass the package name to packaging.get_version() | 17:52 |
fungi | yeah | 17:52 |
clarkb | could also just do a if os.path.exists(pyproject.toml) load toml and get name from there else look in setup.cfg fallback | 17:53 |
fungi | elsewhere it's relying on setuptools.Command.distribution.get_name() | 17:55 |
fungi | i wonder if that could work as a fallback | 17:55 |
fungi | or if that's actually plumbed into the same pbr.hooks.metadata.MetadataConfig.get_name() call | 17:56 |
clarkb | I'm just going to remove docker hub rate limit discussion from the meeting agenda due to my testing showing thinsg haven't changed yet | 17:57 |
clarkb | fungi: its possible setuptools hasn't bootstrapped enough state at that point too | 17:57 |
clarkb | but ya if that works that may be the most correct option | 17:57 |
clarkb | my first pass of agenda updates are in | 18:17 |
clarkb | let me know if we want to add/remove/edit anything else | 18:18 |
fungi | i'm digging into the possible pbr change for pulling the package name from somewhere other than setup.cfg, but need to take a break to go grab a late lunch. should be back in under an hour | 18:18 |
clarkb | you're still on standard time scheduling (then its not a super late lunch) | 18:29 |
fungi | too true | 19:26 |
fungi | anyway, it's done now regardless | 19:27 |
fungi | having mulled it over, i suspect querying setuptools for the project name in the pbr hook is going to end up being circular, so probably need to find a way to break that cycle | 19:28 |
clarkb | ya I wondered about that | 20:33 |
fungi | the underlying problem is that other calls pbr makes need to pass the project name, so that has to be resolved first. i agree finding a way to pull that out of pyproject.toml as an alternative to setup.cfg is probably a cleaner solution | 20:37 |
fungi | just treat it as another source of metadata configuration essentially. i'm coming around to that idea | 20:37 |
clarkb | ya that seems like a simple solution | 20:40 |
clarkb | and you can continue to fallback on setup.cfg if pyproject.toml data isn't present | 20:40 |
clarkb | corvus: tonyb did you want to weigh in on the parallel infra-prod stuff or should we proceed with that now? | 20:42 |
clarkb | https://review.opendev.org/c/opendev/system-config/+/943488 is the change | 20:42 |
clarkb | we'll want to monitor the hourly runs after that lands and then decide if we need to revert or proceed with daily runs at 0200 | 20:42 |
fungi | i expect to be around at 02:00 utc (especially now that it's earlier relative to local time for me), so can keep an eye on it as well | 20:45 |
clarkb | ya I can probably be around as well | 20:45 |
corvus | clarkb: that change is much shorter now! lgtm | 20:48 |
clarkb | ok I guess I'll approve it now? | 20:50 |
clarkb | I approved it | 20:50 |
fungi | thanks! | 20:54 |
opendevreview | Merged opendev/system-config master: Run infra-prod jobs in parallel https://review.opendev.org/c/opendev/system-config/+/943488 | 21:00 |
clarkb | the 2100 hourlies shoudl apply that I think | 21:00 |
clarkb | as they haven't enqueued yet | 21:00 |
fungi | i've noticed a lag of ~2 minutes for hourly enqueues | 21:01 |
clarkb | ya the jitter is set to 120 seconds | 21:01 |
clarkb | it does see mlike it always runs 120 seconds late | 21:01 |
fungi | aha, that would indeed explain it | 21:01 |
clarkb | rather than some random value between 0 and 120 seconds | 21:01 |
fungi | 120 was randomly selected as a value between 0 and 120 | 21:01 |
clarkb | yup the deploy job is running first before the hourlies so I think the hourlies should include it (we should know in just a few minutes) | 21:01 |
fungi | https://xkcd.com/221/ | 21:02 |
clarkb | hourly has begun. It is doing the single bootstrap bridge job to start though | 21:03 |
clarkb | it paused and two jobs are starting now as expected | 21:04 |
clarkb | service-bridge and service-nodepool | 21:04 |
fungi | so far so good | 21:05 |
clarkb | ps -elf on bridge confirms both playbooks are running | 21:05 |
clarkb | system load is ~1 | 21:05 |
clarkb | peaked at 1.23 and falling now | 21:06 |
clarkb | zuul has more nodes to interact with and may spike load higher. But its good we have that check | 21:06 |
fungi | hopefully we can crank it up well above 2 in that case | 21:06 |
clarkb | ya there are 8 vcpus. I'm guessing 5-6 may be a good sweet spot | 21:06 |
clarkb | service-bridge succeeded. Service-registry is running | 21:07 |
clarkb | it will alos vary by the scheduling overlap of some jobs. We could probably micromanage ordering to get a good bucket fill on cpu demand | 21:07 |
clarkb | but I suspect that may be more effort than it is worth maintaining over time. Just take the wins if this works and call it a day | 21:08 |
clarkb | registry succeeded. Now zuul and nodepool are running concurrently | 21:08 |
clarkb | nodepool succeeded | 21:11 |
clarkb | https://zuul.opendev.org/t/openstack/buildset/7d0d93da5bbe4739b8438ef25bfba7cb says we ended 20:55 minutes:seconds after the hour | 21:12 |
clarkb | sorry for the 1900 hourly run | 21:12 |
clarkb | 20:00 run ended at 27:51 | 21:13 |
clarkb | this one is on track to run at 20:15 ish | 21:13 |
clarkb | *21:15 ish | 21:13 |
clarkb | my local wifi is being randomly flaky with high rtt times | 21:14 |
clarkb | but things are looking good in hourlies so far | 21:14 |
clarkb | fungi: it occurs to me if we really want to exercise all the things before the dailies we might be able to do that landing your change to cleanup the old sjc3 config | 21:15 |
clarkb | do we want to do that? | 21:15 |
clarkb | and I might need to go manually reboot the AP | 21:15 |
clarkb | the zuul job is pulling new images so it is slow | 21:16 |
clarkb | at least that is what it looks like at the tail of the log. | 21:16 |
fungi | yeah, 943625 would be a good way to flush the pipes | 21:17 |
fungi | approved it just now | 21:17 |
clarkb | https://zuul.opendev.org/t/openstack/buildset/578abc171bae4bc6a75666168d588ddb completed with success so hourlies are happy with the parallel execution | 21:21 |
clarkb | ianw: ^ fyi | 21:21 |
clarkb | 943625 should land shortly providing us with round two of parallel infra-prod feedback | 21:39 |
opendevreview | Merged opendev/system-config master: Clean up old Rackspace Flex SJC3 project https://review.opendev.org/c/opendev/system-config/+/943625 | 21:42 |
fungi | here goes | 21:42 |
clarkb | it didn't enqueue as much stuff as I expected but still a good shakedown | 21:43 |
clarkb | there is basically no overlap with the hourlies | 21:43 |
clarkb | the two bridge jobs are it | 21:43 |
clarkb | cool infra-prod-base is running alone as expected too | 21:43 |
clarkb | (everything depends on that job) | 21:43 |
fungi | at least a good double-check | 21:44 |
clarkb | I think I see a small bug, one we can live with for now I suspect. Patch momentarily | 21:45 |
opendevreview | Clark Boylan proposed opendev/system-config master: Have puppet depend on letsencrypt https://review.opendev.org/c/opendev/system-config/+/943992 | 21:48 |
clarkb | its ^ basically some puppet service do use LE certs so we want LE to run first then puppet can copy the cert over and use it. Minor issue because I think we're eventually consistent there | 21:49 |
ianw | omg thanks for pushing this through clarkb! i new it was kind of there in the background and cool to see it working | 21:51 |
clarkb | hrm infra-prod base failed | 21:51 |
fungi | thanks for your work on it ianw! | 21:52 |
clarkb | Failed to update apt cache: E:Failed to fetch http://ddebs.ubuntu.com/dists/focal-updates/main/binary-arm64/Packages.xz File has unexpected size (364788 != 364784). Mirror sync in progress? | 21:52 |
clarkb | on mirror01.regionone.osuosl.opendev.org | 21:52 |
clarkb | so not related to parallel execution | 21:52 |
clarkb | fungi: I think you can safely reenqueue that change if you think we should rahter than wait for hourlies | 21:53 |
clarkb | sorry dailies | 21:53 |
fungi | i'll reenqueue it, would be good to have thorough testing | 21:53 |
clarkb | ++ | 21:53 |
ianw | mirror01.regionone.osuosl.opendev.org : ok=36 changed=2 unreachable=0 failed=1 skipped=6 rescued=0 ignored=0 | 21:54 |
fungi | reenqueued 943625,1 into deploy | 21:54 |
ianw | oh, haha you found it | 21:54 |
clarkb | I wondery why we trigger manage projects on that update | 21:55 |
clarkb | it shouldn't matter but I think it should noop too | 21:55 |
fungi | yeah, not sure, i don't see any files that are likely candidates | 21:55 |
clarkb | we don't have a file matcher on that job | 21:56 |
clarkb | so it must run a lot? | 21:56 |
fungi | oh, yeah i guess it runs all the time | 21:56 |
clarkb | I wonder if part of the reason for that is we trigger it from project-config changes too and file matchers for system-config won't necessarily match project-config file matchers and vice versa? | 21:58 |
clarkb | though we should be able to have two disjoint lists as any match causes the job to run | 21:58 |
clarkb | oh the file matchers are in project.yaml and we match on inventory/.* | 22:04 |
clarkb | not sure we need to do that but seems afe enough for now | 22:04 |
clarkb | fungi: I think it failed again | 22:04 |
clarkb | should we add the server to the emergency list and try again | 22:06 |
clarkb | oddly nb04 has no problems (not trying to install packages I guess) | 22:06 |
fungi | Failed to update apt cache: E:Failed to fetch http://ddebs.ubuntu.com/dists/focal-updates/main/binary-arm64/Packages.xz File has unexpected size (364788 != 364784). Mirror sync in progress? [IP: 185.125.190.17 80] | 22:07 |
clarkb | ya and the timestamp on the releases file is from 20:30 ish today | 22:07 |
fungi | that seems to be the error coming from mirror01.regionone.osuosl | 22:08 |
clarkb | so seems likely that there is an in progress mirror update | 22:08 |
clarkb | and they haven't updated the signatures yet? | 22:08 |
fungi | maybe | 22:08 |
fungi | haven't updated the hashes anyway | 22:08 |
fungi | Release file created at: Mon, 10 Mar 2025 20:31:52 +0000 | 22:09 |
clarkb | I think it is safe to reenqueue even though the hourlies are running because we haven't merged any other changes fwiw | 22:09 |
clarkb | so hourlies updated to the state that chagne was looking at | 22:09 |
fungi | enqueued again | 22:09 |
clarkb | (just somethingto consider when reenqueing even before the refactor) | 22:09 |
clarkb | fungi: do you want me to put that server in the emergency file? | 22:09 |
fungi | let's see if it fails a third time | 22:10 |
clarkb | ack | 22:10 |
fungi | as for sequencing issues, if reenqueuing deploy for the most recently merged change is unsafe, that would be good to know | 22:10 |
fungi | but as far as i can tell we should never install anything newer than the newest state in the hourly run | 22:11 |
clarkb | fungi: yup I was just making that explicit | 22:11 |
fungi | time machines notwithstanding | 22:11 |
clarkb | I think if we merged another change now we'd be in an awkward spot. But since you are reenqueing the last merged change its all the same between deploy and hourly | 22:12 |
fungi | yeah, same goes for most projects really. if i'm reenqueuing into a post-merge pipeline i never inject anything other than the most recently merged change | 22:13 |
clarkb | ++ | 22:13 |
fungi | anything else could cause rollback problems | 22:13 |
clarkb | kids should be home from school soon so I may step out for a bit to get a rundown of how that went. But if it fails again I think the next step may be to put that one host in the emergency file and reenqueue for a forth time | 22:13 |
fungi | yeah, can do | 22:14 |
fungi | i'll keep an eye on it | 22:14 |
fungi | and also troubleshoot locally on the server if it persists | 22:14 |
fungi | probably something went sideways with ubuntu's arm64 package mirrors | 22:14 |
fungi | yeah, still broken | 22:24 |
fungi | running an apt update locally on mirror01.regionone.osuosl doesn't error | 22:26 |
fungi | i'll reenqueue it one more time | 22:26 |
clarkb | weird that it wouldn't error locally unless doing it manaully unstuck something or chose a different mirror | 22:31 |
fungi | probably just timing | 22:32 |
clarkb | seems to be running this time | 22:38 |
clarkb | so far things look good here too which is reassuring | 22:39 |
fungi | yeah, we probably just had to give consistency time to eventual itself | 22:39 |
clarkb | if we do decide that something is horribly wrong during periodic jobs we can either run the disable ansible command to write out the file and reason or move the zuul user authorized keys aside | 22:40 |
clarkb | either way should sufficiently stop jobs from continuing to execute | 22:40 |
fungi | and we'll still have a few hours of breathing room until the 02:00 run | 22:40 |
clarkb | yup | 22:40 |
clarkb | and if we do that all we need to go back to a safe state within the zuul config is revert the change that set the semaphore limit to 2 | 22:41 |
clarkb | we ran with it at 1 for some time and should be fine back there | 22:41 |
clarkb | system load was higher for this change than hourlies but doesn't seem like it was by much | 22:47 |
clarkb | probably our next step up with be a semaphore limit of 4 to double what we currently have then check system load from there | 22:48 |
clarkb | but we should run with just 2 for a bit first | 22:48 |
fungi | yeah, later this week would be good | 22:48 |
clarkb | https://zuul.opendev.org/t/openstack/buildset/3424e9e068da43069256154677105514 that buildset was successful | 22:51 |
fungi | yup, lgtm | 23:05 |
clarkb | I guess we hvae no signal indicating we should rollback. In that case I'll do my best to check periodic after dinner etc | 23:17 |
clarkb | last call on meeting agenda items | 23:37 |
fungi | yeah, still lgtm | 23:53 |
fungi | i may not stick around for the 02:00 utc run, but can check it when i get back to the keyboard in the morning | 23:54 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!