Monday, 2025-03-10

*** ralonsoh_ is now known as ralonsoh		07:04
opendevreview	Fabien Boucher proposed zuul/zuul-jobs master: zuul_log_path: allow override for emit-job-header and upload-logs https://review.opendev.org/c/zuul/zuul-jobs/+/943586	07:52
opendevreview	Fabien Boucher proposed zuul/zuul-jobs master: zuul_log_path: allow override for emit-job-header and upload-logs https://review.opendev.org/c/zuul/zuul-jobs/+/943586	08:04
opendevreview	Fabien Boucher proposed zuul/zuul-jobs master: zuul_log_path: allow override for emit-job-header and upload-logs https://review.opendev.org/c/zuul/zuul-jobs/+/943586	08:14
opendevreview	Fabien Boucher proposed zuul/zuul-jobs master: zuul_log_path: allow override for emit-job-header and upload-logs https://review.opendev.org/c/zuul/zuul-jobs/+/943586	08:22
opendevreview	Fabien Boucher proposed zuul/zuul-jobs master: zuul_log_path: allow override for emit-job-header and upload-logs https://review.opendev.org/c/zuul/zuul-jobs/+/943586	08:33
opendevreview	Fabien Boucher proposed zuul/zuul-jobs master: zuul_log_path: allow override for emit-job-header and upload-logs https://review.opendev.org/c/zuul/zuul-jobs/+/943586	08:37
opendevreview	Fabien Boucher proposed zuul/zuul-jobs master: zuul_log_path: allow override for emit-job-header and upload-logs https://review.opendev.org/c/zuul/zuul-jobs/+/943586	08:44
opendevreview	Fabien Boucher proposed zuul/zuul-jobs master: zuul_log_path: allow override for emit-job-header and upload-logs https://review.opendev.org/c/zuul/zuul-jobs/+/943586	09:01
opendevreview	Fabien Boucher proposed zuul/zuul-jobs master: zuul_log_path: allow override for emit-job-header and upload-logs https://review.opendev.org/c/zuul/zuul-jobs/+/943586	09:06
opendevreview	Fabien Boucher proposed zuul/zuul-jobs master: zuul_log_path: allow override for emit-job-header and upload-logs https://review.opendev.org/c/zuul/zuul-jobs/+/943586	09:15
opendevreview	Fabien Boucher proposed zuul/zuul-jobs master: zuul_log_path: allow override for emit-job-header and upload-logs https://review.opendev.org/c/zuul/zuul-jobs/+/943586	09:22
opendevreview	Fabien Boucher proposed zuul/zuul-jobs master: zuul_log_path: allow override for emit-job-header and upload-logs https://review.opendev.org/c/zuul/zuul-jobs/+/943586	09:35
opendevreview	Fabien Boucher proposed zuul/zuul-jobs master: zuul_log_path: allow override for emit-job-header and upload-logs https://review.opendev.org/c/zuul/zuul-jobs/+/943586	10:21
opendevreview	Fabien Boucher proposed zuul/zuul-jobs master: zuul_log_path: allow override for emit-job-header and upload-logs https://review.opendev.org/c/zuul/zuul-jobs/+/943586	10:24
opendevreview	Fabien Boucher proposed zuul/zuul-jobs master: zuul_log_path: allow override for emit-job-header and upload-logs https://review.opendev.org/c/zuul/zuul-jobs/+/943586	10:31
opendevreview	Karolina Kula proposed opendev/glean master: WIP: Add support for CentOS 10 keyfiles https://review.opendev.org/c/opendev/glean/+/941672	12:28
*** dhill is now known as Guest11027		13:02
clarkb	infra-root https://review.opendev.org/c/opendev/system-config/+/943488 is ready for review at this point if we're generally happy with the new infra-prod deployment process. I did find out additional potential conflict on bridge that this change addresses via some new job dependencies. corvus you may be interested in that as it has to do with the opendev-ca used by zookeeper,	14:50
clarkb	nodepool, zuul, and jaeger	14:50
clarkb	fungi: did you see that python was tripping over a clang 19 bug? I don't know which compiler you use locally for your python builds but if you are using clang19 then you probably want to rebuild with mitigations in place: https://blog.nelhage.com/post/cpython-tail-call/	14:55
clarkb	its just performance issues tahnkfully. Not a security of crash type bug	14:55
fungi	good question... i have clang 16, 17, 18 and 19 installed...	14:56
fungi	thanks for the heads up!	14:56
opendevreview	Karolina Kula proposed opendev/glean master: WIP: Add support for CentOS 10 keyfiles https://review.opendev.org/c/opendev/glean/+/941672	14:57
clarkb	I've started thinking about the process for a gerrit server replacement and I think the main consideration is that replication cannot be enabled until after we're running on the new server (to avoid overwriting content on the giteas that is stale or even just wiping out content)	15:06
clarkb	otherwise I think the process is very similar to our regular process. Boot new server, attach volume, enroll in inventory and deploy an "empty" gerrit. Check things look ok. Syncrhonize content over. CHeck again. Schedule outage for final synchronization and dns update then enable replication on the new server and disable it on the old server?	15:07
clarkb	I want people to start noodling on that if possible so that we can begin the process soonish with a plan for cutting over soon after the openstack release	15:08
fungi	the synchronize content step will need a (probably somewhat lengthy) outage, though maybe not too bad if we use a pre-warmed rsync and just finalize it offline	15:12
fungi	could also do that step in parallel with dns change propagation, to shorten the outage further	15:12
fungi	i assume we'll want the zuul schedulers offline during that time as well	15:13
clarkb	ya I think we can prestage warm data to speed things up. I guess the database will want a backup restore rather than a rsync?	15:14
clarkb	the other option is to use the current prod volume and move it over	15:15
fungi	true, the db is opefully fairly minimal these days?	15:15
clarkb	probably spin up the new server in that case with a temporary volume (to avoid polluting the root fs with gerrit stuff)	15:15
clarkb	fungi: its all the reviewed flags for each user on each file in every change they have looked at. I think it can get somewhat large but also that data isn't critical if we decide to drop it	15:15
clarkb	if we did the volume swap instead of rsyncing we might do something like a volume clone and validate the data works with the deployment then during our actual cutover swap out the volumes for the actual prod one?	15:16
fungi	we ought to temporarily disable the optmize-git-repos cronjob on the new server too	15:16
clarkb	can cinder do a clone of an active volume? That might be a good option if we think it will be reliable (I know we've had slow attaches in the past etc)	15:16
fungi	not sure	15:17
fungi	if you think it would make more sense than rsync/mysqldump, i suppose we could try and see whether it's possible	15:20
clarkb	Just thinking if it works it might be the shortest downtime option as it would essentially be umount, volume detach, volume attach, mount	15:22
clarkb	but I seem to recall us trying to do similar things in the past and it not being veryrelaible (because detaches may not work and attaches may be slow and and and...)	15:22
clarkb	rsync and mysqldump are probably the best choices as they are stable known quantities and while we may wait a bit longer we're less likely to have exceptional behavior	15:23
clarkb	looks like the cinder api supports taking a snapshot of an attached volume: "Prior to API version 3.66, a ‘force’ flag was required to create a snapshot of an in-use volume, but this is no longer the case. From API version 3.66, the ‘force’ flag is invalid when passed in a volume snapshot request"	15:25
clarkb	so in theory this is an option available to us. Make a snapshot of the prod volume for testing purposes and boot the new server with a new volume based on that snapshot attached	15:25
clarkb	then when we do the cut over replace the temporary copy of the volume with the actual volume	15:25
clarkb	but as mentioned this has a lto more unknown unknowns and requires a lot of different tools to interact with each other properly	15:26
fungi	also rsync/mysqldump gives us the option to switch regions or providers, while a volume move would only work for adjacent server instances	15:29
clarkb	yup	15:30
clarkb	also I think we use lvm. That probably complicates the volume swap a bit (as we'd need to also refresh the lvm configuration as part of the mount process?)	15:30
fungi	lvm should be fine as long as we deactivate the volume groups before the switch	15:34
clarkb	on which side? Maybe both to clear out the config on the destination and to allow the source to become source on the destination?	15:35
clarkb	anyway I think I'm leaning towards rsync and mysqldump simply because it gives us more control over the process and less leap of faith type actions	15:36
fungi	vgchange -an on both servers before detaching the volumes from them, then vgchange -ay on the new server after attaching again	15:39
fungi	and yeah, i'm not advocating for a volume swap, but that's that the process would look like. it's tractable	15:40
clarkb	the other two changes I've currently got up are https://review.opendev.org/c/opendev/system-config/+/940928 and child. Though those might make good followups to the parallel infra-prod execution as sanity checks if the hourlies look good first	16:06
clarkb	but reviews now might be helpful	16:06
fungi	both lgtm	16:09
clarkb	should we make a bindep release?	16:16
fungi	ah, yes probably... infra-root: last call for input on whether 938570 and 940711 make sense as a pattern to align our projects with broader python packaging community expectations	16:20
fungi	if there's a general preference for keeping requirements.txt, test-requirements.txt and doc/requirements.txt files in our packages, or a desire to implement the latter two with "extras" in the pyproject.toml and reference those from our noxfile.py or something, then we can release bindep as-is	16:22
fungi	the python pacckaging community argument against requirements.txt in particular (938570) is that we're marking our package dependencies as dynamically resolved at wheel build time, which means the wheel has to be downloaded and built in order to identify them instead of just checking published metadata	16:24
clarkb	I think if I had a preference keeping the test requirements separate to make it easier to run without using nox seems like a good idea. But rolling true deps into the pyproject.toml to match upstream expectations makes sense to me	16:24
fungi	note that for 940711 we could stick those into extras in the pyproject.toml and then list bindep[docs] or bindep[linters] or bindep[coverage] or bindep[unit] or whatever in noxfile.py	16:25
fungi	that way people who want to install them some other way still can	16:26
fungi	i can work up an alternative if people want to see it in action	16:26
clarkb	those might become an attractive nuisance for people thinking if they install them they get extra features/functionality	16:28
clarkb	(which is typically how extras are used aiui)	16:28
corvus	clarkb: i think opendev-ca is idempotent -- here's what it would be doing for each server https://opendev.org/opendev/system-config/src/branch/master/playbooks/roles/opendev-ca/opendev-ca.sh#L98	16:28
corvus	clarkb: at least, once it's established	16:28
corvus	the first time it ever runs, it would need to create the ca cert, that could race	16:28
clarkb	maybe I should update my todo to move opendev-ca calls into bootstrap-bridge then we can remove the serialization of those jobs?	16:29
clarkb	maybe not move opendev-ca there but call it there first at least?	16:29
corvus	yeah, if we call it once with, say, a dummy server name, that would ensure that the ca part is done once on bootstrap	16:30
corvus	(we could also change the script to make it so it doesn't need a server name, but as of today, it expects one)	16:30
clarkb	ok let me update the parallelization chagne to keep serializing those 4 jobs but I'll add a TODO that indicates we can stop serializing once we have bootstrap bridge calling opendev-ca	16:30
corvus	hrm	16:31
corvus	i'm trying to avoid serializing them	16:31
clarkb	I guess today its safe to not serialize them because we have an existing bridge and ca	16:31
corvus	can we just make that change now?	16:31
corvus	right, the only issue would be in DR	16:31
corvus	(or we decide to make a new ca)	16:32
clarkb	maybe. The "problem" is bootstrap-bridge is actually doing minimal stuff on bridge and only running ansible from zuul not from bridge aiui	16:32
clarkb	so I'm not sure how to port that into bootstrap-bridge yet	16:32
corvus	oh, then... we could flock	16:32
clarkb	oh ya that would work. I can write that edit	16:33
clarkb	I'll update the parallelization change to drop serializing those jobs and add a flock to the opendev-ca script	16:34
corvus	cool, that sounds great	16:34
corvus	the other thing i looked at is https://opendev.org/opendev/system-config/src/branch/master/playbooks/roles/opendev-ca/tasks/main.yaml -- but i think those copy commands will be idempotent, so should be fine.	16:35
clarkb	oh there is already a flock in there	16:38
clarkb	heh	16:38
clarkb	there isn't a timeout on that flock call. Which is probably ok but I could add one	16:39
corvus	oh it's in the role!	16:39
clarkb	I'll leave it as is for now since it has apparrently been working	16:39
opendevreview	Clark Boylan proposed opendev/system-config master: Run infra-prod jobs in parallel https://review.opendev.org/c/opendev/system-config/+/943488	16:40
clarkb	the actual period of time where things contend for the ca should be very small after the initial setup so this is probably fine as is	16:41
clarkb	if nothing else I feel a lot more confident that the ca won't give us unexpected problems now :)	16:42
opendevreview	Merged zuul/zuul-jobs master: Update zookeeper latest version to 3.9.3 https://review.opendev.org/c/zuul/zuul-jobs/+/943861	17:05
opendevreview	Jeremy Stanley proposed opendev/bindep master: Drop requirements.txt https://review.opendev.org/c/opendev/bindep/+/938570	17:18
opendevreview	Jeremy Stanley proposed opendev/bindep master: Drop auxiliary requirements files https://review.opendev.org/c/opendev/bindep/+/940711	17:18
fungi	clarkb: ^ redid 940711 based on extras, and also elaborated on the rationale for both in their commit messages	17:19
clarkb	having the test- prefix on the extras should help mitigate confusion over whther that adds new functionality	17:19
fungi	yep, exactly why i went with that after you raised the concern	17:20
opendevreview	Jeremy Stanley proposed opendev/bindep master: Drop setup.cfg https://review.opendev.org/c/opendev/bindep/+/943976	17:21
clarkb	fungi: question on https://review.opendev.org/c/opendev/bindep/+/940711 about behavior differences between binde[] and .[] usages	17:21
opendevreview	Jeremy Stanley proposed opendev/bindep master: Drop setup.py https://review.opendev.org/c/opendev/bindep/+/943977	17:22
fungi	943976 and 943977 are wip, just want somewhere to test more pbr improvements later	17:22
clarkb	++	17:22
clarkb	fungi: ok I think we should probably expand the coerage deps in that case (I left a note on the change about it)	17:31
fungi	clarkb: actually, no it works exactly how we want now that i think about it	17:32
fungi	because right now bindep[test-unit] doesn't exist anywhere except locally in that change	17:32
fungi	and it still gets the right deps installed	17:33
clarkb	oh right so if the coverage test is happy then it should work	17:33
clarkb	and that job was successful let see if we ran tests and collected coverage	17:33
clarkb	seems to have done so. Ok no change necessary	17:34
fungi	yeah, i ran all those locally before i pushed them	17:34
clarkb	I updated my review to +2 I think both of those changes lgt	17:35
clarkb	*lgtm	17:35
fungi	as for 943976 and 943977 the failure modes are subtly different between the two of them	17:37
clarkb	ya for 943977 we don't have a way of enabling pbr so pbr never runs aiui	17:37
fungi	though ultimately the exception for both is the same	17:38
fungi	ERROR Backend subprocess exited when trying to invoke get_requires_for_build_sdist	17:38
clarkb	it seems that we're using an old distutils flag=value support option via setup() being called explicitly there that we can't set up another way	17:38
clarkb	though we may be able to hook into a different system to enable the pbr subsystem	17:38
clarkb	Any idea how things like setuptools-scm do it?	17:40
clarkb	thats probably the thing to look at if we want to take the next step in PBR updates to make it fit into the modern stuff better	17:40
fungi	i haven't looked, though probably step 1 is working out how to fix 943976	17:40
clarkb	ya I guess those may be related in subtle ways as it all comes down to feeding options/config into setuptools	17:41
clarkb	infra-root thoughts on if/when we want to try https://review.opendev.org/c/opendev/system-config/+/943488 and run infra-prod jobs in parallel?	17:41
fungi	right now that comes down to https://opendev.org/openstack/pbr/src/branch/master/pbr/core.py#L98 insisting on looking for a setup.cfg file	17:42
fungi	i wonder if it would be possible to just drop that check	17:42
fungi	oh, though i think there's going to be an earlier failure to fix too	17:43
opendevreview	Jeremy Stanley proposed opendev/bindep master: Drop metadata.name from setup.cfg https://review.opendev.org/c/opendev/bindep/+/943981	17:44
fungi	that ^ should demonstrate	17:44
clarkb	random followup notes on things from last week: rax-flex is looking happy in both sjc3 and dfw3, gitea response times continue to appear to be stable so I'm feeling more and more confident in the memcached fixes broken internal memory caching for gitea, and manually checking docker hub rate limits I still personally appear to get 100 per 6 hours	17:49
clarkb	if jamesdenton returns to the channel we should ask about bumping quotas up if they are in a position to support that in rax flex	17:50
fungi	okay, so https://opendev.org/openstack/pbr/src/branch/master/pbr/hooks/metadata.py#L32 is where pbr insists on pulling metadata.name from setup.cfg but i suppose trying to rip that out is going to unravel a much longer thread	17:51
clarkb	we'd need to discover the name from somewhere else. In theory we could get it form wherever setuptools stores it if it has read it by the time pbr is called and needs it	17:52
fungi	it wants to pass the package name to packaging.get_version()	17:52
fungi	yeah	17:52
clarkb	could also just do a if os.path.exists(pyproject.toml) load toml and get name from there else look in setup.cfg fallback	17:53
fungi	elsewhere it's relying on setuptools.Command.distribution.get_name()	17:55
fungi	i wonder if that could work as a fallback	17:55
fungi	or if that's actually plumbed into the same pbr.hooks.metadata.MetadataConfig.get_name() call	17:56
clarkb	I'm just going to remove docker hub rate limit discussion from the meeting agenda due to my testing showing thinsg haven't changed yet	17:57
clarkb	fungi: its possible setuptools hasn't bootstrapped enough state at that point too	17:57
clarkb	but ya if that works that may be the most correct option	17:57
clarkb	my first pass of agenda updates are in	18:17
clarkb	let me know if we want to add/remove/edit anything else	18:18
fungi	i'm digging into the possible pbr change for pulling the package name from somewhere other than setup.cfg, but need to take a break to go grab a late lunch. should be back in under an hour	18:18
clarkb	you're still on standard time scheduling (then its not a super late lunch)	18:29
fungi	too true	19:26
fungi	anyway, it's done now regardless	19:27
fungi	having mulled it over, i suspect querying setuptools for the project name in the pbr hook is going to end up being circular, so probably need to find a way to break that cycle	19:28
clarkb	ya I wondered about that	20:33
fungi	the underlying problem is that other calls pbr makes need to pass the project name, so that has to be resolved first. i agree finding a way to pull that out of pyproject.toml as an alternative to setup.cfg is probably a cleaner solution	20:37
fungi	just treat it as another source of metadata configuration essentially. i'm coming around to that idea	20:37
clarkb	ya that seems like a simple solution	20:40
clarkb	and you can continue to fallback on setup.cfg if pyproject.toml data isn't present	20:40
clarkb	corvus: tonyb did you want to weigh in on the parallel infra-prod stuff or should we proceed with that now?	20:42
clarkb	https://review.opendev.org/c/opendev/system-config/+/943488 is the change	20:42
clarkb	we'll want to monitor the hourly runs after that lands and then decide if we need to revert or proceed with daily runs at 0200	20:42
fungi	i expect to be around at 02:00 utc (especially now that it's earlier relative to local time for me), so can keep an eye on it as well	20:45
clarkb	ya I can probably be around as well	20:45
corvus	clarkb: that change is much shorter now! lgtm	20:48
clarkb	ok I guess I'll approve it now?	20:50
clarkb	I approved it	20:50
fungi	thanks!	20:54
opendevreview	Merged opendev/system-config master: Run infra-prod jobs in parallel https://review.opendev.org/c/opendev/system-config/+/943488	21:00
clarkb	the 2100 hourlies shoudl apply that I think	21:00
clarkb	as they haven't enqueued yet	21:00
fungi	i've noticed a lag of ~2 minutes for hourly enqueues	21:01
clarkb	ya the jitter is set to 120 seconds	21:01
clarkb	it does see mlike it always runs 120 seconds late	21:01
fungi	aha, that would indeed explain it	21:01
clarkb	rather than some random value between 0 and 120 seconds	21:01
fungi	120 was randomly selected as a value between 0 and 120	21:01
clarkb	yup the deploy job is running first before the hourlies so I think the hourlies should include it (we should know in just a few minutes)	21:01
fungi	https://xkcd.com/221/	21:02
clarkb	hourly has begun. It is doing the single bootstrap bridge job to start though	21:03
clarkb	it paused and two jobs are starting now as expected	21:04
clarkb	service-bridge and service-nodepool	21:04
fungi	so far so good	21:05
clarkb	ps -elf on bridge confirms both playbooks are running	21:05
clarkb	system load is ~1	21:05
clarkb	peaked at 1.23 and falling now	21:06
clarkb	zuul has more nodes to interact with and may spike load higher. But its good we have that check	21:06
fungi	hopefully we can crank it up well above 2 in that case	21:06
clarkb	ya there are 8 vcpus. I'm guessing 5-6 may be a good sweet spot	21:06
clarkb	service-bridge succeeded. Service-registry is running	21:07
clarkb	it will alos vary by the scheduling overlap of some jobs. We could probably micromanage ordering to get a good bucket fill on cpu demand	21:07
clarkb	but I suspect that may be more effort than it is worth maintaining over time. Just take the wins if this works and call it a day	21:08
clarkb	registry succeeded. Now zuul and nodepool are running concurrently	21:08
clarkb	nodepool succeeded	21:11
clarkb	https://zuul.opendev.org/t/openstack/buildset/7d0d93da5bbe4739b8438ef25bfba7cb says we ended 20:55 minutes:seconds after the hour	21:12
clarkb	sorry for the 1900 hourly run	21:12
clarkb	20:00 run ended at 27:51	21:13
clarkb	this one is on track to run at 20:15 ish	21:13
clarkb	*21:15 ish	21:13
clarkb	my local wifi is being randomly flaky with high rtt times	21:14
clarkb	but things are looking good in hourlies so far	21:14
clarkb	fungi: it occurs to me if we really want to exercise all the things before the dailies we might be able to do that landing your change to cleanup the old sjc3 config	21:15
clarkb	do we want to do that?	21:15
clarkb	and I might need to go manually reboot the AP	21:15
clarkb	the zuul job is pulling new images so it is slow	21:16
clarkb	at least that is what it looks like at the tail of the log.	21:16
fungi	yeah, 943625 would be a good way to flush the pipes	21:17
fungi	approved it just now	21:17
clarkb	https://zuul.opendev.org/t/openstack/buildset/578abc171bae4bc6a75666168d588ddb completed with success so hourlies are happy with the parallel execution	21:21
clarkb	ianw: ^ fyi	21:21
clarkb	943625 should land shortly providing us with round two of parallel infra-prod feedback	21:39
opendevreview	Merged opendev/system-config master: Clean up old Rackspace Flex SJC3 project https://review.opendev.org/c/opendev/system-config/+/943625	21:42
fungi	here goes	21:42
clarkb	it didn't enqueue as much stuff as I expected but still a good shakedown	21:43
clarkb	there is basically no overlap with the hourlies	21:43
clarkb	the two bridge jobs are it	21:43
clarkb	cool infra-prod-base is running alone as expected too	21:43
clarkb	(everything depends on that job)	21:43
fungi	at least a good double-check	21:44
clarkb	I think I see a small bug, one we can live with for now I suspect. Patch momentarily	21:45
opendevreview	Clark Boylan proposed opendev/system-config master: Have puppet depend on letsencrypt https://review.opendev.org/c/opendev/system-config/+/943992	21:48
clarkb	its ^ basically some puppet service do use LE certs so we want LE to run first then puppet can copy the cert over and use it. Minor issue because I think we're eventually consistent there	21:49
ianw	omg thanks for pushing this through clarkb! i new it was kind of there in the background and cool to see it working	21:51
clarkb	hrm infra-prod base failed	21:51
fungi	thanks for your work on it ianw!	21:52
clarkb	Failed to update apt cache: E:Failed to fetch http://ddebs.ubuntu.com/dists/focal-updates/main/binary-arm64/Packages.xz File has unexpected size (364788 != 364784). Mirror sync in progress?	21:52
clarkb	on mirror01.regionone.osuosl.opendev.org	21:52
clarkb	so not related to parallel execution	21:52
clarkb	fungi: I think you can safely reenqueue that change if you think we should rahter than wait for hourlies	21:53
clarkb	sorry dailies	21:53
fungi	i'll reenqueue it, would be good to have thorough testing	21:53
clarkb	++	21:53
ianw	mirror01.regionone.osuosl.opendev.org : ok=36 changed=2 unreachable=0 failed=1 skipped=6 rescued=0 ignored=0	21:54
fungi	reenqueued 943625,1 into deploy	21:54
ianw	oh, haha you found it	21:54
clarkb	I wondery why we trigger manage projects on that update	21:55
clarkb	it shouldn't matter but I think it should noop too	21:55
fungi	yeah, not sure, i don't see any files that are likely candidates	21:55
clarkb	we don't have a file matcher on that job	21:56
clarkb	so it must run a lot?	21:56
fungi	oh, yeah i guess it runs all the time	21:56
clarkb	I wonder if part of the reason for that is we trigger it from project-config changes too and file matchers for system-config won't necessarily match project-config file matchers and vice versa?	21:58
clarkb	though we should be able to have two disjoint lists as any match causes the job to run	21:58
clarkb	oh the file matchers are in project.yaml and we match on inventory/.*	22:04
clarkb	not sure we need to do that but seems afe enough for now	22:04
clarkb	fungi: I think it failed again	22:04
clarkb	should we add the server to the emergency list and try again	22:06
clarkb	oddly nb04 has no problems (not trying to install packages I guess)	22:06
fungi	Failed to update apt cache: E:Failed to fetch http://ddebs.ubuntu.com/dists/focal-updates/main/binary-arm64/Packages.xz File has unexpected size (364788 != 364784). Mirror sync in progress? [IP: 185.125.190.17 80]	22:07
clarkb	ya and the timestamp on the releases file is from 20:30 ish today	22:07
fungi	that seems to be the error coming from mirror01.regionone.osuosl	22:08
clarkb	so seems likely that there is an in progress mirror update	22:08
clarkb	and they haven't updated the signatures yet?	22:08
fungi	maybe	22:08
fungi	haven't updated the hashes anyway	22:08
fungi	Release file created at: Mon, 10 Mar 2025 20:31:52 +0000	22:09
clarkb	I think it is safe to reenqueue even though the hourlies are running because we haven't merged any other changes fwiw	22:09
clarkb	so hourlies updated to the state that chagne was looking at	22:09
fungi	enqueued again	22:09
clarkb	(just somethingto consider when reenqueing even before the refactor)	22:09
clarkb	fungi: do you want me to put that server in the emergency file?	22:09
fungi	let's see if it fails a third time	22:10
clarkb	ack	22:10
fungi	as for sequencing issues, if reenqueuing deploy for the most recently merged change is unsafe, that would be good to know	22:10
fungi	but as far as i can tell we should never install anything newer than the newest state in the hourly run	22:11
clarkb	fungi: yup I was just making that explicit	22:11
fungi	time machines notwithstanding	22:11
clarkb	I think if we merged another change now we'd be in an awkward spot. But since you are reenqueing the last merged change its all the same between deploy and hourly	22:12
fungi	yeah, same goes for most projects really. if i'm reenqueuing into a post-merge pipeline i never inject anything other than the most recently merged change	22:13
clarkb	++	22:13
fungi	anything else could cause rollback problems	22:13
clarkb	kids should be home from school soon so I may step out for a bit to get a rundown of how that went. But if it fails again I think the next step may be to put that one host in the emergency file and reenqueue for a forth time	22:13
fungi	yeah, can do	22:14
fungi	i'll keep an eye on it	22:14
fungi	and also troubleshoot locally on the server if it persists	22:14
fungi	probably something went sideways with ubuntu's arm64 package mirrors	22:14
fungi	yeah, still broken	22:24
fungi	running an apt update locally on mirror01.regionone.osuosl doesn't error	22:26
fungi	i'll reenqueue it one more time	22:26
clarkb	weird that it wouldn't error locally unless doing it manaully unstuck something or chose a different mirror	22:31
fungi	probably just timing	22:32
clarkb	seems to be running this time	22:38
clarkb	so far things look good here too which is reassuring	22:39
fungi	yeah, we probably just had to give consistency time to eventual itself	22:39
clarkb	if we do decide that something is horribly wrong during periodic jobs we can either run the disable ansible command to write out the file and reason or move the zuul user authorized keys aside	22:40
clarkb	either way should sufficiently stop jobs from continuing to execute	22:40
fungi	and we'll still have a few hours of breathing room until the 02:00 run	22:40
clarkb	yup	22:40
clarkb	and if we do that all we need to go back to a safe state within the zuul config is revert the change that set the semaphore limit to 2	22:41
clarkb	we ran with it at 1 for some time and should be fine back there	22:41
clarkb	system load was higher for this change than hourlies but doesn't seem like it was by much	22:47
clarkb	probably our next step up with be a semaphore limit of 4 to double what we currently have then check system load from there	22:48
clarkb	but we should run with just 2 for a bit first	22:48
fungi	yeah, later this week would be good	22:48
clarkb	https://zuul.opendev.org/t/openstack/buildset/3424e9e068da43069256154677105514 that buildset was successful	22:51
fungi	yup, lgtm	23:05
clarkb	I guess we hvae no signal indicating we should rollback. In that case I'll do my best to check periodic after dinner etc	23:17
clarkb	last call on meeting agenda items	23:37
fungi	yeah, still lgtm	23:53
fungi	i may not stick around for the 02:00 utc run, but can check it when i get back to the keyboard in the morning	23:54

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!