Monday, 2025-03-10

*** ralonsoh_ is now known as ralonsoh07:04
opendevreviewFabien Boucher proposed zuul/zuul-jobs master: zuul_log_path: allow override for emit-job-header and upload-logs  https://review.opendev.org/c/zuul/zuul-jobs/+/94358607:52
opendevreviewFabien Boucher proposed zuul/zuul-jobs master: zuul_log_path: allow override for emit-job-header and upload-logs  https://review.opendev.org/c/zuul/zuul-jobs/+/94358608:04
opendevreviewFabien Boucher proposed zuul/zuul-jobs master: zuul_log_path: allow override for emit-job-header and upload-logs  https://review.opendev.org/c/zuul/zuul-jobs/+/94358608:14
opendevreviewFabien Boucher proposed zuul/zuul-jobs master: zuul_log_path: allow override for emit-job-header and upload-logs  https://review.opendev.org/c/zuul/zuul-jobs/+/94358608:22
opendevreviewFabien Boucher proposed zuul/zuul-jobs master: zuul_log_path: allow override for emit-job-header and upload-logs  https://review.opendev.org/c/zuul/zuul-jobs/+/94358608:33
opendevreviewFabien Boucher proposed zuul/zuul-jobs master: zuul_log_path: allow override for emit-job-header and upload-logs  https://review.opendev.org/c/zuul/zuul-jobs/+/94358608:37
opendevreviewFabien Boucher proposed zuul/zuul-jobs master: zuul_log_path: allow override for emit-job-header and upload-logs  https://review.opendev.org/c/zuul/zuul-jobs/+/94358608:44
opendevreviewFabien Boucher proposed zuul/zuul-jobs master: zuul_log_path: allow override for emit-job-header and upload-logs  https://review.opendev.org/c/zuul/zuul-jobs/+/94358609:01
opendevreviewFabien Boucher proposed zuul/zuul-jobs master: zuul_log_path: allow override for emit-job-header and upload-logs  https://review.opendev.org/c/zuul/zuul-jobs/+/94358609:06
opendevreviewFabien Boucher proposed zuul/zuul-jobs master: zuul_log_path: allow override for emit-job-header and upload-logs  https://review.opendev.org/c/zuul/zuul-jobs/+/94358609:15
opendevreviewFabien Boucher proposed zuul/zuul-jobs master: zuul_log_path: allow override for emit-job-header and upload-logs  https://review.opendev.org/c/zuul/zuul-jobs/+/94358609:22
opendevreviewFabien Boucher proposed zuul/zuul-jobs master: zuul_log_path: allow override for emit-job-header and upload-logs  https://review.opendev.org/c/zuul/zuul-jobs/+/94358609:35
opendevreviewFabien Boucher proposed zuul/zuul-jobs master: zuul_log_path: allow override for emit-job-header and upload-logs  https://review.opendev.org/c/zuul/zuul-jobs/+/94358610:21
opendevreviewFabien Boucher proposed zuul/zuul-jobs master: zuul_log_path: allow override for emit-job-header and upload-logs  https://review.opendev.org/c/zuul/zuul-jobs/+/94358610:24
opendevreviewFabien Boucher proposed zuul/zuul-jobs master: zuul_log_path: allow override for emit-job-header and upload-logs  https://review.opendev.org/c/zuul/zuul-jobs/+/94358610:31
opendevreviewKarolina Kula proposed opendev/glean master: WIP: Add support for CentOS 10 keyfiles  https://review.opendev.org/c/opendev/glean/+/94167212:28
*** dhill is now known as Guest1102713:02
clarkbinfra-root https://review.opendev.org/c/opendev/system-config/+/943488 is ready for review at this point if we're generally happy with the new infra-prod deployment process. I did find out additional potential conflict on bridge that this change addresses via some new job dependencies. corvus you may be interested in that as it has to do with the opendev-ca used by zookeeper,14:50
clarkbnodepool, zuul, and jaeger14:50
clarkbfungi: did you see that python was tripping over a clang 19 bug? I don't know which compiler you use locally for your python builds but if you are using clang19 then you probably want to rebuild with mitigations in place: https://blog.nelhage.com/post/cpython-tail-call/14:55
clarkbits just performance issues tahnkfully. Not a security of crash type bug14:55
fungigood question... i have clang 16, 17, 18 and 19 installed...14:56
fungithanks for the heads up!14:56
opendevreviewKarolina Kula proposed opendev/glean master: WIP: Add support for CentOS 10 keyfiles  https://review.opendev.org/c/opendev/glean/+/94167214:57
clarkbI've started thinking about the process for a gerrit server replacement and I think the main consideration is that replication cannot be enabled until after we're running on the new server (to avoid overwriting content on the giteas that is stale or even just wiping out content)15:06
clarkbotherwise I think the process is very similar to our regular process. Boot new server, attach volume, enroll in inventory and deploy an "empty" gerrit. Check things look ok. Syncrhonize content over. CHeck again. Schedule outage for final synchronization and dns update then enable replication on the new server and disable it on the old server?15:07
clarkbI want people to start noodling on that if possible so that we can begin the process soonish with a plan for cutting over soon after the openstack release15:08
fungithe synchronize content step will need a (probably somewhat lengthy) outage, though maybe not too bad if we use a pre-warmed rsync and just finalize it offline15:12
fungicould also do that step in parallel with dns change propagation, to shorten the outage further15:12
fungii assume we'll want the zuul schedulers offline during that time as well15:13
clarkbya I think we can prestage warm data to speed things up. I guess the database will want a backup restore rather than a rsync?15:14
clarkbthe other option is to use the current prod volume and move it over15:15
fungitrue, the db is opefully fairly minimal these days?15:15
clarkbprobably spin up the new server in that case with a temporary volume (to avoid polluting the root fs with gerrit stuff)15:15
clarkbfungi: its all the reviewed flags for each user on each file in every change they have looked at. I think it can get somewhat large but also that data isn't critical if we decide to drop it15:15
clarkbif we did the volume swap instead of rsyncing we might do something like a volume clone and validate the data works with the deployment then during our actual cutover swap out the volumes for the actual prod one?15:16
fungiwe ought to temporarily disable the optmize-git-repos cronjob on the new server too15:16
clarkbcan cinder do a clone of an active volume? That might be a good option if we think it will be reliable (I know we've had slow attaches in the past etc)15:16
funginot sure15:17
fungiif you think it would make more sense than rsync/mysqldump, i suppose we could try and see whether it's possible15:20
clarkbJust thinking if it works it might be the shortest downtime option as it would essentially be umount, volume detach, volume attach, mount15:22
clarkbbut I seem to recall us trying to do similar things in the past and it not being veryrelaible (because detaches may not work and attaches may be slow and and and...)15:22
clarkbrsync and mysqldump are probably the best choices as they are stable known quantities and while we may wait a bit longer we're less likely to have exceptional behavior15:23
clarkblooks like the cinder api supports taking a snapshot of an attached volume: "Prior to API version 3.66, a ‘force’ flag was required to create a snapshot of an in-use volume, but this is no longer the case. From API version 3.66, the ‘force’ flag is invalid when passed in a volume snapshot request"15:25
clarkbso in theory this is an option available to us. Make a snapshot of the prod volume for testing purposes and boot the new server with a new volume based on that snapshot attached15:25
clarkbthen when we do the cut over replace the temporary copy of the volume with the actual volume15:25
clarkbbut as mentioned this has a lto more unknown unknowns and requires a lot of different tools to interact with each other properly15:26
fungialso rsync/mysqldump gives us the option to switch regions or providers, while a volume move would only work for adjacent server instances15:29
clarkbyup15:30
clarkbalso I think we use lvm. That probably complicates the volume swap a bit (as we'd need to also refresh the lvm configuration as part of the mount process?)15:30
fungilvm should be fine as long as we deactivate the volume groups before the switch15:34
clarkbon which side? Maybe both to clear out the config on the destination and to allow the source to become source on the destination?15:35
clarkbanyway I think I'm leaning towards rsync and mysqldump simply because it gives us more control over the process and less leap of faith type actions15:36
fungivgchange -an on both servers before detaching the volumes from them, then vgchange -ay on the new server after attaching again15:39
fungiand yeah, i'm not advocating for a volume swap, but that's that the process would look like. it's tractable15:40
clarkbthe other two changes I've currently got up are https://review.opendev.org/c/opendev/system-config/+/940928 and child. Though those might make good followups to the parallel infra-prod execution as sanity checks if the hourlies look good first16:06
clarkbbut reviews now might be helpful16:06
fungiboth lgtm16:09
clarkbshould we make a bindep release?16:16
fungiah, yes probably... infra-root: last call for input on whether 938570 and 940711 make sense as a pattern to align our projects with broader python packaging community expectations16:20
fungiif there's a general preference for keeping requirements.txt, test-requirements.txt and doc/requirements.txt files in our packages, or a desire to implement the latter two with "extras" in the pyproject.toml and reference those from our noxfile.py or something, then we can release bindep as-is16:22
fungithe python pacckaging community argument against requirements.txt in particular (938570) is that we're marking our package dependencies as dynamically resolved at wheel build time, which means the wheel has to be downloaded and built in order to identify them instead of just checking published metadata16:24
clarkbI think if I had a preference keeping the test requirements separate to make it easier to run without using nox seems like a good idea. But rolling true deps into the pyproject.toml to match upstream expectations makes sense to me16:24
funginote that for 940711 we could stick those into extras in the pyproject.toml and then list bindep[docs] or bindep[linters] or bindep[coverage] or bindep[unit] or whatever in noxfile.py16:25
fungithat way people who want to install them some other way still can16:26
fungii can work up an alternative if people want to see it in action16:26
clarkbthose might become an attractive nuisance for people thinking if they install them they get extra features/functionality16:28
clarkb(which is typically how extras are used aiui)16:28
corvusclarkb: i think opendev-ca is idempotent -- here's what it would be doing for each server https://opendev.org/opendev/system-config/src/branch/master/playbooks/roles/opendev-ca/opendev-ca.sh#L9816:28
corvusclarkb: at least, once it's established16:28
corvusthe first time it ever runs, it would need to create the ca cert, that could race16:28
clarkbmaybe I should update my todo to move opendev-ca calls into bootstrap-bridge then we can remove the serialization of those jobs?16:29
clarkbmaybe not move opendev-ca there but call it there first at least?16:29
corvusyeah, if we call it once with, say, a dummy server name, that would ensure that the ca part is done once on bootstrap16:30
corvus(we could also change the script to make it so it doesn't need a server name, but as of today, it expects one)16:30
clarkbok let me update the parallelization chagne to keep serializing those 4 jobs but I'll add a TODO that indicates we can stop serializing once we have bootstrap bridge calling opendev-ca16:30
corvushrm16:31
corvusi'm trying to avoid serializing them16:31
clarkbI guess today its safe to not serialize them because we have an existing bridge and ca16:31
corvuscan we just make that change now?16:31
corvusright, the only issue would be in DR16:31
corvus(or we decide to make a new ca)16:32
clarkbmaybe. The "problem" is bootstrap-bridge is actually doing minimal stuff on bridge and only running ansible from zuul not from bridge aiui16:32
clarkbso I'm not sure how to port that into bootstrap-bridge yet16:32
corvusoh, then... we could flock16:32
clarkboh ya that would work. I can write that edit16:33
clarkbI'll update the parallelization change to drop serializing those jobs and add a flock to the opendev-ca script16:34
corvuscool, that sounds great16:34
corvusthe other thing i looked at is https://opendev.org/opendev/system-config/src/branch/master/playbooks/roles/opendev-ca/tasks/main.yaml -- but i think those copy commands will be idempotent, so should be fine.16:35
clarkboh there is already a flock in there16:38
clarkbheh16:38
clarkbthere isn't a timeout on that flock call. Which is probably ok but I could add one16:39
corvusoh it's in the role!16:39
clarkbI'll leave it as is for now since it has apparrently been working16:39
opendevreviewClark Boylan proposed opendev/system-config master: Run infra-prod jobs in parallel  https://review.opendev.org/c/opendev/system-config/+/94348816:40
clarkbthe actual period of time where things contend for the ca should be very small after the initial setup so this is probably fine as is16:41
clarkbif nothing else I feel a lot more confident that the ca won't give us unexpected problems now :)16:42
opendevreviewMerged zuul/zuul-jobs master: Update zookeeper latest version to 3.9.3  https://review.opendev.org/c/zuul/zuul-jobs/+/94386117:05
opendevreviewJeremy Stanley proposed opendev/bindep master: Drop requirements.txt  https://review.opendev.org/c/opendev/bindep/+/93857017:18
opendevreviewJeremy Stanley proposed opendev/bindep master: Drop auxiliary requirements files  https://review.opendev.org/c/opendev/bindep/+/94071117:18
fungiclarkb: ^ redid 940711 based on extras, and also elaborated on the rationale for both in their commit messages17:19
clarkbhaving the test- prefix on the extras should help mitigate confusion over whther that adds new functionality17:19
fungiyep, exactly why i went with that after you raised the concern17:20
opendevreviewJeremy Stanley proposed opendev/bindep master: Drop setup.cfg  https://review.opendev.org/c/opendev/bindep/+/94397617:21
clarkbfungi: question on https://review.opendev.org/c/opendev/bindep/+/940711 about behavior differences between binde[] and .[] usages17:21
opendevreviewJeremy Stanley proposed opendev/bindep master: Drop setup.py  https://review.opendev.org/c/opendev/bindep/+/94397717:22
fungi943976 and 943977 are wip, just want somewhere to test more pbr improvements later17:22
clarkb++17:22
clarkbfungi: ok I think we should probably expand the coerage deps in that case (I left a note on the change about it)17:31
fungiclarkb: actually, no it works exactly how we want now that i think about it17:32
fungibecause right now bindep[test-unit] doesn't exist anywhere except locally in that change17:32
fungiand it still gets the right deps installed17:33
clarkboh right so if the coverage test is happy then it should work17:33
clarkband that job was successful let see if we ran tests and collected coverage17:33
clarkbseems to have done so. Ok no change necessary17:34
fungiyeah, i ran all those locally before i pushed them17:34
clarkbI updated my review to +2 I think both of those changes lgt17:35
clarkb*lgtm17:35
fungias for 943976 and 943977 the failure modes are subtly different between the two of them17:37
clarkbya for 943977 we don't have a way of enabling pbr so pbr never runs aiui17:37
fungithough ultimately the exception for both is the same17:38
fungiERROR Backend subprocess exited when trying to invoke get_requires_for_build_sdist17:38
clarkbit seems that we're using an old distutils flag=value support option via setup() being called explicitly there that we can't set up another way17:38
clarkbthough we may be able to hook into a different system to enable the pbr subsystem17:38
clarkbAny idea how things like setuptools-scm do it?17:40
clarkbthats probably the thing to look at if we want to take the next step in PBR updates to make it fit into the modern stuff better17:40
fungii haven't looked, though probably step 1 is working out how to fix 94397617:40
clarkbya I guess those may be related in subtle ways as it all comes down to feeding options/config into setuptools17:41
clarkbinfra-root thoughts on if/when we want to try https://review.opendev.org/c/opendev/system-config/+/943488 and run infra-prod jobs in parallel?17:41
fungiright now that comes down to https://opendev.org/openstack/pbr/src/branch/master/pbr/core.py#L98 insisting on looking for a setup.cfg file17:42
fungii wonder if it would be possible to just drop that check17:42
fungioh, though i think there's going to be an earlier failure to fix too17:43
opendevreviewJeremy Stanley proposed opendev/bindep master: Drop metadata.name from setup.cfg  https://review.opendev.org/c/opendev/bindep/+/94398117:44
fungithat ^ should demonstrate17:44
clarkbrandom followup notes on things from last week: rax-flex is looking happy in both sjc3 and dfw3, gitea response times continue to appear to be stable so I'm feeling more and more confident in the memcached fixes broken internal memory caching for gitea, and manually checking docker hub rate limits I still personally appear to get 100 per 6 hours17:49
clarkbif jamesdenton returns to the channel we should ask about bumping quotas up if they are in a position to support that in rax flex17:50
fungiokay, so https://opendev.org/openstack/pbr/src/branch/master/pbr/hooks/metadata.py#L32 is where pbr insists on pulling metadata.name from setup.cfg but i suppose trying to rip that out is going to unravel a much longer thread17:51
clarkbwe'd need to discover the name from somewhere else. In theory we could get it form wherever setuptools stores it if it has read it by the time pbr is called and needs it17:52
fungiit wants to pass the package name to packaging.get_version()17:52
fungiyeah17:52
clarkbcould also just do a if os.path.exists(pyproject.toml) load toml and get name from there else look in setup.cfg fallback17:53
fungielsewhere it's relying on setuptools.Command.distribution.get_name()17:55
fungii wonder if that could work as a fallback17:55
fungior if that's actually plumbed into the same pbr.hooks.metadata.MetadataConfig.get_name() call17:56
clarkbI'm just going to remove docker hub rate limit discussion from the meeting agenda due to my testing showing thinsg haven't changed yet17:57
clarkbfungi: its possible setuptools hasn't bootstrapped enough state at that point too17:57
clarkbbut ya if that works that may be the most correct option17:57
clarkbmy first pass of agenda updates are in18:17
clarkblet me know if we want to add/remove/edit anything else18:18
fungii'm digging into the possible pbr change for pulling the package name from somewhere other than setup.cfg, but need to take a break to go grab a late lunch. should be back in under an hour18:18
clarkbyou're still on standard time scheduling (then its not a super late lunch)18:29
fungitoo true19:26
fungianyway, it's done now regardless19:27
fungihaving mulled it over, i suspect querying setuptools for the project name in the pbr hook is going to end up being circular, so probably need to find a way to break that cycle19:28
clarkbya I wondered about that20:33
fungithe underlying problem is that other calls pbr makes need to pass the project name, so that has to be resolved first. i agree finding a way to pull that out of pyproject.toml as an alternative to setup.cfg is probably a cleaner solution20:37
fungijust treat it as another source of metadata configuration essentially. i'm coming around to that idea20:37
clarkbya that seems like a simple solution20:40
clarkband you can continue to fallback on setup.cfg if pyproject.toml data isn't present20:40
clarkbcorvus: tonyb did you want to weigh in on the parallel infra-prod stuff or should we proceed with that now?20:42
clarkbhttps://review.opendev.org/c/opendev/system-config/+/943488 is the change20:42
clarkbwe'll want to monitor the hourly runs after that lands and then decide if we need to revert or proceed with daily runs at 020020:42
fungii expect to be around at 02:00 utc (especially now that it's earlier relative to local time for me), so can keep an eye on it as well20:45
clarkbya I can probably be around as well20:45
corvusclarkb: that change is much shorter now! lgtm20:48
clarkbok I guess I'll approve it now?20:50
clarkbI approved it20:50
fungithanks!20:54
opendevreviewMerged opendev/system-config master: Run infra-prod jobs in parallel  https://review.opendev.org/c/opendev/system-config/+/94348821:00
clarkbthe 2100 hourlies shoudl apply that I think21:00
clarkbas they haven't enqueued yet21:00
fungii've noticed a lag of ~2 minutes for hourly enqueues21:01
clarkbya the jitter is set to 120 seconds21:01
clarkbit does see mlike it always runs 120 seconds late21:01
fungiaha, that would indeed explain it21:01
clarkbrather than some random value between 0 and 120 seconds21:01
fungi120 was randomly selected as a value between 0 and 12021:01
clarkbyup the deploy job is running first before the hourlies so I think the hourlies should include it (we should know in just a few minutes)21:01
fungihttps://xkcd.com/221/21:02
clarkbhourly has begun. It is doing the single bootstrap bridge job to start though21:03
clarkbit paused and two jobs are starting now as expected21:04
clarkbservice-bridge and service-nodepool21:04
fungiso far so good21:05
clarkbps -elf on bridge confirms both playbooks are running21:05
clarkbsystem load is ~121:05
clarkbpeaked at 1.23 and falling now21:06
clarkbzuul has more nodes to interact with and may spike load higher. But its good we have that check21:06
fungihopefully we can crank it up well above 2 in that case21:06
clarkbya there are 8 vcpus. I'm guessing 5-6 may be a good sweet spot21:06
clarkbservice-bridge succeeded. Service-registry is running21:07
clarkbit will alos vary by the scheduling overlap of some jobs. We could probably micromanage ordering to get a good bucket fill on cpu demand21:07
clarkbbut I suspect that may be more effort than it is worth maintaining over time. Just take the wins if this works and call it a day21:08
clarkbregistry succeeded. Now zuul and nodepool are running concurrently21:08
clarkbnodepool succeeded21:11
clarkbhttps://zuul.opendev.org/t/openstack/buildset/7d0d93da5bbe4739b8438ef25bfba7cb says we ended 20:55 minutes:seconds after the hour21:12
clarkbsorry for the 1900 hourly run21:12
clarkb20:00 run ended at 27:5121:13
clarkbthis one is on track to run at 20:15 ish21:13
clarkb*21:15 ish21:13
clarkbmy local wifi is being randomly flaky with high rtt times21:14
clarkbbut things are looking good in hourlies so far21:14
clarkbfungi: it occurs to me if we really want to exercise all the things before the dailies we might be able to do that landing your change to cleanup the old sjc3 config21:15
clarkbdo we want to do that?21:15
clarkband I might need to go manually reboot the AP21:15
clarkbthe zuul job is pulling new images so it is slow21:16
clarkbat least that is what it looks like at the tail of the log.21:16
fungiyeah, 943625 would be a good way to flush the pipes21:17
fungiapproved it just now21:17
clarkbhttps://zuul.opendev.org/t/openstack/buildset/578abc171bae4bc6a75666168d588ddb completed with success so hourlies are happy with the parallel execution21:21
clarkbianw: ^ fyi21:21
clarkb943625 should land shortly providing us with round two of parallel infra-prod feedback21:39
opendevreviewMerged opendev/system-config master: Clean up old Rackspace Flex SJC3 project  https://review.opendev.org/c/opendev/system-config/+/94362521:42
fungihere goes21:42
clarkbit didn't enqueue as much stuff as I expected but still a good shakedown21:43
clarkbthere is basically no overlap with the hourlies21:43
clarkbthe two bridge jobs are it21:43
clarkbcool infra-prod-base is running alone as expected too21:43
clarkb(everything depends on that job)21:43
fungiat least a good double-check21:44
clarkbI think I see a small bug, one we can live with for now I suspect. Patch momentarily21:45
opendevreviewClark Boylan proposed opendev/system-config master: Have puppet depend on letsencrypt  https://review.opendev.org/c/opendev/system-config/+/94399221:48
clarkbits ^ basically some puppet service do use LE certs so we want LE to run first then puppet can copy the cert over and use it. Minor issue because I think we're eventually consistent there21:49
ianwomg thanks for pushing this through clarkb!  i new it was kind of there in the background and cool to see it working21:51
clarkbhrm infra-prod base failed21:51
fungithanks for your work on it ianw!21:52
clarkbFailed to update apt cache: E:Failed to fetch http://ddebs.ubuntu.com/dists/focal-updates/main/binary-arm64/Packages.xz  File has unexpected size (364788 != 364784). Mirror sync in progress?21:52
clarkbon mirror01.regionone.osuosl.opendev.org21:52
clarkbso not related to parallel execution21:52
clarkbfungi: I think you can safely reenqueue that change if you think we should rahter than wait for hourlies21:53
clarkbsorry dailies21:53
fungii'll reenqueue it, would be good to have thorough testing21:53
clarkb++21:53
ianwmirror01.regionone.osuosl.opendev.org : ok=36   changed=2    unreachable=0    failed=1    skipped=6    rescued=0    ignored=021:54
fungireenqueued 943625,1 into deploy21:54
ianwoh, haha you found it21:54
clarkbI wondery why we trigger manage projects on that update21:55
clarkbit shouldn't matter but I think it should noop too21:55
fungiyeah, not sure, i don't see any files that are likely candidates21:55
clarkbwe don't have a file matcher on that job21:56
clarkbso it must run a lot?21:56
fungioh, yeah i guess it runs all the time21:56
clarkbI wonder if part of the reason for that is we trigger it from project-config changes too and file matchers for system-config won't necessarily match project-config file matchers and vice versa?21:58
clarkbthough we should be able to have two disjoint lists as any match causes the job to run21:58
clarkboh the file matchers are in project.yaml and we match on inventory/.*22:04
clarkbnot sure we need to do that but seems afe enough for now22:04
clarkbfungi: I think it failed again22:04
clarkbshould we add the server to the emergency list and try again22:06
clarkboddly nb04 has no problems (not trying to install packages I guess)22:06
fungiFailed to update apt cache: E:Failed to fetch http://ddebs.ubuntu.com/dists/focal-updates/main/binary-arm64/Packages.xz  File has unexpected size (364788 != 364784). Mirror sync in progress? [IP: 185.125.190.17 80]22:07
clarkbya and the timestamp on the releases file is from 20:30 ish today22:07
fungithat seems to be the error coming from mirror01.regionone.osuosl22:08
clarkbso seems likely that there is an in progress mirror update22:08
clarkband they haven't updated the signatures yet?22:08
fungimaybe22:08
fungihaven't updated the hashes anyway22:08
fungiRelease file created at: Mon, 10 Mar 2025 20:31:52 +000022:09
clarkbI think it is safe to reenqueue even though the hourlies are running because we haven't merged any other changes fwiw22:09
clarkbso hourlies updated to the state that chagne was looking at22:09
fungienqueued again22:09
clarkb(just somethingto consider when reenqueing even before the refactor)22:09
clarkbfungi: do you want me to put that server in the emergency file?22:09
fungilet's see if it fails a third time22:10
clarkback22:10
fungias for sequencing issues, if reenqueuing deploy for the most recently merged change is unsafe, that would be good to know22:10
fungibut as far as i can tell we should never install anything newer than the newest state in the hourly run22:11
clarkbfungi: yup I was just making that explicit22:11
fungitime machines notwithstanding22:11
clarkbI think if we merged another change now we'd be in an awkward spot. But since you are reenqueing the last merged change its all the same between deploy and hourly22:12
fungiyeah, same goes for most projects really. if i'm reenqueuing into a post-merge pipeline i never inject anything other than the most recently merged change22:13
clarkb++22:13
fungianything else could cause rollback problems22:13
clarkbkids should be home from school soon so I may step out for a bit to get a rundown of how that went. But if it fails again I think the next step may be to put that one host in the emergency file and reenqueue for a forth time22:13
fungiyeah, can do22:14
fungii'll keep an eye on it22:14
fungiand also troubleshoot locally on the server if it persists22:14
fungiprobably something went sideways with ubuntu's arm64 package mirrors22:14
fungiyeah, still broken22:24
fungirunning an apt update locally on mirror01.regionone.osuosl doesn't error22:26
fungii'll reenqueue it one more time22:26
clarkbweird that it wouldn't error locally unless doing it manaully unstuck something or chose a different mirror22:31
fungiprobably just timing22:32
clarkbseems to be running this time22:38
clarkbso far things look good here too which is reassuring22:39
fungiyeah, we probably just had to give consistency time to eventual itself22:39
clarkbif we do decide that something is horribly wrong during periodic jobs we can either run the disable ansible command to write out the file and reason or move the zuul user authorized keys aside22:40
clarkbeither way should sufficiently stop jobs from continuing to execute22:40
fungiand we'll still have a few hours of breathing room until the 02:00 run22:40
clarkbyup22:40
clarkband if we do that all we need to go back to a safe state within the zuul config is revert the change that set the semaphore limit to 222:41
clarkbwe ran with it at 1 for some time and should be fine back there22:41
clarkbsystem load was higher for this change than hourlies but doesn't seem like it was by much22:47
clarkbprobably our next step up with be a semaphore limit of 4 to double what we currently have then check system load from there22:48
clarkbbut we should run with just 2 for a bit first22:48
fungiyeah, later this week would be good22:48
clarkbhttps://zuul.opendev.org/t/openstack/buildset/3424e9e068da43069256154677105514 that buildset was successful22:51
fungiyup, lgtm23:05
clarkbI guess we hvae no signal indicating we should rollback. In that case I'll do my best to check periodic after dinner etc23:17
clarkblast call on meeting agenda items23:37
fungiyeah, still lgtm23:53
fungii may not stick around for the 02:00 utc run, but can check it when i get back to the keyboard in the morning23:54

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!