Monday, 2025-08-18

*** mrunge_ is now known as mrunge		14:09
clarkb	today is the day that we announced we should switch all opendev zuul tenants to ansible 11 by default. Before we do that I would like to land https://review.opendev.org/c/zuul/zuul-jobs/+/957188 to fix a known issue with ansible 11. Then I don't think we have a change to the zuul tenant config yet should I write one? cc infra-root and probably corvus in particular	15:01
corvus	clarkb: +3 on 188 ; i can write the tenant change in a few mins	15:05
clarkb	perfect thanks!	15:05
fungi	i'm going to pop out for a quick lunch, should be back soon	15:10
clarkb	the other item on my monday todo list is checking on zuul upgrades/restarts and I think it may have failed again...	15:19
clarkb	looks like the issue there according to the log is the zuu launcher servers ran out of disk space causing ansible to fail pulling more docker images? I'm guessing (without looking yet) that log files are still consuming a lot of disk	15:20
clarkb	we should probably consider running the rolling upgrades and reboots playbook manually out of band once we've got the launchers in a happier state	15:21
clarkb	confirmed there is 26GB in /var/log/zuul	15:23
clarkb	which is >50% of available disk space	15:23
clarkb	huh it looks like on zl01 we're still failing to rotate the log files properly	15:23
clarkb	but there is also a .old log file that is 17GB large	15:24
clarkb	I wonder if log rotation isn't happening because python doesn't have enough disk space to do it?	15:24
clarkb	if it does a copy truncate I think we need at least as much room on disk to copy the file before the old file gets trimmed which may not be the case here	15:25
clarkb	I wonder if we can configure python to keep a single -debug log file and limit its total size to say 5gb or something along those lines	15:26
clarkb	then keep a longer history in the non debug file (which is much smaller)	15:26
corvus	well, the new version should have much smaller logs	15:29
corvus	so how about we just do whatever is needed to get the upgrade done, then re-evaluate	15:29
clarkb	works for me. I think we can probably start by deleting the .old files then optionally rebooting the server (since /tmp was full?) and then run the upgrade playbook in a screen on bridge?	15:30
clarkb	but I'll defer to you on what bits of the current log files are most likely to be useful	15:30
corvus	i deleted some stuff in my homedir on those which freed up some space	15:31
corvus	i think we can dump all the logs	15:32
clarkb	so maybe stop service, delete logs, reboot, start service?	15:33
corvus	stop; delete; pull; reboot; start ?	15:33
clarkb	ya that sounds good. Do you want to do that or should I? (and I guess one launcher at a time to continue to provide contiuous service)	15:34
corvus	clarkb: go for it	15:34
clarkb	ok I'll start on zl01	15:34
opendevreview	James E. Blair proposed openstack/project-config master: Update all tenants to default to Ansible 11 https://review.opendev.org/c/openstack/project-config/+/957748	15:35
clarkb	corvus: the non debug logs are relatively small should I keep those?	15:35
corvus	nah	15:35
clarkb	ok rebooting zl01 now	15:37
clarkb	unbound didn't start up properly (I suspect due to the race where it tries to bind to an interface that isn't up yet) so I'm double checking on that before I proceed	15:39
clarkb	error: failed to read /var/lib/unbound/root.key	15:39
corvus	maybe fallout from full disk	15:40
clarkb	ya I'm remembering now this fiel gets updated periodically and when disk is full we write a truncated version	15:40
clarkb	I think the solution is to reinstall the package but I'm trying to remember that now	15:40
opendevreview	Merged zuul/zuul-jobs master: Vendor openvswitch_bridge https://review.opendev.org/c/zuul/zuul-jobs/+/957188	15:40
clarkb	dns-root-data I think it is this package	15:41
clarkb	oh heh with dns failing I can't run apt updates	15:41
clarkb	fungi is at lunch and I think he fixed this last time. Trying to remember what the secret was. Maybe its a local command	15:42
corvus	unbound-anchor may install it	15:42
corvus	nope, that didn't fix zl02	15:43
clarkb	I think there may be a unit file somewhere	15:44
corvus	huh, stop/start of unbound fixed it on zl02	15:44
johnsom	The file is not unique, you could grab one from another instance	15:44
clarkb	corvus: oh ok I'll just try that then	15:45
corvus	clarkb: i just did it on zl01	15:45
corvus	i removed the file first	15:45
clarkb	corvus: oh heh I did it too. Seems to be running now	15:45
corvus	so i think the incantation may be: rm /var/lib/unbound/root.key; systemctl start unbound	15:45
clarkb	corvus: I'll start zuul launcher there	15:46
clarkb	corvus: how long do I need to wait before doing this process on zl02?	15:47
corvus	no time at all	15:47
clarkb	cool I'm starting on zl02 now then	15:47
clarkb	and done	15:50
clarkb	corvus: do you think we should start the zuul reboot playbook now? Or is it better to let the dust settle on ansible 11 or just wait until this evening local time which tends to be quieter (or at least fewer humans expecting shorter round trip times)	15:52
corvus	remote: https://review.opendev.org/c/zuul/zuul/+/957750 Remove some launcher log messages [NEW]	15:52
corvus	some more logspam cleanup	15:52
corvus	i think it's fine to run the playbook now	15:52
corvus	i think we have more executor capacity than node capacity at the moment, so people shouldn't notice.	15:53
clarkb	ack I'll start a root screen on bridge and run it there	15:53
clarkb	it is running	15:54
clarkb	johnsom: ya I figured there was a way to workaround it I just remembered there was a process to make it easy and corvus figured it out	15:55
clarkb	as a side note I wonder if unbound could be more defensive about this like write the file to a tmp location then atomically rename it if the content is complete	15:56
johnsom	Cool, I know it checks for updates on startup, I just wasn't sure if it would work without a resolver or not.	15:56
clarkb	you're probably still risking running out of inodes that way but that is less common for us at least	15:56
johnsom	Yeah, it certainly could use the rename/hard link trick to keep the old one until a new one is complete. Probably worth a bug report.	15:59
fungi	catching up to find out what i fixed	16:21
fungi	clarkb: i think unbound may install a cronjob to update the root key?	16:23
fungi	so when the rootfs is full and it tries to overwrite the file it ends up empty because it doesn't bother to do an atomic replace instead	16:24
fungi	https://meetings.opendev.org/irclogs/%23opendev/%23opendev.2025-06-06.log.html#opendev.2025-06-06.log.html%23t2025-06-06T17:28:49 is where i fixed it last time, for reference	16:28
fungi	removed the empty file and restarted the service, looks like	16:28
clarkb	yup that is what corvus did too	16:31
clarkb	I think we are ready to land https://review.opendev.org/c/openstack/project-config/+/957748 at any time today. The ovs bridge fix landed and https://lists.opendev.org/archives/list/service-announce@lists.opendev.org/thread/EUTEAOIUQZ6YJUQWD75QO34H5TDGBLKP/ announced the update for today	16:48
fungi	approved	16:50
fungi	i'll send something to openstack-discuss too	16:50
clarkb	I guess one question is whether or not the version of zuul running on the executors includes 11	16:55
clarkb	/var/lib/zuul/ansible has an 11 in it so we should be fine	16:55
clarkb	and ze01 is waiting on 2 jobs to complete before it gets upgraded and restarted	16:57
opendevreview	Merged openstack/project-config master: Update all tenants to default to Ansible 11 https://review.opendev.org/c/openstack/project-config/+/957748	17:00
clarkb	fungi: one thing on my todo list after last week's meeting was to sync up with you on ugrading some of the remaining servers in our upgrade list	17:11
clarkb	I think you were thinking of doing the kerberos and afs related servers in place?	17:11
fungi	yeah, i haven't started on the planning for those yet. i expect the kerberos servers might actually be the most sensitive just because of version changes, the openafs version likely won't change as we already build backported packages of newer openafs we're usinh	17:13
fungi	using	17:13
clarkb	ya I think some of the afs servers may be am inor version update behind due to some updates they weren't restarted for	17:13
clarkb	but its the same general version	17:13
clarkb	re kerberos: has it changed much in the last say 35 years?	17:13
fungi	nope, which is why i'm not really worried about it either	17:14
clarkb	we can also do one distro version at a time on both kerberos servers before going to the next distro version just to give them a chance to sync up with a smaller delta	17:14
clarkb	the infra-prod-service-zuul job to update to ansible 11 by default just succeeded	17:15
clarkb	so any jobs started from ~nowish should default to 11	17:15
clarkb	though actually the hourly jobs may have udpated it about 8 minutes ago too	17:17
clarkb	I'm seeing successful jobs with ansible version 2.18.7 which I believe maps to 11	17:26
fungi	https://docs.ansible.com/ansible/latest/reference_appendices/release_and_maintenance.html indicates that ansible community 11.x maps to ansible-core 2.18.x	17:28
fungi	agreed	17:28
clarkb	for the meeting agenda I plan to drop the ansible 11 topic unless something comes up related to it, drop the voip.ms cleanup as that is well underway and in the hands of the service provider now, add rax-flex iad3 bootstrapping. Anything else to change/edit/add/delete?	17:52
clarkb	oh also no one has nominated for service coordinator. I feel like we hsould play rock paper scissors to see who is next :P	17:58
* corvus touches nose		18:06
clarkb	I've been spot checking job failures and so far the ones I have checked all appear related to the job payload not the ansible version	18:12
stephenfin	clarkb: fungi: jfyi, looks like there's a bit more to be done with pbr yet https://bugs.launchpad.net/pbr/+bug/2111459	18:20
stephenfin	likely won't be this week (OpenShift 4.20 feature freeze)	18:20
clarkb	stephenfin: so I think the primary use of easy_install is to get the script writer class which I think even then the main functioanlity is around looking up paths to install to?	18:33
clarkb	stephenfin: but for regular cli scripts I think we had decided that pip install is writing them out in a way that doesn't force slow/expensive entrypoint lookups (this is the reason pbr overrode the ones in setuptools which are still slow but if everyone is using pip install and not setup.py install it doesn't matter anymore)	18:34
clarkb	then pbr also grew the wsgi script support using the same mechanism. It may be possible to just treat wsgi scripts as cli scripts?	18:35
clarkb	evry project would need to write the shim that wsgiifies things and set that as the cli script target though	18:35
clarkb	I do think there is a light at tne end of this tunnel but there may be some slight inconvenience to how wsgi entry point scripts are written	18:37
clarkb	now I'm going to pop out for lunch. Back in a bit	18:41
*** dhill is now known as Guest24409		18:43
fungi	interesting idea: pypi started checking the domains for e-mail addresses on accounts and locking them out when their domains expire... https://blog.pypi.org/posts/2025-08-18-preventing-domain-resurrections/	19:27
opendevreview	Merged zuul/zuul-jobs master: Add default for build diskimage image name https://review.opendev.org/c/zuul/zuul-jobs/+/956219	20:56
clarkb	the zuul reboot playbook is waiting for ze04 now which is the last one that got updated the last time we managed to do updates	21:04
clarkb	corvus: the launcher debug log is up to 1.2GB on zl01 fwiw. https://review.opendev.org/c/zuul/zuul/+/957750 hit a gate failure so I rechecked it	21:06
opendevreview	Clark Boylan proposed opendev/system-config master: Add rax flex iad3 to clouds.yaml and cloud management https://review.opendev.org/c/opendev/system-config/+/957810	21:21
clarkb	I've made my edits to the meeting agenda	21:26
clarkb	will send that out in a n hour or two	21:27
corvus	clarkb: ack; we might need to manually clean out the logs and restart when that change merges	21:28
clarkb	infra-root I think ansible 11 doesn't like our bionic test nodes	21:38
corvus	i think they dropped some old python versions	21:38
clarkb	the base job on 957810 seems to fail during fact discovery on some json load error. If all of the job retries fail for that change then I guess I'll push up a change to pin to ansible 9 for now while we sort out dropping bionic or doing something else etc	21:38
clarkb	corvus: ya I suspect that is the cause	21:38
corvus	https://docs.ansible.com/ansible/latest/porting_guides/porting_guide_10.html#command-line	21:39
corvus	actually happened with ansibl 10	21:39
fungi	i guess we can pin those jobs back to 9 until we clear out the rest of the bionic servers	21:43
fungi	are there many left?	21:44
clarkb	fungi: mostly the ones of the upgrade todo list that we're already talking about	21:45
clarkb	though I think many of those are focal too	21:45
clarkb	so ya I think we pin then trim	21:45
fungi	right, i mean not the focal ones since they're still fine for now	21:46
clarkb	I'm already updating the jobs as ar esult of this for mirrors. We don't have any focal or bionic mirrors so we can drop those mirror node types from testing	21:46
fungi	oh even better	21:46
fungi	so mostly the krb/afs servers then	21:46
fungi	er, at least some of those are focal actually	21:48
fungi	bionic is 18.04/lts right?	21:48
clarkb	yes	21:49
fungi	yeah, looks like the afs servers were upgraded to or rebuilt on focal a while back	21:51
corvus	i don't want to be pessimistic -- i'm optimistic the bionic nodes will be gone soon! but... considering that ansible 10 dropped it, and we skipped that in zuul, and 9 is way eol, we should probably entertain the idea that ansible 9 might not stay in zuul as long as we have bionic nodes, and think about what we would do in that case.	21:54
corvus	maybe just switch the test jobs to the oldest supported platform and note the discrepancy?	21:54
clarkb	ya that is probably the next best thing	21:54
clarkb	I think for the systems that are impacted the delta between bionic and focal is also small	21:55
corvus	sounds reasonable then	21:55
opendevreview	Clark Boylan proposed opendev/system-config master: Add rax flex iad3 to clouds.yaml and cloud management https://review.opendev.org/c/opendev/system-config/+/957810	21:55
opendevreview	Clark Boylan proposed opendev/system-config master: Fix system-config-run for Ansible 9 vs 11 https://review.opendev.org/c/opendev/system-config/+/957811	21:55
clarkb	this is the first step.	21:55
clarkb	basically its test our base roles work on bionic (iptables, etc which we don't change often and if they work on focal are likely to work on bionic) and that we can backup bionic nodes. But we use a virtualenv install of borg for that so it too should be very consistent across platforms	21:56
fungi	wfm	22:00
clarkb	corvus: the launcher debug message cleanup just landed	22:40
clarkb	I'm working on sending some emails but can do restarts of the service to pick up that change in a few	22:40
corvus	ack, thanks, i'm occupied for a bit, so if you have cycles, that's appreciated	22:51
clarkb	yup emails are sent. I'll start on zl01. Do you want me to rm the debug log there as part of the restarts?	22:52
corvus	yep, that's fine	22:52
clarkb	ok zl01 is done	22:54
clarkb	and zl02 is done	22:56
clarkb	unfortunately I think we're in a quieter time of day so it is hard to say if the lower rate of log accumulation (it does seem to be slowish right now) is due to the updates or just lack of load/demand	22:56
corvus	yeah, i think we'd be getting the same log lines with the previous version	22:57
corvus	but we'll theoretically see benefits in a couple of hours :)	22:57
clarkb	Unable to change directory before execution: [Errno 2] No such file or directory: b'/etc/zuul-merger' <- this error caused the fixup for ansible 11 running against bionic to fail in the zuul deployment job	23:06
clarkb	I'm just going to recheck as I didn't change anything about zuul or its mergers	23:06
clarkb	I think ze05 is getting pacagke updates and should be back on the new zuul version shortly. The reboot playbook is running in screen and it should continue overnight	23:46
clarkb	but I need to pop out shortly for dinner so will have to check back in in the morning	23:46

Generated by irclog2html.py 4.0.0 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!