Monday, 2025-08-18

*** mrunge_ is now known as mrunge14:09
clarkbtoday is the day that we announced we should switch all opendev zuul tenants to ansible 11 by default. Before we do that I would like to land https://review.opendev.org/c/zuul/zuul-jobs/+/957188 to fix a known issue with ansible 11. Then I don't think we have a change to the zuul tenant config yet should I write one? cc infra-root and probably corvus in particular15:01
corvusclarkb: +3 on 188 ; i can write the tenant change in a few mins15:05
clarkbperfect thanks!15:05
fungii'm going to pop out for a quick lunch, should be back soon15:10
clarkbthe other item on my monday todo list is checking on zuul upgrades/restarts and I think it may have failed again...15:19
clarkblooks like the issue there according to the log is the zuu launcher servers ran out of disk space causing ansible to fail pulling more docker images? I'm guessing (without looking yet) that log files are still consuming a lot of disk15:20
clarkbwe should probably consider running the rolling upgrades and reboots playbook manually out of band once we've got the launchers in a happier state15:21
clarkbconfirmed there is 26GB in /var/log/zuul15:23
clarkbwhich is >50% of available disk space15:23
clarkbhuh it looks like on zl01 we're still failing to rotate the log files properly15:23
clarkbbut there is also a .old log file that is 17GB large15:24
clarkbI wonder if log rotation isn't happening because python doesn't have enough disk space to do it?15:24
clarkbif it does a copy truncate I think we need at least as much room on disk to copy the file before the old file gets trimmed which may not be the case here15:25
clarkbI wonder if we can configure python to keep a single -debug log file and limit its total size to say 5gb or something along those lines15:26
clarkbthen keep a longer history in the non debug file (which is much smaller)15:26
corvuswell, the new version should have much smaller logs15:29
corvusso how about we just do whatever is needed to get the upgrade done, then re-evaluate15:29
clarkbworks for me. I think we can probably start by deleting the .old files then optionally rebooting the server (since /tmp was full?) and then run the upgrade playbook in a screen on bridge?15:30
clarkbbut I'll defer to you on what bits of the current log files are most likely to be useful15:30
corvusi deleted some stuff in my homedir on those which freed up some space15:31
corvusi think we can dump all the logs15:32
clarkbso maybe stop service, delete logs, reboot, start service?15:33
corvusstop; delete; pull; reboot; start ?15:33
clarkbya that sounds good. Do you want to do that or should I? (and I guess one launcher at a time to continue to provide contiuous service)15:34
corvusclarkb: go for it15:34
clarkbok I'll start on zl0115:34
opendevreviewJames E. Blair proposed openstack/project-config master: Update all tenants to default to Ansible 11  https://review.opendev.org/c/openstack/project-config/+/95774815:35
clarkbcorvus: the non debug logs are relatively small should I keep those?15:35
corvusnah15:35
clarkbok rebooting zl01 now15:37
clarkbunbound didn't start up properly (I suspect due to the race where it tries to bind to an interface that isn't up yet) so I'm double checking on that before I proceed15:39
clarkberror: failed to read /var/lib/unbound/root.key15:39
corvusmaybe fallout from full disk15:40
clarkbya I'm remembering now this fiel gets updated periodically and when disk is full we write a truncated version15:40
clarkbI think the solution is to reinstall the package but I'm trying to remember that now15:40
opendevreviewMerged zuul/zuul-jobs master: Vendor openvswitch_bridge  https://review.opendev.org/c/zuul/zuul-jobs/+/95718815:40
clarkbdns-root-data I think it is this package15:41
clarkboh heh with dns failing I can't run apt updates15:41
clarkbfungi is at lunch and I think he fixed this last time. Trying to remember what the secret was. Maybe its a local command15:42
corvusunbound-anchor may install it15:42
corvusnope, that didn't fix zl0215:43
clarkbI think there may be a unit file somewhere15:44
corvushuh, stop/start of unbound fixed it on zl0215:44
johnsomThe file is not unique, you could grab one from another instance15:44
clarkbcorvus: oh ok I'll just try that then15:45
corvusclarkb: i just did it on zl0115:45
corvusi removed the file first15:45
clarkbcorvus: oh heh I did it too. Seems to be running now15:45
corvusso i think the incantation may be: rm /var/lib/unbound/root.key; systemctl start unbound15:45
clarkbcorvus: I'll start zuul launcher there15:46
clarkbcorvus: how long do I need to wait before doing this process on zl02?15:47
corvusno time at all15:47
clarkbcool I'm starting on zl02 now then15:47
clarkband done15:50
clarkbcorvus: do you think we should start the zuul reboot playbook now? Or is it better to let the dust settle on ansible 11 or just wait until this evening local time which tends to be quieter (or at least fewer humans expecting shorter round trip times)15:52
corvusremote:   https://review.opendev.org/c/zuul/zuul/+/957750 Remove some launcher log messages [NEW] 15:52
corvussome more logspam cleanup15:52
corvusi think it's fine to run the playbook now15:52
corvusi think we have more executor capacity than node capacity at the moment, so people shouldn't notice.15:53
clarkback I'll start a root screen on bridge and run it there15:53
clarkbit is running15:54
clarkbjohnsom: ya I figured there was a way to workaround it I just remembered there was a process to make it easy and corvus figured it out15:55
clarkbas a side note I wonder if unbound could be more defensive about this like write the file to a tmp location then atomically rename it if the content is complete15:56
johnsomCool, I know it checks for updates on startup, I just wasn't sure if it would work without a resolver or not.15:56
clarkbyou're probably still risking running out of inodes that way but that is less common for us at least15:56
johnsomYeah, it certainly could use the rename/hard link trick to keep the old one until a new one is complete. Probably worth a bug report.15:59
fungicatching up to find out what i fixed16:21
fungiclarkb: i think unbound may install a cronjob to update the root key?16:23
fungiso when the rootfs is full and it tries to overwrite the file it ends up empty because it doesn't bother to do an atomic replace instead16:24
fungihttps://meetings.opendev.org/irclogs/%23opendev/%23opendev.2025-06-06.log.html#opendev.2025-06-06.log.html%23t2025-06-06T17:28:49 is where i fixed it last time, for reference16:28
fungiremoved the empty file and restarted the service, looks like16:28
clarkbyup that is what corvus did too16:31
clarkbI think we are ready to land https://review.opendev.org/c/openstack/project-config/+/957748 at any time today. The ovs bridge fix landed and https://lists.opendev.org/archives/list/service-announce@lists.opendev.org/thread/EUTEAOIUQZ6YJUQWD75QO34H5TDGBLKP/ announced the update for today16:48
fungiapproved16:50
fungii'll send something to openstack-discuss too16:50
clarkbI guess one question is whether or not the version of zuul running on the executors includes 1116:55
clarkb/var/lib/zuul/ansible has an 11 in it so we should be fine16:55
clarkband ze01 is waiting on 2 jobs to complete before it gets upgraded and restarted16:57
opendevreviewMerged openstack/project-config master: Update all tenants to default to Ansible 11  https://review.opendev.org/c/openstack/project-config/+/95774817:00
clarkbfungi: one thing on my todo list after last week's meeting was to sync up with you on ugrading some of the remaining servers in our upgrade list17:11
clarkbI think you were thinking of doing the kerberos and afs related servers in place?17:11
fungiyeah, i haven't started on the planning for those yet. i expect the kerberos servers might actually be the most sensitive just because of version changes, the openafs version likely won't change as we already build backported packages of newer openafs we're usinh17:13
fungiusing17:13
clarkbya I think some of the afs servers may be am inor version update behind due to some updates they weren't restarted for17:13
clarkbbut its the same general version17:13
clarkbre kerberos: has it changed much in the last say 35 years?17:13
funginope, which is why i'm not really worried about it either17:14
clarkbwe can also do one distro version at a time on both kerberos servers before going to the next distro version just to give them a chance to sync up with a smaller delta17:14
clarkbthe infra-prod-service-zuul job to update to ansible 11 by default just succeeded17:15
clarkbso any jobs started from ~nowish should default to 1117:15
clarkbthough actually the hourly jobs may have udpated it about 8 minutes ago too17:17
clarkbI'm seeing successful jobs with ansible version 2.18.7 which I believe maps to 1117:26
fungihttps://docs.ansible.com/ansible/latest/reference_appendices/release_and_maintenance.html indicates that ansible community 11.x maps to ansible-core 2.18.x17:28
fungiagreed17:28
clarkbfor the meeting agenda I plan to drop the ansible 11 topic unless something comes up related to it, drop the voip.ms cleanup as that is well underway and in the hands of the service provider now, add rax-flex iad3 bootstrapping. Anything else to change/edit/add/delete?17:52
clarkboh also no one has nominated for service coordinator. I feel like we hsould play rock paper scissors to see who is next :P17:58
* corvus touches nose18:06
clarkbI've been spot checking job failures and so far the ones I have checked all appear related to the job payload not the ansible version18:12
stephenfinclarkb: fungi: jfyi, looks like there's a bit more to be done with pbr yet https://bugs.launchpad.net/pbr/+bug/211145918:20
stephenfinlikely won't be this week (OpenShift 4.20 feature freeze)18:20
clarkbstephenfin: so I think the primary use of easy_install is to get the script writer class which I think even then the main functioanlity is around looking up paths to install to?18:33
clarkbstephenfin: but for regular cli scripts I think we had decided that pip install is writing them out in a way that doesn't force slow/expensive entrypoint lookups (this is the reason pbr overrode the ones in setuptools which are still slow but if everyone is using pip install and not setup.py install it doesn't matter anymore)18:34
clarkbthen pbr also grew the wsgi script support using the same mechanism. It may be possible to just treat wsgi scripts as cli scripts?18:35
clarkbevry project would need to write the shim that wsgiifies things and set that as the cli script target though18:35
clarkbI do think there is a light at tne end of this tunnel but there may be some slight inconvenience to how wsgi entry point scripts are written18:37
clarkbnow I'm going to pop out for lunch. Back in a bit18:41
*** dhill is now known as Guest2440918:43
fungiinteresting idea: pypi started checking the domains for e-mail addresses on accounts and locking them out when their domains expire... https://blog.pypi.org/posts/2025-08-18-preventing-domain-resurrections/19:27
opendevreviewMerged zuul/zuul-jobs master: Add default for build diskimage image name  https://review.opendev.org/c/zuul/zuul-jobs/+/95621920:56
clarkbthe zuul reboot playbook is waiting for ze04 now which is the last one that got updated the last time we managed to do updates21:04
clarkbcorvus: the launcher debug log is up to 1.2GB on zl01 fwiw. https://review.opendev.org/c/zuul/zuul/+/957750 hit a gate failure so I rechecked it21:06
opendevreviewClark Boylan proposed opendev/system-config master: Add rax flex iad3 to clouds.yaml and cloud management  https://review.opendev.org/c/opendev/system-config/+/95781021:21
clarkbI've made my edits to the meeting agenda21:26
clarkbwill send that out in a n hour or two21:27
corvusclarkb: ack; we might need to manually clean out the logs and restart when that change merges21:28
clarkbinfra-root I think ansible 11 doesn't like our bionic test nodes21:38
corvusi think they dropped some old python versions21:38
clarkbthe base job on 957810 seems to fail during fact discovery on some json load error. If all of the job retries fail for that change then I guess I'll push up a change to pin to ansible 9 for now while we sort out dropping bionic or doing something else etc21:38
clarkbcorvus: ya I suspect that is the cause21:38
corvushttps://docs.ansible.com/ansible/latest/porting_guides/porting_guide_10.html#command-line21:39
corvusactually happened with ansibl 1021:39
fungii guess we can pin those jobs back to 9 until we clear out the rest of the bionic servers21:43
fungiare there many left?21:44
clarkbfungi: mostly the ones of the upgrade todo list that we're already talking about21:45
clarkbthough I think many of those are focal too21:45
clarkbso ya I think we pin then trim21:45
fungiright, i mean not the focal ones since they're still fine for now21:46
clarkbI'm already updating the jobs as ar esult of this for mirrors. We don't have any focal or bionic mirrors so we can drop those mirror node types from testing21:46
fungioh even better21:46
fungiso mostly the krb/afs servers then21:46
fungier, at least some of those are focal actually21:48
fungibionic is 18.04/lts right?21:48
clarkbyes21:49
fungiyeah, looks like the afs servers were upgraded to or rebuilt on focal a while back21:51
corvusi don't want to be pessimistic -- i'm optimistic the bionic nodes will be gone soon!  but... considering that ansible 10 dropped it, and we skipped that in zuul, and 9 is way eol, we should probably entertain the idea that ansible 9 might not stay in zuul as long as we have bionic nodes, and think about what we would do in that case.21:54
corvusmaybe just switch the test jobs to the oldest supported platform and note the discrepancy?21:54
clarkbya that is probably the next best thing21:54
clarkbI think for the systems that are impacted the delta between bionic and focal is also small21:55
corvussounds reasonable then21:55
opendevreviewClark Boylan proposed opendev/system-config master: Add rax flex iad3 to clouds.yaml and cloud management  https://review.opendev.org/c/opendev/system-config/+/95781021:55
opendevreviewClark Boylan proposed opendev/system-config master: Fix system-config-run for Ansible 9 vs 11  https://review.opendev.org/c/opendev/system-config/+/95781121:55
clarkbthis is the first step.21:55
clarkbbasically its test our base roles work on bionic (iptables, etc which we don't change often and if they work on focal are likely to work on bionic) and that we can backup bionic nodes. But we use a virtualenv install of borg for that so it too should be very consistent across platforms21:56
fungiwfm22:00
clarkbcorvus: the launcher debug message cleanup just landed22:40
clarkbI'm working on sending some emails but can do restarts of the service to pick up that change in a few22:40
corvusack, thanks, i'm occupied for a bit, so if you have cycles, that's appreciated22:51
clarkbyup emails are sent. I'll start on zl01. Do you want me to rm the debug log there as part of the restarts?22:52
corvusyep, that's fine22:52
clarkbok zl01 is done22:54
clarkband zl02 is done22:56
clarkbunfortunately I think we're in a quieter time of day so it is hard to say if the lower rate of log accumulation (it does seem to be slowish right now) is due to the updates or just lack of load/demand22:56
corvusyeah, i think we'd be getting the same log lines with the previous version22:57
corvusbut we'll theoretically see benefits in a couple of hours :)22:57
clarkbUnable to change directory before execution: [Errno 2] No such file or directory: b'/etc/zuul-merger' <- this error caused the fixup for ansible 11 running against bionic to fail in the zuul deployment job23:06
clarkbI'm just going to recheck as I didn't change anything about zuul or its mergers23:06
clarkbI think ze05 is getting pacagke updates and should be back on the new zuul version shortly. The reboot playbook is running in screen and it should continue overnight23:46
clarkbbut I need to pop out shortly for dinner so will have to check back in in the morning23:46

Generated by irclog2html.py 4.0.0 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!