*** mrunge_ is now known as mrunge | 14:09 | |
clarkb | today is the day that we announced we should switch all opendev zuul tenants to ansible 11 by default. Before we do that I would like to land https://review.opendev.org/c/zuul/zuul-jobs/+/957188 to fix a known issue with ansible 11. Then I don't think we have a change to the zuul tenant config yet should I write one? cc infra-root and probably corvus in particular | 15:01 |
---|---|---|
corvus | clarkb: +3 on 188 ; i can write the tenant change in a few mins | 15:05 |
clarkb | perfect thanks! | 15:05 |
fungi | i'm going to pop out for a quick lunch, should be back soon | 15:10 |
clarkb | the other item on my monday todo list is checking on zuul upgrades/restarts and I think it may have failed again... | 15:19 |
clarkb | looks like the issue there according to the log is the zuu launcher servers ran out of disk space causing ansible to fail pulling more docker images? I'm guessing (without looking yet) that log files are still consuming a lot of disk | 15:20 |
clarkb | we should probably consider running the rolling upgrades and reboots playbook manually out of band once we've got the launchers in a happier state | 15:21 |
clarkb | confirmed there is 26GB in /var/log/zuul | 15:23 |
clarkb | which is >50% of available disk space | 15:23 |
clarkb | huh it looks like on zl01 we're still failing to rotate the log files properly | 15:23 |
clarkb | but there is also a .old log file that is 17GB large | 15:24 |
clarkb | I wonder if log rotation isn't happening because python doesn't have enough disk space to do it? | 15:24 |
clarkb | if it does a copy truncate I think we need at least as much room on disk to copy the file before the old file gets trimmed which may not be the case here | 15:25 |
clarkb | I wonder if we can configure python to keep a single -debug log file and limit its total size to say 5gb or something along those lines | 15:26 |
clarkb | then keep a longer history in the non debug file (which is much smaller) | 15:26 |
corvus | well, the new version should have much smaller logs | 15:29 |
corvus | so how about we just do whatever is needed to get the upgrade done, then re-evaluate | 15:29 |
clarkb | works for me. I think we can probably start by deleting the .old files then optionally rebooting the server (since /tmp was full?) and then run the upgrade playbook in a screen on bridge? | 15:30 |
clarkb | but I'll defer to you on what bits of the current log files are most likely to be useful | 15:30 |
corvus | i deleted some stuff in my homedir on those which freed up some space | 15:31 |
corvus | i think we can dump all the logs | 15:32 |
clarkb | so maybe stop service, delete logs, reboot, start service? | 15:33 |
corvus | stop; delete; pull; reboot; start ? | 15:33 |
clarkb | ya that sounds good. Do you want to do that or should I? (and I guess one launcher at a time to continue to provide contiuous service) | 15:34 |
corvus | clarkb: go for it | 15:34 |
clarkb | ok I'll start on zl01 | 15:34 |
opendevreview | James E. Blair proposed openstack/project-config master: Update all tenants to default to Ansible 11 https://review.opendev.org/c/openstack/project-config/+/957748 | 15:35 |
clarkb | corvus: the non debug logs are relatively small should I keep those? | 15:35 |
corvus | nah | 15:35 |
clarkb | ok rebooting zl01 now | 15:37 |
clarkb | unbound didn't start up properly (I suspect due to the race where it tries to bind to an interface that isn't up yet) so I'm double checking on that before I proceed | 15:39 |
clarkb | error: failed to read /var/lib/unbound/root.key | 15:39 |
corvus | maybe fallout from full disk | 15:40 |
clarkb | ya I'm remembering now this fiel gets updated periodically and when disk is full we write a truncated version | 15:40 |
clarkb | I think the solution is to reinstall the package but I'm trying to remember that now | 15:40 |
opendevreview | Merged zuul/zuul-jobs master: Vendor openvswitch_bridge https://review.opendev.org/c/zuul/zuul-jobs/+/957188 | 15:40 |
clarkb | dns-root-data I think it is this package | 15:41 |
clarkb | oh heh with dns failing I can't run apt updates | 15:41 |
clarkb | fungi is at lunch and I think he fixed this last time. Trying to remember what the secret was. Maybe its a local command | 15:42 |
corvus | unbound-anchor may install it | 15:42 |
corvus | nope, that didn't fix zl02 | 15:43 |
clarkb | I think there may be a unit file somewhere | 15:44 |
corvus | huh, stop/start of unbound fixed it on zl02 | 15:44 |
johnsom | The file is not unique, you could grab one from another instance | 15:44 |
clarkb | corvus: oh ok I'll just try that then | 15:45 |
corvus | clarkb: i just did it on zl01 | 15:45 |
corvus | i removed the file first | 15:45 |
clarkb | corvus: oh heh I did it too. Seems to be running now | 15:45 |
corvus | so i think the incantation may be: rm /var/lib/unbound/root.key; systemctl start unbound | 15:45 |
clarkb | corvus: I'll start zuul launcher there | 15:46 |
clarkb | corvus: how long do I need to wait before doing this process on zl02? | 15:47 |
corvus | no time at all | 15:47 |
clarkb | cool I'm starting on zl02 now then | 15:47 |
clarkb | and done | 15:50 |
clarkb | corvus: do you think we should start the zuul reboot playbook now? Or is it better to let the dust settle on ansible 11 or just wait until this evening local time which tends to be quieter (or at least fewer humans expecting shorter round trip times) | 15:52 |
corvus | remote: https://review.opendev.org/c/zuul/zuul/+/957750 Remove some launcher log messages [NEW] | 15:52 |
corvus | some more logspam cleanup | 15:52 |
corvus | i think it's fine to run the playbook now | 15:52 |
corvus | i think we have more executor capacity than node capacity at the moment, so people shouldn't notice. | 15:53 |
clarkb | ack I'll start a root screen on bridge and run it there | 15:53 |
clarkb | it is running | 15:54 |
clarkb | johnsom: ya I figured there was a way to workaround it I just remembered there was a process to make it easy and corvus figured it out | 15:55 |
clarkb | as a side note I wonder if unbound could be more defensive about this like write the file to a tmp location then atomically rename it if the content is complete | 15:56 |
johnsom | Cool, I know it checks for updates on startup, I just wasn't sure if it would work without a resolver or not. | 15:56 |
clarkb | you're probably still risking running out of inodes that way but that is less common for us at least | 15:56 |
johnsom | Yeah, it certainly could use the rename/hard link trick to keep the old one until a new one is complete. Probably worth a bug report. | 15:59 |
fungi | catching up to find out what i fixed | 16:21 |
fungi | clarkb: i think unbound may install a cronjob to update the root key? | 16:23 |
fungi | so when the rootfs is full and it tries to overwrite the file it ends up empty because it doesn't bother to do an atomic replace instead | 16:24 |
fungi | https://meetings.opendev.org/irclogs/%23opendev/%23opendev.2025-06-06.log.html#opendev.2025-06-06.log.html%23t2025-06-06T17:28:49 is where i fixed it last time, for reference | 16:28 |
fungi | removed the empty file and restarted the service, looks like | 16:28 |
clarkb | yup that is what corvus did too | 16:31 |
clarkb | I think we are ready to land https://review.opendev.org/c/openstack/project-config/+/957748 at any time today. The ovs bridge fix landed and https://lists.opendev.org/archives/list/service-announce@lists.opendev.org/thread/EUTEAOIUQZ6YJUQWD75QO34H5TDGBLKP/ announced the update for today | 16:48 |
fungi | approved | 16:50 |
fungi | i'll send something to openstack-discuss too | 16:50 |
clarkb | I guess one question is whether or not the version of zuul running on the executors includes 11 | 16:55 |
clarkb | /var/lib/zuul/ansible has an 11 in it so we should be fine | 16:55 |
clarkb | and ze01 is waiting on 2 jobs to complete before it gets upgraded and restarted | 16:57 |
opendevreview | Merged openstack/project-config master: Update all tenants to default to Ansible 11 https://review.opendev.org/c/openstack/project-config/+/957748 | 17:00 |
clarkb | fungi: one thing on my todo list after last week's meeting was to sync up with you on ugrading some of the remaining servers in our upgrade list | 17:11 |
clarkb | I think you were thinking of doing the kerberos and afs related servers in place? | 17:11 |
fungi | yeah, i haven't started on the planning for those yet. i expect the kerberos servers might actually be the most sensitive just because of version changes, the openafs version likely won't change as we already build backported packages of newer openafs we're usinh | 17:13 |
fungi | using | 17:13 |
clarkb | ya I think some of the afs servers may be am inor version update behind due to some updates they weren't restarted for | 17:13 |
clarkb | but its the same general version | 17:13 |
clarkb | re kerberos: has it changed much in the last say 35 years? | 17:13 |
fungi | nope, which is why i'm not really worried about it either | 17:14 |
clarkb | we can also do one distro version at a time on both kerberos servers before going to the next distro version just to give them a chance to sync up with a smaller delta | 17:14 |
clarkb | the infra-prod-service-zuul job to update to ansible 11 by default just succeeded | 17:15 |
clarkb | so any jobs started from ~nowish should default to 11 | 17:15 |
clarkb | though actually the hourly jobs may have udpated it about 8 minutes ago too | 17:17 |
clarkb | I'm seeing successful jobs with ansible version 2.18.7 which I believe maps to 11 | 17:26 |
fungi | https://docs.ansible.com/ansible/latest/reference_appendices/release_and_maintenance.html indicates that ansible community 11.x maps to ansible-core 2.18.x | 17:28 |
fungi | agreed | 17:28 |
clarkb | for the meeting agenda I plan to drop the ansible 11 topic unless something comes up related to it, drop the voip.ms cleanup as that is well underway and in the hands of the service provider now, add rax-flex iad3 bootstrapping. Anything else to change/edit/add/delete? | 17:52 |
clarkb | oh also no one has nominated for service coordinator. I feel like we hsould play rock paper scissors to see who is next :P | 17:58 |
* corvus touches nose | 18:06 | |
clarkb | I've been spot checking job failures and so far the ones I have checked all appear related to the job payload not the ansible version | 18:12 |
stephenfin | clarkb: fungi: jfyi, looks like there's a bit more to be done with pbr yet https://bugs.launchpad.net/pbr/+bug/2111459 | 18:20 |
stephenfin | likely won't be this week (OpenShift 4.20 feature freeze) | 18:20 |
clarkb | stephenfin: so I think the primary use of easy_install is to get the script writer class which I think even then the main functioanlity is around looking up paths to install to? | 18:33 |
clarkb | stephenfin: but for regular cli scripts I think we had decided that pip install is writing them out in a way that doesn't force slow/expensive entrypoint lookups (this is the reason pbr overrode the ones in setuptools which are still slow but if everyone is using pip install and not setup.py install it doesn't matter anymore) | 18:34 |
clarkb | then pbr also grew the wsgi script support using the same mechanism. It may be possible to just treat wsgi scripts as cli scripts? | 18:35 |
clarkb | evry project would need to write the shim that wsgiifies things and set that as the cli script target though | 18:35 |
clarkb | I do think there is a light at tne end of this tunnel but there may be some slight inconvenience to how wsgi entry point scripts are written | 18:37 |
clarkb | now I'm going to pop out for lunch. Back in a bit | 18:41 |
*** dhill is now known as Guest24409 | 18:43 | |
fungi | interesting idea: pypi started checking the domains for e-mail addresses on accounts and locking them out when their domains expire... https://blog.pypi.org/posts/2025-08-18-preventing-domain-resurrections/ | 19:27 |
opendevreview | Merged zuul/zuul-jobs master: Add default for build diskimage image name https://review.opendev.org/c/zuul/zuul-jobs/+/956219 | 20:56 |
clarkb | the zuul reboot playbook is waiting for ze04 now which is the last one that got updated the last time we managed to do updates | 21:04 |
clarkb | corvus: the launcher debug log is up to 1.2GB on zl01 fwiw. https://review.opendev.org/c/zuul/zuul/+/957750 hit a gate failure so I rechecked it | 21:06 |
opendevreview | Clark Boylan proposed opendev/system-config master: Add rax flex iad3 to clouds.yaml and cloud management https://review.opendev.org/c/opendev/system-config/+/957810 | 21:21 |
clarkb | I've made my edits to the meeting agenda | 21:26 |
clarkb | will send that out in a n hour or two | 21:27 |
corvus | clarkb: ack; we might need to manually clean out the logs and restart when that change merges | 21:28 |
clarkb | infra-root I think ansible 11 doesn't like our bionic test nodes | 21:38 |
corvus | i think they dropped some old python versions | 21:38 |
clarkb | the base job on 957810 seems to fail during fact discovery on some json load error. If all of the job retries fail for that change then I guess I'll push up a change to pin to ansible 9 for now while we sort out dropping bionic or doing something else etc | 21:38 |
clarkb | corvus: ya I suspect that is the cause | 21:38 |
corvus | https://docs.ansible.com/ansible/latest/porting_guides/porting_guide_10.html#command-line | 21:39 |
corvus | actually happened with ansibl 10 | 21:39 |
fungi | i guess we can pin those jobs back to 9 until we clear out the rest of the bionic servers | 21:43 |
fungi | are there many left? | 21:44 |
clarkb | fungi: mostly the ones of the upgrade todo list that we're already talking about | 21:45 |
clarkb | though I think many of those are focal too | 21:45 |
clarkb | so ya I think we pin then trim | 21:45 |
fungi | right, i mean not the focal ones since they're still fine for now | 21:46 |
clarkb | I'm already updating the jobs as ar esult of this for mirrors. We don't have any focal or bionic mirrors so we can drop those mirror node types from testing | 21:46 |
fungi | oh even better | 21:46 |
fungi | so mostly the krb/afs servers then | 21:46 |
fungi | er, at least some of those are focal actually | 21:48 |
fungi | bionic is 18.04/lts right? | 21:48 |
clarkb | yes | 21:49 |
fungi | yeah, looks like the afs servers were upgraded to or rebuilt on focal a while back | 21:51 |
corvus | i don't want to be pessimistic -- i'm optimistic the bionic nodes will be gone soon! but... considering that ansible 10 dropped it, and we skipped that in zuul, and 9 is way eol, we should probably entertain the idea that ansible 9 might not stay in zuul as long as we have bionic nodes, and think about what we would do in that case. | 21:54 |
corvus | maybe just switch the test jobs to the oldest supported platform and note the discrepancy? | 21:54 |
clarkb | ya that is probably the next best thing | 21:54 |
clarkb | I think for the systems that are impacted the delta between bionic and focal is also small | 21:55 |
corvus | sounds reasonable then | 21:55 |
opendevreview | Clark Boylan proposed opendev/system-config master: Add rax flex iad3 to clouds.yaml and cloud management https://review.opendev.org/c/opendev/system-config/+/957810 | 21:55 |
opendevreview | Clark Boylan proposed opendev/system-config master: Fix system-config-run for Ansible 9 vs 11 https://review.opendev.org/c/opendev/system-config/+/957811 | 21:55 |
clarkb | this is the first step. | 21:55 |
clarkb | basically its test our base roles work on bionic (iptables, etc which we don't change often and if they work on focal are likely to work on bionic) and that we can backup bionic nodes. But we use a virtualenv install of borg for that so it too should be very consistent across platforms | 21:56 |
fungi | wfm | 22:00 |
clarkb | corvus: the launcher debug message cleanup just landed | 22:40 |
clarkb | I'm working on sending some emails but can do restarts of the service to pick up that change in a few | 22:40 |
corvus | ack, thanks, i'm occupied for a bit, so if you have cycles, that's appreciated | 22:51 |
clarkb | yup emails are sent. I'll start on zl01. Do you want me to rm the debug log there as part of the restarts? | 22:52 |
corvus | yep, that's fine | 22:52 |
clarkb | ok zl01 is done | 22:54 |
clarkb | and zl02 is done | 22:56 |
clarkb | unfortunately I think we're in a quieter time of day so it is hard to say if the lower rate of log accumulation (it does seem to be slowish right now) is due to the updates or just lack of load/demand | 22:56 |
corvus | yeah, i think we'd be getting the same log lines with the previous version | 22:57 |
corvus | but we'll theoretically see benefits in a couple of hours :) | 22:57 |
clarkb | Unable to change directory before execution: [Errno 2] No such file or directory: b'/etc/zuul-merger' <- this error caused the fixup for ansible 11 running against bionic to fail in the zuul deployment job | 23:06 |
clarkb | I'm just going to recheck as I didn't change anything about zuul or its mergers | 23:06 |
clarkb | I think ze05 is getting pacagke updates and should be back on the new zuul version shortly. The reboot playbook is running in screen and it should continue overnight | 23:46 |
clarkb | but I need to pop out shortly for dinner so will have to check back in in the morning | 23:46 |
Generated by irclog2html.py 4.0.0 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!