15:00:59 <noonedeadpunk> #startmeeting openstack_ansible_meeting
15:01:00 <opendevmeet> Meeting started Tue Jul 26 15:00:59 2022 UTC and is due to finish in 60 minutes.  The chair is noonedeadpunk. Information about MeetBot at http://wiki.debian.org/MeetBot.
15:01:00 <opendevmeet> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
15:01:00 <opendevmeet> The meeting name has been set to 'openstack_ansible_meeting'
15:01:06 <noonedeadpunk> #topic office hours
15:01:23 <mgariepy> right on time :D
15:02:30 <noonedeadpunk> so for infra jobs yeah, we probably should also add ansible-role-requirements and ansible-collection-requirements
15:02:46 <noonedeadpunk> to "files"
15:03:18 <noonedeadpunk> stream-9 distro jobs - well, I intended to fix them one day
15:03:47 <noonedeadpunk> they're NV as of today, so why not. But I had a dependency on smth to get them working, can't really recall
15:04:02 * noonedeadpunk fixed alarm clock :D
15:04:18 <damiandabrowski> :D
15:04:33 <opendevreview> Jonathan Rosser proposed openstack/openstack-ansible master: Remove centos-9-stream distro jobs  https://review.opendev.org/c/openstack/openstack-ansible/+/851041
15:05:28 <jrosser> o/ hello
15:05:58 <noonedeadpunk> Will try to spend some time on them maybe next week
15:07:16 <noonedeadpunk> oh, ok, so the reason why distro jobs are failing, is that zed packages are not published yet
15:07:32 <noonedeadpunk> and we need openstacksdk 0.99 with our collection versions
15:09:32 <noonedeadpunk> So. I was looking today at our lxc jobs. And wondering if we should try switching to overlayfs instead of dir by default?
15:09:42 <noonedeadpunk> maybe it's more a ptg topic though
15:10:15 <jrosser> is there a benefit?
15:10:20 <noonedeadpunk> But I guess as of today, all supported distros (well, except focal - do we support it?) have kernel 3.13 by default
15:10:41 <noonedeadpunk> Well, yes? You don't copy root each time, but just use cow and snapshot?
15:11:12 <noonedeadpunk> I have no idea if it's working at all tbh as of today, as I haven't touched that part for real
15:11:28 <noonedeadpunk> but you should save quite some space with that on controllers
15:12:02 <jrosser> i think we have good tests for all of these, so it should be easy to see
15:12:03 <noonedeadpunk> Each container is ~500MB
15:12:12 <noonedeadpunk> And diff would be like 20MB tops
15:12:33 <noonedeadpunk> Nah, more. Image is 423MB
15:12:56 <noonedeadpunk> So we would save 423Mb per each container that runs
15:13:43 <noonedeadpunk> Another thing was, that we totally forget to mention during previous PTG, is getting rid of dash-sparated groups and use underscore only
15:13:58 <noonedeadpunk> We're postponing this for quite a while now to have that said...
15:14:11 <noonedeadpunk> But it might be not that big chunk of work.
15:14:24 <noonedeadpunk> I will try to look into that, but not now for sure
15:15:20 <noonedeadpunk> Right now I'm trying to work on AZ concept and what needs to be done from OSA side to get this working. Will publish a scenario to docs once done if there're no objections
15:15:45 <noonedeadpunk> env.d overrides are quite massive as of to day to have that said
15:15:52 <noonedeadpunk> *today
15:16:18 <jrosser> i wonder if we prune journals properly
15:16:31 <opendevreview> Marc GariƩpy proposed openstack/openstack-ansible master: Fix cloud.openstack.* modules  https://review.opendev.org/c/openstack/openstack-ansible/+/851038
15:16:41 <jrosser> i am already using COW on my controllers (zfs) and the base size of the container is like tiny compared to the total
15:17:24 <noonedeadpunk> ah. well, then there's no profit in overlayfs for you :D
15:18:27 <noonedeadpunk> But I kind of wonder if there're consequences to use overlayfs with other bind-mounts, as it kind of change representaton of these from what I read
15:18:33 <jrosser> i guess i mean that the delta doesnt always stay small
15:18:43 <jrosser> it will be when you first create them
15:19:45 <noonedeadpunk> well, delta would be basically venvs size I guess?
15:20:01 <jrosser> for $time-since-resinstall on one i'me looking at 25G for 13 containers even with COW
15:20:03 <noonedeadpunk> as logs and journals and databases are passed into the container and not root
15:20:46 <noonedeadpunk> that is kind of weird?
15:28:42 <mgariepy> wow.. https://github.com/ansible/ansible/issues/78344
15:29:50 <jrosser> huh
15:30:08 <jrosser> they're going to complain next at the slightly unusual use of the setup module, i can just feel it
15:30:09 <mgariepy> let's make the control persist 4 hours.. ;p
15:30:49 <mgariepy> what a good responses..
15:31:07 <jrosser> noonedeadpunk: in my container /openstack is nearly 1G for keystone container, each keystone venv is ~200M so for a couple of upgrade this adds up pretty quick
15:32:02 <noonedeadpunk> btw, retention of old venvs is a good topic
15:32:27 <noonedeadpunk> should we create some playbook at least for ops repo for that?
15:35:54 <jrosser> it's pretty coupled to the inventory so might be hard for the ops repo?
15:36:47 <jrosser> mgariepy: i will reproduce with piplining=False and show that it handles -13 in that case
15:36:55 <jrosser> i'm sure i saw it doing that
15:37:46 <mgariepy> hmm i haven't been able to reproduce it with pipelining false.
15:37:58 <noonedeadpunk> because it re-tries :)
15:38:12 <mgariepy> no
15:38:26 <mgariepy> because scp catch it to transfer the file.
15:38:33 <mgariepy> then reactivate the error
15:38:54 <mgariepy> error / socket
15:40:00 <jrosser> well - question is if it throws AnsibleControlPersistBrokenPipeError and handles it properly when pipelining is false
15:40:13 <mgariepy> when pipelining is false
15:40:18 <mgariepy> it will transfert the file.
15:40:43 <mgariepy> if there is an error it most likely end up being exit(255).
15:40:50 <mgariepy> then it retries.
15:40:56 <mgariepy> no ?
15:41:16 <jrosser> lets test it
15:41:19 <mgariepy> you never catch the same race with pipelining false.
15:41:37 <jrosser> no, but my code that was printing p.returncode was showing what was happening though
15:42:38 <jrosser> anyway, lets meeting some more then look at that
15:44:30 <jrosser> we did a Y upgrade in the lab last week
15:44:48 <jrosser> which went really quite smoothly, just a couple of issues
15:44:57 <spotz_> nice
15:45:27 <jrosser> galaxy seems to have some mishandling of http proxies so we had to override all the things to be https://github....
15:46:08 <jrosser> also lxc_hosts_container_build_command is not taking account of lxc_apt_mirror
15:46:42 <anskiy> noonedeadpunk: there is already one: https://opendev.org/openstack/openstack-ansible-ops/src/branch/master/ansible_tools/playbooks/cleanup-venvs.yml
15:47:07 <jrosser> but the biggest trouble was from `2022-07-22 08:04:35 upgrade lxc:all 1:4.0.6-0ubuntu1~20.04.1 1:4.0.12-0ubuntu1~20.04.1`
15:47:07 <noonedeadpunk> anskiy: ah, well :)
15:48:14 <noonedeadpunk> oh, well, it would restart all containers
15:48:23 <jrosser> yeah, pretty much
15:48:42 <jrosser> and i think setup-hosts does that across * at the same time
15:49:03 <noonedeadpunk> what I had to do with that, is to systemctl stop lxc, systemctl mask lxc, then update package, unmask and reboot
15:49:41 <noonedeadpunk> if we pass package_state: latest, then yeah...
15:49:46 <jrosser> i don't know if we want to manage that a bit more
15:50:06 <jrosser> but that might be a big surprise for poeple expecting the upgrades to not hose everything at once
15:51:16 <noonedeadpunk> I bet this is causing troubles https://opendev.org/openstack/openstack-ansible/src/branch/master/scripts/run-upgrade.sh#L186
15:53:02 <jrosser> we also put that in the documentation https://opendev.org/openstack/openstack-ansible/src/branch/master/doc/source/admin/upgrades/minor-upgrades.rst#L52
15:53:04 <noonedeadpunk> So we would need to add serial here https://opendev.org/openstack/openstack-ansible/src/branch/master/playbooks/containers-lxc-host.yml#L24 and override it in script
15:53:36 <noonedeadpunk> ok, that is good a point.
15:53:43 <noonedeadpunk> * a good point
15:53:50 <jrosser> well i don't know if serial is the right thing to do
15:54:47 <noonedeadpunk> are you thinking about putting lxc on hold?
15:55:09 <noonedeadpunk> I really can't recall how it's done on centos though
15:57:11 <noonedeadpunk> but yes, I do agree we need to fix that for ppl
15:58:16 <noonedeadpunk> not upgrading packages - well, it's also a thing. Maybe we can adjust docs not to mention `-e package_state=latest` as smth "default" but what can be added if you want to update packages, but with some pre-cautions
15:58:39 <noonedeadpunk> As at very least this also affects neutron, as ovs being upgraded also cause disturbances
15:59:33 <jrosser> andrewbonney is back tomorrow - i'm sure we have some use of unattended upgrades here with config to prevent $bad-things but i can't find where that is just now
16:00:01 <noonedeadpunk> well, we jsut disabled unattended upgrades...
16:00:19 <jrosser> yeah i might be confused here, but it is the same sort of area
16:01:10 <noonedeadpunk> so my intention when I added `-e package_state=latest` was - you're panning upgrade and thus maintenance. During maintenance interruptions happen.
16:01:25 <noonedeadpunk> But with case of LXC you would need to re-boostrap galera at bery least
16:01:43 <noonedeadpunk> as all goes down at the same time
16:01:48 <jrosser> right, it's not at all self-healing just by running some playbooks
16:01:53 <jrosser> which is kind of what poeple expect
16:02:51 <noonedeadpunk> and what I see we can do - either manage serial, then restarting lxc will likely self-heal as others are alive, or, put some packages on hold and add another flag to upgrade LXC to make it intentionally
16:03:15 <noonedeadpunk> as I don't think we are able to pin the version
16:04:02 <noonedeadpunk> ok, let's wait for andrewbonney then to make a decision
16:04:06 <noonedeadpunk> #endmeeting