Sunday, 2024-08-04

f0o_	jrosser: regarding https://github.com/openstack/openstack-ansible/blob/master/tests/roles/bootstrap-host/tasks/prepare_nfs.yml#L81 - how do you force nova/cinder/glance to all use UID/GID 10000 in this case?	06:24
f0o_	also prehaps more interestingly, why did this became an issue now and not much much earlier.. this was running for quite a while without issues...	06:26
f0o_	I guess I have to override nova_system_user_uid/group_gid as well as cinder's and glance's? it does say that changing these is not really supported... hrm	06:33
jrosser	f0o_: i would not expect that you have to override those, otherwise surely the ansible code would have some special case for nfs backend and deal with it for you	07:03
jrosser	all_squash and the anonymous uid/gid will only take effect for newly written items, not existing ones, on your nfs server	07:03
f0o_	I know but the new files will be UID/GID 10000 which cinder/glance/nova/libvirt does not have access to - so that's failing	07:09
f0o_	so unless I override the UID/GID of cinder/nova/glance and disable dynamic_ownership in /etc/libvirt/qemu.conf I dont see how it should work	07:09
jrosser	why not build an AIO with the reference nfs deployment	07:11
f0o_	because I got 20 instances in error state right now and both nova and cinder are throwing perm denied after the run-upgrade which was the staging for production. So this issue will be present in our current prod and I'd like to understand what I can do to mitigate it prior to running the upgrade	07:13
f0o_	having an AIO that doesnt have to deal with X hosts UID/GID deviances doesnt really represent the reality where some LXC containers have different UID/GIDs now	07:13
jrosser	ok sorry, i'll leave you to it	07:14
f0o_	I just fail to see how the AIO would help me see the requirement (or lack thereof) for those UID/GID overrides	07:14
f0o_	but really I'm just baffled that this actually became an issue out of seemingly nowhere. We were able to create and migrate instances without issues so the perm problem should've been hit us much earlier. It just feels like something changed fundamentally	07:15
jrosser	there will be no matching of the uid/gid in the AIO between the lxc continers and the nfs server running on the hot	07:20
jrosser	*host	07:20
jrosser	that would be a pretty good represenatiaion of your current situation and allow you to see how it either does, or does not work properly in the configuration we test in CI	07:21
jrosser	if it doesnt work properly that would be easily reproducible and the chances of getting that fixed would be pretty high	07:22
jrosser	the setting of all_squash should be the thing that permits deviations of uid/gid across hosts	07:23
jrosser	oh well also there is this https://github.com/openstack/openstack-ansible-os_nova/blob/3d385e9d3f96d51957e6b8b5bec91d13f93cd725/defaults/main.yml#L86-L96	07:27
f0o_	let me try to figure out how can I mount the NFS through VPN and give it a shot then - all_squash does force UID/GID of 10000 on all new files but the issue that the user who created those files has no perms because it's running 999:999 (cinder) or 997:997 (nova)	07:27
f0o_	yeah those are the overrides I mentioned earlier after googling - that warning should deffo be a bit more prominent	07:28
jrosser	it is in the docs for the nova role https://docs.openstack.org/openstack-ansible-os_nova/latest/configure-nova.html#shared-storage-and-synchronized-uid-gid	07:32
f0o_	>> These values should only be set once before deploying an OpenStack environment and then never changed. - Into the rabbit hole I go :D	07:32
jrosser	yeah - i mean clearly you could change those, it's just quite some work and quite possibly some downtime whilst you do it	07:33
f0o_	do you happen to know which ansible role changes /etc/libvirt/qemu.conf ? doesnt seem to be os-nova	07:34
jrosser	i had a look at the nova and cinder roles and really nothing about this changes for many years	07:34
f0o_	nvmd it was os-nova https://github.com/openstack/openstack-ansible-os_nova/blob/3d385e9d3f96d51957e6b8b5bec91d13f93cd725/tasks/drivers/kvm/nova_compute_kvm.yml#L84	07:34
jrosser	so i can only think that inside nova/cinder/libvirt/wherever is now tighter or more specific with the permissions used at runtime	07:35
f0o_	I start to believe that it's libvirt's dynamic_ownership which seems to default to 1 now - so it will change the ownership of the volumes to libvirt:kvm (or equivalent) which deviates from nova/cinder	07:36
f0o_	so maybe I can just specify qemu_conf_dict entries to set it to 0... gonna give it a shot	07:38
jrosser	yeah the comment is slightly wrong here https://opendev.org/openstack/openstack-ansible-os_nova/src/branch/master/defaults/main.yml#L582	07:40
jrosser	it lets you add additional config fields	07:40
jrosser	if you think that we should change the defaults here please either make a patch or submit a bug report	07:42
f0o_	I think a safe default for this could be `user = nova` `group = nova` - then libvirt/qemu will run as nova:nova and dynamic_ownerships will just chown all volumes to nova:nova if needed - if we assume that nova:nova == cinder:cinder == glance:glance for NFS then all perm issues are resolved	07:43
f0o_	the drawback is running qemu as nova:nova which may or may not be a can of worms for apparmor/selinux policies	07:43
jrosser	unfortunately the active contributors to openstack-ansible are mostly using ceph so we don't get much real feedback on other storage backends	07:44
f0o_	I guess the alternative is to add the default libvirt user (which unfortunately differ across distros) into the nova group and specify dynamic_ownership=0 and rely that all volumes are chmod 660 at least (through umask or similar)	07:45
f0o_	not sure which path is "best"	07:45
jrosser	no indeed, there are lots of moving parts and it would be easy to break something else	07:46
f0o_	ok so my nova is actually part of the kvm group which also libvirt-qemu is in. So Alternative #2 feels more reasonable	07:47
jrosser	https://github.com/openstack/openstack-ansible-os_nova/blob/master/tasks/drivers/kvm/nova_compute_kvm.yml#L38-L45	07:53
f0o_	ideally glance/cinder/nova/libvirt accept an additional group as configuration parameter so all NFS stuff could be owned by that group. Then there are no conflicts with preexinsting groups or SELinux/AppArmor based on deviating custom UID/GIDs...	07:59
f0o_	but that seems like a very ugly patch	08:00
f0o_	at least for glance uid/gid change was relatively seamless	08:19
f0o_	gonna see how cinder/nova goes	08:19
f0o_	opted for running qemu as nova:nova and without dynamic_ownership - hoping apparmor plays ball with it	08:20
f0o_	if it does then this could be a simple documentation issue where NFS users need to set custom UID/GID and qemu settings	08:20
f0o_	alright this seems to be working pretty well now	09:08
f0o_	https://paste.opendev.org/show/bCOVeSboEuLh0CqeDUVo/	09:08
f0o_	with the all_squash exports config to make all writes uid/gid 10000	09:08
f0o_	all instances are running as nova now and there's no perm issues anymore	09:09
f0o	should https://github.com/openstack/openstack-ansible-os_cinder/blob/stable/2024.1/defaults/main.yml#L35 be master?	17:38
f0o	I got nearly everything operational, just one cinder-volume instance keeps having an issue with the new quorum queues... keeps crashing with precondition fail on durable/auto_delete which AFAIK was deprecated in favor for quorum queues... I got the rabbitmq settings verified against a working node but this one just wont start... idk if I'm just missing some dependency	17:43
f0o	somewhere	17:43
f0o	anyway tomorrow's issue	17:44

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!