Monday, 2024-10-14

opendevreview	Dmitriy Rabotyagov proposed openstack/openstack-ansible master: Add autocomplete script for playbooks https://review.opendev.org/c/openstack/openstack-ansible/+/932220	07:09
*** gaudenz__ is now known as gaudenz		07:12
noonedeadpunk	fwiw, I failed to get CAPI working during weekends.	08:14
noonedeadpunk	first weird thing was that magnum-system namespace wasn't created, so I had to create it manually.	08:15
noonedeadpunk	Then code somehow fails if you've missed dns_nameservers in coe templates	08:15
noonedeadpunk	but last - it just freeze in create_in_progress, and when I check inside k8s control cluster - it doesn't have any progress on creation either	08:16
noonedeadpunk	https://paste.openstack.org/show/bKbpA5igK1IDKzPw17V5/	08:17
noonedeadpunk	and - no openstack resources are being created	08:17
noonedeadpunk	so if someone (looking at jrosser) has some advice of what can I be doing wrong or how to trace that - would be appreciated	08:28
opendevreview	Merged openstack/openstack-ansible-rabbitmq_server master: Cleanup unneeded upgrade tasks https://review.opendev.org/c/openstack/openstack-ansible-rabbitmq_server/+/931973	09:56
opendevreview	Dmitriy Rabotyagov proposed openstack/openstack-ansible master: Freeze roles for 30.0.0.0b1 release https://review.opendev.org/c/openstack/openstack-ansible/+/931611	13:32
opendevreview	Dmitriy Rabotyagov proposed openstack/openstack-ansible master: Unfreeze roles after milestone release https://review.opendev.org/c/openstack/openstack-ansible/+/931612	13:34
noonedeadpunk	would be nice to have some more reviews on https://review.opendev.org/q/topic:%22bump_osa%22+status:open	13:36
opendevreview	Merged openstack/openstack-ansible-rabbitmq_server master: Add erlang package defenition to defaults https://review.opendev.org/c/openstack/openstack-ansible-rabbitmq_server/+/931794	14:22
jrosser	noonedeadpunk: so i never had to create the magnum-system namespace manually	15:15
noonedeadpunk	huh	15:15
noonedeadpunk	so smth went terribly wrong then in my case...	15:15
jrosser	the ci jobs don't do that after all	15:16
noonedeadpunk	yeah, true	15:16
noonedeadpunk	but also I'm not sure they're passing today either	15:16
noonedeadpunk	ok, they are...	15:17
noonedeadpunk	I did multiple times destroyed k8s containers and re-spawned them as well...	15:17
noonedeadpunk	wonder if there's some folder that persisted on control plane	15:18
noonedeadpunk	but namespace is not created by vexxhost roles at least	15:18
jrosser	i would guess originating something like here https://github.com/vexxhost/magnum-cluster-api/blob/main/magnum_cluster_api/resources.py#L134	15:24
jrosser	but i guess mnaser would have a quick answer to this :)	15:24
noonedeadpunk	I was thinking about here even: https://github.com/vexxhost/magnum-cluster-api/blob/178b4be4202ce3338d28aab1644b6be4f7040592/magnum_cluster_api/sync.py#L34	15:24
noonedeadpunk	as there was smth related to failed locking	15:24
mnaser	https://github.com/vexxhost/magnum-cluster-api/blob/main/magnum_cluster_api/driver.py#L63	15:25
mnaser	The namespace gets created when the create_cluster gets called	15:25
mnaser	Did you make sure you had the correct kubeconfig file in $HOME/.kube/config ?	15:25
noonedeadpunk	on magnum side I assume?	15:26
mnaser	yeah, magnum-conductor needs to be able to read that file	15:26
noonedeadpunk	as once I've created namespace manually - at least things passed a little bit to a point of https://paste.openstack.org/show/bKbpA5igK1IDKzPw17V5/	15:27
jrosser	ah well if you deleted/recreated the control plane k8s but did not do the rest of the playbook to redo the magnum integration, that would be wrong	15:27
noonedeadpunk	and `kube-vn6vf` is what I got from magnum api	15:27
mnaser	okay so it seems to have progressed a bit	15:27
mnaser	id check the logs of the capo-system namespace	15:27
mnaser	kubectl -n capo-system logs deploy/capo-controller-manager	15:28
mnaser	-f will get you a follow, it'll talk a bit about what its trying to od	15:28
noonedeadpunk	ok, I think that's the issue indeed, thanks mnaser	15:28
noonedeadpunk	it claims on failure to verify cert	15:29
noonedeadpunk	I wasn't just sure what logs to check where :)	15:29
mnaser	PR to the docs would be welcome :P	15:29
noonedeadpunk	(fwiw, cert is a valid let's encrypt there)	15:29
noonedeadpunk	++	15:29
noonedeadpunk	I'm trying to get my head around first before pushing to docs	15:29
noonedeadpunk	mnaser: btw, I've pushed couple of things, but confused a bit at how/who should launch pipelines there?	15:30
mnaser	noonedeadpunk: like for the repos? you can create a pr to the repo and its a combination of zuul/old github actions we're getting slowly rid of	15:31
noonedeadpunk	as it seems it's only repo maintainers who run pipelines?	15:31
mnaser	no, zuul should automatically run on the PR, but i dont see any :p	15:32
mnaser	oooh, for https://github.com/vexxhost/ansible-collection-kubernetes/pulls	15:32
mnaser	yeah, that i need to get around moving to zuul	15:32
noonedeadpunk	yeah, ie https://github.com/vexxhost/ansible-collection-kubernetes/pull/136	15:32
noonedeadpunk	there're some zuul bits though	15:32
mnaser	yeah i need to zuulify that repo like i did for the others, we moved away from GHA	15:33
mnaser	noonedeadpunk: just fyi in github when you get a PR from a first time contributor, it needs maintainers to approve before the CI runs for the repo	15:33
mnaser	once you have your first landed change for the future GHA will automatically run, i guess its to protect from someone being malicious or something	15:34
noonedeadpunk	I somehow thought it depends on repo config, but yeah, not 100% sure	15:34
noonedeadpunk	but anyway, point was that I'm trying to contribute back when have smth on my hands :)	15:35
mnaser	yeah no i appreciate it	15:36
noonedeadpunk	hm, okay, so why in the hell let's encrypt in unknown authority for k8s....	15:36
mnaser	noonedeadpunk: by default, the capo container has no "os", so we pass the CA for it dynamically in the config	15:37
mnaser	i get it from certifi and push it if none is provided	15:37
noonedeadpunk	aha, and I passed it internal CA and it tries to reach public endpoint	15:38
mnaser	noonedeadpunk: https://github.com/vexxhost/magnum-cluster-api/blob/178b4be4202ce3338d28aab1644b6be4f7040592/magnum_cluster_api/resources.py#L587-L607 yea	15:38
mnaser	https://github.com/vexxhost/magnum-cluster-api/blob/main/magnum_cluster_api/utils.py#L89-L96	15:39
noonedeadpunk	though I do have `[capi_client]/endpoint = internal`	15:41
noonedeadpunk	ok, but knowing where logs are really explained a lot, so thanks!	15:42
noonedeadpunk	I guess I should be able to proceed from that	15:42
noonedeadpunk	I kinda wonder if that's what making to connect to public URL https://github.com/vexxhost/magnum-cluster-api/commit/178b4be4202ce3338d28aab1644b6be4f7040592	15:43
noonedeadpunk	which makes total sense	15:43
noonedeadpunk	but then we need to pass just all system certs I guess?	15:44
noonedeadpunk	though likely I'm not yet there anyway, as no resources were created at the first place. and then client cluster is spawned on ubuntu and it can have system certs (theoretically)	15:45
noonedeadpunk	anyway	15:45
jrosser	noonedeadpunk: have been round all this CA stuff a lot here :)	16:05
jrosser	let me know if you want me to check any specific vars blah blah	16:05
jrosser	and afaik, the config in the AIO should deal with this as theres pki role on the internal vip	16:06
jrosser	so you have worst case setup, where it's not a trusted cert so the config has to be spot-on everywhere	16:06
noonedeadpunk	the problem now, is that in logs it tries to reach keystone over public interface, while I totally see internal uri in config	16:06
noonedeadpunk	and I've suplied path to internal CA in config...	16:06
jrosser	what does :) there are many things	16:07
jrosser	andrews patch is for the workload cluster	16:07
noonedeadpunk	yeah, wrokload is not there yet....	16:07
jrosser	you might also need to patch magnum	16:07
jrosser	maybe you run into this? https://bugs.launchpad.net/magnum/+bug/2060194	16:08
noonedeadpunk	oh, well	16:09
noonedeadpunk	it would explain it	16:09
noonedeadpunk	I just though this will be actually used: https://github.com/vexxhost/magnum-cluster-api/blob/178b4be4202ce3338d28aab1644b6be4f7040592/magnum_cluster_api/resources.py#L580-L583	16:10
noonedeadpunk	but I'm gonna try and pass jsut whole system trust to be frank	16:10
jrosser	that needs a proper fix, as imho this should all be automagic from parsing the client config options with keystoneauth	16:10
jrosser	but /o\ every service does this different and it's all messy	16:10
noonedeadpunk	oh yes	16:10
jrosser	andrews patch is just a sticking plaster really	16:11
jrosser	ok so you have an issue in the magnum code about endpoints	16:11
noonedeadpunk	btw there was some reply in ML about db pooling being broken, but I didn't look into the patch in detail still	16:12
noonedeadpunk	yeah, likely.	16:12
jrosser	the thing you linked is about openstack clients that run in the k8s control plane which also need to select the correct endpoint	16:12
jrosser	so it has to be right in two completely seperate places	16:12
noonedeadpunk	or well, in that env I don't care _much_ about public/internal, just different certs for them	16:12
jrosser	well, at least two	16:12
noonedeadpunk	yeah, why I looked into k8s code as the error I see on k8s control cluster https://paste.openstack.org/show/bL8Q0BBUbKjgtatJ63NF/	16:13
jrosser	tbh there is likley stuff we can improve in the docs	16:15
jrosser	what i have pushed is basically all good for the AIO/CI	16:15
jrosser	but having said that we make a bunch of overrides for things like this in actual deployments	16:15
noonedeadpunk	well, I'm playing it a multinode physical sandbox, so I'm not in any real restrictions there _yet_	16:17
mnaser	noonedeadpunk: there is a secret in the magnum-system ns	16:17
mnaser	it should contain the CA that is in use	16:17
jrosser	i have written all this up	16:17
jrosser	should make a patch for some "how it works" docs	16:18
noonedeadpunk	so if I didn't have magnum-system ns - highly unlikely I got the secret....	16:18
mnaser	kubectl -n magnum-system get secret	16:19
mnaser	what does that give you?	16:19
noonedeadpunk	no resources found	16:19
noonedeadpunk	ok, let me drop all clusters and namespaces	16:20
noonedeadpunk	and restart magnum-conductor, I assume	16:20
mnaser	yeah i think something went terribly wrong here,	16:20
mnaser	the secret with the CA is missing so its probably not using any CAs at all	16:20
mnaser	id watch logs of m-cond and send the create and see what gets logged	16:21
noonedeadpunk	++	16:21
noonedeadpunk	it looked like it was failing very early, when tried to create some lock with KeyError	16:21
noonedeadpunk	that namespace is not found	16:21
jrosser	mnaser: btw did you ever try multiple interfaces on a workload cluster?	16:22
mnaser	jrosser: never got around it	16:23
mnaser	jrosser: im going over all the open PRs -- are you still using https://github.com/vexxhost/ansible-collection-containers/pull/21/files ?	16:23
jrosser	so we do make some progress, but are running into this https://medium.com/@kanrangsan/how-to-specify-internal-ip-for-kubernetes-worker-node-24790b2884fd	16:23
mnaser	so you need to template the ip of the server into extraArgs	16:24
jrosser	so for https://github.com/vexxhost/ansible-collection-containers/pull/21/files yes we are using that	16:27
jrosser	and because there was no review of that yet i did not create a PR for the companion here https://github.com/jrosser/ansible-collection-kubernetes/tree/download-artifacts	16:27
jrosser	but that is pretty much ready to go (subject to getting everything rebased and back up to date......)	16:28
noonedeadpunk	btw, another thing, is that driver moves cluster to failure if there's even number of members. But API accepts request to create cluster. Was there some discussion around that in magnum? As it feels totally as verification before accepting the request.	16:30
mnaser	noonedeadpunk: i think this is because its a few layers below where we cant validate this, i think adding stuff to the api to decline this would be helpful but we havent gotten around bubbling this stuff up	16:31
noonedeadpunk	so I dropped the namespace and cluster creation fails on it again: https://paste.openstack.org/show/bx3OvvOVIreomHH1tYsG/	16:32
noonedeadpunk	oh, yeah, I totally get that it's verified very down the line	16:32
mnaser	noonedeadpunk: i think you're seeing an unrelated issue here	16:33
mnaser	`magnum.service.periodic.ClusterUpdateJob.update_status` runs to update status of clusters, and i think what happened was that the magnum-system ns did not exist when that periodic job ran	16:33
noonedeadpunk	oh	16:34
noonedeadpunk	ok, yes, now namespace is there, huh	16:35
noonedeadpunk	so it was just a red herring	16:35
noonedeadpunk	yeah, and at least one machine was created once I passed the system trust	16:37
mnaser	the thing is magnum has no like... coordination system	16:40
mnaser	so N conductors means N requests :p	16:40
mnaser	or M*N requests where M is clusters	16:40
noonedeadpunk	yeah, sounds like adopting tooz would be beneficial for magnum	16:41
mnaser	yep	16:42
noonedeadpunk	oh, btw, you folks are using horizon for magnum, right? Am I blind, or there's no good way to fetch magnum config together with cert embeded, like you'd get with CLI?	16:44
jrosser	i think there was a patch to fix that	16:44
noonedeadpunk	/o\, okay	16:45
noonedeadpunk	it's same with heat driver, just to be clear	16:45
noonedeadpunk	I jsut haven't used Horizon for a while	16:45
jrosser	oh https://review.opendev.org/c/openstack/magnum-ui/+/917913	16:45
jrosser	doh	16:46
noonedeadpunk	been a while as well...	16:46
jrosser	mnaser: if you are looking at PR then doing something (anything?!) about Noble deployments would be useful - afaik you need something like https://github.com/vexxhost/ansible-collection-kubernetes/pull/127	16:49
jrosser	soon we will release OSA pointing to my fork otherwise, which would not be ideal	16:49
mnaser	jrosser: ok sounds good, im working my way up the stack from the collections role and working my way up	16:50
jrosser	ok cool - thanks	16:50
* noonedeadpunk will check on patch to add support for more modern control cluster versions soonish		16:51
mnaser	jrosser: after this https://github.com/vexxhost/ansible-collection-containers/pull/21 i will tag/release new version of vexxhost.containers	16:52
jrosser	cool	16:53
jrosser	i will check on my counterpart patches for the kubernetes collection to go with that PR	16:54
jrosser	i re-use the plugin out of the containers collection to parse the list of binaries needed by the kubernetes roles	16:54
jrosser	this is likley out of date / conflicting now as theres been more change in the k8s collection	16:55
mnaser	yeah no problem, i just wanted to know if the containers collection is missing any other pieces so i can release	17:04
opendevreview	Merged openstack/openstack-ansible stable/2023.1: Ensure that the inventory tox job runs on an ubuntu-jammy node https://review.opendev.org/c/openstack/openstack-ansible/+/932080	17:11
opendevreview	Merged openstack/openstack-ansible stable/2023.1: Bump SHAs for 2023.1 https://review.opendev.org/c/openstack/openstack-ansible/+/931742	17:18
mnaser	jrosser: stage 1 done :) https://galaxy.ansible.com/ui/repo/published/vexxhost/containers/	17:46
noonedeadpunk	damn, I've introduced quite a bug to neutron role recently	18:01
opendevreview	Dmitriy Rabotyagov proposed openstack/openstack-ansible-os_neutron master: Ensure that services that intended to stay disabled are not started https://review.opendev.org/c/openstack/openstack-ansible-os_neutron/+/932357	18:04
noonedeadpunk	or well. it was for a while but become harmful lately	18:04
opendevreview	Dmitriy Rabotyagov proposed openstack/openstack-ansible-os_neutron master: Ensure that services that intended to stay disabled are not started https://review.opendev.org/c/openstack/openstack-ansible-os_neutron/+/932357	18:08
noonedeadpunk	I'd say we need to include that for release .....	18:28
mnaser	noonedeadpunk: did you use 1.31 for your k8s cluster in your tests?	20:45
mnaser	i am seeing some changes in 1.29 that probably caused you to see your issues	20:45
opendevreview	Merged openstack/openstack-ansible stable/2024.1: Bumps SHAs for 2024.1 https://review.opendev.org/c/openstack/openstack-ansible/+/931740	22:32

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!