Monday, 2024-10-14

opendevreviewDmitriy Rabotyagov proposed openstack/openstack-ansible master: Add autocomplete script for playbooks  https://review.opendev.org/c/openstack/openstack-ansible/+/93222007:09
*** gaudenz__ is now known as gaudenz07:12
noonedeadpunkfwiw, I failed to get CAPI working during weekends.08:14
noonedeadpunkfirst weird thing was that magnum-system namespace wasn't created, so I had to create it manually.08:15
noonedeadpunkThen code somehow fails if you've missed dns_nameservers in coe templates08:15
noonedeadpunkbut last - it just freeze in create_in_progress, and when I check inside k8s control cluster - it doesn't have any progress on creation either08:16
noonedeadpunkhttps://paste.openstack.org/show/bKbpA5igK1IDKzPw17V5/08:17
noonedeadpunkand - no openstack resources are being created08:17
noonedeadpunkso if someone (looking at jrosser) has some advice of what can I be doing wrong or how to trace that - would be appreciated08:28
opendevreviewMerged openstack/openstack-ansible-rabbitmq_server master: Cleanup unneeded upgrade tasks  https://review.opendev.org/c/openstack/openstack-ansible-rabbitmq_server/+/93197309:56
opendevreviewDmitriy Rabotyagov proposed openstack/openstack-ansible master: Freeze roles for 30.0.0.0b1 release  https://review.opendev.org/c/openstack/openstack-ansible/+/93161113:32
opendevreviewDmitriy Rabotyagov proposed openstack/openstack-ansible master: Unfreeze roles after milestone release  https://review.opendev.org/c/openstack/openstack-ansible/+/93161213:34
noonedeadpunkwould be nice to have some more reviews on https://review.opendev.org/q/topic:%22bump_osa%22+status:open13:36
opendevreviewMerged openstack/openstack-ansible-rabbitmq_server master: Add erlang package defenition to defaults  https://review.opendev.org/c/openstack/openstack-ansible-rabbitmq_server/+/93179414:22
jrossernoonedeadpunk: so i never had to create the magnum-system namespace manually15:15
noonedeadpunkhuh15:15
noonedeadpunkso smth went terribly wrong then in my case...15:15
jrosserthe ci jobs don't do that after all15:16
noonedeadpunkyeah, true15:16
noonedeadpunkbut also I'm not sure they're passing today either15:16
noonedeadpunkok, they are...15:17
noonedeadpunkI did multiple times destroyed k8s containers and re-spawned them as well...15:17
noonedeadpunkwonder if there's some folder that persisted on control plane15:18
noonedeadpunkbut namespace is not created by vexxhost roles at least15:18
jrosseri would guess originating something like here https://github.com/vexxhost/magnum-cluster-api/blob/main/magnum_cluster_api/resources.py#L13415:24
jrosserbut i guess mnaser would have a quick answer to this :)15:24
noonedeadpunkI was thinking about here even: https://github.com/vexxhost/magnum-cluster-api/blob/178b4be4202ce3338d28aab1644b6be4f7040592/magnum_cluster_api/sync.py#L3415:24
noonedeadpunkas there was smth related to failed locking15:24
mnaserhttps://github.com/vexxhost/magnum-cluster-api/blob/main/magnum_cluster_api/driver.py#L6315:25
mnaserThe namespace gets created when the create_cluster gets called15:25
mnaserDid you make sure you had the correct kubeconfig file in $HOME/.kube/config ?15:25
noonedeadpunkon magnum side I assume?15:26
mnaseryeah, magnum-conductor needs to be able to read that file15:26
noonedeadpunkas once I've created namespace manually - at least things passed a little bit to a point of https://paste.openstack.org/show/bKbpA5igK1IDKzPw17V5/15:27
jrosserah well if you deleted/recreated the control plane k8s but did not do the rest of the playbook to redo the magnum integration, that would be wrong15:27
noonedeadpunkand `kube-vn6vf` is what I got from magnum api15:27
mnaserokay so it seems to have progressed a bit15:27
mnaserid check the logs of the capo-system namespace15:27
mnaserkubectl -n capo-system logs deploy/capo-controller-manager15:28
mnaser-f will get you a follow, it'll talk a bit about what its trying to od15:28
noonedeadpunkok, I think that's the issue indeed, thanks mnaser15:28
noonedeadpunkit claims on failure to verify cert15:29
noonedeadpunkI wasn't just sure what logs to check where :)15:29
mnaserPR to the docs would be welcome :P15:29
noonedeadpunk(fwiw, cert is a valid let's encrypt there)15:29
noonedeadpunk++15:29
noonedeadpunkI'm trying to get my head around first before pushing to docs15:29
noonedeadpunkmnaser: btw, I've pushed couple of things, but confused a bit at how/who should launch pipelines there?15:30
mnasernoonedeadpunk: like for the repos?  you can create a pr to the repo and its a combination of zuul/old github actions we're getting slowly rid of15:31
noonedeadpunkas it seems it's only repo maintainers who run pipelines?15:31
mnaserno, zuul should automatically run on the PR, but i dont see any :p15:32
mnaseroooh, for https://github.com/vexxhost/ansible-collection-kubernetes/pulls15:32
mnaseryeah, that i need to get around moving to zuul15:32
noonedeadpunkyeah, ie https://github.com/vexxhost/ansible-collection-kubernetes/pull/13615:32
noonedeadpunkthere're some zuul bits though15:32
mnaseryeah i need to zuulify that repo like i did for the others, we moved away from GHA15:33
mnasernoonedeadpunk: just fyi in github when you get a PR from a first time contributor, it needs maintainers to approve before the CI runs for the repo15:33
mnaseronce you have your first landed change for the future GHA will automatically run, i guess its to protect from someone being malicious or something15:34
noonedeadpunkI somehow thought it depends on repo config, but yeah, not 100% sure15:34
noonedeadpunkbut anyway, point was that I'm trying to contribute back when have smth on my hands :)15:35
mnaseryeah no i appreciate it15:36
noonedeadpunkhm, okay, so why in the hell let's encrypt in unknown authority for k8s....15:36
mnasernoonedeadpunk: by default, the capo container has no "os", so we pass the CA for it dynamically in the config15:37
mnaseri get it from certifi and push it if none is provided15:37
noonedeadpunkaha, and I passed it internal CA and it tries to reach public endpoint15:38
mnasernoonedeadpunk: https://github.com/vexxhost/magnum-cluster-api/blob/178b4be4202ce3338d28aab1644b6be4f7040592/magnum_cluster_api/resources.py#L587-L607 yea15:38
mnaserhttps://github.com/vexxhost/magnum-cluster-api/blob/main/magnum_cluster_api/utils.py#L89-L9615:39
noonedeadpunkthough I do have `[capi_client]/endpoint = internal`15:41
noonedeadpunkok, but knowing where logs are really explained a lot, so thanks!15:42
noonedeadpunkI guess I should be able to proceed from that15:42
noonedeadpunkI kinda wonder if that's what making to connect to public URL https://github.com/vexxhost/magnum-cluster-api/commit/178b4be4202ce3338d28aab1644b6be4f704059215:43
noonedeadpunkwhich makes total sense15:43
noonedeadpunkbut then we need to pass just all system certs I guess?15:44
noonedeadpunkthough likely I'm not yet there anyway, as no resources were created at the first place. and then client cluster is spawned on ubuntu and it can have system certs (theoretically)15:45
noonedeadpunkanyway15:45
jrossernoonedeadpunk: have been round all this CA stuff a lot here :)16:05
jrosserlet me know if you want me to check any specific vars blah blah16:05
jrosserand afaik, the config in the AIO should deal with this as theres pki role on the internal vip16:06
jrosserso you have worst case setup, where it's not a trusted cert so the config has to be spot-on everywhere16:06
noonedeadpunkthe problem now, is that in logs it tries to reach keystone over public interface, while I totally see internal uri in config16:06
noonedeadpunkand I've suplied path to internal CA in config...16:06
jrosserwhat does :) there are many things16:07
jrosserandrews patch is for the workload cluster16:07
noonedeadpunkyeah, wrokload is not there yet....16:07
jrosseryou might also need to patch magnum16:07
jrossermaybe you run into this? https://bugs.launchpad.net/magnum/+bug/206019416:08
noonedeadpunkoh, well16:09
noonedeadpunkit would explain it16:09
noonedeadpunkI just though this will be actually used: https://github.com/vexxhost/magnum-cluster-api/blob/178b4be4202ce3338d28aab1644b6be4f7040592/magnum_cluster_api/resources.py#L580-L58316:10
noonedeadpunkbut I'm gonna try and pass jsut whole system trust to be frank16:10
jrosserthat needs a proper fix, as imho this should all be automagic from parsing the client config options with keystoneauth16:10
jrosserbut /o\ every service does this different and it's all messy16:10
noonedeadpunkoh yes16:10
jrosserandrews patch is just a sticking plaster really16:11
jrosserok so you have an issue in the magnum code about endpoints16:11
noonedeadpunkbtw there was some reply in ML about db pooling being broken, but I didn't look into the patch in detail still16:12
noonedeadpunkyeah, likely.16:12
jrosserthe thing you linked is about openstack clients that run in the k8s control plane which also need to select the correct endpoint16:12
jrosserso it has to be right in two completely seperate places16:12
noonedeadpunkor well, in that env I don't care _much_ about public/internal, just different certs for them16:12
jrosserwell, at least two16:12
noonedeadpunkyeah, why I looked into k8s code as the error I see on k8s control cluster https://paste.openstack.org/show/bL8Q0BBUbKjgtatJ63NF/16:13
jrossertbh there is likley stuff we can improve in the docs16:15
jrosserwhat i have pushed is basically all good for the AIO/CI16:15
jrosserbut having said that we make a bunch of overrides for things like this in actual deployments16:15
noonedeadpunkwell, I'm playing it a multinode physical sandbox, so I'm not in any real restrictions there _yet_16:17
mnasernoonedeadpunk: there is a secret in the magnum-system ns16:17
mnaserit should contain the CA that is in use16:17
jrosseri have written all this up16:17
jrossershould make a patch for some "how it works" docs16:18
noonedeadpunkso if I didn't have magnum-system ns - highly unlikely I got the secret....16:18
mnaserkubectl -n magnum-system get secret16:19
mnaserwhat does that give you?16:19
noonedeadpunkno resources found16:19
noonedeadpunkok, let me drop all clusters and namespaces16:20
noonedeadpunkand restart magnum-conductor, I assume16:20
mnaseryeah i think something went terribly wrong here,16:20
mnaserthe secret with the CA is missing so its probably not using any CAs at all16:20
mnaserid watch logs of m-cond and send the create and see what gets logged16:21
noonedeadpunk++16:21
noonedeadpunkit looked like it was failing very early, when tried to create some lock with KeyError 16:21
noonedeadpunkthat namespace is not found16:21
jrossermnaser: btw did you ever try multiple interfaces on a workload cluster?16:22
mnaserjrosser: never got around it16:23
mnaserjrosser: im going over all the open PRs -- are you still using https://github.com/vexxhost/ansible-collection-containers/pull/21/files ?16:23
jrosserso we do make some progress, but are running into this https://medium.com/@kanrangsan/how-to-specify-internal-ip-for-kubernetes-worker-node-24790b2884fd16:23
mnaserso you need to template the ip of the server into extraArgs16:24
jrosserso for https://github.com/vexxhost/ansible-collection-containers/pull/21/files yes we are using that16:27
jrosserand because there was no review of that yet i did not create a PR for the companion here https://github.com/jrosser/ansible-collection-kubernetes/tree/download-artifacts16:27
jrosserbut that is pretty much ready to go (subject to getting everything rebased and back up to date......)16:28
noonedeadpunkbtw, another thing, is that driver moves cluster to failure if there's even number of members. But API accepts request to create cluster. Was there some discussion around that in magnum? As it feels totally as verification before accepting the request.16:30
mnasernoonedeadpunk: i think this is because its a few layers below where we cant validate this, i think adding stuff to the api to decline this would be helpful but we havent gotten around bubbling this stuff up16:31
noonedeadpunkso I dropped the namespace and cluster creation fails on it again: https://paste.openstack.org/show/bx3OvvOVIreomHH1tYsG/16:32
noonedeadpunkoh, yeah, I totally get that it's verified very down the line16:32
mnasernoonedeadpunk: i think you're seeing an unrelated issue here16:33
mnaser`magnum.service.periodic.ClusterUpdateJob.update_status` runs to update status of clusters, and i think what happened was that the magnum-system ns did not exist when that periodic job ran16:33
noonedeadpunkoh16:34
noonedeadpunkok, yes, now namespace is there, huh16:35
noonedeadpunkso it was just a red herring16:35
noonedeadpunkyeah, and at least one machine was created once I passed the system trust16:37
mnaserthe thing is magnum has no like... coordination system16:40
mnaserso N conductors means N requests :p16:40
mnaseror M*N requests where M is clusters16:40
noonedeadpunkyeah, sounds like adopting tooz would be beneficial for magnum16:41
mnaseryep16:42
noonedeadpunkoh, btw, you folks are using horizon for magnum, right? Am I blind, or there's no good way to fetch magnum config together with cert embeded, like you'd get with CLI?16:44
jrosseri think there was a patch to fix that16:44
noonedeadpunk /o\, okay16:45
noonedeadpunkit's same with heat driver, just to be clear16:45
noonedeadpunkI jsut haven't used Horizon for a while16:45
jrosseroh https://review.opendev.org/c/openstack/magnum-ui/+/91791316:45
jrosserdoh16:46
noonedeadpunkbeen a while as well...16:46
jrossermnaser: if you are looking at PR then doing something (anything?!) about Noble deployments would be useful - afaik you need something like https://github.com/vexxhost/ansible-collection-kubernetes/pull/12716:49
jrossersoon we will release OSA pointing to my fork otherwise, which would not be ideal16:49
mnaserjrosser: ok sounds good, im working my way up the stack from the collections role and working my way up16:50
jrosserok cool - thanks16:50
* noonedeadpunk will check on patch to add support for more modern control cluster versions soonish16:51
mnaserjrosser: after this https://github.com/vexxhost/ansible-collection-containers/pull/21 i will tag/release new version of vexxhost.containers16:52
jrossercool16:53
jrosseri will check on my counterpart patches for the kubernetes collection to go with that PR16:54
jrosseri re-use the plugin out of the containers collection to parse the list of binaries needed by the kubernetes roles16:54
jrosserthis is likley out of date / conflicting now as theres been more change in the k8s collection16:55
mnaseryeah no problem, i just wanted to know if the containers collection is missing any other pieces so i can release17:04
opendevreviewMerged openstack/openstack-ansible stable/2023.1: Ensure that the inventory tox job runs on an ubuntu-jammy node  https://review.opendev.org/c/openstack/openstack-ansible/+/93208017:11
opendevreviewMerged openstack/openstack-ansible stable/2023.1: Bump SHAs for 2023.1  https://review.opendev.org/c/openstack/openstack-ansible/+/93174217:18
mnaserjrosser: stage 1 done :) https://galaxy.ansible.com/ui/repo/published/vexxhost/containers/17:46
noonedeadpunkdamn, I've introduced quite a bug to neutron role recently18:01
opendevreviewDmitriy Rabotyagov proposed openstack/openstack-ansible-os_neutron master: Ensure that services that intended to stay disabled are not started  https://review.opendev.org/c/openstack/openstack-ansible-os_neutron/+/93235718:04
noonedeadpunkor well. it was for a while but become harmful lately18:04
opendevreviewDmitriy Rabotyagov proposed openstack/openstack-ansible-os_neutron master: Ensure that services that intended to stay disabled are not started  https://review.opendev.org/c/openstack/openstack-ansible-os_neutron/+/93235718:08
noonedeadpunkI'd say we need to include that for release .....18:28
mnasernoonedeadpunk: did you use 1.31 for your k8s cluster in your tests?20:45
mnaseri am seeing some changes in 1.29 that probably caused you to see your issues20:45
opendevreviewMerged openstack/openstack-ansible stable/2024.1: Bumps SHAs for 2024.1  https://review.opendev.org/c/openstack/openstack-ansible/+/93174022:32

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!