harun | hi all, there is no container communication among these containers via br-mgmt bridge, i cannot reach the containers of a host from other hosts after keeapalived is installed, when i reboot all hosts, i can reach containers from different hosts. Why do i encounter this issue? I would appreciate if you help me, thank you. Tcdump stdout: https://paste.openstack.org/show/bI2qiJpMbCsX8AgnS3OX/ | 07:05 |
---|---|---|
noonedeadpunk | o/ | 07:32 |
noonedeadpunk | harun: um, I think you actually need to check first if there's a communication over br-mgmt between hosts first then. As it feels like either some kind of firewalling or just misconfigured bridge on itself | 07:34 |
noonedeadpunk | I'm not sure if keepalived would cause such issues, but ofc would depend on your configuration | 07:35 |
jrosser | you need a unique mgmt address ip on each bridge, quite apart from whatever vip keepalived is managing | 07:35 |
gokhan_ | noonedeadpunk, jrosser harun is my teammate. we can ping br-mgmt ips between hosts. All of the ips are unique. | 07:39 |
gokhan_ | we can not get arp reply from other hosts. | 07:40 |
jrosser | are you being completely specific about hosts/containers here? | 07:41 |
noonedeadpunk | well we kinda had specific issues with ARP replies between computes quite recently, but that was related to NICs firware and kernel version | 07:42 |
jrosser | from memory keepalived adjusts routes too? | 07:43 |
noonedeadpunk | and what we saw was VMs on some compute nodes were not able to communicate over tunnel (vxlan) networks due to arp being jsut dropped in one way | 07:43 |
noonedeadpunk | yes, it does add a network which is defined for VIP | 07:43 |
noonedeadpunk | *add a route | 07:43 |
gokhan_ | we are getting this issue in all of our environmets, also in customer environment which we installed. after reboot issiu is resolved | 07:43 |
noonedeadpunk | so if you define vip with some weird netmask... but then there would be an issue with communication between controllers as well | 07:44 |
noonedeadpunk | could it be that you've somehow dropped ip_forward from sysctl for the runtime? | 07:44 |
jrosser | gokhan_: it is still not completey clear to me what breaks | 07:46 |
jrosser | i.e, if you can still ping containers on the same host, from the host br-mgmt | 07:47 |
jrosser | or if you can still ping br-mgmt <> br-mgmt between hosts, but just the container<>container is broken | 07:47 |
gokhan_ | this sysctl conf https://paste.openstack.org/show/bi97dP7ADRPtOB7hAAUk/ | 07:47 |
gokhan_ | jrosser, also container<>otherhostsbr-mgmt is broken but container<>samehostbr-mgmt is working | 07:49 |
jrosser | and what about one host br-mgmt to another host br-mgmt? | 07:50 |
gokhan_ | jrosser, sorry we also can not ping containers on the same host | 07:51 |
gokhan_ | we can only ping from containers to its host br-mgmt ip | 07:52 |
gokhan_ | we restart the lxc-dnsmasq service but it is not worked. | 07:54 |
jrosser | that only deals with eth0 in the container | 07:55 |
jrosser | did you do other things like check that the routing table looks reasonable? | 07:56 |
gokhan_ | jrosser, https://paste.openstack.org/show/blJXkzVFYUpesUza31NQ/ | 07:58 |
gokhan_ | it seems ok | 07:59 |
jrosser | docker? | 07:59 |
gokhan_ | also ceph is installed with cephadm and it is using docker | 08:01 |
jrosser | all i can recommend is starting bottom up with really basic connectivity checks | 08:04 |
jrosser | arp/ping with tcpdump at both ends between two host br-mgmt | 08:04 |
jrosser | we do not test cephadm on the same hosts as openstack-ansible so that would be up to you to check there is no bad interaction | 08:05 |
jrosser | it is also possible that docker is installing iptables rules | 08:09 |
gokhan_ | jrosser, this is ping and tcpdum output ping between 2 host. https://paste.openstack.org/show/bcdc92jPLvzqYrewviQs/ | 08:11 |
gokhan_ | this is iptables rule list https://paste.openstack.org/show/bR62RCLyZQeKgd5OEFix/ | 08:12 |
gokhan_ | I am using cephadm and osa on samehost in multiple environments, ı didn't get any issues about that. | 08:13 |
gokhan_ | the weird behaviour it is working after the reboot :( | 08:14 |
gokhan_ | but we are trying to find root cause of this. | 08:14 |
gokhan_ | jrosser, can apparmor service effect container networking ? | 08:16 |
jrosser | you would see anything that apparmor blocks in the kernel log | 08:21 |
jrosser | have you checked that br-mgmt has all the members you'd expect | 08:22 |
gokhan_ | jrosser, these are dmesq logs https://paste.openstack.org/show/bDEQa5sxPHJcf7UeHC59/ | 08:25 |
gokhan_ | thre are profile replace logs on apparmor | 08:26 |
gokhan_ | jrosser, br-mgmt has all members https://paste.openstack.org/show/bd2p0LIGnUCfEpoVk0uh/ | 08:30 |
gokhan_ | now I am rebooting one of hosts try to see difference | 08:36 |
gokhan_ | jrosser, after the reboot, now containers on rebooted host can ping between themselves | 08:44 |
gokhan_ | the only difference I see is lxc-monitord service is not working | 08:45 |
jrosser | noonedeadpunk: is this correct? https://github.com/openstack/openstack-ansible/blob/master/scripts/gate-check-commit.sh#L68 | 08:51 |
jrosser | should it be 2024.1? | 08:51 |
gokhan_ | jrosser, it seems I find the issue | 08:56 |
gokhan_ | after the reboot iptables rule has changed | 08:57 |
gokhan_ | this is rebooted host https://paste.openstack.org/show/bO3R9QEdzZlzFeivqcIo/ | 08:57 |
noonedeadpunk | jrosser: should be 2024.1, yes | 08:58 |
gokhan_ | in other host iptables rules are Chain INPUT (policy ACCEPT) | 08:58 |
gokhan_ | target prot opt source destination | 08:58 |
gokhan_ | ACCEPT tcp -- anywhere anywhere tcp dpt:domain | 08:58 |
gokhan_ | ACCEPT udp -- anywhere anywhere udp dpt:domain | 08:58 |
gokhan_ | ACCEPT tcp -- anywhere anywhere tcp dpt:67 | 08:58 |
gokhan_ | ACCEPT udp -- anywhere anywhere udp dpt:bootps | 08:58 |
gokhan_ | Chain FORWARD (policy DROP) | 08:58 |
gokhan_ | target prot opt source destination | 08:58 |
gokhan_ | ACCEPT all -- anywhere anywhere | 08:58 |
gokhan_ | ACCEPT all -- anywhere anywhere | 08:58 |
gokhan_ | DOCKER-USER all -- anywhere anywhere | 08:58 |
gokhan_ | DOCKER-ISOLATION-STAGE-1 all -- anywhere anywhere | 08:58 |
gokhan_ | ACCEPT all -- anywhere anywhere ctstate RELATED,ESTABLISHED | 08:58 |
gokhan_ | DOCKER all -- anywhere anywhere | 08:58 |
gokhan_ | ACCEPT all -- anywhere anywhere | 08:58 |
gokhan_ | ACCEPT all -- anywhere anywhere | 08:58 |
gokhan_ | Chain OUTPUT (policy ACCEPT) | 08:58 |
gokhan_ | target prot opt source destination | 08:58 |
gokhan_ | Chain DOCKER (1 references) | 08:58 |
gokhan_ | target prot opt source destination | 08:58 |
gokhan_ | Chain DOCKER-ISOLATION-STAGE-1 (1 references) | 08:58 |
gokhan_ | target prot opt source destination | 08:58 |
gokhan_ | DOCKER-ISOLATION-STAGE-2 all -- anywhere anywhere | 08:58 |
gokhan_ | RETURN all -- anywhere anywhere | 08:58 |
gokhan_ | Chain DOCKER-ISOLATION-STAGE-2 (1 references) | 08:59 |
gokhan_ | target prot opt source destination | 08:59 |
gokhan_ | DROP all -- anywhere anywhere | 08:59 |
gokhan_ | RETURN all -- anywhere anywhere | 08:59 |
gokhan_ | Chain DOCKER-USER (1 references) | 08:59 |
gokhan_ | target prot opt source destination | 08:59 |
gokhan_ | RETURN all -- anywhere anywhere | 08:59 |
gokhan_ | sorry :( | 08:59 |
gokhan_ | https://paste.openstack.org/show/bK8axT4XP81rb1l580Lt/ | 08:59 |
gokhan_ | change Forward policy is drop on unrebooted hosts | 08:59 |
gokhan_ | how can we apply iptables rule for lxc containers | 08:59 |
gokhan_ | it seems they are not applied | 08:59 |
jrosser | like i say we do not test/support having docker and lxc on the same host | 08:59 |
jrosser | this is very well known to cause trouble for both lxc and lxd | 08:59 |
jrosser | there might be some config you can change on the docker side about this - but i have no idea about that really | 09:01 |
jrosser | openstack-ansible does not do any management of iptables rules at all, so this feels like a docker issue | 09:01 |
gokhan_ | thanks jrosser for helping to find the issue. as you have said, it seems there are issues when installing docker and lxc on same host. as a workaround we will change iptables rules as expected. | 09:05 |
jrosser | it may be that restarting some service on the docker side has the same effect as the reboot, whichever one is responsible for inserting the iptables rules | 09:06 |
gokhan_ | the weird thing is docker is working as expected. ceph mon daemon can communicate between themselves. | 09:08 |
gokhan_ | I will restart ceph.target and see is there any change on iptables side. | 09:08 |
opendevreview | Jonathan Rosser proposed openstack/openstack-ansible master: Fix upgrade job on master to upgrade from 2024.1 to master https://review.opendev.org/c/openstack/openstack-ansible/+/928771 | 09:17 |
noonedeadpunk | actually - we do iptables rules for LXC | 09:20 |
noonedeadpunk | and they should be re-loaded/applied with restart of lxc-dnsmasq service iirc | 09:20 |
noonedeadpunk | https://opendev.org/openstack/openstack-ansible-lxc_hosts/src/branch/master/templates/lxc-system-manage.j2#L76-L111 | 09:21 |
noonedeadpunk | and yes, lxc-dnsmasq would remove/add iptables rules | 09:22 |
noonedeadpunk | https://opendev.org/openstack/openstack-ansible-lxc_hosts/src/branch/master/tasks/lxc_net.yml#L89-L104 | 09:22 |
noonedeadpunk | but you also can `/usr/local/bin/lxc-system-manage iptables-recreate` | 09:23 |
jrosser | oh wow i completely missed that! | 09:23 |
jrosser | gokhan_: ^ this is stuff to know about | 09:25 |
noonedeadpunk | eventually we can add some "custom" rules to that template if that's gonna help | 09:41 |
opendevreview | Merged openstack/openstack-ansible stable/2023.2: Remove the get_md5 parameter from ansible stat tasks https://review.opendev.org/c/openstack/openstack-ansible/+/927720 | 10:08 |
gokhan_ | noonedeadpunk, thanks noonedeadpunk , we restarted lxc-dnsmasq but they are not applied. I am trying now | 10:11 |
gokhan_ | noonedeadpunk, it is not changed. Chain FORWARD (policy DROP) > policy is DROP but on rebooted host it is Chain FORWARD (policy ACCEPT) | 10:26 |
gokhan_ | same ip table rules are ecreated | 10:26 |
noonedeadpunk | so the service totally does not change the default policy on chains | 10:36 |
noonedeadpunk | I don't think doker does this either | 10:36 |
noonedeadpunk | ah.... | 10:37 |
noonedeadpunk | service ensures forward only for lxc_bridge, not mgmt_bridge | 10:37 |
gokhan_ | network connection issue is solved by running "sudo iptables -P FORWARD ACCEP" | 10:38 |
gokhan_ | network connection issue is solved by running "sudo iptables -P FORWARD ACCEPT" | 10:39 |
gokhan_ | noonedeadpunk, I didn't find anouther solution except upper | 10:40 |
noonedeadpunk | iptables -I FORWARD -i "br-mgmt" -j ACCEPT ? | 10:41 |
gokhan_ | docker restart is also not worked | 10:41 |
gokhan_ | noonedeadpunk, I am trying | 10:42 |
gokhan_ | it also worked | 10:45 |
gokhan_ | noonedeadpunk, jrosser I have tested with docker installation on a vm, docker is changing iptables forward chain policy from accept to drop. | 10:48 |
noonedeadpunk | well.... | 10:48 |
noonedeadpunk | this used to be really nice role to manage iptables rules: https://github.com/logan2211/ansible-iptables | 10:49 |
jrosser | we use that ^ | 10:51 |
jrosser | but we also have an unmerged PR there for 4 years :( | 10:52 |
noonedeadpunk | oops, quite a crucial one btw | 11:19 |
jrosser | looks like logan- is still here in irc..... | 11:20 |
noonedeadpunk | at worst I hope seeing him in couple of months, so potentially can bug him about things :p | 11:24 |
noonedeadpunk | I've been reported one thing here. Apparently, magnum with heat driver (at least with heat) does try to use `amphora` octavia_provider which is default | 11:43 |
noonedeadpunk | and I've proposed patch (which we've merged) which removes this provider and leaves only amphorav2 | 11:43 |
noonedeadpunk | so I was wondering if we should maybe rollback (or not) and have `amphora` provider along with `amphorav2` | 11:55 |
noonedeadpunk | as `amphora` will call the v2 anyway. | 11:55 |
noonedeadpunk | not idea though if it's going to be same in the future or not. Or Magnum should adjust the default to point to v2 | 11:56 |
jrosser | noonedeadpunk: there is `octavia_provider` label but having that be the old value by default is not good | 12:19 |
noonedeadpunk | yeah, it's "old" default. | 12:20 |
noonedeadpunk | johnsom: any insight if `amphora` provider is expected to be existing in deployments, or having jsut `amphorav2` is fine? | 12:21 |
noonedeadpunk | and if `amphora` is going to be kept in octavia for the future as well? | 12:21 |
jrosser | i wonder if we should revert converting the repo server to apache | 12:29 |
noonedeadpunk | I wanna fix mpms this week for sure | 12:30 |
noonedeadpunk | and backport to 2024.1 | 12:30 |
noonedeadpunk | as seems that skyline/keystone is already an issue | 12:30 |
jrosser | are you going to look at fixing up everything being on apache? | 12:34 |
jrosser | if so i will leave it alone | 12:34 |
jrosser | there are some surprise failures to come as we've not been testing the right upgrades too | 12:34 |
noonedeadpunk | I was going to iterate through mpm modules and disable all except one that's being defined | 12:39 |
noonedeadpunk | and introduce global variable to set the mpm | 12:40 |
jrosser | ah i think also the wrong upgrade branch is why we are missing a bunch of logs of /etc for upgrade jobs | 12:46 |
jrosser | the log collection at the end depends on tools which should have been installed from the starting branch, and they are missing (like parallel) | 12:47 |
jrosser | and the same will affect slurp upgrades as we need to tools for master but set things up initially two branches back | 12:48 |
noonedeadpunk | yeah | 12:49 |
jrosser | i think thats a simple fix | 12:49 |
opendevreview | Jonathan Rosser proposed openstack/openstack-ansible stable/2023.2: Ensure "parallel" package is installed for CI log collection https://review.opendev.org/c/openstack/openstack-ansible/+/928790 | 12:56 |
noonedeadpunk | which mpm we wanna for the default? event as of keystone or worker as of horizon? | 12:56 |
noonedeadpunk | I frankly can't recall exact difference between these 2 already :( | 12:57 |
jrosser | i have no idea tbh :/ | 12:57 |
noonedeadpunk | sounds like event is better | 12:59 |
noonedeadpunk | or well, like it's improved worker | 12:59 |
opendevreview | Merged openstack/openstack-ansible stable/2024.1: Remove extra slash character from horizon haproxy healthcheck url. https://review.opendev.org/c/openstack/openstack-ansible/+/927264 | 13:20 |
opendevreview | Merged openstack/openstack-ansible-os_neutron master: Improve OVN cluster setup idempotence report https://review.opendev.org/c/openstack/openstack-ansible-os_neutron/+/928618 | 13:37 |
opendevreview | Merged openstack/openstack-ansible-os_neutron master: Do not kill ipsec on L3 cleanup https://review.opendev.org/c/openstack/openstack-ansible-os_neutron/+/927992 | 13:37 |
opendevreview | Merged openstack/openstack-ansible-plugins master: Add infrastructure playbooks to openstack-ansible-plugins collection https://review.opendev.org/c/openstack/openstack-ansible-plugins/+/924171 | 13:38 |
opendevreview | Merged openstack/openstack-ansible-os_ceilometer stable/2024.1: Add support for Magnum notifications https://review.opendev.org/c/openstack/openstack-ansible-os_ceilometer/+/927812 | 13:41 |
opendevreview | Merged openstack/openstack-ansible-os_neutron master: Remove ns-metadata-proxy cleanuop handler https://review.opendev.org/c/openstack/openstack-ansible-os_neutron/+/927993 | 13:53 |
johnsom | noonedeadpunk amphora will be permanent, amphorav2 may go away at some point. | 13:59 |
johnsom | So, yeah, keep using amphora | 14:01 |
opendevreview | Merged openstack/openstack-ansible-ops master: Update magnum-cluster-api version https://review.opendev.org/c/openstack/openstack-ansible-ops/+/928613 | 14:01 |
noonedeadpunk | johnsom: and `octavia` is just removed? | 14:04 |
noonedeadpunk | for some reason I thought that it will remain with v2 :( | 14:06 |
noonedeadpunk | probably completely misunderstood some discussion | 14:06 |
johnsom | Yeah, at some point "octavia" might go away. people didn't like that one as we have multiple providers now, so lobbied to change to "amphora" | 14:13 |
opendevreview | Dmitriy Rabotyagov proposed openstack/openstack-ansible-os_octavia master: Return `amphora` provider back https://review.opendev.org/c/openstack/openstack-ansible-os_octavia/+/928815 | 14:13 |
noonedeadpunk | well, versioned amphora also makes sense to me kinda | 14:13 |
johnsom | noonedeadpunk as of master branch, they are all the same now | 14:13 |
johnsom | Yeah, but the code for v1 is going away | 14:13 |
noonedeadpunk | yeah, that part I know, though thought some were marked for removal anyway in the future | 14:14 |
noonedeadpunk | and that is why having `amphora` felt a bit confusing I guess | 14:14 |
johnsom | For a deployment project, "amphora" will always be the right answer | 14:14 |
noonedeadpunk | as `amphorav2` makes more natural given that the code is quite different | 14:14 |
noonedeadpunk | ok, yeah, I see | 14:14 |
noonedeadpunk | we just switched `default_provider_driver = amphorav2` some time ago.... | 14:15 |
noonedeadpunk | #startmeeting openstack_ansible_meeting | 15:00 |
opendevmeet | Meeting started Tue Sep 10 15:00:20 2024 UTC and is due to finish in 60 minutes. The chair is noonedeadpunk. Information about MeetBot at http://wiki.debian.org/MeetBot. | 15:00 |
opendevmeet | Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. | 15:00 |
opendevmeet | The meeting name has been set to 'openstack_ansible_meeting' | 15:00 |
noonedeadpunk | #topic rollcall | 15:00 |
noonedeadpunk | o/ | 15:00 |
hamburgler | o/ | 15:00 |
NeilHanlon | o/ | 15:01 |
jrosser | o/ hello | 15:01 |
noonedeadpunk | #topic office hours | 15:03 |
noonedeadpunk | so, noble test jobs finally merged | 15:03 |
noonedeadpunk | though we've missed moving noble with playbooks | 15:04 |
noonedeadpunk | and the fix failed on gate intermittently and currently in recheck | 15:05 |
noonedeadpunk | #link https://review.opendev.org/c/openstack/openstack-ansible-plugins/+/928592/3 | 15:05 |
noonedeadpunk | There is also a current issue with apache on metal | 15:05 |
noonedeadpunk | as we're using different MPMs across roles, which causes upgrade job failures | 15:06 |
noonedeadpunk | (once upgrade jobs track correct branch) | 15:06 |
noonedeadpunk | so whatever fix needed shoud be backported to 2024.1 | 15:07 |
jrosser | i found that by trying to understand the job failures in more depth | 15:07 |
noonedeadpunk | and i guess this should be kinda last thing for backport before doing first minor release | 15:07 |
noonedeadpunk | Ah, except octavia thing that I realized just today | 15:07 |
noonedeadpunk | #link https://review.opendev.org/c/openstack/openstack-ansible-os_octavia/+/928815 | 15:08 |
jrosser | do we have broken apache/metal on 2024.1? | 15:08 |
noonedeadpunk | yeah | 15:08 |
jrosser | oh dear, ok | 15:08 |
noonedeadpunk | I think that second run of playbooks will break it | 15:08 |
jrosser | fixing the upgrade job branch could bring more CI trouble, just a release earlier | 15:09 |
noonedeadpunk | yeah, true | 15:11 |
noonedeadpunk | so there's quite some things to work on, but not sure what needs deeper discussion | 15:14 |
jrosser | i found the horizon compress failure is not specifically an OSA issue | 15:16 |
noonedeadpunk | oh | 15:16 |
jrosser | it aparrently occurs when installing UCA pacakges, as part of building debian packages, and also in devstack | 15:17 |
jrosser | there is a bug which is now correctly assigned to the horizon project https://bugs.launchpad.net/horizon/+bug/2045394 | 15:17 |
jrosser | i also spent some time looking at why jobs fail to get u-c when that should be from the disk | 15:18 |
jrosser | and unfortuntley that happens a lot in upgrade jobs and there are insufficient logs collected | 15:19 |
jrosser | this (+ a backport) should address the log collection https://review.opendev.org/c/openstack/openstack-ansible/+/928790 | 15:20 |
jrosser | but that is kind of hard to test | 15:20 |
noonedeadpunk | it looks reasonable enough | 15:35 |
jrosser | for the u-c errors it is clear that the code takes the path for the url being https:// rather than file:// | 15:38 |
jrosser | but why it does that is not obvious yet - it could be that we have changed the way that the redirection of the URLs to files works between releases | 15:39 |
jrosser | so what is set up for the initial upgrade branch does not do the right thing for the target branch | 15:39 |
jrosser | i think this is the most likley explanation for those kind of errors | 15:40 |
noonedeadpunk | so if that for upgrade jobs only - that might be the case | 15:47 |
noonedeadpunk | as there we kind of ignore zuul-provided repos | 15:47 |
noonedeadpunk | just to leave them in "original" state to preserve depends-on | 15:48 |
noonedeadpunk | which could explain why upgrade on N-1 might try to do web fetch of u-c | 15:48 |
jrosser | how do i discover where the opensearch log collection service is? | 15:53 |
jrosser | ^ for CI jobs | 15:53 |
jrosser | ML says https://opensearch.logs.openstack.org/_dashboards/app/discover?security_tenant=global | 15:56 |
noonedeadpunk | #endmeeting | 16:06 |
opendevmeet | Meeting ended Tue Sep 10 16:06:38 2024 UTC. Information about MeetBot at http://wiki.debian.org/MeetBot . (v 0.1.4) | 16:06 |
opendevmeet | Minutes: https://meetings.opendev.org/meetings/openstack_ansible_meeting/2024/openstack_ansible_meeting.2024-09-10-15.00.html | 16:06 |
opendevmeet | Minutes (text): https://meetings.opendev.org/meetings/openstack_ansible_meeting/2024/openstack_ansible_meeting.2024-09-10-15.00.txt | 16:06 |
opendevmeet | Log: https://meetings.opendev.org/meetings/openstack_ansible_meeting/2024/openstack_ansible_meeting.2024-09-10-15.00.log.html | 16:06 |
jrosser | https://mariadb.com/newsroom/press-releases/k1-acquires-a-leading-database-software-company-mariadb-and-appoints-new-ceo/ | 16:31 |
noonedeadpunk | wow | 16:41 |
noonedeadpunk | no good example of k1 investments in examples.... | 16:42 |
opendevreview | Merged openstack/openstack-ansible-plugins master: Verify OS for containers installation https://review.opendev.org/c/openstack/openstack-ansible-plugins/+/928591 | 18:25 |
opendevreview | Merged openstack/openstack-ansible-plugins master: Add Ubuntu 24.04 to supported by playbook versions https://review.opendev.org/c/openstack/openstack-ansible-plugins/+/928592 | 18:25 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!