*** elodilles is now known as elodilles_pto | 06:33 | |
*** tkajinam is now known as Guest8639 | 09:04 | |
sean-k-mooney | o/ | 11:18 |
---|---|---|
sean-k-mooney | we just notice the openstackerview bot is not posting in #openstack-nova | 11:18 |
sean-k-mooney | any idea what can cause that and if we can just turn it off and on again to fix it? | 11:19 |
sean-k-mooney | i shoudl say opendevreview | 11:20 |
sean-k-mooney | i think its https://opendev.org/opendev/gerritbot | 11:22 |
sean-k-mooney | nothing obvious in https://opendev.org/openstack/project-config/src/branch/master/gerritbot/channels.yaml | 11:24 |
noonedeadpunk | ++ same here | 12:06 |
fungi | last gerritbot service restart was on 2025-01-31 at least | 12:58 |
fungi | my guess is it got disconnected during the unexpected gerrit outage over the weekend and doesn't notice the stream socket it's listening on is dead and needs to be reopened | 12:59 |
fungi | i'll restart it | 13:00 |
fungi | sean-k-mooney: noonedeadpunk: it should be resolved as of about 15 minutes ago | 13:17 |
fungi | do let us know if it continues to be silent for new events now | 13:18 |
noonedeadpunk | thanks! at least the bot is in the channel now | 13:23 |
sean-k-mooney | cool ill let you know if i notice any issues | 13:35 |
sean-k-mooney | its one bot for all the channels yes? | 13:36 |
sean-k-mooney | we dont need to restart it per project | 13:36 |
fungi | correct. note that it doesn't always hang out in every channel, it dynamically joins channels it has messages for and parts the least recently used channel when necessary in order to keep under the joined channels limit for the irc network | 13:43 |
sean-k-mooney | ack so it will show up the first time there is a gerrit event for one of the channels listed in its config | 13:47 |
sean-k-mooney | adn rejoin as needed | 13:47 |
fungi | yes | 13:59 |
sean-k-mooney | cool well its workign we just got the first noticiation, thanks for fixing it :) | 14:03 |
fungi | no problem, we probably ought to implement some sort of dead peer detection to catch that situation and reopen the stream connection, but this comes up pretty infrequently | 14:07 |
sean-k-mooney | ya i dont recall the last time i noticed | 14:07 |
sean-k-mooney | so yes but meh | 14:08 |
fungi | basically not a problem for a clean gerrit shutdown, but if the hypervisor host where its vm is running crashes out from under everything, we get pathological behaviors | 14:10 |
sean-k-mooney | i dont know how you run it, but it might be best for the bot to just exit if the gerrit connection drops and relay on systemd/docker ot restart it | 14:11 |
sean-k-mooney | maybe with a limited reconnect ie. try reconnectin 1-3 times before bailing | 14:12 |
sean-k-mooney | we do something like that in the nova-compute, if we loose the rabit connection oslo-messaging tries to reconenct up to 3 times then we just exit and wait for the service manager to restart us | 14:13 |
sean-k-mooney | service manager being systemd or docker or whatever is runign the agent | 14:13 |
sean-k-mooney | but if it happens very rarely it proably not worth fixing now | 14:14 |
fungi | sean-k-mooney: well, the challenge is detecting that the connection has died | 14:16 |
fungi | the bot establishes a tcp connection to the gerrit server and then receives event messages on that connection. if the server side of the connection goes away without shutting down the connection, then the client has no way of knowing because it's essentially a one-way stream so there's no reason for the client to send any new packets to the server | 14:18 |
fungi | a dpd implementation would address that (wherein the client sends periodic no-op packets to the server, and would therefore receive a tcp/rst back if that socket is no longer valid when the server comes back up) | 14:19 |
fungi | (or in our case possibly an icmp/admin-prohibited error since it would get caught by iptables) | 14:21 |
sean-k-mooney | well wont the tcp conenction drop eventurally | 14:21 |
fungi | no, not without something to force it to drop | 14:21 |
sean-k-mooney | woudl tcp keepalive fix that | 14:21 |
fungi | potentially, since that would eventually result in a keepalive message getting rejected | 14:22 |
fungi | that's one way of doing dead peer detection | 14:22 |
sean-k-mooney | right we would not get the ack | 14:22 |
sean-k-mooney | its what we rely on for rabbit mainly. | 14:22 |
sean-k-mooney | we have both our aplication heatbeat and tcp keepalive | 14:22 |
sean-k-mooney | sendign packets perodicly and we detch the drop as a result | 14:23 |
fungi | an ssh keepalive might also work in this case since that's the layer 7 protocol, though i don't know whether the embedded mina-sshd inside gerrit supports it | 14:23 |
sean-k-mooney | i assume that is what zuul does too | 14:23 |
sean-k-mooney | sure although ssh is just tcp so again the kernel can tcp keepalive can help fi the application server cant | 14:23 |
fungi | i think so, though i don't know that it does any direct socket manipulation since these are ssh connections handled by a separate library | 14:23 |
fungi | so the lib that opens the connection would need to have a feature for that | 14:24 |
fungi | or at least expose the resulting tcp socket object | 14:24 |
sean-k-mooney | i dont want to nerd snipe you with this by the way if you have better things to do | 14:26 |
sean-k-mooney | ah so the base bot is commign for an extenal lib https://opendev.org/opendev/gerritbot/src/branch/master/gerritbot/bot.py#L23 | 14:27 |
fungi | nah, it's worth revisiting. looking at the chain of what we use, gerritbot relies on gerritlib which in turn uses paramiko to make the connection to the gerrit server's api port and invoke its stream-events command | 14:27 |
fungi | zuul similarly relies on paramiko for the gerrit stream-events connection | 14:28 |
sean-k-mooney | right using the gerrit client via ssh is what most external system do too get the stream-event | 14:28 |
sean-k-mooney | i dotn think its aviable via the rest api | 14:29 |
fungi | zuul does do a paramiko.SSHClient().connect().get_transport().set_keepalive(somevalue) | 14:29 |
sean-k-mooney | ya so that might be all that needed set it to say 10 seconds or some small value | 14:30 |
sean-k-mooney | well small in that its shoudl be less then a few minutes but large enough to not add much load on gerrit | 14:30 |
sean-k-mooney | i would guess zuul is somewhere in the 10-300 second range? | 14:31 |
fungi | gerritlib similarly calls set_keepalive() if specified, defaulting to off | 14:31 |
sean-k-mooney | ack so may just change https://opendev.org/opendev/gerritbot/src/branch/master/gerritbot/bot.py#L208-L209 | 14:32 |
fungi | it appears we do not specify a keep_alive value in gerritbot when instantiating the GerritConnection class | 14:32 |
fungi | so yes, probably a one-line fix | 14:33 |
sean-k-mooney | best type of fix onter then negitive line fix :) | 14:33 |
sean-k-mooney | so ya it defaults to 0 https://opendev.org/opendev/gerritlib/src/branch/master/gerritlib/gerrit.py#L219-L231 | 14:34 |
fungi | right, which keeps it off | 14:34 |
fungi | looks like keep_alive_interval was added in https://review.opendev.org/750849 and appears in 0.10.0 (latest version is 0.11.0) | 14:37 |
sean-k-mooney | 4 years ago. i think we can assume its safe to use | 14:38 |
fungi | yep, mainly i was just confirming that we actually released a version with that feature | 14:38 |
sean-k-mooney | ack | 14:39 |
fungi | and to know what to set the minimum gerritlib version to now in requirements.txt | 14:39 |
TheJulia | Curious, has anyone observed MTU issues with Cirros? | 14:58 |
fungi | sean-k-mooney: https://review.opendev.org/c/opendev/gerritbot/+/941117 | 15:22 |
fungi | TheJulia: not that i've heard, what's the context? booting in devstack? what sort of network is attached? | 15:23 |
TheJulia | fungi: one of ironics ovn jobs. Looks like it is just bad config which impacts packet sizing | 15:29 |
fungi | makes sense, and yes that's where i tend to look first when hitting mtu problems, whether it's pmtud black holes, excessive fragmentation, et cetera | 15:30 |
fungi | it's almost always mismatched configuration somewhere (unfortunately when you're experiencing it across the internet, "somewhere" is often not under your control) | 15:31 |
TheJulia | I semi-suspect cirros doesn't have pmtud | 15:42 |
sean-k-mooney | it depend on the version | 15:42 |
sean-k-mooney | are you using 0.6.x | 15:42 |
TheJulia | 0.6.2 or 0.6.3 | 15:42 |
sean-k-mooney | i think it should have path mtu then let mee see if i can find that commit quickly | 15:43 |
opendevreview | Merged openstack/project-config master: Revert "Temporarily remove zuul-providers from zuul tenant" https://review.opendev.org/c/openstack/project-config/+/940902 | 15:43 |
sean-k-mooney | i might be missrememberign but i know between 0.5 and 0.6 the dhcp stuff was rewritten i think mainly for ipv6 | 15:44 |
TheJulia | sean-k-mooney: thanks, In the mean time I suspect I need to review recent changes to the mtu code for the hard wired sub interfaces becuase I think that is actually what is causing things to go sideways in a static state. Full OSes like Centos Stream seem like they don't blink at it nor generate dropped packets | 15:44 |
sean-k-mooney | TheJulia: well you can only have one mtu on an l2 broadcast domain i.e on one port | 15:45 |
sean-k-mooney | vlan sub interface are a gray area. | 15:46 |
fungi | yeah, pmtud only helps you with layer 3 | 15:46 |
sean-k-mooney | in principal i guess it woudl be valid for sub interface to be differnt but not really | 15:46 |
TheJulia | eh, vlans are not that gray, but it looks like its blowing up on layer3 traffic | 15:46 |
fungi | if there's no router, there's nothing to emit the necessary responses | 15:46 |
sean-k-mooney | well vlans are still part fo the same l2 broadcast domain as the non tagged traffic | 15:46 |
TheJulia | anyway, I see a mismatch in the config somehow, so I need to hunt that down first. | 15:46 |
sean-k-mooney | so i was thinking of https://github.com/cirros-dev/cirros/blob/main/ChangeLog#L20-L22 | 15:47 |
TheJulia | so, based upon the dropped packets, its definitely getting what it thinks it should be from dhcp | 15:48 |
sean-k-mooney | maybe related to this https://github.com/cirros-dev/cirros/pull/113/files | 15:48 |
TheJulia | because ovn's logs match | 15:49 |
TheJulia | okay, cool, so that at least confirms it | 15:49 |
TheJulia | I know where the problem is then | 15:49 |
* TheJulia makes more coffee first | 15:49 | |
sean-k-mooney | so that was fixed 7 months ago but i dont know fi its in a release | 15:50 |
sean-k-mooney | apprently shoudl be in 0.6.3 https://github.com/cirros-dev/cirros/compare/0.6.3...main | 15:51 |
clarkb | openstack/ovs/ovn etc networking really breaks the whole pmtu thing due to the proliferation of l2 devices | 15:55 |
TheJulia | sean-k-mooney: so interesting because I think the date on 0.6.3 in the mirror is from like 2023 | 15:55 |
clarkb | I believe that neutron is now supposed to figure this out and set it via dhcp because you can't rely on the network doing it for you | 15:55 |
sean-k-mooney | yes and no | 15:56 |
sean-k-mooney | neturon asked use to not set it in the metadata if dhcp is enabled | 15:56 |
sean-k-mooney | so nova wont add the mtu if dhcp is enabeld on the subnet | 15:56 |
sean-k-mooney | if dhcp is not enabled on the subnet we will provide the mtu so that glean ectra can staticaly configure it when seting the ip | 15:57 |
sean-k-mooney | however i dont think we ever provide metadata for neturon trunk ports | 15:57 |
clarkb | right because neutron is supposed to be the overseer that figures out what the minimum mtu is acorss the board then set it via dhcp | 15:57 |
sean-k-mooney | and ironic prot groups dont extist in either neutron or nova | 15:57 |
sean-k-mooney | so netiher project know about them | 15:57 |
clarkb | the fundamental problem is that overlay networking is a bunch of l2 devices tied together and none of those can do mtu discovery | 15:58 |
sean-k-mooney | clarkb: the mtu is actully somethign you can set when creatign the network as long as you request oen equal to or below what the operator configures in the neturon server conifg | 15:59 |
clarkb | which is something you can't know... | 15:59 |
sean-k-mooney | yep | 15:59 |
fungi | right, path mtu discovery is more about "my packets are going to traverse a number of networks connected by routers, and some of those segments may have a lower mtu than mine, so i want to limit the size of my packets in order to avoid them getting fragmented when they pass through those networks | 15:59 |
sean-k-mooney | so this is why i say vlan sub interfaces should not have a differnt mtu form the parent | 15:59 |
sean-k-mooney | path mtu discovery shoudl not really care about any of this if the routes supprot it | 16:00 |
sean-k-mooney | cause that working at the teh path/l3 level | 16:00 |
clarkb | right the problem is that it isn't sufficient so you end up with misconfigured interfaces if you don't have something else managing it | 16:00 |
sean-k-mooney | but pmtu is kind of flaky | 16:00 |
clarkb | the point I'm trying to get at is no one should rely on pmtu within openstack or ovn networking | 16:01 |
sean-k-mooney | TheJulia: actully one regression in ovn vs ml2/ovs | 16:01 |
clarkb | so imo it is wrong to debug if cirros has pmtud installed and running. It doesn't matter | 16:01 |
TheJulia | sean-k-mooney: I know, the bug is noted in ironic's docs | 16:01 |
sean-k-mooney | ovn does nto supprot creating routers betwwen networks with diffent mtus | 16:01 |
sean-k-mooney | you can create them | 16:01 |
sean-k-mooney | but it wont do fragmentation and reassembly | 16:02 |
TheJulia | a good chunk of this entire back and forth has been discussed 2-4 prior times here :) | 16:02 |
sean-k-mooney | like the standalone neutron router used with ml2/ovs | 16:02 |
sean-k-mooney | ack | 16:02 |
fungi | well, pmtud can have a role to play, but when you're using overlay networks it's the underlay that needs to worry about pmtud for the hops it manages since those are transparent virtual layer 2 from the guest's perspective | 16:02 |
opendevreview | Dmitriy Rabotyagov proposed openstack/project-config master: Stop using storyboard by Freezer https://review.opendev.org/c/openstack/project-config/+/941138 | 17:40 |
opendevreview | Dmitriy Rabotyagov proposed openstack/project-config master: Remove Freezer DR from infra https://review.opendev.org/c/openstack/project-config/+/938184 | 18:20 |
opendevreview | Dmitriy Rabotyagov proposed openstack/project-config master: Stop using storyboard by Freezer https://review.opendev.org/c/openstack/project-config/+/941138 | 18:20 |
opendevreview | Dmitriy Rabotyagov proposed openstack/project-config master: Stop using storyboard by Freezer https://review.opendev.org/c/openstack/project-config/+/941138 | 18:22 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!