Monday, 2025-02-10

*** elodilles is now known as elodilles_pto06:33
*** tkajinam is now known as Guest863909:04
sean-k-mooneyo/11:18
sean-k-mooneywe just notice the openstackerview bot is not posting in #openstack-nova11:18
sean-k-mooneyany idea what can cause that and if we can just turn it off and on again to fix it?11:19
sean-k-mooneyi shoudl say opendevreview11:20
sean-k-mooneyi think its https://opendev.org/opendev/gerritbot11:22
sean-k-mooneynothing obvious in https://opendev.org/openstack/project-config/src/branch/master/gerritbot/channels.yaml11:24
noonedeadpunk++ same here12:06
fungilast gerritbot service restart was on 2025-01-31 at least12:58
fungimy guess is it got disconnected during the unexpected gerrit outage over the weekend and doesn't notice the stream socket it's listening on is dead and needs to be reopened12:59
fungii'll restart it13:00
fungisean-k-mooney: noonedeadpunk: it should be resolved as of about 15 minutes ago13:17
fungido let us know if it continues to be silent for new events now13:18
noonedeadpunkthanks! at least the bot is in the channel now13:23
sean-k-mooneycool ill let you know if i notice any issues13:35
sean-k-mooneyits one bot for all the channels yes?13:36
sean-k-mooneywe dont need to restart it per project13:36
fungicorrect. note that it doesn't always hang out in every channel, it dynamically joins channels it has messages for and parts the least recently used channel when necessary in order to keep under the joined channels limit for the irc network13:43
sean-k-mooneyack so it will show up the first time there is a gerrit event for one of the channels listed in its config13:47
sean-k-mooneyadn rejoin as needed13:47
fungiyes13:59
sean-k-mooneycool well its workign we just got the first noticiation, thanks for fixing it :) 14:03
fungino problem, we probably ought to implement some sort of dead peer detection to catch that situation and reopen the stream connection, but this comes up pretty infrequently14:07
sean-k-mooneyya i dont recall the last time i noticed14:07
sean-k-mooneyso yes but meh14:08
fungibasically not a problem for a clean gerrit shutdown, but if the hypervisor host where its vm is running crashes out from under everything, we get pathological behaviors14:10
sean-k-mooneyi dont know how you run it, but it might be best for the bot to just exit if the gerrit connection drops and relay on systemd/docker ot restart it14:11
sean-k-mooneymaybe with a limited reconnect ie. try reconnectin 1-3 times before bailing14:12
sean-k-mooneywe do something like that in the nova-compute, if we loose the rabit connection oslo-messaging tries to reconenct up to 3 times then we just exit and wait for the service manager to restart us14:13
sean-k-mooneyservice manager being systemd or docker or whatever is runign the agent14:13
sean-k-mooneybut if it happens very rarely it proably not worth fixing now14:14
fungisean-k-mooney: well, the challenge is detecting that the connection has died14:16
fungithe bot establishes a tcp connection to the gerrit server and then receives event messages on that connection. if the server side of the connection goes away without shutting down the connection, then the client has no way of knowing because it's essentially a one-way stream so there's no reason for the client to send any new packets to the server14:18
fungia dpd implementation would address that (wherein the client sends periodic no-op packets to the server, and would therefore receive a tcp/rst back if that socket is no longer valid when the server comes back up)14:19
fungi(or in our case possibly an icmp/admin-prohibited error since it would get caught by iptables)14:21
sean-k-mooneywell wont the tcp conenction drop eventurally14:21
fungino, not without something to force it to drop14:21
sean-k-mooneywoudl tcp keepalive fix that14:21
fungipotentially, since that would eventually result in a keepalive message getting rejected14:22
fungithat's one way of doing dead peer detection14:22
sean-k-mooneyright we would not get the ack14:22
sean-k-mooneyits what we rely on for rabbit mainly.14:22
sean-k-mooneywe have both our aplication heatbeat and tcp keepalive14:22
sean-k-mooneysendign packets perodicly and we detch the drop as a result14:23
fungian ssh keepalive might also work in this case since that's the layer 7 protocol, though i don't know whether the embedded mina-sshd inside gerrit supports it14:23
sean-k-mooneyi assume that is what zuul does too14:23
sean-k-mooneysure although ssh is just tcp so again the kernel can tcp keepalive can help fi the application server cant14:23
fungii think so, though i don't know that it does any direct socket manipulation since these are ssh connections handled by a separate library14:23
fungiso the lib that opens the connection would need to have a feature for that14:24
fungior at least expose the resulting tcp socket object14:24
sean-k-mooneyi dont want to nerd snipe you with this by the way if you have better things to do14:26
sean-k-mooneyah so the base bot is commign for an extenal lib https://opendev.org/opendev/gerritbot/src/branch/master/gerritbot/bot.py#L2314:27
funginah, it's worth revisiting. looking at the chain of what we use, gerritbot relies on gerritlib which in turn uses paramiko to make the connection to the gerrit server's api port and invoke its stream-events command14:27
fungizuul similarly relies on paramiko for the gerrit stream-events connection14:28
sean-k-mooneyright using the gerrit client via ssh is what most external system do too get the stream-event14:28
sean-k-mooneyi dotn think its aviable via the rest api14:29
fungizuul does do a paramiko.SSHClient().connect().get_transport().set_keepalive(somevalue)14:29
sean-k-mooneyya so that might be all that needed set it to say 10 seconds or some small value14:30
sean-k-mooneywell small in that its shoudl be less then a few minutes but large enough to not add much load on gerrit14:30
sean-k-mooneyi would guess zuul is somewhere in the 10-300 second range?14:31
fungigerritlib similarly calls set_keepalive() if specified, defaulting to off14:31
sean-k-mooneyack so may just change https://opendev.org/opendev/gerritbot/src/branch/master/gerritbot/bot.py#L208-L20914:32
fungiit appears we do not specify a keep_alive value in gerritbot when instantiating the GerritConnection class14:32
fungiso yes, probably a one-line fix14:33
sean-k-mooneybest type of fix onter then negitive line fix :)14:33
sean-k-mooneyso ya it defaults to 0 https://opendev.org/opendev/gerritlib/src/branch/master/gerritlib/gerrit.py#L219-L23114:34
fungiright, which keeps it off14:34
fungilooks like keep_alive_interval was added in https://review.opendev.org/750849 and appears in 0.10.0 (latest version is 0.11.0)14:37
sean-k-mooney4 years ago. i think we can assume its safe to use14:38
fungiyep, mainly i was just confirming that we actually released a version with that feature14:38
sean-k-mooneyack14:39
fungiand to know what to set the minimum gerritlib version to now in requirements.txt14:39
TheJuliaCurious, has anyone observed MTU issues with Cirros?14:58
fungisean-k-mooney: https://review.opendev.org/c/opendev/gerritbot/+/94111715:22
fungiTheJulia: not that i've heard, what's the context? booting in devstack? what sort of network is attached?15:23
TheJuliafungi: one of ironics ovn jobs. Looks like it is just bad config which impacts packet sizing15:29
fungimakes sense, and yes that's where i tend to look first when hitting mtu problems, whether it's pmtud black holes, excessive fragmentation, et cetera15:30
fungiit's almost always mismatched configuration somewhere (unfortunately when you're experiencing it across the internet, "somewhere" is often not under your control)15:31
TheJuliaI semi-suspect cirros doesn't have pmtud15:42
sean-k-mooneyit depend on the version15:42
sean-k-mooneyare you using 0.6.x15:42
TheJulia0.6.2 or 0.6.315:42
sean-k-mooneyi think it should have path mtu then let mee see if i can find that commit quickly15:43
opendevreviewMerged openstack/project-config master: Revert "Temporarily remove zuul-providers from zuul tenant"  https://review.opendev.org/c/openstack/project-config/+/94090215:43
sean-k-mooneyi might be missrememberign but i know between 0.5 and 0.6 the dhcp stuff was rewritten i think mainly for ipv615:44
TheJuliasean-k-mooney: thanks, In the mean time I suspect I need to review recent changes to the mtu code for the hard wired sub interfaces becuase I think that is actually what is causing things to go sideways in a static state. Full OSes like Centos Stream seem like they don't blink at it nor generate dropped packets15:44
sean-k-mooneyTheJulia: well you can only have one mtu on an l2 broadcast domain i.e on one port15:45
sean-k-mooneyvlan sub interface are a gray area.15:46
fungiyeah, pmtud only helps you with layer 315:46
sean-k-mooneyin principal i guess it woudl be valid for sub interface to be differnt but not really 15:46
TheJuliaeh, vlans are not that gray, but it looks like its blowing up on layer3 traffic15:46
fungiif there's no router, there's nothing to emit the necessary responses15:46
sean-k-mooneywell vlans are still part fo the same l2 broadcast domain as the non tagged traffic15:46
TheJuliaanyway, I see a mismatch in the config somehow, so I need to hunt that down first.15:46
sean-k-mooneyso i was thinking of https://github.com/cirros-dev/cirros/blob/main/ChangeLog#L20-L2215:47
TheJuliaso, based upon the dropped packets, its definitely getting what it thinks it should be from dhcp15:48
sean-k-mooneymaybe related to this https://github.com/cirros-dev/cirros/pull/113/files 15:48
TheJuliabecause ovn's logs match15:49
TheJuliaokay, cool, so that at least confirms it15:49
TheJuliaI know where the problem is then15:49
* TheJulia makes more coffee first15:49
sean-k-mooneyso that was fixed 7 months ago but i dont know fi its in a release15:50
sean-k-mooneyapprently shoudl be in 0.6.3 https://github.com/cirros-dev/cirros/compare/0.6.3...main15:51
clarkbopenstack/ovs/ovn etc networking really breaks the whole pmtu thing due to the proliferation of l2 devices 15:55
TheJuliasean-k-mooney: so interesting because I think the date on 0.6.3 in the mirror is from like 202315:55
clarkbI believe that neutron is now supposed to figure this out and set it via dhcp because you can't rely on the network doing it for you15:55
sean-k-mooneyyes and no15:56
sean-k-mooneyneturon asked use to not set it in the metadata if dhcp is enabled15:56
sean-k-mooneyso nova wont add the mtu if dhcp is enabeld on the subnet15:56
sean-k-mooneyif dhcp is not enabled on the subnet we will provide the mtu so that glean ectra can staticaly configure it when seting the ip15:57
sean-k-mooneyhowever i dont think we ever provide metadata for neturon trunk ports15:57
clarkbright because neutron is supposed to be the overseer that figures out what the minimum mtu is acorss the board then set it via dhcp15:57
sean-k-mooneyand ironic prot groups dont extist in either neutron or nova15:57
sean-k-mooneyso netiher project know about them15:57
clarkbthe fundamental problem is that overlay networking is a bunch of l2 devices tied together and none of those can do mtu discovery15:58
sean-k-mooneyclarkb: the mtu is actully somethign you can set when creatign the network as long as you request oen equal to or below what the operator configures in the neturon server conifg15:59
clarkbwhich is something you can't know...15:59
sean-k-mooneyyep15:59
fungiright, path mtu discovery is more about "my packets are going to traverse a number of networks connected by routers, and some of those segments may have a lower mtu than mine, so i want to limit the size of my packets in order to avoid them getting fragmented when they pass through those networks15:59
sean-k-mooneyso this is why i say vlan sub interfaces should not have a differnt mtu form the parent15:59
sean-k-mooneypath mtu discovery shoudl not really care about any of this if the routes supprot it16:00
sean-k-mooneycause that working at the teh path/l3 level16:00
clarkbright the problem is that it isn't sufficient so you end up with misconfigured interfaces if you don't have something else managing it16:00
sean-k-mooneybut pmtu is kind of flaky16:00
clarkbthe point I'm trying to get at is no one should rely on pmtu within openstack or ovn networking16:01
sean-k-mooneyTheJulia: actully one regression in ovn vs ml2/ovs16:01
clarkbso imo it is wrong to debug if cirros has pmtud installed and running. It doesn't matter16:01
TheJuliasean-k-mooney: I know, the bug is noted in ironic's docs16:01
sean-k-mooneyovn does nto supprot creating routers betwwen networks with diffent mtus16:01
sean-k-mooneyyou can create them 16:01
sean-k-mooneybut it wont do fragmentation and reassembly16:02
TheJuliaa good chunk of this entire back and forth has been discussed 2-4 prior times here :)16:02
sean-k-mooneylike the standalone neutron router used with ml2/ovs16:02
sean-k-mooneyack16:02
fungiwell, pmtud can have a role to play, but when you're using overlay networks it's the underlay that needs to worry about pmtud for the hops it manages since those are transparent virtual layer 2 from the guest's perspective16:02
opendevreviewDmitriy Rabotyagov proposed openstack/project-config master: Stop using storyboard by Freezer  https://review.opendev.org/c/openstack/project-config/+/94113817:40
opendevreviewDmitriy Rabotyagov proposed openstack/project-config master: Remove Freezer DR from infra  https://review.opendev.org/c/openstack/project-config/+/93818418:20
opendevreviewDmitriy Rabotyagov proposed openstack/project-config master: Stop using storyboard by Freezer  https://review.opendev.org/c/openstack/project-config/+/94113818:20
opendevreviewDmitriy Rabotyagov proposed openstack/project-config master: Stop using storyboard by Freezer  https://review.opendev.org/c/openstack/project-config/+/94113818:22

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!