Monday, 2025-02-10

*** elodilles is now known as elodilles_pto		06:33
*** tkajinam is now known as Guest8639		09:04
sean-k-mooney	o/	11:18
sean-k-mooney	we just notice the openstackerview bot is not posting in #openstack-nova	11:18
sean-k-mooney	any idea what can cause that and if we can just turn it off and on again to fix it?	11:19
sean-k-mooney	i shoudl say opendevreview	11:20
sean-k-mooney	i think its https://opendev.org/opendev/gerritbot	11:22
sean-k-mooney	nothing obvious in https://opendev.org/openstack/project-config/src/branch/master/gerritbot/channels.yaml	11:24
noonedeadpunk	++ same here	12:06
fungi	last gerritbot service restart was on 2025-01-31 at least	12:58
fungi	my guess is it got disconnected during the unexpected gerrit outage over the weekend and doesn't notice the stream socket it's listening on is dead and needs to be reopened	12:59
fungi	i'll restart it	13:00
fungi	sean-k-mooney: noonedeadpunk: it should be resolved as of about 15 minutes ago	13:17
fungi	do let us know if it continues to be silent for new events now	13:18
noonedeadpunk	thanks! at least the bot is in the channel now	13:23
sean-k-mooney	cool ill let you know if i notice any issues	13:35
sean-k-mooney	its one bot for all the channels yes?	13:36
sean-k-mooney	we dont need to restart it per project	13:36
fungi	correct. note that it doesn't always hang out in every channel, it dynamically joins channels it has messages for and parts the least recently used channel when necessary in order to keep under the joined channels limit for the irc network	13:43
sean-k-mooney	ack so it will show up the first time there is a gerrit event for one of the channels listed in its config	13:47
sean-k-mooney	adn rejoin as needed	13:47
fungi	yes	13:59
sean-k-mooney	cool well its workign we just got the first noticiation, thanks for fixing it :)	14:03
fungi	no problem, we probably ought to implement some sort of dead peer detection to catch that situation and reopen the stream connection, but this comes up pretty infrequently	14:07
sean-k-mooney	ya i dont recall the last time i noticed	14:07
sean-k-mooney	so yes but meh	14:08
fungi	basically not a problem for a clean gerrit shutdown, but if the hypervisor host where its vm is running crashes out from under everything, we get pathological behaviors	14:10
sean-k-mooney	i dont know how you run it, but it might be best for the bot to just exit if the gerrit connection drops and relay on systemd/docker ot restart it	14:11
sean-k-mooney	maybe with a limited reconnect ie. try reconnectin 1-3 times before bailing	14:12
sean-k-mooney	we do something like that in the nova-compute, if we loose the rabit connection oslo-messaging tries to reconenct up to 3 times then we just exit and wait for the service manager to restart us	14:13
sean-k-mooney	service manager being systemd or docker or whatever is runign the agent	14:13
sean-k-mooney	but if it happens very rarely it proably not worth fixing now	14:14
fungi	sean-k-mooney: well, the challenge is detecting that the connection has died	14:16
fungi	the bot establishes a tcp connection to the gerrit server and then receives event messages on that connection. if the server side of the connection goes away without shutting down the connection, then the client has no way of knowing because it's essentially a one-way stream so there's no reason for the client to send any new packets to the server	14:18
fungi	a dpd implementation would address that (wherein the client sends periodic no-op packets to the server, and would therefore receive a tcp/rst back if that socket is no longer valid when the server comes back up)	14:19
fungi	(or in our case possibly an icmp/admin-prohibited error since it would get caught by iptables)	14:21
sean-k-mooney	well wont the tcp conenction drop eventurally	14:21
fungi	no, not without something to force it to drop	14:21
sean-k-mooney	woudl tcp keepalive fix that	14:21
fungi	potentially, since that would eventually result in a keepalive message getting rejected	14:22
fungi	that's one way of doing dead peer detection	14:22
sean-k-mooney	right we would not get the ack	14:22
sean-k-mooney	its what we rely on for rabbit mainly.	14:22
sean-k-mooney	we have both our aplication heatbeat and tcp keepalive	14:22
sean-k-mooney	sendign packets perodicly and we detch the drop as a result	14:23
fungi	an ssh keepalive might also work in this case since that's the layer 7 protocol, though i don't know whether the embedded mina-sshd inside gerrit supports it	14:23
sean-k-mooney	i assume that is what zuul does too	14:23
sean-k-mooney	sure although ssh is just tcp so again the kernel can tcp keepalive can help fi the application server cant	14:23
fungi	i think so, though i don't know that it does any direct socket manipulation since these are ssh connections handled by a separate library	14:23
fungi	so the lib that opens the connection would need to have a feature for that	14:24
fungi	or at least expose the resulting tcp socket object	14:24
sean-k-mooney	i dont want to nerd snipe you with this by the way if you have better things to do	14:26
sean-k-mooney	ah so the base bot is commign for an extenal lib https://opendev.org/opendev/gerritbot/src/branch/master/gerritbot/bot.py#L23	14:27
fungi	nah, it's worth revisiting. looking at the chain of what we use, gerritbot relies on gerritlib which in turn uses paramiko to make the connection to the gerrit server's api port and invoke its stream-events command	14:27
fungi	zuul similarly relies on paramiko for the gerrit stream-events connection	14:28
sean-k-mooney	right using the gerrit client via ssh is what most external system do too get the stream-event	14:28
sean-k-mooney	i dotn think its aviable via the rest api	14:29
fungi	zuul does do a paramiko.SSHClient().connect().get_transport().set_keepalive(somevalue)	14:29
sean-k-mooney	ya so that might be all that needed set it to say 10 seconds or some small value	14:30
sean-k-mooney	well small in that its shoudl be less then a few minutes but large enough to not add much load on gerrit	14:30
sean-k-mooney	i would guess zuul is somewhere in the 10-300 second range?	14:31
fungi	gerritlib similarly calls set_keepalive() if specified, defaulting to off	14:31
sean-k-mooney	ack so may just change https://opendev.org/opendev/gerritbot/src/branch/master/gerritbot/bot.py#L208-L209	14:32
fungi	it appears we do not specify a keep_alive value in gerritbot when instantiating the GerritConnection class	14:32
fungi	so yes, probably a one-line fix	14:33
sean-k-mooney	best type of fix onter then negitive line fix :)	14:33
sean-k-mooney	so ya it defaults to 0 https://opendev.org/opendev/gerritlib/src/branch/master/gerritlib/gerrit.py#L219-L231	14:34
fungi	right, which keeps it off	14:34
fungi	looks like keep_alive_interval was added in https://review.opendev.org/750849 and appears in 0.10.0 (latest version is 0.11.0)	14:37
sean-k-mooney	4 years ago. i think we can assume its safe to use	14:38
fungi	yep, mainly i was just confirming that we actually released a version with that feature	14:38
sean-k-mooney	ack	14:39
fungi	and to know what to set the minimum gerritlib version to now in requirements.txt	14:39
TheJulia	Curious, has anyone observed MTU issues with Cirros?	14:58
fungi	sean-k-mooney: https://review.opendev.org/c/opendev/gerritbot/+/941117	15:22
fungi	TheJulia: not that i've heard, what's the context? booting in devstack? what sort of network is attached?	15:23
TheJulia	fungi: one of ironics ovn jobs. Looks like it is just bad config which impacts packet sizing	15:29
fungi	makes sense, and yes that's where i tend to look first when hitting mtu problems, whether it's pmtud black holes, excessive fragmentation, et cetera	15:30
fungi	it's almost always mismatched configuration somewhere (unfortunately when you're experiencing it across the internet, "somewhere" is often not under your control)	15:31
TheJulia	I semi-suspect cirros doesn't have pmtud	15:42
sean-k-mooney	it depend on the version	15:42
sean-k-mooney	are you using 0.6.x	15:42
TheJulia	0.6.2 or 0.6.3	15:42
sean-k-mooney	i think it should have path mtu then let mee see if i can find that commit quickly	15:43
opendevreview	Merged openstack/project-config master: Revert "Temporarily remove zuul-providers from zuul tenant" https://review.opendev.org/c/openstack/project-config/+/940902	15:43
sean-k-mooney	i might be missrememberign but i know between 0.5 and 0.6 the dhcp stuff was rewritten i think mainly for ipv6	15:44
TheJulia	sean-k-mooney: thanks, In the mean time I suspect I need to review recent changes to the mtu code for the hard wired sub interfaces becuase I think that is actually what is causing things to go sideways in a static state. Full OSes like Centos Stream seem like they don't blink at it nor generate dropped packets	15:44
sean-k-mooney	TheJulia: well you can only have one mtu on an l2 broadcast domain i.e on one port	15:45
sean-k-mooney	vlan sub interface are a gray area.	15:46
fungi	yeah, pmtud only helps you with layer 3	15:46
sean-k-mooney	in principal i guess it woudl be valid for sub interface to be differnt but not really	15:46
TheJulia	eh, vlans are not that gray, but it looks like its blowing up on layer3 traffic	15:46
fungi	if there's no router, there's nothing to emit the necessary responses	15:46
sean-k-mooney	well vlans are still part fo the same l2 broadcast domain as the non tagged traffic	15:46
TheJulia	anyway, I see a mismatch in the config somehow, so I need to hunt that down first.	15:46
sean-k-mooney	so i was thinking of https://github.com/cirros-dev/cirros/blob/main/ChangeLog#L20-L22	15:47
TheJulia	so, based upon the dropped packets, its definitely getting what it thinks it should be from dhcp	15:48
sean-k-mooney	maybe related to this https://github.com/cirros-dev/cirros/pull/113/files	15:48
TheJulia	because ovn's logs match	15:49
TheJulia	okay, cool, so that at least confirms it	15:49
TheJulia	I know where the problem is then	15:49
* TheJulia makes more coffee first		15:49
sean-k-mooney	so that was fixed 7 months ago but i dont know fi its in a release	15:50
sean-k-mooney	apprently shoudl be in 0.6.3 https://github.com/cirros-dev/cirros/compare/0.6.3...main	15:51
clarkb	openstack/ovs/ovn etc networking really breaks the whole pmtu thing due to the proliferation of l2 devices	15:55
TheJulia	sean-k-mooney: so interesting because I think the date on 0.6.3 in the mirror is from like 2023	15:55
clarkb	I believe that neutron is now supposed to figure this out and set it via dhcp because you can't rely on the network doing it for you	15:55
sean-k-mooney	yes and no	15:56
sean-k-mooney	neturon asked use to not set it in the metadata if dhcp is enabled	15:56
sean-k-mooney	so nova wont add the mtu if dhcp is enabeld on the subnet	15:56
sean-k-mooney	if dhcp is not enabled on the subnet we will provide the mtu so that glean ectra can staticaly configure it when seting the ip	15:57
sean-k-mooney	however i dont think we ever provide metadata for neturon trunk ports	15:57
clarkb	right because neutron is supposed to be the overseer that figures out what the minimum mtu is acorss the board then set it via dhcp	15:57
sean-k-mooney	and ironic prot groups dont extist in either neutron or nova	15:57
sean-k-mooney	so netiher project know about them	15:57
clarkb	the fundamental problem is that overlay networking is a bunch of l2 devices tied together and none of those can do mtu discovery	15:58
sean-k-mooney	clarkb: the mtu is actully somethign you can set when creatign the network as long as you request oen equal to or below what the operator configures in the neturon server conifg	15:59
clarkb	which is something you can't know...	15:59
sean-k-mooney	yep	15:59
fungi	right, path mtu discovery is more about "my packets are going to traverse a number of networks connected by routers, and some of those segments may have a lower mtu than mine, so i want to limit the size of my packets in order to avoid them getting fragmented when they pass through those networks	15:59
sean-k-mooney	so this is why i say vlan sub interfaces should not have a differnt mtu form the parent	15:59
sean-k-mooney	path mtu discovery shoudl not really care about any of this if the routes supprot it	16:00
sean-k-mooney	cause that working at the teh path/l3 level	16:00
clarkb	right the problem is that it isn't sufficient so you end up with misconfigured interfaces if you don't have something else managing it	16:00
sean-k-mooney	but pmtu is kind of flaky	16:00
clarkb	the point I'm trying to get at is no one should rely on pmtu within openstack or ovn networking	16:01
sean-k-mooney	TheJulia: actully one regression in ovn vs ml2/ovs	16:01
clarkb	so imo it is wrong to debug if cirros has pmtud installed and running. It doesn't matter	16:01
TheJulia	sean-k-mooney: I know, the bug is noted in ironic's docs	16:01
sean-k-mooney	ovn does nto supprot creating routers betwwen networks with diffent mtus	16:01
sean-k-mooney	you can create them	16:01
sean-k-mooney	but it wont do fragmentation and reassembly	16:02
TheJulia	a good chunk of this entire back and forth has been discussed 2-4 prior times here :)	16:02
sean-k-mooney	like the standalone neutron router used with ml2/ovs	16:02
sean-k-mooney	ack	16:02
fungi	well, pmtud can have a role to play, but when you're using overlay networks it's the underlay that needs to worry about pmtud for the hops it manages since those are transparent virtual layer 2 from the guest's perspective	16:02
opendevreview	Dmitriy Rabotyagov proposed openstack/project-config master: Stop using storyboard by Freezer https://review.opendev.org/c/openstack/project-config/+/941138	17:40
opendevreview	Dmitriy Rabotyagov proposed openstack/project-config master: Remove Freezer DR from infra https://review.opendev.org/c/openstack/project-config/+/938184	18:20
opendevreview	Dmitriy Rabotyagov proposed openstack/project-config master: Stop using storyboard by Freezer https://review.opendev.org/c/openstack/project-config/+/941138	18:20
opendevreview	Dmitriy Rabotyagov proposed openstack/project-config master: Stop using storyboard by Freezer https://review.opendev.org/c/openstack/project-config/+/941138	18:22

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!