14:02:21 <haleyb> #startmeeting neutron_drivers
14:02:21 <opendevmeet> Meeting started Fri Sep 26 14:02:21 2025 UTC and is due to finish in 60 minutes.  The chair is haleyb. Information about MeetBot at http://wiki.debian.org/MeetBot.
14:02:21 <opendevmeet> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
14:02:21 <opendevmeet> The meeting name has been set to 'neutron_drivers'
14:02:27 <haleyb> sorry was in other meeting
14:02:34 <haleyb> Ping list: ykarel, mlavalle, mtomaska, slaweq, tobias-urdin, lajoskatona, haleyb, ralonsoh
14:02:36 <mlavalle> \o
14:02:41 <ralonsoh> hello
14:02:47 <ralonsoh> slaweq, cannot attend today
14:02:55 <mtomaska> o/
14:02:55 <lajoskatona> o/
14:03:23 <haleyb> ralonsoh: thanks, i think we'd still have quorum with the rest of us
14:04:36 <haleyb> we did have a couple of topics, not sure if Dejan is here
14:04:41 <haleyb> #link https://bugs.launchpad.net/neutron/+bug/2123836
14:04:52 <haleyb> [RFE] Set a local TXT record in the DHCP agent/dnsmasq
14:06:17 <haleyb> #link https://review.opendev.org/c/openstack/neutron/+/950486 is the proposed patch
14:06:57 <haleyb> dsan: are you here?
14:07:07 <dsan> yep
14:07:23 <haleyb> ah, can you talke a little about your RFE?
14:07:33 <dsan> yeah sure
14:07:58 <dsan> i tried to explain it a bit on launchpad
14:08:22 <dsan> with use the ability to have a local txt record in dnsmasq
14:08:51 <dsan> as kind of a side effect which allows to monitor it
14:09:20 <dsan> we thought it might me usefull to others
14:09:31 <dsan> so a good candidate to upstream
14:09:42 <ralonsoh> Who is monitoring this process? How does it works?
14:11:21 <dsan> it's at the infra level, we wish to ensure that dhcp is really available
14:11:42 <ralonsoh> so what process is monitoring the dnsmasq process?
14:11:52 <ralonsoh> outside openstack, I guess
14:12:09 <dsan> i think it's blackbox exporter running the actual dns queries
14:12:11 <mlavalle> I think he means out side OpenStack
14:12:55 <mlavalle> and that leaves options open for other potential adopters
14:13:01 <mlavalle> right?
14:13:10 <haleyb> so is it going into each network namespace and running a command?
14:14:52 <ralonsoh> dsan, are you still there?
14:14:58 <dsan> yep
14:15:12 <dsan> was checking what tool was doing the work
14:15:45 <dsan> so its an internal tool that does the query
14:16:25 <ralonsoh> why is that needed? The DHCP agent is in charge of this process
14:16:36 <ralonsoh> dnsmasq is a child process of the DHCP agent
14:16:44 <ralonsoh> that is launched with a wrapper
14:17:00 <ralonsoh> if the child process fails and dies, the DHCP agent will respawn it
14:17:44 <haleyb> it just doesn't do periodic queries to see if it's able to respond
14:19:53 <ralonsoh> ok, I still have very little information about this
14:20:03 <ralonsoh> for example, how to make this check, things like this
14:20:32 <ralonsoh> dsan, how the check is done?
14:20:50 <dsan> with a custom go tool
14:21:09 <dsan> that happens to also do the dns queries in each namespace
14:21:29 <ralonsoh> yeah, and do you need this txt field for that?
14:21:52 <ralonsoh> I mean, you can use "dig" against the dnsmasq process to check its liveness
14:21:56 <ralonsoh> why is this field needed?
14:22:49 <lajoskatona> some reference for usage in random docs, i.e.: https://www.ibm.com/docs/en/i/7.4.0?topic=td-problem-dns-records-are-not-being-updated-by-dhcp
14:24:14 <dsan> is when dnsmasq_local_resolv is that to true
14:24:28 <dsan> that part was on launchpad
14:24:46 <dsan> https://bugs.launchpad.net/neutron/+bug/2123836
14:27:35 <lajoskatona> ok, so if the monitor can't have the txt record from dnsmasq in the namespace you know that the clients in the VMs also has issues so the dnsmasq process is failing and have to do something on the network node?
14:28:41 <dsan> yes
14:29:22 <dsan> we know, there's some kind of issue in dnsmasq/the DHCP agent/the underlying network/etc.
14:29:25 <lajoskatona> ok thanks
14:29:46 <ralonsoh> ok, I'm not going to insist more on this because it is difficult obtain information from you. I would like to have a description of what is the process you implemented to use this txt record, how are you doing that
14:29:56 <ralonsoh> please, add this in the launchpad
14:30:01 <haleyb> to let others take advantage of this option, it seems to me like it should be in the dhcp-agent, then it can respawn and log a warning?
14:30:16 <dsan> ok, i'll gather more details
14:31:06 <haleyb> ok thanks dsan
14:31:15 <dsan> there's also the HA part/underlying network aspect
14:31:39 <dsan> it's one thing that the DHCP agent ensures that dnsmasq is running
14:32:04 <dsan> and another that the environment is also OK
14:32:22 <haleyb> understood
14:32:28 <dsan> anyway thanks for your time
14:32:29 <haleyb> we had one more item
14:32:39 <haleyb> i had forgotten to add to agenda
14:32:53 <haleyb> #link https://bugs.launchpad.net/neutron/+bug/2124215
14:32:55 <haleyb> [RFE] Implement more graceful handling of dhcp_lease_duration reduction
14:33:20 <haleyb> jcmoore: are you here?
14:33:24 <jcmoore> Yes, I'm here
14:34:28 <haleyb> i think i understand the ask, can you just give a quick overview, don't know if others have read it all
14:34:43 <ralonsoh> yes and is a legit bug, IMO
14:35:36 <haleyb> and i think we can work around it based on the comments
14:35:55 <jcmoore> Sure. In the event that the dhcp lease time is reduced by an amount greater than half of the previous lease duration, the lease will expire before the client has an opportunity to renew
14:36:32 <jcmoore> For Windows clients, this means that the next time the client tries to renew, dnsmasq will have expired the lease, therefore there will be no active lease to renew
14:36:51 <ralonsoh> ^^ what happen at this point?
14:37:07 <jcmoore> As a result, dnsmasq will issue a NAK and that will cause Windows to completely release the IP (dropping all active connections) and perform a new DORA cycle
14:37:33 <ralonsoh> hmmm that's bad, for sure
14:38:09 <jcmoore> Linux is much more forgiving, it retains the IP while it's working to perform the DORA
14:38:47 <haleyb> so based on the comments, it seems ok to have the lease as infinite in the file to avoid the NAK, since we only have leases for known ports anyway.
14:39:26 <ralonsoh> I'm trying to figure out what could be the problem with this
14:39:32 <haleyb> we continue to advertise the lease interval in responses based on the config value
14:39:38 <jcmoore> That's my take, given the very specific way that Neutron is driving/using dnsmasq
14:39:53 <ralonsoh> what if the subnet DHCP range is reduced?
14:40:03 <ralonsoh> you'll leave a lease in the file
14:41:57 <ralonsoh> my concern here is that we introduce, by accident, a regression with this infinite timeout
14:42:17 <ralonsoh> that we could introduce*
14:42:24 <haleyb> ralonsoh: so a port changes from being in the allocation range to out?
14:42:29 <ralonsoh> no
14:44:28 <haleyb> ralonsoh: so the lease duration is reduced?
14:44:40 <ralonsoh> no, it isn't
14:44:51 <ralonsoh> the port IP assignation is not changed
14:45:00 <haleyb> i guess i don't understand your question "what if the subnet DHCP range is reduced?"
14:45:46 <ralonsoh> I'm just thinking in any situation that could leave a record indefinitely in the leases file
14:45:49 <ralonsoh> because of this change
14:46:52 <ralonsoh> in any case, this file belongs to the dnsmasq process that is a child process of the DHCP agent
14:47:04 <ralonsoh> so it should be handled by the DHCP agent
14:47:32 <jcmoore> I think _release_unused_leases() should clean up any leases for ports which are no longer valid, right?
14:47:42 <ralonsoh> yes
14:49:20 <jcmoore> So if we init the leases file with infinte leases for only valid ports, dnsmasq will take care of updating the lease timeout upon the next renewal by a client. If a client never renews but the port is valid, it will remain in the leases file with an infinite lease.
14:49:52 <jcmoore> Is that a "don't care" caes?
14:49:55 <haleyb> we'd have to add tests for any corner cases, but i think the check to match the entries might be easier not having the lease time there perhaps?
14:51:12 <ralonsoh> right, we need to add proper testing for possible corner cases
14:51:25 <ralonsoh> in any case, as commented, this is a legit bug
14:51:25 <jcmoore> That works but there's an edge case with that also in the event there is no existing leases file upon startup
14:52:17 <jcmoore> If the duration has been reduced and there is no existing leases file to parse, then we'd default to the existing behavior of using the current lease duration to init the leases file
14:52:47 <jcmoore> Likely an even smaller proability of hitting this but not unlikely
14:54:25 <jcmoore> Using an infinite lease seems like it would solve both of these issues, without the additional work of parsing/retaining existing leases, if existing leases are present
14:56:04 <haleyb> right, and i think the number of entries in all these dnsmasq files is the same as before since it should only be existing ports that are there
14:56:24 <mlavalle> jcmoore, would you implement it?
14:56:41 <jcmoore> Correct. They would just start out with 0 instead of the currently configured lease duration
14:57:35 <jcmoore> Sure, if we want to init to 0, that's easy enough to implement.
14:58:15 <haleyb> it's more about testing we don't break things
14:58:23 <lajoskatona> +1
14:58:29 <haleyb> should we vote?
14:58:35 <ralonsoh> +1
14:58:39 <lajoskatona> +1
14:58:53 <mlavalle> +1
14:58:59 <haleyb> i'm +1, and don't think we need a spec as it's really a bug
14:59:05 <ralonsoh> agree
14:59:41 <haleyb> jcmoore: can you just put any info from above regarding possible test cases in the bug? so we don't forget them?
15:00:12 <jcmoore> Yes, I'll be sure to capture as many edge cases as we can currently foresee
15:00:29 <haleyb> and thanks for finding the issue and working on it, reach out if you have questions on submitting patches
15:00:45 <haleyb> i'll mark approved
15:01:06 <haleyb> thanks for attending everyone, have other meeting to run to
15:01:10 <haleyb> and have a good weekend
15:01:14 <haleyb> #endmeeting