14:02:21 <haleyb> #startmeeting neutron_drivers 14:02:21 <opendevmeet> Meeting started Fri Sep 26 14:02:21 2025 UTC and is due to finish in 60 minutes. The chair is haleyb. Information about MeetBot at http://wiki.debian.org/MeetBot. 14:02:21 <opendevmeet> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 14:02:21 <opendevmeet> The meeting name has been set to 'neutron_drivers' 14:02:27 <haleyb> sorry was in other meeting 14:02:34 <haleyb> Ping list: ykarel, mlavalle, mtomaska, slaweq, tobias-urdin, lajoskatona, haleyb, ralonsoh 14:02:36 <mlavalle> \o 14:02:41 <ralonsoh> hello 14:02:47 <ralonsoh> slaweq, cannot attend today 14:02:55 <mtomaska> o/ 14:02:55 <lajoskatona> o/ 14:03:23 <haleyb> ralonsoh: thanks, i think we'd still have quorum with the rest of us 14:04:36 <haleyb> we did have a couple of topics, not sure if Dejan is here 14:04:41 <haleyb> #link https://bugs.launchpad.net/neutron/+bug/2123836 14:04:52 <haleyb> [RFE] Set a local TXT record in the DHCP agent/dnsmasq 14:06:17 <haleyb> #link https://review.opendev.org/c/openstack/neutron/+/950486 is the proposed patch 14:06:57 <haleyb> dsan: are you here? 14:07:07 <dsan> yep 14:07:23 <haleyb> ah, can you talke a little about your RFE? 14:07:33 <dsan> yeah sure 14:07:58 <dsan> i tried to explain it a bit on launchpad 14:08:22 <dsan> with use the ability to have a local txt record in dnsmasq 14:08:51 <dsan> as kind of a side effect which allows to monitor it 14:09:20 <dsan> we thought it might me usefull to others 14:09:31 <dsan> so a good candidate to upstream 14:09:42 <ralonsoh> Who is monitoring this process? How does it works? 14:11:21 <dsan> it's at the infra level, we wish to ensure that dhcp is really available 14:11:42 <ralonsoh> so what process is monitoring the dnsmasq process? 14:11:52 <ralonsoh> outside openstack, I guess 14:12:09 <dsan> i think it's blackbox exporter running the actual dns queries 14:12:11 <mlavalle> I think he means out side OpenStack 14:12:55 <mlavalle> and that leaves options open for other potential adopters 14:13:01 <mlavalle> right? 14:13:10 <haleyb> so is it going into each network namespace and running a command? 14:14:52 <ralonsoh> dsan, are you still there? 14:14:58 <dsan> yep 14:15:12 <dsan> was checking what tool was doing the work 14:15:45 <dsan> so its an internal tool that does the query 14:16:25 <ralonsoh> why is that needed? The DHCP agent is in charge of this process 14:16:36 <ralonsoh> dnsmasq is a child process of the DHCP agent 14:16:44 <ralonsoh> that is launched with a wrapper 14:17:00 <ralonsoh> if the child process fails and dies, the DHCP agent will respawn it 14:17:44 <haleyb> it just doesn't do periodic queries to see if it's able to respond 14:19:53 <ralonsoh> ok, I still have very little information about this 14:20:03 <ralonsoh> for example, how to make this check, things like this 14:20:32 <ralonsoh> dsan, how the check is done? 14:20:50 <dsan> with a custom go tool 14:21:09 <dsan> that happens to also do the dns queries in each namespace 14:21:29 <ralonsoh> yeah, and do you need this txt field for that? 14:21:52 <ralonsoh> I mean, you can use "dig" against the dnsmasq process to check its liveness 14:21:56 <ralonsoh> why is this field needed? 14:22:49 <lajoskatona> some reference for usage in random docs, i.e.: https://www.ibm.com/docs/en/i/7.4.0?topic=td-problem-dns-records-are-not-being-updated-by-dhcp 14:24:14 <dsan> is when dnsmasq_local_resolv is that to true 14:24:28 <dsan> that part was on launchpad 14:24:46 <dsan> https://bugs.launchpad.net/neutron/+bug/2123836 14:27:35 <lajoskatona> ok, so if the monitor can't have the txt record from dnsmasq in the namespace you know that the clients in the VMs also has issues so the dnsmasq process is failing and have to do something on the network node? 14:28:41 <dsan> yes 14:29:22 <dsan> we know, there's some kind of issue in dnsmasq/the DHCP agent/the underlying network/etc. 14:29:25 <lajoskatona> ok thanks 14:29:46 <ralonsoh> ok, I'm not going to insist more on this because it is difficult obtain information from you. I would like to have a description of what is the process you implemented to use this txt record, how are you doing that 14:29:56 <ralonsoh> please, add this in the launchpad 14:30:01 <haleyb> to let others take advantage of this option, it seems to me like it should be in the dhcp-agent, then it can respawn and log a warning? 14:30:16 <dsan> ok, i'll gather more details 14:31:06 <haleyb> ok thanks dsan 14:31:15 <dsan> there's also the HA part/underlying network aspect 14:31:39 <dsan> it's one thing that the DHCP agent ensures that dnsmasq is running 14:32:04 <dsan> and another that the environment is also OK 14:32:22 <haleyb> understood 14:32:28 <dsan> anyway thanks for your time 14:32:29 <haleyb> we had one more item 14:32:39 <haleyb> i had forgotten to add to agenda 14:32:53 <haleyb> #link https://bugs.launchpad.net/neutron/+bug/2124215 14:32:55 <haleyb> [RFE] Implement more graceful handling of dhcp_lease_duration reduction 14:33:20 <haleyb> jcmoore: are you here? 14:33:24 <jcmoore> Yes, I'm here 14:34:28 <haleyb> i think i understand the ask, can you just give a quick overview, don't know if others have read it all 14:34:43 <ralonsoh> yes and is a legit bug, IMO 14:35:36 <haleyb> and i think we can work around it based on the comments 14:35:55 <jcmoore> Sure. In the event that the dhcp lease time is reduced by an amount greater than half of the previous lease duration, the lease will expire before the client has an opportunity to renew 14:36:32 <jcmoore> For Windows clients, this means that the next time the client tries to renew, dnsmasq will have expired the lease, therefore there will be no active lease to renew 14:36:51 <ralonsoh> ^^ what happen at this point? 14:37:07 <jcmoore> As a result, dnsmasq will issue a NAK and that will cause Windows to completely release the IP (dropping all active connections) and perform a new DORA cycle 14:37:33 <ralonsoh> hmmm that's bad, for sure 14:38:09 <jcmoore> Linux is much more forgiving, it retains the IP while it's working to perform the DORA 14:38:47 <haleyb> so based on the comments, it seems ok to have the lease as infinite in the file to avoid the NAK, since we only have leases for known ports anyway. 14:39:26 <ralonsoh> I'm trying to figure out what could be the problem with this 14:39:32 <haleyb> we continue to advertise the lease interval in responses based on the config value 14:39:38 <jcmoore> That's my take, given the very specific way that Neutron is driving/using dnsmasq 14:39:53 <ralonsoh> what if the subnet DHCP range is reduced? 14:40:03 <ralonsoh> you'll leave a lease in the file 14:41:57 <ralonsoh> my concern here is that we introduce, by accident, a regression with this infinite timeout 14:42:17 <ralonsoh> that we could introduce* 14:42:24 <haleyb> ralonsoh: so a port changes from being in the allocation range to out? 14:42:29 <ralonsoh> no 14:44:28 <haleyb> ralonsoh: so the lease duration is reduced? 14:44:40 <ralonsoh> no, it isn't 14:44:51 <ralonsoh> the port IP assignation is not changed 14:45:00 <haleyb> i guess i don't understand your question "what if the subnet DHCP range is reduced?" 14:45:46 <ralonsoh> I'm just thinking in any situation that could leave a record indefinitely in the leases file 14:45:49 <ralonsoh> because of this change 14:46:52 <ralonsoh> in any case, this file belongs to the dnsmasq process that is a child process of the DHCP agent 14:47:04 <ralonsoh> so it should be handled by the DHCP agent 14:47:32 <jcmoore> I think _release_unused_leases() should clean up any leases for ports which are no longer valid, right? 14:47:42 <ralonsoh> yes 14:49:20 <jcmoore> So if we init the leases file with infinte leases for only valid ports, dnsmasq will take care of updating the lease timeout upon the next renewal by a client. If a client never renews but the port is valid, it will remain in the leases file with an infinite lease. 14:49:52 <jcmoore> Is that a "don't care" caes? 14:49:55 <haleyb> we'd have to add tests for any corner cases, but i think the check to match the entries might be easier not having the lease time there perhaps? 14:51:12 <ralonsoh> right, we need to add proper testing for possible corner cases 14:51:25 <ralonsoh> in any case, as commented, this is a legit bug 14:51:25 <jcmoore> That works but there's an edge case with that also in the event there is no existing leases file upon startup 14:52:17 <jcmoore> If the duration has been reduced and there is no existing leases file to parse, then we'd default to the existing behavior of using the current lease duration to init the leases file 14:52:47 <jcmoore> Likely an even smaller proability of hitting this but not unlikely 14:54:25 <jcmoore> Using an infinite lease seems like it would solve both of these issues, without the additional work of parsing/retaining existing leases, if existing leases are present 14:56:04 <haleyb> right, and i think the number of entries in all these dnsmasq files is the same as before since it should only be existing ports that are there 14:56:24 <mlavalle> jcmoore, would you implement it? 14:56:41 <jcmoore> Correct. They would just start out with 0 instead of the currently configured lease duration 14:57:35 <jcmoore> Sure, if we want to init to 0, that's easy enough to implement. 14:58:15 <haleyb> it's more about testing we don't break things 14:58:23 <lajoskatona> +1 14:58:29 <haleyb> should we vote? 14:58:35 <ralonsoh> +1 14:58:39 <lajoskatona> +1 14:58:53 <mlavalle> +1 14:58:59 <haleyb> i'm +1, and don't think we need a spec as it's really a bug 14:59:05 <ralonsoh> agree 14:59:41 <haleyb> jcmoore: can you just put any info from above regarding possible test cases in the bug? so we don't forget them? 15:00:12 <jcmoore> Yes, I'll be sure to capture as many edge cases as we can currently foresee 15:00:29 <haleyb> and thanks for finding the issue and working on it, reach out if you have questions on submitting patches 15:00:45 <haleyb> i'll mark approved 15:01:06 <haleyb> thanks for attending everyone, have other meeting to run to 15:01:10 <haleyb> and have a good weekend 15:01:14 <haleyb> #endmeeting