14:00:12 <slaweq> #startmeeting neutron_drivers
14:00:13 <openstack> Meeting started Fri Aug 21 14:00:12 2020 UTC and is due to finish in 60 minutes.  The chair is slaweq. Information about MeetBot at http://wiki.debian.org/MeetBot.
14:00:14 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
14:00:14 <slaweq> hi
14:00:16 <openstack> The meeting name has been set to 'neutron_drivers'
14:00:24 <dosaboy> o/ hi
14:00:50 <pprincipeza_> Hello.
14:00:57 <dosaboy> slaweq: im @hopem (from lp 1892200) fwiw
14:00:59 <openstack> Launchpad bug 1892200 in neutron "Make keepalived healthcheck more configurable" [Wishlist,New] https://launchpad.net/bugs/1892200
14:01:15 <mlavalle> o/
14:01:15 <haleyb> hi
14:01:21 <amotoki> hi
14:01:25 <slaweq> welcome dosaboy :)
14:01:31 <pprincipeza_> slaweq, and I'm pprincipeza from lp 1891334. :)
14:01:32 <openstack> Launchpad bug 1891334 in neutron "[RFE] Enable change of CIDR on a subnet" [Wishlist,New] https://launchpad.net/bugs/1891334
14:01:38 <slaweq> welcome pprincipeza_ :)
14:01:58 <slaweq> hi mlavalle haleyb and amotoki
14:02:02 <slaweq> :)
14:02:12 <slaweq> ralonsoh and njohnston are on pto this week
14:02:23 <slaweq> so lets just wait 2 more minutes for yamamoto
14:02:51 <pprincipeza_> ACK!
14:04:35 <slaweq> ok, lets start
14:04:48 <slaweq> even without yamamoto we have quorum so we should be good to go
14:04:56 <slaweq> #topic RFEs
14:05:02 <slaweq> we have 2 RFEs today
14:05:04 <slaweq> first one:
14:05:11 <slaweq> #link https://bugs.launchpad.net/neutron/+bug/1891334
14:05:12 <openstack> Launchpad bug 1891334 in neutron "[RFE] Enable change of CIDR on a subnet" [Wishlist,New]
14:07:05 <pprincipeza_> I submitted that on behalf of a customer, who would like to have the ability to expand the subnet currently in use.  He has implemented the alternative alread (new subnets), his wish would be to avoid creating these distinct subnets, and keep the servers under a single subnet (even though this is completely virtual and internal to OpenStack).
14:08:07 <pprincipeza_> I understand the limitations that this would imply, as in having to repopulate existing Instances with new IP/Mask/GW information, and this would definitely need downtime. :/
14:10:09 <pprincipeza_> I'd imagine this would be less "painful" on subnets with DHCP allocation, as this info *should* come with the lease renewal?
14:10:21 <dosaboy> i assume this would only be expected to work as increase and not decrease of size
14:10:49 <pprincipeza_> dosaboy, yes.
14:11:55 <dosaboy> pprincipeza_: is there a particular use-case? would just adding an extra subnet to the network now solve the issue?
14:12:05 <dosaboy> s/now/not
14:12:32 <haleyb> technically, it should be able to scale-down a subnet if number of ports is small enough, just don't know why you'd want to do that
14:12:43 <slaweq> but how You want to force changes of e.g. mask in the existing instances?
14:13:18 <pprincipeza_> dosaboy, the use-case is more on a "systems management" side than in functionality.  He has implemented other subnets, and everything is working fine.
14:14:51 <pprincipeza_> slaweq, my only thought there would be on doing the change with instances down, as I imagine the new information from subnet would come in upon the lease renewal?
14:14:59 <pprincipeza_> (And I expect that to happen at boot time?)
14:15:13 <haleyb> slaweq: scaling-up by just changing the mask might not cause a big disruption, but i can't see being able to do that if pools are being used, would need to change to a new subnet, bleck
14:15:23 <mlavalle> so the system manager wants to save himself / herself manag ement work and instead have the users reconfigure thier vms?
14:15:36 <dosaboy> haleyb: i guess either way i worry about the number of places that update would have to be applied to
14:16:52 <pprincipeza_> mlavalle, yes, it sounds not much reasonable when thinking of end-users of the Cloud. :/
14:17:29 <pprincipeza_> And the change on lb-mgmt for Octavia would also be a use-case, I believe?
14:17:49 <haleyb> i can see how this change would help with certain deployers too, for example where the undercloud was made too small, but don't know how the cloud admin could do this successfully, for a tenant it seems more doable
14:18:28 <haleyb> pprincipeza_: so is the use case more the end-user/tenant?
14:18:58 <mlavalle> I understood the opposite, but I migh be wrong
14:19:26 <slaweq> on neutron's side it would be change in the dhcp agent and neutron db, right? other things are on user who needs to e.g. reboot vms or force renew of lease, is that correct or am I missing something here?
14:19:46 <dosaboy> if its octavia lb-mgmt net then its the case where it runs out of addresses for amphora vms etc
14:19:47 <pprincipeza_> haleyb, ^ thanks, slaweq.  that sums it up.
14:20:08 <haleyb> mlavalle: it's not clear to me
14:20:09 <mlavalle> pprincipeza_: you also mention Octavia, which is not mentioned in the RFE. Is there an aditional conversation going on that we haven't seen in this meeting?
14:21:20 <pprincipeza_> mlavalle, this was a use-case not initially added to the RFE I submitted, but discussing this with dosaboy, that use-case came up as an addition.
14:21:35 <dosaboy> mlavalle: tbh we've seen the octavia issue with older deployments before we switched to using v6 networks for the lb-mgmt net
14:21:49 <pprincipeza_> mlavalle, I can certainly add that mention to the LP Bug, if that's needed.
14:22:06 <dosaboy> but i havent yet tested if adding an extra subnet could fix that, planning to try that
14:22:59 <johnsom> Octavia also has an rfe to add multiple subnets for the lb-mgmt-net. It just hasn’t come up that anyone needed it, so is low priority.
14:23:13 <dosaboy> i don't know if that's pprincipeza_ original use-case though
14:23:22 <dosaboy> johnsom: oh interesting i was not aware of that
14:24:55 <mlavalle> but pprincipeza_'s aim is to avoid adding subnets, isn't it?
14:24:55 <slaweq> I'm still not really convinced for that rfe and if we should implement something what may be in fact painful for users later
14:25:21 <amotoki> I had a trouble in internet connection and am just following the discussion. As you already discussed, expanding a subnet CIDR leads to a subnet mask change. It sometimes leads to a problem between an existing vm and a new booted vm if the exsting vm does not update the mask. If all communications happen between gateway and VMs, we will hit less problems.
14:25:51 <pprincipeza_> mlavalle, yes, that's my initial aim.
14:26:25 <mlavalle> in that case, the Octavia RFE mentioned above is not related, if I understand correctly
14:27:37 <haleyb> amotoki: and it might be you can't just change the mask, but need a new cidr due to overlap, right?
14:27:57 <pprincipeza_> Yes.
14:28:03 <amotoki> haleyb: yes
14:28:27 <johnsom> In fact, using the AZ code in Ussuri Octavia allows for multiple lb-mgmt-nets today.
14:28:28 <haleyb> so then it's new subnet and live migration, etc
14:28:32 <amotoki> haleyb: I think overlapping case can be covered by the API side. we can check overlapping of CIDR
14:29:53 <haleyb> amotoki: right, i was just thinking the simple case of changing from /26 to /24, same cidr, which causes little disruption, otherwise it's reboot everything
14:30:10 <mlavalle> right
14:30:33 <amotoki> haleyb: thanks. I am in a same page.
14:30:57 <pprincipeza_> And if rebooting is needed, it is needed, I don't see that feature being added without some "pain" for the Instances. :)
14:32:25 <mlavalle> f and the RFE refers to "changing the CIDR of a subnet", so it's not just going from /26 to /24
14:32:42 <haleyb> pprincipeza_: if you need to reboot, is a new subnet with new instances easier?  no downtime if these instances are part of a pool?
14:33:16 <slaweq> mlavalle: exactly, how it would be if You would e.g. changed from 10.0.0.0/24 to 192.168.0.0/16 ?
14:33:26 <pprincipeza_> haleyb, the new subnet with new instances is already in place as a functional way out of the "expansion" limitation.
14:33:29 <slaweq> that may be much harder :)
14:34:37 <haleyb> this has a ripple effect through security groups as well
14:35:03 <pprincipeza_> mlavalle, slaweq, my initial use-case beared a minor /26 to /24 scenario, but the bigger change (CIDR) was requested, utmostly.
14:36:03 <slaweq> personally I can imaging that we are allowing extend of the cidr, so old cidr has to be inside new, bigger one
14:36:08 <amotoki> does it work if the RFE is rephrased from "changing" to "expanding"?
14:36:12 <slaweq> but other use cases I'm not sure
14:36:49 <mlavalle> amotoki: so the /26 to /24 case?
14:36:54 <amotoki> mlavalle: yes
14:37:12 <amotoki> * with limitations
14:40:03 <slaweq> so should we vote on that rfe or do You want some more clarifications and discuss that again next week?
14:40:05 <haleyb> amotoki: it's more reasonable if just expanding, but i guess there will still be connectivity issues since the gateway will have changed?
14:40:23 <mlavalle> haleyb: I think so
14:40:54 <amotoki> haleyb: the gateway address is alreaedy assigned so I don't think we need to change the gateway address.
14:40:59 <slaweq> haleyb: if we will just expand, why gateway need to be changed?
14:42:17 <haleyb> slaweq: i just didn't know if the expansion changed the ".1" address to be different. i.e. 2.1 to 0.1 or something
14:42:48 <haleyb> or the gateway stays the same...
14:43:30 <amotoki> I thought the gateway stasy the same. In my understanding, the current logic is applied only when a gateway is not specified.
14:43:47 <slaweq> amotoki: I think the same
14:44:46 <haleyb> amotoki: yes, i was just thinking out loud, shouldn't be an issue after thinking about it
14:45:18 <amotoki> haleyb: thanks. we are careful enough :)
14:45:55 <slaweq> so are we ok to approve this rfe as "expansion of subnet's cidr" and discuss details in the review of the spec?
14:46:19 <amotoki> +1 for expanding a subnet CIDR. This operation may require additional workaround including instance reboot as we discussed. it is worth documenting in the API ref or some.
14:46:36 <rafaelweingartne> If it is just about expanding CIDRS, +1
14:47:07 <slaweq> pprincipeza_: will that work for You?
14:47:29 <pprincipeza_> slaweq, it works for me.
14:47:39 <pprincipeza_> Thank you very much for considering it!
14:47:40 <mlavalle> +1 from me, then
14:47:52 <slaweq> haleyb ?
14:48:43 <haleyb> +1 from me
14:48:57 <slaweq> thx, so I will mark this rfe as approved
14:49:05 <slaweq> with note about "expanding cidr" only
14:49:16 <slaweq> ok, lets quickly look into second rfe
14:49:23 <slaweq> #link https://bugs.launchpad.net/neutron/+bug/1892200
14:49:24 <openstack> Launchpad bug 1892200 in neutron "Make keepalived healthcheck more configurable" [Wishlist,New]
14:49:27 <pprincipeza_> Awesome, thank you very much slaweq haleyb mlavalle!
14:49:38 <dosaboy> ok so,
14:49:43 <dosaboy> 1892200 is related to an issue that we have recently observed in an env using l3ha
14:49:47 <dosaboy> the conditions to hit the issue are somewhat protracted and described in the LP but
14:49:50 <dosaboy> long story short is that while the check was failing for a valid reason, the result of it failing ended up causing more problems
14:49:53 <dosaboy> and since the original cause of the test failure was transient, there was no real need to failover
14:49:56 <dosaboy> therefore a slightly more intelligent test than simply doing a single ping would be preferable
14:50:03 <dosaboy> in terms of solutions I know that protocols like BFD are much better at dealing with this kind of thing and are available in OVN
14:50:06 <dosaboy> but this is really for those users that will be stuck with L3HA for the forseeable
14:50:09 <dosaboy> so adapting what we e.g. just trying more pings before we declare a failure, would be better than what we have imho
14:50:12 <dosaboy> on top of that, the suggestion to move the current code to use a template seems like a good idea
14:50:17 <dosaboy> right now the healthcheck script is entirely built from code
14:50:20 <dosaboy> if we used a template it would also provide the opportunity to make the path to the template configurable
14:50:24 <dosaboy> thus allowing for it to be modified without changing neutron code
14:50:28 <dosaboy> I've not looked to deeply into this yet but was thinkig something along the lines of a jinja template
14:50:31 <dosaboy> thoughts?
14:52:52 <slaweq> dosaboy: I was thinking about jinja2 too :)
14:52:58 <rafaelweingartne> The idea of a customizable template looks great, as long as there is a default one already in place.
14:53:26 <slaweq> rafaelweingartne: yes, my idea here was that we should basically provide default template which will be the same as what we have now
14:53:27 <dosaboy> rafaelweingartne: yeah absolutely, default that can be overriden via config path
14:53:29 <mlavalle> I think that is the idea, rafaelweingartne
14:53:55 <dosaboy> slaweq: yep
14:54:54 <amotoki> generally templating it sounds great. it potentially makes our bug triage complex. my question is whether we will keep current configurations (though I haven't checked if we have them).
14:54:58 <dosaboy> so i guess there's two way to look at the request, either we "improve" the existing default test and/or make it templatised to allow user-override
14:55:35 <dosaboy> amotoki: the only config currently iirc is the interval i.e. how often to run the check
14:55:37 <slaweq> amotoki: are You asking about configuration for specific process which is spawned for router?
14:55:41 <dosaboy> that can remain
14:56:09 <dosaboy> slaweq: no process here fwiw, the check is run directly by keepalived
14:56:26 <amotoki> slaweq: what i mention is keepalived config
14:56:33 <dosaboy> neutron generates the keepalived conf with the test enabled and a path to the test
14:56:43 <haleyb> slaweq: i like your thought on keepalived options for a router, maybe an extension?  instead of adding more config options?  or were you thinking something else?
14:57:01 <haleyb> that was in the bug comments
14:57:27 <dosaboy> haleyb: this isnt really about how keepalived drives the test though, not in my experience anyway
14:57:38 <dosaboy> our problem was really the test itself
14:57:54 <slaweq> currently when neutron-l3-agent is configuring new router, it generates keepalived config file through https://github.com/openstack/neutron/blob/master/neutron/agent/linux/keepalived.py
14:58:16 <slaweq> and this config file is storred somewhere in /var/lib/neutrin/ha_confs/<router_id> (IIRC)
14:58:31 <dosaboy> ... that fails when a single icmp reply is missed within 1s
14:58:56 <slaweq> and my idea here was that we can somehow change clases in this module that it will generate keepalived config file based on some template
14:58:56 <dosaboy> slaweq: correct
14:59:06 <dosaboy> right
14:59:10 <slaweq> so it will still set correct interface names, ip addresses and other variables
14:59:41 <slaweq> but user will be able to prepare in template other things like some timeouts, etc.
14:59:52 <dosaboy> im totally +1 on that and make the path to the template configurable so that users can modify it and put it somewhere else
14:59:56 <slaweq> I hope that it is clear and I hope it is correct with what dosaboy wants :)
14:59:59 <haleyb> slaweq: per-router or per-cloud?
15:00:08 <haleyb> i.e. admin-controlled template
15:00:15 <slaweq> haleyb: template would be "per l3 agent" in fact
15:00:16 <dosaboy> slaweq: yeah i think thats it
15:00:29 <slaweq> but in practice it should be per cloud
15:00:31 <dosaboy> yep per l3agent
15:00:40 <dosaboy> thats enough for us anyways
15:00:51 <slaweq> as You shouldn't have different configs on different network nodes
15:01:41 <amotoki> I am +1 for introducing template (and also we can keep better default configs)
15:01:41 <slaweq> ok, we are out of time today
15:01:41 <dosaboy> slaweq: yeah sorry im confusing things, it would be the same everythere but thats just cause we would configure all l3-agents the same
15:01:51 <dosaboy> ok thanks for reviewing
15:01:51 <slaweq> I think we need to get back to this next week
15:01:52 <haleyb> it now makes sense to me
15:02:10 <slaweq> or is it clear for You and You want to vote now on that quickly ?
15:02:47 <haleyb> what about others?
15:02:59 <haleyb> i don't think there's a meeting right after us...
15:02:59 <mlavalle> it makes sense to me
15:03:33 <mlavalle> i'm comfortable casting a +1
15:03:38 <slaweq> ok, so it seems that mlavalle amotoki haleyb and I are ok to approve it now
15:03:40 <slaweq> right?
15:03:49 <amotoki> I think so
15:04:33 <slaweq> haleyb: you ok with that proposal, right?
15:04:36 <haleyb> yes, i'd +1, seems useful for cloud admins
15:04:42 <slaweq> ok, thx a lot
15:04:48 <slaweq> so I will mark this one as approved too
15:04:54 <slaweq> thx for proposing it dosaboy
15:05:07 <dosaboy> thanks guys, much appreciated
15:05:08 <slaweq> and sorry that I keept You here longer than usually :)
15:05:17 <slaweq> have a great weekend and see You all next week
15:05:19 <slaweq> o/
15:05:23 <slaweq> #endmeeting