16:01:38 <gibi> #startmeeting nova
16:01:38 <opendevmeet> Meeting started Tue Aug 17 16:01:38 2021 UTC and is due to finish in 60 minutes.  The chair is gibi. Information about MeetBot at http://wiki.debian.org/MeetBot.
16:01:38 <opendevmeet> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
16:01:38 <opendevmeet> The meeting name has been set to 'nova'
16:01:38 <melwitt> gibi: ++ thanks
16:01:55 <gibi> sean-k-mooney: thanks
16:02:45 <gibi> #topic Bugs (stuck/critical)
16:02:52 <gibi> no critical bug open
16:02:56 <gibi> #link 15 new untriaged bugs (+4 since the last meeting): #link https://bugs.launchpad.net/nova/+bugs?search=Search&field.status=New
16:03:22 <gibi> is there any specific bug to discuss today?
16:04:16 <gibi> I see ganso has one in the open discussion, lets bring that up here
16:04:24 <gibi> (ganso): bug "Compute node deletes itself if rebooted without DNS": https://bugs.launchpad.net/nova/+bug/1939920
16:04:29 <gibi> was this a design choice? acceptable solutions discussion
16:04:33 <gibi> EOM
16:04:37 <ganso> gibi thanks
16:05:00 <ganso> so, IMO this is a critical bug, and after reading the code and the way it works it kinda feels like a design choice
16:05:29 <ganso> because seems like it was intentionally implemented for it to scan for "orphan compute nodes" and delete them, clear the allocations and RP, etc
16:05:42 <gibi> yes that was intentional
16:05:51 <ganso> but it is producing this effect which is very undesirable
16:06:33 <gibi> so in your infra the compute host can change hostname and that causing the issue
16:06:33 <ganso> as I suggested in the bug, a possible solution I see if to compare the host field in the nova.compute_nodes table
16:06:43 <ganso> if it is the same, then we would skip this
16:06:57 <ganso> gibi: it is not that it "can" change the hostname. But it happens due to external reasons
16:07:09 <ganso> like, lack of connectivity when it boots, a DNS outage, etc
16:07:26 <melwitt> it will recover in that it will create a new compute node etc. the main thing that is "unique" is the hostname, that's what's stored in the instance.host and a whole lot of other places. so changing the name you break all the associations and in reality you essentially have a new/different service and compute node
16:08:39 <melwitt> if the associations were done using UUID it would be a different story. but unfortunately it is what it is and would take a large work to change it IMHO
16:08:39 <ganso> melwitt: right, so the instance.host captures the entire FQDN, and the FQDN is what is changing, therefore when that changes, running instances are no longer identifiable as running in that node
16:08:58 <melwitt> right
16:09:26 <ganso> melwitt: so that is another side-effect of that FQDN changing problem, but I am not proposing changing that. I am just proposing to skip this "deletion" step if the compute_nodes.host field does not change
16:09:33 <ganso> if will avoid part of the issues
16:10:00 <ganso> melwitt it will not avoid the issue you described, but 1 issue is better than 2 I think
16:10:24 <dansmith> I'm missing the distinction I think
16:11:22 <gibi> but compute_nodes.host comes from the DB isn't it? so it won't ever change
16:11:24 <sean-k-mooney> ganso: right so nova does not support compute hosts changing hostname today
16:11:52 <ganso> gibi: doesn't it derive from the FQDN it reads from the system?
16:11:57 <sean-k-mooney> so if it is changing for external reasons that is not expected to work out of the box
16:12:34 <dansmith> sean-k-mooney: ++
16:12:42 <ganso> sean-k-mooney: right, but I'm not proposing that it does support, but just stop doing what it is doing today. That thing about orphan compute nodes isn't supposed to address changing hostnames either
16:13:05 <gibi> ganso: when the ComputeNode is created then yes, it is coming from the hostname reported by libvirt, but never changes after
16:13:35 <sean-k-mooney> ganso: even if we did not clean up the orpah compute nodes teh instnace.host is used to make rpc calls to the host that the instance is on
16:14:00 <sean-k-mooney> so unless you hardcode the chost paramter in the nova.conf so it does not change
16:14:06 <ganso> dansmith: when FQDN changes from "host.domain" to "host.domain1" or just "host" it causes the compute node to delete itself from the DB, clear allocations, RP, etc, and the new name will not match the instances.host field as melwitt mentioned. Out of all those consequences, I'd suggest skipping the compute node deletion, because this is an error state, to avoid deleting up all allocations and RPs, so the node can more easily go
16:14:06 <ganso> back to normal once the FQDN is fixed and the service is restarted
16:14:12 <sean-k-mooney> that will still break
16:14:32 <sean-k-mooney> ganso: the compute service will not do that by default
16:14:41 <sean-k-mooney> the compute service will auto register
16:14:44 <ganso> sean-k-mooney: yes, that will still be broken, as it is today, no need to fix that right now
16:14:45 <sean-k-mooney> bvut it wont auto delete
16:15:17 <ganso> sean-k-mooney: well it does, it thinks there was an orphan and deletes it
16:15:27 <sean-k-mooney> ganso: what deletes it
16:15:33 <sean-k-mooney> i think i missed that
16:15:39 <ganso> sean-k-mooney: https://github.com/openstack/nova/blob/b0099aa8a28a79f46cfc79708dcd95f07c1e685f/nova/compute/manager.py#L9997
16:16:08 <sean-k-mooney> is this a clustered hypervior
16:16:24 <ganso> sean-k-mooney: "host.domain" changes to "host", so it deletes "host.domain" from the compute nodes table and creates a new one, as if the node was brand new
16:16:36 <sean-k-mooney> e.g.  ironic or hyperv or something like vmware
16:16:52 <dansmith> ganso: because that's a hostname change
16:16:57 <ganso> sean-k-mooney: no, it is just a regular compute node with a libvirt compute service
16:17:05 <dansmith> ganso: arrange for that not not happen, that's the solution, IMHO
16:17:22 <ganso> dansmith: unfortunately it is beyond control
16:17:37 <sean-k-mooney> well for the libvirt driver that is entirly unsupported
16:17:51 <sean-k-mooney> the other way to fix this is to make sure your cannonical hostname is not the fqdn
16:17:51 <ganso> my proposal is to leave it in an error state to prevent it from deleting allocations and RP
16:18:42 <sean-k-mooney> e.g. in /etc/host set <ip> <short hostname> <fqdn>
16:19:09 <dansmith> sean-k-mooney: or /etc/domainname, but yeah, totally fixable, IMHO
16:19:33 <sean-k-mooney> im still configuse how nodenames is a list in this case
16:19:33 <ganso> sean-k-mooney: hmm I see, that would override the one currently being provided by the domain provider
16:20:07 <sean-k-mooney> or rather how when the the fqdn changes we are actully geting anything back from the db
16:20:14 <sean-k-mooney> i was expecting it to not match anything
16:20:26 <dansmith> can't you set the hostname nova uses in the config anyway? to hard-code it per host so it doesn't change, I thought we had that
16:20:44 <sean-k-mooney> unless you have something linke  host1.<domain1> host1.<doamin2>
16:20:53 <ganso> dansmith: looking in the code now
16:21:06 <sean-k-mooney> dansmith: you can set the hostname used by the compute service
16:21:09 <dansmith> we might not want to hold up the meeting to discuss this to completion
16:21:11 <sean-k-mooney> not the hypervior_hostname
16:21:17 <sean-k-mooney> which in this case comes form libviret
16:21:25 <dansmith> sean-k-mooney: ah, so that's fixed but the nodename is always from hostname, right okay
16:21:28 <ganso> oh yea, console_host
16:21:28 <ganso> default=socket.gethostname()
16:21:48 <sean-k-mooney> ganso: not console host but ill get the link and send it to you
16:21:57 <sean-k-mooney> i think we can move on and come back to this after the meeting
16:22:08 <ganso> sean-k-mooney: thanks, yes. Thanks for the suggestions!
16:22:10 <gibi> lets come back to this
16:22:23 <gibi> thanks sean-k-mooney dansmith melwitt
16:22:34 <sean-k-mooney> ganso: https://github.com/openstack/nova/blob/master/nova/conf/netconf.py#L52-L70
16:22:35 <gibi> any other bug that needs attention?
16:23:30 <gibi> #topic Gate status
16:23:35 <gibi> Nova gate bugs #link https://bugs.launchpad.net/nova/+bugs?field.tag=gate-failure
16:23:42 <gibi> I dont see new gate bugs in that list
16:24:24 <gibi> and also I pushed plenty of patches yesterday without many failures from Zuul
16:24:24 <gibi> so I think master CI looks good
16:24:24 <sean-k-mooney> :)
16:24:25 <gibi> any recent failures?
16:24:44 <melwitt> yeah I think the troubling one is the libvirt/qemu one.. that we're doing non voting on live migration job over
16:25:05 <gibi> yeah, skiping that helped a lot
16:25:37 <gibi> placement period jobs are green too #link https://zuul.openstack.org/builds?project=openstack%2Fplacement&pipeline=periodic-weekly
16:25:46 <gibi> anything else about the gate?
16:26:17 <gibi> #topic Release Planning
16:26:24 <gibi> Milestone 3 and therefore Feature Freeze is at 3rd of September which is in 2 weeks.
16:26:46 <gibi> lets land things :)
16:26:52 <gibi> Non client library freeze is this week.
16:26:56 <gibi> os-vif: https://review.opendev.org/q/project:openstack/os-vif+status:open+branch:master nothing important seems to be pending
16:27:04 <gibi> os-resource-classes: https://review.opendev.org/q/project:openstack/os-resource-classes+status:open ditto nothing is pending
16:27:06 <sean-k-mooney> yes i might try to addreess https://bugs.launchpad.net/os-vif/+bug/1939542
16:27:09 <gibi> os-traits: https://review.opendev.org/q/project:openstack/os-traits+status:open there seem pending reviews for traits needed by ongoing features e.g.: COMPUTE_GRAPHICS_MODEL_BOCHS and HW_FIRMWARE_UEFI
16:27:16 <sean-k-mooney> but im fine with backporting it too
16:27:37 <gibi> sean-k-mooney: sure, bugs are easy, as the fix is backportable
16:28:10 <gibi> but os-traits has some new trait proposal that if they not land then the feature depending on them is blocked in Xena
16:28:42 <gibi> so let's close those this week
16:28:47 <sean-k-mooney> im not sure about HW_FIRMWARE_UEFI
16:29:04 <sean-k-mooney> but ill review it
16:29:22 <sean-k-mooney> thecnially that is stating that the host has uefi boot capablity
16:29:39 <gibi> the BOCHS trait probably needs kashyap answer as stephenfin had some feedback on https://review.opendev.org/c/openstack/os-traits/+/794807
16:29:44 <sean-k-mooney> as in the host can boot in uefi mode not that it can virtualise it
16:30:10 <sean-k-mooney> so i think HW_FIRMWARE_UEFI shoudl be COMPUTE_FIRMWARE_UEFI
16:30:34 <gibi> ohh, that is a good point
16:30:45 <gibi> stephenfin: ^^ :)
16:30:53 <sean-k-mooney> the bosh trait looks correct but ill read stpehns commnets
16:31:04 <gibi> sean-k-mooney: thanks
16:31:15 <gibi> anything else about the coming lib feature freeze?
16:32:36 <gibi> #topic PTG Planning
16:32:40 <gibi> every info is in the PTG etherpad #link https://etherpad.opendev.org/p/nova-yoga-ptg
16:32:48 <gibi> If you see a need for a specific cross project section then please let me know
16:33:03 <gibi> s/section/session/
16:34:22 <gibi> any question about the PTG?
16:35:09 <gibi> #topic Stable Branches
16:35:14 <gibi> stable/queens is blocked (tempest-full-py3 @ "Starting Horizon", probably due to queens-eol of horizon)
16:35:18 <gibi> all the other branches' gate look OK
16:35:21 <gibi> EOM from elodilles
16:35:56 <elodilles> i've proposed a quick fix for queens gate: https://review.opendev.org/c/openstack/devstack/+/804889
16:36:13 <gibi> elodilles: thanks
16:36:19 <gibi> any other news from stable-land?
16:36:48 <elodilles> nothing from me
16:37:19 <gibi> OK moving on
16:37:32 <gibi> I'm skipping libvirt subteam as bauzas_away is on PTO
16:37:39 <gibi> #topic Open discussion
16:37:43 <gibi> (melwitt): unified limits series is ready for review (https://blueprints.launchpad.net/nova/+spec/unified-limits-nova) https://review.opendev.org/q/topic:bp/unified-limits-nova
16:37:59 <gibi> I started on that ^^ and will continue tomorrow
16:38:10 <opendevreview> Merged openstack/nova stable/wallaby: libvirt: Do not destroy volume secrets during _hard_reboot  https://review.opendev.org/c/openstack/nova/+/796258
16:38:10 <gibi> but one more core is needed
16:38:40 <gibi> who feels the power?
16:39:00 <melwitt> yeah just wanted to give a quick heads up that this is up-to-date, as some know it was stalled for awhile. it's a "tech preview" status where the legacy quota APIs are read-only and there are no quota migration tools, it is DIY for operators to try out
16:39:20 <sean-k-mooney> dansmith: lyarwood  do ye have time to review the unified limits series
16:39:30 <melwitt> I have added some tempest test coverage that Depends-On it that can be looked at to see it working
16:39:44 <dansmith> do I? no. should I? yes. Will I? I'll try :)
16:39:53 <sean-k-mooney> :)
16:39:58 <melwitt> hehe ++
16:40:41 <gibi> :)
16:40:42 <melwitt> thanks all for listening, we can move on I think
16:40:46 <gibi> ok
16:41:06 <gibi> there is one more topic on the wiki
16:41:07 <gibi> (gibi): PTL nomination is open. As I noted in my Xena nomination, I will not run for the 4th time as Nova PTL.
16:41:50 <gibi> if you have questions about the role as you consider running for it then feel free to ask me
16:43:37 <gibi> nothing else on the agneda
16:43:47 <gibi> is there anything else to discuss today?
16:44:56 <gibi> if not then thanks for joining
16:45:08 <ganso> gibi: I will do more testing with the hostname config and /etc/hosts later today and I will mark that bug as invalid if successful (probably will be) =)
16:45:18 <gibi> ganso: cool, thanks
16:45:20 <gibi> #endmeeting