16:01:38 #startmeeting nova 16:01:38 Meeting started Tue Aug 17 16:01:38 2021 UTC and is due to finish in 60 minutes. The chair is gibi. Information about MeetBot at http://wiki.debian.org/MeetBot. 16:01:38 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 16:01:38 The meeting name has been set to 'nova' 16:01:38 gibi: ++ thanks 16:01:55 sean-k-mooney: thanks 16:02:45 #topic Bugs (stuck/critical) 16:02:52 no critical bug open 16:02:56 #link 15 new untriaged bugs (+4 since the last meeting): #link https://bugs.launchpad.net/nova/+bugs?search=Search&field.status=New 16:03:22 is there any specific bug to discuss today? 16:04:16 I see ganso has one in the open discussion, lets bring that up here 16:04:24 (ganso): bug "Compute node deletes itself if rebooted without DNS": https://bugs.launchpad.net/nova/+bug/1939920 16:04:29 was this a design choice? acceptable solutions discussion 16:04:33 EOM 16:04:37 gibi thanks 16:05:00 so, IMO this is a critical bug, and after reading the code and the way it works it kinda feels like a design choice 16:05:29 because seems like it was intentionally implemented for it to scan for "orphan compute nodes" and delete them, clear the allocations and RP, etc 16:05:42 yes that was intentional 16:05:51 but it is producing this effect which is very undesirable 16:06:33 so in your infra the compute host can change hostname and that causing the issue 16:06:33 as I suggested in the bug, a possible solution I see if to compare the host field in the nova.compute_nodes table 16:06:43 if it is the same, then we would skip this 16:06:57 gibi: it is not that it "can" change the hostname. But it happens due to external reasons 16:07:09 like, lack of connectivity when it boots, a DNS outage, etc 16:07:26 it will recover in that it will create a new compute node etc. the main thing that is "unique" is the hostname, that's what's stored in the instance.host and a whole lot of other places. so changing the name you break all the associations and in reality you essentially have a new/different service and compute node 16:08:39 if the associations were done using UUID it would be a different story. but unfortunately it is what it is and would take a large work to change it IMHO 16:08:39 melwitt: right, so the instance.host captures the entire FQDN, and the FQDN is what is changing, therefore when that changes, running instances are no longer identifiable as running in that node 16:08:58 right 16:09:26 melwitt: so that is another side-effect of that FQDN changing problem, but I am not proposing changing that. I am just proposing to skip this "deletion" step if the compute_nodes.host field does not change 16:09:33 if will avoid part of the issues 16:10:00 melwitt it will not avoid the issue you described, but 1 issue is better than 2 I think 16:10:24 I'm missing the distinction I think 16:11:22 but compute_nodes.host comes from the DB isn't it? so it won't ever change 16:11:24 ganso: right so nova does not support compute hosts changing hostname today 16:11:52 gibi: doesn't it derive from the FQDN it reads from the system? 16:11:57 so if it is changing for external reasons that is not expected to work out of the box 16:12:34 sean-k-mooney: ++ 16:12:42 sean-k-mooney: right, but I'm not proposing that it does support, but just stop doing what it is doing today. That thing about orphan compute nodes isn't supposed to address changing hostnames either 16:13:05 ganso: when the ComputeNode is created then yes, it is coming from the hostname reported by libvirt, but never changes after 16:13:35 ganso: even if we did not clean up the orpah compute nodes teh instnace.host is used to make rpc calls to the host that the instance is on 16:14:00 so unless you hardcode the chost paramter in the nova.conf so it does not change 16:14:06 dansmith: when FQDN changes from "host.domain" to "host.domain1" or just "host" it causes the compute node to delete itself from the DB, clear allocations, RP, etc, and the new name will not match the instances.host field as melwitt mentioned. Out of all those consequences, I'd suggest skipping the compute node deletion, because this is an error state, to avoid deleting up all allocations and RPs, so the node can more easily go 16:14:06 back to normal once the FQDN is fixed and the service is restarted 16:14:12 that will still break 16:14:32 ganso: the compute service will not do that by default 16:14:41 the compute service will auto register 16:14:44 sean-k-mooney: yes, that will still be broken, as it is today, no need to fix that right now 16:14:45 bvut it wont auto delete 16:15:17 sean-k-mooney: well it does, it thinks there was an orphan and deletes it 16:15:27 ganso: what deletes it 16:15:33 i think i missed that 16:15:39 sean-k-mooney: https://github.com/openstack/nova/blob/b0099aa8a28a79f46cfc79708dcd95f07c1e685f/nova/compute/manager.py#L9997 16:16:08 is this a clustered hypervior 16:16:24 sean-k-mooney: "host.domain" changes to "host", so it deletes "host.domain" from the compute nodes table and creates a new one, as if the node was brand new 16:16:36 e.g. ironic or hyperv or something like vmware 16:16:52 ganso: because that's a hostname change 16:16:57 sean-k-mooney: no, it is just a regular compute node with a libvirt compute service 16:17:05 ganso: arrange for that not not happen, that's the solution, IMHO 16:17:22 dansmith: unfortunately it is beyond control 16:17:37 well for the libvirt driver that is entirly unsupported 16:17:51 the other way to fix this is to make sure your cannonical hostname is not the fqdn 16:17:51 my proposal is to leave it in an error state to prevent it from deleting allocations and RP 16:18:42 e.g. in /etc/host set 16:19:09 sean-k-mooney: or /etc/domainname, but yeah, totally fixable, IMHO 16:19:33 im still configuse how nodenames is a list in this case 16:19:33 sean-k-mooney: hmm I see, that would override the one currently being provided by the domain provider 16:20:07 or rather how when the the fqdn changes we are actully geting anything back from the db 16:20:14 i was expecting it to not match anything 16:20:26 can't you set the hostname nova uses in the config anyway? to hard-code it per host so it doesn't change, I thought we had that 16:20:44 unless you have something linke host1. host1. 16:20:53 dansmith: looking in the code now 16:21:06 dansmith: you can set the hostname used by the compute service 16:21:09 we might not want to hold up the meeting to discuss this to completion 16:21:11 not the hypervior_hostname 16:21:17 which in this case comes form libviret 16:21:25 sean-k-mooney: ah, so that's fixed but the nodename is always from hostname, right okay 16:21:28 oh yea, console_host 16:21:28 default=socket.gethostname() 16:21:48 ganso: not console host but ill get the link and send it to you 16:21:57 i think we can move on and come back to this after the meeting 16:22:08 sean-k-mooney: thanks, yes. Thanks for the suggestions! 16:22:10 lets come back to this 16:22:23 thanks sean-k-mooney dansmith melwitt 16:22:34 ganso: https://github.com/openstack/nova/blob/master/nova/conf/netconf.py#L52-L70 16:22:35 any other bug that needs attention? 16:23:30 #topic Gate status 16:23:35 Nova gate bugs #link https://bugs.launchpad.net/nova/+bugs?field.tag=gate-failure 16:23:42 I dont see new gate bugs in that list 16:24:24 and also I pushed plenty of patches yesterday without many failures from Zuul 16:24:24 so I think master CI looks good 16:24:24 :) 16:24:25 any recent failures? 16:24:44 yeah I think the troubling one is the libvirt/qemu one.. that we're doing non voting on live migration job over 16:25:05 yeah, skiping that helped a lot 16:25:37 placement period jobs are green too #link https://zuul.openstack.org/builds?project=openstack%2Fplacement&pipeline=periodic-weekly 16:25:46 anything else about the gate? 16:26:17 #topic Release Planning 16:26:24 Milestone 3 and therefore Feature Freeze is at 3rd of September which is in 2 weeks. 16:26:46 lets land things :) 16:26:52 Non client library freeze is this week. 16:26:56 os-vif: https://review.opendev.org/q/project:openstack/os-vif+status:open+branch:master nothing important seems to be pending 16:27:04 os-resource-classes: https://review.opendev.org/q/project:openstack/os-resource-classes+status:open ditto nothing is pending 16:27:06 yes i might try to addreess https://bugs.launchpad.net/os-vif/+bug/1939542 16:27:09 os-traits: https://review.opendev.org/q/project:openstack/os-traits+status:open there seem pending reviews for traits needed by ongoing features e.g.: COMPUTE_GRAPHICS_MODEL_BOCHS and HW_FIRMWARE_UEFI 16:27:16 but im fine with backporting it too 16:27:37 sean-k-mooney: sure, bugs are easy, as the fix is backportable 16:28:10 but os-traits has some new trait proposal that if they not land then the feature depending on them is blocked in Xena 16:28:42 so let's close those this week 16:28:47 im not sure about HW_FIRMWARE_UEFI 16:29:04 but ill review it 16:29:22 thecnially that is stating that the host has uefi boot capablity 16:29:39 the BOCHS trait probably needs kashyap answer as stephenfin had some feedback on https://review.opendev.org/c/openstack/os-traits/+/794807 16:29:44 as in the host can boot in uefi mode not that it can virtualise it 16:30:10 so i think HW_FIRMWARE_UEFI shoudl be COMPUTE_FIRMWARE_UEFI 16:30:34 ohh, that is a good point 16:30:45 stephenfin: ^^ :) 16:30:53 the bosh trait looks correct but ill read stpehns commnets 16:31:04 sean-k-mooney: thanks 16:31:15 anything else about the coming lib feature freeze? 16:32:36 #topic PTG Planning 16:32:40 every info is in the PTG etherpad #link https://etherpad.opendev.org/p/nova-yoga-ptg 16:32:48 If you see a need for a specific cross project section then please let me know 16:33:03 s/section/session/ 16:34:22 any question about the PTG? 16:35:09 #topic Stable Branches 16:35:14 stable/queens is blocked (tempest-full-py3 @ "Starting Horizon", probably due to queens-eol of horizon) 16:35:18 all the other branches' gate look OK 16:35:21 EOM from elodilles 16:35:56 i've proposed a quick fix for queens gate: https://review.opendev.org/c/openstack/devstack/+/804889 16:36:13 elodilles: thanks 16:36:19 any other news from stable-land? 16:36:48 nothing from me 16:37:19 OK moving on 16:37:32 I'm skipping libvirt subteam as bauzas_away is on PTO 16:37:39 #topic Open discussion 16:37:43 (melwitt): unified limits series is ready for review (https://blueprints.launchpad.net/nova/+spec/unified-limits-nova) https://review.opendev.org/q/topic:bp/unified-limits-nova 16:37:59 I started on that ^^ and will continue tomorrow 16:38:10 Merged openstack/nova stable/wallaby: libvirt: Do not destroy volume secrets during _hard_reboot https://review.opendev.org/c/openstack/nova/+/796258 16:38:10 but one more core is needed 16:38:40 who feels the power? 16:39:00 yeah just wanted to give a quick heads up that this is up-to-date, as some know it was stalled for awhile. it's a "tech preview" status where the legacy quota APIs are read-only and there are no quota migration tools, it is DIY for operators to try out 16:39:20 dansmith: lyarwood do ye have time to review the unified limits series 16:39:30 I have added some tempest test coverage that Depends-On it that can be looked at to see it working 16:39:44 do I? no. should I? yes. Will I? I'll try :) 16:39:53 :) 16:39:58 hehe ++ 16:40:41 :) 16:40:42 thanks all for listening, we can move on I think 16:40:46 ok 16:41:06 there is one more topic on the wiki 16:41:07 (gibi): PTL nomination is open. As I noted in my Xena nomination, I will not run for the 4th time as Nova PTL. 16:41:50 if you have questions about the role as you consider running for it then feel free to ask me 16:43:37 nothing else on the agneda 16:43:47 is there anything else to discuss today? 16:44:56 if not then thanks for joining 16:45:08 gibi: I will do more testing with the hostname config and /etc/hosts later today and I will mark that bug as invalid if successful (probably will be) =) 16:45:18 ganso: cool, thanks 16:45:20 #endmeeting