Thursday, 2025-09-04

*** mhen_ is now known as mhen		02:04
*** bbezak_ is now known as bbezak		06:22
opendevreview	Rajesh Tailor proposed openstack/nova stable/2025.1: Fix 'nova-manage image_property set' command https://review.opendev.org/c/openstack/nova/+/958834	06:53
opendevreview	Balazs Gibizer proposed openstack/nova master: Reproduce bug/2115905 https://review.opendev.org/c/openstack/nova/+/954336	08:33
opendevreview	Balazs Gibizer proposed openstack/nova master: [PCI tracker]Remove non configured devs when freed https://review.opendev.org/c/openstack/nova/+/954613	08:33
opendevreview	Balazs Gibizer proposed openstack/nova master: [pci]Keep used dev in Placement regardless of dev_spec https://review.opendev.org/c/openstack/nova/+/954149	08:33
gibi	sean-k-mooney: ^^ I fixed the commit messages and pulled out two constants for the log message asserts	08:34
sean-k-mooney[m]	cool ill review it after i grab a coffee and start my day	08:36
mmessah	Hi all, `/v2.1/os-hypervisors/detail` requests take too long time after upgrade to Epoxy. I saw similar issue for server list like https://bugs.launchpad.net/nova/+bug/2106380 but we dont have issue for server list. We tested manually with curl request. if I use OpenStack-API-Version is 2.88 and above, we wait about 20 seconds. if I use OpenStack-API-Version is 2.87, result is returned immediately. Is there anyone	09:01
mmessah	experiencing this situation?	09:01
jkulik	2.88 added the "uptime" field into the hypervisor details. we experienced slowness with that, too ... and thus added a config setting downstream to patch it out: https://github.com/sapcc/nova/commit/686f44c4c65eba6c851c852804e82a3e00309945	09:06
frickler	jkulik: interesting, that looks like a different source of slowness to me, did you ever try to upstream your patch? it does seem pretty reasonable to me (caveat I'm no nova developer, just another deployer)	09:11
jkulik	frickler: no, it stems from a time with less upstream-involvement. especially after an upgrade, downstream times are quite ... stressful :) I'd be happy to propose that, though	09:12
masahito	we hit the same issue from the v2.88. calling the RPC to all hypervisors in serial causes slow os-hypervisor/detail API response.	09:13
masahito	large scale deployment hits the same issue. we just wait few minutes for 10K HV situation, or use limit and maker to separate API.	09:15
frickler	yes, iterating over all hypervisors for an API call (assuming there's no caching, didn't check the code) sounds pretty much not scalable	09:19
jkulik	hm ... Isn't there some policy that the API response shouldn't change depending on operator configs? our downstream patch would break that policy.	09:26
jkulik	api-ref says "Only reported for active hosts where the virt driver supports this feature." so ... maybe it's still fine, because clients would have to handle those cases anyways?	09:28
bauzas	hey folks, do we have any upstream tracker for that bug ?	09:34
bauzas	just overlooked the code and yeah I'm afraid of the fact this is indeed a performance issue	09:34
bauzas	when calling _view_hypervisor(), we then call the host API for each of the hypervisors	09:35
bauzas	gibi: ^	09:35
jkulik	I haven't found one. Also no review-request.	09:35
bauzas	https://github.com/openstack/nova/blob/9f156aa954fb57eeaf407cc13a0f782b8af1a5d3/nova/api/openstack/compute/hypervisors.py#L100-L101	09:36
opendevreview	Johannes Kulik proposed openstack/nova master: Conditionally disable hypervisor uptime in details API https://review.opendev.org/c/openstack/nova/+/959420	09:40
jkulik	ready to work on making ^ better. I guess it needs a releasenote at least? but before that, I'd like to get some feedback if this approach should be continue	09:42
bauzas	eek, https://github.com/openstack/nova/blob/9f156aa954fb57eeaf407cc13a0f782b8af1a5d3/nova/virt/libvirt/driver.py#L12319-L12322	09:42
bauzas	so now when calling statistics, we do N calls to an external process call	09:43
bauzas	jkulik: I need to review your patch but we can't have API behaviour changed by configuration	09:43
jkulik	yeah, figured. it's already behaving like that for a subset of the HVs, though	09:44
bauzas	hmmm, unfortunately I need to leave by the next mins	09:45
bauzas	jkulik: where do we conditionally change the API behaviour ?	09:46
bauzas	I mean, already	09:46
jkulik	depending on if the calls succeeed or rais NotImplemented, we return None instead of an actual uptime string	09:46
jkulik	so this patch changes to "always None"	09:46
bauzas	atm, if we return None, this is not because we change the configuration, right? this is just whether the call works or not	09:48
bauzas	the concern with an confiugrable API is about interop	09:48
jkulik	yes. for some hypervisors, this call never works. api-ref states for "uptime": "Only reported for active hosts where the virt driver supports this feature."	09:48
jkulik	so imho, a client cannot rely on this being set	09:49
* bauzas needs to leave :(		09:49
stephenfin	jkulik: You're focused on the vmware driver, right? And I assume the noise you're talking about is the tracebacks in nova-compute? Couldn't you just implement the method for that driver but return None?	09:57
stephenfin	That could likely be backported too	09:57
jkulik	sure. doesn't bring any speedup at all though, so the _actual_ issue of os-hypervisor/detail being slow doesn't go away	10:01
opendevreview	sean mooney proposed openstack/nova master: [pci]Keep used dev in Placement regardless of dev_spec https://review.opendev.org/c/openstack/nova/+/954149	10:05
opendevreview	Johannes Kulik proposed openstack/nova master: Conditionally disable hypervisor uptime in details API https://review.opendev.org/c/openstack/nova/+/959420	10:19
sean-k-mooney	jkulik: yes we have a general rule about no api driven behvaiaor	10:34
sean-k-mooney	jkulik: also the uptime prerfroamce issue is known	10:34
sean-k-mooney	jkulik: the way we plan to fix this is to add it to the stats in the compute node db table and read it form there instead od doing the rpc	10:35
sean-k-mooney	so in the future you will get a cached value but wont have the perfomacne hit	10:35
sean-k-mooney	it will be in the data revvived with teh ohte rcompute node values in the same sql query	10:35
sean-k-mooney	jkulik: gibi and i debugged this a month or two ago but didnt have time tow rok on proposing the fix	10:36
sean-k-mooney	basiclly the plan to make this "just work" across upgrades is to make the rpc consditional and to get the uptime value only if its not in the stats blob we get form teh db	10:38
jkulik	sean-k-mooney: thanks for sharing. I'll abandon my RR then.	10:55
jkulik	sean-k-mooney: so this means, all drivers currently not implementing get_host_uptime() should at least provide a None value into the stats to not still hit the performance problem, right?	10:55
sean-k-mooney	yep that or we could jsut remove the fallback entirly	11:02
sean-k-mooney	jkulik: we breifly dicsused just removing uptime form the api in a new microversion	11:02
sean-k-mooney	we kind of wanted to get wider input on that before moving forward however	11:03
sean-k-mooney	jkulik: we orginally wanted to remove uptime in the orgial spec but some operator wanted use to keep it so we did	11:03
sean-k-mooney	but im not sure that was the right choice	11:03
sean-k-mooney	jkulik: does vmware supprot uptime at all	11:04
sean-k-mooney	jkulik: we could also make the fallback to rpc condtional on if the driver suprot it	11:04
jkulik	sean-k-mooney: no, not at all, because in VMware we have a whole cluster as hypervisor in Nova and each node in the cluster has different uptime	11:05
sean-k-mooney	and then just exclude vmware if it does not	11:05
jkulik	we're also introducing libvirt, so I'd still like to get the details to become fast ;)	11:05
sean-k-mooney	right so we know the hypervior type so we coudl limit the fallback to just libvirt(qemu)	11:05
sean-k-mooney	so for all ohter driver we coudl retrun None	11:06
sean-k-mooney	if not in stats	11:06
jkulik	... or patch everyone to return None, which would be fine for me, too	11:07
sean-k-mooney	well again we propsoed removing uptime when we removed the hypervior stats api	11:08
sean-k-mooney	and some folks objected to that so we cant actully do that without a new microversion	11:08
sean-k-mooney	i.e. it woudl break our api rules ot just stop reutring without a new microversion	11:08
sean-k-mooney	but instead of a config opiotn https://review.opendev.org/c/openstack/nova/+/959420/2/nova/api/openstack/compute/hypervisors.py#105	11:09
sean-k-mooney	you coudl have checked if the hypervior type was qemu and only get the uptime if it was	11:09
sean-k-mooney	evenutaly we want to also stash it in the stats to make libvirt fast but that would fix vmware for now	11:10
jkulik	I mean, we have the patch downstream. there's no pressure from our side. I'd go with the proper "put in stats" solution.	11:10
sean-k-mooney	ack. we may want to put this in the ptg adjenda to not forget or have a dedicated bug for it	11:11
sean-k-mooney	https://bugs.launchpad.net/nova/+bug/2106380 is about server list not the hyperviors api so we sould avoid resuing it	11:12
sean-k-mooney	with that said i wonder if gibi already filed an upstream bug	11:12
opendevreview	Rajesh Tailor proposed openstack/nova stable/2024.2: Fix 'nova-manage image_property set' command https://review.opendev.org/c/openstack/nova/+/959429	11:38
* gibi reads back		11:39
gibi	bauzas: jkulik: the uptime issue ring bells to me but I think it was only reported downstream. I agree that this needs to be fixed	11:41
gibi	until then the WA is using older microversion	11:42
gibi	jkulik: yeah a config driven API is a no-no. But we can revisit why we added the uptime to the main GET call, or we can try to speed it up via scatter gather, or revers the relationship and send up the uptime data to the DB periodicall from the compute so reading it from the API will be quick	11:43
gibi	hm that latter rings another bell that this was dansmith's idea	11:43
gibi	so we might discussed this already upstream as well	11:43
gibi	aand there is another semi related thread of sending that uptime RPC to computes that are down, but not yet marked as down in the DB is another source of slowness	11:48
jkulik	I'd go with the "sending to the DB" - not sure how to solve that for computes being down, though.	11:50
gibi	jkulik: we already have the periodic heartbeat to know when a compute is down, so we can use that info to avoid reporting back a stale uptime data on the API	11:51
gibi	I did not find this uptime discussion in my IRC logs so this probably happened downstream. I will file a bug right away summarizing both problems and the solution suggestion	12:07
gibi	aand I found the downstream report https://issues.redhat.com/browse/OSPRH-17563 so we probably have motivation to try to fix it :)	12:16
gibi	field the bug https://bugs.launchpad.net/nova/+bug/2122036	12:23
gibi	mmessah: ^^	12:25
mmessah	Thanks for discussing the situation fastly :) I hope it will be resolved as soon as possible. We recently upgraded our test environment to Epoxy. We will continue to test all services. If we have any situation, we will share IRC.	12:39
gibi	mmessah: thanks	12:41
opendevreview	Rajesh Tailor proposed openstack/nova stable/2024.2: Fix case-sensitivity for metadata keys https://review.opendev.org/c/openstack/nova/+/959435	12:42
gibi	dansmith: sean-k-mooney: when you are referring to "large-ops" job from the past (in relation to scale testing eventlet removal) do you mean https://review.opendev.org/c/openstack/project-config/+/278978 ?	12:49
sean-k-mooney	i was not refering to that because i did not know that existed	12:50
sean-k-mooney	but maybe dan was	12:50
gibi	sean-k-mooney: ahh sorry I did not remember where this input coming from :)	12:51
sean-k-mooney	gibi: i have just mention that kolla and devstack can fake may compute nodes per phsyical host	12:51
sean-k-mooney	my expeirnce of this was as part of the osci lab scale testing	12:51
sean-k-mooney	where intel and rackspace gave access to i think 500? baremental host half x86 the rest ppc	12:52
sean-k-mooney	for the comutnity to do some scale testing	12:52
sean-k-mooney	gibi: i wonder if the rally jobs have any existing usage of the fake agents	12:54
sean-k-mooney	or if the tempest jobs in rally ieslf are usign real agents	12:54
gibi	yeah rally is also on my radar to look into, especially neutron's jobs	12:55
sean-k-mooney	so rally and browbeat are the two scaling frameworks we have	12:55
sean-k-mooney	at leats those are teh two i am vagley aware of	12:56
gibi	yeah, I never saw a realy browbeat scenario but heard the name before	12:56
sean-k-mooney	it might be a wrapper arount rally looking at the repo	12:57
sean-k-mooney	https://opendev.org/x/browbeat/	12:57
sean-k-mooney	it has no ci jobs so that wont help us	12:58
bauzas	gibi: sean-k-mooney: well, I think we should quickly remove the uptime call IMHO	12:58
gibi	ohh brownbeat referst to "generate_tripleo_inventory"	12:58
gibi	that turns me down :)	12:58
gibi	bauzas: in a new microversion? in a non backportable way?	12:58
bauzas	if you look how this works, then we call the driver for every compute, and then for example the driver directly calls the OS	12:58
bauzas	so when you have 10000 hypervisors, then each of them calls the OS directly	12:59
sean-k-mooney	looks like rally does not have any devstack jobs https://github.com/openstack/rally/blob/master/.zuul.d/zuul.yaml although im not sure wht rallty tox-functional does	12:59
bauzas	gibi: I don't know how to fix this, I'd like to discuss that with gmaan and dansmith	12:59
sean-k-mooney	bauzas: that would require an api micoversion	12:59
bauzas	because when I see https://github.com/openstack/nova/blob/master/nova/virt/libvirt/driver.py#L12319-L12322 I'm pretty done	13:00
bauzas	sean-k-mooney: well, if that's due to a bug, maybe not	13:00
sean-k-mooney	bauzas: its not	13:00
sean-k-mooney	stephen wanted to remove it in the spec	13:00
bauzas	like, HTTP500 > HTTP200 doesn't need to have a new microversion	13:00
sean-k-mooney	and we told them no	13:00
sean-k-mooney	becuase ops asked for it to be kept	13:01
bauzas	here, we don't have a HTTP500, but it can take more than one minute for getting a response	13:01
sean-k-mooney	that does nto mean we cant make it better	13:01
gibi	HTTP500 -> HTTP200 only accepted in a bugfix in my eyes if it still provides the data that the non HTP500 case provides	13:02
sean-k-mooney	https://specs.openstack.org/openstack/nova-specs/specs/wallaby/implemented/modernize-os-hypervisors-api.html#problem-description	13:02
bauzas	I saw it	13:02
bauzas	already and I saw the implementation change too	13:02
sean-k-mooney	yes htis was done yesar ago to be clear its not recent	13:02
sean-k-mooney	it has been this way since wallaby	13:02
gibi	I agree that we need to fix it but maybe we can fix it in a way that preserve the functionality without the performance hit	13:02
bauzas	but the problem is now we call the OS for all of the hosts automatically	13:02
bauzas	instead of a specific API	13:03
sean-k-mooney	you opt into it with the micorversion	13:03
sean-k-mooney	we dont if you use the old one	13:03
bauzas	sure	13:03
sean-k-mooney	but there is a way to achive both	13:03
sean-k-mooney	first for any host that is not using the qemu virt_type	13:03
sean-k-mooney	we canc skip the call entirly	13:03
bauzas	but atm, 2.88 has a performance problem, right?	13:03
sean-k-mooney	and return none	13:03
sean-k-mooney	and then we can start adding the uptime to the compute node stats	13:04
sean-k-mooney	jand if its in the stats we can also skip it even for libvirt	13:04
gibi	(ahh the idea was from sean-k-mooney not from dansmith sorry for the misattribution above)	13:04
sean-k-mooney	finelly if we want to remove it in the api entirly we can do that in a new microverison	13:04
jkulik	instead of having nova-compute call out into the OS on every HV, we could add it once on startup and add a timestamp to it. the API call can then just add `time.time() - timestamp` to the originally reported uptime :)	13:05
sean-k-mooney	so we have 2 backportable fixes (only do the rpc if the diver supprot it and cache the value in the state generate via update_aviable_resouces) and one non backportable fix removing it in a new micorverion	13:05
sean-k-mooney	jkulik: well we can also do that too yes	13:06
sean-k-mooney	but that woudl be diffent	13:06
jkulik	then we don't need a regular task	13:06
jkulik	and also no rpc call	13:06
sean-k-mooney	we check the host so the value does not go backward if you restart the agent on the host	13:06
sean-k-mooney	jkulik: we have a perodic that is updating the stats already	13:06
sean-k-mooney	we woudl jus tbe making it add 1 more key to the json blob	13:07
sean-k-mooney	so that pretty much free	13:07
jkulik	but is it calling an process every time? sounded to me like someone didn't like that very much	13:07
sean-k-mooney	if we want ot change the meaning to be the compute agent uptime instgead of the host uptime we coudl do the time.time()-timestamp change as well	13:07
jkulik	out, err = processutils.execute('env', 'LANG=C', 'uptime')	13:07
sean-k-mooney	well we can get that info form libivrt too i think or there are other ways to do that hten calling the uptime command	13:08
jkulik	we call that on agent startup and add the time since agent startup -> meaning stays the same	13:08
sean-k-mooney	that not what uptime is	13:08
sean-k-mooney	the uptime is nto the uptime since the agent started	13:08
sean-k-mooney	its the uptiem of the host	13:08
sean-k-mooney	and if you restart nova-compute without a reboot those are not the same	13:09
jkulik	yes. hence we add the agent-uptime to the initially-read host uptime	13:09
sean-k-mooney	or we jsut dont provide this in the furue since you shoudl not be getting this form nova at all	13:09
jkulik	fine by me :D	13:09
sean-k-mooney	in wallaby we really wanted to remvoe this from nova but opt wanted ot keep it instead of pullling it ofmr a monitoring tool liek prometheus or collectd	13:10
sean-k-mooney	the compromies is we can provide a cached value form the existing perodic which is basiclly free	13:10
sean-k-mooney	for older microversion and delete this in a future one	13:10
jkulik	sounded like bauzas wouldn't like calling `uptime` in the periodic on all hosts all the time. so we'd also need a cheaper replacement for that.	13:12
sean-k-mooney	the imprant thing it so fix the perfoamnce reguression. we could also choose in a new microversion to make the retivial fo the uptime contoleable by a query arg and default to not but that fells overly complicaterd	13:12
sean-k-mooney	jkulik: well we can get that form libvirt as well im pretty sure	13:12
sean-k-mooney	but that is an entirly seperate issue	13:12
sean-k-mooney	nova has alwasy done that	13:12
sean-k-mooney	it was not a regression intorduced by 2.88	13:13
jkulik	periodically? only on request, right? and we're going to change how often Nova calls that, if we move it to the periodic task	13:13
jkulik	it's probably™ not much overhead, so we could also fix that later if we see it being a problem.	13:13
sean-k-mooney	jkulik: my suggestion is to only update it when the update aviable resouce perodic runs	13:14
sean-k-mooney	which is every 5 mins	13:14
sean-k-mooney	we coudl stash the start time of the agent in the stats instead	13:14
bauzas	sean-k-mooney: maybe I'm not right, but https://github.com/openstack/nova/blob/master/nova/api/openstack/compute/hypervisors.py#L206 calls _view_hypervisor for each of the hypervisors lisrt	13:15
sean-k-mooney	adn compute the uptime in the api	13:15
bauzas	then in https://github.com/openstack/nova/blob/9f156aa954fb57eeaf407cc13a0f782b8af1a5d3/nova/api/openstack/compute/hypervisors.py#L97-L101 we unconditionnally call the compute API to get the uptime https://github.com/openstack/nova/blob/9f156aa954fb57eeaf407cc13a0f782b8af1a5d3/nova/api/openstack/compute/hypervisors.py#L97-L101	13:15
sean-k-mooney	bauzas: the rpc happens if you do a detail list or i belive hypervior show with 2.88+	13:16
bauzas	not only show right?	13:16
bauzas	that's my point	13:16
sean-k-mooney	right its in detail list as well	13:16
sean-k-mooney	which is what was approvded in the spec	13:16
bauzas	so for a detailed list of HVs using 2.88, we then call every compute for getting the uptime, right?	13:17
sean-k-mooney	yep	13:17
bauzas	each of those computes then eventually calls the driver, but in case of libvirt, it calls an OS process for getting uptime	13:18
sean-k-mooney	yep	13:18
sean-k-mooney	no one is saying this si good or scalable	13:18
sean-k-mooney	but its not an api bug	13:18
bauzas	so that means that if I have a list of 1000 computes all of them running libvirt, then I'd get 1000 RPC calls and each of them would access the OS before returning	13:18
bauzas	that's what I call not scalable per say	13:18
gibi	(I think we all agree that it is bad)	13:19
sean-k-mooney	right we could also speed it up with sater gather but unless we want to break our api contract we cant fix it by just not including it	13:19
gibi	we cannot just rip it out from 2.88 something that works for 10 computes but not for 1000	13:19
bauzas	previously we were not doing that in 2.87 (calling N RPC calls), that's why I can't say this is not a regression	13:19
bauzas	if you have to opt in, for sure	13:20
bauzas	MHO is that either we throw 2.88 as non usable by fixing it in another microversion, or we try somehow to fix 2.88	13:21
bauzas	my preference would be the latter for downstream reasons :)	13:21
sean-k-mooney	bauzas: look im just going to go fix this becuase im -2 on just riping it ou8t	13:21
bauzas	sean-k-mooney: I'm just awaiting gmaan and dansmith	13:22
sean-k-mooney	ill file a bug and work on it this/next week	13:22
dansmith	sorry, I'm on another call	13:22
bauzas	there are multiple ways to adress that, but I'd rather prefer if we could, as a team, find a solution that we all agree	13:22
bauzas	before even working on a fix	13:23
jkulik	sean-k-mooney: https://bugs.launchpad.net/nova/+bug/2122036 bug is filed already	13:23
bauzas	dansmith: doh, I should attend that call too	13:23
sean-k-mooney	jkulik: ah perfect thanks gibi	13:23
sean-k-mooney	oh right	13:23
* sean-k-mooney well its almost over so no point joining now		13:24
sean-k-mooney	jkulik: we already depend on psutiil in nova so i can do time.time() - psutil.boot_time() adn cahce it on the first invocation	13:25
sean-k-mooney	well i can cache the boot time part	13:25
jkulik	nice	13:25
gibi	sean-k-mooney: btw, devstack with fake driver works like a charm. I just rebuilt my two VM devstack with 5 fake compute in each VM	13:26
sean-k-mooney	gibi: cool its been a very long time since i used it so im glad it still works	13:27
opendevreview	ribaudr proposed openstack/nova stable/2025.1: Reproducer for bug 2114951 https://review.opendev.org/c/openstack/nova/+/959445	13:28
opendevreview	ribaudr proposed openstack/nova stable/2025.1: Fix bug 2114951 https://review.opendev.org/c/openstack/nova/+/959446	13:28
sean-k-mooney	apprently time.clock_gettime(time.CLOCK_BOOTTIME) os also a thing without the psutils dep on py 3.7+	13:29
sean-k-mooney	we dont have many deps on psutils so if we can aovid ading more that woudl eb good	13:30
gibi	(and just booted 200 fake VMs successfully)	13:34
sean-k-mooney	did hte fake networkign agent work by the way or did you do that without networking	13:34
sean-k-mooney	ironic has a fake driver too but im not sure about cidner	13:35
sean-k-mooney	glance kind of does not need it as you can just have tiny or even 0 byte images	13:36
gibi	just used --nic none	13:36
sean-k-mooney	same with keystone and placment they just talk to a db	13:36
sean-k-mooney	although they dont use eventlet so keystone and palcment are not really relevnet to testing nova at scale	13:37
sean-k-mooney	at least not in this context	13:37
jkulik	regarding scale: we have a patch downstream that makes the server-group listing much faster with lots of server-groups around (e.g. when an admin lists them all). We changed the DB queries to actually do the pagination instead of doing it after fetching _all_ DB objects.	13:43
jkulik	Would that be something I can propose just like that - with a bug-report - or does that need some special process?	13:43
dansmith	bauzas: what do you want me to look at?	13:45
sean-k-mooney	jkulik: yes for master, it would depend on the actual cahnge if it woudl be backprotable	13:45
bauzas	not specifically looking at, just being there for accepting a solution	13:45
jkulik	sean-k-mooney: I don't need backports, as we have it running. would just be nice to share if possible :)	13:46
sean-k-mooney	jkulik: its alwasy valid to reprot performance bug and fix them on master. weither its backproabel will depend on if your using any feature form newer sqlachme ectra or if ti has any other upgrade impact	13:46
bauzas	tl;dr: by 2.88, we now call every compute if you call hypervisor detailed list and we await the compute OS for returning us an uptime	13:46
dansmith	bauzas: oh yeah I read the bug	13:46
dansmith	bauzas: seems like the solution there _has_ to be to start reporting uptime somewhere we can persist it and fix that to query that location instead of making the (egads) rpc call	13:52
bauzas	something that could then be backportable to Epoxy if we can	13:53
dansmith	I agree we can't just stop doing it, and making it only present if you don't list, or don't list enough is also not great	13:53
bauzas	if we can't backport it, then 2.88 will have the same issue	13:53
sean-k-mooney	the hypervior api is already doing an sqlquery to the compute nodes db table and geting the stats	13:54
sean-k-mooney	we can put it in that json bob which is entirly unversioned today	13:54
sean-k-mooney	so if we want to backprot it we can	13:54
dansmith	yeah was just going to say, ComputeNode has several places we could stash it I think	13:55
dansmith	but, it will be absent during an upgrade so we probably need to fall back to the old behavior for nodes that aren't upgraded	13:55
sean-k-mooney	the bit im currntly checking is it purly the time or are we includign the load info too	13:55
sean-k-mooney	im hoping it just the former but some of our docs implies the latter	13:56
dansmith	doesn't much matter	13:57
sean-k-mooney	well its just changes what we need to stash	13:58
sean-k-mooney	but ya it does not matther in the long run	13:58
sean-k-mooney	the depercated api https://docs.openstack.org/api-ref/compute/#show-hypervisor-uptime-deprecated	13:58
bauzas	if we can stuff the uptime value somewhere, sure	13:58
sean-k-mooney	was the rwaw output of uptime	13:58
bauzas	without needing to have a new DB field, cool	13:58
sean-k-mooney	but i think in 2.88 its just the time in second	13:58
sean-k-mooney	but i just need to check that on a master devstack to confirm	13:59
bauzas	sean-k-mooney: my concern is not with hypervisor show, again.	13:59
sean-k-mooney	bauzas: yep i know	13:59
bauzas	if we continue to call the OS for hypervisor show, meh	13:59
bauzas	even if that's bad, that's not really an issue	13:59
bauzas	but, if we are able to persist some uptime value for detailed-list, then of course we could modify hypervisor show	14:00
jkulik	I don't see anything parsing the `uptime` command output, so it seems to be indeed the full one	14:01
opendevreview	ribaudr proposed openstack/nova stable/2024.2: Reproducer for bug 2114951 https://review.opendev.org/c/openstack/nova/+/959525	14:01
opendevreview	ribaudr proposed openstack/nova stable/2024.2: Fix bug 2114951 https://review.opendev.org/c/openstack/nova/+/959526	14:01
sean-k-mooney	it is yes i just confirmed that	14:01
sean-k-mooney	https://paste.opendev.org/show/bJ59n0G92NoP06hDsfAu/	14:02
jkulik	including a newline? :D	14:02
sean-k-mooney	so i cant jsut retirn the tiem btu i can just stash the value in the stats and skip the rpc call if its there	14:02
sean-k-mooney	and ya it also has a leadign space	14:03
dansmith	can't what?	14:03
sean-k-mooney	sorry, i cant just retive the time since boot and return that value in seconds	14:04
bauzas	dansmith: given we return the whole uptime, we can't just calculate it	14:04
sean-k-mooney	dansmith: jkulik suggeted doing that to aovid calling the uptime command	14:04
sean-k-mooney	in a futre microverion we can jsut not in the backportabel fix	14:04
dansmith	ack, sorry I just wasn't getting that meaning from the translation	14:04
bauzas	this is honestly super weird to have that in our API	14:05
sean-k-mooney	ya that was a pebkac, my keyboard was not aligent so i was typing off center	14:05
bauzas	I don't disagree with sean-k-mooney when he said we should have stopped to provide it	14:05
bauzas	but now we have it in detailed-list instead of a specific API call... so that's even worst	14:05
dansmith	yep, just got to fix it, not the end of the world	14:06
gibi	^^ +1	14:08
bauzas	dansmith: using a periodic that would cache some raw value ?	14:09
dansmith	bauzas: just use whatever periodic we use to update the metrics and/or RT IMHO	14:10
bauzas	I guess we would need to modify the API docs to say that the uptime value can be a bit stale	14:12
dansmith	does the api guarantee or suggest that it's live?	14:12
jkulik	with the call taking 40s, I guess it wasn't live already for a lot of them :D	14:13
dansmith	jkulik: excellent point :D	14:13
bauzas	dansmith: good point, /me looks at the existing docs	14:19
bauzas	well, nothing is specifically said, so I guess we can just fix it silently	14:21
dansmith	:)	14:21
bauzas	"stats" can be a good CN object field candidate for persisting the value	14:23
bauzas	that's a dict of strings	14:23
bauzas	sean-k-mooney: ^	14:23
dansmith	IIRC both stats and metrics are actually used for something, but I don't remember all the details	14:24
opendevreview	Rajesh Tailor proposed openstack/nova stable/2024.2: Fix case sensitive comparison https://review.opendev.org/c/openstack/nova/+/959534	14:25
bauzas	dansmith: I just remember now where stats is called	14:27
sean-k-mooney	bauzas: you realise i litrally said that many many times	14:27
bauzas	first good news, that's a periodically updated field using the RT update_avail_resources()	14:27
sean-k-mooney	dansmith: the api refe just say "The total uptime of the hypervisor and information about average load."	14:28
bauzas	sean-k-mooney: well, then I missed the many times when you said that	14:29
sean-k-mooney	and everythin else is cached at the period of the update_aviable_resouces perodic	14:29
dansmith	sean-k-mooney: yeah, it's unfortunate that's in our api ref at all, especially since all that is very os-specific	14:30
bauzas	anyway, sounds like we have a plan	14:31
sean-k-mooney	that what the deprecate api says and thsi is what the hypervior deail api say ```The total uptime of the hypervisor and information about average load. Only reported for active hosts where the virt driver supports this feature```	14:31
bauzas	we can fix 2.88 and drop that field in a later microversion	14:31
bauzas	if ops complain about that missing field, then we can tell them to use whatever tool for calling the OS by themselves	14:32
sean-k-mooney	bauzas: just for extra context gibi and i talk about this a few weeks ago and before that i talk to others about this too.	14:32
bauzas	ansible, prometheus or nagios, I don't care	14:32
sean-k-mooney	so this is not the first converation we have had upstream or downstream on how to fix this	14:32
bauzas	few weeks ago, I was on PTO :)	14:32
sean-k-mooney	you were on pto for the prior ones	14:32
bauzas	so I had no context, but the good thing is that we went to the same solution, isn't that good ?	14:33
sean-k-mooney	yes im aware but we were not activly fixing it because of time pressuer for FF	14:33
bauzas	I think this is a reasonable bugfix for RC1	14:33
bauzas	not RC2 of course as that's a not a Flamingo regression per say	14:33
bauzas	but I know you know	14:34
dansmith	bauzas: ++ for rc1	14:34
sean-k-mooney	maybe ill see if i can fix it tomrrwo or at least ahve a draft for review	14:34
dansmith	if it's reasonable to do so	14:34
bauzas	extending the Stats class by stuffing a new keypair sounds a reasonable solution	14:34
bauzas	but sean already found the solution, I leave him tell whether he's OK	14:35
sean-k-mooney	lets just review the patch and we can backprot as normal. if its ready for rc1 cool if not we chan cherry pick after we release 2025.2	14:36
dansmith	sean-k-mooney: there's no patch up to review yet, right? just one in your head?	14:38
sean-k-mooney	exactly im still chekcing if any other virtdriver ever supproted this rpc	14:38
sean-k-mooney	basicly im not sure i want to stash a sentenal value in the stats for other virt driver	14:39
sean-k-mooney	if i can show its only libvirt by inspecting the code i can skip that and gard the rpc call behidn both the virt type adn the absence of the key in the dict	14:40
sean-k-mooney	that way i can aovid stashign garbage values for ironic or vmeare compute nodes	14:40
sean-k-mooney	looks like zvm also suports it	14:45
sean-k-mooney	although if your using zvm the contet of the respocne i think is diffent based on the unit tests	14:46
sean-k-mooney	so zvm and libvirt the rest all raise not implemtned ok that good to know	14:48
sean-k-mooney	i will probaly put the zvm fix in a spereate commit but it wil be effectivly the same	14:49
dansmith	super expensive to make that rpc call for every hypervisor only to have them raise NotImplemented :/	14:51
sean-k-mooney	yep that why i dont want to even try :)	14:53
bauzas	stats are already provided by the virt driver and we stuff them in the Stats object using update_stats()	14:54
bauzas	so we can just return the uptime as part of the virt resources	14:55
sean-k-mooney	yep thats the plan	14:55
sean-k-mooney	but the upstream is ment to be the uptime of the node which is drivder depenent. its not the uptime of the comptue service	14:56
bauzas	if the driver doesn't provide that uptime value, fine, we already provide None	14:56
bauzas	that's why I don't see the problem	14:56
sean-k-mooney	for ironic if we chosse to implement it it would be the uptime of the ironic instnace. but im not suggesting we do that	14:56
opendevreview	ribaudr proposed openstack/nova stable/2024.1: Reproducer for bug 2114951 https://review.opendev.org/c/openstack/nova/+/959540	14:58
opendevreview	ribaudr proposed openstack/nova stable/2024.1: Fix bug 2114951 https://review.opendev.org/c/openstack/nova/+/959541	14:58
bauzas	I don't see the problem	14:58
bauzas	(sorry)	14:58
bauzas	each of the virt drivers can have get_host_uptime() defined	14:59
bauzas	the simple fix seems for me to call get_host_uptime() if defined directly in the virt driver method than returns the resources	14:59
bauzas	and we could later remove the RPC API and make that method private	15:00
dansmith	_is_ there a problem?	15:00
bauzas	I dunno hence my question	15:01
bauzas	AFAICS, only zvm and libvirt do implement the method	15:02
bauzas	vmware is not providing it	15:02
dansmith	right, so no problem	15:02
dansmith	honestly,	15:02
dansmith	the API also provides for =None for all these drivers	15:02
dansmith	and the "does this specific driver provide this data" couldn't possibly be part of the API contract, IMHO	15:03
dansmith	like, if we started remotely managing libvirt nodes from a central one, we wouldn't be able to provide this anymore	15:03
bauzas	that API field has to be gone IMHO	15:03
bauzas	but for the moment, let's fix the performance issue	15:03
dansmith	I'm not saying we shouldn't just solve this the way we've prescribed here, but... :)	15:03
bauzas	in a cloud world, I'm still surprised to see that we continue to provide hypervisor specific fields	15:04
sean-k-mooney	bauzas: well https://review.opendev.org/c/openstack/nova-specs/+/765797/comment/8926d415_9c5a7022/ we all said we woudl prefer to remove it but you wanted to discuss it with operators first.	15:09
sean-k-mooney	we tried to remove it once before and we decied to revers cource becuase peopel ask to keep it	15:10
sean-k-mooney	that does nto mean we cant revisit that and fix it going forward	15:10
sean-k-mooney	but we shoudl nto rewrite history	15:10
bauzas	LOL	15:11
bauzas	I love when something pops up out of context :)	15:11
bauzas	1/ I was only talking about hypervisor show	15:11
bauzas	2/ I was just trying to explain that we should remove that but by another spec, not that one	15:11
sean-k-mooney	well that was update 3 to the sepc	15:12
bauzas	and AFAIK, the spec you mentioned wasn't the one we had for 2.88, right?	15:12
sean-k-mooney	we rogainlly agreed to remvoe it then we updated it after the fact to readd it	15:12
sean-k-mooney	it was the spec for 2.88 it jsut went through many iteration	15:12
sean-k-mooney	you can see the iteration if you really care https://github.com/openstack/nova-specs/commits/abfb91875c7d6b1e2df45d5bc2244956bc662884/specs/wallaby/approved/modernize-os-hypervisors-api.rst	15:14
sean-k-mooney	i alwasy rememebr this spec because stephen use a 💩emoji	15:16
gmaan	++, I agree to continue showing it in same API instead of separate (separate API even in old API will be microversion bump). reporting cache value is better solution and backportable.	16:33
opendevreview	Lajos Katona proposed openstack/nova master: Use SDK for Neutron networks https://review.opendev.org/c/openstack/nova/+/928022	16:46
opendevreview	Sylvain Bauza proposed openstack/nova master: change uptime in hypervisor to be provided async using stats https://review.opendev.org/c/openstack/nova/+/959571	17:31
bauzas	sean-k-mooney: gmaan: dansmith: I made a quick POC 'that works on my devstack'	17:32
bauzas	jkulik too : https://review.opendev.org/c/openstack/nova/+/959571	17:33
sean-k-mooney	i have a slightly diffent quick poc partly done	17:59
opendevreview	sean mooney proposed openstack/nova master: hypervisors: Optimize uptime retrieval for better performance https://review.opendev.org/c/openstack/nova/+/959604	20:45

Generated by irclog2html.py 4.0.0 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!