Thursday, 2025-09-04

*** mhen_ is now known as mhen02:04
*** bbezak_ is now known as bbezak06:22
opendevreviewRajesh Tailor proposed openstack/nova stable/2025.1: Fix 'nova-manage image_property set' command  https://review.opendev.org/c/openstack/nova/+/95883406:53
opendevreviewBalazs Gibizer proposed openstack/nova master: Reproduce bug/2115905  https://review.opendev.org/c/openstack/nova/+/95433608:33
opendevreviewBalazs Gibizer proposed openstack/nova master: [PCI tracker]Remove non configured devs when freed  https://review.opendev.org/c/openstack/nova/+/95461308:33
opendevreviewBalazs Gibizer proposed openstack/nova master: [pci]Keep used dev in Placement regardless of dev_spec  https://review.opendev.org/c/openstack/nova/+/95414908:33
gibisean-k-mooney: ^^ I fixed the commit messages and pulled out two constants for the log message asserts08:34
sean-k-mooney[m]cool ill review it after i grab a coffee and start my day08:36
mmessahHi all, `/v2.1/os-hypervisors/detail` requests take too long time after upgrade to Epoxy. I saw similar issue for server list like https://bugs.launchpad.net/nova/+bug/2106380 but we dont have issue for server list. We tested manually with curl request. if I use OpenStack-API-Version is 2.88 and above, we wait about 20 seconds. if I use OpenStack-API-Version is 2.87, result is returned immediately. Is there anyone 09:01
mmessahexperiencing this situation?09:01
jkulik2.88 added the "uptime" field into the hypervisor details. we experienced slowness with that, too ... and thus added a config setting downstream to patch it out: https://github.com/sapcc/nova/commit/686f44c4c65eba6c851c852804e82a3e0030994509:06
fricklerjkulik: interesting, that looks like a different source of slowness to me, did you ever try to upstream your patch? it does seem pretty reasonable to me (caveat I'm no nova developer, just another deployer)09:11
jkulikfrickler: no, it stems from a time with less upstream-involvement. especially after an upgrade, downstream times are quite ... stressful :) I'd be happy to propose that, though09:12
masahitowe hit the same issue from the v2.88. calling the RPC to all hypervisors in serial causes slow os-hypervisor/detail API response.09:13
masahitolarge scale deployment hits the same issue. we just wait few minutes for 10K HV situation, or use limit and maker to separate API.09:15
frickleryes, iterating over all hypervisors for an API call (assuming there's no caching, didn't check the code) sounds pretty much not scalable09:19
jkulikhm ... Isn't there some policy that the API response shouldn't change depending on operator configs? our downstream patch would break that policy.09:26
jkulikapi-ref says "Only reported for active hosts where the virt driver supports this feature." so ... maybe it's still fine, because clients would have to handle those cases anyways?09:28
bauzashey folks, do we have any upstream tracker for that bug ?09:34
bauzasjust overlooked the code and yeah I'm afraid of the fact this is indeed a performance issue 09:34
bauzaswhen calling _view_hypervisor(), we then call the host API for each of the hypervisors09:35
bauzasgibi: ^09:35
jkulikI haven't found one. Also no review-request.09:35
bauzashttps://github.com/openstack/nova/blob/9f156aa954fb57eeaf407cc13a0f782b8af1a5d3/nova/api/openstack/compute/hypervisors.py#L100-L10109:36
opendevreviewJohannes Kulik proposed openstack/nova master: Conditionally disable hypervisor uptime in details API  https://review.opendev.org/c/openstack/nova/+/95942009:40
jkulikready to work on making ^ better. I guess it needs a releasenote at least? but before that, I'd like to get some feedback if this approach should be continue09:42
bauzaseek, https://github.com/openstack/nova/blob/9f156aa954fb57eeaf407cc13a0f782b8af1a5d3/nova/virt/libvirt/driver.py#L12319-L1232209:42
bauzasso now when calling statistics, we do N calls to an external process call 09:43
bauzasjkulik: I need to review your patch but we can't have API behaviour changed by configuration09:43
jkulikyeah, figured. it's already behaving like that for a subset of the HVs, though09:44
bauzashmmm, unfortunately I need to leave by the next mins09:45
bauzasjkulik: where do we conditionally change the API behaviour ?09:46
bauzasI mean, already09:46
jkulikdepending on if the calls succeeed or rais NotImplemented, we return None instead of an actual uptime string09:46
jkulikso this patch changes to "always None"09:46
bauzasatm, if we return None, this is not because we change the configuration, right? this is just whether the call works or not09:48
bauzasthe concern with an confiugrable API is about interop09:48
jkulikyes. for some hypervisors, this call never works. api-ref states for "uptime": "Only reported for active hosts where the virt driver supports this feature."09:48
jkulikso imho, a client cannot rely on this being set09:49
* bauzas needs to leave :(09:49
stephenfinjkulik: You're focused on the vmware driver, right? And I assume the noise you're talking about is the tracebacks in nova-compute? Couldn't you just implement the method for that driver but return None?09:57
stephenfinThat could likely be backported too09:57
jkuliksure. doesn't bring any speedup at all though, so the _actual_ issue of os-hypervisor/detail being slow doesn't go away10:01
opendevreviewsean mooney proposed openstack/nova master: [pci]Keep used dev in Placement regardless of dev_spec  https://review.opendev.org/c/openstack/nova/+/95414910:05
opendevreviewJohannes Kulik proposed openstack/nova master: Conditionally disable hypervisor uptime in details API  https://review.opendev.org/c/openstack/nova/+/95942010:19
sean-k-mooneyjkulik: yes we have a general rule about no api driven behvaiaor10:34
sean-k-mooneyjkulik: also the uptime prerfroamce issue is known10:34
sean-k-mooneyjkulik: the way we plan to fix this is to add it to the stats in the compute node db table and read it form there instead od doing the rpc10:35
sean-k-mooneyso in the future you will get a cached value but wont have the perfomacne hit10:35
sean-k-mooneyit will be in the data revvived with teh ohte rcompute node values in the same sql query10:35
sean-k-mooneyjkulik: gibi and i debugged this a month or two ago but didnt have time tow rok on proposing the fix10:36
sean-k-mooneybasiclly the plan to make this "just work" across upgrades is to make the rpc consditional and to get the uptime value only if its not in the stats blob we get form teh db10:38
jkuliksean-k-mooney: thanks for sharing. I'll abandon my RR then.10:55
jkuliksean-k-mooney: so this means, all drivers currently not implementing get_host_uptime() should at least provide a None value into the stats to not still hit the performance problem, right?10:55
sean-k-mooneyyep that or we could jsut remove the fallback entirly11:02
sean-k-mooneyjkulik: we breifly dicsused just removing uptime form the api in a new microversion11:02
sean-k-mooneywe kind of wanted to get wider input on that before moving forward however11:03
sean-k-mooneyjkulik: we orginally wanted to remove uptime in the orgial spec but some operator wanted use to keep it so we did 11:03
sean-k-mooneybut im not sure that was the right choice11:03
sean-k-mooneyjkulik: does vmware supprot uptime at all11:04
sean-k-mooneyjkulik: we could also make the fallback to rpc condtional on if the driver suprot it11:04
jkuliksean-k-mooney: no, not at all, because in VMware we have a whole cluster as hypervisor in Nova and each node in the cluster has different uptime11:05
sean-k-mooneyand then just exclude vmware if it does not11:05
jkulikwe're also introducing libvirt, so I'd still like to get the details to become fast ;)11:05
sean-k-mooneyright so we know the hypervior type so we coudl limit the fallback to just libvirt(qemu)11:05
sean-k-mooneyso for all ohter driver we coudl retrun None11:06
sean-k-mooneyif not in stats11:06
jkulik... or patch everyone to return None, which would be fine for me, too11:07
sean-k-mooneywell again we propsoed removing uptime when we removed the hypervior stats api11:08
sean-k-mooneyand some folks objected to that so we cant actully do that without a new microversion11:08
sean-k-mooneyi.e. it woudl break our api rules ot just stop reutring without a new microversion11:08
sean-k-mooneybut instead of a config opiotn https://review.opendev.org/c/openstack/nova/+/959420/2/nova/api/openstack/compute/hypervisors.py#10511:09
sean-k-mooneyyou coudl have checked if the hypervior type was qemu and only get the uptime if it was11:09
sean-k-mooneyevenutaly we want to also stash it in the stats to make libvirt fast but that would fix vmware for now11:10
jkulikI mean, we have the patch downstream. there's no pressure from our side. I'd go with the proper "put in stats" solution.11:10
sean-k-mooneyack. we may want to put this in the ptg adjenda to not forget or have a dedicated bug for it11:11
sean-k-mooneyhttps://bugs.launchpad.net/nova/+bug/2106380 is about server list not the hyperviors api so we sould avoid resuing it11:12
sean-k-mooneywith that said i wonder if gibi already filed an upstream bug11:12
opendevreviewRajesh Tailor proposed openstack/nova stable/2024.2: Fix 'nova-manage image_property set' command  https://review.opendev.org/c/openstack/nova/+/95942911:38
* gibi reads back11:39
gibibauzas: jkulik: the uptime issue ring bells to me but I think it was only reported downstream. I agree that this needs to be fixed11:41
gibiuntil then the WA is using older microversion11:42
gibijkulik: yeah a config driven API is a no-no. But we can revisit why we added the uptime to the main GET call, or we can try to speed it up via scatter gather, or revers the relationship and send up the uptime data to the DB periodicall from the compute so reading it from the API will be quick11:43
gibihm that latter rings another bell that this was dansmith's idea11:43
gibiso we might discussed this already upstream as well11:43
gibiaand there is another semi related thread of sending that uptime RPC to computes that are down, but not yet marked as down in the DB is another source of slowness11:48
jkulikI'd go with the "sending to the DB" - not sure how to solve that for computes being down, though.11:50
gibijkulik: we already have the periodic heartbeat to know when a compute is down, so we can use that info to avoid reporting back a stale uptime data on the API11:51
gibiI did not find this uptime discussion in my IRC logs so this probably happened downstream. I will file a bug right away summarizing both problems and the solution suggestion12:07
gibiaand I found the downstream report https://issues.redhat.com/browse/OSPRH-17563 so we probably have motivation to try to fix it :)12:16
gibifield the bug https://bugs.launchpad.net/nova/+bug/212203612:23
gibimmessah: ^^12:25
mmessahThanks for discussing the situation fastly :) I hope it will be resolved as soon as possible. We recently upgraded our test environment to Epoxy. We will continue to test all services. If we have any situation, we will share IRC. 12:39
gibimmessah: thanks12:41
opendevreviewRajesh Tailor proposed openstack/nova stable/2024.2: Fix case-sensitivity for metadata keys  https://review.opendev.org/c/openstack/nova/+/95943512:42
gibidansmith: sean-k-mooney: when you are referring to "large-ops" job from the past (in relation to scale testing eventlet removal) do you mean https://review.opendev.org/c/openstack/project-config/+/278978 ?12:49
sean-k-mooneyi was not refering to that because i did not know that existed12:50
sean-k-mooneybut maybe dan was12:50
gibisean-k-mooney: ahh sorry I did not remember where this input coming from :)12:51
sean-k-mooneygibi: i have just mention that kolla and devstack can fake may compute nodes per phsyical host12:51
sean-k-mooneymy expeirnce of this was as part of the osci lab scale testing 12:51
sean-k-mooneywhere intel and rackspace gave access to i think 500? baremental host half x86 the rest ppc12:52
sean-k-mooneyfor the comutnity to do some scale testing12:52
sean-k-mooneygibi: i wonder if the rally jobs have any existing usage of the fake agents12:54
sean-k-mooneyor if the tempest jobs in rally ieslf are usign real agents12:54
gibiyeah rally is also on my radar to look into, especially neutron's jobs12:55
sean-k-mooneyso rally and browbeat are the two scaling frameworks we have 12:55
sean-k-mooneyat leats those are teh two i am vagley aware of12:56
gibiyeah, I never saw a realy browbeat scenario but heard the name before12:56
sean-k-mooneyit might be a wrapper arount rally looking at the repo 12:57
sean-k-mooneyhttps://opendev.org/x/browbeat/12:57
sean-k-mooneyit has no ci jobs so that wont help us12:58
bauzasgibi: sean-k-mooney: well, I think we should quickly remove the uptime call IMHO12:58
gibiohh brownbeat referst to "generate_tripleo_inventory"12:58
gibithat turns me down :)12:58
gibibauzas: in a new microversion? in a non backportable way?12:58
bauzasif you look how this works, then we call the driver for every compute, and then for example the driver directly calls the OS12:58
bauzasso when you have 10000 hypervisors, then each of them calls the OS directly12:59
sean-k-mooneylooks like rally does not have any devstack jobs https://github.com/openstack/rally/blob/master/.zuul.d/zuul.yaml although im not sure wht rallty tox-functional does12:59
bauzasgibi: I don't know how to fix this, I'd like to discuss that with gmaan and dansmith12:59
sean-k-mooneybauzas: that would require an api micoversion12:59
bauzasbecause when I see https://github.com/openstack/nova/blob/master/nova/virt/libvirt/driver.py#L12319-L12322 I'm pretty done13:00
bauzassean-k-mooney: well, if that's due to a bug, maybe not13:00
sean-k-mooneybauzas: its not13:00
sean-k-mooneystephen wanted to remove it in the spec13:00
bauzaslike, HTTP500 > HTTP200 doesn't need to have a new microversion13:00
sean-k-mooneyand we told them no13:00
sean-k-mooneybecuase ops asked for it to be kept13:01
bauzashere, we don't have a HTTP500, but it can take more than one minute for getting a response13:01
sean-k-mooneythat does nto mean we cant make it better13:01
gibiHTTP500 -> HTTP200 only accepted in a bugfix in my eyes if it still provides the data that the non HTP500 case provides13:02
sean-k-mooneyhttps://specs.openstack.org/openstack/nova-specs/specs/wallaby/implemented/modernize-os-hypervisors-api.html#problem-description13:02
bauzasI saw it13:02
bauzasalready and I saw the implementation change too13:02
sean-k-mooneyyes htis was done yesar ago to be clear its not recent13:02
sean-k-mooneyit has been this way since wallaby13:02
gibiI agree that we need to fix it but maybe we can fix it in a way that preserve the functionality without the performance hit13:02
bauzasbut the problem is now we call the OS for all of the hosts automatically13:02
bauzasinstead of a specific API13:03
sean-k-mooneyyou opt into it with the micorversion13:03
sean-k-mooneywe dont if you use the old one13:03
bauzassure13:03
sean-k-mooneybut there is a way to achive both 13:03
sean-k-mooneyfirst for any host that is not using the qemu virt_type13:03
sean-k-mooneywe canc skip the call entirly13:03
bauzasbut atm, 2.88 has a performance problem, right?13:03
sean-k-mooneyand return none13:03
sean-k-mooneyand then we can start adding the uptime to the compute node stats13:04
sean-k-mooneyjand if its in the stats we can also skip it even for libvirt13:04
gibi(ahh the idea was from sean-k-mooney not from dansmith sorry for the misattribution above)13:04
sean-k-mooneyfinelly if we want to remove it in the api entirly we can do that in a new microverison13:04
jkulikinstead of having nova-compute call out into the OS on every HV, we could add it once on startup and add a timestamp to it. the API call can then just add `time.time() - timestamp` to the originally reported uptime :)13:05
sean-k-mooneyso we have 2 backportable fixes (only do the rpc if the diver supprot it and cache the value in the state generate via update_aviable_resouces) and one non backportable fix removing it in a new micorverion13:05
sean-k-mooneyjkulik: well we can also do that too yes13:06
sean-k-mooneybut that woudl be diffent13:06
jkulikthen we don't need a regular task13:06
jkulikand also no rpc call13:06
sean-k-mooneywe check the host so the value does not go backward if you restart the agent on the host13:06
sean-k-mooneyjkulik: we have a perodic that is updating the stats already13:06
sean-k-mooneywe woudl jus tbe making it add 1 more key to the json blob13:07
sean-k-mooneyso that pretty much free13:07
jkulikbut is it calling an process every time? sounded to me like someone didn't like that very much13:07
sean-k-mooneyif we want ot change the meaning to be the compute agent uptime instgead of the host uptime we coudl do the time.time()-timestamp change as well13:07
jkulikout, err = processutils.execute('env', 'LANG=C', 'uptime')13:07
sean-k-mooneywell we can get that info form libivrt too i think or there are other ways to do that hten calling the uptime command13:08
jkulikwe call that on agent startup and add the time since agent startup -> meaning stays the same13:08
sean-k-mooneythat not what uptime is13:08
sean-k-mooneythe uptime is nto the uptime since the agent started13:08
sean-k-mooneyits the uptiem of the host13:08
sean-k-mooneyand if you restart nova-compute without a reboot those are not the same13:09
jkulikyes. hence we add the agent-uptime to the initially-read host uptime13:09
sean-k-mooneyor we jsut dont provide this in the furue since you shoudl not be getting this form nova at all13:09
jkulikfine by me :D13:09
sean-k-mooneyin wallaby we really wanted to remvoe this from nova but opt wanted ot keep it instead of pullling it ofmr a monitoring tool liek prometheus or collectd13:10
sean-k-mooneythe compromies is we can provide a cached value form the existing perodic which is basiclly free 13:10
sean-k-mooneyfor older microversion and delete this in a future one13:10
jkuliksounded like bauzas wouldn't like calling `uptime` in the periodic on all hosts all the time. so we'd also need a cheaper replacement for that.13:12
sean-k-mooneythe imprant thing it so fix the perfoamnce reguression. we could also choose in a new microversion to make the retivial fo the uptime contoleable by a query arg and default to not but that fells overly complicaterd13:12
sean-k-mooneyjkulik: well we can get that form libvirt as well im pretty sure13:12
sean-k-mooneybut that is an entirly seperate issue13:12
sean-k-mooneynova has alwasy done that 13:12
sean-k-mooneyit was not a regression intorduced by 2.8813:13
jkulikperiodically? only on request, right? and we're going to change how often Nova calls that, if we move it to the periodic task13:13
jkulikit's probably™ not much overhead, so we could also fix that later if we see it being a problem.13:13
sean-k-mooneyjkulik: my suggestion is to only update it when the update aviable resouce perodic runs13:14
sean-k-mooneywhich is every 5 mins13:14
sean-k-mooneywe coudl stash the start time of the agent in the stats instead13:14
bauzassean-k-mooney: maybe I'm not right, but https://github.com/openstack/nova/blob/master/nova/api/openstack/compute/hypervisors.py#L206 calls _view_hypervisor for each of the hypervisors lisrt13:15
sean-k-mooneyadn compute the uptime in the api 13:15
bauzasthen in https://github.com/openstack/nova/blob/9f156aa954fb57eeaf407cc13a0f782b8af1a5d3/nova/api/openstack/compute/hypervisors.py#L97-L101 we unconditionnally call the compute API to get the uptime https://github.com/openstack/nova/blob/9f156aa954fb57eeaf407cc13a0f782b8af1a5d3/nova/api/openstack/compute/hypervisors.py#L97-L10113:15
sean-k-mooneybauzas: the rpc happens if you do a detail list or i belive hypervior show with 2.88+13:16
bauzasnot only show right?13:16
bauzasthat's my point13:16
sean-k-mooneyright its in detail list as well13:16
sean-k-mooneywhich is what was approvded in the spec13:16
bauzasso for a detailed list of HVs using 2.88, we then call every compute for getting the uptime, right?13:17
sean-k-mooneyyep13:17
bauzaseach of those computes then eventually calls the driver, but in case of libvirt, it calls an OS process for getting uptime13:18
sean-k-mooneyyep 13:18
sean-k-mooneyno one is saying this si good or scalable13:18
sean-k-mooneybut its not an api bug13:18
bauzasso that means that if I have a list of 1000 computes all of them running libvirt, then I'd get 1000 RPC calls and each of them would access the OS before returning13:18
bauzasthat's what I call not scalable per say13:18
gibi(I think we all agree that it is bad)13:19
sean-k-mooneyright we could also speed it up with sater gather but unless we want to break our api contract we cant fix it by just not including it13:19
gibiwe cannot just rip it out from 2.88 something that works for 10 computes but not for 100013:19
bauzaspreviously we were not doing that in 2.87 (calling N RPC calls), that's why I can't say this is not a regression13:19
bauzasif you have to opt in, for sure13:20
bauzasMHO is that either we throw 2.88 as non usable by fixing it in another microversion, or we try somehow to fix 2.8813:21
bauzasmy preference would be the latter for downstream reasons :)13:21
sean-k-mooneybauzas: look im just going to go fix this becuase im -2 on just riping it ou8t13:21
bauzassean-k-mooney: I'm just awaiting gmaan and dansmith 13:22
sean-k-mooneyill file a bug and work on it this/next week13:22
dansmithsorry, I'm on another call13:22
bauzasthere are multiple ways to adress that, but I'd rather prefer if we could, as a team, find a solution that we all agree13:22
bauzasbefore even working on a fix13:23
jkuliksean-k-mooney: https://bugs.launchpad.net/nova/+bug/2122036 bug is filed already13:23
bauzasdansmith: doh, I should attend that call too13:23
sean-k-mooneyjkulik: ah perfect thanks gibi 13:23
sean-k-mooneyoh right13:23
* sean-k-mooney well its almost over so no point joining now13:24
sean-k-mooneyjkulik: we already depend on psutiil in nova so i can do time.time() - psutil.boot_time() adn cahce it on the first invocation13:25
sean-k-mooneywell i can cache the boot time part13:25
jkuliknice13:25
gibisean-k-mooney: btw, devstack with fake driver works like a charm. I just rebuilt my two VM devstack with 5 fake compute in each VM13:26
sean-k-mooneygibi: cool its been a very long time since i used it so im glad it still works13:27
opendevreviewribaudr proposed openstack/nova stable/2025.1: Reproducer for bug 2114951  https://review.opendev.org/c/openstack/nova/+/95944513:28
opendevreviewribaudr proposed openstack/nova stable/2025.1: Fix bug 2114951  https://review.opendev.org/c/openstack/nova/+/95944613:28
sean-k-mooneyapprently time.clock_gettime(time.CLOCK_BOOTTIME) os also a thing without the psutils dep on py 3.7+13:29
sean-k-mooneywe dont have many deps on psutils so if we can aovid ading more that woudl eb good13:30
gibi(and just booted 200 fake VMs successfully)13:34
sean-k-mooneydid hte fake networkign agent work by the way or did you do that without networking13:34
sean-k-mooneyironic has a fake driver too but im not sure about cidner13:35
sean-k-mooneyglance kind of does not need it as you can just have tiny or even 0 byte images13:36
gibijust used --nic none13:36
sean-k-mooneysame with keystone and placment they just talk to a db13:36
sean-k-mooneyalthough they dont use eventlet so keystone and palcment are not really relevnet to testing nova at scale13:37
sean-k-mooneyat least not in this context13:37
jkulikregarding scale: we have a patch downstream that makes the server-group listing much faster with lots of server-groups around (e.g. when an admin lists them all). We changed the DB queries to actually do the pagination instead of doing it after fetching _all_ DB objects.13:43
jkulikWould that be something I can propose just like that - with a bug-report - or does that need some special process?13:43
dansmithbauzas: what do you want me to look at?13:45
sean-k-mooneyjkulik: yes for master, it would depend on the actual cahnge if it woudl be backprotable13:45
bauzasnot specifically looking at, just being there for accepting a solution13:45
jkuliksean-k-mooney: I don't need backports, as we have it running. would just be nice to share if possible :)13:46
sean-k-mooneyjkulik: its alwasy valid to reprot performance bug and fix them on master. weither its backproabel will depend on if your using any feature form newer sqlachme ectra or if ti has any other upgrade impact13:46
bauzastl;dr: by 2.88, we now call every compute if you call hypervisor detailed list and we await the compute OS for returning us an uptime13:46
dansmithbauzas: oh yeah I read the bug13:46
dansmithbauzas: seems like the solution there _has_ to be to start reporting uptime somewhere we can persist it and fix that to query that location instead of making the (egads) rpc call13:52
bauzassomething that could then be backportable to Epoxy if we can13:53
dansmithI agree we can't just stop doing it, and making it only present if you don't list, or don't list enough is also not great13:53
bauzasif we can't backport it, then 2.88 will have the same issue13:53
sean-k-mooneythe hypervior api is already doing an sqlquery to the compute nodes db table and geting the stats13:54
sean-k-mooneywe can put it in that json bob which is entirly unversioned today13:54
sean-k-mooneyso if we want to backprot it we can13:54
dansmithyeah was just going to say, ComputeNode has several places we could stash it I think13:55
dansmithbut, it will be absent during an upgrade so we probably need to fall back to the old behavior for nodes that aren't upgraded13:55
sean-k-mooneythe bit im currntly checking is it purly the time or are we includign the load info too13:55
sean-k-mooneyim hoping it just the former but some of our docs implies the latter13:56
dansmithdoesn't much matter13:57
sean-k-mooneywell its just changes what we need to stash13:58
sean-k-mooneybut ya it does not matther in the long run13:58
sean-k-mooneythe depercated api https://docs.openstack.org/api-ref/compute/#show-hypervisor-uptime-deprecated13:58
bauzasif we can stuff the uptime value somewhere, sure13:58
sean-k-mooneywas the rwaw output of uptime13:58
bauzaswithout needing to have a new DB field, cool13:58
sean-k-mooneybut i think in 2.88 its just the time in second 13:58
sean-k-mooneybut i just need to check that on a master devstack to confirm13:59
bauzassean-k-mooney: my concern is not with hypervisor show, again.13:59
sean-k-mooneybauzas: yep i know13:59
bauzasif we continue to call the OS for hypervisor show, meh13:59
bauzaseven if that's bad, that's not really an issue13:59
bauzasbut, if we are able to persist some uptime value for detailed-list, then of course we could modify hypervisor show14:00
jkulikI don't see anything parsing the `uptime` command output, so it seems to be indeed the full one14:01
opendevreviewribaudr proposed openstack/nova stable/2024.2: Reproducer for bug 2114951  https://review.opendev.org/c/openstack/nova/+/95952514:01
opendevreviewribaudr proposed openstack/nova stable/2024.2: Fix bug 2114951  https://review.opendev.org/c/openstack/nova/+/95952614:01
sean-k-mooneyit is yes i just confirmed that14:01
sean-k-mooneyhttps://paste.opendev.org/show/bJ59n0G92NoP06hDsfAu/14:02
jkulikincluding a newline? :D14:02
sean-k-mooneyso i cant jsut retirn the tiem btu i can just stash the value in the stats and skip the rpc call if its there14:02
sean-k-mooneyand ya it also has a leadign space14:03
dansmithcan't what?14:03
sean-k-mooneysorry, i cant just retive the time since boot and return that value in seconds14:04
bauzasdansmith: given we return the whole uptime, we can't just calculate it 14:04
sean-k-mooneydansmith: jkulik suggeted doing that to aovid calling the uptime command14:04
sean-k-mooneyin a futre microverion we can jsut not in the backportabel fix14:04
dansmithack, sorry I just wasn't getting that meaning from the translation14:04
bauzasthis is honestly super weird to have that in our API14:05
sean-k-mooneyya that was a pebkac, my keyboard was not aligent so i was typing off center14:05
bauzasI don't disagree with sean-k-mooney when he said we should have stopped to provide it 14:05
bauzasbut now we have it in detailed-list instead of a specific API call... so that's even worst14:05
dansmithyep, just got to fix it, not the end of the world14:06
gibi^^ +114:08
bauzasdansmith: using a periodic that would cache some raw value ?14:09
dansmithbauzas: just use whatever periodic we use to update the metrics and/or RT IMHO14:10
bauzasI guess we would need to modify the API docs to say that the uptime value can be a bit stale14:12
dansmithdoes the api guarantee or suggest that it's live?14:12
jkulikwith the call taking 40s, I guess it wasn't live already for a lot of them :D14:13
dansmithjkulik: excellent point :D14:13
bauzasdansmith: good point, /me looks at the existing docs14:19
bauzaswell, nothing is specifically said, so I guess we can just fix it silently14:21
dansmith:)14:21
bauzas"stats" can be a good CN object field candidate for persisting the value14:23
bauzasthat's a dict of strings14:23
bauzassean-k-mooney: ^14:23
dansmithIIRC both stats and metrics are actually used for something, but I don't remember all the details14:24
opendevreviewRajesh Tailor proposed openstack/nova stable/2024.2: Fix case sensitive comparison  https://review.opendev.org/c/openstack/nova/+/95953414:25
bauzasdansmith: I just remember now where stats is called14:27
sean-k-mooneybauzas: you realise i litrally said that many many times14:27
bauzasfirst good news, that's a periodically updated field using the RT update_avail_resources()14:27
sean-k-mooneydansmith: the api refe just say "The total uptime of the hypervisor and information about average load."14:28
bauzassean-k-mooney: well, then I missed the many times when you said that14:29
sean-k-mooneyand everythin else is cached at the period of the update_aviable_resouces perodic14:29
dansmithsean-k-mooney: yeah, it's unfortunate that's in our api ref at all, especially since all that is very os-specific14:30
bauzasanyway, sounds like we have a plan14:31
sean-k-mooneythat what the deprecate api says and thsi is what the hypervior deail api say ```The total uptime of the hypervisor and information about average load. Only reported for active hosts where the virt driver supports this feature```14:31
bauzaswe can fix 2.88 and drop that field in a later microversion14:31
bauzasif ops complain about that missing field, then we can tell them to use whatever tool for calling the OS by themselves14:32
sean-k-mooneybauzas: just for extra context gibi and i talk about this a few weeks ago and before that i talk to others about this too.14:32
bauzasansible, prometheus or nagios, I don't care14:32
sean-k-mooneyso this is not the first converation we have had upstream or downstream on how to fix this14:32
bauzasfew weeks ago, I was on PTO :)14:32
sean-k-mooneyyou were on pto for the prior ones14:32
bauzasso I had no context, but the good thing is that we went to the same solution, isn't that good ?14:33
sean-k-mooneyyes im aware but we were not activly fixing it because of time pressuer for FF14:33
bauzasI think this is a reasonable bugfix for RC114:33
bauzasnot RC2 of course as that's a not a Flamingo regression per say14:33
bauzasbut I know you know14:34
dansmithbauzas: ++ for rc114:34
sean-k-mooneymaybe ill see if i can fix it tomrrwo or at least ahve a draft for review14:34
dansmithif it's reasonable to do so14:34
bauzasextending the Stats class by stuffing a new keypair sounds a reasonable solution14:34
bauzasbut sean already found the solution, I leave him tell whether he's OK14:35
sean-k-mooneylets just review the patch and we can backprot as normal. if its ready for rc1 cool if not we chan cherry pick after we release 2025.214:36
dansmithsean-k-mooney: there's no patch up to review yet, right? just one in your  head?14:38
sean-k-mooneyexactly im still chekcing if any other virtdriver ever supproted this rpc14:38
sean-k-mooneybasicly im not sure i want to stash a sentenal value in the stats for other virt driver 14:39
sean-k-mooneyif i can show its only libvirt by inspecting the code i can skip that and gard the rpc call behidn both the virt type adn the absence of the key in the dict14:40
sean-k-mooneythat way i can aovid stashign garbage values for ironic or vmeare compute nodes14:40
sean-k-mooneylooks like zvm also suports it14:45
sean-k-mooneyalthough if your using zvm the contet of the respocne i think is diffent based on the unit tests14:46
sean-k-mooneyso zvm and libvirt the rest all raise not implemtned ok that good to know14:48
sean-k-mooneyi will probaly put the zvm fix in a spereate commit but it wil be effectivly the same14:49
dansmithsuper expensive to make that rpc call for every hypervisor only to have them raise NotImplemented :/14:51
sean-k-mooneyyep that why i dont want to even try :)14:53
bauzasstats are already provided by the virt driver and we stuff them in the Stats object using update_stats()14:54
bauzasso we can just return the uptime as part of the virt resources14:55
sean-k-mooneyyep thats the plan14:55
sean-k-mooneybut the upstream is ment to be the uptime of the node which is drivder depenent. its not the uptime of the comptue service14:56
bauzasif the driver doesn't provide that uptime value, fine, we already provide None14:56
bauzasthat's why I don't see the problem14:56
sean-k-mooneyfor ironic if we chosse to implement it it would be the uptime of the ironic instnace. but im not suggesting we do that14:56
opendevreviewribaudr proposed openstack/nova stable/2024.1: Reproducer for bug 2114951  https://review.opendev.org/c/openstack/nova/+/95954014:58
opendevreviewribaudr proposed openstack/nova stable/2024.1: Fix bug 2114951  https://review.opendev.org/c/openstack/nova/+/95954114:58
bauzasI don't see the problem14:58
bauzas(sorry)14:58
bauzaseach of the virt drivers can have get_host_uptime() defined14:59
bauzasthe simple fix seems for me to call get_host_uptime() if defined directly in the virt driver method than returns the resources14:59
bauzasand we could later remove the RPC API and make that method private15:00
dansmith_is_ there a problem?15:00
bauzasI dunno hence my question15:01
bauzasAFAICS, only zvm and libvirt do implement the method15:02
bauzasvmware is not providing it15:02
dansmithright, so no problem15:02
dansmithhonestly,15:02
dansmiththe API also provides for =None for all these drivers15:02
dansmithand the "does this specific driver provide this data" couldn't possibly be part of the API contract, IMHO15:03
dansmithlike, if we started remotely managing libvirt nodes from a central one, we wouldn't be able to provide this anymore15:03
bauzasthat API field has to be gone IMHO15:03
bauzasbut for the moment, let's fix the performance issue15:03
dansmithI'm not saying we shouldn't just solve this the way we've prescribed here, but... :)15:03
bauzasin a cloud world, I'm still surprised to see that we continue to provide hypervisor specific fields 15:04
sean-k-mooneybauzas: well https://review.opendev.org/c/openstack/nova-specs/+/765797/comment/8926d415_9c5a7022/ we all said we woudl prefer to remove it but you wanted to discuss it with operators first.15:09
sean-k-mooneywe tried to remove it once before and we decied to revers cource becuase peopel ask to keep it15:10
sean-k-mooneythat does nto mean we cant revisit that and fix it going forward15:10
sean-k-mooneybut we shoudl nto rewrite history15:10
bauzasLOL15:11
bauzasI love when something pops up out of context :)15:11
bauzas1/ I was only talking about hypervisor show15:11
bauzas2/ I was just trying to explain that we should remove that but by another spec, not that one15:11
sean-k-mooneywell that was update 3 to the sepc15:12
bauzasand AFAIK, the spec you mentioned wasn't the one we had for 2.88, right?15:12
sean-k-mooneywe rogainlly agreed to remvoe it then we updated it after the fact to readd it15:12
sean-k-mooneyit was the spec for 2.88 it jsut went through many iteration 15:12
sean-k-mooneyyou can see the iteration if you really care https://github.com/openstack/nova-specs/commits/abfb91875c7d6b1e2df45d5bc2244956bc662884/specs/wallaby/approved/modernize-os-hypervisors-api.rst15:14
sean-k-mooneyi alwasy rememebr this spec because stephen  use a đź’©emoji15:16
gmaan++, I agree to continue showing it in same API instead of separate (separate API even in old API will be microversion bump). reporting cache value is better solution and backportable. 16:33
opendevreviewLajos Katona proposed openstack/nova master: Use SDK for Neutron networks  https://review.opendev.org/c/openstack/nova/+/92802216:46
opendevreviewSylvain Bauza proposed openstack/nova master: change uptime in hypervisor to be provided async using stats  https://review.opendev.org/c/openstack/nova/+/95957117:31
bauzassean-k-mooney: gmaan: dansmith: I made a quick POC 'that works on my devstack'17:32
bauzasjkulik too : https://review.opendev.org/c/openstack/nova/+/95957117:33
sean-k-mooneyi have a slightly diffent quick poc partly done17:59
opendevreviewsean mooney proposed openstack/nova master: hypervisors: Optimize uptime retrieval for better performance  https://review.opendev.org/c/openstack/nova/+/95960420:45

Generated by irclog2html.py 4.0.0 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!