| *** mhen_ is now known as mhen | 02:04 | |
| *** bbezak_ is now known as bbezak | 06:22 | |
| opendevreview | Rajesh Tailor proposed openstack/nova stable/2025.1: Fix 'nova-manage image_property set' command https://review.opendev.org/c/openstack/nova/+/958834 | 06:53 |
|---|---|---|
| opendevreview | Balazs Gibizer proposed openstack/nova master: Reproduce bug/2115905 https://review.opendev.org/c/openstack/nova/+/954336 | 08:33 |
| opendevreview | Balazs Gibizer proposed openstack/nova master: [PCI tracker]Remove non configured devs when freed https://review.opendev.org/c/openstack/nova/+/954613 | 08:33 |
| opendevreview | Balazs Gibizer proposed openstack/nova master: [pci]Keep used dev in Placement regardless of dev_spec https://review.opendev.org/c/openstack/nova/+/954149 | 08:33 |
| gibi | sean-k-mooney: ^^ I fixed the commit messages and pulled out two constants for the log message asserts | 08:34 |
| sean-k-mooney[m] | cool ill review it after i grab a coffee and start my day | 08:36 |
| mmessah | Hi all, `/v2.1/os-hypervisors/detail` requests take too long time after upgrade to Epoxy. I saw similar issue for server list like https://bugs.launchpad.net/nova/+bug/2106380 but we dont have issue for server list. We tested manually with curl request. if I use OpenStack-API-Version is 2.88 and above, we wait about 20 seconds. if I use OpenStack-API-Version is 2.87, result is returned immediately. Is there anyone | 09:01 |
| mmessah | experiencing this situation? | 09:01 |
| jkulik | 2.88 added the "uptime" field into the hypervisor details. we experienced slowness with that, too ... and thus added a config setting downstream to patch it out: https://github.com/sapcc/nova/commit/686f44c4c65eba6c851c852804e82a3e00309945 | 09:06 |
| frickler | jkulik: interesting, that looks like a different source of slowness to me, did you ever try to upstream your patch? it does seem pretty reasonable to me (caveat I'm no nova developer, just another deployer) | 09:11 |
| jkulik | frickler: no, it stems from a time with less upstream-involvement. especially after an upgrade, downstream times are quite ... stressful :) I'd be happy to propose that, though | 09:12 |
| masahito | we hit the same issue from the v2.88. calling the RPC to all hypervisors in serial causes slow os-hypervisor/detail API response. | 09:13 |
| masahito | large scale deployment hits the same issue. we just wait few minutes for 10K HV situation, or use limit and maker to separate API. | 09:15 |
| frickler | yes, iterating over all hypervisors for an API call (assuming there's no caching, didn't check the code) sounds pretty much not scalable | 09:19 |
| jkulik | hm ... Isn't there some policy that the API response shouldn't change depending on operator configs? our downstream patch would break that policy. | 09:26 |
| jkulik | api-ref says "Only reported for active hosts where the virt driver supports this feature." so ... maybe it's still fine, because clients would have to handle those cases anyways? | 09:28 |
| bauzas | hey folks, do we have any upstream tracker for that bug ? | 09:34 |
| bauzas | just overlooked the code and yeah I'm afraid of the fact this is indeed a performance issue | 09:34 |
| bauzas | when calling _view_hypervisor(), we then call the host API for each of the hypervisors | 09:35 |
| bauzas | gibi: ^ | 09:35 |
| jkulik | I haven't found one. Also no review-request. | 09:35 |
| bauzas | https://github.com/openstack/nova/blob/9f156aa954fb57eeaf407cc13a0f782b8af1a5d3/nova/api/openstack/compute/hypervisors.py#L100-L101 | 09:36 |
| opendevreview | Johannes Kulik proposed openstack/nova master: Conditionally disable hypervisor uptime in details API https://review.opendev.org/c/openstack/nova/+/959420 | 09:40 |
| jkulik | ready to work on making ^ better. I guess it needs a releasenote at least? but before that, I'd like to get some feedback if this approach should be continue | 09:42 |
| bauzas | eek, https://github.com/openstack/nova/blob/9f156aa954fb57eeaf407cc13a0f782b8af1a5d3/nova/virt/libvirt/driver.py#L12319-L12322 | 09:42 |
| bauzas | so now when calling statistics, we do N calls to an external process call | 09:43 |
| bauzas | jkulik: I need to review your patch but we can't have API behaviour changed by configuration | 09:43 |
| jkulik | yeah, figured. it's already behaving like that for a subset of the HVs, though | 09:44 |
| bauzas | hmmm, unfortunately I need to leave by the next mins | 09:45 |
| bauzas | jkulik: where do we conditionally change the API behaviour ? | 09:46 |
| bauzas | I mean, already | 09:46 |
| jkulik | depending on if the calls succeeed or rais NotImplemented, we return None instead of an actual uptime string | 09:46 |
| jkulik | so this patch changes to "always None" | 09:46 |
| bauzas | atm, if we return None, this is not because we change the configuration, right? this is just whether the call works or not | 09:48 |
| bauzas | the concern with an confiugrable API is about interop | 09:48 |
| jkulik | yes. for some hypervisors, this call never works. api-ref states for "uptime": "Only reported for active hosts where the virt driver supports this feature." | 09:48 |
| jkulik | so imho, a client cannot rely on this being set | 09:49 |
| * bauzas needs to leave :( | 09:49 | |
| stephenfin | jkulik: You're focused on the vmware driver, right? And I assume the noise you're talking about is the tracebacks in nova-compute? Couldn't you just implement the method for that driver but return None? | 09:57 |
| stephenfin | That could likely be backported too | 09:57 |
| jkulik | sure. doesn't bring any speedup at all though, so the _actual_ issue of os-hypervisor/detail being slow doesn't go away | 10:01 |
| opendevreview | sean mooney proposed openstack/nova master: [pci]Keep used dev in Placement regardless of dev_spec https://review.opendev.org/c/openstack/nova/+/954149 | 10:05 |
| opendevreview | Johannes Kulik proposed openstack/nova master: Conditionally disable hypervisor uptime in details API https://review.opendev.org/c/openstack/nova/+/959420 | 10:19 |
| sean-k-mooney | jkulik: yes we have a general rule about no api driven behvaiaor | 10:34 |
| sean-k-mooney | jkulik: also the uptime prerfroamce issue is known | 10:34 |
| sean-k-mooney | jkulik: the way we plan to fix this is to add it to the stats in the compute node db table and read it form there instead od doing the rpc | 10:35 |
| sean-k-mooney | so in the future you will get a cached value but wont have the perfomacne hit | 10:35 |
| sean-k-mooney | it will be in the data revvived with teh ohte rcompute node values in the same sql query | 10:35 |
| sean-k-mooney | jkulik: gibi and i debugged this a month or two ago but didnt have time tow rok on proposing the fix | 10:36 |
| sean-k-mooney | basiclly the plan to make this "just work" across upgrades is to make the rpc consditional and to get the uptime value only if its not in the stats blob we get form teh db | 10:38 |
| jkulik | sean-k-mooney: thanks for sharing. I'll abandon my RR then. | 10:55 |
| jkulik | sean-k-mooney: so this means, all drivers currently not implementing get_host_uptime() should at least provide a None value into the stats to not still hit the performance problem, right? | 10:55 |
| sean-k-mooney | yep that or we could jsut remove the fallback entirly | 11:02 |
| sean-k-mooney | jkulik: we breifly dicsused just removing uptime form the api in a new microversion | 11:02 |
| sean-k-mooney | we kind of wanted to get wider input on that before moving forward however | 11:03 |
| sean-k-mooney | jkulik: we orginally wanted to remove uptime in the orgial spec but some operator wanted use to keep it so we did | 11:03 |
| sean-k-mooney | but im not sure that was the right choice | 11:03 |
| sean-k-mooney | jkulik: does vmware supprot uptime at all | 11:04 |
| sean-k-mooney | jkulik: we could also make the fallback to rpc condtional on if the driver suprot it | 11:04 |
| jkulik | sean-k-mooney: no, not at all, because in VMware we have a whole cluster as hypervisor in Nova and each node in the cluster has different uptime | 11:05 |
| sean-k-mooney | and then just exclude vmware if it does not | 11:05 |
| jkulik | we're also introducing libvirt, so I'd still like to get the details to become fast ;) | 11:05 |
| sean-k-mooney | right so we know the hypervior type so we coudl limit the fallback to just libvirt(qemu) | 11:05 |
| sean-k-mooney | so for all ohter driver we coudl retrun None | 11:06 |
| sean-k-mooney | if not in stats | 11:06 |
| jkulik | ... or patch everyone to return None, which would be fine for me, too | 11:07 |
| sean-k-mooney | well again we propsoed removing uptime when we removed the hypervior stats api | 11:08 |
| sean-k-mooney | and some folks objected to that so we cant actully do that without a new microversion | 11:08 |
| sean-k-mooney | i.e. it woudl break our api rules ot just stop reutring without a new microversion | 11:08 |
| sean-k-mooney | but instead of a config opiotn https://review.opendev.org/c/openstack/nova/+/959420/2/nova/api/openstack/compute/hypervisors.py#105 | 11:09 |
| sean-k-mooney | you coudl have checked if the hypervior type was qemu and only get the uptime if it was | 11:09 |
| sean-k-mooney | evenutaly we want to also stash it in the stats to make libvirt fast but that would fix vmware for now | 11:10 |
| jkulik | I mean, we have the patch downstream. there's no pressure from our side. I'd go with the proper "put in stats" solution. | 11:10 |
| sean-k-mooney | ack. we may want to put this in the ptg adjenda to not forget or have a dedicated bug for it | 11:11 |
| sean-k-mooney | https://bugs.launchpad.net/nova/+bug/2106380 is about server list not the hyperviors api so we sould avoid resuing it | 11:12 |
| sean-k-mooney | with that said i wonder if gibi already filed an upstream bug | 11:12 |
| opendevreview | Rajesh Tailor proposed openstack/nova stable/2024.2: Fix 'nova-manage image_property set' command https://review.opendev.org/c/openstack/nova/+/959429 | 11:38 |
| * gibi reads back | 11:39 | |
| gibi | bauzas: jkulik: the uptime issue ring bells to me but I think it was only reported downstream. I agree that this needs to be fixed | 11:41 |
| gibi | until then the WA is using older microversion | 11:42 |
| gibi | jkulik: yeah a config driven API is a no-no. But we can revisit why we added the uptime to the main GET call, or we can try to speed it up via scatter gather, or revers the relationship and send up the uptime data to the DB periodicall from the compute so reading it from the API will be quick | 11:43 |
| gibi | hm that latter rings another bell that this was dansmith's idea | 11:43 |
| gibi | so we might discussed this already upstream as well | 11:43 |
| gibi | aand there is another semi related thread of sending that uptime RPC to computes that are down, but not yet marked as down in the DB is another source of slowness | 11:48 |
| jkulik | I'd go with the "sending to the DB" - not sure how to solve that for computes being down, though. | 11:50 |
| gibi | jkulik: we already have the periodic heartbeat to know when a compute is down, so we can use that info to avoid reporting back a stale uptime data on the API | 11:51 |
| gibi | I did not find this uptime discussion in my IRC logs so this probably happened downstream. I will file a bug right away summarizing both problems and the solution suggestion | 12:07 |
| gibi | aand I found the downstream report https://issues.redhat.com/browse/OSPRH-17563 so we probably have motivation to try to fix it :) | 12:16 |
| gibi | field the bug https://bugs.launchpad.net/nova/+bug/2122036 | 12:23 |
| gibi | mmessah: ^^ | 12:25 |
| mmessah | Thanks for discussing the situation fastly :) I hope it will be resolved as soon as possible. We recently upgraded our test environment to Epoxy. We will continue to test all services. If we have any situation, we will share IRC. | 12:39 |
| gibi | mmessah: thanks | 12:41 |
| opendevreview | Rajesh Tailor proposed openstack/nova stable/2024.2: Fix case-sensitivity for metadata keys https://review.opendev.org/c/openstack/nova/+/959435 | 12:42 |
| gibi | dansmith: sean-k-mooney: when you are referring to "large-ops" job from the past (in relation to scale testing eventlet removal) do you mean https://review.opendev.org/c/openstack/project-config/+/278978 ? | 12:49 |
| sean-k-mooney | i was not refering to that because i did not know that existed | 12:50 |
| sean-k-mooney | but maybe dan was | 12:50 |
| gibi | sean-k-mooney: ahh sorry I did not remember where this input coming from :) | 12:51 |
| sean-k-mooney | gibi: i have just mention that kolla and devstack can fake may compute nodes per phsyical host | 12:51 |
| sean-k-mooney | my expeirnce of this was as part of the osci lab scale testing | 12:51 |
| sean-k-mooney | where intel and rackspace gave access to i think 500? baremental host half x86 the rest ppc | 12:52 |
| sean-k-mooney | for the comutnity to do some scale testing | 12:52 |
| sean-k-mooney | gibi: i wonder if the rally jobs have any existing usage of the fake agents | 12:54 |
| sean-k-mooney | or if the tempest jobs in rally ieslf are usign real agents | 12:54 |
| gibi | yeah rally is also on my radar to look into, especially neutron's jobs | 12:55 |
| sean-k-mooney | so rally and browbeat are the two scaling frameworks we have | 12:55 |
| sean-k-mooney | at leats those are teh two i am vagley aware of | 12:56 |
| gibi | yeah, I never saw a realy browbeat scenario but heard the name before | 12:56 |
| sean-k-mooney | it might be a wrapper arount rally looking at the repo | 12:57 |
| sean-k-mooney | https://opendev.org/x/browbeat/ | 12:57 |
| sean-k-mooney | it has no ci jobs so that wont help us | 12:58 |
| bauzas | gibi: sean-k-mooney: well, I think we should quickly remove the uptime call IMHO | 12:58 |
| gibi | ohh brownbeat referst to "generate_tripleo_inventory" | 12:58 |
| gibi | that turns me down :) | 12:58 |
| gibi | bauzas: in a new microversion? in a non backportable way? | 12:58 |
| bauzas | if you look how this works, then we call the driver for every compute, and then for example the driver directly calls the OS | 12:58 |
| bauzas | so when you have 10000 hypervisors, then each of them calls the OS directly | 12:59 |
| sean-k-mooney | looks like rally does not have any devstack jobs https://github.com/openstack/rally/blob/master/.zuul.d/zuul.yaml although im not sure wht rallty tox-functional does | 12:59 |
| bauzas | gibi: I don't know how to fix this, I'd like to discuss that with gmaan and dansmith | 12:59 |
| sean-k-mooney | bauzas: that would require an api micoversion | 12:59 |
| bauzas | because when I see https://github.com/openstack/nova/blob/master/nova/virt/libvirt/driver.py#L12319-L12322 I'm pretty done | 13:00 |
| bauzas | sean-k-mooney: well, if that's due to a bug, maybe not | 13:00 |
| sean-k-mooney | bauzas: its not | 13:00 |
| sean-k-mooney | stephen wanted to remove it in the spec | 13:00 |
| bauzas | like, HTTP500 > HTTP200 doesn't need to have a new microversion | 13:00 |
| sean-k-mooney | and we told them no | 13:00 |
| sean-k-mooney | becuase ops asked for it to be kept | 13:01 |
| bauzas | here, we don't have a HTTP500, but it can take more than one minute for getting a response | 13:01 |
| sean-k-mooney | that does nto mean we cant make it better | 13:01 |
| gibi | HTTP500 -> HTTP200 only accepted in a bugfix in my eyes if it still provides the data that the non HTP500 case provides | 13:02 |
| sean-k-mooney | https://specs.openstack.org/openstack/nova-specs/specs/wallaby/implemented/modernize-os-hypervisors-api.html#problem-description | 13:02 |
| bauzas | I saw it | 13:02 |
| bauzas | already and I saw the implementation change too | 13:02 |
| sean-k-mooney | yes htis was done yesar ago to be clear its not recent | 13:02 |
| sean-k-mooney | it has been this way since wallaby | 13:02 |
| gibi | I agree that we need to fix it but maybe we can fix it in a way that preserve the functionality without the performance hit | 13:02 |
| bauzas | but the problem is now we call the OS for all of the hosts automatically | 13:02 |
| bauzas | instead of a specific API | 13:03 |
| sean-k-mooney | you opt into it with the micorversion | 13:03 |
| sean-k-mooney | we dont if you use the old one | 13:03 |
| bauzas | sure | 13:03 |
| sean-k-mooney | but there is a way to achive both | 13:03 |
| sean-k-mooney | first for any host that is not using the qemu virt_type | 13:03 |
| sean-k-mooney | we canc skip the call entirly | 13:03 |
| bauzas | but atm, 2.88 has a performance problem, right? | 13:03 |
| sean-k-mooney | and return none | 13:03 |
| sean-k-mooney | and then we can start adding the uptime to the compute node stats | 13:04 |
| sean-k-mooney | jand if its in the stats we can also skip it even for libvirt | 13:04 |
| gibi | (ahh the idea was from sean-k-mooney not from dansmith sorry for the misattribution above) | 13:04 |
| sean-k-mooney | finelly if we want to remove it in the api entirly we can do that in a new microverison | 13:04 |
| jkulik | instead of having nova-compute call out into the OS on every HV, we could add it once on startup and add a timestamp to it. the API call can then just add `time.time() - timestamp` to the originally reported uptime :) | 13:05 |
| sean-k-mooney | so we have 2 backportable fixes (only do the rpc if the diver supprot it and cache the value in the state generate via update_aviable_resouces) and one non backportable fix removing it in a new micorverion | 13:05 |
| sean-k-mooney | jkulik: well we can also do that too yes | 13:06 |
| sean-k-mooney | but that woudl be diffent | 13:06 |
| jkulik | then we don't need a regular task | 13:06 |
| jkulik | and also no rpc call | 13:06 |
| sean-k-mooney | we check the host so the value does not go backward if you restart the agent on the host | 13:06 |
| sean-k-mooney | jkulik: we have a perodic that is updating the stats already | 13:06 |
| sean-k-mooney | we woudl jus tbe making it add 1 more key to the json blob | 13:07 |
| sean-k-mooney | so that pretty much free | 13:07 |
| jkulik | but is it calling an process every time? sounded to me like someone didn't like that very much | 13:07 |
| sean-k-mooney | if we want ot change the meaning to be the compute agent uptime instgead of the host uptime we coudl do the time.time()-timestamp change as well | 13:07 |
| jkulik | out, err = processutils.execute('env', 'LANG=C', 'uptime') | 13:07 |
| sean-k-mooney | well we can get that info form libivrt too i think or there are other ways to do that hten calling the uptime command | 13:08 |
| jkulik | we call that on agent startup and add the time since agent startup -> meaning stays the same | 13:08 |
| sean-k-mooney | that not what uptime is | 13:08 |
| sean-k-mooney | the uptime is nto the uptime since the agent started | 13:08 |
| sean-k-mooney | its the uptiem of the host | 13:08 |
| sean-k-mooney | and if you restart nova-compute without a reboot those are not the same | 13:09 |
| jkulik | yes. hence we add the agent-uptime to the initially-read host uptime | 13:09 |
| sean-k-mooney | or we jsut dont provide this in the furue since you shoudl not be getting this form nova at all | 13:09 |
| jkulik | fine by me :D | 13:09 |
| sean-k-mooney | in wallaby we really wanted to remvoe this from nova but opt wanted ot keep it instead of pullling it ofmr a monitoring tool liek prometheus or collectd | 13:10 |
| sean-k-mooney | the compromies is we can provide a cached value form the existing perodic which is basiclly free | 13:10 |
| sean-k-mooney | for older microversion and delete this in a future one | 13:10 |
| jkulik | sounded like bauzas wouldn't like calling `uptime` in the periodic on all hosts all the time. so we'd also need a cheaper replacement for that. | 13:12 |
| sean-k-mooney | the imprant thing it so fix the perfoamnce reguression. we could also choose in a new microversion to make the retivial fo the uptime contoleable by a query arg and default to not but that fells overly complicaterd | 13:12 |
| sean-k-mooney | jkulik: well we can get that form libvirt as well im pretty sure | 13:12 |
| sean-k-mooney | but that is an entirly seperate issue | 13:12 |
| sean-k-mooney | nova has alwasy done that | 13:12 |
| sean-k-mooney | it was not a regression intorduced by 2.88 | 13:13 |
| jkulik | periodically? only on request, right? and we're going to change how often Nova calls that, if we move it to the periodic task | 13:13 |
| jkulik | it's probably™ not much overhead, so we could also fix that later if we see it being a problem. | 13:13 |
| sean-k-mooney | jkulik: my suggestion is to only update it when the update aviable resouce perodic runs | 13:14 |
| sean-k-mooney | which is every 5 mins | 13:14 |
| sean-k-mooney | we coudl stash the start time of the agent in the stats instead | 13:14 |
| bauzas | sean-k-mooney: maybe I'm not right, but https://github.com/openstack/nova/blob/master/nova/api/openstack/compute/hypervisors.py#L206 calls _view_hypervisor for each of the hypervisors lisrt | 13:15 |
| sean-k-mooney | adn compute the uptime in the api | 13:15 |
| bauzas | then in https://github.com/openstack/nova/blob/9f156aa954fb57eeaf407cc13a0f782b8af1a5d3/nova/api/openstack/compute/hypervisors.py#L97-L101 we unconditionnally call the compute API to get the uptime https://github.com/openstack/nova/blob/9f156aa954fb57eeaf407cc13a0f782b8af1a5d3/nova/api/openstack/compute/hypervisors.py#L97-L101 | 13:15 |
| sean-k-mooney | bauzas: the rpc happens if you do a detail list or i belive hypervior show with 2.88+ | 13:16 |
| bauzas | not only show right? | 13:16 |
| bauzas | that's my point | 13:16 |
| sean-k-mooney | right its in detail list as well | 13:16 |
| sean-k-mooney | which is what was approvded in the spec | 13:16 |
| bauzas | so for a detailed list of HVs using 2.88, we then call every compute for getting the uptime, right? | 13:17 |
| sean-k-mooney | yep | 13:17 |
| bauzas | each of those computes then eventually calls the driver, but in case of libvirt, it calls an OS process for getting uptime | 13:18 |
| sean-k-mooney | yep | 13:18 |
| sean-k-mooney | no one is saying this si good or scalable | 13:18 |
| sean-k-mooney | but its not an api bug | 13:18 |
| bauzas | so that means that if I have a list of 1000 computes all of them running libvirt, then I'd get 1000 RPC calls and each of them would access the OS before returning | 13:18 |
| bauzas | that's what I call not scalable per say | 13:18 |
| gibi | (I think we all agree that it is bad) | 13:19 |
| sean-k-mooney | right we could also speed it up with sater gather but unless we want to break our api contract we cant fix it by just not including it | 13:19 |
| gibi | we cannot just rip it out from 2.88 something that works for 10 computes but not for 1000 | 13:19 |
| bauzas | previously we were not doing that in 2.87 (calling N RPC calls), that's why I can't say this is not a regression | 13:19 |
| bauzas | if you have to opt in, for sure | 13:20 |
| bauzas | MHO is that either we throw 2.88 as non usable by fixing it in another microversion, or we try somehow to fix 2.88 | 13:21 |
| bauzas | my preference would be the latter for downstream reasons :) | 13:21 |
| sean-k-mooney | bauzas: look im just going to go fix this becuase im -2 on just riping it ou8t | 13:21 |
| bauzas | sean-k-mooney: I'm just awaiting gmaan and dansmith | 13:22 |
| sean-k-mooney | ill file a bug and work on it this/next week | 13:22 |
| dansmith | sorry, I'm on another call | 13:22 |
| bauzas | there are multiple ways to adress that, but I'd rather prefer if we could, as a team, find a solution that we all agree | 13:22 |
| bauzas | before even working on a fix | 13:23 |
| jkulik | sean-k-mooney: https://bugs.launchpad.net/nova/+bug/2122036 bug is filed already | 13:23 |
| bauzas | dansmith: doh, I should attend that call too | 13:23 |
| sean-k-mooney | jkulik: ah perfect thanks gibi | 13:23 |
| sean-k-mooney | oh right | 13:23 |
| * sean-k-mooney well its almost over so no point joining now | 13:24 | |
| sean-k-mooney | jkulik: we already depend on psutiil in nova so i can do time.time() - psutil.boot_time() adn cahce it on the first invocation | 13:25 |
| sean-k-mooney | well i can cache the boot time part | 13:25 |
| jkulik | nice | 13:25 |
| gibi | sean-k-mooney: btw, devstack with fake driver works like a charm. I just rebuilt my two VM devstack with 5 fake compute in each VM | 13:26 |
| sean-k-mooney | gibi: cool its been a very long time since i used it so im glad it still works | 13:27 |
| opendevreview | ribaudr proposed openstack/nova stable/2025.1: Reproducer for bug 2114951 https://review.opendev.org/c/openstack/nova/+/959445 | 13:28 |
| opendevreview | ribaudr proposed openstack/nova stable/2025.1: Fix bug 2114951 https://review.opendev.org/c/openstack/nova/+/959446 | 13:28 |
| sean-k-mooney | apprently time.clock_gettime(time.CLOCK_BOOTTIME) os also a thing without the psutils dep on py 3.7+ | 13:29 |
| sean-k-mooney | we dont have many deps on psutils so if we can aovid ading more that woudl eb good | 13:30 |
| gibi | (and just booted 200 fake VMs successfully) | 13:34 |
| sean-k-mooney | did hte fake networkign agent work by the way or did you do that without networking | 13:34 |
| sean-k-mooney | ironic has a fake driver too but im not sure about cidner | 13:35 |
| sean-k-mooney | glance kind of does not need it as you can just have tiny or even 0 byte images | 13:36 |
| gibi | just used --nic none | 13:36 |
| sean-k-mooney | same with keystone and placment they just talk to a db | 13:36 |
| sean-k-mooney | although they dont use eventlet so keystone and palcment are not really relevnet to testing nova at scale | 13:37 |
| sean-k-mooney | at least not in this context | 13:37 |
| jkulik | regarding scale: we have a patch downstream that makes the server-group listing much faster with lots of server-groups around (e.g. when an admin lists them all). We changed the DB queries to actually do the pagination instead of doing it after fetching _all_ DB objects. | 13:43 |
| jkulik | Would that be something I can propose just like that - with a bug-report - or does that need some special process? | 13:43 |
| dansmith | bauzas: what do you want me to look at? | 13:45 |
| sean-k-mooney | jkulik: yes for master, it would depend on the actual cahnge if it woudl be backprotable | 13:45 |
| bauzas | not specifically looking at, just being there for accepting a solution | 13:45 |
| jkulik | sean-k-mooney: I don't need backports, as we have it running. would just be nice to share if possible :) | 13:46 |
| sean-k-mooney | jkulik: its alwasy valid to reprot performance bug and fix them on master. weither its backproabel will depend on if your using any feature form newer sqlachme ectra or if ti has any other upgrade impact | 13:46 |
| bauzas | tl;dr: by 2.88, we now call every compute if you call hypervisor detailed list and we await the compute OS for returning us an uptime | 13:46 |
| dansmith | bauzas: oh yeah I read the bug | 13:46 |
| dansmith | bauzas: seems like the solution there _has_ to be to start reporting uptime somewhere we can persist it and fix that to query that location instead of making the (egads) rpc call | 13:52 |
| bauzas | something that could then be backportable to Epoxy if we can | 13:53 |
| dansmith | I agree we can't just stop doing it, and making it only present if you don't list, or don't list enough is also not great | 13:53 |
| bauzas | if we can't backport it, then 2.88 will have the same issue | 13:53 |
| sean-k-mooney | the hypervior api is already doing an sqlquery to the compute nodes db table and geting the stats | 13:54 |
| sean-k-mooney | we can put it in that json bob which is entirly unversioned today | 13:54 |
| sean-k-mooney | so if we want to backprot it we can | 13:54 |
| dansmith | yeah was just going to say, ComputeNode has several places we could stash it I think | 13:55 |
| dansmith | but, it will be absent during an upgrade so we probably need to fall back to the old behavior for nodes that aren't upgraded | 13:55 |
| sean-k-mooney | the bit im currntly checking is it purly the time or are we includign the load info too | 13:55 |
| sean-k-mooney | im hoping it just the former but some of our docs implies the latter | 13:56 |
| dansmith | doesn't much matter | 13:57 |
| sean-k-mooney | well its just changes what we need to stash | 13:58 |
| sean-k-mooney | but ya it does not matther in the long run | 13:58 |
| sean-k-mooney | the depercated api https://docs.openstack.org/api-ref/compute/#show-hypervisor-uptime-deprecated | 13:58 |
| bauzas | if we can stuff the uptime value somewhere, sure | 13:58 |
| sean-k-mooney | was the rwaw output of uptime | 13:58 |
| bauzas | without needing to have a new DB field, cool | 13:58 |
| sean-k-mooney | but i think in 2.88 its just the time in second | 13:58 |
| sean-k-mooney | but i just need to check that on a master devstack to confirm | 13:59 |
| bauzas | sean-k-mooney: my concern is not with hypervisor show, again. | 13:59 |
| sean-k-mooney | bauzas: yep i know | 13:59 |
| bauzas | if we continue to call the OS for hypervisor show, meh | 13:59 |
| bauzas | even if that's bad, that's not really an issue | 13:59 |
| bauzas | but, if we are able to persist some uptime value for detailed-list, then of course we could modify hypervisor show | 14:00 |
| jkulik | I don't see anything parsing the `uptime` command output, so it seems to be indeed the full one | 14:01 |
| opendevreview | ribaudr proposed openstack/nova stable/2024.2: Reproducer for bug 2114951 https://review.opendev.org/c/openstack/nova/+/959525 | 14:01 |
| opendevreview | ribaudr proposed openstack/nova stable/2024.2: Fix bug 2114951 https://review.opendev.org/c/openstack/nova/+/959526 | 14:01 |
| sean-k-mooney | it is yes i just confirmed that | 14:01 |
| sean-k-mooney | https://paste.opendev.org/show/bJ59n0G92NoP06hDsfAu/ | 14:02 |
| jkulik | including a newline? :D | 14:02 |
| sean-k-mooney | so i cant jsut retirn the tiem btu i can just stash the value in the stats and skip the rpc call if its there | 14:02 |
| sean-k-mooney | and ya it also has a leadign space | 14:03 |
| dansmith | can't what? | 14:03 |
| sean-k-mooney | sorry, i cant just retive the time since boot and return that value in seconds | 14:04 |
| bauzas | dansmith: given we return the whole uptime, we can't just calculate it | 14:04 |
| sean-k-mooney | dansmith: jkulik suggeted doing that to aovid calling the uptime command | 14:04 |
| sean-k-mooney | in a futre microverion we can jsut not in the backportabel fix | 14:04 |
| dansmith | ack, sorry I just wasn't getting that meaning from the translation | 14:04 |
| bauzas | this is honestly super weird to have that in our API | 14:05 |
| sean-k-mooney | ya that was a pebkac, my keyboard was not aligent so i was typing off center | 14:05 |
| bauzas | I don't disagree with sean-k-mooney when he said we should have stopped to provide it | 14:05 |
| bauzas | but now we have it in detailed-list instead of a specific API call... so that's even worst | 14:05 |
| dansmith | yep, just got to fix it, not the end of the world | 14:06 |
| gibi | ^^ +1 | 14:08 |
| bauzas | dansmith: using a periodic that would cache some raw value ? | 14:09 |
| dansmith | bauzas: just use whatever periodic we use to update the metrics and/or RT IMHO | 14:10 |
| bauzas | I guess we would need to modify the API docs to say that the uptime value can be a bit stale | 14:12 |
| dansmith | does the api guarantee or suggest that it's live? | 14:12 |
| jkulik | with the call taking 40s, I guess it wasn't live already for a lot of them :D | 14:13 |
| dansmith | jkulik: excellent point :D | 14:13 |
| bauzas | dansmith: good point, /me looks at the existing docs | 14:19 |
| bauzas | well, nothing is specifically said, so I guess we can just fix it silently | 14:21 |
| dansmith | :) | 14:21 |
| bauzas | "stats" can be a good CN object field candidate for persisting the value | 14:23 |
| bauzas | that's a dict of strings | 14:23 |
| bauzas | sean-k-mooney: ^ | 14:23 |
| dansmith | IIRC both stats and metrics are actually used for something, but I don't remember all the details | 14:24 |
| opendevreview | Rajesh Tailor proposed openstack/nova stable/2024.2: Fix case sensitive comparison https://review.opendev.org/c/openstack/nova/+/959534 | 14:25 |
| bauzas | dansmith: I just remember now where stats is called | 14:27 |
| sean-k-mooney | bauzas: you realise i litrally said that many many times | 14:27 |
| bauzas | first good news, that's a periodically updated field using the RT update_avail_resources() | 14:27 |
| sean-k-mooney | dansmith: the api refe just say "The total uptime of the hypervisor and information about average load." | 14:28 |
| bauzas | sean-k-mooney: well, then I missed the many times when you said that | 14:29 |
| sean-k-mooney | and everythin else is cached at the period of the update_aviable_resouces perodic | 14:29 |
| dansmith | sean-k-mooney: yeah, it's unfortunate that's in our api ref at all, especially since all that is very os-specific | 14:30 |
| bauzas | anyway, sounds like we have a plan | 14:31 |
| sean-k-mooney | that what the deprecate api says and thsi is what the hypervior deail api say ```The total uptime of the hypervisor and information about average load. Only reported for active hosts where the virt driver supports this feature``` | 14:31 |
| bauzas | we can fix 2.88 and drop that field in a later microversion | 14:31 |
| bauzas | if ops complain about that missing field, then we can tell them to use whatever tool for calling the OS by themselves | 14:32 |
| sean-k-mooney | bauzas: just for extra context gibi and i talk about this a few weeks ago and before that i talk to others about this too. | 14:32 |
| bauzas | ansible, prometheus or nagios, I don't care | 14:32 |
| sean-k-mooney | so this is not the first converation we have had upstream or downstream on how to fix this | 14:32 |
| bauzas | few weeks ago, I was on PTO :) | 14:32 |
| sean-k-mooney | you were on pto for the prior ones | 14:32 |
| bauzas | so I had no context, but the good thing is that we went to the same solution, isn't that good ? | 14:33 |
| sean-k-mooney | yes im aware but we were not activly fixing it because of time pressuer for FF | 14:33 |
| bauzas | I think this is a reasonable bugfix for RC1 | 14:33 |
| bauzas | not RC2 of course as that's a not a Flamingo regression per say | 14:33 |
| bauzas | but I know you know | 14:34 |
| dansmith | bauzas: ++ for rc1 | 14:34 |
| sean-k-mooney | maybe ill see if i can fix it tomrrwo or at least ahve a draft for review | 14:34 |
| dansmith | if it's reasonable to do so | 14:34 |
| bauzas | extending the Stats class by stuffing a new keypair sounds a reasonable solution | 14:34 |
| bauzas | but sean already found the solution, I leave him tell whether he's OK | 14:35 |
| sean-k-mooney | lets just review the patch and we can backprot as normal. if its ready for rc1 cool if not we chan cherry pick after we release 2025.2 | 14:36 |
| dansmith | sean-k-mooney: there's no patch up to review yet, right? just one in your head? | 14:38 |
| sean-k-mooney | exactly im still chekcing if any other virtdriver ever supproted this rpc | 14:38 |
| sean-k-mooney | basicly im not sure i want to stash a sentenal value in the stats for other virt driver | 14:39 |
| sean-k-mooney | if i can show its only libvirt by inspecting the code i can skip that and gard the rpc call behidn both the virt type adn the absence of the key in the dict | 14:40 |
| sean-k-mooney | that way i can aovid stashign garbage values for ironic or vmeare compute nodes | 14:40 |
| sean-k-mooney | looks like zvm also suports it | 14:45 |
| sean-k-mooney | although if your using zvm the contet of the respocne i think is diffent based on the unit tests | 14:46 |
| sean-k-mooney | so zvm and libvirt the rest all raise not implemtned ok that good to know | 14:48 |
| sean-k-mooney | i will probaly put the zvm fix in a spereate commit but it wil be effectivly the same | 14:49 |
| dansmith | super expensive to make that rpc call for every hypervisor only to have them raise NotImplemented :/ | 14:51 |
| sean-k-mooney | yep that why i dont want to even try :) | 14:53 |
| bauzas | stats are already provided by the virt driver and we stuff them in the Stats object using update_stats() | 14:54 |
| bauzas | so we can just return the uptime as part of the virt resources | 14:55 |
| sean-k-mooney | yep thats the plan | 14:55 |
| sean-k-mooney | but the upstream is ment to be the uptime of the node which is drivder depenent. its not the uptime of the comptue service | 14:56 |
| bauzas | if the driver doesn't provide that uptime value, fine, we already provide None | 14:56 |
| bauzas | that's why I don't see the problem | 14:56 |
| sean-k-mooney | for ironic if we chosse to implement it it would be the uptime of the ironic instnace. but im not suggesting we do that | 14:56 |
| opendevreview | ribaudr proposed openstack/nova stable/2024.1: Reproducer for bug 2114951 https://review.opendev.org/c/openstack/nova/+/959540 | 14:58 |
| opendevreview | ribaudr proposed openstack/nova stable/2024.1: Fix bug 2114951 https://review.opendev.org/c/openstack/nova/+/959541 | 14:58 |
| bauzas | I don't see the problem | 14:58 |
| bauzas | (sorry) | 14:58 |
| bauzas | each of the virt drivers can have get_host_uptime() defined | 14:59 |
| bauzas | the simple fix seems for me to call get_host_uptime() if defined directly in the virt driver method than returns the resources | 14:59 |
| bauzas | and we could later remove the RPC API and make that method private | 15:00 |
| dansmith | _is_ there a problem? | 15:00 |
| bauzas | I dunno hence my question | 15:01 |
| bauzas | AFAICS, only zvm and libvirt do implement the method | 15:02 |
| bauzas | vmware is not providing it | 15:02 |
| dansmith | right, so no problem | 15:02 |
| dansmith | honestly, | 15:02 |
| dansmith | the API also provides for =None for all these drivers | 15:02 |
| dansmith | and the "does this specific driver provide this data" couldn't possibly be part of the API contract, IMHO | 15:03 |
| dansmith | like, if we started remotely managing libvirt nodes from a central one, we wouldn't be able to provide this anymore | 15:03 |
| bauzas | that API field has to be gone IMHO | 15:03 |
| bauzas | but for the moment, let's fix the performance issue | 15:03 |
| dansmith | I'm not saying we shouldn't just solve this the way we've prescribed here, but... :) | 15:03 |
| bauzas | in a cloud world, I'm still surprised to see that we continue to provide hypervisor specific fields | 15:04 |
| sean-k-mooney | bauzas: well https://review.opendev.org/c/openstack/nova-specs/+/765797/comment/8926d415_9c5a7022/ we all said we woudl prefer to remove it but you wanted to discuss it with operators first. | 15:09 |
| sean-k-mooney | we tried to remove it once before and we decied to revers cource becuase peopel ask to keep it | 15:10 |
| sean-k-mooney | that does nto mean we cant revisit that and fix it going forward | 15:10 |
| sean-k-mooney | but we shoudl nto rewrite history | 15:10 |
| bauzas | LOL | 15:11 |
| bauzas | I love when something pops up out of context :) | 15:11 |
| bauzas | 1/ I was only talking about hypervisor show | 15:11 |
| bauzas | 2/ I was just trying to explain that we should remove that but by another spec, not that one | 15:11 |
| sean-k-mooney | well that was update 3 to the sepc | 15:12 |
| bauzas | and AFAIK, the spec you mentioned wasn't the one we had for 2.88, right? | 15:12 |
| sean-k-mooney | we rogainlly agreed to remvoe it then we updated it after the fact to readd it | 15:12 |
| sean-k-mooney | it was the spec for 2.88 it jsut went through many iteration | 15:12 |
| sean-k-mooney | you can see the iteration if you really care https://github.com/openstack/nova-specs/commits/abfb91875c7d6b1e2df45d5bc2244956bc662884/specs/wallaby/approved/modernize-os-hypervisors-api.rst | 15:14 |
| sean-k-mooney | i alwasy rememebr this spec because stephen use a đź’©emoji | 15:16 |
| gmaan | ++, I agree to continue showing it in same API instead of separate (separate API even in old API will be microversion bump). reporting cache value is better solution and backportable. | 16:33 |
| opendevreview | Lajos Katona proposed openstack/nova master: Use SDK for Neutron networks https://review.opendev.org/c/openstack/nova/+/928022 | 16:46 |
| opendevreview | Sylvain Bauza proposed openstack/nova master: change uptime in hypervisor to be provided async using stats https://review.opendev.org/c/openstack/nova/+/959571 | 17:31 |
| bauzas | sean-k-mooney: gmaan: dansmith: I made a quick POC 'that works on my devstack' | 17:32 |
| bauzas | jkulik too : https://review.opendev.org/c/openstack/nova/+/959571 | 17:33 |
| sean-k-mooney | i have a slightly diffent quick poc partly done | 17:59 |
| opendevreview | sean mooney proposed openstack/nova master: hypervisors: Optimize uptime retrieval for better performance https://review.opendev.org/c/openstack/nova/+/959604 | 20:45 |
Generated by irclog2html.py 4.0.0 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!