bauzas | good morning Nova | 08:50 |
---|---|---|
gibi | good morning | 08:50 |
* bauzas is about to go back to bed as it looks it's still the night over my window | 08:51 | |
bauzas | you appreciate winter times when you have to light on | 08:51 |
*** simondodsley_ is now known as simondodsley | 09:31 | |
*** erlon_ is now known as erlon | 09:32 | |
*** TheJulia_ is now known as TheJulia | 09:32 | |
*** EugenMayer3 is now known as EugenMayer | 10:31 | |
gibi | bauzas: I think this spec is a quick +A https://review.opendev.org/c/openstack/nova-specs/+/810868 if you have minute : | 13:16 |
gibi | :) | 13:16 |
bauzas | gibi: I'll need to taxi my daughter in a few mins but I'll look at it after | 13:17 |
sean-k-mooney | gibi: i kind of agree | 13:19 |
gibi | bauzas: OK, sure | 13:19 |
sean-k-mooney | i can take a look but ill leave the +w to bauzas | 13:19 |
Henriqueof | Is there any articles/guides on how overcommiting CPU/RAM degrades performance? | 13:26 |
sean-k-mooney | Henriqueof: well over commiting ram is easy to understand in that when you actully start overcommiting the inuse ram it will start swapping to disk | 13:34 |
sean-k-mooney | using something like zram as first level swap and an actual swap partion as second level swap and have 1 swap partion per numa node can in some casess help but only if the storage is also numa alinged | 13:35 |
sean-k-mooney | Henriqueof: for cpus it is really just a matter of contntion and context switchign overhead | 13:36 |
sean-k-mooney | if you vms are mostly idel it will be fien to over commit them but once the host load starts to exceed the number of cpus then the performcne will degrade | 13:36 |
sean-k-mooney | you likely will hit memory bandwith, disk io or network io bottelneck too depending on your workloads | 13:37 |
sean-k-mooney | Henriqueof: my recommendation are never over commit cpus more then 4:1, always reserve at least 1 core per numa node for the host, if you are using hyper treading reserve the hypertread sibling of each host core too. | 13:38 |
sean-k-mooney | Henriqueof: hyperthreading only give you about a 1.4x increase in throughput by the way so dont expect to actully be able to service a load = nproc if you are using HT | 13:39 |
sean-k-mooney | Henriqueof: form memroy i normally recommend never over commiting and using hugepages but if you must over commit allocate swap equal to total memory*overcommit ratio | 13:40 |
sean-k-mooney | i would not really over commit more then about 2-4x your ram either. but as i said i recomend keeping memory over comiit at 1.0 so no over commit in most cases. | 13:41 |
Henriqueof | sean-k-mooney: You actually answered most of my questions, thank you! | 14:00 |
Henriqueof | I find odd the OpenStack docs says that they overcommit CPU and RAM by default, but kolla-ansible doesn't seens to do that. | 14:02 |
sean-k-mooney | we have our default set to overcommit cpu by 16:1 and ram by 1.5:1 | 14:03 |
kashyap | Henriqueof: Who are the users of 'kolla-ansible'? | 14:03 |
kashyap | (Do people use it manually, or do tools use it mostly?) | 14:03 |
sean-k-mooney | they are old defaults from when when openstack was used by nasa and rackspace mainly for webhosting/data storage | 14:04 |
sean-k-mooney | kashyap: its one of the more popular installers its often used via kayobe which is supported by stackhpc https://www.stackhpc.com/pages/kayobe.html | 14:05 |
sean-k-mooney | kashyap: the company johnthetubaguy[m] works at if he has not moved on. | 14:06 |
kashyap | sean-k-mooney: Right; I vaguely know the tool is an installer. Didn't know how much it is actually used in production | 14:06 |
kashyap | I see, noted. | 14:06 |
sean-k-mooney | kashyap: so most of the user are HPC or scientific user or goverment/university installation i belive | 14:06 |
sean-k-mooney | kashyap: the highest profile use is proably SKA the Square Kilometer Array telescope | 14:07 |
kashyap | Cool; good to know :) | 14:08 |
Henriqueof | sean-k-mooney: Really? Until now I thought kolla-ansible was one of the most popular deployment tool. | 14:08 |
sean-k-mooney | Henriqueof: it is yes | 14:10 |
sean-k-mooney | Henriqueof: im not sure how much market share it has vs tripleo,openstack charms and openstack ansible | 14:11 |
sean-k-mooney | but those are the big 4 deployment tools | 14:11 |
sean-k-mooney | looking at https://www.openstack.org/analytics | 14:14 |
sean-k-mooney | if you go to Deployment Decisions | 14:14 |
Henriqueof | Yeah, it is very straight forward and stable tool so I never felt the nned to experiment with the others. | 14:14 |
sean-k-mooney | 29% of respondence used kolla ansible | 14:14 |
sean-k-mooney | which is about the same as juju/triplo/OSA combined | 14:15 |
sean-k-mooney | that does not tell you how big the deployment are however | 14:15 |
sean-k-mooney | so ther might be more respondnce using kolla-ansible but that does not mean there are more servers managed by it but it at least give some indeicaionts of it popularity | 14:16 |
kashyap | stephenfin: Hey, have you ever used this? - sphinxcontrib-spelling | 14:21 |
kashyap | [https://sphinxcontrib-spelling.readthedocs.io/en/latest/] | 14:21 |
sean-k-mooney | kashyap: i would expect it would have issues with the terms we use like extra-spec | 14:23 |
kashyap | Right; but still I wonder is it overall a net win or not | 14:24 |
sean-k-mooney | we likely could include a dictionary with those but that might get tedious | 14:24 |
kashyap | sean-k-mooney: Yes, it can use project-specific dictionaries | 14:24 |
opendevreview | Merged openstack/nova-specs master: Repropose Add libvirt support for flavor and image defined ephemeral encryption https://review.opendev.org/c/openstack/nova-specs/+/810868 | 15:25 |
opendevreview | Dan Smith proposed openstack/nova master: Allow per-context rule in error messages https://review.opendev.org/c/openstack/nova/+/816865 | 15:38 |
opendevreview | Dan Smith proposed openstack/nova master: Revert project-specific APIs for servers https://review.opendev.org/c/openstack/nova/+/816206 | 15:38 |
dansmith | gmann: johnthetubaguy[m]: Removed the WIPs from these ^ as I'm assuming there are no more fundamental concerns | 15:38 |
gmann | dansmith: ack, I will check today. thanks | 15:39 |
dansmith | can we get this merged? https://review.opendev.org/c/openstack/nova/+/817030 | 15:48 |
dansmith | it's already being used to debug gate and real VIF plugging event failures | 15:48 |
gibi | dansmith: done | 15:49 |
dansmith | gibi: thanks | 15:51 |
kashyap | In CirrOS latest 0.5.2, where is this file? /etc/cirros-init/config? | 16:51 |
kashyap | Is it moved to somewhere else? /me didn't find it in a quick libguestfs inspection | 16:51 |
kashyap | Actually, ignore me. It's still there. | 16:57 |
opendevreview | Merged openstack/nova master: Log instance event wait times https://review.opendev.org/c/openstack/nova/+/817030 | 17:35 |
*** lucasagomes_ is now known as lucasagomes | 18:22 | |
opendevreview | Merged openstack/nova master: nova-manage: Always get BDMs using get_by_volume_and_instance https://review.opendev.org/c/openstack/nova/+/811716 | 18:39 |
mnaser | hi y'all | 18:56 |
mnaser | has anyone ran into an issue where the api stops responding if the notification transport is failing? | 18:56 |
mnaser | i.e. oslo_messaging_notificaitons/transport_url = rabbit://foobar , where foobar goes down, and the DEFAULT/transport_url still is up, but i guess the threads all get blocked till it grinds down to a halt? | 18:57 |
mnaser | i've repro'd on a customer environment that is deployed by OSA but i'm trying to get a devstack up right now and get GMR to see how it hands | 18:58 |
mnaser | s/hands/hangs/ | 18:58 |
sean-k-mooney | it might be related to the heartbeat | 18:59 |
sean-k-mooney | or the wsgi server | 18:59 |
sean-k-mooney | if you are using mod_wsgi under apptach each worker will only ever service 1 api request at a time | 19:00 |
sean-k-mooney | we may monkey patch the api but that will never allow the apache process to service a second request in parallel as that is managed by apache | 19:01 |
sean-k-mooney | if all the api workers are trying to do somethign that needs rabit then it will stop responding until the request or rpc timeout fires and it retruns an error | 19:01 |
sean-k-mooney | i dont know if uwsgi is better in that regard | 19:02 |
mnaser | sean-k-mooney: OSA deploys with uwsgi | 19:03 |
mnaser | sean-k-mooney: i'm still doing my research, but also, i suspect this affects n-cond too | 19:03 |
mnaser | and anything rabbit related, it seems like the notification blocks the main process | 19:03 |
mnaser | or maybe when the queue of unsent messages gets so big, the whole process bogs down | 19:04 |
mnaser | or it has a limit of threads it will bubble up to and then the whole process stops responding | 19:05 |
sean-k-mooney | its possible that the eventlet thread pool will file up eventually | 19:05 |
sean-k-mooney | hopefully this is something i can detech ast part of the health check work | 19:06 |
mnaser | afaik i think the default timeout or retry is set to 0 with notifications | 19:06 |
sean-k-mooney | well notificaiotn are off by default | 19:06 |
sean-k-mooney | or rather we use the noop driver | 19:06 |
mnaser | right yes, but if you turn them on, retries=0 so retry forever | 19:06 |
sean-k-mooney | i woudl expect 0 to be retry never | 19:07 |
sean-k-mooney | and -1 be retry for ever | 19:07 |
mnaser | 0 is retry forever in notifier i think let me duble check | 19:07 |
mnaser | sean-k-mooney: btw i suggest looking at how we do health checks in openstack-helm, it has some neat things where it actually makes an rpc call to the local instance and make sure we get an error back saying "not valid call" | 19:07 |
mnaser | there's some neat stuff there that might draw inspiration | 19:07 |
mnaser | sean-k-mooney: https://opendev.org/openstack/openstack-helm/src/branch/master/neutron/templates/bin/_health-probe.py.tpl | 19:07 |
sean-k-mooney | mnaser: i wanted to do active probes but the direction at the ptg was that was not ok | 19:08 |
mnaser | this one pretty much runs the check when it's asked | 19:08 |
sean-k-mooney | maybe after the intial work si done we can add a probe endpoint but it will intially be based on cached sate | 19:08 |
sean-k-mooney | mnaser: ya that is what i was going to do but it was rejected when i proposed it | 19:08 |
mnaser | https://opendev.org/openstack/openstack-helm/src/branch/master/nova/templates/bin/_health-probe.py.tpl is how its done for nova | 19:08 |
sean-k-mooney | mnaser: well that is writing to the nova message bus | 19:09 |
sean-k-mooney | so that is not allowed by anything that is not apart fo nova | 19:09 |
mnaser | yes it might not be very clean but it works(tm) | 19:09 |
sean-k-mooney | sure and will void any downstream support you have with your vendor | 19:10 |
mnaser | fair enough, my downstream support is me =P | 19:10 |
sean-k-mooney | but ya probing the queue was one of the thing i wanted to do | 19:10 |
sean-k-mooney | we migth add a way to do that at somepoint | 19:11 |
mnaser | btw, you were right, -1 is indefinite, and it defaults to that => https://opendev.org/openstack/oslo.messaging/src/branch/master/oslo_messaging/notify/notifier.py#L55-L58 | 19:11 |
sean-k-mooney | ack | 19:11 |
mnaser | i guess oslo messaging doesnt have a timeout | 19:11 |
sean-k-mooney | ya im not sure | 19:19 |
opendevreview | Artom Lifshitz proposed openstack/nova master: api-ref: Adjust BFV rescue non-support note. https://review.opendev.org/c/openstack/nova/+/818823 | 19:19 |
sean-k-mooney | likely you shoul change the default to be say 10 or simialr | 19:19 |
sean-k-mooney | in OSA | 19:19 |
opendevreview | Merged openstack/nova stable/xena: Add a WA flag waiting for vif-plugged event during reboot https://review.opendev.org/c/openstack/nova/+/818515 | 20:04 |
mnaser | sean-k-mooney: well it sounds like maybe that's not a great default value i guess | 20:16 |
sean-k-mooney | mnaser: i assume notifcaiton are off in osa by default. if it enabeld an no default is spcifed for retry i woudl proably default to 0,1 or 3 but not -1 | 20:18 |
sean-k-mooney | or just make it an error | 20:18 |
mnaser | sean-k-mooney: yeah im thinking more of a more sane oslo.messaging defaults | 20:18 |
sean-k-mooney | require it to be set | 20:18 |
sean-k-mooney | well again it depends on your setup you might rely on notificaions | 20:19 |
sean-k-mooney | but if you do then you als need to have monitoring in place to know that ere are rabbit issues | 20:19 |
sean-k-mooney | and correct that | 20:19 |
mnaser | sean-k-mooney: yeah but to me it sounds like notifications failing should not result in nova falling apart | 20:20 |
sean-k-mooney | well it should not but that might just mean that -1 for retry is not a vaild value | 20:21 |
sean-k-mooney | -1 presumable mean you must keep every notificaiotn in memory | 20:21 |
sean-k-mooney | untill its sent | 20:21 |
sean-k-mooney | with a copertive threading model like evently if you have enough notificiton eventlet pendign that will eventurally degrade the performance of the service | 20:22 |
mnaser | yeah im trying to repro right now | 20:23 |
opendevreview | Stanislav Dmitriev proposed openstack/nova master: Retry image download if it's corrupted https://review.opendev.org/c/openstack/nova/+/818503 | 21:21 |
opendevreview | Dmitrii Shcherbakov proposed openstack/nova master: [yoga] Add PCI VPD Capability Handling https://review.opendev.org/c/openstack/nova/+/808199 | 22:06 |
opendevreview | Dmitrii Shcherbakov proposed openstack/nova master: [yoga] Support remote-managed SmartNIC DPU ports https://review.opendev.org/c/openstack/nova/+/812111 | 22:06 |
Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!