jrosser | NeilHanlon: thanks for the pointer to `rocky-release-kernel` i think theres a missing build for aarch64 64k page size? | 08:31 |
---|---|---|
jrosser | hamburgler: was there anything specific you were interested in for the radosgw config? | 08:53 |
hamburgler | jrosser: hey :), i was curious how the mappings were working from storage-policy (swift) to placement-group(ceph) as for some reason in horizon I was unable to see these from the drop down menu in horizon so I could not create a container that way. I ended up upgrading to 18.2.1 from 18.2.0 and it seemed to resolve the issue lol :D | 19:50 |
hamburgler | was able to get a multi-site config setup running in lab, really quite neat :) | 19:51 |
jrosser | hamburgler: interesting - is that with replication? | 19:53 |
hamburgler | yes sir, so for demo purposes, two ceph azs, one is master, other is secondary, when i write to first az it gets replicated to second | 19:54 |
hamburgler | lowered garbage collection times to run faster to see that syncs as well | 19:55 |
jrosser | did you ever investigate performance with lots of objects in a bucket | 19:55 |
jrosser | delete time for the bucket for example | 19:55 |
jrosser | ^ not related to multisite | 19:56 |
jrosser | we see a delete rate of about 500 objects/second which can make it extremely long to delete massive buckets | 19:58 |
hamburgler | not yet with any measurable workload - but from observation it looks pretty quick with a single bucket, that really doesn't mean much though I suppose :D, if I delete an object in a bucket, i can see the pool drop in size (this must be a marker saying objects to be deleted) then since i set rgw_gc_processor_period and rgw_gc_obj_min_wait to 60s within the objects within a bucket are gone i'd say within 2 minutes | 19:59 |
hamburgler | everything is removed | 19:59 |
hamburgler | did you adjust the default delete io rate limit for objects in buckets? | 19:59 |
jrosser | almost certainly not | 19:59 |
hamburgler | i haven't played with that yet myself | 19:59 |
jrosser | generally we've focussed on getting thoughtput up | 19:59 |
hamburgler | gotcha, what type of disks do you use? are your pools mapped to specific crush rules/drive types - such as I read the red hat docs a bit ago and it looks like certain pools related to metadata should be mapped to ssd at a minimum, ideally NVMe - then data pools likely hdd with an ssd/nvme journal/wal | 20:01 |
jrosser | all the rgw metadata is on an nvme pool | 20:01 |
jrosser | but then there is several PB of hdd for the default placement group | 20:02 |
hamburgler | replication or EC? | 20:02 |
jrosser | and we made an extra placement group "fast" which is placed on the nvme | 20:02 |
jrosser | all replicated, no EC | 20:02 |
jrosser | initially i didn't have enough chassis to sensibly do EC | 20:02 |
hamburgler | yeah it looks like quite a few nodes are needed for that, I'm truly not sure if it is a benefit over replication as I've barely touched EC for anything | 20:03 |
hamburgler | I would say since you have metadata on nvme pool that's probably not the limit of 500 objects/second - wonder if it's the io rate but again haven't really played with it | 20:04 |
hamburgler | is it different from web to cli? | 20:05 |
jrosser | i don't think so | 20:05 |
jrosser | we did discuss it briefly and decided there must be some rate limit | 20:05 |
jrosser | but bigger problem is dealing with quota exceeded :) | 20:06 |
hamburgler | haha - is that an issue on a per tenant basis? | 20:06 |
jrosser | that returns an error code immediately from the radosgw and is absolutely not bound by the latency of doing any storage I/o | 20:07 |
jrosser | so if the client is written naievely, then that can badly hurt your radosgw node | 20:07 |
jrosser | we need to come up with some rate limiting for that case | 20:08 |
hamburgler | ahh you mean that if there is lots of objects being written it triggers quota exceeded? | 20:08 |
hamburgler | sorry if I am not following there | 20:09 |
jrosser | once you exceed your quota and the client just retries over and over, you can DOS the radosgw pretty easily | 20:09 |
jrosser | particularly if the client is highly parallel | 20:09 |
hamburgler | oh shoot that's good to know | 20:09 |
jrosser | so theres two sides to that, client needs to behave appropriatey | 20:09 |
jrosser | but object store needs to be able to cope with a stupid client | 20:10 |
hamburgler | hmm could haproxy not handle a rate limit? | 20:11 |
hamburgler | via source | 20:11 |
jrosser | it could | 20:11 |
jrosser | but we need to revisit as the haproxy on the front is L4 | 20:11 |
jrosser | and SSL termination is done on a big bank of radosgw behind that | 20:11 |
jrosser | so that needs to become a bit fancier architecture | 20:12 |
hamburgler | yeah absolutely, I'm not sure what we would do horizon size, but I imagine we will likely set the public endpoint through a different set of haproxy nodes - currently my lab is using the orchestrator to deploy keepalived/haproxy for ceph interal to openstack | 20:13 |
hamburgler | horizon side* | 20:13 |
jrosser | thats what we do - a separate pair of haprox y for the public endpoint | 20:14 |
jrosser | the radsgw sit in a sort of dmz | 20:14 |
jrosser | and then theres some magic which allows access to those radosgw directly from instances | 20:15 |
jrosser | without having to go through a neutron router | 20:15 |
hamburgler | oh interesting never thought of that use case! | 20:15 |
hamburgler | is your object storage built on its own cluster or dedicated root where rbd pools sit? | 20:16 |
jrosser | it's the same cluster for everything | 20:17 |
jrosser | but we had some particular requirements around object store throughtput MB/s, rather than rbd iops | 20:17 |
hamburgler | gotcha, we have multiple roots for different tiers at the moment, was debating about throwing object storage on its own dedicated cluster but again $ :| lol | 20:18 |
jrosser | i have opportunity to expand the object store significantly | 20:19 |
jrosser | but the disk chassis are 84 disks each | 20:19 |
jrosser | which is some pretty wild number for OSDs | 20:19 |
hamburgler | oh yeah :O, those are HDD nodes for data pools? | 20:20 |
jrosser | yes | 20:20 |
hamburgler | was gunna say if that was NVMe or even SSD those poor processors :D bottleneck for days | 20:20 |
hamburgler | *with that many disks on one node :) | 20:24 |
hamburgler | actually pretty happy with 18.2.1 - not that I am much of a fan of dashboards but the multi-site and overview are a nice touch | 20:26 |
hamburgler | jrosser: btw ty, appreciate the chat! | 20:30 |
jrosser | heh no problem | 20:30 |
jrosser | tbh we had a lot of trouble so far with the radosgw dashboard | 20:30 |
jrosser | buckets with loads of objects did not go well | 20:31 |
hamburgler | I imagine it will be the same for me when not in a quiet lab env :D | 20:32 |
NeilHanlon | jrosser: not sure about the 64k kernel--but will check! | 20:40 |
jrosser | NeilHanlon: that would be great | 20:41 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!