cloudnull | noonedeadpunk sorry, I'm replying late. I out of the loop, but the strategy plugin may not actually be needed with modern ansible. Much of the strategy plugin was trying to do forward lookups on tasks and omit them when conditions were not met all without ever loading the task into the ansible-engine. We also were changing the connection plugin | 01:43 |
---|---|---|
cloudnull | when dealing with specific scenarios to use paraminko or ssh. All of this is probably not really needed any more. | 01:43 |
cloudnull | I think the only thing that would actually need to be kept is the magic variable mapping | 01:44 |
cloudnull | https://github.com/openstack/openstack-ansible-plugins/blob/master/plugins/strategy/linear.py#L48 | 01:44 |
cloudnull | but then again, I'm not really sure. | 01:44 |
cloudnull | ThiagoCMC I've seen that error before, the issue for me was that the Galera cluster health check with xinetd ( I think OSA uses something else now ) wasn't permitting the traffic from the HAProxy node. To fix it I had to change the allow address for the head check config; in general I would set it to something like 0.0.0.0/0 for testing then lock | 01:50 |
cloudnull | it down once it was working again. | 01:50 |
cloudnull | https://github.com/openstack/openstack-ansible-galera_server/blob/master/defaults/main.yml#L82 | 01:51 |
cloudnull | here's the implementation - https://github.com/openstack/openstack-ansible-galera_server/blob/8a8d29ea490fba6695e3356831846466f6991089/tasks/galera_server_post_install.yml#L60 | 01:52 |
cloudnull | I guess that's using clustercheck these days. | 01:53 |
cloudnull | but same basic idea | 01:53 |
prometheanfire | cloudnull: ohai | 02:29 |
noonedeadpunk | cloudnull: aha, thanks for explaining it a bit. SSH connection plugin is still needed though as we don't have ssh inside lxc and it's used to do lxc-attach from lxc_hosts | 08:12 |
noonedeadpunk | and indeed looks like ansible-core 2.13 doesn't really need our strategy, but I defenitely see performance improvenent with 2.11 | 08:12 |
jamesdenton | prometheanfire makes complete sense and demonstrates my lack of testing with LXC, only metal | 13:44 |
jamesdenton | So we may need to adjust that strategy, then | 13:44 |
Mohaa | "Host evacuation [for Masakri] requires shared storage and some method of fencing nodes, likely provided by Pacemaker/STONITH and access to the OOB management network. Given these requirements and an incomplete implementation within OpenStack-Ansible at this time, I’ll skip this demonstration." (jamesdenton's blog post, 20201) | 14:12 |
jamesdenton | hi | 14:13 |
Mohaa | (: | 14:13 |
Mohaa | Hi | 14:13 |
jamesdenton | i haven't looked at it again since then | 14:14 |
Mohaa | Is that feature not yet implemented in openstack-ansible? | 14:14 |
Mohaa | Ah, ok | 14:15 |
jamesdenton | Well, TBH I am not sure about any sort of pacemaker role in OSA, and i'm pretty sure we don't have any sort of reference arch for this | 14:17 |
Mohaa | Oops! Instances HA is necessary for a production environment! | 14:26 |
jamesdenton | depends on who you ask. pets vs cattle, etc. | 14:30 |
prometheanfire | jamesdenton: I'm willing to test patches for it :D | 15:32 |
spatel | what SSDs you guys prefer for Ceph nowadays.. | 15:48 |
spatel | Intel or Samsung and what model? | 15:48 |
damiandabrowski | can't help you with vendor but recently I learned that enterprise grade SSDs may behave a lot better for ceph due to the way they handle fsyncs: https://forum.proxmox.com/threads/ceph-shared-ssd-for-db-wal.80165/post-397544 | 15:56 |
damiandabrowski | so even my samsungs' 980 pro performance looks good on paper, they probably suck for ceph. I haven't got a time for a real comparison though | 15:57 |
admin1 | spatel.. intel .. it generally has very high endurance compared to samsungs | 15:58 |
spatel | Model? | 16:00 |
admin1 | p* series | 16:00 |
spatel | damiandabrowski Samsung 980 pro is consumer SSD correct? | 16:00 |
admin1 | dont recall the exact model | 16:00 |
damiandabrowski | spatel: yeah | 16:01 |
spatel | They sucks..! ask me.. i had to upgrade my whole cluster with PM883 4 year ago.. | 16:01 |
spatel | consume SSD sucks.. | 16:01 |
spatel | admin1 thanks i will try to find p* model and see which one fit in pocket :) | 16:02 |
admin1 | spatel, S4500 also has a lot of endurance | 16:02 |
spatel | Assuming we don't need dedicated for journal SSD with Intel SSD | 16:02 |
admin1 | if you are going to have bigger cluster or chance of growth, then having a nvme journal might become bottleneck | 16:03 |
spatel | https://www.amazon.com/Intel-SSDSC2KB019T701-S4500-Reseller-Single/dp/B074QSB52M/ref=sr_1_3?crid=1B4LWHP6M79LY&keywords=Intel+S4500+2TB&qid=1677254585&s=electronics&sprefix=intel+s4500+2tb%2Celectronics%2C79&sr=1-3&ufe=app_do%3Aamzn1.fos.304cacc1-b508-45fb-a37f-a2c47c48c32f | 16:03 |
damiandabrowski | admin1: ouh, so you generally do not recommend having NVMe WAL&DB for SSD OSDs? | 16:04 |
admin1 | i do not if you are going to have 100s of ssds or say public cloud where you need to add dozens per month for growth for example | 16:04 |
spatel | Why do we need dedicated journal for SSD? why create single point of failure | 16:05 |
spatel | i can understand with HDD | 16:05 |
admin1 | this * endurance * comes into play | 16:05 |
damiandabrowski | i guess performance. Ceph docs suggest putting WAL on faster device type and NVMe performance is generally better than standard SSD | 16:06 |
damiandabrowski | but i'd love to see some benchmarks comparing colocated vs. external WAL for SSD OSDs | 16:06 |
damiandabrowski | and also what is the optimal number of SSD OSDs per NVMe WAL(I have 4 OSDs per NVMe WAL but not sure if it's the best ratio) | 16:08 |
spatel | damiandabrowski you do use NvME for SSD journal? | 16:09 |
damiandabrowski | yeah | 16:09 |
spatel | How many SSD you have per server ? | 16:10 |
spatel | i meant OSD | 16:10 |
damiandabrowski | it's a small cluster, I have 3 storage servers, each one has: 6 HDD OSDs + 1 SSD journal and 4 SSD OSDs + 1 NVMe journal | 16:13 |
damiandabrowski | but honestly, I don't know if it's a good ratio or not :D cluster performance is quite bad, but probably due to consumer grade SSDs | 16:13 |
spatel | consumer grade SSD is really really bad for Ceph (I went through that pain) | 16:17 |
spatel | Do you have dedicated MON nodes and running with OSD? because of budget i am using MON+OSD on same node | 16:18 |
spatel | Bad bad idea.. but plan is to when we have money will split it | 16:18 |
damiandabrowski | I'm running MONs on openstack controllers | 16:18 |
damiandabrowski | (in LXC containers) | 16:20 |
spatel | ohh | 16:21 |
damiandabrowski | and yeah..probably I'll have to replace my samsungs sooner or later :| | 16:21 |
spatel | I am deploying with cephadm | 16:21 |
spatel | I did replace with PM883 SSD and happy with performance. | 16:21 |
damiandabrowski | well, probably you can create a ceph-mon group in OSA, let it prepare containers and do the rest with cephadm | 16:23 |
spatel | cephadm use docker/podman | 16:23 |
damiandabrowski | ahhh that's true, forgot about it :| | 16:23 |
spatel | haha!! | 16:23 |
spatel | what a mess.. | 16:23 |
spatel | what number we should look in this output of smartctl - https://paste.opendev.org/show/bB5XMOXki9y1Tt0igBdV/ | 16:24 |
damiandabrowski | regarding containers for ceph: i love that title :D https://www.youtube.com/watch?v=pPZsN_urpqw | 16:25 |
damiandabrowski | regarding smartctl, i think most attributes are meaningful but instead of manually reading smartctl output, I prefer to use smartd and just wait for emails :D | 16:29 |
prometheanfire | jamesdenton: as far as metal vs lxc, is metal becoming the more 'supported' option by OSA? | 16:29 |
damiandabrowski | https://github.com/stuvusIT/smartd - i used this role some time ago | 16:30 |
mgariepy | damiandabrowski, yeah i saw this one last year :D they almost kill ceph-ansible because of that tho.. :P | 16:41 |
damiandabrowski | mgariepy: oops, but at least with ceph-ansible you can choose whether you want a containerized deployment or not :D https://github.com/ceph/ceph-ansible/blob/main/group_vars/all.yml.sample#L583 | 16:44 |
damiandabrowski | can't say the same about cephadm :| | 16:44 |
spatel | damiandabrowski i will look into smartd | 16:47 |
spatel | But how do i know how much life left for my SSD here https://paste.opendev.org/show/bB5XMOXki9y1Tt0igBdV/ | 16:48 |
jamesdenton | prometheanfire all of our deploys have been "metal" since Stein | 16:50 |
spatel | damiandabrowski what RAID controller card do you prefer for ceph ? | 16:55 |
damiandabrowski | spatel: i think you can consider this disk as worn out, but if you want to save some money, it should be ok to keep it until you see some errors(like Uncorrectable_Error_Cnt ) | 17:01 |
spatel | +1 | 17:02 |
damiandabrowski | regarding RAID controller, I just use some random HBA. AFAIK ceph do not recommend raid controllers | 17:02 |
prometheanfire | noonedeadpunk: is metal considered more supported now than lxc, where dev effort / support is likely to be going? asking for a new install | 17:02 |
spatel | What would be that random HBA ? | 17:05 |
spatel | Any model i should follow.. | 17:05 |
spatel | is it possible to use RAID + JBOD on same controller? | 17:05 |
spatel | Like OS disk do RAID and rest of disk JBOD | 17:05 |
damiandabrowski | don't remember the exact model, something from broadcom | 17:15 |
damiandabrowski | i think it depends on the controller, few years ago i was working with one that didn't support JBOD at all | 17:15 |
damiandabrowski | and I had to configure RAID0 separately on each disk xD | 17:16 |
damiandabrowski | or maybe i had to create RAID1 on each disk because RAID0 was not supported? don't remember, it was a long time ago | 17:20 |
spatel | That is what i have in old ceph all RAID0 | 17:30 |
admin1 | yeah . | 17:30 |
spatel | I never work on JBOD so no idea what model and how good they are | 17:30 |
spatel | Do you enable RAID0 with Write-Back-Cache on HBA controller? | 17:31 |
admin1 | spatel, does the controller have its own battery unit | 17:35 |
spatel | admin1 yes controller has battery | 18:09 |
spatel | small one :) | 18:09 |
noonedeadpunk | prometheanfire: I don't think it matters much to be frank. But metal with ceph will be troublesome if both glance and cinder will use it as a backend. So you will need to isolate some of them. And troublesome not during deploy but futher operations | 18:18 |
noonedeadpunk | prometheanfire: me personally still running LXC and not going to migrate out of it | 18:18 |
prometheanfire | good to know about ceph | 18:18 |
noonedeadpunk | (as of today) | 18:18 |
noonedeadpunk | so it's more about your preferrence to be frank | 18:19 |
prometheanfire | ya, I like the containers, being able to blow things away is nice | 18:19 |
noonedeadpunk | Well, if you have ironic you can blow off controllers as well... But yeah, limiting impact might be good sometimes | 18:19 |
prometheanfire | ironic for the controllers? snake eating it's own tail :P | 18:36 |
spatel | prometheanfire haha!! controller running ironic and ironic building controller.. to much fun | 19:17 |
noonedeadpunk | it can be standalone like bifrost for instance | 20:04 |
prometheanfire | true | 20:18 |
admin1 | anyone tried trove ? | 20:23 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!