*** rbudden has quit IRC | 01:08 | |
*** priteau has joined #scientific-wg | 01:50 | |
*** priteau has quit IRC | 01:55 | |
*** rbudden has joined #scientific-wg | 02:02 | |
*** rbudden has quit IRC | 02:02 | |
masber | I see, we would like to deploy openstack to deploy HPC but also VMs through nova and containers using zen+kuryr. Right now we use rocks cluster for deployment and to provision SGE (job scheduler), does openstack/ironic has any integration with a job scheduler? | 02:25 |
---|---|---|
jmlowe | rbudden and trandles are the ones to ask about that | 02:26 |
jmlowe | masber I take it you are not in North America? | 02:26 |
masber | jmlowe, correct I am in Sydney (Australia) | 02:27 |
jmlowe | always seem to be on in the middle of the night for most of us | 02:27 |
jmlowe | b1airo might be on, he's from Nectar | 02:28 |
jmlowe | I'm up late babysitting my ceph cluster converting from filestore to bluestore | 02:28 |
masber | I would love to get in contact with the people from Nectar | 02:29 |
masber | jmlowe, bluestore looks quite nice, apparently it is a better fit for all-flash ceph environments? | 02:30 |
jmlowe | I'm estimating by Saturday when this is all said and done I will have moved around 750TB of data 35M objects | 02:30 |
jmlowe | I think it's just better all around | 02:30 |
jmlowe | it really does cow instead of emulating it with atomic filesystem operations | 02:31 |
jmlowe | gets rid of double write penalty | 02:31 |
jmlowe | if you are all flash, should make your system twice as fast and last twice as long | 02:32 |
masber | jmlowe, nice, I am playing with jewel as it is the one used by openstack-kolla | 02:32 |
jmlowe | all data is checksummed on disk | 02:32 |
jmlowe | on disk inline compression, they use some trick similar to zfs to make erasure coded pools for rbd usable | 02:33 |
jmlowe | the rados gw gets swift public buckets, great for hosting data sets | 02:33 |
jmlowe | I can't say enough good things about this luminous update | 02:34 |
masber | jmlowe, I tried to configure it by reading a document from intel, they said to pin each nvme drive to the cpu... couldn figure out how to do that, at the end I just reduced the size of the objects (my files are not big) to get better performance but I guess I can get much more than what I am doing now | 02:34 |
jmlowe | right, what you would do is pin the osd process to the cpu | 02:34 |
masber | yes, I am using NUMA which makes things a little bit more complicated I guess | 02:35 |
masber | I am not an expert, just reading and trying things around | 02:36 |
jmlowe | apparently you use numactl to do the pinning | 02:36 |
jmlowe | http://www.glennklockwood.com/hpc-howtos/process-affinity.html | 02:36 |
jmlowe | cpu pinning is an old hpc tric | 02:36 |
jmlowe | trick | 02:37 |
jmlowe | when you are counting floating point operations per clock tick getting yourself on the wrong numa node is to be avoided at all costs | 02:37 |
masber | yes something like this --> Ceph startup scripts need change with setaffinity=" numactl --membind=0 --cpunodebind=0 " | 02:38 |
masber | this is the docs I was reading http://tracker.ceph.com/projects/ceph/wiki/Tuning_for_All_Flash_Deployments | 02:38 |
masber | jmloew, may I ask, have you used or heard about Panasas? | 02:40 |
jmlowe | Yeah, they were always too expensive for us, we have traditionally been a gpfs (came with the cluster from IBM) and lustre shop | 02:41 |
masber | jmlowe, we are currently using Panasas | 02:43 |
masber | yes very expensive | 02:43 |
masber | jmlowe, how do you use ceph? | 02:59 |
jmlowe | backs the openstack for jetstream-cloud.org | 03:01 |
masber | wow | 03:03 |
masber | spinning disks I guess? and are you using 25Gb/s network? | 03:05 |
jmlowe | 10gbs, 25Gbs wasn't really a thing when we were putting together the proposal 3 years ago | 03:05 |
jmlowe | from the time we found out we were getting the award to the time the hardware arrived was about 14 months if I remember correctly | 03:06 |
jmlowe | but yes 20x12x4TB 12Gbs sas spinning disks | 03:09 |
jmlowe | bonded 2x10Gbs networking | 03:10 |
*** priteau has joined #scientific-wg | 03:51 | |
*** priteau has quit IRC | 03:56 | |
b1airo | hi masber | 04:43 |
b1airo | just noticed this conversation | 04:43 |
masber | b1airo, hi | 04:43 |
b1airo | and hey jmlowe | 04:43 |
b1airo | what process are you using to convert those OSDs? | 04:43 |
b1airo | to you earlier question masber, Nova/Ironic don't currently have any official HPC scheduler integrations, but the placement-api and scheduler decoupling work that has been a focus of Nova dev for the last couple of cycles is going to make deep integration easier | 04:46 |
b1airo | there are people out there today doing interesting things with SLURM, e.g., using a SLURM job to essentially request a VM and job prolog to convert a regular HPC job node to a Nova node for the VM | 04:47 |
b1airo | i think that is something the folks at PSC running Bridges were doing | 04:47 |
masber | b1airo, so the job scheduler will provision the vm to run the job? | 04:48 |
b1airo | in that case i was talking about the job is "a VM" | 04:51 |
b1airo | so once the job has started the user's new VM should be up and running for them within the cloud environment | 04:52 |
b1airo | rather than going to the dashboard or using IaaS APIs they requested it through SLURM | 04:52 |
masber | b1airo, I see, sounds interesting. I was wondering to deploy the os and job scheduler integration with master using ironic | 04:53 |
b1airo | that is a possible approach to a more narrow and highly-managed HPC-integrated cloud service, which would probably be of interest if you are a HPC centre/shop but need some way to offer more flexible on-demand systems | 04:53 |
masber | and having the master on a vm already running | 04:53 |
b1airo | sorry, what do you mean by master in this case? | 04:54 |
masber | like the master host for sge | 04:55 |
masber | I was thinking having he master on a vm and the clients on the baremetal | 04:58 |
masber | not sure if I explained properly | 05:04 |
b1airo | masber - ah right, your cluster master controller/s, at first i thought you meant Ironic master branch o_0 | 05:10 |
b1airo | i'd say having those sorts of service machines as VMs is pretty normal practice even without cloud thrown in | 05:12 |
masber | they to me would be to find a way to install all packages needed and the integration with the job scheduler automatically | 05:12 |
masber | same if I want to shrink the HPC cluster | 05:12 |
masber | that way I could do things like assign nodes to HPC or to hadoop or VMs or containers | 05:13 |
masber | vms provided by nova, containers by zun_kuryr | 05:13 |
masber | hadoop was thinking to use apache ambari or sahara | 05:13 |
masber | but I am not sure how to provision the HPC after ironic | 05:14 |
masber | maybe ansible? | 05:14 |
masber | do you have an opinion about that? maybe I am crazy? | 05:14 |
*** simon-AS559 has joined #scientific-wg | 05:15 | |
*** simon-AS559 has quit IRC | 05:33 | |
*** priteau has joined #scientific-wg | 05:52 | |
*** priteau has quit IRC | 05:57 | |
*** priteau has joined #scientific-wg | 07:53 | |
*** priteau has quit IRC | 07:58 | |
*** priteau has joined #scientific-wg | 09:02 | |
*** jmlowe has quit IRC | 11:38 | |
*** jmlowe has joined #scientific-wg | 11:41 | |
*** rbudden has joined #scientific-wg | 11:57 | |
jmlowe | b1airo set the crush weight to zero, go have lunch, come back, take a nap, make dinner, get a good night's sleep, have brunch, then stop osd and follow the manual osd replacement procedure from the docs http://docs.ceph.com/docs/luminous/rados/operations/add-or-rm-osds/#adding-osds-manual | 12:20 |
jmlowe | masber rbudden uses puppet to deploy hpc (slurm) following ironic for https://www.psc.edu/bridges | 12:21 |
b1airo | masber, at Monash we use Ansible for those sorts of tasks, seems to work pretty well and is definitely a better choice for quick and simple stuff compared to Puppet | 12:22 |
jmlowe | b1airo 17 hours later I'm still waiting for the first 1/3 to drain | 12:23 |
jmlowe | 5842308/90961059 objects misplaced (6.423%) | 12:23 |
jmlowe | 11152 active+clean | 12:23 |
jmlowe | 1198 active+remapped+backfill_wait | 12:23 |
jmlowe | 248 active+remapped+backfilling | 12:23 |
jmlowe | I did one node, it was relatively painless except for the draining | 12:24 |
b1airo | jmlowe, so you are letting cluster recover completely before re-provisioning that OSDs? our biggest cluster has about 1400 OSDs now so I think we'll probably have to take a shorter route to make it less painful and just reformat the current set of OSDs we're working on without doing any recovery | 12:24 |
jmlowe | well depending on how much free space you have.... | 12:25 |
b1airo | sure, and the crush topology | 12:25 |
jmlowe | I'm 1/3 full so I did 1/3 of the osd's and I expect the osd's to go up to %50 used | 12:25 |
jmlowe | I'll set the weights back up to normal and the next 1/3 down to zero so in another 18-20 hours I'll be able to convert the next 1/3 | 12:27 |
jmlowe | so I'm guessing with sleep the whole process is going to take about 96 hours | 12:28 |
jmlowe | I'll give this whole process a proper writeup somewhere | 12:30 |
jmlowe | I also did some contrarian things, I had my disks in raid0 pairs to cut down on the number of osd's, with async messaging and no journaling for the data I'm splitting them and going jbod as part of the conversion | 12:32 |
jmlowe | raid0 pairs meant I had 50GB ssd journals rather than 25GB, performance was markedly better than jbod with smaller disks that the texas jetstream cloud uses | 12:33 |
jmlowe | I'm a bit sleep deprived, texas cloud has smaller journals and per disk osd's on identical hardware, if that wasn't clear | 12:35 |
jmlowe | b1airo if you drain there is no blocking, it's a lot less disruptive | 12:37 |
b1airo | shouldn't cause blocking either way unless you have a undersize issue? | 12:44 |
b1airo | jmlowe, interesting that the raid0 setup was better performing - for what benchmarks? i guess you get to a point where you have enough OSDs to soak up your peak workload so adding extra OSDs will just give you more headroom for extra clients that you don't need, whereas making existing OSDs faster (such as RAID0) will mean individual clients get better performance | 12:48 |
jmlowe | the only scenario where more osd's did better was small writes, once you got up above 4k for writes the larger journals kicked in | 12:49 |
jmlowe | I've got a graph somewhere | 12:50 |
b1airo | are you sure it was to do with the journal sizes? once you add them all up you should need a lot of sustained writes at high bandwidth before their size matters much. did you also change the sync interval? | 12:51 |
b1airo | have you started CephFS-ing in JetStream yet? | 12:52 |
jmlowe | not yet, I'm looking at manila and I'm kind of freaked out by the idea of the instances being able to directly talk to the ceph cluster | 12:52 |
jmlowe | I'd also have to change all of my ip addresses | 12:53 |
jmlowe | that being said, it's something I think I want to do, doubling my osd's will give me the overhead I need to add the cephfs pools, I was at 300 pgs/osd before | 12:55 |
jmlowe | I'm really not sure how raid controller cache with raid0 pairs is going to compare with being able to enable on disk cache | 13:00 |
b1airo | you mean the ~64MB on-disk cache? | 13:01 |
b1airo | in one Nova AZ I forced that into write-caching mode - but it is usually not power safe so... | 13:02 |
b1airo | re. CephFS + Manila, Red Hat is working towards NFS gateway as the standard mode I think | 13:03 |
b1airo | would expect to hear more on that in Sydney | 13:03 |
jmlowe | oh, I like the sound of that | 13:03 |
b1airo | we're going to start out soon running in native mode but only give access to "trusted" tenants | 13:04 |
b1airo | we have the ganesha cephfs-nfs gateway up on one cluster too, seems to work but our HA setup needs more testing | 13:05 |
jmlowe | I can live with the trusted service vm exporting cephfs over nfs, similar to the generic manila setup | 13:12 |
jmlowe | re cache, 768MB of dram vs 2048MB of flash | 13:37 |
*** simon-AS559 has joined #scientific-wg | 16:26 | |
*** simon-AS559 has quit IRC | 16:58 | |
trandles | masber: +1 what b1airo and jmlowe said about everything, they're fonts of knowledge ;) | 18:36 |
trandles | I'll be in Sydney for the Summit in November if you're attending. I'm happy to sit down and go over how we're using OpenStack with HPC at Los Alamos | 18:37 |
jmlowe | I'm not sure I'd call what I'm spewing knowledge :) | 18:37 |
trandles | there are several orthogonal use cases that we're addressing that largely fall along user-support and system-management lines | 18:37 |
trandles | to answer your immediate question, there's no direct integration between something like SLURM and OpenStack, but I've written SLURM plugins to do some integration | 18:38 |
trandles | the plugin source code release is being held up by the lab's legal process though :( Not for any technical reason, but for bureaucratic reasons | 18:39 |
trandles | jmlowe: you're just being modest :) | 18:39 |
*** jmlowe_ has joined #scientific-wg | 20:04 | |
*** jmlowe has quit IRC | 20:06 | |
*** simon-AS559 has joined #scientific-wg | 20:46 | |
*** simon-AS5591 has joined #scientific-wg | 20:50 | |
*** simon-AS559 has quit IRC | 20:51 | |
*** simon-AS5591 has quit IRC | 21:22 | |
*** priteau has quit IRC | 21:39 |
Generated by irclog2html.py 2.15.3 by Marius Gedminas - find it at mg.pov.lt!