Friday, 2017-09-01

*** rbudden has quit IRC		01:08
*** priteau has joined #scientific-wg		01:50
*** priteau has quit IRC		01:55
*** rbudden has joined #scientific-wg		02:02
*** rbudden has quit IRC		02:02
masber	I see, we would like to deploy openstack to deploy HPC but also VMs through nova and containers using zen+kuryr. Right now we use rocks cluster for deployment and to provision SGE (job scheduler), does openstack/ironic has any integration with a job scheduler?	02:25
jmlowe	rbudden and trandles are the ones to ask about that	02:26
jmlowe	masber I take it you are not in North America?	02:26
masber	jmlowe, correct I am in Sydney (Australia)	02:27
jmlowe	always seem to be on in the middle of the night for most of us	02:27
jmlowe	b1airo might be on, he's from Nectar	02:28
jmlowe	I'm up late babysitting my ceph cluster converting from filestore to bluestore	02:28
masber	I would love to get in contact with the people from Nectar	02:29
masber	jmlowe, bluestore looks quite nice, apparently it is a better fit for all-flash ceph environments?	02:30
jmlowe	I'm estimating by Saturday when this is all said and done I will have moved around 750TB of data 35M objects	02:30
jmlowe	I think it's just better all around	02:30
jmlowe	it really does cow instead of emulating it with atomic filesystem operations	02:31
jmlowe	gets rid of double write penalty	02:31
jmlowe	if you are all flash, should make your system twice as fast and last twice as long	02:32
masber	jmlowe, nice, I am playing with jewel as it is the one used by openstack-kolla	02:32
jmlowe	all data is checksummed on disk	02:32
jmlowe	on disk inline compression, they use some trick similar to zfs to make erasure coded pools for rbd usable	02:33
jmlowe	the rados gw gets swift public buckets, great for hosting data sets	02:33
jmlowe	I can't say enough good things about this luminous update	02:34
masber	jmlowe, I tried to configure it by reading a document from intel, they said to pin each nvme drive to the cpu... couldn figure out how to do that, at the end I just reduced the size of the objects (my files are not big) to get better performance but I guess I can get much more than what I am doing now	02:34
jmlowe	right, what you would do is pin the osd process to the cpu	02:34
masber	yes, I am using NUMA which makes things a little bit more complicated I guess	02:35
masber	I am not an expert, just reading and trying things around	02:36
jmlowe	apparently you use numactl to do the pinning	02:36
jmlowe	http://www.glennklockwood.com/hpc-howtos/process-affinity.html	02:36
jmlowe	cpu pinning is an old hpc tric	02:36
jmlowe	trick	02:37
jmlowe	when you are counting floating point operations per clock tick getting yourself on the wrong numa node is to be avoided at all costs	02:37
masber	yes something like this --> Ceph startup scripts need change with setaffinity=" numactl --membind=0 --cpunodebind=0 "	02:38
masber	this is the docs I was reading http://tracker.ceph.com/projects/ceph/wiki/Tuning_for_All_Flash_Deployments	02:38
masber	jmloew, may I ask, have you used or heard about Panasas?	02:40
jmlowe	Yeah, they were always too expensive for us, we have traditionally been a gpfs (came with the cluster from IBM) and lustre shop	02:41
masber	jmlowe, we are currently using Panasas	02:43
masber	yes very expensive	02:43
masber	jmlowe, how do you use ceph?	02:59
jmlowe	backs the openstack for jetstream-cloud.org	03:01
masber	wow	03:03
masber	spinning disks I guess? and are you using 25Gb/s network?	03:05
jmlowe	10gbs, 25Gbs wasn't really a thing when we were putting together the proposal 3 years ago	03:05
jmlowe	from the time we found out we were getting the award to the time the hardware arrived was about 14 months if I remember correctly	03:06
jmlowe	but yes 20x12x4TB 12Gbs sas spinning disks	03:09
jmlowe	bonded 2x10Gbs networking	03:10
*** priteau has joined #scientific-wg		03:51
*** priteau has quit IRC		03:56
b1airo	hi masber	04:43
b1airo	just noticed this conversation	04:43
masber	b1airo, hi	04:43
b1airo	and hey jmlowe	04:43
b1airo	what process are you using to convert those OSDs?	04:43
b1airo	to you earlier question masber, Nova/Ironic don't currently have any official HPC scheduler integrations, but the placement-api and scheduler decoupling work that has been a focus of Nova dev for the last couple of cycles is going to make deep integration easier	04:46
b1airo	there are people out there today doing interesting things with SLURM, e.g., using a SLURM job to essentially request a VM and job prolog to convert a regular HPC job node to a Nova node for the VM	04:47
b1airo	i think that is something the folks at PSC running Bridges were doing	04:47
masber	b1airo, so the job scheduler will provision the vm to run the job?	04:48
b1airo	in that case i was talking about the job is "a VM"	04:51
b1airo	so once the job has started the user's new VM should be up and running for them within the cloud environment	04:52
b1airo	rather than going to the dashboard or using IaaS APIs they requested it through SLURM	04:52
masber	b1airo, I see, sounds interesting. I was wondering to deploy the os and job scheduler integration with master using ironic	04:53
b1airo	that is a possible approach to a more narrow and highly-managed HPC-integrated cloud service, which would probably be of interest if you are a HPC centre/shop but need some way to offer more flexible on-demand systems	04:53
masber	and having the master on a vm already running	04:53
b1airo	sorry, what do you mean by master in this case?	04:54
masber	like the master host for sge	04:55
masber	I was thinking having he master on a vm and the clients on the baremetal	04:58
masber	not sure if I explained properly	05:04
b1airo	masber - ah right, your cluster master controller/s, at first i thought you meant Ironic master branch o_0	05:10
b1airo	i'd say having those sorts of service machines as VMs is pretty normal practice even without cloud thrown in	05:12
masber	they to me would be to find a way to install all packages needed and the integration with the job scheduler automatically	05:12
masber	same if I want to shrink the HPC cluster	05:12
masber	that way I could do things like assign nodes to HPC or to hadoop or VMs or containers	05:13
masber	vms provided by nova, containers by zun_kuryr	05:13
masber	hadoop was thinking to use apache ambari or sahara	05:13
masber	but I am not sure how to provision the HPC after ironic	05:14
masber	maybe ansible?	05:14
masber	do you have an opinion about that? maybe I am crazy?	05:14
*** simon-AS559 has joined #scientific-wg		05:15
*** simon-AS559 has quit IRC		05:33
*** priteau has joined #scientific-wg		05:52
*** priteau has quit IRC		05:57
*** priteau has joined #scientific-wg		07:53
*** priteau has quit IRC		07:58
*** priteau has joined #scientific-wg		09:02
*** jmlowe has quit IRC		11:38
*** jmlowe has joined #scientific-wg		11:41
*** rbudden has joined #scientific-wg		11:57
jmlowe	b1airo set the crush weight to zero, go have lunch, come back, take a nap, make dinner, get a good night's sleep, have brunch, then stop osd and follow the manual osd replacement procedure from the docs http://docs.ceph.com/docs/luminous/rados/operations/add-or-rm-osds/#adding-osds-manual	12:20
jmlowe	masber rbudden uses puppet to deploy hpc (slurm) following ironic for https://www.psc.edu/bridges	12:21
b1airo	masber, at Monash we use Ansible for those sorts of tasks, seems to work pretty well and is definitely a better choice for quick and simple stuff compared to Puppet	12:22
jmlowe	b1airo 17 hours later I'm still waiting for the first 1/3 to drain	12:23
jmlowe	5842308/90961059 objects misplaced (6.423%)	12:23
jmlowe	11152 active+clean	12:23
jmlowe	1198 active+remapped+backfill_wait	12:23
jmlowe	248 active+remapped+backfilling	12:23
jmlowe	I did one node, it was relatively painless except for the draining	12:24
b1airo	jmlowe, so you are letting cluster recover completely before re-provisioning that OSDs? our biggest cluster has about 1400 OSDs now so I think we'll probably have to take a shorter route to make it less painful and just reformat the current set of OSDs we're working on without doing any recovery	12:24
jmlowe	well depending on how much free space you have....	12:25
b1airo	sure, and the crush topology	12:25
jmlowe	I'm 1/3 full so I did 1/3 of the osd's and I expect the osd's to go up to %50 used	12:25
jmlowe	I'll set the weights back up to normal and the next 1/3 down to zero so in another 18-20 hours I'll be able to convert the next 1/3	12:27
jmlowe	so I'm guessing with sleep the whole process is going to take about 96 hours	12:28
jmlowe	I'll give this whole process a proper writeup somewhere	12:30
jmlowe	I also did some contrarian things, I had my disks in raid0 pairs to cut down on the number of osd's, with async messaging and no journaling for the data I'm splitting them and going jbod as part of the conversion	12:32
jmlowe	raid0 pairs meant I had 50GB ssd journals rather than 25GB, performance was markedly better than jbod with smaller disks that the texas jetstream cloud uses	12:33
jmlowe	I'm a bit sleep deprived, texas cloud has smaller journals and per disk osd's on identical hardware, if that wasn't clear	12:35
jmlowe	b1airo if you drain there is no blocking, it's a lot less disruptive	12:37
b1airo	shouldn't cause blocking either way unless you have a undersize issue?	12:44
b1airo	jmlowe, interesting that the raid0 setup was better performing - for what benchmarks? i guess you get to a point where you have enough OSDs to soak up your peak workload so adding extra OSDs will just give you more headroom for extra clients that you don't need, whereas making existing OSDs faster (such as RAID0) will mean individual clients get better performance	12:48
jmlowe	the only scenario where more osd's did better was small writes, once you got up above 4k for writes the larger journals kicked in	12:49
jmlowe	I've got a graph somewhere	12:50
b1airo	are you sure it was to do with the journal sizes? once you add them all up you should need a lot of sustained writes at high bandwidth before their size matters much. did you also change the sync interval?	12:51
b1airo	have you started CephFS-ing in JetStream yet?	12:52
jmlowe	not yet, I'm looking at manila and I'm kind of freaked out by the idea of the instances being able to directly talk to the ceph cluster	12:52
jmlowe	I'd also have to change all of my ip addresses	12:53
jmlowe	that being said, it's something I think I want to do, doubling my osd's will give me the overhead I need to add the cephfs pools, I was at 300 pgs/osd before	12:55
jmlowe	I'm really not sure how raid controller cache with raid0 pairs is going to compare with being able to enable on disk cache	13:00
b1airo	you mean the ~64MB on-disk cache?	13:01
b1airo	in one Nova AZ I forced that into write-caching mode - but it is usually not power safe so...	13:02
b1airo	re. CephFS + Manila, Red Hat is working towards NFS gateway as the standard mode I think	13:03
b1airo	would expect to hear more on that in Sydney	13:03
jmlowe	oh, I like the sound of that	13:03
b1airo	we're going to start out soon running in native mode but only give access to "trusted" tenants	13:04
b1airo	we have the ganesha cephfs-nfs gateway up on one cluster too, seems to work but our HA setup needs more testing	13:05
jmlowe	I can live with the trusted service vm exporting cephfs over nfs, similar to the generic manila setup	13:12
jmlowe	re cache, 768MB of dram vs 2048MB of flash	13:37
*** simon-AS559 has joined #scientific-wg		16:26
*** simon-AS559 has quit IRC		16:58
trandles	masber: +1 what b1airo and jmlowe said about everything, they're fonts of knowledge ;)	18:36
trandles	I'll be in Sydney for the Summit in November if you're attending. I'm happy to sit down and go over how we're using OpenStack with HPC at Los Alamos	18:37
jmlowe	I'm not sure I'd call what I'm spewing knowledge :)	18:37
trandles	there are several orthogonal use cases that we're addressing that largely fall along user-support and system-management lines	18:37
trandles	to answer your immediate question, there's no direct integration between something like SLURM and OpenStack, but I've written SLURM plugins to do some integration	18:38
trandles	the plugin source code release is being held up by the lab's legal process though :( Not for any technical reason, but for bureaucratic reasons	18:39
trandles	jmlowe: you're just being modest :)	18:39
*** jmlowe_ has joined #scientific-wg		20:04
*** jmlowe has quit IRC		20:06
*** simon-AS559 has joined #scientific-wg		20:46
*** simon-AS5591 has joined #scientific-wg		20:50
*** simon-AS559 has quit IRC		20:51
*** simon-AS5591 has quit IRC		21:22
*** priteau has quit IRC		21:39

Generated by irclog2html.py 2.15.3 by Marius Gedminas - find it at mg.pov.lt!