21:00:15 <oneswig> #startmeeting scientific-sig 21:00:16 <openstack> Meeting started Tue Aug 21 21:00:15 2018 UTC and is due to finish in 60 minutes. The chair is oneswig. Information about MeetBot at http://wiki.debian.org/MeetBot. 21:00:17 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 21:00:19 <openstack> The meeting name has been set to 'scientific_sig' 21:00:29 <oneswig> Hi 21:00:42 <oneswig> #link agenda for today https://wiki.openstack.org/wiki/Scientific_SIG#IRC_Meeting_August_21st_2018 21:01:04 <oneswig> Martial sends his apologies as he is travelling 21:01:10 <rbudden> hello everyone 21:01:29 <oneswig> Blair's hoping to join us later - overlapping meeting 21:01:39 <oneswig> Hey rbudden! :-) 21:01:46 <rbudden> hello hello :) 21:01:54 <oneswig> How's BRIDGES? 21:02:12 <rbudden> it’s doing well 21:02:41 <rbudden> Our Queens HA setup is about to get some friendly users/staff for testing :) 21:02:55 <oneswig> I've been having almost too much fun working with these guys on a Queens system: https://www.euclid-ec.org 21:03:23 <oneswig> Mapping the dark matter of OpenStack... 21:03:23 <rbudden> Nice 21:03:59 <oneswig> It boils down to a bunch of heat stacks and 453 nodes in a Slurm partition with some whistles and bells attached 21:04:17 <rbudden> Sounds interesting 21:04:44 <oneswig> Found the scaling limits of Heat on our system surprisingly quickly! 21:05:06 <rbudden> Hehe, I haven’t used Heat much 21:05:09 <oneswig> rbudden: did you ever get your work on Slurm into the public domain? 21:05:28 <rbudden> I took the lazy programmer approach and wrote it all in bash originally, then converted it mostly to Ansible 21:05:34 <rbudden> Actually, I did not 21:05:48 <rbudden> I have piles that need to be pushed to our public GitHub 21:06:07 <oneswig> There's some interest from various places in common Ansible for OpenHPC+Slurm 21:06:10 <rbudden> It’s on my never ending todo list somewhere in there 21:06:21 <rbudden> ;) 21:06:26 <oneswig> Right, somewhere beyond the event horizon... 21:07:45 <oneswig> Are you going to the PTG rbudden? It's merged with the Ops meetup too this time 21:08:30 <rbudden> I am not 21:08:52 <rbudden> Unfortunately I don’t have much funding for travel right now 21:08:56 <oneswig> No, me neither - Martial's leading a session there 21:09:31 <rbudden> I’d have loved to get more involved with the PTG and Ops, but time and funding have always kept me away unfortunately! 21:09:41 <oneswig> Some of our team will be there but I'm of limited use being less hands-on than I'd like to be nowadays 21:10:06 <oneswig> What's new on BRIDGES, what have you been working on? 21:10:09 <rbudden> As long as we have someone there that’s good 21:10:35 <rbudden> Largely the Queens HA and looking into scaling Ironic better 21:10:50 <rbudden> We’ve also had a large influx of requests for Windows VMs from up on CMU campus 21:11:01 <oneswig> bare metal windows? 21:11:04 <rbudden> so we’ve been trying to patch together some infra to make those easier 21:11:10 <rbudden> no, virtual ATM 21:11:18 <oneswig> figures. 21:11:37 <rbudden> I’m starting to plan for our next major upgrade as well 21:11:54 <oneswig> what are you fighting with Ironic scaling? 21:12:05 <rbudden> so getting some multiconductor testing and attempting boot over OPA are high on the list 21:12:40 <oneswig> IPoIB interface on OPA? 21:12:43 <rbudden> yes 21:13:00 <rbudden> i had read the 7.4 centos kernel has stock drivers that could be used for the deploy 21:13:07 <rbudden> I’m not sure if anyone has tried it though 21:13:33 <oneswig> The OPA system I've been working on has the Intel HFI driver packages 21:13:43 <rbudden> i’d love to deploy nodes via IPoIB vs our lonely GigE private network ;) 21:13:50 <oneswig> I'd be surprised if the in-box driver is new enough. 21:14:14 <rbudden> our last upgrade to 7.4 took longer than expect pushing out updates via puppet/scripts 21:14:17 <oneswig> Last time I looked was end of last year though 21:14:30 <rbudden> so we’d like to attempt to reimage all 900+ nodes as fast as possible via Ironic this time 21:14:51 <rbudden> yeah, I admit I haven’t done much looking other than finding the pieces are there 21:14:58 <rbudden> i.e. PXE boot via OPA has been done 21:15:00 <oneswig> This might interest you - proposal for Ironic native RBD boot from Ceph volume - https://storyboard.openstack.org/#!/story/2003515 21:15:16 <rbudden> generic OPA IPoIB drivers should be in the mainline kernel now (whether they work is anotehr story) 21:15:29 <rbudden> cool, thx, i’ll check it out 21:16:04 <oneswig> Just a framework for an idea right now but one we've already tried to explore ourselves, because it's a natural fit 21:16:32 <rbudden> we’ve only had a limited experience with Ceph 21:16:42 <rbudden> and so far we’ve seen incredibly bad performance 21:16:56 <rbudden> I’ve been meaning to solicit some advice/stats from the group 21:17:04 <rbudden> on setups/performance numbers 21:17:39 <oneswig> Ceph's single-node performance is lame in comparison but the scaling's excellent, particularly for object and block. File is still a little doubtful. 21:17:52 <oneswig> by single-node I mean single-client 21:18:02 <rbudden> ok, interesting 21:18:09 <rbudden> that’s what we were curious about 21:18:15 <rbudden> especially given this: 21:18:16 <rbudden> https://docs.openstack.org/performance-docs/latest/test_results/ceph_testing/index.html#ceph-rbd-performance-results-50-osd 21:18:32 <oneswig> Our team has recently been working on BeeGFS deployed via ansible 21:18:35 <rbudden> it looks like single client peformance is a small fraction of what the hardware is capable of 21:19:00 <rbudden> but i was curious if that was due to scaling/sustaining those numbers in parallel to a large portion of clients 21:19:05 <oneswig> That matches my experience, but 20 clients will get 75% of the raw performance of the hardware 21:19:19 <rbudden> we have a very small amount of hardware to dedicate 21:19:26 <oneswig> I bet I have less :-) 21:19:27 <rbudden> say 3-4 servers for Ceph 21:19:33 <oneswig> yup, won hands down :-) 21:20:06 <rbudden> so far the best we’ve seen was 15 MB/s writes, though reads were maybe between 50-100 MB/s 21:20:13 <rbudden> pretty attrocious 21:20:14 <oneswig> The EUCLID deploy's home dir is CephFS, served from 2 OSDs. It's a bit ad-hoc, all we had available at the time. 21:20:35 <oneswig> rbudden: wow, that's not good! 21:20:39 <oneswig> by any measure 21:20:41 <rbudden> there must be some configuration issues 21:21:00 <rbudden> yeah, not sure what’s wrong just yet 21:21:38 <rbudden> we pulled in Mike from IU try and spot something obvious ;) 21:21:56 <oneswig> On performance, one of the team at Cambridge has been benchmarking BeeGFS vs Lustre on the same OPA hardware. 21:22:05 <rbudden> oh yeah? 21:22:11 <rbudden> we have both at PSC as well 21:22:15 <oneswig> Mike's a great call :-)! 21:22:24 <oneswig> Write performance was equivalent. 21:22:28 <rbudden> our Olympus cluster has both Lustre and BeeGFS 21:22:40 <oneswig> BeeGFS ~20% faster for reads 21:22:45 <rbudden> very interesting 21:22:50 <oneswig> Quite a surprise 21:22:56 <rbudden> we’ve been toying with using BeeGFS on Bridges 21:23:25 <rbudden> since we already mount Olympus filesystems on both 21:23:26 <oneswig> There seems to be good momentum behind it currently. It's pretty easy to set up and use 21:23:37 <rbudden> I’ll have to relay the performance numbers to ppl here 21:23:40 <oneswig> Olympus? 21:23:54 <rbudden> it’s a small cluster from our Public Health group 21:23:58 <oneswig> I'll see if I can get graphs for you. 21:24:01 <rbudden> exact hardware as Bridges 21:24:21 <rbudden> just a separate Slurm scheduler to manage them 21:24:33 <rbudden> we’ve even folded their cluster under Ironic now 21:25:12 <oneswig> Nice work. Hopefully not twice the effort for you now! 21:25:15 <rbudden> I think that might put us over 1000 nodes on single conductor :P 21:25:44 <rbudden> not that i’d normally advise that… but we only image a few nodes a week at this point 21:25:54 <oneswig> Any additional concerns taken into account for public health data on that system? 21:26:16 <rbudden> not to my knowledge. we don’t do any classified work, etc. 21:26:34 <rbudden> for nothing fancy with network isolation or anything of that nature 21:26:40 <rbudden> s/for/so/ 21:26:59 <oneswig> Got it. 21:27:30 <oneswig> It's a requirement that is proliferating though 21:27:36 <rbudden> indeed 21:27:52 <rbudden> we’ve done some small projects that required HIPPA with local medical community 21:28:03 <oneswig> on Bridges? 21:28:24 <rbudden> i’m not sure if it was Bridges or a small subproject on our DXC hardware 21:28:38 <rbudden> i’d have to check and make sure i’m not mixing things up ;) 21:28:54 <rbudden> I know being compliant for securing big data is on our radar 21:29:11 <oneswig> hipaa compliance? 21:29:34 <rbudden> that’s one, but i believe there are others 21:29:50 <rbudden> familiar with HIPPA? 21:30:28 <oneswig> faintly at best.. 21:30:37 <rbudden> it’s a health care act to help protect sensitive patient data 21:30:45 <rbudden> or should i say regulation ;) 21:31:30 <rbudden> oneswig: sorry to jet, but i need to step offline for daycare pickup! 21:31:43 <rbudden> we should follow up offline, it’s been awhile since we chatted in Vancouver 21:31:44 <oneswig> NP, good to talk 21:31:54 <oneswig> I'll mail re: cambridge data 21:31:54 <rbudden> i’ve been meaning to reach out 21:32:00 <rbudden> sounds good 21:32:04 <oneswig> rbudden: always a pleasure 21:32:14 <rbudden> indeed 21:32:18 <rbudden> tty soon 21:38:53 <b1air> am i too late? 21:39:03 <oneswig> b1air: howzit? 21:39:15 <b1air> a'ight, you? 21:39:19 <oneswig> Was just chatting with rbudden, he's dropped off 21:39:34 <oneswig> Good thanks. How's the new role going? 21:40:16 <b1air> i'm in the thick of having my head exploded by all the context i'm trying to soak up 21:40:29 <b1air> but otherwise good 21:40:58 <oneswig> Are you going to be hands-on with the Cray? 21:41:24 <b1air> at least i now have a vague idea of what preconceptions the organisation has about my role, i honestly didn't know what i was really signing up for to start off with, just that it would probably be quite different from what i was doing at Monash 21:41:57 <b1air> i will be unlikely to have root access there anytime soon 21:42:33 <b1air> NIWA seem to like to keep systems stuff pretty closely guarded 21:43:08 <b1air> how was your holiday? 21:43:47 <oneswig> Very good indeed. We hiked in the Pyrenees and it was ruddy hot 21:44:59 <b1air> nice 21:45:24 <oneswig> Perhaps we should close the meeting, and then I'll reply to your email :-) 21:45:35 <b1air> i need to come up with a hiking/biking/sailing itinerary for summer here! 21:45:48 <oneswig> b1air: in NZ, it's not hard, surely mate 21:46:05 <b1air> the poor boat needs a bit of love first 21:46:17 <oneswig> sandpaper = tough love ? 21:46:48 <b1air> lol, water blaster might be enough i hope! anyway, sure, let's continue to be penpals :-) 21:47:10 <oneswig> ok bud, have a good day down there 21:47:20 <b1air> cheers, over and out 21:47:28 <oneswig> until next time 21:47:30 <oneswig> #endmeeting