#openstack-meeting log

21:00:15 <oneswig> #startmeeting scientific-sig
21:00:16 <openstack> Meeting started Tue Aug 21 21:00:15 2018 UTC and is due to finish in 60 minutes.  The chair is oneswig. Information about MeetBot at http://wiki.debian.org/MeetBot.
21:00:17 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
21:00:19 <openstack> The meeting name has been set to 'scientific_sig'
21:00:29 <oneswig> Hi
21:00:42 <oneswig> #link agenda for today https://wiki.openstack.org/wiki/Scientific_SIG#IRC_Meeting_August_21st_2018
21:01:04 <oneswig> Martial sends his apologies as he is travelling
21:01:10 <rbudden> hello everyone
21:01:29 <oneswig> Blair's hoping to join us later - overlapping meeting
21:01:39 <oneswig> Hey rbudden! :-)
21:01:46 <rbudden> hello hello :)
21:01:54 <oneswig> How's BRIDGES?
21:02:12 <rbudden> it’s doing well
21:02:41 <rbudden> Our Queens HA setup is about to get some friendly users/staff for testing :)
21:02:55 <oneswig> I've been having almost too much fun working with these guys on a Queens system: https://www.euclid-ec.org
21:03:23 <oneswig> Mapping the dark matter of OpenStack...
21:03:23 <rbudden> Nice
21:03:59 <oneswig> It boils down to a bunch of heat stacks and 453 nodes in a Slurm partition with some whistles and bells attached
21:04:17 <rbudden> Sounds interesting
21:04:44 <oneswig> Found the scaling limits of Heat on our system surprisingly quickly!
21:05:06 <rbudden> Hehe, I haven’t used Heat much
21:05:09 <oneswig> rbudden: did you ever get your work on Slurm into the public domain?
21:05:28 <rbudden> I took the lazy programmer approach and wrote it all in bash originally, then converted it mostly to Ansible
21:05:34 <rbudden> Actually, I did not
21:05:48 <rbudden> I have piles that need to be pushed to our public GitHub
21:06:07 <oneswig> There's some interest from various places in common Ansible for OpenHPC+Slurm
21:06:10 <rbudden> It’s on my never ending todo list somewhere in there
21:06:21 <rbudden> ;)
21:06:26 <oneswig> Right, somewhere beyond the event horizon...
21:07:45 <oneswig> Are you going to the PTG rbudden?  It's merged with the Ops meetup too this time
21:08:30 <rbudden> I am not
21:08:52 <rbudden> Unfortunately I don’t have much funding for travel right now
21:08:56 <oneswig> No, me neither - Martial's leading a session there
21:09:31 <rbudden> I’d have loved to get more involved with the PTG and Ops, but time and funding have always kept me away unfortunately!
21:09:41 <oneswig> Some of our team will be there but I'm of limited use being less hands-on than I'd like to be nowadays
21:10:06 <oneswig> What's new on BRIDGES, what have you been working on?
21:10:09 <rbudden> As long as we have someone there that’s good
21:10:35 <rbudden> Largely the Queens HA and looking into scaling Ironic better
21:10:50 <rbudden> We’ve also had a large influx of requests for Windows VMs from up on CMU campus
21:11:01 <oneswig> bare metal windows?
21:11:04 <rbudden> so we’ve been trying to patch together some infra to make those easier
21:11:10 <rbudden> no, virtual ATM
21:11:18 <oneswig> figures.
21:11:37 <rbudden> I’m starting to plan for our next major upgrade as well
21:11:54 <oneswig> what are you fighting with Ironic scaling?
21:12:05 <rbudden> so getting some multiconductor testing and attempting boot over OPA are high on the list
21:12:40 <oneswig> IPoIB interface on OPA?
21:12:43 <rbudden> yes
21:13:00 <rbudden> i had read the 7.4 centos kernel has stock drivers that could be used for the deploy
21:13:07 <rbudden> I’m not sure if anyone has tried it though
21:13:33 <oneswig> The OPA system I've been working on has the Intel HFI driver packages
21:13:43 <rbudden> i’d love to deploy nodes via IPoIB vs our lonely GigE private network ;)
21:13:50 <oneswig> I'd be surprised if the in-box driver is new enough.
21:14:14 <rbudden> our last upgrade to 7.4 took longer than expect pushing out updates via puppet/scripts
21:14:17 <oneswig> Last time I looked was end of last year though
21:14:30 <rbudden> so we’d like to attempt to reimage all 900+ nodes as fast as possible via Ironic this time
21:14:51 <rbudden> yeah, I admit I haven’t done much looking other than finding the pieces are there
21:14:58 <rbudden> i.e. PXE boot via OPA has been done
21:15:00 <oneswig> This might interest you - proposal for Ironic native RBD boot from Ceph volume - https://storyboard.openstack.org/#!/story/2003515
21:15:16 <rbudden> generic OPA IPoIB drivers should be in the mainline kernel now (whether they work is anotehr story)
21:15:29 <rbudden> cool, thx, i’ll check it out
21:16:04 <oneswig> Just a framework for an idea right now but one we've already tried to explore ourselves, because it's a natural fit
21:16:32 <rbudden> we’ve only had a limited experience with Ceph
21:16:42 <rbudden> and so far we’ve seen incredibly bad performance
21:16:56 <rbudden> I’ve been meaning to solicit some advice/stats from the group
21:17:04 <rbudden> on setups/performance numbers
21:17:39 <oneswig> Ceph's single-node performance is lame in comparison but the scaling's excellent, particularly for object and block.  File is still a little doubtful.
21:17:52 <oneswig> by single-node I mean single-client
21:18:02 <rbudden> ok, interesting
21:18:09 <rbudden> that’s what we were curious about
21:18:15 <rbudden> especially given this:
21:18:16 <rbudden> https://docs.openstack.org/performance-docs/latest/test_results/ceph_testing/index.html#ceph-rbd-performance-results-50-osd
21:18:32 <oneswig> Our team has recently been working on BeeGFS deployed via ansible
21:18:35 <rbudden> it looks like single client peformance is a small fraction of what the hardware is capable of
21:19:00 <rbudden> but i was curious if that was due to scaling/sustaining those numbers in parallel to a large portion of clients
21:19:05 <oneswig> That matches my experience, but 20 clients will get 75% of the raw performance of the hardware
21:19:19 <rbudden> we have a very small amount of hardware to dedicate
21:19:26 <oneswig> I bet I have less :-)
21:19:27 <rbudden> say 3-4 servers for Ceph
21:19:33 <oneswig> yup, won hands down :-)
21:20:06 <rbudden> so far the best we’ve seen was 15 MB/s writes, though reads were maybe between 50-100 MB/s
21:20:13 <rbudden> pretty attrocious
21:20:14 <oneswig> The EUCLID deploy's home dir is CephFS, served from 2 OSDs.  It's a bit ad-hoc, all we had available at the time.
21:20:35 <oneswig> rbudden: wow, that's not good!
21:20:39 <oneswig> by any measure
21:20:41 <rbudden> there must be some configuration issues
21:21:00 <rbudden> yeah, not sure what’s wrong just yet
21:21:38 <rbudden> we pulled in Mike from IU try and spot something obvious ;)
21:21:56 <oneswig> On performance, one of the team at Cambridge has been benchmarking BeeGFS vs Lustre on the same OPA hardware.
21:22:05 <rbudden> oh yeah?
21:22:11 <rbudden> we have both at PSC as well
21:22:15 <oneswig> Mike's a great call :-)!
21:22:24 <oneswig> Write performance was equivalent.
21:22:28 <rbudden> our Olympus cluster has both Lustre and BeeGFS
21:22:40 <oneswig> BeeGFS ~20% faster for reads
21:22:45 <rbudden> very interesting
21:22:50 <oneswig> Quite a surprise
21:22:56 <rbudden> we’ve been toying with using BeeGFS on Bridges
21:23:25 <rbudden> since we already mount Olympus filesystems on both
21:23:26 <oneswig> There seems to be good momentum behind it currently.  It's pretty easy to set up and use
21:23:37 <rbudden> I’ll have to relay the performance numbers to ppl here
21:23:40 <oneswig> Olympus?
21:23:54 <rbudden> it’s a small cluster from our Public Health group
21:23:58 <oneswig> I'll see if I can get graphs for you.
21:24:01 <rbudden> exact hardware as Bridges
21:24:21 <rbudden> just a separate Slurm scheduler to manage them
21:24:33 <rbudden> we’ve even folded their cluster under Ironic now
21:25:12 <oneswig> Nice work.  Hopefully not twice the effort for you now!
21:25:15 <rbudden> I think that might put us over 1000 nodes on single conductor :P
21:25:44 <rbudden> not that i’d normally advise that… but we only image a few nodes a week at this point
21:25:54 <oneswig> Any additional concerns taken into account for public health data on that system?
21:26:16 <rbudden> not to my knowledge. we don’t do any classified work, etc.
21:26:34 <rbudden> for nothing fancy with network isolation or anything of that nature
21:26:40 <rbudden> s/for/so/
21:26:59 <oneswig> Got it.
21:27:30 <oneswig> It's a requirement that is proliferating though
21:27:36 <rbudden> indeed
21:27:52 <rbudden> we’ve done some small projects that required HIPPA with local medical community
21:28:03 <oneswig> on Bridges?
21:28:24 <rbudden> i’m not sure if it was Bridges or a small subproject on our DXC hardware
21:28:38 <rbudden> i’d have to check and make sure i’m not mixing things up ;)
21:28:54 <rbudden> I know being compliant for securing big data is on our radar
21:29:11 <oneswig> hipaa compliance?
21:29:34 <rbudden> that’s one, but i believe there are others
21:29:50 <rbudden> familiar with HIPPA?
21:30:28 <oneswig> faintly at best..
21:30:37 <rbudden> it’s a health care act to help protect sensitive patient data
21:30:45 <rbudden> or should i say regulation ;)
21:31:30 <rbudden> oneswig: sorry to jet, but i need to step offline for daycare pickup!
21:31:43 <rbudden> we should follow up offline, it’s been awhile since we chatted in Vancouver
21:31:44 <oneswig> NP, good to talk
21:31:54 <oneswig> I'll mail re: cambridge data
21:31:54 <rbudden> i’ve been meaning to reach out
21:32:00 <rbudden> sounds good
21:32:04 <oneswig> rbudden: always a pleasure
21:32:14 <rbudden> indeed
21:32:18 <rbudden> tty soon
21:38:53 <b1air> am i too late?
21:39:03 <oneswig> b1air: howzit?
21:39:15 <b1air> a'ight, you?
21:39:19 <oneswig> Was just chatting with rbudden, he's dropped off
21:39:34 <oneswig> Good thanks.  How's the new role going?
21:40:16 <b1air> i'm in the thick of having my head exploded by all the context i'm trying to soak up
21:40:29 <b1air> but otherwise good
21:40:58 <oneswig> Are you going to be hands-on with the Cray?
21:41:24 <b1air> at least i now have a vague idea of what preconceptions the organisation has about my role, i honestly didn't know what i was really signing up for to start off with, just that it would probably be quite different from what i was doing at Monash
21:41:57 <b1air> i will be unlikely to have root access there anytime soon
21:42:33 <b1air> NIWA seem to like to keep systems stuff pretty closely guarded
21:43:08 <b1air> how was your holiday?
21:43:47 <oneswig> Very good indeed.  We hiked in the Pyrenees and it was ruddy hot
21:44:59 <b1air> nice
21:45:24 <oneswig> Perhaps we should close the meeting, and then I'll reply to your email :-)
21:45:35 <b1air> i need to come up with a hiking/biking/sailing itinerary for summer here!
21:45:48 <oneswig> b1air: in NZ, it's not hard, surely mate
21:46:05 <b1air> the poor boat needs a bit of love first
21:46:17 <oneswig> sandpaper = tough love ?
21:46:48 <b1air> lol, water blaster might be enough i hope! anyway, sure, let's continue to be penpals :-)
21:47:10 <oneswig> ok bud, have a good day down there
21:47:20 <b1air> cheers, over and out
21:47:28 <oneswig> until next time
21:47:30 <oneswig> #endmeeting