11:00:24 <oneswig> #startmeeting scientific-wg
11:00:25 <openstack> Meeting started Wed Jul 19 11:00:24 2017 UTC and is due to finish in 60 minutes.  The chair is oneswig. Information about MeetBot at http://wiki.debian.org/MeetBot.
11:00:26 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
11:00:28 <openstack> The meeting name has been set to 'scientific_wg'
11:00:31 <verdurin> Afternoon, oneswig
11:00:40 <oneswig> Precisely so
11:00:46 <oneswig> afternoon verdurin
11:01:08 <oneswig> #link Agenda for today https://wiki.openstack.org/wiki/Scientific_working_group#IRC_Meeting_July_19th_2017
11:01:16 <Pelalil> oneswig, tom here, thanks for the email reminder 30min ago :)
11:01:25 <daveholland> (I'm in a training course so just lurking in this meeting)
11:01:30 <oneswig> Hi Tom, very welcome
11:01:40 <oneswig> #chair martial
11:01:41 <openstack> Current chairs: martial oneswig
11:02:52 <oneswig> We have a shortish agenda for today but the main event is an interesting presentation on research by James at Sanger
11:03:19 <zz9pzza> Hi
11:03:27 <oneswig> greetings :-)
11:03:41 <verdurin> Hello
11:03:51 <oneswig> #topic OpenStack and Lustre
11:04:06 <oneswig> #link presentation on Lustre is at https://docs.google.com/presentation/d/1kGRzcdVQX95abei1bDVoRzxyC02i89_m5_sOfp8Aq6o/edit#slide=id.g22aee564af_0_0
11:04:54 <oneswig> zz9pzza: can you tell us a bit about the background?
11:05:13 <zz9pzza> One of the things we have been worrying about is posix file access in openstack
11:05:57 <b1airo> o/
11:06:03 <oneswig> Hi b1airo
11:06:06 <oneswig> #chair b1airo
11:06:06 <zz9pzza> We did some work with DDN ( not using any none standard/open software ) to see if we could allow access to lustre from different tenants to different bits of a lustre filesystem
11:06:07 <openstack> Current chairs: b1airo martial oneswig
11:06:08 <b1airo> evening
11:06:12 <zz9pzza> Hi
11:06:36 <zz9pzza> The short story is I think we could provide it as a service  and we are definitely considering doing so
11:06:46 <zz9pzza> We have one tenant about to use it in anger
11:06:53 <verdurin> Hello b1airo
11:06:54 <oneswig> Does it depend on the DDN Lustre distribution?
11:06:56 <zz9pzza> No
11:07:07 <zz9pzza> You need to have atleast lustre 2.9
11:07:14 <zz9pzza> And 2.10 is an lts release
11:07:27 <zz9pzza> There a lbug that needs fixing but the fix is public
11:07:42 <verdurin> Ah - was going to ask if any fixes had gone upstream
11:07:48 <zz9pzza> https://jira.hpdd.intel.com/browse/LU-9289
11:08:03 <zz9pzza> its a one line fix for a null termination error
11:08:10 <zz9pzza> ( from memory )
11:08:32 <Pelalil> zz9pzza, are you using the ddn software stack, or just the ddn hardware with your own lustre install?
11:08:32 <zz9pzza> It says it is in 2.10
11:09:16 <zz9pzza> We use DDN hardware. I believe the software is a ddn release but it is not required. DDN reinstalled a set of old servers for us as we didn't have the time
11:09:54 <zz9pzza> ( We used to install our own lustre servers but people wanted a more appliance approach  ) We could still do it but we are a bit busy
11:10:14 <zz9pzza> Basically the approach is to use lustre routers to isolate the tenant
11:10:40 <zz9pzza> Use the 2.9 file sets and uid mapping so each tenant gets a different @tcp network
11:11:05 <zz9pzza> We tried with both pyhsical and virtual lustre routers, both work
11:11:36 <b1airo> why the net LNET routing, is that to integrate with OpenStack tenant/private networks ?
11:11:44 <b1airo> s/net/need/
11:12:03 <zz9pzza> Its because you can't trust the tenant not to alter the network and the isolation is based on the lustre nid
11:12:39 <zz9pzza> The routers are in the control of the openstack/filesystem operators
11:12:46 <b1airo> ahh ok, don't know anything about how the isolation works, guess that is a hard requirement then!
11:13:19 <verdurin> zz9pzza: could all the background Lustre work be triggered by tenant creation in OpenStack? Or does it have to be handled separately?
11:13:24 <oneswig> zz9pzza: how does the special role fit in?
11:14:26 <zz9pzza> "could all the background Lustre work be triggered by tenant creation in OpenStack? Or does it have to be handled separately?" I can imagine doing it all as part of a tenant creation SOP/script however we havn't done it. I would pre create the provisioning networks otherwise you would need to restart neutron on a tenant creation
11:14:48 <zz9pzza> What do you mean by special role oneswig ?
11:15:08 <oneswig> The Lnet-5 role
11:15:38 <zz9pzza> So each filesystem isolation area needs to have a role associated with it.
11:15:41 <oneswig> This role enables the creation of ports, the admins create the network and routers - is that it?
11:16:09 <zz9pzza> That role allows the user to create a machine that has access to the provisioning network that has a lustre router on it
11:16:35 <zz9pzza> So the role allows a user to create a machine where that machine has all the access to the data in that tenant
11:17:04 <zz9pzza> My experiments only used a single uid/guid as root in the tenant could change their user to be what ever they wished.
11:17:12 <zz9pzza> Does that make sense ?
11:17:30 <zz9pzza> ( and yes "the admins create the network and routers - is that it?" )
11:18:14 <zz9pzza> If you had a tenant which had real multiple users in it you could expose more uid/guid space to that tenant
11:18:33 <oneswig> I think so thanks.
11:18:36 <zz9pzza> ( but root in the tenant could still read all the data of the tenant )
11:19:50 <oneswig> It makes sense for situations where there are project power users who create and set up infra for others on the team
11:20:10 <zz9pzza> it could do yes
11:20:53 <oneswig> The VLAN interface on the instances, have you tried using direct-bound ports on that, ie SR-IOV?
11:21:49 <zz9pzza> We haven't we have a tendancy to keep things as simple as we can and optimise when we are more confident
11:22:15 <zz9pzza> The performance was better than I expected without it.
11:22:51 <oneswig> Sounds reasonable.  What's the hypervisor networking configuration - a standard Red Hat setup or tuned in any particular way?
11:23:24 <zz9pzza> The orignal runs were done completely in optimally. We had a kernel where we had to turn all hardware acceleration off.
11:23:44 <zz9pzza> We upgraded the kernel and got a large performance boost with hardware acceleration on.
11:24:35 <b1airo> zz9pzza, have you submitted this for a talk in Sydney by any chance?
11:24:36 <oneswig> I think I saw something similar in our tests, but that was for offloads on VXLAN-tunneled packets
11:25:11 <zz9pzza> When is Sydney  ( and I present really really badly )
11:25:43 <zz9pzza> I don't think I can make that time in November
11:26:04 <zz9pzza> If any one wanted to talk around it I wouldn't object and would help :)
11:26:28 <oneswig> Alas, the formal deadline for presentation submissions was last Friday
11:26:48 <b1airo> pity to waste a perfectly good slide-deck though!
11:27:33 <zz9pzza> And it does work and solves a problem we in scientific research do have
11:27:33 <oneswig> agreed!
11:27:45 <zz9pzza> And it is all free software
11:28:07 <oneswig> The results on slide 23 - what is the missing text in the legend "bare metal with..."?
11:28:37 <zz9pzza> Just looking
11:30:32 <zz9pzza> I think it is dual physical routers, but I think that is an artifact of how our first instance of openstance uses bonding
11:30:45 <oneswig> For comparison, the Ceph filesystem was giving you about 200MB/s read performance, and you're seeing around 1300MB/s for single-client Lustre bandwidth?
11:30:50 <zz9pzza> Our hypevisors are not using l3/l4
11:31:26 <oneswig> ... but about 3000MB/s in an equivalent bare metal test
11:31:36 <zz9pzza> Let me see if I can find the spreadsheet of doom which has raw numbers in it ( and the headings are less than obvious )
11:32:53 <b1airo> two instances of openstack = two completely separate clouds? or different regions, AZs, Cells, or something else...?
11:33:25 <zz9pzza> So are next openstack instance is on the same set of physical hardware
11:33:43 <zz9pzza> We have enough controller type machines to have two instances running at once
11:33:44 <martial> very nice speed up
11:33:49 <zz9pzza> The next one is newton
11:34:30 <zz9pzza> ( which has a better kernel and better bonding config )
11:35:39 <zz9pzza> https://docs.google.com/spreadsheets/d/1E-YOso5-aDTzn2m6lKoYwBNWHvgAIcrxM5SpxSBw0_Y/edit?usp=sharing
11:35:44 <zz9pzza> I think that is it :)
11:36:59 <zz9pzza> sorry its a bit incomprehensable
11:37:49 <zz9pzza> So a single client got up to about 900MB/s write
11:38:02 <oneswig> zz9pzza: there's a certain amount of admin setup (the provider networks, role definitions etc). Is this a blocker for orchestrating through Manila?  It does sound quite similar to the CephFS sharing model
11:38:27 <zz9pzza> I don't know enough about Manila to comment.
11:38:44 <zz9pzza> I would have to check if Manila is supported in redhats rdo as well
11:38:47 <b1airo> i'm curious as to why you ended up with Arista switching and Mellanox HCAs?
11:38:54 <zz9pzza> And I am still abit nervous about cephfs
11:39:26 <oneswig> zz9pzza: my experiments with it confirm those nerves, but I have great hope for luminous
11:39:35 <zz9pzza> So the Mellanox HCA's appear to have lots of hardware acceleration and were not significantly more expensive
11:40:00 <zz9pzza> The Arista switches are very good software and for the ports supported very cost effective
11:40:21 <oneswig> How are you doing 25G - breakout cables?
11:40:29 <zz9pzza> Dual 32 port 100/50/40/25/10 gig switches are very reasonable
11:40:46 <zz9pzza> Yes the 25gb breakout is a cable
11:41:03 <zz9pzza> so we could have 128 ports of 25GB/s
11:41:12 <zz9pzza> We have a mixture of 25 and 100
11:41:39 <zz9pzza> We hope to use the arista to do the vxlan encoding in the next itteration as well
11:41:43 <b1airo> any RDMA traffic?
11:42:05 <zz9pzza> We don't do that right now, the next itteration of ceph should have it and the cards do support it
11:42:40 <zz9pzza> ( For vxlan encoding in the switch redhats lack of being able to set the region name is a complete pain ()
11:42:41 <oneswig> zz9pzza: You have a mix of VLANs and VXLANs?
11:42:48 <zz9pzza> yes its fine
11:43:18 <zz9pzza> This itteration all the tenant networking is double encapsulated.
11:43:28 <zz9pzza> Once on the switch, once on the hypervisor
11:43:35 <oneswig> b1airo: did you have any data that compares lnet tcp and o2ib approaches?
11:44:37 <oneswig> zz9pzza: your presentation is very thorough, thanks for that - is there anything that isn't working yet that you're still working on?
11:44:43 <b1airo> no, both our Lustres are in o2ib mode at the moment, but we have considered going back to TCP on the ConnectX-3 based system due to some recurring LNET errors
11:45:16 <zz9pzza> I think I would be happy enough putting it in to production, I am not sure if virtual or physical lustre routers are the right approach both have advantages.
11:45:20 <b1airo> not easy with it all in production and long job queues
11:45:36 <zz9pzza> I think I would buy physical routers in the first instance.
11:46:11 <zz9pzza> ( physical routers makes it easy to bridge ethernet and ib too )
11:46:24 <oneswig> zz9pzza: is 1300MB/s fast enough for your needs or are you looking at closing that gap to bare metal?
11:46:49 <zz9pzza> Well a single vm got to 900MB/s write and 1.2GB/s read
11:47:09 <zz9pzza> which is not too bad for a 6 year old server set up
11:47:21 <zz9pzza> ( and not at the same time )
11:47:42 <zz9pzza> ( look at single machine 4MB/2VCPU New Kernel )
11:47:53 <zz9pzza> ( it should be 4GB )
11:50:03 <b1airo> are LNET routers particularly resource hungry? i'm guessing they don't need much memory, just decent connectivity and a bit of CPU (maybe not even much of that if using RDMA) ?
11:50:24 <zz9pzza> No you just need a bit of cpu
11:50:26 <oneswig> Did you try more VCPUs and the new kernel?  Just interested in exploring what the bottleneck is cf bare metal
11:50:48 <zz9pzza> I didn't the new kernel came came very close to my deadline
11:51:17 <b1airo> zz9pzza, that spreadsheet is all Lustre numbers? what was the ceph comparison you mentioned?
11:51:48 <zz9pzza> At the very beginning on our first ceph I just used dd,
11:51:54 <oneswig> I saw some numbers on slide 9
11:52:00 <zz9pzza> Let me look
11:52:26 <zz9pzza> That was a dd in a volume, not really comparable
11:52:28 <b1airo> ah ok, so no direct comparison, hat was just a single client dd "does this work" test?
11:52:33 <zz9pzza> Yes
11:52:42 <zz9pzza> I could do that now and I am happy to do so
11:52:56 <zz9pzza> ( our ceph cluster has also just doubled in size )
11:53:13 <b1airo> fun times
11:53:13 <zz9pzza> And we are getting another 2.5PB of useable in a couple of weeks
11:53:39 <oneswig> nice!  At that scale, I guess you're erasure coding?
11:53:42 <zz9pzza> No.
11:54:00 <zz9pzza> We are not confident yet.
11:54:04 <b1airo> do you have any support now? i see you discounted RH on price (can understand that)
11:54:04 <oneswig> I'd be interested to hear experiences with that
11:54:34 <zz9pzza> Redhat have got back to us on pricing and it is worth talking to them again if you are accedemic
11:54:35 <b1airo> oneswig, can't EC for RBD or CephFS yet unless you're on Bluestore etc
11:55:01 <zz9pzza> ( we have yet to buy support but I think we will  in the long term )
11:55:02 <oneswig> ah, thanks b1airo
11:55:26 <b1airo> we do have RH support, at a significant discount, it is still fairly pricey compared to others (e.g. SUSE, Canonical)
11:55:58 <zz9pzza> I would check again
11:55:59 <b1airo> oneswig, however we use EC for radosgw objects
11:56:27 <zz9pzza> ( I mean I would get an upto date quote for ceph support )
11:56:35 <b1airo> it works just fine - main difference operationally is that things like scrub and recovery/backflll take much longer as the PGs are much wider
11:57:03 <oneswig> b1airo: and does it become CPU limited, or is that not a concern?
11:57:16 <oneswig> for the OSDs that is
11:57:29 <b1airo> not really a concern for HTTP object store workloads (at least that we have hit yet!)
11:57:44 <oneswig> fair point
11:57:51 <zz9pzza> What kind of performance do you get though rados gw ?
11:58:02 <oneswig> We are close to time... final questions
11:58:32 <verdurin> Just a shame it's Lustre, rather than GPFS...
11:58:38 <zz9pzza> :)
11:58:56 <verdurin> Thanks a lot, though zz9pzza - very impressive
11:58:57 <oneswig> verdurin: well timed to wind up the debate :-)
11:59:15 <b1airo> it can scale pretty well, think we tested to over a GB/s with just three 2core radosgw VM hosts
11:59:25 <oneswig> thanks indeed, that presentation is thorough and very informative zz9pzza
11:59:43 <b1airo> careful with big buckets though - if you are not prepared for them it can be a real pain to get rid of them!
11:59:44 <martial> yes, thank you. I will share the log to a couple colleagues
11:59:47 <zz9pzza> Feel free to steal bits if any one wants to
12:00:01 <b1airo> yes thanks zz9pzza - great presentation
12:00:08 <oneswig> "share and enjoy" you mean :-)
12:00:12 <zz9pzza> Indeed
12:00:23 <oneswig> OK, time to wrap up all
12:00:26 <verdurin> Bye all
12:00:26 <oneswig> thanks everyone
12:00:32 <oneswig> #endmeeting