11:00:24 <oneswig> #startmeeting scientific-wg 11:00:25 <openstack> Meeting started Wed Jul 19 11:00:24 2017 UTC and is due to finish in 60 minutes. The chair is oneswig. Information about MeetBot at http://wiki.debian.org/MeetBot. 11:00:26 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 11:00:28 <openstack> The meeting name has been set to 'scientific_wg' 11:00:31 <verdurin> Afternoon, oneswig 11:00:40 <oneswig> Precisely so 11:00:46 <oneswig> afternoon verdurin 11:01:08 <oneswig> #link Agenda for today https://wiki.openstack.org/wiki/Scientific_working_group#IRC_Meeting_July_19th_2017 11:01:16 <Pelalil> oneswig, tom here, thanks for the email reminder 30min ago :) 11:01:25 <daveholland> (I'm in a training course so just lurking in this meeting) 11:01:30 <oneswig> Hi Tom, very welcome 11:01:40 <oneswig> #chair martial 11:01:41 <openstack> Current chairs: martial oneswig 11:02:52 <oneswig> We have a shortish agenda for today but the main event is an interesting presentation on research by James at Sanger 11:03:19 <zz9pzza> Hi 11:03:27 <oneswig> greetings :-) 11:03:41 <verdurin> Hello 11:03:51 <oneswig> #topic OpenStack and Lustre 11:04:06 <oneswig> #link presentation on Lustre is at https://docs.google.com/presentation/d/1kGRzcdVQX95abei1bDVoRzxyC02i89_m5_sOfp8Aq6o/edit#slide=id.g22aee564af_0_0 11:04:54 <oneswig> zz9pzza: can you tell us a bit about the background? 11:05:13 <zz9pzza> One of the things we have been worrying about is posix file access in openstack 11:05:57 <b1airo> o/ 11:06:03 <oneswig> Hi b1airo 11:06:06 <oneswig> #chair b1airo 11:06:06 <zz9pzza> We did some work with DDN ( not using any none standard/open software ) to see if we could allow access to lustre from different tenants to different bits of a lustre filesystem 11:06:07 <openstack> Current chairs: b1airo martial oneswig 11:06:08 <b1airo> evening 11:06:12 <zz9pzza> Hi 11:06:36 <zz9pzza> The short story is I think we could provide it as a service and we are definitely considering doing so 11:06:46 <zz9pzza> We have one tenant about to use it in anger 11:06:53 <verdurin> Hello b1airo 11:06:54 <oneswig> Does it depend on the DDN Lustre distribution? 11:06:56 <zz9pzza> No 11:07:07 <zz9pzza> You need to have atleast lustre 2.9 11:07:14 <zz9pzza> And 2.10 is an lts release 11:07:27 <zz9pzza> There a lbug that needs fixing but the fix is public 11:07:42 <verdurin> Ah - was going to ask if any fixes had gone upstream 11:07:48 <zz9pzza> https://jira.hpdd.intel.com/browse/LU-9289 11:08:03 <zz9pzza> its a one line fix for a null termination error 11:08:10 <zz9pzza> ( from memory ) 11:08:32 <Pelalil> zz9pzza, are you using the ddn software stack, or just the ddn hardware with your own lustre install? 11:08:32 <zz9pzza> It says it is in 2.10 11:09:16 <zz9pzza> We use DDN hardware. I believe the software is a ddn release but it is not required. DDN reinstalled a set of old servers for us as we didn't have the time 11:09:54 <zz9pzza> ( We used to install our own lustre servers but people wanted a more appliance approach ) We could still do it but we are a bit busy 11:10:14 <zz9pzza> Basically the approach is to use lustre routers to isolate the tenant 11:10:40 <zz9pzza> Use the 2.9 file sets and uid mapping so each tenant gets a different @tcp network 11:11:05 <zz9pzza> We tried with both pyhsical and virtual lustre routers, both work 11:11:36 <b1airo> why the net LNET routing, is that to integrate with OpenStack tenant/private networks ? 11:11:44 <b1airo> s/net/need/ 11:12:03 <zz9pzza> Its because you can't trust the tenant not to alter the network and the isolation is based on the lustre nid 11:12:39 <zz9pzza> The routers are in the control of the openstack/filesystem operators 11:12:46 <b1airo> ahh ok, don't know anything about how the isolation works, guess that is a hard requirement then! 11:13:19 <verdurin> zz9pzza: could all the background Lustre work be triggered by tenant creation in OpenStack? Or does it have to be handled separately? 11:13:24 <oneswig> zz9pzza: how does the special role fit in? 11:14:26 <zz9pzza> "could all the background Lustre work be triggered by tenant creation in OpenStack? Or does it have to be handled separately?" I can imagine doing it all as part of a tenant creation SOP/script however we havn't done it. I would pre create the provisioning networks otherwise you would need to restart neutron on a tenant creation 11:14:48 <zz9pzza> What do you mean by special role oneswig ? 11:15:08 <oneswig> The Lnet-5 role 11:15:38 <zz9pzza> So each filesystem isolation area needs to have a role associated with it. 11:15:41 <oneswig> This role enables the creation of ports, the admins create the network and routers - is that it? 11:16:09 <zz9pzza> That role allows the user to create a machine that has access to the provisioning network that has a lustre router on it 11:16:35 <zz9pzza> So the role allows a user to create a machine where that machine has all the access to the data in that tenant 11:17:04 <zz9pzza> My experiments only used a single uid/guid as root in the tenant could change their user to be what ever they wished. 11:17:12 <zz9pzza> Does that make sense ? 11:17:30 <zz9pzza> ( and yes "the admins create the network and routers - is that it?" ) 11:18:14 <zz9pzza> If you had a tenant which had real multiple users in it you could expose more uid/guid space to that tenant 11:18:33 <oneswig> I think so thanks. 11:18:36 <zz9pzza> ( but root in the tenant could still read all the data of the tenant ) 11:19:50 <oneswig> It makes sense for situations where there are project power users who create and set up infra for others on the team 11:20:10 <zz9pzza> it could do yes 11:20:53 <oneswig> The VLAN interface on the instances, have you tried using direct-bound ports on that, ie SR-IOV? 11:21:49 <zz9pzza> We haven't we have a tendancy to keep things as simple as we can and optimise when we are more confident 11:22:15 <zz9pzza> The performance was better than I expected without it. 11:22:51 <oneswig> Sounds reasonable. What's the hypervisor networking configuration - a standard Red Hat setup or tuned in any particular way? 11:23:24 <zz9pzza> The orignal runs were done completely in optimally. We had a kernel where we had to turn all hardware acceleration off. 11:23:44 <zz9pzza> We upgraded the kernel and got a large performance boost with hardware acceleration on. 11:24:35 <b1airo> zz9pzza, have you submitted this for a talk in Sydney by any chance? 11:24:36 <oneswig> I think I saw something similar in our tests, but that was for offloads on VXLAN-tunneled packets 11:25:11 <zz9pzza> When is Sydney ( and I present really really badly ) 11:25:43 <zz9pzza> I don't think I can make that time in November 11:26:04 <zz9pzza> If any one wanted to talk around it I wouldn't object and would help :) 11:26:28 <oneswig> Alas, the formal deadline for presentation submissions was last Friday 11:26:48 <b1airo> pity to waste a perfectly good slide-deck though! 11:27:33 <zz9pzza> And it does work and solves a problem we in scientific research do have 11:27:33 <oneswig> agreed! 11:27:45 <zz9pzza> And it is all free software 11:28:07 <oneswig> The results on slide 23 - what is the missing text in the legend "bare metal with..."? 11:28:37 <zz9pzza> Just looking 11:30:32 <zz9pzza> I think it is dual physical routers, but I think that is an artifact of how our first instance of openstance uses bonding 11:30:45 <oneswig> For comparison, the Ceph filesystem was giving you about 200MB/s read performance, and you're seeing around 1300MB/s for single-client Lustre bandwidth? 11:30:50 <zz9pzza> Our hypevisors are not using l3/l4 11:31:26 <oneswig> ... but about 3000MB/s in an equivalent bare metal test 11:31:36 <zz9pzza> Let me see if I can find the spreadsheet of doom which has raw numbers in it ( and the headings are less than obvious ) 11:32:53 <b1airo> two instances of openstack = two completely separate clouds? or different regions, AZs, Cells, or something else...? 11:33:25 <zz9pzza> So are next openstack instance is on the same set of physical hardware 11:33:43 <zz9pzza> We have enough controller type machines to have two instances running at once 11:33:44 <martial> very nice speed up 11:33:49 <zz9pzza> The next one is newton 11:34:30 <zz9pzza> ( which has a better kernel and better bonding config ) 11:35:39 <zz9pzza> https://docs.google.com/spreadsheets/d/1E-YOso5-aDTzn2m6lKoYwBNWHvgAIcrxM5SpxSBw0_Y/edit?usp=sharing 11:35:44 <zz9pzza> I think that is it :) 11:36:59 <zz9pzza> sorry its a bit incomprehensable 11:37:49 <zz9pzza> So a single client got up to about 900MB/s write 11:38:02 <oneswig> zz9pzza: there's a certain amount of admin setup (the provider networks, role definitions etc). Is this a blocker for orchestrating through Manila? It does sound quite similar to the CephFS sharing model 11:38:27 <zz9pzza> I don't know enough about Manila to comment. 11:38:44 <zz9pzza> I would have to check if Manila is supported in redhats rdo as well 11:38:47 <b1airo> i'm curious as to why you ended up with Arista switching and Mellanox HCAs? 11:38:54 <zz9pzza> And I am still abit nervous about cephfs 11:39:26 <oneswig> zz9pzza: my experiments with it confirm those nerves, but I have great hope for luminous 11:39:35 <zz9pzza> So the Mellanox HCA's appear to have lots of hardware acceleration and were not significantly more expensive 11:40:00 <zz9pzza> The Arista switches are very good software and for the ports supported very cost effective 11:40:21 <oneswig> How are you doing 25G - breakout cables? 11:40:29 <zz9pzza> Dual 32 port 100/50/40/25/10 gig switches are very reasonable 11:40:46 <zz9pzza> Yes the 25gb breakout is a cable 11:41:03 <zz9pzza> so we could have 128 ports of 25GB/s 11:41:12 <zz9pzza> We have a mixture of 25 and 100 11:41:39 <zz9pzza> We hope to use the arista to do the vxlan encoding in the next itteration as well 11:41:43 <b1airo> any RDMA traffic? 11:42:05 <zz9pzza> We don't do that right now, the next itteration of ceph should have it and the cards do support it 11:42:40 <zz9pzza> ( For vxlan encoding in the switch redhats lack of being able to set the region name is a complete pain () 11:42:41 <oneswig> zz9pzza: You have a mix of VLANs and VXLANs? 11:42:48 <zz9pzza> yes its fine 11:43:18 <zz9pzza> This itteration all the tenant networking is double encapsulated. 11:43:28 <zz9pzza> Once on the switch, once on the hypervisor 11:43:35 <oneswig> b1airo: did you have any data that compares lnet tcp and o2ib approaches? 11:44:37 <oneswig> zz9pzza: your presentation is very thorough, thanks for that - is there anything that isn't working yet that you're still working on? 11:44:43 <b1airo> no, both our Lustres are in o2ib mode at the moment, but we have considered going back to TCP on the ConnectX-3 based system due to some recurring LNET errors 11:45:16 <zz9pzza> I think I would be happy enough putting it in to production, I am not sure if virtual or physical lustre routers are the right approach both have advantages. 11:45:20 <b1airo> not easy with it all in production and long job queues 11:45:36 <zz9pzza> I think I would buy physical routers in the first instance. 11:46:11 <zz9pzza> ( physical routers makes it easy to bridge ethernet and ib too ) 11:46:24 <oneswig> zz9pzza: is 1300MB/s fast enough for your needs or are you looking at closing that gap to bare metal? 11:46:49 <zz9pzza> Well a single vm got to 900MB/s write and 1.2GB/s read 11:47:09 <zz9pzza> which is not too bad for a 6 year old server set up 11:47:21 <zz9pzza> ( and not at the same time ) 11:47:42 <zz9pzza> ( look at single machine 4MB/2VCPU New Kernel ) 11:47:53 <zz9pzza> ( it should be 4GB ) 11:50:03 <b1airo> are LNET routers particularly resource hungry? i'm guessing they don't need much memory, just decent connectivity and a bit of CPU (maybe not even much of that if using RDMA) ? 11:50:24 <zz9pzza> No you just need a bit of cpu 11:50:26 <oneswig> Did you try more VCPUs and the new kernel? Just interested in exploring what the bottleneck is cf bare metal 11:50:48 <zz9pzza> I didn't the new kernel came came very close to my deadline 11:51:17 <b1airo> zz9pzza, that spreadsheet is all Lustre numbers? what was the ceph comparison you mentioned? 11:51:48 <zz9pzza> At the very beginning on our first ceph I just used dd, 11:51:54 <oneswig> I saw some numbers on slide 9 11:52:00 <zz9pzza> Let me look 11:52:26 <zz9pzza> That was a dd in a volume, not really comparable 11:52:28 <b1airo> ah ok, so no direct comparison, hat was just a single client dd "does this work" test? 11:52:33 <zz9pzza> Yes 11:52:42 <zz9pzza> I could do that now and I am happy to do so 11:52:56 <zz9pzza> ( our ceph cluster has also just doubled in size ) 11:53:13 <b1airo> fun times 11:53:13 <zz9pzza> And we are getting another 2.5PB of useable in a couple of weeks 11:53:39 <oneswig> nice! At that scale, I guess you're erasure coding? 11:53:42 <zz9pzza> No. 11:54:00 <zz9pzza> We are not confident yet. 11:54:04 <b1airo> do you have any support now? i see you discounted RH on price (can understand that) 11:54:04 <oneswig> I'd be interested to hear experiences with that 11:54:34 <zz9pzza> Redhat have got back to us on pricing and it is worth talking to them again if you are accedemic 11:54:35 <b1airo> oneswig, can't EC for RBD or CephFS yet unless you're on Bluestore etc 11:55:01 <zz9pzza> ( we have yet to buy support but I think we will in the long term ) 11:55:02 <oneswig> ah, thanks b1airo 11:55:26 <b1airo> we do have RH support, at a significant discount, it is still fairly pricey compared to others (e.g. SUSE, Canonical) 11:55:58 <zz9pzza> I would check again 11:55:59 <b1airo> oneswig, however we use EC for radosgw objects 11:56:27 <zz9pzza> ( I mean I would get an upto date quote for ceph support ) 11:56:35 <b1airo> it works just fine - main difference operationally is that things like scrub and recovery/backflll take much longer as the PGs are much wider 11:57:03 <oneswig> b1airo: and does it become CPU limited, or is that not a concern? 11:57:16 <oneswig> for the OSDs that is 11:57:29 <b1airo> not really a concern for HTTP object store workloads (at least that we have hit yet!) 11:57:44 <oneswig> fair point 11:57:51 <zz9pzza> What kind of performance do you get though rados gw ? 11:58:02 <oneswig> We are close to time... final questions 11:58:32 <verdurin> Just a shame it's Lustre, rather than GPFS... 11:58:38 <zz9pzza> :) 11:58:56 <verdurin> Thanks a lot, though zz9pzza - very impressive 11:58:57 <oneswig> verdurin: well timed to wind up the debate :-) 11:59:15 <b1airo> it can scale pretty well, think we tested to over a GB/s with just three 2core radosgw VM hosts 11:59:25 <oneswig> thanks indeed, that presentation is thorough and very informative zz9pzza 11:59:43 <b1airo> careful with big buckets though - if you are not prepared for them it can be a real pain to get rid of them! 11:59:44 <martial> yes, thank you. I will share the log to a couple colleagues 11:59:47 <zz9pzza> Feel free to steal bits if any one wants to 12:00:01 <b1airo> yes thanks zz9pzza - great presentation 12:00:08 <oneswig> "share and enjoy" you mean :-) 12:00:12 <zz9pzza> Indeed 12:00:23 <oneswig> OK, time to wrap up all 12:00:26 <verdurin> Bye all 12:00:26 <oneswig> thanks everyone 12:00:32 <oneswig> #endmeeting