11:00:24 #startmeeting scientific-wg 11:00:25 Meeting started Wed Jul 19 11:00:24 2017 UTC and is due to finish in 60 minutes. The chair is oneswig. Information about MeetBot at http://wiki.debian.org/MeetBot. 11:00:26 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 11:00:28 The meeting name has been set to 'scientific_wg' 11:00:31 Afternoon, oneswig 11:00:40 Precisely so 11:00:46 afternoon verdurin 11:01:08 #link Agenda for today https://wiki.openstack.org/wiki/Scientific_working_group#IRC_Meeting_July_19th_2017 11:01:16 oneswig, tom here, thanks for the email reminder 30min ago :) 11:01:25 (I'm in a training course so just lurking in this meeting) 11:01:30 Hi Tom, very welcome 11:01:40 #chair martial 11:01:41 Current chairs: martial oneswig 11:02:52 We have a shortish agenda for today but the main event is an interesting presentation on research by James at Sanger 11:03:19 Hi 11:03:27 greetings :-) 11:03:41 Hello 11:03:51 #topic OpenStack and Lustre 11:04:06 #link presentation on Lustre is at https://docs.google.com/presentation/d/1kGRzcdVQX95abei1bDVoRzxyC02i89_m5_sOfp8Aq6o/edit#slide=id.g22aee564af_0_0 11:04:54 zz9pzza: can you tell us a bit about the background? 11:05:13 One of the things we have been worrying about is posix file access in openstack 11:05:57 o/ 11:06:03 Hi b1airo 11:06:06 #chair b1airo 11:06:06 We did some work with DDN ( not using any none standard/open software ) to see if we could allow access to lustre from different tenants to different bits of a lustre filesystem 11:06:07 Current chairs: b1airo martial oneswig 11:06:08 evening 11:06:12 Hi 11:06:36 The short story is I think we could provide it as a service and we are definitely considering doing so 11:06:46 We have one tenant about to use it in anger 11:06:53 Hello b1airo 11:06:54 Does it depend on the DDN Lustre distribution? 11:06:56 No 11:07:07 You need to have atleast lustre 2.9 11:07:14 And 2.10 is an lts release 11:07:27 There a lbug that needs fixing but the fix is public 11:07:42 Ah - was going to ask if any fixes had gone upstream 11:07:48 https://jira.hpdd.intel.com/browse/LU-9289 11:08:03 its a one line fix for a null termination error 11:08:10 ( from memory ) 11:08:32 zz9pzza, are you using the ddn software stack, or just the ddn hardware with your own lustre install? 11:08:32 It says it is in 2.10 11:09:16 We use DDN hardware. I believe the software is a ddn release but it is not required. DDN reinstalled a set of old servers for us as we didn't have the time 11:09:54 ( We used to install our own lustre servers but people wanted a more appliance approach ) We could still do it but we are a bit busy 11:10:14 Basically the approach is to use lustre routers to isolate the tenant 11:10:40 Use the 2.9 file sets and uid mapping so each tenant gets a different @tcp network 11:11:05 We tried with both pyhsical and virtual lustre routers, both work 11:11:36 why the net LNET routing, is that to integrate with OpenStack tenant/private networks ? 11:11:44 s/net/need/ 11:12:03 Its because you can't trust the tenant not to alter the network and the isolation is based on the lustre nid 11:12:39 The routers are in the control of the openstack/filesystem operators 11:12:46 ahh ok, don't know anything about how the isolation works, guess that is a hard requirement then! 11:13:19 zz9pzza: could all the background Lustre work be triggered by tenant creation in OpenStack? Or does it have to be handled separately? 11:13:24 zz9pzza: how does the special role fit in? 11:14:26 "could all the background Lustre work be triggered by tenant creation in OpenStack? Or does it have to be handled separately?" I can imagine doing it all as part of a tenant creation SOP/script however we havn't done it. I would pre create the provisioning networks otherwise you would need to restart neutron on a tenant creation 11:14:48 What do you mean by special role oneswig ? 11:15:08 The Lnet-5 role 11:15:38 So each filesystem isolation area needs to have a role associated with it. 11:15:41 This role enables the creation of ports, the admins create the network and routers - is that it? 11:16:09 That role allows the user to create a machine that has access to the provisioning network that has a lustre router on it 11:16:35 So the role allows a user to create a machine where that machine has all the access to the data in that tenant 11:17:04 My experiments only used a single uid/guid as root in the tenant could change their user to be what ever they wished. 11:17:12 Does that make sense ? 11:17:30 ( and yes "the admins create the network and routers - is that it?" ) 11:18:14 If you had a tenant which had real multiple users in it you could expose more uid/guid space to that tenant 11:18:33 I think so thanks. 11:18:36 ( but root in the tenant could still read all the data of the tenant ) 11:19:50 It makes sense for situations where there are project power users who create and set up infra for others on the team 11:20:10 it could do yes 11:20:53 The VLAN interface on the instances, have you tried using direct-bound ports on that, ie SR-IOV? 11:21:49 We haven't we have a tendancy to keep things as simple as we can and optimise when we are more confident 11:22:15 The performance was better than I expected without it. 11:22:51 Sounds reasonable. What's the hypervisor networking configuration - a standard Red Hat setup or tuned in any particular way? 11:23:24 The orignal runs were done completely in optimally. We had a kernel where we had to turn all hardware acceleration off. 11:23:44 We upgraded the kernel and got a large performance boost with hardware acceleration on. 11:24:35 zz9pzza, have you submitted this for a talk in Sydney by any chance? 11:24:36 I think I saw something similar in our tests, but that was for offloads on VXLAN-tunneled packets 11:25:11 When is Sydney ( and I present really really badly ) 11:25:43 I don't think I can make that time in November 11:26:04 If any one wanted to talk around it I wouldn't object and would help :) 11:26:28 Alas, the formal deadline for presentation submissions was last Friday 11:26:48 pity to waste a perfectly good slide-deck though! 11:27:33 And it does work and solves a problem we in scientific research do have 11:27:33 agreed! 11:27:45 And it is all free software 11:28:07 The results on slide 23 - what is the missing text in the legend "bare metal with..."? 11:28:37 Just looking 11:30:32 I think it is dual physical routers, but I think that is an artifact of how our first instance of openstance uses bonding 11:30:45 For comparison, the Ceph filesystem was giving you about 200MB/s read performance, and you're seeing around 1300MB/s for single-client Lustre bandwidth? 11:30:50 Our hypevisors are not using l3/l4 11:31:26 ... but about 3000MB/s in an equivalent bare metal test 11:31:36 Let me see if I can find the spreadsheet of doom which has raw numbers in it ( and the headings are less than obvious ) 11:32:53 two instances of openstack = two completely separate clouds? or different regions, AZs, Cells, or something else...? 11:33:25 So are next openstack instance is on the same set of physical hardware 11:33:43 We have enough controller type machines to have two instances running at once 11:33:44 very nice speed up 11:33:49 The next one is newton 11:34:30 ( which has a better kernel and better bonding config ) 11:35:39 https://docs.google.com/spreadsheets/d/1E-YOso5-aDTzn2m6lKoYwBNWHvgAIcrxM5SpxSBw0_Y/edit?usp=sharing 11:35:44 I think that is it :) 11:36:59 sorry its a bit incomprehensable 11:37:49 So a single client got up to about 900MB/s write 11:38:02 zz9pzza: there's a certain amount of admin setup (the provider networks, role definitions etc). Is this a blocker for orchestrating through Manila? It does sound quite similar to the CephFS sharing model 11:38:27 I don't know enough about Manila to comment. 11:38:44 I would have to check if Manila is supported in redhats rdo as well 11:38:47 i'm curious as to why you ended up with Arista switching and Mellanox HCAs? 11:38:54 And I am still abit nervous about cephfs 11:39:26 zz9pzza: my experiments with it confirm those nerves, but I have great hope for luminous 11:39:35 So the Mellanox HCA's appear to have lots of hardware acceleration and were not significantly more expensive 11:40:00 The Arista switches are very good software and for the ports supported very cost effective 11:40:21 How are you doing 25G - breakout cables? 11:40:29 Dual 32 port 100/50/40/25/10 gig switches are very reasonable 11:40:46 Yes the 25gb breakout is a cable 11:41:03 so we could have 128 ports of 25GB/s 11:41:12 We have a mixture of 25 and 100 11:41:39 We hope to use the arista to do the vxlan encoding in the next itteration as well 11:41:43 any RDMA traffic? 11:42:05 We don't do that right now, the next itteration of ceph should have it and the cards do support it 11:42:40 ( For vxlan encoding in the switch redhats lack of being able to set the region name is a complete pain () 11:42:41 zz9pzza: You have a mix of VLANs and VXLANs? 11:42:48 yes its fine 11:43:18 This itteration all the tenant networking is double encapsulated. 11:43:28 Once on the switch, once on the hypervisor 11:43:35 b1airo: did you have any data that compares lnet tcp and o2ib approaches? 11:44:37 zz9pzza: your presentation is very thorough, thanks for that - is there anything that isn't working yet that you're still working on? 11:44:43 no, both our Lustres are in o2ib mode at the moment, but we have considered going back to TCP on the ConnectX-3 based system due to some recurring LNET errors 11:45:16 I think I would be happy enough putting it in to production, I am not sure if virtual or physical lustre routers are the right approach both have advantages. 11:45:20 not easy with it all in production and long job queues 11:45:36 I think I would buy physical routers in the first instance. 11:46:11 ( physical routers makes it easy to bridge ethernet and ib too ) 11:46:24 zz9pzza: is 1300MB/s fast enough for your needs or are you looking at closing that gap to bare metal? 11:46:49 Well a single vm got to 900MB/s write and 1.2GB/s read 11:47:09 which is not too bad for a 6 year old server set up 11:47:21 ( and not at the same time ) 11:47:42 ( look at single machine 4MB/2VCPU New Kernel ) 11:47:53 ( it should be 4GB ) 11:50:03 are LNET routers particularly resource hungry? i'm guessing they don't need much memory, just decent connectivity and a bit of CPU (maybe not even much of that if using RDMA) ? 11:50:24 No you just need a bit of cpu 11:50:26 Did you try more VCPUs and the new kernel? Just interested in exploring what the bottleneck is cf bare metal 11:50:48 I didn't the new kernel came came very close to my deadline 11:51:17 zz9pzza, that spreadsheet is all Lustre numbers? what was the ceph comparison you mentioned? 11:51:48 At the very beginning on our first ceph I just used dd, 11:51:54 I saw some numbers on slide 9 11:52:00 Let me look 11:52:26 That was a dd in a volume, not really comparable 11:52:28 ah ok, so no direct comparison, hat was just a single client dd "does this work" test? 11:52:33 Yes 11:52:42 I could do that now and I am happy to do so 11:52:56 ( our ceph cluster has also just doubled in size ) 11:53:13 fun times 11:53:13 And we are getting another 2.5PB of useable in a couple of weeks 11:53:39 nice! At that scale, I guess you're erasure coding? 11:53:42 No. 11:54:00 We are not confident yet. 11:54:04 do you have any support now? i see you discounted RH on price (can understand that) 11:54:04 I'd be interested to hear experiences with that 11:54:34 Redhat have got back to us on pricing and it is worth talking to them again if you are accedemic 11:54:35 oneswig, can't EC for RBD or CephFS yet unless you're on Bluestore etc 11:55:01 ( we have yet to buy support but I think we will in the long term ) 11:55:02 ah, thanks b1airo 11:55:26 we do have RH support, at a significant discount, it is still fairly pricey compared to others (e.g. SUSE, Canonical) 11:55:58 I would check again 11:55:59 oneswig, however we use EC for radosgw objects 11:56:27 ( I mean I would get an upto date quote for ceph support ) 11:56:35 it works just fine - main difference operationally is that things like scrub and recovery/backflll take much longer as the PGs are much wider 11:57:03 b1airo: and does it become CPU limited, or is that not a concern? 11:57:16 for the OSDs that is 11:57:29 not really a concern for HTTP object store workloads (at least that we have hit yet!) 11:57:44 fair point 11:57:51 What kind of performance do you get though rados gw ? 11:58:02 We are close to time... final questions 11:58:32 Just a shame it's Lustre, rather than GPFS... 11:58:38 :) 11:58:56 Thanks a lot, though zz9pzza - very impressive 11:58:57 verdurin: well timed to wind up the debate :-) 11:59:15 it can scale pretty well, think we tested to over a GB/s with just three 2core radosgw VM hosts 11:59:25 thanks indeed, that presentation is thorough and very informative zz9pzza 11:59:43 careful with big buckets though - if you are not prepared for them it can be a real pain to get rid of them! 11:59:44 yes, thank you. I will share the log to a couple colleagues 11:59:47 Feel free to steal bits if any one wants to 12:00:01 yes thanks zz9pzza - great presentation 12:00:08 "share and enjoy" you mean :-) 12:00:12 Indeed 12:00:23 OK, time to wrap up all 12:00:26 Bye all 12:00:26 thanks everyone 12:00:32 #endmeeting