11:00:42 <oneswig> #startmeeting scientific-sig
11:00:43 <openstack> Meeting started Wed Jun 20 11:00:42 2018 UTC and is due to finish in 60 minutes.  The chair is oneswig. Information about MeetBot at http://wiki.debian.org/MeetBot.
11:00:44 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
11:00:47 <openstack> The meeting name has been set to 'scientific_sig'
11:00:58 <oneswig> Hi there
11:01:08 <janders> hey Stig :)
11:01:08 <oneswig> #link Agenda for today https://wiki.openstack.org/wiki/Scientific_SIG#IRC_Meeting_June_20th_2018
11:01:17 <oneswig> Hey janders, good evening
11:01:30 <priteau> Hi everyone!
11:01:45 <janders> ceph-RDMA! looking forward to the discussion :)
11:01:46 <oneswig> Greetings priteau!
11:01:55 <oneswig> janders: have you got experience of this?
11:02:20 <janders> oneswig: not yet, but it's an area of strong interest
11:02:30 <oneswig> Would be a good fit in your place, perhaps...
11:02:41 <oneswig> How's the production roll-out going?
11:02:50 <janders> indeed! :)
11:02:58 <martial__> Good day
11:03:09 <oneswig> Morning martial__!
11:03:12 <oneswig> #chair martial__
11:03:13 <openstack> Current chairs: martial__ oneswig
11:03:16 <priteau> Good morning martial__
11:03:22 <oneswig> up early again
11:03:41 <oneswig> OK we must give apologies in advance that this meeting might not run the full hour
11:03:44 <martial__> but only able to stay for 20 minutes or so
11:04:01 <janders> oneswig: making progress - multirail issues are better understood now. Working on adding some virtualisation capability for the bits that don't need bare-metal.
11:04:08 <b1airo> o/
11:04:17 <oneswig> Hey b1airo
11:04:22 <oneswig> #chair b1airo
11:04:23 <openstack> Current chairs: b1airo martial__ oneswig
11:04:23 <janders> Hey Blair
11:04:28 <b1airo> how goes it?
11:04:31 <oneswig> We should make the most and get cracking
11:04:37 <janders> indeed!
11:04:43 <oneswig> #topic DockerCon roundup
11:04:49 <oneswig> martial__ has been on the road...
11:04:52 <oneswig> How was it?
11:04:59 <martial__> I have indeed been doing the rounds :)
11:05:11 <martial__> well it was Dockercon sponsors Kubernetes conference
11:05:26 <martial__> most of the talks were about using K8s with Docker
11:05:26 <b1airo> yeah, before i fall asleep and drool on my laptop
11:05:37 <martial__> very good conversation indeed
11:05:56 <oneswig> there was an AI/HPC session, right?
11:06:08 <martial__> the talk by Christine Lovett and Christian Kniep (who presented here) was the HPC talk
11:06:21 <martial__> their slides/video is not available yet
11:06:30 <oneswig> man, these 2-bit conferences :-)
11:06:31 <martial__> but it talked about the future of Docker
11:06:43 <b1airo> lol
11:07:02 <b1airo> no fntech i guess
11:07:18 <martial__> #link https://www.linkedin.com/feed/update/urn:li:activity:6413174151317590016
11:07:47 <martial__> the above link is a post I did that gives you a few of the pictures of that presentation
11:07:59 <martial__> There was also a very interesting talk about ML
11:08:21 <martial__> #link https://www.linkedin.com/feed/update/urn:li:activity:6413098707868221440
11:08:41 <martial__> by two Microsoft people, looks of interesting technology presented there
11:09:12 <janders> have NVIDIA presented anything on their DGX1/docker stack?
11:09:24 <martial__> and then there was a "Black Belt" talk on orchestration
11:09:40 <martial__> #link https://www.linkedin.com/feed/update/urn:li:activity:6412796018143821824
11:10:01 <martial__> very interesting on the context but still not ready for HPC usage
11:10:05 <oneswig> Interesting claim on your slides Martial "no integration with upstream ecosystem" ... that overlooks all the Docker composition tools
11:10:23 <oneswig> sorry - correction - your photo of their slides :-)
11:10:28 <martial__> no Nvidia that I could say
11:10:34 <martial__> see
11:11:14 <martial__> yes Christian want to make a clear distinction with true HPC usage
11:11:38 <martial__> among the cool stuff I saw:
11:11:49 <oneswig> martial__: were there many OpenStack / HPC folks there?
11:12:32 <martial__> Helm and openwhisk are two other popular topics of conversation
11:12:46 <oneswig> openwhisk?  What's that
11:12:50 <martial__> (oneswig a few, that fellow from Sandia and Chris Hoge)
11:13:08 <priteau> Lambdas right?
11:13:23 <martial__> Dotmesh
11:13:44 <martial__> There was a demo of this live debugging service (squash and more) and it seems powerful https://www.solo.io (and on github)
11:13:55 <oneswig> dotmesh - I know the founder, is the technology an evolution of the work they did with stateful containers at ClusterHQ?
11:14:17 <martial__> Calico https://www.projectcalico.org/getting-started/docker/
11:14:22 <martial__> Helm https://helm.sh
11:14:29 <martial__> Prometheus https://prometheus.io
11:14:30 <martial__> Istio https://istio.io
11:14:38 <martial__> Jaeger https://github.com/jaegertracing/jaeger
11:14:49 <martial__> and For istio https://github.com/IBM/istio101
11:15:19 <martial__> openwhisk https://openwhisk.apache.org/
11:15:39 <martial__> pierre: yes like Amazon Lambda
11:15:49 <martial__> a lot of serverless conversation
11:16:21 <martial__> oneswig: exactly
11:16:49 <oneswig> Without too much irony, OpenStack would benefit from a method for resilient event-driven function processing.  Would save all this polling
11:16:51 <martial__> oneswig: although I think they discussed it being a continuation of the work started a year ago
11:17:25 <oneswig> Be interesting to hear more about how it is being used.
11:17:41 <priteau> Is there talk of using serverless more in HPC use cases, or is it still mainly for web services?
11:17:41 <martial__> and that is it for my report from DockerCon ... talks and slides are said to be available soon
11:17:56 <martial__> just how soon is not clear
11:18:25 <oneswig> OK, we're on a shorter session today, should we move on? Thanks martial__
11:18:38 <b1airo> oneswig: isn't that what Mistral is for?
11:18:47 <martial__> oneswig: 👍
11:19:04 <janders> Ceph-RDMA! :)
11:19:08 <oneswig> b1airo: perhaps it is.  Never seems to make things faster though.
11:19:12 <oneswig> #topic Ceph-RDMA update
11:19:21 <oneswig> you ask, we give :-)
11:19:42 <oneswig> OK, so I wanted to follow up on the digging I've been doing on this subject since Vancouver.
11:20:05 <oneswig> I came away from there with a few new areas of activity which I've (briefly) explored)
11:20:31 <oneswig> First up - anyone here used RDMA with Ceph?
11:21:10 <b1airo> the closest i've come is testing RoCE between OSD nodes
11:21:20 <b1airo> but not with actual OSD traffic
11:21:47 <oneswig> OK, so the way it works is to implement a derivation of Ceph's async messenger classes
11:22:01 <oneswig> There are currently two competing implementations
11:22:05 <janders> I haven't. Last time I checked (a while ago) it was experimental and used accelio
11:22:20 <oneswig> One from XSky in China that builds on IB verbs directly
11:22:35 <oneswig> A new one from Mellanox that uses a userspace messaging library called UCX
11:22:44 <oneswig> Which sounds in overall pitch a lot like Accelio
11:23:18 <oneswig> The XSky project has been running for probably a couple of years
11:23:46 <oneswig> I have the privilege right now of 3 different systems to test on: RoCE, IB and OPA.
11:23:50 <oneswig> It only works on RoCE.
11:23:59 <janders> from a quick look it seems UCX replaced Accelio
11:24:26 <janders> http://www.mellanox.com/page/products_dyn?product_family=79&mtag=roceThe
11:24:38 <oneswig> janders: that would make sense.  This is the repo: http://www.openucx.org/
11:25:20 <oneswig> The async+ucx messenger is a separate implementation from async+rdma
11:25:41 <oneswig> It's still being actively developed but there is code available
11:25:51 <oneswig> https://github.com/Mellanox/ceph/tree/vasily-ucx
11:26:24 <oneswig> Unfortunately, it uses a new API for UCX that is not yet upstream, but may be "in a few weeks"
11:26:39 <oneswig> So we cannot test this code yet, unfortunately.
11:27:00 <martial__> (need to get going guys, sorry)
11:27:16 <b1airo> see ya martial__
11:27:30 <oneswig> martial__: thanks!
11:27:30 <janders> do you have any experience benchmarking "vanilla" ceph vs RDMA/ceph implementations?
11:27:43 <oneswig> janders: a bit, not enough yet.
11:27:49 <janders> I'm curious about 1) throughput 2) CPU utilisation
11:27:59 <oneswig> I'm working on a third way - there's a branch from Intel in China
11:28:00 <oneswig> https://github.com/tanghaodong25/ceph
11:28:24 <oneswig> This adds iWARP support (which I don't care about) and RDMA-CM support (which *should* enable me to use IB and OPA)
11:28:34 <oneswig> I'm working on that right now.
11:29:23 <oneswig> So for testing, what I've seen is that people typically get performance by enabling busy polling, which totally obliterates the efficiency claims.  But if that's not enabled, it should (in theory) do better
11:29:43 <oneswig> Anyone used memstore for testing?
11:29:46 <b1airo> RDMA-CM sounds like it is probably the most logical and sustainable option
11:29:47 <janders> iWARP.. is that long-range RDMA that can be carried over commodity networks (eg Internet)?
11:30:01 <oneswig> It's RDMA over TCP, so it's routable.
11:30:21 <b1airo> oneswig: UDP actually i think?
11:30:23 <oneswig> Was a side-note in history (NetEffect, remember them?), but Intel has revived it
11:30:37 <b1airo> oh sorry, i was meaning RoCE v2
11:30:43 <oneswig> Apparently it now comes on-board on Intel Purley platforms
11:30:48 <janders> UDP sounds about right, but it's been a while since I looked into that
11:30:53 <oneswig> b1airo: could be, I thought TCP but am probably wrong
11:31:06 <b1airo> janders: RoCE v2 is routable too
11:31:22 <janders> do Mellanox call RoCEv2 RROCE?
11:31:30 <janders> (Routable ROCE)
11:31:49 <oneswig> Could lead to hilarity in mispronunciation attempts...
11:32:06 <b1airo> no you're right oneswig, iWARP is TCP - i was getting ahead of myself
11:32:19 <b1airo> janders: yes sometimes
11:32:24 <oneswig> janders: in terms of performance I have a specific issue to face
11:33:07 <oneswig> If I can avoid using IPoIB, I'm expecting a huge boost for free, which kind of distorts the uplift
11:33:57 <oneswig> I'm also experimenting with Mimic but the ceph-volume preparation keeps hanging for some reason.
11:34:11 <oneswig> Anyone see that?  This is with the CentOS packages
11:34:52 <b1airo> ceph-volume does seem a little buggy still, early days
11:35:00 <oneswig> Probably an answer to be found on the ceph mailing list
11:35:28 <b1airo> oneswig: so did you get any tests working?
11:35:54 <oneswig> Only RoCE so far, and that's old data - just picking up these new pieces of work now
11:36:28 <b1airo> would you expect much of a difference with the other connection management options?
11:36:35 <oneswig> The RoCE system was I/O bound, not network bound, so it didn't make masses of difference
11:36:39 <janders> oneswig: regarding hangs, it's not anywhere a "gathering keys" step, is it?
11:36:47 <oneswig> The connection management is an enabler, that's all
11:36:54 <oneswig> Without it, the parameters seem to be all wrong
11:37:08 <oneswig> janders: not in this instance.
11:37:56 <oneswig> The nodes are running ceph-disk or ceph-volume, but not making any progress
11:38:07 <oneswig> Well, I'll keep plugging on that.
11:38:14 <oneswig> janders: you're using BeeGFS, right?
11:38:18 <oneswig> Or was it Lustre?
11:38:42 <janders> it'll be BeeGFS - we don't have it yet, kit is ordered and in flight
11:39:00 <janders> we do have some GPFS too
11:39:02 <janders> no Lustre
11:39:09 <oneswig> Excellent - been hearing some good things on BeeGFS
11:39:24 <janders> this is a little off topic, but given you mentioned IPoIB - do you have any experience running software vxlan over IPoIB?
11:39:43 <oneswig> I think the summary on Ceph-RDMA is still "watch this space".
11:39:48 <janders> I'm not after outstanding performance (SRIOV will cover this part), more after flexibility
11:40:48 <oneswig> I keep getting reminded to get our BeeGFS playbooks into better shape so they are respectable
11:40:55 <b1airo> i don't janders , but i would expect it to work given both are L3
11:40:58 <oneswig> One of these days...
11:41:28 <oneswig> Ah - VXLAN - just saw that.  What's the use case?
11:41:49 <janders> a bit of virtualised capacity on the baremetal cloud
11:41:58 <oneswig> It might have a better chance of working with the kernel module than with (say) OVS
11:42:12 <janders> without adding an extra ethernet fabric
11:42:40 <janders> (if it was mostly KVM w/o SRIOV, I'd probably go with Mellanox ethernet + vxlan offload)
11:43:15 <b1airo> yeah i was thinking kernel too
11:44:16 <janders> noted! thanks guys
11:44:59 <oneswig> janders: the mix of bare metal + virt is interesting to us, can you come back some time and tell us how your solution's working out?
11:45:13 <janders> oneswig: sure! :)
11:45:19 <janders> will do
11:45:47 <oneswig> b1airo: we should wrap up given your timing
11:45:52 <oneswig> #topic AOB
11:46:04 <oneswig> not that we've rigidly kept to topic thus far.
11:46:09 <oneswig> Any other news?
11:46:11 <janders> maybe just quickly as this was missed last week
11:46:16 <b1airo> on the Ceph note, anyone seen any material comparing bluestore with rocksdb metadata.db on ssd/flash versus using underlying block-caching layer and presenting a single cached device to bluestore ?
11:46:16 <janders> focus areas for the cycle
11:46:21 <janders> I propose: spot instances
11:46:41 <janders> (I think I saw Belmiro joining :)
11:47:13 <b1airo> there were a couple of presentations from the Beijing Cephalocon, but they are in Chinese and no slidedecks published
11:47:16 <oneswig> b1airo: got a link?
11:47:24 <oneswig> ah
11:47:32 <b1airo> i'm after a link! :-)
11:47:41 <oneswig> janders: spot instances - agreed - we are after this too
11:48:02 <oneswig> The "missing link" for us is how to do it nicely at the platform level
11:48:37 <janders> oneswig: I think we need spot instances + clever chargeback to make sure resources are used efficiently
11:48:55 <oneswig> Totally, that's how we see it too.
11:49:07 <oneswig> Anyone going to ISC next week?
11:49:16 <janders> otherwise - people will either hog resources, or there'll be a lot of unused capacity that ain't good either
11:49:29 <janders> I was supposed to, but ended up not going - too much work
11:49:40 <daveholland> Pete is going for Sanger
11:50:20 <oneswig> John T will be there for us, I'm going to HPCKP (tomorrow) instead
11:50:47 <janders> it'll be interesting to hear more about the new Summit machine
11:50:58 <janders> that's a few more Volta GPUs than we have... :)
11:51:08 <b1airo> indeed!
11:51:31 <b1airo> i just got some quotes with the new 32GB V100 variants
11:52:01 <janders> how will you run 'em? baremetal? vGPU?
11:53:18 <b1airo> passthrough mostly i think
11:54:07 <b1airo> probably make these ones the new compute hosts and use some of the existing 16GB fleet for vGPU (desktop nodes in the cluster service)
11:54:51 <janders> makes sense!
11:55:15 <b1airo> on the GPU topic, i've been trying to do some multi-gpu profiling and getting gpudirect running within KVM, have overcome some minor issues but no major result comparisons to share yet
11:55:50 <oneswig> Ah, too bad.  Would be great to know the state of the art while you're still able to
11:55:56 <b1airo> if i get it together i'll post it back to the lists as a follow up to jon from csail@mit
11:56:07 <oneswig> Sounds good.
11:57:24 <b1airo> i need to grab one of the hosts we have that supports gpu p2p and try the experimental qemu gpu clique support. the host i've reserved at the moment only has two P100s on different root complexes. rubbish ;-)
11:58:12 <janders> P100s... can't work with that... ;)
11:58:57 <oneswig> OK - anything more to add?
11:59:15 <janders> I'm good - thank you guys!
11:59:39 <janders> oneswig: I will keep you posted on virt+ironic mix,
11:59:42 <oneswig> Same here - thanks everyone
11:59:48 <oneswig> janders: please do, sounds interesting
11:59:54 <janders> should have some data in the coming weeks
12:00:00 <b1airo> https://ardc.edu.au/
12:00:45 <oneswig> b1airo: NeCTAR-NG?
12:01:05 <b1airo> yep. 5 yr NCRIS roadmap looks to be funded reasonably well - $60m odd for capex that isn't peak hpc
12:01:28 <janders> cool!
12:01:35 <oneswig> Wow, sounds good
12:01:44 <oneswig> OK, gotta close the meeting
12:01:51 <oneswig> bon voyage b1airo!
12:01:56 <b1airo> night gents
12:02:02 <janders> night
12:02:03 <oneswig> #endmeeting