#openstack-meeting log

21:00:26 <oneswig> #startmeeting scientific-sig
21:00:28 <openstack> Meeting started Tue Oct  2 21:00:26 2018 UTC and is due to finish in 60 minutes.  The chair is oneswig. Information about MeetBot at http://wiki.debian.org/MeetBot.
21:00:29 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
21:00:31 <openstack> The meeting name has been set to 'scientific_sig'
21:00:42 <oneswig> I made it.  Greetings all.
21:00:51 <janders> g'day oneswig :)
21:00:59 <oneswig> Hey janders!
21:01:14 <janders> do you have any experience with NVME-oF in the OpenStack context?
21:01:27 <trandles> hello all
21:01:32 <janders> (cinder/nova/swift/... backend)
21:01:33 <oneswig> ooh, straight down to it :-)
21:01:38 <oneswig> hey trandles!
21:01:40 <janders> yes! :)
21:01:58 <oneswig> Not direct experience of NVMEoF
21:02:11 <oneswig> keep looking for a good opportunity.
21:02:38 <oneswig> janders: what's come up?
21:02:46 <oneswig> Is this in bare metal or virt?
21:03:26 <janders> I was chatting to Moshe and the Intel guys about it in Vancouver and they said that 1) the cinder driver is ready and will be supported in Pike but 2) they haven't really gone beyond a scale of a single NVMe device with their testing (eg no NVMe-oF arrays etc)
21:03:56 <janders> now I was chatting to my storage boss colleague about specing out a backend for the cyber system and he pulled out some array type options that do do NVMe-oF
21:04:02 <oneswig> I recently saw PR from Excelero - a German cloud (teuto.net?) have built a substantial data platform upon it.
21:04:21 <janders> found it very interesting - I didn't know anyone offers NVMe-of appliances just yet
21:04:32 <janders> he quoted E8
21:04:47 <oneswig> Ah yes, I've heard of them too.
21:05:05 <oneswig> Not sure on the unique selling points to distinguish them (I am sure there are some)
21:05:41 <janders> you asked what is this for - our cyber system (which will do both baremetal and VM)
21:06:02 <oneswig> new project?
21:06:05 <janders> no 1) use case would be persistent volumes, but the more OpenStack services can use the backend the better
21:06:53 <janders> long story. Short version is what I was talking about in Vancouver has been split up in few parts and this is in the context of one of them
21:07:00 <oneswig> I've previously used iSER in a similar context.  It worked really well for large (>64K) transfers, was pretty lame below that.  This was serving Cinder volumes via RDMA to VMs.
21:07:20 <janders> was that IB or RoCE?
21:07:27 <jmlowe> Hi everybody
21:07:28 <oneswig> RoCE in this case.
21:07:39 <oneswig> Hey Mike - good to see you
21:07:58 <janders> ok! that is encouraging - and important given that quite a few NVME-oF players do go RoCE and not IB
21:08:30 <janders> and on a related but slightly different front - if you don't mind me asking
21:08:33 <oneswig> Are you thinking of buying it in or building yourself?
21:08:44 <janders> for your ironic ethernet+IB system, what ethernet NICs do you use?
21:08:52 <janders> are those Mellanox or non-Mellanox?
21:09:16 <oneswig> Mellanox ConnectX4 VPI works well - dual 100G, 1 Ethernet, 1 IB
21:09:30 <janders> ok!
21:09:48 <oneswig> Although in this specific case the Ethernet was down-clocked to 25G
21:09:49 <janders> did you need to do anything to make sure that the Ethernet port doesn't steal bandwidth away from IB?
21:09:56 <janders> reading my mind :)
21:09:58 <oneswig> No, nothing major
21:10:15 <janders> was the 25G cap due to PCIe3x16 bandwidth limitations?
21:10:35 <oneswig> If you want to use Secure Host for the multi-tenant IB, I believe you're stuck with ConnectX3-pro - unless you know otherwise?
21:10:58 <janders> all my POC nodes are either 2xFDR or 4xFDR so haven't had the PCIe3x16 problem just yet
21:11:00 <oneswig> janders: no, switch budget, and I think it was considered more representative of the use case requirements
21:11:24 <janders> you are right regarding Secure Host - I hope to work this out with Erez and his team
21:11:45 <janders> do you know if mlnx have any good reason NOT to implement SecureHost for CX4 and above?
21:12:00 <janders> or is it that no one kept asking persistently enough? :)
21:12:00 <oneswig> I don't know.
21:12:19 <janders> I will have a go at this soon :) hopefully I will get something out and it will be useful to you guys too
21:12:21 <oneswig> I think your theory's good.
21:12:43 <janders> regarding the appliance vs roll your own for NVME-oF - we'll consider both approaches
21:13:26 <janders> the idea of hyperconverged storage in a bare-metal cloud context is quite interesting actually, but I think this is a potentially endless topic
21:13:57 <janders> first we discouraged it as not feasible but it came back and it might actually work
21:14:38 <oneswig> Our thinking has been more towards disaggregated storage recently.
21:15:07 <oneswig> Possibly a single block device hyperconverged - to keep the resource footprint low on the compute nodes.
21:15:22 <oneswig> Might be interesting if also nailed down with cgroups
21:16:20 <janders> we're not there yet - but in a fully containerised context where the whole deployment is baremetal openstack running k8s, it could be just as simple as it used to be in the hypervisor centric reality
21:16:54 <oneswig> We did a little testing on resource footprint with hyperconverged (non-containerised) on bare metal.  I think the containers could work.
21:18:08 <oneswig> On the extreme bandwidth topic, one of our team has been getting really great numbers from BeeGFS at scale recently.  Servers with NVME and dual OPA exceeding 21GBytes/s for reads.  Yeee-ha
21:18:26 <janders> wow!
21:18:56 <oneswig> And there are lots of them in this setup too...
21:19:25 <janders> our BeeGFS is being built (after some shipping delays). It's 24x NVMe and 2xEDR per box so hoping for some good numbers, too
21:19:33 <oneswig> There was talk of offering this resource in a NVMEoF mode, but it's secondary to scratch BeeGFS filesystems
21:19:53 <janders> again our thinking is quite closely aligned :)
21:19:54 <oneswig> Sounds pretty similar, in which case it should fly
21:20:16 <janders> our guys are thinking 1) build up the BeeGFS on half of the nodes and see how hard the users will push it
21:20:52 <janders> if it's not being pushed too hard, we can use some of the NVMes for NVMe-oF, BeeGFS-OND etc
21:20:53 <jmlowe> Is BeeGFS taking root in Europe?  I'm not aware of any deployments in the US
21:21:38 <oneswig> jmlowe: It seems to be gaining ground over here.  Some people got shaken with recent events in Lustre I guess
21:21:59 <janders> good BeeGFS-OpenStack integration would be an amazing thing to have
21:22:00 <oneswig> The performance we've seen from it is as good or better than Lustre, on the same hardware.
21:22:17 <oneswig> janders: Manila! Lets do this!
21:22:26 <janders> I don't really have much experience with non-IB BeeGFS deployments, but I do know it's quite flexible so it could be quite useful outside HPC too
21:22:32 <janders> +1! :)
21:22:54 <janders> the tricky bit is no kerberos support (or prospects for it) last time I checked
21:23:11 <jmlowe> I'm really happy with nfs-ganesha/cephfs/manila so far, going to try to start rolling it out
21:23:24 <janders> this can be worked around, but it does make life much harder
21:23:25 <jmlowe> so all you'd really need to get bootstrapped is a BeeGFS FASL for ganesha
21:24:01 <oneswig> jmlowe: I'd be interested to hear how that flies when it's under heavy load, really interested.
21:24:36 <jmlowe> benchmarking from a single node it's faster than native cephfs for smaller files
21:24:53 <oneswig> This week we've been experimenting with a kubernetes CSI driver for BeeGFS but it's really early days for that.
21:25:24 <oneswig> jmlowe: an extra layer of caching at play there?  Any theories?
21:25:47 <jmlowe> that was my go to theory
21:27:00 <oneswig> That's pretty interesting.  Does Ganesha run in the hypervisor or separate provisioned nodes?
21:27:32 <jmlowe> separate
21:28:33 <oneswig> so what's new with Jetstream jmlowe?
21:31:06 <oneswig> I've had a frustrating day debugging bizarre stubbornness with getting Heat to do more things
21:31:31 <janders> what specifically?
21:31:53 <janders> heat is always "fun" (even though tight integration with all the other APIs is pretty cool)
21:31:57 <trandles> +1 frustrating day...but for non-fun non-technical reasons
21:32:09 <oneswig> Creating resource groups of nodes with parameterised numbers of network interfaces.
21:32:28 <oneswig> trandles: feel your pain.
21:32:40 <janders> ouch
21:32:42 <oneswig> Did you want to mention those open positions?
21:32:59 <janders> if I remember correctly, that was non-trivial even in ansible..
21:33:01 <oneswig> (somewhat select audience here, mostly of non-US nationals...)
21:33:10 <trandles> ah...
21:33:11 <trandles> https://jobszp1.lanl.gov/OA_HTML/OA.jsp?OAFunc=IRC_VIS_VAC_DISPLAY&OAMC=R&p_svid=68995&p_spid=3171345&p_lang_code=US
21:33:20 <trandles> I need to get it up on the openstack jobs board
21:33:21 <jmlowe> We are getting some new IU only hardware it looks like, 20 v100's and 16 of dell's latest blades
21:33:39 <oneswig> janders: they appear to share an issue with handling parameters that are yaml-structured compound data objects
21:34:03 <oneswig> trandles: the jobs board seems to work
21:34:06 <jmlowe> I need to figure out a good way to partition so that NSF users only get the hardware the NSF purchased and IU users get only the hardware IU purchased
21:34:28 <trandles> we've done zero advertising and are pulling in resumes
21:34:36 <trandles> so fingers crossed we get the right candidate
21:35:07 <oneswig> jmlowe: there are ways and means for that.
21:35:23 <jmlowe> I was thinking via private flavors and host aggregates
21:35:25 <trandles> jmlowe: I hate that hard partitioning problem.  We almost always try to argue "you'll get the equivalent amount of time for the dollars you provided" but then end up in arguments about normalization
21:35:36 <oneswig> I forget the current best practice, but linking project flavors and custom resources seems to be in fashion
21:36:54 <jmlowe> It's a generation newer cpu, so they really are different, skylake vs haswell I think
21:37:45 <janders> jmlowe: do you need to worry about idle resources in this scenario, or is the demand greater than the supply across the board?
21:38:20 <oneswig> jmlowe: what will you do for network fabric between those V100s?
21:38:29 <jmlowe> not terribly concerned about idle resources
21:38:36 <jmlowe> 2x10GigE
21:39:52 <janders> briefly jumping back to the SDN discussion - do you guys know if it's possible to have a single port virtual function on a dual port CX4/5/6?
21:39:55 <oneswig> Had a question on this - does anyone have experience on mixing SR-IOV with bonded NICs?  Is there a way to make that work?
21:39:56 <jmlowe> looking at nvidia grid and slicing n dicing them, let's say to run a workshop on the openmp like automagical gp-gpu compiler thingy whose name escapes me right now
21:40:06 <oneswig> janders: snap :-)
21:40:49 <jmlowe> I would think you would need to do the bonding inside the vm in order for SR-IOV to work with bonding
21:40:58 <janders> oneswig: I think Mellanox do have some clever trick to do in-hardware LACP in CX5 / Ethernet. But that's me repeating their words, I haven't tested that
21:41:15 <trandles> SR-IOV and bonded NICs sounds like evil voodoo
21:41:39 <oneswig> trandles: it does indeed.
21:41:50 <jmlowe> if you consider SR-IOV is just multiplexed pci passthrough then it doesn't sound so scary
21:42:15 <trandles> but across multiple physical devices?
21:43:08 <oneswig> trandles: there's two scenarios - 2 nics or 1 dual-port nic - perhaps the second might work, the first, I am not confident
21:43:31 <janders> http://www.hpcadvisorycouncil.com/events/2018/swiss-workshop/pdf/Monday09April/Cohen_NetworkingAdvantage_Mon090418.pdf
21:43:32 <janders> slide 10
21:43:40 <janders> I think this is what Erez was talking about
21:44:07 <oneswig> No way - I was there :-)
21:44:40 <trandles> dual-port NICs sounds like something manageable...multiple NICs sounds like driver hell
21:44:53 <janders> I wasn't but got a good debrief in Vancouver
21:45:39 <janders> hmm now I wonder if I should ask for 1xHDR and 1xdual port EDR  for my new blades :)
21:45:39 <jmlowe> trandles never had fun in the bad old days altering a nic's mac address to defeat mac address filtering
21:46:19 <janders> ..or even 2xEDR and 2xHDR all active/passive for maximum availability
21:46:29 <oneswig> janders: might push out the delivery date, beyond an event horizon
21:46:46 <janders> oneswig: this is a very valid point
21:47:10 <janders> I do have some kit that can be used in the interim, but perhaps retrofitting HDR later isn't a bad idea
21:47:50 <jmlowe> Did live migration with melanox sr-iov ever come to fruition?
21:47:59 <janders> not that I know of
21:48:06 <oneswig> DK Panda presented something on this
21:48:11 <janders> however cold migration is now supposed to work (historically wasn't the case)
21:48:23 <oneswig> But it's never clear if his work will be generally available (or when)
21:48:30 <jmlowe> damn, that would have been killer, DL Panda hinted at it during the 2017 HPC Advisory Council
21:48:46 <jmlowe> well, that's something at least
21:48:55 <oneswig> He came back with numbers, just no github URL (as far as I remember)
21:50:07 <oneswig> jmlowe: if you look for his video from Lugano 2018, he covered some results.  It's possible he was presenting on using RDMA to migrate the VM, I don't recall exactly.
21:51:45 <oneswig> trandles: we were talking Charliecloud over here the other day.  Any news from the project?
21:51:51 <jmlowe> That's an entirely different beast, no to say it's not useful for something like postcopy live migration, but it's not what I was promised with my flying car
21:52:22 <oneswig> jmlowe: transporting your physical person, I wouldn't trust RDMA for that :-)
21:52:44 <janders_> cmon... isn't IB loseless ;)?
21:52:56 <trandles> Charliecloud: Reid has been kicking ass on implementing some shared object voodoo to get MPI support automagically inside a containerized application on both OpenMPI installations and Cray MPICH
21:53:11 <jmlowe> see star trek quote
21:54:09 <oneswig> trandles: this is for the kind of problems that Shifter tackles by grafting in system MPI libraries?
21:54:28 <trandles> we have also had quite a bit of success putting some large multi-physics lagrangian applications into containers and running automated workflows of parameter studies
21:55:17 <trandles> Reid had a few chats with the Shifter guys (Shane and Doug) about the gotchas they found and then implemented his own method of doing something similar
21:55:39 <oneswig> trandles: do you know of people successfully using MPI + Kubernetes?  It keeps coming up here and I'm suspicious it's missing some key PMI glue
21:56:06 <trandles> MPI + Kubernetes: A Story of Oil and Water
21:56:16 <janders_> LOL
21:56:20 <oneswig> trandles: so nearly vinaigrette
21:56:30 <trandles> not nearly as delicious
21:56:38 <janders_> no..
21:56:53 <jmlowe> just needs some egg
21:56:59 <trandles> I know more than I ever thought I'd know (or wanted to know) about how MPI starts up jobs
21:57:18 <jmlowe> mmm, sausage
21:57:26 <jmlowe> it's dinner time here
21:57:26 <janders_> I don't think there's an easy way of plumbing RDMA into k8s managed docker containers at this point in time - please correct me if I am wrong
21:57:32 <trandles> long story short, ORTE is dead (not a joke...), PRTE is specialized and will be unsupported, PMIx is the future (maybe...)
21:58:38 <trandles> what that all means is in the near future (once OpenMPI 2.x dies off) there won't be an mpirun to start jobs
21:59:20 <trandles> what we do is compile slurm with PMIx support and srun wires it all up
22:00:03 <trandles> k8s would need that PMIx stuff to properly start a job IMO
22:00:03 <oneswig> It's a missing piece for AI at scale, I suspect it'll happen one way or another and the deep learning crowd will drive it.
22:00:28 <oneswig> trandles: any evidence of k8s getting extended with that support?
22:00:40 <trandles> I don't follow k8s enough right now to know
22:01:00 <oneswig> hmmm.  OK the quest continues but that's more doors closed.
22:01:05 <trandles> for MPI jobs, k8s runtime model feels like an anti-pattern
22:01:21 <trandles> ie. when a rank dies, MPI dies
22:01:23 <oneswig> All that careful just-so placement...
22:01:28 <trandles> yeah
22:01:48 <oneswig> Alas we are out of time.
22:01:55 <oneswig> final comments?
22:02:03 <trandles> for things like DASK, multi-node TensorFlow, other ML/AI frameworks, k8s makes more sense
22:02:06 <janders_> thanks guys, great chat!
22:02:14 <trandles> janders_: +1
22:02:18 <trandles> later everyone
22:02:20 <jmlowe> I'll miss most of you in Dallas
22:02:33 <oneswig> jmlowe: you're not going?
22:02:37 <trandles> jmlowe: you're in Dallas or not in Dallas?
22:02:50 <jmlowe> I'm going, was under the impression most were going to the Berlin Summit instead
22:03:04 <oneswig> Ich bin eine Berliner
22:03:09 <trandles> ah, ok, I'll be in Dallas Monday night, all day Tuesday...leaving early Wednesday morning
22:03:11 <jmlowe> and I need to buy plane tickets RSN
22:03:56 <oneswig> OK y'all, until next time
22:03:59 <oneswig> #endmeeting