21:00:26 #startmeeting scientific-sig 21:00:28 Meeting started Tue Oct 2 21:00:26 2018 UTC and is due to finish in 60 minutes. The chair is oneswig. Information about MeetBot at http://wiki.debian.org/MeetBot. 21:00:29 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 21:00:31 The meeting name has been set to 'scientific_sig' 21:00:42 I made it. Greetings all. 21:00:51 g'day oneswig :) 21:00:59 Hey janders! 21:01:14 do you have any experience with NVME-oF in the OpenStack context? 21:01:27 hello all 21:01:32 (cinder/nova/swift/... backend) 21:01:33 ooh, straight down to it :-) 21:01:38 hey trandles! 21:01:40 yes! :) 21:01:58 Not direct experience of NVMEoF 21:02:11 keep looking for a good opportunity. 21:02:38 janders: what's come up? 21:02:46 Is this in bare metal or virt? 21:03:26 I was chatting to Moshe and the Intel guys about it in Vancouver and they said that 1) the cinder driver is ready and will be supported in Pike but 2) they haven't really gone beyond a scale of a single NVMe device with their testing (eg no NVMe-oF arrays etc) 21:03:56 now I was chatting to my storage boss colleague about specing out a backend for the cyber system and he pulled out some array type options that do do NVMe-oF 21:04:02 I recently saw PR from Excelero - a German cloud (teuto.net?) have built a substantial data platform upon it. 21:04:21 found it very interesting - I didn't know anyone offers NVMe-of appliances just yet 21:04:32 he quoted E8 21:04:47 Ah yes, I've heard of them too. 21:05:05 Not sure on the unique selling points to distinguish them (I am sure there are some) 21:05:41 you asked what is this for - our cyber system (which will do both baremetal and VM) 21:06:02 new project? 21:06:05 no 1) use case would be persistent volumes, but the more OpenStack services can use the backend the better 21:06:53 long story. Short version is what I was talking about in Vancouver has been split up in few parts and this is in the context of one of them 21:07:00 I've previously used iSER in a similar context. It worked really well for large (>64K) transfers, was pretty lame below that. This was serving Cinder volumes via RDMA to VMs. 21:07:20 was that IB or RoCE? 21:07:27 Hi everybody 21:07:28 RoCE in this case. 21:07:39 Hey Mike - good to see you 21:07:58 ok! that is encouraging - and important given that quite a few NVME-oF players do go RoCE and not IB 21:08:30 and on a related but slightly different front - if you don't mind me asking 21:08:33 Are you thinking of buying it in or building yourself? 21:08:44 for your ironic ethernet+IB system, what ethernet NICs do you use? 21:08:52 are those Mellanox or non-Mellanox? 21:09:16 Mellanox ConnectX4 VPI works well - dual 100G, 1 Ethernet, 1 IB 21:09:30 ok! 21:09:48 Although in this specific case the Ethernet was down-clocked to 25G 21:09:49 did you need to do anything to make sure that the Ethernet port doesn't steal bandwidth away from IB? 21:09:56 reading my mind :) 21:09:58 No, nothing major 21:10:15 was the 25G cap due to PCIe3x16 bandwidth limitations? 21:10:35 If you want to use Secure Host for the multi-tenant IB, I believe you're stuck with ConnectX3-pro - unless you know otherwise? 21:10:58 all my POC nodes are either 2xFDR or 4xFDR so haven't had the PCIe3x16 problem just yet 21:11:00 janders: no, switch budget, and I think it was considered more representative of the use case requirements 21:11:24 you are right regarding Secure Host - I hope to work this out with Erez and his team 21:11:45 do you know if mlnx have any good reason NOT to implement SecureHost for CX4 and above? 21:12:00 or is it that no one kept asking persistently enough? :) 21:12:00 I don't know. 21:12:19 I will have a go at this soon :) hopefully I will get something out and it will be useful to you guys too 21:12:21 I think your theory's good. 21:12:43 regarding the appliance vs roll your own for NVME-oF - we'll consider both approaches 21:13:26 the idea of hyperconverged storage in a bare-metal cloud context is quite interesting actually, but I think this is a potentially endless topic 21:13:57 first we discouraged it as not feasible but it came back and it might actually work 21:14:38 Our thinking has been more towards disaggregated storage recently. 21:15:07 Possibly a single block device hyperconverged - to keep the resource footprint low on the compute nodes. 21:15:22 Might be interesting if also nailed down with cgroups 21:16:20 we're not there yet - but in a fully containerised context where the whole deployment is baremetal openstack running k8s, it could be just as simple as it used to be in the hypervisor centric reality 21:16:54 We did a little testing on resource footprint with hyperconverged (non-containerised) on bare metal. I think the containers could work. 21:18:08 On the extreme bandwidth topic, one of our team has been getting really great numbers from BeeGFS at scale recently. Servers with NVME and dual OPA exceeding 21GBytes/s for reads. Yeee-ha 21:18:26 wow! 21:18:56 And there are lots of them in this setup too... 21:19:25 our BeeGFS is being built (after some shipping delays). It's 24x NVMe and 2xEDR per box so hoping for some good numbers, too 21:19:33 There was talk of offering this resource in a NVMEoF mode, but it's secondary to scratch BeeGFS filesystems 21:19:53 again our thinking is quite closely aligned :) 21:19:54 Sounds pretty similar, in which case it should fly 21:20:16 our guys are thinking 1) build up the BeeGFS on half of the nodes and see how hard the users will push it 21:20:52 if it's not being pushed too hard, we can use some of the NVMes for NVMe-oF, BeeGFS-OND etc 21:20:53 Is BeeGFS taking root in Europe? I'm not aware of any deployments in the US 21:21:38 jmlowe: It seems to be gaining ground over here. Some people got shaken with recent events in Lustre I guess 21:21:59 good BeeGFS-OpenStack integration would be an amazing thing to have 21:22:00 The performance we've seen from it is as good or better than Lustre, on the same hardware. 21:22:17 janders: Manila! Lets do this! 21:22:26 I don't really have much experience with non-IB BeeGFS deployments, but I do know it's quite flexible so it could be quite useful outside HPC too 21:22:32 +1! :) 21:22:54 the tricky bit is no kerberos support (or prospects for it) last time I checked 21:23:11 I'm really happy with nfs-ganesha/cephfs/manila so far, going to try to start rolling it out 21:23:24 this can be worked around, but it does make life much harder 21:23:25 so all you'd really need to get bootstrapped is a BeeGFS FASL for ganesha 21:24:01 jmlowe: I'd be interested to hear how that flies when it's under heavy load, really interested. 21:24:36 benchmarking from a single node it's faster than native cephfs for smaller files 21:24:53 This week we've been experimenting with a kubernetes CSI driver for BeeGFS but it's really early days for that. 21:25:24 jmlowe: an extra layer of caching at play there? Any theories? 21:25:47 that was my go to theory 21:27:00 That's pretty interesting. Does Ganesha run in the hypervisor or separate provisioned nodes? 21:27:32 separate 21:28:33 so what's new with Jetstream jmlowe? 21:31:06 I've had a frustrating day debugging bizarre stubbornness with getting Heat to do more things 21:31:31 what specifically? 21:31:53 heat is always "fun" (even though tight integration with all the other APIs is pretty cool) 21:31:57 +1 frustrating day...but for non-fun non-technical reasons 21:32:09 Creating resource groups of nodes with parameterised numbers of network interfaces. 21:32:28 trandles: feel your pain. 21:32:40 ouch 21:32:42 Did you want to mention those open positions? 21:32:59 if I remember correctly, that was non-trivial even in ansible.. 21:33:01 (somewhat select audience here, mostly of non-US nationals...) 21:33:10 ah... 21:33:11 https://jobszp1.lanl.gov/OA_HTML/OA.jsp?OAFunc=IRC_VIS_VAC_DISPLAY&OAMC=R&p_svid=68995&p_spid=3171345&p_lang_code=US 21:33:20 I need to get it up on the openstack jobs board 21:33:21 We are getting some new IU only hardware it looks like, 20 v100's and 16 of dell's latest blades 21:33:39 janders: they appear to share an issue with handling parameters that are yaml-structured compound data objects 21:34:03 trandles: the jobs board seems to work 21:34:06 I need to figure out a good way to partition so that NSF users only get the hardware the NSF purchased and IU users get only the hardware IU purchased 21:34:28 we've done zero advertising and are pulling in resumes 21:34:36 so fingers crossed we get the right candidate 21:35:07 jmlowe: there are ways and means for that. 21:35:23 I was thinking via private flavors and host aggregates 21:35:25 jmlowe: I hate that hard partitioning problem. We almost always try to argue "you'll get the equivalent amount of time for the dollars you provided" but then end up in arguments about normalization 21:35:36 I forget the current best practice, but linking project flavors and custom resources seems to be in fashion 21:36:54 It's a generation newer cpu, so they really are different, skylake vs haswell I think 21:37:45 jmlowe: do you need to worry about idle resources in this scenario, or is the demand greater than the supply across the board? 21:38:20 jmlowe: what will you do for network fabric between those V100s? 21:38:29 not terribly concerned about idle resources 21:38:36 2x10GigE 21:39:52 briefly jumping back to the SDN discussion - do you guys know if it's possible to have a single port virtual function on a dual port CX4/5/6? 21:39:55 Had a question on this - does anyone have experience on mixing SR-IOV with bonded NICs? Is there a way to make that work? 21:39:56 looking at nvidia grid and slicing n dicing them, let's say to run a workshop on the openmp like automagical gp-gpu compiler thingy whose name escapes me right now 21:40:06 janders: snap :-) 21:40:49 I would think you would need to do the bonding inside the vm in order for SR-IOV to work with bonding 21:40:58 oneswig: I think Mellanox do have some clever trick to do in-hardware LACP in CX5 / Ethernet. But that's me repeating their words, I haven't tested that 21:41:15 SR-IOV and bonded NICs sounds like evil voodoo 21:41:39 trandles: it does indeed. 21:41:50 if you consider SR-IOV is just multiplexed pci passthrough then it doesn't sound so scary 21:42:15 but across multiple physical devices? 21:43:08 trandles: there's two scenarios - 2 nics or 1 dual-port nic - perhaps the second might work, the first, I am not confident 21:43:31 http://www.hpcadvisorycouncil.com/events/2018/swiss-workshop/pdf/Monday09April/Cohen_NetworkingAdvantage_Mon090418.pdf 21:43:32 slide 10 21:43:40 I think this is what Erez was talking about 21:44:07 No way - I was there :-) 21:44:40 dual-port NICs sounds like something manageable...multiple NICs sounds like driver hell 21:44:53 I wasn't but got a good debrief in Vancouver 21:45:39 hmm now I wonder if I should ask for 1xHDR and 1xdual port EDR for my new blades :) 21:45:39 trandles never had fun in the bad old days altering a nic's mac address to defeat mac address filtering 21:46:19 ..or even 2xEDR and 2xHDR all active/passive for maximum availability 21:46:29 janders: might push out the delivery date, beyond an event horizon 21:46:46 oneswig: this is a very valid point 21:47:10 I do have some kit that can be used in the interim, but perhaps retrofitting HDR later isn't a bad idea 21:47:50 Did live migration with melanox sr-iov ever come to fruition? 21:47:59 not that I know of 21:48:06 DK Panda presented something on this 21:48:11 however cold migration is now supposed to work (historically wasn't the case) 21:48:23 But it's never clear if his work will be generally available (or when) 21:48:30 damn, that would have been killer, DL Panda hinted at it during the 2017 HPC Advisory Council 21:48:46 well, that's something at least 21:48:55 He came back with numbers, just no github URL (as far as I remember) 21:50:07 jmlowe: if you look for his video from Lugano 2018, he covered some results. It's possible he was presenting on using RDMA to migrate the VM, I don't recall exactly. 21:51:45 trandles: we were talking Charliecloud over here the other day. Any news from the project? 21:51:51 That's an entirely different beast, no to say it's not useful for something like postcopy live migration, but it's not what I was promised with my flying car 21:52:22 jmlowe: transporting your physical person, I wouldn't trust RDMA for that :-) 21:52:44 cmon... isn't IB loseless ;)? 21:52:56 Charliecloud: Reid has been kicking ass on implementing some shared object voodoo to get MPI support automagically inside a containerized application on both OpenMPI installations and Cray MPICH 21:53:11 see star trek quote 21:54:09 trandles: this is for the kind of problems that Shifter tackles by grafting in system MPI libraries? 21:54:28 we have also had quite a bit of success putting some large multi-physics lagrangian applications into containers and running automated workflows of parameter studies 21:55:17 Reid had a few chats with the Shifter guys (Shane and Doug) about the gotchas they found and then implemented his own method of doing something similar 21:55:39 trandles: do you know of people successfully using MPI + Kubernetes? It keeps coming up here and I'm suspicious it's missing some key PMI glue 21:56:06 MPI + Kubernetes: A Story of Oil and Water 21:56:16 LOL 21:56:20 trandles: so nearly vinaigrette 21:56:30 not nearly as delicious 21:56:38 no.. 21:56:53 just needs some egg 21:56:59 I know more than I ever thought I'd know (or wanted to know) about how MPI starts up jobs 21:57:18 mmm, sausage 21:57:26 it's dinner time here 21:57:26 I don't think there's an easy way of plumbing RDMA into k8s managed docker containers at this point in time - please correct me if I am wrong 21:57:32 long story short, ORTE is dead (not a joke...), PRTE is specialized and will be unsupported, PMIx is the future (maybe...) 21:58:38 what that all means is in the near future (once OpenMPI 2.x dies off) there won't be an mpirun to start jobs 21:59:20 what we do is compile slurm with PMIx support and srun wires it all up 22:00:03 k8s would need that PMIx stuff to properly start a job IMO 22:00:03 It's a missing piece for AI at scale, I suspect it'll happen one way or another and the deep learning crowd will drive it. 22:00:28 trandles: any evidence of k8s getting extended with that support? 22:00:40 I don't follow k8s enough right now to know 22:01:00 hmmm. OK the quest continues but that's more doors closed. 22:01:05 for MPI jobs, k8s runtime model feels like an anti-pattern 22:01:21 ie. when a rank dies, MPI dies 22:01:23 All that careful just-so placement... 22:01:28 yeah 22:01:48 Alas we are out of time. 22:01:55 final comments? 22:02:03 for things like DASK, multi-node TensorFlow, other ML/AI frameworks, k8s makes more sense 22:02:06 thanks guys, great chat! 22:02:14 janders_: +1 22:02:18 later everyone 22:02:20 I'll miss most of you in Dallas 22:02:33 jmlowe: you're not going? 22:02:37 jmlowe: you're in Dallas or not in Dallas? 22:02:50 I'm going, was under the impression most were going to the Berlin Summit instead 22:03:04 Ich bin eine Berliner 22:03:09 ah, ok, I'll be in Dallas Monday night, all day Tuesday...leaving early Wednesday morning 22:03:11 and I need to buy plane tickets RSN 22:03:56 OK y'all, until next time 22:03:59 #endmeeting