21:00:45 <b1airo> #startmeeting scientific-sig 21:00:46 <openstack> Meeting started Tue Mar 19 21:00:45 2019 UTC and is due to finish in 60 minutes. The chair is b1airo. Information about MeetBot at http://wiki.debian.org/MeetBot. 21:00:47 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 21:00:49 <openstack> The meeting name has been set to 'scientific_sig' 21:00:49 <jmlowe> I expect rbudden to make an appearance today 21:00:52 <rbudden> hello 21:00:55 <rbudden> :) 21:00:59 <b1airo> o/ 21:01:05 <b1airo> #chair oneswig 21:01:06 <openstack> Current chairs: b1airo oneswig 21:01:08 <oneswig> jmlowe: so quickly proven correct 21:01:10 <b1airo> #chair martial 21:01:11 <openstack> Current chairs: b1airo martial oneswig 21:01:27 <oneswig> greetings all 21:01:43 <b1airo> hopefully joe manages to get on soon 21:01:48 <jmlowe> I may have cheated and compared calendars with him within the past 45 min 21:01:59 <rbudden> heh 21:02:37 <rbudden> been far too long 21:03:07 <janders> g'day 21:03:35 <b1airo> morning janders 21:03:50 <martial> I invited Maxime and Alex to join this session to see how we run those sessions, they are proposing to discuss their work with us next week :) 21:04:09 <oneswig> cool 21:04:32 <oneswig> Although I am going to Texas next week to see our friends at Dell 21:05:03 <martial> I will help start the meeting 21:05:05 <oneswig> I expect I'll be jet-lagged so still in about the right time zone! 21:05:21 <martial> was just about to email you about it 21:05:55 <janders> texan bbq! :) 21:06:24 <jtopjian> I'm here! sorry 21:06:40 <oneswig> janders: damn right 21:06:46 <oneswig> hi jtopjian, welcome 21:07:08 <jtopjian> Hello 21:07:24 <janders> have you guys heard the news about NVIDIA's Mellanox acquisition? 21:07:36 <rbudden> yep 21:07:45 <rbudden> should be interesting 21:07:59 <oneswig> Mixes things up a little! 21:08:11 <janders> I sense this should push the HPC side of things more 21:08:30 <janders> (unlike with some other potential buyers who were on the table) 21:08:40 <martial> yep, outbid Intel by a lot too 21:08:45 <oneswig> There's a good deal of talk about the high performance data centre 21:08:55 <oneswig> which sounds good to me 21:09:13 <janders> any thoughts on how this would affect Mellanox'es OpenStack work? 21:09:24 <janders> s/would/will 21:10:06 <oneswig> neutral-to-positive, I'd guess 21:10:13 <b1airo> i doubt it will have any immediate impact janders 21:10:20 <janders> same feelings here 21:10:32 <jmlowe> I'm kind of banking on Mellanox continuing to do good things for both ceph and openstack 21:10:54 <janders> I came across this yesterday: https://www.nextplatform.com/2019/03/18/intel-to-take-on-openpower-for-exascale-dominance-with-aurora/ 21:10:54 <oneswig> b1airo: increased interest in virtualised gpu direct, perhaps? 21:10:55 <b1airo> though a general HPC-datacentre trend might make NVIDIA more interested in e.g. Cyborg 21:11:17 <b1airo> possibly oneswig 21:11:31 <b1airo> or some new version thereof (more likely) 21:11:32 <jmlowe> Which I find astounding, "We've delivered nothing, but double our money an we'll deliver something even better, we priomise" 21:11:55 <janders> yeah.. and the interconnect bit is "interesting" 21:12:26 <janders> as much as I am a fan of competitive landscape, no one except Mellanox seems to have the right answers to the right questions about advanced fabric features 21:12:54 <oneswig> That's more than a ripple in the pond, janders, if it's true it's a huge splash 21:13:14 <b1airo> they are also guilty of claiming to have the answers 2 years before they have a product containing any of them, but i guess everyone knows that now 21:13:18 <jmlowe> Who exactly signed off on 1/2 billion to a company that just completely botched your 1/4 billion dollar acquisition? 21:14:10 <b1airo> NVIDIA's acquisition history is not great mind you, would have been interesting to have a front row seat to Mellanox boardroom talks when that topic came up 21:14:19 <jmlowe> I'm still waiting on my sriov live migration from Mellanox 21:14:34 <b1airo> lols, good luck 21:14:53 <b1airo> there is some upstream kvm work that looks promising - vfio migrations 21:15:31 <b1airo> ok, we should move along and pass the baton to jtopjian 21:15:48 <jmlowe> I might get live migration with vgpus from nvidia though 21:15:50 <jmlowe> yes 21:15:53 <janders> jmlowe: are you thinking eth or ib for live migration? 21:16:02 <jtopjian> sure. I shouldn't take long, but happy to answer questions, too. 21:16:13 <janders> I think SRIOV is tricky, but the RDMA/QP part is even trickier 21:16:14 <b1airo> #topic Nomad for GPU container workload management 21:16:16 <jmlowe> janders: We wound up pitching mellanox eth 21:16:30 <b1airo> stow it for the wrap-up lads :-) 21:16:36 <janders> ok! 21:16:49 <martial> jtopjian: and link for slides or external material? 21:16:57 <jtopjian> none :) 21:16:59 <b1airo> give us the pitch jtopjian 21:17:05 <jtopjian> There's a group we're working with who have a stack of nine servers, each with four GPUs. Currently, users log into each server directly and perform whatever GPU processing there. 21:17:40 <b1airo> was i correct that they are doing exploratory data-science and ML stuff jtopjian ? 21:17:46 <jtopjian> That's correct 21:18:19 <jtopjian> The group has a wide range of knowledge and skill as well as a wide range of what they're working on. 21:18:40 <jtopjian> They asked us if we could do anything to help automate job submission. We figured we'd try a bit of an experiment and do something totally different. 21:18:50 <oneswig> I guess these are all single node workloads, right? 21:19:17 <jtopjian> Possibly. I don't think there was ever enough parallel work for a node to be running more than one job :) 21:20:06 <jtopjian> Unfortunately we were not able to get direct access to the hardware, due to a myriad of non-technical reasons, so we set up a PoC cluster in our OpenStack cloud: 5 instances, one control node and 4 worker nodes with 1 GPU each 21:20:54 <martial> for a second I though it was 4 GPUs per node 21:21:10 <jtopjian> The existing cluster is. We could not get physical access to it. 21:21:23 <jtopjian> So we build a virtual PoC instead. 21:21:37 <jtopjian> We had a nomad master on the control node and a worker on each one. The only Nomad driver we used was "raw_exec" which basically just runs a shell command on the root filesystem. The reason we went with this route was because the group had a second request about Singularity support. 21:22:41 <jtopjian> Nomad 0.9-dev (master branch) has support for GPU polling. It'll poll the amount of memory the GPU cards are using and schedule jobs that way. So I had to compile Nomad from source, but that's not terribly hard at all. 21:23:02 <jtopjian> With all of that in place, I set up some sample jobs and shared it out 21:23:09 <jtopjian> And then waited for feedback 21:23:20 <jtopjian> out of six people who used the cluster, I received feedback from 2. 21:23:27 <b1airo> #link https://www.hashicorp.com/products/nomad 21:24:09 <jtopjian> Right, some quick background: Nomad is a very basic scheduler. It supports various execution styles (batch, service) and drivers (docker, exec, qemu). 21:24:24 <jtopjian> Jobs are declared using an HCL-type syntax. If you've worked with Terraform, Packer, etc - it's the same. 21:25:02 <b1airo> #link https://github.com/hashicorp/hcl 21:25:06 <jtopjian> One person "got it" and one person did not. 21:26:15 <jtopjian> I'm quite new to working with data analysis, ML etc and learned that there are a number of different ways users work on data. I assumed we'd just give them a bunch of GPUs with a scheduler and they'd have a field day 21:26:51 <b1airo> heh, users are the hardest part of this world 21:27:04 <jtopjian> The person with positive feedback was at this phase of their work: they had some code they wanted to run repeatedly with different arguments and Nomad did this quite well (you can crate "parameterized jobs" which repeat jobs with new input) 21:27:31 <jtopjian> The person with negative feedback had no code written and wanted to be as close to the GPU as possible. 21:28:00 <jtopjian> So they wondered why they couldn't get shell access to the GPU and thought it was a bother to have to commit or copy their code somewhere each time they ran a job 21:28:09 <jtopjian> So it was a learning experience. 21:28:38 <oneswig> jtopjian: what is the execution environment nomad creates for a job that might get in the way for this user? 21:29:08 <martial> any into into nvidia-docker (v2) to try to give people access to the bare metal resources through an abstraction layer (seems to be very "core" OpenStack so far or am I missing something?) 21:29:16 <jtopjian> We used "raw_exec" which just ran a command on the root filesystem. So whatever you have installed on the OS is what you have access to. 21:29:29 <jtopjian> But Nomad also allows for chroot execution, docker execution, etc 21:29:58 <jtopjian> I would have used Docker, but a few users wanted to use Singularity instead. 21:30:19 <jtopjian> So "raw_exec" basically just exec'd the singularity command with an image the user made available 21:31:00 <martial> (the HPC ... we need root squash issue ... re: Docker) 21:31:55 <b1airo> dunno about jtopjian , but i can't parse that question (was it a question?) martial ... 21:31:57 <jtopjian> So that's the summary of the PoC. As for next steps, I've learned that having job submission access is just a small part of what these users want, so I'm going to be looking at something like Kubeflow to see if that can provide better features for them. 21:32:03 <oneswig> Does it assume a network filesystem - or how is data and code copied if not? 21:32:29 <jtopjian> Nomad can download "artifacts". Unfortunately it only supports a few methods such as http and git at the moment. 21:33:12 <jtopjian> In your job spec, you declare different artifacts you want the worker to download and it'll pull them. The artifacts can also be interpolated so you can have an input variable of a version to download a certain version of something 21:33:21 <b1airo> jtopjian: do they need to turn these things into services with jobs that will be spun up in the cluster? 21:33:23 <oneswig> And I guess outputs need something too? 21:33:38 <jtopjian> @b1 21:33:52 <jtopjian> @b1airo we used the "batch" scheduler which ran jobs once 21:34:23 <jtopjian> @oneswig Nomad will save the stdout for review. For saving data somewhere else, that's something I didn't explore. 21:34:52 <oneswig> I guess there's many ways that can be approached. 21:34:55 <b1airo> it sort of sounds like the primary use-case is interactive exploration/design-iteration style work 21:35:16 <jtopjian> However, one thing I was not a fan of was that Nomad does not keep logs around long. It expects you to send logs to a more robust logging system. While I understand the intention of the design, it felt lame 21:35:45 <b1airo> if that's true i wonder whether container orchestration is really buying them anything? 21:36:01 <jtopjian> @b1airo actually, no. This was more for a something along the lines of "I have a large dataset that I want to run a known good piece of code on for a few hours" 21:36:39 <jtopjian> And so one piece of learning on my part was that most users are actually in the exploratory phase which made this cluster harder to work with. 21:36:52 <jtopjian> They really just need easy access to, say, Jupyter with GPU 21:37:11 <oneswig> The world needs that :-) 21:37:22 <jtopjian> right? 21:38:44 <b1airo> how had some of them come to Singularity already jtopjian ? 21:40:00 <jtopjian> That's a good question. There were 3-4 users who were very keen on it, but others in the group who had not heard of it. 21:40:15 <jtopjian> I'm not sure how the one group came across it 21:40:23 <b1airo> had those 3-4 already been exposed to HPC in some way? 21:40:53 <jtopjian> Yes and I believe they wanted Singularity for this PoC because they were unable to use it in their institution's existing clusters 21:41:35 <martial> (b1airo not a question) 21:42:09 <oneswig> What was Singularity gaining for them, specifically? 21:42:15 <b1airo> which leads me to the next question, why not Slurm/PBS (plus Singularity or whatever) etc instead of these higher level programming interfaces ? 21:43:01 <jtopjian> @oneswig: Again, good question and I'm not sure. Personally, I didn't mind using Singularity for this (I had to learn it to implement it) but I wasn't able to figure out what the big difference between it and Docker was. 21:43:49 <oneswig> Were they wanting to use images from Singularity hub? 21:44:24 <b1airo> i suspect you'd have to sit down with the users for a few hours to see how they were actually using it. i guess they maybe just learned how to create Singularity images and didn't know Docker images are very similar and compatible ? 21:44:44 <jtopjian> @b1airo Slurm was mentioned in the first meeting and we chose not to implement it for two reasons: 1) it's been done and we wanted to take a chance and 2) part of our interaction with users is to... how do I want to say... work with them outside of an academic context. Being a little facetious: I don't see too many Medium articles on Slurm :) 21:45:02 <jtopjian> @oneswig They had their own images to use 21:45:22 <b1airo> Singularity Hub... what are people's thoughts? i went looking for an Intel optimised tensorflow container the other day and it felt like a mess 21:45:43 <oneswig> Not tried using it b1airo 21:46:55 <jtopjian> And "sit down" - yes, this is something we plan to do. We feel there's a big gap between students/researchers/users and infrastructure teams like my group. We want to bridge that 21:48:46 <oneswig> Sometimes people call that bridge you're making "ResOps" - is that familiar? 21:49:04 <jtopjian> It's not! That's new to me :) 21:49:28 <martial> what were the advantages of Nomad for this setup? 21:49:39 <oneswig> kind of derivative but fits the bill 21:50:16 <b1airo> jtopjian: sure, it's great you're getting to do some exploration. though i'd challenge you on the "academic" context - HPC is a much bigger industry than just universities. anyway, i think yours is the first real-world example i've seen of attempting to push the status quo of workload management in this space. convergence of container orchestration and hpc workload management was raised in a couple different forums 21:50:16 <b1airo> at SC last year, so you're on trend :-) 21:51:03 <jtopjian> Exactly: from the specs which were discussed, it checked off all the boxes. Nomad, being a single binary, made deployment really quick and simple (ignoring I had to compile it, but I'm familiar with Go so that was fine). In practice, it helped us uncover a lot of missing knowledge we had about data processing. 21:52:12 <jtopjian> To be clear: I was (and am not) not trying to challenge status quo or indoctrinate people. Part of my job is to do experiments like this. 21:53:08 <jtopjian> If it was requested, I'd have no problems deploying something like Slurm for this group. But if I have the opportunity to explore different approaches, I'll take it. 21:53:39 <martial> still it sounds pretty fun, and a good setup 21:53:47 <martial> Thanks for sharing 21:53:48 <b1airo> some other interesting examples floating around like what CERN is doing on top of Magnum, but that seems to be more about using a dynamic k8s cluster on a per-workload basis 21:54:08 <oneswig> In a similar vein, I'd be very interested to hear about anyone's experiences with Univa NavOps - https://kubernetes.io/blog/2017/08/kubernetes-meets-high-performance/ 21:54:19 <jtopjian> Indeed. It was a great learning experience. Nomad itself works great as a very foundational scheduler. But it's truly foundational. You'll need to add things on top of it to make it more user-friendly. 21:54:59 <jtopjian> In another similar vein, (I can't remember if I mentioned Kubeflow earlier in this meeting), Kubeflow is on my list to look at. 21:56:21 <b1airo> yeah that sounds interesting too, suspect it might have a broader potential audience 21:56:31 <oneswig> Thanks jtopjian - very interesting to hear about your work 21:56:49 <b1airo> you got Magnum in any of your clouds jtopjian ? 21:57:00 <jtopjian> @oneswig very welcome! 21:57:24 <jtopjian> @b1airo Yes, we support both Swarm and Kubernetes 21:57:25 <b1airo> yes +1, thanks jtopjian , interesting stuff 21:57:47 <rbudden> indeed, thanks for sharing 21:57:54 <b1airo> would make a good lightning talk if you're in Denver :-) 21:57:57 <janders> +1 thank you 21:58:16 <janders> great idea! 21:58:18 <jtopjian> Unfortunately I won't make it to Denver :/ 21:58:19 <oneswig> going back to the NextPlatform article janders shared - see the reference to "Intel OneAPI" at the top of this - http://3s81si1s5ygj3mzby34dq6qf-wpengine.netdna-ssl.com/wp-content/uploads/2019/03/argonne-aurora-tech.jpg - I wondered if that was an IaaS API but it seems not... 21:59:06 <janders> regarding jmlowe's live migration concenrs, I found this: 21:59:08 <janders> https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=8401692 22:00:12 <janders> copying hardware state for bm migration sounds like a nightmare. But.. if an identical image were to be spun up on identical hardware, maybe copying the delta it wouldnt actually be that bad.. 22:00:17 <janders> interesting idea :) 22:00:32 <oneswig> Live migration of bare metal? "1) power off server, 2) move server to new rack, 3) power on server again?" 22:00:37 <jmlowe> I used to use openvz and was able to live migrate with that 22:01:07 <jmlowe> effectively live migration of containers 22:01:09 <janders> a friend used to leverage dual PSUs and LACP to move servers between racks without powering off..we used to make jokes that's the bm livemigration 22:01:39 <janders> true.. container could be that shim layer they are referring to 22:01:41 <b1airo> the vfio migration work that is being discussed upstream at the moment sounds like it is allowing for vendor-specific state capture and transfer 22:02:10 <b1airo> ooh, we're overtime! 22:02:23 <janders> thanks guys! great chat 22:02:25 <b1airo> thanks all, good turnout! 22:02:27 <oneswig> Thanks all 22:02:27 <janders> see you next week 22:02:33 <rbudden> catch ya later 22:02:34 <b1airo> sounds good! 22:02:39 <b1airo> #endmeeting