09:01:16 <oneswig> #startmeeting scientific_wg 09:01:16 <openstack> Meeting started Wed Jul 20 09:01:16 2016 UTC and is due to finish in 60 minutes. The chair is oneswig. Information about MeetBot at http://wiki.debian.org/MeetBot. 09:01:17 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 09:01:19 <openstack> The meeting name has been set to 'scientific_wg' 09:01:41 <oneswig> Hello 09:01:56 <apdibbo> Hi 09:01:56 <oneswig> We have an agenda (such as it is) 09:02:01 <dariov> Hello! 09:02:06 <oneswig> #link https://wiki.openstack.org/wiki/Scientific_working_group#IRC_Meeting_July_20th_2016 09:02:32 <oneswig> Hi dariov apdibbo 09:03:06 <oneswig> Topics today from the agenda: 09:03:13 <priteau> Hello 09:03:18 <oneswig> 1) Update on user stories activities 09:03:35 <oneswig> 2) anything from other activity areas (esp parallel filesystems this week) 09:03:47 <oneswig> 3) hypervisor tuning guide solicitation 09:04:02 <oneswig> 4) speaker track and wg activities for barcelona 09:04:28 <oneswig> BTW I've heard from b1airo he's going to join shortly 09:04:35 <oneswig> Hi priteau 09:05:21 <oneswig> #topic Update on user stories activities 09:06:27 <oneswig> I've been working away on this white paper for OpenStack and HPC, not much to report on that right now other than progress is being made 09:06:46 <oneswig> I forget what was discussed on this two weeks ago... 09:07:29 <oneswig> The next phase is solicitation of subject matter experts to provide input, review and comment 09:07:52 <oneswig> To recap there are five principal subjects 09:08:03 <oneswig> 1) HPC and virtualisation 09:08:13 <oneswig> 2) HPC parallel filesystems 09:08:20 <oneswig> 3) HPC worklaod managers 09:08:29 <oneswig> 4) HPC infrastructure management 09:08:42 <oneswig> 5) HPC network fabrics 09:09:00 <oneswig> virtualisation is actually half about bare metal 09:09:14 <oneswig> Anyone interested in contributing on these areas? 09:09:47 <oneswig> I have people in mind for many items but not all 09:11:08 <oneswig> OK, don't all shout at once :-) 09:11:13 <dariov> lol 09:11:33 <oneswig> no problem, I think we have plenty to go on as it is 09:12:26 <oneswig> More generally, is there activity with any of you that might convert well into a user story? 09:14:05 <dariov> 3 and 4 do interest us 09:14:09 <oneswig> At Cambridge our bioinformatics deployment is in early trials but it's HA is less reliable than a non-HA deployment right now... 09:14:24 <dariov> we’re trying to move some of the hpc on-prem to the cloud 09:14:37 <dariov> (possibly openstack, possibly AWS, but that’s another story) 09:14:45 <apdibbo> we are in early trials with openstack for HTC 09:14:49 <dariov> but no direct knowledge so far 09:14:54 <oneswig> dariov: interesting to know, and why constrain to one at this point 09:15:01 <dariov> or trials whatsoever 09:15:34 <dariov> on ostack we’re limited by our own resources now 09:15:57 <dariov> and eventually people landing on the openstack land for the first time tend to develop cloud-friendly apps since the beginning 09:16:11 <dariov> so no real need for HPC solutions there 09:16:30 <dariov> as they usually deploy their own orchestration layer 09:17:00 <oneswig> dariov: I've seen use of ManageIQ for abstracting away private OpenStack vs AWS 09:17:44 <dariov> onewswig, I think we tried it back in October to “rule all the clouds”, we also have VMware on site 09:17:52 <priteau> oneswig: One of our partners on Chameleon, Ohio State University, is using Chameleon to work on high performance networking in virtualized environments: http://mvapich.cse.ohio-state.edu/overview/#mv2v 09:18:01 <oneswig> dariov: if you document your journey I'd be interested to read it 09:18:22 <dariov> oneswig, where are you based, exacltly? 09:18:27 <dariov> *exactly? 09:18:52 <oneswig> priteau: That's pretty cool, I hadn't made the connection, but I know the work at OSU it's good stuff 09:19:03 <oneswig> dariov: I'm based in Bristol but working in Cambridge 09:19:09 <dariov> ah ah! 09:19:28 <dariov> Hello from the Genome Campus in Hinxton, Cambrigeshire, UK, then :-) 09:19:35 <priteau> oneswig: I can get you in touch if you want 09:19:39 <oneswig> Small world :-) 09:20:30 <oneswig> priteau: I would appreciate that. I've previously mailed with DK Panda but only once or twice. A connected introduction for this work would be a great help 09:20:46 <dariov> oneswig, yes. We can also have a cup of coffee together one day, it would be cool to see how other people are making this journey :-) 09:21:13 <oneswig> #action Cambridge-centric OpenStack meetup needed! 09:21:26 <oneswig> I'm in town next week midweek. You around? 09:21:34 <oneswig> Let's take this offline :-) 09:21:58 <oneswig> Anything else to add on user stories? 09:23:01 <oneswig> #topic Bare metal 09:23:36 <oneswig> I don't actually have anything here but Pierre I recall you were more active than me w.r.t Ironic serial consoles. Any news there? 09:24:54 <priteau> We are working on backporting the patch to our installation (we're still on Liberty) and will soon test it 09:25:39 <priteau> I believe there was some progress upstream, as the multiple implementations seem to have merged into one, and I saw at least one patch being merged in master 09:25:40 <oneswig> that's great. Is this a patch to Nova as well? 09:26:11 <priteau> this one: https://github.com/openstack/ironic/commit/cb2da13d15bf72755880e7a8e6881e5180e2e29f 09:26:34 <b1airo> evening 09:26:49 <oneswig> Hi b1airo 09:26:54 <oneswig> #chair b1airo 09:26:55 <openstack> Current chairs: b1airo oneswig 09:27:15 <oneswig> Just on bare metal, hpfs next 09:27:27 <priteau> as far as I know it's all in Ironic, no change to nova 09:28:12 <priteau> Another interesting thing that happened recently with bare-metal: support for network multitenancy is being integrated in Ironic 09:28:32 <b1airo> yeah we talked a bit about that last week actually 09:28:32 <priteau> e.g. https://github.com/openstack/ironic/commit/090ba640d9187ec6bee157ce8f1cf12ce6a868ca 09:28:44 <b1airo> or at least touched on it 09:29:01 <b1airo> i was asking about requirements for new OOB network we will be doing soon 09:29:06 <priteau> I haven't seen proper documentation yet so I don't know how it works and what kind of network support it requires 09:29:33 <b1airo> i was surprised to find out that Ironic-Neutron integration seems relatively new 09:29:53 <oneswig> It seems to be sufficiently far upstream that there's limited support and limited docs but hopefully it'll grow! 09:30:33 <oneswig> b1airo: I think one major obstacle is that Ironic requires physical network knowledge, something that OpenStack has been in wilful denial over 09:31:20 <oneswig> If your moving activities normally done into a hypervisor into a network port, you need to know which port (and possibly the paths to it) 09:31:47 <oneswig> Enter the Neo :-) 09:32:06 <priteau> for now the only doc I see is for devstack where the bare-metal servers are actually VMs 09:32:56 <oneswig> priteau: I'd previously seen an Arista driver that was doing bare metal multi-tenancy and required LLDP to make the connection with a physical switch+port 09:33:37 <oneswig> But that work is now very much out of date I expect 09:35:18 <oneswig> Anyway, it's great to see this documentation merged, thanks priteau for sharing that 09:35:23 <priteau> it looks like in this new implementation they extended the API to add the physical port info, I suppose managed by the admin 09:35:31 <priteau> e.g. https://github.com/openstack/ironic/commit/76726c6a3feda8419724d09c777e7bb578d82ec0#diff-2526370ba29c4ac923f13218952a32c3R167 09:35:53 <oneswig> ooh, interesting. 09:35:55 <priteau> I will keep an eye on what they are doing and will share what I learn 09:36:04 <oneswig> Maintaining the physical mapping - an exercise for the reader? 09:36:32 <oneswig> Any more for bare metal? 09:37:10 <priteau> not from me 09:37:59 <oneswig> #topic Parallel Filesystems 09:38:15 <b1airo> is just the three of us at the moment? 09:38:29 <oneswig> I think 5-6? 09:39:10 <b1airo> ah ok, a few lurkers then :-) 09:39:14 <priteau> apdibbo and dariov were talking earlier 09:39:27 <apdibbo> yeah, i'm still here 09:39:45 <dariov> I’m here too! 09:39:53 <apdibbo> just dont have much to contribute on bare metal, interesting developments though :) 09:40:01 <dariov> same here 09:40:23 <oneswig> There was quite a bit of interest to Blair's mail on the lists about discussion parallel filesystems for cloud hpc 09:41:21 <oneswig> Seems like an idea with momentum behind it 09:41:24 <b1airo> yeah was nice to have such quick feedback 09:41:47 <b1airo> pity i didn't think of it earlier, such is the life of a procrastinator 09:42:08 <oneswig> Any parallel filesystem users or interested potential users here today? 09:43:48 <oneswig> b1airo: for our benefit perhaps we should talk over the WG panel session 09:44:13 <b1airo> only news here is that we got LNET with o2ib over RoCE via PCI pass-through virtual function NICs in guests going last week, for our new deployment 09:44:28 <oneswig> Nice! 09:44:31 <b1airo> say that sentence ten times fast 09:44:46 <oneswig> I bet you had it pre-typed out :-) 09:45:23 <b1airo> got it bound to a hot key ;-) 09:45:31 <oneswig> What issues did you hit on the way? 09:46:07 <b1airo> this deployment is full of ConnectX-4 and ConnectX-4 Lx 09:46:33 <oneswig> b1airo: Is that considered an issue? 09:46:56 <b1airo> the firmware in MOFED 3.2-1.0.1 (i think that's right) seems to have an issue. we've had a handful of nodes just "lose" their NICs 09:47:18 <b1airo> i.e. all traffic stops passing on them and can't do anything to fix it, even resetting the pci device 09:47:29 <b1airo> on reboot the card is completely absent 09:47:40 <oneswig> Ah, we moved to 3.3 - there's a security advisory against 3.2. Plus we had loads of issues getting link up 09:47:45 <b1airo> i.e., not visible in lspci 09:48:06 <oneswig> b1airo: interestingly your issues are completely disjoint from ours, yet we both appear to have many 09:48:07 <b1airo> we have upgraded a few to 3.3.xxx and that seems to have sorted it 09:48:18 <b1airo> orly!? 09:48:28 <b1airo> where is that advisory published? 09:48:39 <b1airo> i have a support case open about it and they haven't mentioned that 09:49:04 <oneswig> OFED 3.3 release notes is where I saw it - was wondering why it's no longer on the website 09:49:05 <dariov> guys, I need to run 09:49:13 <oneswig> later dariov, thanks 09:49:16 <dariov> bye! 09:49:32 <b1airo> anyway, that issue was just a distraction, main issue was needing to get client Lustre modules built by Intel for the particular kernel + MOFED combination in the guest compute nodes 09:49:46 <b1airo> bye! 09:50:25 <b1airo> after that just simple LNET config 09:50:40 <oneswig> Is your process blogged anywhere, perchance...? 09:51:42 <b1airo> the Intel Manager for Lustre (IML) doesn't understand RoCE though, so if you're configuring things server-side you need to manually edit one of the network config files to make it use o2ib, else it assumes tcp (because ethernet) 09:52:38 <b1airo> no, but that's probably a good idea - it is documented internally reasonably well now so should be easy enough 09:52:55 <oneswig> Share and enjoy, b1airo! 09:53:40 <oneswig> OK, we should move on 09:53:57 <b1airo> remaining issue is that we reconfigured the filesystem, into two separate filesystems (projects and scratch). and now we seem to be peaking at ~80MB/s for single client/thread writes... which is a little low! 09:54:27 <oneswig> b1airo: for that effort, I would say so. Is there a gigabit ethernet somewhere in the data path??? 09:55:11 <b1airo> it's a mix of 25, 50 (compute/client side) and 100 (server) 09:55:38 <b1airo> same issue with bare-metal clients so it isn't the passthrough or anything like that 09:55:54 <b1airo> initial acceptance tests were just fine 09:56:21 <b1airo> haven't looked into it yet myself, distracted with other issues as per recent os-ops posts 09:56:32 <oneswig> SMARTD errors on your disks perhaps? 09:56:51 <oneswig> Right, let's move on 09:56:54 <oneswig> #topic Any Other Business 09:57:07 <oneswig> What news? 09:57:08 <b1airo> they're all dell MDs so unlikley 09:57:26 <b1airo> Trump got the republican nomination? 09:57:44 <oneswig> You raise that with 3 MINUTES to discuss? 09:58:12 <b1airo> haha 09:58:35 <priteau> oneswig: Is this the Hypervisor Tuning Guide? https://wiki.openstack.org/wiki/Documentation/HypervisorTuningGuide 09:58:42 <b1airo> yep 09:58:52 <priteau> I had never seen it before 09:59:09 <b1airo> yeah it's not well promoted or linked anywhere 09:59:34 <b1airo> i'm hoping we can figure out something sensible to do with it 09:59:51 <oneswig> Are you officially it's custodian b1airo? 09:59:51 <b1airo> maybe look at converting to doc format next cycle 10:00:22 <b1airo> something like that, but all the work so far has been done by Joe Topjian 10:00:57 <b1airo> i think the initial idea was to turn it into a standalone doc 10:01:15 <priteau> b1airo: maybe merge it into http://docs.openstack.org/ops-guide/arch_compute_nodes.html 10:01:35 <b1airo> but seems like it's hard to find info on other hypervisors that would bulk out the content enough to justify that 10:01:44 <b1airo> priteau, good suggestion 10:01:52 <b1airo> i was also thinking probably the ops guide 10:02:38 <oneswig> It's more operational than architectural I'd guess 10:03:27 <b1airo> yeah, though some of the things covered are probably things you want to be thinking about before you buy your hardware 10:03:52 <oneswig> I found CERN's study that Hyper-V outperforms KVM relevant - a pity there isn't more diversity in the guide as-is if KVM's not the fastest thing out there 10:04:09 <priteau> or in the Operations part of the Ops Guide, it could be a new page about troubleshooting performance problems 10:05:19 <oneswig> ah, we are over time 10:05:26 <oneswig> Any final comments? 10:05:36 <b1airo> NUMA matters! 10:05:52 <oneswig> One from me - appears we have 8 sessions allocated for Barcelona summit 10:06:03 <oneswig> hooray! 10:06:24 <oneswig> Got to wrap up now, thanks everyone 10:06:28 <b1airo> yeah time for me to look at the proposals 10:06:35 <b1airo> bfn! 10:06:42 <apdibbo> bye 10:06:45 <oneswig> Until next time 10:06:53 <oneswig> #endmeeting