11:01:03 #startmeeting scientific_wg 11:01:04 Meeting started Wed Oct 11 11:01:03 2017 UTC and is due to finish in 60 minutes. The chair is b1airo. Information about MeetBot at http://wiki.debian.org/MeetBot. 11:01:05 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 11:01:07 The meeting name has been set to 'scientific_wg' 11:01:17 #chair martial 11:01:18 Current chairs: b1airo martial 11:02:08 #topic Roll call - wave if you are "here" 11:03:01 hi 11:03:07 hi enolfc 11:03:09 *wave* 11:03:43 khappone, i take it that is Kalle from CSC? 11:03:55 b1airo: correct 11:04:23 don't think i have seen you for a while, hope you're doing well! 11:05:01 #topic Agenda 11:05:01 Yes! It's been a while! You know how it is, busy :) 11:05:08 indeed! 11:05:19 And trying to keep up with 4 other forums where scientific openstack is discussed :) 11:05:40 I think today may be a quick one, but we'll see... on the agenda thus far I have: 11:05:46 Sydney Summit session prep 11:05:46 Monash OpenStack HPC expansion plans 11:05:46 Mellanox SR-IOV / RoCE tuning shenanigans 11:06:07 and I am personally interested to recap on the Image Naming Conventions discussion from last week that I missed 11:06:12 I.. https://github.com/CSCfi/CloudImageRecommendations 11:06:28 Yes, I was going to ask if that could be discussed in this timezone 11:06:29 wavy thingie :) 11:06:40 hi martial :-) 11:06:53 Hello, sorry for joining late. 11:06:56 hi Blair, as mentioned, getting one kid out the door :) 11:07:06 alright, so topic #1... 11:07:26 no probs martial - you better go martial your kid... ;-) 11:07:38 #topic Sydney Summit 11:08:01 the summit is coming up fast, who's coming? 11:08:10 I know Stig and martial are 11:08:27 I can't sadly make it, nor my colleagues 11:08:32 zioproto is, I'm not. 11:08:44 (He's busy on another line right now.) 11:08:55 hi simon-AS559, that's a shame, but good that zioproto will be there at least 11:09:15 I am here 11:09:20 b1airo: instilling discipline ... check :) 11:09:58 it's about this time we normally start compiling a list of sessions of interest to sciencey inclined folks 11:10:36 plus we have our own sessions to plan and promote - there is a Scientific BOF session and a Scientific Lightning Talk session 11:10:39 I'm still not sure if I will be able to make it 11:11:00 and we have two forum sessions that we ought to hear back on this week 11:11:18 enolfc, if you need an official invite I can have a chat with Malcolm... 11:11:45 martial, yes that's right 11:12:08 martial, do you recall if we already started an etherpad for lightning talk possibilities? 11:12:10 thanks b1airo, official invite in principle not needed 11:12:23 i thought we might have but am not having much luck finding a link... 11:13:14 b1airo: I believe so, but I might be wrong 11:14:04 I did not see any link 11:14:15 ok, well i have minted a new one and if we find an old one (i haven't done a decent search of the previous IRC logs yet) then we can copy that content in 11:14:27 #link https://etherpad.openstack.org/p/sydney-scientific-sig-lightning-talks 11:14:42 Hello! 11:14:46 There is also https://etherpad.openstack.org/p/Scientific-Sydney17-Forum 11:15:11 with some content already, not sure if that's the lightning talks? 11:15:12 I checked the ones we listed on the September 27th meeting but they are not needed anymore 11:15:13 hi priteau! 11:15:37 thanks, yes that was for some quick forum session brain storming (something we sort of failed to coordinate very well i'm sorry) 11:16:06 we got two that are not "refused" yet :) 11:16:31 so, if you are going to be in Sydney and have a story to tell, please add your name, topic and two sentence outline to that new etherpad 11:16:44 here it is again: https://etherpad.openstack.org/p/sydney-scientific-sig-lightning-talks 11:18:00 b1airo: we ought to post on the ML as well at this point 11:18:25 yes martial , i will follow up on that later 11:18:30 I know it is hard to get people to both meetings 11:18:39 #action b1airo to post to ML re. lightning talks 11:19:09 depending on the number of proposals, we might need to have a vote 11:19:36 ok, that is probably enough on that. i'll also mint another etherpad to use as a scientific summit guide, i.e., place to note done sessions of interest to start your personal agenda off 11:19:59 martial, yeah possibly - let's cross that bridge if we get to it 11:20:18 ok, next thing was from me... 11:20:31 agreed, was looking for the time we had per session 11:20:36 #topic Monash OpenStack HPC expansion plans 11:21:18 I kind of disappeared down a rabbit hole the last month - just as I should have been helping with book chapter reviews (sorry!) 11:21:58 the reason for that is we are embarking on a major uplift of the system (M3) we built a little of a year ago 11:22:12 s/of/over/ 11:22:50 not sure how interested people are in hearing about new kit etc so i'll keep this brief and you can ask me questions if you have any more interest 11:23:48 a few months ago it became apparent that our present compute capability was not going to be sufficient to service all the expected new demand over the next 3-12 months 11:25:04 there are a few new and upgraded instruments around Monash University that are coming online and, once fully operational, will produce a lot of data, all of which needs heavy computational workflows of one kind or other to produce science results 11:25:18 what was your "expected use" vs "okay we really need"? 11:25:49 it's mostly the later 11:25:59 *latter (can't spell tonight) 11:27:41 anyway, a strategic proposal (read: plea) was put into the Research Portfolio and after a lot of scrambling to find unspent money and a few raids we will be adding around 2PB of Lustre, 2800 cores, 60 V100s 11:28:11 wow 11:28:14 pretty nice 11:28:29 and some cups and string for them all to talk 11:28:38 All of them under OpenStack? 11:28:59 khappone, yes, in one way or another... 11:29:08 Lustre? 11:29:31 the cups and strings will work ... plus they can have coffee in the unused cups :) 11:29:47 we are also just kicking off some work to get started with Ironic, planning to use it mainly as an internal management/deployment tool at first and then transition into a user-facing service 11:30:25 the actual bare-metal versus KVM split for the main HPC workloads is not yet determined, but probably a large amount bare-metal now 11:31:03 Is the workload expected to be true HPC or more HTC? 11:31:14 the Lustre capacity is expansion only, no new servers, just several hundred new disks 11:31:28 since you mention KVM, have you tested LXD? 11:32:14 khappone, it is small scale HPC - several of the primary use-cases commonly use ~4-8 node MPI jobs 11:32:33 but it is mostly data-centric processing 11:33:09 Great. Just curious about the direction others are going in, and their architectural choices 11:33:13 some of the MPI jobs are just using MPI for coordination, not any shared-memory or messaging 11:33:45 enolfc, no we haven't tried LXD yet, but I am LXD-curious - have you any experience? 11:34:56 it would probably be a worthwhile investment of time though 11:35:53 we have really pushed KVM in terms of tuning and it can provide almost bare-metal performance, but there is also an overhead in hitting issues etc that we are starting to have difficulty keeping up with along with the amount of infra we're now operating 11:37:00 ok, no more questions on that. we can move onto something semi-related... 11:37:13 #topic Mellanox SR-IOV / RoCE tuning shenanigans 11:37:51 one thing we do for our virtualised HPC is use SR-IOV to get our data/mpi networks into guests 11:38:56 when we bought in some new P100 nodes (Intel Broadwell) earlier in the year we had this weird issue where we could not run 28 core MPI jobs, i.e., all cores in a node 11:39:37 IIRC it worked up to 27 cores and then died with an MXM error at 28 cores 11:40:21 Sounds like a fun bug 11:40:29 anyway, that turns out to be courtesy of Mellanox's HCOLL functionality 11:40:56 which is part of their Fabric Collectives Accelerator 11:41:04 Yeah! I'm sitting on the edge if my chair waiting for blair to type faster 11:41:14 haha 11:41:48 oddly, FCA is on by default, but so far as I can tell it only really helps properly big parallel workloads 11:42:16 (though we are yet to benchmark on-versus-off to confirm that) 11:42:24 FCA took one core for itself or something similar? 11:42:59 anyway, disabling hcoll allows bigger numprocs jobs - the problem is with memory allocation 11:43:45 the device driver appears to need more memory when using these oflloads 11:44:25 Is this solution documented anywhere where google will find it if we hit it later? 11:45:07 Mellanox recommended tuning the PF_LOG_BAR_SIZE and VF_LOG_BAR_SIZE firmware config parameters to fix this properly, rather than disabling HCOLL 11:45:52 so we diligently did this and then our server would not boot! 11:46:10 seems this topic #2 is not into topic #3 :) 11:46:55 sadly some BIOS' prevent mapping the BAR over addresses which surpass 4GB 11:47:13 and that can then cause the device to hang 11:47:44 and I guess for these Dell servers they do not continue bootup with devices in such a state 11:48:39 khappone, so far this is all in a support case with Mellanox - but they have reproduced and I believe they will be addressing with Dell too 11:48:52 Ok, good to know 11:49:09 once we have determined the BAR tuning fixes the problem I will post about it on the Mellanox community forums 11:49:51 there is one other brief thing to mention on Mellanox tuning if you are using Ethernet 11:50:13 for RoCE workloads they recommend tuning the device ring-buffers, i.e., increasing them from default 11:51:22 recently we hit some OOMs and as a result noticed that increasing the RX & TX ring-buffers on a single port to 8192 respectively ate another ~3.5GB of memory 11:52:09 this is not intuitive (or documented) until you know that the ring-buffer is a per-core resource 11:53:00 Yes. I had to dig into linux networking and ring buffer internals a while back. 11:53:21 and each ring-buffer "slot" takes the amount of memory required to fit a frame into, e.g. for 9000 MTU that will be 3 pages per frame (12K) 11:53:23 This tuning is done on the VM side if you use SR-IOV? 11:53:48 no it's host-side on the PF, so the OOM we hit was QEMU being killed :-( 11:54:18 alright, let's briefly talk about he image guidelines in the last few mins! 11:54:23 Ok 11:54:24 Yes 11:54:51 #topic Cloud Image Recommendations / Standardisation 11:55:13 Background: We spend too much time with our image creation and management. We make some "important" local site-specific tweaks 11:55:19 khappone, i am interested in your work because we also did something similar for Nectar a while back in a project i lef 11:55:23 *led 11:55:44 Pretty much each NREN/scientific site I've talked to have had the same issues 11:55:57 IS that somehow related to what Stif shared in the past https://galaxy.ansible.com/stackhpc/ ? 11:56:14 (stig) 11:56:16 no. 11:56:19 https://github.com/CSCfi/CloudImageRecommendations 11:57:02 After discussing, we realized that each site has "Important local tweaks", but most of them don't overlap at all, so we started questioning the needs to do any local tweaks 11:57:13 khappone, i agree - images are a bit of a drain 11:58:18 based on that I wrote a document with recommendations how to manage images for scientific sites. Basically it lists images, their sources, and suggested naming, It says don't touch them, just upload them and keep them up to date 11:58:56 looked at the page, very nice. I like the fact that images are updated regularily 11:59:02 ok, i see this work is more about high-level standardisation across clouds - that's good, doesn't really overlap too much with what we were working on, which is about a process for users contributing images into the cloud: https://support.ehelp.edu.au/support/solutions/articles/6000173077-contributed-images-submission-guide 11:59:38 The idea is that many sites would have identical images, and identical image management processes. The benefits would be that tools could be shared, and that sites could trusts that any appliances that are built on these don't have unnecessary site-specific tweaks 12:00:15 i'm curious about the tweaks you came up with... 12:00:22 (the list of, i mean) 12:00:37 E.g. we could build CUDA images, and it would be the base image + cuda, with no strange our tweaks. This means anybody conforming to this could just take it, point their automation to the source of the image and basically have it with zero effort 12:01:20 oh dear - we are over time! khappone, do you want to join me in #scientific-wg channel to continue discussing for another 5 mins? 12:01:31 b1airo: sure 12:01:33 thank you all! 12:01:37 #endmeeting