11:01:03 <b1airo> #startmeeting scientific_wg
11:01:04 <openstack> Meeting started Wed Oct 11 11:01:03 2017 UTC and is due to finish in 60 minutes.  The chair is b1airo. Information about MeetBot at http://wiki.debian.org/MeetBot.
11:01:05 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
11:01:07 <openstack> The meeting name has been set to 'scientific_wg'
11:01:17 <b1airo> #chair martial
11:01:18 <openstack> Current chairs: b1airo martial
11:02:08 <b1airo> #topic Roll call - wave if you are "here"
11:03:01 <enolfc> hi
11:03:07 <b1airo> hi enolfc
11:03:09 <khappone> *wave*
11:03:43 <b1airo> khappone, i take it that is Kalle from CSC?
11:03:55 <khappone> b1airo: correct
11:04:23 <b1airo> don't think i have seen you for a while, hope you're doing well!
11:05:01 <b1airo> #topic Agenda
11:05:01 <khappone> Yes! It's been a while! You know how it is, busy :)
11:05:08 <b1airo> indeed!
11:05:19 <khappone> And trying to keep up with 4 other forums where scientific openstack is discussed :)
11:05:40 <b1airo> I think today may be a quick one, but we'll see... on the agenda thus far I have:
11:05:46 <b1airo> Sydney Summit session prep
11:05:46 <b1airo> Monash OpenStack HPC expansion plans
11:05:46 <b1airo> Mellanox SR-IOV / RoCE tuning shenanigans
11:06:07 <b1airo> and I am personally interested to recap on the Image Naming Conventions discussion from last week that I missed
11:06:12 <khappone> I.. https://github.com/CSCfi/CloudImageRecommendations
11:06:28 <khappone> Yes, I was going to ask if that could be discussed in this timezone
11:06:29 <martial> wavy thingie :)
11:06:40 <b1airo> hi martial :-)
11:06:53 <simon-AS559> Hello, sorry for joining late.
11:06:56 <martial> hi Blair, as mentioned, getting one kid out the door :)
11:07:06 <b1airo> alright, so topic #1...
11:07:26 <b1airo> no probs martial - you better go martial your kid... ;-)
11:07:38 <b1airo> #topic Sydney Summit
11:08:01 <b1airo> the summit is coming up fast, who's coming?
11:08:10 <b1airo> I know Stig and martial are
11:08:27 <khappone> I can't sadly make it, nor my colleagues
11:08:32 <simon-AS559> zioproto is, I'm not.
11:08:44 <simon-AS559> (He's busy on another line right now.)
11:08:55 <b1airo> hi simon-AS559, that's a shame, but good that zioproto will be there at least
11:09:15 <zioproto> I am here
11:09:20 <martial> b1airo: instilling discipline ... check :)
11:09:58 <b1airo> it's about this time we normally start compiling a list of sessions of interest to sciencey inclined folks
11:10:36 <b1airo> plus we have our own sessions to plan and promote - there is a Scientific BOF session and a Scientific Lightning Talk session
11:10:39 <enolfc> I'm still not sure if I will be able to make it
11:11:00 <martial> and we have two forum sessions that we ought to hear back on this week
11:11:18 <b1airo> enolfc, if you need an official invite I can have a chat with Malcolm...
11:11:45 <b1airo> martial, yes that's right
11:12:08 <b1airo> martial, do you recall if we already started an etherpad for lightning talk possibilities?
11:12:10 <enolfc> thanks b1airo, official invite in principle not needed
11:12:23 <b1airo> i thought we might have but am not having much luck finding a link...
11:13:14 <martial> b1airo: I believe so, but I might be wrong
11:14:04 <zioproto> I did not see any link
11:14:15 <b1airo> ok, well i have minted a new one and if we find an old one (i haven't done a decent search of the previous IRC logs yet) then we can copy that content in
11:14:27 <b1airo> #link https://etherpad.openstack.org/p/sydney-scientific-sig-lightning-talks
11:14:42 <priteau> Hello!
11:14:46 <priteau> There is also https://etherpad.openstack.org/p/Scientific-Sydney17-Forum
11:15:11 <priteau> with some content already, not sure if that's the lightning talks?
11:15:12 <martial> I checked the ones we listed on the September 27th meeting but they are not needed anymore
11:15:13 <b1airo> hi priteau!
11:15:37 <b1airo> thanks, yes that was for some quick forum session brain storming (something we sort of failed to coordinate very well i'm sorry)
11:16:06 <martial> we got two that are not "refused" yet :)
11:16:31 <b1airo> so, if you are going to be in Sydney and have a story to tell, please add your name, topic and two sentence outline to that new etherpad
11:16:44 <b1airo> here it is again: https://etherpad.openstack.org/p/sydney-scientific-sig-lightning-talks
11:18:00 <martial> b1airo: we ought to post on the ML as well at this point
11:18:25 <b1airo> yes martial , i will follow up on that later
11:18:30 <martial> I know it is hard to get people to both meetings
11:18:39 <b1airo> #action b1airo to post to ML re. lightning talks
11:19:09 <martial> depending on the number of proposals, we might need to have a vote
11:19:36 <b1airo> ok, that is probably enough on that. i'll also mint another etherpad to use as a scientific summit guide, i.e., place to note done sessions of interest to start your personal agenda off
11:19:59 <b1airo> martial, yeah possibly - let's cross that bridge if we get to it
11:20:18 <b1airo> ok, next thing was from me...
11:20:31 <martial> agreed, was looking for the time we had per session
11:20:36 <b1airo> #topic Monash OpenStack HPC expansion plans
11:21:18 <b1airo> I kind of disappeared down a rabbit hole the last month - just as I should have been helping with book chapter reviews (sorry!)
11:21:58 <b1airo> the reason for that is we are embarking on a major uplift of the system (M3) we built a little of a year ago
11:22:12 <b1airo> s/of/over/
11:22:50 <b1airo> not sure how interested people are in hearing about new kit etc so i'll keep this brief and you can ask me questions if you have any more interest
11:23:48 <b1airo> a few months ago it became apparent that our present compute capability was not going to be sufficient to service all the expected new demand over the next 3-12 months
11:25:04 <b1airo> there are a few new and upgraded instruments around Monash University that are coming online and, once fully operational, will produce a lot of data, all of which needs heavy computational workflows of one kind or other to produce science results
11:25:18 <martial> what was your "expected use" vs "okay we really need"?
11:25:49 <b1airo> it's mostly the later
11:25:59 <b1airo> *latter (can't spell tonight)
11:27:41 <b1airo> anyway, a strategic proposal (read: plea) was put into the Research Portfolio and after a lot of scrambling to find unspent money and a few raids we will be adding around 2PB of Lustre, 2800 cores, 60 V100s
11:28:11 <martial> wow
11:28:14 <martial> pretty nice
11:28:29 <b1airo> and some cups and string for them all to talk
11:28:38 <khappone> All of them under OpenStack?
11:28:59 <b1airo> khappone, yes, in one way or another...
11:29:08 <khappone> Lustre?
11:29:31 <martial> the cups and strings will work ... plus they can have coffee in the unused cups :)
11:29:47 <b1airo> we are also just kicking off some work to get started with Ironic, planning to use it mainly as an internal management/deployment tool at first and then transition into a user-facing service
11:30:25 <b1airo> the actual bare-metal versus KVM split for the main HPC workloads is not yet determined, but probably a large amount bare-metal now
11:31:03 <khappone> Is the workload expected to be true HPC or more HTC?
11:31:14 <b1airo> the Lustre capacity is expansion only, no new servers, just several hundred new disks
11:31:28 <enolfc> since you mention KVM, have you tested LXD?
11:32:14 <b1airo> khappone, it is small scale HPC - several of the primary use-cases commonly use ~4-8 node MPI jobs
11:32:33 <b1airo> but it is mostly data-centric processing
11:33:09 <khappone> Great. Just curious about the direction others are going in, and their architectural choices
11:33:13 <b1airo> some of the MPI jobs are just using MPI for coordination, not any shared-memory or messaging
11:33:45 <b1airo> enolfc, no we haven't tried LXD yet, but I am LXD-curious - have you any experience?
11:34:56 <b1airo> it would probably be a worthwhile investment of time though
11:35:53 <b1airo> we have really pushed KVM in terms of tuning and it can provide almost bare-metal performance, but there is also an overhead in hitting issues etc that we are starting to have difficulty keeping up with along with the amount of infra we're now operating
11:37:00 <b1airo> ok, no more questions on that. we can move onto something semi-related...
11:37:13 <b1airo> #topic Mellanox SR-IOV / RoCE tuning shenanigans
11:37:51 <b1airo> one thing we do for our virtualised HPC is use SR-IOV to get our data/mpi networks into guests
11:38:56 <b1airo> when we bought in some new P100 nodes (Intel Broadwell) earlier in the year we had this weird issue where we could not run 28 core MPI jobs, i.e., all cores in a node
11:39:37 <b1airo> IIRC it worked up to 27 cores and then died with an MXM error at 28 cores
11:40:21 <priteau> Sounds like a fun bug
11:40:29 <b1airo> anyway, that turns out to be courtesy of Mellanox's HCOLL functionality
11:40:56 <b1airo> which is part of their Fabric Collectives Accelerator
11:41:04 <khappone> Yeah! I'm sitting on the edge if my chair waiting for blair to type faster
11:41:14 <b1airo> haha
11:41:48 <b1airo> oddly, FCA is on by default, but so far as I can tell it only really helps properly big parallel workloads
11:42:16 <b1airo> (though we are yet to benchmark on-versus-off to confirm that)
11:42:24 <khappone> FCA took one core for itself or something similar?
11:42:59 <b1airo> anyway, disabling hcoll allows bigger numprocs jobs - the problem is with memory allocation
11:43:45 <b1airo> the device driver appears to need more memory when using these oflloads
11:44:25 <khappone> Is this solution documented anywhere where google will find it if we hit it later?
11:45:07 <b1airo> Mellanox recommended tuning the PF_LOG_BAR_SIZE and VF_LOG_BAR_SIZE firmware config parameters to fix this properly, rather than disabling HCOLL
11:45:52 <b1airo> so we diligently did this and then our server would not boot!
11:46:10 <martial> seems this topic #2 is not into topic #3 :)
11:46:55 <b1airo> sadly some BIOS' prevent mapping the BAR over addresses which surpass 4GB
11:47:13 <b1airo> and that can then cause the device to hang
11:47:44 <b1airo> and I guess for these Dell servers they do not continue bootup with devices in such a state
11:48:39 <b1airo> khappone, so far this is all in a support case with Mellanox - but they have reproduced and I believe they will be addressing with Dell too
11:48:52 <khappone> Ok, good to know
11:49:09 <b1airo> once we have determined the BAR tuning fixes the problem I will post about it on the Mellanox community forums
11:49:51 <b1airo> there is one other brief thing to mention on Mellanox tuning if you are using Ethernet
11:50:13 <b1airo> for RoCE workloads they recommend tuning the device ring-buffers, i.e., increasing them from default
11:51:22 <b1airo> recently we hit some OOMs and as a result noticed that increasing the RX & TX ring-buffers on a single port to 8192 respectively ate another ~3.5GB of memory
11:52:09 <b1airo> this is not intuitive (or documented) until you know that the ring-buffer is a per-core resource
11:53:00 <khappone> Yes. I had to dig into linux networking and ring buffer internals a while back.
11:53:21 <b1airo> and each ring-buffer "slot" takes the amount of memory required to fit a frame into, e.g. for 9000 MTU that will be 3 pages per frame (12K)
11:53:23 <khappone> This tuning is done on the VM side if you use SR-IOV?
11:53:48 <b1airo> no it's host-side on the PF, so the OOM we hit was QEMU being killed :-(
11:54:18 <b1airo> alright, let's briefly talk about he image guidelines in the last few mins!
11:54:23 <khappone> Ok
11:54:24 <khappone> Yes
11:54:51 <b1airo> #topic Cloud Image Recommendations / Standardisation
11:55:13 <khappone> Background: We spend too much time with our image creation and management. We make some "important" local site-specific tweaks
11:55:19 <b1airo> khappone, i am interested in your work because we also did something similar for Nectar a while back in a project i lef
11:55:23 <b1airo> *led
11:55:44 <khappone> Pretty much each NREN/scientific site I've talked to have had the same issues
11:55:57 <martial> IS that somehow related to what Stif shared in the past https://galaxy.ansible.com/stackhpc/ ?
11:56:14 <martial> (stig)
11:56:16 <khappone> no.
11:56:19 <khappone> https://github.com/CSCfi/CloudImageRecommendations
11:57:02 <khappone> After discussing, we realized that each site has "Important local tweaks", but most of them don't overlap at all, so we started questioning the needs to do any local tweaks
11:57:13 <b1airo> khappone, i agree - images are a bit of a drain
11:58:18 <khappone> based on that I wrote a document with recommendations how to manage images for scientific sites. Basically it lists images, their sources, and suggested naming, It says don't touch them, just upload them and keep them up to date
11:58:56 <martial> looked at the page, very nice. I like the fact that images are updated regularily
11:59:02 <b1airo> ok, i see this work is more about high-level standardisation across clouds - that's good, doesn't really overlap too much with what we were working on, which is about a process for users contributing images into the cloud: https://support.ehelp.edu.au/support/solutions/articles/6000173077-contributed-images-submission-guide
11:59:38 <khappone> The idea is that many sites would have identical images, and identical image management processes. The benefits would be that tools could be shared, and that sites could trusts that any appliances that are built on these don't have unnecessary site-specific tweaks
12:00:15 <b1airo> i'm curious about the tweaks you came up with...
12:00:22 <b1airo> (the list of, i mean)
12:00:37 <khappone> E.g. we could build CUDA images, and it would be the base image + cuda, with no strange our tweaks. This means anybody conforming to this could just take it, point their automation to the source of the image and basically have it with zero effort
12:01:20 <b1airo> oh dear - we are over time! khappone, do you want to join me in #scientific-wg channel to continue discussing for another 5 mins?
12:01:31 <khappone> b1airo: sure
12:01:33 <b1airo> thank you all!
12:01:37 <b1airo> #endmeeting