21:02:22 #startmeeting scientific-wg 21:02:22 Meeting started Tue Feb 7 21:02:22 2017 UTC and is due to finish in 60 minutes. The chair is oneswig. Information about MeetBot at http://wiki.debian.org/MeetBot. 21:02:23 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 21:02:26 The meeting name has been set to 'scientific_wg' 21:02:31 hi everyone 21:02:35 Hi zioproto - up late! 21:02:38 hi 21:02:38 * ildikov is lurking :) 21:02:39 Hi Stig 21:02:40 Hello 21:02:43 Hi all 21:02:47 #chair martial 21:02:48 Current chairs: martial oneswig 21:02:48 Hi all 21:02:48 * cdent waves 21:02:51 hello everyone 21:03:00 (sorry dual meeting right now -- in person + IRC) 21:03:24 we'll try not to cause you any sudden outbursts in your real meeting Martial :-) 21:03:53 #link Agenda for today https://wiki.openstack.org/wiki/Scientific_working_group#IRC_Meeting_February_7th_2017 21:03:57 freshly minted 21:03:59 oneswig: thanks will be listening in both :) [working my multitasking skills :) ] 21:04:29 Perfect. Anyone seen Blair yet? 21:04:57 freaky 21:04:59 #chair b1airo 21:05:00 Current chairs: b1airo martial oneswig 21:05:07 I rubbed the lamp - that's the truth 21:05:09 morning, bit tardy sorry 21:05:22 what have i just walked in on? 21:05:26 Hi b1airo 21:05:35 only the minutes can answer that! 21:06:15 Ready to talk HPL and hypervisors? 21:06:22 rubbed the lamp... did you win something, or literally find a lamp? 21:06:30 sure 21:06:33 ... aladdin 21:06:35 #topic HPL and hypervisors 21:07:01 ok now that i say yes i need to paste some graphs 21:08:15 this one is SMP linpack: 21:08:18 #link http://pasteboard.co/vBwRsur2Q.png 21:09:04 Taking a while to reach me here... 21:09:08 is slow 21:09:12 ok I have loaded it 21:09:13 aha, got it 21:09:55 this one is HPL with OpenMPI (still single host though): http://pasteboard.co/vByv44w1d.png 21:09:59 #link http://pasteboard.co/vByv44w1d.png 21:10:28 acks to my colleague Phil Chan for generating these 21:10:58 These are hostnames along the x-axis? 21:12:19 the shortname yeah - they are not publicly accessible though so not too worried about security if that was your thinking? 21:12:42 Just wondering what they were - hostnames, parameter variations, ? 21:12:58 so you have guest name and hypervisor name along the x 21:13:23 so basically the story goes like this... 21:13:23 What are the five 120k outliers in the middle of the HPL chart due to? 21:13:31 indeed, what? 21:13:52 I came here tonight for answers! 21:14:33 we only put these nodes into production semi-recently as mellanox had been kind of using them as a RoCE MPI testbed for a while (there's a bunch of fixes and improvements in MOFED 4.0 thanks to this) 21:15:27 anyway, some of our users complained of jobs running much slower on these nodes than on the other 4 year old cluster, so we started digging 21:17:11 just thought i'd share what we have so far as it is a nice sample showing how tight BM-guest can be 21:18:07 VMs appear to have a performance edge for hpl - any theories on why? 21:18:12 the next thing is to figure out what is going on with those outliers - presumably something on the hypervisor is getting in the way at the highest memory utilisation levels, but not sure what yet as they are all the same 21:18:51 oneswig, at this stage i'm guessing it is just the difference in task scheduling between kernel versions 21:19:26 host is Ubuntu with 4.4, guest is CentOS7 with 3.10(?) 21:19:54 Is the NIC involved at all here, or is it memory-copy between VMs? 21:19:56 the guest is pinned etc so host task scheduling is somewhat taken out of the picture there 21:20:08 Or is it just one VM 21:20:27 yeah just single host runs, no IPC 21:21:39 i started looking at whether the guest memory was pinned like i thought it was, but seems the idea i had that pci passthrough forces all guest memory to be pinned was wrong 21:21:41 perf in the hypervisor any help? 21:22:29 at least the only place i've found to look for that is Mlocked memory in /proc/meminfo 21:22:47 You can apparently get perf traces of stacks from both hypervisor and guest, which is pretty cool (but I've never done it) 21:23:08 oneswig, yes that's one of the next things to look at now that we have a known set of misbehaving machines 21:23:40 i have never used it yet either, just started reading 21:24:02 Have you compared bios settings with racadm? Might not explain why only the large-block cases are problematic 21:24:36 think i heard about an ansible role or two that might be useful for that ;-) 21:24:50 Funny you should say that! 21:25:14 have not compared them with a fine-tooth comb just yet, but the big "performance" items have been checked 21:25:47 You'll report back if you figure it out I hope! 21:26:01 in particular, all these hosts now have highest pstate forced via kernel command line 21:26:41 so they idle at a higher frequency than when they are actually working (AVX is a bit disconcerting like that) 21:27:16 Because AVX heats the chip up? 21:27:18 yes, hopefully that'll be next week! i feel like an amateur detective at the moment, i'm just bouncing from one performance issue to another 21:27:51 Time for the lowdown on ECMP? 21:28:19 oneswig, yes all the modern Intel CPUs have lower base frequencies for AVX instructions (you won't find that anywhere prominent on the product pages though!) 21:28:39 yes ECMP 21:28:39 #topic RoCE over layer-3 ECMP fabrics 21:29:00 what RoCE stands for ? 21:29:02 So this is layer-2 RoCE over VXLAN over ECMP, right? 21:29:09 RDMA-over-Converged-Ethernet 21:29:21 memory over ethernet ? 21:29:24 also known as Infinband-on-ethernet 21:29:27 corret 21:29:48 It's low latency networking 21:29:58 a couple of weeks ago we started our big datacentre migration, in the process we replaced most of the switching in our research compute infrastructure 21:30:48 we were previously mostly using Mellanox SX1036 40/56GbE in a layer-2 configuration 21:32:05 b1airo: what caused you to change? 21:33:05 sorry burnt toast emergency 21:33:07 back now 21:33:26 the size of the fabric we could build was limited to 2x spine switches 21:33:44 Due to MLAG? 21:33:46 using either MLAG or a multi-STP config 21:34:20 and the MLAG had a habit of being a bit fragile 21:34:39 so now you have a L3 leaf/spine architecture ? 21:34:44 so all our new gear is Mellanox Spectrum (100/50/25GbE) 21:34:46 but with more than 2 spines ? 21:34:56 with switches all running Cumulus 21:35:43 zioproto, actually not yet - but yes we could have more spines or insert another aggregation layer 21:36:11 We presently running with MLAG, all Cumulus, 2 spines and 10 leaf switches. So far so good. 21:36:26 the switches speak BGP to each other and VXLAN encap/decap traffic to/from the edge ports 21:36:41 (in hardware) 21:36:41 We are old school - 2 core, 9 edge, MLNX-OS + NEO 21:36:54 Is this Cumulus 3.2.0? 21:37:03 b1airo, are you using puppet or ansible to configure the cumulus switches ? 21:37:03 jonmills_nasa, yep 21:37:10 zioproto, ansible 21:37:27 honestly i don't think puppet really belongs with switching 21:37:53 I sorta thought that hardware accelerated VTEP wasn't supported in Cumulus until 3.3 release or even 3.4.... 21:38:13 b1airo: how are the VXLAN VNIs configured? Is there a physical-aware ML2 driver behind that? 21:38:26 anyway, it was a huge effort to get it all built and configured - we actually started moving hosts before the network configs were completed 21:39:32 oneswig, no at this stage we haven't added any features just done the migration - so the automation has port profiles setup which configure the VNIs appropriately for the required host-facing VLANs 21:40:12 Ah OK, so the host's vlans are mapped to ranges of vnis in the vtep? 21:40:29 jonmills_nasa, this is using Cumulus LNV to coordinate the VNIs across switches, switching definitely happening in hardware 21:40:38 oneswig, yep 21:41:02 once we had all the gear migrated to the new DC we started properly testing the new network 21:41:15 b1airo, this is great stuff, this is my long-term goal -- just didn't think I could get there on first pass, so we started with all L2 21:41:49 jonmills_nasa, cool - glad to have some company! Cumulus is very nice to work with I have to say 21:41:58 I am loving it 21:42:11 we used Lustre LNET self test 21:42:26 I generated 100% of my L2 edge port config using a very small shell script with NCLU commands 21:42:39 we run Lustre over the o2ib driver as usual but on RoCEv1 instead of IB 21:42:48 (so far very cool) 21:42:51 b1airo, we also have Cumulus at SWITCH 21:43:15 b1airo, but we are using puppet and OSPF unnumbered for the fabric 21:43:49 we firstly had to ensure the network could handle RoCEv1 at all, which means needing flow-control everywhere (RoCEv1 does not like lossy networks) 21:44:32 the first hurdle there was that we couldn't just turn on global pause like we had previously on the MLNX-OS L2 fabric 21:44:35 I hadn't realised that SR-IOV can be lossy - flow control ends at the phy on Mellanox NICs 21:45:23 oneswig, actually we should go into that more in a moment as they haven't told us anything particular about that (but of course I am interested to hear it!) 21:46:24 the Cumulus folks were leery about enabling pause on the inter-switch links because they were worried BGP peer exchange messages would might get held up long enough that it would then impact the fabric topology 21:47:07 so we had to have priority flow-control configured inter-switch with switch-switch traffic at the highest CoS 21:48:04 then we start pushing large streams around and immediately noticed only a single link utilised up and down when crossing spines 21:49:22 turns out that the VXLAN encap happens before the packet is routed to the next hop, so the ECMP hash is done on the VXLAN'd packet, and for RoCEv1 at least that 5-tuple looks exactly the same for every packet on the same VNI 21:50:21 blair, that is a typical hashing problem 21:50:43 most hashing stuff expect UDP or TCP 21:51:04 when you has other protocols hashing algorithm is dumb and sends all to the same link 21:51:16 hence, one out of four 100G paths used and crappy performance - we actually had a simple cluster test we used to simulate what some user workloads to, we call it "crazy cat lady" - it basically just cats a bunch of large files into /dev/null in a parallel job 21:51:52 love that name! 21:51:54 why is the destination IP of the VXLAN frame not different for flows to different machines? 21:52:05 zioproto, it is particularly bad with flow-control, as the link gets highly congested with pause frames as well 21:53:22 oneswig, this is for flows from any host in rack A to any host in rack B on the storage VLAN - that's a single VNI for each ToR 21:53:36 ahhh, got it, thanks 21:53:48 what's the fix? 21:54:13 anyway, the Cumulus support guys spent a couple of days in video conference with us working through testing and workaround 21:54:15 Can you hash on the decapsulated packet? 21:54:58 (how are we doing on time versus agenda?) 21:55:24 martial: not very well... 21:55:26 the fix was to run a little python script that they hacked up which altered some base port settings in the ASIC and introduced some randomness to the hash 21:56:03 b1airo: is this going to be your gift to Cumulus Linux? 21:56:35 their engineering wrote it, but i'll claim i provided the inspiration 21:56:56 i believe this is going into 3.2.1 permanently 21:56:57 Is there a long-term fix in the works, I assume? 21:57:03 OK 21:57:26 We should cover some other matters before the clock strikes... 21:57:30 looks like i monopolised somewhat today (sorry!) 21:57:43 blame the burnt toast ;) 21:57:46 #topic Boston Cloud Declaration 21:57:56 May 11 & 12 21:58:01 Hooray - the date is moved to the same week as the summit :-) 21:58:17 #topic Repo for WG docs 21:58:27 #link http://git.openstack.org/cgit/openstack/scientific-wg/ 21:58:33 guys I have to go 21:58:34 oneswig: that's good news. I assume the wiki page will be updated? https://wiki.openstack.org/wiki/Boston-Cloud-Congress 21:58:34 we've got somewhere where we can write things now 21:58:48 bye :) sorry for leaving earlier 21:58:49 ciao 21:58:49 priteau: I think so - should have been already 21:58:53 thanks zioproto 21:59:05 #topic SC2017 workshops 21:59:20 I filed for two workshops at SC2017 21:59:27 one on infrastructure and one on platforms/apps 21:59:41 #link as per discussion at https://etherpad.openstack.org/p/SC17WorkshopWorksheet 21:59:55 oneswig, nice one thanks for getting that done. i saw the notification email come by 21:59:59 There's still a week (apparently) before the submissions ship, any more volunteers please? 22:00:13 Thanks b1airo, no problem 22:00:24 Some improvement on the wording would help! 22:00:32 oneswig, we haven't pinged the MLs about that have we? 22:00:41 If anyone else will be at SC and can contribute, please put your names in the etherpad 22:01:05 oneswig: i’ll read it over and can likely volunteer to help 22:01:08 #action oneswig to put the workshops onto the ML 22:01:09 i’m always at SC 22:01:13 thanks rbudden 22:01:17 rbudden, that'd be great! 22:01:22 we are out of time, alas 22:01:33 I'm honestly hoping to avoid SC17. I'll be at Boston Summit tho 22:01:35 and as always we have the PSC booth for other SC content 22:01:36 Thanks all and thanks b1airo, I like my toast to be high roast 22:01:51 you should all prioritise the Sydney Summit! 22:02:02 b1airo: i’m trying to pull strings ;) 22:02:05 That's my plan b1airo! 22:02:13 Sydney's a go for me 22:02:13 time to close, alas 22:02:16 Gov would never pay for me to go there 22:02:20 the wife has always wanted to visit as well! 22:02:32 awesome, i should start looking for venues! 22:02:35 jonmills_nasa: not since they've fingered you as @RogueNASA? 22:02:45 lol 22:02:47 22:02:54 really gotta close now :-) 22:02:55 #endmeeting