#openstack-meeting log

21:02:22 <oneswig> #startmeeting scientific-wg
21:02:22 <openstack> Meeting started Tue Feb  7 21:02:22 2017 UTC and is due to finish in 60 minutes.  The chair is oneswig. Information about MeetBot at http://wiki.debian.org/MeetBot.
21:02:23 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
21:02:26 <openstack> The meeting name has been set to 'scientific_wg'
21:02:31 <trandles> hi everyone
21:02:35 <oneswig> Hi zioproto - up late!
21:02:38 <jonmills_nasa> hi
21:02:38 * ildikov is lurking :)
21:02:39 <martial> Hi Stig
21:02:40 <priteau> Hello
21:02:43 <oneswig> Hi all
21:02:47 <oneswig> #chair martial
21:02:48 <openstack> Current chairs: martial oneswig
21:02:48 <martial> Hi all
21:02:48 * cdent waves
21:02:51 <rbudden> hello everyone
21:03:00 <martial> (sorry dual meeting right now -- in person + IRC)
21:03:24 <oneswig> we'll try not to cause you any sudden outbursts in your real meeting Martial :-)
21:03:53 <oneswig> #link Agenda for today https://wiki.openstack.org/wiki/Scientific_working_group#IRC_Meeting_February_7th_2017
21:03:57 <oneswig> freshly minted
21:03:59 <martial> oneswig: thanks will be listening in both :) [working my multitasking skills :) ]
21:04:29 <oneswig> Perfect.  Anyone seen Blair yet?
21:04:57 <trandles> freaky
21:04:59 <oneswig> #chair b1airo
21:05:00 <openstack> Current chairs: b1airo martial oneswig
21:05:07 <oneswig> I rubbed the lamp - that's the truth
21:05:09 <b1airo> morning, bit tardy sorry
21:05:22 <b1airo> what have i just walked in on?
21:05:26 <oneswig> Hi b1airo
21:05:35 <oneswig> only the minutes can answer that!
21:06:15 <oneswig> Ready to talk HPL and hypervisors?
21:06:22 <b1airo> rubbed the lamp... did you win something, or literally find a lamp?
21:06:30 <b1airo> sure
21:06:33 <oneswig> ... aladdin
21:06:35 <oneswig> #topic HPL and hypervisors
21:07:01 <b1airo> ok now that i say yes i need to paste some graphs
21:08:15 <b1airo> this one is SMP linpack:
21:08:18 <b1airo> #link http://pasteboard.co/vBwRsur2Q.png
21:09:04 <oneswig> Taking a while to reach me here...
21:09:08 <zioproto> is slow
21:09:12 <zioproto> ok I have loaded it
21:09:13 <oneswig> aha, got it
21:09:55 <b1airo> this one is HPL with OpenMPI (still single host though): http://pasteboard.co/vByv44w1d.png
21:09:59 <b1airo> #link http://pasteboard.co/vByv44w1d.png
21:10:28 <b1airo> acks to my colleague Phil Chan for generating these
21:10:58 <oneswig> These are hostnames along the x-axis?
21:12:19 <b1airo> the shortname yeah - they are not publicly accessible though so not too worried about security if that was your thinking?
21:12:42 <oneswig> Just wondering what they were - hostnames, parameter variations, ?
21:12:58 <b1airo> so you have guest name and hypervisor name along the x
21:13:23 <b1airo> so basically the story goes like this...
21:13:23 <oneswig> What are the five 120k outliers in the middle of the HPL chart due to?
21:13:31 <b1airo> indeed, what?
21:13:52 <oneswig> I came here tonight for answers!
21:14:33 <b1airo> we only put these nodes into production semi-recently as mellanox had been kind of using them as a RoCE MPI testbed for a while (there's a bunch of fixes and improvements in MOFED 4.0 thanks to this)
21:15:27 <b1airo> anyway, some of our users complained of jobs running much slower on these nodes than on the other 4 year old cluster, so we started digging
21:17:11 <b1airo> just thought i'd share what we have so far as it is a nice sample showing how tight BM-guest can be
21:18:07 <oneswig> VMs appear to have a performance edge for hpl - any theories on why?
21:18:12 <b1airo> the next thing is to figure out what is going on with those outliers - presumably something on the hypervisor is getting in the way at the highest memory utilisation levels, but not sure what yet as they are all the same
21:18:51 <b1airo> oneswig, at this stage i'm guessing it is just the difference in task scheduling between kernel versions
21:19:26 <b1airo> host is Ubuntu with 4.4, guest is CentOS7 with 3.10(?)
21:19:54 <oneswig> Is the NIC involved at all here, or is it memory-copy between VMs?
21:19:56 <b1airo> the guest is pinned etc so host task scheduling is somewhat taken out of the picture there
21:20:08 <oneswig> Or is it just one VM
21:20:27 <b1airo> yeah just single host runs, no IPC
21:21:39 <b1airo> i started looking at whether the guest memory was pinned like i thought it was, but seems the idea i had that pci passthrough forces all guest memory to be pinned was wrong
21:21:41 <oneswig> perf in the hypervisor any help?
21:22:29 <b1airo> at least the only place i've found to look for that is Mlocked memory in /proc/meminfo
21:22:47 <oneswig> You can apparently get perf traces of stacks from both hypervisor and guest, which is pretty cool (but I've never done it)
21:23:08 <b1airo> oneswig, yes that's one of the next things to look at now that we have a known set of misbehaving machines
21:23:40 <b1airo> i have never used it yet either, just started reading
21:24:02 <oneswig> Have you compared bios settings with racadm?  Might not explain why only the large-block cases are problematic
21:24:36 <b1airo> think i heard about an ansible role or two that might be useful for that ;-)
21:24:50 <oneswig> Funny you should say that!
21:25:14 <b1airo> have not compared them with a fine-tooth comb just yet, but the big "performance" items have been checked
21:25:47 <oneswig> You'll report back if you figure it out I hope!
21:26:01 <b1airo> in particular, all these hosts now have highest pstate forced via kernel command line
21:26:41 <b1airo> so they idle at a higher frequency than when they are actually working (AVX is a bit disconcerting like that)
21:27:16 <oneswig> Because AVX heats the chip up?
21:27:18 <b1airo> yes, hopefully that'll be next week! i feel like an amateur detective at the moment, i'm just bouncing from one performance issue to another
21:27:51 <oneswig> Time for the lowdown on ECMP?
21:28:19 <b1airo> oneswig, yes all the modern Intel CPUs have lower base frequencies for AVX instructions (you won't find that anywhere prominent on the product pages though!)
21:28:39 <b1airo> yes ECMP
21:28:39 <oneswig> #topic RoCE over layer-3 ECMP fabrics
21:29:00 <zioproto> what RoCE stands for ?
21:29:02 <oneswig> So this is layer-2 RoCE over VXLAN over ECMP, right?
21:29:09 <oneswig> RDMA-over-Converged-Ethernet
21:29:21 <zioproto> memory over ethernet ?
21:29:24 <oneswig> also known as Infinband-on-ethernet
21:29:27 <oneswig> corret
21:29:48 <oneswig> It's low latency networking
21:29:58 <b1airo> a couple of weeks ago we started our big datacentre migration, in the process we replaced most of the switching in our research compute infrastructure
21:30:48 <b1airo> we were previously mostly using Mellanox SX1036 40/56GbE in a layer-2 configuration
21:32:05 <oneswig> b1airo: what caused you to change?
21:33:05 <b1airo> sorry burnt toast emergency
21:33:07 <b1airo> back now
21:33:26 <b1airo> the size of the fabric we could build was limited to 2x spine switches
21:33:44 <oneswig> Due to MLAG?
21:33:46 <b1airo> using either MLAG or a multi-STP config
21:34:20 <b1airo> and the MLAG had a habit of being a bit fragile
21:34:39 <zioproto> so now you have a L3 leaf/spine architecture ?
21:34:44 <b1airo> so all our new gear is Mellanox Spectrum (100/50/25GbE)
21:34:46 <zioproto> but with more than 2 spines ?
21:34:56 <b1airo> with switches all running Cumulus
21:35:43 <b1airo> zioproto, actually not yet - but yes we could have more spines or insert another aggregation layer
21:36:11 <jonmills_nasa> We presently running with MLAG, all Cumulus, 2 spines and 10 leaf switches.  So far so good.
21:36:26 <b1airo> the switches speak BGP to each other and VXLAN encap/decap traffic to/from the edge ports
21:36:41 <b1airo> (in hardware)
21:36:41 <oneswig> We are old school - 2 core, 9 edge, MLNX-OS + NEO
21:36:54 <jonmills_nasa> Is this Cumulus 3.2.0?
21:37:03 <zioproto> b1airo, are you using puppet or ansible to configure the cumulus switches ?
21:37:03 <b1airo> jonmills_nasa, yep
21:37:10 <b1airo> zioproto, ansible
21:37:27 <b1airo> honestly i don't think puppet really belongs with switching
21:37:53 <jonmills_nasa> I sorta thought that hardware accelerated VTEP wasn't supported in Cumulus until 3.3 release or even 3.4....
21:38:13 <oneswig> b1airo: how are the VXLAN VNIs configured?  Is there a physical-aware ML2 driver behind that?
21:38:26 <b1airo> anyway, it was a huge effort to get it all built and configured - we actually started moving hosts before the network configs were completed
21:39:32 <b1airo> oneswig, no at this stage we haven't added any features just done the migration - so the automation has port profiles setup which configure the VNIs appropriately for the required host-facing VLANs
21:40:12 <oneswig> Ah OK, so the host's vlans are mapped to ranges of vnis in the vtep?
21:40:29 <b1airo> jonmills_nasa, this is using Cumulus LNV to coordinate the VNIs across switches, switching definitely happening in hardware
21:40:38 <b1airo> oneswig, yep
21:41:02 <b1airo> once we had all the gear migrated to the new DC we started properly testing the new network
21:41:15 <jonmills_nasa> b1airo, this is great stuff, this is my long-term goal -- just didn't think I could get there on first pass, so we started with all L2
21:41:49 <b1airo> jonmills_nasa, cool - glad to have some company! Cumulus is very nice to work with I have to say
21:41:58 <jonmills_nasa> I am loving it
21:42:11 <b1airo> we used Lustre LNET self test
21:42:26 <jonmills_nasa> I generated 100% of my L2 edge port config using a very small shell script with NCLU commands
21:42:39 <b1airo> we run Lustre over the o2ib driver as usual but on RoCEv1 instead of IB
21:42:48 <martial> (so far very cool)
21:42:51 <zioproto> b1airo, we also have Cumulus at SWITCH
21:43:15 <zioproto> b1airo, but we are using puppet and OSPF unnumbered for the fabric
21:43:49 <b1airo> we firstly had to ensure the network could handle RoCEv1 at all, which means needing flow-control everywhere (RoCEv1 does not like lossy networks)
21:44:32 <b1airo> the first hurdle there was that we couldn't just turn on global pause like we had previously on the MLNX-OS L2 fabric
21:44:35 <oneswig> I hadn't realised that SR-IOV can be lossy - flow control ends at the phy on Mellanox NICs
21:45:23 <b1airo> oneswig, actually we should go into that more in a moment as they haven't told us anything particular about that (but of course I am interested to hear it!)
21:46:24 <b1airo> the Cumulus folks were leery about enabling pause on the inter-switch links because they were worried BGP peer exchange messages would might get held up long enough that it would then impact the fabric topology
21:47:07 <b1airo> so we had to have priority flow-control configured inter-switch with switch-switch traffic at the highest CoS
21:48:04 <b1airo> then we start pushing large streams around and immediately noticed only a single link utilised up and down when crossing spines
21:49:22 <b1airo> turns out that the VXLAN encap happens before the packet is routed to the next hop, so the ECMP hash is done on the VXLAN'd packet, and for RoCEv1 at least that 5-tuple looks exactly the same for every packet on the same VNI
21:50:21 <zioproto> blair, that is a typical hashing problem
21:50:43 <zioproto> most hashing stuff expect UDP or TCP
21:51:04 <zioproto> when you has other protocols hashing algorithm is dumb and sends all to the same link
21:51:16 <b1airo> hence, one out of four 100G paths used and crappy performance - we actually had a simple cluster test we used to simulate what some user workloads to, we call it "crazy cat lady" - it basically just cats a bunch of large files into /dev/null in a parallel job
21:51:52 <priteau> love that name!
21:51:54 <oneswig> why is the destination IP of the VXLAN frame not different for flows to different machines?
21:52:05 <b1airo> zioproto, it is particularly bad with flow-control, as the link gets highly congested with pause frames as well
21:53:22 <b1airo> oneswig, this is for flows from any host in rack A to any host in rack B on the storage VLAN - that's a single VNI for each ToR
21:53:36 <oneswig> ahhh, got it, thanks
21:53:48 <oneswig> what's the fix?
21:54:13 <b1airo> anyway, the Cumulus support guys spent a couple of days in video conference with us working through testing and workaround
21:54:15 <oneswig> Can you hash on the decapsulated packet?
21:54:58 <martial> (how are we doing on time versus agenda?)
21:55:24 <oneswig> martial: not very well...
21:55:26 <b1airo> the fix was to run a little python script that they hacked up which altered some base port settings in the ASIC and introduced some randomness to the hash
21:56:03 <oneswig> b1airo: is this going to be your gift to Cumulus Linux?
21:56:35 <b1airo> their engineering wrote it, but i'll claim i provided the inspiration
21:56:56 <b1airo> i believe this is going into 3.2.1 permanently
21:56:57 <oneswig> Is there a long-term fix in the works, I assume?
21:57:03 <oneswig> OK
21:57:26 <oneswig> We should cover some other matters before the clock strikes...
21:57:30 <b1airo> looks like i monopolised somewhat today (sorry!)
21:57:43 <trandles> blame the burnt toast ;)
21:57:46 <oneswig> #topic Boston Cloud Declaration
21:57:56 <martial> May 11 & 12
21:58:01 <oneswig> Hooray - the date is moved to the same week as the summit :-)
21:58:17 <oneswig> #topic Repo for WG docs
21:58:27 <oneswig> #link http://git.openstack.org/cgit/openstack/scientific-wg/
21:58:33 <zioproto> guys I have to go
21:58:34 <priteau> oneswig: that's good news. I assume the wiki page will be updated? https://wiki.openstack.org/wiki/Boston-Cloud-Congress
21:58:34 <oneswig> we've got somewhere where we can write things now
21:58:48 <zioproto> bye :) sorry for leaving earlier
21:58:49 <zioproto> ciao
21:58:49 <oneswig> priteau: I think so - should have been already
21:58:53 <oneswig> thanks zioproto
21:59:05 <oneswig> #topic SC2017 workshops
21:59:20 <oneswig> I filed for two workshops at SC2017
21:59:27 <oneswig> one on infrastructure and one on platforms/apps
21:59:41 <oneswig> #link as per discussion at https://etherpad.openstack.org/p/SC17WorkshopWorksheet
21:59:55 <b1airo> oneswig, nice one thanks for getting that done. i saw the notification email come by
21:59:59 <oneswig> There's still a week (apparently) before the submissions ship, any more volunteers please?
22:00:13 <oneswig> Thanks b1airo, no problem
22:00:24 <oneswig> Some improvement on the wording would help!
22:00:32 <b1airo> oneswig, we haven't pinged the MLs about that have we?
22:00:41 <oneswig> If anyone else will be at SC and can contribute, please put your names in the etherpad
22:01:05 <rbudden> oneswig: i’ll read it over and can likely volunteer to help
22:01:08 <oneswig> #action oneswig to put the workshops onto the ML
22:01:09 <rbudden> i’m always at SC
22:01:13 <oneswig> thanks rbudden
22:01:17 <b1airo> rbudden, that'd be great!
22:01:22 <oneswig> we are out of time, alas
22:01:33 <jonmills_nasa> I'm honestly hoping to avoid SC17.  I'll be at Boston Summit tho
22:01:35 <rbudden> and as always we have the PSC booth for other SC content
22:01:36 <oneswig> Thanks all and thanks b1airo, I like my toast to be high roast
22:01:51 <b1airo> you should all prioritise the Sydney Summit!
22:02:02 <rbudden> b1airo: i’m trying to pull strings ;)
22:02:05 <oneswig> That's my plan b1airo!
22:02:13 <trandles> Sydney's a go for me
22:02:13 <oneswig> time to close, alas
22:02:16 <jonmills_nasa> Gov would never pay for me to go there
22:02:20 <rbudden> the wife has always wanted to visit as well!
22:02:32 <b1airo> awesome, i should start looking for venues!
22:02:35 <oneswig> jonmills_nasa: not since they've fingered you as @RogueNASA?
22:02:45 <b1airo> lol
22:02:47 <jonmills_nasa> <cough>
22:02:54 <oneswig> really gotta close now :-)
22:02:55 <oneswig> #endmeeting