21:00:27 <oneswig> #startmeeting scientific-sig 21:00:28 <openstack> Meeting started Tue Jan 22 21:00:27 2019 UTC and is due to finish in 60 minutes. The chair is oneswig. Information about MeetBot at http://wiki.debian.org/MeetBot. 21:00:29 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 21:00:32 <openstack> The meeting name has been set to 'scientific_sig' 21:00:38 <oneswig> Let us get the show on the road! 21:00:58 <oneswig> #link agenda for today https://wiki.openstack.org/wiki/Scientific_SIG#IRC_Meeting_January_22nd_2019 21:01:58 <oneswig> #chair martial 21:01:59 <openstack> Current chairs: martial oneswig 21:02:39 <oneswig> While we are gathering, I'll do the reminder - CFP ends tomorrow, midnight pacific time! 21:03:00 <oneswig> #link CFP link for Open Infra Denver https://www.openstack.org/summit/denver-2019/call-for-presentations/ 21:03:42 <oneswig> janders: you ready? 21:03:48 <janders> I am 21:04:02 <oneswig> Excellent, lets start with that. 21:04:16 <oneswig> #topic Long-tail latency of SR-IOV Infiniband 21:04:24 <oneswig> So what's been going on? 21:04:31 <janders> Here are some slides I prepared - these cover IB microbenchmarks across bare-metal and SRIOV/IB 21:04:41 <janders> https://docs.google.com/presentation/d/1k-nNQpuPpMo6v__9OGKn1PIkDiF2fxihrLcJSh_7VBc/edit?usp=sharing 21:04:57 <oneswig> Excellent, thanks 21:05:17 <b1air> o/ hello! (slight issue with NickServ here...) 21:05:27 <janders> As per the Slide 2 we were aiming to get a better picture of low-level causes of the performance disparity we were seeing (or mostly hearing about) 21:05:32 <oneswig> Hi b1air, you made it 21:05:36 <oneswig> #chair b1air 21:05:36 <openstack> Current chairs: b1air martial oneswig 21:05:42 <b1air> tada! 21:05:54 <janders> We were also thinking this will help decide what's best running on bare-metal and what can run happily in a (HPC)VM 21:05:59 <martial> bad nick serve 21:06:02 <janders> hey Blair! 21:06:51 <martial> welcome Blair :) 21:07:09 <janders> Slide 3 has the details of the lab setup. We tried to stay as generic as possible so the numbers aren't super-optimised, however we've done all the ususal reasonable things to do for benchmarking (performance mode, CPU passthru) 21:07:39 <janders> Similar "generic" approach was applied to the microbenchmarks - we ran with the default parameters which are captured in Slide 4 21:08:32 <janders> On that - the results once again prove what many of us here have seen - virtualised RDMA (or IB in this case) can match the bandwidth of bare-metal RDMA 21:08:38 <martial> I asked Rion to join as well (since it is his scripts for Kubespray) 21:09:11 <janders> We're at around 99% efficiency bandwidth wise 21:09:13 <janders> With latency, however things things are different 21:09:36 <oneswig> This is just excellent. :-) 21:09:51 <janders> Virtualisation seems to add a fair bit there, about .4 us off from bare-metal 21:10:09 <janders> (or ~24%) 21:10:11 <b1airo> ok, i've managed to log in twice it seems :-) 21:10:25 <martial> b1airo: cool :) 21:10:33 <janders> the more the merrier :) 21:10:35 <oneswig> Does your VM have NUMA passthrough? 21:11:02 <oneswig> #chair b1airo 21:11:03 <openstack> Current chairs: b1air b1airo martial oneswig 21:11:04 <janders> I haven't explicitly tweaked NUMA, however the CPUs are in passthru mode 21:11:06 <b1airo> thanks oneswig 21:11:20 <b1airo> hugepages...? 21:11:39 <janders> I'm happy to check later in the meeting and report back - just run "numactl --show"? 21:12:11 <b1airo> i note the hardware is relatively old now, wonder how that impacts these results as compared to say Skylake with CX-4 21:12:19 <oneswig> That ought to do it, or if you have multiple sockets in lscpu, I think that also means it's on 21:12:36 <janders> I was considering applying NFV style tuning, but thought this would make it less generic 21:12:48 <oneswig> janders: what kind of things? 21:12:53 <janders> good point on hugepages though - just enabling that shouldn't have any negative side effects 21:13:11 <janders> NFV style approach would involve CPU pinning 21:13:33 <janders> however my worry there is we'll likely improve IB microbenchmark numbers at the expense on Linpack 21:13:37 <b1airo> this could be a nice start to some work to come up with a standard set of microbenchmarks we could use... 21:13:54 <martial> oh here is Rion 21:14:13 <oneswig> Hello Rion, welcome 21:14:14 <janders> I hope to get some newer kit soon and I'm intending to retest on Skylake/CX5 and maybe newer 21:14:33 <b1airo> surely if you were going to be running MPI code you'd want pinning though janders ? 21:14:39 <deardooley> Hi all 21:14:54 * b1airo waves to Rion 21:15:46 <oneswig> Your page 6, plotting sigmas against one another. Are you able to plot histograms instead? I'd be interested to see the two bell curves superimposed. 21:16:17 <b1airo> it's an interesting consideration though - how much tuning is reasonable and what are the implications... 21:17:02 <janders> oneswig: thank you for the suggestion - I will include that when I work on this again soon (with more hardware) 21:17:15 <oneswig> One potential microbenchmark for capturing the effect of long-tail jitter on a parallel job would be to time multi-node barriers 21:17:36 <oneswig> I'm not sure there's an ib_* benchmark for that but there might be something in IMB. 21:17:59 <janders> The second interesting observation is the impact of virtualisation on standard deviation in results 21:18:29 <janders> interestingly this is seen across the board. In these slides I'm mostly focusing on ib_write_* but I put Linpack there for reference too 21:18:46 <janders> whether it's Linpack, bandwidth or latency, baremetals are heaps more consistent 21:18:59 <oneswig> Is that a single-node Linpack configuration? 21:19:05 <janders> in absolute numbers the fluctuation isn't huge, but in relative numbers it's an order of magnitude 21:19:09 <janders> yes, it's single node 21:20:05 <janders> I considered multinode but was getting mixed messages about the potential impact of interconnect virtualisation on the results from different people, so thought I better keep things simple 21:20:22 <janders> this way we have the overheads measured separately 21:20:59 <oneswig> I'd love to know more on the root causes (wouldn't we all) 21:21:03 <janders> I think the variability could likely be addressed with NFV style tyning 21:21:06 <janders> *tuning 21:21:14 <janders> at least to some degree 21:21:39 <janders> with the latency impact, I think the core of it might have to do with the way IB virtualisation is done, however I've never heard the official explanation 21:22:02 <janders> I feel it likely gets better for larger message sizes 21:22:59 <b1airo> seems to me that once you start doing cpu and/or numa pinning and/or static hugepage backing, then you really need to commit to isolating specific hardware for that job and you create just a few instance-types that fit together on your specific hardware nicely. then you probably also have some other nodes for general purpose. so perhaps there are at least three interesting scenarios to benchmark: 1) VM tuned for 21:22:59 <b1airo> general purpose hosts; 2) VM tuned for high-performance dedicated hosts; 3) bare-metal 21:23:25 <janders_> sorry got kicked out :( 21:23:41 <janders_> 1 2 3 testing 21:23:50 <oneswig> Did you catch b1airo's message with 3 cases? 21:24:11 <janders_> unfortunatly not, I lost everything past "08:22] <janders> given the bandwidth numbers are quite good" 21:24:49 <janders_> b1airo: can you re-post please? 21:24:56 <janders_> sorry about that 21:25:26 <martial> helping b1airo 16:23:00 <b1airo> seems to me that once you start doing cpu and/or numa pinning and/or static hugepage backing, then you really need to commit to isolating specific hardware for that job and you create just a few instance-types that fit together on your specific hardware nicely. then you probably also have some other nodes for general purpose. so perhaps there are at least three interesting scenarios 21:25:26 <martial> to benchmark: 1) VM tuned for 21:25:26 <martial> 16:23:00 <b1airo> general purpose hosts; 2) VM tuned for high-performance dedicated hosts; 3) bare-metal 21:25:55 <janders_> indeed 21:26:27 <janders_> have you guys had a chance to benchmark CPU pinned configurations in similar ways? 21:26:33 <oneswig> This test might be interesting, across multiple nodes: IMB-MPI1 Barrier 21:26:44 <janders_> I wonder if pinning helps with consistency 21:26:57 <janders_> and what would be the impact of the pinned configuration on local Linpack? 21:27:05 <oneswig> janders_: yes, I did some stuff using OpenFOAM (with paravirtualised networking) 21:27:16 <b1airo> thanks martial - i was in the bathroom briefly 21:27:44 <janders_> (I suppose we would likely lose some cores for the host OS - and I wonder to what degree the performance improvement on the pinned cores would compensate that) 21:28:32 <oneswig> You can see the impact of pinning in simple test cases like iperf - I found it didn't increase TCP performance much but it certainly did help with jitter 21:28:40 <b1airo> yes, i believe pinning does help with consistency 21:28:47 <janders_> I think the config proposed by Blair is a good way forward - my worry is if they users will know if they need max_cores configuration (20 core VM on a 20 core node) or the NFV configuration 21:29:29 <b1airo> i think the question of reserved host cores is another interesting one for exploration... 21:29:51 <janders_> I tried Linpack in 18 core and 20 core VMs in the past and 20 core was still faster 21:30:15 <janders_> despite the potential of scheduling issues between the host and the guest 21:30:18 <b1airo> i would contend that if most of your network traffic is happening via SR-IOV interface then reserving host cores is unnecessary 21:30:56 <oneswig> b1airo: makes sense unless they are working for the libvirt storage too 21:31:20 <oneswig> OK, we should move on. Any more questions for janders? 21:31:58 <oneswig> janders_: one final thought - did you disable hyperthreading? 21:31:59 <b1airo> ah yes, good point oneswig - i guess i was thinking of storage via SR-IOV too, i.e., parallel filesystem 21:32:05 <janders_> yes I did 21:32:20 <janders_> no HT 21:32:43 <oneswig> We found that makes a significant difference. 21:32:57 <janders_> I typically work with node-local SSD/NVMe for scratch 21:33:04 <b1airo> to Linpack oneswig ? 21:33:08 <janders_> and a parallel fs indeed mounted via SRIOV interface 21:33:23 <oneswig> b1airo: haven't tried that. Other benchmarks. 21:33:31 <janders_> on the local scratch it would be interesting to look at the impact of qcow2 vs lvm 21:34:02 <janders_> lvm helps with IOPS a lot, but in a scenario like the one we're discussing where there's little CPU for host OS, this might be even more useful 21:34:52 <janders_> so - b1airo - do you think in the configuration you proposed, would we need two "performance VM" flavors? 21:35:02 <janders_> max_cpu and low_latency (pinned)? 21:35:11 <b1airo> janders_: we could talk more on this offline perhaps, i'd be keen to try and get some comparable benchmarks together as well 21:36:14 <oneswig> janders_: you need Ironic deploy templates... let's revisit that 21:36:19 <janders_> OK! being mindful of time, let's move on to the next topic. Thank you for your attention and great insights! 21:36:19 <oneswig> OK time is pressing 21:36:31 <oneswig> #topic Terraform and Kubespray 21:36:42 <oneswig> deardooley: martial: take it away! 21:36:51 <martial> so I invited Rion to this convesrsation 21:37:09 <martial> but the idea is simple, I needed to deploy a small K8s cluster for testing on top of OpenStack 21:37:21 <martial> internally we have used Kubespray to do so 21:37:44 <martial> to deploy a Kubernetes (one master and two minion nodes) in an pre-configured OpenStack project titled nist-ace, using Kubespray and Terraform. 21:38:14 <martial> the default Kubespray install requires the creation of pre-configured VMs 21:38:17 <martial> #link https://github.com/kubernetes-sigs/kubespray/blob/master/docs/openstack.md 21:38:44 <martial> Terraform has the advantage to pre-configures the OpenStack project for running the ansible playbook given information about networking, users, and the OpenStack project itself. Then Terraform handles the VM configurations and creations. 21:38:51 <oneswig> Given Kubespray is Ansible, why the need for preconfiguration? 21:39:02 <martial> #link https://github.com/kubernetes-sigs/kubespray/blob/master/contrib/terraform/openstack/README.md 21:39:34 <martial> that was also my question, but the ansible script did not create the OpenStack instances 21:39:37 <martial> terraform will 21:40:00 <deardooley> @oneswig it's a pretty common pattern. terraform is much faster at pure provisioning that ansible, but config management is not it's strong suite. ansible is a good complement once the infrastructure is in place. 21:40:15 <martial> you obviously need the OpenStack project's RC file 21:41:20 <martial> once you have this sourced you are able to create the terraform configuration file to include master/minion number and names 21:41:37 <martial> images, flavors, IP pools 21:42:36 <oneswig> How do you find Terraform compares to expressing the equivalent in Heat? 21:43:34 <martial> given that kubespray has its own ways of spawning on top of OpenStack, I did not try heat for this purpose 21:44:00 <janders_> in the context of a Private Cloud, would it make sense to disable port-security so that we don't need to worry about address pairs? 21:44:16 <janders_> or do you see value in having this extra layer of protection? 21:44:32 <martial> supposedely terraforms create the private network needed for indeed k8s communication 21:44:37 <martial> on top of the OpenStack project 21:44:42 <martial> 's own network 21:45:36 <martial> once the configuration is done, calling terraform init 21:45:46 <martial> followed by terraform apply 21:46:06 <oneswig> What do you configure for Kubernetes networking? Does it use Weave, Calico, ? 21:46:18 <deardooley> within the context of kubespray, you the terraform provisioner will handle all security group creation and managmeent for you as part of the process. You will need to implement any further security at the edge of your k8s apiservers on your own. 21:47:53 <martial> (with a little extra obviously) creates the openstack 21:48:28 <martial> I was checking in my configuration file and I do not see the k8s networking set 21:48:45 <deardooley> it's pluggable. defaults to calico. there are some tweaks you need to make in your specific inventory depending on the existence of different openstack components in your particular cloud. 21:48:52 <oneswig> Seems like a lot of interesting things are happening around blending the layers of abstraction, and interfacing with OpenStack to provide resources (storage, networking, etc) for Kubernetes - eg https://github.com/kubernetes/cloud-provider-openstack - does Kubespray support enabling that kind of stuff? 21:48:53 <martial> the default seems to be Callico 21:49:25 <oneswig> ... would be cool if it did 21:49:35 <deardooley> for example, to plug into external loadbalancers, dynamically configure upstream dns, use cinder as persistent volume provisioner, etc. 21:49:59 <oneswig> deardooley: that kind of stuff 21:50:35 <deardooley> yeah. it's all pluggable with the usual caveates. 21:51:20 <martial> (I am kind of tempting Rion here to think about a presentation at the summit on the topic of Kubespray deployment on top of OpenStack) 21:51:39 <oneswig> Have you been using those cloud-provider integrations and is it working? 21:53:00 <martial> I have not 21:53:14 <deardooley> I use them on a couple different openstack installs. 21:54:13 <deardooley> they work, but there are ways to get the job done, and there are ways to get the job done and keep all your hair and staff in tact 21:54:35 <oneswig> deardooley: sounds familiar :-) 21:55:43 <deardooley> it's likely anyone on this channel could pull it off in a day or two by pinging this list, but once you do, you'll appreciate the "Hey Google, create a 5 node Kubernes cluster" demo in a whole new way. 21:56:26 <deardooley> that being said, once you get your config down, it really is push button to spin up another dozen in the same environment. 21:56:44 <oneswig> deardooley: in your experience, what does Kubespray do wrong / badly? Does it have shortcomings? 21:56:46 <martial> after Terraform pre-configures everything is it simply the steps or running the ansible playbook 21:57:11 <b1airo> certainly sounds like it could be an interesting presentation topic 21:57:27 <deardooley> it can build a production scale cluster for you. it can't do much to help you manage it. 21:57:45 <b1airo> i think i need a diagram of the openstack - kubespray - terraform interactions 21:57:53 <martial> see Rion "it could be an interesting presentation topic" :) 21:58:04 <martial> (wink) 21:58:38 <deardooley> as long as you treat nodes idempotently and get sufficient master quorem defined up front, it's not a huge issue. when something goes weird, just delete it, remove the node, and rerujn the cluster.yml playbook with a new vm. 21:58:44 <oneswig> We are nearly at time - final thoughts? 21:58:52 <b1airo> one other motivation type question? why not Magnum? 21:59:37 <martial> in my particular case, the person I was working with wanted K8s to test containers ... truth is given how many containers they want, docker swarm might be enough 21:59:46 <deardooley> flexibility, portability across cloud providers, secruity across multiple layers of the infrastructure and application stack, logging, monitoring, etc... 21:59:58 <deardooley> +1 22:00:12 <oneswig> OK, we are at time 22:00:22 <martial> and I will second Rion's comment of "remove" "rerun", that was very useful for testing things 22:00:29 <oneswig> Thanks deardooley martial - interesting to hear about your work. 22:00:38 <janders_> great work, thanks guys! 22:00:47 <oneswig> Until next time 22:00:50 <oneswig> #endmeeting