#openstack-meeting log

21:00:35 <oneswig> #startmeeting scientific-sig
21:00:36 <openstack> Meeting started Tue Jan  9 21:00:35 2018 UTC and is due to finish in 60 minutes.  The chair is oneswig. Information about MeetBot at http://wiki.debian.org/MeetBot.
21:00:37 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
21:00:39 <openstack> The meeting name has been set to 'scientific_sig'
21:00:49 <oneswig> Hello hello hello
21:01:07 <oneswig> #link Agenda for today https://wiki.openstack.org/wiki/Scientific_SIG#IRC_Meeting_January_9th_2018
21:02:05 <oneswig> Andrey tells me he's stuck in a taxi somewhere in Delhi, hoping to join us shortly
21:03:22 <oneswig> I had a discussion with some colleagues from Cambridge University earlier.
21:03:39 <oneswig> They have been benchmarking the effect of the spectre/meltdown fixes
21:04:13 <oneswig> On a lustre router, apparently there is a 40% hit!
21:04:59 <oneswig> Are there other people testing their platforms?
21:05:07 <trandles> filesystems/IO is bad post-patch
21:05:13 <trandles> too much context switching :(
21:05:23 <oneswig> Hi Tim, so it seems - the worst case
21:05:33 <bollig> we saw 30% IO loss on VMs, 2% cpu loss. one sec and I’ll paste in my query earlier today on #scientific-wg
21:05:44 <trandles> Hello Stig. David Daniel says hello.  Had a meeting with him this morning. :)
21:05:44 <oneswig> Please do
21:05:45 <rbudden> hello
21:06:05 <oneswig> DDD - fantastic!  You've made my day
21:06:19 <belmoreira> we started testing as well. Our compute workloads don't seem affected in the initial benchmarks. I will have more info next week
21:07:04 <rbudden> we’ve started baremetal testing, but I don’t have any result at the moment
21:07:20 <oneswig> There's been some discussion around whether there is an impact for RDMA, and in which modes of usage
21:08:14 <jmlowe> Hi everybody
21:08:23 <oneswig> #link Here's an intriguing coincidence form a couple of months ago https://www.fool.com/investing/2017/12/19/intels-ceo-just-sold-a-lot-of-stock.aspx
21:08:41 <martial_> Hi Mike, Bob, Stig
21:08:48 <oneswig> Hi jmlowe rbudden belmoreira et al
21:08:52 <oneswig> Hey martial_
21:08:55 <oneswig> #chair martial_
21:08:56 <openstack> Current chairs: martial_ oneswig
21:09:51 <oneswig> So, it does appear that the worst case scenarios can be readily hit.
21:09:57 <belmoreira> humm... not a coincidence :)
21:10:09 <bollig> We have broadwell arch (compute: E5-2680v4, storage: E5-2630v4). Ceph Luminous, Openstack Newton, qemu+kvm virtualization. First, after patching our hypervisors we saw a 2% CPU perf-loss in HPL benchmark running inside an unpatched centos 6.5 VM, plus 30% I/O perf-loss in the FIO benchmark inside the same VM. No further loss from patching VMs. Finally, we patched the storage servers and again saw no further degredation. Better than I expected,
21:10:10 <bollig> I’m curious if it plays out that way for others. We have NVMe in our storage, which might be amortizing the cost of I/O operations at the storage servers.
21:11:22 <bollig> The numbers were very close to redhat’s predictions
21:12:11 <oneswig> bollig: I'd guess, if it's a performance penalty incurred on every context switch, then it'll be more painful for nvme than for other devices simply because they achieve more context switches, but the penalty is constant for each.  Perhaps
21:13:48 <oneswig> What was the extent to which an unpatched guest could read data from the hypervisor?
21:15:09 <oneswig> This is one of those rare circumstances where bare metal looks like the security-conscious option
21:15:41 <martial_> (too used to slack, I want to +1 Stig's last comment)
21:16:10 <oneswig> ha, slack is too easy!
21:17:40 <oneswig> OK, well interesting to hear people's experiences.  I'm sure it's just early days.
21:17:47 <jmlowe> I really want to know the definitive answer to that question as well, do I need to make sure my guests are patched or is qemu and hypervisor patching sufficient to ensure guests can't read more than their own memory
21:18:45 <bollig> that I dont know. We’re rebuilding all of our base images, and looking for that same answer for existing VMs
21:18:47 <jmlowe> I had some linpac numbers from Jonathan Mills at NASA, worst case was %50 best case was %5, seemed to vary linearly with N
21:19:04 <oneswig> jmlowe: if you find out, will you let us know - sure it'll be widely applicable to just about everyone in OpenStack
21:19:09 <jmlowe> we are also patching all of our images as per ususal
21:20:38 <oneswig> Andrey is ready, or thereabouts.  He's sent ahead a presentation to share
21:20:52 <oneswig> #link SGX and OpenStack https://drive.google.com/file/d/1wBXVrd9v8GjyreFLET5nW7IhROaOf7A6/view
21:22:18 <jmlowe> I have a more urgent need to patch, it seems that either my 2.1.26 i40e driver is leaking or it's the rhel/centos 3.10.0-693.5 kernel is leaking about 20GB/month, it's starting to trigger oom killer on my instances
21:22:19 <aembrito> Hi everyone, I was offline on a plane and had much less time than I expected, so please, consider it a first discussion
21:23:18 <aembrito> I will then come back and give more details, including on how we are using it with OpenStack (Ironic, LXD, KVM, and Magnum+Kubernetes)
21:24:20 <oneswig> Hi Andrey, thanks for joining us today
21:24:41 <oneswig> #topic SGX on OpenStack
21:25:29 <oneswig> Is this specific to Skylake?  I've seen previous articles on it that appear to date from 2013
21:26:22 <abrito> yes, it is to Skylake
21:26:31 <abrito> previous discussion was based on simulations
21:27:25 <oneswig> How much of a limitation is it that the code in the enclave can't make system calls?
21:30:14 <abrito> there are tools to help circumvent this
21:30:43 <oneswig> abrito: Are there uses for this as protection for bare metal infrastructure from malicious users?
21:30:45 <abrito> for example, SCONE is a tool that places one thread inside the enclave and another outside the enclave, the one outside does the syscalls
21:31:12 <abrito> the one inside takes care that there is no leak (e.g., encrypts data going to the disk operation)
21:31:36 <oneswig> It seems to be targeted as an application-level tool rather than a system-level tool.  Is that accurate?
21:31:51 <abrito> one application in baremetal would be to store certificates and other secrets in the enclave
21:32:02 <abrito> exactly
21:32:21 <abrito> Intel has just released an POC for doing that
21:32:38 <oneswig> Oooh, got a link?
21:32:45 <abrito> just a sec
21:33:19 <abrito> this one is from a project partner: https://github.com/lsds/TaLoS
21:34:36 <abrito> https://github.com/cloud-security-research/sgx-ra-tls
21:36:02 <oneswig> Is there a performance penalty for accessing memory within the enclave, or executing code within the enclave?
21:38:01 <abrito> in the graph in slide 8, you can see something about this
21:38:12 <abrito> if the memory footprint is small
21:38:30 <abrito> you see no penalty
21:38:56 <abrito> this would be the case if you are, for example, streaming the data through the protected application
21:39:26 <oneswig> Ah, the y axis is relative slowdown of running in an enclave?
21:39:27 <abrito> if you exceed the EPC size (e.g, the 128 mb) then it needs to decrypt and re-encrypt the data
21:39:39 <abrito> adding a huge overhead
21:40:10 <abrito> yes, the Y axis is the overhead compared to regular C code running outside an enclave
21:40:47 <oneswig> What is the difference in the code generated?
21:41:13 <oneswig> Have you found it easy to work with?
21:41:41 <abrito> it allocated a piece of the "enclave memory" and the secure functions and its data are allocated inside it
21:42:17 <oneswig> Just curious, what's a secure process for loading code into the enclave?
21:42:32 <abrito> there is some learning curve if your are using intel SDK directly, but if you do not need syscalls for the confidential algorithms/transformations, then it is mostly boilerplate code
21:43:00 <abrito> can you rephrase that last question?
21:43:28 <oneswig> Just wondering how we trust the code as it is transferred in.  I guess there is some code signing or similar?
21:43:38 <abrito> yes
21:43:50 <abrito> Once the code is executed you can do a remote attestation
21:44:22 <abrito> the remote attestation starts by an external participant asking the application to get a "quote" of itself
21:44:40 <abrito> the quote is produced by the processor where the application is running in
21:45:17 <abrito> then the application gives you the signed quote and if you have never trusted that processor before
21:45:49 <abrito> then you go to intel attestation service (IAS) for it to confirm that the quote was emitted from a sgx-supporting processor
21:45:54 <abrito> using the current firmware
21:46:16 <abrito> that has not been blacklisted and that is running in the correct mode (e.g., non-debug or simulated)
21:46:45 <abrito> if you already trusted the processor, you do not need to go to the intel service again
21:47:10 <oneswig> interesting - so if you trust Intel then you can also trust the cpu
21:47:40 <abrito> :-)
21:48:09 <martial_> did I understand properly: it creates a hardware memory map in the enclave ?
21:48:15 <abrito> yes, for this version of SGX you have to trust intel to tell you that the code is actually running in the correct mode and processor
21:48:44 <abrito> martial_: during boot it separates a piece of memory to be used by the enclaves
21:49:11 <oneswig> I can see it being useful in apps where secrets are held.  Do you think it will succeed for future cpu generations?
21:49:35 <abrito> that piece of memory cannot be accessed by code other then the code from the enclave that allocated it on creation
21:49:52 <abrito> oneswig: yes, I am optimistic
21:50:20 <oneswig> good to hear it.
21:50:26 <abrito> one this is that recently, Azure and IBM have mentioned that they are making test services available that use SGX
21:50:27 <oneswig> What will you do next with it?
21:50:38 <abrito> e.g.: SGX capable VMs
21:51:05 <abrito> I answer heard that enclave memory is likely to become larger in the short term
21:51:19 <abrito> my next step is to run kubernetes jobs on it
21:51:41 <abrito> using code in python running inside the enclaves
21:51:46 <oneswig> with the enclave holding something for the containerised app, or something for kubernetes itself?
21:52:15 <abrito> there is not much to be done with kubernetes itself
21:52:25 <abrito> monitoring needs to be done differently
21:52:31 <abrito> so that you consider the EPC usage
21:52:51 <abrito> otherwise you can hit the 128 MB and suffer the performance hit
21:53:15 <abrito> but once you have the code running, it mostly a matter of configuring the right tools
21:53:44 <abrito> you also want, for example, that the tasks in the tasks queues are encrypted
21:53:58 <abrito> and only workers that have been attested hold the keys
21:54:44 <abrito> we (not only UFCG, but the securecloud consortium) are also working on monitoring and scheduling tools
21:55:22 <martial_> so how different is it from a HSM?
21:55:25 <oneswig> I'd be interested to know where you take it
21:55:49 <abrito> it is a HSM, the advantage is that you already have it on your table
21:56:11 <abrito> it is not a separate hardware piece
21:56:26 <abrito> the downside is that not many Xeon have it
21:56:36 <martial_> thank you, that helps
21:56:43 <oneswig> OK - anything more for Andrey - we are close to time
21:57:22 <abrito> there are people also looking at sgx for barbican
21:57:35 <oneswig> I was wondering about that...
21:57:35 <abrito> exactly because of its easier availability
21:57:54 <oneswig> would be great to use it for holding secrets 'at rest'
21:58:04 <abrito> yes
21:58:27 <martial_> abrito: I was wondering about this, in 2006 the Barbican team did a Hands on during the Barcelona summit and they had a HSM setup
21:58:39 <martial_> (can not remember the hardware now)
21:58:42 <oneswig> OK, we must press on
21:58:43 <oneswig> thank you Andrey - really interesting to hear about your work
21:58:53 <abrito> So, I would like to thank you for the invitation
21:59:06 <abrito> and apologize for the terrible slides
21:59:20 <martial_> really cool indeed, thank you for explaining this to us
21:59:24 <abrito> I should had been more pessimistic about the time
21:59:41 <oneswig> #topic AOB
21:59:42 <oneswig> I had one item to raise - PTG
21:59:42 <oneswig> The Scientific SIG have been invited to have a slot at the PTG in Dublin, and I'm planning to go as there'll be at least 5-6 members present
21:59:56 <abrito> I will do a second round, and explain details
21:59:59 <jmlowe> he possibility of using a $200 nuc as a backing store for barbican is really exciting
22:00:18 <trandles> oneswig: thanks for carrying the torch for configurable deployment steps in Ironic at the Dublin PTG
22:00:29 <trandles> if you need anything documenting our use case let me know offline
22:00:42 <martial_> Mike: yes sounds interesting indeed :)
22:01:39 <oneswig> Anything development-centric on people's wish lists, let's have it before then
22:01:39 <oneswig> Ironic deployment steps - check
22:01:39 <oneswig> We'll also aim to cover some of the CERN/SKA subject areas
22:01:39 <oneswig> but anything else to cover - have a think and do follow up before the PTG, which is late February
22:01:40 <oneswig> we are out of time - anything else to raise?
22:02:56 <martial_> good for me
22:02:59 <martial_> thanks Stig
22:03:25 <trandles> thx for the topic today, very interesting
22:04:00 <rbudden> thanks everyone!
22:04:43 <martial_> #endmeeting