#openstack-meeting log

10:59:57 <oneswig> #startmeeting scientific-sig
10:59:58 <openstack> Meeting started Wed Oct 24 10:59:57 2018 UTC and is due to finish in 60 minutes.  The chair is oneswig. Information about MeetBot at http://wiki.debian.org/MeetBot.
10:59:59 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
11:00:01 <openstack> The meeting name has been set to 'scientific_sig'
11:00:10 <oneswig> greetings
11:00:17 <oneswig> #link agenda for today https://wiki.openstack.org/wiki/Scientific_SIG#IRC_Meeting_October_24th_2018
11:00:18 <dh3> hi
11:00:19 <janders> good morning, good evening everyone
11:00:32 <oneswig> hello janders dh3
11:00:42 <priteau> Good afternoon
11:00:49 <oneswig> priteau: indeed it is
11:01:00 <janders> :)
11:01:07 * ildikov is lurking :)
11:01:09 <oneswig> martial__: that you?
11:01:25 <martial__> I keep getting longer and longer _ added to my nick :)
11:01:26 <oneswig> Greetings ildikov :-)
11:01:32 <oneswig> #chair martial__
11:01:33 <openstack> Current chairs: martial__ oneswig
11:01:49 <ildikov> oneswig: hi :)
11:02:11 <oneswig> #topic Mixed bare metal and virt
11:02:13 <martial__> hello Stig, Ildiko
11:02:29 <oneswig> OK, janders has very kindly put together a presentation for us on his work
11:02:32 <ildikov> morning martial__ :)
11:02:59 <oneswig> #link janders presentation https://docs.google.com/presentation/d/1fh6ZOq3DO-4V880Bn7Vxfm5I1kH5b0xKXHBLW2-sJXs/edit?usp=sharing
11:03:23 <oneswig> janders: what was your motivation for taking this approach?
11:03:31 <janders> oneswig: thank you
11:03:48 <janders> short answer is flexibility
11:04:20 <janders> a bit longer one is - as we started exploring bare-metal cloud we realised that not everything needs a full node
11:04:59 <oneswig> But you didn't want to have dedicated hypervisors?
11:04:59 <janders> also we wanted to make sure we can accommodate non-HPC OpenStack workloads
11:05:13 <janders> then the question is  - how many do we need?
11:05:21 <janders> and how will the virt hypervisor change over time
11:05:45 <janders> * virt hypervisor demand
11:06:14 <janders> the main goal behind SuperCloud is providing bare metal
11:06:44 <janders> so we need as many hypervisors as we need and as little as we can get away with, so there is more baremetal capacity available
11:07:40 <oneswig> janders: makes sense.  How do you know its time to create a hypervisor?
11:07:53 <janders> my expectation is that we might see more virt uptake in the early days and then some of the virt users might start looking at baremetal or containers moving forward
11:08:24 <janders> right now it will be a "handraulic" approach - "running low? create more."
11:09:01 <janders> running low in vcpus/ram/vifs would be an indication to create more hypervisors
11:09:18 <oneswig> Is the InternalAPI network going via IPoIB?
11:09:33 <janders> yes, it's a tenant network as any other
11:09:42 <janders> currently we only have SDN capability on IB
11:10:20 <martial__> interesting word "hand-raulic"
11:10:32 <oneswig> cool stuff.  Does that also mean the cluster membership protocols for your control plane services are transported via IPoIB?
11:11:03 <janders> oneswig: correct
11:11:04 <oneswig> I'm curious if they all work on something not-quite-ethernet
11:12:07 <janders> yes - and while this isn't directly related to ephemeral hypervisors, I have used WAN-capable IB to connect remote computes to central controllers
11:12:30 <janders> I tried with up to 300ms latency (that is pretty long range), no issues
11:12:51 <oneswig> Thats awesome
11:13:47 <janders> in addition to the configuration and capacity flexibility, something where I see a lot of value is having one-stop-shop for baremetals, VMs and containers going forward
11:13:48 <oneswig> How is the overlay networking done?
11:14:19 <janders> we don't have all of these going just yet, but that's the direction
11:14:33 <janders> we run vlan tenant networks in Neutron
11:14:42 <janders> IB works through mapping pkeys to vlans
11:15:09 <janders> so from the physical perspective, comms between VMs and between computes and controller aren't any different
11:15:15 <janders> they just happen to live in a different pkey
11:15:39 <janders> as VMs get scheduled to nodes, SDN makes sure that the hypervisor has access to the appropriate pkeys
11:15:40 <oneswig> The hypervisor nodes are bound to the InternalAPI network, how are they bound to other networks too?
11:15:52 <janders> and the hypervisor works out vf/pkey mapping locally
11:15:52 <oneswig> ah I see
11:16:03 <janders> they are in internalapi on pkey0
11:16:07 <janders> sorry index0
11:16:14 <janders> then, they can have a ton of other pkeys on top
11:16:27 <janders> it's a bit like native and tagged vlans and trunks, but not quite the same
11:16:43 <janders> there is no trunking in neutron trunk sense
11:17:16 <oneswig> The binding is not done on the baremetal hypervisor instance, but the effect is used by VMs running upon it
11:17:28 <janders> correct
11:17:37 <janders> hypervisor's GUID is mapped to the pkey
11:17:54 <janders> however the hypervisor itself does not send any packets in this pkey
11:17:59 <janders> just maps the vfs accordingly
11:18:18 <janders> it's quite neat
11:18:37 <janders> might be IB generation specific though, we're working on FDR at the moment but this will change soon
11:18:46 <oneswig> All VM networking is via SR-IOV?
11:18:53 <janders> haha :)
11:18:55 <janders> yes and no
11:19:00 <janders> I will cover more on that a bit later
11:19:15 <janders> SuperCloud proper is all baremetal/IB and SRIOV/IB VMs
11:19:37 <janders> the only ethernet is the br-ex uplink on the controller for L3
11:20:29 <janders> I usually try to demo as much as I can but it's a little tricker via IRC
11:20:34 <janders> however I have some demo-like slides
11:20:42 <janders> so just to illustrate what we're talking about
11:20:55 <janders> first of all - please jump to slide 10 (we covered most of the earlier slides just chatting )
11:21:24 <janders> on slide 10 you can see that the baremetal nodes running on internalapi network are listed in "openstack compute service list"
11:21:33 <janders> so - they run on SuperCloud and are nodes of SuperCloud at the same time :)
11:21:53 <janders> slide 11 just documents setting up the lab I'm using in next slides
11:22:23 <janders> on slide 12 you can see that the VM instances we created for the experiment are running on SuperCloud-powered compute nodes
11:22:34 <janders> there are also baremetal instances sigdemo-bm{1,2}
11:22:56 <janders> now - as both baremetals and VMs use IB they can connect (almost) seamlessly
11:23:19 <janders> almost - cause VMs won't have access to admin functions (eg running ibhosts) unless these are manually force-enabled
11:23:38 <janders> a demonstrated example of this is slide 13 where we run RDMA benchmark between a baremetal and a VM
11:24:12 <oneswig> janders: do you have a way of preventing bare metal instances from running admin functions like that to find the size of the network?
11:24:13 <janders> it's only FDR10 on this box, so the result is only 4.5GB/s but it does demonstrate end-to-end RDMA connectivity between baremetals and VMs
11:24:37 <janders> not yet - we still haven't deployed the SecureHost firmware
11:25:04 <janders> I hope that the current version will help to a degree and I am intending to work with MLNX to improve baremetal security further
11:25:35 <janders> it's a little tricky with SecureHost and OEM firmware. Some tough conversations to be had with the OEM :)
11:25:55 <oneswig> Do you have a solution for reducing the number of hypervisors?
11:26:26 <janders> this is a very good question. The answer is no - this is totally hand-raulic at this stage.
11:26:41 <janders> SRIOV will only make it tricker as evac-like functionality won't help much here
11:26:42 <oneswig> that nice word again :-)
11:26:48 <janders> :)
11:27:07 <janders> I expect we'll need to write custom tools for that
11:27:22 <oneswig> Anything from Mlnx on migration with SR-IOV?
11:27:48 <janders> it will also be good to see how interested other ironic users are - perhaps we can start an upstream discussion about adding better support for this mode of operation
11:27:56 <janders> I think there's value in this
11:28:05 <janders> regarding migration - cold migration should now work
11:28:16 <janders> live migration - if it ever happens - won't be any time soon
11:28:27 <janders> that's what I keep hearing
11:28:53 <janders> now - I would like to go back to your question about vxlan and non-SRIOV comms
11:29:00 <janders> I actually do need this for some use cases
11:29:05 <verdurin> Afternoon
11:29:21 <janders> yet the SDN does not support vxlan when SDN/IB is used
11:29:25 <janders> Good afternoon :)
11:29:45 <janders> however, running baremetal/SDN we have a very large degree of flexibility
11:30:14 <janders> if you look at slide 18, you will see there are actually few more nodes I didn't show before (cause I filtered by internalapi network membership )
11:30:40 <janders> there are four more nodes (controller + 3 computes) which are running a separate sub-system that supports vxlan and fully virtualised networking
11:30:55 <martial__> (I think one of the tool defenitely needs to be called "handrauler")
11:31:02 <janders> in this case, vxlan traffic is carried over IPoIB
11:31:26 <janders> martial__: :)
11:32:07 <janders> by adding this capability to the system, we can: 1) achieve very high VM density 2) run workloads that either don't care about IB or don't work nicely with it
11:32:44 <janders> please look at slides 20&21 and you will see we're running 200VMs per hypervisor
11:33:22 <janders> we can probably go higher density but need more modern hardware (more RAM and NVMes)
11:34:18 <janders> going forward, I aim to remove this IB/non-IB distinction and have one integrated system that will do baremetal, SRIOV VMs and non-SRIOV VMs
11:34:31 <janders> however we'll need to enhance the SDN layer for that to be possible
11:35:09 <oneswig> janders: what patches are you carrying currently for this?
11:35:31 <janders> for the elastic hypervisors, literally none
11:35:52 <janders> there are some outstanding ones that I need for baremetal+SDN/IB
11:36:04 <janders> but there are fewer and fewer as RHAT and MLNX fix things up
11:36:39 <janders> for elastic hypervisor it's all in the architecture/configuration
11:36:51 <janders> and that's mostly runtime
11:37:50 <oneswig> Is the same concept directly transferrable to a baremetal ethernet system?
11:38:16 <janders> I believe so
11:38:28 <janders> do you know if the SDN in eth mode supports vxlan?
11:39:08 <janders> from the fact that it's picking up requests for vxlan ports in IB mode, I suppose so however I haven't researched this in detail as I don't have SDN capable eth kit yet
11:39:31 <oneswig> If you're thinking of the Mellanox SDN driver, I don't know.  I believe the hardware does so I'd guess yes
11:40:36 <janders> if ethernet SDN doesnt support vxlan, presenting networks to VMs might be tricky (I'm utilising very specific IB features to do this)
11:41:27 <janders> overall I am very happy with the flexibility we get by running our hypervisors on bare-metal SDN system
11:41:33 <oneswig> Good point - the Ethernet port would have to be trunk, Ironic sets up access
11:41:42 <janders> and it's a great demonstrator for the API-driven datacentre idea which we're heading towards
11:41:42 <oneswig> Are there issues with exposing the Internal API network?  I guess it's only exposed to a specific internal project?
11:41:59 <janders> no - it's a private network
11:42:59 <janders> [root@supercloud03 ~(keystone_admin)]# openstack network show internalapi | grep -i shared | shared                    | False                                | [root@supercloud03 ~(keystone_admin)]#
11:43:19 <martial__> Am interested in the idea of moving this to an ethernet solution as well
11:43:21 <janders> from users perspective its invisible:
11:43:24 <janders> [root@supercloud03 ~(keystone_sigdemo)]# openstack network show internalapi | grep -i shared Error while executing command: No Network found for internalapi [root@supercloud03 ~(keystone_sigdemo)]#
11:43:43 <oneswig> How does that work?  Your hypervisor is an instance, but not within a project, but is accessing pkeys for tenant networks private to that project
11:44:27 <janders> the hypervisor has the pkeys mapped to its GUID in a non-zero index
11:44:34 <janders> index zero is internalapi pkey
11:44:53 <janders> it then maps the index of the private project pkey the VM is in to index zero of the VM
11:45:02 <janders> or, I should say, the index zero of the vf
11:45:13 <janders> pf and vfs have separate pkey tables
11:45:37 <janders> does this make sense?
11:45:50 <oneswig> yes thanks.
11:46:09 <janders> it's a bit like having an eth port with a dynamic set of tagged VLANs
11:46:16 <janders> and uplinking it to ovs
11:46:24 <oneswig> From a policy perspective I guess it's no different from an admin having visibility of project networks.
11:46:45 <janders> if VLAN x is configured on the upstream switch, an OVS port in VLAN x will be able to access it
11:46:50 <janders> sort of
11:47:01 <janders> hypervisor just controls the middle layer
11:47:24 <janders> SDN makes sure that only the hypervisors running VMs in that pkey have access to the pkey
11:48:32 <janders> with ethernet/baremetal, if SDN supports vxlan and trunks I think it's possible to have full feature parity with VMs/vxlan from the networking perspective
11:49:19 <janders> personally, I really want to run this on the full-virt part of the system:
11:49:20 <janders> https://github.com/snowjet/os-inception
11:49:49 <janders> it's a set of playbooks I wrote some time back at RHAT that automate deployment of fully featured HA OpenStack in OpenStack VMs
11:50:05 <janders> great for ensuring each team member has a realistic R&D env
11:50:18 <janders> with zero possibility of people stepping on each others toes
11:50:57 <janders> it's heaps easier to run playbooks like this without worrying if things will figure out that ib0=eth0 and no, ib0 can't be put on an ovs bridge
11:51:09 <martial__> leaving the slides up afterward? I am hoping a couple people will be able to take a peek at them
11:51:30 <janders> martial__: sure, no worries
11:52:01 <janders> so - being mindful of time - this is pretty much everything I have
11:52:07 <oneswig> Very cool project janders
11:52:12 <oneswig> Thanks for sharing.
11:52:20 <janders> do you guys have any more questions?
11:52:29 <janders> oneswig: thank you and youre most welcome
11:52:33 <oneswig> It moves smoothly into our other topic...
11:52:46 <oneswig> ... will you also talk about it at the Berlin SIG BoF?
11:52:49 <oneswig> :-)
11:52:55 <janders> I would be very happy to
11:53:03 <verdurin> janders: yes, looks very interesting - thanks
11:53:18 <janders> thank you verdurin
11:53:20 <oneswig> #topic Berlin SIG BoF
11:53:33 <oneswig> #link Etherpad for BoF talks: https://etherpad.openstack.org/p/Berlin-Scientific-SIG-BoF
11:53:55 <oneswig> We don't usually finalise this until the day of the event, but people are very welcome to put in talks
11:54:27 <oneswig> I'll make sure there's a nice bottle of quality German wine for the winner :-)
11:55:18 <verdurin> My advice to the winner is to drink it quickly
11:55:39 <oneswig> ... I suspect there's a story behind that
11:56:01 <oneswig> ok, we are running close to time
11:56:04 <oneswig> #topic AOB
11:56:19 <oneswig> What's new?
11:56:26 <dh3> janders, I am curious about any tooling you have to move a machine between roles "bare metal host" to "hypervisor" - do you integrate with whatever your deployment system is? (TripleO, openstack-ansible...?)
11:57:26 <dh3> we (Sanger) are all-virtual, but some projects are using flavours which are basically an entire hypervisor, so placing those on bare metal seems sensible
11:58:01 <dh3> but scale-down/scale-up (removing/adding hypervisors) is a bit fraught so doing it on the fly seems... courageous :)
11:58:47 <janders_> my connection dropped :( back now
11:59:12 <johnthetubaguy> at the ptg folks were talking about using compose-able hardware to build a hypervisor when capacity gets tight, which sounds worse
11:59:12 <dh3> if you didn't see my question I can email (we are close to end time)
11:59:39 <janders_> dh3: I don't have any tripleo/os-ansible integration
11:59:54 <janders_> I do use tripleo capability filters to control scheduling though :) (--hint nodename)
12:00:07 <oneswig> We are alas on the hour
12:00:18 <janders_> thanks guys! till next time
12:00:42 <oneswig> dh3: janders_: could continue via #scientific-wg or via email given the hour in Canberra?
12:00:54 <oneswig> Thanks again janders_, great session
12:00:55 <janders_> email would be best
12:01:01 <oneswig> #endmeeting