10:59:57 <oneswig> #startmeeting scientific-sig 10:59:58 <openstack> Meeting started Wed Oct 24 10:59:57 2018 UTC and is due to finish in 60 minutes. The chair is oneswig. Information about MeetBot at http://wiki.debian.org/MeetBot. 10:59:59 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 11:00:01 <openstack> The meeting name has been set to 'scientific_sig' 11:00:10 <oneswig> greetings 11:00:17 <oneswig> #link agenda for today https://wiki.openstack.org/wiki/Scientific_SIG#IRC_Meeting_October_24th_2018 11:00:18 <dh3> hi 11:00:19 <janders> good morning, good evening everyone 11:00:32 <oneswig> hello janders dh3 11:00:42 <priteau> Good afternoon 11:00:49 <oneswig> priteau: indeed it is 11:01:00 <janders> :) 11:01:07 * ildikov is lurking :) 11:01:09 <oneswig> martial__: that you? 11:01:25 <martial__> I keep getting longer and longer _ added to my nick :) 11:01:26 <oneswig> Greetings ildikov :-) 11:01:32 <oneswig> #chair martial__ 11:01:33 <openstack> Current chairs: martial__ oneswig 11:01:49 <ildikov> oneswig: hi :) 11:02:11 <oneswig> #topic Mixed bare metal and virt 11:02:13 <martial__> hello Stig, Ildiko 11:02:29 <oneswig> OK, janders has very kindly put together a presentation for us on his work 11:02:32 <ildikov> morning martial__ :) 11:02:59 <oneswig> #link janders presentation https://docs.google.com/presentation/d/1fh6ZOq3DO-4V880Bn7Vxfm5I1kH5b0xKXHBLW2-sJXs/edit?usp=sharing 11:03:23 <oneswig> janders: what was your motivation for taking this approach? 11:03:31 <janders> oneswig: thank you 11:03:48 <janders> short answer is flexibility 11:04:20 <janders> a bit longer one is - as we started exploring bare-metal cloud we realised that not everything needs a full node 11:04:59 <oneswig> But you didn't want to have dedicated hypervisors? 11:04:59 <janders> also we wanted to make sure we can accommodate non-HPC OpenStack workloads 11:05:13 <janders> then the question is - how many do we need? 11:05:21 <janders> and how will the virt hypervisor change over time 11:05:45 <janders> * virt hypervisor demand 11:06:14 <janders> the main goal behind SuperCloud is providing bare metal 11:06:44 <janders> so we need as many hypervisors as we need and as little as we can get away with, so there is more baremetal capacity available 11:07:40 <oneswig> janders: makes sense. How do you know its time to create a hypervisor? 11:07:53 <janders> my expectation is that we might see more virt uptake in the early days and then some of the virt users might start looking at baremetal or containers moving forward 11:08:24 <janders> right now it will be a "handraulic" approach - "running low? create more." 11:09:01 <janders> running low in vcpus/ram/vifs would be an indication to create more hypervisors 11:09:18 <oneswig> Is the InternalAPI network going via IPoIB? 11:09:33 <janders> yes, it's a tenant network as any other 11:09:42 <janders> currently we only have SDN capability on IB 11:10:20 <martial__> interesting word "hand-raulic" 11:10:32 <oneswig> cool stuff. Does that also mean the cluster membership protocols for your control plane services are transported via IPoIB? 11:11:03 <janders> oneswig: correct 11:11:04 <oneswig> I'm curious if they all work on something not-quite-ethernet 11:12:07 <janders> yes - and while this isn't directly related to ephemeral hypervisors, I have used WAN-capable IB to connect remote computes to central controllers 11:12:30 <janders> I tried with up to 300ms latency (that is pretty long range), no issues 11:12:51 <oneswig> Thats awesome 11:13:47 <janders> in addition to the configuration and capacity flexibility, something where I see a lot of value is having one-stop-shop for baremetals, VMs and containers going forward 11:13:48 <oneswig> How is the overlay networking done? 11:14:19 <janders> we don't have all of these going just yet, but that's the direction 11:14:33 <janders> we run vlan tenant networks in Neutron 11:14:42 <janders> IB works through mapping pkeys to vlans 11:15:09 <janders> so from the physical perspective, comms between VMs and between computes and controller aren't any different 11:15:15 <janders> they just happen to live in a different pkey 11:15:39 <janders> as VMs get scheduled to nodes, SDN makes sure that the hypervisor has access to the appropriate pkeys 11:15:40 <oneswig> The hypervisor nodes are bound to the InternalAPI network, how are they bound to other networks too? 11:15:52 <janders> and the hypervisor works out vf/pkey mapping locally 11:15:52 <oneswig> ah I see 11:16:03 <janders> they are in internalapi on pkey0 11:16:07 <janders> sorry index0 11:16:14 <janders> then, they can have a ton of other pkeys on top 11:16:27 <janders> it's a bit like native and tagged vlans and trunks, but not quite the same 11:16:43 <janders> there is no trunking in neutron trunk sense 11:17:16 <oneswig> The binding is not done on the baremetal hypervisor instance, but the effect is used by VMs running upon it 11:17:28 <janders> correct 11:17:37 <janders> hypervisor's GUID is mapped to the pkey 11:17:54 <janders> however the hypervisor itself does not send any packets in this pkey 11:17:59 <janders> just maps the vfs accordingly 11:18:18 <janders> it's quite neat 11:18:37 <janders> might be IB generation specific though, we're working on FDR at the moment but this will change soon 11:18:46 <oneswig> All VM networking is via SR-IOV? 11:18:53 <janders> haha :) 11:18:55 <janders> yes and no 11:19:00 <janders> I will cover more on that a bit later 11:19:15 <janders> SuperCloud proper is all baremetal/IB and SRIOV/IB VMs 11:19:37 <janders> the only ethernet is the br-ex uplink on the controller for L3 11:20:29 <janders> I usually try to demo as much as I can but it's a little tricker via IRC 11:20:34 <janders> however I have some demo-like slides 11:20:42 <janders> so just to illustrate what we're talking about 11:20:55 <janders> first of all - please jump to slide 10 (we covered most of the earlier slides just chatting ) 11:21:24 <janders> on slide 10 you can see that the baremetal nodes running on internalapi network are listed in "openstack compute service list" 11:21:33 <janders> so - they run on SuperCloud and are nodes of SuperCloud at the same time :) 11:21:53 <janders> slide 11 just documents setting up the lab I'm using in next slides 11:22:23 <janders> on slide 12 you can see that the VM instances we created for the experiment are running on SuperCloud-powered compute nodes 11:22:34 <janders> there are also baremetal instances sigdemo-bm{1,2} 11:22:56 <janders> now - as both baremetals and VMs use IB they can connect (almost) seamlessly 11:23:19 <janders> almost - cause VMs won't have access to admin functions (eg running ibhosts) unless these are manually force-enabled 11:23:38 <janders> a demonstrated example of this is slide 13 where we run RDMA benchmark between a baremetal and a VM 11:24:12 <oneswig> janders: do you have a way of preventing bare metal instances from running admin functions like that to find the size of the network? 11:24:13 <janders> it's only FDR10 on this box, so the result is only 4.5GB/s but it does demonstrate end-to-end RDMA connectivity between baremetals and VMs 11:24:37 <janders> not yet - we still haven't deployed the SecureHost firmware 11:25:04 <janders> I hope that the current version will help to a degree and I am intending to work with MLNX to improve baremetal security further 11:25:35 <janders> it's a little tricky with SecureHost and OEM firmware. Some tough conversations to be had with the OEM :) 11:25:55 <oneswig> Do you have a solution for reducing the number of hypervisors? 11:26:26 <janders> this is a very good question. The answer is no - this is totally hand-raulic at this stage. 11:26:41 <janders> SRIOV will only make it tricker as evac-like functionality won't help much here 11:26:42 <oneswig> that nice word again :-) 11:26:48 <janders> :) 11:27:07 <janders> I expect we'll need to write custom tools for that 11:27:22 <oneswig> Anything from Mlnx on migration with SR-IOV? 11:27:48 <janders> it will also be good to see how interested other ironic users are - perhaps we can start an upstream discussion about adding better support for this mode of operation 11:27:56 <janders> I think there's value in this 11:28:05 <janders> regarding migration - cold migration should now work 11:28:16 <janders> live migration - if it ever happens - won't be any time soon 11:28:27 <janders> that's what I keep hearing 11:28:53 <janders> now - I would like to go back to your question about vxlan and non-SRIOV comms 11:29:00 <janders> I actually do need this for some use cases 11:29:05 <verdurin> Afternoon 11:29:21 <janders> yet the SDN does not support vxlan when SDN/IB is used 11:29:25 <janders> Good afternoon :) 11:29:45 <janders> however, running baremetal/SDN we have a very large degree of flexibility 11:30:14 <janders> if you look at slide 18, you will see there are actually few more nodes I didn't show before (cause I filtered by internalapi network membership ) 11:30:40 <janders> there are four more nodes (controller + 3 computes) which are running a separate sub-system that supports vxlan and fully virtualised networking 11:30:55 <martial__> (I think one of the tool defenitely needs to be called "handrauler") 11:31:02 <janders> in this case, vxlan traffic is carried over IPoIB 11:31:26 <janders> martial__: :) 11:32:07 <janders> by adding this capability to the system, we can: 1) achieve very high VM density 2) run workloads that either don't care about IB or don't work nicely with it 11:32:44 <janders> please look at slides 20&21 and you will see we're running 200VMs per hypervisor 11:33:22 <janders> we can probably go higher density but need more modern hardware (more RAM and NVMes) 11:34:18 <janders> going forward, I aim to remove this IB/non-IB distinction and have one integrated system that will do baremetal, SRIOV VMs and non-SRIOV VMs 11:34:31 <janders> however we'll need to enhance the SDN layer for that to be possible 11:35:09 <oneswig> janders: what patches are you carrying currently for this? 11:35:31 <janders> for the elastic hypervisors, literally none 11:35:52 <janders> there are some outstanding ones that I need for baremetal+SDN/IB 11:36:04 <janders> but there are fewer and fewer as RHAT and MLNX fix things up 11:36:39 <janders> for elastic hypervisor it's all in the architecture/configuration 11:36:51 <janders> and that's mostly runtime 11:37:50 <oneswig> Is the same concept directly transferrable to a baremetal ethernet system? 11:38:16 <janders> I believe so 11:38:28 <janders> do you know if the SDN in eth mode supports vxlan? 11:39:08 <janders> from the fact that it's picking up requests for vxlan ports in IB mode, I suppose so however I haven't researched this in detail as I don't have SDN capable eth kit yet 11:39:31 <oneswig> If you're thinking of the Mellanox SDN driver, I don't know. I believe the hardware does so I'd guess yes 11:40:36 <janders> if ethernet SDN doesnt support vxlan, presenting networks to VMs might be tricky (I'm utilising very specific IB features to do this) 11:41:27 <janders> overall I am very happy with the flexibility we get by running our hypervisors on bare-metal SDN system 11:41:33 <oneswig> Good point - the Ethernet port would have to be trunk, Ironic sets up access 11:41:42 <janders> and it's a great demonstrator for the API-driven datacentre idea which we're heading towards 11:41:42 <oneswig> Are there issues with exposing the Internal API network? I guess it's only exposed to a specific internal project? 11:41:59 <janders> no - it's a private network 11:42:59 <janders> [root@supercloud03 ~(keystone_admin)]# openstack network show internalapi | grep -i shared | shared | False | [root@supercloud03 ~(keystone_admin)]# 11:43:19 <martial__> Am interested in the idea of moving this to an ethernet solution as well 11:43:21 <janders> from users perspective its invisible: 11:43:24 <janders> [root@supercloud03 ~(keystone_sigdemo)]# openstack network show internalapi | grep -i shared Error while executing command: No Network found for internalapi [root@supercloud03 ~(keystone_sigdemo)]# 11:43:43 <oneswig> How does that work? Your hypervisor is an instance, but not within a project, but is accessing pkeys for tenant networks private to that project 11:44:27 <janders> the hypervisor has the pkeys mapped to its GUID in a non-zero index 11:44:34 <janders> index zero is internalapi pkey 11:44:53 <janders> it then maps the index of the private project pkey the VM is in to index zero of the VM 11:45:02 <janders> or, I should say, the index zero of the vf 11:45:13 <janders> pf and vfs have separate pkey tables 11:45:37 <janders> does this make sense? 11:45:50 <oneswig> yes thanks. 11:46:09 <janders> it's a bit like having an eth port with a dynamic set of tagged VLANs 11:46:16 <janders> and uplinking it to ovs 11:46:24 <oneswig> From a policy perspective I guess it's no different from an admin having visibility of project networks. 11:46:45 <janders> if VLAN x is configured on the upstream switch, an OVS port in VLAN x will be able to access it 11:46:50 <janders> sort of 11:47:01 <janders> hypervisor just controls the middle layer 11:47:24 <janders> SDN makes sure that only the hypervisors running VMs in that pkey have access to the pkey 11:48:32 <janders> with ethernet/baremetal, if SDN supports vxlan and trunks I think it's possible to have full feature parity with VMs/vxlan from the networking perspective 11:49:19 <janders> personally, I really want to run this on the full-virt part of the system: 11:49:20 <janders> https://github.com/snowjet/os-inception 11:49:49 <janders> it's a set of playbooks I wrote some time back at RHAT that automate deployment of fully featured HA OpenStack in OpenStack VMs 11:50:05 <janders> great for ensuring each team member has a realistic R&D env 11:50:18 <janders> with zero possibility of people stepping on each others toes 11:50:57 <janders> it's heaps easier to run playbooks like this without worrying if things will figure out that ib0=eth0 and no, ib0 can't be put on an ovs bridge 11:51:09 <martial__> leaving the slides up afterward? I am hoping a couple people will be able to take a peek at them 11:51:30 <janders> martial__: sure, no worries 11:52:01 <janders> so - being mindful of time - this is pretty much everything I have 11:52:07 <oneswig> Very cool project janders 11:52:12 <oneswig> Thanks for sharing. 11:52:20 <janders> do you guys have any more questions? 11:52:29 <janders> oneswig: thank you and youre most welcome 11:52:33 <oneswig> It moves smoothly into our other topic... 11:52:46 <oneswig> ... will you also talk about it at the Berlin SIG BoF? 11:52:49 <oneswig> :-) 11:52:55 <janders> I would be very happy to 11:53:03 <verdurin> janders: yes, looks very interesting - thanks 11:53:18 <janders> thank you verdurin 11:53:20 <oneswig> #topic Berlin SIG BoF 11:53:33 <oneswig> #link Etherpad for BoF talks: https://etherpad.openstack.org/p/Berlin-Scientific-SIG-BoF 11:53:55 <oneswig> We don't usually finalise this until the day of the event, but people are very welcome to put in talks 11:54:27 <oneswig> I'll make sure there's a nice bottle of quality German wine for the winner :-) 11:55:18 <verdurin> My advice to the winner is to drink it quickly 11:55:39 <oneswig> ... I suspect there's a story behind that 11:56:01 <oneswig> ok, we are running close to time 11:56:04 <oneswig> #topic AOB 11:56:19 <oneswig> What's new? 11:56:26 <dh3> janders, I am curious about any tooling you have to move a machine between roles "bare metal host" to "hypervisor" - do you integrate with whatever your deployment system is? (TripleO, openstack-ansible...?) 11:57:26 <dh3> we (Sanger) are all-virtual, but some projects are using flavours which are basically an entire hypervisor, so placing those on bare metal seems sensible 11:58:01 <dh3> but scale-down/scale-up (removing/adding hypervisors) is a bit fraught so doing it on the fly seems... courageous :) 11:58:47 <janders_> my connection dropped :( back now 11:59:12 <johnthetubaguy> at the ptg folks were talking about using compose-able hardware to build a hypervisor when capacity gets tight, which sounds worse 11:59:12 <dh3> if you didn't see my question I can email (we are close to end time) 11:59:39 <janders_> dh3: I don't have any tripleo/os-ansible integration 11:59:54 <janders_> I do use tripleo capability filters to control scheduling though :) (--hint nodename) 12:00:07 <oneswig> We are alas on the hour 12:00:18 <janders_> thanks guys! till next time 12:00:42 <oneswig> dh3: janders_: could continue via #scientific-wg or via email given the hour in Canberra? 12:00:54 <oneswig> Thanks again janders_, great session 12:00:55 <janders_> email would be best 12:01:01 <oneswig> #endmeeting