10:59:57 #startmeeting scientific-sig 10:59:58 Meeting started Wed Oct 24 10:59:57 2018 UTC and is due to finish in 60 minutes. The chair is oneswig. Information about MeetBot at http://wiki.debian.org/MeetBot. 10:59:59 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 11:00:01 The meeting name has been set to 'scientific_sig' 11:00:10 greetings 11:00:17 #link agenda for today https://wiki.openstack.org/wiki/Scientific_SIG#IRC_Meeting_October_24th_2018 11:00:18 hi 11:00:19 good morning, good evening everyone 11:00:32 hello janders dh3 11:00:42 Good afternoon 11:00:49 priteau: indeed it is 11:01:00 :) 11:01:07 * ildikov is lurking :) 11:01:09 martial__: that you? 11:01:25 I keep getting longer and longer _ added to my nick :) 11:01:26 Greetings ildikov :-) 11:01:32 #chair martial__ 11:01:33 Current chairs: martial__ oneswig 11:01:49 oneswig: hi :) 11:02:11 #topic Mixed bare metal and virt 11:02:13 hello Stig, Ildiko 11:02:29 OK, janders has very kindly put together a presentation for us on his work 11:02:32 morning martial__ :) 11:02:59 #link janders presentation https://docs.google.com/presentation/d/1fh6ZOq3DO-4V880Bn7Vxfm5I1kH5b0xKXHBLW2-sJXs/edit?usp=sharing 11:03:23 janders: what was your motivation for taking this approach? 11:03:31 oneswig: thank you 11:03:48 short answer is flexibility 11:04:20 a bit longer one is - as we started exploring bare-metal cloud we realised that not everything needs a full node 11:04:59 But you didn't want to have dedicated hypervisors? 11:04:59 also we wanted to make sure we can accommodate non-HPC OpenStack workloads 11:05:13 then the question is - how many do we need? 11:05:21 and how will the virt hypervisor change over time 11:05:45 * virt hypervisor demand 11:06:14 the main goal behind SuperCloud is providing bare metal 11:06:44 so we need as many hypervisors as we need and as little as we can get away with, so there is more baremetal capacity available 11:07:40 janders: makes sense. How do you know its time to create a hypervisor? 11:07:53 my expectation is that we might see more virt uptake in the early days and then some of the virt users might start looking at baremetal or containers moving forward 11:08:24 right now it will be a "handraulic" approach - "running low? create more." 11:09:01 running low in vcpus/ram/vifs would be an indication to create more hypervisors 11:09:18 Is the InternalAPI network going via IPoIB? 11:09:33 yes, it's a tenant network as any other 11:09:42 currently we only have SDN capability on IB 11:10:20 interesting word "hand-raulic" 11:10:32 cool stuff. Does that also mean the cluster membership protocols for your control plane services are transported via IPoIB? 11:11:03 oneswig: correct 11:11:04 I'm curious if they all work on something not-quite-ethernet 11:12:07 yes - and while this isn't directly related to ephemeral hypervisors, I have used WAN-capable IB to connect remote computes to central controllers 11:12:30 I tried with up to 300ms latency (that is pretty long range), no issues 11:12:51 Thats awesome 11:13:47 in addition to the configuration and capacity flexibility, something where I see a lot of value is having one-stop-shop for baremetals, VMs and containers going forward 11:13:48 How is the overlay networking done? 11:14:19 we don't have all of these going just yet, but that's the direction 11:14:33 we run vlan tenant networks in Neutron 11:14:42 IB works through mapping pkeys to vlans 11:15:09 so from the physical perspective, comms between VMs and between computes and controller aren't any different 11:15:15 they just happen to live in a different pkey 11:15:39 as VMs get scheduled to nodes, SDN makes sure that the hypervisor has access to the appropriate pkeys 11:15:40 The hypervisor nodes are bound to the InternalAPI network, how are they bound to other networks too? 11:15:52 and the hypervisor works out vf/pkey mapping locally 11:15:52 ah I see 11:16:03 they are in internalapi on pkey0 11:16:07 sorry index0 11:16:14 then, they can have a ton of other pkeys on top 11:16:27 it's a bit like native and tagged vlans and trunks, but not quite the same 11:16:43 there is no trunking in neutron trunk sense 11:17:16 The binding is not done on the baremetal hypervisor instance, but the effect is used by VMs running upon it 11:17:28 correct 11:17:37 hypervisor's GUID is mapped to the pkey 11:17:54 however the hypervisor itself does not send any packets in this pkey 11:17:59 just maps the vfs accordingly 11:18:18 it's quite neat 11:18:37 might be IB generation specific though, we're working on FDR at the moment but this will change soon 11:18:46 All VM networking is via SR-IOV? 11:18:53 haha :) 11:18:55 yes and no 11:19:00 I will cover more on that a bit later 11:19:15 SuperCloud proper is all baremetal/IB and SRIOV/IB VMs 11:19:37 the only ethernet is the br-ex uplink on the controller for L3 11:20:29 I usually try to demo as much as I can but it's a little tricker via IRC 11:20:34 however I have some demo-like slides 11:20:42 so just to illustrate what we're talking about 11:20:55 first of all - please jump to slide 10 (we covered most of the earlier slides just chatting ) 11:21:24 on slide 10 you can see that the baremetal nodes running on internalapi network are listed in "openstack compute service list" 11:21:33 so - they run on SuperCloud and are nodes of SuperCloud at the same time :) 11:21:53 slide 11 just documents setting up the lab I'm using in next slides 11:22:23 on slide 12 you can see that the VM instances we created for the experiment are running on SuperCloud-powered compute nodes 11:22:34 there are also baremetal instances sigdemo-bm{1,2} 11:22:56 now - as both baremetals and VMs use IB they can connect (almost) seamlessly 11:23:19 almost - cause VMs won't have access to admin functions (eg running ibhosts) unless these are manually force-enabled 11:23:38 a demonstrated example of this is slide 13 where we run RDMA benchmark between a baremetal and a VM 11:24:12 janders: do you have a way of preventing bare metal instances from running admin functions like that to find the size of the network? 11:24:13 it's only FDR10 on this box, so the result is only 4.5GB/s but it does demonstrate end-to-end RDMA connectivity between baremetals and VMs 11:24:37 not yet - we still haven't deployed the SecureHost firmware 11:25:04 I hope that the current version will help to a degree and I am intending to work with MLNX to improve baremetal security further 11:25:35 it's a little tricky with SecureHost and OEM firmware. Some tough conversations to be had with the OEM :) 11:25:55 Do you have a solution for reducing the number of hypervisors? 11:26:26 this is a very good question. The answer is no - this is totally hand-raulic at this stage. 11:26:41 SRIOV will only make it tricker as evac-like functionality won't help much here 11:26:42 that nice word again :-) 11:26:48 :) 11:27:07 I expect we'll need to write custom tools for that 11:27:22 Anything from Mlnx on migration with SR-IOV? 11:27:48 it will also be good to see how interested other ironic users are - perhaps we can start an upstream discussion about adding better support for this mode of operation 11:27:56 I think there's value in this 11:28:05 regarding migration - cold migration should now work 11:28:16 live migration - if it ever happens - won't be any time soon 11:28:27 that's what I keep hearing 11:28:53 now - I would like to go back to your question about vxlan and non-SRIOV comms 11:29:00 I actually do need this for some use cases 11:29:05 Afternoon 11:29:21 yet the SDN does not support vxlan when SDN/IB is used 11:29:25 Good afternoon :) 11:29:45 however, running baremetal/SDN we have a very large degree of flexibility 11:30:14 if you look at slide 18, you will see there are actually few more nodes I didn't show before (cause I filtered by internalapi network membership ) 11:30:40 there are four more nodes (controller + 3 computes) which are running a separate sub-system that supports vxlan and fully virtualised networking 11:30:55 (I think one of the tool defenitely needs to be called "handrauler") 11:31:02 in this case, vxlan traffic is carried over IPoIB 11:31:26 martial__: :) 11:32:07 by adding this capability to the system, we can: 1) achieve very high VM density 2) run workloads that either don't care about IB or don't work nicely with it 11:32:44 please look at slides 20&21 and you will see we're running 200VMs per hypervisor 11:33:22 we can probably go higher density but need more modern hardware (more RAM and NVMes) 11:34:18 going forward, I aim to remove this IB/non-IB distinction and have one integrated system that will do baremetal, SRIOV VMs and non-SRIOV VMs 11:34:31 however we'll need to enhance the SDN layer for that to be possible 11:35:09 janders: what patches are you carrying currently for this? 11:35:31 for the elastic hypervisors, literally none 11:35:52 there are some outstanding ones that I need for baremetal+SDN/IB 11:36:04 but there are fewer and fewer as RHAT and MLNX fix things up 11:36:39 for elastic hypervisor it's all in the architecture/configuration 11:36:51 and that's mostly runtime 11:37:50 Is the same concept directly transferrable to a baremetal ethernet system? 11:38:16 I believe so 11:38:28 do you know if the SDN in eth mode supports vxlan? 11:39:08 from the fact that it's picking up requests for vxlan ports in IB mode, I suppose so however I haven't researched this in detail as I don't have SDN capable eth kit yet 11:39:31 If you're thinking of the Mellanox SDN driver, I don't know. I believe the hardware does so I'd guess yes 11:40:36 if ethernet SDN doesnt support vxlan, presenting networks to VMs might be tricky (I'm utilising very specific IB features to do this) 11:41:27 overall I am very happy with the flexibility we get by running our hypervisors on bare-metal SDN system 11:41:33 Good point - the Ethernet port would have to be trunk, Ironic sets up access 11:41:42 and it's a great demonstrator for the API-driven datacentre idea which we're heading towards 11:41:42 Are there issues with exposing the Internal API network? I guess it's only exposed to a specific internal project? 11:41:59 no - it's a private network 11:42:59 [root@supercloud03 ~(keystone_admin)]# openstack network show internalapi | grep -i shared | shared | False | [root@supercloud03 ~(keystone_admin)]# 11:43:19 Am interested in the idea of moving this to an ethernet solution as well 11:43:21 from users perspective its invisible: 11:43:24 [root@supercloud03 ~(keystone_sigdemo)]# openstack network show internalapi | grep -i shared Error while executing command: No Network found for internalapi [root@supercloud03 ~(keystone_sigdemo)]# 11:43:43 How does that work? Your hypervisor is an instance, but not within a project, but is accessing pkeys for tenant networks private to that project 11:44:27 the hypervisor has the pkeys mapped to its GUID in a non-zero index 11:44:34 index zero is internalapi pkey 11:44:53 it then maps the index of the private project pkey the VM is in to index zero of the VM 11:45:02 or, I should say, the index zero of the vf 11:45:13 pf and vfs have separate pkey tables 11:45:37 does this make sense? 11:45:50 yes thanks. 11:46:09 it's a bit like having an eth port with a dynamic set of tagged VLANs 11:46:16 and uplinking it to ovs 11:46:24 From a policy perspective I guess it's no different from an admin having visibility of project networks. 11:46:45 if VLAN x is configured on the upstream switch, an OVS port in VLAN x will be able to access it 11:46:50 sort of 11:47:01 hypervisor just controls the middle layer 11:47:24 SDN makes sure that only the hypervisors running VMs in that pkey have access to the pkey 11:48:32 with ethernet/baremetal, if SDN supports vxlan and trunks I think it's possible to have full feature parity with VMs/vxlan from the networking perspective 11:49:19 personally, I really want to run this on the full-virt part of the system: 11:49:20 https://github.com/snowjet/os-inception 11:49:49 it's a set of playbooks I wrote some time back at RHAT that automate deployment of fully featured HA OpenStack in OpenStack VMs 11:50:05 great for ensuring each team member has a realistic R&D env 11:50:18 with zero possibility of people stepping on each others toes 11:50:57 it's heaps easier to run playbooks like this without worrying if things will figure out that ib0=eth0 and no, ib0 can't be put on an ovs bridge 11:51:09 leaving the slides up afterward? I am hoping a couple people will be able to take a peek at them 11:51:30 martial__: sure, no worries 11:52:01 so - being mindful of time - this is pretty much everything I have 11:52:07 Very cool project janders 11:52:12 Thanks for sharing. 11:52:20 do you guys have any more questions? 11:52:29 oneswig: thank you and youre most welcome 11:52:33 It moves smoothly into our other topic... 11:52:46 ... will you also talk about it at the Berlin SIG BoF? 11:52:49 :-) 11:52:55 I would be very happy to 11:53:03 janders: yes, looks very interesting - thanks 11:53:18 thank you verdurin 11:53:20 #topic Berlin SIG BoF 11:53:33 #link Etherpad for BoF talks: https://etherpad.openstack.org/p/Berlin-Scientific-SIG-BoF 11:53:55 We don't usually finalise this until the day of the event, but people are very welcome to put in talks 11:54:27 I'll make sure there's a nice bottle of quality German wine for the winner :-) 11:55:18 My advice to the winner is to drink it quickly 11:55:39 ... I suspect there's a story behind that 11:56:01 ok, we are running close to time 11:56:04 #topic AOB 11:56:19 What's new? 11:56:26 janders, I am curious about any tooling you have to move a machine between roles "bare metal host" to "hypervisor" - do you integrate with whatever your deployment system is? (TripleO, openstack-ansible...?) 11:57:26 we (Sanger) are all-virtual, but some projects are using flavours which are basically an entire hypervisor, so placing those on bare metal seems sensible 11:58:01 but scale-down/scale-up (removing/adding hypervisors) is a bit fraught so doing it on the fly seems... courageous :) 11:58:47 my connection dropped :( back now 11:59:12 at the ptg folks were talking about using compose-able hardware to build a hypervisor when capacity gets tight, which sounds worse 11:59:12 if you didn't see my question I can email (we are close to end time) 11:59:39 dh3: I don't have any tripleo/os-ansible integration 11:59:54 I do use tripleo capability filters to control scheduling though :) (--hint nodename) 12:00:07 We are alas on the hour 12:00:18 thanks guys! till next time 12:00:42 dh3: janders_: could continue via #scientific-wg or via email given the hour in Canberra? 12:00:54 Thanks again janders_, great session 12:00:55 email would be best 12:01:01 #endmeeting