21:00:35 <oneswig> #startmeeting scientific-sig 21:00:36 <openstack> Meeting started Tue Feb 16 21:00:35 2021 UTC and is due to finish in 60 minutes. The chair is oneswig. Information about MeetBot at http://wiki.debian.org/MeetBot. 21:00:37 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 21:00:39 <openstack> The meeting name has been set to 'scientific_sig' 21:00:54 <oneswig> greetings 21:01:01 <rbudden> hello 21:01:17 <martial_> hello :) 21:01:45 <oneswig> #chair martial_ 21:01:46 <openstack> Current chairs: martial_ oneswig 21:01:55 <oneswig> how's it going? 21:02:28 <oneswig> #link agenda for today https://wiki.openstack.org/wiki/Scientific_SIG#IRC_Meeting_February_16th_2021 21:03:04 <oneswig> b1airo raised an excellent point on the Baremetal SIG's activities 21:03:29 <oneswig> #link Ironic deploy steps video from dtantsur https://youtu.be/uyN481mqdOs 21:03:33 <oneswig> brb 21:03:51 <trandles> hi all...splitting attention between two meetings :( 21:04:46 <oneswig> Hi trandles, glad you could make it 21:05:27 <rbudden> Time splicing here as well, but it’s been too long, so I wanted to at least lurk! 21:06:06 <martial_> not meetings but multiple terminals :) 21:06:20 <janders> g'day team! 21:06:22 <oneswig> The Baremetal SIG has taken the initiative with a nice video. Back in January there was a thought we could do something similar. 21:06:29 <oneswig> Hey janders, nice to see you 21:06:44 <martial_> exiting topics for today's agenda 21:08:18 <oneswig> Well I wonder what SIG members might be able to present on in a similar manner to help raise the SIG's profile. 21:08:39 <oneswig> User stories perhaps, or specific pieces of tech and method 21:08:45 * janders is quickly grabbing a coffee, just woke up 21:12:13 <oneswig> I was showing a work colleague the SIG's book yesterday - reminded me how cool that was as a group effort 21:13:08 <trandles> I should have more real world use case stories later this year, now that we're _finally_ getting a production deployment. 21:13:32 <oneswig> trandles: that's great to hear. 21:13:42 <trandles> we have multiple projects in flight that will take advantage of this infrastructure 21:13:47 <oneswig> How have things progressed in the last couple of weeks? 21:13:55 <rbudden> We should have our new cloud Explore up and running this year, so use cases, and possibly some migration war stories of combining two clouds 21:14:28 <oneswig> rbudden: migrating users and projects from one cloud to another? 21:14:31 <trandles> we just blew up a fairly fundamental road block on our deployment last week, so progress had been slow but will accelerate now 21:14:46 <rbudden> oneswig: yes, migration two OpenStack clouds into a single consolidate cloud 21:14:52 <rbudden> *consolidated 21:15:05 <oneswig> #link priteau came across this a while back and it looks handy (but haven't tried it myself) https://github.com/os-migrate/os-migrate 21:15:45 <trandles> I'm looking forward to learning from Mike's upcoming deployment at Indiana. Also, Chris Layton should be doing at least one new deployment in the coming weeks. 21:15:49 <rbudden> Interesting, I’ll have to take a look at this 21:15:59 <rbudden> Thanks oneswig ! 21:16:06 <oneswig> +1 21:16:07 <trandles> Chris is at Oak Ridge...forgot to mention that 21:16:42 <oneswig> trandles: is ORNL doing an Ironic deploy like yours? 21:17:10 <trandles> I think ORNL will be RHOSP 16.1 like mine 21:17:45 <oneswig> Nice work. 21:17:49 <trandles> Mike is planning to do everything IPv6, so there might be lots to come from that work. I wish he was here... 21:18:21 <oneswig> We have a project with a /28 public IPV4 allocation - so that'll be mostly IPV6 as well 21:19:00 <oneswig> Not quite getting started on that part yet. But perhaps it won't be the last machine where I have about as many IPV4 addresses as I have fingers and toes 21:19:44 <rbudden> We’re currently doing some IPv6. It’s a bit gnarly, but functional. I’ll be curious to see Mike’s work as well. 21:20:56 <oneswig> I do wonder if IPV6 is hugely neglected in network security 21:21:29 <oneswig> (I'm sure not by you guys) 21:21:52 <trandles> oneswig: our network security plan actually says "disable IPv6" 21:22:14 <oneswig> that solves the issue 21:24:02 <oneswig> trandles: had some discussion recently on the ramdisk driver. Do you know of other sites using it? 21:24:14 <trandles> oneswig: I do not 21:24:53 <oneswig> Ah, too bad. Let's keep an eye out. 21:25:01 <trandles> There may be some news RE: kexec deploy in the next week or two. 21:27:12 <oneswig> That would be good. One of our team was raising earlier the provisioning cycle times were ~35 minutes on a bare metal system. Anything to help with that is good - although does it only work in flat networking? 21:27:55 <janders> that's a lot of awesome projects going on! :) 21:28:22 <trandles> oneswig: I can't answer that question at this point. That's a Julia question probably. 21:29:04 <trandles> all of my ironic stuff with clusters is no-op networking 21:29:19 <janders> trandles - kexec - awesome! 21:29:45 <janders> oneswig with 35 minute deploys, how much of this is due to reboots between BIOS/RAID config etc? 21:30:48 <janders> fast track, kexec (and soon nvme cleaning) can help rapid node turn around, but hardware config might not be keeping up with that pace :/ 21:30:49 <oneswig> I don't know at this stage, just heard it this afternoon. It might possibly be a dependency in Terraform causing nodes to be deployed in phases. I'm speculating 21:31:40 <janders> I'm just thinking aloud: for low-latency high-turnover provisioning, it would likely make sense to come up with uniform BIOS/RAID/... config and simplify deploy and cleaning steps as much as possible, to avoid reboots 21:31:46 <oneswig> This kit is old. NVMEs are a mirage on the horizon 21:32:19 <oneswig> janders: now you're caffeinated, shall we move on? 21:32:35 <janders> oneswig yes! :) 21:32:50 <oneswig> #topic superclouds - virt on bare metal 21:33:44 <oneswig> janders made the comment 2 weeks ago we should dedicate time for this subject, I think it's well deserved. 21:34:11 <oneswig> #link For context, janders video on SuperCloud https://www.openstack.org/videos/summits/denver-2019/its-a-cloud-its-a-supercomputer-no-its-supercloud 21:35:12 <janders> oneswig this brings memories... hard to believe how much the World changed since (intl travel = foreign concept :/) 21:35:13 <oneswig> Over here we've been wrangling for ages with ways to improve the flexibility in systems mixing virt and bare metal 21:35:27 <oneswig> janders: sadly true 21:36:09 <oneswig> johnthetubaguy sends apologies, he was hoping to come along but was feeling rough this evening. 21:37:02 <oneswig> But I can tell you he's followed similar principles to SuperCloud in a couple of recent projects requiring mixed workloads of HPC and virt 21:37:39 <oneswig> #link CSD3 and UM6P systems https://www.openstack.org/videos/summits/virtual/Lessons-learnt-expanding-Cambridge-Universitys-CSD3-Supercomputer-with-OpenStack 21:38:04 <oneswig> Actually I don't think UM6P was mentioned in that video as it was still gunning for its TOP500 number 21:39:47 <oneswig> The major difference in approach I think relates to Ethernet vs IB networking. Working with Ethernet requires more consideration for switchport configs. IB IIRC had considerations of its own... 21:40:02 <janders> how is the current state of Nova handling compute nodes popping in and out? 21:40:26 <janders> at the time of the Denver demo I had some little hacky manual cleanup to handle that IIRC 21:41:14 <janders> I remember discussing this in the Nova PTG and there were some fixes on the way, but never got to deploying a solution fully based on Nova handling appearing/disappearing hypervisors without out-of-nova manual tweaks 21:41:34 <oneswig> janders: I'm not aware of issues there. The major piece of work is in bringing nodes in and out of the Ansible inventory for OpenStack service deployment. 21:42:35 <janders> Right. Nice! I was always a fan of keeping a singificant proportion of complexity/logics in the deployment&management playbooks instead of tweaking OpenStack code too much, so all for that 21:42:56 <janders> oneswig what are the main challenges with ethernet multi-tenant networking? 21:43:50 <janders> I was hoping to have dual IB+eth SDN, but firmware bugs in CX6 prevented that and then I moved on to my current role, so I never got to work on it 21:45:38 <oneswig> There have been issues with changing port config efficiently for multi-node deployments. For the networking-generic-switch driver this led to development of better batching of transactions so that a rack could be provisioned without 52 individual port operations 21:46:49 <janders> oneswig what switching did you use if I may ask? 21:46:54 <oneswig> For hypervisors, typically they are members of several VLANs. That config is applied outside of Ironic. 21:47:19 <oneswig> Cumulus in a couple of cases and Arista in another 21:47:24 <janders> unfortunately I have to say these are all the same old problems :/ 21:47:45 <janders> (and some of this seems inherent to BM SDN scalability in general - I've seen some of this on IB) 21:48:19 <janders> I ended up wiring up to four ports per node so that I can get a node into up to four different networks even though trunking wasnt working 21:48:32 <janders> I was lucky I had all that kit sitting around from past projects, so cost wasnt an issue 21:49:16 <oneswig> To be fair I think Ironic is developing significantly. A lot of progress has been made in the last few years. The first time we attempted this in production at scale, it was a lot harder! 21:49:18 <janders> and BM SDN getting out of sync has been a concern across many solutions 21:49:54 <oneswig> janders: That's certainly true. 21:50:07 <janders> are those improvements happening in Ironic itself, or in Neutron & drivers? 21:51:20 <oneswig> Progress on all fronts, but especially with Ironic's capabilities 21:52:00 <janders> would you be able to point me to specific changes? I'd be interested to read up (I might have missed some of these) 21:52:41 <janders> this is great news 21:53:18 <janders> regarding port config batching, are the switches able to do this on a port range basis? 21:53:40 <janders> I've seen cases where that was a bottleneck also 21:53:52 <oneswig> Here's a couple - https://review.opendev.org/c/openstack/networking-generic-switch/+/743269 - cumulus support for NGS 21:53:56 <janders> I wonder if the driver is inefficient because of catering for the lowest common denominator 21:54:20 <oneswig> And https://review.opendev.org/c/openstack/networking-generic-switch/+/743283 - batching transactions (think this one is yet to merge) 21:54:35 <janders> ah, different repo - this is why I missed it 21:54:54 <janders> is there an IRC channel for this project? I'd assume so? 21:55:05 <oneswig> janders: the networking drivers are dumb. For example, they save the config after every operation. 21:55:14 <janders> ouch! 21:55:31 <janders> s/supercomputing/supercongestion.... 21:56:06 <janders> being mindful of time, I wanted to throw one more topic into the mix 21:56:16 <oneswig> Right... so without some awareness the Neutron port wiring suddenly becomes a lengthy serialised process 21:56:30 <oneswig> janders: go ahead, hope it's a small one :-) 21:56:31 <janders> how would container-based ephemeral hypervisors fit into your use cases? 21:56:43 <janders> its big but we can just touch on it and continue another time :) 21:57:03 <janders> as you may know RHAT is pouring a lot of effort into KNI (Kubernetes Native Infrastructure) 21:57:06 <oneswig> like kube-virt? 21:57:12 <janders> Kubevirt is also related to that 21:57:17 <janders> spot on oneswig 21:57:22 <janders> similar end game, different implementation 21:58:04 <janders> if majority of workloads on the system are containerised I think it is a strong option 21:58:12 <janders> (if not, SuperCloud may be more flexible still) 21:58:18 <janders> but I wonder what your thoughts are? 21:59:26 <janders> we can add this topic to agenda for the meeting in two weeks time if you like :) 21:59:43 <janders> my bad for not raising this earlier, though I find SDN discussions fascinating! 22:00:01 <oneswig> It hasn't been a consideration so far. We certainly have users with a high-performance kubernetes focus, but I wouldn't plumb our control plane networks into that world without thinking about it... 22:00:24 <janders> this is a great answer, thank you 22:00:31 <oneswig> janders: same time 2 weeks hence? 22:00:34 <janders> yes! 22:00:44 <janders> thank you all 22:00:46 <oneswig> Sounds good to me 22:00:52 <oneswig> #endmeeting