#openstack-meeting log

21:00:35 <oneswig> #startmeeting scientific-sig
21:00:36 <openstack> Meeting started Tue Feb 16 21:00:35 2021 UTC and is due to finish in 60 minutes.  The chair is oneswig. Information about MeetBot at http://wiki.debian.org/MeetBot.
21:00:37 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
21:00:39 <openstack> The meeting name has been set to 'scientific_sig'
21:00:54 <oneswig> greetings
21:01:01 <rbudden> hello
21:01:17 <martial_> hello :)
21:01:45 <oneswig> #chair martial_
21:01:46 <openstack> Current chairs: martial_ oneswig
21:01:55 <oneswig> how's it going?
21:02:28 <oneswig> #link agenda for today https://wiki.openstack.org/wiki/Scientific_SIG#IRC_Meeting_February_16th_2021
21:03:04 <oneswig> b1airo raised an excellent point on the Baremetal SIG's activities
21:03:29 <oneswig> #link Ironic deploy steps video from dtantsur https://youtu.be/uyN481mqdOs
21:03:33 <oneswig> brb
21:03:51 <trandles> hi all...splitting attention between two meetings :(
21:04:46 <oneswig> Hi trandles, glad you could make it
21:05:27 <rbudden> Time splicing here as well, but it’s been too long, so I wanted to at least lurk!
21:06:06 <martial_> not meetings but multiple terminals :)
21:06:20 <janders> g'day team!
21:06:22 <oneswig> The Baremetal SIG has taken the initiative with a nice video.  Back in January there was a thought we could do something similar.
21:06:29 <oneswig> Hey janders, nice to see you
21:06:44 <martial_> exiting topics for today's agenda
21:08:18 <oneswig> Well I wonder what SIG members might be able to present on in a similar manner to help raise the SIG's profile.
21:08:39 <oneswig> User stories perhaps, or specific pieces of tech and method
21:08:45 * janders is quickly grabbing a coffee, just woke up
21:12:13 <oneswig> I was showing a work colleague the SIG's book yesterday - reminded me how cool that was as a group effort
21:13:08 <trandles> I should have more real world use case stories later this year, now that we're _finally_ getting a production deployment.
21:13:32 <oneswig> trandles: that's great to hear.
21:13:42 <trandles> we have multiple projects in flight that will take advantage of this infrastructure
21:13:47 <oneswig> How have things progressed in the last couple of weeks?
21:13:55 <rbudden> We should have our new cloud Explore up and running this year, so use cases, and possibly some migration war stories of combining two clouds
21:14:28 <oneswig> rbudden: migrating users and projects from one cloud to another?
21:14:31 <trandles> we just blew up a fairly fundamental road block on our deployment last week, so progress had been slow but will accelerate now
21:14:46 <rbudden> oneswig: yes, migration two OpenStack clouds into a single consolidate cloud
21:14:52 <rbudden> *consolidated
21:15:05 <oneswig> #link priteau came across this a while back and it looks handy (but haven't tried it myself) https://github.com/os-migrate/os-migrate
21:15:45 <trandles> I'm looking forward to learning from Mike's upcoming deployment at Indiana. Also, Chris Layton should be doing at least one new deployment in the coming weeks.
21:15:49 <rbudden> Interesting, I’ll have to take a look at this
21:15:59 <rbudden> Thanks oneswig !
21:16:06 <oneswig> +1
21:16:07 <trandles> Chris is at Oak Ridge...forgot to mention that
21:16:42 <oneswig> trandles: is ORNL doing an Ironic deploy like yours?
21:17:10 <trandles> I think ORNL will be RHOSP 16.1 like mine
21:17:45 <oneswig> Nice work.
21:17:49 <trandles> Mike is planning to do everything IPv6, so there might be lots to come from that work. I wish he was here...
21:18:21 <oneswig> We have a project with a /28 public IPV4 allocation - so that'll be mostly IPV6 as well
21:19:00 <oneswig> Not quite getting started on that part yet.  But perhaps it won't be the last machine where I have about as many IPV4 addresses as I have fingers and toes
21:19:44 <rbudden> We’re currently doing some IPv6. It’s a bit gnarly, but functional. I’ll be curious to see Mike’s work as well.
21:20:56 <oneswig> I do wonder if IPV6 is hugely neglected in network security
21:21:29 <oneswig> (I'm sure not by you guys)
21:21:52 <trandles> oneswig: our network security plan actually says "disable IPv6"
21:22:14 <oneswig> that solves the issue
21:24:02 <oneswig> trandles: had some discussion recently on the ramdisk driver.  Do you know of other sites using it?
21:24:14 <trandles> oneswig: I do not
21:24:53 <oneswig> Ah, too bad.  Let's keep an eye out.
21:25:01 <trandles> There may be some news RE: kexec deploy in the next week or two.
21:27:12 <oneswig> That would be good.  One of our team was raising earlier the provisioning cycle times were ~35 minutes on a bare metal system.  Anything to help with that is good - although does it only work in flat networking?
21:27:55 <janders> that's a lot of awesome projects going on! :)
21:28:22 <trandles> oneswig: I can't answer that question at this point. That's a Julia question probably.
21:29:04 <trandles> all of my ironic stuff with clusters is no-op networking
21:29:19 <janders> trandles - kexec -  awesome!
21:29:45 <janders> oneswig with 35 minute deploys, how much of this is due to reboots between BIOS/RAID config etc?
21:30:48 <janders> fast track, kexec (and soon nvme cleaning) can help rapid node turn around, but hardware config might not be keeping up with that pace :/
21:30:49 <oneswig> I don't know at this stage, just heard it this afternoon.  It might possibly be a dependency in Terraform causing nodes to be deployed in phases.  I'm speculating
21:31:40 <janders> I'm just thinking aloud: for low-latency high-turnover provisioning, it would likely make sense to come up with uniform BIOS/RAID/... config and simplify deploy and cleaning steps as much as possible, to avoid reboots
21:31:46 <oneswig> This kit is old.  NVMEs are a mirage on the horizon
21:32:19 <oneswig> janders: now you're caffeinated, shall we move on?
21:32:35 <janders> oneswig yes! :)
21:32:50 <oneswig> #topic superclouds - virt on bare metal
21:33:44 <oneswig> janders made the comment 2 weeks ago we should dedicate time for this subject, I think it's well deserved.
21:34:11 <oneswig> #link For context, janders video on SuperCloud https://www.openstack.org/videos/summits/denver-2019/its-a-cloud-its-a-supercomputer-no-its-supercloud
21:35:12 <janders> oneswig this brings memories... hard to believe how much the World changed since (intl travel = foreign concept :/)
21:35:13 <oneswig> Over here we've been wrangling for ages with ways to improve the flexibility in systems mixing virt and bare metal
21:35:27 <oneswig> janders: sadly true
21:36:09 <oneswig> johnthetubaguy sends apologies, he was hoping to come along but was feeling rough this evening.
21:37:02 <oneswig> But I can tell you he's followed similar principles to SuperCloud in a couple of recent projects requiring mixed workloads of HPC and virt
21:37:39 <oneswig> #link CSD3 and UM6P systems https://www.openstack.org/videos/summits/virtual/Lessons-learnt-expanding-Cambridge-Universitys-CSD3-Supercomputer-with-OpenStack
21:38:04 <oneswig> Actually I don't think UM6P was mentioned in that video as it was still gunning for its TOP500 number
21:39:47 <oneswig> The major difference in approach I think relates to Ethernet vs IB networking.  Working with Ethernet requires more consideration for switchport configs.  IB IIRC had considerations of its own...
21:40:02 <janders> how is the current state of Nova handling compute nodes popping in and out?
21:40:26 <janders> at the time of the Denver demo I had some little hacky manual cleanup to handle that IIRC
21:41:14 <janders> I remember discussing this in the Nova PTG and there were some fixes on the way, but never got to deploying a solution fully based on Nova handling appearing/disappearing hypervisors without out-of-nova manual tweaks
21:41:34 <oneswig> janders: I'm not aware of issues there.  The major piece of work is in bringing nodes in and out of the Ansible inventory for OpenStack service deployment.
21:42:35 <janders> Right. Nice! I was always a fan of keeping a singificant proportion of complexity/logics in the deployment&management playbooks instead of tweaking OpenStack code too much, so all for that
21:42:56 <janders> oneswig what are the main challenges with ethernet multi-tenant networking?
21:43:50 <janders> I was hoping to have dual IB+eth SDN, but firmware bugs in CX6 prevented that and then I moved on to my current role, so I never got to work on it
21:45:38 <oneswig> There have been issues with changing port config efficiently for multi-node deployments.  For the networking-generic-switch driver this led to development of better batching of transactions so that a rack could be provisioned without 52 individual port operations
21:46:49 <janders> oneswig what switching did you use if I may ask?
21:46:54 <oneswig> For hypervisors, typically they are members of several VLANs.  That config is applied outside of Ironic.
21:47:19 <oneswig> Cumulus in a couple of cases and Arista in another
21:47:24 <janders> unfortunately I have to say these are all the same old problems :/
21:47:45 <janders> (and some of this seems inherent to BM SDN scalability in general - I've seen some of this on IB)
21:48:19 <janders> I ended up wiring up to four ports per node so that I can get a node into up to four different networks even though trunking wasnt working
21:48:32 <janders> I was lucky I had all that kit sitting around from past projects, so cost wasnt an issue
21:49:16 <oneswig> To be fair I think Ironic is developing significantly.  A lot of progress has been made in the last few years.  The first time we attempted this in production at scale, it was a lot harder!
21:49:18 <janders> and BM SDN getting out of sync has been a concern across many solutions
21:49:54 <oneswig> janders: That's certainly true.
21:50:07 <janders> are those improvements happening in Ironic itself, or in Neutron & drivers?
21:51:20 <oneswig> Progress on all fronts, but especially with Ironic's capabilities
21:52:00 <janders> would you be able to point me to specific changes? I'd be interested to read up (I might have missed some of these)
21:52:41 <janders> this is great news
21:53:18 <janders> regarding port config batching, are the switches able to do this on a port range basis?
21:53:40 <janders> I've seen cases where that was a bottleneck also
21:53:52 <oneswig> Here's a couple - https://review.opendev.org/c/openstack/networking-generic-switch/+/743269 - cumulus support for NGS
21:53:56 <janders> I wonder if the driver is inefficient because of catering for the lowest common denominator
21:54:20 <oneswig> And https://review.opendev.org/c/openstack/networking-generic-switch/+/743283 - batching transactions (think this one is yet to merge)
21:54:35 <janders> ah, different repo - this is why I missed it
21:54:54 <janders> is there an IRC channel for this project? I'd assume so?
21:55:05 <oneswig> janders: the networking drivers are dumb.  For example, they save the config after every operation.
21:55:14 <janders> ouch!
21:55:31 <janders> s/supercomputing/supercongestion....
21:56:06 <janders> being mindful of time, I wanted to throw one more topic into the mix
21:56:16 <oneswig> Right... so without some awareness the Neutron port wiring suddenly becomes a lengthy serialised process
21:56:30 <oneswig> janders: go ahead, hope it's a small one :-)
21:56:31 <janders> how would container-based ephemeral hypervisors fit into your use cases?
21:56:43 <janders> its big but we can just touch on it and continue another time :)
21:57:03 <janders> as you may know RHAT is pouring a lot of effort into KNI (Kubernetes Native Infrastructure)
21:57:06 <oneswig> like kube-virt?
21:57:12 <janders> Kubevirt is also related to that
21:57:17 <janders> spot on oneswig
21:57:22 <janders> similar end game, different implementation
21:58:04 <janders> if majority of workloads on the system are containerised I think it is a strong option
21:58:12 <janders> (if not, SuperCloud may be more flexible still)
21:58:18 <janders> but I wonder what your thoughts are?
21:59:26 <janders> we can add this topic to agenda for the meeting in two weeks time if you like :)
21:59:43 <janders> my bad for not raising this earlier, though I find SDN discussions fascinating!
22:00:01 <oneswig> It hasn't been a consideration so far.  We certainly have users with a high-performance kubernetes focus, but I wouldn't plumb our control plane networks into that world without thinking about it...
22:00:24 <janders> this is a great answer, thank you
22:00:31 <oneswig> janders: same time 2 weeks hence?
22:00:34 <janders> yes!
22:00:44 <janders> thank you all
22:00:46 <oneswig> Sounds good to me
22:00:52 <oneswig> #endmeeting