21:00:35 #startmeeting scientific-sig 21:00:36 Meeting started Tue Feb 16 21:00:35 2021 UTC and is due to finish in 60 minutes. The chair is oneswig. Information about MeetBot at http://wiki.debian.org/MeetBot. 21:00:37 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 21:00:39 The meeting name has been set to 'scientific_sig' 21:00:54 greetings 21:01:01 hello 21:01:17 hello :) 21:01:45 #chair martial_ 21:01:46 Current chairs: martial_ oneswig 21:01:55 how's it going? 21:02:28 #link agenda for today https://wiki.openstack.org/wiki/Scientific_SIG#IRC_Meeting_February_16th_2021 21:03:04 b1airo raised an excellent point on the Baremetal SIG's activities 21:03:29 #link Ironic deploy steps video from dtantsur https://youtu.be/uyN481mqdOs 21:03:33 brb 21:03:51 hi all...splitting attention between two meetings :( 21:04:46 Hi trandles, glad you could make it 21:05:27 Time splicing here as well, but it’s been too long, so I wanted to at least lurk! 21:06:06 not meetings but multiple terminals :) 21:06:20 g'day team! 21:06:22 The Baremetal SIG has taken the initiative with a nice video. Back in January there was a thought we could do something similar. 21:06:29 Hey janders, nice to see you 21:06:44 exiting topics for today's agenda 21:08:18 Well I wonder what SIG members might be able to present on in a similar manner to help raise the SIG's profile. 21:08:39 User stories perhaps, or specific pieces of tech and method 21:08:45 * janders is quickly grabbing a coffee, just woke up 21:12:13 I was showing a work colleague the SIG's book yesterday - reminded me how cool that was as a group effort 21:13:08 I should have more real world use case stories later this year, now that we're _finally_ getting a production deployment. 21:13:32 trandles: that's great to hear. 21:13:42 we have multiple projects in flight that will take advantage of this infrastructure 21:13:47 How have things progressed in the last couple of weeks? 21:13:55 We should have our new cloud Explore up and running this year, so use cases, and possibly some migration war stories of combining two clouds 21:14:28 rbudden: migrating users and projects from one cloud to another? 21:14:31 we just blew up a fairly fundamental road block on our deployment last week, so progress had been slow but will accelerate now 21:14:46 oneswig: yes, migration two OpenStack clouds into a single consolidate cloud 21:14:52 *consolidated 21:15:05 #link priteau came across this a while back and it looks handy (but haven't tried it myself) https://github.com/os-migrate/os-migrate 21:15:45 I'm looking forward to learning from Mike's upcoming deployment at Indiana. Also, Chris Layton should be doing at least one new deployment in the coming weeks. 21:15:49 Interesting, I’ll have to take a look at this 21:15:59 Thanks oneswig ! 21:16:06 +1 21:16:07 Chris is at Oak Ridge...forgot to mention that 21:16:42 trandles: is ORNL doing an Ironic deploy like yours? 21:17:10 I think ORNL will be RHOSP 16.1 like mine 21:17:45 Nice work. 21:17:49 Mike is planning to do everything IPv6, so there might be lots to come from that work. I wish he was here... 21:18:21 We have a project with a /28 public IPV4 allocation - so that'll be mostly IPV6 as well 21:19:00 Not quite getting started on that part yet. But perhaps it won't be the last machine where I have about as many IPV4 addresses as I have fingers and toes 21:19:44 We’re currently doing some IPv6. It’s a bit gnarly, but functional. I’ll be curious to see Mike’s work as well. 21:20:56 I do wonder if IPV6 is hugely neglected in network security 21:21:29 (I'm sure not by you guys) 21:21:52 oneswig: our network security plan actually says "disable IPv6" 21:22:14 that solves the issue 21:24:02 trandles: had some discussion recently on the ramdisk driver. Do you know of other sites using it? 21:24:14 oneswig: I do not 21:24:53 Ah, too bad. Let's keep an eye out. 21:25:01 There may be some news RE: kexec deploy in the next week or two. 21:27:12 That would be good. One of our team was raising earlier the provisioning cycle times were ~35 minutes on a bare metal system. Anything to help with that is good - although does it only work in flat networking? 21:27:55 that's a lot of awesome projects going on! :) 21:28:22 oneswig: I can't answer that question at this point. That's a Julia question probably. 21:29:04 all of my ironic stuff with clusters is no-op networking 21:29:19 trandles - kexec - awesome! 21:29:45 oneswig with 35 minute deploys, how much of this is due to reboots between BIOS/RAID config etc? 21:30:48 fast track, kexec (and soon nvme cleaning) can help rapid node turn around, but hardware config might not be keeping up with that pace :/ 21:30:49 I don't know at this stage, just heard it this afternoon. It might possibly be a dependency in Terraform causing nodes to be deployed in phases. I'm speculating 21:31:40 I'm just thinking aloud: for low-latency high-turnover provisioning, it would likely make sense to come up with uniform BIOS/RAID/... config and simplify deploy and cleaning steps as much as possible, to avoid reboots 21:31:46 This kit is old. NVMEs are a mirage on the horizon 21:32:19 janders: now you're caffeinated, shall we move on? 21:32:35 oneswig yes! :) 21:32:50 #topic superclouds - virt on bare metal 21:33:44 janders made the comment 2 weeks ago we should dedicate time for this subject, I think it's well deserved. 21:34:11 #link For context, janders video on SuperCloud https://www.openstack.org/videos/summits/denver-2019/its-a-cloud-its-a-supercomputer-no-its-supercloud 21:35:12 oneswig this brings memories... hard to believe how much the World changed since (intl travel = foreign concept :/) 21:35:13 Over here we've been wrangling for ages with ways to improve the flexibility in systems mixing virt and bare metal 21:35:27 janders: sadly true 21:36:09 johnthetubaguy sends apologies, he was hoping to come along but was feeling rough this evening. 21:37:02 But I can tell you he's followed similar principles to SuperCloud in a couple of recent projects requiring mixed workloads of HPC and virt 21:37:39 #link CSD3 and UM6P systems https://www.openstack.org/videos/summits/virtual/Lessons-learnt-expanding-Cambridge-Universitys-CSD3-Supercomputer-with-OpenStack 21:38:04 Actually I don't think UM6P was mentioned in that video as it was still gunning for its TOP500 number 21:39:47 The major difference in approach I think relates to Ethernet vs IB networking. Working with Ethernet requires more consideration for switchport configs. IB IIRC had considerations of its own... 21:40:02 how is the current state of Nova handling compute nodes popping in and out? 21:40:26 at the time of the Denver demo I had some little hacky manual cleanup to handle that IIRC 21:41:14 I remember discussing this in the Nova PTG and there were some fixes on the way, but never got to deploying a solution fully based on Nova handling appearing/disappearing hypervisors without out-of-nova manual tweaks 21:41:34 janders: I'm not aware of issues there. The major piece of work is in bringing nodes in and out of the Ansible inventory for OpenStack service deployment. 21:42:35 Right. Nice! I was always a fan of keeping a singificant proportion of complexity/logics in the deployment&management playbooks instead of tweaking OpenStack code too much, so all for that 21:42:56 oneswig what are the main challenges with ethernet multi-tenant networking? 21:43:50 I was hoping to have dual IB+eth SDN, but firmware bugs in CX6 prevented that and then I moved on to my current role, so I never got to work on it 21:45:38 There have been issues with changing port config efficiently for multi-node deployments. For the networking-generic-switch driver this led to development of better batching of transactions so that a rack could be provisioned without 52 individual port operations 21:46:49 oneswig what switching did you use if I may ask? 21:46:54 For hypervisors, typically they are members of several VLANs. That config is applied outside of Ironic. 21:47:19 Cumulus in a couple of cases and Arista in another 21:47:24 unfortunately I have to say these are all the same old problems :/ 21:47:45 (and some of this seems inherent to BM SDN scalability in general - I've seen some of this on IB) 21:48:19 I ended up wiring up to four ports per node so that I can get a node into up to four different networks even though trunking wasnt working 21:48:32 I was lucky I had all that kit sitting around from past projects, so cost wasnt an issue 21:49:16 To be fair I think Ironic is developing significantly. A lot of progress has been made in the last few years. The first time we attempted this in production at scale, it was a lot harder! 21:49:18 and BM SDN getting out of sync has been a concern across many solutions 21:49:54 janders: That's certainly true. 21:50:07 are those improvements happening in Ironic itself, or in Neutron & drivers? 21:51:20 Progress on all fronts, but especially with Ironic's capabilities 21:52:00 would you be able to point me to specific changes? I'd be interested to read up (I might have missed some of these) 21:52:41 this is great news 21:53:18 regarding port config batching, are the switches able to do this on a port range basis? 21:53:40 I've seen cases where that was a bottleneck also 21:53:52 Here's a couple - https://review.opendev.org/c/openstack/networking-generic-switch/+/743269 - cumulus support for NGS 21:53:56 I wonder if the driver is inefficient because of catering for the lowest common denominator 21:54:20 And https://review.opendev.org/c/openstack/networking-generic-switch/+/743283 - batching transactions (think this one is yet to merge) 21:54:35 ah, different repo - this is why I missed it 21:54:54 is there an IRC channel for this project? I'd assume so? 21:55:05 janders: the networking drivers are dumb. For example, they save the config after every operation. 21:55:14 ouch! 21:55:31 s/supercomputing/supercongestion.... 21:56:06 being mindful of time, I wanted to throw one more topic into the mix 21:56:16 Right... so without some awareness the Neutron port wiring suddenly becomes a lengthy serialised process 21:56:30 janders: go ahead, hope it's a small one :-) 21:56:31 how would container-based ephemeral hypervisors fit into your use cases? 21:56:43 its big but we can just touch on it and continue another time :) 21:57:03 as you may know RHAT is pouring a lot of effort into KNI (Kubernetes Native Infrastructure) 21:57:06 like kube-virt? 21:57:12 Kubevirt is also related to that 21:57:17 spot on oneswig 21:57:22 similar end game, different implementation 21:58:04 if majority of workloads on the system are containerised I think it is a strong option 21:58:12 (if not, SuperCloud may be more flexible still) 21:58:18 but I wonder what your thoughts are? 21:59:26 we can add this topic to agenda for the meeting in two weeks time if you like :) 21:59:43 my bad for not raising this earlier, though I find SDN discussions fascinating! 22:00:01 It hasn't been a consideration so far. We certainly have users with a high-performance kubernetes focus, but I wouldn't plumb our control plane networks into that world without thinking about it... 22:00:24 this is a great answer, thank you 22:00:31 janders: same time 2 weeks hence? 22:00:34 yes! 22:00:44 thank you all 22:00:46 Sounds good to me 22:00:52 #endmeeting