21:00:04 #startmeeting scientific-sig 21:00:05 Meeting started Tue Jul 21 21:00:04 2020 UTC and is due to finish in 60 minutes. The chair is oneswig. Information about MeetBot at http://wiki.debian.org/MeetBot. 21:00:06 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 21:00:08 The meeting name has been set to 'scientific_sig' 21:00:28 Greetings 21:01:02 Hello 21:01:04 I don't think there is a formal agenda set for today, any items to suggest? 21:01:11 Hi jmlowe, good afternoon. 21:01:16 hello 21:01:30 Bob! 21:01:31 Over here we've been watching 'stranger things' - lots of Indiana-themed action. 21:01:36 Hi rbudden 21:01:49 Due to tax credits, filmed in Georga 21:02:05 ? :) 21:02:06 Doesn't have quite the same look 21:02:10 Hi all 21:02:17 hi trandles martial 21:02:20 I have an agenda item oneswig 21:02:22 #chair martial 21:02:23 Current chairs: martial oneswig 21:02:27 They did get lots of other things just like I remember them 21:02:28 trandles: go for it 21:02:32 It's a quick one, and amusing 21:03:07 If anyone remembers the Slurm plugins I wrote and spoke about in Austin and Barcelona, you'll know that I submitted paperwork to the lab to get the open sourced. 21:03:14 I haven't heard anything since October 2017 21:03:21 I remember! 21:03:23 Even when I asked for status updates, I got nothing back. 21:03:32 You accidentally ripped a hole with some classified high energy experiments and allowed the Gorgon in? 21:03:33 Well, today someone contacted me. 21:03:46 It looks like I'll be able to release that code after all 21:03:58 That's less likely than releasing the Gorgon 21:04:08 Man, that's awesome. I'm going to downgrade to 2017 OpenStack specially 21:04:13 I haven't looked at it since it seemed that it would never be approved to see the light of day. 21:04:36 So I'll get it working with Slurm 20 and the current OpenStack API and get it released ASAP 21:04:54 ... won't that require another approval? :-) 21:05:09 nope 21:05:17 I would imagine it's covered by the open source license 21:05:33 trandles: can you remind us what the plugins did, and how? 21:05:37 I wrote the request in a very open ended fashion as "Open source contributions to Slurm" 21:05:47 must now release every change 21:06:26 back in a sec, kids are up 21:07:08 Two sets of functionality: 1) slurm grabs images from glance and deposits the on the nodes in a job allocation and 2) dynamic VLAN manipulation via Neutron controlled by slurm 21:07:53 I'm not sure how interesting #1 is any more. I'm willing to get it working if anyone cares. #2 is definitely still interesting at least to us at LANL. 21:08:50 With #2 was that for bare metal networking? 21:09:08 Yeah. Nothing installed on the compute hosts. 21:09:16 Done how? 21:09:39 The prototype I wrote made calls to the neutron API, neutron was using the Arista ML2 plugin 21:10:07 Very interesting. 21:10:28 Did you have issues with scaling the number of ports? And reliability? 21:10:30 So it was configuring VLANs on the switch, adding/removing/creating/deleting depending on the job's request. 21:10:55 I never did scale past my testbed size of 48 nodes. 21:11:28 Reliability using Arista's CloudVision was fine 21:11:57 Was it doing the port updates in a single batched operation? 21:12:29 I've heard of similar activities using networking-generic-switch and ansible-networking 21:13:04 I'll have to look at the code again, but as I recall it was one update to the switches per VLAN operation. 21:13:07 Isn't there a neutron vxlan vtep thingy 21:13:25 One persistent problem appears to be making changes to large numbers of ports, if they can't be bulk-processed then it can lead to timeouts getting exceeded. 21:14:04 jmlowe: if the hardware supports it, yes but I am not sure which mechanism drivers support that for bare metal. 21:14:19 The two use cases were isolating the nodes in a job, for instance if you were going to start up an insecure listening service (like DASK at the time), and making network-attached resources available ONLY to specific nodes in an allocation. 21:15:47 So for the later, imagine you have an RDBMS like Oracle that you don't want an entire cluster to be able to see or a filesystem that should only be mounted by specific nodes in an allocation. 21:16:20 question: some people here are having a lot of issues with Octavia, is there some advice that 21:16:26 I can pass along? 21:16:31 I had thought about going further and doing things like manipulating flows at the switch hardware level 21:16:43 But never got that far with it 21:17:23 trandles: so how long would it take to program the switch with new vlan config? 21:17:43 On a 2-switch testbed, it was fast 21:18:04 And how did you interface between slurmctld and Neutron? 21:18:05 I want to say it didn't delay job start by more than a few seconds 21:18:22 martial What issues are you having with Octavia? 21:18:31 The openstack API calls were all made by a Slurm job start plugin 21:18:46 Slurmctld that is, not slurmd 21:19:15 It was a lot like some fabric management stuff that slurm had long ago for the bluegene systems from IBM IIRC 21:19:21 Although that's been removed from Slurm I think 21:20:05 I've since wondered if using PMIx's fabric setup hooks would be a better generic solution for job-specific SDN 21:20:47 https://pmix.github.io/standard/fabric-manager-roles-and-expectations 21:22:04 For Ethernet and bare metal Slurm, would be a nice additional method of job isolation. 21:23:31 I agree. I'm interested to know if slingshot, since it's based on ethernet, will have similar isolation mechanisms. 21:23:49 I just haven't had time to get near slingshot yet. Perhaps some of you have. 21:24:19 I wouldn't know. They shipped 4 Shasta cabinets to Edinburgh Parallel Computing Centre a few weeks ago. 21:24:43 Kill them with fire and throw them in a loch :P 21:25:05 Deployment running a bit behind schedule it seems. 21:25:10 (My opinion, I speak not for LANL/NNSA/DOE) 21:25:43 I'm sure you're just jealous of such a large body of fresh water :-) 21:25:54 trandles: Are you aware of the OpenDev session on bare metal networking tomorrow? My colleague Mark is chairing. 21:26:06 You should join in! 21:26:23 Unfortunately I've been too swamped to attend any of this week's sessions. :( 21:26:28 What time tomorrow? 21:27:00 13:15 UTC 21:27:10 #link baremetal networking etherpad https://etherpad.opendev.org/p/OpenDev_HardwareAutomation_Day3 21:27:20 johnsom: truthfully not 100% sure, I heard issue about port collisions 21:27:31 7:15 here, I'll be available and will attend 21:28:10 Excellent. It's always useful to get many points of view. 21:28:11 martial Hm, ok, well let them know we are in the #openstack-lbaas channel and happy to help. 21:28:55 johnsom: will do 21:28:58 In other relevant news, I've been working quite a lot with Ironic and will be doing an internal demo for some systems staff. It's a candidate for our next bare metal cluster provisioning system. 21:29:23 cool, re tomorrow, will see if my meeting is done and will try to attend :) 21:30:18 trandles: that would be very cool. Is that with the deploy-to-ramdisk stuff? 21:31:00 trandles: did you sign up for opendev this week? https://www.eventbrite.com/e/opendev-hardware-automation-registration-104569991660?aff=ebemnsuserinsight&afu=226342862332&utm_term=eattnewsrecs&recommended_events_quantity=6&utm_campaign=newsletter_editorial&utm_content=new_york-ny.r2020_28&utm_source=eventbrite&utm_medium=email&rank=0&ref=ebemnsuserinsight 21:31:09 Some of it is. I'm pitching it to replace all provisioning, not just cluster but also lots of diskful infrastructure services too 21:32:39 We've been having fun recently with Ironic's new deploy to software RAID capability 21:33:17 Registered, I thought I had but that's for the containers in production event 21:33:18 #link Software raid in Ironic blog https://www.stackhpc.com/software-raid-in-ironic.html 21:34:08 Did you also see the baremetal whitepaper came out this week? 21:34:11 I saw your section of the new ironic whitepaper 21:34:22 Ah, great (it's the same stuff) 21:36:11 #link Baremetal Ironic whitepaper https://www.openstack.org/bare-metal/how-ironic-delivers-abstraction-and-automation-using-open-source-infrastructure 21:37:24 I have to drop off, alas - all this typing is waking my kids up (either that or the after effects of Stranger Things) 21:37:54 martial: can you take over as chair? 21:38:22 Interesting discussion, hopefully catch you at OpenDev tomorrow 21:39:10 sure thing 21:39:35 I would bet Stranger Things has something to do with it :) 21:41:03 checking the agenda to see if anything else is to be covered today 21:42:50 okay not much there :) 21:43:09 #topic AOB 21:43:26 opening the floor for additional conversations as needed 21:45:55 well offering to save people 15 minutes then :) 21:47:46 Sorry, had a phone call 21:47:59 I'll be at opened tomorrow, later everyone 21:48:36 bye then, bye all :) 21:48:40 #end-meeting 21:48:46 #endmeeting