11:00:16 <oneswig> #startmeeting scientific-sig 11:00:17 <openstack> Meeting started Wed Mar 25 11:00:16 2020 UTC and is due to finish in 60 minutes. The chair is oneswig. Information about MeetBot at http://wiki.debian.org/MeetBot. 11:00:18 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 11:00:21 <openstack> The meeting name has been set to 'scientific_sig' 11:01:08 <janders> g'day! :) 11:01:13 <janders> hope everyone is keeping safe 11:01:16 <verdurin> Morning. 11:01:26 <oneswig> hi all 11:01:43 <janders> here packets are still allowed to leave NIC ports, we'll see for how long 11:02:17 <belmoreira> hello 11:02:28 <oneswig> all healthy here 11:03:02 <janders> that's great to hear :) 11:03:30 <oneswig> There isn't an agenda today - apologies - I've been helping with a training workshop this week 11:03:56 <janders> no worries 11:06:18 <oneswig> I've heard a few things related to coronavirus and providing compute resources to help research 11:06:30 <oneswig> Anyone else active in this? 11:07:01 <verdurin> Yes, we're active there - lots of frantic requests and cluster re-arrangements. 11:07:13 <janders> we're under change moratorium to make sure systems are up and running for COVID research 11:07:40 <janders> personally I'm not involved in anything beyond this at this point in time :( 11:08:03 <oneswig> I registered for folding@home, but in 2 days (so far) I've had just 1 work unit... apparently the response has been so huge their queues have drained. 11:08:40 <oneswig> What form is the compute workload taking? 11:11:30 <oneswig> janders: I assume the fires are all done with, or is that still ongoing in the background? 11:11:51 <janders> luckily, yes 11:11:55 <janders> on to the next crisis 11:12:00 <verdurin> We have some people preparing to work on patient data, others running simulations. 11:12:22 <verdurin> Some working on drug targets/therapies. 11:12:45 <oneswig> verdurin: from the news I've seen Oxford University does appear to be very active. 11:13:12 <verdurin> Yes. I wasn't deemed glamorous enough to appear on the News, though. 11:13:53 <oneswig> Ah 11:14:30 <oneswig> Elsewhere I've not heard how it's affecting IT supply chains but inevitably there must be consequences. 11:15:17 <verdurin> It has. We had to pay over the odds for some parts to ensure timely delivery, by going with the only vendor who had stock already. 11:16:31 <oneswig> janders: is Canberra in lockdown? We started in earnest this week. 11:16:43 <janders> partially, yes 11:16:51 <janders> though comms aren't particularly clear 11:17:02 <janders> we've been working from home for weeks now so not much change 11:17:21 <janders> but the authorities are gradually cracking down on pretty much anything 11:20:52 <oneswig> Did you see that Ceph Octopus has been released? 11:21:37 <janders> not yet... are there any new features or improvements relevant to HPC? 11:21:45 <janders> RDMA support, improvements to CephFS? 11:21:56 <oneswig> From what I've heard the Ceph-Ansible support is doubtful. A pity considering it tends to work well nowadays 11:22:25 <janders> thats unfortunate 11:22:27 <oneswig> I think there may be less support for RDMA, but an opportunity to do it better second time round. 11:24:00 <oneswig> I'm not sure this represents the official plan of record but it's along the right lines: https://docs.ceph.com/docs/master/dev/seastore/ 11:25:45 <verdurin> The big thing I noticed is this new 'cephadm' tool. 11:27:02 <oneswig> verdurin: is that part of the orchestrator? 11:34:45 <verdurin> It's a new installation tool that is yes somehow integrated with the orchestrator. 11:39:01 <verdurin> I just scanned the release notes and flipped through the new docs, though. 11:39:18 <janders> on another note, I had a chance to finally get back to VPI work on CX6 and I did get it to work with MOFED5 and latest FW 11:39:34 <janders> will probably use it more over the coming weeks 11:40:02 <janders> one interesting challenge I ran into is - how to tackle dual-PCIe-slot setup for 100/200GE 11:40:09 <verdurin> janders: was this with newer firmware than before or did you need to do something different? 11:40:29 <janders> verdurin: newer firmware 11:40:46 <janders> I had issues with VPI in the past and ended up using onboard 10GE for ethernet comms 11:40:54 <janders> now I'm trying to move back to the Mellanox cards 11:41:13 <janders> where it gets interesting is - for the dual-slot, dual-port card, how to wire this up to ovs? 11:41:27 <janders> it comes out as two ports 11:41:36 <janders> even though physically it's one 11:41:49 <janders> with IB it's easy as multipathing is more natural 11:42:11 <verdurin> Ah, you have the dual-slot card. We haven't used any of those yet. 11:42:15 <janders> with ethernet I'm not sure - do you guys have any experience with that? 11:42:19 <janders> yes 11:42:34 <janders> they work beautifully for storage, even if they are a little convoluted 11:43:13 <oneswig_> sorry, dropped off 11:45:05 <janders> I will likely chat to Mellanox about this over coming days, happy to relay back what I learned if you're interested 11:45:34 <janders> it seems that for non-performance-sensitive workloads, just using one of the two interfaces will suffice 11:46:07 <janders> what im wondering about though is if we're using leftover PCIe bandwidth for ethernet traffic, maybe it's better for IB if ethernet load balances across PCIe slots as well 11:46:18 <janders> IB likes symmetry 11:47:13 <janders> if each card gets say 12.5GB/s but we try to snap 5GB/s on one port for eth, I am not too sure if we still get 20GB/s on IB using what's left 11:47:29 <janders> so this may be motivation for figuring this out 11:47:41 <janders> to prevent IB slowdowns due to intermittent ethernet bandwidth peaks 11:48:06 <verdurin> Hmm. Sounds potentially troublesome, as you say. 11:48:12 <janders> yeah 11:48:16 <verdurin> Would be interested to hear. 11:48:23 <janders> dual-slot CX6s are fast but tricky to drive at times 11:48:32 <janders> luckily GPFS uses them very efficiently 11:48:32 <oneswig_> Surprising that VPI has been an issue, you'd think a lot of people would want that. 11:49:09 <janders> I think it got fixed initially around October, but that had the limitation of having to use a specific port for IB, wouldn't work other way round 11:49:15 <janders> I think now it is fully functional 11:50:06 <janders> new MOFED packaging convention is interesting though 11:50:27 <janders> the archive has multiple repos and accidentally using more than one leads to dependency hell 11:50:42 <janders> I think it's been like that since around 4.7 but I only really looked into this today on 5.0 11:50:44 <oneswig_> oh no... 11:50:57 <janders> makes sense when you really look through but takes a while to get used to 11:52:14 <oneswig_> We've been using this - https://github.com/stackhpc/stackhpc-image-elements/tree/master/elements/mlnx-ofed - but sometimes the kernel weak-linkage doesn't work correctly, perhaps it's related to the synthetic DIB build environment. 11:52:54 <janders> yeah I had some issues with weak linkage in the past 11:53:01 <janders> best rebuild against a specific kernel revision 11:57:38 <oneswig_> That can be tricky in diskimage-builder, but I expect there's a way through. 11:58:24 <oneswig_> Nearly at time 11:58:39 <janders> stay safe everyone :) 11:58:46 <oneswig_> same to you! 11:58:49 <janders> hopefully there won't be curfew on internet comms next week :) 11:58:56 <oneswig_> overload, more like. 11:59:03 <janders> I don't think COVID transmits through RDMA though 11:59:19 <janders> so that may be the safest comms channel :) 12:02:07 <oneswig_> cheers all 12:02:10 <oneswig_> #endmeeting