11:00:16 #startmeeting scientific-sig 11:00:17 Meeting started Wed Mar 25 11:00:16 2020 UTC and is due to finish in 60 minutes. The chair is oneswig. Information about MeetBot at http://wiki.debian.org/MeetBot. 11:00:18 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 11:00:21 The meeting name has been set to 'scientific_sig' 11:01:08 g'day! :) 11:01:13 hope everyone is keeping safe 11:01:16 Morning. 11:01:26 hi all 11:01:43 here packets are still allowed to leave NIC ports, we'll see for how long 11:02:17 hello 11:02:28 all healthy here 11:03:02 that's great to hear :) 11:03:30 There isn't an agenda today - apologies - I've been helping with a training workshop this week 11:03:56 no worries 11:06:18 I've heard a few things related to coronavirus and providing compute resources to help research 11:06:30 Anyone else active in this? 11:07:01 Yes, we're active there - lots of frantic requests and cluster re-arrangements. 11:07:13 we're under change moratorium to make sure systems are up and running for COVID research 11:07:40 personally I'm not involved in anything beyond this at this point in time :( 11:08:03 I registered for folding@home, but in 2 days (so far) I've had just 1 work unit... apparently the response has been so huge their queues have drained. 11:08:40 What form is the compute workload taking? 11:11:30 janders: I assume the fires are all done with, or is that still ongoing in the background? 11:11:51 luckily, yes 11:11:55 on to the next crisis 11:12:00 We have some people preparing to work on patient data, others running simulations. 11:12:22 Some working on drug targets/therapies. 11:12:45 verdurin: from the news I've seen Oxford University does appear to be very active. 11:13:12 Yes. I wasn't deemed glamorous enough to appear on the News, though. 11:13:53 Ah 11:14:30 Elsewhere I've not heard how it's affecting IT supply chains but inevitably there must be consequences. 11:15:17 It has. We had to pay over the odds for some parts to ensure timely delivery, by going with the only vendor who had stock already. 11:16:31 janders: is Canberra in lockdown? We started in earnest this week. 11:16:43 partially, yes 11:16:51 though comms aren't particularly clear 11:17:02 we've been working from home for weeks now so not much change 11:17:21 but the authorities are gradually cracking down on pretty much anything 11:20:52 Did you see that Ceph Octopus has been released? 11:21:37 not yet... are there any new features or improvements relevant to HPC? 11:21:45 RDMA support, improvements to CephFS? 11:21:56 From what I've heard the Ceph-Ansible support is doubtful. A pity considering it tends to work well nowadays 11:22:25 thats unfortunate 11:22:27 I think there may be less support for RDMA, but an opportunity to do it better second time round. 11:24:00 I'm not sure this represents the official plan of record but it's along the right lines: https://docs.ceph.com/docs/master/dev/seastore/ 11:25:45 The big thing I noticed is this new 'cephadm' tool. 11:27:02 verdurin: is that part of the orchestrator? 11:34:45 It's a new installation tool that is yes somehow integrated with the orchestrator. 11:39:01 I just scanned the release notes and flipped through the new docs, though. 11:39:18 on another note, I had a chance to finally get back to VPI work on CX6 and I did get it to work with MOFED5 and latest FW 11:39:34 will probably use it more over the coming weeks 11:40:02 one interesting challenge I ran into is - how to tackle dual-PCIe-slot setup for 100/200GE 11:40:09 janders: was this with newer firmware than before or did you need to do something different? 11:40:29 verdurin: newer firmware 11:40:46 I had issues with VPI in the past and ended up using onboard 10GE for ethernet comms 11:40:54 now I'm trying to move back to the Mellanox cards 11:41:13 where it gets interesting is - for the dual-slot, dual-port card, how to wire this up to ovs? 11:41:27 it comes out as two ports 11:41:36 even though physically it's one 11:41:49 with IB it's easy as multipathing is more natural 11:42:11 Ah, you have the dual-slot card. We haven't used any of those yet. 11:42:15 with ethernet I'm not sure - do you guys have any experience with that? 11:42:19 yes 11:42:34 they work beautifully for storage, even if they are a little convoluted 11:43:13 sorry, dropped off 11:45:05 I will likely chat to Mellanox about this over coming days, happy to relay back what I learned if you're interested 11:45:34 it seems that for non-performance-sensitive workloads, just using one of the two interfaces will suffice 11:46:07 what im wondering about though is if we're using leftover PCIe bandwidth for ethernet traffic, maybe it's better for IB if ethernet load balances across PCIe slots as well 11:46:18 IB likes symmetry 11:47:13 if each card gets say 12.5GB/s but we try to snap 5GB/s on one port for eth, I am not too sure if we still get 20GB/s on IB using what's left 11:47:29 so this may be motivation for figuring this out 11:47:41 to prevent IB slowdowns due to intermittent ethernet bandwidth peaks 11:48:06 Hmm. Sounds potentially troublesome, as you say. 11:48:12 yeah 11:48:16 Would be interested to hear. 11:48:23 dual-slot CX6s are fast but tricky to drive at times 11:48:32 luckily GPFS uses them very efficiently 11:48:32 Surprising that VPI has been an issue, you'd think a lot of people would want that. 11:49:09 I think it got fixed initially around October, but that had the limitation of having to use a specific port for IB, wouldn't work other way round 11:49:15 I think now it is fully functional 11:50:06 new MOFED packaging convention is interesting though 11:50:27 the archive has multiple repos and accidentally using more than one leads to dependency hell 11:50:42 I think it's been like that since around 4.7 but I only really looked into this today on 5.0 11:50:44 oh no... 11:50:57 makes sense when you really look through but takes a while to get used to 11:52:14 We've been using this - https://github.com/stackhpc/stackhpc-image-elements/tree/master/elements/mlnx-ofed - but sometimes the kernel weak-linkage doesn't work correctly, perhaps it's related to the synthetic DIB build environment. 11:52:54 yeah I had some issues with weak linkage in the past 11:53:01 best rebuild against a specific kernel revision 11:57:38 That can be tricky in diskimage-builder, but I expect there's a way through. 11:58:24 Nearly at time 11:58:39 stay safe everyone :) 11:58:46 same to you! 11:58:49 hopefully there won't be curfew on internet comms next week :) 11:58:56 overload, more like. 11:59:03 I don't think COVID transmits through RDMA though 11:59:19 so that may be the safest comms channel :) 12:02:07 cheers all 12:02:10 #endmeeting