21:00:18 <oneswig> #startmeeting scientific-sig
21:00:19 <openstack> Meeting started Tue Aug  6 21:00:18 2019 UTC and is due to finish in 60 minutes.  The chair is oneswig. Information about MeetBot at http://wiki.debian.org/MeetBot.
21:00:21 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
21:00:23 <openstack> The meeting name has been set to 'scientific_sig'
21:00:26 <oneswig> away we go
21:00:43 <b1airo> Morning
21:00:46 <janders> g'day guys
21:00:49 <oneswig> #link Agenda for today https://wiki.openstack.org/wiki/Scientific_SIG#IRC_Meeting_August_6th_2019
21:01:07 <oneswig> Morning b1airo janders how's Wednesday going?
21:01:25 <janders> oneswig starting slowly :)  how are you?
21:01:27 <b1airo> I'm just back from holiday and organising late start with the kids etc, so will be a bit fleeting
21:01:30 <priteau> Hello everyone
21:01:42 <oneswig> Hi priteau!
21:02:12 <oneswig> janders: I'm well, thanks.  Been a fairly intense week working on some customer projects, but a good one.
21:02:21 <janders> that's good to hear
21:03:20 <oneswig> I was anticipating jmlowe and trandles coming along today
21:03:44 <oneswig> janders: what's the latest on supercloud?
21:04:21 <janders> not much of an update - we're at this time in the year where all the projects are getting reshuffled - so more paperwork than anything else really
21:04:47 <janders> some interesting challenges with dual port ConnectX6es - happy to touch on this in AOB if there is time
21:05:12 <oneswig> was talking with somebody recently with an interest in a Manila driver for BeeGFS
21:05:49 <oneswig> janders: always happy to have some grunty networking issues to round the hour out.
21:06:07 <janders> :)  sounds good
21:06:12 <b1airo> Yeah Manila BeeGFS sounds useful
21:06:17 <janders> +1!!!
21:06:23 <oneswig> #topic OpenStack user survey
21:06:37 <oneswig> #link yer tiz https://www.openstack.org/user-survey/survey-2019/landing
21:07:01 <oneswig> Get it filled in and make sure your scientific clouds are stood up and counted.
21:07:16 <oneswig> That's all about that.
21:07:32 <oneswig> #topic Monitoring for Chargeback and Accounting
21:07:53 <tbarron> just need someone to write and maintain a manila driver for BeeGFS and we commit (manila maintainers) to help it integrate/merge/etc.
21:08:20 * tbarron apologizes for jumping before current topic, is done
21:08:21 <oneswig> Hi tbarron - ears burning :-)
21:08:36 <oneswig> Thanks for joining in.
21:08:58 <tbarron> oneswig: :)
21:09:29 <trandles> o/ I'm late but oh well
21:09:58 <oneswig> I fear the maintaining bit is often overlooked.  But I think there's a good opportunity here to do something good.
21:10:15 <oneswig> hi trandles, you made it!
21:11:35 <oneswig> tbarron: we'll see if we can get an interested group together
21:12:47 <oneswig> #action oneswig to canvas for interest on Manila driver for BeeGFS
21:13:11 <oneswig> OK, let's return to topic, ok?
21:13:57 <oneswig> We actually discussed this last week when priteau did some interesting work investigating CloudKitty drawing data from Monasca instead of the Ceilometer family.
21:14:11 <priteau> oneswig: Telemetry family :)
21:14:36 <oneswig> I stand corrected
21:14:50 <oneswig> #link CloudKitty and Monasca (episode 1) https://www.stackhpc.com/cloudkitty-and-monasca-1.html
21:15:27 <oneswig> This article sets the scene
21:15:45 <oneswig> priteau: I think you've been busy since and assume you've no further developments to report?
21:16:19 <priteau> I am afraid I've been otherwise engaged
21:16:50 <priteau> But hopefully the summer is not yet over for a sequel blog post
21:17:30 <oneswig> I hope not - you can't leave us hanging off this cliff-edge!
21:17:49 <oneswig> b1airo: were you using CloudKitty at Monash?
21:19:02 <b1airo> No, looked at it a few times, but never attempted putting it all together
21:20:05 <oneswig> Ah ok, I remember you mentioning it.
21:21:13 <oneswig> We have an interest in generating billing data but an aversion to pulling in additional telemetry to do it.
21:21:22 <priteau> CloudKitty itself isn't very complex to configure, it's more having the right data collected that can be tricky
21:24:04 <oneswig> I'll be interested to see how it works with data from the OpenStack exporter, or nova instance data
21:26:02 <priteau> Nova is actually the easiest service to charge because there are various ways to collect usage metrics. Other services, like charge for image or volume storage, will be more challenging.
21:28:28 <oneswig> A pity jmlowe's not around to tell us how they use xdmod (I assume) for this
21:31:30 <oneswig> OK, time for janders ConnectX6 issue?
21:31:38 <oneswig> #topic AOB
21:32:08 <janders> ok!
21:32:14 <oneswig> janders: what's been going on?
21:32:24 <janders> does any of you have any experience with dual-port CX6s?
21:32:54 <oneswig> only CX-5, alas
21:33:13 <clarkb> Thought I would point out http://lists.openstack.org/pipermail/openstack-discuss/2019-August/008304.html as some of you may be able to answer those questions
21:33:34 <janders> I think CX5s are a bit easier to work with :)
21:33:50 <janders> I wanted to use one CX6 port as 50GE and the other as HDR200
21:34:03 <oneswig> Hi clarkb - saw that earlier tonight, thanks for raising it.
21:34:14 <janders> it seems that with the current firmware it's hard to get eth/ib the ports to work concurrently
21:34:20 <janders> it's a bit of "one or the other"
21:34:32 <oneswig> What happened to VPI?
21:35:17 <janders> another angle: do you guys use splitter cables (eg 100GE> 2x50GE)?
21:35:32 <b1airo> Yes
21:35:41 <oneswig> janders: yes, on a CX-4 system, had a good deal of trouble initially
21:35:53 <janders> interesting.. hitting the same
21:36:01 <janders> what sorts of issues did you have?
21:36:47 <janders> so far I've seen that the support for splitters sometimes varies across firmware versions (say 1.1 supports it than you upgrade to 1.2 and stuff stops working and support say this version doesn't support splitters)
21:37:21 <janders> also support guys seem very confused about splitters in general (once I was told these are meant to connect switches not nodes, which seems insane)
21:38:16 <oneswig> We had 100G-4x25G splitter with 25G SFP+, but the NICs had QSFP sockets.  Mellanox do a passive slipper that would take the SFP cable in the QSFP socket - needed some firmware tweaks to get it going.
21:38:46 <janders> right!
21:39:07 <janders> did they make these tweaks mainstream in the end, or did these remain tweaks?
21:39:54 <oneswig> I think they were mainstream but I'll check the fw version now
21:40:03 <janders> IIRC our "splitters" are QSFP on both sides
21:41:07 <oneswig> The CX4 NICs are running 12.20.1010
21:41:21 <oneswig> Probably ~1.5 years old
21:41:32 <janders> ok!
21:41:56 <janders> so - I think between dual-port CX6es and the splitter cables we might have a bit more "fun" before everything works
21:42:18 <oneswig> What's preventing running the NIC dual-protocol?  Do you think it's the splitter?
21:42:44 <janders> I was told there are missing bits in the CX6 firmware but that came through a reseller not Mellanox directly
21:42:50 <janders> so not 100% sure if this is accurate
21:43:05 <oneswig> That would be a surprising piece to be missing.
21:43:13 <janders> also given we're using the dual-PCI-slot form factor I have a sneaking suspicion that VPI might be tricker than it used to
21:43:20 <oneswig> What do you get with CX-6, apart from 200G?
21:43:22 <janders> thanks, PCIe3.0
21:43:27 <janders> nothing
21:43:50 <janders> I'm actually considering asking to have some cards replaced with CX5s if this can't be resolved promptly
21:43:59 <janders> HDR200 would be handy for my GPFS cluster though
21:44:26 <janders> it's NVMe based (similar design to our BeeGFS) so could definitely use that bandwidth
21:44:45 <janders> but... 2x CX5 could do that, too :)
21:45:09 <oneswig> You'd get to put one on each socket as well, and perhaps exploit some numa locality
21:45:22 <janders> indeed
21:45:53 <janders> I'm anticipating an update from the reseller & Mellanox this week, I will report back if I learn anything intresting
21:46:02 <janders> it's very useful to know you guys had issues with splitters, too
21:46:24 <janders> I will be less keen to put them in any larger systems now :(
21:46:42 <oneswig> We did but I think the QSA28 slipper was part of our problem
21:46:59 <janders> it's a shame cause these could be used to build some really funky IO-centric topologies
21:47:26 <janders> I haven't given up on them entirely just yet but it's definitely an amber light in terms of resilient systems design
21:47:36 <oneswig> I wouldn't give up on them just yet, it would rule out many options
21:48:14 <janders> one thing seems certain: taking up a new HCA generation + splitters at the same time is painful
21:49:26 <janders> that's pretty much all I have on this topic for now. Thank you for sharing thoughts, I will keep you posted! :)
21:50:03 <oneswig> Thanks janders, feel your pain but slightly envy your kit at the same time :-)
21:50:18 <oneswig> OK, shall we wrap up?  Any more to add today?
21:50:39 <b1airo> I think being an early adopter of anything new from MLNX has a certain degree of pain and perplexedness involved
21:50:53 <janders> +1 :)
21:51:23 <oneswig> how about being an early adopter of OPA, in fairness?
21:51:46 <janders> LOL!
21:52:01 <janders> I suppose you can be among the first and the last ones, all at the same time
21:53:03 <b1airo> Hehe
21:53:48 <b1airo> You guys see ARDC has released a call for proposals for new Research Cloud infra
21:54:18 <oneswig> What's going on there b1airo?
21:54:22 <b1airo> Only ~$5.5m worth though it seems
21:55:36 <b1airo> Probably just aiming to replace existing capacity. Was $20m 6-7 years ago, so $5m probably buys at least the same number of vcores now
21:56:02 <oneswig> ARDC is New Nectar?
21:57:10 <b1airo> Yeah, Nectar-ANDS-RDS conglomerate
21:58:13 <b1airo> Also, quickly on topic of public vs private/hybrid
21:58:50 <b1airo> Was thinking of reaching out to some server vendors to see if they have some good base modelling on costs, anyone know of some?
21:59:49 <oneswig> I think Cambridge Uni had some models for cost per core hour that might be relevant but I'm not sure how general (or public) they are
21:59:53 <oneswig> I'll check
22:00:11 <oneswig> OK, time to close
22:00:13 <b1airo> Sounds like I'll need to do some direct legwork
22:00:26 <oneswig> Thanks all
22:00:29 <oneswig> #endmeeting