20:59:19 <oneswig> #startmeeting scientific-sig
20:59:20 <openstack> Meeting started Tue Jul 10 20:59:19 2018 UTC and is due to finish in 60 minutes.  The chair is oneswig. Information about MeetBot at http://wiki.debian.org/MeetBot.
20:59:21 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
20:59:23 <openstack> The meeting name has been set to 'scientific_sig'
20:59:36 <oneswig> #link agenda for today https://wiki.openstack.org/wiki/Scientific_SIG#IRC_Meeting_July_10th_2018
20:59:46 <oneswig> greetings
21:00:14 <trandles> o/
21:00:23 <oneswig> Afternoon trandles
21:00:49 <oneswig> I was hoping to catch up with you on the latest status for Charliecloud
21:01:00 <trandles> bonsoir oneswig  ;)
21:01:14 <trandles> ah yeah, I can fill you in on Charliecloud developments
21:01:26 <janders> Hi Stig! :)
21:01:33 <oneswig> In fact I am in German-speaking territories right now, so guten abent, I guess...
21:01:44 <oneswig> Hey janders, how's things?
21:01:51 <martial_> Hi Stig
21:02:02 <oneswig> martial_: bravo, you made it, good effort
21:02:05 <oneswig> #chair martial_
21:02:06 <openstack> Current chairs: martial_ oneswig
21:02:17 <janders> oneswig: good, thank you :)
21:02:26 <janders> how are you?
21:02:33 <oneswig> OK, I think the agenda might be a quick one today...
21:02:40 <janders> Berlin :)
21:02:51 <oneswig> janders: all good here, wanted to ask you about your news later
21:02:59 <oneswig> janders: oh go on then
21:03:08 <oneswig> #topic Berlin CFP
21:03:14 <oneswig> It's a week off!
21:04:07 <oneswig> I am hoping to put a talk (or two) in on some recent work - hope you'll all do the same
21:04:16 <janders> Which of the three topics would you be interested to hear about: 1) CSIRO baremetal cloud project report 2) Cloudifying NVIDIA DGX1 or 3) dual-fabric (Mellanox Ethernet+IB) OpenStack ?
21:04:21 <oneswig> trandles: I assume you're going to Denver?
21:04:37 <oneswig> er... all of the above?
21:04:38 <trandles> Dallas, but I know what you meant ;)
21:04:41 <janders> I can do one, perhaps two out of the three
21:05:05 <oneswig> trandles: understandable, but dang all the same...
21:05:38 <oneswig> I'm not sure who to drink a G&T with now!
21:05:51 <trandles> oneswig: yeah I'd prefer a trip to Berlin but I have HQ duties at SC for at least the next two years :(  Have to get the Summit organizing committee to pick a week that doesn't conflict.
21:06:22 <oneswig> Is there a Summit system going to your place?
21:06:36 <oneswig> Ah.  Different summit.
21:06:36 <oneswig> doh
21:06:43 <martial_> and I have to get our SC18 BoF situated. I have a K8s person, 2x from Docker and our OpenStack crowd as well
21:07:03 <oneswig> martial_: that's even bigger than your panel at Vancouver!
21:07:04 <janders> oneswig: that would be great - it's a real PITA that OSS & SC clash pretty much every time
21:07:33 <trandles> martial_: if you think of it, put me in touch with Hoge to discuss HPC containers + OpenStack
21:07:56 <oneswig> ok - I consider the notice served on Berlin CFP.
21:08:02 <martial_> Yes Tim, sorry I got your email, just in the queue and things get added
21:08:11 <trandles> martial_: no worries :)
21:08:18 <oneswig> Can we fit a containers-on-openstack update in here?
21:08:33 <oneswig> Charliecloud news?
21:09:42 <trandles> Charliecloud news-in-brief: 1) new version coming soonish but I don't know of anything hugely impactful in the changelog   2) several times now I've been pinged about being able to launch baremetal Charliecloud containers using nova but I haven't followed up on it
21:10:38 <trandles> if this crowd would like a brief talk on Charliecloud I'd be happy to do one at this meeting sometime
21:10:53 <janders> trandles: that would be great! :)
21:11:02 <oneswig> On 2, what's that all about?
21:11:04 <trandles> discuss our HPC container runtime philosophy and why Charliecloud is the way it is
21:11:13 <martial_> Tim: yes please
21:11:24 <oneswig> trandles: any news on OpenHPC packaging?
21:12:12 <oneswig> trandles: seconded on the SIG talk, yes please
21:12:23 <trandles> well, for 2), I'm not sure what folks are thinking and no one has brought it up enough for me to make it a priority, so maybe consider this a very generic request for more info
21:13:15 <martial_> I can tell you we have a lot of users that use TACC that are really unhappy about Singularity
21:13:22 <oneswig> Given Charliecloud takes docker containers as input, isn't this a bit odd?
21:13:23 <martial_> we are investigating gVisor
21:13:23 <trandles> oneswig: we (one of the exascale project focus areas) just got an OpenHPC presentation from the founders and their slides said Charliecloud is in the new release
21:13:35 <oneswig> martial_: care to elaborate?
21:13:53 <oneswig> I suspect gvisor isn't going far, seems to be a fix at the wrong level with the user space interposer
21:13:59 <trandles> gVisor IIRC is more of a container hypervisory-type thing
21:15:07 <trandles> oneswig: I agree about the oddness of wanting to launch Charliecloud using nova, hence my lack of interest in seriously running down the idea
21:15:16 <martial_> oneswig: we have this issue where they are forced to use Singularity but the analytics is constrained within a docker container and shifting it to singularity limiting
21:16:02 <martial_> plus we are hoping to remove root access. I have been pitching CharlieCloud as an alternative but TACC runs Singularity so hard to force the issue
21:16:07 <oneswig> martial_: perhaps this applies to a number of the non-docker runtimes?
21:16:53 <oneswig> martial_: They'll need to upgrade their kernel - but on the plus side they'll also get the mimic Ceph client!
21:17:09 <trandles> at a higher level, there is an ECP container working group discussing two general use cases 1) containerizing HPC microservices, 2) runtime requirements for containerized HPC jobs, and 3) containerizing HPC applications
21:17:19 <trandles> *three general use cases...
21:17:59 <martial_> Tim: if you have some time and are willing to discuss Charliecloud for root isolation (vs Singularity or gVisor) :)
21:18:12 <oneswig> "book 5 in the increasingly inaccurately named trilogy..."
21:18:28 <trandles> oneswig: +1
21:18:33 <trandles> martial_: let me know when
21:18:50 <janders> martial_: what's the main issue with users having root - is it a data access issue or risk for the container host itself? Or all of the above?
21:18:52 <trandles> FYI - I'm out on work travel July 14-25, so likely missing the next two of these meetings
21:18:59 <martial_> checking if our Chris is around
21:19:14 <oneswig> trandles: are we talking to the Charliecloud marketing dept?
21:19:48 <trandles> oneswig: no comment (might be the PR department)
21:20:03 <martial_> tim: data exfiltration protection
21:20:31 <trandles> that's a nice segue to the next topic on the agenda isn't it martial_ ?
21:20:46 <martial_> always here to help :)
21:20:48 <oneswig> Ah, indeed, have we come to a conclusion on hpc containers?
21:21:06 <oneswig> #topic survey on controlled-access data
21:21:15 <trandles> pencil me in for Charliecloud talk at the August 7 meeting
21:21:25 <oneswig> Just wanted to draw everyone's attention to new data on our etherpad
21:21:31 <martial_> cool, I will try to bring in my crowd too
21:21:51 <oneswig> trandles: someone else will need to pencil that one as I'll be on vacation, alas...
21:22:02 <oneswig> (not alas for the vacation however)
21:22:13 <martial_> vacation? Stig ... it is for science!
21:22:35 <oneswig> I'll pack my telescope...
21:22:46 <oneswig> #link Updates on the etherpad https://etherpad.openstack.org/p/Scientific-SIG-Controlled-Data-Research
21:22:53 <janders> oneswig: heading to Western Australia? South Africa? :)
21:23:17 <oneswig> janders: not this time...
21:23:33 <oneswig> Geneva to CERN next week however, really looking forward to that!
21:24:15 <oneswig> OK - on the etherpad, the search has turned up a few prominent projects
21:24:33 <janders> oneswig: excellent, enjoy.
21:24:41 <oneswig> I was hoping these might get people thinking of other items to add
21:25:11 <trandles> oneswig: pro-tip - check out Charly's in Saint Genis for a friendly CERN crowd
21:25:20 <martial_> I have a meeting 8/7 that ends at 5pm, so I might be penciled in on that one too
21:26:13 <martial_> I am sharing this with Khalil and Craig because we are in conversation with places for data access too (with ORCA), so they might be able to extend the list
21:26:25 <oneswig> Etherpad editing isn't usually a spectator sport so I wanted to leave that one with people
21:26:54 <oneswig> martial_: that sounds really good, I think we could do with more data from their domain
21:27:55 <oneswig> I would like to get some guest speakers in, if possible.  Martial - got any NIST connections for FedRAMP?
21:29:36 <janders> oneswig: Charly's is pretty cool indeed :)
21:29:59 <oneswig> trandles: janders: now I'll have to go!
21:31:09 <oneswig> OK, shall we move on?
21:31:15 <martial_> stig: Bob Bohn
21:31:34 <martial_> Stig: I was thinking on it, he can help you with that, you know Bob, right?
21:31:51 <oneswig> martial_: I do indeed.  I'll drop him a mail, thanks
21:32:12 <martial_> :)
21:32:48 <oneswig> #topic AOB
21:33:00 <oneswig> janders: how's the IB deploy going?
21:33:50 <janders> oneswig: making good progress. This week I'm training colleagues who are new on the project, so less R&D happening.
21:34:03 <janders> An interesting observation / hint:
21:34:34 <janders> Make sure that the GUIDs are always specified in lowercase
21:34:41 <janders> Otherwise Bad Things (TM) will happen
21:34:53 <martial_> sound ominous
21:35:10 <janders> Like - the whole SDN workflow accepting the uppercase GUIDs end to end... till the Subnet Manager is restarted
21:35:28 <janders> then all the GUIDs disappear and all the IB ports go down
21:35:32 <trandles> janders: I'm just now thinking how I want to incorporate IB into my OpenStack testbed.  OK if I bug you with thoughts/questions?
21:35:45 <janders> sure! :)
21:35:58 <janders> jacob.anders@csiro.au
21:36:04 <trandles> trandles@lanl.gov
21:36:10 <trandles> I'll fire something off tomorrow, thx
21:36:29 <janders> trandles: are you after virtualised/SRIOV implementation, baremetal or both?
21:37:05 <trandles> SRIOV primarily, but I'd like to hear about the SDN stuff especially
21:37:25 <trandles> plus I might be able to extend the use case to OPA with another testbed
21:37:47 <janders> just to wrap up the uppercase GUID challange: it seems that input validation for API calls is different than for reading config files on startup. This will be addressed in a future (hopefully - next OpenSM release:)
21:38:06 <trandles> rumor is the fabric manager is much happier with the dynamic stuff than an IB subnet manager
21:38:16 <janders> It was interesting to debug in R&D - not so much if it somehow happened in prod - that would have been catastrophic
21:38:48 <trandles> I have to run folks...until next time...
21:38:55 <oneswig> thanks trandles
21:39:06 <janders> see you trandles
21:39:17 <oneswig> janders: is it in production now?  At what scale?
21:39:55 <janders> oneswig: not yet. The requirements changed a fair bit
21:39:59 <martial_> (bye Tim)
21:40:08 <martial_> seems pretty cool janders
21:40:27 <janders> It turns out that the cybersecurity research system will be completely separate - so we'll build this first
21:40:31 <janders> it will be 32 nodes
21:40:37 <janders> (most likely)
21:40:51 <oneswig> janders: did you end up using Secure Host?
21:41:08 <janders> and then we'll be looking at a larger system, possibly an order of magnitude larger than that
21:41:27 <janders> oneswig: not yet, but it's on the roadmap - thanks for reminding me to remind Mellanox to send me the firmware
21:41:47 <janders> we had some concerns about SecureHost on OEM branded HCAs but I'm happy to try it, got a heap of spares
21:42:00 <janders> if a couple get bricked while testing not a big deal
21:42:12 <oneswig> janders: indeed - I assume it breaks warranty
21:42:29 <oneswig> Do you have firmware for CX5?
21:42:47 <janders> no, SH FW is CX3 only at this stage AFAIK
21:42:54 <janders> ancient, I know
21:43:14 <janders> this doesn't worry me as most of my current kit is CX3, however this will change soon so I hope to work this out with mlnx
21:43:34 <oneswig> keep up the good fight :-)
21:43:43 <janders> I will :)
21:44:09 <janders> another update that you might be interested in is on running nova-compute in ironic instances
21:44:14 <janders> it works pretty well
21:44:31 <oneswig> janders: what's that?
21:44:49 <janders> my current deployment got pretty badly affected by the GUID case mismatch issue, but other than that it *just works*
21:45:32 <janders> I figured that if I have ironic/SDN capability, there is no point in deploying "static", nominated nova-compute nodes
21:45:46 <janders> better make everything ironic controlled - and deploy as many as needed at a given point in time
21:46:33 <janders> if no longer needed - just delete and reuse the hardware, possibly for a "tenant" baremetal instance
21:46:39 <oneswig> So you're running nova hypervisors within your Ironic compute instances?
21:46:46 <janders> correct
21:46:50 <oneswig> That's awesome!
21:47:03 <janders> networking needs some careful thought, but when done well it's pretty cool
21:47:24 <janders> there will be a talk proposal submitted on this with a vendor, should be coming in in very soon
21:47:25 <oneswig> How does that even work - you won't get multi-tenant network access?
21:48:01 <janders> with or without SRIOV?
21:48:09 <janders> both work, but the implementation is very different
21:48:43 <oneswig> If your Ironic node is attached to a tenant network, how does it become a generalised hypervisor?
21:49:19 <janders> are you familiar with Red Hat / tripleo "standard" OpenStack networking (provisioning, internalapi, ...)?
21:49:58 <oneswig> sure
21:50:13 <janders> I define internalapi network as a tenant network
21:50:26 <janders> same with externalapi (which is router-external)
21:50:41 <janders> then, I boot the ironic nodes meant to be nova-compute on internalapi network
21:50:48 <janders> this way they talk to controllers
21:50:57 <janders> the router allows them to hit external API endpoints, too
21:51:19 <janders> and - in case of non-SRIOV we can run vxlan over internalapi too (though this is work in progress, I've got SRIOV implementation going for now)
21:51:36 <oneswig> I guess you're not using mutli-tenant Ironic networking, otherwise you wouldn't be seeing other networks (genuine tenant networks) to your new hypervisor, right?
21:52:02 <oneswig> Or are all tenant networks VXLAN?
21:52:13 <janders> CX3 SRIOV implementation uses pkey mapping
21:52:29 <janders> there's the "real" pkey table and then each vf has it's own virtual one
21:53:08 <janders> this seems to work well without need for pkey trunking
21:53:34 <janders> so - now that you asked me I will test this more thoroughly but I did have it working with VMs in multiple tenant networks
21:53:51 <oneswig> sounds promising, keep us updated!
21:54:07 <janders> there might be some extra work required for CX4 and above, but it is doable
21:54:52 <janders> I will :) I will test this again more thoroughly and happy to drop you an email with results if you like?
21:56:09 <oneswig> yes please janders, this is very interesting
21:56:16 <janders> will do!
21:56:24 <oneswig> OK we are nearly at the hour, anything else to add?
21:56:28 <janders> just to finish off I will quickly jump to my opening question
21:56:31 <janders> 1) CSIRO baremetal cloud project report 2) Cloudifying NVIDIA DGX1 or 3) dual-fabric (Mellanox Ethernet+IB) OpenStack ?
21:56:44 <janders> which one do you think will get the most interest (and hence best chances of getting into the Summit)?
21:57:01 <janders> with 1) I might have even more interesting content for Denver timeframe
21:57:17 <janders> but it's doable for Berlin, too
21:58:37 <oneswig> Number 2 - I think there is strong competition for this.  It might be interesting to do the side-by-side bake-off hinted at by 3
21:58:49 <oneswig> (just my personal view, mind)
21:59:07 <oneswig> No harm in submitting all options!
21:59:14 <janders> thanks Stig! :)
21:59:14 <oneswig> OK, we are out of time
21:59:21 <janders> I already have one in, so limited to 2
21:59:30 <oneswig> thanks all
21:59:35 <oneswig> #endmeeting