21:00:15 <oneswig> #startmeeting scientific-sig
21:00:16 <openstack> Meeting started Tue Oct 29 21:00:15 2019 UTC and is due to finish in 60 minutes.  The chair is oneswig. Information about MeetBot at http://wiki.debian.org/MeetBot.
21:00:17 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
21:00:20 <openstack> The meeting name has been set to 'scientific_sig'
21:00:23 <oneswig> ahoy!
21:00:29 <trandles> hello
21:00:33 <oneswig> #link agenda for today https://wiki.openstack.org/wiki/Scientific_SIG#IRC_Meeting_October_29th_2019
21:00:38 <oneswig> Hi trandles, afternoon
21:01:21 <oneswig> What's new?
21:01:41 <trandles> Nothing much, just super busy
21:01:52 <trandles> You preparing for Shanghai?
21:02:13 <oneswig> I have a visa in my passport, at least
21:02:43 <oneswig> Just looking through the schedule at the moment...
21:03:08 <trandles> I haven't had a chance to see the schedule. Mostly been focusing on SC
21:03:27 <oneswig> Some useful talks about operations at scale.  Definitely on my radar.
21:03:55 <oneswig> trandles: are you presenting material at SC?
21:04:23 <trandles> I'm a co-author on 3 posters (one for CANOPIE) but not presenting
21:04:39 <trandles> Well, the CANOPIE submission is a poster and a short paper
21:04:39 <janders> g'day all!
21:04:46 <trandles> Hi janders
21:04:48 <oneswig> hi janders
21:05:15 <b1airo> o/
21:05:22 <oneswig> Hi b1airo!
21:05:27 <oneswig> #chair b1airo
21:05:28 <openstack> Current chairs: b1airo oneswig
21:05:44 <oneswig> How's everyone enjoying the rugby? :-)
21:05:54 <b1airo> that's it, i'm out!
21:06:22 <b1airo> such a horrible game from ABs
21:06:34 <oneswig> got to make hay when the sun shines, particularly with the all-blacks...
21:06:41 <trandles> oneswig gets a +1 for htat
21:06:46 <trandles> *that
21:07:05 <b1airo> indeed, they couldn't get their act together after realising England were there to play and play hard
21:07:59 <oneswig> With a 9am kick-off back home, many people lost a weekend after that...
21:08:03 <b1airo> not sure who to barrack for in the final
21:08:27 <oneswig> can't remember - who's in it b1airo?
21:08:32 <oneswig> :-)
21:08:34 <trandles> lol
21:08:40 <trandles> relentless
21:08:51 <oneswig> I'm sure we had an agenda somewher
21:09:35 <b1airo> All Blacks are just helping prepare the nation for the cricket season...
21:10:01 <oneswig> ah, at least you have summer to look forwards to
21:10:22 <oneswig> Anyone going to Shanghai from NESI?
21:11:19 <b1airo> no, not this time round i'm afraid. there was some interest from NIWA but couldn't get the travel sorted/approved
21:11:53 <oneswig> too bad.
21:12:03 <oneswig> I'm interested in the talk from Pawsey
21:13:01 <oneswig> ... which doesn't appear to be in the schedule any more...
21:14:19 <b1airo> oh, what was it about? their not Nectar Nectar-cloud ... :-)
21:14:44 <oneswig> Using CephFS as a cluster filesystem for MPI workloads, IIRC.
21:15:07 <oneswig> My personal experience was mostly in the "not yet" camp
21:15:43 <oneswig> But we have some good operational experience of it serving home dirs for non-MPI workloads.
21:16:16 <janders> look like it's been removed indeed :(
21:16:20 <b1airo> oh ok, i didn't know they had been attempting anything like that. I know UniMelb have been, with fairly middling results. i would agree with the "not yet" assessment
21:16:20 <janders> I'd love to watch it
21:16:48 <b1airo> in fact, i'm kind of skeptical about it ever being a good option for that
21:17:04 <janders> yeah that's why we went GPFS instead (which was running fine till the IB switch died, it's having a break while RMA is being sorted)
21:17:46 <oneswig> Maybe they had second thoughts about their experiences?
21:17:47 <janders> it hasn't been a good month for storage for ue
21:17:50 <janders> *us
21:17:51 <b1airo> i can see it being useful for all persistent storage applications in HPC, all the way to "project/programme" space, but for true scratch it's just not a good design fit
21:18:19 <oneswig> It's not anything like as fast.  We've used it for hybrid cloud.
21:19:37 <janders> I was quite disappointed yesterday to hear 1) BeeGFS doesn't have any QoS and 2) it's not on the roadmap
21:19:57 <janders> I suppose BeeOND can cover that off in a way but still
21:20:21 <oneswig> janders: what's happened this month?
21:20:39 <janders> do you guys have any experience where BeeOND for job X is running on nodes where job Y is running?
21:20:46 <janders> oneswig: where do I start...
21:21:08 <janders> we had some metadata lockups on BeeGFS not handled by builtin HA which caused a lot of damage
21:21:08 <oneswig> No, sorry, sounds like a proper noisy neighbour issue
21:21:40 <janders> and last week HDR switch died in my OpenStack kit
21:21:42 <oneswig> janders: data loss?
21:22:00 <janders> not much of that but a lot of downtime at the least appropriate time
21:22:07 <oneswig> HDR switch, but that's practically brand new!
21:22:16 <janders> yeah hence RMAs come out of Israel
21:22:30 <b1airo> the joys of Mellanox support
21:22:35 <janders> no local stock
21:23:02 <oneswig> On a related note, these guys didn't get the news: https://www.openstack.org/summit/shanghai-2019/summit-schedule/events/24298/hpc-and-openstack-via-omni-path-architecture
21:23:03 <janders> I designed the system with one IB switch as it's just under 70 nodes (and I would struggle to fit in the budget otherwise)
21:23:22 <oneswig> "future work"...
21:23:28 <janders> haha
21:23:48 <oneswig> 70 x 100G or 70x 200G?
21:23:50 <janders> but I'll see if I can get more switching there - it would be nice to experiment with IB-multipath-GPFS
21:24:03 <janders> a combination
21:24:17 <janders> 80%-20% HDR100-HDR200
21:24:27 <janders> GPFS and some clients are 200
21:25:16 <b1airo> lol! seriously though, how did that talk get in..?
21:25:28 <janders> through side avenues? :)
21:25:40 <janders> or maybe keyword sponsored?
21:25:44 <b1airo> maybe it replaced the Pawsey one
21:26:26 <oneswig> I'm sure it has nothing to do with affiliation with a Summit Diamond Sponsor
21:27:07 <janders> as much as I dont think this is a relevant direction it would be interesting to have a look at the mechanics of the integration mechanism
21:27:17 <oneswig> Ironic without PXE seems interesting https://www.openstack.org/summit/shanghai-2019/summit-schedule/events/24274/ironically-reeling-in-large-scale-bare-metal-deployment-without-pxe
21:27:30 <janders> cause I asked Intel exactly that a year and a bit back and they were like "uh... oh... we don't think we can do that."
21:27:33 <b1airo> anyone here on the selection panel? my experience is the panel makes the final decision, have never seen it changed by foundation staff except for logistical reasons
21:27:40 <oneswig> janders: right, and there's a few OPA systems out there.
21:28:10 <janders> that PXEless talk is cool
21:28:27 <janders> Julia mentioned some of the underlying bits at the PTG in Denver
21:28:30 <oneswig> b1airo: think you're right.  I saw the panel members at one point.
21:28:32 <janders> looks like it's coming together
21:29:38 <b1airo> hasn't the Dell Ironic driver always worked without PXE...?
21:30:07 <oneswig> The iDrac driver and "always worked" in the same sentence, hmmm.
21:30:19 <janders> LOL and +1
21:30:41 <janders> I used Ironic with Dell kit a fair bit, but it was always generic driver
21:30:50 <oneswig> We've spent some time with it again recently, still hit issues with stuck jobs
21:30:54 <janders> mostly due to the fact no one I spoke to at Dell was able to explain how to use it
21:31:11 <janders> Arkady stepped in to help some time back but I moved to Intel kit since
21:31:35 <janders> I think at least the doco should be less bad now
21:32:42 <oneswig> I recall there were some useful-looking Ansible playbooks built upon it in the last couple of years
21:37:24 <janders> did it get a bit quiet, or did I drop out?
21:38:12 <b1airo> oneswig: pretty sure i saw those playbooks in a blog of yours...
21:40:41 <oneswig> Not sure about that exactly, there were others we never got into using
21:40:56 <oneswig> Don't recall the names, alas
21:41:18 <b1airo> for BM clouds it does kind of feel like we need a better way of managing server imaging... there must be a way to pre-deploy common OS images ready for boot selection. some sort of boot drive mechanism that let's you deploy images and select the desired image in management mode, then in runtime mode it COW overlays the selected OS image
21:43:57 <oneswig> It does seem quite clunky, agreed.
21:44:58 <oneswig> I recall there was a switch a few years back from a company called Pluribus Networks that was all about pushing compute and data services out to the ToR
21:45:59 <oneswig> fancy that ... looks like they are still at it.
21:47:14 <b1airo> yeah, i seem to recall trandles dreaming up a ToR/rack-oriented architecture like this too
21:47:32 <oneswig> far out
21:47:52 <trandles> ah yeah, scalability in provisioning
21:48:14 <trandles> Our work is still in progress
21:49:01 <oneswig> Has your new team member started trandles?
21:49:15 <trandles> b1airo, what you're describing in the "BM clouds" comment is kinda like what NERSC did with CHOS
21:49:27 <trandles> But CHOS was kinda clunky and used chroot
21:49:49 <trandles> oneswig, negative...but should be by the end of the year
21:51:10 <trandles> https://github.com/NERSC/chos
21:52:32 <oneswig> in a chroot, can you still mknod a device file for the raw root device and mount it?
21:53:42 <oneswig> asking for a friend :-)
21:54:01 <b1airo> ah yeah, i think i've seen / heard of that before. i guess to do this properly(tm) would require some sort of integration with UEFI boot etc to avoid a guest OS instance from fiddling with the pre-deployed images
21:54:40 <oneswig> ... or could the CH be CHarliecloud?
21:55:24 <trandles> I'm much more in favor of something like kexec + pivot_root for rapid re-provisioning
21:56:04 <trandles> kexec (if a new kernel is required) and then the pivot_root to the new image...Ironic has talked about supporting it
21:56:40 <trandles> I had a phone call with some people a couple months ago who are interested
21:56:52 <oneswig> In an Ironic context?
21:57:24 <trandles> yeah
21:57:51 <oneswig> It does violate that abstraction of the guest OS, but perhaps we just want faster boots, man!
21:58:24 <b1airo> yeah that sounds fine for a trusted environment / set of tenants, but in a true multi-tenant cloud env you'd need a method to protect (or at least verify) the OS image/install is good
21:58:26 <oneswig> trandles: I still haven't tried the ramdisk deploy driver.  Been meaning to for ages.
21:58:48 <trandles> Either let us replace UEFI with a kernel and ramdisk, or give us ways to "reboot" without going back through the firmware please
21:59:29 <trandles> oneswig: I had half a day a month ago to play with it some, but managed to mess something up along the way and haven't gone back to it yet :(
21:59:30 <oneswig> ... or give us a little game to play while the lifecycle controller does its stuff in the background?
21:59:48 <oneswig> ah, we are at time.
21:59:50 <oneswig> Any more?
22:00:17 <b1airo> oh, whoops - i meant to leave for a meeting 10 minutes ago!
22:00:27 <b1airo> cya!
22:00:35 <oneswig> OK, nice talking to you again guys.  I'll put some of these talking points on the flip chart for the SIG PTG.
22:00:36 <janders> "standby" mode for ironic was on the radar for the current Ironic cycle
22:00:39 <oneswig> See you in Denver!
22:00:45 <oneswig> #endmeeting