11:00:32 <oneswig> #startmeeting scientific-sig 11:00:33 <openstack> Meeting started Wed Oct 9 11:00:32 2019 UTC and is due to finish in 60 minutes. The chair is oneswig. Information about MeetBot at http://wiki.debian.org/MeetBot. 11:00:34 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 11:00:36 <openstack> The meeting name has been set to 'scientific_sig' 11:00:55 <oneswig> greetings 11:01:01 <oneswig> #link agenda for today https://wiki.openstack.org/wiki/Scientific_SIG#IRC_Meeting_October_9th_2019 11:01:05 <oneswig> (such as it is) 11:01:41 <dh3> hi 11:02:02 <oneswig> Hi dh3, how's things? 11:02:09 <dh3> busy :) no change there! 11:02:47 <oneswig> Saw an interesting development this morning 11:02:59 <oneswig> #link SUSE drops their OpenStack product https://www.suse.com/c/suse-doubles-down-on-application-delivery-to-meet-customer-needs/ 11:03:09 <oneswig> Crikey 11:03:20 <verdurin> Hello. 11:03:52 <dh3> Suse has always been a bit niche (IME) and OpenStack is niche so maybe they're only dropping a tiny bit of business 11:03:53 <oneswig> Hi verdurin, afternoon 11:04:00 <janders> g'day 11:04:02 <janders> sorry for being late 11:04:07 <janders> Suse dropping OpenStack? 11:04:25 <oneswig> dh3: does seem like it. I'm mostly worried about where it leaves their core contributors 11:04:47 <dh3> mmm that's a point 11:04:52 <oneswig> janders: yes it seems like it 11:05:34 <oneswig> I wonder what alternatives they'll be transitioning customers to, doesn't say in the post... 11:06:27 <janders> good question however I have to say I dont think I've ever met anyone running SuSE Cloud... 11:07:05 <oneswig> actually, not sure I have either. 11:07:58 <oneswig> They do appear to have taken a wrong turn in deployment tooling in recent years, not sure what they ended up with. 11:08:33 <janders> I did consider them for a couple projects to be fair, but the tooling was always like.... wait.... what?!? 11:08:47 <janders> active passive databases, chef config management etc 11:09:00 <oneswig> Well, interesting times. 11:09:19 <janders> it does look like OpenStack is past the top of the curve 11:09:34 <janders> those who are using it right are having a blast 11:09:39 <janders> the others realise maybe it's not the way 11:10:08 <oneswig> janders: there's certainly a lot less hype and more practical usage it seems. 11:10:36 <janders> shame likes of IBM seem to be slowing down development of things such as gpfs-openstack integration etc 11:10:51 <janders> good they wrote the container friendly version before that happened 11:10:56 <janders> its a shame cause it's a killer 11:11:28 <janders> with this I've got faster storage in VMs than quite a few supercomputers 11:12:01 <janders> and that's without baremetal/sriov 11:12:10 <oneswig> I wonder what will come of work like this for secure computing frameworks: https://blog.adamspiers.org/2019/09/13/improving-trust-in-the-cloud-with-openstack-and-amd-sev/ 11:12:30 <janders> yet people are still undecided whether to develop it further... :/ 11:12:38 <dh3> we've had users stand up gpfs on openstack instances (without needing sysadmin help!) as part of a k8s layer, that must say something 11:12:58 <oneswig> dh3: that's impressive and surprising 11:13:23 <oneswig> might also be a statement on your users 11:13:56 <oneswig> Is your k8s an OpenShift deployment or do you roll your own? 11:14:01 <dh3> some of them do jump in with all 3 feet :) 11:14:05 <janders> yeah that is the justification I hear from IBM when they say OpenStack integration will not be developed further. Diverting resources to k8s. 11:14:22 <dh3> our k8s is DIY at the moment but we are pushing towards Rancher to get the nice UI layer 11:14:57 <oneswig> dh3: kubespray or more fundamentally DIY? 11:15:21 <oneswig> hmm seems we've lost dh3 11:15:24 <janders> GPFS is quite flexible - can be relatively simple or quite complex 11:16:07 <janders> our first attempt of deploying GPFS-EC ended up destroying all the OSes it was supposed to run on 11:16:28 <janders> the magic script was filtering out the software raid but not the member drives - and formatted them :D 11:16:40 <dh3> (dunno what happened there) 11:16:49 <dh3> AFAIK people are building on kubespray 11:16:56 <janders> probably we were the first site to deploy with swraid thats why 11:17:19 <oneswig> janders: oops... 11:17:56 <janders> got the hotfix overnight and the second deploy was all good 11:18:52 <janders> now it's doing 120/50 GB/s read/write on six servers and six clients 11:19:10 <janders> gpfs backed cinder is getting close to 20GB/s in VMs 11:19:13 <oneswig> janders: are you using SR-IOV or ASAP2 for storage I/O to VMs? 11:19:24 <oneswig> or is it all block? 11:19:31 <janders> that's the best part 11:19:33 <janders> no 11:19:53 <janders> hypervisors connect to GPFS over HDR100/HDR200 (depending which ones) 11:20:00 <dh3> janders: do you have any write ups, blog posts, etc? 11:20:10 <janders> VM networking is stock standard 11:20:24 <janders> no - but happy to chat if you're interested 11:20:35 <janders> jacob.anders.au@gmail.com 11:21:06 <dh3> potentially yes. we haven't used gpfs (on the "systems supported" side) for years. but always happy to look around. I'll drop you an email, thanks 11:21:28 <janders> we could make it quite a bit faster especially on the write side but we traded that off for redundancy 11:22:28 <janders> losing a server doesn't hurt it, could probably run without two but that's not a requirement so haven't tested that scenario 11:23:01 <janders> essentially it's ceph architecture with HPC filesystem performance 11:23:19 <janders> and minimal changes to the OpenStack 11:23:41 <oneswig> janders: have you seen roadmaps for that? 11:23:56 <janders> for what exactly? 11:24:23 <oneswig> Ongoing development for GPFS+OpenStack 11:24:47 <janders> I'm being told they pulled resources from this and only maintain it but do not develop it 11:24:47 <oneswig> Are there constraints on the version you can run? 11:25:00 <janders> I've got it going with RH-OSP13 11:25:16 <janders> but I think it would run with latest, too 11:26:15 <janders> I currently have cinder and glance integrated, I tested nova, too and it worked fine 11:27:26 <dh3> do you mount gpfs as (say) /var/lib/nova/instances on the hypervisor then let everything run as normal? 11:27:39 <janders> for nova, yes 11:27:47 <janders> for cinder, no 11:29:35 <janders> though with nova I only turned it on for testing, I have 1.5TB of NVMe in each compute node 11:29:54 <janders> GPFS can do 70k IOPS per client, that NVMe can do 10x that 11:30:19 <janders> so right now it's cinder for capacity/throughput and ephemeral for IOPS 11:30:41 <oneswig> A good balance of options 11:30:47 <verdurin> janders: I'll contact you about this too, if I may 11:30:49 <oneswig> (for the user that knows how to exploit that) 11:31:38 <janders> sure - no worries, happy to chat more 11:31:53 <dh3> similar to us but the compute is SSD not NVMe 11:32:14 <janders> it is a very interesting direction cause there's a lot of push for performance hence walking away from VMs to containers and baremetal 11:32:24 <oneswig> janders: what kind of workloads are you supporting with this cloud? 11:32:40 <janders> and with something like that if it's storage performance they want, VMs suddenly become good enough again 11:33:01 <janders> it's for the cybersecurity research system. The workloads are still being identified/decided. 11:33:13 <janders> GPFS was designed to stand up to ML workloads that were killing our older HPC storage 11:33:40 <janders> it's essentially a smaller more efficiently balanced version of our BeeGFS design 11:33:55 <janders> if I had PCIe4 these could do 40GB/s read per node 11:34:02 <janders> but unfortunately I don't 11:35:47 <janders> do you guys have any experience tuning GPFS clients for good IOPS numbers? 11:36:43 <oneswig> sorry, not here jamespage 11:36:52 <oneswig> janders I mean - lost again? 11:39:14 <oneswig> welcome back janders :-) 11:39:26 <janders> thanks! :) network hiccups 11:39:28 <verdurin> janders: our cluster nodes are all GPFS but we haven't needed to do any advanced tuning with them 11:39:46 <janders> do you remember what IOPS are you getting on clients? 11:40:17 <verdurin> Not offhand, but I can find out. 11:40:32 <janders> that would be quite interesting 11:40:44 <janders> what's the storage backend on your cluster? 11:40:53 <verdurin> This is mainly EDR to spinning disk. 11:41:00 <verdurin> Some FDR. 11:41:09 <janders> JBODs or array based? 11:41:43 <verdurin> DDN arrays. 11:42:27 <verdurin> Latest iteration will have a small pool of SSD. 11:42:39 <janders> what's the capacity? 11:43:48 <verdurin> ~7PB usable. 11:44:38 <janders> nice! 11:44:47 <janders> ours is ~250TB 11:44:52 <verdurin> Hence we're not desperately keen on capacity-based licensing... 11:44:52 <janders> but all-NVME 11:45:10 <janders> then EC is not a good idea - cheaper to buy more kit 11:45:57 <verdurin> oneswig: your earlier point about contributors is an important one. Lots of people stepping down from PTL-type roles of late, I see. 11:46:32 <oneswig> Yes that has been a trend. 11:48:33 <oneswig> We've been looking recently at the condition of Tungsten Fabric integration with Kolla-Ansible. It's pretty good, in that it's less than a year behind, but doesn't appear to be advancing beyond that point. I'm still investigating. 11:49:38 <oneswig> It appears Tungsten has some invasive requirements for installing widgets in the containers of other services. 11:50:36 <oneswig> janders: in a vague attempt to follow the agenda, there was a question on the SIG Slack about usage accounting. What do you use for this, if anything? 11:51:10 <janders> nothing at the moment 11:51:49 <janders> our User Services guys interview users to identify how much resources they really need (as opposed to what they think they need or they would look like) and set the quotas accordingly 11:51:58 <janders> from there it's assumed it's a solved problem 11:52:11 <janders> not very accurate but kinda works for now 11:52:58 <janders> better than giving users what they ask for on a shared system I suppose 11:53:06 <janders> hope to have a better answer few months down the track :) 11:53:21 <oneswig> Thanks janders, good to know 11:54:10 <janders> given the User Services guys are nice to us we are nice to them and give them a simple yaml interface for managing projects, memberships and quotas 11:54:24 <janders> from there ansible sets it all, they don't need to know OpenStack commands 11:56:40 <dh3> "simple" + "yaml"... :/ our service desk get to set quotas using Cloudforms (it was only marginally quicker to set up than writing our own interface) 11:57:12 <oneswig> Nearly at time - anything for AOB? 11:57:14 <verdurin> oneswig: billing (or rather costing) keeps coming up for us, and we've relied on rough heuristics for now. 11:58:40 <oneswig> verdurin: noted. At this end, I'm hoping for priteau to update the study he did on CloudKitty earlier this summer. 11:58:59 <janders25> ################################ sample-five: name: sample-five enabled: true description: asdf quota: instances: 3 cores: 20 ram: 25600 volumes: 6 gigabytes: 5 snapshots: 3 floating_ips: 8 members: - user123 networking: create_default: true create_router: 11:59:00 <janders25> true################################ 11:59:24 <janders25> oops formatting died but I wanted to show that yaml can be simple to work with 11:59:43 <janders25> my networking is really bad today too 11:59:52 <dh3> I know that but some people are CLI-avoidant! 12:00:05 <priteau> janders25: If they were not nice to you, the interface would be XML? 12:00:08 <dh3> (hard enough getting them to edit the LSF users file) 12:00:09 <janders25> true! :) 12:00:13 <oneswig> And for APEL users, at some point we hope to complete the loop with data submission from OpenStack infrastructure 12:00:31 <oneswig> hi priteau - ears burning :-) ? 12:00:38 <oneswig> Ah, we are out of time 12:00:38 <janders25> priteau: yes! xml.... you nailed it, I was editing pacemaker configs today... argh! 12:00:52 <oneswig> xml gives a vintage feel nowadays 12:00:59 <oneswig> #endmeeting