11:00:26 <oneswig> #startmeeting scientific-sig
11:00:27 <openstack> Meeting started Wed Jun  6 11:00:26 2018 UTC and is due to finish in 60 minutes.  The chair is oneswig. Information about MeetBot at http://wiki.debian.org/MeetBot.
11:00:28 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
11:00:30 <openstack> The meeting name has been set to 'scientific_sig'
11:00:40 <verdurin> Afternoon.
11:00:50 <oneswig> #link agenda for today https://wiki.openstack.org/wiki/Scientific_SIG#IRC_Meeting_June_6th_2018
11:00:58 <oneswig> Hi verdurin, good afternoon
11:01:05 <priteau> Hello
11:01:06 <daveholland> hi
11:01:21 <oneswig> hello all
11:01:41 <daveholland> (might be distracted, working on some instances whose IP addresses show in Horizon but not "openstack server show")
11:02:00 <oneswig> daveholland: intriguing...
11:02:25 <daveholland> AOB :)
11:02:39 <oneswig> keep us updated!
11:03:07 <oneswig> OK shall we get on with the show...
11:03:41 <oneswig> #topic summit roundup
11:04:10 <oneswig> We had a well-attended session.  I think there were probably 60 people in for the meeting and 80 for the lightning talks
11:04:13 <oneswig> a good showing
11:04:38 <b1airo> o/
11:05:01 <oneswig> Hey b1airo, evening
11:05:07 <oneswig> #chair b1airo
11:05:08 <openstack> Current chairs: b1airo oneswig
11:05:11 <verdurin> oneswig: was looking at Tuesday logs, have people thought of more talk recordings to recommend since then?
11:06:13 <oneswig> John's talk on preemptible instances went online. I think it got overlooked.
11:06:45 <oneswig> #link preemptible instances and bare metal containers https://www.openstack.org/videos/vancouver-2018/containers-on-baremetal-and-preemptible-vms-at-cern-and-ska
11:07:34 <oneswig> John also did a pretty comprehensive roundup on Scientific SIG interests from the forum
11:08:12 <oneswig> #link Scientific SIG and forum https://www.stackhpc.com/openstack-forum-vancouver-2018.html
11:08:17 <b1airo> yeah i saw that on twitter, haven't had a chance yet though
11:08:49 <daveholland> it's a good read
11:09:01 <oneswig> thanks daveholland, will pass that on
11:10:41 <oneswig> daveholland: any session picks from you?
11:11:39 <daveholland> there were a few things that stood out... the session on "enterprise problems" was familiar and reassuring (other people also find lost/orphaned instances, for example)
11:12:21 <daveholland> one tiny nugget from a CERN talk that raised eyebrows: 8 minutes average VM boot time? maybe there is pre-work in scheduling GPUs or other scarce resource, or ironic cleaning... but we think 1 minute is on the slow side
11:12:47 <verdurin> daveholland: might be CVMFS-related
11:13:00 <oneswig> That's interesting.  I missed most of the CERN talks, alas.
11:13:18 <daveholland> @verdurin interesting point I hadn't thought of, thanks
11:14:16 <oneswig> but I'd assume all the CernVMFS repositories would be local, on-site?
11:14:26 <b1airo> you'd think
11:14:34 <daveholland> I found some other interesting points behind operating-at-scale, e.g. Walmart (I think) with dozens of OpenStack deployments/clouds
11:15:22 <verdurin> They would be, yes.
11:15:34 <b1airo> daveholland: yeah i've noticed a few enterprise users have scale in shear number of deployments, whereas on the scientific side it's more likely to be large individual deployments
11:16:26 <b1airo> this is true for e.g. Ceph too when we were at the inaugural advanced user meeting at the  Red Hat summit earlier last month
11:16:31 <oneswig> There was a good demo of exactly this in the keynotes - Riccardo Rocha's federated Kubernetes - scale through federation at the platform level instead
11:16:33 <daveholland> are enough people running large enough (single) deployments that there's a feel for where the sensible boundaries are? Sanger has a single ~5000 core deployment and looking to add ~4000, currently pondering whether lumping it all in one is sensible
11:17:18 <b1airo> depends what you mean by all in one i guess, lots of different ways to architect
11:17:27 <daveholland> likewise for large Ceph (PB+)
11:17:44 <b1airo> scale-wise the only practical considerations for Nova are number of hosts in a cell
11:17:47 <daveholland> single cell, 3x HA controllers (Pike)
11:17:57 <b1airo> but that doesn't have to translate into anything user-facing
11:18:41 <b1airo> as far as i remember 300 hosts in a single cell was the max rule of thumb (possibly according to rackspace)
11:19:12 <oneswig> The thorny issue in any discussion on scale is how much it depends on how the control plane is used.  What kind of churn is going on at the instance level and how many services are active...
11:19:22 <b1airo> indeed oneswig
11:19:49 <b1airo> and placement adds quite a bit of extra api load it seems
11:20:16 <oneswig> brb
11:20:32 <b1airo> daveholland: ceph-wise the main consideration that i haven't seen any good guidance on is what a sensible limit to number of OSDs in a pool is
11:21:09 <b1airo> or rather than a pool a single PG set
11:21:55 <daveholland> so, number of OSDs supporting a PG? we have 3 (for 3-way replication) but I can see using more for EC (which we don't yet do). 3060 OSDs in total which is starting to feel a bit unwieldy
11:23:04 <b1airo> i mean, depending on how you crushmap is implemented (but assuming the default sort of layout), the more OSDs in a pool the greater the chance of concurrent failures and possible data loss
11:23:38 <StefanPaetowJisc> @oneswig, @b1airo: by the way… the vapourware I’ve promised for the last year and a half is almost here (had what looks like almost the final build yesterday). I might *actually* be able to… demo… macOS Integration with this thing called Moonshot. :-/
11:24:07 <oneswig> StefanPaetowJisc: hi Stefan, bring it on!
11:24:57 <b1airo> there was a talk many summits ago, possibly in tokyo, from "something"Stack, maybe UnitedStack, talking about how they implemented limited failure domains within crush to increase reliability and improve backfill
11:24:58 <oneswig> Would be interested to see integration with Keystone, or what the options are for this.
11:25:08 <StefanPaetowJisc> Indeed.
11:25:18 <daveholland> blairo: I may not be understanding, will drop you a line
11:25:26 <b1airo> oneswig: haven't even seen a demo and already making feature requests!
11:25:43 <oneswig> b1airo: give an inch, take a mile...
11:26:04 <oneswig> b1airo: I'm interested in this discussion on PG size because it seems counter-intuitive
11:26:45 <oneswig> But perhaps we can carry on in #scientific-wg afterwards - jmlowe might be interested too
11:27:08 <daveholland> (I have to dash promptly at 1pm sorry)
11:27:28 <b1airo> not the PG size (as in num reps) but the total population of OSDs
11:27:39 <b1airo> found the slides from that talk: https://www.slideshare.net/kioecn/build-an-highperformance-and-highdurable-block-storage-service-based-on-ceph
11:28:33 <daveholland> blairo: ta, will read/learn/inwardly digest (and then be in touch :) )
11:28:36 <oneswig> Thanks, I'll take a look
11:28:43 <oneswig> ditto...
11:29:00 <oneswig> OK, next item?
11:29:37 <oneswig> #topic gathering best practice on handling controlled-access data
11:30:02 <oneswig> There was some good discussion here at the session.
11:30:27 <oneswig> Somehow the group membership has swung from particle physics to life sciences!
11:31:21 <oneswig> My aim is that if we can gather good practice it might make new content for the HPC book
11:31:39 <verdurin> oneswig: spent nearly all last week hosting a site visit for a large project in this area, so very timely for us.
11:31:58 <oneswig> I started an etherpad to start gathering fellow travellers: https://etherpad.openstack.org/p/Scientific-SIG-Controlled-Data-Research
11:32:10 <oneswig> verdurin: interesting, would love to hear more.
11:32:37 <b1airo> first question, what do we mean by "Implementation Standard"?
11:33:15 <oneswig> daveholland: at the Sanger OpenStack event, I met someone looking to combine a trifecta of sensitive data - genomics, patient records and location.  Who would that have been?
11:33:51 <oneswig> b1airo: I was thinking of regulatory frameworks.  Probably a better term.
11:34:06 <daveholland> Franceso at PHE maybe?
11:34:57 <oneswig> daveholland: not this time, it was someone from Sanger IIRC, possibly working with a population in Africa?
11:35:40 <b1airo> most of the big studies i hear about have some element like this in the data-science aspects
11:36:07 <b1airo> which is why where we think the safe haven environment comes into play
11:36:18 <b1airo> s/why//
11:36:20 <daveholland> humm, not sure, was the angle human, or cancer, or pathogen? that'd nail down which team I ask at least
11:36:44 <oneswig> b1airo: that project @Monash, how can we get them involved?  Was it Jerico?
11:37:13 <oneswig> daveholland: human I'd guess but it wasn't a detailed discussion
11:37:23 <daveholland> I think it was likely Martin in human genetics, they have links to Africa
11:38:19 <verdurin> daveholland: could be one of the malaria-related groups? We have people here with a presence at your end.
11:38:31 <b1airo> oneswig: yeah Jerico is the main resource doing technical stuff on it at the moment, and now we are looking to focus on this i'll suggest he joins (will forward him today's logs and the etherpad)
11:39:05 <oneswig> excellent, thanks b1airo
11:39:48 <oneswig> daveholland: I think (somehow) location data was coming from phones / gps as well.  Could that use of technology track it down?
11:40:06 <daveholland> oneswig: I will have to ask around, not heard of that aspect
11:40:34 <oneswig> Thanks - I hope I'm not making all this up :-)
11:40:38 <b1airo> oneswig: i was thinking about how to structure things around this topic
11:40:51 <b1airo> i think we need to break out specific areas as it's very broad
11:41:24 <oneswig> makes sense.  Different standards / regulatory frameworks will have common requirements and common solutions
11:41:50 <b1airo> there are also multiple layers in any solution to consider
11:42:04 <verdurin> b1airo: yes, data ingest, anonymization, data analysis etc.
11:42:40 <oneswig> were you thinking of layers as human processes vs platform vs infrastructure?
11:42:49 <b1airo> e.g. on the infra side you might need to have your OpenStack cloud implemented in a certain way and need specific controls in the environment, plus also ensure your deployment conforms to reporting requirements etc
11:43:28 <b1airo> that'd be one good way of carving it, yes
11:43:34 <verdurin> b1airo: yes, that was exactly the sort of discussion we were having last week
11:44:02 <b1airo> verdurin: with your guests?
11:44:08 <verdurin> b1airo: yes
11:44:47 <oneswig> Seems like a good approach b1airo - go ahead and note it in the etherpad
11:45:14 <oneswig> verdurin: any specific noteworthy items from that discussion?
11:45:26 <b1airo> so then if you assume your underlying infra risks are well managed you need to move on to the tenant-level infra, guests, networks etc
11:45:50 <verdurin> oneswig: don't think I'm allowed to say much publicly yet - will do as soon as I can
11:46:15 <oneswig> controlled-access data in action :-)
11:47:42 <oneswig> We should leave some time for AOB - any other items to cover here for now?
11:47:46 <b1airo> then you start getting more towards the application level and thinking about whether controlled data access is needed, plus how to get data screened and out of a controlled environment
11:48:41 <daveholland> also proper deletion: wiping TB+ volumes makes performance sad; is deleting the encryption key equivalent to deleting an encrypted volume; etc
11:49:30 <b1airo> plus lots of very specific, but important, points like that
11:50:13 <oneswig> b1airo: it seems a lot of people delegate responsibility for data management to the users.  Having trusted users bound by a usage agreement has to play a role here, doesn't it?  Is it a non-starter if you don't have users you trust?
11:50:17 <b1airo> i don't have wide knowledge of the various standards / requirements, but i think encryption at rest usually suffices for that daveholland
11:51:23 <b1airo> oneswig: it does seem to be surprisingly common, but i don't think it flies when you start dealing with clinical or commercial data
11:52:04 <oneswig> Providing controlled data to users you can't trust is where things get interesting and I guess that's what underpins the Monash project.
11:52:11 <b1airo> you can't expect scientists to be able to implement best practice IT security
11:53:11 <StefanPaetowJisc> You can’t, no. But making it easier for them to helps :-)
11:53:19 <oneswig> I wonder if that's always true, when scientists know the data is sensitive.
11:53:34 <b1airo> oneswig: it's a little bit analogous to the problem the bloomberg terminal solves i suppose
11:53:54 <oneswig> what problem is that?
11:54:05 <b1airo> data is the product
11:54:40 <oneswig> ok, got it
11:54:43 <b1airo> you want to "sell" it (let people use it), but not let them take it away
11:54:56 <b1airo> en masse that is
11:55:09 <oneswig> makes sense.
11:55:29 <oneswig> We ought to move on - time and all that.
11:55:35 <oneswig> #topic AOB
11:56:02 <b1airo> time to turn in that is
11:56:16 <verdurin> I want to hear about daveholland and his mystery IPs
11:56:19 <oneswig> #link Berlin CFP closes end of June! https://www.openstack.org/summit/berlin-2018/call-for-presentations/
11:56:35 <oneswig> daveholland: any update?
11:56:37 <StefanPaetowJisc> Uh-oh.
11:57:07 <daveholland> verdurin: looks like a hypervisor bug, many instances (but not all) on that hypervisor in this state. hypervisor logs "unexpected VIF plugged" event when the instance reboots (and the instance then has no NIC)
11:57:19 <b1airo> is that a mistake? i feel like it can't be right. previous summits haven't closed 4 months before the actual conference have they?
11:57:41 <verdurin> oneswig: it does feel a bit previous
11:57:48 <oneswig> b1airo: seems pretty harsh, who to ask - Jimmy?
11:57:52 <daveholland> fortunately most of these instances are expendable (dev, or auto-scaled workers) so.... byebye
11:58:19 <oneswig> daveholland: what does the virsh dumpxml say about network vifs?
11:58:22 <verdurin> daveholland: good - grtbr
11:58:26 <daveholland> if I get any closer to finding the cause I'll pass it on. But I need to dash now
11:58:47 <oneswig> don't leave us hanging!
11:58:59 <verdurin> oneswig: do let us know if you hear about the submission date
11:59:10 <b1airo> there's a stubborn sysadmin inside everyone huh
11:59:19 <oneswig> verdurin: will do - I'll open that mail now
11:59:49 <b1airo> looks like we're done
12:00:09 <b1airo> back to Cumulus upgrades...
12:00:16 <oneswig> One other activity from the summit - two new code branches for Ceph RDMA to explore
12:00:19 <oneswig> I'm on it...
12:00:31 <b1airo> hmm sounds promising
12:00:45 <oneswig> b1airo: thought you were turning in...
12:00:58 <oneswig> we are out of time - final comments
12:01:15 <b1airo> w00t - last comment
12:01:26 <b1airo> #endmeeting