09:00:25 #startmeeting scientific_wg 09:00:26 Meeting started Wed Nov 23 09:00:25 2016 UTC and is due to finish in 60 minutes. The chair is oneswig. Information about MeetBot at http://wiki.debian.org/MeetBot. 09:00:27 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 09:00:29 The meeting name has been set to 'scientific_wg' 09:00:55 #link today's agenda https://wiki.openstack.org/wiki/Scientific_working_group#IRC_Meeting_November_23rd_2016 09:01:37 Blair says he'll be in-and-out today - you hear yet Blair? 09:02:16 Lets give people a minute or two to gather 09:03:00 #topic Wrap up from Supercomputing 2016 09:03:04 #chair b1airo 09:03:04 Current chairs: b1airo oneswig 09:03:18 Howdy 09:03:27 Evening b1airo 09:03:35 Just getting started on SC wrap-up 09:03:48 Kewl 09:04:02 Well it was a good show, my first time. The level of OpenStack interest was high but I have no prior experience to gauge the upward trend 09:04:06 I only got home from SLC yesterday! 09:04:16 Go by bike? :-) 09:04:41 Ha! Just left via Bryce Canyon 09:05:01 It was a good conference for OpenStack I think 09:05:14 The nice people at SuperUser invited us to write a trip report based on the roundup I wrote on the company blog 09:05:44 #link http://www.stackhpc.com/stackhpc-at-supercomputing-2016.html 09:05:54 Not massively detailed though... 09:06:29 Heh - are they paying for our expenses? :-) 09:06:30 Great job on getting science on the openstack.org homepage! 09:06:36 priteau: Kate did a great job in the panel session, it was good to meet her again 09:06:45 priteau: I know - incredible isn't it! 09:07:17 Good morning all 09:07:17 Will you get statistics on number of book downloads? 09:07:18 So quite a few expectations to live up to there 09:07:22 morning StefanPaetowJisc 09:07:36 Morning. 09:07:39 priteau: I expect I could ask, that would be good to know 09:07:45 Hi verdurin 09:08:21 I don't think there was any more to add on SC - anyone else? 09:09:00 OK lets go to the next item 09:09:04 #item Scientific datasets - update from Sofiane 09:09:09 #topic Scientific datasets - update from Sofiane 09:09:10 oops 09:09:11 hi all 09:09:19 I just shared a document with you 09:09:24 Hi Sofiane, how are you doing 09:09:34 https://goo.gl/Rqb9qg 09:09:53 There's a summary of what we are doing 09:10:20 We are building an academic cloud for Swiss Universities 09:10:44 and have numerous services on top of the infrastructure 09:10:50 which runs on OpenStack 09:11:11 SWITCH is hosting and operating the infrastructure 09:11:34 among the applications, we have big data and data science in general 09:11:41 and we need to host large datasets 09:11:55 compute has to be done close to the data 09:12:13 What do you expect the interface to be for users seeking access to specific datasets? 09:12:39 direct access from any application susing S3 or SWIFT 09:12:56 or specific access with higher level APIs 09:13:39 Are you thinking of creating a directory service? 09:13:57 We will definitely build a catalog 09:14:05 with appropriate search functionalities 09:14:35 Do you expect it to be integrated with Horizon, or a separate freestanding catalog? 09:14:35 but there are many things that need improvment in the OpenStack ecosystem 09:14:39 sofianes_: the higher level APIs are for applications that don't understand objects? 09:14:46 Sahara, bare emtal 09:14:50 bare metal 09:15:09 verdurin: yes 09:15:56 we are also thinking about the possibility to co-host datasets 09:16:03 with other interested insititutions 09:16:23 saverio has already initiated that with other NRENs 09:16:30 Do you also expect to provide access control to the datasets or can every user access every dataset all the time? 09:16:38 both 09:16:46 IMHO this sort of hosting may be better served by something like CVFS 09:16:49 public data will be open 09:16:59 b1airo: new to me, got a link? 09:17:01 sofianes_: do you have details on what's missing for your use-case, that might be turned into blueprints? 09:17:01 private data will requires ACLs 09:17:08 b1airo: do you mean CVMFS? 09:17:21 As in CERN VM FS? 09:17:22 It would offer a more familiar interface for legacy applications and is already designed to distribute data around the Internet 09:17:32 Yes, CVMFS sorry 09:17:39 I am not famililar with CVMFS 09:17:44 but will look at it 09:18:19 performance is very important 09:18:19 And I should clarify - I don't mean instead of Swift, but as potential complim 09:18:36 *compliment (sorry, on phone) 09:18:49 the direction in which we will go will be directed by the use cases 09:19:04 sofianes_: my understanding is that it is an HTTP-delivered filesystem overlay. The particle physics people use it with a minimal VM image to attach specific code/data - right? 09:19:20 we have aleady posted a bug report for sahara 09:19:23 oneswig: yes, that's right 09:20:02 oneswig: it has wider uses than that, i.e. not just for Micro-CernVM 09:20:10 our typical use cases come from machine learning, text analysis, genomics 09:20:16 sofianes_: I notice billing in the SWITCH diagram - any plans to charge for access? 09:20:28 we have two directions 09:20:43 either have the datasets as an added value to attract users 09:21:00 or build a market place for data (not trivial) 09:21:35 oneswig: could you tell us more about the paricle physics use case? 09:21:53 I don't know if there is a precedent for charging for datasets, might be problematic... 09:22:09 Microsoft is actually doing that on Azure 09:22:12 I agree - good to have them as an attractor to your platform (which might then be charged for in usual ways) 09:22:33 sofianes_: interesting! 09:22:37 Possibly cross charging... 09:22:48 it makes sense for commercial applications where you consume data and sell services. You basicaly act as a broker in that case 09:22:58 sofianes_: I'm not sure I have a link to CVMFS - anyone? 09:23:00 *nod* 09:23:18 https://cernvm.cern.ch/portal/filesystem 09:23:34 https://wiki.gentoo.org/wiki/CVMFS 09:23:35 sofianes_: they have a minimal VM image that pulls in experiment-specific software via CVMFS 09:23:35 this? 09:24:40 So in WLCG it's primarily used for distributing software, but could be used for datasets too 09:25:05 there are also alternatives to expose S3 as a file system 09:25:13 WLCG? 09:25:20 worldwide LHC computing group 09:25:22 Especially since it's read-only... :-) 09:25:32 2 level acronym! 09:25:40 like this: https://github.com/s3fs-fuse/s3fs-fuse 09:25:40 De rigeur. 09:25:57 Damn that HEP community... :-D 09:26:11 sofianes_: I'm sure I came across a US group that was using it for data, but can't remember where right now 09:26:20 (i.e. a non-HEP group) 09:26:30 if you find out, please let me know 09:26:40 verdurin: possibly Jim Basney's group? 09:26:45 we are open to collaboration/sharing experience 09:26:47 I can ask 09:26:58 We could ask Tim et al 09:27:09 What do Globus do in this area? 09:27:25 #link https://www.globus.org/ 09:27:54 our interest is not only about making data available for download 09:28:05 but mainly to bring computation to where data sits 09:28:17 that's at the heart of the hadoop principles 09:28:22 The holy grail... 09:28:31 oneswig: they have data publication and discovery capabilities https://www.globus.org/data-publication 09:28:36 for very large data, it's cheaper to move computation 09:29:05 it seems to be a commercial hosted service 09:29:09 Some of the genomics data at Sanger is made available via Globus Connect 09:29:24 actually we have a mirror of that data 09:29:30 I want to add one thing 09:29:36 some datasets are so large 09:29:44 that you simply can't download them 09:29:53 sofianes_: do you think there is value in a user story here? I think there could be. 09:30:14 human genome which can be obtined through Globus is more than 100TB 09:30:29 :-O 09:30:35 common crawl, which is a crawl of the web is 500+ TB 09:30:52 this is said for static data 09:31:07 but we also have users interested in real time data 09:31:15 Hence moving code to the data 09:31:20 they need to subscribe to streams 09:31:31 these can be weather or traffic data 09:31:44 or anything that comes from IOT 09:31:51 we are also working on such use cases 09:32:25 we receive data in real time, and users can selecte specific feeds 09:32:42 data can also be persisted and becomes historical data 09:32:47 What are the gaps in OpenStack or does everything discussed build upon today's OpenStack? 09:33:05 we have sahara and bare metal 09:33:21 but solutions don't seem to be mature enough 09:34:03 I haven't heard of anyone using bare metal with swift - does it work? 09:34:21 ceph and swift have also issues with large objects 09:34:53 oneswig: We use Swift from bare-metal instances. There's no reason for it not to work, it's all HTTP requests 09:35:05 support for hadoop is still not very established 09:35:23 priteau: I guess so - wasn't sure which network they'd be transported over 09:35:58 sofianes_: there was a bug mentioned in Barcelona on Sahara with large objects - any news on that? 09:36:23 I have nothing new on that 09:36:31 I will ask saverio 09:36:45 Saverio not here? Saverio...? 09:37:21 Wasn't that an issue with the related hdfs plugin rather than Swift? 09:37:32 ah, think you're right 09:37:38 if any of you is interested in our topic, your welcome to contact me 09:37:59 sofianes_: would you be interested in transferring the google doc into a user story? 09:38:07 We are working on deploying Sahara in NeCTAR at the moment too 09:38:28 yes for the user story 09:38:36 Great, thanks sofianes_ 09:38:58 #action sofianes_ to transfer the google doc on data sharing into a user story 09:39:29 b1airo: any comment on how Sahara is working in Nectar? 09:40:24 Main issue we've found is that current version defaults to putting Hadoop job binaries in rdbms 09:41:21 Should be in object store for a large prod cloud, especially where users are not trusted/known 09:41:59 How responsive are the sahara crowd? 09:42:18 using object as back end for hadoop seems to be the current trned at least in public clouds (AWS, Azure, GCP) 09:42:47 Reasonably, problem is already being addressed 09:42:59 when you deploy a cluster, storage is embeded in HDFS within the cluster 09:43:17 using objectr storage is a good compromise to share data 09:43:38 I'd be interested to see performance comparisons, does seem to me to be breaking some assumptions about hadoop (unless I'm misunderstanding it) 09:44:10 we have that on our roadmap 09:44:14 Anyway, we have other topics, any more on big data & sahara? 09:44:19 performance comparison 09:44:22 S3 vs HDFS 09:44:32 sofianes_: would be interested to see that 09:44:41 and otjer dimensions as well 09:44:49 we will share our findings 09:45:01 OK thanks sofianes_ 09:45:06 typo: other dimensions as well 09:45:19 #topic Federation user story up for review 09:45:32 Saverio's not here but there is a WG user story in review 09:45:38 on identity federation 09:45:51 #link https://review.openstack.org/#/c/400738/ 09:46:40 Please comment if it's an area of interest for you - I recall aloga had some questions about the original etherpad this user story came from 09:46:57 I'll have to review that story again... 09:46:59 Cool - I will seek input locally 09:47:24 Great! Want to make sure it's consistent with global experience 09:48:02 Don't think there's any more to cover on this... anyone? 09:48:47 #topic Cloud workload traces 09:49:05 Late addition to the agenda - b1airo want to cover this? 09:49:41 sorry guys, I arrived late 09:49:50 Hi aloga 09:50:02 Sure - briefly 09:50:15 aloga: Got backscroll? Feedback here: https://review.openstack.org/#/c/400738/ 09:50:28 oneswig: yes, I was catching up 09:51:13 Kate Keahey's BoF at SC raised the topic of getting cloud workload traces for ComSci work (e.g. scheduling simulation etc) 09:51:35 Is this infrastructure or higher-level? 09:51:44 oneswig: I will go through the review and comment there, I still have concerns about the 2nd use case that is described there (i.e. L156-160) 09:52:17 i.e. who will benefit from that use case 09:52:36 (same applies for user story 3) 09:52:39 She isn't sure what data they want yet, needs to do a review of existing archives from prior grid work, but we discussed the potential to have the work supported through this community so as to gain wider audience 09:53:03 We have a project for OpenStack distributed profiling. For user applications, how about http://zipkin.io/ 09:53:22 ... which I would also love to see ... 09:54:33 b1airo: I'd love to hear more, hopefully we can do a monitoring & telemetry session next week and she might attend? 09:54:56 Yes we need to invite her or a team member 09:55:11 ... know anyone ... ? :-) 09:55:42 oneswig: I am not sure we are on the same page, the traces we have in mind are the equivalent of job traces in hpc, ie instance creation / update / termination events 09:56:28 priteau: yes that's what I thought, at least for initial data 09:56:34 priteau: possibly more on the same page than you think! We've been looking at connecting SLURM with Monasca 09:56:35 So e.g. nova-scheduler logs 09:56:54 So this is infrastructure event tracing? 09:56:59 But the more data the better I assume 09:57:28 Now I am even more interested. 09:57:58 oneswig: I was referring to your link to zipkin which seems to be more about inter-service call timings? 09:58:12 OK, I previously took an action to define our use case - will do that for next week 09:58:24 priteau: wasn't sure if you were looking at infrastructure or application profiling 09:59:23 we are looking at profiling "infrastructure-plus", ie a bit of the stuff above 10:00:13 #action oneswig to write up the profiling telemetry use case for next wee 10:00:18 Have to go - bye. 10:00:18 ahem. week 10:00:23 out of time! 10:00:37 Thanks all, we'd better clear the channel 10:00:38 priteau: we are interested on that 10:00:52 #endmeeting