#openstack-meeting log

11:01:00 <martial_> #startmeeting scientific-sig
11:01:01 <openstack> Meeting started Wed Mar 27 11:01:00 2019 UTC and is due to finish in 60 minutes.  The chair is martial_. Information about MeetBot at http://wiki.debian.org/MeetBot.
11:01:02 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
11:01:04 <openstack> The meeting name has been set to 'scientific_sig'
11:01:11 <martial_> #chair oneswig
11:01:12 <openstack> Current chairs: martial_ oneswig
11:01:16 <oneswig> up, up and away
11:01:24 <martial_> Good morning team
11:01:35 <oneswig> Greetings
11:01:59 <oneswig> #link Agenda for today https://wiki.openstack.org/wiki/Scientific_SIG#IRC_Meeting_March_27th_2019
11:02:09 <verdurin> Morning.
11:02:17 <oneswig> Morning verdurin
11:02:30 <martial_> Today we have an exciting agenda (as always :) )
11:02:38 <martial_> #topic Evaluation of hardware-specific machine learning systems orchestrated in an OS cluster at NIST
11:03:04 <martial_> We welcome Maxime and Alexandre to talk to us about the work they did on their project for NIST
11:03:28 <oneswig> Thank you for coming!
11:03:38 <martial_> For people that want to follow up at home the link to the outline for their presentation is at
11:03:41 <martial_> #link https://docs.google.com/document/d/190v0XtuVt1oH7yhPwjpuQ6mFLhL1dNcjNwTIF7Y9n9w/edit?usp=sharing
11:03:55 <martial_> and it has been accepted as a lighting talk for the Denver Summit
11:04:05 <oneswig> What is Multimodal information?
11:04:07 <martial_> #link https://www.openstack.org/summit/denver-2019/summit-schedule/events/23243/evaluation-of-hardware-specific-machine-learning-systems-orchestrated-in-an-os-cluster-at-nist
11:04:23 <martial_> Alex, Maxime: the floor is yours
11:04:32 <maxhub_> Thanks for welcoming us, that's a great opportunity to dry-run our presentation!
11:05:43 <alexandre_b> Thank you for this opportunity, Multimodal Information is a group of researchers that performs evaluation on machine learning systems
11:05:56 <maxhub_> So first of all: hi everyone, we (Alexandre Boyer and Maxime Hubert), we are working at NIST and we are going to present to you how we automated our machine learning evaluation process using Openstack.
11:06:33 <alexandre_b> The Multimodal Information Group, which is part of the National Institute of Standards and Technology, often performs advanced evaluations and benchmarking of the performance of computer programs.
11:06:34 <alexandre_b> The main actors are:
11:06:34 <alexandre_b> - The performers whose systems will be evaluated
11:06:34 <alexandre_b> - The data providers who collect and send the data to evaluate the performers’ systems
11:06:34 <alexandre_b> - The evaluators who design the evaluation steps and report the results
11:07:29 <maxhub_> This is how we usually proceed
11:07:29 <maxhub_> - Collect and send the data sets to the performers: Data collection/delivery
11:07:29 <maxhub_> - Collect the system outputs from the performers
11:07:29 <maxhub_> - Score/Analysis of the Results
11:07:29 <maxhub_> However, this approach has some caveats:
11:07:30 <maxhub_> - The data collection task may be (very) costly, and has to be performed for every new evaluation
11:07:30 <maxhub_> - Some agencies may not want their data to be released.
11:07:31 <maxhub_> - Systems are being evaluated on “known” data (some ML systems are good at learning)
11:08:14 <alexandre_b> All these constraints led us to rethink the evaluation process:
11:08:14 <alexandre_b> New evaluation tasks involving in-house system runs:
11:08:14 <alexandre_b> - Collect the data sets: Data collection, deliver a Validation data set
11:08:14 <alexandre_b> - Collect and validate the computer programs (aka Systems delivery)
11:08:14 <alexandre_b> - Run the systems over the collected data, in our sequestered environment.
11:08:14 <alexandre_b> - Score/Analysis of the results
11:08:14 <alexandre_b> This process is raising this new question: How to deliver a system to guarantee that it will run in-house?
11:08:15 <alexandre_b> - OS Virtualization (containers) or Full Virtualization (VMs) to leverage all the configuration hassle
11:08:15 <alexandre_b> - The system can be run against the Validation data set in advance
11:08:16 <alexandre_b> Once the systems are delivered and validated, the evaluator still has to run them on the evaluation data sets. This could be done manually but would require a lot of time if a lot of systems have to be evaluated against a lot of data sets.
11:09:35 <oneswig> What kind of machine learning frameworks were you using and what were the use cases?
11:09:54 <martial_> To support what project?
11:11:04 <martial_> Also, explain the constraints that control the use of resources and the solutions to alleviate those, please
11:11:32 <maxhub_> this process was applied during a NIST evaluation consisting of six systems performing activity detection on cctv footage: some dozens of hours of video were processed by six systems in different chunks
11:12:23 <verdurin> Any special considerations with the GPUs?
11:12:42 <maxhub_> The teams designing the systems had the freedom to pick whatever framework they wanted too.
11:13:34 <maxhub_> The configuration consisted in the following:
11:13:35 <maxhub_> 16 cores (32 hyperthreaded)
11:13:35 <maxhub_> 128GB of memory
11:13:35 <maxhub_> 4 x Nvidia GTX1080Ti
11:13:35 <maxhub_> 40GB root disk space
11:13:35 <maxhub_> 256GB of ephemeral SSD disk space mounted to /mnt/
11:14:14 <martial_> per VM or system?
11:16:28 <maxhub_> One system consisted of one VM only so far, but we expect to receive multi-VMs systems in the next few months
11:16:59 <martial_> so 4x GPU per VMs ... nice :)
11:17:51 <verdurin> And this was with GPU passthrough, not Ironic?
11:18:10 <maxhub_> Yes it was
11:19:51 <oneswig> Did you deploy OpenStack specifically for managing this evaluation work?
11:20:21 <martial_> you are listing "4 x Nvidia GTX1080Ti" in a server system or a desktop system (ie which chassis was used to support the non server version)?
11:20:54 <alexandre_b> Yes we deploy OpenStack specifically for this evaluation work
11:21:27 <martial_> what prevents exfiltration of information?
11:23:06 <martial_> In case people have questions, would you be okay adding your contact information in the outline document?
11:23:18 <verdurin> Did you use specific node-cleaning in between runs?
11:24:52 <maxhub_> We would give one node to each team so they can integrate their system on it first through an Openstack project, and collect the images and volumes one they were ready.
11:24:52 <maxhub_> 
11:24:52 <maxhub_> The last step for us was to automate the runs of the systems: We abstracted the system runs so all pairs of systems/datasets can be executed the same way. A job scheduler would instantiate the systems and trigger them. This only required a change to our Systems delivery: we Collected and validated the computer programs wrapped into a standard CLI/API.
11:25:49 <alexandre_b> martial what do you mean by server system system or desktop system?
11:26:06 <maxhub_> Data exfiltration was prevented by cutting the internet access to the VMs before pluggin in our datasets
11:27:05 <martial_> 1080s have a connector for power that make them hard to use in a server form factor usually
11:27:11 <verdurin> Detail on that last step you described would be interesting
11:27:23 <verdurin> martial_: we have a number of server systems like that
11:28:00 <martial_> alexandre_b + verdurin: investigating server box to use :)
11:28:00 <maxhub_> Instantiating the VMs over and ever again would make our nodes run out of memory sometimes, we had this problem when we had to deal with several image versions at the beginning. So the node cleaning would consist in deleting the past images on the nodes
11:29:22 <verdurin> maxhub_: Sure, I was wondering whether you had other cleaning processes in addition, given that you imply the need for strong isolation between groups
11:32:10 <martial_> what are your lessons learned?
11:32:36 <maxhub_> No we didn't, each system instanciation was offering a clean environment so we didn't feel the needs for any other cleanup
11:33:06 <alexandre_b> How Openstack made this possible for us:
11:33:06 <alexandre_b> Systems delivery:
11:33:06 <alexandre_b> - Opening an Openstack project per team for a limited amount of time allowed performers to remotely integrate their system on our hardware that can be specific (specific set of GPUs, SSDs). Performers have one VM per system.
11:33:06 <alexandre_b> - The ability to “snapshot” the state of a system made it easy to deliver, as well as easy to re-deploy. One system can consist of one or several images and volumes.
11:33:06 <alexandre_b> Openstack already provided all the mechanisms to help us improving our systems delivery: it’s easier for the evaluator to collect the systems, and easier for the performers who can integrate their systems directly.
11:33:07 <alexandre_b> Run the systems
11:33:07 <alexandre_b> - An NFS server instance using OS Cinder allowed us to distribute large amounts of data to different systems at the same time
11:33:08 <alexandre_b> - The use of VMs + OS security group mechanisms helped to leverage the requirements in terms of security and protection against data exfiltration
11:33:08 <alexandre_b> - VMs snapshots guaranteed the replicability needed when performing experiments
11:33:09 <alexandre_b> - A job scheduler using the OS API helped us to automate the instantiation and execution of our systems.
11:34:17 <maxhub_> Lessons learned: Openstack is a good tool to work with, but is not designed to support the instantiation of a lot of VMs at the same time
11:34:23 <oneswig> "A job scheduler using the OS API" - what was that?
11:34:57 <martial_> so basically using the openstack CLI to run the VM?
11:35:31 <maxhub_> Yes, after abstracting the systems into Heat templates
11:35:40 <martial_> any "wrapper" type technology to start/control the process within the VM?
11:36:12 <maxhub_> The system delivery would require teams to implement specific entrypoints
11:36:39 <alexandre_b> We use a job scheduler developed in-house similar to Slurm, and it allows us to instanciate/delete Openstack instances, sequestered the network thanks to the OpenStack API
11:37:06 <martial_> so firewall rules?
11:37:14 <maxhub_> Yes
11:38:39 <martial_> anything else that you want to share on your work that you have not shared yet?
11:39:02 <martial_> or is the rest secrets for the lighting talk? :)
11:39:29 <maxhub_> Right now we are working on generalizing this evalutation process,
11:39:57 <maxhub_> This would imply using more resource management technologies (Terraform for example)
11:40:01 <martial_> Speaking of ligthing talks, oneswig do we have space available you think for them to talk to our audience?
11:40:29 <maxhub_> And using a more advanced Job scheduling technology (Airflow)
11:40:44 <martial_> cool!
11:40:59 <maxhub_> Thanks for listening to listening to us!
11:41:13 <oneswig> martial_: I expect so.
11:41:50 <oneswig> maxhub_: alexandre_b: Thanks to both of you.
11:42:21 <oneswig> Are you done with the project now or will you continue to work on it?
11:42:24 <martial_> was looking for the link to the Etherpad to share with them, do you have it handy oneswig ?
11:42:59 <oneswig> I am not sure if there is one yet.  I think Blair might have set one up.
11:43:46 <verdurin> Yes, thanks for presenting.
11:43:47 <alexandre_b> We will be working on it for at least another year
11:44:09 <alexandre_b> Thanks for listening to us
11:44:41 <oneswig> I hope the next steps go well
11:46:27 <oneswig> OK, let's move on
11:46:37 <oneswig> #topic Open Infra Days London
11:46:58 <oneswig> This is 1st April (ie Monday), and there's a scientific track
11:47:06 <martial_> (was looking for Etherpad link, no luck)
11:47:13 <oneswig> including a nice chap from Oxford, I believe...
11:47:24 <martial_> (maxhub_ alexandre_b will share with you when I have it)
11:47:54 <oneswig> verdurin: there's always plenty of interest in the AAI system on your infrastructure
11:47:57 <martial_> cool :)
11:49:37 <oneswig> #topic Denver Forum Sessions
11:50:30 <martial_> No dates yet?
11:51:01 <oneswig> We had a mail from Rico Lin to highlight this session
11:51:11 <oneswig> #link Help most needed for SIGs and WGs https://www.openstack.org/summit/denver-2019/summit-schedule/events/23612/help-most-needed-for-sigs-and-wgs
11:51:52 <oneswig> martial_: for the PTG session?  I don't think I've seen anything confirmed
11:53:41 <oneswig> For the forum session, we'd need to re-think what the gaps may be for OpenStack support for research computing.
11:54:19 <oneswig> And that's probably a session in itself.
11:55:10 <martial_> I would agree
11:56:14 <martial_> any update on ISC?
11:56:59 <oneswig> #topic AOB
11:57:03 <oneswig> ISC
11:57:11 <oneswig> Not heard anything yet I don't think
11:58:04 <oneswig> Just checking when we should hear
11:58:30 <oneswig> April 10th
11:58:38 <martial_> for ISC, I am on the program chair for HPCW
11:58:43 <martial_> #link https://easychair.org/cfp/hpcw2019
11:59:03 <martial_> "5th High Performance Containers Workshop - In conjunction with ISC HIGH PERFORMANCE 2019"
11:59:22 <martial_> (program committee)
11:59:32 <oneswig> Interesting, not seen that before
12:00:02 <martial_> Christian Kniep was organizing it
12:00:13 <martial_> We started this conversation at SC19
12:00:28 <oneswig> I was wondering where Christian was, assumed he'd be involved somewhere...
12:00:29 <martial_> so inviting submissions obviously :)
12:00:46 <oneswig> Let's put it on the agenda for next time
12:00:49 <oneswig> Time to close!
12:00:53 <martial_> Waiting to hear back from him, but not at Docker anymore it seems
12:00:53 <oneswig> Final comments?
12:01:06 <martial_> Thanks for a great session everybody :)
12:01:15 <oneswig> Indeed, thanks
12:01:19 <oneswig> #endmeeting