Software For Data Science Workshop: High Performance Computing in the Cloud – Numerical Simulation and Data Analysis

February 5 - 6, 2018, The Fields Institute

Location: Fields Institute, Room 230

Objectives

The goal of this course is to help scientists to leverage the power of distributed computing to more efficiently carry out the numerical experiments. The course will be based on Amazon Web Services computational infrastructure and will give a hands-on experience to all participants.

First day of the course will equip the students with knowledge how to set-up massive computations in the cloud. We will start with basic examples multi-core computations on a single machine. Next we will present tools allowing to run massive computations on multiple computers: KissCluster , and Amazon cfncluster .

During the second day practical data science case studies requiring massive computations will be presented using R, Python and Julia. For Julia both thread- and process- based parallelization will be discussed. In the last part we will provide presentation of parallelizing computations using GPUs using R.

The course assumes intermediate prior knowledge of scientific programming in at least one language (e.g. either Python, Gnu R, Java, Node.js or Julia) and a basic knowledge of Linux or Mac command line console. During the workshop participants will create various types of distributed computational clusters. No prior knowledge of distributed computing or experience with large-scale simulation models is required.

Why High Performance Computing in the Cloud:

Cloud computing offers a new set of possibilities for computational scientists – especially in the area of numerical computing, GPU computing and data analytics. The most significant advantage of HPC in the cloud is that cloud providers offer standardized hardware and software solutions while proprietary HPC solutions require long specialized training. Hence, cloud computing makes it possible to configure, build and provision cluster consisting of several thousand cores within just few minutes. Combined with extremely low costs (starting from \$7 for 1000 vCPU cores per hour or \$4 per 100000 GPU cores per hour, for universities those costs can be easily covered by one of several computational grant programs offered by cloud providers) this is an ideal solution for many computation-intensive scientific problems. For simplicity during the course Amazon Web Services will be used as an example computation service – it should be noted that using services offered by a different public cloud provider would be very similar.

Participation Requirements:

The workshop will be hands-on and participants should bring their laptops and have a SSH client installed. On Linux and Ubuntu SSH is in-built while on Windows platform the best option is to use SSH that is automatically installed with Git software. Additionally, Windows users should install Con Emu console that is a much better option than the standard Command Prompt.

For convenient data transfer between your laptop and the cloud (including Linux instances and S3 data storage) we recommend CyberDuck. Users having higher experience with Bash can use scp and aws command line interface instead.

Each participant should set up an AWS account. The Workshop participants will be given free AWS credit coupons of approximately 50 USD value. This will be sufficient to perform all tasks during the workshop and self-study outside of class. Please note it while the setup process takes only 3 minutes, it takes up to one day to provision the AWS account by Amazon – hence, we strongly advise you to start at least two days before the workshop.

For your convenience it is recommended to have interpreters of your favorite scripting languages installed (out of three considered during the course: Python, Gnu R and Julia). We will be working with cloud installations of those environments but you still might want run some tests on your side.

If you encounter any problems with AWS account setup please contact Przemysław Szufel for support or reach out the AWS support directly.

Instructors:

Dr. Przemysław Szufel

Przemysław Szufel is an Assistant Professor in Decision Support and Analysis Unit at Warsaw School of Economics.

His current research focuses on methods for execution of large-scale simulations for numerical experiments and optimization. He is working on asynchronous algorithms for parallel execution of large-scale simulation in the cloud and distributed computational environments. He is an author or a co-author of several Open Source tools for high performance and numerical simulation (such as KissCluster, D MASON, Isislab SOF, SilverDecisions, PyCX), and actively participates in their development. He is also a co-author of various algorithms for distributed simulation models (such as AKG, AOCBA).

Dr. Bogumił Kamiński

Bogumił Kamiński is the Head of Decision Analysis and Support Unit at Warsaw School of Economics. He is a member of the Management Committee of European Social Simulation Association (ESSA), and Vice President of Institute for Operations Research and Management Sciences (INFORMS) Polish Chapter. His field of expertise is operations research, with special focus on industrial applications of forecasting, optimization and simulation. He has 15 years of experience in teaching data science related topics at undergraduate, graduate, and MBA courses.

Registration Instruction:

This event has reached capacity and registration has closed.

Schedule

Monday, February 5th, 2018
08:45 to 09:00	Check in
09:00 to 10:00	Introduction to architecture and set-up of High Performance Computing environments in the cloud (AWS)
10:00 to 10:15	Coffee break
10:15 to 12:00	Warm-up: computations using multiple cores on a single machine
12:00 to 13:00	Lunch
13:00 to 14:45	Parallelizing computational jobs with KissCluster
14:45 to 15:00	Coffee break
15:00 to 16:45	Parallelizing computational jobs with cfncluster
16:45 to 17:00	Day 1 wrap-up

Tuesday, February 6th, 2018
09:00 to 10:30	Case study: parallelization of Python code through multiprocessing
10:30 to 10:45	Coffee break
10:45 to 12:00	Case study: parallelization of Julia code through multiprocessing
12:00 to 13:00	Lunch
13:00 to 14:45	Case study: parallelization of Julia code with threads
14:45 to 15:00	Coffee break
15:00 to 16:45	Case study: running computations on GPUs using R
16:45 to 17:00	Day 2 wrap-up

Registered Participants

Name	Affiliation
Ramzi Abdelmoula	GM Canada
Pegah Abed-Esfahani	University of Toronto
Yasneen Ashroff
Mahdis Azadbakhsh
Elnaz Bigdeli	KPMG
Santa Borel	IQVIA
Jesus Calderon	Gravito
Ba Chu	Carleton University
Shaun Cumby
Konrad Duch
Tianyu Du	University of Toronto
Yaser Eftekhari
Ali Fathi	Royal Bank of Canada
Selina Gabriele	University of Windsor
Sakshi Garg
Rishabh Gupta	Myant Inc.
Tuck-Voon How	University of Toronto
Abdulkadir Hussein	University of Windsor
David Islip	University of Toronto
Abdul Qadir Javaid	University of Toronto
Muhammed Jobe	Polymatiks
Mikayel Karapetyan	Western University
Alaa Khamis	GM Canada
Nicole Langballe	University of Toronto
Ki Beom Lee	University of Waterloo
Daniyal Liaqat	University of Toronto
Kai Liu	McMaster University
Mayan Murray
Amin Nabavi	Carleton University
Matt Olechnowicz	University of Toronto
Shraddha Pai	University of Toronto
Palermo Penano
Jonathan Poisson-Rioux	Desjardins Insurance Group
Shahab Poshtkouhi	Scotiabank
Arun Ramani	Hospital for Sick Kids
Oleksandr Romanko	University of Toronto
Pooyan Shirvani	TD Securities
Zahra Shirzadi	University of Toronto
Olena Skalianska	Wrocław University of Science and Technology
Matt Sourisseau	University of Toronto
Greg Stortz	SickKids Hospital
Patricia Thaine	University of Toronto
Andrei Turinsky	SickKids Hospital
Matthew Wongkee	Canada Pension Plan Investment Board
Kaiyin Zhu	Toronto Rehabilitation Institute
Victor Zurkowski	Polymatiks

Organizing Committee

Pawel Pralat - Toronto Metropolitan University

Tom Salisbury - The Fields Institute

Tyler Wilson - The Fields Institute

Fields Contact: Erika Pedersen-Lorenzen

The Fields Institute for
Research in Mathematical Sciences