Software For Data Science Workshop: High Performance Computing in the Cloud – Numerical Simulation and Data Analysis
Objectives
The goal of this course is to help scientists to leverage the power of distributed computing to more efficiently carry out the numerical experiments. The course will be based on Amazon Web Services computational infrastructure and will give a hands-on experience to all participants.
First day of the course will equip the students with knowledge how to set-up massive computations in the cloud. We will start with basic examples multi-core computations on a single machine. Next we will present tools allowing to run massive computations on multiple computers: KissCluster , and Amazon cfncluster .
During the second day practical data science case studies requiring massive computations will be presented using R, Python and Julia. For Julia both thread- and process- based parallelization will be discussed. In the last part we will provide presentation of parallelizing computations using GPUs using R.
The course assumes intermediate prior knowledge of scientific programming in at least one language (e.g. either Python, Gnu R, Java, Node.js or Julia) and a basic knowledge of Linux or Mac command line console. During the workshop participants will create various types of distributed computational clusters. No prior knowledge of distributed computing or experience with large-scale simulation models is required.
Why High Performance Computing in the Cloud:
Cloud computing offers a new set of possibilities for computational scientists – especially in the area of numerical computing, GPU computing and data analytics. The most significant advantage of HPC in the cloud is that cloud providers offer standardized hardware and software solutions while proprietary HPC solutions require long specialized training. Hence, cloud computing makes it possible to configure, build and provision cluster consisting of several thousand cores within just few minutes. Combined with extremely low costs (starting from \$7 for 1000 vCPU cores per hour or \$4 per 100000 GPU cores per hour, for universities those costs can be easily covered by one of several computational grant programs offered by cloud providers) this is an ideal solution for many computation-intensive scientific problems. For simplicity during the course Amazon Web Services will be used as an example computation service – it should be noted that using services offered by a different public cloud provider would be very similar.
Participation Requirements:
The workshop will be hands-on and participants should bring their laptops and have a SSH client installed. On Linux and Ubuntu SSH is in-built while on Windows platform the best option is to use SSH that is automatically installed with Git software. Additionally, Windows users should install Con Emu console that is a much better option than the standard Command Prompt.
For convenient data transfer between your laptop and the cloud (including Linux instances and S3 data storage) we recommend CyberDuck. Users having higher experience with Bash can use scp and aws command line interface instead.
Each participant should set up an AWS account. The Workshop participants will be given free AWS credit coupons of approximately 50 USD value. This will be sufficient to perform all tasks during the workshop and self-study outside of class. Please note it while the setup process takes only 3 minutes, it takes up to one day to provision the AWS account by Amazon – hence, we strongly advise you to start at least two days before the workshop.
For your convenience it is recommended to have interpreters of your favorite scripting languages installed (out of three considered during the course: Python, Gnu R and Julia). We will be working with cloud installations of those environments but you still might want run some tests on your side.
If you encounter any problems with AWS account setup please contact Przemysław Szufel for support or reach out the AWS support directly.
Instructors:
Dr. Przemysław Szufel
Przemysław Szufel is an Assistant Professor in Decision Support and Analysis Unit at Warsaw School of Economics.
His current research focuses on methods for execution of large-scale simulations for numerical experiments and optimization. He is working on asynchronous algorithms for parallel execution of large-scale simulation in the cloud and distributed computational environments. He is an author or a co-author of several Open Source tools for high performance and numerical simulation (such as KissCluster, D MASON, Isislab SOF, SilverDecisions, PyCX), and actively participates in their development. He is also a co-author of various algorithms for distributed simulation models (such as AKG, AOCBA).
Dr. Bogumił Kamiński
Bogumił Kamiński is the Head of Decision Analysis and Support Unit at Warsaw School of Economics. He is a member of the Management Committee of European Social Simulation Association (ESSA), and Vice President of Institute for Operations Research and Management Sciences (INFORMS) Polish Chapter. His field of expertise is operations research, with special focus on industrial applications of forecasting, optimization and simulation. He has 15 years of experience in teaching data science related topics at undergraduate, graduate, and MBA courses.
This event has reached capacity and registration has closed.
Schedule
08:45 to 09:00 |
Check in
|
09:00 to 10:00 | |
10:00 to 10:15 |
Coffee break
|
10:15 to 12:00 |
Warm-up: computations using multiple cores on a single machine
|
12:00 to 13:00 |
Lunch
|
13:00 to 14:45 | |
14:45 to 15:00 |
Coffee break
|
15:00 to 16:45 | |
16:45 to 17:00 |
Day 1 wrap-up
|
09:00 to 10:30 | |
10:30 to 10:45 |
Coffee break
|
10:45 to 12:00 |
Case study: parallelization of Julia code through multiprocessing
|
12:00 to 13:00 |
Lunch
|
13:00 to 14:45 |
Case study: parallelization of Julia code with threads
|
14:45 to 15:00 |
Coffee break
|
15:00 to 16:45 | |
16:45 to 17:00 |
Day 2 wrap-up
|