Scalable Data Science - 1.x from Middle Earth

Scalable data science is a technical course in the area of Big Data, aimed at the needs of the
data industry. This course uses
Apache Spark, a fast and general engine for large-scale data processing via databricks to compute
with datasets that won't fit in a single computer. The course will introduce Spark’s core concepts
via hands-on coding, including resilient distributed datasets and map-reduce algorithms, DataFrame
and Spark SQL on Catalyst, scalable machine-learning pipelines in MlLib and vertex programs using
the distributed graph processing framework of GraphX. We will solve instances of real-world big data
decision problems from various scientific domains.

This is being prepared by Raazesh Sainudiin and Sivanand Sivaram
with assistance from Paul Brouwers, Dillon George and Ivan Sadikov.

All course projects by seven enrolled and four observing students for Semester 1 of 2016 at UC, Ilam are part of this content.

How to self-learn this content?

The 2016 instance of this scalable-data-science course finished on June 30 2016.

To learn Apache Spark for free try databricks Community edition by starting from https://databricks.com/try-databricks.

All course content can be uploaded for self-paced learning by copying the following URL for 2016/Spark1_6_to_1_3/scalable-data-science.dbc archive
and importing it from the URL to your free Databricks Community Edition.

The Gitbook version of this content is https://www.gitbook.com/book/raazesh-sainudiin/scalable-data-science/details.

The browsable git-pages version of the content is http://lamastex.github.io/scalable-data-science/.

How to cite this work?

Scalable Data Science - 1.x from Middle Earth, Raazesh Sainudiin and Sivanand Sivaram, Published by GitBook https://www.gitbook.com/book/lamastex/scalable-data-science/details, 791 pages, 24th July 2017.

Supported By

Databricks Academic Partners Program and Amazon Web Services Educate.

Summary of Contents

Contribute

All course content is currently being pushed by Raazesh Sainudiin after it has been tested in
Databricks cloud (mostly under Spark 1.6 and some involving Magellan under Spark 1.5.1).

The markdown version for gitbook is generated from the Databricks .scala, .py and other source codes.
The gitbook is not a substitute for the Databricks notebooks available in the Databricks cloud. The following issues need to be resolved:

need to find a stable solution for the output of various databricks cells to be shown in gitbook, including those from display_HTML and frameIt with their in-place embeds of web content.

Please feel free to fork the github repository:

https://github.com/raazesh-sainudiin/scalable-data-science.

Furthermore, due to the anticipation of Spark 2.0 this mostly Spark 1.6 version could be enhanced with a 2.0 version-specific upgrade.

Please send any typos or suggestions to raazesh.sainudiin@gmail.com

Please read a note on babel to understand how the gitbook is generated from the .scala source of the databricks notebook.

Raazesh Sainudiin,
Laboratory for Mathematical Statistical Experiments, Christchurch Centre
and School of Mathematics and Statistics,
University of Canterbury,
Private Bag 4800,
Christchurch 8041,
Aotearoa New Zealand

Sun Jun 19 21:59:19 NZST 2016

Introduction