# SDS-2.2

**Scalable Data Science from Atlantis**, *A Big Data Course in Apache Spark 2.2*

Welcome! This is a 2017 Uppsala Big Data Meetup of a fast-paced PhD course sequel in data science.

Most lectures are live via HangoutsOnAir in Youtube at this channel and archived in this playlist. We are not set up for live interactions online.

There are two parts to the course sequel:

- Introduction to Data Science
- Fundamentals of Data Science

## SDS-2.2 Book

The course gitbook, edited by Dan Lilja, Raazesh Sainudiin and Tilo Wiklund, is under construction at:

## Uppsala University Students

Introduction to Data Science and Fundamentals of Data Science is a *"big data"* sequel that will introduce researchers from various scientific and engineering disciplines to the rapidly growing field of data science and equip them with the latest industry-recommended open sourced tools for extracting insights from large-scale datasets through the complete data science process of collecting, cleaning, extracting, transforming, loading, exploring, modelling, analysing, tuning, predicting, communicating and serving.

## Support

The course is supported by databricks academic partners programme (databricks is the UC, Berkeley startup founded by the creators of Apache Spark™, a fast and general engine for large-scale data processing) and Amazon Web Services Educate, and aims to train data scientists towards being able to meet the current needs of Stockholm's data industry through feedback from the AI and Data Analytics Centre of Excellence at Combient AB, an industrial joint venture between 21 large companies in Sweden and Finland.

## Overview of Data Science Courses

Data Science is the study of the generalizable extraction of knowledge from data in a practical and scalable manner. A data scientist is characterized by an integrated skill set spanning mathematics, statistics, machine learning, artificial intelligence, databases and optimization along with a deep understanding of the craft of problem formulation to engineer effective solutions (DOI:10.1145/2500499, DOI: 10.1126/science.aaa8415). This course will introduce students to this rapidly growing field and equip them with some of its basic principles and tools.

In particular, in *Introduction to Data Science*, they will be introduced to basic skills needed to collect, store, extract, transform, load, explore, model, evaluate, tune, test and predict using large structured and unstructured datasets from the real-world.
The course will use the latest, open-sourced, fast and general engine for large scale data processing.
Various common methods will be covered in-depth with a focus on the student’s ability to execute the data science process on their own through a course project (in *Fundamentals of Data Science*).

**Target group/s and recommended background**

Students are recommended to have basic knowledge of algorithms and some programming experience (equivalent to completing an introductory course in computer science), and some familiarity with linear algebra, calculus, probability and statistics.
These basic concepts will be introduced quickly and one could take the course by putting extra effort on these basics as the course progresses.
Successful completion of the course on *Introduction to Data Science* or equivalent and an interest in doing a course project in a small team is a prerequisite for *Fundamentals of Data Science*.

## Introduction to Data Science

**Contents, study format and form of examination**

The course will cover the following contents:

- key concepts in distributed fault-tolerant storage and computing, and working knowledge of a data scientist’s toolkit: Shell/Scala/SQL/Python/R, etc.
- practical understanding of the
*data science process*:- ingest, extract, transform, load and explore structured and unstructured datasets
- model, train/fit, validate/select, tune, test and predict (through an estimator) with a practical understanding of the underlying mathematics, numerics and statistics
- communicate and serve the model’s predictions to the clients

- practical applications of predictive models for classification and regression, using case-studies of real datasets

There will be assignments involving computer programming and data analysis. The grade is based on attendance, course participation and successful completion of programming assignments.

## Fundamentals of Data Science

**Contents, study format and form of examination**

The course will cover the following contents:

- key concepts in distributed fault-tolerant filestores and in-memory computing
- understanding the
*data science process*(the underlying mathematics, numerics and statistics as well as concerns around privacy and ethics at a deeper level) - applications of current predictive models and methods in data science to make/take common decisions/actions, including classification, regression, anomaly detection and recommendation, using case-studies of real datasets
- apply the data science process to one’s own case study and work collaboratively in a group (course project).

There will be assignments involving computer programming and data analysis, and written and oral presentation of the course project. The grade will be based on attendance, course participation, successful completion of programming assignments and the final course project.

The *course project* could take one of the following forms:

- analyzing an interesting dataset using existing methods
- obtaining your own dataset and analyzing it using existing methods
- building your own data product
- focussing on the theoretical properties of an algorithm, etc.

Students are encouraged to work in teams of two or three for a project.

The project should be orally presented during the first week of January 2018 and made avaialable as a written report with all source codes and explanations.
Assignments, on the other hand, are to be completed individually.

## Outline of Topics

- Uploading Course Content into Databricks
- Introduction: What is Data Science and the Data Science Process?
- Apache Spark and Big Data
- Map-Reduce, Transformations and Actions with Resilient Distributed datasets
- Ingest, Extract, Transform, Load and Explore with noSQL
- Introduction to Machine Learning
- Unsupervised Learning - Clustering
- k-means (1 million songs dataset)
- Gaussian Mixture Models and EM Algorithm
- K-Means 1MSongs Intro
- 1MSongs - 1 ETL
- 1MSongs - 2 Explore
- 1MSongs - 3 Model

- Supervised Learning - Decision Trees
- Linear Regression (power-plant data)
- Decision Trees for Classification (hand-written digit recognition)
- Decision Trees for Digits

Linear Algebra for Distributed Machine Learning

Supervised Learning - Regression

- Supervised Learning - Random Forests
- Mining Networks and Graphs with Spark's GraphX
- Extract, transform and loading of network data
- Discovery of communities in graphs (wikipedia click streams)
- label and belief propagation
- querying sub-structures in graphs (US Airport network)
- Graph Frames Intro
- Ontime Flight Performance

- Spark Streaming with Discrete Resilient Distributed Datasets
- Social networks as distributed graphs (twitter data)
- Supervised Learning - Regression as a Complete Data Science Process
- Scalabe Geospatial Analytics
- ETL of XML-structured Dataset
- Unsupervised Learning - Latent Dirichlet Allocation
- Collaborative Filtering for Recommendation Systems
- Matrix completion via Alternative Least Squares
- Movie Recommendation - Alternating Least Squares

- Spark Structured Streaming
- Sketching for Anomaly Detection in Streams
- Neural networks and Deep Learning
- Linear and logistic regression as neural networks
- Back propagation for gradient descent
- Use of pre-trained neural networks from google/Baidu/facebook in your machine learning pipeline
- Intro to Deep Learning
- Outline for DL
- Neural Networks
- Deep feed Forward NNs with Keras
- Hello Tensorflow
- Batch Tensorflow with Matrices
- Convolutional Neural Nets
- MNIST: Multi-Layer-Perceptron
- MNIST: Convolutional Neural net
- CIFAR-10: CNNs
- Recurrent Neural Nets and LSTMs
- LSTM solution
- LSTM spoke Zarathustra
- Generative Networks
- Reinforcement Learning
- DL Operations

- Data Science and Ethical Issues
- Discussions on ethics, privacy and security
- Case studies from the field

- Advise from Industry
- Project Ideas
- Student Projects
- Student Project 01 on Network Anomaly Detection
- Student Project 02 on Twitter UK Election
- Student Project 03 on Article Topics in Retweet Networks
- Student Project 03 on Article Topics in Retweet Networks - scalable web scraper
- Student Project 04 on Power Forecasting - Part 0
- Student Project 04 on Power Forecasting - Part 1
- Student Project 04 on Power Forecasting - Part 2
- Student Project 04 on Power Forecasting - Part 3
- Student Project 05 on Hail Scala for Population Genomics ETL

# Supplements

We will be supplementing the lecture notes with reading assignments from original sources.

Here are some resources that may be of further help.

### Mathematical Statistical Foundations

- Avrim Blum, John Hopcroft and Ravindran Kannan. Foundations of Data Science. Freely available from: https://www.cs.cornell.edu/jeh/book2016June9.pdf. It is intended as a modern theoretical course in computer science and statistical learning.
- Kevin P. Murphy. Machine Learning: A Probabilistic Perspective. ISBN 0262018020. 2013.
- Trevor Hastie, Robert Tibshirani and Jerome Friedman. Elements of Statistical Learning, Second Edition. ISBN 0387952845. 2009. Freely available from: https://statweb.stanford.edu/~tibs/ElemStatLearn/.

### Data Science / Data Mining at Scale

- Jure Leskovek, Anand Rajaraman and Jeffrey Ullman. Mining of Massive Datasets. v2.1, Cambridge University Press. 2014. Freely available from: http://www.mmds.org/#ver21.
- Foster Provost and Tom Fawcett. Data Science for Business: What You Need to Know about Data Mining and Data-analytic Thinking. ISBN 1449361323. 2013.
- Mohammed J. Zaki and Wagner Miera Jr. Data Mining and Analysis: Fundamental Concepts and Algorithms. Cambridge University Press. 2014.
- Cathy O’Neil and Rachel Schutt. Doing Data Science, Straight Talk From The Frontline. O’Reilly. 2014.

Here are some free online courses if you need quick refreshers or want to go indepth into specific subjects.

### Maths/Stats Refreshers

- Linear Algebra Refresher Course (with Python)
- Intro to Descriptive Statistics
- Intro to Inferential Statistics

### Apache Spark / shell / github / Scala / Python / Tensorflow / R

- Learning Spark : lightning-fast data analytics by Holden Karau, Andy Konwinski, Patrick Wendell, and Matei Zaharia, O'Reilly, 2015.
- Advanced analytics with Spark : patterns for learning from data at scale, O'Reilly, 2015.
Command-line Basics

- Intro to Data Analysis: Using NumPy and Pandas
- Data Analysis with R by facebook
- Machine Learning Crash Course with TensorFlow APIs by Google Developers
- Data Visualization and D3.js
- Scala Programming
- Scala for Data Science, Pascal Bugnion, Packt Publishing, 416 pages, 2016.