SDS-2.2

Scalable Data Science from Atlantis, A Big Data Course in Apache Spark 2.2

Welcome! This is a 2017 Uppsala Big Data Meetup of a fast-paced PhD course sequel in data science.

Most lectures are live via HangoutsOnAir in Youtube at this channel and archived in this playlist. We are not set up for live interactions online.

There are two parts to the course sequel:

Introduction to Data Science
Fundamentals of Data Science

SDS-2.2 Book

The course gitbook, edited by Dan Lilja, Raazesh Sainudiin and Tilo Wiklund, is under construction at:

https://lamastex.gitbooks.io/sds-2-2/.

Uppsala University Students

Introduction to Data Science and Fundamentals of Data Science is a "big data" sequel that will introduce researchers from various scientific and engineering disciplines to the rapidly growing field of data science and equip them with the latest industry-recommended open sourced tools for extracting insights from large-scale datasets through the complete data science process of collecting, cleaning, extracting, transforming, loading, exploring, modelling, analysing, tuning, predicting, communicating and serving.

Support

The course is supported by databricks academic partners programme (databricks is the UC, Berkeley startup founded by the creators of Apache Spark™, a fast and general engine for large-scale data processing) and Amazon Web Services Educate, and aims to train data scientists towards being able to meet the current needs of Stockholm's data industry through feedback from the AI and Data Analytics Centre of Excellence at Combient AB, an industrial joint venture between 21 large companies in Sweden and Finland.

Overview of Data Science Courses

Data Science is the study of the generalizable extraction of knowledge from data in a practical and scalable manner. A data scientist is characterized by an integrated skill set spanning mathematics, statistics, machine learning, artificial intelligence, databases and optimization along with a deep understanding of the craft of problem formulation to engineer effective solutions (DOI:10.1145/2500499, DOI: 10.1126/science.aaa8415). This course will introduce students to this rapidly growing field and equip them with some of its basic principles and tools.

In particular, in Introduction to Data Science, they will be introduced to basic skills needed to collect, store, extract, transform, load, explore, model, evaluate, tune, test and predict using large structured and unstructured datasets from the real-world. The course will use the latest, open-sourced, fast and general engine for large scale data processing. Various common methods will be covered in-depth with a focus on the student’s ability to execute the data science process on their own through a course project (in Fundamentals of Data Science).

Target group/s and recommended background

Students are recommended to have basic knowledge of algorithms and some programming experience (equivalent to completing an introductory course in computer science), and some familiarity with linear algebra, calculus, probability and statistics. These basic concepts will be introduced quickly and one could take the course by putting extra effort on these basics as the course progresses. Successful completion of the course on Introduction to Data Science or equivalent and an interest in doing a course project in a small team is a prerequisite for Fundamentals of Data Science.

Introduction to Data Science

Contents, study format and form of examination

The course will cover the following contents:

key concepts in distributed fault-tolerant storage and computing, and working knowledge of a data scientist’s toolkit: Shell/Scala/SQL/Python/R, etc.
practical understanding of the data science process:
- ingest, extract, transform, load and explore structured and unstructured datasets
- model, train/fit, validate/select, tune, test and predict (through an estimator) with a practical understanding of the underlying mathematics, numerics and statistics
- communicate and serve the model’s predictions to the clients
practical applications of predictive models for classification and regression, using case-studies of real datasets

There will be assignments involving computer programming and data analysis. The grade is based on attendance, course participation and successful completion of programming assignments.

Fundamentals of Data Science

Contents, study format and form of examination

The course will cover the following contents:

key concepts in distributed fault-tolerant filestores and in-memory computing
understanding the data science process (the underlying mathematics, numerics and statistics as well as concerns around privacy and ethics at a deeper level)
applications of current predictive models and methods in data science to make/take common decisions/actions, including classification, regression, anomaly detection and recommendation, using case-studies of real datasets
apply the data science process to one’s own case study and work collaboratively in a group (course project).

There will be assignments involving computer programming and data analysis, and written and oral presentation of the course project. The grade will be based on attendance, course participation, successful completion of programming assignments and the final course project.

The course project could take one of the following forms:

analyzing an interesting dataset using existing methods
obtaining your own dataset and analyzing it using existing methods
building your own data product
focussing on the theoretical properties of an algorithm, etc.

Students are encouraged to work in teams of two or three for a project.
The project should be orally presented during the first week of January 2018 and made avaialable as a written report with all source codes and explanations. Assignments, on the other hand, are to be completed individually.

Outline of Topics

Uploading Course Content into Databricks
- 2017 dbc ARCHIVES
Introduction: What is Data Science and the Data Science Process?
- Introduction
Apache Spark and Big Data
- Why Spark?
- Login to databricks
- Scala Crash Course
Map-Reduce, Transformations and Actions with Resilient Distributed datasets
- RDDs
- RDDs HOMEWORK
- Word Count - SOU
- Russian Word Count
Ingest, Extract, Transform, Load and Explore with noSQL
- Spark SQL Basics
- SparkSQL HW-a ProgGuide
- SparkSQL HW-b ProgGuide
- SparkSQL HW-c ProgGuide
- SparkSQL HW-d ProgGuide
- SparkSQL HW-e ProgGuide
- SparkSQL HW-f ProgGuide
- ETL Diamonds Data
- ETL Power Plant
- Wiki Click streams
Introduction to Machine Learning
- Simulation Intro
- Machine Learning Intro
Unsupervised Learning - Clustering
- k-means (1 million songs dataset)
- Gaussian Mixture Models and EM Algorithm
- K-Means 1MSongs Intro
- 1MSongs - 1 ETL
- 1MSongs - 2 Explore
- 1MSongs - 3 Model
Supervised Learning - Decision Trees
- Linear Regression (power-plant data)
- Decision Trees for Classification (hand-written digit recognition)
- Decision Trees for Digits
Linear Algebra for Distributed Machine Learning
- Linear Algebra Intro
- Linear Regression Intro
- DLA (Distrib. Linear Algebra)
- DLA - Data Types Prog Guide
- DLA - Local Vector
- DLA - Labeled Point
- DLA - Local Matrix
- DLA - Distributed Matrix
- DLA - Row Matrix
- DLA - Indexed Row Matrix
- DLA - Coordinate Matrix
- DLA - Block Matrix
Supervised Learning - Regression
- Power Plant - Model Tune Evaluate
Supervised Learning - Random Forests
- Activity Detection - Random Forest
Mining Networks and Graphs with Spark's GraphX
- Extract, transform and loading of network data
- Discovery of communities in graphs (wikipedia click streams)
- label and belief propagation
- querying sub-structures in graphs (US Airport network)
- Graph Frames Intro
- Ontime Flight Performance
Spark Streaming with Discrete Resilient Distributed Datasets
- Spark Streaming Intro
Social networks as distributed graphs (twitter data)
- Extended Twitter Utils
- Tweet Transmission Trees
- REST Twitter API
- Tweet Collector
- Tweet Track, Follow
- Tweet Hashtag Counter
- Tweet Classifier
Supervised Learning - Regression as a Complete Data Science Process
- Power Plant - Model Tune Evaluate Deploy
Scalabe Geospatial Analytics
- Geospatial Analytics in Magellan
- NY Taxi trips in Magellan
ETL of XML-structured Dataset
- Old Bailey Online - ETL of XML
Unsupervised Learning - Latent Dirichlet Allocation
- 20 Newsgroups - Latent Dirichlet Allocation
- Cornell Movie Dialogs - Latent Dirichlet Allocation
Collaborative Filtering for Recommendation Systems
- Matrix completion via Alternative Least Squares
- Movie Recommendation - Alternating Least Squares
Spark Structured Streaming
- Animal Names Streaming Files
- Normal Mixture Streaming Files
- Structured Streaming Prog Guide
- Graph Mixture Streaming Files
- Structured Streaming of JSONs
Sketching for Anomaly Detection in Streams
- T-Digest Normal Mixture Streaming Files
- Sketching with T-Digest
- Streaming with T-Digest
Neural networks and Deep Learning
- Linear and logistic regression as neural networks
- Back propagation for gradient descent
- Use of pre-trained neural networks from google/Baidu/facebook in your machine learning pipeline
- Intro to Deep Learning
- Outline for DL
- Neural Networks
- Deep feed Forward NNs with Keras
- Hello Tensorflow
- Batch Tensorflow with Matrices
- Convolutional Neural Nets
- MNIST: Multi-Layer-Perceptron
- MNIST: Convolutional Neural net
- CIFAR-10: CNNs
- Recurrent Neural Nets and LSTMs
- LSTM solution
- LSTM spoke Zarathustra
- Generative Networks
- Reinforcement Learning
- DL Operations
Data Science and Ethical Issues
- Discussions on ethics, privacy and security
- Case studies from the field
Advise from Industry
- 2017 Advise from Data Industry
Project Ideas
- Potential Projects
Student Projects
- Student Project 01 on Network Anomaly Detection
- Student Project 02 on Twitter UK Election
- Student Project 03 on Article Topics in Retweet Networks
- Student Project 03 on Article Topics in Retweet Networks - scalable web scraper
- Student Project 04 on Power Forecasting - Part 0
- Student Project 04 on Power Forecasting - Part 1
- Student Project 04 on Power Forecasting - Part 2
- Student Project 04 on Power Forecasting - Part 3
- Student Project 05 on Hail Scala for Population Genomics ETL

Supplements

We will be supplementing the lecture notes with reading assignments from original sources.

Here are some resources that may be of further help.

Mathematical Statistical Foundations

Avrim Blum, John Hopcroft and Ravindran Kannan. Foundations of Data Science. Freely available from: https://www.cs.cornell.edu/jeh/book2016June9.pdf. It is intended as a modern theoretical course in computer science and statistical learning.
Kevin P. Murphy. Machine Learning: A Probabilistic Perspective. ISBN 0262018020. 2013.
Trevor Hastie, Robert Tibshirani and Jerome Friedman. Elements of Statistical Learning, Second Edition. ISBN 0387952845. 2009. Freely available from: https://statweb.stanford.edu/~tibs/ElemStatLearn/.

Data Science / Data Mining at Scale

Jure Leskovek, Anand Rajaraman and Jeffrey Ullman. Mining of Massive Datasets. v2.1, Cambridge University Press. 2014. Freely available from: http://www.mmds.org/#ver21.
Foster Provost and Tom Fawcett. Data Science for Business: What You Need to Know about Data Mining and Data-analytic Thinking. ISBN 1449361323. 2013.
Mohammed J. Zaki and Wagner Miera Jr. Data Mining and Analysis: Fundamental Concepts and Algorithms. Cambridge University Press. 2014.
Cathy O’Neil and Rachel Schutt. Doing Data Science, Straight Talk From The Frontline. O’Reilly. 2014.

Here are some free online courses if you need quick refreshers or want to go indepth into specific subjects.

Maths/Stats Refreshers

Apache Spark / shell / github / Scala / Python / Tensorflow / R

Learning Spark : lightning-fast data analytics by Holden Karau, Andy Konwinski, Patrick Wendell, and Matei Zaharia, O'Reilly, 2015.
Advanced analytics with Spark : patterns for learning from data at scale, O'Reilly, 2015.
Command-line Basics
- Linux Commnad-line Basics
- Windows Command-line Bascis
How to use Git and GitHub: Version control for code
Intro to Data Analysis: Using NumPy and Pandas
Data Analysis with R by facebook
Machine Learning Crash Course with TensorFlow APIs by Google Developers
Data Visualization and D3.js
Scala Programming
Scala for Data Science, Pascal Bugnion, Packt Publishing, 416 pages, 2016.