SDS-2.2

Scalable Data Science from Atlantis, A Big Data Course in Apache Spark 2.2

Welcome! This is a 2017 Uppsala Big Data Meetup of a fast-paced PhD course sequel in data science.

Most lectures are live via HangoutsOnAir in Youtube at this channel and archived in this playlist. We are not set up for live interactions online.

There are two parts to the course sequel:

  1. Introduction to Data Science
  2. Fundamentals of Data Science

SDS-2.2 Book

The course gitbook, edited by Dan Lilja, Raazesh Sainudiin and Tilo Wiklund, is under construction at:

Uppsala University Students

Introduction to Data Science and Fundamentals of Data Science is a "big data" sequel that will introduce researchers from various scientific and engineering disciplines to the rapidly growing field of data science and equip them with the latest industry-recommended open sourced tools for extracting insights from large-scale datasets through the complete data science process of collecting, cleaning, extracting, transforming, loading, exploring, modelling, analysing, tuning, predicting, communicating and serving.

Support

The course is supported by databricks academic partners programme (databricks is the UC, Berkeley startup founded by the creators of Apache Spark™, a fast and general engine for large-scale data processing) and Amazon Web Services Educate, and aims to train data scientists towards being able to meet the current needs of Stockholm's data industry through feedback from the AI and Data Analytics Centre of Excellence at Combient AB, an industrial joint venture between 21 large companies in Sweden and Finland.

Overview of Data Science Courses

Data Science is the study of the generalizable extraction of knowledge from data in a practical and scalable manner. A data scientist is characterized by an integrated skill set spanning mathematics, statistics, machine learning, artificial intelligence, databases and optimization along with a deep understanding of the craft of problem formulation to engineer effective solutions (DOI:10.1145/2500499, DOI: 10.1126/science.aaa8415). This course will introduce students to this rapidly growing field and equip them with some of its basic principles and tools.

In particular, in Introduction to Data Science, they will be introduced to basic skills needed to collect, store, extract, transform, load, explore, model, evaluate, tune, test and predict using large structured and unstructured datasets from the real-world. The course will use the latest, open-sourced, fast and general engine for large scale data processing. Various common methods will be covered in-depth with a focus on the student’s ability to execute the data science process on their own through a course project (in Fundamentals of Data Science).

Target group/s and recommended background

Students are recommended to have basic knowledge of algorithms and some programming experience (equivalent to completing an introductory course in computer science), and some familiarity with linear algebra, calculus, probability and statistics. These basic concepts will be introduced quickly and one could take the course by putting extra effort on these basics as the course progresses. Successful completion of the course on Introduction to Data Science or equivalent and an interest in doing a course project in a small team is a prerequisite for Fundamentals of Data Science.

Introduction to Data Science

Contents, study format and form of examination

The course will cover the following contents:

  • key concepts in distributed fault-tolerant storage and computing, and working knowledge of a data scientist’s toolkit: Shell/Scala/SQL/Python/R, etc.
  • practical understanding of the data science process:
    • ingest, extract, transform, load and explore structured and unstructured datasets
    • model, train/fit, validate/select, tune, test and predict (through an estimator) with a practical understanding of the underlying mathematics, numerics and statistics
    • communicate and serve the model’s predictions to the clients
  • practical applications of predictive models for classification and regression, using case-studies of real datasets

There will be assignments involving computer programming and data analysis. The grade is based on attendance, course participation and successful completion of programming assignments.

Fundamentals of Data Science

Contents, study format and form of examination

The course will cover the following contents:

  • key concepts in distributed fault-tolerant filestores and in-memory computing
  • understanding the data science process (the underlying mathematics, numerics and statistics as well as concerns around privacy and ethics at a deeper level)
  • applications of current predictive models and methods in data science to make/take common decisions/actions, including classification, regression, anomaly detection and recommendation, using case-studies of real datasets
  • apply the data science process to one’s own case study and work collaboratively in a group (course project).

There will be assignments involving computer programming and data analysis, and written and oral presentation of the course project. The grade will be based on attendance, course participation, successful completion of programming assignments and the final course project.

The course project could take one of the following forms:

  • analyzing an interesting dataset using existing methods
  • obtaining your own dataset and analyzing it using existing methods
  • building your own data product
  • focussing on the theoretical properties of an algorithm, etc.

Students are encouraged to work in teams of two or three for a project.
The project should be orally presented during the first week of January 2018 and made avaialable as a written report with all source codes and explanations. Assignments, on the other hand, are to be completed individually.

Outline of Topics

  1. Uploading Course Content into Databricks
  2. Introduction: What is Data Science and the Data Science Process?
  3. Apache Spark and Big Data
  4. Map-Reduce, Transformations and Actions with Resilient Distributed datasets
  5. Ingest, Extract, Transform, Load and Explore with noSQL
  6. Introduction to Machine Learning
  7. Unsupervised Learning - Clustering
  8. Supervised Learning - Decision Trees
    • Linear Regression (power-plant data)
    • Decision Trees for Classification (hand-written digit recognition)
    • Decision Trees for Digits
  9. Linear Algebra for Distributed Machine Learning

  10. Supervised Learning - Regression

  11. Supervised Learning - Random Forests
  12. Mining Networks and Graphs with Spark's GraphX
    • Extract, transform and loading of network data
    • Discovery of communities in graphs (wikipedia click streams)
    • label and belief propagation
    • querying sub-structures in graphs (US Airport network)
    • Graph Frames Intro
    • Ontime Flight Performance
  13. Spark Streaming with Discrete Resilient Distributed Datasets
  14. Social networks as distributed graphs (twitter data)
  15. Supervised Learning - Regression as a Complete Data Science Process
  16. Scalabe Geospatial Analytics
  17. ETL of XML-structured Dataset
  18. Unsupervised Learning - Latent Dirichlet Allocation
  19. Collaborative Filtering for Recommendation Systems
  20. Spark Structured Streaming
  21. Sketching for Anomaly Detection in Streams
  22. Neural networks and Deep Learning
  23. Data Science and Ethical Issues
    • Discussions on ethics, privacy and security
    • Case studies from the field
  24. Advise from Industry
  25. Project Ideas
  26. Student Projects

Supplements

We will be supplementing the lecture notes with reading assignments from original sources.

Here are some resources that may be of further help.

Mathematical Statistical Foundations

  • Avrim Blum, John Hopcroft and Ravindran Kannan. Foundations of Data Science. Freely available from: https://www.cs.cornell.edu/jeh/book2016June9.pdf. It is intended as a modern theoretical course in computer science and statistical learning.
  • Kevin P. Murphy. Machine Learning: A Probabilistic Perspective. ISBN 0262018020. 2013.
  • Trevor Hastie, Robert Tibshirani and Jerome Friedman. Elements of Statistical Learning, Second Edition. ISBN 0387952845. 2009. Freely available from: https://statweb.stanford.edu/~tibs/ElemStatLearn/.

Data Science / Data Mining at Scale

  • Jure Leskovek, Anand Rajaraman and Jeffrey Ullman. Mining of Massive Datasets. v2.1, Cambridge University Press. 2014. Freely available from: http://www.mmds.org/#ver21.
  • Foster Provost and Tom Fawcett. Data Science for Business: What You Need to Know about Data Mining and Data-analytic Thinking. ISBN 1449361323. 2013.
  • Mohammed J. Zaki and Wagner Miera Jr. Data Mining and Analysis: Fundamental Concepts and Algorithms. Cambridge University Press. 2014.
  • Cathy O’Neil and Rachel Schutt. Doing Data Science, Straight Talk From The Frontline. O’Reilly. 2014.

Here are some free online courses if you need quick refreshers or want to go indepth into specific subjects.

Maths/Stats Refreshers

Apache Spark / shell / github / Scala / Python / Tensorflow / R

Computer Science Refreshers