• Scalable Data Science
  • Introduction
  • Prelude of 2016 Version
  • Some Basics and Essentials
  • Week 1: Introduction to Scalable Data Science
    • Scalable Data Science
    • Why Spark?
    • Login to databricks
    • Scala Crash Course
  • Week 2: Introduction to Spark RDDs, Transformations and Actions and Word Count of the US State of the Union Addresses
    • RDDs, Transformations and Actions
    • HOMEWORK: RDDs, Transformations and Actions
    • Word Count: US State of Union Addesses
    • EXTRA_Word Count: ETL of US State of Union Addesses
  • Week 3: Introduction to Spark SQL, ETL and EDA of Diamonds, Power Plant and Wiki CLick Streams Data
    • Spark SQL Introduction
      • HOMEWORK: overview
      • HOMEWORK: getting started
      • HOMEWORK: data sources
      • HOMEWORK: performance tuning
      • HOMEWORK: distributed sql engine
    • ETL and EDA of Diamonds Data
    • ETL and EDA of Power Plant Data
    • ETL and EDA of Wiki Click Stream Data
  • Week 4: Introduction to Machine Learning - Unsupervised Clustering and Supervised Classification
    • Introduction to Machine Learning
    • Unsupervised Clustering of 1 Million Songs via K-Means in 3 Stages
      • Stage 1: Extract-Transform-Load
      • Stage 2: Explore
      • Stage 3: Model
    • Supervised Classification of Hand-written Digits via Decision Trees
  • Week 5: Introduction to Non-distributed and Distributed Linear Algebra and Applied Linear Regression
    • Linear Algebra Introduction
      • HOMEWORK: breeze linear algebra cheat sheet
    • Linear Regression Introduction
    • Distributed Linear Algebra for Linear Regression Introduction
      • HOMEWORK: Spark Data Types for Distributed Linear Algebra
        • Local Vector
        • Labeled Point
        • Local Matrix
        • Distributed Matrix
        • Row Matrix
        • Indexed Row Matrix
        • Coordinate Matrix
        • Block Matrix
    • Power Plant Pipeline: Model, Tune, Evaluate
  • Week 6: Introduction to Spark Streaming, Twitter Collector, Top Hashtag Counter and Streaming Model-Prediction Server
    • Introduction to Spark Streaming
    • Tweet Collector - broken down
    • Tweet Collector - Generic
    • Tweet Hashtag Counter
    • Streaming Model-Prediction Server, the Full Powerplant Pipeline
  • Week 7: Probabilistic Topic Modelling via Latent Dirichlet Allocation and Intro to XML-parsing of Old Bailey Online
    • Probabilistic Topic Modelling
    • HOMEWORK: Introduction to XML-parsing of Old Bailey Online
  • Week 8: Graph Querying in GraphFrames and Distributed Vertex Programming in GraphX
    • Introduction to GraphFrames
    • HOMEWORK: On-Time Flight Performance with GraphFrames
  • Week 9: Deep Learning, Convolutional Neural Nets, Sparkling Water and Tensor Flow
    • Deep Learning, A Crash Introduction
    • H2O Sparkling Water
    • H2O Sparkling Water: Ham or Spam Example
    • Setting up TensorFlow Spark Cluster
    • Scalable Object Identification with Sparkling TensorFlow
  • Week 10: Scalable Geospatial Analytics with Magellan
    • What is Scalable Geospatial Analytics
    • Introduction to Magellan for Scalable Geospatial Analytics
  • Week 11 and 12: Student Projects
    • Student Projects
    • Dillon George, Scalable Geospatial Algorithms
      • Scalable Spatio-temporal Constraint Satisfaction
      • Map-matching
      • OpenStreetMap to GraphX
    • Akinwande Atanda, Twitter Analytics
      • Chapter_Outline_and_Objectives
      • Unfiltered_Tweets_Collector_Set-up
      • Filtered_Tweets_Collector_Set-up_by_Keywords_and_Hashtags
      • Filtered_Tweets_Collector_Set-up_by_Class
      • ETL_Tweets
      • binary_classification
      • binary_classification_with_Loop
      • binary_classification_with_Loop_TweetDataSet
    • Yinnon Dolev, Deciphering Spider Vision
    • Xin Zhao, Higher Order Spectral CLustering
      • Case-study
    • Shanshan Zhou, Exploring EEG
    • Shakira Suwan, Change Detection in Random Graph Series
    • Matthew Hendtlass, The ATP graph
      • Yuki_Katoh_GSW_Passing_Analysis
    • Andrey Konstantinov, Keystroke Biometric
    • Dominic Lee, Random Matrices
      • References
    • Harry Wallace, Movie Recommender
    • Ivan Sadikov, Reading NetFlow Logs
  • Extra Resources
    • AWS Educate
    • Databricksified Spark SQL Programming Guide 1.6
      • overview
      • getting started
      • data sources
      • performance tuning
      • distributed sql engine
    • Linear Algebra Cheat Sheet
    • Databricksified Data Types in MLLib Programming Guide 1.6
      • Local Vector
      • Labeled Point
      • Local Matrix
      • Distributed Matrix
      • Row Matrix
      • Indexed Row Matrix
      • Coordinate Matrix
      • Block Matrix
    • Introduction to XML-parsing of Old Bailey Online
Powered by GitBook

Some Basics and Essentials

The Basics and Essentials

Learn to Work with Your Local Laptop

  • Command-line Basics

    • Linux Commnad-line Basics
    • Windows Command-line Bascis
  • How to use Git and GitHub: Version control for code

  • Scala

    • scala basics in a hurry
    • Scala School! by twitter
      • Scala School! build yourself from scratch
  • Apache Spark in Scala on your Laptop

Other Courses for free!

Beginner course (free)

  • Intro to Statistics
  • Intro to Computer Science (with Python)
  • Intro to Descriptive Statistics
  • Intro to Inferential Statistics
  • Intro to Python Programming

Intermediate to Advanced Courses (free)

  • Mining Massive Datasets: Stanford Online
  • Linear Algebra Refresher Course (with Python)
  • Deploying a Hadoop Cluster
  • Intro to Algorithms
  • Machine Learning
    • Machine Learning: Supervised Learning
    • Machine Learning: Unupervised Learning
    • Machine Learning: Reinforcement Learning
  • Differential Equations in Action (Python)
  • AB testing
  • Intro to Relational Databases
  • Data Visualization and D3.js
  • Intro to Hadoop and MapReduce
  • Real-Time Analytics with Apache Storm
  • Intro to Data Analysis: Using NumPy and Pandas
  • Intro to Data Science
  • Intro to Artifical Intelligence
  • Data Analysis with R by facebook
  • Intro to Parallel Programming by Nvidia
  • Model Building and Validation by AT&T

results matching ""

    No results matching ""