• Scalable Data Science
  • Introduction
  • Prelude of 2016 Version
  • Some Basics and Essentials
  • Week 1: Introduction to Scalable Data Science
    • Scalable Data Science
    • Why Spark?
    • Login to databricks
    • Scala Crash Course
  • Week 2: Introduction to Spark RDDs, Transformations and Actions and Word Count of the US State of the Union Addresses
    • RDDs, Transformations and Actions
    • HOMEWORK: RDDs, Transformations and Actions
    • Word Count: US State of Union Addesses
    • EXTRA_Word Count: ETL of US State of Union Addesses
  • Week 3: Introduction to Spark SQL, ETL and EDA of Diamonds, Power Plant and Wiki CLick Streams Data
    • Spark SQL Introduction
      • HOMEWORK: overview
      • HOMEWORK: getting started
      • HOMEWORK: data sources
      • HOMEWORK: performance tuning
      • HOMEWORK: distributed sql engine
    • ETL and EDA of Diamonds Data
    • ETL and EDA of Power Plant Data
    • ETL and EDA of Wiki Click Stream Data
  • Week 4: Introduction to Machine Learning - Unsupervised Clustering and Supervised Classification
    • Introduction to Machine Learning
    • Unsupervised Clustering of 1 Million Songs via K-Means in 3 Stages
      • Stage 1: Extract-Transform-Load
      • Stage 2: Explore
      • Stage 3: Model
    • Supervised Classification of Hand-written Digits via Decision Trees
  • Week 5: Introduction to Non-distributed and Distributed Linear Algebra and Applied Linear Regression
    • Linear Algebra Introduction
      • HOMEWORK: breeze linear algebra cheat sheet
    • Linear Regression Introduction
    • Distributed Linear Algebra for Linear Regression Introduction
      • HOMEWORK: Spark Data Types for Distributed Linear Algebra
        • Local Vector
        • Labeled Point
        • Local Matrix
        • Distributed Matrix
        • Row Matrix
        • Indexed Row Matrix
        • Coordinate Matrix
        • Block Matrix
    • Power Plant Pipeline: Model, Tune, Evaluate
  • Week 6: Introduction to Spark Streaming, Twitter Collector, Top Hashtag Counter and Streaming Model-Prediction Server
    • Introduction to Spark Streaming
    • Tweet Collector - broken down
    • Tweet Collector - Generic
    • Tweet Hashtag Counter
    • Streaming Model-Prediction Server, the Full Powerplant Pipeline
  • Week 7: Probabilistic Topic Modelling via Latent Dirichlet Allocation and Intro to XML-parsing of Old Bailey Online
    • Probabilistic Topic Modelling
    • HOMEWORK: Introduction to XML-parsing of Old Bailey Online
  • Week 8: Graph Querying in GraphFrames and Distributed Vertex Programming in GraphX
    • Introduction to GraphFrames
    • HOMEWORK: On-Time Flight Performance with GraphFrames
  • Week 9: Deep Learning, Convolutional Neural Nets, Sparkling Water and Tensor Flow
    • Deep Learning, A Crash Introduction
    • H2O Sparkling Water
    • H2O Sparkling Water: Ham or Spam Example
    • Setting up TensorFlow Spark Cluster
    • Scalable Object Identification with Sparkling TensorFlow
  • Week 10: Scalable Geospatial Analytics with Magellan
    • What is Scalable Geospatial Analytics
    • Introduction to Magellan for Scalable Geospatial Analytics
  • Week 11 and 12: Student Projects
    • Student Projects
    • Dillon George, Scalable Geospatial Algorithms
      • Scalable Spatio-temporal Constraint Satisfaction
      • Map-matching
      • OpenStreetMap to GraphX
    • Akinwande Atanda, Twitter Analytics
      • Chapter_Outline_and_Objectives
      • Unfiltered_Tweets_Collector_Set-up
      • Filtered_Tweets_Collector_Set-up_by_Keywords_and_Hashtags
      • Filtered_Tweets_Collector_Set-up_by_Class
      • ETL_Tweets
      • binary_classification
      • binary_classification_with_Loop
      • binary_classification_with_Loop_TweetDataSet
    • Yinnon Dolev, Deciphering Spider Vision
    • Xin Zhao, Higher Order Spectral CLustering
      • Case-study
    • Shanshan Zhou, Exploring EEG
    • Shakira Suwan, Change Detection in Random Graph Series
    • Matthew Hendtlass, The ATP graph
      • Yuki_Katoh_GSW_Passing_Analysis
    • Andrey Konstantinov, Keystroke Biometric
    • Dominic Lee, Random Matrices
      • References
    • Harry Wallace, Movie Recommender
    • Ivan Sadikov, Reading NetFlow Logs
  • Extra Resources
    • AWS Educate
    • Databricksified Spark SQL Programming Guide 1.6
      • overview
      • getting started
      • data sources
      • performance tuning
      • distributed sql engine
    • Linear Algebra Cheat Sheet
    • Databricksified Data Types in MLLib Programming Guide 1.6
      • Local Vector
      • Labeled Point
      • Local Matrix
      • Distributed Matrix
      • Row Matrix
      • Indexed Row Matrix
      • Coordinate Matrix
      • Block Matrix
    • Introduction to XML-parsing of Old Bailey Online
Powered by GitBook

Week 3: Introduction to Spark SQL, ETL and EDA of Diamonds, Power Plant and Wiki CLick Streams Data

Introduction to Spark SQL, ETL and EDA of Diamonds, Power Plant and Wiki CLick Streams Data

  • Sections

    • Spark SQL Introduction

      • HOMEWORK: overview
      • HOMEWORK: getting started
      • HOMEWORK: data sources
      • HOMEWORK: performance tuning
      • HOMEWORK: distributed sql engine
    • ETL and EDA of Diamonds Data

    • ETL and EDA of Power Plant Data
    • ETL and EDA of Wiki Click Stream Data

results matching ""

    No results matching ""