SDS-2.2, Scalable Data Science

Archived YouTube videos of this live unedited lab-lecture:

Archived YouTube video of this live unedited lab-lecture Archived YouTube video of this live unedited lab-lecture

Why Apache Spark?

  • Apache Spark: A Unified Engine for Big Data Processing By Matei Zaharia, Reynold S. Xin, Patrick Wendell, Tathagata Das, Michael Armbrust, Ankur Dave, Xiangrui Meng, Josh Rosen, Shivaram Venkataraman, Michael J. Franklin, Ali Ghodsi, Joseph Gonzalez, Scott Shenker, Ion Stoica Communications of the ACM, Vol. 59 No. 11, Pages 56-65 10.1145/2934664

Apache Spark ACM Video

Right-click the above image-link, open in a new tab and watch the video (4 minutes) or read about it in the Communications of the ACM in the frame below or from the link above.

Some BDAS History behind Apache Spark

The Berkeley Data Analytics Stack is BDAS

Spark is a sub-stack of BDAS

Source:

BDAS State of The Union Talk by Ion Stoica, AMP Camp 6, Nov 2015

The followign talk outlines the motivation and insights behind BDAS' research approach and how they address the cross-disciplinary nature of Big Data challenges and current work. * watch later (5 mins.):

Ion Stoica on State of Spark Union AmpCamp6

key points

  • started in 2011 with strong public-private funding
    • Defense Advanced Research Projects Agency
    • Lawrance Berkeley Laboratory
    • National Science Foundation
    • Amazon Web Services
    • Google
    • SAP
  • The Berkeley AMPLab is creating a new approach to data analytics to seamlessly integrate the three main resources available for making sense of data at scale:
    • Algorithms (machine learning and statistical techniques),
    • Machines (in the form of scalable clusters and elastic cloud computing), and
    • People (both individually as analysts and in crowds).
  • The lab is realizing its ideas through the development of a freely-available Open Source software stack called BDAS: the Berkeley Data Analytics Stack.
  • Several components of BDAS have gained significant traction in industry and elsewhere, including:
    • the Mesos cluster resource manager,
    • the Spark in-memory computation framework, a sub-stack of the BDAS stack,
    • and more...

The big data problem, Hardware, distributing work, handling failed and slow machines

by Anthony Joseph in BerkeleyX/CS100.1x

  • (watch now 1:48): The Big Data Problem
    • The Big Data Problem by Anthony Joseph in BerkeleyX/CS100.1x
  • (watch now 1:43): Hardware for Big Data
  • Hardware for Big Data by Anthony Joseph in BerkeleyX/CS100.1x
  • (watch now 1:17): How to distribute work across a cluster of commodity machines?
    • How to distribute work across a cluster of commodity machines? by Anthony Joseph in BerkeleyX/CS100.1x
  • (watch now 0:36): How to deal with failures or slow machines?
    • How to deal with failures or slow machines? by Anthony Joseph in BerkeleyX/CS100.1x

MapReduce and Apache Spark.

by Anthony Joseph in BerkeleyX/CS100.1x

  • (watch now 1:48): Map Reduce (is bounded by Disk I/O)
    • The Big Data Problem by Anthony Joseph in BerkeleyX/CS100.1x
  • (watch now 2:49): Apache Spark (uses Memory instead of Disk)
  • Apache Spark by Anthony Joseph in BerkeleyX/CS100.1x
  • (watch now 3:00): Spark Versus MapReduce
    • Spark Versus MapReduce by Anthony Joseph in BerkeleyX/CS100.1x
  • SUMMARY
    • uses memory instead of disk alone and is thus fater than Hadoop MapReduce
    • resilience abstraction is by RDD (resilient distributed dataset)
    • RDDs can be recovered upon failures from their lineage graphs, the recipes to make them starting from raw data
    • Spark supports a lot more than MapReduce, including streaming, interactive in-memory querying, etc.
    • Spark demonstrated an unprecedented sort of 1 petabyte (1,000 terabytes) worth of data in 234 minutes running on 190 Amazon EC2 instances (in 2015).
    • Spark expertise corresponds to the highest Median Salary in the US (~ 150K)

Key Papers



55 minutes 55 out of 90+10 minutes.

We have come to the end of this section.

Next let us get everyone to login to databricks to get our hands dirty with some Spark code!

10-15 minutes. Then break for 5.



To Stay Connected to Changes in Spark

Subscribe to YouTube Channels:

EXTRA: For a historical insight see excerpts from an interview with Ion Stoica

Beginnings of Apache Spark and Databricks (academia-industry roots)

Ion Stoica on Starting Spark and DataBricks

Advantages of Apache Spark: A Unified System for Batch, Stream, Interactive / Ad Hoc or Graph Processing

Ion Stoica on Starting Spark and DataBricks

Main Goal of Databricks Cloud: To Make Big Data Easy

Ion Stoica on Starting Spark and DataBricks