SDS-2.2, Scalable Data Science

Archived YouTube video of this live unedited lab-lecture:

Wiki Clickstream Analysis

** Dataset: 3.2 billion requests collected during the month of February 2015 grouped by (src, dest) **

** Source: https://datahub.io/dataset/wikipedia-clickstream/ **

NY clickstream image

This notebook requires Spark 1.6+.

This notebook was originally a data analysis workflow developed with Databricks Community Edition, a free version of Databricks designed for learning Apache Spark.

Here we elucidate the original python notebook (also linked here) used in the talk by Michael Armbrust at Spark Summit East February 2016 shared from https://twitter.com/michaelarmbrust/status/699969850475737088 (watch later)

Data set

The data we are exploring in this lab is the February 2015 English Wikipedia Clickstream data, and it is available here: http://datahub.io/dataset/wikipedia-clickstream/resource/be85cc68-d1e6-4134-804a-fd36b94dbb82.

According to Wikimedia:

"The data contains counts of (referer, resource) pairs extracted from the request logs of English Wikipedia. When a client requests a resource by following a link or performing a search, the URI of the webpage that linked to the resource is included with the request in an HTTP header called the "referer". This data captures 22 million (referer, resource) pairs from a total of 3.2 billion requests collected during the month of February 2015."

The data is approximately 1.2GB and it is hosted in the following Databricks file: /databricks-datasets/wikipedia-datasets/data-001/clickstream/raw-uncompressed

Let us first understand this Wikimedia data set a bit more

Let's read the datahub-hosted link https://datahub.io/dataset/wikipedia-clickstream in the embedding below. Also click the blog by Ellery Wulczyn, Data Scientist at The Wikimedia Foundation, to better understand how the data was generated (remember to Right-Click and use -> and <- if navigating within the embedded html frame below).

<p> <a href="http://spark.apache.org/docs/latest/index.html" target="_blank"> Fallback link for browsers that, unlikely, don't support frames </a> </p>

Run the next two cells for some housekeeping.

if (org.apache.spark.BuildInfo.sparkBranch < "1.6") sys.error("Attach this notebook to a cluster running Spark 1.6+")

Loading and Exploring the data

val data = sc.textFile("dbfs:///databricks-datasets/wikipedia-datasets/data-001/clickstream/raw-uncompressed")

data: org.apache.spark.rdd.RDD[String] = dbfs:///databricks-datasets/wikipedia-datasets/data-001/clickstream/raw-uncompressed MapPartitionsRDD[1] at textFile at <console>:34

Looking at the first few lines of the data

data.take(5).foreach(println)

prev_id    curr_id    n    prev_title    curr_title    type
    3632887    121    other-google    !!    other
    3632887    93    other-wikipedia    !!    other
    3632887    46    other-empty    !!    other
    3632887    10    other-other    !!    other

data.take(2)

res3: Array[String] = Array(prev_id    curr_id    n    prev_title    curr_title    type, "    3632887    121    other-google    !!    other")

The first line looks like a header
The second line (separated from the first by ",") contains data organized according to the header, i.e., prev_id = 3632887, curr_id = 121", and so on.

Actually, here is the meaning of each column:

prev_id: if the referer does not correspond to an article in the main namespace of English Wikipedia, this value will be empty. Otherwise, it contains the unique MediaWiki page ID of the article corresponding to the referer i.e. the previous article the client was on
curr_id: the MediaWiki unique page ID of the article the client requested
prev_title: the result of mapping the referer URL to the fixed set of values described below
curr_title: the title of the article the client requested
n: the number of occurrences of the (referer, resource) pair
type
- "link" if the referer and request are both articles and the referer links to the request
- "redlink" if the referer is an article and links to the request, but the request is not in the production enwiki.page table
- "other" if the referer and request are both articles but the referer does not link to the request. This can happen when clients search or spoof their refer

Referers were mapped to a fixed set of values corresponding to internal traffic or external traffic from one of the top 5 global traffic sources to English Wikipedia, based on this scheme:

an article in the main namespace of English Wikipedia -> the article title

any Wikipedia page that is not in the main namespace of English Wikipedia -> other-wikipedia

an empty referer -> other-empty

a page from any other Wikimedia project -> other-internal

Google -> other-google

Yahoo -> other-yahoo

Bing -> other-bing

Facebook -> other-facebook

Twitter -> other-twitter

anything else -> other-other

In the second line of the file above, we can see there were 121 clicks from Google to the Wikipedia page on "!!" (double exclamation marks). People search for everything!

prev_id = (nothing)
curr_id = 3632887 --> (Wikipedia page ID)
n = 121 (People clicked from Google to this page 121 times in this month.)
prev_title = other-google (This data record is for referals from Google.)
curr_title = !! (This Wikipedia page is about a double exclamation mark.)
type = other

Create a DataFrame from this CSV

From the next Spark release - 2.0, CSV as a datasource will be part of Spark's standard release. But, we are using Spark 1.6

// Load the raw dataset stored as a CSV file
val clickstream = sqlContext.
    read.
    format("com.databricks.spark.csv").
    options(Map("header" -> "true", "delimiter" -> "\t", "mode" -> "PERMISSIVE", "inferSchema" -> "true")).
    load("dbfs:///databricks-datasets/wikipedia-datasets/data-001/clickstream/raw-uncompressed")

clickstream: org.apache.spark.sql.DataFrame = [prev_id: int, curr_id: int ... 4 more fields]

Print the schema

clickstream.printSchema

root
 |-- prev_id: integer (nullable = true)
 |-- curr_id: integer (nullable = true)
 |-- n: integer (nullable = true)
 |-- prev_title: string (nullable = true)
 |-- curr_title: string (nullable = true)
 |-- type: string (nullable = true)

Display some sample data

display(clickstream)

prev_id	curr_id	n	prev_title	curr_title	type
null	3632887.0	121.0	other-google	!!	other
null	3632887.0	93.0	other-wikipedia	!!	other
null	3632887.0	46.0	other-empty	!!	other
null	3632887.0	10.0	other-other	!!	other
64486.0	3632887.0	11.0	!_(disambiguation)	!!	other
2061699.0	2556962.0	19.0	Louden_Up_Now	!!!_(album)	link
null	2556962.0	25.0	other-empty	!!!_(album)	other
null	2556962.0	16.0	other-google	!!!_(album)	other
null	2556962.0	44.0	other-wikipedia	!!!_(album)	other
64486.0	2556962.0	15.0	!_(disambiguation)	!!!_(album)	link
600744.0	2556962.0	297.0	!!!	!!!_(album)	link
null	6893310.0	11.0	other-empty	!Hero_(album)	other
1921683.0	6893310.0	26.0	!Hero	!Hero_(album)	link
null	6893310.0	16.0	other-wikipedia	!Hero_(album)	other
null	6893310.0	23.0	other-google	!Hero_(album)	other
8127304.0	2.2602473e7	16.0	Jericho_Rosales	!Oka_Tokat	link
3.5978874e7	2.2602473e7	20.0	List_of_telenovelas_of_ABS-CBN	!Oka_Tokat	link
null	2.2602473e7	57.0	other-google	!Oka_Tokat	other
null	2.2602473e7	12.0	other-wikipedia	!Oka_Tokat	other
null	2.2602473e7	23.0	other-empty	!Oka_Tokat	other
7360687.0	2.2602473e7	10.0	Rica_Peralejo	!Oka_Tokat	link
3.7104582e7	2.2602473e7	11.0	Jeepney_TV	!Oka_Tokat	link
3.437659e7	2.2602473e7	22.0	Oka_Tokat_(2012_TV_series)	!Oka_Tokat	link
null	6810768.0	20.0	other-wikipedia	!T.O.O.H.!	other
null	6810768.0	81.0	other-google	!T.O.O.H.!	other
3.1976181e7	6810768.0	51.0	List_of_death_metal_bands,_!–K	!T.O.O.H.!	link
null	6810768.0	35.0	other-empty	!T.O.O.H.!	other
null	3243047.0	21.0	other-empty	!_(album)	other
1337475.0	3243047.0	208.0	The_Dismemberment_Plan	!_(album)	link
3284285.0	3243047.0	78.0	The_Dismemberment_Plan_Is_Terrified	!_(album)	link

Truncated to 30 rows

Display is a utility provided by Databricks. If you are programming directly in Spark, use the show(numRows: Int) function of DataFrame

clickstream.show(5)

+-------+-------+---+------------------+----------+-----+
|prev_id|curr_id|  n|        prev_title|curr_title| type|
+-------+-------+---+------------------+----------+-----+
|   null|3632887|121|      other-google|        !!|other|
|   null|3632887| 93|   other-wikipedia|        !!|other|
|   null|3632887| 46|       other-empty|        !!|other|
|   null|3632887| 10|       other-other|        !!|other|
|  64486|3632887| 11|!_(disambiguation)|        !!|other|
+-------+-------+---+------------------+----------+-----+
only showing top 5 rows

Reading from disk vs memory

The 1.2 GB Clickstream file is currently on S3, which means each time you scan through it, your Spark cluster has to read the 1.2 GB of data remotely over the network.

Call the count() action to check how many rows are in the DataFrame and to see how long it takes to read the DataFrame from S3.

clickstream.cache().count()

res7: Long = 22509897

It took about several minutes to read the 1.2 GB file into your Spark cluster. The file has 22.5 million rows/lines.
Although we have called cache, remember that it is evaluated (cached) only when an action(count) is called

Now call count again to see how much faster it is to read from memory

clickstream.count()

res8: Long = 22509897

Orders of magnitude faster!
If you are going to be using the same data source multiple times, it is better to cache it in memory

What are the top 10 articles requested?

To do this we also need to order by the sum of column n, in descending order.

//Type in your answer here...
display(clickstream
  .select(clickstream("curr_title"), clickstream("n"))
  .groupBy("curr_title")
  .sum()
  .orderBy($"sum(n)".desc)
  .limit(10))

curr_title	sum(n)
Main_Page	1.2750062e8
87th_Academy_Awards	2559794.0
Fifty_Shades_of_Grey	2326175.0
Alive	2244781.0
Chris_Kyle	1709341.0
Fifty_Shades_of_Grey_(film)	1683892.0
Deaths_in_2015	1614577.0
Birdman_(film)	1545842.0
Islamic_State_of_Iraq_and_the_Levant	1406530.0
Stephen_Hawking	1384193.0

Who sent the most traffic to Wikipedia in Feb 2015?

In other words, who were the top referers to Wikipedia?

display(clickstream
  .select(clickstream("prev_title"), clickstream("n"))
  .groupBy("prev_title")
  .sum()
  .orderBy($"sum(n)".desc)
  .limit(10))

prev_title	sum(n)
other-google	1.496209976e9
other-empty	3.47693595e8
other-wikipedia	1.29772279e8
other-other	7.7569671e7
other-bing	6.5962792e7
other-yahoo	4.8501171e7
Main_Page	2.9923502e7
other-twitter	1.9241298e7
other-facebook	2314026.0
87th_Academy_Awards	1680675.0

As expected, the top referer by a large margin is Google. Next comes refererless traffic (usually clients using HTTPS). The third largest sender of traffic to English Wikipedia are Wikipedia pages that are not in the main namespace (ns = 0) of English Wikipedia. Learn about the Wikipedia namespaces here: https://en.wikipedia.org/wiki/Wikipedia:Project\_namespace

Also, note that Twitter sends 10x more requests to Wikipedia than Facebook.

//Type in your answer here...
display(clickstream
  .select(clickstream("curr_title"), clickstream("prev_title"), clickstream("n"))
  .filter("prev_title = 'other-twitter'")
  .groupBy("curr_title")
  .sum()
  .orderBy($"sum(n)".desc)
  .limit(5))

curr_title	sum(n)
Johnny_Knoxville	198908.0
Peter_Woodcock	126259.0
2002_Tampa_plane_crash	119906.0
Sơn_Đoòng_Cave	116012.0
The_boy_Jones	114401.0

What percentage of page visits in Wikipedia are from other pages in Wikipedia itself?

val allClicks = clickstream.selectExpr("sum(n)").first.getLong(0)
val referals = clickstream.
                filter(clickstream("prev_id").isNotNull).
                selectExpr("sum(n)").first.getLong(0)
(referals * 100.0) / allClicks

allClicks: Long = 3283067885
referals: Long = 1095462001
res12: Double = 33.36702253416853

Register the DataFrame to perform more complex queries

clickstream.createOrReplaceTempView("clicks")

Which Wikipedia pages have the most referrals to the Donald Trump page?

SELECT *
FROM clicks
WHERE 
  curr_title = 'Donald_Trump' AND
  prev_id IS NOT NULL AND prev_title != 'Main_Page'
ORDER BY n DESC
LIMIT 20

prev_id	curr_id	n	prev_title	curr_title	type
1861441.0	4848272.0	4658.0	Ivanka_Trump	Donald_Trump	link
4848272.0	4848272.0	2212.0	Donald_Trump	Donald_Trump	link
1209075.0	4848272.0	1855.0	Melania_Trump	Donald_Trump	link
1057887.0	4848272.0	1760.0	Ivana_Trump	Donald_Trump	link
5679119.0	4848272.0	1074.0	Donald_Trump_Jr.	Donald_Trump	link
2.1377251e7	4848272.0	918.0	United_States_presidential_election,_2016	Donald_Trump	link
8095589.0	4848272.0	728.0	Eric_Trump	Donald_Trump	link
473806.0	4848272.0	652.0	Marla_Maples	Donald_Trump	link
2565136.0	4848272.0	651.0	The_Trump_Organization	Donald_Trump	link
9917693.0	4848272.0	599.0	The_Celebrity_Apprentice	Donald_Trump	link
9289480.0	4848272.0	597.0	The_Apprentice_(U.S._TV_series)	Donald_Trump	link
290327.0	4848272.0	596.0	German_American	Donald_Trump	link
1.2643497e7	4848272.0	585.0	Comedy_Central_Roast	Donald_Trump	link
3.7643999e7	4848272.0	549.0	Republican_Party_presidential_candidates,_2016	Donald_Trump	link
417559.0	4848272.0	543.0	Alan_Sugar	Donald_Trump	link
1203316.0	4848272.0	489.0	Fred_Trump	Donald_Trump	link
303951.0	4848272.0	426.0	Vince_McMahon	Donald_Trump	link
6191053.0	4848272.0	413.0	Jared_Kushner	Donald_Trump	link
1295216.0	4848272.0	412.0	Trump_Tower_(New_York_City)	Donald_Trump	link
6509278.0	4848272.0	402.0	Trump	Donald_Trump	link

Top referrers to all presidential candidate pages

-- FIXME (broke query, will get back to it later)
SELECT *
FROM clicks
WHERE 
  prev_id IS NOT NULL
ORDER BY n DESC
LIMIT 20

prev_id	curr_id	n	prev_title	curr_title	type
1.5580374e7	4.4789934e7	769616.0	Main_Page	Deaths_in_2015	link
3.516685e7	4.0218034e7	368694.0	Fifty_Shades_of_Grey	Fifty_Shades_of_Grey_(film)	link
4.0218034e7	7000810.0	284352.0	Fifty_Shades_of_Grey_(film)	Dakota_Johnson	link
3.5793706e7	3.7371793e7	253460.0	Arrow_(TV_series)	List_of_Arrow_episodes	link
3.516685e7	4.3180929e7	249155.0	Fifty_Shades_of_Grey	Fifty_Shades_Darker	link
4.0218034e7	6138391.0	228742.0	Fifty_Shades_of_Grey_(film)	Jamie_Dornan	link
4.3180929e7	3.5910161e7	220788.0	Fifty_Shades_Darker	Fifty_Shades_Freed	link
2.7676616e7	4.0265175e7	192321.0	The_Walking_Dead_(TV_series)	The_Walking_Dead_(season_5)	link
6138391.0	1076962.0	185700.0	Jamie_Dornan	Amelia_Warner	link
1.9376148e7	4.4375105e7	185449.0	Stephen_Hawking	Jane_Wilde_Hawking	link
2.7676616e7	2.8074027e7	161407.0	The_Walking_Dead_(TV_series)	List_of_The_Walking_Dead_episodes	link
3.4149123e7	4.1844524e7	161081.0	List_of_The_Flash_episodes	The_Flash_(2014_TV_series)	other
1.1269605e7	1.3542396e7	156313.0	The_Big_Bang_Theory	List_of_The_Big_Bang_Theory_episodes	link
3.9462431e7	3.4271398e7	152892.0	American_Sniper_(film)	Chris_Kyle	link
1.5580374e7	1738148.0	148820.0	Main_Page	Limpet	other
1.5580374e7	4.5298077e7	140335.0	Main_Page	TransAsia_Airways_Flight_235	other
7000810.0	484101.0	139682.0	Dakota_Johnson	Melanie_Griffith	link
4.511931e7	4.256734e7	138179.0	Take_Me_to_Church	Take_Me_to_Church_(Hozier_song)	link
3.8962787e7	4.1126542e7	136236.0	The_Blacklist_(TV_series)	List_of_The_Blacklist_episodes	link
3.2262767e7	4.5305174e7	135900.0	Better_Call_Saul	Uno_(Better_Call_Saul)	link

Load a visualization library

This code is copied after doing a live google search (by Michael Armbrust at Spark Summit East February 2016 shared from https://twitter.com/michaelarmbrust/status/699969850475737088). The d3ivan package is an updated version of the original package used by Michael Armbrust as it needed some TLC for Spark 2.2 on newer databricks notebook. These changes were kindly made by Ivan Sadikov from Middle Earth.

Warning: classes defined within packages cannot be redefined without a cluster restart.
Compilation successful.

d3ivan.graphs.help()

Produces a force-directed graph given a collection of edges of the following form:
case class Edge(src: String, dest: String, count: Long)

Usage:
import d3._
graphs.force( height = 500, width = 500, clicks: Dataset[Edge])

d3ivan.graphs.force(
  height = 800,
  width = 1000,
  clicks = sql("""
    SELECT 
      prev_title AS src,
      curr_title AS dest,
      n AS count FROM clicks
    WHERE 
      curr_title IN ('Donald_Trump', 'Bernie_Sanders', 'Hillary_Rodham_Clinton', 'Ted_Cruz') AND
      prev_id IS NOT NULL AND prev_title != 'Main_Page'
    ORDER BY n DESC
    LIMIT 20""").as[d3ivan.Edge])

Convert raw data to parquet

Apache Parquet is a columnar storage format available to any project in the Hadoop ecosystem, regardless of the choice of data processing framework, data model or programming language. It is a more efficient way to store data frames.

To understand the ideas read Dremel: Interactive Analysis of Web-Scale Datasets, Sergey Melnik, Andrey Gubarev, Jing Jing Long, Geoffrey Romer, Shiva Shivakumar, Matt Tolton and Theo Vassilakis,Proc. of the 36th Int'l Conf on Very Large Data Bases (2010), pp. 330-339, whose Abstract is as follows:
- Dremel is a scalable, interactive ad-hoc query system for analysis of read-only nested data. By combining multi-level execution trees and columnar data layouts it is capable of running aggregation queries over trillion-row tables in seconds. The system scales to thousands of CPUs and petabytes of data, and has thousands of users at Google. In this paper, we describe the architecture and implementation of Dremel, and explain how it complements MapReduce-based computing. We present a novel columnar storage representation for nested records and discuss experiments on few-thousand node instances of the system.

<p> <a href="http://spark.apache.org/docs/latest/index.html" target="_blank"> Fallback link for browsers that, unlikely, don't support frames </a> </p>

// Convert the DatFrame to a more efficent format to speed up our analysis
clickstream.
  write.
  mode(SaveMode.Overwrite).
  parquet("/datasets/wiki-clickstream") // warnings are harmless

Load parquet file efficiently and quickly into a DataFrame

Now we can simply load from this parquet file next time instead of creating the RDD from the text file (much slower).

Also using parquet files to store DataFrames allows us to go between languages quickly in a a scalable manner.

val clicks = sqlContext.read.parquet("/datasets/wiki-clickstream")

clicks: org.apache.spark.sql.DataFrame = [prev_id: int, curr_id: int ... 4 more fields]

clicks.printSchema

root
 |-- prev_id: integer (nullable = true)
 |-- curr_id: integer (nullable = true)
 |-- n: integer (nullable = true)
 |-- prev_title: string (nullable = true)
 |-- curr_title: string (nullable = true)
 |-- type: string (nullable = true)

display(clicks)  // let's display this DataFrame

prev_id	curr_id	n	prev_title	curr_title	type
7009881.0	164003.0	21.0	Mayall	John_Mayall	link
476786.0	164003.0	86.0	Mick_Taylor	John_Mayall	link
1.9735547e7	164003.0	10.0	Peter_Green_discography	John_Mayall	link
244136.0	164003.0	10.0	Macclesfield	John_Mayall	link
3.3105755e7	164003.0	13.0	The_Yardbirds	John_Mayall	link
8910430.0	164003.0	34.0	The_Turning_Point_(John_Mayall_album)	John_Mayall	link
329878.0	164003.0	10.0	Steve_Marriott	John_Mayall	link
null	164003.0	652.0	other-empty	John_Mayall	other
null	147396.0	134.0	other-bing	John_Mayall_&_the_Bluesbreakers	other
1.7865484e7	147396.0	13.0	Timeline_of_heavy_metal_and_hard_rock_music	John_Mayall_&_the_Bluesbreakers	other
1.5580374e7	147396.0	94.0	Main_Page	John_Mayall_&_the_Bluesbreakers	other
168254.0	147396.0	23.0	Paul_Butterfield	John_Mayall_&_the_Bluesbreakers	link
322138.0	147396.0	283.0	Peter_Green_(musician)	John_Mayall_&_the_Bluesbreakers	link
null	147396.0	79.0	other-other	John_Mayall_&_the_Bluesbreakers	other
1.2154926e7	147396.0	13.0	Marshall_Bluesbreaker	John_Mayall_&_the_Bluesbreakers	link
223910.0	147396.0	12.0	Robben_Ford	John_Mayall_&_the_Bluesbreakers	other
1.4433637e7	147396.0	10.0	Parchman_Farm_(song)	John_Mayall_&_the_Bluesbreakers	link
476786.0	147396.0	213.0	Mick_Taylor	John_Mayall_&_the_Bluesbreakers	link
1.8952282e7	147396.0	13.0	Ric_Grech	John_Mayall_&_the_Bluesbreakers	other
4113741.0	147396.0	50.0	Rolling_Stone's_500_Greatest_Albums_of_All_Time	John_Mayall_&_the_Bluesbreakers	link
36668.0	147396.0	64.0	Mick_Fleetwood	John_Mayall_&_the_Bluesbreakers	link
null	147396.0	328.0	other-empty	John_Mayall_&_the_Bluesbreakers	other
166705.0	147396.0	10.0	Thin_Lizzy	John_Mayall_&_the_Bluesbreakers	link
3.3105755e7	147396.0	115.0	The_Yardbirds	John_Mayall_&_the_Bluesbreakers	link
6071392.0	147396.0	45.0	Walter_Trout	John_Mayall_&_the_Bluesbreakers	other
null	147396.0	269.0	other-wikipedia	John_Mayall_&_the_Bluesbreakers	other
null	147396.0	21.0	other-twitter	John_Mayall_&_the_Bluesbreakers	other
null	147396.0	1632.0	other-google	John_Mayall_&_the_Bluesbreakers	other
null	147396.0	84.0	other-yahoo	John_Mayall_&_the_Bluesbreakers	other
2771975.0	147396.0	17.0	70th_Birthday_Concert	John_Mayall_&_the_Bluesbreakers	link

Truncated to 30 rows

DataFrame in python

clicksPy = sqlContext.read.parquet("/datasets/wiki-clickstream")

# in Python you need to put the object int its own line like this to get the type information
clicksPy

Out[2]: DataFrame[prev_id: int, curr_id: int, n: int, prev_title: string, curr_title: string, type: string]

clicksPy.show()

+--------+-------+---+--------------------+--------------------+-----+
| prev_id|curr_id|  n|          prev_title|          curr_title| type|
+--------+-------+---+--------------------+--------------------+-----+
| 7009881| 164003| 21|              Mayall|         John_Mayall| link|
|  476786| 164003| 86|         Mick_Taylor|         John_Mayall| link|
|19735547| 164003| 10|Peter_Green_disco...|         John_Mayall| link|
|  244136| 164003| 10|        Macclesfield|         John_Mayall| link|
|33105755| 164003| 13|       The_Yardbirds|         John_Mayall| link|
| 8910430| 164003| 34|The_Turning_Point...|         John_Mayall| link|
|  329878| 164003| 10|      Steve_Marriott|         John_Mayall| link|
|    null| 164003|652|         other-empty|         John_Mayall|other|
|    null| 147396|134|          other-bing|John_Mayall_&_the...|other|
|17865484| 147396| 13|Timeline_of_heavy...|John_Mayall_&_the...|other|
|15580374| 147396| 94|           Main_Page|John_Mayall_&_the...|other|
|  168254| 147396| 23|    Paul_Butterfield|John_Mayall_&_the...| link|
|  322138| 147396|283|Peter_Green_(musi...|John_Mayall_&_the...| link|
|    null| 147396| 79|         other-other|John_Mayall_&_the...|other|
|12154926| 147396| 13|Marshall_Bluesbre...|John_Mayall_&_the...| link|
|  223910| 147396| 12|         Robben_Ford|John_Mayall_&_the...|other|
|14433637| 147396| 10|Parchman_Farm_(song)|John_Mayall_&_the...| link|
|  476786| 147396|213|         Mick_Taylor|John_Mayall_&_the...| link|
|18952282| 147396| 13|           Ric_Grech|John_Mayall_&_the...|other|
| 4113741| 147396| 50|Rolling_Stone's_5...|John_Mayall_&_the...| link|
+--------+-------+---+--------------------+--------------------+-----+
only showing top 20 rows

Now you can continue from the original python notebook tweeted by Michael.

Recall from the beginning of this notebook that this python databricks notebook was used in the talk by Michael Armbrust at Spark Summit East February 2016 shared from https://twitter.com/michaelarmbrust/status/699969850475737088

(watch now, if you haven't already!)

You Try!

Try to laoad a DataFrame in R from the parquet file just as we did for python. Read the docs in databricks guide first:

https://docs.databricks.com/spark/latest/sparkr/overview.html

And see the R example in the Programming Guide:

https://spark.apache.org/docs/latest/sql-programming-guide.html#parquet-files.

library(SparkR)

# just a quick test
df <- createDataFrame(faithful)
head(df)

# Read in the Parquet file created above. Parquet files are self-describing so the schema is preserved.
# The result of loading a parquet file is also a DataFrame.
clicksR <- read.df("/datasets/wiki-clickstream", source = "parquet")
clicksR # in R you need to put the object int its own line like this to get the type information

head(clicksR)

display(clicksR)

prev_id	curr_id	n	prev_title	curr_title	type
7009881.0	164003.0	21.0	Mayall	John_Mayall	link
476786.0	164003.0	86.0	Mick_Taylor	John_Mayall	link
1.9735547e7	164003.0	10.0	Peter_Green_discography	John_Mayall	link
244136.0	164003.0	10.0	Macclesfield	John_Mayall	link
3.3105755e7	164003.0	13.0	The_Yardbirds	John_Mayall	link
8910430.0	164003.0	34.0	The_Turning_Point_(John_Mayall_album)	John_Mayall	link
329878.0	164003.0	10.0	Steve_Marriott	John_Mayall	link
null	164003.0	652.0	other-empty	John_Mayall	other
null	147396.0	134.0	other-bing	John_Mayall_&_the_Bluesbreakers	other
1.7865484e7	147396.0	13.0	Timeline_of_heavy_metal_and_hard_rock_music	John_Mayall_&_the_Bluesbreakers	other
1.5580374e7	147396.0	94.0	Main_Page	John_Mayall_&_the_Bluesbreakers	other
168254.0	147396.0	23.0	Paul_Butterfield	John_Mayall_&_the_Bluesbreakers	link
322138.0	147396.0	283.0	Peter_Green_(musician)	John_Mayall_&_the_Bluesbreakers	link
null	147396.0	79.0	other-other	John_Mayall_&_the_Bluesbreakers	other
1.2154926e7	147396.0	13.0	Marshall_Bluesbreaker	John_Mayall_&_the_Bluesbreakers	link
223910.0	147396.0	12.0	Robben_Ford	John_Mayall_&_the_Bluesbreakers	other
1.4433637e7	147396.0	10.0	Parchman_Farm_(song)	John_Mayall_&_the_Bluesbreakers	link
476786.0	147396.0	213.0	Mick_Taylor	John_Mayall_&_the_Bluesbreakers	link
1.8952282e7	147396.0	13.0	Ric_Grech	John_Mayall_&_the_Bluesbreakers	other
4113741.0	147396.0	50.0	Rolling_Stone's_500_Greatest_Albums_of_All_Time	John_Mayall_&_the_Bluesbreakers	link
36668.0	147396.0	64.0	Mick_Fleetwood	John_Mayall_&_the_Bluesbreakers	link
null	147396.0	328.0	other-empty	John_Mayall_&_the_Bluesbreakers	other
166705.0	147396.0	10.0	Thin_Lizzy	John_Mayall_&_the_Bluesbreakers	link
3.3105755e7	147396.0	115.0	The_Yardbirds	John_Mayall_&_the_Bluesbreakers	link
6071392.0	147396.0	45.0	Walter_Trout	John_Mayall_&_the_Bluesbreakers	other
null	147396.0	269.0	other-wikipedia	John_Mayall_&_the_Bluesbreakers	other
null	147396.0	21.0	other-twitter	John_Mayall_&_the_Bluesbreakers	other
null	147396.0	1632.0	other-google	John_Mayall_&_the_Bluesbreakers	other
null	147396.0	84.0	other-yahoo	John_Mayall_&_the_Bluesbreakers	other
2771975.0	147396.0	17.0	70th_Birthday_Concert	John_Mayall_&_the_Bluesbreakers	link

Truncated to 30 rows

SDS-2.2, Scalable Data Science

Wiki Clickstream Analysis

Data set

Let us first understand this Wikimedia data set a bit more

Loading and Exploring the data

Looking at the first few lines of the data

Create a DataFrame from this CSV

Print the schema

Display some sample data

Reading from disk vs memory

What are the top 10 articles requested?

Who sent the most traffic to Wikipedia in Feb 2015?

What were the top 5 trending articles people from Twitter were looking up in Wikipedia?

What percentage of page visits in Wikipedia are from other pages in Wikipedia itself?

Register the DataFrame to perform more complex queries

Which Wikipedia pages have the most referrals to the Donald Trump page?

Top referrers to all presidential candidate pages

Load a visualization library

Convert raw data to parquet

Load parquet file efficiently and quickly into a DataFrame

DataFrame in python