SDS-2.2, Scalable Data Science

Million Song Dataset - Kaggle Challenge

Predict which songs a user will listen to.

SOURCE: This is just a Scala-rification of the Python notebook published in databricks community edition in 2016.

Archived YouTube video of this live unedited lab-lecture:

Stage 2: Exploring songs data

Explore

This is the second notebook in this tutorial. In this notebook we do what any data scientist does with their data right after parsing it: exploring and understanding different aspect of data. Make sure you understand how we get the songsTable by reading and running the ETL notebook. In the ETL notebook we created and cached a temporary table named songsTable

Let's Do all the main bits in Stage 1 now before doing Stage 2 in this Notebook.

// Let's quickly do everything to register the tempView of the table here

// fill in comment ... EXERCISE!
case class Song(artist_id: String, artist_latitude: Double, artist_longitude: Double, artist_location: String, artist_name: String, duration: Double, end_of_fade_in: Double, key: Int, key_confidence: Double, loudness: Double, release: String, song_hotness: Double, song_id: String, start_of_fade_out: Double, tempo: Double, time_signature: Double, time_signature_confidence: Double, title: String, year: Double, partial_sequence: Int)

def parseLine(line: String): Song = {
  // fill in comment ...

  def toDouble(value: String, defaultVal: Double): Double = {
    try {
       value.toDouble
    } catch {
      case e: Exception => defaultVal
    }
  }

  def toInt(value: String, defaultVal: Int): Int = {
    try {
       value.toInt
      } catch {
      case e: Exception => defaultVal
    }
  }
  // fill in comment ...
  val tokens = line.split("\t")
  Song(tokens(0), toDouble(tokens(1), 0.0), toDouble(tokens(2), 0.0), tokens(3), tokens(4), toDouble(tokens(5), 0.0), toDouble(tokens(6), 0.0), toInt(tokens(7), -1), toDouble(tokens(8), 0.0), toDouble(tokens(9), 0.0), tokens(10), toDouble(tokens(11), 0.0), tokens(12), toDouble(tokens(13), 0.0), toDouble(tokens(14), 0.0), toDouble(tokens(15), 0.0), toDouble(tokens(16), 0.0), tokens(17), toDouble(tokens(18), 0.0), toInt(tokens(19), -1))
}

// this is loads all the data - a subset of the 1M songs dataset
val dataRDD = sc.textFile("/databricks-datasets/songs/data-001/part-*") 

// .. fill in comment
val df = dataRDD.map(parseLine).toDF

// .. fill in comment
df.createOrReplaceTempView("songsTable")

defined class Song
parseLine: (line: String)Song
dataRDD: org.apache.spark.rdd.RDD[String] = /databricks-datasets/songs/data-001/part-* MapPartitionsRDD[14036] at textFile at <console>:63
df: org.apache.spark.sql.DataFrame = [artist_id: string, artist_latitude: double ... 18 more fields]

spark.catalog.listTables.show(false) // make sure the temp view of our table is there

+--------------------------+--------+-----------+---------+-----------+
|name                      |database|description|tableType|isTemporary|
+--------------------------+--------+-----------+---------+-----------+
|cities_csv                |default |null       |EXTERNAL |false      |
|cleaned_taxes             |default |null       |MANAGED  |false      |
|commdettrumpclintonretweet|default |null       |MANAGED  |false      |
|donaldtrumptweets         |default |null       |EXTERNAL |false      |
|linkage                   |default |null       |EXTERNAL |false      |
|nations                   |default |null       |EXTERNAL |false      |
|newmplist                 |default |null       |EXTERNAL |false      |
|ny_baby_names             |default |null       |MANAGED  |false      |
|nzmpsandparty             |default |null       |EXTERNAL |false      |
|pos_neg_category          |default |null       |EXTERNAL |false      |
|rna                       |default |null       |MANAGED  |false      |
|samh                      |default |null       |EXTERNAL |false      |
|table1                    |default |null       |EXTERNAL |false      |
|test_table                |default |null       |EXTERNAL |false      |
|uscites                   |default |null       |EXTERNAL |false      |
|songstable                |null    |null       |TEMPORARY|true       |
+--------------------------+--------+-----------+---------+-----------+

A first inspection

A first step to any data exploration is viewing sample data. For this purpose we can use a simple SQL query that returns first 10 rows.

select * from songsTable limit 10

artist_id	artist_latitude	artist_longitude	artist_location	artist_name	duration	end_of_fade_in	key	key_confidence	loudness	release	song_hotness
AR81V6H1187FB48872	0.0	0.0		Earl Sixteen	213.7073	0.0	11.0	0.419	-12.106	Soldier of Jah Army	0.0
ARVVZQP11E2835DBCB	0.0	0.0		Wavves	133.25016	0.0	0.0	0.282	0.596	Wavvves	0.471578247701
ARFG9M11187FB3BBCB	0.0	0.0	Nashua USA	C-Side	247.32689	0.0	9.0	0.612	-4.896	Santa Festival Compilation 2008 vol.1	0.0
ARK4Z2O1187FB45FF0	0.0	0.0		Harvest	337.05751	0.247	4.0	0.46	-9.092	Underground Community	0.0
AR4VQSG1187FB57E18	35.25082	-91.74015	Searcy, AR	Gossip	430.23628	0.0	2.0	3.4e-2	-6.846	Yr Mangled Heart	0.0
ARNBV1X1187B996249	0.0	0.0		Alex	186.80118	0.0	4.0	0.641	-16.108	Jolgaledin	0.0
ARXOEZX1187B9B82A1	0.0	0.0		Elie Attieh	361.89995	0.0	7.0	0.863	-4.919	ELITE	0.0
ARXPUIA1187B9A32F1	0.0	0.0	Rome, Italy	Simone Cristicchi	220.00281	2.119	4.0	0.486	-6.52	Dall'Altra Parte Del Cancello	0.484225272411
ARNPPTH1187B9AD429	51.4855	-0.37196	Heston, Middlesex, England	Jimmy Page	156.86485	0.334	7.0	0.493	-9.962	No Introduction Necessary [Deluxe Edition]	0.0
AROGWRA122988FEE45	0.0	0.0		Christos Dantis	256.67873	2.537	9.0	0.742	-13.404	Daktilika Apotipomata	0.0

Truncated to 12 cols

table("songsTable").printSchema()

root
 |-- artist_id: string (nullable = true)
 |-- artist_latitude: double (nullable = false)
 |-- artist_longitude: double (nullable = false)
 |-- artist_location: string (nullable = true)
 |-- artist_name: string (nullable = true)
 |-- duration: double (nullable = false)
 |-- end_of_fade_in: double (nullable = false)
 |-- key: integer (nullable = false)
 |-- key_confidence: double (nullable = false)
 |-- loudness: double (nullable = false)
 |-- release: string (nullable = true)
 |-- song_hotness: double (nullable = false)
 |-- song_id: string (nullable = true)
 |-- start_of_fade_out: double (nullable = false)
 |-- tempo: double (nullable = false)
 |-- time_signature: double (nullable = false)
 |-- time_signature_confidence: double (nullable = false)
 |-- title: string (nullable = true)
 |-- year: double (nullable = false)
 |-- partial_sequence: integer (nullable = false)

select count(*) from songsTable

count(1)
31369.0

table("songsTable").count() // or equivalently with DataFrame API - recall table("songsTable") is a DataFrame

res4: Long = 31369

display(sqlContext.sql("SELECT duration, year FROM songsTable")) // Aggregation is set to 'Average' in 'Plot Options'

duration	year
213.7073	2003.0
133.25016	2009.0
247.32689	0.0
337.05751	0.0
430.23628	2006.0
186.80118	0.0
361.89995	0.0
220.00281	2007.0
156.86485	2004.0
256.67873	0.0
204.64281	0.0
112.48281	0.0
170.39628	0.0
215.95383	0.0
303.62077	0.0
266.60526	0.0
326.19057	1997.0
51.04281	2009.0
129.4624	0.0
253.33506	2003.0
237.76608	2004.0
132.96281	0.0
399.35955	2006.0
168.75057	1991.0
396.042	0.0
192.10404	1968.0
12.2771	2006.0
367.56853	0.0
189.93587	0.0
233.50812	0.0

Truncated to 30 rows

Exercises

Why do you think average song durations increase dramatically in 70's?
Add error bars with standard deviation around each average point in the plot.
How did average loudness change over time?
How did tempo change over time?
What other aspects of songs can you explore with this technique?

Sampling and visualizing

Another technique for visually exploring large data, which we are going to try, is sampling data.

First step is generating a sample.
With sampled data we can produce a scatter plot as follows.

# let's use ggplot from python
# note that this is second natural way to 'babble' between languages - using the right tool for the job!
#   recall: the first naive but solid way was to use parquet files to write and read from different languages 
#           with parquet files you can tackle the babbling problem when the table is too large to be 'Viewed'
from ggplot import *
sampled = sqlContext.sql("select year, duration from songsTable where year > 1930 and year < 2012")\
  .sample(withReplacement = False, fraction = 0.1).toPandas()

p = ggplot(sampled, aes(x = 'year', y = 'duration')) + ylim(0, 800) + \
  geom_smooth(size=3, span=0.3) + geom_point(aes(color = 'blue', size = 4))
display(p)

Exercises

Add jitter to year value in the plot above.
Plot sampled points for other parameters in the data.

Next step is clustering the data. Click on the next notebook (Model) to follow the tutorial.