SDS-2.2, Scalable Data Science

Archived YouTube video of this live unedited lab-lecture:

Archived YouTube video of this live unedited lab-lecture

Power Plant ML Pipeline Application - DataFrame Part

This is the Spark SQL parts of an end-to-end example of using a number of different machine learning algorithms to solve a supervised regression problem.

This is a break-down of Power Plant ML Pipeline Application from databricks.

This will be a recurring example in the sequel

Table of Contents
  • Step 1: Business Understanding
  • Step 2: Load Your Data
  • Step 3: Explore Your Data
  • Step 4: Visualize Your Data
  • Step 5: Data Preparation
  • Step 6: Data Modeling
  • Step 7: Tuning and Evaluation
  • Step 8: Deployment

We are trying to predict power output given a set of readings from various sensors in a gas-fired power generation plant. Power generation is a complex process, and understanding and predicting power output is an important element in managing a plant and its connection to the power grid.

  • Given this business problem, we need to translate it to a Machine Learning task (actually a Statistical Machine Learning task).
  • The ML task here is regression since the label (or target) we will be trying to predict takes a continuous numeric value
    • Note: if the labels took values from a finite discrete set, such as, Spam/Not-Spam or Good/Bad/Ugly, then the ML task would be classification.

Today, we will only cover Steps 1, 2, 3 and 4 above. You need introductions to linear algebra, stochastic gradient descent and decision trees before we can accomplish the applied ML task with some intuitive understanding. If you can't wait for ML then check out Spark MLLib Programming Guide for comming attractions!

The example data is provided by UCI at UCI Machine Learning Repository Combined Cycle Power Plant Data Set

You can read the background on the UCI page, but in summary:

  • we have collected a number of readings from sensors at a Gas Fired Power Plant (also called a Peaker Plant) and
  • want to use those sensor readings to predict how much power the plant will generate in a couple weeks from now.
  • Again, today we will just focus on Steps 1-4 above that pertain to DataFrames.

More information about Peaker or Peaking Power Plants can be found on Wikipedia https://en.wikipedia.org/wiki/Peakingpowerplant.

// a good habit to ensure the code is being run on the appropriate version of Spark - we are using Spark 2.2 actually if we use SparkSession object spark down the road...
require(sc.version.replace(".", "").toInt >= 140, "Spark 1.4.0+ is required to run this notebook. Please attach it to a Spark 1.4.0+ cluster.")

Step 1: Business Understanding

The first step in any machine learning task is to understand the business need.

As described in the overview we are trying to predict power output given a set of readings from various sensors in a gas-fired power generation plant.

The problem is a regression problem since the label (or target) we are trying to predict is numeric

Step 2: Load Your Data

Now that we understand what we are trying to do, we need to load our data and describe it, explore it and verify it.

Data was downloaded already as these five Tab-separated-variable or tsv files.

display(dbutils.fs.ls("/databricks-datasets/power-plant/data")) // Ctrl+Enter
path name size
dbfs:/databricks-datasets/power-plant/data/Sheet1.tsv Sheet1.tsv 308693.0
dbfs:/databricks-datasets/power-plant/data/Sheet2.tsv Sheet2.tsv 308693.0
dbfs:/databricks-datasets/power-plant/data/Sheet3.tsv Sheet3.tsv 308693.0
dbfs:/databricks-datasets/power-plant/data/Sheet4.tsv Sheet4.tsv 308693.0
dbfs:/databricks-datasets/power-plant/data/Sheet5.tsv Sheet5.tsv 308693.0

Now let us load the data from the Tab-separated-variable or tsv text file into an RDD[String] using the familiar textFile method.

val powerPlantRDD = sc.textFile("/databricks-datasets/power-plant/data/Sheet1.tsv") // Ctrl+Enter
powerPlantRDD: org.apache.spark.rdd.RDD[String] = /databricks-datasets/power-plant/data/Sheet1.tsv MapPartitionsRDD[363800] at textFile at <console>:34
powerPlantRDD.take(5).foreach(println) // Ctrl+Enter to print first 5 lines
AT    V    AP    RH    PE
14.96    41.76    1024.07    73.17    463.26
25.18    62.96    1020.04    59.08    444.37
5.11    39.4    1012.16    92.14    488.56
20.86    57.32    1010.24    76.64    446.48
// let us make sure we are using Spark version greater than 2.2 - we need a version closer to 2.0 if we want to use SparkSession and SQLContext 
require(sc.version.replace(".", "").toInt >= 220, "Spark 2.2.0+ is required to run this notebook. Please attach it to a Spark 2.2.0+ cluster.")
// this reads the tsv file and turns it into a dataframe
val powerPlantDF = spark.read // use 'sqlContext.read' instead if you want to use older Spark version > 1.3  see 008_ notebook
    .format("csv") // use spark.csv package
    .option("header", "true") // Use first line of all files as header
    .option("inferSchema", "true") // Automatically infer data types
    .option("delimiter", "\t") // Specify the delimiter as Tab or '\t'
    .load("/databricks-datasets/power-plant/data/Sheet1.tsv")
powerPlantDF: org.apache.spark.sql.DataFrame = [AT: double, V: double ... 3 more fields]
powerPlantDF.printSchema // print the schema of the DataFrame that was inferred
root
 |-- AT: double (nullable = true)
 |-- V: double (nullable = true)
 |-- AP: double (nullable = true)
 |-- RH: double (nullable = true)
 |-- PE: double (nullable = true)
powerPlantDF.count
res8: Long = 9568

2.1. Alternatively, load data via the upload GUI feature in databricks

USE THIS FOR OTHER SMALLish DataSets you want to import to your CE

Since the dataset is relatively small, we will use the upload feature in Databricks to upload the data as a table.

First download the Data Folder from UCI Machine Learning Repository Combined Cycle Power Plant Data Set

The file is a multi-tab Excel document so you will need to save each tab as a Text file export.

I prefer exporting as a Tab-Separated-Values (TSV) since it is more consistent than CSV.

Call each file Folds5x2_pp<Sheet 1..5>.tsv and save to your machine.

Go to the Databricks Menu > Tables > Create Table

Select Datasource as "File"

Upload ALL 5 files at once.

See screenshots below (but refer https://docs.databricks.com/user-guide/importing-data.html for latest methods to import data):

2.1.1. Create Table _________________

When you import your data, name your table power_plant, specify all of the columns with the datatype Double and make sure you check the First row is header box.

alt text

2.1.2. Review Schema __________________

Your table schema and preview should look like this after you click Create Table:

alt text

Now that your data is loaded let's explore it.

Step 3: Explore Your Data

Now that we understand what we are trying to do, we need to load our data and describe it, explore it and verify it.

Viewing the table as text

By uisng .show method we can see some of the contents of the table in plain text.

This works in pure Apache Spark, say in Spark-Shell without any notebook layer on top of Spark like databricks, zeppelin or jupyter.

It is a good idea to use this method when possible.

powerPlantDF.show(10) // try putting 1000 here instead of 10
+-----+-----+-------+-----+------+
|   AT|    V|     AP|   RH|    PE|
+-----+-----+-------+-----+------+
|14.96|41.76|1024.07|73.17|463.26|
|25.18|62.96|1020.04|59.08|444.37|
| 5.11| 39.4|1012.16|92.14|488.56|
|20.86|57.32|1010.24|76.64|446.48|
|10.82| 37.5|1009.23|96.62| 473.9|
|26.27|59.44|1012.23|58.77|443.67|
|15.89|43.96|1014.02|75.24|467.35|
| 9.48|44.71|1019.12|66.43|478.42|
|14.64| 45.0|1021.78|41.25|475.98|
|11.74|43.56|1015.14|70.72| 477.5|
+-----+-----+-------+-----+------+
only showing top 10 rows

Viewing as DataFrame

This is almost necessary for a data scientist to gain visual insights into all pair-wise relationships between the several (3 to 6 or so) variables in question.

display(powerPlantDF)
AT V AP RH PE
14.96 41.76 1024.07 73.17 463.26
25.18 62.96 1020.04 59.08 444.37
5.11 39.4 1012.16 92.14 488.56
20.86 57.32 1010.24 76.64 446.48
10.82 37.5 1009.23 96.62 473.9
26.27 59.44 1012.23 58.77 443.67
15.89 43.96 1014.02 75.24 467.35
9.48 44.71 1019.12 66.43 478.42
14.64 45.0 1021.78 41.25 475.98
11.74 43.56 1015.14 70.72 477.5
17.99 43.72 1008.64 75.04 453.02
20.14 46.93 1014.66 64.22 453.99
24.34 73.5 1011.31 84.15 440.29
25.71 58.59 1012.77 61.83 451.28
26.19 69.34 1009.48 87.59 433.99
21.42 43.79 1015.76 43.08 462.19
18.21 45.0 1022.86 48.84 467.54
11.04 41.74 1022.6 77.51 477.2
14.45 52.75 1023.97 63.59 459.85
13.97 38.47 1015.15 55.28 464.3
17.76 42.42 1009.09 66.26 468.27
5.41 40.07 1019.16 64.77 495.24
7.76 42.28 1008.52 83.31 483.8
27.23 63.9 1014.3 47.19 443.61
27.36 48.6 1003.18 54.93 436.06
27.47 70.72 1009.97 74.62 443.25
14.6 39.31 1011.11 72.52 464.16
7.91 39.96 1023.57 88.44 475.52
5.81 35.79 1012.14 92.28 484.41
30.53 65.18 1012.69 41.85 437.89

Truncated to 30 rows

powerPlantDF.count() // count the number of rows in DF
res14: Long = 9568

Viewing as Table via SQL

Let us look at what tables are already available, as follows:

sqlContext.tables.show() // Ctrl+Enter to see available tables
+--------+--------------------+-----------+
|database|           tableName|isTemporary|
+--------+--------------------+-----------+
| default|          cities_csv|      false|
| default|       cleaned_taxes|      false|
| default|commdettrumpclint...|      false|
| default|   donaldtrumptweets|      false|
| default|             linkage|      false|
| default|             nations|      false|
| default|           newmplist|      false|
| default|       ny_baby_names|      false|
| default|       nzmpsandparty|      false|
| default|    pos_neg_category|      false|
| default|                 rna|      false|
| default|                samh|      false|
| default|              table1|      false|
| default|          test_table|      false|
| default|             uscites|      false|
+--------+--------------------+-----------+

We can also access the list of tables and databases using spark.catalog methods as explained here:

spark.catalog.listTables.show(false)
+--------------------------+--------+-----------+---------+-----------+
|name                      |database|description|tableType|isTemporary|
+--------------------------+--------+-----------+---------+-----------+
|cities_csv                |default |null       |EXTERNAL |false      |
|cleaned_taxes             |default |null       |MANAGED  |false      |
|commdettrumpclintonretweet|default |null       |MANAGED  |false      |
|donaldtrumptweets         |default |null       |EXTERNAL |false      |
|linkage                   |default |null       |EXTERNAL |false      |
|nations                   |default |null       |EXTERNAL |false      |
|newmplist                 |default |null       |EXTERNAL |false      |
|ny_baby_names             |default |null       |MANAGED  |false      |
|nzmpsandparty             |default |null       |EXTERNAL |false      |
|pos_neg_category          |default |null       |EXTERNAL |false      |
|rna                       |default |null       |MANAGED  |false      |
|samh                      |default |null       |EXTERNAL |false      |
|table1                    |default |null       |EXTERNAL |false      |
|test_table                |default |null       |EXTERNAL |false      |
|uscites                   |default |null       |EXTERNAL |false      |
+--------------------------+--------+-----------+---------+-----------+
spark.catalog.listDatabases.show(false)
+-------+---------------------+-------------------------+
|name   |description          |locationUri              |
+-------+---------------------+-------------------------+
|default|Default Hive database|dbfs:/user/hive/warehouse|
+-------+---------------------+-------------------------+

We need to create a temporary view of the DataFrame as a table before being able to access it via SQL.

powerPlantDF.createOrReplaceTempView("power_plant_table") // Shift+Enter
sqlContext.tables.show()
+--------+--------------------+-----------+
|database|           tableName|isTemporary|
+--------+--------------------+-----------+
| default|          cities_csv|      false|
| default|       cleaned_taxes|      false|
| default|commdettrumpclint...|      false|
| default|   donaldtrumptweets|      false|
| default|             linkage|      false|
| default|             nations|      false|
| default|           newmplist|      false|
| default|       ny_baby_names|      false|
| default|       nzmpsandparty|      false|
| default|    pos_neg_category|      false|
| default|                 rna|      false|
| default|                samh|      false|
| default|              table1|      false|
| default|          test_table|      false|
| default|             uscites|      false|
|        |   power_plant_table|       true|
+--------+--------------------+-----------+

Note that table names are in lower-case only!

You Try!

//sqlContext // uncomment and put . after sqlContext and hit Tab to see what methods are available
res23: org.apache.spark.sql.hive.HiveContext = org.apache.spark.sql.hive.HiveContext@67a5d97d
//sqlContext.dropTempTable("power_plant_table") // uncomment and Ctrl+Enter if you want to remove the table!

The following SQL statement simply selects all the columns (due to *) from powerPlantTable.

-- Ctrl+Enter to query the rows via SQL
SELECT * FROM power_plant_table
AT V AP RH PE
14.96 41.76 1024.07 73.17 463.26
25.18 62.96 1020.04 59.08 444.37
5.11 39.4 1012.16 92.14 488.56
20.86 57.32 1010.24 76.64 446.48
10.82 37.5 1009.23 96.62 473.9
26.27 59.44 1012.23 58.77 443.67
15.89 43.96 1014.02 75.24 467.35
9.48 44.71 1019.12 66.43 478.42
14.64 45.0 1021.78 41.25 475.98
11.74 43.56 1015.14 70.72 477.5
17.99 43.72 1008.64 75.04 453.02
20.14 46.93 1014.66 64.22 453.99
24.34 73.5 1011.31 84.15 440.29
25.71 58.59 1012.77 61.83 451.28
26.19 69.34 1009.48 87.59 433.99
21.42 43.79 1015.76 43.08 462.19
18.21 45.0 1022.86 48.84 467.54
11.04 41.74 1022.6 77.51 477.2
14.45 52.75 1023.97 63.59 459.85
13.97 38.47 1015.15 55.28 464.3
17.76 42.42 1009.09 66.26 468.27
5.41 40.07 1019.16 64.77 495.24
7.76 42.28 1008.52 83.31 483.8
27.23 63.9 1014.3 47.19 443.61
27.36 48.6 1003.18 54.93 436.06
27.47 70.72 1009.97 74.62 443.25
14.6 39.31 1011.11 72.52 464.16
7.91 39.96 1023.57 88.44 475.52
5.81 35.79 1012.14 92.28 484.41
30.53 65.18 1012.69 41.85 437.89

Truncated to 30 rows

Note that the output of the above command is the same as display(powerPlantDF) we did earlier.

We can use the SQL desc command to describe the schema. This is the SQL equivalent of powerPlantDF.printSchema we saw earlier.

desc power_plant_table
col_name data_type comment
AT double null
V double null
AP double null
RH double null
PE double null

Schema Definition

Our schema definition from UCI appears below:

  • AT = Atmospheric Temperature in C
  • V = Exhaust Vaccum Speed
  • AP = Atmospheric Pressure
  • RH = Relative Humidity
  • PE = Power Output

PE is our label or target. This is the value we are trying to predict given the measurements.

Reference UCI Machine Learning Repository Combined Cycle Power Plant Data Set

Let's do some basic statistical analysis of all the columns.

We can use the describe function with no parameters to get some basic stats for each column like count, mean, max, min and standard deviation. More information can be found in the Spark API docs

display(powerPlantDF.describe())
summary AT V AP RH PE
count 9568 9568 9568 9568 9568
mean 19.65123118729102 54.30580372073601 1013.2590781772603 73.30897784280926 454.3650094063554
stddev 7.4524732296110825 12.707892998326784 5.938783705811581 14.600268756728964 17.066994999803402
min 1.81 25.36 992.89 25.56 420.26
max 37.11 81.56 1033.3 100.16 495.76

Step 4: Visualize Your Data

To understand our data, we will look for correlations between features and the label. This can be important when choosing a model. E.g., if features and a label are linearly correlated, a linear model like Linear Regression can do well; if the relationship is very non-linear, more complex models such as Decision Trees or neural networks can be better. We use the Databricks built in visualization to view each of our predictors in relation to the label column as a scatter plot to see the correlation between the predictors and the label.

select AT as Temperature, PE as Power from power_plant_table
Temperature Power
14.96 463.26
25.18 444.37
5.11 488.56
20.86 446.48
10.82 473.9
26.27 443.67
15.89 467.35
9.48 478.42
14.64 475.98
11.74 477.5
17.99 453.02
20.14 453.99
24.34 440.29
25.71 451.28
26.19 433.99
21.42 462.19
18.21 467.54
11.04 477.2
14.45 459.85
13.97 464.3
17.76 468.27
5.41 495.24
7.76 483.8
27.23 443.61
27.36 436.06
27.47 443.25
14.6 464.16
7.91 475.52
5.81 484.41
30.53 437.89

Truncated to 30 rows

From the above plot, it looks like there is strong linear correlation between temperature and Power Output!

select V as ExhaustVaccum, PE as Power from power_plant_table;
ExhaustVaccum Power
41.76 463.26
62.96 444.37
39.4 488.56
57.32 446.48
37.5 473.9
59.44 443.67
43.96 467.35
44.71 478.42
45.0 475.98
43.56 477.5
43.72 453.02
46.93 453.99
73.5 440.29
58.59 451.28
69.34 433.99
43.79 462.19
45.0 467.54
41.74 477.2
52.75 459.85
38.47 464.3
42.42 468.27
40.07 495.24
42.28 483.8
63.9 443.61
48.6 436.06
70.72 443.25
39.31 464.16
39.96 475.52
35.79 484.41
65.18 437.89

Truncated to 30 rows

The linear correlation is not as strong between Exhaust Vacuum Speed and Power Output but there is some semblance of a pattern.

select AP as Pressure, PE as Power from power_plant_table;
Pressure Power
1024.07 463.26
1020.04 444.37
1012.16 488.56
1010.24 446.48
1009.23 473.9
1012.23 443.67
1014.02 467.35
1019.12 478.42
1021.78 475.98
1015.14 477.5
1008.64 453.02
1014.66 453.99
1011.31 440.29
1012.77 451.28
1009.48 433.99
1015.76 462.19
1022.86 467.54
1022.6 477.2
1023.97 459.85
1015.15 464.3
1009.09 468.27
1019.16 495.24
1008.52 483.8
1014.3 443.61
1003.18 436.06
1009.97 443.25
1011.11 464.16
1023.57 475.52
1012.14 484.41
1012.69 437.89

Truncated to 30 rows

select RH as Humidity, PE as Power from power_plant_table;
Humidity Power
73.17 463.26
59.08 444.37
92.14 488.56
76.64 446.48
96.62 473.9
58.77 443.67
75.24 467.35
66.43 478.42
41.25 475.98
70.72 477.5
75.04 453.02
64.22 453.99
84.15 440.29
61.83 451.28
87.59 433.99
43.08 462.19
48.84 467.54
77.51 477.2
63.59 459.85
55.28 464.3
66.26 468.27
64.77 495.24
83.31 483.8
47.19 443.61
54.93 436.06
74.62 443.25
72.52 464.16
88.44 475.52
92.28 484.41
41.85 437.89

Truncated to 30 rows

...and atmospheric pressure and relative humidity seem to have little to no linear correlation.

These pairwise plots can also be done directly using display on selected columns of the DataFrame powerPlantDF.

In general we will shy from SQL as much as possible to focus on ML pipelines written with DataFrames and DataSets with occassional in-and-out of RDDs.

The illustations in %sql above are to mainly reassure those with a RDBMS background and SQL that their SQL expressibility can be directly used in Apache Spark and in databricks notebooks.

display(powerPlantDF.select($"RH", $"PE"))
RH PE
73.17 463.26
59.08 444.37
92.14 488.56
76.64 446.48
96.62 473.9
58.77 443.67
75.24 467.35
66.43 478.42
41.25 475.98
70.72 477.5
75.04 453.02
64.22 453.99
84.15 440.29
61.83 451.28
87.59 433.99
43.08 462.19
48.84 467.54
77.51 477.2
63.59 459.85
55.28 464.3
66.26 468.27
64.77 495.24
83.31 483.8
47.19 443.61
54.93 436.06
74.62 443.25
72.52 464.16
88.44 475.52
92.28 484.41
41.85 437.89

Truncated to 30 rows

Furthermore, you can interactively start playing with display on the full DataFrame!

display(powerPlantDF) // just as we did for the diamonds dataset
AT V AP RH PE
14.96 41.76 1024.07 73.17 463.26
25.18 62.96 1020.04 59.08 444.37
5.11 39.4 1012.16 92.14 488.56
20.86 57.32 1010.24 76.64 446.48
10.82 37.5 1009.23 96.62 473.9
26.27 59.44 1012.23 58.77 443.67
15.89 43.96 1014.02 75.24 467.35
9.48 44.71 1019.12 66.43 478.42
14.64 45.0 1021.78 41.25 475.98
11.74 43.56 1015.14 70.72 477.5
17.99 43.72 1008.64 75.04 453.02
20.14 46.93 1014.66 64.22 453.99
24.34 73.5 1011.31 84.15 440.29
25.71 58.59 1012.77 61.83 451.28
26.19 69.34 1009.48 87.59 433.99
21.42 43.79 1015.76 43.08 462.19
18.21 45.0 1022.86 48.84 467.54
11.04 41.74 1022.6 77.51 477.2
14.45 52.75 1023.97 63.59 459.85
13.97 38.47 1015.15 55.28 464.3
17.76 42.42 1009.09 66.26 468.27
5.41 40.07 1019.16 64.77 495.24
7.76 42.28 1008.52 83.31 483.8
27.23 63.9 1014.3 47.19 443.61
27.36 48.6 1003.18 54.93 436.06
27.47 70.72 1009.97 74.62 443.25
14.6 39.31 1011.11 72.52 464.16
7.91 39.96 1023.57 88.44 475.52
5.81 35.79 1012.14 92.28 484.41
30.53 65.18 1012.69 41.85 437.89

Truncated to 30 rows

We will do the following steps in the sequel.

  • Step 5: Data Preparation
  • Step 6: Data Modeling
  • Step 7: Tuning and Evaluation
  • Step 8: Deployment

Datasource References:

  • Pinar Tüfekci, Prediction of full load electrical power output of a base load operated combined cycle power plant using machine learning methods, International Journal of Electrical Power & Energy Systems, Volume 60, September 2014, Pages 126-140, ISSN 0142-0615, Web Link
  • Heysem Kaya, Pinar Tüfekci , Sadik Fikret Gürgen: Local and Global Learning Methods for Predicting Power of a Combined Gas & Steam Turbine, Proceedings of the International Conference on Emerging Trends in Computer and Electronics Engineering ICETCEE 2012, pp. 13-18 (Mar. 2012, Dubai) Web Link