SDS-2.2, Scalable Data Science

Archived YouTube video of this live unedited lab-lecture:

Archived YouTube video of this live unedited lab-lecture

This is an elaboration of the Apache Spark mllib-progamming-guide on mllib-data-types.

Overview

Data Types - MLlib Programming Guide

MLlib supports local vectors and matrices stored on a single machine, as well as distributed matrices backed by one or more RDDs. Local vectors and local matrices are simple data models that serve as public interfaces. The underlying linear algebra operations are provided by Breeze and jblas. A training example used in supervised learning is called a “labeled point” in MLlib.

RowMatrix in Scala

A RowMatrix is a row-oriented distributed matrix without meaningful row indices, backed by an RDD of its rows, where each row is a local vector. Since each row is represented by a local vector, the number of columns is limited by the integer range but it should be much smaller in practice.

A RowMatrix can be created from an RDD[Vector] instance. Then we can compute its column summary statistics and decompositions.

Refer to the RowMatrix Scala docs for details on the API.

import org.apache.spark.mllib.linalg.{Vector, Vectors}
import org.apache.spark.mllib.linalg.distributed.RowMatrix
import org.apache.spark.mllib.linalg.{Vector, Vectors}
import org.apache.spark.mllib.linalg.distributed.RowMatrix
val rows: RDD[Vector] = sc.parallelize(Array(Vectors.dense(12.0, -51.0, 4.0), Vectors.dense(6.0, 167.0, -68.0), Vectors.dense(-4.0, 24.0, -41.0))) // an RDD of local vectors
rows: org.apache.spark.rdd.RDD[org.apache.spark.mllib.linalg.Vector] = ParallelCollectionRDD[18] at parallelize at <console>:36
// Create a RowMatrix from an RDD[Vector].
val mat: RowMatrix = new RowMatrix(rows)
mat: org.apache.spark.mllib.linalg.distributed.RowMatrix = org.apache.spark.mllib.linalg.distributed.RowMatrix@720029a1
mat.rows.collect
res0: Array[org.apache.spark.mllib.linalg.Vector] = Array([12.0,-51.0,4.0], [6.0,167.0,-68.0], [-4.0,24.0,-41.0])
// Get its size.
val m = mat.numRows()
val n = mat.numCols()
m: Long = 3
n: Long = 3
// QR decomposition
val qrResult = mat.tallSkinnyQR(true)
qrResult: org.apache.spark.mllib.linalg.QRDecomposition[org.apache.spark.mllib.linalg.distributed.RowMatrix,org.apache.spark.mllib.linalg.Matrix] = 
QRDecomposition(org.apache.spark.mllib.linalg.distributed.RowMatrix@299d426,14.0  21.0                 -14.0                
0.0   -174.99999999999997  70.00000000000001    
0.0   0.0                  -35.000000000000014  )
qrResult.R
res1: org.apache.spark.mllib.linalg.Matrix = 
14.0  21.0                 -14.0                
0.0   -174.99999999999997  70.00000000000001    
0.0   0.0                  -35.000000000000014  


RowMatrix in Python

A RowMatrix can be created from an RDD of vectors.

Refer to the RowMatrix Python docs for more details on the API.

from pyspark.mllib.linalg.distributed import RowMatrix

# Create an RDD of vectors.
rows = sc.parallelize([[1, 2, 3], [4, 5, 6], [7, 8, 9], [10, 11, 12]])

# Create a RowMatrix from an RDD of vectors.
mat = RowMatrix(rows)

# Get its size.
m = mat.numRows()  # 4
n = mat.numCols()  # 3
print m,'x',n

# Get the rows as an RDD of vectors again.
rowsRDD = mat.rows
4 x 3