SDS-2.2, Scalable Data Science
Archived YouTube video of this live unedited lab-lecture:
This is an elaboration of the Apache Spark mllib-progamming-guide on mllib-data-types.
Overview
Data Types - MLlib Programming Guide
MLlib supports local vectors and matrices stored on a single machine, as well as distributed matrices backed by one or more RDDs. Local vectors and local matrices are simple data models that serve as public interfaces. The underlying linear algebra operations are provided by Breeze and jblas. A training example used in supervised learning is called a “labeled point” in MLlib.
CoordinateMatrix in Scala
A CoordinateMatrix
is a distributed matrix backed by an RDD of its
entries. Each entry is a tuple of (i: Long, j: Long, value: Double)
,
where i
is the row index, j
is the column index, and value
is the
entry value. A CoordinateMatrix
should be used only when both
dimensions of the matrix are huge and the matrix is very sparse.
A CoordinateMatrix
can be created from an RDD[MatrixEntry]
instance, where
MatrixEntry
is a wrapper over (Long, Long, Double)
. A CoordinateMatrix
can be
converted to an IndexedRowMatrix
with sparse rows by calling
toIndexedRowMatrix
. Other computations for CoordinateMatrix
are not
currently supported.
Refer to the CoordinateMatrix
Scala docs
for details on the API.
import org.apache.spark.mllib.linalg.distributed.{CoordinateMatrix, MatrixEntry}
import org.apache.spark.mllib.linalg.distributed.{CoordinateMatrix, MatrixEntry}
val entries: RDD[MatrixEntry] = sc.parallelize(Array(MatrixEntry(0, 0, 1.2), MatrixEntry(1, 0, 2.1), MatrixEntry(6, 1, 3.7))) // an RDD of matrix entries
entries: org.apache.spark.rdd.RDD[org.apache.spark.mllib.linalg.distributed.MatrixEntry] = ParallelCollectionRDD[454] at parallelize at <console>:35
// Create a CoordinateMatrix from an RDD[MatrixEntry].
val mat: CoordinateMatrix = new CoordinateMatrix(entries)
mat: org.apache.spark.mllib.linalg.distributed.CoordinateMatrix = org.apache.spark.mllib.linalg.distributed.CoordinateMatrix@73dc93f3
// Get its size.
val m = mat.numRows()
val n = mat.numCols()
m: Long = 7 n: Long = 2
// Convert it to an IndexRowMatrix whose rows are sparse vectors.
val indexedRowMatrix = mat.toIndexedRowMatrix()
indexedRowMatrix: org.apache.spark.mllib.linalg.distributed.IndexedRowMatrix = org.apache.spark.mllib.linalg.distributed.IndexedRowMatrix@4a8e753a
indexedRowMatrix.rows.collect()
res3: Array[org.apache.spark.mllib.linalg.distributed.IndexedRow] = Array(IndexedRow(0,(2,[0],[1.2])), IndexedRow(6,(2,[1],[3.7])), IndexedRow(1,(2,[0],[2.1])))
CoordinateMatrix in Scala
A CoordinateMatrix
can be created from an RDD
of MatrixEntry
entries, where
MatrixEntry
is a wrapper over (long, long, float)
. A CoordinateMatrix
can be
converted to a RowMatrix
by calling toRowMatrix
, or to an
IndexedRowMatrix
with sparse rows by calling toIndexedRowMatrix
.
Refer to the CoordinateMatrix
Python docs
for more details on the API.
from pyspark.mllib.linalg.distributed import CoordinateMatrix, MatrixEntry
# Create an RDD of coordinate entries.
# - This can be done explicitly with the MatrixEntry class:
entries = sc.parallelize([MatrixEntry(0, 0, 1.2), MatrixEntry(1, 0, 2.1), MatrixEntry(6, 1, 3.7)])
# - or using (long, long, float) tuples:
entries = sc.parallelize([(0, 0, 1.2), (1, 0, 2.1), (2, 1, 3.7)])
# Create an CoordinateMatrix from an RDD of MatrixEntries.
mat = CoordinateMatrix(entries)
# Get its size.
m = mat.numRows() # 3
n = mat.numCols() # 2
print (m,n)
# Get the entries as an RDD of MatrixEntries.
entriesRDD = mat.entries
# Convert to a RowMatrix.
rowMat = mat.toRowMatrix()
# Convert to an IndexedRowMatrix.
indexedRowMat = mat.toIndexedRowMatrix()
# Convert to a BlockMatrix.
blockMat = mat.toBlockMatrix()
(3L, 2L)