SDS-2.2, Scalable Data Science
This is used in a non-profit educational setting with kind permission of Adam Breindel.
This is not licensed by Adam for use in a for-profit setting. Please contact Adam directly at [email protected]
to request or report such use cases or abuses.
A few minor modifications and additional mathematical statistical pointers have been added by Raazesh Sainudiin when teaching PhD students in Uppsala University.
Archived YouTube video of this live unedited lab-lecture:
We can also implement the model with mini-batches -- this will let us see matrix ops in action:
(N.b., feeddict is intended for small data / experimentation. For more info on ingesting data at scale, see https://www.tensorflow.org/apiguides/python/reading_data)
# we know these params, but we're making TF learn them
REAL_SLOPE_X1 = 2 # slope along axis 1 (x-axis)
REAL_SLOPE_X2 = 3 # slope along axis 2 (y-axis)
REAL_INTERCEPT = 5 # intercept along axis 3 (z-axis), think of (x,y,z) axes in the usual way
import numpy as np
# GENERATE a batch of true data, with a little Gaussian noise added
def make_mini_batch(size=10):
X = np.random.rand(size, 2) #
Y = np.matmul(X, [REAL_SLOPE_X1, REAL_SLOPE_X2]) + REAL_INTERCEPT + 0.2 * np.random.randn(size)
return X.reshape(size,2), Y.reshape(size,1)
To digest what's going on inside the function above, let's take it step by step.
Xex = np.random.rand(10, 2) # Xex is simulating PRNGs from independent Uniform [0,1] RVs
Xex # visualize these as 10 orddered pairs of points in the x-y plane that makes up our x-axis and y-axis (or x1 and x2 axes)
Out[14]: array([[ 0.68729439, 0.58462379], [ 0.93698634, 0.42744602], [ 0.08603721, 0.67466094], [ 0.17160026, 0.71457899], [ 0.72199128, 0.72235838], [ 0.58633688, 0.14721157], [ 0.27425931, 0.60181525], [ 0.5962165 , 0.51132706], [ 0.6869326 , 0.28285958], [ 0.9634012 , 0.21305557]])
Yex = np.matmul(Xex, [REAL_SLOPE_X1, REAL_SLOPE_X2]) # + REAL_INTERCEPT + 0.2 * np.random.randn(size)
Yex
Out[15]: array([ 3.12846017, 3.15631072, 2.19605725, 2.4869375 , 3.61105769, 1.61430847, 2.35396437, 2.72641418, 2.22244393, 2.56596912])
The first entry in Yex is obtained as follows (change the numbers in the produc below if you reevaluated the cells above) and geometrically it is the location in z-axis of the plane with slopes given by REALSLOPEX1 in the x-axis and REALSLOPEX2 in the y-aixs with intercept 0 at the point in the x-y or x1-x2 plane given by (0.68729439, 0.58462379).
0.68729439*REAL_SLOPE_X1 + 0.58462379*REAL_SLOPE_X2
Out[16]: 3.12846015
The next steps are adding an intercept term to translate the plane in the z-axis and then a scaled (the multiplication by 0.2 here) gaussian noise from independetly drawn pseudo-random samples from the standard normal or Normal(0,1) random variable via np.random.randn(size)
.
Yex = np.matmul(Xex, [REAL_SLOPE_X1, REAL_SLOPE_X2]) + REAL_INTERCEPT # + 0.2 * np.random.randn(10)
Yex
Out[19]: array([ 8.12846017, 8.15631072, 7.19605725, 7.4869375 , 8.61105769, 6.61430847, 7.35396437, 7.72641418, 7.22244393, 7.56596912])
Yex = np.matmul(Xex, [REAL_SLOPE_X1, REAL_SLOPE_X2]) + REAL_INTERCEPT + 0.2 * np.random.randn(10)
Yex # note how each entry in Yex is jiggled independently a bit by 0.2 * np.random.randn()
Out[20]: array([ 8.54725608, 8.21366699, 7.55268465, 7.34539416, 8.36644074, 6.42752603, 7.55499568, 7.71093193, 7.32762635, 7.29992345])
Thus we can now fully appreciate what is going on in make_mini_batch
. This is meant to substitute for pulling random sub-samples of batches of the real data during stochastic gradient descent.
make_mini_batch() # our mini-batch of Xx and Ys
Out[21]: (array([[ 0.76270993, 0.21237438], [ 0.93630169, 0.63585941], [ 0.65131407, 0.62564024], [ 0.50704891, 0.80775216], [ 0.45825114, 0.9387936 ], [ 0.65822646, 0.28507217], [ 0.78834033, 0.43172133], [ 0.85462969, 0.14751868], [ 0.05519411, 0.81125067], [ 0.10222417, 0.09151233]]), array([[ 7.68758883], [ 8.75889251], [ 8.07336849], [ 8.23691867], [ 8.90827379], [ 6.93907587], [ 7.52369029], [ 7.00021765], [ 7.60460715], [ 5.65254595]]))
import tensorflow as tf
batch = 5 # size of batch
tf.reset_default_graph() # this is important to do before you do something new in TF
# we will work with single floating point precision and this is specified in the tf.float32 type argument to each tf object/method
x = tf.placeholder(tf.float32, shape=(batch, 2)) # placeholder node for the pairs of x variables (predictors) in batches of size batch
x_aug = tf.concat( (x, tf.ones((batch, 1))), 1 ) # x_aug is a concatenation of a vector of 1`s along the first dimension
y = tf.placeholder(tf.float32, shape=(batch, 1)) # placeholder node for the univariate response y with batch many rows and 1 column
model_params = tf.get_variable("model_params", [3,1]) # these are the x1 slope, x2 slope and the intercept (3 rows and 1 column)
y_model = tf.matmul(x_aug, model_params) # our two-factor regression model is defined by this matrix multiplication
# note that the noise is formally part of the model and what we are actually modeling is the mean response...
error = tf.reduce_sum(tf.square(y - y_model))/batch # this is mean square error where the sum is computed by a reduce call on addition
train_op = tf.train.GradientDescentOptimizer(0.02).minimize(error) # learning rate is set to 0.02
init = tf.global_variables_initializer() # our way into running the TF session
errors = [] # list to track errors over iterations
with tf.Session() as session:
session.run(init)
for i in range(500):
x_data, y_data = make_mini_batch(batch) # simulate the mini-batch of data x1,x2 and response y with noise
_, error_val = session.run([train_op, error], feed_dict={x: x_data, y: y_data})
errors.append(error_val)
out = session.run(model_params)
print(out)
[[ 2.37420893] [ 2.94586873] [ 4.8277688 ]]
REAL_SLOPE_X1, REAL_SLOPE_X2, REAL_INTERCEPT # compare with rue parameter values - it's not too far from the estimates
Out[23]: (2, 3, 5)
import matplotlib.pyplot as plt
fig, ax = plt.subplots()
fig.set_size_inches((4,3))
plt.plot(errors)
display(fig)