SDS-2.2, Scalable Data Science
This is used in a non-profit educational setting with kind permission of Adam Breindel.
This is not licensed by Adam for use in a for-profit setting. Please contact Adam directly at [email protected]
to request or report such use cases or abuses.
A few minor modifications and additional mathematical statistical pointers have been added by Raazesh Sainudiin when teaching PhD students in Uppsala University.
Archived YouTube video (no sound, sorry) of this live unedited lab-lecture:
Convolutional Neural Networks
aka CNN, ConvNet
As a baseline, let's start a lab running with what we already know.
We'll take our deep feed-forward multilayer perceptron network, with ReLU activations and reasonable initializations, and apply it to learning the MNIST digits.
The main part of the code looks like the following (full code you can run is in the next cell):
# imports, setup, load data sets
model = Sequential()
model.add(Dense(20, input_dim=784, kernel_initializer='normal', activation='relu'))
model.add(Dense(15, kernel_initializer='normal', activation='relu'))
model.add(Dense(10, kernel_initializer='normal', activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['categorical_accuracy'])
categorical_labels = to_categorical(y_train, num_classes=10)
history = model.fit(X_train, categorical_labels, epochs=100, batch_size=100)
# print metrics, plot errors
Note the changes, which are largely about building a classifier instead of a regression model: * Output layer has one neuron per category, with softmax activation * Loss function is cross-entropy loss * Accuracy metric is categorical accuracy
Let's hold pointers into wikipedia for these new concepts.
The following is from: https://www.quora.com/How-does-Keras-calculate-accuracy.
Categorical accuracy:
def categorical_accuracy(y_true, y_pred):
return K.cast(K.equal(K.argmax(y_true, axis=-1),
K.argmax(y_pred, axis=-1)),
K.floatx())
K.argmax(y_true)
takes the highest value to be the prediction and matches against the comparative set.
from keras.models import Sequential
from keras.layers import Dense
from keras.utils import to_categorical
import sklearn.datasets
import datetime
import matplotlib.pyplot as plt
import numpy as np
train_libsvm = "/dbfs/databricks-datasets/mnist-digits/data-001/mnist-digits-train.txt"
test_libsvm = "/dbfs/databricks-datasets/mnist-digits/data-001/mnist-digits-test.txt"
X_train, y_train = sklearn.datasets.load_svmlight_file(train_libsvm, n_features=784)
X_train = X_train.toarray()
X_test, y_test = sklearn.datasets.load_svmlight_file(test_libsvm, n_features=784)
X_test = X_test.toarray()
model = Sequential()
model.add(Dense(20, input_dim=784, kernel_initializer='normal', activation='relu'))
model.add(Dense(15, kernel_initializer='normal', activation='relu'))
model.add(Dense(10, kernel_initializer='normal', activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['categorical_accuracy'])
categorical_labels = to_categorical(y_train, num_classes=10)
start = datetime.datetime.today()
history = model.fit(X_train, categorical_labels, epochs=40, batch_size=100, validation_split=0.1, verbose=2)
scores = model.evaluate(X_test, to_categorical(y_test, num_classes=10))
print
for i in range(len(model.metrics_names)):
print("%s: %f" % (model.metrics_names[i], scores[i]))
print ("Start: " + str(start))
end = datetime.datetime.today()
print ("End: " + str(end))
print ("Elapse: " + str(end-start))
Using TensorFlow backend. Train on 54000 samples, validate on 6000 samples Epoch 1/40 1s - loss: 0.4799 - categorical_accuracy: 0.8438 - val_loss: 0.1881 - val_categorical_accuracy: 0.9443 Epoch 2/40 1s - loss: 0.2151 - categorical_accuracy: 0.9355 - val_loss: 0.1650 - val_categorical_accuracy: 0.9503 Epoch 3/40 1s - loss: 0.1753 - categorical_accuracy: 0.9484 - val_loss: 0.1367 - val_categorical_accuracy: 0.9587 Epoch 4/40 1s - loss: 0.1533 - categorical_accuracy: 0.9543 - val_loss: 0.1321 - val_categorical_accuracy: 0.9627 Epoch 5/40 1s - loss: 0.1386 - categorical_accuracy: 0.9587 - val_loss: 0.1321 - val_categorical_accuracy: 0.9618 Epoch 6/40 1s - loss: 0.1287 - categorical_accuracy: 0.9612 - val_loss: 0.1332 - val_categorical_accuracy: 0.9613 Epoch 7/40 1s - loss: 0.1199 - categorical_accuracy: 0.9643 - val_loss: 0.1339 - val_categorical_accuracy: 0.9602 Epoch 8/40 1s - loss: 0.1158 - categorical_accuracy: 0.9642 - val_loss: 0.1220 - val_categorical_accuracy: 0.9658 Epoch 9/40 1s - loss: 0.1099 - categorical_accuracy: 0.9660 - val_loss: 0.1342 - val_categorical_accuracy: 0.9648 Epoch 10/40 1s - loss: 0.1056 - categorical_accuracy: 0.9675 - val_loss: 0.1344 - val_categorical_accuracy: 0.9622 Epoch 11/40 1s - loss: 0.0989 - categorical_accuracy: 0.9691 - val_loss: 0.1266 - val_categorical_accuracy: 0.9643 Epoch 12/40 1s - loss: 0.0982 - categorical_accuracy: 0.9699 - val_loss: 0.1226 - val_categorical_accuracy: 0.9650 Epoch 13/40 1s - loss: 0.0945 - categorical_accuracy: 0.9717 - val_loss: 0.1274 - val_categorical_accuracy: 0.9648 Epoch 14/40 1s - loss: 0.0915 - categorical_accuracy: 0.9711 - val_loss: 0.1374 - val_categorical_accuracy: 0.9637 Epoch 15/40 1s - loss: 0.0885 - categorical_accuracy: 0.9719 - val_loss: 0.1541 - val_categorical_accuracy: 0.9590 Epoch 16/40 1s - loss: 0.0845 - categorical_accuracy: 0.9736 - val_loss: 0.1307 - val_categorical_accuracy: 0.9658 Epoch 17/40 1s - loss: 0.0835 - categorical_accuracy: 0.9742 - val_loss: 0.1451 - val_categorical_accuracy: 0.9613 Epoch 18/40 1s - loss: 0.0801 - categorical_accuracy: 0.9749 - val_loss: 0.1352 - val_categorical_accuracy: 0.9647 Epoch 19/40 1s - loss: 0.0780 - categorical_accuracy: 0.9754 - val_loss: 0.1343 - val_categorical_accuracy: 0.9638 Epoch 20/40 1s - loss: 0.0800 - categorical_accuracy: 0.9746 - val_loss: 0.1306 - val_categorical_accuracy: 0.9682 Epoch 21/40 1s - loss: 0.0745 - categorical_accuracy: 0.9765 - val_loss: 0.1465 - val_categorical_accuracy: 0.9665 Epoch 22/40 1s - loss: 0.0722 - categorical_accuracy: 0.9769 - val_loss: 0.1390 - val_categorical_accuracy: 0.9657 Epoch 23/40 1s - loss: 0.0728 - categorical_accuracy: 0.9771 - val_loss: 0.1511 - val_categorical_accuracy: 0.9633 Epoch 24/40 1s - loss: 0.0717 - categorical_accuracy: 0.9772 - val_loss: 0.1515 - val_categorical_accuracy: 0.9638 Epoch 25/40 1s - loss: 0.0686 - categorical_accuracy: 0.9784 - val_loss: 0.1670 - val_categorical_accuracy: 0.9618 Epoch 26/40 1s - loss: 0.0666 - categorical_accuracy: 0.9789 - val_loss: 0.1446 - val_categorical_accuracy: 0.9655 Epoch 27/40 1s - loss: 0.0663 - categorical_accuracy: 0.9789 - val_loss: 0.1537 - val_categorical_accuracy: 0.9623 Epoch 28/40 1s - loss: 0.0671 - categorical_accuracy: 0.9792 - val_loss: 0.1541 - val_categorical_accuracy: 0.9677 Epoch 29/40 1s - loss: 0.0630 - categorical_accuracy: 0.9804 - val_loss: 0.1642 - val_categorical_accuracy: 0.9632 Epoch 30/40 1s - loss: 0.0660 - categorical_accuracy: 0.9797 - val_loss: 0.1585 - val_categorical_accuracy: 0.9648 Epoch 31/40 1s - loss: 0.0646 - categorical_accuracy: 0.9801 - val_loss: 0.1574 - val_categorical_accuracy: 0.9640 Epoch 32/40 1s - loss: 0.0640 - categorical_accuracy: 0.9789 - val_loss: 0.1506 - val_categorical_accuracy: 0.9657 Epoch 33/40 1s - loss: 0.0623 - categorical_accuracy: 0.9801 - val_loss: 0.1583 - val_categorical_accuracy: 0.9662 Epoch 34/40 1s - loss: 0.0630 - categorical_accuracy: 0.9803 - val_loss: 0.1626 - val_categorical_accuracy: 0.9630 Epoch 35/40 1s - loss: 0.0579 - categorical_accuracy: 0.9815 - val_loss: 0.1647 - val_categorical_accuracy: 0.9640 Epoch 36/40 1s - loss: 0.0605 - categorical_accuracy: 0.9815 - val_loss: 0.1693 - val_categorical_accuracy: 0.9635 Epoch 37/40 1s - loss: 0.0586 - categorical_accuracy: 0.9821 - val_loss: 0.1753 - val_categorical_accuracy: 0.9635 Epoch 38/40 1s - loss: 0.0583 - categorical_accuracy: 0.9812 - val_loss: 0.1703 - val_categorical_accuracy: 0.9653 Epoch 39/40 1s - loss: 0.0593 - categorical_accuracy: 0.9807 - val_loss: 0.1736 - val_categorical_accuracy: 0.9638 Epoch 40/40 1s - loss: 0.0538 - categorical_accuracy: 0.9826 - val_loss: 0.1817 - val_categorical_accuracy: 0.9635 32/10000 [..............................] - ETA: 0s 1952/10000 [====>.........................] - ETA: 0s 3872/10000 [==========>...................] - ETA: 0s 5792/10000 [================>.............] - ETA: 0s 7712/10000 [======================>.......] - ETA: 0s 9632/10000 [===========================>..] - ETA: 0s loss: 0.205790 categorical_accuracy: 0.959600 Start: 2017-12-07 09:30:17.311756 End: 2017-12-07 09:31:14.504506 Elapse: 0:00:57.192750
after about a minute we have:
...
Epoch 40/40
1s - loss: 0.0610 - categorical_accuracy: 0.9809 - val_loss: 0.1918 - val_categorical_accuracy: 0.9583
...
loss: 0.216120
categorical_accuracy: 0.955000
Start: 2017-12-06 07:35:33.948102
End: 2017-12-06 07:36:27.046130
Elapse: 0:00:53.098028
import matplotlib.pyplot as plt
fig, ax = plt.subplots()
fig.set_size_inches((5,5))
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('model loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train', 'val'], loc='upper left')
display(fig)
What are the big takeaways from this experiment?
- We get pretty impressive "apparent error" accuracy right from the start! A small network gets us to training accuracy 97% by epoch 20
- The model appears to continue to learn if we let it run, although it does slow down and oscillate a bit.
- Our test accuracy is about 95% after 5 epochs and never gets better ... it gets worse!
- Therefore, we are overfitting very quickly... most of the "training" turns out to be a waste.
- For what it's worth, we get 95% accuracy without much work.
This is not terrible compared to other, non-neural-network approaches to the problem. After all, we could probably tweak this a bit and do even better.
But we talked about using deep learning to solve "95%" problems or "98%" problems ... where one error in 20, or 50 simply won't work. If we can get to "multiple nines" of accuracy, then we can do things like automate mail sorting and translation, create cars that react properly (all the time) to street signs, and control systems for robots or drones that function autonomously.
Try two more experiments (try them separately):
- Add a third, hidden layer.
- Increase the size of the hidden layers.
Adding another layer slows things down a little (why?) but doesn't seem to make a difference in accuracy.
Adding a lot more neurons into the first topology slows things down significantly -- 10x as many neurons, and only a marginal increase in accuracy. Notice also (in the plot) that the learning clearly degrades after epoch 50 or so.
... We need a new approach!
... let's think about this:
What is layer 2 learning from layer 1? Combinations of pixels
Combinations of pixels contain information but...
There are a lot of them (combinations) and they are "fragile"
In fact, in our last experiment, we basically built a model that memorizes a bunch of "magic" pixel combinations.
What might be a better way to build features?
- When humans perform this task, we look not at arbitrary pixel combinations, but certain geometric patterns -- lines, curves, loops.
- These features are made up of combinations of pixels, but they are far from arbitrary
- We identify these features regardless of translation, rotation, etc.
Is there a way to get the network to do the same thing?
I.e., in layer one, identify pixels. Then in layer 2+, identify abstractions over pixels that are translation-invariant 2-D shapes?
We could look at where a "filter" that represents one of these features (e.g., and edge) matches the image.
How would this work?
Convolution
Convolution in the general mathematical sense is define as follows:
The convolution we deal with in deep learning is a simplified case. We want to compare two signals. Here are two visualizations, courtesy of Wikipedia, that help communicate how convolution emphasizes features:
Here's an animation (where we change )
In one sense, the convolution captures and quantifies the pattern matching over space
If we perform this in two dimensions, we can achieve effects like highlighting edges:
The matrix here, also called a convolution kernel, is one of the functions we are convolving. Other convolution kernels can blur, "sharpen," etc.
So we'll drop in a number of convolution kernels, and the network will learn where to use them? Nope. Better than that.
We'll program in the idea of discrete convolution, and the network will learn what kernels extract meaningful features!
The values in a (fixed-size) convolution kernel matrix will be variables in our deep learning model. Although inuitively it seems like it would be hard to learn useful params, in fact, since those variables are used repeatedly across the image data, it "focuses" the error on a smallish number of parameters with a lot of influence -- so it should be vastly less expensive to train than just a huge fully connected layer like we discussed above.
This idea was developed in the late 1980s, and by 1989, Yann LeCun (at AT&T/Bell Labs) had built a practical high-accuracy system (used in the 1990s for processing handwritten checks and mail).
How do we hook this into our neural networks?
First, we can preserve the geometric properties of our data by "shaping" the vectors as 2D instead of 1D.
Then we'll create a layer whose value is not just activation applied to weighted sum of inputs, but instead it's the result of a dot-product (element-wise multiply and sum) between the kernel and a patch of the input vector (image).
- This value will be our "pre-activation" and optionally feed into an activation function (or "detector")
If we perform this operation at lots of positions over the image, we'll get lots of outputs, as many as one for every input pixel.
- So we'll add another layer that "picks" the highest convolution pattern match from nearby pixels, which
- makes our pattern match a little bit translation invariant (a fuzzy location match)
- reduces the number of outputs significantly
- This layer is commonly called a pooling layer, and if we pick the "maximum match" then it's a "max pooling" layer.
The end result is that the kernel or filter together with max pooling creates a value in a subsequent layer which represents the appearance of a pattern in a local area in a prior layer.
Again, the network will be given a number of "slots" for these filters and will learn (by minimizing error) what filter values produce meaningful features. This is the key insight into how modern image-recognition networks are able to generalize -- i.e., learn to tell 6s from 7s or cats from dogs.
Ok, let's build our first ConvNet:
First, we want to explicity shape our data into a 2-D configuration. We'll end up with a 4-D tensor where the first dimension is the training examples, then each example is 28x28 pixels, and we'll explicitly say it's 1-layer deep. (Why? with color images, we typically process over 3 or 4 channels in this last dimension)
A step by step animation follows: * http://cs231n.github.io/assets/conv-demo/index.html
train_libsvm = "/dbfs/databricks-datasets/mnist-digits/data-001/mnist-digits-train.txt"
test_libsvm = "/dbfs/databricks-datasets/mnist-digits/data-001/mnist-digits-test.txt"
X_train, y_train = sklearn.datasets.load_svmlight_file(train_libsvm, n_features=784)
X_train = X_train.toarray()
X_test, y_test = sklearn.datasets.load_svmlight_file(test_libsvm, n_features=784)
X_test = X_test.toarray()
X_train = X_train.reshape( (X_train.shape[0], 28, 28, 1) )
X_train = X_train.astype('float32')
X_train /= 255
y_train = to_categorical(y_train, num_classes=10)
X_test = X_test.reshape( (X_test.shape[0], 28, 28, 1) )
X_test = X_test.astype('float32')
X_test /= 255
y_test = to_categorical(y_test, num_classes=10)
Now the model:
from keras.layers import Dense, Dropout, Activation, Flatten, Conv2D, MaxPooling2D
model = Sequential()
model.add(Conv2D(8, # number of kernels
(4, 4), # kernel size
padding='valid', # no padding; output will be smaller than input
input_shape=(28, 28, 1)))
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2,2)))
model.add(Flatten())
model.add(Dense(128))
model.add(Activation('relu')) # alternative syntax for applying activation
model.add(Dense(10))
model.add(Activation('softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
... and the training loop and output:
start = datetime.datetime.today()
history = model.fit(X_train, y_train, batch_size=128, epochs=8, verbose=2, validation_split=0.1)
scores = model.evaluate(X_test, y_test, verbose=1)
print
for i in range(len(model.metrics_names)):
print("%s: %f" % (model.metrics_names[i], scores[i]))
Train on 54000 samples, validate on 6000 samples Epoch 1/8 12s - loss: 0.3183 - acc: 0.9115 - val_loss: 0.1114 - val_acc: 0.9713 Epoch 2/8 13s - loss: 0.1021 - acc: 0.9700 - val_loss: 0.0714 - val_acc: 0.9803 Epoch 3/8 12s - loss: 0.0706 - acc: 0.9792 - val_loss: 0.0574 - val_acc: 0.9852 Epoch 4/8 15s - loss: 0.0518 - acc: 0.9851 - val_loss: 0.0538 - val_acc: 0.9857 Epoch 5/8 16s - loss: 0.0419 - acc: 0.9875 - val_loss: 0.0457 - val_acc: 0.9872 Epoch 6/8 16s - loss: 0.0348 - acc: 0.9896 - val_loss: 0.0473 - val_acc: 0.9867 Epoch 7/8 16s - loss: 0.0280 - acc: 0.9917 - val_loss: 0.0445 - val_acc: 0.9875 Epoch 8/8 14s - loss: 0.0225 - acc: 0.9934 - val_loss: 0.0485 - val_acc: 0.9868 32/10000 [..............................] - ETA: 0s 672/10000 [=>............................] - ETA: 0s 1312/10000 [==>...........................] - ETA: 0s 1952/10000 [====>.........................] - ETA: 0s 2560/10000 [======>.......................] - ETA: 0s 3072/10000 [========>.....................] - ETA: 0s 3712/10000 [==========>...................] - ETA: 0s 4384/10000 [============>.................] - ETA: 0s 5024/10000 [==============>...............] - ETA: 0s 5632/10000 [===============>..............] - ETA: 0s 6304/10000 [=================>............] - ETA: 0s 6976/10000 [===================>..........] - ETA: 0s 7616/10000 [=====================>........] - ETA: 0s 8256/10000 [=======================>......] - ETA: 0s 8960/10000 [=========================>....] - ETA: 0s 9600/10000 [===========================>..] - ETA: 0s loss: 0.054931 acc: 0.983000
fig, ax = plt.subplots()
fig.set_size_inches((5,5))
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('model loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train', 'val'], loc='upper left')
display(fig)
Our MNIST ConvNet
In our first convolutional MNIST experiment, we get to almost 99% validation accuracy in just a few epochs (a minutes or so on CPU)!
The training accuracy is effectively 100%, though, so we've almost completely overfit (i.e., memorized the training data) by this point and need to do a little work if we want to keep learning.
Let's add another convolutional layer:
model = Sequential()
model.add(Conv2D(8, # number of kernels
(4, 4), # kernel size
padding='valid',
input_shape=(28, 28, 1)))
model.add(Activation('relu'))
model.add(Conv2D(8, (4, 4)))
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2,2)))
model.add(Flatten())
model.add(Dense(128))
model.add(Activation('relu'))
model.add(Dense(10))
model.add(Activation('softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
history = model.fit(X_train, y_train, batch_size=128, epochs=15, verbose=2, validation_split=0.1)
scores = model.evaluate(X_test, y_test, verbose=1)
print
for i in range(len(model.metrics_names)):
print("%s: %f" % (model.metrics_names[i], scores[i]))
Train on 54000 samples, validate on 6000 samples Epoch 1/15 21s - loss: 0.2698 - acc: 0.9242 - val_loss: 0.0763 - val_acc: 0.9790 Epoch 2/15 22s - loss: 0.0733 - acc: 0.9772 - val_loss: 0.0593 - val_acc: 0.9840 Epoch 3/15 21s - loss: 0.0522 - acc: 0.9838 - val_loss: 0.0479 - val_acc: 0.9867 Epoch 4/15 21s - loss: 0.0394 - acc: 0.9876 - val_loss: 0.0537 - val_acc: 0.9828 Epoch 5/15 21s - loss: 0.0301 - acc: 0.9907 - val_loss: 0.0434 - val_acc: 0.9885 Epoch 6/15 21s - loss: 0.0255 - acc: 0.9916 - val_loss: 0.0444 - val_acc: 0.9867 Epoch 7/15 21s - loss: 0.0192 - acc: 0.9941 - val_loss: 0.0416 - val_acc: 0.9892 Epoch 8/15 21s - loss: 0.0155 - acc: 0.9952 - val_loss: 0.0405 - val_acc: 0.9900 Epoch 9/15 21s - loss: 0.0143 - acc: 0.9953 - val_loss: 0.0490 - val_acc: 0.9860 Epoch 10/15 22s - loss: 0.0122 - acc: 0.9963 - val_loss: 0.0478 - val_acc: 0.9877 Epoch 11/15 21s - loss: 0.0087 - acc: 0.9972 - val_loss: 0.0508 - val_acc: 0.9877 Epoch 12/15 21s - loss: 0.0098 - acc: 0.9967 - val_loss: 0.0401 - val_acc: 0.9902 Epoch 13/15 21s - loss: 0.0060 - acc: 0.9984 - val_loss: 0.0450 - val_acc: 0.9897 Epoch 14/15 21s - loss: 0.0061 - acc: 0.9981 - val_loss: 0.0523 - val_acc: 0.9883 Epoch 15/15 21s - loss: 0.0069 - acc: 0.9977 - val_loss: 0.0534 - val_acc: 0.9885 32/10000 [..............................] - ETA: 1s 448/10000 [>.............................] - ETA: 1s 864/10000 [=>............................] - ETA: 1s 1280/10000 [==>...........................] - ETA: 1s 1696/10000 [====>.........................] - ETA: 1s 2112/10000 [=====>........................] - ETA: 0s 2560/10000 [======>.......................] - ETA: 0s 2976/10000 [=======>......................] - ETA: 0s 3392/10000 [=========>....................] - ETA: 0s 3840/10000 [==========>...................] - ETA: 0s 4320/10000 [===========>..................] - ETA: 0s 4736/10000 [=============>................] - ETA: 0s 5152/10000 [==============>...............] - ETA: 0s 5568/10000 [===============>..............] - ETA: 0s 5952/10000 [================>.............] - ETA: 0s 6368/10000 [==================>...........] - ETA: 0s 6816/10000 [===================>..........] - ETA: 0s 7200/10000 [====================>.........] - ETA: 0s 7648/10000 [=====================>........] - ETA: 0s 8000/10000 [=======================>......] - ETA: 0s 8384/10000 [========================>.....] - ETA: 0s 8800/10000 [=========================>....] - ETA: 0s 9216/10000 [==========================>...] - ETA: 0s 9632/10000 [===========================>..] - ETA: 0s loss: 0.043144 acc: 0.989100
While that's running, let's look at a number of "famous" convolutional networks!
LeNet (Yann LeCun, 1998)
AlexNet (2012)
Back to our labs: Still Overfitting
We're making progress on our test error -- about 99% -- but just a bit for all the additional time, due to the network overfitting the data.
There are a variety of techniques we can take to counter this -- forms of regularization.
Let's try a relatively simple solution solution that works surprisingly well: add a pair of Dropout
filters, a layer that randomly omits a fraction of neurons from each training batch (thus exposing each neuron to only part of the training data).
We'll add more convolution kernels but shrink them to 3x3 as well.
model = Sequential()
model.add(Conv2D(32, # number of kernels
(3, 3), # kernel size
padding='valid',
input_shape=(28, 28, 1)))
model.add(Activation('relu'))
model.add(Conv2D(32, (3, 3)))
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2,2)))
model.add(Dropout(0.25)) # <- regularize
model.add(Flatten())
model.add(Dense(128))
model.add(Activation('relu'))
model.add(Dropout(0.5)) # <-regularize
model.add(Dense(10))
model.add(Activation('softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
history = model.fit(X_train, y_train, batch_size=128, epochs=15, verbose=2)
scores = model.evaluate(X_test, y_test, verbose=2)
print
for i in range(len(model.metrics_names)):
print("%s: %f" % (model.metrics_names[i], scores[i]))
Epoch 1/15 122s - loss: 0.2673 - acc: 0.9186 Epoch 2/15 123s - loss: 0.0903 - acc: 0.9718 Epoch 3/15 124s - loss: 0.0696 - acc: 0.9787 Epoch 4/15 130s - loss: 0.0598 - acc: 0.9821 Epoch 5/15 140s - loss: 0.0507 - acc: 0.9842 Epoch 6/15 148s - loss: 0.0441 - acc: 0.9864 Epoch 7/15 152s - loss: 0.0390 - acc: 0.9878 Epoch 8/15 153s - loss: 0.0353 - acc: 0.9888 Epoch 9/15 161s - loss: 0.0329 - acc: 0.9894 Epoch 10/15 171s - loss: 0.0323 - acc: 0.9894 Epoch 11/15 172s - loss: 0.0288 - acc: 0.9906 Epoch 12/15 71s - loss: 0.0255 - acc: 0.9914 Epoch 13/15 86s - loss: 0.0245 - acc: 0.9916 Epoch 14/15 91s - loss: 0.0232 - acc: 0.9922 Epoch 15/15 86s - loss: 0.0218 - acc: 0.9931 loss: 0.029627 acc: 0.991400
While that's running, let's look at some more recent ConvNet architectures:
VGG16 (2014)
GoogLeNet (2014)
"Inception" layer: parallel convolutions at different resolutions
Residual Networks (2015-)
Skip layers to improve training (error propagation). Residual layers learn from details at multiple previous layers.
ASIDE: Atrous / Dilated Convolutions
An atrous or dilated convolution is a convolution filter with "holes" in it. Effectively, it is a way to enlarge the filter spatially while not adding as many parameters or attending to every element in the input.
Why? Covering a larger input volume allows recognizing coarser-grained patterns; restricting the number of parameters is a way of regularizing or constraining the capacity of the model, making training easier.
Lab Wrapup
From the last lab, you should have a test accuracy of over 99.1%
For one more activity, try changing the optimizer to old-school "sgd" -- just to see how far we've come with these modern gradient descent techniques in the last few years.
Accuracy will end up noticeably worse ... about 96-97% test accuracy. Two key takeaways:
- Without a good optimizer, even a very powerful network design may not achieve results
- In fact, we could replace the word "optimizer" there with
- initialization
- activation
- regularization
- (etc.)
- All of these elements we've been working with operate together in a complex way to determine final performance
Of course this world evolves fast - see the new kid in the CNN block -- capsule networks
Hinton: “The pooling operation used in convolutional neural networks is a big mistake and the fact that it works so well is a disaster.”
Well worth the 8 minute read: * https://medium.com/ai%C2%B3-theory-practice-business/understanding-hintons-capsule-networks-part-i-intuition-b4b559d1159b
To understand deeper: * original paper: https://arxiv.org/abs/1710.09829