SDS-2.2, Scalable Data Science

This is used in a non-profit educational setting with kind permission of Adam Breindel. This is not licensed by Adam for use in a for-profit setting. Please contact Adam directly at [email protected] to request or report such use cases or abuses. A few minor modifications and additional mathematical statistical pointers have been added by Raazesh Sainudiin when teaching PhD students in Uppsala University.

Operations in the Real World

Practical Options, Tools, Patterns, and Considerations for Deep Learning

There are various ways to use deep learning in an enterprise setting that may not require designing your own networks!

Ways to Use Deep Learning

(in order from least complex/expensive investment to most)

[1] Load and use a pretrained model

Many of the existing toolkit projects offer models pretrained on datasets, including

[2] Augmenting a pretrained model with new training data, or using it in a related context (see Transfer Learning)

[3] Use a known, established network type (topology) but train on your own data

[4] Modify established network models for your specific problem

[5] Research and experiment with new types of models

Just because Google DeepMind, Facebook, and Microsoft are getting press for doing a lot of new research doesn't mean you have to do it too.

Data science and machine learning is challenging in general for enterprises (though some industries, such as pharma, have been doing it for a long time). Deep learning takes that even further, since deep learning experiments may require new kinds of hardware ... in some ways, it's more like chemistry than the average IT project!

Tools and Processes for your Deep Learning Pipeline

Data Munging

Most of the deep learning toolkits are focused on model-building performance or flexibility, less on production data processing.

However, Google recently introduced tf.Transform, a data processing pipeline project: https://github.com/tensorflow/transform

and Dataset, an API for data processing: https://www.tensorflow.org/api\_docs/python/tf/contrib/data/Dataset

TensorFlow can read from HDFS and run on Hadoop although it does not scale out automatically on a Hadoop/Spark cluster: https://www.tensorflow.org/deploy/hadoop

Falling back to "regular" tools, we have Apache Spark for big data, and the Python family of pandas, sklearn, scipy, numpy.

Experimenting and Training

Once you want to scale beyond your laptop, there are few options...

Frameworks

We've focused on TensorFlow and Keras, because that's where the "center of mass" is at the moment.

But there are lots of others. Major ones include:

  • Caffe
  • PaddlePaddle
  • Theano
  • CNTK
  • MXNet
  • DeepLearning4J
  • BigDL
  • Torch/PyTorch
  • NVIDIA Digits

and there are at least a dozen more minor ones.

Taking Your Trained Model to Production

Most trained models can predict in production in near-zero time. (Recall the forward pass is just a bunch of multiplication and addition with a few other calculations thrown in.)

For a neat example, you can persist Keras models and load them to run live in a browser with Keras.js

See Keras.js for code and demos: https://github.com/transcranial/keras-js

TensorFlow has an Android example at https://github.com/tensorflow/tensorflow/tree/master/tensorflow/examples/android

and Apple CoreML supports Keras models: https://developer.apple.com/documentation/coreml/converting*trained*models*to*core\_ml

(remember, the model is already trained, we're just predicting here)

And for your server-side model update-and-serve tasks, or bulk prediction at scale...

(imagine classifying huge batches of images, or analyzing millions of chat messages or emails)

Security and Robustness

A recent (3/2017) paper on general key failure modes is Failures of Deep Learning: https://arxiv.org/abs/1703.07950

Deep learning models are subject to a variety of unexpected perturbations and adversarial data -- even when they seem to "understand," they definitely don't understand in a way that is similar to us.

Ian Goodfellow has distilled and referenced some of the research here: https://openai.com/blog/adversarial-example-research/

It's all fun and games until a few tiny stickers that a human won't even notice ... turn a stop sign into a "go" sign for your self-driving car ... and that's exactly what this team of researchers has done in Robust Physical-World Attacks on Machine Learning Models: https://arxiv.org/pdf/1707.08945v1.pdf

Final Notes

The research and projects are coming so fast that this will probably be outdated by the time you see it ...

2017 is the last ILSVRC! http://image-net.org/challenges/beyond\_ilsvrc.php

Try visualizing principal components of high-dimensional data with TensorFlow Embedding Projector http://projector.tensorflow.org/

Or explore with Google / PAIR's Facets tool: https://pair-code.github.io/facets/

Visualize the behavior of Keras models with keras-vis: https://raghakot.github.io/keras-vis/

Want more out of Keras without coding it yourself? See if your needs are covered in the extension repo for keras, keras-contrib: https://github.com/farizrahman4u/keras-contrib

Interested in a slightly different approach to APIs, featuring interactive (imperative) execution? In the past year, a lot of people have started using PyTorch: http://pytorch.org/

XLA, an experimental compiler to make TensorFlow even faster: https://www.tensorflow.org/versions/master/experimental/xla/

...and in addition to refinements of what we've already talked about, there is bleeding-edge work in

  • Neural Turing Machines
  • Code-generating Networks
  • Network-designing Networks
  • Evolution Strategies (ES) as an alternative to DQL / PG: https://arxiv.org/abs/1703.03864

Books

To fill in gaps, refresh your memory, gain deeper intuition and understanding, and explore theoretical underpinnings of deep learning...

Easier intro books (less math)

Hands-On Machine Learning with Scikit-Learn and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems by Aurélien Géron

Deep Learning with Python by Francois Chollet

Fundamentals of Machine Learning for Predictive Data Analytics: Algorithms, Worked Examples, and Case Studies by John D. Kelleher, Brian Mac Namee, Aoife D'Arcy

More thorough books (more math)

Deep Learning by Ian Goodfellow, Yoshua Bengio, Aaron Courville

Information Theory, Inference and Learning Algorithms 1st Edition by David J. C. MacKay