Databricks notebook source exported at Tue, 28 Jun 2016 09:35:44 UTC
Scalable Data Science
prepared by Paul Brouwers, Raazesh Sainudiin and Sivanand Sivaram
Students of the Scalable Data Science Course at UC, Ilam
- First check if a cluster named
classClusterTensorFlow
is running. - If it is then just skip this notebook and attach the next notebook to
classClusterTensorFlow
TensorFlow initialization scripts
This notebook explains how to install TensorFlow on a large cluster. It is not required for the Databricks Community Edition.
The TensorFlow library needs to be installed directly on all the nodes of the cluster. We show here how to install complex python packages that are not supported yet by the Databricks library manager. Such libraries are directly installed using cluster initialization scripts ("init scripts" for short). These scripts are Bash programs that run on a compute node when this node is being added to a cluster.
For more information, please refer to the init scripts in the Databricks guide.
These scripts require the name of the cluster. If you use this notebook, you will need to change the name of the cluster in the cell below:
Step 1. Set cluster variable and check
# Change the value to the name of your cluster:
clusterName = "classClusterTensorFlow"
To check if the init scripts are already in this cluster.
dbutils.fs.ls("dbfs:/databricks/init/%s/" % clusterName)
If `pillow-install.sh
and tensorflow-install.sh
are already in this cluster then skip Step 2 below.
Step 2. To (re)create init scripts
If the .sh
files above are not there, then evaluate the cell below and restart the cluster.
Sub-step 2.1
The following commands create init scripts that install the TensorFlow library on your cluster whenever it gets started or restarted. If you do not want to have TensorFlow installed on this cluster by default, you need to remove the scripts, by running the following command:
dbutils.fs.rm("dbfs:/databricks/init/%s/tensorflow-install.sh" % clusterName)
dbutils.fs.rm("dbfs:/databricks/init/%s/pillow-install.sh" % clusterName)
The next cell creates the init scripts. You need to restart your cluster after running the following command.
dbutils.fs.mkdirs("dbfs:/databricks/init/")
dbutils.fs.put("dbfs:/databricks/init/%s/tensorflow-install.sh" % clusterName,"""
#!/bin/bash
/databricks/python/bin/pip install https://storage.googleapis.com/tensorflow/linux/cpu/tensorflow-0.6.0-cp27-none-linux_x86_64.whl
""", True)
# This is just to get nice image visualization
dbutils.fs.put("dbfs:/databricks/init/%s/pillow-install.sh" % clusterName,"""
#!/bin/bash
echo "------ packages --------"
sudo apt-get -y --force-yes install libtiff5-dev libjpeg8-dev zlib1g-dev
echo "------ python packages --------"
/databricks/python/bin/pip install pillow
""", True)
Sub-step 2.2 You now need to restart your cluster.
3. How to check that the scripts ran correctly after running a cluster (possibly by restarting)
As explained in the Databricks guide, the output of init scripts is stored in DBFS. The following cell accesses the latest content of the logs after a cluster start:
stamp = str(dbutils.fs.ls("/databricks/init/output/%s/" % clusterName)[-1].name)
print("Stamp is %s" % stamp)
files = dbutils.fs.ls("/databricks/init/output/%s/%s" % (clusterName, str(stamp)))
tf_files = [str(fi.path) for fi in files if fi.name.startswith("%s-tensorflow-install" % clusterName)]
logs = [dbutils.fs.head(fname) for fname in tf_files]
for log in logs:
print "************************"
print log