JoaquÃn Padilla Montani
Deep Learning with TensorFlow WS18/19 @ IST Austria
January 7th, 2019
Current DNN arquitectures can be very deep
(e.g., for object detection in high-resolution images).
This presents several challenges:
Often the gradient w.r.t. weights in lower layers is very small/vanishes.
This can slow/stop training.
The opposite can also happen (e.g., RNNs), where gradients explode.
What to do about it:
#Image from: "Hand-On Machine Learning [...]" A. Géron (coursebook)
Image(filename = "activations.png", width = 800)
Problems:
#Image from: "Hand-On Machine Learning [...]" A. Géron (coursebook)
Image(filename = "lrelu-elu.png", width = 800)
#ICLR 2016
Image(filename = "elupaper.png", width = 600)
hidden1 = tf.layers.dense(X, n_hidden1, activation=tf.nn.leaky_relu)
hidden1 = tf.layers.dense(X, n_hidden1, activation=tf.nn.elu)
#NIPS 2017 (tf.nn.selu)
Image(filename = "selu.png", width = 600)
#Image from: "Hand-On Machine Learning [...]" A. Géron (coursebook)
Image(filename = "initializations.png", width = 600)
he_init = tf.variance_scaling_initializer(mode = "FAN_AVG")
hidden1 = tf.layers.dense(X, n_hidden1, activation=tf.nn.relu,
kernel_initializer=he_init, name="hidden1")
Problem: the distribution of a layer's input changes as the parameters of previous layers change.
Idea: before activation, zero-center and normalize the inputs (over the current mini-batch), then let the network learn two new parameters per layer for scaling and shifting.
The model learns, in each layer, the best scale and mean for its inputs.
During training, the layers learn $\gamma$ (the scale) and $\beta$ (the offset).
Each layer also learns an overall $\mu$ and $\sigma$, to use during testing
(instead of the batch estimations).
X = tf.placeholder(tf.float32, shape=(None, n_inputs), name="X")
training = tf.placeholder_with_default(False, shape=(), name='training')
hidden1 = tf.layers.dense(X, n_hidden1, name="hidden1")
bn1 = tf.layers.batch_normalization(hidden1, training=training)
bn1_act = tf.nn.elu(bn1)
hidden2 = tf.layers.dense(bn1_act, n_hidden2, name="hidden2")
bn2 = tf.layers.batch_normalization(hidden2, training=training)
bn2_act = tf.nn.elu(bn2)
logits_before_bn = tf.layers.dense(bn2_act, n_outputs, name="outputs")
logits = tf.layers.batch_normalization(logits_before_bn, training=training)
extra_update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS)
with tf.Session() as sess:
init.run()
for epoch in range(n_epochs):
for X_batch, y_batch in shuffle_batch(X_train, y_train, batch_size):
sess.run([training_op, extra_update_ops],
feed_dict={training: True, X: X_batch, y: y_batch})
accuracy_val = accuracy.eval(feed_dict={X: X_valid, y: y_valid})
print(epoch, "Validation accuracy:", accuracy_val)
save_path = saver.save(sess, "./my_model_final.ckpt")
threshold = 1.0
optimizer = tf.train.GradientDescentOptimizer(learning_rate)
grads_and_vars = optimizer.compute_gradients(loss)
capped_gvs = [(tf.clip_by_value(grad, -threshold, threshold), var)
for grad, var in grads_and_vars]
training_op = optimizer.apply_gradients(capped_gvs)
Regular Gradient Descent: walking down a hill
$\theta \leftarrow \theta - \eta \nabla_\theta J(\theta)$
Momentum Optimization: rolling down a hill
$\textbf{m} \leftarrow \beta \, \textbf{m} + \eta \nabla_\theta J(\theta)$
$\theta \leftarrow \theta - \textbf{m}$
The gradient is thus used as an acceleration, instead of as a speed.
The new hyperparameter $\beta$, called the $\textit{momentum}$, controls the "friction".
# Image from:
# www.towardsdatascience.com/
# how-to-train-neural-network-faster-with-optimizers-d297730b3713
Image(filename = "momentum.png", width = 550)
Advantages vs. GD:
Interactive applet for gaining intuition: https://distill.pub/2017/momentum/
with tf.name_scope("train"):
optimizer = tf.train.MomentumOptimizer(learning_rate=lr, momentum=0.9)
training_op = optimizer.minimize(loss)
#Even better: Nesterov Accelerated Gradient
tf.train.MomentumOptimizer(learning_rate=lr, momentum=0.9, use_nesterov=True)
Initialize $\textbf{m}$ and $\textbf{s}$ to $0$.
Where $T$ is the iteration number.
Typical hyperparameter values (default in TensorFlow):
with tf.name_scope("train"):
optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate)
training_op = optimizer.minimize(loss)
# NIPS 2017
Image(filename = "adaptivepaper.png", width = 600)
#Image from: "Hand-On Machine Learning [...]" A. Géron (coursebook)
Image(filename = "learningrate.png", width = 600)
Idea: start with a high learning rate, then reduce it.
Possible implementations:
with tf.name_scope("train"):
#\eta(t) = \eta_0 10^{-t/r}
initial_learning_rate = 0.1 # eta_0
decay_steps = 10000 # r
decay_rate = 1/10 #10^- ...
global_step = tf.Variable(0, trainable=False, name="global_step") # t
learning_rate = tf.train.exponential_decay(initial_learning_rate,
global_step,
decay_steps,
decay_rate)
optimizer = tf.train.MomentumOptimizer(learning_rate, momentum=0.9)
training_op = optimizer.minimize(loss, global_step=global_step)
#In execution, each batch will increase global_step by 1
for X_batch, y_batch in shuffle_batch(X_train, y_train, batch_size):
sess.run(training_op, feed_dict={X: X_batch, y: y_batch})
Typical situation when training large DNN:
Idea: stop training when val-performance starts dropping.
#Image from: "Hand-On Machine Learning [...]" A. Géron (coursebook)
Image(filename = "earlystopping.png", width = 600)
n_epochs, batch_size = 1000, 20
max_checks_without_progress = 20
checks_without_progress = 0
best_loss = np.infty
with tf.Session() as sess:
init.run()
for epoch in range(n_epochs):
rnd_idx = np.random.permutation(len(X_train))
for rnd_indices in np.array_split(rnd_idx, len(X_train) // batch_size):
X_batch, y_batch = X_train[rnd_indices], y_train[rnd_indices]
sess.run(training_op, feed_dict={X: X_batch, y: y_batch})
loss_val, acc_val = sess.run([loss, acc], feed_dict={X:X_val, y:y_val})
if loss_val < best_loss:
save_path = saver.save(sess, "./my_model.ckpt")
best_loss = loss_val
checks_without_progress = 0
else:
checks_without_progress += 1
if checks_without_progress > max_checks_without_progress:
print("Early stopping!")
break
Idea: at every training step, each neuron (except output units) has a
probability $p$ of being temporarily ignored (of being "dropped out").
Goals:
# Image from:
# https://warwick.ac.uk/fac/cross_fac/complexity/
# people/students/dtc/students2013/eyre/statsreadinggroup/
Image(filename = "dropout.png", width = 550)
Units are only dropped while training.
Problem: when testing, a given neuron will have, on average, twice as many inputs as it had during training (assuming $p = 0.5$).
Solution: divide each neuron's output by the "keep probability" $(1-p)$ during training.
# Image from: Ian Goodfellow et al. "Deep Learning." MIT Press (2016).
# http://www.deeplearningbook.org/
Image(filename = "dropoutemsemble.png", width = 500)
X = tf.placeholder(tf.float32, shape=(None, n_inputs), name="X")
y = tf.placeholder(tf.int32, shape=(None), name="y")
training = tf.placeholder_with_default(False, shape=(), name='training')
p = 0.5 # == 1 - keep_prob
X_drop = tf.layers.dropout(X, rate=p, training=training)
with tf.name_scope("dnn"):
hidden1 = tf.layers.dense(X_drop, n_hidden1, activation=tf.nn.relu)
hidden1_drop = tf.layers.dropout(hidden1, rate=p, training=training)
hidden2 = tf.layers.dense(hidden1_drop, n_hidden2, activation=tf.nn.relu)
hidden2_drop = tf.layers.dropout(hidden2, rate=p, training=training)
logits = tf.layers.dense(hidden2_drop, n_outputs)
'''
Dropout consists in randomly setting a fraction rate of input units to 0 at
each update during training time, which helps prevent overfitting.
The units that are kept are scaled by 1 / (1 - rate), so that their sum
is unchanged at training time and inference time.
https://www.tensorflow.org/api_docs/python/tf/layers/dropout
'''
n_epochs = 20
batch_size = 50
with tf.Session() as sess:
init.run()
for epoch in range(n_epochs):
for X_batch, y_batch in shuffle_batch(X_train, y_train, batch_size):
sess.run(trn_op, feed_dict={X:X_batch, y:y_batch, training:True})
accuracy_val = accuracy.eval(feed_dict={X: X_valid, y: y_valid})
print(epoch, "Validation accuracy:", accuracy_val)
save_path = saver.save(sess, "./my_model_final.ckpt")
Aurélien Géron. "Hands-On Machine Learning with Scikit-Learn & TensorFlow." O'Reilly Media (2017).
[Main reference. TensorFlow code also from here. See Chapter 11.]
Ian Goodfellow et al. "Deep Learning." MIT Press (2016).
Ashia C. Wilson et al. "The Marginal Value of Adaptive Gradient Methods in Machine Learning." NIPS (2017).
Djork-Arne Clevert et al. "Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs)." ICLR (2016).
Günter Klambauer et al. "Self-Normalizing Neural Networks." NIPS (2017).
Glorot, Xavier and Yoshua Bengio. "Understanding the difficulty of training deep feedforward neural networks." AISTATS (2010).
#Image from: "Hand-On Machine Learning [...]" A. Géron (coursebook)
Image(filename = "transfer.png", width = 550)
#Code from: https://github.com/taehoonlee/tensornets
import tensorflow as tf
import tensornets as nets
inputs = tf.placeholder(tf.float32, [None, 416, 416, 3])
model = nets.YOLOv2(inputs, nets.Darknet19)
img = nets.utils.load_img('cat.png')
with tf.Session() as sess:
sess.run(model.pretrained())
preds = sess.run(model, {inputs: model.preprocess(img)})
boxes = model.get_boxes(preds, img.shape[1:3])
#Code from: https://github.com/taehoonlee/tensornets
%matplotlib inline
from tensornets.datasets import voc
print("%s: %s" % (voc.classnames[7], boxes[7][0])) # 7 is cat
import numpy as np
import matplotlib.pyplot as plt
box = boxes[7][0]
plt.imshow(img[0].astype(np.uint8))
plt.gca().add_patch(plt.Rectangle(
(box[0], box[1]), box[2] - box[0], box[3] - box[1],
fill=False, edgecolor='r', linewidth=2))
plt.show()
cat: [103. 15. 356. 267. 0.9605811]