Data Mining algorithms
2017-2018 spring
04.11-20.2018
1. NN and BN
28/03/2017
W1 Februar 7-9: Introduction, kNN, evaluation W2 Februar 14-16: Evaluation, Decision Trees
W3 Februar 21-23: Linear separators, iPython, VC theorem
W4 Februar 28-march 2: Linear separators, iPython, maximal margin W5 March 7-9: SVM, VC theorem and Bottou-Bousquet
W6 March 14-16: clustering (hierarchical, density based etc.), GMM, MRF, Apriori and association rules
W7 March 21-23: Recommender systems W8 March 28: centrality, generative models W9 April 4-6: holiday
W10 April 11-13: basics of neural networks, Bayes networks W11 April 18-20: midterm, Sontag-Maas-Bartlett, BN,
W12 April 25-27: CNN, MLP, Dropout, Batch normalization, RNN, LSTM, GRU W13 May 2-4: attention, Image caption, Turing Machine
W14 May 9-11: RBM, DBN, VAE, GAN W15 May 16-18: Boosting, Time series W16 May 23-25: TS, Projects on Friday
Plan
Neural networks, briefly deeply
Hypothesis: deep, hierarchical models can be exponentially more efficient than a shallow one [Bengio et al. 2009, Le Roux and Bengio, 2010, Delalleau and Bengio, 2011 etc. ]
[Delalleau and Bengio, 2011]: deep sum-product network may require exponentially less units to represent the same function compared to a shallow sum-product network.
What is the depth of a Neural Network?
In case of feed forward networks, the number of multiple nonlinear layers between the input and the output layer.
We will see, in case of recurrent NN this definition does not apply.
So Q1: What is NN?
29/03/2017
Neural Networks
Key ingredients:
• Wiring: units and connections
XOR = x
1 AND NOTx
2OR
NOTx
1 ANDx
2z
1-0.5 1 -1
z 1 z 2
z2
-0.5 -1
1 x
1x
21
y 1 -0.5
1 1
Fig.: Danny Bickson
Fig.: wikipedia
Output of a unit
• linear/
non-linear
• bounded/
non-bounded
• usually monotonic, but not all
Why so rigid?
Traditional pattern recognition vs. CNN
Conv. Layer
Sub-sampling
….
Fully conn.
Receptive field
LeNet-5
LeNet-5 for handwriting recognition in [LeCun et al. 1998]
Key advantages:
• Fixed feature extraction vs. learning the kernel functions
• Spatial structure through sampling
• “Easier to train” due much lesser connection than fully connected Training: back propagation
By definition it is a feed forward deep neural network.
Image classification with CNN
[Krizhevsky et al. 2012]
Advantages over LeNet:
• Local response normalization (normalize over the kernel maps at the same position) over ReLU (-1.2%..1.4% in error rate)
• Overlapping pooling (-0.3..-0.4% in error rate)
• traditional image tricks: augmentation as horizontal flipping, subsampling, PCA over the RGB and noise (-1% in error rate)
• Dropout (we will discuss it later)
29/03/2017
Image classification with CNN
[Krizhevsky et al. 2012]
ImageNet: 150k test set and 1.2 million training images with 1000 labels.
Evaluation: top-1 and top-5 error rate
* - additional data
4096 dim. representation per image 5-6 days with 2 Nvidia GTX 580 3GB
Recent results
[He et al. 2015]: Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification
Parametric ReLU + zero mean Gaussian init + extreme (at the time…) deep network:
A:19 layers, B: 22 layers, C: 22 layers with more filters Training of model C: 8xK40 Nvidia GPU 3..4 weeks (!)
[He et al. 2015]: ResNet:“Is learning better networks as easy as stacking more layers?”
Several implementations
Restrictions Wrapper Architectures Notes
Theano Both feed forward
and recurrent nets Python Multi core/CUDA Multiple optimization Chainer Both feed forward
and recurrent nets Python Multi core/CUDA Multiple optimization GraphLab Feed Forward: CNN,
DBN Python/C++ Multi core/CUDA/
distributed Compact, Hadoop TensorFlow Both feed forward
and recurrent nets Python/C++ Multi core/CUDA/
distributed Graphical interface and multiple optimization Caffe Feed Forward: CNN,
DBN Python/Matlab Multi core/CUDA Torch Both feed forward
and recurrent nets Lua Multi core/CUDA Developed for vision
Ok, step back a little and … ☺
Probabilistic Graphical models (hmmm, RF?) Set of random variables: X= {x
1,…,x
T}
Visualize connections with edges
Bayes Networks
x1
x3
x4 x2
Probabilistic Graphical models
Visualize connections with edges (with directed edges!!!) Conditional dependencies (A “causes” B) vs. MRF?
Bayes Networks
x1
x3
x4 x2
Probabilistic Graphical models
Set of random variables: X= {x
1,…,x
T}
Visualize connections with edges (with directed edges!!!)
Bayes Networks
x1
x3
x4 x2
P(x1)
P(x3|x1) P(x2|x1)
P(x4|x2,x3)
Probabilistic Graphical models
Set of random variables: X= {x
1,…,x
T}
Visualize connections with edges (with directed edges!!!)
Bayes Networks
x1
x3
x4 x2
P(x1)
P(x3|x1) P(x2|x1)
P(x4|x2,x3)
P(x1)P(x2|x1)P(x3|x2,x1) P(x4| x3,x2,x1)
= P(x1)P(x2|x1)P(x3|x1) P(x4| x3,x2)
Learning:
I) We know the structure (the dependencies) Parameter estimation (prior, posterior)
Analytically or via optimization (EM, GD etc.) II) We do not know the structure
optimize over the space of the possible trees…
and estimate parameters, optimization etc.
What kind of BN is a feed—forward network?
Bayes Networks
06/04/2017
Main neural network structures
1. Feed-forward Neural Networks:
• Bayes Networks:
• The nodes are either input, output or hidden
• Connections between the nodes: directed edges
• Presumption: finite set of nodes -> finite set of layers (are there any layers?)
• no directed cycles -> directed acyclic graphs!
• Posteriors are made out of:
• linear combination (edge weights) of inputs
• Activation function
where z
i(l+1)is the LC of the i-th element in the (l+1)-th
hidden layer, and f is the non-linear transformation
(common: f: R -> R ! When is it not?)
06/04/2017
Main neural network structures
1. Feed-forward Neural Networks:
Some common (not necessary!) restrictions:
• Disjoint set of nodes -> layers
• Exists an ordering of layers (so ordering of nodes!!)
• “Causality”: previous layer “causes” the next one
• each node is connected to
• Nodes in the previous layer (input nodes)
• Nodes in the next layer
• CDF activations
• Optimization via previously determined loss function (CDF?)
If it is fully connected:
• Each node in the previous layer is connected
• Each node in the next layer is connected
If so the Network is called Multi-Layer Perceptron (in short MLP)
06/04/2017
Main neural network structures
1. Feed-forward Neural Networks: MLP vs. CNN
• Each node is adopted on a subset of the input, but all over the image
It can be interpreted
• Either as a lot of fully connected node
with zero weights and there weights where they are non-zero is shared
• Or leave this complicated definition and just simply define it as a convolution over the input:
(f*g)(x) = ∫ f(t) g(x-t)dt
Usually we think of it as a discrete convolution and in case of images, it is 2D/3D or XD convolution.
What kind of input can we think for 1D?
06/04/2017
Main neural network structures
1. Feed-forward Neural Networks: MLP vs. CNN
• Each node is adopted on a subset of the input, but all over the image
1/4 1/16
1/8 1/16 1/8 1/8
1/16
1/8 1/16 Example:
What does it do? What are we changing during optimization?
The main advantages of the CNN over MLP (in pract.) is the highly reduced size of the parameter set: 32x32x128 vs.
3x3x128 (128 hidden node and 32x32 input)
We will talk about: Inception, ResNet, Maxout etc.
06/04/2017
Main neural network structures
2. Recurrent Neural Networks:
• The nodes are either input, output or hidden
• Connections between the nodes: directed edges
• Presumption: finite set of nodes -> finite set of layers (are there any layers?)
• There are some directed cycles -> not a directed acyclic graph anymore … ☹
• Common: self loops only
• Posteriors are similar to FF
We will talk about (two weeks from now) about:
classic BPTT model [Werbos et al., 1988]
LSTM [Hochreiter & Schmidhuber, 1997]
GRU [Cho et al., 2014]
06/04/2017
Main neural network structures
3. Generative models
• The nodes are either input or hidden (no output!)
• Connections between the nodes: not necessary directed edges
• Presumption: finite set of nodes -> finite set of layers (are there any layers?)
• There are some directed/undirected cycles -> not a directed acyclic graph ☹
• Posteriors are similar to FF, but we do not give at first any restriction to edges (full graph? ☹ ) , OK, we will ☺
We will talk about:
Boltzmann Machine [Hinton et al., 1983]
Restricted Boltzmann Machine, Harmonium [Smolensky et al., 1986]
Deep Belief Networks [Hinton et al., 2006]
Variational Autoencoders [Dayan et al., 1995]
Generative Adversarial Networks [Goodfellow et al., 2014]
06/04/2017
• Overfitting:
• DropOut [Hinton et al., 2012]
• DropConnect [Wan et al., 2013]
• Saturation, vanishing gradients and sparsity:
• pReLU [Het et al., 2015]
• Maxout [Goodfellow et al., 2013]
• Network-in-Network [Lin et al., 2014]
• local response/batch normalization [Ioffe et al., 2015]
• BN Maxout NiN [Chang et al., 2015]
• Complexity:
• Convolution (in comp. to MLP)
• FastFood [Yang et al., 2015]
• Generalization [Zhang et al., 2017]
• Lower bounds [Lin & Tegmark, 2016]
• Architecture:
• spectral representation (pooling) [Rippel & Snoek, 2015]
• Identity map and residual block [He et al., 2015]
• Manifold tangent classifier, high-order contractive auto-encoder [Rifai et al., 2011]
Some unsolved problems
Frameworks
Restrictions Wrapper Architectures Notes
Theano feed forward/
recurrent NN Python Multi core/CUDA Keras
Chainer feed forward/
recurrent NN Python Multi core(?)/CUDA GraphLab Feed Forward: CNN,
DBN Python/C++ Multi core/CUDA/
distributed Compact, Hadoop TensorFlow feed forward/
recurrent NN Python/C++ Multi core/CUDA/
distributed Keras Caffe Feed Forward: CNN,
DBN Python/Matlab Multi core/CUDA Blob, for visual Torch feed forward/
recurrent NN Lua Multi core/CUDA
Developed at Google
multiCore/GPU (Cuda only ☹ )/Distributed Python and C++
Lost of examples codes at github: CNN, DBN, RNN, LSTM
Google’s example: MNIST
Install
Example: Ubuntu/Linux 64-bit, CPU only, Python 3.5
export TF_BINARY_URL=https://storage.googleapis.com/tensorflow/linux/cpu/
tensorflow-0.11.0rc1-cp35-cp35m-linux_x86_64.whl pip install --ignore-installed --upgrade pip setuptools pip install --upgrade $TF_BINARY_URL
Import:
Import tensorflow as ts
MNIST:
• 60000 images
• 28x28 resolution
• Only grayscale -> 1ch
• Labels: 0..9
https://tensorflow.googlesource.com/tensorflow/+/master/tensorflow/g3doc/tutorials/
mnist/input_data.py
(if something changes they will change this tutorial)
MNIST handwriting recog.
# we will follow the tutorial import tensorflow as tf import input_data
mnist = input_data.read_data_sets("MNIST_data/", one_hot=True)
# set the basic variables
x = tf.placeholder("float", [None, 784]) W = tf.Variable(tf.zeros([784,10]))
b = tf.Variable(tf.zeros([10]))
# the output is softmax
y = tf.nn.softmax(tf.matmul(x,W) + b)
# original labels
y_ = tf.placeholder("float", [None,10])
# loss function
cross_entropy = -tf.reduce_sum(y_*tf.log(y))
# gradient calculation based on cross entropy and the net (computational graph) train_step = tf.train.GradientDescentOptimizer(0.01).minimize(cross_entropy)
Shallow model
# we start a session and run it init = tf.initialize_all_variables() sess = tf.Session()
sess.run(init)
# 1000 times we update the model based on a small batch (sized 100) for i in range(1000):
batch_xs, batch_ys = mnist.train.next_batch(100)
sess.run(train_step, feed_dict={x: batch_xs, y_: batch_ys})
# evaluation
correct_prediction = tf.equal(tf.argmax(y,1), tf.argmax(y_,1)) accuracy = tf.reduce_mean(tf.cast(correct_prediction, "float"))
print(sess.run(accuracy, feed_dict={x: mnist.test.images, y_: mnist.test.labels}))
MNIST shallow model
MNIST Convolutional NN
# a bit better suited for ipython sess = tf.InteractiveSession()
# we can define variables (here the weights) def weight_variable(shape):
initial = tf.truncated_normal(shape, stddev=0.1) return tf.Variable(initial)
def bias_variable(shape):
initial = tf.constant(0.1, shape=shape) return tf.Variable(initial)
# simple 2D conv.: step size 1, W is the window (conv. func.) , x will be the input def conv2d(x, W):
return tf.nn.conv2d(x, W, strides=[1, 1, 1, 1], padding='SAME')
# simple pooling, subsampling with maximum, each 2x2 patch will be a single element def max_pool_2x2(x):
return tf.nn.max_pool(x, ksize=[1, 2, 2, 1], strides=[1, 2, 2, 1], padding='SAME')
MNIST CNN
# first and second Conv. Layers with pooling (the model is on the bottom) W_conv1 = weight_variable([5, 5, 1, 32])
b_conv1 = bias_variable([32])
x_image = tf.reshape(x, [-1,28,28,1])
h_conv1 = tf.nn.relu(conv2d(x_image, W_conv1) + b_conv1) h_pool1 = max_pool_2x2(h_conv1)
W_conv2 = weight_variable([5, 5, 32, 64]) b_conv2 = bias_variable([64])
h_conv2 = tf.nn.relu(conv2d(h_pool1, W_conv2) + b_conv2) h_pool2 = max_pool_2x2(h_conv2)
06/04/2017
# fully connected layers on top of the convolutions -> flattening then MLP: 7x7 but 64 channel images (after the second pooling) and a 1024 nodes in the hidden layer
W_fc1 = weight_variable([7 * 7 * 64, 1024]) b_fc1 = bias_variable([1024])
h_pool2_flat = tf.reshape(h_pool2, [-1, 7*7*64])
# ReLU
h_fc1 = tf.nn.relu(tf.matmul(h_pool2_flat, W_fc1) + b_fc1)
# if dropout (later)
keep_prob = tf.placeholder("float")
h_fc1_drop = tf.nn.dropout(h_fc1, keep_prob)
# Output layer, again with softmax W_fc2 = weight_variable([1024, 10]) b_fc2 = bias_variable([10])
y_conv=tf.nn.softmax(tf.matmul(h_fc1_drop, W_fc2) + b_fc2)
MNIST CNN
06/04/2017
# same loss function
cross_entropy = -tf.reduce_sum(y_*tf.log(y_conv))
# now with ADAM optimizer (Kingma et al. 2015), similar to Newton train_step = tf.train.AdamOptimizer(1e-4).minimize(cross_entropy)
# training via batches and evaluation
correct_prediction = tf.equal(tf.argmax(y_conv,1), tf.argmax(y_,1)) accuracy = tf.reduce_mean(tf.cast(correct_prediction, "float")) sess.run(tf.initialize_all_variables())
for i in range(1000):
batch = mnist.train.next_batch(50) if i%100 == 0:
train_accuracy = accuracy.eval(feed_dict={
x:batch[0], y_: batch[1], keep_prob: 1.0})
print(‘step %d, training accuracy %g’ % (i, train_accuracy))
train_step.run(feed_dict={x: batch[0], y_: batch[1], keep_prob: 0.5}) Print(‘test accuracy %g’ % (accuracy.eval(feed_dict={
x: mnist.test.images, y_: mnist.test.labels, keep_prob: 1.0}))
MNIST CNN
06/04/2017
Same in Keras
# Install: pip install keras
# it is wrapper over Theano and Tensorflow backend (def. TF) and it utilizes gpu if it can
from keras.datasets import cifar10,mnist
from keras.preprocessing.image import ImageDataGenerator from keras.models import Sequential, Model
from keras.layers import Dense, Input, Dropout, Activation, Flatten from keras.layers import Convolution2D, MaxPooling2D, normalization from keras.utils import np_utils
import keras.backend as K batch_size = 32
nb_classes = 10 nb_epoch = 10
data_augmentation = True
06/04/2017
Same in Keras
# input image dimensions img_rows, img_cols = 32, 32
# The CIFAR10 images are RGB img_channels = 3
# The data, shuffled and split between train and test sets:
(X_train, y_train), (X_test, y_test) = cifar10.load_data() print('X_train shape:', X_train.shape)
print(X_train.shape[0], 'train samples') print(X_test.shape[0], 'test samples')
# Convert class vectors to binary class matrices.
Y_train = np_utils.to_categorical(y_train, nb_classes) Y_test = np_utils.to_categorical(y_test, nb_classes)
06/04/2017
Same in Keras
# two Conv + Pooling layers model = Sequential()
model.add(Convolution2D(32, 3, 3, border_mode='same', input_shape=X_train.shape[1:])) model.add(Activation('relu'))
model.add(Convolution2D(32, 3, 3)) model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2, 2))) model.add(Dropout(0.25))
model.add(Convolution2D(64, 3, 3, border_mode='same')) model.add(Activation('relu'))
model.add(Convolution2D(64, 3, 3)) model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2, 2))) model.add(Dropout(0.25))
06/04/2017
Same in Keras
# FC layer with ReLU model.add(Flatten()) model.add(Dense(512))
model.add(Activation('relu')) model.add(Dropout(0.5))
# Output layer
model.add(Dense(nb_classes)) model.add(Activation('softmax'))
# Train with ADAM
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy']) model.summary()
06/04/2017
Same in Keras
X_train = X_train.astype('float32') X_test = X_test.astype('float32') X_train /= 255
X_test /= 255
if not data_augmentation:
print('Not using data augmentation.')
model.fit(X_train, Y_train, batch_size=batch_size, nb_epoch=nb_epoch, validation_data=(X_test, Y_test), shuffle=True)
else:
06/04/2017
Same in Keras
else:
print('Using real-time data augmentation.') datagen = ImageDataGenerator(
featurewise_center=False, # set input mean to 0 over the dataset samplewise_center=False, # set each sample mean to 0
featurewise_std_normalization=False, # divide inputs by std of the dataset samplewise_std_normalization=False, # divide each input by its std
zca_whitening=False, # apply ZCA whitening
rotation_range=0, # randomly rotate images in the range (0..180 deg) width_shift_range=0.1, # randomly shift images horizontally
(fraction of total width)
height_shift_range=0.1, # randomly shift images vertically (fraction of total height)
horizontal_flip=True, # randomly flip images vertical_flip=False # randomly flip images )
datagen.fit(X_train) # Fit the model on the batches generated by datagen.flow().
model.fit_generator(datagen.flow(X_train, Y_train,
batch_size=batch_size), samples_per_epoch=X_train.shape[0], nb_epoch=nb_epoch, validation_data=(X_test, Y_test))
06/04/2017
Dropout (Hinton et al. 2012)
As we mentioned the deep networks have very high complexity Consequences:
• High generalization bound
• Overfitting
• Slow convergence
Idea: reduce the overfitting by removing units during the training forming multiple
“thinned” networks (similarly to some of the pruning by Decision Trees), so the neurons are less rely on each other.
The result is an “unthinned” (same sized as the original), but less overfitted network with downscaled weights (e.g. softmax -> w’ = pw, p=0.5).
The method is more like a regularization.
Hypothesis: dropout is model averaging (bagging)
Works well with large steps in parameter space (vs. manifold hypothesis?)
06/04/2017
Standard fully connected Neural Network After Dropout during training (figures by Srivastava et al. 2014)
Dropout (Hinton et al. 2012)
06/04/2017
Standard feed forward With Dropout
Dropout (Hinton et al. 2012)
Some notes:
Why is it helping? -> Jump!
Global sampling?
Residual sampling?
And CDF ☺
06/04/2017
Standard fully connected Neural Network After Dropout
(figures by Srivastava et al. 2014)
Dropout (Hinton et al. 2012)
06/04/2017
Dropout over MNIST
(figures by Srivastava et al. 2014)
06/04/2017
Training phase Testing phase
(figures, results by Srivastava et al. 2014)
06/04/2017
Dropout: experiments MNIST
(results by Srivastava et al. 2014)
06/04/2017
Batch normalization (Ioffe & Szegedy, 2015)
o Instead of standard normalization of the input
o Estimation over the batch instead over the training set
o Learn an affine transformation of the normalized input (a lot of new
parameters)
o In practice: apply BN immediately before the non-linear transformation -> only 2x size of the output new parameters
o For CNN aggregation needed (consistent normalization)
o CDF ☺
4.9 % Top 1 error on ImageNet
with significantly lesser number of epochs
06/04/2017
Maxout (Goodfellow et al. 2013)
• MLP inside an activation function
• Piecewise linear approximation of convex functions -> universal approximator with “enough”
affine components
• For a single hidden unit with k affine feature maps
where
06/04/2017
Network In Network (Lin et al., 2014)
• Non-linear convolution
• Global average pooling (average of feature maps)
vs. fully connected
11.59% (FC) vs. 10.89% (FC+DO) vs. 10.41% (GAP) error on CIFAR-10 vs. maxout:
06/04/2017
IL – Integer bits FL – fractional bits WL – word length
06/04/2017
Boltzmann Machines (RBM)
Boltzmann machines was introduced by Hinton and Sejnowski in 1983 -> general model
Actually it is recurrent NN… with a full graph at first …Hopfiled network: order!
Visible units (V)
Hidden elements (H)
Markov Random Field:
Hammersly-Clifford theorem: JPD factorizes over the maximal cliques ...
Intractable in practice (inference is compl.) Inference is similar to Simulated Annealing
06/04/2017
Simulated Annealing (Kirkpatrick et al. 1983)
Gradient based optimization models we learned about are in a way deterministic, with the same input they do the same updates (the stochastic nature comes from the sampling of the batches not from the optimization part).
In contrast, in SA:
• We define an energy function over the configurations (states): E(s)
• And a probability function which controls the transition from state e1 to e2 : P(e1,e2,T)
where T is the “global” temperature.
Some presumptions: e1 is the last and e2 is the energy of the candidate configuration
• If T goes to zero, the P(e1,e2,T) goes to zero if e2>e1 if T=0 we will accept if we go towards the local min. (GD!)
• If e1<e2 then P(e1,e2,T) should go to 1 (towards minimum)
• Lots of versions: what if e1>>e2 ? And if e1<<e2
• P(e1,e2,T) varies smoothly with the difference between e1 and e2
• How to cool down the system?
06/04/2017
Harmonium by Smolensky in 1986 and RBM in 2002 by Hinton with a Contrastive Divergence as a fast learning solution.
Restricted: Boltzmann/Gibbs factorizes over the cliques in the graph -> no connection between the units in the same layer, therefore we control the cliques (BM?).
Visible units (V)
Hidden layer (H)
hence the name
W
06/04/2017
Restricted Boltzmann Machines (RBM)
Visible units (V)
Hidden layer (H) The energy function:
h is usually a sigmoid.
W
Gradient: Estimation is difficult -> GS
☹
06/04/2017
Visible units (V)
Hidden layer (H)
W
Maximizing the log-likelihood of the data or minimize the KL divergence between the data distribution and the equilibrium distribution:
Constractive divergence [Hinton 2002]:
Key idea: Minimize the difference between the KL divergences over the data and the first reconstruction:
-> The problematic part cancels out, but there is a remaining element which according to Hinton can be ignored in practice
Restricted Boltzmann Machines (RBM)
06/04/2017
Deep Belief Network (DBN)
Similar to the idea of boosting [Freund and Shapire 1995]
Straightforward application of CD for deep RBM network is not suitable.
Idea: allow each model in the sequence to receive a different representation of the data.
With l hidden layers and x as input:
Greedy layer learning [Hinton et al. 2006]:
learn layer by layer, using the output of the last layer as input
28/04/2017
Main neural network structures
1. Feed-forward Neural Networks:
• Bayes Networks
• Back propagation
• CNN, MLP etc.
2. Recurrent Neural Networks:
• Loops -> not a DAG…
• Usually: self loop
3. Generative models
28/04/2017
Main neural network structures
3. Generative models
• The nodes are either input or hidden (no output!)
• Connections between the nodes: not necessary directed edges
• Presumption: finite set of nodes -> finite set of layers (are there any layers?)
• There are some directed/undirected cycles -> not a directed acyclic graph ☹
• Posteriors are similar to FF, but at first we do not give any restrictions to edges (full graph? ☹ ) , OK, we will ☺
We will talk about:
Boltzmann Machine [Hinton et al., 1983]
Restricted Boltzmann Machine, Harmonium [Smolensky et al., 1986]
Deep Belief Networks [Hinton et al., 2006]
Variational Autoencoders [Dayan et al., 1995] (next week)
Generative Adversarial Networks [Goodfellow et al., 2014] (next week)
28/04/2017
Standard fully connected Neural Network After Dropout during training (figures by Srivastava et al. 2014)
Dropout (Hinton et al. 2012) recap
28/04/2017
Batch normalization (Ioffe & Szegedy, 2015)
o Instead of standard normalization of the input
o Estimation over the batch instead over the training set
o Learn an affine transformation of the normalized input (a lot of new
parameters)
o In practice: apply BN immediately before the non-linear transformation -> only 2x size of the output new parameters
o For CNN aggregation needed (consistent normalization)
o CDF ☺
4.9 % Top 1 error on ImageNet
with significantly lesser number of epochs
28/04/2017
Maxout (Goodfellow et al. 2013)
• MLP inside an activation function
• Piecewise linear approximation of convex functions -> universal approximator with “enough”
affine components
• For a single hidden unit with k affine feature maps
where
28/04/2017
IL – Integer bits FL – fractional bits WL – word length
28/04/2017
Boltzmann Machines (RBM)
Boltzmann machines was introduced by Hinton and Sejnowski in 1983 -> general model
Actually it is recurrent NN… with a full graph at first …Hopfield network: order!
Visible units (V)
Hidden elements (H)
Markov Random Field:
Hammersly-Clifford theorem: JPD factorizes over the maximal cliques ...
Intractable in practice (inference is compl.) Inference is similar to Simulated Annealing
28/04/2017
Simulated Annealing (Kirkpatrick et al. 1983)
Gradient based optimization models we learned about are in a way deterministic, with the same input they do the same updates (the stochastic nature comes from the sampling of the batches not from the optimization part).
In contrast, in SA:
• We define an energy function over the configurations (states): E(s)
• And a probability function which controls the transition from state e1 to e2 : P(e1,e2,T)
where T is the “global” temperature.
Some presumptions: e1 is the last and e2 is the energy of the candidate configuration
• If T goes to zero, the P(e1,e2,T) goes to zero if e2>e1 if T=0 we will accept if we go towards the local min. (GD!)
• If e1<e2 then P(e1,e2,T) should go to 1 (towards minimum)
• Lots of versions: what if e1>>e2 ? And if e1<<e2
• P(e1,e2,T) varies smoothly with the difference between e1 and e2
• How to cool down the system?
28/04/2017
Harmonium by Smolensky in 1986 and RBM in 2002 by Hinton with a Contrastive Divergence as a fast learning solution.
Restricted: Boltzmann/Gibbs factorizes over the cliques in the graph -> no connection between the units in the same layer, therefore we control the cliques (BM?).
Visible units (V)
Hidden layer (H)
hence the name
W
28/04/2017
Restricted Boltzmann Machines (RBM)
Visible units (V)
Hidden layer (H) The energy function:
h is usually a sigmoid.
W
Gradient: Estimation is difficult -> GS
☹
28/04/2017
Visible units (V)
Hidden layer (H)
W
Maximizing the log-likelihood of the data or minimize the KL divergence between the data distribution and the equilibrium distribution:
Constractive divergence [Hinton 2002]:
Key idea: Minimize the difference between the KL divergences over the data and the first reconstruction:
-> The problematic part cancels out, but there is a remaining element which according to Hinton can be ignored in practice
Restricted Boltzmann Machines (RBM)
28/04/2017
Deep Belief Network (DBN)
Similar to the idea of boosting [Freund and Shapire 1995]
Straightforward application of CD for deep RBM network is not suitable.
Idea: allow each model in the sequence to receive a different representation of the data.
With l hidden layers and x as input:
Greedy layer learning [Hinton et al. 2006]:
learn layer by layer, using the output of the last layer as input
28/04/2017
Midterm
Midterm: Good work! Remember: with a good project + 3xPS + midterm -> grade without exam 1. Hierarchical clustering:
the strategies only differ if the clusters are not individual clusters 2. If a node is homogeous -> we do not split it, pruning?
3. a) What is the main difference between the discriminative and the generative models?
b) Why teleportation is usually necessary for PageRank in practice?
c) What is the difference between a greedy and a lazy algorithm?
d) What happens with a deep feed-forward network if we use only linear activation functions?
Why?
e) What is the difference between the Collaborative Filtering and the Content Based Recommendation Systems?
f) When are we using ROC AUC instead of F-measure, Precision etc.? Why?
4. b) was non-separable, how to solve it?
5. Too much calculation but it was OK Grades: grade #
16+ : 5 3
12-15.5: 4 3 10-11.5: 3 1
8-9.5: 2 0
8-: 1 1
28/04/2017
Recurrent Neural Networks (RNN)
Simulates a discrete-time dynamical system [Rumelhart et al. 1986]
Three components:
x
tinput in time t y
toutput in time t
h
thidden state in time t
The connection between the layers are straightforward:
In comparison to feed forward networks, the main difference is the connection
between the current and the last hidden state (a loop in the network) -> can
carry along information about the previous inputs! But for how long?
28/04/2017
Recurrent Neural Networks (RNN)
Let be given a sequence of samples
Estimation of the parameters (Θ) of RNN is based on minimization of the following additive cost function:
where
The d(y,f(h)) is some penalty function (divergence, distance etc.).
28/04/2017
Feed forward representation of RNN
This unfolded representation is already “deepish” ☺ but with the same weights at each layer (time)
w
1w
2w
3w
4w
1w w
3w
4 2w
1w w
3w
4 2w
1w w
3w
4 2time=0 time=2
time=1 time=3
(figures by Geoffrey Hinton)
Recurrent Neural Networks (RNN)
28/04/2017
A particular example:
where W,U and V are the weight matrices and the Φ functions are some bounded non-linear functions, such as the sigmoid.
The parameters of this conventional RNN can be estimated by SGD over the cost function with back propagation through time [Rumelhart et al. 1986]. The trick is to unfold the network and after back propagation we average the
weights through time to have identical functions (as we assumed initially).
The question remains, how to “deepen” RNN?
Recurrent Neural Networks (RNN)
28/04/2017
Some advances in recurrent Neural Networks
Stacked RNN (sRNN [Schmidhuber 1992, El Hihi and Bengio 1996]):
- stacking multiple recurrent hidden layers on top of each other - modeling multiple time scales in the input sequence
[Pascanu et al. 2014]: three type of expansions:
- deep Input-to-Hidden function
(temporal neighbours in NLP [Mikolov et al. 2013]) - deep Hidden-to-Hidden function (DT-RNN)
with shortcuts to preserve the responsiveness of RNN
- deep Hidden-to-Output function (DO-RNN)
28/04/2017
Deep Recurrent Neural Networks (dRNN, Pascanu et al. 2014)
RNN DT-RNN DT-RNN with shortcut
Fig.: Pascanu et al. 2014
28/04/2017
DOT-RNN sRNN
Fig.: Pascanu et al. 2014
Deep Recurrent Neural Networks (dRNN,
Pascanu et al. 2014)
28/04/2017
Evaluation: log-probability of the next sample
1. Experiment: polyphonic music prediction over Nottingham, JSB Chorales and MuseData 2. Experiment: character-level and word-level language modeling over Penn Treebank Corpus
Notes: SGD for training. Learning rates are manually tuned.
Deep Recurrent Neural Networks (dRNN,
Pascanu et al. 2014)
28/04/2017
Evaluation: negative log-probability
1. Experiment: polyphonic music prediction over Nottingham, JSB Chorales and MuseData
• Gradients were calculated at most 200 steps for the first two and 50 for the Muse.
• If the song changed at the end of the subsequent, the hidden states are reset.
(*) maxout units in the deep output function instead of sigmoid and dropout (we will examine them later)
Deep Recurrent Neural Networks (dRNN,
Pascanu et al. 2014)
28/04/2017
Evaluation: negative log-probability
2. Experiment: Character-level and word-level language modeling over Penn Treebank Corpus
ReLU instead of sigmoid for the word-level.
Additional methods:
• 1k long-short-term memory [Graves et al. 2013]
• shallow RNN with a larger hidden state [Mikolov et al. 2011]
Deep Recurrent Neural Networks (dRNN,
Pascanu et al. 2014)
28/04/2017
Long-short term memory in general
(Greff et al., 2015)
28/04/2017
Long-short term memory in general
(Greff et al., 2015)
28/04/2017
Long-short term memory in general
(Greff et al., 2015)
28/04/2017
Long short term memory (LSTM)
[Hochreiter & Schmidhuber 1997] actually 2005 ☺
Fig: Christopher Olah
28/04/2017
Long short term memory (LSTM)
[Hochreiter & Schmidhuber 1997]
Fig: Christopher Olah
Forget gate: carry or not carry (on)
Input gate: selection and new state
28/04/2017
Long short term memory (LSTM)
[Hochreiter & Schmidhuber 1997]
Fig: Christopher Olah
The new state
Output: combined with the new state
28/04/2017
Gate recurrent unit or GRU (Cho et al., 2014)
Fig: Christopher Olah
28/04/2017
Notes on LSTM
Hyperparameters:
• Random search… [Anderson, 1953, Solis & Wets, 1981]
The size of the hidden layer is independent of the learning rate.
They can be determined independently: learning rate over a small
network, then the size of the layer
28/04/2017
Notes on LSTM
Performance of various versions:
• They are actually very similar [Greff et al., 2015, Chung et al., 2014]
• GRU is similar in performance but simplier than regular LSTM
• Forget gate was introduced in 2000 [Gers et al., 2000]
• Recurrent connections between all gates -> overfit
• Bidirectional LSTMs are better
• Full gradient was introduced only in 2005
• Forget gate and output activation are crucial
• For text, image caption we need attention (next week), what
could it be?
28/04/2017
Attention [Xu et al., 2015]
28/04/2017
Attention [Xu et al., 2015]
T rick: instead of hard wiring of input selection -> distribution
• Differentiable ☺
• Distribution: another RNN’s softmax output -> we can train it!
• Overfitting…
Soft vs. hard attention:
• Soft: linear combination of location vectors (image parts)
• Hard: one-hot coded
What are the shortcomings?
28/04/2017
Attention [Xu et al., 2015]
Question? What are the shortcomings?
Are the vectors additive? (images?)
Connected or non-connected components?
In image caption, if hard coded: |L| vs. T vs. |W|?
Distribution of importance of locations?
Object vs. concept detection…
(cat, bird vs. daylight, winter etc.)
BUT: do we need attention? ☺
28/04/2017
CNN + WE + LSTM: image caption
Image caption (Vinyals et al, 2016)
Ensemble, BeamSearch and scheduled sampling
28/04/2017
[Szegedy et al., 2014]:
Inception: replace convolutions with smaller but deeper mini networks -> dimension reduction
[Ioffe & Szegedy, 2015]: Batch normalization
Image model (GoogLeNet + BN)
28/04/2017
Word embedding [Y. Bengio et al., 2006]
An actual language model:
Predict the next word from the context
Input representation:
One-hot encoding (dim. is the size of the dictionary)
This is the original model, recent models use smoothed input word representations
Interesting property:
King + Woman close to Queen in L2 -> [Rothe,Ebert & Schütze, 2016]:
Orthogonal word embedding, Polarity In IC: 512 dim emb.
28/04/2017
Regular WE
Hidden layer
Context (dictionary) Softmax (dictionary)
Language model:
P(wt | N(wt)) ->
max log-likelihood Input:
Context of the atual word:
Cat sits on sofa -> context:
1. Cat,on,sofa 2. Cat
etc.
Hidden layer: usually tanh
Output: distrib. of terms -> p(wcat| context)
28/04/2017
Orthogonality and polarity [Rothe, Ebert, Schütze 2016]
Dimension reduction, but Instead of reducing, ranking:
1. Sentiment and non-sentiment information 2. Correctness information
3. Frequency information
Assumption:
They do not correlate!
Q: orthogonal matrix transofrmations or SO(2) Lie group
28/04/2017
Orthogonality and polarity [Rothe, Ebert,
Schütze 2016]
28/04/2017
LSTM in Vinyals et al. 2016
Not multidimensional RNN!
BN is translated into 512
dimensional representation to match WE
Simple LSTM No peepholes
28/04/2017
Putting everything together:
CNN (BN Inc.) + WE (d=512) + LSTM (#hidden=512) : image caption
Image caption (Vinyals et al, 2016)
Ensemble, BeamSearch and scheduled sampling -> Attention?
28/04/2017
Image caption (Vinyals et al, 2016)
o Pre-trained image model: trained on ImageNet, fine-tuning (tricky) helped a bit o Word embedding was not pre-trained
o Ensemble:
Multiple models with different initialization, learning parameters or even different networks
o BeamSearch:
Consider the k best sentences before generating the next word
Beam size matters, actually k=3 was the best on the MS COCO challenge
o Optimization: SGD with fixed learning rate and without momentum + Dropout (?) o Transfer Learning: models trained on different datasets
o Scheduled sampling: curriculum learning strategy, flip a coin to use the predicted or the true previous word
Together 20+% in performance
28/04/2017
Image caption (Vinyals et al, 2016)
28/04/2017
Embeddings and LSTM with Keras tut.
Keras includes some datasets: IMDB movies reviews (25k), labels: pos. or neg.
import numpy as np import pandas as pd import time
from keras.datasets import imdb # imdb dataset
from keras.models import Sequential, Model # the LSTM is just a layer!!! -> CDFFN from keras.layers import Dense, Input, Dropout, Activation, Flatten, LSTM #LSTM from keras.layers import Convolution2D, MaxPooling2D, normalization
from keras.utils import np_utils
from keras.layers.embeddings import Embedding # Embedding from keras.preprocessing import sequence
import keras.backend as K import matplotlib.pyplot as plt
%matplotlib inline
28/04/2017
Embeddings and LSTM with Keras tut.
# we will only use the most frequent 1k terms (faster) topk = 1000
(x_tr_raw, y_tr), (x_te_raw, y_te) = imdb.load_data(nb_words=topk) print('raw tr: %s te: %s' % (str(x_tr_raw.shape),str(x_te_raw.shape)))
# pad the sequences (they are originally various in length as expected)
# only the first 100 terms (but frequent!) again because it is faster max_len = 100
x_tr = sequence.pad_sequences(x_tr_raw, maxlen=max_len) x_te = sequence.pad_sequences(x_te_raw, maxlen=max_len) print('after tr: %s te: %s' % (str(x_tr.shape),str(x_te.shape)))
28/04/2017
Embeddings and LSTM with Keras tut.
emb_dim = 32 # embedding will be a 32 dimensional small vector per context (per term) lstm_units = 10 # number of lstm units in the lstm layer
epochs=10 # train it for 10 epochs batch_size=32 # mini batch size
#create the model (or build the computational graph) model = Sequential() # still a sequential model !
# max_len controls the length, but in our case it is always 100 model.add(Embedding(topk, emb_dim, input_length=max_len))
# finally we add a recurrent layer :) model.add(LSTM(lstm_units))
model.add(Dense(1, activation='sigmoid')) # logreg binary classifier on the top
# compilation, binary not categorical since only two classes
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy']) print(model.summary())
for i in range(epochs):
model.fit(x_tr, y_tr, validation_data=(x_te, y_te), nb_epoch=1, batch_size=batch_size)
28/04/2017
Object detection
Fig. Kaiming He
Traditional models:
o Haar wavelet [Poggio et al., 1998] and Haar-like features [Viola and Jones, 2001]
Rigid features + SVM or AdaBoost (and also reduce the number of features) o Deformable parts model [Felzenswalb et al. 2010]
o 100 Hz: HOG + Boosted Trees etc. [Benenson et al. 2012]
And non-rigid feature extraction (CNN) : o R-CNN [Girshick et al., 2014]:
Regions with CNN features o SPP-net [He et al., 2014]:
Spatial Pyramid Pooling in DNN o Fast R-CNN [Girschick et al., 2015]:
CNN feature maps
o RegionLet [Wang et al., 2015]:
Integral image over CNN
o Faster R-CNN [Ren et al., 2015]:
Region Proposal Network
28/04/2017
R-CNN vs. Fast R-CNN
Fig. Kaiming He
CNN-s over the candidate regions Rigid region size
High complexity In practice:
Separate SVM over the feats.
not CDF ☹
One CNN per image
Feature map per pixel (filters/pixel)
Arbitrary sized regions of feature maps Much faster than R-CNN
SVM over the candidate feature maps Still not CDF ☹
28/04/2017
Faster R-CNN [Ren et al., 2015]
Instead of complex search:
o Region proposals:
different sized but rectangular regions on multiple scales (attention? ☺ )
o Still rigid
o End-to-end optimization: CDF ☺ o Very fast <200ms
28/04/2017
Faster R-CNN [Ren et al., 2015]
28/04/2017
Detect: drones and their trajectories R-CNN methods are not suitable
Time constraint: maximally 100ms to act
Computational constraints: Nvidia Tegra X1 but it is actually very slow (16x 7x7 Convolution ~10ms) Idea: high recall/bad precision detection with traditional models
+ MLP/CNN + LSTM with Fisher and later with additional sensors And now: project works ☺