W1 Februar 7-9: Introduction, kNN, evaluation  W2 Februar 14-16: Evaluation, Decision Trees

(1)

Data Mining algorithms

2017-2018 spring

04.11-20.2018

1. NN and BN

28/03/2017

(2)

W1 Februar 7-9: Introduction, kNN, evaluation  W2 Februar 14-16: Evaluation, Decision Trees

W3 Februar 21-23: Linear separators, iPython, VC theorem 

W4 Februar 28-march 2: Linear separators, iPython, maximal margin  W5 March 7-9: SVM, VC theorem and Bottou-Bousquet

W6 March 14-16: clustering (hierarchical, density based etc.), GMM, MRF, Apriori and association rules

W7 March 21-23: Recommender systems W8 March 28: centrality, generative models  W9 April 4-6: holiday

W10 April 11-13: basics of neural networks, Bayes networks  W11 April 18-20: midterm, Sontag-Maas-Bartlett, BN, 

W12 April 25-27: CNN, MLP, Dropout, Batch normalization, RNN, LSTM, GRU W13 May 2-4: attention, Image caption, Turing Machine

W14 May 9-11: RBM, DBN, VAE, GAN W15 May 16-18: Boosting, Time series W16 May 23-25: TS, Projects on Friday

Plan

(3)

Neural networks, briefly deeply

Hypothesis: deep, hierarchical models can be exponentially more efficient than a shallow one [Bengio et al. 2009, Le Roux and Bengio, 2010, Delalleau and Bengio, 2011 etc. ]

[Delalleau and Bengio, 2011]: deep sum-product network may require exponentially less units to represent the same function compared to a shallow sum-product network.

What is the depth of a Neural Network?

In case of feed forward networks, the number of multiple nonlinear layers between the input and the output layer.

We will see, in case of recurrent NN this definition does not apply.

So Q1: What is NN?

(4)

29/03/2017

Neural Networks

Key ingredients:

• Wiring: units and connections

XOR = x

₁ ^AND ^NOT

x

₂

OR

^NOT

^x

₁ ^AND

^x

₂

z

₁

-0.5 1 -1

z ₁ z ₂

z₂

-0.5 -1

1 x

₁

x

₂

1 y 1 _-0.5

1 1

Fig.: Danny Bickson

(5)

Fig.: wikipedia

Output of a unit

• linear/

non-linear

• bounded/

non-bounded

• usually monotonic, but not all

Why so rigid?

(6)

Traditional pattern recognition vs. CNN

Conv. Layer

Sub-sampling

….

Fully conn.

Receptive field

(7)

LeNet-5

LeNet-5 for handwriting recognition in [LeCun et al. 1998]

Key advantages:

• Fixed feature extraction vs. learning the kernel functions

• Spatial structure through sampling

• “Easier to train” due much lesser connection than fully connected Training: back propagation

By definition it is a feed forward deep neural network.

(8)

Image classification with CNN

[Krizhevsky et al. 2012]

Advantages over LeNet:

• Local response normalization (normalize over the kernel maps at the same position) over ReLU (-1.2%..1.4% in error rate)

• Overlapping pooling (-0.3..-0.4% in error rate)

• traditional image tricks: augmentation as horizontal flipping, subsampling, PCA over the RGB and noise (-1% in error rate)

• Dropout (we will discuss it later)

(9)

29/03/2017

Image classification with CNN

[Krizhevsky et al. 2012]

ImageNet: 150k test set and 1.2 million training images with 1000 labels.

Evaluation: top-1 and top-5 error rate

* - additional data

4096 dim. representation per image 5-6 days with 2 Nvidia GTX 580 3GB

(10)

Recent results

[He et al. 2015]: Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification

Parametric ReLU + zero mean Gaussian init + extreme (at the time…) deep network:

A:19 layers, B: 22 layers, C: 22 layers with more filters Training of model C: 8xK40 Nvidia GPU 3..4 weeks (!)

(11)

[He et al. 2015]: ResNet:“Is learning better networks as easy as stacking more layers?”

(12)

Several implementations

Restrictions Wrapper Architectures Notes

Theano Both feed forward

and recurrent nets Python Multi core/CUDA Multiple optimization Chainer Both feed forward

and recurrent nets Python Multi core/CUDA Multiple optimization GraphLab Feed Forward: CNN,

DBN Python/C++ Multi core/CUDA/

distributed Compact, Hadoop TensorFlow Both feed forward

and recurrent nets Python/C++ Multi core/CUDA/

distributed Graphical interface and multiple optimization Caffe Feed Forward: CNN,

DBN Python/Matlab Multi core/CUDA Torch Both feed forward

and recurrent nets Lua Multi core/CUDA Developed for vision

(13)

Ok, step back a little and … ☺

Probabilistic Graphical models (hmmm, RF?) Set of random variables: X= {x

₁

,…,x

_T

}

Visualize connections with edges

Bayes Networks

x₁

x₃

x₄ x₂

(14)

Probabilistic Graphical models

Visualize connections with edges (with directed edges!!!) Conditional dependencies (A “causes” B) vs. MRF?

Bayes Networks

x₁

x₃

x₄ x₂

(15)

Probabilistic Graphical models

Set of random variables: X= {x

₁

,…,x

_T

}

Visualize connections with edges (with directed edges!!!)

Bayes Networks

x₁

x₃

x₄ x₂

P(x₁)

P(x₃|x₁) P(x₂|x₁)

P(x₄|x₂,x₃)

(16)

Probabilistic Graphical models

Set of random variables: X= {x

₁

,…,x

_T

}

Visualize connections with edges (with directed edges!!!)

Bayes Networks

x₁

x₃

x₄ x₂

P(x₁)

P(x₃|x₁) P(x₂|x₁)

P(x₄|x₂,x₃)

P(x₁)P(x₂|x₁)P(x₃|x₂,x₁) P(x₄| x₃,x₂,x₁)

= P(x₁)P(x₂|x₁)P(x₃|x₁) P(x₄| x₃,x₂)

(17)

Learning:

I) We know the structure (the dependencies) Parameter estimation (prior, posterior)

Analytically or via optimization (EM, GD etc.) II) We do not know the structure

optimize over the space of the possible trees…

and estimate parameters, optimization etc.

What kind of BN is a feed—forward network?

Bayes Networks

(18)

06/04/2017

Main neural network structures

1. Feed-forward Neural Networks:

• Bayes Networks:

• The nodes are either input, output or hidden

• Connections between the nodes: directed edges

• Presumption: finite set of nodes -> finite set of layers (are there any layers?)

• no directed cycles -> directed acyclic graphs!

• Posteriors are made out of:

• linear combination (edge weights) of inputs

• Activation function

where z

_i^(l+1)

is the LC of the i-th element in the (l+1)-th

hidden layer, and f is the non-linear transformation

(common: f: R -> R ! When is it not?)

(19)

06/04/2017

Main neural network structures

1. Feed-forward Neural Networks:

Some common (not necessary!) restrictions:

• Disjoint set of nodes -> layers

• Exists an ordering of layers (so ordering of nodes!!)

• “Causality”: previous layer “causes” the next one

• each node is connected to

• Nodes in the previous layer (input nodes)

• Nodes in the next layer

• CDF activations

• Optimization via previously determined loss function (CDF?)

If it is fully connected:

• Each node in the previous layer is connected

• Each node in the next layer is connected

If so the Network is called Multi-Layer Perceptron (in short MLP)

(20)

06/04/2017

Main neural network structures

1. Feed-forward Neural Networks: MLP vs. CNN

• Each node is adopted on a subset of the input, but all over the image

It can be interpreted

• Either as a lot of fully connected node

with zero weights and there weights where they are non-zero is shared

• Or leave this complicated definition and just simply define it as a convolution over the input:

(f*g)(x) = ∫ f(t) g(x-t)dt

Usually we think of it as a discrete convolution and in case of images, it is 2D/3D or XD convolution.

What kind of input can we think for 1D?

(21)

06/04/2017

Main neural network structures

1. Feed-forward Neural Networks: MLP vs. CNN

• Each node is adopted on a subset of the input, but all over the image

1/4 1/16

1/8 1/16 1/8 1/8

1/16

1/8 1/16 Example:

What does it do? What are we changing during optimization?

The main advantages of the CNN over MLP (in pract.) is the highly reduced size of the parameter set: 32x32x128 vs.

3x3x128 (128 hidden node and 32x32 input)

We will talk about: Inception, ResNet, Maxout etc.

(22)

06/04/2017

Main neural network structures

2. Recurrent Neural Networks:

• The nodes are either input, output or hidden

• Connections between the nodes: directed edges

• Presumption: finite set of nodes -> finite set of layers (are there any layers?)

• There are some directed cycles -> not a directed acyclic graph anymore … ☹

• Common: self loops only

• Posteriors are similar to FF

We will talk about (two weeks from now) about:

classic BPTT model [Werbos et al., 1988]

LSTM [Hochreiter & Schmidhuber, 1997]

GRU [Cho et al., 2014]

(23)

06/04/2017

Main neural network structures

3. Generative models

• The nodes are either input or hidden (no output!)

• Connections between the nodes: not necessary directed edges

• Presumption: finite set of nodes -> finite set of layers (are there any layers?)

• There are some directed/undirected cycles -> not a directed acyclic graph ☹

• Posteriors are similar to FF, but we do not give at first any restriction to edges (full graph? ☹ ) , OK, we will ☺

We will talk about:

Boltzmann Machine [Hinton et al., 1983]

Restricted Boltzmann Machine, Harmonium [Smolensky et al., 1986]

Deep Belief Networks [Hinton et al., 2006]

Variational Autoencoders [Dayan et al., 1995]

Generative Adversarial Networks [Goodfellow et al., 2014]

(24)

06/04/2017

• Overfitting:

• DropOut [Hinton et al., 2012]

• DropConnect [Wan et al., 2013]

• Saturation, vanishing gradients and sparsity:

• pReLU [Het et al., 2015]

• Maxout [Goodfellow et al., 2013]

• Network-in-Network [Lin et al., 2014]

• local response/batch normalization [Ioffe et al., 2015]

• BN Maxout NiN [Chang et al., 2015]

• Complexity:

• Convolution (in comp. to MLP)

• FastFood [Yang et al., 2015]

• Generalization [Zhang et al., 2017]

• Lower bounds [Lin & Tegmark, 2016]

• Architecture:

• spectral representation (pooling) [Rippel & Snoek, 2015]

• Identity map and residual block [He et al., 2015]

• Manifold tangent classifier, high-order contractive auto-encoder [Rifai et al., 2011]

Some unsolved problems

(25)

Frameworks

Restrictions Wrapper Architectures Notes

Theano feed forward/

recurrent NN Python Multi core/CUDA Keras

Chainer feed forward/

recurrent NN Python Multi core(?)/CUDA GraphLab Feed Forward: CNN,

DBN Python/C++ Multi core/CUDA/

distributed Compact, Hadoop TensorFlow feed forward/

recurrent NN Python/C++ Multi core/CUDA/

distributed Keras Caffe Feed Forward: CNN,

DBN Python/Matlab Multi core/CUDA Blob, for visual Torch feed forward/

recurrent NN Lua Multi core/CUDA

(26)

Developed at Google

multiCore/GPU (Cuda only ☹ )/Distributed Python and C++

Lost of examples codes at github: CNN, DBN, RNN, LSTM

Google’s example: MNIST

(27)

Install

Example: Ubuntu/Linux 64-bit, CPU only, Python 3.5

export TF_BINARY_URL=https://storage.googleapis.com/tensorflow/linux/cpu/

tensorflow-0.11.0rc1-cp35-cp35m-linux_x86_64.whl pip install --ignore-installed --upgrade pip setuptools pip install --upgrade $TF_BINARY_URL

Import:

Import tensorflow as ts

(28)

MNIST:

• 60000 images

• 28x28 resolution

• Only grayscale -> 1ch

• Labels: 0..9

https://tensorflow.googlesource.com/tensorflow/+/master/tensorflow/g3doc/tutorials/

mnist/input_data.py

(if something changes they will change this tutorial)

MNIST handwriting recog.

(29)

# we will follow the tutorial import tensorflow as tf import input_data

mnist = input_data.read_data_sets("MNIST_data/", one_hot=True)

# set the basic variables

x = tf.placeholder("float", [None, 784]) W = tf.Variable(tf.zeros([784,10]))

b = tf.Variable(tf.zeros([10]))

# the output is softmax

y = tf.nn.softmax(tf.matmul(x,W) + b)

# original labels

y_ = tf.placeholder("float", [None,10])

# loss function

cross_entropy = -tf.reduce_sum(y_*tf.log(y))

# gradient calculation based on cross entropy and the net (computational graph) train_step = tf.train.GradientDescentOptimizer(0.01).minimize(cross_entropy)

Shallow model

(30)

# we start a session and run it init = tf.initialize_all_variables() sess = tf.Session()

sess.run(init)

# 1000 times we update the model based on a small batch (sized 100) for i in range(1000):

batch_xs, batch_ys = mnist.train.next_batch(100)

sess.run(train_step, feed_dict={x: batch_xs, y_: batch_ys})

# evaluation

correct_prediction = tf.equal(tf.argmax(y,1), tf.argmax(y_,1)) accuracy = tf.reduce_mean(tf.cast(correct_prediction, "float"))

print(sess.run(accuracy, feed_dict={x: mnist.test.images, y_: mnist.test.labels}))

MNIST shallow model

(31)

MNIST Convolutional NN

# a bit better suited for ipython sess = tf.InteractiveSession()

# we can define variables (here the weights) def weight_variable(shape):

initial = tf.truncated_normal(shape, stddev=0.1) return tf.Variable(initial)

def bias_variable(shape):

initial = tf.constant(0.1, shape=shape) return tf.Variable(initial)

# simple 2D conv.: step size 1, W is the window (conv. func.) , x will be the input def conv2d(x, W):

return tf.nn.conv2d(x, W, strides=[1, 1, 1, 1], padding='SAME')

# simple pooling, subsampling with maximum, each 2x2 patch will be a single element def max_pool_2x2(x):

return tf.nn.max_pool(x, ksize=[1, 2, 2, 1], strides=[1, 2, 2, 1], padding='SAME')

(32)

MNIST CNN

# first and second Conv. Layers with pooling (the model is on the bottom) W_conv1 = weight_variable([5, 5, 1, 32])

b_conv1 = bias_variable([32])

x_image = tf.reshape(x, [-1,28,28,1])

h_conv1 = tf.nn.relu(conv2d(x_image, W_conv1) + b_conv1) h_pool1 = max_pool_2x2(h_conv1)

W_conv2 = weight_variable([5, 5, 32, 64]) b_conv2 = bias_variable([64])

h_conv2 = tf.nn.relu(conv2d(h_pool1, W_conv2) + b_conv2) h_pool2 = max_pool_2x2(h_conv2)

(33)

06/04/2017

# fully connected layers on top of the convolutions -> flattening then MLP: 7x7 but 64 channel images (after the second pooling) and a 1024 nodes in the hidden layer

W_fc1 = weight_variable([7 * 7 * 64, 1024]) b_fc1 = bias_variable([1024])

h_pool2_flat = tf.reshape(h_pool2, [-1, 7*7*64])

# ReLU

h_fc1 = tf.nn.relu(tf.matmul(h_pool2_flat, W_fc1) + b_fc1)

# if dropout (later)

keep_prob = tf.placeholder("float")

h_fc1_drop = tf.nn.dropout(h_fc1, keep_prob)

# Output layer, again with softmax W_fc2 = weight_variable([1024, 10]) b_fc2 = bias_variable([10])

y_conv=tf.nn.softmax(tf.matmul(h_fc1_drop, W_fc2) + b_fc2)

MNIST CNN

(34)

06/04/2017

# same loss function

cross_entropy = -tf.reduce_sum(y_*tf.log(y_conv))

# now with ADAM optimizer (Kingma et al. 2015), similar to Newton train_step = tf.train.AdamOptimizer(1e-4).minimize(cross_entropy)

# training via batches and evaluation

correct_prediction = tf.equal(tf.argmax(y_conv,1), tf.argmax(y_,1)) accuracy = tf.reduce_mean(tf.cast(correct_prediction, "float")) sess.run(tf.initialize_all_variables())

for i in range(1000):

batch = mnist.train.next_batch(50) if i%100 == 0:

train_accuracy = accuracy.eval(feed_dict={

x:batch[0], y_: batch[1], keep_prob: 1.0})

print(‘step %d, training accuracy %g’ % (i, train_accuracy))

train_step.run(feed_dict={x: batch[0], y_: batch[1], keep_prob: 0.5}) Print(‘test accuracy %g’ % (accuracy.eval(feed_dict={

x: mnist.test.images, y_: mnist.test.labels, keep_prob: 1.0}))

MNIST CNN

(35)

06/04/2017

Same in Keras

# Install: pip install keras

# it is wrapper over Theano and Tensorflow backend (def. TF) and it utilizes gpu if it can

from keras.datasets import cifar10,mnist

from keras.preprocessing.image import ImageDataGenerator from keras.models import Sequential, Model

from keras.layers import Dense, Input, Dropout, Activation, Flatten from keras.layers import Convolution2D, MaxPooling2D, normalization from keras.utils import np_utils

import keras.backend as K batch_size = 32

nb_classes = 10 nb_epoch = 10

data_augmentation = True

(36)

06/04/2017

Same in Keras

# input image dimensions img_rows, img_cols = 32, 32

# The CIFAR10 images are RGB img_channels = 3

# The data, shuffled and split between train and test sets:

(X_train, y_train), (X_test, y_test) = cifar10.load_data() print('X_train shape:', X_train.shape)

print(X_train.shape[0], 'train samples') print(X_test.shape[0], 'test samples')

# Convert class vectors to binary class matrices.

Y_train = np_utils.to_categorical(y_train, nb_classes) Y_test = np_utils.to_categorical(y_test, nb_classes)

(37)

06/04/2017

Same in Keras

# two Conv + Pooling layers model = Sequential()

model.add(Convolution2D(32, 3, 3, border_mode='same', input_shape=X_train.shape[1:])) model.add(Activation('relu'))

model.add(Convolution2D(32, 3, 3)) model.add(Activation('relu'))

model.add(MaxPooling2D(pool_size=(2, 2))) model.add(Dropout(0.25))

model.add(Convolution2D(64, 3, 3, border_mode='same')) model.add(Activation('relu'))

model.add(Convolution2D(64, 3, 3)) model.add(Activation('relu'))

model.add(MaxPooling2D(pool_size=(2, 2))) model.add(Dropout(0.25))

(38)

06/04/2017

Same in Keras

# FC layer with ReLU model.add(Flatten()) model.add(Dense(512))

model.add(Activation('relu')) model.add(Dropout(0.5))

# Output layer

model.add(Dense(nb_classes)) model.add(Activation('softmax'))

# Train with ADAM

model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy']) model.summary()

(39)

06/04/2017

Same in Keras

X_train = X_train.astype('float32') X_test = X_test.astype('float32') X_train /= 255

X_test /= 255

if not data_augmentation:

print('Not using data augmentation.')

model.fit(X_train, Y_train, batch_size=batch_size, nb_epoch=nb_epoch, validation_data=(X_test, Y_test), shuffle=True)

else:

(40)

06/04/2017

Same in Keras

else:

print('Using real-time data augmentation.') datagen = ImageDataGenerator(

featurewise_center=False, # set input mean to 0 over the dataset samplewise_center=False, # set each sample mean to 0

featurewise_std_normalization=False, # divide inputs by std of the dataset samplewise_std_normalization=False, # divide each input by its std

zca_whitening=False, # apply ZCA whitening

rotation_range=0, # randomly rotate images in the range (0..180 deg) width_shift_range=0.1, # randomly shift images horizontally

(fraction of total width)

height_shift_range=0.1, # randomly shift images vertically (fraction of total height)

horizontal_flip=True, # randomly flip images vertical_flip=False # randomly flip images )

datagen.fit(X_train) # Fit the model on the batches generated by datagen.flow().

model.fit_generator(datagen.flow(X_train, Y_train,

batch_size=batch_size), samples_per_epoch=X_train.shape[0], nb_epoch=nb_epoch, validation_data=(X_test, Y_test))

(41)

06/04/2017

Dropout (Hinton et al. 2012)

As we mentioned the deep networks have very high complexity Consequences:

• High generalization bound

• Overfitting

• Slow convergence

Idea: reduce the overfitting by removing units during the training forming multiple

“thinned” networks (similarly to some of the pruning by Decision Trees), so the neurons are less rely on each other.

The result is an “unthinned” (same sized as the original), but less overfitted network with downscaled weights (e.g. softmax -> w’ = pw, p=0.5).

The method is more like a regularization.

Hypothesis: dropout is model averaging (bagging)

Works well with large steps in parameter space (vs. manifold hypothesis?)

(42)

06/04/2017

Standard fully connected Neural Network After Dropout during training (figures by Srivastava et al. 2014)

Dropout (Hinton et al. 2012)

(43)

06/04/2017

Standard feed forward With Dropout

Dropout (Hinton et al. 2012)

Some notes:

Why is it helping? -> Jump!

Global sampling?

Residual sampling?

And CDF ☺

(44)

06/04/2017

Standard fully connected Neural Network After Dropout

(figures by Srivastava et al. 2014)

Dropout (Hinton et al. 2012)

(45)

06/04/2017

Dropout over MNIST

(figures by Srivastava et al. 2014)

(46)

06/04/2017

Training phase Testing phase

(figures, results by Srivastava et al. 2014)

(47)

06/04/2017

Dropout: experiments MNIST

(results by Srivastava et al. 2014)

(48)

06/04/2017

Batch normalization (Ioffe & Szegedy, 2015)

o Instead of standard normalization of the input

o Estimation over the batch instead over the training set

o Learn an affine transformation of the normalized input (a lot of new

parameters)

o In practice: apply BN immediately before the non-linear transformation -> only 2x size of the output new parameters

o For CNN aggregation needed (consistent normalization)

o CDF ☺

4.9 % Top 1 error on ImageNet

with significantly lesser number of epochs

(49)

06/04/2017

Maxout (Goodfellow et al. 2013)

• MLP inside an activation function

• Piecewise linear approximation of convex functions -> universal approximator with “enough”

affine components

• For a single hidden unit with k affine feature maps

where

(50)

06/04/2017

Network In Network (Lin et al., 2014)

• Non-linear convolution

• Global average pooling (average of feature maps)

vs. fully connected

11.59% (FC) vs. 10.89% (FC+DO) vs. 10.41% (GAP) error on CIFAR-10 vs. maxout:

(51)

06/04/2017

IL – Integer bits FL – fractional bits WL – word length

(52)

06/04/2017

Boltzmann Machines (RBM)

Boltzmann machines was introduced by Hinton and Sejnowski in 1983 -> general model

Actually it is recurrent NN… with a full graph at first …Hopfiled network: order!

Visible units (V)

Hidden elements (H)

Markov Random Field:

Hammersly-Clifford theorem: JPD factorizes over the maximal cliques ...

Intractable in practice (inference is compl.) Inference is similar to Simulated Annealing

(53)

06/04/2017

Simulated Annealing (Kirkpatrick et al. 1983)

Gradient based optimization models we learned about are in a way deterministic, with the same input they do the same updates (the stochastic nature comes from the sampling of the batches not from the optimization part).

In contrast, in SA:

• We define an energy function over the configurations (states): E(s)

• And a probability function which controls the transition from state e₁ to e₂ : P(e₁,e₂,T)

where T is the “global” temperature.

Some presumptions: e₁ is the last and e₂is the energy of the candidate configuration

• If T goes to zero, the P(e₁,e₂,T) goes to zero if e₂>e₁ if T=0 we will accept if we go towards the local min. (GD!)

• If e₁<e₂ then P(e₁,e₂,T) should go to 1 (towards minimum)

• Lots of versions: what if e₁>>e₂? And if e₁<<e₂

• P(e₁,e₂,T) varies smoothly with the difference between e₁ and e₂

• How to cool down the system?

(54)

06/04/2017

Harmonium by Smolensky in 1986 and RBM in 2002 by Hinton with a Contrastive Divergence as a fast learning solution.

Restricted: Boltzmann/Gibbs factorizes over the cliques in the graph -> no connection between the units in the same layer, therefore we control the cliques (BM?).

Visible units (V)

Hidden layer (H)

hence the name

W

(55)

06/04/2017

Restricted Boltzmann Machines (RBM)

Visible units (V)

Hidden layer (H) The energy function:

h is usually a sigmoid.

W

Gradient: Estimation is difficult -> GS

☹

(56)

06/04/2017

Visible units (V)

Hidden layer (H)

W

Maximizing the log-likelihood of the data or minimize the KL divergence between the data distribution and the equilibrium distribution:

Constractive divergence [Hinton 2002]:

Key idea: Minimize the difference between the KL divergences over the data and the first reconstruction:

-> The problematic part cancels out, but there is a remaining element which according to Hinton can be ignored in practice

Restricted Boltzmann Machines (RBM)

(57)

06/04/2017

Deep Belief Network (DBN)

Similar to the idea of boosting [Freund and Shapire 1995]

Straightforward application of CD for deep RBM network is not suitable.

Idea: allow each model in the sequence to receive a different representation of the data.

With l hidden layers and x as input:

Greedy layer learning [Hinton et al. 2006]:

learn layer by layer, using the output of the last layer as input

(58)

28/04/2017

Main neural network structures

1. Feed-forward Neural Networks:

• Bayes Networks

• Back propagation

• CNN, MLP etc.

2. Recurrent Neural Networks:

• Loops -> not a DAG…

• Usually: self loop

3. Generative models

(59)

28/04/2017

Main neural network structures

3. Generative models

• The nodes are either input or hidden (no output!)

• Connections between the nodes: not necessary directed edges

• Presumption: finite set of nodes -> finite set of layers (are there any layers?)

• There are some directed/undirected cycles -> not a directed acyclic graph ☹

• Posteriors are similar to FF, but at first we do not give any restrictions to edges (full graph? ☹ ) , OK, we will ☺

We will talk about:

Boltzmann Machine [Hinton et al., 1983]

Restricted Boltzmann Machine, Harmonium [Smolensky et al., 1986]

Deep Belief Networks [Hinton et al., 2006]

Variational Autoencoders [Dayan et al., 1995] (next week)

Generative Adversarial Networks [Goodfellow et al., 2014] (next week)

(60)

28/04/2017

Standard fully connected Neural Network After Dropout during training (figures by Srivastava et al. 2014)

Dropout (Hinton et al. 2012) recap

(61)

28/04/2017

Batch normalization (Ioffe & Szegedy, 2015)

o Instead of standard normalization of the input

o Estimation over the batch instead over the training set

o Learn an affine transformation of the normalized input (a lot of new

parameters)

o In practice: apply BN immediately before the non-linear transformation -> only 2x size of the output new parameters

o For CNN aggregation needed (consistent normalization)

o CDF ☺

4.9 % Top 1 error on ImageNet

with significantly lesser number of epochs

(62)

28/04/2017

Maxout (Goodfellow et al. 2013)

• MLP inside an activation function

• Piecewise linear approximation of convex functions -> universal approximator with “enough”

affine components

• For a single hidden unit with k affine feature maps

where

(63)

28/04/2017

IL – Integer bits FL – fractional bits WL – word length

(64)

28/04/2017

Boltzmann Machines (RBM)

Boltzmann machines was introduced by Hinton and Sejnowski in 1983 -> general model

Actually it is recurrent NN… with a full graph at first …Hopfield network: order!

Visible units (V)

Hidden elements (H)

Markov Random Field:

Hammersly-Clifford theorem: JPD factorizes over the maximal cliques ...

Intractable in practice (inference is compl.) Inference is similar to Simulated Annealing

(65)

28/04/2017

Simulated Annealing (Kirkpatrick et al. 1983)

Gradient based optimization models we learned about are in a way deterministic, with the same input they do the same updates (the stochastic nature comes from the sampling of the batches not from the optimization part).

In contrast, in SA:

• We define an energy function over the configurations (states): E(s)

• And a probability function which controls the transition from state e₁ to e₂ : P(e₁,e₂,T)

where T is the “global” temperature.

Some presumptions: e₁ is the last and e₂is the energy of the candidate configuration

• If T goes to zero, the P(e₁,e₂,T) goes to zero if e₂>e₁ if T=0 we will accept if we go towards the local min. (GD!)

• If e₁<e₂ then P(e₁,e₂,T) should go to 1 (towards minimum)

• Lots of versions: what if e₁>>e₂? And if e₁<<e₂

• P(e₁,e₂,T) varies smoothly with the difference between e₁ and e₂

• How to cool down the system?

(66)

28/04/2017

Harmonium by Smolensky in 1986 and RBM in 2002 by Hinton with a Contrastive Divergence as a fast learning solution.

Restricted: Boltzmann/Gibbs factorizes over the cliques in the graph -> no connection between the units in the same layer, therefore we control the cliques (BM?).

Visible units (V)

Hidden layer (H)

hence the name

W

(67)

28/04/2017

Restricted Boltzmann Machines (RBM)

Visible units (V)

Hidden layer (H) The energy function:

h is usually a sigmoid.

W

Gradient: Estimation is difficult -> GS

☹

(68)

28/04/2017

Visible units (V)

Hidden layer (H)

W

Maximizing the log-likelihood of the data or minimize the KL divergence between the data distribution and the equilibrium distribution:

Constractive divergence [Hinton 2002]:

Key idea: Minimize the difference between the KL divergences over the data and the first reconstruction:

-> The problematic part cancels out, but there is a remaining element which according to Hinton can be ignored in practice

Restricted Boltzmann Machines (RBM)

(69)

28/04/2017

Deep Belief Network (DBN)

Similar to the idea of boosting [Freund and Shapire 1995]

Straightforward application of CD for deep RBM network is not suitable.

Idea: allow each model in the sequence to receive a different representation of the data.

With l hidden layers and x as input:

Greedy layer learning [Hinton et al. 2006]:

learn layer by layer, using the output of the last layer as input

(70)

28/04/2017

Midterm

Midterm: Good work! Remember: with a good project + 3xPS + midterm -> grade without exam 1. Hierarchical clustering:

the strategies only differ if the clusters are not individual clusters 2. If a node is homogeous -> we do not split it, pruning?

3. a) What is the main difference between the discriminative and the generative models?

b) Why teleportation is usually necessary for PageRank in practice?

c) What is the difference between a greedy and a lazy algorithm?

d) What happens with a deep feed-forward network if we use only linear activation functions?

Why?

e) What is the difference between the Collaborative Filtering and the Content Based Recommendation Systems?

f) When are we using ROC AUC instead of F-measure, Precision etc.? Why?

4. b) was non-separable, how to solve it?

5. Too much calculation but it was OK Grades: grade #

16+ : 5 3

12-15.5: 4 3 10-11.5: 3 1

8-9.5: 2 0

8-: 1 1

(71)

28/04/2017

Recurrent Neural Networks (RNN)

Simulates a discrete-time dynamical system [Rumelhart et al. 1986]

Three components:

x

_t

input in time t y

_t

output in time t

h

_t

hidden state in time t

The connection between the layers are straightforward:

In comparison to feed forward networks, the main difference is the connection

between the current and the last hidden state (a loop in the network) -> can

carry along information about the previous inputs! But for how long?

(72)

28/04/2017

Recurrent Neural Networks (RNN)

Let be given a sequence of samples

Estimation of the parameters (Θ) of RNN is based on minimization of the following additive cost function:

where

The d(y,f(h)) is some penalty function (divergence, distance etc.).

(73)

28/04/2017

Feed forward representation of RNN

This unfolded representation is already “deepish” ☺ but with the same weights at each layer (time)

w

1

w

2

w

³

w

⁴

w

1

w w

³

w

⁴ 2

w

1

w w

³

w

⁴ 2

w

¹

w w

3

w

4 ²

time=0 time=2

time=1 time=3

(figures by Geoffrey Hinton)

Recurrent Neural Networks (RNN)

(74)

28/04/2017

A particular example:

where W,U and V are the weight matrices and the Φ functions are some bounded non-linear functions, such as the sigmoid.

The parameters of this conventional RNN can be estimated by SGD over the cost function with back propagation through time [Rumelhart et al. 1986]. The trick is to unfold the network and after back propagation we average the

weights through time to have identical functions (as we assumed initially).

The question remains, how to “deepen” RNN?

Recurrent Neural Networks (RNN)

(75)

28/04/2017

Some advances in recurrent Neural Networks

Stacked RNN (sRNN [Schmidhuber 1992, El Hihi and Bengio 1996]):

- stacking multiple recurrent hidden layers on top of each other - modeling multiple time scales in the input sequence

[Pascanu et al. 2014]: three type of expansions:

- deep Input-to-Hidden function

(temporal neighbours in NLP [Mikolov et al. 2013]) - deep Hidden-to-Hidden function (DT-RNN)

with shortcuts to preserve the responsiveness of RNN

- deep Hidden-to-Output function (DO-RNN)

(76)

28/04/2017

Deep Recurrent Neural Networks (dRNN, Pascanu et al. 2014)

RNN DT-RNN DT-RNN with shortcut

Fig.: Pascanu et al. 2014

(77)

28/04/2017

DOT-RNN sRNN

Fig.: Pascanu et al. 2014

Deep Recurrent Neural Networks (dRNN,

Pascanu et al. 2014)

(78)

28/04/2017

Evaluation: log-probability of the next sample

1. Experiment: polyphonic music prediction over Nottingham, JSB Chorales and MuseData 2. Experiment: character-level and word-level language modeling over Penn Treebank Corpus

Notes: SGD for training. Learning rates are manually tuned.

Deep Recurrent Neural Networks (dRNN,

Pascanu et al. 2014)

(79)

28/04/2017

Evaluation: negative log-probability

1. Experiment: polyphonic music prediction over Nottingham, JSB Chorales and MuseData

• Gradients were calculated at most 200 steps for the first two and 50 for the Muse.

• If the song changed at the end of the subsequent, the hidden states are reset.

(*) maxout units in the deep output function instead of sigmoid and dropout (we will examine them later)

Deep Recurrent Neural Networks (dRNN,

Pascanu et al. 2014)

(80)

28/04/2017

Evaluation: negative log-probability

2. Experiment: Character-level and word-level language modeling over Penn Treebank Corpus

ReLU instead of sigmoid for the word-level.

Additional methods:

• 1k long-short-term memory [Graves et al. 2013]

• shallow RNN with a larger hidden state [Mikolov et al. 2011]

Deep Recurrent Neural Networks (dRNN,

Pascanu et al. 2014)

(81)

28/04/2017

Long-short term memory in general

(Greff et al., 2015)

(82)

28/04/2017

Long-short term memory in general

(Greff et al., 2015)

(83)

28/04/2017

Long-short term memory in general

(Greff et al., 2015)

(84)

28/04/2017

Long short term memory (LSTM)

[Hochreiter & Schmidhuber 1997] actually 2005 ☺

Fig: Christopher Olah

(85)

28/04/2017

Long short term memory (LSTM)

[Hochreiter & Schmidhuber 1997]

Forget gate: carry or not carry (on)

Input gate: selection and new state

(86)

28/04/2017

Long short term memory (LSTM)

[Hochreiter & Schmidhuber 1997]

The new state

Output: combined with the new state

(87)

28/04/2017

Gate recurrent unit or GRU (Cho et al., 2014)

(88)

28/04/2017

Notes on LSTM

Hyperparameters:

• Random search… [Anderson, 1953, Solis & Wets, 1981]

The size of the hidden layer is independent of the learning rate.

They can be determined independently: learning rate over a small

network, then the size of the layer

(89)

28/04/2017

Notes on LSTM

Performance of various versions:

• They are actually very similar [Greff et al., 2015, Chung et al., 2014]

• GRU is similar in performance but simplier than regular LSTM

• Forget gate was introduced in 2000 [Gers et al., 2000]

• Recurrent connections between all gates -> overfit

• Bidirectional LSTMs are better

• Full gradient was introduced only in 2005

• Forget gate and output activation are crucial

• For text, image caption we need attention (next week), what

could it be?

(90)

28/04/2017

Attention [Xu et al., 2015]

(91)

28/04/2017

Attention [Xu et al., 2015]

T rick: instead of hard wiring of input selection -> distribution

• Differentiable ☺

• Distribution: another RNN’s softmax output -> we can train it!

• Overfitting…

Soft vs. hard attention:

• Soft: linear combination of location vectors (image parts)

• Hard: one-hot coded

What are the shortcomings?

(92)

28/04/2017

Attention [Xu et al., 2015]

Question? What are the shortcomings?

Are the vectors additive? (images?)

Connected or non-connected components?

In image caption, if hard coded: |L| vs. T vs. |W|?

Distribution of importance of locations?

Object vs. concept detection…

(cat, bird vs. daylight, winter etc.)

BUT: do we need attention? ☺

(93)

28/04/2017

CNN + WE + LSTM: image caption

Image caption (Vinyals et al, 2016)

Ensemble, BeamSearch and scheduled sampling

(94)

28/04/2017

[Szegedy et al., 2014]:

Inception: replace convolutions with smaller but deeper mini networks -> dimension reduction

[Ioffe & Szegedy, 2015]: Batch normalization

Image model (GoogLeNet + BN)

(95)

28/04/2017

Word embedding [Y. Bengio et al., 2006]

An actual language model:

Predict the next word from the context

Input representation:

One-hot encoding (dim. is the size of the dictionary)

This is the original model, recent models use smoothed input word representations

Interesting property:

King + Woman close to Queen in L2 -> [Rothe,Ebert & Schütze, 2016]:

Orthogonal word embedding, Polarity In IC: 512 dim emb.

(96)

28/04/2017

Regular WE

Hidden layer

Context (dictionary) Softmax (dictionary)

Language model:

P(w_t | N(w_t)) ->

max log-likelihood Input:

Context of the atual word:

Cat sits on sofa -> context:

1. Cat,on,sofa 2. Cat

etc.

Hidden layer: usually tanh

Output: distrib. of terms -> p(w_cat| context)

(97)

28/04/2017

Orthogonality and polarity [Rothe, Ebert, Schütze 2016]

Dimension reduction, but Instead of reducing, ranking:

1. Sentiment and non-sentiment information 2. Correctness information

3. Frequency information

Assumption:

They do not correlate!

Q: orthogonal matrix transofrmations or SO(2) Lie group

(98)

28/04/2017

Orthogonality and polarity [Rothe, Ebert,

Schütze 2016]

(99)

28/04/2017

LSTM in Vinyals et al. 2016

Not multidimensional RNN!

BN is translated into 512

dimensional representation to match WE

Simple LSTM No peepholes

(100)

28/04/2017

Putting everything together:

CNN (BN Inc.) + WE (d=512) + LSTM (#hidden=512) : image caption

Image caption (Vinyals et al, 2016)

Ensemble, BeamSearch and scheduled sampling -> Attention?

(101)

28/04/2017

Image caption (Vinyals et al, 2016)

o Pre-trained image model: trained on ImageNet, fine-tuning (tricky) helped a bit o Word embedding was not pre-trained

o Ensemble:

Multiple models with different initialization, learning parameters or even different networks

o BeamSearch:

Consider the k best sentences before generating the next word

Beam size matters, actually k=3 was the best on the MS COCO challenge

o Optimization: SGD with fixed learning rate and without momentum + Dropout (?) o Transfer Learning: models trained on different datasets

o Scheduled sampling: curriculum learning strategy, flip a coin to use the predicted or the true previous word

Together 20+% in performance

(102)

28/04/2017

Image caption (Vinyals et al, 2016)

(103)

28/04/2017

Embeddings and LSTM with Keras tut.

Keras includes some datasets: IMDB movies reviews (25k), labels: pos. or neg.

import numpy as np import pandas as pd import time

from keras.datasets import imdb # imdb dataset

from keras.models import Sequential, Model # the LSTM is just a layer!!! -> CDFFN from keras.layers import Dense, Input, Dropout, Activation, Flatten, LSTM #LSTM from keras.layers import Convolution2D, MaxPooling2D, normalization

from keras.utils import np_utils

from keras.layers.embeddings import Embedding # Embedding from keras.preprocessing import sequence

import keras.backend as K import matplotlib.pyplot as plt

%matplotlib inline

(104)

28/04/2017

Embeddings and LSTM with Keras tut.

# we will only use the most frequent 1k terms (faster) topk = 1000

(x_tr_raw, y_tr), (x_te_raw, y_te) = imdb.load_data(nb_words=topk) print('raw tr: %s te: %s' % (str(x_tr_raw.shape),str(x_te_raw.shape)))

# pad the sequences (they are originally various in length as expected)

# only the first 100 terms (but frequent!) again because it is faster max_len = 100

x_tr = sequence.pad_sequences(x_tr_raw, maxlen=max_len) x_te = sequence.pad_sequences(x_te_raw, maxlen=max_len) print('after tr: %s te: %s' % (str(x_tr.shape),str(x_te.shape)))

(105)

28/04/2017

Embeddings and LSTM with Keras tut.

emb_dim = 32 # embedding will be a 32 dimensional small vector per context (per term) lstm_units = 10 # number of lstm units in the lstm layer

epochs=10 # train it for 10 epochs batch_size=32 # mini batch size

#create the model (or build the computational graph) model = Sequential() # still a sequential model !

# max_len controls the length, but in our case it is always 100 model.add(Embedding(topk, emb_dim, input_length=max_len))

# finally we add a recurrent layer :) model.add(LSTM(lstm_units))

model.add(Dense(1, activation='sigmoid')) # logreg binary classifier on the top

# compilation, binary not categorical since only two classes

model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy']) print(model.summary())

for i in range(epochs):

model.fit(x_tr, y_tr, validation_data=(x_te, y_te), nb_epoch=1, batch_size=batch_size)

(106)

28/04/2017

Object detection

Fig. Kaiming He

Traditional models:

o Haar wavelet [Poggio et al., 1998] and Haar-like features [Viola and Jones, 2001]

Rigid features + SVM or AdaBoost (and also reduce the number of features) o Deformable parts model [Felzenswalb et al. 2010]

o 100 Hz: HOG + Boosted Trees etc. [Benenson et al. 2012]

And non-rigid feature extraction (CNN) : o R-CNN [Girshick et al., 2014]:

Regions with CNN features o SPP-net [He et al., 2014]:

Spatial Pyramid Pooling in DNN o Fast R-CNN [Girschick et al., 2015]:

CNN feature maps

o RegionLet [Wang et al., 2015]:

Integral image over CNN

o Faster R-CNN [Ren et al., 2015]:

Region Proposal Network

(107)

28/04/2017

R-CNN vs. Fast R-CNN

Fig. Kaiming He

CNN-s over the candidate regions Rigid region size

High complexity In practice:

Separate SVM over the feats.

not CDF ☹

One CNN per image

Feature map per pixel (filters/pixel)

Arbitrary sized regions of feature maps Much faster than R-CNN

SVM over the candidate feature maps Still not CDF ☹

(108)

28/04/2017

Faster R-CNN [Ren et al., 2015]

Instead of complex search:

o Region proposals:

different sized but rectangular regions on multiple scales (attention? ☺ )

o Still rigid

o End-to-end optimization: CDF ☺ o Very fast <200ms

(109)

28/04/2017

Faster R-CNN [Ren et al., 2015]

(110)

28/04/2017

Detect: drones and their trajectories R-CNN methods are not suitable

Time constraint: maximally 100ms to act

Computational constraints: Nvidia Tegra X1 but it is actually very slow (16x 7x7 Convolution ~10ms) Idea: high recall/bad precision detection with traditional models

+ MLP/CNN + LSTM with Fisher and later with additional sensors And now: project works ☺

W1 Februar 7-9: Introduction, kNN, evaluation W2 Februar 14-16: Evaluation, Decision Trees

Data Mining algorithms

2017-2018 spring

04.11-20.2018

1. NN and BN

W1 Februar 7-9: Introduction, kNN, evaluation W2 Februar 14-16: Evaluation, Decision Trees

W3 Februar 21-23: Linear separators, iPython, VC theorem

W4 Februar 28-march 2: Linear separators, iPython, maximal margin W5 March 7-9: SVM, VC theorem and Bottou-Bousquet

W6 March 14-16: clustering (hierarchical, density based etc.), GMM, MRF, Apriori and association rules

W7 March 21-23: Recommender systems W8 March 28: centrality, generative models W9 April 4-6: holiday

W10 April 11-13: basics of neural networks, Bayes networks W11 April 18-20: midterm, Sontag-Maas-Bartlett, BN,

W12 April 25-27: CNN, MLP, Dropout, Batch normalization, RNN, LSTM, GRU W13 May 2-4: attention, Image caption, Turing Machine

W14 May 9-11: RBM, DBN, VAE, GAN W15 May 16-18: Boosting, Time series W16 May 23-25: TS, Projects on Friday

Plan

Neural networks, briefly deeply

Hypothesis: deep, hierarchical models can be exponentially more efficient than a shallow one [Bengio et al. 2009, Le Roux and Bengio, 2010, Delalleau and Bengio, 2011 etc. ]

[Delalleau and Bengio, 2011]: deep sum-product network may require exponentially less units to represent the same function compared to a shallow sum-product network.

What is the depth of a Neural Network?

In case of feed forward networks, the number of multiple nonlinear layers between the input and the output layer.

We will see, in case of recurrent NN this definition does not apply.

So Q1: What is NN?

Neural Networks

Key ingredients:

• Wiring: units and connections

XOR = x

x

OR

x

x

z

-0.5 1 -1

z 1 z 2

-0.5 -1

1 x

x

1

y 1 -0.5

1 1

Output of a unit

• linear/

non-linear

• bounded/

non-bounded

• usually monotonic, but not all

Why so rigid?

Traditional pattern recognition vs. CNN

LeNet-5

Image classification with CNN

Image classification with CNN

Recent results

Several implementations

Ok, step back a little and … ☺

Probabilistic Graphical models (hmmm, RF?) Set of random variables: X= {x

,…,x

}

Visualize connections with edges

Bayes Networks

Probabilistic Graphical models

Visualize connections with edges (with directed edges!!!) Conditional dependencies (A “causes” B) vs. MRF?

Bayes Networks

Probabilistic Graphical models

Set of random variables: X= {x

,…,x

}

Visualize connections with edges (with directed edges!!!)

Bayes Networks

Probabilistic Graphical models

Set of random variables: X= {x

,…,x

}

Visualize connections with edges (with directed edges!!!)

Bayes Networks

Learning:

I) We know the structure (the dependencies) Parameter estimation (prior, posterior)

Analytically or via optimization (EM, GD etc.) II) We do not know the structure

optimize over the space of the possible trees…

and estimate parameters, optimization etc.

What kind of BN is a feed—forward network?

Bayes Networks

Main neural network structures

W1 Februar 7-9: Introduction, kNN, evaluation  W2 Februar 14-16: Evaluation, Decision Trees

W1 Februar 7-9: Introduction, kNN, evaluation  W2 Februar 14-16: Evaluation, Decision Trees

W3 Februar 21-23: Linear separators, iPython, VC theorem 

W4 Februar 28-march 2: Linear separators, iPython, maximal margin  W5 March 7-9: SVM, VC theorem and Bottou-Bousquet

W7 March 21-23: Recommender systems W8 March 28: centrality, generative models  W9 April 4-6: holiday

W10 April 11-13: basics of neural networks, Bayes networks  W11 April 18-20: midterm, Sontag-Maas-Bartlett, BN, 

^x

^x

z ₁ z ₂

y 1 _-0.5