Development of Complex Curricula for Molecular Bionics and Infobionics Programs within a consortial* framework**
Consortium leader
PETER PAZMANY CATHOLIC UNIVERSITY
Consortium members
SEMMELWEIS UNIVERSITY, DIALOG CAMPUS PUBLISHER
The Project has been realised with the support of the European Union and has been co-financed by the European Social Fund ***
**Molekuláris bionika és Infobionika Szakok tananyagának komplex fejlesztése konzorciumi keretben
***A projekt az Európai Unió támogatásával, az Európai Szociális Alap társfinanszírozásával valósul meg.
Feedforward Neural Networks
Előrecsatolt neurális hálózatok
Digitális- neurális-, és kiloprocesszoros architektúrákon alapuló jelfeldolgozás
Digital- and Neural Based Signal Processing &
Kiloprocessor Arrays
Contents
• Introduction – topology
• Representation capability
• Blum and Li construction
• Generalization capabilities
• Bias variance dilemma
• Learning
• Applications
www.itk.ppke.hu
• Multilayer neural network
– Input layer
– Intermediate (hidden) layers – Output layer
– The outputs are the inputs of the following layer
• Multiple inputs, multiple outputs
• Each layer contains a number of nonlinear
perceptrons
www.itk.ppke.hu
Introduction – FFNN
• Feed Forward Neural Networks are used for
• Classification
• Supervised learning for classification
• Given inputs and class labels
• Approximation
• Arbitrary function with arbitrary precision
• Prediction
• „What is the next element in the future of given time
series?”
Input signal (stimulus)
input output
Output signal (response)
) 2 (
w11
11
2
l1
1 2
l2
1 2
l3
. . .
. . .
x y
Topology
• Each cell
• Weights
• l
thlayer
• i
thneuron in the l
thlayer
• From the j
thneuron of the (l-1)
thlayer
• Nonlinear activation function (logistic function, biologically motivated)
www.itk.ppke.hu
n
( )l
W
ij( ) 1
1
uu e
φ =
−+
( ) 2 1
1
uu e
λφ =
−−
+
( ) u 1
φ = 2
φ = −
10 5 5 10
0.2 0.4 0.6 0.8 1.0
10 5 5 10
1.0
0.5
0.5 1.0
Activation functions
• The parameter of the sigmoid
function may be different as it can be seen on the
figure
www.itk.ppke.hu
( ) 2 1
1
uu e
λφ =
−−
+
• Output of the network
• Where
1 1
( ) ( 1) (1)
1 1
1
( , ) · · ·
L nL n
L L
i ij km
j n
m m
i
Net y φ w φ w φ w x
−
=
=
−
=
… …
= =
∑ ∑ ∑
W x
( w
1,0(1), w
1,1(1), w
1,2(1), … , w
1,0(2), w
1,1(2), … w
1,0( )L, )
= …
W
FFNN – weights
• The free parameters, called weights
• Can be changed in course of adaptation (learning) process in order to „tune” the network for
performing a special task
• This learning procedure will be discussed later
• When solving engineering task by FFNN we are faced with the following questions:
www.itk.ppke.hu
1. Representation
– How many different tasks can be represented by an FFNN
2. Learning
– How to set up the weights to solve a specific given task
3. Generalization
– If only limited knowledge is available about the task
which is to be solved, then how the FFNN is going to
generalize this knowledge
FFNN in operation
• The neural network works as follows
• the network should be created by the specification
• the weights of the network are set so the error of the network should be minimal
• The weights are set by the training sequence
• The learning is lead through the error function, which determines the adaptation of the weights of the neural network on the error surface
www.itk.ppke.hu
( )
K{ (
k, d
k) ; k 1,..., K }
τ = x =
• most cases the error function is chosen to be the square error
• the adaptation of weights can be done by different methods
• usually the gradient descent method is used
• In simple problems the error function is a quadratic function
• It has only one minimum, so the convergence to the
FFNN – Example of error function
• Possible quadratic error surface
• The learning task is to find the global minimum
www.itk.ppke.hu
• In the following the representation capability of the FFNN will be discussed.
• We seek the F function space where the FFNN approximation is uniformly dense
• ( symbol denotes the fact that the NN ⊆ is ( )
F x ∈ F
( , )
D
Net ∈
⊆
w x NN
NN F
Representation
• In this function space every function can be arbitrarily approximated with FFNN
• The notation || || defines a norm used in F space
• For example error computed as follows in L
pwww.itk.ppke.hu
( ) : ( ) ( , )
0
F F Net ε
ε
∀ ∈
→ ∃ − <
>
x F w x x w
( F ( ) − Net ( , ) )
px , … x
N< ε
∫ ∫ ⋯
Xx x w d d
Theorem (Harnik, Stinchambe, White 1989)
• The FFNN-s are uniformly dense in the L
pspace
• Recall:
p D
L
⊆ NN
( ( ) ) ( ( ) )
( ( ) )
2 1
2
: ,
: ,
Np p
x
Nx
L F x x
L F x x
< ∞
< ∞
…
…
<
… ∞
∫ ∫
∫ ∫
∫ ∫
X
X
d d
d d
⋯
⋯
⋯
www.itk.ppke.hu
Representation – Theorem 1 Theorem (Harnik, Stinchambe, White 1989)
• In other words every function in L
pcan be
represented arbitrarily closely approximation by a neural net
• More precisely for each F x ( ) ∈ L
p( ( ) ( ) )
0,
,
p,
NF Net x x ε
ε
∀ > ∃
…
− <
∫ ∫
Xx x w d d
w
⋯
Theorem (Harnik, Stinchambe, White 1989)
• Since L
pis a rather large space, the theorem implies that almost any engineering task can be solved by a one-layer neural network
• The proof of theorem heavily draws from functional analysis and is based on the Hahn-Banach theorem.
• Since it is out of the focus of the course this proof
will not be presented here.
www.itk.ppke.hu
Representation – Blum and Li theorem Theorem (Blum and Li)
• The FFNN-s are uniformly dense in the L
2space
• In other words:
• For each
2 D
L
⊆ NN
( ( ) ( , )
20,
) ,
NF Net x x
ε
ε
∀
…
− <
> ∃
∫ ∫
Xx x w d d
w
⋯
( )
2F x ∈ L
• Theorem:
• Proof:
• Using the step functions: S
• From elementary integral theory it is clear that S is
uniformly dense in L
1, namely every function in L
1can be approximated by an appropriate step function (figure)
( )
Nε
∫ ∫
X ε1 2
D
L
DL
⊆ ⊆
= = ∑
S
www.itk.ppke.hu
Representation – Blum and Li theorem
• This step function can have arbitrary narrow steps
• For example each
step could be divided into two sub-steps
• Therefore
1 if
( ) 0 else
I X
X ∈
=
x
)
( ) ( ) (
i
i
F x ≅ ∑ F x I x
i• These steps partition the domain of the function
• One partition can be easily represented by small neural network
• In two dimension the following figure gives an example
• The borders of the partition are hyper planes which could
www.itk.ppke.hu
Representation – Blum and Li theorem
• Now since every partition can be represented by a corresponding
• Therefore whole F(x) function can be approximated by the FFNN
• In the following slides a constructive approximation method will be introduced
sgn
jsgn
jj ij i
a b x
∑ ∑
• The Blum and Li construction is based on the
„LEGO” principle
• The approximation of the F function is based on its step function
• Let us have a step function with n number of steps
Blum and Li construction
• This step function partitions the domain of the original F function
• For each partition there is a neuron responsible for approximation the „step”
• If the input of the FFNN (x) falls into a given range the appropriate approximator neuron has to be
selected
• The output of the network should be this selected value
www.itk.ppke.hu
1. Incoming arbitrary x value
2. The appropriate interval will be selected
3. The response of the
network is the response of selected neuron
(approximator)
2.
Blum and Li construction
• This construction …
• … has no dimensional limits
• … has no equidistance restrictions on tiles (partitions)
• … can be further fined, and the approximation can be any precise
• 2 dimensional example
• The tiles are the top of the columns for each approximation cell
www.itk.ppke.hu
for one particular region
• The output is I
1if we are in this region
x2
xM
-1
Σ
Σ
Σ
-1
Σ
x1
-1 ( 2 )
w10 ( 2 )
w1M ( 2 )
w12 ( 2 )
w11
. . . .
. . .
. .
w11
(1)
w10 (1)
w12
(1)
w1M
AND
s1
I (x )1
Blum and Li construction
• Construction for one
particular region
• The output is I
2if we are in this region
www.itk.ppke.hu
x2
xM
-1
Σ
Σ
Σ
-1
Σ
x1
-1 ( 2 )
w10 ( 2 )
w1M ( 2 )
w12 ( 2 )
w11
. . . .
. . .
. .
(1)
w11
(1)
w10 (1)
w12
(1)
w1M
AND
(1)
s1
I (x )2
AND perceptron with 0 or 1 output
Linear separation for one side of the region
is being
approximated by a block specified
above
x2x
Σ
x1
F ( x )1
. . . .
. . .
. .
(1)
w1 1 (1)
w 1 2
(1)
w1 M
F ( x M ) I ( x )1
Blum and Li construction
• Third layer
• This neuron has linear activation function
• The weights of this neuron are the approximation values of the F function
• The output of blocks marked with different colors is zero or one as the input is in the
specified region,
• Thus the approximation for the whole domain of the original F function is done by FFNN
www.itk.ppke.hu
x2
xM
Σ
x1
F(x )1
. . . .
. . .
. .
(1)
w11 (1)
w12
(1)
w1M
F(x )M
I (x)1
• Minimizing the number of neurons
• We do not have to represent a hyper plane more than once
• size of FFNN ~ max||grad F||
• If F has an input, where F is very sensitive,
meaning that the changing of F is very fast (the
derivative is large), than we have to define the
Blum and Li examples
• 2D example and 3D example
www.itk.ppke.hu
• Weights – separator neurons
1. [ -1,875 -1 ]
2. [ -1.875 +1 ], [ -0.625 -1 ] 3. [ -0.625 +1 ], [ 0.625 -1 ] 4. [ 0.625 +1 ], [ 1.875 -1 ] 5. [ 1.875 +1 ]
• AND neurons: [ 0.5 1 ] or [ 1.5 1 1 ]
• Linear neuron in output layer:
Blum and Li in general
• The partitioning of the domain may be arbitrary
• Let us consider the 2D plane as the domain of the F function
• The following partitioning is possible to be used:
www.itk.ppke.hu
shown previously, but it has its limitations
• The size of the FFNN constructed via this method is quite big
• Consider the task on the picture, where let us have 1000 by 1000 cell to approximate the function
• Optimal case 3003 neurons are needed
• (non-optimal: ~4 Million)
• Smoother approximation needs more
Learning
• The Blum and Li construction is not always
applicable, therefore we seek a solution which trains the neural network for an arbitrary function, then this function can be approximated by the neural network
• The F function is partially known
• The F function behaves as a black box
• The task is to find a w which minimize the difference between the F and the network:
www.itk.ppke.hu
( )
2( ( ) )
2opt
: min F( ) − Net , = min .. ∫ ∫ F( ) − Net , dx
1... dx
Nw x x w x x w
• This minimization task is not possibly done
• Complete information is needed about F(x)
• Weak learning in incomplete environment, instead of using F(x)
• A training set is being constructed of observations
( ) ( ( ) )
opt
: min F( )
w− Net , = min ..
w∫ ∫ F( ) − Net , dx
1... dx
Nw x x w x x w
( ) K { ( k , d k ) ; k 1,..., K }
τ = x =
Learning
• The error of the network (the square of difference
between the output and the desired output) is minimal
• The approximation is the best achievable
• We cannot do this due to the limited information on F, instead of we seek:
www.itk.ppke.hu
( )opt
( ( ) )
21
: min 1
K K
k k
k
d Net , K ∑
=−
w
wx w
( )
2( ( ) )
2opt
: min F( ) Net
w− , = min ..
w∫ ∫ F( ) Net − , dx
1... dx
Nw x x w x x w
Unknown system F(.)
FFNN
-
xk dk
yk
εk
desired output
output
input error signal
wopt
( ) ( ( ) )
opt
: min F( )
w− Net , = min ..
w∫ ∫ F( ) − Net , dx
1... dx
Nw x x w x x w
Learning
• The questions are the following
• What is the relationship of these optimal weights
• How this new objective function should be minimized as quickly as possible
www.itk.ppke.hu
( )
opt
???
opt
⇔
Kw w
( )opt
( ( ) )
21
: min 1
K K
k k
k
d Net , K ∑
=−
w
wx w
• Empirical error
• Theoretical error
• Let us have x
krandom variables subject to uniform
( ) ( ( ) )
21
1
Kemp k k
k
R d Net ,
K
== ∑ −
w x w
( )
2( ( ) )
2 1F( ) Net F( ) Net ...
X N
, , dx dx
− = ∫ ∫ … −
x x w x x w
Statistical learning theory
• x
krandom variable, where d=F(x)
www.itk.ppke.hu
( )
( )
2( )
21
lim 1 E ( , )
K
k k
k k
d Net , d Net
→∞
K
== ∑ − x w = − x w =
( ( ) )
( ( ) )
( ( ) )
2
1 2
1 2
1
F( ) Net ( ) ...
1 F( ) Net ...
F( ) Net ...
X
X
X
N
N
N
, p dx dx , dx dx X
, dx dx
− =
−
−
…
…
…
∫ ∫
∫ ∫
∫ ∫
x x w x
x x w
x x w
∼
Because it is ~ constant due to the uniformity
• Therefore
• Where l.i.m. means: lim in mean
• The question is, how to set K to have
( )
opt opt
l.i.m.
KK→∞
w = w
( )
( )
( )
2( ( ) )
2 11
lim ( )
lim 1 F( ) Net ...
K emp
K K k
h
k X N
k
R R
td Net , , dx dx
K
→∞
→∞ =
=
− = … −
∑ ∫ ∫
w w
x w x x w
Bias – variance dilemma
• Size of the NN ↔ size of training set, K
• The size of the neural network is the number of weights
• K is the size of the training set
• Let us investigate the difference:
• Where is obtained by minimizing the empirical error R
empwww.itk.ppke.hu
( )
(
( ))
2E d
k− Net x w
k,
opt k( )k
w
opt• One can write then (adding and subtracting the same term)
• Therefore
( )
( )
( )
( )
( ) 2
( ) 2
E
E ( , ) ( , )
k
k opt
k opt k
k
d Net ,
d Net Net Net ,
− =
= − + −
x w
x w x w x w
( )
( )
2( ) (
( ))
2E d
kNet
k,
optE Net
k,
optNet
k,
o tp k
= − x w + x w − x w
Bias – variance dilemma
• Remarks
• The other terms in the expression above become zero
• The first term in the expression above is the approximation error between F(x) and Net(x,w)
• The second term is the error resulting from the finite training set
• One can choose between the following options
• either minimizing the first term (which is referred to as bias) with a relatively large size network, but in this case with a limited size training set the weights cannot be trained correctly by learning, so the second term will be large
www.itk.ppke.hu
• minimizing the second term (called variance) which needs small size network. However the size of the training set the should be large, invoking the first term large
• Conclusion
• there is a dilemma between bias and variance
• This gives rise to the question, how to set the size of the training set which strikes a good balance between the bias and variance.
( )
( )
2( ) (
( ))
2E d − Net x w , + E Net x w , − Net x w ,
k
VC dimension
• Question: how to set the size of the training set which strikes a good balance between the bias and variance .
• We know the theoretical and empirical error
The question is, what is the probability of that the difference of these errors are greater than a given constant
• Furthermore this probability must be minimized
www.itk.ppke.hu
( )
(
th(
opt)
emp opt( )k)
P R w − R w ≥ ε
( )
( R
th(
opt)
emp opt( )k) ( ) ,
P w − R w > ≤ Ψ ε ε K
• We seek this function
• Replacing the optimal weight vector:
• To have such result, we have to introduce a more stronger bound on the convergence, called uniform convergence
( )
Ψ
( min th( ) min
emp ( ) ) ( ) ,
P R − R > ≤ Ψ K
w
w
ww ε ε
VC dimension
• Uniform convergence
• Which enforces that for all other w
www.itk.ppke.hu
( )
0,
( ) , sup
0
th
W
R
empP R
α
∈
− > α
∀ > ∀ > ∈
<
w
w
w
w W ε
ε
( R
th( ) R
emp( ) )
P w − w > ε < α
• If this uniform convergence holds then the necessary size of learning set can be estimated
• Vapnik and Chervonenkis pioneered the work in
revealing such bounds and the basic parameter of this bound is called VC dimension to honor their
achievements
• Following slides will discuss this VC dimension
VC dimension
• Let us assume that we are given by a Net(x,w), what we use for binary classification
• VC dimension is related to the classification “power”
of Net(x,w).
• More precisely, given the set of dichotomies expanded by Net(x,w) as
www.itk.ppke.hu
( ) ( )
( )
(1) (0)
(1) (0) (1) (0)
, ,
: 0 if
: , 1 if
, ,
0 Net
F Net x
X X X X
W Net x
X X
X
−
= =
= =
=
∈
∈ ∈
x w x
x w w
w
∪ ∩
• The VC dimension of Net(x,w). is defined as the number of possible dichotomies expressed by Net(x,w)
• For example let us consider the following elementary network Net(x,w)= sgn{w
Tx – b}
• Its VC dimension is N +1
• If N = 2 only 2 + 1 = 3 points can be separated on a 2D
plane.
VC dimension
• VC dimension in general
• Consider the following theoretical and empirical errors, and given relations
• We also know
www.itk.ppke.hu
( ) ( )
( )
( )
( )
( )
e
k
th opt th opt
k
opt opt
mp emp
R R
R R
≥
≥
w w
w w
( ) ( )
( ) ( )
( )
( )
opt th opt
k k
th op
emp
emp
t opt
R R
R R
− <
− <
w w
w w
ε
ε
• Therefore
• Vapnik states the following
• Combining
( )
( )(
( ))
(
opt)
th opt th(
opt k)
kemp emp opt
R w − R w ≤ R w − R w ≤ ε
( ) 2
2sup ( )
c
K
th p
W c
V
R R
emek e
P
∈V
<
−
− >
ww w ε
ε( )
2( ) ( )
sup (
k)
k2 2 ek
Vc KP R R e
−− > <
w w ε
εVC dimension
• VC dimension result
• To set the constant properly
• Therefore the optimal size of training set is driven by the Vc dimension
www.itk.ppke.hu
( )
( ) ( )
sup
th(
opt k)
e opt k2
W
R
mpP R α
∈
− >
<
w
w w ε
2
c 2Kc
ek
VV e
α
−=
ε
• Value of the Vc parameter
• If we apply hard nonlinearity in the neural network
• If we apply soft nonlinearity
• Where the is the number of weights in the neural
( log
2) Vc = O W W
(
2) Vc = O W
W
Learning – in practice
• Learning based on the training set:
• Minimize the empirical error function (R
emp)
• Learning is a multivariate optimization task
www.itk.ppke.hu
( )opt
( ( ) )
2( )
1
: min 1 min
K K
k k emp
k
d Net , R
K ∑
=− =
w w
w x w w
( ) K { ( k , d k ) ; k 1,..., K }
τ = x =
E
k• Newton method
• In each step using the learning set we modify the
weights of the neurons in layers in order to minimize the error
• To do this the empirical error of the actual neuron is computed and the gradient of this error is used to
( k + = 1 ) (k) η ⋅ grad { R
emp( ( ) k ) }
w
w w - w
Learning
• The Rosenblatt algorithm is inapplicable, while we do not know the error and desired output in the hidden layers of the FFNN
• Someway the error of the whole network has to be distributed to the internal neurons, in a feedback way
www.itk.ppke.hu
Function signals
Forward propagation of function signals and back-propagation of
errors signals
• Adapting the weights of the FFNN
• The weights are modified towards the differential of the error function:
• The elements of the training set adapted by the FFNN sequentially
( ) ( ) ( )
( )
( 1) ( ) ( )
( ) ?
l l l
ij ij ij
l ij
w k w k w k
w k
+ = + ∆
∆ =
( )
( ) l emp
ij l
ij
w R
η ∂ w
∆ = −
∂
Sequential back propagation
• Consider the following FFNN
• Error function
• Adapting the bias of neuron in hidden layer
• Where the empirical error is
www.itk.ppke.hu
(1)
w20
y
(1)
w10
(1)
y1
x1
x2
( 2 )
w10 (1)
w11
(1)
w22 (1)
w21 (1)
w12
(1)
y2
( 2 )
w11
( 2 )
w12
φ Σ
( )l
Ii
(
( ) ()
2E = d x − y x)
(2) 1
(2) (2) (2)
10 1 10
emp emp
R R y I
w y I w
∂ = ∂ ∂ ∂
∂ ∂ ∂ ∂
( )
1(2)(2) 102 ( ) ( ) ; 1
Remp I
d y
y w
∂ = − − ∂ = −
∂ x x ∂
( ) ( )
I1(2) 1(2) ?y
φ
I∂
φ
∂ = = ′ =
∂ ∂
• Activation function
• The derivative of this function
u φ(u)
( )u 1 u
φ
= e−+
( )
2( )
( ) 1
1 1
1 1
1 1
u u u
u
u u
u u e
e e
e
e e
φ
−−
−
−
− −
′ = ∂ =
∂ +
= =
+
= =
+ +
Sequential back propagation
• Using the previous result of the derivative of activation function
• Modifying the weight
www.itk.ppke.hu
( ) ( )
1(2) 1(2)( )
(2) (2)
1 1
I 1
y I y y
I I
φ φ
∂ = ∂ = ′ = −
∂ ∂
( ) ( )
(2) 1
(2) (2) (2)
10 1 10
2 1
emp emp
R R y I
d y y y
w y I w
∂ = ∂ ∂ ∂ = − −
∂ ∂ ∂ ∂
( ) ( )
(2) (2) ( 2) (2)
10
( 1)
10( )
10( )
10( ) 2 1
w k + = w k + ∆ w k = w k − ⋅ η d − y y − y
(2)
10 (2)
10
Remp
w η ∂w
∆ = −
∂
• Adapting the weights of the neuron in output layer
y
(1)
w10
(1)
y1
x1
x2
( 2 )
w10 (1)
w11
w(1) (1)
w21 (1)
w12
(1)
y2
( 2 )
w11
( 2 )
w12
( ) ( )
(2) 1 (1)
(2) (2) (2) 1
11 1 11
2 1
emp emp
R R y I
d y y y y
w y I w
∂ = ∂ ∂ ∂ = − − −
∂ ∂ ∂ ∂
( ) ( )
(2) (2) (2)
11 11 11
(2) (1)
11 1
( 1) ( ) ( )
( ) 2 1
w k w k w k
w k η d y y y y
+ = + ∆ =
= + ⋅ − −
( ) ( )
(2) 1 (1)
(2) (2) (2) 2
12 1 12
2 1
emp emp
R R y I
d y y y y
w y I w
∂ = ∂ ∂ ∂ = − − −
∂ ∂ ∂ ∂
(2) (2) (2)
( 1) ( ) ( )
w k + = w k + ∆ w k =
Sequential back propagation
• Adapting the weights of the neuron in hidden layer
www.itk.ppke.hu
( ) ( ) ( )
(2) (1) (1)
1 1 1
(1) (2) (1) (1) (1)
10 1 1 1 10
(2) (1) (1)
11 1 1
2 1 1 1
emp emp
R R y I y I
w y I y I w
d y y y w y y
∂ = ∂ ∂ ∂ ∂ ∂ =
∂ ∂ ∂ ∂ ∂ ∂
= − − − − ⋅−
( ) ( ) ( )
(2) (1) (1)
1 1 1
(1) (2) (1) (1) (1)
11 1 1 1 11
(2) (1) (1)
11 1 1 1
2 1 1
emp emp
R R y I y I
w y I y I w
d y y y w y y x
∂ = ∂ ∂ ∂ ∂ ∂ =
∂ ∂ ∂ ∂ ∂ ∂
= − − − − ⋅
( ) ( ) ( )
(2) (1) (1)
1 1 1
(1) (2) (1) (1) (1)
12 1 1 1 12
(2) (1) (1)
2 1 1
emp emp
R R y I y I
w y I y I w
d y y y w y y x
∂ = ∂ ∂ ∂ ∂ ∂ =
∂ ∂ ∂ ∂ ∂ ∂
= − − − − ⋅
(1)
y1
x1
x2
y
(2)
w
10(1)
w20 (1)
w10 (1)
w11
(1)
w22 (1)
w21 (1)
w12
(1)
y2
(2)
w11
(2)
w12
• Adapting the weights of the neuron in hidden layer
(1)
y1
x1
x2
y
(2)
w
10 (1)w10 (1)
w11
(1)
w22 (1)
w21 (1)
w12
(1)
y2
(2)
w11
(2)
w12
( ) ( ) ( )
(2) (1) (1)
1 2 2
(1) (2) (1) (1) (1)
20 1 2 2 20
(2) (1) (1)
21 2 2
2 1 1 1
emp emp
R R y I y I
w y I y I w
d y y y w y y
∂ = ∂ ∂ ∂ ∂ ∂ =
∂ ∂ ∂ ∂ ∂ ∂
= − − − − ⋅−
( ) ( ) ( )
(2) (1) (1)
1 2 2
(1) (2) (1) (1) (1)
21 1 2 2 21
(2) (1) (1)
21 2 2 1
2 1 1
emp emp
R R y I y I
w y I y I w
d y y y w y y x
∂ = ∂ ∂ ∂ ∂ ∂ =
∂ ∂ ∂ ∂ ∂ ∂
= − − − − ⋅
(2) (1) (1)
1 2 2
(1) (2) (1) (1) (1)
emp emp
R R y I y I
∂ = ∂ ∂ ∂ ∂ ∂ =
∂ ∂ ∂ ∂ ∂ ∂
Steps of learning
1. Initialization
• Setting up the initial w weights, usually random numbers
2. Assembling the training set
• The training set has pairs of inputs and desired outputs
3. Propagating the signal
• Compute the outputs for all neurons in the network
4. Back propagating the error and updating the weights
5. Repeating the 3. and 4. steps for a new sample
www.itk.ppke.hu
Ez a k ép most nem jeleníthető meg.
( )
( ) l emp
ij l
ij
w R
η ∂w
∆ = −
∂
Numerical example – step 1 & 2
• Consider the following problem, initial states:
www.itk.ppke.hu
( )1
11
0.3
w = −
( )1
21
0.6
w =
( )2
11
0.5
w =
( )2
12
0.4
w =
( )1 ( )1 ( )2
10 20 10 0.5
w = w = w =
η=1
2 1 ,
)
( =
= +
−α
ϕ u
α τ( )3 ={ (
1, 0.1 , 2, 0.5 , 3, 0.9) ( ) ( ) }
• Propagating the signal
k=1
1
= 1
x d = 0.1
(1)
1 1.6
1 0.1680 y 1
= e =
+
(1)
2 0.2
1 0.5498 y 1
e−
= =
+
1
( )3
{ ( ) ( ) ( ) }
τ = 1, 0.1 , 2, 0.5 , 3, 0.9
Numerical example – step 4
• Back propagating, and updating
• Output layer
www.itk.ppke.hu
( ) ( )
( ) ( )
(2) (1)
11 ( ) 2 1 2 1
2 0.1 0.4032 0.4032 1 0.4032 2 0.1680 0.0490
w k η d y y y y
η
∆ = − ⋅− − − ⋅ ⋅ =
= ⋅ − − ⋅ ⋅ = −
( ) ( )
( ) ( )
(2)
10 ( ) 2 1
2 0.1 0.4032 0.4032 1 0.4032 2 0.2918
w k η d y y y α
η
∆ = − ⋅ − −
= − ⋅ − − ⋅
=
( ) ( )
( ) ( )
(2) (1)
12 ( ) 2 1 2 2
2 0.1 0.4032 0.4032 1 0.4032 2 0.5498 0.1604
w k η d y y y y
η
∆ = − ⋅− − − ⋅ ⋅ =
= ⋅ − − ⋅ ⋅ = −
• Back propagating, and updating
• Output layer - Updating
( )2 ( )2 ( )2
10
(1)
10(0)
10(0)
0.5 0.2918 0.7918
w = w + ∆ w =
= + =
( )2 ( )2 ( )2
11
(1)
11(0)
11(0)
0.5 0.0490 0.4510
w = w + ∆ w =
= − =
( )2