• Nem Talált Eredményt

Time (s)Time (s)

N/A
N/A
Protected

Academic year: 2022

Ossza meg "Time (s)Time (s)"

Copied!
102
0
0

Teljes szövegt

(1)

Development of Complex Curricula for Molecular Bionics and Infobionics Programs within a consortial* framework**

Consortium leader

PETER PAZMANY CATHOLIC UNIVERSITY

Consortium members

SEMMELWEIS UNIVERSITY, DIALOG CAMPUS PUBLISHER

The Project has been realised with the support of the European Union and has been co-financed by the European Social Fund ***

**Molekuláris bionika és Infobionika Szakok tananyagának komplex fejlesztése konzorciumi keretben

***A projekt az Európai Unió támogatásával, az Európai Szociális Alap társfinanszírozásával valósul meg.

(2)

Feedforward Neural Networks

Előrecsatolt neurális hálózatok

Digitális- neurális-, és kiloprocesszoros architektúrákon alapuló jelfeldolgozás

Digital- and Neural Based Signal Processing &

Kiloprocessor Arrays

(3)

Contents

• Introduction – topology

• Representation capability

• Blum and Li construction

• Generalization capabilities

• Bias variance dilemma

• Learning

• Applications

www.itk.ppke.hu

(4)

• Multilayer neural network

– Input layer

– Intermediate (hidden) layers – Output layer

– The outputs are the inputs of the following layer

• Multiple inputs, multiple outputs

• Each layer contains a number of nonlinear

perceptrons

(5)

www.itk.ppke.hu

Introduction – FFNN

• Feed Forward Neural Networks are used for

• Classification

• Supervised learning for classification

• Given inputs and class labels

• Approximation

• Arbitrary function with arbitrary precision

• Prediction

• „What is the next element in the future of given time

series?”

(6)

Input signal (stimulus)

input output

Output signal (response)

) 2 (

w11

11

2

l1

1 2

l2

1 2

l3

. . .

. . .

x y

(7)

Topology

• Each cell

• Weights

l

th

layer

i

th

neuron in the l

th

layer

From the j

th

neuron of the (l-1)

th

layer

• Nonlinear activation function (logistic function, biologically motivated)

www.itk.ppke.hu

n

( )l

W

ij

( ) 1

1

u

u e

φ =

+

( ) 2 1

1

u

u e

λ

φ =

+

(8)

( ) u 1

φ = 2

φ = −

10 5 5 10

0.2 0.4 0.6 0.8 1.0

10 5 5 10

1.0

0.5

0.5 1.0

(9)

Activation functions

• The parameter of the sigmoid

function may be different as it can be seen on the

figure

www.itk.ppke.hu

( ) 2 1

1

u

u e

λ

φ =

+

(10)

• Output of the network

• Where

1 1

( ) ( 1) (1)

1 1

1

( , ) · · ·

L nL n

L L

i ij km

j n

m m

i

Net y φ w φ w φ w x

=

=

=

   

… …

  

 

= =  

 

 

  

 

 

∑ ∑ ∑

W x

( w

1,0(1)

, w

1,1(1)

, w

1,2(1)

, , w

1,0(2)

, w

1,1(2)

, w

1,0( )L

, )

= …

W

(11)

FFNN – weights

• The free parameters, called weights

• Can be changed in course of adaptation (learning) process in order to „tune” the network for

performing a special task

• This learning procedure will be discussed later

• When solving engineering task by FFNN we are faced with the following questions:

www.itk.ppke.hu

(12)

1. Representation

– How many different tasks can be represented by an FFNN

2. Learning

– How to set up the weights to solve a specific given task

3. Generalization

– If only limited knowledge is available about the task

which is to be solved, then how the FFNN is going to

generalize this knowledge

(13)

FFNN in operation

• The neural network works as follows

• the network should be created by the specification

• the weights of the network are set so the error of the network should be minimal

• The weights are set by the training sequence

• The learning is lead through the error function, which determines the adaptation of the weights of the neural network on the error surface

www.itk.ppke.hu

( )

K

{ (

k

, d

k

) ; k 1,..., K }

τ = x =

(14)

• most cases the error function is chosen to be the square error

• the adaptation of weights can be done by different methods

• usually the gradient descent method is used

• In simple problems the error function is a quadratic function

• It has only one minimum, so the convergence to the

(15)

FFNN – Example of error function

• Possible quadratic error surface

• The learning task is to find the global minimum

www.itk.ppke.hu

(16)

• In the following the representation capability of the FFNN will be discussed.

• We seek the F function space where the FFNN approximation is uniformly dense

• ( symbol denotes the fact that the NN is ( )

F x ∈ F

( , )

D

Net

w x NN

NN F

(17)

Representation

• In this function space every function can be arbitrarily approximated with FFNN

• The notation || || defines a norm used in F space

For example error computed as follows in L

p

www.itk.ppke.hu

( ) : ( ) ( , )

0

F F Net ε

ε

∀ ∈ 

→ ∃ − <

>  

x F w x x w

( F ( ) Net ( , ) )

p

x , x

N

< ε

∫ ∫

X

x x w d d

(18)

Theorem (Harnik, Stinchambe, White 1989)

The FFNN-s are uniformly dense in the L

p

space

• Recall:

p D

L

⊆ NN

( ( ) ) ( ( ) )

( ( ) )

2 1

2

: ,

: ,

N

p p

x

N

x

L F x x

L F x x

< ∞

< ∞

<

… ∞

∫ ∫

∫ ∫

∫ ∫

X

X

d d

d d

(19)

www.itk.ppke.hu

Representation – Theorem 1 Theorem (Harnik, Stinchambe, White 1989)

In other words every function in L

p

can be

represented arbitrarily closely approximation by a neural net

• More precisely for each F x ( ) ∈ L

p

( ( ) ( ) )

0,

,

p

,

N

F Net x x ε

ε

∀ > ∃

− <

∫ ∫

X

x x w d d

w

(20)

Theorem (Harnik, Stinchambe, White 1989)

Since L

p

is a rather large space, the theorem implies that almost any engineering task can be solved by a one-layer neural network

• The proof of theorem heavily draws from functional analysis and is based on the Hahn-Banach theorem.

• Since it is out of the focus of the course this proof

will not be presented here.

(21)

www.itk.ppke.hu

Representation – Blum and Li theorem Theorem (Blum and Li)

The FFNN-s are uniformly dense in the L

2

space

• In other words:

• For each

2 D

L

⊆ NN

( ( ) ( , )

2

0,

) ,

N

F Net x x

ε

ε

− <

> ∃

∫ ∫

X

x x w d d

w

( )

2

F xL

(22)

• Theorem:

• Proof:

• Using the step functions: S

• From elementary integral theory it is clear that S is

uniformly dense in L

1

, namely every function in L

1

can be approximated by an appropriate step function (figure)

( )

N

ε

∫ ∫

X ε

1 2

D

L

D

L

⊆ ⊆

 

= = ∑

S

(23)

www.itk.ppke.hu

Representation – Blum and Li theorem

• This step function can have arbitrary narrow steps

• For example each

step could be divided into two sub-steps

• Therefore

1 if

( ) 0 else

I X

X  ∈

= 

x

)

( ) ( ) (

i

i

F x ≅ ∑ F x I x

i

(24)

• These steps partition the domain of the function

• One partition can be easily represented by small neural network

• In two dimension the following figure gives an example

• The borders of the partition are hyper planes which could

(25)

www.itk.ppke.hu

Representation – Blum and Li theorem

• Now since every partition can be represented by a corresponding

Therefore whole F(x) function can be approximated by the FFNN

• In the following slides a constructive approximation method will be introduced

sgn

j

sgn

j

j ij i

a b x

   

 

  

   

 ∑ ∑ 

(26)

• The Blum and Li construction is based on the

„LEGO” principle

• The approximation of the F function is based on its step function

Let us have a step function with n number of steps

(27)

Blum and Li construction

• This step function partitions the domain of the original F function

• For each partition there is a neuron responsible for approximation the „step”

• If the input of the FFNN (x) falls into a given range the appropriate approximator neuron has to be

selected

• The output of the network should be this selected value

www.itk.ppke.hu

(28)

1. Incoming arbitrary x value

2. The appropriate interval will be selected

3. The response of the

network is the response of selected neuron

(approximator)

2.

(29)

Blum and Li construction

• This construction …

• … has no dimensional limits

• … has no equidistance restrictions on tiles (partitions)

• … can be further fined, and the approximation can be any precise

• 2 dimensional example

• The tiles are the top of the columns for each approximation cell

www.itk.ppke.hu

(30)

for one particular region

• The output is I

1

if we are in this region

x2

xM

-1

Σ

Σ

Σ

-1

Σ

x1

-1 ( 2 )

w10 ( 2 )

w1M ( 2 )

w12 ( 2 )

w11

. . . .

. . .

. .

w11

(1)

w10 (1)

w12

(1)

w1M

AND

s1

I (x )1

(31)

Blum and Li construction

• Construction for one

particular region

• The output is I

2

if we are in this region

www.itk.ppke.hu

x2

xM

-1

Σ

Σ

Σ

-1

Σ

x1

-1 ( 2 )

w10 ( 2 )

w1M ( 2 )

w12 ( 2 )

w11

. . . .

. . .

. .

(1)

w11

(1)

w10 (1)

w12

(1)

w1M

AND

(1)

s1

I (x )2

AND perceptron with 0 or 1 output

Linear separation for one side of the region

(32)

is being

approximated by a block specified

above

x2

x

Σ

x1

F ( x )1

. . . .

. . .

. .

(1)

w1 1 (1)

w 1 2

(1)

w1 M

F ( x M ) I ( x )1

(33)

Blum and Li construction

• Third layer

• This neuron has linear activation function

• The weights of this neuron are the approximation values of the F function

• The output of blocks marked with different colors is zero or one as the input is in the

specified region,

• Thus the approximation for the whole domain of the original F function is done by FFNN

www.itk.ppke.hu

x2

xM

Σ

x1

F(x )1

. . . .

. . .

. .

(1)

w11 (1)

w12

(1)

w1M

F(x )M

I (x)1

(34)

• Minimizing the number of neurons

• We do not have to represent a hyper plane more than once

• size of FFNN ~ max||grad F||

• If F has an input, where F is very sensitive,

meaning that the changing of F is very fast (the

derivative is large), than we have to define the

(35)

Blum and Li examples

• 2D example and 3D example

www.itk.ppke.hu

(36)

• Weights – separator neurons

1. [ -1,875 -1 ]

2. [ -1.875 +1 ], [ -0.625 -1 ] 3. [ -0.625 +1 ], [ 0.625 -1 ] 4. [ 0.625 +1 ], [ 1.875 -1 ] 5. [ 1.875 +1 ]

• AND neurons: [ 0.5 1 ] or [ 1.5 1 1 ]

• Linear neuron in output layer:

(37)

Blum and Li in general

• The partitioning of the domain may be arbitrary

• Let us consider the 2D plane as the domain of the F function

• The following partitioning is possible to be used:

www.itk.ppke.hu

(38)

shown previously, but it has its limitations

• The size of the FFNN constructed via this method is quite big

• Consider the task on the picture, where let us have 1000 by 1000 cell to approximate the function

• Optimal case 3003 neurons are needed

• (non-optimal: ~4 Million)

• Smoother approximation needs more

(39)

Learning

• The Blum and Li construction is not always

applicable, therefore we seek a solution which trains the neural network for an arbitrary function, then this function can be approximated by the neural network

• The F function is partially known

• The F function behaves as a black box

• The task is to find a w which minimize the difference between the F and the network:

www.itk.ppke.hu

( )

2

( ( ) )

2

opt

: min F( ) − Net , = min .. ∫ ∫ F( ) − Net , dx

1

... dx

N

w x x w x x w

(40)

• This minimization task is not possibly done

Complete information is needed about F(x)

• Weak learning in incomplete environment, instead of using F(x)

• A training set is being constructed of observations

( ) ( ( ) )

opt

: min F( )

w

− Net , = min ..

w

∫ ∫ F( ) − Net , dx

1

... dx

N

w x x w x x w

( ) K { ( k , d k ) ; k 1,..., K }

τ = x =

(41)

Learning

• The error of the network (the square of difference

between the output and the desired output) is minimal

• The approximation is the best achievable

• We cannot do this due to the limited information on F, instead of we seek:

www.itk.ppke.hu

( )opt

( ( ) )

2

1

: min 1

K K

k k

k

d Net , K

=

w

w

x w

( )

2

( ( ) )

2

opt

: min F( ) Net

w

, = min ..

w

∫ ∫ F( ) Net − , dx

1

... dx

N

w x x w x x w

(42)

Unknown system F(.)

FFNN

-

xk dk

yk

εk

desired output

output

input error signal

wopt

( ) ( ( ) )

opt

: min F( )

w

− Net , = min ..

w

∫ ∫ F( ) − Net , dx

1

... dx

N

w x x w x x w

(43)

Learning

• The questions are the following

• What is the relationship of these optimal weights

• How this new objective function should be minimized as quickly as possible

www.itk.ppke.hu

( )

opt

???

opt

K

w w

( )opt

( ( ) )

2

1

: min 1

K K

k k

k

d Net , K

=

w

w

x w

(44)

• Empirical error

• Theoretical error

Let us have x

k

random variables subject to uniform

( ) ( ( ) )

2

1

1

K

emp k k

k

R d Net ,

K

=

= ∑ −

w x w

( )

2

( ( ) )

2 1

F( ) Net F( ) Net ...

X N

, , dx dx

− = ∫ ∫ … −

x x w x x w

(45)

Statistical learning theory

x

k

random variable, where d=F(x)

www.itk.ppke.hu

( )

( )

2

( )

2

1

lim 1 E ( , )

K

k k

k k

d Net , d Net

→∞

K

=

= ∑ − x w = − x w =

( ( ) )

( ( ) )

( ( ) )

2

1 2

1 2

1

F( ) Net ( ) ...

1 F( ) Net ...

F( ) Net ...

X

X

X

N

N

N

, p dx dx , dx dx X

, dx dx

− =

∫ ∫

∫ ∫

∫ ∫

x x w x

x x w

x x w

Because it is ~ constant due to the uniformity

(46)

• Therefore

• Where l.i.m. means: lim in mean

The question is, how to set K to have

( )

opt opt

l.i.m.

K

K→∞

w = w

( )

( )

( )

2

( ( ) )

2 1

1

lim ( )

lim 1 F( ) Net ...

K emp

K K k

h

k X N

k

R R

t

d Net , , dx dx

K

→∞

→∞ =

=

− = … −

∑ ∫ ∫

w w

x w x x w

(47)

Bias – variance dilemma

• Size of the NN ↔ size of training set, K

• The size of the neural network is the number of weights

• K is the size of the training set

• Let us investigate the difference:

• Where is obtained by minimizing the empirical error R

emp

www.itk.ppke.hu

( )

(

( )

)

2

E d

k

Net x w

k

,

opt k

( )k

w

opt

(48)

• One can write then (adding and subtracting the same term)

• Therefore

( )

( )

( )

( )

( ) 2

( ) 2

E

E ( , ) ( , )

k

k opt

k opt k

k

d Net ,

d Net Net Net ,

− =

= − + −

x w

x w x w x w

( )

( )

2

( ) (

( )

)

2

E d

k

Net

k

,

opt

E  Net

k

,

opt

Net

k

,

o tp k

= − x w +   x wx w  

(49)

Bias – variance dilemma

• Remarks

• The other terms in the expression above become zero

• The first term in the expression above is the approximation error between F(x) and Net(x,w)

• The second term is the error resulting from the finite training set

• One can choose between the following options

• either minimizing the first term (which is referred to as bias) with a relatively large size network, but in this case with a limited size training set the weights cannot be trained correctly by learning, so the second term will be large

www.itk.ppke.hu

(50)

• minimizing the second term (called variance) which needs small size network. However the size of the training set the should be large, invoking the first term large

• Conclusion

• there is a dilemma between bias and variance

• This gives rise to the question, how to set the size of the training set which strikes a good balance between the bias and variance.

( )

( )

2

( ) (

( )

)

2

E dNet x w , + E  Net x w ,Net x w ,

k

(51)

VC dimension

Question: how to set the size of the training set which strikes a good balance between the bias and variance .

• We know the theoretical and empirical error

The question is, what is the probability of that the difference of these errors are greater than a given constant

• Furthermore this probability must be minimized

www.itk.ppke.hu

( )

(

th

(

opt

)

emp opt( )k

)

P R wR w ≥ ε

( )

( R

th

(

opt

)

emp opt( )k

) ( ) ,

P wR w > ≤ Ψ ε ε K

(52)

• We seek this function

• Replacing the optimal weight vector:

• To have such result, we have to introduce a more stronger bound on the convergence, called uniform convergence

( )

Ψ

( min

th

( ) min

emp

( ) ) ( ) ,

P RR > ≤ Ψ K

w

w

w

w ε ε

(53)

VC dimension

• Uniform convergence

Which enforces that for all other w

www.itk.ppke.hu

( )

0,

( ) , sup

0

th

W

R

emp

P R

α

− > α

∀ > ∀ > ∈

  <

 

w

w

w

w W ε

ε

( R

th

( ) R

emp

( ) )

P ww > ε < α

(54)

• If this uniform convergence holds then the necessary size of learning set can be estimated

• Vapnik and Chervonenkis pioneered the work in

revealing such bounds and the basic parameter of this bound is called VC dimension to honor their

achievements

• Following slides will discuss this VC dimension

(55)

VC dimension

Let us assume that we are given by a Net(x,w), what we use for binary classification

• VC dimension is related to the classification “power”

of Net(x,w).

• More precisely, given the set of dichotomies expanded by Net(x,w) as

www.itk.ppke.hu

( ) ( )

( )

(1) (0)

(1) (0) (1) (0)

, ,

: 0 if

: , 1 if

, ,

0 Net

F Net x

X X X X

W Net x

X X

X

 

 

 

=  = 

 

= = 

=

∈ ∈

 

x w x

x w w

w

∪ ∩

(56)

The VC dimension of Net(x,w). is defined as the number of possible dichotomies expressed by Net(x,w)

• For example let us consider the following elementary network Net(x,w)= sgn{w

T

x – b}

Its VC dimension is N +1

If N = 2 only 2 + 1 = 3 points can be separated on a 2D

plane.

(57)

VC dimension

• VC dimension in general

• Consider the following theoretical and empirical errors, and given relations

• We also know

www.itk.ppke.hu

( ) ( )

( )

( )

( )

( )

e

k

th opt th opt

k

opt opt

mp emp

R R

R R

w w

w w

( ) ( )

( ) ( )

( )

( )

opt th opt

k k

th op

emp

emp

t opt

R R

R R

− <

− <

w w

w w

ε

ε

(58)

• Therefore

• Vapnik states the following

• Combining

( )

( )

(

( )

)

(

opt

)

th opt th

(

opt k

)

k

emp emp opt

R wR wR wR w ≤ ε

( ) 2

2

sup ( )

c

K

th p

W c

V

R R

em

ek e

P

V

  <

  

− >  

 

 

w

w w ε

ε

( )

2

( ) ( )

sup (

k

)

k

2 2 ek

Vc K

PR R   e

− >  <  

w w ε 

ε

(59)

VC dimension

• VC dimension result

• To set the constant properly

• Therefore the optimal size of training set is driven by the Vc dimension

www.itk.ppke.hu

( )

( ) ( )

sup

th

(

opt k

)

e opt k

2

W

R

mp

P R α

− >

  <

 

w

w w ε 

2

c 2K

c

ek

V

V e

α

= 

ε

(60)

Value of the Vc parameter

• If we apply hard nonlinearity in the neural network

• If we apply soft nonlinearity

• Where the is the number of weights in the neural

( log

2

) Vc = O W W

(

2

) Vc = O W

W

(61)

Learning – in practice

• Learning based on the training set:

Minimize the empirical error function (R

emp

)

• Learning is a multivariate optimization task

www.itk.ppke.hu

( )opt

( ( ) )

2

( )

1

: min 1 min

K K

k k emp

k

d Net , R

K

=

− =

w w

w x w w

( ) K { ( k , d k ) ; k 1,..., K }

τ = x =

E

k

(62)

• Newton method

• In each step using the learning set we modify the

weights of the neurons in layers in order to minimize the error

• To do this the empirical error of the actual neuron is computed and the gradient of this error is used to

( k + = 1 ) (k) η grad { R

emp

( ( ) k ) }

w

w w - w

(63)

Learning

• The Rosenblatt algorithm is inapplicable, while we do not know the error and desired output in the hidden layers of the FFNN

• Someway the error of the whole network has to be distributed to the internal neurons, in a feedback way

www.itk.ppke.hu

Function signals

Forward propagation of function signals and back-propagation of

errors signals

(64)

• Adapting the weights of the FFNN

• The weights are modified towards the differential of the error function:

• The elements of the training set adapted by the FFNN sequentially

( ) ( ) ( )

( )

( 1) ( ) ( )

( ) ?

l l l

ij ij ij

l ij

w k w k w k

w k

+ = + ∆

∆ =

( )

( ) l emp

ij l

ij

w R

η w

∆ = −

(65)

Sequential back propagation

• Consider the following FFNN

• Error function

• Adapting the bias of neuron in hidden layer

• Where the empirical error is

www.itk.ppke.hu

(1)

w20

y

(1)

w10

(1)

y1

x1

x2

( 2 )

w10 (1)

w11

(1)

w22 (1)

w21 (1)

w12

(1)

y2

( 2 )

w11

( 2 )

w12

φ Σ

( )l

Ii

(

( ) (

)

2

E = d x y x)

(2) 1

(2) (2) (2)

10 1 10

emp emp

R R y I

w y I w

∂ = ∂ ∂ ∂

∂ ∂ ∂ ∂

( )

1(2)(2) 10

2 ( ) ( ) ; 1

Remp I

d y

y w

= − = −

x x

( ) ( )

I1(2) 1(2) ?

y

φ

I

φ

∂ = = ′ =

∂ ∂

(66)

• Activation function

• The derivative of this function

u φ(u)

( )u 1 u

φ

= e

+

( )

2

( )

( ) 1

1 1

1 1

1 1

u u u

u

u u

u u e

e e

e

e e

φ

′ = ∂ =

∂ +

= =

+

= =

+ +

(67)

Sequential back propagation

• Using the previous result of the derivative of activation function

• Modifying the weight

www.itk.ppke.hu

( ) ( )

1(2) 1(2)

( )

(2) (2)

1 1

I 1

y I y y

I I

φ φ

∂ = ∂ = ′ = −

∂ ∂

( ) ( )

(2) 1

(2) (2) (2)

10 1 10

2 1

emp emp

R R y I

d y y y

w y I w

= =

( ) ( )

(2) (2) ( 2) (2)

10

( 1)

10

( )

10

( )

10

( ) 2 1

w k + = w k + ∆ w k = w k − ⋅ η dy yy

(2)

10 (2)

10

Remp

w η w

= −

(68)

• Adapting the weights of the neuron in output layer

y

(1)

w10

(1)

y1

x1

x2

( 2 )

w10 (1)

w11

w(1) (1)

w21 (1)

w12

(1)

y2

( 2 )

w11

( 2 )

w12

( ) ( )

(2) 1 (1)

(2) (2) (2) 1

11 1 11

2 1

emp emp

R R y I

d y y y y

w y I w

∂ = ∂ ∂ ∂ = − − −

∂ ∂ ∂ ∂

( ) ( )

(2) (2) (2)

11 11 11

(2) (1)

11 1

( 1) ( ) ( )

( ) 2 1

w k w k w k

w k η d y y y y

+ = + ∆ =

= + ⋅ − −

( ) ( )

(2) 1 (1)

(2) (2) (2) 2

12 1 12

2 1

emp emp

R R y I

d y y y y

w y I w

∂ = ∂ ∂ ∂ = − − −

∂ ∂ ∂ ∂

(2) (2) (2)

( 1) ( ) ( )

w k + = w k + ∆ w k =

(69)

Sequential back propagation

• Adapting the weights of the neuron in hidden layer

www.itk.ppke.hu

( ) ( ) ( )

(2) (1) (1)

1 1 1

(1) (2) (1) (1) (1)

10 1 1 1 10

(2) (1) (1)

11 1 1

2 1 1 1

emp emp

R R y I y I

w y I y I w

d y y y w y y

∂ = ∂ ∂ ∂ ∂ ∂ =

∂ ∂ ∂ ∂ ∂ ∂

= − − − − ⋅−

( ) ( ) ( )

(2) (1) (1)

1 1 1

(1) (2) (1) (1) (1)

11 1 1 1 11

(2) (1) (1)

11 1 1 1

2 1 1

emp emp

R R y I y I

w y I y I w

d y y y w y y x

∂ = ∂ ∂ ∂ ∂ ∂ =

∂ ∂ ∂ ∂ ∂ ∂

= − − − − ⋅

( ) ( ) ( )

(2) (1) (1)

1 1 1

(1) (2) (1) (1) (1)

12 1 1 1 12

(2) (1) (1)

2 1 1

emp emp

R R y I y I

w y I y I w

d y y y w y y x

∂ = ∂ ∂ ∂ ∂ ∂ =

∂ ∂ ∂ ∂ ∂ ∂

= − − − − ⋅

(1)

y1

x1

x2

y

(2)

w

10

(1)

w20 (1)

w10 (1)

w11

(1)

w22 (1)

w21 (1)

w12

(1)

y2

(2)

w11

(2)

w12

(70)

• Adapting the weights of the neuron in hidden layer

(1)

y1

x1

x2

y

(2)

w

10 (1)

w10 (1)

w11

(1)

w22 (1)

w21 (1)

w12

(1)

y2

(2)

w11

(2)

w12

( ) ( ) ( )

(2) (1) (1)

1 2 2

(1) (2) (1) (1) (1)

20 1 2 2 20

(2) (1) (1)

21 2 2

2 1 1 1

emp emp

R R y I y I

w y I y I w

d y y y w y y

∂ = ∂ ∂ ∂ ∂ ∂ =

∂ ∂ ∂ ∂ ∂ ∂

= − − − − ⋅−

( ) ( ) ( )

(2) (1) (1)

1 2 2

(1) (2) (1) (1) (1)

21 1 2 2 21

(2) (1) (1)

21 2 2 1

2 1 1

emp emp

R R y I y I

w y I y I w

d y y y w y y x

∂ = ∂ ∂ ∂ ∂ ∂ =

∂ ∂ ∂ ∂ ∂ ∂

= − − − − ⋅

(2) (1) (1)

1 2 2

(1) (2) (1) (1) (1)

emp emp

R R y I y I

∂ = ∂ ∂ ∂ ∂ ∂ =

∂ ∂ ∂ ∂ ∂ ∂

(71)

Steps of learning

1. Initialization

Setting up the initial w weights, usually random numbers

2. Assembling the training set

• The training set has pairs of inputs and desired outputs

3. Propagating the signal

• Compute the outputs for all neurons in the network

4. Back propagating the error and updating the weights

5. Repeating the 3. and 4. steps for a new sample

www.itk.ppke.hu

Ez a k ép most nem jeleníthető meg.

( )

( ) l emp

ij l

ij

w R

η w

∆ = −

(72)
(73)

Numerical example – step 1 & 2

• Consider the following problem, initial states:

www.itk.ppke.hu

( )1

11

0.3

w = −

( )1

21

0.6

w =

( )2

11

0.5

w =

( )2

12

0.4

w =

( )1 ( )1 ( )2

10 20 10 0.5

w = w = w =

η=1

2 1 ,

)

( =

= +

α

ϕ u

α τ( )3 =

{ (

1, 0.1 , 2, 0.5 , 3, 0.9

) ( ) ( ) }

(74)

• Propagating the signal

k=1

1

= 1

x d = 0.1

(1)

1 1.6

1 0.1680 y 1

= e =

+

(1)

2 0.2

1 0.5498 y 1

e

= =

+

1

( )3

{ ( ) ( ) ( ) }

τ = 1, 0.1 , 2, 0.5 , 3, 0.9

(75)

Numerical example – step 4

• Back propagating, and updating

• Output layer

www.itk.ppke.hu

( ) ( )

( ) ( )

(2) (1)

11 ( ) 2 1 2 1

2 0.1 0.4032 0.4032 1 0.4032 2 0.1680 0.0490

w k η d y y y y

η

∆ = − ⋅− − − ⋅ ⋅ =

= ⋅ − − ⋅ ⋅ = −

( ) ( )

( ) ( )

(2)

10 ( ) 2 1

2 0.1 0.4032 0.4032 1 0.4032 2 0.2918

w k η d y y y α

η

∆ = − ⋅ − −

= − ⋅ − − ⋅

=

( ) ( )

( ) ( )

(2) (1)

12 ( ) 2 1 2 2

2 0.1 0.4032 0.4032 1 0.4032 2 0.5498 0.1604

w k η d y y y y

η

∆ = − ⋅− − − ⋅ ⋅ =

= ⋅ − − ⋅ ⋅ = −

(76)

• Back propagating, and updating

• Output layer - Updating

( )2 ( )2 ( )2

10

(1)

10

(0)

10

(0)

0.5 0.2918 0.7918

w = w + ∆ w =

= + =

( )2 ( )2 ( )2

11

(1)

11

(0)

11

(0)

0.5 0.0490 0.4510

w = w + ∆ w =

= − =

( )2

(1)

( )2

(0)

( )2

(0)

w = w + ∆ w =

Hivatkozások

KAPCSOLÓDÓ DOKUMENTUMOK

2) The state transition probabilities are time independent in the time homogeneous absorbing Markov chain model, that is, regardless how many times the transition from state s i

Observe that vectors x i , i ≥ 1 derived in Section 3 can not be used directly for the sojourn time analysis, since they correspond to the distribution of the queue length and the

Main idea: Instead of expressing the running time as a function T (n) of n , we express it as a function T (n , k ) of the input size n and some parameter k of the input.. In

The effect of the molecular weight of the n-alcohol was investigated by extraction with different alcohols from n-C4 to n-C7 and the time of separation of the phases was measured

During neural network training by the method of Sequential repetition, with direct training using the input vector comprised of all the points of action, from t 1 to t n , and the

Time of the entire, completed check-in process tc is calculated as a sum 1 of the following time values: • Queuing time tq • Service time ts tc = tq+ ts 1 At the airport, that

My teaching: ( Human mind map) This was the first time I wasn't feeling confident, also this was the first time when I used this session to experiment with something. I was

The algorithm consists in a meshing of the time instants and the simulation of the N-mode system for each time instant from this meshing using Equation (8) to check if this sequence