Development of Complex Curricula for Molecular Bionics and Infobionics Programs within a consortial* framework**
Consortium leader
PETER PAZMANY CATHOLIC UNIVERSITY
Consortium members
SEMMELWEIS UNIVERSITY, DIALOG CAMPUS PUBLISHER
The Project has been realised with the support of the European Union and has been co-financed by the European Social Fund ***
**Molekuláris bionika és Infobionika Szakok tananyagának komplex fejlesztése konzorciumi keretben
***A projekt az Európai Unió támogatásával, az Európai Szociális Alap társfinanszírozásával valósul meg.
Principal Component Analysis
Főkomponens analízis
András Oláh, Gergely Treplán, Dávid Tisza
Digitális- neurális-, és kiloprocesszoros architektúrákon alapuló jelfeldolgozás
Digital- and Neural Based Signal Processing &
Kiloprocessor Arrays
Contents
• Data compression
• Principal Component analysis (PCA or KLT)
• Notation
• Summary of the algorithm
• Problems, extensions
• Hebbian PCA (The Oja algorithm)
• Experiments
• Problems
Examples of well known transformations:
• DCT - Discrete Cosine Transformation (JPEG,MPEG)
• KLT - Karhunen-Loève Transformation
Transformation
Information
Discarding
Only the important segment remains Prioritized
representation
Notation
• Scalars are noted with normal letters
• Vectors are noted with bold face lower case.
• Matrices are noted with bold face upper case.
• Random data vector (e.g. video frame):
• Correlation matrix:
• Vectors are defined to be column vectors:
x
( )
T ; ij( )
i jE R E x x
= =
R xx
dim( ) x = × N 1
• i element of a vector (scalar):
• (i,j)th element of a matrix:
• ith row of a matix:
• jth column of a matix:
• Submatrix (eg. first K rows and all columns):
• ith element of a set (eg. a set of eigenvectors)
• Expectation operator, if X is a discrete random variable:
,
R
i j ,:R
iR:, j
( )i
∈ S
R= { : is an Eigen vector of }
s x x R
( ( ) : x p xi i ) E X =
∑
x
i1: ,:K
R
• Dot product:
• Correlation matix:
• Diagonal extension operator:
1 2
1
2
[ , , ]
0 0
0 0
( ) :
0 0
,
0 0
T
N
a a N
a diag
a
a
a
=
= =
… a
A a
⋯
⋯
⋮ ⋮ ⋱
1
, T i i
i
a b
=
< a b >= a ⋅ =b
∑
⋅( ·
T), dim ( )
E N N
= = ×
R x x R
• eigenvectors and eigenvalues of
• Set of Eigen equations:
• Properties of the eigenvectors:
( ) ( )
·
i= λ
i·
i, i = … 1, , N
R s s
R
( ) ( ) 1|
, orthonormal bases, so 0 |
i j i j T
i j
=
< >= ⋅ =
≠
s s S S I
⋅ = ⋅ R S S Λ
[ ]
( )
(1)
2
) (
1
(2 )
,
, , , , ,
T N
diag λ λ λN
= …
= …
S s
Λ
s s
Basis transformations back and forth to the eigenvectors:
• Coder, transforming to the eigenvectors basis:
• Decoder, transforming from the eigenvector basis:
( ) 1
( )
:, ·
= , ( ) ( ) ·
N
T T i
i i n n
i T
n
y s x
=
=
⋅ = x = s x
∑
y S x S
(1) (2) ( ) ( )
,:
1
· ( , , , ·
( ) )
N N
i i i
T n
i i i n
n
x s s s s y
=
= …
= S y y =
∑
= ⋅ x S y
PCA
• Suppose we have an element of a data set, which is a grayscale image map now.
data frame at time instant k:
• Generate the data vector from it, by reading it
column wise: x[k]
• The data usually can be modeled as a random vector variable x, because we are dealing with processes assuming quasi stationarity and
ergodicity. E.g. consecutive image frames from a video flow are usually „similar”.
• Thus in a time window we examine the corr.
matrix of the process:
{ }
{ } { } { }
{ } { } { }
{ } { } { }
1 1 1 2 1
2 1 2 2 2
1 2
. .
. . . .
.
N T N
N N N N
E x x E x x E x x E x x E x x E x x E
E x x E x x E x x
= =
R xx
( )
dim R = ×N N
• The data vector random is assumed to have 0 mean.
• Compute the eigenvectors and eigenvalues of the correlation matrix R and sort it in an
decreasing order (by importance):
( ) ( )
·
i= λ
i·
i, i = … 1, , N
R s s
1 2 3 1
(1) (2) (3) ( ) ( 1) ( )
M M N
M M N
λ λ λ λ λ + λ
+
≥ ≥ ≥ … ≥ ≥ ≥ … ≥
s s s s s s
վ վ վ վ վ վ
( )
E x = 0
• Eigenvectors point
to orthogonal directions in each
dimension which tries to capture the independent variance directions of the distribution
• Eigenvalues are proportional to the magnitude of variances
Eigenvalue scaled eigenvectors:
blue arrows
Iso curves of the distribution:
black dashed contours
Sample points of the distribution:
red dots
0 1 0.9
0 , 0.9 1 -0.7071 -0.7071 , -0.7071 0.7071
~
1.9 0 0 0.1 X N
=
=
S Λ
• Eigen vectors correspond to the directions of the standard deviations of the data understood as a sample set of a multidimensional random process.
• Eigen values correspond to the magnitude of the standard deviations.
• If we use a hypothesis that this linear quantity
(correlation) captures real world “importance” then we can sort the dimensions along “importance” and
• compress by discarding “not really important”
dimensions.
( )
(1)
1 1
( 2 )
2 2
( )
( )
(1)
1 ( 2 )
2
(
(
)
( ) )
G iven , g et all
T ran sfo rm ed d at
,
a: = =
o rig in a
, 1, ,
·
· l d ata:
i i
T T
T i
i
T N
i
i i
N N N
T T N
T
y y
y y
y y
y
i N
y
λ λλ
λ λ
⇒ =
⇒ =
⇒ =
⇒ =
⇒ =
=
⋅
…
=
s s s s R
s x
s x
s x y
s x
y x S S
s
y
x x S
⋮ ⋮ ⋮ ⋮
⋮ ⋮ ⋮ ⋮
• We can transform the data frame to the new basis system in a lossless fashion:
• We want to discard the „unimportant”
components to achieve compression:
• Instead of using all eigenvectors, just use the first M to get an approximate of y
1 2 3 1
(1) (2) (3) ( ) ( 1) ( )
M M N
M M N
λ λ λ λ λ + λ
+
≥ ≥ ≥ … ≥ ≥ ≥ … ≥
s s s s s s
վ վ վ վ վ վ
Transformation
Information
Discarding
Only the important segment remains Prioritized
representation
• The approximate of the transformed data:
( )
(1)
1 1
(2)
(1)
1 (2)
2
( 2 2
( )
:,1:
)
ˆ ,
ˆ the first M eigenvectors, arranged in a matrix compressed, transformed data: =ˆ =
decompressed, lo
ˆ ˆ
y d
· ss
M M M
M
T T T
T
T M M
T
y y
y y y y
λλ λ
⇒ =
⇒ = ⇒ =
⇒ =
=
⋅
s x
s x y
s x
S S
y x s
s s
S S x
⋮ ⋮ ⋮ ⋮
ata recovery: xˆ = SˆT·yˆ
PCA compression
Compression algorithm:
Decompression algorithm:
Compression
algorithm Access Line Decompression
x ˆy ˆy ˆx
dim( )x = N dim( )yˆ = M dim( )xˆ = N
N >> M
ˆ ˆ= T ⋅ y S x
ˆ= ˆ ⋅ ˆ x S y
The PCA is the optimal lossy representation in the terms of Mean Square Error:
2 2 2 2
2 2
ˆ ˆ
1 1
ˆ : QoS criterion
ˆ ˆ ˆ ˆ ˆ
ˆ ˆ ( ˆ) ( ˆ) ( ˆ )( ˆ)
ˆ ˆ ˆ ˆ
ˆ ˆ ˆ ˆ 2ˆ ˆ ˆ ˆ
ˆ ˆ
T
T T T T
T T T T T T T T T
N M
T T
i i i
i i i M
ε
λ λ λ
= = =
− <
− = − = − − = − − =
= − − + = − + =
= − =
∑ ∑
− =y y I
E x x
E x x E x Sy E x Sy x Sy E x y S x Sy E x x y S x x S y y S S y E x x y y y y E x x E y y
1 N
∑
+ Minimum becauseeigenvalues are ordered in a monotone
decreasing sequence
• The components we discard in the course of compression contribute as much as their corresponding eigenvalues.
• The mean square error is the sum of the eigenvalues
corresponding to the discarded components. We drop the minimum contribution due to the monoton decreasing order of eigenvalues.
2 2
1 1 1
ˆ
N M N
i i i
i i i M
λ λ λ
= = = +
− =
∑ ∑
− =∑
E x x
PCA architecture
Access Line
ˆy ˆy
∑
∑
∑
⋮
⋮ x1
x2
xN
ˆy1
ˆy2
ˆM y
(1)
s1 (1)
s2 (1)
sN ( 2)
s1 ( 2)
s2 ( 2)
sN
( ) 1
s M ( ) 2
s M (M)
sN
∑
∑
∑
⋮
⋮
ˆx1
ˆx2
ˆN x
(1)
s1 (2)
s1 ( ) 1
s M (1)
s2 (2)
s2 ( ) 2
s M
(1)
sN (2)
sN (M)
sN
1
2
M
1
2
N
PCA compression overview
We assume that R is given for the process we are dealing with and R remains the same over time. (stationarity) 1. Solve the Eigen problem for R to get the first M largest
eigenvalues and the corresponding eigenvectors based on the quality requirement.
2. Assemble the coding/decoding matrix.
3. Use the coder to encode the data (matrix-vector multiplication), and send through the access line.
4. Decode the data with the decoding matrix.
Difficulties of direct implementing the PCA for video compression
•Nobody tells us what is R of an underlying video flow.
•R does not remain „still” for the whole time
•Computing eigenvalues and eigenvectors are computationally complex
•Practically the normalization of the variances in the original data set along each dimension before the PCA transformation is useful to get meaningful relative
importance values after the transformation.
Difficulties of direct implementing the PCA for video compression
•Storage problems:
•N: dim(x)=640x480=307200 per color channel
•NxN: dim(R)=dim(E(x xT))=307200x307200
•If a float (32 bit) used to hold an element of R,
•sizeof(R)~350GB per color channel in memory to hold R or an estimate of R
Estimation of the correlation matrix
• Using the assumed ergodicity of the process and estimate in a time window of length K
• From time to time we have to reestimate the empirical correlation matrix.
• Note that we still assume quasi stationarity for a K length time window
1 1
1 1 ˆ
lim [ ] [ ] [ ] [ ]
K K
T T
K k k
k k k k
K K
→∞ = =
= ⇒ =
∑
x x R∑
x x ROvercoming the storage problem - Introducing the kernel based PCA
• We are using the fact that was constructed using K outer products, so
• Using the basis transform:
• We can rewrite:
Rˆ
[ ]
( ) ( ) ( )
[1], , [ ]
i i i
= ⋅ = K ⋅
s X ξ x ⋯ x ξ
( )ˆ
rank R = K
( ) ( )
( ) ( )
ˆ , dim( )ˆ
, dim( )
i i
i
i i
i
N N
K K K N N
λ λ
= = ×
⋅ξ = ξ G = × << ×
Rs s R
G
Kernel based PCA
[ ]
( ) ( )
( )
1
( ) ( )
( ) ( )
1 1 1
( ) ( )
1 1 1
( ) (
, ,
1 1
ˆ 1
, [ ] [ ]
[1] [ ] ˆ
[ ] [ ] [ ] [ ] [ ]
[ ] [ ] [ ] [ ] [
· ] [ ]
K
i i T
i
k
i i
i
K K K
T i i T
l i m
k l m
K K K
i T T i T
l i m
l k m
K K
i i
l n k k l i m
l k
k k
K K
k k l m K n
n k k l K n m
K ξ
λ λ
λ
ξ ξ λ
ξ ξ
ξ
=
= = =
= = =
= =
= ⋅ = ⋅ =
=
= → ⋅
=
=
∑
∑ ∑ ∑
∑ ∑ ∑
∑ ∑
R x x
s X ξ x x ξ
R s s
x x x x x
x x x x x x
G G
⋯
) , 1
( ) ( )
( ) ( )
1
K
n m m
i i
i
i i
K K λ
λ
=
⋅ =
⋅ =
∑ G
G G ξ Gξ
Kernel based PCA
• Storage problems:
•N: dim(x)=640x480=307200 per color channel
•Sizeof(R)~350GB per color channel in memory to hold R
•If window length of K=1000 frames used
•Sizeof(G)=4MB per color channel [ [1], , [ ]]
ˆ 1 dim ( ˆ )
dim
·
, ( )
,
·
T
T
K
N N K
K K
=
= = ×
×
= =
X G X
X x x
R X R
G X
⋯
The Oja algorithm
• The Oja algorithm can be another solution to the storage problem and an alternative for the existing effective algorithms for the computation of the
EigenValue Decomposition(EVD), which is to be performed from time window to time window
• If we assume that we know can we recursively compute assuming it is changing slowly enough?
( )i [ ]k s
( )i [k +1]
s
The Oja algorithm
• The Oja algorithm is a nonlinear stochastic difference equation:
, proof with Kushner Clark theorem
[ ] [ ] [ ] [ ] [ ] [ ] [ ]
(
[ ] [ ]·)
, ( )1
T
k k y k
k k u u
k k y k
y k
η
ϕ ϕ
+ = + −
= w =
w w x
x
w
x1
x2 w2
Σ y w1
xN wN
[ ] (1 )
li m k
→ ∞ w = s
The Oja algorithm
• Lemma:
• Proof outline:
1
, (1)
. . max
1
n :
T
s t
∈ = λ =
=
w
w Rw w
w
ℝ s
[ ]
(1) 1
1 1
1 2
1
(1) (1)
2
,
· , ·
if we c
,
hoose , 1 0 0
,
·
T T T
N
T T T T T
i i
T T
i
i N
T
i
v
v
λ
λ λ
− −
−
=
=
= = = ⇒ = = =
= = = =
= = = …
= =
∑
w s
∑
v S w S v S S I S S RS S RS w Rw v S RSv v S RS
w SΛ
v v v w s v S s
w Rw
Λ Λ
The Oja algorithm
• Using the properties of the Rayleigh quotient, that
• Starting from the Rayleigh quotient and using the Newton’s method to find the maximum:
( ) max
, max
if , max , if
. . 1
n
T T
T N N
s t
λ
×
= ∈ ∈ =
=
w =
R R R w Rw w s
w w w
ℝ ℝ
( )
[ ] [
[ 1] [ ]
[ ] [ ] [ ] ]
T T
T
grad
k k
k k
k k k
η η
+ = + =
−
= +
w
Rw w R w
w w w Rw
w w w
w
The Oja algorithm
• We still need the knowledge of R, so we substitute:
(Use the simplification discussed at the LMS and at the Robins-Monroe algorithm)
• Stability can be proven by the Kushner-Clark thm
( )
[ ] [ ]
[ ] [ ] [ ]
( ) [ ] [ ] ( ) [ ] [ ]
( [ ]
[ 1] [ ]
[ ) (
]
[
[ ] [ ]
] [ ] [ ] [
[ ] ]
)
T T T
T T T
y k y k y k
E k k E k k
E k E
k
k k k
k k
k y k k k y k
η η η
+ = + − =
−
= + ≅
≅ + −
xx w w xx w w
x x
w w
w w
w w x w
w
w x
x
T T
y = x w = w x
The Oja algorithm
• Extending the algorithm with the Generalized
Hebbian Algorithm one can design a feed-forward neural network to catch all the eigenvectors of the correlation matrix.
• Generalized Hebbian Rule:
[ ] [ ] [ ] [ ] [ ] [ ]
1
1
j
ji j i j mi m
m
w k η y k x k y k w k y k
=
+ = −
∆
∑
The Oja algorithm with the Generalized Hebbian Algorithm (GHA)
• No knowledge of R is needed for the iteration
• Can be implemented in a feed-forward neural network structure
• For the jth neuron’s ith weight:
[ ] [ ] [ ] [ ]
(
[ ] [ ])
[ ] [ ] 1 [ ] [ ]
1
note that is a function of as well
ji ji j ji ji j
j
ji i mi m
m
w k w k y k x k w k y k
x k x k w k y k
x j
η
−
+ = + ′ −
′ = −
′
∑
Hebbian PCA
• Hebbian learning:
• Normalization:
• Expanding by Taylor series:
[
k + =1] [ ]
k +η y k[ ] [ ]
kwɶ w x
[ ] [ ]
[
1] [ ] [ ]
11 1 1
1
k k k k
k + −
+ = = + +
+
w w w w
w
ɶ ɶ ɶ
ɶ
[ ] [ ] [ ] [ ] [ ] ( )
[ ] [ ] ( )
2 2 2
2 2 2
1 2
2
k k y k T k k O
k y k O
η η
η η
+ = + + =
= + +
w w w x
w
ɶ ɶ
[
k +1]
−1 = −1 η y2[ ]
k +O( )
η2wɶ
Experiments
•An eigenvalue distribution of a small video stream.
•We experimented on two small video streams
•The first illustrates a calm aquarium.
•The second is an action scene with quick changes.
•We will present these sequences with different compression ratios using the Kernel and the
Hebbian PCA
Experiments
•The video snippet consisted of 57 frames, window length of 10 was used.
Experiments
Original:
With
all eigenvectors:
With
5 eigenvectors:
With
2 eigenvectors:
Experiments
Original:
With all eigenvectors:
With 20 eigenvectors:
With 15 eigenvectors:
Experiments Error function:
Aquarium Star Wars
2
1
ˆ
N
i i M
ε λ
= +
=
∑
<x - x
# eigenvectors # eigenvectors
Experiments
•Difference between the approximated and the true eigenvectors
0 2 4 6 8 10 12 14 16 18 20
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
0 2 4 6 8 10 12 14 16 18 20
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
1 100 iterations 500 iterations
# eigenvectors # eigenvectors
Problems with solutions 1. A simple eigenvalue problem:
=
1 0
5 . 0
0 1
5 . 0
5 . 0 5 . 0 5 . 0 R
3 , 2 , 1 i
i
i
i = s =
Rs λ
( )
det Iλ − R = 0
(
Iλ − R s)
= 05 .
1 = 1
λ λ2 = 1 λ3 = 0
=
3 1 3
1 3
1
s1
− −
= 6
1 6
1 6
2 s3
−
= 2
1 2
0 1 s2
Problems with solutions
2. An information flow, compressed from time window to time window (e.g. video flow) has the following eigenvalues of the correlation matrix:
If we want the mean square error of the PCA compression to be less or equal than 0.01, how many neurons should we implement in the
coder ?
2 m, m 1 1000
m m ,...,
λ = − =
Problems with solutions
• The mean square error is defined as:
• So at least 11 neurons must be used to achieve the desired error level
2 2
1 1 1
ˆ
N M N
i i i
i i i M
λ λ λ
= = = +
− =
∑ ∑
− =∑
E x x
1000 1000
2 2
1 1 11
ˆ 2 0.0063
0. 10 i i 47 66
i M i M M
λ i −
= + = + =
− = =
≥E x x =
∑ ∑
Problems with solutions
3. Train the underlying neuron with the Oja algorithm with the following parameters:
a) Compute the weights after the 2nd iteration:
b) Can the following state be stabile? Explain your answer!
[ ]
[ ] [ ]
T
T
1 0.5 1
0 0 0 1
η 0.1
= − −
=
= x w
[ ]2 = ?
w
[ ] [∞ = 2 3 −1 3 −2 3]T
w
Problems with solutions
a) Compute the weights after the 2nd iteration:
[ ] [ ] [ ] [ ] [ ] [ ]
( )
[ ] [ ] [ ]
[ ] [ ] [ ] [ ] [ ] [ ]
( )
[ ] [ ] [ ]
[ ] [ ] ( )
(
[ ] [ ] ( ))
[ ] [ ]
T T
T T T
T
1
1 0.5 1 0 0 0 1 0.1
1 0 0 0 0 0
0 0
1 0 0 1 0.1 1 1 0.5 1 0 0 1 1
1 0.1 0.
0
0
1
5
·
1
T
k k y k k k y k
y y
y
η
η η
+ = + −
= − − = =
= + −
=
= + − − − − −
−
= −
=
w w x w
x
x w
w w x w
w w
w
Problems with solutions
a) Compute the weights after the 2nd iteration:
[ ] [ ] [ ]
[ ] [ ] [ ] [ ] [ ] [ ]
( )
[ ] [ ] [ ]
[ ] [ ] [ ] [ ]
[ ] [ ]
T T
T T T
1 0.5 1 1 0.1 0.05 1 0.1
2 1 1 1 1 1
1 1
9 9
2 0.1 0.05 1 0.1 1 0.5 1 0.1 0.05 1
8 8
2 -0.1998 0.0999 0.
· 1 1.125
9859
T
T
y y
y
η η
= − − = − =
= + −
=
− −
= − + − − − −
≈
=
−
x w
w w x w
w w
w
x
Problems with solutions
b) Can the following state be stable ? Explain your answer!
Let us check if it can be an eigenvector: each and every eigenvector must have a unit norm and form an
orthonormal basis with the others. Since this vector has the unit norm it passes this check.
Let’s solve the eigen problem to be sure:
( )∞ =[2 3 −1 3 −2 3]T
w
1 0.5 1
0.5 0.25 0.5
1 0.5 1
· T
− −
= −
−
R = x x
( )
det Iλ − R = 0
1 2.25
λ = (1) 2 1 2
3 3 3
T
= − − s