• Nem Talált Eredményt

Clustering-Based Approach to State Space Reconstruction

3.3 State-Space Reconstruction

3.3.2 Clustering-Based Approach to State Space Reconstruction

Overview of the Proposed Method

The idea of the presented method is illustrated in Figure 3.6. Clustering is applied in the reconstructed space defined by the lagged measured variablesyk. First, the lag timeτis chosen by using the average mutual information (Step 1).

Figure 3.6: A scheme of the presented method.

The key step of the approach is the clustering of the data (Step 2). A model-based clustering algorithm has been developed for the identification of operating regimes and parameters of Gaussian mixture models using the Expectation Maximization (EM) method. The local models of the clusters are linear multi-input multi-output (MIMO) or multi-input single-output (MISO) models and the corresponding operating regions are represented by multivariate Gaussian functions. The embedding

dimension is inferred from the one-step ahead prediction performance of the local models (Step 3).

The intrinsic dimension of the reconstructed space is estimated by analyzing the eigenvalues of the fuzzy cluster covariance matrices (Step 4).

Selection of the Lag Time

The mutual information between two measurements is used to measure the generally nonlinear de-pendence of two variables. The mutual information for two measurementsynandyn+τ from the same time series is expressed by

In(τ) = log2

· P(yn, yn+τ) P(yn)P(yn+τ)

¸

(3.36) where the individual probability densities,P(yn) andP(yn+τ), are equal to the frequency with which the data pointsyn and yn+τ appear in the time series, respectively. The frequency can be obtained directly through tracing the data points in the entire time series. The joint probability density, P(yn, yn+τ), is obtained by counting the number of times the values of the and pair are observed in the series. Average mutual information is computed for all data points in the following manner:

I(τ) =

N−τX

n=1

P(yn, yn+τ) log2

· P(yn, yn+τ) P(yn)P(yn+τ)

¸

(3.37) WhenP(yn, yn+τ) =P(yn)P(yn+τ) such thatI(τ) approaches zero, the data pointsyn andyn+τ are completely independent. In practice the first minimum of the average mutual information function is chosen as the lag time [89].

Estimation of the Embedding Dimension based on Local Modeling

If the embedding dimension is properly chosen, the behavior of the time series in the reconstruction space can be described by a smooth nonlinear function,fr:

yk+1=fr(yk) (3.38)

If this mapping is similar to the original dynamic systemf, it is required thatyk+1=yp+1 whenever yk =yp. It means that the trajectories must not cross each other. These systems are deterministic, therefore the state space would not be uniquely determined if the trajectories crossed each other. In terms of the time series this condition amounts to yn+de = yp+de whenever yk = yp, . . . ,yk+de = yp+de and is thus equivalent to the time series being predictable. The embedding dimension is de-termined by increasing the number of lagged outputs, de, and the analysis of the one-step ahead prediction error of the identified model.

The basic idea of the method can be described as follows: when the input vector of the model, the so-called regressor vector (3.35) contains enough information, i.e. it has the right dimension, the output of the model can be predicted with acceptably small error, i.e. the function fr can be approximated accurately (3.38). The question is what the terms ’acceptably small error’ and ’being approximated accurately’ mean. They mean that the model error is due to the noise/measurement error presenting in the dataset and the imperfection of the modeling method rather than the lack of enough information. To satisfy this condition, the measurements should be accurate and adequate modeling method should be chosen. If the the dimension of the regressor vector is less than necessary, the output can be predicted with large error, but the model error is reduced with the increase of the information content/dimension of the regressor vector. When the reconstructed space with the proper embedding dimension is analyzed, the model error is reduced to an acceptable small value, and it will not be decreased significantly even with the increase of the dimension of the regressor vector, since

extra terms in (3.35) do not increase the information content of the regressor. Consequently, a model should be identified by each dimension, and their accuracy should be stored and the changes should be analyzed. The accuracy of a model can be measured by the one-step ahead prediction error by chaotic systems because more than one step ahead prediction is not possible according to the nature of that kind of systems. The dimension by which there is a ’knee’ in the model error, i.e. a drastic decrease and there is no significant decrease by bigger dimensions, is the proper embedding dimension de. As can be seen, the model identification stands in the center of the method.

To identify a model that can be used for prediction of a time series global, local and semi-local methods can be used [114]. One of the main disadvantages of global methods is that a new sample pair may change the approximation function everywhere. Local interpolation overcomes this drawback by utilizing only a limited number of neighboring samples. There are two major classes of local methods, those applying neighbor samples directly in the prediction, and those fitting a function locally to the neighbors. The latter method fits a set of surfaces to the measurement points. These local surfaces can be hyperplanes but polynomials of higher degrees may also be used. For chaotic time series, this local approach was first used in [59]. Other works applying local methods are, e.g., [40] and [91]. The modeling framework that is based on combining a number of local models, where each local model has a predefined operating region in which it is valid, is called operating regime based model [129]. This approach is advantageous in the modeling of complex nonlinear systems, since while it may not be possible to find a model that is universally applicable to describe the unknown MIMO system fr(.), with the application of the divide and conquer paradigm it is possible to decompose the complex problem into a set of smaller identification problems where standard linear models give satisfactory performance.

This operating-regime based model is formulated as follows:

yk+1= section, it is assumed thatβi is known, in Section 3.3.2 this function will be computed through fuzzy clustering.

The main advantage of this framework is its transparency. The operating regions of the local models can be represented by fuzzy sets [19]. This representation is appealing, since many systems change their behavior smoothly as a function of the operating point, and the soft transition between the regimes introduced by the fuzzy set representation captures this feature in a natural way.

The output of this MIMO model is the vectoryk+1= [yk+1, yk−τ+1, . . . , yk−τ(de−1)+1]T. As only the first elementyk+1is unknown, another possible method for the state space reconstruction problem is the Multiple Input - Single Output (MISO) modeling approach. In this case, onlyyk+1is estimated.

Both methods are used in the examples in Section 3.3.3 because MIMO approach gives not only a prediction error value but also takes into account the error covariance matrices of each local model, therefore gives an insight into the reconstructed space.

The output of the MIMO and MISO model is linear in the parameters. Therefore, these parameters can be estimated from the data by linear least-squares techniques (see also Section 1.6). The N identification data pairs are arranged in the following regressor φand regressand Y matrices where

they contain the same terms, yk, just shifted by one sample time.

φ = [y1,y2, . . . ,yN]T (3.40)

Y = [y2,y3, . . . ,yN+1]T (3.41)

Bi =





βi(y1) 0 · · · 0 0 βi(y2) · · · 0 ... ... . .. ... 0 0 · · · βi(yN)



 (3.42)

By using this notation, the weighted least squares solution ofθi is:

θi =h

φTBiφi−1

φTBiY. (3.43)

As this method forces the local linear models to fit the data locally, it does not give an optimal model in terms of a minimal global prediction error, but it ensures that the fuzzy model is interpretable as a Linear Parameter Varying (LPV) system [3].

Cluster Analysis in the Reconstructed Space

The bottleneck of the data-driven identification of operating regime based models is the identification of the functions that can represent the operating regimes of the models (membership functions in the terminology of fuzzy models), which requires nonlinear optimization. In this section Gaussian functions are used to represent the operating regimes of the linear models:

βi(yk) = exp¡

12(ykvi)TF−1i (ykvi)¢ Pc

j=1exp¡

12(ykvj)TF−1j (ykvj)¢ (3.44) whereviis the center and Fi is the covariance matrix of the multivariate Gaussian function.

A new clustering-based technique for the identification of these parameters is presented. The ob-jective is to partition the identification dataY= [y1, . . . ,yN]T intocclusters to reveal the underlying structure of the data. The patterns belong to clusters with degrees that are in inverse proportion to the distance from the respective cluster prototypes. The basic idea of the presented algorithm is to define the cluster prototype such that it locally approximates the MIMO function yk+1 =fr(yk).

In this way, the algorithm simultaneously partitions the data (i.e., identifies the operating regimes of the local models) and determines the cluster prototypes (i.e., identifies the local model parameters).

The fuzzy partition is represented by theU= [µi,k]c×N matrix, where the elementµi,k represents the membership degree ofyk in clusteri.

The clustering is based on minimizing the following cost function:

J(Y,U, η) = Xc

i=1

XN k=1

i,k)md2(yk+1,yk, ηi). (3.45) where the squared distanced2(yk+1,yk, ηi) is given by the probability that the data belong to a given

cluster:

1

d2(yk+1,yki) =p(ηi)p(yki)p(yk+1|yk, ηi) = (3.46)

p(ηi) 1

(2π)de2 p

det(Fi)exp µ

1

2(ykvi)TF−1i (ykvi)

| {z }

p(yki)

1 (2π)de2 p

det(Pi)exp µ

1

2(yk+1yik+1)TP−1i (yk+1yik+1)

| {z }

p(yk+1|yki)

The termp(ηi) represents thea priori probability of theith cluster defined by theηiset of parameters (see equation (3.50) below). The Gaussian distribution p(yki) defines the domain of influence of a cluster (see (3.44)). The third term is based on the performance of the local linear models where Pi

is the weighted covariance matrix of the modeling error of the ith local model.

The above distance function can be seen as the geometric mean of the distance measure of Gath–

Geva [65] (the first two terms) and the Fuzzy c-Regression (FCR) [76] (the last term) algorithms.

Gath–Geva algorithm takes into account the distribution of the data, it is able to reveal clusters with arbitrary ellipsoidal shape and size, but it is not able to fit models to data. That is why the third term is introduced. However, the FCR distance measure is not sufficient in itself, either, because the same linear model (the same hyperspace in an n dimensional space) happens to be suited for different regions of the input domain, therefore the local linear model would not be ’compact’ and its operating regime could not be described as a (Gaussian) function. To avoid this, the Gath–Geva distance measure should also be taken into account as can be seen above.

The minimization of the functional (3.45) represents a non-linear optimization problem which is subject to the constrains (1.8), (1.9) and (1.10). It can be solved by using a variety of methods.

The most popular one is the alternating optimization (AO), which is formulated as it can be seen in Algorithm 3.3.1. The main advantage of the presented clustering-based approach is that the local dimension can be estimated by the analysis of the shape of the clusters, as presented in the following section.

Algorithm 3.3.1 (Modified Gath–Geva Clustering for MIMO Modeling).

Initialization Given a set of data Y specify the number of clusters, c, choose a weighting exponent 1< m <∞(usuallym= 2) and a termination tolerance

² >0. Initialize the partition matrix U = [µi,k]c×N randomly in such a way that the constrains hold or use the partition matrix given by another clustering method such as fuzzy c-means or Gustafson–Kessel algorithm.

Repeat forl= 1,2, . . .

Calculate the parameters of the clusters

Centers of the membership functions

v(l)i =

Fuzzy covariance matrices of the Gaussian membership functions:

Fi =

Parameters of the local models are computed by (3.43)where the elements of theBi matrix are equal to the membership: βi(yk) =µi,k.

Covariance of the modeling errors of the local models:

Pi=

A priori probability of the cluster

p(ηi) = 1

Estimation of the Local Dimension

Local (or topological) dimension of the reconstructed state space is the basis dimension of the local linear approximation of the hypersurface on which the data resides, i.e., the tangent space [37]. The most popular algorithms to estimate the local dimension are Fukunaga-Olsen’s algorithm [64], Near-Neighbor algorithm [144], Topological Representing Networks [123], different projection techniques, and fractal-based methods. A method is presented that exploits the shape of the clusters, and our results will be compared with the so-called Correlation dimension [69].

In the literature on nonlinear dynamics, many definitions of fractal dimensions have been proposed.

Box-Counting (dBC) and Correlation dimensions (dC) are the most popular ones. The dBC of a set Y is defined as follows. Ifν(r) is the number of the boxes of sizerto coverY, then dBC is

dBC = lim

r→0

ln(ν(r))

ln(1/r) (3.52)

The Box-Counting dimension can only be computed for low-dimensional spaces because the algo-rithmic complexity grows exponentially with the set dimensionality. A good alternative to the Box-Counting dimension is the Correlation dimension. Due to its computational simplicity, the Correlation dimension has been successfully used to estimate the dimension of attractors of dynamical systems.

The Correlation dimension is defined as follows. Let Y = [y1,y2, . . . ,yN]T be a set of points of cardinalityN. Define the correlation integralCm(r) as

Cm(r) = lim where I(·) is the indicator function (I(·) is 1 if condition (·) holds, 0 otherwise). The Correlation dimensiondC is

dC = lim

r→0

ln(Cm(r))

ln(r) (3.54)

It can be proven that both dimensions are special cases of the generalized Renyi dimension [37].

The method presented in this section is able to determine simultaneously the embedding and the local dimension based on the result of the clustering. The local dimension can be estimated based on the shape of the clusters because the collection of c clusters approximate the tangent space. Hence, the clusters can be approximately regarded as local linear subspaces described by the cluster ellipsoids (covariance matrices).

To detect linear subspaces Principal Component Analysis (PCA), i.e. eigen analysis, of the fuzzy covariance matrices has been used. The eigenvector associated with the largest eigenvalue has the same direction as the first principal component. The eigenvector associated with the second largest eigenvalue determines the direction of the second principal component. The first principal component accounts for as much of the variability in the data as possible, and each succeeding component accounts for as much of the remaining variability as possible. Hence, the first few biggest eigenvalues are much bigger than the remaining ones, and they cover the greatest part of the variance of the data set.

In general, the smallest eigenvalues of the cluster covariance matrices, Fi, are typically in orders of magnitude smaller than the remaining ones [10, 16].

The local dimension is estimated based on the sum of the eigenvalues weighted by the a priori probability of the clusters:

Jde,j = Xc i=1

p(ηii,j (3.55)

whereλi,jis thejth biggest eigenvalue of theith covariance matrix. The local dimension is determined by computing the index over all values of j, j= 1, . . . , dein case of the proper embedding dimension

chosen beforehand based on the one-step-ahead prediction errors. The local dimension is selected as a cutoff point on a sharp change in the slope of the graph or such a way that

Pk

j=1Jde,j Pde

j=1Jde,j >threshold (3.56)

This method is often referred as screeplot analysis, which is commonly used to estimate the proper dimension of Principal Component Analysis models in the literature, e.g., in [160]. Srinivasan et al.

used 0.93 as the threshold, but it can be replaced by another number, based on domain knowledge and the expected noise levels.