Conclusion - Korreláció alapú, elasztikus hasonlósági mértékek technológiai idősorok adatbányás

A novel similarity measure for multivariate time series with complex correlation structure was introduced in this chapter. The presented method combines covariance-driven segmentation, dynamic time warping and PCA similarity factor. First, two homogeneity measures driven by PCA were utilized for segmentation as the cost functions. These homogeneity measures correspond to the two typical applications of PCA models. TheQreconstruction error can be used to segment the time series according to the direct change of the correlation among the variables, while the Hotelling’sT² statistics can be utilized to segment the time series based on the drift of the center of the operating region. The dissimilarity between the segments was derived from the PCA similarity factor. Finally, dynamic time warping was applied to compensate the time shifts and make to the presented dissimilarity measure more accurate.

To prove that it can be expected from CBDTW to outperform the PCA similarity factor in any environment, CBDTW was tested on two datasets, which differ from each other regarding the correlation. The AUSLAN dataset has22variables with a complex correlation structure. It was selected to simulate the “typical” industrial data, i.e. a large number of variables, whose correlation structure cannot be revealed without the application of PCA. The algorithm was also tested on the dataset of SVC2004 in which, counter to the AUSLAN dataset, the correlation between the variables is obvious and better results can be achieved if not all the variables are utilized^∗.

The precision-recall graphs showed superiority of CBDTW over the PCA similar-ity factor in precision, irrespective of the complexsimilar-ity of the hidden process. However, CBDTW can not only be used as a direct replacement for the PCA similarity fac-tor. As the DTW allows the selection of its local dissimilarity measure, any other PCA-based similarity measure can be replaced by CBDTW. All that has to be done is to change the local dissimilarity measure of the DTW with a new one, which is derived from the PCA-based similarity measure to be replaced. Moreover, CBDTW even outperforms the Euclidean-distance based dynamic time warping when a high number of variables with complex correlation structure has to be handled making CBDTW one of the best selections for such datasets.

∗

Chapter 4 Process dynamics-aware multivariate time series segmentation method for process dynamics-based data mining

Principal component analysis (PCA) based, time series analysis methods have be-come basic tools of every process engineer in the past few years, thank to their efficiency and solid statistical basis. As, however, they have progressively gained reputation in almost every scientific area, their limitations have become more obvious.

PCA assumes linear relationships between the variables and it is not able to consider process dynamics, thus various solutions were proposed to address limitations arise form linearity (kernel-function approach [94], generative topographic mapping [95], neural networks [96], mixture of principal component analyzers [40], etc.).

Another popular alternative of the non-linear methods is to split the data to locally linear segments and to use the PCA model of these segments. However, as Kivikunnas [87] reported,although in many real-life applications a lot of variables must be simultaneously tracked and monitored, most of the segmentation algorithms are used for the analysis of only one time-variant variable.

In Section 3.1, a PCA-based multivariate time series segmentation method of Feil et al.[89] was presented, which addressed the segmentation problem. The nonlinear processes were split into locally linear segments by using T² and Qstatistics as cost functions. Although this solution gave the possibility to segment a multivariate time series according to the needs of PCA, it shall not be forgotten that PCA was created for analyzing steady state processes, thus it is not able to handle any process dynamics. However, it seems this fact is rarely considered before any PCA based

technique is applied. Consequently, changes in process dynamics are not discovered and, what is even worse, it can also lead to wrong conclusions.

Different solution were proposed to address this problem. Kuet al.[97] suggested the application of PCA to an extended data matrix that contains past values of each variable, and named this method dynamic principal component analysis (DPCA).

Negiz and Çlinar [98] proposed a state space model based on canonical variates and Simoglouet al.[99] presented an approach for monitoring of continuous dynamic processes that involved canonical variates and partial least squares. However, none of these methods can substitute any elastic dissimilarity measure as they cannot handle the fluctuation in process dynamics.

Although, Negiz and Çlinar [98] also pointed out that DPCA can severely be affected by process noise and requires more computational power than PCA, it is a powerful tool for environments with less noise (or when proper noise filtering can be done) and it can be easily integrated into the existing PCA-based frameworks/toolsets.

Considering this fact and being motivated by the success of correlation-based time warping (CBDTW, see Section 3.3), it was desirable to extend CBDTW to dynamic processes by incorporating dynamic principal component analysis into correlation-based time warping. On the other hand, to be able to do this, the difficulties and problems behind process dynamics-based segmentation had to be solved as well.

4.1 Dynamic principal component analysis-based segmentation

The time-variant property of dynamic processes cannot be handled by standard PCA based tools since PCA assumes that no time dependency exists between the data points. This drawback motivated Kuet al.[97] to dynamize PCA for the needs of dynamic processes. Consider the following process:

Y_g^T(k+ 1) =A₁Y_g^T(k) +. . .

+AtaY_g^T(k−ta) +B1U_h^T(k) +. . .+Bt_bU_h^T(k−tb) +C, (4.1) whereA₁, . . . , A_t_a andB₁, . . . , B_t_b areg xg andgxhmatrices,Cis a column vector,t_aandt_b show time dependence,U_h(k)is thek^th sample of the (multivariate) input andY_g(k)is the (multivariate) output in the same time. In case of standard

(static) PCA, the multivariate time series of such a process is formed by the inputs (flow rates, temperatures, etc.) and the outputs (properties of the final product):

X_n = [Y_g, U_h] (4.2)

Ku et al. [97] pointed out that performing PCA on the above X_n data matrix preserves time dependence of the original series and this obviously reduces the performance of the applied algorithm. He suggested that theX_ndata matrix should be formed by considering the process dynamics at each sample point. Generally speaking, each sample point is complemented with the sample points from the past with which it is supposedly dependent on, i.e. the lagged values:

X_n^l(k) = [Y_g(k), U_h(k), . . . , Y_g(k−l), U_h(k−l)], (4.3) whereldenotes the time lag. To select its optimal value, different type of methods were suggested over the last decades. For simplicity, a correlation-based heuristic is used in this thesis as it is also uggested by Kuet al.[97], but more advanced methods such as the Akaike information criterion [100, 101], auto and cross correlation-based metrics [99], and autocorrelation of the autoregressive model [102] were also utilized for determination of time lagl.

As it was mentioned, linear relations exist between lagged inputs and outputs, due to the process dynamics. These relations are preserved within the PCA in the form of auto and cross correlations of the scores. Performing PCA on the above modified data matrix moves these unwanted correlations to the noise subspace, because the possible combinations of the shifted input and output variables are presented in the data matrix and only the most important combinations of these are selected by the PCA. Moreover, the first zero (or close to zero) eigenvalue shows the linear relationship between variables revealed by the eigenvector belonging to this eigenvalue.

Based on these facts, and considering the PCA-based segmentation method presented in Section 3.1, a novel multivariate time-series segmentation method is introduced in this section that considers process dynamics via the following steps:

1. Create the dynamized data matrixX_n^l of inputs and outputs in every sample point.

2. Determine optimal value of time lagtand the number of the retained principal

3. Select the suitable PCA statistics (T² or Q) as segmentation cost (see Sec-tion 3.1 for details).

4. Apply the PCA-based segmentation method introduced by Feilet al.[89].

In document Korreláció alapú, elasztikus hasonlósági mértékek technológiai idősorok adatbányászatához (Pldal 69-73)