• Nem Talált Eredményt

Overview of recent advances on time series similarity

1.2 Process data - from source to applications

1.2.3 Overview of recent advances on time series similarity

In process engineering, time series similarity was always a popular topic be-cause it is not a trivial problem to create functions and/or numeric values from a highly subjective abstraction [103]. Actually, every method tries to approach the ability of the human mind [127]: e.g. if process engineers and process op-erators want to control a system, then they compare the resulted trajectories with their pre-imagined ones and decide whether the control strategy was

suciently good or bad. It is also a dicult task to copy the operators' qual-ication strategy that decides how similar two trajectories are, because their work experience automatically neglects unimportant features in the trend and handles shifts in time.

After the optimal pre-imagined trajectories are mapped into time series data as a function of time, and any specic similarity (or dissimilarity) measure is predened, time series similarity algorithms can begin to work to compare these data sets. A large dierence between these algorithms and human mind is that algorithms work in an objective manner, so qualitative trend analysis (QTA) or time series analysis is based on objective functions and values instead of subjective fuzzy categories, like 'quite good' or 'moderate high', thus its results can be more adequate and reliable.

Another advantage of QTA is that it helps the users to decide in a multi-objective environment where modern automated processes provide large amount of data in every minute. It is getting more and more dicult to monitor the trends and trajectories for a human observer, where undiscovered correlations may be in the highly multidimensional data space [104].

The original trend analysis techniques were rather nancial than engineer-ing tools. They were applied on sales forecastengineer-ing or pattern recognition for seasonality based on historical data [128]. Recent algorithms have rather pro-cess engineering relations to analyze propro-cess data for e.g. fault diagnosis or monitoring applications. In the following, a brief review is given related to (rather quantitative) trend analysis.

The simplest way of comparing two equi-sized and normalized data se-quence ordered by time is dening a distance measure and a threshold value, where the sum of calculated distances should be below that threshold to con-sider two trends as similar. These distance measures can be Lp-norms, where L2 means Euclidean distance norm in the metric space. To handle time shifts as well, dynamic time warping (DTW) was developed, which resulted in an optimal alignment of two time series based on dynamic programming [129].

Its continuity constraint does not allow injecting gaps in the alignment, so it cannot align dierent-sized sequences. These disadvantages are solved in another technique called longest common subsequence (LCS) [130]. A lot of research has been done on DTW and LCS like measures of time series similarity [131][132].

Similarity is also a common feature in other scientic areas where

qualita-tive analysis of symbolic sequences take place, e.g. string comparison is typical and well-described in the eld of text retrieval and bio-informatics. Optimal alignment of amino acid or nucleotide sequences is solved by a dynamic pro-gramming based fast algorithm developed by Needleman and Wunsch [133]. It is analogous to DTW, but it allows injecting gaps into a sequence or mutating a part of the sequence, so it has the advantages of LCS as well. Additionally, there are not only injection and deletion, but also mutation and substitution operators to present how far the evolved sequences are from each other. In bio-informatics, instead of metric distance measures, empirically computed similar-ity matrices are widely spread (PAM and BLOSUM matrices)They usually do not fulll all requirements of a metric (non-negativity, identity, symmetry, sub-additivity) but are ecient base for amino-acid sequence comparison. Apply-ing user-dened similarity values instead of empirical transformation weights, this general sequence aligning technique is able to reach the goal of optimally aligning symbolic sequences with dierent length.

As an input of pairwise sequence alignment, data need to be preprocessed in order to have string sequences instead of variable values. Hence a technique is needed that results in an adequate symbolic representation of a trend. A hier-archy of trend representation techniques with numerous references is presented in [68]. Most of these are more preferred in data mining community than sym-bolic representation, and these are already combined with many comparing algorithms, like piecewise linear approximation (PLA) based DTW [134], but symbolic representation based trend analysis is still highly unnoticed. Actually, to the author's knowledge, Symbolic Aggregate approXimation (SAX) devel-oped by Lonardi et al. [68] is the only work in this area, which uses piecewise aggregate approximation as a segmentation basis for string conversion. SAX itself is a very powerful trend compression and representation technique, but it is rather quantitative in the sense that it encodes dierent value levels into symbolic sequence of a trend.

To get to a qualitative analysis of trends by symbolic representation, this thesis proposes the application of the formal framework developed by Che-ung and Stephanopoulos to represent our process data as triangular episode sequences [135]. Triangular episodes use the rst and second derivatives of a trend on a geometrical basis, hence seven primitive episodes can be achieved as characters, which note the shape of the time series over a time interval (see Section 3.1). Many researchers in the literature found feature

extrac-tion by episodes useful for fault diagnosis, decision support service or system monitoring, but also a lot of them modied the denition or set of primitives [104, 136, 103, 101]. In [102] episodes are partitioned into fuzzy episodes by change of magnitude and duration to have a larger symbol set for representing trends.

Concluding all the above cited results, there is still a need for eective algorithms that can convert quantitative time series data into qualitative trend analysis and are able to nd similarities in those trends.

1.3 Integrated application of process data,