The proposed algorithm - Folyamat-szimulációs és adatbányászati eszközök integrált alkalmazása

3.2.1 Similarity measure

Our algorithm uses a scoring matrix where we maximize the score of an align-ment instead of minimization of the sum of transformation weights. In the following, s_i,j similarity measure denotes the level of similarity between a_i and b_j. These score values were set based on heuristics and simple logic but it is a useful attribute of our algorithm that these values can be reset in order to t more to one's aims. The predened score for the episodes is given by the following rules:

- s_i,j = [0,10], each score of an alignment can be between 0 and 10;

- Perfect match gains a score of 10;

- Score is zero for an alignment of increasing {A,D,E} and a decreasing episode {B,C,F};

- Aligning accelerating increasing/decreasing episode to decelerating in-creasing/decreasing episode (e.g. A to D) gets a score of 5, to a linearly increasing/decreasing episode (e.g. B to F) gets 7;

- Gaps have a scoring value of -0.1 (injecting a gap is slightly penalized);

- Every episode is similar to the steady episode G with a score of 2.

To handle the total set of 57 episodes, this similarity scoring matrix is extended by multipliers to the primitive scores:

- One category step in one fuzzy attribute gets a gain of 0.8 to the score (e.g. ss to sm);

- Two category steps in one fuzzy attribute or one step in both attributes gets a gain of 0.6 (e.g. ss to ls or to mm);

- Two category steps in one fuzzy attribute and one step in the other attribute gets a gain of 0.4 (e.g. ss to lm);

- Two category steps in both attributes gets a gain of 0.2 (e.g. ss to ll);

- Steady episodes are multiplied by 0.8 for one category step (e.g. sG to mG) and by 0.6 for two category steps (sG to lG)

The resulted local score represents the similarity between two aligned episode.

A global score is calculated as the sum of the individual local scores but rst these local ones are weighted according to the type of the episode during alignment: it means a multiplier of 1/3 if duration/change of magnitude is short/small and 2/3 if it is medium.

This weighting procedure tries to avoid that perfect matches of little and short episodes result in a large global score, hence two sequences are considered as similar while only unimportant small features match perfectly.

Table 3.1 shows some example calculations for local scores and how much these alignments count in the global score. It can be seen that original scores for the primitive episode alignments have much larger values than at the end as addition to the global score. They equal only if thresholds are set to a very low level in order that every episode gets a label of "ll" (large-long).

Note that in this non-metric similarity measurement calculation perfect but small episode alignments can happen to have smaller scores than more dierent but larger/longer ones if one compares only their addition to global score. This phenomena is normal as "bigger" episodes cut larger part from the trend.

When comparing two global scores in a set of time series, global score value is normalized by the length of the shorter sequence, thus scores are recalculated for a one-episode-long base for a "fair" comparison.

Alignment Prim. score Fuzzif. score Addition to Global Score smA-smA 10 10 10∗(¹₃)²∗(²₃)² = 0.494

smA-slA 10 10*0.8 10∗0.8(¹₃)²∗ ²₃ = 0.593

lmA-lmD 5 5 5∗(²₃)² = 2.222

lmA-lsD 5 5*0.8 5∗0.8∗²₃ ∗ ¹₃ = 0.889 lmA-sG 2 2*0.8 2∗0.8∗²₃ ∗ ¹₃ = 0.356

Table 3.1: Local alignment score examples and their addition to the global score.

3.2.2 Structure of the algorithm

The algorithm has two data preprocessing steps: (i) a widely known dimen-sionality reduction technique, Principal Component Analysis (PCA) with its dynamic extension and (ii) Gaussian-ltering, a one-parametric (σ) convolu-tion kernel that eliminates high frequency noise from one-dimensional signals (see Section A.2 in Appendix).

The Gaussian lter, which was originally suggested by Cheung and Stephanopou-los as well, has denitely some advantages to other advanced ltering tech-niques:

- Location of extremum values do not change during ltering.

- Only one parameter, the σ parameter controls the level of ltering thus the result of segmentation: larger values result in signals with less high-frequency features i.e. shorter episode sequence.

- Users/operators can incorporate their a priori knowledge into the l-tering strategy through tuning σ while deciding, which features are the main important ones in the signal and which ones need to be vanished.

This data preprocessing step is inevitable for noisy industrial data. As output of the algorithm depends on σ, it needs to be set appropriate to a specic problem.

The main steps are listed as follows:

1. Problem-specic selection and parametrization of algorithm features:

- Appropriate thresholds for {s,m,l} subepisode bounds;

- Dimensionality reduction by PCA (if needed);

- Gaussian lter: increasing lter parameterσmakes high-frequency-features vanish from trend (that results in shorter sequences);

2. Data pre-processing: (optional dimensionality reduction and/or ltering and) episode segmentation;

3. Optimal alignment of two episode sequences based on maximizing a score;

4. Storing global score and normalized global score of alignment.

The sequence of the possible projection, ltering and dening thresholds can be arbitrary: e.g. user can lter each dimension in a multi-dimensional time series rst, then project ltered data into 1-D by PCA and nally dene thresholds for the ltered 1-D signal, or dene thresholds for each dimension, project time series and thresholds into 1-D by PCA and lter the projected signal at last, etc.

Application and eciency highly depends on the parametrization step. Pa-rameter setting itself needs to be managed by one or a mixture of two dierent approaches: supervised or unsupervised methods. This latter one is suggested if no a priori knowledge exists about the system but several measurements and their evaluation (e.g. class) can be extracted to train σ and threshold parameters (like in Section 3.3.1 for UCR time series). In the case of a super-vised parameter setting, a system operator or expert can give information on important/unimportant features in a trend in order to set σ, and information on what can be considered a large/long or small/short change in that trend in order to set thresholds for subepisode bounds. Moreover, unsupervised pa-rameter training can serve as proof and evaluation for papa-rameters extracted from expert knowledge.

Regarding the above mentioned steps, Figure 3.5 shows the structure and owchart of the algorithm. The sequence of the preprocessing steps and choos-ing the right approach to set parameters are totally the challenge of the user, however these tasks have to be done only one time for a specic problem, i.e.

process data type.

The algorithm is implemented in MATLAB environment, which is a power-ful experimental language for technical computations. For sequence alignment algorithm we applied MATLAB Bioinformatics Toolbox [167].

Data Input X

1-D filtered Data {s,m,l} bounds

for magnitude and duration

Gaussian filtering of every

dimension 1-D projection by

(P)PCA

Fuzzy Episode segmentation

Data Input Y

1-D filtered Data

Fuzzy Episode segmentation Problem-specific

global information:

- ‘σσσσ’ filter-param.(s) (- Proj. vector of PCA) - (Projected) Bounds

Opt. Sequence Alignment

Storing Score and Alignment

Figure 3.5: Structure and owchart of the algorithm.

3.2.3 Visualization and clustering

Once time series similarity can be achieved in a numeric value of global align-ment score, one is able to dene distance from similarity. The transformation in Eq. 3.9 can be applied for that purpose where where S_A,B means the re-sulting global score of an alignment and d_A,B is the calculated distance.

d_A,B = 10−S_A,B

10 . (3.9)

Based on this distance measure, a set of time series can be mapped into a lower dimensional subspace where they can easily be visualized and clustered.

We applied two of them in our work:

- Single/complete/average linkage. A hierarchical clustering method visu-alized in a dendrogram [168].

- Multidimensional Scaling (MDS). A metric/non-metric mapping method into 2 or 3 dimensions for simple visualization [169, 118].

These techniques are well documented in the literature, so they are not discussed here. Through these types of visualization, user is able to judge

how far a currently analyzed time series lies from previously analyzed ones or from an optimal one in a graphical way (from economic, safety, etc. point of view). Moreover, time series grouping becomes an easy task, but it needs to be emphasized that using the Stephanopoulos episodes, trends are characterized by their shape and not by their numeric values. In other words, it works on a dierent base from other distance measure based classication techniques (like PLA based DTW).

In document Folyamat-szimulációs és adatbányászati eszközök integrált alkalmazása (Pldal 80-85)