A novel approach - Folyamat-szimulációs és adatbányászati eszközök integrált alkalmazása

The proposed algorithm, fuzzy triangular episode based sequence alignment is based on the work of Stephanopoulos and Wong for trend representation and

applies pairwise sequence alignment developed by Needleman and Wunsch to be able to measure time series similarity hence qualitatively analyze, qualify historical time series data. This combined technique uses the shape of a trend for comparison and is able to qualify non-equi-sized trends or to highlight unimportant features by inserting gaps. A similarity measurement is dened for the total set of 57 episode-types and the dynamic alignment technique was modied to handle these episodes (instead of amino acid sequences). The algo-rithm is extended with principal component analysis [115, 165] and a ltering technique to handle multidimensional and noisy data as well.

3.1.1 Segmentation based Symbolic Trend Representa-tion

The key feature of trend analysis and also the cornerstone of the algorithm is the adequate trend representation. Cheung and Stephanopoulos [135] cre-ated a precise formal framework for trend analysis, hence their nomenclature is used to describe triangular episodes. To get from a quantitative to a qualita-tive representation of a real-valued x(t) function, it has to fulll the following properties over a closed time interval [a b] (note that time is represented as a sequence of open intervals separated by time points):

1. x(t) is continuous over [a b] or it has a nite number of discontinuities in its value or derivative;

2. x(t) has a nite number of extrema over [a b];

3. x⁰(t) and x⁰⁰(t)are continuous in (a b) and ata and b they have existing onesided limits.

The functions satisfying the above requirements are called reasonable func-tions. It is clear that all the psychical variables in a plant operation are reasonable. If we know the value and the derivatives of a reasonable function, the state of that function is completely known. The continuous state (CS) over [a b] can be dened as a point value, which is a triplet (if x(t) is cont. int):

CS(x, t) = hx(t), x⁰(t), x⁰⁰(t)i (3.1) If x(t)is discontinuous in t, then left- and right-hand limits of the derivatives are used.

∂∂∂∂x = {+}

∂∂∂∂∂∂∂∂x = {-}

(a)

∂∂∂∂x = {+}

∂∂∂∂∂∂∂∂x = {-}

A B

C D

∂∂∂∂x = {-}

∂∂∂∂∂∂∂∂x = {+}

∂∂∂∂x = {-}

∂∂∂∂∂∂∂∂x = {-} ∂∂∂∂x = {+}

∂∂∂∂∂∂∂∂x = {+}

E G F

∂∂∂∂x = {0}

∂∂∂∂x = {+}

∂∂∂∂∂∂∂∂x = {0} ∂∂∂∂x = {-}

∂∂∂∂∂∂∂∂x = {0}

(b)

Figure 3.1: An example for an episode (a) and the seven primitive episodes proposed by Cheung and Stephanopoulos (b).

Consequently, a continuous trend can be dened as sequence of states over [a b]. For discrete functions, as an approximation, an underlying continuous function has to be known since the derivatives of single points in [a b] cannot be computed. These denitions lead to a qualitative description of a state (QS) and trend if x is continuous at t, otherwise it is undened:

QS(x, t) =h[x(t)], [∂x(t)], [∂∂x(t)]i, (3.2) where[x(t)],[∂x(t)]and [∂∂x(t)]can be{−,0,+}, depending if they have neg-ative, zero or positive values. Obviously, a qualitative trend of a reasonable variable is given by the continuous sequence of qualitative states over [a b]. WhenQS(x, t)is constant for a maximal time interval (the aggregation of time intervals with same QS), it is called an episode (see Fig. 3.1(a)). The nal denition of a trend of a reasonable function is a sequence of these maximal episodes.

An ordered sequence of triangular episodes is the geometric language to de-scribe trends. It is composed of seven primitives noted as { A,B,C,D,E,F,G}

illustrated in Fig. 3.1(b).

Wong et al. fuzzied these seven primitives into a fuzzy set of 57 episodes with a fuzzy membership function [102]. Every episode is assigned to be {small(s), medium(m), large(l)} by duration and magnitude with the high-est membership value. Actually, this means that they hard partitioned the episodes by values of the intersections of the membership function, i.e. by exact threshold values. Figure 3.2 shows how 9 episodes are created from one;

small medium large 1

0 medium

small large 0 1

duration magnitude

ms mm

sm sl

ls lm ll

Figure 3.2: Episodes partitioned by magnitude and duration.

e.g. in {smB} 'fuzzied' episode notation, s means a small change by magni-tude, m means a medium duration and B means the main primitive episode (A to F), in this case an increasingly decreasing episode. Episode {G} is an exception while there is no sense in partitioning this episode by magnitude, it is only partitioned by duration.

Obviously, partitioning the original sequence into fuzzy subsequences re-sults in a more adequate trend representation. As a result of segmentation, one has a highly lowered dimensional data encoded into a sequence of triangular episodes.

Note that a possible incorporation of operators knowledge into the algo-rithm is to predene the boundary thresholds of partitioning the episodes.

3.1.2 Symbolic Sequence Comparison, Pairwise Sequence Alignment

To eciently align two trends expressed by triangular episode chain of charac-ters, a possible technique can be Pairwise Sequence Alignment. It is a typical expression of bio-informatics where amino acid or nucleotide sequences have to be compared in order to see how far the evolved new sequences are from the elders, i.e. how old they are, and how many mutation steps were needed to result in the new sequence.

Applying the minimal evolution, one tries to nd the least number of

mu-tation steps between the elder and ospring sequence. Naive algorithms com-pare all possible alignments and select one with minimal sum of transformation weights.

Fast algorithms calculate in an other way [166]: Let A_n be an n-element sequence and B_m an m-element sequence, a_i and b_j their ith and jth element

(identied from the previously mentioned episode set: {s,m,l}{s,m,l}{A,B,C,D,E,F} and {s,m,l}{G}); α^∗(A_n, B_m) denotes the set of optimal pairwise alignments

of Anand Bm andw(α^∗(An, Bm))the sum of transformation weights for these optimal alignment.

For instance, A_n={smA.mlB.sG.llB.} and B_m ={smA.mmD.lmB}, where n = 4 and m= 3.

The basic idea of fast algorithms is that if we know w(α^∗(A_n−1, B_m)), w(α^∗(A_n, B_m−1)) and w(α^∗(A_n−1, B_m−1)), then w(α^∗(A_n, B_m)) can be calcu-lated within a constant time period. If we leave the last aligned pair in an opti-mal alignment of A_n andB_m then we get an optimal alignment of(A_n−1, B_m), (A_n, B_m−1)or(A_n−1, B_m−1), depending on that last mutation step was a dele-tion, injection or substitudele-tion, respectively:

w(α^∗(A_n, B_m)) = min









w(α∗(A_n−1, B_m)) +w(a_n→ −) w(α∗(A_n, B_m−1)) +w(− →b_m) w(α∗(An−1, Bm−1)) +w(an →bm)









(3.3)

The optimal alignment weights are given in a dynamic programming matrix D with a size of (n+ 1)×(m+ 1), which is lled up with the distances of corresponding sequence elementsd_i,j =w(a_i →b_j)(note thatd_i,j is a simplied notation for d(a_i, b_j)). The initial conditions for the 0^th row and column:

d_0,0 = 0; (3.4)

d_i,0 = Xi

l=1

w(a_l → −); (3.5)

d_0,j = Xj

k=1

w(− →b_k); (3.6)

Optimal alignments are started at d_n,m and ended in d_0,0, while in every step the minimal weight (maximal score) is chosen and stepping left means an injection, stepping upwards means a deletion and stepping diagonally upwards

means a substitution or a perfect match, like the way in Eq. 3.3.

This method was developed by Needleman and Wunsch[133] and it is called global pairwise sequence alignment. It penalizes gap extension as a simple sum of weights. Latter algorithms use gap functions to penalize longer gaps in a sequence alignment. In our study, the original gap penalty is applied because of the independent episodes.

3.1.3 An illustrating example

The following example shows how our algorithm works on a synthetic one di-mensional pair of data sets. The two time series are outputs of the sin(−x), x ∈ [0 2π] function, where the rst trend is vertically shifted by

−0.5 and the second trend has two disturbances: a peak at ^π₂ and a super-posed uncertainty in [^3π₂ 2π]. Figure 3.3 shows the original and the ltered-segmented trends. Both time series are sampled at 0.01 intervals thus their length is 628 points (from 0 to 2π). Via lack of space, we do not draw the triangles on the gure, only episode boundaries are noted as vertical lines and the primitive episode character is written on the gure. The following param-eters were applied: (i) σ = 5 to smoothen the hard breakpoints of the peak;

(ii) [(s/m)dur (m/l)dur (s/m)magn (m/l)magn] vector of boundaries is chosen to [20 50 0.2 0.4], respectively. Note that bounds of duration are given in data samples: 20 and 50 samples mean 0.2πand 0.5πon time axis as the underlying function is sampled at a rate of 0.01.

The optimal alignment of these trends for the main episodes can be seen in Eq. 3.7 (':' denotes a mutation, 'k' a match). Eq. 3.8 shows the fuzzy attributes (Fig. 3.2) as well.

BCDABCDABCDABC

|| :|| : ||

BC---GDA-G--BC

(3.7)

ssB.llC.ssD.mmA.mmB.ssC.llD.llA.mmB.mmC.ssD.ssA.smB.ssC.

| | : | | : | |

ssB.llC.gap.gap..sG.gap.llD.llA.gap..sG.gap.gap.smB.ssC.

(3.8)

As it can be seen, based the optimal alignment in Eq. 3.7 it seems that the two trends are not similar: global score is 63.4 for primitive episodes (maxi-mum is 140, which is 45.3%). But the alignment of sequences considering the

0 0.5 1 1.5 2

Figure 3.3: Two synthetic time series (a) and their transformed signals after ltering and segmentation (b).

Sequence 1

Sequence 2

Score of 25.303 for best path

2 4 6 8 10 12 14

Figure 3.4: Optimal alignment path of the two episode sequences.

fuzzy attributes in Eq. 3.8 shows that the signicant episodes are identical, and only small episodes dier or cause gaps: from the maximum of 39.01, global score is 31.01 (79,5%). This is characterized in the phenomena that maximal alignment value lowers signicantly with small/short or medium episodes in the sequence. Denitions of score values and examples can be found in Sec-tion 3.2. Consequently, longer sequence has some small unimportant features as it can be seen on Figure 3.3, which could be captured correctly only by applying fuzzied episode segments. Such spot analysis of trends can easily be performed on any personal computer as the algorithm does not have huge computational demand.

In document Folyamat-szimulációs és adatbányászati eszközök integrált alkalmazása (Pldal 73-80)