Motivation and outline - Korreláció alapú, elasztikus hasonlósági mértékek technológiai idősoro

The popularity of knowledge discovery and data mining tasks for discrete data has indicated the growing need for similarly efficient methods for time series databases.

In the last decades, a wealth of different, time series-specialized methods were introduced for general data mining tasks listed below.

• Classification: Partitioning the data to a known structure. For example, a character written on the touch screen of a mobile phone is compared to the stored templates and decided which character it is.

• Clustering: Discovery of groups of data that are similar to each other. For example, clusters can be identified in sales history data which can lead to refined product placement.

• Outlier detection: Identification of uncommon data that can be interesting for further investigation. For example, unusual heartbeats in ECG data can be identified.

• Associative rule mining:Detecting internal relationships inside the dataset. For example, it can be determined that a specific behavior of a variable forecasts faulty end product during production.

These tasks share a common requirement: a (dis)similarity measure has to be defined between the elements of a given database. Moreover, from simple clustering and classification to complex decision-making systems, the results of a data mining application are highly dependent on the applied (dis)similarity measure, as it is illustrated in Figure 1.1.

As it can be seen, the selected similarity measure has a key role in data mining.

Utilizing the flexibility provided by the elastic similarity measures (see Section 2.4) and overcoming their run-time issues [20, 18], it seems there is no more place for development in this area. Yet, this is far from reality.

Associative rule mining

Outlier detection

Clustering

Classification

Similarity

Figure 1.1. Common tasks in data mining and their relation to similarity The main advantage provided by the current elastic similarity measures used for time series data mining is that they are capable of relaxing the temporal axis.

Can this property always be enough? I showed in my M.Sc. thesis [21] that the answer is No. It was empirically proven that the current (dis)similarity measures that were developed with static univariate time series in mind are not adequate for other common sequences like multivariate time series with complex correlation structure or for time series of dynamic processes.

Facing these limitations and building upon the experiences with the most popular dissimilarity measures during my daily development engineering work motivated me to reconsider the current elastic measures. This led to the introduction of two novel dissimilarity measures and a new way of avoiding pathological warnings in case of dynamic time warping. Moreover, the result of this reconsideration is not only this thesis but the improved performance of the state of the art and market leading ultrasonic-based blind spot monitoring system [22].

The work described by this thesis is presented in the following structure:

Chapter 2 – Background gives a short overview on the basics of time series data mining. Time series representations and segmentation problems are discussed first and then the most popular elastic dissimilarity measure — dynamic time warping

— is reviewed in detail.

Chapter 3 – Correlation-based dynamic time warping discusses the first novel dissimilarity measure I defined for multivariate time series by combining principal component analysis-based segmentation and dynamic time warping. I proved that the presented algorithm provides better results than the currently used dissimilarity measures in case of multivariate time series with complex correlation structure.

Chapter 4 – Process dynamics-aware segmentationintroduces a new segmen-tation method, which enables the segmensegmen-tation of multivariate time series according to the changes in process dynamics. This way, the modified version of the algorithm presented in the previous chapter can compare and data mine multivariate time series based on process dynamics.

Chapter 5 – Automatic constraining of dynamic time warping shows how correlation-based dynamic time warping inspired another novel similarity measure for time series segmented using piecewise linear approximation. The presented simi-larity measure can be combined with the classical, mean-based simisimi-larity measure to achieve more accurate results than that of the existing methods. Moreover, using this combined similarity measure, I empirically proved that similarity measures considering multiple features shorten the warping path and thus they reduce the possibility of pathological warpings and enable omission of global constraints.

Chapters 6 and7 – SummaryandÖsszegzéssummarize the three theses pre-sented throughout chapters 3 to 5 both in English and in Hungarian.

Chapter 8 – Publications related to the thesislists all the publications of the author that connected to the thesis.

Chapter 2 Background

2.1 Basic definitions and notation

Let us say, thatXndenotes ann-variable,m-elementtime series, wherexi is thei^th variable andx_i(j)denotes itsj^th element:

X_n= [x₁, x₂, x₃, . . . , x_n],

x_i = [x_i(1), x_i(2), . . . , x_i(j), . . . , x_i(m)]^T

(2.1)

According to this notation, a multivariate time series can be represented by a matrix in which each column corresponds to a variable and each row represents a sample of the multivariate time series at a given time:

h x₁ x₂ . . . x_n ⁱ

ThedissimilaritybetweenX_nandY_nrepresents how stronglyX_nandY_n resem-ble each other and assigns a non-negative real number to show this relation:

d(X_n, Y_n) :T xT →R⁺0, (2.3) where T denotes the domain of the time series withd(X_n, Y_n) = d(Y_n, X_n),

0≤d(X_n, Y_n)andd(X_n, X_n) = 0.

Similarityof time series can also be defined for arbitrarily selected finite value of k:

s(X_n, Y_n) :T xT →R^k0, (2.4) wherek 6= +∞,s(X_n, Y_n) =s(Y_n, X_n),0≤ s(X_n, Y_n)≤k ands(X_n, X_n) = k. In such cases, the dissimilarity of two multivariate time series can be derived from similarity:

d(X_n, Y_n) =k−s(X_n, Y_n) (2.5) A dissimilarity measure is considered adistance(metric) if it satisfies the trian-gular inequality, i.e.:

d(X_n, Y_n) +d(Y_n, Z_n)≥d(X_n, Z_n) (2.6)

In document Korreláció alapú, elasztikus hasonlósági mértékek technológiai idősorok adatbányászatához (Pldal 24-28)