Information extraction from process data: Knowledge

1.2 Process data - from source to applications

1.2.2 Information extraction from process data: Knowledge

Integration of heterogeneous data sources is strongly connected to knowledge discovery and data mining [75, 76]. One of its main purposes is to store data in a logically constructed way that some deeper information and knowledge can be extracted through data analysis. Knowledge discovery in databases (KDD) is a well known iterative process in the literature, which involves several steps that interactively take the user along the path from data source to knowledge [77].

Figure 1.6 shows the KDD process and its connection to the process devel-opment scheme: KDD can be considered as the analysis step of the technology improvement process. In the following, we go through the steps of KDD, highlighting the presence of "data mining in chemical engineering" (note, that although data mining is a particular step of KDD, it is often associated to it as an independent technique).

1. Data selection. Developing and understanding of the application domain and the relevant prior knowledge, and identifying the goal of the KDD process.

2. Data pre-processing. This step deals with data ltering and data recon-ciliation.

Selection

Preprocessing

Transformation

Data mining

Interpretation

Data

Knowledge

Data

Knowledge

Technology

Analysis

Figure 1.6: Knowledge Discovery in Databases process (left) and the data-driven process development scheme (right).

3. Data transformation. Finding useful features to represent the data de-pending on the goal of the task.

4. Data mining. It is an information processing method, the extraction of interesting (non-trivial, implicit, previously unknown and potentially useful) information or patterns from data.

5. Interpretation of mined patterns, i.e. discovered knowledge about the system or process. The interpretation depends on the chosen data mining representation.

Data selection, pprocessing and transformation activities are often re-ferred to as the data preparation step. A wealth of approaches have been used to solve the feature selection problem, such as principal component analysis [78], Walsh analysis [79], neural networks [80], kernels [81], rough set theory [82, 83], neuro-fuzzy scheme [84], fuzzy clustering [85], self-organizing maps [86], hill climbing [87], branch&bound algorithms [88], and stochastic algo-rithms like simulated annealing and genetic algoalgo-rithms (GAs) [89, 90].

Process data have several undesirable attributes which need to be handled before any analysis can take place: they may be time-dependent, multi-scale, noisy, variant and incomplete at the same time. All these problems need to be solved in the data preparation steps, hence it takes the largest part, approx. 60

% of eorts in the whole KDD process. For industrial data reconciliation, OS-Isoft and Invensys have developed packages such as Sigmane and DATACON [91, 92].

In the data mining step, the goals are achieved by various methods:

- Clustering. Cluster is a group of objects that are more similar to one another than to members of other clusters. The term "similarity" should be understood as mathematical similarity, measured in some well-dened sense. Clustering is widely used for feature selection [86], feature extrac-tion method, which is applied in operating regime detecextrac-tion [93, 94], fault detection [95, 96] or system identication, like model order selec-tion [97, 98, 99], state space reconstrucselec-tion [100].

- Segmentation. Time series segmentation means nding time intervals where a trajectory of a state variable is homogeneous. In order to formal-ize this goal, a cost function with the internal homogeneity of individual segments is dened. The linear, steady-state or transient segments can be indicative for normal, transient or abnormal operation, hence seg-mentation based feature extraction is a widely known technique for fault diagnosis, anomaly detection and process monitoring or decision support [101, 102, 103, 104]. A more detailed description and illustration can be found in Section 1.2.3.

- Classication. Map the data into labelled subsets, i.e. classes, which are characterized by their specic attribute called the class attribute. The goal is to induce a model that can be used to discriminate new data into classes according to class attributes. In chemical engineering problems, classication is used in fault detection, anomaly detection problems [84, 102, 104, 105, 106, 107].

- Regression. The purpose of regression is to give prediction for process or so called dependent variables based on the existing data (independent variable). In other words, regression learns a function which maps a data item to a real-valued prediction variable and it discovers functional relationships between variables [108, 109]. Uses of regression include curve tting, prediction (forecasting), modelling of causal relationships and testing scientic hypotheses about relationships between variables.

Applied mainly in system identication problems, see e.g. [110].

Representation of patterns of interest, i.e. output of data mining, can be in various forms like regression models [94, 84, 102, 105], association rules [111]

or decision trees [104, 106].

As last step of the knowledge discovery process, mined patterns, i.e. dis-covered knowledge about the system or process needs to be interpreted. The interpretation depends on the chosen data mining representation.

Exploratory Data Analysis (EDA) deals with visualization of the mined patterns. Although it is often stated as an independent analysis technique, it can be considered as a special application of the KDD process, where the knowledge is presented by the information embedded into several types of visualization tools. It focuses on a variety of mostly graphical techniques to maximize insight into a data set.

The seminal work in EDA is written by Tukey [112]. Over the years it has beneted from other noteworthy publications such as Data Analysis and Re-gression by Mosteller and Tukey [113], and the book of Velleman and Hoaglin [114]. Data preprocessing step in EDA refers to several projection methods in order to be able to visualize high dimensional data as well: techniques of principal component analysis (PCA) [115], Sammon-mapping [116], Projec-tion to latent structure (PLS) [117], Multidimensional Scaling (MDS) [118] or Self-Organizing Map (SOM) [119] are applicable, from which PCA, MDS and SOM techniques are applied in the later chapters. Data mining methods also use these techniques, but in EDA projection is used for visualization purpose.

First, to give a short introduction to plots of this area, denition of a quantile is needed: it means the fraction (or percent) of points below the given value. That is, the 0.25 (or 25%) quantile is the point at which 25% of the data fall below and 75% fall above that value. The 0.25 quantile is called the rst quartile, 0.5 quantile is called second quartile (or median) and 0.75 quantile is the third quartile. A quantile-plot serves as a good indication for cumulative probability function of a given time series by visualizing the distribution of the data set.

Quantile-quantile plot (q-q plot) is a plot of quantiles of the rst data set against quantiles of the second data set thus they can serve as visual proof for correlation between two data sets. The basic idea behind is that if two variables have similar distribution, their tail behaviors are similar as well, thus q-q plots are applicable to identify connections between them. Both axes are in units of their respective data sets. That is, the actual quantile level is not plotted. For a given point on the q-q plot, we know that the quantile level is the same for both points, but not what that quantile level actually is. If two sets come from a population with the same distribution, points should fall

approximately along the reference line. Based on q-q plots, connection between operating cost, energy consumption and process variables can be detected.

Considering three quartiles, minimum and maximum of the given data set, a ve-number-summary by Tukey can be collected, which is visualized in a box-plot giving very compact and informative graphical results about the given set (for example about production).

Besides the above plots, EDA techniques have a wide spectrum including plots of raw data (histograms, probability plots, block plots), basic statistics or advanced multidimensional plots (scatterplot matrices, radar plots, bubble charts, coded maps, etc.).

The most common software for EDA is MS Excel with numerous non-commercial add-ins, but there are several products on the market as well:

IBM's DB2 Intelligent Miner (which is no longer supported), Mathworks's MATLAB Statistics Toolbox [120] and the open-source WEKA developed by Waikato University [121].

Note that most EDA techniques are only a guide to the expert to under-stand the underlying structure in the data in a visual form. Hence their main application is process monitoring [122, 123], but these tools are already used for system identication [124], ensuring consistent production [125] and product design as well [126].

As it can be seen from the numerous citations, solutions based on the KDD process were proven to be extremely useful in solving chemical engineering tasks as well and showed that instead of simple queries of data, potential prot can be realized using the knowledge given by data analysis. The mined and discovered knowledge about the system or process is fed back to the beginning of the process to help continuous development (see Figure 1.6).

1.2.3 Overview of recent advances on time series

In document Folyamat-szimulációs és adatbányászati eszközök integrált alkalmazása (Pldal 27-31)