Introduction to process data mining - Folyamatadatok szabálykeresésen alapuló elemzése

In modern production systems, technological databases store huge amount of pro-duction data. Such technological data is collected and generated by automatic log-ging and the operating personnel. It consists of many information about the normal and abnormal operations, significant changes in operational and control strategies.

Therefore, technological databases are potentially useful information sources for operators and engineers in the process monitoring and performance analysis.

Traditional analyzes of technological databases involve human experts who man-ually analyze the data, or specially coded applications that search for patterns of in-terest. Traditional database management systems (DBMS, e.g. MS ACCESS) and analyzer tools (EXCEL) could help in process monitoring and analysis, but the most of them are inefficient where data come from a distributed, heterogeneous and com-plex system (such as a chemical production factory). Beside the structure and data sources, the characteristics of process operational data are also must be considered.

Typical characteristics are the followings:

• Large volume of data

• High dimensionality

• Process uncertainty and noise

• Dynamics

• Difference in sampling time of variables

• Incomplete data

Uniform data structure

(laboratory, automatic measurements)

Process monitoring

(long time horizon - years) Quality management

Comparisons

(products, parameters)

Discovering relationships

Documentations

(automatic reports)

Front-end tools Data mining

Technology development

Figure 1.1: Facilities in the analysis of a production process

• Small and stale data

• Complex interactions between process variables

• Redundant measurements.

As the volume and complexity of process data continue to grow, general approaches are no longer adequate. New data integration techniques and analyzer tools are needed to satisfy all these demands.

In [105] a literature survey is presented on intelligent systems for monitoring, control, and diagnosis of process systems. Fig. 1.1 shows a new framework to dis-cover and use information related to a technological process. The key of the method is a data pre-processing step where an uniform data structure is proposed in consid-eration of heterogeneous data sources (laboratory data, automatic measurements, parameters, technological boundaries, etc.). Based on that, process monitoring, documentation and comparison of productions (over a long time horizon) can be the main functions of the analyzer front-end tools. Moreover, data mining algo-rithms can be also applied to discover relationships between variables, events and productions. This can support the technology development and quality manage-ment. [18] demonstrated that data mining can provide a powerful tool for process

Data

Target data

Preprocessed data

Transformed data

Patterns Selection

Preprocessing

Transformation

Data mining

Interpretation

Knowledge

Figure 1.2: The Knowledge Discovery in Databases (KDD) process

Knowledge discovery in databases

Data mining, sometimes known as Knowledge Discovery in Databases (KDD), gives users tools to sift through vast data stores to learn and recognize patterns, make classifications, verify hypotheses, and detect anomalies. These findings can highlight previously undetected correlations, influence strategic decision-making, and identify new hypotheses that warrant further investigation. A typical KDD process involves several steps that take the user along the path from data source to knowledge (Fig. 1.2). KDD is an iterative process where each of the steps may be repeated several times:

• Data selection, preprocessing and transformation activities compose what is often referred to as the data preparation step.

• The next step, data mining is the extraction of interesting (non-trivial, im-plicit, previously unknown and potentially useful) information or patterns from data in large databases. It is an information processing method that gives answers for the questions which we can not even consider or define precisely.

Frequently, data mining is the most emphasized step of KDD, but for a successful discovery project all the steps have large significance, because the selection and preparation of data has a decisive influence on the set of discoverable and adequate knowledge. The effort for the various steps is illustrated in Fig. 1.3.

The results of a successful KDD activity include not only the identification of structural patterns in data (recognition), but also descriptions of the patterns

(learn-Objectives Data preparation Data mining Interpretation 0

10 20 30 40 50 60

Effort [%]

Figure 1.3: Required effort for the main steps of KDD process Value

Decisions Knowledge Information

Data

Technological knowledge +

Volume

Figure 1.4: From volume to value

ing) that can impart knowledge, not just yield predictions [109]. The integration of the obtainable knowledge during a KDD process and the technological knowl-edge could help in decision situations (Fig. 1.4) to solve problems and improve the performance of the production. Different kinds of decision making activities are represented in Fig. 1.5 [98]. Data mining models and methods could help decisions, primarily in levels of strategic, logistics-planning and supervisory control.

DM tools can be categorized in different ways. According to functions and application purposes a possible categorization is represented in Fig. 1.6. The most frequently used techniques are summarized in the followings:

• Classification: learning a function that maps (classifies) a data item into one of several predefined classes.

Levels of decision making Scope of application

Figure 1.5: Levels, time scales, and application scopes of decision making activities

Predictive Modeling - Sequential pattern discovery - Similar time sequence discovery

Deviation Detection

- Visualization - Statistics

Figure 1.6: Data mining models and methods

diction variable and the discovery of functional relationships between vari-ables.

• Clustering: identifying a finite set of categories or clusters to describe the data. Closely related to clustering is the method of probability density es-timation which consists of techniques for estimating the joint multivariate probability density function of all of the variables/fields in the database.

• Summarization: finding a compact description of a subset of data, e.g. the derivation of summary or association rules and the use of multivariate visual-ization techniques.

• Dependency Modeling: finding a model which describes significant depen-dencies between variables (eg. learning of belief networks)

• Change and Deviation Detection: discovering the most significant changes in the data from previously measured or normative values.

KDD process can serve as an efficient technique to the analyze and represent of real-time and historical process data in order to get deeper insight into the behav-ior of complex systems. One of the main requirements of a successful and useful knowledge integration step is that the discovered knowledge must be well inter-pretable for users. Soft computing systems help knowledge discovering processes to get interpretable results. In the following, the basics of soft computing systems are introduced according to [22].

In document Folyamatadatok szabálykeresésen alapuló elemzése (Pldal 10-15)