Exploratory Data Analysis (EDA) - Folyamatadatok szabálykeresésen alapuló elemzése

Process monitoring based on multivariate statistical analysis of process data has re-cently been investigated by a number of researchers [67, 106, 20, 105]. The aim of these approaches is to reduce the dimensionality of the correlated process data by projecting them onto a lower dimensional latent variable space where the op-eration can be easily visualized. These approaches use the techniques of principal component analysis (PCA) or projection to latent structure (PLS). Beside process performance monitoring, these tools can be used for system identification [67], en-suring consistent production [69] and product design [59]. For these classical data analysis approaches, the collection of the data is followed by the construction of a model and the analysis, estimation, and testing focused on the parameters of the model.

Pearson suggested using Exploratory Data Analysis (EDA) tools for both of these reasons [82]. For EDA, the data collection is not followed by a model con-struction, it is rather followed immediately by analysis with a goal of inferring what model would be appropriate. EDA is an approach/philosophy for data analysis that

employs a variety of techniques (mostly graphical) to maximize insight into a data set, uncover underlying structure, extract important variables, detect outliers and anomalies, test underlying assumptions, develop parsimonious models, and deter-mine optimal factor settings. The seminal work in EDA is written by Tukey [103].

Over the years it has benefited from other noteworthy publications such as Data Analysis and Regression by Mosteller and Tukey [75], and the book of Velleman and Hoaglin [104].

Most EDA techniques are graphical in nature with a few quantitative techniques [74]. The reason for the heavy reliance on graphics is that by its very nature the main role of EDA is to open-mindedly explore, and graphics gives the analysts unparalleled power to do so, enticing the data to reveal its structural secrets, and being always ready to gain some new, often unsuspected, insight into the data. In combination with the natural pattern-recognition capabilities that we all possess, graphics provides, of course, unparalleled power to carry this out. The particular graphical techniques employed in EDA are often quite simple, consisting of various techniques of:

1. Plotting the raw data (such as data traces and histograms).

2. Plotting simple statistics (such as mean plots, standard deviation plots and box plots).

3. Positioning such plots so as to maximize our natural pattern-recognition abil-ities (such as using multiple plots per page).

The key step for any process data analysis method is to collect all available (het-erogenic) data and technology information (a priori knowledge) and integrate them into a process data warehouse. To analyze the enormous amount of data, trends, ex-ploratory data analysis (box plots, quantile-quantile plots, etc.) and data mining al-gorithms (association rule mining, clustering, classification, etc.) can be applied to get relevant information about the technology. A detail application study is showed in Chapter 5, now let’s see the basics of box-plots.

Suppose that X is a real-valued variable (e.g. the reactor temperature - T) and the analysis of process and product quality variables is considered. Hence, the variables are X ∈ {z_k,y_k}. An example of the behavior of a process variable (reactor temperature) is given in Fig. 2.1. The (cumulative) distribution function of

T [°C]

Time [h]

0 0.25

0.5 0.75 1

T [°C]

F(T) = P(T <= x)

q_0.25q_0.5q_0.75

Figure 2.1: Example of the change of a process variable reactor temperature (T, left) and it’s cumulated distribution function, the q_0.25, q_0.5 and q_0.75 quintile are also depicted (right)

x. For a discrete random variable, the cumulative distribution function is found by summing up the probabilities. For a continuous random variable, the cumulative distribution function is the integral of its probability density function (Fig. 2.1 on the right). Suppose thatp∈ [0,1]. A value of x such thatF(x) = P(X < x) ≤ p and F(x) = P(X ≤ x) ≥ p is called a quantile of order p for the distribution.

Roughly speaking, a quantile of orderpis a value where the cumulative distribution crossesp. Hence, by a quantile, we mean the fraction (or percent) of points below the given value. That is, the 0.25 (or 25 %) quantile is the point at which 25 % percent of the data fall below and 75 % fall above that value. Note that there is an inverse relation of sorts between the quantiles and the cumulative distribution values. A quantile of order 1/2 is called a median of the distribution. When there is only one median, it is frequently used as a measure of the center of the distribution.

A quantile of order 1/4 is called a first quartile and the quantile of order 3/4 is called a third quartile. A median is a second quartile. Assuming uniqueness, letq_0.25,q_0.5, andq_0.75denote the first (lower), second, and third (upper) quartiles of X.

Note that the interval fromq_0.25toq_0.75gives the middle half of the distribution, and thus the interquartile range is defined to be IQR =q_0.75-q_0.25, and is sometimes used as a measure of the variance of the distribution with respect to the median. Let q₀ and q₁ denote the minimum and maximum values of X, respectively (assuming that these are finite). The five parametersq₀, q_0.25, q_0.5, q_0.75, q₁ are often referred to as the five-number summary. Together, these parameters give a great deal of information about the distribution in terms of the center, spread, and skewness.

Tukey’s five number summary is often displayed as a box plot. Box plots are

69.8 69.85 69.9 69.95 70 70.05 70.1 70.15 70.2

°C

Figure 2.2: Single box-plot of a variable (e.g. reactor temperature)

excellent tools for conveying location and variation information in data sets, partic-ularly for detecting and illustrating location and variation changes between different groups of data [74]. An example for box plot is presented in Fig. 2.2. The box plot consists of a line extending from the minimum valueq₀ to the maximum value q₁, with a rectangular box fromq_0.25toq_0.75, and tick marks at the medianq_0.5. Hence, the lower and upper lines of the "box" are the 25th and 75th percentiles of the sam-ple. The distance between the top and bottom of the box is the interquartile range.

The line in the middle of the box is the sample median. If the median is not cen-tered in the box that is an indication of skewness. Thus the box represents the body (middle 50 %) of the data.

A single box plot can be drawn for one batch of data with no distinct groups. It is an easy, but very useful descriptive statistical tool for process data analysis. Special cases for using box plots (e.g. multiple plots) are demonstrated at the industrial application study in Chapter 5.

In document Folyamatadatok szabálykeresésen alapuló elemzése (Pldal 26-29)