Application Examples - Visualization of Frequent Item Sets and Fuzzy Association Rules

4.3 Visualization of Frequent Item Sets and Fuzzy Association Rules

4.3.3 Application Examples

For the interactive generation of galaxies it is absolute necessary to have a graph-ical user interface (GUI) or a working environment to support the user in interac-tive second-order data mining of the mined frequent item sets or association rules.

For this purpose the FISARM (Frequent Item Set and Association Rules Mining) toolbox have been developed in MATLAB and it is extended with a GUI called FISARVis (Frequent Item Set and Association Rules Visualization).

In this section two benchmark classification problems and one industrial exam-ple will be given to demonstrate the applicability of the new visualization tool. The structure of the example data sets are summarized in Table 4.7.

Iris example

This classical benchmark data set has been analyzed many times to illustrate various methods. The original data set includes crisp values (continuous variables for each attribute and discrete variable for the class label), therefore to allow fuzzy associa-tion rule mining this raw data must be transformed into a fuzzy transacassocia-tional data set. In this example Gustafson-Kessel (GK) clustering algorithm is used to partition all the input variables. On every features three fuzzy sets were defined. This results in a transactional database, where everyt_k transaction consists of4×3 + 3 items related to the three fuzzy sets defined on the the four features and the three class labels (Fig. 4.18).

0 0.5 1

Figure 4.18: Trapezoidal membership functions of the four input variables (sepal length and width and petal length and width) of the Iris problem.

−0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8

Figure 4.19: Sammon mapping of the frequent item sets of the Iris data (Full Galaxy)

−0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6

−0.8

−0.6

−0.4

−0.2 0 0.2 0.4 0.6 0.8 1

Figure 4.20: Sammon mapping of the IRIS data. The different markers represent the different classes of the Iris flower

The frequent item sets and rules were generated by fuzzy Apriori algorithm. The minimal fuzzy support and fuzzy confidence were set to σ = 10%andγ = 75%.

With these parameters 115 frequent item sets and 45 rules were obtained. It is interesting to note that even in case of this small-scale problem the number of item sets and rules are quite high compared to the small number of samples (150).

The proposed Full Galaxy plot of the item sets (Fig. 4.19) shows that although the number of the item sets is high, most of the item sets form well-separated clus-ters where the item sets of a given cluster are quite similar to each other. To give much more insight to how these clusters are formed, the first elements of the item sets are also shown on this plot. As can be seen, the item sets related to association rules (items with first element of the class labels:1,2,3) form separate clusters. This information highlights that the generated association rules can be used for classifica-tion. The analysis of the neighboring item sets can be used for "model pruning", the redundant rules can be removed from the rule base. For this purpose the developed interactive FISARVis tool can be used. The GUI is prepared to show association rules as well, separating the derived (fuzzy) sets from the antecedent and the con-sequent part of rules. There is also opportunity to monitor the movements of users, the visited item sets are also logged in order to jump next time easily to the desired views.

raw data. Compared to the plot of the frequent item sets this plot shows much less information about the hidden structure of the multivariate data. The only one infor-mation is that one of the cluster (class) is well separated from the other two classes.

Contrary, the developed association rule based visualization tool gives information how the classes can be described, which variables are important and which fuzzy sets form useful rules.

Wisconsin example

The aim of the Wisconsin Breast Cancer classification problem is to distinguish between benign and malignant cancers based on the available nine features. This example is used to illustrate how the new algorithm can be used to visualize larger (nine-dimensional) classification problem, and how this visualization can be used to detect the number of useful rules.

The frequent item set and association rule mining parameters of the fuzzy Apri-ori algApri-orithm were identical to the previous Iris example. Due to the larger number of data (966) and input variables much more item sets (2310) and rules (1134) were generated. The distribution of the distances of the frequent item sets gives an useful information about the diversity of the obtained rule base (Fig. 4.21).

0 1 2 3 4 5 6

x 10⁶ 0

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

N × N di,j

Figure 4.21: Distribution of the distances of the item sets of the Wisconsin problem

−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1

−1

−0.8

−0.6

−0.4

−0.2 0 0.2 0.4 0.6 0.8 1

Figure 4.22: Map of the item sets of the Wisconsin problem

−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1

−0.8

−0.6

−0.4

−0.2 0 0.2 0.4 0.6 0.8 1

Figure 4.23: Map of the association rules of the Wisconsin problem

The role of the Sammon mapping is to create a two-dimensional plot on which the distances among the objects are close to these distances. As Fig. 4.22 shows, the generated plot shows a nicely structured (clustered) scheme of the item sets.

This figure becomes much more transparent when only the rules are visualized (Fig. 4.23). As the plot suggests, due to the high redundancy among the 1134 asso-ciation rules, this large number of rules can be arranged to only 9 clusters, hence at

is confirmed by previous studies [2, 1].

4.3.4 Conclusion

A novel method has been presented in this section for the visualization of asso-ciation rules and frequent item sets. The developed tool is based on the distant preserving mapping of the item sets (rules) into a two dimensional map, where the distances among the item sets (rules) represent their relative information content.

The results illustrate the this tool has been proven as an effective method for sec-ondary data-mining, useful information can be extracted from the large number of (fuzzy) rules, e.g. the important variables can be selected, the rules can be clustered, which is an effective approach for the reduction of complex rule-based models.

4.4 Discussion and conclusion

This chapter showed three new association rule discovery-based methods for clas-sification, model structure selection and visualization tasks. The developed algo-rithms and tools are applied for several benchmark and real-world problems from data mining and chemical engineering literature. The results showed that these new methods can be efficient use for several type of data mining problems. The indus-trial applicability of the developed methods are demonstrated by a detailed case study in the next chapter.

Chapter 5 Data Warehousing and Mining for the Analysis of Polypropylene

Production

The PolyPropylene-4 plant (PP4) of TVK Ltd. (www.tvk.hu) is based on the li-censed Spheripol Process (see the scheme in Fig. 5.1), which produces spherical polypropylene polymer particles. Two polymer products are produced: homopoly-mer (propylene polyhomopoly-mer) and copolyhomopoly-mer (propylene-ethylene polyhomopoly-mer), while two additional products could be produced: random copolymer (random propylene-ethylene polymer) and terpolymer (buthane polymer). A three-component-catalyst (Ti-catalyst on MgCl₂ base - CAT, Aluminium-triethyl - TEAL, and a Silane com-pound - Donor) is used for polymerization. It is very important to maintain Al/Ti and Donor/Ti ratios within the preset range because of the product quality and pro-ductivity of catalyst system. The catalyst components are mixed in a pre-contacting pot with grease and oil and this paste is injected into a specific portion of the propy-lene stream in the pre-polymerization reactor, which is a small, tempered, circulated loop reactor with about 10-minute residence time and low temperature. After pre-polymerization, the blend is mixed into the circulated slurry of the first loop reactor.

The polymerization reaction for homopolymer production takes place in two loop reactors in series. The pipelines for random copolymer and terpolymer production are attached as well. All the loop reactors operate completely full, a combination of an expansion drum and an evaporator is used to keep them at this state and prevent any pressure fluctuation. The hydrogen feed is used to control the intrinsic viscosity

Purge & Ex-pansion Drum

CWS CWR CWS CWR

Gas Phase

Figure 5.1: Scheme of Spheripol polypropylene technology

is maximum. The large quantity of monomers (ca. 50 wt%) in the outlet of the second loop reactor must be recovered. After a complete evaporation and flash sep-aration, the condensed polymer is filtered and discharged to the gas-phase reactor, if impact or special impact copolymer is produced from homopolymer. Otherwise the loop reactor outlet is transferred to the purification and extrusion section and to the storage silos. The recovered monomers are fed back to the propylene feed tank after having purified from the powder entertainment and aluminium-alkyl. Het-erophase high-impact-copolymers are obtained by adding a gummy phase consist-ing of ethylene-propylene bipolymer to the homopolymer matrix from bulk poly-merization in loop reactors. The reactor operates as a fluid-bed-reactor, which is maintained by a centrifugal compressor. Polymer particles impinge on the wall and a scraper prevents them from accumulation. Copolymer is discharged first into the non-fluidized degassing section of the reactor. Then it is filtered where the monomer gases are vented into a propylene scrubber then discharged into an ethylene stripper:

hydrogen- and ethylene-rich top vapors are recycled to the reactor, propylene- and ethane-rich bottom stream joins the monomer feedback. The following actions are taken on the particles: catalyst-deactivation, distribution with steam, drying with

nitrogen. The purified copolymer product is discharged to the extrusion and storage section of the plant.

5.1 Development of the process data warehouse

In the PP4 plant 15-20 types of polypropylene are produced and the change of prod-ucts is relatively frequent (1-2 days). A detailed process analysis can answer many technological questions, discover relevant information and relationships of process variables at the important events of the process, as product or catalyst change, cata-lyst productivity, product quality, etc. The discovered knowledge can be used to op-timize the production process, improve the productivity and product quality, more-over develop new products.

The process data definitely have the potential to provide information for product and process design, monitoring and control but the access to these data are limited in time on the process control computers. They are archived retrospective for 3-4 months. The other momentous problem is the heterogeneities of sources, data, etc.

(Section 2.1).

For the effective analysis data storing in a homogenous structure for longer pe-riod (years) and easy to use software tools are also needed. To solve all these prob-lems a process data warehouse based information system is designed. The imple-mentation process of the information system have the following main steps:

1. Identification of data sources 2. Data acquisition

3. Data warehouse building 4. Development of front-end tools 5. Validation

6. Installation in plant

In the following, steps 1-5 are detailed.

Identification of data sources

Application programs

Current Value Database PHD SERVER

Real-time database

Copy of config data

Raw queue | Data queue

PHD disk archives PHD API

Reference database Real-time

Data Interface

(RDI) Real-time system

& DCS

Figure 5.2: PHD system and components, process data flow

Uniformance MySQL server

Tag list Data

Data

PHD API PHD server

Figure 5.3: Data acquisition in PP production unit

are distinguished: electronic data (historical databases of DCS) and paper based re-ports (manually logged data). Additional sources are the documents of laboratory measurements and the specification documents of products (production sheets, a priori knowledge).

Data acquisition

In the PP4 plant process data of the technology were collected through a PHD (Process History Database) module of the control system. The components of the PHD and the process data flow are presented in Fig. 5.2. There are two main oper-ations:

• Data collection: Data originates in the real-time system and is collected by a

real-time data interface (RDI). A tag contains all important information about a process variable such as name, type, unit, etc. These tag parameters for all the variables are stored in a reference database. The RDI interface sends data to the PHD server which places the collected data for a tag in the raw data queue and applies data processing, such as smoothing, compression, and so on, to move raw data queue entries to the data queue of the tag. The data queue of the tag then holds processed data that is ready for insertion into the active logical archive files using the continuous store thread.

• Data retrieval: An application program makes a call to the PHD application programming interface (API) indicating the desired tag and time range for data. The PHD system checks the data queues to see if the data is still held in the queues, otherwise PHD accesses the data from the connected archive files.

Storing data for historical analysis the way of data acquisition presented in Fig. 5.3 is proposed. First, the tag names of the relevant process variables are selected from all the possible tags in the plant. Process data belong to the selected tags are access in PHD by the Uniformance Desktop application program (by Honeywell). While the Uniformance runs as an MS Excel add-in, the results of data queries are saved in Excel files.

Data warehouse building

The main steps of data warehouse building process are presented in Fig. 5.4. First of all to get adequate reports about the productions the partially overlapping, non-consistent, mainly paper based reports are processed and stored in electronic form.

Since the values of process variables are stored in Excel files during the data ac-quisition, paper based data sources are also transformed into this form. The DW is implemented in a MySQL database server. The data is loaded from the consis-tent information sources into the DW by the Navicat MySQL administrative tool.

The structure (data tables) of the DW represented in Fig. 5.5. Table Measured data stores values of all the relevant (about 220) process variables of the technology

Enterprise Information

Figure 5.4: The data warehouse building process

changes of storage drums. The product quality is determined by off-line laboratory analysis. The sampling time intervals are between one and four hours. The most important quality variable is the melting index (synonyms of melting index/MI are melt flow rate/MFR and melt flow index/MFI). It is a measure of the ease of flow of the melt of a thermoplastic polymer and it is defined as the weight of polymer in grams flowing in 10 minutes through a capillary of specific diameter and length by a pressure applied via prescribed alternative gravimetric weights for alternative prescribed temperatures (more details about MFI are in [91]). MI values of powder and granulate are stored in the data tables Melting index of polymer powder and Melting index of granulate.

Measured data

Melting index of granulate Extruder silo’s changes

Melting index of polymer powder

Figure 5.5: Data tables of process data warehouse for PP4 plant

Graphical User Interface

Process data warehouse

Trends Productions APC simulator Melting index

Figure 5.6: The structure of the information system

In document Folyamatadatok szabálykeresésen alapuló elemzése (Pldal 91-103)