Conclusions and Summary

Data streams are a computational challenge to data mining problems because of the additional algorithmic constraints created by the large volume of data. In addition, the problem of temporal locality leads to a number of unique mining challenges in the data stream case. This chapter provides an overview to the different mining algorithms which are covered in this book. We discussed the different problems and the challenges which are associated with each problem.

We also provided an overview of the material in each chapter of the book.

An Intmduction to Data Streams 7

References

[I] Aggarwal C. (2003). A Framework for Diagnosing Changes in Evolving Data Streams. ACM SIGMOD Conference.

[2] Aggarwal C (2002). An Intuitive Framework for understanding Changes in Evolving Data Streams. IEEE ICDE Conference.

[3] Aggarwal C., Han J., Wang J., Yu P (2003). A Framework for Clustering Evolving Data Streams. VLDB Conference.

[4] Aggarwal C., Han J., Wang J., Yu P (2004). A Framework for High Dimen- sional Projected Clustering of Data Streams. VLDB Conference.

[5] Aggarwal C, Han J., Wang J., Yu P. (2004). On-Demand Classification of Data Streams. ACM KDD Conference.

[6] Agrawal R., Imielinski T., Swami A. (1993) Mining Association Rules between Sets of items in Large Databases. ACM SIGMOD Conference.

[7] Chen Y., Dong G., Han J., Wah B. W., Wang J. (2002) Multi-dimensional regression analysis of time-series data streams. VLDB Conference.

[8] Cormode G., Garofalakis M. (2005) Sketching Streams Through the Net:

Distributed Approximate Query Tracking. VLDB Conference.

[9] Datar M., Gionis A., Indyk P., Motwani R. (2002) Maintaining stream statistics over sliding windows. SIAM Journal on Computing, 3 l(6): 1794-

1813.

[lo] Dong G., Han J., Lam J., Pei J., Wang K. (2001) Mining multi-dimensional constrained gradients in data cubes. VLDB Conference.

[ l l ] Dasu T., Krishnan S., Venkatasubramaniam S., Yi K. (2005).

An Information-Theoretic Approach to Detecting Changes in Multi- dimensional data Streams. Duke University Technical Report CS-2005-06.

[12] Domingos P. and Hulten G. (2000) Mining High-speed Data Streams. In Proceedings of the ACM KDD Conference.

[13] Garofalakis M., Gehrke J., Rastogi R. (2002) Querying and mining data streams: you only get one look (a tutorial). SIGMOD Conference.

[14] Guha S., Mishra N., Motwani R., O'Callaghan L. (2000). Clustering Data Streams. IEEE FOCS Conference.

[I51 Giannella C., Han J., Pei J., Yan X., and Yu P. (2002) Mining Frequent Patterns in Data Streams at Multiple Time Granularities. Proceedings of the NSF Workshop on Next Generation Data Mining.

1161 Hulten G., Spencer L., Domingos P. (2001). Mining Time Changing Data Streams. ACM KDD Conference.

[17] Jin R., Agrawal G. (2005) An algorithm for in-core frequent itemset min- ing on streaming data. ICDM Conference.

[18] Kifer D., David S.-B., Gehrke J. (2004). Detecting Change in Data Streams. VLDB Conference, 2004.

1191 Kollios G., Byers J., Considine J., Hadjielefttheriou M., Li F. (2005) Ro- bust Aggregation in Sensor Networks. IEEE Data Engineering Bulletin.

[20] S a h a i Y., Papadimitriou S., Faloutsos C. (2005). BRAID: Stream mining through group lag correlations. ACM SIGMOD Conference.

[21] Yi B.-K., Sidiropoulos N.D., Johnson T., Jagadish, H. V., Faloutsos C., Biliris A. (2000). Online data mining for co-evolving time sequences. ICDE Conference.

Chapter 2 ON CLUSTERING MASSIVE DATA STREAMS: A SUMMARIZATION PARADIGM

Cham C. Aggarwal

IBM Z J. Watson Research Center Hawthorne, W 1053.2

Jiawei Han

University of Illinois at Urbana-Champaign Urbana, IL

hanj@cs.uiuc.edu

Jianyong Wang

University of Illinois at Urbana-Champaign Urbana, ZL

jianyong @tsinghua.edu.cn

Philip S. Yu

IBM Z J. Watson Research Center Hawthorne, NY 10532

Abstract

In recent years, data streams have become ubiquitous because of the large number of applications which generate huge volumes of data in an automated way. Many existing data mining methods cannot be applied directly on data streams because of the fact that the data needs to be mined in one pass. Fur- thermore, data streams show a considerable amount of temporal locality because of which a direct application of the existing methods may lead to misleading results. In this paper, we develop an efficient and effective approach for min- ing fast evolving data streams, which integrates the micro-clustering technique

with the high-level data mining process, and discovers data evolution regularities as well. Our analysis and experiments demonstrate two important data mining problems, namely stream clustering and stream classification, can be performed effectively using this approach, with high quality mining results. We discuss the use of micro-clustering as a general summarization technology to solve data mining problems on streams. Our discussion illustrates the importance of our approach for a variety of mining problems in the data stream domain.

1. Introduction

In recent years, advances in hardware technology have allowed us to auto- matically record transactions and other pieces of information of everyday life at a rapid rate. Such processes generate huge amounts of online data which grow at an unlimited rate. These kinds of online data are referred to as data streams. The issues on management and analysis of data streams have been researched extensively in recent years because of its emerging, imminent, and broad applications [l 1, 14, 17,231.

Many important problems such as clustering and classification have been widely studied in the data mining community. However, a majority of such methods may not be working effectively on data streams. Data streams pose special challenges to a number of data mining algorithms, not only because of the huge volume of the online data streams, but also because of the fact that the data in the streams may show temporal correlations. Such temporal correlations may help disclose important data evolution characteristics, and they can also be used to develop efficient and effective mining algorithms. Moreover, data streams require online mining, in which we wish to mine the data in a continuous fashion. Furthermore, the system needs to have the capability to perform an ofline analysis as well based on the user interests. This is similar to an online analytical processing (OLAP) framework which uses the paradigm of pre-processing once, querying many times.

Based on the above considerations, we propose a new stream mining frame- work, which adopts a tilted time window framework, takes micro-clustering as a preprocessing process, and integrates the preprocessing with the incre- mental, dynamic mining process. Micro-clustering preprocessing effectively compresses the data, preserves the general temporal locality of data, and facili- tates both online and offline analysis, as well as the analysis of current data and data evolution regularities.

In this study, we primarily concentrate on the application of this technique to two problems: (1) stream clustering, and (2) stream classification. The heart of the approach is to use an online summarization approach which is efficient and also allows for effective processing of the data streams. We also discuss

On Clustering Massive Data Streams: A Summarization Paradigm

Figure 2. I . Micro-clustering Examples

.

_Now^time

Figure 2.2. Some Simple Time Windows

a number of research directions, in which we show how the approach can be adapted to a variety of other problems.

This paper is organized as follows. In the next section, we will present our micro-clustering based stream mining Eramework. In section 3, we discuss the stream clustering problem. The classification methods are developed in Section 4. In section 5, we discuss a number of other problems which can be solved with the micro-clustering approach, and other possible research directions. In section 6, we will discuss some empirical results for the clustering and classi- fication problems. In Section 7 we discuss the issues related to our proposed stream mining methodology and compare it with other related work. Section 8 concludes our study.

2. The Micro-clustering Based Stream Mining

In document Data Streams (Pldal 24-30)