WAVELET METHODS IN DATA MINING

3 downloads 0 Views 5MB Size Report
WBIIS applies Daubechies-8 wavelets for each color component and low frequency wavelet coefficients and their variance are stored as feature vectors. Wang ...
Missing:
Chapter 27

WAVELET METHODS IN DATA MINING Tao Li School of Computer Science Florida Intemtional University Miami, FL 331 99 taoli8cs.fiu.edu

Sheng Ma Machine Learning for Systems IBM ZJ.Watson Research Center 19 Skyline Drive Hawthorne, NY 10532 [email protected]

Mitsunori Ogihara Computer Science Department University of Rochester Rocheste~NY 14627-0226

Abstract

Recently there has been significant development in the use of wavelet methods in various Data Mining processes. This article presents general overview of their applications in Data Mining. It first presents a high-level data-mining framework in which the overall process is divided into smaller components. It reviews a p plications of wavelets for each component. It discusses the impact of wavelets on Data Mining research and outlines potential future research directions and applications.

Keywords:

Wavelet Transform, Data Management, Short Time Fourier Transform, Heisenberg's Uncertainty Principle, Discrete Wavelet Transform, Multiresolu-

604

DATA MINING AND KNOWLEDGE DISCOVERY HANDBOOK

tion Analysis, Harr Wavelet Transform, Trend and Surprise Abstraction, Preprocessing, Denoising, Data Transformation, Dimensionality Reduction, Distributed Data Mining

1.

Introduction

The wavelet transform is a synthesis of ideas that emerged over many years from different fields. Generally speaking, the wavelet transform is a tool that partitions data, functions, or operators into different frequency components and then studies each component with a resolution matched to its scale (Daubechies, 1992). Therefore, it can provide economical and informative mathematical representation of many objects of interest (Abramovich et al., 2000). Nowadays many software packages contain fast and efficient programs that perform wavelet transforms. Due to such easy accessibility wavelets have quickly gained popularity among scientists and engineers, both in theoretical research and in applications. Data Mining is a process of automatically extracting novel, useful, and understandable patterns from a large collection of data. Over the past decade this area has become significant both in academia and in industry. Wavelet theory could naturally play an important role in Data Mining because wavelets could provide data presentations that enable efficient and accurate mining process and they can also could be incorporated at the kernel for many algorithms. Although standard wavelet applications are mainly on data with temporalhpatial localities (e.g., time series data, stream data, and image data), wavelets have also been successfully applied to various Data Mining domains. In this chapter we present a general overview of wavelet methods in Data Mining with relevant mathematical foundations and of research in wavelets applications. An interested reader is encouraged to consult with other chapters for further reading (for references, see (Li, Li, Zhu, and Ogihara, 2003)). This chapter is organized as follows: Section 2 presents a high-level Data Mining framework, which reduces Data Mining process into four components. Section 3 introduces some necessary mathematical background. Sections 4, 5, and 6 review wavelet applications in each of the components. Finally, Section 7 concludes.

2.

A Framework for Data Mining Process

Here we view Data Mining as an iterative process consisting of: data management, data preprocessing, core mining process and post-processing. In data management, the mechanism and structures for accessing and storing data are specified. The subsequent data preprocessing is an important step, which ensures the data quality and improves the efficiency and ease of the mining process. Real-world data tend to be 'incomplete, noisy, inconsistent,

WaveletMethods in Data Mining

605

high dimensional and multi-sensory etc. and hence are not directly suitable for mining. Data preprocessing includes data cleaning to remove noise and outliers, data integration to integrate data from multiple information sources, data reduction to reduce the dimensionality and complexity of the data, and data transformation to convert the data into suitable forms for mining. Core mining refers to the essential process where various algorithms are applied to perform the Data Mining tasks. The discovered knowledge is refined and evaluated in post-processing stage. The four-component framework above provides us with a simple systematic language for understanding the steps that make up the data mining process. Of the four, post-processing mainly concerns the non-technical work such as documentation and evaluation, we will focus our attention on the first three components.

3. 3.1

Wavelet Background Basics of Wavelet in L2(R)

So, first, what is a wavelet? Simply speaking, a mother wavelet is a function $(x) such that {$(2jx - k) ,i, k E 2 ) is an orthonormal basis of L~(R) . The basis functions are usually referred to wavelets l. The term wavelet means a small wave. The smallness refers to the condition that we desire that the function is of finite length or compactly supported. The wave refers to the condition that the function is oscillatory. The term mother implies that the functions with different regions of support that are used in the transformation process are derived by dilation and translation of the mother wavelet. At first glance, wavelet transforms are very much the same as Fourier transforms except they have different bases. So why bother to have wavelets? What are the real differences between them? The simple answer is that wavelet transform is capable of providing time and frequency localizations simultaneously while Fourier transforms could only provide frequency representations. Fourier transforms are designed for stationary signals because they are expanded as sine and cosine waves which extend in time forever, if the representation has a certain frequency content at one time, it will have the same content for all time. Hence Fourier transform is not suitable for non-stationary signal where the signal has time varying frequency (Polikar, 2005). Since FT doesn't work for non-stationary signal, researchers have developed a revised version of Fourier transform, The Short Time Fourier Transform (STFT). In STFT, the signal is divided into small segments where the signal on each of these segments could be assumed as stationary. Although STFT could provide a time-frequency representation of the signal, Heisenberg's Uncertainty Principle makes the choice of the segment length a big problem for STFI: The principle states that one cannot know the exact time-frequency representation

606

DATA MINING AND KNOWLEDGE DISCOVERY HANDBOOK

of a signal and one can only know the time intervals in which certain bands of frequencies exist. So for STFT, longer length of the segments gives better frequency resolution and poorer time resolution while shorter segments lead to better time resolution but poorer frequency resolution. Another serious problem with STFT is that there is no inverse, i.e., the original signal can not be reconstructed from the time-frequency map or the spectrogram.

S .2

Tinoea-nm

IlrO=-mTJ

Figure 27.1. 'lime-Frequency Structure of STFT. The graph shows that time and frequency localizations are independent. The cells are always square.

Figure 27.2. 'lime Frequency structure of WT. The graph shows that frequency resolution is good for low frequency and time resolution is good at high frequencies.

Wavelet is designed to give good time resolution and poor frequency resolution at high frequencies and good frequency resolution and poor time resolution at low frequencies (Polikar, 2005). This is useful for many practical signals since they usually have high frequency components for a short durations (bursts) and low frequency components for long durations (trends). The time-frequency cell structures for STFT and WT are shown in Figure 27.1 and Figure 27.2 , respectively. In Data Mining practice, the key concept in use of wavelets is the discrete wavelet transform (DWT). Our discussions will focus on DWT.

3.2

Dilation Equation

How to find the wavelets? The key idea is self-similarity. Start with a function $(x) that is made up of smaller version of itself. This is the refinement (or ak$(2x - k), where aLs are called 2-scale, dilation) equation $(x) = CEO=_, filter coefficients or masks. The function $(x) is called the scaling function (or father wavelet). Under certain conditions,

gives a wavelet 2. Figure 27.3 shows Haar wavelet and Figure 27.4 shows Daubechies-2(db) wavelet that is supported on intervals [O, 31. In general, db, represents the family of Daubechies Wavelets and n is the order. Generally it

607

Wavelet Methods in Data Mining

can be shown that: (1) The support for db, is on the interval [O,2n - 11, (2) The wavelet db, has n vanishing moments, and (3) The regularity increases with the order. db, has r n continuous derivatives (r is about 0.2).

Figure 27.4.

Figure 27.3. Haar Wavelet.

3.3

Daubechies-2(d4) Wavelet.

Multiresolution Analysis (MRA) and Fast DWT Algorithm

How to efficiently compute wavelet transforms? To answer the question, we need to touch on some material of Multiresolution Analysis (MRA). MRA was first introduced in (Mallat, 1989) and there is a fast family of algorithms based on it. The motivation of MRA is to use a sequence of embedded subspaces to approximate L2(R)so that a proper subspace for a specific application task can be chosen to get a balance between accuracy and efficiency. Mathematically, MRA studies the property of a sequence of closed subspaces 5 ,j E Z which approximate L ~ ( Rand ) satisfy . - - V - 2 c V-1 c & c & c & c .. ., UjEz6 = L ~ ( R(L2(R) ) space is the closure of the union V, = 0 (the intersection of all I$ is empty). So what of all V,), and does multiresolution mean? The multiresolution is reflected by the additional requirement f E 5 u f (22) E I$+l, j E Z (This is equivalent to f ( x ) E & u f (2jx) E G), i.e., all the spaces are scaled versions of the central space Vo. So how does this related to wavelets? Because the scaling function easily generates a sequence of subspaces which can provide a simple multiresolution analysis. First, the translations of +(x),i.e., +(x - k), k E 2,span a subspace, say Vo (Actually, +(x - k ) ,k E Z constitutes an orthonormal basis of the subspace &). Similarly 2-lI2+(2x - k ) ,k E Z span another subspace, say &. The dilation equation tells us that can be represented by a basis of Vl. It implies that falls into subspace Vl and so the translations +(x - k ) ,k E Z also fall into subspace Vl. Thus Vo is embedded into Vl. With different dyadic, it is straightforward to obtain a sequence of embedded subspaces of L2(R) from only one function. It can be shown that the closure of the union of these

njEz

+

+

+

608

DATA MINING AND KNOWLEDGE DISCOVERY HANDBOOK

subspaces is exactly L2(R)and their intersections are empty sets (Daubechies, 1992). here, j controls the observation resolution while k controls the observation location. Formal proof of wavelets' spanning complement spaces can be found in (Daubechies, 1992).

Figure 27.5. Fast Discrete Wavelet Transform.

A direct application of multiresolution analysis is the fast discrete wavelet transform algorithm, called the pyramid algorithm (Mallat, 1989). The core idea is to progressively smooth the data using an iterative procedure and keep the detail along the way, i.e., analyze projections of f to W j . We use Haar wavelets to illustrate the idea through the following example. In Figure 27.5, the raw data is in resolution 3 (also called layer 3). After the first decomposition, the data are divided into two parts: one is of average information (projection in the scaling space V2 and the other is of detail information (projection in the wavelet space W2). We then repeat the similar decomposition on the data in V2,and get the projection data in Vl and Wl, etc. The fact that L2(R)is decomposed into an infinite wavelet subspace is equivalent to the statement that Gj k , j, k E Z span an orthonormal basis of L2(R).An arbitrary function f E L9 ( R ) then can be expressed as f ( x ) = &EZdj,k$j,k(x), where dj,k = ( f , ?+hjlk)is called the wavelet coeficients. Note that j controls the observation resolution and k controls the observation location. If data in some location are relatively smooth (it can be represented by low-degree polynomials), then its corresponding wavelet coefficients will be fairly small by the vanishing moment property of wavelets.

3.4

Illustrations of Harr Wavelet Transform

We demonstrate the Harr wavelet transform using a discrete time series x(t), where 0 t 2K. In L2(R),discrete wavelets can be represented as + y ( t ) = 2-jI24(2-jt - m ) ,where j and m are positive integers. j represents the dilation, which characterizes the function 4 ( t ) at different time-scales. m represents the translation in time. Because 4 y ( t ) are obtained by dilating and

<