An enhanced Support Vector Machine classification framework by ...

Appl Intell (2012) 37:80–99 DOI 10.1007/s10489-011-0314-z

An enhanced Support Vector Machine classification framework by using Euclidean distance function for text document categorization Lam Hong Lee · Chin Heng Wan · Rajprasad Rajkumar · Dino Isa

Published online: 25 August 2011 © Springer Science+Business Media, LLC 2011

Abstract This paper presents the implementation of a new text document classification framework that uses the Support Vector Machine (SVM) approach in the training phase and the Euclidean distance function in the classification phase, coined as Euclidean-SVM. The SVM constructs a classifier by generating a decision surface, namely the optimal separating hyper-plane, to partition different categories of data points in the vector space. The concept of the optimal separating hyper-plane can be generalized for the nonlinearly separable cases by introducing kernel functions to map the data points from the input space into a high dimensional feature space so that they could be separated by a linear hyper-plane. This characteristic causes the implementation of different kernel functions to have a high impact on the classification accuracy of the SVM. Other than the kernel functions, the value of soft margin parameter, C is another critical component in determining the performance of the SVM classifier. Hence, one of the critical problems of the conventional SVM classification framework is the neL.H. Lee () · C.H. Wan Faculty of Information and Communication Technology, Universiti Tunku Abdul Rahman, Bandar Barat, 31900 Kampar, Perak, Malaysia e-mail: [email protected] C.H. Wan e-mail: [email protected] R. Rajkumar · D. Isa Intelligent Systems Research Group, Faculty of Engineering, The University of Nottingham, Malaysia Campus, Jalan Broga, 43500, Semenyih, Selangor, Malaysia R. Rajkumar e-mail: [email protected] D. Isa e-mail: [email protected]

cessity of determining the appropriate kernel function and the appropriate value of parameter C for different datasets of varying characteristics, in order to guarantee high accuracy of the classifier. In this paper, we introduce a distance measurement technique, using the Euclidean distance function to replace the optimal separating hyper-plane as the classification decision making function in the SVM. In our approach, the support vectors for each category are identified from the training data points during training phase using the SVM. In the classification phase, when a new data point is mapped into the original vector space, the average distances between the new data point and the support vectors from different categories are measured using the Euclidean distance function. The classification decision is made based on the category of support vectors which has the lowest average distance with the new data point, and this makes the classification decision irrespective of the efficacy of hyper-plane formed by applying the particular kernel function and soft margin parameter. We tested our proposed framework using several text datasets. The experimental results show that this approach makes the accuracy of the Euclidean-SVM text classifier to have a low impact on the implementation of kernel functions and soft margin parameter C. Keywords Text document classification · Support Vector Machine · Euclidean distance function · Kernel function, · Soft margin parameter

1 Introduction This paper presents a novel text document classification framework that uses the Support Vector Machine (SVM) approach in the training phase to identify the set of support

An enhanced Support Vector Machine classification framework by using Euclidean distance function

vectors for each category, and uses the Euclidean distance function in the classification phase to compute the average distances between the testing data point and each of the sets of support vectors from different categories. Classification decision is made based on the category which has the lowest average distance between its set of support vectors and the new data point and this makes the classification decision irrespective of the efficacy of hyper-plane formed by applying the particular kernel function and soft margin parameter. Text document classification denotes the task of automatically assigning collections of electronic text documents into their annotated categories, based on their contents and similarities. For some decades now, text document classification has become important due to the rapid growth of creating, editing, manipulating and storing text documents in digital form. In recent years, an increasing number of statistical and computational approaches have been developed for text classification, including k-nearest-neighbor classification [1, 2], Bayesian classification [3–12], support vector machine [2, 13–17], maximum entropy [18], decision tree induction [19], rule induction [20, 21], and artificial neural networks [22]. Besides the supervised classification approaches, unsupervised clustering techniques such as self-organizing maps [23, 24] have also been introduced for text document segmentation. Since the past decade, SVM has gained its popularity in various types of classification applications and has been reported as one of the best performing classification approaches [2, 13–17, 25–34]. It can be used as a discriminative classifier and has been shown to be more accurate than most other classification models [15, 27, 28, 31, 32, 35, 36]. The good generalization characteristic of the SVM is due to the implementation of Structural Risk Minimization (SRM) principle, which entails finding an optimal separating hyper-plane, thus guaranteeing the highly accurate classifier in most applications. Equation (1) represents the equation of a hyper-plane which can be used to partition data points in a SVM. w·x+b=0

(1)

Figure 1 illustrates a linearly separable case, where data points of one category (represented by “◦”) and data points of another category (represented by “•”) are separated by the linear optimal separating hyper-plane (the solid straight line). There are actually an infinite number of hyper-planes which are able to partition the data points into two categories (as illustrated by the dashed lines on Fig. 1). According to the SVM methodology, there will just be one optimal separating hyper-plane. This optimal separating hyper-plane is lying half-way in between the maximal margin, where the margin is defined as the sum of distances of the hyper-plane

81

Fig. 1 Optimal separating hyper-plane

to the support vectors. In the case as illustrated in Fig. 1, the margin is d1 + d2 . The optimal separating hyper-plane is only determined by the closest data points of each category. These points are called Support Vectors (SVs). As only the SVs determine the optimal separating hyper-plane, there is a certain way to represent them for a given set of training points. It has been shown in [37] that the maximal margin can be found by minimizing ½w2 , as shown in (2). min ½w2

(2)

Therefore, the optimal separating hyper-plane can be configured by minimizing (2) under the constraint of (3), that the training data points are correctly separated. yi · (w · xi + b) ≥ 1,

∀i

(3)

A more detailed discussion of the SVM has been presented in our previous work [14]. The concept of the optimal separating hyper-plane can be generalized for the non-linearly separable cases. One of the methods which can be used to partition non-linearly separable data points is the implementation of kernel functions. By implementing the kernel functions, the non-linearly separable data points are mapped from the original input space to a high dimensional feature space through a non-linear transformation rather than fitting non-linear decision surfaces to the input space to separate the data points, as illustrated in Fig. 2. This is to ensure that a linear optimal separating hyper-plane could be generated in the new feature space to separate the data points. By introducing the kernel function as shown by (4), it is not necessary to explicitly know (·) [38]. Hence, the optimization problem can be translated directly to the more general kernel version, as shown by (5). K(xi , xj ) = (xi ), (xj ),

(4)

82

L.H. Lee et al.

Fig. 2 Mapping data points into high dimensional feature space using kernel functions Table 1 Common kernel functions for SVM with promising classification performance in most cases Kernels

Formula

Lineal

K(u, v) = u · v

Sigmoid

K(u, v) = tanh(au · v + b)

Polynomial

K(u, v) = (1 + u · v)d

RBF

K(u, v) = exp(−au − v2 )

Exponential RBF

K(u, v) = exp(−au − v)

W (α) =

n i=1

αi −

n 1 αi αj yi yj K(xi , xj ), 2 i=1,j =1

subject to C ≥ αi ≥ 0,

n

αi yi = 0

(5)

i=1

The algorithm of the non-linear classification is formally similar to the linear classification, except that every dot product is replaced by a non-linear kernel function. This allows a linear optimal separating hyper-place to be fitted in the high dimensional feature space, and it may be non-linear in the original input space, as illustrated in Fig. 2. By mapping the data points into a high dimensional feature space using kernel functions, superb performing classification tasks can be obtained since it allows the SVM models to perform data points separation even with very complex boundaries. There could be an infinite number of kernel functions, but only certain kernel functions have been found to perform well in a wide variety of classification tasks. Table 1 shows some of the common well-performing kernel functions in most cases. Each of the kernel functions listed in Table 1 has its own properties and unique responses in handling different types of data. For example, the SVM model equipped with a sigmoid kernel function is equivalent to a two-layer perceptron neural network [38], while the SVM model using a Radius Basis Function (RBF) kernel is closely similar to the RBF neural networks and the feature space is in an infinite dimension. The selection of kernel function for the SVM classification model is based on the classification task’s requirements and the patterns of the data points distribution. With

an appropriate and optimal kernel function implemented in the SVM model, the classifier is able to scale high dimensional data relatively well, and trade-off between classifier complexity and classification error can be controlled explicitly. Therefore, an appropriate implementation of optimal kernel function is a necessity for the SVM classification framework in order to obtain optimal performance. This unique characteristic causes the classification accuracy of the SVM to be highly dependent on the selection of kernel functions. This is due to the fact that the data points distribution may change in different high dimensional feature space, and the linear optimal separating hyper-plane can only be constructed after the data points have been mapped into the higher dimensional feature space using kernel functions. In other words, the decision surface of the SVM classifier for non-linearly separable cases is constructed based on the implementation of kernel functions. As the result, one of the critical problems of the conventional SVM classification approach is the selection of appropriate kernel function, based on the varying characteristics of different datasets, in order to obtain high classification accuracy. It does not have a generally optimal kernel function which is able to guarantee good classification performance on all types of datasets of varying characteristics. In recent years, many research works have been carried out with the same goal that is seeking for solutions to counter this problem. However, there is no ultimate solution of having an all-rounded and optimal kernel function which will suit most of the SVM classification tasks on different datasets of varying characteristics. Apart from the kernel functions, the performance of the SVM classifier is also heavily dependent on the soft margin parameter, C. The parameter C controls the trade-off between the margin and the size of slack variables [39]. It creates the soft margin that allows some of the classification errors, especially when non-separable points exist during the classification phase. If the value of C is small, the number of training errors will increase due to underfitting [17]. On the other hand, the large value of C will lead to overfitting where a high penalty for non-separable points occurs [40], and the classifier will behave like a hard margin SVM [17]. As the SVM classification approach requires the selection of appropriate combination of kernel function and parameters, the optimization of kernel function and parameters has to be incorporated into the training phase of the SVM, in order to guarantee high accuracy of the classification task. Typically, convoluted computations have been carried out to optimize the set of kernel function and parameters combination for the SVM. This could be done by conducting iterative cross-validation process to predict the best performing combination of kernel function and parameters for the trained SVM classifier, using a validation set. This method leads to a computationally intensive and high time-consuming training process for the SVM, hence degrading the efficiency of


the classifier. Furthermore, for certain cases in which the training samples are limited, there exists a critical problem in preparing sufficient training set and validation set for the SVM to train the classifier and to conduct the kernel function and parameters optimization process. In this paper, we propose an enhanced classification framework for text document classification, coined as the Euclidean-SVM. This new classification framework is proposed by introducing the Euclidean distance measurement function to replace the optimal separating hyper-plane as the decision making function for the conventional SVM classification. In our proposed approach, the SVs for each of the categories are identified using the SVM training algorithm and the SVs are then mapped into the original vector space to construct the trained Euclidean-SVM classifier. In the classification phase, an unlabeled new data point is mapped into the same original vector space and the average distances between the new data point and each set of the SVs of different categories are computed using the Euclidean distance function. The classification decision is made based on the category of SVs which has the lowest average distance with the new data point. With the classification decision making function using the Euclidean distance function instead of the optimal separating hyper-plane, the impact of kernel function and soft margin parameter C on accuracy of the SVM classifier could be minimized, hence contribute to a kernel function and soft margin parameter independent EuclideanSVM text classification framework.

2 Related works Although the SVM has been reported as one of the best performing machine learning approaches for classification, there exists a critical problem of the SVM in determining the optimal combination of kernel function and parameters, in order to guarantee high efficiency and effectiveness of the classification tasks. Typically, the optimal combination of the SVM kernel and parameters are determined by using the computationally intensive grid search algorithms [41, 42]. This method varies different types of kernel functions and parameters through a wide range of values using geometric steps. The set of kernel and parameters combination with the best cross-validation accuracy is selected. In recent years, many research works have been carried out in order to solve the problem of automatically finding the most appropriate kernel function and parameters for the SVM in order to guarantee high accuracy of the classifier. Most of them perform the kernel and parameters optimization using evolutionary algorithms. These methods conduct iterative computation in order to configure the optimal set of kernel and parameters for the SVM, and they will further complicate the convoluted training process, hence leading to a high

83

computational cost for the SVM during the training phase. As a result, the efficiency of the SVM classifier has been severely degraded by having such methods in determining the appropriate combination of kernel and parameters. Quang et. al. have presented the evolutionary algorithm, in specific, the genetic algorithm to optimize SVM parameters, including kernel type, kernel parameters and upper bound C. This is an iterative process by repeating the crossover, mutation and selection procedures to produce the optimal set of parameters [43]. Friedrichs and Igel have also proposed an evolution strategy, in specific, the Covariance Matrix Adaptation Evolution Strategy for obtaining optimal set of multiple hyper-parameters (kernel parameters and regularization parameter) for the SVM [44]. Briggs and Oates have introduced the idea of domain-specific composite kernels to the SVM classification for better generalization ability as compared to the base kernels [45]. An evolutionary algorithm was employed in order to search through a large number of composite kernels and the hill climbing technique was chosen as the composite kernel search algorithm [45]. Dong et. al. have also presented a genetic algorithm-based technique to select the optimal value of the cost parameter C and kernel parameters for the SVM, using cross-validation [46]. All the approaches mentioned above suffer from inefficient classification due to the high time consuming iterative computation, as they employ evolutionary algorithms for the optimal kernel and parameters selection. Avci has proposed a hybrid method of genetic algorithm and SVM, coined as HGASVM, for automatic digital modulation classification. Avci has shown that the proposed HGASVM has better accuracy than the combination of the SVM classifiers with randomly selected parameters, in the specified application of automatic digital modulation classification [47]. However, as the parameters optimization process is based on the genetic algorithm, the common problem of high time consumption is still a disadvantage to this proposed approach. Zhang et al. have used the combination of simulated annealing and genetic algorithm to optimize the parameters for the SVM. This hybrid approach takes the advantages from both of these techniques to overcome the disadvantages of each other. As the result, this hybrid technique has been proven to have better performance than simulated annealing or genetic algorithm alone in selecting optimal kernel and parameters for the SVM [48]. Diosan et al. have proposed another hybridized framework of genetic programming and SVM to choose the most efficient expression of the kernel of kernels function and to select the optimal set of SVM hyper-parameters. This approach conducts an iterative process to optimize the SVM parameters and due to the complexity of kernel function, the computational complexity of the proposed algorithm is high, and this is even higher than that from evolutionary linear multiple kernel [49].

84

Besides using the evolutionary algorithms, there exist some approaches which determine the optimal kernel parameters using distance between two classes in the feature space [50–52]. Sun et al. have proposed a method in which the training phase of the SVM and the iterative process of evaluating the performance for all the parameters combinations can be avoided [50, 51]. The optimal parameters can be determined by sigmoid function. According to Sun et al., this method has good accuracy with sigmoid function and drastically reduces the time for searching the optimal kernel parameters for the conventional SVM by using other existing algorithms, since the iterative computation of selecting optimal parameters using evolutionary algorithms can be avoided [50, 51]. Wu and Wang have also proposed a similar kernel parameters selection approach which uses data separation index (inter-class distance in the feature space) to predict the optimal SVM kernel parameters [52]. However, both of these methods do not perform the optimization of parameter C in which the training time of the SVM could be further reduced if the proposed methods incorporate parameter C into their optimization strategy. As the existing kernel function and parameters optimization methods for the SVM involve convoluted and iterative computation, this problem is considered and investigated by our group. Therefore, it is the goal of this paper to propose an enhanced SVM framework which has low impact on the implementation of different kernel functions and parameter C, for text document classification.

3 The Euclidean-SVM text classification framework We propose and implement a new text classification framework by introducing the Euclidean distance function to replace the optimal separating hyper-plane in the conventional SVM as the classification decision making function. We utilize the SVM training algorithm to reduce the training data points by identifying and retaining only the SVs, and eliminating the rest of the training data points. In the classification phase, the Euclidean distance function is used to make the classification decision based on the average distance between the testing data point to each group of SVs from different categories. We eliminate the use of optimal separating hyper-plane as the decision surface in the conventional SVM approach, as the construction of the optimal separating hyper-plane is highly dependent on the kernel functions. In fact, the construction of the linear separating hyper-plane in the high dimensional feature space is based on the SVs and kernel function, in which the kernel function is incorporated to map the data points into a high dimensional feature space, so that the data points (specifically the SVs) are possibly separable by a linear separating hyper-plane. This causes the kernel functions to have high impact on the construction

L.H. Lee et al. Table 2 Kernels function and number of error [56] Method

Sample #

Error #

ESTScan, closest ATG

2350

729

Salzberg method

3312

1095

SVM, Salzberg kernel

3312

530

TISHunter

3312

13

Table 3 Kernel functions and number of support vectors [56] Kernel

Average # of SVs

Edit kernel I

2312

Edit kernel II

2316

Edit kernel III, SCM120

319

Edit kernel III, SCM250

230

Edit kernel III, ASCM120

507

Edit kernel III, ASCM250

293

Edit kernel III, PAM250

821

Table 4 Kernels function and accuracy [53] Kernels

Best accuracy

Average accuracy

Linear

56.04%

56.04%

Polynomial

74.60%

66.81%

RBF

70.34%

59.01%

of the separating hyper-plane, and hence affect the classification accuracy of the SVM. Previous research works had proven that the implementation of different kernel functions will greatly influence the accuracy of the SVM classifier, as well as the number of SVs [53–59]. Tables 2, 3 and 4 illustrate the results from previous research works which investigated on the impact of different kernel functions on accuracy and number of SVs of the SVM classification approach. In this paper, we propose the utilization of the Euclidean distance function to replace the optimal separating hyperplane as the decision making function of the SVM approach. Prior to the training phase of our proposed EuclideanSVM text classification framework, we have proposed a preprocessing approach to transform text documents into a representation suitable format for the SVM and the EuclideanSVM, which is typically in numerical format. The text documents have been pre-processed by using the Bayesian vectorization technique [14, 60]. The Bayesian vectorization technique is carried out in order to transform each of the text documents in the dataset into the format of probability distribution in the vector space, by using the Bayesian formula. By applying the Bayesian vectorization technique to preprocess the text documents, the textual data are transformed into numerical format and the dimensionality of data has


been greatly reduced from thousands (equal to the number of words in the document, such as when using Term Frequency Inverse Document Frequency (TFIDF) method to transform text to numerical) to typically less than hundred (number of categories the document may be classified to). This transformation is a necessity of the Euclidean-SVM text classification framework due to the fact that the Euclidean-SVM approach is a classification framework based on vector space model, which requires data in numerical format so that the data could be mapped into a vector space, for both training and classifying purposes. Besides this, as the EuclideanSVM approach may suffer from high computational time consumption in handling data with high dimensionality, due to its convoluted computation (requiring the computation of distances between the SVs and the input data points), the dimensionality reduction of data (from the number of words to the number of available categories in the classification task) is also a crucial requirement for the Euclidean-SVM classification approach. The details of the Bayesian vectorization technique have been discussed in our previous works presented in [14, 60]. We have conducted an experiment to validate the performance of the Bayesian vectorization technique over TFIDF vectorization on the preprocessing of textual data for the SVM classifier and the Euclidean-SVM classifier. The results showed that the classifiers which use the Bayesian vectorization as the textual data transformation technique outperformed the classifiers which use the TFIDF vectorization technique. The results for this experiment can be seen in Tables 16 and 17 in Sect. 4.6 of this paper. During the training phase of the Euclidean-SVM classification framework, the conventional SVM training algorithm is used to map all the training data points into the vector space and identify the set of SVs for each of the categories. The construction of the optimal separating hyper-plane is still a necessity in order to identify the SVs, since the optimal separating hyper-plane is lying half-way in between the maximal margin, where the margin is defined as the sum of distances of the hyper-plane to the SVs. Figure 3 illustrates the construction of the optimal separating-hyper-plane in the vector space which separates the training data points of two different categories, after implementing the conventional SVM training algorithm. As illustrated in Fig. 3, there are two categories of training data points, represented by spheres and squares respectively. The black spheres represent the SVs of the category “Sphere” and black squares represent the SVs of the category “Square”. The optimal separating hyper-plane is constructed by maximizing the margin of d1 + d2 . However, the optimal separating hyper-plane is discarded in the classification phase as it does not act as the decision surface in our proposed Euclidean-SVM classification framework. Our proposal in this paper is to replace the optimal separating

85

Fig. 3 Vector space of the conventional SVM classifier with optimal separating hyper-plane

Fig. 4 Vector space of the Euclidean-SVM classifier with the Euclidean distance function as the classification decision making algorithm

hyper-plane by introducing the Euclidean distance function in making the decision for the classification task. After the SVs for each of the categories have been identified, they are mapped into the original vector space and the rest of the training data points are eliminated. During the classification phase, a new unlabeled data point is mapped into the same vector space with the SVs, and the average distances between the new data point and each set of the SVs of different categories are computed using the Euclidean distance function. Figure 4 illustrates the vector space of the EuclideanSVM classifier during classification phase. The “Triangle” in Fig. 4 represents the new unlabeled data point to be classified. The distances between the new input data point and each of the SVs are computed. The Euclidean distance function is used to calculate the distance between two points, new vector P , and support vector Q. Equation (6) illustrates the Euclidean distance formula where pi and qi are the coordinate of P or Q in dimension n. n 2 D= (pi − qi ) (6) i=1

As illustrated in Fig. 4, D1 and D2 represent the Euclidean distances between the new data point and the SVs of cat-

86

L.H. Lee et al.

Fig. 5 Block diagram of the Euclidean-SVM text classification framework

egory “Sphere”, while D3 , D4 and D5 represent the Euclidean distances between the new data point and the SVs of category “Square”. After obtaining the Euclidean distances between the new data point and each of the SVs from different categories, the average distance of the new data point to the set of SVs of each of the categories has been computed. This could be done by adding up the Euclidean distances of the new data point to the SVs from the same category, and divide the sum with the total number of SVs for that particular category, as illustrated by (7). Based on the example as illustrated in Fig. 4, the average distance of the new data point to the SVs of category “Sphere” is (D1 + D2 )/2, and the average distance of the new data point to the SVs of category “Square” is (D3 + D4 + D5 )/3.

N n 2 I =1 ( ( i=1 (pi − qi ) ))I Davg = (7) N After computing the average distance of the new data point to the set of SVs of each of the categories, the classification decision is made based on the category which has the lowest average distance between its set of SVs and the new data point. In other words, the new input data point will be labeled with the category which has the lowest average distance between its SVs and the new data point itself. The Euclidean-SVM classification approach and the k-Nearest Neighbor (k-NN) classification approach share some similarities as both of these classification approaches

map the training data points into a vector space, and distance measurement technique is used to make classification decision. In fact, the Euclidean-SVM approach differs from the k-NN approach. The Euclidean-SVM approach makes the classification decision based on the category which has the shortest average Euclidean distance between its set of SVs and the new data point. On the other hand, the k-NN approach assigns a testing data point to a particular category if it is the most frequent category among the k nearest training data points. Figure 5 illustrates the block diagram and Table 5 illustrates the algorithm of the Euclidean-SVM text classification approach. With the combination of the SVM training algorithm and the Euclidean distance function to make the classification decision, the impact of kernel function and parameter C on the classification accuracy of the conventional SVM can be minimized. This is due to the fact that the optimal separating hyper-plane, which its construction is highly dependent on kernel functions, is replaced by the Euclidean distance function. Since the Euclidean distance function is able to perform its classification decision making task sufficiently as long as both the training data points (support vectors) and data points to be classified are mapped into the same vector space, the transformation of existing vector space into a higher dimensional feature space by the kernel functions is not needed during the classification phase, hence does not have great impact on the classification performance. In


87

Table 5 Algorithms of the Euclidean-SVM text classification framework in pre-processing phase, training phase and classification phase Algorithms of the Euclidean-SVM text classification framework Pre-Processing Phase 1. Transform all the text documents (in both training set and testing set) into numerical format using the Bayesian Vectorization Technique. Training Phase 1. Map all the training data points into the vector space of a SVM. 2. Identify and obtain the set of support vectors for each of the categories using SVM algorithm, and eliminate the rest of the training data points which are not identified as support vectors. 3. Map all the support vectors into the original vector space. Classification Phase 1. Map the new unlabeled data point into the same original vector space with support vectors. 2. Use the Euclidean distance formula to calculate the average distances between the new data point and each of the sets of support vectors from different categories. 3. Identify the category which has the lowest average distance between its set of support vectors and the new data point. 4. Generate classification result for the new data point based on the identified category.

Table 6 List of categories of the vehicles dataset

Table 7 List of categories of the mathematics dataset

1. Aircrafts

1. Algebraic Geometry

2. Boats

2. Analysis of PDEs

3. Cars

3. Combinatorics

4. Train

4. Differential Geometry 5. Mathematical Physics

other words, the problem of selecting the right kernel functions for the classifier does not exist if the optimal separating hyper-plane is replaced by the Euclidean distance function. As the result, by integrating the SVM training algorithm and the Euclidean distance function to construct a classification framework, we can obtain an enhanced Euclidean-SVM classifier with better performance in which the accuracy is comparable to the conventional SVM, while immune from the problem of determining the appropriate kernel functions and parameter C.

4 Experimental results Our proposed Euclidean-SVM text classification framework has been tested and evaluated using five text corpuses. Three of them were collected by our research group, namely the Vehicles dataset [8, 9, 14, 23], the Mathematics dataset [8, 9, 14], and the Automobiles datasets [8, 9, 14]. These three text datasets have been constructed by collecting text articles from different sources, such as Wikipedia website and arxiv.org website. The Vehicles dataset was built by acquiring vehicles related articles from Wikipedia website. This dataset consists of 4 categories of vehicles. All the four categories are easily differentiated in terms of the content since each category

6. Number Theory 7. Probability 8. Statistics

has their unique set of keywords. The list of categories of the Vehicles dataset is illustrated in Table 6. A dataset containing articles about mathematical topics, namely the Mathematics dataset, has been acquired by our research group from arxiv.org website. This dataset consists of 8 mathematical sub-categories. The list of categories of the Mathematics dataset is shown in Table 7. The Automobiles dataset was designed and organized by collecting articles about automobiles from Wikipedia website. This dataset consists of nine categories of automobile, differentiated in terms of geographical regions and classifications. Table 8 illustrates the list of categories of the Automobiles dataset. Besides the three text corpuses that we constructed by acquiring documents from different sources and organized by ourselves, we have also acquired the WebKB dataset and the Reuters-21578 dataset for more generic evaluations of the performance of our proposed Euclidean-SVM text classification approach. The WebKB collection was originally constructed by “the World Wide Knowledge Base (Web->Kb) Project of

88

L.H. Lee et al.

Table 8 List of categories of the automobiles dataset

Table 10 List of categories of the Reuters-21578 R8 dataset

1. American Mini Vans

1. Acq

2. American Sports Cars

2. Crude

3. American SUVs

3. Earn

4. Asian Mini Vans

4. Grain

5. Asian Sports Cars

5. Interest

6. Asian SUVs

6. Money-FX

7. European Mini Vans

7. Ship

8. European Sports Cars

8. Trade

9. European SUVs

Table 9 List of categories of the WebKB dataset 1. Course 2. Faculty 3. Project 4. Student

the CMU Text Learning Group” and this dataset has been widely used for experiments in text applications of machine learning techniques, such as text classification and text clustering [61]. Many research groups have used the WebKB dataset to evaluate the performance of their presented text classification approaches [62–66]. The original WebKB dataset consists of files collected from computer science departments of various universities in 1997. These documents were manually classified into seven different categories: student, faculty, staff, department, course, project, and other. In the experiments here, the categories “Department” and “Staff” were discarded due to the fact that there were only a few pages from each university. The category “Other” was also discarded because of the documents in this category are greatly varying from each other. In conclusion, the list of categories in the WebKB dataset which has been used in the experiments carried out in this paper is illustrated in Table 9. The Reuters-21578 dataset was originally collected by Carnegie Group Inc. and Reuters Ltd. This text collection has been reported as one of the most common benchmark for text classification approaches and it has been widely used by text classification research groups in evaluating their classification models [1, 2, 6, 7, 10, 15, 17, 18, 35, 36]. This dataset consists of documents appeared on the Reuters newswire in 1987 and they were manually organized into categories by personnel from Reuters Ltd. There exist many versions of Reuters-21578 text collection due to the fact that different researchers have different evaluation criteria on their classification models. In our experiment, we have adopted the Reuters-21578 R8 dataset which is the set of the 8 categories with the highest number of positive training data, and this collection only consists of single labeled text documents. The R8 version of the Reuters-21578 text collection consists of the categories as illustrated in Table 10.

We have conducted the experiments by implementing the conventional SVM classification approach and the Euclidean-SVM classification approach independently. Each of these approaches has been evaluated with the implementation of different kernel functions and different values of parameter C during the training phase. We have implemented the conventional SVM classification approach using MATLAB version 7.6.0.324 (R2008a) with LIBSVM toolbox version 2.91 [67]. As for the Euclidean-SVM, we implemented the proposed classification approach by using the same version of MATLAB and LIBSVM toolbox to identify the set of SVs of each of the categories, and we developed an additional module which performs the computation of the average Euclidean distances of the new data point to the set of SVs of each of the categories, and make the classification decision based on the category which has the lowest average distance between its SVs and the new data point. In our experiments, we have implemented the classification approaches with four common kernel functions for the SVM, linear kernel, polynomial kernel, radial basis function (RBF) kernel and sigmoid kernel. As for the parameter C, the range of values of 1, 101 , 102 , 103 , 104 and 105 has been applied to both of the tested classifiers. By conducting the experiments on these two classification approaches separately with different kernel functions and different values of parameter C, we are able to evaluate the performance of each approach and to determine the improvement of the Euclidean-SVM approach (if any) in contrast to the conventional SVM model in terms of classification accuracy. Besides this, we are also able to evaluate the impact of the implementation of different kernel functions and parameter C, on the conventional SVM approach, as well as the Euclidean-SVM approach. 4.1 Experiment on the vehicles dataset The Vehicles dataset consists of 4 categories with a total of 640 documents. Each category consists of 160 documents where 50 documents were used to build the training set, and the remaining 110 documents were used for testing purposes. In other words, the Vehicles dataset had been split


89

Table 11 Classification accuracies of the SVM classifier and the Euclidean-SVM classifier with different kernels and different values of parameter C, on the vehicles dataset Classification approach

Classification accuracy (%)

Variance of

(Kernels) Dataset: Vehicles

accuracies Value of soft margin parameter, C

across

Training Set: 200 Documents

values of

Testing Set: 440 Document

1

10

100

1000

10000

100000

parameter C

SVM (Linear)

93.75

93.33

92.92

92.92

92.92

92.92

0.1201

SVM (Polynomial)

92.50

92.50

92.50

92.50

92.50

92.50

0

SVM (RBF)

25.00

25.00

25.00

25.00

25.00

25.00

0

SVM (Sigmoid)

25.00

25.00

25.00

25.00

25.00

25.00

0

Euclidean-SVM (Linear)

93.75

93.33

93.75

94.17

94.17

94.17

0.1176

Euclidean-SVM (Polynomial)

94.17

94.17

94.17

94.17

94.17

94.17

0

Euclidean-SVM (RBF)

93.33

93.33

93.33

93.33

93.33

93.33

0

Euclidean-SVM (Sigmoid)

93.33

93.33

93.33

93.33

93.33

93.33

0

into a training set with 200 documents and a testing set with 440 documents. Table 11 shows the experimental results of the conventional SVM classifier and the Euclidean-SVM classifier, which have been implemented with different kernels and different values of parameter C, on the Vehicles dataset. As illustrated in Table 11, the performance of the conventional SVM classifier on the Vehicle dataset is highly dependent on the implementation of kernel functions. Both linear kernel and polynomial kernel have contributed high classification accuracies to the SVM, which are between the range of 92.50% and 93.75%. On the other hand, the SVM classifier with RBF kernel and the SVM classifier with sigmoid kernel have performed poorly on the Vehicles dataset, with the accuracies of 25.00%. This is due to the fact that the implementation of appropriate kernel function is a necessity for the SVM classifier to guarantee good generalization ability. The wrong implementation of kernel functions will lead to a seriously poor classification performance of the SVM. In other words, the implementation of kernel functions has very high impact to the classification accuracy of the SVM classification approach. As for the performance of the Euclidean-SVM classification approach on the Vehicles dataset, we have obtained classification accuracies between the range of 93.33% to 94.17%, with the implementation of different kernels and different values of parameter C. Hence, we can conclude that the Euclidean-SVM classification approach is nearly immune from the implementation of kernel function and parameter C, in order to obtain good classification accuracy. In this experiment, parameter C does not have great impact on both of the conventional SVM classifier and the Euclidean-SVM classifier. As illustrated in Table 11, the variances of classification accuracies across the tested val-

ues of parameter C for both of the tested classification approaches are only approximately 0.12. This is due to the fact that the data points for each of the available categories in this dataset are very dissimilar, as they are easily differentiated with their own unique features. Non-separable data points are hardly found during the classification phase. With a small number of non-separable points found in the classification phase, the effect of parameter C, which creates the soft margin that allows some of the classification errors when non-separable points occur, has become minimum in the classification task. 4.2 Experiment on the mathematics dataset The Mathematics dataset consists of 8 categories with a total of 320 documents. Each of the 8 categories consists of an equal number of 40 documents. 10 documents from each category were obtained to construct the training set with a total number of 80 documents, and the remaining 30 documents from each category were used for building the testing set with 240 documents. Table 12 illustrates the experimental results of the conventional SVM classifier and the Euclidean-SVM classifier with different kernels and different values of parameter C, on the Mathematics dataset. Based on Table 12, we can observe that the implementation of different kernel functions has affected the classification performance of the conventional SVM classifier on the Mathematics dataset. As illustrated in Table 12, both of the linear kernel and polynomial kernel have contributed relatively high classification accuracies to the SVM. With the varying of the value of parameter C, the classification accuracies of the SVM with linear kernel have the variance of 27.8211, while the classification accuracies of the SVM with

90

L.H. Lee et al.

Table 12 Classification accuracies of the SVM classifier and the Euclidean-SVM classifier with different kernels and different values of parameter C, on the mathematics dataset Classification approach


Variance of

(Kernels) Dataset: Mathematics


across


values of


1

10

100

1000

10000

100000

parameter C

SVM (Linear)

61.25

74.17

74.17

74.17

74.17

74.17

27.8211

SVM (Polynomial)

70.00

70.00

70.00

70.42

70.00

70.00

0.0290

SVM (RBF)

41.67

41.67

41.67

41.67

41.67

41.67

0

SVM (Sigmoid)

12.50

12.50

12.50

12.50

12.50

12.50

0


75.83

74.17

73.75

73.75

73.75

73.75

0.6922


72.92

72.92

72.92

72.92

72.92

72.92

0

Euclidean-SVM (RBF)

75.83

75.83

75.83

75.83

75.83

75.83

0


75.83

75.83

75.83

75.83

75.83

75.83

0

polynomial kernel have the variance of 0.029, which is more consistent as compared to the classification accuracies of the SVM with linear kernel. On the other hand, the SVM classifier with RBF kernel and the SVM classifier with sigmoid kernel have performed poorly on the Mathematics dataset. With the varying of the value of parameter C, the classification performance of the SVM with RBF kernel is consistent with the accuracies of 41.67%, while the SVM with sigmoid kernel has achieved poor performance with low but consistent accuracies of 12.50%. As the nature of the conventional SVM, the inconsistency of the SVM classification performance in this experiment is due to the implementation of different kernel functions. In order to guarantee good generalization ability for the SVM, the determination of the right kernel function is considered as mandatory in this experiment. On the other hand, the Euclidean-SVM approach is not highly depend on the implementation of kernel functions and parameter C. The Euclidean-SVM has achieved accuracies between the range of 72.92% to 75.83%, with the implementation of different kernels and different values of parameter C. In other words, the Euclidean-SVM classifier has better consistency in terms of accuracy, with the implementation of different kernel functions and different value of parameter C, as compared to the conventional SVM classifier. 4.3 Experiment on the automobiles dataset The Automobiles dataset consists of 9 categories, which consist of an equal number of 30 documents for each of them. In other words, this dataset consists of a total of 270 documents. 10 documents from each category had been utilized to construct the training set with 90 documents, and

the remaining 20 documents from each category were used to build the testing set with 180 documents. Table 13 shows the experimental results of the conventional SVM classifier and the Euclidean-SVM classifier, which have been implemented with different kernels and different values of parameter C, on the Automobiles dataset. Table 13 shows that the implementation of kernel functions and parameter C has very high impact on the classification performance of the conventional SVM approach, while the Euclidean-SVM approach does not suffer from this problem. As illustrated in Table 13, the SVM classifier with linear kernel has achieved medium classification performance on the Automobiles dataset, with accuracies between the range of 56.11% to 68.89% (variance of classification accuracies is 24.6998), while the value of parameter C varies from 1 to 105 . The SVM classifier with polynomial kernel has achieved consistent but poor classification accuracies of 30.56%, with the varying of the value of parameter C from 1 to 105 . The SVM classifier with RBF kernel and the SVM classifier with sigmoid kernel have achieved the lowest classification performance in this experiment, with consistent accuracies of 11.11% while the value parameter C varies within the tested range. These results further justify that the SVM is highly dependent on the implementation of kernel functions. The implementation of the inappropriate kernel function may lead to a high risk of obtaining low classification accuracy from the SVM classifier. While SVM suffers from the problem of being highly dependent on the implementation of kernel function, the Euclidean-SVM has achieved classification accuracies between the range 59.44% to 67.78% (variance of classification accuracies across the tested values of parameter C is 8.8515), with the implementation of linear kernel and different values of parameter C. Even though the gap between the


91

Table 13 Classification accuracies of the SVM classifier and the Euclidean-SVM classifier with different kernels and different values of parameter C, on the automobiles dataset Classification approach


Variance of

(Kernels) Dataset: Automobiles


across


values of


1

10

100

1000

10000

100000

parameter C 24.6998

SVM (Linear)

62.78

68.89

58.33

56.11

56.67

57.22

SVM (Polynomial)

30.56

30.56

30.56

30.56

30.56

30.56

0

SVM (RBF)

11.11

11.11

11.11

11.11

11.11

11.11

0

SVM (Sigmoid)

11.11

11.11

11.11

11.11

11.11

11.11

0


67.78

64.44

63.33

60.56

59.44

62.22

8.8515


67.78

67.78

67.78

67.78

67.78

67.78

0

Euclidean-SVM (RBF)

62.78

62.22

62.22

62.22

62.22

62.22

0


67.78

67.78

67.78

67.78

67.78

67.78

0

Table 14 Classification accuracies of the SVM classifier and the Euclidean-SVM classifier with different kernels and different values of parameter C, on the WebKB dataset Classification approach


Variance of

(Kernels) Dataset: WebKB


across


values of


1

10

100

1000

10000

100000

parameter C

SVM (Linear)

48.60

72.68

73.11

74.12

74.33

73.90

104.7953

SVM (Polynomial)

38.99

38.99

38.99

38.99

73.62

73.33

317.1325

SVM (RBF)

62.36

72.83

73.33

74.33

74.40

74.40

22.4632

SVM (Sigmoid)

48.60

72.68

73.11

74.12

74.33

73.97

104.9205 14.5823


68.60

70.68

68.53

64.01

62.22

61.57


68.60

68.45

68.67

68.67

69.89

68.17

0.3522

Euclidean-SVM (RBF)

69.75

69.53

67.16

63.51

61.72

62.29

13.0916


68.60

70.68

68.53

64.01

62.22

61.57

14.5823

highest accuracy and the lowest accuracy for the EuclideanSVM classifier in this experiment is approximately 8%, the classification performance of the Euclidean-SVM approach is still considered as having low dependence on the implementation of different kernel functions and different value of parameter C on the classifier, as compared to the conventional SVM. 4.4 Experiment on the WebKB dataset The WebKB dataset which had been utilized in our experiments was acquired from Ana Cardoso-Cachopo’s website [68]. It consists of 4 categories with a total of 4199 documents. The training set is constructed by 2803 documents, while the testing set consists of 1396 documents.

Table 14 illustrates the experimental results of the conventional SVM classifier and the Euclidean-SVM classifier with different kernels and different values of parameter C, on the WebKB dataset. Table 14 again shows the inconsistency in terms of classification accuracy for the SVM with different kernel functions and different values of parameter C. Based on Table 14, we can observe that the SVM has achieved high classification accuracies for every implementation of different kernel functions, which are approximately 74%. However, the inappropriate value of parameter C will severely degrade the accuracy of the SVM classifiers to below 50% (48.60% for the SVM with linear kernel, C =1 and SVM with sigmoid kernel, C = 1) or even down to below 40% (38.99% for the SVM with polynomial kernel, C = 1 to 1000).

92

L.H. Lee et al.

plementation of parameter C, as low variances of classification accuracies across the tested values of parameter C (less than 15) have been recorded in this experiment.

On the other hand, the implementation of different kernel functions and different value of parameter C does not have high impact on the Euclidean-SVM classification approach. The Euclidean-SVM has achieved classification accuracies between the range of 61.57% to 70.68%, with the implementation of different kernels and different values of parameter C. These results are considered as consistent even though the gap between the highest classification accuracy and the lowest classification accuracy in this experiment is approximately 9%, as compared to the conventional SVM, where the gap between the highest classification accuracy and the lowest classification accuracy is approximately 35%. The results in this experiment have further justified that the Euclidean-SVM has lower dependency on the implementation of kernel functions and value of parameter C, as compared to the conventional SVM. In this experiment, the parameter C has greatly affected the performance of the conventional SVM classifier with different kernel functions. Based on the results as illustrated in Table 14, high variances of classification accuracies across the tested values of parameter C have been recorded for the conventional SVM, which is up to 317.13. This is due to the fact that the WebKB dataset consists of files collected from computer science departments of various universities. Text documents in this dataset are describing topics which are related to computer science in various universities. Therefore, text documents for each of the categories are very similar to each other, and non-separable cases have high occurrences during the classification phase. As the result, parameter C has significant effect in the classification performance of the conventional SVM, due to the fact that soft margin is needed to allow classification errors caused by the non-separable cases. On the other hand, the EuclideanSVM is not as sensitive as the conventional SVM to the im-

4.5 Experiment on the Reuters-21578 R8 dataset The Reuters-21578 R8 dataset which had been used in our experiments was acquired from Ana Cardoso-Cachopo’s website [68], which is the same source where the WebKB dataset was acquired. This collection consists of 7670 documents which had been categorized into 8 categories. The documents in the collection had been divided into training set and testing set, which consist of 5483 documents and 2187 documents respectively. Table 15 illustrates the experimental results of the conventional SVM classifier and the Euclidean-SVM classifier with different kernels and different values of parameter C, on the Reuters-21578 R8 dataset. The results illustrated in Table 15 show that, as compared to the conventional SVM approach, the Euclidean-SVM has better consistency in terms of classification accuracy over the implementation of different kernel functions and different values of parameter C. The conventional SVM classifier has scored high accuracies in this experiment, only when the appropriate combination of kernel function and parameter C is implemented. In this experiment, the best classification accuracy of 94.97% has been achieved by the conventional SVM with linear kernel function and parameter C is set at 100000. However, the conventional SVM has only achieved high classification accuracies (between the range of 87.75% to 94.97%) while the value of parameter C is high (C = 100000), and this is not applicable for the implementation of polynomial kernel function, where the SVM classifier scores the lowest accuracy of 49.52%. As illustrated in

Table 15 Classification accuracies of the SVM classifier and the Euclidean-SVM classifier with different kernels and different values of parameter C, on the Reuters-21578 R8 dataset Classification approach


Variance of

Value of Soft Margin Parameter, C

across

(Kernels) Dataset: Reuters-21578 R8

accuracies


values of


1

10

100

1000

10000

100000

parameter C

SVM (Linear)

52.17

87.75

93.14

94.19

94.51

94.97

283.6759

SVM (Polynomial)

49.52

49.52

49.52

49.52

49.52

49.52

SVM (RBF)

49.52

49.52

49.52

49.52

66.16

89.21

264.6682

0

SVM (Sigmoid)

49.52

49.52

49.52

49.52

52.17

87.75

238.0053 120.2538


81.48

77.73

80.43

63.69

65.11

54.55


82.72

82.72

82.81

82.72

82.72

82.72

0.0014

Euclidean-SVM (RBF)

82.30

82.30

82.30

82.30

80.80

77.05

4.4438


84.73

84.73

82.58

81.48

81.48

77.73

6.7854


Table 15, in general, we could observe that the SVM classifier with linear kernel function has obtained high accuracies (between the range of 87.75% to 94.97%) in the experiment with the implementation of different values of parameter C, except that when parameter C is set at 1. However, if the SVM classifier has the inappropriate implementation of kernel functions, the classification performance is severely degraded and low classification accuracies have been obtained. The SVM classifier with polynomial kernel function has achieved the baseline accuracy in this experiment, where classification accuracy is recorded as 49.52%, even though the value of parameter C is altered from 1 to 100000. As for the SVM classifier with RBF kernel function and the SVM classifier with sigmoid kernel function, low classification accuracies have been recorded unless the right value of parameter C is applied. Based on the results obtained from this experiment, it has been proven again that the performance of the conventional SVM classifier is highly dependent to the implementation of kernel functions and parameter C. As for the Euclidean-SVM classification approach, even though the highest classification accuracy recorded (84.73% for the Euclidean-SVM with sigmoid kernel, C = 1 to 10) is lower than the highest classification accuracy recorded (94.97%) for the conventional SVM with linear kernel function and parameter C = 100000, the overall consistency in terms of accuracy for the Euclidean-SVM with the implementation of different kernel functions and different values of parameter C is much better as compared to the conventional SVM. As illustrated in Table 15, the highest variance of classification accuracies across the tested values of parameter C for the Euclidean-SVM is recorded at 120.25, while the highest variance of classification accuracies across the tested values of parameter C for the conventional SVM is recorded at 283.68. In most of the combinations of different kernel functions and different values of parameter C, the Euclidean-SVM has achieved high classification accuracies (77.05% to 84.73%), except for the implementation of linear kernel function and parameter C is set at the range from 1000 to 100000. On the other hand, the conventional SVM approach has achieved low classification accuracies in most of the combinations of different kernel functions and different values of parameter C, except for the implementation of linear kernel function and parameter C is set at the

93

range from 10 to 100000, the implementation of RBF kernel function and parameter C is set at 100000, and the implementation of sigmoid kernel function and parameter C is set at 100000. The experimental results here have further justified that the Euclidean-SVM has achieved better overall performance, and lower dependency on the implementation of kernel functions and value of parameter C, as compared to the conventional SVM. As illustrated in Table 15, the Euclidean-SVM has achieved a low accuracy when linear kernel function is applied and parameter C is set at 100000. This is due to the fact that the high value of parameter C leads to the condition of overfitting [17] to the SVM training algorithm, and less training data points are identified as the SVs. In such a situation, the Euclidean-SVM suffers from lack of sufficient information in computing the average Euclidean distance between the input data points and each set of the SVs from different categories, hence the classification performance is degraded. This problem could be solved by setting the parameter C with low values. Based on the results obtained in this experiment, the range of values of parameter C from 1 to 100 will lead the Euclidean-SVM classification approach to achieve good accuracies (approximately 80%), regardless the implementation of kernel functions. 4.6 Comparison of the performance of the Bayesian vectorization technique and TFIDF vectorization technique In this paper, the Bayesian vectorization technique had been utilized to transform textual data into numerical format. On the other hand, the TFIDF (Term Frequency Inverse Document Frequency) technique has been reported as one of the most widely used pre-processing technique by many text mining research groups for the same purpose. To validate that the enhancement of our proposed Euclidean-SVM classification framework is contributed by the implementation of the Euclidean distance function to replace the optimal separating hyper-plane of the conventional SVM classification approach, rather than the preprocessing technique, we have conducted an additional experiment using the Reuters21578 dataset, to compare the performance of the Bayesian

Table 16 Comparison of the TFIDF-SVM classifiers and the Bayesian-SVM classifiers with different kernel functions Vectorization technique-classifier

Accuracy of classifier using different kernels (%)

Dataset: Reuters-21578 R8 Training Set: 5483 Documents Testing Set: 2187 Documents

Linear

Polynomial

RBF

Sigmoid

TFIDF-SVM

90.29

80.14

90.58

90.29

Bayes-SVM

94.97

92.87

94.97

94.92

94

L.H. Lee et al.

vectorization over the TFIDF vectorization as the preprocessing technique for the SVM classifier. The experimental results of this comparison have been presented in Table 16 below. Based on the comparison conducted using the Reuters21578 dataset, the Bayesian-SVM classifiers always outperforms the TF-IDF-SVM classifiers, with the implementation of all tested types of kernel functions. The results presented in Table 16 prove that the Bayesian vectorization technique provides better transformation of textual data to numerical data for the SVM classifier, as compared to the TFIDF vectorization technique. In our previous works as presented in [14, 60], the SVM classifier with the Bayesian vectorization has also been proven to have better performance, in terms of classification accuracy and time consumption, as compared to the SVM classifier with the TFIDF vectorization technique. These results show that the Bayesian vectorization technique has contributed a more effective textual data transformation process to the SVM classifier, as compared to the use of TFIDF vectorization technique for the same purpose. As the Bayesian vectorization contributes to the improvement of the classifier in the pre-processing stage, the Euclidean-SVM approach further improves the conventional SVM by replacing the optimal separating hyper-plane with the Euclidean distance function in making the classification decision. We have also conducted the experiment on the Euclidean-SVM classification approach with different vectorization techniques, to further justify the fact that the main contribution of the proposed Bayesian-SVM classification framework to the conventional SVM approach is delivered by the implementation of the Euclidean distance function to replace the optimal separating hyper-plane of the conventional SVM classification approach, rather than the preprocessing technique by the Bayesian vectorization. Table 17 illustrates the results of comparison between the TFIDF-Euclidean-SVM classification approach and Bayesian-Euclidean-SVM classification approach. As illustrated in Table 17, the TFIDF-Euclidean-SVM classifier performs badly with only 1.02% of accuracy. This is due to the fact that the high dimensionality of vectorized data (approximately 18,000 dimensions, which is equal to the number of words in the text collection) resulted from

the TFIDF vectorization has led to high complexity in computation to the Euclidean-SVM approach in making classification decision. As the Euclidean distance function may suffer from curse of dimensionality, high dimension of data may severely degrade the effectiveness and efficiency of the classification process due to the convoluted computation, hence leads to the poor performance of the EuclideanSVM classifier. On the other hand, the transformation of data by the Bayesian vectorization technique, which reduces the dimensionality of data from thousands to typically less than one hundred (number of categories the document may be classified to) has contributed to better performance of the Euclidean-SVM classification approach, as the computational complexity by the Euclidean distance function has been drastically reduced. The Bayesian-Euclidean-SVM classifier has achieved a classification accuracy of 84.73%. Besides, the high dimensionality of vectorized data by the TFIDF technique has also resulted to higher training time and testing time consumptions of the Euclidean-SVM classifier. In this experiment, the TFIDF-Euclidean-SVM classifier has recorded the training time consumption of 20 seconds and the testing time consumption of 27 hours 50 minutes and 32 seconds. On the other hand, The BayesianEuclidean-SVM classifier has recorded the training time consumption of 6 seconds and the testing time consumption of 2 hours 51 minutes 29 seconds. The time consumptions of the TFIDF-Euclidean-SVM classifier, especially the testing time is much higher as compared to the Bayesian-EuclideanSVM classifier. This is again due to the high dimensionality of vectorized data resulted from the TFIDF vectorization, which lead to high computational complexity to the Euclidean-SVM classification approach. The system specifications for running both of the TFIDF-Euclidean-SVM approach and the Bayesian-Euclidean-SVM approach are Intel Core i3 CPU 550 at 3.2 GHz, 2 GB of RAM and Windows 7 Home Basic 32-Bit. Based on the experimental results presented in this section, it shows that the Bayesian vectorization technique has enhanced the pre-processing stage of the conventional SVM approach which typically implements TFIDF vectorization technique. This fact has also been proven in our previous works [14, 60]. In the classification phase, the EuclideanSVM approach further improves the performance of the con-

Table 17 Comparison of the TFIDF-Euclidean-SVM classifier and the Bayesian-Euclidean-SVM classifier Dataset: Reuters-21578 R8

TFIDF-Euclidean-SVM

Bayesian-Euclidean-SVM

Training Set: 5483 Documents Testing Set: 2187 Documents Classification accuracy (%)

1.02

84.73

Training time (hh:mm:ss)

00:00:29

00:00:06

Testing time (hh:mm:ss)

27:50:32

02:51:29


ventional SVM approach to obtain better effectiveness and efficiency in performing classification tasks. In conclusion, the Bayesian vectorization technique and the Euclidean distance function provide a combination of enhancement techniques to the conventional SVM approach, which is able to improve the performance of the baseline approach in terms of classification accuracy and time consumptions. 4.7 Discussion on the experimental results Based on the results that we obtained from a series of experiments using different text datasets, we found that the performance of the Euclidean-SVM classification framework has low dependency on the implementation of kernel functions and parameter C. For all the five datasets that we have used in our experiments, high classification accuracy can always been obtained by the Euclidean-SVM classifier with linear kernel function and a small value of parameter C. In all of our experiments by carrying out classification tasks on each of the five datasets, when linear kernel is used and parameter C is set at the value of 1, the Euclidean-SVM classifier always outperforms the conventional SVM classifier. This situation shows that, by performing the classification tasks using the Euclidean-SVM approach, high accuracies could be obtained, without needing the transformation of the original vector space into a high dimensional feature space using kernel functions. This is due to the fact that the Euclidean distance function could perform effective classification decision making task, as long as all the training data points (the SVs) and the input data points are mapped into the same vector space. Besides this, the selection of optimal value of parameter C can also be avoided in the Euclidean-SVM classification approach. As shown by the results obtained from the experiments, in most cases, the variances of classification accuracies across different values of parameter C for the SVM approach are much higher as compared to the Euclidean-SVM approach. This reiterates the fact that the Euclidean-SVM approach has less dependency to the value of parameter C. Moreover, the Euclidean-SVM classification approach can perform well with a small value of parameter C (based on our experiments, the optimal value for C is 1), while the small value of parameter C leads to the condition of underfitting for the conventional SVM classifier [17]. According to SVM methodology, small value of parameter C will lead to the situation where more training data points have been identified as the SVs. When more SVs have been identified during training phase, more information is provided for the computation of average Euclidean distance between the input data points and each set of the SVs from different categories. Therefore, more accurate classification results would be obtained for the Euclidean-SVM approach. In contrast to the conventional SVM classification approach which conducts iterative processes to determine the right

95

kernel function and the appropriate value of parameter C, the convoluted and computationally intensive training process and the preparation of the additional validation set could be avoided by implementing the Euclidean-SVM classification approach. In conclusion, our proposed Euclidean-SVM approach contributes to more effective and efficient classification task, with its unique characteristic which is independent from the implementation of different kernel functions and different value of parameter C. Besides, in our experiments, when the SVs are identified using the SVM approach with certain kernel functions, such as RBF kernel and sigmoid kernel, the Euclidean-SVM approach drastically outperforms the conventional SVM approach, even though both of these approaches share a similarity that the SVs are similar. This situation can be observed from the experiments on the Vehicles dataset, the Mathematics dataset and the Automobiles dataset. Based on our analysis, this situation is due to the fact that in the conventional SVM classification phase, the testing data point is computed with alpha (αi ) values of each support vector and summed up in order to determine the right category of the testing data point using (8) [37]. l

iF (x) = sign yi αi K(x, xi ) + b (8) i=1

The αi values play a significant role in the performance of the conventional SVM approach and the wrongly weighted SVs play a very strong role in misclassification. The hyperplane resulting from a kernel function that is not optimized will result in αi values that are not accurate. This leads to the low classification accuracy experienced by the conventional SVM classifier using the non-optimal kernels. On the other hand, the Euclidean-SVM approach however computes the average distance between the testing data point and each set of support vectors from different categories before making the classification decision. Hence, distance and location of the testing data point and the SVs are given higher priority in the classification process. There is no reliance on the αi weight values. The average distance calculated also does not change drastically when the SVs change due to kernel manipulation. By using the average Euclidean distance, the effect of the wrongly weighted SVs which is a result of using a non-optimal kernel function is diluted. This causes the accuracy of the Euclidean-SVM approach to be much higher and relatively consistent than the conventional SVM approach, even though the SVs used are the same for both approaches. Another reason that may cause this situation is that the setting of a constant value to one of the kernel parameters for the SVM. Changing the value of this kernel parameter will vary the construction of the optimal separating hyper-plane and the number of SVs during the training phase, hence result in different classification accuracies for the conventional

96

SVM approach, as well as the Euclidean-SVM approach. The key point here is that, in terms of classification accuracy, the Euclidean-SVM approach has better consistency, as compared to the conventional SVM approach, across the range of values of parameter C and the types of kernel function. As for the analysis of computational complexity of the Euclidean-SVM classification approach, since it contributes to an enhanced SVM classification framework which is more independent from the implementation of kernel functions and parameters, as the trade-off, the classification time consumption of the Euclidean-SVM approach is higher than the conventional SVM. The classification complexity of the Euclidean-SVM approach is depending on the dimensionality of the data points, and also the number of SVs generated from the training phase. This is due to the fact that the Euclidean-SVM approach inherits the characteristic of the nearest neighbor approach by calculating the distances between the new input data point and each set of the SVs from different categories using the Euclidean distance formula. Based on our experimental results, we found that the Euclidean-SVM classifier performs well when the value of parameter C is small, which will contribute to a high number of SVs during training phase. Due to the high number of SVs, in the classification phase, high computational time has been consumed for calculating the average Euclidean distance between the input data points and each set of the SVs from different categories. The higher classification time consumption of our proposed Euclidean-SVM approaches compared to the conventional SVM approach is reasonable, since the training time of the Euclidean-SVM model is much lesser than the training time of the conventional SVM classifier, which implement evolutionary algorithms to determine the optimal combination of kernel functions and parameters. Besides this, the relatively consistent classification accuracy of the Euclidean-SVM model with all ranges of kernel functions and values of parameter C as compared to the conventional SVM has been highlighted as one of the outstanding characteristics of our proposed Euclidean-SVM classification framework.

5 Conclusion A new text classification framework is presented and described here. The Euclidean-SVM classification approach is reported to have low impact on the implementation of kernel function and soft margin parameter C. The classification accuracy of the Euclidean-SVM approach is relatively consistent with the implementation of different kernel functions and different values of parameter C, as compared to the conventional SVM, which its classification accuracy is severely degraded with the implementation of an inappropriate kernel function and parameter C. This is achieved through the

L.H. Lee et al.

implementation of the Euclidean distance function to replace the optimal separating hyper-plane as the classification decision making function of the SVM. Unlike the optimal separating hyper-plane of the conventional SVM, where the construction is highly dependent on kernel functions, the Euclidean distance function could perform effective classification decision making tasks as long as all the training data points and the input data points are in the same vector space. Hence, the issue of selecting the appropriate kernel function and parameter C can be avoided in the EuclideanSVM classification framework. However, the classification phase of the Euclidean-SVM approach consumes a longer time as compared to the conventional SVM. Besides this, for certain classification tasks where the similarity between categories are high, for example, the WebKB dataset that we have used in our experiments, the classification accuracy of the Euclidean-SVM approach is lower than the accuracy of conventional SVM approach. This is due to the fact that the Euclidean distance calculation which inherits the characteristic of nearest neighbor approach, may suffer from the curse of dimensionality, hence leads to the inefficient classification tasks. As a future work, we will further investigate on the alternative distance and similarity measurement functions to replace the Euclidean distance function, which may reduce the time consumption of the distance or similarity calculation, and contribute to a more accurate distance or similarity measurement for the SVs and the input data point, hence leading to a more effective and efficient SVM-based text classification framework.

References 1. Han EH, Karypis G, Kumar V (1999) Text categorization using weighted adjusted k-nearest neighbor classification. Technical Report, Department of Computer Science and Engineering, Army HPC Research Centre, University of Minnesota, Minneapolis, USA 2. He J, Tan AH, Tan CL (2003) On machine learning methods for Chinese document categorization. Appl Intell 18(3):311–322 3. Androutsopoulos I, Koutsias J, Chandrinos KV, Spyropoulos CD (2000) An experimental comparison of Naïve Bayesian and keyword-based anti-spam filtering with personal e-mail messages. In: Proceedings of the 23rd annual international ACM SIGIR conference on research and development in information retrieval, pp 160–167 4. Chen JN, Huang HK, Tian SF, Qu YL (2009) Feature selection for text classification with Naïve Bayes. Expert Syst Appl 36(3):5423–5435 5. Domingos P, Pazzani M (1997) On the optimality of the simple Bayesian classifier under zero-one loss. Mach Learn 29(2–3):103– 130 6. Eyheramendy S, Genkin A, Ju WH, Lewis D, Madigan D (2003) Sparce Bayesian classifiers for text categorization. Technical Report, Department of Statistics, Rutgers University, 2003. URL:http://www.stat.rutgers.edu/~madigan/PAPERS/jicrd-v13. pdf

An enhanced Support Vector Machine classification framework by using Euclidean distance function 7. Kim SB, Rim HC, Yook DS, Lim HS (2002) Effective methods for improving Naïve Bayes text classification. In: Proceedings of the 7th Pacific Rim international conference on artificial intelligence. Springer, Heidelberg, pp 414–423 8. Lee LH, Isa D, Choo WO, Chue WY (2010) Tournament structure ranking techniques for Bayesian text classification with highly similar categories. J Appl Sci—Asian Netw Sci Inf 10(13):1243– 1254 9. Lee LH, Isa D (2010) Automatically computed document dependent weighting factor facility for Naïve Bayes classification. Expert Syst Appl 37(12):8471–8478 10. McCallum A, Nigam K (1998) A comparison of event models for Naïve Bayes text classification. In: AAAI-98 workshop on learning for text categorization, pp 41–48 11. O’Brien C, Vogel C (2003) Spam filters: Bayes vs. chi-squared. Letters vs. words. In: Proceedings of the 1st international symposium on information and communication technologies, pp 298– 303 12. Sahami M, Dumais S, Heckerman D, Horvitz E (1998) A Bayesian approach to filtering junk e-mail. In: AAAI-98 workshop on learning for text categorization, Madison, Wisconsin, pp 55–62 13. Diederich J, Kindermann J, Leopold E, Paass G (2003) Authorship attribution with support vector machines. Appl Intell 19(1– 2):109–123 14. Isa D, Lee LH, Kallimani VP, Rajkumar R (2008) Text document pre-processing with the Bayes formula for classification using the support vector machine. IEEE Trans Knowl Data Eng 20(9):1264– 1272 15. Joachims T (1998) Text categorization with support vector machines: learning with many relevant features. In: Proceedings of the 10th European conference on machine learning (ECML-98), pp 137–142 16. Joachims T (1999) Making large-scale SVM learning practical. In: Advances in kernel methods—-support vector learning, pp 169– 184 17. Joachims T (2002) Learning to classify text using Support Vector Machines. Kluwer Academic Publishers, Dordrecht 18. Nigam K, Lafferty J, McCallum A (1999) Using maximum entropy for text classification. In: Proceedings of the IJCAI-99 workshop on machine learning for information filtering, pp 61–67 19. Greiner R, Schaffer J (2001) AIxploratorium—decision trees. Department of Computing Science, University of Alberta, Edmonton, AB T6G 2H1, Canada. URL:http://www.cs.ualberta. ca/~aixplore/learning/DecisionTrees 20. Apte C, Damerau F, Weiss SM (1994) Automated learning of decision rules for text categorization. ACM Trans Inf Sys 12(3):233– 251 21. Apte C, Damerau F, Weiss SM (1994) Towards language independent automated learning of text categorization models. In: Proceedings of the 17th annual international ACM-SIGIR conference on research and development in information retrieval, pp 23–30 22. Chen CM, Lee HM, Hwang CW (2005) A hierarchical neural network document classifier with linguistic feature selection. Appl Intell 23(3):5423–5435 23. Isa D, Kallimani VP, Lee LH (2009) Using self-organizing map for clustering of text document. Expert Syst Appl 36(5):9584–9591 24. Lee CH, Yang HC (2003) A multilingual text mining approach based on self-organizing maps. Appl Intell 18(3):295–310 25. Bosnic Z, Kononenko I (2008) Estimation of individual prediction reliability using the local sensitivity analysis. Appl Intell 29(3):187–203 26. Hao PY, Chiang JH, Lin YH (2009) A new maximal-margin spherical-structured multi-class support vector machine. Appl Intell 30(2):98–111 27. Kocsor A, Toth L (2004) Application of kernel-based feature space transformations and learning methods to phoneme classification. Appl Intell 21(2):129–142

97

28. Kyriacou E, Pattichis MS, Pattichis CS, Mavrommatis A, Christodoulou CI, Kakkos S, Nicolaides A (2009) Classification of atherosclerotic carotid plaques using morphological analysis on ultrasound images. Appl Intell 30(1):3–23 29. Li YM, Lai CY, Kao CP (2011) Building a qualitative recruitment system via SVM with MCDM approach. Appl Intell 35(1):75–88 30. Li C, Liu K, Wang H (2011) The incremental learning algorithm with support vector machine based on hyperplane-distance. Appl Intell 34(1):19–27 31. Maglogiannis I, Zafiropoulos E, Anagnostopoulos I (2009) An intelligent system for automated breast cancer diagnosis and prognosis using svm based classifiers. Appl Intell 30(1):24–36 32. Mahmoud SA, Al-Khatib WG (2010) Recognition of Arabic (Indian) bank check digits using log-Gabor filters. Appl Intell. doi:10.1007/s10489-010-0235-2 33. Maudes J, Rodriguez JJ, Garcia-Osorio C, Pardo C (2011) Random projections for linear SVM ensembles. Appl Intell 34(3):347–359 34. Yu B, Yang Z (2009) A dynamic holding strategy in public transit systems with real-time information. Appl Intell 31(1):69–80 35. Chakrabarti S, Roy S, Soundalgekar MV (2003) Fast and accurate text classification via multiple linear discriminant projection. VLDB J 12(2):170–185 36. Yang YM, Liu X (1999) A re-examination of text categorization methods. In: Proceedings of the 22nd annual international ACM SIGIR conference on research and development in information retrieval (SIGIR’99), pp 42–49 37. Haykin S (1999) Neural network, a comprehensive foundation, 2nd edn. Prentice Hall, New York 38. Burges CJC (1998) A tutorial on Support Vector Machines for pattern recognition. Bell Laboratories, Lucent Technologies. Data Mining and Knowledge Discovery. URL:http://research. microsoft.com/~cburges/papers/SVMTutorial.pdf 39. Shawe-Taylor J, Cristianini N (2004) kernel methods for pattern analysis. Cambridge University Press, Cambridge 40. Alpaydin E (2004) Introduction to machine learning. MIT Press, Cambridge 41. Hsu CW, Lin CJ (2002) A comparison of methods for multiclass support vector machines. IEEE Trans Neural Netw 13(2):415–425 42. Staelin C (2003) Parameter selection for Support Vector Machines. Technical Report HPL-2002-354R1, Hewlett Packard Laboratories 43. Quang AT, Zhang QL, Li X (2002) Evolving Support Vector Machine parameters. In: Proceedings of the 1st international conference on machine learning and cybernetics, pp 548–551 44. Friedrichs F, Igel C (2004) Evolutionary tuning of multiple SVM parameters. In: Proceedings of European symposium on artificial neural networks (ESANN’2004), pp 519–524 45. Briggs T, Oates T (2005) Discovering domain-specific composite kernels. In: Proceedings of the 20th national conference of artificial intelligence. AAAI Press, Menlo Park, pp 732–738 46. Dong Y, Xia Z, Tu M (2007) Selecting optimal parameters in Support Vector Machines. In: Proceedings of the IEEE 6th international conference on machine learning and applications (ICMLA07). 47. Avci E (2009) Selecting of the optimal feature subset and kernel parameters in digital modulation classification by using hybrid genetic algorithm-support vector machines: HGASVM. Expert Syst Appl 36(2):1391–1402 48. Zhang Q, Shan G, Duan X, Zhang Z (2009) Parameters optimization of Support Vector Machine based on simulated annealing and genetic algorithm. In: Proceedings of the IEEE international conference on robotics and biomimetics, pp 1302–1306 49. Diosan L, Rogozan A, Pecuchet JP (2010) Improving classification performance of Support Vector Machine by genetically optimising kernel shape and hyper-parameters. Appl Intell doi:10.1007/s10489-010-0260-1

98 50. Sun J (2008) Fast tuning of SVM kernel parameter using distance between two classes. In: Proceedings of the 3rd international conference on intelligent system and knowledge engineering (ISKE2008), pp 108–113 51. Sun J, Zheng C, Li X, Zhou Y (2010) Analysis of the distance between two classes for tuning SVM hyperparameters. IEEE Trans Neural Netw 21(2):305–318 52. Wu KP, Wang SD (2009) Choosing the kernel parameters for Support Vector Machines by the inter-cluster distance in the feature space. Pattern Recognit 42(5):710–717 53. Buck TAE, Zhang B (2006) SVM kernel optimization: an example in yeast protein subcellular localization prediction. Project Report, School of Computer Science, Carnegie Mellon University, Pittsburgh, USA 54. Doniger S, Hofmann T, Yeh J (2002) Predicting CNS permeability of drugs molecules: comparison of neural network and Support Vector Machines algorithms. J Comput Biol 9(6):849–864 55. Kim H, Cha S (2005) Empirical evaluation of SVM-based masquerade detection using UNIX commands. Comput Secur 24(2):160–168 56. Li H, Jiang T (2004) A class of edit kernels for SVMs to predict translation initiation in eukaryotic mRNAs. In: Proceedings of the 8th annual international conference on research in computational molecular biology, pp 262–271 57. Lu M, P Chen L, Huo J, Wang X (2008) Optimization of combined kernel function for SVM based on large margin learning theory. In: Proceedings of the IEEE international conference on systems, man and cybernetics (SMC 2008), pp 353–358 58. Scholköpf B, Burgers CJC, Smola AJ (1999) Advances in kernel methods: support vector learning. MIT Press, Cambridge 59. Yuan R, Li Z, Guan X, Xu L (2010) An SVM-based machine learning method for accurate Internet traffic classification. Inf Syst Front 12(2):149–156 60. Lee LH, Rajkumar R, Isa D (2010) Automatic folder allocation system using Bayesian-support Vector Machines hybrid classification approach. Appl Intell. doi:10.1007/s10489-010-0261-0 61. Craven M, DiPasquo D, Freitag D, McCallum A, Mitchell T, Nigam K, Slattery S (1998) Learning to construct knowledge bases from the World Wide Web. In: Proceedings of the 15th national conference for artificial intelligence, pp 509–516 62. Callut J, Franscoisse K, Saerens M, Dupont P (2008) Semisupervised classification from discriminative random walks. In: Proceedings of the 2008 European conference on machine learning and knowledge discovery in databases—Part 1 (ECML PKDD ’08), pp 162–177 63. Ko Y, Seo J (2009) Text classification from unlabeled documents with bootstrapping and feature projection techniques. Inf Process Manag 45(1):70–83 64. Li T, Zhu S, Ogihara M (2008) Text categorization via generalized discriminant analysis. Inf Process Manag 44(5):1684–1697 65. Xue XB, Zhou ZH (2009) Distributional features for text categorization. IEEE Trans Knowl Data Eng 21(3), 428–442 66. Zhang D, Mao R (2008) A new kernel for classification of networked entities. In: Proceedings of the 6th international workshop on mining and learning with graphs, Helsinki, Finland 67. Chang C, Lin C (2001) LIBSVM: a library for support vector machines. Software available at: http://www.csie.ntu.edu.tw/~ cjlin/libsvm 68. Cardoso-Cachopo A (2011) Datasets for single label text categorization. Artificial Intelligence Group, Department of Information Systems and Computer Science, Instituto Superior Tecnico, Portugal. URL:http://web.ist.utl.pt/~acardoso/datasets/

L.H. Lee et al. Lam Hong Lee received a Bachelor of Computer Science from Universiti Putra Malaysia in 2004, and a Ph.D. degree in Computer Science from the University of Nottingham in 2009. He joined Universiti Tunku Abdul Rahman, Malaysia since March 2009 as an assistant professor. His current research interest lies in improving text categorization using AI techniques, specifically Support Vector Machines. Besides this, he is also investigating on the implementation of data mining, pattern recognition and machine learning techniques in various kinds of intelligent systems. Chin Heng Wan received his Bachelor of Information System (Hons) in Information System Engineering, from Universiti Tunku Abdul Rahman, Malaysia, in 2005. He is currently pursuing a Master of Computer Science in Universiti Tunku Abdul Rahman, Malaysia. His research interests are in Artificial Intelligence, Machine Learning, Pattern Recognition, Text Mining, and Intelligent Systems.

Rajprasad Rajkumar received his Ph.D. degree and Master’s degree from the University of Nottingham in 2011 and 2005 respectively, and Bachelor of Engineering in Electrical and Electronic from University Tenaga National in 2003. He is currently working at the Department of Electrical and Electronic Engineering, University of Nottingham Malaysia Campus as an Assistant Professor. His main research interests are in the use of support vector machines and signal processing techniques in various domains. He is currently working in the area of nondestructive testing, remote sensing, text document classification and developing unsupervised learning techniques in real-time systems. Dino Isa is a Professor in the Department of Electrical and Electronics Engineering of the University of Nottingham, Malaysia Campus. He obtained a BSEE (Hons) from the University of Tennessee, USA in 1986 and a Ph.D. from the University of Nottingham, UK in 1991. Following his Ph.D., he was employed as Engineering Section Head in Motorola’s Power Products Division in Seremban, Malaysia. Subsequently he was recruited as Plant Manager and then promoted to Chief Technology Officer of Crystal Clear Technology (CCT), a

An enhanced Support Vector Machine classification framework by using Euclidean distance function subsidiary of the Malaysian government’s investment arm, Khazanah Nasional Berhad. He spent three years in the Westlake Village facility in California as Director of Operations in the R&D phase of the project before resuming his duties in CCT Malaysia. He joined the University of Nottingham in 2001. To date Prof. Isa has won five research contracts worth RM 6,500,000 (£ 1,000,000) while at the University.

99

His research interest lies in the application of Machine Learning techniques for various kinds of problems. The main aim of his research is to formulate strategies which lead to the successful implementations of “Intelligent Systems” in various domains.