Effect of Feature Selection Methods on Machine ... - ACM Digital Library

8 downloads 270317 Views 754KB Size Report
Results show that, Greedy Stepwise Search is a good method for feature selection for spam email detection. Among the Machine. Learning Classifiers, Support ...
Effect of Feature Selection Methods on Machine Learning Classifiers for Detecting Email Spams Shrawan Kumar Trivedi

Shubhamoy Dey

Indian Institute of Management Prabandh Shikhar, Rau Indore – 453331, India

Indian Institute of Management Prabandh Shikhar, Rau Indore – 453331, India

[email protected]

[email protected] Some attacks, such as Tokenisation (i.e. Splitting or modifying the features, such as ‘free’ written as f r 3 3) and Obfuscation (hides feature by adding HTML or some other codes such as ‘free’ coded as frexe or FR3E), change the information of a feature [3, 4].

ABSTRACT This research presents the effects of using features selected by two feature selection methods i.e. Genetic Search and Greedy Stepwise Search on popular Machine Learning Classifiers like Bayesian, Naive Bayes, Support Vector Machine and Genetic Algorithm. Tests were performed on two different publicly available spam email datasets: “Enron” and “SpamAssassin”. Results show that, Greedy Stepwise Search is a good method for feature selection for spam email detection. Among the Machine Learning Classifiers, Support Vector Machine has been found to be the best both in terms of accuracy and False Positive rate

Various Machine Learning Classifiers has been experimented with to tackle these problems. Some of these have demonstrated their strength in Spam classification. In particular, Support Vector Machines (SVM), Probabilistic Classifiers (Bayesian and Naive Bayes) and Evolutionary Classifiers (Genetic classifiers) have been proven to be effective in this area of application. SVM [3, 5] uses the concept of “Statistical Learning Theory” proposed by Vapnik [6]. Probabilistic classifiers such as Naive Bayes [7, 8] and Bayesian Classifier [9, 10, 11] based on Bayes Theorem are also popular. Evolutionary Classifiers that are based on the principles of evolution have been intensely researched.

Categories and Subject Descriptors I.5.2 [Pattern Recognition]: Design Methodology—classifier design and evaluation, feature evaluation and selection; I.2.7 [Natural Language Processing] – Text analysis

In this study, the three types of classifiers mentioned above have been tested on two well-known publicly available datasets: Enron and SpamAssassin to evaluate their efficacy when used in conjunction with two feature selection techniques: Genetic search and Greedy Stepwise search. A comparative analysis of the performance (in terms of accuracy) of the various combinations of feature selection techniques and classifiers is presented.

General Terms Algorithms, Performance, Experimentation.

Keywords Email spam classification, Feature selection, Evolutionary algorithms, False Positive Rate, Classification Accuracy.

The later sections of this paper have been structured in the following way: Section 2 summarizes related work, Section 3 describes the Methodology used in this research, Section 4 describes the Experiment setup and Evaluation, Section 5 presents comparative analysis and finally Section 6 concludes the paper.

1. INTRODUCTION In today’s automated world, email is a necessary and useful tool for enabling rapid and inexpensive communication. It has now become a popular medium and can be seen as an essential part of the life [1]. On the other hand, Spam (also known as unsolicited bulk email), has turned into a challenge because its volume is increasing day by day. A study estimates that 70% of business emails are Spam. As a result, this rapid growth causes some serious hitches, such as unnecessary filling of users’ mailboxes engulfing important emails, consuming storage space and bandwidth as well as time required to segregate them [2].

2. RELATED WORK Recently, various applications of Text classification have generated substantial interest. Various classifiers and feature selection methods have been tested and reported in literature. This study focuses on the Spam Email Classification application of Text Classification. A good amount of research literature is available in this area.

Spam classification is becoming increasingly challenging due to complexity introduced by the spammers in the features of Spam. Complexity can be defined as modifications made to the spam words which make a feature difficult to understand.

This study incorporates two wrapper search methods i.e. Genetic search and Greedy stepwise search to obtain a small subset of best informative features from two publically available datasets i.e. Enron [3, 11] and SpamAssssin [12, 10]. Three popular machine learning classifier has been tested for comparative performance evaluation.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. RACS’13, October 1–4, 2013, Montreal, QC, Canada. Copyright 2013 ACM 978-1-4503-2348-2/13/10 …$15.00.

The strength of Bayesian classifiers has been reported in literature (e.g. by Lewis [13]). An extended Naive Bayes approach was experimented with in a series of studies done by Androutsopoulos et al. in 2000 [7]. That research evaluated the performance of

35

classifiers with the help of varying number of features and data sizes. Results of the study strongly demonstrate the strengths of the proposed probabilistic classifier. A recent study by Trivedi & Dey, 2013 [11] used the concept of boosting algorithms for performance improvement of probabilistic classifiers. The study has shown that probabilistic classifiers work effectively with boosting even when the number of features used is low. Support vector machines have been extensively researched and occupy a place of prominence as a classifier. A study, by Drucker et al. [14] compares the performance of SVM with various machine learning classifiers. The results were in favour of SVM and boosted decision tree in terms of accuracy and speed. However, the training time of SVM was less than boosted decision tree. Evolutionary algorithm based classifiers such as Genetic Classifiers [15, 17] are continuously experimented with by researchers, as their interesting operators such as Selection operator (searching for the most fit individual), and Reproduction operator (combining and altering individuals to obtain new individuals) have been shown to be good at complex classification tasks.

Year Androutsopoulo s et al. (2000) [7]

Data Source/Data set

Accura cy (%)

NB

Ling Spam corpus

83 to 99

Matthew Woitaszek et al. (2003) [5]

SVM

Bo Liu et al. (2003) [15]

Genetic classifier with boosting

Metsis et al. (2006) [8] D. Sculley, and Gabriel M. Wachman (2007) [12] Chen (2008) [9] Jiang Hua Li, and Wang Ping (2009) [16]

Five types of NB comparison

RIT’s ITS help desk for spam with Publically available Ham with 50% rate Wisconsin breast cancer dataset and Tic-tac-toe data set. Enron data set with different compositions

SpamAssassin

97.4299.46

Zhixiang (Eddie) Xu et al. (2012) [17]

Miser Greedy

Yahoo dataset and 15 image data set

69 and 83 approx

Trivedi S and Dey S (2013) [3]

SVM with different Kernels

Enron Email dataset

98.5

Trivedi S and Dey S (2013) [11]

Probabilistic classifiers with Boosting

Enron Email dataset

92.9

These algorithms use a learning approach based on the principles of natural selection introduced by Holland [18]. Initially, Genetic algorithm starts with a constant population of individuals to search a sample of space. Each individual of the population is evaluated for their fitness. Thereafter, new individuals are produced by selecting the best performing individuals who produce “offspring” [19]. The offspring retain the characteristics of their parents and generate a population with improved fitness.

Table 1. Literature Survey Model

Bayesian, SVM and various ML classifiers

3. METHODOLOGY 3.1 Genetic Algorithm based Classifier

The table below summarizes the existing literature where various classifiers have been tested on different datasets.

Author(s)

W.A. Awad, and S.M. Elseuofi (2011) [10]

The process of generating new individuals is done by two significant Genetic operators i.e. “Crossover” and “Mutation”. Crossover operator operates by random selection of a point in two parent gene structures and develops two new individuals by exchanging the remaining parts of the parents. Hence, this operator formulates two new individuals with potentially improved fitness by combining the two old individuals. On the other hand, the Mutation operator creates a new individual by arbitrarily altering some component of an old individual. The work of this operator is same like population perturbation operator which introduces potentially new information in the population. This operator also helps to stave off any stagnation which can arise during the search process.

96.69

77.5 and 69.9

3.2 Probabilistic Classifiers: This idea was proposed by Lewis in 1998 [13], who introduced

90.5 to 96.6

Relaxed Online SVMs

SpamAssassin

93.194.9

Bayesian classification

PU1, PU2 corpus

92.896.2

Improved Genetic Classifier

100 Spam on a dataset

95

the term

P

( ) and defined as the probability of a document ci dj

d = w1j , w2j ,..., wnj of terms fall within a recognized by a vector j ci certain category . This probability is calculated by the Bayes theorem

P

( ) ci dj

Where

=

P ( ci )*P

P(d j )

( )

( )

P dj

dj ci

(1)

Symbolize the Probability of arbitrarily selected d P ( ci ) is documents represented by the documents vector j and dj belonging to a the probability of arbitrary selected documents

36

Mutation within the chromosomes representing the features. Selection operator works by selecting the most fit individuals for reproduction from the current population. Reproduction is done by the use of two operators i.e. Crossover and Mutation of the parent genes to generate novel solutions.

c particular class i . The discussed classification method is usually known as “Bayesian Classification”.

Bayesian Classifier is a popular technique but it has been shown to have limitations in the case of high dimension of the data d vector j . This limitation is tackled by the assumption that any d two arbitrary selected components of document vector j (tokens) are independent of each other. This assumption is

Basic Steps during GA feature search:

formalized by the equation

P

( ) = ∏ P( ) n

wlj

l =1

ci

dj ci

1.

Produce arbitrary population of n chromosomes (features subset).

2.

Evaluate fitness of each chromosome.

3.

Iterate until the required N number of chromosomes is obtained.

(2)

I.

Selection: pick two chromosomes.

This assumption is used in the classifier named “Naive Bayes” and which is quite popular in the area of the Text Mining.

II.

Crossover: combine the properties of parent chromosomes to generate offspring.

3.3 Support vector machine (SVM)

III.

Mutation: Mute the offspring by predefined mutation probability.

IV.

Fitness: calculate the fitness of muted offspring.

V.

Update: Replace the muted offspring in the population.

VI.

Evaluation: if the fitness is contented:

Support Vector Machine (SVM) is a popular category of Machine Learning Classifiers. It takes its inspiration from Statistical Learning Theory and structural Minimization Principal [6]. Due to its strength in dealing with high dimensional data by the use of unique Kernel Function, it is one of the best accepted classifier in the concerned area. The basic concept of SVM is to separate the classes (i.e. positive and negative) by a maximum margin produced by hyper-plane. X = { xi , yi } x ∈ Rn Let us take a training sample , where i and th yi ∈ {+1, −1} , which is defined as the particular class for i training sample. In this research, +1 is denoted as the SPAM mails i.e. unsolicited emails and −1 is denoted as the HAM i.e. legitimate mails. Final output of the classifier can be determined by the following equations-

y = w.x − b ,

4.

parameter that determine by the training procedure. The following optimization function is used for maximize the separation between classes. minimize subject to

b.

Produce new population of chromosomes and calculate new offspring.

Return: find N chromosomes (features wrapper set)

This method works as an iterative process where in each step features are evaluated iteratively. Thereafter, the single best informative feature is selected and taken for the model. Evaluation is performed with the help of Stepwise regression. Selection can be done by the three different processes i.e. Forward selection (adding valuable features), Backward selection (removing worst features), and Mixed selection (Forward and backward simultaneously). Some criteria are used to indicate the termination of the feature selection process such as P-Value measure indicate whether all selected features are added or not in the model or none of the feature left add value.

(3)

w

Keep this offspring

3.4.2 Greedy Stepwise Search:

Where y indicates final output of classifier, w termed as normal vector analogous to those in the feature vector x , and b is the bias

1 2

a.

2

yi ( w.x − b ) ≥ 1, ∀i

Let us consider

(4)

f s is the features set which has carried for search

f e is the number of features taken under evaluation f* with respect to their fitness. Hence best feature wrapper set b is: process and

.

(5)

f b* = arg max fit ( f s ∪ { f e }) fe ∉ f s

3.4 Feature selection Methods A number of studies have been reported in the literature for searching the feature space for the best subset of features for use along with machine learning algorithms [17, 11]. In this research Genetic search and Greedy Stepwise search are considered.

(6)

4. EXPERIMENTS AND EVALUATION 4.1 Data Sets: This study uses two different dataset, taken from two different sources. Our main analysis is done with “Enron email” dataset and thereafter “SpamAssassin” dataset is employed for validation of the results obtained from the first dataset. A brief description of the datasets is given below:

3.4.1 Genetic Search: Genetic search is based on the idea given by Darwinian Theory of survival of fittest. This method is used to simulate the evolutionary processes occurring in the nature with the help of three fundamental genetic operators i.e. Selection, Crossover and

37

4.1.1 Enron email dataset:

4.2.5 Classifiers:

In this study, out of the six existing version of the Enron Email dataset, Enron versions 5, and 6 is being selected to create 3000 Legitimate (Ham) and 3000 unsolicited (Spam) files by random sampling. The reason for selecting these Enron email versions was that they were found to contain more complexities in the Spam Email files making the task more challenging.

This study used JAVA and MATLAB environments in Window 7 operating system for testing the concerned classifiers. Four classifiers: a Genetic Algorithm based Classifier, Bayesian, Naive Bayes, and Support Vector Machine (SVM) were tested on the most informative Features selected by the feature selection methods for the two different datasets mentioned above.

4.1.2 SpamAssassin:

4.2.6 Spam Classification:

This dataset carries some older as well as recent unsolicited emails (Spam) developed by some non-Spam-trap sources. Out of entire set of Spams, 2350 Spam email files were selected for this research. Along with the spam files, this dataset has some easy (simple to identify) and difficult (with complexities) legitimate (Ham) files. For maintaining a balance, both i.e. easy and difficult Ham emails were sampled equally to generate 2350 Ham email files.

For the purpose of evaluation, data splitting is performed where 66% of the data is used for training and remaining 34% data is set aside for testing of the classifiers. In this process, the set of of Ham and Spam files is split in a random fashion such that 66% of the files are selected for training the classifiers and the remaining are left for testing.

4.2.7 Evaluation: This study employs a number of Performance Measures for evaluation and analysis. A simple measure for classifiers testing is A the Classification Accuracy ( ccuracy ) defined as the percentage of accurately classified Emails. The weakness of this measure is that it fails to distinguish between false positive and false negative.

4.2 Classification processes description 4.2.1 Pre-Processing: An Email file (document) can be represented by a collection of ai feature vectors k defined as the weight of word i that belongs to document k [20]. The Email data files are taken through the

FP

rate ) is For accurate measurement the false positive rate ( H ,S F calculated separately. F-value ( value ) defined as the harmonic P mean of recision (i.e. fraction of retrieved classified emails that are R relevant) and ecall (i.e. fraction of accurate classified emails that are retrieved), is another measure used for evaluation and analysis in this study.

feature extraction process to obtain a set of relevant ‘terms’ (usually the words) for generating a Term-Document matrix (TDM). It forms a binary matrix where 1 indicates the presence of word in corresponding document and 0 otherwise. This matrix is expected to have high dimensionality and be sparse in nature because a large number of documents are present and most of the terms occur in only a few of the documents. However, this problem is well handled by the “Dimensionality reduction” process.

Table 2. Performance Instruments Instruments

4.2.2 Dimensionality reduction Dimensions can be reduced by “Feature selection” or “Feature extraction” and “Stop word” (terms that consist no information such as Pronouns, Prepositions, and conjunctions) elimination [20] and “Lemmatisation” (grouping the terms that come from the same ‘root’ word).

Related Formulas

Accuracy =

Accuracy

H ,S Fvalue =

F-Value

4.2.3 Feature Extraction process: In this step Spam and Ham files are used to extract and develop the associated features dictionary. This process is done by the String-to-Word-Vector conversion process which also includes Stop word removal and Lemmatization steps. The resultant sparse and large matrix is further processed by feature selection and search techniques to generate the minimum number of best informative features.

N Ham→c + N Spam→c N Ham→c + N Ham→m + N Spam→c + N Spam→m

FPrate =

False Positive Rate

H ,S H ,S 2*Precision * Recall H ,S H ,S Precision + Recall

N Ham→m N Ham→m + N Ham→c

In the table above, the formulae of performance measure have been shown, where

4.2.4 Feature selection:

classified

Feature selection is employed after stop word removal and lemmatisation. This technique helps to find the most informative terms from the complete set of terms. For evaluation of classifiers, the use of a few good features (i.e. terms) to represent documents has been shown to be effective. In this study, we have used Genetic feature search and Greedy Stepwise Feature search techniques. According to the dimensionality of the original dataset, different number of good features were selected using these techniques from each of the two different datasets, and thereafter used for testing the concerned classifiers.

Ham

N Ham → c

Emails,

misclassified Ham emails, emails and emails.

N Spam→m

denotes the total number of correctly N Ham → m

N Spam→c

denotes

the

number

of

is the correctly classified Spam

denotes the total number of misclassified Spam

5. Comparative Analysis This section presents the comparative analysis of various Machine Learning Classifiers that were tested by varying number of most informative features. Percentage Accuracy, F-Value and False Positive rate were the measures used for analysis. For clear

38

understanding, this analysis is presented in three segments. The first segment deals with the analysis of Machine Learning Classifiers, the second segment analyses the feature selection methods, and in the last segment, the False Positive rates are used for evaluating the accuracy of classification from a different perspective.

100 G enetic G enetic G r eedy G r eedy

AccuracyandF-Value(%)

98

Table 3. Accuracy and F-value of classifiers tested on Enron dataset

96 94 92 90 88 86 84 82 80

In %

Genetic Search Greedy search

Acc F-Value Acc F-Value

Bayes ian 85.6 85.6 93.0 93.1

NB

SVM

84.8 84.8 94.0 93.9

87.1 87.1 94.2 94.3

Genetic Search Greedy search

Acc F-Value Acc

95.2 95.2 96.4

Bay esia n 91.9 91.9 97.1

F-Value

96.4

97.1

In %

Gen etic

NB

Bay es ian

NB

S VM

Figure 1. Accuracy and F-value for Enron Dataset 100

Table 4. Accuracy and F-value of classifiers tested on SpamAssassin dataset SpamAssa ssin

G enetic

Classifiers

Accuracy and F-Value (%)

Enron

Gen etic 80.4 80.4 87.6 87.5

S ear c h ( A c c ur ac y ) S ear c h ( F- V alue) Stepwis e ( Ac c ur ac y ) Stepwis e ( F- Value)

SVM

G e n e t i c Se a r c h ( Ac c u r a c y)

99

G e n e t i c Se a r c h ( F - Va l u e ) G r e e d y St e p wi s e ( Ac c u r a c y)

98

G r e e d y St e p wi s e ( F - Va l u e )

97 96 95 94 93 92 91 90

G enetic

Bay es ian

NB

SVM

Classifiers

91.2 91.2 96.6

96.2 96.2 97.8

96.7

97.8

Figure 2. Accuracy and F-value for SpamassAssin Dataset

5.3 Analysis with False Positive Rate Although some of machine learning classifiers show good overall classification accuracy, the possibility of misclassification of the positive instances may be higher. Legitimate Emails are considered important and if these emails get misclassified as Spam, it may lead to serious consequences. This problem can be well tackled by considering the False Positive rate (FP Rate) which takes into account how many legitimate emails are misclassified.

5.1 Analysis of Machine Learning Classifiers The results of the classifiers tested on the Enron dataset is shown in the Table 3 and Figure 1, which demonstrate that Support Vector Machine is the most accurate amongst the tested classifiers. In this case, the performance accuracy is between 87.1% and 94.2%. The Genetic Classifier is found to be the worst in terms of accuracy with the accuracy varying between 80.4% and 87.6 %. The Bayesian and Naive Bayes came very close to best one with accuracy between 85.6% and 93.1% for Bayesian and 84.8% and 93.9% for Naive Bayes.

From the Table 5 and Figure 3, it is clear that SVM and Bayesian Classifier perform better in terms of the FP rate. For these classifiers the FP Rate is low in both datasets (7.3% and 1.8% for Bayesian classifier and 7.3% and 2.6% for SVM on Enron dataset as well as 0.1% and 3.6% for Bayesian and 2.1% and 1% for SVM on SpamAssassin dataset). The above results are for Genetic and Greedy Stepwise search respectively which indicate that the use of Greedy Stepwise search method for feature selection leads to lower FP rate.

Testing the same classifiers on the SpamAssassin dataset confirmed the results obtained from the Enron dataset. The results of the experiments on the SpamAssassin dataset are shown in Table 4 and Figure 2.

5.2 Analysis of Feature Selection Methods

Table 5. False Positive Rate of the Classifiers

As discussed in the preceding sections, the most informative features subsets were selected using Genetic and Greedy StepWise feature search techniques. Initially, 48 best features out of 1500 initially created features for Enron dataset and 35 features out of 1414 features for SpamAssassin dataset were selected for testing the classifiers. The result presented in Tables 3 and 4, and Figures 1 and 2 demonstrate that Greedy Step-Wise search method has identified the most informative features in both the datasets with accuracy between 87.6% and 95.2% for the Enron dataset, and 96.4% and 97.8% for the SpamAssassin dataset.

FP Rate (%) Genetic Search Greedy search

The features selected by Genetic search have shown poorer results i.e. 80.4% to 87.1% for Enron dataset and 91.2% to 96.2% for SpamAssassin dataset.

39

Enron SpamA ssassin Enron SpamA ssassin

Genetic

Bayes ian

NB

SVM

22.6

7.3

10.6

7.3

3.5

0.1

0.4

2.1

22.5

1.8

4.4

2.6

2.4

3.6

4.5

1.0

Third Conference on Email and Anti-Spam (CEAS), pages 125–134. 8

30

False Positive Rate (%)

25 20

[9] Chen, J. & Chen, Z. (2008), Extended Bayesian information criterion for model selection with large model space. Biometrika, 94, 759-771. 9

15 10 5 0 Sp a ma s s a s s in ( Ge n e t ic Se a r c h ) En r o n ( Gr e e d y St e p w is e )

- 10 - 15

[10] W.A. Awad, and S.M. ELseuofi, “Machine Learning Methods for Spam Classification,” International Journal of Computer Science & Information Technology (IJCSIT), PP 173-184, Vol 3, No 1, Feb 2011. 10

En r o n ( Ge n e t ic Se a r c h )

-5

Sp a ma s s a s s in ( Gr e e d y St e p w is e )

G enetic

Bay es ian

NB

SVM

Classifiers

Figure 3. False Positive Rate of Classifiers

[11] Trivedi. S, and Dey. S, [2013], "Interplay between Probabilistic Classifiers and Boosting Algorithms for Detecting Complex Unsolicited Emails," Journal of Advances in Computer Networks vol. 1, no. 2, pp. 132-136,.

6. CONCLUSION Achieving good classification accuracy of classifiers using minimum number of features has always been one of the major research objectives in text classification. This study presents a comparative analysis of two feature selection methods: Genetic and Greedy Stepwise search and their interactions with some Machine Learning Classifiers in the context of Spam email detection. The results lead to the following conclusions: first, among the Machine Learning Classifiers examined SVM has shown best classification accuracy and also the lowest False Positive Rate; second, Greedy Stepwise Search was found to be the best feature subset selector.

[12] D. Sculley, G. M. Wachman, “Relaxed Online SVMs for Spam Filtering” SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, PP 415-422, ISBN: 978-1-59593-597-7, July 2007. 12 [13] David D. Lewis. 1998. Naive (Bayes) at forty: The independence assumption in information retrieval, Proceedings of ECML-98, 10th European Conference on Machine Learning, pages 4 -15. 13 [14] Drucker, H., Wu, D., & Vapnik, V. N. (1999). Support vector machines for spam categorization. IEEE Transactions on Neural Networks, 10(5), 1048–1054.

7. REFERENCES [1]

Whittaker, S., Bellotti, V., & Moody, P. (2005). Introduction to this special issue on revisiting and reinventing e-mail. Human–Computer Interaction, 20(1-2), 1-9.

[15] Liu, Bo, Bob McKay, and Hussein A. Abbass. "Improving genetic classifiers with a boosting algorithm." In Evolutionary Computation, 2003. CEC'03. The 2003 Congress on, vol. 4, pp. 2596-2602. IEEE, 2003. 15

[2]

C. C. Lai, "An empirical study of three machine learning methods for spam filtering,"Journal of Knowledge-Based Systems archive, Volume 20, Issue 3, PP. 249-254, April, 2007.

[16] Jiang Hua Li, and Wang Ping (2009), The e-mail filtering system based onimproved genetic algorithm. Proceedings of the 2009 International Workshop on Information Security and Application (IWISA 2009), ISBN 978-952-5726-06-0. 16

[3] Trivedi. S, and Dey. S, [2013], Effect of Various Kernels and Feature Selection Methods on SVM Performance for Detecting Email Spams. International Journal of Computer Applications 66(21):18-23, March 2013. Published by Foundation of Computer Science, New York, USA. 11

[17] Xu, Z., Weinberger, K., & Chapelle, O. (2012). The greedy miser: Learning under test-time budgets. arXiv preprint arXiv:1206.6451. 17 [18] Holland, J. H., "Adaptation in Natural and Artificial Systems," University of Michigan Press, Ann Arbor, MI., 1975. 18

[4] J. Goodman, G.V. Cormack, and D. Heckerman, “Spam and the ongoing battle for the inbox,” Communications of the ACM, vol.50, issue 2, pp. 24-33, February 2007

[19] Haleh, Vafaie and Ibrahim F. Imam,, 1994, Feature Selection Methods: Genetic Algorithms vs. Greedy-like Search, Proceedings of the 3rd International Fuzzy Systems and Intelligent Control Conference. 19

[5] M. Woitaszek, M. Shaaban, and R. Czernikowski “Identifying Junk Electronic Mail in Microsoft Outlook with a Support Vector Machine,” conf. Proceedings, 2003 Symposium on Applications and the Internet, PP 166 – 169, 27-31 Jan. 2003. 5

[20] Joachims, T. (1998). Text categorization with support vector machines: Learning with many relevant features, Proceedings of ECML '98

[6] V.N Vapnik, “An Overview of Statistical Learning Theory”, IEEE Trans.on Neural Network, Vol. 10, No. 5, pp.988-998 , 1999. 6 [7] Androutsopoulos I., J. Koutsias, K.V. Chandrinos, G. Paliouras, and C.D., Spyropoulos. 2000a. An Evaluation of Naive Bayesian Anti-Spam Filtering, Proceedings of the Workshop on Machine Learning in the New Information Age, 11th European Conference on Machine Learning, Barcelona, Spain, pages 9–17. 7 [8] Metsis, V., Androutsopoulos, I., and Paliouras, G. (2006). Spam Filtering with Naive Bayes–Which Naive Bayes?

40