use style: paper title

0 downloads 0 Views 1MB Size Report
comparative analysis of algorithms for the classification of text in the Bulgarian ... bayes classifier, C4.5, k-nearest neighbours, support vector machines with .... were classified with 21 poems and approximately equal number of words - 2018 ...
Comparative Analysis of Algorithms for Classification of Text in the Bulgarian Language in Machine Learning Neli An. Arabadzieva – Kalcheva dept. Software and Internet Technologies Technical University of Varna, Varna 9010, Bulgaria e-mail: [email protected] Abstract — The topic of the publication is the research and comparative analysis of algorithms for the classification of text in the Bulgarian language using Machine Learning methods. The algorithms examined are: naive bayes classifier, multinomial bayes classifier, C4.5, k-nearest neighbours, support vector machines with optimization. The results are depicted analytically and graphically, and show that with 2 classes or fewer, and a low volume of data support vector machines and C4.5 give the highest results. If the number of classes are doubled, the naive bayes classifier and the multinomial bayes classifier give similar results and are ahead of the other results. Running the algorithms with 20 or more classes results in poor accuracy scores across the board. The best performers with circa 55% are the naive Bayes classifier and support vector machines with optimizing. The lowest accuracy is obtained from k-nearest neighbours. Keywords - naive Bayes classifier, multinomial Bayes classifier, k-nearest neighbours, support vector machines with optimization, C4.5 (J48), text classifying, machine learning

I.

INTRODUCTION

In recent years, thanks to the ease of internet access, a vast amount of data has accumulated in various knowledge repositories. These resources become useless if it is not possible to obtain up-to-date and useful information on a particular topic. The use of classification allows you to shorten the time required to search for the necessary information presented as electronic texts. Formal definition of the classification task: Let:  ХRn – a set of objects (input)  УR – a set of results (output) We will look at the pair (x, y) as a realization of (n + 1) dimensional random variables (X, Y) set in probability space. The distribution law РХУ(х,у) is not known, all we have is a training set: {х(1), у(1), х(2), у(2), ……, х(N), у(N)} (1) (𝑖) (𝑖) where (𝑥 , 𝑦 ), 𝑖 = 1,2, … , 𝑁 are independent. The goal is to find the function 𝑓: 𝑋 → 𝑌, which by using the values of x, can predict y. We call the function f a decision function or a classifier. [2] In other words, the formal definition of the task of classifying a text can be shown thusly: Let there be a set of classes

С={c1,c2,….,ck} and a set of set of documents D={d1,d2,….,dk}. The end function f: DxC→{0,1} for every pair is unknown. We need to find a classifier f ‘, i.e. a function as close as possible to the function f. [1] A primary goal in the classification of text using a set of features. Traditionally, the frequency of number of words is used for that goal. II.

EXPOSITION

A. Naive Bayes classifier One of the classical algorithms in machine learning is the Naive Bayes Classifier, which is based on the Bayes theorem for determining the aposteriori probability of an event occurring. Assuming the "naive" assumption of conditional independence between each pair of attributes, the Naive Bayes classifier deals effectively with the problem of having too many features, i.e. the so-called "Curse of dimensionality". Bayes’ Theorem [5] 𝑃(𝑥|𝑦=𝑐)𝑃(𝑦=𝑐) 𝑃(𝑦 = 𝑐|𝑥) = (2) 𝑃(𝑥)

where: Р(у=c|x) is the probability for an object х to belong to a class C (aposteriori probability) P(x|y=c) – class conditional density P(y=c) – class prior P(x) – unconditional probability of x The purpose of the classification is to determine the class to which the object x belongs. Therefore, it is necessary to find the probabilistic class of the object x, i.e. it is necessary to choose the one that gives the maximum probability P (у = c | x). 𝑐𝑜𝑝𝑡 = arg max 𝑃(𝑥|𝑦 = 𝑐)𝑃(𝑦 = 𝑐) (3) 𝑐∈𝐶

B. Multinomial Bayes Classifier The multinomial Bayes classifier makes the assumption that the features are distributed multinomially. Let xi{1,…K}, have emission probabilities 1, …К [7] Then, the probability for an event х to occur when  is given is: 𝑃(х|𝜃) =

𝑛! х1 !…х𝐾 !

∏К𝑖=1 𝜃𝑖х𝑖

(4)

where: 𝑛 = ∑К𝑖=1 х𝑖

(5)

The multinomial Bayes classifier calculates the frequency of occurrence of each word in the documents. Again, a naive assumption is made that the likelihood that a word will occur in the text is independent of the context and the position of the word in the document. [6] C. K-nearest neighbours – KNN The K-nearest neighbours algorithm is an object classification algorithm that calculates the distance between each pair of objects from the training set, by using an appropriate function to measure the distance between the two points. The algorithm uses a majority vote of the k nearest neighbours of the object to classify it. Function for measuring distance: Euclidean 𝜌(𝑥𝑖 , 𝑥𝑗 ) = √∑𝑚 (6) 𝑘=1 𝑤𝑘 where: (1)

(2)

(𝑚)

𝑥𝑖 = (𝑥𝑖 , 𝑥𝑖 , … , 𝑥𝑖 ) – vector of m-features of the i-th object (1) (2) (𝑚) 𝑥𝑗 = (𝑥𝑗 , 𝑥𝑗 , … , 𝑥𝑗 ) – vector of m-features of the j-th object Other known functions for measuring distance between two points are: Lp – metric, L – metric, L1 – metric, LanceWilliams function. An important question when using the k-nearest neighbor algorithm is the choice of the number K - this is the number of the nearest neighbors. Heuristic techniques, such as cross validation, can help with obtaining the appropriate K values. If the K value is high, the classifier is more precise and more new data sets are correctly classified, but the recognition takes a long time. In case of a low K value the algorithm completes quickli, but produces a large recognition error. The conclusions are the choice of K depends on the specific problem and its optimal value is determined experimentally.[8] D.

Support Vector Machines - SVM The Support Vector Machines (SVM) method represents training examples as n-dimensional points. The examples are projected into space in such a way as to be linearly separable. When working with two classes, a line is drawn to separate data along two classes. The line that divides the data is called a maximum-margin hyperplane. This hyper plane must be chosen in such a way as to be as close as possible to the examples of both classes. The function f (x) of the linear classification is as follows: [3] f(x) = wTx + b Т

(7)

where: w is a weight vector, and b is the displacement The goal is to find the values of wT and b, that will determine the classifier. In order to do this, it is necessary to find the points with the least variance that should be maximized. In non-linearly divisible data, the basic idea is to achieve linear separation by passing the data to another higher

dimensional functional space through a function of the input non-linear data. This is accomplished by the so-called kernel function K, which is defined as follows: K(xi,xj) = f(xi).f(xj)

(8)

Some of the most commonly used kernel functions are: Polynomial kernel function, Gaussian radial basis function, Exponential radial basis function, Multiple layer perceptron, etc. Modification of the algorithm using the Support Vector Method is the so-called SMO (Sequential Minimal Optimization), which at each optimization step selects two Langrange multipliers. This algorithm is faster and has better scaling properties than the standard SVM algorithm. [9] E. С4.5 algorithm The С4.5 algorithm is an algorithm for constructing a decision tree from a learning set. Classes must have a finite number of values, with each example referring to a particular class. C4.5 is an extension of the ID3 classification algorithm, which divides recursively into subtrees, using an information significance index, i.e. a feature with the highest information utility is selected. The C4.5 algorithm calculates "normalized information significance," i.e. when constructing the classification tree, the nodes with the most useful information are selected. To avoid a strong division into subsets, a kind of normalization is used, where a criterion called gain is calculated. [10] |𝑇 | |𝑇 | 𝑠𝑝𝑙𝑖𝑡 𝑖𝑛𝑓𝑜(𝑋) = − ∑𝑛𝑖=1 |𝑇|𝑖 × 𝑙𝑜𝑔2 ( |𝑇|𝑖 ) (9) where: Т – the test set; Т1, Т2, … Тn – subsets; n – number of results III.

RESEARCH, RESULTS AND ANALYSIS

The study uses the WEKA software package, which is opensource software released under the GNU General License. The analyzed algorithms are: Naive Bayes classifier, Multinomial Bayes classifier, C4.5, k-nearest neighbors method, method of Support Vector Machines using optimization - SMO with a polynomial kernel function. The results of the text classification by the closest K neighbor are best for the corresponding example and are obtained experimentally at different K values (the number of closest neighbors). The tables use the name J48, which is the working name of the C4.5 algorithm rewritten in Java.[4] The report introduces the following abbreviations of the algorithms used in the text:  Naive Bayes classifier – NB (Naive Bayes)  Multinomial Bayes classifier - MNB  K-nearest neighbours - IBk  Support Vector Machines using optimization– SMО (Sequential Minimal Optimization)  С4.5 algorithm - J48 Initially, 2 authors Peyo Yavorov and Dimcho Debelyanov were classified with 21 poems and approximately equal number of words - 2018 words for the first author and 2031 words for

the second author. The two poets lived and worked in approximately the same period - the late 19th and early 20th centuries. The results show that SMO and J48 have classified the authors 100% correctly, with the worst result being the Knearest neighbors. The difference in percentage between the first and the last of the algorithms analyzed is 21%, while the difference between the first placed and the second one is 7%. (Table 1) TABLE 1 Classification of 2 authors with 21 poems each and with  2000 number of words each Authors

Number of poems

Number of words

Peyo 21 2018 Yavorov Dimcho 21 2031 Debelyanov Accuracy - Percentage of accurately classified poems

Number of accurately classified poems NB

MNB

J48

IBk

SMO

21

17

21

18

21

18

20

21

15

21

92.85 %

88.09 %

100 %

78.57 %

100 %

With an almost double increase in the number of words, J48 is again at the top, and IBk is the last. Significant differences in J48, IBk and SMO scores were observed when compared to the previous study -5%, -9.8214%, -6.25%, while the difference in the case of NB was only 0.3571%. (Table 2) TABLE 2 Classification of two authors with  equal number of poems and with  4000 number of words

Authors

Number of poems

Number of words

Peyo 38 4024 Yavorov Dimcho 42 4017 Debelyanov Accuracy - Percentage of accurately classified poems

Number of accurately classified poems NB

MNB

J48

IBk

SMO

36

32

35

27

35

38

38

41

28

40

92.5 %

87.5 %

95 %

68.75 %

93.75 %

With an almost double increase in the number of words, now 16 040, MNB and NB show the best result among the algorithms tested. The difference between the MNB and NB is only 0.1379% in favor of the MNB. (Table 3) TABLE 3 Classification of two authors with  equal number of poems and with  8000 number of words Authors Peyo Yavorov Dimcho Debelyanov

Number of poems

Number of words

73

8023

68

8017

Accuracy - Percentage of accurately classified poems

Number of accurately classified poems NB

MNB

J48

IBk

SMO

55

62

55

73

60

18

20

51

3

52

84.25 %

84.39 %

75.17 %

53.90 %

79.43 %

With a double increase in the number of authors from two to four (Table 4), i.e. the number of classes with approximately equal number of words ( 2000), MNB emerges as the winner, followed closely by NB, with SMO in third position by percentage of properly classified poems. IBk is again the last one, and it is notable that the percentages are extremely low and

only one of the authors is recognized, while another two are not recognized at all. TABLE 4 Classification of 4 authors with  equal number of poems and with  2000 number of words Number of poems

Authors

Number of accurately classified poems

Number of words

Peyo 21 2009 Yavorov Dimcho 21 2031 Debelyanov Hristo 20 2022 Fotev Petko 21 2000 Slaveykov Accuracy - Percentage of accurately classified poems

NB

MNB

J48

IBk

SMO

14

15

10

2

15

18

18

17

0

15

18

19

9

20

18

16

15

13

0

11

79.51 %

80.72 %

59.03 %

26.50 %

71.08 %

With an increase in the number of authors to eight, the highest percentage of properly classified poems is MNB, which is 4.84% better than the next two – NB and SMO, which have identical results. TABLE 5 Classification of 8 authors with  equal number of poems and with  2000 number of words Number of poems

Authors Peyo Yavorov Dimcho Debelyanov Hristo Fotev Petko Slaveykov

Number of words

Number of accurately classified poems NB

MNB

J48

IBk

SMO

21

2009

14

12

16

2

13

21

2031

10

12

10

0

12

20

2022

15

16

9

18

17

21

2000

14

15

10

0

11

8

9

2

1

9

Pencho 20 2020 Slaveikov Geo Milev 21 2000 Lyuben 19 2020 Karavelov Nikolai 21 2000 Liliev Accuracy - Percentage of accurately classified poems

7

9

7

6

8

12

15

10

5

13

19

19

11

0

16

60 %

64.84 %

45.45 %

19.39 %

60 %

In the classification of two other authors Petko Slaveikov and Nikolay Liliev with 21 poems and with 2000 equal number of words, who lived and worked in different time periods: the first - the middle and the end of the 19th century, and the second - the beginning and the 20th century we receive different results (Table 6). The Naive Bayes classifier and J48 have only erred with a single poem, while SMO has erred with three poems. TABLE 6 Classification of other 2 authors with 21 poems and with 2000 number of words

Authors Petko Slaveykov

Number of poems

Number of words

NB

MNB

J48

IBk

SMO

21

2000

20

21

21

9

20

21

18

20

21

18

97.67 %

92.85 %

97.61 %

71.42 %

90. 47 %

Nikolai 21 2000 Liliev Accuracy - Percentage of accurately classified poems

Number of accurately classified poems

In the study (Tables 1, 2, 3, 4, 5, 6) the number of poems and number of words is roughly equal for each author. When the number of poems and words are reduced significantly, J48 classifies authors 100% correctly. (Table 7) The accuracy of the SMO is almost 20% lower than the case, where the number of poems and the number of words of the two authors are equal (Table 1).

The accuracy of the Naive Bayes classifier and the accuracy of the method of Support Vector Machines are the highest and approximately equal, about 55%. The J48 and MNB algorithms have almost tripled their accuracy in comparison with two-class cases. IBk is nearly 10 times less accurate than NB and SMO (Fig. 2)

TABLE 7 Classification of 2 authors with  number of poems and with  equal number of words Number of poems

Authors

CLASSIFICATION OF 20 AUTHORS

Number of accurately classified poems

Number of words

Peyo 5 2018 Yavorov Dimcho 21 2031 Debelyanov Accuracy - Percentage of accurately classified poems

NB

MNB

J48

IBk

SMO

2

3

5

1

0

21

21

21

15

21

88.46 %

92.30 %

100 %

61.53 %

80.76 %

6

5

15

7

4

55.08%

Peyo 15 4020 Yavorov Dimcho 42 4017 Debelyanov Accuracy - Percentage of accurately classified poems

42

42

42

31

42

82.45 %

100 %

66.66 %

80.70 %

In the case where the number of poems is equal, and there is a difference in the number of words in an order of magnitude, J48 emerges again with the highest accuracy. All classifiers, except J48, increase their precision from 3% to 10% when they are given a larger number of words. (Table 8) TABLE 8 Classification of 2 authors with equal number of poems and with  equal number of words Number of poems

Authors

Number of words

Peyo 20 5129 Yavorov Dimcho 20 1937 Debelyanov Accuracy - Percentage of accurately classified poems

Number of accurately classified poems NB

MNB

J48

IBk

SMO

11

18

19

17

16

20

18

20

11

20

77.5 %

90%

97.5 %

70 %

90%

Number of poems of 20 authors

J48

IBK

SMO

Figure 2 Classification of 20 authors using the algorithms: NB, MNB, J48, IBK и SMO

IV.

CONCLUSION

The conducted studies show that in the classification of two classes and a small amount of data the algorithms with the highest accuracy are C4.5 (J48) and SMO; the accuracy decreases with a larger volume of data. When given 4 and 8 classes the Multinomial Bayes Classifier is the most accurate. When given 20 classes, the naive bayes classifier and the method of Support Vector Machines with optimization using two Lagrange multipliers and a polynomial kernel function have the highest accuracy of about 55%. The K-nearest neighbours (IBk) show the lowest scores throughout the entire study.

[3] [4] [5] [6]

[8]

Sirak Skitnik

9

Figure 1 Number of poems of 20 authors, used in classification

[2]

[7]

35

14 Vasil Popovich

Petko Slaveykov

Konstantin Velichkov

Konstantin Miladinov

Georgi Rakovski

Hristo Smirnenski

Lyuben Karavelov

Pencho Slaveykov

Geo Milev

Hristo Fotev

Dobri Chintalov

Dimcho Debelyanov

Hristo Botev

8

Mara Belcheva

10

Stoyan Mihaylovski

15 4

Peyo Yavorov

37 26

25

17

44 41

Nikolay Liliev

44

40

Rayko Genzifov

21

43

Stefan Stambolov

50

MNB

REFERENCES

69 39

5%

[1]

By increasing the number of authors to 20, the poems to 585, (Fig. 1), the words to 103377 and eliminating the “equal numbers of poems and words for each author” requirement, none of the five classifiers produce good results. (Fig. 2)

80 70 60 50 40 30 20 10 0

27.75%

NB

84.21 %

54.99% 33.39%

Arabadzieva – Kalcheva N., Nikolov N., Comparative analysis of the naive bayes classifier and sequential minimal optimization for classifying text in bulgarian in machine learning, Computer Science and Technologies Journal, pp. 97-105, 2017, TU – Varna Arabadzieva – Kalcheva N., Mateva Z., Bayesian theory in machine learning, Annual Journal of the Technical University of Varna 2016, pp. 130-133 Harrington P., Machine Learning in Action, 2012, pp 105-106, p. 144 http://weka.sourceforge.net/doc.dev/weka/classifiers/trees/J48.html L. Uitkin, Machine Learning, 2017, pp. 6-8 McCallum A., Nigam K., A comparison of event models for Naive Bayes text classification. Papers from the 1998 AAAI Workshop, 1998 Murphy K.P. Machine Learning A Probabilistic Perspective, 2012, p. 34

Penev I. ,Karova M., Todorova M. , On the optimum choice of the K Parameter in Hand -Written Digit Recognition by kNN in comparison to SVM, International journal of neural networks and advanced applications, vol.3, 2016 [9] Platt J., Fast Training of Support Vector Machines using Sequential Minimal Optimization, 1998, p.44 [10] Quinlan J., C4,5: Programs for Machine Learning, 1993