Comparative Analysis of Algorithms for Classification of Text in the Bulgarian Language in Machine Learning Neli An. Arabadzieva – Kalcheva dept. Software and Internet Technologies Technical University of Varna, Varna 9010, Bulgaria e-mail:
[email protected] Abstract — The topic of the publication is the research and comparative analysis of algorithms for the classification of text in the Bulgarian language using Machine Learning methods. The algorithms examined are: naive bayes classifier, multinomial bayes classifier, C4.5, k-nearest neighbours, support vector machines with optimization. The results are depicted analytically and graphically, and show that with 2 classes or fewer, and a low volume of data support vector machines and C4.5 give the highest results. If the number of classes are doubled, the naive bayes classifier and the multinomial bayes classifier give similar results and are ahead of the other results. Running the algorithms with 20 or more classes results in poor accuracy scores across the board. The best performers with circa 55% are the naive Bayes classifier and support vector machines with optimizing. The lowest accuracy is obtained from k-nearest neighbours. Keywords - naive Bayes classifier, multinomial Bayes classifier, k-nearest neighbours, support vector machines with optimization, C4.5 (J48), text classifying, machine learning
I.
INTRODUCTION
In recent years, thanks to the ease of internet access, a vast amount of data has accumulated in various knowledge repositories. These resources become useless if it is not possible to obtain up-to-date and useful information on a particular topic. The use of classification allows you to shorten the time required to search for the necessary information presented as electronic texts. Formal definition of the classification task: Let: ХRn – a set of objects (input) УR – a set of results (output) We will look at the pair (x, y) as a realization of (n + 1) dimensional random variables (X, Y) set in probability space. The distribution law РХУ(х,у) is not known, all we have is a training set: {х(1), у(1), х(2), у(2), ……, х(N), у(N)} (1) (𝑖) (𝑖) where (𝑥 , 𝑦 ), 𝑖 = 1,2, … , 𝑁 are independent. The goal is to find the function 𝑓: 𝑋 → 𝑌, which by using the values of x, can predict y. We call the function f a decision function or a classifier. [2] In other words, the formal definition of the task of classifying a text can be shown thusly: Let there be a set of classes
С={c1,c2,….,ck} and a set of set of documents D={d1,d2,….,dk}. The end function f: DxC→{0,1} for every pair is unknown. We need to find a classifier f ‘, i.e. a function as close as possible to the function f. [1] A primary goal in the classification of text using a set of features. Traditionally, the frequency of number of words is used for that goal. II.
EXPOSITION
A. Naive Bayes classifier One of the classical algorithms in machine learning is the Naive Bayes Classifier, which is based on the Bayes theorem for determining the aposteriori probability of an event occurring. Assuming the "naive" assumption of conditional independence between each pair of attributes, the Naive Bayes classifier deals effectively with the problem of having too many features, i.e. the so-called "Curse of dimensionality". Bayes’ Theorem [5] 𝑃(𝑥|𝑦=𝑐)𝑃(𝑦=𝑐) 𝑃(𝑦 = 𝑐|𝑥) = (2) 𝑃(𝑥)
where: Р(у=c|x) is the probability for an object х to belong to a class C (aposteriori probability) P(x|y=c) – class conditional density P(y=c) – class prior P(x) – unconditional probability of x The purpose of the classification is to determine the class to which the object x belongs. Therefore, it is necessary to find the probabilistic class of the object x, i.e. it is necessary to choose the one that gives the maximum probability P (у = c | x). 𝑐𝑜𝑝𝑡 = arg max 𝑃(𝑥|𝑦 = 𝑐)𝑃(𝑦 = 𝑐) (3) 𝑐∈𝐶
B. Multinomial Bayes Classifier The multinomial Bayes classifier makes the assumption that the features are distributed multinomially. Let xi{1,…K}, have emission probabilities 1, …К [7] Then, the probability for an event х to occur when is given is: 𝑃(х|𝜃) =
𝑛! х1 !…х𝐾 !
∏К𝑖=1 𝜃𝑖х𝑖
(4)
where: 𝑛 = ∑К𝑖=1 х𝑖
(5)
The multinomial Bayes classifier calculates the frequency of occurrence of each word in the documents. Again, a naive assumption is made that the likelihood that a word will occur in the text is independent of the context and the position of the word in the document. [6] C. K-nearest neighbours – KNN The K-nearest neighbours algorithm is an object classification algorithm that calculates the distance between each pair of objects from the training set, by using an appropriate function to measure the distance between the two points. The algorithm uses a majority vote of the k nearest neighbours of the object to classify it. Function for measuring distance: Euclidean 𝜌(𝑥𝑖 , 𝑥𝑗 ) = √∑𝑚 (6) 𝑘=1 𝑤𝑘 where: (1)
(2)
(𝑚)
𝑥𝑖 = (𝑥𝑖 , 𝑥𝑖 , … , 𝑥𝑖 ) – vector of m-features of the i-th object (1) (2) (𝑚) 𝑥𝑗 = (𝑥𝑗 , 𝑥𝑗 , … , 𝑥𝑗 ) – vector of m-features of the j-th object Other known functions for measuring distance between two points are: Lp – metric, L – metric, L1 – metric, LanceWilliams function. An important question when using the k-nearest neighbor algorithm is the choice of the number K - this is the number of the nearest neighbors. Heuristic techniques, such as cross validation, can help with obtaining the appropriate K values. If the K value is high, the classifier is more precise and more new data sets are correctly classified, but the recognition takes a long time. In case of a low K value the algorithm completes quickli, but produces a large recognition error. The conclusions are the choice of K depends on the specific problem and its optimal value is determined experimentally.[8] D.
Support Vector Machines - SVM The Support Vector Machines (SVM) method represents training examples as n-dimensional points. The examples are projected into space in such a way as to be linearly separable. When working with two classes, a line is drawn to separate data along two classes. The line that divides the data is called a maximum-margin hyperplane. This hyper plane must be chosen in such a way as to be as close as possible to the examples of both classes. The function f (x) of the linear classification is as follows: [3] f(x) = wTx + b Т
(7)
where: w is a weight vector, and b is the displacement The goal is to find the values of wT and b, that will determine the classifier. In order to do this, it is necessary to find the points with the least variance that should be maximized. In non-linearly divisible data, the basic idea is to achieve linear separation by passing the data to another higher
dimensional functional space through a function of the input non-linear data. This is accomplished by the so-called kernel function K, which is defined as follows: K(xi,xj) = f(xi).f(xj)
(8)
Some of the most commonly used kernel functions are: Polynomial kernel function, Gaussian radial basis function, Exponential radial basis function, Multiple layer perceptron, etc. Modification of the algorithm using the Support Vector Method is the so-called SMO (Sequential Minimal Optimization), which at each optimization step selects two Langrange multipliers. This algorithm is faster and has better scaling properties than the standard SVM algorithm. [9] E. С4.5 algorithm The С4.5 algorithm is an algorithm for constructing a decision tree from a learning set. Classes must have a finite number of values, with each example referring to a particular class. C4.5 is an extension of the ID3 classification algorithm, which divides recursively into subtrees, using an information significance index, i.e. a feature with the highest information utility is selected. The C4.5 algorithm calculates "normalized information significance," i.e. when constructing the classification tree, the nodes with the most useful information are selected. To avoid a strong division into subsets, a kind of normalization is used, where a criterion called gain is calculated. [10] |𝑇 | |𝑇 | 𝑠𝑝𝑙𝑖𝑡 𝑖𝑛𝑓𝑜(𝑋) = − ∑𝑛𝑖=1 |𝑇|𝑖 × 𝑙𝑜𝑔2 ( |𝑇|𝑖 ) (9) where: Т – the test set; Т1, Т2, … Тn – subsets; n – number of results III.
RESEARCH, RESULTS AND ANALYSIS
The study uses the WEKA software package, which is opensource software released under the GNU General License. The analyzed algorithms are: Naive Bayes classifier, Multinomial Bayes classifier, C4.5, k-nearest neighbors method, method of Support Vector Machines using optimization - SMO with a polynomial kernel function. The results of the text classification by the closest K neighbor are best for the corresponding example and are obtained experimentally at different K values (the number of closest neighbors). The tables use the name J48, which is the working name of the C4.5 algorithm rewritten in Java.[4] The report introduces the following abbreviations of the algorithms used in the text: Naive Bayes classifier – NB (Naive Bayes) Multinomial Bayes classifier - MNB K-nearest neighbours - IBk Support Vector Machines using optimization– SMО (Sequential Minimal Optimization) С4.5 algorithm - J48 Initially, 2 authors Peyo Yavorov and Dimcho Debelyanov were classified with 21 poems and approximately equal number of words - 2018 words for the first author and 2031 words for
the second author. The two poets lived and worked in approximately the same period - the late 19th and early 20th centuries. The results show that SMO and J48 have classified the authors 100% correctly, with the worst result being the Knearest neighbors. The difference in percentage between the first and the last of the algorithms analyzed is 21%, while the difference between the first placed and the second one is 7%. (Table 1) TABLE 1 Classification of 2 authors with 21 poems each and with 2000 number of words each Authors
Number of poems
Number of words
Peyo 21 2018 Yavorov Dimcho 21 2031 Debelyanov Accuracy - Percentage of accurately classified poems
Number of accurately classified poems NB
MNB
J48
IBk
SMO
21
17
21
18
21
18
20
21
15
21
92.85 %
88.09 %
100 %
78.57 %
100 %
With an almost double increase in the number of words, J48 is again at the top, and IBk is the last. Significant differences in J48, IBk and SMO scores were observed when compared to the previous study -5%, -9.8214%, -6.25%, while the difference in the case of NB was only 0.3571%. (Table 2) TABLE 2 Classification of two authors with equal number of poems and with 4000 number of words
Authors
Number of poems
Number of words
Peyo 38 4024 Yavorov Dimcho 42 4017 Debelyanov Accuracy - Percentage of accurately classified poems
Number of accurately classified poems NB
MNB
J48
IBk
SMO
36
32
35
27
35
38
38
41
28
40
92.5 %
87.5 %
95 %
68.75 %
93.75 %
With an almost double increase in the number of words, now 16 040, MNB and NB show the best result among the algorithms tested. The difference between the MNB and NB is only 0.1379% in favor of the MNB. (Table 3) TABLE 3 Classification of two authors with equal number of poems and with 8000 number of words Authors Peyo Yavorov Dimcho Debelyanov
Number of poems
Number of words
73
8023
68
8017
Accuracy - Percentage of accurately classified poems
Number of accurately classified poems NB
MNB
J48
IBk
SMO
55
62
55
73
60
18
20
51
3
52
84.25 %
84.39 %
75.17 %
53.90 %
79.43 %
With a double increase in the number of authors from two to four (Table 4), i.e. the number of classes with approximately equal number of words ( 2000), MNB emerges as the winner, followed closely by NB, with SMO in third position by percentage of properly classified poems. IBk is again the last one, and it is notable that the percentages are extremely low and
only one of the authors is recognized, while another two are not recognized at all. TABLE 4 Classification of 4 authors with equal number of poems and with 2000 number of words Number of poems
Authors
Number of accurately classified poems
Number of words
Peyo 21 2009 Yavorov Dimcho 21 2031 Debelyanov Hristo 20 2022 Fotev Petko 21 2000 Slaveykov Accuracy - Percentage of accurately classified poems
NB
MNB
J48
IBk
SMO
14
15
10
2
15
18
18
17
0
15
18
19
9
20
18
16
15
13
0
11
79.51 %
80.72 %
59.03 %
26.50 %
71.08 %
With an increase in the number of authors to eight, the highest percentage of properly classified poems is MNB, which is 4.84% better than the next two – NB and SMO, which have identical results. TABLE 5 Classification of 8 authors with equal number of poems and with 2000 number of words Number of poems
Authors Peyo Yavorov Dimcho Debelyanov Hristo Fotev Petko Slaveykov
Number of words
Number of accurately classified poems NB
MNB
J48
IBk
SMO
21
2009
14
12
16
2
13
21
2031
10
12
10
0
12
20
2022
15
16
9
18
17
21
2000
14
15
10
0
11
8
9
2
1
9
Pencho 20 2020 Slaveikov Geo Milev 21 2000 Lyuben 19 2020 Karavelov Nikolai 21 2000 Liliev Accuracy - Percentage of accurately classified poems
7
9
7
6
8
12
15
10
5
13
19
19
11
0
16
60 %
64.84 %
45.45 %
19.39 %
60 %
In the classification of two other authors Petko Slaveikov and Nikolay Liliev with 21 poems and with 2000 equal number of words, who lived and worked in different time periods: the first - the middle and the end of the 19th century, and the second - the beginning and the 20th century we receive different results (Table 6). The Naive Bayes classifier and J48 have only erred with a single poem, while SMO has erred with three poems. TABLE 6 Classification of other 2 authors with 21 poems and with 2000 number of words
Authors Petko Slaveykov
Number of poems
Number of words
NB
MNB
J48
IBk
SMO
21
2000
20
21
21
9
20
21
18
20
21
18
97.67 %
92.85 %
97.61 %
71.42 %
90. 47 %
Nikolai 21 2000 Liliev Accuracy - Percentage of accurately classified poems
Number of accurately classified poems
In the study (Tables 1, 2, 3, 4, 5, 6) the number of poems and number of words is roughly equal for each author. When the number of poems and words are reduced significantly, J48 classifies authors 100% correctly. (Table 7) The accuracy of the SMO is almost 20% lower than the case, where the number of poems and the number of words of the two authors are equal (Table 1).
The accuracy of the Naive Bayes classifier and the accuracy of the method of Support Vector Machines are the highest and approximately equal, about 55%. The J48 and MNB algorithms have almost tripled their accuracy in comparison with two-class cases. IBk is nearly 10 times less accurate than NB and SMO (Fig. 2)
TABLE 7 Classification of 2 authors with number of poems and with equal number of words Number of poems
Authors
CLASSIFICATION OF 20 AUTHORS
Number of accurately classified poems
Number of words
Peyo 5 2018 Yavorov Dimcho 21 2031 Debelyanov Accuracy - Percentage of accurately classified poems
NB
MNB
J48
IBk
SMO
2
3
5
1
0
21
21
21
15
21
88.46 %
92.30 %
100 %
61.53 %
80.76 %
6
5
15
7
4
55.08%
Peyo 15 4020 Yavorov Dimcho 42 4017 Debelyanov Accuracy - Percentage of accurately classified poems
42
42
42
31
42
82.45 %
100 %
66.66 %
80.70 %
In the case where the number of poems is equal, and there is a difference in the number of words in an order of magnitude, J48 emerges again with the highest accuracy. All classifiers, except J48, increase their precision from 3% to 10% when they are given a larger number of words. (Table 8) TABLE 8 Classification of 2 authors with equal number of poems and with equal number of words Number of poems
Authors
Number of words
Peyo 20 5129 Yavorov Dimcho 20 1937 Debelyanov Accuracy - Percentage of accurately classified poems
Number of accurately classified poems NB
MNB
J48
IBk
SMO
11
18
19
17
16
20
18
20
11
20
77.5 %
90%
97.5 %
70 %
90%
Number of poems of 20 authors
J48
IBK
SMO
Figure 2 Classification of 20 authors using the algorithms: NB, MNB, J48, IBK и SMO
IV.
CONCLUSION
The conducted studies show that in the classification of two classes and a small amount of data the algorithms with the highest accuracy are C4.5 (J48) and SMO; the accuracy decreases with a larger volume of data. When given 4 and 8 classes the Multinomial Bayes Classifier is the most accurate. When given 20 classes, the naive bayes classifier and the method of Support Vector Machines with optimization using two Lagrange multipliers and a polynomial kernel function have the highest accuracy of about 55%. The K-nearest neighbours (IBk) show the lowest scores throughout the entire study.
[3] [4] [5] [6]
[8]
Sirak Skitnik
9
Figure 1 Number of poems of 20 authors, used in classification
[2]
[7]
35
14 Vasil Popovich
Petko Slaveykov
Konstantin Velichkov
Konstantin Miladinov
Georgi Rakovski
Hristo Smirnenski
Lyuben Karavelov
Pencho Slaveykov
Geo Milev
Hristo Fotev
Dobri Chintalov
Dimcho Debelyanov
Hristo Botev
8
Mara Belcheva
10
Stoyan Mihaylovski
15 4
Peyo Yavorov
37 26
25
17
44 41
Nikolay Liliev
44
40
Rayko Genzifov
21
43
Stefan Stambolov
50
MNB
REFERENCES
69 39
5%
[1]
By increasing the number of authors to 20, the poems to 585, (Fig. 1), the words to 103377 and eliminating the “equal numbers of poems and words for each author” requirement, none of the five classifiers produce good results. (Fig. 2)
80 70 60 50 40 30 20 10 0
27.75%
NB
84.21 %
54.99% 33.39%
Arabadzieva – Kalcheva N., Nikolov N., Comparative analysis of the naive bayes classifier and sequential minimal optimization for classifying text in bulgarian in machine learning, Computer Science and Technologies Journal, pp. 97-105, 2017, TU – Varna Arabadzieva – Kalcheva N., Mateva Z., Bayesian theory in machine learning, Annual Journal of the Technical University of Varna 2016, pp. 130-133 Harrington P., Machine Learning in Action, 2012, pp 105-106, p. 144 http://weka.sourceforge.net/doc.dev/weka/classifiers/trees/J48.html L. Uitkin, Machine Learning, 2017, pp. 6-8 McCallum A., Nigam K., A comparison of event models for Naive Bayes text classification. Papers from the 1998 AAAI Workshop, 1998 Murphy K.P. Machine Learning A Probabilistic Perspective, 2012, p. 34
Penev I. ,Karova M., Todorova M. , On the optimum choice of the K Parameter in Hand -Written Digit Recognition by kNN in comparison to SVM, International journal of neural networks and advanced applications, vol.3, 2016 [9] Platt J., Fast Training of Support Vector Machines using Sequential Minimal Optimization, 1998, p.44 [10] Quinlan J., C4,5: Programs for Machine Learning, 1993