Method: text categorization to automatic book classification. To distinguish books from .... Page 24 ... Apache Spark is a fast and general engine for large-scale ...
Master ECD Report
AUTOMATICALLY ASSIGNING LABELS TO BOOKS Student’s name: Doan
Mau Hien Supervisor : Assoc. Prof. Dr. Do Thanh Nghi Photo © JamesMillar/TEDxExeter
Research
report content
1. Introduction Rationale of the Study Related concept 2. Related works 3. Methodology Support vector machines Latent - local SVM 4. Experiments - results 5. Conclusion – Future work
introduction
Rationale of the Study In Vietnam, there is no research on automatic classification for books (by Machine learning methods) Have high practical application for many disciplines and interdisciplinary fields, such as information technology, information management, library, and record and archive management.
introduction
Related concept: Text
classification (TC)
Natural language processing (NLP)
Text categorization is the assignment of natural
language texts to one or more predefined categories based on their content
Used to automatically assign labels to books
and news; classify and filter emails, etc.
introduction
Text classification (cont.)
Process of text classification
introduction
Text classification (cont.)
A graphical view of text classification
Related works
Related works
Chen et al (2009): Automatic book classification method combined with support vector machine and metadata. Method: text categorization to automatic book classification. To distinguish books from other general documents, data about books are divided into: description data (book title, introductions to book and author); classification accuracy: 95% Han, Hui, et al. "Automatic document metadata extraction using support vector machines." Digital Libraries, 2003. Proceedings. 2003 Joint Conference on. IEEE, 2003. Method: SVM; Text classification; dataset metadata struck; classification accuracy: 92.9% Susan Dumais, John Platt, David Heckerman, and Mehran Sahami. Inductive learning algorithms and representations for text categorization. In Proceedings of the seventh international conference on Information and knowledge management, pages 148–155. ACM, 1998 Method SVM; Text classification; dataset Reuters; classification accuracy: 87%.
Related works
(cont.)
Thanh Nghi Do and Francois Poulet. Classifying very highdimensional and large-scale multi-class image. In IEEE Cloud and Big Data Computing, Toulouse, France, 2016 Method: Propose a new learning algorithm that uses support vector machines (SVM) to classify the very-high-dimensional and large-scale multiclass datasets, but this approach is applied for images.
Methodology
Methodology
Preprocessing text and Bag-of-words model Machine learning: Support vector Machine
Latent local Support vector Machine
Methodology (cont.)
Preprocessing text and Bag-of-words (BoW) model Data preparation, text preprocessing and feature engineering
• • • •
Analyzing vocabularies Separating words Extracting word features Performing document data in the table form so that algorithms can learn for classification
Ref. BoW (Salton et al., 1975) ; JVTexPro (Nguyen et al., 2006)
Methodology (cont.) No.
1
Preprocessing text and Bag-of-words model Example of dataset of documents
Titles Cửa sổ âm nhạc:
Abstracts Tìm hiểu về các nhạc sĩ và hoàn cảnh ra đời của những ca khúc đã đồng hành với nhiều thế hệ người yêu nhạc
2 …
Keywords Thư viện , Lịch sử âm nhạc Việt Nam
Bách khoa toàn thư về
Sách là quyển Bách khoa giới thiệu cho độc giả
Thế giới:
những điều chưa được biết trên Thế giới.
thư
…
…
…
Bách khoa toàn
Subjects
Âm nhạc
Bách khoa toàn thư …
This book describes the government of Japan, m
The government of Japan with emphasis on the period of readjustment since the peace treaty of 1952.
Khoa học chính trị Nhật Bản
Khoa học chính trị
Methodology (cont.)
Preprocessing text and Bag-of-words model
Example of BoW model 1 (nhạc)
2 (điều)
…
n (khoa)
Subjects
1
1
0
…
0
Âm nhạc
2
0
1
…
1
Bách khoa toàn thư
…
…
…
…
…
…
m
0
0
…
1
Khoa học chính trị
No.
Ref. BoW (Salton et al., 1975)
Methodology (cont.)
Preprocessing text and bag-of-words model Bag-of-words model of books taken from the library
:< value1> :< value2>...
Methodology (cont.)
Machine learning: Support vector Machine
Support vector machines for binary classification problems
Methodology (cont.)
Machine learning: Support vector Machine Support vector machines for multi-class problems
Multi-class SVM (One-Versus-All)
Multi-Class SVM (One – Versus – One)
Latent local Support vector Machine
We propose to use a new learning algorithm of SVM, called Latent-lSVM Classify very high dimensional input spaces and large scale multi-class book datasets. The Latent-lSVM produces a partition of the full dataset into joint clusters and then it is easier to learn a non-linear SVM in each cluster to classify the data locally.
Global SVM model
Local SVM models
Training algorithm of latent local SVM models
Prediction of x with latent local SVM models
Experiments
Software
Experiments • • • • •
JVTexPro LibSVM Liblinear Liblinear 0ne-vs-one Latent-lSVM
(Java) (C/C++) (C/C++) (C/C++) (C/C++))
All experiments are run on PC with linux Ubuntu 16, Intel(R) core i5-4590, 3.3Ghz (4CPUs), 8GB main memory
Experiments
Book dataset Books: include English and Vietnamese
• No. of books : 114.998 • Input attributes: 03 • Output class: 661
Experiments
Separation result of book dataset (Preprocessing) • 64.073 words • 661 classes
Experiments
Tuning parameters • Latent-ISVM requires to tune the Dirichlet hyperparameters of LDA, the good model quality has been reported for β= 0.01 and α = 50/T with the number of topics (clusters) T • LibSVM: parameters cost C=4, kernel t =0, epsilon e=0.1, n-fold with V=10.
• Liblinear: parameters cost C=4, epsilon e=0.1, cross validation mode with V=10 • Liblinear one-vs-one: cost C=4, epsilon e=0.1, V=10
Results
Classification results
Results
Training time of algorithms
Conclusion and future work
Conclusion • Latent-lSVM achieves an accuracy of 70.14% in the classification of book dataset having 64.073 dimensions and 661 classes • Latent-lSVM is a better more than another algorithms
Future work • Necessary to develop a distributed implementation for large scale processing on an in-memory cluster-computing platform, namely Apache Spark • Apache Spark is a fast and general engine for large-scale data processing. • Spark is utilized at a wide range of organizations to process large datasets
Link: http://spark.apache.org/
Demo videos for research results: • Liblinear:
https://youtu.be/hkyl7j5faF0
• Liblinear 1-vs-1:
https://youtu.be/bGlNzWNKvvE
• LibSVM:
https://youtu.be/lnSHgYT7XEw
• Laten-local SVM: https://youtu.be/diWVY7rdq6g
Thank you! and and
References
[7] J Huang. A study of book title feature extraction based on the automatic classification. Unpublished Master Thesis, Department of Library and Information Science of Fu-Jen Catholic University, Taipei. [8] Hyunsoo Kim, Peg Howland, and Haesun Park. Dimension reduction in text classification with support vector machines. Journal of Machine Learning Research, 6(Jan):37–53, 2005. [9] Walid Magdy and Kareem Darwish. Book search: indexing the valuable parts. In Proceedings of the 2008 ACM workshop on Research advances in large digital book repositories, pages 53–56. ACM, 2008. [10] Cam-Tu Nguyen, Xuan-Hieu Phan, and Thu-Trang Nguyen. Jvntextpro: A java-based Vietnamese text processing tool. http://jvntextpro.sourceforge.net/, 2010. [11] Fran¸cois Poulet and Thanh-Nghi Do. Mining very large datasets with support vector machine algorithms. In Enterprise Information Systems V, pages 177–184. Springer, 2004. [12] Asim Roy. A classification algorithm for high-dimensional data. Procedia Computer Science, 53:345–355, 2015. [13] Gerard Salton, Anita Wong, and Chung-Shu Yang. A vector space model for automatic indexing. Communications of the ACM, 18(11):613–620, 1975.
[1] Tom Betts, Maria Milosavljevic, and Jon Oberlander. The utility of information extraction in the classification of books. In European Conference on Information Retrieval, pages 295–306. Springer, 2007. [2] S.Y Chen, J.Y Yeh, M.J Hwang, X.J Lin, H.R Ke, and W.P Yang. Automatic book classification method combined with support vector machine and metadata. International Journal of Advanced Information Technologies (IJAIT) 3(1), 2–21, 2009. [3] Thanh Nghi Do and Fran¸cois Poulet. Classifying very high-dimensional and large-scale multi-class image datasets with Latent-lSVM. In IEEE Cloud and Big Data Computing, Toulouse, France, July 2016. [4] Susan Dumais, John Platt, David Heckerman, and Mehran Sahami. Inductive learning algorithms and representations for text categorization. In Proceedings of the seventh international conference on Information and knowledge management, pages 148–155. ACM, 1998. [5] Sharon Givon and Maria Milosavljevic. Extracting useful information from the full text of fiction. In Large Scale Semantic Access to Content (Text, Image, Video, and Sound), pages 633–638. LE CENTRE DE HAUTES ETUDES INTERNATIONALES D’INFORMATIQUE DOCUMENTAIRE, 2007