Malicious PDF Files Detection Using Structural and

0 downloads 0 Views 1MB Size Report
method for malicious PDF file detection via machine learning approach. The ... Keywords: Machine learning 4 PDF 4 JavaScript 4 Malware ... are in internet [3].
Malicious PDF Files Detection Using Structural and Javascript Based Features Sonal Dabral1(&) 1

, Amit Agarwal2, Manish Mahajan1, and Sachin Kumar3

Computer Science and Engineering, Graphic Era University, Dehradun, India [email protected], [email protected] 2 Computer Science and Engineering, Indian Institute of Technology, Roorkee, India [email protected] 3 Centre for Transportation Systems, Indian Institute of Technology, Roorkee, India [email protected]

Abstract. Malicious PDF files recently considered one of the most dangerous threats to the system security. The flexible code-bearing vector of the PDF format enables to attacker to carry out malicious code on the computer system for user exploitation. Many solutions have been developed by security agents for the safety of user’s system, but still inadequate. In this paper, we propose a method for malicious PDF file detection via machine learning approach. The proposed method extract features from PDF file structure and embedded JavaScript code that leverage on advanced parsing mechanism. Instead of looking for the specific attack inside the content of PDF i.e. quite complex procedure, we extract features that are often used for attacks. Moreover, we present the experimental evidence for the choice of learning algorithm to provide the remarkably high accuracy as compared to other existing methods. Keywords: Machine learning

 PDF  JavaScript  Malware

1 Introduction Portable document format (PDF) is an electronic document format and it was released in 1993 by Adobe System Inc, which allows publishing and exchange of documents [1]. Nowadays, PDF is very popular because it is preferred as a mean of exchange different documents between different organizations, peoples i.e. students and professionals. Due to its high popularity, flexible structure and versatile functionality, it has become a popular malware distribution strategy for user exploitation ranging from server side to client side attack. The interest of miscreants has currently switched from server side to client side attacks, because it gives well opportunity to the attacker to exploit client applications (e.g. PDF readers) that are not up-to-date where the goal is to take advantage from lack of security knowledge of users by fooling them into opening a malicious PDF document using applications found on most user’s computers [2].

© Springer Nature Singapore Pte Ltd. 2017 S. Kaushik et al. (Eds.): ICICCT 2017, CCIS 750, pp. 137–147, 2017. https://doi.org/10.1007/978-981-10-6544-6_14

138

S. Dabral et al.

One of the most popular client applications is adobe reader for reading and exchanging of documents. Attackers may exploit specific vulnerabilities of the reader application. In addition to exploitation of the PDF reader’s vulnerabilities, the attackers also take the advantages of the many advanced features of PDF such as ‘/Launch’ which can automatically run an embedded script to manage OS specific events, or the ‘/ GoTo’ and ‘/URl’ which can automatically open remote resources for creating risk that are in internet [3]. Attackers often use JavaScript code to distract usual execution flow to malicious code, it can be done by Buffer overflow, Heap spraying and Return Oriented Programming (ROP) [4]. In order to bypass detection, attackers mainly use advanced encryptions techniques so that they can easily hide the malicious code or embedded files in PDF [1]. The recent academic works over the malicious PDF file detections are categorized into two methods: dynamic and static. First Detection of malicious JavaScript code within PDF files using both methods dynamic and static [5–7]. Another structural based approaches for malicious PDF detection using static analysis [4, 9]. The advantage of this method over the JavaScript analysis is that they are capable of detection of non-JavaScript attacks and not affected by code obfuscation because it does not a focus on analyzing content itself. However, further research showed that attacker exploits the system through deliberate attacks [10]. Therefore, work has focused again on malicious JavaScript code detection [11]. This paper propose a method based on machine learning technique for malicious PDF files detection where we combine PDF structure feature vector to the JavaScript feature vector which are extracted from the PDF file structure and embedded JavaScript in the PDF file respectively. The set of PDF structure features includes general characteristic of the PDF structure as well as dynamic characteristic of PDF structure in terms of keywords such as ‘/JavaScript’, ‘/openAction’ and ‘/URL’ etc. and the JavaScript features obtained from JavaScript code in the PDF file. As recent research shows that the vast majority of PDF related vulnerabilities do rely JavaScript, hence we also analyze JavaScript code inside the PDF file. But instead of looking for the specific attack inside the JavaScript code, we extract features from JavaScript code which can conduct attack through JavaScript. The extraction process is efficiently carried out using PDF analysis tool, namely, Origami that overcome the parsing related weakness presented in prior work. It provides significant features to the classifier for effective and enhance detection of Malicious PDF file. We employed different ensemble machine learning techniques to choose the classifier for our experiment. The good choice of ensemble classifier gives a significant improvement on malicious PDF file detection. 1.1

The PDF File Structure

PDF file is a hierarchical structure of objects that are logically connected to each other. The structure of PDF file determines how objects are accumulated in a file, how objects accessed and updated [1]. The PDF file structure is made by four parts shown in Fig. 1. • Header: represents the version number of PDF used by the file. • Body: It contains large part in PDF file structure which constitutes all the PDF objects and contains the data or information that is shown to user.

Malicious PDF Files Detection Using Structural and Javascript

139

Fig. 1. An example of PDF file structure

• Cross reference table (CRT): It indicates the position of every indirect object and these single objects are represented by one entry in the table. • Trailer: It gives the location of CRT and information about root object.

2 Related Work The increased prevalence of malicious documents has generated interest in techniques to perform malware analysis of such documents over the years. Previous research focused on two methods for malicious PDF detection: static and dynamic. Li et al. and Shafiq et al. [12, 13] present a method for detection of embedded malcode in word document through static analysis using n-gram and introduced novel dynamic run-time test that shows assertion but also remains limited due to the size of malcode. Particularly this work is not designed for PDF file but they specially focused on another file format such as docs, exe etc. There are possibilities to evade detection by modern obfuscation methods like AES encryption [1], and other methods to exploit vulnerability like Heap Spraying, Return Oriented Programming (ROP) [4]. These exploiting

140

S. Dabral et al.

methods are performed using embedded JavaScript code in PDF file. Therefore researchers mainly targeted JavaScript code in PDF file. Laskov and Srndic [6] developed a tool PJScan which is closely related to static analysis techniques, used to detect the malicious PDF documents through lexical analysis of JavaScript code. They used a machine learning approach, One-Class Support Vector Machine automatically generate models from the available data for classification of testing data. However this approach showed lower detection rate and not able to analyze obfuscated code that behave maliciously during execution time. To overcome such limitation Snow et al. [14] proposed ShellOS, based on dynamic analysis to detect code injection attacks, during runtime. It uses hardware virtualization that provides faster and precise analysis of code and also enables to detect obfuscated code. Moreover Tzermias et al. [7] demonstrated that the antivirus systems for the detection of malicious PDF documents are less effective. To make more reliable detection system, they used the combination of both static and dynamic analysis and introduced a standalone malicious PDF file scanner MDScan that specially focus on vulnerabilities. A similar approach adopted by Schmitt et al. [15] presented a tool PDF Scrutinizer is used to detect current malicious PDF file, however it showed a low false-positive rate. It is mainly focuses on JavaScript- based attacks. Dynamic analysis of JavaScript code may be computationally expensive and complex. To reduce cost factor and increase speed, research again focused on static analysis. Maiorka et al. [4] introduced a tool, PDF Malware Slayer (PDFMS) based on static method which analyze the structure of PDF files by keywords and their occurrence. They have performed test set on Naive Bayes, SVM, J48 and Random Forests classifiers. The results showed Random forests provided the highest accuracy which is better than others. However, it has some structural weaknesses. Instead of looking for specific content, the analysis of structure of PDF provided a higher detection rate. However current work Maiorca et al. [10] showed that such detectors may be bypass, due to complexity in parsing mechanism. Due to some structural weaknesses, work focused again on analysis of malicious JavaScript code. Corona et al. [11] presented Lux0R “Lux 0n discriminant References”, a new approach for the malicious JavaScript code detection using characterization of JavaScript code by its API references. And Liu et al. [16] introduced a context-aware approach for the detection of malicious JavaScript in PDF based on static document instrumentation and runtime behavior monitoring.

3 Materials and Methods In this section, the paper explain a method based on machine learning approach through static analysis where we combine PDF structure feature vector with JavaScript feature vector, which are extracted from the PDF file structure and embedded JavaScript code within PDF file, respectively. Our system architecture is shown in Fig. 2.

Malicious PDF Files Detection Using Structural and Javascript

141

Parser Origami

PDF

Feature extractor Structure Analysis

EJSA*

Features Classifier Ensemble methods Bagging

Adaboost M1

Malicious PDF Files

Stacking

Benign PDF Files

* EJSA: Embedded JavaScript

Fig. 2. Architecture of our system

3.1

Dataset Used

We have collected dataset both malicious and benign PDF files from real and up-to date samples. We have collected around 4807 malicious file and 3745 benign files. Malicious PDF file samples are collected from the Contagiodump [9] is a popular depository which contains the information about the trending vulnerabilities and attacks in PDF files. And the benign PDF Samples are collected from the Yahoo search engine API. When collecting data from source websites it gives no assurance that some data may be malicious. The existence of malicious files in the benign dataset will generate undesirable results on the designed experiments. To diminish the risk, whole benign dataset was scanned using antivirus.

142

3.2

S. Dabral et al.

Features Extraction

To extract features, we developed a parser that leverages on Origami tool. This tool performs a deep scanning of PDF files to extract features that are mostly used by the attackers to hide malicious property. We adopted this tool as it provide a reliable extraction of features as compared to others, such as PdfID [17], which simply analyzes the PDF file without its logical properties, it may give good opportunity to the attackers to perform easy manipulations. For the extraction of features, we analyze each PDF file by following two ways: (1) Structure Analysis In this phase, parser analyzes the structure of PDF file and searching for features which are significant for labeling PDF file as malicious. This gives the set of features and their occurrence. Based on the previous research [18, 19], these are following features that can be suspicious and used by attackers mostly. • JavaScript: JavaScript code can be directly embedded into an object within the PDF. Most malicious PDFs use JavaScript to exploit Java vulnerabilities or to create heap sprays. ‘/JS’, ‘/JavaScript’ keywords indicate the use of JavaScript in the PDF. • Actions: There are number of features such as ‘/GoTo’, ‘/GoToR’, ‘/GoToE’ that are capable of specifying an action to be performed. For example: Activating a hypertext link. • Triggers: Attackers can use a number of different triggers in order to execute the harmful content within the documents. An action is a common method to triggering mechanism. This is performed by the ‘OpenAction’ key in the root object of PDF file. The object which is point by ‘OpenAction’ that may be a part of the attack. • Launch: A document can open or print by launch an application, to manage OS-specific events. This feature may be misused by attacker to steal confidential data of any organization whenever they access that suspicious PDF file. • Form Action: PDF Reader allow the ‘/SubmitForm’ action from client to server. So in order to take advantage of the weakness of the victim browser, this action perform a request to corrupt sites that will automatically show on the victim browser and can perform a malfunctioning. (2) JavaScript Code Analysis Our parser extract objects contain JavaScript from the body part of the structure of PDF file. Then it extracts embedded JavaScript code and searching for the features labeled with JavaScript code that are often followed in carrying out an attack. Based on the previous study [6, 19, 20], we describe following set of features used in our system. • eval_length: This function is used by malicious scripts to dynamically interpret code and to calculate the length of the longest string passed to eval() function call.

Malicious PDF Files Detection Using Structural and Javascript

143

• max_string: It is use to define the length of the longest string. Malware writer use the strings for shell code is very long as compared to string used in legitimate JavaScript. • stringcount: It is used to count the no. of strings that are defined in scripts. To obfuscate the script malware writer break the strings into many paltry strings. • replace: This function calculate the uses of the javascript replace() function. Often it is used to obfuscate JavaScript code in malicious scripts. • substring: This function can be used to measure the uses of the javascript substring() function. It is mostly used to obfuscate the JavaScript code. • Eval: This function call used by the malicious scripts to measure the uses of the javascript eval() function and to dynamically interpret JavaScript code. • fromCharCode: It coverts Unicode values to the characters. It is mostly used to obfuscate the code. • setTimeOut(): can be used to replace the eval() to run random javascript code after the particular timeout. • document. write and document. createElement: which indicate the use of dynamic code executions. 3.3

Classification

To classify PDF files, extracted features run by a classifier that can be create by any learning algorithms. But in previous, researchers have used the method of combining the predictions of multiple learners to produce better results than could be produced from any individual learning algorithm [8]. In this sense we tested ensemble methods such as Adaptive Boosting (AdaBoostM1), Bagging, stacking [8]. These algorithms combine weak classification tree models with a particular weight to create a stronger and precise classifier. As a weak model we define a simple decision trees (J48) (supervised learning approach, Quinlan, 1996) because an ensemble of trees gives more robustness compared to a single tree. In addition we decided to give exhaustive experimental evidence in order to know which ensemble method has ability to improve the accuracy on our dataset.

4 Results and Discussion In this Section, we provide two experiments. The first one demonstrated the features extraction process. And the second experiment presented experimental evidence as to which classification method has ability to improve the accuracy of detection. In order to do this, first the only PDF structure features was run through different classifiers. Than we experimented how the accuracy was improved when JavaScript features were combined with structure features. Furthermore, we compare the performance of proposed method with previously developed tools for malicious PDFs detection.

144

4.1

S. Dabral et al.

Experiment 1: Features Extraction

The goal of the experiment is to extract the feature vector from PDF file. Origami tool performs a deep scanning of PDF files to extract features that are often used by the miscreants. After running the scan over one by one PDF file in malicious and benign dataset, the results were achieved as shown in Fig. 3.

Fig. 3. Structure based features extraction result

After completing the structure feature vector extraction, we realized that a huge number of the malicious PDF files used JavaScript to perform malicious actions. In our own dataset we found around 92.3% malicious samples contained JavaScript. Thus we performed JavaScript features extraction process by origami tool. The Results were shown in Fig. 4. 4.2

Experiment 2: Detection Accuracy

Our test was conducted on Adaboost M1 (used as a boosting ensemble), Bagging (used as a bagging ensemble) and stacking with two learning algorithms (J48 and IBk, and Logistic Regression used as the Meta classifier), using 10-folds Cross Validation repeated 10 times. We show our results with regards to confusion matrices (the number of benign and malicious files with correct and incorrect classifications).

Malicious PDF Files Detection Using Structural and Javascript

145

Fig. 4. JavaScript based features extraction result

First the structure feature vector dataset was run through different classifiers. This gives the following results (Table 1). Table 1. Result of structure features AdaBoostM1 True positives 4498 False positives 309 True negatives 2990 False negatives 755 TP rate 0.876 FP rate 0.141 ROC area 0.945 Detection accuracy 87.5584%

Bagging 4471 336 2936 809 0.866 0.152 0.934 86.6113%

Stacking 4493 314 2976 767 0.873 0.144 0.940 87.3363%

Further we tested how well the complete feature vector dataset (structure feature and JavaScript features) performed at the classification task. And the dataset gives the following results as shown in Table 2. Table 2. Result of complete features (structure features and JavaScript based features) True positives False positives True negatives False negatives TP rate FP rate ROC area Detection accuracy

AdaBoostM1 4753 54 3666 79 0.984 0.017 0.998 98.4448%

Bagging 4742 65 3603 142 0.976 0.0287 0.993 97.5795%

Stacking 4744 63 3670 75 0.984 0.017 0.995 98.3863%

146

S. Dabral et al.

As we can see, when we combine structure feature vector to the JavaScript feature vector, it gives better detection accuracy than only structure features dataset. To interpret the proposed method, it is compared with previous developed tools such as Wepawet, PDFMS, PJScan, MDScan and PDF Scrutinizer for malicious PDFs detection. The result is shown in Table 3. For each method, we show true positives rate (TPR) and false positives rates (FPR). It shows that our system definitely outperforms Wepawet, PJScan, MDScan and PDF Scrutinizer. Table 3. Comparison of the proposed method with previous tools. System Proposed method WepaWet PJScan MDScan PDFMS PDF scrutinizer

TPR 0.984 0.8892 0.7194 0.8934 0.9955 0.9

FPR 0.017 0.032 0.011 0 0.0251 0

PJScan, MDScan and PDF Scrutinizer show the smallest FPR, but detection rate is very low compared to the other tools. PDFMS shows the highest TPR but gives a lower FPR as compared to proposed method. It can be also observed that the proposed method works better than WepaWet in both TPR and FPR terms. Moreover, it is indicating that the proposed method is better than all these tools.

5 Conclusions In the past few years malicious PDF file has become one of the most crucial threats which originate a very effectual attack vector for malware writers. In this paper, we have proposed a method using machine learning techniques for the malicious PDF file detection. Instead of only relying on structure property of PDF file, we also presented the JavaScript based features to improve the accuracy of detection. In addition, we also showed experimental evidence as to which learning algorithm has ability to improve the accuracy of detection. Finally, we show the comparison of our method with the other academic tools. And the high detection accuracy of our method has to be proved it is more accurate to other tools.

References 1. Adobe: PDF reference, adobe portable document format version 1.7 (2006) 2. Symantec: malware security report: protecting your business, customers, and the bottom line. Symantec (2010) 3. Filiol, E., Blonce, A., Frayssignes, L.: Portable document format (PDF) security analysis and malware threats. J. Comput. Virol. 3, 75–86 (2007)

Malicious PDF Files Detection Using Structural and Javascript

147

4. Maiorca, D., Giacinto, G., Corona, I.: A pattern recognition system for malicious pdf files detection. In: International Workshop on Machine Learning and Data Mining in Pattern Recognition, pp. 510–524 (2012) 5. Esparza, J.M.: Obfuscation and (non-)detection of malicious pdf files. In: S21Sec e-crime (2011) 6. Laskov, P., Srndić, N.: Static detection of malicious javascript-bearing pdf documents. In: Proceedings of the 27th Annual Computer Security Applications Conference, pp. 373–382, December 2011 7. Tzermias, Z., Sykiotakis, G., Polychronakis, M., Markatos, E.P.: Combining static and dynamic analysis for the detection of malicious documents. In: Proceedings of the Fourth European Workshop on System Security, p. 4 (2011) 8. Tiwari, A., Prakash, A.: Improving classification of J48 algorithm using bagging, boosting and blending ensemble methods on SONAR dataset using WEKA. Int. J. Eng. Tech. Res. 2, 207–209 (2014) 9. Mila: Contagio Malware Dump. http://contagiodump.blogspot.in/2010/08/Maliciousdocuments-archive-for.html. Accessed 10 Oct 2014 10. Maiorca, D., Corona, I., Giacinto, G.: Looking at the bag is not enough to find the bomb: an evasion of structural methods for malicious pdf files detection. In: Proceedings of the 8th ACM SIGSAC Symposium on Information, Computer and Communications Security, pp. 119–130 (2013) 11. Corona, I., Maiorca, D., Ariu, D., Giacinto, G.: Lux0r: detection of malicious pdf-embedded javascript code through discriminant analysis of API references. In: Proceedings of the 2014 Workshop on Artificial Intelligent and Security Workshop, pp. 47–57. ACM, November 2014 12. Li, W.-J., Stolfo, S., Stavrou, A., Androulaki, E., Keromytis, A.D.: A study of malcode-bearing documents. In: Proceedings of the 4th International Conference on Detection of Intrusions and Malware, and Vulnerability Assessment (2007) 13. Shafiq, M.Zubair, Khayam, S.A., Farooq, M.: Embedded malware detection using Markov n-Grams. In: Zamboni, D. (ed.) DIMVA 2008. LNCS, vol. 5137, pp. 88–107. Springer, Heidelberg (2008). doi:10.1007/978-3-540-70542-0_5 14. Snow, K.Z., Krishnan, S., Monrose, F., Provos, N.: SHELLOS: enabling fast detection and forensic analysis of code injection attacks. In: USENIX Security Symposium, pp. 183–200, August 2011 15. Schmitt, F., Gassen, J., Gerhards-Padilla, E.: PDF SCRUTINIZER: detecting javascript-based attacks in PDF documents. In: 10th Annual International Conference on Privacy, Security and Trust (PST), pp. 104–111. IEEE, July 2012 16. Liu, D., Wang, H., Stavrou, A.: Detecting malicious javascript in pdf through document instrumentation. In: 44th IFIP International Conference on Dependable Systems and Networks (DSN), pp. 100–111. IEEE (2014) 17. Stevens, D.: PDF Tool. http://blog.didierstevens.com/programs/pdf-tools/ 18. Stevens, D.: Malicious pdf analysis ebook, September 2010. http://didierstevens.com/files/ data/malicious-pdf-analysis-ebook.zip. Accessed 22 Sep 2015 19. Kittilsen, J.: Detecting malicious PDF documents. Master thesis, Gjovik, Norway, pp. 1– 112, December 2011 20. Cova, M., Kruege, C., Vigna, G.: Detection and analysis of drive-by-download attacks and malicious JavaScript code. In: Proceedings of International Conference on World Wide Web, pp. 281–290, July 2010