Using semantic templates to study vulnerabilities ... - Semantic Scholar

3 downloads 35059 Views 391KB Size Report
systems and communication threads in mailing lists. As these repositories .... semi-automated annotation of evidence related to a confirmed vulnerability in a ...
Using Semantic Templates to Study Vulnerabilities Recorded in Large Software Repositories Yan Wu

Robin A. Gandhi

Harvey Siy

College of Information Science & Technology Peter Kiewit Institute University of Nebraska at Omaha USA. 68182-0500

Nebraska University Center for Information Assurance (NUCIA) Peter Kiewit Institute University of Nebraska at Omaha USA. 68182-0500

Department of Computer Science Peter Kiewit Institute University of Nebraska at Omaha USA. 68182-0500

[email protected]

[email protected]

[email protected]

document, track and study reported vulnerabilities. This information is recorded in existing project repositories such as change logs in version control systems, entries in bug tracking systems and communication threads in mailing lists. As these repositories were created for different purposes, it is not straightforward to extract useful vulnerability-related information. In large projects, these repositories store vast amounts of data. Oftentimes, the relevant information is buried in a mass of other data. Natural language text descriptions and discussions do not facilitate mechanisms to aggregate vulnerability artifacts from multiple sources or pinpoint the actual software fault and affected software elements. While a significant body of knowledge exists for classifying and categorizing software weaknesses, it is hardly applied in the context of a software project. Little or no effort has been made to improve the mental model of the software developer to sense the possibility of vulnerability in the face of growing software complexity.

ABSTRACT Software repositories are rich sources of information about vulnerabilities that occur during a product’s lifecycle. Although available, such information is scattered across numerous databases. Furthermore, in large software repositories, a single vulnerability may span across multiple components and have multidimensional interactions with other vulnerabilities. Thus, identifying the patterns of vulnerability occurrence in a larger context of software development continues to be an open problem. Here we present findings from our study of vulnerable software components using an ontology-guided analysis of vulnerabilities recorded in a software project's code repository. In this approach, a semantic template for each type of vulnerability is created from information in the Common Weakness Enumeration dictionary. Next, known vulnerabilities and related concepts in the repository are tagged with concepts from the template. Based on the characteristics of the resources affected by these vulnerabilities, other similar resources in the software can be identified for closer inspection and verification. We present results from our study of vulnerabilities in the Apache web server.

We are faced with two problems: information overload in software repositories and, paradoxically, lack of information or security know how among project stakeholders. The large volume of data in software repositories and other project information sources make it difficult to locate the artifacts needed to identify, track and study previously recorded vulnerabilities. This condition is compounded by the fact that the complete record of information is scattered over several separate systems with different information schemas and natural language descriptions. Even if the information is found, a significant amount of work is needed to reconstruct the trail of artifacts that help understand the actual vulnerability. Thus, the information within software project repositories are not in a representation that can be easily extracted and analyzed for vulnerability-related questions.

Categories and Subject Descriptors D.2.9 [Software Engineering]: Management—life cycle; H.3.4 [Information Systems]: Information Storage and Retrieval— systems and software

General Terms Reliability, Security.

Keywords CWE, CVE, software repository, buffer overflow, semantic template, vulnerability, fix patterns, ontology.

We propose organizing the information in project repositories around semantic templates. We define semantic templates to be generalized patterns of relationship between software elements and faults, and their association with known higher level phenomena in the security domain. Semantic templates enhance the existing software project repositories by keeping track of relevant details and relating the information back to public vulnerability databases. In this approach, semantic templates for major vulnerability types are abstracted from information in the Common Weakness Enumeration dictionary [3]. Next, known vulnerabilities, related concepts in the project repository and descriptions of the corresponding fixes are tagged with concepts from the template. Based on the characteristics of the resources affected by these vulnerabilities, other similar

1. INTRODUCTION Vulnerabilities in software products present opportunities for malicious attacks that compromise the integrity of the software and security of its data. Most projects dedicate some effort to Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. SESS’10, May 2, 2010, Cape Town, South Africa Copyright © 2010 ACM 978-1-60558-965-7/10/05…$10.00

22

resources in the software are identified for closer inspection and verification. We illustrate this approach by showing some results from our study of the buffer overflow vulnerabilities in the Apache web server.

the proposed semantic templates take these approaches one step further by narrowing the possible queries under consideration down to one domain (security) and augmenting the information with higher level, domain-specific data. In our current work, we focused on two sources for domain-specific information: CWE (Common Weakness Enumeration) and CVE (Common Vulnerabilities and Exposures). CWE is a community driven and continuously evolving taxonomy of vulnerability types, called weaknesses. The CWE vision is to enable a more effective discussion, description, selection, and use of software security tools and services that can find weaknesses in source code and operational systems as well as better understanding and management of software weaknesses related to architecture and design [3]. Closely related to the CWE, the Common Vulnerability Enumeration (CVE) is a growing compilation of known information security vulnerabilities and exposures as reported by software development organizations, coordination centers, developers and individuals at large. It provides a common standard identifier for each discovered vulnerability to enable data exchange between security products and provide a baseline for evaluating coverage of tools and services [4].

2. BACKGROUND AND RELATED WORK Software project repositories provide rich insights into the development history of a software product. Examples include version control systems, Integrated Development Environment (IDEs), issue and bug tracking systems, and mailing lists. Version control systems store the history of modifications done to the code base. In addition to storing the actual code changes, these typically also record who made the change, when it was made, what files were changed, and why the change was made. Version control data provide information about how a vulnerability was fixed. In most cases, the vulnerable code can be identified as it is likely to be in the vicinity of the code change, if not the code that was replaced by the new code. Issue and bug tracking systems record requests for changes or fixes from people inside and outside the development organization. This can be a request to fix a bug or a request to enhance some functionality. The information typically includes who submitted the bug or change request, when it was submitted and resolved, the resolution selected. Vulnerability issues may appear as bug fix requests in this database.

3. METHODOLOGY Software products with ongoing distributed development processes and disciplined release cycles rely heavily on software repositories to manage the activities of stakeholders involved in software development and maintenance lifecycle around its code base. Ensuing the discovery of a vulnerability, the software repository also provides a medium to record the communications among developers and testers, comments in the code, and fixes performed in response to the vulnerability. Thus, the software repository over the software product lifecycle becomes a rich collection of artifacts that can be examined to study vulnerability patterns and corresponding code fixes. However, the associated information overload along with the usage of natural language to describe vulnerable conditions does not facilitate systematic analysis.

To make sense of this large quantity of data stored within such systems, many efforts for organizing software repositories have been suggested. The semantics of repository data is determined by the diverse and complex interrelationships between various software entities, e.g., classes, functions, files, bugs, versions, etc. Hence, most proposals provide different ways of combining these pieces of data to identify related bug data or related change data, connect bug data to change data, connect email data to change data, and so on. The augmented information is then used to provide useful inferences such as bug prediction, change impact assessment, identification of co-changed entities, identification of developer expertise, change effort estimation, etc. Given the usefulness of augmenting the raw repository data with these semantic relationships, it makes sense to store this augmented information in some form of enhanced repository. Examples of such systems are Kenyon [1] and TA-RE [8]. These systems ease the fact extraction process by providing a repository that combines multiple sources of information, particularly from version control data and information extracted from parsing individual source code versions. Most recently bug fix patterns in multiple java projects have been examined using source code [9].

In response to each discovered software bug common literature [11] suggests to ask the following questions 1) is this mistake somewhere else also? 2) what next bug is hidden behind this one? and 3) what should I do to prevent bugs like this? These questions are even more relevant to study security vulnerabilities. In addition, from a secure software assurance perspective it becomes necessary to raise additional questions: 4) what measured quality of the software system does the bug relate to? 5) what is the revised confidence in the software system after the planned fixes?

Other research efforts have focused on modeling software repository data using ontologies. An ontology is a formal explicit specification of a shared conceptualization [6]. They provide a way of organizing and encoding the collected knowledge for a given domain. Ontology languages such as the Web Ontology Language (OWL) [13] enable the description of relationships that are constrained via description logic axioms, making it possible to formally qualify when two items are related. Most ontologies also provide built-in inference tools for ease of querying for related information. One such example is EvoOnt [7], an OWL-based ontology that includes representations for software entities (e.g., classes, functions, variables, etc.), version control data and bug information. Each of these systems provide an infrastructure that makes it possible to ask higher level queries by reducing the incidental and effort-intensive task of data extraction. In concept,

Rigorous study and analysis of vulnerability artifacts and corresponding fixes recorded in software repositories can guide the selection of software security tools, use of secure coding practices and training scenarios that can address these weaknesses. Such analysis can benefit our understanding of source code evolution as well as better management of software weaknesses related to architecture and design. The lessons learned from the study of past vulnerabilities of one project can also generate intuitive hypotheses about projects in general. In this section we describe our methodology to produce semantic templates to study vulnerabilities in large software repositories. A semantic template is a human and machine understandable representation of the following: 1) software faults that lead to a weakness; 2) resources that a weakness affects; 3)

23

weakness characteristics; and 4) consequences/failures resulting from the weakness. The purpose of creating a semantic template for each vulnerability type is to enable capabilities for fully or semi-automated annotation of evidence related to a confirmed vulnerability in a software product whose code base is maintained using software repositories. The semantic template is the result of a top-down as well as bottom-up approach to describe a vulnerability. In a top-down fashion, we rely on the Common Weaknesses Enumeration (CWE) [3], to provide a unified, measurable and inter-dependent set of software weaknesses. In a bottom-up fashion, we rely on the recorded cases of vulnerabilities available through the Common Vulnerability Enumeration (CVE) [4], a dictionary of publicly known information security vulnerabilities and exposures. Figure 1 provides an overview of our template production process as an SADT diagram.

regarding: How software flaws leads to a vulnerability? What are the consequences of exploiting the vulnerability? How were they exploited? What resources were involved? How were they fixed? Are the applied fixes sufficient? What project specific measures can be produced for the CWE weakness categories that the vulnerability is related to? How does the discovered vulnerability and its fix revise our confidence in the software system? What other weaknesses might still remain? What steps should be taken to prevent the vulnerabilities in general? Can tools be developed using the discovered patterns? While our current work is preliminary in nature, to motivate the feasibility of our approach in the following section we describe the production of a buffer overflow [12] semantic template. We then apply the template to aggregate artifacts for buffer overflow vulnerability in the Apache web server.

4. BUFFER OVERFLOW SEMANTIC TEMPLATE

Domain Knowledge

The template production process is conducted as follows:

A buffer overflow is the single most exploited weakness in the past 25 years. It occurs upon failure to constrain program operations within the bounds of a memory buffer. CWE entry #119 further describes that this weakness occurs because certain languages allow direct addressing of memory locations and do not automatically ensure that these locations are valid for the memory buffer that is being referenced [3]. This condition results in the ability for attackers to perform read and write operations on sensitive memory areas used for controlling program behavior. The fundamental reason why a buffer overflow weakness can be exploited is that data and code are both numerically encoded and stored in the same memory based on the von Neumann computer architecture [10].

x

4.1 STEP1: CONCEPT EXTRACTION

control

CWE CVE

input

Template Production

output

Semantic Template

Supporting mechanisms

Ontological Engineering

Figure 1: The Template Production Activity Model

STEP 1: The first step in the template production process involves the exaction of the chosen weakness, its interdependent weakness and related categorizations from the CWE. Different views [3] of the CWE (e.g. research view and development view) are also separated and analyzed.

x

STEP 2: In step two, the extracted CWE concepts are analyzed to identify the software faults that lead to the weakness, resources that the weakness affects, characteristics of the weakness, and consequences/failures resulting from the weakness. These concepts are identified based on the relationships of the weakness under consideration with other concepts in the CWE.

x

STEP 3: Step three involves identifying multiple CVE entries about an actual vulnerability related to the weakness for which the semantic template is being produced. The organization of the semantic template is then fine tuned based on its ability to sufficiently describe each reported vulnerability. In our current work, we focus only the CVE vulnerabilities reported for the Apache web server from the CVE website as well as vulnerability descriptions in the Apache software repositories.

To produce a buffer overflow semantic template, within the CWE we first locate the weaknesses that pertain to failure to constrain program operation within the bounds of a memory buffer. This corresponds to CWE# 118 and 119 as shown in Figure 2.

The availability of a semantic template enables us to systematically assimilate the information pieces related to the discovery of a vulnerability and related artifacts. Currently we examine the change history repository, CVE descriptions, bug report discussions, and differences of source code files before and after fixes are introduced for the vulnerability. This collection of information allows us to examine fundamental questions

Figure 2: CWE Research View with Weaknesses related to Buffer Overflow Highlighted

24

CWE organizes weakness using multiple views to accommodate differing perspectives. For example the Development view organizes weaknesses around concepts that are frequently used or encountered in software development; such as Location and Motivation [3]. On the other hand, the Research view is intended to facilitate research into weaknesses, including their interdependencies and their role in vulnerabilities [3]. Its classification of weaknesses focuses on abstractions of software behaviors with a deep hierarchical organization. The Research view in CWE version 1.6 [3] in shown in Figure 2. Other than the higher level abstractions, several specific weaknesses are shared between the Research and Development views in CWE.

different CWE views help to accommodate multiple perspectives, it adds an additional layer of complexity. The CWE is comprehensive; however the current nested structure and tangled contents are confusing.

4.2 STEP 2: TEMPLATE STRUCTURING To organize the buffer overflow related concepts identified in step one along the four dimensions of a semantic template we use the following heuristics: 1) Concepts that pertain to a human error during the development process are collected under the software fault dimension; 2) Any computing resources related to the weaknesses are identified and collected under the resources dimension; 3) Concepts that describe the characteristics of the weakness in question are organized under the weakness dimension; and 4) Concepts that describe the conditions that may result following the weakness are collected in the consequences dimension. The four dimensions of the semantic template itself are related using non-taxonomical relationships. In addition to applying these heuristics, we consolidate redundancy in some CWE categories in Figure 3 For example CWE 786: Access of memory location before start of buffer, has overlapping semantics with CWE 125: Out-of-bounds read. From this effort a highly structured collection of interdependent concepts emerges as shown in Figure 4. Each concept in the semantic template of Figure 4 is traceable to corresponding CWE categories in Figure 3. The template shown in Figure 4 also includes the refinements introduced after examining its applicability to understand buffer overflow vulnerabilities reported in the CVE and related artifacts in the Apache web server software repository. In the following section we examine this step in the context of the buffer overflow vulnerability CAN-2004-0492 [2].

With the primary weakness categories relevant to buffer overflow identified in the different CWE views, all related weaknesses are manually extracted by carefully navigating the hierarchical and non-taxonomical interdependencies and keyword search in CWE documentation. As we follow explicit dependencies modeled in the CWE; it avoids subjectivity in the selection of weaknesses. CWE annotation of weaknesses as class (abstract), base (has details about detection and prevention) and variant (limited to a specific language or technology) further helps the effort. This step results in an extensive collection of concepts (abstract and specific) that help to understand buffer overflows. Figure 3 depicts the concepts relevant to buffer overflows discovered by initiating our navigation from CWE# 119. Figure 3 speaks volumes about the complexity of the “mental model” that developers need to be aware of to understand the consequences of their coding and design decisions. Although hyperlinked, navigating the CWE documentation and various graphical representations is tedious and non-intuitive. While CWE-221: INFORMATION LOSS OR OMMISSION

CWE-468: INCORRECT POINTER SCALING CWE-192 INTEGER COERCION ERROR

CWE-194: UNEXPECTED SIGN EXTENSION

CWE-467: USE OF SIZEOF() ON A POINTER TYPE

CWE- 190 INTEGER OVERFLOW OR WRAPAROUND

CAN PRECEED (RESEARCH VIEW)

CWE-20 IMPROPER INPUT VALIDATION

CWE-119: FAILURE TO CONSTRAIN OPERATIONS WITHIN THE BOUNDS OF A MEMORY BUFFER

CWE- 788 ACCESS OF MEMORY LOCATION AFTER END OF BUFFER

CWE- 121 STACKBASED BUFFER OVERFLOW

CWE- 786 ACCESS OF MEMORY LOCATION BEFORE START OF BUFFER

CWE-787 OUTOF-BOUNDS WRITE

CWE- 124 BUFFER UNDERWRITE ('BUFFER UNDERFLOW')

CAN PRECEED (DEVELOPMENT VIEW)

CATEGORY (DEVELOPMENT VIEW)

CWE- 466 RETURN OF POINTER VALUE OUTSIDE OF EXPECTED RANGE

CWE- 134 UNCONTROLLED FORMAT STRING CWE- 127 BUFFER UNDER-READ

CWE- 227 API ABUSE

CWE- 251 STRING MGMT. MISUSE

CWE- 416 USE AFTER FREE

CWE- 415 DOUBLE FREE

CWE- 231 IMPROPER HANDELING OF EXTRA VALUES

CWE- 196 UNSIGNED TO SIGNED CONVERSION ERROR CWE- 120 BUFFER COPY WITHOUT CHECKING SIZE OF INPUT ('CLASSIC BUFFER OVERFLOW')

Figure 3: CWE Development and Research view for Buffer Overflow

25

CATEGORY (RESEARCH VIEW)

CWE- 456 MISSING INITIALIZATION

CWE- 125 OUT-OFBOUNDS READ

CWE- 785 USE OF PATH MANIPULATION FUNCTION WITHOUT MAX-SIZE BUFFER

CHILD OF (DEVELOPMENT VIEW)

CWE- 242 USE OF DANDEROUS FUNCTIONS

CWE-123 WRITEWHAT-WHERE CONDITION

CWE- 126 BUFFER OVER-READ CWE- 122 HEAPBASED BUFFER OVERFLOW

CWE-680 INTEGER OVERFLOW TO BUFFER OVERFLOW

CWE-129 IMPROPER VALIDATION OF ARRAY INDEX

CWE- 131 INCORRECT CALCULATION OF BUFFER SIZE CWE- 195 SIGNED TO UNSIGNED CONVERSION ERROR

CWE-789 UNCONTROLLED MEMORY ALLOCATION

PEER OF (RESEARCH VIEW)

CHILD OF (RESEARCH VIEW) CWE-130: IMPROPER HANDLING OF LENGTH PARAMETER INCONSISTENCY

CWE- 191 INTEGER UNDERFLOW (WRAP OR WRAPAROUND)

CWE- 193 OFFBY-ONE ERROR

LEGEND

CWE-19: DATA HANDLING

CWE-118 IMPROPER ACCESS OF INDEXABLE RESOURCE ('RANGE ERROR')

CWE- 128 WRAPAROUND ERROR

CWE- 682 INCORRECT CALCULATION

CWE-199: INFORMATION MGMT. ERRORS

CWE- 170 IMPROPER NULL TERMINATION

SOFTWARE-FAULT OFF-BYONE #193

SIGN ERRORS #194 #195 #196

INTEGER OVERFLOW #190 #680

INTEGER COERCION ERROR #192

INTEGER UNDERFLOW #191

IS-A IS-A

IS-A WRAPAROUND ERROR #128

IS-A

IMPROPERINPUTVALIDATION #20

MISSING INITIALIZATION #456

STRING MANAGEMENT API ABUSE # 785 #134 #251

IS-A

RETURN OF POINTER VALUE OUTSIDE OF EXPECTED RANGE #466

IS-A

IS-A

INCORRECTBUFFER-SIZECALCULATION #131

IMPROPER HANDELING OF EXTRA VALUES #231

API ABUSE #227

USE OF DANDEROUS FUNCTIONS #242 IMPROPER USE OF FREED MEMORY #415 #416

IMPROPER NULL TERMINATION #170

POINTER ERRORS #467 #468

IS-A IS-A

INCORRECTCALCULATION #682

IMPROPER HANDLING OF LENGTH PARAMETER INCONSISTENCY # 130

BUFFER COPY WITHOUT CHECKING SIZE OF INPUT ('CLASSIC BUFFER OVERFLOW') #120

IMPROPER VALIDATION OF ARRAY INDEX #129 #789

CAN-PRECEDE

CAN-PRECEDE

WEAKNESS RESOURCE/LOCATION ACCESS AND OUT-OF-BOUNDS READ #125, #126, #127, #786

ACCESS AND OUTOF-BOUNDS WRITE #787, #788, #124

STACK-BASED #121

IS-A

OCCURS-IN

IS-A

HEAP-BASED #122

IS-A INDEX (POINTER #466 INTEGER #129)

MEMORYBUFFER #119

IS-A

IS-A

STATIC #129

FAILURE TO CONSTRAIN OPERATIONS WITHIN THE BOUNDS OF A MEMORY BUFFER #119

IS-A

PART-OF

BUFFER #119

IS-A

CONSEQUENCES

IMPROPER-ACCESS-OFINDEXABLE-RESOURCE #118

WRITE-WHAT-WHERE CONDITION #123

CAN-PRECEDE

PART-OF

INDEXABLERESOURCE #118

UNCONTROLLED MEMORY ALLOCATION #789 INFORMATION LOSS OR OMMISSION #199 #221

Figure 4: Buffer Overflow Semantic Template

4.3 STEP 3: AGGREGATION OF VULNERABILITY ARTIFACTS

Apache website

Project Specific Description: A buffer overflow was found in the Apache proxy module, mod_proxy, which can be triggered by receiving an invalid Content-Length header. In order to exploit this issue an attacker would need to get an Apache installation that was configured as a proxy to connect to a malicious site. This would cause the Apache child processing the request to crash, although this does not represent a significant Denial of Service attack as requests will continue to be handled by other Apache child processes. This issue may lead to remote arbitrary code execution on some BSD platforms.

Apache Change History

Description in Apache Project Change History: Receiving a negative content length from a remote server can cause a buffer overflow in later code; reject connection if we receive an invalid header. CAN-2004-0492 PR: Obtained from: Submitted by: Mark Cox; Reviewed by: Joe Orton, Bill Stoddard, Jim Jagielski

For the reported buffer overflow vulnerability CAN-2004-0492 [2] in the Apache web server, we have aggregated several vulnerability artifacts scattered across multiple sources. These sources include the CVE database, Apache Website, Apache change history in the software repository, and source code versions as shown in Table 1. Table 1: Vulnerability Artifacts for CAN-2004-0492 Source

Vulnerability Artifact

CVE

CVE ID: CAN-2004-0492 Release Date: 08/06/2004 CVE Source: US-CERT/NIST CVE Description: Heap-based buffer overflow in proxy_util.c for mod_proxy in Apache 1.3.25 to 1.3.31 allows remote attackers to cause a denial of service (process crash) and possibly execute arbitrary code via a negative Content-Length HTTP header field, which causes a large amount of data to be copied.

Files Fixed in Project: apache-1.3/src/modules/proxy/proxy_http.c Fix Author: mjc for file proxy_http.c Fix Time: 2004.06.11.07.54.38

26

“a negative Content-length HTTP header field” activates the software fault of “improper input validation (CWE #20)” in the semantic template. In addition, the semantic template allows extrapolating the missing information if any from the vulnerability artifacts.

Difference in before and after versions of proxy_http.c:

Source Code

revision 103191, Mon Mar 29 17:47:15 2004 UTC

revision 103896, Fri Jun 11 07:54:38 2004 UTC

Line 485

Line 485

Line #

490

ap_kill_timeout(r);

The semantic template provides intuitive visualization capabilities for the vulnerability artifacts. In Figure 5, the highlighted elements in Table 1 provide context to “trigger” and “navigate” the concepts in the buffer overflow semantic template. Such a structured approach allows developers to reason about the vulnerability and discuss the necessary fixes. Over a collection of vulnerability reports, their mappings to specific weakness categories reveals patterns and provides project specific measures for identifying the most prominent CWE weaknesses.

491

return ap_proxyerror (r, HTTP_BAD_GATEWAY, ap_pstrcat(r->pool,

5. CONTRIBUTIONS AND DISCUSSIONS

485 content_length = ap_table_get(resp_hdrs, "Content-Length");

content_length = ap_table_get(resp_hdrs, "Content-Length");

486

if (content_length != NULL) {

if (content_length != NULL) {

487

c->len = ap_strtol(content_length, NULL, 10);

c->len = ap_strtol(content_length, NULL, 10);

488 489

if (c->len < 0) {

492

In this paper, we have proposed to organize the software repositories information related to detected vulnerabilities using semantic templates. Semantic templates record the software elements and faults leading to the vulnerability and associate them with the relevant weakness types. We have illustrated an example from a vulnerability occurrence in the Apache web server. We have used the presented semantic template to encode the knowledge about buffer overflows in the Apache web server as reported in the CVE. We are using the methodology to encode other vulnerability types such as injection, denial of service and illegal information access.

"Invalid ContentLength from remote server",

493

NULL));

494

}

The semantic template allows us to parameterize the natural language vulnerability description, change history, source code changes and other artifacts. In Table 1 we highlight the information pieces that confirm the existence of one or more concepts in the buffer overflow semantic template. For example, SOFTWARE-FAULT SIGN ERRORS #194 #195 #196

1

INTEGER OVERFLOW #190 #680

INTEGER COERCION ERROR #192

OFF-BYONE #193

IS-A

IMPROPER HANDELING OF EXTRA VALUES #231

INTEGER UNDERFLOW #191

RETURN OF POINTER VALUE OUTSIDE OF IS-Avia a negative ContentEXPECTED RANGE ….possibly execute arbitrary code CVE : IS-A #466 Length HTTP header field…WRAPPOINTER AROUND ERRORS ERROR #128 Apache Website: buffer overflow ….can be #467 triggered #468 by IS-A

INCORRECTBUFFER-SIZECALCULATION #131

IS-A

IS-A

receiving an invalid Content-Length IS-A header

IMPROPERINPUTVALIDATION #20

IMPROPER HANDLING OF

INCORRECTIS-A Source Code: input Added validation criteria to avoid negative LENGTH PARAMETER CALCULATION if (c->len < 0) { #682

2

API ABUSE #227

IMPROPER USE OF FREED MEMORY #415 #416

IMPROPER NULL TERMINATION #170 IMPROPER VALIDATION OF ARRAY INDEX #129 #789

BUFFER COPY WITHOUT CHECKING SIZE OF INPUT ('CLASSIC BUFFER OVERFLOW') #120

CAN-PRECEDE OCCURS-IN

RESOURCE/LOCATION

ACCESS AND OUTOF-BOUNDS WRITE #787, #788, #124

STACK-BASED #121

copied …

IS-A

Apache Website: This IS-A

USE OF DANDEROUS FUNCTIONS #242

IS-A

CAN-PRECEDE

WEAKNESS ACCESS AND OUT-OF-BOUNDS CVE : …. causes a large READ #125, #126, amount of data to be #127, #786

INCONSISTENCY # 130

MISSING INITIALIZATION #456

STRING MANAGEMENT API ABUSE # 785 #134 #251

STATIC #129

IS-A

IS-A CVE : Heap-based buffer overflow in

MEMORYBUFFER #119

IS-A

issue may lead to remote arbitrary code execution FAILURE TO CONSTRAIN on some BSD platforms…

INDEX in Apache INDEX proxy_util.c for mod_proxy (pointer #466 (POINTER 1.3.25 to 1.3.31… Integer #129) INTEGER

Apache Website: PART-OFA buffer overflow PART-OF

IS-A

OPERATIONS WITHIN THE BOUNDS OF A MEMORY BUFFER #119

3

HEAP-BASED #122

BUFFER #119

was found in the Apache proxy module, mod_proxy … INDEXABLERESOURCE

Apache Change #118 History:… buffer PART-OF overflow in later code…

IS-A

CONSEQUENCES

IMPROPER-ACCESS-OFINDEXABLE-RESOURCE #118

4

WRITE-WHAT-WHERE CONDITION #123

CAN-PRECEDE

UNCONTROLLED MEMORY CVE : allows remote attackers to cause a ALLOCATION denial of service (process crash) and #789 arbitrary code possibly execute

Figure 5: Aggregation of Vulnerability Artifacts for CAN-2004-0492

27

INFORMATION

Apache Website: cause the Apache LOSS OR child processing the request to crash … OMMISSION This issue may lead to remote #199arbitrary #221 code execution on some BSD platforms

Application in existing development processes. We foresee several development activities that can benefit from semantic templates. They can be used to organize and retrieve historical knowledge of vulnerabilities in a change analysis tool for automatically inspecting code patch submissions. This is helpful in open source projects like Apache where many patch contributors are not core developers and may not be aware that a particular usage of some resource could expose a vulnerability. History-based tools can also complement existing static code analysis tools like FindBugs by correlating identified faults with information on past vulnerabilities caused by similar faults in the same resource.

population of templates is useful for recording of new vulnerabilities as they are detected, to relate past vulnerabilities with the templates requires automation. We are investigating a machine learning approach, using manually created entries as a training set for associating version repository entries with applicable semantic templates. As a side effect, previously unreported vulnerabilities may also be identified. An empirical study with the Apache repository information will be conducted to assess the accuracy of this automated process. Addition of fix information. Semantic templates can be extended to address patterns for fixing vulnerabilities. Known resolutions for known vulnerabilities can be studied and generalized into categories and approaches for fixing vulnerabilities. A starting point is to utilize the bug fix patterns defined by Pan, et al. [9] and investigate the relationship between bug fix patterns and vulnerability types. To validate the effectiveness of identified fix patterns, a student experiment will be conducted to determine if the semantic information can assist the subjects in finding better ways to fix a set of simulated vulnerabilities and check the completeness of proposed fixes.

Ontology implementation. We have implemented semantic templates in an OWL ontology using Protégé [5]. Figure 6 shows the conceptual abstractions of the semantic template in the context of vulnerability artifacts related to CAN-2004-0492 captured using a semantic web representation format. The description logic based reasoning capability offered by Protégé facilitates further analysis activities of related vulnerability data. This makes it possible to ask, for example, if a resource was involved in multiple vulnerability occurrences. Furthermore, the combination of ontological information and data from project repositories can enable the discovery of previously undetected or unreported vulnerabilities, e.g., in related components that share a vulnerable resource. We can also generate and give preliminary answers to hypotheses concerning causes leading to vulnerabilities, relationships between faults and vulnerabilities, or relationships between vulnerability types, etc.

6. ACKNOWLEDGMENTS This research is funded in part by Department of Defense (DoD)/Air Force Office of Scientific Research (AFOSR), NSF Award Number FA9550-07-1-0499, under the title “High Assurance Software”.

7. REFERENCES [1] Bevan, J., Whitehead, E.J., Kim, S., Godfrey, M., “Facilitating Software Evolution Research with Kenyon.” Proc. of 13th Foundations of Software Engg., Sept. 2005. [2] CAN-2004-0492, Common Vulnerability Enumeration Description, cve.mitre.org [3] Christey, S.M, Harris, C.O., Kenderdine, J.E., Miles, B., “Common Weaknesses Enumeration v 1.6,” cwe.mitre.org. [4] Common Vulnerability Enumeration, cve.mitre.org [5] Gennari, J., Musen, M., et al., “The evolution of Protégé2000: An environment for knowledge-based systems development.” Human-Computer Studies, 58(1), 2003. [6] Gruber. T., “A Translation Approach to Portable Ontologies,” Knowledge Acquisition 5,2, 199-299, 1993. [7] Kiefer, C., Bernstein, A., Tappolet. J., "Mining software repositories using iSPARQL and a software evolution ontology.” 4th Int’l Workshop on Mining Soft. Repo. 2007. [8] Kim, S., et al. “TA-RE: an Exchange Language for Mining Software Repositories.” 3rd Int’l Workshop on Mining Soft. Repo. 2006. [9] Pan, K., Kim, S., Whitehead, E.J., “Toward an understanding of bug fix patterns.” Empirical Software Engineering 14:286-315, 2009. [10] Riley, H.N., “The von Neumann Architecture of Computer Systems,” Computer Science Department, California State Polytechnic University, September, 1987. [11] van Vleck, T., “Three questions about each bug you find.” SIGSOFT Softw. Eng. Notes 14, 5 (Jul. 1989), 62-63. [12] Viega, J., McGraw, G., “Building Secure Software: How to Avoid Security Problems the Right Way”, Addison Wesley, 2002 [13] W3C. OWL Web Ontology Language reference. W3C Recommendation, Feb. 2004.

Figure 6: Modeling and Visualization of CAN-2004-0492 using Ontological Engineering Tool Experience with CWE. Semantic templates were simplified from the CWE entries. Dealing with the full CWE proved to be unwieldy as some vulnerabilities appear to belong in multiple categories. Furthermore, it appears that CWE provides multiple categorizations of vulnerabilities. In other words, multiple CWE entries can describe different aspects of a vulnerability. For example, some CWE entries focus on resource (e.g., stack-based buffer overflow) while others focus on the faults (e.g., incorrect calculation). The main elements of our semantic template can be seen as orthogonal dimensions documenting the various types of focus areas found in the CWE. Automated support. Currently, the process of encoding the known vulnerabilities is manually performed. While manual

28