Investigative Data Mining and its Application in ... - CiteSeerX

34 downloads 60362 Views 1MB Size Report
1 Software Intelligence Security Research Center, ... comparison to social networks; however, even a ... of investigative data mining software; while Section.
Proceedings of the 5th WSEAS Int. Conf. on APPLIED INFORMATICS and COMMUNICATIONS, Malta, September 15-17, 2005 (pp397-403)

Investigative Data Mining and its Application in Counterterrorism Nasrullah Memon1, Abdul Rasool Qureshi2, 1

Software Intelligence Security Research Center, Aalborg University Niels Bohrs Vej 8, 6700, Esbjerg, Denmark [email protected] 2 Software Developer, iMiner Development Team [email protected] Abstract-It is well recognized that advanced filtering and mining in information streams and intelligence bases are of key importance in investigative analysis for countering terrorism and organized crime. As opposed to traditional data mining aiming at extracting knowledge form data, mining for investigative analysis, called Investigative Data Mining (IDM), aims at discovering hidden instances of patterns of interest, such as patterns indicating an organized crime activity. An important problem targeted by IDM is identification of terror/crime networks, based on available intelligence and other information. We present an approach to an IDM solution of this problem, using semantic link analysis and visualization of findings. The approach is demonstrated in an application by a prototype system. The system finds associations between terrorist and terrorist and is capable of determining links between terrorism plots occurred in the past, their affiliation with terrorist camps, travel record, and funds transfer, etc. The findings are represented by a network in the form of an attributed relational graph. Paths from a node to any other node in the network indicate the relationships between individuals and organizations. The system also provides assistance to law enforcement agencies, indicating when the capture of a specific terrorist will likely destabilize the terrorist network.

Key-Words: - Terrorism, Counterterrorism, Investigative Data Mining, Link Analysis, Visualization

1 Introduction When intelligence analysts are required to understand a complex uncertain situation, one of the techniques they use often to simply draw a diagram of the situation. The diagrams are attributed relational graphs, an extension of the abstracted directed graph [1]. In these graphs, nodes represent people, organizations, objects, or events. Edge represents relationship like interaction, ownership, or trust. Attributes store the details of each node and edge, like person’s name, or interactions time of occurrence. The graphs function as external memory aids, which are crucial tools for arriving an unbiased conclusion in the face of uncertain information [2] For example, if we have information that Osama-bin-Laden is a friend of Ayman-al-Zawahiri, who is father of Khalid-al-Zawahiri, who frequents at Al-Qaeda House Afghanistan, as does Abu-Bakar-Bashir, Hud-bin-AbdulHaq and Muhammad Atta who works at Hay Computing Service Hamburg, where he has a colleague Ramzi-bin-al-Sibh who provided funds to Marwan-al-Shehhi who is friend of Zakarya-Essabar who is brother of Mounir-al-Motassadek, then this information may be presented as in Figure 1. Visualizing the connections in this way is of importance in investigating of terrorist or criminal networks. It has been found to aid considerably a good understanding of what is known so far, which is necessary to guide and direct further lines of

inquiry in the most timely and productive way. In an investigation of a suicide terrorist attack, with an unknown perpetrator, say, we may well begin by investigating the social network of the victims, their friends, the places they visit, their work colleagues, and so on in looking for possible motives of and suspects; and then build a link chart to clearly understand what we know so far as we proceed.

Fig 1: The connection between a set of people displayed as network In countering terrorism for homeland security, understanding data, information, and intelligence gathered so far in a terrorist threat investigation is of great importance for guiding the future course of the investigation. Such information is typically best comprehended when viewed as a network if the connections between the objects of interest in the problem domain, such as people, financial transactions, meetings, and travel records. By analyzing the 9/11 terrorist network, it is clear that one of Al Qaeda’s criteria for selecting the hijackers almost certainly was that they were

Proceedings of the 5th WSEAS Int. Conf. on APPLIED INFORMATICS and COMMUNICATIONS, Malta, September 15-17, 2005 (pp397-403)

relatively clean [3]. For example, Muhammad Atta and some of the other 9/11 hijackers were never even considered candidates for the watch lists as intelligence reporting had not previously associated them with known terrorists [3]. The terrorist networks are highly decentralized in comparison to social networks; however, even a decentralized terrorist threat has some linkage. It is the challenge to the intelligence community to exploit these links. Within the networks of extremists, almost everyone can be linked at least indirectly, such as through their past common experiences. Intelligence services cannot monitor all of these contacts for compiling the life history of every extremist who has the potential to become a terrorist. Detecting the perpetrators of the next terrorist attack will therefore have to go beyond link analysis and increasingly rely on other techniques for identifying the terrorists for closer monitoring. Mining of financial, travel, and other data on personal actions and circumstances, is one such technique. The potential of such data mining goes well beyond current usage. Yet, Data Mining for counterterrorism (we call it as Investigative Data Mining) purposes will always require a major investment in obtaining and manipulating the data in return for only modest narrowing of the search for terrorists. The remainder of the paper is organized as follows: Section 2 introduces investigative data mining approach and also provides a clear overview how it distinguishes from the traditional data mining; Section 3 presents a detailed review on link analysis techniques; Section 4 describes investigative data mining view of link analysis. Section 5 reviews some SNA techniques that could be used in finding terrorist threats. Section 6 introduces the system architecture of investigative data mining software; while Section 7 introduces a new strategy of neutralization of terrorist network. Section 8 provides conclusion of this paper.

2 Investigative Data Mining The rapid growth of available data in all regions of society requires new computational methods. Besides traditional statistical techniques [4] and standard database approaches, current research known as Investigative Data Mining (IDM) uses modern methods that originate from research in Algorithms and Artificial Intelligence. The main goal is the quest for interesting and understandable patterns. This search always been and will always be critical task in law enforcement, especially for criminal investigation, and more specific for the fight against terrorism. Examples are the discovery of interesting links between people (Social networks,

see, e.g., [5]) and other entities (means of transport, modus operandi, locations, communication channels like phone numbers, accounts, financial transactions and so on). IDM is defined as: “The technique which models to data to predict behaviour assesses risk, determine associations and help in neutralizing the terrorist network” [6, 7] IDM differs from traditional data mining applications in significant ways. Traditional data mining are generally applied against large transaction databases in order to classify people according to transaction characteristics and extra pattern in widespread applicability. The problem in IDM is to focus on smaller number of subjects within large background population and identify links and relationships from a far wider variety of activities. Table 1 states how investigative data mining is distinguished from the traditional data mining. Table 1. Traditional Data Mining Vs. Investigative Data Mining Traditional Data Mining Classifying propositional Data from large data sets (transactions)

To extract patterns of general applicability Identify patterns among unrelated subjects based on their transaction patterns in order to make predictions about their unrelated subjects doing the same things

Investigative Data Mining Extracting Relational Data from large heterogeneous data sets (people, places, things, events, transactions, etc.) To find rare but significant relational links Identify patterns that evident organizations and activities among related subjects in order to express additional related (or like) subjects or activities Access risk of the threat Neutralize terrorists’ network

In the area of law enforcement the need for IDM is apparent in view of enormous load of information that is (and can be made) available nowadays. We list some of the problems. First, incomplete, incorrect, or inconsistent data can create problems. Moreover, these characteristics of terrorist networks cause difficulties: -Incompleteness. Terrorist networks are covert networks that operate in secrecy and stealth [8]. Terrorists may minimize interactions to avoid attracting law enforcement attention and their interactions are hidden behind various illicit activities. Thus, data about terrorists and their interactions and associations is inevitably incomplete, causing missing nodes and links in the networks [9].

Proceedings of the 5th WSEAS Int. Conf. on APPLIED INFORMATICS and COMMUNICATIONS, Malta, September 15-17, 2005 (pp397-403)

-Incorrectness. Incorrect data regarding terrorists’ identities, physical characteristics, and addresses may result either from unintentional data entry errors or from intentional deception by terrorists. Many terrorists lie about their identity information when caught and investigated. -Inconsistency. Information about a terrorist who has multiple police contacts may be entered into law enforcement databases multiple times. These records are not necessarily consistent. Multiple data records could make a single terrorist appear to be different individuals. When seemingly different individuals are included in a network under study, misleading information may result. Problems specific to terrorist network analysis lie in network dynamics, fuzzy boundaries and data transformation: -Terrorist networks are not static, but are subject to changes over time. New data and even new methods of data collection may be required to capture the dynamics of terrorist networks [9]. -Boundaries of terrorist networks are likely to be ambiguous. It can be quite difficult for the analyst to decide whom to include and whom to exclude from a network under study [9]. -Data Transformation. Network analysis requires that data be presented in specific format, in which network members are represented by nodes, their associations are represented by links. However information about terrorist associations from raw data and transforming them to the required format can be laborious and time-consuming.

3 Link Analysis Network Analysis)

(Terrorist

The data mining technologies like link analysis (LA) can be employed by law enforcement investigators and intelligence analysts to help them to examine graphically the anomalies and inconsistencies; and connect networks of relationships, and contacts hidden in the data. LA is the first level by which networks of people , places, organizations, vehicles, bank accounts, telephone calls, email contacts, and other tangible entities can be discovered, linked, assembled, examined, detected, and analyzed. Effectively combining multiple sources of data can lead law enforcement investigators and government analysts to discover patterns to help them be more proactive in their investigations. LA is good start in mapping terrorist activity and criminal intelligence by visualizing association between entities and events. LA is often involves seeing via a chart or map the associations between suspects and locations, whether physical or on network or the Internet. The technology is often used to answer such questions as who knows whom and when and where they been in contact?

LA is the process of building up networks of interconnected objects through relationships in order to expose patterns and trends. LA uses item-to-item associations to generate networks of interactions and connections from defined datasets. LA diagrams have variety of names ranging from entity-relationship-diagrams and connected networks to nodes-and-links and directed graphs. LA methods add dimensions to an analysis that the other forms of visualization do not support. By explicitly representing relationships among objects, an investigator can gain an entirely different perspective on how the data can be analyzed and the types of patterns that can be discovered. LA provides a powerful means of performing visual data mining, particularly if an investigator knows how to take advantage of layout options, filter assessments, and presentation formats. Used properly LA systems allow to identify patterns, merging groups, and generate connections quickly. LA can be used to expose the underlying patterns and behaviours pertaining to national security and homeland defense related to such areas as terrorism and drug trafficking. The intelligence community can use LA to sift through vast amount of data looking for connections, relationships, and critical links among their suspected targets. LA reveals the structure and content of a body of information by representing it as a set of interconnected linked objects or entities. Often LA allows an investigator to identify association patterns, new emerging groups, and connection between suspects. Through the visualization of these entities and links, an investigator can gain an understanding of the strength of relationships and frequency of contacts and discover new hidden associations [10, 11] Criminal Network Analysis, which is a broad category of terrorist network analysis categorized into three generations [12] First generation: Manual approach. Representative of the first generation is the Anacapa Chart [13]. With this approach, an analyst must first construct an association matrix by identifying criminal associations from raw data. A link chart for visualization purposes can then be drawn based on the association matrix. For example, to map the terrorist network containing the 19 hijackers in the September 11 attacks, Krebs [8] gathered data about the relationships among the hijackers from publicly released information reported in several major newspapers. He then manually constructed an association matrix to integrate these relations [8] and drew a network representation to analyze the structural properties of the network (Figure 2).

Proceedings of the 5th WSEAS Int. Conf. on APPLIED INFORMATICS and COMMUNICATIONS, Malta, September 15-17, 2005 (pp397-403)

Fig. 2. 9-11 Terrorist Network Although such a manual approach for criminal network analysis is helpful in crime investigation, it becomes an extremely ineffective and inefficient method when data sets are very large. Second generation: Graphic-based approach. These tools can automatically produce graphical representations of criminal networks. Most existing network analysis tools belong to this generation. Among them Analyst's Notebook [12], Netmap [14], and XANALYS Link Explorer (previously called Watson) [15], are the most popular. For example, Analyst's Notebook can automatically generate a link chart based on relational data from a spreadsheet or text file. Second-generation network analysis approaches have been developed by the COPLINK research at University of Arizona. The first approach employs a hyperbolic tree metaphor to visualize crime relationships [16]. It is especially helpful for visualizing a large amount of relationship data because it simultaneously handles both focus and context. Second-generation tools are also capable of using various methods to visualize criminal networks; their sophistication level remains modest because they produce only graphical representations of criminal networks without much analytical functionality. They still rely on analysts to study the graphs with awareness to find structural properties of the network. Third generation: SNA. This approach provides ore advanced analytical functionality to assist crime investigation. Sophisticated structural analysis tools are needed to go from merely drawing networks to mining large volumes of data to discover useful knowledge about the structure and organization of criminal networks.

4 IDM Perspective of Terrorist Network Analysis Intelligence and law enforcement agencies are often interested in finding structural properties of terrorist networks. This study aims to answer the following questions: 1. What is the overall structure of the network? And what are the different roles within the network? And who is the central or peripheral in the network? 2. What cells (subgroups) exist in the network? And how do these cells interact with each other?

3. What are the important communications and method of communication? And which individuals might be more likely to give information to the intelligence agencies? And which individual should be removed to disrupt the network? The prototype iMiner could be helpful for investigative personnel to target critical network members for removal or surveillance, and locate network vulnerabilities where disruptive action can be effective.

5 SNA Techniques SNA techniques are designed to discover patterns of interaction between social actors in social networks [11], they are especially appropriate for studying terrorist networks [18, 9]. Specifically, SNA is capable of detection of terrorist cells, discovering their patterns of interaction, identifying central and peripheral individuals, finding leaders, followers and gatekeepers, and uncovering network organization and structure [17]. Centrality deals with the roles of individuals in a network. Several centrality measures, such as degree, betweenness, and closeness can suggest the importance of a node in a network. The degree of a particular node is its number of links; its betweenness is the number of geodesics (shortest paths between any two nodes) passing through it; and its closeness is the sum of all geodesics between the particular node and every other node in the network. An individual’s having high degree, for instance may imply his leadership; whereas an individual with high betweenness may be a gatekeeper in the network. Krebs [8] found that in the network consisting 19 hijackers, Mohamed Ata scored the highest on degree and closeness, but not on the betweenness. There also many other measures like blockmodeling, etc. can be used in terrorist networks.

6 iMiner iMiner is an experimental system, which provides the answers of the questions described in section 4. The system architecture of our prototype is presented in figure 3. The first stage of our network analysis development was intended to automatically identify the strongest association paths, or geodesics, between two or more network members using shortest path algorithms. In practice, such task often entails intelligence officials to manually explore links and try to find association paths that might be useful for generating investigative leads.

Proceedings of the 5th WSEAS Int. Conf. on APPLIED INFORMATICS and COMMUNICATIONS, Malta, September 15-17, 2005 (pp397-403)

is developed as a result of processing the links retrieved from database by Graph Maker. iMiner is window based implementation of IDM framework with new algorithms for visualization, filtering and destabilization of terrorist networks.

Fig 3. System Architecture

6.1 Database Perspective We stored data in the database in the form of triples, Where subject and objects may be entities of interest and relationship is link between two entities. The entities are saved in Database in Entity_Profile table, which has the fields (Entity_key, Entity_name, Entity_type). The relationships are saved in sequence of their involvement in a certain terrorism plot. For example, all the relationships, which are involved in the 9-11 Terrorist Attack, are saved together thus forming the events. Each event has a table of relations (in which relations are stored in above given format) and each event is registered in the events master table. These events are analogous to case files in local police stations. The different attributes of each entity are saved in separate tables, which are joined with entities using entity_key. As shown in figure 2, “AT” tables represent attribute tables, while “EP” and “DP” represents entities master table and events master table. The tables with “R” are relations tables, each table represent a distinct event.

7 Neutralization Networks

of

Terrorist

For neutralization of terrorist network, we have used centrality measures from SNA literature i.e. degree centrality and Eigen vector centrality. The algorithm1 is used to convert undirected graph to a directed graph. While algorithm2 we constructed hierarchy from the directed graph. Hierarchy shows who is in the power of whom, i.e. who is central or peripheral in the network Algorithm 1: To convert undirected graphs into directed graphs

6.2 Application Perspective Database wrappers consist of classes, which serve as wrappers to tables used in our database architecture, control the flow of information to and from these wrappers. The Database Settings Manager holds the values of different attributes, which are used in searches. The Database Settings Manager controls whether search is done on an event or in whole database? The queries for search are generated automatically. User has just set the attribute values. Importer actually imports data from files of different formats like CSV files, XML files etc. This data could be saved in the database (and for this purpose the importer transforms data using database wrappers and saves it in database). The data, which concedes to current database search settings, are then used by Graph Maker, which analyzes the data and transform into the graph. This graph is abstract and cannot be drawn directly on screen for visualization. For visualization we have used the open Source Library “prefuse“[19]. To convert graph to perfuse recognizable format, graph special converters are designed. These converters convert the graph, which

Algorithm2: To construct a hierarchy from a directed graph. 1. Identify the nodes from which 1 or the more edges are originating and repeat each of the step from 2 to 4 for each node 2. Taking the node with minimum number of edges originating and traverse it’s each edge. 3. Every node adjacent to the current edge will be placed under its predecessor, if no other edge is pointing towards it. 4. If any other edge is pointing towards it, then it will be placed under the node that has more links directing to its neighbourhood.

Proceedings of the 5th WSEAS Int. Conf. on APPLIED INFORMATICS and COMMUNICATIONS, Malta, September 15-17, 2005 (pp397-403)

Results: We applied the algorithms on the network of alleged 9-11 hijackers, we found that power is originating from Nawaf Al Hazmi, Hamza Alghamdi and Marvan-Al-shehhi. Also if we follow the arrows, the dotted line in hierarchy shows “may be”. It is crystal clear from Fig. 5 that there are 4 groups. In the first group Nawaf Alhazmi is leading Hani Hunjor, Majed Moqed, Khalid Mindhar and Saleem Alghadmi which is 100% reality. This group hijacked Flight AA #77, which crashed into pentagon. The doted Line from Nawaf Alhazmi to Saeed Alghamdi also reveals that Nawaf Al Hazmi may also influence Saeed Alghamdi and its group. The other group extracted from the results of above given algorithm lead by Saeed Alghamdi and consisting of Ahmed Alnami, Zaid Jarah, Ahmed Alhaznawi. This group in reality drove the flight UA# 93 crushed in Pennsylvania. The 3rd group lead by Hamza Alghamdi, other members of group were Fayyez Ahmed, Mohand Alshehri, Ahmed Alghamdi. Where there is a (may be) dotted link which shows Fayyaz Ahmed follows Marvan Alshehhi. Now if we consider this "may be" (shown by dotted line) statement as true then it becomes 100% reality, otherwise Marvan Al shehhi was missing from group. The last group which consists of Marvan Al shehhi ,Mohamed Atta, Abdel Aziz Al-omairi, Wail Al-shehri, Waleed al shehri and Satam Suqami was also true but there is only 1 diversion that Marvan Al shehhi was not the member of that group. But at least the results are showing it an important member. The law enforcement people may use the same strategy for neutralization of terrorist networks. This type of hierarchy will definitely help law enforcement to use energies in capturing a useful node (node) so that maximum of the network is disrupted. Table 2. Showing degree and Eigen Vector Centrality

Fig. 4: directed Graph

Fig. 5 Hierarchy of 9-11 terrorists

8 Conclusion and Future Work The motivation behind the work described in this paper is to provide better database interface support for an established method of visually analyzing the data, when data is stored in a database. Several visual data analysis tools provide support for construction and analysis of link charts, such as Automated Tactical Analysis of Crime (ATAC), Analyst’s Notebook, Crime Link, Crime Workbench, Daisy, NetMap, ORION, VisualLink [10]. The above software are only for the investigation of the links. But the software under development, iMiner, will not only find the links available in the database, but it’ll also help the law enforcement in destabilizing the network.

References [1] Coffman Thayne, Greenblatt Seth, and Marcus Sherry. Graph-Based Technologies for Intelligence Analysis. Communications of the ACM, 2003, vol 47, No. 03, pp. 45-47. [2] Heuer, R.J. Psychology of Intelligence Analysis. Center for the study of Intelligence, Central Intelligence Agency, 2001. [3] Paul R. Pillar. Counterterrorism after Al Qaeda. The Washington Quarterly. 27 (3) pp. 101–113. [4] T. Hastie et el. The Elements of Statistical Learning: Data Mining, Inference and Prediction. Springer 2001. [5] V.E. Krebs. Unlocking Terrorist Networks. First Monday, 7, issue 4, April 2002. [6] Memon Nasrullah et al. Investigative Data Mining: Issues and Challenges. In the proceedings of the First International Computer Engineering Conference (ICENCO’ 2004) Cairo, EGYPT, pages 515-524, 2004. [7] Memon Nasrullah et al. Investigative Data Mining: A General Framework. In the proceedings of International Conference on Computational Intelligence (ICCI2004), ISBN 975- 98458-1-4, pages 384-387, 2004. [8] Krebs, V. E. Mapping networks of terrorist cells. Connections 24, 3 (2001), 43–52. [9] Sparrow, M.K. The application of network analysis to criminal intelligence: An assessment of the prospects. Social Networks 13 (1991), 251–274.

Proceedings of the 5th WSEAS Int. Conf. on APPLIED INFORMATICS and COMMUNICATIONS, Malta, September 15-17, 2005 (pp397-403)

[10] Jesus Mena. Investigative Data Mining for Security and Criminal Detection. Butterworth Heinmann, 2003. [11] Harper R. Walter, and Harris H. Douglas. The Application of Link Analysis to Police Intelligence. Human Factors, 1975, 17(2), 157-164. [12] Klerks, P. The network paradigm applied to criminal organizations: Theoretical nitpicking or a relevant doctrine for investigators? Recent developments in the Netherlands. Connections 24, 3 (2001), 53–65. [13] Harper, W.R., and Harris, D.H. The application of link analysis to police intelligence. Human Factors 17, 2 (1975), 157–164. [14] Goldberg, H.G., and Senator, T.E. Restructuring databases for knowledge discovery by consolidation and link formation. In Proceedings of 1998 AAAI Fall Symposium on Artificial Intelligence and Link Analysis. AAAI Press (1998). [15] Anderson, T., Arbetter, L., Benawides, A., and Longmore-Etheridge, A. Security works. Security Management 38, 17, (1994), 17–20. [16] Chen, H., Zeng, D., Atabakhsh, H., Wyzga, W., and Schroeder, J. COPLINK: Managing law enforcement data and knowledge. Commun. ACM 46, 1 (Jan. 2003), 28–34. [17] Wasserman, S., and Faust, K. Social Network Analysis: Methods and Applications. Cambridge University Press, Cambridge, MA, 1994. [18] McAndrew, D. The structural analysis of criminal networks. The Social Psychology of Crime: Groups, Teams, and Networks, Offender Profiling Series, III. D. Canter and L. Alison (Eds.). Aldershot, Dartmouth (1999). [19] Heer Jeffrey, Card K. Stuart, and Landay A. James. Prefuse: A toolkit for interactive information visualization. In the Proceedings of CHI2005, Portland, Orgon, USA. [20] Carely M. Cathleen, Lee S.J., Krackhardt David, Destabilizing Networks, Connections 4.