Third Party Privacy and the Investigation of Cybercrime

1 downloads 0 Views 256KB Size Report
In a multi-user environment (a file-server, or .... Typical features are strings of text inside documents (or file meta-data), file types, and ..... and file-carving results.
Third Party Privacy and the Investigation of Cybercrime∗ Wynand JC van Staden School of Computing University of South Africa [email protected]

Abstract This article considers privacy of third parties during the investigation of a cybercrime. In particular it focuses on the “post-mortem” data analysis phase of an investigation. It is argued that privacy protection of such third parties can be accomplished without reducing the effectiveness of the investigation. The typical steps followed during an investigation requires that the equipment used during the commission of the crime be seized and analysed in a manner that complies with accepted investigative policy and practice – ensuring that the evidence collected is admissible during possible legal action. The analysis of the data on the equipment seized provides the investigator (who may not necessarily be associated with a law enforcement agency) with the opportunity to gain access to Personal Identifiable Information (PII) of persons or entities who may not be linked to the crime – this is especially true in multi-user systems. The article considers the key aspects surrounding privacy protection of third parties during a cybercrime investigation, and proposes a framework and process for systems that may be employed to protect privacy during the investigative process. The design includes a profiling component which profiles the result corpus, and a filtering component which calculates the diversity in the search results, and depending on a sensitivity level either filters out the search result, or presents it to the investigator.

Keywords: Privacy, Privacy Enhancing Technology, Digital Forensics, Third Party Privacy Protection, Digital Forensic Investigation, Cybercrime ∗ This is a post-print of the paper that appears in Advances in Digital Forensics IX, ISBN: 978-3-642-41148-9, https://link.springer.com/chapter/10.1007/978-3-642-41148-9\ protect_2

1

1

Introduction

Crimes committed using computer equipment are investigated using techniques, processes, and frameworks that have been refined specifically to deal with the difficulties surrounding the digital nature of the crime, and also fit into the requirements for investigative procedure as may be required by law. These techniques, processes, and frameworks are collectively referred to as Digital Forensics (DF). Within DF, there are typically two types of processes, each of which deals with live investigation (investigation of the crime that is in progress), or post-mortem investigation (investigation of events and facts after the crime has been committed and discovery thereof). This article is concerned with the latter of the two: post-mortem investigation, with a specific interest in the context of multi-user systems. In this reference framework, the typical steps during the investigative process is defined to be: preservation, collection, examination, and analysis [1, 2]. The collection phase typically means that all digital information that may be relevant to the crime is collected. Seizure of a computing device means the collection of storage media data. In a multi-user environment (a file-server, or email server) seized data will not only be about the activities and of the suspect, but also about persons (known as third parties) that may not be relevant to the investigation. When investigators start examining collected data, they may stumble upon data of a sensitive nature on third parties. Even if filtering techniques are used, when the job functions of the suspect and a third party overlaps, the examination phase may result in the third party data being scrutinized as well. In the event that the system that was seized proclaims to honour privacy and enforces the protection of privacy, third parties may feel their rights infringed upon, even in light of the investigation of criminal activity. By analogy, one would not like one’s property searched just because one’s neighbour is being investigated for vehicle theft and one also owns a vehicle. This scenario leads to the Third Party Privacy Breach (TPPB) problem: The act of investigating a cyber-crime may inadvertently result in a loss of privacy of people unrelated to the crime or investigation. This paper presents a first step in providing a privacy enhancing framework for use during a DF investigation. The framework is aimed at preventing (or minimizing the risk of) the TPPB problem. However the framework should not prohibit justifiable access to private information. The approach presented here lines up with the basic principle of privacy: information self-determination. A Privacy Enhancing Technology (PET) proposes to protect access to PII of users on the system. In the light of a criminal investigation it is reasonable to expect that the right to privacy of those who are not suspects should not be discarded without due consideration [3]. It is therefore argued that digital investigation techniques should support this right to privacy where practical – and only with due regard should this right be suspended.

2

1.1

Contribution

The paper provides a framework for a system that can be used to protect the privacy of third parties during a post-mortem DF investigation on a multi-user system. It also presents the process to be followed during the data analysis phase using the proposed framework, and provides the principle techniques that underlie the framework. In particular the focus is currently on existing files (documents) on the seized system. The paper argues that such a framework used during a DF investigation does not need to add significant overhead to investigative activities, and that it is commensurate with a reasonable expectation of privacy from users of a multiuser system. Additionally, the system approaches the protection of third parties in a similar fashion as current filtering techniques for weeding out information that may not be relevant to an investigation.

1.2

Structure of the paper

The rest of the paper is structured in the following manner: Section 2 provides background information and current related work on digital forensics, privacy, and privacy enhancing technologies. Section 3 provides information on the key attributes of a PET for use during an investigation. Section 4 provides the design for such a technology. Section 5 proposes a way to restrict search results that may result in privacy breaches Section 6 provides a brief discussion of the audit logging component. Section 7 has some comments on the preserving the chain of custody, and finally 8 provides concluding remarks. The following section provides background on current digital investigative methods, privacy and privacy enhancing technologies.

2

Background

The DF investigation of a crime, that incorporates a privacy aspect involves three areas of interest: firstly, digital forensics itself, secondly, privacy, and finally PETs. Background on each of these is presented in turn.

2.1

Digital Forensics

DF is a branch of forensic science [1, 2] that is concerned with the “...preservation, collection, validation, identification, analysis, interpretation, documentation and presentation [1] of evidence in a digital context. Through the use of scientifically sound methods, digital evidence is ultimately used to reconstruct events that are considered criminal [1]. During the collection stage, all digital information that may be relevant to the criminal investigation is collected [2, 4]. In practice, this means creating bit-for-bit copies of storage media of equipment used during the commissioning of a crime. Of course, the crucial reason for doing this is preservation of the 3

content of the storage media without altering the content. The safest way of accomplishing this is to use a no-write device to copy the content of the storage media from the disk. Collection of information in this way does not discriminate on the context of the information (the investigating team cannot discard any data that could potentially be relevant). Since all data collected on the equipment can be considered potential evidence, the investigator may access and examine all of it. In some jurisdictions (the Netherlands, and South Africa) data collected for investigative purposes is referred to as police data [3, 5]. That is, data that belongs to the police for investigative purposes. In both these jurisdictions, the police to are allowed to examine all information classified as police data without prejudice. Thus, the privacy rights of all entities about whom there is data on the storage media is suspended in view of the investigation taking place. The examination phase involves the systematic search of collected information for data that is relevant to the criminal investigation. Examination may involve searching through documents and other related files on the disk, the search of so-called slack space, and the search for stegonographically hidden information in cover files [2, 4]. Since there is so much information that has to be examined, several tools are used to aid during the analysis phase. These tools [6] provide search filters which make use of features of the data on the storage medium to filter out any data that may be considered noise (not relevant to the investigation). Typical features are strings of text inside documents (or file meta-data), file types, and time-stamps on files [7]. Other techniques for filtering out irrelevant data have been proposed and used, such as text-mining techniques [8], and machine learning such as SelfOrganising Maps (SOMs) [6, 7]. Text mining [9] seeks to return only information that may be important based on some relevance scale, and SOMs [10] are used to cluster information using mentioned features of data to aid the investigator in forming opinions about the relevance of data. Thematic clustering of search results have also been proposed to help filter out noise [11]. The practice of filtering out potential red-herrings during the investigative process is thus well established. Filtering often results in false positives which returns information that may not be relevant to the investigation – which still takes investigation effort to classify as such – which will be discarded later on. False negatives are also a problem from the investigations point of view since information that could be relevant may be hidden from the investigator’s view by the filtering tool.

2.2

Privacy

The notion of the protection of the individual’s right to privacy has been around for more than a hundred years [12]. Modern interpretations of the right to privacy is enshrined in the European Union (EU) Data Protection Directive [13], and South Africa’s Protection of Private Information (POPI) [5], when enacted 4

will be on par with that of the EU directive. Both of these regard privacy important enough to prohibit the cross border flow of personal information to countries that do not have equivalent privacy laws (to reduce the trade impact of these laws, countries with strict privacy laws sign safe harbour treaties with non-compliant countries). These principles of protecting private information were first listed in the Organisation for Economic Cooperation and Development (OECD) report on the protection of private information [14]. The principles in that document (and present in the legislation named above) are: collection limitation, data quality, purpose specification, use limitation, security safeguards, openness, individual participation, and accountability. These, when implemented in technological safeguards for the protection of privacy, are commonly translated as the right to information self-determination [15, 16] – the right to determine who has data about you, what data they have about you, what they may do with it, and to whom they may give it [16]. Legislation typically provides a reasonable expectation of privacy to individuals who entrust their data to enterprises, or service providers who make a positive statement about their intent to protect the user’s privacy. It is the exemption of these laws (during DF investigations) that may take individuals by surprise. In particular if an individual should have his data perused by an investigator.

2.3

Technological protection of Privacy

Systems that promise to protect the privacy of individuals are classified as Privacy Enhancing Technologys (PETs). These systems operate on two principles: purpose specification and use limitation [16, 15]. These technically solve only two of the eight listed principles of privacy protection, however, the other principles are normally taken care of as a matter of operating procedure. The systems make sure that personal data is not dolled out to those who are not supposed to get them, and certainly not to those who want to use it for the wrong purpose [16]. Because PET frameworks have been well studied and presented, it is quite possible to consider a PET that should form part of the investigative process – this PET should have several attributes, and the following section considers the attributes of a privacy enabled system which would allow a reasonable expectation of privacy to be honoured by the investigating authority.

3

Preventing a Third Party Privacy Breach During an Investigation

Section 2 discussed the techniques used to filter out information that could be considered irrelevant to the investigation. It is envisaged that a privacy aware system used during the investigation will be an extension of these filtering

5

techniques – the important difference being that the system will restrict results to those that has limited potential in breaching privacy of third parties. The principle idea for privacy aware DF comes from Croft and Olivier [17] that proposed sequenced release of information to protect the privacy of the individual under investigation. This proposal pivots on the testing of a hypothesis before further information is released. That is: the system only releases information once it knows the person querying the system already has some prior, relevant piece of information. The approach presented here allows the privacy of third parties to be taken into account. The system evaluates the query results and either releases the results to the investigator, or states that the query may result in a privacy breach. In the latter case, the investigator is requested to provide a more focused search. Based on the definition of a PET’s operating principles provided earlier it could be argued that the framework presented here is not a PET since no explicit purpose specification or use limitation is present. However, it is safe to assume that the purpose of the data access will be the investigation of the crime, and that information is stored and accessed for the purposes of investigation of the case at hand. The PET forces the investigator to present focused questions on evidence that is looked for – once it is determined that the query is focussed, the search results are presented. It is obvious that the work presented here should not encourage the sequestering of information that may help an investigator (the system reporting false negatives), and this concern is returned to in section 5.3. In the case of false positives the guiding principle is accountability: in the event that personal information not related to the investigation was accessed, the PET should record the access, thereby allowing the individual to know that there was a privacy breach. To support the use of PETs during the investigative process, the following characteristics of a PET used during a DF investigation is presented: 1. It should not encumber the investigator. The investigator should not have to be required to perform a significant number of additional tasks during the investigative process. 2. It should to a reasonable extent sequester information from the investigator that is safely determined as not relevant to the investigation, specifically limitation of information on parties that may not be under investigation. 3. It should allow the investigator enough freedom to access information that may be considered false negatives. 4. It should support accountability, and openness – recording accidental (or intentional) access of information that bears no relevance on the investigation, but that may be of a private nature.

6

5. It should not add significant time-delays to already established techniques used during the examination phase. The point of using text-mining techniques, search strings, and other filters is to reduce investigation time. The system should not increase this time without regard for the effectiveness of the investigation. The following section provides the design for a system that supports these principles.

4

Design of a PET for Use During a Digital Forensic Investigation

To support privacy aware investigation, the following design for a PET that will fit into the forensic investigation is presented. The PET includes the following components: Firstly, a profiling component which will create a profile of the data and store the result of the profiling in a meta-database. The profile is created Just In Time (JIT), when a new document is queried, and re-used once created. Secondly, a query front-end which allows the investigator to perform searches (partial string matches, file types, and so on). Thirdly, a results filter which will determine if the query results could potentially lead to a privacy breach. Finally, an audit logging component which can be used for auditing compliance with privacy principles. Additionally the PET must provide: a tamper-proof meta database for storing profiling and caching ad-hoc profiles, and an audit log which will store search queries and changes in the configuration of the PET. The following section presents the foundation of the PET used during searches.

5

Responsibly Disclosing Search Results

As has already been stated, the investigation of a cybercrime may result in the investigator gaining access to sensitive information on third parties. To mitigate this problem the search technique presented in this paper proposes the use of a filter that forces the investigator to focus their search queries specifically to the case at hand. To achieve this the following are introduced: the profiling component (already named), a metric for determining the diversity of the search results, a sensitivity measure that determines how lenient to be with queries that return a diverse search result. Each of these are discussed in more detail in the following sections.

5.1

Profiling

Section 2 already presented background into the automatic classification of sensitive information as well as the thematic clustering of text, or clustering of files based on a defined feature set. 7

Investigator

Query Front-end

Query Filter

Audit Log

Storage Media

Categorisation Tool

Meta-database

Figure 1: Strawman PET for investigations The profiling component presented in this paper makes use of a well-known profiling technique which has delivered good results: n-grams [18]. The n-gram technique relies on Zipf’s law which states that “the nth most common word in human text occurs with frequency inversely proportional to n” [18] – the implication being that texts that are related should have similar frequency distributions. Using this principle, text is tokenized, and each token is de-constructed into n-grams with 2 ≤ n ≤ 5 [18]. The technique uses a frequency table for each n-gram in the text. Once scanning of the document is completed, the frequency table is used to construct a rank list in which the most frequently occurring n-gram is ranked first. The number of entries in the rank table can vary, however, according to Cavnar et al. domain specific terminology only starts appearing at roughly the threehundredth rank. The n-gram technique is fast and provides a usable profile for text. Once a profile has been generated, it can be compared to other profiles,

8

thereby determining the distance between profiles (or texts). Texts that are similar will have a smaller distance, than texts that are not similar. The following section provides more detail on the way in which distance between results are calculated.

5.2

Distance Calculation

The distance measure used by Cavnar et al. simply uses the rank-difference of P|Pk | n-grams: |rk(i) − rj(i) |. Pk is the rank profile for text k, and rk(i) is the i rank of n-gram i for text k. This distance measure takes content into account; however, it ignores several important aspects. Firstly, it is reasonable to assume that documents that are similar, but have different owners may be the result of two persons working on the same project. Therefore a difference in document owners should be factored in when considering the similarity between documents. However, another user may be an accomplice to the suspect – different owners should not skew the results too much. Secondly, the logical location of documents on the storage medium is an important attribute when one considers that two or more people may be on the same project, but only a certain individual has corrupt relations during the performance of duties for that project. The goal here is to protect those individuals through carefully presenting their owned documents as distant from that of the suspect. The attributes used to determine the distance measure used for searching relies on features used for anomaly detection by Fei et al. [7]. In their approach documents that are clustered closely together using those feature sets are worthy of further investigation. In order to determine the distance between documents returned during a search for evidence, the following features are thus considered for inclusion during calculation: the document text, document creation date and time, document owner, and the absolute logical location on disk. The feature set proposed here may depend on the type of investigation – we do not propose an absolute tie down on them, rather that their inclusion should be considered important when implementing a tool based on the framework. Using these features the distance between documents are calculated as follows: Definition 1 Document distance The distance between two documents in a search result corpus is calculated as: D(j, k) = R(j, k) + |δj − δk | + P (j, k) + O(j, k). Where R(j, k) is the rankdifference distance between j and k, δi is the creation date and time for i, P (j, k) is the difference in logical location of the files, and O(j, k) is ownership distance. Furthermore, P (j, k) is calculated as the number of directory changes needed to navigate from one location to the other. Ownership differences rely on little more than using a maximum value to create additional distance between thesuspect and other users of the system, 0 if jo = ko and is calculated as follows: O(j, k) = τ is a maximum value τ if jo 6= ko 9

assigned to O(i, j). Once the investigator issues a query, the profiles of documents are constructed (or retrieved). And the distance between all the documents are calculated. This information is used during the diversity calculation, discussed next.

5.3

Diversity Classification

Once a query is executed, the documents in the search results are clustered together to determine the diversity of results using a hierarchical clustering scheme. In particular the single-linkage clustering scheme is used. However, the algorithm is only applied once – this yields an intuitive clustering result which provides a good indication of the similarity between the documents. Moreover, hierarchical clustering schemes have been well studied, and provide a fast way to cluster documents [19]. The modified clustering scheme is presented in definition 5.3. Definition 2 Clustering For a document corpus S, consisting of all the documents that are returned in the result: ∀x ∈ S, x ∈ / Ci : y = argminz∈S D(x, z), if y ∈ Ci for some i, then Ci ∪ {x}, else Cj = {x, y} with j = |C|+1. Here D(x, y) represents some distance measure between x and y. Informally: for each document in the result, find the cluster that most closely matches it and add the document to that cluster, or find a document (not yet in any cluster) that most closely matches it, and create a new cluster with those two documents. Once clustering is complete, the PET calculates the diversity of the corpus PC using Shannon’s entropy index [20]: DS = − i pi log2 pi Here, C is the clusters, and i is the document within the cluster C. pi represents the probability of choosing document i from within cluster Ci . The higher H, the more diverse the corpus. Ideally, the search result would have returned a result with one cluster – all the documents are closely related – containing all the documents that was returned during the search. This ideal search PSresult is referred to as the uniformity norm, and is calculated as: NS = − i pi log2 pi – which is the lower bound of the diversity of the corpus. Ideally, a search result should be on par with ideal uniformity: the query was focused enough to return only documents that are closely related (with respect to the features named above). If this is not the case, the query should be rejected. The PET proposed here should not interfere with a legitimate investigative effort, and support the investigator, a sensitivity indicator is introduced which determines a threshold which is used to determine if the search result is too diverse. The sensitivity threshold is provided in definition 5.3. Definition 3 Sensitivity Filter The PET is configured using a sensitivity level indicator l. Using the diversity

10

indicator DS , if DS > l, the query is defined as being too wide, and the results are dropped. l is range limited as NS ≤ l – which requires an ideal query result. A query that returns results that are considered too diverse is referred to as a wide query. On the other-side, queries that are considered close to the uniformity norm as referred to as focussed queries. Wide queries are reported, and no results are returned. A wide ranging implication for the adjustment of the threshold if that it may be used without due regard for privacy, and eventually disregarded altogether (the threshold is set too a low enough value that all results are returned, ignoring the attemp at preventing TPPBs). Cross-checks can serve to avoid this situation by requiring that authorisation from a higher level is required on each threshold adjustment. Additionally, the audit-logging component as discussed in the following section should be used to record any threshold adjustments.

6

The Audit-logging Component

The audit logging component is a final step in privacy protection of third parties. It provides a platform for the recording of wide queries to ensure that only questions related to the investigation is asked. It is emphasised that the logging of wide queries is not to have a “smoking gun” in proving that the investigator issued unfocussed queries, but rather to have documentary proof when the investigative process is questioned in the event of a privacy breach. It also records changes in the sensitivity of the results filter making it possible to trace privacy breaches later on. Since the query is recorded, and the data and meta-database should be kept intact, privacy breaches can be traced by examining the results obtained from queries logged in the system, similar to the auditing compliance work done by Agrawal et al. on hippocratic databases [21]. Additionally, the audit-logging component would also be able to provide transparent logging for recording the process followed during the investigation.

7

Usability Concerns

The safe use of a digital forensic tool is ultimately determined by it’s ability to fit in with the accepted standards for maintaining the chain of custody. Any tool that casts doubt on the integrity of the evidence collected cannot be considered appropriate for use during an investigation. The framework proposed in this paper is purely intended as an extension of the way in which the investigator will search for relevant information – it therefore merely augments the analysis phase, attempting to prevent sensitive information on third parties that are not involved in the crime but may be close purely by virtue of them sharing a job function or a contact with the person under investigation.

11

It does therefore not change any data on the media being inspected, and does not get in the way of the investigator: if the investigator uncovers information pertaining to third parties, the sensitivity level can be adjusted to reveal more information. The filter can be switched off completely – meaning the investigator gets access to all the information on the medium.

8

Conclusion

This paper considered the prevention of privacy breaches of so called third parties during a DF investigation. It was argued that third parties (or innocent bystanders) could suffer a loss of privacy during an investigation in a multi-user environment as a result of the investigator getting access to all data on the storage media of the seized equipment. To prevent TPPBs, the paper presented a first steps toward privacy aware investigation in the form of a framework that outlined the characteristics and components of a PET that would be able to protect the privacy of individuals who are not under investigation. The PET adds little additional burden on the investigator during the investigative process, and operates on the same principles as tools that are currently used to filter out information that may not be relevant to the investigation. The important difference being that the proposed PET will take privacy concerns into account when filtering out the query result. One of the primary features of the PET is that investigators are required to present focused queries relating to the investigation to the PET – if a query is considered too wide the PET will prevent the investigator from getting information. Should the investigator have concerns about missing relevant information, the PET will be reconfigurable to be less sensitive about the query width, and return results even if the query is considered unfocussed. An audit log is kept to ensure that queries are kept for later audit compliance should a privacy breach be discovered after the fact. Future research in this area is to test the usability of such a PET during an investigation, and to further investigate using it during searches on slack-space and file-carving results.

References [1] G. Palmer, A Road Map for Digital Forensic Research, Technical report DFRWS, Utica, NY, USA, 2001. [2] P. Gladyshev, Formalising Event Reconstruction in Digital Investigations, PhD Thesis, University College Dublin, Ireland, 2004. [3] R. van den Hoven van Genderen, Cybercrime Investigation and the Protectoin of Personal Data and Pivacy, Technical Report, Council of Europe, Economic Crime Division, 2008.

12

[4] TWGEDE Group, Forensic Examination of Digital Evidence: A Guide for Law Enforcement, Technical Report, National Institute of Justice, 2004. [5] Minister of Justice and Constitutional Development, Protection of Personal Information Bill (PoPI), Electronically Published (http://www.justice.gov.za/legislation/bills/B92009 ProtectionOfPersonalInformation.pdf), Republic of South Africa, 2009. [6] B.K.L. Fei, J.H. Eloff, H.S. Venter, and M.S. Olivier. Exploring Forensic Data with Self-Organizing Maps. Advances in Digital Forensics, M. Pollitt and S. Shenoi (Eds.), Springer, Boston, MA, pp 113–123, 2005. [7] B.K.L. Fei, J.H. Eloff, M.S. Olivier, H.M. Tillwick, and H.S. Venter. Using Self-Organising Maps for Anomalous Behaviour Detection in a Computer Forensic Investigation. Proceedings of the Fifth Annual Information Security South Africa Conference (ISSA2005), L. Labuschagne, H.S. Venter, J.H. Eloff and M.M. Eloff (Eds.) 2005. [8] N.L. Beebe and J.G. Clark, Dealing with Terabyte Data Sets in Digital Investigations, Advances in Digital Forensics, M. Pollitt and S. Shenoi (Eds.), Springer, Boston, MA, pp. 3–16. 2005. [9] I.H. Witten, Practical handbook of internet computing, Chapman & Hall/CRC Press, pp 1–22, 2005. [10] T. Kohonen, M.R. Schroeder, and T.S. Huang (Eds.), Self-Organizing Maps, Springer-Verlag New York, Inc., Secaucus, NJ, USA, 2001. [11] N.L. Beebe and J.G. Clark, Digital Forensic Text String Searching: Improving Information Retrieval Effectiveness by Thematically Clustering Search Results, Digital Investigation, Elsevier Science Publishers B. V., Amsterdam, The Netherlands, The Netherlands, pp. 49–54, 2007. [12] S.D. Warren and L.D. Brandeis. The Right to Privacy, Harvard Law Review, pp 193–220, 1890. [13] The European ParliamenT and the Council of the European Union, EU Data Protection Directive 95/46/EC, 1995. [14] Organisation for Economic Co-operation and Development, OECD Guidelines on the Protection of Privacy and Transborder Flows of Personal data. Technical report http://www.oecd.org/document/18/0,2340,en 2649 34255 1815186 1 1 1 1,00.html., 1980. [15] S. Fischer-H¨ ubner, IT Security and Privacy: design and use of Privacy Enhancing Security Mechanisms, Springer-Verlag, 2001.

13

[16] W.J.C. van Staden and M.S. Olivier, On Compound Purposes and Compound Reasons for Enabling Privacy. Journal of Universal Computer Science 17(3), pp 426–450, 2011. [17] N.J. Croft and M.S. Olivier, Sequenced Release of Privacy-Accurate Information in a Forensic Investigation. Digital Investigation 7(1–2), Elsevier Science Publishers B. V., Amsterdam, The Netherlands, pp. 95–101, 2010. [18] W.B. Cavnar and J.M. Trenkle, N-Gram-Based Text Categorization, Proceedings of SDAIR-94, Third Annual Symposium on Document Analysis and Information Retrieval, pp 161–175, 1994. [19] S.C. Johnson, Hierarchical Clustering Schemes, Psychometrika, 32(3), Springer, 1967. [20] C¿E. Shannon, A Mathematical Theory of Communication, The Bell System Technical Journal 27, pp. 379–423, 1948. [21] R. Agrawal, R. Bayardo, C. Faloutsos, J. Kieman, R. Rantzau, and R. Srikant, Auditing Compliance with a Hippocratic Database, Proceedings ¨ of the Tirtieth VLDB Conference, M.A. Nascimento, M.T. Ozsu, D. Kossmann, R. J. Miller, J.A. Blakeley, and K.B. Schiefer, Morgan Kauffman, pp. 516–527, 2004.

14