Privacy Preserving Data Mining

6 downloads 164 Views 239KB Size Report
Building Privacy Preserving Algorithms for 3 categories of data mining techniques – Related Work. ▫. Classification. ▫. Association Rules. ▫. Clustering. ▫.
First Solve the Homework „ „ „ „ „ „ „

100 doors 100 people Initial state: all closed – 0 Final state: some open, some closed. Problem: How many open? Problem: which ones? Operations: Person 1 – all doors, Person 2 – even doors (0/1), Person 3 – door#3,6…. (0/1)…..

Graph similarity „ „ „

Given two graphs. I’ll draw them NOW. Find similarity Y/N.

Overview „ „ „ „

Why privacy? PPDM ( Privacy Preserving Data Mining) How is PPDM possible? – Techniques Building Privacy Preserving Algorithms for 3 categories of data mining techniques – Related Work „ „ „

„ „

Classification Association Rules Clustering

References Challenging Problems

Overview

„ „

„

To develop models about aggregated data Extracting knowledge from the data and coming up with patterns in very large databases Discover the information which is not obvious from large databases

Data Mining - Example „

„

Center for Disease Control „ To identify trends and patterns in disease outbreak „ Understanding and predicting the progression of flu outbreak „ Might want some data from Insurance companies „ No access to the data (they might not want to reveal the data due to some privacy concerns) Public use of private data „ Data mining is used in research studies of huge population „ What if the population does not want to release the data?

Can we develop accurate models without access to the original data?

Solution to the problem

„

Insurance companies „ Do not give access to the original data „ Provide some sort of statistics on the data so that the original data cannot be retrieved from such statistics „ Such data can be used to identify trends and patterns

Privacy Preserving Data Mining (PPDM)

ƒ To protect data privacy in data mining ƒ Different Techniques in PPDM: ƒ Query Restriction ƒ Data Perturbation

Techniques in PPDM Noise Addition

Query Restriction „ „ „

Partitioning Cell Suppression Query size control

„ „

Data perturbation Introducing noise either to data or to the results of queries

PPDM Classification - Decision Trees Two approaches: „ Randomization approach „ Hide the original data by randomly modifying the data values using some additive noise still preserving the patterns of the original data (preserving the underlying probabilistic properties) „ Reconstruct the distribution of original data values from the perturbed data. „ Cannot reconstruct original values „ A decision tree classifier is built from the perturbed data from this reconstructed distribution. „ Privacy breaches

PPDM Classification – Decision Trees Approaches (contd … )

„

„

„

Cryptographic approach – Party X – owns Database D1 Party Y – owns Database D2 Build a decision tree built on D1 and D2 without revealing information about D1 to party Y and about D2 to party X except what might be revealed from the decision tree. Horizontally partitioned data „ Records (entities) split across parties Vertically partitioned data „ Attributes split across parties

Related Work Classification – Decision Trees „

„

Perturbation Approach (Randomization Approach) „ Privacy Preserving Data Mining „ Rakesh Agrawal, Ramakrishna Srikant „ On the design and quantification of Privacy Preserving Data mining algorithms „ Dakshi Agrawal, Charu C Aggarwal (EM algorithm) SMC (Secure Multi party Computation) Approach „ Privacy Preserving Data Mining „ Lindell, Pinkas „ Tools for Privacy preserving Distributed Data Mining „ Kantarcioglu, Cilfton,Vaidya, Xiaodong Lin, Michael Y. Zhu

Randomization Approach

Privacy Preserving Data Mining - Rakesh Agrawal, Ramakrishna Srikant

„

Randomize the data „ Value Class Membership „ The values of the attributes are discretized into intervals. The interval in which the value lies is returned instead of the original value. „ Value Distortion „ Noise addition – Add a random value r to each value of an attribute. „ Normal: r lies between [ −α , α ] . Mean = 0 „ Gaussian: mean = 0

Randomization Approach –Overview

30 | 70K | ...

50 | 40K | ...

Randomizer

Randomizer

65 | 20K | ...

25 | 60K | ...

Reconstruct distribution of Age

Reconstruct distribution of Salary

Data Mining Algorithms

...

...

...

Model

Reconstructing the Original Data Distribution

„

Problem: „ Let x1,x2 …xn be the original values (probability Distribution X) „ Let y1,y2 …yn be the random values used to distort the data (probability distribution Y) „ Given, „ x1+y1, x2+y2, … xn+yn (perturbed data) „ Probability distribution Y (noise) „ Estimate probability distribution X (of original data)

Reconstructing the Original Data Distribution - Solution „

Using Bayes theorem, given the probability distribution f Y , the randomized values ( xi + y i = wi) „

estimated density function : f x' (a ) =

1 n ∑ n i =1



f Y ( wi − a ) f X (a )

∫f

Y

( wi − z ) f X ( z )dz

−∞

„

Give large number of samples, it would be equal to the real density function „

fX

is unknown fx

j +1

1 (a ) = n

n



i =1





f Y ( w i − a ) f Xj ( a ) f Y ( w i − z ) f Xj + 1 ( z ) dz

−∞

„ „

initially f X is the uniform distribution Do this iteratively till the stopping criterion is met.

Seems to work well! Number of People

1200 1000 800

Original Randomized Reconstructed

600 400 200 0 20

60 Age

On the design and Quantification of Privacy Preserving Data Mining Algorithms – Dakshi Agrawal, Charu C Aggarwal

„ „ „ „

Previous distribution reconstruction process leads to some information loss. EM (Expectation Maximization) Algorithm for reconstruction of original distribution Provides robust estimates of original distribution even with a large amount of data. Less information loss.

Cryptographic Approach (SMC Approach)

Tools for privacy preserving distributed data mining „

„

„

Tool kit of privacy preserving distributed computation techniques that can be applied to real time problems SMC – Secure Multiparty computation „ No party knows anything except its input and the result „ 2 approaches „ Third party „ Through some communication mechanism „ Induce non determinism in the values (Encryption) Different techniques discussed „ Secure Sum „ Secure Set Union „ Secure Size of Set Intersection „ Scalar Product

Secure Sum „ „

Sum of values from each site Value to be computed is in the range [0 .. n] s

v = ∑ vl l =1

„ „

Site 1 generates a random number R, adds to its local value mod n, sends it to the next site. All sites 2 .. s , each site adds its local value mod n to the above number. l

R + ∑ v j mod n j =1

„

Site s sends this to site 1. Site 1 subtracts R from the above result to get the sum.

Secure Set Union

„ „ „ „

„ „ „

Commutative encryption mechanism Each party encrypts its own items and adds them to the global set Each party then encrypts the items of the remaining parties Remove duplicates (duplicates in original items will be in the encrypted items too Basic idea here is that now all the items are permuted. Now this global set is passed around, each site decrypting its items Now the union of items is obtained.

Secure Size of Set Intersection

„ „

„

„ „

„

Commutative encryption Every party encrypts its items with its own key and passes it to all other parties When each party receives a set, it encrypts each item and permutes the order and sends it the other party. Repeat this till every item is encrypted by every party. Encrypted values will be same in 2 sets only if its respective original values are same. Since we need the count, there is no decryption needed.

Applications

„

„

„

„

Association Rule Mining in horizontally partitioned data „ Chris Clifton, Kantancioglu Association rule mining in vertically partitioned data „ Chris Clifton, Jaideep Vaidya Privacy preserving Distributed Data Mining „ Chris Clifton EM clustering

Privacy Preserving Data Mining – Lindell, Pinkas „

„ „ „ „

Yao’s 2 party protocol (“How to exchange and generate secrets”) „ 2 parties P1 and P2 with inputs x, y respectively „ The functionality f – represented as a combinatorial circuit „ Each party runs a separate protocol on each gate Not suitable for huge databases Extension of ID3 – algorithm for data mining of classification Training set is distributed between 2 parties Uses cryptographic tools to build decision trees

Related Work Classification – Decision Trees (contd…)

„

“Random Data Perturbation Techniques and Privacy Preserving Data Mining” – Hillol Kargupta, Souptik Gupta, Qi Wang, Krishnamoorthy Sivakumar „ Randomization preserves very little privacy. „ Random noise can be represented in the form of random matrices and random matrices have some properties from which we can estimate the original data „ Random objects have predictable structures in spectral domain „ Spectral filtering techniques can be used to estimate the original data

Related Work - Privacy Preserving Association Rule Mining

„

Privacy preserving Association Rule Mining in vertically partitioned data – Jaideep Vaidya, Chris Clifton

„

Maintaining Data Privacy in Association Rule Mining – Shariq Rizvi, Jayant Haritsa

„

Privacy Preserving Mining of Association Rules – Evfimiesvski, Srikant, Agrawal, Johannes Gehrke

„

Privacy preserving Distributed Mining of Association rules on Horizontally partitioned data - Kantarcioglu, Cilfton,Vaidya

„

Privacy preserving Distributed Data Mining – Chris Clifton

„

An Architecture for Privacy Preserving Mining of Client Information – Murat Kantarcioglu, Jaideep Vaidya

Privacy Preserving distributed mining of Association rules on Horizontally partitioned data - Kantarcioglu, Chris Clifton

„

„

Ability to share non sensitive data enables to produce highly effective solutions We need not hide all the data from all the parties. Some of the data can be known to some of the parties but nobody can see all the data.

Contd…… „

3 phases „ „

„

Identify all the candidate itemsets (Secure Union) Verify if each item satisfies the support threshold (Secure Sum) If X is a itemset, find the local support count X . sup i Then find global support count ∑ X .sup Item is globally supported if X . sup i > s * no. of transactions Securely find the confidence of a rule X => Y. Check if s

„ „ „

i =1

i

i =n

∑{ X ∪ Y }. sup i =1

i =n

∑ X .sup i =1

„

Each site knows

i

≥c

i

{ X ∪ Y }. sup i

and

X . sup i

(Secure Sum)

Privacy Preserving Association Rule Mining on Vertically Partitioned Data – Chris Clifton, Jaideep Vaidya „ „

„

2 party computation No Central authority of the data – data split vertically across 2 parties Example „

„

Market Basket – Grocery and Clothing purchases

One approach to privacy preserving „

„

Run association rule mining algorithm by each party – combine the results from the 2 parties Disadvantages „ „

Duplication Correlation

„

„

„

„

Mining boolean association rules. „ Absence of attribute – 0 „ Presence of atrribute -1 Determining the frequent itemsets is determining how many rows have the values of all attributes in the itemset as 1. X, Y represent attributes in the database. xi represents the value of X attribute for i row. Scalar Product „ „

n – total number of transactions, k –support threshold n X.Y = ∑ xi * y i i =1

„

Frequent itemsets „

X.Y > k

Privacy Preserving Mining of Association Rules - Evfimievski, Srikant, Agrawal, Johannes Gehrke „ „ „

Categorical Data items Horizontally partitioned data Principle of uniform Randomization „

Take an item of probability p and replace it with another item which is not present in the transaction.

Privacy Preserving Clustering

„

„

Privacy Preserving k means clustering over Vertically partitioned data – Jaideep Vaidya, Chris Clifton Privacy preserving clustering by data transformation – Stanley R. M. Oliveira, Osmar R. Zaiane

Privacy preserving k means clustering over vertically partitioned data – Jaideep Vaidya, Chris Clifton „ „

„

K means clustering – divide into k clusters Vertically partitioned data – each site has information on all the entities but for only some attributes To cluster the entities without revealing the values

Problem „ r parties, n entities , k clusters „ Cluster the data using k means algorithm „ Final result sent to each party „ Final value of its mean „ The cluster to which each point is assigned „ Nothing else

„

Privacy preserving k means clustering algorithm comprises „ Each party calculates its own distance matrix w.r.t its own attributes ⎡ x11 ⎤ ⎡ x12 ⎤ ⎡ x1r ⎤ ⎢x ⎥ ⎢x ⎥ ⎢x ⎥ ⎢ 21 ⎥ ⎢ 22 ⎥ ⎢ 2r ⎥ ⎢ x31 ⎥ ⎢ x32 ⎥ ⎢ x3r ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ X 1 = ⎢ . ⎥ X 2 = ⎢ . ⎥........... Xr = ⎢ ⎥ ⎢ . ⎥ ⎢ . ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ . ⎥ ⎢ . ⎥ ⎢ ⎥ ⎢x ⎥ ⎢x ⎥ ⎢x ⎥ ⎣ k1 ⎦ ⎣ k2 ⎦ ⎣ kr ⎦

„

The row where the sum is the minimum is the closest cluster

„

„

Finding the closest cluster „ Requires cooperation between the parties „ Secure computation of the closest cluster involves „ Permutation algorithm ( “Privacy preserving cooperative Statistical Analysis” by Wenliang Du and Mikhail J Atallah ) „ Secure add and compare (using combinatorial circuits) Security based on 3 key ideas „ Add noise to the distance components with random vectors which sum upto 0. „ Only the result of comparison of distances should be known.

„

Permutation Algorithm: „ Let the parties be P1,P2, P3, ….. Pr. „ P1,P2,Pr are non colluding parties „ P1 generates random vectors for each party i= 1 to r such that V1+V2+V3 …… +Vr =0 „ P1 also generates a permutation II(1..k) „ Every party i=2…..r generates (Ek,Dk) „ It computes E k ( X i ) and sends it and the encryption function Ek to P1

„

P1 gets the encrypted version of the distance vectors and the encryption function Ek of all other parties „ For i = 2 to r, P1 calculates E k ( X i ) * E k (Vi ) = E k ( X i + Vi ) „ „

„

Each party i = 2 to r „

„ „

„

' Using its permutation II, P1 calculates T p = II [ E k ( X i + V i )] ' P1 sends T p to i= 2 to r

Ti = Dk [Tp' ] = Dk (II[ Ek ( X i + Vi )]) = II[ X i + Vi ]

Each party I = 1, 3-rr sends Ti to Pr Pr calculates T1 + ∑ Ti i =3

P2 and Pr now need to find out the closest cluster. Pr has all the components of the sum to find the least row in the distance matrix except T2 . So now P2 and Pr engage in some secure addition/comparisons to find out the closest cluster

„

We need the below matrix to find out the closest cluster Known only to Pr ⎡ x11 + x13 + x14 + ......... + x1r ⎤ ⎡ x12 ⎤ ⎢ x + x + x + .......... + x ⎥ ⎢ x ⎥ 2r ⎥ ⎢ 22 ⎥ ⎢ 21 23 24 ⎢ x31 + x33 + x34 + .......... + x3r ⎥ ⎢ x32 ⎥ ⎥ ⎢ ⎥ ⎢ . ⎥+⎢ . ⎥ ⎢ ⎥ ⎢ . ⎥ ⎢ . ⎥ ⎢ ⎥ ⎢ . ⎥ ⎢ . ⎥ ⎢ ⎢ x + x + x + .......... + x ⎥ ⎢ x ⎥ kr ⎦ ⎣ k2 ⎦ ⎣ k1 k 3 k 4

„

Known only to P2

This secure addition/comparison is done using a combinatorial circuit.

Generic K-means Privacy Preserving Algorithm To start with, each party initializes their own means corresponding to each cluster. Securely find the closest cluster for each data point for all the parties. (Use permutation algorithm [21] )

1.

2.

„

P1 generates random vectors for each party i= 1 to r such that r

∑V i =1

„ „ „

„

„

i

=0

P1 also generates a permutation II(1..k) Every party i=2…..r generates (Ek , Dk ) E (X ) Ek Each party computes k i and sends it and the encryption function to P1 P1 gets the encrypted version of the distance vectors and the Ek encryption function of all other parties For i = 2 to r, P1 calculates E k ( X i ) * E k (Vi ) = E k ( X i + Vi )

„

Using its permutation II, P1 calculates T p' = II [ E k ( X

i

+ V i )]

„

P1 sends T p'

to i= 2 to r

„

Each party i = 2 to r

„

Each party i = 1, 3 to r sends Ti

to Pr

Ti = Dk [Tp' ] = Dk (II [ Ek ( X i + Vi )]) = II [ X i + Vi ]

r

T1 + ∑ Ti

„ „

3.

Pr calculates i =3 P2 and Pr now need to find out the closest cluster. Pr has all the components of the sum to find the least row in the distance matrix T2 except .So now P2 and Pr engage in some secure addition/comparisons to find out the closest cluster

Calculate the new means for each cluster for each party (Secure sum and compare – Combinatorial circuits) Known only to Pr

4.

⎡ x11 + x13 + x14 + ......... + x1r ⎤ ⎡ x12 ⎤ ⎢ x + x + x + .......... + x ⎥ ⎢ x ⎥ 2r ⎥ ⎢ 22 ⎥ ⎢ 21 23 24 ⎢ x31 + x33 + x34 + .......... + x3r ⎥ ⎢ x32 ⎥ ⎥ ⎢ ⎥ ⎢ . ⎥+⎢ . ⎥ ⎢ ⎥ ⎢ . ⎥ ⎢ . ⎥ ⎢ ⎥ ⎢ . ⎥ ⎢ . ⎥ ⎢ ⎢ x + x + x + .......... + x ⎥ ⎢ x ⎥ kr ⎦ ⎣ k2 ⎦ ⎣ k1 k 3 k 4

Known only to P2

Iterate this loop till there is no difference between the old and new means or till the difference is negligible. (Check Threshold)

Problem with the generic algorithm „

„ „

How about efficiency and accuracy of the algorithm in presence of different scales, variability, correlations and outliers? Variable with the largest scale dominates Examples „

Cluster on Age and Salary - Salary becomes the dominant attribute. Irrespective of the values of age, the 1st cluster will have records with id 1,5 and 6 and Cluster 2 will have records with id 2,3 and 4 Id

Age

Salary

1

23

$55,000

2

33

$67,000

3

24

$66,000

4

56

$67,000

5

34

$53,000

6

39

$52,000

Possible Solutions [24] „

Normalization - Normalizing the data is important to ensure that the distance measure gives equal weight to all the variables „

„

Standardization Transformation – Subtracting mean from each attribute and then dividing by its standard deviation „

„

Single outlier could result in all other values fall in a smaller range

Doesn’t take into account some correlations between attributes

Statistical Estimates

References „ „

„

„

„

Privacy preserving data mining – Tutorial – Chris Clifton Privacy preserving data mining: Challenges and Opportunities – Ramakrishna Srikant Privacy Preserving Cooperative Statistical Analysis – Wenliang Du, Mikhaiil J Atallah Defining Privacy for Data Mining – Chris Clifton, Murat Kantarcioglu, Jaideep Vaidya] Data Mining : Concepts and Techniques – Jiawei Han, Micheline Kamber