Building Privacy Preserving Algorithms for 3 categories of data mining
techniques – Related Work. ▫. Classification. ▫. Association Rules. ▫. Clustering.
▫.
First Solve the Homework
100 doors 100 people Initial state: all closed – 0 Final state: some open, some closed. Problem: How many open? Problem: which ones? Operations: Person 1 – all doors, Person 2 – even doors (0/1), Person 3 – door#3,6…. (0/1)…..
Graph similarity
Given two graphs. I’ll draw them NOW. Find similarity Y/N.
Overview
Why privacy? PPDM ( Privacy Preserving Data Mining) How is PPDM possible? – Techniques Building Privacy Preserving Algorithms for 3 categories of data mining techniques – Related Work
Classification Association Rules Clustering
References Challenging Problems
Overview
To develop models about aggregated data Extracting knowledge from the data and coming up with patterns in very large databases Discover the information which is not obvious from large databases
Data Mining - Example
Center for Disease Control To identify trends and patterns in disease outbreak Understanding and predicting the progression of flu outbreak Might want some data from Insurance companies No access to the data (they might not want to reveal the data due to some privacy concerns) Public use of private data Data mining is used in research studies of huge population What if the population does not want to release the data?
Can we develop accurate models without access to the original data?
Solution to the problem
Insurance companies Do not give access to the original data Provide some sort of statistics on the data so that the original data cannot be retrieved from such statistics Such data can be used to identify trends and patterns
Privacy Preserving Data Mining (PPDM)
To protect data privacy in data mining Different Techniques in PPDM: Query Restriction Data Perturbation
Techniques in PPDM Noise Addition
Query Restriction
Partitioning Cell Suppression Query size control
Data perturbation Introducing noise either to data or to the results of queries
PPDM Classification - Decision Trees Two approaches: Randomization approach Hide the original data by randomly modifying the data values using some additive noise still preserving the patterns of the original data (preserving the underlying probabilistic properties) Reconstruct the distribution of original data values from the perturbed data. Cannot reconstruct original values A decision tree classifier is built from the perturbed data from this reconstructed distribution. Privacy breaches
PPDM Classification – Decision Trees Approaches (contd … )
Cryptographic approach – Party X – owns Database D1 Party Y – owns Database D2 Build a decision tree built on D1 and D2 without revealing information about D1 to party Y and about D2 to party X except what might be revealed from the decision tree. Horizontally partitioned data Records (entities) split across parties Vertically partitioned data Attributes split across parties
Related Work Classification – Decision Trees
Perturbation Approach (Randomization Approach) Privacy Preserving Data Mining Rakesh Agrawal, Ramakrishna Srikant On the design and quantification of Privacy Preserving Data mining algorithms Dakshi Agrawal, Charu C Aggarwal (EM algorithm) SMC (Secure Multi party Computation) Approach Privacy Preserving Data Mining Lindell, Pinkas Tools for Privacy preserving Distributed Data Mining Kantarcioglu, Cilfton,Vaidya, Xiaodong Lin, Michael Y. Zhu
Randomization Approach
Privacy Preserving Data Mining - Rakesh Agrawal, Ramakrishna Srikant
Randomize the data Value Class Membership The values of the attributes are discretized into intervals. The interval in which the value lies is returned instead of the original value. Value Distortion Noise addition – Add a random value r to each value of an attribute. Normal: r lies between [ −α , α ] . Mean = 0 Gaussian: mean = 0
Randomization Approach –Overview
30 | 70K | ...
50 | 40K | ...
Randomizer
Randomizer
65 | 20K | ...
25 | 60K | ...
Reconstruct distribution of Age
Reconstruct distribution of Salary
Data Mining Algorithms
...
...
...
Model
Reconstructing the Original Data Distribution
Problem: Let x1,x2 …xn be the original values (probability Distribution X) Let y1,y2 …yn be the random values used to distort the data (probability distribution Y) Given, x1+y1, x2+y2, … xn+yn (perturbed data) Probability distribution Y (noise) Estimate probability distribution X (of original data)
Reconstructing the Original Data Distribution - Solution
Using Bayes theorem, given the probability distribution f Y , the randomized values ( xi + y i = wi)
estimated density function : f x' (a ) =
1 n ∑ n i =1
∞
f Y ( wi − a ) f X (a )
∫f
Y
( wi − z ) f X ( z )dz
−∞
Give large number of samples, it would be equal to the real density function
fX
is unknown fx
j +1
1 (a ) = n
n
∑
i =1
∞
∫
f Y ( w i − a ) f Xj ( a ) f Y ( w i − z ) f Xj + 1 ( z ) dz
−∞
initially f X is the uniform distribution Do this iteratively till the stopping criterion is met.
Seems to work well! Number of People
1200 1000 800
Original Randomized Reconstructed
600 400 200 0 20
60 Age
On the design and Quantification of Privacy Preserving Data Mining Algorithms – Dakshi Agrawal, Charu C Aggarwal
Previous distribution reconstruction process leads to some information loss. EM (Expectation Maximization) Algorithm for reconstruction of original distribution Provides robust estimates of original distribution even with a large amount of data. Less information loss.
Cryptographic Approach (SMC Approach)
Tools for privacy preserving distributed data mining
Tool kit of privacy preserving distributed computation techniques that can be applied to real time problems SMC – Secure Multiparty computation No party knows anything except its input and the result 2 approaches Third party Through some communication mechanism Induce non determinism in the values (Encryption) Different techniques discussed Secure Sum Secure Set Union Secure Size of Set Intersection Scalar Product
Secure Sum
Sum of values from each site Value to be computed is in the range [0 .. n] s
v = ∑ vl l =1
Site 1 generates a random number R, adds to its local value mod n, sends it to the next site. All sites 2 .. s , each site adds its local value mod n to the above number. l
R + ∑ v j mod n j =1
Site s sends this to site 1. Site 1 subtracts R from the above result to get the sum.
Secure Set Union
Commutative encryption mechanism Each party encrypts its own items and adds them to the global set Each party then encrypts the items of the remaining parties Remove duplicates (duplicates in original items will be in the encrypted items too Basic idea here is that now all the items are permuted. Now this global set is passed around, each site decrypting its items Now the union of items is obtained.
Secure Size of Set Intersection
Commutative encryption Every party encrypts its items with its own key and passes it to all other parties When each party receives a set, it encrypts each item and permutes the order and sends it the other party. Repeat this till every item is encrypted by every party. Encrypted values will be same in 2 sets only if its respective original values are same. Since we need the count, there is no decryption needed.
Applications
Association Rule Mining in horizontally partitioned data Chris Clifton, Kantancioglu Association rule mining in vertically partitioned data Chris Clifton, Jaideep Vaidya Privacy preserving Distributed Data Mining Chris Clifton EM clustering
Privacy Preserving Data Mining – Lindell, Pinkas
Yao’s 2 party protocol (“How to exchange and generate secrets”) 2 parties P1 and P2 with inputs x, y respectively The functionality f – represented as a combinatorial circuit Each party runs a separate protocol on each gate Not suitable for huge databases Extension of ID3 – algorithm for data mining of classification Training set is distributed between 2 parties Uses cryptographic tools to build decision trees
Related Work Classification – Decision Trees (contd…)
“Random Data Perturbation Techniques and Privacy Preserving Data Mining” – Hillol Kargupta, Souptik Gupta, Qi Wang, Krishnamoorthy Sivakumar Randomization preserves very little privacy. Random noise can be represented in the form of random matrices and random matrices have some properties from which we can estimate the original data Random objects have predictable structures in spectral domain Spectral filtering techniques can be used to estimate the original data
Related Work - Privacy Preserving Association Rule Mining
Privacy preserving Association Rule Mining in vertically partitioned data – Jaideep Vaidya, Chris Clifton
Maintaining Data Privacy in Association Rule Mining – Shariq Rizvi, Jayant Haritsa
Privacy Preserving Mining of Association Rules – Evfimiesvski, Srikant, Agrawal, Johannes Gehrke
Privacy preserving Distributed Mining of Association rules on Horizontally partitioned data - Kantarcioglu, Cilfton,Vaidya
Privacy preserving Distributed Data Mining – Chris Clifton
An Architecture for Privacy Preserving Mining of Client Information – Murat Kantarcioglu, Jaideep Vaidya
Privacy Preserving distributed mining of Association rules on Horizontally partitioned data - Kantarcioglu, Chris Clifton
Ability to share non sensitive data enables to produce highly effective solutions We need not hide all the data from all the parties. Some of the data can be known to some of the parties but nobody can see all the data.
Contd……
3 phases
Identify all the candidate itemsets (Secure Union) Verify if each item satisfies the support threshold (Secure Sum) If X is a itemset, find the local support count X . sup i Then find global support count ∑ X .sup Item is globally supported if X . sup i > s * no. of transactions Securely find the confidence of a rule X => Y. Check if s
i =1
i
i =n
∑{ X ∪ Y }. sup i =1
i =n
∑ X .sup i =1
Each site knows
i
≥c
i
{ X ∪ Y }. sup i
and
X . sup i
(Secure Sum)
Privacy Preserving Association Rule Mining on Vertically Partitioned Data – Chris Clifton, Jaideep Vaidya
2 party computation No Central authority of the data – data split vertically across 2 parties Example
Market Basket – Grocery and Clothing purchases
One approach to privacy preserving
Run association rule mining algorithm by each party – combine the results from the 2 parties Disadvantages
Duplication Correlation
Mining boolean association rules. Absence of attribute – 0 Presence of atrribute -1 Determining the frequent itemsets is determining how many rows have the values of all attributes in the itemset as 1. X, Y represent attributes in the database. xi represents the value of X attribute for i row. Scalar Product
n – total number of transactions, k –support threshold n X.Y = ∑ xi * y i i =1
Frequent itemsets
X.Y > k
Privacy Preserving Mining of Association Rules - Evfimievski, Srikant, Agrawal, Johannes Gehrke
Categorical Data items Horizontally partitioned data Principle of uniform Randomization
Take an item of probability p and replace it with another item which is not present in the transaction.
Privacy Preserving Clustering
Privacy Preserving k means clustering over Vertically partitioned data – Jaideep Vaidya, Chris Clifton Privacy preserving clustering by data transformation – Stanley R. M. Oliveira, Osmar R. Zaiane
Privacy preserving k means clustering over vertically partitioned data – Jaideep Vaidya, Chris Clifton
K means clustering – divide into k clusters Vertically partitioned data – each site has information on all the entities but for only some attributes To cluster the entities without revealing the values
Problem r parties, n entities , k clusters Cluster the data using k means algorithm Final result sent to each party Final value of its mean The cluster to which each point is assigned Nothing else
Privacy preserving k means clustering algorithm comprises Each party calculates its own distance matrix w.r.t its own attributes ⎡ x11 ⎤ ⎡ x12 ⎤ ⎡ x1r ⎤ ⎢x ⎥ ⎢x ⎥ ⎢x ⎥ ⎢ 21 ⎥ ⎢ 22 ⎥ ⎢ 2r ⎥ ⎢ x31 ⎥ ⎢ x32 ⎥ ⎢ x3r ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ X 1 = ⎢ . ⎥ X 2 = ⎢ . ⎥........... Xr = ⎢ ⎥ ⎢ . ⎥ ⎢ . ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ . ⎥ ⎢ . ⎥ ⎢ ⎥ ⎢x ⎥ ⎢x ⎥ ⎢x ⎥ ⎣ k1 ⎦ ⎣ k2 ⎦ ⎣ kr ⎦
The row where the sum is the minimum is the closest cluster
Finding the closest cluster Requires cooperation between the parties Secure computation of the closest cluster involves Permutation algorithm ( “Privacy preserving cooperative Statistical Analysis” by Wenliang Du and Mikhail J Atallah ) Secure add and compare (using combinatorial circuits) Security based on 3 key ideas Add noise to the distance components with random vectors which sum upto 0. Only the result of comparison of distances should be known.
Permutation Algorithm: Let the parties be P1,P2, P3, ….. Pr. P1,P2,Pr are non colluding parties P1 generates random vectors for each party i= 1 to r such that V1+V2+V3 …… +Vr =0 P1 also generates a permutation II(1..k) Every party i=2…..r generates (Ek,Dk) It computes E k ( X i ) and sends it and the encryption function Ek to P1
P1 gets the encrypted version of the distance vectors and the encryption function Ek of all other parties For i = 2 to r, P1 calculates E k ( X i ) * E k (Vi ) = E k ( X i + Vi )
Each party i = 2 to r
' Using its permutation II, P1 calculates T p = II [ E k ( X i + V i )] ' P1 sends T p to i= 2 to r
Ti = Dk [Tp' ] = Dk (II[ Ek ( X i + Vi )]) = II[ X i + Vi ]
Each party I = 1, 3-rr sends Ti to Pr Pr calculates T1 + ∑ Ti i =3
P2 and Pr now need to find out the closest cluster. Pr has all the components of the sum to find the least row in the distance matrix except T2 . So now P2 and Pr engage in some secure addition/comparisons to find out the closest cluster
We need the below matrix to find out the closest cluster Known only to Pr ⎡ x11 + x13 + x14 + ......... + x1r ⎤ ⎡ x12 ⎤ ⎢ x + x + x + .......... + x ⎥ ⎢ x ⎥ 2r ⎥ ⎢ 22 ⎥ ⎢ 21 23 24 ⎢ x31 + x33 + x34 + .......... + x3r ⎥ ⎢ x32 ⎥ ⎥ ⎢ ⎥ ⎢ . ⎥+⎢ . ⎥ ⎢ ⎥ ⎢ . ⎥ ⎢ . ⎥ ⎢ ⎥ ⎢ . ⎥ ⎢ . ⎥ ⎢ ⎢ x + x + x + .......... + x ⎥ ⎢ x ⎥ kr ⎦ ⎣ k2 ⎦ ⎣ k1 k 3 k 4
Known only to P2
This secure addition/comparison is done using a combinatorial circuit.
Generic K-means Privacy Preserving Algorithm To start with, each party initializes their own means corresponding to each cluster. Securely find the closest cluster for each data point for all the parties. (Use permutation algorithm [21] )
1.
2.
P1 generates random vectors for each party i= 1 to r such that r
∑V i =1
i
=0
P1 also generates a permutation II(1..k) Every party i=2…..r generates (Ek , Dk ) E (X ) Ek Each party computes k i and sends it and the encryption function to P1 P1 gets the encrypted version of the distance vectors and the Ek encryption function of all other parties For i = 2 to r, P1 calculates E k ( X i ) * E k (Vi ) = E k ( X i + Vi )
Using its permutation II, P1 calculates T p' = II [ E k ( X
i
+ V i )]
P1 sends T p'
to i= 2 to r
Each party i = 2 to r
Each party i = 1, 3 to r sends Ti
to Pr
Ti = Dk [Tp' ] = Dk (II [ Ek ( X i + Vi )]) = II [ X i + Vi ]
r
T1 + ∑ Ti
3.
Pr calculates i =3 P2 and Pr now need to find out the closest cluster. Pr has all the components of the sum to find the least row in the distance matrix T2 except .So now P2 and Pr engage in some secure addition/comparisons to find out the closest cluster
Calculate the new means for each cluster for each party (Secure sum and compare – Combinatorial circuits) Known only to Pr
4.
⎡ x11 + x13 + x14 + ......... + x1r ⎤ ⎡ x12 ⎤ ⎢ x + x + x + .......... + x ⎥ ⎢ x ⎥ 2r ⎥ ⎢ 22 ⎥ ⎢ 21 23 24 ⎢ x31 + x33 + x34 + .......... + x3r ⎥ ⎢ x32 ⎥ ⎥ ⎢ ⎥ ⎢ . ⎥+⎢ . ⎥ ⎢ ⎥ ⎢ . ⎥ ⎢ . ⎥ ⎢ ⎥ ⎢ . ⎥ ⎢ . ⎥ ⎢ ⎢ x + x + x + .......... + x ⎥ ⎢ x ⎥ kr ⎦ ⎣ k2 ⎦ ⎣ k1 k 3 k 4
Known only to P2
Iterate this loop till there is no difference between the old and new means or till the difference is negligible. (Check Threshold)
Problem with the generic algorithm
How about efficiency and accuracy of the algorithm in presence of different scales, variability, correlations and outliers? Variable with the largest scale dominates Examples
Cluster on Age and Salary - Salary becomes the dominant attribute. Irrespective of the values of age, the 1st cluster will have records with id 1,5 and 6 and Cluster 2 will have records with id 2,3 and 4 Id
Age
Salary
1
23
$55,000
2
33
$67,000
3
24
$66,000
4
56
$67,000
5
34
$53,000
6
39
$52,000
Possible Solutions [24]
Normalization - Normalizing the data is important to ensure that the distance measure gives equal weight to all the variables
Standardization Transformation – Subtracting mean from each attribute and then dividing by its standard deviation
Single outlier could result in all other values fall in a smaller range
Doesn’t take into account some correlations between attributes
Statistical Estimates
References
Privacy preserving data mining – Tutorial – Chris Clifton Privacy preserving data mining: Challenges and Opportunities – Ramakrishna Srikant Privacy Preserving Cooperative Statistical Analysis – Wenliang Du, Mikhaiil J Atallah Defining Privacy for Data Mining – Chris Clifton, Murat Kantarcioglu, Jaideep Vaidya] Data Mining : Concepts and Techniques – Jiawei Han, Micheline Kamber