Dissemination of Anonymized Streaming Data

2 downloads 0 Views 967KB Size Report
Jun 29, 2015 - With the vision of the emergence of streaming data marketplaces, we ... Data Dissemination, Data Privacy, Data Anonymization, Data Streams.
Dissemination of Anonymized Streaming Data Yongluan Zhou

Lidan Shou

Xuan Shang

University of Southern Denmark

Zhejiang University

Zhejiang University

[email protected]

[email protected]

[email protected]

Ke Chen Zhejiang University

[email protected] ABSTRACT With the vision of the emergence of streaming data marketplaces, we study the problem of how to use a scalable dissemination infrastructure, composed by a number of brokers, to disseminate anonymized streaming data to a large number of clients. To satisfy the clients, who are trusted at different anonymity levels and have their own urgencies in requiring the data, we propose to deeply integrate the anonymization process into the dissemination infrastructure. More specifically, we extend the existing anonymization algorithms to derive the anonymity data from other anonymity data with different privacy constraints rather than only from the original microdata, in a technique which we call version derivation. With this flexibility, the anonymous data can be generated as needed, according to the available bandwidth, on the way from the data source to the end clients. Exploiting such new opportunities, we formulate the problem of dissemination planning which aims at minimizing the information loss of the disseminated data. Furthermore, we design two dissemination plan optimization strategies to solve the problem. The experimental study using both synthetic and real datasets verifies the effectiveness of our approach.

Categories and Subject Descriptors H.3.5 [Online Information Services]: Data sharing; H.3.4 [Systems and Software]: Distributed systems

Keywords Data Dissemination, Data Privacy, Data Anonymization, Data Streams

1.

INTRODUCTION

The popularity of the Internet has availed a wide spectrum of services which are characterized by the need for disseminating streaming data to a large number of connected terminals. Applications such as real-time traffic monitoring, online transaction analysis, and so forth all require the dissemination of streaming data. In these applications, data in the form of arbitrary records are typically disseminated from a centralized data source, to numerous clients (also Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. DEBS’15, June 29 - July 3, 2015, OSLO, Norway. Copyright 2015 ACM 978-1-4503-3286-6/15/06 . . . $15.00. http://dx.doi.org/10.1145/2675743.2771837.

known as terminals) which subscribe to the dissemination service, via a tree-structured network of servers called brokers [31, 21, 22], as depicted in Figure 1. While previous studies have focused on the efficiency of data dissemination services, little has been done on the privacy issues with streaming data dissemination. Some previous works have studied secure transmission for data dissemination [23, 27]. However, no study has been reported to address the problem of partial information hiding for streaming data. Consider a data-as-a-service scenario, where real-time online user clickstreams and shopping transactions records are offered to a large number of subscribers. The subscribers could be of different nature and include, for example, monitors, analysts, and researchers from various companies, organizations and institutes. The monitors have to detect frauds and respond quickly to contain the misbehavior of suspicious customers/store owners; the analysts may analyze the sales information for a certain portfolio; whereas the researchers may need to conduct a study on the fluctuation in the sales of multiple products. These users require different levels of detail for the data being disseminated. They have, in the meantime, their own urgencies which differ in requiring the data to be delivered. Moreover, the original clickstreams or transaction records typically contain enormous private information, such as the names, addresses, cellphone numbers, credit card numbers of customers and the exact products purchased or urls clicked by the customers. Disseminating such private information to everyone inadvertently might cause severe safety problems for individuals. On the other hand, subscribers could have different trustworthiness, which depends on various factors, including, for example, whether they have gone through some data-privacy certifications (e.g. the IAPP certification1 ), whether they are government agencies, private companies or private persons, whether they have signed some privacy-preserving agreements and are able to bear the legal liability of privacy preservation. All in all, it is desirable to distinguish different subscribers by their trustworthiness as well as their data requirements in terms of level-of-details and transmission latencies, and provide them with data at different privacy-preserving levels. We address privacy-preserving stream dissemination by partial information hiding. Suppose we are disseminating a stream of data records from one source (s0 ) to a number of clients (c1 , . . . , cn ) via a network of trusted brokers (s1 , . . . , sm ), which are server nodes connected in a tree topology, our dissemination model takes into consideration the following requirements/constraints: (1) Privacy Requirement Each client connected in the network is assigned a credit that indicates the client’s trustworthiness and can be calculated based on multiple factors such as the aforementioned 1

https://www.privacyassociation.org/certification/

Broker

S0

C2

S3

C3

Link

S2

S1 C1

Client

S4

S5 C5

C7 C6

C4 Figure 1: A dissemination tree, where s0 is the data source, s1 -s5 are the brokers, and c1 -c7 are the clients.

ones. Typically, data in its original form is considered to contain an abundance of privacy. A client with a high credit is entitled to data with high fidelity (i.e. data similar to its original form), while a low-credited client is likely to receive data with poor fidelity (which is deliberately distorted to resist privacy attacks). The provider can predefine a number of privacy-preserving levels and map the clients’ credits to the appropriate levels. Privacy preservation is fulfilled by imposing the k-anonymity model [18] on the data being disseminated. With k-anonymization techniques being applied on the brokers, the data can be converted to k-anonymity versions. A k-anonymity version of the data guarantees that each record cannot be differentiated from at-least k − 1 other records in the same anonymity version. Therefore, each client can be credited with an appropriate k value. By specifying different ki (i = 1, . . . , n) values (and thus different ki -anonymity versions) for the clients, it is possible to constrain the maximum fidelity of data that each client receives. (2) Tolerance On Delay Each client also specifies a tolerance on the communication delay of its received data. For example, a client performing real-time fraud detection may require each transaction record to be delivered within one second; while another client aggregating the data for periodical analysis may be satisfied by a delay of minutes. In the ideal case, the dissemination infrastructure has sufficient network resources to disseminate all ki -anonymity versions of a record to the respective clients within the specified tolerance. Unfortunately, the real communication channels in the tree have limited bandwidth. Thus, some of the versions may become overdue when they reach their target clients. To satisfy the tolerance requirements, we may need to deliver over-anonymized versions to the clients (a version with ki0 > ki is said to be over-anonymized with regard to ki ), as they tend to incur lower transmission delay while always satisfying the privacy requirement. However, overanonymization also compromises the data fidelity. As the anonymization operations are being performed on the brokers, the output of each k-anonymization operation is sent downstream via a network link to either a directly connected client or a child broker which is responsible for relaying the data in the dissemination tree. The child broker may also perform further anonymization on the k-anonymity version that it receives, to generate a more deeply anonymized version (i.e. a k∗-anonymity version where k∗ > k). This technique is called version derivation. Compared to performing k∗-anonymization on the original data (which is also called the microdata), version derivation significantly reduces the network transmission cost along the relaying path, as an intermediate k version is typically much smaller than the original data in terms of storage space and communication cost. Once the anonymization operations on a broker are decided, the anonymity version that its attached clients receive is determined

also. The process of determining the anonymization operations on the brokers is called dissemination planning. Dissemination planning is guided by a metric called information loss, which measures the total loss in data fidelity for all clients. The objective of our optimization is to minimize the information loss, while respecting the constraints mentioned above. We propose two strategies for plan optimization, namely incremental and local adjustment strategies. The former incrementally populates the dissemination network with clients, making compromises to the fidelity of data being delivered downstream if the delay cannot be tolerated on clients. The latter performs a global client assignment to the brokers, and then makes local changes to the plan in a process similar to simulated annealing. In fact, the dissemination planning problem being proposed here can be flavored by specifying additional constraints that are often observed in real world: • resource constraints, such as the link bandwidth, number of brokers, computational ability of brokers, number of communication channels, or an arbitrary combination of them; • network structure, i.e. fixed, half-fixed or variable tree structures; • client requirements, such as the tolerance on communication delay or the data fidelity requirements. Different settings of the above aspects may lead to very different problems and solutions. In this paper, we focus on a practical scenario where: (1) both brokers and clients are involved in the dissemination planning; (2) brokers have fixed topology while clients would choose freely the broker to connect to; (3) each broker as well as the data source has limited number of downstream links and each link has limited bandwidth; (4) each client specifies the tolerance on the communication delay. The contributions of this paper are summarized as follows: • We formulate the problem of disseminating streaming data to a large number of clients, which are constrained by different privacy requirements and delay tolerances. • We perform an in-depth analysis of the version derivation problem and develop two kinds of version derivation methods, the Mondrian derivation and the top-down greedy derivation. Both schemes can reduce the volume of the derived data without compromising the data fidelity. • We study the dissemination plan optimization problem. We prove that the plan optimization problem is NP-hard, and then propose two effective plan optimization strategies, both relying on the version derivation technique. • We conduct a thorough experimental study of the algorithms for version derivation and dissemination plan optimization. The results verify the effectiveness of our approach.

2.

RELATED WORK

We first review the related work on data privacy protection techniques and their existing applications in stream dissemination. We also give a brief introduction to the area of streaming data dissemination. K-anonymity K-anonymity is a popular approach to partial information hiding. The conventional k-anonymity [18] and one of its extension, l-diversity [19], have been thoroughly studied. The beauty of the k-anonymity model is that the anonymized attributes in its output, which is given effectively in compressed form, can

perfectly match the need for lower communication cost in data dissemination. There do exist some works which apply k-anonymity on the streaming data [2, 16, 29, 28]. In [2] and [16], the authors adopt a user-specified strict delay tolerance for each tuple and the data is generalized based on some given hierarchies. The algorithms in [28] and [16] allow the records in the same anonymity group to be published at different time. This feature may compromise the privacy protection ability if the adversary could utilize the published time of each record. [29] proposes the integration of delay time into the information loss measure to guide the anonymization process. All these methods cannot be used to address our problem. Perturbation Perturbation is the name of another class of approaches towards privacy protection [15, 24, 3]. However, perturbation does not aim at linkage attack. Besides, there is no trade off between data volume and privacy level for perturbation. Recent years see more research interests in differential privacy [5]. The potential of differential privacy for privacy-preserving stream dissemination is yet to be explored. Streaming data dissemination In [22, 21, 30, 31], the authors introduced the problem of maximizing the coherency in the process of disseminating streaming data. The infrastructure employed in these work is similar to ours. There are also some existing work [17, 20] of content-based dissemination dedicated on obtaining a good dissemination plan under limited resources. [17] and [20] tried to find the minimal common subset of the required data of a group of nodes to reduce the communication cost, which is not applicable in our work. [6] and [12] exploit the computational ability of each peer and employ a set of filters to minimize the network traffic. [32] studied how to efficiently deliver the streaming query results to the end users. These work do not consider the privacy issue, and the version derivation in our work is more complicated than just filtering the useful data for each peer. [8, 33] introduced the problem of disseminating models instead of raw data. The data traffic was minimized by model sharing and redundant model elimination. Although anonymized data can also be considered as a type of model built over the microdata, the method in [33] cannot be applied to our problem due to the fact that their assumptions on the models cannot hold for anonymization. Secure dissemination There are some studies working on ‘‘privacy-preserving network dissemination’’ [23, 9, 27]. However, they actual motive is to design a secure dissemination protocol based on a given network structure, which is significantly different from ours. [4] aims at developing a load-balanced dissemination tree that preserves the publisher’s privacy. Its research problem also totally differs from ours.

3.

PROBLEM FORMULATION

In this section we first describe the dissemination model relying on the well-known k-anonymity. Then we formulate the problem of constrained privacy-preserving data dissemination. The problem can be easily extended to adopt the l-diversity, which is a more advanced k-anonymity model [19].

3.1 3.1.1

Dissemination Model The Data Model

First, let us look at the form of the disseminated data, which is given as different anonymity versions. Following the conventional k-anonymity, each original data record r has three parts: (1) an identifer attribute id, (2) a set of quasi-identifier (QI) attributes, denoted as QI = {A1 , A2 , . . . , A|QI| }, and (3) a set of sensitive attributes,

ID Paul Aron Cathy Mary Bob Susan Emma Kelly David Taylor Jacob Olivia

Age 65 22 29 58 46 32 6 18 35 40 38 50

Zipcode 31027 31012 31008 31024 31052 31020 31016 31058 31030 31025 31040 31042

Price 2440 72 300 2450 996 681 13 712 638 15 36 152

Purchased Product Platinol Panadol Robitussin Mitozytrex Insulin Excedrin Viagra Sudafed Lotrimin Aspirin Gycerin Benadryl

Table 1: Microdata with 1 identifier (ID), 3 QI attributes (Age, Zip, Price), and 1 sensitive attribute (Product). Age [6,40]

Zipcode [31016,31040]

Price [13,36]

[22,50]

[31008,31042]

[72,300]

[18,35]

[31020,31058]

[638,712]

[46,65]

[31024,31052]

[996,2450]

Purchased Product {Viagra,Aspirin, Gycerin} {Panadol,Benadryl, Robitussin} {Lotrimin,Excedrin, Sudafed} {Insulin,Platinol, Mitozytrex}

Table 2: The 3-anonymity version of Table 1.

denoted as AS . After anonymization, a record is transformed into two parts: (1) the transformed QI attributes, denoted as QIt ; (2) the original sensitive attributes. The QI attributes are transformed to ensure that each record in the published data set has the same QI attributes with at least k-1 other records. The QI transformation may have the following forms, depending on the original data type: 1. each QI attribute value At is generalized into a value range Rt = [rt− , rt+ ]; 2. each QI attribute value At is generalized into a value set, i.e. {a, d} for an attribute whose domain is {a, b, c, d}; 3. a part of QI attributes are replaced by ‘‘*’’, in a process called suppression. Notably, all records with the same QIt comprise an anonymity group. To save transmission cost, we only deliver one copy of QIt for each anonymity group. So the disseminated form of each anonymity group consists of (1) the QIt of this group and (2) the sequence of sensitive attribute values of all records in this group. For example, Table 1 shows the microdata of a sample dataset for online purchase transactions. Table 2 shows a 3-anonymity version generated from it.

3.1.2

The System Model

The system model is built on the scenario that the data source periodically collects and disseminates a set of private records. This scenario is observed in many practical applications where data records are generated so rapidly that a bunch of records could be collected in a flash. For example, millions of deals can be made by an online shopping website every second. In such applications, the data source is often too busy to perform real-time push dissemination. Thus, the dissemination has to be performed periodically. The other reason supporting periodical dissemination is that the raw data may sometimes contain ill-formatted or duplicate records and noises. Thus the data source may need to perform data cleaning and consolidation, which are conducted on a collection of records before

disseminating them. Due to the above reasons, we assume that the data source s0 would periodically collect a dataset D = {r1 , r2 , . . ., r|D| } containing |D| micro records, which will be anonymized and disseminated to clients. For clarity, we also assume that the data source would only start to prepare the next dataset after it finishes the dissemination of the current one. One practical assumption of our work is that, before the system starts working, the connections from the data source down to all brokers are pre-determined and fixed, as they usually do in a realworld system. Nevertheless, the clients are given the freedom to choose which broker to connect to upon system initialization. To disseminate the collected dataset D, the data source s0 employs a set of brokers S = {s1 , s2 , . . . , s|S| } and organize them into a dissemination tree in which s0 is the root, as shown in Figure 1. We shall regard s0 as one of the brokers because s0 has the same functionality as brokers on data dissemination. Meanwhile, a large set of clients, denoted by C = {c1 , c2 , . . . , c|C| }, subscribe to the disseminated data by establishing links to the brokers. Each of the clients is assigned to only one broker for receiving the data. The system runs in two phases, namely dissemination planning and run-time dissemination. The first phase determines how each clients are assigned to brokers and how anonymity versions are generated and routed to all clients, in a process guided by an optimization target. The second phase performs data dissemination following the plan generated from the first phase. Since we assume the privacy requirement and delay tolerance of each client do not vary, and no brokers/clients leave or join the network during the run-time phase, the second phase is quite trivial once the dissemination plan is generated. Therefore, our main contributions focus on the first phase.

3.1.3

Brokers and Clients

In this section, we shall describe the model and some notions defined for brokers and clients. Each broker has the following functionalities: (1) Direct Disseminating A broker is responsible for disseminating the data, given at an appropriate anonymity version, to its directly-linked clients within their specified delay tolerances. Consider the multiple downstream links of a broker sj , we can assume that these links can transmit data in parallel to a number of child nodes (possibly including either brokers or clients or both). Since the connections among brokers are pre-determined, we are only concerned about the number of communication channels left for clients, which is denoted by ncj for broker sj . To ensure that P all clients could be served, we must have |S| j=0 ncj ≥ |C|, where C is the set of clients. (2) Relaying If there are clients indirectly connected in its subtree, the broker also needs to relay, depending on the dissemination plan, the necessary data version (either anonymized data or microdata) to the child brokers. As we can see, relaying multiple versions of the same dataset causes extra communication delay, and thus making it harder to satisfy the client tolerance. Therefore, we make a simple assumption that each broker only receives ONE anonymity version of the dataset D to avoid unnecessary data relaying. Such version is decided by the dissemination plan depending on the privacy requirements and tolerances of the clients involved. Given a broker sj , the actual anonymity version that it receives is denoted by V (kjs ), where kjs stands for the k value of the respective anonymity version. (3) Anonymization Computation The brokers in the network also perform the anonymization computation (and version derivation when applicable) in a distributed way. Specifically, the data requirement of each node (broker sj or client ci ) is handled by its

parent node in the network, which is denoted by p(sj ) or p(ci ). Such design aims at better scalability and full exploitation of the computing power of the brokers. Each client ci is given its privacy credit by a ki value, which specifies the finest anonymity version (denoted by V (ki )) that it is entitled to. Meanwhile, ci is allowed to specify a tolerance δi as the maximum allowed communication delay of the requested data. In order to satisfy the tolerance, the network may probably need to deliver an over-anonymized version V (ki0 ) (where ki0 > ki ) to ci . The reason will be shown in Section 3.2.

3.2

Dissemination Planning

The dissemination plan determines how clients are assigned to brokers and how anonymity versions are generated and routed to all clients towards our optimization target under the resource constraints and client requirements. In the following text, we will first formulate the constraints and the optimization target, then formulate the dissemination plan optimization problem. As mentioned previously, each client ci specifies a tolerance δi on the total communication delay of a version (from the data source to itself). The total communication delay of ci , denoted by d(ci ), refers to the time from the moment that D is ready at s0 until its actual version V (ki0 ) arrives in full at ci . It is calculated by adding up the communication delays on each link from s0 to ci . Likewise, we can define the communication delay of a broker d(sj ). Computing d(ci ) and d(sj ) requires the definition of the communication delay of an anonymity version V (ki ) on each link. Denoted by ∆ki , the communication delay of a link consists of three components, as proposed in [1]: 1. The processing delay, which is the time that a node spends processing a packet (the packet is V (ki ) in our model), including error checking, reading the packet header and looking up the link to the next node. As the processing delay is usually negligible compared to the other two terms, we will simply ignore it in the remainder of the paper. 2. The propagation delay, which is the time between the last bit is transmitted at the source and the last bit is received at the destination. The propagation delay depends on the physical distance and the propagation speed. For simplicity, we will model it as a constant number ρ. 3. The transmission delay, which is the time required to put an entire packet into the communication channel. The transmis(ki )| sion delay equals to |V bw , where bw is the bandwidth of the communication channel and |V (ki )| represents the data volume of V (ki ). Formally, we have the following definition: D EFINITION 1. (Communication Delay) The communication delay of V (ki ) on a link, denoted by ∆ki , is given by ∆ki = ρ +

|V (ki )| . bw

One thing to note is that no queueing delay is taken into account in our model, as the data transmissions from a broker to its child nodes are performed in parallel. We will now define the data volume of anonymity versions, which is required in Definition 1. Suppose the total size of QI attributes and sensitive attributes of a micro record are denoted by size(QI) and size(AS ) respectively. Considering the microdata of |D| records, its data volume without ID is given by (size(QI) + size(AS )) ∗ |D|. An anonymity version, however, contains the transformed QI

Age [6,50]

[18,65]

Zipcode [31008,31042]

[31020,31058]

Price [13,300]

[638,2450]

Purchased Product {Viagra, Aspirin, Gycerin,Panadol, Benadryl, Robitussin} {Lotrimin,Excedrin, Sudafed, Insulin, Mitozytrex,Platinol}

Table 3: The 6-anonymity version by Mondrian-derivation from the 3-anonymity version (Table 2). The table is SAME as the 6anonymity version generated from the microdata.

attributes QIt of each anonymity group and the sensitive attributes of each record. Note that different forms of QIt lead to different volumes, as mentioned in Section 3.1, and the volume of QIt for different anonymity groups may not be same if QI attributes are generalized into value sets. Thus we have: D EFINITION 2. (Data Volume) If V (ki ) (ki > 1) has g(ki ) anonymity groups and the volume of QIt for the q-th group is size(QIt )q , the data volume of V (ki ) is given by g(ki )

|V (ki )| =

X

size(QIt )q + size(AS ) ∗ |D|,

q=1

where g(ki ) is upper-bounded by b |D| c. ki In whatever QIt forms, |V (ki )| will be no greater than the volume of microdata. In the following discussions, we will focus on the case that QI attributes are generalized into value ranges because it is the most commonly used form in the existing work. In this case, size(QIt ) of each anonymity group is 2 ∗ size(QI) as each generalized range Rt = [rt− , rt+ ] contains two values. Thus we can obtain |V (ki )| = size(QI) ∗ 2 ∗ g(ki ) + size(AS ) ∗ |D|. |V (ki )| is monotonically decreasing by ki , as the number of anonymity groups usually decreases when ki becomes larger. Using the running example in Table 2, the 3-anonymity version has a volume of size(QI) ∗ 2 ∗ g(3) + size(AS ) ∗ 12 = sizeof (number) ∗ 24 + size(AS ) ∗ 12. In contrast, Table 3 shows a 6-anonymity version, with a volume of sizeof (number) ∗ 12 + size(AS ) ∗ 12, derived from the 3-anonymity version in Table 2. If a broker employs version derivation on the 3-anonymity version instead of generating the 6-anonymity version from the microdata, it saves incoming data transmission cost by |V (3)| − |V (1)| = 12 ∗ sizeof (number). Our optimization target is to maximize the overall data fidelity of the anonymity versions received by all clients. As information loss is a popular metric for measuring data fidelity [29, 26, 11], our optimization target is modeled as the total information loss of all clients. We use the NCP measure [26], a popular metric for k-anonymity, to measure the information loss of each record r after anonymization. D EFINITION 3. (Information Loss) Assume the domain of QI attribute At is kAt k and the generalized range of At for r is Rt = [rt− , rt+ ], IL(r) can be calculated as: |QI|

IL(r) =

X r+ − r− t t . kA tk t=1

The information loss on a client when receiving an anonymity version V (ki0 ) is defined as theP total information loss of all records in V (ki0 ), i.e. IL(V (ki0 )) = r∈V (ki0 ) IL(r). For all clients ci receiving version V (ki0 ), the total information loss on all clients is

given by aggregating all the clients’ information losses: IL(C) =

|C| X

IL(V (ki0 )).

i=1

Similarly, we can define IL(sj ) as the information loss on broker sj when receiving V (kjs ). We now introduce the need for version derivation. Based on the above formulation, it can be seen that the microdata has the largest data volume among all possible anonymity versions. Naturally, it incurs the longest communication delay on a link. If the brokers would only utilize microdata to perform anonymization, we would have to deliver the microdata to all brokers, probably causing the delay tolerances of some clients unsatisfied. For example, in Figure 2, if δ2 < 2 · ∆1 for client c2 , where ∆1 is the communication delay of microdata on one link, the disseminated data definitely cannot reach c2 within δ2 time, because d(s3 ) equals to 2 · ∆1 , which already exceeds δ2 . To address this issue, we propose a technique called anonymity version derivation. Version derivation in a broker allows an anonymity version V (ki ) to be derived from another version V (kj ) (that it receives) rather than from the microdata. Thus, the mandatory need to deliver the microdata to all brokers can be avoided. As a result, the dissemination planning will find it much easier to satisfy the client tolerances. Meanwhile, the communication delays in many cases can be significantly relieved, making the plan more optimized in terms of overall data fidelity. Details of version derivation technique will be presented in 4. With the example in Figure 2, we are always able to disseminate an anonymity version to c2 with version derivation, if only δ2 ≥ 3 · ∆|D| (∆|D|  ∆1 ). The detailed proof will be provided in Section 5. A dissemination plan P is defined as a set of (1) all client assignments b(c1 ), b(c2 ), . . . , b(cn ) and (2) version derivation schemes on all brokers V D(s0 ), V D(s1 ), . . . , V D(sm ). Given the flexibility of client assignments and version derivation on brokers, our optimization target is to decide a dissemination plan, so that the fidelity of the data received by the clients could be maximized, while respecting the resource and client constraints. Specifically, our dissemination plan optimization problem can be formally stated as: P ROBLEM 1. Given the following conditions: 1. the fixed topology of s0 and brokers S, and a set of clients C to be assigned to brokers, and 2. the value of ncj for each broker sj (0 ≤ j ≤ |S|), and the bandwidth of each communication channel, and 3. the value of ki and δi for each client ci (1 ≤ i ≤ |C|), construct a dissemination plan Popt so that: 1. IL(C) is minimized; and 2. the delay tolerance δi of each client ci is not violated. L EMMA 1. Problem 1 is an NP-hard problem. P ROOF . (Sketch) Problem 1 consists of two subproblems: (1) assign the clients to brokers; (2) perform version derivation on each broker without violating the delay tolerances. If the client assignment problem has been solved, a simplified version of Problem 1 could be modeled by Figure 2. A precedence constrained MMKP problem could be reduced to the problem in Figure 2 by seeing δi as a resource dimension and all possible anonymity versions received by each node as a class. Each class has multiple items because kjs or ki0 has multiple choices based on the dissemination plan. Since MMKP is NP-hard [10], Problem 1 is also NP-hard.

'(S3 )

'( S

4

)

s5

C3

C4

'(C )

'(C5)

'(C4)

2

)

) 3 '(C

'( C

s2 s4

s3

C1

C2

s1

C5

C7

6

C6

' ( S1 )  ' (C1 ) d G 1 ' ( S 1 )  ' ( S 3 )  ' (C 2 ) d G 2

members

15 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15

2

8 1,3,5,7,9,11,13,15

3 7 2,4,6,8,10,12,14

' ( S 1 )  ' ( S 3 )  ' (C 3 ) d G 3 ' ( S 1 )  ' ( S 4 )  ' (C 4 ) d G 4 ' ( S 2 )  ' ( S 5 )  ' (C 5 ) d G 5 ' ( S 2 )  ' ( S 5 )  ' (C 6 ) d G 6

Extensions to l-diversity. The problem formulation based on ldiversity uses different l values to formulate the the different privacy requirements. That means each client ci is associated with a particular li value. If an li -diversity version is denoted by V (li ), then the above measures and the hardness proof above for k-anonymity are also applicable for the l-diversity model. Delay caused by data collection. In streaming environment, the dataset to be disseminated may require a certain period of time to be collected. Such delay is actually dependent on the input data rate as well as the k value, and is unavoidable if we want to maintain the anonymity level and do not consider dropping any data. It is orthogonal to the optimization problem that we solve in this paper, which can only optimize for the delay caused by network transmission. Furthermore, our work considers the situation that the data stream has a high arrival rate, such as clickstreams and transaction records. So the delay should not be too high and is assumed to be accepted by the subscribers.

VERSION DERIVATION

Anonymity version derivation, or version derivation for short, is an operation which obtains a ki -anonymity version of the data from a kj -anonymity version of the same data. In such case, the operation is denoted by V (kj ) . V (ki ) (i 6= j). In fact, anonymization from the microdata can also be considered version derivation if we take the microdata as kj = 1. Since version derivation may incur extra information loss, the development of version derivation algorithm can be seen as an optimization problem: Given a V (kj ) as the source version, a target ki value and an anonymization algorithm, produce a V (ki ) so that IL(V (ki )) is minimized. Although there exist many studies [14, 26] on the implementation of anonymization from microdata, none of these addresses the version derivation issue. We propose two version derivation schemes in this section, namely Mondrian derivation and top-down greedy derivation. The former is expected to work with the Mondrian anonymization algorithm [14], and can be extended to other anonymization algorithms which rely on a pre-defined generalization hierarchy, such as hierarchy-based global recoding approaches [13, 11]; while the latter is for local-recoding algorithms such as [26, 7]. Both derivation methods should be discussed in three cases: (Case 1) kj = 1. In this case we simply apply the anonymization algorithm on the microdata; (Case 2) kj ≥ ki . In this case V (kj ) is the most accurate version we could obtain for ki , so we directly use V (kj ) as V (ki ); (Case 3) kj < ki and kj 6= 1. Case 2 is trivial. In the following subsections we will focus on the implementation of Case 1 and 3.

4 1,3,5,7

4

' ( S 2 )  ' (C 7 ) d G 7

Figure 2: Simplified version of Problem 1. ∆(ci ) (∆(sj )) equals to ∆ki0 (∆kjs ).

4.

1

i 1

) C7 '(

) 1 '(C

size

7

min ¦ IL(Ci ) , due to

'(S ) 2

'(S 5)

'(

s0

S 1)

5 4 9,11,13,15

6 4 2,6,8,12

7

2 1,3

2 5,7

2 9,13

2 11,15

2 2,12

2 6,8

8

9

10

11

12

13

3 4,10,14

Figure 3: The Mondrian hierarchy of a dataset containing 15 records, denoted by {1, 2, . . . , 14, 15}. The number near each partition is the label of this partition.

4.1

Mondrian Derivation

The Mondrian derivation relies on a binary partition hierarchy called Mondrian hierarchy, as shown in Figure 3. The whole dataset is treated as the root of the Mondrian hierarchy which will be partitioned iteratively. Case 1. (kj = 1) In case 1, we apply the original Mondrian method [14] to generate V (ki ). Staring from the root, the dataset is partitioned iteratively. In each iteration, a partition P is partitioned into two sub-partitions, which become the child partitions of P . The dimension of the anonymization envelope (i.e. the range of attribute values) that has the widest normalized range will be chosen for partitioning. The median value on this dimension will be selected as the split value and P is divided into P1 and P2 such that |P1 | = |P2 | or |P1 | = |P2 | + 1. The process will stop when all partitions have sizes smaller than 2 · ki . Each partition in the hierarchy is a possible anonymity group in the final ki -anonymity version and is associated with an anonymization envelope generalized based on the QI attribute values of records belonging to it. Any ki -anonymity version consists of all partitions in the Mondrian hierarchy satisfying the following two conditions: i) the partition size is in the range of [ki , 2 · ki − 1]; and ii) the size of its sibling partition is no less than ki . For example, in Figure 3, the 4-anonymity version contains Partition 3, 4, and 5. Case 3. (kj 6= 1 and kj < ki ) The derivation process gradually merges the partitions in V (kj ) based on the Mondrian hierarchy to obtain the partitions in V (ki ) (line 7-15 in Algorithm 1). The partition merging is based on the Mondrian hierarchy and the procedure terminates once minP ≥ ki . The details about procedure F indSibling is omitted for saving space. For Mondrian derivation, the following lemma can be proved: L EMMA 2. If kj ≤ ki , version V (ki ) derived from V (kj ) by Mondrian derivation is same as V opt (ki ), which is the anonymized version from the microdata. In other words, the extra information loss incurred by Mondrian derivation is 0.

4.2

Top-down Greedy Derivation

This derivation scheme can work with the top-down greedy search technique [26], which is in essence a clustering-based anonymization algorithm. Case 1.(kj = 1) In this case, we simply generate V (ki ) from the microdata, utilizing the original top-down greedy search algorithm in [26]. The algorithm runs in two phases, both guided by the information loss metric:

Algorithm 1: Mondrian Derivation

1 2 3

4 5 6

Data: V (kj ), ki , temp_V Result: V (ki ) begin if kj = 1 then Perform V (1) . V (ki ) by original Mondrian approach and return V (ki ); else if kj ≥ ki then return V (kj ); else

10 11

else

8 9

14

for each partition Pmin that |Pmin | = minP do Psib =F indSibling(Pmin ); Merge the sibling partition Psib with Pmin ;

15

goto line 8;

12 13

1 2 3

4 5

Data: V (kj ), ki , temp_V Result: V (ki ) begin if kj = 1 then Perform V (1) . V (ki ) by original top-down greedy search approach and return V (ki ); else if kj ≥ ki then return V (kj ); else

6

temp_V ← V (kj ); Get the minimal partition size minP in temp_V ; if minP ≥ ki then return temp_V ;

7

Algorithm 2: Top-Down Greedy Derivation

Get a generalized record set T from V (kj ); temp_V ← ∅; P artitioning(T ); Adjust groups in temp_V so that each group has at least ki records and return temp_V ;

7 8 9 10

11 12 13 14 15 16

P artitioning(Dataset P ) begin if |P | > ki then Partition P into two exclusive subsets P1 and P2 ; if |P1 | ≥ ki or |P2 | ≥ ki then P artitioning(P1 ); P artitioning(P2 ); else

17

Add P to temp_V ;

18

• P ARTITIONING The dataset is partitioned iteratively. At each iteration, a dataset P is partitioned into two subsets P1 and P2 if and only if: 1) |P | > ki ; and 2) either |P1 | ≥ ki or |P2 | ≥ ki . P is partitioned into P1 and P2 so that the sum of IL(P1 ) and IL(P2 ) is smaller than IL(P ). • P OSTPROCESSING After partitioning, if there exist groups smaller than ki , the postprocessing will be performed repeatedly until no more such groups exist. These groups are merged to some neighboring groups or receive records from other groups to honor the ki -anonymity requirement. At each step, the option that will result in the smallest information loss will be chosen. Case 3. (kj 6= 1 and kj < ki ) The basic steps of derivation are similar to [26]. However, our method differs from [26] in two aspects: • The QI attributes of each record are represented as an anonymization envelope (value range) instead of a single data point; • We should consider the inherent information loss of each input record in the derivation process. Given the dataset, the derivation algorithm could obtain a generalized record set T = {e1 , e2 , . . . , ew }, where w is the dataset size, from V (kj ) (line 7 of Algorithm 2). Note that ei has its own information loss, which is denoted by IL(ei ). This information loss has to be considered in the derivation process. To perform line 14 in Algorithm 2,Swe need to find two seeds u and v in P such that d(u, v) = IL(u v)−IL(u)−IL(v) is maximized. To reduce the time complexity, we will adopt the heuristic method proposed in [26] to find u and v. We randomly pick a generalized record u1 and then find v1 that maximizes d(u1 , v1 ). Then, we scan all generalized records again and find u2 that maximizes d(u2 , v1 ). The iteration goes on a few rounds until d(u, v) does not increase substantially. Once u and v are found, two respective anonymity groups P1 and P2 are created. Then, we assign other generalized records to P1 and P2 one by one in a random order. ForSeach generalized record ei , the S assignment depends on IL(P1 {ei }) − IL(P1 ) and IL(P2 {ei }) − IL(P2 ). ei is assigned to the group with smaller increment of information loss. If either P1 or P2 has at

19 20

else Add P to temp_V ;

least ki generalized records, the partitioning is done. The top-down method is recursively applied to those groups having more than ki generalized records. We also have a postprocessing step for group P with less than ki generalized records. The details are omitted due to space limit. The top-down greedy scheme can be extended to work with other local-recoding anonymization algorithms. We shall only focus on Case 3 , i.e. kj < ki and kj 6= 1. We first obtain a generalized record set from V (kj ) and then perform the derivation by revising the original anonymization process to handle the generalized record set rather than the microdata. More specifically, the information loss calculation during derivation should take into account the inherent information loss of the generalized records. Furthermore, the procedures such as sorting and transformation should be extended for generalized ranges of QI attributes. Extensions to l-diversity. Version derivation based on ldiversity is defined as obtaining an li -diversity version from a lj diversity version, i.e. V (lj ) . V (li ) (i 6= j), so that IL(V (li )) is minimized. The Mondrian derivation for l-diversity algorithms [19] is same as that for k-anonymity algorithms except that we should check if each derived version satisfies li -diversity, instead of ki -anonymity. The top-down greedy derivation for l-diversity algorithms is also similar to that for k-anonymity algorithms. The following l-diversity algorithms can all be adopted in our derivation framework: For the TP algorithm in [25], we could see V (lj ) as a microdata table only with ‘‘*’’ as a new attribute value. For the 1-D l-diversity algorithm [7], we will develop the rules for the 1-D transformation and sorting of the generalized ranges.

5.

PLAN OPTIMIZATION

Given the NP-hardness of Problem 1, we propose two heuristicbased plan optimization strategies in this section, namely Incremental and Local Adjustment. Both strategies can employ one of the two derivation schemes described in Section 4. We will present the

strategies based on k-anonymity. Their counterparts on l-diversity are same except for substituting V (ki ) with V (li ). In the rest of this section we will use the following notations: Given an anonymization algorithm, we denote by V opt (ki ) the ki anonymized version generated from the microdata. Its respective communication cost on a link is denoted by ∆opt(ki ) .

Algorithm 3: Incremental Strategy 1 2 3 4 5 6

5.1

Incremental Strategy

Starting from the broker network, the incremental strategy performs plan optimization repeatedly, each time adding one new client to the dissemination tree. This is a commonly used approach in existing work on streaming data dissemination [31, 21]. For each newly added client, we find its optimal parent broker and compute the optimal dissemination plan in a greedy manner. Clients are added to the dissemination tree one by one in ascending order of a gap denoted by δi − ∆opt(ki ) , where δi is the specified delay tolerance. The gap represents the extra time that each client ci could tolerate for intermediate broker relaying. The smaller the gap, the less tolerant ci is. In other words, our greedy strategy handles the most easily expired clients first. The pseudo code of the Incremental Strategy is shown in Algorithm 3. To find the optimal parent broker and dissemination plan for the current client ci , the algorithm perform the following steps: (Step 1) Find a candidate parent broker set CP for ci to reduce the search space for the optimal parent broker (Line 4). CP is selected from brokers which satisfy: (1) sj has available communication channels, which means ncj is greater than the number of clients connected to sj ; and (2) sj has child clients, or sj has no child clients but p(sj ) has child clients. The second condition above excludes the brokers which have no child clients and sibling clients because they cannot be a better p(ci ) than their parents. (Step 2) Traverse all brokers sj in CP , pretend that ci is connected to sj , and then calculate the optimal IL(ci ) respecting the delay tolerance δi (Line 5-6). We choose the broker that achieves minimal IL(ci ) as the true parent broker of ci , i.e. p(ci ) (Line 7). (Step 3) If no parent broker can be found in Step 2 satisfying the δi constraint, we (1) choose the broker sj in CP , which has the minimum d(sj ), to be p(ci ), and (2) make compromises to broker p(ci ) and all its ancestor nodes in the dissemination tree so that the δi constraint is satisfied (Line 8-10). The compromise to a broker sj means to keep increasing kjs to reduce the transmission delay until δi is satisfied or kjs could not be increased. Note that the compromises on p(ci ) and its ancestors are performed in a bottom-up manner, in order to minimize the influence on other nodes, because brokers closers to s0 influence more nodes in the tree. Finally, all the child nodes of the compromised brokers will also be updated (Line 11). For each compromised broker sj we perform the following operations: (1) for each child client cjq , we will find the kjq that corresponds to minimal IL(cjq ) under δjq ; (2) for each s child broker sjq , update kjq as the value that achieves the minimal total information loss of all its child nodes. The update is performed iteratively until we traverse each client in the subtree of sj . Notably, the implementation of line 6 differs for the following two cases: (1) sj already has child clients. In this case, V (kjs ) has been determined, we only need to decide the value of ki0 (ki ≤ ki0 ≤ |D|) and perform V (kjs ) . V (ki0 ); (2) sj has no child clients yet, which means sj does not require any data from its parent. So if ci connects to sj , we need to decide V (kjs ) as well as V (ki0 ), which means we should find the optimal combination of kjs and ki0 that achieves minimal IL(ci ) on the premise that δi is fully exploited but not violated.

Data: data source s0 , broker set S, client set C begin Sort C in ascending order of δi − ∆opt(ki ) ; for i = 0;i < |C|;i++ do Find CP for ci ; for each broker sj in CP do Find the optimal IL(ci ) when connected to sj ;

7 8 9 10 11

Choose the broker sj with minimal IL(ci ) as p(ci ); if p(ci ) has not found then Choose the sj with minimal d(sj ) as p(ci ); Compromise sj and its ancestors one by one in a bottom-up approach until δi is satisfied; Update all descendants of the compromised brokers;

The worst-case time complexity of this algorithm is O(|S| ∗ |C|), where |S| is the number of brokers and |C| is the number of clients.

5.2

Local Adjustment Strategy

The local adjustment strategy first performs a global client assignment. Then it constructs the dissemination plan using the delay tolerance on each path from s0 to clients. This strategy is also suitable for the situation that the client assignment is constrained for some reasons, e.g. links between clients and brokers. If the client assignment is not pre-determined, a global client assignment will be performed in ascending order of the gap value δi − ∆opt(ki ) . For each client ci , we assign it to the first available broker, where ‘‘available broker" means brokers having available communication channels, in a breadth-first order of the tree. The reasons for doing this are two-fold: (1) The gap value represents the extra time that ci could tolerate for intermediate relaying. The strategy must ensure that clients with smaller gaps are assigned to brokers closer to s0 . (2) In this way clients with similar gaps are clustered together to the same parent broker. This is reasonable because clients under the same parent should have equal gaps in the ideal case. After that, the dissemination plan optimization is performed in a way similar to simulated annealing -- We first construct an initial dissemination plan, and then gradually optimize it by local adjustment based on the cost metric. Initialization. In this phase, we try to construct an initial dissemination plan by designating the anonymity version to each broker and client, while satisfying the delay tolerances. Since brokers closer to s0 influence more clients, they should be given more accurate versions (smaller information loss), thus allowing for further derivations. Based on this observation, a simple way to construct the initial plan is to set the information loss on the brokers proportional to their levels in the tree. The level of each broker is defined as the number of hops from s0 . Starting from the root, we designate the anonymity version of each broker in a top-down manner, in the ascending order of their tree level. For each broker sj , we choose the value of kjs that achieves minimal IL(sj ) on the premise that all clients in its subtree could at least receive an anonymity version with the worst fidelity, i.e. V (|D|), while meeting the delay tolerances. After all brokers have been processed, the version derivation for each client ci is performed by p(ci ) constrained by its delay tolerance δi . Adjustment. After the initial plan construction, we perform local adjustment to gradually optimize the dissemination plan. The adjustment is also performed in a top-down fashion. For each broker sj , we denote its subtree as Sub(sj ), the number of clients in

s1

s2 C2

C3

Algorithm 4: Local Adjustment Strategy

s3

C1

s6

s5

s4

1 2

3

C8

4

5

C4

C5

Sub ( S 2 )

C6

C7 6

Sub ( S 3 )

Sub ( S1 )

7

8

Figure 4: The adjustment of broker s1 , which has one child client c1 and two child brokers s2 and s3 .

Data: data source s0 , broker set S, client set C begin Sort C in ascending order of δi − ∆opt(ki ) ; // client assignment for i = 0;i < |C|;i++ do Assign ci to the first available broker in the breadth-first order of the tree; Sort S in the breadth-first order of the tree; // initialization for j = 0;j < |S|;j++ do Find kjs that achieves minimal IL(sj ) under the delay tolerances; Find kjq that achieves minimal IL(cjq ) under δjq and perform V (kjs ) . V (kjq ) for each child client cjq ; // adjustment for j = 0;j < |S|;j++ do for each possible value of kjs do Update the child nodes of sj ; Calculate the current Pm Pn |Sub(sjq )| q=1 IL(sjq ) · (1 + |Sub(s )| ) + q=1 IL(cjq );

9 10 11

Sub(sj ) as |Sub(sj )|, and the total information loss of all clients in Sub(sj ) as IL(Sub(sj )). As the adjustment of sj will influence all clients in Sub(sj ), we will use IL(Sub(sj )) as the cost metric. We choose the value of kjs which produces the minimal IL(Sub(sj )) as the final kjs . Figure 4 exemplifies the adjustment of each broker. When we adjust s1 by changing the value of k1s , the optimal IL(c1 ) under each new k1s could be easily obtained. However, the optimal IL(·) of clients in Sub(s2 ) and Sub(s3 ) cannot be decided immediately as it also depends on the resource allocation in Sub(s2 ) and Sub(s3 ). Thus, when adjusting s1 , we will update its child brokers and use IL(sq ) (q = 2, 3) to predict IL(Sub(sq )) based on the observation that a smaller IL(sq ) indicates larger optimization space and also smaller IL(·) values for clients in Sub(sq ). Meanwhile, as IL(sq ) |Sub(s )| influences the percentage of |Sub(sq1 )| clients in Sub(s1 ), we should |Sub(s )|

use |Sub(sq1 )| as a weight factor for IL(Sub(sq )). In the figure, we can see that IL(s2 ) in Figure 4 is more influential than IL(s3 ) as |Sub(s2 )| = 4 and |Sub(s3 )| = 3. Given broker sj with child clients cj1 , . . . , cjn and child brokers sj1 , . . . , sjm , we use m X q=1

IL(sjq ) · (1 +

n X |Sub(sjq )| )+ IL(cjq ) |Sub(sj )| q=1

to approximate IL(Sub(sj )) for each possible kjs . Thus, the optimal kjs can easily be found. After all brokers are adjusted, the dissemination plan optimization is completed and we can obtain the P final |C| i=1 IL(ci ). The superiority of this strategy is that we consider the influence of each adjustment step on all clients in the dissemination tree. The pseudo code of Local Adjustment Strategy is shown in Algorithm 4. To perform the adjustment of sj in line 10-14, we will gradually increase the value of kjs to find the optimal kjs that achieves minimal IL(Sub(sj )). For each possible value of kjs , all child nodes of sj will be updated accordingly. For each child client cjq , we will find the value of kjq that produces minimal IL(cjq ) under δjq . s While for each child broker sjq , we need to update kjq as the value that achieves the minimal IL(sjq ), ensuring all child clients in its subtree could receive an anonymity version within the delay tolerances. This algorithm has the same worst-case time complexity as the previous one, i.e. O(|S| · |C|), where |S| is the number of brokers and |C| is the number of clients. Before concluding this section, one important issue has to be mentioned. To guarantee a feasible solution for both of the above

12

j

Set the final kjs as the value with minimal Pm Pn |Sub(sjq )| q=1 IL(sjq ) · (1 + |Sub(s )| ) + q=1 IL(cjq );

13

j

Set the anonymity version received by each child node of sj based on kjs ;

14

strategies, we require δi ≥ maxhop×∆|D| for any client ci , where maxhop is the height of the dissemination tree. This requirement is easy to satisfy in practice, because δi should be no less than ∆opt(ki ) (in English language, every client should have enough patience for the data to be transmitted over one hop) and ∆opt(ki ) is usually greater than maxhop × ∆|D| when the propagation delay ρ is negligible.

6.

EXPERIMENTAL EVALUATION

In this section, we report the performance evaluation of the proposed version derivation and dissemination plan optimization techniques, and study the impact of various parameters on the dissemination framework. Since all the results from the l-diversity algorithms produce similar trends as those from their k-anonymity counterparts, we only report our results on k-anonymity to save space.

6.1

Experiment Settings

Metrics The fidelity of data being disseminated is reported in ‘‘IL-Ratio’’, which is calculated by |C|

IL(C) − Σi=1 IL(Vopt (ki )) |C| Σi=1 IL(Vopt (ki ))

,

|C|

where Σi=1 IL(Vopt (ki )) is the data fidelity of the dissemination tree in the ideal case -- the network bandwidth is infinite, so each client ci just receives Vopt (ki ) within its specified tolerance, with no need for version derivation at the brokers. For simplicity, we assume the propagation delay ρ = 0 without loss of generality. Datasets We use a real dataset and a synthetic one: • The Adult dataset, referred to as adl, is widely used as a benchmark for k-anonymization 2 . It contains 30162 tuples 2 The Adult dataset can be downloaded from the UCI machine learning repository.

derived,syn ideal,syn derived,adl ideal,adl

1800 1200

4

6

8 ki

4200

derived,syn ideal,syn derived,adl ideal,adl

3500 2800

10

10

12

(a) kj = 2

15

20 ki

25

120 96

96

6

8 ki

10

12

0.05

72 48

(a) kj = 2

15

10

2

4

6

8

10

3

|D|(× 10 )

(b) Top-down

opt(ki )

20 ki

25

30

0.35

syn,INC adl,INC syn,LA adl,LA

0.04

10

8 3

Figure 7: Effect of |D| (|S| = 100, |C| = 200, ncj ∈ [2, 4], δi ∈ [2, 8], uniform distribution) ∆

72 4

6

(a) Mondrian

derived,syn ideal,syn derived,adl ideal,adl

120

4

|D|(× 10 )

IL-Ratio

data volume (KB)

data volume (KB)

144

0.24

0 2

30

adl,INC adl,LA syn,INC 0.36 syn,LA 0.12

0

(b) kj = 5

derived,syn ideal,syn derived,adl ideal,adl

144

0.12 0.06

Figure 5: Information loss on one broker (|D| = 2000) 168

0.6 0.48

syn,INC adl,INC syn,LA 0.18 adl,LA

0.03 0.02 0.01

0.21 0.14 0.07

0

0 2

(b) kj = 5

adl,INC adl,LA syn,INC syn,LA

0.28 IL-Ratio

2400

4900

IL-Ratio

3000

0.3

0.24

IL-Ratio

5600

3600

information loss

information loss

4200

4

6

8

10

2

4

|D|(× 103)

6

8

10

|D|(× 103)

Figure 6: Communication cost on one broker (|D| = 2000) (a) Mondrian

For ease of presentation, we randomly extract equal number of tuples from both adl and syn datasets to construct a collected dataset D. Its cardinality |D| varies from 2000 to 10000. When we refer to D, we are conducting the experiment on BOTH datasets. Brokers and Clients The brokers (including s0 ) are synthesized by specifying each broker having 3 child brokers and 2-4 channels available for clients. The number of brokers |S| varies from 100 to 300. The topology is generated by randomly assigning each broker to an existing tree node that has less than 3 child brokers. Without loss of generality, we assume all communication channels have the same bandwidth 1 Mbps. Note that the bandwidth here is just set in proportion to the data size, which dose not affect the conclusion of our experiments. The ki of clients are generated under two types of distributions, namely the uniform distribution and the Gaussian distribution, on a range of [2, 500]. δi is randomly selected from [2 · ∆opt(ki ) , 8 · ∆opt(ki ) ]. To calculate ∆opt(ki ) , the number of anonymity groups gopt (ki ) is approximated by b|D|/ki c. As mentioned in Section 5, we will use maxhop · ∆|D| as the lowerbound of δi . |C| varies from 200 to 600. To remove the randomness, for each |C| we generate 10 client sets and calculate the average results. In the experiments, we will refer to the Incremental strategy as ‘‘INC’’, and the Local Adjustment strategy as ‘‘LA’’. We also denote the case of microdata anonymization as ‘‘ideal’’ (V (1) . V (ki )) and the case of version derivation as ‘‘derived’’ (V (kj ) . V (ki )).

6.2

Experiment Results

We first look at the information loss and the reduction in communication cost by applying version derivation on a single broker. Figure 5 and 6 show respectively the information loss and the com-

opt(ki )

munication cost of applying the top-down greedy version derivation (kj = 2 or 5) on a broker in comparison with the ‘ideal’ microdata anonymization. The derived version is denoted by ki . The communication cost is calculated by adding the data volume of (1) the incoming kj version and (2) the outgoing derived ki version. For the ‘ideal’ case, the cost accounts for (1) one copy of the incoming microdata and (2) its outgoing ideal ki version. It can be seen that the fidelity of data version produced by version derivation is slightly worse than that obtained from microdata anonymization (ideal case). However, version derivation considerably reduces the communication cost on each broker, as shown in Figure 6, which makes it more suitable for resource-constrained applications. In particular, the saved communication cost is much more evident when kj is slightly larger (kj = 5). Note that the above results are shown for one broker only. The aggregated reduction in a running network will be much more significant. We do not perform the same experiment on Mondrian derivation, because Mondrian relies on the partition-hierarchy, so the fidelity of the derived V (ki ) is always as good as Vopt (ki ). Figure 7 and 8 show the results of varying the dataset size |D|, with different ki distributions. It can be seen that, generally, the

0.45

0.95

syn,INC adl,INC syn,LA adl,LA

0.36 0.27

adl,INC syn,INC adl,LA syn,LA

0.76 IL-Ratio

• The synthetic dataset, denoted as syn, contains 40000 records, each of which contains 5 QI attributes. The attribute values in syn are uniformly distributed in range [0,1]. We note that this is a commonly used approach [29] to generate synthetic data for privacy protection.

Figure 8: Effect of |D| (|S| = 100, |C| = 200, ncj ∈ [2, 4], δi ∈ [2, 8], Gaussian distribution) ∆

IL-Ratio

with 8 QI attributes. All values are normalized to [0,1] for the fairness of comparison.

(b) Top-down

0.18 0.09

0.57 0.38 0.19

0

0 3

4

5 δi/∆opt(k ) i

(a) Mondrian

6

7

3

4

5 δi/∆opt(k )

6

7

i

(b) Top-down

Figure 9: Effect of δi (|S| = 100, |C| = 200, ncj ∈ [2, 4], |D| = 2000, uniform distribution)

0.06

0.2

0.01

adl,INC adl,LA syn,INC syn,LA

0.2 0.15

0.1

0

0 3

4

5 δi/∆opt(k )

6

7

0 3

4

i

5 δi/∆opt(k )

6

7

(b) Top-down

0.45 IL-Ratio

0.15 0.1

0.27

0.09

0

0 3 ncj

4

2

3 ncj

0.4

0.24

4

δi

0 100

0.36 0.24

adl,INC adl,LA syn,INC syn,LA

0.12 150

200

250

300

0 100

150

∆opt(k ) i

200

250

300

|S|

(a) Mondrian ∈ [2, 8],

∈ [2, 8],

0.48

0.16

(b) Top-down

Figure 11: Effect of ncj (|S| = 100, |C| = 200,

δi ∆opt(k ) i

0.6

syn,INC adl,INC syn,LA adl,LA

|S|

(a) Mondrian

4

(b) Top-down

0.08 2

3 ncj

|D| = 2000, Gaussian distribution)

0.32

0.18

0.05 2

adl,INC adl,LA syn,INC syn,LA

0.36

4

Figure 12: Effect of ncj (|S| = 100, |C| = 200,

IL-Ratio

syn,INC adl,INC syn,LA adl,LA

0.2

3 ncj

(a) Mondrian

Figure 10: Effect of δi (|S| = 100, |C| = 200, ncj ∈ [2, 4], |D| = 2000, Gaussian distribution) 0.25

0.1 2

i

(a) Mondrian

IL-Ratio

0.02

IL-Ratio

0.03

0.3

0.25

syn,INC adl,INC syn,LA adl,LA

IL-Ratio

0.09

0.03

adl,INC syn,INC adl,LA syn,LA

0.4 IL-Ratio

IL-Ratio

0.5

syn,INC syn,LA adl,INC adl,LA

0.12

IL-Ratio

0.15

(b) Top-down

Figure 13: Effect of |S| (|C| = 2 · |S|, ncj ∈ [2, 4],

δi ∆opt(k ) i



|D| = 2000, uniform distribution)

[2, 8], |D| = 2000, uniform distribution)

fidelity of Local Adjustment is better than Incremental for both datasets and both ki value distributions. Although the overall trend of IL-Ratio is not clear, the results in uniform distribution is less ‘‘choppy’’ than Gaussian when varying |D|. This is probably due to the more balanced information loss being generated from the former. Another observation is that the results of Local Adjustment is smoother than Incremental for different database sizes. Figure 9 and 10 show the results of IL-Ratio varying the client tolerance, which is given by ∆ δi (1 ≤ i ≤ |C|). Generally, opt(ki ) the fidelity of all schemes decreases when the tolerance increases. This is expected, because higher tolerances allow for more accurate versions to be produced/derived from the brokers. Subsequently the information loss observed on the clients will decrease. In all these experiments Local Adjustment still outperforms Incremental. Figure 11 and 12 show the results of IL-Ratio by varying the ncj (0 ≤ j ≤ |S|) of all brokers. We can see the IL-Ratio of all schemes decrease by ncj . That is because the average number of hops from s0 to each client reduces when ncj increases, resulting in a ‘‘bushy’’ network. This can help to reduce the information loss. We also notice that the reduction in Incremental strategy is more evident than Local Adjustment. That is because larger ncj provides more flexibility for client assignment in Incremental Strategy. Figure 13 and 14 illustrate the IL-Ratio increases by |S|. This can be interpreted as follows: The increase in |S| causes the average number of hops from s0 to clients to increase. As a result, the delay tolerances become harder to satisfy and thus resulting in more information loss. However, Local Adjustment is more resilient to such change in network scale compared to Incremental. In all above figures, Mondrian derivation outperforms top-down greedy derivation, because the result of Mondrian derivation is always the same as the version obtained from microdata. This is a feature which the latter could not achieve. Surprisingly, although the IL-Ratio of Mondrian derivation is always smaller under the same parameter setting, the total information loss IL(C) is not necessarily the case. The evidence is clearly

shown in Figure 15. In both distributions, the top-down scheme outperforms Mondrian on Adult dataset, in terms of total information loss. The main reason is due to the high information loss introduced by the Mondrian anonymization algorithm itself. The results on Synthetic is very similar and therefore omitted. In summary, the Local Adjustment strategy always outperforms the Incremental strategy under various parameter settings. Besides, the IL-Ratio of Mondrian derivation is always smaller than topdown greedy derivation. However, top-down greedy derivation can achieve smaller total information loss. Efficiency Generally, the efficiency of Local Adjustment strategy is comparable with Incremental strategy. Both strategies based on Mondrian derivation have execution time around 10 msec. However, strategies based on top-down greedy derivation are much less efficient, having execution time falling in the range of 10 to 50 seconds.

7.

CONCLUSION

In this paper, we studied the problem of privacy-preserving streaming data dissemination. We formulated the problem as a

IL-Ratio

0.12

0.35

syn,INC adl,INC syn,LA adl,LA

0.3 IL-Ratio

0.2 0.16

0.08 0.04 0 100

0.25

adl,INC adl,LA syn,INC syn,LA

0.2 0.15

150

200 |S|

(a) Mondrian

250

300

0.1 100

150

200 |S|

250

300

(b) Top-down

Figure 14: Effect of |S| (|C| = 2 · |S|, ncj ∈ [2, 4], [2, 8], |D| = 2000, Gaussian distribution)

δi ∆opt(k ) i



15

12

12

Mond,INC Mond,LA TD,INC 9 TD,LA

6

IL(C)(× 10 )

IL(C)(× 106)

15

6 3 0

Mond,INC Mond,LA TD,INC 9 TD,LA 6 3 0

2

4

6

8

10

2

3

(a) Uniform distribution

4

6

8

10

3

|D|(× 10 )

|D|(× 10 )

(b) Gaussian distribution

Figure 15: IL(C) comparison between two derivation algorithms (|S| = 100, |C| = 200, nc ∈ [2, 4], ∆ δi ∈ [2, 8], Adult opt(ki ) dataset)

constrained optimization problem, relying on a dissemination model using both k-anonymity and its advanced counterpart (l-diversity). We proved the NP-hardness of our optimization problem and proposed two heuristic-based strategies, namely Incremental and Local Adjustment, to solve the problem. Local Adjustment is shown to outperform Incremental Strategy in terms of data fidelity under various parameter settings. In the future work, we will extend the solution for the dissemination of multiple streams where each stream might have a different set of subscribers. This can be solved by making one dissemination plan for each stream. We would also consider to optimize a global information loss function while respecting all the constraints similar to the approach in [31].

8.

ACKNOWLEDGMENTS

This work is supported by National High Technology Research and Development Program of China (Grant No. 2013AA040601).

9.

REFERENCES

[1] Delay models in data networks. In http://web.mit. edu/dimitrib/www/Queueing_Data_Nets.pdf. [2] J. Cao, B. Carminati, E. Ferrari, and K.-L. Tan. Castle: Continuously anonymizing data streams. IEEE Trans. Dependable Sec. Comput., 8(3):337--352, 2011. [3] C.-M. Chao. Privacy-preserving classification of data streams. In SEKE, pages 603--606, 2008. [4] E. Curtmola, A. Deutsch, K. K. Ramakrishnan, and D. Srivastava. Load-balanced query dissemination in privacy-aware online communities. In SIGMOD Conference, pages 471--482, 2010. [5] C. Dwork. Differential privacy. In ICALP (2), pages 1--12, 2006. [6] B. Gedik and L. Liu. Quality-aware distributed data delivery for continuous query services. In SIGMOD Conference, pages 419--430, 2006. [7] G. Ghinita, P. Karras, P. Kalnis, and N. Mamoulis. Fast data anonymization with low information loss. In VLDB, pages 758--769, 2007. [8] Q. Guo, Y. Zhou, and L. Su. Multi-scale dissemination of time series data. In SSDBM, page 14, 2013. [9] M. Haridasan and R. van Renesse. Securestream: An intrusion-tolerant protocol for live-streaming dissemination. Computer Communications, 31(3):563--575, 2008. [10] M. Hifi et al. Heuristic algorithms for the multiple-choice multidimensional knapsack problem. Journal of the Operational Research Society, 55:1323--1332, 2004.

[11] V. S. Iyengar. Transforming data to satisfy privacy constraints. In KDD, pages 279--288, 2002. [12] R. Kuntschke, B. Stegmaier, A. Kemper, and A. Reiser. Streamglobe: Processing and sharing data streams in grid-based p2p infrastructures. In VLDB, pages 1259--1262, 2005. [13] K. LeFevre, D. J. DeWitt, and R. Ramakrishnan. Incognito: Efficient full-domain k-anonymity. In SIGMOD Conference, pages 49--60, 2005. [14] K. LeFevre, D. J. DeWitt, and R. Ramakrishnan. Mondrian multidimensional k-anonymity. In ICDE, page 25, 2006. [15] F. Li, J. Sun, S. Papadimitriou, G. A. Mihaila, and I. Stanoi. Hiding in the crowd: Privacy preservation on evolving streams through correlation tracking. In ICDE, pages 686--695, 2007. [16] J. Li, B. C. Ooi, and W. Wang. Anonymizing streaming data for privacy protection. In ICDE, pages 1367--1369, 2008. [17] M. Li and D. Kotz. Event dissemination via group-aware stream filtering. In DEBS, pages 59--70, 2008. [18] L.Sweeney. k-anonymity: privacy protection using generalization and suppression. International Journal on Uncertainty Fuzziness and Knowledge-based Systems, 10(5), 2002. [19] A. Machanavajjhala, J. Gehrke, and etc. l-diversity: Privacy beyond k-anonymity. In ICDE, page 24, 2006. [20] O. Papaemmanouil and U. Çetintemel. Semantic multicast for content-based stream dissemination. In WebDB, pages 37--42, 2004. [21] S. Shah, S. Dharmarajan, and K. Ramamritham. An efficient and resilient approach to filtering and disseminating streaming data. In VLDB, pages 57--68, 2003. [22] S. Shah, K. Ramamritham, and P. J. Shenoy. Maintaining coherency of dynamic data in cooperating repositories. In VLDB, pages 526--537, 2002. [23] N. Shang, M. Nabeel, F. Paci, and E. Bertino. A privacy-preserving approach to policy-based content dissemination. In ICDE, pages 944--955, 2010. [24] T. Wang and L. Liu. Butterfly: Protecting output privacy in stream mining. In ICDE, pages 1170--1179, 2008. [25] X. Xiao, K. Yi, and Y. Tao. The hardness and approximation algorithms for l-diversity. In EDBT, pages 135--146, 2010. [26] J. Xu, W. Wang, J. Pei, X. Wang, B. Shi, and A. W.-C. Fu. Utility-based anonymization for privacy preservation with less information loss. SIGKDD Explorations, 8(2):21--30, 2006. [27] X. Ye, Z. Li, B. Li, and F. Xie. Trust and privacy in dissemination control. In ICEBE, pages 173--180, 2009. [28] H. Zakerzadeh and S. L. Osborn. Faanst: Fast anonymizing algorithm for numerical streaming data. In DPM/SETOP, pages 36--50, 2010. [29] B. Zhou et al. Continuous privacy preserving publishing of data streams. In EDBT, pages 648--659, 2009. [30] Y. Zhou, B. C. Ooi, K. Tan, and F. Yu. Adaptive reorganization of coherency-preserving dissemination tree for streaming data. In ICDE, page 55, 2006. [31] Y. Zhou, B. C. Ooi, and K.-L. Tan. Disseminating streaming data in a dynamic environment: an adaptive and cost-based approach. VLDB J., 17(6):1465--1483, 2008. [32] Y. Zhou, A. Salehi, and K. Aberer. Scalable delivery of stream query result. PVLDB, 2(1):49--60, 2009. [33] Y. Zhou, Z. Vagena, and J. Haustad. Dissemination of models over time-varying data. PVLDB, 4(11):864--875, 2011.