Location Privacy in Vehicular Communication

2 downloads 0 Views 3MB Size Report
gesagt: Man kann nicht kontrollieren, was man nicht messen kann! Gemäßdieses Grund- satzes ist das Ziel dieser Dissertation eine Metrik zur Messung von ...
Universit¨at Ulm Fakult¨at f¨ ur Ingenieurwissenschaften und Informatik Institut f¨ ur Medieninformatik

Location Privacy in Vehicular Communication Systems: a Measurement Approach

Dissertation zur Erlangung des Doktorgrades Dr. rer. nat. der Fakult¨at f¨ ur Ingenieurwissenschaften und Informatik der Universit¨at Ulm

Zhendong Ma aus Shanghai, China

2011

Amtierender Dekan:

Prof. Dr.-Ing. Klaus Dietmayer

Gutachter: Gutachter: Gutachter: Tag der Promotion:

Prof. Dr. Michael Weber Prof. Dr. Manfred Reichert Prof. Dr. Levente Butty´an 21.1.2011

Abstract As an enabling technology for boosting cooperations on the road, the emerging vehicular communication systems have the promise to greatly improve road safety, traffic efficiency, and driver convenience through vehicle-to-vehicle and vehicle-to-infrastructure communications. However, since many envisioned applications and services require a user to constantly reveal his locations, the user is in danger of loosing his location privacy. Hence, in the context of vehicular communication systems, a meaningful location privacy metric is indispensable for the assessment of potential location privacy risk, the development of privacy-enhanced technologies, and the evaluation and benchmarking of any given location privacy-protection mechanisms. Measuring a user’s location privacy is a non-trivial task. The location privacy metric must consider technical aspects specific to vehicular communication systems as well as social and legal aspects of privacy in such context. Existing privacy metrics have various limitations which make them incapable to fully capture a user’s location privacy and give meaningful and accurate privacy measurements. We cannot manage what we cannot measure. In this dissertation, we aim at developing a location privacy metric for the users of vehicular communication systems. Our measurement approach provides solutions for various issues in the concept, theory, and application of the metric. Taking legal and social aspects and vehicle mobility into consideration, we measure location privacy as the relationship between a user and his vehicle trips from an attacker’s perspective. Based on a capture-model-measure paradigm, the metric captures related information in snapshots, processes the information, and gives quantitative measurements of a user’s level of location privacy in the system. To truthfully reflect the underlying privacy values, we extend the metric and provide solutions for measuring a user’s location privacy in multiple dimensions such as privacy in timely-ordered snapshots and privacy in snapshots with interrelated users. Our approach is evaluated by various scenarios and simulations. We also demonstrate the practicability of our approach by a proofof-concept implementation based on realistic dataset, and use the developed metric to assess the effectiveness of privacy-protection mechanisms such as mix zone and changing pseudonyms. Altogether, in this dissertation, we give a comprehensive answer to how to measure

i

a user’s location privacy in vehicular communication systems while taking into account domain specific aspects and privacy in multiple dimensions. The location privacy metric fills an important gap in current research and facilitates the estimation of the privacy level in emerging vehicular communication systems and the benchmarking of different privacy-protection mechanisms on a common basis. Furthermore, our measurement approach provides insights into the cause of the location privacy problem, on which cost-effective privacy-protection mechanisms can be developed to benefit the users of vehicular communication systems.

ii

Zusammenfassung Kooperatives Verhalten zwischen Fahrzeugen auf der Straße wird zunehmend durch Technologien wie Fahrzeug-Fahrzeug-Kommunikationssysteme unterst¨ utzt und tr¨agt entscheidend zu einer Erh¨ ohung der Verkehrsicherheit und -effizienz sowie des Fahrkomforts bei. Kooperationen finden dabei durch Kommunikation zwischen Fahrzeugen untereinander und zwischen Fahrzeugen und Infrastruktur statt. Damit eine Fahrzeugkommunikation u ¨berhaupt m¨ oglich wird, m¨ ussen die Nutzer einer solchen Technologie st¨andig ihre Lokationsdaten preisgeben. Dies stellt nat¨ urlich eine m¨ogliches Risiko f¨ ur die Privatheit der Lokationsdaten (Location Privacy) der Nutzer dar. Um dieses Risiko genau absch¨atzen zu k¨onnen, bedarf es eines ad¨ aquaten ”Messinstruments”, genauer gesagt einer Metrik zur Messung des Grades an Location Privacy der Nutzer in einem Fahrzeug-FahrzeugKommunikationssystem. Neben der Risikoeinsch¨atzung erm¨oglicht eine solche Metrik auch die Evaluatierung bestehender Mechanismen zum Schutz von Location Privacy sowie deren Neu- und Weiterentwicklung. Allerdings stellt die Entwicklung einer geeigneten Metrik keine einfache Aufgabe dar, da diese sowohl technischen Aspekten von Fahrzeug-Fahrzeug-Kommunikationssystemen als auch sozialen und rechtlichen Anforderungen gen¨ ugen muss. Existierende Metriken weisen eine Reihe von Einschr¨ ankungen auf, die eine sinnvolle und akkurate Bestimmung des Grades and Location Privacy erschweren oder sogar unm¨oglich machen. Kurz gesagt: Man kann nicht kontrollieren, was man nicht messen kann! Gem¨aßdieses Grundsatzes ist das Ziel dieser Dissertation eine Metrik zur Messung von Location Privacy der Nutzer von Fahrzeug-Fahrzeug-Kommunikationssystemen zu entwickeln, und zwar ohne die Beschr¨ ankungen existierender Metriken. Unser Ansatz vereint hierbei Theorie, Konzept und Anwendung der Metrik. Auf der Grundlage einer realistischen Betrachtung des Fahrverhaltens wird Location Privacy als die F¨ahigkeit eines Angreifers gemessen, einen Nutzer den Wegstrecken, die er mit seinem Fahrzeug zur¨ uckgelegt hat, zuzuordnen. Um dies in numerischen Kenngr¨oßen abbilden zu k¨onnen, werden Schnappsch¨ usse des Fahrzeug-Fahrzeug-Kommunikationssystems verwendet, um die Verbindung zwischen Wegstrecken und Nutzern zu modellieren. In einem ersten Schritt werden Schnappsch¨ usse getrennt voneinander betrachtet, dann jedoch sukzessive um den zeitlichen Verlauf zwischen den Schnappsch¨ ussen und die Verbindung zwischen mehreren Nutzern des

iii

Fahrzeug-Fahrzeug-Kommunikationssystems im Modell erweitert. Diese zus¨atzlichen Dimensionen liefern weitere Informationen, die in der Metrik zu realistischen Bewertung von Location Privacy verwendet werden k¨onnen. Die Messung von Location Privacy in unserem Modell erfolgt in einem informationtheoretischen Ansatz. Alle Ergebnisse werden mittels unterschiedlicher Simulationen evaluiert. Außerdem erfolgt eine prototypische Umsetzung des Gesamtansatzes durch Analyse eine realistischen Datensatzes. Zusammenfassend bietet die Dissertation einen umfassenden Ansatz zur Messung von Location Privacy der Nutzer in einem Fahrzeug-Fahrzeug-Kommunikationssystem. Damit f¨ ullt die Arbeit eine wichtige L¨ ucke in existierenden Ans¨atzen, da durch die Metrik zum ersten Mal eine umfassende Bewertung und Vergleich von Mechanismen zum Schutz von Location Privacy m¨ oglich wird. Dar¨ uber hinaus bietet die Arbeit fundamentale Einsichten in die Gr¨ unde f¨ ur m¨ ogliche Privacy Verletzungen, womit eine Entwicklung von effektiveren Schutzmechanismen erm¨oglich wird.

iv

Acknowledgements It is my pleasure to thank those who made this dissertation possible. First of all, I am deeply grateful for Prof. Dr. Michael Weber, whose belief, supervision, and support has got this work started and accomplished. I also own my gratitude to Prof. Dr. Manfred Reichert, whose corrections and suggestions have brought the dissertation to a higher level of standard. I am also thankful to Prof. Dr. Frank Kargl, who has given valuable guidance in various stages of this work. No man is an island. I am grateful for the inspiring discussions with my colleagues in the once “VANET” group. Especially, I would like to thank Elmar Schoch, Florian Schaub, and Bj¨ orn Widersheim for their fruitful cooperations. The same gratitude also goes to the partners in the SEVECOM and PRECIOSA project. This dissertation would not have been possible without the encouragement, support, and love of Stefanie, Jennie, and Johnny. I am also thankful to my parents, Baodi Yang and Mingfu Ma, who taught me to pursuit excellence at a very early age.

v

vi

Contents 1. Introduction 1.1. Location privacy . . . . 1.2. Measurement approach . 1.3. Contribution . . . . . . 1.4. Organization . . . . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

2. Background 2.1. System and threat model . . . . . . . . . . 2.1.1. System model . . . . . . . . . . . . . 2.1.2. Threat model . . . . . . . . . . . . . 2.1.3. Attacks on location privacy . . . . . 2.1.4. Privacy model . . . . . . . . . . . . 2.2. Existing privacy-protection mechanisms . . 2.2.1. Information flow control . . . . . . . 2.2.2. Anonymization . . . . . . . . . . . . 2.2.3. Degradation . . . . . . . . . . . . . . 2.2.4. Dummy traffic . . . . . . . . . . . . 2.2.5. Real-world implementations . . . . . 2.3. Existing privacy metrics . . . . . . . . . . . 2.3.1. Privacy-related concepts and notions 2.3.2. Anonymity set-based metrics . . . . 2.3.3. Mix zone-based metrics . . . . . . . 2.3.4. Tracking-based metrics . . . . . . . . 2.3.5. Distance-based metrics . . . . . . . . 2.4. Discussion . . . . . . . . . . . . . . . . . . . 2.4.1. Requirements on privacy metrics . . 2.4.2. Analysis of requirement fulfillment . 2.4.3. Summary and outlook . . . . . . . .

vii

. . . .

. . . . . . . . . . . . . . . . . . . . .

. . . .

. . . . . . . . . . . . . . . . . . . . .

. . . .

. . . . . . . . . . . . . . . . . . . . .

. . . .

. . . . . . . . . . . . . . . . . . . . .

. . . .

. . . . . . . . . . . . . . . . . . . . .

. . . .

. . . . . . . . . . . . . . . . . . . . .

. . . .

. . . . . . . . . . . . . . . . . . . . .

. . . .

. . . . . . . . . . . . . . . . . . . . .

. . . .

. . . . . . . . . . . . . . . . . . . . .

. . . .

. . . . . . . . . . . . . . . . . . . . .

. . . .

. . . . . . . . . . . . . . . . . . . . .

. . . .

. . . . . . . . . . . . . . . . . . . . .

. . . .

. . . . . . . . . . . . . . . . . . . . .

. . . .

. . . . . . . . . . . . . . . . . . . . .

. . . .

. . . . . . . . . . . . . . . . . . . . .

. . . .

. . . . . . . . . . . . . . . . . . . . .

. . . .

1 2 3 4 5

. . . . . . . . . . . . . . . . . . . . .

9 9 9 9 10 12 13 13 16 22 23 24 25 25 26 28 31 33 34 35 35 39

3. Location privacy in snapshot view 3.1. Location privacy revisited . . . . . . . . . . 3.1.1. Measuring vehicle location privacy . 3.2. Methodology of measuring location privacy 3.3. Capture information . . . . . . . . . . . . . 3.4. Model information . . . . . . . . . . . . . . 3.4.1. Observation . . . . . . . . . . . . . . 3.4.2. Formalization . . . . . . . . . . . . . 3.5. Calculate information . . . . . . . . . . . . 3.5.1. Entropy . . . . . . . . . . . . . . . . 3.5.2. Extract information . . . . . . . . . 3.5.3. Quantify information . . . . . . . . . 3.6. Analysis . . . . . . . . . . . . . . . . . . . . 3.6.1. Use Case I . . . . . . . . . . . . . . 3.6.2. Use Case II . . . . . . . . . . . . . . 3.7. Discussion . . . . . . . . . . . . . . . . . . . 3.8. Summary . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

4. Location privacy in time series 4.1. Accumulated information . . . . . . . . . . . . . . . . . . . . . 4.2. Measurements based on multiple snapshots . . . . . . . . . . . 4.2.1. Frequency based approach . . . . . . . . . . . . . . . . . 4.2.2. Bayesian approach . . . . . . . . . . . . . . . . . . . . . 4.3. Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.1. Evaluation criteria . . . . . . . . . . . . . . . . . . . . . 4.3.2. Evaluation setup . . . . . . . . . . . . . . . . . . . . . . 4.3.3. Simulation . . . . . . . . . . . . . . . . . . . . . . . . . 4.4. Heuristic algorithm for dynamic trip constellations . . . . . . . 4.4.1. Finding an adequate measurement of similarity . . . . . 4.4.2. Constellation fitting . . . . . . . . . . . . . . . . . . . . 4.4.3. Heuristic algorithm . . . . . . . . . . . . . . . . . . . . . 4.5. Evaluation of heuristic algorithm . . . . . . . . . . . . . . . . . 4.5.1. Evaluation with respect to constellation dynamics . . . 4.5.2. Evaluation with respect to p-value . . . . . . . . . . . . 4.5.3. Evaluation with respect to cluster of re-appearing trips 4.6. Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

viii

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

41 41 45 50 51 54 54 57 59 59 60 61 65 65 67 67 69

. . . . . . . . . . . . . . . . .

71 72 75 75 76 80 80 81 83 87 88 90 92 94 94 96 98 100

5. Location privacy in global view 5.1. Local vs. global view . . . . . . . . . . . . . 5.1.1. Location privacy in local view . . . . 5.1.2. Location privacy in global view . . . 5.2. Prior art . . . . . . . . . . . . . . . . . . . . 5.3. Measure location privacy in global view . . 5.3.1. Our approach . . . . . . . . . . . . . 5.4. Modeling in Bayesian networks . . . . . . . 5.4.1. Bayesian networks . . . . . . . . . . 5.4.2. Modeling . . . . . . . . . . . . . . . 5.4.3. Parameterize BN model . . . . . . . 5.5. Design probabilistic queries . . . . . . . . . 5.6. Calculate posterior probability distributions 5.6.1. Calculate entropy . . . . . . . . . . . 5.7. Evaluation . . . . . . . . . . . . . . . . . . . 5.7.1. Evaluation process . . . . . . . . . . 5.7.2. Evaluation setup . . . . . . . . . . . 5.7.3. Evaluation of simple scenario . . . . 5.7.4. Statistical evaluation . . . . . . . . . 5.8. Summary . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

6. Implementation 6.1. Implementation overview . . . . . . . . . . . . . 6.2. Vehicle trip dataset . . . . . . . . . . . . . . . . . 6.2.1. Travel tracker survey data . . . . . . . . . 6.2.2. Vehicle routes . . . . . . . . . . . . . . . . 6.3. Implement measurement approach . . . . . . . . 6.3.1. Capture data with snapshots . . . . . . . 6.3.2. Model data in tripartite graph . . . . . . 6.3.3. Model mis-linking . . . . . . . . . . . . . 6.3.4. Implement privacy-protection mechanisms 6.3.5. Extract and measure . . . . . . . . . . . . 6.4. Evaluate effectiveness . . . . . . . . . . . . . . . 6.5. Discussion on practicability . . . . . . . . . . . . 6.6. Summary . . . . . . . . . . . . . . . . . . . . . . 7. Conclusion

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

101 . 102 . 102 . 105 . 106 . 108 . 108 . 110 . 110 . 112 . 114 . 116 . 117 . 122 . 123 . 123 . 124 . 126 . 127 . 130

. . . . . . . . . . . . .

133 . 133 . 134 . 135 . 137 . 139 . 139 . 140 . 141 . 142 . 146 . 147 . 151 . 151 153

ix

A. Multi-hypothesis tracking and Kalman filtering

157

B. The basics of measurement theory

161

C. Related conditional probability formulas

165

D. Acronyms

167

x

List of Figures 1.1. Examples of VCS scenarios . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2. Three dimensional measurement approach to location privacy in VCS . .

2 6

2.1. Privacy model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.2. Structured overview of existing privacy-protection mechanisms . . . . . . 14 2.3. Example of mix zone . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 3.1. Location privacy of one individual over a period of time . . . . . . . . . 3.2. Number of publications related to location privacy on DBLP . . . . . . 3.3. Conceptual understanding of location privacy . . . . . . . . . . . . . . . 3.4. Three inseparable elements of location privacy . . . . . . . . . . . . . . 3.5. Vehicle trajectory from multiple location sampels . . . . . . . . . . . . . 3.6. Example of individual’s typical urban day trips . . . . . . . . . . . . . . 3.7. Example of trips . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.8. Illustration of information processing in location privacy metric . . . . . 3.9. Example of taking snapshots in continuous time . . . . . . . . . . . . . . 3.10. Example of taking snapshots in continuous space . . . . . . . . . . . . . 3.11. Simple scenarios of individuals, origin/destination (O/D) pairs, and their relations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.12. The measurement model as a weighted tripartite graph . . . . . . . . . . 3.13. Visualization of probability distribution related to is extracted from G . 3.14. Example 3.4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.15. Simple example of three individuals and three trips . . . . . . . . . . . . 3.16. Example of probability distributions in IO, OD, and DI . . . . . . . . 3.17. Comparison of average entropy from Use Case I and II. . . . . . . . . .

. . . . . . . . . .

41 42 43 45 48 48 49 50 52 53

. . . . . . .

55 57 61 63 64 66 68

4.1. 4.2. 4.3. 4.4.

. . . .

71 73 80 84

Location privacy of one individiual in multiple snapshots . . Multiple snapshots of i in timely-ordered sequence . . . . . . Example of Algorithm 1 . . . . . . . . . . . . . . . . . . . . . Entropy of irregular trips . . . . . . . . . . . . . . . . . . . .

xi

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

4.5. Entropy of regular trips . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.6. Entropy of re-occurring trips . . . . . . . . . . . . . . . . . . . . . . . . 4.7. Change of uncertainty . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.8. Change of beliefs with different p-values . . . . . . . . . . . . . . . . . . 4.9. Various relations of Sj and Si . . . . . . . . . . . . . . . . . . . . . . . . 4.10. Example of Algorithm 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.11. Snapshots with 10% constellation dynamics . . . . . . . . . . . . . . . . 4.12. Changes of beliefs on T1 with different degree of constellation dynamics 4.13. Changes of beliefs with different p-values . . . . . . . . . . . . . . . . . . 4.14. Beliefs on trip clusters at 60th snapshot . . . . . . . . . . . . . . . . . .

. . . . . . . . . .

85 86 87 88 92 94 95 96 98 99

5.1. Location privacy of all users captured in the same snapshot . . . . . . . 5.2. Extracting probability distributions of individuals from tripartite graph 5.3. Example of attacker’s mistakes . . . . . . . . . . . . . . . . . . . . . . . 5.4. Example of two individuals in local and global views . . . . . . . . . . . 5.5. Perfect matching bipartite graph . . . . . . . . . . . . . . . . . . . . . . 5.6. Approach to calculate location privacy in global view . . . . . . . . . . . 5.7. Example of Bayesian networks . . . . . . . . . . . . . . . . . . . . . . . 5.8. Simple example of the BN model . . . . . . . . . . . . . . . . . . . . . . 5.9. Example of modeling global view in Bayesian network . . . . . . . . . . 5.10. The noisy OR-Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.11. A Bayesian network with three nodes . . . . . . . . . . . . . . . . . . . . 5.12. Prior and posterior probability distributions of i1 and i2 . . . . . . . . . 5.13. Evaluation process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.14. Example of evaluation setup . . . . . . . . . . . . . . . . . . . . . . . . . 5.15. KL divergence of local and global view with respect to ground truth . . 5.16. Entropy in local and global view . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

101 103 104 105 107 109 111 112 113 116 118 121 124 126 129 131

6.1. 6.2. 6.3. 6.4. 6.5. 6.6. 6.7.

Modules in implementation . . . . . . . . . . . . . . . . . . . . . . . . . . 134 Trip data table from travel tracker data . . . . . . . . . . . . . . . . . . . 136 Using Google Maps to find vehicle routes . . . . . . . . . . . . . . . . . . 138 Route from Figure 6.3 in Google Earth . . . . . . . . . . . . . . . . . . . . 139 Snapshot (day 1, 10am -11am, zone (2,2) in a 3  3 grid) in Google Earth 140 Example of snapshot modeling . . . . . . . . . . . . . . . . . . . . . . . . 141 Origins and destinations from example snapshot and adjacent ones from trip dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143 6.8. Spatial relations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144

xii

6.9. Distribution of overlapped distances in ascendant order . . . . . . . . . . 6.10. Intersections of trips in example snapshot . . . . . . . . . . . . . . . . . . 6.11. Illustration of effectiveness of α and β on trip pair . . . . . . . . . . . . . 6.12. Influence of mix-factor α on location privacy measurements . . . . . . . . 6.13. Influence of pseudonym-factor β on location privacy measurements . . . . 6.14. Influence of mix and pseudonym-factor on location privacy measurements

xiii

145 146 147 148 149 150

List of Tables 2.1. Example of location privacy attackers . . . . . . . . . . . . . . . . . . . . 10 2.2. Existing privacy metrics with respect to requirements . . . . . . . . . . . 36 3.1. Summary of trips in Figure 3.7 . . . . . . . . . . . . . . . . . . . . . . . . 49 3.2. Result of example in Figure 3.15 . . . . . . . . . . . . . . . . . . . . . . . 65 4.1. A simple example with six consecutive snapshots of i . . . . . . . . . . . . 75 4.2. Overview of use cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 4.3. 3nd use case setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 5.1. Notations used in BN model . . . . . . . . . . . . . . . . . . . . . . . . . . 113 5.2. Entropy of i1 amd i2 in local vs. gloval view . . . . . . . . . . . . . . . . . 123 5.3. Number of matches of individuals in 100 simulation runs . . . . . . . . . . 128 B.1. Classification of scale of measurement . . . . . . . . . . . . . . . . . . . . 162

xiv

1. Introduction Vehicle Communication Systems (VCS) are emerging communication networks in which vehicles can communicate with each other and with the entities in backend systems. Leveraged on wireless technologies such as Dedicated Short Range Communications (DSRC)1 [2] and cellular networks, vehicles are able to communicate seamlessly. Consequently, a new way of cooperations among the participants and stakeholders of transportation systems (e.g., drivers, traffic operators, and service providers) can be achieved on the road. As one of the key technologies for Intelligent Transportation System (ITS), VCS have the promise to greatly improve road safety, traffic efficiency, and driver convenience in the near future. Consequently, VCS attract a lot of attention from various parties such as governments, road operators, vehicle manufacturers, equipment suppliers, and research institutes. A large number of research projects have been carried out worldwide in recent years, e.g., CVIS [3], GST [4], and VII [5]. The functionalities of VCS are realized by various vehicular communication (VC) applications and services. For example, in collision warning, a vehicle frequently (e.g., every 100 ms) broadcasts its current position, speed, and heading in the so-called “beacon” messages to notify and warn the vehicles in the vicinity about its existence and driving intentions. In floating car data (FCD), a number of vehicles send their positions, speeds, and headings to the Transportation Management Center (TMC) in the infrastructure network via Roadside Units (RSU). The TMC calculates, controls, and optimizes traffic flows based on the real-time traffic information provided by the FCD vehicles. A driver can also gain great conveniences from various location-based services (LBS) provided by third-party service providers, such as comparing and finding a restaurant or a gas station, or an electric vehicle (EV) charging station on the route. Figure 1.1 illustrates some of the example scenarios in VCS. In the first scenario, vehicles frequently send out short beacon messages to warn the others and avoid collision. In the second scenario, a TMC collects FCD data from the vehicles and calculates the real-time traffic situation on the road. Such information can be used to optimize traffic. In the third scenario, a vehicle sends a location-based service request to a third-party 1

DSRC is standardized in IEEE 802.11p draft standard, commonly know as Wireless Access in Vehicular Environments (WAVE) [1].

1

service provider over the Internet. In the fourth scenario, the vehicles involved in an accident send warning messages to the approaching vehicles, such that the approaching vehicles can promptly react before they arrive at the accident site. LBS  provider   Real-­‐%me   traffic  report  

3  

Traffic  Management  Center  

Internet  

2   Posi%on:  (x,y)   Speed:  60  km/h   Heading:  SE    

Posi%on:  (x,y)   Speed:  40  km/h   Heading:  E    

Gas  sta%on   within  5km  to   posi%on  (x,y)?  

1  

Posi%on:  (x,y)   Speed:  55  km/h   Heading:  NE    

Accident  at   posi%on  (x,y)  

4  

Posi%on:  (x,y)   Speed:  50  km/h   Heading:  E    

Posi%on:  (x,y)   Speed:  50  km/h   Heading:  E    

Figure 1.1.: Examples of VCS scenarios In general, VCS will serve as a communication platform, on which numerous safety and commercial applications and services can be developed and deployed. These applications and services will in turn bring safety, productivity, profit, and many other benefits to road users, road operators, and service providers, as well as the society as a whole.

1.1. Location privacy Many functionalities of VCS, however, require a vehicle to reveal its precise location information in a continuous way. Vehicles are usually personal devices and owned for a long period of time. When putting into context, the whereabouts of a vehicle can be used to derive plenty of other information such as identity, social activities, and personal preferences of the user. If not handled properly, the transmission and dissemination of one’s location information will be a clear threat to one’s location privacy, which is “the ability of an individual to move in public space with the expectation that under normal circumstances their location will not be systematically and secretly recorded for later use”[6]. As a result, preserving and protecting location privacy of the VCS users is essential and mandatory for the successful deployment and public acceptance of VCS.

2

The issue of location privacy in VCS has been identified in recent years [7, 8, 9], and a multitude of proposals has been made to preserve privacy of VCS users. Many of these privacy-protection mechanisms follow the principle of dissociating people from places, i.e., to separate the information on a user’s identity from the information on his vehicle movements. In addition, possible solutions to the privacy problems in VCS are constrained by the unique characteristics of vehicular communications and a set of security requirements [10], e.g., the preference for conditional pseudonymity instead of total anonymity [11]. As a result, many of the proposed privacy-protection mechanisms are converged toward a pseudonym-based approach, in which identifiers that cannot be directly linked to a user’s real identity are used in communications. Besides, various architectures, protocols, and techniques that support and enhance the pseudonym approach are proposed [12, 13, 14, 15, 16, 17, 18, 19], which cover the whole pseudonym life cycle including pseudonym generation, change, refill, and revocation. Location privacy is a special type of privacy. In the context of VCS, the main causes of the privacy concern are: 1) VCS users need to reveal their personal location information in exchange for functionalities of the system such as safety and convenience, 2) most of the location information will be sent out automatically by the vehicles without user control. Thus it is very important that VCS preserve and protect the users’ location privacy when communicating, disseminating, utilizing, and storing the users’ location information.

1.2. Measurement approach To assess whether VCS are able to preserve and protect the users’ location privacy and to evaluate and compare the effectiveness of any privacy-protection mechanism, a method to quantify and measure the level of users’ location privacy is crucial and indispensable. However, in current privacy research activities, most efforts focus on developing privacyprotection mechanisms. On the contrary, methods that assess the trustworthiness of the system, gauge the privacy level of the users, and evaluate the effectiveness of a given protection mechanism are underdeveloped. As the ability to evaluate and test a proposed design is an indispensable part of any scientific process, the current trend on location privacy research in VCS creates a gap, which is analogous to leaping to point B (protection mechanisms) before starting at point A (evaluation methods). Furthermore, privacy does not come for free. Privacy-protection mechanisms usually have side-effects in terms of communication and computation overhead [20] as well as deployment cost. Putting into perspective, privacy requirements are only a subset of the

3

identified requirements for VCS, such as the requirements related to communication, security, and privacy [11]. This determines that the overall system design of VCS will have to consider and harmonize a conglomeration of different and even conflicting requirements. Hence, measurements of privacy values can greatly contribute to the design process, in which various requirements can be balanced and optimized to find the best privacy protection available. Although several privacy metrics, such as anonymity set-based metrics for measuring the degree of anonymity in anonymous communication systems [21, 22, 23] and mix zone-based metrics for measuring location privacy in ubiquitous computing or vehicle networks [24, 25, 26], have been proposed in recent years, none of the existing metrics can capture the essence of vehicle movements and the corresponding privacy implications. For example, anonymity set-based metrics base their measurements on a set of subjects indistinguishable from the subject under consideration, which are applicable to measure sender-receiver relations in communication systems, but are not appropriate to capture dynamic vehicle-vehicle relations in VCS; mix zone-based metrics can capture a vehicle’s movement, but are not sufficient to reflect the vehicle’s location privacy because mix zones only partially capture the vehicle’s spatial-temporal movement and its location privacy-sensitive points. Therefore, they are either inappropriate or insufficient to accurately capture and reflect privacy values in VCS per se. The existing privacy metrics and their limitations will be further discussed in Section 2.3. Just as security metrics are essential to develop secure computer systems because they increases accountability, demonstrate compliance, and help to make informed decisions on efforts to make systems secure [27], privacy metrics are essential to location privacy because the progress in protecting location privacy depends on the ability to quantify location privacy [28]. Therefore, to understand the causes and address the issues on location privacy, to evaluate the effectiveness of any proposed protection mechanism, to contribute to the design of privacy-preserving vehicular communications, and to fill an important gap in current privacy research, in this dissertation we aim at developing a location privacy metric that can take the VCS users’ location privacy into rigorous measurements.

1.3. Contribution The main object of this dissertation is to develop a location privacy metric, which captures and processes privacy-related information and quantitatively reflects the level of the users’ location privacy in VCS. Our work covers the concept, theory, and applica-

4

tion of the measurement approach to VCS. As we will show in the remainder of the dissertation, our measurement approach gives rise to a number of non-trivial issues and challenges, which have not been identified and addressed before. Step by step, we present our measurement approach that systematically addresses these challenges. A summary of our contributions is listed below: • Concept of location privacy. We identify fundamental elements of location privacy and their relations beyond the conventional line of thought. The refined concept of location privacy enables us to take more accurate measurements. • Capturing location privacy. We apply the mechanism of snapshot to capture privacy-related information into discrete forms from a continuous system. We model and quantify the captured information to yield numeric values on a user’s location privacy. • Measuring long-term location privacy. We identify the issue of privacy over time, and design methods to process, propagate, and utilize the accumulated information, and reflect its impact on location privacy in the measurement. • Measuring location privacy in global view. Since VCS consist of multiple users, these users may exhibit certain interrelations among each other in the system. To accurately reflect location privacy at the system level, we investigate the interrelations among the users and develop methods to relate the users in the system and to measure a user’s location privacy at the system level. • Evaluation method. We design and develop method to implement and evaluate the correctness and feasibility of our approach. The outcome of our measurement approach is a comprehensive location privacy metric for VCS, which measures a user’s location privacy along three dimensions as illustrated in Figure 1.2. In the first dimension, location privacy is measured for an individual in a specific time period, i.e., location privacy in snapshot view. In the second dimension, location privacy is measured for an individual in a long period of time, i.e., location privacy in time series view. In the third dimension, location privacy is measured as a system consisting of inter-related individuals, i.e., location privacy in global view.

1.4. Organization Chapter 2 gives necessary background information. We first describe the system and threat model, followed by a review of related work on privacy-protection mechanisms.

5

t

(a) First dimension

(b) Second dimension

(c) Third dimension

Figure 1.2.: Three dimensional measurement approach to location privacy in VCS Afterwards, we review existing privacy metrics and discuss why they are inappropriate and insufficient to precisely capture and reflect the values of location privacy in VCS. Chapter 3 first revisits the concept of location privacy and its special meaning in the context of VCS. Next, we introduce the concept of snapshots, which captures a user’s privacy-related information in a given space and time period. More specifically, the snapshots capture the information on each individual in the system and their vehicle trips. Assuming that an attacker tries to link the vehicle trips to the individuals, we express the attacker’s information in terms of probabilities. Subsequently, location privacy of an individual is measured as the uncertainty of such information and quantified into entropy. Moreover, location privacy of a specific user can be determined by the ratio of its current entropy and the maximum possible entropy within the system. Finally, we evaluate the feasibility of our approach by different use case studies. Chapter 4 extends the metric by investigating the assumption in which an attacker can gather and store the users’ location information over a long period of time, and exploit the accumulated information. We develop approaches and algorithms to model, process, propagate, and reflect the impact of accumulated information in privacy measurements. As a user’s short-term location privacy is captured in a single snapshot, the user’s long-term location privacy is thus captured in multiple snapshots in time series. We develop two algorithms to apply the Bayesian method to process and propagate accumulated information among multiple snapshots along the timeline. The first algorithm propagates information among snapshots with exactly matching trip constellations. The second algorithm is a heuristic extension to the first one, which is robust to function on snapshots with highly dynamic trip constellations. We design methods to evaluate the feasibility and correctness of these approaches and algorithms by various use case studies and extensive simulations. We show that accumulated information can have significant impacts on the level of location privacy. Chapter 5 addresses the issue of measuring location privacy in global view, in which we assume that an attacker can correlate and process the information on all individuals as a

6

whole. To establish and process the information on the individuals interrelated by their possible trips, we use Bayesian networks to model the information on the individualtrip linkabilities. We design probabilistic queries to extract the relevant conditional probabilities from the Bayesian network and obtain the probability distributions that are updated by the information on the others. Our findings show that the level of location privacy in VCS will decrease when the information on the individuals and their trips can be processed as a whole. The feasibility and correctness of our approach are evaluated by various use case studies and simulations. Chapter 6 presents our proof-of-concept implementation. The implementation demonstrates how to apply the measurement approach to the real world and measures the effectiveness of two location privacy-protection mechanisms based on realistic scenarios and dataset. In our implementation, we address the challenges of obtaining realistic vehicle trip data, engineering the dataset for privacy measurements, and designing and implementing the measurement approach. Our implementation shows that out measurement approach is practical and it is possible to measure location privacy in VCS. Chapter 7 concludes the dissertation with an outlook on the directions for future researches.

7

8

2. Background 2.1. System and threat model 2.1.1. System model In VCS, the On-board Unit (OBU) of a vehicle communicates with the OBUs of other vehicles in vehicle-to-vehicle (V2V) communications, as well as with the nodes (e.g., a traffic operator or a service provider) in the infrastructure network. The latter is achieved via Roadside Units (RSU) in vehicle-to-infrastructure (V2I) communications. The RSU acts as a gateway to the infrastructure network. In addition, a vehicle is equipped with receivers for Global Position System (GPS) signals or sensors for precise geographic position information. The actual vehicular communications are carried out by sending and receiving messages.

2.1.2. Threat model An attacker is a person or an organization who breaches a user’s location privacy in vehicle communication systems. With minor adaptions from [29], we categorize an attacker on VCS into the following dimensions: • Insider vs. Outsider. The insider is an authenticated member of the network that can access information about other members. An outsider does not have legitimate access to the information about other members, and is considered by other members as an intruder. • Malicious vs. Rational. A malicious attacker seeks no personal benefits from the attacks and aims to harm the member or functionality of the network. A rational attacker seeks personal profit and hence is more predictable in terms of attack means and target. • Active vs. Passive. An active attacker can inject messages into the network. A passive attacker eavesdrops on the wireless channel or the communications in the infrastructure network.

9

• Local vs. Extended. A local attacker can be in control of several entities (vehicles or RSUs), but is limited in scope. An extended attacker controls more entities that are scattered across the network, thus extending his scope. The strongest attacker will have a global coverage of the network. An attacker can be any combination from the descriptions along the four dimensions. Table 2.1 gives an example of various possible location privacy attackers in VCS. Insider Outsider Malicious Rational Active Passive Local Extended

An employee at Transportation Management Center (TMC) with access to floating car data (FCD). Someone outside the TMC without legitimate access to FCD data. A teenage obtains the whereabouts of renowned person and posts it on Internet. A white-collar criminal seeks particular location information and sells it to a bidder. A hacker poses as authority and queries a vehicle about its position. A eavesdropper deploys receivers along the road to collect beacon messages. An attacker with limited coverage of a few blocks in the city. An attacker with global coverage of the whole network in a region.

Table 2.1.: Example of location privacy attackers

2.1.3. Attacks on location privacy In the following, we list several location privacy attacks, identified in recent years in literature. Inferring home and identity Krumm [30] made an empirical study on inference attacks. In the attack, a subject’s home location is first identified from a set of anonymized GPS data and then a programmable Web search engine is used to find his identity. The GPS data are generated by a set of volunteer drivers in a two week period. The data are anonymized such that they only include time-stamped geo-coordinates. To find the home locations in the GPS traces, the GPS data are segmented into discrete trips by a set of criteria, e.g., a trip must include at least 10 measured points and must be at least one kilometer long. The home locations are identified by one of the following heuristic algorithms:

10

• Last destination, assuming that the last destination of the day is often a subject’s home. • Weighted median, assuming that a subject spends more time at home than at any other location. • Largest cluster, assuming that most of a subject’s coordinates will be at home. • Best time, by learning the relative probability of a subject being at home vs. the time of the day. In the next step, the likely coordinates of a subject’s home are put into a Web-based white page lookup. The attack is able to correctly identify about 5% of the 172 subjects. Cluster-based home identification Similar to the aforementioned inference attack, in [31], Hoh et al. developed a clusterbased home identification algorithm. The algorithm uses a k-means clustering algorithm on anonymous location samples (i.e., time-stamped geo-coordinates) to identify frequently visited places. To refine the resulting clusters, several heuristics are used. The heuristics are based on the observations that a vehicle is likely to have low to zero speed near the driver’s home, and vehicles are often parked overnight at home. The results show that the home identification algorithm can correctly locate about 85% of the home locations from the anonymous location samples. Ashbrook and Starner [32] also used clustering in place identification, which groups the GPS data into meaningful locations. Though not intended as location privacy attack, the developed techniques can be used to extract potentially sensitive information about a driver’s habits and interests from the driver’s location samples. Identity identification from home/work location pair Golle and Partridge [33] show that the approximate locations of an individual’s home and workplace can be used to infer the individual’s real identity with high certainty. In their study, they use the “Origin/Destination” dataset from the Longitudinal EmployerHousehold Dynamics (LEHD) program run by the U.S. Census Bureau, which reports where workers live and work at the granularity of census block1 . The results show that based on the current release of Origin/Destination dataset from the US Census Bureau, 1

Census blocks typically coincide with city blocks in urban areas and cover several square kilometers in rural areas.

11

it is possible to uniquely identify an individual by just knowing the locations of his2 home and workplace at the granularity of a census block. Target tracking The above attacks rely on knowing complete location traces to find the sensitive locations, usually the end points of location traces. Sometimes the complete traces are not readily available. As a part of location privacy attack, an attacker has to track a vehicle’s movement in space and time long enough to obtain the trace. Hoh [34] argues that for a complete privacy breach, the tracked trace should have a privacy sensitive event such as a sensitive destination and the driver generating this trace should be identified. As a result, the longer an attacker can track a vehicle, the better the chance will be that he can achieve his goal. In [35], Gruteser and Hoh use Reid’s multiple hypothesis tracking algorithm (MHT) [36] to track anonymous GPS data from a group of students in and around a university campus. Anonymous location samples from three different tracks are used as input to the MHT algorithm. The tracking results show that despite several temporary incorrect assignments, most anonymous samples can be associated with the correct tracks due to the spatial-temporal correlations in the location samples. In a recent study [37], we applied the MHT approach (see Appendix A) to track vehicles in vehicular networks. We take a similar approach, i.e., applying the MHT algorithm on anonymous vehicle location samples, but with more complex and larger scale settings that have different characteristics. Our study shows that due to the high precision and frequency of location information given out by the vehicles in vehicular networks, an attacker eavesdropping all communications in an area can reconstruct long vehicle traces. Our analysis provides supporting evidence to the existence of such privacy threat in VCS.

2.1.4. Privacy model The system and threat model can be generalized into a concise privacy model shown in Figure 2.1. The asset in the privacy model is one’s location information, which is generated by vehicle movements and disseminated by vehicular communications. The privacy-protection mechanism in the model is in place to alter the location information (e.g., to remove the identity information from the data or to distort the exact locations) to a certain degree 2

Note that for the sake of brevity, we use the male possessive adjective “his” instead of “his or her” throughout this dissertation.

12

to protect one’s location privacy. Subsequently, the observation captures a subset of the actual location information, which is incomplete and contains errors. Consequently, the attacker’s knowledge contains the reconstructed version of the location information. It does not only depend on the quantity and quality of the observation, but also on the attacker’s ability to make sense out of the observed information. Loca%on   informa%on  

Privacy-­‐ protec%on   mechanism  

Observa%on  

A8acker’s   knowledge  

Figure 2.1.: Privacy model As a result, an individual’s level of location privacy depends on the accuracy and completeness of the attacker’s knowledge with respect to his location information.

2.2. Existing privacy-protection mechanisms A principle followed by most of the location privacy-protection mechanisms is to dissociate identity information from location data. Hence a privacy-protection mechanism aims to achieve either identity privacy or (location) data privacy. A multitude of privacy-protection mechanisms have been proposed in recent years. In this section, we give a structured review of existing privacy-protection mechanisms, which we categorize into coarse-grained groups according to their specific methods and techniques. Figure 2.2 shows an overview of this coarse-grained classification of existing privacy-protection mechanisms. The details will be given in the following sections. Since our focus is on VCS, we have chosen those mechanisms mostly related to VCS. Interested readers can also find surveys on other protection mechanisms in the context of pervasive computing in [38, 28].

2.2.1. Information flow control To protect a person’s privacy is to define, control, and monitor how the information about an individual is used and disseminated in a system. Thus information flow control aims to control information flow from the information generator to the information sink in the system, where the information is processed and stored, or permanently removed.

13

Encryp(on  and   access  control   Privacy   policy  

Informa(on   flow  control  

Privacy   proxy   k-­‐anonymity    

Anonymiza(on  

Pseudonym   Mix  zone  

Degrada(on  

Obfusca(on   Perturba(on  

Dummy   traffic  

Figure 2.2.: Structured overview of existing privacy-protection mechanisms Encryption and access control From a security point of view, information flow control can be regarded as means to fulfill one of the fundamental security requirements, i.e., confidentiality. Therefore, many techniques to achieve confidentiality can be applied to enforce information flow control. Encryption and access control are two common security mechanisms that are applicable to privacy, for example, applying cryptographic schemes to encrypt location data [39] or leveraging on access policies to control the access to location data [40]. Encryption assures data confidentiality in transmission and storage. Access control such as the Bell-LaPadula model [41] divides a system into multiple security levels and explicitly defines a user’s access right to detailed data. The basic idea is to prevent a user with lower security level from accessing information in the higher level and to prevent the information from flowing from a higher security level to a lower one. Encryption and access control are integrated into many of the existing privacy-protection mechanisms. Privacy policy Privacy policy achieves information flow control by defining and enforcing a set of privacy rules on how personal data is handled in the system. To enforce privacy in the system,

14

such policy must be implemented at all entities that generate, transmit, process, and store personal data. For example, Geopriv [42] is a standard designed by the IETF Geopriv working group for the transfer of location information and privacy policies over the Internet in a confident and integrity preserving manner. Geopriv specifies the format of the location data and the protocol to transfer such data. The presentation of the location data (location object) includes specifications on how to handle the location data and on the precision of this location data. The user (location generator) specifies the privacy policies in the location object. The location server, which is responsible for forwarding location information to a location recipient, implements the privacy policies, and acts as a proxy for location generators. The location server can reduce the resolution of the location information and transfer the location object to another format. The external location services (location recipient) access the location data under the privacy policies specified in the location object. The privacy aware system [43, 44] for ubiquitous computing environments is a system that enforces the data collectors to announce their privacy policies to the user, and allows users to keep track of their personal data in the system. The design philosophy is that by social or legal force, a system (especially the one that collects and stores users’ personal data) can implement their privacy protection policies in accordance with a set of rules (i.e., the rules declared to the user in advance). The system is based on P3P [45], a web privacy framework that enables websites to encode rules on collection and use of user data into machine readable descriptions (e.g., XML code). Therefore an end-user can be informed of the practice of the websites and make decisions on his preferences. The design of the privacy aware system includes 1) a set of machine-readable privacy policies, 2) policy announcement mechanisms to help users to locate such policies, 3) privacy proxies which handle the privacy-relevant communications between users and data collectors, and 4) a policy-based access control to the user data. Alternatively, privacy policy and access control can be combined, in a way that access control is enforced by a set of policies. Snekkenes [46] proposes to let a user to formulate the privacy policy as “who should have access to what location information under which circumstances” in location-based services. Then the access decisions to the user’s location data is based on the role of the location data requestor and purpose of the request, the location, the identity of the object, and the time and speed of the object. Specifically, the purpose of a request is modeled in a lattice structure. Policy-based approaches assume that all entities in the system are able to implement privacy policy. To ensure the enforcement of privacy policy, Kargl et al. [47] propose a

15

Privacy-enforceable Runtime Architecture (PeRA) to enforce privacy policy in VCS by using the trust computing platform to create a virtual trust domain, in which the components in remote systems can be ensured to comply with user-defined privacy policies. Privacy proxy Privacy proxy is an intermediate entity, strategically placed between a user and an untrusted third-party in the system architecture. A proxy acts on behalf of a user to filter and forward the communications between the user and the third-party. Privacy proxy is a part of the aforementioned Geopriv and the privacy aware system to handle communications between users and data collectors. Various mechanisms can be implemented at the privacy proxy. For example, in [48], a trusted server is intended as privacy proxy in a typical location-based service scenario that includes mobile nodes and an untrusted third-party service provider. The trusted server stores location information of the mobile nodes (either by monitoring the movement of the mobile nodes or collecting location information sent from mobile nodes). Whenever a location-based service needs to compute some results based on location information of the mobile nodes, it sends a function (in the form of mobile code) to the trusted server. The code is executed on the trusted server, which accesses relevant location information on the server, and sends the result back to the location-based service. Before executing the function, the trusted server must first check whether the function from the location-based service satisfies the non-inference rule, i.e., the result from the function cannot be used to infer other information.

2.2.2. Anonymization Anonymization removes identity information in the user data such that the user data cannot be used to link to a particular individual. In situations where identifiers or accurate location information must be provided in the communication, anonymization is a useful technique to protect a user’s ID privacy. k-anonymity Originated from the database community but often applied in anonymous communication systems, the k-anonymity model [21] aims to provide useful data from a database while still preserving privacy of the data subjects3 . The model consists of a set of rules 3

A very useful glossary of privacy terminologies, composed mainly by the author, can be found at http://www.preciosa-project.org/index.php/privacy-terminology.

16

to insure that the information for each person contained in the released data cannot be distinguished from at least k  1 individuals whose information also appears in the released data. In [49], a k-anonymity model is used for anonymous usage of location information in a traffic monitoring, Floating Car Data (FCD)-like system. The anonymization is achieved by an adaptive-interval cloaking algorithm. The desired degree of anonymity in the algorithm is specified by the minimum acceptable size of the anonymity set kmin . The algorithm adjusts the spatial-temporal resolution of the reported data which contains at least kmin users. Pseudonym A pseudonym is an identifier of a subject, other than one of the subject’s real names [50]. First introduced by Chaum [51], pseudonyms are widely used in communications where identifiers are needed. Pseudonyms do not contain identifiable information of the user, hence the messages in the communications cannot be linked to the user on the identifiers. Furthermore, a user needs to change pseudonyms from time to time to avoid being trackable by the same pseudonym. In the context of VCS, pseudonyms usually refer to pseudonymous public key certificates, which do not contain any identifiable information and cannot be used to link to a particular user. Due to the safety-critical nature of vehicle system, VCS has a set of stringent security requirements [10]. Pseudonyms provide a good solution to balance the requirements on security and privacy in VCS. In a typical scenario, vehicles are equipped with pseudonyms and their corresponding secret keys. When sending a message, a vehicle signs it with its secret key and attaches the signature and the pseudonym certificate to the message so that receivers can verify the signature. Vehicles also have to change pseudonyms often to make it hard for an attacker to link different messages from the same sender. To use pseudonyms in a communication system, a pseudonym management system is needed to manage the issuance, usage, resolution, and revocation of pseudonyms. In a bigger picture, pseudonym management can be regarded as a special case of Identity Management (IDM) systems [52]. A large number of pseudonym-based approaches have been proposed in recent years, which often involve quite complex cryptographic schemes. In [11], we review and compare most of these approaches. Some of them are listed below. In SeVeCom project4 , we employ a hierarchical Certificate Authority (CA) struc4

http//:www.sevecom.org

17

ture [16]. The CAs manage and issue long-term identities to vehicles. Pseudonyms are issued by pseudonym providers (PP) and are only valid for a short period of time. When issuing pseudonyms, a PP authenticates a vehicle by its long-term identity and keeps the pseudonyms-to-identity mapping in case of liability investigation. Provided with a pseudonym, pseudonym resolution authorities can resolve an identity by accessing the pseudonyms-to-identity mappings at a PP. Pseudonyms are intentionally set to have a short lifetime to minimize the need for pseudonym revocation. To exclude a misbehaving or compromised vehicle from participating in the communications, CAs distribute certificate revocation lists (CRLs) that include the vehicle’s long-term identity to PPs to prevent it from acquiring new pseudonyms from a PP. Due to the short lifetime of pseudonyms, a vehicle needs to regularly contact the pseudonym provider to obtain new sets of pseudonyms. This process is called pseudonym refill. In [19], we propose a refill strategy which minimizes the need for pseudonym revocation and enhances security and privacy in vehicular communications. To implement this refill strategy we propose a privacy-enhancing pseudonym-on-demand (POD) scheme. For liability investigation, a pseudonym provider needs to store the pseudonyms-toidentity mapping to have the possibility to resolve a pseudonym. In [53], we propose a Vtoken scheme, in which resolution information are directly embedded in pseudonyms and can only be accessed when multiple authorities cooperate for identity resolution. Using blind signature scheme, our privacy-preserving pseudonym issuance protocol ensures that pseudonyms contain valid resolution information but prevents issuing authorities from creating pseudonym-identity mappings. As an alternative, the PKI+ approach from Armknecht et al. [54] is based on bilinear mappings on elliptic curves. A user obtains a master key and certificate from a CA after it proves its identity and knowledge of a user secret x to the CA. The user can then self-generate pseudonyms by computing a public key from the master certificate, the secret x, and a random value. A certificate is computed as a signature of knowledge proof s over the public key and the master public key. The certificate also includes the version number Ver of the CA’s public key for revocation purposes. The user signs a message m by computing the signature of knowledge proof ms on m. A receiver of m can verify the message with the public key in the pseudonym. When revoking a user, the CA publishes a new version information Ver 1 , which has to be used by all users to update their keys. The blind signature approach from Fischer et al. [55] uses blind signatures and secret sharing in the pseudonym issuance protocol to enforce distributed pseudonym resolution.

18

In the pseudonym issuance process, a user blinds the public key to be signed and presents shares of it to a number of CAs. Each CA holds a partial secret of the secret key shared by all CAs in a secret sharing scheme. The CAs sign the presented blinded key part with its partial secret key, return it to the user, and store a corresponding partial resolution tag in its database. The user can unblind and combine the received results, yielding a certificate which can be verified with a public key common to all CAs. To resolve a pseudonym, a number of CAs have to cooperate in a second secret sharing scheme to compute a joint resolution tag for the presented pseudonym and compare it to all tags in the database. Other approaches exploit the feature of group signature. Group signature is a signature scheme to provide conditional anonymity to members of a group. Each group member can create signatures verifiable with a common group public key. However, only the group manager can assign individual secret keys or membership certificates to the group members and use the unique secret key or certificate to determine the identity of a signer. The hybrid approach from Calandriello et al. [17] uses group signature to reduce the overhead of key and pseudonym management. Vehicles are members of a group and possess individual secret keys. Each vehicle generates random public/secret key pairs to be used for pseudonymous communications. The public keys are signed with the group secret key, yielding a pseudonym certificate that can be verified with the group public key. A receiver of a message can verify with the group public key that the pseudonym was created by a legitimate group member. The group manager, however, is able to open group signatures and retrieve the signer’s identity, if necessary. The Group Signature and Identity-based Signature (GSIS) protocol from Lin et al. [56] is based on short group signature. In their approach, a CA acts as a group manager. The CA computes a group public key and group secret keys for each vehicle in the group from their unique identifiers. With the identifier and a part of the secret key, a CA is able to determine the identity of a group member. Thus accountability can be achieved while at the same time impersonation attacks are prevented. A vehicle signs messages with its own secret key and receivers can verify them with the group public key. Revocation is achieved by distributing revocation lists. In a different cryptographic scheme, identity-based cryptography (IBC) can be used to derive public keys from the identity of a user. Presented with a signature, a verifier can check its validity merely by knowing the sender’s identity. An example is the Efficient Conditional Privacy Preservation (ECPP) protocol from Lu et al. [57], which utilizes both IBC and group signature.

19

Mix zone Beresford and Stajano [24] find out that due to the spatial-temporal correlation in a user’s movement, a user can still be tracked despite the frequent change of pseudonyms. Consequently, they propose to use mix zone to enhance the effectiveness of pseudonym change. In general, a mix zone is a geographic region, in which users’ movements cannot be observed by an attacker and hence they can switch to new pseudonyms unlinkable to the old ones. To understand the basics of mix zone, consider a simple example illustrated in Figure 2.3. The users enter the mix zone from an application zone, in which their activities can be observed by an untrusted application. The border between the application zone and the mix zone is the boundary line of the mix zone. When a user enters a mix zone, it generates an ingress event. By contrast, an egress event happens when a user exits the mix zone. In the example, three users with identifiers a, b, and c entering the mix zone generate three ingress events i1 , i2 , and i3 . At some later points in time, three users with identifiers d, e, and f re-emerge at the application zone and cause egress events e1 , e2 , and e3 . Consequently, the mix zone creates uncertainties on the mapping of egresses to ingresses. Thus an attacker, in this case the untrusted application, will have difficulty to follow a user’s movement through the mix zone. Applica'on  zone   Boundary  line   a  

e  

Mix  zone   i1  

e1  

c   e3  

i3   i2  

e2  

b  

f  

d  

Figure 2.3.: Example of mix zone Various techniques to create mix zone have been proposed. We divide them into two groups: system-centric mix zones and user-centric mix zones. A system-centric mix zone is an area pre-defined by the system. Users have to go through the area to “mix” their pseudonyms. On the contrary, in user-centric mix zones, users do not need to go to a pre-defined area. Instead, a group of users in the vicinity take advantage of their spatial and temporal similarities in which they communicate and coordinate their pseudonym

20

changes, thus create virtual mix zones on the move. Butty´an et al. [25] propose to use the natural layout of a road network to create mix zones in vehicular network. The rationale is that due to the scale of the road network, an attacker is not likely to deploy enough receivers to intercept all communications in the network. As a result, the road network can be divided into two distinct regions: the observed zones and the unobserved ones. Mix zones are areas where an attacker cannot observe the activities of vehicles. Therefore, an attacker has to track a vehicle by linking a vehicle leaving a mix zone to a vehicle previously entering it. In other words, the attacker tries to link a pseudonym appeared at the egress of a mix zone to a pseudonym previously disappeared at the ingress of the mix zone. The attacker builds knowledge of a mix zone by observing the vehicles entering and existing the mix zone and constructing a matrix containing probability distributions that model the entry and exit events. Then the attacker can calculate the probabilities to each permutation of the exiting vehicles given a specific entering vehicle, and choose the one with the maximum probability. Thus the level of location privacy achieved by changing pseudonyms depends on several factors such as the attacker’s knowledge on the road network and the vehicle movement patterns in the mix zone. Freudiger et al. [14] propose a cryptographic scheme to create mix zones in a vehicular network. Their basic idea is to encrypt communications at the intersection to thwart an eavesdropping attacker. The RSUs at the intersections are responsible for distributing the encryption keys. When a vehicle approaches an intersection, it sends a request for a symmetric key. The vehicle authenticates itself to the RSU by signing the request. The RSU replies with an encryption key signed by the RSU and encrypted with the vehicle’s public key from the request. Afterwards, the vehicle can use the symmetric key to encrypt all communications in the intersection. As all vehicles approaching the same intersection will get the same symmetric key from the RSU, vehicles within the intersection can communicate with each other, but the attacker cannot read the content of the encrypted messages. Although an attacker is thwarted to listen to communications in the mix zones, he can still monitor the ingress and egress events of the vehicles to the mix zones and try to map the events by their spatial and temporal correlations. Therefore, the effectiveness of a cryptographic mix zone depends on the road topology and the vehicle density at the intersection. Silent period from Huang et al. [58, 59] can be regarded as one of the approaches to create user-centric mix zones. A silent period is the transition time interval between an old and a new pseudonym, in which a node keeps quiet within the interval. As a result, time and space ambiguity is introduced in the relation between the old and new

21

pseudonyms. However, an attacker might still be able to track a node’s movement by correlating the position and the time of two pseudonyms. To be effective, the silent periods of the nodes need to be synchronized. The synchronization is achieved by dividing the silent period into two parts: a constant period and a variable one. A node in silent mode first keeps a constant silent period, after that it senses the media for other nodes’ transmissions during the variable silent period. As soon as the node detects others nodes’ transmissions, it ends the variable silent period and emerges with a new pseudonym. The same authors further propose silent cascade [60], in which a user switches between silent state and active state periodically for an extended period of time to create a cascade of mix zones. As extension to silent period, Li et al. propose the Swing & Swap [61] approach. Basically, their approach is a scheme to coordinate and enhance the effectiveness of silent period. In the Swing mode, a node broadcasts messages to inform and synchronize its neighboring nodes when it changes its pseudonyms after silent periods. In Swap mode, two nodes exchange and use each other’s old pseudonyms after the silent period. In addition, to make it more difficult for an attacker to track the nodes, the nodes only change their pseudonyms when their movements such as speed and direction are changed. The CARAVAN approach from Sampigethaya et al. [62, 15] is another scheme to create user-centric mix zones by using cluster-based communications. Due to vehicle mobility, vehicles tend to form clusters while driving, i.e., several vehicles travel at same speed and keep same distance to each other, especially on highways. The CARAVAN approach exploits this property by grouping vehicles into clusters and letting one of the vehicles in the group act as a proxy for the group members for anonymous communications with entities outside the group. Hence each group forms a virtual moving mix zone.

2.2.3. Degradation Whereas the main goal of anonymization is to achieve ID privacy, the main goal of degradation is to achieve data privacy by deliberately degrading the accuracy of location data. Generally, there are two ways to degrade data: data obfuscation and data perturbation. Data obfuscation Data obfuscation degrades the quality and resolution of data in a controlled way. Duckham and Kulik [63, 64] propose a graph-based location obfuscation model and negotiation algorithm to protect a user’s location privacy from an untrusted location-based service (LBS) server. In their approach, whenever a user requests a location based service, he sends a request with a location obfuscation set which includes his exact location and

22

other locations in the proximity. The LBS server finds the possible points of interest that satisfy all locations in the user’s request. The user’s obfuscation set will have a partition and thus the locations in the set will divide into one or more subsets according to their distances to the points of interest. If there is more than one subset within the user’s obfuscation set, the LBS server negotiates with the user to find a most relevant point of interest. The user can choose to either further reveal its location by indicating in which partition he is in or the server arbitrarily chooses a partition and returns the nearest points of interest from the partition to the user. The spatial and temporal cloaking algorithm in [49], mentioned early in the context of k-anonymity, is also an algorithm for data obfuscation. The algorithm either degrades the location information spatially by adjusting the size of the area around the exact location of the user, until kmin or more users are included in the area, or degrades the location information temporally by delaying the forwarding of a request until kmin users have visited the area where the actual user is in. Data perturbation Different from data obfuscation that degrades data quality and resolution, data perturbation deliberately introduces errors into location data in a controlled way such that the data still meets a given quality-of-service (QoS) requirement. Hoh and Gruteser [65] introduce a path perturbation algorithm to prevent an attacker from tracking an individual’s movement path. The path perturbation algorithm first defines the mean location error for a set of users’ paths tolerable to an application. In a sense, the mean location error is the application’s QoS requirement. Then the path perturbation algorithm perturbs location information every time two user paths are in close proximity to confuse the attacker to follow the wrong user. Formally, the perturbation is formulated as constrained nonlinear optimization problem and solved by sequential quadratic programming. Finally, the perturbed positions are used to modify the original set of location samples within the perturbation radius.

2.2.4. Dummy traffic Inserting dummy traffic into communication is another technique to confuse an attacker and to hide the real identity and location of a user. A mix in a mix network [51] can forward dummy messages along with real messages to thwart an attacker from traffic analysis, i.e., finding the sender and recipient of a message without knowing the content of the message.

23

Kamat et al. [66] propose a phantom routing protocol for source node location privacy in Wireless Sensor Network (WSN). The idea is to create a false source location different from the real one and thus to lead an attacker away from the actual position of the source node. To create a falsified source location, the routing of a message is divided into two steps. The first step is a random walk, in which a message is directed to a random node or a selected position away from the source node in the network by single-path routing. In the second step, the message is routed to the sink by either flooding or single-path routing starting from the falsified source. One disadvantage of dummy traffic is that it increases communication overhead and computation cost. Kido et al. [67] propose to let a user send dummy requests to a service provider in location-based service, i.e., send false positions and the true position at the same time. To reduce communication cost, they propose techniques which construct false positions in the requesting messages in a more compact way and use keywords in the replying messages to reduce transmitted data. To confuse an attacker, the dummy traffic should be as realistic as possible. Krumm [68] develops probabilistic models of driving behavior to create realistic driving trips, which can be used as false location reports with high realism.

2.2.5. Real-world implementations Although most of the aforementioned privacy-protection mechanisms have been originated from and been circulated in the research community, several real-world implementations exist to date, which apply the same principles to a certain degree. For example, The Onion Router (TOR) [69] is a non-profit implementation of the onion routing technique5 [70] that follows the same principle of mix network and provides anonymous communications for Internet traffic. In cellular networks, a subscriber of the Global System for Mobile Communications (GSM) network has an International Mobile Subscriber Identity (IMSI) stored in the subscriber identity module (SIM) card. To prevent identification and tracking of the subscriber’s mobile phone, instead of the long-term IMSI, short-term pseudonyms, called Temporary Mobile Subscriber Identity (TMSI), are used for daily communications. As soon as a subscriber enters a new geographic area, it is randomly assigned a TMSI by the local network’s Visitor Location Register (VLR) and uses it afterwards.

5

In onion routing, messages are encrypted multiple times and sent through network nodes known as onion routers. Each onion router decrypts one layer of the encryptions to uncover routing instructions, and sends the message to the next router.

24

2.3. Existing privacy metrics A metric is a system or standard of measurement. A privacy metric for a communication system is a set of measurements, which map the level of privacy to numeric values. The numeric values are usually from a partially-ordered group, such that the magnitude of the numbers reflects the underlying privacy values in the system. In this section, we will review existing privacy metrics.

2.3.1. Privacy-related concepts and notions Since most privacy metrics try to capture a certain aspect of privacy and map the privacy values to numbers, we will first look at the common privacy-related concepts and notions. Butty´an and Hubaux [71] summarize the most important privacy-related notions to be anonymity, untraceability, unlinkability, unobservability, and pseudonymity. Anonymity is related to hiding who performed a given action. Untraceability aims at making it difficult for the attacker to identify that a given set of actions are performed by the same subject. Unlinkability is the generalization of anonymity and untraceability, meaning hiding information about the relationships between two related items. Unlike unlinkability that hides the relationships of any items, unobservability aims at hiding the items themselves. Pseudonymity means to use pseudonyms instead of real identities. In practice, anonymity and unlinkability are two common ways to achieve user privacy in communication systems. Consequently, privacy can be measured in terms of the degree of anonymity or the degree of unlinkability. In a more detailed definition, Pfitzmann and Hansen [50] define anonymity as the state of being not identifiable within a set of subjects6 , the anonymity set. In their definition, unlinkability of two or more items of interest (IOI, e.g., subjects, messages, or actions etc.) means that within the system (comprising these and possibly other items), from the attacker’s perspective, these items of interest are no more and no less related after his observation than they are related concerning his a-priori knowledge. They further point out that the concepts of anonymity and unlinkability are convertible. For example, if we consider the action of sending a message and the action of receiving the message are two items of interest, anonymity can be defined as unlinkability of any subject to the IOIs, i.e., the actions of sending and receiving messages. In their work to refine the definition of anonymity and to formalize the notion of unlinkability, Steinbrecher and K¨opsell [72] conclude that anonymity is a notion usually restricted to users with respect 6

A subject is defined, by the same authors, as a possibly acting entity such as, e.g., a human being (i.e. a natural person), a legal person, or a computer.

25

to a specific action, whereas unlinkability is applicable to any arbitrary item of interest, e.g., individuals, pieces of information, or actions.

2.3.2. Anonymity set-based metrics Since the users in anonymous communication systems form anonymity sets, a straightforward way to measure anonymity is to calculate the size of the anonymity set, i.e., the total number of users indistinguishable from each other with respect to a piece of information or a specific action in the system. For example, the number of all users that could potentially send or receive a specific message constitute the size of the anonymity set with respect to the message. Arguably the most popular metric based on anonymity set is k-anonymity [21]. Originally, k-anonymity has been a privacy model for data released from a database. The structure of the data release is a table with rows and columns. Each row corresponds to a tuple and each column to an attribute, which denotes a category of information with a set of possible values. The attributes include identifiers (e.g., name and address) which can uniquely identify a person, as well as attributes that, when being combined, can also identify a person (e.g., birth data and gender). Such attributes are called quasi-identifiers. Hence a data release satisfies k-anonymity if and only if each tuple of the table is indistinguishable from at least k  1 other tuples with respect to the quasi-identifiers. Gruteser and Grunwal [49] apply k-anonymity to the design of a cloaking algorithm for anonymous usage of location information. In their system, a user is k-anonymous with respect to location information, if and only if the location information presented is indistinguishable from the location information of at least k  1 other users. The location information includes a tuple prx1 , y1 s, rx2 , y2 s, rt1 , t2 sq, in which prx1 , y1 s, rx2 , y2 sq specifies a two dimensional area where the user is in and rt1 , t2 s specifies a time period during which the user is in this area. A user is k-anonymous if the location information describes not only this user, but also other k  1 users who are also in the same area at the same time period. Hence, all users within the area defined by rx1 , y1 s, rx2 , y2 s in the time period rt1 , t2 s form a k-anonymity set. Measuring anonymity as the size of the anonymity set assumes that all members in the set are equally probable with respect to a piece of information or action. However, such assumption does not hold under a more sophisticated attacker, who might possess more information and hence can assign different probabilities to each member from the set. Therefore, some members will become more “probable” than the others in the same anonymity set.

26

Serjantov and Danezis [22] and D´ıaz et al. [23] identified the issue at the same time and point out that the size of the anonymity set does not reflect different probabilities of the members in the set. Taking an information theoretic approach, they propose to use entropy that quantifies the probability distribution in the anonymity set as a measurement for the degree of anonymity. Specifically, let X be a discrete random variable with probability mass function pi  P rpX  iq associated with an anonymity set with N members, an attacker can assign a probability pi to each member of the anonymity set with respect to an item of interest, e.g., the sender of an email. Then the entropy of the anonymity set is H pX q  

N ¸



pi log2 ppi q

(2.1)

i 1

As entropy is a measure of information uncertainty in information theory [73], the degree of anonymity is directly related to an attacker’s uncertainty on the anonymity set. The higher the uncertainty, the higher the degree of anonymity of the users in the anonymity set will be. Entropy reaches maximum if all users have same probabilities. On the other hand, if the attacker knows that one or more members are more probable than the others, the entropy will decreases, which leads to a decrease in anonymity. In contrast, the size of anonymity set is no longer an accurate measure of anonymity, because it represents only the best-case scenario, in which all users are assumed to have the same probability. It should be noticed that there is a slight difference between the two approaches. In [22], H pX q is directly used as degree of anonymity, whereas in [23], D´ıaz et al. further compute the maximum entropy of the system HM , in which all N members have equally distributed probabilities as HM

 log2pN q

(2.2)

and use the ratio of H pX q and HM as the measure of degree of anonymity, that is d

H pX q HM

(2.3)

Anonymity set is originally intended for anonymous communication systems with stationary users. However, a main feature of wireless communication systems is user mobility, which is not explicitly captured and expressed by the anonymity set. For this reason, Hung et al. [59] propose geographical anonymity set (GAS), which is a set of users who form an anonymity set geographically due to their intersected or interwoven

27

trajectories (i.e., segments of users’ movements). The GAS of a user’s identifier i with trajectory Ti is defined as subset of all identifiers ID, which satisfy GAS piq  tj |j

P ID, DTi, Tj P T , pi,j  0u (2.4) where T is the set of all trajectories, and ppi, j q is the probability that two observed trajectories Ti , Tj belong to the same user. So ppi, j q  0 means that Ti and Tj might be two consecutive segments of a user’s movement in space and time. In other words, a user’s geographic anonymity set contains all trajectories that might be the continuation of the user’s current movement. Subsequently, a user’s level of location privacy is either measured as the size of the GAS

 |GAS piq|

Si

(2.5)

or measured as the entropy of the GAS Hi



¸

P

pi,j

 log2ppi,j q

(2.6)

j ID

2.3.3. Mix zone-based metrics Several privacy metrics propose to use location privacy provided by mix zones (cf. section 2.2.2) to measure a user’s level of location privacy. Since any unobserved area in a system can be modeled as mix zones, a mix zone provides a quantitative measure of the level of location privacy of a traversing user. When traversing a mix zone, a user enters the mix zone at the ingress location and exits it at the egress location. One way to measure location privacy provided at a mix zone is to calculate the size of the anonymity set, which consists of all users visiting the mix zone at the same time period. However, Beresford and Stajano [24, 74] point out that the anonymity set does not capture user entry and exit motions. Due to the spatial and temporal correlation of a user entering and exiting the mix zone, an attacker observing the events at the ingress and egress of the mix zone can assign different probabilities to these events. Hence entropy based on the probability distribution of user egress-ingress events gives more accurate measurements on location privacy. The basic calculation is presented in the following. Consider a user who is traveling through a mix zone y at time t. Assume that an attacker can observe the user at the preceding zone x at time t  1, and at the subsequent zone z at time t 1. The probability of the pair ppx, z q is the probability that a user enters mix zone y from x and continues to z (i.e., x Ñ y Ñ z). If the historical data on how users move through the mix zone

28

y is available, we can calculate the frequencies of each pair of px, z q and store them in a movement matrix M . Based on the matrix, we can generate the normalized joint probabilities as follows: M px, z q ppz, xq  ° i,j M pi, j q

(2.7)

where i, j are the rows and columns of the matrix, respectively. The conditional probability of a user continuing to zone z, having been to zone x, is calculated as follows: M px, z q ppz |xq  ° j M px, j q

(2.8)

The information content associated with a set of possible outcomes with the probabilities pi can be calculated as h

¸

pi  log2 ppi q

(2.9)

i

to yield the entropy. Originally, mix zone was designed for users of location-based services in a wireless communication system such as cellular networks. Applying the same principle, Butty`an et al. [25] use the entropy provided by the mix zones to evaluate the location privacy achieved by the vehicles in vehicular networks. In their approach, a continuous part of a road network is modeled as mix zone, where an attacker cannot hear the communications of the vehicles. In their mix model, a mix zone has a set of ports corresponding to the layout of the road network, at which the vehicles enter and exit. In a similar way to the approach of Beresford and Stajano, the attacker’s knowledge is summarized in a movement matrix Q  rqij s with M rows and M columns. M is the number of ports of the mix zones and qij is the conditional probability a vehicle chooses to exit at port j given that it enters at port i. In addition, the attacker’s knowledge is also modeled in M 2 discrete probability density functions fij ptqp1 ¤ i, j ¤ M q. The discrete probability functions are based on the matrix Q that describe the probability distributions of the delays when a vehicle traversing the mix zone between port i and port j. The attacker’s goal is to link the exiting vehicles to the entering ones. If a vehicle v enters the mix zone at port i at time 0 and afterwards a vehicle v 1 exits the mix zone at port j at time t, the attacker computes the probability pjt  qij fij ptq to determine whether or not v 1 is actually v. If there is more than one existing vehicle, the attacker will compute the probabilities for each of them and choose the one with maximum pjt . To quantify the level of location privacy provided by mix zone, Butty`an et al. propose to use success

29

ratio, i.e., the ratio of the incidences of correct linking of vehicles traversing the mix zone to all vehicles traversing the mix zone at the same time. In vehicular networks, if a vehicle travels through a chain of mix zones, it will “collect” location privacy provided by each of the mix zones on the chain. Assuming that the mix zones are independent, Freudiger et al. [14] approximate the location privacy of a vehicle v traveling through a chain of L mix zones as the sum of all entropies collected at each of the mix zones, that is Htotal pv, Lq 

L ¸



Hi pv q.

(2.10)

i 1

Mix zone can be regarded as a best-effort approach to location privacy, because the effectiveness of a mix zone depends on actual traffic density within the mix zone. Therefore, the location privacy of mix zone cannot be precisely determined and hence measured at design time. Freudiger et al. [26] propose a flow-based metric that theoretically measures the effectiveness of mix zone prior to its operation. The outcome of the measurement can help for the optimal placement of mix zones in mobile networks. In the flow-based metric, a mix zone is traversed by flows of mobile nodes from entrances to exits. The entering and exiting events of the mobile nodes are statistically modeled as flows with Poisson process having arrival rate λ. The authors claim that the advantage of such an approach is that the Poisson process can statistically generalize the traffic flows to a mix zone hence the mix zone can be evaluated independent of actual traffic. The effectiveness of the mix zone is measured as the probability that the attack assigns an exit event (i.e., a mobile node leaves the mix zone at an exit l at time t) to the wrong flow. The Jensen-Shannon (JS) divergence is used to calculate the upper and lower bound of the error probability. For a mix zone traversed by m flows, the JS divergence is JSπ pp1 , . . . , pm q  H p

m ¸



πi pi pxqq 

i i

m ¸



πi H ppi pxqq,

(2.11)

i 1

where πi is the probability that an observed exit event belongs to flow fi , and pi pxq is the probability that the flow fi generates an exit event x at time tx . Furthermore, overall effectiveness of a mix zone with L exits is measured as the average error probability, that is °

p¯ie

pi

 |LL |e,l , i i

where pie,l is the error probability of the attacker at exit l.

30

(2.12)

2.3.4. Tracking-based metrics Tracking means to follow an object’s movement in space and time by linking a sequence of location measurements of the object. If a tracking system keeps the historical locations of an object being tracked, then it produces a path, also called track or trace, i.e., a timely-ordered sequence of locations that reveal a user’s movement in space over time. As location privacy is about a user’s location information, tracking is another common location privacy metric. In the context of wireless communication systems, an attacker can track a user’s movement by either linking a series of messages from the same user according to the identifiers (e.g., user ID, IP address, or MAC address etc.) in the messages sent, or the spatial-temporal correlations within the messages. Thus from the privacy perspective, location privacy can be regarded as unlinkability between the messages sent. Multiple target tracking in tracking systems [75] aims at finding the tracks of multiple moving targets from noisy observations. From a theoretical point of view, tracking is to solve the data association problem that deals with the problem of finding a partition on the observations such that each element of a partition is a collection of observations generated by a single user [76]. Various metrics have been developed in tracking systems to evaluate the effectiveness of a given tracking algorithm. Most of them have a common criterion, that is, how long can a target be correctly tracked. As we will see in the following, this criterion is used in tracking-based privacy metrics as well. Gruteser and Hoh [35] point out that in wireless communication systems, a user has to frequently reveal his location points and these points can be linked to reveal his trajectory. Therefore, point anonymity commonly measured by anonymity set is insufficient to capture the level of location privacy and trace anonymity is a better measure. Instead, they propose to use a tracking system to measure the level of location privacy and to derive privacy mechanisms for the users of wireless communication systems. In their work, they choose to use Reid’s multiple hypothesis tracking (MHT) algorithm [36]. Given a set of anonymous GPS location samples from multiple users, the MHT algorithm tracks the users’ movements. The users’ levels of location privacy are expressed as the result of the tracking algorithm, i.e., how long and how correct can the tracking algorithm link the location samples from each of the users. Sampigethaya et al. [62] combine tracking and anonymity set to evaluate location privacy in vehicular networks. Assuming vehicles use pseudonyms in communications, the anonymity set of a vehicle is defined as the set of pseudonyms that are indistinguishable from other vehicles’ pseudonyms to an attacker. A vehicle’s location privacy is measured as maximum tracking time of the anonymity set. Let ρ be the density of the vehicles

31

on the streets and Ar be the reachable area of a vehicle from its last transmission, the expected size of the vehicle’s anonymity set |SA | can be statistically derived as E t|SA |u 

ρAr . (2.13) 1  eρAr The anonymity set includes all vehicles that appear in Ar with a new pseudonym, and has a Poisson distribution. Subsequently, maximum tracking time of a vehicle is the maximum cumulative time that there is only one vehicle in the anonymity set. In other words, the size of the anonymity set equals one. Let ptrack denote the probability that a vehicle can be uniquely identified from an anonymity set, the expected maximum tracking time is E tTtrack u 

E tsperiodu , 1  ptrack

(2.14)

in which speriod denotes the interval between two consecutive transmissions. In [37], we use simulation method to study the effectiveness of pseudonym changes in vehicular networks. We apply the MHT algorithm to vehicle movement traces from the STRAW vehicular mobility model [77] coupled with JiST/SWANS ad-hoc network simulator [78]. Using mean tracking duration as a metric, our study shows that target tracking systems can quite effectively track vehicle movements despite the fact that the vehicles frequently change their pseudonyms. As a variance to tracking time, Hoh et al. [13] propose a metric of mean time to confusion, which combines entropy with tracking time. Based on the observation that the degree of privacy strongly depends on how long an attacker can follow a vehicle, the mean time to confusion metric measures the degree of privacy as the tracking time that an attacker can correctly follow a vehicle’s trace until the point in time in which the attacker can no longer determine the trace with sufficient certainty. The tracking confusion is measured in entropy. The tracking uncertainty of any point on the trace of a vehicle is defined as H



¸

pi log2 ppi q,

(2.15)

where pi denotes the probability that a location sample i belongs to the vehicle. Lower entropy indicates that the attacker has more certainty and hence the vehicle has lower privacy. Since the tracking uncertainty H is a variable, the mean time to confusion is calculated as the mean tracking time during which uncertainty stays below an arbitrarily defined confusion threshold.

32

2.3.5. Distance-based metrics Since an attacker on location privacy is assumed to learn a user’s current and past location [24], the ultimate goal of the attacker is to reconstruct the user’s locations from his observations, which are often incomplete and error-ridden. The distance-based metrics use the distances between a user’s actual locations and the reconstructed ones at the attacker side to reflect the user’s level of location privacy. Hoh and Gruteser [65] point out that location privacy is related to uncertainty and distance. Although entropy is a suitable measure of uncertainty, it does not reflect the location differences. For example, for two users u1 and u2 with corresponding location samples l1 and l2 , respectively, entropy will yield same results regardless whether l1 or l2 is assigned to u1 or u2 . Therefore, the authors propose to use expectation of distance error as alternative metrics, and claim that it captures how accurate an attacker can estimate a user’s location. The expectation of distance error d can be calculated as E rds 

K I 1 ¸¸ pi pk qdi pk q, N K k 1 i1

(2.16)

where pi 7 denotes the probability of an attacker’s i-th hypothesis, in which he assigns a user’s identity to an observed location, di is the total distance error between the correct assignment hypotheses and hypothesis i, N is the number of users, and K is the total observation time. In a similar approach, Shokri et al. [79] use the distortion in an attacker’s reconstructed trajectory as a metric for a user’s location privacy in mobile networks. In their excessive definitions, an event is defined as 3-tuple that consists of a user’s identity, a time instance, and a location. A trace Υ is a set of events. A function tailpΥq returns the last event of the trace (i.e., the tail of the trace). For a user u at time t, the function whereispu, tq returns the actual location of the user. The expected distortion of user u at time t is defined as EDpu, tq 

¸

D whereispu, tq, locptailpΥqq

Υ



 πxpΥq,

(2.17)

where Dpq is a normalized distance function between two locations, locpq is a function that gives identity, time, and location of an event, and π x pΥq is the probability assigned to a trace Υ. Then the distortion-based location privacy LPd of a user u at time t is defined as

7

Note that pi has different meanings in different contexts.

33

LPdu ptq  1  ltspu, whereispu, tq, tq  p1  EDpu, tqq,

(2.18)

where ltspq is a normalized location and time sensitivity function in the range of r0, 1s. Similarly, Fischer et al. [80] propose an expected distance unlinkability measure to quantify the error made by an attacker when relating and grouping messages from the same sender. In their approach, sender-message relations are modeled as set partitions on the observed messages. Each partition is assigned a probability by the attacker. Let M be the set of all observed messages, and π be one of the set partitions on M , and ΠM be all possible partitions, an attacker assigns each of the partition π P ΠM with a probability, so ΠM is weighted by probability mass assignments P . Then the expected distance edM pP q on the message set M with a probability mass assignment P is defined as edM pP q :

¸

P

P pπ q  δ pπ, τ q,

(2.19)

π ΠM

where τ is the reference partition corresponding to the true send-message relations, and δ pπ, τ q is the distance of two partitions (i.e., the attacker’s partition π to the reference partition τ ) which is calculated as the number of different element pairs in the two partitions. As an example, for a set of five messages M  xm1 , . . . , m5 y, two different partitions π1  ttm1 , m3 , m4 u, tm2 , m5 uu and π2  ttm1 , m4 u, tm2 , m5 u, tm3 uu on the same messages have a distance of δ  2, due to the different grouping of m3 in the two partitions.

2.4. Discussion In a recent survey on location privacy mechanisms, Krumm [28] raises the issue on the importance of quantifying location privacy. As he points out, the progress in protecting location privacy depends on the ability to quantify it. However, there is neither a standard nor a consensus on how location privacy should be quantified across different research projects. Therefore, it will be interesting to see whether the privacy metrics presented in Section 2.3 can be applied to VCS to provide satisfactory privacy measurements. To evaluate these metrics, we identify a set of requirements on the privacy metric in VCS and then discuss these metrics along these requirements.

34

2.4.1. Requirements on privacy metrics Our main goal is to find ways to quantify the level of location privacy in VCS. In accordance with this goal, we identify a set of requirements on the metric for location privacy in VCS. These requirements are derived from our analyses of the objectives of existing privacy metrics as well as the unique characteristics of VCS. • Applicability. The metric should be applicable to VCS. More precisely, the metric should be able to capture the unique characteristics of VCS such as vehicle movements and vehicular communications, as well as user behavior and the resulting location privacy in such unique setting. • Thoroughness. The metric should be able to provide location privacy measurements in a perspective, in which all relevant aspects (e.g., privacy values in short-term vs. long-term, and privacy values at the user level vs. the system level) and their relations are taken into account. • Accuracy. The metric should correctly capture and reflect privacy values in VCS. As privacy values are directly related to the ability of an attacker, the measurements should be based on a set of reasonable assumptions and sound estimations of the information and knowledge at the attacker side. • Generality. As VCS is in a very early state and still evolving, most features of the system (e.g., application scenarios, communication protocols, message formats, and privacy-protection mechanisms etc.) have not been defined de facto and de jure. In parallel, new privacy threats have emerged over time. Thus the metric should be independent from current developments and future changes in VCS. On the other hand, such generality should not compromise the accuracy and overlook the specifics. • Practicability. The metric should be practical. That is, given a description and a set of parameters of VCS, the metric should provide measurements on the level of location privacy not in an abstract form, but in concrete numbers. • Scalability. The metric can be used to calculate privacy values in large and complex systems like VCS.

2.4.2. Analysis of requirement fulfillment By relating the existing privacy metrics to the identified requirements, we obtain an overview on the applicability of the existing metrics with respect to measuring location

35

privacy in VCS. The result is summarized in Table 2.2. In the table, we use a # to denote that we consider a requirement being not fulfilled, a # G to denote that a requirement being partially fulfilled, and a to denote that a requirement being fully fulfilled.

Applicability Thoroughness Accuracy Generality Practicability Scalability

Table 2.2.: Existing privacy metrics with respect to requirements

Privacy metric k-anonymity Entropy of anonymity set Geographical anonymity set

# G##G # # G#G ## G # G#G ## G

# # #

# GG ## G # GG ## G # GG ## G

# G # G # G

# GG ## G # GG ## G # GG ## G

Entropy of mix zone Entropy of mix network Error probability of mix zone Performance of target tracking Maximum tracking time Mean time to confusion

Distance error # GG ## G# G# Distance distortion # GG ## G# G# Distance of set partition # GG ## G# G# not fulfilled (#), partially fulfilled (G #), fulfilled ( ) Anonymity set-based metrics

The first metric in the table is k-anonymity. Since the anonymity set does not capture a user’s movement, k-anonymity only partially fulfills applicability. k-anonymity only considers the degree of anonymity of a user within an anonymity set. Thus it does not fulfill thoroughness. The size of the anonymity set assumes that all users in the set are equal-probable. Hence k-anonymity is not very accurate. On the other hand, k-anonymity can be used to reflect a user’s degree of anonymity with respect to an event or an action. For example, in the context of VCS, each time a vehicle sends a message, all vehicles within the transmission range can be considered to form an anonymity set. Since any message in vehicular communications or any location at any instance can have a corresponding anonymity set, k-anonymity can be used as a measure of location privacy. However, k-anonymity uses the size of the anonymity set to reflect degree of

36

anonymity, the values cannot be normalized to enable us to compare the same user in any two different instances or any two users in different system settings. Thus k-anonymity only partially fulfills generality. The outcomes from k-anonymity are always concrete numbers, hence it is practical. However, if we measure a user’s location privacy as kanonymity of each of the messages the user sent, or each of the locations the user visited, we will be overwhelmed by measurements and lose focus. Therefore, k-anonymity does not fulfill scalability. Entropy of anonymity set is very similar to k-anonymity, except that each user in the anonymity set is assigned a specific probability reflecting an attacker’s fine-grained knowledge on the anonymity set. Therefore, it is somewhat more accurate to reflect an attacker’s knowledge on the anonymity set. Geographical anonymity set is directly derived from anonymity set. The measurements are either based on the size of the anonymity set or the entropy of the anonymity set. Therefore, it has the same fulfillment status as the other two metrics in the same category. Mix zone-based metrics The two metrics, entropy of mix zone and entropy of mix network, are very similar except entropy of mix network sums up the entropies from a cascade of mix zones on a vehicle’s route. Since any area unobservable to an attacker can be modeled as mix zone, mix zones are applicable to capture location privacy in VCS. However, mix zones only partially cover the whole VCS, thus they do not capture all information related to user location privacy in VCS, e.g., attacks outside mix zones. Accordingly, mix zone-based metrics only partially fulfill thoroughness. In literature, the movement and delay of a vehicle inside the mix zone is characterized by statistical traffic data. One of the drawbacks of such approach is that statistical traffic data do not provide much granularity and hence accuracy. For example, a common setting of mix zone in literature is as follows: a mix zone is established at an intersection with 4 segments, a vehicle entering the intersection has 25% chance to turn left, 50% chance to go straight on, and 25% chance to turn right. These data are then used to derive the probability distribution of the egress-ingress events. Furthermore, it is often assumed that an attacker has only information on the ingress and egress of the vehicles at the borders of the mix zone. Such assumption leads to an underestimation of the attacker, hence a higher than actual degree of location privacy. Imagine that a vehicle sends a routing request including its intended destination or a beacon message broadcasting that it is heading to the right turn lane outside the mix zone, then the attacker will have more information to successfully follow the vehicle after it traverses

37

the mix zone. This means mix zone-based metrics only partially fulfill accuracy. Since VCS can be modeled by mix zone, these metrics partially fulfill generality. At the same time, because it is possible to model location privacy in VCS by mix zones and calculate their entropies, they fulfill practicability and scalability. The difference of error probability of mix zone to the other two metrics is that the probability of an attacker making wrong linking of egress-ingress events is used to express the effectiveness of the mix zone, instead of its entropy. The other difference is that a user’s mobility within the mix zone is modeled as Poisson distributed flows traversing the mix zone instead of statistical traffic counts. However, since there is no comparison in literature, we cannot determine whether the flow-based metric is more accurate. Furthermore, as we already mentioned, mix zone-based metrics capture only information of mix zones, thus they are neither thorough nor accurate to reflect actual privacy values in VCS. Tracking-based metrics Tracking-based metrics fulfill applicability, because they can characterize a vehicle’s location privacy by tracking on the messages sent from the vehicle. However, they only partially capture all aspects of location privacy. Consider a simple example, in which a user with a pseudonym can be continuously tracked. In the tracking-based metric, the user will be considered to have no location privacy. However, what if an attacker can never link the pseudonym to the user’s real identity? Then the user’s location privacy has not been breached. As a result, we regard tracking-based metrics only partially fulfilling thoroughness. As a vehicle’s movement is restricted by the layout of the road network and the surrounding traffic, it is very reasonable to assume that such information can be integrated into the tracking algorithm to improve tracking performance. However, current tracking algorithms discussed in literature only consider a user’s movement and its projected trajectory, hence the privacy values derived from the tracking results are not very accurate. Because a user’s movement in VCS can be generalized as sequence of messages and fed to a tracking algorithm to decide their linkabilities, tracking-based metrics can cope with different application scenarios and message formats. However, since trackingbased metrics overlook the relation between a user’s identity and its movement, they can only partially fulfill generality. A tracking algorithm yields concrete numeric results, thus it fulfills practicability. On the other hand, tracking systems appeared in the privacy research literature are usually evaluated on very small datasets with simplified parameters, e.g., tens of objects in an one kilometer area, no statements can be found

38

as whether these systems can be scaled to be applied to actual vehicle networks that are characterized by a large amount of highly dynamic vehicles. Thus we considered these metrics only partially fulfill scalability. Distance-based metrics Distance-based metrics measure a user’s location privacy by comparing the user’s actual locations with an attacker’s reconstructed ones. Since a user’s locations in VCS can be captured and modeled as message-derived location samples, distance-based metrics are applicable to VCS. However, for the same reason as in tracking-based metrics, they only capture a user’s spatial-temporal movement without capturing the relation between the possible trajectories and the user’s real identity. Therefore, distance-based metrics only partially fulfill thoroughness. Distance-based metrics assume the existence of a “known” tracking algorithm, which generates the reconstructed locations for the comparison. Thus they partially fulfill accuracy and generality as tracking-based metrics for the same reason. Distance-based metrics express the level of location privacy in numeric values. The numbers are calculated as the spatial distances between each location pairs, i.e., the actual location points against the attacker’s reconstructed location points. Because all location points on a user’s trajectory are included in the measurement, the calculation can be tedious and unfocused. In literature, the authors proposed the distance-based metrics either explicitly or implicitly acknowledge the complexity of their approaches and admit that it is not feasible to apply them in real-world systems with large number of users. Therefore, before there are any progress to improve the feasibility of the distance-based metrics, they only partially fulfill practicability. Due to complexity and feasibility, distance-based metrics do not fulfill scalability.

2.4.3. Summary and outlook From the above analysis, we can see that despite the attempts, none of the existing privacy metrics are suitable for location privacy caused by vehicular communications. In other words, the existing privacy metrics are either inappropriate or insufficient to capture and reflect the actual privacy values in VCS. As the saying goes, “we cannot manage what we cannot measure,” a suitable privacy metric is of crucial importance in the development of privacy-preserving VCS. The issue can only be addressed by the development of a privacy metric that is suitable for VCS and satisfies the aforementioned requirements.

39

Our analysis shows that the most challenging requirements are thoroughness, accuracy, and generality. Therefore, our focus and main challenges in this dissertation are to develop a location privacy metric that includes aspects that are important but overlooked in the state-of-art, and accurately reflects the subsequent privacy values. At the same time, the location privacy metric should be general such that it can be applied despite the variations in the development of VCS in the near future. Starting from the next chapter, we will present our work on how to measure the users’ location privacy in vehicular communication systems in detail. We will show that in order to take location privacy to rigorous measurement, several new issues and technical challenges need to be addressed. We will show how we design and develop methods to address theses issues. As one of our goals is to have a system that works, each new method is evaluated by various use case studies and extensive simulations. Furthermore, the feasibility of applying our measurement approach to VCS will be demonstrated by a proof-of-concept implementation.

40

3. Location privacy in snapshot view This chapter presents the first step in our approach to measure location privacy in Vehicular Communication Systems (VCS). We start by asking ourselves fundamental questions such as “what is location privacy” and “what is the best way to measure location privacy in VCS?” We revisit the concept of location privacy in literature and refine the concept of location privacy as well as how to measure it. Based on the conceptual findings, we identify that the end points of any vehicle trip are the most privacy-sensitive locations and propose to use the relationship between a user and his vehicle trips to reflect the user’s level of location privacy in VCS. We then use the mechanism of snapshots to capture privacy-related information from VCS (see example in Figure 3.1). The information contained in the snapshot is then modeled and quantified to yield numeric values to reflect the user’s level of location privacy1 .

Figure 3.1.: Location privacy of one individual over a period of time

3.1. Location privacy revisited In Chapter 1, we have already introduced a colloquial definition of location privacy. In this section, we will further investigate the concept of location privacy and give a more rigor definition on it. What exactly is location privacy? A precise understanding in the conceptual level and an explicit definition is a prerequisite before we can proceed further into this topic. Despite its frequent appearance, finding an appropriate definition on “location privacy” is not a straight-forward task. 1

Note that this chapter is based on the result published in [81]

41

A search on the bibliography website Digital Bibliography & Library Project (DBLP) [82] with the keyword “location privacy” returns more than 230 hits2 . Figure 3.2 shows the distribution of publications in recent years. From the figure, we can see that the research on location privacy only gains its momentum after 2003, which is roughly correlated with the advances in localization (e.g., GPS) and wireless communication (e.g., UMTS, WLAN) technologies, as well as the introduction of location-aware applications (e.g., location-based services). 80   70   60   50   No.  Of  papers  

40   30   20   10   0   1996  

2001  

2002  

2003  

2004  

2005  

2006  

2007  

2008  

2009  

Figure 3.2.: Number of publications related to location privacy on DBLP Before we give our definition on location privacy, we will first review the existing ones. From literature, we found the following “definitions” on location privacy. a. The privacy of location information. [83] b. A particular type of information privacy defined as the ability to prevent other parties from learning one’s current or past location. [24] c. A particular type of information privacy defined as the ability to prevent other parties from learning one’s movement. [59] d. Who should have access to what location information under which circumstances? [46] e. Location privacy can be defined according to two different dimensional parameters: information related to the identification of the user and entities which are able to have access to these pieces of information. [84] f. Users’ identity and location information should not be disclosed to unauthorized entities. [85] 2

Retrieved on April 8, 2010

42

g. A location privacy threat describes the risk that an untrusted party can locate a transmitting device and identify the subject using the device. [86] The above definitions are very representative to reflect the common views on location privacy in the research community. From them, we can make the following observations: 1. The centerpiece of location privacy is location information. 2. Ideally, a user should have the control of the disclosure and access to its location information. In general, two kinds of entities will access location information: trusted/authorized parties and untrusted/unauthorized parties. 3. Location information can have different levels of granularity, ranging from a single location at a point in time to multiple locations over space and time. 4. Identity information, i.e., the information to identify an individual, is an integral part of location information.

Trusted   party  

These observations help us to consolidate different definitions on location privacy and have a thorough understanding of the concept. Figure 3.3 summarizes our conceptual understanding of location privacy based on the above four observations. As a result, we are able to give an explicit definition of location privacy.

Iden%ty  

Mul%ple   loca%ons  

Untrusted   party  

Single   loca%on  

Figure 3.3.: Conceptual understanding of location privacy

43

Definition 3.1. Location privacy Location privacy is a user’s ability to control the disclosure and access to its location information, which consists of single or multiple locations as well as the user’s identity information. 2 Based on Definition 3.1, we can derive three elements directly related to the property of location privacy, which are: • Adversary3 • Individual • Location information The first element is the assumption regarding the existence of an adversary. Studies on location privacy always assume the existence of an adversary. Generally, an adversary is assumed to access and obtain a user’s location information without the user’s consent. Without the existence of an adversary, talking about location privacy will be meaningless. The second element is the individual, i.e., a natural person. Privacy focuses on the control over information about individuals. This point of view is endorsed by various philosophical thinking and legislations. For example, Aristotle distinguished between the public sphere of politics and political activity, the polis, and the private sphere of the family, the oikos [87]. Warren and Brandeis stated in their seminal essay that an individual has “the right to be let alone” [88]. The privacy law governing EU, Directive 95/46/EC on the protection of individuals with regard to the processing of personal data and on the free movement of such data [89] states clearly that personal data is any information relating to an identified or identifiable natural person. Therefore, we use the word “individual” to refer to an identifiable person. In this sense, a user of VCS is an individual. The last element of location privacy is location information. As explicitly stated in our definition, location information consists of the information on single locations and multiple locations that reveal an individual’s movement in space and time, as well as an individual’s identity information. Identity information is an individual’s abstract representation in the location information. Strictly speaking, the above three elements are heterogeneous in nature. Usually, an adversary is an imaginary entity assumed to be capable of discovering various privacy vulnerabilities of the system and launching attacks to breach a user’s privacy. The 3

In this dissertation, we do not distinguish between the term “attacker” and “adversary” and use them interchangeably.

44

capability of the adversary directly influences the privacy level of a system under consideration. An individual is a natural person. Location privacy is an individual’s basic human right and hence protected by law. In this sense, an individual is more of a legal concept. Location information is the most technical of the three. By location information, we mean anything such as data, meaning, or knowledge which can be presented, processed, communicated, and stored by computer systems. Despite their different natures, as we illustrate in Figure 3.4, our analysis shows that the three elements are fundamental and inseparable as a whole to denote the concept of location privacy. Hence, they should be all included in the measurement approach.

Adversary 

Loca%on  privacy  Individual 

Loca%on  informa%on 

Figure 3.4.: Three inseparable elements of location privacy

3.1.1. Measuring vehicle location privacy Applying measurement theory4 , measuring location privacy is to map the level of location privacy to numeric values that reflect specific properties of location privacy. Therefore, in our measurement approach, we choose the connection between an individual and his location information as the most distinguishable property to reflect an individual’s level of location privacy. This is also in accordance with most of the current privacy legislations, which base the verdict of a privacy infringement on whether the information about an individual can actually be linked to this person. For example, several shopping malls in United Kingdom install mobile phone tracking devices5 to track shoppers’ movements in order to collect detailed shopping behavior6 . Both the retailers and the company making such 4

See Appendix B for a brief introduction of measurement theory. Path intelligence Ltd. http://www.pathintelligence.com/ 6 http://www.theregister.co.uk/2008/05/20/tracking_phones/ 5

45

devices claim that privacy is not breached because they only track the Temporary Mobile Subscriber Identity (TMSI) of the shoppers’ mobile phones, which is not the shopper’s real identity. Hence, as long as a TMSI cannot be used to identify a specific individual, such practices and the location information collected are legitimate within the current legal framework. In our approach, we use (un)linkability, a commonly-used term in privacy research community [50] to refer to the connection or relationship of an individual to his location information. Unlinkability is two-folded. From a user’s perspective, the goal is to achieve unlinkability between itself and its location information. From an adversary’s perspective, the goal is to achieve linkability between the users and their location information. In other words, the uncertainty of a potential adversary on such linkability and the users’ level of location privacy are indeed two sides of the same coin. This observation leads to our measurement approach as taking an adversary’s perspective, the level of a user’s location privacy is measured as the linkability of the location information to the user who generates it. A user in VCS gives out his location information in the outgoing messages. Assuming that vehicular communications are the main source of location information for an adversary, the messages reveal the whereabouts and movements of the user at the time of communications. Note that we only consider the privacy issue caused by vehicular communications, other privacy-invasive technologies, such as the large-scale deployment of cameras with automatic number plate recognition (ANPR) technology7 or GPS vehicle tracking devices are not considered in our work. Furthermore, we use the term “location samples” to refer to location information at the adversary side. Location samples are the distorted subset of all messages from a vehicle, which contain information on identifier, location, and time of the user. The information on identifier, although not necessarily to be a person’s identity, can help to identify an individual and the message relations. For example, a vehicle identification number (VIN) is a serial number uniquely identifying a vehicle. If all messages from a vehicle contain the VIN of the vehicle, it will be trivial to link all messages from the same vehicle and easy to find the true identity of the user. Besides, location and time are either explicitly given in a message or can be implicitly derived from the place and time of the messages recorded. 7

http://www.independent.co.uk/news/uk/home-news/britain-will-be-first-country-to-monitor-everycar-journey-520398.html

46

Each time a user sends a message, he has left a “digital footprint” in the system. An adversary on location privacy tries to follow the vehicle movements based on location samples. The adversary can exploit such information to identify and profile a user, or even to infer the activities or any sensitive information about the user. Depending on whether the adversary is outside or inside the system, the methods of obtaining location samples range from eavesdropping communications to directly accessing the data stored in the system. Since adversaries vary in capacities, the obtained location information vary in qualities. With respect to granularity, location samples can be categorized into three groups: • a single location, the position of a vehicle at a point in time; • a track, a sequence of locations from the same vehicle, which reveals a vehicle’s movements in space and time; • and a trip, a complete journey of a vehicle from the origin to the destination. A trip is an ensemble of tracks. However, given a single location, it is very difficult to link it to a specific user unless the identity information is included in the message. For example, a location sample with identifier “D-12345” merely shows that a vehicle is at the said location. Unless the identifier “D-12345” can be linked to a real person, e.g., “John Smith from Springfield”, the privacy of John Smith is not breached. Due to the spatial-temporal correlation in the trajectories of vehicle movements, multiple location samples from the same vehicle can be linked together to form the location traces. For example, as illustrated in Figure 3.5, the messages from a vehicle generate multiple location samples, based on which a vehicle’s movement can be revealed. Tracks and trips are location traces revealing a vehicles’ movement. An adversary can obtain tracks by linking a sequence of location samples with same identifiers, or by one of the target tracking methods [75], which exploit the spacial-temporal correlations of moving objects [35, 37]. A track only provides partial information on a vehicle’s movement. However, if an adversary is able to link all location samples of a vehicle from the beginning to the end of its journey, it has the trip information of the vehicle. A trip contains information about the time, the origin and the destination of a vehicle journey. Researchers in the field of transportation systems suggest that typical urban day trips associated with one individual are centered around the individual’s home [90]. Figure 3.6 shows such an example. With the information on the origins and destinations, an adversary can exploit such information to further infer the driver and the passengers’ identity and activities. For example, Hoh et al. show that given the trip information,

47

Figure 3.5.: Vehicle trajectory from multiple location sampels an adversary can use clustering techniques to automatically identify a vehicle’s home location [31]. The empirical study by Krumm shows that given the trip information and open information on the Web, it is possible to heuristically infer the home address and the identity of the driver of a vehicle [30]. A recent study by Golle and Partidge shows, quite remarkably, that knowing the approximate locations of an individual’s home and workplace at the granularity of a block level can uniquely identify this individual [33] (cf. Section 2.1.3).

Shopping  mall   18:30     Drive  with  family  

20:00     Drive  with  family   17:30  Drive  alone  

Work  

Home   8:00  Drive  alone  

Figure 3.6.: Example of individual’s typical urban day trips In future vehicle communications, most likely, vehicles will not include the drivers’ real identities in the outgoing messages. Therefore, to learn the identity and activities of the driver, an adversary has to rely on the trip information with locations of origins and destinations. From an adversary’s perspective, a trip is an ensemble of location samples of single locations and tracks, but contains much more useful information to infer further information. This leads to our next definition.

48

Definition 3.2. Trip-based location privacy metric A trip-based location privacy metric measures a user’s location privacy in vehicular communication systems as the linkability of the user and his vehicle trips. 2

By user, we mean the driver of a vehicle. We consider the driver to be the only individual associated with the vehicle. Other individuals involved in a trip (e.g., the passengers or the second driver for part of the trip) are omitted. This is in accordance with [91], in which a vehicle trip is defined as a trip by a single privately operated vehicle (POV) regardless of the number of persons in the vehicle. Furthermore, a trip starts from an origin O when a driver starts the vehicle and ends at a destination D when the vehicle reaches the location where the driver fulfills his need for a certain activity. Example 3.1. We consider that there are three trips in Figure 3.7 and show the summary in Table 3.1. In each of the trips, the driver moves from the origin to the destination to fill the need for a certain activity (the purpose).

Home 

Work  7:55 Drive alone 

7:30 Drive with kid   7:45 Drive alone 

School 

Gas  Sta1on 

Figure 3.7.: Example of trips

Table 3.1.: Summary of trips in Figure 3.7 Trips

Origin

Destination

Purpose

trip 1

Home

School

Drop the kid at school

trip 2

School

Gas station

Fill the tank

trip 3

Gas station

Work

Go to work

2

In a special case, an individual might start and end a trip at the same place, e.g., the so-called “cruising” activity. We regard this as an exception, thus not in our general consideration. To this end, we have elaborated on various aspects of measuring location privacy at the conceptual level. Specifically, we look at the definition of location privacy in existing

49

literature. This helps us to give an explicit definition of location privacy on our own. This, in turn, brings us to a more in-depth analysis of fundamentals of location privacy, which leads to the conclusion that we can take an adversary’s perspective to measure the level of location privacy as the linkability of location information to the individuals who generate it. Furthermore, we anatomize location information in order to establish our measurement approach, i.e., a trip-based location privacy metric. From the next section, we will begin to present the technical details on our measurement approach.

3.2. Methodology of measuring location privacy In VCS, a user’s level of location privacy depends on how much information is accessible to an adversary. Since a user’s privacy level and the information available to a potential adversary are indeed two sides of the same coin, we can use the measure of uncertainty of the adversary to express the level of the user’s location privacy in the system. Taking an information theoretic approach, we treat VCS as information source. The information source produces sequences of messages stemming from the communicating vehicles. A part of the information from the information source will be obtained by an adversary. To quantify the information obtained by the adversary, we use Shannon’s entropy [73], which is a quantitative measure of information content and uncertainty. Figure 3.8 illustrates the process of our trip-based location privacy metric. In the first step, we capture information related to location privacy from VCS. In the second step, we model the captured information in a measurement model. The measurement model encodes the information in an abstract and mathematical way. In the last step, we extract the information from the measurement model, calculate and yield quantitative measurements which reflect the users’ level of location privacy.

Trip-­‐based  loca.on  privacy  metric   Vehicular   Communica.on   Systems  

Capture   informa.on  

Model   informa.on  

Calculate   informa.on  

Figure 3.8.: Illustration of information processing in location privacy metric Each step in the process gives rise to one or more challenges. We show how we address these challenges in the following sections.

50

3.3. Capture information It can be envisioned that future VCS are dynamic systems and continuous in space and time. The basic question is: how to capture dynamic and continuous information generated from VCS? Our approach to make sensible measurements of a dynamic and continuous system is to take discrete samples from the system and base our measurements on these relatively static and confined versions of the system. To capture information for measurements, we make the following three assumptions: 1. The information considered in the metric is assumed to be within an arbitrarily defined time period. 2. The information considered in the metric is assumed to be within an arbitrarily defined area. 3. We further assume that an adversary only interests in reconstructing complete vehicle trips and to find the origins and destinations of these trips. With the first two assumptions, we virtually take a snapshot of the system. Definition 3.3. Snapshot A snapshot captures the vehicle movements and their relations to the drivers in a given area in a given period of time. 2

Consequently, the measurement is based on the information captured in a snapshot. Note that in our definition, snapshot is within a “period of time” instead of a “point in time”. The reason is because the snapshot will need a period of time to capture vehicle trips. According to the third assumption, an adversary tries to identify the origin and destination of a vehicle trip. The reason is that in our study, the origin and destination of a trip represent the most location privacy-sensitive points on a vehicle’s trajectory. In practice, an adversary can use various attacks to obtain or infer the end points of a vehicle trip, e.g., taking the position of the first and the last location samples from the data set belonging to the same vehicle, combining the location samples with additional spatial and temporal information [30]. Since each trip captured in a snapshot is assumed to start from the origin and end at the destination, we derive that the number of origins equals the number of destinations in a snapshot. Another design issue in this process is that we assume that all trips are separate, trips with the same origin or destination will be treated separately. For example, if two trips

51

all originate from the same parking lot but end in two different destinations, we will treat them as two trips with two distinct origins. As aforementioned, a snapshot takes a discrete sample from a continuous system. But until now, we have not explained how the discretization influences the continuous nature of the captured information. We will elaborate on this using the following two examples. Example 3.2. Consider the scenario related to the movements of five vehicles between 8:00 and 9:30 as shown in Figure 3.9. The straight lines indicate a vehicle’s movement over time. A “ ” at the beginning and the end of a solid line indicates an origin or a destination of a trip, respectively. By taking snapshots, we arbitrarily define a period of time, e.g., every half an hour. Hence we will have three snapshots each covering half an hour. Except vehicle V1 , vehicle V2 , V3 , V4 , V5 have trips spanning more than one snapshot. According to our assumptions, the information to be considered should be within an arbitrarily defined time period. Thus we need to “fit” the information into a snapshot. As a result, we arbitrarily define that a trip is only considered (or captured) in the snapshot where it ends. Therefore, Snapshot 1 captures the trips made by V1 and V5 , Snapshot 2 captures the trips made by V1 , V2 , and V4 , and Snapshot 3 captures the trips made by V1 , V2 , V3 , and V4 . It might be possible that a snapshot captures more than one trip of the same individual, e.g., V4 in Snapshot 2. We treat them as two separate trips. We will address this further in the measurement model in Section 3.4.

Vehicles ID 

Snapshot #1 

Snapshot #2 

Snapshot #3 

V5  V4  V3  V2  V1  8:00 am 

8:30 am 

9:00 am 

9:30 am  

Time 

Figure 3.9.: Example of taking snapshots in continuous time 2

The above example shows how we take snapshots in continuous time. The next example shows how we take snapshots in continuous space. Example 3.3. Consider the scenario shown in Figure 3.10, which is related to the idea of using snapshots to delimit the area the metric will consider. The solid lines indicate vehicle trips in the area. A “ ” and a “ ” indicate the origin and destination of a trip, respectively. The dashed lines delimit an area in the continuous space, within which

52

the trip information will be captured and considered in the metric. We call it Snapshot Ulm to indicate the region the snapshot is taken. The example comprises four vehicles, V1 , V2 , V3 , and V4 . The trip of V1 starts and ends in the snapshot. The trip of V2 , in turn, starts outside the snapshot but ends within it. By contrast, the trip of V3 starts within the snapshot, but ends somewhere outside it. In this case, we arbitrarily define that only trips ended within the boundaries of the snapshot will be considered, e.g., the trip of V2 is captured in the snapshot but the trip of V3 is not considered. By carefully defining the area, a snapshot can keep the number of such cross-boundary trips to the minimum, hence capture the most “typical” vehicle trips in that area. The fourth vehicle V4 has two trips in the snapshot. We regard them as two separate trips of the same individual. This will also be addressed in the measurement model in Section 3.4. V2 

Snapshot Ulm 

V3  V4  V4  V1 

Figure 3.10.: Example of taking snapshots in continuous space 2

To have a global measurement of VCS, the snapshots need to be “stitched” according to their temporal and spatial orders. However, as a first step in this chapter, we only consider a single snapshot, so we can focus on the groundwork of our measurement approach in a relatively simple setting. The extensions towards a global and hence more complicated measurement will be introduced in the next chapters.

53

3.4. Model information The objective of modeling the information captured in the snapshot into a measurement model is to represent the information in an abstract and mathematical form to facilitate later calculations. The measurement model encodes the information related to location privacy in the model. The challenge is to abstract the information such that we can represent it mathematically, whereas important information is not omitted during abstraction.

3.4.1. Observation We observe that the information in the snapshot contains the information about individuals, origin/destination (O/D) pairs, and their relations. We also observe that for an individual to “make a trip”, he must start the trip at the origin and end the trip at the destination. This implies that the individual at the origin and the destination must be the same person. To launch a location privacy attack, an adversary has to deal with two problems: tracking and identification problem. Tracking problem is the process of following a vehicle’s movement from the origin to the destination. Tracking results directly influence an individual’s location privacy level. However, tracking alone does not capture the whole picture of location privacy. Location information only becomes privacy-relevant after it is associated with specific individuals. In other words, individuals must be linked to location information, i.e., trips, to account for such information. This constitutes the identification problem. Usually, we can assume that the identity information about the driver (e.g., the driver’s name or his driver’s licensee number) will not be given out in vehicular communications. Consequently, an adversary must use other information to foster its identification process. For example, given the locations of origin and destination, an adversary can try to locate an individual’s home and use the information on white pages to further find the individual’s real identity [30]. We also assume that with the fast development of Geographic Information Systems (GIS) technology, location-based identification will become increasingly easier8 . Furthermore, other data sources, such as social network sites provide a large amount of information which can also aid to find the connections of 8

Privacy issue in GIS has been identified quite early [92]. However, what we have seen in recent years is a proliferation of easily accessible GIS data. Nowadays even an average user can access detailed geographic data on the Web (e.g., on GoogleMap and GoogleEarth). Since privacy in GIS is out of the scope of our work, interested readers are referred to a collection of articles on privacy issues in GIS at http://gislounge.com/privacy-in-gis-issues/.

54

individuals and geographic locations9 . Figure 3.11 illustrates two simple scenarios of individuals and their trips. Imagine there is only one trip in a snapshot as shown in Figure 3.11(a). In the figure, the trip is visualized to go through a tunnel. Thus the same person entering the tunnel will eventually exit it at the other end. Except the two ends, the tunnel is a enclosed space, i.e., once having entering the tunnel, an individual has no way to exit it except at the end. The two ends are the origin and destination of the trip, denoted o1 and d1 , respectively. Therefore, if an individual i1 starts a trip at o1 , he will also be at d1 at a later point in time. By making a trip, i1 has changed his time and space. Mathematically, we can use probabilities to describe this as ppi1 , o1 q  1, ppo1 , d1 q  1, and ppd1 , i1 q  1. All probabilities equal 1 because the relations are absolute certain. Using plain English, the information encoded can be expressed as “i1 has made a trip from location o1 at time t1 to location d1 at time t2 10 .” For an adversary, knowing the origin and destination of the person can facilitate further inference of other information such as the person’s activities and social preferences.

i1 

i1  

i1 

i1   o1  

d1   i2  

i2  

o1 

d1 

o2  

(a) One individual and one trip

d2  

(b) Two individuals and two trips

Figure 3.11.: Simple scenarios of individuals, origin/destination (O/D) pairs, and their relations In the second scenario in Figure 3.11(b), two individuals are linked to two trips, in which the adversary can neither link the two trips nor the individuals to the trips with absolute certainty. Consequently, the adversary assigns probabilities to describe the uncertain information. Assuming that 1) i1 and i2 are both probable to start from 9

Privacy in online social networks is another hot research topic. A number of privacy issues has already been identified, e.g., the leakage of personal identifiable information [93]. Although an interesting research topic, privacy in social networks is out of the scope of this work. Interested readers are referred to a list of publications at http://www.cl.cam.ac.uk/~jcb82/sns_bib/main.html. 10 We assume that time is always implicitly given for locations.

55

either o1 or o2 , 2) the two trips are equal probable to end at either d1 or d2 , we can express this in terms of a set of probabilities as ppi1 , o1 q  0.5 ppi1 , o2 q  0.5 ppi2 , o1 q  0.5 ppi2 , o2 q  0.5

ppo1 , d1 q  0.5 ppo1 , d2 q  0.5 ppo2 , d1 q  0.5 ppo2 , d2 q  0.5

ppd1 , i1 q  0.5 ppd1 , i2 q  0.5 ppd2 , i1 q  0.5 ppd2 , i2 q  0.5

Probabilities11 are summaries of information transferred to a high level of abstraction [95]. They only make sense if they are related to the underlying information. In order to “make” sense out of the probabilities, we need to put them in the context to relate them to the information they represent. Imagine to answer the question “Has i1 made a trip?”, we can construct a probability distribution which includes all possible outcomes to this question. Thus we obtain ppi1 , o1 qppo1 , d1 qppd1 , i1 q  0.125 ppi1 , o1 qppo1 , d2 qppd2 , i1 q  0.125 ppi1 , o2 qppo2 , d1 qppd1 , i1 q  0.125 ppi1 , o2 qppo2 , d2 qppd2 , i1 q  0.125

: : : :

i1 ’s i1 ’s i1 ’s i1 ’s

probability probability probability probability

of of of of

making making making making

the the the the

trip trip trip trip

from from from from

o1 o1 o2 o2

to to to to

d1 d2 d1 d2

Conventionally, a probability distribution should sum up to 1. Therefore, we normalize the four elements in the distribution, such that we have a distribution with elements p0.25, 0.25, 0.25, 0.25q. In plain English, the answer is something saying that “i1 has 0.25 probability to make each of the four trips.” In the same way, we can construct the probability distribution for i2 . As we can see from the above examples, we can use probabilities to express information mathematically. However, each time a new scenario appears, we need to repeat the process and construct a new set of probabilities. The representation will become messy and to construct the distribution will become a tedious job. Thus we need a more efficient way to model the information that is generic enough to account for all possible scenarios. This will be covered in the next subsection.

11

Historically, the meaning of probability has divided people into two camps, the subjectivists who regard “the probability of an event is the degree to which someone believes it, as indicated by their willingness to bet or take other actions,” and the frequentists who regard “the probability of an event is the frequency with which it occurs” [94]. In our work, we use probability for a non-deterministic description of a complex system due to partial and uncertain knowledge. Therefore, our usage of probability is in accordance with the subjectivists.

56

3.4.2. Formalization Based on the observations, we model the information in a snapshot as a weighted directed graph G  pV, E, pq. The vertices V in the graph represent the individuals and locations. The edges E in the graph represent their relations. And the weight p on the edges correspond to the probabilities of such relations. There are three disjoint sets of vertices in the digraph, i.e., I „ V , O „ V , and D „ V with I Y O Y D  V . I is the set of all individuals, |I |  n. O is the set of all origins, and D is the set of all destinations of the trips. Since the number of origins equals the number of destinations in the graph, we obtain |O|  |D|  m. Edge set E is defined as E : E1 Y E2 Y E3 with E1 : teio |i P I, o P Ou, E2 : teod |o P O, d P Du and E3 : tedi |d P D, i P I u. As E1 , E2 , and E3 are disjoint, G is a tripartite graph. Furthermore, each edge ejk P E is weighted by a probability function p : E ÞÑ r0, 1s. Figure 3.12 gives an illustration of the graph. p(is,o j ) ∈ [0,1]

i1 

i2 

i3 

… 

in 

o1 

p(o j ,dk ) ∈ [0,1] p(dk ,is ) ∈ [0,1]

d1  €

o2  o3 

d2  d3  … 

…  om 

dm 

Figure 3.12.: The measurement model as a weighted tripartite graph Notice that G has several noteworthy properties. First, G contains all the measurementrelated information of a snapshot in abstract forms, i.e., the information on individuals and their vehicle trips. Besides, we assume that there are many ways possible to obtain information on vehicle trips, e.g., eavesdropping on vehicle communications or accessing information in the backend system, and to establish the relation of an O/D pair of a trip accordingly. Specifically, in case of finding the origin and destination of trip by end-to-end tracking of location samples, we assume that there is a publicly known tracking algorithm. Consequently, we treat vehicle tracking as a black box and assume that ppoj , dk q is known. In the worst case, if an adversary cannot establish any relation

57

between an origin and a destination, the adversary can assign ppoj , dk q  1{|D| on every outward edge from oj , meaning that oj can be linked to all destinations with equal probability. Second, vertices in G are connected with directed edges. The directions of the edges are in accordance with the order of the space-time procession. Therefore, if we follow the directed edges from a vertex is and have a walk in the graph G, the path will pass the vertices tis , oj , dk , is u and form a cycle [96]. The semantics of the cycle is an individual’s possibility of having made a trip from oj to dk . Third, the probability assignments weighing the edges correspond to the adversary’s knowledge of the users and their movements in the system. Additionally, by defining a set of rules on probability assignment, we are able to keep the measurement model generic and flexible (to account for various possible scenarios) at the same time. Specifically, we define the sum of the probabilities on outgoing edges from a vertex (i.e., o P O or °n ° d P D) to be 1, m k1 ppoj , dk q  1, k1 ppdj , ik q  1. We further define the sum of ° probabilities from a vertex i P I to be equal or smaller than 1, i.e., m k1 ppij , ok q ¤ 1. By the latter definition, we model that an individual does not make any trips, or is “staying at home”. For example, we can express that an individual i1 has probability ° 0.9 to make trips and probability 0.1 to “stay at home” by letting m k1 ppi1 , ok q  0.9. Besides, the edge between two vertices without any linkability is kept but weighted by a probability of 0. In this way, the structure of G as illustrated in Figure 3.12 can be kept unchanged under different scenarios. In Section 3.3, we identified a special case in which a snapshot might capture more than one trip of the same individual. To keep the model generic, we treat the trips separate, but keep a note on the associated vertices that they represent the same individual. For example, if two trips, from o1 to d1 and from o2 to d2 by the same individual are captured in the same snapshot, we can use two i-vertices, e.g., i1 and i2 , to denote the same individual, while having a side note specifying that i1 and i2 are the same person. The rationale is that in our approach, users’ location privacy is measured based on their trips. So each trip is treated equally, i.e., every trip counts, regardless of whether an individual has one or more trips. For the ease of calculations, we also represent G by three adjacency matrices, denoted as IO, OD, and DI. These matrices are specified as 

ppi1 , o1 q ppi1 , o2 q   ppi2 , o1 q ppi2 , o2 q 



   ppi1, omq     ppi2, omq  IO          ppin , o1 q ppin , o2 q . . . ppin , om q 58



ppo1 , d1 q   ppo2 , d1 q 

OD   



DI







ppom , d1 q ppom , d2 q

ppd1 , i1 q   ppd2 , i1 q 



ppo1 , d2 q ppo2 , d2 q



ppd1 , i2 q ppd2 , i2 q



ppdm , i1 q ppdm , i2 q

       



ppo1 , dm q  ppo2 , dm q  



ppom , dm q

 



ppd1 , in q  ppd2 , in q  



ppdm , in q

 

where each entry ajk in the matrices indicates that there is an edge from vertex vj to vertex vk . The value of the entry corresponds to the weight of edge ajk  ppvj , vk q. Each row in the matrices is a vector of the probabilities on all outgoing edges from the same vertex. The sum of each row in IO is equal or smaller than 1. The sum of each row in OD and DI equals 1. The probabilities in the measurement model can be regarded as “raw data” because they encode the collective information of all individuals and their trips in a snapshot. Remember that our goal is to measure a user’s location privacy. In order to quantitatively measure a user’s location privacy, we need to extract the probability distributions with respect to each of the individuals from the measurement model, which is the focus of the next section.

3.5. Calculate information Based on the measurement model and the notions introduced in the last section, we now show how to extract the probability distribution and how to quantify the location privacy-related information to yield privacy measurements.

3.5.1. Entropy To extract the probability distributions and quantify the information in the measurement model, we use Shannon’s information entropy [73]. Entropy is a quantitative measure of information content and uncertainty over a probability distribution. Entropy has been widely accepted as an appropriate measure in the privacy research community [22, 23, 24]. However, our main challenge here is to apply entropy to the measurement model. By definition, for a probability distribution with probabilities p1 , p2 , . . . , pn , the entropy is ° (3.1) H   pi log2 ppi q i  1, 2, . . . , n

59

where pi is the ith element of the probability distribution. H is then a measure of information content and uncertainty associated with the probability distribution. The logarithm in the formula is usually taken to base 2 to have a unit of bit, indicating how many bits are needed to represent a piece of information. A higher entropy means more uncertainty and hence a higher level of privacy. The maximum uncertainty, or the maximum entropy is reached if all probabilities in the distribution are equal [97]. Entropy in information theory is a quantitative measure of the information produced by a discrete information source. When applying entropy to our calculation, the source is the information captured by snapshots and encoded in the measurement models. We also assume that such information is accessible to an adversary.

3.5.2. Extract information Since our objective is to measure the level of location privacy of the users of vehicular communication system, our foremost interest is the entropy related to each individual. To calculate the entropy related to an individual, we need to extract the probability distribution of the individual with respect to the O/D pairs, i.e., the possible trips. Because there are m origins and m destinations in the measurement model, an individual can be related to at most m2 possible trips. To extract the probability distribution of an individual is from the graph G, we take all the cycles related to is and “unfold” them. The result is a “flower-like” structure as shown in Figure 3.13(a). The stigma, i.e., the center of the flower, represents a specific individual is . The petals run clock-wise around the stigma, denoting is making one of the m2 possible trips, with the last petal representing is does not make a trip. We denote this complementary probability pc . Since an adversary is very likely to carry out separate attacks with different means in a location privacy attack (e.g., use public available data to link a person to a specific location such as home and working place, and use a tracking system to reconstruct a vehicle’s trajectory from the location samples), we assume that the measurements reflect separate observations, that is, the assignment of a probability on a segment of a cycle is not influenced by the probabilities on other segments of the same cycle. Therefore, we assume that the probabilities on the petals describe independent events. As a result, the probability of an individual making a specific trip can be calculated as the product of the probabilities on all edges of the petal representing that trip (cf. Section 3.4.1). We can further simplify the flower-like structure to a “hub-and-spoke” structure as shown in Figure 3.13(b). The hub in the center represents the same individual is . Each of the radiating spokes from the hub represents the probability of is making a specific

60

trip, which is indicated by the subscription of a probability on the spoke. For example, p11 is the probability of the trip from o1 to d1 . Notice that pc always corresponds to the probability of is staying at home.

o1 

p(o1,d1) 

d1 

o1  ,d 2)  p(o 1

pc 

pc 

p(d2 ,i ) 

d2 

s

p12 

… 

p13 

… 

is 

p2m 

p(is,om) 

dm 

is 

p11 

p1m 

… 

om 

p23 

(a) Flower-like structure

p22 

p21 

(b) Hub-and-spoke structure

Figure 3.13.: Visualization of probability distribution related to is extracted from G

3.5.3. Quantify information Since pi  0 means there is no uncertainty and pi logpi  0 does not have any effect on entropy, we only take the nonzero probabilities from the extracted probability distribution. Furthermore, we normalize the probability distribution such that the sum of the probabilities in the distribution equals 1. Then the entropy is calculated on the normalized distribution. Based on equation (3.1) and using the notation specified in the measurement model, we calculate the entropy for a specific individual as H pis q  p

m m ¸ ¸

 

pˆjk log ppˆjk q

pc log ppc qq

(3.2)

j 1k 1

where pˆjk is the normalized probability (cf. (3.4)) of an individual is making a trip from oj to dk (explained shortly after in (3.4)). pc is the probability of is not making any trips, which is calculate as the complementary probability to the sum of all probabilities from is P I to oj P O, j  1, 2, . . . , m, in the measurement model. p

c

1

m ¸



j 1

61

ppis , oj q

(3.3)

The values of pˆjk are calculated as pˆjk



ppis , oj qppoj , dk qppdk , is q

m m ¸ ¸

 

ppis , oj qppoj , dk qppdk , is q

p1  pcq

(3.4)

j 1k 1

where ppis , oj qppoj , dk qppdk , is q is the product of the three probabilities on the cycle with vertices is , oj , and dk . The rest of the equation normalizes the probability distribution to 1. Example 3.4. Imagine that based on a snapshot, we can extract i1 ’s probability distribution from the corresponding measurement model. i1 is seen to be associated with three trips, i.e., po1 , d1 q, po1 , d2 q, and po2 , d3 q, as shown in the flower-like structure in Figure 3.14. The product of probabilities weighing the petals except pc are

 0.1  0.2  0.3  0.006  0.25  0.3  0.7  0.0525  0.3  0.4  0.2  0.024 0.3q  0.35, and the sum of pjk is 0.006

p11 p12 p23

Since pc  1 p0.1 0.25 0.0525 0.024  0.0825, we can use (3.4) to normalize pjk and obtain the probability distribution as pˆ11  0.006{0.0825  p1  0.35q  0.05 pˆ12  0.0525{0.0825  p1  0.35q  0.41 pˆ23  0.024{0.0825  p1  0.35q  0.19 pc  0.35 Then we use (3.2) to calculate the entropy as

p0.05  logp0.05q

0.41  log p0.41q

0.19  log p0.19q

0.35  log p0.35qq  1.72 2

To evaluate the location privacy of an individual, it is also useful to find the maximum entropy possible for an individual in the system, i.e., the upper bound. The maximum entropy for an individual is reached if all participants in the system are equiprobable to make any trips and all trips are also equiprobable. In a system with measurements of m O/D pairs, there will be m2 possible linkings of the origins and destinations, then the maximum entropy of each individual is Hmax

 logpm2

62

1q

(3.5)

o1  

d1  

0.2   0.1   0.3  

0.35  

0.25  

i1   0.2  

d3  

0.3   0.7  

d2  

0.3   0.4  

o1  

o2  

Figure 3.14.: Example 3.4 where 1 accounts for the individual not making any trips. Interestingly, the maximum entropy for an individual depends only on the number of possible trips, but not on the number of participants in the system. Given the entropy upper bound, the level of location privacy of an individual can also be expressed as the ratio of the current entropy to the maximum. Therefore, we have H%

pisq H H

(3.6)

max

which calculates the ratio of an individual’s privacy level to the maximum possible level in percentage. In other words, it gives a hint as how far an individual is from the theoretical privacy upper bound. Notice that by definition, our H% is different from a similar formula d  H pX q{HM used in [23], which measures the degree of anonymity given an anonymity set. Example 3.5. Consider the following scenario: three individuals Alice, Bob, and Charlie live on the same street, their houses are close to each other. They park their cars in front of their houses, at parking slots Pa , Pb , and Pc . A snapshot of VCS captures their trips starting around the same time, to three different destinations: university U , hospital H, and cafe C. To measure Alice, Bob, and Charlie’s level of location privacy is to measure how much information a potential adversary has on their vehicle movements. Figure 3.15 gives an example of the probability assignments from the adversary. The adversary assigns probabilities based on the information obtained from VCS. For example, the adversary is certain that Alice starts from Pa , but thinks Bob and Charlie are both probable to start from either Pb or Pc . The adversary also knows that both Alice and Bob work at the university, and Alice has visited the hospital quite often in the past. The adversary knows that Bob and Charlie often go to the cafe. Since Bob is supposed to work at that time, the adversary assigns a much higher probability to

63

Alice  

Bob   0.5  

1  

Charlie   0.5  

0.5  

Pa  

0.05  

0.1   0.5  

Pb  

0.9  

0.9  

0.5   0.5  

0.9  

0.5  

U  

0.05  

H  

0.1   0.5  

Pc  

1  

C  

Figure 3.15.: Simple example of three individuals and three trips Charlie than Bob with respect to the cafe. Besides, the adversary makes the probability assignments independently. For example, to consider the possibility of Bob making a trip from Pb to H, although the probability of Bob starting from Pb to H is 0.5  0.5  0.25, it has no influence when the adversary assigning 0.05 as the probability of linking H back to Bob. By such assignments, we can model the situation in which later information influences the certainty of the whole trip. Another noteworthy assignment is the 1 on Pc to C to model that an adversary can track a complete trip, regardless of whether it can link the trip to a particular individual. Using equations (3.2) – (3.6), we calculate the entropies and list the corresponding result in Table 3.2. The result shows that Alice has the lowest entropy, hence the lowest privacy level. A close look at the example reveals that among all the possible trips, Alice can be linked to the trip from Pa to H with high certainty. As the uncertainty is low, Alice’s entropy becomes low. On the other hand, Bob has the highest entropy because the uncertainty is high when linking Bob to the trips from Pb and Pc , as well as the destinations H and C. 2

Although very simple, the example demonstrates that the metric is an effective tool to process various information and reflect the underlying privacy values. In the next section, we will further analyze the metric.

64

Table 3.2.: Result of example in Figure 3.15 is

H pis q

Hmax

H%

Alice Bob Charlie

0.32 1.38 1.03

3.32 3.32 3.32

9.6% 42% 31%

3.6. Analysis Vehicular communication systems are emerging technologies, i.e., they do not yet exist and all specifications are still under development. Although it provides us with the opportunity to put privacy in place from the very beginning of the system design process, it also poses challenges on the analysis and evaluation of many design issues, in our case, the trip-based location privacy metric. Specifically, we do not know the exact content and format of information that will be communicated in the system. Consequently, we do not have the training data to test the metric. Our solution is to employ a use-case based approach for the analysis and evaluation of our design. We define use cases which specify possible scenarios in the system and study the results from the metric in order to analyze the correctness and performance of the metric. Our use-case based approach is also in accordance with the state-of-the-art projects in the same field, such as CVIS [3], SEVECOM [98], and PRECIOSA [99] in Europe.

3.6.1. Use Case I In the first use case, we analyze the role of tracking on location privacy. In this scenario, an adversary can track vehicles with high certainty, but has difficulties to link the vehicle movements to specific individuals. As a result, the adversary assigns higher probabilities to the individuals in the vicinity of the origins or destinations of the trips, and gradually decreases the probabilities as the individual’s distances to the origins or destinations increase. To simulate this scenario, we generate probabilities from a normal distribution for each row in matrices IO and DI, and probabilities from an exponential distribution for each row in matrix OD. We simulate the scenario in MATLAB with 50 individuals and 20 O/D pairs. The probabilities are randomly generated from the probability distributions. Figure 3.16 shows three example probability distributions from the three matrices. The

65

first distribution is the probability of one of the individual, i1 , starting at one of the 20 origins. Since the probabilities are taken from the normal distribution, they are quite evenly distributed around the average value, i.e., 1{20  0.05. The second distribution shows the probabilities of a trip from o1 ending at one of the 20 destinations. The probabilities are exponentially distributed, so several destinations have much higher values than the rest. The third distribution is also taken from the normal distribution. It shows the relations of one of the destination d1 to the 50 individuals. Probability distribution of p(i ,o ) Probability

1

j

0.1

0.05

0

0

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

16

17

18

19

20

21

Probability

Origin o(j) Probability distribution of p(o1,dk) 0.1

0.05

0

0

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

Probability

Destination d(k) Probability distribution of p(d1,is) 0.03 0.02 0.01 0

0

5

10

15

20

25

30

35

40

45

50

Individuals i(s)

Figure 3.16.: Example of probability distributions in IO, OD, and DI Arbitrarily, we define an exponential distribution with µ  1, and a normal distribution with µ  0.5 and σ  0.1. The probabilities in each row of the matrices are randomly generated according to their distributions and normalized to 1. Then the three matrices are fed to the metric calculation. We repeat this process for 100 times, each time with three new randomly generated probability matrices. After 100 simulation runs, we obtain an average entropy over the 50 individuals of 8.02 bits. As the maximum entropy for a system with 20 O/D pairs is 8.65 bits, we have a ratio H%  92.7%. These results will be analyzed and compared with Use Case II in the next section.

66

3.6.2. Use Case II In the second use case, we look at the opposite scenario: high certainty on the identification of individuals to the origins and destinations, low certainty on the tracking of vehicle movemeents. To be able to compare with the results from Use Case I, we use the same setting of 50 individuals and 20 O/D pairs. We exchange the probability distributions. Specifically, we let matrices IO and DI have the probabilities from an exponential distribution, and OD have the probabilities from a normal distribution. IO and DI simulate the situation that the adversary has more information on the individuals, such as where they live and what their daily schedules are, resulting in high probabilities on linking an individual to a few origins and linking a destination to a few individual. However, due to poor tracking performance, the adversary has problems to link the origins to the destinations. This is simulated by probabilities from a normal distribution in OD. Using the same parameters for the exponential and normal distributions and the same process described in Use Case I, we obtain an average entropy over the 50 individuals over 100 simulations of 7.48 bits, and H%  86.5%. Figure 3.17 compares the average entropy of the 50 individuals after each simulation run from both case studies. For all the simulation runs, entropies from Use Case I are higher than the ones from use case II, meaning that users in Use Case I have higher level of location privacy that those in Use Case II. The entropy values fluctuate slightly, because the probabilities are re-generated at each simulation run. However, on the long run, they are quite stable around certain values. The result shows that the linkability of location information to particular individuals has more influence on the overall location privacy level than vehicle tracking. This means interestingly it will be more efficient to devise mechanisms to increase the unlinkability between location information and individuals.

3.7. Discussion Although our goal is develop a location privacy metric for VCS, we do not regard our metric the only way to reflect the privacy values in such systems. In an attempt for generalizations, our approach stresses the importance of vehicle trips in the overall level of location privacy. Therefore, for specific situations, in which other aspects of location privacy are much more focused, such as particular locations at particular points in time, our approach will not provide the most appropriate measurements. However, as our systematic analysis in Section 3.1 has clearly shown, the most common location privacy concerns for the users of VCS will be the end points of their vehicle trips. Thus the

67

9

8.8

Hmax = 8.65 8.6

Entropy

8.4

8.2

Use case I

8

7.8

Use case II

7.6

7.4

10

20

30

40

50

60

70

80

90

100

Simulation runs

Figure 3.17.: Comparison of average entropy from Use Case I and II. rationale behind the trip-based location privacy metric can be justified. In this chapter, we have introduced some of the basics of our metric. As we proceed deeper into this topic in the following chapters, we will show that there are many interesting and challenging issues associated with location privacy measurement. To the best of our knowledge, such issues have not been exposed and addressed so far. Thus our metric dose not only serve as a measurement tool to reflecting privacy values, but also provide valuable insights into location privacy, which contributes to the design and development of efficient privacy-protection mechanisms and privacy-preserving systems. One might ask how to assign probabilities such that they reflect the true amount of information an adversary has on the system? Although having a complete list of all possible attacks on location privacy and identify possible information leaks in VCS is impossible for such a complex system, we can employ two general approaches. In the first approach, we can derive the probabilities based on a) a set of already identified attacks, e.g., home identification and target tracking [31]; b) the information to be included in the communications of potential applications; and c) publicly available data like land-use data and telephone directory. This can be a useful way to measure location privacy in a microscopic scale. For example, we can evaluate the conformance of a given application to the privacy requirements by some pre-defined settings. In the second approach, we can use probability mass functions to approximate the statistical data on population

68

distributions and traffic statistics to measure location privacy in VCS in a macroscopic scale. Although at this point we do not know what kind of information will be actually communicated in VCS, and what kind of personal information will be exposed in the open in the years to come, we can use various probability mass functions to simulate possible and non-deterministic scenarios. With the help of a concrete measurement framework, we can estimate and foresee the privacy impacts and develop and verify privacy-preserving mechanisms in VCS before such systems are actually in place.

3.8. Summary In this chapter, we have introduced our first approach for quantitatively measuring location privacy of individual users of the emerging VCS. The basic consideration behind is that location privacy of a user is not only determined by vehicle tracking, but also by linking vehicle trips to the individual who generated them. We introduced the concept of using snapshots to capture the information on location privacy in terms of individuals in the system and their vehicle trips, which are defined by the origins and destinations of the trips. Assuming that an adversary has uncertain and incomplete information on the linking between individuals and their vehicle trips expressed by probabilities, the location privacy of an individual is measured by the uncertainty of such information and quantified in entropy. Then the location privacy of a specific user can be determined by the ratio of its current entropy and the maximum possible entropy within the given system. The feasibility of the approach is supported by means of different use-case based analyses. The metric developed in this chapter has the following properties: • it reflects the most important relation between a user and his privacy-related locations; • it is generic, as well as flexible to capture and account for various possible scenarios in future VCS; • it yields quantitative measurements, which are intuitive and well-understood to express the level of location privacy of the users in such systems. This chapter only considered one individual in a single snapshot. To have a more comprehensive view on location privacy in VCS, we will extend our approach into different directions. In Chapter 4 we will consider time in our metric by measuring an individual’s location privacy in a sequence of timely-ordered snapshots. In Chapter 5 we

69

will consider all individuals in a snapshot and investigate the interrelations among them to determine location privacy in a global view.

70

4. Location privacy in time series In Chapter 3, we introduced the first step of the trip-based location privacy metric that measures the level of location privacy of individual users of VCS. However, so far our metric only considers one snapshot. To precisely reflect the level of location privacy, we can assume that information available to an adversary is not limited to only a short period of time. Instead, we can assume that a determined adversary will do its best to obtain as much information as possible to decrease the uncertainty of the obtained information. Thus the adversary will take advantage of accumulated information, i.e., privacy-related information captured for a long period of time (e.g., weeks or months). To reflect such assumption in our metric, we need to take time into account and measure location privacy in a long-term perspective. Hence, instead of one single snapshot, the metric should be able to base its measurements on multiple snapshots, i.e., a sequence of snapshots taken at successive times with equal intervals among them. As we mentioned in Chapter 1, this is our measurement approach in the second dimension (see Figure 4.1). Measurements based on multiple snapshots should reflect the impact of the accumulated information on the level of location privacy in VCS. Intuitively, the more information an adversary obtains, the easier he can draw conclusions with less uncertainty.

t

Figure 4.1.: Location privacy of one individiual in multiple snapshots The impact of accumulated information on location privacy has not been explicitly addressed in most of these approaches so far. Mostly, it is assumed that an adversary’s

71

knowledge about a system already reflects its long-term observations at the time of attack. For example, in most of the mix zone approaches, an adversary is assumed to have the statistical data on user mobility in the mix zone. Empirical studies such as [30] use two weeks of recorded pseudonymous location tracks to infer home addresses and identities of the drivers with partial successes. Outside the communication domain, the authors of [100] find out that snapshot-based, time-invariant approaches cannot cope with the emergence of time series data mining, and propose to add the time dimension to the current research on privacy-preserving data mining. For measuring long-term location privacy, several issues need to be addressed such as the challenges to model, process, and reflect the accumulated information in the privacy measurements. The relation and the impact of the accumulated information on location privacy have not been investigated so far. In this chapter, we identify and address this issue by extending the current location privacy measurement in snapshot views to time series, which take into account accumulated information. Therefore, the metric will become more precise to reflect location privacy of the users of VCS. The main objectives of this chapter are: • to develop methods to model accumulated information, • to design approaches and algorithms to process, propagate, and reflect the accumulated information in location privacy measurements, • to devise approaches to evaluate the feasibility and correctness of our approaches by various case studies and extensive simulations. This chapter is organized as follows: first, we introduce and formulate the problem of accumulated information in Section 4.1; then we propose two approaches, a frequency based approach and a Bayesian approach, to process, propagate, and reflect the accumulated information in location privacy measurement in Section 4.2; the effectiveness and feasibilities of the two approaches are evaluated and compared in Section 4.3; afterwards, we introduce a heuristic algorithm to apply Bayesian approach to accumulated information with high dynamics in Section 4.4 and evaluate our approach in Section 4.51 .

4.1. Accumulated information Using snapshots enables us to capture privacy-relevant information from VCS, which are continuous and dynamic in nature. However, privacy measurements based on a single 1

Note that this chapter is based on the result published in [101] and [102].

72

snapshot only reflect the privacy values in a short period of time. It is reasonable to assume that a determined adversary will gather as much information as possible over a long period of time to benefit from the collected information. Intuitively, information accumulated over time should help to reveal more facts about the individuals and their vehicle movements. To reflect this more realistic assumption on the adversary, instead of one snapshot, we extend the metric to include consecutive snapshots. Thus the metric yields measurements on “multiple snapshots”. In a single snapshot, the information needed for measuring each individual can be represented by a hub-and-spoke structure. When more snapshots are added to the metric, we can imagine that the information related to an individual i becomes a sequence of hub-and-spoke structures ordered in time as shown in Figure 4.2. Notice that only one individual is shown in Figure 4.2. However, we can imagine that for each of the individuals captured in the snapshots, we can extract the information and build a similar sequence of hub-and-spoke structures. For simplicity in formulations, we will only consider one individual i for the remainder of the chapter. The same formulas and procedures are applicable to any of the other individuals captured in the snapshots.

time

i in 3rd Snapshot

i in 2nd Snapshot

i in 1st Snapshot

Figure 4.2.: Multiple snapshots of i in timely-ordered sequence There are several observable characteristics of the consecutive hub-and-spoke structure (in Figure 4.2) and the accumulated information contained within this structure. First, i can be linked to different trips from snapshot to snapshot. The differences concern the number as well as the origins and destinations of the trips. We denote the assortment of trips related to i in a snapshot as trip constellation. Second, the accu-

73

mulated information has two dimensions, i.e., the first one extends into the diversity of trip constellations, and the second one extends along the timeline. Third, given the fact that many individuals use vehicles to fulfill demands on activities on a daily basis [103], accumulated information is likely to contain an individual’s trip patterns, i.e., regularly occurring trips with the same origins and destinations. By same trip we mean two or more trips have the same origin and destination, e.g., the same garage, parking lot, or street parking space. To model accumulated information in multiple snapshots, we represent the hub-andspoke structure in a more compact way. Let S be the set of all snapshots and let T be the set of all trips considered for an individual i. Then snapshot St reflects the relation of i to a set of trips at time period t. We define St to be St : tpTk , pk q| Tk

P T, pk Ps0, 1s,

¸

pk

 1, k  1, . . . , ntu

(4.1)

k

where pTk , pk q is a tuple with Tk denoting a specific trip (i.e., the k th trip) and pk being the corresponding probability of that trip. Only trips with probabilities larger than 0 are assigned to i. As trip constellations can vary in snapshots, we denote the number of possible trips at t by a variable nt . For the tth snapshot, each Tk represents a spoke and each pk represents the corresponding probability of that spoke. For the sake of simplicity, the last spoke denoting the probability of an individual “staying at home” is also represented as one of the trips. As the metric uses entropy to quantify the uncertainty in the information, the calculation of entropy of i at time t can be simplified as ¸ Ht   pk log ppk q (4.2) k

where pk corresponds to the probability of the k th trip in St . Consider the simple example from Table 4.1. We have five consecutive snapshots of an individual i, t  1, . . . , 5. In the 1st snapshot, i’s trip constellation consists of four trips, i.e., tT1 , T2 , T3 , T4 u, with corresponding probabilities given in the table. In the 2nd snapshot, i is observed to make a new trip T5 . Besides, trip T3 appeared in the 1st , 2nd , and 3rd snapshot, but has not been observed in the 4th and 5th snapshot. In the table, the same trips are aligned along the same column. Therefore, we use a blank as placeholder in the table to represent a non-existing trip in the tth snapshot. The probabilities show the adversary’s information on the linkability of the vehicle trips to a particular individual over time. However, only one trip at each time (i.e., each row in the table) has actually happened.

74

Table 4.1.: A simple example with six consecutive snapshots of i t

t1 t2 t3 t4 t5 t6

T1

T2

T3

T4

T5

0.2 0.2 0.2 0.2 0.2

0.2 0.2 0.1 0.3 0.2

0.3 0.3 0.3

0.3 0.2 0.2 0.2 0.3

0.1 0.2 0.3 0.3

0.2

0.2

0.2

0.2

0.2

Now imagine that the 6th snapshot is captured. Without considering snapshots accumulated in the past, the information contained in S6 represents the highest uncertainty because all trips are equally probable. However, if we also take into account the five already existing snapshots, our intuition tells us that the historical data might provide us with some useful information. Based on the observed characteristics, we are aware that in order to include accumulated information in the metric, we need approaches to process the information contained in the snapshots, propagate such information along the timeline to the following snapshots, and reflect the information in the measurement results.

4.2. Measurements based on multiple snapshots In this section, we propose two approaches to measure location privacy in multiple snapshots. Specifically, the existing trip-based location privacy metric is extended from a single snapshot to multiple timely-ordered snapshots. The extension to multiple snapshots takes into account the impact of accumulated information on location privacy.

4.2.1. Frequency based approach One way to “learn from the past” is to check whether the same trip has already been observed. Normally vehicle trips follow some patterns. For example, we might drive from home to work on a daily basis. Hence the information on the frequency of a particular trip in the past gives hints on how probable the same trip will be repeated in a later point in time. For this we define an auxiliary variable fkt that counts how often trip Tk has been linked to a specific individual over all snapshots up to time t, i.e., fkt  |tSi |Si P S, i  1, 2, . . . , t, DpTk , pk q P Si u|. For example, in Table 4.1, at time t  6, T1 has occurred 6 times so f16  6, whereas f36  4 holds. Then the frequency-adjusted

75

 tpTk , pk q| . . .u can be calculated as Sˆtf  tpTk , αpk fkt q, k  1, 2, . . . , nt u

snapshot Sˆtf of snapshot St °

(4.3)

where α  1{ k pk fkt is a normalization constant calculated by requiring that all probabilities in Sˆtf sum up to 1. Consequently, the frequency-adjusted S6 is Sˆ6f

 tpT1, 0.22q, pT2, 0.22q, pT3, 0.15q, pT4, 0.22q, pT5, 0.19qu

Comparing Sˆ6 with S6 , the probability distribution changes from equal to unequal. The corresponding entropy calculated by (4.2) is also decreased from 2.32 for S6 to 2.31 for Sˆ6 , i.e., the accumulated information helps to slightly reduce the uncertainty of the current information. However, using only the frequency of a particular trip does not consider the actual probability of that trip in each snapshot. Therefore, we lose information if we use only frequencies to adjust a snapshot. For example, in Table 4.1, though T1 and T4 have the same value of fkt , T4 has a higher average probability than T1 . To also include actual probability values in the frequency-adjustment, we rewrite (4.3) as Sˆtw

 tpTk , αpk wkt q, k  1, 2, . . . , ntu

(4.4) °

in which we replace fkt by the average probability of the same trip, i.e., wkt  p i pik q{fkt ° for i  1, 2, . . . , t. The normalization constant α is changed to α  1{ k pk wkt , accordingly. The probability of a non-existing trip (e.g., T5 at t  1) is treated as 0, so the equation can be kept in a generic form. Using (4.4), Sˆ6w turns out to be Sˆ6w

 tpT1, 0.18q, pT2, 0.18q, pT3, 0.24q, pT4, 0.21q, pT5, 0.19qu

with an entropy value of 2.31. The result again shows that accumulated information, in terms of average probabilities of specific trips, can change the current probability distribution and thus modify the level of uncertainty. Furthermore, the result reflects the value of probabilities of the trips in the past. For example, T3 has the highest probability because it has been associated with high probabilities in the past (i.e., 0.3 at t  1, 2, 3). On the other hand, even though T1 and T2 appear at all snapshots, the relatively low probabilities in the past cause these two trips to have the lowest value in the probability distribution of Sˆ6w (i.e., both are 0.18). A more extensive evaluation of this approach will be given in Section 4.3.

4.2.2. Bayesian approach Our second approach to process, propagate, and reflect the accumulated information is to use the Bayesian method to infer information from the historical data.

76

Bayesian method In principle, Bayesian method uses evidence to update a set of hypotheses expressed numerically in probabilities. The core of Bayesian method is the Bayes’ theorem express in conditional probability2 [104]. Let hk be the k th hypothesis of a complete set of hypotheses H 3 . Then the Bayes’ theorem takes the following form P pE |hk qP phk q P phk |E q  ¸ P pE |hk qP phk q

(4.5)

k

in which • E is the evidence; • P phk |E q is the posterior probability of hk because it is the conditional probability of hk given evidence E; • P pE |hk q is the conditional probability of observing evidence E if hypothesis hk is true; • P phk q is the prior probability of hk because it is the probability of hk before it is updated by E; •

¸

P pE |hk qP phk q is the denominator denoting the sum of probabilities of observing

k

evidence E under all possible hypotheses. The above description accounts for updating the hypotheses once. When applying Bayes’ theorem to situations in which hypotheses are continuously updated by new evidence, the following steps are usually involved: • Initially define an exhaustive and mutually exclusive hypotheses H 0 . • Before receiving new evidence E, generate a prior hypotheses H  . H  is the same as H 0 before the first update. • After receiving evidence E, calculate the posterior hypotheses H using Formula (4.5). H will be used as the prior hypotheses H  for the next update.

2 3

See Appendix C for a concise introduction of conditional probability. Notice that the notation H is conventionally used for both entropy and hypotheses. We keep the convention and assume that the meaning should be clear from the context.

77

In Bayesian method, the initial hypotheses can be subjective, i.e., we can assign probabilities to a hypotheses according to some preliminary knowledge. If there is enough evidence, the hypotheses will eventually be updated towards the objective truth. The characteristics of the modeled accumulated information make it appropriate to apply Bayesian method. Specifically, St contains a set of possible trips and the corresponding probabilities. Each of the trips can be regarded as a hypothesis of an individual making that trip. St includes all the possible trips and only one of them can be true. Therefore, the hypotheses are complete and mutually exclusive. The corresponding probabilities in the snapshots are the evidence of those trips from observations. At each time period, St contains a new set of evidence, which can be used to update the hypotheses. However, there is still an issue to be solved before we can apply Bayesian method. It is very likely that St contains dynamic trip constellations, e.g., tT1 , T2 , T3 , T4 u in S1 and tT1, T2, T3, T4, T5u in S2 (see Table 4.1). The implication of such dynamics is that the set of hypotheses H will be different from snapshot to snapshot. As Bayesian method works on a fixed set of hypotheses, i.e., it does not consider adding or removing one or more hypotheses during the evidence updating process, we need a “smart” solution to apply Bayesian method to solve this problem. Exact algorithm The solution is Algorithm 1 shown below. In general, for a given snapshot at time t, the algorithm calculates the modified probability distribution for this snapshot using the Bayesian method. Specifically, for each existing snapshot Sj , j  1, 2, . . . , t, the algorithm generates the prior hypotheses Hj and uses the probability in Sj to calculate the posterior hypotheses Hj . The algorithm stores each Hj in a belief table B. Entries in B are called Belief because they are posterior hypotheses updated by evidence that express the level of confidence of the algorithm on their “correctness”. The algorithm also keeps track of the latest posterior hypotheses with the same trip constellation. For example, S6 has the same trip constellation as S3 in Table 4.1, so H3 will be the latest posterior hypotheses with exactly the same trip constellation to S6 . Informally, we introduce an equivalent relation lph , in which Hj lph Si , j   i denotes that Hj is the latest posterior hypotheses of Sj in B with a trip constellation that exactly matches the one in snapshot Si . To calculate Sˆt , the algorithm takes all existing snapshots up to time t. Before processing a new snapshot Si , the algorithm first consults B for the latest posterior hypotheses with the same trip constellation as Si . If found, the posterior hypotheses Hj will be used as the prior hypotheses Hi for the current snapshot Si . If not found, the algorithm

78

Algorithm 1 Calculate Sˆt (cf. equation (4.1)) using Bayesian method Input: snapshots until time t, S1 , . . . , St Output: snapshot at time t with modified probability distribution, Sˆt 1: for i  1 to t do 2: if found Hj lph Si then 3: use Hj as Hi 4: else 5: assign equal probabilities to Hi 6: end if 7: update Hi with the probabilities in Si , the result is Hi 8: add Hi to B 9: end for ˆt , return Sˆt 10: replace the probability distribution in St with Ht to obtain S assigns Hi with equally distributed probabilities. The rationale is that we assign probabilities without any prejudices to the initial hypotheses, believing that the evidence will eventually update the hypotheses towards the objective truth. Then Hi is updated by Si to generate Hi . Afterwards, Hi is added to B. Notice that for efficiency, B only needs to keep the latest H with a unique trip constellation. Finally, Ht replaces the probability distribution in St to have Sˆt . Sˆt reflects the current beliefs expressed in probabilities, which have been continuously updated by new evidence, on each of the trips in the trip constellation in St . In line 7 of the algorithm, when using the probabilities in Si to update the prior hypotheses, the notions in Formula (4.5) can be substituted and rewritten as H

Si B

pk i

 ¸pk Spk B pk i pk

(4.6)

k H

Here pk i and pSk i are the probabilities of the k th trip in Hi and Si , respectively. pB k is defined as pB k



#

H

pk j 1 ni

if Hj if Hj

lph Si found lph Si not found

H

(4.7)

In this formula, pk j is the probability of the k th trip of the latest posterior hypothesis in B with the same trip constellation as Si , and ni is the number of trips in Si .

79

We demonstrate how the algorithm works by calculating the same example from Table 4.1. The results at each time period are shown in Figure 4.3. We also include H  at each time period to show how they are assigned and how they are updated by S to generate H . For example, at t  2, since the trip constellation of S2 appears for the first time, H  is assigned a equal probability distribution. Look further down, at t  6, the latest snapshot with the same trips constellation can be found at t  3. So the posterior probabilities H at t  3 are copied to the prior probabilities H  at t  6. Sˆ6 has the same value as H at t  6 Sˆ6

 tpT1, 0.19q, pT2, 0.1q, pT3, 0.42q, pT4, 0.19q, pT5, 0.09qu

with entropy of 2.08. Comparing with the results from the frequency based approaches in Section 4.2.1, we witness a more dramatic change in the probability distribution, as well as a sharp decrease in entropy. The results show that Bayesian approach is more effective to reflect the impact of accumulated information than the frequency based approaches. We will further compare and evaluate these approaches in the next section. t

T1

St (Evidence) T2 T3 T4

t=1 0.2 0.2 0.3

T5

0.3

t=2 0.2 0.2 0.3 0.2 0.1 t=3 0.2 0.1 0.3 0.2 0.2 t=4 0.2 0.3

0.2 0.3

t=5 0.2 0.2

0.3 0.3

t=6 0.2 0.2 0.2 0.2 0.2

HH+ HH+ HH+ HH+ HH+ HH+

B (Belief) T1 T2 T3 T4 T5 0.25 0.25 0.25 0.25 0.2 0.2 0.3 0.3 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.3 0.2 0.1 0.2 0.2 0.3 0.2 0.1 0.19 0.1 0.42 0.19 0.1 0.25 0.25 0.25 0.25 0.2 0.3 0.2 0.3 0.2 0.3 0.2 0.3 0.16 0.24 0.24 0.36 0.19 0.1 0.42 0.19 0.1 0.19 0.1 0.42 0.19 0.09

Figure 4.3.: Example of Algorithm 1

4.3. Evaluation 4.3.1. Evaluation criteria Our goal is to evaluate whether the privacy metric, now with the extension for accumulate information, can really reflect the underlying value of user location privacy in VCS. For

80

this purpose, we define two use-case-based evaluation criteria. The use cases specify scenarios likely to happen in VCS. The criteria are the expected impacts of the scenarios on user location privacy. We simulate the use cases. The simulation results will then be compared with the criteria. The results give us clues as how good the metric can be used to measure the long-term location privacy in VCS. We define the evaluation criteria as 1. if an individual has irregular trips with quite different origins and destinations at each time, accumulated information should provide less or even no additional information; 2. if an individual has regular trip patterns, accumulated information should provide additional information. With this additional information, it should be possible to detect an individual’s trip patterns. In our metric, the uncertainty of information is quantified in entropy. A decrease in entropy indicates that additional information leads to a decrease in uncertainty, i.e., a decrease in user location privacy.

4.3.2. Evaluation setup We identify three parameters to have main influences on the outcome of our location privacy metric. Among them are the trip constellations in each snapshot, their corresponding probability distributions, and the number of snapshots. First, the trip constellation specifies the number of trips and their appearances observed in a snapshot. Second, the probability distribution of the corresponding trips specifies the information captured by a snapshot. Third, the number of snapshots specifies the duration of the measurement. Implicitly, it specifies the amount of accumulated information available to the metric. By specifying these parameters, we can create use cases to check whether the metric meets the evaluation criteria. The use cases are the mock-ups of scenarios in the real world. We have identified three representative use cases that will be used to evaluate the metric. The use cases are representative with regard to real-world scenarios. An overview of the use cases is given in Table 4.2. The first two use cases represent two opposite extremes. In the 1st use case, each of the snapshots has different trip constellations. A series of such snapshots also contains irregular trips. We imagine that such scenario will happen, if either an individual makes different trips each time or the observation of an adversary is of very bad quality such that there are high confusions or uncertainties associated with the obtained information. For each snapshot, the simulation first generates a random trip index in the range of 1 to 100, then it generates the corresponding probabilities. To avoid any subjectiveness

81

Table 4.2.: Overview of use cases Description

Use case 1st 2nd 3rd

use case use case use case

Snapshots with irregular trip constellations Snapshots with same trip constellation Snapshots with re-occurring trips

in the probability assignment, the probabilities are randomly generated from the uniform distribution. The process is repeated to generate 60 snapshots with dynamic trip constellations. In the 2nd use case, all snapshots have the same trip constellation. However, only one trip in the constellation actually happens. Hence the snapshots contain a regular trip hidden among other observed trips. This scenario happens if an adversary has correctly observed the regular trip such as driving from home to work, but somehow cannot distinguish it from other trips observed at the same time. To simulate such scenario, we generate 60 snapshots with a trip index from 1 to 100. We set trip T1 in the constellation as the one actually happened and assign a fixed probability, called the p-value, to it. The remaining 99 trips are assigned with probabilities from the uniform distribution. We set the p-value to be the average, i.e., p  0.01, and normalize the probabilities of the remaining 99 trips to p1  p1 q  0.99. The choice and impact of the p-value will be further elaborated in Section 4.3.3. The 3rd use case locates on the spectrum between the two extreme cases described before, and contains several re-occurring trips. It is a mock-up of a more realistic and common scenario as specified in Table 4.3. Imagine there is a series of snapshots capturing an individual’s vehicle trips for several weeks. All snapshots cover a time period somewhen in the morning, so all the trips are from home to somewhere. We simulate this by four trip constellations. The first trip constellation for snapshots (Mon. – Wed.) contains trips tT1 , T4 , . . . , T100 u. We set T1 as the trip actually happened and assign a p-value of 0.012. The corresponding probabilities of tT4 , T5 , . . . , T100 u are assigned with probabilities from the uniform distribution, and normalized to p1  p1 q  0.988. The second trip constellation for snapshots (Thur. – Fri.) contains trips tT2 , T4 , . . . , T100 u. We set T2 as actually happened and also assign a p-value of 0.012, and the normalized probabilities to tT4 , T5 , . . . , T100 u. The third trip constellation for snapshots (Sat.) contains trips tT3 , T4 , . . . , T100 u. We assign a p-value of 0.012 to T3 and the normalized probabilities to tT4 , T5 , . . . , T100 u. The last trip constellation for snapshots (Sun.)

82

has trips tT4 , T5 , . . . , T100 u. To simulate random destinations on Sundays, we assign all the trips with probabilities from the uniform distribution. We repeat the process and generate 56 snapshots to simulate 8 weeks of snapshots with re-occurring trips. Table 4.3.: 3nd use case setup Scenario Vehicle trips Home to office A (Mon. – Wed.) Home to office B (Thur. – Fri.) Home to shopping mall C (Sat.) Home to a random destination (Sun.)

Simulation Trip constellation

Prob. assignment

tT1, T4, . . . , T100u

p1

 0.012, °ii100 4 pi  1  p1

tT2, T4, . . . , T100u

p2

 0.012, °ii100 4 pi  1  p2

tT3, T4, . . . , T100u

p3

 0.012, °ii100 4 pi  1  p3

tT4, T5, . . . , T100u

random,

°i100



i 4

pi

1

During the simulation, we generate snapshots corresponding to the use cases and feed them to the location privacy metric. The outcome of the metric is analyzed along the evaluation criteria. For our analysis, we choose the following entropy values: 1) Hmax , the theoretical maximum entropy based on each single snapshot; 2) H, the entropy based only on single snapshot; 3) Hf , the entropy based on the snapshots modified by frequencies of occurrence; 4) Hw , the entropy based on the snapshots modified by average probabilities; 5) HB , the entropy based on the snapshots modified by Bayesian approach. To analyze the impact of accumulated information on the actual level of uncertainty, we further define Hd as a measurement of the decrease in uncertainty Hd

 HBH H 100%

(4.8)

which bases the calculation on the difference of the entropy using Bayesian approach and the entropy based on single snapshot without any additional information.

4.3.3. Simulation Figure 4.4 shows the simulation result from the 1st use case, in which each snapshot contains a randomly generated trip constellation. We can see from the figure that the

83

entropies of H, Hf , Hw , and HB are so close that they overlap each other most of the time. This means neither frequency based approaches nor Bayesian approach are able to benefit from the accumulated information. Besides, these entropies are very close to the upper-bound Hmax , due to the fact that the probabilities in each snapshot are from uniform distributions. For illustrative reason, the lower part of the figure includes a bar chart showing the number of trips in each of the snapshots. Notice that the actual trip constellations are not shown in the bar chart. Hmax

7

H

Hf

Hw

HB

Entropy (bits)

6.5 6 5.5 5

Number of trips

4.5

1

5

10

15

20

25

30

35

40

45

50

55

60

40

45

50

55

60

Number of snapshots

100 50 0

5

10

15

20

25

30

35

Number of snapshots

Figure 4.4.: Entropy of irregular trips Figure 4.5 shows the result from the 2nd use case, which simulates the scenario that a regular trip is blurred by other false observations in each snapshot. The result shows that the frequency based approaches can barely reflect the accumulated information. As a result, Hf and Hw mostly overlap H, with the exception that Hw has slightly lower entropies at the first few snapshots. On the other hand, Bayesian approach has significantly decreased the entropy level from 6.3 bits to as low as 0.79 bits at the 33th snapshot. Obviously, at 0.79 bits, the uncertainty is very low, i.e., the privacy level is very low. The shape of the curve of HB suggests that Bayesian approach is able to process and benefit from the accumulated information. Figure 4.6 shows the simulation result from the 3rd use case. The 3rd use case simulates weekly re-occurring trips. Hf and Hw have similar outcomes as those in Figure 4.5, i.e., frequency based approaches can not really benefit from accumulated information in the long run. Again, Bayesian approach has significantly decreased the entropy value. Interestingly, this time the curve of HB has a cascading and downward shape. The reason

84

7 6

Hmax

Entropy (bits)

5

H

Hf

Hw

HB

4 3 2 1 0

5

10

15

20

25

30

35

40

45

50

55

60

Number of snapshots

Figure 4.5.: Entropy of regular trips is that we have simulated four types of re-occurring trips in this use case. The first three trips are regularly occurred trips and the fourth one (i.e., the Sunday trip) is chosen to be random. Therefore, while the overall curve of HB demonstrates a downward trend, the entropies corresponding to the first three trips decrease much faster than the entropy of the Sunday trip. Notice that the entropy of the Sunday trip also exhibits a downward trend. The reason is that even though the probability distributions of the Sunday trip are from the uniform distribution, their values are slightly different among each others. As a result, the probabilities are modified by Bayesian approach towards a non-uniform distribution. In other words, given consecutive snapshots, Algorithm 1 regards some of the trips are “more likely to have happened” than others. The result again demonstrates that Bayesian approach can take advantage of the accumulated information caused by regularly occurring trips. As the next step, we use Hd to analyze the decrease in uncertainty in each of the use cases. Since a new set of random values is generated each time a use case is simulated, we run each use case 100 times and calculate the mean values to take into account the effects of the variations of random variables. The results are plotted in Figure 4.7. For irregular trips, taking more snapshots into the metric does not decrease information uncertainty. In some cases, it even increases the level of uncertainty. This means based on the metric, accumulated information does not provide any additional information due to the randomness in the captured information. For regular trips, we can see that there is a constant decrease in uncertainty as more and more snapshots are added in

85

7 6

Entropy (bits)

5 4 3 2 1 0

Hmax 7

H

14

Hf

Hw 21

HB 28

35

42

49

56

Number of snapshots

Figure 4.6.: Entropy of re-occurring trips the sequence. The decrease reaches -84.6% at the 60th snapshot. The outcome of the metric shows that with regular trips, accumulated information can significantly reduce the uncertainty in the information related to user location privacy. For re-occurring trips, despite the spikes on each Sunday due to the randomness of the trips on that day, there is also a constant decrease in uncertainty as the time elapses. Because there are several regular trip patterns involved in this use case, the speed of the decrease in uncertainty is slower than the use case with regular trips. The result demonstrates again that the accumulated information can cause considerable decreases in the level of uncertainty, i.e., users’ location privacy. Notice that the shape of the curves in Figure 4.7 correspond to those appeared in Figure 4.4, 4.5, and 4.6, i.e., the observations we made before on single simulation result also hold in general cases. We know that the main reason behind the significant decrease in uncertainty is because of the application of Bayesian method in Algorithm 1. Algorithm 1 processes, propagates, and reflects the accumulated information by continuously updating the probabilities in each hypotheses after a new set of evidence contained in a snapshot is received. The updated hypotheses are kept in the belief table B. As a result, the probability distributions in the belief table converge toward the “real happened” trips. The changing of probability distributions leads to lower entropy values and hence a decrease in uncertainty. However, so far we have not shown whether the algorithm is able to update probability distributions in a correct way. We test the correctness of Algorithm 1 by tracing the change of beliefs in the belief table. In this sense, the second and the third

86

Irregular trips

ï14

Hd (%)

x 10 0.5 0 ï0.5 1

5

10

15

20

25

30

35

40

45

50

55

60

40

45

50

55

60

Number of snapshots Regular trips

Hd (%)

50 0 ï50 ï100

1

5

10

15

20

25

30

35

Number of snapshots Reïoccurring trips

Hd (%)

50 0 ï50 ï100

1

7

14

21

28

35

42

49

56

Number of snapshots

Figure 4.7.: Change of uncertainty use case are quite similar. Therefore, we only show the study on the 2nd use case here. Same as before, we assign the first trip as the one actually happened. Furthermore, we assign different probabilities to study the effect of the p-values on the performance of the algorithm. The p-values are t0.009, 0.01, 0.011u, which correspond to 10% lower than the average, the average, and 10 % higher than the average of the probability of the 100 trips in the trip constellation. Again, we run the simulation 100 times to account for the variations in the random dataset and calculate the means of the first trip in the 100 simulation runs. Figure 4.8 shows the result. At 10% below the average, Algorithm 1 almost fails to detect the trip. However, as soon as the p-value is of the average value, there is a steady rise of the probability. If we assume that 0.5 is the threshold to select a trip as the one really happens, the first trip will be selected at the 59th snapshot. Only slightly increase the p-value 10% higher, the probability of the first trip exhibits a sharp rise and passes the 0.5 threshold at the 32th snapshot. From the simulation results, we conclude that our location privacy metric and the related approach meet both evaluation criteria defined in Section 4.3.1.

4.4. Heuristic algorithm for dynamic trip constellations Algorithm 1 in Section 4.2.2 relies on finding posterior hypotheses (i.e., H ) of the previous snapshots with exactly the same trip constellations to propagate the beliefs.

87

1

10% lower 0.9

Average 10% higher

0.8

0.7

Probability

0.6

0.5

0.4

0.3

0.2

0.1

0

1

5

10

15

20

25

30

35

40

45

50

55

60

Number of snapshots

Figure 4.8.: Change of beliefs with different p-values Therefore, it functions well on snapshots containing regular trip patterns, in which snapshots with same trip constellations appear frequently. Imagine if an individual can be linked to different sets of trips in each of the snapshots, Algorithm 1 will likely wait for a very long period of time until it has the same trip constellation again. In the worst case, a specific trip constellation might even never happen more than once. The simulation results in Figure 4.4 and Figure 4.7 have already shown the negative effect of snapshots with dynamic trip constellations. To have a robuster way to process and reflect accumulated information in the privacy measurements, in this section, we develop a heuristic algorithm as an important extension to Algorithm 1 and evaluate its feasibility to work with dynamic trip constellations in Section 4.5.

4.4.1. Finding an adequate measurement of similarity A trip constellation is a set of trips associated with a specific individual in a snapshot. The biggest difference in the heuristic algorithm is that, instead of searching for a snapshot with an identical trip constellation, now the heuristic algorithm searches for a snapshot with the most similar trip constellation. Then the beliefs (i.e., the posterior hypotheses) from the previous snapshot are used as an input to construct the prior hypotheses of the later snapshot. Recall that originally, Bayesian method is intended to work on a fixed set of exhaustive and mutually exclusive hypothesis during the evidence update process (cf. Section 4.2.2), our solution to tackle the trip dynamics is an heuris-

88

tic approach. However, our rationale is that, if the beliefs are propagated between two snapshots with the most similarities, the distortions during the belief propagation will be kept at a minimum. In fact, because two identical trip constellations are the most “similar” ones, a search for the most similar will return the identical trip constellation, if it exists. The question arises as “how to find an adequate notion of similarity?” Intuitively, two snapshots are more similar, the more trips they have in common. To quantitively express the concept of “similarity”, we can count the number of trips presented in both snapshots, as well as those only appeared at respective ones. An elegant way to count the occurrence of trips in a snapshot is to convert the set-based snapshot representation in (4.1) to binary strings. Let n be the number of all unique trips appeared in all snapshots ” up to St , formally: n  | Si1 |, i  1, 2, . . . , t with Si1  tTk |DpTk , pk q P Si u. Then the trip constellation of Si expressed by a binary string ci is ci

 rT1, T2, . . . , Tns with Tk 

#

1 if D pTk , pk q P Si 0 otherwise

(4.9)

in which we use 1 for an existing trip and 0 for a non-existing trip within snapshot Si . Notice that n is a constant, so all binary strings will have the same length of n bits. This also means that to convert the trip constellation in a snapshot to a binary string, we might need to pad all snapshots retrospectively to have the same length for all ci , i  1, 2, . . . , t. For example, in Table 4.1, at t  1, c1 will be r1, 1, 1, 1s, while at t  2, by retrospective padding, c1 becomes r1, 1, 1, 1, 0s and c2 will be r1, 1, 1, 1, 1s. For two binary strings with equal length, the hamming distance [105] is a measure of the number of positions where there are different bits. For example, the hamming distance of r1, 1, 1, 1, 0s and r1, 1, 1, 1, 1s is 1. Therefore, we can use hamming distance to measure the similarity of two snapshots. Hence, the hamming distance between two snapshots (or more precisely, the trip constellations in the two snapshots) expresses explicitly the difference in their trip constellations. The more trips in common, the smaller the hamming distance, hence the more similar are the two snapshots. Therefore, for St , we can calculate the hamming distances from St to each of the previous snapshots S1 , S2 , . . . , St1 . We regard the snapshot with the smallest hamming distance the most similar snapshot to St . In case more than one snapshot have the same hamming distances, we choose the latest one. This is also in accordance with Algorithm 1, which looks for the latest posterior hypotheses with the same trip constellation.

89

4.4.2. Constellation fitting After finding the most similar snapshot, the next question is “how to propagate the beliefs between two snapshots so there will be minimum distortions?” In case of the exact match as in Algorithm 1, this is done by taking the whole posterior hypotheses H of the previous snapshot from the belief table B, and using them as the prior hypotheses H  of the current snapshot. Knowing that the trip constellations in the two snapshots will most likely to match only partially, we need to find a solution to align the hypotheses so we can propagate the probabilities from H to H  . We call this the “constellation fitting” problem, i.e., to shape and fit the current trip constellation into the previous one such that the current snapshot can heuristically inherit the associated hypotheses of the previous one with minimum distortions. To propagate beliefs between two sets of similar but not exact matching hypotheses with minimum distortions, we made two decisions in our heuristic algorithm. The feasibility will be evaluated by simulations in Section 4.5. The two decisions are: 1. if a posterior hypothesis of a trip exists, it will be used as the prior hypothesis for the same trip in the current snapshot; 2. otherwise, the prior hypothesis of the trip in the current snapshot will be given an equally distributed probability. As the probabilities in a hypotheses should sum up to 1, we also normalize the probability distribution in a hypotheses in the process when it is necessary. Although two snapshots might be similar with respect to their trip constellations, there are various ways that such similarities can be. Because the various relations between two trip constellations directly influence the probability assignment for prior hypotheses in the heuristic algorithm, we will first elaborate on the possible relations and their corresponding probability assignments, and present the detailed description of the algorithm afterwards. Let Si be the current snapshot and Sj be the most similar snapshot in the past, j   i, we can derive five kinds of relations between Si and Sj . The first one is the exact match, i.e., the trip constellations in Si and Sj are identical, which is the case considered in Algorithm 1. Beside the exact match, the other four relations are illustrated in Figure 4.9. For simplicity, in the following description, we treat a snapshot as a set containing only trips, and omit the corresponding probabilities (cf. (4.1)), e.g., Si  tT1 , T2 , . . . , Tni u. Moreover, we use Hi to denote the prior hypotheses of Si and

90

Hj for the posterior hypotheses of Sj stored in the belief table B. Hence we have four relations as: 1. Disjoint relation might happen when Sj is most similar to Si , despite Sj and Si have completely different sets of trips. For example, if S1  tT1 , T2 u and S2  tT3 , T4 u, S1 will be the “choice” for S2 because the binary string representation of S1 , i.e., c1  r1, 1, 0, 0s (cf. (4.9)), has the smallest hamming distance to S2 ’s binary string representation, i.e., c2  r0, 0, 1, 1s. In this case, Si has a complete new trip constellation, and Hi will not inherit any beliefs from Sj . Hence Hi are assigned equal probabilities, which is similar to line 5 in Algorithm 1. 2. Intersected relation might be the most occurring relation for two similar-but-notidentical snapshots. In this relation, Si and Sj will share some trips in common, but have different sets of trips to their own at the same time. For example, S1  tT1, T2, T3, T4u and S2  tT1, T3, T5, T6u have an intersection of tT1, T3u. The unique trips to S2 is S2 zS1  tT5 , T6 u. To assign probabilities to Hi , we let the trips in the intersection inherit the probabilities of the same trips in Hj , and the rest of the trips in Hi are equally assigned the remaining probability. 3. In subset relation, Si is a subset of Sj , i.e., all trips in Si are also in Sj . The trips in Hi will inherit all corresponding probabilities of the same trips in Hj . Since we have only a subset of Hj , we need to normalize the probabilities in Hi to 1. 4. In superset relation, Si is a superset of Sj , i.e., Si includes all trips in Sj plus some other trips. To assign probabilities, we first let all trips in Si but not in Sj (i..e, Si zSj ) to have the equal probabilities, so these trips can have unbiased initial hypotheses. Then we let all trips also in Sj inherit the corresponding probabilities from Hj . We further normalize the inherited probabilities to the remaining probability in Hi . For example, for S1  tT1 , T2 , T3 u and S2  tT1 , T2 , T3 , T4 , T5 u, T4 and T5 in H2 will each have a probability of 15 , the probabilities of T1 , T2 , T3 will be taken from H1 and normalized to 35 . Notice that at any time, Si will contain only two possible sets of trips: the trips as also in Sj and the trips not in Sj . The design of the probability assignment for Hi reflects our idea to use the existing beliefs while avoiding prejudicing the hypotheses of “newly-appeared” trips.

91

Sj

Sj

Si

Si

(a) Sj and Si are disjoint

(b) Sj and Si intersect

Sj

Sj

Si

(c) Si is a subset of Sj

Si

(d) Si is a superset of Sj

Figure 4.9.: Various relations of Sj and Si

4.4.3. Heuristic algorithm The heuristic algorithm has a similar structure as Algorithm 1, except the search for similar snapshots and the probability assignment for the prior hypotheses. The details of the heuristic algorithm are given in Algorithm 2. Notice that line 5 in Algorithm 2 searches for the latest snapshot with the trip constellation of minimum hamming distance to Si . Line 6 to line 16 is the probability assignment for the prior hypotheses Hi . Also notice that line 8 to line 15 correspond to the four relations outlined in Figure 4.9. Furthermore, because two snapshots with an identical trip constellation have a hamming distance of 0, and Algorithm 2 always searches for the latest snapshot with the smallest hamming distance, the heuristic algorithm will function exactly as the “exact” algorithm (i.e., Algorithm 1) when there are snapshots with same trip constellations in the series. In other words, Algorithm 2 is fully compatible with Algorithm 1. To demonstrate how Algorithm 2 works, we show a simple example in Figure 4.10. Similar to the example in Figure 4.3, the figure shows the snapshot and their corresponding prior and posterior hypotheses. Besides, there is an extra column to show the latest most similar snapshot (LMSS) of each snapshot. The example includes six snapshots with very dynamic trip constellations. The snapshots include all five relations we outlined in Section 4.4.2. For example, S2 and S1 have disjoint relation, S3 and S2 have intersected relation, S4 and S2 have subset relation, S5 and S3 have superset relation, and S6 and S1 match exactly. The prior hypotheses H  at each time period demonstrate how prior probabilities are assigned according to Algorithm 2. Notice that the calculation of posterior hypotheses H is the same in both Algorithm 1 and Algorithm 2.

92

Algorithm 2 Heuristic algorithm to calculate Sˆt Input: snapshots until time t, S1 , . . . , St Output: snapshot at time t with modified probability distribution, Sˆt 1: for i  1 to t do 2: for l  1 to i do 3: convert trip index in Sl to binary string cl and pad to equal length 4: end for 5: find cj with minimum hamming distance to ci , j   i, i  j is minimum 6: if hamming distance = 0 then 7: Hi  Hj “ 8: else if Si Sj  H then 9: assign trips with probability of |S1i | “ 10: else if Si Sj  H then “ 11: assign trips in Si° Sj with probabilities from Hj , and trips in Si zSj with probability of p1|Si zSpjk| q 12: else if Si „ Sj then p1 13: assign trips with probabilities of ° pk1 , p1k are probabilities from Hj k 14: else if Si … Sj then 15: assign trips in Si zSj with probability of |S1i | , and trips in Sj with probabilities ° of p1  pc qp1k , p1k are probabilities from Hj 16: end if 17: update Hi with the probabilities in Si , the result is Hi 18: add Hi to B 19: end for ˆt , return Sˆt 20: replace the probability distribution in St with Ht to obtain S

93

t

T1

St (Evidence) T2 T3 T4 T5 LMSS

t=1 0.4 0.6

1

t=2

0.3 0.5 0.2

1

t=3 0.1

0.3 0.6

2

t=4 t=5 0.1 t=6 0.7 0.3

HH+ HH+ HH+ HH+ HH+ HH+

0.6 0.4

2

0.3 0.4 0.2

3 1

B (Belief) T2 T3 T4 0.5 0.6 0.33 0.33 0.3 0.5 0.2 0.3 0.5 0.049 0.22 0.73 0.71 0.79 0.04 0.16 0.55 0.01 0.15 0.68 0.4 0.6 0.61 0.39 T1 0.5 0.4

T5

0.33 0.2

0.29 0.21 0.25 0.16

Figure 4.10.: Example of Algorithm 2

4.5. Evaluation of heuristic algorithm Comparing to Algorithm 1, the heuristic algorithm involves more variables that are of interest in the evaluation, such as trip constellations and the dynamics of the constellations. Since our focus is on the feasibility of the heuristic algorithm, we choose the most important aspects related to the feasibility and use simulations to evaluate them. In the following, we will evaluate the heuristic algorithm with respect to the constellation dynamics, the probability of the “real” trip, and clusters of re-appearing trips, respectively.

4.5.1. Evaluation with respect to constellation dynamics Snapshots with dynamic trip constellations model the scenario in which an adversary is able to “correctly” link an individual to a specific trip. However, due to uncertainties, the real trip is mixed with a set of false trips in each of the snapshots, such that from the adversary’s perspective, the correct information is submerged and concealed by incorrect information. To make things worse, in each snapshot, the real trip is presented with a different set of false trips that form a different trip constellation. The consequence is a sequence of snapshots with dynamic trip constellations. The heuristic algorithm is developed to cope with constellation dynamics. Hence, we expect that Algorithm 2 can propagate beliefs under dynamic trip constellations. Furthermore, we are also interested in the performance of the algorithm under different degrees of constellation dynamics. Following the same approach in Section 4.4.1, we express the degree of constellation dynamics between two snapshots by their hamming

94

distance. The bigger the hamming distance, the more dynamic are the trip constellations along the timeline. In order to simulate such scenario, we generate a dataset of 60 snapshots with 100 trips each. We specify the first trip T1 as the “real” trip. We assume that the real trip will have a slightly higher than the average probability if it really occurs. Since the average probability in a snapshot with 100 trips will be 1{100  0.01, we slightly increase the probability on the real trip T1 by 10 % and assign 0.01 0.01  10%  0.011 to T1 . Other trips (i.e., tT2 , T3 , . . . , T100 u) are given random probabilities from the uniform distribution. The next step is to find a way to distribute the 100 trips, so we can have a sequence of snapshots with dynamic trip constellations. One possibility is to distribute the trips randomly. However, in this case, it is difficult to have a clear picture of the relation between the trip dynamics to the results from the heuristic algorithm. Therefore, we control the degree of constellation dynamics so we can evaluate the heuristic algorithm in a controlled manner. We achieve this by shifting all trips after T1 to the right, each time a new snapshot is generated. For example, if we want the 2nd snapshot to have 10% constellation dynamics to the 1st snapshot with trips tT1 , T2 , . . . , T100 u, we shift the trip block of tT2 , T3 , . . . , T100 u of the 2nd snapshot 10 trips to the right, so the trip index becomes tT1 , T12 , T13 , . . . , T110 u. Consequently, 10% of the trips in the 2nd snapshot (i.e., tT101 , T102 , . . . , T110 u) are different from the 1st snapshot. The idea is illustrated in Figure 4.11. For simplicity, we show an example of only 6 snapshots with 10 trips each. In the figure, a black square indicates an existing trip. All snapshots in the figure have a 10% constellation dynamics to the one before, i.e., each latter snapshot has one trip different from the former snapshot. In other words, each two neighboring snapshots have 10  90%  9 trips in common. Trip index 1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

Number of snapshots

1 2 3 4 5 6

Figure 4.11.: Snapshots with 10% constellation dynamics

95

We construct snapshots with different constellation dynamics and observe the change of beliefs on the real trip T1 . The goal is to evaluate the performance of the heuristic algorithm under various constellation dynamics. Figure 4.12 shows some of the selected results with constellation dynamics of 1%, 10%, and 50%. The results are averaged over 100 simulation runs for each value of constellation dynamics to account for the variations in the random dataset. For 1% constellation dynamics, Algorithm 2 has a similar good result as Algorithm 1 (cf. Figure 4.8). This means that the heuristic algorithm is able to propagate beliefs among snapshots with dynamic trip constellations, resulting in an increase in the belief on the real trip. Notice that because the hamming distances are fixed between any two consecutive snapshots, the heuristic algorithm will always find the directly precedent snapshot as the most similar one and use H from that one as the basis for the construction of H  . Therefore, the hypotheses are continuously updated and the two algorithms yield similar results. However, the beliefs on T1 go down when the constellation dynamics increase. This matches our intuition that if there are more dynamics in the trip constellations (i.e., a real trip is associated with a different set of false trips at each snapshot), there are more uncertainties and thus less possibilities to detect a really happened trip. 1

0.9

1% const. dynamics 10% const. dynamics

0.8

50% const. dynamics 0.7

Probability

0.6

0.5

0.4

0.3

0.2

0.1

0

1

5

10

15

20

25

30

35

40

45

50

55

60

Number of snapshots

Figure 4.12.: Changes of beliefs on T1 with different degree of constellation dynamics

4.5.2. Evaluation with respect to p-value The p-value (cf. Figure 4.8) is the probability assigned to the real trip in each of the snapshots in the dataset. By specifying different probabilities of the real trip, we model

96

an adversary’s ability to link an individual to his vehicle movements. In Section 4.5.1, we have shown that the heuristic algorithm performs well with a 10% higher than the average p-value under fixed constellation dynamics. However, we can imagine that a fixed constellation dynamics will be rare in most realistic scenarios. Therefore, we use snapshots with totally random trip constellations to evaluate Algorithm 2 with respect to different p-values. Total randomness also means that the constellation dynamics is at its maximum. For the dataset, we randomly generate trips in the range from 2 to 100 for each of the snapshot. Hence each snapshot has a random number of trips and the trip indices are random as well. Same as before, we specify T1 as the real trip so it appears in all snapshots. Furthermore, we assign T1 with the p-value and probabilities from the uniform distribution to the rest of the trips. The rest of the trips are then normalized to p1  p1 q. We choose two kinds of p-values: absolute values and variable values. Since now each snapshot contains a various amount of trips, the absolute p-value is a constant probability throughout all snapshots, and the variable p-value is the average probability at each snapshot (i.e., p1  |S1t | at time t) multiplied by a scaling factor. The p-values are: 0.01, 0.02, 0.03 for the absolute and 10%, 30%, 50%, and 70% higher than the average for the variable. For each of the p-values, we run the simulation 100 times and take the averages of the beliefs on T1 from the belief table B. The results are shown in Figure 4.13. By observing the simulation results, we made several interesting observations. First, the curves with high p-values have ripples in the short-term and exhibit an upward trend in the long-term. The ripples are due to the fluctuations in the hypotheses because the heuristic algorithm searches for the most similar snapshot in the past. For example, for the 10th snapshot, the algorithm might find that the 2nd snapshot has the most  based on H . As a result, the updated similar trip constellation, and construct H10 2 beliefs on T1 between the 3nd and 9th snapshots are not involved in the construction  . However, in the long-term, the heuristic algorithm is able to benefit from the of H10 accumulated information. Thus the long-term beliefs on T1 increase. Second, if the probability of the real trip is below a certain threshold, the heuristic algorithm is unable to detect the trip. This is demonstrated by the curves representing p-values of absolute 0.01 and variable 10% higher in the figure. Notice that in previous evaluations, 10% higher p-value gives a very quick rise to the beliefs on T1 . The reason for the slow rise here is that the hypothesis of T1 in the previous settings is continuously updated, while in our current setting, due to the same reason that causes the ripples, the hypothesis of T1 is updated based on the posterior hypothesis from a randomly found

97

0.6

Constant 0.01 Constant 0.02 Constant 0.03

0.5

10% higher 30% higher 50% higher 70% higher

Probability

0.4

0.3

0.2

0.1

0

5

10

15

20

25

30

35

40

45

50

55

60

Number of snapshots

Figure 4.13.: Changes of beliefs with different p-values snapshot with most similar trip constellation. However, looking closely, we can see that the curve of 10% higher actually increases. A measurement on the 10% higher curve confirms that there is an 88% increase at the 60th snapshot comparing to the value at the 1st snapshot. Third, the relation of low p-values to low beliefs corresponds to our intuition that if an adversary fails to capture correct information on a real trip and give it an “outstanding treatment” in the probability assignment, the trip will be concealed among others and no adversaries can derive any useful information from that. In this sense, our findings here provide two interesting privacy thresholds for the design of privacy-protection mechanisms. If each time an individual has a trip and the trip can be mistaken by an adversary with no more than 99 other trips, a privacy-protection mechanism should be able to conceal the real trip among the others, in which the probability of the real trip is no more than 0.01 or no higher than 10% of the average of the trips at the same time.

4.5.3. Evaluation with respect to cluster of re-appearing trips The evaluation in Section 4.5.2 specifies T1 as the real trip through out all the snapshots. All other trips are generated randomly and hence might not appear in every snapshot. Thus a question arises as whether the high occurrence of T1 biases the heuristic algorithm?

98

To answer this question, we use clusters of re-appearing trips to evaluate the fairness of the heuristic algorithm. Specifically, when generating the dataset, instead of placing only T1 in each of the snapshots, we specify a set of trips to appear in all snapshots as well. Thus T1 and other trips in the set form a trip cluster among other randomly generated trips in each of the snapshots. Consequently, the hypotheses of all trips in the cluster will be updated by the heuristic algorithm at the same time. Then we can check whether the hypothesis of T1 is treated equally as the others in the cluster. For simulations, we generate 60 snapshots with maximum 100 trips each. We specify three cluster sizes, i.e., 10 trips from T1 to T10 , 20 trips from T1 to T20 , and 50 trips from T1 to T50 . Each snapshot includes a trip cluster together with other randomly generated trips. We assign a 10% higher than the average probability to T1 . The rest of the trips are assigned probabilities from the uniform distribution. For each cluster size, we run the simulation 100 times. Then we take the averaged beliefs corresponding to the trips in the cluster at the 60th snapshot from the belief table B. The results are shown in Figure 4.14. As clearly demonstrated by the figure, T1 has the highest belief at the 60th snapshot for all three cluster sizes. Thus we conclude that the occurrence of a trip does not bias the heuristic algorithm, and the algorithm performs correctly. Trip cluster size = 10 Probability

0.04

0.02

0

1

2

3

4

5

6

7

8

9

10

Trip index Trip cluster size = 20 Probability

0.03 0.02 0.01 0

0

1

2

4

6

8

10

12

14

16

18

20

Trip index Trip cluster size = 50 Probability

0.03 0.02 0.01 0

01

5

10

15

20

25

30

35

40

45

50

Trip index

Figure 4.14.: Beliefs on trip clusters at 60th snapshot Based on the simulation results, we conclude that the heuristic algorithm is robust and powerful to process, propagate, and reflect the accumulated information in the location privacy metric under dynamic situations.

99

4.6. Summary In this chapter, we presented our approach to measure long-term location privacy of the users of VCS. To precisely reflect underlying privacy values in VCS, we took accumulated information into consideration. We develop approaches and algorithms to model, process, propagate, and reflect the impact of accumulated information in privacy measurements. Moreover, in this chapter we presented two algorithms to apply the Bayesian method to process and propagate accumulated information among multiple snapshots along the timeline. Specifically, the first algorithm propagates information among snapshots with exactly matching trip constellations. The second algorithm is a heuristic extension to the first one, which is robust to function on snapshots with highly dynamic trip constellations. We also designed methods to evaluate the feasibility and correctness of the approaches and algorithms by various case studies and extensive simulations. We showed in this chapter that accumulated information can have significant impact on the level of location privacy. Interestingly, our simulation results are in accordance with the theory behind Bayesian method, which states that one can build initial hypotheses based on preliminary knowledge and if there are enough evidence in the long run, the hypotheses will be updated by the evidence towards the objective truth. Due to the repeated trip patterns intrinsic to vehicle usages, Bayesian method is shown effective to be applied to location privacy attacks on VCS, in which the actual locations of the users are more likely to be exposed and detected through their vehicular communications in the long run. The results and findings in this chapter provided some valuable insights into location privacy, which contribute to the design and development of future-proof, privacy-preserving VCS. Until now, the metric only measures privacy of individual users. In Chapter 5, we will investigated the possible interrelations among individuals and their impacts on the level of location privacy to determine location privacy in a global view. Furthermore, the actual application of the metric to measure the level of location privacy in VCS and evaluate a given Privacy-enhanced Technology (PET) design will be demonstrated in Chapter 6.

100

5. Location privacy in global view In the previous two chapters we have presented our approaches for measuring location privacy along the two dimensions, i.e., a user’s location privacy as captured in a single snapshot and in a sequence of snapshots. In this chapter, we address location privacy in the third dimension, in which a user’s location privacy is measured while considering all other users captured in the same snapshot (see Figure 5.1). To distinguish the measurement approach presented in this chapter with those in Chapter 3 and Chapter 4, we denote the privacy values measured in the third dimension as “location privacy in global view” since this time the attacker is assumed to be capable of processing and reasoning the location information globally. Consequently, an individual’s location privacy is computed in the context of others with respect to the observed vehicle trips in the same snapshot. In contrast, we denote location privacy measured in the first and second dimension as “location privacy in local view” because each individual is considered separately. A more precise description of the local vs. global view will be given shortly after.

Figure 5.1.: Location privacy of all users captured in the same snapshot In the following, we will first introduce the problem of privacy in global view, in which an attacker is able to process the information of the privacy-related information as a whole and to reason about it. We then develop approaches to model the information in global view and to measure a user’s location privacy in such a setting. Following this, we design methods to evaluate the feasibility of our approach and its impact on the level of location privacy.

101

5.1. Local vs. global view This section introduces the issue of location privacy in local view and global view.

5.1.1. Location privacy in local view We begin with a quick review of the basics on how to capture an individual’s location privacy. As discussed in Chapter 3, the location privacy of a user in VCS is closely related to the information about the relation of the user to its vehicle trips. Thus taking an attacker’s perspective, we can measure a user’s level of location privacy as the attacker’s knowledge on such relations. The relation can be expressed by a privacy terminology as “linkability” of items of interest [50]. The items of interest here are an individual and its vehicle trips. An attacker on location privacy in VCS is assumed to utilize all sorts of means to obtain the information about the individual-trip linkability. For example, an attacker can eavesdrop vehicular communications to collect location samples to track a vehicle’s movement. An attacker can further exploit the information on the origin and destination of a vehicle trip to learn the identity of the driver. However, it is also assumed that an attacker has various limitations. For example, he might not be able to intercept all vehicular communications due to network coverage; or the attacker might not have all the information available to identify a driver. Furthermore, he might be thwarted to a certain degree by one of the privacy-protection mechanisms like changing pseudonyms. As a result, uncertainty is prevalent in the information about the linkability at the attacker side. Consequently, by measuring the uncertainty in the information possessed by the attacker, we can have a measure for the level of location privacy of a user engaged in vehicular communications. Mathematically, the uncertain information is underlined by probability distributions. Taking an information-theoretic approach, the information uncertainty is quantified into entropy to give an intuitive and elegant measure of a user’s level of privacy in a communication system [23, 22]. Entropy is directly related to a user’s level of privacy. High entropies indicate high information uncertainties and hence high levels of privacy, and vice versa. In Chapter 3, we have used a weighted tripartite graph to model the information on the individual-trip linkabilities as captured in a snapshot. To calculate an individual’s location privacy, all cycles related to the individual are extracted from the graph to form a hub-and-spoke structure. The hub-and-spoke contains the information about the individual-trip linkabilty, modeled as a probability distribution weighted hub-and-spoke structure. The hub in the middle represents the individual, e.g., is . Each of the spokes

102

represents a possible trip Tjk , defined by origin oj and destination dk . Besides, we use the last spoke to represent the possibility that an individual cannot be linked to any of the trips. The probabilities weighing the spokes form the probability distribution of the individual-trip linkability relations. For each individual in the tripartite graph, we can extract a corresponding hub-and-spoke structure. Figure 5.2 illustrates such a process. p(is,o j ) ∈ [0,1]

i1

i2

i3



p(o j ,dk ) ∈ [0,1]

in

p(dk ,is ) ∈ [0,1]

o1

d1 €

o2

d2

o3





pc

d3

om

dm

p11

p12

pc

p13

… p2m



i1



p23

p22

p12 p13

… p2m



in

p1m



p11

p1m



p21

p23

p22

p21

Figure 5.2.: Extracting probability distributions of individuals from tripartite graph

Possible source of uncertainty The hub-and-spoke structure captures the individual-trip relation in a snapshot. It is obvious that the number of spokes around the hub and the probabilities weighing them are directly related to the uncertainty at the attacker side. The uncertainty reflects “an attacker’s mistakes”. Based on the characteristics of vehicular communications and the subsequently identified attacks on location privacy (cf. Section 2.1.3), we assume that an attacker can make two kinds of mistakes in the context of a location privacy attack:

103

1. The attacker might be misled to believe that an individual makes a trip which is actually caused by other individuals. We call this the mis-linking mistake. 2. When tracking a vehicle’s movement, an attacker might mistakenly believe that the vehicle makes a trip which is not true. We call this the mis-tracking mistake. Figure 5.3 gives an example of these two mistakes. In Figure 5.3(a), the attacker mistakenly links i2 to i1 ’s trip from o1 to d1 . In Figure 5.3(b), the attacker mistakes two intersected vehicles trajectories and tracks a vehicle from o1 to d1 , which actually ends at d2 . In the same figure, the attacker tracks a vehicle from o3 to d13 and mistakenly believes that the vehicle has made a trip from o3 to d13 , whereas the actual vehicle trip ends at d3 . o1  

i

1

o1  

i

d1  

2

o2  

d1  

i

1

d2  

o3  

d’3  

d3  

i

2

(a) Mis-linking

(b) Mis-tracking

Figure 5.3.: Example of attacker’s mistakes As privacy matters, any uncertainty at the attacker side works to the users’ advantage. However, from the attacker’s point of view, this is undesirable. Assuming that the means to gather raw data (e.g., by eavesdropping vehicle communications) are limited, one possibility of the attacker is to exploit information processing techniques to reason about and to reduce the uncertainty in the obtained information. In Chapter 4, we showed that an attacker can exploit the accumulated information in a sequence of snapshots to reduce uncertainty. In this chapter, we focus on another possibility, in which an attacker exploits the interrelations among the individuals when the individual-trip information is reasoned in a system-wide way. Intuitively, two individuals establish a certain relation if they can be both linked to the same trip, but only one of them actually causes the trip. Consequently, the other individual should be ruled out from the consideration. In Chapter 3 and 4, the individuals are considered separately such that the possible interrelations among them and the consequences on location privacy have not been

104

investigated. In this chapter, we extend the location privacy measurement of local view that considers only one individual at a time to global view that considers all individuals in a snapshot at the same time.

5.1.2. Location privacy in global view To understand the problem of location privacy in global view, let us consider the simple example as illustrated in Figure 5.4. Imagine two individuals, i1 and i2 , that are both captured in the same snapshot. In this snapshot, i1 can be linked to trips T1 and T2 , and i2 can be linked to trips T2 and T3 1 . We further use pc to denote the probability of “no-trip”, i.e., the probability that an individual cannot be linked to any of the trips. Figure 5.4(a) shows the local view, in which i1 and i2 are considered separately, although they can be both linked to the same trip T2 . If we correlate the two individuals on T2 , we obtain a new structure shown in Figure 5.4(b).

0.3 pc

T1

T2

0.3

0.3

i1

i2 0.4

0.3 T2

i1

0.3

0.4

pc

T3

pc

(a) Consider i1 and i2 separately

0.3

T1

i2

0.4

0.3

T2

0.4

0.3

T3

pc

(b) Consider i1 and i2 together

Figure 5.4.: Example of two individuals in local and global views Interestingly, although the individual-trip relationships and the probability distributions have not changed, the new structure in Figure 5.4(b) gives us a new perspective, which makes it possible for the attacker to exploit the interrelations. To reduce the uncertainty, the attacker can relate the individuals to the same trips and reason about the information as a whole. Consider the example in Figure 5.4(b): since i1 and i2 are mutually exclusive to cause the occurrence of T2 , if the attacker knows that i1 is more likely to cause T2 , then i2 will be less likely to cause the same trip. Consequently, the attacker can decrease the probability of i2 on T2 and increase i2 ’s probabilities on other possibilities (i.e., T3 and pc ) accordingly. In the same way, the attacker can reason about the information and update the elements in i1 ’s probability distribution. In general, the attacker can apply such a technique to an arbitrary number of individuals in a snapshot. 1

For the sake of simplicity in the formalization we use Tk instead of Tjk to denote a trip. We assume that a lookup table exists to map trip Tk to Tjk and explicitly specify origin oj and destination dk .

105

As the probability distributions become related in global view, a change on an element in one of the probability distributions will have a cascading effect on the elements in other probability distributions. Intuitively, global reasoning can reduce uncertainty. This is also in accordance with the theorem of “conditioning reduces entropy” [97] as known from information theory, which states that if X and Y are two probability distributions, knowing Y can reduce the uncertainty in X, formulated as H pX |Y q ¤ H pX q

(5.1)

with equality if and only if X and Y are unrelated (independent). The problem of location privacy in global view leads to an interesting research question: “How to quantitatively process and exploit interrelated information?”

5.2. Prior art Although taking into account global information when determining individual privacy is a solid approach in privacy research, little work has been done so far. Troncoso et al. [106] have approached this issue in a different setting. In their work, they proposed to consider all users in a mix round when developing a strong attacking method to de-anonymize mix communication networks. Mix networks, first introduced by Chaum [51], aim to achieve anonymous communication between a message sender and its recipient. Mix networks can be applied for anonymous email or e-voting systems. A mix is an intermediate node on the message forwarding route. It helps the sender and recipient to achieve anonymity by collecting a certain number of messages in a round and then forwarding them in a random order to the next mix. The anonymity increases if a message traverses the network through a chain of such mixes. An attacker on mix networks tries to de-anonymize the communication by traffic analysis, i.e., intercepting messages in and out of a mix to deduce the patterns in the communication and to uncover sender-receiver relationships. Since the sender-receiver relationships are “mixed”, an attacker might not have all the information. The attacker’s uncertain knowledge is expressed in probabilities. In their work, Troncoso et al. propose a perfect matching disclosure attack. Their approach is to map the de-anonymization to the problem of perfect matching of a weighted bipartite graph. In their approach, the sender and receiver relation is modeled in a ” bipartite graph G  pS R, E q whose nodes can be divided into two disjoint sets S for the senders and R for the receivers. Each edge e P E connects a node from S with a

106

node from R. Based on his knowledge of the system, an attacker assigns probabilities to the bipartite graph as the weight w of the edges. The weight expresses the sender and receiver relations from the attacker’s perspective. Illustrated in Figure 5.5, a perfect matching means every node in S is connected by an edge to exactly one node in R. G

S

R

w w w

w

Figure 5.5.: Perfect matching bipartite graph Our global reasoning approach has some similarities to the perfect matching disclosure attack. For example, we could have also modeled the individual-trip linkabilities as two disjoint sets of vertices in a bipartite graph. However, the approach to find the perfect match of a bipartite graph is not appropriate here. First, in mix networks, the number of messages that enter and leave a mix are assumed to be equal. Consequently, a perfect match of a bipartite graph can be employed to find the most likely senderreceiver relation mathematically. In our case, despite the fact that in reality, individualtrip relations are mutually exclusive, we assume that multiple individuals might be linkable to multiple trips to reflect the uncertainty in an attacker’s observation of the wireless vehicular communication systems. Therefore, we cannot map our problem to the perfect matching of a bipartite graph. Second, despite using probabilities to capture an attacker’s knowledge on the sender-receiver relations, the perfect matching disclosure attack seeks a deterministic solution. The result of the perfect matching disclosure attack is indeed an approximation of the one-to-one relations of two sets of items of interest that ignores uncertainties involved. Our goal is to use global reasoning to refine the privacy measurements, which involves measuring uncertainties. Hence we cannot transform the probabilistic problem to a deterministic solution. Edman et al. [107] propose a “system-wide” metric to quantify the degree of anonymity in a mix system. Unlike most of the work on anonymity-based metrics that focus on the level of anonymity from a single user’s perspective, they claim to consider interdependence between anonymity sets of different users. In their approach, the send-recipient relationships are modeled as a bipartite graph G. Then the number of possible perfect matchings in G is counted, which is equivalent to calculate the permanent of the graph’s

107

adjacency matrix. However, their approach produces a generalized privacy measurement of a system without really exploiting the interdependence among the inputs and outputs of the system.

5.3. Measure location privacy in global view The inequality relation in Equation (5.1) expresses a quite intuitive concept: the uncertainty (measured in entropy) on an information source will decrease if one also knows other information sources. However, to quantitatively calculate the decrease in uncertainty and to apply it to our specific problem, we need to consider the following three requirements: 1. The calculation of entropy is based on the underlying probability distributions. Although we can establish some “interrelations” among the individuals with mutual trips, we need to find a more rigorous way to calculate the probabilities influenced by such interrelations. 2. Due to the size of VCS, a large number of vehicles will be presented in privacy measurements. The method of global reasoning should be scalable. 3. As discussed in Section 5.1.2, intuitively, the probabilities in global view will exhibit some sorts of cascading effect, in which a change of a probability at one place will have influences on other probability distributions. Thus the global reasoning should be able to propagate probability changes in the whole system. All these issues make location privacy in global view intriguing and challenging. In the following, we will show how we address these issues.

5.3.1. Our approach In Sections 5.4 to Section 5.6, we will show how our approach measures location privacy in global view and how it fulfills the aforementioned three requirements. In general, our approach breaks down into four important steps: • Step 1. To relate the individuals to mutual trips, we use a special type of graphic model – Bayesian networks (BN) to define the individual-trip relations. We also use BN to capture the probabilities on such relations. The BN model encodes the information contained in a snapshot in an abstract form. This, in turn, facilitates our later calculations.

108

• Step 2. The BN model contains fully specified conditional probability distributions (CPD) of the individual-trip relations. To measure location privacy of an individual, we need a subset of the conditional probabilities from the BN. To extract the probabilities related to an individual’s location privacy, we design targeted probabilistic queries to obtain relevant conditional probabilities from the BN. • Step 3. To answer these probabilistic queries, we derive an inverse conditional probability formula for calculating the corresponding conditional probabilities and answering the queries. Since the individuals and their possible trips are interrelated in the BN, the answers to the queries take into account the collective relations of other individuals. The results update the previous probability distributions. We call the updated version “posterior probability distributions” and the original ones “prior probability distributions”. • Step 4. Based on the updated probability distributions, we use the same procedure as described in Chapter 3 to quantify the information on the individual-trip linkabilities into entropy. The entropies reflect each individual’s location privacy in global view. Figure 5.6 summarizes the above steps. The details will be elaborated in the following sections.

Hub-­‐and-­‐spokes  

Step 1   Modeling   Bayesian  networks   (Graphic  model)  

Step 2  

Probabilis