L1 Top-k Nearest Neighbor Searching with Uncertain Queries

0 downloads 0 Views 529KB Size Report
Aug 18, 2013 - point. Given a set of n exact points in the plane, we build an O(n log n log log ... point p and any uncertain point Q, their expected distance, denoted by Ed(p, Q), ..... For each qQ, we induce a horizontal line and a vertical line through q, respectively; let A .... If we move on π1 from its left endpoint to its right.
L1 Top-k Nearest Neighbor Searching with Uncertain Queries Haitao Wang1 and Wuzhou Zhang2

arXiv:1211.5084v4 [cs.CG] 18 Aug 2013

1 Department of Computer Science Utah State University, Logan, UT 84322, USA [email protected] 2 Department of Computer Science Duke University, Durham, NC 27708, USA [email protected]

Abstract. In this paper, we present algorithms for the top-k nearest neighbor searching where the input points are exact and the query point is uncertain under the L1 metric in the plane. The uncertain query point is represented by a discrete probability distribution function, and the goal is to efficiently return the top-k expected nearest neighbors, which have the smallest expected distances to the query point. Given a set of n exact points in the plane, we build an O(n log n log log n)-size data structure in O(n log n log log n) time, such that for any uncertain query point with m possible locations and any integer k with 1 ≤ k ≤ n, the top-k expected nearest neighbors can be found in O(m log m + (k + m) log2 n) time. Even for the special case where k = 1, our result is better than the previously best method (in PODS 2012), which requires O(n log2 n) preprocessing time, O(n log2 n) space, and O(m2 log3 n) query time. In addition, for the one-dimensional version of this problem, our approach can build an O(n)-size data structure in O(n log n) time that can support O(min{k, log m} · m + k + log n) time queries and the query time can be reduced to O(k + m + log n) time if the locations of Q are given sorted. In fact, the problem is equivalent to the aggregate or group nearest neighbor searching with the weighted Sum as the aggregate distance function operator.

1

Introduction

The top-k nearest neighbor searching is a fundamental and well-studied problem, due to its wide range of applications in databases, computer vision, image processing, information retrieval, pattern recognition, etc [5,11]. In general, for a set P of points in the d-D space Rd , the problem asks for a data structure to quickly report the top-k nearest neighbors in P for any query point. In many applications, e.g. face recognition and sensor networks, data is inherently imprecise due to various reasons, such as noise or multiple observations. Numerous classic problems, including clustering [13], skylines [1,25], range queries [3], and nearest neighbor searching [4,29], have been cast and studied under uncertainty in the past few years. In this paper, we consider the top-k nearest neighbor searching where the query data is uncertain. Further, we focus on the distances measured by the L1 metric, which is appropriate for applications like VLSI design automation and urban transportation modeling (“Manhattan metric”). This problem has been studied by Agarwal et al. [4] and we propose a better solution in this paper. The same problems with Euclidean distance measure and squared Euclidean distance measure were also studied in [4]. The converse problem model where the input data are uncertain and the query data are certain was also considered in [4]. Refer to [4] for motivations of these problems. 1.1

Problem Statement, Previous Work, and Our Results

An uncertain point Q in the d-D space Rd (for d ≥ 1) is represented as a discrete probability distribution function fQ : Q → [0, 1]. Instead of having one exact location, Q has a set of m

possible locations: P Q = {q1 , · · · , qm }, where qi has probability wi = fQ (qi ) ≥ 0 of being the true location of Q, and m i=1 wi = 1. Throughout the paper, we use m to denote the number of locations of any uncertain point Q; m is also known as the description complexity of Q [4]. For any two exact points p and q in Rd , denote by d(p, q) the distance of p and q. For any exact point p and any uncertain point Q, their expected distance, denoted by Ed(p, Q), is defined to be Ed(p, Q) =

m X

wi d(p, qi ).

i=1

Let P be a set of n exact points in Rd . For any uncertain query point Q and any integer k with 1 ≤ k ≤ n, the top-k expected nearest neighbors (top-k ENNs) of Q in P are the k points of P whose expected distances to Q are the smallest among all points in P ; we denote by Sk (P, Q) the set of the top-k ENNs (in particular, when k = 1, S1 (P, Q) is the ENN of Q in P ). Given a set P of n exact points in Rd , the problem is to design a data structure to quickly report the set Sk (P, Q) for any uncertain query point Q and any integer k with 1 ≤ k ≤ n. In this paper, we consider the L1 distance metric in the plane. Specifically, for any two exact points p = (x(p), y(p)) and q = (x(q), y(q)), d(p, q) = |x(p) − x(q)| + |y(p) − y(q)|. We build an O(n log n log log n)-size data structure in O(n log n log log n) time that can support each query in O(m log m + (k + m) log2 n) time. Note that we also return the expected distance of each point in Sk (P, Q) to Q (the points of Sk (P, Q) are actually reported in sorted order by their expected distances to Q). Previously, only approximation and heuristic results were given for this problem [21]. For the special case where k = 1, Agarwal et al. [4] built an O(n log2 n)-size data structure in O(n log2 n) time that can answer each (top-1) ENN query in O(m2 log3 n) time. Hence, even for the special case where k = 1, our result is better than that in [4] in all three aspects: preprocessing time, space, and query time. For the one-dimensional version of this problem, our approach can build an O(n)-size data structure in O(n log n) time with O(min{k, log m} · m + k + log n) query time, and the query time can be reduced to O(k + m + log n) time if the locations of Q are given in sorted order. Note that in the 1-D space, the L1 metric is the same as the L2 metric. For the L2 metric, only approximation results have been given in Rd when d ≥ 2, e.g., [4,20]. Pm We remark that i=1 wi = 1 in our definition, our results are applicable to the Palthough w = 6 1. Hence, the problem is equivalent to the aggregate or group nearest general case where m i=1 i neighbor searching where the aggregate distance function uses the weighted Sum as the operator [20,18,19,23,24]. 1.2

Related Work

Different formulations have been proposed for the nearest neighbor searching when each uncertain point is represented by a probability distribution function. In the formulation of probabilistic nearest neighbor (PNN), one considers the probability of each input point being the nearest neighbor of the query point. The main drawback of PNN is that it is computationally expensive: the probability of each input point being the nearest neighbor not only depends on the query point, but also depends on all the other input points. The formulation has been widely studied [6,8,9,10,17,21,26,29]. All of these methods were R-tree based heuristics and did not provide any guarantee on the query time in the worst case. For instance, Cheng et al. [8] studied the PNN query that returns those uncertain points whose probabilities of being the nearest neighbor 2

are higher than some threshold, allowing some given errors in the answers. Pretty recently, Agarwal et al. [2] presented non-trivial results on nearest neighbor searching in a probabilistic framework. In the formulation of superseding nearest neighbor (SNN) [29], one considers the superseding relationship of each pair of input points: one supersedes the other if and only if it has probability more than 0.5 of being the nearest neighbor of the query point, where the probability computation is restricted to this pair of points. One can return the point, if such one exists, which supersedes all the others. Otherwise, one returns the minimal set S of data points such that any data point in S supersedes any data point not in S. In the formulation of expected nearest neighbor (ENN), one considers the expected distance from each data point to the query point. Since the expected distance of any input point only depends on the query point, efficient data structures are available. Recently, Agarwal et al. [4] gave the first nontrivial methods for answering exact or approximate expected nearest neighbor queries under L1 , L2 , and the squared Euclidean distance, with provable performance guarantee. Efficient data structures are also provided in [4] when the input data is uncertain and the query data is exact. When the input points are exact and the query point is uncertain, the ENN is the same as the weighted version of the Sum aggregate nearest neighbors (ANN), which is a generalization of the Sum ANN. Only heuristics are known for answering Sum ANN queries [19,20,22,23,24,27,28]. The best known heuristic method for exact (weighted) Sum ANN queries is based on R-tree [24], and Li et al. [20] gave a data structure with 3-approximation query performance for the Sum ANN. Agarwal et al. [4] gave a data structure with a polynomial-time approximation scheme for the ENN queries under the Euclidean distance metric, which also works for the Sum ANN queries. In the following, in Section 2, we give our results in the 1-D space, which are generalized to the 2-D space in Section 3. One may view Section 2 as a “warm-up” for Section 3. Section 4 concludes the paper. For simplicity of discussion, we make a general position assumption that no two points in P ∪ Q have the same x- or y-coordinate for any query Q; we also assume no two points of P have the same expected distance to Q. Our techniques can be easily extended to the general case. Throughout the paper, we use Q to denote the uncertain query point and assume k < n. To simplify the notation, we will write Ed(p) for Ed(p, Q), and Sk (P ) for Sk (P, Q). For any subset P 0 ⊆ P , denote by Sk (P 0 ) the set of the top-k ENNs ofP Q in P 0 . For any point q ∈ Q, let w(q) denote the probability of Q being located at q. Let W = q∈Q w(q).

2

Top-k ENN Searching in the 1-D Space

In 1-D, all points in P lie on a real line L. We assume L is the x-axis. For any point p on L, denote by x(p) the coordinate of p on L. Consider any uncertain P point Q = {q1 , . . . , qm } on L. For any point p on L, the expected distance from p to Q is Ed(p) = q∈Q w(q)d(p, q), where d(p, q) = |x(p) − x(q)|. Given any Q and any k, our goal is to compute Sk (P ), i.e., the set of the top-k ENNs of Q in P . For a fixed uncertain point Q, a point p on L is called a global minimum point if it minimizes the expected distance Ed(p) among all points on L. Such a global minimum point on L may not be unique. The global minimum point is also known as weighted Fermat-Weber point [15], and as shown below, it is very easy to compute in our problem setting. To find Sk (P ), we will use the following strategy. First, we find a global minimum point q ∗ on L. Second, the point q ∗ partitions P into two subsets Pl and Pr , for which we compute Sk (Pl ) and Sk (Pr ). Finally, Sk (P ) is obtained by taking the first k points after merging Sk (Pl ) and Sk (Pr ). 3

P Note that the points in Q may not be given sorted on L. Recall that W = q∈Q w(q). Let q ∗ be the point in Q such that X X w(q) ≥ W/2. w(q) < W/2 and w(q ∗ ) + x(q)