or classroom use is ranted without fee provided that copies are not made. 7 ... software may be easily modified to implement a spatial hash-join, facilitating ... can be easily estimated. A spatial join following this paradigm may also lead ...... result- ing method is simple in design and efficient when im- plemented. (see. Section.
Spatial Ming-Ling
Lo
Department University
Hash-Joins* Chinya
of EECS
Ann
mingling@eecs.
University
Arbor
Arbor,
MI 48109
joins,
and
Our
spatial
define
partition
set of bucket may
map
a new
framework
functions
extents
paradigm
for
have
two
item
into
multiple
buckets.
functions
for
the
input
two
hash-joins.
components:
an assignment
function,
a
which
Furthermore,
datasets
may
be
and tested
on this
framework.
The
inner
dataset
is initialized
by
evolves outer
as data dataset
from
are inserted.
mirrors
dataset
relational
a wide
range
Our
spatial
a wide when
margin. the
both
when the input
when
the datasets
1 Relational
compared well-studied difficulties spatial part,
a spatial
hash-join,
facilitating
systems.
Second,
method
Our
method
applicable
our
In this the
to
method
outperforms
on tree
matching
based
its
performance have
hash-join
datasets
is superior
pre-computed
method
highly
are dynamically
have pre-computed
by even
indices.
hash
joins
to
buffer in the
peculiar
directly join with
for
sizes.
The
relational
applicable seeded trees
joins,
to the spatial have
been
that
and
large
paradigm
method domain.
tree-based
[8, 9] being
excellent are
However,
this
and
with
for
a particularly
is
due
to
joins
the
efficient
function
in
our
247
a spatial
hash-
counterpart, our
method
a variety
of data
method
partitions
our
partition
phase
join
results
relational
joins,
framework
comprises
and may
in
sets,
a data
functions
its
and
then
the
join
a partition
two
an assignment map
and the partition
joins
proceed
step.
the
We view
issue;
components: function.
item
into
for the two
with
in two stages:
[10].
on
tree
makes
the input
our
The
multiple datasets
show
have
step
step
and
on the filter
as an important
in this
that
current
matching
our
step
the filter focuses
but
should
also
method.
methods
datasets
paper
refinement
outperforms
the tree-based
This
any innovation
experiments
method
they
SIGMOD ’966196 Montreal, Canada 01996 ACM 0-89791 -794-4196/0006,.. $3,50
propose methods.
We evaluate
produce
extents
step
This
hash-join
the
unlike
the refinement
based
and
differ.
Our
Permission to make digitahhard copy of part or all of this work for personal or classroom use is ranted without fee provided that copies are not made or distributed for pro 7It or commercial advantage, the copyright notkx, the title of the publication and its date appear, and notice is given that copying is by permission of ACM, Inc. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/or a fee.
to
function
be usable
for
stream
of applying
spatial
with
during
buckets
orthogonal by the Consortium Networkhg.
difficulties joins,
hash-joins,
buckets
However,
Spatial
most
in input
data.
phase.
buckets,
or other
to estimate.
spatial
advantages.
contrast,
characteristics
to its relational
experiments
relational into
In
we implement
similar
real-life
input
may
Existing the
similar
assignment
has not
designing
very
a set of bucket
approach. This work was supported in part International Earth Science Information
framework,
on this
Like
hash-join
domain.
to spatial
methods
relations
Based
algorithm
the
to
may
facilitating
matching on
harder
we discuss
for
join
on tree
costs
be easily
paradigm
costs,
or ordering
paradigm
a framework
including
indices.
[1, 2, 3, 4, 5, 6, 7] yield
particularly
paper,
hash-join
this in
mostly
thus
existing
can
optimization.
depend
and
and
following
distribution data,
by conducting
competitive generated
input
with hash-join
sizes
based
existing
to implement
integration
and
joins
advantages
First,
relational
predictability
can
as spatial
of the
join
other
modified
data
planning
indices
yield
efficiency.
be easily
input
to greater
query
such
item
The
on
A spatial
lead
spatial
for the
a data
mainly
may
the costs of spatial
and
may
database
spatial
the
Introduction
performance,
been
aspects.
methods
makes the spatial
for
function
It is therefore
that
Further,
This
method
dataset,
replicate
in other
algorithms
tree-based
partition may
MI 48109
besides
software
estimated.
joins. show
join
the
buckets,
indices.
experiments
current
sampling
multiple
hash-joins
of spatial
function
but
into
needs no pre-computed
hash-join
partition
The
is immutable,
the outer
a spatial
domain
hash-join
also
We have designed
Arbor,
paradigm
spatial
depend
different. based
hash-join
in the
to spatial
spatial
and
a data
the partition
the hash-join
Ann
Arbor
ravi@eecs. umich.edu
The how to apply
of EECS
of Michigan-Ann
1301 Beal Avenue,
umich .edu
Abstract We examine
Ravishankar
Department
of Michigan–Ann
1301 Beal Avenue,
V.
method
by
wide
are given highly
are dynamically
pre-computed
our spatial
indices.
spatial join
margins,
hash-join algorithms even
pre-computed competitive generated
when
indices. both and
when when
This
paper
discusses in
is
applying
Our in
organized
related
work.
the
hash-join
framework Section
spatial
for
implementation. in
and
Section
7.
results
our
of
a
6 presents
———4 —
Bucketl
its
experiments
8 discusses
the
joins.
presented
design
Section
of
Section
9 concludes
is
our
2
difficulties
spatial
joins
presents and
the to
hash
5
The
Section
paradigm
algorithm,
Section
follows.
3 studies
spatial
Section
4.
hash-join
given
as
Section
related
issues,
paper.
Figure
1:
belong
to
with
Related
2
forms
method
does
require
tree
and
tree
method
[8,
spatial
costs.
The
require for
18,
when
and
the
the
are
commonly
used
as
can
used
to
and exist.
Brinkhoff
two
method
[8, 9] may
delivers
better
input
no into
method indices,
trees
of and
manageable
plane-sweeping. Sections
and
and
disk
also
equality.
be used
appears
and
Patel
and
operates
identifies overlap. the
a
Suppose
when
on
A
there
first
[22]
also
dividing
are
them
method
is discussed
using
further
in
8.
with
characteristics
direct
application
joins.
First,
Spatial
preserving total
do objects
away
in
application
[13,
sets
predicates
a natural
Although
do not
intersect,
X
A and X
into
will
identifying
in
space This
objects, spatial can
is
that
for handling
are often
much
the
of the join
based
works
defining
such
total be
prevents
of the
far
direct
databases.
relational
hash-join
well fully
classes.
so
to spatial
but
that
buckets.
attribute
another,
B we
X).
Unless
we will
always
how
are
we divide
more
of
crucial split
term
this
over
have
the
with
join
predicates
equi-join
buckets.
Thus,
to match
objects
a coherent
buckets
buckets
equivalence performance
is that across
values method
create these
to the
method not
equivalence
hash-join
functions
or
we do not
to
the
relational
most
We
values
A and
If we hash
predicate
partition
hash-join
predicate,
a join
constructs
The
property
given
of matching
(B,
bucket,
equality
one
classes
across
the
because
equivalence
of
spatial
by
include The
buckets. a spatial
of them.
no matter
into
buckets.
attribute.
relational
pairs
pair
one single
effectively
induced
two for
B and Y into
pairs,
between
equi-join
order
closeness,
equi-joins, more
for
sometimes
difficulty
techniques
difficulty
total
objects
classes
into
intersecting
the
X
be certain
Although
the intersecting
all objects
some
and
and
three
dataset,
an equi-join
same
both
and
one
A
across
intersects
one bucket,
to
tuples
Y).
we
B, we still
another,
(B,
B.j,
intersect.
belong
1 shows
#
intersect X
the
and
For an
if we know
performing
do
and
A.j
B
pairs
attribute
hand,
In
into
cannot X)
and
other
hash
B
Figure (B,
An
to spatial
join.
miss
could
matching
spatial
join join,
A, B and X,
B and
and
tuple
(A, X),
prevent
techniques
spatial preserve
of relational is optimized
algorithms
10, 14, 23] exist
of
uniformly
ordering.
second
method
closeness.
adjacent
the
lack
joins
A
the
A by A.j.
A does not
whether
we
whose
A.j
and
intersection
objects
=
spatial
suffice.
predicate
the
identify
identical
j of object
On
hash
the same
complicated
not
spatial
three
another.
objects
miss
Joins
spatial
join
data
curves over
not
to
of relational
spatial
orders
orders
intrinsic
spatial
on space-filling
join
no
the the
but
we
and
intersection
the
the
The
X,
j,
bucket,
with does
if X.j
to
attribute
one
objects
both
B.j.
more
bucket
value
about
X
other
of objects
objects
object
must
at
#
nothing
and
hand,
pairs
A intersects
know
on the
Consider
X.j
identical
into
Consider
predicate,
with functions
relationships
attribute
that
of tuples partition values
same
arise
which
that
in Section
DeWitt
by
and joining
Difficulties
Two
two
the
levels.
we hash 3
B
Boxes
Problem
pairs
Placing
into
values
know
1/0
performance.
partitions
This
4.3.1
A and
another.
join-attribute
with
Unfortunately, PBSM
objects
Y to
buckets.
Their
joins,
objects
equi-join
R-trees,
algorithm CPU
than
denote
et al.
existing
matching reduce
seeded
be
Spatial
Difficulties
R-
find
identical
of
operator
sets.
on
exist,
of seeded
data
with
pairs
indices,
based
seeded-tree and
r-epresen-
spatial
20]
they
of an R-tree
discussion
The
hash
values.
attributes
19,
to join to
a divide-
6.1.
requires
represent
hash-joins
bucket.
based
on tree-like
large
those
R-trees
joins
based
tuples
in-
those
on seperational
databases,
of techniques
indices
Data
X and
Coherent-Assignment
join-attribute
tree
algorithms 12],
The
Relational
or
seeded
[16] proposed
sorting
a method
consists
brief
9].
spatial
proposed
lines
join.
and
as-
datasets
the
[11,
not
[17,
input
those
include
variants
in
collection
no
external
methods
its
facilitate
does
generally
These
indices
based
method
Tree-based
indices
the
and Schilling
algorithm
This
have with
10, 14], and
Guting
and-conquer
which
on join
[13,
[15].
tation.
for
an exception.
based
on z-ordering indices
joins
of pre-computation,
those
[21]
spatial indices
[8, 9] being
clude
but
for
pre-computed
other
spatial
dataset,
Work
methods
sumed
A simple one
dashed
3.1 Previous
Bucket2
are
respect
assignment to
a join
predicate.
complicated.
Unfortunately,
248
spatial
do not
always
define
such
equivalence
to guarantee buckets
a coherent
with
respect
we cannot
divide
n groups,
and
in any We
this
difficulty
the
dimensions,
spatial
hashing
in
various
through this
and the
produced
join,
spatial
section.
Hash-join
predicates
borrow
the
phases:
of
(a)
the
design from
(outer)
phase:
Join
into
place
buckets
phase:
join
ets to obtain
inner
using
dataset
or
maps
and
outer
dataset
partition
corresponding
inner
after
ob-
and
outer
For
the
hash-join one relating
buck-
object
other
to the join
Single
algorithms
input
conformed
partition
X
and
the
data
item
partition
function
to exactly
one
still
only
bucket.
For
two
(12, 22), (11, 21) and Single
matching:
joined The
with
partition
into
The will a join
of
bucket
pairs,
each
states
in exactly
of buckets
relation
pair
having bucket.
that
the second must
each pair.
in the join
buffer
Using assign
inner
X
During
We
pairs
phase
The
pair. Design
main
Multiple map Multiple join
assignment: an input
problem
is
object
matching:
us to
principle.
relax
context
The
index
into
partition multiple
A bucket
may
function
may
buckets. appear
built.
pairs.
Hybrids
be appropriate
249
and
endure
schedule
the
multiple-assignment we need With
may
22 (see Figure only
this
2(b)).
match
bucket
approach,
buffer
However,
object
X
is
twice. is particularly
joins.
[20].
It
The
complicates
never
to disk either
carefully
21 and
phase,
duplication
since
joins, is
example, buckets
spatial
for
or
hold order
12 is needed
flushed must
can
in the
pairs.
a problem.
this
It in multiple
trees
We
not
However,
be able
buffer pairs
bucket
again,
(21, 22).
indices
When been
and read
of
spatial
are: The
and
B in
buffer-management
buckets
problem
the matching (11, 21)
We believe
requires
or single-matching
alternatives
bucket
Alternatives
assignment
the single-assignment two
both
thrashing
Primary coherent
same
Since
A and
we must
the
have
buckets
into
and
we join
disk
of join the
phase,
(12, 21).
from
Also,
2(a).
objects
suppose
it would
thrashing
processing
written 4.1
time,
be read
the
relations
outer
one bucket
to be joined
is
relation.
and outer
“corresponding”
principle
appears
inner
of the outer
the inner
bucket
matching”
a pair
of the
one bucket
and one
bucket
call
bucket
divides
number bucket
“single
or outer
exactly phase
some
one inner
Each
1/0.
trivial.
pairs.
scheduling
and
disk
are (11, 21), (12, 22) and
matching
example,
objects
than join-
multiple-matching
both
pairs
bucket
remain
problems.
assigns
the
these
more
as in Figure
2 overlaps bucket
During
to identify
1, the
objects
1/0.
schedule
is no longer
Figure
the
1, the join
There The
in
hash
in dataset
(12, 21).
to two
phase,
phase:
assignment:
each
have
to the
example may
be
multiple-matching.
so we must
pairs
of
sizes may
a bucket
so as to minimize
buckets
function instead
in more
read
phase,
carefully join
data
with
to
is
difference
a partition
resulting
need
join
approach
conceptual
is that
occurs
may the
reads
dataset
principles,
by hash
approach.
to a set of buckets
drawback
inflation
during
approach
only is that
object
Its
we
identifying
functions.
The
partitioning,
size
once
results.
Relational
produced
A duplicative
multiple-assignment
hash-joins
inflated
However,
the join
the
an input
bucket.
No
hash-
and
of
a single
Our
relational
phase
spatial
table
(b)
multiple-assignment
simplicity.
relational
now
phase
jects
hash
approach.
by the
benefit
from
phase. Partition
A non-duplicative
produced
The
Buckets
buckets.
of the
(a)
algorithmic
of a spatial
inner
partition
realized
relational-join
(outer)
2:
multiple-matching
table
parameters
datasets
that
the
hash-
be
datasets.
inner like
the
Bu&ket Bye$et Bu;fket E3y~t .——. = J; —_. — 1 J I —.
algorithms
can
outer
the
the
framework,
has two
I
Figure
designing
Bucket Bucket 22 ~?!,
12
make
for
and
called
11
haah tabla for dataaet 2
This
Framework
operand
dataaet 1
Bucket Bucket
is,
are
Hash-Join
We
haa;o:ble
dataaef 2
group.
problem. complicated,
the
haa~rable
dataaet 1
appearing
more
for
haa~o~bla
into
same
thus
partitioning
are
algorithmic
call
inner
by
relation
objects in the
attributes
choices
framework.
hash-join
two
That
datasets
spatial
join
appropriate
to hash
problem.
this
spatial
terminology,
two
belong
framework
algorithms
for
of
our
the
impossible
that
and
Spatial
present
join
fact
a difficult
Our
We
that always
is
predicates. the
coherent-assignment
the
higher
in
it
of objects
join
objects
ensure
and
4
tospatial
pair
and
assignment
the
matched
call
classes,
is not
insertion a drawback are
necessary
to
of these some
also
suitable been
use of duplication
they
for
has
and deletion with
constructed update
two
main
situations.
them
data and once
approaches
in the
used in
with
spatial
of objects. structures used
once.
they
are
may
also
Spatial
4.2 Our
spatial
The
assignment
of buckets. design
function
maps
parameter.
Quite
the to
partition
differ.
the space is used
with
different
degrees
may
be hashed
based
representations,
spaces
spaces
Our
objects
on their
space.
Since
coordinate
are
extents
used
by
buckets
the
they
are used
pairs.
As
corresponds
objects
logically
suggestive for
There
the
The to
an object or to all The
the
stored our
main
with
to
(1)
are
An
The
on
is mapped
to,
(2)
of
to identify
join
space,
bucket
extent
a
spatial
(1)
relationships
An
bucket’s
In
individual
should
bucket to
bucket
be decided
bucket
by
map
partitioning
on object
extents
buckets?
distribution?
updated
as data
If so, how?
hash-join set
(3)
these
Some
(2)
for
be
be
to
the
contained
assigned
consists
a set
bucket
of
by to
two
it.
related
can
spatial
relationship
bucket
extents
y:
the
which
is based.
number
be assigned
be-
upon
of buckets
an
to.
pairs.
can
realize
to:
of (1)
(2)
to
all
buckets
(3)
to
the
bucket
assignment
assignment all
whose whose
the
of
single
bucket
used,
assignment
when rules.
so far
comprise
is
several Figure
for
buckets
the
partition
it,
it,
the
rules,
When
there
must to
satisfy
or
object.
several
rules
3 summarizes
assign
contain
multiplicity.
principle
to
overlap
is nearest
may
tie-breaking
are
extents fully
center
assignment
single-assignment series
whose
extents
criterion
on
criterion
buckets
the be
choose
the
primary
the design
space
functions.
Extents divide
the
buckets
extents
the
the
or
to buckets
multiplicity
assignment
must
and
locations,
extents,
with
not
the and
of objects
object
depending
or the multiple-matching
equal-sized the
overlap still
design
objects
examples
The
the
bucket
parameters
may
criterion: data
an object
consists
extents, join
object
relationship
but
function
Function
an input
at-
assign
it the most,
algorithm
the
input
it,
of bucket
its
object
assignment
object
object
overlaps
Assignment maps
on
extent,
tween
efficiency.
function
addition
the
determined
the
based
Assignment
extents
it may
overlap
Bucket
approximately
tion.
the
aspects:
bucket.
spatial
between
extent
extents
the
a
For example,
whose
for
affect
the
the
function
Assignment
a bounding
to bucket
assigns
and
made
choosing
cover
assigned
associated
counterpart
buckets
extents.
is generally
necessarily its
bucket
where
it is not to
and
a bucket
of the objects
extents.
Choosing partition
of all extents
or based
into
assignment
discussed
into
functions..
partition
overlap?
union
Choosing
4.2.2
they the
in the space
the multiple-assignment
4.2.1
the
statically
are inserted
determine
original
A
whose
and
for
are extents
space
Assignment
function,
approach,
does
Mutability:
approach
functions:
coverage
of a spatial
choices
either
spatial
Partitioning:
function
to a bucket
assignment
m
–d
area?
original
space,
space
do extents
Coverage:
in these
on hashing
generally
3: Design
Intersection:
in the
assigned
bucket
design
Ez!!!5
in higher
to hashing
phase
no obvious
buckets
~
a-l I
they
though
the
reside.
of determining: The
Figure
space,
to a region
based
and
i
Y
3
I
~ ‘ntersec”””
hashing.
buckets
r
I
,
original
object
in
assignment
tributes
in
binary
on coordinates
function
spatial
objects
in relational
1
properties
points
are
two
However,
is also
swgnment multiplicity .— ———
bucket extents
and
of their
original
assignment
of the spatial
to the bucket.
assignment criterion ~
.
transformations.
hash
extent
box
objects
by the join
we
I transformed I 1:pace I ——. —
i
extent
example,
into
focuses
serve
an input
bucket
For
in the
in their
no additional
Bucket
[ Or]glnal 1:p~e_,
outer
contiguous)
patterns
hashed
limited
spatial
__lC+l
is a
addressing
various
efficacy.
locations
values
requires
of
on
on the bit
is not
and
in
i
to buckets.
implementation
framework
set
is to be computed,
be hashed
and
.fs%L_. “’
awgnment ~ funct(on
to a set
inner
A
or be transformed
dimensional
our
the
c
hash-joins,
necessarily join
objects can
object
is useful
(not
spatial
in assigning objects
for
spatial hash function design
two
bucket.
of this
problem.
the
each
relational
feature
by
a set of bucket
with
a spatial
functions
to a region
Spatial
and
unlike
This
where
defined
cardinality
coherent-assignment
corresponds
are
associated
maximum
dataset
Design issues Destgn choices
——
function
extent
The
we allow
the
functions
a assagnrnent
one bucket
U
I__/
Function
partition
components: ezt ents,
Partition
the
input
sizes
set of bucket
properly
assignment and
following
4.2.3
datasets
by
In
func-
shapes
every
of
parameters
extents:
Join
an
bucket
outer
bucket
the
inner
this
matching
on the
250
Identifying
principle,
inner that
bucket’s
design
space
may
objects.
Bucket
contain We
significantly
of the inner
must
Pairs be
matched
objects
may
be able
in practice
and outer
partition
with
overlapping to
reduce
depending functions.
a a
When and
the
the
multiple-assignment
single-matching
bucket
need
outer
only
bucket.
is applied bucket and
and
pairs,
each
bucket
extent
inner
buckets for
bucket
can
need
identifying
only
to
its
pairs
that
every
assigned
to it,
with
the
extent.
thus
join
depend
considered
the
same
are
assigned
on
the
of
of
the
inner
outer
function:
choosing
nearest
center
As
bucket
may
be different.
object enlarged
the
bucket
of gravity
a
have objects
is enlarged
final
is
grid
being
datasets
extent
Each
extent
cell
extents.
its
the
regular
each
outer
bucket
datasets
whose
area,
and
a bucket,
Thus,
assignment,
n equal-sized, map
initial
to
Assignment bucket
components
set
and
with
whole
The
them.
inner
outer
the
extent.
enclose
Methods
Start
covering
bucket
join
2
extents:
cells
extents
identify
know
be joined
an d are
function,
we
many
bucket
all objects
overlap
join-bucket
in
the
Example
Bucket
principle
appear
if
inner
corresponding
be used
cent ains
extents
its
of
example,
fully
whose
partition
may
properties
For
pairs
with
4.3.2
chosen
each
multiple-matching
functions
bucket’s
is
preserved,
joined the
known
buckets each
be
When
assignment
principle
property
extents
is assigned the
if there
the
to the
least
whose
to
for
after
extent
has
are ties.
its design. Join 4.3 We
now
present
two
straightforward
join
adopts
multiple-assignment
the
the
based
on
multiple-matching
4.3.1
first
the
The
second
regions, Each
The
the
map
area
for example,
of these
inner
and
set of bucket
cells
the
with
thus
extent
bucket
with
method
The
first
may
be
extents
very
must
The
PBSM
buckets
may
be
objects
two bucket effort
when
extents,
may
they to
of
1.
Example
of grid
cell as a bucket several
equivalent
that
some
ways,
the
partition
divides
the
plane
PBSM cells,
In
but
extent,
instead
it avoids
possibly of a bucket
of using imbalance
non-contiguous
the
join
to be joined.
join
bucket
a bucket
during
the join
design
a given
optimal
A
pairs. may
If
need
phase.
since
feasibility
convincing
our
and
for
and will
We
goal
of our
may
are
join.
performance
method
predicates
parameters
spatial
method, the
hash-join
design
method
approach
It
differs
here
approach
gains.
The
be different
input
dataset
be the subject
We
data
for with
of future
relational hash
extents
whole
of of
by
to derive
extent.
251
that
im-
multiplesimplic-
only to
in that multiple
amount
of effort
irnplementations
to realize
are allowed
the
this
objects
in
method. reside,
thus
representations.
to overlap, space,
do not
and
based
Our cover
on
the
object
in space.
the
bucket
extents,
we discuss
only
open
methods.
efficiency
are
the
dataset
inner
a small
hash-join
where
when
the
object
the
result-
conceptual
a data
only
space
leaves
our
efficient
hash-join
of transforming
This Here
grid
maps
for
The
adopt
of its
relational
object-distribution
distribution
a large
costs
and We
because
databases
in the
the
design
to modify
method
framework.
6).
the
We expect
avoiding
data
in
function
be required
hash-join
our
Section
from
partition
existing
using
is simple
assignment
buckets.
remove
using
(see
its
on
a spatial
join
plemented
fall
buckets cells
above
characteristics,
now
will
a single
the
yields
join
function into
it
different
ity.
have
its
for
an
spatial
empty.
inflating
is necessary
resembles
outer
Choices
demonstrate
optimal
ing
spatial
nearly
seek to
We
sizes
[22] can also be interpreted
framework.
scheme
the
some
times
during
pairs
enough,
implementing
intersection
same
spurious
the
not
bucket
method join
for
pairs
the
bucket
on
and
research.
This
object
overlap
is that
is that
bucket
large
Design
choices
different
outer
results.
spatial
grouping
others
additional
partitioning number
and
eliminate
method
of inner
in multiple
in several
Several
and
may
cell).
matching
must
objects,
duplicate
between
Finally,
redundant
(grid
Depending
while
the boundary size.
We
of this
of input
we
to
object
of inner
objects
imbalance.
objects,
Second,
is assigned An
the
is not
feasible
is only
the join.
drawback
distribution many
pair
spurious
extents.
after
it.
extent
of matching
bucket
size
pair
method
track
participate
Our
5
of have
buckets.
each
same
produce
a pair
results
Join the
may
when
object
overlap
to multiple
pairs:
buckets
data
extents
be assigned
two
A
whose
Join-bucket
may
do
function:
all buckets
bucket
each overlap.
equal,
datasets
The
of this
we must
buffer
Join extents
a regular
is the
outer
extents.
with
are immutable. Assignment
drawback
to be read
non-overlapping of n cells
whose
phase,
the
grid
same
for
The
approach,
Tessellate
one bucket.
examples
framework.
1
extents:
the
our
approach.
Example
Bucket
pairs:
buckets
intersection
our
bucket
Examples
the
initial and
choices Design
discussed are:
in
values the that choices Section
and
the mutability
assignment affect
the
affecting 6.
Our
functions. correctness only choices
the for
Bucket
extents:
bucket
Initial
extents
value:
are
updated
see Section
6. Mutability:
intersection
to
all
and location
enclose
assigned
criterion
objects. Assi~nment tion
fund
6.
actly
joins.
ion:
Assimment
Multiplicity:
Each
criterion:
object
for
efficiency,
see Sec-
is assigned
This
one bucket.
box
these
of the
objects
contained that
choices
extent
assigned
in the
bucket.
a bucket
extent
The
to
it.
However,
of a bucket
design
choices
may for
not
such
an object
dataset
belong
rion
the
outer
to
for
to
are: extents:
=
final
Initial
inner
have
the
bucket
same
associate
one
Mutability:
ject it.
Identifying
each
inner
buckets
extents
and
inner
object
criterion:
whose may
criterion dataset
since
An
extents
ob-
overlap
be assigned
need not join
outer
now
bucket
the
A1, Az,
the design
results
join
the
with
phase
same
begins,
and
the
the
Theorem
design
given
these pairs
bucket
correct
the
corresponding
Let
the
When buckets
inner
buckets
corresponding
outer
For
each
objects
inner
that
dataset
overlap
object
in Ai,
it can be found
extents
Proof:
We
overlap
can
object
that any
p does
not
Since
Bi
extent. not
show
overlap
Ai’s
contained
any
dataset
Theorem
produces
Proof
We
copy
of any
to only with can
meet
considered Any design
have
being
When
the
to
outer
for join. choices
hash will
each and
Bi’s
objects
els tree
as
an inner object
one
method the
following correct
dataset
for
the
object
requirements nodes
to
over-
is minimized.
to
are very
of spatial
index
functions
would
constructing
spa-
when
The
the answer
above
slot
initial
our
252
present
Initially,
As
determine
and also largely into
subtrees.
of slots
extent
extents objects how
determine
can
be
and
a set
of
and the
are assigned the
to the objects.
objects
are assigned effectively the number
steps:
slots
a slot
the how three
are
in the
are points,
to enclose
Determining
involves
a seeded
technique.
terminology,
are updated
a lev-
Readers
is contained
slot
seed)
builds
of this
[9] to tailors
(or
dataset.
a bucket
is empty. extents
extents
its initial
underlying details
implemen-
technique Seeding
information
objects.
are grouped
for
full
represent
extents
in our
seeding
Bootstrap-seeding its
to
their
Extents
extents
by setting
4).
Using
Bucket
bootstrap
a join
from
set of objects
slots,
produce
set
extent
is minimized,
configuration.
key seed-level
9].
slots,
❑ join
requireare
dataset
partition
bucket
the
to [9] for
spatial
object
once
for
Figure
considered
is joined
dat aset
at most
latter
extents
extents
Inner
initial
directly
All
is assigned
bucket
an inner
most
use
tree
(see
[8,
object
of the
bucket
developed
rectangular
their
seeded
overlap
identified at
techniques
over-
area
trees.
and
identify
p does
dat aset
contains
two
an outer
our
extents
of an outer
for
that
the follow-
total
buckets
These
Determining
referred
inner
bucket,
dataset
overlap
extent,
pairs
bucket the
probability
requirements
suggests
from
We choose
of the join.
a bucket
Since outer
bucket
the
This
in-
approximately
(2)
When
of bucket
the
initial criterion
with
(3) The
assigning
perforFor
in Bi,
❑
answer
bucket,
one
spatial
all
area
the
contains
to multiple
is maximized.
bucket
tation,
not
the
assigned
the
assignment
bucket
extents.
to
buckets
outer
any
index
inner bucket and
the
of not
outside
p does
of the
methods.
inner
is minimized.
since
total
dataset
all inner
extent,
and object
aspect
to choose
objects,
probability
object
not
of
are minimized,
an outer
same
in Ai.
exact that
the
Since
Joining the
object.
any
/li
by Ai’s
one inner
If
and
know
exactly
A,.
to B2, it does
objects
2:
above
dataset
in
belong extent.
in A% are inner
no outer
object
an
each
bucket
object
6.1
B,
objects
the object,
is crucial
it is best
as possible,
inner
tial
all
(1)
arise
trees.
buckets
to assign
other
hash-join
and
number
similar
be
the crite-
of overlapping
Every
im-
object
contains
contain
buckets of
choices,
as little
laps
and
joins.
equal
final
extent.
produces
to
assignment
function
phase.
to a set of final
an
6).
simply
same
lead
properties:
bucket
bucket
easily
same.
extents
ing
ments
is trivial
join the
affect issues
to find
object the
instead
stability
bucket
benefit
1:
dataset
above
any
B2, . . ..Bk.
outer
and
lap
be processed
overlapping
intersection
extents.
Ak,
...,
for
reduce
dataset
extents
equal-sized
mance
ner
to mul-
at all (see Section
inner
our
to
to
not
pairs
Each
that
non-redundant
objects
be assigned
bucket
show
serves
an object
the stays
very
If we want
assignment
may These
6.
we modify
bucket
shape
Implementation
that
of-parameters.
the
during
6
extent.
inner
containment
Producing
Assignment
An
the
choices
Bl,
with
to all buckets
phase,
extent
have
the
outer
for
algorithm
are immutable.
of outer
join
bucket
the
extents
two
relabel
extent
assignment
number
in the
We
If
bucket
buckets. outer
total
extents.
function:
Multiplicity:
Our
with
outer
is assigned
tiple
Outer
extent,
extents
Assignment
value:
the
whose
function
be modified
queries.
object,
the
look
pairs Bucket
that
to buckets
dat aset
also
initial
the assignment
correctness.
in Section
can
for
and
assignment
containment
pairs
choices
extents
algorithm
further
outer
is also a bounding
actual
bucket
inner
not
design
plement Under
the
but
are discussed
to ex-
The
of inner
to data and
After seed
levels
grown
levels
the
and
outer
sizes
of
grown subtree
growny
Figure
seed node
■
grown
4:
1. Determining the
the number
upper
of
and
average
number
of a seeded
number
of slots
we use the
node
Example
the
describing
as the
❑
tree.
slots
S.
lower
bounds
was derived
in [9].
of the
upper
A
formula
for
and
lower
can
be joined
the
filter
optimal
method
The
algorithm
some
the
multiple
3. Placing from
the
slot
data
of the
the S slots
sample to
input
center
number
and
sample
size
we “brute
choose
among
their
centers several
heuristic
efficiently.
in this
We
to
join”
into
a quadratic-split-cost objects
in overflow
does
Inner
spatial
and
determined
by
bootstrap-seeding
number initial
the
and the initial bucket
member
PI
hash-join
Obtain
initial
using
nearest-
objects.
Our
Bucket
extent
similar
to insertion
reduce and
inner
the
total
are likely
process
to become
assignment
whose
updating
of the
area
seeded
and
to do the
overlap
same
Outer
Dataset
least.
Set
outer
final
extents
partition
of their
the outer
dataset
object
If an object it
since
it
technique seed-level The
corresponding
to every overlaps
bucket
is irrelevant
to the
bucket-extent
jiltertng.
filtering
technique of
Join
now
algorithm!
pre-defined
threshold
grow
whole
to fill
to disk
mostly
to the
performance
in called
we write to disk
buffer.
in sequential of our
To
extent
overlaps
we can
inner batch
when
the
technique
1/0, method.
bucket
bucket
for
the
use
LRU
against
inner
each
bucket
as
dataset
inner
object
extents
bucket
extents a copy
whose
and
the
extents
after
to
final
inner
the
of each outer
extent
We
and wrztes.
call
the
it.
study
overlaps
writes
and contributes
pairs
object
to
the object.
to produce
for
result.
conducted join
join
a
join.
by
prior
The
first
copy-seeding
three
variants
tree
advantage,
we also
considered
is stored
so that
data at
all
some
noted
as
during
tree
RJ(I),
and
random
read
greatly
is given
two
performs
tree
during
buckets
tree
is
spatial
on
three
tree join
(SJ),
costs
the
construction
matching.
tree
is con[8].
We
experiments. and
then
join case
per-
a clear when
is no buffer (though
the
thrash-
there
may
method
is
de-
by disregarding indices,
is always
all
RJ(M)
construction.
R-tree
RJ(M)
using
tree
R-tree
This
is implemented
pre-computed
first in our
an ideal there
match-
second
dynamically offer
two
tree
the
the joins
matching), during
constructs
constructed
and
from
of R-tree
input
tree
( SJ)
technique,
To
ing
hash-join
other
to performing
R-trees
be
seecled
method
just
two
outer
spatial with
experiments
(lf.7),
matching.
forms
contents
the
RJ constructs
this
our
join.
indices
structed
of
its performance
hash
seeded-tree
seeded-tree ing
behavior
We
spatial
R-tree
to the
than
the
and compare
study
As in the
larger
bucket
bucket
the bootstrap-seeding
discard
is analogous
all buckets
This
buckets.
[8, 9].
the
we
of
R-tree
be summarized
Assign
Update
Assign
methods.
and
set to the
of an outer
result. It
outer
outer
method, join
a copy
extents, join
both
uses a technique tree
inner
whose
no bucket
the
Experiments
We
nodes,
the
and
we assign
two)
are probed
can
on the
corresponding
7
heuristics
and
are immutable,
dataset,
partitioning
datasets seeded
extents
based
extents.
J 1 Join
The
bucket
If
buffer,
extents
criteria.
the
bucket
Phase The
of chances
are
extents.
Partitioning
This
(instead
objects
seeding.
bucket
methods: 6.3
then
it.
against the
inner
and
assigns
criterion
These
bucket
the
the
assignment.
every
of its
the
of seeded-tree
for
is
“indexed
are assigned
enlarges
trees.
P2
the The
the MBR
and the assignment
into
as
criterion
extent
slots
buckets.
As objects
enlarges
to a bucket,
extents
buckets
[17],
phase.
method
bucket
bootstrap
to one
of the inner
are points.
its extent
an object
initial
extents
extents
to a bucket,
the
of
be the
each
number
an
construct
reduces
overflow
for
[8] or the
bucket
as outer
work
look
a pair
join
Since
our
follows:
Partitioning
We use the
the
in
buckets
tree.
implementation.
Dataset
not
bucket
and
during
management
and
R-tree
one
buffer.
two
join.
first
outer
are
overhead.
do
in
We
only
the
assignment 6.2
[22].
of the
requires
exist
Our
heuristics
use the
join”
loop
buffer
the
to join
bucket
information
S clusters
use force
nested
in
1/0
bucket-bucket
to the
the
area using
In [9], we examined clusters
being
of slots.
We identify
objects,
identify
the
in the map
sample.
locations.
set,
the
Thus,
we
The
scheme
intensive
cost,
similar
the
size.
1/0
1/0
for
so constructed 2. Sampling
buffer
of inner
are joined.
our
additional
reducing
pairs
extent
using
is usually
on
buffer
of buckets.
the
without
focuses
to
bounds
than
step
approach
work,
is partitioned,
the same
partitioned
smaller
probe
choosing
In this
dataset with
buckets
general
subtree
outer
buckets
and the
simply most
Method
Description
HJ
Spatial
SJ
Bootstrawseedirw
Hash
R-tree
E
Join
join,
(both
indices
buffer
thrashing
R-tree
join,
Table
method
no pre-existing built
indices
before
while
join)
building
indices
two pre-existing
1: Competing
Table
indices
2: Experimental
methods were
chosen
randomly
O and a predefine efficient
among
the
phasize
that
depends
cannot
be
matching
it
applied step
all the
Table
1 summarizes
and
bytes.
bytes,
entries
pairs.
the
Also,
on 1/0
costs
one
disk
When
filter
of disk
step.
blocks,
we used
a buffer to
of a 16-byte
in our
bounding
we focus
to consist
the
filter
step
box
on the
8-byte
ratio
range
7.1.1
of the
cost
be 5, unless
Choice
We
conducted
the
first
in our
study
is shown
established
the
conditions a series the
as well
as data algorithm.
We studied of
spatial
input
by
set
at
area.
of y data
rectangles
We denote
more The
the total
area area)
of the
clustering
by (7CQ.
The
clustered length
the and
data the
generated
z
clustering
rectangles
To
cluster in
cluster
was
rectangles
of data
region
objects.
under both
and
Clustering
study
X and
Y
basic in
of
an
the
was
set
R-tree
the
0.04.
from
side
The
length
resulting
in the
centers
were
outer
of 4 levels,
on the
to
In
the
dataset
rectangles
that
dataset
of
inner
bound
clustering
outer
experiments.
cardinality
outer
of all the data
restricted
to
20%
of
the
effect the
at 40,000
degree
the
upper
of spatial
cardinality and
100,000,
of clustering
of the
bound
on
side
so that
the
CCQ
on side
length
dataset
was
and
sets.
length
of
of the
outer
We the
The
of
the
on
outer varied
adjusted clustering
dataset upper
rectangles
as that
data
and
respectively, data
clustering
same
of
inner
1.0, respectively.
of the the
clustering of the
was bound
of the
outer
inner
dataset
in
experiment.
rectangle. set
rectangles. rect angles
divided
clustering
the
we fixed
each
of the
upper
of the
study
the
by
of CCQ,
7.2
Basic
Figure
5 shows
an example
the
the worst
the
because at
of each
The
0.2, 0.4, 0.6, 0.8 and
a
set.
width
ex-
area.
rectangles
was
of the data
the value
of
the
1 along
Size
resulting
rectangles
in
map
joins,
the centers
of the clustering
map
O to
cardinality
quotient
datasets
distributed
of clustering
smaller
of
generating
distributed
each
the
data,
clustering
area of the clustering
quotient
the
to 80,000.
objects
sizes and degrees of
randomly
randomly within
100,000,
varied
dat aset was 0,2, meaning
performance
When
we first were
the degree
the cover
degree
scheme.
centers
We then
control
by controlling (total
of varying
The
a simple
whose
We could
map
datasets
fixed
tests and
real-life
degraded
In
per
number
the
series
we
CCQ
and robustness
included
to induce
of z x y objects,
rectangles, the map
designed
tests
total
area.
boundary
objects
Data
used under
encountered,
the stability
stability
clustering.
controlled data
confirmed
of basic
HJ method
of the
to be frequently
The
in the
parameters
2, A series
performance
of tests
method.
and other
in Table
expected
two
of clustering
of data
the
clipped.
of clustering
from
of
series,
dataset
Data
and nature
over not
of
the map
of data
to the
to range
boundary
number
loss of generality,
20,000
The
the
rect an-
the
identifier
and
Experimental
number
clustering
axes.
step,
specified.
7.1
it was
The chosen
to fit into
extended
con-
similarly
over
rectangle
and
assumed
When
clipped
bound
rectangles.
were
extended
rectangle, the
upper
clustering
bound.
were
set according
was
of accessing
to
they
a data
Without
a 4-
filter
area,
to lie between
This
rectangles
upper
clustering
was
of
contain
of the
of data
set to be 200,
we focus
to that
is assumed
be 8K
and
object
The
randomly
sequentially
to
is I/O-bound,
measurements. block
all
area
independently
bound.
rectangles
periments,
memory
assumed
are
its
For
files
Since
nodes
data
the map
methods.
on the
total
shape
or
and
upper
a smaller
gles
in [21].
specified,
output
block
described
otherwise data
using
tree
imple-
The
one disk
otherwise
The
RJ(M),
the
size and
and
seeded-tree
since
of accessing
practice.
competing sizes
indices,
and
trolled
we em-
and
identifier.
we assume
in
focused
the
but
pre-computed
techniques
these
consisting object
variations,
cases
have
R-tree
Unless
512K
all
we assume
pages,
on
optimization
experiments
simplicity,
byte
to
join
SJ, RJ, RJ(I)
in
ments
Our
R-tree
parameters
rectangle
the
join. method
R-trees
once,
constructed.
254
Experiments
and
results
As
by far. are not
cause This
of various
clearly This
held
joins in the
behavior
designed
severe
trend
shown
buffer
to
methods figure,
occurs
on
RJ is
primarily
be constructed thrashing
in all our
when
experiments,
all so so
Effect of Spatial Clustering
““””~
of Input Data
1300 ~
J!!~ g ~a=: 900
-
o >
800
-
x~ ,Q
700
-
600
-
500
-
$’
I
+’”
., .,
m,..
... .. -a
.... “
~.,..,.... .... 0
400 300
1- .–.––— ~~
2“”
.,.
0.2
.-.—---—-.
Figure
5:
Join
method
and
40K
costs.
Datasets
objects
sizes
(800 K-bytes),
are
Figure
lOOK
7: Effect
*
1 1.2
0.6 0.s CCQ of inner dataset’
0.4
ioin methods
(2 M-bytes)
-- —--
of data
clustering
on join
methods
respectively.
C7CQ = 0.2. The
RJ from
we drop
further
consideration
as a competitor
HJ.
for
second
series
of spatial
Again,
HJ demonstrates
over
all other
clustering
higher
when
because
1200
step
of tree
hash
400
nodes
confirms
method
as the
are
during
the
methods,
degree
that
the
easier
to
costs
tree
costs
incurred clustering
of
our
estimate
spatial
than
query
is the
matching
hand,
of spatial
facilitating
were
This
influences
HJ, on the other
This
of competing 200
7). gains
spatially.
clustering
accessed
methods. costs
join
performance
less clustered
of spatial
constant
varied.
600
the
(see Figure
HJ, the processing
were
degree
of these
almost
800
studied
costs
substantial
except
data
the
number 1000
experiments
on join
methods.
For all methods 1400
of basic
effects
those
planning
and
optimization.
“~ 20
10
30
40
50
60
70
80
90
7.3
inner dataaet size (K-byte)
Stability
Tests
and
Tests
with
Real-Life
Data Figure
6: Join
costs
under
different
input
dataset
sizes
Our
next
series
of various Figure
6 shows
experiments,
RJ(I)
incurred
assumed tree
not
nodes
of the input
highest
It
SJ and RJ(I)
tree
to disk
of
enable
it ran
still
using that
to
with
random
RJ(I), run
1/0,
its
faster
faster
dataset
SJ did
than
than
produced
only
partitioning
while
was a street
main
of rivers
RJ(M) reason incurs mainly less
includes
We
tree-matching
costs
them,
only.
The
line
RJ(M) random 1/0, while the join phase of HJ incurs RJ(M) accesses sequential 1/0. Even though is
data
of random the
and joining
that
(only 1/0
the
tree
the make
method
of choice
pre-computed
R-trees.
matching
pre-computed it
more
even
process
indices), expensive.
if the
input
of
the
HJ is
datasets
have
also
ran
tested
tests
with
map
railway
the
first
dataset
worked
the
tracks
as least than
were dataset
as the
other 8 shows
way
used. as the
outer
1. The around.
clustered both
pairs,
having “EXCL”
instead
of the
experiments. data. [24].
objects, with
=
way
These the
TIGER
The
first
dataset
the second
128,971 Experiment inner
dataset,
exper-
from
The
“REAL12”
dataset while
a map
objects.
and
the
“REAL21”
around.
the
2.5 times
RJ(I).
Bureau
CCQ
correlation,
extracted
131,461
of the objects
with
faster
with
outer
spatially
real-life
datasets
was
the
datasets,
object
in other
40,000.
and
other
two
of negative
pairs
0.2,
the
spatial
matched
inner
for which
worked
70K
two
and
Figures
255
dataset
40K
ran
joined
second
ran
CCQ
files of the US Census
MBRs
effects
Thus,
=
Because
approximately
iments
datasets
with
correlated
0.2.
SJ as itignores tree-construction costs altogether. lower than even those costs of HJ are much although HJ includes the costs of both of RJ(M), input
and
dataset
“EXCL”
=
stability
performed
the
Experiment
CCQ
the were
of 100,000
was a uniform
tree-
The
cardinalities
dataset
negatively
compared
experiments
“CLU-UNI”,
“UNI-CLU”
and
the
with
experiment but
RJ(I)
These
experiment
a clustered
caused
while
datasets
In
construction,
algorithm
RJ(M)
cases.
Although
during
assumptions
techniques
in many
of basic
sizes.
is noteworthy
idealized
construction
series
construction
to be written
the
first
dataset
costs.
thrashing
costs.
make
RJ(I)
the
of its tree
increasing
results
varied
no buffer
the nature its
the
which
of experiments
methods,
results
faster Though
iYJ experiments. SJ, and at least 3 times comparisons with RJ(M) of these
than
n ❑ ❑ ❑ ❑
Effect of ratio of random
I’Ll
4000
to sequential
accass costs
~
3500
w
3000
~ g Q G ~
RJ(M) RJ(I)
2500 2000
g ~
1500 1000
500
o~ UNI-CLU
Figure
CLU-UNI
EXCL REAL1 input datasats
8: Stability
tests
REAL2
and tests
15
on real-life
Figure
data.
10:
Effects
20
15
10
r
of ratio
2s
r (real-life
30
data)
5000 4500 4000 1 3500 3000 2500 2000 1500
,~
1000
0
Figure
ratio of raw-data
200 100 150 buffer size (number of pages)
0
9: Buffer
size effects
Figure
(real-life
11:
unfair runs
since
it
Experiments all
used
of
data
the
different
data
same
for
these
the
three
results
HJ
and
and
sequential
data
SJ exploit
improved
different
It is noteworthy
faster
that
methods
under
r N 2.08 for the synthetic
datasets, costs for
HJ
were
the real-life
data.
This
lower
for
method
among
all
than
very
small
datasets.
Effects
join
methods Bytes)
shows
the
datasets.
256
results
of
In
comes
worst.
As expected, buffer
and
of buffer
had
buffer
the
from
24
bytes).
methods
widen
pages
the
The
HJ performs the best, and SJ and RJ(I) being the
the join stable
costs
rose
The
costs
through
out
for for the
all
On
stayed
whole
range
r = join
the
the
block 5 in
effects
access all
costs
of r,
costs,
our
earlier
for
r
in
the
ratio
on the join experiments.
the
range
1–30.
Here
is size
large
a 512K produced
was
we
The
size
5.270.
was
stronger
Figure
256
the
buffer,
when
bucket
but
value
not
of
drastically.
r to outperform
.7~o
The the
inflated
outer
the
original
inflation
bucket-filtering outer
dataset.
from
we ran
number the of the
outer during
additional
Bucket-filtering
objects
net
of
the
objects that
number
multiple function.
deflates
irrelevant
the
of
since
partition
filtering
on average. average
be
14 experiments
17.5’%0 of
of 12
dat aset.
the
discarding
objects,
an average
We
may
in
hand, In
byte
size
used by
partitioning.
of random costs.
dataset
other
dataset
methods
HJ
outer
dataset
tested
HJ are considerably r as small as 5. As HJ and gaps between
Discussion
assignment
show
with
dropped.
at
even for
continuously very
happened
at r w 1.05 for
methods.
8
9
real-life
datasets
and
the
Figure
with
synthetic
random
performance
This
of r.
to 30, the performance
methods
other on
sizes.
also
set
r
size
sizes
(2M
with
ratio
buffer
experiments
general,
relatively
to sequential show
size
and
of
pages
in second,
as the
We
varying
to
trends.
Sizes
effects
Experiments
RJ(M)
low
Buffer the
by
(192K
similar
of
studied
values
The
all other
r increases
between so their
shows
at
trends.
RJ(I) and RJ(M) as r increased. HJ began to outperform all other
than
HJ does not require also
costs,
data.
similar
differences
access
real-life
very
varied
other
7.4
raw data,
with
show
the
block
‘(EXCL”
tested.
We
given
of experiments
8 again
costs
stable
performance
synthetic
and
sets of input
HJ is the most
that
the
with
with
methods
characteristics, for
but
Figure
rival
HJ
indices,
RJ(M).
as
size,
clustering. costs
constant
confirms
as fast “CLU-UNI”
of the
spatial
while
almost
pre-computed
twice
“UNI-CLU”,
input
degrees that
assumes
approximately
Projected
data). 10 shows
are
20
15
file size to M BR file sizes
250
50
Tests still
10
5
1
500
using copies
of
outer
eliminated original outer
effect
tends
is much
larger
outer dat aset to
be
than
the
inner
dataset,
Filtering 30%
in some
insertion
can
initial
settings
is required An data,
MBR,
tend
be
need
to
process all
case it
on
cost
methods.
raw
data
sizes
RJ(I)
are affected figure
only
shows,
20 times
We the
even
substantially
when
than faster
When
substantially,
iments.
When
can simply nodes,
and
the
treat
performance
data
files
exist,
files
of all
Although
costs
ing
As
HJ still
the
but
not
as
due
runs
as its
HJ over
in
all
cases,
RJ(M)
SJ and
even
though
is manifestly
computed
indices
results
clearly
method
our
unfair
MBR
The
may
file. will
a large
pre-computed
factor
HJ
and pre-
efficient
join
exist.
of the
and filter
method
on
report
both
been in
on
the
PBSM
spatial can
extents
refinement
steps
the
Paradise
database
1/0
and
filter
step.
equivalent cover
on
static
and
CPU The to
the
the
extents
the
whole
space
partitioning.
assignment
functions
can
reduce
also
serve
to
it in some results,
their
removal.
joins
the
join
of
and
applying
defines
spatial
a new
partition
a set of bucket
function. object
ideas
predicates.
Our
components: An into
extents
assignment
multiple
functions
the
hash-join
difficulties
spatial
partition
for implementto improve
relational
hash-joins.
a data
designed
based
tion
the
for
and
on this
inner
function
buckets.
for
and evolves
function
for the outer
objects
into
the
relational
hash-joins our
that
two
spatial
join
spatial
The
Fur-
datasets
dataset
hash-join
partition
The
is immutable,
buckets.
The
aspects.
hash-join
based
the
partition
and
can as-
method
mirrors
Our
experiments
far
outperforms
method
algorithms
func-
by sampling
are inserted.
in other
spatial
a
is initialized
as data
multiple
show
tested
framework.
dataset
dataset,
join
Acknowledgment:
system
their
[25],
and
The
Our
focus
has
P. Kriegel
PBSM
partitioning
partition
and
one
on tree
matching.
implementation
phase
phase
Section
framework
of PBSM of bucket
for
data (see
of our
the
times.
framework
context
strategies
immutable,
describe
and
hash-join
respective
based
[22]
corresponds
use the
PBSM
DeWitt
two
the
have
earlier Patel
using
to
assignment
method
sign
experimental
indices
for
Second,
be different.
We
still
assumes
of
RJ(M)
that
or not
assign
thermore,
comparison
since
datasets
used
for
difficult
of spatial
spatial
have
an
may
files,
for
and
HJ
by
assigned.
no duplicate
are efficient
overcomes
paradigm
functions
exper-
joins
complexity
paper
tree
HJ does not. Our HJ is a very
but
show
whether
RJ(M)
of
benefits.
outweigh
effort
it has been
of spatial
framework
hold.
HJ outperforms
be which
produces
methods
joins,
to the
all internal
RJ(M)
method
hash-join
This
comes
our
MBR
discard
our
are outer
effects
even
We that
by sampling
inflated.
Its
and
area.
several
only
no post-processing
relational
efficiency
.H.7 outperforms by
offers
is not
also
extents
distribution
and
filtering,
inflation,
after is
extents
objects
inner
need
bucket
sub-
Conclusions
join
RJ(M)
exist,
nodes of
Third,
hash-join
file,
leaf
advantages
data
map
spatial
size significantly.
requires
9
in
is as much
size,
also exist,
R-tree
the
only
input.
data
data
allows
whole
as input
dataset
inner
results
bucket
asymmetry
both
be inflated
are initialized
for the
using
uses bucket
the the
by
gain
exist
as demonstrated
R-trees
read
raw
R-trees
HJ. If MBR
to
RJ(M)
and
RJ(I).
than
pre-computed
closest
cases.
outperforms
so that
evolve
inner
dataset
counteract
R-trees.
on
assignment
The
design
method
extents
This
both
choosing
based
to
duplicate
cover
by
functions
multiple
HJ
of reading
MBR
different.
all methods,
raw
file,
the
the
this
the performance
that
costs
in
do not sizes
and
are
outer
may of
our
bucket
data, partition
our
for
assignment
so they
and
Our
dataset.
this
costs
run
datasets
MBR
by the
larger
which
assume
expect
read
and
bucket
First,
would
non-contiguous
multiple
In contrast,
Our
for
same
input
same
may input
we
cannot
11 projects
input
files
pre-computed
by
when
format.
generate
on
index
buckets
many
Elimination
adaptive
cases,
balances
comprise
datasets,
overlap
objects.
assume
We
are
raw
and
exists,
fly.
is the
ratio
Figure
HJ over
outer
that
research
data
to increase
reliance
the
raw
which
increase
the
of and
in some
MBR
data
the
RJ(M),
of its
applies
and
was
containing
experiments
and
the
It
necessary.
and
but
raw
MBRs
diminishes
of
Our only
bound,
except
regions,
number More
The
in size,
If
may
partitioning.
files
pairs,
compute
because
other
pointer indices.
to be 1/0
Although
inflation
that
bucket-filtering
extents.
comprise
larger.
files.
methods
multiple
PBSM
datasets.
extents
balance
may
spatial
times
net
on the
bucket
and object
MBR
from
and
input
data.
size by over
issues.
to be similar
be many to
inner
dataset
pre-computed
largest
inflation
significantly
for
clustered
dataset
inflation
the
that
on these
input
largest
62 .6’ZO, and
depend
spatially
net outer
The
We observed
effects
with
reduced
cases.
was
61.6%.
files
and
actually
to of our
in
datasets
4.3. 1).
We the
area,
[1]
The
and
Identical are
used
for
like to thank
Dr. T. Brinkhoff,
B. Seeger for generously
used in our experiment
with
real-life
Prof.
H.-
providing
the
data.
References M.
Kitsuregawa,
H. Tanaka,
plication of hash to data tecture, ” New Generation
are non-overlapping, map
would and Prof.
our
compare design.
authors
66-74,
are
bucket
[2] D.
both
M.
257
T.
Moto-Oka,
“Ap-
1983.
J. DeWitt, R.
and
base machine and its archiComputing, vol. 1, no. 1, pp.
R. H.
Stonebraker,
Katz, and
F. Olken, D.
Wood,
L. D.
Shapiro,
“Implementation
techniques
for main
memory
database
ceedings of A CM SIGMOD Management [3] D.
of Data,
J. DeWitt
based join
and
151-164,
R.
partitioned
Gerber,
method
International
“Multiprocessor
hash-
of VLDB
85, pp.
[18]
and M. Takagi,
using dynamic
of the l~th
Data, [19]
pp. 322-332,
C. Faloutsos,
May
T. Sellis,
International pp. 427–439,
T. Sellis,
N. Roussopoulos,
Conference
on
“Join
Shapiro, large
main
Large
Data
Bases,
[20]
tree:
processing
memories,
Systems,
Very
1989.
vol.
in database
”
ACM
in
systems
Transactions
11, no. 3, pp. 239-264,
and M. H. Eich,
“Join
databases,”
ACM
CornputEng
pp. 64-113,
March
1992.
seeded trees,” national
Conference
Lo aud from
on [21]
Spatial
sets,”
in
SSD
1,
joins
ing techniques
SIGMOD of Data,
The
Fourth
[24]
(Advances
in
Maine,
August
query
process-
for native
Management
of Data,
Rotem,
and parameter
500-509,
Kobe,
Japan
W. Lu and J. Han, range
Conference J. A.
join
on Data
Orenstein,
in
in spatial
of Data, algorithm
Portland, for
Zurich,
for
1992.
databases,”
International
Confer-
OR,
1989.
computing
lay of k-dimensional spaces,” in Databases (SSD ‘91), O. Gunther 381–400,
of pp.
indices
Advances and H.-J.
Switzerland,
the
over-
in Spatial Schek, ed-
August
28-30
1991, Springer-Verlag. O. Gunther,
“Efficient
Proceedings neering, R. H.
pp. 50–59, Guting
and-conquer problem,” 112, July
computation
of International
and
on Data
joins,” Engi-
1993. W.
algorithm Information
of spatial
Conference
Schilling, for
the
Sciences,
“A
practical
rectangle
and
using
DeWitt,
Y.
fractals,
divide-
intersection
vol. 42, no. 2, pp. 95-
1987.
258
R+-
objects,”
Bases,
pp.
3–11,
“Partition
Rong,
based
3-6 June “Dot:
on Man-
1993.
of the 1996A
Engineering,
DC,
“Efficient Proceedings
Conference
report,
spatial-
CM- SIGMOD 1996.
A
spatial
access
of International
pp. 152-159,
precensus
TechnicaJ
N.
B, Seeger,
“ in Proceedings
“Tiger/lines
“Client-server VLDB
on Man-
“The
R-trees,”
May
Canada,
on Data
J. DeWitt,
1994.
of Internataond
pp. 284-292,
‘(Redundancy
“An
join
D.
.%Nz
on
Proceedings
Engineering,
Proceedings
D.
Montreal,
B. of Census,
J. Yu,
1991.
Engineering,
ence on Management
pp.
in
Data
of A CM SIGMOD
J. Orenstein,
Conference
“Dist ante-associated
search ,“
in Proceedings
on
International pp. 237–246,
sus, Washington,
1990.
indices,”
Conference
SIGMOD
caJ documentation,”
spaces, ” in Pro-
Inter-national
pp. 343-352,
“Spatial
International
of spatial
using
of Data,
Faloutsos
Data
and
joins
of ACM
Conference
seeded
[25]
comparison
Kriegel,
agement
C.
Conference 1987.
Large
conference, [23]
“Analysis
1987.
H.-P.
and
of
Proceedings
and C. Faloutsos,
Inter-
International
Databases
Very
of spatial
J. M. Patel
of A CM
on Management
for multi-dimensional
merge join, “ in Proceedings
pp. 209–
“Generating
of
Brinkhoff,
method
‘95), Portland,
ceedings of A CM SIGMOD
spatial
T.
index
England,
and access
access methods,”
using
1995, Springer-Verlag. “A
itors,
24, no.
1994.
Spatial
J. Orenstein,
D.
of ACM
C. V. Ravishankar,
Databases:
vol.
[22]
May
on Large
in relational
“Spatial
on Management MN,
data
Symposium 26-29
Surveys,
in Proceedings
220, Minneapolis,
trees
processing
Lo and C. V. Ravishankar,
[9] M.-L.
Proceedings
Brighton,
Septem-
spatial
A dynamic
processing
[7] P. Mishra
[8] M.-L.
oriented
Schneider, and robust
and N. Roussopoulos,
SIGMOD
Amsterdam,
pp.
1990.
of Data,
M. Takagi,
of Data,
Proceedings
Conference
of ACM
and
ber 1986.
[16]
and rectangles,”
International
agement
Database
[15]
for points
R.
An efficient
effect of bucket size tuning in the dynamic hybrid grace hash join method, ” in Proceedings of the Fifteenth
with
[14]
Kriegel,
R*-tree:
of object
[6] L. D.
[13]
H.-P.
“The
SIGMOD
Conference, pp.
VLDB
M. Nakayama,
International
[12]
Beckmann,
method
‘(Hash-
destagingstrat-
on Management
structure SIGMOD
1984.
“The
pp. 257-266,
[11]
N.
Conference
Aug.
B. Seeger,
1988.
[5] M. Kitsuregawa,
[10]
“R-trees: A dynamic index Guttman, spatial searching,” Proceedings of ACM
A. for
47-57,
1985.
in Proceedings
468-478,
on
1984.
M. Kitsuregawa,
join
[17]
in Pro-
Conference
in Proceedings
Stockholm,
[4] M. Nakayama, egy,”
pp. 1-8,
algorithms,”
systems,”
International
files:
1991.
1990 techni-
Bureau
of Cen-
1989.
Kabra,
J. Luo,
paradise,”
Conference,
in
Santiage,
J. M.
Patel,
Proceedings Chile,
and of the
September