Spatial Hash-Joins - DElab

0 downloads 0 Views 1MB Size Report
or classroom use is ranted without fee provided that copies are not made. 7 ... software may be easily modified to implement a spatial hash-join, facilitating ... can be easily estimated. A spatial join following this paradigm may also lead ...... result- ing method is simple in design and efficient when im- plemented. (see. Section.
Spatial Ming-Ling

Lo

Department University

Hash-Joins* Chinya

of EECS

Ann

mingling@eecs.

University

Arbor

Arbor,

MI 48109

joins,

and

Our

spatial

define

partition

set of bucket may

map

a new

framework

functions

extents

paradigm

for

have

two

item

into

multiple

buckets.

functions

for

the

input

two

hash-joins.

components:

an assignment

function,

a

which

Furthermore,

datasets

may

be

and tested

on this

framework.

The

inner

dataset

is initialized

by

evolves outer

as data dataset

from

are inserted.

mirrors

dataset

relational

a wide

range

Our

spatial

a wide when

margin. the

both

when the input

when

the datasets

1 Relational

compared well-studied difficulties spatial part,

a spatial

hash-join,

facilitating

systems.

Second,

method

Our

method

applicable

our

In this the

to

method

outperforms

on tree

matching

based

its

performance have

hash-join

datasets

is superior

pre-computed

method

highly

are dynamically

have pre-computed

by even

indices.

hash

joins

to

buffer in the

peculiar

directly join with

for

sizes.

The

relational

applicable seeded trees

joins,

to the spatial have

been

that

and

large

paradigm

method domain.

tree-based

[8, 9] being

excellent are

However,

this

and

with

for

a particularly

is

due

to

joins

the

efficient

function

in

our

247

a spatial

hash-

counterpart, our

method

a variety

of data

method

partitions

our

partition

phase

join

results

relational

joins,

framework

comprises

and may

in

sets,

a data

functions

its

and

then

the

join

a partition

two

an assignment map

and the partition

joins

proceed

step.

the

We view

issue;

components: function.

item

into

for the two

with

in two stages:

[10].

on

tree

makes

the input

our

The

multiple datasets

show

have

step

step

and

on the filter

as an important

in this

that

current

matching

our

step

the filter focuses

but

should

also

method.

methods

datasets

paper

refinement

outperforms

the tree-based

This

any innovation

experiments

method

they

SIGMOD ’966196 Montreal, Canada 01996 ACM 0-89791 -794-4196/0006,.. $3,50

propose methods.

We evaluate

produce

extents

step

This

hash-join

the

unlike

the refinement

based

and

differ.

Our

Permission to make digitahhard copy of part or all of this work for personal or classroom use is ranted without fee provided that copies are not made or distributed for pro 7It or commercial advantage, the copyright notkx, the title of the publication and its date appear, and notice is given that copying is by permission of ACM, Inc. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/or a fee.

to

function

be usable

for

stream

of applying

spatial

with

during

buckets

orthogonal by the Consortium Networkhg.

difficulties joins,

hash-joins,

buckets

However,

Spatial

most

in input

data.

phase.

buckets,

or other

to estimate.

spatial

advantages.

contrast,

characteristics

to its relational

experiments

relational into

In

we implement

similar

real-life

input

may

Existing the

similar

assignment

has not

designing

very

a set of bucket

approach. This work was supported in part International Earth Science Information

framework,

on this

Like

hash-join

domain.

to spatial

methods

relations

Based

algorithm

the

to

may

facilitating

matching on

harder

we discuss

for

join

on tree

costs

be easily

paradigm

costs,

or ordering

paradigm

a framework

including

indices.

[1, 2, 3, 4, 5, 6, 7] yield

particularly

paper,

hash-join

this in

mostly

thus

existing

can

optimization.

depend

and

and

following

distribution data,

by conducting

competitive generated

input

with hash-join

sizes

based

existing

to implement

integration

and

joins

advantages

First,

relational

predictability

can

as spatial

of the

join

other

modified

data

planning

indices

yield

efficiency.

be easily

input

to greater

query

such

item

The

on

A spatial

lead

spatial

for the

a data

mainly

may

the costs of spatial

and

may

database

spatial

the

Introduction

performance,

been

aspects.

methods

makes the spatial

for

function

It is therefore

that

Further,

This

method

dataset,

replicate

in other

algorithms

tree-based

partition may

MI 48109

besides

software

estimated.

joins. show

join

the

buckets,

indices.

experiments

current

sampling

multiple

hash-joins

of spatial

function

but

into

needs no pre-computed

hash-join

partition

The

is immutable,

the outer

a spatial

domain

hash-join

also

We have designed

Arbor,

paradigm

spatial

depend

different. based

hash-join

in the

to spatial

spatial

and

a data

the partition

the hash-join

Ann

Arbor

ravi@eecs. umich.edu

The how to apply

of EECS

of Michigan-Ann

1301 Beal Avenue,

umich .edu

Abstract We examine

Ravishankar

Department

of Michigan–Ann

1301 Beal Avenue,

V.

method

by

wide

are given highly

are dynamically

pre-computed

our spatial

indices.

spatial join

margins,

hash-join algorithms even

pre-computed competitive generated

when

indices. both and

when when

This

paper

discusses in

is

applying

Our in

organized

related

work.

the

hash-join

framework Section

spatial

for

implementation. in

and

Section

7.

results

our

of

a

6 presents

———4 —

Bucketl

its

experiments

8 discusses

the

joins.

presented

design

Section

of

Section

9 concludes

is

our

2

difficulties

spatial

joins

presents and

the to

hash

5

The

Section

paradigm

algorithm,

Section

follows.

3 studies

spatial

Section

4.

hash-join

given

as

Section

related

issues,

paper.

Figure

1:

belong

to

with

Related

2

forms

method

does

require

tree

and

tree

method

[8,

spatial

costs.

The

require for

18,

when

and

the

the

are

commonly

used

as

can

used

to

and exist.

Brinkhoff

two

method

[8, 9] may

delivers

better

input

no into

method indices,

trees

of and

manageable

plane-sweeping. Sections

and

and

disk

also

equality.

be used

appears

and

Patel

and

operates

identifies overlap. the

a

Suppose

when

on

A

there

first

[22]

also

dividing

are

them

method

is discussed

using

further

in

8.

with

characteristics

direct

application

joins.

First,

Spatial

preserving total

do objects

away

in

application

[13,

sets

predicates

a natural

Although

do not

intersect,

X

A and X

into

will

identifying

in

space This

objects, spatial can

is

that

for handling

are often

much

the

of the join

based

works

defining

such

total be

prevents

of the

far

direct

databases.

relational

hash-join

well fully

classes.

so

to spatial

but

that

buckets.

attribute

another,

B we

X).

Unless

we will

always

how

are

we divide

more

of

crucial split

term

this

over

have

the

with

join

predicates

equi-join

buckets.

Thus,

to match

objects

a coherent

buckets

buckets

equivalence performance

is that across

values method

create these

to the

method not

equivalence

hash-join

functions

or

we do not

to

the

relational

most

We

values

A and

If we hash

predicate

partition

hash-join

predicate,

a join

constructs

The

property

given

of matching

(B,

bucket,

equality

one

classes

across

the

because

equivalence

of

spatial

by

include The

buckets. a spatial

of them.

no matter

into

buckets.

attribute.

relational

pairs

pair

one single

effectively

induced

two for

B and Y into

pairs,

between

equi-join

order

closeness,

equi-joins, more

for

sometimes

difficulty

techniques

difficulty

total

objects

classes

into

intersecting

the

X

be certain

Although

the intersecting

all objects

some

and

and

three

dataset,

an equi-join

same

both

and

one

A

across

intersects

one bucket,

to

tuples

Y).

we

B, we still

another,

(B,

B.j,

intersect.

belong

1 shows

#

intersect X

the

and

For an

if we know

performing

do

and

A.j

B

pairs

attribute

hand,

In

into

cannot X)

and

other

hash

B

Figure (B,

An

to spatial

join.

miss

could

matching

spatial

join join,

A, B and X,

B and

and

tuple

(A, X),

prevent

techniques

spatial preserve

of relational is optimized

algorithms

10, 14, 23] exist

of

uniformly

ordering.

second

method

closeness.

adjacent

the

lack

joins

A

the

A by A.j.

A does not

whether

we

whose

A.j

and

intersection

objects

=

spatial

suffice.

predicate

the

identify

identical

j of object

On

hash

the same

complicated

not

spatial

three

another.

objects

miss

Joins

spatial

join

data

curves over

not

to

of relational

spatial

orders

orders

intrinsic

spatial

on space-filling

join

no

the the

but

we

and

intersection

the

the

The

X,

j,

bucket,

with does

if X.j

to

attribute

one

objects

both

B.j.

more

bucket

value

about

X

other

of objects

objects

object

must

at

#

nothing

and

hand,

pairs

A intersects

know

on the

Consider

X.j

identical

into

Consider

predicate,

with functions

relationships

attribute

that

of tuples partition values

same

arise

which

that

in Section

DeWitt

by

and joining

Difficulties

Two

two

the

levels.

we hash 3

B

Boxes

Problem

pairs

Placing

into

values

know

1/0

performance.

partitions

This

4.3.1

A and

another.

join-attribute

with

Unfortunately, PBSM

objects

Y to

buckets.

Their

joins,

objects

equi-join

R-trees,

algorithm CPU

than

denote

et al.

existing

matching reduce

seeded

be

Spatial

Difficulties

R-

find

identical

of

operator

sets.

on

exist,

of seeded

data

with

pairs

indices,

based

seeded-tree and

r-epresen-

spatial

20]

they

of an R-tree

discussion

The

hash

values.

attributes

19,

to join to

a divide-

6.1.

requires

represent

hash-joins

bucket.

based

on tree-like

large

those

R-trees

joins

based

tuples

in-

those

on seperational

databases,

of techniques

indices

Data

X and

Coherent-Assignment

join-attribute

tree

algorithms 12],

The

Relational

or

seeded

[16] proposed

sorting

a method

consists

brief

9].

spatial

proposed

lines

join.

and

as-

datasets

the

[11,

not

[17,

input

those

include

variants

in

collection

no

external

methods

its

facilitate

does

generally

These

indices

based

method

Tree-based

indices

the

and Schilling

algorithm

This

have with

10, 14], and

Guting

and-conquer

which

on join

[13,

[15].

tation.

for

an exception.

based

on z-ordering indices

joins

of pre-computation,

those

[21]

spatial indices

[8, 9] being

clude

but

for

pre-computed

other

spatial

dataset,

Work

methods

sumed

A simple one

dashed

3.1 Previous

Bucket2

are

respect

assignment to

a join

predicate.

complicated.

Unfortunately,

248

spatial

do not

always

define

such

equivalence

to guarantee buckets

a coherent

with

respect

we cannot

divide

n groups,

and

in any We

this

difficulty

the

dimensions,

spatial

hashing

in

various

through this

and the

produced

join,

spatial

section.

Hash-join

predicates

borrow

the

phases:

of

(a)

the

design from

(outer)

phase:

Join

into

place

buckets

phase:

join

ets to obtain

inner

using

dataset

or

maps

and

outer

dataset

partition

corresponding

inner

after

ob-

and

outer

For

the

hash-join one relating

buck-

object

other

to the join

Single

algorithms

input

conformed

partition

X

and

the

data

item

partition

function

to exactly

one

still

only

bucket.

For

two

(12, 22), (11, 21) and Single

matching:

joined The

with

partition

into

The will a join

of

bucket

pairs,

each

states

in exactly

of buckets

relation

pair

having bucket.

that

the second must

each pair.

in the join

buffer

Using assign

inner

X

During

We

pairs

phase

The

pair. Design

main

Multiple map Multiple join

assignment: an input

problem

is

object

matching:

us to

principle.

relax

context

The

index

into

partition multiple

A bucket

may

function

may

buckets. appear

built.

pairs.

Hybrids

be appropriate

249

and

endure

schedule

the

multiple-assignment we need With

may

22 (see Figure only

this

2(b)).

match

bucket

approach,

buffer

However,

object

X

is

twice. is particularly

joins.

[20].

It

The

complicates

never

to disk either

carefully

21 and

phase,

duplication

since

joins, is

example, buckets

spatial

for

or

hold order

12 is needed

flushed must

can

in the

pairs.

a problem.

this

It in multiple

trees

We

not

However,

be able

buffer pairs

bucket

again,

(21, 22).

indices

When been

and read

of

spatial

are: The

and

B in

buffer-management

buckets

problem

the matching (11, 21)

We believe

requires

or single-matching

alternatives

bucket

Alternatives

assignment

the single-assignment two

both

thrashing

Primary coherent

same

Since

A and

we must

the

have

buckets

into

and

we join

disk

of join the

phase,

(12, 21).

from

Also,

2(a).

objects

suppose

it would

thrashing

processing

written 4.1

time,

be read

the

relations

outer

one bucket

to be joined

is

relation.

and outer

“corresponding”

principle

appears

inner

of the outer

the inner

bucket

matching”

a pair

of the

one bucket

and one

bucket

call

bucket

divides

number bucket

“single

or outer

exactly phase

some

one inner

Each

1/0.

trivial.

pairs.

scheduling

and

disk

are (11, 21), (12, 22) and

matching

example,

objects

than join-

multiple-matching

both

pairs

bucket

remain

problems.

assigns

the

these

more

as in Figure

2 overlaps bucket

During

to identify

1, the

objects

1/0.

schedule

is no longer

Figure

the

1, the join

There The

in

hash

in dataset

(12, 21).

to two

phase,

phase:

assignment:

each

have

to the

example may

be

multiple-matching.

so we must

pairs

of

sizes may

a bucket

so as to minimize

buckets

function instead

in more

read

phase,

carefully join

data

with

to

is

difference

a partition

resulting

need

join

approach

conceptual

is that

occurs

may the

reads

dataset

principles,

by hash

approach.

to a set of buckets

drawback

inflation

during

approach

only is that

object

Its

we

identifying

functions.

The

partitioning,

size

once

results.

Relational

produced

A duplicative

multiple-assignment

hash-joins

inflated

However,

the join

the

an input

bucket.

No

hash-

and

of

a single

Our

relational

phase

spatial

table

(b)

multiple-assignment

simplicity.

relational

now

phase

jects

hash

approach.

by the

benefit

from

phase. Partition

A non-duplicative

produced

The

Buckets

buckets.

of the

(a)

algorithmic

of a spatial

inner

partition

realized

relational-join

(outer)

2:

multiple-matching

table

parameters

datasets

that

the

hash-

be

datasets.

inner like

the

Bu&ket Bye$et Bu;fket E3y~t .——. = J; —_. — 1 J I —.

algorithms

can

outer

the

the

framework,

has two

I

Figure

designing

Bucket Bucket 22 ~?!,

12

make

for

and

called

11

haah tabla for dataaet 2

This

Framework

operand

dataaet 1

Bucket Bucket

is,

are

Hash-Join

We

haa;o:ble

dataaef 2

group.

problem. complicated,

the

haa~rable

dataaet 1

appearing

more

for

haa~o~bla

into

same

thus

partitioning

are

algorithmic

call

inner

by

relation

objects in the

attributes

choices

framework.

hash-join

two

That

datasets

spatial

join

appropriate

to hash

problem.

this

spatial

terminology,

two

belong

framework

algorithms

for

of

our

the

impossible

that

and

Spatial

present

join

fact

a difficult

Our

We

that always

is

predicates. the

coherent-assignment

the

higher

in

it

of objects

join

objects

ensure

and

4

tospatial

pair

and

assignment

the

matched

call

classes,

is not

insertion a drawback are

necessary

to

of these some

also

suitable been

use of duplication

they

for

has

and deletion with

constructed update

two

main

situations.

them

data and once

approaches

in the

used in

with

spatial

of objects. structures used

once.

they

are

may

also

Spatial

4.2 Our

spatial

The

assignment

of buckets. design

function

maps

parameter.

Quite

the to

partition

differ.

the space is used

with

different

degrees

may

be hashed

based

representations,

spaces

spaces

Our

objects

on their

space.

Since

coordinate

are

extents

used

by

buckets

the

they

are used

pairs.

As

corresponds

objects

logically

suggestive for

There

the

The to

an object or to all The

the

stored our

main

with

to

(1)

are

An

The

on

is mapped

to,

(2)

of

to identify

join

space,

bucket

extent

a

spatial

(1)

relationships

An

bucket’s

In

individual

should

bucket to

bucket

be decided

bucket

by

map

partitioning

on object

extents

buckets?

distribution?

updated

as data

If so, how?

hash-join set

(3)

these

Some

(2)

for

be

be

to

the

contained

assigned

consists

a set

bucket

of

by to

two

it.

related

can

spatial

relationship

bucket

extents

y:

the

which

is based.

number

be assigned

be-

upon

of buckets

an

to.

pairs.

can

realize

to:

of (1)

(2)

to

all

buckets

(3)

to

the

bucket

assignment

assignment all

whose whose

the

of

single

bucket

used,

assignment

when rules.

so far

comprise

is

several Figure

for

buckets

the

partition

it,

it,

the

rules,

When

there

must to

satisfy

or

object.

several

rules

3 summarizes

assign

contain

multiplicity.

principle

to

overlap

is nearest

may

tie-breaking

are

extents fully

center

assignment

single-assignment series

whose

extents

criterion

on

criterion

buckets

the be

choose

the

primary

the design

space

functions.

Extents divide

the

buckets

extents

the

the

or

to buckets

multiplicity

assignment

must

and

locations,

extents,

with

not

the and

of objects

object

depending

or the multiple-matching

equal-sized the

overlap still

design

objects

examples

The

the

bucket

parameters

may

criterion: data

an object

consists

extents, join

object

relationship

but

function

Function

an input

at-

assign

it the most,

algorithm

the

input

it,

of bucket

its

object

assignment

object

object

overlaps

Assignment maps

on

extent,

tween

efficiency.

function

addition

the

determined

the

based

Assignment

extents

it may

overlap

Bucket

approximately

tion.

the

aspects:

bucket.

spatial

between

extent

extents

the

a

For example,

whose

for

affect

the

the

function

Assignment

a bounding

to bucket

assigns

and

made

choosing

cover

assigned

associated

counterpart

buckets

extents.

is generally

necessarily its

bucket

where

it is not to

and

a bucket

of the objects

extents.

Choosing partition

of all extents

or based

into

assignment

discussed

into

functions..

partition

overlap?

union

Choosing

4.2.2

they the

in the space

the multiple-assignment

4.2.1

the

statically

are inserted

determine

original

A

whose

and

for

are extents

space

Assignment

function,

approach,

does

Mutability:

approach

functions:

coverage

of a spatial

choices

either

spatial

Partitioning:

function

to a bucket

assignment

m

–d

area?

original

space,

space

do extents

Coverage:

in these

on hashing

generally

3: Design

Intersection:

in the

assigned

bucket

design

Ez!!!5

in higher

to hashing

phase

no obvious

buckets

~

a-l I

they

though

the

reside.

of determining: The

Figure

space,

to a region

based

and

i

Y

3

I

~ ‘ntersec”””

hashing.

buckets

r

I

,

original

object

in

assignment

tributes

in

binary

on coordinates

function

spatial

objects

in relational

1

properties

points

are

two

However,

is also

swgnment multiplicity .— ———

bucket extents

and

of their

original

assignment

of the spatial

to the bucket.

assignment criterion ~

.

transformations.

hash

extent

box

objects

by the join

we

I transformed I 1:pace I ——. —

i

extent

example,

into

focuses

serve

an input

bucket

For

in the

in their

no additional

Bucket

[ Or]glnal 1:p~e_,

outer

contiguous)

patterns

hashed

limited

spatial

__lC+l

is a

addressing

various

efficacy.

locations

values

requires

of

on

on the bit

is not

and

in

i

to buckets.

implementation

framework

set

is to be computed,

be hashed

and

.fs%L_. “’

awgnment ~ funct(on

to a set

inner

A

or be transformed

dimensional

our

the

c

hash-joins,

necessarily join

objects can

object

is useful

(not

spatial

in assigning objects

for

spatial hash function design

two

bucket.

of this

problem.

the

each

relational

feature

by

a set of bucket

with

a spatial

functions

to a region

Spatial

and

unlike

This

where

defined

cardinality

coherent-assignment

corresponds

are

associated

maximum

dataset

Design issues Destgn choices

——

function

extent

The

we allow

the

functions

a assagnrnent

one bucket

U

I__/

Function

partition

components: ezt ents,

Partition

the

input

sizes

set of bucket

properly

assignment and

following

4.2.3

datasets

by

In

func-

shapes

every

of

parameters

extents:

Join

an

bucket

outer

bucket

the

inner

this

matching

on the

250

Identifying

principle,

inner that

bucket’s

design

space

may

objects.

Bucket

contain We

significantly

of the inner

must

Pairs be

matched

objects

may

be able

in practice

and outer

partition

with

overlapping to

reduce

depending functions.

a a

When and

the

the

multiple-assignment

single-matching

bucket

need

outer

only

bucket.

is applied bucket and

and

pairs,

each

bucket

extent

inner

buckets for

bucket

can

need

identifying

only

to

its

pairs

that

every

assigned

to it,

with

the

extent.

thus

join

depend

considered

the

same

are

assigned

on

the

of

of

the

inner

outer

function:

choosing

nearest

center

As

bucket

may

be different.

object enlarged

the

bucket

of gravity

a

have objects

is enlarged

final

is

grid

being

datasets

extent

Each

extent

cell

extents.

its

the

regular

each

outer

bucket

datasets

whose

area,

and

a bucket,

Thus,

assignment,

n equal-sized, map

initial

to

Assignment bucket

components

set

and

with

whole

The

them.

inner

outer

the

extent.

enclose

Methods

Start

covering

bucket

join

2

extents:

cells

extents

identify

know

be joined

an d are

function,

we

many

bucket

all objects

overlap

join-bucket

in

the

Example

Bucket

principle

appear

if

inner

corresponding

be used

cent ains

extents

its

of

example,

fully

whose

partition

may

properties

For

pairs

with

4.3.2

chosen

each

multiple-matching

functions

bucket’s

is

preserved,

joined the

known

buckets each

be

When

assignment

principle

property

extents

is assigned the

if there

the

to the

least

whose

to

for

after

extent

has

are ties.

its design. Join 4.3 We

now

present

two

straightforward

join

adopts

multiple-assignment

the

the

based

on

multiple-matching

4.3.1

first

the

The

second

regions, Each

The

the

map

area

for example,

of these

inner

and

set of bucket

cells

the

with

thus

extent

bucket

with

method

The

first

may

be

extents

very

must

The

PBSM

buckets

may

be

objects

two bucket effort

when

extents,

may

they to

of

1.

Example

of grid

cell as a bucket several

equivalent

that

some

ways,

the

partition

divides

the

plane

PBSM cells,

In

but

extent,

instead

it avoids

possibly of a bucket

of using imbalance

non-contiguous

the

join

to be joined.

join

bucket

a bucket

during

the join

design

a given

optimal

A

pairs. may

If

need

phase.

since

feasibility

convincing

our

and

for

and will

We

goal

of our

may

are

join.

performance

method

predicates

parameters

spatial

method, the

hash-join

design

method

approach

It

differs

here

approach

gains.

The

be different

input

dataset

be the subject

We

data

for with

of future

relational hash

extents

whole

of of

by

to derive

extent.

251

that

im-

multiplesimplic-

only to

in that multiple

amount

of effort

irnplementations

to realize

are allowed

the

this

objects

in

method. reside,

thus

representations.

to overlap, space,

do not

and

based

Our cover

on

the

object

in space.

the

bucket

extents,

we discuss

only

open

methods.

efficiency

are

the

dataset

inner

a small

hash-join

where

when

the

object

the

result-

conceptual

a data

only

space

leaves

our

efficient

hash-join

of transforming

This Here

grid

maps

for

The

adopt

of its

relational

object-distribution

distribution

a large

costs

and We

because

databases

in the

the

design

to modify

method

framework.

6).

the

We expect

avoiding

data

in

function

be required

hash-join

our

Section

from

partition

existing

using

is simple

assignment

buckets.

remove

using

(see

its

on

a spatial

join

plemented

fall

buckets cells

above

characteristics,

now

will

a single

the

yields

join

function into

it

different

ity.

have

its

for

an

spatial

empty.

inflating

is necessary

resembles

outer

Choices

demonstrate

optimal

ing

spatial

nearly

seek to

We

sizes

[22] can also be interpreted

framework.

scheme

the

some

times

during

pairs

enough,

implementing

intersection

same

spurious

the

not

bucket

method join

for

pairs

the

bucket

on

and

research.

This

object

overlap

is that

is that

bucket

large

Design

choices

different

outer

results.

spatial

grouping

others

additional

partitioning number

and

eliminate

method

of inner

in multiple

in several

Several

and

may

cell).

matching

must

objects,

duplicate

between

Finally,

redundant

(grid

Depending

while

the boundary size.

We

of this

of input

we

to

object

of inner

objects

imbalance.

objects,

Second,

is assigned An

the

is not

feasible

is only

the join.

drawback

distribution many

pair

spurious

extents.

after

it.

extent

of matching

bucket

size

pair

method

track

participate

Our

5

of have

buckets.

each

same

produce

a pair

results

Join the

may

when

object

overlap

to multiple

pairs:

buckets

data

extents

be assigned

two

A

whose

Join-bucket

may

do

function:

all buckets

bucket

each overlap.

equal,

datasets

The

of this

we must

buffer

Join extents

a regular

is the

outer

extents.

with

are immutable. Assignment

drawback

to be read

non-overlapping of n cells

whose

phase,

the

grid

same

for

The

approach,

Tessellate

one bucket.

examples

framework.

1

extents:

the

our

approach.

Example

Bucket

pairs:

buckets

intersection

our

bucket

Examples

the

initial and

choices Design

discussed are:

in

values the that choices Section

and

the mutability

assignment affect

the

affecting 6.

Our

functions. correctness only choices

the for

Bucket

extents:

bucket

Initial

extents

value:

are

updated

see Section

6. Mutability:

intersection

to

all

and location

enclose

assigned

criterion

objects. Assi~nment tion

fund

6.

actly

joins.

ion:

Assimment

Multiplicity:

Each

criterion:

object

for

efficiency,

see Sec-

is assigned

This

one bucket.

box

these

of the

objects

contained that

choices

extent

assigned

in the

bucket.

a bucket

extent

The

to

it.

However,

of a bucket

design

choices

may for

not

such

an object

dataset

belong

rion

the

outer

to

for

to

are: extents:

=

final

Initial

inner

have

the

bucket

same

associate

one

Mutability:

ject it.

Identifying

each

inner

buckets

extents

and

inner

object

criterion:

whose may

criterion dataset

since

An

extents

ob-

overlap

be assigned

need not join

outer

now

bucket

the

A1, Az,

the design

results

join

the

with

phase

same

begins,

and

the

the

Theorem

design

given

these pairs

bucket

correct

the

corresponding

Let

the

When buckets

inner

buckets

corresponding

outer

For

each

objects

inner

that

dataset

overlap

object

in Ai,

it can be found

extents

Proof:

We

overlap

can

object

that any

p does

not

Since

Bi

extent. not

show

overlap

Ai’s

contained

any

dataset

Theorem

produces

Proof

We

copy

of any

to only with can

meet

considered Any design

have

being

When

the

to

outer

for join. choices

hash will

each and

Bi’s

objects

els tree

as

an inner object

one

method the

following correct

dataset

for

the

object

requirements nodes

to

over-

is minimized.

to

are very

of spatial

index

functions

would

constructing

spa-

when

The

the answer

above

slot

initial

our

252

present

Initially,

As

determine

and also largely into

subtrees.

of slots

extent

extents objects how

determine

can

be

and

a set

of

and the

are assigned the

to the objects.

objects

are assigned effectively the number

steps:

slots

a slot

the how three

are

in the

are points,

to enclose

Determining

involves

a seeded

technique.

terminology,

are updated

a lev-

Readers

is contained

slot

seed)

builds

of this

[9] to tailors

(or

dataset.

a bucket

is empty. extents

extents

its initial

underlying details

implemen-

technique Seeding

information

objects.

are grouped

for

full

represent

extents

in our

seeding

Bootstrap-seeding its

to

their

Extents

extents

by setting

4).

Using

Bucket

bootstrap

a join

from

set of objects

slots,

produce

set

extent

is minimized,

configuration.

key seed-level

9].

slots,

❑ join

requireare

dataset

partition

bucket

the

to [9] for

spatial

object

once

for

Figure

considered

is joined

dat aset

at most

latter

extents

extents

Inner

initial

directly

All

is assigned

bucket

an inner

most

use

tree

(see

[8,

object

of the

bucket

developed

rectangular

their

seeded

overlap

identified at

techniques

over-

area

trees.

and

identify

p does

dat aset

contains

two

an outer

our

extents

of an outer

for

that

the follow-

total

buckets

These

Determining

referred

inner

bucket,

dataset

overlap

extent,

pairs

bucket the

probability

requirements

suggests

from

We choose

of the join.

a bucket

Since outer

bucket

the

This

in-

approximately

(2)

When

of bucket

the

initial criterion

with

(3) The

assigning

perforFor

in Bi,



answer

bucket,

one

spatial

all

area

the

contains

to multiple

is maximized.

bucket

tation,

not

the

assigned

the

assignment

bucket

extents.

to

buckets

outer

any

index

inner bucket and

the

of not

outside

p does

of the

methods.

inner

is minimized.

since

total

dataset

all inner

extent,

and object

aspect

to choose

objects,

probability

object

not

of

are minimized,

an outer

same

in Ai.

exact that

the

Since

Joining the

object.

any

/li

by Ai’s

one inner

If

and

know

exactly

A,.

to B2, it does

objects

2:

above

dataset

in

belong extent.

in A% are inner

no outer

object

an

each

bucket

object

6.1

B,

objects

the object,

is crucial

it is best

as possible,

inner

tial

all

(1)

arise

trees.

buckets

to assign

other

hash-join

and

number

similar

be

the crite-

of overlapping

Every

im-

object

contains

contain

buckets of

choices,

as little

laps

and

joins.

equal

final

extent.

produces

to

assignment

function

phase.

to a set of final

an

6).

simply

same

lead

properties:

bucket

bucket

easily

same.

extents

ing

ments

is trivial

join the

affect issues

to find

object the

instead

stability

bucket

benefit

1:

dataset

above

any

B2, . . ..Bk.

outer

and

lap

be processed

overlapping

intersection

extents.

Ak,

...,

for

reduce

dataset

extents

equal-sized

mance

ner

to mul-

at all (see Section

inner

our

to

to

not

pairs

Each

that

non-redundant

objects

be assigned

bucket

show

serves

an object

the stays

very

If we want

assignment

may These

6.

we modify

bucket

shape

Implementation

that

of-parameters.

the

during

6

extent.

inner

containment

Producing

Assignment

An

the

choices

Bl,

with

to all buckets

phase,

extent

have

the

outer

for

algorithm

are immutable.

of outer

join

bucket

the

extents

two

relabel

extent

assignment

number

in the

We

If

bucket

buckets. outer

total

extents.

function:

Multiplicity:

Our

with

outer

is assigned

tiple

Outer

extent,

extents

Assignment

value:

the

whose

function

be modified

queries.

object,

the

look

pairs Bucket

that

to buckets

dat aset

also

initial

the assignment

correctness.

in Section

can

for

and

assignment

containment

pairs

choices

extents

algorithm

further

outer

is also a bounding

actual

bucket

inner

not

design

plement Under

the

but

are discussed

to ex-

The

of inner

to data and

After seed

levels

grown

levels

the

and

outer

sizes

of

grown subtree

growny

Figure

seed node



grown

4:

1. Determining the

the number

upper

of

and

average

number

of a seeded

number

of slots

we use the

node

Example

the

describing

as the



tree.

slots

S.

lower

bounds

was derived

in [9].

of the

upper

A

formula

for

and

lower

can

be joined

the

filter

optimal

method

The

algorithm

some

the

multiple

3. Placing from

the

slot

data

of the

the S slots

sample to

input

center

number

and

sample

size

we “brute

choose

among

their

centers several

heuristic

efficiently.

in this

We

to

join”

into

a quadratic-split-cost objects

in overflow

does

Inner

spatial

and

determined

by

bootstrap-seeding

number initial

the

and the initial bucket

member

PI

hash-join

Obtain

initial

using

nearest-

objects.

Our

Bucket

extent

similar

to insertion

reduce and

inner

the

total

are likely

process

to become

assignment

whose

updating

of the

area

seeded

and

to do the

overlap

same

Outer

Dataset

least.

Set

outer

final

extents

partition

of their

the outer

dataset

object

If an object it

since

it

technique seed-level The

corresponding

to every overlaps

bucket

is irrelevant

to the

bucket-extent

jiltertng.

filtering

technique of

Join

now

algorithm!

pre-defined

threshold

grow

whole

to fill

to disk

mostly

to the

performance

in called

we write to disk

buffer.

in sequential of our

To

extent

overlaps

we can

inner batch

when

the

technique

1/0, method.

bucket

bucket

for

the

use

LRU

against

inner

each

bucket

as

dataset

inner

object

extents

bucket

extents a copy

whose

and

the

extents

after

to

final

inner

the

of each outer

extent

We

and wrztes.

call

the

it.

study

overlaps

writes

and contributes

pairs

object

to

the object.

to produce

for

result.

conducted join

join

a

join.

by

prior

The

first

copy-seeding

three

variants

tree

advantage,

we also

considered

is stored

so that

data at

all

some

noted

as

during

tree

RJ(I),

and

random

read

greatly

is given

two

performs

tree

during

buckets

tree

is

spatial

on

three

tree join

(SJ),

costs

the

construction

matching.

tree

is con[8].

We

experiments. and

then

join case

per-

a clear when

is no buffer (though

the

thrash-

there

may

method

is

de-

by disregarding indices,

is always

all

RJ(M)

construction.

R-tree

RJ(M)

using

tree

R-tree

This

is implemented

pre-computed

first in our

an ideal there

match-

second

dynamically offer

two

tree

the

the joins

matching), during

constructs

constructed

and

from

of R-tree

input

tree

( SJ)

technique,

To

ing

hash-join

other

to performing

R-trees

be

seecled

method

just

two

outer

spatial with

experiments

(lf.7),

matching.

forms

contents

the

RJ constructs

this

our

join.

indices

structed

of

its performance

hash

seeded-tree

seeded-tree ing

behavior

We

spatial

R-tree

to the

than

the

and compare

study

As in the

larger

bucket

bucket

the bootstrap-seeding

discard

is analogous

all buckets

This

buckets.

[8, 9].

the

we

of

R-tree

be summarized

Assign

Update

Assign

methods.

and

set to the

of an outer

result. It

outer

outer

method, join

a copy

extents, join

both

uses a technique tree

inner

whose

no bucket

the

Experiments

We

nodes,

the

and

we assign

two)

are probed

can

on the

corresponding

7

heuristics

and

are immutable,

dataset,

partitioning

datasets seeded

extents

based

extents.

J 1 Join

The

bucket

If

buffer,

extents

criteria.

the

bucket

Phase The

of chances

are

extents.

Partitioning

This

(instead

objects

seeding.

bucket

methods: 6.3

then

it.

against the

inner

and

assigns

criterion

These

bucket

the

the

assignment.

every

of its

the

of seeded-tree

for

is

“indexed

are assigned

enlarges

trees.

P2

the The

the MBR

and the assignment

into

as

criterion

extent

slots

buckets.

As objects

enlarges

to a bucket,

extents

buckets

[17],

phase.

method

bucket

bootstrap

to one

of the inner

are points.

its extent

an object

initial

extents

extents

to a bucket,

the

of

be the

each

number

an

construct

reduces

overflow

for

[8] or the

bucket

as outer

work

look

a pair

join

Since

our

follows:

Partitioning

We use the

the

in

buckets

tree.

implementation.

Dataset

not

bucket

and

during

management

and

R-tree

one

buffer.

two

join.

first

outer

are

overhead.

do

in

We

only

the

assignment 6.2

[22].

of the

requires

exist

Our

heuristics

use the

join”

loop

buffer

the

to join

bucket

information

S clusters

use force

nested

in

1/0

bucket-bucket

to the

the

area using

In [9], we examined clusters

being

of slots.

We identify

objects,

identify

the

in the map

sample.

locations.

set,

the

Thus,

we

The

scheme

intensive

cost,

similar

the

size.

1/0

1/0

for

so constructed 2. Sampling

buffer

of inner

are joined.

our

additional

reducing

pairs

extent

using

is usually

on

buffer

of buckets.

the

without

focuses

to

bounds

than

step

approach

work,

is partitioned,

the same

partitioned

smaller

probe

choosing

In this

dataset with

buckets

general

subtree

outer

buckets

and the

simply most

Method

Description

HJ

Spatial

SJ

Bootstrawseedirw

Hash

R-tree

E

Join

join,

(both

indices

buffer

thrashing

R-tree

join,

Table

method

no pre-existing built

indices

before

while

join)

building

indices

two pre-existing

1: Competing

Table

indices

2: Experimental

methods were

chosen

randomly

O and a predefine efficient

among

the

phasize

that

depends

cannot

be

matching

it

applied step

all the

Table

1 summarizes

and

bytes.

bytes,

entries

pairs.

the

Also,

on 1/0

costs

one

disk

When

filter

of disk

step.

blocks,

we used

a buffer to

of a 16-byte

in our

bounding

we focus

to consist

the

filter

step

box

on the

8-byte

ratio

range

7.1.1

of the

cost

be 5, unless

Choice

We

conducted

the

first

in our

study

is shown

established

the

conditions a series the

as well

as data algorithm.

We studied of

spatial

input

by

set

at

area.

of y data

rectangles

We denote

more The

the total

area area)

of the

clustering

by (7CQ.

The

clustered length

the and

data the

generated

z

clustering

rectangles

To

cluster in

cluster

was

rectangles

of data

region

objects.

under both

and

Clustering

study

X and

Y

basic in

of

an

the

was

set

R-tree

the

0.04.

from

side

The

length

resulting

in the

centers

were

outer

of 4 levels,

on the

to

In

the

dataset

rectangles

that

dataset

of

inner

bound

clustering

outer

experiments.

cardinality

outer

of all the data

restricted

to

20%

of

the

effect the

at 40,000

degree

the

upper

of spatial

cardinality and

100,000,

of clustering

of the

bound

on

side

so that

the

CCQ

on side

length

dataset

was

and

sets.

length

of

of the

outer

We the

The

of

the

on

outer varied

adjusted clustering

dataset upper

rectangles

as that

data

and

respectively, data

clustering

same

of

inner

1.0, respectively.

of the the

clustering of the

was bound

of the

outer

inner

dataset

in

experiment.

rectangle. set

rectangles. rect angles

divided

clustering

the

we fixed

each

of the

upper

of the

study

the

by

of CCQ,

7.2

Basic

Figure

5 shows

an example

the

the worst

the

because at

of each

The

0.2, 0.4, 0.6, 0.8 and

a

set.

width

ex-

area.

rectangles

was

of the data

the value

of

the

1 along

Size

resulting

rectangles

in

map

joins,

the centers

of the clustering

map

O to

cardinality

quotient

datasets

distributed

of clustering

smaller

of

generating

distributed

each

the

data,

clustering

area of the clustering

quotient

the

to 80,000.

objects

sizes and degrees of

randomly

randomly within

100,000,

varied

dat aset was 0,2, meaning

performance

When

we first were

the degree

the cover

degree

scheme.

centers

We then

control

by controlling (total

of varying

The

a simple

whose

We could

map

datasets

fixed

tests and

real-life

degraded

In

per

number

the

series

we

CCQ

and robustness

included

to induce

of z x y objects,

rectangles, the map

designed

tests

total

area.

boundary

objects

Data

used under

encountered,

the stability

stability

clustering.

controlled data

confirmed

of basic

HJ method

of the

to be frequently

The

in the

parameters

2, A series

performance

of tests

method.

and other

in Table

expected

two

of clustering

of data

the

clipped.

of clustering

from

of

series,

dataset

Data

and nature

over not

of

the map

of data

to the

to range

boundary

number

loss of generality,

20,000

The

the

rect an-

the

identifier

and

Experimental

number

clustering

axes.

step,

specified.

7.1

it was

The chosen

to fit into

extended

con-

similarly

over

rectangle

and

assumed

When

clipped

bound

rectangles.

were

extended

rectangle, the

upper

clustering

bound.

were

set according

was

of accessing

to

they

a data

Without

a 4-

filter

area,

to lie between

This

rectangles

upper

clustering

was

of

contain

of the

of data

set to be 200,

we focus

to that

is assumed

be 8K

and

object

The

randomly

sequentially

to

is I/O-bound,

measurements. block

all

area

independently

bound.

rectangles

periments,

memory

assumed

are

its

For

files

Since

nodes

data

the map

methods.

on the

total

shape

or

and

upper

a smaller

gles

in [21].

specified,

output

block

described

otherwise data

using

tree

imple-

The

one disk

otherwise

The

RJ(M),

the

size and

and

seeded-tree

since

of accessing

practice.

competing sizes

indices,

and

trolled

we em-

and

identifier.

we assume

in

focused

the

but

pre-computed

techniques

these

consisting object

variations,

cases

have

R-tree

Unless

512K

all

we assume

pages,

on

optimization

experiments

simplicity,

byte

to

join

SJ, RJ, RJ(I)

in

ments

Our

R-tree

parameters

rectangle

the

join. method

R-trees

once,

constructed.

254

Experiments

and

results

As

by far. are not

cause This

of various

clearly This

held

joins in the

behavior

designed

severe

trend

shown

buffer

to

methods figure,

occurs

on

RJ is

primarily

be constructed thrashing

in all our

when

experiments,

all so so

Effect of Spatial Clustering

““””~

of Input Data

1300 ~

J!!~ g ~a=: 900

-

o >

800

-

x~ ,Q

700

-

600

-

500

-

$’

I

+’”

., .,

m,..

... .. -a

.... “

~.,..,.... .... 0

400 300

1- .–.––— ~~

2“”

.,.

0.2

.-.—---—-.

Figure

5:

Join

method

and

40K

costs.

Datasets

objects

sizes

(800 K-bytes),

are

Figure

lOOK

7: Effect

*

1 1.2

0.6 0.s CCQ of inner dataset’

0.4

ioin methods

(2 M-bytes)

-- —--

of data

clustering

on join

methods

respectively.

C7CQ = 0.2. The

RJ from

we drop

further

consideration

as a competitor

HJ.

for

second

series

of spatial

Again,

HJ demonstrates

over

all other

clustering

higher

when

because

1200

step

of tree

hash

400

nodes

confirms

method

as the

are

during

the

methods,

degree

that

the

easier

to

costs

tree

costs

incurred clustering

of

our

estimate

spatial

than

query

is the

matching

hand,

of spatial

facilitating

were

This

influences

HJ, on the other

This

of competing 200

7). gains

spatially.

clustering

accessed

methods. costs

join

performance

less clustered

of spatial

constant

varied.

600

the

(see Figure

HJ, the processing

were

degree

of these

almost

800

studied

costs

substantial

except

data

the

number 1000

experiments

on join

methods.

For all methods 1400

of basic

effects

those

planning

and

optimization.

“~ 20

10

30

40

50

60

70

80

90

7.3

inner dataaet size (K-byte)

Stability

Tests

and

Tests

with

Real-Life

Data Figure

6: Join

costs

under

different

input

dataset

sizes

Our

next

series

of various Figure

6 shows

experiments,

RJ(I)

incurred

assumed tree

not

nodes

of the input

highest

It

SJ and RJ(I)

tree

to disk

of

enable

it ran

still

using that

to

with

random

RJ(I), run

1/0,

its

faster

faster

dataset

SJ did

than

than

produced

only

partitioning

while

was a street

main

of rivers

RJ(M) reason incurs mainly less

includes

We

tree-matching

costs

them,

only.

The

line

RJ(M) random 1/0, while the join phase of HJ incurs RJ(M) accesses sequential 1/0. Even though is

data

of random the

and joining

that

(only 1/0

the

tree

the make

method

of choice

pre-computed

R-trees.

matching

pre-computed it

more

even

process

indices), expensive.

if the

input

of

the

HJ is

datasets

have

also

ran

tested

tests

with

map

railway

the

first

dataset

worked

the

tracks

as least than

were dataset

as the

other 8 shows

way

used. as the

outer

1. The around.

clustered both

pairs,

having “EXCL”

instead

of the

experiments. data. [24].

objects, with

=

way

These the

TIGER

The

first

dataset

the second

128,971 Experiment inner

dataset,

exper-

from

The

“REAL12”

dataset while

a map

objects.

and

the

“REAL21”

around.

the

2.5 times

RJ(I).

Bureau

CCQ

correlation,

extracted

131,461

of the objects

with

faster

with

outer

spatially

real-life

datasets

was

the

datasets,

object

in other

40,000.

and

other

two

of negative

pairs

0.2,

the

spatial

matched

inner

for which

worked

70K

two

and

Figures

255

dataset

40K

ran

joined

second

ran

CCQ

files of the US Census

MBRs

effects

Thus,

=

Because

approximately

iments

datasets

with

correlated

0.2.

SJ as itignores tree-construction costs altogether. lower than even those costs of HJ are much although HJ includes the costs of both of RJ(M), input

and

dataset

“EXCL”

=

stability

performed

the

Experiment

CCQ

the were

of 100,000

was a uniform

tree-

The

cardinalities

dataset

negatively

compared

experiments

“CLU-UNI”,

“UNI-CLU”

and

the

with

experiment but

RJ(I)

These

experiment

a clustered

caused

while

datasets

In

construction,

algorithm

RJ(M)

cases.

Although

during

assumptions

techniques

in many

of basic

sizes.

is noteworthy

idealized

construction

series

construction

to be written

the

first

dataset

costs.

thrashing

costs.

make

RJ(I)

the

of its tree

increasing

results

varied

no buffer

the nature its

the

which

of experiments

methods,

results

faster Though

iYJ experiments. SJ, and at least 3 times comparisons with RJ(M) of these

than

n ❑ ❑ ❑ ❑

Effect of ratio of random

I’Ll

4000

to sequential

accass costs

~

3500

w

3000

~ g Q G ~

RJ(M) RJ(I)

2500 2000

g ~

1500 1000

500

o~ UNI-CLU

Figure

CLU-UNI

EXCL REAL1 input datasats

8: Stability

tests

REAL2

and tests

15

on real-life

Figure

data.

10:

Effects

20

15

10

r

of ratio

2s

r (real-life

30

data)

5000 4500 4000 1 3500 3000 2500 2000 1500

,~

1000

0

Figure

ratio of raw-data

200 100 150 buffer size (number of pages)

0

9: Buffer

size effects

Figure

(real-life

11:

unfair runs

since

it

Experiments all

used

of

data

the

different

data

same

for

these

the

three

results

HJ

and

and

sequential

data

SJ exploit

improved

different

It is noteworthy

faster

that

methods

under

r N 2.08 for the synthetic

datasets, costs for

HJ

were

the real-life

data.

This

lower

for

method

among

all

than

very

small

datasets.

Effects

join

methods Bytes)

shows

the

datasets.

256

results

of

In

comes

worst.

As expected, buffer

and

of buffer

had

buffer

the

from

24

bytes).

methods

widen

pages

the

The

HJ performs the best, and SJ and RJ(I) being the

the join stable

costs

rose

The

costs

through

out

for for the

all

On

stayed

whole

range

r = join

the

the

block 5 in

effects

access all

costs

of r,

costs,

our

earlier

for

r

in

the

ratio

on the join experiments.

the

range

1–30.

Here

is size

large

a 512K produced

was

we

The

size

5.270.

was

stronger

Figure

256

the

buffer,

when

bucket

but

value

not

of

drastically.

r to outperform

.7~o

The the

inflated

outer

the

original

inflation

bucket-filtering outer

dataset.

from

we ran

number the of the

outer during

additional

Bucket-filtering

objects

net

of

the

objects that

number

multiple function.

deflates

irrelevant

the

of

since

partition

filtering

on average. average

be

14 experiments

17.5’%0 of

of 12

dat aset.

the

discarding

objects,

an average

We

may

in

hand, In

byte

size

used by

partitioning.

of random costs.

dataset

other

dataset

methods

HJ

outer

dataset

tested

HJ are considerably r as small as 5. As HJ and gaps between

Discussion

assignment

show

with

dropped.

at

even for

continuously very

happened

at r w 1.05 for

methods.

8

9

real-life

datasets

and

the

Figure

with

synthetic

random

performance

This

of r.

to 30, the performance

methods

other on

sizes.

also

set

r

size

sizes

(2M

with

ratio

buffer

experiments

general,

relatively

to sequential show

size

and

of

pages

in second,

as the

We

varying

to

trends.

Sizes

effects

Experiments

RJ(M)

low

Buffer the

by

(192K

similar

of

studied

values

The

all other

r increases

between so their

shows

at

trends.

RJ(I) and RJ(M) as r increased. HJ began to outperform all other

than

HJ does not require also

costs,

data.

similar

differences

access

real-life

very

varied

other

7.4

raw data,

with

show

the

block

‘(EXCL”

tested.

We

given

of experiments

8 again

costs

stable

performance

synthetic

and

sets of input

HJ is the most

that

the

with

with

methods

characteristics, for

but

Figure

rival

HJ

indices,

RJ(M).

as

size,

clustering. costs

constant

confirms

as fast “CLU-UNI”

of the

spatial

while

almost

pre-computed

twice

“UNI-CLU”,

input

degrees that

assumes

approximately

Projected

data). 10 shows

are

20

15

file size to M BR file sizes

250

50

Tests still

10

5

1

500

using copies

of

outer

eliminated original outer

effect

tends

is much

larger

outer dat aset to

be

than

the

inner

dataset,

Filtering 30%

in some

insertion

can

initial

settings

is required An data,

MBR,

tend

be

need

to

process all

case it

on

cost

methods.

raw

data

sizes

RJ(I)

are affected figure

only

shows,

20 times

We the

even

substantially

when

than faster

When

substantially,

iments.

When

can simply nodes,

and

the

treat

performance

data

files

exist,

files

of all

Although

costs

ing

As

HJ still

the

but

not

as

due

runs

as its

HJ over

in

all

cases,

RJ(M)

SJ and

even

though

is manifestly

computed

indices

results

clearly

method

our

unfair

MBR

The

may

file. will

a large

pre-computed

factor

HJ

and pre-

efficient

join

exist.

of the

and filter

method

on

report

both

been in

on

the

PBSM

spatial can

extents

refinement

steps

the

Paradise

database

1/0

and

filter

step.

equivalent cover

on

static

and

CPU The to

the

the

extents

the

whole

space

partitioning.

assignment

functions

can

reduce

also

serve

to

it in some results,

their

removal.

joins

the

join

of

and

applying

defines

spatial

a new

partition

a set of bucket

function. object

ideas

predicates.

Our

components: An into

extents

assignment

multiple

functions

the

hash-join

difficulties

spatial

partition

for implementto improve

relational

hash-joins.

a data

designed

based

tion

the

for

and

on this

inner

function

buckets.

for

and evolves

function

for the outer

objects

into

the

relational

hash-joins our

that

two

spatial

join

spatial

The

Fur-

datasets

dataset

hash-join

partition

The

is immutable,

buckets.

The

aspects.

hash-join

based

the

partition

and

can as-

method

mirrors

Our

experiments

far

outperforms

method

algorithms

func-

by sampling

are inserted.

in other

spatial

a

is initialized

as data

multiple

show

tested

framework.

dataset

dataset,

join

Acknowledgment:

system

their

[25],

and

The

Our

focus

has

P. Kriegel

PBSM

partitioning

partition

and

one

on tree

matching.

implementation

phase

phase

Section

framework

of PBSM of bucket

for

data (see

of our

the

times.

framework

context

strategies

immutable,

describe

and

hash-join

respective

based

[22]

corresponds

use the

PBSM

DeWitt

two

the

have

earlier Patel

using

to

assignment

method

sign

experimental

indices

for

Second,

be different.

We

still

assumes

of

RJ(M)

that

or not

assign

thermore,

comparison

since

datasets

used

for

difficult

of spatial

spatial

have

an

may

files,

for

and

HJ

by

assigned.

no duplicate

are efficient

overcomes

paradigm

functions

exper-

joins

complexity

paper

tree

HJ does not. Our HJ is a very

but

show

whether

RJ(M)

of

benefits.

outweigh

effort

it has been

of spatial

framework

hold.

HJ outperforms

be which

produces

methods

joins,

to the

all internal

RJ(M)

method

hash-join

This

comes

our

MBR

discard

our

are outer

effects

even

We that

by sampling

inflated.

Its

and

area.

several

only

no post-processing

relational

efficiency

.H.7 outperforms by

offers

is not

also

extents

distribution

and

filtering,

inflation,

after is

extents

objects

inner

need

bucket

sub-

Conclusions

join

RJ(M)

exist,

nodes of

Third,

hash-join

file,

leaf

advantages

data

map

spatial

size significantly.

requires

9

in

is as much

size,

also exist,

R-tree

the

only

input.

data

data

allows

whole

as input

dataset

inner

results

bucket

asymmetry

both

be inflated

are initialized

for the

using

uses bucket

the the

by

gain

exist

as demonstrated

R-trees

read

raw

R-trees

HJ. If MBR

to

RJ(M)

and

RJ(I).

than

pre-computed

closest

cases.

outperforms

so that

evolve

inner

dataset

counteract

R-trees.

on

assignment

The

design

method

extents

This

both

choosing

based

to

duplicate

cover

by

functions

multiple

HJ

of reading

MBR

different.

all methods,

raw

file,

the

the

this

the performance

that

costs

in

do not sizes

and

are

outer

may of

our

bucket

data, partition

our

for

assignment

so they

and

Our

dataset.

this

costs

run

datasets

MBR

by the

larger

which

assume

expect

read

and

bucket

First,

would

non-contiguous

multiple

In contrast,

Our

for

same

input

same

may input

we

cannot

11 projects

input

files

pre-computed

by

when

format.

generate

on

index

buckets

many

Elimination

adaptive

cases,

balances

comprise

datasets,

overlap

objects.

assume

We

are

raw

and

exists,

fly.

is the

ratio

Figure

HJ over

outer

that

research

data

to increase

reliance

the

raw

which

increase

the

of and

in some

MBR

data

the

RJ(M),

of its

applies

and

was

containing

experiments

and

the

It

necessary.

and

but

raw

MBRs

diminishes

of

Our only

bound,

except

regions,

number More

The

in size,

If

may

partitioning.

files

pairs,

compute

because

other

pointer indices.

to be 1/0

Although

inflation

that

bucket-filtering

extents.

comprise

larger.

files.

methods

multiple

PBSM

datasets.

extents

balance

may

spatial

times

net

on the

bucket

and object

MBR

from

and

input

data.

size by over

issues.

to be similar

be many to

inner

dataset

pre-computed

largest

inflation

significantly

for

clustered

dataset

inflation

the

that

on these

input

largest

62 .6’ZO, and

depend

spatially

net outer

The

We observed

effects

with

reduced

cases.

was

61.6%.

files

and

actually

to of our

in

datasets

4.3. 1).

We the

area,

[1]

The

and

Identical are

used

for

like to thank

Dr. T. Brinkhoff,

B. Seeger for generously

used in our experiment

with

real-life

Prof.

H.-

providing

the

data.

References M.

Kitsuregawa,

H. Tanaka,

plication of hash to data tecture, ” New Generation

are non-overlapping, map

would and Prof.

our

compare design.

authors

66-74,

are

bucket

[2] D.

both

M.

257

T.

Moto-Oka,

“Ap-

1983.

J. DeWitt, R.

and

base machine and its archiComputing, vol. 1, no. 1, pp.

R. H.

Stonebraker,

Katz, and

F. Olken, D.

Wood,

L. D.

Shapiro,

“Implementation

techniques

for main

memory

database

ceedings of A CM SIGMOD Management [3] D.

of Data,

J. DeWitt

based join

and

151-164,

R.

partitioned

Gerber,

method

International

“Multiprocessor

hash-

of VLDB

85, pp.

[18]

and M. Takagi,

using dynamic

of the l~th

Data, [19]

pp. 322-332,

C. Faloutsos,

May

T. Sellis,

International pp. 427–439,

T. Sellis,

N. Roussopoulos,

Conference

on

“Join

Shapiro, large

main

Large

Data

Bases,

[20]

tree:

processing

memories,

Systems,

Very

1989.

vol.

in database



ACM

in

systems

Transactions

11, no. 3, pp. 239-264,

and M. H. Eich,

“Join

databases,”

ACM

CornputEng

pp. 64-113,

March

1992.

seeded trees,” national

Conference

Lo aud from

on [21]

Spatial

sets,”

in

SSD

1,

joins

ing techniques

SIGMOD of Data,

The

Fourth

[24]

(Advances

in

Maine,

August

query

process-

for native

Management

of Data,

Rotem,

and parameter

500-509,

Kobe,

Japan

W. Lu and J. Han, range

Conference J. A.

join

on Data

Orenstein,

in

in spatial

of Data, algorithm

Portland, for

Zurich,

for

1992.

databases,”

International

Confer-

OR,

1989.

computing

lay of k-dimensional spaces,” in Databases (SSD ‘91), O. Gunther 381–400,

of pp.

indices

Advances and H.-J.

Switzerland,

the

over-

in Spatial Schek, ed-

August

28-30

1991, Springer-Verlag. O. Gunther,

“Efficient

Proceedings neering, R. H.

pp. 50–59, Guting

and-conquer problem,” 112, July

computation

of International

and

on Data

joins,” Engi-

1993. W.

algorithm Information

of spatial

Conference

Schilling, for

the

Sciences,

“A

practical

rectangle

and

using

DeWitt,

Y.

fractals,

divide-

intersection

vol. 42, no. 2, pp. 95-

1987.

258

R+-

objects,”

Bases,

pp.

3–11,

“Partition

Rong,

based

3-6 June “Dot:

on Man-

1993.

of the 1996A

Engineering,

DC,

“Efficient Proceedings

Conference

report,

spatial-

CM- SIGMOD 1996.

A

spatial

access

of International

pp. 152-159,

precensus

TechnicaJ

N.

B, Seeger,

“ in Proceedings

“Tiger/lines

“Client-server VLDB

on Man-

“The

R-trees,”

May

Canada,

on Data

J. DeWitt,

1994.

of Internataond

pp. 284-292,

‘(Redundancy

“An

join

D.

.%Nz

on

Proceedings

Engineering,

Proceedings

D.

Montreal,

B. of Census,

J. Yu,

1991.

Engineering,

ence on Management

pp.

in

Data

of A CM SIGMOD

J. Orenstein,

Conference

“Dist ante-associated

search ,“

in Proceedings

on

International pp. 237–246,

sus, Washington,

1990.

indices,”

Conference

SIGMOD

caJ documentation,”

spaces, ” in Pro-

Inter-national

pp. 343-352,

“Spatial

International

of spatial

using

of Data,

Faloutsos

Data

and

joins

of ACM

Conference

seeded

[25]

comparison

Kriegel,

agement

C.

Conference 1987.

Large

conference, [23]

“Analysis

1987.

H.-P.

and

of

Proceedings

and C. Faloutsos,

Inter-

International

Databases

Very

of spatial

J. M. Patel

of A CM

on Management

for multi-dimensional

merge join, “ in Proceedings

pp. 209–

“Generating

of

Brinkhoff,

method

‘95), Portland,

ceedings of A CM SIGMOD

spatial

T.

index

England,

and access

access methods,”

using

1995, Springer-Verlag. “A

itors,

24, no.

1994.

Spatial

J. Orenstein,

D.

of ACM

C. V. Ravishankar,

Databases:

vol.

[22]

May

on Large

in relational

“Spatial

on Management MN,

data

Symposium 26-29

Surveys,

in Proceedings

220, Minneapolis,

trees

processing

Lo and C. V. Ravishankar,

[9] M.-L.

Proceedings

Brighton,

Septem-

spatial

A dynamic

processing

[7] P. Mishra

[8] M.-L.

oriented

Schneider, and robust

and N. Roussopoulos,

SIGMOD

Amsterdam,

pp.

1990.

of Data,

M. Takagi,

of Data,

Proceedings

Conference

of ACM

and

ber 1986.

[16]

and rectangles,”

International

agement

Database

[15]

for points

R.

An efficient

effect of bucket size tuning in the dynamic hybrid grace hash join method, ” in Proceedings of the Fifteenth

with

[14]

Kriegel,

R*-tree:

of object

[6] L. D.

[13]

H.-P.

“The

SIGMOD

Conference, pp.

VLDB

M. Nakayama,

International

[12]

Beckmann,

method

‘(Hash-

destagingstrat-

on Management

structure SIGMOD

1984.

“The

pp. 257-266,

[11]

N.

Conference

Aug.

B. Seeger,

1988.

[5] M. Kitsuregawa,

[10]

“R-trees: A dynamic index Guttman, spatial searching,” Proceedings of ACM

A. for

47-57,

1985.

in Proceedings

468-478,

on

1984.

M. Kitsuregawa,

join

[17]

in Pro-

Conference

in Proceedings

Stockholm,

[4] M. Nakayama, egy,”

pp. 1-8,

algorithms,”

systems,”

International

files:

1991.

1990 techni-

Bureau

of Cen-

1989.

Kabra,

J. Luo,

paradise,”

Conference,

in

Santiage,

J. M.

Patel,

Proceedings Chile,

and of the

September