Metric Trees for Efficient Similarity Search in Large ... - Semantic Scholar

8 downloads 472 Views 630KB Size Report
similarity search in metric spaces to the field of business process ..... small. However, large node sizes perform less efficiently, indicated by the dashed curve ...
Metric Trees for Efficient Similarity Search in Large Process Model Repositories Matthias Kunze, Mathias Weske Hasso Plattner Institute at the University of Potsdam Prof.-Dr.-Helmert-Straße 2-3, 14482 Potsdam {matthias.kunze,mathias.weske}@hpi.uni-potsdam.de

Summary. Due to the increasing adoption of business process management and the key role of process models, companies are setting up and maintaining large process model repositories. Repositories containing hundreds or thousands of process models are not uncommon, whereas only simplistic search functionality, such as text based search or folder navigation, is provided, today. On the other hand, advanced methods have recently been proposed in the literature to ascertain the similarity of process models. However, due to performance reasons, an exhaustive similarity search by pairwise comparison is not feasible in large process model repositories. This paper presents an indexing approach based on metric trees, a hierarchical search structure that saves comparison operations during search with nothing but a distance function at hand. A detailed investigation of this approach is provided along with a quantitative evaluation thereof, showing its suitability and scalability in large process model repositories.

1 Introduction Nowadays, one can find large process model collections in almost every company, since the value of business processes has been recognized and acknowledged as an essential asset to drive an organization [26, 28]. Effective reuse, extraction of knowledge, and generation of insights among these model collections require capabilities to efficiently search and navigate within them. Process models generally expose a high degree of heterogeneity [1] and thus, exact-match searching is neither feasible nor meaningful. Instead, the concept of similarity has been discovered to be a valuable means to compare and search process models and many similarity measures have been proposed [1, 7, 14, 16, 17, 18, 24]. These include semantic approaches to identifying corresponding nodes, structural, and behavioral aspects of process model similarity. Several of the authors mention that their algorithms are in particular expensive with regards to time complexity. Yan et al. [29] comprehensively studied process model repositories currently available for industry and academia. Their survey discloses that efficient implementation of search for similar process models has been widely neglected.

However, applications for fast and efficient similarity search in large process model repositories are manifold, due to the increasing adoption of business process modeling. Aligning large process model collections to changing market opportunities can be achieved through continuous process model refactoring, which, however, bears the risk of introducing inconsistencies [26]. Searching for similar model artifacts yields the set of relevant models and provides a means to track and propagate adoption of process models [23]. Efficient search in process model repositories can also help to discover redundancies when models are added to a collection [17], e.g., when process models of two organizations that engage in a merger need to be consolidated. Furthermore, similarity search among process models offers the chance to obtain reference models, or fragments thereof, as means to semi-automatic modeling assistance [13], or to find normative process models, which are most similar to process models that have been discovered through process mining [1]. In this paper, we do not address optimization of the similarity algorithms’ efficiency, but rather try to reduce the number of comparisons required to find models in a repository. We achieve this goal by indexing, i.e., building efficient data structures to guide searching. Metric trees [20, 21] offer a means to increase the efficiency of similarity search by a notion of distance, or dissimilarity. A metric tree is an index that partitions the metric space, i.e., the collection of process models. During search, certain partitions can be safely pruned under some circumstances, reducing the number of comparison operations. We applied similarity search in metric spaces to the field of business process management to evaluate its usefulness and scalability to search within large process model repositories. The remainder of this work is structured as follows. In Section 2 we give an overview of similarity search that puts metric trees into the context of similarity search and discuss previous work on process model similarity. Subsequently, we introduce our approach to build and search within an index of process models with the help of the M-Tree [6] index structure and an implementation of the graph edit distance for process models, in Section 3. Since we aim at reducing the number of comparison operations when searching for similar process models, in Section 4, we evaluate the implementation of our approach toward this aspect with varying parameters. Finally, we conclude our work and give an outlook on research directions and objectives that we will address in the future, in Section 5.

2 Preliminaries This section discusses related work with regards to process model similarity and similarity search, which lay the groundwork for our approach, described in Section 3. Similarity search has received increasing interest in a variety of domains, including multimedia, computational biology, data mining, and pattern recognition, cf. [4, 31]. The idea behind is based on a definition of proximity that can be defined by characteristic features extracted from the actual objects. In the

context of process models, such characteristic features can be structure, i.e., the model graph, or behavior, derived by means of the state space [1, 27]. 2.1 Process Model Similarity Many existing approaches to compare process models, such as various notions of bisimulation [25], trace equivalence [11], and workflow inheritance [22], only tell whether two process models are the same in a certain context, but not how similar they are. More meaningful measures that quantify the similarity of process models have been demanded and proposed [1, 7, 17, 24], since. The approaches, which can be generally divided into behavioral and structural ones, ground on the same concept: (1) Identify corresponding process model elements, e.g., activities, gateways, control flow edges, and (2) compare the relations between them. The behavior of a process is characterized by the set of possible execution sequences of the process model elements. The behavioral similarity of two process models addresses relationships regarding causality, concurrence, and exclusion of corresponding elements [17, 27] and is indicated by a—usually weighted—fraction of common behavior; whereas particular fragments of a process model may be considered more relevant than others [1]. In structural approaches, the relation between process model elements essentially refers to the graph structure of the process model, i.e., which nodes are predecessors and successors of other nodes, connected through edges with each other. Similarity is often based on the fraction of common structural components, such as control flow arcs [1] or activity nodes [13]. Another approach to (dis-) similarity is to calculate the cost of converting one process model into another one with the graph edit distance [8, 14, 16, 18]. The graph edit distance is defined by the least expensive sequence of operations to transform one graph into another, where each operation type, including inserting, removing, and substituting elements of the graph, has a cost-weight assigned [9]. Key to identifying corresponding process model elements is labeling. Although correspondences can be found manually through reviews from process model experts [8], researchers addressed this issue applying structural models, e.g., the Levenshtein distance [15], and linguistic models, such as synonyms taken from WordNet1 , [13, 24]. 2.2 Indexing in Metric Spaces In exact-match search, objects of the result set are exact equals among a set of structured data that can be ordered by simple means. In contrast, similarity search does not assume an intrinsic ordering of objects: The result set comprises objects, which are within certain proximity of the query object. In coordinate spaces, objects are treated as vectors in a multidimensional space by mapping each feature to a value of a particular dimension [12]. The concept of vectors in a multidimensional space offers means to calculate distances of two objects by computing the distance of the corresponding feature-vectors, 1

cf. http://wordnet.princeton.edu/

e.g., using the Minkowski distances [31]. Search structures for vector spaces, so-called spatial access methods (SAM), effectively exploit the ordering of featurevalues of a dimension to find similar objects. The R-Tree [10] is a well known approach to search within coordinate spaces. However, the notion of distance in a multidimensional space may be of little meaning, due to correlation of features, so-called cross-talk [31], which does not map into vector based similarity. Generally, one cannot assume feature vectors to be available in certain domains. For example, a representation of graph structure or process behavior in a coordinate space is not feasible or has only limited meaning, and therefore, SAMs cannot be applied. A more general approach has been raised that addresses the similarity searching problem, called metric spaces [20, 21], of which coordinate spaces are a special type. In metric spaces, nothing but the distance of two objects can be computed. Yet, the abstract notion of distance offers the opportunity to construct particular means to express proximity of objects, with regards to the desired use case and accounting for cross-talk between the objects’ features. Informally, searching in a metric space means to obtain a ranked set of objects that are in proximity to a query object, or whose feature values fall within a given range from those of a query object. There is a large body of research in similarity searching, cf. [4, 12, 31], and many index structures for metric spaces, which will be referred to as metric trees hereafter, have been proposed, including, but not limited to, the GNAT [3], VP-Tree [5], and M-Tree [6]. The common goal of these approaches is to avoid exhaustive examination of the search space and to reduce the number of comparisons required for query processing. This is because cost for query execution is not only constrained by I/O-operations but is also CPU-bound, due to expensive computation of the distance between two objects. This is achieved by preprocessing the data, i.e., “building an equivalence relation, so that at search time some classes are discarded and the others are exhaustively searched” [4]. The equivalence relation in metric spaces is defined in terms of the distance between two elements. Equivalence classes are generally built using reference objects, pivots, that partition the search space, such that all objects within a certain distance to a pivot are within its equivalence class. Recursive partitioning of a class leads to a hierarchical structure of equivalence classes: the index tree. Some indexing approaches use only one pivot per node in the index tree, e.g., the VP-Tree [5], and partition the space by means of spheres of different radii centered at the pivot [30]. Other indexing approaches take two or more pivots into account, i.e., objects are partitioned among several pivots based on their distance to them and fair distribution between them [3, 6].

3 Efficient Search with Metric Trees This section elucidates how a metric tree can reduce the number of comparison operations to obtain a set of process models similar to a given query q. Search in metric spaces can be coarsely classified into range searches, i.e., given a query model q, find all models that fall within a certain distance range r(q), and into

nearest neighbor searches, i.e., find the k models with least distance to q, or a combination of both. Nearest neighbor queries are quite similar to range queries, in that the range iteratively decreases while k candidates are being collected until no models more similar to q can be found. Metric trees partition metric spaces by means of relative distances between objects instead of their absolute position in a vector space. The only requirement of such indexing is that the notion of distance is a metric [31]. Definition 1 (Distance Metric). A metric space is a pair S = (D, d) where D is the domain of objects and d : D × D → R is a metric, i.e., a distance function with the following properties: – symmetry: ∀oi , oj ∈ D : d(oi , oj ) = d(oj , oi ) V – nonnegativity: ∀oi , oj ∈ D, oi 6= oj : d(oi , oj ) > 0 ∀oi ∈ D : d(oi , oi ) = 0 – triangle inequality: ∀oi , oj , ok ∈ D : d(oi , ok ) ≤ d(oi , oj ) + d(oj , ok ) 3.1 A Sample Similarity Measure for Process Models The concept of metric trees is agnostic of the actual distance notion as long as it is a metric. Most of the similarity notions mentioned earlier satisfy the first two requirements, i.e., symmetry and nonnegativity, while the triangle inequality is the essential property that allows search to prune subtrees in metric trees. To calculate the dissimilarity of process models, we exemplarily chose the graph edit distance, which satisfies the triangle inequality. Since we apply the graph edit distance to process models, they will be introduced first. Definition 2 (Process Model). Let L be a set of labels and T a set of node types. A process model P is a connected graph (N, E, λ, τ ) where – N is a finite set of nodes, – E ⊆ N × N is a set of edges, – τ : N → T assigns a type to each node, and – λ : N → L assigns a label to each node Definition 3 (Graph Edit Distance). A process model P1 = (N1 , E1 , λ, τ ) can be transformed into another process model P2 = (N2 , E2 , λ, τ ) through a finite sequence O = σ1 , σ2 , ..., σn of edit operations σi , such that P2 = σn (...(σ2 (σ1 (P1 )))...). Let M : N1∗ → N2∗ ; N1∗ ⊆ N1 , N2∗ ⊆ N2 be a bijective mapping that indicates that some nodes in P1 have corresponding nodes in P2 , i.e., ∀n∈N1∗ M(n) ∈ N2∗ , which are substitutable. Edges established by two succeeding nodes in P1 , i.e., E1 3 e = (ni , nj ) can be substituted, iff the corresponding nodes induced by the mapping M are connected by an edge E2 3 e0 = (M(ni ), M(nj )) in P2 . To quantify the dissimilarity of P1 and P2 specific costs are assigned to the edit operations: w : σi → R. The edit operations and their specific costs2 w are classified as follows: 2

We used the given costs for our evaluation, cf. Section 4. These can be varied to shift the similarity notion, but costs for insert/remove must be equal to preserve the symmetry property of the metric.

– σin inserts a node to N1 that has a counterpart in N2 ; w(σin ) = 1 – σrn removes a node from N1 that has no counterpart in N2 ; w(σrn ) = 1 – σie inserts an edge to E1 that has a counterpart in E2 ; w(σie ) = 1 – σre removes an edge from E1 that has no counterpart in E2 ; w(σre ) = 1 – σsn substitutes a node n ∈ N1 with its counterpart M(n) ∈ N2 induced by the mapping M; w(σsn ) = 2 · ldn(λ(n), λ(M(n))) To calculate the cost of substituted nodes we apply the Levenshtein distance [15] ld of their labels s1 = λ(n), s2 = λ(M(n)) normalized by the longer label, i.e., ) ) ld(s1 ,s2 ) ) ldn(s1 , s2 ) = max(|s , ldn(⊥, ⊥) = 0 |,|s |) $ 1 2 !"# &( &( & ( %"&'#$ %"&'#$ # # $ $ ∈ E1 with its counterpart – σse substitutes an edge e0 ∈ E2 induced by the &e &$ " " !"& ! ! ) ) ) mapping M, as explained above; w(σse ) = 0 The Graph Edit Distance is denoted by the least cost of n possible sequences of . edit operations O1 , ...,,,,O,,,n that transform P11 into2 P2 :3 & ( )! . . . X ,,, ,,, ,,, ,,, ,,, ,,, ) ) ) ) ) ged(P1 , P2 ) = min w(σ i ); l = 1...n *

*

!"#

$

*

+

+

+

!))0

/

*

+

*

σi ∈O )* l

&( !"&

)*

*

$

$

-

&(

)*

$ !"&

)+

%"&'#$

)+

!"#

+

#

& ( %"&'#$

)* !"&

!"#

2 1 2 &( &( We adapted1 the2 greedy algorithm proposed by Dijkman et%"&'#$al. [8] to find a$& ( %"&'#$ 1 3 # 3 $ $ 6 4 5 !"& !"& !"& ) ) ) mapping M and to calculate the graph edit distance iteratively. The Levenshtein . distance [15] is used to obtain a set of candidate pairs of ,,,process model nodes for 1 2 ,,, & . the mapping. In our implementation for Event-driven Process Chains (EPC), cf. . . . 1 2 3 1 2 3 ,,, ,,, & ,,, ,,, of ,,, ,,, ,,, of) the ) )same type Section 4, the algorithm further only considers ,,,pairs nodes . . . ) ) ) τ ∈T. ,,, ,,, ,,, ,,, ,,, ,,, ) ) ) $ 1 other applications 2 !"# edit distance, e.g., ob&( &( & ( %"&'#$ In contrast to that use the graph %"&'#$ # # $ $ $ 2 1 !"& !"& !"& ) ) models in a repository, ) taining a ranked set4 of similar not 1normalize 3 the 1 2 we do 3 5 6 graph edit distance by the size of either process model, because, in2 order to find 1 2 1 3 1 2 3 models similar to a .given2 query, we are interested in the number of6 operations to 4 1 2 3 ,,, ,,, & transform the query into a model3 from the1 repository. +

$

+

$ !"#

#

)$+

!"#

$

*

#

+

!))0

3

!))0

*

+

/

-

*

*

*

+

*

*

+

/

*

+

!"#

$

-

+

+

!))0

.*

.+

,,, ,,, ,,,

6

1

./

,,, ,,, ,,,

)- )* )+

2

1

2

2

1

1 1

3

3 2

3 6

2

1

2 1

3

1

2

3 6

(a) p

4

24

5

5

4

(b) o1

52

2

6

1 1

2

3

3

1

3 6

(c)2 q

1

1

(d) o2

2 3

2 Fig. 1. Sample Petri Net Process Models 6 4

5

1

3

To illustrate our approach we use the four process6 models of Figure 1. For 2 simplicity purposes, these models have been labeled with single characters, 1 3 resulting in a normalized6Levenshtein distance of their labels of either 0 (for equal labels) or 1. Given, for example, models p and o2 from Figure 1, the difference between both models is the path containing transition y that is a concurrent path 2 in p while being1 a conflict path in3 o2 . The least-cost sequence of edit operations 6

2

+

4

5

5

to transform p to o2 consists of removing both places before and after y in p and connected edges, and reconnecting y with the places connected to a and b as well as b and c, respectively. Remaining places and transitions (a, b, and c) as well as the according edges can be substituted at zero cost, which results in a graph edit distance ged(p, o2 ) = 8. Likewise, the graph edit distance of p and o1 is 18. In the next section, we explain how this distance metric is used to build and search efficiently within a process model repository. We will refer to the generic metric d when explaining the concepts of the index; for concrete examples we employ the graph edit distance ged as a representative metric. 3.2 Metric Trees for Process Model Repositories

!"#

$

From the many alternative approaches on metric trees, cf. [4, 12, 31], we chose the M-Tree [6], which, in contrast to other approaches, has been designed to combine a balanced and dynamic index. The latter means that the index does not need to be rebuilt after adding process models to the repository, which is certainly a desirable feature in large, actively used process model repositories. Nodes of an M-Tree comprise a set of pivots, of which each points to a subtree— a node on the next lower level. This is illustrated in Figure 2: The pivot p in the root node nroot references leaf node n3 , which contains three pivots, o0 , o1 , o2 , of which each references a process model. In M-Trees, all indexed objects are )* )* referenced in leaf nodes, and)*feature values of pivots are chosen from the set of indexed objects, rather than being constructed artificially. Thus, the feature $ !"# &( &( & ( %"&'#$ %"&'#$ value of an object may be referenced several times, once in a leaf node, #and in # $ $ $ !"& !"&0 have )+ )is+ why, in Figure 2, p and o )+ the same one!"&or more nonleaf nodes. This feature value, i.e., ged(o0 , p) = 0. .!))0

2

,,, ,,, & .*

.+

,,, ,,, ,,,

1

./

,,, ,,, ,,,

)- )* )+

2 1

3 6

1

2

2

3

1

6

4

5

3 6

Fig. 2. Example Metric Tree including Process Models of Figure 1. Pivots store their radius and their distance to parent pivot, i.e., r(p) = 18, since ged(p, o0 ) = 0, ged(p, o1 ) = 18, and ged(p, o2 ) = 8

1

2

3

The subtree rooted in pivot p is called a covering tree, denoted T (p), and each child element o ∈ T (p) stores its distance d(p, o) to the pivot, unless o is a pivot in the root node nroot . Each pivot maintains its covering radius r(p), such that, ∀o ∈ 1T (p) : d(p, o) 2≤ r(p), cf. Figure 3(a). 4

5 2

1

3 6

2

Building Metric Trees. These values, i.e., r(p) and d(o, p), o ∈ T (p), are calculated in a bottom up approach when a new process model, or feature value thereof, on is added to the index, which renders the M-Tree both balanced and dynamic. First, on is treated like a query to identify the most similar leaf node to add the object to and thus, moves down the tree. Each time on passes a pivot vertically, this pivot’s covering radius r(p) will be updated by the distance d(p, on ) if it exceeds r(p). This strategy keeps the covering radii of pivots compact. Finally, a new pivot o0n —the feature value of on along with its distance to its parent pivot—will be added to a leaf node. If adding pivot o0n to a node T (p0 ) exceeds the maximum size of this node, it will be split. New pivots p01 , p02 will be chosen from T (p0 ) and its pivots pi ∈ T (p0 ) will be distributed among two newly created nodes T (p01 ), T (p02 ). The new pivots p01 , p02 will be added to the parent node of T (p0 ), whereas the old pivot p0 will be removed from it. This procedure may require splitting the parent node, if it grows too large, and so on, which may move up to the root node of the tree recursively. Since pivots are distributed evenly among nodes in case of a split, the tree remains balanced, which equalizes search costs. Search for Similar Process Models and Pruning of Subtrees. Search in metric spaces takes a query process model q and a query radius r(q), which describes the acceptable distance of a matched process model compared with q. The triangle inequality of the metric d is the particular property that makes similarity searching in metric spaces efficient. It allows pruning subtrees of the metric tree without calculating the distance between q and the pivots of these subtrees. Search starts in the root node, and for each pivot p ∈ nroot the distance d(p, q) is calculated. If the covering radius of a pivot and the query radius intersect, i.e., d(p, q) ≤ r(p) + r(q), there may exist referenced models in T (p) or in its subtrees that satisfy q with regards to r(q), depicted in Figure 3(b). Let r(q) = 6 and take the example metric tree from above, then ged(p, q) = 7 and r(p) + r(q) = 18 + 6 = 24. This requires comparing each pivot in T (p) with q. If the distance was greater than the sum of both covering radii, i.e., d(p, q) > r(p) + r(q) then T (p) could not contain any objects that are close enough to q due to the triangle inequality, illustrated in Figure 3(c). Thus, T (p) could be safely pruned from the search. Since the covering radius of p and its distance to all objects o ∈ T (p), where T (p) is a leaf node, are a priori known, it is also possible to prune them without calculating d(o, q). Recall, if d(o, q) > r(q) + r(o) then o can be safely excluded from the search. Since o references only one object, its “covering radius” is 0 and can therefore be ignored. From the triangle inequality it follows that d(o, q) ≥ |d(o, p) − d(p, q)| and thus, if |d(o, p) − d(p, q)| > r(q) then o can be pruned. Given from the examples above, ged(o1 , p) = 18, ged(p, q) = 7, r(q) = 6, and thus ged(o1 , q) ≥ 11 > 6, it can be concluded that o1 is too far away from p with regards to r(q) and can be dismissed from the search result without calculating ged(o1 , q). Searching with q and r(q) = 6 among given models would only return o2 , since its distance ged(o2 , q) = 3 is within the acceptable range of the query r(q). The distance ged(o0 , q) = 7 had already been calculated

)*

$

)+ (a)

.!))0

,

./

,,, ,,, ,,,

)- )*

1

3

2

2 4

5 2 6

2

$

& ( %"&'#$

!"#

#

)*

!"&

$

$ !"# &

#

)+

!"&

( %"&'#$

$

$ !"#

#

)+

(c)

1

3

2

3

2

1

2

while visiting pivot p. This from the . examples is depicted in ..*+ particular setting ..+/ .* / Figure 3(b). )+ ,,, ,,, )- )* )+ ,,, ,,, ,,, ,,, ,,,,,, ,,,,,, ) ,,, )* ,,, )+ N.B.: Given the distance notion complies- with the definition of a metric, cf. Definition 1, pruning subtrees and objects during search does not influence the quality of the result, i.e., provides the same results as exhaustive search.

13

1

2

2 1 3 2 described in 4 5

2

2 1 3

1

3

1 3

6 We 6evaluated the approach the previous6 section with regards to 4 its scalability to process model repositories and present the results within the following paragraphs. As metric trees’ efficiency relies on the effective pruning of partitions of the search space, and thus reducing the number of comparison operations, we chose this measure for our evaluation. To test our approach of indexing business process models using the M-Tree as 3 index structure1and the graph 1edit 2 distance as 2(dis-) 3 similarity 3 metric, we took the SAP reference model, a set of approximately 600 EPC process models. These models consist of 21 nodes in average, with up to 130 nodes per model. Less than 8 percent of the models comprise 50 or more nodes. Figure 4(a) illustrates the distribution1 of graph edit distances among the 12 2 pairs within our dataset. We chose models from the indexed dataset as queries and calculated the average number of comparison operations to find the 10 most similar process models for the given query, i.e., performed a 10-nearest-neighbor search. The baseline for 45 5 4 our comparison is sequential search (SEQ), i.e., exhaustive, pair-wise comparison of the query with each stored model. In a first phase, we compared 2different methods 2 to choose a pivot when splitting a node, cf. Section 3.2, namely the RANDOM method that randomly chooses two 1 the MAX_DIST method 3 pivots with 3 pivots from all 1 3 available ones, and that chooses maximal distance to partition the [6]. Figure 4(b) shows 6 6 space most effectively that RANDOM already saves about 42% of comparison operations compared to SEQ. While MAX_DIST requires more comparison operations to build the index, that is, for finding the most distant pivots, it provides further search improvements over RANDOM, saving up to 85,8% of comparison operations compared to SEQ. 2 indicates that search within 2 Increasing the sample size improves this ratio and

3 6

$

(b)

4 Evaluation 2 2

)*

)*

Fig. 3. Space partition.with a single pivot .!))0(a). Include covering tree T (p) in further !))0 (exhaustive) search, if radii r(p) and r(q) intersect (b). Exclude T (p) from search if it is 2 1 2 ,,, ,,, & ,,, ,,, &3 too distant from q1(c).

,,, ,,, & .+

)* $

& # (

!"&

)+

)*

!"# & ( & ( %"&'#$ & ( %"&'#$ & ( %"&'#$ # # $ $ $ $ & & & & " " " " ! )+ ! ! )+ ! )+ )+

$

%"&'#$

!"#

&( $ !"&

)*

!"#

)*

1

1

6

3 6

3

5

4

5

404040

600 600 600 500 500 500

number of models number of models number of models

number of pairs number of pairs number of pairs

2250 2250 2250

303030

400 400 400

1500 1500 1500

202020

300 300 300 200 200 200

750 750 750

101010 00 0 0 0 0 202020 404040 606060 808080 100 100 100120 120 120 complexity complexity complexity (number (number (number ofofnodes) of nodes) nodes)

SEQ SEQ SEQ RANDOM RANDOM RANDOM MAX_DIST MAX_DIST MAX_DIST

600 600 600

SEQ SEQ SEQ 50 5050 10 1010 400 400 400 55 5 300 300 300 22 2 500 500 500

200 200 200

100 100 100

100 100 100

00 0 0 0 0 505050100 100 100150 150 150200 200 200250 250 250300 300 300350 350 350

00 0 0 0 0 100 100 100 200 200 200 300 300 300 400 400 400 500 500 500 600 600 600

graph graph graph edit edit edit distance distance distance

number number number ofofprocess of process process models models models

(a)

average number of comparisons average number of comparisons average number of comparisons

average number of comparisons average number of comparisons average number of comparisons

3000 3000 3000

505050

(b)

00 0 0 0 0 100 100 100 200 200 200 300 300 300 400 400 400 500 500 500 600 600 600 number number number ofofprocess of process process models models models

(c)

Fig. 4. Distribution of graph edit distance among samples (a). Scalability by method to choose pivot (b) and maximum node size (c).

the index has logarithmic costs for comparison operations. Due to the search algorithm, pruning of subtrees is based on the chosen pivots, and thus search performance varies, observable by the fluctuation of the curves, and can only be measured empirically. In a second evaluation phase, we compared the impact of different maximum node sizes, using MAX_DIST. As Figure 4(c) indicates, this has no significant influence on the efficiency of searching at large dataset sizes if the node size is small. However, large node sizes perform less efficiently, indicated by the dashed curve, representing a node size of 50 pivots. This suggests that moderate node sizes, e.g., 10 pivots per node, are most suitable and scalable.

5 Conclusion and Future Work Searching in process model repositories needs to be both, meaningful and efficient, to be useful under practical conditions. Much effort has been spent investigating meaningful similarity notions to compare process models considering process behavior and graph structures. However, they lack efficiency in terms of computation time and resources, mainly due to the nature of process models, which, in consequence, requires a detailed comparison to identify corresponding elements within process models and map their semantics. Thus, costs with regards to the number of comparison operations need to be minimized to make search more efficient. In this paper, we applied an approach known from the database community to business process model repositories, namely metric trees. This index structure leverages nothing but the distance between characteristic feature values of objects to partition the search space and saves comparison operations by safely excluding partitions from exhaustive search. We used the graph edit distance for business process models as distance metric and a greedy algorithm to identify corresponding process model elements. The scalability of this approach has been proved in experiments that show significant savings of comparison operations, using the SAP reference model. This paper focuses primarily on the suitability and scalability of the metric tree approach to process model repositories, thus using the rather simple graph edit distance as a metric. However, the approach is not limited to process model similarity, such as the graph edit distance. Any method that can be applied to

compare a given query model with a model from the repository is applicable if it yields a metric, cf. Definition 1. Other aspects of process model search, cf. Section 2.1, shall be examined in future work, as well as extending this approach to more expressive queries, e.g., BPMN-Q [2]. Using more abstract process model definitions, as, e.g., proposed by La Rosa et al. [19], would make the index applicable to heterogeneous process model repositories regarding different process modeling languages.

References 1. W. M. P. Van Der Aalst, A. K. Alves De Medeiros, and A. J. M. M. Weijters. Process equivalence: Comparing two process models based on observed behavior. In International Conference on Business Process Management (BPM 2006), volume 4102 of Lecture Notes in Computer Science, pages 129–144. Springer, 2006. 2. Ahmed Awad. Bpmn-q: A language to query business processes. In EMISA, pages 115–128, 2007. 3. Sergey Brin. Near neighbor search in large metric spaces. In VLDB ’95: Proceedings of the 21th International Conference on Very Large Data Bases, pages 574–584, San Francisco, CA, USA, 1995. Morgan Kaufmann Publishers Inc. 4. Edgar Chávez, Gonzalo Navarro, Ricardo Baeza-Yates, and José Luis Marroquín. Searching in Metric Spaces. ACM Comput. Surv., 33(3):273–321, 2001. 5. Tzi-cker Chiueh. Content-based image indexing. In VLDB ’94: Proceedings of the 20th International Conference on Very Large Data Bases, pages 582–593, San Francisco, CA, USA, 1994. Morgan Kaufmann Publishers Inc. 6. Paolo Ciaccia, Marco Patella, and Pavel Zezula. M-tree: An efficient access method for similarity search in metric spaces. In VLDB ’97: Proceedings of the 23rd International Conference on Very Large Data Bases, pages 426–435, San Francisco, CA, USA, 1997. Morgan Kaufmann Publishers Inc. 7. Remco Dijkman. Diagnosing differences between business process models. In Business Process Management, volume 5240 of Lecture Notes in Computer Science, pages 261–277. Springer Berlin / Heidelberg, 2008. 8. Remco M. Dijkman, Marlon Dumas, and Luciano García-Bañuelos. Graph matching algorithms for business process model similarity search. In BPM, pages 48–63, 2009. 9. Xinbo Gao, Bing Xiao, Dacheng Tao, and Xuelong Li. A Survey of Graph Edit Distance. Pattern Analysis & Applications, 13(1):113–129, February 2010. 10. Antonin Guttman. R-trees: a dynamic index structure for spatial searching. SIGMOD Rec., 14(2):47–57, 1984. 11. Jan Hidders, Marlon Dumas, Wil M. P. van der Aalst, Arthur H. M. ter Hofstede, and Jan Verelst. When are two workflows the same? In CATS ’05: Proceedings of the 2005 Australasian symposium on Theory of computing, pages 3–11, Darlinghurst, Australia, Australia, 2005. Australian Computer Society, Inc. 12. Gisli R. Hjaltason and Hanan Samet. Index-driven similarity search in metric spaces (survey article). ACM Trans. Database Syst., 28(4):517–580, 2003. 13. Agnes Koschmider. Ähnlichkeitsbasierte Modellierungsunterstützung für Geschäftsprozesse. PhD thesis, Universität Karlsruhe (TH), Fakultät für Wirtschaftswissenschaften, 2007. 14. Jochen M. Küster, Christian Gerth, Alexander Förster, and Gregor Engels. Detecting and resolving process model differences in the absence of a change log. In Business Process Management, volume 5240 of Lecture Notes in Computer Science, pages 244–260. Springer Berlin / Heidelberg, 2008.

15. VI Levenshtein. Binary Codes Capable of Correcting Deletions, Insertions and Reversals. Soviet Physics Doklady, 10:707, 1966. 16. Chen Li, Manfred Reichert, and Andreas Wombacher. On measuring process model similarity based on high-level change operations. In ER, pages 248–264, 2008. 17. Jan Mendling, Boudewijn F. van Dongen, and Wil M. P. van der Aalst. On the Degree of Behavioral Similarity between Business Process Models. In Markus Nüttgens, Frank J. Rump, and Andreas Gadatsch, editors, EPK, volume 303 of CEUR Workshop Proceedings, pages 39–58. CEUR-WS.org, 2007. 18. Mirjam Minor, Alexander Tartakovski, and Ralph Bergmann. Representation and structure-based similarity assessment for agile workflows. In ICCBR ’07: Proceedings of the 7th international conference on Case-Based Reasoning, pages 224–238, Berlin, Heidelberg, 2007. Springer-Verlag. 19. Marcello La Rosa, Hajo A. Reijers, Wil M.P. van der Aalst, Remco M. Dijkman, Jan Mendling, Marlon Dumas, and Luciano Garcia-Banuelos. Apromore : An advanced process model repository. http://eprints.qut.edu.au/27448/, 2009. 20. Jeffrey K. Uhlmann. Metric Trees. Applied Mathematics Letters, 4:61–62, 1991. 21. Jeffrey K. Uhlmann. Satisfying general proximity/similarity queries with metric trees. Information Processing Letters, 40(4):175–179, 1991. 22. W. M. P. van der Aalst and T. Basten. Inheritance of Workflows: An Approach to Tackling Problems Related to Change. Theor. Comput. Sci., 270(1-2):125–203, 2002. 23. W.M.P. van der Aalst. Inheritance of business processes: A journey visiting four notorious problems. In Petri Net Technology for Communication-Based Systems, volume 2472 of Lecture Notes in Computer Science, pages 383–408. Springer Berlin / Heidelberg, 2003. 24. Boudewijn van Dongen, Remco Dijkman, and Jan Mendling. Measuring Similarity between Business Process Models. In Advanced Information Systems Engineering, volume 5074 of Lecture Notes in Computer Science, pages 450–464. Springer Berlin / Heidelberg, 2008. 25. Rob J. van Glabbeek and W. Peter Weijland. Branching Time and Abstraction in Bisimulation Semantics. J. ACM, 43(3):555–600, 1996. 26. Barbara Weber and Manfred Reichert. Refactoring process models in large process repositories. In Advanced Information Systems Engineering, volume 5074 of Lecture Notes in Computer Science, pages 124–139. Springer Berlin / Heidelberg, 2008. 27. Matthias Weidlich and Mathias Weske. Structural and Behavioural Commonalities of Process Variants. In Christian Gierds and Jan Sürmeli, editors, Proceedings of the 2nd Central-European Workshop on Services and their Composition, ZEUS 2010, Berlin, Germany, February 25–26, 2010, volume 563 of CEUR Workshop Proceedings, pages 41–48. CEUR-WS.org, 2010. 28. Mathias Weske. Business Process Management – Concepts, Languages, Architectures. Springer-Verlag New York, Inc., Secaucus, NJ, USA, 2007. 29. Zhiqiang Yan, Remco Dijkman, and Paul Grefen. Business process model repositories - framework and survey. cms.ieis.tue.nl/Beta/Files/WorkingPapers/Beta_ wp292.pdf, 2009. 30. Peter N. Yianilos. Data structures and algorithms for nearest neighbor search in general metric spaces. In SODA ’93: Proceedings of the fourth annual ACM-SIAM Symposium on Discrete algorithms, pages 311–321, Philadelphia, PA, USA, 1993. Society for Industrial and Applied Mathematics. 31. Pavel Zezula, Giuseppe Amato, Vlastislav Dohnal, and Michal Batko. Similarity Search: The Metric Space Approach. Springer-Verlag New York, Inc., Secaucus, NJ, USA, 2005.