An Approximate Algorithm for Median Graph Computation ... - CiteSeerX

0 downloads 0 Views 300KB Size Report
tation of the median graph based on graph embedding ..... kNN. Median. Set Median. Figure 5. Classification rate. These results show not only that the median ...
An Approximate Algorithm for Median Graph Computation using Graph Embedding Miquel Ferrer and Ernest Valveny Computer Vision Center, Dep. Cienci`es de la Computaci´o, Univ. Aut`onoma de Barcelona, Spain {mferrer,ernest}@cvc.uab.cat Francesc Serratosa Dep. Inform`atica i Matem`atiques, Universitat Rovira i Virgili, Spain [email protected] Kaspar Riesen and Horst Bunke Dep. of Computer Science and Applied Mathematics, University of Bern, Switzerland {riesen,bunke}@iam.unibe.ch

Abstract Graphs are powerful data structures that have many attractive properties for object representation. However, some basic operations are difficult to define and implement, for instance, how to obtain a representative of a set of graphs. The median graph has been defined for that purpose, but existing algorithms are computationally complex and have a very limited applicability. In this paper we propose a new approach for the computation of the median graph based on graph embedding in vector spaces. Experiments on a real database containing large graphs show that we succeed to compute good approximations of the median graph. We have also applied the median graph to perform some basic classification tasks achieving reasonable good results.

1. Introduction Graphs are a powerful tool to represent structured objects compared with other alternatives such as feature vectors. For instance, a recent work comparing the representational power of the feature vectors and graphs under the context of web content mining has been presented in [9]. Experimental results show better accuracies of the graph-based approaches over the comparable vector-based methods. Nevertheless, some basic operations such as computing the sum or the mean, turn very difficult or even impossible in the graph domain. The mean of a set of graphs has been defined using the concept of the median graph. Given a set of graphs,

978-1-4244-2175-6/08/$25.00 ©2008 IEEE

the median graph [6] is defined as the graph that has the minimum sum of distances (SOD) to all graphs in the set. It can be seen as the representative of the set. Thus it has a large number of potential applications including many classical algorithms for learning, clustering and classification usually used in the vector domain. However, its computation is exponential both in the number of input graphs and their size [3]. A number of algorithms have been reported in the past [7, 6, 5, 4], but, in general, they suffer from either of a large complexity or are restricted to very limited applications. In this paper we propose a new approximate method based on graph embedding in vector spaces. Graph embedding has been recently used as a way to map graphs into a vector space [8] using the graph edit distance [1]. In this way we can combine advantages from both domains: we keep the representational power of graphs while being able to operate in a vector space. The median of the set of vectors obtained with this mapping can be easily computed in the vector space. Then, using the weighted mean of a pair of graphs [2] we have designed a triangulation procedure that permits to go from the vector domain back to the graph domain in order to finally obtain an approximation of the median graph. With this new approach we are able to bring the median graph to the world of real applications in pattern recognition and machine learning. In our experiments we have applied the median graph to real classification problems in the context of web content mining. The underlying graphs have no constraints regarding the number of nodes and edges. Our procedure potentially allows us to transfer any machine learning algorithm that

uses a median, from the vector to the graph domain. The rest of this paper is organized as follows. In the next section we introduce in detail the concept of the median graph. In Section 3 the proposed method for the median computation is described. Section 4 reports a number of experiments and present results achieved with our method. Finally, in Section 5 we draw some conclusions.

2

g∈U

gi ∈S

1

New Approximate Algorithm Based on Graph Embedding in Vector Spaces

The computation of the generalized median graph is a rather complex task. In this section we present a novel approach for the computation of the median graph that is faster and more accurate than previous approximate algorithms. It is based on graph embedding in a vector space and it consists of three main steps: the graph embedding in vector spaces, the median vector computation and the conversion of this median vector to a graph.

Graph embedding in vector space

2

Median vector computation

Median Graph

3

From median vector to median graph

Figure 1. Procedure overview. This resulting graph is taken as the median graph of S. These three steps are depicted in Figure 1, and they will be further explained in the next sections.

3.1

Graph Embedding in Vector Spaces

The embedding procedure we use in this paper follows the procedure proposed in [8]. So, we compute the graph edit distance between every pair of graphs in the set S. These distances are arranged in a distance matrix. Each row/column of the matrix can be seen as an n-dimensional vector. Since each row/column of the distance matrix is assigned to one graph, such an n-dimensional vector is the vectorial representation of the corresponding graph. Figure (2) illustrates this procedure. GRAPH DOMAIN

VECTOR DOMAIN

Set of graphs S

n-dimensional Vector Space 

........

That is, the generalized median graph g¯ of S is a graph g ∈ U that minimizes the sum of distances (SOD) to all the graphs in S. Notice that g¯ is usually not a member of S, and in general more than one generalized median graph may exist for a given set S. As shown in equation (1) some distance measure d(g, gi ) between the candidate median g and every graph gi ∈ S must be computed. However, since the computation of the graph edit distance is a well-known NP-complete problem, the computation of the generalized median graph can only be done in exponential time, both in the number of graphs in S and their size. As a consequence, in real applications we are forced to use suboptimal methods [6, 5, 4] in order to obtain approximate solutions in reasonable time. Another alternative is the set median graph. While the search space for the generalized median graph is U , that is, the whole universe of graphs, the search space for the set median graph is simply S, that is, the set of graphs in the given set. The set median graph is usually not the best representative of a set of graphs, but it is often a good starting point towards the generalized median graph.

VECTOR DOMAIN

Set of graphs S

Generalized Median Graph

Let U be the set of graphs that can be constructed using labels from a set of labels L. Given S = {g1 , g2 , ..., gn } ⊆ U , the generalized median graph g¯ of S is defined as: X d(g, gi ) (1) g¯ = arg min

3

GRAPH DOMAIN

0 d1,2 d1,3  d2,1 0 d2,3   DM =  d3,1 d3,2 0  .. .. ..  . . . dn,1 dn,2 dn,3

. . . d1,n . . . d2,n . . . d3,n .. .. . . ... 0

      

Figure 2. Step 1. Graph Embedding

3.2

Computation of the Median Vector

Once all the graphs have been embedded in the vector space, the median vector is computed. To this end we use the concept of Euclidean Median.

Euclidean median = arg minn y∈R

m X

||xi − y||

i=1

where ||xi −y|| denotes the Euclidean distance between the points xi , y ∈ Rn . The Euclidean Median, is a

point y ∈ Rn that minimizes the sum of the Euclidean distances to all the points in X. To obtain the Euclidean median we have used the Weiszfeld’s algorithm [10].

3.3

Back to Graph Domain

To transform the median vector into a graph we will use a triangulation procedure based on the weighted mean of a pair of graphs [2]. With the weighted mean of a pair of graphs, an intermediate graph along the edit path between two graphs can be obtained. The resulting graph after this triangulation procedure will be considered as the approximated median graph.

v3,g3

vm'

v'm

v2, g2

v3,g3

(a)

(b)

v1,g1

v1,g1

vi

vi

v'm

v2, g2

v'm

v3,g3

v2, g2

v3,g3

(c)

(d)

v1,g1

v1,g1

vi,gi

vi,gi v'm

v2, g2

v3,g3

(e)

v'm, g'm

v2, g2

4

Experimental Setup

In this section we provide the results of an experimental evaluation of the proposed algorithm. To this end we have used a database containing 2,340 graphbased representations of web-pages belonging to six different classes [9]. The database was splitted into two disjoint sets: the training set with 180 elements (30 elements per class) and the test set (the remaining 2160 elements). In a first experiment we computed several medians in order to evaluate whether we obtained good approximations. In a second experiment, we used these medians to perform classification tasks.

v1,g1

vm v1,g1 v2,g2

v1 and v2 (Figure 3(c)). With this point at hand, we can compute the percentage of the distance in between v1 and v2 where vi is located (Figure 3(d)). As we know the corresponding graphs of the points v1 and v2 we can obtain the graph gi corresponding to vi by applying the weighted mean procedure (Figure 3(e)). Once gi is known, then we can obtain the percentage of distance 0 is located and obtain in between vi and v3 where vm 0 gm applying again the weighted mean procedure (Fig0 is chosen as the approximation for ure 3(f)). Finally, gm the generalized median of the set S.

v3,g3

(f)

Figure 3. Triangulation procedure. This triangulation procedure, illustrated in Figure 3, works as follows. Given the n-dimensional points representing every graph in S (the white dots in Figure 3(a)), and the Euclidean Median vector vm (the grey dot in Figure 3(a)), we first select the three closest points to the Euclidean median (v1 to v3 in Figure 3(a)). Notice that we know the corresponding graph of each of these points (in Figure 3(a) we have indicated this fact by labelling them with the pair vj , gj with j = 1 . . . 3). 0 of these three Then, we compute the median vector vm points (represented as a black dot in Figure 3(a)). No0 is in the plane formed by v1 , v2 and v3 . tice that vm 0 at hand (Figure 3(b)), we arbitrarWith v1 to v3 and vm ily choose two out of these three points (without loss of generality we can assume that we select v1 and v2 ) and we project the remaining point (v3 ) onto the line joining v1 and v2 . In this way, we obtain a point vi in between

4.1

Assessment of the Median Quality

To evaluate the quality of the obtained median graphs, we compare their SOD with the SOD of the set median graph. Since the set median graph is the graph belonging to the training set with minimum SOD, it is a good reference to evaluate the generalized median graph quality. We do not use other approximations of the median graph as a reference because the existing algorithms are not able to compute the median graph with these large graphs and datasets we are dealing with. For this experiment we have randomly chosen an increasing number of graphs from the training set, and we have computed the median graph of each of these sets. Results for the mean value of the SOD over all the classes are shown in Figure 4. The results show that in all cases the SOD of the obtained medians are better than the set median SOD. With these results we can conclude that our method finds good approximations of the median graph.

4.2

Classification

In this section we will use the previously computed medians in order to perform classification experiments. We classified each element in the test set according to three different schemes. The first one is a 1NN classifier using the whole training set. The other two use the

SOD Evolution 10000 9000 8000

Set Median SOD Generalized Median SOD

SOD

7000 6000 5000 4000 3000 2000 1000 5

10

15

20

Number of graphs in S

25

30

Figure 4. SOD evolution. generalized median and the set median graphs respectively. Each element in the test set is classified into the class of the most similar median. Figure 5 shows the mean classification rates over the 6 classes. The classification rate achieved by our median approach is very close to the 1NN classifier. The minimum difference between the 1NN classifier and the median is only about 2%. The results also show that with our medians we obtain better results than with the set median. This means that the computed medians are better representatives than the set median graph. Although the 1NN performs better, it is important to notice the difference between the number of comparisons needed for both classifiers. While the 1NN classifier needs 388,800 comparisons, the use of the median graph reduces such quantity to 12,960 comparisons. Classification Rate 100

90

80

Rate [%]

70

60

50

40

30

kNN Median Set Median

20

10

0

5

10

15

20

25

30

Number of Graphs in S

Figure 5. Classification rate. These results show not only that the median can be used in real classification tasks, but also it is a good representative of a given set and therefore it can be potentially used in any application where a representative of a set is needed.

5

Discussion and Conclusions

In the present paper we have proposed a novel technique to obtain approximate solutions for the median graph. This new approach is based on embedding of graphs into vector spaces. First, the graphs are turned into points of n-dimensional vector spaces using the graph edit distance paradigm. Then, the crucial point of obtaining the median of the set is carried out in the vector space, not in the graph domain, which simplifies dramatically this operation. Finally, using the graph edit distance again we can transform the obtained median

vector to a graph by means of the weighted mean of a pair of graphs and a triangulation procedure. This embedding approach allows us to get the main advantages of both the vector and graph representations, computing the more complex parts in real vector spaces but keeping the representational power of graphs. With this new procedure we have applied the median graph computation to a real database containing a high number of graphs with large sizes. Then we have extended its applicability to real classification problems. Under these strong conditions, the generalized median could not be computed before, due to the large computational resources needed for the existing methods. Moreover, these results show that with this new procedure the median graph can be potentially applied to any application where a representative of a set could be needed.

Acknowledgements This work has been partially supported by the CICYT project TIN2006-15694-C02-02, the Spanish research programme Consolider Ingenio 2010: MIPRCV (CSD2007-00018) and the Swiss National Science Foundation Project 200021-113198/1.

References [1] H. Bunke and G. Allerman. Inexact graph matching for structural pattern recognition. Pattern Recognition Letters, 1(4):245–253, 1983. [2] H. Bunke and S. G¨unter. Weighted mean of a pair of graphs. Computing, 67(3):209–224, 2001. [3] H. Bunke, A. M¨unger, and X. Jiang. Combinatorial search versus genetic algorithms: A case study based on the generalized median graph problem. Pattern Recognition Letters, 20(11-13):1271–1277, 1999. [4] M. Ferrer, F. Serratosa, and A. Sanfeliu. Synthesis of median spectral graph. IbPRIA 2005, volume 3523 of LNCS, pages 139–146. Springer, 2005. [5] A. Hlaoui and S. Wang. Median graph computation for graph clustering. Soft Comput., 10(1):47–53, 2006. [6] X. Jiang, A. M¨unger, and H. Bunke. On median graphs: Properties, algorithms, and applications. IEEE Trans. Pattern Anal. Mach. Intell., 23(10):1144–1151, 2001. [7] A. M¨unger. Synthesis of prototype graphs from sample graphs. In Diploma Thesis, University of Bern (in German), 1998. [8] K. Riesen, M. Neuhaus, and H. Bunke. Graph embedding in vector spaces by means of prototype selection. GbRPR 2007 Proceedings, volume 4538 of LNCS, pages 383–393. Springer, 2007. [9] A. Schenker, H. Bunke, M. Last, and A. Kandel. GraphTheoretic Techniques for Web Content Mining. World Scientific Publishing, 2005. [10] E. Weiszfeld. Sur le point pour lequel la somme des distances de n points donn´es est minimum. Tohoku Math. Journal, (43):355– 386, 1937.