Semantic Graph Kernels for Automated Reasoning - Semantic Scholar

3 downloads 84574 Views 426KB Size Report
is a largely underdeveloped area of automated reason- ing. ... automated reasoning domain and that in many cases ... ‡E-mail: [email protected].
Semantic Graph Kernels for Automated Reasoning Evgeni Tsivtsivadze∗

Josef Urban†

Herman Geuvers‡

Tom Heskes§

Institute for Computing and Information Sciences Radboud University Nijmegen, The Netherlands

Abstract Learning reasoning techniques from previous knowledge is a largely underdeveloped area of automated reasoning. As large bodies of formal knowledge are becoming available to automated reasoners, state-of-the-art machine learning methods can provide powerful heuristics for problem-specific detection of relevant knowledge contained in the libraries. In this paper we develop a semantic graph kernel suitable for learning in structured mathematical domains. Our kernel incorporates contextual information about the features and unlike “random walk”-based graph kernels it is also applicable to sparse graphs. We evaluate the proposed semantic graph kernel on a subset of the large formal Mizar mathematical library. Our empirical evaluation demonstrates that graph kernels in general are particularly suitable for the automated reasoning domain and that in many cases our semantic graph kernel leads to improvement in performance compared to linear, Gaussian, latent semantic, and geometric graph kernels. 1

Background and Motivation: Automated Reasoning and Machine Learning

In the last fifteen years, the body of formally expressed mathematics has grown substantially. Interactive Theorem Provers (ITPs) like Coq, Isabelle, Mizar, and HOL [27] have been used for advanced formal theory developments and verification of non-trivial theorems, like the Four Color Theorem and Jordan Curve Theorem, and also for advanced verification of software and hardware models. The large formal Mizar mathematical library (MML)1 contains today nearly 1100 formal mathematical articles, covering a substantial part of standard undergraduate mathematical knowledge. The library has about 50000 theorems, proved with about 2.5 million lines of mathematical proofs. Such proofs often contain nontrivial mathematical ∗ E-mail:

[email protected] † E-mail: [email protected] ‡ E-mail: [email protected] § E-mail: [email protected] 1 http://www.mizar.org

ideas, sometimes precised over decades and centuries of development of mathematics and abstract formal thinking. Having this kind of a “knowledge base of abstract human thinking” in a completely machine-processable and machine-understandable way, presents very interesting opportunities for application and development of novel artificial intelligence methods that make use of the knowledge in various ways. A concrete and pressing task, for which novel machine learning techniques are needed, and on which we focus in this paper, is selection of relevant knowledge from the large formal knowledge bases, when one is presented with a new conjecture that needs to be proved. Providing good solution to this problem is important both for the mathematicians, and also for the existing tools for automated theorem proving (ATP) that typically cannot be successfully used directly with tens or hundreds of thousands of axioms. It has been recently experimentally demonstrated with large theory benchmarks like the MPTP Challenge2 and the LTB (Large Theory Batch) division of the CASC competition3 that smart selection of relevant knowledge can significantly boost the performance of existing ATP techniques in large domains [24]. There are a number of different techniques and approaches used for learning (extracting, generalizing) knowledge from large bodies of previous examples. In this paper we focus on the widely applied kernel-based and latent semantics approaches, and their suitable combination and evaluation on the formal mathematical domain. The available information in formal mathematical domains is frequently represented in structured form (such as a formula graph and a proof graph)4 . Kernel-based approaches [19] for learning in structured domains appear to be ideally suited for solving machine learning problems in the domain of computerassisted reasoning. They sidestep the need for hand2 http://www.tptp.org/MPTPChallenge 3 http://www.tptp.org/CASC 4 Generally, a particular formula can have associated to it a number of mathematical structures (models), which are typically (hyper)graphs. Thus, committing treatment of formulas to methods relying, for example, on tree representation could turn out to be limiting.

795

Copyright © SIAM. Unauthorized reproduction of this article is prohibited.

crafted features and can directly deal with the structures, in particular formula and proof graphs, encountered in this domain. The implicit, non-vectorial representation induced by kernel approaches is potentially much richer and can be further enhanced by designing appropriate kernels to incorporate prior knowledge about the domain (e.g. particular features of logic formulas). Compared to other application domains for structured learning, such as natural language processing and bioinformatics, formal mathematics has the advantage of being indisputable: there is a precise notion of semantics and truth, based on a precise notion of mathematical proof. Informally, the task we aim to solve in the domain of computer-assisted reasoning is of selecting from a large knowledge base Z of thousands of theorems those that are most relevant for proving a new formula x. We can turn this into a learning problem as follows. For each theorem t ∈ Z we construct a dataset Dt consisting of m formulas {xi }m i=1 as inputs and labels {yi }i=1 ∈ Y = {0, 1} as outputs. The label yi corresponding to a formula xi is 1 if the theorem t is used (possibly also recursively, etc.) to prove formula xi and zero otherwise. Based on the dataset Dt , an algorithm can learn a classifier Ct (·) which, given a formula xκ as input, can predict whether the theorem t is relevant for proving xκ . Typically, classifiers give a graded output. Having learned classifiers for all theorems t, the classifier predictions Ct (xκ ) then can be ranked: the theorems that are predicted to be most relevant will have the highest output Ct (xκ ).

deal with abstractions. All these should be taken into account when designing and/or using appropriate kernel function. 2.1 Geometric Graph Kernels Graph kernels are usually applied in situations where a graph-based structure/annotation is naturally present. For a thorough overview of graph kernels and their efficient computation we refer to [26]. Here we are concerned with geometric graph kernels (frequently applied kernel functions for computing similarity between graphs) [10]. In the following subsection we formulate a semantic graph kernel for the automated reasoning domain. Let us define L = {lr }, r ∈ N to be the index set of all possible labels that could occur in the graph. Also, let G = (V, E, h) be a graph consisting of the ordered set of vertices V = {v1 , v2 , . . . , vn }, the set of directed edges E ⊆ V × V, and a function h : V → L that assigns a label to each vertex of a graph. A vertex vi is a neighbour of another vertex vj if they are connected by an edge, namely (vi , vj ) ∈ E. A walk of length n on graph G is a sequence of indices i0 , i1 , . . . in such that (vir−1 , vjr ) ∈ E for all 1 ≤ r ≤ n. We suppose that the function h is represented as a label allocation matrix L ∈ M|L|×|V| (R) so that [L]i,j = 1 if the label of vj is li and 0 otherwise. The adjacency matrix A ∈ M|V|×|V| (R) having the rows and columns indexed by V and where ( 1, if (vi , vj ) ∈ E [A]i,j = 0, otherwise

corresponds to the edge set of the G. In our definitions we do not allow self loops, that is the diagonal entries of A are always zeros. We will consider two different kernel functions to compute the similarity between graphs G and G0 . The first one, called the direct product kernel [10] is constructed as follows. The vertex set of the graph G× that would take into account common walks between the vertices of the G and G0 is V× ⊆ V × V 0 . The graph G× has a vertex iff the labels of the vertices in the corresponding graphs G and G0 have the same label. Furthermore, there is an edge between the vertices in the 2 Methods and Technical Solutions: Graph graph G× iff there are edges between the corresponding Kernels for Mathematical Domain vertices in both graphs G and G0 . Let us denote the Recently, kernels for structured domains have received adjacency matrix of the graph G as A . The direct × × significant attention in machine learning (see e.g. [9]). product kernel [10] then reads Many successful applications have been reported, for "∞ # |V× | example, predicting the toxicity of chemical molecules X X 0 n k× (G, G ) = w n A× , [26], protein function prediction [3], etc. The domain of (2.1) i,j=1 n=0 i,j formal mathematics provides its own challenges, such as the abundance of structured and symbolic information, where wn ∈ R, wn > 0, is a typically monotonically the existence of many related tasks, and the need to decreasing sequence of weights. This kernel can be 1.1 Notations The set of all n × m matrices with real coefficients is denoted by Mn×m (R). Given a matrix M ∈ Mn×m (R), we denote the element in the i-th row and j-th column by [M ]i,j . As a shorthand notation for the set {1, . . . , n} we use [n]. For two sets R = {i1 , . . . , ir } ⊆ [n] and S = {j1 , . . . , js } ⊆ [m] of indices, we use MR,S to denote the matrix that contains only the rows and columns of M that are indexed by R and S, respectively. At last, we use yi to denote the i-th coordinate of a vector y ∈ Rn .

796

Copyright © SIAM. Unauthorized reproduction of this article is prohibited.

computed efficiently using exponential or geometric series, if the limit in (2.1) exists. As a second example we consider the kernel which measures similarity based on the start and the end vertices of the random walk. We use the fact that [An ]i,j is the number of walks of length n from vertex vi to vertex vj , where An denotes the nth power of the adjacency matrix of the graph G. Moreover, if the labels of the graph vertices are taken into account, [LAn LT ]s,t corresponds to the number of walks of length n between vertices labelled ls and lt . We denote by hM, M 0 iF the Frobenius product M and M 0 , that is, hM, M 0 iF = P of matrices 0 i,j [M ]i,j [M ]i,j . Further, let γ ∈ Mn×n (R) be a positive semidefinite matrix containing coefficients penalizing long walks. The kernel kn between the graphs G and G0 can be defined as follows: (2.2) kn (G, G0 ) = Pn j T [γ]i,j hLAi LT , L0 A0 L0 iF = Pi,j=0 P |L| n i T 0 0j 0T s,t=1 i,j=0 [γ]i,j [LA L ]s,t [L A L ]s,t .

in the other training examples. This can make semantic graph kernels applicable to situations where geometric graph kernels might not lead to satisfactory results. Our approach is related to latent semantic indexing (LSI) [6] formulated within the framework of kernel methods [5]. Let us consider the graph-label matrix H ∈ M|L|2 ×m (R) P having, for example, the following form n H·,j = vec(L( i=0 θi Ai )LT ) = φ(G). By performing a singular value decomposition (SVD) of the matrix H we can project graphs onto the subspace spanned by the first p singular vectors to create a new dimensionally reduced feature space. Also, one can vary the dimensionality of the feature space by making a particular choice of p. We have H = U ΣV T and assume that the columns of U are the singular vectors of the feature space in order of decreasing singular value. Then, the projection operator on the first p principal components is Ip U T , where Ip is the identity matrix with only the first p nonzero diagonal elements. Now we can use the dimensionality reduced feature space to calculate the similarity between two graphs

Several specializations of this kernel function lead to different feature spaces and consequently have very different interpretations. If, for example, we set [γ]i,j = θi θj , where θ ∈ R+ is a parameter, we obtain the kernel (2.3) n n X  X  j T b θi Ai LT , L0 kn (G, G0 ) = hL θj A0 L0 iF ,

(2.4)

i=0

j=0

which corresponds to the inner product between the feature vectors of the G and G0 , where features φs,t (G) and φs,t (G0 ) are the weighted count of walks of length up to n from the vertices labelled ls to the vertices labelled lt in each of the graphs. On the other hand, if we set in (2.2) [γ]i,j = θi when i = j and zero otherwise, we can obtain the kernel corresponding to the inner product with features φi,s,t (G) and φi,s,t (G0 ) that can be interpreted as θi times the count of walks of length i from the vertices labelled ls to the vertices labelled lt in each of the graphs. 2.2 Semantic Graph Kernels With semantic graph kernels defined below, we aim to take into account the co-occurrence information of the features (random walks) contained in the whole set of the graphs used for training of the algorithm. This is in contrast to the previously described geometric graph kernels that measure similarity between pairs of graphs without taking any additional (contextual ) information into account. Furthermore, the feature space constructed using a semantic graph kernel can contain walks between sparsely connected parts of the graph if such connections are present

ksg (G, G0 )

= (Ip U T φ(G))T (Ip U T φ(G0 )) = φ(G)T U Ip U T φ(G0 ).

This kernel function uses a particular mapping to identify highly correlated features in two graphs. Unlike previously proposed graph kernels that take into account only features contained in graphs G and G0 , here the information about features (random walks) that are most important in the complete training dataset is implicitly present. For example, the random walks that co-occur very often in the same graph of the training set are creating new single dimension of the new feature space. Consider a kernel matrix having the form K = H T H that can be computed using the kernel described in equation (2.3). The kernel matrix constructed with the semantic graph kernel (2.4) is Ksg = V Λp V T , where V are the eigenvectors of K and Λp is a diagonal matrix with only the first p nonzero eigenvalues. Thus, to compute the kernel matrix Ksg we do not need to construct H explicitly. The same approach can be used to construct a semantic kernel matrix corresponding to the graph kernel (2.1). To make a prediction we consider a column vector k = H T φ(G0 ) that can be computed using the appropriate graph kernel function, for example, (2.3) and calculate Pm 0 f (G0 ) = i=1 ai ksg (Gi , G ) T T = a V Ip V k. When constructing the kernel matrix or making a prediction we avoid dealing with the feature vectors of the graphs explicitly. However, in some circumstances, examining new features created by our kernel could pro-

797

Copyright © SIAM. Unauthorized reproduction of this article is prohibited.

The first subset contains ten datasets each consisting of 530 examples6 . The second subset contains ten datasets each consisting of 347 examples7 . A single dataset corresponds to a binary classification task where it is necessary to predict whether a particular formula is useful in proving some theorem (see Section 1). We note that several datasets were recently used to evaluate performance of the ATP systems8 in the domain of automated reasoning and it has been demonstrated that heuristics for selecting relevant knowledge can 2.3 Kernel-Based Learning Algorithm We use significantly boost the performance of existing ATP kernel functions, proposed in Section 2, together with techniques in large domains [24]. Below we briefly regularized learning algorithm that selects hypothesis discuss existing feature representations that have been from the reproducing kernel Hilbert space H, deter- previously used in automated reasoning systems and mined by the input space X and the positive definite their relation to several kernel functions. kernel function k : X × X → R (for details see [18]). We aim to minimize following objective function Existing Feature Representations Depending on the structure of the particular mathematical domain, 2 (2.5) min J(f ) = c(f, D) + λkf kH f ∈H there are several possible kinds of features that can be where c(·, ·) is the loss measuring the error of the pre- more or less suitable for characterizing the formulas. vide additional insights about the problem domain. For this purpose we employ a simple stochastic search algorithm that aims to reconstruct input space representation of the projected feature vector of the graph. We note that this “feature discovery” task is related to the graph pre-image finding problem described in [2]. However, our task is simpler because the set of vertices is fixed and we only need to discover edges of the graph corresponding to the new features created by our kernel.

diction function f on the training set D, k · kH denotes the norm in H, and λ ∈ R+ is a regularization parameter controlling the tradeoff between the error on the training set and the complexity of the hypothesis. Note that by specializing the loss in the above formulation we can obtain P support vector machines m [25] (by choosing c(f, D) = i=1 max(1 − yi f (xi ), 0)) or regularized Pm least-squares (RLS) [16] (by choosing c(f, D) = i=1 (yi − f (xi ))2 ). The RLS algorithm with slight modifications (e.g. including the bias term) is also known as least-squares support vector machines [22], proximal vector machines [8], kernel ridge regression [17], and is closely related to many other methods. It has been shown that the RLS algorithm have a classification performance similar to the regular SVMs (see e.g. [11, 28]). Because of its simple closed form solution, yet competitive performance, the RLS algorithm is our choice for conducting experiments. 3

Application to Automated Reasoning

In the previous sections we have described and proposed kernels for structured representations that could be useful when learning from mathematical data. In this section, we give a concrete example of an application of the presented kernels to the task of automated reasoning and demonstrate notable improvement in performance when taking into account graph-based structure of the formulas. Datasets We test the performance of our method on two subsets of the formal Mizar mathematical library5 .

Symbols (functors, predicates, etc.). This is the most obvious and probably most commonly used characterization of mathematical formulas. The formula is simply characterized as a set or list or multiset of the mathematical symbols that it uses. Mathematical punctuation (brackets), logical connectives and quantifiers, and also variables are typically omitted from these characteristics. Thus, for example, the formula: forall n:Nat (n < n+1) will be characterized by predicate symbols