Parsing with CYK over Distributed Representations:" Classical ...

4 downloads 0 Views 365KB Size Report
May 24, 2017 - Tetris-like representation not accurate enough to have an impact on final tasks [13]. Moreover, also in this case, the grammar is represented in ...
arXiv:1705.08843v1 [cs.CL] 24 May 2017

Parsing with CYK over Distributed Representations: “Classical” Syntactic Parsing in the Novel Era of Neural Networks Fabio Massimo Zanzotto University of Rome Tor Vergata Viale del Politecnico, 1 00133 Roma [email protected]

Giordano Cristini University of Rome Tor Vergata Viale del Politecnico, 1 00133 Roma [email protected]

Abstract Syntactic parsing is a key task in natural language processing which has been dominated by symbolic, grammar-based syntactic parsers. Neural networks, with their distributed representations, are challenging these methods. In this paper, we want to show that existing parsing algorithms can cross the border and be defined over distributed representations. We then define D-CYK: a version of the traditional CYK algorithm defined over distributed representations. Our D-CYK operates as the original CYK but uses matrix multiplications. These operations are compatible with traditional neural networks. Experiments show that D-CYK approximates the original CYK. By showing that CYK can be performed on distributed representations, our D-CYK opens the possibility of defining recurrent layers of CYK-informed neural networks.

1

Introduction

Symbolic, grammar-based syntactic parsers have dominated syntactic interpretations of natural language for decades. The Cocke-Younger-Kasami algorithm (CYK) [2, 12, 5], the Early algorithm [3] and the Shift-Reduce parsing algorithm [6] have fostered many constituency-based and dependency-based parsers. Probabilistic constituency-based parsers have generally flourished on the CYK algorithm and on the Early algorithm. These parsers exploit estimated probabilities of context-free rules to select best interpretations for sentences. Fast and accurate dependency-based parsers have been designed on top of the shift-reduce algorithm where decisions are taken with discriminative models such as Support Vector Machines or some more complex neural networks [7, 1]. For these dependency parsers, grammars are hidden in discriminative decision functions but their control is completely symbolic. Recently, distributed representations and neural networks are challenging symbolic, grammar-based approaches to parsing. There are at least two lines of research in this area. In a first line of research [11], parsing is seen as a translation task: the source language is a natural language and the target language are syntactic interpretations. Hence, parsers, learned with long-short term memories (LSTM), are translators from sentences to syntactic trees. Training is done over millions of sentences which have been annotated with existing parsers. Hence, what it learns is obscure. It seems that the LSTM network learns two things: (1) the associations among fragments of sentences and fragments of trees and (2) a way of recombining these fragments in the final interpretation. The grammar is lost in these associations hidden in the weights of LSTM. In the second line of research [13, 10], both sentences and trees are represented in distributed vectors and neutral networks learn a way to map sentence vectors to tree vectors. However, the overall model does not have the ability to replicate distributed vectors for trees from distributed vectors from sentences. Resulting vectors are

Grammar rules R S S D E

Table P

Distributed Representation of P

→ DE → DS → a → b

0 1 a

3 1 S

1 2 D

2 2 S

1 1 D

0 2 a

0 3 b

1 3 E

Figure 1: Running example: a simple grammar, the P matrix for the sequence aab and its “distributed” Tetris-like representation

not accurate enough to have an impact on final tasks [13]. Moreover, also in this case, the grammar is represented in an unspecified way in the weights of the multi-layer perceptron or the LSTM. In this paper, we show that existing parsing algorithms can cross the border of distributed representations. We then propose a D-CYK that is a version of the original CYK algorithm on distributed representations. To obtain this result, we leverage on the strict relation between symbolic and distributed representations. Hence, we define a way to represent the table underlying the CYK algorithm in a distributed representation. This table, which is the base of the CYK algorithm, is transformed in a matrix. Moreover, the operation to update the table are obtained with matrix multiplication. Experiments shows that our D-CYK can successfully approximate the CYK algorithm on distributed representations.

2 2.1

Related Work and Notation What should represented in a “distributed’ way in a classical CYK algorithm?

Although the CYK algorithm is a “classical’, simple parsing algorithm for context-free grammars, we give here a short description to share a common notation and clear statements of our objectives. The CYK algorithm is a parsing algorithm based on dynamic programming which parses sentences by applying grammar rules R in Chomsky Normal Form (CNF) and stores partial computations in a 2dimensional table P . CNF is a particular form for context free grammars where rules are constrained as binary rules A → BC and unit rules A → α where A, B and C are non terminal symbols and α represents terminal symbols. The 2-dimensional table P represents partial computations when recognizing sequences s = s1 . . . sn . Cells P [0, i] in row 0 contains elements si of the input sequence. A generic cell P [i, j] contains non terminal symbols A that can interpret the sequence si . . . sj , that is, there is a derivation of R from A to si . . . sj . The key of the CYK algorithm is that CNF rules allow to fill efficiently the table P (see Algorithm 1). In fact, using unary rules, cells P [1, j] are filled in a single cycle and, using binary rules, cells P [i, j] are filled by looking to pairs of cells P [k, j] and P [i − k, j + k] for k ∈ {1, . . . , i − 1}. Sequences belong to the language generated by a grammar R if the start symbol S is in the top cell P [1, n]. Algorithm 1 CKY(sequence s, grammar R) return table P P [n, n] are sets of symbols for i ← 1 to n do for each unit production R → si do add R to P [1, i] for i ← 2 to n do for j ← 1 to n − i + 1 do for k ← 1 to i − 1 do for each production A → BC do if B ∈ P [k, j] and C ∈ P [i − k, j + k] then add R to P [i, j]

2

To illustrate our idea in the following sections, we introduce here a running example for the CYK algorithm in Figure 1: a simple grammar with a set R of grammar rules and the table P for the partial interpretation of a simple sequence aab. The row 0 of P contains the sequence aab. The cell P [1, 1] contains D as the cell P [0, 1] contains a and the unary rule D → a is in the grammar. Moreover, for example, the cell P [3, 1] contains S as P [1, 1] contains D, P [2, 2] contains S and S → DS is in the grammar rules. The sequence aab is recognized by the grammar with rules R as P [3, 1] contains S. Our study aims to demonstrate that it is possible to realize a version of the CYK algorithm using distributed representations, that is, matrices or tensors. Hence, we show that: first, it is possible to represent both rules R and 2-dimensional tables P as real number matrices R and P; and, second and more important, applying rules R to tables P can be done by multiplying the two matrices R and P. 2.2

Distributed Representations with Holographic Reduced Representations

Holographic reduced representations (HRR) [8] are distributed representations well-suited for our aim of representing the 2-dimensional tables P of the CYK algorithm and the operation of selecting the content of its cells P [i, j]. In fact, HRR can encode symbolic representations and decode them back. Moreover, by using holographic reduced representations [8] along with vector shuffling, it is possible to represent multiple symbolic structures in distributed representations. Hence, symbols as well as structures in tables P can be encoded and decoded. In the following, we introduce the operations we use and an iconic way to represent their properties. The iconic metaphor is based on Tetris-like pieces. The starting point of a distributed representation is how to encode symbols in vectors: symbols a can be encoded using random vectors ~a ∈ Rd drawn from a multivariate normal distribution ~a ∼ N (0, I √1d ). These vectors are used as basis vectors for the Johnson-Lindenstrauss Tranform [4] as well as for random indexing [9]. The major property of these random vectors is the following: ( 1 if ~a = ~b T~ ~a b ≈ 0 if ~a 6= ~b Given the above representation of symbols, we can define a basic operation [ ]⊕ and its approximate inverse [ ] as the basis for encoding and decoding symbols. The first operation is defined as follows: [a]⊕ = A◦ Φ where A◦ is the circulant matrix of the vector ~a and Φ is a permutation matrix. This operation has a nice approximated inverse in [a] = ΦT A◦ T . In fact, this property holds: ( I if ~a = ~b ⊕ [a] [b] ≈ 0 if ~a 6= ~b This property holds because ΦΦT = I as Φ is a permutation matrix and: ( I if ~a = ~b T A◦ B◦ ≈ 0 if ~a 6= ~b due to the fact that A◦ and B◦ are circulant matrices based on random vectors ~a, b ∼ N (0, I √1d ). The two operations [a]⊕ and [a] are strictly linked to (1) the circular convolution and its inverse, the circular correlation, used to encode and decode flat structures [8]; (2) the shuffled circular convolution [?] used to encode syntactic trees. In fact, the shuffled circular convolution ~ is: ~a ~ ~b = ~a ∗ Φ~b = [a]⊕ [b]⊕ e~1 where ∗ is the circular convolution and e1 is the first base vector of Rd . Shuffling has been introduced in combination to circular convolution in order to give the possibility of encoding larger structures [?] by eliminating the commutative property of circular convolution. A similar technique has been used for word sequences [9]. These two operations are what is needed to encode and decode symbols and structures in matrices. If [a]⊕ [b]⊕ encodes two symbols a and b, the operations allow to know that: (1) [a]⊕ [b]⊕ starts with 3

numbers

symbols

Separator

[0]⊕ [1]⊕ [2]⊕ [a]⊕ [b]⊕ [D]⊕ [E]⊕ [S]⊕ [Sep]⊕ 0

1

2

a

b

D

E

S

sep

Figure 2: Tetris-like pieces for symbols in the running example a but not with b or c as [a] [a]⊕ [b]⊕ ≈ [b]⊕ which is different from 0 and as [b] [a]⊕ [b]⊕ ≈ 0 and [c] [a]⊕ [b]⊕ ≈ 0; (2) [a]⊕ [b]⊕ starts with a and continues with b as [a] [a]⊕ [b]⊕ ≈ [b]⊕ . To visualize the encoding and decoding ability of the operations, we will use a metaphor based on Tetris. Symbols under the operations are represented as Tetris pieces, for example, [a]⊕ = , [a] =

a

, [b]⊕ =

[a]⊕ [b]⊕ [S]⊕ is

b

a b S

and [b] =

b

a

. Sequences of symbols are sequences of pieces, for example,

. More than one sequence in sums are represented in boxes like: L=

a b S

D S a

that represents [a]⊕ [b]⊕ [S]⊕ and [D]⊕ [S]⊕ [a]⊕ . Then, like in Tetris, elements with complementary shape are canceled and removed from a sequence, for example, if the result is

b S

as

a a

a

is applied to the left of

a b S

,

disappears. In addition to Tetris rules, an element with a shape will select

only elements with an emerging complementary shape from a list, for example, if

a

is applied to the

list L, the result is the list: b S

L= as

a

selects

a b S

and not

D S a

.

With these operations and with this Tetris metaphor, we can describe our model to encode P tables in matrices and to transform rule applications in matrix multiplication.

3

The CYK algorithm on Distributed Representations

The distributed CYK algorithm (D-CYK) is our version of the CYK algorithm over distributed representations. As the traditional CYK, this algorithm recognizes whether or not a sequence s is in a language defined by a Chomsky Normal Form grammar with a set of rules R. Yet, unlike the traditional CYK algorithm, the table P and the rules R are represented with matrices in a distributed representation and the application of the rules is obtained with matrix algebra. In the following we describe how our D-CYK will encode: (1) the table P in a matrix P; (2) pre-terminal rules in a matrix R; and, (3) binary rules in matrices RA , one for each symbol A in the grammar. Such an encoding will enable a CYK algorithm based on matrix algebra. 3.1

Encoding the Table P

The table P of the CYK algorithm can be seen as a collection of triples (i, j, e). If more than one symbol ek is in a cell P (i, j), the collection of triples will contain more elements with the same i and j and with a different symbol ek , that is, triples like (i, j, ek ). Given the revised Holographic Reduced Representation we are using (see Section 2.2), the table P can be represented as a matrix P containing the sum of the representations of each symbol in a cell. 4

Symbol e in cells P [i, j], terminal symbols, are represented: P[i, j, e] = [i]⊕ [j]⊕ [e]⊕ P is a sum of elements P[j, i, e]. To visualize the idea, the table P of the running example is represented in the Tetris-like notation in Fig. 1 represented according to the pieces in Fig. 2. The distributed representation P of the table P allows selecting symbols in specific positions. According to previous assumptions, symbols in position i, j can be selected with [j] [i] P. For example, the symbol in position (1, 2) for the running example in Fig. 1 can be selected using the matrix [2] [1] represented in Tetris-like pieces as 3.2

2 1

.

Encoding and Using Unary Rules

The CYK algorithm uses unary rules are used in to fill cells P [1, j] in the first row by using the input in cells P [0, j] in row 0 (see Algorithm 1). Hence, the distributed representation ofP these unary rules should take a representation of the row 0 in the distributed representation P, that is, j [0]⊕ [j]⊕ [sj ]⊕ P and add the distributed representation of the first row j [1]⊕ [j]⊕ [Aj ]⊕ . In the running example (Fig. 1), this distributed representation of unary rules should build the matrix representing the first row: 1 2 D

P0 =

1 1 D

1 3 E

by using the matrix that encodes the initial sequence: P=

0 1 a

0 2 a

0 3 b

Our D-CYK algorithm multiplies distributed unary rules R[A] with the matrix P to detect the applicability of unary rules r = A → α to a P (see Algorithm 2). Distributed unary rules R[A] are defined as follows: X R[A] = [α] A→α

The operation between ìR[A] and P is σ(R[A][i] [0] P) 1 where σ(x) is a sigmoid function σ(x) = 1+e−(x−0.5)∗β . Hence, this operation detects whether rules associated to a symbol A are applicable in the cell (0, i) of the table P . In fact, [i] [0] P ≈ [wi ]⊕ extracts distributed representations of terminal symbols wi . Then,  0 if α 6= wi ⊕ R[A][wi ] ≈ (1) I if α = wi is reinforced by the subsequent use of the sigmoid function. Hence, if rules associated to a symbol A are applicable, the resulting matrix is I else is 0. Then, the operation P := P + [1]⊕ [i]⊕ [A]⊕ σ(R[A][i] [0] P) adds a non-zero element to the matrix P only if rules for A are matched in P . Algorithm 2 CYK_basic(sequence s, distributed unariy rules R) return P P P = i [0]⊕ [i]⊕ [wi ]⊕ for i ∈ [1..n] do for A ∈ preterminals do P := P + [1]⊕ [i]⊕ [A]⊕ σ(R[A][i] [0] P) To describe how this part of the algorithm works, we describe the application of CYK_basic by using the running example in Fig. 1 and the Tetris-like representation. The two preterminal rules are represented as R[D] = [a] and R[E] = [b] , that is: R[D] =

a

R[E] =

5

b

in the Tetris-like form. Given the input sequence aab, the matrix P is initialized as: 0 1 a

P=

0 2 a

0 3 b

Hence, the application of the rules R[D] to the (0, 1) position of the table P represented in the matrix P is the following: P := P +

1 1 D

a

σ(

0 1

P)

The second part of the assignment changes in this way: 1 0 D

σ(

a

1 0

0 1 a

0 2 a

0 3 b

)=

1 1 D

σ(

a

a

)=

1 1 D

Then, at the end of the application of the matrix R in the position (0, 1), the final piece is added to the initial matrix P. After applying matrix R to each (0, i) position, the matrix P is completed in the following way: 1 2 D

P=

3.2.1

0 1 a

1 3 E

0 2 a

0 3 b

1 1 D

(2)

Encoding Binary Rules

To fully define our D-CYK, we describe here how to encode binary rules such that these rules can be fire over the representation of the table P with matrix algebra. We will then define the second part of the algorithm, that is, CYK_binary. The driving idea is using a representation of rules that gives a nearly identity matrix I when applied to the matrix P if specific rules fire. This will enable the insertion of new symbols in P. Algorithm 3 CKY_binary(P, rules RA for each A) return P for i ∈ [2..n] do for j ∈ [1..(n − i + 1)] do for A ∈ N onT erminals do for k ∈ [1..(i − 1)] do PA := PA + σ([j] [k] PRA [(j + k)] [(i − k)] P) ⊗ I P := P + [i]⊕ [j]⊕ [A]⊕ PA

In PA after the application of sigmoid is required apply element-wise multiplication with identity matrix, this operation is helpful to remove noise. To obtain the above effect, we encode the right-hand sides of binary rules A → BC as: rA→BC = [B] [C] The, sides of a given symbol A are collected in the matrix RA = P all the right-hand [B] [C] . The algorithm 3 (CYK_binary) uses this rules to determine whether a symbol A A→BC can fire in a position (i, j) with a specific k. The key part of CYK_binary is this: σ([j] [k] PRA [(j + k)] [(i − k)] P) where [j] [k] P and [(j + k)] [(i − k)] P select the symbols in the position (k, j) and ((i − k), (j + k)), respectively. Then, RA is used to determine if in those two positions there are symbols that activate a rule headed with A. If this is the case, the result will be a nearly I. Then, to reinforce this 1 matrix, we use a sigmoid function σ(x) = 1+e−(x−0.5)∗β that forces the matrix to be more similar to I or to a matrix 0 with all zeros. 6

To visualize the behaviour of the algorithm CYK_binary, we use again the running example of Fig. 1. There is only the symbol S that has rules. Them the matrix RS is the following: RS =

D E

D S

Let’s focus on the position (2, 2) with k = 1. The operation is the following: PS := PS + σ(

2 1

P

D E

D S

3 1

P)

where P is the one defined in equation 2. Hence: 2 1

4

P

D E

D S

3 1

P≈

2 1 1 2 D

D E

D S

3 1 1 3 E

⊗I≈I

Experiments

The aim of these experiments is to show that the distributed CYK algorithm behaves as the original CYK algorithm. For this purpose, we do not need huge datasets but small well-defined set of sentences derived from fixed grammars as defined in the following sections. 4.1

Experimental Set-up

We experimented with three different grammars containing an increasing number of rules. The three grammars are: (1) the grammar M1 with 10 non-terminal, with 4 binary rules and an average of 10 unary rules for each non-terminal; (2) the grammar M2 that is the grammar M1 with an average of 100 unary rules for each non-terminal; and, finally, (3) the grammar ML always built on M1 with more than 300 unary rules for each terminal. Given the above grammars, the test sets of sentences were prepared randomly. We used 3 sets of 50 random sentences that have been produced from the above grammar. As we want to understand whether D-CYK is able to reproduce the computation of the original CYK, we used cell symbol precision (P rec), cell symbol recall (Rec) and cell symbol f1-measure (f 1) which estimate whether the distributed P is similar to the original P. In fact, these versions of the precision, recall and f1-measure aim to compute how many symbols of the original P are correctly recovered in the distributed P Pdist generated by the algorithm. The computation of these measures exploit the fact that the dot product between two distributed vectors approximately counts the number of symbols in common. Hence, to compute these measures, for each sentence s, we run both the original CYK and our D-CYK. We then obtain P CYK and PD-CYK . Then, we encode P CYK in its distributed representation version PCYK , which is the oracle version of our target matrix. Finally, cell symbol precision (P rec) and cell symbol recall (Rec) are defined as follows: P rec = Rec =

hPCYK , PD-CYK i i i D-CYK hPi , PD-CYK i i hPCYK , PD-CYK i i i CYK hPCYK , P i i i

where PCYK and PD-CYK are the i-th column of the corresponding matrix and hy, xi is the dot product i i between x and y. The f 1 measure is obtained with the classical combination of precision and recall. Presented results are the average and the standard deviations of these measures on the columns of the matrices and on the different sentences. We experimented with four different dimensions of matrices P: 500, 1000, 1500 and 2000. 4.2

Results

Results are really encouraging as showing that, as the dimensions of the matrices increase, D-CYK can approximate with its operations what is done by the traditional CYK. The f1-measure is in fact 7

Figure 3: Experiment with grammar M1. increasing with the dimension of the matrix. This is mainly due to an improvement of the cell symbol precision as the cell symbol recall is substantially stable. Hence, as the dimension increases D-CYK gets more precise in replicating the original matrix. The size of the grammar is instead a major problem. In fact, the precision of the algorithm is affected by the number of rules whereas the recall is substantially similar across the three different grammars. These results confirm that it is possible to transfer a traditional algorithm on a version, which is defined on distributed representations.

5

Conclusions and Future Work

In these days, the predominance of symbolic, grammar-based syntactic parsers for natural language has been successfully challenged by neural networks, which are based on distributed representations. Years of results and understanding can be lost. We proposed D-CYK that is a distributed version of the CYK, a classical parsing algorithm. Experiments show that D-CYK can do the same task of the original CYK in this new setting. Neural networks are a tremendous opportunity to develop novel solutions for known tasks. Our solution opens an avenue to an innovative set possibilities: revitalizing symbolic methods in neural networks. In fact, our model is the first set to towards the definition of the “complete distributed CYK algorithm” that builds trees in distributed representations during the computation. Moreover, it can foster the definition of recurrent layers of CYK-informed neural networks.

8

References [1] Bharat Ram Ambati, Tejaswini Deoskar, and Mark Steedman. Shift-Reduce CCG Parsing using Neural Network Models. pages 447–453, 2016. [2] John Cocke. Programming Languages and Their Compilers: Preliminary Notes. Courant Institute of Mathematical Sciences, New York University, 1969. [3] Jay Earley. An Efficient Context-free Parsing Algorithm. Commun. ACM, 13(2):94–102, 1970. [4] William B Johnson and Joram Lindenstrauss. Extensions of Lipschitz mappings into a Hilbert space. Contemporary mathematics, 26(January 1984):189–206, 1984. [5] Tadao Kasami. An efficient recognition and syntax-analysis algorithm for context-free languages. Technical report, Air Force Cambridge Research Lab, Bedford, MA, 1965. [6] Donald E. Knuth. On the translation of languages from left to right. Information and Control, 8(6):607–639, 1965. [7] Marshall R Mayberry and Risto Miikkulainen. A Neural Network Shift-Reduce Parser. 1997. [8] Tony A. Plate. Holographic Reduced Representations. IEEE Transactions on Neural Networks, 6(3):623–641, 1995. [9] Magnus Sahlgren. An Introduction to Random Indexing. Proceedings of the Methods and Applications of Semantic Indexing Workshop at the 7th International Conference on Terminology and Knowledge Engineering, TKE 2005, pages 1–9, 2005. [10] Gregory Senay, Fabio Massimo Zanzotto, Lorenzo Ferrone, and Luca Rigazio. Predicting Embedded Syntactic Structures from Natural Language Sentences with Neural Network Approaches. In Proceedings of the 2015th International Conference on Cognitive Computation: Integrating Neural and Symbolic Approaches - Volume 1583, volume 1583 of COCO’15, pages 129–137, Aachen, Germany, Germany, 2015. CEUR-WS.org. [11] Oriol Vinyals, Lukasz Kaiser, Terry Koo, Slav Petrov, Ilya Sutskever, and Geoffrey Hinton. Grammar as a Foreign Language. arXiv, pages 1–10, 2014. [12] Daniel H. Younger. Recognition and parsing of context-free languages in time n3. Information and Control, 10(2):189–208, 1967. [13] Fabio Massimo Zanzotto and Lorenzo Dell’Arciprete. Transducing Sentences to Syntactic Feature Vectors: an Alternative Way to "Parse"? In Proceedings of the Workshop on Continuous Vector Space Models and their Compositionality, pages 40–49, Sofia, Bulgaria, 2013. Association for Computational Linguistics.

9