Neural Generative Question Answering

2 downloads 0 Views 399KB Size Report
Dec 4, 2015 - (Ludwig van Beethoven, place of birth, Germany). Q: Which club does Messi play for? A: Lionel Messi currently plays for FC Barcelona in the.
arXiv:1512.01337v1 [cs.CL] 4 Dec 2015

Neural Generative Question Answering Jun Yin1∗ Xin Jiang2 Zhengdong Lu2 Lifeng Shang2 Hang Li2 Xiaoming Li1 1 School of Electronic Engineering and Computer Science, Peking University {jun.yin, lxm}@pku.edu.cn 2 Noah’s Ark Lab, Huawei Technologies {Jiang.Xin, Lu.Zhengdong, Shang.Lifeng, HangLi.HL}@huawei.com

Abstract This paper presents an end-to-end neural network model, named Neural Generative Question Answering (genQA), that can generate answers to simple factoid questions, both in natural language. More specifically, the model is built on the encoder-decoder framework for sequence-to-sequence learning, while equipped with the ability to access an embedded knowledge-base through an attention-like mechanism. The model is trained on a corpus of question-answer pairs, with their associated triples in the given knowledgebase. Empirical study shows the proposed model can effectively deal with the language variation of the question and generate a right answer by referring to the facts in the knowledge base. The experiment on question answering demonstrates that the proposed model can outperform the retrieval based model as well as the neural dialogue models trained on the same data.

1

Introduction

Question answering (QA) can be viewed as a special case of single-turn dialogue: QA aims at providing correct answers to the questions asked in natural language, while dialogue models often emphasize on generating relevant and fluent responses in natural language in a conversation [5, 6]. Recent progress in neural dialogue system has raised the intriguing possibility of having a generation-based model for QA. That is, the answer is generated by a neural network (e.g., RNN) based on a proper representation of the question, therefore enjoying the flexibility of language in the answer. More importantly, since it can be trained in an end-to-end fashion, there is no need for extra effort on building a semantic parser for questions. There is however one serious limitation of this generation-based proposal for QA. It is nearly impossible to save all the knowledge in the weights of neural network with the desired precision and coverage for real world QA. This is a fundamental difficulty, rooting deeply in the way knowledge of different forms and abstract levels are acquired, represented and stored. On the other hand, the recent success of memory-based neural network models has greatly extended the current scheme of representing text information in both short-term memory (e.g., in [1]) and long short-term memory (e.g., in [7]), offering much richer ways to save the information of one or more sentences (i.e., going beyond a fixed-length vector) and access the information (beyond matrix-vector multiplication) (e.g., in [8]). It is hence a natural choice to ∗

The work is done when the first author worked as intern at Noah’s Ark Lab, Huawei Technologies.

1

Table 1: Examples of training instances for conversational question answering. The KB-word in the training instances are underlined in the examples. Question & Answer Q: How tall is Yao Ming? A: He is 2.29m tall and is visible from space. Q: Which country was Beethoven from? A: He was born in what is now Germany. Q: Which club does Messi play for? A: Lionel Messi currently plays for FC Barcelona in the Spanish Primera Liga.

Triple (subject, predicate, object) (Yao Ming, height, 2.29m) (Ludwig van Beethoven, place of birth, Germany) (Lionel Messi, team, FC Barcelon)

connect the neural model for QA with an external memory with long-term knowledge residing in it, which interestingly approaches the more traditional line of research on template-based QA equipped with a knowledge-base (KB) from a different angle. In this paper, we report our exploration in this direction, proposing a model called Neural Generative Question Answering (genQA). Learning Task We formalize the neural generative QA (genQA) as a supervised learning task or more specifically, a sequence-to-sequence learning task. The generated sentence might contain two overlapped types of “words”: one is the normal word corresponding to the conversation (will be referred to as normal word) and the other is the item in the triple (usually the object) corresponding to the extracted answer (referred to as KB-word). Each training instance consists of a question-answer pair while the KB-word in the answer can be specified beforehand in composing the dataset. In this paper, we only consider the case of simple factoid question, with a single relevant fact (i.e., one triple) of the KB (see Tables 1 for some examples). In Section 3 we will explain how we obtain such training examples. The difficulty of learning with genQA is a way to jointly train the representation of question, the generation of answer in natural language sentence, and a mechanism to find the relevant piece of information in KB, in an end-to-end fashion.

2

The Neural Model

Let Q = (x1 , . . . , xTQ ) and Y = (y1 , . . . , yTY ) denote the natural language question and answer respectively. The knowledge base is organized as a set of triples (subject, predicate, object), each denoted as τ = (τs , τp , τo ). Inspired by works on RNN encoder-decoder framework [4, 1, 5] for short-text conversation and knowledge base embedding for open domain QA [2, 3], we propose an end-to-end neural network model for genQA (as illustrated by Figure 1), which consists of Interpreter, Enquirer, Answerwer, and an external Knowledge-base. Basically, Interpreter transform the natural language sentence Q to the representation HQ saved in the Short-term Memory. HQ will serve as input to Enquirer, which then interacts with the Long-term Memory (knowledge-base) and returns a vector r, summarizing the retrieved results in the knowledge-base. The Answerer will feed on the question representation HQ (through the Attention Model) as well as the fixed-length vector r in generating the answer with Generator.

2

Figure 1: The diagram for genQA.

Interpreter: We adopt the bi-directional RNN in [1] to create an array (with the same length as Q) of vectors {h1 , · · · , hTQ }, which are then concatenated with the original word˜ t = [ht ; xt ], t = 1, · · · , TQ , with xt being the embedding to form the representation of Q, i.e., h th ˜ 1, · · · , h ˜ T } will be saved in embedding for the t word in Q. This array of vectors HQ = {h Q the short-term memory, allowing for further summarization by Answerer and Enquirer for different purpose. Enquirer: The Enquirer extract the relevant knowledge from the knowledge-base based on a representation it obtains from HQ (as illustrated by Figure 2). This reduces to evaluating the relevance of each triple in the embedded KB[3, 2], which is actually a variant of the attention mechanism [1]. In our current implementation, Enquirer takes an average pooling ¯ Q ) and uses this as the representation in of the vectors in HQ (with result denoted as h communicating with KB. More specifically, for the triple τ with vector representation uτ 1 , we ¯ > Muτ , where M is the matrix parameterizing the will define a matching score S(Q, τ ) = h Q similarity between the question and the triple. Suppose that for the each question Q there KQ . The matching are KQ candidate triples obtained by some heuristics2 , denoted as {τk }k=1 scores between the question and the candidate triples are encoded in a pair KQ -dimensional S(Q,τk ) vector (rQ , IDQ ), with the k th element in rQ defined as P e S(Q,τ , and the the k th element k0 ) k0

e

in IDQ gives the corresponding word index in the KB-vocabulary. Answerer: We use another RNN (Generator) for generating the answering sentence based on information of question saved in short-term memory, represented by HQ ) and the relevant knowledge retrieved from long-term memory (indexed by rQ ). The probability of a answering 1 2

Here we simply use the sum of embedding for subject and predicate for uτ . We use string-based matching to find a much smaller subset with say, 50, triples.

3

Figure 2: The Enquirer of genQA.

sentence Y = (y1 , y2 , . . . , yTQ ) is given by p(y1 , · · · , yTQ |HQ , rQ ; θ) = p(y1 |HQ , rQ ; θ)

TQ Y

p(yt |y1 , . . . , yt−1 , HQ , rQ ; θ)

t=2

where the conditional probability in the generating RNN (with state s1 , · · · , sTQ ) is given by p(yt |y1 , . . . , yt−1 , HQ , rQ ; θ) = p(yt |st ; θ). In generating the tth word in the answering sentence, the probability of yt is given by the following mixture model p(yt |st ; θ) = p(zt = 0|st ; θ)p(yt |st , 0; θ) + p(zt = 1|st ; θ)p(yt |st , 1; θ), which sums the contribution from the “language model” part (p(yt |st , 0; θ)) and the knowledge from an external source (p(yt |st , 1; θ)), with p(zt |st ; θ) realized by a logistic regression model with st as input. Here the latent variable zt indicates whether the word is generated from normal vocabulary (for zt = 0) or a KB vocabulary (zt = 1). For any word y that is only in KB vocabulary, e.g., “2.29m”, p(y|st , 0; θ) = 0, while for y that doesn’t appear in KB, e.g., “and”, we have p(y|st , 1; θ) = 0. There are some words (e.g., “Shanghai”) that appear in both regular vocabulary and KB vocabulary, for which the probability contains nontrivial contribution of both generating components. In generating the normal words, i.e., the part described by p(yt |st , 0; θ), Generator acts in the same way as the decoder RNN in [1] with relevant segment from HQ selected by Attention Model. In generating the KB-word via p(yt |st , 1; θ), it is simply given by p(yt = IDQ (j)|st , 1; θ) = rQ (j), where we use a(j) to denote the j th element in vector a.

2.1

End-to-End Learning

The parameters to be learned include the weights in the RNNs for Interpreter and Answerer, M in Enquirer, and the word-embeddings which are shared by the Interpreter RNN and the Knowledge-base. Quite nicely, genQA, although essentially containing a matching/retrieval operation, can be trained in an end-to-end fashion by maximizing the likelihood of observed data, since the mixture form of probability in the Answerer provide a unified way to generate the words from regular vocabulary and knowledge-base. In practice we use stochastic gradientdescent with mini-batch for optimization.

4

Knowledge Base triples 5,539,736

Table 2: Statistics of the dataset. Training Data Test Data QA pairs unique triples QA pairs unique triples 696,306 58,019 23,364 1,974 Table 3: Training and test accuracies Models Training Test Retrieval-based 40% 36% NRM[5] 15% 19% genQA 46% 47%

3 3.1

Experiments Dataset

A knowledge base is constructed by extracting triples from three Chinese encyclopedia web sites3 and question-answer pairs are collected from two Chinese community QA sites4 . Training and test data are constructed by going through every question-answer pair, and keeping only those with triples in the KB judged by some rules5 . As a result, 720K instances are obtained, with an estimated 80% of pairs having the corresponding triples in the KB. In order to test the generalization ability of the model, the data are randomly partitioned into training and test data by using triple as the key. That way, all the questions in the test data are regarding to the unseen facts in the training data. Table 2 shows some statistics of the datasets. (From the table, we can see that a large portion of facts are not present in the community QA pairs, which proves the necessity for the model generalizing to unseen facts.)

3.2

Comparison Models

To our best knowledge there is no previous work on generative QA, we choose two baseline methods: neural dialogue model and the retrieval-based QA corresponding respectively the generative aspect and the KB-retrieval aspect of genQA. Retrieval-based QA: the KB is indexed by an information retrieval system (we use Apache Solr), in a way that each triple is deemed as a document. At test phase, question are used as the query and the top-retrieved triple is returned as the answer. Note in general this method can not generate natural language answers. Neural Responding Machine (NRM): NRM [5] is a neural generative model, inspired by the attention-based model for machine translation [1], that has shown promising results on short-text conversation. We train the NRM model on the question-answer pairs in the training data with the same vocabulary as the normal vocabulary of genQA.

5

3.3

Results and Analysis

We evaluate the performance of the models in terms of 1) answering accuracy, i.e., the ratio of correctly answered questions, and 2) the fluency of answers. In order to ensure an accurate evaluation, we randomly select 300 questions from the training and test data respectively, and manually remove the nearly duplicate cases and filter out the mistaken cases. Table on the right shows the accuracies of the models in the training and test set respectively. NRM has the lowest accuracy in both training and test data, showing the lack of ability to remember the answers accurately and generalize to facts unseen in the training data. For example, to question “Which country does Xavi play for as a midfielder?” (Translated from Chinese), NRM gives the answer “He plays for France” (Translated from Chinese). Retrieval-based method achieve moderate accuracy, but like most string-matching methods it suffers from the word mismatch between the question and the triples in KB. genQA gets almost half of the questions right, achieving the best accuracy among the three, demonstrating its ability to represent the question and find the relevant triple in KB even for unseen facts. For example, to question “Which country does Xavi play for as a midfielder?”, genQA gives the correct answer “He plays for Spain”. The Table below gives some examples of generated answers by genQA to the questions in test set, with the underlined words generated from KB. Clearly it can blend the words form KB in the sentence consisting mostly of normal words, thanks to the unified neural model. We made some empirical comparisons and find no significant differences between NRM and genQA in terms of the fluency of the answers.

References [1] D. Bahdanau, K. Cho, and Y. Bengio. Neural machine translation by jointly learning to align and translate. In ICLR, 2015. [2] A. Bordes, J. Weston, and S. Chopra. Question answering with subgraph embeddings. EMNLP, 2014. [3] A. Bordes, J. Weston, and N. Usunier. Open question answering with weakly supervised embedding models. In ECML PKDD, pages 165–180, 2014. 3

Baidu Baike: http://baike.baidu.com, Hudong Baike: http://www.baike.com, Douban: http://www.douban.com Baidu Zhidao: http://zhidao.baidu.com, Sougou Wenwen: http://wenwen.sogou.com 5 The basic requirement for relevance is that the question contains the subjects of the triple and the answer contains its object. Besides, we use additional filtering and normalization techniques to improve the quality of the data. 4

6

[4] K. Cho, B. Van Merri¨enboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio. Learning phrase representations using rnn encoder-decoder for statistical machine translation. EMNLP, 2014. [5] L. Shang, Z. Lu, and H. Li. Neural responding machine for short-text conversation. In ACL, pages 1577–1586, 2015. [6] O. Vinyals and Q. Le. A neural conversational model. arXiv preprint arXiv:1506.05869, 2015. [7] J. Weston, S. Chopra, and A. Bordes. Memory networks. CoRR, abs/1410.3916, 2014. [8] P. Yin, Z. Lu, H. Li, and B. Kao. Neural Enquirer: Learning to Query Tables. ArXiv e-prints arXiv:1512.00965, Dec. 2015.

7