Generating a Large Dataset for Neural Question

17 downloads 0 Views 225KB Size Report
dataset for neural question answering over the DBpedia [9] knowledge base. We show how the ... multilingual RDF dataset in activity today. It is based on ...
Generating a Large Dataset for Neural Question Answering over the DBpedia Knowledge Base Ann-Kathrin Hartmann1 , Tommaso Soru2 , and Edgard Marx1 1

Leipzig University of Applied Sciences (HTWK), Germany https://portal.imn.htwk-leipzig.de/ 2 AKSW, University of Leipzig, Germany http://aksw.org/

Abstract. The role of Question Answering is central to the fulfillment of the Semantic Web. Recently, several approaches relying on artificial neural networks have been proposed to tackle the problem of question answering over knowledge graphs. Such techniques are however known to be data-hungry and the creation of training sets requires a substantial manual effort. We thus introduce Dbnqa, a comprehensive dataset of 894,499 pairs of questions and SPARQL queries based on templates which are specifically designed on the DBpedia knowledge base. We show how the method used to generate our dataset can be easily reused for other purposes. We report the successful adoption of Dbnqa in an experimental phase and present how it compares with existing question-answering corpora.

1

Introduction

The size of the Linked Open Data cloud has been increasing at a high pace in the last years. To date, billions of triples are featured in thousands of interlinked datasets belonging to a plethora of domains. However, the search and retrieval of this information is a hard task for lay users. To this end, the role of Question Answering over Linked Data (QALD) is central to the fulfillment of the Semantic Web. Recently, several approaches relying on artificial neural networks have been proposed to tackle the problem of question answering over knowledge graphs [8,14,11,12]. While most techniques perform entity linking and recognition as a separate task [8,14,7], some of them put the burden completely on the neural network, which carries out the tasks along with the query construction [11,12]. An example of the latter approaches is Neural SPARQL Machines, which are end-to-end architectures to translate natural-language utterances into SPARQL queries [12]. However in general, deep-learning algorithms are known to be datahungry and the creation of training sets requires a substantial manual effort. In this paper, we introduce the DBpedia Neural Question Answering (Dbnqa) dataset for neural question answering over the DBpedia [9] knowledge base. We show how the method used to generate our dataset can be easily reused for other purposes. We report the successful adoption of Dbnqa in an experimental phase and present how it compares with existing question-answering benchmarks. This work brings the following contributions:

2

Hartmann et al.

1. a reusable and efficient method to generate pairs of natural-language questions and SPARQL queries for any knowledge base; 2. a collection of templates with placeholders, based on two existing datasets for QALD and specifically designed on the DBpedia knowledge base; 3. a large and comprehensive dataset of 894,499 pairs of questions and SPARQL queries, materialized using the resources above.

2

Related Work

A large number of QA datasets were developed for Knowledge Bases such as DBpedia [9] and Freebase [2]. DBpedia is one of the oldest and largest open multilingual RDF dataset in activity today. It is based on structured information extracted from Wikipedia and describes 6.2 million things, ranging from places to diseases. Freebase is another example of open and large RDF dataset that is, however, no longer been maintained. Its data originated from different sources including individual as well as user-submitted contributions. Free917 [5] is a QA dataset containing 917 questions (on average, 6.3 words per question) and a meaning (answer) written in a variant of lambda calculus. It comprises 635 distinct Freebase relations whereas the most common domains are film and business, representing more than 6% of the overall dataset. The dataset was manually generated by two native English speakers with the only restriction of producing questions for multiple domains. Therefore, a dataset of alignment from these annotated questions was created whereas each logical form of the question was paired with a manually-selected word from the question. The first collection of QA that we have notice so far is the WebQuestions [1]. The WebQuestions was built upon Freebase, containing 5,810 question-answer pairs. The dataset was built using questions extracted from Google Suggest API and answers from Amazon Mechanical Turk. The dataset consists in 3,778 Question-Answer instances for training, 2,032 for testing, and 1,000 questions from the training set are isolated for validation. The WebQuestions’ entities are annotated using entity names in Freebase. Later, a larger dataset focusing on the task of simple QA was introduced, the SimpleQuestions dataset [4], consisting of 108,442 questions written in natural language by human English-speaking annotators. In this dataset, each question and answer is related to a corresponding fact (triple) in the FB2M, a subset of Freebase consisting of 2 million facts [3]. Another large collection of QA datasets can be found in the Question Answering over Linked Data (QALD) challenges [6], a series of evaluation campaigns on Question Answering over Linked Data which have been organized since 2011. For each challenge, the QALD committee provides a training as well as test dataset created by experts, leading to a vast collection of question-answer pairs. In the QALD datasets, the questions can be found in their natural and keyword format; the answer is presented as a SPARQL query as well as the final result. So far, LC-QuAD [13] is the largest QA dataset for DBpedia, containing 5,000 questions-answers pairs. From all questions in the dataset, only 18% can be

Generating a Large Dataset for Neural QA over the DBpedia KB

3

answered with SPARQL queries that requires a simple triple pattern. However, different from QALD, the LC-QuAD dataset do not comprise queries containing OPTIONAL, UNION or conditional aggregates. The questions are less complex and paraphrased by the authors. On total, the dataset covers 5,042 entities and 615 predicates. Table 1 gives an overview of each discussed QA dataset containing their respective Size in terms of question-query pairs. The table also contains the number of Entities and Predicates, the targeting Knowledge Base as well as the Language used to describe or formalize the result. Table 1. Comparison between different existing QA datasets. Dataset

Size Entities Predicates Knowledge Base Language

WebQuestions [1] 5,810 SimpleQuestions [4] 108,442 Free917 [5] 917 QALD-7-train3 217 LC-QuAD [13] 5,000

3

733 216 5,042

852 171 615

Freebase Text Freebase Triple Freebase λ-Calculus DBpedia SPARQL DBpedia SPARQL

The DBNQA dataset

The dataset described in this work contains 894,499 pairs, each one composed by a question in English and its respective SPARQL query targeting the DBpedia knowledge base. It is so far the largest dataset for training and testing machinelearning models (see Table 1). The Dbnqa dataset is as much as eight fold bigger than the current largest collection – the SimpleQuestions dataset – and, more than 170 fold larger than the largest collection of question-answer pairs from the DBpedia knowledge base. The dataset originated from the instantiation of the Neural SPARQL Machines (NSpM) templates extracted from queries in QALD-7 training (QALD-7-train) in conjunction to the LC-QuAD dataset. Overall, the conversion of both dataset resulted in 5,217 NSpM templates that were used on the dataset composition. From which, 5,000 templates originated from LC-QuAD and 217 from the QALD-7-train dataset. Table 2 shows statistics of the dataset as well as the distribution of Instances, Entities and Predicates of each of the datasets used to build it. In the following, we describe how we create the dataset by (1) extracting, and (2) instantiating the templates using these QA datasets. 3.1

Template Extraction

In our previous work [12], we use 38 handmade generic query templates that can be instantiated multiple times to generate a sufficient set of question-query 3

https://qald.sebastianwalter.org/

4

Hartmann et al. Table 2. Comparison of the datasets involved. Training set Instances Entities Predicates Knowledge Base Formal Language QALD-7-Train 217 216 171 DBpedia SPARQL LC-QuAD 5,000 5,042 615 DBpedia SPARQL Dbnqa

894,499 238,865

666

DBpedia

SPARQL

pairs. For Dbnqa, the extensive manual work of creating generic templates is circumvented by extracting those templates from examples. This can be done by replacing the concrete entities with placeholders. That is, we replace the resource label on the original question and the respective resource URI or Literal in the target SPARQL query by a variable. The entities of the 215 pairs from QALD-7-Train were manually replaced by three SPARQL experts. Due to its semi-automatic creation procedure, the 5,000 LC-QuAD question-query pairs could be processed with a script because the labels of the resources in the questions are marked. Therefore, we could use the labels to identify the resource in the query and replace both by a placeholder. The use of colloquial names complicated this procedure and sometimes caused a mismatch. We then apply a manual review performed by three SPARQL experts. Table 3 contains the number of duplicated templates produced by converting the original queries into templates. Listing 1.2 shows the SPARQL template for the template question “What are some famous artists who rocked a ?”, generated from the original SPARQL query on Listing 1.1 for the LC-QuAD question “What are some famous artists who rocked a Les Paul?”. 1 2 3 4

SELECT DISTINCT ? uri WHERE { ? uri dbp : notableInstruments dbr : Gibson_Les_Paul ? uri rdf : type dbo : MusicalArtist . }

.

Listing 1.1. LC-QuAD SPARQL query for the question: “What are some famous artists who rocked a Les Paul?”.

1 2 3 4

SELECT DISTINCT ? uri WHERE { ? uri dbp : notableInstruments
. ? uri rdf : type dbo : MusicalArtist . } Listing 1.2. NSpM Query template for the question: “What are some famous artists who rocked a ?”.

Table 3 shows the template distribution per topic (Art, Business, Informa tics, Location, Person, Science, Society, Sport, Time, and Others) in either QALD-7 training, LC-QuAD and NSpM datasets. The biggest template subset in NSpM dataset is Art while Time is the smallest, comprising resp. 24,16% and 0,12% of the total dataset.

Generating a Large Dataset for Neural QA over the DBpedia KB

5

Table 3. NSpM templates topic distribution per subset. Topic Art Economy IT Geography Person Science Society Sport Time Others Duplicate Total

3.2

QALD-7-Train LC-QuAD NSpM dataset # % # % # % 49 13 7 47 37 9 15 12 6 5 15 215

24.5 6.5 3.5 23.5 18.5 4.5 7.5 6.0 3.0 2.5 8.3

199 463 104 766 207 150 506 627 0 944 34

24.1 1,248 9.3 476 2,1 111 15.5 813 4,2 244 3.0 159 10.2 521 12.6 639 0 6 19.0 948 0.7 52

24.2 9.2 2.1 15.7 4.7 3.1 10.1 12.4 0,1 18.4 1

100 5,000 100 5,217

100

Dataset Generation

After the template extraction, we use the DBpedia knowledge base to instantiate the templates and generate the dataset, following close to our previous work [12]. Here, the generator component of the NSpM – which is responsible for the template instantiation – was fed with templates and access to the dataset’s SPARQL endpoint. We use the original question-query template to produce all possible resources that could be used to fill the template placeholders. For instance, for the template question “What are some famous artists who rocked a
?” and its corresponding SPARQL query (see Listing 1.2), we generate placeholders with the rearranged query on Listing 1.3. In this case, the notable instrument dbr:Fender Stratocaster is part of the query solution, leading to the question-query instance “What are some famous artists who rocked a Fender Stratocaster?” and its corresponding SPARQL query in Listing 1.4. For each of the 5,217 templates, we limited the number of instances to 300 examples satisfying the corresponding SPARQL graph patterns. 1 2 3 4

SELECT DISTINCT ? a WHERE { ? uri dbp : notableInstruments ? a . ? uri rdf : type dbo : MusicalArtist . } Listing 1.3. Querying all notable instruments of famous artists.

1 2 3 4

SELECT DISTINCT ? uri WHERE { ? uri dbp : notableInstruments dbr : Fender_Stratocaster . ? uri rdf : type dbo : MusicalArtist . } Listing 1.4. SPARQL query for the generated question: “What are some famous artists who rocked a Fender Stratocaster?”.

6

Hartmann et al.

The runtime of 42 minutes to generate the dataset from a knowledge base of the size of DBpedia on a 4-core laptop with MacOS and 8 GB of RAM suggests that our method is efficient. It is noteworthy to point out that this runtime value has been influenced by network overhead, since the public DBpedia endpoint was used.

3.3

Use Case

In this use case, the Dbnqa dataset is implied as a training set in the NSpM learning phase. Table 4 shows training specifications compared with the dataset about monuments presented in our previous work [12], where only a domainspecific QA was performed. We measure the accuracy of each model using BLEU [10]. BLEU is a widely chosen measure for evaluating machine translation (MT) outputs and uses a modified precision metric for comparing the MT output with the reference translation. Four models were trained with four randomized sets of templates. The size of the template sets was gradually increased. Model 1 contains only templates extracted from the QALD-7-Train data set. Model 2 and Model 3 contain QALD-7-Train templates as well as subsets of LC-QuAD. Model 4 contains all extracted templates from QALD-7-Train and LC-QuAD. All experiments were carried out on a machine with a similar training setting to the previous work (64-core CPU-only Linux machine with 256 GB RAM). We split the working dataset into three parts: training, validation, and test. We fixed the size of the validation and test sets to 10% each. In all models, the training was set to 12,000 iterations. We notice that the size of the vocabulary impact the training time. For example, the Model 4 has a vocabulary that is seven times bigger than in Model 1 and its training time is also nearly seven times longer. Another observation is that the size of the vocabulary has an inverse relation with the accuracy. That is, as the size of the vocabulary grows, the accuracy decreases. However, it is not in the same proportion. In our evaluation, while the vocabulary follows a linear growth, the accuracy seems to decrease in a logarithmic scale. The BLEU accuracies of the four models are very close one to another. Table 4 shows the BLEU accuracy from the previous domain-specific model [12] and the accuracy of the four open-domain Dbnqa models. Table 4. Accuracy of different excerpts of the Dbnqa dataset in comparison to Monuments [12]. Model Number of templates Number of generated questions Size of English Vocabulary Size of SPARQL Vocabulary Training time (hh:mm) Best BLEU wit test data

Monuments [12]

1

38 8,544 2,063 1,769 00:18 0.753

200 51,292 28,869 33,730 03:03 0.634

2

3

4

407 2,233 5,165 99,274 390,444 894,499 40,599 84,963 134,788 51,399 144,720 249,395 04:11 10:33 21:29 0.620 0.619 0.601

Generating a Large Dataset for Neural QA over the DBpedia KB

4

7

Availability & Sustainability

The Dbnqa dataset is available over FigShare at https://figshare.com/ projects/NSpM_Dataset/30821.4 The dataset is licensed over Creative Commons 4.0 (CC BY 4.0)5 and can be used to train other machine learning models as well as on evaluating existing QA systems. The NSpM project is publicly available at the GitHub repository https://github.com/AKSW/NSpM and can also be used for autonomous generation of new datasets. The community is therefore encouraged to engage in evaluating and extending the dataset and the approach. The NSpM research project has also a mailing list6 where users can participate and engage in the discussion. The NSpM project is been carried out by the AKSW research group7 and, to the best of our knowledge, it is the first research project of its kind within the Semantic Web community. The project itself is vital for many knowledge bases because it investigates new approaches to enable lay users to access the information underlying the RDF data. Furthermore, we are also working with large communities such as the DBpedia Association. To that extent, the project is also contemplated as one of DBpedia’s Google Summer of Code (GSoC) 2018 projects.

5

Conclusion & Future Work

In this work, we present an open-domain dataset for question answering over the DBpedia knowledge base. It is the largest question-to-SPARQL dataset with more than eight hundred thousands of entries as well as 5, 165 question-query templates extracted from the LC-QuAD and QALD-7-Train datasets. It is over 170 times larger than the second largest dataset, the LC-QuAD, containing a more comprehensive set of questions dealing with complex SPARQL operations such as solution sequences and modifiers.8 Additionally, we provide an efficient approach that allows users to generate customized training datasets using these templates and report the successful adoption of Dbnqa in an experimental phase by providing the results in terms of BLEU accuracy in four-fold random generated subsets with different size. As future work, we plan to continually evolve the dataset giving a closer look at its quality and coverage. We plan to train a Neural SPARQL Machine model with different setups on well-known benchmarks so as to evaluate the quality of our templates as well as the neural translation. In particular, we want to investigate what is affecting the BLEU accuracy over big vocabularies. We see this work as the first step towards a highly-accurate and multi-domain collection of Neural SPARQL Machine templates for question answering over the DBpedia knowledge base. 4 5 6 7 8

We will add a permanent URL in the final version of the paper. https://creativecommons.org/licenses/by/4.0/ https://groups.google.com/forum/#!forum/neural-sparql-machines http://aksw.org https://www.w3.org/TR/rdf-sparql-query/#solutionModifiers

8

6

Hartmann et al.

Acknowledgments

This work was partly supported by a grant from the German Research Foundation (DFG) for the project Professorial Career Patterns of the Early Modern History: Development of a scientific method for research on online available and distributed research databases of academic history under the grant agreement No GL 225/9-1, by CNPq under the program Ciˆencias Sem Fronteiras process 200527/2012-6.

References 1. J. Berant, A. Chou, R. Frostig, and P. Liang. Semantic parsing on freebase from question-answer pairs. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 1533–1544, 2013. 2. K. Bollacker, C. Evans, P. Paritosh, T. Sturge, and J. Taylor. Freebase: a collaboratively created graph database for structuring human knowledge. In Proceedings of the 2008 ACM SIGMOD international conference on Management of data, pages 1247–1250. AcM, 2008. 3. A. Bordes, S. Chopra, and J. Weston. Question answering with subgraph embeddings. arXiv preprint arXiv:1406.3676, 2014. 4. A. Bordes, N. Usunier, S. Chopra, and J. Weston. Large-scale simple question answering with memory networks. arXiv preprint arXiv:1506.02075, 2015. 5. Q. Cai and A. Yates. Large-scale semantic parsing via schema matching and lexicon extension. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), volume 1, pages 423–433, 2013. 6. M. Dragoni, M. Solanki, and E. Blomqvist. Semantic Web Challenges: 4th SemWebEval Challenge at ESWC 2017, Portoroz, Slovenia, May 28-June 1, 2017, Revised Selected Papers, volume 769. Springer, 2017. 7. M. Dubey, D. Banerjee, D. Chaudhuri, and J. Lehmann. Earl: Joint entity and relation linking for question answering over knowledge graphs. arXiv preprint arXiv:1801.03825, 2018. 8. D. Lukovnikov, A. Fischer, J. Lehmann, and S. Auer. Neural network-based question answering over knowledge graphs on word and character level. In WWW, pages 1211–1220. International World Wide Web Conferences Steering Committee, 2017. 9. P. N. Mendes, M. Jakob, and C. Bizer. DBpedia: A Multilingual Cross-domain Knowledge Base. In LREC, pages 1813–1817. Citeseer, 2012. 10. K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu. Bleu: A method for automatic evaluation of machine translation. In ACL, ACL ’02, pages 311–318, Stroudsburg, PA, USA, 2002. Association for Computational Linguistics. 11. D. Sorokin and I. Gurevych. End-to-end representation learning for question answering with weak supervision. In Semantic Web Evaluation Challenge, pages 70–83. Springer, 2017. 12. T. Soru, E. Marx, D. Moussallem, G. Publio, A. Valdestilhas, D. Esteves, and C. B. Neto. SPARQL as a Foreign Language. SEMANTiCS 2017 Poster and Demos, abs/1708.07624, 2017. 13. P. Trivedi, G. Maheshwari, M. Dubey, and J. Lehmann. Lc-quad: A corpus for complex question answering over knowledge graphs. In International Semantic Web Conference, pages 210–218. Springer, 2017. 14. H. Zafar, G. Napolitano, and J. Lehmann. Formal query generation for question answering over knowledge bases. ESWC, 2018.