Using machine learning to disentangle homonyms in

10 downloads 0 Views 1MB Size Report
May 31, 2016 - Using machine learning to disentangle homonyms in large text corpora ... analysis and artificial neural networks to quickly and accurately sift through large .... 1 provides a workflow chart of the stages in our semi-automated.
Using machine learning to disentangle homonyms in large text corpora Uri Roll1,2, Ricardo A. Correia2,3,4, and Oded Berger-Tal1

1

Mitrani Department of Desert Ecology, The Jacob Blaustein Institutes for Desert Research, BenGurion University of the Negev, Midreshet Ben-Gurion 8499000, Israel.

2

School of Geography and the Environment, University of Oxford, Oxford, OX13QY, UK.

3

Institute of Biological Sciences and Health, Federal University of Alagoas, Campus A. C. Simões, Av. Lourival Melo Mota, s/n Tabuleiro dos Martins, Maceió, AL, Brazil

4

DBIO & CESAM-Centre for Environmental and Marine Studies, University of Aveiro, Aveiro, Portugal

Running head (short title of 40 or fewer characters): Disentangling homonyms Keywords (5-8): Automated content analysis, Big Data, Homographs, Neural Networks, Reintroductions, Systematic Reviews, Text mining

Abstract Systematic reviews are an increasingly popular decision-making tool which provides an unbiased summary of evidence to support conservation action. These reviews bridge the gap between researchers and managers by presenting a comprehensive overview of all studies relating to a particular topic and identify specifically where and under which conditions an effect is present. However, several technical challenges can severely hinder the feasibility and applicability of systematic reviews. One such challenge is the presence of homonyms – terms that share spelling but differ in meaning. Homonyms add a lot of noise to search results but they cannot be easily identified

This article has been accepted for publication and undergone full peer review but has not been through the copyediting, typesetting, pagination and proofreading process, which may lead to differences between this version and the Version of Record. Please cite this article as doi: 10.1111/cobi.13044. This article is protected by copyright. All rights reserved.

and removed. In this work, we developed a semi-automated approach that can aid in the classification of homonyms between narratives. We used a combination of automated content analysis and artificial neural networks to quickly and accurately sift through large corpora of academic texts and classify them to distinct topics. As an example, we explored the use of the word ‘reintroduction’ in academic texts. Reintroduction is used within the conservation context to indicate the release of organisms to their former native habitat, however a ‘Web of Science’ search using this word returned thousands of publications that use this term with other meanings and contexts. Using our method, we were able to automatically classify a sample of 3000 of these publications with more than 99% accuracy, when compared to a manual classification. Our approach can be easily used with any other homonym terms and greatly facilitate systematic reviews, or any similar cases where homonyms hinder the harnessing of large text corpora. Beyond homonyms we see great promise in the combination of automated content analysis and machine learning methods in handling and screening big data for relevant information in conservation science.

Introduction Recent years have seen a sharp increase in the quantity of texts, both scientific and non-scientific, available in digital archives (Hey & Trefethen 2003; Kennan et al. 2012). This is a result of several trends, including the shift to provide digital access to most ‘classical’ scientific journals, the emergence of many new information outlets – some of which appear only in digital format – and the digital scanning of older articles, books, and other scientific outputs (Connaway 2003; Raschke 2003; Kasemsap 2016). While recognized for half a century now (Margolis 1967; London 1968), the growing influx of indexable, sub-settable, and easily accessible text corpora from many sources and eras brings about much promise for improved and novel scientific endeavors (Philip Chen & Zhang 2014).

This article is protected by copyright. All rights reserved. 2

The need to sort out through the ever-increasing volumes of scientific data, distill useful information, and make it available to conservation scientists and policy makers, has made systematic reviews an important tool for environmental decision-making. In systematic reviews, the findings of individual studies pertaining to a particular topic or question are identified, evaluated and summarized (University of York Centre for Reviews Dissemination 2009). These reviews bridge the gap between researchers and managers by presenting a comprehensive overview of all studies relating to a topic and identifying specifically where and under which conditions an effect is present. Furthermore, systematic reviews and meta-analyses are gaining greater importance in influencing policy makers and clinical opinion worldwide (Tacconelli 2010). Over the past four decades systematic reviews have become paramount in medical research and decision making (Gough et al. 2017). Recent years have also seen the adoption of this approach in environmental sciences (Bilotta et al. 2014). Systematic reviews together with meta-analyses have become the 'gold standard' for providing unbiased summarized evidence to support conservation action, and are quickly becoming the 'go-to tool' for evidence-based conservation (Dicks et al. 2014; Cook et al. 2017;). While holding much promise for improving science, the scientific publication data deluge is not without its problems (Siebert et al. 2015). Keeping up-to-date with new outlets, publication and constantly expanding fields is becoming an increasing challenge (Lawrence & Giles 1999; TejedaLorente et al. 2014). As academic corpora grow rapidly in size and breadth (Ferreira et al. 2016), our ability to access them methodically becomes ever-more dependent on various automated algorithms and software (Hersh 2009). Furthermore, new forms of publicizing science such as blogs, and personal data repositories (such as Figshare, Dryad, or Zenodo), make this an even more arduous task. This influx also brings about the need to combine results from several sources to improve search performance, and inclusiveness of results (Lawrence & Giles 1998). The data deluge problem has been specifically raised with respect to systematic reviews (Bastian et al. 2010; Lefebvre et al. 2013). Beyond the constant influx of new publications there are other technical issues with

This article is protected by copyright. All rights reserved. 3

accessing, sorting and analyzing ever-expanding data. In a recent publication in Conservation Biology, Westgate and Lindenmayer (2017) pinpoint several important technical challenges in conducting systematic reviews, specifically within the conservation realm. One of the problems that pose a hindrance to collecting and assessing data from the scientific literature is distinguishing between homonyms. Homonyms are defined as “The same name or word used to denote different things” (Oxford English Dictionary 2017a) and are a common feature of most languages (Kulkarni et al. 2008). For example the word 'orange', refers to both a color and a fruit. With respect to automated searches of terms, we are specifically interested in homographs – i.e those words that share their spelling, but are of different origin and meaning (Oxford English Dictionary 2017a), irrespective of their pronunciation. Westgate and Lindenmayer (2017) highlight that “homonyms have the effect of adding irrelevant hits to search results because redundant meanings are provided by the search engine (leading to low specificity)”, and go on to state that “Homonyms cannot be readily excluded from keyword-based searches”. In general, homonyms present a unique data-mining challenge to separate signal from noise, as context cannot be derived from the words alone (Rahm & Bernstein 2001; Tzanis 2014). Problems of disentangling homonyms and similar issues have been raised in other attempts to analyze and mine large text corpora in various fields such as: disambiguation in medical literature (Krauthammer & Nenadic 2004; Schuemie et al. 2005), patent retrieval (Raffo & Lhuillery 2009), classification of deep web sources (Xu et al. 2007), as well as conservation (Ladle et al. 2016, Roll et al. 2016; Correia et al. 2017). Here, we aim to provide a semi-automated method to accurately distinguish between results pertaining to distinct scientific fields from a single homonym search term. Our approach classifies outputs of a scientific database using automated text analysis, with supervised machine learning algorithms. As an example, we explore the use of the word ‘reintroduction’ in academic texts as extracted from Thomson’s (now Clarivate Analytics) Web of Science searches.

This article is protected by copyright. All rights reserved. 4

Methods Reintroduction is used within the conservation context to indicate the intentional movement and release of organisms to their former native habitat (IUCN/SSC 2013). However, it is also commonly used in reference to the more traditional meaning of the word: bringing a material or a concept into existence or effect again. Thus, scientists may reintroduce a chemical, a disease vector, or even an idea into a system, in a manner that is fundamentally different and unrelated to the above conservation narrative. We conducted a ‘Topic’ search using the terms 'reintroduc*' or 're-introduc*' on the 31st of May 2016 from the following databases within Web of Science: Core Collection, Current Contents, Derwent Innovations, KCI – Korean Journal Database, Medline, Russian Citation Index, SciELO, Data citation index, BIOSIS, Inspec and Zoological Record. We tested our method by exploring a sample of the 3000 most recent references out of all results obtained. One of us (OBT) manually classified the references as pertaining or unrelated to the conservation narrative using their titles, and abstracts where necessary. This procedure took about 3 hours, with approximately 2/3 of papers being unrelated to conservation. The results of this manual classification were then used to train our automated classification and, via a cross-validation procedure, to obtain naïve classification of the algorithm on the test cases. In our procedure, we initially removed from all of our text corpora stop words (such as ‘the’, ‘if’, ‘is’, etc.), punctuation symbols and numbers. We then extracted the title and keyword texts from each paper, pooled these together, and removed sparse terms (i.e. infrequent terms). This left us with the 22 most common terms in the dataset for which we tallied their frequency in each paper (see the supplementary material for the list of terms, as well as the sparsity parameters used to retain the most common terms, and the underlying code). We repeated this procedure for words found in the abstracts of each paper extracting the frequency of the 12 most popular words found in This article is protected by copyright. All rights reserved. 5

these (see Table S1 for the list of terms). To these, we added the 19 most popular terms found in the names of the journals where each publication was published (Table S1). Put together this gave us the occurrence frequencies of relevant terms from these three entities (titles and keywords, abstracts, journal names) – readily available from each reference. We added to these as a categorical predictor the ‘Main Category’ of the journal as designated by Thompson’s InCites Essential Science indicators Master Journal list (available at http://ipsciencehelp.thomsonreuters.com/incitesLiveESI/ESIGroup/overviewESI/esiJournalsList.html). All text mining was conducted using the ‘tm’ R package (Feinerer et al. 2008; Feinerer & Hornik 2015). We provide the code for all of our analysis in the supplement. We distinguished between conservation related and unrelated ‘reintroduction’ papers by using an artificial neural networks classification procedure. Artificial neural networks are a flexible and powerful tool that has been growing in popularity for uses in various fields (Nielsen 2015). Among its advantages are the ability to model non-linear relationships, no need for assumptions regarding the distributions of the variables, and accommodation of variable interactions without prior specifications (Olden et al. 2008). For our analysis we used feed-forward neural networks as implemented in the ‘nnet’ function of the ‘nnet’ R package (Venables & Ripley 2002), with ten units in one hidden layer, and a maximum of 1000 iterations. We ran our analysis using this method within a 10-fold cross-validation procedure, at each iteration constructing the model on 90% of the data as a training set and using it to predict the remaining 10% of the test cases. For each iteration we recoded the error rate which was calculated as a percentage, out of all cases, of automated classifications – either related or unrelated – that were in disagreement with our manual classification. We then summed these error rates for all 10 iterations to obtain the overall mean absolute prediction error. Fig. 1 provides a workflow chart of the stages in our semi-automated approach to classify these records, as detailed above.

This article is protected by copyright. All rights reserved. 6

We conducted a sensitivity analysis to test the effect of the different modelling parameters on the error-rate. Initially we tested the effect of decreasing the number of cases on the error-rate, when compared to the manual classification. As several sources and outlets are not indexed in ISI for ‘Main Category’ assignments, we repeated this procedure also without this information to test its effect on classification. We further conducted these procedures twice with different sparsity parameters for our three general sources of information: titles + keywords, abstracts, journal names, to test the effect of the overall number of modelling terms on error-rate. In the first run we set the sparsity at 0.97, 0.85, 0.97 giving us 20, 30, 11 different terms for the titles + keywords, abstracts, journal names respectively. In the second run we set the sparsity values at 0.95, 0.75, 0.95 giving us 6, 9, 4 different terms for the titles + keywords, abstracts, journal names respectively. For all tests we recorded the error-rate from a 10-fold cross-validation procedure. For each set of different parameters, with or without ‘Main Category’, and with different number of cases; we sub-sampled our 3000 cases dataset five different times, and calculated the mean and standard deviation of the error rates from these different sub-samples.

Results Our Web of Science searches yielded a total of 26,259 unique candidate references citing the term ‘Reintroduc*’ or 'Re-introduc*'. From a brief observation of the titles of some of these results it was clear that while many referred to this term within its conservation narrative, a large number did not. In Fig. 2a we show a community word-cloud (of the titles and keywords) of the most recent 3000 articles of the above search of ‘reintroduction’ papers. Fig. 2b depicts a comparison word-cloud (of the same subset), after these results have been manually separated between their conservation

This article is protected by copyright. All rights reserved. 7

and non-conservation related narratives. These images clearly show that there is a division in the narratives of the papers categorized with ‘reintroduction’ in their topic in Web of Science searches. Our first implementation of the classification algorithm returned a classification error rate of 1.6%. However, exploring specifically those cases where there were mismatches between the manual and machine learning classifications highlighted 30 papers that were actually misclassified in the initial manual classification. We proceeded to reanalyze the new revisited data using the same method, which brought our overall error rate down to under 1%. Once such a highly accurate model was obtained it would have been trivial and very quick to use it for classification of the additional approximately 22,000 articles that were initially obtained, providing a reliable, objective and quick solution to the homonyms problem. Our sensitivity analysis showed that both number of cases and number of modelling terms have an effect on the error rate (Fig. 3). As fewer cases are sampled the error rate increases from less than a 1% for 3000 cases to about 12% for 250 cases when many modelling terms are used, or from about 4% to 16% when fewer modelling terms are incorporated. Running the classification without the ‘Main Category’ information further increases the error rate by 1-2% when many terms are used and about 4% when fewer other terms are used (Fig. 3).

Discussion In this age of big data, obtaining large text corpora is becoming very rapid and convenient (Bollier & Firestone 2010; Tzanis 2014). Nevertheless, this flurry of data comes at a price - the need to sift through it and distinguish relevant from irrelevant information (Hersh 2009). Homonyms exemplify a particular conundrum within this broad topic, as a-priori there is no inherent information that can

This article is protected by copyright. All rights reserved. 8

enable distinction between different meanings of a term. Nevertheless, we show here that a combination of manual classification on a sample of the data together with text mining and a commonplace machine learning algorithm can provide an accurate and quick homonym classification procedure, for a large dataset of references addressing a particular theme, topic or question. The approach we present can be further used for other, similar classification problems of large text corpora. However, while holding much promise, our approach is not without its issues. First, it relies on structured data per article, as can be obtained from Web of Science, but not necessarily easily obtainable from other article repositories (e.g. Google Scholar). Structured data in this context refers to data that has been organized into a formatted repository, typically a database, so that its elements can be seamlessly and readily searchable by simple, straightforward search engine algorithms or other search operations. When using unstructured data, either more pre-processing of the data is needed or alternatively more elaborate text mining terms should be employed. Furthermore, both our automated context analysis and machine learning models needed a bit of fine-tuning to achieve best results. For example, we needed to determine the values of the sparsity parameters that govern how many terms will be used from each sources, as these have an effect on classification success (Fig. 3). Moreover, observing the actual terms obtained from the text mining approach and using only those that are potentially good in distinguishing between the homonyms, rather than general ones, can aid and speed the classification process. Our approach is also dependent on manually classifying a large and representative sample of the articles in questions. When obtaining and manually classifying such a sample is problematic, it is possible other automated classification approaches should be preferred (see also below). Generally, there is a need for a good sample of relevant and irrelevant items in the manual classification sample, to enable accurate automatic classification.

This article is protected by copyright. All rights reserved. 9

We wanted to have an exploration of our classification approach’s efficacy in classifying different types of texts. However, the sample of 3000 references we analyzed was comprised of predominantly journal articles with few conference proceedings which did not show any difference in classification success. Therefore we explored the model’s ability to classify correctly the full dataset of records which included different types of texts such as books, non-indexed journals, and scientific repository items. We indeed identified that our error rate increased for these types of records, predominantly when they did not contain some of the sections we used in the text-mining procedure – i.e had no abstract, keywords or did not have a ‘main category’ assignment through ISI. Once we, for example, added such ‘main category’ information manually to the records, their classification error-rate greatly decreased. This was also evident from our sensitivity analysis conducted on the 3000 manually classified records (Fig. 3). This result highlights the great benefit for automated classification that structured data have over unstructured data. It also suggests that to achieve best results when replicating our approach, it is advisable to initially obtain some familiarity with the nature of the data and its metadata, and the effect of the modelling parameters on results (Fig. 3). Using automated text or content analysis is increasing in prevalence for an array of scientific fields to illuminate different questions (Cohen & Hersh 2005; Feldman & Sanger 2007). For example, it has been used in several works aiming to aid in decision making for policy in fields such as emergency management, financial regulation and others (Ngai & Lee 2016), or to understand social media trends and usage (He et al. 2013; Mostafa 2013), and in many commercial applications (He et al. 2013; Mostafa 2013; Khadjeh Nassirtoussi et al. 2014). Silva et al. (2016) use such methods, together with graph theory analyses to enable mapping and visualizations of scientific fields. These approaches could also be of much use for conservation, for example in identifying knowledge and research gaps (Fisher et al. 2010; dos Santos et al. 2015; Westgate et al. 2015). Use of text mining is becoming more common as a tool to aid systematic reviews (Thomas et al. 2011; Tsafnat et al. 2014;

This article is protected by copyright. All rights reserved. 10

O'Mara-Eves et al. 2015), and will be of even greater importance in coming years, as databases become more structured (Lefebvre et al. 2013). In their review of text mining tools in systematic reviews, O'Mara-Eves et al. (2015) state that “The use of text mining to eliminate studies automatically should be considered promising, but not yet fully proven”. We hope that our contribution gives some support for the feasibility of semi-automated approaches (incorporating text mining and machine learning) to sieve through large corpora and remove unrelated items. Similar approaches pertaining to other facets of systematic reviews could potentially also benefit from automated stages in the data collection, curation and validation (Nunez-Mir et al. 2016). We suggest that some of these methods could, in the future, even be incorporated into database search interfaces to make them much more effective. Dictionaries of homonym terms, and their respective narratives, could be compiled, and once a homonym term is searched – search engines could ask us to which particular narrative we refer to and only search for results within it. While we tested our approach on a single term – ‘reintroduction’ – that has both conservation relevant and irrelevant narratives, homonyms are omnipresent in our languages and they inflate automated search results with noise. Some examples from the conservation narrative include, first and foremost the term ‘conservation’ itself, and its derivatives which are also used heavily in genetic and evolutionary research as in ‘conserved elements in the genome’ and other narratives without any relevance to ‘conservation biology’. ‘Nature’ is another opaque term with many different meanings in different contexts (Oxford English Dictionary 2017b). Beyond these broad terms relating to conservation biology, there are many other more detailed phrases or terms that are homonyms. For example, terms pertaining to invasion biology such as ‘invasion’ itself but also alien, exotic and non-indigenous. ‘Restoration’ is used also with respect to art and artifacts, ‘threat’ and ‘disturbance’ have many psychological and other connotations to them, and so on. Within the framework of systematic reviews we suggest that an initial exploration of search terms for the existence of

This article is protected by copyright. All rights reserved. 11

homonyms should be pursued. If these occur - i.e. if some of the returned results are unrelated outputs - then our approach can be applied to separate the cream from the crop. Our approach could be further expanded to automatically explore if homonyms occur in different text corpora. Our observation that the automated approach outperformed the manual classification in some instances (see above), gave us confidence in our method and provided support that fully automated approached could be pursed. For fully automated classification we could potentially again use the most common terms in titles, keywords, abstracts, and journal names as well as journal categories as inputs for automated clustering algorithms. These could in turn show us if our data cluster into one (no homonyms) or two or more clusters (homonyms possible). Methodologically this will entail changing our modelling scheme from the supervised machine learning approach we employed here to unsupervised methods. The classification method we employed - artificial neural networks - can also be used in unsupervised learning, for example through self-organized maps. This extension of our methods is therefore pretty straightforward and could be easily applied, for example in R (e.g. Wehrens & Buydens 2007; Yan 2016). Such an automated approach could be particularly beneficial when exploring in unison many potential homonym terms; when it is difficult to obtain representative test cases for manual classification; or for checking if terms cluster together when combining datasets from different sources or even across languages. The analysis of big data presents novel challenges which in turn promote the development of new tools and methods to overcome them. Westgate and Lindenmayer (2017) specifically summarize difficulties in our ability to conduct systematic reviews for conservation biology, but the issues they raise are shared by many other fields. We hope that our proposed framework (Fig. 1) could be useful for separating related from unrelated content for an array of questions and topics.

This article is protected by copyright. All rights reserved. 12

Acknowledgments: UR is supported by the Kreitman Post-doctoral Fellowship at the Ben-Gurion University of the Negev and the Shamir fellowship of the Israeli Ministry of Science. RAC is supported by a post-doctoral grant from Fundação para a Ciência e Tecnologia (SFRH/BPD/118635/2016). This is publication number XXX of the Mitrani Department of Desert Ecology.

Literature Cited Bastian, H., P. Glasziou, and I. Chalmers. 2010. Seventy-Five Trials and Eleven Systematic Reviews a Day: How Will We Ever Keep Up? PLoS Medicine 7:e1000326. Bilotta, G. S., A. M. Milner, and I. Boyd. 2014. On the use of systematic reviews to inform environmental policies. Environmental Science & Policy 42:67-77. Bollier, D., and C. M. Firestone 2010. The Promise and Peril of Big Data. Aspen Institute, Communications and Society Program Washington, DC. Cohen, A. M., and W. R. Hersh. 2005. A survey of current work in biomedical text mining. Briefings in Bioinformatics 6:57-71. Connaway, L. S. 2003. Electronic books (e-Books): Current trends and future directions. DESIDOC Journal of Library & Information Technology 23: 13-18. Cook, C. N., A. S. Pullin, W. J. Sutherland, G. B. Stewart, and L. R. Carrasco. 2017. Considering cost alongside the effectiveness of management in evidence-based conservation: A systematic reporting protocol. Biological Conservation 209:508-516. Correia, R. A., P. Jepson, A. C. M. Malhado, and R. J. Ladle. 2017. Internet scientific name frequency as an indicator of cultural salience of biodiversity. Ecological Indicators 78:549-555. Dicks, L. V., J. C. Walsh, and W. J. Sutherland. 2014. Organising evidence for environmental management decisions: a '4S' hierarchy. Trends in Ecology & Evolution 29:607-613.

This article is protected by copyright. All rights reserved. 13

dos Santos, J. G., A. C. Malhado, R. J. Ladle, R. A. Correia, and M. H. Costa. 2015. Geographic trends and information deficits in Amazonian conservation research. Biodiversity and Conservation 24:2853-2863. Feinerer, I., and K. Hornik. 2015. tm: Text Mining Package. R package version 0.6-2. Feinerer, I., K. Hornik, and D. Meyer. 2008. Text Mining Infrastructure in R. Journal of Statistical Software 25:1-54. Feldman, R., and J. Sanger 2007. The text mining handbook: advanced approaches in analyzing unstructured data. Cambridge university press, Cambridge. Ferreira, C., G. Bastille-Rousseau, A. M. Bennett, E. H. Ellington, C. Terwissen, C. Austin, A. Borlestean, M. R. Boudreau, K. Chan, and A. Forsythe. 2016. The evolution of peer review as a basis for scientific publication: directional selection towards a robust discipline? Biological Reviews 91:597-610. Fisher, R., B. T. Radford, N. Knowlton, R. E. Brainard, F. B. Michaelis, and M. J. Caley. 2010. Global mismatch between research effort and conservation needs of tropical coral reefs. Conservation Letters 4:64-72. Gough, D., S. Oliver, and J. Thomas 2017. An introduction to systematic reviews. Sage Publications Ltd. He, W., S. Zha, and L. Li. 2013. Social media competitive analysis and text mining: A case study in the pizza industry. International Journal of Information Management 33:464-472. Hersh, W. 2009. Information Retrieval: A Health and Biomedical Perspective, Third Edition. Springer Science, New York, USA. Hey, A. J. G., and A. E. Trefethen. 2003. The Data Deluge: An e-Science Perspective. Pages 809-824 in F. Berman, G. C. Fox, and A. J. G. Hey, editors. Grid Computing - Making the Global Infrastructure a Reality. Wiley and Sons.

This article is protected by copyright. All rights reserved. 14

IUCN/SSC 2013. Guidelines for Reintroductions and Other Conservation Translocations. Version 1.0. IUCN Species Survival Commission, viiii, Gland, Switzerland. Kasemsap, K. 2016. Mastering Digital Libraries in the Digital Age. Pages 275-305. E-Discovery Tools and Applications in Modern Libraries. IGI Global. Kennan, M. A., K. Williamson, and G. Johanson. 2012. Wild Data: Collaborative E-Research and University Libraries. Australian Academic & Research Libraries 43:56-79. Khadjeh Nassirtoussi, A., S. Aghabozorgi, T. Ying Wah, and D. C. L. Ngo. 2014. Text mining for market prediction: A systematic review. Expert Systems with Applications 41:7653-7670. Krauthammer, M., and G. Nenadic. 2004. Term identification in the biomedical literature. Journal of Biomedical Informatics 37:512-526. Kulkarni, A., M. Heilman, M. Eskenazi, and J. Callan. 2008. Word Sense Disambiguation for Vocabulary Learning. Pages 500-509 in B. P. Woolf, E. Aïmeur, R. Nkambou, and S. Lajoie, editors. Intelligent Tutoring Systems: 9th International Conference, ITS 2008, Montreal, Canada, June 23-27, 2008 Proceedings. Springer Berlin Heidelberg, Berlin, Heidelberg. Ladle, R. J., R. A. Correia, Y. Do, G.-J. Joo, A. Malhado, R. Proulx, J.-M. Roberge, and P. Jepson. 2016. Conservation culturomics. Frontiers in Ecology and the Environment 14:269-275. Lawrence, S., and C. L. Giles. 1998. Searching the World Wide Web. Science 280:98-100. Lawrence, S., and C. L. Giles. 1999. Searching the web: General and scientific information access. Pages 18-31. Internet Technologies and Services, 1999. Proceedings. First IEEE/Popov Workshop on. IEEE. Lefebvre, C., J. Glanville, L. S. Wieland, B. Coles, and A. L. Weightman. 2013. Methodological developments in searching for studies for systematic reviews: past, present and future? Systematic Reviews 2:78. London, G. 1968. The publication inflation. American Documentation 19:137-141. Margolis, J. 1967. Citation Indexing and Evaluation of Scientific Papers. Science 155:1213-1219.

This article is protected by copyright. All rights reserved. 15

Mostafa, M. M. 2013. More than words: Social networks’ text mining for consumer brand sentiments. Expert Systems with Applications 40:4241-4251. Ngai, E., and P. Lee. 2016. A Review of the literature on Applications of Text Mining in Policy Making. PACIS 2016 Proceedings. 343. Nielsen, M. A. 2015. Neural networks and deep learning. Determination Press. Nunez-Mir, G. C., B. V. Iannone, B. C. Pijanowski, N. Kong, and S. Fei. 2016. Automated content analysis: addressing the big literature challenge in ecology and evolution. Methods in Ecology and Evolution 7:1262-1272. O'Mara-Eves, A., J. Thomas, J. McNaught, M. Miwa, and S. Ananiadou. 2015. Using text mining for study identification in systematic reviews: a systematic review of current approaches. Systematic Reviews 4:5. Olden, J. D., J. J. Lawler, and N. L. Poff. 2008. Machine learning methods without tears: a primer for ecologists. The Quarterly review of biology 83:171-193. Oxford English Dictionary. 2017a. "homonym, n.". Oxford University Press, Oxford. Oxford English Dictionary. 2017b. "nature, n.". Oxford University Press, Oxford. Philip Chen, C. L., and C.-Y. Zhang. 2014. Data-intensive applications, challenges, techniques and technologies: A survey on Big Data. Information Sciences 275:314-347. Raffo, J., and S. Lhuillery. 2009. How to play the 'Names Game': Patent retrieval comparing different heuristics. Research Policy 38:1617-1627. Rahm, E., and P. A. Bernstein. 2001. A survey of approaches to automatic schema matching. The VLDB Journal 10:334-350. Raschke, C. A. 2003. The digital revolution and the coming of the postmodern university. Routledge. Roll, U., J. C. Mittermeier, G. I. Diaz, M. Novosolov, A. Feldman, Y. Itescu, S. Meiri, and R. Grenyer. 2016. Using Wikipedia page views to explore the cultural importance of global reptiles. Biological Conservation 204A:42-50.

This article is protected by copyright. All rights reserved. 16

Schuemie, M. J., J. A. Kors, and B. Mons. 2005. Word Sense Disambiguation in the Biomedical Domain: An Overview. Journal of Computational Biology 12:554-565. Siebert, S., L. M. Machesky, and R. H. Insall. 2015. Overflow in science and its implications for trust. eLife 4:e10825. Silva, F. N., D. R. Amancio, M. Bardosova, L. d. F. Costa, and O. N. Oliveira Jr. 2016. Using network science and text analytics to produce surveys in a scientific topic. Journal of Informetrics 10:487-502. Tacconelli, E. 2010. Book: Systematic reviews: CRD's guidance for undertaking reviews in health care. The Lancet Infectious Diseases 10:226. Tejeda-Lorente, Ã., C. Porcel, E. Peis, R. Sanz, and E. Herrera-Viedma. 2014. A quality based recommender system to disseminate information in a university digital library. Information Sciences 261:52-69. Tzanis, G. 2014. Biological and Medical Big Data Mining. International Journal of Knowledge Discovery in Bioinformatics (IJKDB) 4:42-56. University of York Centre for Reviews Dissemination 2009. Systematic reviews: CRD's guidance for undertaking reviews in health care. University of York, Centre for Reviews & Dissemination. Venables, W. N., and B. D. Ripley 2002. Modern applied statistics with S, 4th ed. Springer, New York. Wehrens, R., and L. M. C. Buydens. 2007. Self- and Super-organising Maps in R: the kohonen package. Journal of Statistical Software 21. Westgate, M. J., P. S. Barton, J. C. Pierson, and D. B. Lindenmayer. 2015. Text analysis tools for identification of emerging topics and research gaps in conservation science. Conservation Biology 29:1606-1614. Westgate, M. J., and D. B. Lindenmayer. 2017. The difficulties of systematic reviews. Conservation Biology, published online.

This article is protected by copyright. All rights reserved. 17

Xu, H., C. Zhang, X. Hao, and Y. Hu. 2007. A Machine Learning Approach Classification of Deep Web Sources. Pages 561-565. Fourth International Conference on Fuzzy Systems and Knowledge Discovery (FSKD 2007). Yan, J. 2016. som: Self-Organizing Map. R package version 0.3-5.1. Figure captions Figure 1: Workflow chart of the stages of a semi-automated approach to classify homonyms between narratives. It highlights both the text mining procedures and the machine learning modelling stages. It depicts the different components needed to construct and test a model (based on a subset of the data) that will eventually be used to classify all the dataset.

This article is protected by copyright. All rights reserved. 18

Figure 2: Word clouds of titles and keywords of a sample of 3000 papers dealing with reintroductions. a- communality cloud highlighting the shared terms between conservation related and unrelated references. b- comparison cloud highlighted the main terms for either related or unrelated references.

This article is protected by copyright. All rights reserved. 19

Figure 3: Sensitivity analysis of the error rate with different classification modelling parameters. asparsity values set at 0.97, 0.85, 0.97 giving us 20, 30, 11 different terms for the titles + keywords, abstracts, journal names respectively. b- sparsity values set at 0.95, 0.75, 0.95 giving us 6, 9, 4 different terms for the titles + keywords, abstracts, journal names respectively. Dashed lines indicate classifications without using the ‘Main Category’ information. Continuous lines depicts mean errorrate when using the ‘Main Category’ information for classification and dashed lines without it (see text for full details). Lines depict mean error-rates for different sub-samples of each parameter combination. Grey regions depict the standard deviations of the error rates for different sub-sample runs (light grey with ‘Main category’ information and dark grey without it).

This article is protected by copyright. All rights reserved. 20