Savvy searching Citation searching

2 downloads 0 Views 1MB Size Report
the pros and cons of thesaurus-based searching, and the often less than user-friendly and sometimes careless and/or incompetent implementation of the ...
Savvy searching Citation searching Pe´ter Jacso´

The author Pe´ter Jacso´ is a Professor at the University of Hawaii, Manoa, Hawaii, USA.

Keywords Text retrieval, Search engines, Electronic journals, Databases

Abstract Citation searching has been available for decades, although in a limited form. This article discusses the advantages and limitations of searching by cited references, and also some alternatives in searching for cited references, before presenting a case study involving citation searching in full-text indexes.

Electronic access The Emerald Research Register for this journal is available at www.emeraldinsight.com/researchregister The current issue and full text archive of this journal is available at www.emeraldinsight.com/1468-4527.htm

Online Information Review Volume 28 · Number 6 · 2004 · pp. 454-460 q Emerald Group Publishing Limited · ISSN 1468-4527 DOI 10.1108/14684520410570580

These days, savvy searching increasingly means citation searching. This search technique has been available for decades but was limited to legal databases and citation databases of the Institute for Scientific Information (ISI). The phenomenal success of Google, which ranks search results by the combination of the number of links a site or page received and the PageRank of the citing sources from where the links originate, brought the concept of citation searching to the forefront. Linking on the web is essentially the high-tech digital version of the intellectual act of citing – except for the purely financially motivated and sociopath links. Some online information services offer sophisticated features for searching by cited authors, cited journal name, cited article title, cited year and their combinations. Others just offer a single index of cited references, pouring all the components of cited references into a large bucket of keywords. Yet others may have an illimplemented search engine for citation searching or do not make cited references searchable even though they appear as distinct parts in a publisher’s archive, not realizing the power of citation searching to complement the traditional search methods of searching by controlled vocabulary terms and/or free text.

The case for searching by cited references In an earlier column (Jacso´, 2003) I have discussed the pros and cons of thesaurus-based searching, and the often less than user-friendly and sometimes careless and/or incompetent implementation of the thesauri on the host software. Free-text searching in natural language style has been made popular by Google, and to a lesser extent by other web-wide search engines. These, however, do not offer a panacea for searching digital archives of full-text journal articles and conference papers. They do not yet make any use of the tagged structure of the documents, which would allow limiting the result to those documents which have the terms matching the query in the abstract, among the author-assigned keywords, or in the cited references, instead of merely appearing somewhere in the full text in an AND relationship. The potential of tracing references cited in scholarly articles was argued for convincingly 50 years ago (Garfield, 1955), but it took a long time to get it into the mainstream of information retrieval. Tests comparing results of searches using thesaurus terms versus tracing cited and citing references of seminal articles concluded that there was no significant overlap in the result of

454

Citation searching

Online Information Review

Pe´ter Jacso´

Volume 28 · Number 6 · 2004 · 454-460

descriptor-based and citation-based searches (McCain, 1989). Clearly, there is a reason to complement word-based searches by citation searches for comprehensive results. Still, until recently only the ISI citation products provided the tool for this kind of information retrieval, first in awkward print format (Tenopir, 2001), then in increasingly sophisticated databases. The latest release of the Web of Science (WoS) implementation of the citation databases of ISI is an example of how simple and effective the citation search process can be. For all of their power and benefits, the ISI databases have a limitation: they do not display the usually content-rich, indicative-informative title of the cited articles (which makes it difficult to judge if the article is pertinent for the user), let alone allow searching the cited title. Being informed about and led to cited articles through abbreviated citations is a great way for finding pertinent information, but the inclusion of the title of the cited articles and making them searchable gives an extra dimension to citation searching. In the past few years, several projects have been initiated to add manually to a subset of records in traditional indexing/abstracting databases the list of the cited references of the processed articles, conference papers, reports and books, or to algorithmically extract cited references from digital manuscripts of open access archives. All these projects include the title of the articles in the cited references and make them searchable. PsycINFO and the soon-to-debut Scopus database are examples of the most important projects for the former, and Citebase and CiteSeer (also known as Research Index) are examples of the latter approach. These relatively new systems offer the best compromise among controlled vocabulary, full-text and cited reference searching.

articles. These were published in journals by Kluwer, Emerald, Elsevier, Wiley, and Sage (to name only the largest). All of these journals are also covered by the citation databases of ISI, and most of them also by Scopus, the soon to be released citation-enhanced mega abstracting-indexing database of Elsevier. (I tested the beta version of Scopus in August, when neither the content nor the software were complete, but the fine options for cited reference searching more than justified its inclusion in the tests.) When in doubt, the cited article or paper was verified in Thelwall’s resume´ available on the web to make sure that it was his paper. This was needed only for two articles. For testing the citation searching abilities of search engines for known items, his article about extracting macroscopic information from web links was used. It is to be noted that publisher archives have a limited domain defined by their own journals. In Elsevier’s Science Direct archive, one cannot find citations from Online Information Review to Thelwall’s articles published in Information Processing & Management. But, one can find articles from journals in the Elsevier stable which cite his Online Information Review articles. Similarly, in Emerald archives one can find references only from Emerald journals, but the cited journals, books, conference papers, etc., can be from any publisher. Sometimes the timespan is limited, as the journal may have been acquired from another publisher with just a few years of retrospective digitization rights. The databases of facilitators like Ingenta, MetaPress or Highwire Press may offer cited reference searching in all the journals they help to digitize. The range of the aggregators’ options for cited reference searching may vary, depending on their agreement with the third party content providers. WoS and Scopus have the broadest domain for cited reference searching, as they have citation-enhanced indexing and abstracting records with cited references coming from and going to publications by hundreds of publishers. Some publishers, facilitators and aggregators swiftly take care of the difficulties associated with generating indexes for citation searching by simply ignoring the issue. Kluwer has a digital archive and licenses Verity, one of the most powerful search tools. Still, it creates searchable indexes only from the author, title, abstract and keyword fields, as many online services did in the 1960s – the early 1960s, that is (see Figure 1). There is no full-text search option, let alone a cited reference search. The beta version of Ingenta Connect, which hosts journals from many prominent publishers, has some additional options (such as searching by

Alternatives in searching for cited references The search options are primarily determined by: . how the cited references were entered in the master record; . how the entries in cited reference index entries are generated (word, phrase or both); and . which fields or subfields they are extracted from. I have tested the options by searching for cited references of the articles of Mike Thelwall, member of the Editorial Advisory Board of Online Information Review. He has been the most productive author in information science for 20012004, having published nearly 50 scholarly journal

455

Citation searching

Online Information Review

Pe´ter Jacso´

Volume 28 · Number 6 · 2004 · 454-460

Figure 1 The advanced search options of Kluwer are not really advanced

Figure 2 Excerpt from the result list of full-text citation search

journal name or ISSN) but it does not offer a full text/cited reference search possibility.

Citation searching in full-text indexes

Figure 3 Excerpt from the cited reference list of a citing article which matched the query

If there is no separate field-specific index for cited references, but the extracted citations are poured into a single index bucket (as extreme lumpindexing) together with words extracted from the title, author, journal name, abstract, descriptor and the full text of the articles, it is impossible to tell an author name apart from a cited author name, an article title word from a cited article’s title word, a word from the source journal’s name versus a word from a cited journal’s name, or any of the above versus a word from the full text. Therefore, we cannot talk about citation searching, but only about citation rummaging at best if no index is available for words extracted from the cited references. This is the typical option for the online services of digital facilitators who help publishers to mount their archive on the web, as Catchword (now Ingenta Select) has done for Emerald. Ingenta Select uses a customized interface for Emerald journals, but cited reference searching is not an explicit option. However, the full text is searchable, as opposed to the Ingenta Connect version of the Emerald journals. The search can be restricted to Thelwall as author, but not to Thelwall as cited author. Nevertheless, if you combine the author name and a multiple word fragment from the cited article’s title in using fulltext search, it will produce a set of 13 articles (published in Emerald journals) – and indeed all of them cite the article specified for this test through the combination of his name and a phrase from the title: the one about extracting macroscopic information on the Web published in the Journal of the American Society for Information Science and Technology (JASIST) (see Figures 2 and 3).

You cannot hope for such good results for fewer words or for combinations for less unique title words. For example, the query “Thelwall AND links” will retrieve several articles which cite Thelwall but not necessarily the ones that have the word “link” in the cited title or in the cited title of Thelwall’s articles. The word may appear in the body of the text or in another cited reference. In case of authors with very common last name, false hits are more likely to emerge. Finding an article by Alastair G. Smith as an author is a walk in the park because of the rather distinctive first name. Finding articles which cite A.G. Smith’s work is much more difficult, because in the query you should typically use only last name and first initial (such as Smith, A as you may not count for sure on the middle name initial appearing in the cited references). Combing it with some rarely occurring words may alleviate the problem, but this is not the best solution for citation searching.

456

Citation searching

Online Information Review

Pe´ter Jacso´

Volume 28 · Number 6 · 2004 · 454-460

Citation searching using a cited reference index Most publishers’ archives use this approach. But just because a publisher (or other online service) offers a cited reference index for searching, it does not guarantee that it will work well. It is quite disappointing that the Wiley Interscience service, which hosts JASIST (as well as its predecessor), would find Mike Thelwall’s six articles published in JASIST (in which he obviously cites his earlier works), but it cannot find a single article in the Wiley journals which would cite any of Mike Thelwall’s articles, let alone the specific test article about extracting macroscopic information from web links. Searching a variety of combinations with and without truncation of his first name and last name (or initial) in the references index yielded nil results. Searching for the last name alone in the references index yielded eight results. Unfortunately, all of them cited articles by P.E. Thelwall, who published about nuclear magnetic resonance spectroscopy (see Figure 4). This is inexplicable, because there are ten articles in JASIST alone between 2002 and 2004 which cite one or more of Thelwall’s articles: seven of them cite the specific article about macroscopic extraction of information from web links according to the correct information provided by WoS (see Figure 5). Elsevier delivers on its promise to find cited references. For the search on Thelwall in the References index (i.e. as a cited author), the ScienceDirect software found nine articles from various Elsevier journals (see Figure 6). There were six Elsevier journal articles citing the specific JASIST article, highlighting in red the matching terms in the cited reference lists (see Figure 7).

Figure 5 A comparison search shows that WoS finds ten papers which meet the query specification

Citation searching with field-specific indexes WoS offers cited reference searching through a Cited Search template with field-specific indexes for cited author, cited work and cited year. Authors and cited authors in ISI databases must be searched without a comma between the last and first name. It found 100 documents citing Mike Thelwall’s articles, published in a variety of journals by a dozen different publishers (see Figures 8 and 9). The well-designed templates and interface of WoS show the company’s unique experience in supporting high-precision cited reference searching – with one exception. In ISI databases, the cited references do not include the title of the cited article. Limiting the search to the specific article published in JASIST in 2001, WoS found 45 articles citing it, in more than a dozen journals processed by ISI (see Figure 10). The article title is

Figure 4 Wiley’s finds only another Thelwall as cited author, Thelwall, P (truncated)

457

Citation searching

Online Information Review

Pe´ter Jacso´

Volume 28 · Number 6 · 2004 · 454-460

Figure 6 Records matching the query

Figure 7 Excerpts from the cited reference list of a citing article

Figure 8 Excerpt from 100 matching records

not displayed, only the journal abbreviation, but for us the year, volume and page number identify sufficiently that it is the specific article. The same functionality is available in Dialog’s implementation, but only in the command mode, not in the guided search mode, as you need to use a special proximity operator to make sure that the cited author, work and year appear in the same instance (occurrence) of the repeatable cited reference field. WoS automatically takes care of this in the cited reference search process. The newcomer, Scopus, offers the most comprehensive options for cited reference searching. The first rough search for Thelwall as a

Figure 9 The search template allows (sub)field-specific searching of cited references

458

Citation searching

Online Information Review

Pe´ter Jacso´

Volume 28 · Number 6 · 2004 · 454-460

Figure 10 The specific article is cited by 45 other articles or conference papers, two of which are shown here

cited author in the field-specific References Index of Scopus for the period 2001-2004 yielded 119 hits. At first glance it is clear that the result set includes quite a number of source documents which cite the NMR specialist Thelwall, P.E. Using ;Thelwall M’ as an exact term reduces the set to 66 hits. Using a comma between the last name and first name yields the same set (see Figure 11). Although Scopus found only 36 documents that cited the specific JASIST article, its citation search prowess and outstanding result display features may offset the less comprehensive coverage of Scopus in some disciplines. Beyond the generic REF index, Scopus has separate indexes for the cited author, cited source, cited title, cited year and cited pages. Although these can be accessed only in the advanced mode using rather cryptic field tags like REFSRCTITLE for the cited reference source which needs to be

looked up in the help file and typed in, it is a good solution, allowing known item searches by combining cited author and start page, such as REFAUTH (thelwall, m) AND REFPAGE (1157), or for topic and author-oriented cited searches using author and title word combinations such as REFAUTH (thelwall, m) and REFTITLE (extracting macroscop*). Using such alternatives will yield slightly different results, which is clear evidence for the inaccuracies and inconsistencies in cited references, because many of us as authors do not verify the accuracy of our cited references at the micro level, such as starting page number, issue number, and sometimes not even at the macro level. The most attractive feature of Scopus is that you can sort the result (up to 1,000 items) of any traditional and citation search. The sort criteria include the publication year, first author, journal name, and most importantly from our perspective, citedness score. This score indicates, for each item in the result list, how many times that article, conference paper or report was cited by documents covered in Scopus. This is a powerful and transparent feature for many other scientometric and bibliometric purposes. For our test case it brought up as Thelwall’s most cited article the one about extracting macroscopic information from web links. So before you churn out the complex query to be as precise as a sharpshooter, you may just do a simple cited author search, rank the results by citedness, and scan the re-ranked result list. The matrix format is highly conducive to effective scanning of the result list (see Figure 12). The only fly in the delicious soup of citation searching, even in the best citation-based search systems, is the incredible level of inconsistency and inaccuracy in the cited references. There is one way to reduce the crippling implications of this:

Figure 11 Outstanding result display features following a cited author search

459

Citation searching

Online Information Review

Pe´ter Jacso´

Volume 28 · Number 6 · 2004 · 454-460

Figure 12 The re-ranked result matrix brings up to the top of the list the most cited article of Mike Thelwall, which happens to be the one used for the item-specific test

looking up (browsing) the indexes created from the cited references. I will discuss this in the next instalment of Savvy Searching.

References Garfield, E. (1955), “Citation indexes for science: a new dimension in documentation through association of ideas”, Science, Vol. 122 No. 3159, pp. 108-11.

Jacso´, P. (2003), “Using controlled vocabulary (Part II – Software issues)”, Online Information Review, Vol. 27 No. 5, pp. 359-63. McCain, K.W. (1989), “Descriptor and citation retrieval in the medicine behavioral sciences literature: retrieval overlaps and novelty”, Journal of the American Society for Information Science, Vol. 40 No. 2, pp. 110-14. Tenopir, C. (2001), “The power of citation searching”, Library Journal, Vol. 126, November, pp. 39-40.

460