CHANGING THE TUNE – MATCHING LANGUAGE ...

4 downloads 48631 Views 114KB Size Report
The best way do understand user needs is through an examination of: ... This requires access to data gathered in applications such as web analytics .... The Thesaurus was built using the MultiTes thesaurus management software and a.
CHANGING THE TUNE – MATCHING LANGUAGE WITH USER NEEDS TO MAXIMISE DISCOVERABILITY Anna Gifford

Australian Drug Foundation, Melbourne VIC Australia

ABSTRACT Subject headings have long been a key component in information discovery, but changing literacies means that more than ever, the language or “voice” needs to be meaningful to the user group. This paper examines voice in the context of controlled vocabularies, using case studies of thesaurus development within online and specialist contexts.

Introduction Controlled vocabularies are a familiar tool within the library and information sciences. They serve to provide subject-based descriptors for the items represented in catalogues, databases and beyond, and their benefits range from disambiguation to search enhancement. Within the broader online context, user behaviour is changing when it comes to searching and information retrieval. New literacies are developing and the way in which people interact with information is also changing. Traditional approaches such as controlled vocabularies within catalogues are at risk of becoming out of step within this new technological and behavioural landscape. This paper explores some of the ways in which these shifts in information retrieval behaviour influence the way in which user-oriented language can and should be leveraged to maximise discoverability through resource description. Drawing from two case studies, it aims to demonstrate some of the key issues and some possible solutions.

1. Controlled vocabularies A controlled vocabulary is a finite set of authorised terms (also known as preferred terms or descriptors) which are used in resource description. These terms often have associations to synonymous terms (also known as non-preferred terms) which direct the interrogator to a matching preferred term. Examples of controlled vocabularies include: • thesauri (synonymous, hierarchical and relational structures which are used in description, labelling and discovery) • glossaries (word lists and their definitions) • ontologies (structured vocabularies used to organise concepts) • taxonomies (structured and hierarchical sets of terms typically used for navigation or classification) Some newer and less-controlled vocabularies include: • folksonomies (user-generated terms or tags which can be used in developing grassroots classification systems) • topic maps (visualised mappings of relationships between concepts, sometimes discussed as a hybrid between a thesaurus and a folksonomy) Controlled vocabularies have the following advantages: • consistency in description • disambiguation (e.g. the use of qualifiers to distinguish ‘Ice (Water)’ from ‘Ice (Crystal methamphetamine)’) • rigour in their development and maintenance • interoperability opportunities but can also be: • quickly obsolete (not reflective of changing terminology) • specific to an audience and too narrow/too broad for other applications • inflexible in terms of customisation 2

1.1.

Controlled vocabularies in librarianship

Libraries use controlled vocabularies in bibliographic subject description. Library of Congress (LC) subject headings are commonly used in public and academic institutions, while special libraries typically employ specialist vocabularies, such as the Australian Thesaurus of Education Descriptors which is used in the Australian Council for Educational Research’s Cunningham Library and online contexts such as education.au, or the AOD Thesaurus which is used in the Australian Drug Foundation Resource Centre. These vocabularies have the clear benefits of consistency, disambiguation, and rigour behind their development and structures. However these vocabularies are typically developed with the intermediary audience, i.e. the information professional, rather than the resource target, i.e. the end user, in mind. In the case of the LC subject headings, their key audience is cataloguers and librarians. In the case of the Australian Thesaurus of Education Descriptors, the key audiences are again cataloguers (and indexers) and educational researchers. This can create a significant language gap between descriptors and user language. This gap is typically minimised through keywords (also known as identifiers) and full-text searching but there is an opportunity to review new findings from the broader online profession. 1.2.

Controlled vocabularies in the online space

Library and information science is sustained by an academic, theoretical and structured model of information management, where standards, thesauri and rules all serve to manage information, particularly in historical contexts when the information professional had a vital role in organising and retrieving information for users. The online space, on the other hand, comes from a more commercial, technical and indeed sometimes quite chaotic, environment where much pivots on the behaviours and needs of the users. The aim of many websites is to sell, to entice, and to minimise the need for human interaction. Information discovery is a key commercial imperative for such sites, and the information architecture discipline sprang from a need to develop better discovery mechanisms (search and browse) to maximise navigation success. These days the information architect (many of whom are ex-librarians) uses many tools familiar to the library world, such as metadata, controlled vocabularies, naming conventions and national and international standards, but using the enhanced technological capabilities of online tools to leverage them in ways with which most library management systems are unable to compete. 2. Changing literacies and user language The Internet has been increasingly available and accessible since the mid 1990s. From the structured forms of directory-based browsing such as Yahoo has emerged a world wide web which is used in a multiplicity of professional and personal settings. The ‘Google generation’ has arrived – those who have grown up with a lifelong familiarity with communications and media technologies, whose first

reference for an information search tends to be a web-based search engine rather than a library catalogue. This context is shown to be affecting the way in which people seek information. In a report to the Library of Congress Bicentennial Conference on Bibliographic Control for the New Millennium, Bates (2003) identified three emerging user behaviours: i) the principle of least effort – people will input the least number of search terms possible and are prone to give up if they are not immediately successful, rather than trying other strategies such as synonyms a lack of self-awareness in information seeking behaviours – people not ii) in the information professions are often not able to describe or categorise the processes they use to find things iii) the importance placed on influential figures or models in information seeking – for example employees modelling the behaviours of their employers, children learning from parents or teachers, and so on Tied to this is the increasing personalisation of the information space with practices such as tagging. In 2006 a comparative analysis was undertaken of user tags, published keywords and descriptors within the CiteULike social bookmarking service (Kipp, 2006). This study examined the convergence between the three language sets and amongst the findings identified the following traits: 1. while users often used terminology similar to descriptors, the terms used tended not to be exactly matching or consistently applied 2. many terms were acronyms, abbreviations, spelling variations and spelling errors which were not covered by the more formal vocabularies Several other studies have also discussed user tags as a potential source for subject description. The consensus seems to be, as supported by Kipp’s findings, that while user language can have strong ties with controlled vocabularies, its uncontrolled nature creates problems of duplication, ambiguity and time dependency (if a concept acquires a new label, taggers do not tend to go back and re-tag previous related content). Today’s searchers tend to use shorter search phrases (a commonly reported observation is that online users tend to use the shortest possible search strings – one to two keywords only – rather than more complex Boolean search strings) and have higher expectations of search engine functionality within databases from their experience with sophisticated search engines such as Google. User language therefore becomes increasingly valuable as a relevant discovery tool in the online space, where search is user-driven as well as content-driven. Already in the library and information management space, organisations are experimenting with ways of bringing user / fresh language into thesauri (Mitchell, 2007). To do this however requires a considered analytic approach to understanding user needs as well as their language, before developing approaches for building user language into the library context.

2.1.

Understanding user needs

The best way do understand user needs is through an examination of: • the user demographic 4

• •

the user behaviours the subject matter being interrogated

The user demographic is determined using statistical analysis of user or member data. This requires access to data gathered in applications such as web analytics tools, member data profiling and surveys. User behaviours can be harder to determine but some characteristic behaviours can be identified through search analysis. For websites and catalogues alike, important insights can be gained from examining search term language and structure, popular results sets or resources, and where the searcher has given up or retrieved no results. These can then be mapped against broader understandings of user behaviour as detailed in this section. The third key to understanding the language needs of users is to look at the material itself that they are trying to retrieve. The topics, themes and level of the collection can be used to reference a collection’s ontology – that is, a conceptual framework of the collection’s scope.

2.2.

Matching user needs and language with controlled vocabularies

As earlier discussed, controlled vocabularies such as thesauri provide a structured approach to subject-based information retrieval. In a traditional LIS subject search, a non-matching subject search term will either retrieve a coordinating reference (e.g. ‘Foreign students SEE International students’) or, if the term is not listed in the controlled vocabulary, potentially no match. In online contexts, metadata cataloguing or search engine configuration is manipulated to maximise discoverability. The model of thesauri as put forward by Morville and Rosenfeld (2006) proposes that for a thesaurus in the online environment, a preferred term can become the centre of its own semantic network where nonpreferred terms become variant terms and can even be retained within resource description. This allows accommodation of search term idiosyncrasies and variation. Some very common searches can be input a number of ways - for example ‘animal registration’, ‘dog registration’, ‘pet registration’, ‘register my dog’, ‘cat rego’, etc. A controlled vocabulary can be used to capture equivalent and variant terms and improve search results. This can built into the metadata or alternatively referenced within the search engine – one example is the ‘Did You Mean’ function within Google’s search engine which returns search results based on the input string but also offers a dynamic search for a suggested spelling variation.

3. Leveraging user language to improve discoverability With the convergence of disciplines and techniques, the library world can benefit from the lessons learned in information architecture and other online disciplines to improve discoverability in their own contexts. Where systems and standard frameworks allow, user language can enhance the search experience within library networks to accommodate these changing literacies. The following case studies aim to illustrate two applications of user language for discoverability. The first discusses the development of a controlled vocabulary to improve search and resource description within the context of a large and highly used website, Victoria Online. The second case study outlines a proposal for reviewing, extending and leveraging an alcohol and other drugs thesaurus for the library and the website of the Australian Drug Foundation’s DrugInfo Clearinghouse. 4. Case Study 1: Victoria Online Thesaurus Victoria Online (http://www.vic.gov.au) is a web portal to government information and services for Victorians. As part of the site’s ongoing innovation and enhancement strategy, an opportunity for improving search results was identified through the development of a controlled vocabulary1. The Victoria Online Thesaurus was developed in 2005 and is now used as part of the site’s metadata application profile. 4.1.1.

Identifying user needs

Broad understandings about changing user needs must always be reviewed in light of a service’s own and sometimes quite specific users and their behaviours. Websites can have large and diverse user groups, and web analytics packages such as Google Analytics are required to profile actual users. The users of Victoria Online were shown to be both government and the general public, with the largest proportion being from Melbourne or Victoria but with sizeable access from interstate and overseas visitors. This broad-ranging user group were shown to be unfamiliar with the structures and functions of government and required information to be accessible using clear and common language. Victoria Online built its user needs into its information architecture principles including: o citizen-oriented language rather than government jargon o knowledge-driven not content driven – aligning the site to citizen needs not government structures o multiple discovery methods and pathways – not assuming that all users share the same mental models or vocabularies 4.1.2.

Identifying user language

1

A full report on the VO Thesaurus’ development is available on the eGovernment Resource Centre website http://www.egov.vic.gov.au/pdfs/VO-ThesaurusDevelopmentProjectReport1.2.pdf

6

In most information management systems, whether a website or a web-delivered catalogue, user search terms can be captured and analysed. This is a key resource as it: o shows the most common search queries and the language used o identifies the most common variations and synonyms o highlights the terms which are most clearly mismatching against the content/collection This analysis is not necessarily limited to one’s own search terms. Depending on the scope and user base, other sources may also be relevant to understand user language. In the development of the Victoria Online Thesaurus, top 1000 search terms were collected from not only Victoria Online but also the main Victorian government departmental websites. This provided a broader pool of terms and perspectives to assist in the development of the thesaurus. 4.1.3.

Confirming the ontology

The user search language was then mapped against: o existing metadata keywords o the 3-tiered topic taxonomy This cross-mapping allowed a rough ontology to be developed which covered the themes and included the language relevant for the site and its users. 4.1.4.

Building the VO Thesaurus

The Thesaurus was built using the MultiTes thesaurus management software and a review and consultation process was used to guide the Thesaurus’ development. It referenced the ANSI/NISO Standard Z39.19-2003 Guidelines for the Construction, Format, and Management of Monolingual Thesauri but varied in its application with its treatment of nonpreferred terms. The online environment validated the retention of synonymous terms within cataloguing as alternative terms describing the resource alongside the preferred terms. The Thesaurus was developed soon after the topic taxonomy was refreshed, and so a program of reclassification and recataloguing was undertaken across all 3000 links within Victoria Online. The Victoria Online Thesaurus was first released in September 2005 and is fully updated every six months. 4.1.5.

VO Thesaurus implementation and maintenance

The seventh edition of the VO Thesaurus was released in July 2008. It contains 5,952 terms, with 2,467 preferred terms grouped across 31 subject categories. 5 new preferred terms have been added. An html version2 is also available. As a dynamic portal (as opposed to a collection which is built upon), the links within Victoria Online change over time, and as a result language changes do not 2

http://www.egov.vic.gov.au/victoriaonlinethesaurus/index.htm

necessarily affect cataloguing. However the Thesaurus is actively maintained through ongoing and regular search term analysis, and changes are immediately reflected in all records. The Thesaurus has also been integrated into the search functionality. When a thesaurus preferred or equivalent term is searched for, results are presented and a highlight box offers narrower and related dynamic searches. In December 2005, Victoria Online won the Sir Rupert Hamer Records Management Award 2005 - Inner Budget Agency - for delivery of the Victoria Online Thesaurus3.

5. Case Study 2: Australian Drug Foundation thesaurus review – a proposal The Australian Drug Foundation manages the DrugInfo Clearinghouse Resource Centre, one component of which is a specialist library service available free of charge to those working or studying in the alcohol and other drugs (AOD) sector. The library collection focuses on the psychosocial aspects of AOD and includes statistics, reports, monographs and serials. The library originally used Library of Congress subject headings in cataloguing and at some point moved over to using the AOD Thesaurus, an American reference which offered greater relevance within the subject area. Unfortunately there was limited resourcing at the time to manage the migration, and so the current catalogue is a mix of LC headings, AOD Thesaurus terms and unregulated keywords. Initial analysis of search has demonstrated inconsistent search results because of this lack of bibliographic control, and it has become clear that work is needed to clean up the library catalogue. Tying in with this, a design and information architecture refresh of the DrugInfo website (http://druginfo.adf.org.au) was commissioned to be rolled out over 20092010. This refresh includes a review of all content and an upgrade of the search engine. An opportunity was identified to broaden the scope of the thesaurus beyond the immediate library needs and consider needs and applications across the Australian Drug Foundation as a whole. A proposal under consideration aims to address these issues by • modifying the AOD Thesaurus to bring it closer to local contexts and issues • utilising user language and building it into the synonymous structures of the thesaurus • retaining capability for linkage with Library of Congress terms to preserve interoperability and data sharing • implementing this thesaurus into the DrugInfo library catalogue • rolling out the thesaurus across parts of the DrugInfo website to aid in discovery through metadata enrichment • developing a maintenance and updating process to ensure continued currency and relevance to user needs

3

http://www.egov.vic.gov.au/index.php?env=-innews/detail:m2110-1-1-8-s-0:n-909-1-0--

8

Some of the elements to be considered during this process are detailed below. 5.1.1.

Identifying user needs

The 2000 members of the DrugInfo Resource Centre Library are people studying and working within the AOD sector. This includes TAFE students doing their Certificate IV qualification, youth workers, counsellors, clinicians and researchers. Some key differences in user needs arise from such a small but diverse user group, and this highlights the need for close evaluation on search success across these different member types. The broader user group of the DrugInfo websites are about to be analysed as part of a strategic evaluation across all of the services under the DrugInfo umbrella. The user profile, to be identified through web analytics, is as yet still to be fully evaluated, but it is expected that the website users will include members of the general public and interstate/international visitors as well as the core Victorian AOD user base. 5.1.2.

Identifying user language

User language will be captured and analysed using the following data sources: • library catalogue search terms • DrugInfo website search terms It must be noted that AOD is an area that is covered by very diverse language, from scientific terminology (e.g. ‘Methamphetamine’, ‘Cannabis’, ‘Gamma hydroxybutrate’) to street slang (e.g. ‘Ice’, ‘Ganja’, ‘Grievous bodily harm’). In both the catalogue and website contexts, consideration for the varying terminology and user contexts must be given. 5.1.3.

Confirming the ontology

The user search language will then be mapped against: o existing catalogue subject headings (LC, AOD Thesaurus, keywords) o existing metadata keywords within the DrugInfo website o comparable glossaries of AOD terminology and street slang 5.1.4.

Building the thesaurus

The thesaurus will be built using the MultiTes thesaurus management software using a similar review and consultation process as used in the Victoria Online project. The difference from the Victoria Online Thesaurus project is that this thesaurus is intended for use in both a library and online contexts. Modelling has begun to interrogate how this might occur. Some key points for consideration include: • resources required for developing the thesaurus • resources required for migrating and updating the catalogue subject headings • retention of mapping to LC subject headings within the thesaurus





resources required for the implementation of the thesaurus across the DrugInfo website maintenance of currency and relevance given the dynamic ‘slang’ component of the language base

Other issues not yet investigated relate to the capabilities of the software environments in which this thesaurus is to be implemented. The current library system has a thesaurus component which will be able to be leveraged in cataloguing but may have opportunities in Search that can be explored. Similarly, the DrugInfo website is going to be migrated onto a new content management system and a new search engine is being acquired for the site, so further investigation is required to know how the benefits of the thesaurus can be maximised for the DrugInfo website. 6. Conclusion The technological capabilities of information management software are changing to incorporate the sophisticated search functionality already existing in web search engines but there is a long way to go yet, especially for the smaller systems in use by smaller libraries. The information profession can harness its considerable skills in information management and retrieval and use tools such as controlled vocabularies and user language to maximise discoverability. In an environment of changing literacies, these skills and tools provide the greatest opportunity for leverage in achieving better harmonisation between the users, the creators, and the online environment.

10

Bibliography Bates MJ 2003 Task force recommendation 2.3 research and design review: improving user access to library catalog and portal information: final report (version 3). Los Angeles: Department of Information Studies, University of California Dalmau M, Floyd R, Jiao D, Riley J 2005 ‘Integrating thesaurus relationships into search and browse in an online photograph collection’, Library Hi Tech 23:3, pp. 425-452 Gardner SA 2008 ‘The changing landscape of contemporary cataloging’, Cataloging & Classification Quarterly 45:4, pp. 81-99 Garshol LM 2004 ‘Metadata? Thesauri? Taxonomies? Topic maps! Making sense of it all’, Journal of Information Science 20:4, pp. 378-391 Greenberg J 2004 ‘User comprehension and searching with information retrieval thesauri’, Cataloging & Classification Quarterly 37:3/4, pp. 103-120 Kipp MEI 2006 Complementary or discrete contexts in online indexing: a comparison of user, creator and intermediary keywords http://eprints.rclis.org/archive/00008379/01/mkipp-caispaper.pdf [accessed 15/11/2008] Mitchell P 2007 Learning architecture: issues in indexing Australian education in a Web 2.0 world, paper presented to the Australian and New Zealand Society of Indexers Conference, Melbourne, March 16, 2007 http://www.educationau.edu.au/jahia/webdav/site/myjahiasite/shared/papers/0 7LearnArch.pdf [accessed 15/11/2008] National Information Standards Organization 2003. ANSI/NISO Z39.19-2003 Guidelines for the construction, format, and management of monolingual thesauri. Bethesda MD: NISO Press National Institute on Alcohol Abuse and Alcoholism (2000) The alcohol and other drug (AOD) thesaurus: a guide to concepts and terminology in substance abuse and addiction. Third Edition http://etoh.niaaa.nih.gov/AODVol1/AODthome.htm [accessed 15/11/2008] Qin J, Paling S 2001 ‘Converting a controlled vocabulary into an ontology: the case of GEM’, Information Research 6.2 http://informationr.net/ir/6-2/paper94.html [accessed 15/11/2008] Rosenfeld L, Morville P 2006. Information architecture for the world wide web. 3rd ed. Sebastopol CA: O’Reilly & Associates Schwartz C 2008 ‘Thesauri and facets and tags, oh my! A look at three decades in subject analysis’, Library Trends 56:4, pp. 830-842

Shiri AA, Revie C 2000 ‘Thesauri on the web: current developments and trends’, Online Information Review 24:4, pp. 273-279 Speller E 2007 ‘Collaborative tagging, folksonomies, distributed classification and ethnoclassification: a literature review’, Library Student Journal http://www.librarystudentjournal.org/index.php/lsj/article/viewArticle/45/58 [accessed 15/11/2008] Victoria Online 2005 Victoria Online Thesaurus development project report, v1.2. Melbourne Vic: Multimedia Victoria http://www.egov.vic.gov.au/pdfs/VO-ThesaurusDevelopmentProjectReport1.2.pdf [accessed 15/11/2008]

12