CONTENT ASSESSMENT OF THE PRIMARY BIODIVERSITY DATA ...

4 downloads 297 Views 11MB Size Report
Apr 2, 2012 - On the development of a data analysis web application .... que, si bien, la compleción de los registros en general es buena, la distribución ...... less centralized in a few big frameworks, each one with its own aim and rationale.
Be careful of what you wish for wishes might come alive (Sonata Arctica – The Boy Who Wanted to Be a Real Puppet)

Acknowledgements Agradecimientos

Mi primer agradecimiento es para los miembros del Departamento de Zoología y Ecología de la Universidad de Navarra, a los que están ahora y que han estado antes, con quienes he compartido estos cuatro años y medio de trabajo, pero también de diversión, alegría, Belenes, COCIDOs y Labianadas. Gracias por todo vuestro apoyo, vuestro cariño y por ayudarme a ser capaz de superarme cada día. I would also like to thank the people at the Global Biodiversity Information Facility (GBIF) International Secretariat in Copenhagen, for being such wonderful hosts during the time I spent there. Special thanks to Samy, for sharing his time, wisdom and love and for always having a smile for me. Mi eterna gratitud y afecto para mi Jefe, Arturo, Maestro que ha sabido dirigir mi crecimiento personal y profesional, haciéndome desarrollar mi independencia pero sin haberme abandonado en ningún momento. Lo bueno que haya en este trabajo y en gran parte de mi persona, es obra suya. A mis amigos, los de siempre, los de no-tan-siempre, y los nuevos. Para los que hablan Castellano as well as for those who don’t, thank you. Sin vuestro apoyo y amistad esta Tesis no hubiera sido posible. A mis padres, José Ángel y Aurora, mi hermana Amaia y mi aitona Patxi, por vuestro amor y por apoyarme desde el primer momento y animarme a seguir. A mis amonas Aurora y Angelita, que me siguen cuidando y guiando y me están viendo desde allá arriba. A mi amiga, novia, mujer, Ana, quien ha sufrido y compartido conmigo los momentos malos y me ha dado muchos más momentos inmejorables. Y al resto de personas que me haya podido olvidar en este momento, pero que han formado parte de mi vida.

Muchas gracias

Thank you very much

Eskerrik asko

Javier

Financiación: Esta Tesis Doctoral ha sido posible gracias a una beca de la Asociación de Amigos de la Universidad de Navarra.

Index

INDEX ÍNDICE Summary ............................................................................................................................. 7 Resumen.............................................................................................................................. 9 Presentation ..................................................................................................................... 13 Presentación ..................................................................................................................... 19 Introduction Introducción ................................................................................................ 25 First Part Primera Parte .................................................................................................... 39 On the global assessment of the content of the GBIF index Sobre el análisis global del contenido del índice de GBIF Chapter One Capítulo Uno ................................................................................................ 45 Content assessment of the primary biodiversity data published through GBIF network: status, challenges and potentials Evaluación del contenido de los datos primarios de biodiversidad publicados a través de la red de GBIF: estado, desafíos y potencial Second Part Segunda Parte ............................................................................................ 107 On the detailed assessments of the different aspects of the content of the index Sobre los análisis detallados de los distintos aspectos del contenido del índice de GBIF Chapter Two Capítulo Dos ............................................................................................... 111 GBIF position paper on future directions and recommendations for enhancing fitness-for-use across the GBIF network Position paper de GBIF sobre futuras direcciones y recomendaciones para mejorar la adecuación al uso en la red de GBIF Chapter Three Capítulo Tres ............................................................................................ 157 On the dates of the GBIF mobilised primary biodiversity data records Sobre las fechas de los datos primarios de biodiversidad movilizados por GBIF Chapter Four Capítulo Cuatro ......................................................................................... 179 Assessing the primary data hosted by the Spanish node of the Global Biodiversity Information Facility (GBIF) Evaluación de los datos primarios hospedados por el nodo español de la Infraestructura Global de Información de Biodiversidad (GBIF) Third Part Tercera Parte.................................................................................................. 211 On the development of a data analysis web application Sobre el desarrollo de la aplicación web de análisis de datos Chapter Five Capítulo Cinco ............................................................................................ 217 BIDDSAT: visualizing the content of biodiversity data publishers of the GBIF network BIDDSAT: visualizando el contenido de los editores de datos de la red de GBIF

1

Índice Fourth Part Cuarta Parte ................................................................................................ 231 On the uses of the data available through GBIF Sobre los usos de los datos accesibles a través de GBIF Chapter Six Capítulo Seis ................................................................................................. 235 Primary biodiversity data records in the Pyrenees Registros primarios de biodiversidad en los Pirineos Chapter Seven Capítulo Siete .......................................................................................... 263 Protected areas in the Spanish Pyrenees: a meaningful way to preserve biodiversity? Áreas protegidas en los pirineos españoles: ¿efectivas para preservar la Biodiversidad? General Discussion Discusión General ........................................................................... 281 General Conclusions ....................................................................................................... 291 Conclusiones Generales .................................................................................................. 293 General Literature Referencias Generales...................................................................... 295 Annex I Anexo I ................................................................................................................ 303 Source code of BIDDSAT application Código fuente de la aplicación BIDDSAT Annex II Anexo II .............................................................................................................. 345 Meta-assessment of the GBIF.ES mediated biodiversity data Meta-análisis de los datos de biodiversidad suministrados a través de GBIF.ES Annex III Anexo III ............................................................................................................ 371 Abstract of the ‘Sampling Biodiversity Sampling’ poster presented at the TDWG 2008 Annual Meeting in Perth, Australia Resumen del póster ‘Sampling Biodiversity Sampling’ presentado en la reunión anual 2008 de TDWG en Perth, Australia Annex IV Anexo IV............................................................................................................ 375 Abstract of the ‘Noise in Biodiversity Data’ poster presented at the eBiosphere 2009 conference in London, United Kingdom Resumen del póster ‘Noise in Biodiversity Data’ presentado en la conferencia eBiosphere 2009 en Londres, Reino Unido Annex V Anexo V.............................................................................................................. 379 Abstract of the ‘Have Standards Enhanced Biodiversity Data?’ oral presentation at the TDWG 2009 Annual Meeting in Montpellier, France Resumen de la presentación oral ‘Have Standards Enhanced Biodiversity Data?’ en la reunión anual 2009 de TDWG en Montpellier, Francia

2

SUMMARY RESUMEN

Summary

Summary Even though biodiversity has suffered threats due to its dynamic nature, in the last decades the effect of the increasing human impact on the environment has largely led to a scenario known as ‘biodiversity crisis’, a crisis whose treatment strongly needs a better knowledge of biodiversity. The Global Biodiversity Information Facility (GBIF) was created as core tool for the improvement of biodiversity knowledge with the aim of making “the world's primary data on biodiversity freely and universally available via the Internet”. GBIF is nowadays the largest network of primary biodiversity data (PBD) publisher institutions. Increasingly, biodiversity stakeholders have postulated to need to assess the quality of the records served through GBIF so that data publishers – the ultimate and only owners of Intellectual Property Rights –can be made aware of the status of the records they share. This doctoral dissertation is framed in that context, which has the general aim of contributing to the development of the Biodiversity Informatics. This encompasses the following specific aims: (a) assess the quality and fitness-foruse of the PBD available through the GBIF network of data publishers; (b) perform more indepth analyses on the different detailed aspects of the available PBD sets in order to tailor recommendations for improvement; (c) allow the community to access easily to some basic data quality assessments through the development of a PBD collection visualization web tool; and (d) to apply the acquired knowledge to perform exemplar studies on the uses of the data available through the GBIF network. Results show that even though the overall completeness of the records is good, the distributions of their components is not that good, allowing the detection of the ‘low-hanging fruit’ effect (overrepresentation of easy-to-get records) through the extraction of patterns in these components. The large volume of available data does not reduce the need to make more information available, either by “unveiling” existing collections to the Internet or enhancing the sampling of high biodiversity areas, in order to improve the fitness-for-use of the information.

7

Summary

Resumen Aunque la biodiversidad siempre ha sufrido diversas amenazas debido a su naturaleza dinámica, en las últimas décadas se ha llegado, principalmente por el efecto del creciente impacto humano en el medio ambiente, a un escenario que se conoce como la ‘crisis de la biodiversidad’, una crisis cuyo tratamiento, se dice, depende en buena medida de la ampliación del conocimiento sobre biodiversidad. Como parte fundamental de la solución para esta demanda de conocimiento, se crea la Infraestructura Global de Información de Biodiversidad (GBIF por sus siglas en inglés) con el fin de “hacer disponibles de manera libre y universal los datos de biodiversidad a través de Internet”. Hoy por hoy, GBIF representa la mayor red de instituciones editoras de datos primarios de biodiversidad (PBD). Sin embargo, se ha defendido progresivamente entre los implicados en el estudio de la biodiversidad la conveniencia de evaluar la calidad de los PBDs servidos a través de GBIF de manera que los proveedores – dueños de la propiedad intelectual de los mismos – sean conscientes del estado de los registros que comparten. En este marco se encuadra la presente Tesis Doctoral, que ha tenido como objetivo general contribuir al desarrollo de la informática aplicada a la biodiversidad. Esto se traduce en los siguientes objetivos particulares: (a) evaluar la calidad y usabilidad de los PBDs accesibles a través de la red de instituciones editoras de datos de GBIF; (b) analizar con mayor profundidad desde distintas perspectivas aspectos más detallados del conjunto de PBDs disponibles para generar recomendaciones de mejora; (c) facilitar a la comunidad el acceder a distintos análisis sencillos de calidad de datos mediante el desarrollo de una herramienta web de visualización de colecciones de PBDs; y (d) aplicar el conocimiento adquirido para realizar estudios ejemplares del uso de los datos disponibles en la red de GBIF. Los resultados muestran que, si bien, la compleción de los registros en general es buena, la distribución de sus componentes no lo es tanto, permitiendo detectar de la extracción de sus pautas el efecto ‘low-hanging fruit’ (la sobrerrepresentación de los registros sencillos de obtener en relación a los más complicados). El gran volumen de datos disponibles no merma la necesidad de aumentar la información disponible, bien ‘descubriendo’ a la red colecciones existentes o continuando con el muestreo de zonas de elevada biodiversidad con el fin de aumentar la usabilidad de la información.

9

PRESENTATION PRESENTACIÓN

Presentation

Motivation and General Aim We are facing a critical moment for biodiversity and any topic related to it. Recent advances in information technology are enhancing scientific work capabilities in a striking way, allowing scientists to perform new types of research—in particular, massively data-dependent research (Guralnick et al., 2007) -- which were very hard or even impossible to perform to date (Bisby, 2000). Recently a new branch of the bioinformatics discipline has emerged, the Biodiversity Informatics (BI). This subject aims to applying such advances to the biodiversity research field (Johnson, 2007). One of the most important BI projects is to establish a unique, free and open common access gateway for all the globally available biodiversity information in digital form (Bisby, 2000). Now, the Global Biodiversity Information Facility (GBIF) is the largest initiative in providing such access (Yesson et al., 2007; Boakes et al., 2010), and it does so by establishing a global network of linked biodiversity data institutions (data publishers) which share their biodiversity record collections using standard open data exchange schemas. GBIF1 is an inter-governmental organization born in 2001 that currently comprises 53 governments and 43 international organizations. GBIF is a network of national nodes with an international mandate settled in Copenhagen, Denmark. The main goal of GBIF is to provide free and open online access to global biodiversity data for supporting scientific research, conservation and sustainable development2. To do so, GBIF establishes links with the collections of records from several biodiversity data publisher institutions and prepares a central index with the most relevant information. This index enables access to these data through a common gateway3. The main data GBIF enables access to is called Primary Biodiversity Data (PBD). They are the most basic interpretation-free unit for biodiversity studies: what organism has been seen/sampled and where (Guralnick and Hill, 2009). A third element is usually added which gives a higher value for ecological research: when has the observation/sampling taken place. The availability of data through GBIF keeps increasing (GBIF, 2012) and meanwhile, the focus of the scientific community is shifting towards assuring a high-enough quality level for those data 1

http://www.gbif.org

2

http://www.gbif.org/index.php?id=269

3

http://data.gbif.org

13

Presentación (see for example Murgía and Villaseñor, 2000; Peterson et al., 2003; Peterson and NavarroSigüenza, 2004; Chapman, 2005a; Hortal and Jiménez-Valverde, 2008). Assessments on the quality and fitness-for-use of the available primary biodiversity data are increasing importance. One of the aims of every data assessment is designing recommendations for data managers. But without an efficient result communication channel, data managers are unable to access easily to the outputs of such assessments, slowing record correction and information improvement down (Hill et al., 2010; Jetz et al., 2011). The general aim of this doctoral research is to contribute to the advancement of the BI by developing new quality and usability assessments which can be applied to large sets of biodiversity records, mainly the collections of the data publishers from the GBIF network, and to contribute with the development of applications to improve the feedback to the data managers.

Structure and Specific Aims We present this research work in four parts. Each part is focused on specific aspects of the general work and has its own particular aims. Introduction The introduction of this work situates the reader in the context of biodiversity, primary biodiversity data, the biodiversity informatics discipline and GBIF, and a detailed review of the available literature on these topics is presented. First part: On the global assessment of the content of the GBIF index In the first part, an exhaustive assessment of the content of the GBIF index is described. Using well-established and newly-developed tools, the volume of the index is analyzed in order to determine the level of its quality and usability (fitness-for-use). This part has one chapter. First chapter: Content assessment of the primary biodiversity data published through GBIF network: status, challenges and potentials In this paper we analyze the most basic quality issues on the content of the GBIF index and perform usability tests for certain specific common biodiversity studies. This paper has been submitted to the Biodiversity Informatics journal and is currently under revision.

14

Presentation Second part: On the detailed assessments of the different aspects of the content of the index In the second part, more detailed assessments of the content of the GBIF index are described. To do so, we follow different approaches: we assess both the individual key aspects of the PBD and the content of specific data publishers. These in-depth assessments help in detecting strange patterns or biases in the data and lead to determine the source of error, allowing the publishers to correct them at the source. This part has three chapters. Second chapter: GBIF position paper on future directions and recommendations for enhancing fitness-for-use across the GBIF network. This chapter shows a thorough assessment of the geospatial aspect of the data indexed by GBIF. Approaches to determining the quality and fitness-for-use of the data are performed in order to come to a series of recommendations for the different strata of the biodiversity community, from data users and managers to the GBIF network itself. This study was commissioned to Andrew Hill – first author – by GBIF in 2010, who developed a Position Paper with the collaboration of Javier Otegui, Arturo H. Ariño and Robert Guralnick. The paper is available online at http://www.gbif.org/orc/?doc_id=2777 Third chapter: On the dates of the GBIF mobilised primary biodiversity data records. For the development of this chapter, we took all the temporal information available through GBIF and performed some in-depth quality and usability assessments. We extracted lots of different patterns that revealed a number of biases and mistakes at different points of the information sharing workflow, from the data sampling to the indexing by the mechanisms of GBIF. The paper describing this study has been accepted for publication in a special issue of the Biodiversity Informatics journal. Fourth chapter: Assessing the primary data hosted by the Spanish node of the Global Biodiversity Information Facility (GBIF) In this chapter, we followed a different approach. Instead of assessing all the data available for certain feature, we took a representative biodiversity institution and assessed the data they made available. The selected institution was the Spanish Node of the GBIF, an initiative that coordinates and hosts biodiversity records from other institutions such as museums or herbaria. We defined some recommendations based on the biases we found on the data. The paper describing this study has been submitted to the PLoS ONE journal and is currently under revision.

15

Presentación Third part: On the development of a data analysis web application In the third part, the publication related to the deployment of a web application is shown. This application allows for the creation of some visualizations and basic quality assessments over the data publishers’ collections of records. This part has one chapter. FIfth chapter: BIDDSAT: visualizing the content of biodiversity data publishers of the GBIF network In this paper, the BIDDSAT – ‘BIoDiversity DataSets Assessment Tool’ – is presented. The need for such an online application and the possibilities it enables are described. This tool is intended to be an online visualization environment for the collections of PBD from the data publishers of the GBIF network. Its aim is to allow the general public, and especially the data publishers’ collection managers, to perform a series of data visualizations for detecting the most common issues and thus contribute to the general quality and fitness-for-use improvement processes. This paper has been accepted for publication in the Bioinformatics journal. Fourth part: On the uses of the data available through GBIF In the fourth part, two examples on the use of the index of GBIF as a source of primary biodiversity data are shown. The high data availability GBIF represents allows for the development of data-rich research that would have been hard or even impossible to perform otherwise. This part has two chapters. Sixth chapter: Primary biodiversity data records in the Pyrenees. In this chapter a study regarding the status of the knowledge on the biodiversity of the Pyrenean Mountain is described. This area is studied in order to know the volume and quality of the primary biodiversity records accessible through the GBIF index, especially regarding the three basic aspects of the data: geospatial, temporal and taxonomic. The paper describing this study has been accepted for publication in the Environmental Engineering and Management Journal. Seventh chapter: Protected areas in the Spanish Pyrenees: a meaningful way to preserve biodiversity? This chapter shows a simple environmental management study using the GBIF-enabled primary biodiversity data. Here we try to determine if the delimitation of a protected area enhances its diversity along the Pyrenean Mountains, using one of the simplest measurements of

16

Presentation biodiversity – the species richness – as surrogate. The paper describing this study has been accepted for publication in the Environmental Engineering and Management Journal. Annex I: Source code of the BIDDSAT application. Annex II: Report for GBIF.ES on which Chapter four is based Annex III: Abstract of the ‘Sampling Biodiversity Sampling’ poster, presented in the Taxonomic Databases Working Group (TDWG) 2008 annual meeting in Perth, Australia. Annex IV: Abstract of the ‘Noise in Biodiversity Data’ poster, presented in the eBiosphere 2009 meeting in London, UK. Annex V: Abstract of the ‘Have Standards Enhanced Biodiversity Data?’ oral presentation in the Taxonomic Databases Working Group (TDWG) 2009 annual meeting in Montpellier, France.

17

Presentation

Motivación y Objetivo General Nos encontramos en un momento crítico para todo lo relacionado con la biodiversidad. Los recientes avances en técnicas de computación están aumentando la capacidad de realizar trabajos científicos de manera espectacular, en especial aquellos basados en análisis masivos de datos (Guralnick et al., 2007), permitiendo realizar nuevos tipos de estudios que hasta ahora eran muy complicados o incluso imposibles de llevar a cabo (Bisby, 2000). Recientemente ha surgido una nueva rama de la bioinformática, llamada Informática aplicada a la Biodiversidad, o BI por sus siglas en inglés, que pretende aplicar estas novedades en el campo de los estudios de biodiversidad (Johnson, 2007). Uno de los grandes proyectos de la BI es el de establecer un portal de acceso único, libre y gratuito para toda la información en biodiversidad disponible digitalmente en el planeta (Bisby, 2000). Hoy por hoy, la Infraestructura Global de Información de Biodiversidad (GBIF) es la mayor iniciativa en lograr esto (Yesson et al., 2007; Boakes et al., 2010), y lo hace mediante el establecimiento de una red de instituciones editoras de datos de biodiversidad que abren sus colecciones de registros mediante el uso de estándares abiertos de intercambio de datos. Nacida en el año 2001, GBIF4 es una organización intergubernamental que comprende en la actualidad 53 países y 43 organizaciones internacionales. Se estructura como una red de nodos nacionales coordinados por una Secretaría Internacional en Copenhague, Dinamarca. El objetivo de GBIF es dar acceso libre y gratuito a través de internet a los datos de biodiversidad de todo el mundo, para apoyar la investigación científica, fomentar la conservación biológica y favorecer el desarrollo sostenible5. Para ello, GBIF establece enlaces con las colecciones de registros de instituciones editoras de datos de biodiversidad y prepara un índice central con la información más relevante. Este índice permite el acceso a dichos datos través de un portal único6. Los datos principales que GBIF enlaza se denominan datos primarios de biodiversidad (PBD) y son la unidad más básica, libre de interpretación, de los estudios de biodiversidad: qué organismo ha sido observado/muestreado y dónde (Guralnick y Hill, 2009). A éstos se les suele añadir un tercer campo de gran valor ecológico que indica cuándo se ha realizado la 4

http://www.gbif.org

5

http://www.gbif.org/index.php?id=269

6

http://data.gbif.org

19

Presentación observación/captura. El volumen de datos que se va haciendo disponible a través de GBIF mantiene su crecimiento (GBIF, 2012) y, mientras tanto, el enfoque de la comunidad científica está girando hacia asegurar un nivel de calidad adecuado en dichos datos (ver por ejemplo Murgía y Villaseñor, 2000; Peterson et al., 2003; Peterson y Navarro-Sigüenza, 2004; Chapman, 2005a; Hortal y Jiménez-Valverde, 2008). Los estudios que evalúan la calidad en general de los datos y su adecuación al uso para estudios específicos están en auge. Uno de los objetivos de todo análisis de datos es el de diseñar recomendaciones para mejorar la calidad de los mismos. Sin embargo, sin un canal eficiente de comunicación de resultados, los gestores de datos no pueden acceder fácilmente a estas recomendaciones y la mejora de las colecciones de datos se ralentiza (Hill et al., 2010; Jetz et al., 2011). El objetivo general de esta tesis es contribuir al avance de la BI mediante el desarrollo de técnicas de análisis de calidad y usabilidad de datos aplicables a grandes masas de PBD, principalmente de las instituciones editoras de datos de biodiversidad que conforman la red de GBIF, y contribuir con el desarrollo de aplicaciones para mejorar la realimentación a los gestores de los datos.

Estructura y Objetivos Específicos Presentamos este trabajo de investigación en cuatro partes, cada una sobre un aspecto del mismo y con unos objetivos concretos. Introducción En la introducción del presente trabajo se sitúa al lector en el contexto de la biodiversidad, los datos primarios de biodiversidad, la informática aplicada a su estudio y GBIF y se efectúa una detallada revisión bibliográfica sobre estos temas. Primera parte: Sobre el análisis global del contenido del índice de GBIF En la primera parte se describe un análisis exhaustivo del contenido del índice central de GBIF. En este análisis se han utilizado tanto técnicas bien establecidas como de nuevo desarrollo para analizar el conjunto de datos y determinar su nivel general de calidad y usabilidad. Esta parte tiene un capítulo. Primer capítulo: Evaluación del contenido de los datos primarios de Biodiversidad publicados a través de la red de GBIF: estado, desafíos y potencial En este capítulo se analizan en profundidad los aspectos más básicos del contenido del índice

20

Presentation de GBIF en cuanto a calidad y adecuación a los estudios más comunes de biodiversidad. El artículo en el que se describe este análisis ha sido enviado a la revista Biodiversity Informatics y se encuentra en revisión. Segunda parte: Sobre los análisis detallados de los distintos aspectos del contenido del índice de GBIF En la segunda parte se describen análisis más detallados del contenido del índice de GBIF. Para ello, se han seguido dos aproximaciones distintas: evaluar los aspectos principales de los PBDs individualmente y el contenido completo de un editor de la red de GBIF. Estos análisis en profundidad son de ayuda para detectar pautas extrañas o desvíos en los datos y para llegar a la fuente del error, permitiendo así que los editores los corrijan en el origen. Esta parte tiene tres capítulos. Segundo capítulo: Position paper de GBIF sobre futuras direcciones y recomendaciones para mejorar la adecuación al uso en la red de GBIF Este capítulo muestra un análisis concienzudo del apartado geoespacial de los datos indexados por GBIF. Se abordaron los temas de la calidad y adecuación al uso de los datos para construir una serie de recomendaciones para los distintos estratos de la comunidad científica, desde los usuarios y gestores de datos a la propia red de GBIF. Este trabajo fue encomendado en 2010 por parte de GBIF a Andrew Hill – primer autor – quien desarrolló un Position Paper con la colaboración de Javier Otegui, Arturo H. Ariño y Robert Guralnick. El Position Paper es público y se puede encontrar en la siguiente dirección: http://www.gbif.org/orc/?doc_id=2777. Tercer capítulo: Sobre las fechas de los datos primarios de biodiversidad movilizados por GBIF El desarrollo de este capítulo se basó en toda la información del aspecto temporal disponible a través de GBIF, para realizar análisis de calidad y adecuación al uso en profundidad. Se extrajeron multitud de pautas que revelaron diversos desvíos y errores en distintos puntos del flujo que va de la toma de datos al indexado por parte de los mecanismos de GBIF. El artículo que describe este análisis ha sido aceptado para publicación en la revista Biodiversity Informatics. Cuarto capítulo: Evaluación de los datos primarios hospedados por el nodo español de la Infraestructura Global de Información de Biodiversidad (GBIF) En este capítulo se siguió una aproximación diferente a la del resto de los capítulos de esta parte. En vez de evaluar un aspecto de todo el volumen de datos disponible, se tomó una

21

Presentación institución de datos de biodiversidad representativa de la red de GBIF y se analizó todo su contenido. La institución seleccionada fue la Unidad de Coordinación de GBIF en España (o “nodo español de GBIF”), una iniciativa que ofrece alojamiento de registros de biodiversidad de otras instituciones nacionales tales como museos o herbarios. Se ha definido una serie de recomendaciones basadas en las pautas que se detectaron en los registros. El artículo que describe este análisis ha sido enviado a la revista PLoS ONE y se encuentra en revisión. Tercera parte: Sobre el desarrollo de la aplicación web de análisis de datos En la tercera parte se muestra la publicación asociada a la creación de una aplicación web que permite elaborar visualizaciones y análisis básicos de calidad sobre el cuerpo de datos de publicadores o colecciones individuales. Quinto capítulo: BIDDSAT: visualizando el contenido de los editores de datos de la red de GBIF En esta publicación se justifica la necesidad y se enumeran las posibilidades que ofrece la aplicación web BIDDSAT. Ésta pretende ser un entorno de visualización de los datos contenidos en las colecciones de los editores de datos de la red de GBIF. El objetivo de la aplicación es poner en manos del público general, y en particular de los gestores de datos de las instituciones publicadoras, una serie de visualizaciones de datos para que se puedan detectar los fallos más comunes y así contribuir con los procesos generales de mejora de calidad y adecuación al uso de los datos. Este artículo ha sido aceptado para publicación en la revista Bioinformatics. Cuarta parte: Sobre los usos de los datos accesibles a través de GBIF En la cuarta parte se muestran dos ejemplos del uso del índice de GBIF como fuente de PBDs. La elevada disponibilidad de datos que el índice representa permite llevar a cabo investigaciones que de otra forma serían muy difíciles o incluso imposibles. Esta parte tiene dos capítulos. Sexto capítulo: Registros primarios de biodiversidad en los Pirineos Este capítulo describe un estudio sobre el volumen de PBDs conocidos en la cordillera de los Pirineos. Este área ha sido estudiada para conocer la cantidad y calidad de los tres aspectos básicos de los PBDs – geoespacial, temporal y taxonómico – que están disponibles a través del índice central de GBIF. El artículo que lo describe ha sido aceptado para publicación en la revista Environmental Engineering and Management Journal.

22

Presentation Séptimo capítulo: Áreas protegidas en los pirineos españoles: ¿efectivas para preservar la Biodiversidad? En este capítulo se muestra un estudio sencillo de gestión ambiental llevado a cabo gracias al aumento en accesibilidad de datos que GBIF representa. En él se trata de determinar si la delimitación de un área protegida, con la legislación que ello conlleva, permite mejorar su valor de diversidad en la zona de los Pirineos, usando para ello una de las medidas más sencillas de biodiversidad – la riqueza de especies – como indicador. El artículo que lo describe ha sido aceptado para publicación en la revista Environmental Engineering and Management Journal. Anexo I: Código fuente de la herramienta BIDDSAT. Anexo II: Informe para GBIF.ES en el que se basa el capítulo cuarto. Anexo III: Abstract del póster ‘Sampling Biodiversity Sampling’, presentado a la conferencia anual 2008 del Taxonomic Databases Working Group (TDWG) en Perth, Australia. Anexo IV: Abstract del póster ‘Noise in Biodiversity Data, presentado a la conferencia eBiosphere 2009 en Londres, Reino Unido. Anexo V: Abstract de la presentación oral ‘Have Standards Enhanced Biodiversity Data?’, de la conferencia anual 2009 del TDWG en Montpellier, Francia.

23

INTRODUCTION INTRODUCCIÓN

Introduction

Importance of studying biodiversity The importance of Biodiversity Emerging some twenty years ago, ‘Biodiversity’ is a term that encompasses all living things and, according to some views, the relationships among them, with their environment and their ecosystems (Lane, 2003; Scholes et al., 2008). As the Convention on Biological Diversity says, biodiversity “means the variability among living organisms from all sources including, inter alia, terrestrial, marine and other aquatic ecosystems and the ecological complexes of which they are part; this includes diversity within species, between species and of ecosystems” (Convention on Biological Diversity, Art.2, par. 1). Biodiversity has evolved to become a word to refer to the number, variety and variability of organisms on Earth and, in a broad sense, a synonym for Life on Earth (WCMC, 1992). When we discuss ‘global biodiversity’, we naturally realize its direct relationship with Earth’s resources, as these include living beings that we often resort to (and, indeed, depend on). It’s easy then to perceive its intrinsic importance. Besides, biodiversity has its value also from a strictly anthropocentric point of view: it provides economic benefits, has an aesthetical value and is an insurance towards the future (Erlich and Erlich, 1992). The Convention on Biological Diversity, an international program created for developing legal instruments for biodiversity preservation, states in its preamble that biodiversity is important for evolution, maintaining life and sustaining the biosphere, and is, in the end, a concern of humankind7. This importance of biodiversity is now further enhanced by the realization that much of it is in jeopardy. Numerous studies have pointed out to changes in biodiversity (see for example Perfecto et al., 1997; Worm et al., 2006; Norris et al., 2010), both at the local and global scale, generally prompted by human development and its environmental consequences. There is a societal perception, backed by hard data in some instances (see for example Zwick, 1992; Pivello et al., 1999; Bax et al., 2003; Gozlan et al., 2005; Hölker et al., 2010), that biodiversity is under threat, and this is perceived as undesirable from the anthropocentric point of view. Research has identified several threats for biodiversity, some of which are human-induced.

7

http://www.cbd.int/convention/articles/?a=cbd-00

27

Introducción

Threats for biodiversity According to Margalef (1998), threats for biodiversity have always existed. Species population dynamics is subject to some sort of internal regulation, maintaining overall diversity by means of controlling biases. Populations included within many predator-prey dynamics, for example, tend to threaten diversity in their ecosystem if one or the other grows too fast: typically, an imbalance between both types of populations leads to overkill by predators, resulting in its own extinction through subsequent famine, or to resource exhaustion when predators fail to check prey’s growth. Nevertheless, phenomena happen sometimes, especially climate modifications, where this dynamic equilibrium is threatened. An example is the Pleistocene large mammal extinction, caused primarily by changes in environmental conditions to which megafauna was unable to react in time. However, the increasing pre-eminence of the human species started to change things. Nowadays, our species’ growing numbers, spread, and per-capita use of resources set a high footprint in the entire biosphere. We, for instance, take biomass from the environment at a dramatically higher rate than any other species, facilitating opportunistic species to spread, and thus unbalancing entire ecosystems. Besides, by widely mistakenly understanding Nature and mankind as different, perhaps uncoupled systems, the later believes itself entitled to exploit the former beyond its recovery rate, thus causing its regression. Threats posed by mankind aren’t new, but did not seem to occur at the current rate. Among them, some of the most important ones are habitat loss and/or fragmentation, extinction of species, distribution area retraction or ecological invasions.

Data-supported decision making The possible solution to the current biodiversity crisis, if feasible at all, relies on adequate environmental policies that should spawn the right environmental management decisions. These decisions need to be well founded, and knowledge, judgment, past experience, models and, most of all, data are so far the only suitable foundations. All things equal, with good data, decisions can be wrong by error or by lack of an adequate theory or model; but with no or bad data, decisions can only be right by random chance. As time goes by, more environmental management programs based on solid and enough biodiversity data are needed (Mace, 2005), because these data confer a higher reliability to conclusions and, thus, improve the outcome of

28

Introduction any management program. Current analyses help designing policies for all the above mentioned threads, and examples of studies that use biodiversity data are plentiful (e.g. Peterson et al., 2002; Soberón and Peterson, 2004; Soberón et al., 2007; Beck and Kitching, 2007).

Data sources and available data Perhaps the main source of existing, potentially available biodiversity data are specimen collections deposited in museums and research institutions. Museums document and study life on Earth, focusing on living organisms’ characteristics, history, patterns and processes (Krishtalka and Humphrey, 2000). Museums hold organisms collected directly from the environment (‘specimen vouchers’), and these specimens are maintained in the long-term so that they can be accessed whenever necessary for scientific purposes. Thus, museums turn to be a less biased information gathering systems than field sampling (Scoble, 2000). Currently, there are an estimated 2 to 3 billion specimens curated in museums world-wide (Edwards et al., 2000; Guralnick and Neufeld; 2005; Ariño, 2010). Each of those vouchers has, ideally, information concerning the time and place the specimen was collected, and its taxonomic identification. This basic data (‘what, where, when’) are called primary biodiversity data (PBD). With such data, scientists are potentially able to analyze the species’ time and space patterns, draw distribution maps, migration paths, and other elements of discovery (Brooke, 2000). However, not all of this gathered data is readily available. For data to be available world-wide, it has to be turned into a digital format (by means of a digitization process). Nowadays, very few of the current data is digitized (Ariño, 2008), and many museum collections or field survey data, especially from older (and, therefore, even more valuable, for trend analysis purposes) collections exist still only in paper. The little availability of data may be a problem when assessing the consistency of some data analyses, since not all current data is represented or taken into account.

The need for more and better data The comparatively small amount of currently available data, added to the requirements of new information, makes current decision-making poorly based and mostly incomplete (Krishtalka and Humphrey, 2000). New technology developments may help in getting more reliable biodiversity data, in order to get better environmental management policies. This includes two

29

Introducción aspects of the same issue (lack of enough data): (a) existence of data, that can be gathered through studies yet to be done, and (b) availability of data, which may have been collected but may not be readily available, due to locking in unavailable formats, places, codification or systems, or to the lack of knowledge about the data themselves (metadata availability).

Biodiversity data types Until now, we’ve dealt with biodiversity data. But actually, these data can be divided into many groups. For instance, based on immediacy of data gathering we may get primary or secondary biodiversity data.

Primary Biodiversity Data Information gathered directly in the field is called primary biodiversity data (PBD). It is the main source of information, from where the rest can be derived (Lira-Noriega et al., 2007). A primary datum is a piece of information identifying ‘what’ has been collected or seen ‘where’ (Johnson, 2007). Apart from these two mandatory pieces of information, a PBD might also add the temporal dimension by determining the moment of sampling or observation. Occurrences Primary data can be divided as well into two groups: observations, and specimen collection. The first one is easier for the researcher to obtain, but it is less reliable, at least regarding taxonomic information, than specimen collection: often, an observation cannot be confirmed by thoroughly examining the specimen, but must rely on first-instant recognition. In turn, specimen collections mean more work for the scientist, but information obtained this way is, or at least is seen as, more authoritative (Scoble, 2000): the specimens can be examined thoroughly and, if need be, subject to additional tests such as DNA barcoding. Names: vernacular and scientific Although the ‘when’ and ‘where’ pieces of information are, at a certain point, well defined, giving a name to ‘what’ has been caught or observed is a more complicated task. Organisms have names, given by humans. Before Linnaeus (Linnaeus, 1758), living things only received the names people gave them, and these names varied depending on where it was given. Multiple vernacular names could be given to a single species. The Swedish botanist invented a binomial system to identify species, giving them a scientific name, apart from the vernacular ones they had. Being able to transform from scientific names to vernacular names solves the trouble and

30

Introduction lets the use of any of them as a primary biodiversity datum. Taxonomic hierarchies and taxonomic concepts Linnaeus also provided a way to classify the living organisms according to common features among them, what we call taxonomy. Taxonomic hierarchies, i.e. classification and grouping of organisms, are subjective; they depend on the expertise and point of view of the taxonomist. As an example and in most simple terms (the issue can be highly complex), it is known that different taxonomies exist, and that a single name can be used to describe different ‘species’ (homonymy) or that multiple names can be attached to the same ‘species’ (synonymy) according to different taxonomies or authorities. Thus, it is important to distinguish between two concepts: taxonomic names and taxonomic concepts. A taxonomic name is a concatenation of characters in a unique string which is applied to certain taxonomic rank (kingdom, phylum, class, etc.), and although they are the link between different sources of information, they are not enough to uniquely identify an organism (Page, 2006). A taxonomic concept represents a taxonomic name associated to a specific taxonomic hierarchy, thus avoiding synonymy and homonymy issues – the same name can be tied to different taxonomies, giving different taxonomic concepts, and different names have different taxonomic concepts – and easing the unique identification of organisms.

Secondary (derived) Biodiversity Data Primary data have their intrinsic value, for example to prepare lists of known species in a given place, but these kinds of analyses may be losing value against more complex ecological modeling (Woodruff, 2001), or perhaps serve as starting points for them. As ecosystems are complex systems, due to the large number of interactions on which they depend, interactions among data types have to be measured and calculated to yield correct and valuable information. Mixing taxonomic (‘what’), temporal (‘when’) and spatial (‘where’) data, or different sets of them, allows for new kinds of analyses, such as temporal or spatial distributions of species, ecological niche modelling, or modelization of niche variation caused by, or related to, climate change (among others), which are widely used in ecological studies (Peterson et al., 1998; Schipper, 2008; Baselga et al., 2006; Fagan and Stephens, 2006).

31

Introducción

Data Sources Biodiversity data can be obtained from four main sources. Some of them are more type-specific than others, but overall, those sources are: specific databases, collection specimens, literature, and unstructured shared digital knowledge (the world-wide web).

Databases and datasets Databases are supports for data storage in an organized way. Databases are composed of tables, which have fields, and each field has a piece of semantic information (like date, latitude, longitude, etc.). A database is commonly engaged by a query engine, which may be part of the database or not, thus getting a system that lets a researcher extract specific data types and perform operations with them. Most databases today are digital and rely on computer systems for storage, management and query. While databases cover general topics, there are several initiatives that gather information about a topic (which could be a place, a time span or whatever) and structure it. This way, they produce focused datasets, often related to main, general datasets. Independently of whether it is a general purpose database, or a series of joint, related datasets, database types are countless regarding the type of data they hold. However, within the realm of biodiversity we could group them into three main types (Shanmughavel, 2007): occurrence databases, taxonomical databases, and nomenclators. Occurrence databases Occurrence databases store primary data and often auxiliary data in case the initiative needs to serve complex ecological analyses. Those occurrences can either be field observations or museum specimens. Occurrence databases are the main way of getting large amounts of data in a fast way for ecological niche modelling. Taxonomic databases Taxonomic databases provide a hierarchical structure allowing an item to be related to dependent or parent items. They contain information about the taxonomy of a species, including synonymy, homonymy, different hierarchical classifications, etc. A taxonomic backbone for a correct identification of specimens is mandatory (Lobo, 2008). Thus, this kind of databases is the key stone from which a complete biodiversity data network can be built.

32

Introduction Nomenclators Nomenclators are structured names databases. These can, in turn, be focused into different aspects of biodiversity. Taxon nomenclators, for instance, hold taxonomic names and their correction history, misspellings and different ways of naming them. Another common type of nomenclator is the gazeteer: a database of locality names, with associated geographical data such as coordinates and elevations, which can be related to the PBD location or can be used for lookups. These databases help in correcting name misapplications, and permit getting more comprehensive sets of data related to a name.

Museums Although databases offer a fast and comfortable way of getting biodiversity data, it is a relatively new way to obtain this information. As stated earlier, the main sources of this data have been, traditionally, museums (Ponder et al., 2001; Elzen et al., 2005). Museums hold specimen collections, properly identified and with associated data that allow almost all calculations and analyses databases do. And they are regarded as highly reliable sources of data. Since a specimen can be revised and assured by expert taxonomists, this piece of information takes a higher value than an observation made by a scientist, with little means of assessing its correctness after the fact. Each museum specimen, if properly curated, has a label with data concerning what it is, when and where it was seen, and often by whom. It also has the history of taxonomic identification corrections, and data about the biotope in which the organism was caught (Morin and Gomon, 1993). Even if the specimen itself no longer exists at the museum, researchers may have a sort of access to it by means of museum accessions. Registers and ledgers may take account of specimens held in other museums of the world, a fact that turns them into specimen information sources. Often, museum accessions are the only data the researcher needs, without having to physically access the specimen voucher itself. Museums, therefore, not only curate the specimens vouchers, but also the information gathered from the vouchers or the voucher’s circumstances.

Literature When new taxa are described, information concerning its morphology, distribution, taxonomic hierarchy and special characteristics is generally published in scientific journals, according to

33

Introducción well-established publication and treatment codes, such as the ICZN8 or the ICBN9. These data are also considered primary data, thus turning taxonomic papers into a primary data source. Moreover, scientific papers in which primary specimen or observation data is used and offered are also taken as data sources, i.e. Wilson’s (Wilson, 1988). Some papers such as May’s (May, 1992) represent also a biodiversity data source, even if it is considered gray literature10.

Websphere The World Wide Web (WWW) has turned out to be an excellent tool for data sharing. Initiatives that put their data accessible contribute to increasing its availability, thus permitting a wider use of biodiversity data and better policy making. In fact, the possibilities the Internet gives for interoperability have made a good source of primary biodiversity data of it. Nevertheless, techniques for reliability assurance and authoritative systems need to be developed to discern good data from garbage.

Some new trends in Biodiversity studies Data-Driven Science According to Kelling et al. (2009), workflow on ecological research, and in general scientific research, has traditionally begun with a theoretical hypothesis, mainly established by expert opinions, while data gathering was performed with the aim of testing that hypothesis. For example, in case we wanted to define a species’ distribution, previous knowledge on the biology of that species was mandatory in order to determine the best place/s, moment/s and methodology to perform adequate sampling campaigns that could lead to the confirmation or denial of the first theory. Although well established and widely used, this approach has its downsides, both theoretical – the species’ biology might be unknown – and practical – data sampling can be hard or even impossible to perform (Bisby, 2000) – and many of the greatest findings in science history followed a different path. See for example Kepler’s laws or Darwin’s theory of evolution. Kepler managed to deduce mathematical relationships between orbital 8

International Code of Zoological Nomenclature: http://www.nhm.ac.uk/hosted-sites/iczn/code/

9

International Code of Botanical Nomenclature: http://ibot.sav.sk/icbn/main.htm

10

Gray literature: a group of publication types that do not follow conventional channels of scientific dissemination through conventionally established scientific publishers, although this does not mean they are less reliable by default (for instance, some gray literature such as ‘white papers’ may indeed be peer-reviewed, and highly so.) Examples of gray literature are technical reports, working papers or preprints.

34

Introduction periods and axes of celestial bodies after a series of observations. On the other hand, when Darwin and Wallace sailed, their aim was to catalogue all known species, and the seeds of the evolution theory came only after gathering and managing the right information. None of them started with an initial hypothesis and then took the data. As a counterpoint to the so called ‘hypothesis-driven’ science, there is a paradigm built upon the usage of a massive body of data from which theories are formulated and new features are discovered. Continuing with the previous example, using the presence and absence records for a specific taxon together with large sets of environmental data, both the current and potential distribution area of that taxon can be estimated. Nowadays, this ‘data-driven’ science (also called ‘data-intensive’ science) is on the increase due to recent advances in computer science, mainly prize drop and storage and processing capacity increase, and huge data availability. These features allow the information to come out of the data. For this approach to be effective, data should be well documented and adequately preserved through a new workflow which would enable pattern discovery and/or hypothesis confirmation in complex systems (Lynch, 2008). This topic has aroused much discussion on the scientific community, especially focused on trying to determine which approach is best (see for example Allen, 2001a, 2001b, 2001c; Kelley and Scott, 2001; Smalheiser, 2002; Wilkins, 2001; Gillies, 2001; Niemeijer, 2002). Some authors see both paradigms as complementary instead of opposite, especially on those data-rich but hypothesis-poor disciplines (Kell and Oliver, 2003). Despite all the advantages it yields for a field like biodiversity research, this scientific paradigm has its negative aspects. Some of the features that must be taken into account are inherent to the data-driven science concept, such as the informatics infrastructure – data storing and processing equipment must fulfill some advanced requirements –, persistence of the information – since data is not use-specific, they should be preserved (ideally) indefinitely, for any researcher should be able to use them at any moment for any new analysis – or even the massive volume of data – databases must be properly mined in order to retrieve what is actually looked for (Narayanan et al., 2003). Besides, in this kind of science, it is fairly easy to fall in the “cum hoc ergo propter hoc” fallacy and we should remember that correlation does not imply causation. One of the main aspects that could slow data-driven biodiversity research down is the low (or even absence of) recognition associated to the key role of primary data gathering and publishing on the general workflow. And, without a large enough body of primary data, this

35

Introducción path is not profitable. Advances are being made on this topic promoted by large biodiversity data initiatives. These advances are aimed at establishing a framework for giving academic recognition to institutions that provide useful and used primary biodiversity data. The case proposed by Chavan and Ingwersen (2009, 2011) is remarkable, where the concept of ‘data papers’ is discussed, a type of scientific publication derived from PBD databases that has already begun to show up (Narwade et al., 2011).

New disciplines The use of new developments in informatics to solve some of the actual needs about biodiversity data access has made a new sub-discipline of informatics arise (Sarkar, 2007). Stemming from the original bioinformatics as defined by Biodiversity Information Standards (formerly TDWG) in the early 80’s, before the term was shifted to the newer field of molecular bioinformatics, the fast rate at which data volume is growing and access is improving, the reliability, variety and resolution of explicit electronic data and new initiatives established the basis and rationale for the development of the current biodiversity informatics field (Soberón and Peterson, 2004), thus solidly establishing one of the two branches of bioinformatics. Biodiversity informatics is defined as the application of information technologies to the management, algorithmic exploration, analysis and interpretation of primary data regarding life, particularly at the species level of organization (Soberón and Peterson, 2004; Johnson, 2007). Its aim is to enhance biodiversity studies by applying information management tools to the management and analysis of species occurrence, taxonomic character, and image data (Schnase et al., 2003; Paton, 2009).

The Global Biodiversity Information Facility (GBIF) Background, history and structure With all these advancements in the informatics field, many initiatives are up and running with the aims of improving the quality of service in biodiversity infrastructure. Efforts are more or less centralized in a few big frameworks, each one with its own aim and rationale. The Global Biodiversity Information Facility (GBIF) is the largest and best known initiative in this field. It was established as an outcome of the Organisation for Economic Co-operation and Development (OECD) ‘Mega Science Forum Working Group’ in 2001, with the aim of “making the world's primary data on biodiversity freely and universally available via the Internet”. Its

36

Introduction purpose is to promote, coordinate, design and implement the compilation, links, standardization, digitization and global dissemination of biodiversity data world-wide (GBIF, 2006), and it is organized as a decentralized network of biodiversity information facilities, established and maintained by national and regional participant nodes, which are coordinated by an International Secretariat at Copenhagen, Denmark and governed by an international, elected board. As of May 2012, the GBIF network allows open access to more than 367 million biodiversity records, and is built upon 60 Participant Nodes, which coordinate 406 data publishers that share 8,872 data sources.

The mission of GBIF Participant Nodes are the administrative units “in charge of deploying informatics infrastructure, building capacity, promoting policies on open access to biodiversity data, supporting data holders in the process of mobilising and publishing data, and coordinating the development of information products and services for target audiences”11. Data publishers, on the other hand, can be research centers, universities, natural history museums or biodiversity information networks, among others (Telenius, 2011). These institutions share publicly their collections of PBD (also called data resources) and establish an information stream from their databases to GBIF's internal mechanisms, where the records are indexed. In the end, GBIF offers a data discovery and access tool (the data portal) that returns the indexed information with links to the original data. There has been some misconception within the biodiversity community that GBIF is a biodiversity data repository, which holds the ownership of the data. In fact, GBIF’s role on the scientific community is that of a biodiversity data sources aggregator. GBIF works as a proxy for biodiversity information and, even though it is true that it builds and caches an index with the main PBD aspects in order to speed up information query and retrieval, GBIF does not claim any ownership over the rights of the data in all this process (GBIF, 2006). The data publishers are the ultimate owners and sources of the records and hold their intellectual rights, making them responsible for the quality assurance of their content (Chapman, 2005).

11

http://www.gbif.org/participation/participant-nodes/who-we-are/

37

PART I ON THE GLOBAL ASSESSMENT OF THE CONTENT OF THE GBIF INDEX PRIMERA PARTE SOBRE EL ANÁLISIS GLOBAL DEL CONTENIDO DEL ÍNDICE DE GBIF

First Part: Global Assessment of GBIF

Data quality and fitness-for-use As argued in the introduction, a need for a large-enough data body for biodiversity research has been firmly established. Good and reliable conclusions can only be achieved with sufficient data, which may enable pattern detection and satisfy statistical requirements. Alas, even though this feature is necessary, it is not sufficient. Data quality issues should be also taken into account. As in computer science, the output of biodiversity studies that make use of garbage data will be mostly garbage (the GIGO principle: Krebs, 1999). In addition to data quality, another feature has to be taken into account: usability. Data usability can be defined as the suitability of a set of data for the intended purpose. Usability can also be referred to as ‘fitness-for-use’ (Hill et al., 2010). Both terms should not be confused. Data quality refers to an intrinsic characteristic of data in an abstract level, while fitness-for-use deals with the aim the data is collected for (Juran et al., 1974; Morrison, 1995; Aalders and Morrison, 1998; Aalders, 2002; Dassonville et al., 2002; Devillers and Jeansoulin, 2006). The completion of a PBD, the integrity of a digitization or the information

trustworthiness

are

data quality

issues. On the other

hand,

data

representativeness or granularity of the information for a specific study are fitness-for-use issues. A set of dates that is complete only to year level is a low-quality set of data as compared to data having fully specified dates, since it lacks information enabling detection of seasonal patterns (see ‘Introduction’ in Chapter three). Theoretically, such set will not be suitable for many studies. However, when the information for seasonality is not relevant, the same dataset can be suitable for a research where only year values matter, as in examining range shifts for migratory species (see for example Thomas and Lennon, 1999). Month or day presence would not improve conclusions in this case, and the data set would have a high fitness-for-use. While fitness-for-use assessment of a data set is associated, by definition, to the aim of the dataset, quality assessment can be seen as a generic information reliability assessment tool. Even though we may not know the purpose data is collected for, a good quality data set is more likely to show a high fitness-for-use as well. Thus, a first step before using any data set is to assess its quality and, if possible, its fitness-for-use.

41

Primera Parte: Análisis Global de GBIF

Data quality assessment The scientific community seems divided over the topic of whether it is better to publish less but thoroughly checked data, or to publish more data, even if it is known not to be complete, or completely reliable (Edwards, 2004; Suarez and Tsutsui, 2004; Wheeler et al., 2004). Each option has its advantages and disadvantages, and both points of view have promoters and detractors. However, Peterson and Navarro-Sigüenza (2003) suggest that large sets of unchecked (or at least not fully checked) data can be used as a tool for error detection. Using adequate techniques and methodologies, errors can stand out of the main body of correct data. Let’s show a simple example, represented in figure 1. We begin with a set of coordinates that should, ideally, fall on the United States of America. When plotting a map with the records, we observe that most of the records do fall on the US, but we also see a mirrored image south-east. With this simple representation of the records, we have detected both the wrong records and the cause of this error, which would in this case be an incorrect storage of coordinates, having latitude value stored in longitude field and vice versa.

Figure 1: Spatial representation of the records of a single North-American data publisher. A mirrored and rotated image of the U.S. east coast located in Antarctica reveals a swap in the coordinate fields: latitude value in longitude field and vice versa.

Anyway, finding the proper methods to assess or plot the data is mandatory when trying to find what we are actually looking for. In many cases, there is no need for these assessments to be complex; a simple direct visualization of the data can unveil many patterns and detect trends in information (Geng et al., 2011).

42

First Part: Global Assessment of GBIF

Assessment methodologies A first step when assessing a set of data should be to represent them through a series of basic visualizations, graphic representations of specific features of the data over one or more defined axes (Guralnick et al., 2007). Apart from giving a picture of the current status of the data set, this methodology makes it easy to detect the most common patterns arising. The main advantage of visualizations against other more complex analyses is, precisely, their simplicity, which does not imply a lower validity when properly detecting most common issues. When dealing with biodiversity data, it is highly recommendable to build different visualization types, for each of them will help in detecting real and artifactual trends (Viégas et al., 2007). Throughout the development of our research, quality and fitness-for-use assessments have been carried out over large sets of primary biodiversity data, mainly extracted from the central indexes of GBIF. To accomplish this, well-established visualizations as well as newly developed ones have been used. In order to avoid information duplication, we will not list here all the used visualizations or explain the patterns they allow to detect. Such information is given on the ‘introduction’ and ‘methods’ sections of the next chapters. Nonetheless, three visualization types which have been heavily used will be explained: maps, chronhorograms and hebdoplots. Among the existing geospatial visualizations, record density maps have been especially useful and are widespread; institutions that use maps for visualizing their records are, for example, GBIF12, the Ocean Biogeographic Information System (OBIS)13, or the Avian Knowledge Network (AKN)14. These are spatial representations of the PBR volume for each existing latitudelongitude combinations. GBIF’s data schema allows for a fast and easy geographical exploration, since coordinates are stored in a decimal degree system which can be directly plotted over an equirectangular projection of the Earth surface. We also found useful two visualizations conceived by the candidate’s advisor to represent the temporal aspect of the PBD. The ‘chronhorogram’ (Ariño and Otegui, 2008) is a kind of polar representation of dates where the radius length represents the year (being the center of the plot arbitrary though we have set it to the year 1750) and the angle represents the day of the year, from January 1st (0º) to December 31st (359º). As in maps, record density is represented

12

http://data.gbif.org/tutorial/maps

13

http://iobis.org/maps

14

http://www.avianknowledge.net/content/features/archive/data-visualization

43

Primera Parte: Análisis Global de GBIF through a colored scale. The ‘hebdoplot’ (Ariño and Otegui, 2008, 2009) represents the temporal aspect of the PBD according to their week-day value across months; with this plot, some aspects of the sampling methodology can be spotted. Next, chapter one is presented. The research was performed in parallel by the team led by Dr. Samy Gaiji, Senior Programme Officer for Science and Scientific Liaison at the GBIF secretariat, and the team at the University of Navarra led by the candidate’s advisor. The candidate has contributed significantly to this research, and the resulting paper has been submitted to the Biodiversity Informatics journal and is currently under revision.

44

CHAPTER ONE CONTENT ASSESSMENT OF THE PRIMARY BIODIVERSITY DATA PUBLISHED THROUGH GBIF NETWORK: STATUS, CHALLENGES AND POTENTIALS Evaluación del contenido de los datos primarios de Biodiversidad publicados a través de la red de GBIF: estado, desafíos y potencial

Gaiji S, Chavan V, Ariño AH, Otegui J, Hobern D, Sood R, Robles E Manuscript submitted to Biodiversity Informatics

Content Assessment of the PBD Published through GBIF

Abstract With the establishment of the Global Biodiversity Information Facility (GBIF) in 2001 as an inter-governmental coordinating body, concerted efforts have been made during the past decade to establish a global research infrastructure to facilitate the publishing, discovery, and access to primary biodiversity data. The participants in GBIF have enabled the access to over 323 million records of such data as of February 2012. This is a remarkable achievement involving efforts at national, regional and global levels in multiple areas such as data digitization, standardization and exchange protocols. However concerns about the quality and ‗fitness for use‘ of the data mobilized in particular for the scientific communities has grown over years and must now be carefully considered in future developments. This paper is the first comprehensive assessment of the content mobilised so far through GBIF, as well as a reflexion on possible strategies to improve its ‗fitness for use‘. The methodology builds on complementary approaches adopted by the GBIF Secretariat and the University of Navarra for the development of comprehensive content assessment methodologies. The outcomes of this collaborative research demonstrate the immense value of the GBIF mobilized data and its potential for the scientific communities. Recommendations are provided to the GBIF community on the adoption of common indicators to assess data quality as well as priorities for future data mobilization. Keywords: Primary Biodiversity Data, Content Assessment, Gap Analysis.

47

Content Assessment of the PBD Published through GBIF

Resumen Con el establecimiento de la Infraestructura Global de Información de Biodiversidad (GBIF) en el año 2001 como un organismo de coordinación intergubernamental, ha habido grandes esfuerzos para establecer una infraestructura global de investigación para facilitar la compartición de y el acceso a datos primarios de biodiversidad. Para febrero de 2012, los participantes de GBIF han habilitado el acceso a más de 323 millones de estos registros. Éste es un logro notable obtenido gracias a los esfuerzos realizados en múltiples áreas como la digitalización de datos, la estandarización y los protocolos de intercambio a niveles nacional, regional y global. Sin embargo, con los años han crecido las dudas sobre la calidad y adecuación al uso de los datos movilizados por las comunidades científicas, y deben ser tenidas en cuenta de manera escrupulosa para futuros desarrollos. El artículo es la primera evaluación exhaustiva del contenido movilizado hasta hoy a través de GBIF, así como una reflexión sobre posibles estrategias a seguir para mejorar su adecuación al uso. La metodología se construye sobre dos aproximaciones complementarias tomadas por el Secretariado de GBIF y la Universidad de Navarra para el desarrollo de metodologías exhaustivas de evaluación de contenido. Los resultados de esta investigación colaborativa demuestran el inmenso valor de los datos movilizados por GBIF y su potencial para las comunidades científicas. Se han propuesto una serie de recomendaciones para la comunidad de GBIF para adoptar indicadores comunes que evalúen la calidad de los datos y que indiquen prioridades para futuras movilizaciones de datos. Palabras clave: Datos Primarios de Biodiversidad, Evaluación de Contenido, Análisis de Brecha

49

Content Assessment of the PBD Published through GBIF

Introduction Free and open access to primary biodiversity data is essential both to enable effective decisionmaking and to empower those concerned with the conservation of biodiversity and the natural world (Bisby, 2000; Gaikwad and Chavan, 2005; GBIF, 2008). However, the history of publishing of primary biodiversity data is very recent. With the establishment of the Global Biodiversity Information Facility (GBIF) in 2001, concerted efforts to publish primary biodiversity data using community driven and agreed standards and tools gained momentum. GBIF was created to facilitate free and open access to biodiversity data worldwide, via the Internet, to underpin scientific research, conservation and sustainable development. The GBIF network, through its data portal (http://data.gbif.org), already facilitates access to over 323 million records from more than 300 data publishers. The progress achieved in GBIF‘s first decade indicates that the development of a global informatics infrastructure, facilitating free and open access to biodiversity data, is indeed a realistic aspiration. One of the key future challenges for GBIF is now to ensure that such volume of knowledge about biodiversity on earth is indeed of high relevance for the scientific communities.

Why assess the content of GBIF-mobilised data? Despite GBIF‘s achievements, questions are frequently raised about whether it can yet be considered a global facility (Yesson et al., 2007), and about the usefulness of the data mobilised. GBIF has been criticised for the taxonomic, thematic, geospatial as well as temporal biases in the data mobilised by its network of data publishers (Johnson, 2007). There have been isolated studies to assess gaps, quality and fitness for use of GBIF-mobilised data (e.g. Guralnick et al., 2007; Collen et al., 2008; GBIF, 2010a). In 2010, an initial overview of the data published through the GBIF network (GBIF, 2010b) provided a first set of indicators on the content mobilized so far as well as major bias such as in the taxonomy and temporal areas. Recognising this, the GBIF-constituted Content Needs Assessment Task Group (CNATG) recommended that assessment of GBIF-mobilised content at various levels (global, regional, national and thematic) is crucial for determining the demand-driven approach for data mobilisation (Faith et al., 2011, this volume). In 2011, in response to these recommendations, a series of improvements to the GBIF infrastructure were made such as the rework of the GBIF ‗backbone taxonomy‘ with upto-date checklists and taxonomic catalogues such as the Catalogue of Life 2011 . Other improvements such as the automated interpretation of the coordinates, country location and scientific names used in published records has been improved to screen out inaccuracies – for example, ensuring that records identified as coming from a particular country are shown as

51

Evaluación del Contenido de los PBD Publicados a través de GBIF occurring within the borders and territorial waters of that country. The current study attempts to assess the gaps and fitness for use of the GBIF-mobilised data. It aims to provide a comprehensive overview of the ‗state of the network‘ for data published through the GBIF network. Such assessment is aimed at demonstrating the value of the content mobilised and how it can contribute to our improved understanding of biodiversity in particular by the scientific community. To achieve this objective and taking into account the large volume of information to be analysed, the authors of this study have adopted two complementary methodologies. One approach led by the GBIF Secretariat (GBIFS) focused on two temporal complete studies (December 2010 and February 2012) while the Department of Zoology and Ecology at the University of Navarra (UNZYEC) focused on processing random samples of the full content. The research outputs of these two studies were compared and complemented each other. The outcomes of these two complementary exercises are presented in three categories: (a) data quality assessment, (b) trends/patterns assessment, and (c) fitness-for-use assessment.

Data flow of the GBIF network Currently, the GBIF network is comprised of 345 data publishers in the biodiversity sciences from 44 countries and 15 international organisations. Together they publish through GBIF 10,780 occurrence based data resources (or datasets). Figure 1 depicts the typical flow of the data publishing processes through the GBIF network. Data publishers can use a variety of tools and protocols (e.g. DiGIR , BioCASE , Tapir , GBIF Integrated Provider Toolkit ) and data standards (e.g. DwC and ABCD ) in order to publish primary occurrence records to GBIF. After successful registration of their resources through the central registry, GBIF centrally indexes a limited but essential number of core data elements detailing the ‗what‘ (species), ‗when‘ (date/time), ‗where‘ (location), ―with what evidence‖ (basis of record) and ‗by whom‘ (collector/observer) of the primary biodiversity data published by the GBIF network (also called GBIF-mediated data). The list of core data elements (Table 1) follows a common data standard: the Darwin Core standard . This data standard has been used for the discovery of the vast majority of specimen occurrence and observational records published through the GBIF network. The Darwin Core standard was originally conceived to facilitate the discovery, retrieval, and integration of information about modern biological specimens, their spatio-temporal occurrence, and their supporting evidence housed in collections (physical or digital). These elements are compiled into a central database (also called GBIF Index) and their discovery and access is enabled through the GBIF data portal

52

Content Assessment of the PBD Published through GBIF (http://data.gbif.org) as well as through web services (http://data.gbif.org/tutorial/services). Such a global discovery system is aimed at promoting access to the original information sources owned by each single publisher participating in the GBIF network, where more information can be found (e.g. media, richer data etc.).

Figure 1. Typical flow of data discovered and published through the GBIF network.

While all data publishers are expected to follow common standards (e.g. DwC), their data resources discoverable through the GBIF infrastructure have varying precision and quality. This could be explained by incomplete information at the publisher level, errors during the publishing processes (e.g. formatting of date information) as well as errors during the central harvesting and indexing procedures. In order to assess the content mobilised through the GBIF network, this study will focus on using the content of the GBIF Index as a proxy to the information published by the contributing publishers.

53

Evaluación del Contenido de los PBD Publicados a través de GBIF

Table 1. Essential core data elements (in the GBIF-Index occurrence table).

Title

Description

Publisher

Publisher of the resource/dataset

Dataset

Resource/Dataset

Institution

The name (or acronym) in use by the institution having custody of the object(s) or information referred to in the record.

Collection

The name, acronym, code, or initials identifying the collection or data set from which the record was derived.

Catalogue number

An identifier (preferably unique) for the record within the data set or collection.

Scientific name

The full scientific name, with authorship and date information if known. When forming part of identification, this should be the name in lowest level taxonomic rank that can be determined.

Taxon author

The authorship information for the Scientific name.

Taxon rank

The taxonomic rank of the most specific name in the Scientific name. Recommended best practice is to use a controlled vocabulary.

Kingdom

The full scientific name of the kingdom in which the taxon is classified.

Phylum

The full scientific name of the phylum or division in which the taxon is classified.

Class

The full scientific name of the class in which the taxon is classified.

Order

The full scientific name of the order in which the taxon is classified.

Family

The full scientific name of the family in which the taxon is classified.

Genus

The full scientific name of the genus in which the taxon is classified.

Species epithet

The name of the first or species epithet of the Scientific name.

Infraspecific epithet

The name of the lowest or terminal infraspecific epithet of the Scientific name, excluding any rank designation.

Latitude

The geographic latitude (in decimal degrees) of the geographic center of a Location. Positive values are north of the Equator; negative values are south of it. Legal values lie between -90 and 90, inclusive.

Longitude

The geographic longitude (in decimal degrees) of the geographic center of a Location. Positive values are east of the Greenwich Meridian, negative values are west of it. Legal values lie between -180 and 180, inclusive.

Coordinate precision

A decimal representation of the precision of the coordinates given in the Latitude and Longitude.

Maximum altitude

The upper limit of the range of elevation (altitude, usually above sea level), in meters.

Minimum altitude

The lower limit of the range of elevation (altitude, usually above sea level), in meters.

Altitude precision

A decimal representation of the precision of the altitude.

Minimum depth

The lesser depth of a range of depth below the local surface, in meters.

Maximum depth

The lesser depth of a range of depth below the local surface, in meters.

Depth precision

A decimal representation of the precision of the depth.

Continent or ocean

The name of the continent in which the Location occurs. Recommended best practice is to use a controlled vocabulary such as the Getty Thesaurus of Geographic Names or the ISO 3166 Continent code. Recommended best practice is to use a controlled vocabulary such as the Getty Thesaurus of Geographic Names.

Country

The name of the country or major administrative unit in which the Location occurs. Recommended best practice is to use a controlled vocabulary such as the Getty Thesaurus of Geographic Names.

State or province

The name of the next smaller administrative region than country (state, province, canton, department, region, etc.) in which the Location occurs.

County

The full, unabbreviated name of the next smaller administrative region than State or Province (county, shire, department, etc.) in which the location occurs.

Name of collector/observer

A list (concatenated and separated) of names of people, groups, or organizations responsible for recording the original occurrence.

Locality

The specific description of the place. Less specific geographic information can be provided in other

54

Content Assessment of the PBD Published through GBIF Title

Description geographic terms. This term may contain information modified from the original to correct perceived errors or standardize the description.

Year of collection

The four-digit year in which the collection or observation event occurred, according to the Common Era Calendar.

Month of collection

The ordinal month in which the collection or observation event occurred.

Day of collection

The integer day of the month on which the collection or observation event occurred.

Basis of record

The specific nature of the data record. Recommended best practice is to use a controlled vocabulary such as the Darwin Core Type Vocabulary (http://rs.tdwg.org/dwc/terms/type-vocabulary/index.htm).

Name of identifier

A list (concatenated and separated) of names of people, groups, or organizations that assigned the taxon to the subject.

Identification date

The date on which the subject was identified as representing the taxon. Recommended best practice is to use an encoding scheme, such as ISO 8601:2004(E).

date of creation

Timestamp of creation of this raw occurrence record in the index.

date of modification

Timestamp of last update of this raw occurrence record in the index.

date of deletion

Timestamp of deletion of this raw occurrence record in the index (obsolete).

Content assessment of GBIF-mobilised data Methodology In the last two decades, the informatics field has evolved to a stage where the handling of very large volume of data is becoming the central component of data discovery . The capacity to store, manage and analyse a large volume of data is becoming a fundamental requirements in the field of Biodiversity Informatics and in particular for infrastructures like GBIF . Today, technologies like Hadoop

and Hive

offer the ability to process such huge volumes of

information on certain kinds of distributable problems using a large number of computers. The assessment carried out by GBIFS used this new technology to process and analyse the full GBIF Index is depicted in Figure 2. The full GBIF Index was extracted in the form of Hive tables in December 2010 and February 2012. All outputs of the data-mining processes were stored in MySQL tables for easy processing and visualisation. The results of these analyses were kept so that in the future similar experiments could be repeated and compared temporally. The Hadoop/Hive technology allowed the processing and analysis of the full GBIF Index in a reasonable amount of time compared to conventional technologies like relational database using known database management systems like MySQL (Figure 2). However such methodology requires a dedicated infrastructure with sufficient IT expertise and understanding of the processes involved in manipulating such large volume of information at once.

55

Evaluación del Contenido de los PBD Publicados a través de GBIF

Figure 2. Data mining methodology employed during content assessment exercise carried out by the GBIF Secretariat.

UNZYEC used two separate approaches in their assessment (Figure 3). In one, a random sample of the GBIF Index was obtained by issuing an automated set of queries through the portal‘s web services. This approach mimics an ecological sampling where a vast amount of data is represented by a subset, thus greatly reducing the data processing requirements. In another approach, mirrors of both the GBIF Index and the raw data harvested from the participants were queried using standard SQL statements and scripts. Although much more taxing in terms of resources, this approach enabled the authors to finely track the flow of information (not just data) from the publishers to the index. In this way, gaps caused by the data processing flow can be detected. The UNZYEC team made queries and samplings during a three-year period, over ten versions of the GBIF Index. However, for the purpose of this assessment, analyses were made mostly on the November, 2010-released mirror, in order to provide an independent comparison of GBIFSobtained results.

56

Content Assessment of the PBD Published through GBIF

Figure 3. Data mining methodologies employed during content assessment exercise carried out by the University of Navarra.

Limitations of the methodologies The methodology used in this article enables the fast data mining of the GBIF data index but does not address issues such as: The level of accuracy of the data (e.g. precision in geospatial coordinates). The risk of misidentification of taxa. Duplicate records that can arise from: o

Datasets being unwittingly published repeatedly,

o

Duplicate records within a single dataset,

o

Multiple digital records derived from the same physical specimen, such as a specimen being physically split and stored in multiple museums.

Computing errors (e.g. software bugs) in the data interpretation routines. For example, depending on the data schema used (Darwin Core or ABCD) and their versions, an

57

Evaluación del Contenido de los PBD Publicados a través de GBIF occurrence date may be represented as a date-time stamp, an ISO-formatted date, a simple text string in varying formats, or composed of individual fields (day, month, year). The mapping of the data by the publisher may therefore introduce additional error or ambiguity, if for example month and day are swapped. In order to overcome this difficulty, we assumed the level of error of the year within a malformed date-time stamp as sufficiently low to be considered as a good proxy to assess the temporal dimension. With regards to the conversion and validation of taxonomical information (e.g. genus, species, scientific names) the challenges are more complex. During the harvesting and indexing procedures, the taxonomical information is checked against the most up-to-date GBIF taxonomical backbone. Until end 2011, GBIF used the Catalogue of life (CoL) 2007 as its core taxonomical

backbone

and

when

unmatched

names

were

identified

during

the

harvesting/indexing procedures they were simply added to the backbone. In November 2011, GBIF has entirely refreshed its taxonomical backbone and uses now primarily the latest version of the Catalogue of life in addition to other resources (Table 2). Today, unmatched names are not added to the core backbone and whenever possible, expert taxonomists are consulted. Therefore the study undertaken in terms of taxonomical comparison (in 2010 and 2012) should be undertaken taking into account this particular bias due to the improvement of the GBIF taxonomical backbone and resolution services. Table 2. Top 10 resources currently available through GBIF ‗ChecklistBank‘ used to build the GBIF taxonomical backbone.

Title

Families

Genera

Species

The Catalogue of Life

2012-01-14

Version

8,149

129,461

1,379,178

Register of Marine and Nonmarine Genera (IRMNG)

2012-01-13

34,119

790,025

1,017,851

International Plant Names Index

2011-07-13

791

59,766

1,317,317

NCBI Taxonomy

2012-01-13

7,223

59,404

668,915

The Integrated Taxonomic Information System (ITIS)

2012-01-14

6,972

45,531

306,358

World Register of Marine Species

2012-05-02

6,370

41,293

233,811

Index Fungorum

2011-07-13

2,926

10,569

267,553

Fauna Europaea

2011-07-13

-

37,214

131,671

Wikipedia Species Pages - English

2011-09-04

-

-

-

GRIN Taxonomy for Plants

2012-01-14

492

12,909

58,773

A full up-to-date list can be accessed at: http://ecat-dev.gbif.org/

For the purpose of this study, elements covering three dimensions (what, where and when) were extracted from the GBIF Index by GBIFS and UNZYEC in December 2010, and also from raw

58

Content Assessment of the PBD Published through GBIF data as supplied by the providers by UNZYEC for some specific analysis. Further analyses using the February version of the GBIF Index were undertaken by GBIFS. The elements covered in these analyses are: Source of the data: The assessment has taken into account the identifiers of the data publisher and data resources. However, due to incompleteness and lack of accuracy of entries in the institution ID, collection ID and catalogue fields in the GBIF Index, we have decided to exclude these fields from the analysis. Taxonomic data: Taxonomic ranks such as Kingdom, Phylum, Class, Family, Genus and Species are included. The assessments have also taken into account the synonyms as recorded in the GBIF Index, in order to provide the most accurate estimate of the number of species. Data from multiple synonyms get merged during the harvesting and indexing routines. Geospatial data: Latitude and longitude information was used when available. However, due to scarce information provided by data publishers, it was not possible to consider precision. This is a serious limitation that will need to be addressed in future analysis. Temporal data: Limited to the field year of observation/collection. The assessments ignored the day and month recorded in the date field, except for analysing possible causes of year misassignment. Other data: The basis of records, a descriptive term indicating whether the record represents an object or observation, was included in the analysis. The basis of record actually contains useful information such as the level of evidence and other categories that may be considered enhanced subclasses of information.

Results of the content assessment of the GBIF-mobilised data We present the salient outcomes of these two independent exercises in three categories, namely: (a) data quality assessment, (b) trends/patterns assessment, and (c) fitness-for-use assessment. In most cases, both exercises reached similar conclusions and therefore validate each other. In some instances, significant differences arose and were assessed.

A. Data Quality Assessment: Taxonomy Until November 2011, the processing of taxonomical references was made against some

59

Evaluación del Contenido de los PBD Publicados a través de GBIF taxonomical references such as the checklist of Catalogue of Life 200715 or the International Plant Names Index16. During the discovery of unmatched taxonomical references against the accumulated GBIF taxonomical backbone, these are automatically added. Therefore, the 2010 GBIF taxonomical backbone contained accepted names (e.g. from CoL 2007) and new names discovered during the indexing process. This also means that in our December 2010 assessment, we had limited capacity to distinguish between authoritative names (e.g. referring to Catalogue of Life 2007 version) and added names, which had no validation against any taxonomical reference. In November 2011, the GBIF taxonomical backbone was rebuilt using primarily the latest version of the Catalogue of Life as well as many new taxonomical authoritative references (Table 2). Therefore the February 2012 assessment on taxonomical names can be considered as much more accurate. Matching against the Catalogue of Life Using a less advanced interpretation techniques developed in 2006 by the GBIFS, the backbone taxonomy that covers the occurrence records has 1,946,429 concepts at species or lower ranks, of which 458,716 (24%) is provided by the Catalogue of Life 2007 Annual Checklist. A more recent study made in December 2010 showed that 52 per cent of the distinct canonical names found in the GBIF Index matched to a name in the CoL 2010 using straight, case insensitive matches. This can be slightly increased to 54% if a ‗fuzzy‘ matching with a maximum difference of 10% in characters is used. In February 2012, a similar study (Table 3) showed than 53.47% of names were straight, case insensitive matched of the canonical names in the Catalogue of Life 2011 Annual Checklist. Table 3. Taxonomical rank matching with Catalogue of Life 2011 (February 2012)

Species

Genus

Family

Order

Class

Phylum

Kingdom

Taxonomical rank matching with Catalogue of Life 2011

✔ ✔



















































Percentage of the GBIF-Index

Percentage of the total number of species

(324,247,283 occurrences) 0.05% 1.38%

(995,974 species in total) 0.27% 0.29%



15

http://www.catalogueoflife.org/annual-checklist/2007

16

http://www.ipni.org

60

0.53%

0.89%

0.77%

1.35%

0.76%

2.40%

4.54%

13.36%

9.13%

27.98%

82.83%

53.47%

Content Assessment of the PBD Published through GBIF

Completeness of the taxonomical classification In order to study the completeness of taxonomical classification in the GBIF Index, we assessed for each rank (kingdom, phylum, class, order, family, genus and species) the valid references generated after the harvesting and indexing routines. The level of completeness is therefore based on valid taxonomical references within the GBIF taxonomical backbone. In cases where for example a family name wasn‘t mapped correctly, a ‗null‘ value is assigned to this field in the published occurrence record. For each rank, we evaluated the number of occurrences and species (or lower taxa) having incomplete or unknown taxonomical status – or ‗null‘ values (e.g. counting all occurrences having an `unknown` status for the kingdom rank). Table 4 provides a summary of our findings in December 2010 and February 2012. In 2010, a total of 114,721 species or lower taxa corresponding to 15 million occurrences representing 5.6% of the GBIF Index were not ‗mapped‘ against the GBIF taxonomical backbone at the kingdom level. Similar trends are observed for other taxonomical ranks with somehow a variation in amplitude of incompleteness (e.g. 14.5% for species and lower taxa at the family level and 7.4% at the species level). This analysis confirmed similar results obtained in 2008 and 2010 (GBIF, 2010b; Ariño and Otegui, 2008). However some of the correctly matched names against the GBIF taxonomy backbone may not be valid names if referred to authoritative references such as Catalogue of Life. The reasons being that some of these names if not matched to the existing GBIF taxonomy backbone during the harvesting and indexing processes were simply added as valid references. The mixing of valid taxonomical references with new unverified references with limited capacity to track such changes over time caused serious difficulties to our study. The assessment summarized in Table 4 provides therefore more a status of incompleteness of the taxonomical backbone rather than a real comparison to any authoritative taxonomical references. Table 4. Scientific names and occurrences summary for each ‗unknown‘ taxonomic rank (as of December 2010 and February 2012).

Taxonomy

Scientific name with ‘unknown’ status

% of total species recorded in GBIF

Occurrences with ‘unknown’ status

% of total occurrences recorded in GBIF Index

version

12/2010

02/2012

12/2010

02/2012

12/2010

02/2012

12/2010

02/2012

Kingdom Phylum Class Order Family Genus Species

114,721 223,433 235,857 261,706 235,089 76,416 120,362

5,153 11,305 26,266 52,007 41,932 31,565 133,086

7.0% 13.8% 14.5% 16.1% 14.5% 4.7% 7.4%

0.35% 0.77% 1.81% 3.58% 2.82% 2.17% 9.15%

15,030,014 22,180,639 23,071,180 24,605,925 21,508,688 8,665,178 23,015,905

167,208 4,640,252 3,963,750 6,304,444 6,015,636 8,959,016 25,343,834

5.6% 8.3% 8.6% 9.2% 8.1% 3.2% 8.6%

0.05% 1.43% 1.22% 1.94% 1.86% 2.76% 7.82%

61

Evaluación del Contenido de los PBD Publicados a través de GBIF In December 2010, our preliminary findings suggested the need for an urgent review of the GBIF taxonomical backbone in particular against the most critical taxonomical authorities such as the annual checklist Catalogue of Life 2010 (http://www.catalogueoflife.org/) and other sources such as the Interim Register of Marine and Nonmarine Genera (IRMNG). The decision not to mix unverified names with existing authoritative names was critical. In November 2011, GBIFS successfully upgraded its taxonomical backbone against the latest version of the Catalogue of Life (2011) and other authoritative references. This resulted in our February 2012 study in a more accurate assessment of the taxonomical gaps within the GBIF Index. The results of this analysis are presented in Table 4. The percentages of incompleteness observed in 2012 were significantly lower (i.e. 0,35%, 1.81%, 2.82%, 2.17% respectively at the Kingdom, Class, Family and Genus levels) than the once observed in December 2010 (i.e. 7.0%, 14,5%, 14,5% and 4.7% respectively at the Kingdom, Class, Family and Genus levels) with the exception of the species rank. Similar trends are observed taking into account occurrences. Therefore a high number of unmapped taxonomical ranks from Kingdom to Genus levels were resolved using the upgraded GBIF taxonomical backbone. The higher number of taxonomical references used to construct the GBIF taxonomic backbone largely explains this. The observed percentages of unresolved names at the species level represents 9.15% in 2012 while in 2010 this percentage was of 7.82%. Taking into account these improvements in taxonomical name resolution, we have tried to assess the additional data quality improvements that could be undertaken. To achieve this, we have looked at the top 10 possible misidentification (at the kingdom level) taking into account occurrences as the order of magnitude (Table 5). The three species within the genus Zonotrichia listed as within the plantae kingdom are wrongly assigned. These species belongs to the American sparrows group of the family Emberizidae17. This misidentification is due to the generic homonym Zonotrichia both present in the Plantae and Animalia Kingdom. This misidentification is being resolved in the GBIF taxonomical backbone and these obvious misidentifications progressively corrected18. For the other cases listed in Table 5, the discrepancy with CoL 2011 version is resolved in the latest version of the CoL (February 2012) or other taxonomical authorities (i.e. Marine Species Identification Portal). Once these changes are implemented we estimate that 1,808,488 occurrences would be correctly mapped and the total of occurrences with „unknown‟ status at the species level would decrease from 25,343,834 to 23,535,346. This shows that while the GBIF Index has grown from 267 to 324 million 17

http://en.wikipedia.org/wiki/Zonotrichia

18

http://dev.gbif.org/issues/browse/CLB-119

62

Content Assessment of the PBD Published through GBIF occurrences (+21.3%) from December 2010 to February 2012, corrections on the top 10 species misidentifications in February 2012 would have resolved a substantive volume of the GBIF Index: the growth in occurrences with „unknown‟ status at the species rank would have grown of only 2.3% (from 23,015,905 to 23,535,346). Table 5: Major discrepancies at the Kingdom rank and tentative resolution through CoL 2011 and more recent version (February 2012)

Kingdom

Species

Occurrences

CoL 2011

Plantae

Zonotrichia albicollis

775,671

Accepted name in CoL 2012 in Animalia kingdom

Plantae

Zonotrichia leucophrys

362,767

Accepted name in CoL 2012 in Animalia kingdom

Protozoa

Neogloboquadrina pachyderma

141,720

Not in CoL 2011. Accepted in CoL 2012

Plantae

Zonotrichia atricapilla

106,804

Accepted name in CoL 2012 in Animalia kingdom

Protozoa

Globigerinoides ruber

86,563

Not in CoL 2011. Accepted name in CoL 2012

Protozoa

Globigerina bulloides

82,643

Not in CoL 2011. Accepted name in CoL 2012

Protozoa

Globigerinita glutinata

74,617

Not in CoL 2011. Accepted in CoL 2012

Protozoa

Globorotalia truncatulinoides

64,707

Not in CoL 2011. Not CoL 2012. Identified in Marine Species Identification Portal (as of Feb 2012)19

Protozoa

Globorotalia inflata

57,706

Not in CoL 2011. Not in CoL 2012. Identified in Marine Species Identification Portal (as of Feb 2012)20

Protozoa

Orbulina universa

55,290

Not in CoL 2011. Not in CoL 2012. Identification Portal (as of Feb 2012)21

It is therefore reasonable to extrapolate that: a large portion of the gaps identified in Table 4 will in the future be resolved with newest versions of the taxonomical authorities used to build the GBIF taxonomic backbone. The rate of resolved names should in principle directly be correlated with the growth in volume of the taxonomic authoritative references used by GBIF. Table 6.a provides a summary of the taxonomical misidentification at the Kingdom level and an indication of the total number of associated occurrences affected. For example, correcting the wrong assignment of 90 species from the Kingdom Plantae to Animalia will impact more than 1,3 million occurrences within the GBIF Index as of February 2012. On the other hand correction of the wrong assignments to Animalia of 26 species will only affect 1,536 occurrences. Similar breakdowns are provided for Phylum (Table 6.b) and Class (Table 6.c). This table shows that the effort in correcting misidentifications at a high taxonomical rank (e.g. Kingdom) will impact a limited number of occurrences (1,313,258 representing less than 0.5% of the GBIF Index). Costs in verifying such misidentification should be taken into consideration during data cleansing activities. 19

http://species-identification.org/species.php?species_group=zsao&id=1387

20

http://species-identification.org/species.php?species_group=zsao&id=1384

21

http://species-identification.org/species.php?species_group=zsao&id=1397

63

Evaluación del Contenido de los PBD Publicados a través de GBIF Table 6.a: Estimation of the taxonomical misidentification at the Kingdom level (February 2012).

Incorrect Kingdom assignment

Correct Kingdom in CoL 2011

Plantae Animalia Chromista Chromista Animalia Fungi Protozoa Plantae Plantae Plantae Fungi Animalia Bacteria Plantae Protozoa Protozoa Fungi

Animalia Plantae Animalia Plantae Fungi Animalia Chromista Fungi Protozoa Chromista Plantae Chromista Protozoa Bacteria Plantae Fungi Protozoa Total

Occurrences

Species

1,308,111 1,536 1,504 310 190 186 100 98 61 43 41 26 22 13 9 6 2 1,312,258

90 26 1 3 23 10 8 11 2 6 2 3 1 5 5 1 1 198

Table 6.b: Estimation of the taxonomical misidentification at the Phylum level (February 2012).

Incorrect Phylum assignment

Correct Phylum in CoL 2011

Bryophyta Magnoliophyta Cnidaria Ochrophyta Chordata Cyanobacteria Ochrophyta Arthropoda Arthropoda Magnoliophyta Magnoliophyta Labyrinthista Ascomycota Marchantiophyta Annelida Arthropoda Arthropoda Magnoliophyta Mollusca Bryozoa Sarcomastigophora Ascomycota Ascomycota Chlorophyta Brachiopoda Mollusca

Magnoliophyta Arthropoda Chordata Arthropoda Magnoliophyta Proteobacteria Rhodophyta Magnoliophyta Chlorophyta Chordata Cnidaria Sarcomastigophora Chordata Bryozoa Tardigrada Ascomycota Rhodophyta Ascomycota Ascomycota Magnoliophyta Ochrophyta Magnoliophyta Arthropoda Magnoliophyta Ascomycota Arthropoda

64

Occurrences

Species

17,488 2,788 2,213 1,504 833 312 309 297 244 201 176 116 115 114 111 93 82 80 48 47 46 40 39 34 32 30

24 36 12 1 5 5 2 10 1 5 2 4 1 1 1 6 2 2 10 4 4 1 3 2 2 2

Content Assessment of the PBD Published through GBIF Incorrect Phylum assignment

Correct Phylum in CoL 2011

Arthropoda Arthropoda Annelida Pinophyta Platyhelminthes Rhodophyta Basidiomycota Arthropoda Chlorophyta Chlorophyta Echinodermata Bacteroidetes Magnoliophyta Bryophyta Ascomycota Echinodermata Magnoliophyta Ciliophora Arthropoda Arthropoda Ascomycota Euglenozoa Annelida Ascomycota Ascomycota Chlorophyta Platyhelminthes Magnoliophyta Platyhelminthes Pteridophyta Ascomycota Cnidaria Arthropoda Dinophyta

Nematoda Ochrophyta Magnoliophyta Arthropoda Arthropoda Arthropoda Arthropoda Bacillariophyta Rhodophyta Cyanobacteria Arthropoda Proteobacteria Ochrophyta Rhodophyta Bryozoa Cnidaria Rotifera Chlorophyta Mollusca Pinophyta Bacillariophyta Rhodophyta Bacillariophyta Cnidaria Echinodermata Arthropoda Bacillariophyta Bacillariophyta Acanthocephala Arthropoda Chlorophyta Ochrophyta Platyhelminthes Rhodophyta Total

Occurrences

Species

26 25 24 23 22 21 19 17 16 13 12 11 8 8 7 6 6 4 4 4 3 3 3 3 3 2 2 2 1 1 1 1 1 1

1 2 1 1 9 1 2 5 2 5 1 3 5 1 2 1 1 1 1 1 2 2 1 1 1 1 1 1 1 1 1 1 1 1

27,695

210

Table 6.c: Estimation of the taxonomical misidentification at the Class level (February 2012).

Incorrect Class assignment

Correct Class in CoL 2011

Bryopsida Bryopsida Magnoliopsida Hydrozoa Phaeophyceae Actinopterygii Insecta Malacostraca Phaeophyceae Insecta Liliopsida Lecanoromycetes

Andreaeopsida Liliopsida Insecta Arachnida Insecta Magnoliopsida Malacostraca Insecta Florideophyceae Trebouxiophyceae Insecta Dothideomycetes

Occurrences

Species

21,966 17,488 2,429 2,213 1,504 808 664 359 309 244 239 220

82 24 12 12 1 2 11 1 2 1 15 1

65

Evaluación del Contenido de los PBD Publicados a través de GBIF Incorrect Class assignment

Correct Class in CoL 2011

Magnoliopsida Liliopsida Insecta Insecta Labyrinthulea Jungermanniopsida Polychaeta Insecta Magnoliopsida Liliopsida Insecta Magnoliopsida Lobosa Stenolaemata Zoomastigophora Lecanoromycetes Lecanoromycetes Insecta Chlorophyceae Rhynchonellata Ostracoda Actinopterygii Polychaeta Pinopsida Turbellaria Florideophyceae Agaricomycetes Magnoliopsida Insecta Chlorophyceae Maxillopoda Asteroidea Sphingobacteria Bryopsida Insecta Lecanoromycetes Magnoliopsida Granuloreticulosea Magnoliopsida Magnoliopsida Ciliatea Magnoliopsida Ostracoda Dothideomycetes Magnoliopsida Insecta Rhabditophora Dothideomycetes Polychaeta Dothideomycetes Euglenida Flavobacteria Leotiomycetes Liliopsida

Hydrozoa Andreaeopsida Magnoliopsida Liliopsida Polycystina Gymnolaemata Eutardigrada Florideophyceae Lecanoromycetes Maxillopoda Lecanoromycetes Arachnida Coscinodiscophyceae Magnoliopsida Craspedophyceae Magnoliopsida Insecta Leotiomycetes Liliopsida Lecanoromycetes Secernentea Liliopsida Liliopsida Insecta Arachnida Insecta Insecta Actinopterygii Agaricomycetes Florideophyceae Liliopsida Entognatha Alphaproteobacteria Florideophyceae Phaeophyceae Gymnolaemata Eurotatoria Lecanoromycetes Reptilia Coscinodiscophyceae Chlorophyceae Entognatha Gastropoda Insecta Liliopsida Pinopsida Turbellaria Asteroidea Bacillariophyceae Eurotiomycetes Florideophyceae Gammaproteobacteria Hydrozoa Actinopterygii

66

Occurrences

Species

176 175 167 116 116 114 111 82 78 59 58 57 54 47 45 40 35 35 34 32 26 24 24 23 22 21 19 18 17 16 14 12 8 8 8 7 6 6 6 5 4 4 4 4 4 4 4 3 3 3 3 3 3 2

2 1 3 6 4 1 1 2 1 1 1 7 4 4 3 1 1 5 2 2 1 2 1 1 9 1 2 2 5 2 1 1 2 1 1 2 1 1 1 3 1 1 1 2 2 1 1 1 1 1 2 1 1 1

Content Assessment of the PBD Published through GBIF Incorrect Class assignment

Correct Class in CoL 2011

Magnoliopsida Turbellaria Lecanoromycetes Ulvophyceae Pezizomycetes Magnoliopsida Liliopsida Eurotiomycetes Dinophyceae Filicopsida Appendicularia Gastropoda Neoophora Anthozoa Zoomastigophora Arachnida

Agaricomycetes Bacillariophyceae Granuloreticulosea Insecta Leotiomycetes Phaeophyceae Coscinodiscophyceae Dothideomycetes Florideophyceae Insecta Liliopsida Orbiliomycetes Palaeacanthocephala Phaeophyceae Synurophyceae Turbellaria

Occurrences

Species

2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1

1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1

50,434

290

Indicators In order to assess the effectiveness of its taxonomical backbone, GBIF should perform regular estimation of completeness at all taxonomical ranks as performed in Table 4. Such analysis should in particular assess the amount of misidentifications (e.g. species within genus Zonotrichia). GBIF should also improve its reporting services to the original publishers so that potential taxonomical misidentifications are known. GBIF should also monitor over time the data quality improvements made in the GBIF Index. In addition, GBIF should provide means to assess the effectiveness of its taxonomical names resolution services used during the harvesting and indexing processes. All taxa misidentifications should be documented and calls to expert groups (e.g. marine biologists, crop wild relatives experts) should be considered in order to tap into taxonomist expertise and increase their engagements in improving the quality of such valuable global resource. Geospatial During the harvesting and indexing routines, these geo-referenced occurrences are checked in particular for wrong assignments (e.g. when the latitude and longitude information is not corresponding to the country where the occurrence was observed/collected). In the context of this study, we considered geo-referenced occurrences as a record in the GBIF Index with the latitude and longitude within the earth-bounding box (i.e. -90

Biodiversity Datasets Assessment Tool

DEPT. ZOOLOGY AND ECOLOGY

Welcome to the BIDDSAT

Aim

The Global Biodiversity Information Facility (GBIF) is now the most comprehensive common access to the data content of biodiversity institutions worldwide. Such institutions, also called data publishers, have agreed to share publicly their Primary Biodiversity Records (PBR) datasets through data exchange standards. The information in the data publishers is indexed by GBIF's mechanisms and, in order to ease data query and retrieval, a central index is built from certain standardized fields linking back to the original data.

Feedback to the data publishers is essential in any quality control workflow. Data managers can detect biases and errors in data by assessing the quality and fitness-for-use of their own records from the general data users' perspective. Certain errors, mostly interoperability issues, cannot be detected byrecord-centric assessments: they arise as patterns visible only when visualizing large sets of records (often the entire collection) under certain arrangement criteria.

This website is a basic online PBR visualization environment for the content of data publishers and/or collections within the GBIF network.The tools here allow exploring the content of such datasets as a whole and find patterns or biases that may impact their quality or usability. The visualizations could be used to detect potential errors,allowing data publishers to fix the issues at the source. In order to enable historical comparisons and track error correction, data from several versions of the index are accessible.

Catalogue of Visualizations

Follow this link for a description of the available visualizations.

Source Code

The source code for this application is available at [].


Proceed to the application

Comments and feedback are really appreciated, just give a little whistle. Last Update: 2012-04-02 | v0.3 (changelog)


306

Annex I: BIDDSAT source code

Biddsat.php – Main interface BIDDSAT – Biodiversity Datasets Assessment Tool

Start typing to filter the select box. Available publishers:


Start typing to filter the select box. Available collections:




307

Anexo I: código fuente de BIDDSAT      
Now, please click on the visualization type (hover to see longer description):


Info

Records per dataset

Records per type

Collections per type of record

Map

Records per country

Records per year (Filtered)

Records per year (All)

Record density, day of year

Record density, day of week

Average records among years

Chronhorogram

Records per kingdom

Tree Map of Taxonomy

Tree Map of Records


For any help you need, give a little whistle v0.3 (changelog)


309

Anexo I: código fuente de BIDDSAT

Scripts.js – Files storing the Javascript code for rendering the page and error checking var currentDB; function bring(source) { samplePage = "./files/visualizations/" + source + ".html"; xmlhttp = new XMLHttpRequest(); xmlhttp.onreadystatechange = function() { if(xmlhttp.readyState == 4 && xmlhttp.status == 200) { document.getElementById("sample_plot").innerHTML = xmlhttp.responseText; } } xmlhttp.open('GET',samplePage, true); xmlhttp.send(); } function checker(source) { var provValue = document.getElementById('provs').value; var datasetValue = document.getElementById('datasets').value; var dbValue = document.getElementById('db').value; if (provValue=='' | provValue=='-') { alert('Please, provide a valid data publisher value'); } else if (datasetValue!='' && datasetValue!='-' && (source=='recsperres' || source=='resperbasis' || source=='resvsdow' || source=='resvsdom') ) { alert('This visualization is not suitable for a single collection assessment. Please, select another visualization or remove the value of the collection.'); } else { if (datasetValue == '-') { datasetValue = 'all'; } sitio=source+'.php?prov='+provValue+'&dataset='+datasetValue+'&db='+dbValue; window.location = sitio; } } function storeDB() { currentDB = document.getElementById('db').value; } function checkVersionChange() { var selectedProvId; var selectedDatasetId; if (document.getElementById('provs').value.substr(0,1) == "-") { populatePublishers(); } else { selectedProvId = document.getElementById('provs').options[document.getElementById('provs').selectedIndex].value; var i = 0; var pubMatch = 0; var provlist; provlist = prepareList('provs'); for (i = 0; i

Biodiversity Datasets Assessment Tool

DEPT. ZOOLOGY AND ECOLOGY



313

Anexo I: código fuente de BIDDSAT
Click one to see example and description

Info

Records per dataset

Records per type

Collections per type of record

Map

Records per country

Records per year (Filtered)

Records per year (All)

Record density, day of year

Record density, day of week

Average records among years

Chronhorogram

Records per kingdom

Tree Map of Taxonomy

Tree Map of Records



314

Annex I: BIDDSAT source code

Header.php – common processing script

315

Anexo I: código fuente de BIDDSAT

Info.php – General information and completeness levels google.load('visualization', '1', {packages:['table','corechart']}); google.setOnLoadCallback(drawCharts); function drawCharts() { var generalTableData = new google.visualization.DataTable(); generalTableData.addColumn('number','Resource ID'); generalTableData.addColumn('number','Records'); generalTableData.addColumn('number','With Coordinates'); generalTableData.addColumn('number','With Coordinates (%)'); generalTableData.addColumn('number','With Country'); generalTableData.addColumn('number','With Country(%)'); generalTableData.addColumn('number','With Year'); generalTableData.addColumn('number','With Year (%)'); generalTableData.addColumn('number','With Date'); generalTableData.addColumn('number','With Date (%)'); generalTableData.addColumn('number','With Kingdom'); generalTableData.addColumn('number','With Kingdom (%)'); generalTableData.addColumn('number','With Taxonomy'); generalTableData.addColumn('number','With Taxonomy (%)');