Primary Biodiversity Data Records in the Pyrenees - Universidad de ...

0 downloads 0 Views 1MB Size Report
We characterize the primary biodiversity data records that have been made public for retrieval for ... Such data, spanning more than a hundred years, have been.
AUTHOR’S POSTPRINT Published version of this article at http://omicron.ch.tuiasi.ro/EEMJ/issues/vol11/vol11no6.htm

Environmental Engineering and Management Journal, June 2012, Vol.11, No. 6: 1059-1075.

Primary Biodiversity Data Records in the Pyrenees Arturo H. Ariño*, Javier Otegui, Ana Villarroya, Anabel Pérez de Zabalza Department of Zoology and Ecology, University of Navarra, E-31080 Pamplona, Spain Abstract We characterize the primary biodiversity data records that have been made public for retrieval for the Pyrenean region. Such data, spanning more than a hundred years, have been collected by many institutions and individual researchers and digitized in databases, some of which have been shared through the Global Biodiversity Information Facility platform by using a standard format, Darwin Core. The datasets are not homogeneous in extent, coverage, taxonomy, or accuracy. Differences arising from taxonomic depth or group, georeferencing precision, age of collection, and other features result in biases and gaps that may influence the fitness for use of such data. Knowledge of patterns found in the data may help researchers and managers operating in the Pyrenees to estimate the reliability of available information, and to assess what uses for the data are acceptable. Keywords Pyrenees, biodiversity, digital assets, data availability Introduction Sound environmental management calls for thorough knowledge of ecosystems and its components, as success stories attest (e.g. Boesch 2006). Such knowledge requires field data, of which primary biodiversity data is one of the most basic types. Mountain areas, both sensitive to global change (Pauli et al., 2001) and distinct in terms of biodiversity components (Körner, 2004), harbor fragile ecosystems (Wohl, 2006). The Pyrenean Range, running along the French-Spanish border, has been traditionally subject to heavy impact from human activities that have shaped its ecosystems (Gutiérrez Elorza, 2007) at ever increasingly remote reaches. Such changes frequently result in land cover change, itself a primary cause for biodiversity change (Gonzalez et al., 2011). Sustained work in research projects has produced sizable amounts of biodiversity data, and knowledge about the range’s ecosystems has built on these. Ariño (2010) estimated at 2 x 109 primary records the volume of data stored in natural history collections only. However, a question remains whether this mass of data can be used in full, or, as argued by Hill et al. (2010), its fitness-for-use is dependent on how such data were recorded, digitized, and shared. Further inferences on biodiversity patterns, such as species distribution, population evolution or, more importantly, range changes potentially tied to (or becoming indicators of) climate

*

Author to whom all correspondence should be addressed: [email protected]

1

AUTHOR’S POSTPRINT Published version of this article at http://omicron.ch.tuiasi.ro/EEMJ/issues/vol11/vol11no6.htm

change, will be highly dependent on data being readily available and amenable to use in dataintensive analyses. Chapman (2005) and Graham et al. (2004) have compiled exemplary cases. Databases collating biodiversity information play a paramount role as repositories of such information, but their joint exploitation is highly dependent on interoperability. The Global Biodiversity Information Facility (GBIF) (GBIF, n.d.), the largest and most comprehensive initiative to unlock these data, indexes what is available in digital form from numerous sources in terms of primary biodiversity records, and currently is the single most important handle to a vault of historical data on vouchered specimens and observations and the key facilitator for these data (Scoble, 2010). Knowing what is available and, perhaps more importantly, what is not (our gaps in knowledge) may help us prioritize research (Ariño et al., 2011) and assess to what extent we may make ecological inferences for an area, in particular those related to distributional areas of sensitive species (Boakes et al., 2010; Peterson et al., 2011; Soberón and Peterson, 2009) The Pyrenean area The Pyrenean range runs along the border between France and Spain, forming a ridge separating the Iberian Peninsula from the rest of Europe and also encircling Andorra. It is cut at both sides by a number of valleys, mostly of glacial origin, that do not cross the massive limestone and granite crests forming its central axis up to some 3 km high. Its PaleozoicMesozoic strata were pushed up until the Eocene. The environment is essentially alpine with a high west-east precipitation gradient as well as a marked altitudinal and longitudinal seriation from densely wooded, Atlantic ecosystems in the lower and western reaches, to steppe and altitudinal tundra in the upper sectors and Mediterranean vegetation in the eastern sectors. The rates of endemism in these habitats is higher than in neighboring lowlands (García and Gómez, 2007). Human settlement of the lower reaches can be traced to prehistoric times and has been constant for millennia. However it has remained relatively sparsely populated until recently. Traditional uses have been forestry and livestock, which has been documented to more than 4000 years of continuous exploitation (Comín and Martínez Rica, 2007), but the late XXth century has seen a very large increase in tourism, ski, and other tertiary activities (Lasanta, 2010). The primary biodiversity data Arguably, the absolute minimum biodiversity datum is a presence: a pair specifying that a biological entity is known to exist, or have existed, at a specific place. A more complete qualification will include, but will not be limited to, a point in time for the recording of such presence (i.e. date), conforming an occurrence (Johnson, 2007). All three elements for the occurrence (the “what, where, when”; Ariño and Otegui 2008) may have been reported with variable accuracy and precision. For instance, the biological entity can be specified to species or lower taxonomic level, or be left as a genus, family or higher levels according to the taxonomic proficiency of the observer or the observing conditions. Locations may have been recorded with a high degree of precision (i.e. geographical coordinates to the meter) or low, such as a generic literal reference for the area or a 10-km UTM grid, or may have been obscured to protect sensitive species from exploitation. Time may have been recorded to day or to year, the latter case preventing drawing conclusions about the life cycle of the taxon.

2

AUTHOR’S POSTPRINT Published version of this article at http://omicron.ch.tuiasi.ro/EEMJ/issues/vol11/vol11no6.htm

Occurrences may be further qualified by legitimators: one or more people who made, or are responsible for, the observation, capture of the specimen, identification, or digitization. A primary biodiversity data record (PBR) has at least the three basic fields for an occurrence. PBRs can be obtained from a variety of sources. Specimens held in natural history collections, or vouchers, provide verifiable, potentially very accurate taxonomical information for the record. On the other hand, field observations may lack such accuracy for difficult groups, but are generally well documented in space and time. In addition, PBRs may or may not include abundance data: they may refer to one or more specimens, or specimens being observed. Specimen-based PBRs, often coming from stored collections, generally (but not always) match accession records: all specimens from a single sampling event that have been identified as a single taxon can be grouped together into a single PBR (Ariño, 2010), but can also be registered as separate PBRs. Observation events may be treated likewise, recording the number of specimens observed within a single PBR; but metadata regarding quantitative measurements for the observation events are not necessarily available. Both types of sources (specimens and observations) provide the vast majority of PBRs available for research. Other sources include data in literature, ledgers, reports, or remote sensing, although these sources may actually be secondary to the primary specimens or observations. However, PBRs can be derived from such secondary sources if the primary data is missing or otherwise lost, i.e. a destroyed collection. PBRs need to be stored somewhere. Digital assets have largely replaced older storage forms, affording much easier retrieval. However, PBRs need to have been digitized to be easily retrievable. Much data exist in cabinets, files, or even in the physical “database” of the labels in specimen collections (Scoble, 2010) that may be lost if they are not digitized (Chavan et al., 2010). Digitizing the assets does not guarantee its availability though. A myriad of databases have been set up during the late XXth century at many institutions, but until recently these have largely not been interoperable or accessible by researchers other than those at the hosting institutions. Indeed, such lack of access or interoperability (in turn, dependent on standardization) has been regarded as a formidable barrier for integration of biocollections data with information and tools across research domains (Krishtalka and Humphrey, 2000). The databases: GBIF The Megascience Forum of the Organisation for Economic Cooperation and Development established a working group on biological informatics which prompted the inception in 2001 of the Global Biodiversity Information Facility (GBIF) with a mandate to facilitate free and open access to primary biodiversity data worldwide (OECD, 1999). GBIF maintains an index of PBRs published by various institutions across the world that have agreed to facilitate retrieval of such records through a common portal. Currently, the Darwin Core body of standards (Wieczorek et al., 2012) ensures that data from a myriad sources and stored in various formats on many distinct hardware platforms use common terms to facilitate such integration within the GBIF network. As of December, 2011, 358 data publishers from 57 countries and 47 organizations were linking 317 million PBRs stored in approximately 11,500 databases. This vast trove of data is however a fraction of what is currently stored at these and many other institutions, and GBIF has sought ways to increase mobilization of these data (Berendsohn et al., 2010). At present, GBIF-mediated data represent the largest combined source of publicly accessible primary biodiversity data in the world. Methods 3

AUTHOR’S POSTPRINT Published version of this article at http://omicron.ch.tuiasi.ro/EEMJ/issues/vol11/vol11no6.htm

Data extraction: Bounding We extracted the PBRs for the Pyrenean area by querying the GBIF Index as of June, 2011. Only georeferenced records with coordinates between 41.4167º to 41.650ºN latitude, and 2.017ºW to 3.317ºE longitude were initially selected. Downloaded data included geographic and temporal information, taxonomy, type of observation, collection data at the record level and collection-level metadata, and data publisher metadata. Abundance data was not available in the dataset, and therefore all occurrences were treated as presence-only data. Within the bounding box, data were further selected by querying a digital elevation model (DEM) in an ESRI GIS layer (SRTM30) for each coordinate set in the data. We excluded data records below 500 m a.s.l. Fitness checks and record cleaning and grouping The resulting dataset was queried for inconsistencies. Records where a declared textual description of the locality did not match their coordinates (i.e. countries other than Spain, France or Andorra) were filtered out. Records with wrong date (e.g. in the future or with a month >12) or with incomplete dates were flagged as unusable for time analysis, but available for time-independent geographical analysis. Record and species counts were grouped at three geographical precision levels: nearest degree, nearest tenth of degree and nearest hundredth of degree, both in latitude and longitude. A common list of taxonomic names was extracted from the records and checked for apparent misspellings or name variants. Names were interpreted based on Catalogue of Life, CoL (Bisby et al., 2011), MycoBank (Crous et al., 2004; Robert et al., n.d.), EUNIS (EEA, 2012), and Diatoms of the US (Spaulding et al., n.d.). Taxon interpretations were added to the records whenever possible. Name variants were located with the help of a fuzzy logic algorithm (Arasu et al., 2008) using a Jaccard similarity coefficient of 0.91 or higher between species names, and checked manually. Records with no or incomplete taxonomy were placed within the classification used in the CoL if a matching taxon name could be found. Orphan records (i.e. not matched to databases after spelling correction) were manually searched in taxonomic literature for placement within the unified taxon tree down to the lowest possible taxonomic level, but were left with their original name. To avoid branch proliferation and potential conflicts due to competing taxonomies, additional, operational taxonomic groups were made based on the higher classification but not restricted to a specified level. Data management Initial data extraction was made by issuing SQL statements directly on a copy of the June, 2011 version of the GBIF index. Subsequent filtering, querying, cross-referencing and management were done with an xBASE interpreter (Microsoft Visual FoxPro, version 6). GIS layers were managed with Esri ArcView (version 10). Statistics were compiled in Microsoft Excel (version 2010). Treemaps (Bederson et al., 2002) were produced with ManyEyes (http://www-958.ibm.com/software/data/cognos/manyeyes/) using the Bruls-Huizing-Jarke algorithm (Bruls et al., 2000). Results and discussion Source distribution

4

AUTHOR’S POSTPRINT Published version of this article at http://omicron.ch.tuiasi.ro/EEMJ/issues/vol11/vol11no6.htm

The GBIF index had pointers to 781,764 records in the defined bounding box, of which 417,301 records (53%) met the filter elevation and fitness criteria. Three-quarters of the records originated in Spain, 16.6% in Andorra, and 7.4% in France. However, more records may exist in the area (for example, pointing to localities within the Pyrenean range) that were not captured for lack of georeferencing. According to GBIF (data.gbif.org/countries/browse/), 67.6% of all records from Spain, 79.1% of all records from Andorra and 90.1% of all records from France were georeferenced. Therefore, one can estimate how many records from the area would belong to the selection by multiplying the captured records from each country in the area by the reciprocal of the country-specific georeferencing rate. This would result in approximately 590,000 records estimated for the entire area, whether georeferenced or not. The bulk of data come from just seven data publishers (7%), who altogether account for 90% of the records (Fig. 1): SIVIM, SIBA, FB, INB, IDBD-GN, MZNA and SPN. The distribution of the data records (line in Fig. 2) among publishers follows a power law (y = 4.1·106 x -2.94, R2=0.95). Ninety-five publishers contributed the remaining 10% percent in 132 datasets (table 1). However, these smaller publishers concentrated most of the specimen-based data—by definition verifiable and therefore potentially of high quality, while the largest publishers provided observations, or data that could be determined as neither observations nor specimens and that could potentially be second-hand (e.g. compiled from literature or other sources already accounting for some primary data). This differential distribution exemplifies the benefits of facilitating access to the ‘long tail’ of science data (Chavan and Penev, 2011) that may hold qualitatively significant assets highly valued by researchers (Ariño et al., 2012.). Data publishers also differ greatly in their taxonomic coverage and sampling intensity, which are not linked to each other: large publishers may or may not be also rich publishers in terms of taxa reported. The largest publisher by volume (SIVIM) concentrates on plants, listing just 292 genera within more than 140,000 primary records, while the next publisher, SIBA, publishes about half as many records from a small area (Andorra) but lists 1319 genera from all taxonomic groups (Table 1). Similar differences can be observed in most publishers. Those collectively accounting for 99% of the mass of data are shown in Fig. 2. Taxonomic coverage, represented by number of genera in the datasets, is shown for these publishers. Most data publishing initiatives are thematic, concentrating on few taxonomic groups. However, except for some providers regularly covering the geographic area with taxonomically specialized data (e.g. SIVIM, INB, MZNA), most large providers contribute generalized surveys with little taxonomic limitation (e.g. SIBA, FB, SPN, RJB). Smaller publishers are taxonomically specialized, showing either relatively low diversity, but high geographic or temporal coverage, or wide taxonomic coverage within their groups with large numbers of distinct taxa in comparatively few primary records. Sources were found to be inappropriate to estimate taxon-specific densities, as the retrieved dataset did not contain abundance data for each record, even though they may exist internally (e.g. SIVIM, MZNA). However, they may be fit for presence-only methods, as the relatively large number of overlapping sources allow for frequency analysis. Geographic coverage The distribution of records across the range shows four distinct features (Fig. 3). First, essentially all the areas yield records, with very few areas not sampled. This includes the full alpine range except for some conspicuous voids at the highest altitudes towards the center5

AUTHOR’S POSTPRINT Published version of this article at http://omicron.ch.tuiasi.ro/EEMJ/issues/vol11/vol11no6.htm

east. Second, the density of records is much greater in the south (Spanish) slopes of the range. Third, some areas are densely recorded, in particular Andorra (well visible as a blob of data), the western reaches (Navarra), and the alpine levels (near the divide) especially in the center and west. Fourth, a regular pattern can be observed superimposed to the general distribution of records. This regularity places records at certain coordinates evenly spaced in a diamond pattern, most conspicuous in the center and east regions of the range. The regular pattern may be indicative of a-posteriori georefencing, coordinate reduction to a regular grid (e.g. taking a single coordinate of a UTM cell as reference for records anywhere within the cell), or a gridded sampling. The terrain characteristics may help discarding the latter hypothesis, as the recording sites would be impossible to establish with such widespread regularity. The diamond pattern, rather than a square pattern, may hint to competing coordinate reduction schemes, e.g. center of a cell vs. corner of a cell for attribution to any record within the cell. By selectively plotting individual publisher’s data, we could determine that the regular patterns were concentrated in a few large data publishers, in particular nation-wide surveys using either observational or second-hand (report or literature) digitized data. Publishers performing this coordinate reduction did that differently depending on their taxonomic focus: plant data publishers and bird data publishers chose different cell coordinate references, resulting in two square patterns offset in both latitude and longitude. Data records along the same latitude or longitude grid were separated by 10 km, which is consistent with coordinate reduction from 10 x 10 km UTM cells being either placed at the center or at one corner. The weighted average of the declared precision of the records in the area was 17.6 km. However, approximately 52% of the records explicitly declared a coordinate precision of 10 km (approximately 1/5 of degree in longitude), consistent with the grid hypothesis. This sets the records into two additional groups: Records with coordinates precise to less than 10 km (23.5%, including 8.5% precise to 1 km or less) and imprecise records coarser than 10 km (16.3%), which will be fit for different uses (see Guralnick and Hill, 2009). Historical availability Digital records published for the Pyrenean range span two centuries, with the earliest occurrence dated in 1796 (a wolf, inventoried by the Service du Patrimoine Naturel). However, occurrences appeared slowly until the end of the 1960s, where a marked change in the accumulation slope took place. From that decade and especially after 1975, the number of recorded occurrences grew steadily (but not exponentially) at an almost constant rate of 15,000 dated records per year (Fig. 4). The growth in the occurrences was largely tied to specimen-based data, i.e. data digitized from vouchered museum material (Fig. 4, blue). The increase in data acquired through observations has been steady but at a much lower pace (fig. 4, orange). This may seem rather surprising in view of other analyses, where observation-based digital data have become available at a much greater rate than specimen-based data (Otegui et al., n.d.; Gaiji et al., 2012). Two factors may help explaining this discrepancy. First, the potential to digitize and release data has only recently been there (Ariño et al., 2011), and for obvious reasons there has been a backlog of existing but undigitized data, that have recently emerged in digital form as digitization projects on museum material have taken hold (Berendsohn et al., 2010; Ariño, 2010). Digitizing efforts, being costly, have been prioritized (Chapman, 2005; Scoble, 2010) and museum collections, largely formed by vouchered specimens, have been processed earlier 6

AUTHOR’S POSTPRINT Published version of this article at http://omicron.ch.tuiasi.ro/EEMJ/issues/vol11/vol11no6.htm

than observation-based records. Second, as we shall see later the low completeness level for some large publishers allows plotting only data for which occurrence year is known, which leaves out a very large block of undated data (see Completeness, below.) While the great growth on occurrence data is recent, the accrual of taxonomical data has predated the raw occurrence data in some groups (Fig. 5). Although the turning points appear at about the same moments in time, some taxon groups were largely known already in the first half of the XXth century. By 1970, about one third of all currently-known genera of plants and arthropods for the area had already been recorded. All groups considered, one-fifth of all genera now known had already been recorded in 1960 within less than one-tenth of all records now existing. For some groups, very few new taxa have been found in recent years. Until 1969, the number of recorded vertebrate genera was comparatively low (55 genera out of 211 currently known), and within the next decade the figure more than tripled (doubled in just two years), with 86% of all vertebrate genera (181) recorded by 1980 (Fig. 5, blues). Other groups that had been largely absent from the record started to appear and accounted for most of the growth from the 70’s on: fungi, mosses, and ferns. As these still-growing groups (along with arthropods and higher plants) accrue more taxa, their rate should also descend and newer occurrences will have less chance to contribute newer taxa, as is typical from cumulative sampling curves (Margalef, 1980). Signs of stabilization are already clear: vertebrates are very stable, and fungi, mosses, monocots and dicots are growing at a steadily lower rate. Ultimately, it is arthropods and other invertebrates the groups where more taxa could be found in additional occurrences. Availability of data is not homogeneous throughout the seasons (Fig. 6). Winter dates are erratic, with most data during November through February concentrated on specific days. From March onwards, however, records are much more evenly distributed and increase consistently until June, when they maintain high densities (records per day) throughout July and August, decreasing to less than one-tenth through September. This pattern of midsummer maxima significantly differs from the pattern found in the Iberian Peninsula, where the collection dates concentrate in spring and early summer and the differences between high-production and low-production are less marked (Otegui et al., n.d.) and may be related to the accessibility of the range that can dramatically change from summer to winter. The spike for Jan. 1st (out of range in the plot) is attributable to a fictitious date when the publisher does not have, or does not reveal, the actual occurrence date but has only the year, choosing to date the records as “1/1/year” to comply with date format requirements in databases (Otegui et al., 2012). Taxonomic spread The majority of taxa recorded in the Pyrenean range are plants, both as number of different taxa (richness) and as number of records. Magnoliopsida (dicots) represents more than half of the records (52%, 215,556 records) and almost half or the richness (44%, 5,460 taxa). However, the most represented order is a group of plants belonging to class Liliopsida (Monocots): Poales (grass) account for 76,356 records (18%) and 884 taxa (7%). Among Animals, insects and birds are the most recorded classes, and Lepidoptera and Passeriformes the largest orders respectively (Fig. 7). 7

AUTHOR’S POSTPRINT Published version of this article at http://omicron.ch.tuiasi.ro/EEMJ/issues/vol11/vol11no6.htm

According to various authors (e.g. Erwin, 1982; May, 1992, 2010; Mora et al., 2011; Stork, 2007) the richest taxonomic group of eukaryotes in the biosphere is the Insecta (with Coleoptera accounting for roughly 50% of all Arthropods), with higher plants coming second to that Order. The imbalance observed in the Pyrenean data, where more than half of the taxa (56%) and most records (76%) are higher plants (Table 2), doubtlessly stems from biased surveys or data digitization, that have been advancing much more for plants (through herbaria and botanical surveys) than for animals or even fungi. Also, there have been recorded more than twice as many taxa of Lepidoptera as of Coleoptera, while Vertebrates almost equal all Invertebrates. Historical trends may help explain the large abundance of Lepidoptera data, often hunted for collections (either scientific or commercial) that eventually have ended up being held in Museums and digitized. Within Vertebrates, containing charismatic species to which particular attention is paid e.g. through conservation plans (Pino-Del-Carpio et al., 2011), one particularly popular group is birds, producing vast amounts of observational data (Gilman et al., 2009; Avian Knowledge Network, 2009) because of its implicit ease and observation and ringing campaigns. It should also be noted that large initiatives seeking to collect observational data digitally have been and are in place for ornithology (e.g. eBird), yielding an inordinate (as compared to other groups) amount of records. In the entire GBIF, as of 2012 more than 60% of all PBRs are observational, and the vast majority of these are bird data (Gaiji et al., 2012; Otegui and Ariño, 2009; Otegui et al., 2009). In all, the 102 publishers supply data on 320,893 plant records (7,411 taxa), 4,473 fungal records (1,581 taxa) and 91,936 animal records (3,498 taxa). The large difference between records and taxa for plants and animals may point to a much more systematic botanical survey of the territory than the zoological survey. However, the zoological survey seems to be more targeted at uncovering the biodiversity than to representing the physiognomy of the territory. If the differences were due solely to the relative difference in sampling effort, the number of found taxa in the animal group would have been more commensurate to the size of the sample, i.e., the number of records. This is even more evident for Fungi, where 4,473 records have yielded 1,581 taxa (35% species-specimen ratio; compare with 2.3% for plants and 3.8% for animals). Completeness analysis The assessment of whether the available PBRs are complete (i.e. do not lack data, or data can be correctly interpreted and used) as regards to their three main attributes (geographical location, time of occurrence, and taxonomical placement) shows significant gaps. Location completeness cannot be properly assessed from the selected dataset, as by definition only georeferenced records have been taken and therefore these should be considered “complete” in this regard. However, as mentioned earlier one should expect other records identified by textual locality falling within the region that could not be captured by lack of appropriate georeferencing. We have estimated that about 173,000 records are available in digital form from localities in the area. If these records could be located and georeferenced, an additional 36% could therefore be added to the current dataset. It is uncertain how many new taxa this addition could represent, as the accrual of new taxa is not linear with the accrual of primary records (Figs. 4 and 5). Examination of Fig. 3 reveals that the Pyrenean range can be divided into four broad areas as regards to geographical coverage: 8

AUTHOR’S POSTPRINT Published version of this article at http://omicron.ch.tuiasi.ro/EEMJ/issues/vol11/vol11no6.htm

-

Areas with a very intense concentration of data: Andorra and sections of the central axis;

-

Areas where data points come randomly distributed: Western side (Navarra);

-

Areas with a regular pattern of data points: Most of the central and eastern region;

-

Areas with scant or no coverage: Eastern margin.

These patterns may indicate a geographical incompleteness in the georeferencing. Coordinates in areas densely surveyed with non-uniformly located points can generally be relied upon and matched to terrain, but datapoints following regular patterns are unlikely to represent exact terrain points: in a rugged area such as a mountain range, it would be unrealistic to expect being able to sample at exactly placed, regularly spaced locations and not producing samples from the spaces in between. Therefore the “regular sample pattern” may be an artifact of a-posteriori coordinate attribution, where large, regular sectors for which occurrences are recorded at the sector’s size precision have been transformed to coordinates. While not strictly incomplete data, these records may be regarded as limited in precision. By the same token, intervening areas cannot be properly considered “gaps”, as the records might have actually come from anywhere within the geographical gap between data points. On the other hand, regions lacking data may be an effect of existing datasets not yet linked to the main repository. The list of data publishers (table 1) lacks some known datasets that hold vast amounts of georeferenced data for the region, most notably the Banc de dades de biodiversitat de Catalunya (BDBC: Font et al., n.d.) which at the time of analysis (October 2011) was not yet indexed in the main GBIF database. The amount of available data depends dramatically on datasets being indexed, and these contribute in discrete steps that often fill gaps (Gaiji et al., 2012; Otegui and Ariño, 2009). Once such datasets become indexed, the completeness level of records may increase significantly. Complete dates are lacking in many PBRs, which in turn concentrate in specific data publishers who have not made these time-related data available, or do not have such data. Thirty-eight percent of the datasets do not contain dates, and together these represent 61% of records (Fig. 8). On the other hand, 44% of datasets, accounting for 27,000 records, are datecomplete. Few datasets are partially complete: 20%, with 25% of records, which may or may not have fully specified dates (a fully-specified date in a record contains day, month and year, whereas a partially-specified date may contain e.g. only year). In all, only 56,079 records contained complete dates (13.4%), although up to 60% (251,804 records) had year information. The quantity of incompletely dated records is quite high, much more so than in the general GBIF set where 46% of all records had full dates, and even 73% had information on the year of occurrence (Otegui et al., 2012). For the publishers hosted at the Spanish node (gbif.es), date completeness is 44% and year information exists in 61% of all records (Otegui et al., n.d.). Although completely dated records are much scarcer for the Pyrenees, at least year information is given at a similar rate as in the rest of the Spanish providers. It should be noted that the bulk of incomplete dates is concentrated in a few, very large publishers: FB, IDBD-GN, INB, SIVIM and SIBA, although most of them do provide partial information such as year of collection; e.g. IDBD-GN provides year in 31% of its records. All these repositories are inventories of observational data where the main concern is attributing 9

AUTHOR’S POSTPRINT Published version of this article at http://omicron.ch.tuiasi.ro/EEMJ/issues/vol11/vol11no6.htm

chorology, i.e. presence of a species in a given area (often represented as presence in a cell within a pre-established grid). Such data might have been recorded or collated without regard for precise dates, as these might have been considered as secondary to the main purpose of establishing presences, or attributed an “inventory year” (e.g. INB and SIVIM report species known to exist in the territory in a specific year, without date of occurrence). On the other hand, most data based on vouchered specimens or surveys do have date of collection (Table 1). Close examination of Fig. 7 and table 2 point to an incomplete taxonomic coverage by omission of groups. While plants are well represented, both taxonomically and numerically, other rich groups of organisms are represented by few taxa. For instance many groups of small soil invertebrates are largely absent, such as Nematoda or Collembola. On the other hand, certain “preferred” groups are contributing many data even though their taxonomical spread may be close to saturation (e.g. birds). Misspellings were common, and a review of all names used found more than 10% of names being incorrectly digitized, often with several variants used within the same publisher, which limits the use of taxonomic names in retrieving and summarizing data. As in the geographical gaps, taxonomic “holes” may be related to the aggregated nature of data publishers, which also reflect the targets and expertise of collectors or contributors: Other than general surveys, most publishers are thematic and contribute taxa in specific groups (Fig. 2). Incorporating new data publisher may help filling such gaps. Conclusions The characteristics of the digitally available data for the Pyrenean region determine what uses can be done to these data, according to their fitness for use (Guralnick and Hill, 2009). Although there is a large amount of available data and more than 10,000 taxa have been recorded, most taxa are recorded a few times in the territory, while a few others are recorded in a regular grid that may not properly represent accurate locations but, rather, a rounding of coordinates or attribution from presence cells. The different spatial patterns point to a mix of high-density, high-precision sample data with general survey data, which might even be second-hand (derived from presence data in regular grids). Completeness is a progressive exercise. Between the time of this analysis and the time of writing, one large dataset has been made available (albeit only through the Spanish node, not through the international portal), filling a gap in the Eastern sector of the region. For areas not be yet well covered, datasets such as this new one may also exist that are yet to be aggregated to the general index. Although there are historical data, these are limited and most information was collected in the last 30 years. This may hamper the usability of these data to detect long-term biodiversity trends, precisely the ones arguably most linked to environmental change. However, again the incorporation of new datasets may help significantly extending the span of available data, as well as filling taxonomical gaps, a step towards thus ensuring a complete data coverage (Yesson et al., 2007). The large density of data points in some sectors may facilitate using such data for distribution model analysis but this must be balanced with presence data reduced to a grid as mentioned before. Distribution models are of great importance in biodiversity management 10

AUTHOR’S POSTPRINT Published version of this article at http://omicron.ch.tuiasi.ro/EEMJ/issues/vol11/vol11no6.htm

(Peterson et al., 2011), and therefore the digitally available data may be already a critical source of information for an ecologically tensioned area (the Pyrenean range) if the characteristics, patterns and gaps in such data are known and accounted for while assessing their fitness for use. References Arasu A., Chaudhuri S., Kaushik R., (2008), Transformation-based Framework for Record Matching, 2008 IEEE 24th International Conference on Data Engineering, (1), 40-49. Ariño A. H., (2010), Approaches to estimating the universe of natural history collections data, Biodiversity Informatics, 7(2), 81 - 92. Ariño A. H., Chavan V., Faith D. P., (2012), Assessment of user needs of primary biodiversity data: Analysis, Concerns, and Challenges, Biodiversity Informatics, in press. Ariño A. H., Chavan V., King N., (2011), The Biodiversity Informatics Potential Index, BMC Bioinformatics, 12(Suppl 15), S4. Ariño A. H., Otegui J., (2008), Sampling Biodiversity Sampling, In Proceedings of TDWG, Weitzman A.L., Belbin L., (Eds.), 107, Online at: http://www.tdwg.org/proceedings/article/view/413 Avian Knowledge Network, (2009), Avian Knowledge Network: An online database of bird distribution and abundance [web application], Online at: http://www.avianknowledge.net/ Bederson B. B., Schneiderman B., Wattenberg M., (2002), Ordered and Quantum Treemaps : Making Effective Use of 2D Space to Display Hierarchies, ACM Translations on Graphics (TOG), 21(4), 833-854. Berendsohn W. G., Chavan V., Macklin J. A., (2010), Recommendations of the GBIF Task Group on Global Strategy and Action Plan for the mobilization of natural history collections data, Biodiversity Informatics, 7, 67–71. Bisby F. A., Roskov Y. R., Orrell T. M., Nicolson D., Paglinawan L. E., Bailly N., Kirk P. M., Bourgoin T., Baillargeon G., (2011), Species 2000 & ITIS Catalogue of Life: 2011 Annual Checklist, Online at: www.catalogueoflife.org/annual-checklist/2011/ Boakes E. H., McGowan P. J. K., Fuller R. A., Chang-Qing D., Clark N. E., O’Connor K., Mace G. M., (2010), Distorted views of biodiversity: spatial and temporal bias in species occurrence data, PLoS biology, 8(6), e1000385. Boesch D., (2006), Scientific requirements for ecosystem-based management in the restoration of Chesapeake Bay and Coastal Louisiana, Ecological Engineering, 26(1), 6-26. Bruls M., Huizing K., van Wijk J., (2000), Squarified Treemaps. Proceedings of the Joint Eurographics and IEEE TCVG Symposium on Visualization, Online at: http://www.win.tue.nl/~vanwijk/stm.pdf Chapman A., (2005), Uses of Primary Species- Occurrence Data, GBIF, Copenhagen. Chavan V., Penev L., (2011), The data paper: a mechanism to incentivize data publishing in biodiversity science, BMC bioinformatics, 12 Suppl 1(Suppl 15), S2. Chavan V., Sood R. K., Ariño A. H., (2010), GBIF Best Practice Guide For “Data Discovery and Publishing Strategy and Action Plans”. Version 1.0., GBIF, Copenhagen. Comín F., Martínez Rica J. P., (2007), Los Prineos en el contexto de las montañas del mundo: rasgos y pecularidades, Pirineos, 162, 13-41. Crous P. W., Gams W., Stalpers J. A., Robert V., Stegehuis G., (2004), MycoBank : an online initiative to launch mycology into the 21st century, Studies in Mycology, 50, 19-22. Erwin T. L., (1982), Tropical forests: Their richness in Coleoptera and other Arthropod species, Coleopterology Bulletin, 36(I), 74-75. European Environment Agency, (2012), EUNIS biodiversity database. Online at: http://eunis.eea.europa.eu/species-names.jsp

11

AUTHOR’S POSTPRINT Published version of this article at http://omicron.ch.tuiasi.ro/EEMJ/issues/vol11/vol11no6.htm

Font X., De Cáceres M., Quadrada R.-V., Navarro A., (n.d.), Banc de Dades de Biodiversitat de Catalunya. Generalitat de Catalunya and Universitat de Barcelona, Online at: http://biodiver.bio.ub.es/biocat/ GBIF, (n.d.), Global Biodiversity Information Facility, Online at: http://www.gbif.org/ Gaiji S., Chavan V., Ariño A. H., Otegui J., Robles E., King N., (2012), Content assessment of the primary biodiversity data published through GBIF network: Status, Challenges and Potentials, Biodiversity Informatics, in press. García M., Gómez D., (2007), Flora del Pirineo aragonés. Patrones espaciales de biodiversidad y su relevancia para la conservación. Pirineos, 162, 71–88. Gilman E., King N., Peterson A. T., Chavan V., Hahn A., (2009), Building the Biodiversity Data Commons -- the Global Biodiversity Information Facility, In ICT for Agriculture and Biodiversity Conservation, L. Maurer (Ed.), Graz, Austria, 79-99. Gonzalez A., Rayfield B., Lindo Z., (2011), The disentangled bank: How loss of habitat fragments and disassembles ecological networks, American Journal of Botany, 98(3), 503-16. Graham C. H., Ferrier S., Huettman F., Moritz C., Peterson A. T., (2004), New developments in museum-based informatics and applications in biodiversity analysis, Trends in ecology & evolution, 19(9), 497-503. Guralnick R., Hill A., (2009), Biodiversity informatics: automated approaches for documenting global biodiversity patterns and processes, Bioinformatics, 25(4), 421-428. Gutiérrez Elorza M., (2007), El papel del hombre en la creación y destrucción del relieve, Revista de la real Academia de Ciencias Exactas, Físicas y Naturales, 101, 211-226. Hill A. W., Otegui J., Ariño A. H., Guralnick R. P., (2010), GBIF Position Paper on Future Directions and Recommendations for Enhancing Fitness-for-Use Across the GBIF Network, Global Biodiversity Information Facility, Copenhagen, Online at: http://www2.gbif.org/GPP-Final.pdf Johnson N. F., (2007), Biodiversity informatics. Annual Review of Entomology, 52, 421-38. Krishtalka L., Humphrey P. S., (2000), Can natural history museums capture the future? BioScience, 50(7), 611–617. Körner, C., (2004), Mountain biodiversity, its causes and function, Ambio, 13, 11-17. Lasanta T., (2010), El turismo de nieve como estrategia de desarrollo en el Pirineo aragonés. Cuadernos de investigación geográfica, 36(36), 145–163. Margalef R., (1980), Ecología, Omega, Barcelona. May R. M., (1992), How Many Species Inhabit the Earth? Scientific American, 267, 42-48. May R. M., (2010), Ecology. Tropical arthropod species, more or less? Science, 329(5987), 41-42. Mora C., Tittensor D. P., Adl S., Simpson A. G. B., Worm B., (2011), How many species are there on Earth and in the ocean? PLoS Biology, 9(8), e1001127. OECD, (1999), Final report of the OECD megascience forum: Working group on biological informatics, January 2009, Online at: http://www.oecd.org/dataoecd/24/32/2105199.pdf Otegui J., Ariño A. H., (2009), Have Standards Enhanced Biodiversity Data? Global correction and acquisition patterns, In Proceedings of TDWG, Weitzman A. L. (Ed.), 92, Online at: http://www.tdwg.org/proceedings/article/view/494 Otegui J., Ariño A. H., Chavan V., Gaiji S., (2012), On the dates of the GBIF-mobilised primary biodiversity data records, Biodiversity Informatics, in press. Otegui J., Ariño A. H., Pando F., Encinas M. A., (n.d.), Assessing the primary data hosted by the Spanish node of the Global Biodiversity Information Facility (GBIF), PLoS ONE, submitted. Otegui J., Robles E., Ariño A. H., (2009), Noise in Biodiversity Data, e-Biosphere Conference on Biodiversity Informatics, London. Pauli H., Gottfried M., Grabherr G., (2001), High summits of the Alps in a changing climate. In “Fingerprints” of Climate Change. Adapted behaviour and shifting species ranges, Walther G.R., Burga C. A., Edwards P. J. (Eds.), Kluwer Academic, 139-149. Peterson A. T., Soberón J., Pearson R. G., Anderson R. P., Martínez-Meyer E., Nakamura M., Araújo M. B., (2011), Ecological Niches and Geographic Distributions, Princeton University Press.

12

AUTHOR’S POSTPRINT Published version of this article at http://omicron.ch.tuiasi.ro/EEMJ/issues/vol11/vol11no6.htm

Pino-Del-Carpio A., Villarroya A., Ariño A. H., Puig J., Miranda R., (2011), Communication gaps in knowledge of freshwater fish biodiversity: implications for the management and conservation of Mexican biosphere reserves, Journal of Fish Biology, 79(6), 1563-1591. Robert V., Stegehuis G., Stalpers J. A., (n.d.), The Mycobank engine and related databases, Online at: http://www.mycobank.org Scoble M. J., (2010), Natural history collections digitization: rationale and value. Biodiversity Informatics, 7(2), 77–80. Soberón J., Peterson A. T., (2009), Monitoring biodiversity loss with primary species-occurrence data: toward national-level indicators for the 2010 target of the convention on biological diversity. Ambio, 38(1), 29-34. Spaulding S. A., Lubinski D. J., Potapova M., (n.d.), Diatoms of the United States, Online at: http://westerndiatoms.colorado.edu Stork N. E., (2007), Biodiversity: world of insects, Nature, 448(7154), 657-658. Wieczorek J., Bloom D., Guralnick R., Blum S., Döring M., Giovanni R., Robertson T., Vieglais D., (2012), Darwin Core: An Evolving Community-Developed Biodiversity Data Standard, (I. N. Sarkar, Ed.) PLoS ONE, 7(1), e29715. Wohl E., (2006), Human impacts to mountain streams. Geomorphology, 79(3-4), 217-248. Yesson C., Brewer P. W. Sutton, T., Caithness N., Pahwa J. S., Burgess M., Gray W. A., White R.J., Jones A.C., Bisby F. A., Culham A. L., (2007), How Global Is the Global Biodiversity Information Facility? PLoS ONE, 2(11), e1124.

13

AUTHOR’S POSTPRINT Published version of this article at http://omicron.ch.tuiasi.ro/EEMJ/issues/vol11/vol11no6.htm

TABLES Table 1. Codes and abbreviations for data publishers contributing data for the Pyrenean region. Datasets: Number of separate datasets from a publisher containing records for the area. PBR: Primary biodiversity data records. Earliest and latest year: first and last years of occurrences recorded by the publisher in the region. ‘Percent dated’ refers to fully dated records. Vouchered records are backed by a specimen. Code

Data Publisher (Host)

Data sets

Genera

PBR

Earli est year

Latest year

Percent dated

Percent vouchered

ABH

Instituto de Investigación CIBIO, Universidad de Alicante: ABH (GBIF-Spain)

1

320

1292

1949

2009

100%

100%

ANSP

Academy of Natural Sciences OBIS Mollusc Database (Ocean Biogeographic Information System)

1

9

9

-

-

0%

100%

AraIb

Morano and Cardoso: AraIb (GBIF-Spain)

1

13

93

-

-

0%

0%

ARAN

Aranzadi Zientzi Elkartea (GBIF-Spain)

1

355

5065

1953

2006

100%

100%

AUDCLO

eBird Bird Observation Checklist Database (Avian Knowledge Network)

1

84

929

1974

2010

0%*

0%

BC

Institut Botànic de Barcelona, BC (GBIF-Spain)

1

350

2508

1826

2010

85%*

100%

BCN

CeDoc de Biodiversitat Vegetal: BCN (GBIF-Spain)

1

278

769

1978

2008

96%

100%

BIO

Herbario BIO de la Universidad del País Vasco/EHU, Bilbao (GBIF Spain)

1

76

218

1979

2003

100%

100%

CANB

Australian National Herbarium (CANB)

1

2

2

1999

2000

100%

100%

CAS

California Academy of Sciences

1

1

1

1981

1981

0%*

100%

CEABCSIC

Centre d'Estudis Avançats de Blanes (GBIF-Spain)

1

28

314

2005

2009

100%

75%

CIBIO

Instituto de Investigación CIBIO, Universidad de Alicante: CEUA (GBIF-Spain)

1

52

688

1883

2004

100%

100%

CMN

Canadian Museum of Nature

1

1

1

1950

1950

100%

100%

COA

Jardín Botánico de Córdoba:Herbarium COA (GBIF-Spain)

1

335

1220

1922

2006

90%*

100%

COFC

Facultad de Ciencias,Universidad de Córdoba: Herbario COFC (GBIF-Spain)

1

21

32

1978

1993

100%

100%

COL003

The System-wide Information Network for Genetic Resources (SINGER) (Bioversity International)

1

1

1

-

-

0%

0%

CZULE

Colecciones Zoológicas de la Universidad de León (GBIF-Spain)

1

3

14

1986

2001

100%

0%

DABUH

European Lepidoptera Observations by Donald Hobern (University of Helsinki, Department of Applied Biology)

2

9

10

2003

1980

100%

0%

14

AUTHOR’S POSTPRINT Published version of this article at http://omicron.ch.tuiasi.ro/EEMJ/issues/vol11/vol11no6.htm DL

Therevid PEET Project (Discover Life)

1

2

2

1991

1991

0%*

100%

EEZA

Estación Experimental de Zonas Áridas (CSIC) EEZA (GBIF-Spain)

1

1

2

1980

1980

100%

100%

EGR

Egrell, Lleida - Hymenoptera (GBIF-Spain)

1

52

175

2003

2008

100%

100%

EMMA

Escuela Técnica Superior de Ingenieros de Montes,UPM: EMMA (GBIF-Spain)

1

6

7

1999

2005

86%*

100%

EURISCO

EURISCO, The European Genetic Resources Search Catalogue (Bioversity International)

15

76

460

-

-

0%

0%

FB

Fundación Biodiversidad, Real Jardín Botánico (CSIC): Anthos. Sistema de Información de las plantas de España (GBIF-Spain)

1

1056

68409

1808

2011

5%*

0%

FCO

Universidad de Oviedo. Departamento de Biología de Organismos y Sistemas: FCO (GBIF-Spain)

1

345

1458

1955

2007

98%*

100%

FMI

Flora Mycologica Iberica (GBIF-Spain)

1

177

1663

1912

2002

86%*

0%

FMNH

FMNH Mammals Collections (Field Museum)

1

2

2

1949

1949

0%*

100%

FR

Herbarium Senckenbergianum (FR) (Senckenberg)

1

24

54

1988

1988

100%

100%

FTG

Fairchild Tropical Botanic Garden Virtual Herbarium

1

4

9

1955

1999

100%

100%

HAVI

Fungal Specimens collected by HabitatVision (Jacob Heilmann-Clausen)(DanBIF)

1

19

23

2005

2005

100%

100%

HGI

Universitat de Girona: HGICormophyta (GBIF-Spain)

1

119

827

1913

2004

97%

0%

HSS

Dirección General de Investigación, Desarrollo Tecnológico e Innovación de la Junta de Extremadura (DGIDTI): HSS (GBIF-Spain)

1

150

308

1990

2005

100%

100%

HUAL

Herbario de la Universidad de Almería (GBIF-Spain)

1

4

5

2005

2005

100%

100%

IDBD-GN

Catálogo Florístico Histórico de Navarra. Gobierno de Navarra (GBIF-Spain)

1

697

26532

1967

1989

0%*

0%

INB

Inventario Nacional de Biodiversidad. Ministerio de Medio Ambiente, y Medio Rural y Marino. (GBIF-Spain)

7

247

38099

2006

2007

0%*

0%

IPK

IPK Genebank (Leibniz Institute of Plant Genetics and Crop Plant Research)

1

1

1

-

-

0%

0%

JBAG

Jardín Botánico Atlántico, Gijón (GBIF-Spain)

1

49

75

1967

2007

100%

100%

K

Royal Botanic Gardens, Kew

1

5

5

1859

1999

100%

100%

KU

University of Kansas Biodiversity Research Center

1

4

4

1950

2000

50%*

50%

15

AUTHOR’S POSTPRINT Published version of this article at http://omicron.ch.tuiasi.ro/EEMJ/issues/vol11/vol11no6.htm L

Nationaal Herbarium Nederland (Netherlands Centre for Biodiversity Naturalis)

1

139

218

1831

2004

85%*

100%

LACM

Vertebrate specimens (Los Angeles County Museum of Natural History)

1

3

5

1950

1982

100%

100%

LD

Lund Botanical Museum (GBIF-Sweden)

1

1

2

1999

1999

100%

100%

LEB

Botánica, Universidad de León (GBIF-Spain)

1

57

156

1947

1992

100%

100%

LI

Biologiezentrum Linz Oberoesterreich

1

24

49

1931

1999

94%

98%

M

Botanische Staatssammlung München (Staatliche Naturwissenschaftliche Sammlungen Bayerns)

1

7

10

1872

1999

40%*

67%

MAF

Facultad de Farmacia, Universidad Complutense, Madrid (GBIF-Spain)

1

117

232

1948

2000

100%

100%

MCNB

Museu de Ciències Naturals de Barcelona (GBIF-Spain)

2

78

531

1913

2005

98%

100%

MCZ

Museum of Comparative Zoology, Harvard University

1

2

4

1924

2000

25%*

100%

MGC

Universidad de Málaga: MGC (University of Malaga)

2

177

317

1955

2002

99%*

100%

MHNG

Muséum d'histoire naturelle de la Ville de Genève (GBIF Swiss Node)

1

2

13

1978

1993

0%*

100%

MMEY

Planetary Biodiversity Inventory: Eumycetozoan Databank (University of Arkansas)

2

20

114

1973

2006

94%*

100%

MNCN

Museo Nacional de Ciencias Naturales, Madrid (GBIFSpain)

5

117

2633

1881

2007

71%*

74%

MNHN

Museum national d'histoire naturelle et Reseau des Herbiers de France

4

31

98

1828

2003

60%*

100%

MNHNL

Musée national d'histoire naturelle Luxembourg

1

89

250

1886

2000

90%*

0%

MO

Missouri Botanical Garden

1

85

113

1847

2003

98%*

100%

MUB

Universidad de Murcia, Dpto. Biología Vegetal (Botánica), Murcia (GBIF-Spain)

1

61

152

1965

2007

96%*

100%

MVZ

Museum of Vertebrate Zoology (Arctos)

1

16

141

1977

1977

87%

100%

MZNA

Museum of Zoology, University of Navarra

1

182

15507

1968

2009

77%*

97%

NBM

New Brunswick Museum

1

1

1

1864

1864

0%*

100%

NMB

Naturhistorisches Museum Basel (GBIF Swiss Node)

1

1

8

1978

1998

0%*

100%

NMBE

Naturhistorisches Museum Bern (GBIF Swiss Node)

1

1

1

-

-

0%

100%

NMR

Natural History Museum Rotterdam (NLBIF)

1

6

17

1936

1981

88%

100%

NNM

Naturalis National Natural History Museum (NL) (NLBIF)

1

14

216

-

-

0%

100%

16

AUTHOR’S POSTPRINT Published version of this article at http://omicron.ch.tuiasi.ro/EEMJ/issues/vol11/vol11no6.htm NRM

Mammals (NRM) (GBIFSweden)

1

2

2

1845

1907

50%*

100%

NSW

NSW herbarium collection (National Herbarium of New South Wales)

1

1

2

1907

2008

100%

100%

NY

Herbarium of The New York Botanical Garden

1

5

5

2008

1987

100%

100%

O

Lichen herbarium, Oslo (O) (Natural History Museum, University of Oslo)

1

4

11

1987

1987

100%

100%

OSAL

Ohio State University Acarology Collection

1

26

107

1937

1983

95%

91%

PBDB

Paleobiology Database (Marine Science Institute, UCSB)

1

616

2709

-

-

0%

0%

PMSL

World flea collection of Slovenian Museum of Natural History (excluding Slovenia) (Prirodoslovni muzej Slovenije)

1

1

1

1974

2006

100%

100%

RJB

Real Jardín Botánico, Madrid

6

1020

5385

1906

2008

97%*

100%

S

Phanerogamic Botanical Collections (S) (GBIFSweden)

1

21

25

2006

2006

100%

100%

SALA

Herbario de la Universidad de Salamanca (GBIF-Spain)

1

477

2182

1947

2007

99%*

100%

SANT

Herbario SANT, Universidade de Santiago de Compostela

1

82

279

1965

2004

0%*

100%

SEV

Herbario de la Universidad de Sevilla (GBIF-Spain)

1

29

53

1894

1993

94%*

100%

SIBA

Centre d'estudis de la neu i de la muntanya d'Andorra (CENMA), Institut d'Estudis Andorrans

7

1319

74771

-

-

0%

0%

SIVIM

Sistema de Información de la Vegetación Ibérica y Macaronésica (GBIF-Spain)

1

292

141610

2000

2000

0%*

0%

SMF

Senckenberg Museum of Zoology

1

27

33

1918

2010

42%*

100%

SPN

Inventaire national du Patrimoine naturel (INPN) (Service du Patrimoine naturel, Muséum national d'Histoire naturelle, Paris)

1

729

12214

1796

2007

64%*

0%

SysTax

SysTax

5

136

339

-

-

0%

0%

TAMU

Texas A&M University Insect Collection

1

7

12

1908

2001

0%*

100%

TLMF

Tiroler Landesmuseum Ferdinandeum

1

17

33

1843

2000

94%

100%

UAM

Universidad Autónoma de Madrid, Madrid (GBIF-Spain)

1

53

1076

1962

2004

99%*

100%

UBGEOVEG

Universidad de Barcelona. Grup dInvestigació Geobotànica i Cartografia de la Vegetació (GBIF-Spain)

1

324

1940

1905

2008

100%

5%

UGENT

Global Lacustrine Diatoms (BeBIF Provider)

1

42

353

1992

1993

100%

0%

UMMZ

University of Michigan

1

2

2

1950

1957

100%

100%

17

AUTHOR’S POSTPRINT Published version of this article at http://omicron.ch.tuiasi.ro/EEMJ/issues/vol11/vol11no6.htm Museum of Zoology UNEX

Universidad de Extremadura (GBIF-Spain)

1

191

439

1968

2004

99%

0%

UPNA

Herbario de la Universidad Pública de Navarra, Pamplona (GBIF-Spain)

1

86

502

1970

2006

99%*

100%

UPS

Botany (UPS) (GBIF-Sweden)

1

3

6

1964

2001

100%

100%

UPV/ EHU

Colección de Oligoquetos Acuáticos de la UPV/EHU (GBIF-Spain)

1

19

120

1984

1987

97%

100%

USNM

US National Museum of Natural History

1

7

12

1881

1977

67%*

100%

USNPGS

United States National Plant Germplasm System

5

12

29

-

-

0%

0%

UVEGENV

Laboratorio de Entomología y Control de Plagas del Instituto Cavanilles, Universidad de Valencia (GBIF-Spain)

1

4

14

1965

1984

100%

100%

VAL

Universitat de València (GBIF-Spain)

1

211

652

1955

2005

98%*

100%

W

Natural History Museum, Vienna - Herbarium W

1

7

10

1876

1974

90%*

0%

WU

Herbarium WU (University of Vienna, Institute for Botany)

1

7

15

1996

2003

100%

0%

ZAF-UMU

Dpto. de Zoologia y Antropologia Fisica, Universidad de Murcia (GBIFSpain)

1

1

3

1975

2004

100%

100%

ZFMK

Zoologisches Forschungsinstitut und Museum Alexander Koenig

6

35

74

1917

2007

74%*

100%

ZMA

Zoological Museum Amsterdam, University of Amsterdam (NL) (NLBIF)

1

6

49

1895

1973

90%

67%

ZMB

ZMB (Senckenberg)

1

1

1

-

-

0%

100%

ZMH

ZIM Hamburg (Senckenberg)

1

1

1

1978

1978

0%*

100%

ZMUC

Zoologisk Museum Kobenhavn (DanBIF)

1

2

34

1895

2001

88%

67%

18

AUTHOR’S POSTPRINT Published version of this article at http://omicron.ch.tuiasi.ro/EEMJ/issues/vol11/vol11no6.htm Table 2. Taxonomical composition of the Pyrenean dataset (“Taxa” refers to distinct names at the species level once duplicates, misspellings and alternate spellings have been removed, but still may contain true synonyms). Phylum

Class

Orders

Families

Genera

Taxa

PBR

ANIMALS Annelida

Arthropoda

Brachiopoda Bryozoa

Chordata

Cnidaria

Echinodermata

Hemichordata

Mollusca

Nematoda

Porifera

Clitellata

4

7

19

36

121

Polychaeta

1

1

1

1

21

Arachnida

10

43

86

140

4432

Branchiopoda

1

1

1

1

3

Chilopoda

3

6

11

38

560

Diplopoda

2

2

2

2

2

Insecta

20

176

977

1737

9139

Malacostraca

2

9

10

11

12

Ostracoda

2

1

5

8

12

Trilobita

7

12

59

105

215

Rhynchonellata

6

19

38

48

93

Strophomenata

3

12

15

17

24

Gymnolaemata

2

2

2

2

8

Actinopterygii

15

31

50

79

1430

Amphibia

2

6

14

28

2898

Aves

16

58

148

281

43665

Cephalaspidomorphi

1

1

1

1

1

Chondrichthyes

6

10

11

12

14

Conodonta

1

3

7

43

52

Mammalia

12

44

144

283

23536

Reptilia

6

14

30

55

3174

Anthozoa

7

45

81

114

169

Asteroidea

2

5

9

11

11

Crinoidea

3

1

3

3

7

Echinoidea

6

3

10

15

27

Graptolithina

2

1

3

4

4

Bivalvia

18

46

100

213

674

Cephalopoda

4

25

31

37

39

Gastropoda

11

62

116

169

1479

Scaphopoda

1

1

1

1

15

Tentaculita

2

2

3

14

23

Adenophorea

1

1

1

1

1

Demospongea

6

7

8

8

13

Hexactinellida

4

9

11

11

31

Stromatoporoidea

1

1

1

1

1

8

7

17

18

25

1

4

8

16

19

undefined FUNGI Ascomycota

Arthoniomycetes

19

AUTHOR’S POSTPRINT Published version of this article at http://omicron.ch.tuiasi.ro/EEMJ/issues/vol11/vol11no6.htm

Basidiomycota

Myxomycota

Ascomycetes

2

2

3

3

8

Dothideomycetes

6

11

14

19

36

Eurotiomycetes

4

5

13

19

50

Laboulbeniomycetes

1

2

39

101

642

Lecanoromycetes

13

42

105

289

835

Leotiomycetes

3

11

23

30

40

Lichinomycetes

1

1

2

2

3

Orbiliomycetes

1

1

1

1

1

Pezizomycetes

1

9

19

35

51

Sordariomycetes

6

10

19

24

37

Agaricomycetes

18

90

241

857

2011

Dacrymycetes

1

2

3

8

16

Exobasidiomycetes

1

1

1

1

1

Pucciniomycetes

1

2

4

6

6

Tremellomycetes

1

3

10

17

35

Ustilaginomycetes

2

4

5

27

149

Myxomycetes

5

16

36

116

504

Protosteliomycetes

1

1

1

1

15

4

8

14

14

14

undefined

PLANTS Bacillariophyceae

9

22

34

34

293

Fragilariophyceae

1

1

1

1

7

Andreaeopsida

1

1

1

3

11

Bryopsida

15

46

129

330

1153

Sphagnopsida

1

1

1

24

92

Charophyceae

1

1

2

2

3

Ulvophyceae

2

2

2

3

5

Equisetophyta

Equisetopsida

1

2

3

22

1164

Gnetophyta

Gnetopsida

1

1

1

5

35

Lycopodiophyta

Lycopodiopsida

3

3

7

26

448

Liliopsida

8

33

248

1509

89152

Magnoliopsida

50

133

901

6491

215226

Jungermanniopsida

5

21

26

42

104

Marchantiopsida

2

3

5

9

17

Pinales

1

1

1

1

3

Pinopsida

1

4

15

53

2785

Filicopsida

3

18

42

177

10269

Pteridopsida

1

1

1

3

3

Florideophyceae

2

2

3

3

4

Rhodophyceae

1

1

1

1

1

Ochrophyta

Coscinodiscophyceae

4

4

8

8

54

Rhizopoda

Granuloreticulosea

1

4

7

11

11

Bacillariophyta

Bryophyta

Chlorophyta

Magnoliophyta

Marchantiophyta

Pinophyta

Pteridophyta

Rhodophyta

20

AUTHOR’S POSTPRINT Published version of this article at http://omicron.ch.tuiasi.ro/EEMJ/issues/vol11/vol11no6.htm Rhizopodea undefined

1

1

4

4

4

3

5

5

8

9

21

AUTHOR’S POSTPRINT Published version of this article at http://omicron.ch.tuiasi.ro/EEMJ/issues/vol11/vol11no6.htm

FIGURES AND FIGURE LEGENDS Figure 1. Proportion of records provided by publishers for the Pyrenean range. Orange: Observationbased records; blue: specimen-based records; gray: unknown or not declared. See Table 1 for Publisher codes. RJB 1,290%

Other publishers (130 datasets) ARAN 7,153% 1,213%

SPN MZNA 2,926% 3,715% SIVIM 33,922%

IDBD-GN 6,356%

INB 9,127%

FB 16,387%

SIBA 17,911%

22

AUTHOR’S POSTPRINT Published version of this article at http://omicron.ch.tuiasi.ro/EEMJ/issues/vol11/vol11no6.htm

Figure 2. Number of genera contributed by provider (bars) and total number of records (line) of publishers collectively accounting for 99% of PBR. Providers ordered by size of such contribution. Greens: Plants; blues: animals; yellows: ferns, mosses and liverworts; gray: algae and protists; brown: fungi. See Table 1 for abbreviations of providers. Primary data records

Number of genera 1500

1250

1000

750

500

250

0

150000

Dicots Monocots Gymnosperms Algae Ferns Mosses Fungi Mammals Birds Herps Fish Arthropods Invertebrates Protists Number of records

125000

100000

75000

50000

25000

0

23

AUTHOR’S POSTPRINT Published version of this article at http://omicron.ch.tuiasi.ro/EEMJ/issues/vol11/vol11no6.htm

Figure 3. Distribution of records in the range. Records have been grouped by coordinates to the nearest cent of degree (approximately 1.8 km in latitude and 1 km in longitude). Color scale indicates log2 of the number of records sharing same coordinates. 44,0

43,5 0 1 2 43,0

3 4 5 6 7

42,5

8 9 10 11

42,0

12 13 14 15

41,5

41,0 -2,5

-2

-1,5

-1

-0,5

0

0,5

1

1,5

2

2,5

3

3,5

24

AUTHOR’S POSTPRINT Published version of this article at http://omicron.ch.tuiasi.ro/EEMJ/issues/vol11/vol11no6.htm

Figure 4. Growth of primary records according to their recording date and broken down by basis of record: specimen-based (vouchers exist), observationbased (vouchers normally unavailable), and unknown basis. The recording date is that for the occurrence, not the digitization date or the publication date. Cumulative dated PBR 60000

50000

40000

30000

Unknown Specimen

20000

Observation

10000

0 1800 1810 1820 1830 1840 1850 1860 1870 1880 1890 1900 1910 1920 1930 1940 1950 1960 1970 1980 1990 2000 2010

Date

25

AUTHOR’S POSTPRINT Published version of this article at http://omicron.ch.tuiasi.ro/EEMJ/issues/vol11/vol11no6.htm

Figure 5. Cumulative number of genera in occurrence records having complete dates, broken down by taxonomical group: Higher plants (greens); mosses, liverworts and ferns (yellows); fungi (brown); vertebrate animals (blues); invertebrate animals (reds); protists and algae (gray). The recording date is that for the occurrence, not the digitization date or the publication date. Cumulative genera 2500

2000

1500

1000

500

Dicots Monocots Gymnosperms Algae Ferns Mosses Fungi Mammals Birds Herps Fish Arthropods Invertebrates Protists

0 1800 1810 1820 1830 1840 1850 1860 1870 1880 1890 1900 1910 1920 1930 1940 1950 1960 1970 1980 1990 2000 2010

Date

26

AUTHOR’S POSTPRINT Published version of this article at http://omicron.ch.tuiasi.ro/EEMJ/issues/vol11/vol11no6.htm

Figure 6: Seasonal occurrence cycle. Number of occurrences per day of year, arranged clockwise. Note binary log scale for number of records (radii). Year starts at the top. Records for “January 1” (4000+) exceed the range of the plot (>216). Radial lines represent the first day of each month.

27

AUTHOR’S POSTPRINT Published version of this article at http://omicron.ch.tuiasi.ro/EEMJ/issues/vol11/vol11no6.htm

Figure 7. Treemap of taxa in the current digital data from the Pyrenean range. The area of the tiles is proportional to the number of taxa within each group, and the color density is proportional to the logarithm of the number of primary records (occurrences): more saturated colors indicate more records for a given group (see color scale). Tiles represented down to Order.

28

AUTHOR’S POSTPRINT Published version of this article at http://omicron.ch.tuiasi.ro/EEMJ/issues/vol11/vol11no6.htm

Figure 8: Completeness in date recording for the PBRs in the Pyrenean range. Completeness level is represented by the percent of records that have data on dates, broken down in 10% intervals from 0% (no records have any time data) to 100% (all records have time data). Bars represent the number of publishers sharing a certain level of date-completed records (right, blues), or number of records contributed by publishers placed in each completeness level (left, reds). Dark colors indicate completeness for full dates (100% = all records from a publisher are complete having year, day and month of occurrence), while light colors refer to publishers not supplying dates, but declaring the year. Note that publishers offering only year may be referring to an “inventory year” rather than actual occurrence (see text).

67 61

27 3 5

9 10

80%-90%

4 4

70%-80%

3 3

60%-70%

0 0

0 0

50%-60%

0 0

4 4

40%-50%

0 0

2 2

30%-40%

0 0

1 1

20%-30%

0 0

0 0

10%-20%

1 1

0%-10%

18 18 12 12

68 68 106

48 53

286

350

300

250

200

150

100

Number of PBR represented (x1000)

90%-100%

50

0

50

Percent of records dated

209

0% 100

Number of publishers

29