Words, words

6 downloads 0 Views 2MB Size Report
basic keyword searching (Batjargal et al., 2013; de Polo, 2011; Henry and Brown, 2012; Kamura,. 2011 ..... 6 Hop-picking in Kent, Thompson, Stephen, 1875.
Words, words. They’re all we have to go on: Image finding without the pictures. Abstract This paper describes a Recommendation System to help photographic historians make connections between widely dispersed and previously unrelated records of photographs held in different heritage institutions, and demonstrates how it has been used to rediscover images of exhibits from the Royal Photographic Society annual exhibitions of over 120 years ago. While the surviving exhibition catalogues are a rich information source, they are largely devoid of illustrations of the exhibits. The FuzzyPhoto project has developed techniques for analyzing a corpus of more than 1.4 million historical photographic records across different galleries, libraries, archives and museums, in order to identify similarities between them and used the results to offer visitors to those sites links to potentially related items at other sites, thus creating a web of interconnections between them. The paper describes techniques used for data acquisition, cleaning, integration, semantic-based data mining, approximate reasoning and fuzzy algorithm-based similarity metrics. It compares the approach described here with manual searches and more sophisticated computational methods such as linked data and concludes that FuzzyPhoto is an effective method for dealing with the realities of messy collection records that could be extended to other types of archival objects such as paintings, maps, craft objects, etc. where image matching is required.

Page 1 of 28

Context In the last decade there has been tremendous effort on the part of galleries, libraries, archives and museums (GLAMs) to digitize their records and make them available to users online (IMLS, 2006; Wilson, 2011; Szekely, 2013). The increased availability of online collections and searchable metadata creates enormous potential for using information about objects from one data set to enrich records in others, thus enhancing their value and revealing relationships not previously apparent. In the case of photographic history, different institutions sometimes hold similar or even the same objects because photographs were easily reproduced and copied photographically. Figure 1 shows three examples of what appear to be the same image. The left hand column shows a profile portrait of Edwardian writer George Meredith, as exhibited by Frederick Hollyer at the 1909 exhibition of the Royal Photographic Society. The middle column shows a strikingly similar view held by the Library of Congress that turns out to be an etching made by Frederick's brother Samuel round about 1900, apparently based on George's photograph but intriguingly pre-dating it. In the third column an earlier photograph from the Victoria and Albert Museum collection suggests that the other two are details derived from this earlier, more complete, study.

Page 2 of 28

Fig. 1 Comparison of portraits of George Meredith. Searching for connections across diverse institutional collections like this is possible, as this example shows. However, drawing together such information is an increasingly timeconsuming and complex task for researchers as the volume of online information increases (Batjargal et al. 2013). One reason for this is that much of the data in GLAMs records lies beyond the reach of Web search engines because it resides in databases and is presented dynamically to the Web only in response to a particular query. Another is that collection metadata schemas give only limited access to information that resides in collections and comparisons between collections are difficult because of incompatibilities between different institutional record management systems and idiosyncrasies in the ways in which they are implemented. In particular, GLAMS data tends to be messy, incomplete, imprecise and even incorrect. (Dearnley, 2010; Eklund, et al., 2010; Henry and Brown, 2012; Kamura et al., 2011). In future, cross-collection searching is likely to be facilitated by APIs that can interrogate the target collections and linked data techniques to identify similar items. However, the messiness of much of GLAMs data prohibits such computationally efficient methods in many cases and the skills and time required to bring records up to scratch are Page 3 of 28

likely to remain beyond the resources of many institutions faced with ever more objects to curate and ever tighter budgets. Notwithstanding these difficulties, there is growing consensus among museum professionals and users about the importance of data integration between different collections to allow cross searching and data clustering that extends beyond the limited powers of basic keyword searching (Batjargal et al., 2013; de Polo, 2011; Henry and Brown, 2012; Kamura, 2011; Terras, 2009). This paper argues that rather than wait for a time when GLAMs records are readily machine readable, we need to develop techniques for coping with messy data. Of course, in the case of historical photographs and other visual material, it may be possible to identify connections between records by matching images. Google Images1 and TinEye2 are both popular online visual search algorithms. A TinEye reverse image search for the left hand George Meredith portrait in figure 1 produced two hits, both of which were exact matches. However, historical records, even of photographs, do not always include images. The catalogues of the exhibitions of the Royal Photographic Society (RPS) are a case in point. These exhibition records are a highly significant research resource, offering a unique insight into the evolution of aesthetic trends and photographic technologies, the response of a burgeoning group of photographic manufacturers to business opportunities and the activities and fortunes of individuals concerned with the technical, artistic and commercial development of photography during its early formative years. The RPS's exhibitions in particular attracted a wide constituency of photographers from Britain, Europe and America and many individuals launched their photographic career through them. These historical exhibition records have enormous potential to enrich and contextualise records of photographs in museum collections, if the latter could be linked to the exhibition records. Equally, matching surviving pictures in GLAMs to the exhibition catalogue records would greatly enhance the value of the latter to researchers, particularly since, although the exhibition catalogues contain information about many different aspects of the exhibits, they contain very few reproductions of the photographic exhibits themselves. The RPS exhibitions between 1870 and 1915 contained around 34,197 records of exhibited photographs, but in the catalogues for this period there are only 1040 illustrations. At the time of the exhibitions, mechanical reproduction of photographic images was technically difficult, photographic reproduction was prohibitively expensive for such an ephemeral object as a catalogue, and anyway Page 4 of 28

there was no need to reproduce the photographs in the exhibitions, because they were already on view in the exhibitions themselves. In fact, even when the exhibits were illustrated in the catalogues, many of the illustrations were artists’ sketches of the originals, lacking in detail, as shown in figure 2. These and the exclusively textual exhibit records cannot be used to conduct reverse image searches. “Words, words. They’re all we have to go on”.3

Fig. 2 Example of an illustration of a photograph from the 1895 RPS exhibition catalogue. The remainder of this paper describes the successful approach taken by the FuzzyPhoto project to locating “missing pictures” from the RPS exhibitions catalogues by identifying matches between the exhibit records and records of photographs in major international collections using the record metadata alone. The FuzzyPhoto project is a multi-disciplinary collaboration between the Photographic History Research Centre (PHRC) and the Centre for Computational Intelligence (CCI) at De Montfort University, Leicester, UK. The aim of this two-year project was to help researchers to match photographs held in different archives with the historical photographic exhibition records of the Royal Photographic Society. The latter had already been digitized and presented online as a searchable data base (ERPS: Exhibitions of the Royal Photographic Society 1870-19154) but visitors to the site frequently commented that while the site was immensely useful, the lack of pictures of exhibits was frustrating.

The Data Partner institutions5 known to hold collections overlapping with the content of ERPS were identified and these, combined with the ERPS records and those of an earlier De Montfort

Page 5 of 28

University photographic exhibition database, PEIB6, amounted to a corpus of circa 1.4 million items. While some of these data sets comprised well formed, machine readable records, overall there were a number of issues with the data: 

Lack of structure – some contributor records combined together a variety of different data types, for example date, person name, title, description etc. in a single field or cell. While such information is easy enough for a person to read, computers find it easier to read if it is separated out into separate, labeled, categories.



Different data structures – because institutions use different records management systems that cluster and describe the data differently.



Different metadata schemas – different labels were used for essentially the same descriptor by different institutions, eg. “creator”, “artist/maker”, “photographer”, “exhibitor”, “auteurs”.



Different file types – xml, csv, sql, json



Incomplete records – empty fields.



Syntax independent formats – eg. name order such as Henry Tomas Malby; H.T. Malby; Henry; T. Malby; Malby, Henry,Thomas, and so on. Made more difficult as formats are not used consistently within individual collections.



Junk data – eg. mixed data fields such as person name and date of birth/death or date and place of birth in the same field, or date information expressed in text form: “circa pre Great War”. Again, while such information is intelligible to human beings, computers find it challenging when numerical and text data are combined or different types of text such as person and place names are bracketed.



Short text fields – most of the record fields contain only a few words such as a person name, or an exhibit title. In the case of titles, the average length was only 8.1 words and after unhelpful words such as “to”, “at”, “for”, “the”, “in” were removed, this average was reduced to just 5.4.



Large volume of information – collectively the partners hold a total of 1.4 million records of photographs which, although not a lot by scientific “big data” standards is a substantial amount for a humanities project and represents a challenge for conventional text mining

Page 6 of 28

techniques. Ideally the data would have been analyzed by firstly constructing an API to interrogate the target collections and employing linked data techniques to identify similar items. However, the messiness of the data prohibited such computationally efficient methods. APIs cannot be employed when the data lack structure or are formatted inconsistently. Instead it was decided to aggregate copies of all the records into a common data warehouse and to clean and standardize them to the point whereby they could be analyzed for similarities. Having obtained copies of the records, they had to be mapped to a common schema. CIDOC-CRM was originally considered, but, since the partner metadata was so diverse, it would have been necessary to edit the all the records extensively to comply with CRM. It was decided instead to use the ICOM Lightweight Information Describing Objects schema (LIDO)7. LIDO is an XML harvesting schema developed specifically for exposing, connecting and aggregating information about museum objects on the Web and thus ideally suited to the task of standardizing the metadata provided by each of the contributors. A range of approaches was employed to accommodate the diversity of the different data sets. An XML data store (BaseX) and XQuery queries were used to convert XML data to MySQL tables, while the import facilities of MySQL were sufficient to process CSV tabular data. After the data were imported, “cleaning up”, such as eliminating duplicates or spelling variants, was necessary. This was done chiefly with SQL queries. Some operations, however, were too complex to be handled in SQL, so Java and Python scripts were used externally. Table 1 shows the results of the data cleaning process.

Table 1 FuzzyPhoto record numbers before and after data cleaning. Partner Birmingham City Library

Records before cleaning

Records after cleaning

5,513

5,455

28,974

28,925

172,148

171,840

34,197

34,197

9,526

9,526

PEIB

20,453

20,453

Musée d'Orsay

46,229

46,228

8,380

8,380

British Library CultureGrid ERPS Metropolitan Museum

National Media Museum Page 7 of 28

National Museums Scotland

14,915

14,883

St Andrews University Library

18,620

18,604

875,267

875,267

Brooklyn Museum

2,352

2,352

National Archives

73,187

71,958

101,538

98,598

Library of Congress

Victoria and Albert Museum

1,406,666

The endpoint of the data ingestion stage was a MySQL database of records in LIDO format. Our aim was to compare these records in order to identify similarities.

Data Mining Although most of the data sets contained many fields, there was only a limited degree of overlap between the fields used to describe the RPS exhibits and objects in the partner collections. In addition to person name, date, photographic process and object title, the RPS exhibition records included fields peculiar to the nature of the exhibitions such as whether the exhibits were medal winners, whether they were photographs or equipment, which section of the exhibition they were shown in, their sale price and whether the exhibit was loaned by some other person. To ensure direct comparability between RPS records and the other collections only fields common to all the collections were selected for analysis, namelyperson name, date, photographic process and object title. Of course, additional fields could have been used to match non-RPS record pairs but, since the primary aim was to find the “missing” photographs from the RPS exhibitions,this line of enquiry was not pursued.

Person Name Name comparison is a well-established problem in many application areas and a large number of algorithms exist to compare names which can handle typographical errors, alternative spellings etc. Unfortunately for this project, name data was stored in syntax independent formats, that is to say, the format of name information was not recorded consistently even within some collections, let alone between different collections. Names were variously recorded as ‘firstname/lastname’,

Page 8 of 28

lastname/firstname/initials’, lastname/initials’ and frequently contained birth/death dates and titles. Tests with existing algorithms able to handle names held in an unknown format (e.g. Named Entity Similarity, NESim8) produced disappointing results and this field is therefore handled by a customised person name similarity algorithm designed to handle unknown ordering of name components in addition to typographical errors, alternative spellings, initials and junk data. Our method is derived from the Jaro-Winkler edit distance technique that measures similarity in terms of the number of changes (edits) that are required in order to convert one string into another (Winkler, 1990). This is combined with a heuristic best fit approach to match up individual name elements across fields despite the absence of a standard name format. While the precision rate of the algorithm is lower than desired, the recall rate is excellent and the overall algorithm has produced dramatically better results when compared to NESim, as shown in figure 3. The testing data set was a modified version of the existing Joint Research Centre (JRC)-Names set which contains 268,521 named entities, plus 573,141 variations. Pairs of variations were randomly selected to generate 1x107 test cases split evenly between positive and negative cases. Figure 3 shows the Receiver Operating Characteristic (ROC) curves produced by the two approaches in which the person metric curve clearly outperforms NESim, almost completely maximizing the true positive match rate and minimizing the false positive match rate.

Fig. 3 Comparison of FuzzyPhoto name similarity metric with NESim.

Date When comparing dates, the key factor is the amount of elapsed time between them. The greater the amount of time between them, the less similar they are. Unfortunately date information, like

Page 9 of 28

person names, is not recorded consistently across the collections. Some dates are ‘day/month/year’ others are ‘year/month/day’ for example 9. Consequently the particular format used for each i n d i v i d u a l record is unknown. Moreover, many of the dates are imprecise, describing a span of time rather than a specific year (eg. "1890s", "the 19th century"). However the same principle applies. Greater differences between time spans indicate less similarity between fields. For example "19th century" and "1900" are less similar than "1900s" and "1900" because the latter represent a span of only ten years while the former could include dates up to 100 years apart. Syntax independence is handled by a rule based system built around the python dateutil library to identify well formatted date information (eg. 28/08/1871) and custom regexes to identify less clearly structured information (eg. 19th century) plus a rule based system to handle ‘circa’ modifiers. The task is significantly simplified by only extracting year information from the date fields, discarding day and month information. The date span problem is addressed using a custom developed date similarity algorithm that calculates the differences between different date spans and which treats for example ‘1900 and 1910’ as being more similar than ‘19th century and 19th century’ even though the latter two are identical.

Process Photographic process similarity is difficult to estimate for several reasons. Misidentification of photographic processes is common due to the enormous proliferation of different methods and minor variants in the early years of photography10, plus the difficulty of distinguishing between many processes by the naked eye. Very often a definitive attribution can be reached only with the aid of scanning electron microscope (SEM) or XRF equipment11. Therefore the process similarity metric measures similarity not just according to the processes actually named in the records but also according to how easily those processes could have ben mistaken for each other. We matched the stated process with a list of preset keywords describing various known processes using JaroWinkler (Winkler, 1990) to deal with typographical errors and minor spelling variations. The photographic processes were organized into a hierarchy in which processes sharing specific traits appear on the same branch. The process metric hierarchy is unusual in comparison to other photographic process hierarchies already in existence in that instead of organizing the

Page 10 of 28

processes according to date or technological progression, this hierarchy is organized based on the likelihood of miss-identification between the processes. That is to say, processes which are most easily confused with one another ( eg. daguerrotype and tintype) are positioned close together in the hierarchy, while processes that would be difficult to confuse with one another (eg. daguerreotype and mezzotint) are positioned distantly from each other. Once a field has been matched to a specific locus it can be compared to other fields using a graph transversal algorithm to find the shortest path between the processes. The shorter the distance between the approaches, the more similar they are considered to be. A tree transversal approach ensures that some similarity is acknowledged even in cases of process misidentification.

Object Title Ontology based comparisons were precluded because the range of subject matter depicted in the photographs is so broad that the required ontology would have had to encompass people, events, scientific experiments, medicine, politics, engineering and allegorical subjects as well as landscapes, portraits and still life. Document classification or corpus linguistics based approaches which rely on large text corpora were likewise rejected because although there were large numbers of titles (>1.4 million), each was typically very short (5.4 words long on average). So instead, title field similarities were calculated using a specially devised approach comprising a custom bag-ofwords method based on the cosine similarity metric but using weighted term vectors to incorporate semantic term similarity. This Lightweight Semantic Similarity (LSS) method is similar to more established short text analysis techniques such as Latent Semantic Analysis (LSA) or STASIS (O’Shea et al., 2008) in that it models each title field as a term vector and the n-dimensional angle between the two vectors (where n is the number of unique terms) represents the similarity between the two vectors. However, LSS uses term similarity values extracted from WordNet12, to calculate semantic similarity by modifying the values within the term vectors according to their semantic relations to other elements in the vectors. Comparing these approaches using the same methods as O’Shea et al. (2008), LSS significantly reduces processing time, while resulting in only a minor performance decrease.Working with uncached results LSS takes approximately 70% the time required by LSA to process 1900 results and with cached results it reduces to around 1% ,

Page 11 of 28

becoming even faster as more records are compared, while LSS is on average 9.8% faster than STASIS. The accompanying reduction in quality is only 3.1% compared to the best performing metric LSA and a 0.9% decrease compared to STASIS (Croft et al, 2013). Although generating these term similarities was very time consuming, ( in part due to the processing required but mainly due to the sheer number of terms and the number of comparisons r e q u i r e d 13), it was possible to reduce this significantly by identifying variations on a single term, (eg. “ word”, “words”).

Combined Similarity Metric In order to combine the individual fields into an overall record similarity metric and to group the co-referent records together, a fuzzy rule-based approach was adopted. Rule based approaches often function poorly when faced with imprecise and uncertain information, whereas fuzzy logic systems have the ability to model partial set membership, giving them greater resilience regarding imprecise and noisy inputs compared with crisp logic. Fuzzy logic has been applied previously to resource discovery challenges (Feng, 2012; Lai et al., 2011; Li et al., 2009). However, all these approaches were based on analysis of large volumes of text and thus not applicable in our context. The full set of fuzzy rules we use are:

If bad_title AND bad_person THEN terrible_overall match If bad_title OR bad_person THEN bad_overall match If good_title OR good_person THEN good_overall match IF good_process AND good_date THEN good_overall match If good_title AND good_person THEN excellent_overall match

These rules, which place greater significance on the title and person similarity values than the process and date similarities, were arrived at through a combination of trial and error and a survey of members of the GLAM community regarding the relative importance of the different variables. Starting with the seed record at the root node, FuzzyPhoto identifies the record with the highest similarity to that seed record and adds it as a child node. The record with the highest

Page 12 of 28

similarity to either of those two nodes is then added, creating a branching tree-like structure and so on until all the records are added to the tree and the process ends. The end result is that those records with the greatest similarity to the seed record appear in the highest layers of the tree. In order to reduce the time taken by this final stage of the comparison process, a novel defuzzification algorithm was developed (Coupland et al, 2014). Although a similar effect could be achieved by simply selecting the records with the highest similarity to the seed record, this approach does more than just select the records with the greatest similarity to the seed record, it also groups similar records together within the hierarchy. For example, records with the same/similar person fields will be grouped together, which allows for easy exploration of all the records by a single photographer.

Embedding the Results Results are stored in a second ‘links’ server which is interrogatable by a ‘widget’ embedded in partners’ Web sites. When visitors to any one of these sites visit an individual image record, for which similar items have been identified, the widget will offer them a selection of hyperlinks to potentially related records held in other partners’ collections as illustrated in figure 4 which shows results returned from the National Media Museum Web site in an early prototype. The widget uses javascript to insert suggested hyperlinks. When a visitor opens a FuzzyPhoto link from a partner web page containing an object record, the embedded code queries the FuzzyPhoto links server and retrieves any matches it finds there, displaying them as shown in figure 4.

Page 13 of 28

Fig. 4 Example of a prototype FuzzyPhoto widget embedded in the National Media Museum Website.

Results Figure 5 shows the estimated distribution of overall matches discovered between ERPS records and those in partner collections, across the three categories of person, title and combined (balanced) similarity, based on a sample of 50 records. Within the sample, the proportion of person name matches found is high (around 75%) and circa 70% are excellent matches This is not surprising, since names are usually unambiguous and have a limited number of possible representations. These results will be of use to researchers seeking more photographs by a given person, irrespective of subject matter. There are fewer matches based solely on title (40%) and since titles contain more uncertainty, so the proportion of excellent matches is lower at around 10%. Nevertheless these results will be of use to researchers more interested in a particular topic than a specific photographer. Inevitably there are fewer ‘balanced’ matches based on the combination of all four fields simply because increasing the number of fields included in the comparison increases the number of opportunities for records to differ from each other. Nevertheless, this estimate suggests that good overall matches may be found for around 20% of

Page 14 of 28

the ERPS records1, of which approximately 5% are excellent matches. These results were of most interest to us since it was within this subset that the “missing” photographs were most likely to be found.

Fig. 5 Proportion of ERPS records matched to partners’ records14. ‘Excellent’ does not mean ‘identical’ and these results still need to be inspected by a human agent to judge whether the suggested match is in fact a close enough fit,but in a number of instances the match is so close we can be reasonably confident that the missing image has been rediscovered. Table 2 shows some examples of excellent matches between records in the RPS exhibition catalogues and records discovered in partner collections. Since the RPS records are not illustrated, we cannot be absolutely certain that these records co-reference the same photographs, nevertheless the high degree of correspondence between them suggests that the photographs in figures 6-10 are indeed the same as those exhibited between 1870 and 1893 at the RPS, listed in the catalogues.

Table 2 Examples of excellent matches between records. ERPS seed record

FuzzyPhoto recommended match

Hop picking in Kent

Hop picking in Kent

.

Page 15 of 28

Exhibitor: Stephen Thompson [Not listed] 1870 ERPS 1870 Exhibit ID 2

Photographer: Thompson, Stephen Photographic print 1875 Source: British Library

The Solar Club Exhibitor: Rejlander & Hughes/O. G. Rejlander [Not listed] 1870 ERPS 1870 Exhibit ID 339

The Solar Club Creator: Rejlander, Oscar Gustav (1813-1875) [Not listed] 1869 Source National Media Museum

Ginx’s Baby Exhibitor: O. G. Rejlander [Not listed] 1871 ERPS 1871 Exhibit ID 188

Ginx’s Baby Creator: Rejlander, Oscar Gustav (1813-1875) A polychrome drawing 1871 Source National Media Museum

Le Ministere des Finances, after the Fire Exhibitor: A. Liebert [Not listed] 1871 ERPS 1871 Exhibit ID 545

Finance Ministry, Burned. Exterior View Photographer: Alphonse J. Liébert Albumen silver print from glass negative 1871 Metropolitan Museum of Art

At Dusk Exhibitor: Miss Emma Justine Farnsworth Silver (print) 1893 ERPS 1893 Exhibit ID 310

At Dusk Farnsworth , Emma Justine, photographer, 1 photographic print: platinum c 1893 Source: US Library of Congress

Fig. 6 Hop-picking in Kent, Thompson, Stephen, 1875. Source: British Library.

Page 16 of 28

Fig. 7 The Solar Club. Rejlander, Oscar Gustav (1813-1875), 1869. Source: National Media Museum.

Fig. 8 Ginx’s Baby, Rejlander, Oscar Gustav (1813-1875), 1869. Source: National Media Museum.

Page 17 of 28

Fig. 9 Finance Ministry Burned, Exterior View /Le Ministère des Finances Incendie, Alphonse J. Liébert, 1871. Source: Metropolitan Museum of Art.

Fig. 10 At Dusk. Farnsworth, Emma Justine, 1893. Source: US Library of Congress.

In addition, among the record pairs FuzzyPhoto has identified as excellent matches are examples that are clearly not the same and yet are strikingly similar. Figure 11 shows one such pair. Notice how none of the fields match exactly and yet daffodils and narcissi are the same genus, Henry Thomas Malby and H.T. Malby are undoubtedly the same person despite differences in the way the names are presented, and 1895 is very close to 1896. While similar but inexact matches do not meet our original aim of finding the missing pictures from the exhibitions of the RPS, they are nevertheless useful starting points for researchers looking for material related to an item of Page 18 of 28

interest.

Fair Daffodils Henry Thomas Malby Bromide (Print) 1895 ERPS 1895 Exhibit ID 410

Pheasant-eye Narcissus H. T. Malby Platinum (Print) 1896 ERPS 1896 Exhibit ID 142

Fig. 11 Similar but clearly not identical matches.

We tested a sample of the outcomes on a panel of subject experts, to see how their speed and accuracy of co-reference identification compares with FuzzyPhoto. The results indicate that FuzzyPhoto finds matches at that are at least as good as those discovered by experts and it finds more matches than are found manually. However these trials also revealed that some researchers interpreted the notion of something similar to a given seed record as meaning some other picture by the same photographer, while others understood it to mean something with a similar title or subject matter, or made by a similar photographic process. To accommodate these different conceptions, the interface was modified as shown in figure 12 to allow users to choose between similarity by person name, title or by overall similarity based on the combined metric. However, better ways need to be found to allow users to nuance the results of the matching algorithms to more accurately reflect differing expectations and needs. Page 19 of 28

Fig. 12 Screen shot of the final version of the FuzzyPhoto widget embedded in the ERPs Web site, opened to show links categorised by person name, object title, or overall similarity (all fields).

Discussion The FuzzyPhoto project has highlighted a number of issues, some of which are already well known but are sufficiently important to require restating, and others which are new. A key issue has been the quality of the data obtained from the partner institutions. The majority of online GLAMs collection records employ different data schemas, that are applied inconsistently, are often fragmentary, imprecise and non-machine readable. These are major barriers to attempts to create computationally based finding aids for searches across records held in different collections. Notwithstanding this difficulty, attempts have been made. The concept of Linked Data15 was developed to create a web of interconnected machine readable links so that a person or machine could more easily explore the web of data. (Berners-Lee, 2006) There are already millions of Linked Data pages on the Web and Linked Data search engines that can query data inside Web documents eg. Falcons, SWSE (Bizer et al., 2009). However, for Linked Data to work it has to be expressed correctly. Enabling existing fragmentary, messy museum records to meet the rigorous requirements of Linked Data seems an unlikely prospect for the immediate future because the costs and technical skills required cannot be supported by most GLAMs. Data quality has a direct bearing on the next issue, which is scalability. Although not as Page 20 of 28

demanding as Linked Data, the FuzzyPhoto approach requires relatively well formed records and in this project we had to invest significant effort (70 person days) in data preparation to remove duplicates, incomplete records and junk data. Most of this work had to be done manually, although batch loaders have since been developed that can automatically ingest records providing they meet some basic data standards, in particular syntax dependence. In future, other collections could be added to the FuzzyPhoto database providing they meet these requirements16. A further scalability issue is processing capacity. At present more than 8 gigabytes of memory and around six weeks are required to process all 1.4 million records. These figures will increase as more records are added. Fortunately record matching is performed off line and the results are cached ready to be retrieved virtually instantly by end users. It does mean however that continuous updating of the records is impractical. Instead a regime of 6 monthly updates is planned. Further research on the efficiency of the matching algorithms is needed to significantly expand the scale of FuzzyPhoto17. Intellectual property rights (IPR) issues were more significant during FuzzyPhoto than anticipated. Our original expectation was that there would be few if any IPR issues because (a) the ERPS seed records we were working with were all out of copyright due to their age and (b) we did not intend to publish any of the partner data, either collection records or photographic images themselves. All that the project intended to publish from the start were hyperlinks back to records on the partner websites where similar objects could be found. However at the start of the project for the avoidance of any doubt and to protect all the participants it was decided to put in place a lightweight memorandum of agreement stating that the lead organization, De Montfort University, would not commercially exploit partners’ data in any way and the partners would not seek to withdraw use of their data once they had provided it. Several institutions required their own customised wording of the boilerplate text provided, adding delays at the start of the project before partner data could be obtained. One partner went further, requiring a full legal contract to be prepared and signed by the respective parties. The latter was further complicated by the need to prepare the contract in two languages and for it to be subject to the law of a country other than the United Kingdom. Development and agreement of such a contract entailed obtaining the services of a bilingual international contracts lawyer and scrutiny of several drafts in both languages by Page 21 of 28

various vested interests before agreement could be reached. This delayed access to the data from that particular partner by as much as a year. A second institutional issue was ability of the partners to fully engage at different points of the project. For example, one partner began the project anticipating that by the end it would have its photographic collections publicly accessible online, but this was delayed. Three other partners felt that, due to changes in priorities, the implementation of the public interface widget would have to be delayed or, in one case, cancelled entirely. Institutional readiness was affected to some degree by staff availability. Three of the partner institutions experienced loss of key personnel working on the FuzzyPhoto project, in one case due to illness, and in the others because the individuals concerned left the organizations. This catalogue of woes may give the impression that the FuzzyPhoto project was unlucky but these sorts of issues are not uncommon in large projects of significant duration (Brown, 2014).

Conclusions In the last decade there has been tremendous effort on the part of heritage institutions to digitize their records and make them available to users online. The increased availability of online collections and searchable metadata creates enormous potential for using information about objects from one data set to enrich records in others, thus revealing relationships not previously apparent and enhancing their value. There is also growing recognition among cultural heritage institutions that there are significant benefits to be gained from sharing and connecting data and that users are increasingly likely to expect to be able to navigate seamlessly across separate collections. However, drawing together information from diverse collections is time-consuming and increasingly complex as the volume of online information increases and comparisons between collections are difficult because of incompatibilities between different institutional record management systems, idiosyncrasies in the ways in which they are implemented and the sheer messiness of legacy data in terms of gaps, duplications and typographical errors. The FuzzyPhoto project developed an approach that has successfully identified matches for around 20% of the records in the catalogues of the Royal Photographic Society annual exhibitions18, allowing images of some of the exhibits to be seen for the first time in around 120 Page 22 of 28

years. In addition FuzzyPhoto has identified matches between partner records, even where these do not match exhibits in the RPS exhibitions. So, for example, photographs in the Musée d’Orsay have been matched with similar items in the British Library, and items from the National Media Museum collection have been linked to photographs in the Library of Congress. This shows that FuzzyPhoto is a powerful tool not only for rediscovering the lost images from the RPS exhibitions but more generally for identifying potential connections between records in different institutions, enriching our understanding through cross-referral. Further expansion of the FuzzyPhoto data warehouse has the potential to create a rich web of interconnections between collections worldwide that will be invaluable for photographic historians, curators, researchers, teachers, students and dealers, based on textual metadata alone. The matching algorithms are modular and hence could be modified to be applied to other kinds of records such as maps, paintings, ceramics and even people. However, notwithstanding these encouraging results, the FuzzyPhoto approach is not without limitations. Scalability is an issue, because of the messiness of the data and because of the processing demands created by 1.4 million records. While FuzzyPhoto does not require records as rigorously formed as Linked Data, it cannot handle entirely unstructured and inconsistently formatted data. Processing limitations may in due course be overcome by developing more efficient matching algorithms. However, concerns about sharing resources and loss of control over intellectual property may prove to be the main limiting factor in the immediate future. Finally, FuzzyPhoto has highlighted a number of institutional issues around IPR staffing and institutional readiness that future projects of this kind may usefully take into account.

Funding This research was supported by the UK Arts and Humanities Research Council [Research Grant AH/J004367/1]. Thanks are also due to Birmingham City Library; the British Library; Musée du Louvre; the Metropolitan Museum of Art; Musée d’Orsay; the National Archives; the National Media Museum; National Museums Scotland; St Andrews University; and the V&A for their generous support.

Page 23 of 28

References Batjargal, B., Kuyama, T., Kimura, F. and Maeda, A. (2013). Linked data driven multilingual access to diverse Japanese Ukiyo-e databases by generating links dynamically. Literary and Linguistic Computing, 28(4): 522-530. Brown, S. (2014). You can’t always get what you want: Change management in Higher Education. Campus-Wide Information Systems, 31(4): 208-216. Croft, D., Coupland, S., Shell, J. and Brown, S. (2013). A Fast and Efficient Semantic Short Text Similarity Metric. Proceedings of the 13th UK Workshop on Computational Intelligence (UKCI), 2013: 221-227. http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6651309&isnumber=6651272 (accessed 20 November 2014). Coupland, S., Croft, D. and Brown, S. (2014). A Fast Geometric Defuzzification Algorithm for Large Scale Information Retrieval. Proceedings of FUZZ-IEEE 2014 International Conference on Fuzzy Systems, Beijing 6-11 July 2014. IEEE Conference Publications 2014: 1143 – 1149. DOI: 10.1109/FUZZ-IEEE.2014.6891581 ISBN 978-1-4799-2073-0. Dearnley, L. (2011). Reprogramming the Museum. In J. Trant and D. Bearman (eds). Museums and the Web 2011: Proceedings. Toronto: Archives and Museums Informatics. Published March 31, 2011. http://www.museumsandtheweb.com/mw2011/papers/reprogramming_the_museum (accessed 20 November 2014). de Polo, A. (2011). Digital Environment for Cultural Interfaces: Promoting Heritage Education and Research. In J. Trant and D. Bearman (eds). Museums and the Web 2011: Proceedings. Toronto: Archives and Museums Informatics. Published March 31, 2011. http://conference.archimuse.com/papers/a_digital_environment_for_cultural_interfaces (accessed 20 November 2014). Eklund, P., Goodall, P., Lawson, A. and Wray, T. (2011) CollectionWeb Digital Ecosystems: A Semantic Web and Web 2.0 Framework for generating Museum Web sites. In J. Trant and D. Bearman (eds). Museums and the Web 2011: Proceedings. Toronto: Archives and Museums Informatics. Published March 31, 2011. Page 24 of 28

http://www.archimuse.com/mw2010/papers/eklund/eklund.html (accessed 20 November 2014). Feng, J. (2012). “Efficient fuzzy type-ahead search in XML data.” IEEE Transactions on Knowledge and Data Engineering 24 (5), 882-895. Henry, D. and Brown, E. (2012). Using an RDF Data Pipeline to Implement Cross-Collection Search. In J. Trant and D. Bearman (eds). Museums and the Web 2012: Proceedings. Toronto: Archives and Museums Informatics. . Published March 31, 2012. http://www.museumsandtheweb.com/mw2012/papers/using_an_rdf_data_pipeline_to_implement_ cross_.html (accessed 5 September 2013). IMLS (2006) Institute of Museum and Library Services. Status of technology and digitization in the nation’s museums and libraries. Technical report, Institute of Museum and Library Services, Washington, DC, 2006. http://web.archive.org/web/20060926090433/http://www.imls.gov/resources/TechDig05/Technolo gy%2BDigitization.pdf (accessed 5 September 2013). Kamura, T., Ohmukai, I. and Kato, F. (2011). Building Linked Data for Cultural Information Resources in Japan. In J. Trant and D. Bearman (eds). Museums and the Web 2011: Proceedings. Toronto: Archives and Museums Informatics. Published March 31, 2011. http://www.museumsandtheweb.com/mw2011/papers/building_linked_data_for_cultural_informat ion_.html (accessed 5 September 2013). Lai, L. F., C. C. Wu, P. Y. Lin, and Huang, L. T. (2011). Developing a fuzzy search engine based on fuzzy ontology and semantic search. Proceedings of IEEE International Conference on Fuzzy Systems. Taipei, IEEE: 2684-2689. Li, F. Z., Luo, D. Y. and Xie, D. (2009). Fuzzy search on non-numeric attributes of keyboard query over relational databases. Proceedings of ICCSE ’09. 4th International Conference on Computer Science and Education. Nanjing, China, IEEE: 811-814.

Manning, C. D. and Schütze, H. (1999). Foundations of statistical natural language processing. Cambridge, MA: MIT Press.

Page 25 of 28

O’Shea, J., Bandar, Z. Crockett, K. and McLean, D. (2008). A comparative study of two short text semantic similarity measures. Agent and multi-agent systems: Technologies and Applications. Lecture Notes in Computer Science, Springer-Verlag, 4953: 172–181. Szekely, P., Knoblock, C. A., Yamg, F., Zhu, X., Fink, E. E., Allen, R. and Goolander, G. (2013). Connecting the Smithsonian American Art Museum to the Linked Data Cloud. In P. Cimiano et al. (Eds.) The Semantic Web: Semantics and Big Data, Lecture Notes in Computer Science 7882: 593-607. Terras, M. (2009). The Potential and Problems in using High Performance Computing in the Arts and Humanities: the Researching e-Science Analysis of Census Holdings (ReACH) Project. Digital Humanities Quarterly, 3(4). http://www.digitalhumanities.org/dhq/vol/3/4/000070/000070.html (accessed 20 November 2014). Wilson, R. J. (2011) Digital Heritage: Behind the scenes of the museum website. Museum Management and Curatorship, 26(4): 373-389. Winkler, W. E. (1990). String comparator metrics and enhanced decision rules in the FellegiSunter model of record linkage. Proceedings of the Section on Survey Research: 354-359.

Page 26 of 28

Figure and table legends Fig. 1 Comparison of portraits of George Meredith. Fig. 2 Example of an illustration of a photograph from the 1895 RPS exhibition catalogue. Fig. 3 Comparison of FuzzyPhoto name similarity metric with NESim. Fig. 4 Example of the FuzzyPhoto widget embedded in the National Media Museum Website. Fig. 5 Proportion of ERPS records matched to partners’ records. Fig. 6 At Dusk. Farnsworth, Emma Justine, 1893. Source: US Library of Congress. Fig. 7 The Solar Club. Rejlander, Oscar Gustav (1813-1875), 1869. Source: National Media Museum. Fig. 8 Ginx’s Baby, Rejlander, Oscar Gustav (1813-1875), 1869. Source: National Media Museum. Fig. 9 Finance Ministry Burned, Exterior View /Le Ministère des Finances Incendie, Alphonse J. Liébert, 1871. Source: Metropolitan Museum of Art. Fig. 10 At Dusk. Farnsworth, Emma Justine, 1893. Source: US Library of Congress.

Fair Daffodils Henry Thomas Malby Bromide (Print) 1895 ERPS 1895 Exhibit ID 410

Pheasant-eye Narcissus H. T. Malby Platinum (Print) 1896 ERPS 1896 Exhibit ID 142

Fig. 11a and b Similar but clearly not identical matches. Fig. 12 Screen shot of the final version of the FuzzyPhoto widget embedded in the ERPs Web site, opened to show links categorised by person name, object title, or overall similarity (all fields).

Table 1 FuzzyPhoto record numbers before and after data cleaning. Table 2 Examples of excellent matches between records.

Page 27 of 28

Notes 1

https://images.google.com/imghp?hl=en&gws_rd=ssl

2

https://www.tineye.com/

3

R O S E N C R A N T Z : What are you playing at?

G U I L D E N S T E R N : Words, words. They’re all we have to go on. Stoppard, T. 1966. Rosencrantz and Guildenstern are dead, Act 1. 4

http://erps.dmu.ac.uk

5

Contributing partners were Birmingham City Library, the British Library, The Metropolitan Museum

of Art, the Musée d'Orsay, the National Media Museum, National Museums Scotland, St Andrews University Library, the National Archives, the Victoria and Albert Museum. Additional collection records were obtained from the US Library of Congress, Brooklyn Museum and CultureGrid. 6

Photographic Exhibitions in Britain 1839-1865 http://peib.dmu.ac.uk

7

http://network.icom.museum/cidoc/working-groups/data-harvesting-and-interchange/what-is-lido/

8

http://cogcomp.cs.illinois.edu/page/software_view/NESim

9

Within one single collection 20 different date formats were used.

10

See for example http://www.vam.ac.uk/content/articles/p/photographic-processes/

11

http://www.getty.edu/conservation/publications_resources/pdf_publications/atlas.html

12

http://wordnet.princeton.edu/

13

T h e r e c o r d s i n c l u d e d in the region of 43 thousand recognised terms

14

Based on a sample of 50 records.

15

http://www.w3.org/standards/semanticweb/data

16

At the time of writing circa 400,000 additional records are being added to FuzzyPhoto from the

EuropeanaPhotography project http://www.europeana-photography.eu/ 17

This is the focus of a new research proposal currently with the AHRC.

18

Based on analysis of a random sample of the full data set.

Page 28 of 28