Digitizing a Large Corpus of Handwritten Documents Using ...

5 downloads 20475 Views 2MB Size Report
29 items - simulated web application (available at http://colcat.calit2.uci.edu/tiki-index.php?page=Resources) was developed for the presentation of the ...
International Journal of Internet Science 2016, 11 (1), 8–32

ISSN 1662-5544

IJIS.NET

Digitizing a Large Corpus of Handwritten Documents Using Crowdsourcing and Cultural Consensus Theory Prutha S. Deshpande, Sean Tauber, Stephanie M. Chang, Sergio Gago, Kimberly A. Jameson University of California, Irvine, USA

Abstract: We investigated using internet-based procedures to convert information from a large handwritten archive of ethnographic survey data into a computer addressable database. Rather than manually transcribing the archive's estimated 23,000 pages of handwritten data, we sought to develop novel crowdsourcing task designs, and to use an innovative variation of Cultural Consensus Analysis (CCT) to objectively aggregate crowdsourced responses based on a formal process model of shared knowledge. Experiment 1used simulated internet-based tasks conducted on human subject pool participants in a university laboratory. Experiment 2 used a similar design with the exception that it was implemented on an internet-based research platform (i.e., Amazon Mechanical Turk). Results from these investigations shed light on several uncertainties concerning the utility of CCT analyses with crowdsourced transcription data. For example, they clarify (1) whether crowdsourced tasks are practical as a method for automating the transcription of the archive's handwritten material, (2) whether responses from perceptually-based tasks inherent to transcribing handwritten documents can be analyzed using CCT, and (3) if CCT is appropriate as a model of the transcription challenge, then do the results produce accurate answer-key estimates that could serve as correct transcriptions of the archive's data. Our results address these issues and convey how CCT modeling can be modified and made appropriate for aggregating such data. Implications of these analyses and uses of CCT in large-scale crowdsourced data collection platforms are discussed. Keywords: Crowdsourcing, cultural consensus theory, shared knowledge, handwriting transcription, individual differences

Introduction Important historical documents often only exist as printed/handwritten archives, or corpora, and there is a general need to convert such documents into searchable digital copies for sharing and research purposes. Transcribing handwritten documents can involve time-consuming, tedious and potentially error prone processes. In recent years, internet-based crowdsourcing has been viewed as an alternative approach to the transcription of handwritten documents by individual human effort (Causer & Wallace, 2012; Bevan et al., 2014). While crowdsourcing approaches are well-suited to producing quick transcription results from large numbers of respondents, and repeated assessment of data points, the assessment of accuracy and the aggregation of multiple responses from crowdsourced tasks remains a challenge. In this article, we investigate the utility of novel methods for transcribing large corpora from crowdsourced responses. We employ an extension of Cultural Consensus Theory (CCT) – a formal process model used to derive information from a shared knowledge domain when the “correct answers” are Address correspondence to Prutha S. Deshpande, Department of Cognitive Sciences, University of California, Irvine, 2132 Social Science Plaza, Irvine, California, USA, [email protected], or Sean Tauber, [email protected], or Kimberly A. Jameson, [email protected]

P. S. Deshpande et al. / International Journal of Internet Science 11 (1), 8–32 uncertain (Batchelder & Romney, 1988; Romney et al., 1987, Weller, 1984, 2007). Using a novel Bayesian CCT model for analyzing multiple-choice or free-response data that accounts for the perceptual confusability of items inherent in handwriting transcription tasks, we develop a general method that provides for principled analyses of crowdsourced data. We apply this model to the analysis of crowdsourced responses collected with the aim of transcribing a large handwritten cross-cultural survey dataset (MacLaury, 1997). Digitization of the MacLaury archive The Robert E. MacLaury Color Categorization Archive (ColCat) is a large corpus of heretofore unpublished cognitive anthropology survey data (MacLaury, 1997) consisting of irreproducible observations of color categorization behaviors from over 900 monolingual informants responding in 116 indigenous Mesoamerican languages (MacLaury's Mesoamerican Color Survey or MCS; collected between the years 1978–1981), and over 500 informants of approximately 70 other languages of the world (collected until 2002). The data has been hand-recorded either by MacLaury or by his 142 research associates, each having employed a unique handwritten system of coding abbreviations and conventions that vary both within and across languages. The bulk of the archive consists of data on three independent color naming and categorization tasks completed by informants surveyed (for details of tasks, refer to Appendix A). These task responses have often been recorded on data sheet templates as in Figure 1, although variation exists.

Figure 1. Sample data sheet illustrating the tabular grid format of raw color appearance naming data collected (see Appendix A for other data formats). One aim of the present research is to build a publicly accessible database that includes all the transcribed raw data from the archive (see Jameson et al., 2015 for a description of our database development goals). While MacLaury published analyses of the archive data (MacLaury, 1987,1997; MacLaury et al., 1992, 1997; Burgess, Kempton, & MacLaury, 1983) the raw data has remained inaccessible, preventing further investigation by the research community. The ColCat archive has the potential to be as informative and influential as the highly popular World Color Survey (abbreviated WCS; Kay et al., 2009). The WCS required an enormous 20-year effort to manually transcribe and digitize, but since its 2003 public release, has contributed immensely valuable research in the areas of cognition and linguistics. The diverse ethnographic data contained in the archive is important for addressing questions related to the formation of individual and shared color concepts, including their evolution and dispersion across ethnolinguistic groups. The present project develops new methods that combine the power of internet-based research and novel quantitative analysis techniques to make accessible the wealth of data contained in the ColCat archive. Another aim of the project is to automate transcription of the ColCat archive's estimated 23,000 pages of data in much fewer than the 20 years that were required to transcribe the WCS using the manual efforts of expert transcribers. Our research group is exploring multiple transcription approaches, including the use of crowdsourcing and optical character recognition (OCR) methods. The present article describes only results and methods developed for transcribing the ColCat archive using crowdsourcing. Here we explore the potential for rapid transcription of the ColCat archive using crowdsourcing tasks that harness human perceptual processing 9

P. S. Deshpande et al. / International Journal of Internet Science 11 (1), 8–32 capabilities. The methods are tested using a small subset of data for which the “true” transcriptions are known, permitting the development of crowdsourcing procedures and data analysis methods that can be scaled in order to transcribe the entire archive. Crowdsourcing approach As first defined by Howe, “crowdsourcing represents the act of a company or institution taking a function once performed by employees and outsourcing it to an undefined (and generally large) network of people in the form of an open call” (Howe, 2006). Early examples of crowdsourcing involved for-profit organizations such as Threadless, iStockphoto, InnoCentive, etc., engaging crowd workers in tasks of creative content generation or problem solving (Brabham, 2008). Since then, applications of crowdsourcing have been widespread, with many possibilities with regard to who forms the crowd, what the crowd has to do, what the crowd gets in return, who the crowdsourcer is, and so on (Estellés-Arolas, 2012). Crowdsourcing the participation of humans through internet-based platforms is now also widespread in academic research. For example, crowdsourcing methods were employed by SETI@home, for the detection of intelligent life outside Earth (Anderson et al., 2002); by Foldit, where online games were used to discover protein folding strategies (Khatib et al., 2011); by EteRNA in the design of RNA sequences that fold into a given configuration (Lee et al., 2014); and by Click-Workers (a NASA project) for the tagging of space exploration images (Coleman et al., 2014). Crowdsourcing has also been used in projects similar to the current one for the transcription of historical records, diaries, and literary manuscripts (Moyle et al., 2011; Keegan et al., 2015), but these projects relied on collaborative editing and correction procedures in which transcription versions were modified iteratively over time. The final product is reached by a method of “consensus” (also used to describe the process of user contributions to Wikipedia articles (Malone et al., 2009)) achieved when transcriptions have been unchanged for a period of time, and indicating that contributors are satisfied with the current version. The disadvantage of iterative crowdsourcing procedures is that they result in transcription solutions that are difficult to assess and validate through formal analyses. The wisdom of the crowd phenomenon suggests that the aggregated responses of a group of people often provide a closer estimate of the true answer. For example, on the game show Who Wants to be a Millionaire?, the modal audience response was found to be the correct answer 91% of the time, compared to a 65% accuracy rate of “expert” individuals called on by the contestant for assistance (Surowiecki, 2005). Formal approaches to group estimation and aggregation can be found in the literature on the cognitive modeling of collective intelligence (see Steyvers & Miller, 2015 for a review). As Steyvers & Miller note, although there are advantages to aggregating a groups responses, the individuals composing the group are affected by many factors that may bias their responses. In this investigation we develop several crowdsourcing tasks to collect multiple participant transcriptions for each hand-recorded data point, to which formal CCT models that factor in different types of biases can be applied to aggregate responses and arrive at a “true” transcription for each datum, as well as to obtain useful analytic information derived by the models about individual items and respondents. It is worth noting that the information processing tasks required to crowd-transcribe the ColCat corpus differ from the kinds to which CCT analyses are typically applied. Crowdsourced tasks needed to digitize the archive are variations on REcaptcha type tasks (von Ahn et al., 2008) in which human respondents disambiguate strings of characters often produced in idiosyncratic styles, which can be problematic for standard machine recognition procedures. Such REcaptcha tasks involve perceptual processing for object and feature recognition levels, making the processing enlisted qualitatively different from the kinds of general knowledge and survey question formats that are commonly analyzed using CCT procedures. Although, some results suggest that CCT theory and analyses may be appropriate for the aggregation of perceptual judgment data, and may be used to identify expertise and variations on strategies that may arise in response to transcription task questions (Jameson & Romney, 1990). These issues and their relevance to the modeling of bias and expertise are detailed in later sections. In addition, the present investigations set out to evaluate a debate found in the internet-based research literature concerning the quality of data obtained from internet-based data collection platforms like Amazon's Mechanical Turk (MTurk), compared to laboratory-based formats (Hauser & Schwarz, 2015). As the environment in which MTurk workers participate is unsupervised and highly variable (with many possible distractions), concerns have been raised about the quality of the data obtained from MTurk studies. In the present study, empirical assessment was obtained using two data collection formats. Below, “Experiment 1” describes laboratory-based “crowdsourcing” task investigations, whereas “Experiment 2” reports on internet-based crowdsourcing tasks implemented on MTurk. Questions explored in these two experiments include (1) whether the proposed data aggregation technique of Cultural Consensus analysis is applicable to data obtained from both platforms? (2) whether the participants responses are qualitatively similar – that is, are the response patterns across the groups from the two platforms the same or does one platform give more uniform responses across the group, whereas others are more heterogeneous? (3) if the responses are not the same, would identifying “experts” in the data assist in the derivation of the correct answers? (4) is a large participant group required to obtain an accurate transcription 10

P. S. Deshpande et al. / International Journal of Internet Science 11 (1), 8–32 solution, or can similar results be achieved with a smaller participant group? Our findings on these questions are presented below. Experiment 1: Laboratory crowdsourcing tasks Experiment 1 consisted of a series of transcription tasks designed to assess the potential to accurately convert the raw archive data using responses from undergraduate participants supervised by an experimenter in a university laboratory setting. Participants Participants (N=30) were recruited from the University of California, Irvine, School of Social Sciences Human Subjects Pool to collect human transcription responses for a small subset of data from the archive. The participants received 1 point of extra course credit for participation of 1 hour in the study, in which they completed a total of 7 character judgment/transcription tasks. Procedure As web-based experiments deny researchers the opportunity to make observations on the respondents experience while participating in the tasks, Experiment 1 was conducted under laboratory supervision to gather direct observations and reliable feedback on the procedures implemented. A research assistant was present during the experimental session to observe participant's interacting with task designs, and to assist with questions. A simulated web application (available at http://colcat.calit2.uci.edu/tiki-index.php?page= Resources) was developed for the presentation of the transcription tasks on a lab computer, and the application was developed in a flexible and scalable manner to allow for later integration with a publicly available crowdsourcing platform such as MTurk. Due to the variety of data types and formats in the ColCat archive, as well as uncertainty regarding how best to partition large blocks of archive content into smaller piecewise transcription tasks, we developed seven response format variations to collect three forms of transcription information for (1) OCR assessment data (Tasks 1, 2, & 3), (2) image scans of ColCat's “naming” data (consisting of elicited names for 330 randomly presented color samples; Tasks 4 & 5), and (3) image scans of ColCat's “focus selection” data (consisting of color samples indicated as best-exemplars for elicited color categories; Tasks 6 & 7). For completeness, all seven tasks are described below, however results reported are restricted to the latter two data types arising from “naming” and “focus selection” transcription task formats. Stimuli used were from a survey of Korean (Tyson 1997, 1998) for which the true transcription of data points could be determined beforehand with certainty, as the raw data was relatively clean, and native Korean language speakers were available for verification. Task design OCR Assessment Tasks. Tasks 1, 2, and 3 consisted of images of handwritten characters generated as the output of an OCR algorithm. In Task 1, participants were asked to judge if the output of the OCR algorithm was recognizable as containing English alpha-characters. In Task 2, participants were provided with a potential transcription of the output of the OCR algorithm, and were asked to judge whether the transcription was correct. In Task 3, participants were asked to transcribe the characters in the OCR output. The purpose of these tasks was to generate training data for the OCR transcription algorithm, and to assess the performance of OCR transcriptions performed for a separate study. OCR tasks are mentioned here for completeness, although their analyses and results are not presented (Deshpande & Chang, 2015 provide preliminary analyses). Transcription Tasks for “Naming” and “Focus Selection” Data. Tasks 4 through 7 consisted of multiple choice and free response tasks, in which participants provided direct transcriptions of images containing portions of data from the ColCat archive. Tasks 4 and 5 involved transcribing color naming data, whereas tasks 6 and 7 involved focus selection data. The tasks explored different transcription format variants with the goal of identifying the optimal formats for transcribing large tables of data using small piecewise tasks. Task 4. Participants were asked to transcribe characters from a portion of a scanned image of tabular handwritten data (referred to as a “grid” below). Participants also provided confidence ratings of how sure they were about the transcription provided for each set of characters. The confidence ratings were given on a scale of 1 to 5, where ‘1’ indicated least confidence and ‘5’ indicated most confidence (see Figure 2).

11

P. S. Deshpande et al. / International Journal of Internet Science 11 (1), 8–32 Task 5. Participants were asked to transcribe characters from the same portion of image as in Task 4, given a key of possible transcriptions of the abbreviated data. The key of values contained a “none” option, for the transcription of unabbreviated data in the grid. Participants also provided confidence ratings for this task.

Figure 2. (A) Portion of data grid to be transcribed in Task 4 and Task 5, corresponding to “naming” data. (B) The transcription response template for Task 4 and Task 5. In addition, Task 5 contained a key of all possible transcription values to the left of the data grid. Task 6. Participants were asked to transcribe characters from a portion of a scanned image of tabular handwritten data, to provide the “row” and “column” values for the locations of the characters in the grid, and to provide confidence ratings for the task (see Figure 3). Task 7. Participants were asked to transcribe characters from the same portion of image as in Task 6, and to place the transcription in the correct location of a table replicating the format of the data sheet.

Figure 3. (A) Portion of data grid to be transcribed in Task 6 and Task 7, corresponding to “focus selection” data. (B) Transcription response template for Task 6. (C) Transcription response template for Task 7. Participants completed the seven tasks in a randomized order, with the exception of Task 5 always being presented in a position after Task 4, as Task 5 contained a key of the only possible transcription solutions to the characters in both tasks. Participants also completed a debriefing questionnaire at the end of the transcription tasks, providing their impressions of the tasks, input on improving the tasks, and their willingness to participate in a future study. Empirical results Initial Experiment 1 analyses sought to evaluate (1) whether participants could provide transcription responses via our designs, and (2) what descriptive analyses suggested regarding (a) informants response variability, and (b) the use of simple modal/majority-choice aggregation procedures in the derivation of transcription answers. Free Response and N-Alternative Task Design Considerations. “Naming” Data. In the Task 5 variation for transcribing “naming” data, the list of possible response alternatives for transcription answers was based on a dictionary of color name abbreviations compiled by the investigator responsible for originally recording the data. Unfortunately, the available dictionaries do not always contain all of the unique character strings in the data. For example, cell H26 and cell H27 of the data in Figure 2A consist of characters transcribed as ‘cheng-lok-’ and ‘cheng’ respectively. While the available dictionary (see Figure 4) consists of an entry for ‘chengloksayk’, the distinctions made between these data points with similar content are important to retain. The method of Task 5 could thus only provide an incomplete transcription solution for the data, but was pursued in an attempt to test whether at least a portion of the data could be reliably transcribed in such a way. 12

P. S. Deshpande et al. / International Journal of Internet Science 11 (1), 8–32

Figure 4. Excerpt from the dictionary of the Korean language survey of the ColCat archive. “Focus Selection” Data. In the Task 6 variation for transcribing “focus selection” data, participants were given information on the maximum number of sets of characters to be transcribed from the grid. The instructions for this format were interpreted with great variation by participants, and are reflected in the transcriptions obtained from this task design. In particular, participants expressed confusion regarding the transcription of cases where more than one set of characters were present in a cell, as in cell H29 of Figure 3 containing the two distinct character sets ‘Ph’ and ‘Cg’, which were to be transcribed in separate response boxes per the task design. Because participants also varied in their order of transcribing characters in Task 6, the raw data obtained also required significant post-processing to organize in a format suitable for analysis. Task 7 was better suited to obtaining accurate transcription responses from participants, but inaccuracies were sometimes caused in locating the characters in the provided stimulus sheet, as the whole entry table could not all viewed at once, and required the participant to navigate back and forth in the response table. Distributions of Participant Responses. Figure 5 shows that there were a few items in Tasks 4 and 5 on which individuals were divided between two or three alternatives. For example, in Task 4, the transcription responses for several items were split between the correct answer “hl”, the incorrect answer “hi”, and other uncommon responses such as “h1”. Figure 6 shows similar variation in participant's responses for items such as “cl” in Tasks 6 and 7 (columns in Figure 5 and 6 with cell shading of a lighter green). Discussion In general, participants did not benefit from an attempt to simplify the task by specifying response options (as in the “naming” Task 5 design) or by limiting the number of required responses (as in the “focus selection” Task 6 design). Based on these results, it appears that the use of free response tasks is preferable. We initially thought that multiple choice tasks would be required as current dichotomous choice implementations of CCT (Oravecz et al., 2014b) can be easily extended to assume that data are multiple choice decisions. However, as shown later, free response data from tasks can be treated as if it were multiple choice data for the purposes of analysis; therefore eliminating the need to constrain the response options in the tasks to a fixed set of alternatives. Our observations in Experiment 1 led to improved task designs for Experiment 2 – including a shift towards exclusively using free response tasks. In addition, the observed distribution of participant transcription responses demonstrates a pattern of variation that is unsurprising based on the vast literature on the perceptual confusability of alphanumeric characters (Townsend, 1971a, 1971b, 1984). As much of the archival data has been coded using alphanumeric abbreviations, it is expected that the transcription responses of participants may often be split for confusable items between two or more perceptually similar alternatives. In such a case, a modal aggregation method may not be well suited to producing accurate data transcriptions (examples of this situation are observed in our data and are later discussed, along with a potential method of managing such possibilities). This perceptual confusability of data is taken into consideration when developing our novel model for crowdsourced transcription data aggregation, as an alternative to majority-choice aggregation. Experiment 2: MTurk crowdsourcing tasks Through Experiment 1 we obtained results on the task design best suited to obtaining reliable transcription responses for our specific “naming” and “focus selection” data. Incorporating observations from the laboratory study, Experiment 2 sought to improve on the procedures and task design of transcription response collection, and 13

P. S. Deshpande et al. / International Journal of Internet Science 11 (1), 8–32 to assess the descriptive quality of participant responses collected through the online platform of MTurk. More generally, the aim of Experiment 2 was to determine the suitability of the MTurk platform for the large-scale transcription of the ColCat archive data.

Figure 5. Human responses for “naming” (A) Task 4, (B) Task 5. The abscissa shows the exhaustive set of responses observed for each of the 128 items shown on the ordinate. Cell shading indicates the proportion of each response option (columns) for each item (rows). Proportions for each row on each task sum to one.

Figure 6. Human responses for “focus selection” (A) Task 6, (B) Task 7. The abscissa shows the exhaustive set of responses observed for each of the 10 items shown on the ordinate. Cell shading indicates the proportion of each response option (columns) for each item (rows). Proportions for each row on each task sum to one. 14

P. S. Deshpande et al. / International Journal of Internet Science 11 (1), 8–32 Participants Participants (N=22) were recruited from Amazon Mechanical Turk to collect human transcription responses for three data sheets with “focus selection” data. The participants were limited to US citizens above the age of 18 who are native English speakers, and were compensated at the rate of $1.50 per hour of participation in the study. Procedure Turkers first had the opportunity to preview the task design and view a research disclaimer before deciding whether to accept the HIT (Human Intelligence Task). They were then presented with an information sheet detailing what the task would consist of, and were asked to certify that they met the demographic requirements of the study. They were then given a prescreen task in which they were asked to transcribe two simple images of handwritten characters, similar to the task stimuli. Upon correctly completing the prescreen, participants were taken to a practice transcription task where they were asked to transcribe the handwritten characters in particular cells of the tabular data (Figure 7A). Upon submitting a response, participants were given feedback on whether their transcription was correct or incorrect, and in both cases were given reminders/suggestions on details of the transcription conventions to be adopted in the actual task. For example, when asked to transcribe cell H3 of the table in Figure 7A, if a participant responded with the incorrect response “Pt”, they were reminded to also transcribe the superscript numbers that appear after letters (Figure 7b). Participants moved on to the main transcription task after correctly completing the set of practice transcription questions.

Figure 7. (A) The design of the practice task. (B) An example of feedback given to participants upon responding with an incorrect transcription. Task design In the MTurk study, participants were only asked to transcribe “focus selection” data and not “naming” data, as the task designs developed were largely identical. The adapted task design (see Figure 8) most closely corresponded to Task 7 of the laboratory study, where participants were asked to transcribe characters from a table of data, recreating it in a response table. While in Task 7 participants were shown only one-third of the tabular data, the whole table of data was presented in the MTurk task. In addition, the format of the transcription response template varied between the laboratory and MTurk designs. While in Task 7 the response template included all the columns of the portion of the tabular data, in the MTurk task participants were randomly presented with one column of the table at a time in order to minimize order effects in the data. MTurk HITs were made available for three different “focus selection” data sheets (in the Results section below, these are referred to as Grid 1, Grid 2, and Grid 3), with 10 sets of transcriptions collected for each grid. A total of 22 workers participated in the tasks, with some contributing transcription data to more than one HIT. Empirical results Similar to Experiment 1, initial analyses of Experiment 2 data sought to assess the efficiency of task design and the quality of transcription responses. The distribution of participant responses for all items in Grid 1 are shown in Figure 9. As in Experiment 1, there was relatively little variation in participant's responses, except for a few items on which individuals were divided between two or three alternatives. For example, the transcription responses for item “namwu-” were split between the correct answer “namwu”, the incorrect answer “narnwu-”, and other uncommon responses such as “namwy-”. For items with response variations (columns in Figure 9 with cell shading of a lighter green), the answer options again exhibit the characteristic of being perceptually confusable. The distribution of responses for items in Grid 2 and Grid 3 were qualitatively similar to that of Grid 1, and have not been shown. 15

P. S. Deshpande et al. / International Journal of Internet Science 11 (1), 8–32

Figure 8. MTurk task design showing tabular “focus selection” data of Grid 1, and the transcription response template. The same task design was used for Grid 2 and 3.

Figure 9. Human responses for “focus selection” data of Grid 1. The abscissa shows the exhaustive set of responses observed for each of the 29 items shown on the ordinate. Cell shading indicates the proportion of each response option (columns) for each item (rows). Proportions for each row on each task sum to one. Discussion Consistent with the results of other studies on laboratory and MTurk comparisons (Crump et al., 2013; Clifford & Jerit, 2014; Hauser & Schwarz, 2015), we found that the data obtained through MTurk participants was qualitatively similar to the data obtained from undergraduate students in the laboratory study. Moreover, we observed that MTurk participants tended to be generally more capable of meeting online task demands in comparison to undergraduates. For example, in the laboratory study, students often did not read the instructions and needed explanation on details of the transcription conventions. On the other hand, MTurk participants appeared to have a better intuitive understanding of task requirements. In an experimental oversight, we failed to provide an example in the preliminary training of how to transcribe characters as in Figure 10B, where the number 16

P. S. Deshpande et al. / International Journal of Internet Science 11 (1), 8–32 appearing below the alpha-characters indicates a footnote on the raw data sheet. There was however an example on transcribing the occurrence of numbers as in Figure 10A, where the superscript is to be transcribed as “Pt2”. From this most MTurk participants were able to correctly infer that the cells in Figure 10B should be transcribed as “Nm 1” and “Kw 2” respectively. These results are encouraging towards our goals of automating the crowdsourced transcription of portions of the archival data, although further experimentation is necessary for more complex data types present in the archive (see Figures 16 and 17 of Appendix A for examples).

Figure 10. Example of the subtleties in the handwritten data, requiring explicit directions for transcription. Data aggregation: Inferring the correct transcription Initial descriptive analyses of Experiment 1 and Experiment 2 data suggested that a more principled form of data aggregation was needed to combine crowdsourced responses in order to infer the “true” transcription answer. As suggested earlier, the simplest approach is to take the modal response as the inferred truth. However, this may not yield the best results because it ignores important latent information such as the perceptual confusability of alphanumeric characters, or biases resulting from different languages known to the human participants, that may be present in the data. We explore the idea that because crowdsourced data are fundamentally behavioral data generated by humans (see Jameson et al., 2016 for a discussion on using individual differences in the development of standardized models), it is sensible to use data aggregation models that account for the psychological properties of the data. One such family of models, Cultural Consensus Theory (CCT), uses methods from psychometrics and signal-detection theory and was initially devised for use in ethnographic research. Traditionally, investigators used CCT to infer the cultural knowledge or “consensus” answer for a group of people by aggregating their responses to a set of questions (Romney et al., 1987). This family of models assumes that there is latent structure in the data related to the difficulty of items and the ability of individual respondents that can be used to arrive at a more accurate inference about the true answer of each item. Through CCT analyses of survey data, researchers have been able to assess the degree of shared cultural beliefs on a wide variety of topics such as, the classification of illnesses among informants from a Mexican Village (Weller, 1984), the ecological management of fisheries in Hawaii (Miller et al., 2004), the perception of industry hiring standards among employees in China and the US (Liu et al., 2014), appropriate interventions for the improvement of the functioning of clinics (Smith et al., 2004), and others. Since its original application for the handling of dichotomous response data, CCT has been explored in the modeling of a vast variety of processes and situations. There are a number of advantages to using CCT over modal methods in the case of crowdsourced data aggregation. While both CCT and modal methods will adequately provide item estimations when the sample size is large, CCT was developed to provide accurate results with relatively smaller sample sizes (Batchelder & Romney, 1988; Batchelder & Anders, 2012). As is discussed further in the Inferred Answer Keys for Transcription Data section, this general feature of CCT analyses provides the potential to significantly lower the costs associated with the engagement and compensation of a large number of participants in a crowdsourced study. The approach also provides several quantitative indices that allows for a rigorous empirical and analytic method of (1) estimating the relative individual competencies of transcription task participants, (2) deriving “correct” transcriptions from participant's response-pattern inter-correlations, (3) assessing the appropriateness of the model for the data, and group agreement or consensus (4) identifying potentially unreliable transcription items, and (5) measuring the accuracy of crowdsourced transcriptions. These indices can also quantitatively reveal features of the online participant pool that may have otherwise been unknown. Recent studies find that workers on crowdsourcing platforms form networks and have ties of communication rather than being independent participants (Gray et al., 2016; Yin et al., 2016). For example, CCT competence measures on a character transcription task may indicate that the “crowd” comprises more than one language group, and accounting for such sources of bias in the population will yield more accurate solutions. Recent extensions of CCT include a model allowing for the possibility of more than one consensus pattern in informant responses (Anders & Batchelder, 2012), a modeling approach to account for the uncertainty often involved in informants decision making (Oravecz et al., 2014a), provisions for continuous type responses such as probability judgments (Anders et al., 2014), and even for ordinal type response data (Anders & Batchelder, 2015). These developments have been promoted by the implementation of Bayesian inference frameworks in the models, 17

P. S. Deshpande et al. / International Journal of Internet Science 11 (1), 8–32 an approach which provides several advantages when compared to traditional CCT models (Oravecz et al., 2014b). There have been some attempts at using CCT to model multiple choice (Romney et al., 1987; Sayim et al., 2005; Borgatti & Halgin, 2011) and free response type data (Weller, 2007) – which are both common formats used in crowdsourcing. However, improvements to CCT that incorporate Bayesian statistical inference, which are the state-of-the-art, have not yet been explored for these data types. In this paper we develop a novel Bayesian CCT (BCCT) model that can be applied to multiple choice or free response data. The aim is to provide a novel analysis procedure for aggregating crowdsourced data with principled, theory-based procedures. BCCT for dichotomous data The Bayesian implementation of the General Condorcet Model (GCM) for dichotomous (‘true’ and ‘false’) data described by (Oravecz et al., 2014b) specifies that each response Yi,k is coded as, 𝑌𝑌𝑖𝑖,𝑘𝑘 = �

1, 𝑖𝑖𝑖𝑖 𝑖𝑖 𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟 𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡 𝑡𝑡 𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖 𝑘𝑘 0, 𝑖𝑖𝑖𝑖 𝑖𝑖 𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟 𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓 𝑡𝑡𝑡𝑡 𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖 𝑘𝑘,

where respondents are indexed by i and items are indexed by k. The culturally correct response Zk for each item is estimated by the model as, 1, 𝑖𝑖𝑖𝑖 𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖 𝑘𝑘 𝑖𝑖𝑖𝑖 𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡 Zk = � 0, 𝑖𝑖𝑖𝑖 𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖 𝑘𝑘 𝑖𝑖𝑖𝑖 𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓.

Each respondent has an ability, θi, and each item a difficulty, δk, which are combined using a Rasch measurement model to define the competence, Di,k, of each respondent for each item:

Di,k =

𝜃𝜃𝑖𝑖 (1− 𝛿𝛿𝑘𝑘 ) 𝜃𝜃𝑖𝑖 (1− 𝛿𝛿𝑘𝑘 )+ 𝛿𝛿𝑘𝑘 (1− 𝜃𝜃𝑖𝑖 )

Each Di,k parameter can be interpreted as the probability that respondent i knows the culturally correct answer to item k. The model assumes that each respondent has a bias, gi, towards guessing ‘true’ when the culturally correct answer is not known. All of the individual and item parameters determine the probability that respondent i will answer ‘true’ on item k: 𝑝𝑝�𝑌𝑌𝑖𝑖,𝑘𝑘 = 1� = 𝑍𝑍𝑘𝑘 𝐷𝐷𝑖𝑖,𝑘𝑘 + (1 − 𝐷𝐷𝑖𝑖,𝑘𝑘 )𝑔𝑔𝑖𝑖 For more details on the Bayesian implementation of the GCM we refer the reader to the original formulation in (Oravecz et al., 2014b). Extending BCCT for multiple choice data Our CCT model generalizes (Oravecz et al., 2014b) Bayesian GCM in order to allow for multiple choice response data. Our Bayesian multiple choice CCT model handles data in which each response Yi,k ϵ {1.. L} and each correct answer inferred by the model Zk ϵ {1.. L} where L is the number of possible response options. The model generalizes the individual bias parameters gi from a single probability for responding ‘true’ when the correct answer is unknown, to a probability distribution over the L response options. Based on these changes, the multiple choice model specifies the probability that respondent i will respond with option j ϵ {1.. L} on item k as: 𝐷𝐷𝑖𝑖,𝑘𝑘 + �1 − 𝐷𝐷𝑖𝑖,𝑘𝑘 �𝑔𝑔𝑖𝑖 , 𝑗𝑗, 𝑝𝑝�𝑌𝑌𝑖𝑖,𝑘𝑘 = 𝑗𝑗� = � �1 − 𝐷𝐷𝑖𝑖,𝑘𝑘 �𝑔𝑔𝑖𝑖 , 𝑗𝑗,

𝑖𝑖𝑖𝑖 𝑍𝑍𝑘𝑘 = 𝑗𝑗

𝑖𝑖𝑖𝑖 𝑍𝑍𝑘𝑘 ≠ 𝑗𝑗.

Modeling Task Appropriate Bias. We hypothesized that the standard approach CCT uses for modeling individual response bias was not ideal for modeling the judgments required in our transcription task data. We reasoned that to the extent that individuals did not know the correct answer for transcribing a particular item, their guessing bias would primarily be influenced by response alternatives that were perceptually confusable with the correct response for that item (see Figure 11 for some examples of perceptually confusable stimuli). Therefore, incorrect responses would be correlated across individuals for each item instead of correlated across items for each individual – as is the case with our original formulation of the multiple choice CCT model. To capture this task specific perceptual bias, we modified the model to account for perceptual confusability by substituting the individual guessing bias parameters gi with guessing bias parameters gk for each item. This new model assumes that each item has its own bias that is shared for all individuals and the probability that respondent i will respond with option j ϵ {1.. L} on item k is,

18

P. S. Deshpande et al. / International Journal of Internet Science 11 (1), 8–32 𝐷𝐷𝑖𝑖,𝑘𝑘 + �1 − 𝐷𝐷𝑖𝑖,𝑘𝑘 �𝑔𝑔𝑘𝑘 , 𝑗𝑗, 𝑝𝑝�𝑌𝑌𝑖𝑖,𝑘𝑘 = 𝑗𝑗� = � �1 − 𝐷𝐷𝑖𝑖,𝑘𝑘 �𝑔𝑔𝑘𝑘 , 𝑗𝑗,

𝑖𝑖𝑖𝑖 𝑍𝑍𝑘𝑘 = 𝑗𝑗

𝑖𝑖𝑖𝑖 𝑍𝑍𝑘𝑘 ≠ 𝑗𝑗.

In the remainder of the paper we compare these two multiple choice BCCT models with different guessing bias configurations and refer to them as the subject-wise BCCT model and the item-wise BCCT model.

Figure 11. (A) The correct transcription for the characters is “Hl” and “Cl”. The characters are however easily confusable as being “Hi” and “Ci” respectively. This bias towards a particular incorrect transcription exists for all participants who do not know the correct answer. (B) The correct transcriptions of the characters “Ph” and “Pn” are confusable with each other. Analysis of crowdsourced data using the developed BCCT model One aim of this study was to evaluate whether CCT could be practically used as an efficient, objective, and accurate procedure to aggregate crowdsourced transcription data. To achieve this aim we applied two forms of BCCT to both the collected Laboratory data in Experiment 1, and MTurk data in Experiment 2, to derive (1) estimated “correct” transcription answers for the data, (2) estimates of participant “competence” or “expertise” based on respondents transcription response patterns, and (3) indications of the model “fit”, or appropriateness for the present transcription problem. The value of adopting a CCT approach is judged by comparison to what is provided by simple majority choice alternative analyses. The item- and subject-wise CCT models assume that participants' choose their responses from a fixed set of L response options – where L is finite and predetermined. Although some tasks were free response rather than multiple choice, we were able to apply the models to free response data by computing the set of all unique responses across all participants in the task and providing these options to the model as if it were the a finite set from which participants were choosing. For Tasks 3 to 7, we implemented a Bayesian inference framework for the subject-wise and item-wise CCT models using JAGS (Plummer, 2003) – which is a generic Bayesian statistical inference package – in order to estimate model parameters such as the answer key (aggregate transcription values) and response bias for each individual or item. The code for implementing our models in JAGS is provided in Appendix B. A detailed discussion about implementing CCT in a Bayesian framework can be found in (Oravecz et al., 2014b), whose methods were very similar to our own. For the purposes of assessing CCT's utility with crowdsourced responses, below we present analyses of data from Experiment 1 and 2 using the models described above. Inferred Answer Keys for Transcription Data. As mentioned, several useful indices are obtained as a result of CCT analyses. Figure 12A provides average group “consensus”, or mean participant competence values, and Figure 12B provides average item difficulty values, as estimated by the subject-wise and item-wise BCCT models for both the laboratory and MTurk study. High group consensus values and low item difficulty values shown indicate that BCCT is appropriate as a model of the data. Individual participant competence values allow us to determine whether individuals in the population sampled are uniform with respect to shared knowledge for this task, and if not, the variation in competence estimates arising from the analysis can be used as an index of individual variation in the sample. Figure 12C with individual participant values ranging from a high of 0.8268 for Participant 6, to a low of 0.4706 for Participant 13, exemplifies how the quantitative estimates of individual competence vary, and track the association of individuals’ response patterns with the model’s estimated answer key for the task. Similarly, the observed variation in the quantitative measures of estimated item difficulty can be examined as a check of whether the model is tracking features of the data. For example, in Figure 12D, we see that Item 2, “Hl” was estimated to have a much higher difficulty of 0.6555 in comparison to the other items. This is consistent with our prediction that the perceptual confusability of certain characters pose a challenge in transcription. The simplest test of success of the model in data aggregation, however, is the accuracy of the resulting answer key, and the results below concentrate on this measure and are further analyzed in the section below on Assessing Model Performance. Inferred Answer Keys for Experiment 1 (Laboratory Data). The subject-wise BCCT model, item-wise BCCT model, and the modal majority-choice response for each item all resulted in nearly identical answer keys for tasks 5, 6 and 7 (figure not shown). On Task 4, the answer keys obtained by the three methods differed on the several 19

P. S. Deshpande et al. / International Journal of Internet Science 11 (1), 8–32 items for which variation was observed in the participant response distributions in Figure 5. Taking the mode of the human judgments on these items provided inconsistent predictions: the mode was “hl” on some items and “hi” on others (see Figure 13). Both CCT models were able to correctly infer that all of these items should have the same value and should not fluctuate between the two confusable options (either “hl” or “hi”), but disagreed on which transcription was correct. The subject-wise model inferred incorrectly that the true transcription of the items was “hi”, whereas the item-wise model inferred correctly that the true transcription of the items was “hl”. Although the subject-wise model did not derive the correct answer for all items, the pattern of the model’s transcription predictions from the posterior predictive analysis better capture the general task characteristic of identifying a single, most suitable, correct transcription answer for a handwritten string in the context of other perceptually-confusable transcription options. This additional predictive capability of the CCT model to correctly differentiate among highly similar response options further demonstrates the suitability of CCT over modal aggregation methods for transcription data.

Figure 12. Quantitative CCT indices estimated by the subject-wise and item-wise BCCT models for both the laboratory and MTurk study. (A) Estimated mean competence or group “consensus” values. (B) Estimated mean item difficulty values. (C) Competence values of the first 16 participants in Task 7 of the laboratory study, estimated by the item-wise bias model. (D) Item difficulty values of the 9 items in Task 7. Inferred Answer Keys for Experiment 2 (MTurk Data). The subject-wise BCCT model, item-wise BCCT model and the modal response for each item all resulted in identical answer keys for Grids 1, 2 and 3; with the exception that the modal answer differed from the BCCT models answer on a single item in Grid 1, for which variation was observed in the participant response distributions in Figure 9. The modal aggregation method inferred correctly that the true transcription of the item was “namwu-”, whereas both the subject-wise and item-wise model inferred incorrectly that the true transcription of the item was “narnwu-”. The answer key for Grid 1 is shown in Figure 14. Although in this case the CCT aggregation models did not derive the correct answer for all items, we believe the method to be more reliable in general as a result of being able to account for patterns in the data discussed above in relation to the inferred answer keys for Experiment 1, and point out another demonstration of this advantage below. Assessing Model Performance. To assess the performance of the subject-wise and item-wise BCCT models, we compared the accuracy of the estimated answer keys produced by each model on each task against the true answers obtained from an expert transcription. We also sought to assess the potential for these procedures and results to be applicable beyond this particular investigation. Comparison with Expert Transcriptions. For the purpose of assessing the accuracy of the estimated answers, we compared the transcriptions objectively derived through CCT to the known answers. Independent of the data collected through crowdsourcing, a human expert manually transcribed all of the items in each of our tasks. These expert transcriptions provided a standard against which to compare the solutions inferred using the CCT models (see Table 1). There was general agreement between solutions inferred by both models and those manually transcribed by the expert. The only significant exception occurred in the data from Experiment 1 in which several items were estimated as ‘hi’ by the subject-wise model on tasks 4 and 5 and the item-wise model on task 5; the expert transcribed these items as ‘hl’. It is possible for these models to obtain the correct answer despite being a less than perfect model of the task – especially given the small amount of data on which we tested them and the lack of variance in this data. Although the item-wise model has a slight edge in terms of proportion of items correctly transcribed (in Tasks 4 and 6), more evidence is required in order to assess how well the models can be expected to generalize to the rest of the dataset. Our goal in the next section is to assess how well each of the models will generalize to a much larger dataset for which we do not have expert transcriptions. 20

P. S. Deshpande et al. / International Journal of Internet Science 11 (1), 8–32

Figure 13. Inferred answer key for Task 4 using three methods: (1) The modal response (indicated by X), (2) the subject-wise CCT model (indicated by O), and (3) the item-wise CCT model (indicated by Δ). The abscissa shows the exhaustive set of responses observed for each of the 128 items shown on the ordinate. Descriptive Adequacy of the Models. To establish a practical transcription method that will generalize to large amounts of data for which the ground truth is unavailable, we looked beyond the transcription solutions inferred by the models – which provided only marginal evidence in favor of the item-wise model – and examined their ability to account for patterns in the raw data, suggesting that they were accurately modeling the transcription task and could reasonably be expected to generalize to the rest of the handwritten data of the archive. Figure 15 contains model predictions for a number of representative items from Laboratory Tasks 4 and 7 that clearly show that the item-wise BCCT model provides a better account of the pattern of human responses than does the subject-wise BCCT model. One of the advantages of Bayesian analysis is that model goodness of fit can be assessed using posterior predictive analysis, which provides a richer understanding of how a model does or does not fit the data, than does a single point statistic. We emphasize the posterior predictive analysis approach in this paper rather than relying on a single, arguably less informative, goodness of fit statistic as is typical with frequentist approaches. The model predictions were obtained by inferring posterior distributions for model parameters from the human data and then running the model in the opposite direction – with model parameters set to point estimates of the inferred posteriors – in order to generate a distribution of predicted responses. Although Figure 15’s analysis may seem “qualitative”, in the present case the graphical posterior predictive comparison of 21

P. S. Deshpande et al. / International Journal of Internet Science 11 (1), 8–32 the item- versus subject-wise models provides explicit, task-relevant, information concerning how each approach serves as a model of the data. For the present task, this kind of result is important as it actually provides an informative confirmation of model appropriateness, and is a novel form of support beyond other available indices of CCT model appropriateness (see Figure 12’s discussion) that have been detailed in the literature (see Batchelder & Romney, 1988; Oravecz et al. 2014b). Both models accurately predict that a large number of respondents know the true answer and respond accordingly. However, the models diverge in their predictions of how respondents who do not know the correct answer will respond. The item-wise model correctly predicts that whenever an individual does not know the answer to an item, they guess that the answer is an option that is perceptually similar to the correct answer on the particular item, and this will vary between items. For example, in the graph for Item 16, “kwun cheng”, in Row 2 of Figure 15, the item-wise model appropriately predicts that the individuals guessing-bias will be on the options “kwan cheng” and “kwun chen”, which are easily confusable with the correct answer. The subject-wise model predicts that when individuals do not know the correct answer they will guess in a consistent way that is independent of the properties of the particular item. For example, in all three graphs in Row 1 of Figure 15, we see that the subject-wise model predicts that the individuals guessing bias will be on the option “hl”. Although this is to be expected for Item 4 where the correct answer is perceptually similar, the pattern of bias is not appropriate for the other dissimilar items. Figure 15 also shows other examples of the predictions of the item-wise and subject-wise models for Task 7, and the results are similar for the other tasks. We do not provide predictions based on the modal response because it is straightforward to see that it will predict that all individuals will respond with the mode – which clearly does not model the observed pattern of human responses.

Figure 14. Inferred answer key for Grid 1 using three methods: (1) The modal response (indicated by X), (2) the subject-wise CCT model (indicated by O), and (3) the item-wise CCT model (indicated by Δ). The abscissa shows the exhaustive set of responses observed for each of the 29 items shown on the ordinate. The model predictions for the MTurk study also suggest that the item-wise BCCT model provides a better account of the pattern of human responses than does the subject-wise BCCT model (see Figure 16 for representative item from Grid 1 of the MTurk study). Both the item-wise and subject-wise models provide similar predictive response patterns in the MTurk study as they did in the Laboratory study. Discussion For tasks with very little variance on all items, all of the aggregation models investigated – modal response, subject-wise BCCT or item-wise BCCT – led to virtually the same transcription solutions from the data. For tasks in which there was disagreement between participants on one or more items, the CCT models resulted in better transcriptions compared to the modal response model. The comparison of model predictions to human data (see Figures 15 and 16) supports our conjecture that guessing bias in the CCT model is influenced by features relating to the perceptual confusability of handwritten items – a property of the task that is directly accommodated by the item-wise CCT model developed. Because our overall goal is the development of procedures that will scale up to transcribe the entire ColCat corpus – most of which cannot be verified by expert transcriptions – it appears that of the three methods we explored, the item-wise CCT model will provide the best general solution for aggregating crowdsourced transcriptions. Our reasoning is that even though the corpus consists of entries based on a wide 22

P. S. Deshpande et al. / International Journal of Internet Science 11 (1), 8–32 variety of different languages, all of the data is fundamentally the same with respect to the features that are relevant to the model: each item is a short, handwritten string of characters from a finite set of possible strings for the given task, and some of the character strings are perceptually confusable with one another. Table 1 Proportion of Generated CCT Transcription Answers Agreeing With True Transcription Answers

a

The rows specifying a Task “with Expert” refer to the analyses described in the Section on Utility of Expert Transcriptions, where the data of an expert was added to the group crowdsourcing data.

Although the subject-wise CCT model will likely provide a solution that is equivalent to the item-wise model for a significant portion of the data, the subject-wise model may result in errors when the data involve perceptually confusable items. Furthermore, we do not believe that there are any subsets of the corpus in which the subject-wise model will provide a benefit over the item-wise model, but we do expect the opposite – as seen in our analysis of Task 4. The only advantages of using the modal responses over CCT are processing time and computing resources. Beyond that, the best case scenario when using the mode is no better than either of the CCT models, and the worst case scenario is a completely inconsistent transcription as we saw in Task 4. Taking the mode might be an efficient method in situations where we knew ahead of time that there are no ambiguous stimuli for a subset of data. This information could potentially be derived from the distribution of human responses. If there is very little disagreement on all items for a given task, then it is probably safe to say that the mode will provide a good solution. Determining a threshold for variance, beyond which CCT is applied, is outside the scope of this paper but should be considered in the design of a fully automated system for transcribing the corpus via crowdsourcing. Alternative CCT Models. Most applications of CCT incorporating response bias place the bias on individual subjects rather than on items. We show that for our tasks, including a single response bias for each item that was shared by all individuals made sense theoretically and was supported by the fact that the item-wise model was able to provide a better account of people's responses. We are not arguing that all CCT models would be better off using item-wise instead of subject-wise bias – the preferred configuration depends upon the nature of the task and data – but rather demonstrating that extending the use of CCT outside of its traditional domain may require modifications to the model. There may still be room for improvement of our model by incorporating both item- and subject-wise bias. However, this would require a more sophisticated model than one that includes both biases but assumes they are independent of one another: we tested this model and it performed no better than the item-wise model while having the disadvantage of greater model complexity. Utility of Expert Transcriptions. In another set of analyses we explored the possibility of including the transcription responses of a known expert with crowdsourcing group data. While the transcription of handwritten characters is fundamentally subjective, and an “expert” transcription is not necessarily equal to the ground truth, we reasoned that for the kind of perceptually confusable response options in the present study, there may be value in including the responses of a known expert in a surveyed group's CCT analysis. If our CCT model could objectively identify the most informed participants from within a sample, then CCT analyses should also be able to confirm the known expert as a highly competent respondent, which generally would lend confidence to the answer keys that are derived, regardless of whether an expert's responses were available or not.

23

P. S. Deshpande et al. / International Journal of Internet Science 11 (1), 8–32

Figure 15. Human data and model predictions based on the subject-wise CCT model (rows 1 and 3) and the item-wise CCT model (rows 2 and 4). Note that because CCT uses the latent information reflected in the correlations between participants' responses, the inclusion of a known expert's response might yield alternative outcomes. For example, if most of the high-competence participants within a group agree on a particular question's response, but that response differs from the one given by an expert, then the model might produce an estimated answer that disagreed with that provided by the expert. On the surface such disagreement might seem to suggest that the CCT model is wrong – this, however, is not necessarily the case. This is because CCT computes estimated answers for each question based on group-dependent, test-dependent, response patterns and the degree they are shared across participants, even in cases where the most popular choice selected may not coincide with the actual truth. For example, if belief in a “Flat-Earth” were widely shared by a particular group that was surveyed, then any questions pertaining to the 24

P. S. Deshpande et al. / International Journal of Internet Science 11 (1), 8–32 planet's shape would most likely result in large numbers of responses characterizing the planet as flat/rectangular, and not spheroidal. In this example, regardless of the input of one expert who might know the Earth is actually round, enough group consensus for a falsely held Flat-Earth belief could result in the model providing an estimated “correct” answer which reproduced the belief system's shared misconception, despite the fact that the particular CCT estimated “correct” answer would not reflect true reality. In this case the correct response of a known expert could not outweigh the overwhelming consensus of a misinformed majority, and in such cases assessing groups with greater numbers of participants might be needed.

Figure 16. Human data and model predictions for Item 21 of Grid 1, based on (A) the subject-wise CCT model, and (B) the item-wise CCT model. A second possible outcome that might arise from including an expert's responses is that the known expert's response patterns may improve estimation of correct answers for items disagreed on within the group. This is especially likely if, as in the present case, an “expert” was known to have additional information not available to crowdsourcing participants. Using again the Flat-Earth example, if nearly half of a surveyed group's participants were Flat-Earthers and the other half believed the Earth was truly round, then CCT estimation of answers for such items would necessarily be helped by the overall inter-subject correlations with a known expert's response patterns for those same items. In the present case, because the known expert possesses specific knowledge about the characters or words used in a particular language, we can have confidence in the expert transcriptions and employ the expert responses to help further evaluate a group's shared consensus response and refine estimation of a “correct” answer key. Table 1’s values present quantitative measures of transcription performance that incorporate a known experts' responses. The results show that in most cases, there was greater agreement of the CCT estimated transcription answers with the known true transcription answers when the expert’s responses were included in the analyses. This is seen for the subject-wise model in Task 4, Task 6, Grid 1, and Grid 3, and for the item-wise model in Grid 1 and Grid 3. The potential advantage of using the item-wise CCT model along with an expert's responses is best understood with the example of Grid 1. As described in the Empirical Results section of Experiment 2, the participant’s responses for the incorrectly estimated item “namwu-” were split among 2 alternatives, and the incorporation of the expert's transcription responses resulted in a correct estimation. The application of a model estimation method would result in unreliable answers in such cases. CCT and the Advantage of Smaller Samples. The quality of the data collected in Experiments 1 and 2 was very similar despite a substantial difference in sample size: the analyses in Experiment 1 consisted of responses from 30 participants, whereas the analyses in Experiment 2 consisted of responses from as few as 9 participants. To investigate further whether CCT would perform satisfactorily when aggregating data from smaller numbers of participants, we ran CCT analyses on subsets of the 30 individuals providing responses in Task 4 of Experiment 1, for which participant responses were divided between two confusable options for some items. The subsets were constructed by randomly drawing samples of size 8, and included 4 individuals randomly drawn from the pool of participants who responded with one of the options, and 4 individuals who responded with the other option. Five such subset samples were drawn, one after another, with replacement. The CCT results of the subset analyses were comparable to the results obtained with the complete participant group on measures of estimated answer key accuracy, mean participant competence, and mean item difficulty. Our results suggest that while crowdsourcing methods generally allow for the collection of large quantities of data, using an intelligent model of data aggregation – such as CCT – provides for reliable results using less data; which reduces both costs and data collection time, without sacrificing the crowdsourced product. Moreover, CCT provides principled, theory-based procedures for modeling features of tasks and participant groups that otherwise remain inaccessible through standard majority-choice aggregation approaches. For these reasons we believe the 25

P. S. Deshpande et al. / International Journal of Internet Science 11 (1), 8–32 CCT approach presented here is an important advance for crowdsourced data handling, as it provides for robust answer-key derivation using smarter data processing on smaller crowdsourced datasets. Conclusions We developed internet-based crowdsourcing procedures and novel methods for aggregating crowdsourced data using Cultural Consensus Theory (CCT), and tested these procedures on a small subset of the ColCat archive – a large corpus of handwritten cognitive anthropology survey data. Our findings lead us to a practical method for converting the estimated 23,000 pages of handwritten data in the ColCat archive to a computer addressable format. Experiment 1 conducted in our laboratory led to insights about the task design that allowed us to successfully implement our crowdsourcing procedures in Experiment 2 on Amazon Mechanical Turk (MTurk). This was a crucial step towards a practical method that can “scale up” to the transcription of the ColCat archive at large. Although the present methods were developed specifically for transcribing a particular dataset, there is every potential that they can apply generally. The success of our CCT models in aggregating the crowdsourced data leads to a number of interesting conclusions. First, we demonstrate that CCT can be extended to model free response data within a Bayesian statistical framework. The models developed here can be used with a wide variety of multiple choice or free response data, and the application of CCT models that were previously developed for other types of data such as ordinal or continuous data (Anders & Batchelder, 2012, 2015; Anders et al., 2014; Oravecz et al., 2014a) can also be explored in the future for the aggregation of crowdsourced data. Second, we show how CCT can use item-wise response bias – as opposed to subject-wise – in order to model variance in the data that arises from perceptual features of items. Perceptual-type tasks are common in citizen science crowdsourcing projects such as Agent Exoplanet (http://lcogt.net/agentexoplanet/), where participants classify and characterize telescope images, Age Guess (http://www.ageguess.org/), where participants guess the age of a person based on a photo to help researchers investigate potential aging biomarkers, and so on. Our novel modeling extension allows for the future exploration of CCT applications beyond the traditional general knowledge domains to perceptual judgment tasks. Third, we find that because of the way that CCT leverages the implicit information contained in the correlation of responses across items and individuals, it may require much less crowdsourcing data to arrive at reliable aggregate transcriptions of the data – thus leading to a reduction in the time and expense required for collecting crowdsourced data. Our analyses on this finding have been preliminary, but are worth exploring further considering the advantages involved. Finally, it is worth mentioning again that crowdsourced data is fundamentally behavioral data generated by humans. Therefore, when we wish to aggregate crowdsourced data, it is important whenever possible to use models that can account for psychological or perceptual properties of the processes – including features of human decision making – that are inherent in participant's response data. Acknowledgements The authors would like to thank ColCat team members, Nathan Benjamin, Yang Jiao, Ian Harris, and Dr. G.P. Li, Director of the California Institute for Telecommunications and Information Technology (Calit2), UCI. Support for the archive project was provided by: The University of California Pacific Rim Research Program, 2010-2015 (K.A. Jameson, PI) and The National Science Foundation 2014-2017 (#SMA-1416907, K.A. Jameson, PI). This research was conducted under approved UCI IRB protocols: HS#2013-9921, HS#2015-1976, and HS#2015-9606.

References Anderson, D. P., Cobb, J., Korpela, E., Lebofsky, M., & Werthimer, D. (2002). SETI@Home: An experiment in public-resource computing. Commun. ACM, 45(11), 56–61. doi: 10.1145/581571.581573 Anders, R., & Batchelder, W. H. (2012). Cultural consensus theory for multiple consensus truths. Journal of Mathematical Psychology, 56(6), 452–469. doi: 10.1016/j.jmp.2013.01.004 Anders, R., & Batchelder, W. (2015). Cultural consensus theory for the ordinal Data Case. Psychometrika, 80(1), 151–181. doi: 10.1007/s11336-013-9382-9 Anders, R., Oravecz, Z., & Batchelder, W. H. (2014). Cultural consensus theory for continuous responses: A latent appraisal model for information pooling. Journal of Mathematical Psychology, 61, 1–13. doi: 10.1016/j.jmp.2014.06.001 26

P. S. Deshpande et al. / International Journal of Internet Science 11 (1), 8–32 Batchelder, W. H., & Romney, A. K. (1988). Test theory without an answer key. Psychometrika, 53(1), 71–92. doi: 10.1007/BF02294195 Batchelder, W. H., & Anders, R. (2012). Cultural consensus theory: Comparing different concepts of cultural truth. Journal of Mathematical Psychology, 56(5), 316-332. Bevan, A., Daniel, P., Chiara, B., Keinan-Schoonbaert, A., González, D. L., Sparks, R., … Wilkin, N. (2014). Citizen archaeologists. Online collaborative research about the human past. Human Computation, 1(2). doi: 10.15346/hc.v1i2.9 Borgatti, S. P., & Halgin, D. S. (2011). Consensus analysis. A Companion to Cognitive Anthropology, 171–190. Brabham, D. C. (2008). Crowdsourcing as a model for problem solving an introduction and cases. Convergence: The International Journal of Research into New Media Technologies, 14(1), 75–90. doi: 10.1177/1354856507084420 Burgess, D., Kempton, W., & MacLaury, R. E. (1983). Tarahumara color modifiers: Category structure presaging evolutionary change. American Ethnologist, 133–149. Causer, T., & Wallace, V. (2012). Building a volunteer community: results and findings from Transcribe Bentham. Digital Humanities Quarterly, 6. Clifford, S., & Jerit, J. (2014). Is there a cost to convenience? An experimental comparison of data quality in laboratory and online studies. Journal of Experimental Political Science, 1(02), 120–131. Coleman, E. A., Ishikawa, S. T., & Gulick, V. C. (2014). Clickworkers Interactive: Progress on a JPEG2000-Streaming Annotation Interface. In Lunar and Planetary Science Conference (Vol. 45, p. 2593). Crump, M. J., McDonnell, J. V., & Gureckis, T. M. (2013). Evaluating Amazon's Mechanical Turk as a tool for experimental behavioral research. PloS one, 8(3), e57410. Deshpande, P.S. & Chang, S.M. (2015). A Cultural Consensus Theory Analysis of Crowdsourced Transcription Data. Poster Presentation at the Undergraduate Research Opportunities Program Symposium, Irvine, CA. Estellés-Arolas, E., & González-Ladrón-de-Guevara, F. (2012). Towards an integrated crowdsourcing definition. Journal of Information Science, 38(2), 189–200. doi: 10.1177/0165551512437638 Gray, M. L., Suri, S., Ali, S. S., & Kulkarni, D. (2016, February). The crowd is a collaborative network. In Proceedings of the 19th ACM Conference on Computer-Supported Cooperative Work & Social Computing (pp. 134–147). ACM. Hauser, D. J., & Schwarz, N. (2015). Attentive Turkers: MTurk participants perform better on online attention checks than do subject pool participants. Behavior Research Methods, 1–8. doi: 10.3758/s13428-015-0578-z Howe, J. (2006). Crowdsourcing: A definition. Crowdsourcing: Tracking the rise of the amateur. Jameson, K. A., & Romney, A. K. (1990). Consensus on semiotic models of alphabetic systems. Journal of Quantitative Anthropology, 2(4), 289–304. Jameson, K. A., Benjamin, N. A., Chang, S.M., Deshpande, P. S., Gago, S., Harris, I. G., Jiao, Y., and Tauber, S. (2015). Mesoamerican Color Survey Digital Archive. In Encyclopedia of Color Science and Technology, (Ronnier Luo, Ed.). Springer Berlin Heidelberg. ISBN: 978-3-642-27851-8 (Online). doi: 10.1007/978-3-642-27851-8 Jameson, K. A., Deshpande, P. S., Tauber, S., Chang, S. M., & Gago, S. (2016). Using individual differences to better determine normative responses from crowdsourced transcription tasks: An application to the RE MacLaury Color Categorization Archive. Electronic Imaging, 2016(16), 1–9. Kay, P., Berlin, B., Maffi, L., Merrifield, W. R., & Cook, R. (2009). The world color survey. Stanford, California: CSLI Publications.

27

P. S. Deshpande et al. / International Journal of Internet Science 11 (1), 8–32 Keegan, T., Gilchrist, M., & Soderdahl, P. (2015). DIY History: Building Digital Connections between Collections and Courses. 2015 Digital Initiatives Symposium. Retrieved from http://ir.uiowa.edu/lib_pubs/171 Khatib, F., Cooper, S., Tyka, M. D., Xu, K., Makedon, I., Popović, Z., … Players, F. (2011). Algorithm discovery by protein folding game players. Proceedings of the National Academy of Sciences, 108(47), 18949–18953. doi: 10.1073/pnas.1115898108 Lee, J., Kladwang, W., Lee, M., Cantu, D., Azizyan, M., Kim, H., … Participants, E. (2014). RNA design rules from a massive open laboratory. Proceedings of the National Academy of Sciences, 111(6), 2122–2127. doi: 10.1073/pnas.1313039111 Liu, X., Keller, J., & Hong, Y. (2014). Hiring of personal ties: A cultural consensus analysis of China and the United States. Management and Organization Review. doi: 10.1111/more.12055 MacLaury, R. E., Almási, J., & Kövecses, Z. (1997). Hungarian piros and vörös: Color from points of view. Semiotica, 114(1–2), 67–82. MacLaury, R. E., Hewes, G. W., Kinnear, P. R., Deregowski, J. B., Merrifield, W. R., Saunders, B. A. C., ... & Wescott, R. W. (1992). From brightness to hue: An explanatory model of color-category evolution [and comments and reply]. Current Anthropology, 33(2), 137–186. MacLaury, R. E. (1997). Color and Cognition in Mesoamerica: Constructing Categories as Vantages. University of Texas Press. MacLaury, R. E., Almási, J., & Kövecses, Z. (1997). Hungarian piros and vörös: Color from points of view. Semiotica, 114(1–2), 67–82. Malone, T. W., Laubacher, R., & Dellarocas, C. (2009). Harnessing Crowds: Mapping the Genome of Collective Intelligence (SSRN Scholarly Paper No. ID 1381502). Rochester, NY: Social Science Research Network. Retrieved from http://papers.ssrn.com/abstract=1381502 Miller, M. L., Kaneko, J., Bartram, P., Marks, J., & Brewer, D. D. (2004). Cultural consensus analysis and environmental anthropology: Yellowfin Tuna Fishery management in Hawaii. Cross-Cultural Research, 38(3), 289–314. doi: 10.1177/1069397104264278 Moyle, M., Tonra, J., & Wallace, V. (2011). Manuscript transcription by crowdsourcing: Transcribe Bentham. LIBER Quarterly, 20(3/4). Oravecz, Z., Faust, K., & Batchelder, W. H. (2014a). An extended cultural consensus theory model to account for cognitive processes in decision making in social surveys. Sociological Methodology. doi: 10.1177/ 0081175014529767 Oravecz, Z., Vandekerckhove, J., & Batchelder, W. H. (2014b). Bayesian Cultural Consensus Theory. Field Methods. doi: 10.1177/1525822X13520280 Plummer, M. (2003). JAGS: A program for analysis of Bayesian graphical models using Gibbs sampling. In Proceedings of the 3rd international workshop on distributed statistical computing (Vol. 124, p. 125). Romney, A. K., Batchelder, W. H., & Weller, S. C. (1987). Recent applications of cultural consensus theory. American Behavioral Scientist, 31(2), 163–177. doi: 10.1177/000276487031002003 Sayim, B., Jameson, K. A., Alvarado, N., & Szeszel, M. K. (2005). Semantic and perceptual representations of color: Evidence of a shared color-naming function. Journal of Cognition and Culture, 5(3), 427–486. Smith, C. S., Morris, M., Hill, W., Francovich, C., McMullin, J., Chavez, L., & Rhoads, C. (2004). Cultural consensus analysis as a tool for clinic improvements. Journal of General Internal Medicine, 19(5p2), 514–518. Steyvers, M., & Miller, B. (2015). Cognition and collective intelligence. Handbook of Collective Intelligence, 119. Surowiecki, J. (2005). The wisdom of crowds. Anchor. 28

P. S. Deshpande et al. / International Journal of Internet Science 11 (1), 8–32 Townsend, J. T. (1971a). Alphabetic confusion: A test of models for individuals. Perception & Psychophysics, 9(6), 449–454. Townsend, J. T. (1971b). Theoretical analysis of an alphabetic confusion matrix. Perception & Psychophysics, 9(1), 40–50. Townsend, J. T., Hu, G. G., & Evans, R. J. (1984). Modeling feature perception in brief displays with evidence for positive interdependencies. Perception & Psychophysics, 36(1), 35–49. Tyson, R. E. (1997). Evidence for semantic influence of English loanwords on Korean color naming. Harvard Studies in Korean Linguistics, 7, 519–527. Tyson, R. E. (1998). Color naming and color categorization in Korean. Japanese/Korean Linguistics, 7, 177–196. Von Ahn, L., Maurer, B., McMillen, C., Abraham, D., & Blum, M. (2008). Recaptcha: Human-based character recognition via web security measures. Science, 321(5895), 1465–1468. Weller, S. C. (1984). Consistency and consensus among Informants: Disease concepts in a rural Mexican village. American Anthropologist, 86(4), pp. 966–975. Weller, S. C. (2007). Cultural consensus theory: Applications and frequently asked questions. Field Methods, 19(4), 339–368. doi: 10.1177/1525822X07303502 Yin, M., Gray, M. L., Suri, S., & Vaughan, J. W. (2016, April). The Communication Network Within the Crowd. In Proceedings of the 25th International Conference on World Wide Web (pp. 1293–1303). International World Wide Web Conferences Steering Committee.

29

P. S. Deshpande et al. / International Journal of Internet Science 11 (1), 8–32 Appendix A The Robert E. MacLaury Color Categorization Survey consists of data on the following three tasks, completed by every informant (refer to Jameson et al., 2015 for more details). Naming Task. Informants were asked to name 330 loose color chips. Figure 17A and 17B show two common data sheet formats used for recording naming data.

Figure 17. Naming task data sheet formats shown as image scans of the raw data. (A) Portion of the naming data of a Cree informant. (B) Portion of the naming data of a Mixtec informant. Focus Selection Task. Informants were asked to select the best example, or “focus” of each different basic color name elicited in the naming task, on a fixed array of the same chips. Figure 17B shows foci data marked with an encircled “X”, and Figure 18A shows foci data marked with a “X”, in certain cells of the grids. Category Mapping Task. For the same basic color terms as in the focus selection task, informants were asked to place a grain of rice on every color of the fixed array that could be named with the basic color term. Figure 18A shows the boundaries of five color categories, each recorded with a different color pencil or line shape, and Figure 18B shows the boundaries of five color categories listed with coordinates corresponding to cells of the common data template grid (as in Figure 18A).

Figure 18. Focus Selection and mapping task data sheet formats shown as image scans of the raw data. (A) Portion of the focus selection and mapping data of a Hupa informant. (B) Portion of the mapping data of a Mixtec informant.

30

P. S. Deshpande et al. / International Journal of Internet Science 11 (1), 8–32 Appendix B JAGS Code for the Subject-wise CCT Model % N subjects, M items, L response options. var alpha[L], beta[L], b[N,L] model{ for (i in 1:N){ for (k in 1:M){ for (j in 1:L){ % Probability of individual i responding with option j on item k. pY[i,k,j]