Disambiguation Solution for Persons' Accounts in Research ...

4 downloads 71342 Views 334KB Size Report
In particularly, it is a challenge in research management systems that aggregate recent academic articles, patents, data sets ... is suitable for systems with shared emails or user accounts. ..... mistypes, software bugs, and new entry systems that.
Indian Journal of Science and Technology, Vol 9(43), DOI :10.17485/ijst/2016/v9i43/101683, November 2016

ISSN (Print) : 0974-6846 ISSN (Online) : 0974-5645

Disambiguation Solution for Persons’ Accounts in Research Information Management Systems Maxim S. Sedelnikov, Roman N. Gordeev, Anastasia V. Kuzmicheva and Artem G. Odulov Integrirovannye Sistemy, LLC, 19b1, Str. Presnensky val, Moscow, 123557, Russian Federation; [email protected], [email protected], [email protected], [email protected]

Abstract Objectives: Personal identification in information systems, trivial task in software engineering when a passport number or email addresses are available, is a challenge when users’ data cannot be stored or an email has been changed or shared. In particularly, it is a challenge in research management systems that aggregate recent academic articles, patents, data sets, and other findings and provide a link to their authors. Methods/Analysis: To achieve the project objectives, the authors have reviewed corresponding solutions providing the classification for author disambiguation methods and using the comparative analysis. In addition, they have calculated sensitivity for the process of person identification to the estimated amount of manual work on disambiguation. Findings: This paper proposes the approach and algorithms, which together are a solution for the personal identification process to ensure duplicate elimination. Novelty of the Study: The approach is suitable for systems with shared emails or user accounts.

Keywords: Author Disambiguation, Author Identifier, Current Research Information Systems (CRIS), Duplicate Detection,

Identity Management, Master Data Management (MDM), Person Disambiguation, Research Information Management, User Accounting

1.  Introduction

In recent years, there has been a widespread growth in information systems designed for management of research results. Such systems include: 1. Databases with scholarly articles, authors, and citation data, such as Scopus and MEDLINE1,2; 2. Grant and award management systems like eRA Commons3; 3. College repositories and Current Research Information Systems (CRIS)4,5. The thing that all these systems have in common is a need to aggregate multiple duplicated data of publications and bind them to distinct authors’ records without direct personal identification data used. A single record per person in such systems is a must-have because of researchers’ performance assessment metrics used by institutions based on repository data. Grant and scientific

*Author for correspondence

project information systems have also users with a provided access to grant application management and project reporting. Access and confidence policies require unambiguous user identification, which, in turn, depends on a user’s ID or equivalent data. In a research information management system, person accounting directly based on persons’ IDs is not always acceptable because of users’ unwillingness to provide their IDs and legal requirements applied to such personal data storage and processing (a task for special-purpose IdM systems). This fact triggers a search for alternative approaches to user identification without direct ID and personal data used. In short, any research information system has to solve the following tasks: 1) Keep a single (master) personal record with a proper historical analysis and researcher’s performance assessment; 2) Provide an

Disambiguation Solution for Persons’ Accounts in Research Information Management Systems

access to scientific project to an authorized person who has not provided an ID. This paper proposes an approach and algorithms that together are a solution for the personal identification process in an informational system to avoid duplicates of persons. The solution has been piloted on the FCNTP system in science development federal programs of the Russian Federation6 and has shown 90% sensitivity with zero false results. The approach is suitable for systems, where users may share emails or user accounts, and it is also easily applicable in similar data management systems (MDM solutions) or any other research information management system where there are duplicates as a legacy.

2.  Materials and Methods The problem of person disambiguation and identification has been mostly reviewed in scientific publications kept in repositories and libraries, such as MEDLINE2. The goal here is to reduplicate records of publications. Other approaches to the problem relate to methods of user identification implemented in various online systems and projects.

2.1  Author Disambiguation Methods Automated approaches to author disambiguation are conceptually divided into two broad categories: deterministic and probabilistic. Deterministic algorithms employ a set of rules based on an exact or approximate match between corresponding fields in potential author record pairs. Such algorithms are designed to find an exact match of reliable keys, for example, an ID number or an email address7,8. If the match is unsuccessful, one uses another composite key such as a pair of a full name and a birth date. In addition to exact matching, there are methods to reach an agreement between fields with the help of an approximate comparison of keys9, for example, phonetic encoding for a person’s name. Probabilistic algorithms (fuzzy matching) use statistical methods to find matching records10,11. Frequency of an identifier agreement and its disagreement is derived from a set of records: potential linked and non-linked record-pairs in data sets. From this, likelihood scores are calculated for each potential record pair. The likelihood scores for all potential record pairs are usually divided into three groups: low scores represent non-links; high

2

Vol 9 (43) | November 2016 | www.indjst.org

scores represent probable links, and intermediate scores represent indeterminate links. A size of the last group and a quality of these algorithms depend on the training set of records, i.e. the subset of the whole set of pairs for which disambiguation has already been done, usually, with the deterministic algorithm or manually. The training set is directly required by some probabilistic algorithms, called supervised12. This partial set is necessary to calibrate algorithms (to calculate a set of weights for the similarity function) before running on the whole data set of pairs. Other algorithms do not directly require the training set13, they are called unsupervised or semi-supervised10. The biggest efforts of the science community are applied to probabilistic algorithms. Deterministic ones serve as a tool to generate training sets. We also propose another categorization for existing algorithms, i.e. by type of human intervention required to get a result. Both deterministic and probabilistic algorithms may leave a subset of author pairs without a match decision due to an inability to prove or disprove the match using the given set of attributes. For example, it might be impossible with the full name/birth-date pair because of misprints or variations in the first name. Manual intervention is required to analyse additional attributes of an author and make a decision. There are three categories of distinct algorithms: a) No intervention is required8; b) Partial intervention is required: a subset of authors requires manual disambiguation7,9,11,14,15; c) Final intervention is required: every author match requires to be approved manually or any match can be manually repealed10,12,13. As a direct result of some uncertainty level in probabilistic algorithms, their biggest part belongs to b) and c) categories. The b) category is also referred to as ‘the hybrid approach’ by16. Another remarkable classification of algorithms divides them into two groups. Algorithms in the first group are aimed at incremental disambiguation of the author array while algorithms in the second group are aimed at disambiguation of the entire author set at once12 mentioned the incremental approach as more efficient than any complete re-clustering of the author set. In addition, some online citation databases, such as Web of Science, provide tools, which allow authors disambiguating papers and profiles manually. A partial review of these tools has been done in17,18. As a summary for previous work on disambiguation methods for authors of scientific papers, we see the following:

Indian Journal of Science and Technology

Maxim S. Sedelnikov, Roman N. Gordeev, Anastasia V. Kuzmicheva and Artem G. Odulov

• Different articles authored by the same person will share similarities in one or more aspects8. The same can be assumed about another person’s documents, such as grant applications or reports. • Full Names (FN), co-authors, affiliations, birth data, and the work category can give 100% specificity to the person disambiguation algorithm9. • To achieve the 100% algorithm sensitivity (recall) without using author’s IDs, some manual intervention is always required9. • There are following ways to get the 100% algorithm sensitivity: Records separation by default is more effective, than an eventual error reporting made by an employee14; The iterative approach is more effective than complete re-clustering12; User feedback has a rather positive impact on disambiguation15; Manual disambiguation might be cost-effective in case of rare authors’ FNs7; The context analysis might be done to increase an effect of manual disambiguation19,20. • Many modern CRISes and author profiles feature the person (author) disambiguation capability13,21,22.

2.1.1  User Disambiguation Approaches Other approaches to the person identification problem include various methods for user identification in online information systems. There are two tasks to be solved: user identification at registration and a merger of existing user accounts recognized as duplicates. While the first task availability is obvious, the second one takes place due to a necessity to merge accounts in systems with legacy duplicates where the user identification function has been introduced.

2.1.1.1 User Identification:  One common approach to the user identification is to use an email address as a login or as an ID. The first way is in place in social networks, the second one is, for example, in the Zendesk global support ticket system23, where every user’s login is associated with a pool of email addresses. Despite its popularity, the approach does not work for many systems, including research information ones, mainly as the email address can be changed by a person’s affiliated organization or shared (as a mail list) by participants in a research project. Consequently, a complicated email-based ID is used, for example, by online author ID providers which serve to identify authors of scholarly publication. The ResearcherID (by Thomson Reuters) and non-commercial

Vol 9 (43) | November 2016 | www.indjst.org

ORCID use alerts about duplicating user’s full names in a registration procedure in addition to an email uniqueness requirement24. The Russian Science Citation Index25 identificator (SPIN code), along with the pool of email addresses, uses the full name plus birthday composite key and a pre-moderation procedure for each new identified author. Thereupon, shared users’ email addresses are also allowed in the largest frequent-flyer program accounting systems of Delta, AA and Lufthansa airlines. Instead of emails, they use own IDs, such as a SkyMiles number or a login name for account identification. An alternative approach to user identification is to use an external ID-service. Following this, some publisher information systems, such as The Royal Society of UK, use ORCID for person identification in time of paper submission [15. ORCID.R].

2.1.1.2 Merged user Accounts:  When two duplicate user accounts are found in an information system, they are merged. Different systems make it differently in relation to accounts’ data, permissions, and a degree of human supervision. Social networks do not offer a way to merge user accounts 26, some services provide a migration function with a request to a technical support. Deprecated accounts are removed. The Zendesk global support ticket system allows merging verified users except administrators and users with SSO. It is done via a user search, a comparison of user’s email addresses and twitter accounts. All emails and tickets of a deprecated account are rebound to an approved one other account’s data will be lost. The similar approach is used by the Salesforce and Zoho CRM systems27. The ORCID author ID provider relies on persons to report duplicates and provides the merger function with the request to the technical support22. Access permissions from the third party services from the deprecated account are not transferred. The deprecated account is left in the author ID database as a pointer to a retained one. The SkyMiles frequent-flyer program system of Delta Airlines features the account merger for end-users. Each merge request requires equivalence of accounts’ first and last name and pre-moderation in a way that a user’s choice of an approved account may be overridden by the technical support. It allows retaining important data such as an account program qualification status or a user’s credit

Indian Journal of Science and Technology

3

Disambiguation Solution for Persons’ Accounts in Research Information Management Systems

card. All data of the deprecated account are lost except for email addresses, phones, logins, and password. The login of the deprecated account is kept active and allows a user to access the approved account. Another well-known approach is implemented in the system of the US Internal Revenue Service where accounts are not merged, but closed (by the technical support); and there is consolidation of the Intuit QuickBooks system of employees’ accounts where the procedure is triggered by a change of an employee’s full name (required to match the duplicate)28. As a summary of previous work in the field of user identification and merger methods, we can say the following:

where IDs of participants (not users) could be hard to collect or are not available at all.

1. Using email addresses for user identification only is not enough as, in turn, there are multiple solutions to combine emails with other user’s attributes. 2. Keeping the pool of former users’ email addresses and their labelling as unused looks promising for identification24. 3. There are two polar identification trends: a) usage an external ID and enforcing its everywhere application (as publishers do29,30); and b) introduction and promotion of new, own service user ID or login (as airlines or authors do). 4. None of user mergers are automatically done on a regular basis, each merger is initiated by a user’s request (or a single-time script). 5. The exact match of the user’s full name is a mandatory requirement to merge accounts to prevent mistakes and identify a person’s true name. 6. No data are transferred in the accounts merger except for the identification ones (email addresses, login names, etc.) and links to objects independent from the accounts (tickets in the support system, scholarly publications in CRIS).

To formalize the tasks to be solved, we consider a model M = of an abstract informational system with three types of objects: persons (P), user accounts (U), and artefacts (A). The person represents a human identity, the user account provides an access to the system for an individual, the artefact is an object shaped by an individual and valuable beyond the system (article, grant application, project, etc.). There are the following processes in the system:

Points 3 and 6 faced some criticism.

2.1.1.3 External ID Usage:  The problem with this approach is that many researchers, in particular, elderly people, still have not had any author ID and they do not want to receive it due to a lack of value understanding. For the specific grant application accounting system mentioned in this paper (see ch. 4), the SPIN author ID is used in 25% cases from the total number of persons. Enforcing users to provide such an ID might be possible for the user accounting, but not for person accounting 4

Vol 9 (43) | November 2016 | www.indjst.org

2.1.1.4 Account’s Data Transfer:  A loss of user’s data, which, if kept, could be used to match new user registrations with existing user data set, is the most obvious problem. Other one is a loss of permissions and roles of a deprecated user account. On the contrary, the user’s deprecated login preservation causes the higher complexity of the system and the architecture of account data, which is worthless for end-users.

2.2 Person Disambiguation Approach

• Creation of artefacts and persons by binding the artefact to the person, • Providing an access to the artefact for the person by making user accounts, • Notify people about changes to the accessible artefact. Each human identity is described with several attributes, but not a unique ID. As the ID is not available, duplicated person and user account records might be created. In contrast to people and user accounts, each artefact is described with its ID and has a list of authorized people with or without user accounts: identities and credentials delivery methods (such as email addresses). It is prohibited to grant an access to the artefact to a person not on the list. It is also prohibited to deliver notifications about the artefact to people, which cannot access it. Taking the model into consideration, the goal is to propose an approach and algorithms, which would determine the following data without increasing the complexity of the model M (and of the information system where the model is implemented): − Matrix R1 of the artefacts and people bound to each other, − Matrix R2 of the artefacts and people with a granted access to the artefact,

Indian Journal of Science and Technology

Maxim S. Sedelnikov, Roman N. Gordeev, Anastasia V. Kuzmicheva and Artem G. Odulov

− Total number of people R3, − Total number of people with the granted access R4.

2.1.1 Criteria, Assumptions, Approach, and Concept Because of duplicates possibility, the consolidation approach to persons and users is required (at the reporting level or overall) to solve the tasks mentioned above. We consider the approaches that satisfy the following criteria. 1. Criteria C1: 100% specificity. As a notification misdelivery about an artefact change is prohibited, the R2 matrix must not contain ‘a consolidated’ person, who is granted an access to artefacts of other persons. As a result, specificity of the consolidation algorithm for user account (the granted access list) pairs must be 100% (any two user accounts of two different people must not be consolidated). In other words, for the real information system, incorrect consolidation of granted access lists means the delivery of junk messages where it is impossible to unsubscribe from them without losing the access to the artefact. Additionally, if user accounts of different persons are really merged, then they start to share the account and get in a conflict, for example, in case of a password change. 2. Criteria C2: Mistypes and software bug awareness. There are additional factors for person duplicates: mistypes, software bugs, and new entry systems that do not implement the disambiguation algorithm. The proposed approach must identify and consolidate all duplicates from these factors as well. 3. Criteria C3: Time-limited. The disambiguation approach must be implementable in the finite time independent from undefined factors such as available user’s feedback or author’s corrections. 4. Assumptions: Let us also assume the following statements about person (and user) data: 1) Full person’s Name (FN) contains the first, middle, and last names for Russian’s names and the first and last names for all the others; 2) All affiliations are labels with organization’s ID (a taxpayer identification number, for example); 3) Birth year is available; 4) Е-mail address (email) is available and used as the credentials delivery method (other methods include SMS, phone calls, and postal delivery) for a user account; 5) 99% data are not fraud.

Vol 9 (43) | November 2016 | www.indjst.org

5. The approach: In the similar way to the Master Data Management (MDM) implementation styles31, we see two polar approaches to person disambiguation in information systems: data consolidation, transaction style or maintaining a master (golden) record; data federation, the virtual view for records or the registry style. 1. Consolidation: Person records are combined with a disambiguation algorithm. For all new records the disambiguation takes place and all of them are continuously linked as far as they appear. 2. Virtual view: The view combines dispersed duplicating records about every person, but actual records are not merged and changed. The view is made using data integration techniques. Having person and user accounts as objects for disambiguation, consolidation or the virtual view approach may be applied to person, user accounts or both of them simultaneously. We will explain below that only the last option does not make the model M more complicated. A person is authorized with both the human identity and credentials delivery method, which means that the user account must be bound to a person and an email address. As the person might have the email address too, if the person is bound to the user account, then they both share the same email (otherwise the email addresses must be distinguished with an additional attribute that makes the model M more complicated). Let us assume that persons related to one person are merged, but user accounts are not. If there are multiple user accounts (with different emails) for the person then there are multiple persons (with different emails too) bound to these accounts. But persons of one person are merged that makes a conflict. Let us assume that user accounts of each person are merged, but persons are not. If a user account can only authorize one person (given that the user account cannot be shared), then the user account can be bound to a single identity (a person). Now let us discuss whether user account sharing is possible. If yes, an account must be bound to multiple things, i.e. people and delivery methods (emails). And anyone of these people is allowed to access artefacts of another person unless the system has user accounts of two types: single and shared ones. The last requirement makes the model M more complicated.

Indian Journal of Science and Technology

5

Disambiguation Solution for Persons’ Accounts in Research Information Management Systems

Finally, to reach the goal, the disambiguation approach must be applied in a way when both person and users’ accounts are merged or none of them. Otherwise, the system implementing the model M gets variations in user accounts or credentials delivery methods and becomes more complicated. Consolidation and virtual view approaches have the following advantages if we compare them31: 1. Consolidation: − Simpler person and user account selection and other data entry methods, − Available master data for analytics and business intelligence, − Ability to enforce data quality standards. 2.  Virtual view: − Less changes to the existing disparate system and data, − Quick solution, − No copied data, which may be critical for regulatory requirements. As analytics and business intelligence capabilities are critical for research information management systems while for several isolated systems they are not suitable, we use the consolidation approach to disambiguate both person and user account records. 1. The concept: As existing information system might include person and user account duplicates, a migration procedure for legacy duplicates is required in addition to the consolidation approach implemented in the system. Thus, the concept of disambiguation framework includes the regular disambiguation process and onetime migration procedure executed after the process implementation. The disambiguation process contains the following steps. Firstly, automatic matching: Elimination of the most new duplicates (80% or more) by their matching to existing person and user accounts with the 100% specificity algorithm. Secondly, augment: Person and user account’s data are changed (augmented) on every match. Thirdly, manual matching: The remainder of new and changed duplicates is revealed by the 100% sensitivity algorithm and then manually evaluated and consolidated. The one-time migration procedure is represented with the one-time used 100% specificity match algorithm which consolidates (migrates) the most legacy duplicates.

6

Vol 9 (43) | November 2016 | www.indjst.org

For example: FN and email matching eliminates the most of new person duplicates while the rest ones are entered as new persons; Every matching person is updated with a link to the new artefact and email; FN matching with the Levenshtein distance for all new and updated persons reveals the rest of duplicates; these duplicates are manually consolidated afterwards. We also want to emphasize some other features within this framework: • Data-driven revision: Any update related to a person (or a user account) triggers the disambiguation algorithm for this person or user account. It allows finding out duplicates from a set of existing persons when new data about some of these persons arrive. • Local supervision: There is no extra manual disambiguation work such as revealing possible duplicates for all existing persons. Manual operations for new persons are limited and can be further reduced with an increase in revealing of the algorithm specificity. • Predictability: The concept does not rely on and does not depend on any community or people represented individually. All manual work can be done by several data administrators. • Split-free: A split of person or user accounts is not a part of this disambiguation concept because of the 100% unique algorithm. The corresponding functionality in the information system should be also absent. The framework provides flexibility in balancing between: 1. complexity of algorithm implementation and required precision, and 2. an amount of regular manual work. It allows implementing initially a good-enough disambiguation solution and its later improvements up to 100% precision without significant changes to the architecture.

2.2.2 Algorithms 2.2.2.1  New Person/user Identification (Match) Algorithm:  The matching algorithm for a new person and user (Match) is represented with the flowchart in Figure  1. It includes two sub-procedures, access restoration (Restore Algorithm) and consolidation procedures (Consolidate Algorithm) described in details below. The algorithm implements the disambiguation framework described above. The key implementation points are the following:

Indian Journal of Science and Technology

Maxim S. Sedelnikov, Roman N. Gordeev, Anastasia V. Kuzmicheva and Artem G. Odulov

1. For automatic matching, a high sensitivity level is achieved together with keeping 100% specificity by aligning both person and user accounts’ identities in one set. For the sake of specificity, a new separated user account record is created for access restoration as a more effective way than handling reporting errors if the access to a primary email has been lost14. The user account is manually validated against lists of authorized people of artefacts. 2. For the augment purpose, person’s email addresses are given on a flat list with a single primary email, which enables easy email adding and a search for it24. 3. For manual matching, the process relies on a full name match7 as cost-effective in certain cases32.

For institutions with strict security policies (government agencies, banks, etc.), the algorithm can be improved with additional protection. To get a hint about a primary email, a user must provide at least one of person’s emails in an un-masked form (instead of an ability to get the hint of person’s affiliation and birth year only).

2.2.2.2  Access Recovery Algorithm:  The user account access recovery algorithm (Restore) is given in the flowchart in Figure  2. The algorithm is a typical selfservice password reset procedure with the following differences: 1. To restore the password, a login is required. The login is only restored via a person’s primary email address. 2. Upon the sent password reset link, a user must be given a hint about a primary email where a link is sent to. 3. The user can be given an option to create a user account if there is a user’s person record in place.

Figure 1.  Person or user matching

We want to emphasize that the algorithm does not assume global email uniqueness and makes it possible for different persons to share email addresses (in contrast, for example, with ORCID24). The second point is that a maiden last name match (when it is available) is used within the manual matching framework step only. It enables simpler algorithm implementation without a real loss of sensitivity, as a majority of new person’s data comes with an actual last name, not the maiden one. The data-driven revision framework feature is implemented with updates to a person’s change date with the timestamp of every change of person or user account’s data, including user’s logging in. It allows raising sensitivity and fixing human mistakes if duplicates have been missed at the previous manual matching step. The last point is that person’s affiliations are only matched with ID, such as a taxpayer number.

Vol 9 (43) | November 2016 | www.indjst.org

Figure 2.  User access recovery

Institutions with strict security policies may also send another notification to a user upon a change to an account login or password. It allows detecting any misused user account and improves its security.

2.2.2.3 Persons/users Consolidation Algorithm and Migration Procedure:  The algorithm for consolidation of person and user accounts (‘Consolidate’) is given with the flowchart in Figure 3. The basic concept of the algorithm is that any deprecated person must become free from links to artefacts and other person-independent objects. It allows deleting the person safely or reverting consolidation back at any moment.

Indian Journal of Science and Technology

7

Disambiguation Solution for Persons’ Accounts in Research Information Management Systems

For the user account consolidation, the goal is to prevent a misuse of an approved user account by a person another than the user account’s person who might have used one of user’s accounts being consolidated. In other words, the approved account must be used by a real user. The goal is achieved with the following basic rules: 1. Block consolidated user accounts and ask users to restore an access (or to gain s new account); 2. Ensure that real users are notified about changes to their accounts and that they have got information (or a hint) about an impact: consolidated privileges and artefacts; 3. Rely on superior privileges of the user account when choosing the approved one.

Figure 3.  Person or user consolidation

We want to emphasize that the algorithm does not change or augment any authentication factor of an approved user account (such as a login or a password), the account is only locked. The additionally, deprecated user account’s login is prefixed (when it is kept) to make the login string available for new user registrations. To ensure a technical ability to revert person accounts or consolidate user accounts, the following data should be kept in an event log of an approved person: a dedicated event type for a consolidation operation (to enable event filtration); an event for each change to person’s attribute in the form of ‘a value before change’ and ‘a value after change’; An event for every new artefact link in the form of an artefact type, a deprecated person ID (such as the system’s UUID).

8

Vol 9 (43) | November 2016 | www.indjst.org

If a corporate policy requires a deprecated person to be kept after consolidation, a person’s log should contain an event about a consolidation saying that the person is a deprecated person (ID of the approved person can be found with his or her FN and an event timestamp). Such person or user accounts can be hidden from a common list and regularly deleted depending on corporate retention time. There are several improvements of the algorithm, which may be required by institutions with strict security policies: When being copied, deprecated person’s non-primary emails can be marked as obsolete to avoid redundant notifications of them; Artefacts (and privileges) in the notification email can be masked in a way it is done for emails to prevent a disclosure of confidential information about importance of an approved user account; If a deprecated person has a passport data or an author ID (the last is quite common in research information systems), it only replaces the value from the approved person if the last is empty. This enables detection and corrections of human mistakes when ID is added to an ‘incorrect’ person. To consolidate (migrate) existing duplicates of person and user accounts, the ‘Consolidate’ algorithm is executed three times: 1. For person pairs/clusters, where none of the persons has a user account, 2. For person pairs/clusters, where only one person has the user account, 3. For person pairs/clusters, where all persons have their user accounts. It allows the ensured notification of a true user about account changes to the highest number of email addresses keeping a number of notification emails to minimum.

2.2.3  Fraud Handling An assumption we have made is that 99% incoming personal data are truthful. A list of improvements to the proposed algorithms to ensure corrects handling of fraud or incomplete full names (besides trivial string validations) (like only letters and spaces, register, etc.) in Table 1. We use these improvements to increase sensitivity at the manual matching step algorithm within the disambiguation framework. The mechanical matching does not include them for the sake of simplicity.

Indian Journal of Science and Technology

Maxim S. Sedelnikov, Roman N. Gordeev, Anastasia V. Kuzmicheva and Artem G. Odulov

Table 1.  Handling different fraud types for full names of persons No.

Fraud type

Example

Handling

1

Using similar ‘Петрoв characters from a Владимир different language, Иванoвич’ like o (English) instead of 0 (Russian)

If a FN contains both Russian and non-Russian characters, all non-Russian characters are converted into corresponding Russian ones. The additional comparison of normalized FNs with existing FNs is done.

2

Intentional transliteration of a Russian name Or usage of Russian characters for non-Russian names

Petrov Vladimir Ivanovich or ‘Блэк Джон’

The additional comparison of transliterated FNs with transliterated existing FNs is done.

3

Omitted middle name

‘Петров’ Владимир

The additional comparison of FNs with existing FNs without middle names is done.

4

Copy-paste ‘Петров mistakes: the same Петров value is pasted in Иванович’ two-name parts. For example, in the last name and the first name

If one of the following conditions is true for the FN, it is labeled for manual disambiguation: a) last name = first name, b) last name = middle name, c) first name = middle name.

3. Results The algorithm and approach to person disambiguation were tested on the FCNTP grant application management information systems of the Russian Federation federal programs for science development6. The test was based on previous work to record data of research project participants. The solution for project participant accounting also covered participants from grant applications and

Vol 9 (43) | November 2016 | www.indjst.org

end-users of several categories: experts, managers, and contributors to a project subject. The last category models open registration for any person who would like to contribute to federal programs with a proposal of a new subject for research. When the disambiguation framework was introduced, the system had been used for 8 years and contained ~20k grant applications and ~80k persons. Approximately 20-25% persons were duplicates and the growth rate was 200-500 duplicates a month. At the implementation end, the system had ~120k person with 1.5% duplicates and its zero growth rates. All duplicates among persons with user accounts were consolidated. By the moment of the article writing, 8 months had passed since the implementation end. The disambiguation approach was applied to several parts of the system where the person duplicates growth rate had been the highest: participants in grant applications (the procedure to add a participant), participants in grant contracts, and user accounts. To estimate the approach efficiency, we have calculated sensitivity for the person identification process to estimate an amount of required manual work on disambiguation. The manual work was required when the matching algorithm was unable to identify a person and a notification about possible duplicates and required manual work was sent (see. 2.2.2). Higher values of sensitivity meant a better quality of the approach and the algorithm. To calculate sensitivity, we used one system component: disambiguation of participants in grant applications. Other system components were not involved in the calculation because of a difference in a set of matching rules. We have examined the disambiguation processes for 1,321 persons, participants of 95 grant projects. In the system, all participants with applications were identified in chunks when applications were imported to the system from the grant application components. These chunks were imported on nine different days, such as, on one day, 119 persons with 16 applications were imported and identified, on another day, 84 persons with 10 applications and so on. On each day, the sensitivity value was calculated with the formula and the estimated values were put (Table 2):

. To check the coverage for the full set of disambiguation cases in the system by disambiguation cases from grant applications, we have also estimated a share of grant

Indian Journal of Science and Technology

9

Disambiguation Solution for Persons’ Accounts in Research Information Management Systems

application persons in the full set of persons within notifications (the full set also included persons from grant contracts and user accounts). For each day in Table 2, we had divided the amount of new persons from grant applications by the total number of persons in the system. Table 2.  Estimated algorithm sensitivity Day Total number No. of persons in applications

Number Sensitivity Coverage* of persons (estimated) 100% within notifications

1

389

37

0.905

87%

2

259

26

0.900

99%

3

178

13

0.927

92%

4

151

14

0.907

88%

5

119

11

0.908

55%

6

84

5

0.940

67%

7

75

3

0.960

80%

8

47

3

0.936

64%

9

19

0

1.000

56%

The test result meant automatic person disambiguation in at least 90% cases. Persons from grant applications made 56-99% potential duplicates on the FCNTP system that was quite an enough rationale to use grant application participants as the only source of person duplicates for estimations.

4.  Findings and Discussion The experiment with the specific information system and the person disambiguation approach point out to feasibility of 100%-precise person identification without used IDs but with such basic data used as emails, affiliations, and birth years. For a group of researchers, at least 90% persons and researchers can be semi-mechanically identified in this way. Similar results (92%) were given in11 for article author disambiguation with the much more complicated heuristic approach applied based on the author collaboration analysis33. Aligning of person and user accounts used in the proposed disambiguation approach reaches the same result in the more simple and deterministic way. For the FCNTP system, the higher automation (reduced number of notifications of manual work) can be achieved with the developed algorithm in three directions: 10

Vol 9 (43) | November 2016 | www.indjst.org

1. Using additional matching rules for persons: FN + co-participants; FN + birth year + grant application research area; 2. Elimination of redundant notifications about different FNs which are not typos by using better Levenshtein distance analogues, such as Damerau–Levenshtein one; 3. Handling human mistakes in the manual disambiguation process, such as a notification email ignored by a data steward. The next finding is that users can share emails even within a system with a single account per user (or record per person) requirement without providing their IDs that is unavailable in other systems, such as ORCID24. Common additional data such as a birth year and an affiliation enable proper user identification in this case. Keeping multiple emails on a list per user solves the notification problem and the single mandatory email requirement ensures password recovery. The last conclusion is that an automatic and secure merge of user account duplicates is possible even when the users share accounts or use accounts that do not belong to them. The key point is that the merge of accounts are followed by their block and an email notification to all the participating users. The set always contains a real owner who is allowed to unblock an account. The proposed approach does not rely on the community which should prove or disprove consolidation like, for example, in21,22. The drawback is that a manual operation might be required to select a proper email for an account unlock and eliminate the possibility of the new support ticket from the account owner. Thus, there is a potential for improvements: Selection of an account owner based on consolidated accounts data; increasing awareness of the account owner about the merge using cell phones for notification. As possible pitfalls for the disambiguation approach, we see an improper change of a person’s FN to a name of another person (like John Black to Mike Chorn) and erroneous linking of deprecated persons to artefacts or other objects in case such persons are kept (case of software bugs). The first problem might be solved with revoking permission for the FN change from all users except data stewards. The second one might be handled with regularly removed deprecated persons (according to a corporate retention policy) with sending an alert when the removal of a person is due to links.

Indian Journal of Science and Technology

Maxim S. Sedelnikov, Roman N. Gordeev, Anastasia V. Kuzmicheva and Artem G. Odulov

5. Conclusion The person and user account disambiguation solution proposed in this paper can be applied for human identification in systems where the collection of personal IDs is impossible but other personal data, such as emails or affiliations are available. Examples of such systems are given below: 1) Databases of scientific results: grant and award management systems, CRIS-systems, patent offices’ databases and author identifiers; 2) MDM for customer data: corporate CRM, dossier databases of bank security offices; 3) Personal and marketing services: advertising agencies’ databases, social networks. The approach remains applicable if users change their affiliations, emails or full names. Information systems with the single account per user requirement can use the approach to identify duplicated accounts without enforced email uniqueness. The proposed approach is suitable for existing systems with person duplicates, as well as it has ensured the incremental solution for legacy disambiguation.

6. Acknowledgements The Ministry of Education and Science of the Russian Federation supported the research, Research Project No. 14.579.21.0117 (ID RFMEFI57915X0117).

7. References 1. S copus Abstract and Citation Database of Peer-reviewed Literature. Date accessed: 02/07/2016 Available from: https://www.elsevier.com/solutions/scopus, 2. MEDLINE. Journal Citations and Abstracts Database for Biomedical Literature. Date accessed: 10/07/2016. Available from: https://www.nlm.nih.gov/bsd/pmresources.html. 3. eRA Commons. A Program of National Institutes of Health. Date accessed: 02/07/2016. Available from: https://public. era.nih.gov/commons 4.  Joint N. Current Research Information Systems, Open Access Repositories and Libraries: ANTAEUS. Library Review. 2008; 57(8):570−75. 5. Euro CRIS. Why does one need a CRIS? Date accessed: 15/09/2016. Available from: http://www.eurocris.org/whydoes-one-need-cris 6. FCNTP SSTP System. Short Guidelines on Registration Procedure and Work in System. Date accessed: 15/09/2016. Available from: http://www.fcntp.ru/page. aspx?page=263

Vol 9 (43) | November 2016 | www.indjst.org

7. X iaoxin Y, Jiawei H, Yu PS. Object Distinction: Distinguishing Objects with Identical Names. Proceedings of International Conference on Data Engineering, 2007, p. 1241−46. 8.  Torvik VI, Weeber M, Swanson DR, Smalheiser NR. A Probabilistic Similarity Metric for MEDLINE Records: A Model for Author Name Disambiguation, 2003. 9. J. Grannis, Overhage JD, McDonald CJ. Analysis of identifier performance using a deterministic linkage algorithm. Proc. AMIA Symp., 2002, p. 305−09. 10.  Weiler H. Authormagic A Concept for Author Disambiguation in Large-Scale Digital Libraries, 2012. 11.  Vetel I. Torvik. Author Name Disambiguation in MEDLINE. ACM Transactions on Knowledge Discovery from Data. 2009; 3(3). 12. Soler JM. Separating the Articles of Authors with the Same Name. Scientometrics. 2007; 72(2):281−90. 13. Afonin SA, et al. ISTINA Intelligent System for Subject Research of Scientific and Technical Information). Moscow University Press, 2014. 14. De Carvalho AP, Ferreira AA, Laender AHF, Gonçalves MA. Incremental Unsupervised Name Disambiguation in Cleaned Digital Libraries. Journal of Information and Data Management. 2011; 2(573871):289. 15.  Li Y, Wen A, Lin Q, Li R, Lu Z, Wang H, Qian T. Incorporating User Feedback into Name Disambiguation of Scientific Cooperation Network. Web-Age Information Management. Lecture Notes in Computer Science. 2011; 6897:454−66. 16. Elliott S. Survey of Author Name Disambiguation: 2004 to 2010. Library Philosophy and Practice. Date accessed: 15/09/2016. Available from: http://digitalcommons.unl. edu/libphilprac/473 17.  Mazov N, Gureev V. Problems of Identification of Metadata in Scientometric Databases WoK, Scopus and Russian SCI as Exemplified by Authors’ Profiles. Libraries and Information Resources in the Modern World of Science, Culture, Education, and Business. Proceedings 19th Anniversary International Conference ‘Crimea, 2012, p. 1−4. 18. Chadegani AA, Salehi H, Yunus Md. MM, Farhadi H, Fooladi M, Farhadi M, Ale Ebrahim N. A Comparison between two Main Academic Literature Collections: Web of Science and Scopus Databases. Asian Social Science. 2013; 9(5):18−26. 19.  Cucerzan S. Large-scale Named Entity Disambiguation Based on Wikipedia Data. Proceedings of 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, 2007 Jun, p. 708−16. 20.  Delgado AD, Martínez R, Fresno V, Montalvo S. An Unsupervised Algorithm for Person Name Disambiguation

Indian Journal of Science and Technology

11

Disambiguation Solution for Persons’ Accounts in Research Information Management Systems in the Web. Procesamiento de Lenguaje Natural, 2014; 53:51−8. 21. Scopus Feedback. Use the Scopus Author Feedback Wizard to Collect All Your Scopus Records in One Unique Author Profile. Date accessed: 10/07/2016. Available from: http:// www.scopusfeedback.com 22. Feedback and Support for ORCID. What if I Have Two ORCID IDs? Date accessed: 15/09/2016. Available from: http://support.orcid.org/knowledgebase/articles/ 580410 23.  Zendesk. Help Desk Software and Ticket Management System. Date accessed: 15/09/2016. Available from: https:// www.zendesk.com/help-desk-software 24. Haak L. ORCID. Managing Duplicate ORCID iDs. Date accessed: 15/09/2016. Available from: https://orcid.org/ blog/2014/01/09/managing-duplicate-iDs 25. eLIBRARY. Russian Science Citation Index. Date accessed: 01/07/2016. Available from: http://elibrary.ru/rsci_ about.asp 26. Facebook. I Have Two Accounts. Can I Merge Them? Date accessed: 15/09/2016. Available from: https://www.facebook.com/help/203498356357867 27.  Salesforce. Guidelines and Considerations for Merging Duplicate Accounts. Date accessed: 15/09/2016. Available from:

12

Vol 9 (43) | November 2016 | www.indjst.org

https://help.salesforce.com/HTViewHelpDoc?id=account_ merge_considerations.htm 28. Intuit. Merging Duplicate Employees in QuickBooks. Date accessed: 15/09/2016. Available from: http://payroll.intuit. com/support/kb/1002101.html 29. Hurst Ph. The Royal Society. From January You’ll Need an ORCID. 2015 Dec 7. Date accessed: 15/09/2016. Available from: https://blogs.royalsociety.org/publishing/from-january-youll-need-an-orcid 30. Taylor & Francis Group. Trialling ORCID: What, Why and How. Date accessed: 15/09/2016. Available from: http:// editorresources.taylorandfrancisgroup.com/triallingorcid-what-why-and-how 31.  Sogani D. Understanding Master Data Management (MDM). Date accessed: 10/07/2016. Available from: http:// underthehood.ironworks.com/2011/08/understandingmaster-data-management-mdm.html 32.  Verma A, Kaur I, Arora N. Comparative Analysis of Information Extraction Techniques for Data Mining. Indian Journal of Science and Technology. 2016; 9(11):1−18. 33.  Jeong YS. Parallel Processing Scheme for Minimizing Computational and Communication Cost of Bioinformatics Data. Indian Journal of Science and Technology. 2015; 8(15):1−8.

Indian Journal of Science and Technology