Robust hybrid name disambiguation framework for ...

4 downloads 26916 Views 1MB Size Report
''Who is Bill Gates?''. For a common name like ''Wei Wang'', there is ...... Twitter, which have also involved name ambiguity issue. References. Aleman-Meza, B.
Scientometrics DOI 10.1007/s11192-013-1151-0

Robust hybrid name disambiguation framework for large databases Jia Zhu • Yi Yang • Qing Xie • Liwei Wang • Saeed-Ul Hassan

Received: 9 September 2013 Ó Akade´miai Kiado´, Budapest, Hungary 2013

Abstract In many databases, science bibliography database for example, name attribute is the most commonly chosen identifier to identify entities. However, names are often ambiguous and not always unique which cause problems in many fields. Name disambiguation is a non-trivial task in data management that aims to properly distinguish different entities which share the same name, particularly for large databases like digital libraries, as only limited information can be used to identify authors’ name. In digital libraries, ambiguous author names occur due to the existence of multiple authors with the same name or different name variations for the same person. Also known as name disambiguation, most of the previous works to solve this issue often employ hierarchical clustering approaches based on information inside the citation records, e.g. co-authors and publication titles. In this paper, we focus on proposing a robust hybrid name disambiguation framework that is not only applicable for digital libraries but also can be easily extended to other application based on different data sources. We propose a web pages genre identification component to identify the genre of a web page, e.g. whether the page is a personal homepage. In addition, we propose a re-clustering model based on multidimensional scaling that can further improve the performance of name disambiguation. We evaluated our approach on known corpora, and the favorable experiment results indicated that our proposed framework is feasible.

J. Zhu (&) School of Computer Science, South China Normal University, Guangzhou, China e-mail: [email protected] Y. Yang School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA Q. Xie Division of CEMSE, King Abdullah University of Science and Technology, Thuwal, Saudi Arabia L. Wang Wuhan University, Wuhan, China S.-U. Hassan COMSATS Institute of Information Technology, Lahore, Pakistan

123

Scientometrics

Keywords Name disambiguation  Multidimensional scaling  Genre identification  Clustering

Introduction to name disambiguation Name ambiguity is a common issue existing in many places of real world. In terms of the definition of ‘‘name’’, it can be a person’s name or a company name. A real-world entity can be expressed using different aliases due to multiple reasons: use of abbreviations, different naming conventions (e.g. John Smith and Smith, J.), misspellings or naming variations over time (e.g. Leningrad and Saint Petersburg). Furthermore, some different real-world entities may have the same name or share some aliases. For example in a digital library like DBLP1, it is very common that several authors share the same name. As shown in Fig. 1, there are at least 60 different authors called ‘‘Wei Wang’’ in DBLP and there are more than 700 entries under this name. From users’ point of view, it is extremely difficult to identify which entries belong to which authors. With the rapid growth of digital libraries, it is important to improve the data quality to minimize the improper attribution to authors. Web search is another typical example. When we use search engine like Google to search a person’ name, e.g. ‘‘Jim Smith’’, the results in web pages concern at least three different individuals, an American football player, an English cricketer and a writer of directorial critical biographies and television guides. With the growth of the Web, more and more persons’ names appear on the Web representing dozens of different persons which is not efficient for users to gain information. Clusty2 is an example of new generation search engines that provides such a group results to the users according to the contents similarity of pages (see Fig. 2), with re-clustering function that eliminates the irrelevant elements of users’ search in order to get better clustering results. Yet Clusty can only cluster web pages against the categories which have been predefined and can not distinguish entities sharing with the same reference, e.g. people share the same name. Name disambiguation (Canas et al. 2003), simply speaking, is a task to solve name ambiguity issue by using different learning methods based on certain selected features. Apart from the applications in digital libraries and web search, name disambiguation can also be extended to the issue of word sense disambiguation. For example, when users in US search the word ‘‘Football’’, the results will be different from the users who do the same search in UK, because the term ‘‘Football’’ means two different sports in these two countries. Powerset3 is a search engine company that enables computer to understand human language recently acquired by Microsoft. By exploring Wikipedia4, the system generates results that can reasonably answer the search query (see Fig. 3). However, the effect of Powerset is very limited because the knowledge behind the system is based on Wikipedia only and currently can only handle simple questions with famous person such as ‘‘Who is Bill Gates?’’. For a common name like ‘‘Wei Wang’’, there is still plenty of room to improve.

1

http://www.informatik.uni-trier.de/*ley/db/.

2

http://clusty.com/.

3

http://www.powerset.com.

4

http://en.wikipedia.org/wiki/Wiki.

123

Scientometrics

Fig. 1 DBLP records of author Wei Wang

Fig. 2 Search results in Clusty

Conflict of interest is another area related to name disambiguation. It is the problem of family or friend relationships which access confidential information. For example, reviewers in a scientific peer review process may be in conflict situations if there are social relations between the reviewers and the authors (Aleman-Meza et al. 2006). Attributes like co-authorship is required to be used with corresponding name disambiguation methods in order to detect conflicts. Name disambiguation is also exceedingly important for organizations like banks, taxation office, etc, to determine if two persons are actually the same person or not. In a general process of opening a bank account in Australia, people usually need to pass a 100 point identification check which means they need to provide certified documents like passport and birth certificate that have valid points associated to establish an account. However, people can select different documents in the eligible documents list to pass the identification check. For example, people can use passport to open an account, and later use birth certificate to open another account. Things like that make banks difficult to track the credit of these people, especially if they open accounts in different banks.

123

Scientometrics

Fig. 3 Search results in powerset

Research goal and challenges Name disambiguation has been studied in several areas with particular emphases. Our work focuses on the disambiguation on large databases, e.g. DBLP, because the data there is simple and clean but often its information is not sufficient. Figure 4 is a sample for citation records in DBLP, in which only minimum information like author names, paper titles and publication venues are provided. Therefore the framework we propose needs to be able to handle this kind of situation. The research goal can be defined as below in the case of DBLP: Definition 1 Given a list of citations C, where each citation ci 2 C is a bibliographic reference containing at least a list of author names Ai ¼ a1 ; a2 ; . . .; an ; a work title ti, and publication year yi. The goal of our approach is to find out the set U ¼ u1 ; u2 ; . . .; uk of unique authors and attribute each citation record ci to the corresponding author uj 2 U: Apparently, we will need sufficient information to properly attribute each record to the corresponding person. As discussed above, since the information in DBLP is very limited, we first need to retrieve more information, and the method to gain more information must be domain independent, which means it is not only applicable on DBLP but also can be applied to any other similar domains. There are two methods currently widely investigated by researchers. One is is to retrieve information from other resource, e.g. web pages. The challenge is that we need to identify which pages are useful and which pages are noise. For example, personal homepages are definitely useful in the case of name disambiguation, as many researchers will put their publications in their homepages. Therefore, a model to identify the genre of a web page is needed. The other is to explore the linking information in the data as the relative information among each object in a graph format can improve the quality of disambiguation. The challenge to use this method is how to compute the linking similarity among citations. The details about how we use linking information and combine with web information in our name disambiguation framework to achieve the research goal are given in section ‘‘Hybrid name disambiguation framework’’. Our contributions We use DBLP as case study in this paper because the data there is clean but provides little information. We highlight the following contributions of this paper that aims to propose a robust hybrid name disambiguation framework for large databases:

123

Scientometrics

Fig. 4 Citation sample in DBLP

(1)

(2)

(3)

We propose a web pages genre identification component to identify the genre of a web page, e.g. if the page is personal homepage. The component contains three models, neural network model, traffic model and Genre-SVM model which can be used together or separately. We implemented a multiple models combination (MMC) method to efficiently combine three models together. We propose a re-clustering model as the last step of the framework, which contains unsupervised and supervised methods and enables the strength of both types of methods, namely, scalability and robustness of the former and accuracy of the latter. The re-clustering model can group those records which can not be merged through co-authorship and web information directly based on multidimensional scaling (MDS). Extensive experiments on real corpora are performed to demonstrate the efficiency and effectiveness of our web pages genre identification component and the whole name disambiguation framework.

Organization of the paper The paper is organized as follows: in section ‘‘Related work’’, we present the related work on the disambiguation problem, in particular supervised and unsupervised learning methods for author name disambiguation in digital libraries. Section ‘‘Hybrid name disambiguation framework’’ presents a hybrid and scalable name disambiguation framework to integrate data linking and web information together with re-clustering mechanism. We then evaluated key components of the framework on different corpora. The details are given in section ‘‘Experiments’’. Conclusions and future research directions are outlined in section ‘‘Conclusions and future research directions’’.

Related work This section will discuss recent works that are related to our research and the methods that will be used in the proposed framework. Name disambiguation is relevant to many applications as described before. It is essential not only for digital libraries, but also for applications referring people, or more generally named entities resolutions. The following review of some previous works is presented in chronological order.

123

Scientometrics

Han et al. (2004) proposed a supervised learning framework (by using SVM and Naive Bayes model) to solve the name disambiguation problem. Although the accuracy of this method is high, it relies heavily on the quality of the training data, which is difficult to obtain. In addition, this method will assemble multiple attributes into a training network without any logical order, which may deteriorate the overall performance of the framework. They also proposed an unsupervised learning approach using K-way spectral clustering to solve the name disambiguation problem (Han et al. 2005). Although this method is fast, it depends heavily on the initial partition and the order of processing each data point. Furthermore, this clustering method may not work well when there are many ambiguous authors in the dataset. Dongwen et al. (2005) proposed a 3-tuple vector space model. In their model, each citation can be adapted as follows: The similarity between a citation c and all citations in a cluster a can be estimated as the similarity between a 3-tuple representation of c and that of a:sim(c, a) = a:sim(c, a) = asim(cc, ac) ? bsim(ct, at) ? csim(cv, av) where a ? b ? c = 1 (i.e., weighting factors), cc, ct, and cv are token vectors of coauthors, paper titles, and venues, respectively, of the citation c, and ac, at and av are token vectors of co-authors, paper titles, and venues from all citations of the cluster a, respectively. In turn, each similarity measure between two token vectors can be estimated using cosine similarity. This model is very fast and easy to implement, but the model works not well when there are few common terms, e.g. two citation records, and the limited information in the citation records. Huang et al. (2006) proposed a fast name disambiguation approach based on DBSCAN that can solve transitivity problem in most cases. The transitivity problem appears often in record linking: a triad (o, p, q), point o is coreferent with p, and p with q, while o is not coreferent with q. This is an inconsistent condition since co-reference should be transitive, and is due to errors in metadata extraction, imperfect similarity metric and mis-classification. DBSCAN can model clusters of any arbitrary shape and delimit clusters more intuitively to human interpretation. However, their approaches still rely on using supervised learning methods like SVM to determine the distance between two records that make the results of this approach heavily based on the quality of training data. Tan et al. (2006) presented an approach that analyzes the results of automatically crafted web searches and proposed a model called inverse host frequency (IHF). They used the URLs returned from search engine as feature to disambiguate author names. They did not exploit the content of the documents returned by the queries, which may mislead the results since many returned URLs do not refer to pages containing publications or the pages are not of a single author. Yin et al. (2007) proposed a semi-supervised learning method which uses SVM to train different linkage weights via a graph in order to distinguish objects with identical names. In general, this approach may obtain sound results. Yet, it is quite difficult to generate appropriate linkage weights if the linkages are shared by many different authors. Song et al. (2007) used a topic-based model to solve the name disambiguation problem. Two hierarchical Bayesian models, probabilistic latent semantic analysis (PLSA) and Latent Dirichlet allocation (LDA), are used. In general, it achieves better experimental results over the other approaches. Unfortunately, this approach is difficult to implement because it requires too much labor intensive pre-processing and relies on some special assumptions. Their work has been extended further with web information in Yang et al. (2008). In addition, an approach had been proposed in Kalashnikov and Mehrotra (2006) that analyzes not only features but also inter relationships among each object in a graph format to improve the disambiguation quality.

123

Scientometrics

Based on their ideas, authors in Zhu et al. (2009) proposed a taxonomy-based clustering model that can enrich the information among citation records. Besides basic metadata, some approaches used additional information obtained from the web in order to improve the accuracy of clustering results. Kang et al. (2009) explored the net effects of co-authorship on author name disambiguation and used a web-assisted technique of acquiring implicit coauthors of the target author to be disambiguated. Pairs of names are submitted as queries to search engines to retrieve documents that contain both author names, and these documents are scanned to extract new author names as coauthors of the original pair. However, their work can not identify which pages are personal pages. In order to solve these issues, Pereira et al. (2009) proposed a model to exploit the content of the web to identify if the web page is a personal page. Their approach only use the text in the htitlei tag or the first 20 lines of the pages to identify pages which are not sufficient because some authors might put their emails on the bottom of the pages. Wu and Ding (2013) proposed a recursive reinforced method for name disambiguation. Though their method achieve impressive result, but since they used the Thomson Reuters Scientific Web of Science which contain a lot information compared to DBLP, affiliation information for example. In other words, their method can not guarantee to work in other databases like DBLP.

Hybrid name disambiguation framework In this section, we propose a framework that not only uses information in traditional name disambiguation approaches, e.g. co-authorship, but also applies web pages as a source of additional information. This framework uses the information by leveraging a traditional hierarchical clustering method with re-clustering mechanism to deeply process those citation records which can not be found in any web pages, and group them together if they refer to one person. Our web pages genre identification component involves neural network, traffic rank analysis and our GenreSim-SVM classifier to reduce the web information extraction cost while minimizing the risk of records being merged incorrectly. This framework is not only suitable for name disambiguation in digital libraries but can also be applied to other name ambiguity related issues. Proposed framework Given a list of citations C, where each citation ci 2 C is a bibliographic reference containing at least a list of author names Ai ¼ a1 ; a2 ; . . .; an ; a work title ti, and publication year yi. The goal of our approach is to find out the set U ¼ u1 ; u2 ; . . .; uk of unique authors and attribute each citation record ci to the corresponding author uj 2 U: Initially, the dataset is a set of citation records that have the same author name, and each citation record will be put into a single cluster. The steps to achieve the goal are summarized as below: (1)

Firstly, we perform pre-clustering for these records by using co-authorship with hierarchical agglomerative clustering (HAC) method (Sibson 1973) because coauthorship is a strong evidence to disambiguate records as used by many previous approaches. Our clustering criterion is any citation record in the group must have at

123

Scientometrics

(2)

(3)

(4)

least two same co-authors with at least one of other records in the group. For other records that can not be grouped, we leave each of them in a single cluster. We query records to a search engine, e.g. Google, ordered by the publication year from the dataset generated in step (1) because later publications have higher possibility being listed by author’ home page or captured by search engine. For the list of pages returned from the search engine, we design a component to identify if the page is a personal page for each of them. If a record can be found in a personal page, we should also be able to find some other records in the same page and they should belong to this person. There are three models in this component. After step (3), there might be some records not belonging to any groups because these records can not be found from any web pages. We implement a re-clustering model to handle these records and group them each other or with other groups if possible. In the output dataset, each cluster is expected to represent a distinct author.

From the above steps, we know the key components of our approach are the web pages genre identification in step (3) and the re-clustering model in step (4). The details about these two components are discussed in the following sections. The Fig. 5 shows the diagram of our hybrid name disambiguation framework. Web pages genre identification component According to our observation, there are two types of web pages returned from search engine if we use the author name plus the publication title as keywords to query, personal home pages and non-personal home pages like DBLP. Obviously, we are only interested in the former type of web pages because it is more sufficient to identify authors if multiple entries are found in a personal page. Therefore, the challenge is to identify which pages are personal genre pages. We implement a web pages identification component for our framework which includes three different models, Neural Network (NN), Traffic Rank (TR) and GenreSim-SVM. Based on the idea in Kennedy and Shepherd (2005). Neural Network is a massively parallel distributed processor that has a natural propensity for storing experiential knowledge and making it available for use (Haykin 1999). Since we only have two target categories, personal pages and non-personal pages, we decide to develop a single classifier with multiple target categories rather than a separate classifier for each target category because it requires less training. We use an open source package Neuroph5 to implement the Neural Network and decide the network is multilayer perceptron as our problem is not linearly separable. We use the back propagation to figure out how to adjust the weights of edges coming from the input layer. Since we use Weka6 to implement the network, therefore we apply the default setting in Weka, which the learning rate is 0.3, the momentum rate is 0.2, number of epochs to train is 500, and the number of nodes in each layer is (attributes?classes)/2. The technical details include feature selection and parameter optimization can be found in Kennedy and Shepherd (2005) and Koehler et al. (2010). The NN model we proposed requires reading the page contents and is then determined by the classifier. It consumes very high cost when there are millions of entries to be identified as each page associated with the entry might contain information of megabyte

5

http://neuroph.sourceforge.net/.

6

http://www.cs.waikato.ac.nz/ml/weka/.

123

Scientometrics

Fig. 5 Hybrid name disambiguation framework diagram

size. Therefore, we propose a traffic rank (TR) model to perform web pages identification without going inside each page as designed in our earlier work (Zhu et al. 2010). In addition, we also proposed a neighboring pages selection model called GenreSim, and we use the information generated by GenreSim as features for a classifier to identify web pages. In our approach, we use support vector machine (SVM) (Pedro and Pazzani 1997) as classifier, and we call the model as GenreSim-SVM. The details about GenreSim can be found at our previous work (Zhu et al. 2011). All these three models can be use separately or together to identify if a page is a personal page. In the case of mixing three models together, we proposed a MMC method to combine three models together because the intuition is that the combination of homogeneous models using heterogeneous features can improve the final result (Orrite et al. 2008). Assume each model produces a unique decision regarding the identity of each page p in the test dataset, we then compare the results among all three models and the final output depends on the reliability of the decision confidences delivered by the participating classifiers. We apply the concept of Decision Template (DT) to avoid the case that models make independent errors (Kuncheva et al. 2001) and calculate the confidence score. Assume each of these models produces the output Oi ðpÞ ¼ ½di1 ðpÞ; . . .; dijGj ðpÞ where dij(p) is the membership degree given by the model Mi that a web page p belongs to the genre j; j 2 jGj: The outputs of all models can be represented by a decision matrix DP as follows: 0 1 d11 ðpÞ . . . d1jGj ðpÞ B d21 ðpÞ . . . d2jGj ðpÞ C C DPðpÞ ¼ B @ d31 ðpÞ . . . d3jGj ðpÞ A dN1 ðpÞ . . . dNjGj ðpÞ

123

Scientometrics

Using the training set T ¼ T1 ; . . .; TX for each genre and X being the number of genres, the membership degree dij(p) is calculated as follows: dij ðUÞ ¼

IndðTj ; iÞ jGj

ð1Þ

where Ind(Tj, i) is an indicator function with value 1 if the output from the model Mi based on training set Tj is genre j and 0 otherwise. At this stage, we have the membership degree for each page p belonging to a genre j and stored in a matrix DP(p). We then calculate the confidence score Confidencescorej(p) of page p using various rules from the DP(p) for each genre j and pick the top most genres with the highest confidence score. Assume N is the number of models, we apply the product rule below for the matrix to consider the diversity among multiple models and pick the genre with the highest score as final result: Confidencescorej ðpÞ ¼

N Y ðdij ðpÞÞ

ð2Þ

i¼1

We note that one author might have more than one citation record in one personal page and have other different citation records in another personal page that represented as two distinct authors which can not be fully handled by our approach. However, this situation is very rare and does not exist in our test dataset and we do not consider this scenario in this paper. Re-clustering model As described in section ‘‘Proposed framework’’, there are some records can not be found in any web pages but in fact they might be able to group with other records. For example, some authors might only list those publications which published in famous conferences or journals in their homepages or some authors do not even have a homepage. The other reason is we use only the first 3 web pages return from the web pages genre identification component as we want to make sure only personal pages are used. More details can be found in section ‘‘Experiments’’. The dataset has been split to two parts after being pre-processed and clustered according to co-authorship and the information in those personal pages identified by web pages identification models as we discussed above. The first part is clustered citation records, and we called them D1. The second part is those records can not be found with any citation records in web pages and do not have at least two same co-authors with others, and we called them D2. The records in D1 have at least two same co-authors with one of other records in the group, or they are found in the same page which means they are successfully grouped and each group represents a distinct author. In other words, the records in D1 are solidly grouped. Therefore, we intend to use it to provide reference in order to group those citation records in D2 if possible as the relative information among each object in a graph format can improve the quality of disambiguation as we discussed in section ‘‘Related work’’, Yin and Han (2007) and Song et al. (2007) for instance. We propose a graph-based re-clustering model by extending the SA-Cluster model introduced in the work Zhou et al. (2009) with MDS (Borg and Groenen 2005) to measure the similarity among objects. There are several cases for the citation records in D2: (1) The record does not have co-authors. (2) The record has one or more co-authors but does not

123

Scientometrics

have the same co-authors with any other records. (3) The record has one or more coauthors and has the same co-author with at least one other records. For (1) and (2), since there is no co-authors relationship that can be used, we assign a topic T for each citation record according to the taxonomy we built and the method we used in Zhu et al. (2005) to establish linking among these records. We formalize the model as below: We build a graph G using all citation records. Each vertex v represents a citation record, and each edge is the link between vertices which means they have the same co-authors or they are of the same topic. If there are many paths connecting two vertices vi and vj, then they are closed to each other which means it is likely that these two records belong to the same person. Therefore, we transfer the graph into a similarity martix by using MDS algorithm. The goal of MDS is to detect meaningful underlying dimensions to explain observed similarities between objects. In our case, we set up two dimensions for the matrix, one is co-authorship, the other is topic-relationship. In our case, we use the reciprocal of Euclidean metric to calculate the distance d(vi, vj), which presents the similarity between vertices vi and vj as shown in Eq. (3): 1 dðvi ; vj Þ ¼ rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi    ffi Nc Kc þ1

2

þ

Nt Kt þ1

2

ð3Þ

where Nc and Nt are the number of co-authors and topics existing in the paths between vi and vj, Kc and Kt arestop the number of vertices existing in the paths between vi and vj via co-authorship and topics. For example, if there is only one vertex v between vi and vj, and there are three co-authorship links from vi to v, and two topic links from v to vj, then the 1 ffi ¼ 0:55: Notice that if there is no vertex in the paths (e.g., records dðvi ; vj Þ ¼ pffiffiffiffiffiffiffiffiffiffiffi 1:52 þ12 directly connected), since either Nc or Nt will be greater or equal to 1, therefore the denominator of the equation will never be zero. Table 1 shows a sample of similarity matrix based on Euclidean distance. Obviously, the smaller the value is, the greater the correspondence is between two records. From the table we can see the records A and B are closest and should be grouped together. As we mentioned before, each group in D1 represents a distinct author so that each citation record in D2 must reach a high threshold before it can be merged to any groups in D1 to keep the clustering quality. The closeness of a record vi in D2 to a group gj in D1 can be estimated by: X dðvi ; vk Þ ð4Þ dðvi ; gj Þ ¼ vk gj

Assume there are k records in a group gj of D1, we have the mean distance value dm(vi, gj): dm ðvi ; gj Þ ¼

dðvi ; gj Þ k

ð5Þ

We can then define the criteria for the record vi being grouped into gj is the number of records satisfying d(vi, vk) \ dm(vi, gj) must be greater than Xk ; where X is the parameter to control the threshold. For example, if X = 2, which means more than half records in gj have closer distance than mean distance value to vi. We have evaluated various X value in section ‘‘Experiments’’, and the implementation of the re-clustering model is included in Algorithm 1, which is the clustering algorithm for the framework.

123

Scientometrics Table 1 Sample of similarity matrix based on MDS Record

A

B

C

D

E

A

0

0.1

B

0.1

0

0.2

0.25

0.33

0.15

0.4

C

0.2

0.55

0.15

0

0.68

0.46

D E

0.25

0.4

0.68

0

0.73

0.33

0.55

0.46

0.73

0

Initially, each citation has been put into a single cluster in this algorithm, and pushed into a dataset D. The PreClustering function clusters the input dataset by using coauthorship since co-authorship is a strong evidence to disambiguate records. The rule is we group those citation records that have at least two same co-authors together and sort the dataset in descending order by publication year. Then for each citation record, we query it from search engine, e.g. Google, using author name plus paper title as search string. If any page returned from the search results is a personal page, then the CheckAndMerge function will check if any other citation records from other clusters can be found in this page by using regular expression, and group those records into one cluster and return into dataset D0 . If D0 is not null, then split D0 from

123

Scientometrics

D, and add D0 into result set and rerun the process for the rest of citation records. Ideally, those records being clustered should be also included in the next iteration. However, according to our observation, one distinct author usually only has one homepage and for the consideration of performance, we do not include them to rerun the process. The process will repeat until no more clusters can be grouped. The key function of this algorithm is isPersonalPage. By calling this function we use the models we discussed above to identify pages. These models can be mixed or separately used. Since the function is critical, we use only the first 3 web pages to make sure the identification is correct as we described earlier. We then pass the ResultSet to ReClustering function which represents the re-clustering model introduced in this section. The ReClustering function includes graph construction, distance calculation, re-construct graph if any records are grouped, etc. Experiments We have introduced the framework for name disambiguation in previous sections. In this section, we are going to discuss our evaluation results for key components of this framework, i.g., the corpus we used to evaluate, evaluation metrics and evaluation results. Corpus and evaluation metrics To evaluate our approach, we compared it against methods based on supervised learning and unsupervised learning. We perform evaluations on a dataset extracted from DBLP and used by Han et al. (2005). The dataset is composed of 8,442 citation records, with 480 distinct authors, divided into 14 ambiguous groups, as shown in Table 2. We submitted queries using the Google Search API7 and collected the first 8 web pages from each query due to the API restrictions. We evaluated our approach based on standard pairwise F1 metric. F1 is defined as the harmonic mean of pairwise precision and pairwise recall, where pairwise precision is the number of true positives divided by the total number of elements labeled as belonging to the positive class. Recall in this context is defined as the number of true positives divided by the total number of elements that actually belong to the positive class. They are defined TP TP ; Recall ¼ TPþFN and F1 ¼ 2PrecisionRecall as Precision ¼ TPþFP PrecisionþRecall : In our case, TP stands for True Positive which means the number of pair citation records correctly being grouped, FP stands for False Positive which is the number of pair citation records being grouped incorrectly and FN stands for False Negative which is the number of pair citations records should be grouped together but were not. Evaluation results of web pages genre identification We first evaluated the performance of web pages genre component. As shown at Figs. 6, 7 and 8, we tested top K pages returned from each model for each query to check if the page is personal home page as discussed at section ‘‘Hybrid name disambiguation framework’’. In NN model, we set up the threshold for the output value is 0.6, which means if the final weight of a page produced by the neural network is greater than 0.6, then it is a personal page. In TR model, we set up the b value is 10 times higher than the average 7

http://code.google.com/apis/ajaxsearch/.

123

Scientometrics Table 2 Evaluation dataset

Name

Num. authors

Num. citations

A. Gupta

26

A. Kumar

14

244

C. Chen

61

800

D. Johnson

577

15

368

100

1417

J. Martin

16

112

J. Robinson

12

171

J. Smith

31

927

K. Tanaka

10

280

M. Brown

13

153

M. Jones

13

259

M. Miller

12

412

S. Lee

86

1,458

Y. Chen

71

1,264

J. Lee

traffic rank of top 10 digital libraries. For those pages’ traffic rank are lower than b which might be personal pages, we set the h value is 0.001 as we discussed in Zhu et al. (2010). We notice that only the MMC method can remain accuracy to 100 % when K = 3, and the accuracy is decreasing with the increasing of K value for all cases. Therefore, we only use the top 3 pages to identify authors as they are personal home pages. In addition, we notice that the TR model performs worse than other models though its speed is faster. We will discuss more in later section. Evaluation results of name disambiguation As the results shown at Fig. 9 for the 14 groups of ambiguous authors, we evaluated three models in our approach, Neural Network (NN) model, Traffic Rank (TR) model, GenreSim-SVM model and the MMC model. For all citations in the dataset, we use the query keywords to quote author name plus paper title. The results show that NN model can achieve better F1 value than TR model because each page will be passed to the classifier we implement so that the information of each page is validated while TR model only uses the traffic ranking and the keywords of the page to determine if it is a personal page. Therefore, some pages that are personal pages might be missed which have impacts on the final results. Additionally, the GenreSim-SVM performs better than NN model as it is designed specially for web pages genre identification as discussed in Zhu et al. (2011). We also compare the results between MMC and individual models. From the diagram, we can find that MCC improves around 6 % in average, particularly with the TR model though not in all the cases, which means MMC can improve the performance of web pages genre identification so that the quality of whole name disambiguation framework can be improved. Comparison with baseline methods Our methods were compared with three baseline methods: the WAD method using rules to identify web pages (Pereira et al. 2009), the k-way spectral clustering based method only

123

Scientometrics

Fig. 6 Accuracy of web pages genre identification component, K = 3

Fig. 7 Accuracy of web pages genre identification component, K = 5

using citation metadata information in DBLP(KWAY) (Han et al. 2005), and the webbased method using hierarchical agglomerative clustering (URL) which only uses URL information (Tan et al. 2006). We reimplemented them based on the latest information retrieved from search engine. Figure 10 shows the comparison of F1 values between MMC model and the WAD model. In some cases of the dataset, both models have the same F1 value which means there are no difference between our web pages identification model and the rules in WAD model for these authors. However, there are huge improvement in some cases, ‘‘C. Chen’’ for example. The reasons of that is because our method does not only analyze the URL information and the first 20 lines of the article but also goes through the information in the page so that some pages missed by the WAD are identified by our model.

123

Scientometrics

Fig. 8 Accuracy of web pages genre identification component, K = 8

Fig. 9 F1 value of name disambiguation based on different models

We also compare the F1 value with other models proposed in Pereira et al. (2009). Figure 11 lists the average F1 value for all models. As we can see, KWAY method is the lowest since it only uses basic metadata information which is not sufficient to disambiguate authors. URL method is slightly better because information in the URL is involved and analyzed. WAD method is the best in the existing works and performs better than the TR model. Comparison of various X values As we described in section ‘‘Hybrid name disambiguation framework’’, we set the parameter X = 2 in the re-clustering for all models in the evaluations above. However, we have also evaluated different X values, and found that the F1 value will descrease along with the increase of X value, which means the threshold to determine if two records should

123

Scientometrics

Fig. 10 Comparison of F1 values with WAD

Fig. 11 Comparison of average F1 values

be grouped is too low. Figure 12 shows the outcome based on various X values and the performance is drop in many cases. Performance comparison Since we consider the issue is a real-time application as described earlier, the disambiguation process should be operational in the real environment. As shown in Fig. 13, we evaluate the disambiguation speed for all four models for different number of records. We randomly pick up citation records from the test dataset. The TR model is the fastest because it only takes 12 s to disambiguate 100 records and less than 10 min for 8,000 records. The MMC model is the ideal one, it takes reasonable time to perform the process but the result is much better than other models according to the results we evaluated before. Though it is the slowest but only a bit slower than the NN model which takes nearly 30 min to run through 8,000 records.

123

Scientometrics

Fig. 12 Comparison of average F1 values when X = 2, 3, 4

Fig. 13 Comparison of disambiguation speed (seconds)

The testing is performed by a laptop with 1.83 GHz CPU and 2GB memory, and the Internet connection is Mobile 3G with theoretical speed 54 Mbps. Though the experiment shows the disambiguation speed is not really fast but considering that most of authors have less than 500 records and more powerful machine and faster speed in reality, our solution can be implemented as a real world application with pages indexing if possible.

Conclusions and future research directions There has been little work on personal name disambiguation using information from identified web pages before our study. This paper focuses on proposing a robust name disambiguation framework that is not only used for digital libraries but can also be easily

123

Scientometrics

extended to other applications involving name ambiguity issue. As the main target of this research, we also included a pilot study of robust clustering for name disambiguation in digital libraries. The main contribution of this paper is to propose a hybrid name disambiguation framework that integrates different pieces of information based on web context and data linking. The framework works consistently well for all personal names with varying ambiguity in different datasets. We also evaluated our name disambiguation approaches on known corpora and the results indicated that our proposed framework is feasible. Although this work is developed mainly for digital libraries, the methodologies are also applicable to other scenarios. For instance, there are a great amount of interests in building structured databases and community network from unstructured information for various purposes. The techniques developed in this work based on data mining and machine learning can be applied to these requirements and integrate information easily. Regards to our future research directions, we are going to focus on investigating the task of name disambiguation in various kinds of popular social networks, e.g. Facebook and Twitter, which have also involved name ambiguity issue.

References Aleman-Meza, B., Nagarajan, M., & Ramakrishnan, C. (2006). Semantic analytics on social networks: Experiences in addressing the problem of conflict of interest detection. World Wide Web Conference Communication (pp. 407–416). Borg, I., & Groenen, P. (2005). Modern multidimensional scaling: Theory and applications. (pp. 207–212) New York: Springer. Canas, A. J., Valerio, A., Lalinde-Pulido, J., Carvalho, M. M., & Arguedas, M. (2003). Using wordnet for word sense disambiguation to support concept map construction. International Symposium on String Processing and Information Retrieval (pp. 350–359). Dongwen, L., Byung-Won, O., Jaewoo, K., & Sanghyun, P. (2005). Effective and scalable solutions for mixed and split citation problems in digital libraries. Proceedings of the 2nd international workshop on Information Quality in information Systems. (pp. 69–76). Han, H., Giles, C. L., & Hong, Y. Z. (2004). Two supervised learning approaches for name disambiguation in author citations. Proceedings of the 4th ACM/IEEE-CS Joint Conference on Digital librarie (pp. 296–305). Han, H., Zhang, H., & Giles, C. L. (2005). Name disambiguation in author citations using a k-way spectral clustering method. Proceedings of the 5th ACM/IEEE-CS Joint Conference on Digital Libraries, (pp. 334–343). Haykin, S. (1999). Neural networks: A comprehensive foundation. Huang, J., & Seyda Ertekin, C. L. G. (2006). Efficient name disambiguation for large scale databases. Proc. of 10th European Conference on Principles and Practice of Knowledge Discovery in Databases (pp. 536–544). Kalashnikov, D. V., & Mehrotra, S. (2006). Domain-independent data cleaning via analysis of entity relationship graph. ACM Transactions on Database System 31(2):716–767. Kang, I. S., Na, S. H., Lee, S., Jung, H., Kim, P., Sung, W. K., & Lee, J. H. (2009). On co-authorship for author disambiguation. Information Processing and Management 45(1):84–97. Kennedy, A., & Shepherd, M. (2005). Automatic identification of home pages on the web. Proceedings of the 38th Annual Hawaii International Conference on System Sciences (pp. 99–108). Koehler, H., Zhou, X., Sadiq, S., Shu, Y., & Taylor, K. (2010). Sampling dirty data for matching attributes. SIGMOD (pp. 63–74). Kuncheva, L. I., Bezdek, J. C., & Duin, R. P. (2001). Decision templates for multiple classifier fusion. Pattern Recognition, 34(2), 299–314. Orrite, C., Rodriguez, M., Martinez, F., & Fairhurst, M. (2008). Classifier ensemble generation for the majority vote rule. Progress in Pattern Recognition, Image Analysis and Applications (pp. 340–347).

123

Scientometrics Pedro, D., & Pazzani, M. (1997). On the optimality of the simple bayesian classifier under zero-one loss. Machine Learning 29(2–3), 103–137. Pereira, D. A., Ribeiro, B. N., Ziviani, N., Alberto, H. F., Goncalves, A. M., & Ferreira, A. A. (2009). Using web information for author name disambiguation. Proceedings of the 9th ACM/IEEE Joint Conference on Digital Libraries (pp. 49–58). Sibson, R. (1973). Slink: An optimally efficient algorithm for the single-link cluster method. The Computer Journal (British Computer Society) 1, 30–34. Song, Y., Huang, J., Councill, I. G., Li, J., & Giles, C. L. (2007). Efficient topic-based unsupervised name disambiguation. 7th ACM/IEEE Joint Conference on Digital Libraries (pp. 342–352). Tan, Y. F., Kan, M. Y., & Lee, D. W. (2006). Search engine driven author disambiguation. 6th ACM/IEEE Joint Conference on Digital Libraries (pp. 314–315). Wu, J., & Ding, X. (2013). Author name disambiguation in scientific collaboration and mobility cases. Scientometrics (pp. 683–697). Yang, K. H., Peng, H. T., Jiang, J. Y., Lee, H. M., & Ho, J. H. (2008). Author name disambiguation for citations using topic and web correlation. Proceedings of 12th European Conference on Research and Advanced Technology for Digital Libraries (pp. 185–196). Yin, X. X., & Han, J. W. (2007). Object distinction: Distinguishing objects with identical names. IEEE 23rd International Conference on Data Engineering (pp. 1242–1246) Zhou, Y., Cheng, H., & Yu, J. X. (2009). Graph clustering based on structural/attribute similarities. Proceedings of the VLDB Endowment (pp. 718–729). Zhu, J., Fung, G., & Zhou, X. (2010). Efficient web pages identification for entity resolution. 19th International WWW (pp. 1223–1224). Zhu, J., Fung, G. P. C., & Zhou, X. F. (2009). A term-based driven clustering approach for name disambiguation. Proceedings of a Joint conference on APWeb/WAIM (pp. 320–331) Zhu, J., Zhou, X. F., & Fung, G. (2011). Enhance web pages genre identification using neighboring pages. WISE (pp. 282–289).

123