Fuzzy Classification Scheme Mapping For

1 downloads 0 Views 2MB Size Report
approach defines the fuzzy matching degree by matching information in descriptor, .... information retrieval efficiency, which is the main function of classification ...
Fuzzy Classification Scheme Mapping for Decision Making Research-in-Progress

Wei Du City University of Hong Kong Tat Chee Avenue, Kowloon, Hong Kong SAR [email protected]

Wei Xu Renmin University of China Haidian District, Beijing, China [email protected]

Hongbing Jiang University of Science and Technology of China Hefei, China [email protected]

Jian Ma City University of Hong Kong Tat Chee Avenue, Kowloon, Hong Kong SAR [email protected]

Abstract Classification schemes organize objects into hierarchy structure of knowledge by grouping them with common characteristics. Managers prefer resorting to existing classifications schemes for decision making for their advantages over perplexity and huge amount of detail bottom information. However, single classification scheme is not enough for certain purposes. More and more applications call for the need of information integration from heterogeneous sources. To facilitate the inter-operability and integration among various sources of information, we propose a fuzzy approach to realize the mapping across their backbones-classification schemes. The proposed approach defines the fuzzy matching degree by matching information in descriptor, feature and neighborhoods level. Preliminary results show that managers can easily analyze the relations across heterogeneous classification schemes by visualizing the mapping pattern in research area. An illustration of expert profiling is also given to validate the role of fuzzy classification scheme mapping in decision-making. Keywords: Classification scheme, fuzzy mapping, decision making, expert profiling

Introduction Classification scheme, usually represented in hierarchy structure, is used to organize objects by dividing them into various classes according to their common characteristics in certain aspects. As a relatively stable structure, classification schemes have a wide application in various areas (Barki et al. 1993; Coviello et al. 1997; Loehrlein et al. 2014; Wang et al. 2014), such as International Patent Classification (IPC), Standard Industrial Classification, Dewey Decimal Classification (DDC) for library classification, etc. Classification schemes are useful and practical decision-making tool for managers. By conveying semantics and knowledge of a class to its individuals, classification scheme provides an efficient way to quickly identify a collection of objects based on their corresponding classes. In reality, patent managers can quickly find similar documents of a patent by resort to patent classification scheme, while grants managers refer to their specific grant discipline tree to identify researchers as experts under a discipline. Different or even similar institutions often use different classification schemes to organize information due to their specific responsibilities and focus. Besides, single information source is often not enough for

Thirty Fifth International Conference on Information Systems, Auckland 2014

1

Decision Analytics, Big Data, and Visualization

certain purposes, such as seeking for external experts or performance evaluation with multi-dimensional heterogeneous information, which calls for a need of integrating information in heterogeneous classification schemes. However, information complexity and ambiguity in heterogeneous classification schemes causes difficulty for managers to make decisions by referring to several inconsistent classification schemes. Therefore, to facilitate the inter-operability and integration among various sources of information, the critical step is to realize the mapping among their backbones-classification schemes. Similar situation occurs in research area. Research agents (e.g. Funding agency, Literature library, University, etc.) build various classification schemes to improve the information retrieval efficiency as well as to facilitate research management (Rafols and Leydesdorff 2009; Wang et al. 2014) in different institutions. We call classification schemes used to classify the research areas of different types research objects, such as research patents, journals, projects, etc. as heterogeneous. Relations among these classification schemes exist but remain unclear quantitatively. For example, there is a notion that researchers with a specific major would prefer to summit papers to journals under certain subject categories, but how to measure the relationship with scientometrics and statistical analysis remains unsolved. Further, practical research management activities often require information integration across heterogeneous classification schemes. For example, expert profiling requires one’s research information from multiple dimensions, such as research publications, majors, research project, etc. To make the expert profiling comparable with the existing research area classification, mapping multiple-dimension information into a single dimension is necessary. Last, the uniform classification comparison can help improving the efficiency of research management. Current algorithms in research management always demand the detailed textual analysis of individual information, which leads to low efficiency when applying proposed methods in real world. For certain practical research management activities, the strict requirement of high accuracy is not much important in comparison to the quickness. By referring to the relations among classes as the knowledge umbrella of individuals, the classification scheme mapping based application can obtain the expected results. Based on aforementioned illustrations, two problems are to be solved in this research: 1) How to realize the heterogeneous classification scheme mapping? 2) How does the classification scheme mapping facilitate decision-making? In this research, we propose a fuzzy approach to realize the temporal classification scheme mapping during a time period. Fuzzy relation and composition of fuzzy relations (Cross 2004; Todorov et al. 2011; Zadeh 1965) are introduced to define the “degree of matching” between two classes from different classification schemes, since the relation is not crisp “match” or “not match”. To define the fuzzy affinity function, we adopt the idea in (Rodríguez and Egenhofer 2003) of integrating three aspects of a class (i.e. descriptor, feature and semantic neighborhoods) to measure the temporal matching degree, and the algorithm is adapted. In descriptor level, since two different words may relate not through synonyms, e.g. math and algebra, the simple WordNet based synonym sets matching of two words sequence is not enough to capture the semantic similarity between two classes. To solve this problem, WordNet based similarity measure—adapted Lesk Algorithm (Banerjee and Pedersen 2002) is introduced to obtain more accurate similarity of two concepts. Further, Hungarian Algorithm (Bandaru and Bhavani 2011; Thiagarajan et al. 2008) is introduced to measure the similarity of two terms. Different with the asymmetric similarity measurement in (Rodríguez and Egenhofer 2003), the proposed method is symmetric by adopting Symmetric Tversky’s Ratio Model (Jimenez et al. 2013). To validate the proposed approach, we apply the fuzzy approach to realize the fuzzy mapping across three selected classification schemes in research area. By comparing part of mapping result with experts’ manual mapping result, precision and recall rate (Powers 2011) are used to evaluate the proposed approach. We also select two other approaches to compare with the efficiency of the proposed approach, and the proposed mapping method is proved to obtain a relatively high precision and recall rate. To illustrate the potential application of the fuzzy classification scheme mapping for decision-making in research management, researchers are profiled by utilizing the mapping results for the purpose of seeking for external experts for National Science Foundation of China (NSFC). The rest of paper is organized as follows. Related work can be found in Section 2. The proposed fuzzy approach is described in detail in Section 3. Part of fuzzy mapping results across three selected classification schemes in one time period is given and evaluated in Section 4. Section 5 presents a simple

2

Thirty Fifth International Conference on Information Systems, Auckland 2014

Fuzzy Classification Scheme Mapping for Decision Making

illustration of expert profiling to validate the role of classification scheme mapping in decision-making. The paper concludes with a discussion of the research and future improvements.

Related Work Previous researchers study the classification scheme mapping for two main purposes: 1) improve information retrieval efficiency, which is the main function of classification scheme (Berry et al. 1995; Soergel 1999), and 2) facilitate information revision (Jones et al. 1993; Svanberg and Heiner-Freiling 2008; Zins and Santos 2011). On one hand, mapping between classifications schemes is useful to improve the information retrieval efficiency by integrating different classification schemes for complement, especially when there is an absence of a universal classification scheme. To reduce the heterogeneity and improve sharing between segmented German library classifications, Pfeffer (Pfeffer 2014) proposes a simple method to cluster entries from several library classification schemes. With the resulting clusters, the large numbers of previously not indexed entries can be enriched by sharing indexing and classification information, and the information retrieval efficiency is largely improved. Leydesdorff et al (Leydesdorff et al. 2012) analyze the overlay maps of US Patent (USPTO) data based on International Patent Classification (IPC) developed by the World Intellectual Property Organization. By using the IPC as the base, the corresponding classes of USPTO can be detected more efficiently. On the other hand, mapping between different classification schemes is useful to evaluate and revise existing classification schemes by reference to relatively better ones. To evaluate the efficiency of existing library classifications schemes in covering human knowledge, Zins and Santos (Zins and Santos 2011) select three main library classification schemes to be examined, and the human knowledge in pillars as basis of evaluation. Results show that the existing library classifications schemes fail to systematically and adequately cover human knowledge. Note that aforementioned classification scheme mapping falls within homogeneous classification schemes, which are used to organize same kind of objects. However, mapping among heterogeneous classification schemes remains unstudied. To realize the mapping among different classification schemes or hierarchical structures, three kinds of mapping approaches are reviewed: element based mapping, structure based mapping and hybrid mapping. Element based mapping approach defines the matching degree by comparing name-equality, overlapping instances and common knowledge domain through statistical analysis and semantic analysis (Breitman et al. 2008; Choi et al. 2006; Thor et al. 2007; Zins and Santos 2011). Structure based mapping approach realizes the matching between two concepts by considering their neighbors’ matching degree or by considering path distance combining with the depth of concepts (Avesani et al. 2005; Kalfoglou and Schorlemmer 2003). By integrating the information in element level and structure level, the hybrid mapping approach provides a systematic way to match entities from different hierarchical structures (Jiang and Conrath 1997; Rodríguez and Egenhofer 2003). In contrary to crisp mapping: match or not match, fuzzy mapping is more helpful to capture the uncertainty in class definition and relatedness measurement. The application of fuzzy set theory in ontology mapping falls in two streams: one is the fuzzy representation of classes, or domain concepts, another is the fuzzy relatedness between two classes from different ontologies. For a class, its properties, individuals and other semantic relations can be encoded into fuzzy membership function (Abulaish and Dey 2006; Carlsson et al. 2012). Fuzzy inference and fuzzy relations are adopted in fuzzy ontology mapping. Cross and Yu (Cross and Yu 2010) propose an Information Content (IC) based fuzzy set framework to measure fuzzy similarity between two ontological classes, though this only applies to intraontological classes. By using a fuzzy set formulation and a generic reference ontology, Todorov et al (Todorov et al. 2011) propose a framework to realize fuzzy matching between multiple heterogeneous domain ontologies, and measures the fuzzy relatedness through instance based similarity measure. Fong et al (Fong et al. 2014) use a machine learning model-Fuzzy Unordered Rule Induction-and prediction class to infer the similarity between two medical datasets.

A Fuzzy Approach for Classification Scheme Mapping In this section, we propose a fuzzy method to interpret the fuzzy relations between two classification schemes. Let C1 and C2 represent two sets of classes from different classification schemes, and there

Thirty Fifth International Conference on Information Systems, Auckland 2014

3

Decision Analytics, Big Data, and Visualization

is c1 ∈C1 and c2 ∈C2 . Fuzzy relation, used to define non-crisp relations through membership function (Al Boni et al. 2014; Zadeh 1965), is introduced to describe the ambiguity of matching association between two classes. The temporal fuzzy relation RT (C1 × C2 ) defines the association between two class sets from different classification schemes during time period T : RT :C1 × C2 → [0,1]

(c1 ,c2 )T → RT (c1 ,c2 ) (where class c1 ∈C1 ,c2 ∈C2 )

Fuzzy max-min composition is adopted to indirectly measure fuzzy relation between C1 and C3 : For r1T (c1 ,c2 ) ∈R1T (C1 ,C2 ),r2T (c2 ,c3 ) ∈R2T (C2 ,C3 ) R3T (C1 ,C2 ) = R1T (C1 ,C2 ) i R2T (C2 ,C3 ) r3T (c1 ,c3 ) = max{min(r1T (c1 ,c2 ),r2T (c2 ,c3 ))} = ∨{r1T (c1 ,c2 ) ∧ r2T (c2 ,c3 )} c2

c2

To define the affinity function between two classes, the method in (Rodríguez and Egenhofer 2003) is adapted. Semantic descriptor similarity, feature similarity and neighborhoods similarity are integrated to measure the matching degree, which integrates information from both element level and structure level.

Semantic Descriptor similarity Semantic descriptor similarity measures similarity between two classes on “the descriptor of classes” level. The descriptor of one class comprises a sequence of several words. Adapted Lesk algorithms is preferred in word-pair similarity measurement because: first, it succeeds in obtaining high accuracy in word sense disambiguation (WSD) in comparison with other measures (Patwardhan et al. 2003); second, it can measure relatedness of two concepts (or senses) across part of speech (POS) boundaries and exceed the limit of is-a relation (Pedersen et al. 2004). By viewing a sequence of words in descriptor of one class as vertices and the edge of each word pair as the function of similarity, the semantic similarity measurement of descriptors between two classes can be translated into the bipartite graph matching problem (Bellur and Kulkarni 2007; Melnik et al. 2002). Assume that words in descriptor of classes get POS tags-n (nouns), v (verbs), a (adjectives) or r (adverbs)- aforehand. To simplify the process, the weight of each word and relation is supposed to be equal. Suppose there are m distinct words in class c1 and n distinct words in class c2 after filtering the stopping words. Semantic descriptor similarity can be obtained through following steps. Step 1 Word-pair Similarity Original Lesk algorithm measures the relatedness between two concepts by measuring their overlap between their dictionary definitions (Lesk 1986). Further, Lesk is adapted to measure relatedness of two concepts defined in WordNet by scoring the overlap of their glosses (definition in WordNet) (Banerjee and Pedersen 2002). More words shared between two glosses, higher related between two concepts. For each word, POS tag is given. There may exist more than one concept for each word because of the polysemy. The most appropriate concept for each word can be given by maximizing the overlap among glosses of words within a class descriptor, which is called word sense disambiguation (Tsatsaronis et al. 2007). Thus, for each word

wordi1 in class c1 , the most appropriate sense can be given as sensei1 .

Here, extended glosses including glosses of the concept of a word as well as its synsets are involved into adapted Lesk relatedness computing. By measuring the overlaps of extended glosses in WordNet between two words: wordi and word j , relatedness relij can be obtained (Pedersen et al. 2004). For example,

rel(mathematics,a lg ebra) = 149 while rel(mathematics,computer) = 5 , which means mathematics is more similar with algebra while there is little relatedness between mathematics and computer. The word-pair similarity can be normalized: if relij = 0 , simij = 0 ; if wordi = word j , simij = 1 ; besides above two,

simij = logγ ×relmax relij (γ > 1) , where relmax is the maximum relatedness value and parameter γ is used to scale similarity.

4

Thirty Fifth International Conference on Information Systems, Auckland 2014

Fuzzy Classification Scheme Mapping for Decision Making

Step 2 Descriptor-pair Similarity Semantic descriptor similarity can be translated into bipartite graph matching. The optimal bipartite matching can be obtained through Hungarian Algorithm (Melnik et al. 2002). Word nodes are viewed as vertices, and the weight of each edge eij as lenij = 1− simij . The optimal matching E ' ∈E from original matching

E is to get the optimal minimum value



lenij . Semantic descriptor similarity can be



simij

∀eij ∈E '

measured through the optimal matching:

R(c1 ,c2 )des =

∨eij ∈E '

max(m,n)

Feature Similarity In this paper, individuals of each class are viewed as its distinguished feature. Overlapping individuals of two classes from two different classification schemes can represent their similarity. Overlapping individuals can change with the time window changes. The temporal feature similarity during time window T can be measured by Dice’s Index (Bai et al. 2009). RT (c1 , c2 ) fea =

2 S T (< I c1 , I c2 >) S T (I c1 ) + S T (I c2 )

where S T (< I c , I c >) denotes the set of matched individual pairs between two classes c1 ∈C1 and c2 ∈C2 during time window T . S T (I c1 ) denotes the set of individuals of class c1 ∈C1 during T . S(i) denotes the 1

2

cardinality of set

S(i) .

Semantic Neighborhood Similarity Two classes can relate through their neighborhood classes. Classes are regarded as semantic neighborhoods of c1 if they get distance length less than radius r with the class c1 in the classification

scheme. If set r = 1 , it means only super-class or sub-classes of a class are regarded as its neighbors. The semantic neighborhood similarity depends on semantic descriptor similarity and temporal feature similarity. To define the symmetric similarity, Symmetric Tversky’s Ratio Model (STRM) (Jimenez et al. 2013) is used to adapt the asymmetric neighborhood similarity (Rodríguez and Egenhofer 2003). By setting β = 1 in STRM, the temporal semantic neighborhood similarity with radius r can be defined as RT (c1 ,c2 ,r)neighbor =

c1 ∩Tn c2 c1 ∩ c2 + α (c1 ,c2 )⋅ a + (1− α (c1 ,c2 ))⋅b T n

a = min(δ (c1 ,c1 ∩Tn c2 ,r), δ (c2 ,c1 ∩Tn c2 ,r)) b = max(δ (c1 ,c1 ∩Tn c2 ,r), δ (c2 ,c1 ∩Tn c2 ,r)) T ⎪⎧ N(c1 ,r) − c1 ∩ n c2 δ (c1 ,c1 ∩Tn c2 ,r) = ⎨ ⎩⎪0

if N(c1 ,r) > c1 ∩Tn c2 otherwise

Where c1 ∩ c2 represents the approximate cardinality of the intersection set between two neighborhood T n

sets during T, and N(c1 ,r) represents the cardinality of the neighbor set of class c1 within radius r . The parameter α (c1 ,c2 ) can be defined as the function of the depth of classes in two classification schemes: α (c ,c ) = min(depth(c1 ),depth(c2 )) . By setting the common node (e.g. Science) as the root node of 1 2 depth(c1 ) + depth(c2 )

different classification schemes with depth=0, depths of classes in the classification scheme can be defined according to its place. For example, in NSFC discipline tree, depth (G01)=1, depth (G0101)=2 and depth (G010101)=3. The approximate cardinality of the intersection set between two neighborhood sets can be defined as

Thirty Fifth International Conference on Information Systems, Auckland 2014

5

Decision Analytics, Big Data, and Visualization

c1 ∩Tn c2 = [∑ max R0T (c1i ,c2j )] − ϕ R0T (c1 ,c2 ) i≤n

where

⎧1 if R (c1 ,c2 ) = max R (c ,c ) ϕ=⎨ ⎩0 otherwise T 0

T 0

i 1

j 2

j≤m

and

R0T (c1 ,c2 ) = wd' Rdes (c1 ,c2 ) + w 'f RTfea (c1 ,c2 )

,

and

c1i (i = 1,...,n) and c2j ( j = 1,..., m) are neighbors of class c1 and class c2 . The Aggregation of Similarity The fuzzy affinity function can be represented as the weighted aggregation of three dimensions above, which is defined as follows T RT (c1 ,c2 ) = wd Rdes (c1 ,c2 ) + w f RTfea (c1 ,c2 ) + wn Rneighbor (c1 ,c2 ,r) where similarity values are normalized into [0, 1] for commensurability before the aggregation. wd , w f and wn are three weight values for similarity on semantic descriptor level, feature level and semantic neighborhood level, which can be decided by managers who experienced in research management.

Fuzzy Mapping of Three Classification Schemes To evaluate the fuzzy scheme mapping method, three existing science classification schemes are selected. Obtained fuzzy mapping results are first presented in this section. To evaluate the proposed approach, the mapping results are compared with the classic individual based mapping results and the original method (Rodríguez and Egenhofer 2003) based mapping results. Precision rate and recall rate are introduced as the evaluation metrics.

Fuzzy Mapping Results Three selected classical classification schemes are respectively to organize research journals, research projects and educational majors. SCIE (Science Citation Index Expanded) provides bibliographic and citation information of research publications. It covers over 8500 world’s leading scientific and technical journals across 150 disciplines with 176 subject categories ranging from Acoustics to Zoology. NSFC, be responsible for the administration of the National Natural Science Fund from the Central Government, is the biggest and most important funding agency in China. To facilitate the grants management and research projects retrieval, research projects in NSFC are classified into 8 departments and organized by following three-tier discipline tree. Chinese Ministry of Education builds major system for more efficient recruitment and cultivation of research students by organizing the major taxonomy into three levels: 2digit, 4-digit and 6-digit codes. Universities also organize the faculty members into departments or majors according to the three-tier Major system of Chinese Ministry of Education (MMOE). For these three classification schemes, their following individuals-Research publications, projects and researchers-are intimately related. Research projects in NSFC undertaken by researchers have research publications as output. For a pair of SCIE class and NSFC class, a following publication and a project with same publication is one matched individual. For a pair of NSFC class and MMOE class, researchers are used to denote the overlap. We chose research projects within five years window (2008-2012) from eight departments as individuals of NSFC classification scheme. Five years is the average time period to initialize and finish a research project. The selected 13165 general research projects are all finished and the corresponding outputs (i.e. 240427 research publications) are collected. Researchers are tagged with schools and majors. Fuzzy relations of SCIE and NSFC and fuzzy relations of NSFC and MMOE are first computed, and then the fuzzy relations of SCIE and MMOE can be obtained through the max min composition method. The parameters and weight values in proposed method are given by experts experienced in research management: γ = 1.5 , r = 1 wdes = 0.16 , wind = 0.5 and wneighbor = 0.34 . The fuzzy affinity values for each class-pair can be computed and ranked from most relevant to least relevant. Part of temporal mapping results of the selected three classification schemes within five years window (2008-2012) can be obtained as shown in Figure 1. To visualize the mapping results more conveniently, we filter out matching degree values