Clustering Web Services for Automatic Categorization - IEEE Xplore

161 downloads 0 Views 308KB Size Report
Clustering Web Services for Automatic Categorization. *. Qianhui Liang. School of Information. Systems,. Singapore Management. University,. Singapore e-mail:.
2009 IEEE International Conference on Services Computing

Clustering Web Services for Automatic Categorization* Peipei Li**

Qianhui Liang School of Information Systems, Singapore Management University, Singapore e-mail: [email protected]

Patrick C. K. Hung

School of Computer Science and Information Engineering, Hefei University of Technology, China e-mail: [email protected]

Faculty of Business and IT, University of Ontario, Institute of Technology, Canada e-mail: [email protected]

service category in a given service taxonomy to allow the services to be easily located by potential service users. The service provider has to look for an appropriate service category by browsing through the entire taxonomy, which may be a very tedious and time consuming procedure. Therefore, it is very much desired to have a tool that can help to automatically categorize the service to a service category. For another example, automatic service discovery and service composition usually involve searching for a service with certain functions from a large number of available Web services. Heuristics can be used to reduce the search space and improve the time complexity of service discovery and composition. Domain knowledge such as the categorization information of services can be used as one heuristic to arrive at a good result more quickly. With an increasing number of Web services available, it is challenging to enable the automatic categorization. It is especially difficult with the presence of a massive amount of information embedded in Web service descriptions. To address this issue, a large number of efforts have been made on the classification of Web services in the research community. Researchers have explored different strategies and methods for solutions. Generally, these existing works can be divided into two major groups. One refers to text classification (e.g. [4]) based on keywords and the other specifies the classification with semantic judgment (e.g. [3][6]). However, little attention is paid to consider the composite relation among elements in a structure for the Web service context and their potential semantic information among different combinations. It may lead to a loss of information. This single consideration may be an independent part that is only in an operation consisting of three elements of input, function description and output. In this paper, we propose a new method of clustering Web services, which not only considers the

Abstract -

Analyzing the functionality of Web services is the basis of using Web services effectively and efficiently. The first step in such an analysis of Web services is to categorize different services, which may be offered by different service providers, based on their functionalities. In this paper, we present a clusteringbased approach to Web service categorization in order to form a hierarchy of service taxonomy. Our novel clustering scheme takes into consideration not only individual factors such as input or output of service operations, but also the latent inter-relationships among the individual factors. Given a set of services that may or may not have been categorized, we adopt individual methods to handle the issue and mark out their classification labels in terms of a common (given) taxonomy, such as UNSPSC. When a new service description is published, the unclassified service is compared with the classified ones and measures of the likelihood that the new service description is belonging to each cluster are calculated. Based on this calculation, the service will be assigned to a suitable category.

Keywords -

Inter-relationship of service attributes, service taxonomy, Web service categorization, Web service classification, Web service clustering

I.

INTRODUCTION

Categorizing Web services allows organizing Web services according to a service taxonomy. Such an organization is helpful for a number of tasks when using and managing Web services. For example, during service publishing, a service provider is usually recommended to register the services under a certain ————————————————

* This research is supported by the 973 Program of China under award 2009CB326203 and the National Natural Science Foundation of China (NSFC) under grant 60828005. ** Peipei Li was a research assistant in the School of Information Systems at Singapore Management University when the paper was developed.

978-0-7695-3811-2/09 $26.00 © 2009 IEEE DOI 10.1109/SCC.2009.39

Xindong Wu School of Computer Science & Info. Engg., Hefei University of Technology, China; Dept. of Comp. Science, University of Vermont, USA e-mail:[email protected]

380

procedure between services and classification categories, assuming a corpus of previously classified services is available. Unfortunately, this vision requires that services describe themselves with large amounts of semantic metadata “glue”. In 2008, Saha et al. introduced a Tensor space model for data representation and Rough Set based approach for the classification of Web services [2]. The proposed tensor space model captures the information from internal structure of WSDL documents along with the corresponding text content and Rough sets are used to combine information of the individual tensor components for providing classification results. Authors addressed that on one hand, to achieve better classification results over existing, they use proposed tensor space model. On the other hand, they further adopt Rough set based ensemble classifier to improve classification accuracy. Liang et al. propose a term categorization technique based on an approach of service matching. They have used Web Service Description Language (WSDL) documents to represent a similarity between two terms. Service operations are categorized based on the similarity of their describing terms [6]. A service operation that is categorized together with another service operation is considered to be matched to that operation. Comparing to the categorization-based service matching, our research is more focused on categorizing the Web services that are described by WSDL documents as a whole instead of categorizing service operations.

features of individual keywords, but also uses the underlying semantic relations among elements in a structure marked with metadata and their possible combinations. Generally, our proposal approaches the classification problem as follows. Given a set of services that may or may not have been categorized yet, we adopt individual methods to handle this issue and mark out their class labels in terms of a common (given) taxonomy, such as UNSPSC [12]. When a new service description is published, the unclassified service is compared with the classified ones and measures the likelihood that the new service description belongs to each cluster are calculated. Based on this calculation, the service will be assigned to a suitable category. Specifically, it mainly presents two characteristics. First, a new tree structure with different composite elements where, extracted sub-structures from Web services are designed as a basic comparison object, which contains all combination cases for elements in a metadata structure. Second, two new methods of similarity calculation are taken as a measure of cluster division respectively. The rest of the paper is organized as follows. Section II reviews related work on the classification of Web services. Our approach to automatic categorization of Web services is described in Section III, which is based on a method of clustering and similarity estimation. Section IV provides the experimental study and Section V summarizes our results and future work. II.

RELATED WORK III.

Regarding to the relevant work on Web services, most of the prior works are focusing on the issue of automatic classification. In general, the proposed approaches could be divided into two categories. The first one refers to the text classification and the second one relates to the semantic awareness. However, a common feature occurring in both of them relies on the mechanisms of machine learning used in the information handling for most existing methods. Based on the semantics similarity proposed in [1], [5] presents a framework to automate the semantic annotation of Web services. In their work, a matching algorithm between Web service data types and ontology concepts is defined (based on matching element schemas) in order to obtain a degree of similarity between services and domain ontology. They classify a service by finding the ontology that yields a higher similarity value in comparing to the service. However, the approach needs semantic metadata such as DAML-S and it is lack of annotation. In 2006, Corella et al. proposed a heuristic approach [3] for the semi-automatic classification of Web services, based on a three-level matching

TECHNICAL FRAMEWORK

Our methodology of service categorization involves 1) preprocessing, which converts each wsdl document in to a service training tree, 2) rough clustering, which labels each Web services with a class tag according to the similarities of the textual descriptions provided with services and service operations, and 3) fine clustering, which further clusters Web services according to a number of selected constructs in the semi-structured service description documents. As the name indicates, rough clustering will only produce imprecise clustering results because in this step the information used is only limited to the textual descriptions. Rough clustering does not require that services have be are categorized and labels each service with a category from UNSPSC. For rough clustering to work, we assume that there are text descriptions for Web services and service operations. Otherwise, the system will pass rough clustering and directly apply fine clustering. Fine clustering relies on more targeted information on service operations. In addition to the individual constructs like input and output of service operations, it

381

also relies on their interrelated relationship. To our knowledge, in all exiting work on service clustering or classification by information retrieval technologies, they all rely on counting the occurrences of words in individual constructs. We believe that the relationships of the words in terms of their appearance in corresponding related constructs bear important information that can be used to distinguish the functionality provided by the services. In our approach, we also consider a bag of combinations of individual constructs to make sure that their inter-relationships are counted in. Based on the results, it may either confirm the result from rough clustering or revise the clustering labels of a particular Web service. Next, we will describe these three components in details in the remainder of this section.

UDDI documents. By means of the pre-processing of stemming in porter stemming algorithm [10] and removing the stop words for the corresponding documents of UDDI, all terms oriented to the service description consist in the comparison objects, which require the following three steps to implement rough clustering. First, each pair of key-words is compared; the matching rates are calculated in the following method and ranked in descending order. Second, those that are higher than the pre-specified threshold will be divided into small clusters of documents. Lastly, to avoid discarding blindly for those pairs of key-words with lower similarity rates than the predefined threshold, an additional judgment using the UNSPSC tool to mark a possible label (commodity or class) in UNSPSC, and then divide these key-words with the same labels. Here we have used UNSPSC as the selected taxonomy, this is because UNSPSC is one of the most widely supported service categorization standards of Web services.

A. Preprocessing

Figure 1. Processing of WSDL documents Preprocessing of WSDL documents is illustrated in figure 1. The first step of the process is to parse the WSDL document in the wsdl4j library provided in Java development environment. The library is designed for the WSDL 1.1 version, and thus, for adaptation to the new standard of WSDL 2.0 documents, it is necessary to transform the metadata in WSDL 1.1 into the corresponding definition in WSDL 2.0. After finishing the parsing of WSDL documents, the state-of-the-art of the porter’s stemming algorithm [10] and the removing of stop words are manually implemented. Hence, a certain number of terms will be extracted from each metadata of a WSDL document. Finally, the tree structure relevant to original WSDL documents is composed of several terms in each metadata instead of the text description or else, which is called a service training tree.

Figure 2. Process flow of rough clustering We adopt an incremental algorithm of K-means [7] to implement rough clustering based on the following definition on the similarity, whose details are given as follows. Incremental K-means Algorithm 1. 2.

B. Clustering Referring to the process of rough clustering in Figure 2, we collect the information of services, such as the names and the brief description in WSDL and

3.

382

Select K points as the initial centroids Assign a point to the closest centroid and update the centroid until all points are allocated into centroids Repeat Step 2 until the centroids don’t change

C. Exact Clustering Based on the rough clusters obtained above, we further take account of the structure features in WSDL, and calculate the similarities among component elements in one or more sub-structures, such as considering the structure of operation with input and output. In virtue of more information to estimate further similarity, break down these rough clusters into smaller ones or regroup them exactly. An overall flow of process in the exact clustering is described in Figure 3, which is broken down into four sub-components to solve the final, exact clusters for detailed service categories, including constructing interface trees, generating mapping relations among interface trees, doing statistics of frequency and co-occurrence for each term or each composite term, and estimating similarity between a pair of interface in different WSDL documents. Now, we will describe each subcomponent respectively.

Figure 4. Partial profile of WSDL document With respect to the basic framework of WSDL documents, Figure 4 shows its partial profile. It is true that interfaces are the better representative features in a WSDL document of service, which present the functions of services. Hence, we take into account of all interface sub-structures in a WSDL document as the objects analyzed in the exact clustering. In our design of exact clustering, the basic matching unit is set to an operation. In other words, for a pre-processed WSDL document, all operations from different interfaces consist in a matching object set. According to the structure description in Figure 4, it is defined as W1 = {Inf1, Inf2, …, Infm} and InfI = {Opt1, Opt2, …, Optn}, where Infi specifies an interface in a service, Opti refers to the concrete operation, which consists in Input, Output, Inputfault, Outputfault, Function, etc. If we consider the elements composed in an operation independently, it is probable to cause the loss of potential relation among different elements of an operation, such as Output results depend on Inputs and Functions. In this paper, not only are the individual elements considered but also the composite elements are used in the estimation of similarity. A description for an operation is plotted in Figure 5, whose formalization is expressed as follows:

Figure 3. Process flow of exact clustering

Optk = {Ink, Outk, Fdk, Inoutk, Infdk, Outfdk, Inoutfdk } ( 1 ≤ k ≤ n ) Ink = {itk1, itk2, …, itkci} (itkp indicates a parameter of input, 1 ≤ p ≤ ci ) Outk = {outk1, outk2, …, outkco} (outkp indicates a parameter of output, 1 ≤ p ≤ co ) Fdk = {fdk1, fdk2, …, fdkcf} (fdkp indicates the attributes of an operation. Due to only the attribute of name involved here, hence, cf=1; 1 ≤ p ≤ cf ) Inoutk = {iotk1, iotk2, …, iotkcio} (inoutkp is a composite value of In+Out, 1 ≤ p ≤ cio ) Infdk = {iftk1, iftk2, …, iftkcif} (infunkp is a composite value of In+Fd, 1 ≤ p ≤ cif ) Outfdk = {oftk1, oftk2, …, oftkcof} (outfdtkp is a composite value of Out+Fd, 1 ≤ p ≤ cof )

C.1 Interface Tree and Structure Definition In virtue of the rough clusters obtained from the string matching ranking, divide the documents of WSDL services and extract the tree structure of the interfaces from the original WSDL documents. Meanwhile, based on the preprocessing results, we set the term to the results after parsing and stemming processes. A hierarchical tree structure is shown in Figure 4. Apparently, each WSDL document is corresponding to one tree which is relevant to all interfaces from service training trees.

383

Inoutfdk = {ioftk1, ioftk2, …, ioftkciof} (itkp is a composite value of In+Out+Fd, 1 ≤ p ≤ ciof ),

n( j ) n (i ) Sim(WU i ,WU j ) = ∑ p =1 Max Sim (tip , t ) k =1 jk Sim(tip , t ) = mip M i ⋅ ( m jp M j ) ⋅ sim(tip , t ) jk jk

where Input is denoted as In for short; similarly, Output: Out for short, Function description (the declaration of an operation, referring to the name of an operation now, if the information is sufficient, it contains the terms extracted from the brief function description): Fd for short; In+Out, In+Fd, Out+Fd and In+Out+Fd specify the composition of different elements.

(1)

where M i = ∑kk ==1n(i) mik and M j = ∑ kk ==1n( j ) m jk indicate the total frequency count of keywords; the similarity of two keywords, i.e., sim(tip , t jk ) , is defined below. Len(⋅) is a function of length, which specifies the number of letters in a term. NG (tip , t jk ) specifies the Q-

Grams algorithm [8] to determine element level linguistic similarity, which refers to the short character substrings 1 of length q of the database strings. The intuitions behind the use of q-grams as a foundation for approximate string matching are that, when two strings σ 1 and σ 2 are similar, they share a large number of qgrams in common. Given a string σ , its q-grams are obtained by “sliding” a window of length q over the characters of σ . Since q-grams at the beginning and the end of the string can have fewer than q characters from σ , the strings are conceptually extended by “padding” the beginning and the end of the string with q - 1 occurrences of a special padding character, not in the original alphabet. It has been popular in a variety of ways in text recognition and spelling correction [9].

Figure 5. A composition in an operation

In order to facilitate the future matching and similarity estimation, an operation set is an order triple group compounded of three components, i.e., Opts = (Input, Fd, Output). Now, the detailed definition on Matching Score for a pair of operations is given in the following description.

sim(tip , t

Definition on Matching Score: Score 1: Input Matching Score 2: Fd Matching Score 3: Input+Fd Matching Score 4: Output Matching Score 5: Input+Output Matching Score 6: Fd+Output Matching Score 7: Input+Fd+Output Matching

jk

) = NG(tip , t

jk

)

(2)

The proof of Sim(WUi ,WU j ) ≤ 1 is skipped here due to limited space. Formula (2) uses the string matching method of QGrams to estimate the similarity. Actually, other string matching methods, such as Jaro, Jaro Winkler, and the similarity measure based on TF/IDF referred in [9] are also convenient to be adopted here. Similarity, the conclusion about Theorem Sim(WU i ,WU j ) ≤ 1 is obtained as well. Due to the space limit, we omit their detailed and their comparisons will be given in the experiments.

C.2 Similarity Calculation Supposing the WSDL documents denoted as and their corresponding UDDI documents denoted UDDL = {U1 ,U 2 ,L ,U m } form a Object Set, denoted as OS = {WU1 ,WU 2 ,L ,WU m } , each element in this object set is composed by a set of keywords, i.e., WU i = {ti1 , ti 2 , L , tn ( i ) } (where the index of n(i ) refers to the

C.3 Creating Mapping Relation

WSDL = (W1 ,W1 ,L ,Wm )

We collect the term set corresponding to each metadata including operations in an interface tree, and match those with the keywords in UNSPSC, which is an open, global, multi-sector standard for efficient, accurate classification of products and services. The matching method is based on the string comparison as stated in the above description. In accordance with the matching results, discard those non-matching terms and construct simple mapping relations for those remainder terms; namely, group those matching terms limited to the same metadata field in the different interface trees

count of terms in WU i ) and the frequency of each keyword is marked by a set of {mi1 , mi 2 ,L , mn (i ) } . Hence, for an arbitrary pair of document elements (WU i ,WU j ) , the similarity could be calculated in the following equation, i.e.,

384

with the same flag. The handling mentioned above is carried out on the condition that a classification of service provided in UNSPSC contains its individual terms. Furthermore, considering the issue of compound word, abbreviation or acronym of terms, such as GetTicket/ObtainTicket and Automatic Transaction Matching (ATM) service, provide the user-defined word set to reduce non-matching due to the reasons mentioned above. It could be completed manually to create a self-defined word set supposing the cases are rare.

In the whole process flow of clustering, all services trees are initially taken as the handling objects to be split and used in the same method of the selection of the largest remaining cluster to split in [9]. As the handling strategy in K-means, arbitrarily, select kinitial centroids and cluster the rest points into these centroids in virtual of a certain standard of similarity measure. We pick a cluster to split. We try ITER times when trying to find an optimal split, which result in ITER splits. From all the splits, the one with the largest total similarity are selected (i.e., the bisecting step). The total similarity of a split is calculated as the summation of cluster similarities of all the clusters created by this split, where cluster similarity is the average similarity across the similarity measures for all pairs of service documents in a cluster. A similar iteration is repeated. In particular, the cluster within the selected split with the largest size is picked to split further until the expectation count of clusters is reached. However, an important point that should be addressed here is that in contrast to the method of similarity estimation used in [9], we have designed a new strategy of similarity measure to cast the case in handling of WSDL documents, which is explained in detail as follows.

C.4 Statistics The statistics rely on the tags marked in the operation structure. For each group of terms corresponding to an individual metadata, not only each term owns its frequency of occurrence, but also the cooccurrences of multiple terms (i.e., the composite terms) are counted, such as the co-occurrence count for a term set of Input+Output from all operations, i.e., fi7. Therefore, each feature will own a statistical set individually, which is prepared for the following similarity calculation. According to the description mentioned above, the marked value of frequency in one path is reduced gradually, such as f i 5 ≤ f i 3 ≤ f i1 .

C.6 Method of similarity calculation Similarity estimation is used in the matching between both operations trees (denoted as Ti and T j ) from two arbitrary WSDL documents. Evidently, if a path could be traversed completely, it is considered as a kind of matching with a certain matching score. Otherwise, it is a non-matching. With respect to the process of tree matching, the estimation on similarity between trees is based on the structure matching. Then the content similarity could be carried out, whose details are defined respectively as follows.

C.5 Clustering in the algorithm of bisecting Kmeans To implement an exact clustering, we introduce the algorithm of bisecting K-means [9], which is developed from the traditional one of K-means. In that paper, authors pointed out that agglomerative hierarchical clustering (a representative algorithm is UPGMA) and K-means are two clustering techniques that are commonly used to document a clustering. Furthermore, they addressed that the bisecting K-means technique that they proposed, is better than the standard K-means approach, and is at least as good as the hierarchical approaches in virtue of the extensive experiments and a rough theory analysis. Its details are described below:

c (opti )

Sim(WU i ,WU j ) = ∑ k =1



Bisecting K-means Algorithm = 1. 2. 3.

4.

C ( opt i ) k =1

Sim(optik ,WU j )

C (opt j ) M a x p =1 S co re ( o p t



C ( opt i ) q =1

ik

, o p t jp )

S co re ( M a xP a th ( q ))

(3)

Pick a cluster to split Find two sub-clusters using the basic K-means algorithm Repeat Step 2 for ITER times and take the split that produces the clustering with the highest overall similarity Repeat Steps 1~3 until meeting the desired number of clusters

S.T. Min( Score( MaxPath( q )) ) = 1 Where Score(optik , optip ) = [ Score( Inik , Inip ) ⋅ Sim( Inik , In jp )] + [ Score( Fdik , Fd jp ) ⋅ Sim( Fdik , Fd jp )] + [ Score(Outik , Outip ) ⋅ Sim(Outik , Outip )]

385

(4)

WUj is limited to the bound of 1. Moreover, if Sim( Inik , Inip ) = 0 Sim( Fd ik , Fd ip ) = 0 and

Maxpath(k) refers to a path in a tree; Score( path(k )) specifies the total score in the matching path of two operations between optik and optjp. Furthermore, as regards to the similarities among In, Fd and Out, we also give the estimation methods as follows. Sim( Inik , In jp ) = 2 ⋅

| {inik } ∩ {in jp } |

Sim(Outik , Outip ) Sim(WU , WU ) i j

0,

C(opt j ) Max p=1 Score(optik , opt jp )

=0,

= 0 is apparent.□

Similarly, 0 ≤ Sim(Ci ,WU j ) ≤ 1 .

(5)

| {inik } ∪ {in jp } |

=

Where | {inik } ∩ {in jp } | refers to the number of the same

IV.

input parameters while | {inik } ∪ {in jp } | specifies the count of distinct parameters in inik

Our test set are a collection of 352 services from the following web sites: • www.xmethods.com, • www.bindingpoint.com, • www.webservicelist.com, www.servicesweb.org/rubrique.en.php3?id_ru brique=14 • http://www.xignite.com/.

and in jp .

Considering the division on the complete match and partial match, we plane to handle it fuzzily. Hence, we should add the constraints for Formula (5), as shown in Formula (6). 0, sim( Inik , Inip ) < τ1 Sim( Inik , Inip ) = { 1, sim( Inik , Inip ) ≥ τ 2

(6)

Where τ 1 and τ 2 refer to thresholds of similarity. Sim( Fd ik , Fd ip ) and Sim(Outik , Outip ) are similar to the

To validate the effectiveness of our classification method of Web services based on two-tiers clustering mechanisms and the similarity measures used in our method, extensive experiments are conducted to evaluate the performances on precision and recall, which are the standard measures of performance in the text classification and are popularly used in the classification on Web services, such as [1] and [10], in each process of clustering. Meanwhile, distinct methods of similarity measure and the optimal values of parameters are estimated in a large number of experiments. The details are given below. We have collected 352 Web services for this purpose.

definition for Sim( Inik , Inip ) . Similarly, the similarity between a cluster of Ci and a document of WUj would be estimated as follows.

|C | Sim(Ci ,WU j ) = ∑ i Sim(WU i ,WU j ) | Ci | (7) i =1 where | Ci | indicates the document count in the cluster of Ci . Moreover, it is necessary to address here. For an example, if the inputs from two operations of

optik

EXPERIMENTS

and

opt jp

are matched completely as the definition in Formula (6) and it is so for the function names, Formula (4) should be converted to Formula (8). Score (optik , optip ) = [Score(Outik , Out jp ) ⋅ Sim(Outik ,Out jp )] +[Score(Inik + Fdjk , Inip + Fdjp ) ⋅ Sim(Fdik , Fdjp ) ⋅ Sim(Inik , Injp )]

(8)

Theorem: 0 ≤ Sim(WU ,WU ) ≤ 1 i j Proof: According to the definition in Eqs.(3)-(8) and the definition on the matching score in the service tree, Score(Inik + Fdik + Outik , In jp + Fd jp + Out jp ) ≥ Score(Maxpath(q)) C (opt j ) ≥ Max p =1 Score(optik , opt jp )

Sim( Fd ik , Fd ip ) = 1 and

,

if

Sim( Inik , Inip )

Figure 6. Fine clustering result on Web services

=1

The performances on clustering Web services using the information of service operations is studied by examining the precision and the recall rate. The recall and precision rates are plotted in Figure 6. The experiments are conducted on two sets of Web services,

Sim(Outik , Outip )

= 1, equation is met. Thus, the matching rate between trees of WUi and

386

[3].

M. A. Corella and P. Castells, “A Heuristic Approach to Semantic Web Services Classification,” In Proceedings of the 10th International Conference on Knowledge-Based & Intelligent Information & Engineering Systems (KES 2006), Bournemouth, UK, 2006. [4]. L. S. Larkey and W. Croft, “Combining classifiers in text categorization,” In Proceeding of ACM SIGIR, 1998. [5]. N. Oldham, C. Thomas, A. Sheth, K.. Verma, “METEOR-S Web Service Annotation Framework with Machine Learning Classification,” In Proceedings of the 1st International Workshop on Semantic Web Services and Web Process Composition (SWSWPC’04), July 2004. [6]. Q. Liang, H. Lam, “Web Service Matching by Ontology Instance Categorization,” In Proceedings of the 2008 IEEE International Conference on Services Computing, pp. 202-209, 2008. [7]. D.T. Pham, S.S. Dimov, C.D. Nguyen, “An Incremental K-means Algorithm. In Proceedings of the I MECH E Part C Journal of Mechanical Engineering Science,” 218(7), pp. 783-795, 2004. [8]. E. Ukkonen, “Approximate String Matching with q-Grams and Maximal Matches,” Theoretical Computer Science, 92(1):pp. 191211, 1992. [9]. M. Steinbach, G. Karypis and V. Kumar, “A Comparison of Document Clustering Techniques,” University of Minnesota, Technical Report #00-034 (2000). http://www.cs.umn.edu/tech_reports. [10]. Q. Liang, H. Lam, L. Narupiyakul, and P.C. Hung, “A Rule-Based Approach for Availability of Web Service,” In Proceedings of the 2008 IEEE International Conference on Web Services, pp.153-160, 2008. [11]. porter stemming algorithm: http://tartarus.org/~martin/ PorterStemmer/ [12]. United Nations Standard Products and Service Code (UNSPSC): http://www.unspsc.org/

each consisting of 352 services and 952 services respectively. The first set contains all original descriptions of 352 services we have collected. The second set contains more descriptions that are artificially created based the original descriptions using synonyms. We observe that fine clustering divides the clusters with a higher correct rate comparing to the rough clustering. Besides, the precision rate of clustering is also improved by fine clustering over rough clustering. V.

CONCLUSION

A majority number of previous efforts on the classification of Web services always presume that the class labels of Web services are known in advance. Actually, this is not always true in the case for real Web services. In contrast to their concerns, we assume that a set of services have not yet been classified or only part of the services are labeled, and our approach is to implement the categorization of services using the clustering technology and a common standard of service categories. In this paper, we introduced a new clustering method of Web services. Firstly, the method implements the clustering roughly by means of a simple mechanism of string matching. Secondly, we consider the underlying semantic relation among metadata structures and adopt a bisect-K Means algorithm to do clustering. In virtue of the UNSPSC tool, the method marks out the class label for each Web service. Our experimental study demonstrates that our method could improve the classification accuracy on Web services. Therefore, it provides a valid classification method in the real applications. Meanwhile, though hidden relations among different elements in a metadata structure are considered in our approach, more metadata structures and more semantic match, such as using ontology, should be considered further. Furthermore, we do not handle the issues of composite processes and the cooccurrence of different functions, which are our future work. VI. [1].

[2].

REFERENCES

X. Dong, A. Halevy, J. Madhavan, E. Nemes, J. Zhang, “Similarity Search for Web Services,” In Proceedings of the Thirtieth International Conference on Very Large Data Bases, Toronto, Canada, 30: pp 372 – 383, 2004. S. Saha, C.A. Murthy, S. K. Pal, “Classification of Web Services Using Tensor Space Model and Rough Ensemble Classifier,” Proceedings of the 17th International Symposium on Methodologies for Intelligent Systems (ISMIS'08), 2008. 387