Scientific Research Paper Ranking Algorithm PTRA

2 downloads 611 Views 446KB Size Report
May 9, 2014 - factors to rank its results; paper age, citation index and publication .... algorithm which is the core of Google search engine was introduced in ...
Applied Mechanics and Materials Vol. 551 (2014) pp 603-611 © (2014) Trans Tech Publications, Switzerland doi:10.4028/www.scientific.net/AMM.551.603

Online: 2014-05-09

Scientific Research Paper Ranking Algorithm PTRA: A Tradeoff between Time and Citation Network Mushtaq A.hasson1,a*, Songfeng Lu2,b and Basheer A.Hassoon1 1

College of Education of Pure Science, Basra University, Basra, Iraq

2

Huazhong University of Science and Technology, Wuhan 430074, China a

[email protected], b [email protected]

Keywords: Paper Time Ranking algorithm (PTRA), Ranking Algorithm, Scientific research papers (SRP), Citation index, Age of publication.

Abstract. Most of the scientific research engines utilize the same way to rank scientific research papers (SRP), this way is highly depend on the citation network. The retrieved results of these engines for any keyword contain old papers with the highest citation value. In this paper, we proposed a new and easy to implement ranking algorithm for scientific research papers known as Paper Time Ranking Algorithm (PTRA). PTRA is a new ranking algorithm that depends on three factors to rank its results; paper age, citation index and publication venue. To construct and prove our ranking algorithm we created a web crawler that crawl different scientific search papers databases in the world to collect the information that PTRA needs. Unfortunately, some of this information was missing, such as, the impact factor of the journals. To collect this information, we created another crawler that search the Internet for these journals impact factors. However, to prove our ranking algorithm results, we have compared PTRA with Google scholar ranking algorithm. We used Google scholar, since it has more than 50 million papers and its own system can gather and collect papers in a faster way. Our comparison results show that, citation index has the highest impact on Google scholars ranking algorithm results. Unlike PTRA that gains the paper age a higher impact on the ranked results. However, PTRA depends on the citation index and publication venue to rank the results but with less impact than the paper age. Introduction Internet, without doubt is the 20th century wonder. It started with a point-to-point connection between two machines then moved to cover the globe. This content is increasing in a massive way which can make it hard if not impossible to find what a user need. To arrange this content, search engines were born. Search engines take the inquiry of a user and start to surf WWW to find the closes results to that inquiry. Even with such engines, the number of returned results can be tremendous even for a user to check all of it. This was the reason behind the birth of result ranking algorithms. Ranking algorithms are defined as the procedure that search engines use to give priority to the returned results. Scientific research papers (SRP), is considered as a popular content in the Internet. SRP number is increasing rapidly and that competing with movies, web-sites, music and other content. In 2010, Arif [1] claimed that the number of research papers that have been published exceeds 50 million papers. The huge number of scientific research papers needs search engines to index these papers in order to facilitate search process for the researchers. To rank SRP, many ranking algorithms have been created. Citation Network There are two types of citation network; web pages citation Network and paper citation network. Our interest is about paper citation network that looks like a tree graph. Citation index can be defined as a kind of Bibliographic datasets. On the other hand, it consists of two types of nodes; cited papers and citing papers. Citation direction is moving in one direction from new and modern papers to old papers according to the time interval (maybe months or years). Citation direction walking in one direction since it is impossible for any old papers cited on new papers. All rights reserved. No part of contents of this paper may be reproduced or transmitted in any form or by any means without the written permission of Trans Tech Publications, www.ttp.net. (ID: 211.69.193.50-09/12/15,10:01:16)

604

Design, Manufacturing and Mechatronics

Search Engines Search engines are defined as a special kind of software that crawl and index the content of the Internet. Search engines were found to solve the difficulty of accessing content over the Internet. Search engines started with indexing small content of the web. Now days, they index billion of documents [2]. Search engines have become famous. People look to search engines as the door to access and find what they need over the Internet. Nowadays 20 million queries are claimed to be recorded number of user queries that Google handle in each second [3]. These numbers and others prove the increasing needs for search engines and the popularity of this software’s. Any search engine consists of three main parts; crawler engine, rank engine and Graphic User Interface (GUI). These three parts together work to crawl, restore, update search engine databases. The rest of this paper can be organized as fallow: Section two represents the work of other researchers in this field. Section three explains our algorithm (PTRA). Moreover, section four illustrated our experiment work in two parts data source and keywords. Section five shows the PTRA results as a comparison with Google scholar result. Finally, this paper closed with a summarized of our conclusions. Related Work There are many studies in the field of scientific research rank. Researchers divided into two schools, first school, researchers try to use old algorithms from other problem after modify them for the new problem. The second school is constructed over building a new algorithm from scratch to fit the new requirement. Depending on these schools, ranking algorithms for SRP have been divided into two categories. One category took old methods and tried to tune them to fit the new requirements. The second category is constructed on building new ranking algorithms for SRP from scratch. In the first category, ranking algorithms for search engines have been highly studied. PageRank algorithm which is the core of Google search engine was introduced in 1998 and got a big successful from that day. PageRank algorithm depends on the relation between pages. The algorithm was built on web graph and the concept that Internet looks like a bowtie with a highly connected core. The PageRank algorithm was tuned and used in [4] to rank the SRP. It also was used in [5] to rank the author of SRP for ranking the years. Unfortunately, we do not recommend PageRank algorithm for SRP ranking for different reasons. First, new paper will never have any papers cite them, which will lead rank them in the end of the result list. In the other hand, we need these papers to appear in the top level of results since researchers care about the newest research that have been done in a certain field. Second, the structure of the links between SRP is not like the web. In SRP the links between any two papers moves in one direction. The direction of these links starts at the new paper (released date) and end at old one. This structure looks like tree with new papers in the top and old papers in the bottom. Third, SRP can take references from web sites books not only papers. It’s hard to build the rank for these references, since it has different categories. Fourth, the requirements of SRP are not the same like the web pages. SRP depend on the location of publication, the year, the author and the citation index. Unlike, web pages that only depend on the web graph and the links between pages. Finally, if this algorithm can give the best result for ranking SRP, why does not Google scholar use it [6]. In the second category, to construct new algorithms that do not depend on any old method, researchers and developers start by collecting that user requirements and the information we have. The information that we know about research papers can be used to rank them as follow: • The venue (location of submission and publication) is a critical parameter that defines the quality of the work. • The year of publication, researcher can seek about new research papers more than old one, since they are looking for hot topics. The citation count or index, this parameter defines the popularity of any paper. The papers can be ranked depending on authors of the work and where it has been done. This information’s can be obtained from the Internet. To rank research paper with considering these parameters, search

Applied Mechanics and Materials Vol. 551

605

engines have been divided into two groups. First, give the users the freedom to construct their query with their own requirements. For examples, ScienceDirect [7] allow users to rank depending on the date. In the other hand, IEEEXplore [8] is more flexible, it gains more ranking options to users. This way of ranking is limited since it cannot mix between parameters to build a more complex ranking method. Second, some search engine ranks the SRP by using complex combination of the parameters without any user interference. Google scholar is an example of this category. Google scholar ranking algorithm is unknown, but it considers the citation count as the highest parameter that impacts the result [6][9][10]. One of the draw backs of Google scholar ranking algorithm is, it depend mainly on the citation count which makes old publications appear in top of results and pushes new publications to lower rank in the returned results. Other rank algorithm ranks the papers depend on unique value instead of multi value. H-index [11] is an example for this type of algorithms, h-index rank the paper depends on how many papers published from the same author. H-index didn’t give weight for many criteria’s like venue, date of publication and the citation index. Moreover, it will never give chances for new papers to get ranked in top level if their authors are new. Even, if the venue is a high ranked conference or a journal with a high impact factor. In this paper we propose new algorithm from scratch depending on different parameters, location of publication, year of publication and the citation index. Also, we gave priority for each criteria that we consider. Paper Time Ranking Algorithm (Ptra) Our paper time ranking algorithm is easy to implement and depend on three different parameters; location of publication, year of publication and the citation index. We did not consider the author as critical parameters since people show, read and decide if this author is good or not depending on the contents of the authors’ paper. After that, people can decide to read for this author again or not. In our algorithm we gave priority to each one of these parameters. The reason is that the date of publication is higher than the citation index of a paper, since we need new papers to come out in the highest rank in the result list. The equation that we used to calculate the weight of the paper is shown as in Eq. 1: Paperweight = A + C + T

(1)

where A is the age of the conference or the journal impact factors, C is the citation index of the current paper and T is the age of paper. We calculate these parameters as follow: Place of Publication In our methods, one of our algorithms metric is the kind of publication venue. It is di- vide into three types; journals, conference and workshop. In general, the papers that are published in journals are more important than the other type of publication venue and the conference less than journals and higher then workshop [12]. In addition, there are difference between journals and conferences. The quality of any journal can be measured by the impact factor and its age. However, the quality of any conference can be measured by the age of the conference. In our algorithm, we evaluate the journals and conference separately. In journals area, as we mentioned above that the quality of journal can be known by its impact factor. We utilized this metric in our algorithm and we made the effects of impact factor higher when we rank papers since we want the papers that have the higher journals quality appear in the top of our rank result. To do so; we used Eq. 2 to compute the journal age. A = M . d1

(2)

where d1 is coefficient, M is the impact factor for the current journal. In conference publication, as we mentioned above that the quality of any conference can be known by its history. We depend on the version of the current conference since conferences take place every year once. We used this Eq. 3 to calculate the age of the conference:-

606

Design, Manufacturing and Mechatronics

A = Q .d1

(3)

where Q variable. Date of Publication Value One of the important metric values that we utilize in our ranking algorithm is the date of publication. In our algorithm, we want gain high effect for the year of publication. We need this metric to know if the current paper is old or new. To do so, first; we need to subtract the year of the current paper from the current year (In our algorithm the current year is 2012) as shown in Eq. 4. Then, the output multiplied by a coefficient as in Eq. 5. T =2012−Year

(4)

T = T . d2

(5)

where T is the time, Year is the year of publication for the current paper, d2 is coefficient. Citation Index Value The important of any paper can be known by the number of citation index that it has [13]. Many ranking algorithm use the citation index to rank the paper. So we didn’t neglect the important of this metric. One of the important metric that we use is the citation index. But, we didn’t gain higher effect to this metric, since we need the highest effect for the date of publication. To evaluate the citation index, we used Eq. 6. C=T.d3

(6)

Where T is the value of the citation index for the current paper, d3 is coefficient. Steps of our Algorithm The main steps of PTRA algorithm can be illustrated in pseudocode as in Fig.1, which summarize as follow. Step 1 to 11, counter the retrieve of the papers from the datasets one by one until the datasets over. Step 2, is to initialize all the factors that PTRA need to 0.0 such as; year of publication, citation index, Age of conference, Impact factor and paper weight, since to ensure that it will has any problem in the parameters. Step 3, is to get the values of the above factors for the current paper. Step 4, is to compute the age of the current paper according to Eq 1. Step 5, evaluated of age factor by use Eq. 2. Step 6, multiply the citation index for the current paper by the factor that we put to control the effect of this impact. Step 7, check the type of the publication venue if its conference or journals. Step 8 to 11, in case of the condition is true, so the paper published in a conference. Moreover, these steps will compute the age of the conference according to the equation that assume. Step 12 to 13, in case of the condition in step 7 is false, so the paper published in a journal. In this step it will compute the history of the journal according to impact factor that it has. Step 14, is to compute the weight of the current paper according to the three factors that we evaluate in the previous step.

Applied Mechanics and Materials Vol. 551

607

Fig. 1. Pseudocode for PTRA Experiment Work Data Source PTRA system consists of two main parts; a crawler and PTRA ranking algorithm. Crawler consists of two main parts; automator and extractor. The main function of the automator is to retrieves search result from three well-known scientific paper search engines: Google scholar, Citeseerx and IEEE Xplore. We used three scientific paper search engines for two reasons. First, to examine which one is more general (more result), since each one of them will return different results starting with the sorting (ranking) through the number of returned results ending with the capacity of paper information. Second, to retrieves a complete information list of any scientific paper to execute our algorithm. The second part of the crawler is the extractor. This part extracts the useful information from the returned pages by the automator. This information’s can be summarized as: the name of the paper, the year of publication, the paper citation index, the venue of publication, and the volume of the publication (if any). Unfortunately, we faced an obstacles; some information does not exist, such as, the venue of publication, journal impact factor and the volume. To solve it; first, PTRA automator used three search engines as mentioned before to reduce the data lack as much as possible. Second, we used Google with advance search to retrieve the rest. We didn’t use Google search from the beginning because it is general-purpose search engine and it is more complex to extract what we need from Google. Third, we used DBLP datasets to fill some gaps. Digital Bibliographic and Library (DBLP)[14], is a database site for published paper in different computer science fields. It was created in 1980. DBLP records are xml file that contains a huge records of different types of articles published in different journals and conference, especially ACM, VLDB, and IEEE [15]. DBLP database contains more than one million articles. The DBLP xml file contains six different fields as follow: Authors, title, type (journal or conference), name of the journal or the conference (venue), volume, year and note. Keywords In PTRA crawler experiment, we have used five keywords to create a seeding file that we fed to PTRA crawler. These keywords can be simple or complex. Simple keywords are general names only without any specifications. However, complex ones consist of more than one word, such as Peer-to- Peers locality in ISP’s. Simple words are used to get general results. Therefore, we used complex words to get different result In the PTRA results, we used one word and complex word to fed PTRA automator. For example, the automator in the first time will search about the Peer-to-Peer then it will take complex word like searching for Peer-to-Peer File Sharing. Then, the automator will move to take another keyword

608

Design, Manufacturing and Mechatronics

like Algorithm. In addition, we used different keyword since some of computer science knowledge is very old like (Database, Algorithm) and some of this knowledge is very modern like (Peer-toPeer). Results and Discussion Paper Time Ranking Algorithm Configuration PTRA algorithm has three different variables or coefficients that needs to be tuned to obtain results that balances the effects of citation index and publication venue with increasing the effect of date of publication. These variables as we mentioned are d1, d2 and d3. These variables are important to compute weights that will arrange the papers in a way that search engines needs. After tried different times of tuning we reached the best coefficients as follow: 1) Date of Publication: To evaluate the coefficient of date of publication for the paper to enlarge the effect of date of publication, we subtract the year of publication for the current paper from the current year (In our algorithm we used 2012). Second, we used (1/100) as d3 coefficient. 2) Citation Index: In order to tune the effect of the citation index, we used (1/1000) as d2 to reduce its effect and increase the effect of the time. 3) Publication Venue: In our datasets, we have two types of publication place; journals and conference. Since, the journals are more important. Our interesting is to gain higher weight to the journals but we will not neglect the conference. To do so, we made two equations; one for the journals and one for the conference. To tune the effect of this metric, we used d3 coefficient in these equations. The value of this coefficient is depending on the type of publication venue, if its conference, we used (1/10), if its journals, we used (2) as d1. In the next part, we will show the result of these coefficients and the comparison with Google scholar results in order to prove PTRA ranking algorithm results. Google Scholar Comparison Results

Fig. 2. Distribution of P2P live streaming according the age of paper (year) in our PTRA

Fig. 3. P2P live streaming result distribution to the age of paper(Google scholar)

In this part, we show the results of comparison with Google scholar ranking algorithm. These results show the distribution of the papers if we want rank according to the year or according to the citation index. Our performance metric is shown in the comparison results. In our work, we will discuss two results that we obtain one from old knowledge and another from new knowledge, since the other keywords gain semi convergent results. The result as follow: Fig.2 shows the distribution of P2P live streaming papers according to the year of publication. This figure illustrates exponential distribution for the P2P live streaming papers depend on the time of publish. Since it show that our algorithm is depends on the date of publication as the highest factor in our paper weight equation. Fig.3 demonstrates the normal distribution of Google scholar result according to year of publication. This figure shows that the distributions of Google scholar results do not depend on the year of publication metric. Since this result appears normal distribution. This is meaning the effect of the time is very low in Google scholar ranking algorithm.

Applied Mechanics and Materials Vol. 551

609

Fig. 4. P2P live streaming distribution according to citation index (our algorithm).

Fig. 5. The distribution of P2P live streaming according to citation index (Google scholar). Fig. 2 and fig. 3 shows that the comparison between our ranking algorithm and Google scholar ranking algorithm according to year of publication. It shown that our algorithm gives exponential distribution depends on the time, where Google Scholar gives normal distribution. Fig. 4 shows the distribution of P2P lives streaming according to the citation index factor for our ranking algorithm. This result is affected by other coefficients d1 and d3. This figure explain that even when we give high effect to the date of publication, there is a chance for the papers with high citation index value to appear in the top of our results. Fig. 5 demonstrates that the distributions of the P2P lives streaming according to the citation index for Google scholar results. It shows that the papers arrange from biggest citation to the smallest one. It gains linear distribution for the citation index. Fig. 4 and fig. 5 show that the comparison between Google scholar result and PTRA results according to the citation index. The concluding from these two figures is: our ranking algorithm used the citation index but it did not gain high effect for this factor where Google scholar is totally depending on the citation index.

Fig. 6. The distribution of Database papers according to the age of paper (Our algorithm).

Fig. 7. The distribution of database papers according to the age of paper (Google scholar). Fig. 6 shows the distribution of database papers according to the age of publication for our algorithm. These papers are sort ascending by age of papers (year of publication). Moreover, the other coefficients have effect on this result (Citation index and publication venue). This figure gain different result from P2P results since this result for old knowledge (Database). Fig. 6 shows exponential distribution according to the age of publication. Moreover, it demonstrates that, even with old knowledge, our algorithm will highly depend on the age of publication. Fig. 7 explains the distribution of database papers according to the date of publication for Google scholar results. This result shows normal distribution depending on the year of publication. Fig. 8 shows the database papers distribution according to the citation index by our algorithm. Moreover, it explains how the papers will be distributed if we arranged depend on the citation index. It illustrate that, the papers will not sort from the biggest to smallest citation, since we reduce the effect of this factor (C). So even if we rank according to this factor, our result will be different from Google scholar, which depends on the citation index. Fig. 9 shows the distribution of database papers according to the citation index factor for Google scholar. It demonstrates that the papers sorted from the highest to lowest citation value. On the other hand, Fig. 9 proves that Google Scholar is highly depending on citation index more than any other metrics used to rank the papers.

610

Design, Manufacturing and Mechatronics

Fig. 8. Database papers distribution according to the citation index (Our algorithm).

Fig. 9. Distribution of database papers depending on the citation index ( Google scholar ).

From the above results, we prove that our algorithm is highly depending on time of publication to rank the papers. It also demonstrates that PTRA didn’t neglect the quality of the two metrics (C and A). We also prove that Google scholar is highly depending on the citation index to rank the scientific research papers. Conclusion In this work, we were proposed a new ranking algorithm to work contrary of the previous ranking algorithms in this field. Our algorithm begins from the scratch by build new rank method to deal with the time ranking issue without neglect the importance of other metric (citation index and publication venue). We introduced Paper Time Ranking Algorithm (PTRA), which is flexible and easy ranking algorithm. On the other hand, it compute the paper weight according to three metric; year of publication, citation index and publication venue. PTRA gain high effect to the year of publication metric and reduce the effect of the citation index and the publication venue in order to gain high weights to the newest papers and made it appears in the top of the ranking results. To made PTRA work correctly, we tuning PTRA factors until we found the best coefficient that we used to rank according to the time without neglect the other parameters. We had succeeded to achieving our goal since PTRA algorithm working correctly to rank according to the time after we applied it on different knowledge (old knowledge and new knowledge) in the field of computer science from different time. References [1] A. Jinha, “Article 50 million: an estimate of the number of scholarly articles in existence,” Learned Publishing, pp. 258–263, 2010. [2] S. Brin and L. Page, “The anatomy of a large-scale hypertextual web search engine,” Proceedings of the seventh International World Wide web Conference, pp. 107–117, 1998. [3] “http://www.searchenginewatch.com/.” [4] A. P. Singh, K. Shubhankar, and V. Pudi, “An Efficient Algorithm for Ranking Research Papers based on Citation Network,” in IEEE Data Mining and Optimization (DMO), 2011. [5] Ding, Y. and ErjiaYan, A. and Frazho, and Caverlee, J., “PageRank for Ranking Authors in Co-citation Networks,” JOURNAL OF THE AMER- ICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY, 2009. [6] J. Beel and B. Gipp, “Google scholars ranking algorithm: An introductory overview,” in IEEE, ISSI, 2009. [7] “http://www.sciencedirect.com/.” [8] “http://ieeexplore.ieee.org/xplore/guesthome.jsp.”

Applied Mechanics and Materials Vol. 551

611

[9] Beel, J. and Gipp, B., “Google Scholars Ranking Algorithm: The Impact of Articles Age (An Empirical Study),” IEEE, ITNG, 2009. [10] J. Beel and B. Gipp, “Google Scholars Ranking Algorithm: The Impact of Citation Counts (An Empirical Study),” in IEEE, RCIS, 2009. [11] Hirsch, J. E. , “An Index to Quantify an Individual’s Scientific Research Proceedings of National Academy of Sciences, 2005.

Output,”

[12] X. Liu and C. Huang, “Studies on Utilizing the Three Famous International Index Systems to Evaluate Scientific Research Level of Higher Learning Institutions,” International Journal of Strategic Information Technology and Applications (IJSTA), 2011. [13] B. Gupta, “Citation indexes and other products of ISI” Annuals of Library and information studies, 2004. [14] “The DBLP computer science bibliography,” http://dblp.uni-trier.de/. [15] L. Michael, “The DBLP computer science bibliography: Evolution, research issues, perspectives. string processing and information retrieval,” Lecture Notes in Computer Science, Springer-Verlag, pp. 481–486, 2000.

Design, Manufacturing and Mechatronics 10.4028/www.scientific.net/AMM.551

Scientific Research Paper Ranking Algorithm PTRA: A Tradeoff between Time and Citation Network 10.4028/www.scientific.net/AMM.551.603