immune programming or genetic programming?

0 downloads 0 Views 356KB Size Report
Nov 6, 2009 - In Algorithm 1, the features abstracted in LETOR[6] are mainly adopted as the terminals. In addition, the constant set are also involved, ...
Learning to Rank using Evolutionary Computation: Immune Programming or Genetic Programming? Shuaiqiang Wang

Jun Ma

Jiming Liu

School of Computer Science and Technology Shandong University Jinan, China

School of Computer Science and Technology Shandong University Jinan, China

Department of Computer Science Hong Kong Baptist University Kowloon Tong, Hong Kong

[email protected]

[email protected]

ABSTRACT

[email protected]

rank problem, if a common framework of ranking function discovery using EC could be introduced. Thus we provide a series of generic definitions, together with a common framework, for EC based learning to rank. Furthermore, RankIP, an approach based on IP for the discovery of the effective page ranking function, is proposed. Experiments indicate that the use of RankIP leads to effective ranking functions that significantly outperform the baselines, including Ranking SVM [5], RankBoost [4] and BM25 [8]. In this paper we also theoretically study the dissimilarities between IP and GP. In order to validate these arguments, we carry out some experiments. The results show that under the same conditions IP gains the advantage of GP.

Nowadays ranking function discovery approaches using Evolutionary Computation (EC), especially Genetic Programming (GP), have become an important branch in the Learning to Rank for Information Retrieval (LR4IR) field. Inspired by the GP based learning to rank approaches, we provide a series of generalized definitions and a common framework for the application of EC in learning to rank research. Besides, according to the introduced framework, we propose RankIP, a ranking function discovery approach using Immune Programming (IP). Experimental results demonstrate that RankIP evidently outperforms the baselines. In addition, we study the differences between IP and GP in theory and experiments. Results show that IP is more suitable for LR4IR due to its high diversity.

2. EC BASED LEARNING TO RANK TECHNOLOGIES

Categories and Subject Descriptors: H.3.3 [Information Storage and Retrieval]: Retrieval models General Terms: Algorithms, Experimentation, Theory.

2.1 Formal Definitions

Keywords: Information retrieval, Page ranking, Learning to Rank, Immune programming, Evolutionary computation.

2.1.1 Problem Description

1.

INTRODUCTION

For the ranking problem, a document d can be expressed by a series of feature values, formally, d =< f1 , f2 , · · · , fL >. The problem of information retrieval can be formalized as follows. For a query q and a document collection D, the optimal retrieval system should return a ranking that orders the documents in D according to their relevance to q. Let Q be a query set, D be a document set, Nd be a subset of the natural number set N. Given a query, the relevance judgement for a document is defined as a function rel(Q) : D → Nd . In the training phase, a ranking function should be generated, which is supposed to associate a real number with a query and a document as their degree of relevance. Formally, it is defined as rf (Q) : D → R. By estimating the ranking function values, the documents can be ordered. The order function is defined as o(rf (Q)) : D → No , where No = {1, 2, · · · , |D|}. Note that o(rf (Q)) is an one-to-one function. Let o(rf (Q))−1 : No → D be the inverse function of o(rf (Q)). Obviously, o(rf (Q))−1 is an one-to-one function as well. The following formula should be satisfied.

The performance of a search engine is mainly evaluated by the accuracy of its ranking results. As a matter of fact, the ranking problem is one of the most important issue in the Information Retrieval (IR) field. In this area, the machine learning techniques, that is, “learning to rank”, are becoming more widely used for the ranking problem of IR. Recently GP-based ranking function discovery approaches have become an important branch in learning to rank research, such as [3] and [9]. Meanwhile, many other EC related algorithms have been growing interest in recent years, for example, Immune Programming (IP) [7]. There are quite substantial similarities in these EC algorithms. Therefore, it would be very beneficial to the use of EC in the IR field, especially for the learning to

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. CIKM’09, November 2–6, 2009, Hong Kong, China. Copyright 2009 ACM 978-1-60558-512-3/09/11 ...$10.00.

∀q ∈ Q, m, n ∈ (1..|D|) : n < m ⇒ rf (q)(o(rf (q))−1 (n)) ≥ rf (q)(o(rf (q))−1 (m))

(1)

The optimum ranking function rf ∗ can be used to obtain

1879

Note that the function getCandidates can be implemented in many ways. For example, Fan et al. [3] think of the best individuals per generations as the candidates. Besides, it is also rational to regard individuals in the last generation as the candidate solutions. In our experiments, we employ the same technology as Fan’s. Thus, once the function generateN extGeneration is implemented according to certain evolutionary principle, a complete algorithm is designed.

the optimum document order, which satisfies Formula 2. ∀q ∈ Q, m, n ∈ (1..|D|) : n < m ⇒ rel(q)(o(rf ∗ (q))−1 (n)) ≥ rel(q)(o(rf ∗ (q))−1 (m))

(2)

Let T be the training data set, RF be the ranking function set. The task of learning to rank is to train a proper ranking function rf which can get close to rf ∗ in order to obtain a effective document order o. Thus the learning to rank algorithm can be regarded as a ranking function discovery process f : T → RF .

Algorithm 1 EC Based Learning to Rank Framework Input: the training, validation and test sets: T , V and E the best individual set B selected in the training phase the fitness function F Output: the best individual bestInd. Begin P1 ← randInitP opulation() B←∅ for each generation g ∈ (1..GenerationM ax) do F(T )(Pg ) ← evaluateByT rainingData(Pg , T ) B ← B ∪ getCandidates(F(T )(Pg )) Pg+1 ← generateN extGeneration(Pg ) end do F(V)(B) ← evaluateByV alidationData(B, V) bestInd ← applySelectM ethod(F(T )(B), F(V)(B)) measures ← estimateIndividual(bestInd, E) End

2.1.2 EC Based Technologies In EC based ranking function discovery algorithms, an individual expresses a ranking function, which is actually a composite function composed of some basic functions and terminals. Let P be the individual set. Given a training data set T , the fitness of a certain individual can be expressed as a real number. Thus the fitness function for the individuals is defined as F(T ) : P → R. Let Pg denotes the population at generation g. Obviously, Pg ⊆ P. Let PS g be a subpopulation of Pg . From PS g a selecting operator s : P |PS g | → P μ selects μ ≤ |PS g | individuals for variation that show a better fitness than others. An evolutionary operator v : P μ → P λ creates λ offsprings out of the μ selected individuals from population Pg . These generated individuals become members of population Pg+1 . In this way, population Pg+1 can be fully generated. All of the evolutionary operators must guarantee that no syntactically incorrect programs are generated during the evolution process.

In Algorithm 1, the features abstracted in LETOR[6] are mainly adopted as the terminals. In addition, the constant set are also involved, including 0.1, 0.2, ..., 0.9, 1, 2, ..., 10. In addition, we adapt 10 functions: +, −, ×, ÷, min, max, sin, cos, log and sqrt. Since some functions need protected parameters, we design two protected mechanisms for the protected parameter x: (1) If x < 0, then x := |x|. (2) If x = 0, then x := ε, where ε is a real number close to zero. In our experiments ε = 0.000001.

2.2 Advantages of EC based Technologies for LR4IR Problem First of all, according to the formal definitions in Section 2.1, we think learning to rank is an optimization problem essentially. That is, given a query set and corresponding document sets, what we try to find out is an optimum ranking function, which can rank the documents reasonably. In doing this, the loss function always plays a critical role. Currently, most of the learning to rank approaches are designed to optimize loss functions loosely related to the IR performance measures. By contrary, EC based approaches can optimize solutions by evaluating these IR performance measures directly in order to obtain a good ranking function. Besides, according to the Darwin’s theory of evolution, there is no central organ controlling the evolutionary process in the nature. In this sense, it is easy to implement the EC algorithms as distributed ones, and they can be carried out on different computers distributively. For example, Distributed Genetic Programming [1] have been proposed and applied in various fields. It is indeed an ideal character, for it allow to reduce the training time by using more distributed computers.

2.4 Fitness Functions for Individuals EC-based learning to rank approaches usually regard the evaluation measures and their varieties as the fitness functions directly. In our experiments, we adopt three measures: precision at position n (P @n), Mean Average Precision (M AP ) and Normalized Discount Cumulative Gain (NDCG) at position n (N DCG@n). Note that P @n and M AP can only handle cases with binary judgment: “relevant” or “irrelevant”, while N DCG@n can deal with multiple levels of relevance judgments.

2.5 Ranking Function Seletion As mentioned before, a validation set is used to help in choosing good solutions that are not over-specialized for the training queries. In [2], AV Gσ and SU Mσ formulae are mentioned. Since the size of the training data collection is usually far greater than that of the validation data set, we do not think it appropriate to assume that the training and validation results have the same weights, as shown in AV Gσ and SU Mσ . According to our experiments, we propose two new similar methods: ω-SU Mσ and ω-SU Mσ 2 expressed as Equation (3) and (4) respectively.

2.3 EC Based Learning to Rank Framework In general, three types of the data collections: training, validation and test, are adopted. The training data set T is used to generate a series of candidate solutions B using EC algorithm, the validation set V helps in choosing good solutions that are not over-specialized for the training queries, and the test set E is the estimation data set for the ranking functions generated by the algorithm. The EC based Algorithm is given as Algorithm 1.

argmax ((α × F(T )(Ii ) + β × F(V)(Ii )) − γ × σi ) i

1880

(3)

  argmax (α × F(T )(Ii ) + β × F(V)(Ii )) − γ × σi2

(4)

Table 1: Control Parameters for OHSUMED and TREC Data Collections in RankIP Parameters OHSUMED TREC2003 TREC2004 Population Size 500 100 100 Max Generation 100 100 100 Deme Size 9 9 9 Pr 0.5 0.5 0.5 Pc 0.2 0.2 0.2 Pm 0.1 0.1 0.1 Antibody Depth 6 8 8 x in Eq. (6) N DCG@5 M AP M AP Selecting Formula ω-SU Mσ 2 ω-SU Mσ ω-SU Mσ Random Seed 1234567890 1234567890 1234567890

i

The values of α and β are according to the size of training dataset T and validation dataset V respectively, that is, ⎫ size(T ) ⎪ ⎪ α= size(T ) + k × size(V) ⎬ (5) ⎪ k × size(V) ⎪ ⎭ β= size(T ) + k × size(V) Where both k and γ are constants. The only difference between Equation (3) and (4) is that the former adopts standard deviation σ while the latter employs variance σ 2 . In our experiments, we let k = 2 and γ = 0.5.

3.

RANKIP: IP BASED LR4IR ALGORITHM

IP is an extension of immune algorithms, particularly the clonal selection algorithm, inspired by the biological immune systems or their principles and mechanisms. Due to the similar structures between IP and traditional EC algorithms, nowadays it is regarded as a branch of EC.

Diversity. In IP, the mutation rates are much higher. As shown in Algorithm 2, if the the selected antibody is a low-quality ranking function, an extensive mutations will take place. Such mechanism ensures that the worse antibodies can be replaced by the potential better ones rapidly, meanwhile the diversity can be well maintained. Besides, replace operation in IP can also significantly increase the diversity of IP.

3.1 Immune Programming Algorithm 2 Immune Programming Operations Input: the replacement probability Pr the cloning probability Pc the mutation probability Pm the current population C with affinity set F Output: the population at the next generation N Begin while size(N ) < P opulationSize do r ← Replacement(Pr ) if r = null then N ← N ∪ {r} else do i ← getGoodAntibody(C) c ← Cloning(Pc , i, F(i)) if c = null then N ← N ∪ {c} else do m ← Hypermutation(Pc, i, F(c)) N ← N ∪ {m} end do end do end do End

Operator Selection. GP-based systems choose an alternative genetic operator to execute only according to certain probabilities. On the contrary, such decision in IP is made with respect to not only some probabilities, but also the affinities of the antibodies ready to be operated. Thus, the affinity function value had better be well distributed over the range of number set R = (0..1). Actually it is a very rigorous condition for some problems. If it cannot hold, e.g., the affinities are all far less than 1.0, IP can be thought of as always generating antibodies randomly.

3.3 Affinity Functions in RankIP Since the affinity function value had better be well distributed over the range of number set (0..1), we have to design a mapping function to transform the evaluation measures, such as M AP , P @n and N DCG@n, into (0..1). This function value is considered as the affinity values for the antibodies. In our scheme, f (x) is a logarithm function, expressed formally as follows: f (x) = log10 (1 +

IP algorithm includes three immune operators: replacement, cloning and hypermutation. As the Algorithm 2 expressed, firstly the replacement operator should execute with certain probability Pr . If a new antibody cannot be generated owing to the parameter Pr , a high affinity antibody i is selected and considered for cloning with the probability Pc . If the high-affinity antibody i cannot be cloned due to the parameter Pc , it is submitted to the process of hypermutation. Thus, RankIP can be constructed directly by implementing the function generateN extGeneration in Algorithm 1 as Algorithm 2.

9×x ) where iv is a constant iv

(6)

4. EXPERIMENTS 4.1 Performance of RankIP LETOR[6] data sets are adopted in our experiments. We compare the ranking accuracies of RankIP with those of three baseline methods: Ranking SVM, RankBoost and BM25. The ranking performances of both Ranking SVM and RankBoost are evaluated and reported in [6]. Table 1 shows the control parameters in RankIP. We demonstrate the ranking performances of RankIP, Ranking SVM, RankBoost and BM25 in Table 2 and Figure 1. For OHSUMED data set we evaluate the measure M AP and N DCG@1–10 for comparison, while for TREC data set only M AP is employed. We can see that RankIP outperforms the three baseline methods in terms of almost all measures.

3.2 IP vs. GP Indeed, since both IP and GP are important varieties of EC, they have similar characters inevitably. However, there are some distinct differences:

1881

0.247585

11%

0.249704

Table 2: Ranking accuracies in terms of M AP

2% 2% 2%

0.213103 0.216902 0.137811

OHSUMED 0.4584 0.4469 0.4403 0.4361

TREC 2003 0.2404 0.2564 0.2125 0.1254

TREC 2004 0.4009 0.3505 0.3835 0.3351

0.131659 others

3%

0.169987

39%

others

2%

51%

2% 2%

30% 26%

(a) Fold1

(b) Fold3

Figure 2: M AP Distribution of the Best Individual per Generation for TREC 2003

0.5

GP converges to a local optimum point (or several points) prematurely. For example, for Fold1 and 3 in TREC 2003, in 100 best individuals per generations, two primary individuals account for about 70%. In fact, there are only 19 and 30 candidate individuals respectively ready to be selected for these three data sets. For comparison, these figures reach 82 and 92 respectively in RankIP.

0.45

0.4

5. CONCLUSIONS

0.35

0.3

0.243862

0.213104

RankIP Ranking SVM RankBoost BM25

0.55

0.247364

3%

0.249702

Algorithm RankIP Ranking SVM RankBoost BM25

25%

0.255054

@1

@2

@3

@4

@5

@6

@7

@8

@9

In order to boost the use of EC in the IR area, especially for the learning to rank problem, we introduce a series of generalized definitions, as well as a common framework for ranking function discovery based on EC. Furthermore, according to this framework, an IP based approach, called RankIP, is presented. We use LETOR 2.0 data collections to validate our approach. Experiments show that performances of RankIP improves evidently compared with the baselines, including Ranking SVM, RankBoost and BM25. Finally, experiments show that IP has the superiority over GP for learning to rank problem under almost the same conditions due to IP’s high diversity.

@10

Figure 1: N DCG at position n for OHSUMED Especially for N DCG@1, RankIP achieves more than 12% relative improvement.

4.2 Which is better for LR4IR, IP or GP? Musilek et al. [7] demonstrate that, in their experiments, the convergence of IP is superior to GP evidently. However, their problem differs quite a bit from learning to rank. Then, for our problem, is IP still preferable? We design some experiments to validate this assertion. The data collections are OHSUMED and TREC 2003 in LETOR 2.0. In GP, the probabilities of crossover, cloning and mutation are 0.5, 0.2 and 0.3 respectively. The results can be seen in Table 3. For OHSUMED data set, both IP and GP are effective enough. Actually, performances of all of the learning to rank approaches, including RankBoost, etc., improve slightly. We speculate OHSUMED is much easier to rank, thus even the traditional methods such as BM25 can provide a satisfactory result. However, for TREC 2003, IP receives a good estimate while GP is slightly better than BM25. We believe it is due to the low diversity of GP, though mutation probability 0.3 is fairly high. From Figure 2 we can conclude that in the training phase

6. REFERENCES [1] L. Bai, M. Eyiyurekli, and D. E. Breen. Automated shape composition based on cell biology and distributed genetic programming. In Proceedings of GECCO’08, pages 1179-1186, 2008. [2] H. M. de Almeida, M. A. Gon¸ calves, M. Cristo, and P. Calado. A combined component approach for finding collection-adapted ranking functions based on genetic programming. In Proceedings of SIGIR’07, pages 399-406, 2007. [3] W. Fan, M. D. Gordon, and P. Pathak. Discovery of context-specific ranking functions for effective information retrieval using genetic programming. IEEE Transactions on Knowledge and Data Engineering, 16(4): 523-527, 2004. [4] Y. Freund, R. Iyer, R. E. Schapire, and Y. Singer. An efficient boosting algorithm for combining preferences. Journal of Machine Learning Research, 4: 933-969, 2003. [5] T. Joachims. Optimizing search engines using clickthrough ˝ data. In roceedings of KDD’02, pages 133U142. ACM, 2002. [6] T.-Y. Liu, J. Xu, T. Qin, W. Xiong, and H. Li. LETOR: Benchmark dataset for research on learning to rank for information retrieval. In SIGIR Workshop on Learning to Rank for IR (LR4IR), July 2007. [7] P. Musilek, A. Lau, M. Reformat, and L. Wyard-Scott. Immune programming. Information Sciences, 176(8): 972-1002, 2006. [8] S. E. Robertson. Overview of the okapi projects. Journal of Documentation, 53(1): 3-7, 1997. [9] A. Trotman. Learning to rank. Information Retrieval, 8(3): 359-381, 2005.

Table 3: Ranking accuracies in terms of M AP for using IP and GP EC Algorithm IP GP BM25

OHSUMED 0.4584 0.4506 0.4361

TREC 2003 0.2404 0.1342 0.1254

1882