Parameter Optimization of an Evolutionary Algorithm ...

2 downloads 61 Views 407KB Size Report
Gary B. Fogel Dana G. Weekes. Natural Selection, Inc. 3333 N. ..... The authors would like to thank V. William Porto and. David B. Fogel for assistance with initial ...
Parameter Optimization of an Evolutionary Algorithm for RNA Structure Discovery Rangarajan Sampath David J. Ecker Ibis Therapeutics 1891 Rutherford Road Carlsbad, CA 92008 USA [email protected] [email protected]

Gary B. Fogel Dana G. Weekes Natural Selection, Inc. 3333 N. Torrey Pines Court, Suite #200 La Jolla, CA 92037 USA [email protected] [email protected] Abstract- This paper focuses on the optimization of population and selection parameters of an evolutionary algorithm for similar RNA structure discovery. The effects of population settings such as the number of parents and number of offspring per parent, and method of selection (tournament vs. elitist) on the rate of convergence were investigated relative to a problem with a known RNA structure solution. The results indicate that proper setting of the number of parents and offspring can be used to increase the efficiency of the evolutionary process.

I. INTRODUCTION RNA secondary structure formation in mRNAs has been correlated with translational regulation and mRNA stability. For example, the function of the iron-responsive element (IRE) in the 5’ untranslated region (UTR) of eukaryotic ferritin mRNA has been well characterized [14]. RNA secondary structures form as the result of base pairing interactions. These interactions form helices and looped domains that lower the free energy associated with the nucleic acid primary structure and result in a stable structure that can serve a regulatory function. In all known RNA structures, secondary structure is conserved throughout evolution despite sequence variation. This similar variation over a set of orthologous sequences makes identification of common RNA structure difficult when using methods of that rely solely on sequence alignment. Computational tools that can be used to explore nucleotide sequence space for conserved (but previously unidentified) RNA structures are useful in gaining insight into functional and regulatory relationships amongst related RNA sequences. In turn, these putative structures might serve as possible drug targets for translational control. One approach to RNA structure identification is to define a space of sequences that conform to a structure hypothesis. The RNAMotif algorithm [5] affords the user the ability to encode a structure hypothesis as a “descriptor,” providing as an output file information on sequence, pairing, and length of any regions in a target

0-7803-8515-2/04/$20.00 ©2004 IEEE

nucleotide database that conform to the descriptor. The target nucleotide database can consist of a set of orthologous mRNAs or full-length genomes. Depending on the specificity of the descriptor and the number of nucleotides in the sequence database, this operation can result in very few hits for “tight” descriptors and small target sequences or a very large number of hits (≥105) that conform to a “loose” descriptor over large target sequences such as full length genomes. When the number of hits is small, methods of exhaustive search can be used to find sets of most common RNA structures over all organisms represented in the target database in an efficient manner. However, when the number of hits is large, exhaustive search becomes intractable. Evolutionary algorithms (EAs) can be used to search for RNAMotif hits that share common structure over these large spaces. This can be done without prior sequence alignment or sequence constraints in the descriptor [6-8]. Here we identify parameter settings that affect the performance of the EA and tune these parameters using the well-studied ferritin mRNA IRE as a control example. This tuning decreases the time for the EA to achieve the known correct solution, resulting in an algorithm that is well-suited for discovery in situations where known structures remain unavailable. A. Evolutionary Algorithm for RNA Structure Similarity A brief overview of the EA for RNA structure similarity is required (additional details can be found in [6]). The output of RNAMotif exists as a file containing information on RNA structure “hits” in a target sequence database including sequence, structure, location, length, and strand information. This file serves as the input to the EA. We consider here the case where there are hits from each of 7 full length ferritin mRNA sequences from a variety of eukaryotes with large phylogenetic distance. Under the assumption that there is one and only one regulatory RNA structure in common to these sequences, “bins” are generated through a random choice of one structure from

607

C. Fitness

Figure 1. Flow chart of the RNAMotif/EA process for similar RNA structure discovery across three organisms.

each list of hits per organism. Each bin is considered as an “individual” in an evolving population of solutions with the goal of maximizing similarity within the contents of a bin. The number of structures in a bin and the number of bins in a population were fixed during evolution. These parameters are typically chosen with a particular hypothesis in mind. The overall strategy for discovery with this method is presented in Figure 1.

Fitness was measured via a number of RNA structure similarity scoring metrics including: nucleotide sequence similarity within a structural component, structure component length similarity, and thermodynamic stability similarity. The terms, weights, and fitness calculations presented in this paper were identical to [6]. It should be noted that the 0.5 to 1.0 scaling of fitness on the figures within this paper does not represent the condition were the “best” (true biological) result is = 1.0. Rather 1.0 equates to a fictitious upper bound on fitness given the length of the longest sequence in the RNAMotif output file under the assumption that there are perfect matches for that sequence in all other organisms. The known truth in this example has a fitness of 0.796388. Thus slight changes in fitness, even within the range [0.770000, 0.796388] may still be of significant biological interest. D. Selection

A set of variation operators was used to increase diversity in the population when generating offspring bins from parent bins. A random variable was drawn from a Gaussian distribution with 0 mean and variance of 1 to determine which of the possible variation operators were applied. A second random variable was drawn from the same probability distribution to determine the number of times a variation operator was to be applied. The possible variation operators were:

Selection was based on a tournament approach wherein each bin from the current population was “competed” with a set of R randomly chosen bins from the same population, where R was user-defined to be 10 for the purposes of this paper unless otherwise indicated. Each time the first bin’s fitness score was higher than (or tied) the opponent’s score, a “win” was assigned to the first bin. The number of wins was recorded for all competitions and this process is iterated over all members of the population. All bins were then ranked with respect to the number of wins. Selection was used to remove a user-specified number of bins equal to the number of “offspring” solutions to be generated in the next generation.

1)

E. Ferritin mRNA

B. Variation operators

2)

3)

Structure replacement from a specific organism: Within a bin, a randomly chosen structure was replaced by a new structure chosen at random in the RNAMotif file from the list of structure hits for that specific organism. This operator was akin to a standard mutation operator changing one structure at a time within each bin for a specific organism. Single-point bin recombination: Two bins in the population were chosen at random for recombination. A random variable was used to select a structure to serve as a position of single-point recombination between bins. New offspring solutions were generated by using components of both parents. Multi-point bin recombination: This operator was similar to the single-point recombination above except that multiple structures in the bins served as positions of recombination.

IREs have been described in the 5' and 3' UTRs of mRNAs [1]. IREs bind iron-regulatory proteins and regulate iron homeostasis in eukaryotes. Two IRE RNA structures have been proposed for ferritin mRNA and differ in the size and shape of their internal loop [2]. We used one descriptor (Figure 2) to represent the smaller of these two correct solutions and interrogated seven fulllength ferritin mRNA sequences covering humans to fruit flies (GenBank accession numbers: Homo sapiens, gi|191071; Sus scrofa, gi|286151; Cricetulus griseus, gi|213691; Gallus gallus, gi|2369860; Rana catesbeiana, gi|213691; Xenopus laevis, gi|214135; Drosophila melanogaster, gi|3559829) obtained from GenBank. This resulted in a total of 733 hits (same as Experiment 2; [6]).

608

different random starting seed and no communication between clients. The population size (S) of an EA can be calculated as:

3-10

S=P×O+P 3-10 0-3 3-10

Figure 2. RNAMotif descriptors for the ferritin IRE with a left hand bulge of 0 to 3 nucleotides used in this study to generate the input file for the EA. The numbers associated with each component of the structure represent the number of nucleotides that were allowed when generating “hits” over the ferritin mRNA sequences.

Under the condition that there is one and only one structure chosen from each of these organisms to generate a bin, an output space of 9.7 × 1013 possible bins is defined (Figure 3). EC can then be used to search this space for the known correct solution. We wished to determine how various evolutionary parameters (number of parents and offspring in the population and method of selection) affected the convergence rate for this particular example. A better understanding of these parameters in light of a known RNA structure would help in the proper setting of these parameters for future hypothesis testing.

      

    

                ! "         "#$% &'     " !        (       "#)"'* ) +   

         

Figure 3. An example output from the EA for the known ferritin IRE using the conditions of tournament selection with tournament size = 10, 40 parents, and 20 offspring per parent (adopted from [6]). The correct solution was discovered in a minimum of 33 generations. The solution is provided as GenBank (gi) accession number, positional index in the UTR, length of the discovered structure, and nucleotide sequence with spaces to indicate different structural regions/base pairing.

II. METHODS AND RESULTS A. Evolutionary computation Evolution was performed using a server/client architecture with four, dual processor Intel Pentium III, 450 MHz, 256 MB RAM computers running Linux O/S. Each experimental run used one “master” server and three “client” proceses with each client initalized with a

(1)

where P is the number of parents and O is the number of offspring. Our prior investigation [6] utilized 40 parents and 20 offspring per parent (S = 840 individuals) when working with the structure in Figure 2 on the same machines. It is known that the number of offspring per parent can substantially affect the rate of convergence and overall success of EAs for some problems [7]. We generated a set of experiments with a constant population size S = 840 individuals and varied the ratio of offspring and parents while monitoring performance in terms of CPU time and number of correct trials over 500 generations. The results can be found in Table 1 and Figure 4a-f. All parameter settings examined were able to converge on the correct location for the IRE, yet some settings found the correct solution more often or faster than others over 30 trials. For experiments with a low percentage of parents (%P) in the population, the variance in the mean fitness of the 30 clients was higher than those experiments with a high percentage of parents in the population. However, the correct solution appeared to be discovered faster by one of the clients when using populations with a small number of parents (see “best” convergence Figure 4a-c) rather than those with large percentages of parents (Figure 4d-f). The time to complete 500 generations decreased with increasing number of parents (Table 1). We then conducted a series of additional experiments with a constant population size of 840 individuals with 40 parents and 20 offspring per parent and varied the method of selection by using either elitist selection or tournament sizes of 5, 10, and 20. Of these, the tournament size of 5 provided the best balance of minimized variance in the final generation and convergence speed on the known best solution (Figure 5a-d). While the time required to calculate 500 generations was essentially the same for these selection methods, the resulting number of clients that did arrive at the correct solution was roughly similar for elitist and tournament size 5 (11 of 30 clients) and tournament size 10 and 20 (13 of 30 clients) (Table 2). All three methods were able to discover the best solution in roughly equivalent time (a minimum of 17 to 21 generations), but tournament size 20 appeared to take longer (31 generations). When “best” performance is expressed in terms of mean fitness, we determined that tournament size 10 with either 84 parents and 756 offspring or 168 parents and 672 offspring performed as well as tournament size 5 with 40 parents and 800 offspring (mean fitness = .7912 to .7913; Table 1 and 2). Given these three parameter settings, the

609

(a)

(b)

(c)

(d)

(e)

(f)

Figure 4a-f. Convergence plots of (a) a population of 2.3% parents, (b) 3.5% parents, (c) 4.7% parents, (d) 10% parents, (e) 20% parents, (f) 50% parents showing the mean and variance (solid line) and best (dotted line) fitness over generations for 30 independent clients. The results indicate that populations with few parents result in greater variance in the final convergence, but allows some of the trials to reach the known correct solution more rapidly than populations with high percentages of parent solutions.

610

Table 1. Experiments using different parents (P) and offspring (O) using a constant population size S=840 with tournament size 10 over 500 generations of evolution on 30 independent clients. The solution of known biological truth shown in Figure 3 has a fitness of 0.796388. Performance is provided in terms of the resulting mean fitness, best fitness, number of clients that found the correct solution, minimum number of generations required to reach the correct solution, mean number of generations for those clients that did find the correct solution and mean clock time required to complete 500 generations. Although the differences between the mean and best fitnesses appear small, the difference in resulting RNA structure may be large.

Mean Minimum # Correct Mean Time to Clients of 30 Number of Number of Complete 500 Generations Generations Generations Clients to Correct to Correct (min.) Solution Solution

P

O

S

O/P

%P

Mean Fitness of 30 Clients

Best Fitness of 30 Clients

20

820

840

41

2.3

0.7806

0.796388

15 (50%)

15

103

18.7

30

810

840

27

3.5

0.7903

0.796388

15 (50%)

15

104

18.5

40

800

840

20

4.7

0.7842

0.796388

13 (43%)

21

150

19.5

84

756

840

9

10

0.7912

0.796388

8 (26%)

25

83

17.2

168

672

840

4

20

0.7910

0.796388

7 (23%)

53

232

15.5

420

420

840

1

50

0.7905

0.796388

5 (16%)

49

111

15.1

choice of 168 parents and 672 offspring can complete 500 generations in 15% less time than when using 40 parents and 800 offspring due to fewer variation calculations per generation. When using “best” fitness as a metric of performance, all parameter settings were able to find the correct solution, however there was significant variance in the percentage of trials that were able to find this solution for each parameter setting. Two settings (20 parents, 820 offspring and 30 parents, 810 offspring) gave the best percentage of trials that converged on the known correct solution (50%). This result is counterintuitive relative to mean fitness as it suggests smaller percentages of parents in the population may lead to convergence on the correct solution whereas larger percentages of parents in the population may lead to better mean convergence over N trials with few of those trials actually finding the correct solution. When using time as a metric of performance, strategies that minimized the number of variation operators per generation and maximized the number of parents that were retained at each generation were faster. Thus the fastest setting of 420 parents and 420 offspring was 20% faster than the slowest setting of 20 parents and 820 offspring. Comparison of tournament and elitist selection strategies suggests that these strategies can perform roughly equivalently, however specific tuning of the tournament size can lead to performance improvement. Greater performance variance was observed with different

numbers of parents and offspring in the population rather than with differing selection method. III. DISCUSSION EC has been offered as a means to discovery similar RNA structures over a set of orthologous sequences [6,8]. A key approach to this method is the coupling of EC with RNAMotif, an algorithm used to specify RNA structure in the form of a descriptor that contains both sequence and structure context with varying complexity. In the condition where no prior information exists for a hypothetical RNA regulatory feature, a loose RNAMotif descriptor is required to give greater flexibility and detection of more distantly related structure elements. With this approach, the number of hits increases exponentially and exhaustive calculation is infeasible, requiring an alternate approach such as the use of EC for discovery of similar structures. During initial investigation into this approach, only minor tuning of the parameters associated with the evolutionary search was completed. Here we present the results of experiments that focused on altered percentage of parents in a population and method of selection to determine which parameter settings (if any) increased the rate of convergence on a known correct solution. The proper setting of population parameters is important for the success of any EA. Upon finding a lack of convergence to a particular problem the typical EC researcher simply

611

Table 2. Experiments using different methods of selection (elitist or tournament) using a constant population size S=840 with 40 parents and 20 offspring per parent over 500 generations of evolution on 30 independent clients. Performance is provided in terms of the resulting mean fitness, best fitness, number of clients that found the correct solution, minimum number of generations required to reach the correct solution, mean number of generations for those clients that did find the correct solution and mean clock time required to complete 500 generations.

Selection Method/ Tournament Size Elitist

Mean Fitness of 30 Clients

Best Fitness of 30 Clients

0.7899

0.796388

11 (37%)

17

176

18.5

5

0.7913

0.796388

11 (37%)

21

219

18.3

10

0.7842

0.796388

13 (43%)

21

150

19.5

20

0.7876

0.796388

13 (43%)

31

158

18.5

Mean Time to # Correct Clients Minimum Number Mean Number of of 30 Clients of Generations to Generations to Complete 500 Correct Solution Correct Solution Generations (min.)

(a)

(b)

(c)

(d)

Figure 5a-d. Convergence plots with a population of 40 parents and 800 offspring operating with (a) tournament size 20, (b) tournament size 10, (c) tournament size 5, (d) elitist selection. The results indicate that increased tournament size increases the variance in the mean convergence, but tournament size 5 appears to have less mean variance than elitist selection. All population strategies can rapidly converge on the known correct solution.

612

increases the population size until a solution of sufficient worth is discovered. That strategy works when sufficient computational power is available or when there is unlimited time. But this is rarely the case, especially in real-world settings where rapid high-throughput screening of large numbers of drug targets is required. Indeed, it is difficult to find any papers in the literature that focus on optimal settings for the numbers of parent and offspring as a means to avoid having to increase the population size, number of machines, or time of calculation. In the limit, a population that is composed of one parent and many offspring from that parent at each generation maximizes the exploration of the response surface at the expense of refined local search. Conversely, a population strategy that maximizes the number of parents and minimizes the number of offspring per parent at each generation to one, significantly reduces the ability of the algorithm to explore the response surface yet provides a useful means for local search. Some balance of this is required and is likely to be problem dependent. These trends are evidenced in Figures 3a-f where the variance in the mean fitness over 30 trials is larger for population strategies with low parent percentages. This particular test example is known to have a two solutions of nearly equal top fitness that differ only in the placement of a bulge or internal loop in the lower half of the structure. Over all parent percentages, both of these structures were returned as “best” solutions, indicating that all population settings with a population size of 840 individuals were effective on this response surface. However, populations with high parent percentages maintained low variance during the search and resulted in the entire population stagnating at either of these two best solutions. Population strategies with low parent percentages maintained a higher variance at each generation and resulted in capturing the true “best” solution more rapidly than populations with high parent percentages. These data suggest that the problem of RNA structural similarity discovery presents a rugged multi-modal landscape which in the case of rapid discovery of best solutions requires maintainance of high variation in the population. This high variation can be achieved with a low percentage of parent solutions at each generation. It is possible that the number of parents and offspring could be made to be self-adaptive with the process of evolutionary optimization. This would require a set of multiple parallel clients each operating with independent population parameters set for a specific number of generations. At some user-defined interval, fitnesses could be compared across clients and the parent and offspring values biased over all clients towards the client with best performance. This concept could be used in conjunction with standard methods of self-adaptation that typically evolve the amount of variation that is applied over time.

ACKNOWLEDGEMENTS The authors would like to thank V. William Porto and David B. Fogel for assistance with initial design and coding of the EA and the anonymous reviewers for their valuable input. REFERENCES [1] E.C. Theil, “The iron responsive element (IRE) family of mRNA regulators. Regulation of iron transport and uptake compared in animals, plants and microorganisms,” Met. Ions. Biol. Syst., vol. 35, pp. 403-434, 1998. [2] Z. Gdaniec, H. Sierzputowska-Gracz, and E.C. Theil, “Iron regulatory element and internal loop/bulge structure for ferritin mRNA studied by cobalt(III) hexamine binding, molecular modeling and NMR spectroscopy,” Biochemistry vol. 37, pp. 1505-1512, 1998. [3] A.M. Thomson, J.T. Rogers, and P.J. Leedman, “Ironregulatory proteins, iron-responsive elements and ferritin mRNA translation,” Int. J. Biochem. Cell. Biol. vol. 31. pp. 1139-1152, 1999. [4] Y. Ke, H. Sierzputowska-Gracz, Z. Gdaniec, and E.C. Theil, “Internal loop/bulge and hairpin loop of the iron-responsive element of ferritin mRNA contribute to maximal iron regulatory protein 2 binding and translational regulation in the iso-iron-responsive element/iso-iron regulatory protein family,” Biochemistry vol. 39, pp. 6235-6242, 2000. [5] T.J. Macke, D.J. Ecker, R.R. Gutell, D. Gautheret, D.A. Case, and R. Sampath, “RNAMotif, an RNA secondary structure definition and search algorithm,” Nuc. Acids Res. vol. 29, pp. 4724-4735, 2001. [6] G.B. Fogel, V.W. Porto, D.G. Weekes, D.B. Fogel, R.H. Griffey, J.A. McNeil, E. Lesnik, D.J. Ecker, and R . Sampath, “Discovery of RNA structural elements using evolutionary computation,” Nuc. Acids Res. vol. 30(23), pp. 5310-5317, 2002. [7] D. Landavazo and G.B. Fogel, “Evolved neural networks for quantitative structure-activity relationships of anti-HIV compounds,” Proc. of the 2002 Congress on Evolutionary Computation. pp. 199-204. [8] E.A. Lesnik, G.B. Fogel, D. Weekes, T.J. Henderson, H.B. Levene, R. Sampath, D. Ecker, “Identification of Conserved Regulatory RNA Structures in Prokaryotic Metabolic Pathway Genes,” in review Bioinformatics.

613