RNA secondary structure prediction using ... - Semantic Scholar

3 downloads 29488 Views 792KB Size Report
... 50200, Thailand. E-mail: sitthichoke.s@cmu.ac.th ... Degree in Computer Sciences and a Master's Degree in Bioinformatics from the Faculty of Science at ...
118

Int. J. Data Mining and Bioinformatics, Vol. 7, No. 2, 2013

RNA secondary structure prediction using conditional random fields model Sitthichoke Subpaiboonkit Faculty of Science, Department of Computer Science and Bioinformatics Research Laboratory, Chiang Mai University 50200, Thailand E-mail: [email protected]

Chinae Thammarongtham Biochemical Engineering and Pilot Plant Research and Development Unit, National Center for Genetic Engineering and Biotechnology, Bangkok 10150, Thailand E-mail: [email protected]

Robert W. Cutler Independent Research Scientist, 15 Prasing Post, Ampur Muang, Chiang Mai 50200, Thailand E-mail: [email protected]

Jeerayut Chaijaruwanich* Faculty of Science, Department of Computer Science and Bioinformatics Research Laboratory, Chiang Mai University 50200, Thailand E-mail: [email protected] *Corresponding author Abstract: Non-coding RNAs (ncRNAs) have important biological functions in living cells dependent on their conserved secondary structures. Here, we focus on computational RNA secondary structure prediction by exploring primary sequences and complementary base pair interactions using the Conditional Random Fields (CRFs) model, which treats RNA prediction as a sequence labelling problem. Proposing suitable feature extraction from known RNA secondary structures, we developed a feature extraction based on natural RNA’s loop and stem characteristics. Our CRFs models can predict the secondary structures of the test RNAs with optimal F-score prediction between 56.61 and 98.20% for different RNA families. Keywords: RNA secondary structure prediction; ncRNA; Non-Coding RNA; CRFs; conditional random fields; bioinformatics; machine learning; data mining. Copyright © 2013 Inderscience Enterprises Ltd.

RNA secondary structure prediction using CRFs model

119

Reference to this paper should be made as follows: Subpaiboonkit, S., Thammarongtham, C., Cutler, R.W. and Chaijaruwanich, J. (2013) ‘RNA secondary structure prediction using conditional random fields model’, Int. J. Data Mining and Bioinformatics, Vol. 7, No. 2, pp.118–134. Biographical notes: Sitthichoke Subpaiboonkit Graduated with a Bachelor’s Degree in Computer Sciences and a Master’s Degree in Bioinformatics from the Faculty of Science at Chiang Mai University, Thailand, in 2008 and 2011, respectively. He is currently a Lecturer at Department of Computer Science and a member of Bioinformatics Research Laboratory, Faculty of Science, Chiang Mai University, Thailand. His interests are data mining, graphical model, structured model and machine learning. Chinae Thammarongtham received his Doctoral Degree in Biotechnology from King Mongkut’s University of Technology Thonburi (Thailand). He is currently a Researcher at Biochemical Engineering and Pilot Plant Research and Development Unit, National Centre for Genetic Engineering and Biotechnology, Bangkok, Thailand. His current research focuses on biological sequence analysis including Non-coding RNA (ncRNA) prediction. Robert W. Cutler received his PhD in Computational BioPhysics from Vanderbilt University (USA). His current research focuses on understanding the workings of various methods of correcting mutations in genomes, characterisation of phenotypic traits in tropical plants and development of computational models to better understand genomic function. Jeerayut Chaijaruwanich received his PhD in Informatique from Universite d’Evry (France). He is currently an Associate Professor in Department of Computer Science and a Director of Bioinformatics Research Laboratory, Faculty of Science, Chiang Mai University, Thailand. His current researches focus on microarray and protein sequence analysis using machine-learning techniques.

1

Introduction

Recent developments show that functional Non-coding RNAs (ncRNAs) are extensively involved in many gene regulatory mechanisms (Yoon and Vaidyanathan, 2008). Hence, finding these regions in the genome of a living organism is an important task. However, this kind of prediction is relatively more difficult than standard gene-finding because of a lack of statistical signal (Sato and Sakakibara, 2005). In this paper, we focus on RNA secondary structure prediction as a factor involved in ncRNA regions determination. RNA sequences have conserved secondary structures according to complementary base pair interactions, such as canonical base pairs (Watson-Crick complementary base pairs, with also an occasional G-U pair formed). Stem-loops in a nested manner (Figure 1(a)) that have no interaction between bases crossing each other is one of the typical substructures found in the secondary structures of almost RNAs (Durbin et al., 2002). Since determining RNA secondary structure through laboratory methods, such as Nuclear Magnetic Resonance (NMR) and X-ray crystallography, is complicated and

120

S. Subpaiboonkit et al.

expensive, RNA secondary structure prediction using computational methods provides a useful alternative (Wiese et al., 2008). Many computational methods have been developed to predict RNA secondary structure based on structure data. Probabilistic approaches have been widely proposed for this task including such methods as the Covariance Model (CM) (Eddy and Durbin, 1994), Stochastic Context Free Grammar (SCFG) (Sakakibara, 1992; Sakakibara et al., 1994; Knudsen and Hein, 1999), Pair Hidden Markov Models on Tree Structure (PHMMTS) (Sakakibara, 2003), CONTRAfold (Do et al., 2006), classification of ncRNA using graph representations (Karklin et al., 2004) and RNA secondary structural alignment with CRFs (Sato and Sakakibara, 2005). Figure 1

Example of RNA secondary structure in: (a) nested-manner and (b) pseudo-knot

(a)

(b)

In this paper, we define RNA secondary structure prediction as sequence labelling problem where sequence residues will be labelled by dots and parentheses. One of the most standard methods for performing such sequence labelling tasks is the CRFs method. This method is a supervised machine-learning technique using an undirected graph to build probabilistic models, which solves the sequence labelling task with suitable feature extraction based on the conditional approach (Wallach, 2004). CRFs use a finite-state model with un-normalised transition probabilities. Unlike some other weighted finite-state approaches such as the convolutional neural network (LeCun et al., 1998), CRFs assign a well-defined probability distribution over possible labels, trained by Maximum Likelihood (MLE) or Maximum a Posteriori (MAP) estimation. Since its loss function is convex, this guarantees convergence to the global optimum (Lafferty et al., 2001). In this paper, we propose a novel feature extraction from a simple idea using the natural characteristics of stem-loops of RNA in a nested manner to build score patterns. This concept may not extend easily to the case of pseudo-knots. The example of nested manner and pseudo-knot is shown in Figure 1(a) and (b), respectively. Our counting score algorithm can extract the hidden information from stem’s base-pairing of expected loop length for nested loops.

2

Materials and methods

2.1 RNA features extraction for secondary structure prediction based on Conditional Random Fields Let X be the random variable of RNA sequence and its feature to be labelled, and Y be the label random variable of corresponding RNA secondary structure. In discriminative

RNA secondary structure prediction using CRFs model

121

framework, a conditional probability p (Y = y  X = x) of RNA secondary structure label sequence y = y1, …, yr is estimated from the RNA sequence and its feature x = x1, …, xr. Here, the RNA secondary structure label sequence y is •

Stems complementary base pairs denoted by ‘(’ and ‘)’,



Loop nucleotide denoted by ‘*’



Non-loop unpaired nucleotides denoted by ‘.’. The example of RNA sequences and their secondary structure labelling is shown in Figure 1.

Basically, the RNA secondary structure can be inferred by maximising the number of stems complementary base pairs (Durbin et al., 2002). To deploy this radical idea to the CRF framework, we train CRF models using RNA sequence and its secondary structure with the number of complementary base pairs, hereby called base pairs score. All possible loop lengths and stem sizes will be varied to count all possible number of complementary base pairing. This local base pairs information extraction can describe the overall of RNA secondary structure pattern, and suitable for training CRFs model. This additional feature can help CRFs to learn the characteristic of stem and loop from RNA sequences in higher dimension. We propose two alternative methods for counting base pairs score based on •

the middle of the loop



the maximal base pairs score within the loop.

2.1.1 Counting base pairs score based on the middle of the loop Suppose that the nucleotide of interest t is at the middle of a loop of length L, the number of base pairs can be counted by equation (1) and illustrated by Figure 2. An example of base pairs score is shown in Figure 3 and Table 1. score[t ] =

L  2  +W −1  



L a=  2

(1)

C[ X t − a , X t + a + b ]

where b equals 1 if L is an even number and 0 if L is an odd number, and C [Xt–a,Xt+a+b] denotes the complementary base pairs score. 1 if ( X t − a = G, X t + a + b = C )  ( X t −a = C , X t + a +b = G)   ( X t − a = A, X t + a + b = U )  C [ X t −a , X t +a +b ] =  ( X t − a = U , X t + a + b = A)  ( X t − a = G, X t + a + b = U )   ( X t − a = U , X t + a +b = G)  0 otherwise.

or or or or

(2)

or or

For a particular nucleotide position t in the middle of the loop, we count the possible number of complementary base pairings in the stem of size w. This extraction of local base pairs information describes the overall RNA secondary structure pattern, and is

122

S. Subpaiboonkit et al.

suitable for training the CRF model. The example in Table 1 shows that the nucleotide that has the highest score tends to be the true middle position of the loop since this aligns the complementary bases correctly. Figure 2

Example of counting base pairs score based on the middle of the loop with size of stem w = 5

Figure 3

Example of base pairs score based on the middle of the loop with size of stem w = 5

Table 1

Example of base pairs score based on the middle of the loop for each RNA nucleotide base and its label with size of stem w = 5 Excepted loop length

Base

2

3

4

5

6

7

8

9

Label

C

0

0

0

0

0

0

0

0

)

G

1

0

0

0

0

0

0

0

)

C

1

0

0

0

0

0

0

0

)

G

0

0

0

0

0

0

0

0

)

C

0

1

0

1

0

1

0

0

)

A

0

3

0

3

0

2

0

1

*

A

0

5

0

4

0

3

0

2

*

A

0

3

0

3

0

2

0

1

*

G

0

1

0

1

0

1

0

0

(

C

1

0

0

0

0

0

0

0

(

G

1

0

0

0

0

0

0

0

(

C

0

0

0

0

0

0

0

0

(

G

0

0

0

0

0

0

0

0

(

RNA secondary structure prediction using CRFs model

123

2.1.2 Counting the maximal base pairs score within the loop The previous counting base pairs score is based on the hypothesis that the considered nucleotide is located at the middle of the loop. Table 1 shows that the nucleotide that has the highest score tends to be the true middle position of the loop. It is interesting to consider the loop as a block whose nucleotides have the same highest score. This can be computed by equation (3). Only the maximal score within the loop is represented, see Table 2 for an example. t • = argmax score[t ] 1< t