bioinformatic approaches for predicting substrates ... - Semantic Scholar

3 downloads 349 Views 881KB Size Report
publicly available webservers or stand-alone software. ... Algorithm. Datase t. F e a ture. URL. P e. ptideC utter. 33. Empirica l sco ring. No t a v a ila ble. S p.
January 28, 2011 15:39 WSPC/185-JBCB

S0219720011005288

Journal of Bioinformatics and Computational Biology Vol. 9, No. 1 (2011) 149–178 c Imperial College Press  DOI: 10.1142/S0219720011005288

BIOINFORMATIC APPROACHES FOR PREDICTING SUBSTRATES OF PROTEASES

JIANGNING SONG∗,†,‡,§,‡‡ , HAO TAN† , SARAH E. BOYD¶ , HONGBIN SHEN , KHALID MAHMOOD†,∗∗ , GEOFFREY I. WEBB†† , TATSUYA AKUTSU‡ , JAMES C. WHISSTOCK∗,†,∗∗,§§ and ROBERT N. PIKE∗,†,¶¶ †Department

of Biochemistry and Molecular Biology Monash University, Victoria 3800, Australia

‡Bioinformatics

Center, Institute for Chemical Research Kyoto University, Kyoto 611-0011, Japan

§Tianjin

Institute of Industrial Biotechnology Chinese Academy of Sciences Tianjin 300308, China

¶AgriBio,

La Trobe University Victoria 3086, Australia

Institute of Image Processing and Pattern Recognition Shanghai Jiaotong University, Shanghai 200240, China ∗∗ARC

Centre of Excellence for Structural and Functional Microbial Genomics, Monash University Victoria 3800, Australia ††Faculty

of Information Technology, Monash University Victoria 3800, Australia ‡‡[email protected] §§[email protected] ¶¶[email protected] Received 15 July 2010 Revised 8 October 2010 Accepted 9 October 2010

Proteases have central roles in “life and death” processes due to their important ability to catalytically hydrolyze protein substrates, usually altering the function and/or activity of the target in the process. Knowledge of the substrate specificity of a protease should, in theory, dramatically improve the ability to predict target protein substrates. However, experimental identification and characterization of protease substrates is often difficult and time-consuming. Thus solving the “substrate identification” problem is fundamental to both understanding protease biology and the development of therapeutics that target specific protease-regulated pathways. In this context, bioinformatic prediction of protease substrates may provide useful and experimentally testable information about novel potential cleavage sites in candidate substrates. In this article, we provide an overview of recent advances in developing bioinformatic approaches for predicting ∗ Corresponding

authors. 149

January 28, 2011 15:39 WSPC/185-JBCB

150

S0219720011005288

J. Song et al. protease substrate cleavage sites and identifying novel putative substrates. We discuss the advantages and drawbacks of the current methods and detail how more accurate models can be built by deriving multiple sequence and structural features of substrates. We also provide some suggestions about how future studies might further improve the accuracy of protease substrate specificity prediction. Keywords: Proteases; substrate specificity; substrate cleavage site; bioinformatics; sequence analysis; machine learning; support vector machine; feature selection; structural information.

1. Introduction Proteases, also known as proteolytic enzymes, peptidases or proteinases, are enzymes that catalyze the breakdown of proteins by the hydrolysis of peptide bonds in their substrates.1–5 This process is referred to as proteolysis or substrate cleavage and is used as a biological switch to activate/deactivate protein function in numerous biological processes.4 Thus the ability to catalytically hydrolyze protein substrates, along with the removal of damaged or undesirable products generated after protein synthesis, is fundamental to all forms of life.2–5 Indeed, controlled proteolysis is one of the major pathways by which the estimated 1–1.5 million peptides and proteins are produced from the ∼26,000 human genes, in order to fulfil the complexity of human life.1 Proteases thus have central roles in “life and death” processes, such as neural, endocrine and cardiovascular signalling, digestion, degradation of misfolded or unwanted proteins, immunity, cell division and apoptosis. The importance of protease-controlled protein regulation is apparent in numerous human pathological conditions related to alterations in protease activity levels, including cancer, arthritis, inflammation, neurodegenerative and cardiovascular diseases, development, infection and immunity.2–4 Their fundamental roles in multiple biological processes and their implication in numerous pathological processes resulting from common biochemical functions have led to the concept of the protease degradome,5 which is defined as the complete set of proteases in an organism. The key to understanding the physiological role of a protease is to identify the repertoire of its natural substrate(s), known as the substrate degradome.5 Proteases act as processing enzymes that carry out either highly or moderately selective cleavage of the scissile bond within the cleavage site of their substrates. The specificity and thereby role of the enzymes vary, primarily depending on their active sites, which display selectivity ranging from preferences for a number of specific amino acids at defined positions (e.g. thrombin and the caspases) to more generic sites with limited discrimination at one position (e.g. chymotrypsin).6,7 In addition to the primary amino acid sequence of the substrate, specificity is also influenced by the three-dimensional conformation of the substrate (secondary and tertiary structures). In particular, proteases preferentially cleave substrates within extended loop regions,8–11 while residues that are buried within the interior of the protein substrate are clearly inaccessible to the protease active site. Finally, cleavage is regulated by the temporal and physical co-location of the protease and substrate.

January 28, 2011 15:39 WSPC/185-JBCB

S0219720011005288

Bioinformatic Approaches for Predicting Substrates of Proteases

151

In particular, some proteases are sequestered within specific compartments, with limited access to proteins, while others are able to cleave multiple substrates in different physiological compartments. Knowledge of the substrate specificity of a protease can dramatically improve our ability to predict target protein substrates; however, this information can at present only be derived from experimental approaches. In recent years, many specificity-profiling experimental techniques have been developed to globally profile protease–substrate relationships and characterize the substrate repertoire of proteases. These include phage display,12,13 peptide libraries,14,15 PROTOMAP (SDS-PAGE integrated protein topography and Migration Analysis platform),16 and, more recently, proteomic technologies that use chemical labelling strategies to distinguish N-termini of peptides that are newly formed from proteolytic events such as iTRAQ,17,18 or iTRAQ-TAILS (iTRAQ with terminal amine isotopic labelling of substrates),19–21 as well as other modified N-terminal labelling or positional proteomics approaches.22–26 Despite the fact that these global profiling techniques have greatly increased the coverage of uncharacterized protease substrates, they are complex, costly and time-consuming. However, in the absence of such data, the targets of protease function cannot be deduced a priori from the structure or sequence of the protease. In-depth understanding of the substrate specificity of proteases is essential for obtaining deep insight into their function, and is a prerequisite for the development of specific inhibitors.11,27–29 Thus, solving the “substrate identification” problem is fundamental both to understanding protease biology and to the development of therapeutics that target specific protease-regulated pathways.2 In this context, bioinformatic prediction of the substrates of proteases can provide important and experimentally testable information with regard to the discovery of novel substrates and the putative cleavage sites for a protease. In the past few decades, a number of computational approaches have been developed to predict proteases and their substrates based on different algorithms, including prediction of protease family (serine, cysteine, aspartic, glutamic, metallo or threonine),30,31 development of general predictors to model substrate specificity,32–36 and prediction of the substrate cleavage sites for various proteases, for example, caspases,37–43 granzyme-B,44,45 thrombin,46–50 Factor Xa,51,52 trypsin,53–55 HIV-protease,56–61 SARS-CoV62 and C virus proteases.63,64 These methods generally take known substrate sequences as input, and the resulting models can predict substrate cleavage sites with accuracy ranging from 70% to over 90%. This indicates that these sequence-based substrate specificity tools can be helpful in identifying novel physiological substrates and their cleavage sites in vivo. In this review, we summarize recent progress in the development of bioinformatic approaches to predict protease substrate specificity including benchmark datasets, feature extraction, prediction algorithms, webservers and software implementations. We focus on the methods that are publicly available and highlight some of the major results and challenges in developing more accurate approaches for predicting substrate specificity. We will discuss the shortcomings and drawbacks of current methods and raise issues that remain to be addressed in future studies. We will

January 28, 2011 15:39 WSPC/185-JBCB

152

S0219720011005288

J. Song et al.

also illustrate the utility of the methods by presenting several case studies for the identification of putative substrate cleavage sites and discuss possible future directions for the current methods and their utility for identifying novel putative substrates. 2. Biochemistry and Nomenclature of the Substrate Specificity of Protease Proteases specifically cleave protein substrates near the N- or C- terminus (termed exopeptidases), or in the middle of the substrate (termed endopeptidases), through the binding of the protease active site to the substrate residues flanking the cleavage site (Fig. 1). As defined by Schechter and Berger,65 the active site residues in the protease are composed of contiguous pockets termed subsites. Each subsite pocket binds to a corresponding residue in the substrate sequence, referred to here as the sequence position. According to this definition, amino acid residues in the substrate sequence are consecutively numbered outward from the cleavage sites as · · ·-P4 P3 -P2 -P1 -P1  -P2  -P3  -P4  -· · · (the scissile bond is located between the P1 and P1  positions), while the subsites in the active site are correspondingly labelled as · · ·S4 -S3 -S2 -S1 -S1  -S2  -S3  -S4  -· · ·. Proteases bind to their substrates through interactions between the subsites in the active site cleft, and the amino acids within the cleavage site. However, the subsites exhibit varying binding affinities for the amino acids in the substrate, ranging from a restricted one or few specific amino acids, to generic binding with little or no discrimination between different amino acids.6 For instance, thrombin is

Fig. 1. The nomenclature of protease substrate specificity. Amino acid residues in the substrate are numbered outward from the cleavage site as · · ·-P4 -P3 -P2 -P1 -P1  -P2  -P3  -P4  -· · ·, and the cleavage site is highlighted by the black arrow, between P1 and P1  . The subsite pockets in the active site are correspondingly numbered as · · ·-S4 -S3 -S2 -S1 -S1  -S2  -S3  -S4  -· · ·. The P1 -P4 nonprime side residues are colored blue, while the P1  -P4  prime-side residues are colored orange. The black arrow indicates the substrate cleavage site after the P1 position. For clarity, only positions and subsites extending to P4 -P4  and S4 -S4  are shown.

January 28, 2011 15:39 WSPC/185-JBCB

S0219720011005288

Bioinformatic Approaches for Predicting Substrates of Proteases

153

a serine protease with a classical role in the blood coagulation system, acting as a pro-coagulant enzyme by converting fibrinogen into insoluble strands of fibrin in the final stages of blood coagulation, as well as catalyzing many other coagulation and complement-related reactions.66,67 Thrombin possesses a trypsin-like specificity, as it preferentially cleaves after arginine (R) residues and, under some circumstances, after lysine (K) residues, although with a lower efficiency.68 However, in addition to the requirement of P1 R or K, other amino acids from the P8 -P8  positions confer important additional determinants of specificity that can affect the substrate cleavage efficiency, particularly across the P4 -P2  positions (Fig. 2). In order for a substrate to be cleaved, the subsites usually need to bind to the substrate residues in a “cooperative” or “concerted” manner, which means that the binding process of each amino acid residue in the substrate to its corresponding subsite is not entirely independent, a phenomenon referred to as subsite cooperativity.6,7 This has been observed in several proteases, such as thrombin, Factor Xa, trypsin, HIV-1 and other viral proteases, where researchers have identified a number of cooperative subsites whose interactions can have either positive or negative influences on cleavage efficiency.6 For example, in double mutants of thrombin, the concerted interactions between subsites were abolished, resulting in a great loss of specificity and the inability to cleave substrates.46 This synergistic effect illustrates

Fig. 2. Thrombin substrate cleavage sites for P8 -P8  positions. The results are calculated from the experimentally determined substrate cleavage sites of thrombin published in the MEROPS database.1 The amino acid occurrences from P8 -P8  are calculated and displayed in the form of a two-dimensional heat map. The scissile bond between P1 and P1  is indicated by a vertical white line.

January 28, 2011 15:39 WSPC/185-JBCB

154

S0219720011005288

J. Song et al.

the important role of cooperativity between two or more subsites, a subject that has recently been reviewed.6 Protease specificity is not only influenced by substrate sequence, but also substrate conformation.69,70 This includes the compatibility between the shape of the active site cleft and the structure of the amino acid side chains within the cleavage site,35 and the presentation of the cleavage site at the surface of the threedimensional structure of the substrate.42,45 Furthermore, the specificity of many proteases can be dramatically affected by the existence and occupancy of one or more exosites on the protease, i.e. binding sites that are remote from the active site, but which regulate the function of the protease.50,71 The presence of these multiple influences on substrate cleavage, and the complexity of enzyme-substrate recognition, combined with the fundamental importance of characterizing the substrate degradome, make it highly desirable to develop powerful bioinformatic approaches to accurately predict the substrates of proteases. Publicly available approaches to this problem are reviewed here.

3. Methodology Development 3.1. Benchmark dataset construction An essential component of substrate prediction is the construction of high-quality benchmark datasets, which ideally contain experimentally verified protease substrates. These benchmark sets are used for the model optimization and performance evaluation of the developed approaches. The most comprehensive information resource is the MEROPS database,1,72 which contains all identified proteases, together with a curated collection of their known substrates and inhibitors. In addition to MEROPS, there are other curated databases providing detailed annotation of proteases and their substrates, such as CutDB,73 a database of proteolytic events for physiologically relevant proteins in vivo or in vitro; PMAP,74 an integrated database for analyzing proteolytic events and pathways; and the Degradome database,75 which assembles all of the relevant mammalian proteases and the relationship between protease alterations and disease. Specific collections for individual proteases or protease families, such as the CASBAH database for caspases, also exist.76 Details of these databases are provided in Table 1. Using resources such as these, the first step is to collect high-quality data and build a benchmark dataset containing experimentally verified substrate sequences with detailed annotation of the cleavage sites. This dataset is used to build the predictive models and evaluate performance. However, as the starting dataset might contain some homologous sequences that share higher sequence similarity between two or more substrate sequences, programs such as CD-HIT77 need to be run to reduce the overall sequence redundancy in the dataset. The remaining substrate sequences will then be partitioned into training and testing datasets (see Sec. 3.4, “Performance evaluation”).

January 28, 2011 15:39 WSPC/185-JBCB

S0219720011005288

Bioinformatic Approaches for Predicting Substrates of Proteases

155

Table 1. A list of publicly available databases of proteases and their substrates/inhibitors, in alphabetical order. For each database, the area of specialization, current data statistics, and the URL are provided. The statistics were collected as of July, 2010. Database

Specialization

CASBAH76

Caspase Proteolytic events

CutDB73

Statistics

URL

776 substrates http://bioinf.gen.tcd.ie/casbah/ 470 proteases and 3,070 http://cutdb.burnham.org/ proteolytic events

Degradome75 Mammalian proteases and proteolytic diseases

570 human, 568 chimpanzee, 651 mouse and 641 rat proteases

http://degradome.uniovi.es/

MEROPS1,72 Proteases, substrates, inhibitors

160,176 sequences, 2,995 identifiers

http://merops.sanger.ac.uk/

PMAP74

Proteolytic events and >45,000 proteases, pathways >5,000 proteolytic events

http://www.proteolysis.org/

3.2. Sequence feature extraction and representation Cleaved and uncleaved peptide sequences are collected as positive and negative training datasets, respectively. A sliding window strategy is commonly employed to extract local sequence features from both the positive and negative data, in which the P1 cleavage site is either symmetrically or non-symmetrically flanked by the upstream and downstream residues (Fig. 3). At the substrate sequence level, sequences information is encoded using a binary encoding scheme. In addition to the local amino acid sequences surrounding the cleavage sites, the predicted structural information in the form of secondary structure, solvent accessibility, and natively unstructured regions can be further incorporated to take into consideration the local structural determinants (Fig. 3). These derived features are respectively summarized below.

Fig. 3. A sliding window approach to extract the local sequence features from cleaved and uncleaved sequences. A caspase substrate, the apoptotic protease Mch-2 (Uniprot ID: P55212), is used as an example here. Two typical local window sizes are presented: P8 -P8  (a symmetrical sliding window), and P4 -P2  (a non-symmetrical sliding window). The arrow indicates the substrate cleavage site after the P1 position, while the predicted structural features that are taken into account are highlighted in different colors (C, coil; B, buried; E, exposed; *, disordered; -, ordered). Examples shown here include the predicted secondary structure, solvent accessibility and native disorder.

January 28, 2011 15:39 WSPC/185-JBCB

156

S0219720011005288

J. Song et al.

3.2.1. Binary encoding scheme of amino acid sequence A binary encoding scheme is widely adopted in many applications, where each amino acid residue is represented by the 20-dimensional binary vector, composed of either 0 or 1 elements, whose positional order depends on the alphabetical order of the amino acid type.39,43 For example, Ala is represented as [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], Cys is represented as [0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0], Asp is represented as [0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0], . . . , and Tyr is represented as [0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1]. Therefore, for a peptide sequence with L amino acid residues, it will be represented by a (L × 20)-dimensional vector. 3.2.2. PEST sequences PEST sequences were originally proposed by Rogers et al. to function as proteolytic signals.78,79 These sequences contain regions enriched in proline (P), glutamate (E), serine (S) and threonine (T). PEST sequences are hydrophilic stretches of at least 12 amino acids in length, with the entire region flanked by lysine (K), arginine (R) or histidine (H), but not interrupted by positively charged residues.78,79 Approximately 10% of the mammalian proteins contain PEST sequences. The presence of PEST sequences in the vicinity of protease cleavage sites might be responsible for target recognition of protein substrates for the rapid proteolytic degradation by proteases such as the ubiquitin-26S proteasome and calpain.79 Due to the implication of PEST sequences as targets for potential substrates, the PEST-like index has been used in combination with other descriptors to improve the prediction accuracy of cleavage sites by the CaSPredictor,38 PoPS34 and SitePrediction36 programs. 3.2.3. Physicochemical properties Physicochemical properties of amino acid residues provide useful features for discriminating cleaved sequences from uncleaved sequences. Each amino acid type can be encoded by physicochemical properties extracted from the AAindex database,80 including hydrophobicity, hydrophilicity, volumes of side chains, polarity, isoelectric point and accessible surface area.51,81 The peptide sequence within the sliding window can then be translated into a vector, with each of the amino acids numerically represented by one of the set of descriptors. 3.2.4. Secondary structure The secondary structure of protein substrates can be predicted using PSIPRED,82 which is a neural network-based predictor for estimating the probabilities of three secondary structure classes: helix, strand and coil. PSIPRED has been considered as one of the most accurate sequence-based predictors of protein secondary structures with an overall accuracy of up to 80%. In our previous work, predicted secondary structure information has been shown to significantly improve the accuracy of prediction of various properties of proteins, such as cis/trans

January 28, 2011 15:39 WSPC/185-JBCB

S0219720011005288

Bioinformatic Approaches for Predicting Substrates of Proteases

157

isomerization,83 residue-wise contact order,84 disulfide connectivity pattern,85 half-sphere exposure86 and residue depth.87 Recent studies revealed that cleavage sites of some proteases show clear preferences towards specific secondary structure motifs. For example, all-helix and all-loop motifs are commonly cleaved by caspases.25,42 Predicted secondary structure has been taken into account by specificity prediction programs, such as PoPS,34 Cascleave42 and SitePrediction.36 3.2.5. Solvent accessibility Appropriate presentation of cleavage sites in a solvent exposed region can be important for efficient proteolysis.36,69 Two-state (exposed or buried) solvent accessibility is an additional descriptor that can be predicted using the SSpro program implemented in the SCRATCH suite.88 This program outputs the estimated probability of a residue being solvent-exposed or buried within the substrate structure, which in turn can be used as input to machine learning predictors. Thus, predicted structural information can be exploited to identify substrate sites that are relatively exposed, allowing protease binding and cleavage. However, the fact that some proteolytic cleavages occur at solvent inaccessible regions implies that being predicted as solvent buried does not always rule out the possibility of a site being cleaved.25 Several methods including PoPS,34 Cascleave42 and SitePrediction36 have exploited predicted solvent accessibility information, either as a filter or direct input to the developed predictors. In cases where 3D structural information is available, PoPS can also use solvent accessibility derived from known structural information in PDB. 3.2.6. Native disorder The prevalence and significance of natively disordered or intrinsically unstructured proteins has been recognized, as a consequence of large-scale genome sequencing projects and the development of experimental techniques to analyze the structural properties of proteins in solution.89–92 Natively disordered proteins have been found to have important roles in many biological processes including transcription, translation, signal transduction, protein recognition, targeted degradation and cell cycle regulation. The natively disordered regions of these proteins have been shown to be essential for their functions. Native disorder of a substrate sequence can be accurately predicted from amino acid sequences using machine learning predictors such as DISOPRED293 and PrDOS.94 Several studies have shown that the predicted native disorder information can be used to improve the predictive performance.42,87,90 3.3. Computational methods for predicting the substrate specificity Predicting protease substrate cleavage sites can be viewed as a binary classification problem. The underlying assumption is that the sequence features encoded in the local sequence surrounding the cleavage sites should contain important information

January 28, 2011 15:39 WSPC/185-JBCB

158

S0219720011005288

J. Song et al.

for determining the substrate specificity. In particular, the combination of multiple sequence and structural features should successfully improve performance in terms of predicting substrate cleavage sites. Here, the discussion will mostly focus on methods for predicting protease substrate cleavage that have been implemented as publicly available webservers or stand-alone software. Current methods for predicting protease substrates are generally categorized into two main classes: machine learning or empirical scoring function. The former is mainly based on the selection and subsequent representation of useful sequence features and further mapping of input features to the property of being cleaved (positive) or non-cleaved (negative), e.g. BBFNN,37 CASVM,39,40 Multi-factor CASVM,41 Cascleave,42 Pripper,43 fXaWeb51 and ProtIdent.31 The latter relies on learning the underlying rules using the distribution of positive and negative sequences, and building empirical scoring functions to discriminate between the two classes, e.g. PeptideCutter,33 CasPredictor,38 GraBCas,44 PoPS,34 HIVcleave61 and SitePrediction.36 Table 2 summarizes the features of these methods, and a brief review of their predictive performances is provided in the Results and Discussion section.

3.3.1. Empirical scoring approaches Empirical scoring approaches rely on learning the underlying rules based on the observed distribution of samples and building empirical scoring functions to discriminate between different classes. In terms of predicting substrate cleavage, most of these prediction methods use experimentally verified substrate data to establish the scoring rules that will determine the prediction quality. Generally, the scoring methods assign a score to each amino acid residue in the cleavage sites and the final score is summarized or averaged within a given local window. PeptideCutter is an empirical scoring function-based tool used to identify the cleavage sites of protease substrates.33 However, many of the cleavage sites used by PeptideCutter were derived from a relatively limited experimental dataset. In addition to proteases, PeptideCutter also provides prediction for specific chemicals that are also used to break down proteins. The success of this tool partly lies in the fact that the proteases and chemicals modeled by this tool have well-defined actions in the breakdown of proteins. Another scoring tool, CasPredictor, predicts caspase cleavage sites by taking into account the amino acid substitution index, and the presence of PEST-like sequences in the vicinity of cleavage sites.38 This program achieved a sensitivity value of 81% (111/137) when predicting cleavage sites for a dataset of 137 experimentally determined cleavages. Another prediction algorithm, GraBCas, provides a position-specific scoring prediction of potential substrate cleavage sites for caspases and granzyme B.44 This algorithm was trained on the relative frequencies of amino acids measured in the study of Thornberry et al. who used positional scanning synthetic combinatorial libraries to determine the cleavage specificities of caspases and granzyme B.95

Empirical scoring

PoPS34

From the MEROPS database 370 substrates, 562 cleavage sites

Supervised machine 358 substrates, 443 learning cleavage sites Support vector machine 359 granzyme B and 602 caspase substrates Discriminant function 62 HIV-1 and 22 HIV-2 algorithm protease substrates Two-layer ensemble 3,051 protease sequences classifier Bootstrap aggregation 132 substrate sequences algorithm

Support vector regression

Empirical scoring

Functional domain and position-specific scoring matrix Six-residue sequence fragments

Secondary structure, solvent accessibility and native disorder Octapeptide sequence

Binary encoding amino acid sequence Binary encoding amino acid sequence, disordered and solvent exposed propensity PEST sequences, solvent accessibility and secondary structure Predicted secondary structure, solvent accessibility and native disorder based on bi-36 profile Bayesian feature extraction Binary encoding amino acid sequence

From the MEROPS Position-specific scoring matrices, solvent database or user-derived accessibility, secondary structure 13 substrates Local sub-sequences using a sliding window with a fixed size

223 substrates, 283 cleavage sites 47 substrates, 59 cleavage sites

Feature Specificity probability table surrounding cleavage sites Pest-like index and position-dependent amino acid matrices Position-specific scoring matrices

URL

http://modbase.compbio.ucsf.edu/ peptide/ http://www.csbio.sjtu.edu.cn/bioinf/ HIV/ http://www.csbio.sjtu.edu.cn/bioinf/ Protease/ http://asqa.iis.sinica.edu.tw/fXaWeb/

http://users.utu.fi/mijopi/Pripper/

http://www.dmbr.ugent.be/prx/ bioit2-public/SitePrediction/ http://sunflower.kuicr.kyoto-u.ac.jp/ ∼sjn/Cascleave/

http://www.casbase.org/casvm/ http://www.casbase.org/casvm/

Not available

http://au.expasy.org/tools/ peptidecutter/ http://icb.usp.br/∼farmaco/Jose/ CaS predictorfiles http://wwwalt.med-rz.unikliniksaarland.de/med fak/humangenetik/ software/index.html http://pops.csse.monash.edu.au/

Notes: a BBFNN: Bayesian Bio-basis Functional Neural Network. b fXaWeb: Multi-level bootstrap aggregation predictor for predicting the substrate specificity of factor Xa protease.

fXaWebb,51

ProtIdent31

Peptide Specifier45 HIVcleave61

Pripper43

Cascleave42

CASVM39,40 Multi-factor CASVM41 SitePrediction36

Dataset Not available

Bayesian bio-basis functional neural network Support vector machine 219 cleavage sites Support vector machine 74 cleavage sites

Empirical scoring

GraBCas44

BBFNNa,37

Empirical scoring

CasPredictor38

Algorithm

Empirical scoring

PeptideCutter33

Tool

Table 2. Publicly available webservers and tools for predicting the substrate specificity of proteases.

January 28, 2011 15:39 WSPC/185-JBCB S0219720011005288

Bioinformatic Approaches for Predicting Substrates of Proteases 159

January 28, 2011 15:39 WSPC/185-JBCB

160

S0219720011005288

J. Song et al.

PoPS is a comprehensive bioinformatic tool that allows users to build computational models of protease substrate specificity that can be used to predict and rank potential cleavage sites for any protease.34 In addition, users can also augment PoPS models with dependency rules to account for subsite cooperativity effects that have been experimentally observed to be significant for some proteases such as thrombin, trypsin and HIV protease. Another advantage of PoPS is that it also supports expert users to build new and accurate specificity models from their expert knowledge gleaned from experimental work.34 Other important complementary information about the solvent accessibility, secondary structure and the presence or absence of PEST sequences can be used within the PoPS prediction engine to further screen the false positive sites. SitePrediction is a state-of-the-art empirical scoring tool for predicting substrate specificity of any protease.36 It combines the frequency score, based on the occurrence of each amino acid type at each position in a substrate site with an amino acid substitution matrix score that indicates the similarity of the potential cleavage sites to the known cleavage sites. The final score is calculated as the product of these two scores. Similar to PoPS, SitePrediction also provides extra analysis such as sequence logo representation of the predicted site, identification of PEST regions, and details on solvent accessibility and secondary structure,36 giving the users a better idea of the likelihood of cleavage of a site of interest. Finally, both PoPS and SitePrediction can be employed on a large scale to search entire proteomes for putative protease substrates and their potential cleavage sites.

3.3.2. Machine learning approaches Machine learning approaches can automatically learn and recognize complex patterns and use these to construct predictive models to discriminate different classes of samples, especially for those cases in which it is difficult to extract explicit rules. Appropriate representation of sequence or structural features in combination with feature selection, is required to reduce data dimensionality and improve predictive power. An issue for machine learning approaches, however, is how to efficiently avoid or overcome the over-fitting problem. Representative machine learning methods include artificial neural networks, na¨ıve Bayes, decision trees, nearest neighbor, random forest, support vector classification and support vector regression. The first machine learning algorithm applied to predict caspase substrate cleavage sites is BBFNN (Bayesian Bio-basis functional neural network) which was originally proposed by Yang.37 Based on 13 substrate sequences with experimentally determined caspase cleavage sites, each caspase sub-sequence was obtained by scanning the protein sequence with a fixed-sized sliding window and input into the BBFNN algorithm to build the classifiers which achieved the highest accuracy of 97.15 ± 1.13% when evaluated on this smaller dataset. The SVM approach to machine learning has been designed to avoid the overfitting problem and can be trained efficiently to maximize classification accuracy

January 28, 2011 15:39 WSPC/185-JBCB

S0219720011005288

Bioinformatic Approaches for Predicting Substrates of Proteases

161

with higher generalization ability.96,97 With the increasing availability of experimental data for caspase substrate specificity, Wee et al. have recently employed the SVM algorithm with the radial basis function (RBF) kernel for training classifiers.39,40 Their SVM-based models (termed as CASVM) extracted local amino acid sequence profiles using a sliding window approach and transformed them into an n-dimensional vector using an orthonormal encoding scheme. This approach achieved an accuracy ranging from 81.25% to 97.92%, when evaluated on a dataset containing 390 caspase substrate sequences.39,40 Moreover, Wee et al. have further developed the multi-factor CASVM with the first prediction step based on the output from CASVM and the second filtering step based on structural factors such as native disorder and solvent accessibility which filters out false positives to improve predictive performance.41 Song et al. developed a new approach to predicting the cleavage sites of caspases from the substrate primary sequences.42 This approach, named Cascleave, uses support vector regression (SVR) to build prediction models based on multiple sequence and structural features derived from: (i) the local amino acid sequence profile; (ii) the predicted secondary structure; (iii) the predicted solvent accessibility; and (iv) the predicted native disorder.42 These features were extracted and used as input to train the models based on a recently developed bi-profile Bayesian feature extraction method.98 The idea of this approach is that the intrinsic difference between two classes of samples can be better reflected and learned in a bi-profile manner by calculating the posterior probabilities of each amino acid at each position from P8 to P8  sites in the training dataset. Thus integrating the bi-profile Bayesian features might be more informative than the general encoding scheme. The researchers investigated the effects of different window sizes on the predictive performance and found that the Cascleave model based on the local window of P4 -P2  achieves the best prediction accuracy. Experiments on a dataset of 562 cleavage sites show that Cascleave correctly predicts 82.2% of the known caspase cleavage sites with a Matthews correlation coefficient (MCC) of 0.667,42 performing favorably compared with the CASVM39 and multi-factor CASVM.41 Pripper is a recently developed tool for predicting caspase substrate cleavage sites.43 It uses three different machine learning-based classifiers (SVM, random forest and J48 algorithm) to make the prediction. Pripper can be used to identify caspase substrates from tandem mass spectrometry based proteomics experiments. More recently, Barkan et al. presented another SVM-based tool to computationally recognize the substrate peptides of granzyme B and caspases.45 They used sequence and structural properties including the frequencies of amino acid residue types occurring at each position through P4 to P4  sites, physiochemical properties, secondary structure, solvent accessibility and native disorder. This approach is able to predict caspase substrates at an 87% true positive rate and 13% false positive rate, and granzyme B substrates at a 79% true positive rate and a 21% false positive rate, respectively.45 The developed tool has been further applied to scan the human proteome to generate a list of high-confidence candidate substrates

January 28, 2011 15:39 WSPC/185-JBCB

162

S0219720011005288

J. Song et al.

that were subjected to experimental validation. Two high-confidence predictions, Apoptosis Inducing Factor (AIF-1) and Survival Motor Neuron 1 (SMN1), were experimentally validated as substrates of granzyme B.45 In addition, there are other predictors specifically developed for predicting the substrate specificity of other proteases based on machine learning techniques. For example, the neural network method for predicting HIV protease cleavage sites,55,57–59 the decision tree method for the SARS-CoV protease,62 as well as the SVM-based fXaWeb predictor for Factor Xa substrate sites.51 However, in the case of modeling the substrate specificity of the HIV-1 protease, R¨ognvaldsson and You pointed out that it is inappropriate to apply complex non-linear algorithms, such as neural networks based on a linearly separable database, especially when there is no evidence that the researched problem is non-linear and no serious attempts have been made to rule out the possibility of using simple linear classifiers.99 Together, the consensus of these studies is that the sequence profiles of protease substrates are very informative for determining the cleavage sites, while structural features have also been useful when real 3D structure is available or even when predicted or modeled structural information is used.42,45 3.4. Performance evaluation In order to objectively evaluate the predictive performance of any software program, three rigorous tests are regularly utilized: n-fold cross-validation, leave-oneout cross-validation (or jack-knife test) and independent tests. In the case of n-fold cross-validation, substrate sequences in the dataset are randomly divided into n equally sized subsets. In each validation step, one subset is singled out in turn as the test data, while the remaining are used as the training data. This procedure is repeated n times until each subset has been used both as training and testing data. In the case of leave-one-out cross-validation, each substrate sequence is singled out in turn as the test case, and the remaining as the training data. This needs to be repeated such that each substrate sequence is used once as the testing data. In the case of independent testing, the set of cleavage sites that are used to derive the model are independent from the set of cleavage sites that are used to test the model, with no overlap between the two datasets. The predictive performances of various specificity predictors are evaluated using the following measures: (1) Sensitivity (percentage of correctly predicted caspase cleavage sites): Sensitivity =

TP TP + FN

(2) Specificity (percentage of correctly predicted non-cleavage sites): Specificity =

TN TN + FP

January 28, 2011 15:39 WSPC/185-JBCB

S0219720011005288

Bioinformatic Approaches for Predicting Substrates of Proteases

163

(3) Accuracy (percentage of correct predictions for both cleavage and non-cleavage sites): Accuracy =

TP + TN TP + TN + FP + FN

(4) Mathew’s Correlation Coefficient (MCC) is a measure of the quality of binary classifications.100 MCC = 1 signifies a perfect classification, while MCC = 0 indicates completely random classification. It is defined as: TP × TN − FP × FN . MCC =  (TP + FP )(TP + FN )(TN + FP )(TN + F N ) In each of these measures, TP, TN, FP and FN denote the number of true positives, true negatives, false positives and false negatives, respectively. In addition, the F -score which is a harmonic mean of precision and recall,101 is given as: F -score =

2 × TP . 2 × TP + FP + TN

4. Results and Discussion 4.1. Sequence and structural determinants of the substrate specificity As previously discussed, the substrate specificity of proteases can vary considerably, ranging from restricted specificity of one or only a few specific amino acids, to generic specificity without any discrimination between amino acids. Here we focus on two representatives, caspase-3 and thrombin, which are respectively cysteine and serine proteases, to illustrate the sequence and structural determinants of substrate specificity. The caspases are a family of intracellular cysteine proteases that play essential roles in the initiation and execution of programmed cell death and apoptosis, and are often referred to as “executioner” or “killer” proteases.102–104 A hallmark of the substrate specificity of caspases is that they specifically cleave after aspartate (D) residues, although additional requirements must also be met in order for a caspase to efficiently cleave its substrate.104 As one of the critical members of the caspase family, caspase-3 is characterized by its recognition for the canonical cleavage site motif “DXXD↓X”, where an aspartate residue (D) occurs in both the P4 and P1 positions, “X” denotes any of the 20 amino acids, and “↓” denotes the cleavage site. More specific known cleavage motifs of caspase-3 include “DEVD↓G”, “DGPD↓G”, “DEVD↓N”, “DMQD↓N”, “DEPD↓S”, “DEAD↓G”, “DETD↓S” and “DAVD↓T”. Using the WebLogo program105 to analyze caspase-3 substrate specificity data from the MEROPS database,1 the sequence logo representation of amino acids from P8 P8  shows that the major determinants for the P1  position are small amino acids such as glycine (G), serine (S) and alanine (A) (Fig. 4).

January 28, 2011 15:39 WSPC/185-JBCB

164

S0219720011005288

J. Song et al.

(a)

(b) Fig. 4. Sequence logo representations of the amino acid frequencies in the P8 -P8  cleavage sequences of caspase-3 [Fig. (a)] and thrombin [Fig. (b)].

Thrombin is a serine protease from the chymotrypsin family of proteases, and plays a critical role in the blood coagulation system.66,67 Thrombin preferentially cleaves substrates that contain an arginine (R) residue at the P1 position, but can also cleave after a lysine (K) residue, although with far less efficiency. In addition to the P1 R/K preference, thrombin has also been shown to have a strong preference for proline (P) residues at the P2 position, while serine (S) and glycine (G) residues at the P1  position are also suggested to improve substrate cleavage by thrombin (Fig. 4).106,107 Recent studies have found different preferences for secondary structure types occurring at protease cleavage sites. As indicated by the distribution of secondary structure (SS) motifs for P4 -P4  caspase and thrombin cleavage sites (Fig. 5), SS motifs consisting of only loop structures are the most common secondary structure motifs, and are frequently observed in both caspase and thrombin substrates (54% and 22%, respectively). Surprisingly, the helical SS motifs are not uncommon compared to other SS motifs (8% and 2% for caspase and thrombin substrates, respectively), suggesting potential roles for structural dynamics or conformational switching in these regions upon substrate binding and catalytic cleavage by these two proteases.25,42 In fact, substrate recognition and catalytic proteolysis by proteases is a complicated multi-step process, that sometimes

January 28, 2011 15:39 WSPC/185-JBCB

S0219720011005288

Bioinformatic Approaches for Predicting Substrates of Proteases

(a)

165

(b)

Fig. 5. Distribution of secondary structure (SS) motifs of substrate cleavage sites from P4 -P4  for (a) caspase-3 and (b) thrombin. Datasets for both caspase-3 and thrombin substrates were extracted from the MEROPS database. Secondary structure is represented as H: alpha-helix, E: beta-strand, C: coil.

requires dramatic conformational changes of the substrate cleavage site to form the correct conformation for cleavage. The insulin-degrading enzyme (IDE), a Zn2+ metalloprotease, serves as a suitable example.108 On the one hand, IDE needs to transit from a substrate-free open state to a closed state in order to entrap substrate peptides. On the other hand, in order to fit into the catalytic cleft for cleavage, IDE substrates must undergo substantial conformational changes on binding to IDE, in order to fit into the enclosed active site. In particular, sequences derived from an N-terminal loop and an α-helix both switch to β-strand structures prior to degradation.108 This mechanistic insight might explain why certain regular SS motifs, such as the all-helix SS motif, can be effectively cleaved by proteases. At the level of solvent accessibility and native disorder, local sequences surrounding cleaved and uncleaved sites also exhibit differences. Enrichment analyses performed by different groups based on solved 3D structures and comparative models, or sequence-based structure prediction, both led to the same conclusion that cleaved sequences are more likely to be exposed to the solvent, flexible and natively disordered compared to uncleaved sequences.42,45 Together, it might be useful to use these distinct sequence and structural features that flank the cleaved and uncleaved sites to build more accurate models for predicting the localization of cleavage sites. 4.2. Case studies In this section, we illustrate how putative protease cleavage sites can be predicted by performing in silico scanning of caspase substrate sequences. The caspase family is a focus of extensive research, and as a result, a number of bioinformatic tools

January 28, 2011 15:39 WSPC/185-JBCB

166

S0219720011005288

J. Song et al.

have been developed for predicting caspase substrates.37–45 Prediction of caspase cleavage sites is facilitated by the requirement of Asp (D) at the P1 position, but is confounded by the varying degree of specificity at other cleavage site positions and across the proteases. As some webservers are not available, we compare predictions from CASVM,40 Cascleave,42 PoPS,34 and SitePrediction36 for two known caspase substrates. The first two tools are machine learning-based, while the latter two are empirical rule-based. Except for CASVM, which does not provide a ranking (prediction results given in Supplementary Table 1), the top five ranked predictions from each of the other three servers are shown in Supplementary Tables 2, 3 and 4, respectively. The two substrates investigated here represent cases for which it is difficult to predict cleavage specificity as both contain sequential cleavage sites during apoptosis. The first example is the heterogeneous nuclear ribonucleoprotein inhibitor (hnRNP) (Uniprot109 ID: O43390). This protein has four caspase-3 cleavage sites: DYYD|DYYG (P1 position located at D472 within the protein sequence), KESD|LSHV (D87), DYHD|YRGG (D481) and RAID|ALRE (D66). CASVM predicts seven cleavage sites without ranking and all four physiological cleavage sites are included in the predictions (Supplementary Table 1). The other three methods, Cascleave, PoPS and SitePrediction, give high ranking to the two cleavage sites DYYD|DYYG and DYHD|YRGG, but failed to identify the other two cleavage sites RAID|ALRE and KESD|LSHV. Cascleave gave DYYD|DYYG and DYHD|YRGG a ranking of 1 and 2, respectively (Supplementary Table 2), PoPS gave them a ranking of 5 and 2, respectively (Supplementary Table 3), while SitePrediction gave them a ranking of 2 and 1, respectively (Supplementary Table 4). Interestingly, these two correctly identified cleavage sites are also found to be located in predicted solvent accessible and natively disordered regions [Fig. 6(a)]. The second caspase substrate is the Ras GTPase-activating protein (RasGAP) (Uniprot109 ID: P20936). It also contains sequential caspase-3 cleavage sites: DTVD|GKEI (D455) and DEGD|SLDG (D157). The former is the primary cleavage site, while the latter is a secondary cleavage site whose cleavage will only occur after the cleavage of the primary site. In the case of RasGAP, CASVM outputs 11 predicted sites where the two cleavage sites are included in the predictions (Supplementary Table 1). PoPS provided the best prediction results compared with Cascleave and SitePrediction, giving a ranking of 1 and 2, respectively (Supplementary Table 3). Cascleave gave a ranking of 1 and 3, respectively (Supplementary Table 2), whereas SitePrediction gave a ranking of 2 and 3, respectively (Supplementary Table 4). The performance difference between these three methods likely suggests the different characteristics of the information used by different algorithmic frameworks. Combining the predictions of all three methods might be helpful to predict high-confidence caspase substrates that can be subjected to further experimental validation in the laboratory,45 all the while being mindful of balancing the improvement of the algorithms with overfitting of the models.

January 28, 2011 15:39 WSPC/185-JBCB

S0219720011005288

Bioinformatic Approaches for Predicting Substrates of Proteases

167

(a)

(b) Fig. 6. The predicted cleavage probability for caspase cleavage sites using Cascleave for two caspase substrates: (a) hnRNP (Uniprot ID: O43390), and (b) RasGAP (Uniprot ID: P20936). They have four and two caspase cleavage sites, respectively. The predicted coiled, solvent exposed and disordered regions on the top of each panel are highlighted by magenta, green and red, respectively. A threshold value of 0.5 for making a positive cleavage site prediction is denoted by a red dashed line. The predicted cleavage sites in the P4 -P4  positions are also labeled.

5. Discussion and Conclusions Due to their important function of catalytically hydrolyzing protein substrates, proteases have central roles in “life and death” processes. Protease-controlled events have been implicated in numerous human diseases.2–5 Solving the “substrate identification” problem is fundamental for both understanding protease biology and the development of therapeutics that target specific protease-regulated pathways. Determining the substrate specificity of proteases is an important step towards deeper understanding of the structure and function of proteases and the underlying mechanisms that govern site-specific proteolysis. Due to the gap between functionally characterized proteins and the proteins for which there is no experimental

January 28, 2011 15:39 WSPC/185-JBCB

168

S0219720011005288

J. Song et al.

functional annotation, there is great demand for powerful bioinformatic methods that can accurately predict protease substrates based on the current available specificity data. In this review, we summarize recent advances in bioinformatic approaches for predicting the substrate specificity of proteases and their application to identify putative substrates. Existing approaches can be generally classified into two groups: empirical scoring or machine learning based. A recent trend in this field is to recruit various sources of sequence and structural information to further improve the predictive performance of the models, including positional-specific scoring matrices, predicted secondary structure, solvent accessibility and native disorder.36,42,45 Studies have indicated that use of these different but complementary features can lead to improved predictive performance. Future improvements in the performance of predictive methods may be derived in several ways. The first is to integrate relevant sequence and structural features of cleavage sites based on feature selection. Integrating structural information from the active site cleft of a protease might also be very helpful for developing better predictive models. The second option is to exploit the ensemble learning approach, which provides a better framework for different data integration.110–116 This approach uses multiple diverse and independent basic classifiers, and combines their original predictions to make a final, more reliable consensus prediction.110,111 The third method is to improve the representation of true “negatives” (sites that cannot be cleaved under any experimental conditions), which is important in order to resolve the open question of how to efficiently reduce false positives, which are often high where there is a highly unbalanced dataset (fewer sparse positive samples against many negative samples). On the other hand, the presence of “false negatives” could further confound the analysis, as in some cases those false negatives can actually be cleaved when the physiological conditions change, meaning that they are actually unidentified or unannotated positives. Finally, with greater availability of specificity data generated by proteome-wide profiling techniques, it is becoming feasible to build high-quality datasets that will allow further development and evaluation of computational methods in order to improve the prediction performance of protease substrate specificity.

Acknowledgments JS would like to thank the National Health and Medical Research Council of Australia (NHMRC) and the Japan Society for the Promotion of Science (JSPS) for financially supporting this research via the NHMRC Peter Doherty and JSPS Postdoctoral Fellowships. JS was also supported by the Hundreds of Talents Fellowship Program of the Chinese Academy of Sciences (CAS). HS was supported by the National Natural Science Foundation of China (60704047), Science and Technology Commission of Shanghai Municipality (08ZR1410600, 08JC1410600), and sponsored by Shanghai Pujiang Program and Innovation Program of Shanghai

January 28, 2011 15:39 WSPC/185-JBCB

S0219720011005288

Bioinformatic Approaches for Predicting Substrates of Proteases

169

Municipal Education Commission (10ZZ17). GIW is supported by the Australian Research Council (ARC) (DP0772238). JCW is an ARC Federation Fellow and an honorary NHMRC Principal Research Fellow.

References 1. Rawlings ND, Morton FR, Kok CY, Kong J, Barrett AJ, MEROPS: The peptidase database, Nucleic Acids Res 36:D320–D325, 2008. 2. Turk B, Targeting proteases: Successes, failures and future prospects, Nat Rev Drug Discov 5:785–799, 2006. 3. L´ opez-Ot´ın C, Matrisian LM, Emerging roles of proteases in tumour suppression, Nat Rev Cancer 7:800–808, 2007. 4. Quesada V, Ord´ on ˜ez GR, S´ anchez LM, Puente XS, L´ opez-Ot´ın C, The Degradome database: Mammalian proteases and diseases of proteolysis, Nucleic Acids Res 37:D239–D243, 2009. 5. L´ opez-Ot´ın C, Overall CM, Protease degradomics: A new challenge for proteomics, Nat Rev Mol Cell Biol 3:509–519, 2002. 6. Ng NM, Pike RN, Boyd SE, Subsite cooperativity in protease specificity, Biol Chem 390:401–407, 2009. 7. Boyd SE, Kerr FK, Albrecht DW, de la Banda MG, Ng N, Pike RN, Cooperative effects in the substrate specificity of the complement protease C1s, Biol Chem 390:503–507, 2009. 8. Hubbard SJ, Campbell SF, Thornton JM, Molecular recognition. Conformational analysis of limited proteolytic sites and serine proteinase protein inhibitors, J Mol Biol 220:507–530, 1991. 9. Rote KV, Rechsteiner M, Degradation of proteins microinjected into HeLa cells. The role of substrate flexibility, J Biol Chem 261:15430–15436, 1986. 10. Coombs GS, Bergstrom RC, Madison EL, Corey DR, Directing sequence-specific proteolysis to new targets. The influence of loop size and target sequence on selective proteolysis by tissue-type plasminogen activator and urokinase-type plasminogen activator, J Biol Chem 273:4323–4328, 1998. 11. Fairlie DP, Tyndall JD, Reid RC, Wong AK, Abbenante G, Scanlon MJ, March DR, Bergman DA, Chai CL, Burkett BA, Conformational selection of inhibitors and substrates by proteolytic enzymes: Implications for drug design and polypeptide processing, J Med Chem 43:1271–1281, 2000. 12. Matthews DJ, Wells JA, Substrate phage: Selection of protease substrates by monovalent phage display, Science 260:1113–1117, 1993. 13. Atwell S, Wells JA, Selection for improved subtiligases by phage display, Proc Natl Acad Sci USA 96:9497–9502, 1999. 14. Ju W, Valencia CA, Pang H, Ke Y, Gao W, Dong B, Liu R, Proteome-wide identification of family member-specific natural substrate repertoire of caspases, Proc Natl Acad Sci USA 104:14294–14299, 2007. 15. Schilling O, Overall CM, Proteome-derived, database-searchable peptide libraries for identifying protease cleavage sites, Nat Biotechnol 26:685–694, 2008. 16. Dix MM, Simon GM, Cravatt BF, Global mapping of the topography and magnitude of proteolytic events in apoptosis, Cell 134:679–691, 2008. 17. Enoksson M, Li J, Ivancic MM, Timmer JC, Wildfang E, Eroshkin A, Salvesen GS, Tao WA, Identification of proteolytic cleavage sites by quantitative proteomics, J Proteome Res 6:2850–2858, 2007.

January 28, 2011 15:39 WSPC/185-JBCB

170

S0219720011005288

J. Song et al.

18. Dean RA, Overall CM, Proteomics discovery of metalloproteinase substrates in the cellular context by iTRAQ labeling reveals a diverse MMP-2 substrate degradome, Mol Cell Proteomics 6:611–623, 2007. 19. Kleifeld O, Doucet A, auf dem Keller U, Prudova A, Schilling O, Kainthan RK, Starr AE, Foster LJ, Kizhakkedathu JN, Overall CM, Isotopic labeling of terminal amines in complex samples identifies protein N-termini and protease cleavage products, Nat Biotechnol 28:281–288, 2010. 20. Prudova A, auf dem Keller U, Butler GS, Overall CM, Multiplex N-terminome analysis of MMP-2 and MMP-9 substrate degradomes by iTRAQ-TAILS quantitative proteomics, Mol Cell Proteomics 9:894–911, 2010. 21. auf dem Keller U, Prudova A, Gioia M, Butler GS, Overall CM, A statistics-based platform for quantitative N-terminome analysis and identification of protease cleavage products, Mol Cell Proteomics 9:912–927, 2010. 22. Timmer JC, Enoksson M, Wildfang E, Zhu W, Igarashi Y, Denault JB, Ma Y, Dummitt B, Chang YH, Mast AE, Eroshkin A, Smith JW, Tao WA, Salvesen GS, Profiling constitutive proteolytic events in vivo, Biochem J 407:41–48, 2007. 23. Impens F, Colaert N, Helsens K, Plasman K, Van Damme P, Vandekerckhove J, Gevaert K, MS-driven protease substrate degradomics, Proteomics 10:1284–1296, 2010. 24. Van Damme P, Maurer-Stroh S, Plasman K, Van Durme J, Colaert N, Timmerman E, De Bock PJ, Goethals M, Rousseau F, Schymkowitz J, Vandekerckhove J, Gevaert K, Analysis of protein processing by N-terminal proteomics reveals novel species-specific substrate determinants of granzyme B orthologs, Mol Cell Proteomics 8:258–272, 2009. 25. Mahrus S, Trinidad JC, Barkan DT, Sali A, Burlingame AL, Wells JA, Global sequencing of proteolytic cleavage sites in apoptosis by specific labeling of protein N termini, Cell 134:866–876, 2008. 26. Timmer JC, Zhu W, Pop C, Regan T, Snipas SJ, Eroshkin AM, Riedl SJ, Salvesen GS, Structural and kinetic determinants of protease substrates, Nat Struct Mol Biol 16:1101–1108, 2009. 27. Kerr FK, Thomas AR, Wijeyewickrema LC, Whisstock JC, Boyd SE, Kaiserman D, Matthews AY, Bird PI, Thielens NM, Rossi V, Pike RN, Elucidation of the substrate specificity of the MASP-2 protease of the lectin complement pathway and identification of the enzyme as a major physiological target of the serpin, C1-inhibitor, Mol Immunol 45:670–677, 2008. 28. Kerr FK, O’Brien G, Quinsey NS, Whisstock JC, Boyd S, de la Banda MG, Kaiserman D, Matthews AY, Bird PI, Pike RN, Elucidation of the substrate specificity of the C1s protease of the classical complement pathway, J Biol Chem 280:39510– 39514, 2005. 29. Leissring MA, Malito E, Hedouin S, Reinstatler L, Sahara T, Abdul-Hay SO, Choudhry S, Maharvi GM, Fauq AH, Huzarska M, May PS, Choi S, Logan TP, Turk BE, Cantley LC, Manolopoulou M, Tang WJ, Stein RL, Cuny GD, Selkoe DJ, Designed inhibitors of insulin-degrading enzyme regulate the catabolism and activity of insulin, PLoS ONE 5:e10504, 2010. 30. Shen HB, Chou KC, Identification of proteases and their types, Anal Biochem 385:153–160, 2009. 31. Chou KC, Shen HB, ProtIdent: A web server for identifying proteases and their types by fusing functional domain and sequential evolution information, Biochem Biophys Res Commun 376:321–325, 2008.

January 28, 2011 15:39 WSPC/185-JBCB

S0219720011005288

Bioinformatic Approaches for Predicting Substrates of Proteases

171

32. Lohm¨ uller T, Wenzler D, Hagemann S, Kiess W, Peters C, Dandekar T, Reinheckel T, Toward computer-based cleavage site prediction of cysteine endopeptidases, Biol Chem 384:899–909, 2003. 33. Gasteiger E, Hoogland C, Gattiker A, Duvaud S, Wilkins MR, Appel RD, Bairoch A, Protein identification and analysis tools on the ExPASy server, in The Proteomics Protocols Handbook, Walker JM (ed.), Humana Press, Totowa, New Jersey, pp. 571–607, 2005. 34. Boyd SE, Pike RN, Rudy GB, Whisstock JC, Garcia de la Banda M, PoPS: A computational tool for modeling and predicting protease specificity, J Bioinform Comput Biol 3:551–585, 2005. 35. Venkatraman P, Balakrishnan S, Rao S, Hooda Y, Pol S, A sequence and structure based method to predict putative substrates, functions and regulatory networks of endo proteases, PLoS ONE 4:e5700, 2009. 36. Verspurten J, Gevaert K, Declercq W, Vandenabeele P, SitePredicting the cleavage of proteinase substrates, Trends Biochem Sci 34:319–323, 2009. 37. Yang ZR, Prediction of caspase cleavage sites using Bayesian bio-basis function neural networks, Bioinformatics 21:1831–1837, 2005. 38. Garay-Malpartida HM, Occhiucci JM, Alves J, Beliz´ ario JE, CaSPredictor: A new computer-based tool for caspase substrate prediction, Bioinformatics 21:i169–i176, 2005. 39. Wee LJ, Tan TW, Ranganathan S, SVM-based prediction of caspase substrate cleavage sites, BMC Bioinformatics 7(Suppl 5):S14–S15, 2006. 40. Wee LJ, Tan TW, Ranganathan S, CASVM: Web server for SVM-based prediction of caspase substrates cleavage sites, Bioinformatics 23:3241–3243, 2007. 41. Wee LJ, Tan TW, Ranganathan S, A multi-factor model for caspase degradome prediction, BMC Genomics 10(Suppl 3):S6, 2009. 42. Song J, Tan H, Shen H, Mahmood K, Boyd SE, Webb GI, Akutsu T, Whisstock JC, Cascleave: Towards more accurate prediction of caspase substrate cleavage sites, Bioinformatics 26:752–760, 2010. 43. Piippo M, Lietzen N, Nevalainen OS, Salmi J, Nyman TA, Pripper: Prediction of caspase cleavage sites from whole genomes, BMC Bioinformatics 11:320, 2010. 44. Backes C, Kuentzer J, Lenhof HP, Comtesse N, Meese E, GraBCas: A bioinformatics tool for score-based prediction of Caspase- and Granzyme B-cleavage sites in protein sequences, Nucleic Acids Res 33:W208–W213, 2005. 45. Barkan DT, Hostetter DR, Mahrus S, Pieper U, Wells JA, Craik CS, Sali A, Prediction of protease substrates using sequence and structure features, Bioinformatics, 26:1714–1722, 2010. 46. Vindigni A, Dang QD, Cera ED, Site-specific dissection of substrate recognition by thrombin, Nat Biotechnol 15:891–895, 1997. 47. Krem MM, Rose T, Cera ED, The C-terminal sequence encodes function in serine proteases, J Biol Chem 274:28063–28066, 1999. 48. Page MJ, Macgillivray RTA, Cera ED, Determinants of specificity in coagulation proteases, J Thromb Haemost 3:2401–2408, 2005. 49. Cera ED, Cantwell AM, Determinants of thrombin specificity, Ann NY Acad Sci 936:133–146, 2006. 50. Ng NM, Quinsey NS, Matthews AY, Kaiserman D, Wijeyewickrema LC, Bird PI, Thompson PE, Pike RN, The effects of exosite occupancy on the substrate specificity of thrombin, Arch Biochem Biophys 489:48–54, 2009.

January 28, 2011 15:39 WSPC/185-JBCB

172

S0219720011005288

J. Song et al.

51. Chen CT, Yang EW, Hsu HJ, Sun YK, Hsu WL, Yang AS, Protease substrate site predictors derived from machine learning on multilevel substrate phage display data, Bioinformatics 24:2691–2697, 2008. 52. Hsu HJ, Tsai KC, Sun YK, Chang HJ, Huang YJ, Yu HM, Lin CH, Mao SS, Yang AS, Factor Xa active site substrate specificity with substrate phage display and computational molecular modeling, J Biol Chem 283:12343–12353, 2008. 53. Yang ZR, Thomson R, Hodgman TC, Dry J, Doyle AK, Narayanan A, Wu X, Searching for discrimination rules in proteolytic cleavage site activity using genetic programming with a min-max scoring function, Biosystems 72:159–176, 2003. 54. Thomson R, Hodgman TC, Yang ZR, Doyle AK, Characterizing proteolytic cleavage site activity using bio-basis function neural networks, Bioinformatics 19:1741–1747, 2003. 55. Yang ZR, Thomson R, Bio-basis function neural network for prediction of protease cleavage sites in proteins, IEEE Trans Neural Netw 16:263–274, 2005. 56. Chou KC, A vectorized sequence-coupling model for predicting HIV protease cleavage sites in proteins, J Biol Chem 268:16938–16948, 1993. 57. Thompson TB, Chou KC, Zheng C, Neural network prediction of the HIV-1 protease cleavage sites, J Theor Biol 177:369–379, 1995. 58. Cai YD, Chou KC, Artificial neural network method for predicting HIV protease cleavage sites in protein, J Protein Chem 17:607–615, 1998. 59. Cai YD, Liu XJ, Xu XB, Chou KC, Support vector machines for predicting HIV protease cleavage sites in protein, J Comput Chem 23:267–274, 2002. 60. Yang ZR, Dalby AR, Qiu J, Mining HIV protease cleavage data using genetic programming with a sum-product function, Bioinformatics 20:3398–3405, 2004. 61. Shen HB, Chou KC, HIVcleave: A web-server for predicting human immunodeficiency virus protease cleavage sites in proteins, Anal Biochem 375:388–390, 2008. 62. Yang ZR, Mining SARS-CoV protease cleavage data using non-orthogonal decision trees: A novel method for decisive template selection, Bioinformatics 21:2644–2650, 2005. 63. Yang ZR, A probabilistic peptide machine for predicting C virus protease cleavage sites, IEEE Trans Inf Technol Biomed 11:593–595, 2007. 64. Yang ZR, Predicting hepatitis C virus protease cleavage sites using generalized linear indicator regression models, IEEE Trans Biomed Eng 53:2119–2123, 2006. 65. Schechter I, Berger A, On the size of the active site in proteases. I. Papain, Biochem Biophys Res Commun 27:157–162, 1967. 66. De Cristofaro R, De Candia E, Thrombin domains: Structure, function and interaction with platelet receptors, J Thromb Thrombolysis 15:151–163, 2004. 67. Coughlin SR, Thrombin signaling and protease-activated receptors, Nature 407:258– 264, 2000. 68. Cole ER, Koppel JL, Olwin JH, Multiple specificity of thrombin for synthetic substrates, Nature 213:405–406, 1967. 69. Nicholson DW, Caspase structure, proteolytic substrates, and function during apoptotic cell death, Cell Death Differ 6:1028–1042, 1999. 70. Timmer JC, Salvesen GS, Caspase substrates, Cell Death Differ 14:66–72, 2007. 71. Dennis MS, Eigenbrot C, Skelton NJ, Ultsch MH, Santell L, Dwyer MA, O’Connell MP, Lazarus RA, Peptide exosite inhibitors of factor VIIa as anticoagulants, Nature 404:465–470, 2000. 72. Rawlings ND, A large and accurate collection of peptidase cleavages in the MEROPS database, Database 2009:bap015, 2009.

January 28, 2011 15:39 WSPC/185-JBCB

S0219720011005288

Bioinformatic Approaches for Predicting Substrates of Proteases

173

73. Igarashi Y, Eroshkin A, Gramatikova S, Gramatikoff K, Zhang Y, Smith JW, Osterman AL, Godzik A, CutDB: A proteolytic event database, Nucleic Acids Res 35:D546–D549, 2007. 74. Igarashi Y, Heureux E, Doctor KS, Talwar P, Gramatikova S, Gramatikoff K, Zhang Y, Blinov M, Ibragimova SS, Boyd S, Ratnikov B, Cieplak P, Godzik A, Smith JW, Osterman AL, Eroshkin AM, PMAP: Databases for analyzing proteolytic events and pathways, Nucleic Acids Res 37:D611–D618, 2009. 75. Quesada V, Ord´ on ˜ez GR, S´ anchez LM, Puente XS, L´ opez-Ot´ın C, The Degradome database: Mammalian proteases and diseases of proteolysis, Nucleic Acids Res 37:D239–D243, 2009. 76. L¨ uthi AU, Martin SJ, The CASBAH: A searchable database of caspase substrates, Cell Death Differ 14:641–650, 2007. 77. Li W, Godzik A, Cd-hit a fast program for clustering and comparing large sets of protein or nucleotide sequences, Bioinformatics 22:1658–1659, 2006. 78. Rogers S, Wells R, Rechsteiner M, Amino acid sequences common to rapidly degraded proteins: The PEST hypothesis, Science 234:364–368, 1986. 79. Rechsteiner M, Rogers SW, PEST sequences and regulation by proteolysis, Trends Biochem Sci 21:267–271, 1996. 80. Kawashima S, Pokarowski P, Pokarowska M, Kolinski A, Katayama T, Kanehisa M, AAindex: Amino acid index database, progress report 2008, Nucleic Acids Res 36:D202–D205, 2008. 81. Guo Y, Yu L, Wen Z, Li M, Using support vector machine combined with auto covariance to predict protein-protein interactions from protein sequences, Nucleic Acids Res 36:3025–3030, 2008. 82. Jones DT, Protein secondary structure prediction based on position-specific scoring matrices, J Mol Biol 292:195–202, 1999. 83. Song J, Burrage K, Yuan Z, Huber T, Prediction of cis/trans isomerization in proteins using PSI-BLAST profiles and secondary structure information, BMC Bioinformatics 7:124, 2006. 84. Song J, Burrage K, Predicting residue-wise contact orders in proteins by support vector regression, BMC Bioinformatics 7:425, 2006. 85. Song J, Yuan Z, Tan H, Huber T, Burrage K, Predicting disulfide connectivity from protein sequence using multiple sequence feature vectors and secondary structure, Bioinformatics 23:3147–3154, 2007. 86. Song J, Tan H, Takemoto K, Akutsu T, HSEpred: Predict half-sphere exposure from protein sequences, Bioinformatics 24:1489–1497, 2008. 87. Song J, Tan H, Mahmood K, Law RH, Buckle AM, Webb GI, Akutsu T, Whisstock JC, Prodepth: Predict residue depth by support vector regression approach from protein sequences only, PLoS ONE 4:e7072, 2009. 88. Cheng J, Randall A, Sweredoski M, Baldi P, SCRATCH: A protein structure and structural feature prediction server, Nucleic Acids Res 33:W72–W76, 2005. 89. Dyson HJ, Wright PE, Intrinsically unstructured proteins and their functions, Nat Rev Mol Cell Biol 6:197–208, 2005. 90. Lobley A, Swindells MB, Orengo CA, Jones DT, Inferring function using patterns of native disorder in proteins, PLoS Comput Biol 3:e162, 2007. 91. Doszt´ anyi Z, M´esz´ aros B, Simon I, Bioinformatical approaches to characterize intrinsically disordered/unstructured proteins, Brief Bioinform 11:225–243, 2010. 92. Gsponer J, Futschik ME, Teichmann SA, Babu MM, Tight regulation of unstructured proteins: From transcript synthesis to protein degradation, Science 322:1365–1368, 2008.

January 28, 2011 15:39 WSPC/185-JBCB

174

S0219720011005288

J. Song et al.

93. Ward JJ, Sodhi JS, McGuffin LJ, Buxton BF, Jones DT, Prediction and functional analysis of native disorder in proteins from the three kingdoms of life, J Mol Biol 337:635–645, 2004. 94. Ishida T, Kinoshita K, PrDOS: Prediction of disordered protein regions from amino acid sequence, Nucleic Acids Res 35:W460–W464, 2007. 95. Thornberry NA, Rano TA, Peterson EP, Rasper DM, Timkey T, Garcia-Calvo M, Houtzager VM, Nordstrom PA, Roy S, Vaillancourt JP, Chapman KT, Nicholson DW, A combinatorial approach defines specificities of members of the caspase family and granzyme B, Functional relationships established for key mediators of apoptosis, J Biol Chem 272:17907–17911, 1997. 96. Vapnik V, Statistical Learning Theory, Wiley, New York, 1998. 97. Vapnik V, The Nature of Statistical Learning Theory, Springer, New York, 2000. 98. Shao J, Xu D, Tsai SN, Wang Y, Ngai SM, Computational identification of protein methylation sites through bi-profile Bayes feature extraction, PLoS ONE 4:e4920, 2009. 99. R¨ ognvaldsson T, You L, Why neural networks should not be used for HIV-1 protease cleavage site prediction, Bioinformatics 20:1702–1709. 100. Matthews BW, Comparison of the predicted and observed secondary structure of T4 phage lysozyme. Biochim Biophys Acta 405:442–451, 1975. 101. Kalita MK, Nandal UK, Pattnaik A, Sivalingam A, Ramasamy G, Kumar M, Raghava GP, Gupta D, CyclinPred: A SVM-based method for predicting cyclin protein sequences, PLoS ONE 3:e2605, 2008. 102. Talanian RV, Quinlan C, Trautz S, Hackett MC, Mankovich JA, Banach D, Ghayur T, Brady KD, Wong WW, Substrate specificities of caspase family proteases, J Biol Chem 272:9677–9682, 1997. 103. Fischer U, J¨ anicke RU, Schulze-Osthoff K, Many cuts to ruin: A comprehensive update of caspase substrates, Cell Death Differ 10:76–100, 2003. 104. Pop C, Salvesen GS, Human caspases: Activation, specificity and regulation, J Biol Chem 284:21777–21781, 2009 105. Schneider TD, Stephens RM, Sequence logos: A new way to display consensus sequences, Nucleic Acids Res 18:6097–6100, 1990. 106. Backes BJ, Harris JL, Leonetti FL, Craik CS, Ellman JA, Synthesis of positionalscanning libraries of fluorogenic peptide substrates to define the extended substrate specificity of plasmin and thrombin, Nat Biotechnol 18:187–193, 2000. 107. Gosalia DN, Salisbury CM, Maly DJ, Ellman JA, Diamond SL, Profiling serine protease substrate specificity with solution phase fluorogenic peptide microarrays, Proteomics 5:1292–1298, 2005. 108. Shen Y, Joachimiak A, Rosner MR, Tang WJ, Structures of human insulin-degrading enzyme reveal a new substrate recognition mechanism, Nature 443:870–874, 2006. 109. Bairoch A, Apweiler R, Wu CH, Barker WC, Boeckmann B, Ferro S, Gasteiger E, Huang H, Lopez R, Magrane M, Martin MJ, Natale DA, O’Donovan C, Redaschi N, Yeh LS, The universal protein resource (UniProt), Nucleic Acids Res 33:D154–D159, 2005. 110. Shen HB, Chou KC, Ensemble classifier for protein fold pattern recognition, Bioinformatics 22:1717–1722, 2006. 111. Asur S, Ucar D, Parthasarathy S, An ensemble framework for clustering proteinprotein interaction networks, Bioinformatics 23:i29–40, 2007. 112. Ishida T, Kinoshita K, Prediction of disordered regions in proteins based on the meta approach, Bioinformatics 24:1344–1348, 2008.

January 28, 2011 15:39 WSPC/185-JBCB

S0219720011005288

Bioinformatic Approaches for Predicting Substrates of Proteases

175

113. Deng L, Guan J, Dong Q, Zhou S, Prediction of protein-protein interaction sites using an ensemble method, BMC Bioinformatics 10:426, 2009. 114. Shen HB, Song JN, Chou KC, Prediction of protein folding rates from primary sequence by fusing multiple sequential features, J Biomed Sci and Eng 2:136–143, 2009. 115. Chakrabarti S, Panchenko AR, Ensemble approach to predict specificity determinants: Benchmarking and validation, BMC Bioinformatics 10:207, 2009. 116. Yanover C, Singh M, Zaslavsky E, M are better than one: An ensemble-based motif finder and its application to regulatory element prediction, Bioinformatics 25:868– 874, 2009.

Jiangning Song received both his B.Eng. degree in Biotechnology and his Ph.D. degree in Bioinformatics from Jiangnan University, China in 2000 and 2005, respectively. From 2005 to 2007, he was a postdoctoral research fellow at the Advanced Computational Modelling Centre (ACMC) of the University of Queensland, Australia. From 2007 to 2009, he was a JSPS Research Fellow at the Bioinformatics Center, Institute for Chemical Research, Kyoto University, Japan. He is currently an NHMRC Peter Doherty Fellow in the Department of Biochemistry and Molecular Biology, Faculty of Medicine, Monash University, Australia. His research interests include structural bioinformatics, integrative systems biology, machine learning, data mining and complex networks. Dr. Song is a member of the International Society for Computational Biology (ISCB), the Japan Society of Bioinformatics (JSBi) and the International Proteolysis Society (IPS). Hao Tan received his B.Eng. degree in Communication Engineering from Southeast University China, and his M.Sc. degree in Applied Information Technology from Monash University, Australia, in 2004 and 2008, respectively. He was employed by China Unicom Co. Ltd. from 2004 to 2008 as a software engineer. He is currently a research assistant in the Faculty of Medicine, Monash University Australia. His research interests are computer software design and bioinformatics applications. Sarah E. Boyd received a B.Sc. (Hons) degree in Computer Science and Biochemistry (2000) and a Ph.D. degree in Computer Science (2005) from Monash University, Australia. Since 2000, Dr. Boyd has researched computational prediction of protease specificity. In 2006, she worked as a research fellow for the Centre On Proteolytic Pathways, an NIH program grant housed at the Burnham Institute, San Diego, USA. From 2007 to 2009, Dr. Boyd was awarded an Australian Postdoctoral

January 28, 2011 15:39 WSPC/185-JBCB

176

S0219720011005288

J. Song et al.

Fellowship to research cooperativity in protease specificity. In 2010, Dr. Boyd was appointed as a research fellow at La Trobe University, Australia, to research the systems biology of cell signalling. Dr. Boyd is interested in the acceleration of biomedical research and discovery using the systems biology paradigm, which combines experimental and computational research to integrate, model, visualize and exchange large, complex, multi-dimensional datasets. Hongbin Shen received his Ph.D. degree from Shanghai Jiaotong University, China in 2007. He was a postdoctoral research fellow of Harvard Medical School from 2007 to 2008. Currently, he is a professor in the Institute of Image Processing and Pattern Recognition, Shanghai Jiaotong University, China. His research interests include data mining and bioinformatics. He is particularly interested in research aimed at predicting protein structure and function, as well as intelligent modeling for complex biological networks. Dr. Shen has published more than 60 papers and constructed 20 bioinformatics servers in these areas and he serves on the editorial boards of several international journals. Khalid Mahmood is a Ph.D. student at Monash University, Australia. He is supervised by Prof. James Whisstock and Prof. Geoff Webb. His research interests include computational biology with focus on comparative genomics and protein sequence/structure analysis. In particular, he is interested in developing methods for identifying gene orthologs and conserved gene segments in various genomes. Further, he is interested in developing data mining methods to study how protein families evolve. Geoffrey I. Webb received his Ph.D. from La Trobe University, Australia, in 1987. He held appointments at Griffith University and then Deakin University, where he received a personal chair. He is a professor in the Faculty of Information Technology at Monash University, where he heads the Centre for Research in Intelligent Systems. His primary research areas are machine learning, data mining, and user modelling. He is known for the development of numerous methods, algorithms and techniques for machine learning, data mining and user modelling. His commercial data mining software, Magnum Opus, incorporates many techniques from his association discovery research. Many of his learning algorithms are included in the widely-used Weka machine learning workbench. He is editor-in-chief of Data Mining and Knowledge

January 28, 2011 15:39 WSPC/185-JBCB

S0219720011005288

Bioinformatic Approaches for Predicting Substrates of Proteases

177

Discovery, co-editor of the Encyclopedia of Machine Learning (to be published by Springer), a member of the advisory board of Statistical Analysis and Data Mining and a member of the editorial boards of Machine Learning and ACM Transactions on Knowledge Discovery in Data. Tatsuya Akutsu received both an M.Eng. degree in Aeronautics (1986) and a Dr. Eng. degree in Information Engineering (1989) from the University of Tokyo, Japan. From 1989 to 1994, he was in the Mechanical Engineering Laboratory, Japan. He was an associate professor in Gunma University from 1994 to 1996 and in the Human Genome Center, University of Tokyo, from 1996 to 2001, respectively. He joined the Bioinformatics Center of the Institute for Chemical Research, Kyoto University, Japan as a professor in October 2001. His research interests include bioinformatics and discrete algorithms. He is a member of ACM, IEICE, JSAI and JSBi. James C. Whisstock received his M.Phil. and D.Phil. degrees in Biochemistry from the University of Cambridge, UK in 1992 and 1996, respectively. In 1997 he moved to Monash University to take up a two-year Monash University Faculty of Medicine Fellowship. In 1999 he was awarded an NHMRC Peter Doherty Fellowship, and in 2002 an NHMRC Senior Research Fellowship and Monash University Logan Fellowship. He is currently an ARC Federation Fellow and an honorary NHMRC Principal Research Fellow. Professor Whisstock’s research group is based at Monash University and investigates three broad themes: (1) the structural biology of proteases, serpins and lipid phosphatases, (2) the structural biology of membrane-attack perforin (MACPF) and perforin-like proteins, and (3) a wide range of bioinformatic projects aimed at understanding the structure, function and evolution of large protein families. Robert N. Pike received a B.Sc. degree in Agriculture and a Ph.D. degree in Biochemistry from the University of Natal, Pietermaritzburg, South Africa in 1987 and 1991, respectively. From 1992 to 1994, he was a postdoctoral fellow in the Department of Biochemistry and Molecular Biology, University of Georgia, USA, following which he was a Foundation for Research and Development Research Fellow at the University of Natal, Pietermatitzburg, South Africa from 1994 to 1995.

January 28, 2011 15:39 WSPC/185-JBCB

178

S0219720011005288

J. Song et al.

From 1996 to 1997, he was a postdoctoral fellow in the Department of Haematology, University of Cambridge, UK. In 1997 he was appointed to a lectureship in the Department of Biochemistry and Molecular Biology at Monash University, Melbourne, Australia, where he is now a professor. Professor Pike’s research work has consistently involved investigations into the interplay between proteolytic enzymes and their inhibitors and receptors.