MIPDB: a relational database dedicated to MIP ... - Wiley Online Library

5 downloads 3195 Views 509KB Size Report
Jan 23, 2004 - (database for MIP proteins), a relational database dedicated to members of the MIP family. Results. MIPDB is a motif-oriented database that ...
Research article

Biol. Cell (2005) 97, 535–543 (Printed in Great Britain)

MIPDB: a relational database dedicated to MIP family proteins Khalid El Karkouri, Herve´ Gueune´ and Christian Delamarche1 ´ ´ UMR CNRS 6026 Interactions Cellulaires et Moleculaires, Universite´ de Rennes 1, equipe SDM (Structure et Dynamique des ´ ˆ 13, 35042 Rennes Cedex, France Macromolecules), Campus de Beaulieu, Bat.

Background information. The MIPs (major intrinsic proteins) constitute a large family of membrane proteins that facilitate the passive transport of water and small neutral solutes across cell membranes. Since water is the most abundant molecule in all living organisms, the discovery of selective water-transporting channels called AQPs (aquaporins) has led to new knowledge on both the physiological and molecular mechanisms of membrane permeability. The MIPs are identified in Archaea, Bacteria and Eukaryota, and the rapid accumulation of new sequences in the database provides an opportunity for large-scale analysis, to identify functional and/or structural signatures or to infer evolutionary relationships. To help perform such an analysis, we have developed MIPDB (database for MIP proteins), a relational database dedicated to members of the MIP family. Results. MIPDB is a motif-oriented database that integrates data on 785 MIP proteins from more than 200 organisms and contains 230 distinct sequence motifs. MIPDB proposes the classification of MIP proteins into three functional subgroups: AQPs, glycerol-uptake facilitators and aquaglyceroporins. Plant MIPs are classified into three specific subgroups according to their subcellular distribution in the plasma membrane, tonoplast or the symbiosome membrane. Some motifs of the database are highly selective and can be used to predict the transport function or subcellular localization of unknown MIP proteins. Conclusions. MIPDB offers a user-friendly and intuitive interface for a rapid and easy access to MIP resources and to sequence analysis tools. MIPDB is a web application, publicly accessible at http://idefix.univrennes1.fr:8080/Prot/index.html.

Introduction The proteins of the MIP (major intrinsic protein) family permit the bidirectional exchange of water and small neutral solutes through membrane cells. MIP proteins can be classified into three major functional subgroups: AQPs (aquaporins), which are water diffusion channels, GLPs (glycerol-uptake facilitators), which are permeable to glycerol or small uncharged molecules, and GLAs (aquaglyceroporins) that show

1 To

whom correspondence should be addressed (email [email protected]). Key words: aquaglyceroporin, database, functional prediction, motif, sequence motif. Abbreviations used: AQP, aquaporin; AQPe, AQP of eukaryota; AQPp, AQP of prokaryota; GLA, aquaglyceroporin; GLAe, GLA of eukaryota; GLAp, GLA of prokaryota; GLP, glycerol-uptake facilitator; GLPp, GLP of prokaryota; MIP, major intrinsic protein; NIP, nodulin-like intrinsic protein; PIP, plasmamembrane intrinsic protein; SRS, sequence retrieval system; SQL, structured query language; TIP, tonoplast-membrane intrinsic protein; TM, transmembrane helix.

a mixed permeability (Thomas et al., 2002). Considering that water is a major component of all living cells, the discovery of AQPs, reported by Preston et al. (1992), kindled considerable interest in the scientific community and raised the question of the implication of these proteins in human diseases (Agre et al., 2002). To understand their unique permeability properties, MIPs have been subjected to many genetic, biochemical and physiological studies. All MIP homologues share a common topology consisting of six membrane-spanning helices [TM1 (transmembrane helix 1) to TM6] and five connecting loops (loops A–E), with both the N- and C-terminal ends located on the cytoplasmic side of the membrane (Figure 1A). The high-resolution X-ray structures of GLP, AQP1 and AQPz (Fu et al., 2000; Sui et al., 2001; Savage et al., 2003) revealed a homotetrameric organization. Each monomer is a functional pore formed by six

www.biolcell.org | Volume 97 (7) | Pages 535–543

535

K.E. Karkouri, H. Gueune´ and C. Delamarche

Figure 1 Graphic illustration of the MIPDB interface (A) The MIPDB home page. (B, C) Examples of ‘Taxonomy’ and ‘Motif’ data available in the database.

membrane-spanning helices and by the ‘hourglass’ (Jung et al., 1994; Agre et al., 1998). This structure consists of two short helices from loops B and E, each containing a highly conserved NPA motif (Asn-Pro-Ala motif), which plunges symmetrically into the membrane and overlaps in the narrow part

536

 C

Portland Press 2005 | www.biolcell.org

of the pore. It has been shown that the nature of the residues that form the internal constriction of each monomer is important for substrate selection (Fu et al., 2000; Sui et al., 2001; Thomas et al., 2002; Beitz et al., 2004). Other structural parameters, such as oligomeric state, allostery and heteromerization

A database for MIP family proteins

could also be involved in the physiological properties of some MIPs, but this hypothesis needs further experimentation (Lagr´ee et al., 1999; Thomas et al., 2002; Fetter et al., 2004; Hill et al., 2004). In contrast with the laborious and expensive bench experiments, computational approaches, using sequence analyses, are of particular relevance for the identification of characteristic residues in protein families. Key functional residues are conserved across the proteins sharing the same function and they are found in the same three-dimensional space although they can be dispersed along the sequence. Since MIPs are present in almost all living organisms, they have been widely used to understand their evolutionary and functional relationships. Some major facts are that the MIP family arose by an intragenic duplication event followed by functional and structural diversification (Pao et al., 1991; Park and Saier, 1996; Zardoya and Villalba, 2001). Furthermore, plant GLPs [NIPs (nodulin-like intrinsic proteins)] may result from a horizontal gene transfer from bacteria (Zardoya et al., 2002). Conserved positions in multiple sequence alignments revealed residues of particular importance for the characterization of MIP subfamilies. Froger et al. (1998) identified five key positions (P1–P5) that could play a functional role in the MIP proteins, but the rule, established on a limited number of sequences including 26 orthodox AQPs plus 14 glycerol and mixed channels, has been questioned in some cases (Santoni et al., 2000; Chaumont et al., 2001; Schuurmans et al., 2003). For example, the functional importance of positions P4–P5, demonstrated for the insect AQP known as AQPcic, has not been observed for AqpZ and GLP of Escherichia coli (Borgnia and Agre, 2001). In 2000, Johansson et al. refined the rule for positions P4–P5 in plant GLAs, while Heymann and Engel pointed out combinations of residues distinguishing the AQP and GLP subgroups (Heymann and Engel, 2000; Johansson et al., 2000). However, we believe that the nomenclature of MIPs on the basis of two main categories is too limited to represent the large array of functions and distributions for these proteins. For example, in the studies cited above, the two MIP proteins GLP and GlaLac were grouped under the same cluster even though GLP is much less permeable to water compared with GlaLac. It is clear that an in silico prediction of the exact roles of conserved motifs in the MIP proteins (structure, function, oli-

Research article gomerization, targeting, evolution etc.) requires the analysis of a great number of well-characterized data, available for the entire scientific community. A large number of MIP sequences are available from the public UniProt (Universal Protein Resource) database, which collects all protein sequence data reported worldwide (Apweiler et al., 2004), and from the INTERPRO (integrated resource of protein domains and functional sites) database that amalgamates major signature databases such as PROSITE (database of protein families and domains) and Pfam (Falquet et al., 2002; Mulder et al., 2003). The entry IPR000425 of INTERPRO (release 7.2) matches 816 proteins for the MIP family, but contains ‘falsepositives’. For example, the restriction endonuclease (O52712), the phosphofructokinase (Q9NKN9) and the NADH dehydrogenase (Q94TA5) are wrongly included in the MIP entry. This happens when some motifs collected in the databases have a weak specificity. Such erroneous annotations corrupt the full information content of public databases and might lead to systematic errors in sequence analysis. As a result, the extraction of a set of valid sequences on the basis of particular user interests, for a sequence analysis purpose, requires multisource queries and validations that can be complex and time-consuming. To facilitate extensive mining of the available data on MIP proteins, we have created MIPDB (database for MIP proteins), a multiuser and relational database dedicated to the analysis of sequence motifs. A user-friendly and interactive web interface allows access to MIP data and makes it possible to build sequence files, with direct access to the latest bioinformatic tools such as JalView alignment editor (Clamp et al., 2004). An SQL (structured query language) query window allows one to extract information not directly available from the interface. This first version of MIPDB includes data regarding 785 members of the MIP family from 216 organisms. MIPs are classified into three functional groups (AQP, GLP and GLA), eight physiological subgroups and five taxonomic clusters. We validated 230 motifs by comparing their occurrence in all ‘true-positive’ and ‘falsepositive’ sequences, making it possible to judge the selectivity and sensitivity of each pattern. Although the main aim of the present study is the presentation of MIPDB resources, we also propose some highly specific motifs that can help in the functional/ physiological prediction of new MIP sequences. The

www.biolcell.org | Volume 97 (7) | Pages 535–543

537

K.E. Karkouri, H. Gueune´ and C. Delamarche

methods used to create MIPDB can be used to develop specialized databases for other protein families.

Results and discussion MIPDB is a new specialized database dedicated to the biological classification of MIP proteins and analysis of their sequences. This first version contains a comprehensive classification of 785 MIP proteins from 216 different organisms. The MIPDB application is publicly accessible at http://idefix.univrennes1.fr:8080/Prot/index.html (Figure 1A). The interface has useful tools for similarity searches, multiple alignments construction and edition, phylogeny, motif discovery and prediction of TMs. The MIPDB interface displays four main items: ‘Statistics’, ‘Taxonomy’, ‘Motifs’ and ‘SQL query’.

Figure 2 Diagram followed for the annotation of MIP sequences and for motif extraction, annotation and validation For each group, the number of sequences is indicated in parentheses. Details of the selection steps are given in the Materials and methods section.

Statistics, selection and annotation

The data in MIPDB were collected from the public knowledgebase UniProt and were validated and annotated by us (Figure 2). The UniProt entries contain information on the amino acid sequences and functions of proteins. In UniProt, identification (ID) and description (DE) lines reflect the functional properties of the protein, if experimentally verified, such as for ‘AQP1 HUMAN’ or ‘GLP ECOLI’. This acronym would be useful when searching a database later, but it concerns a minority of MIP entries in UniProt (157/785). Indeed, in the majority of cases, the function is predicted by sequence similarity, leading to their being identified with alphanumeric characters and described with laconic words such as ‘hypothetic’, ‘unknown’, ‘potential’, ‘probable’ or ‘putative’. In the present study, we used a combination of bioinformatic tools and methods (BLASTP, CLUSTAL W, MEGA3, Pratt and TMHMM) to infer new knowledge on MIP proteins and, in particular, to assign each protein to a biological group. We discovered 126 entries abnormally crossindexed in InterPro (IPR000425) and Prosite (PS00221) to be members of the MIP family. These entries are clearly assigned as ‘false-positive’ in MIPDB and they can be easily extracted in a Fasta format file for further analysis. To infer biological classifications, a data set of 397 ‘true-positive’ MIPs was accurately selected and analysed with MatGAT (Campanella et al., 2003) and CLUSTAL W (Thompson et al., 1994). CLUSTAL

538

 C

Portland Press 2005 | www.biolcell.org

W builds a guide tree that reflects the similarities between sequences using pairwise alignment distances. Observation of the guide tree (Figure 3) led us to predict that the most similar sequences, corresponding to each cluster in the tree, probably have related functions. As a result, our classification can be used to predict the function of MIP proteins but not to determine their evolutionary origin. Eight major groups, including 378 ‘true-positive’ sequences, were biologically annotated as follows (Figures 2 and 3): PIP (plasma-membrane intrinsic protein), AQPe (AQP of eukaryota), AQPp (AQP of prokaryota), TIP (tonoplast-membrane intrinsic protein), NIP, GLPp (GLP of prokaryota), GLAe (GLA of eukaryota) and GLAp (GLA of prokaryota). Biological annotation of each MIP protein is available in MIPDB through the menus ‘statistics’ and ‘taxonomy’.

Research article

A database for MIP family proteins

Figure 3 A sequence comparison of 397 MIP proteins, showing the eight functional/biological groups distinguished in the present study (shown with colour backgrounds) Sequences corresponding to uncoloured branches were not included in the sequence analysis. For simplicity, accession numbers are not indicated. The tree was generated with the program MEGA3 from the ‘.dnd’ file of CLUSTAL W.

Generally, MIPs of prokaryotes and eukaryotes (except plants) are classified into two monophyletic groups (AQP and GLP) as inferred from phylogenetic studies. In such a classification, GLP includes all MIPs permeable to small uncharged molecules without considering the capacity to transport water. However, it is well known from experimental studies that some members of the MIP family, including AQP3, AQP7, AQP9 and GLA LAC, are highly permeable to water, in comparison with true GLPs. For this reason, in a previous study, we proposed a ‘functional classification’ of MIPs into three major groups, AQP, GLP and GLA (Thomas et al., 2002). Although we know that this cut is still quite simple, Figure 3 supports such a classification: the GLAp group including GLAs from Lactococcus lactis and Bacillus subtilis (Froger et al., 2001) is clearly distinct from the GLPp group. Intriguingly, Figure 3 reveals that GLAe and GLPp have a common node, a distribution that would be interesting to interpret. According to phylogenetic comparisons (Zardoya et al., 2002), plant MIPs are separated into three groups (PIP, TIP and NIP), suggesting that it should

be possible to highlight some motifs specific for subcellular localization. In Figure 3, a few branches comprise one to three sequences. They represent a total of 19 divergent sequences that were not used to extract specific motifs, because the information in these sequences is too limited. Three of them, Q18352, Q18469 and Q20985, from Caenorhabditis elegans are positioned between the AQPe and AQPp groups. The separation of these MIPs from those of other eukaryotes might reflect specialization in their function or localization or a separate evolutionary origin. A total of 16 other divergent sequences are distributed between TIP and NIP groups: AQP8 HUMAN, AQP8 RAT, AQP8 MOUSE, Q8TFE0 and Q873M6 from mammals; AQY1 YEAST, O74680, Q9C411, O93938 and Q8SRK2 from fungi; and Q9W1M4, Q8T0N8, Q9W1M1, Q9W1M2, Q8MLR2 and Q9GRH9 from insects. According to recent phylogenetic and experimental analyses (Zardoya and Villalba, 2001; Ferri et al., 2003), Figure 3 suggests that AQP8 may have physiological properties that deserve particular interest and further investigations. Biological annotation of the remaining 377 ‘truepositive’ MIPs was predicted by comparison (BLAST similarity) with the training set of the 397 sequences. In contrast with the training data set, incomplete sequences are included in this group. To assign clearly the sequences of this second data set, a specific nomenclature was used: ‘PIP-like’, ‘AQPe-like’, ‘AQPplike’, ‘TIP-like’, ‘NIP-like’, ‘GLPp-’like’, ‘GLAe-like’ and ‘GLAp-like’. All the sequences (19 + 40) excluded from the two data sets used for biological predictions (397 + 377 ‘true-positives’) were annotated as ‘Others’ (Figure 2). Taxonomy

The data in the ‘Taxonomy’ menu are organized into four submenus: ‘All’, ‘Archaea’, ‘Bacteria’ and ‘Eukaryota’. For each submenu, taxonomic subgroups are displayed on the left of the interface. A brief help is also included on the interface. Taxonomy, organism, identifier (with link to UniProt flat file), accession, description, MIP groups, sequence length and gene name are presented in a comparative table (Figure 1B), from which researchers can execute four main queries: first, it is easy to retrieve any information using the dialog box of the browser. Secondly, a checkbox allows one to select the sequences, which are

www.biolcell.org | Volume 97 (7) | Pages 535–543

539

K.E. Karkouri, H. Gueune´ and C. Delamarche

then displayed in a new interface in FASTA format. The sequences can be analysed using a list of popular bioinformatic tools. Thirdly, users can also obtain (in another window) flat files, features (F), comments (C), references (R) and motifs for each MIP, by clicking the corresponding HTML links. However, ‘Features and Comments’ are poorly documented in the present version of MIPDB. Finally, users can BLAST any MIP sequence against the UniProt database by clicking on their accession number. Motifs

MIPDB currently proposes several motifs that can help in functional/structural predictions, in phylogeny and in biology of organisms. In practice, we extracted hundreds of motifs and screened them against the two defined data sets (false and true), allowing a final selection of 230 representative motifs. Motifs were obtained from 13 groups (Figure 2) corresponding to 397 full-length sequences: eight functional groups (PIP, AQPe, AQPp, TIP, NIP, GLPp, GLAe and GLAp) and five taxonomic groups (Fc, Fg, Mz, Pb and Vp) corresponding to super kingdoms (firmicutes, fungi, metazoa, proteobacteria and viridiplantae). Each motif is named and numbered according to its group of origin. The results are displayed in the ‘Motif’ menu, including three submenus called ‘MIP groups’, ‘Taxonomy’ and ‘Search by’. For each submenu, a specific help is included in the interface. The user can display the motifs corresponding to an MIP identifier, a motif group, a sequence segment or a combination of these items (see, e.g. Figure 1C). We assumed that a motif is highly specific to MIP proteins if it does not match any sequence of the 126 ‘false-positives’. Within the 230 motifs extracted, only seven match the ‘false-positive’ MIPs: the worst motif (Pb 11) recognizes eight of the 126 false MIPs; Fc 3 recognizes four false MIPs; Fc 10 recognizes two false MIPs; and Pb 5, Vp 8, Mz 4, Pb 12 and GLAp 7 recognize one false MIP. We attempted to screen public databases with the 223 specific motifs to predict new MIPs. Motif quality can also be measured by the capacity to recognize a specific group among all other MIP groups. These motifs should highlight key residues and domains implied in structure, function, targeting etc. For example, PIP 8 {W-[IV]-[FY]-W-[LV]-GP-[FILM]-x-G-x(0,1)-A-x-[AILV]-A-x(1,2)-Y-x(3)[IV]-[IL]-R-[AGN]} is localized in segment C (see

540

 C

Portland Press 2005 | www.biolcell.org

the Materials and methods section) after the second NPA box. It matches 100% with the 112 PIPs of the 397 sequences, but not with sequences of the seven other biological MIP groups. Other examples that have 100% specificity for their group among the 397 tested sequences are: AQPe 3, AQPp 10, GLPp 1, GLAe 11, GLAp 1, TIP 1 and NIP 1. Interestingly, some motifs extracted from only segments (A, B or C) can help to predict an MIP function from a partial sequence. For example, the motif AQPe 3, which is extracted from segment B, matches only eukaryotic AQP MIPs. Moreover, some motifs described in MIPDB have remarkable properties. The motif Pb 3 {A-x(1,2)G-x(3)-G-x(3)-[AGN]-x-A-[ILMV]-[NST]-x-[AGS]x(3)-[AG]-[GP]-[KR]-x(2)-[APST]-x(2)-[AILV][AGILV]-x(4)-[AL]} matches 37 of the 44 proteobacteria sequences of the training data set. Interestingly, there is a correlation between the position of this motif within the sequences and the function: in AQPp sequences, the motif is positioned in the first NPA box (around position 50), and in GLPp sequences, it is in the second NPA box (around position 190). Compared with the latter, the motif Pb 5 {S-[GI]-[AGPT]-x(2)-N-[PT]-x-V-[NST]-[ILP]-A} has the same characteristic, but in the opposite order. This motif matches the first NPA box of GLPp MIPs and it matches the second NPA box of AQPp MIPs. To our knowledge, this is the first time that such a correlation is described in MIPs, but its significance remains to be investigated.

SQL queries

The MIPDB interface allows users to consult easily the MIP data of preformulated SQL queries using HTML links or search forms (see ‘Statistics’, ‘Taxonomy’ and ‘Motifs’ links). MIPDB displays data that we have considered as prime needs for any MIP researcher without raising any queries. Since preformulated SQL queries can limit implementation of queries and access to particular data, MIPDB also offers interactive access to data by typing simple or complex SQL queries on a search form accessible through the link ‘SQL query’. An example of a simple request is: what are the species having ‘truepositives’ MIP sequences with length greater than 600 amino acids? This translates in the SQL language as: select ORGANISM, SEQ LENGTH from

A database for MIP family proteins

PROTEIN SEQUENCE where SEQ LENGTH > 600 and FRAGMENT FROM NOT LIKE ‘FALSE POSITIVES’. This request gives eight hits, including three fungi, four insects and one plant. An SQL help is also provided to assist the users who have experimented before or beginners to use SQL queries. Concluding remarks

To our knowledge, MIPDB represents the first specialized database dedicated to MIP family proteins. It is a motif-orientated database that would help researchers to study the biological, structural and functional levels (in vitro, in vivo and in silico). In comparison with large non-specialized biological databases, MIPDB not only facilitates access to curated MIP data but also avoids any mismatch with the data from other protein families. Highly specific motifs, available in MIPDB, are well-suited for searching entire proteomes and predicting a biological function for new MIP proteins, thus contrasting with PROSITE patterns that select ‘false positives’ and/or miss known ‘true positives’ (Via and Helmer-Citterich, 2004). MIPDB differs from other databases in that it facilitates the downloading of single or multiple sequences according to taxonomy and/or motif criteria. In addition, MIPDB provides sequence alignment files that can be used for diverse studies such as phylogenetic analysis. We suggest that researchers who are not familiar with the SQL language send their queries to the Data Base Administrator (see Home page) for implementation or optimization. MIPDB has been designed to improve MIP annotations by the integration of new data from experimental investigations, sequences of variants, bioinformatic analyses (hydrophobicity profiles, phylogenies and modelling) and from other public protein databases.

Research article type definition) files. The XML format, in contrast with flat files, became the standard since it facilitated the exchange of data available on distant biological database servers (Achard ¨ et al., 2003; Ren et al., et al., 2001; Chen et al., 2002; Guler 2004). Sequence selection and annotation

The 911 sequences of UniProt entries were subjected to three selections as described in Figure 2. The first selection led us to characterize true and false MIPs. The true-positive sequences were retrieved on association of three criteria: (i) the annotation ‘MIP family’ or ‘MIP/AQP’ in the field ‘Description’ (DE line) of the UniProt entry; (ii) a significant similarity score (E < 0.001) with members of the data set of 911 sequences using BLASTP (Altschul et al., 1997); and (iii) a secondary structure profile compatible with the topology of MIP proteins using TMHMM (Krogh et al., 2001). All sequences that did not match these criteria were rejected in the false-positive group used in a subsequent step to evaluate the specificity of motifs. The second selection rejected the sequences including undetermined amino acids (B, X and Z), because these letters cannot be used with motif extraction tools. In the third selection, the 774 ‘true-positive’ sequences were distributed into two data sets. The two sets diverge by the ‘quality’ of the data. The sequences were thoroughly examined by pairwise alignments (MatGAT; Campanella et al., 2003) and secondary structure analysis (TMHMM). A high-quality test set of 397 full-length sequences, attributable to five taxonomic groups, was retained for motif extractions. The sequences have a methionine residue at the N-terminal end and were of 215–350 amino acids in length. We discarded both highly and poorly similar sequences to avoid a bias that will affect the quality of the alignments. For example, SIPs (small basic intrinsic proteins) were rejected because of their high divergence compared with other MIPs (Johanson and Gustavsson, 2002). The Archaea MIPs were also excluded at this step. Sequence data were aligned using CLUSTAL W (http://www.ebi.ac.uk/clustalw/) and the resulting alignment was refined by eye. Eight major biological subfamilies were identified from the tree (Figure 3). We verified that proteins with known function are correctly distributed in the tree branches. For each of the 377 remaining sequences, we performed pairwise alignments (BLASTP with default parameters) against the database of 397 full-length sequences. A biological annotation was attributed (suffix-like) when the two best hits are associated with the same function within the test set and if E < 10−10 . The data set of 377 sequences was used for the validation of the extracted motifs.

Materials and methods MIPDB data source

Motif extraction and validation

MIPDB integrates public data originating from the Universal Protein Knowledgebase (UniProt release 1.11; 15 June 2004), which is a consortium between SwissProt, TrEMBL and PIR (Protein Information Resource; http://www.expasy.uniprot.org/) (Apweiler et al., 2004). A combination of two MIP family identifiers using the Boolean operator ‘OR’ ([uniprotDBxref :PS00221∗ ] | [uniprot-DBxref :IPR000425∗ ]) at the SRS (sequence retrieval system) server (http://srs.ebi.ac.uk/) retrieved 911 entries. We downloaded these 911 UniProt entries as SRS XML (eXtensible Markup Language) DTD (document-

The global training set of 397 sequences, the eight biological subfamilies and five taxonomic groups were submitted to two different methods to identify sequence motifs: (i) multiple sequence alignments using CLUSTAL W (default parameters, http://www.ebi.ac.uk/clustalw/) followed by a selection of conserved positions using the CGD program (Delamarche, 2000). Two alphabets were used to group the amino acids by similar properties as follows: (D, E), (H, R, K), (C, S, T), (N, Q), M, (A, I, L, V), P, G, (F, W, Y) and (D, E), (H, R, K), (C, M), (N, Q), (S, T), (A, G, I, L, V), P, (F, W, Y) (Li et al., 2003).

www.biolcell.org | Volume 97 (7) | Pages 535–543

541

K.E. Karkouri, H. Gueune´ and C. Delamarche

The reduced alphabet used to express a motif gives the ability to detect distantly related proteins (Delamarche, 2000) and (ii) motif extraction was also performed on unaligned sequences using Pratt (Jonassen et al., 1995; http://www.ebi.ac.uk/pratt/). The C% parameter was adjusted to report patterns matching at least 90% of the input sequences. The motifs are expressed in the PROSITE language (Bucher and Bairoch, 1994). Ambiguities are indicated by listing the acceptable amino acids for a given position between brackets. The symbol ‘-x’ is used for a position where any residue is accepted. Each motif was tested against the 774 ‘true-positive’ MIPs (TP) and 126 ‘false-positives’ (FPs) using ‘PATTERN MATCHING’ web application (http://idefix.univ-rennes1.fr:8080/ PatternDiscovery/). In the database, the motifs are identified with the name of their biological group and a number (e.g. GLP 2). A three-letter code was used to annotate some short motifs within the sequences. We considered that the NPA boxes divide an MIP sequence into three segments, A, B and C, from the N-terminus to the C-terminus. A motif annotated ‘A’, ‘B’ or ‘C’ is strictly localized in the corresponding segment without overlapping the NPA triplet. A motif annotated ‘D’ overlaps sequence regions AB, BC or ABC, including at least one NPA box. Implementation of MIPDB

MIPDB is a relational database (Codd, 1972; Ren et al., 2004) executed on the MySQL v3.23.52 (http://www.mysql.com) RDBMS (relational database management system). MIPDB user interface was implemented with the scripting language PERL, the CGI (common gateway interface) and the APACHE server (http://www.apache.org). The database can be browsed on Internet Explorer v4 (and higher) on a monitor of 17 or 21 inches.

Acknowledgements We gratefully acknowledge support from CRITT Sant´e Bretagne (Programme de recherche d’int´erˆet r´egional A3CA39) and CNRS (Programme prot´eomique et g´enie des prot´eines). References Achard, F., Vaysseix, G. and Barillot, E. (2001) XML, bioinformatics and data integration. Bioinformatics 17, 115–125 Agre, P., Bonhivers, M. and Borgnia, M.J. (1998) The aquaporins, blueprints for cellular plumbing systems. J. Biol. Chem. 273, 14659–14662 Agre, P., King, L.S., Yasui, M., Guggino, W.B., Ottersen, O.P., Fujiyoshi, Y., Engel, A. and Nielsen, S. (2002) Aquaporin water channels – from atomic structure to clinical medicine. J. Physiol. (Cambridge, U.K.) 542, 3–16 Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J., Zhang, Z., Miller, W. and Lipman, D.J. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25, 3389–3402 Apweiler, R., Bairoch, A., Wu, C.H., Barker, W.C., Boeckmann, B., Ferro, S., Gasteiger, E., Huang, H., Lopez, R., Magrane, M. et al. (2004) UniProt: the universal protein knowledgebase. Nucleic Acids Res. 32, D115–D119

542

 C

Portland Press 2005 | www.biolcell.org

Beitz, E., Pavlovic-Djuranovic, S., Yasui, M., Agre, P. and Schultz, J.E. (2004) Molecular dissection of water and glycerol permeability of the aquaglyceroporin from Plasmodium falciparum by mutational analysis. Proc. Natl. Acad. Sci. U.S.A. 101, 1153–1158 Borgnia, M.J. and Agre, P. (2001) Reconstitution and functional comparison of purified GLP and AqpZ, the glycerol and water channels from Escherichia coli. Proc. Natl. Acad. Sci. U.S.A. 98, 2888–2893 Bucher, P. and Bairoch, A. (1994) A generalized profile syntax for biomolecular sequences motifs and its function in automatic sequence interpretation. In ISMB-94, Proceedings of the 2nd International Conference on Intelligent Systems for Molecular Biology (Altman, R., Brutlag, D., Karp, P., Lathrop, R. and Searls, D., eds), pp. 53–61, AAAI Press, Menlo Park, CA Campanella, J.J., Bitincka, L. and Smalley, J. (2003) MatGAT: an application that generates similarity/identity matrices using protein or DNA sequences. BMC Bioinformatics 4, 29–33 Chaumont, F., Barrieu, F., Wojcik, E., Chrispeels, M.J. and Jung, R. (2001) Aquaporins constitute a large and highly divergent protein family in maize. Plant Physiol. 125, 1206–1215 Chen, X., Lin, Y., Liu, M. and Gilson, M.K. (2002) The binding database: data management and interface design. Bioinformatics 18, 130–139 Clamp, M., Cuff, J., Searle, S.M. and Barton, G.J. (2004) The Jalview Java Alignment editor. Bioinformatics 12, 426–427 Codd, E.F. (1972) Further normalisation of the data base relational model. In Data Base Systems (Rustin, R., ed.), pp. 33–64, Prentice-Hall, Englewood Cliffs, NJ Delamarche, C. (2000) Color and graphic display (CGD): programs for multiple sequence alignment analysis in spreadsheet software. BioTechniques 29, 100–107 Falquet, L., Pagni, M., Bucher, P., Hulo, N., Sigrist, C.J.A., Hofmann, K. and Bairoch, A. (2002) The Prosite database, its status in 2002. Nucleic Acids Res. 30, 235–238 Ferri, D., Mazzone, A., Liquori, G.E., Cassano, G., Svelto, M. and Calamita, G. (2003) Ontogeny, distribution, and possible functional implications of an unusual aquaporin, AQP8, in mouse liver. Hepatology 38, 947–957 Fetter, K., Van Wilder, V., Moshelion, M. and Chaumont, F. (2004) Interactions between plasma membrane aquaporins modulate their water channel activity. Plant Cell 16, 215–228 Froger, A., Thomas, D. and Delamarche, C. (1998) Prediction of functional residues in water chanels and related proteins. Protein Sci. 7, 1458–1468 Froger, A., Rolland, J.-P., Bron, P., Lagree, F., ´ V., Le Caherec, ´ Deschamps, S., Hubert, J.-F., Pellerin, I., Thomas, D. and Delamarche, C. (2001) Functional characterisation of a microbial aquaglyceroporin. Microbiology 147, 1129–1135 Fu, D., Libson, A., Miercke, L.J.W., Weitzman, C., Nollert, P., Krucinski, J. and Stroud, R. (2000) Structure of a glycerol conducting channel and the basis of its selectivity. Science 290, 481–486 Guler, S., Eberhart, A. and Rojas, I. (2003) Web-based exchange of ¨ biochemical information. Bioinformatics 19, 1730–1731 Heymann, J.B. and Engel, A. (2000) Structural clues in the sequences of the aquaporins. J. Mol. Biol. 295, 1039–1053 Hill, A.E., Shachar-Hill, B. and Shachar-Hill, Y. (2004) What are aquaporins for? J. Membr. Biol. 197, 1–32 Johanson, U. and Gustavsson, S. (2002) A new subfamily of major intrinsic proteins in plants. Mol. Biol. Evol. 19, 456–461 Johansson, I., Karlsson, M., Johanson, U., Larsson, C. and Kjellbom, P. (2000) The role of aquaporins in cellular and whole plant water balance. Biochim. Biophys. Acta 1465, 324–342 Jonassen, I., Collins, J.F. and Desmond Higgins, D. (1995) Finding flexible patterns in unaligned protein sequences. Protein Sci. 4, 1587–1595

Research article

A database for MIP family proteins

Jung, J.S., Preston, G.M., Smith, B.L., Guggino, W.B. and Agre, P. (1994) Molecular structure of the water channel through aquaporin CHIP. The hourglass model. J. Biol. Chem. 269, 14648–14654 Krogh, A., Larsson, B., von Heijne, G. and Sonnhammer, E.L. (2001) Predicting transmembrane protein topology with a hidden Markov model: application to complete genomes. J. Mol. Biol. 305, 567–580 Lagree, ´ V., Froger, A., Deschamps, S., Hubert, J.F., Delamarche, C., Bonnec, G., Thomas, D., Gouranton, J. and Pellerin, I. (1999) Switch from an aquaporin to a glycerol channel by two amino acids substitution. J. Biol. Chem. 274, 6817–6819 Li, T., Fan, K., Wang, J. and Wang, W. (2003) Reduction of protein sequence complexity by residue grouping. Protein Eng. 16, 323–330 Mulder, N.J., Apweiler, R., Attwood, T.K., Bairoch, A., Barrell, D., Bateman, A., Binns, D., Biswas, M., Bradley, P., Bork, P. et al. (2003) The InterPro Database, 2003 brings increased coverage and new features. Nucleic Acids Res. 31, 315–318 Pao, G.M., Wu, L.F., Johnson, K.D., Hofte, H., Chrispeels, M.J., Sweet, G., Sandal, N.N. and Saier, Jr, M.H. (1991) Evolution of the MIP family of integral membrane transport proteins. Mol. Microbiol. 5, 33–37 Park, J.H. and Saier, Jr, M.H. (1996) Phylogenetic characterization of the MIP family of transmembrane channel proteins. J. Membr. Biol. 153, 171–180 Preston, G.M., Caroll, T.P., Guggino, W.B. and Agre, P. (1992) Appearance of water chanels in Xenopus oocytes expressing red cell CHIP28 protein. Science 256, 385–387 Ren, Q., Kang, K.H. and Paulsen, I.T. (2004) TransportDB: a relational database of cellular membrane transport systems. Nucleic Acids Res. 32, D284–D288

Santoni, V., Gerbeau, P., Javot, H. and Maurel, C. (2000) The high diversity of aquaporins reveals novel facets of plant membrane functions. Curr. Opin. Plant Biol. 3, 476–481 Savage, D.F., Egea, P.F., Robles-Colmenares, Y., Iii, J. D. and Stroud, R. M. (2003) Architecture and selectivity in aquaporins: 2.5 a x-ray structure of aquaporin z. PLoS Biol. 1, E72 Schuurmans, J.A., van Dongen, J.T., Rutjens, B.P., Boonman, A., Pieterse, C.M. and Borstlap, A.C. (2003) Members of the aquaporin family in the developing pea seed coat include representatives of the PIP, TIP, and NIP subfamilies. Plant Mol. Biol. 53, 633–645 Sui, H., Han, B.G., Lee, J.K., Walian, P. and Jap, B.K. (2001) Structural basis of water-specific transport through the AQP1 water channel. Nature (London) 414, 872–878 Thomas, D., Bron, P., Ranchy, G., Dushesne, L., Cavalier, A., Rolland, J.-P., Raguen C., Hubert, J.-F., Haase, W. and ´ es-Nicol, ` Delamarche, C. (2002) Aquaglyceroporins, one chanel for two molecules. Biochim. Biophys. Acta 1555, 181–186 Thompson, J.D., Higgins, D.G. and Gibson, T.J. (1994) CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res. 22, 4673–4680 Via, A. and Helmer-Citterich, M. (2004) A structural study for the optimisation of functional motifs encoded in protein sequences. BMC Bioinformatics 5, 50 Zardoya, R. and Villalba, S. (2001) A phylogenetic framework for the aquaporin family in eukaryotes. J. Mol. Evol. 52, 391–404 Zardoya, R., Ding, X., Kitagawa, Y. and Chrispeels, M.J. (2002) Origin of plant glycerol transporters by horizontal gene transfer and functional recruitment. Proc. Natl. Acad. Sci. U.S.A. 99, 14893–14896

Received 3 September 2004/23 October 2004; accepted 23 October 2004 Published as Immediate Publication 25 April 2005, DOI 10.1042/BC20040123

www.biolcell.org | Volume 97 (7) | Pages 535–543

543