Statistical methods for microarray assays - Semantic Scholar

8 downloads 118039 Views 71KB Size Report
microarray data analysis with quantitative genetic models. Key words: data ... a short review of existing statistical methods of microarray data processing. For.
J. Appl. Genet. 43(3), 2002, pp. 269-278

Review article

Statistical methods for microarray assays Pawe³ KRAJEWSKI, Jan BOCIANOWSKI1 Institute of Plant Genetics, Polish Academy of Sciences, Poznañ, Poland

Abstract: The paper shortly reviews statistical methods used in the area of DNA microarray studies. All stages of the experiment are taken into account: planning, data collection, data preprocessing, analysis and validation. Among the methods of data analysis, the algorithms for estimating differential expression, multivariate approaches, clustering methods, as well as classification and discrimination are reviewed. The need is stressed for routine statistical data processing protocols and for the search of links of microarray data analysis with quantitative genetic models. Key words: data analysis, data collection, DNA microarrays, planning experiments, statistical methods, validation.

Introduction DNA microarrays are now increasingly used to obtain data concerning gene expression in various organisms (KAMIÑSKI 2002). There is a chance that after a period of basic methodological research they will make an important diagnostic tool in biological research and in medicine. It is now understood that this ingenious technology must be supplemented with appropriate statistical, computational and data storage facilities, in order to be useful for researchers. This paper provides a short review of existing statistical methods of microarray data processing. For other reviews of the subject of computational and statistical problems in microarrays see, e.g., QUACKENBUSH (2001) or SMYTH et al. (2002). A typical microarray experiment consists of all stages that characterise any good empirical investigation. It must be planned in such a way that the available Received: July 3, 2002. Accepted: July 8, 2002. Correspondence: P. KRAJEWSKI, Institute of Plant Genetics, Polish Academy of Sciences, ul. Strzeszyñska 34, 60-479 Poznañ, Poland, e-mail: [email protected] 1 Awarded with the scholarship of Foundation for Polish Science for 2002.

270

P. Krajewski, J. Bocianowski

resources are used optimally to provide desirable data. The data must be collected, stored, labelled and preprocessed, and then analysed, in a confirmatory or exploratory way, with respect to the existing a priori or data-generated hypotheses. At this stage, statistical methods seem to be the most appropriate tool. Then, the results of the analysis must be critically checked and validated. In our paper we proceed in this order, sketching the problems connected with planning and analysis of microarray experiments.

Designing experiments The modern design of experiments adopts the three principles introduced by R.A. Fisher: replication, randomization and local control. It is unlikely that microarray experiments can be properly designed without observing these rules. After a period of drawing conclusions on the basis of single slides, attention is now turned towards more complicated schemes. LEE et al. (2000) and NOVAK et al. (2002) presented results on the inherent variability in gene expression data and the role of replication. Their studies involved genes of known expression and checking the profiles on the same, replicated RNA samples, that is, they used the model of a dummy experiment. Both studies stressed the need for replicated experiments, which can decrease the chance of misleading results. KERR and CHURCHILL (2001b) and GLONEK and SOLOMON (2002) addressed the problem of selection of the experimental design, suited for the situation at hand. Their aim was to replace the naive usage of the method of all possible comparisons or reference sample comparisons with more sophisticated, possibly optimal, designs, which would assure efficient usage of resources.

Data collection In any microarray experimental setup the basic unit is a plate (slide), typically containing several thousands genes. After the chemical part of the experiment, it is subjected to image analysis. Data are collected by the scanner, digitized and stored. Several important problems are connected with this stage. The task of the image processing hardware and software is to locate spots on the slide and to segment them, that is, to separate into the pixels bearing the signal and the pixels belonging to the background. Several existing methods do so. Commercial products utilize, for example, the histogram method, which estimates the distribution of pixels within the spot, and recognizes pixels in the upper tail and in the lower tail of the distribution as the signal and background, respectively. The method provides, by definition, positive observations of the spot intensity; this feature is not possessed by several other methods (fixed circle, adaptive, for

Statistical methods for microarray assays

271

details see BOCIANOWSKI et al. 2002). New spot finding and segmentation methods are being developed. BUHLER et al. (2000) considered a method of locating the spots which utilizes all available morphological information and is able to properly find even spots with low intensity and of irregular shape. The same authors suggested that spot finding should provide the experimenter with a measure of slide quality (see also WANG et al. 2001, BROWN et al. 2001). Several image analysis methods were critically compared by YANG et al. (2002). New research in this area is represented by the paper by BOZINOV and RAHNENFÜHRER (2002), who used clustering techniques to find signal and background pixels. Irrespective of the method of spot location and segmentation, the outcome of this stage is observation, for each spot, of the signal intensity for two colour channels, and of the background intensity. The intensities can be obtained by different methods: as a mean, mode, median or total of the intensity of pixels recognized as belonging to the spot or to the background.

Data preprocessing Raw data are transformed into measures of differential expression, usually by calculating the ratio of the (background corrected) intensity of one channel to the other and taking the logarithm of that ratio. Some other measures of differential expression were discussed by TSODIKOV et al. (2002). The data should also be normalized in order to eliminate effects and variation caused by using different slides, dyes or some other experimental conditions, and also within-slide effects. Some methods in this area were described by YANG et al. (2001). Complete systems are devised to correct, filter and normalize raw microarray data (FIELDEN et al. 2002). Pre-processing may involve imputation of missing values (TROYANSKAYA et al. 2001). SHMULEVICH and ZHANG (2002) argue that at this stage the data should be transformed to binary values.

Data analysis The form of the data obtained from microarrays depends on the structure of the experiment. In the case of comparison of just two tissues, e.g. from a mutant and a wild type, possibly on replicated slides, the slides constitute a simple sample. If more cases (tissues, treatments, time points) are compared, slides have a one- or multi-factorial structure. Usually each gene is seen as corresponding to one variable (trait). The data analysis amounts to the estimation of differential expression or to more complicated procedures such as clustering; some of the possibilities in this area are described below.

272

P. Krajewski, J. Bocianowski

Estimating differential expression

For a simple sample, the analysis is usually aimed at finding the genes with different expression in two compared tissues. Genes can be easily ranked according to the expression ratio, possibly averaged over replications, but the decision concerning the significance of the ratios is not so easy. The simplest approach would be to base the test for ratios on replicated plates and a t statistic. CHEN et al. (1997) found an approximation of the exact distribution of the ratios that can improve the test. Three widely used testing methods, the t-test, the regression method of THOMAS et al. (2001) and the method based on Normal mixture models of PAN et al. (2001), were compared by PAN (2002). It can also be argued that the simple ratio of two intensities is not the best measure of differential expression; along this line, NEWTON et al. (2001) used the Gamma distribution model to improve the accuracy of identification of interesting genes. A more complicated situation occurs if the plates can be divided into two sub-samples, each used to measure the expression of genes for a subject in comparison to a common reference sample. For this situation, DUDOIT et al. (2002b) used the t statistic on a gene by gene basis, and then some simultaneous testing procedures to correctly control the family-wise (plate-wise) error rates. They stressed the usefulness of the Holm’s sequential testing procedure, which is less conservative than the simple Bonferroni correction. The p-values valid for the simultaneous testing for many genes can also be obtained by permutation algorithms; this, however, may cause a substantial computation cost. Finally, there is a general situation of comparing several treatments in one experiment. As an approximation, some version of analysis of variance with usual normality assumptions could probably be used for such data. KERR et al. (2000) argued that a better approach would be to use raw intensities (not their ratios) and a linear ANOVA model, but with a bootstrap analysis of the residuals allowing to get confidence intervals for comparisons of treatments without the normality assumption. Multivariate approaches

Although the basic methods treat genes as traits, which is consistent with the general rules of experimental designs, several approaches have been developed by viewing, in the data set of expression ratios, the genes as cases and the plates as variables. The algebra of this approach was described by KURUVILLA et al. (2002). Most of the well-known methods based on the singular value decomposition have been used: principal components analysis (WALL et al. 2001, YEUNG and RUZZO 2001), correspondence analysis (FELLENBERG et al. 2001) and biplots (CHAPMAN et al. 2002). The concept of the minimum spanning tree has been utilized by XU et al. (2002). A Mahalanobis distance based method of detection of differentially expressed genes was described by CHILINGARYAN et al. (2002).

Statistical methods for microarray assays

273

Clustering methods

Existing clustering algorithms were applied to gene expression data in many ways. For example, EISEN et al. (1998) used the hierarchical grouping of genes based on the Pearson-like similarity measure. GETZ et al. (2000) analysed different aspects of two-way clustering methods (that is, clustering applied to both genes and samples) under several specifications of the clustering algorithm and similarity measures. Model-based clustering methods, using the Normal mixture distribution, have been considered by YEUNG et al. (2001a), GHOSH and CHINNAIYAN (2002) and MCLACHLAN et al. (2002). Some interesting propositions concerning clustering have been made by HASTIE et al. (2000, 2001). They considered the situation in which observations of some additional traits are made in the experiment for each plate. This leads to a possibility of trying to link the obtained clusters of samples to external data of different type, quantitative or qualitative. It should also be noted that the method of self-organizing maps has been applied to microarray data by TAMAYO et al. (1999); they claimed its superiority over ordinary hierarchical grouping. Classification and discrimination

The analysis of microarray data can also be directed towards classifying samples into two or more classes, based on the expression level of several genes. The area of classification seems to be very appealing in the context of microarray data for the following reason: in addition to classical methods, e.g. discriminant functions, several new approaches can be applied, which are based on additional available biological knowledge. In this way the so-called supervised techniques arise. Some classical and knew methods were described and compared by DUDOIT et al. (2000a): the nearest neighbour methods, linear discrimination and classification trees. A knowledge-based method, with the ability to assign genes to multiple groups of similar expression, was described by MOLOSHOK et al. (2002). Promising methods seem to emerge from the application of support vector machines (SVMs) to the analysis of microarray data (BROWN et al. 2000, GAASTERLAND et al. 2000). An SVM is a classification algorithm that works iteratively, starting from groups of genes of common function. This feature in an obvious way supports the idea of accumulating and extending the knowledge about gene functions stored in several databases.

Post-processing and validation It is correctly recognized by many authors that the role of statistics in the analysis of microarray data is not just to give optimal solutions. As the amount of data increases, the statistical findings should be validated and compared with findings from other experiments. Thus, KERR and CHURCHILL (2001a) considered check-

274

P. Krajewski, J. Bocianowski

ing the reliability of gene clusters, obtained on the basis of ANOVA estimates of gene expression, by bootstrapping. Similar approaches were described by YEUNG et al. (2001b) and AZUAJE (2002). The aspect of the problem connected with the estimation of the number of clusters was considered by TIBSHIRANI et al. (2001). Finally, a very practical aspect of validation, that is, comparing the results of a microarray assay to standard clinical predictors, was considered by TIBSHIRANI and EFRON (2002).

Discussion Our survey shows that the area of microarray experiments turned out and should continue to be very attractive both for specialists in the field of experiment design and statisticians. Firstly, it seems that the days are over in which simple dye-swapped experiments were thought to provide an antidote to all sources of errors. It is expected that the role of good experimental practice with microarrays will not be diminished by progress in their manufacturing. More and more samples are now compared in one experiment. The samples represent experimental treatments, sometimes with a factorial structure. The theory of experimentation knows designs that are appropriate for such situations. For example, in the field of weighing designs (BANERJEE 1975) a situation is considered in which several samples are compared using a device which can relatively measure only two samples at a time, like a microarray. It is also obvious that the data collection stage plays a very important role in the whole experiment. Once measured, the slide pictures (scans) will probably be discarded, as the storage of numerical data is easier and cheaper. One of the microarray software systems, Quantarray®, offers to the user twelve combinations of methods of spot segmentation and intensity calculation. Within each combination, there are several parameters, which can be left with default values or changed by the user with the effect of changing the observations. Although the manual contains remarks on the features of some methods, we doubt that the experimenter would fully understand what aspects of his conclusions might be affected by decisions at this stage. As to the statistical data analysis itself, it seems that the supervised techniques, taking into account additional biological knowledge about the experimental material, will probably successfully compete with the unsupervised ones. After all, microarrays are a great hope for functional genomics; in this field, the link between existing, accumulated information on gene functions and newly designed experiments must be exceptionally strong. The microarrays will probably remain a diagnostic tool in medicine. We do not know if the same will happen in other biological investigations, on plants or animals. Certainly, such diagnostic tools will not be complete without corresponding

Statistical methods for microarray assays

275

statistical protocols, into which currently developed methods must be transformed. We were not able to give in our survey any account of software tools, which are necessary for performing the statistical analyses described. It must be stressed that, in principle, most of the classical methods, such as the analysis of variance, clustering or principal component analysis, can be programmed for microarray data using leading commercial statistical packages such as Genstat, SAS or S-Plus. Of course, the specialized, compiled software may have the advantage of speed in dealing with large amounts of data; links to some products with that feature can by found at http://genome-www5.stanford.edu/MicroArray/ SMD/restech.html. We also were not able to discuss the data storage aspect of the problem. Databases for microarrays were discussed by BRAZMA et al. (2000); an exemplary solution was described by SHERLOCK et al. (2001). Finally, it should be noted that it is possible that the links of functional genomics will be strengthened, when supported by microarray applications, with the classical (quantitative) genetics. A possibility in this direction was sketched by JANSEN and NAP (2001). To some extent, the situation may in the future resemble the one already met in the field of marker data analysis, where the investigation of segregating material led to new interesting models and applications. REFERENCES AZUAJE F. (2002). A cluster validity framework for genome expression data. Bioinformatics 18: 319-320. BANERJEE K.S. (1975). Weighing Designs for Chemistry, Medicine, Economics, Operations Research, Statistics. Marcel Dekker Inc., New York. BOCIANOWSKI J., GALLAVOTTI A., KRAJEWSKI P., PO E. (2002). On methods of collecting data from DNA microarrays. Colloq. Biometryczne 32 (in press). BOZINOV D., RAHNENFÜHRER J. (2002). Unsupervised technique for robust target separation and analysis of DNA microarray spots through adaptive pixel clustering. Bioinformatics 18: 747-756. BRAZMA A., ROBINSON A., CAMERON G., ASHBURNER M. (2000). One-stop shop for microarray data. Nature 403: 699-700. BROWN C.S., GOODWIN P.C., SORGER P.K. (2001). Image metrics in the statistical analysis of DNA microarray data. Proc. Natl. Acad. Sci. 98: 8944-8949. BROWN M.P.S., GRUNDY W.N., LIN D., CRISTIANINI N., SUGNET C.W., FUREY T.S., ARES Jr M., HAUSSLER D. (2000). Knowledge-based analysis of microarray gene expression data by using support vector machines. Proc. Natl. Acad. Sci. 97: 262-267. BUHLER J., IDEKER T., HAYNOR D. (2000). Dapple: improved techniques for finding spots on DNA microarrays. University of Washington CSE Technical Report UWTR 2000-08-05. CHAPMAN S., SCHENK P., KAZAN K., MANNERS J. (2002). Using biplots to interpret gene expression patterns in plants. Bioinformatics 18: 202-204.

276

P. Krajewski, J. Bocianowski

CHEN Y., DOUGHERTY E.R., BITTNER M.L. (1997). Ratio-based decisions and the quantitative analysis of cDNA microarray images. J. Biomed. Optics 2: 364-374. CHILINGARYAN A., GEVORGYAN N., VARDANYAN A., JONES D., SZABO A. (2002). Multivariate approach for selecting sets of differentially expressed genes. Math. Biosci. 176: 59-72. DUDOIT S., FRIDLYAND J., SPEED T.P. (2002a) Comparison of discrimination methods for the classification of tumors using gene expression data. J. Am. Stat. Assoc. 97: 77-87. DUDOIT S., YANG Y.H., SPEED T.P., CALLOW M.J. (2002b). Statistical methods for identifying differentially expressed genes in replicated cDNA microarray experiments. Statistica Sinica 12: 111-140. EISEN M.B., SPELLMAN P.T., BROWN P.O., BOTSTEIN D. (1998). Cluster analysis and display of genome-wide expression patterns. Proc. Natl. Acad. Sci. 95: 14863-14868. FELLENBERG K., HAUSER N.C., BRORS B., NEUTZNER A., HOHEISEL J.D., VINGRON M. (2001). Correspondence analysis applied to microarray data. Proc. Natl. Acad. Sci. 96: 10781-10786. FIELDEN M.R., HALGREN R.G., DERE E., ZACHAREWSKI T.R. (2002). GP3: GenePix post-processing program for automated analysis of raw microarray data. Bioinformatics 18: 771. GAASTERLAND T., BEKIRANOV S. (2000). Making the most of microaray data. Nature Genetics 24: 204-206. GETZ G., LEVINE E., DOMANY E. (2000). Coupled two-way clustering analysis of gene microarray data. Proc. Natl. Acad. Sci. 97: 12079-12084. GHOSH D., CHINNAIYAN A.M. (2002). Mixture modelling of gene expression data from microarray experiments. Bioinformatics 18: 275-286. GLONEK G.F.V., SOLOMON P.J. (2002). Factorial designs for microarray experiments. Technical Report, Department of Applied Mathematics, University of Adelaide, Australia. HASTIE T., TIBSHIRANI R., BOTSTEIN D., BROWN P. (2001). Supervised harvesting of expression trees. Genome Biology 2(1): research0003.1-0003.12. HASTIE T., TIBSHIRANI R., EISEN M.B., ALIZADEH A., LEVY R., STAUDT L., CHAN W.C., BOTSTEIN D., BROWN P. (2000). ‘Gene shaving’ as a method for identifying distinct sets of genes with similar expression patterns. Genome Biology 1(2): research0003.1-0003.21. JANSEN R.C., NAP J-P. (2001). Genetical genomics: the added value from segregation. Trends in Genetics 17: 388-391. KAMIÑSKI S. (2002). DNA microarrays – a methodological breakthrough in genetics. J. Appl. Genet. 43: 123-130. KERR M.K., CHURCHILL G.A. (2001a). Bootstrapping cluster analysis: Assessing the reliability of conclusions from microarray experiments. Proc. Natl. Acad. Sci. 98: 8961-8965. KERR M.K., CHURCHILL G.A. (2001b). Experimental design for gene expression microarrays. Biostatistics 2: 183-201.

Statistical methods for microarray assays

277

KERR M.K., MARTIN M., CHURCHILL G.A. (2000). Analysis of variance for gene expression microarray data. J. Computat. Biol. 7: 819-837. KURUVILLA F.G., PARK P.J., SCHREIBER S.L. (2002). Vector algebra in the analysis of genome-wide expression data. Genome Biology 3(3): research 0011.1-0011.11. LEE M.L.T., KUO F.C., WHITMORE G.A., SKLAR J. (2000). Importance of replication in microarray gene expression studies: Statistical methods and evidence from repetitive cDNA hybridizations. Proc. Natl. Acad. Sci. 97: 9834-9839. MCLACHLAN G.J., BEAN R.W., PEEL D. (2002). A mixture model-based approach to the clustering of microarray expression data. Bioinformatics 18: 413-422. MOLOSHOK T.D., KLEVECZ R.R., GRANT J.D., MANION F.J., SPEIER W.F., OCHS M.F. (2002). Application of Bayesian decomposition for analysing microarray data. Bioinformatics 18: 566-575. NEWTON M.A., KENDZIORSKI C.M., RICHMOND C.S., BLATTNER F.R., TSUI K.W. (2001). On differential variability of expression ratios: improving statistical inference about gene expression changes from microarray data. J. Computat. Biol. 8: 37-52. NOVAK J. P., SLADEK R., HUDSON T.J. (2002). Characterization of variability in large-scale gene expression data: implications for study design. Genomics 79: 104-113. PAN W. (2002). A comparative review of statistical methods for discovering differentially expressed genes in replicated microarray experiments. Bioinformatics 18: 546-554. PAN W., LIN J., LE C.T. (2001). A mixture model approach to detecting differentially expressed genes with microarray data. Report 2001-011, Division of Biostatistics, University of Minnesota. QUAACKENBUSH J. (2001). Computational analysis of microarray data. Nature 2: 418-427. SHERLOCK G., HERNANDEZ-BOUSSARD T., KASARSKIS A., BINKLEY G., MATESE J.C., DWIGHT S.S., KALOPER M., WENG S., JIN H., BALL C.A., EISEN M.B., SPELLMAN P.T., BROWN P.O., BOTSTEIN D., CHERRY J.M. (2001). The Stanford microarray database. Nucleic Acids Research 29: 152-155. SHMULEVICH I., ZHANG W. (2002). Binary analysis and optimization-based normalization of gene expression data. Bioinformatics 18: 555. SMYTH G.K., YANG Y.H., SPEED T. (2002). Statistical issues in cDNA microarray data analysis. University of California, Berkley, report, available at http://www.stat.berkeley.edu/users/terry/zarray/Html/matt.html. TAMAYO P., SLONIM D., MESIROV J., ZHU Q., KITAREEWAN S., DMITROVSKY E., LANDER E.S., GOLUB T.R. (1999). Interpreting patterns of gene expression with self-organizing maps: Methods and application to hematopoietic differentiation. Proc. Natl. Acad. Sci. 96: 2907-2912. THOMAS J.G., OLSON, J.M., TAPSCOTT S.J., ZHAO L.P. (2001). An efficient and robust statistical modelling approach to discover differentially expressed genes using genomic expression profiles. Genome Research 11: 1227:1236. TIBSHIRANI R., EFRON B. (2002). Pre-validation and inference in microarrays. Technical report, available at http://www-stat.stanford.edu/~tibs/research.html. TIBSHIRANI R., WALTHER G., BOTSTEIN D., BROWN P. (2001). Cluster validation by prediction strength. Technical report, available at http://www-stat.stanford.edu/~tibs/research.html.

278

P. Krajewski, J. Bocianowski

TROYANSKAYA O., CANTOR M., SHERLOCK G., BROWN P., HASTIE T., TIBSHIRANI R., BOTSTEIN D., ALTMAN R.B. (2001). Missing value estimation methods for DNA microarrays. Bioinformatics 17: 520-525. TSODIKOV A., SZABO A., JONES D. (2002). Adjustments and measures of differential expression for microarray data. Bioinformatics 18: 251-260. WALL M.E., DYCK P.A., BRETTIN T.S. (2001). SVDMAN – singular value decomposition analysis of microarray data. Bioinformatics 17: 566-568. WANG X., GHOSH S., GUO S.-W. (2001). Quantitative quality control in microarray image processing and data acquisition. Nucleic Acids Research 29: e75. XU Y., OLMAN V., XU D. (2002). Clustering gene expression data using a graph-theoretic approach: an application of minimum spanning tress. Bioinformatics 18: 536-545. YANG Y.H., BUCKLEY M.J., DUDOIT S., SPEED T.P. (2002). Comparison of methods for image analysis on cDNA microarray data. J. Computat. Graphical Statistics 11: 1-29. YANG Y.H., DUDOIT S., LUU P., SPEED T.P. (2001). Normalization for cDNA microarray data. In: Microarrays: Optimal Technologies and Informatics (M.L. Bittner, Y. Chen, A.N. Dorsel, E.R. Dougherty, eds.). Volume 4266 of Proceedings of SPIE. YEUNG K.Y., FRALEY C., MURUA A., RAFTERY A.E., RUZZO W.L. (2001a). Model-based clustering and data transformation for gene expression data. Bioinformatics 17: 977-987. YEUNG K.Y., HAYNOR D.R., RUZZO W.L. (2001b). Validating clustering for gene expression data. Bioinformatics 17: 309-318. YEUNG K.Y., RUZZO W.L. (2001). Principal component analysis for clustering gene expression data. Bioinformatics 17: 763-774.