Blood Transcriptional Fingerprints to Assess the ...

3 downloads 0 Views 2MB Size Report
Allantaz F, Chaussabel D, Stichweh D, Bennett L, Allman W, Mejias A, Ardura .... Deonarine K, Panelli MC, Stashower ME, Jin P, Smith K, Slade HB, Norwood C,.
Blood Transcriptional Fingerprints to Assess the Immune Status of Human Subjects Damien Chaussabel, Nicole Baldwin, Derek Blankenship, Charles Quinn, Esperanza Anguiano, Octavio Ramilo, Ganjana Lertmemongkolchai, Virginia Pascual, and Jacques Banchereau

Abstract  The blood transcriptome affords a comprehensive view of the status of the human immune system. Global changes in transcript abundance have been measured in the blood of patients with a wide range of diseases. This chapter presents an overview of the advances that have led to the identification of therapeutic targets and biomarker signatures in the field of autoimmunity and infectious disease. It also provides technology and data analysis primers as means of introducing blood transcriptome research to a broad readership. Specifically, we compare microarrays with some of the most recent digital gene expression profiling technologies available to date, including RNA sequencing. Furthermore, in addition to the basic steps involved in the analysis of microarray data we also present more advanced data mining approaches for blood transcriptional fingerprinting.

Blood Transcript Profiling A wide range of molecular and cellular profiling assays are now available to study the human immune system (Fig. 1). Among the systems-wide molecular profiling technologies genomics approaches are the most mature and scalable for high throughput use. The human genome can be investigated from two different angles. Sequence variations, which can be detected using for instance Single Nucleotide Polymorphisms (SNP) chips, permits the identification of common polymorphisms or rare mutations associated with diseases. Hundreds of thousands of SNPs can be typed using these platforms, yielding a genome-wide, hypothesis-free, scan of genetic associations for a given phenotype of interest. The second genome-wide profiling approach employed for the study of immune-mediated diseases consists in the measurement of transcript abundance. Generating transcriptional profiles on

D. Chaussabel (*) Baylor Institute for Immunology Research, Baylor Research Institute, 3434 Live Oak, Dallas, TX 75204, USA e-mail: [email protected] F.M. Marincola and E. Wang (eds.), Immunologic Signatures of Rejection, DOI 10.1007/978-1-4419-7219-4_8, © Springer Science+Business Media, LLC 2011

105

106

D. Chaussabel et al.

Fig.  1  The immune profiling armamentarium. The number of high throughput molecular and cellular profiling tools available for the assessment of the human immune system from blood is increasing rapidly

a genome-wide scale is both straightforward and cost-effective, affording the most comprehensive view of the status of the human immune system to date. Indeed, such studies can inform us on mechanisms of pathogenesis and thereby create opportunities for discovering potential therapeutic targets and novel clinically relevant biomarkers (Alizadeh et al. 2000; Baechler et al. 2003; Bennett et al. 2003; Sarwal et al. 2003; Wright et al. 2003; Achiron et al. 2004; Batliwalla et al. 2005a; Griffiths et  al. 2005; Pascual et  al. 2005). Transcriptional profiles have been obtained from many human tissues; including for instance the skin (Deonarine et  al. 2007; Panelli et  al. 2007; Greco et  al. 2010; Cole et  al. 2001), muscle (Berchtold et  al. 2009), liver (Frueh et  al. 2001; Flanagan et  al. 2009), kidney (Flechner et al. 2004; Bunnag et al. 2009) or brain (Glatt et al. 2005). Specifically, this review will focus on the use of blood transcript profiling. Blood is the pipeline of the immune system, with immune cells exposed to factors released in the bloodstream or present in peripheral tissues from which they re-circulate. It is an accessible tissue for which sampling can easily be standardized, with robust blood collection and RNA stabilization systems becoming widely available in recent years (Debey et  al. 2006; Asare et  al. 2008). In contrast with

Blood Transcriptional Fingerprints to Assess the Immune Status of Human Subjects

107

many other tissues blood can also be sampled repeatedly over time, which constitutes a key property for monitoring the immunological status of human subjects.

Profiling Human Subjects in Health and Disease Profiling Autoimmune Diseases The field of autoimmunity has proved a fertile ground for blood transcriptional studies. The work has initially focused on diseases with clear systemic involvement such as SLE (Baechler et al. 2003; Bennett et al. 2003). This has contributed to the identification of a type 1 interferon signature in the blood of a majority of lupus patients, prompting the development of therapies targeting this pathway (Yao et al. 2009). Furthermore, the potential value of this and other blood transcriptional signatures for the assessment of disease activity has been examined (Sandrin-Garcia et al. 2009; Petri et al. 2009; Nikpour et al. 2008; Nakou et al. 2008; Bauer et al. 2009; Chaussabel et al. 2008). Systemic onset Juvenile Arthritis (SoJIA) is another disease with systemic involvement that greatly benefited from the study of blood transcriptional profiles, with this proof of principle work leading to the development of both therapeutic and diagnostic modalities (Pascual et al. 2005, 2008; Allantaz et al. 2007a, b). Diseases with specific organ involvement have also been the subjects of significant, yet not always extensive, blood profiling efforts. Thus blood signatures have been obtained from patients with multiple sclerosis (Achiron et  al. 2004; Bomprezzi et al. 2003). Given the inaccessibility of the brain, blood constitutes a particularly attractive source of surrogate molecular markers for this disease. These efforts have yielded a systemic signature and identified potential predictive markers of clinical relapse and response to treatment (van Baarsen et al. 2008; Gurevich et al. 2009; Achiron et al. 2007a). Transcriptional signatures have also been generated in the context of dermatologic diseases. In this case the target organ being readily accessible, efforts have been focusing on profiling transcript abundance in skin tissues (Nomura et  al. 2003; de Jongh et  al. 2005). However, systemic involvement has been recognized in recent years to be an important component of autoimmune skin diseases and unique blood transcriptional profiles have also been identified for example in patients with Psoriasis (Batliwalla et al. 2005a; Stoeckman et al. 2006; Koczan et al. 2005). Blood transcriptional profiles have been generated in the context of many other autoimmune diseases. Indeed, the range of autoimmune/autoinflammatory diseases that have been investigated encompasses: SLE (Baechler et al. 2003; Bennett et al. 2003; Crow and Wohlgemuth 2003; Han et al. 2003), juvenile idiopathic arthritis (Pascual et  al. 2005; Allantaz et  al. 2007a; Ogilvie et  al. 2007; Fall et  al. 2007; Barnes et  al. 2009), multiple sclerosis (Achiron et  al. 2007b; Singh et  al. 2007), rheumatoid arthritis (Edwards et al. 2007; van der Pouw Kraan et al. 2007; Lequerre et al. 2006; Batliwalla et al. 2005b), Sjogren’s syndrome (Emamian et al. 2009), diabetes (Kaizer et  al. 2007; Takamura et  al. 2007), inflammatory bowel disease

108

D. Chaussabel et al.

(Burczynski et al. 2006), psoriasis and psoriatic arthritis (Batliwalla et al. 2005a; Stoeckman et al. 2006), inflammatory myopathies (Greenberg et al. 2005; Baechler et  al. 2007), scleroderma (Tan et  al. 2006; York et  al. 2007), vasculitis (Alcorta et al. 2007) and antiphospholipid syndrome (Potti et al. 2006). The body of work produced that focuses on blood transcript profiling in the context of autoimmune diseases has been covered at length in a recent review (Pascual et al. 2010).

Profiling Infectious Diseases Global changes in transcript abundance have also been measured in the blood of patients with infectious diseases. In this context, alterations of blood transcriptional profiles are a reflection of the immunological response mounted by the host against pathogens. This response is mediated by specialized receptors expressed at the surface of host cells recognizing pathogen-associated molecular patterns (Janeway and Medzhitov 2002). Different classes of pathogens signal through different combinations of receptors, eliciting in turn different types of immune responses (Aderem and Ulevitch 2000). This translates experimentally in distinct transcriptional programs being induced upon exposure of immune cells in vitro to distinct classes of infectious agents (Nau et al. 2002; Huang et al. 2001; Chaussabel et al. 2003). Similarly, patterns of transcript abundance measured in the blood of patients with infections caused by different etiological agents were found to be distinct (Ramilo et al. 2007). Predictably, dramatic changes were observed in the blood of patients with systemic infections (e.g., sepsis) (Pankla et al. 2009; Tang et al. 2009). However, profound alterations in patterns of transcript abundance were also found in patients with localized infections (e.g., upper respiratory tract infection, urinary tract infections, pulmonary tuberculosis, skin abscesses) (Allantaz et al. 2007a; Ramilo et al. 2007; Jacobsen et al. 2007). An interferon-inducible neutrophil-driven blood transcriptional signature in human tuberculosis. Measuring changes in host transcriptional profiles may therefore prove of diagnostic value even in situations where the causative pathogenic agent is not present in the test sample. Importantly, it may also help ascertain the severity of the infection and monitor its course. Infections often present as acute clinical events, thus it is important to capture dynamic changes in transcript abundance that occur during the course of the infection from the time of initial exposure. Blood signatures have been described in the context of acute infections caused by a wide range of pathogenic parasites, viruses and bacteria, including: Plasmodium (Griffiths et  al. 2005; Franklin et  al. 2009), respiratory viruses (Influenza, Rhinovirus, Respiratory Syncytial Virus) (Ramilo et al. 2007; Reghunathan et al. 2005; Popper et al. 2009; Zaas et al. 2009; Thach et al. 2005), Dengue virus (Ubol et al. 2008; Nascimento et al. 2009), Adenovirus (Popper et al. 2009), as well as Salmonella (Thompson et al. 2009), Mycobacterium tuberculosis (Jacobsen et  al. 2007), Staphylococcus aureus (Ardura et  al. 2009),

Blood Transcriptional Fingerprints to Assess the Immune Status of Human Subjects

109

Burkholderia pseudomallei (Pankla et al. 2009) and in the general context of bacterial sepsis (Tang et al. 2009; Wong et al. 2009; Payen et al. 2008; Johnson et al. 2007). Some of those pathogens will persist and establish chronic infections (e.g., Human Immunodeficiency Virus, Plasmodium) that may lead to a state of latency (e.g., Tuberculosis), and transcript profiling may in those situations be used as a surveillance tool for the monitoring of disease progression or reactivation. Blood profiling of infectious diseases remains also limited in scale. In particular, additional studies will be necessary to ascertain dynamic changes occurring over time.

Profiling other Diseases Blood transcript profiling studies have been carried out in the cancer research field. While hematological malignancies have led the way (reviewed in (Staratschek-Jox et al. 2009)), blood profiles have also been obtained more recently from patients with solid tumors (Aaroe et  al. 2010). Notably, these signatures can reflect the immunological or physiological changes effected by cancers but also by the presence of rare tumor cells in the circulation (Findeisen et al. 2008; Hayes et al. 2006; Martin et al. 2001). Blood signatures have also been obtained from solid organ transplant recipients in the context of both tolerance (Martinez-Llordella et  al. 2008; Kawasaki et  al. 2007; Brouard et al. 2007) and graft rejection (Flechner et al. 2004; Lin et al. 2009; Alakulppi et al. 2008). While such signatures can also be detected in biopsy material (Sarwal et al. 2003; Mueller et al. 2007; Scherer et al. 2003) blood offers the distinct advantage of being accessible for safely monitoring molecular changes on a routine basis. Some work has also been done in the context of cardiovascular diseases where inflammation is known to play an important role. Hence, profiles have been identified in a wide range of conditions, including stroke, chronic heart failure or acute coronary syndrome (Tang et al. 2001; Nakayama et al. 2008; Moore et al. 2005; Cappuzzello et al. 2009). Other efforts have yielded blood transcriptional signature in patients with neurodegenerative diseases (Maes et al. 2007; Lovrecic et al. 2009; Borovecki et al. 2005), in response to stress, environmental exposure (Peretz et al. 2007; McHale et al. 2009; Bushel et al. 2007), exercise (Kawai et al. 2007; Connolly et al. 2004) or even laughter (Hayashi et al. 2007). The body of published work would be too large to be cited in this review – and it is likely to be only the tip of the iceberg, with a lot more unpublished data scattered throughout the public and private space. However, the vast majorities of these studies are underpowered and sometimes lack even the most rudimentary validation steps. All too often primary data are not available for reanalysis either, reflecting a lack of enforcement of editorial policies, or the absence thereof in some journals. Hence one of the main challenges for this field is to move beyond the proof of principle stage and consolidate the wealth of data being generated.

110

D. Chaussabel et al.

Collectively, studies published thus far demonstrate that alterations in transcript abundance can be detected on a genome-wide scale in the blood of patients with a wide range of diseases. We have also learned that: (1) multiple diseases can share components of the blood transcriptional profile. This is for instance the case for inflammation or interferon signatures; (2) while no single element of the profile may be specific to any given disease it is the combination of those elements that makes a signature unique; (3) finally, the work that has been accomplished to date highlights the importance of carrying out analyses aiming at directly comparing transcriptional profiles across diseases. Indeed, much for instance can be learned about autoimmunity from studying responses to infection, and vice and versa. Furthermore, such efforts may eventually lead us closer towards a molecular classification of diseases.

Technology Primer (Fig. 1) Microarray technologies are limited by several factors, such as hybridization noise (background signal, nonspecific binding) and lack of sensitivity for transcripts expressed at very low or very high levels (dynamic range). Additional limitations derive from the fact that they rely on existing sequence knowledge and lack the capacity to quantify alternative messages, such as splice variants of a given gene. When considering human studies with potential clinical applications, perhaps the main limitation, however, is that direct comparability of data across batches and platforms is sometimes impossible. Real-time PCR technology is currently considered the gold standard for measuring transcript abundance. However, the number of transcripts that can be detected using this technique is limited. Products have been introduced recently that partially address this shortcoming. Alternative technology platforms have also become recently available, such as one developed by Nanostring, which can detect transcripts abundance for up to 500 transcripts with high sensitivity (Geiss et al. 2008). The approach is “digital” since it consists in counting individual RNA molecules. But a distinct advantage of this technology, which like microarrays is hybridizationbased, is that sample preparation needs are reduced to a minimum – for instance none of the steps involving enzymatic reactions. Also, given its high sensitivity, fast turnaround time, sample throughput and intermediate multiplexing ability this approach seems particularly promising for bedside applications. Methods relying on high-throughput sequencing for the genome-wide measurement of RNA abundance are also becoming available (Wold and Myers 2008). RNA-seq (RNA sequencing) (Sultan et al. 2008) starts with a population of RNA (total or fractionated, such as poly(A)+) that is converted to a library of cDNA fragments. High thoughput sequencing of such fragments yields short sequences or reads which are typically 30–400 bp in length, depending on the DNA-sequencing technology used. For a given sample, tens of millions of such sequences will then be uniquely mapped against a reference genome. The higher the level of expression

Blood Transcriptional Fingerprints to Assess the Immune Status of Human Subjects

111

Fig. 2  Cutting-edge RNA profiling technologies. Several technology platforms are available for measuring RNA abundance on large scales. Microarray and Nanostring technologies rely on oligonucleotide probes to capture complementary target sequences. Nanostring and RNA-seq technologies measure abundance at the single molecule level, with results respectively expressed as molecule counts and sequence coverage. Microarray and RNA-seq technologies require extensive sample processing, which include amplification steps

of a given gene the higher the number of reads that will be aligned against it (Fig. 2). Thus, this approach does not rely on probe design and provides information on not only transcript abundance but also transcriptome structure (splice variants), noncoding RNA species such as microRNAs (miRNA), and genetic polymorphisms. The use of RNA-seq has not reached the mainstream. Indeed, challenges ahead are multiple, including sample preparation, storage of massive amounts of data and sequence alignment (Wang et al. 2009). In time, RNA-seq is expected to become sufficiently cost-effective and practical to eventually supersede microarray technologies.

Microarray Data Analysis For years the scale of blood transcriptional studies has been constrained by the cost of the technology. With the price tag on a commercial whole genome microarray decidedly below the $100 USD mark, it is not the case anymore. Also data analysis and exploitation, that has from the start been one of the challenges for transcriptome research, has now clearly become the main rate-limiting step.

112

D. Chaussabel et al.

Analysis Primer This section covers some of the basic steps and considerations involved in microarray data analysis. Per-Chip Normalization: This step controls for array-wide variations in intensity across multiple samples that form a given dataset. After background subtraction, a normalization algorithm is used to rescale the difference in overall intensity to a fixed intensity level for all samples across multiple arrays. Data filtering: Typically more than half of the probes present on a microarray do not detect a signal for any of the samples in a given analysis. Thus, a detection filter is applied to remove such probes. This step avoids the introduction of unnecessary noise in downstream analyses. Unsupervised analysis: The aim of this analysis is to group samples on the basis of their molecular profiles without a priori knowledge of their phenotypic classification. The first step consists in selecting transcripts that are expressed in the dataset (detection filter), and display some degree of variability (which will facilitate sample clustering). For instance, this filter could select transcripts with expression levels that deviate by at least twofold from the median intensity calculated across all samples. Importantly this additional filter is applied independently of any knowledge of sample grouping or phenotype (which makes this type of analysis “unsupervised”). Next, pattern discovery algorithms are often applied to identify molecular phenotypes or trends in the data. Clustering: Clustering is commonly used for the discovery of expression patterns in large datasets. Hierarchical clustering is an iterative agglomerative clustering method that can be used to produce gene trees and condition trees. Condition tree clustering groups samples based on the similarity of their expression profiles across a specified gene list. Other commonly employed clustering algorithms include k-means clustering and self-organizing maps. Class Comparison: Class comparison analyses identify genes differentially expressed among groups and/or time points. The methods for analysis are chosen based on the study design. For studies with independent observations and two or more groups, t-tests, ANOVA, Mann-Whitney U tests, or Kruskal-Wallis tests are used. For more complex studies (e.g., longitudinal) appropriate linear mixed model analyses are chosen. Multiple Testing Correction: Multiple testing correction (MTC) methods provide a means to mitigate the level of noise in sets of transcripts identified by class comparison (in order to lower permissiveness of false positives). While it reduces noise, MTC promotes a higher false negative rate as a result of dampening the signal. The methods available are characterized by varying degrees of stringency, and therefore they produce gene lists with different levels of robustness. • Bonferroni correction is the most stringent method used to control the familywise error rate (probability of making one or more type I errors) and can drastically reduce false positive rates. Conversely, it increases the probability of having false negatives.

Blood Transcriptional Fingerprints to Assess the Immune Status of Human Subjects

113

• Benjamini and Hochberg false discovery rate (Benjamini and Hochberg 1995) is a less stringent MTC method and provides a good balance between discovery of statistically significant genes while limiting false positives. By using this procedure with a value of 0.01, 1% of the statistically significant transcripts might be identified as significant by chance alone (false positives). Class Prediction: Class prediction analyses assess the classification capability of gene expression data for a study subject or sample. K-nearest neighbors is a commonly used technique for this task. Using Euclidian or other measures of distance, this method identifies the user defined “k” number of closest observations for an unclassified sample. Class prediction is then determined by the lowest p-value, which is calculated for each group. The p-values are based on the likelihood of obtaining the observed number of neighbors for a specific class given the overall class proportion in the data set. Other available class prediction procedures include, but are not limited to, Discriminant Analysis, General Linear Model Selection, Logistic Regression, Distance Scoring, Partial Least Squares, Partition Trees, and Radial Basis Machine. Sample Size: The number of samples necessary for the identification of a robust signature is variable. Indeed, sample size requirements will depend on the amplitude of the difference between and the variability within study groups. A number of approaches have been devised for the calculation of sample size for microarray experiments, but to date little consensus exists (Dobbin et  al. 2008; Jorstad et al. 2008; Pawitan et al. 2005; Yang et al. 2003). Hence, best practices in the field consist in the utilization of independent sets of samples for the purpose of validating candidate signatures. Thus, the robustness of the signature identified will rely on a statistically significant association between the predicted and true phenotypic class in the first and the second test sets.

Analysis of Significance Patterns The diagnosis of SoJIA takes weeks to months, as it is based on clinical criteria which lack specificity (Cassidy and Ross 2001). Indeed, initial symptoms mimic infections or malignancies and it is only when arthritis appears that the disease can be recognized. We surmised that the blood transcriptome of these children could be a source of diagnostic biomarkers. The profiles that differentiate SoJIA patients from healthy controls, however, were highly similar to those of children with febrile infectious diseases of both bacterial and viral origin (Allantaz et  al. 2007a). Additionally, because SoJIA can present at any age during childhood, matching the control groups for this disease and for infections that predominate at earlier or later times during childhood was a challenge. Thus, we devised a custom meta-analysis strategy for biomarker selection relying on the analysis of patterns of significance (Chaussabel et  al. 2005). This approach can be used to compare diseases across multiple datasets, each being analyzed in relation to its own set of healthy controls. First, statistical comparisons were performed between each group of patients

114

D. Chaussabel et al.

(SoJIA, S. aureus, S. pneumoniae, E. coli, influenza A, and SLE) and their respective control groups composed of age- and gender-matched healthy donors. The p-values obtained from each comparison were then subjected to selection criteria. This permitted us to identify genes significantly changed in SoJIA patients vs. their control group, and not in any of the other disease vs. their own control groups. The SoJIA-specific signature that we obtained using this algorithm was composed of 88 transcripts. Treatment with IL1 antagonists was able to extinguish this signature in the majority of patients. Using a more stringent analysis, 12/88 transcripts were used to correctly classify an independent group of patients during the systemic phase of the disease against healthy and febrile disease controls. These 12 genes were not dysregulated however in SoJIA patients who had resolved the systemic phase and were left with chronic arthritis, suggesting that they are specifically dysregulated in the initial phase of the disease, probably the time of greatest sensitivity to IL1 blockade. The specificity of 7/12 genes has been recently validated in an independent study of PBMC transcriptional profiles including different types of JIA patients (Barnes et al. 2009). The same type of analysis has allowed us to identify blood disease-specific transcriptional markers differentiating SLE patients from patients with diseases that also display a Type I IFN signature such as Influenza infection (Chaussabel et al. 2005).

A Modular Analysis Framework A myriad of approaches have been developed for the analysis of genome-wide transcriptional profiling data (Mootha et al. 2003; Segal et al. 2003a; Allison et al. 2006; Horvath & Dong 2008). The main challenges encountered while mining such data are several fold: (a) dimensionality, or how to cope with the fact that the number of parameters measured exceeds by several order of magnitude the number of conditions included in any given experiment; (b) noise; a direct consequence of the first point is that results from microarray analyses are particularly permissive to noise (false discovery); (c) data visualization is critical as it helps promote insight and supports data interpretation. Dimension reduction techniques can help address some of those issues. Several groups have developed approaches that consist in grouping genes into distinct units or “modules”. Those genes may be grouped together based on similarities in transcriptional patterns or function. Multiple approaches have been used for the construction of such modules (Chaussabel et al. 2008; Horvath and Dong 2008; Ruan et al. 2010; Segal et al. 2003b; Suthram et al. 2010; Ulitsky and Shamir 2009). We have developed a modular data mining strategy for the specific purpose of analyzing and interpreting blood transcriptional profiles (Chaussabel et  al. 2008). This approach consists in a priori grouping sets of genes with similar transcriptional patterns. This is repeated for different datasets and subsequently when comparing the cluster membership of all the genes across those datasets, the genes with similar membership are grouped together to form a transcriptional module. Structuring the

Blood Transcriptional Fingerprints to Assess the Immune Status of Human Subjects

115

data permits to focus downstream statistical testing on these sets of transcripts that form coherent transcriptional modular units. This is in contrast with more traditional approaches applying statistical tests iteratively for thousands of individual transcripts that are treated as independent variables. The modular transcriptional framework that we have developed constitutes a dimension reduction technique and as such can: a) facilitate functional interpretation; b) enable comparative analyses across multiple datasets and diseases; c) minimize noise and improve robustness of biomarker signatures; and d) yield multivariate metrics that can be used at the bedside. Data visualization is also of critical importance for the interpretation of large-scale datasets. We have developed a straightforward approach for mapping global transcriptional changes for individual diseases on a modular basis (Fig. 3 and interactive web version: www.biir.net/modules). Briefly, differences in expression levels between study groups are displayed for each module on a grid. Each position on the grid is assigned to a given module; a red spot indicates an increase and a blue spot a decrease in transcript abundance. The spot intensity is determined by the proportion of transcripts reaching significance for a given module. A posteriori, biological interpretation has linked several modules to immune cells or pathways (see legend of Fig. 3).

Interpretation Important technical and biological constraints must be taken into account when interpreting blood transcriptome data. For one, reproducibility issues remains a legitimate concern (Shi et al. 2006, 2008; Tan et al. 2003). It is important to avoid confounding the analysis with technical variables. For example, reuse of pre-existing data for direct group comparison should be avoided. Samples should be run if possible in one single batch. If this is not possible, case and control samples should be randomized across the different runs. Overall, the availability of cost-effective commercial platforms has reduced the number of formats used for analysis and contributes to enhance data quality and reproducibility when compared to early cDNA arrays. Meta-analysis of data obtained using different platforms and from different laboratories is possible, but one must proceed with caution. A common strategy consists in the use of a control group that is common to all datasets under study (e.g., non stimulated or healthy controls) (Allantaz et al. 2007a; Chaussabel et al. 2005; Butte and Kohane 2006; Rhodes et al. 2004). Disease heterogeneity is also an important initial limiting step. Thus, patient clinical characteristics and disease stages should be taken into account and carefully recorded at the time of sample collection. Furthermore, drug treatments and comorbidities may impact blood transcriptional signatures and those variables cannot always be isolated, as patients cannot be taken off treatments. These factors also pose significant challenges in terms of study design and downstream data analysis. We have found that including samples from recently diagnosed and untreated patients is useful to select biomarkers related to disease pathogenesis. Selective inclusion criteria can be subsequently relaxed to broaden the scope and potential clinical impact of a study.

116

D. Chaussabel et al.

Fig. 3  Modular analysis of blood leukocyte transcriptional profiles. (a) Gene expression levels from patients with acute S. pneumoniae infection and respective healthy volunteer PBMCs were compared (p