Impact of next generation sequencing: The 2009 ... - Wiley Online Library

7 downloads 0 Views 105KB Size Report
Feb 2, 2010 - The publication of James Watson's and Craig Venter's genomic ..... Graham Taylor, of the Leeds Institute of Molecular Medicine, of. St. James ...
MEETING REPORT

Human Mutation OFFICIAL JOURNAL

Impact of Next Generation Sequencing: The 2009 Human Genome Variation Society Scientific Meeting

www.hgvs.org

William S. Oetting College of Pharmacy, and the Institute of Human Genetics, University of Minnesota, Minneapolis, Minnesota Received 3 December 2009; accepted revised manuscript 10 January 2010. Published online 2 February 2010 in Wiley InterScience (www.interscience.wiley.com). DOI 10.1002/humu.21210

ABSTRACT: The annual scientific meeting of the Human Genome Variation Society (HGVS) was held on the 20th of October, 2009, in Honolulu, Hawaii. The theme of this meeting was the ‘‘Impact of Next Generation Sequencing.’’ Presenters spoke on issues ranging from advances in the technology of large-scale genome sequencing to how this information can be analyzed to uncover genetic variants associated with disease. Many of the challenges resulting from the implementation of these new technologies were presented, but possible solutions, or at least paths to the solutions, were also given. With the combined efforts of investigators using next-generation sequencing to help understand the impact of genetic variants on disease, the use of the personal genome in medicine will soon become a reality. Hum Mutat 31:500–503, 2010. & 2010 Wiley-Liss, Inc. KEY WORDS: HGVS; meeting report; next generation sequencing; databases; mutations; variation

Introduction The theme of the 2009 annual meeting of the Human Genome Variation Society (HGVS; http://www.hgvs.org) was the ‘‘Impact of Next Generation Sequencing.’’ In 1990, the U.S. Human Genome Project formally began sequencing the human genome. During this time, investigators waited with great anticipation for the first complete human genome reference sequence, which was mostly finished in 2003, at a cost of just under $3 billion. Today, new sequencing technologies allow us to sequence the complete human genome in a matter of days, and investigators will soon be able to sequence hundreds of individuals for their own individual research projects at costs that are within the budget of a typical research grant. But with new technologies come new challenges. The publication of James Watson’s and Craig Venter’s genomic DNA sequence identified over 3 million single nucleotide polymorphisms (SNPs), creating an urgent need for new methods to interpret the variation within the personal genome. Applications are being created to analyze the vast amounts of information that these technologies produce. Of great importance are applications that can distinguish functional variants from ‘‘silent’’ variants. Tools already exist to analyze coding variants, but determining the functional consequence of intronic and extragenic Correspondence to: William S. Oetting, Institute of Human Genetics, University of

Minnesota, MMC 485, 420 Delaware Street S.E., Minneapolis, MN 55455. E-mail: [email protected]

variants will require more sophisticated analytical tools. In this meeting advancements in the technology and their use in answering research questions were presented.

New Approaches in Genomic Research The meeting opened with a talk by Matt Hurles, of the Wellcome Trust Sanger Institute, Cambridge, United Kingdom, who spoke on ‘‘Genomic Approaches to Elucidating the Genetics of Rare Disorders.’’ Identifying causal variants responsible for rare highly penetrant Mendelian disorders used to take years of analysis. The steps required to identify mutations responsible for these disorders included the identification of multiple families with affected members, linkage analysis to identify loci, the identification of candidate genes within the linked region followed by sequencing these genes for putative causative mutations. The creation of the human genome reference sequence, including the location of genes, gene structure, and the location of common variants, allowed for the creation of exon capture array technologies and large-scale sequencing technologies that greatly reduced the amount of time and effort needed to identify diseaseproducing mutations. These technologies amplify and isolate almost every exon in the genome, which are then sequenced to identify putative mutations in the affected individuals. A major obstacle is identifying the causal variant from the many candidate variants that are detected using this method. An assumption is that causal variants will be rare, and therefore not represented in databases containing common variation, including dbSNP. Using these databases, ‘‘silent’’ polymorphisms can be filtered out, leaving the very rare mutations as candidate causal mutations. It must be remembered that up to 150 de novo substitutions from both parents can be added to the genome in each generation, potentially masking the real causative mutation, although a de novo mutation can also contribute to disease. To avoid this problem, better classification schemes are needed to identify functional mutations. Additionally, there is growing evidence that copy number variants (CNVs) play an important role in genetic based disease. In an effort to understand the importance of de novo mutations, including CNVs, DECIPHER (DatabasE of Chromosomal Imbalance and Phenotype in Humans using Ensembl Resources; https://decipher.sanger.ac.uk) was created. We are now moving beyond monogenic models for rare disease and analyzing more complex diseases with genetic components. Disease-causing mutations will be even harder to identify due to their lower penetrance, but this problem must be solved if we wish to integrate this technology into the healthcare system. Peter Nagy, of the University of Iowa, Iowa City, provided another example on using these technologies to identify causative mutations of Mendelian disorders in his talk ‘‘Diagnosis of Rare

& 2010 WILEY-LISS, INC.

Genetic Disorders with Combination of High Resolution Comparative Genomic Hybridization and Array Capture Assisted High Throughput Parallel Sequencing.’’ Approximately 50% of individuals with congenital muscular dystrophy (CMD) and limb girdle muscular dystrophy (LGMD) can be diagnosed using Sanger sequencing of candidate genes. However, in many cases only one of the necessary two mutations of these autosomal recessive diseases is detected. Many of these cryptic mutations are thought to be large CNVs including insertion/deletions (in/dels), duplications and inversions. Such mutations can be detected in approximately 10% of patients using comparative genomic hybridization (CGH). For the rest of the individuals without a genetic diagnosis, the use of array capture-assisted high-throughput parallel sequencing (AHS), using NimbleGen array capture of amplified exons with long read length sequencing by the Roche 454 genome sequencer, can help identify the cause of their disease. The fact that most CMD and LGMD cases are thought to arise by a loss of function mechanism, as evidenced by the autosomal recessive inheritance pattern of these disorders, makes the analysis of the sequencing data easier. One can focus on predicted loss of function mutations that are present in the homozygous or compound heterozygous state as the potentially pathogenic changes. Based on their experience through sequencing one-fifth of all coding exons in one patient, Nagy and colleagues estimate the number of such changes to be in the teens. Analysis of expression data available for the suspected genes followed by immunofluorescence studies of the candidate protein products is necessary to confirm these mutations as the cause of the disease. The combination of CGH with AHS in LGMD and CMD patients will help identify novel mutations and improve diagnosis and management of these disorders.

Advances in Technology The technological advancements in next-generation sequencing (NGS) has been the driving force for the ability to create large amounts sequencing data produced in such short time frames and at low cost per base sequenced. Three lectures were presented to report on the latest advances in NGS platforms. Timothy Harkins of Roche Applied Sciences, Indianapolis, Indiana, spoke on ‘‘From Whole Exon Sequencing to 1000 Base-Pair Sequencing Reads: Technology Advancements with the 454 Genome Sequencer FLX.’’ The ability to produce long sequence reads provides certain advantages for genetic analysis. The current maximum sequence read for NGS systems is approximately 700 bp, but this comes with a significant error rate of 3–5%. By optimizing the reaction steps, high-quality read lengths of 1,000 bp will soon be possible. This will allow for larger in/dels and other structural changes to be identified. Meniere disease was used as an example to show the power of NSG technology. Linkage analysis identified two regions in an extended family segregating for Meniere disease. These two regions were resequenced in two affected and one unaffected individual. After filtering out common polymorphisms that had a high minor allele frequency, three candidate SNPs were identified, with one being a 36-bp deletion in the solute carrier family 45, member 3 (SLC45A3) gene. This deletion was found in all affected individuals in the family, and SLC45A3 is proposed as a novel candidate gene for Meniere disease. Jeremy Preston from Illumina Inc., Hayward, California, spoke on ‘‘The Illumina Genome Analyzer IIx: Genome Sequencing Simplified.’’ DNA sequencing has made major advances in a short amount of time. Compared to the high cost and relative slow throughput during the initial sequencing of the human genome,

researchers can now sequence at 30X coverage for $20,000 (consumables cost) in a matter of days using NGS platforms. As an example, DNA samples from individuals with acute myelogenous leukemia (AML) were analyzed. Using NSG technology all gene coding regions were sequenced to identify those variants causative for AML. Eight new mutations were discovered in genes that previously would not have been considered candidates. Having the ability to sequence all the exons in the genome releases the investigator from making guesses as to which gene is the best candidate to sequence. Improvements in this technology include a high-fidelity DNA polymerase enzyme, longer read lengths, simplification of sequencing workflow, and improvements in base calling algorithms including fully automated cluster generation, all of which have increased the number of bases called per run. These improvements will help investigators identify new genes and mutations associated with disease. Avak Kahvejian of Helicos BioSciences, Cambridge, Massachusetts, spoke on ‘‘Helicos Single Molecule Sequencing Delivering Accurate Quantitative Information for Genome Biology.’’ The Helicos platform is a high-throughput sequencing technology based on single molecule sequencing that does not require an initial amplification step, giving investigators the ability to quantitatively sequence billions of molecules of DNA in parallel. Up to 50 different samples can be run at one time due to the flow cell design (and up to 250 samples using a DNA barcoding system to identify each of five individual samples pooled into one of the 50 channels). Because samples are not amplified, direct sequencing of mRNA is possible, allowing for the quantitation of the number of copies for each transcript, with the potential to produce a complete transcriptome (with RNA fragmentation). This can be done with very small amounts of tissue due to the small amounts of RNA required for analysis. This also true for DNA, as shown by the ability to sequence nanogram to picogram amounts of DNA fragments, obtained from ancient DNA samples or isolated by chromatin immunoprecipitation methods (ChIP).

Supporting NGS Technology in Research Although NGS technology is very powerful, using these methods requires a substantial investment in high-end equipment and high-level staff with expertise in data handling and analysis. George Grills, Director of Operations of the Cornell University Life Sciences Core Laboratory Center in Ithaca, New York, spoke on ‘‘Implementation of Next Generation Sequencing Technologies as Shared Research Resources.’’ Using core facilities is the only way that many investigators can access cutting-edge methods and emerging technologies such as NGS. To meet this need, Cornell University has created a core laboratory center with eight core facilities (http://cores.lifesciences.cornell.edu), including NGS resources and services supported by the DNA sequencing and genotyping, microarrays, epigenomics, informatics, and bio-IT cores in the center. Major challenges for core laboratories that offer NGS include handling large amounts of data from multiple projects, implementing data handling and analysis pipelines, keeping technologies and applications at the ‘‘cutting edge,’’ and evaluating and funding the purchase of new instrumentation. Helping investigators at the project design stage is often central to a successful use of NGS technologies. Multidisciplinary core facility expertise, state-of-the-art laboratory information management systems (LIMS), and high-level bioinformatics support are essential in enabling researchers to use NGS technologies efficiently and effectively. HUMAN MUTATION, Vol. 31, No. 4, 500–503, 2010

501

Once the vast amount of data is created, the sequence and the variation within need to be analyzed. The goal is to identify the causal mutations, but which of the many variants identified are important? Christophe Be´roud from the Lab Molecular Genetics, Montpellier, France, spoke on ‘‘Distinguishing Neutral Variations from Pathogenic Mutations Using Bioinformatics Tools.’’ A number of applications now exist to help identify probable functional mutations from neutral mutations. Those variants that have the greatest impact on protein function would be either amino acid (nonsynonymous) substitutions or mutations that impact mRNA splicing. Each of these applications has their own algorithm to determine which have potential functional consequences. In some cases, the results are inconsistent between applications for a given variant, making it difficult to determine which program is the most appropriate. To help answer this question, two new tools were developed to help in the classification sequence variants. The UMD-predictor tool uses a combinatorial approach, by analyzing the results from several prediction applications that associates various data (localization within the protein, conservation, biochemical properties, and the potential impact on mRNA), to predict the effect of nonsynonymous substitutions (www.umd.be) [Frederic et al., 2009]. Using known pathogenic mutations as positive controls, this tool was shown to be very accurate, with a 99.4% positive predictive value, 95.4% sensitivity, and 92.2% specificity for exonic mutations. The Human Splicing Finder (HSF; www.umd.be/ HSF) was created to help determine the effect of variation in splice site sequences [Desmet et al., 2009]. Although highly penetrant mutations are important in rare diseases, in most common disorders only mutations of low penetrance and modest affect on gene function or expression have been identified. Bruce Gottlieb, of the Lady Davis Institute for Medical Research, Montreal, Canada, spoke on ‘‘The Impact of Next Generation Sequencing on the Genetics of Multifactorial Disease and Genome Wide Association Studies (GWAS).’’ It has been hoped that the combination of GWAS with NGS technologies would result in the identification of highly significant gene variants associated with multifactorial common diseases. However, initial results have not identified any such variants, and there are a number of issues that could explain why this is the case. First, as there is a lack of a universally agreed upon genomic reference sequence, to identify a significant disease-associated mutation the normal sequence will first need to be defined. Second, although constitutive germline mutations are important, somatic mutations also likely to impact on disease phenotypes and may even drive the disease process. If somatic mutations are indeed significant, then selection pressure on specific mutations within disease-susceptible tissues may play a central role in the disease process. Recent evidence has supported this hypothesis, by suggesting that there are multiple ‘‘minority’’ disease-producing variants within such tissues, and that when these variants replace the ‘‘major’’ normal variants, due to selection by changing microenvironmental conditions, disease may result. To resolve these issues, ultradeep NGS (i.e., 10,000  reads) of diseased tissues will likely be required to fully understand the impact of specific gene variants on common disease.

The Use of NGS in Understanding Disease As variant identification improves using NGS, the technology is slowing moving into clinical laboratories. Harry Cuppens, of the Center for Human Genetics, in Leuven, Belgium, illuminated some of the emerging issues of NGS with patient samples in his

502

HUMAN MUTATION, Vol. 31, No. 4, 500–503, 2010

talk ‘‘Next Generation Genetic Tests from Artisan Genetic Testing to Uniform, Streamlined. Fully Quality-Assured and Automated Processing of Genetic Tests Using Next Generation Sequencing.’’ A necessary improvement needed to move NGS technology into clinical laboratories is to increase the efficiency, reduce costs, and assure through quality control steps that the integrity of sample identification remains high. These needs can be met through robust equimolar multiplex amplifications, economical pooling of different patient samples, and automated protocols. An additional quality control step being suggested is the spiking of samples using 25 bp ‘‘molecular barcodes’’ to follow samples through the multiple steps needed from DNA extraction to sequence analysis. By adding a specific tag at the initial step of blood collection, identification errors between a sample from a patient to the sequencing results can be greatly reduced. Graham Taylor, of the Leeds Institute of Molecular Medicine, of St. James University Hospital, Leeds, United Kingdom, continued speaking on issues associated with NGS technologies in the clinical setting in his talk entitled ‘‘Sensitive High-Throughput GeneCentric Analysis in Familial Breast Cancer and TP53 Using the Illumina GAII Clonal Sequencer.’’ NGS technologies can greatly aid identification of mutations associated with disease. Using this technology in a clinical setting requires optimization of all steps to reduce errors, time, and costs. Pooling DNA samples helps reduce sequencing runs. This can be done by ‘‘bar-coding’’ individual samples using ligation of small sequence tags to the fragments of each sample and then combining the tagged samples for pooled sequencing reactions. This technique was used in the analysis of five cell lines and 10 Li-Fraumeni patient DNA samples, using targeted exon resequencing, to detect mutations in the tumor suppressor gene TP53. All mutations within the gene were identified. For BRCA1 and BRCA2 genes, this method was used to detect 671 different variants. Additionally software was designed to help in analyzing large stretches of DNA sequence. The Illuminator data extractor (dna.leeds.ac.uk/illuminator) allows sequences to be compared for identification of mutations in a clinical setting. The use of pooled samples for greater sequencing efficiency was further explored by Sandro Rossetti of the Mayo Clinic College of Medicine, Rochester, Minnesota, who spoke on ‘‘Comparison of Two Next-Generation Sequencing Platforms (Illumina GA and Roche 454) for the Deep Sequencing of the PKD1 and PKD2 Genes in Autosomal Dominant Polycystic Kidney Disease (ADPKD).’’ PKD1 and PKD2 are large genes with 46 and 15 exons, respectively. Using long-range polymerase chain reaction (PCR), the number of fragments used to sequence these genes were reduced to nine for PKD1 and six for PKD2, and pooled in an equimolar ratio for a total of 68 kb of sequence to be analyzed. The pools were sequenced using both the Illumina GA1 and the Roche 454 FLX systems. In a comparison test using three known samples, sequence coverage and success in finding variants was found to be superior using the Illumina platform when compared to the output from the Roche 454 FLX. Illumina GA1 detected all the known variants, while Roche 454 FLX missed some due to poor or lack of coverage, particularly in high GC-content regions. The Illumina platform was further tested for multiplexing power by using 20 additional control DNAs. This larger test indicated an overall sensitivity of 90–97%, depending on the level of multiplexing (range 2–16 samples per lane). To successfully identify rare disease-producing variants, especially in the case of multigenic diseases, it is important to first determine the level of diversity in the study population. Vanessa Hayes, of the Children’s Cancer Institute of Australia, Randwick,

Australia, spoke on ‘‘Determining Population Diversity Without a Reference Genome: Next Generation Sequencing Enables Genome-Wide Diversity Studies.’’ Identifying normal diversity within the human population is important to identify potential disease-producing variants. To achieve this, genomes from many populations will need to be sequenced. An important constraint is that NGS technologies result in short-read sequences that need to be compared to a reference sequence. Some differences between populations may not be detected if all populations are compared to a single reference sequence. To overcome this, a computational pipeline was generated in which minimal genome-wide sequence coverage (de novo sequence) using long-read NGS technology was used to determine genetic differences without the use of a reference genome. Coined DIAL (de novo identification of alleles), minimal genome coverage (0.3  and 0.5  ) of two geographically distinct Tasmanian devils (Sarcophilus harrisii) was used to identify candidate genetic differences. Utilizing high-throughput genotyping, DIAL identified variants can then be used to determine population diversity, a term coined DiversiTyping (DT). DT allowed for an island-wide genomic map of genetic structure for the Tasmanian devil population to be established. This information can be directly used to facilitate captive breeding in an effort to save the species from extinction. Additional technologies such as microarray-based CGH can help identify variants that are not detected with Sanger DNA sequencing. Etienne Rouleau spoke on ‘‘CDH1 Large Rearrangements in Breast Cancer Predisposition: Case Report, Prescreening Method and Zoom-In CGH-Array Screening.’’ Cadherin-1 (CDH1) is a cell-to-cell adhesion transmembrane glycoprotein that is also a tumor suppressor gene associated with autosomal dominant form of gastric cancer and breast cancer. In a patient with both breast cancer and gastric cancer, and no detected mutations in the BRCA1/2 genes, analysis of the CDH1 gene was performed using a ‘‘zoom-in’’ dedicated CGH array that focused specifically on the CDH1 gene region. In this individual an exon 3 deletion of 6,974 bp was identified. To further investigate the role of CDH1 mutations in breast cancer, a cohort of 377 breast cancer patients with no detectable BRCA1 or BRCA2 mutations was tested for rearrangements of the CDH1 gene using the CGH array. Two intronic mutations in intron 2 were detected: an intron 2 deletion and an intron 2 duplication. The role of CDH1 mutations associated with breast cancer and no gastric cancer still needs to be investigated.

Bringing Next Generation Sequencing to the Bedside The personal genome is already here. Steven Brenner, of the University of California, Berkeley, spoke to this challenge in his presentation ‘‘Prospects for Personal Genome Interpretation with Next-Generation Sequencing: Opportunities and Challenges.’’ He reminded us that Walter Gilbert, the father of DNA sequencing, predicted in 1990 that by 2030 every newborn would get their personal genome sequence on a CD. Current NGS technologies allow us to sequence an entire genome and identify millions of

variants within days, as has been done for a number of individual genomes. The challenge now is to identify specific variants that will help us to understand and treat individual medical conditions. Unfortunately, characterization of the large number of variants identified has not kept up. The popular media have been pushing for the use of this technology, but we do not have the tools or infrastructure to use this information in the clinic in a meaningful way. For this happen, there are five steps that will need to be addressed: 1. Collect our knowledge in one place. At present, there are too many silos of data. These need to be summarized into a single useable database. 2. Scan multiple genomes. Many genomes need to be analyzed in a consistent way to catalogue variation in multiple human populations. This includes better methods to classify variations. 3. Access and improve methods, including a critical assessment of genome interpretation. 4. Continue to discover new mutations and catalogue them in a central variation database. 5. Aim to cure. All of this information needs to be put together so that patients can be better treated. Brenner explained that the Genome Commons Project (http:// genomecommons.org) was created to assist with personal genome interpretation and as a clearinghouse for information on the effect a variant has on the phenotype of an individual, not only for disease-causing mutations but also predisposing variants.

Conclusion We are at the beginning of an explosive time in human genomics. The ability to create an individual’s personal genome and uncover the variation within is now possible. The ability to interpret the information, including the millions of variants that exist in each genome, is some (many?) years away. The prediction is that this information will allow us to detect and treat disease much better than is now possible. The rewards are very high, but there is much work that needs to be done for this to happen. This is a multifaceted problem, but the speakers who presented at this meeting are taking the challenge head on.

Acknowledgments This year’s annual meeting was cochaired by Graham Taylor of the St. James’s University Hospital, Leeds, and William Oetting, University of Minnesota. The author would like to thank the speakers for their help in the preparation of this report.

References Desmet FO, Hamroun D, Lalande M, Collod-Beroud G, Claustres M, Beroud C. 2009. Human Splicing Finder: an online bioinformatics toolto predict splicing signals. Nucleic Acids Res 37:e67. Frederic MY, Lalande M, Boileau C, Hamroun D, Claustres M, Beroud C, Collod-Beroud G. 2009. UMD-predictor, a new prediction toolfor nucleotide substitution pathogenicity—application to four genes: FBN1, FBN2, TGFBR1, and TGFBR2. Hum Mutat 30:952–959.

HUMAN MUTATION, Vol. 31, No. 4, 500–503, 2010

503