Practical Bioinformatics - Garland Science

857 downloads 1935 Views 117KB Size Report
he was appointed manager of a bioinformatics department. This group was ... Practical Bioinformatics is focused on the fundamental skills of bio- informatics: the ...
Practical Bioinformatics

Dedication This book is dedicated to my mother, Ruth Agostino, who tolerated my smelly biology and chemistry experiments in the basement of our house, and the endless number of muddy clothes and shoes from my frequent explorations of the woods near my home. I owe my love of exploration and discovery to you.

Practical Bioinformatics Michael Agostino

Vice President: Denise Schanck Senior Editor: Gina Almond Assistant Editor: David Borrowdale Development Editor: Mary Purton Production Editor: Ioana Moldovan Typesetter and Senior Production Editor: Georgina Lucas Copy Editor: Jo Clayton Proofreader: Sally Huish Illustrations: Oxford Designers & Illustrators Cover Design: Andrew Magee Indexer: Medical Indexing Ltd

© 2013 by Garland Science, Taylor & Francis Group, LLC

This book contains information obtained from authentic and highly regarded sources. Reprinted material is quoted with permission, and sources are indicated. Reasonable efforts have been made to publish reliable data and information, but the author and the publisher cannot assume responsibility for the validity of all materials or for the consequences of their use. All rights reserved. No part of this publication may be reproduced, stored in a retrieval system or transmitted in any form or by any means—graphic, electronic, or mechanical, including photocopying, recording, taping, or information storage and retrieval systems—without permission of the copyright holder.

ISBN 978-0-8153-4456-8

Front cover image: Chapter 8 of this book focuses on protein analysis. One example in this chapter is the superimposition of the black swan and Atlantic cod fish lysozyme structures (see Section 8.6). This allows the viewer to see the impact, or lack thereof, of the amino acid differences on the structures of these distantly related proteins. The book cover shows the amino acid sequence of this swan lysozyme (UniProt accession number P00717), repeated many times to fill the page, and is combined with the structure of the lysozyme protein (PDB identifier 1gbs). About the author: Michael Agostino received his PhD in Molecular Biology from Roswell Park Memorial Institute, a division of SUNY at Buffalo, New York. His thesis characterized the unusual structure and evolution of sea urchin histone genes. Postdoctoral work included the development of a molecular assay for DNA strand scission agents used in chemotherapy. In 1984, he moved to the University of North Carolina at Chapel Hill where he co-developed a vector trap for gene enhancers. Other work included the creation of a synthetic gene and an E. coli blue-white reporter gene assay for HIV protease activity. In 1991 he formally switched careers to bioinformatics by joining GlaxoSmithKline. There, he provided sequence analysis, consulting, user-support, and training for the Glaxo scientists. In 1996 he moved to Genetics Institute, where he was appointed manager of a bioinformatics department. This group was responsible for the sequence analysis and database of a high-throughput effort to identify, express, and patent the human genes that encode secreted proteins. Presently, he provides bioinformatics analysis and end-user support for multiple sites of the Pfizer Research organization. He is also an adjunct professor in the Biology Department at Merrimack College, North Andover, Massachusetts (USA).

Library of Congress Cataloging-in-Publication Data Agostino, Michael J. Practical bioinformatics / Michael Agostino. p. cm. ISBN 978-0-8153-4456-8 (alk. paper) 1. Nucleotide sequence--Data processing. 2. Bioinformatics. I. Title. QP625.N89A39 2013 572.8’6330285--dc23 2012017992

Published by Garland Science, Taylor & Francis Group, LLC, an informa business, 711 Third Avenue, 8th floor, New York, NY 10017, USA, and 3 Park Square, Milton Park, Abingdon, OX14 4RN, UK. Printed in the United States of America 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1

Visit our Website at http://www.garlandscience.com

Preface Although bioinformatics is a relatively new scientific discipline, it has become quite broad in definition. It is often described as including diverse topics such as the analysis of microarrays and the accompanying statistics, protein structure prediction, and pathway and protein interaction analysis. Of course, computer programming, database development, and even hardware design are included in the field. Practical Bioinformatics is focused on the fundamental skills of bioinformatics: the analysis of DNA, RNA, and protein sequences. The chapters take the reader through a commonly asked question, “What can I learn about this sequence?” The only requirement is access to the Internet and a web browser; no other software is required. This book is designed as an introduction to bioinformatics sequence analysis for biology and biochemistry majors. There are many published books that teach about detailed algorithms, sophisticated programs, and advanced interpretation of data. Although these are excellent sources of information, many biologists and biochemists are not prepared for, nor do they need, the depth and detail of these texts. Instead, they need the practical knowledge and skills to analyze sequences. They are asking questions such as “Which tools do I use?” “What settings should I use?” “What database should I search?” “What do these results mean?” “How do I export this information?” Practical Bioinformatics addresses these questions, and many more, in 12 easily read chapters. Concepts will be introduced within each chapter and then demonstrated through the analysis of problems using selected gene/protein examples. Adequate background, details, illustrations, and references will be provided to insure that readers understand the fundamentals and can do further reading if desired. Along the way, interesting genes, phenotypes, mutations, and biology will be introduced but not discussed extensively or analyzed. These topics are purposefully left open so they may easily be turned into literature searches, analysis problems, or senior projects for the ambitious student. Just thinking about these problems and how to analyze them will instill the habit of identifying topics needing exploration. The best way to learn this material is by “doing.” Readers of this book will learn the concepts by performing many analysis problems. To get the most out of this book, readers should perform most, if not all, of the analysis steps and recreate the figures for themselves. By the time readers finish the book, they will have significant experience in sequence analysis problems, approaches, and solutions. They should then be ready to perform many analysis steps on their own, and tackle more advanced books on the subject. A common error when approaching a sequence analysis problem is to use powerful analysis software with little understanding of how it works or how to interpret the output. Web forms and software can completely hide the details. This text will emphasize the proper use of established analysis software and the need to evaluate new tools. There are literally hundreds of bioinformatics tools available and no book could possibly contain or instruct on all the tools that are available. However, the repeated experience of performing guided analysis problems will teach the reader to be critical of bioinformatics software and to use proper positive and negative controls when testing unfamiliar tools. When this book is finished, readers will have both the practical knowledge and experience to address their own problems, and take advantage of the mountains of genetic data being generated today. I would like to thank the staff and associates of Garland Science for their tremendous support during the process of writing this book. Thanks to Gina Almond who believed in the project from the very beginning and was never

vi

Preface short on enthusiasm, David Borrowdale for guiding the book through the many steps, and Mary Purton for her infinite patience during the editing process. My thanks go to Ioana Moldovan, Georgina Lucas, Jo Clayton, and Sally Huish for their tremendous attention to detail and style during the final editing. Special thanks go to Oxford Designers & Illustrators for numerous illustrations. Thanks to Josephine Modica-Napolitano who gave me my first job in teaching, and the students at Merrimack College; together they put me on the path of writing this book. My special thanks to Donald J. Mulcare, my undergraduate advisor, for advice, encouragement, and my first real taste of what it is like to be a scientist. My years in industry would not have been the same without knowing the members of the “Dream Team:” Yuchen Bai, Sreekumar Kodangattil, Ellen Murphy, Padma Reddy, and Wenyan Zhong. They are the best sequence analysts I know. Additional thanks go to Maryann Whitley and Steve Howes for providing a calm and steady leadership at Pfizer. Many thanks to my daughter Becky, who inspires me to be better every day. Finally, this book would not have been possible without the years of support, encouragement, and love from my wife, Nan. Dreams can come true.

Instructor Resources Website Accessible from www.garlandscience.com, the Instructor Resource Site requires registration and access is available only to qualified instructors. To access the Instructor Resource Site, please contact your local sales representative or email [email protected]. The images in Practical Bioinformatics are available on the Instructor Resource Site in two convenient formats: PowerPoint® and JPEG, which have been optimized for display. The resources may be browsed by individual chapter or a search engine. Figures are searchable by figure number, figure name, or by keywords used in the figure legend from the book. Answers to end of chapter questions/exercises are available on the Instructor Resource Site. Resources available for other Garland Science titles can be accessed via the Garland Science Website. PowerPoint is a registered trademark of Microsoft Corporation in the United States and/or other countries.

vii

Acknowledgments The author and publisher of Practical Bioinformatics gratefully acknowledge the contributions of the following reviewers in the development of this book: Enrique Blanco

University of Barcelona, Spain

Ron Croy

Durham University, UK

John Ferguson

Bard College, USA

Laurie Heyer

Davidson College, USA

Torgeir Hvidsten

Umeå University, Sweden

Ian Kerr

University of Nottingham, UK

Daisuke Kihara

Purdue University, USA

Peter Kos

Biological Research Centre of the Hungarian Academy of Sciences, Hungary

Jean-Christophe Nebel

Kingston University London, UK

Samuel Rebelsky

Grinnell College, USA

Rebecca Roberts

Ursinus College, USA

Hugh Shanahan

Royal Holloway, University of London, UK

Shin-Han Shiu

Michigan State University, USA

Shaneen Singh

Brooklyn College, USA

Alan Ward

Newcastle University, UK

viii

Contents Chapter 1 Introduction to Bioinformatics and Sequence Analysis 1 1.1 1.2 1.3

Introduction The Growth of GenBank Data, Data, Everywhere Further examples of human genome sequencing Personal genome sequencing Paleogenetics Focused medical genomic studies

1.4 1.5 1.6

The Size of a Genome Annotation Witnessing Evolution Through Bioinformatics Recent evolutionary changes to plants and animals 1.7 Large Sources of Human Sequence Variation 1.8 Recent Evolutionary Changes to Human Populations 1.9 DNA Sequence in Databases Genomic DNA assembly cDNA in databases—where does it come from? 1.10 Sequence Analysis and Data Display 1.11 Summary Further Reading Internet resources

Chapter 2 Introduction to Internet Resources 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 2.10 2.11 2.12 2.13

Introduction The NCBI Website and ENTREZ PubMed Gene Name Evolution OMIM Retrieving Nucleotide Sequences Searching Patents Public Grants Database: NIH RePORTER Gene Ontology The Gene Database UniGene The UniGene Library Browser Summary

1 2 2

Exercises Williams syndrome and oxytocin: research with Internet tools Further Reading

4 4 4 5 5 6

Chapter 3 Introduction to the BLAST Suite and BLASTN

7

3.3

3.1 3.2

7 7

3.4

8 9 10 12 14 20 20 21

23 23 23 25 27 29 30 31 33 34 36 38 43 44

Introduction Why search a database? What is BLAST? How does BLAST work? Your First BLAST Search Find the query sequence in GenBank Convert the file to another format Performing BLASTN searches BLAST Results Graphic Interpretation of the graphic

Results table Interpretation of the table

44 44 45

47 47 47 48 48 49 49 51 52 54 54 55 55 57 57 60

The alignments Other BLASTN hits from this query Simultaneous review of the graphic, table, and alignments 3.5 BLASTN Across Species BLASTN of the reference sequence for human beta hemoglobin against nonhuman transcripts Paralogs, orthologs, and homologs 3.6 BLAST Output Format 3.7 Summary Exercises Exercise 1: Biofilm analysis Exercise 2: RuBisCO Further Reading Internet resources

64 66 68 68 68 68 70 71 71

Chapter 4 Protein BLAST: BLASTP

73

4.1 4.2

73 73 76 76 77

4.3

Introduction Codons and the Genetic Code Memorizing the genetic code Amino Acids Amino acid properties

63 64

Contents 4.4

BLASTP and the Scoring Matrix Building a matrix 4.5 An Example BLASTP Search Retrieving protein records Running BLASTP The results The alignments Distant homologies 4.6 Pairwise BLAST 4.7 Running BLASTP at the ExPASy Website Searching for pro-opiomelanocortin using a protein sequence fragment Searching for repeated domains in alpha-1 collagen 4.8 Summary Exercises Exercise 1: Typing contest Exercise 2: How mammoths adapted to cold Exercise 3: Longevity genes? Further Reading

Chapter 5 Cross-Molecular Searches: BLASTX and TBLASTN 5.1 5.2 5.3

Introduction Messenger RNA Structure cDNA Synthesis cDNA in databases ESTs Normalized cDNA libraries An EST record 5.4 BLASTX Reading frames in nucleic acids A simple BLASTX search A more complex BLASTX Using the annotation of sequence records BLASTX alignments with the reverse strand 5.5 TBLASTN A TBLASTN search Metagenomics and TBLASTN 5.6 Summary Exercises Exercise 1: Analyzing an unknown sequence Exercise 2: Snake venom proteins Exercise 3: Metagenomics Further Reading

78 78 80 81 81 82 84 84 85 86 87 91 94 94 94 95 96 97

99 99 100 101 101 102 103 104 106 107 107 108 109 115 117 117 118 120 122 122 122 123 124 125

Chapter 6 Advanced Topics in BLAST 6.1 6.2

Introduction Reciprocal BLAST: Confirming Identities Demonstration of a reciprocal BLASTP 6.3 Adjusting BLAST Parameters Gap cost Compositional adjustments 6.4 Exon Detection Exon detection with BLASTN Look at the coordinates Exon detection with TBLASTN Orthologous exon searching with TBLASTN 6.5 Repetitive DNA Simple sequences Satellite DNA Mini-satellites LINEs and SINEs Tandemly arrayed genes 6.6 Interpreting Distant Relationships Name of the protein Percentage identity Alignment length and length similarity between query and hit E value Gaps Conserved amino acids 6.7 Summary Exercises Exercise 1: Simple sequences Exercise 2: Reciprocal BLAST Exercise 3: Exon identification with TBLASTN Exercise 4: Identification of orthologous exons with TBLASTN Further Reading

Chapter 7 Bioinformatics Tools for the Laboratory 7.1 7.2

Introduction Restriction Mapping and Genetic Engineering Restriction enzymes Restriction enzyme mapping: the polylinker site NEBcutter Generating reverse strand sequences: Reverse Complement DNA translation: the ExPASy Translate tool

ix

127 127 127 128 131 131 133 134 135 138 138 141 144 145 145 145 145 146 147 147 148 148 149 149 150 152 152 152 153 153 154 155

157 157 158 158 160 160 162 162

x

Contents

7.3

Finding Open Reading Frames The NCBI ORF Finder 7.4 PCR and Primer Design Tools Primer3 Primer-BLAST 7.5 Measuring DNA and Protein Composition DNA Stats Composition/Molecular Weight Calculation Form 7.6 Asking Very Specific Questions: The Sequence Retrieval System (SRS) 7.7 DotPlot DotPlot of alternative transcripts DotPlots of orthologous genes 7.8 Summary Exercises Spider silk: a workflow of analysis Further Reading

Chapter 8 Protein Analysis 8.1 8.2

Introduction Finding Functional Patterns A repeating pattern within a zinc finger 8.3 Annotating an Unknown Sequence A zinc protease pattern The ADAM_MEPRO profile 8.4 Looking at Three-dimensional Protein Structures Jmol: a protein structure viewer Exploring and understanding a structure Jmol scripting 8.5 ProPhylER The Interface view The CrystalPainter view 8.6 The Impact of Sequence on Structure 8.7 Building Blocks: A Multiple Domain Protein 8.8 Post-translational Modification Secretion signals Prediction of protein glycosylation sites 8.9 Transmembrane Domain Detection 8.10 Summary Exercises Aquaporin-5 Further Reading Internet resources

163 163 165 166 169 170 170 171

172 174 175 176 179 179 179 181

183 183 183 184 187 188 188 190 192 193 194 195 196 198 201 204 204 206 208 208 211 211 211 213 214

Chapter 9 Explorations of Short Nucleotide Sequences 9.1 9.2

Introduction Transcription Factor Binding Sites Transfac Identifying other binding sites for the estrogen receptor Predicting transcription factor binding sites An experiment with MATCH An experiment with PATCH 9.3 Translation Initiation: The Kozak Sequence 9.4 Viewing Whole Genes 9.5 Exon Splicing Renin: a striking example of a small exon Another striking splice: human ISG15 ubiquitin-like modifier Alternative splicing Human plectin: alternative splicing at the 5P end Consensus splice junctions, translated 9.6 Polyadenylation Signals 9.7 Summary Exercises Inhibitor of Kappa light polypeptide gene enhancer in B-cells (IKBKAP) Further Reading

Chapter 10 MicroRNAs and Pathway Analysis 10.1 10.2 10.3 10.4 10.5 10.6 10.7 10.8

Introduction miRNA Function miRNA Nomenclature miRNA Families and Conservation Structure and Processing of miRNAs miRBase: The Repository for miRNAs Numbers and Locations Linking miRNA Analysis to a Biochemical Pathway: Gastric Cancer 10.9 KEGG: Biological Networks at Your Fingertips miRNAs in the cell cycle pathway 10.10 TarBase: Experimentally Verified miRNA Inhibition Verified miRNA-driven translation repression

215 215 216 216 219 220 221 224 226 228 231 234 235 236 237 238 239 240 242 242 243

245 245 245 247 247 248 250 251

251 253 255 256 256

Contents 10.11 TargetScan: miRNA Target Site Prediction TargetScan predictions for cell cycle transcripts 10.12 Expanding miRNA Regulation of  the Cell Cycle Using TarBase and TargetScan 10.13 Making Sense of miRNAs and Their Many Predicted Targets 10.14 miRNAs Associated With Diseases 10.15 Summary Exercises GDF8 Further Reading

Chapter 11 Multiple Sequence Alignments 11.1 11.2

Introduction Multiple Sequence Alignments Through NCBI BLAST 11.3 ClustalW from the ExPASy Website 11.4 ClustalW at the EMBL-EBI Server MARK1 kinase MAPK15 kinase DNA versus protein identities 11.5 Modifying ClustalW Parameters Gap-opening penalty The clustering method 11.6 Comparing ClustalW, MUSCLE, and COBALT 11.7 Isoform Alignment Problem: Internal Splicing 11.8 Aligning Paralog Domains 11.9 Manually Editing a Multiple Sequence Alignment Jalview Editing with a word processor 11.10 Summary Exercises FOXP2 Further Reading

12.3 258 260

263 265 266 267 267 267 269

271 271 271 274 276 277 280 282 282 282 283

Introduction Chromosomes Human chromosome statistics Chromosome details and comparisons

Appendix 1 Formatting Your Report A1.1 A1.2 A1.3 A1.4

286 A1.5

Introduction Font Choice and Pasting Issues Find and Replace Changing file format Hypertext Creating hypertext Selecting a column of text Summary

303 304 304 305 308 310 312 314 316 318 320 323 324 325 325 325 327

329 329 329 331 332 333 334 334 334

288 292 294 294 296 296 296 296 297

Chapter 12 Browsing the Genome 299 12.1 12.2

Synteny Synteny of the sex chromosomes 12.4 The UCSC Genome Browser OPN5: a sample gene to browse Simple view changes in the UCSC Genome Browser Configuring the UCSC Genome Browser window Searching genomes and adding tracks through BLAT Viewing the Multiz alignments Zooming out: seeing the big picture Very large genes: dystrophin and titin Gene density Interspecies comparison of genomes The beta globin locus 12.5 Summary Exercises Olfactory genes Further Reading

xi

299 299 300 302

Appendix 2 Running NCBI BLAST in “batch” Mode

337

Abbreviations

340

Glossary

341

Web Resources

344

Index

347