Introduction to Bioinformatics - ILRI Research Computing - cgiar

3 downloads 153 Views 22MB Size Report
17 Oct 2013 ... CLC Genomics. Regulatory sequence analysis. Shell Programming ... Bioinformatics: Seeks to analyze large sets of biological data in order.
Joint BecA-ILRI Hub, SLU and UNESCO Advanced Genomics and Bioinformatics Mark  Wamalwa   BecA-­‐ILRI  Hub,  Nairobi,  Kenya   h"p://hub.africabiosciences.org/   h"p://www.Ilri.org/   [email protected]  

7th - 17th October 2013

Plan for the Week Day 1 Day 2 Day 3 Day 4 Day 5

Introduction to Linux Shell Programming

Perl programming cont’d

Introduction to Perl programming

Nucleotide and protein Sequence Manipulation

Regulatory sequence analysis

CLC Genomics

CLC Genomics cont’d

Cocktail

What is Bioinformatics/ Computational Biology? •  Bioinformatics: Seeks to analyze large sets of biological data in order to solve biological questions, to formulate hypotheses and to build models of underlying biological processes involved. •  Bioinformatics: collection and storage of biological information •  Bulk Data analysis •  Bulk Data storage •  Bulk Data mining •  Computational biology: development of algorithms and statistical models to analyze biological data

Scope of bioinformatics   Storage  and  retrieval  of  biological  data   Molecular  structures:  visualiza9on  and  analysis,  classifica9on,  predic9on   Sequence  analysis:  Sequence  alignments,  database  searches,  mo9f  detec9on   Genomics:  annota9on,  compara9ve  genomics   Phylogeny   Func;onal  genomics:  Transcriptome,  proteome,  interactome   Analysis  of  biochemical  networks:  metabolic  networks,  regulatory  networks   Systems  biology:  Modelling  and  simula9on  of  dynamical  systems   …  

Multidisciplinarity molecular biology

genomics mathematics

genetics

statistics

biochemistry bioinformatics

numerical analysis

biophysics

algorithmics

evolution image analysis

data management

Multidisciplinary n  Scientists can not be experts in all of these domains n  Problems: q Biologists (generally) hate statistics and computers q Computer scientists (generally) ignore statistics and biology q Statisticians and mathematicians (generally) •  Spend their time writing formula everywhere q Complexity of the biological domain •  Each time you try to formulate a rule, there is a possible counter-example q  Solution: multidisciplinary teams/multi-lab projects

Applications q  Research in biology –  Molecular organization of the cell/organism –  Development –  Mechanisms of evolution q  Medicine –  Diagnostic of cancers –  Detecting genes involved in cancer q  Pharmaceutical research –  mechanisms of drug action –  drug target identification q  Biotechnology –  Gene therapy –  Bioengineering

From wet science to bioinformatics q  Progresses in biology stimulated the incorporation of new methods in bioinformatics

–  Structure analysis (since the 50s) •  structure comparison •  structure prediction –  Sequencing (since the 70s) •  Sequence alignment •  Sequence search in databases

–  Genomes (since the 90s) •  Genome annotation •  Comparative genomics •  Functional classifications (“ontologies”) –  Transcriptome (since 1997) •  Multivariate analysis –  Proteome (~ 2000) •  Graph analysis

High throughput technologies Genome projects stimulated drastic improvement of sequencing technology q  Post-genomic era –  Genome sequence is not sufficient to predict gene function –  This stimulated the development of new experimental methods •  transcriptomics (microarrays) •  proteomics (Y=2-hybrid, mass spectrometry, ...) q  The "omics" trend: –  High throughput methods raised a fashion of "omics”. –  Some of the "omics" are not associated to any new/high throughput approach, this is just a new name on a previous method, or on an abstract concept

Large-scale analyses q  The availability of massive amounts of data enables to address questions that could not even be imagined a few years ago •  genome-scale measurement of transcriptional regulation •  comparative genomics

q  Downstream analyses require a good understanding of statistics q  Warning: the global trends •  the capability to analyze large amounts of data presents a risk to remain at a superficial level, or to be fooled by forgetting to check the pertinence of the results (with some indepth examples) •  good news: this does not prevent the authors from publishing in highly quoted journals

Bioinformatics is a science of inference q  The risks of inference q  Any analysis of massive data will unavoidably generate a certain rate of errors (false positives and false negatives). q  Good research and development will include an evaluation of the error rates. q  Good methods will minimize the error rate. q  Trade-off between specificity and sensitivity.

Why bioinformatics then ? n In most cases, wet biology will be required afterwards to validate the predictions n Bioinformatics can q  Reduce data to a small set of testable predictions q  assign a degree of confidence to each prediction

n The biologist will often have to chose the appropriate degree of confidence, depending on the trade between q  cost for validating predictions q  benefit expected from the right predictions

n Bioinformatics as in silico biology q  Allows to explore domains that can not be addressed experimentally e.g., the study of past evolutionary events • Phylogenetic inference and comparative genomics give us insights in the mechanisms of evolution and in the past evolutionary events • The time scale of these events is however so large (billions of years) that one cannot conceive to reproduce the inferred events with experimental methods.

Goals of Bioinformatics

Molecular Biology as an Information Science. What is the Information? • 

Central Dogma of Molecular Biology DNA -> RNA -> Protein -> Phenotype -> DNA

• 

Molecules – 

• 

Sequence, Structure, Function

Genomic Sequence Information -> mRNA (level) -> Protein Sequence -> Protein Structure -> Protein Function -> Phenotype

•  Large Amounts of Information

Processes – 

•  Central Paradigm for Bioinformatics

Mechanism, Specificity, Regulation

Standardized Statistical • Most cellular functions are performed or facilitated by proteins. " • Primary biocatalyst" • Cofactor transport/storage" • Mechanical motion/support" • Immune protection" • Control of growth/differentiation"

• 

Genetic material

• Information transfer (mRNA)" • Protein synthesis (tRNA/mRNA)" • Some catalytic activity" (idea from D Brutlag, Stanford, graphics from S Strobel)

Scope of Bioinformatics n Development of computational tools q Writing software q  Creating databases n Application of these tools to generate biological knowledge q  Creating databases q  Molecular sequence analysis q  Molecular sequence analysis q Molecular structural analysis q Molecular functional analysis

The  Bioinforma;cs  PlaAorm   •  High-­‐performance  compu;ng  server:   –  –  –  – 

32  total  processing  cores   128GB  of  memory  (RAM)    8TB  of  disk  space   25TB  LTO4  tape  backup  library  

•  Linux  cluster  

•  32  CPUs  (AMD  64-­‐bit)   •  128  Gigabyte  RAM  

•  >10  terabytes  disk  storage   •  Grid  compu;ng     •  Parallel  applica;ons:   >  Genome  assembly  (Newbler,  MIRA,  Celera,   velvet,  CAP3.  …)   >  Genome  annota;on  (glimmer,  …)     >  Phylogene;c  analysis  (Beast,  Mr  Bayes)   >  Other  sequence  analysis  tools  (BLAST,   clustalw,  HMMER,  R)  

 

BecA-­‐ILRI  Genomics  PlaAorm  

Opportuni1es  for  genomics  and  metagenomics  research    Capillary  sequencing  

ABI  3130-­‐xl                                ABI  3730-­‐xl                    ABI  3500-­‐xl  

 Next  genera1on  sequencing   1 sample = 1 library = 1 plate 500 mb/run 1/2 cassava genome 1/8 human genome

454 GS pyrosequencer

•  •  •  • 

Genomics     Viral  genomics   Func;onal  Genomics   Metagenomics  

Bioinformatics Core Activities • 

Statistical support –  Experimental design

• 

Primary data analysis –  NGS QC, spatial defect removal –  454 GA pipeline

• 

Secondary/downstream analysis –  Differential expression –  ChIP-seq peak calling –  Structural variation, genomic rearrangements –  SNP and CN analysis –  microRNA profiling –  GO enrichment

• 

• 

Training/Capacity Building –  motif finding –  functional/network analysis –  microarray analysis Data management –  NGS data storage and manipulation –  Data warehouse facilities : databases

• 

Software development –  Bioconductor packages: NGS annotation packages –  Automated NGS analysis packages

• 

Bioinformatics tools –  Ensembl, Galaxy, Cytoscape

From Sequence (genomics/metagenomics) to impact

phylogenetic analysis

geographical mapping (meta)genome sequencing Databases

Diagnostics

Global diseases surveillance

protein modeling Vaccine dvlpmt

Compilation of complete genomes, metagenomes, annotation and curation of metadata

Extraction of important biological information

sequence variation analysis

Drug dvlpmt

Primer, microarray

Improved drug selection

discovery of new microorganisms and pathways

Environmental sustainability

Improved Public health intervention

Books n Zvelebil, M. & Baum, J.O. Understanding Bioinformatics. (2007) pp. 772 n Pevzner, J. (2003). Bioinformatics and Functional Genomics. Wiley. q All the slides available at: http://www.bioinfbook.org/

n W. Mount. Bioinformatics: Sequence and Genome Analysis. (2004) pp. 692. q http://www.bioinformaticsonline.org/

n Westhead, D.R., J.H. Parish, and R.M. Twyman. 2002. Bioinformatics. BIOS Scientific Publishers, Oxford.

n Branden et al. Introduction to Protein Structure. (1998) pp. 410

The  BecA  Hub  team  

08  countries,    17  females,  19  males   Australia,  Benin,  Cameroon,  England,  Ethiopia,  Italy,  Kenya,  USA  

Dankie!!!