17 Oct 2013 ... CLC Genomics. Regulatory sequence analysis. Shell Programming ...
Bioinformatics: Seeks to analyze large sets of biological data in order.
Joint BecA-ILRI Hub, SLU and UNESCO Advanced Genomics and Bioinformatics Mark Wamalwa BecA-‐ILRI Hub, Nairobi, Kenya h"p://hub.africabiosciences.org/ h"p://www.Ilri.org/
[email protected]
7th - 17th October 2013
Plan for the Week Day 1 Day 2 Day 3 Day 4 Day 5
Introduction to Linux Shell Programming
Perl programming cont’d
Introduction to Perl programming
Nucleotide and protein Sequence Manipulation
Regulatory sequence analysis
CLC Genomics
CLC Genomics cont’d
Cocktail
What is Bioinformatics/ Computational Biology? • Bioinformatics: Seeks to analyze large sets of biological data in order to solve biological questions, to formulate hypotheses and to build models of underlying biological processes involved. • Bioinformatics: collection and storage of biological information • Bulk Data analysis • Bulk Data storage • Bulk Data mining • Computational biology: development of algorithms and statistical models to analyze biological data
Scope of bioinformatics Storage and retrieval of biological data Molecular structures: visualiza9on and analysis, classifica9on, predic9on Sequence analysis: Sequence alignments, database searches, mo9f detec9on Genomics: annota9on, compara9ve genomics Phylogeny Func;onal genomics: Transcriptome, proteome, interactome Analysis of biochemical networks: metabolic networks, regulatory networks Systems biology: Modelling and simula9on of dynamical systems …
Multidisciplinarity molecular biology
genomics mathematics
genetics
statistics
biochemistry bioinformatics
numerical analysis
biophysics
algorithmics
evolution image analysis
data management
Multidisciplinary n Scientists can not be experts in all of these domains n Problems: q Biologists (generally) hate statistics and computers q Computer scientists (generally) ignore statistics and biology q Statisticians and mathematicians (generally) • Spend their time writing formula everywhere q Complexity of the biological domain • Each time you try to formulate a rule, there is a possible counter-example q Solution: multidisciplinary teams/multi-lab projects
Applications q Research in biology – Molecular organization of the cell/organism – Development – Mechanisms of evolution q Medicine – Diagnostic of cancers – Detecting genes involved in cancer q Pharmaceutical research – mechanisms of drug action – drug target identification q Biotechnology – Gene therapy – Bioengineering
From wet science to bioinformatics q Progresses in biology stimulated the incorporation of new methods in bioinformatics
– Structure analysis (since the 50s) • structure comparison • structure prediction – Sequencing (since the 70s) • Sequence alignment • Sequence search in databases
– Genomes (since the 90s) • Genome annotation • Comparative genomics • Functional classifications (“ontologies”) – Transcriptome (since 1997) • Multivariate analysis – Proteome (~ 2000) • Graph analysis
High throughput technologies Genome projects stimulated drastic improvement of sequencing technology q Post-genomic era – Genome sequence is not sufficient to predict gene function – This stimulated the development of new experimental methods • transcriptomics (microarrays) • proteomics (Y=2-hybrid, mass spectrometry, ...) q The "omics" trend: – High throughput methods raised a fashion of "omics”. – Some of the "omics" are not associated to any new/high throughput approach, this is just a new name on a previous method, or on an abstract concept
Large-scale analyses q The availability of massive amounts of data enables to address questions that could not even be imagined a few years ago • genome-scale measurement of transcriptional regulation • comparative genomics
q Downstream analyses require a good understanding of statistics q Warning: the global trends • the capability to analyze large amounts of data presents a risk to remain at a superficial level, or to be fooled by forgetting to check the pertinence of the results (with some indepth examples) • good news: this does not prevent the authors from publishing in highly quoted journals
Bioinformatics is a science of inference q The risks of inference q Any analysis of massive data will unavoidably generate a certain rate of errors (false positives and false negatives). q Good research and development will include an evaluation of the error rates. q Good methods will minimize the error rate. q Trade-off between specificity and sensitivity.
Why bioinformatics then ? n In most cases, wet biology will be required afterwards to validate the predictions n Bioinformatics can q Reduce data to a small set of testable predictions q assign a degree of confidence to each prediction
n The biologist will often have to chose the appropriate degree of confidence, depending on the trade between q cost for validating predictions q benefit expected from the right predictions
n Bioinformatics as in silico biology q Allows to explore domains that can not be addressed experimentally e.g., the study of past evolutionary events • Phylogenetic inference and comparative genomics give us insights in the mechanisms of evolution and in the past evolutionary events • The time scale of these events is however so large (billions of years) that one cannot conceive to reproduce the inferred events with experimental methods.
Goals of Bioinformatics
Molecular Biology as an Information Science. What is the Information? •
Central Dogma of Molecular Biology DNA -> RNA -> Protein -> Phenotype -> DNA
•
Molecules –
•
Sequence, Structure, Function
Genomic Sequence Information -> mRNA (level) -> Protein Sequence -> Protein Structure -> Protein Function -> Phenotype
• Large Amounts of Information
Processes –
• Central Paradigm for Bioinformatics
Mechanism, Specificity, Regulation
Standardized Statistical • Most cellular functions are performed or facilitated by proteins. " • Primary biocatalyst" • Cofactor transport/storage" • Mechanical motion/support" • Immune protection" • Control of growth/differentiation"
•
Genetic material
• Information transfer (mRNA)" • Protein synthesis (tRNA/mRNA)" • Some catalytic activity" (idea from D Brutlag, Stanford, graphics from S Strobel)
Scope of Bioinformatics n Development of computational tools q Writing software q Creating databases n Application of these tools to generate biological knowledge q Creating databases q Molecular sequence analysis q Molecular sequence analysis q Molecular structural analysis q Molecular functional analysis
The Bioinforma;cs PlaAorm • High-‐performance compu;ng server: – – – –
32 total processing cores 128GB of memory (RAM) 8TB of disk space 25TB LTO4 tape backup library
• Linux cluster
• 32 CPUs (AMD 64-‐bit) • 128 Gigabyte RAM
• >10 terabytes disk storage • Grid compu;ng • Parallel applica;ons: > Genome assembly (Newbler, MIRA, Celera, velvet, CAP3. …) > Genome annota;on (glimmer, …) > Phylogene;c analysis (Beast, Mr Bayes) > Other sequence analysis tools (BLAST, clustalw, HMMER, R)
BecA-‐ILRI Genomics PlaAorm
Opportuni1es for genomics and metagenomics research Capillary sequencing
ABI 3130-‐xl ABI 3730-‐xl ABI 3500-‐xl
Next genera1on sequencing 1 sample = 1 library = 1 plate 500 mb/run 1/2 cassava genome 1/8 human genome
454 GS pyrosequencer
• • • •
Genomics Viral genomics Func;onal Genomics Metagenomics
Bioinformatics Core Activities •
Statistical support – Experimental design
•
Primary data analysis – NGS QC, spatial defect removal – 454 GA pipeline
•
Secondary/downstream analysis – Differential expression – ChIP-seq peak calling – Structural variation, genomic rearrangements – SNP and CN analysis – microRNA profiling – GO enrichment
•
•
Training/Capacity Building – motif finding – functional/network analysis – microarray analysis Data management – NGS data storage and manipulation – Data warehouse facilities : databases
•
Software development – Bioconductor packages: NGS annotation packages – Automated NGS analysis packages
•
Bioinformatics tools – Ensembl, Galaxy, Cytoscape
From Sequence (genomics/metagenomics) to impact
phylogenetic analysis
geographical mapping (meta)genome sequencing Databases
Diagnostics
Global diseases surveillance
protein modeling Vaccine dvlpmt
Compilation of complete genomes, metagenomes, annotation and curation of metadata
Extraction of important biological information
sequence variation analysis
Drug dvlpmt
Primer, microarray
Improved drug selection
discovery of new microorganisms and pathways
Environmental sustainability
Improved Public health intervention
Books n Zvelebil, M. & Baum, J.O. Understanding Bioinformatics. (2007) pp. 772 n Pevzner, J. (2003). Bioinformatics and Functional Genomics. Wiley. q All the slides available at: http://www.bioinfbook.org/
n W. Mount. Bioinformatics: Sequence and Genome Analysis. (2004) pp. 692. q http://www.bioinformaticsonline.org/
n Westhead, D.R., J.H. Parish, and R.M. Twyman. 2002. Bioinformatics. BIOS Scientific Publishers, Oxford.
n Branden et al. Introduction to Protein Structure. (1998) pp. 410
The BecA Hub team
08 countries, 17 females, 19 males Australia, Benin, Cameroon, England, Ethiopia, Italy, Kenya, USA
Dankie!!!