Department of Computer Science, University of Saskatchewan

3 downloads 0 Views 699KB Size Report
Department of Computer Science, University of Saskatchewan. Connor Burbridge, Yan Yan, Anthony Kusalik. Comparison of Genome-Wide Association Results ...
Comparison of Genome-Wide Association Results With Varied p-Value Thresholds and Input Data Quantity Connor Burbridge, Yan Yan, Anthony Kusalik

Department of Computer Science, University of Saskatchewan Introduction

120000

100000 80000 p < 0.1 p < 0.05 p < E-05

60000 40000

Average Associations Detected

Average Associations Detected

120000

100000 80000 60000

20000

0

0 40%

60%

80%

p < 0.05 p < E-05

40000

20000

20%

p < 0.1

100%

0%

20%

Percentage of Original Data Used

40%

60%

80%

100%

Percentage of Original Data Used

Figure 1: Changes in the number of associations observed in PLINK GWAS trials using varied subsets of all phenotype samples at different p-value cutoffs.

Figure 2: Changes in the number of associations observed in PLINK GWAS trials using varied subsets of accessions at different p-value cutoffs. 12000

11500

11500

11000 10500

FaST-LMM TASSEL

10000 9500

Average Associations Detected

12000

11000 10500 FaST-LMM

10000

TASSEL

9500 9000

9000 0%

20%

40%

60%

80%

8500

100%

0%

20%

40%

60%

80%

100%

Percentage of Original Data Used

Percentage of Original Data Used

Figure 3: Comparison of the number of associations reported with p-values < 0.05 between TASSEL and FaST-LMM using varied subsets of all phenotype samples.

Figure 4: Comparison of the number of associations reported with p-values < 0.05 between FaST-LMM and TASSEL using varied subsets of accessions.

Average Associations Detected

• Recreate results from a previous Arabidopsis thaliana study1 • Genotype Data: 214 051 SNPs from 1307 accessions • Phenotype Data: 391 accessions with up to 3 biological replicates each, for a total of 1100 phenotype entries. Of the original 391 accessions, 331 of them have 3 replicates • Since some accessions do not have have phenotype measurements for all 3 replicates, direct Monte-Carlo sampling of all 1100 phenotype entries may result in bias. To reduce the possibility of bias, we implement two sampling strategies on phenotypic data: • Direct Monte-Carlo sampling of all 1100 phenotypes entries • Monte-Carlo sampling of the 331 accessions represented by 3 biological replicates

140000

60

60

50

50

40 30

FaST-LMM TASSEL

20 10 0

Average Associations Detected

Materials and Methods

140000

0%

Average Associations Detected

• Genome-wide association studies (GWAS) have emerged as a highly beneficial tool for plant breeding and genetics • Linking genotypes to phenotypes provides valuable insight into how single-nucleotide polymorphisms (SNPs) may influence phenotypes • Popularity of GWAS has led to development of many genome wide association programs including PLINK, TASSEL and FaST-LMM • Availability of genotypic and phenotypic data is a crucial factor in GWAS. This data is expensive and time-consuming to collect • The goal of this project is to determine the effect on the number of association changes (SNPs) output from GWAS as the quantity of input genotype and phenotype data is varied and different p-value cutoffs applied • Results will help when considering resources to invest into generating and collecting phenotype and genotype data

Results

40 30

FaST-LMM TASSEL

20 10 0

0%

20%

40%

60%

80%

100%

0%

20%

Percentage of Original Data Used

40%

60%

80%

100%

Percentage of Original Data Used

Figure 5: Comparison of the number of associations reported with p-values < 1E-05 between TASSEL and FaST-LMM using varied subsets of all phenotype samples.

Figure 6: Comparison of the number of associations reported with p-values < 1E-05 between TASSEL and FaST-LMM using varied subsets of accessions.

Discussion and Conclusions • The number of associations output from PLINK differ greatly from TASSEL and FaST-LMM under the same conditions. This is likely due to the different statistical models being applied by the programs. • PLINK shows a linear trend between data used and associations found under different p-value cutoffs (Figures 1 and 2) • More genomes input à more associations under given p-values • The increasing trend in Figures 1 and 2 do not plateau. It would be worthwhile to investigate if more associations could be output when more data is fed in. • In general, both TASSEL and FaST-LMM show a trend of increasing numbers of reported associations with increasing amounts of input data (Figure 3-6). Again, it would be worthwhile to see if this trend continues. • Performance of TASSEL in Figure 4 is quite unusual. The maximum number of associations output occurs when the least amount of data is used. The extent of this behaviour, and the reasons for it, need to be further investigated.

References [1] Branham, Sandra E et al. (2016) Genome-Wide Association Study of Arabidopsis thaliana Identifies Determinants of Natural Variation in Seed Oil Composition. Journal of Heredity. 107 (3): 248-256. [2] Storey, John D and Tibshiran, Robert. (2003) Statistical significance for genomewide studies. PNAS. 100 (16): 9440-9445.

GE2P

RESEARCH POSTER PRESENTATION DESIGN © 2015

www.PosterPresentations.com

Poster template provided by www.posterpresentations.com

Acknowledgements • We would like to thank the Plant Phenotyping and Imaging Research Centre at the U of S, the Canadian First Research Excellence Fund, the College of Arts and Science at the U of S and the U of S Department of Computer Science for providing the funding for this research.