Sep 10, 2012 ... 30 questions in plant genetics and breeding. Here we address some of the new
research. 31 opportunities that are becoming more feasible ...
The Plant Genome: Published ahead of print 10 Sept. 2012; doi: 10.3835/plantgenome2012.05.0005
1
Genotyping-by-sequencing for plant breeding and genetics
3
Jesse A. Poland* and Trevor W. Rife
5
J.A. Poland, United States Department of Agriculture – Agricultural Research Service,
7
University, 4008 Throckmorton Hall, Manhattan KS, 66506; T.W. Rife,
2
4 6
8 9
10 11
Hard Winter Wheat Genetics Research Unit and Department of Agronomy, Kansas State
Interdepartmental Genetics, Kansas State University, 4024 Throckmorton Hall, Manhattan KS, 66506
* Corresponding author:
[email protected]
12 13 14 15
Abbreviations:
17
sequencing
16 18 19 20 21 22
GBS, genotyping-by-sequencing; GS, genomic selection; NGS, next-generation
The Plant Genome: Published ahead of print 10 Sept. 2012; doi: 10.3835/plantgenome2012.05.0005
23
Abstract
25
$1000 human genome within reach while providing the raw sequencing output for
27
advancements, genotyping-by-sequencing (GBS) has been developed as a rapid and
29
combines genome-wide molecular marker discovery and genotyping. The flexibility and
24
Rapid advances in “next-generation” DNA sequencing technology have brought the
26
researchers to revolutionize the way populations are genotyped. To capitalize on these
28
robust approach for reduced-representation sequencing of multiplexed samples that
30
low-cost of GBS makes this an excellent tool for many applications and research
32
opportunities that are becoming more feasible with GBS. Further, we highlight areas
34
output, development of reference genomes, and improved bioinformatics. The ultimate
36
genotype can then be used to predict phenotypes and select improved cultivars.
38
resulting phenotypes will enable genomics-assisted breeding to exist on the scale needed
31
questions in plant genetics and breeding. Here we address some of the new research
33
where GBS will become more powerful with the continued increase of sequencing
35
goal of plant biology scientists is to connect phenotype to genotype. In plant breeding the
37
Furthering our understanding of the connection between heritable genetic factors and the
39
to increase global food supplies in the face of decreasing arable land and climate change.
40
The Plant Genome: Published ahead of print 10 Sept. 2012; doi: 10.3835/plantgenome2012.05.0005
41
“Next-generation” genotyping
43
generation sequencing (NGS) output have provided technology with the ability to greatly
42
Driven by the quest for a $1000 human genome, rapid advances in next-
44
transform the way we think about plant genomics and breeding. With the introduction of
46
months (Figure 1). The availability of inexpensive sequencing technology has
48
polymorphisms are discovered (Mardis, 2008; Futschik and Schlötterer, 2010; You et al.,
50
al., 2012), and populations are genotyped (Baird et al., 2008; Elshire et al., 2011; Davey
52
rapidly becoming so inexpensive that it will soon be reasonable to use it for every genetic
54
and the practice of applied plant breeding.
56
connect phenotype to genotype and use this knowledge to make phenotypic predictions
58
populations with dense molecular markers across the genome. To put the power of NGS
60
have been developed. One promising approach is genotyping-by-sequencing (GBS)
62
only a small portion of the genome) coupled with DNA barcoded adapters to produce
64
demonstrated to be robust across a range of species and capable of producing tens to
66
The flexibility of GBS in regards to species as well as populations and research
45
massively parallel sequencing, raw sequencing output is doubling roughly every 6
47
transformed the way genomes are sequenced (Xu et al., 2011; Wang et al., 2011),
49
2011; Nielsen et al., 2011), gene expression is analyzed (Geraldes et al., 2011; Harper et
51
et al., 2011; Truong et al., 2012; Poland et al., 2012; Wang et al., 2012). Sequencing is
53
study. NGS applications have the potential to revolutionize the field of plant genomics
55
One of the primary objectives of functional genomics in agricultural species is to
57
and select improved plant types. To do this on a genome-wide scale requires large
59
to work for plant breeding and genomics, new approaches for sequence-based genotyping
61
which uses enzyme-based complexity reduction (using restriction endonucleases to target
63
multiplex libraries of samples ready for NGS sequencing. This approach has been
65
hundreds of thousands of molecular markers (Elshire et al., 2011; Poland et al., 2012).
67
objectives makes this an ideal tool for plant genetics studies. As the phenomenal increase
68
in NGS output continues, many research questions that were once out of reach will be
69
resolved through the application of these approaches.
71
All in one
70
The Plant Genome: Published ahead of print 10 Sept. 2012; doi: 10.3835/plantgenome2012.05.0005
72
The two key components for genotyping germplasm are finding DNA sequence
73
polymorphisms and assaying the markers across a full set of material. Classically, this
75
genotyping. An important strength of sequence-based genotyping approaches is that the
77
exploration of new germplasm sets or even new species without the upfront effort of
79
is that the raw data is dynamic. The raw sequences obtained from GBS can be re-
74
has been a two-step process involving marker discovery followed by assay design and
76
marker discovery and genotyping are completed at the same time. This facilitates
78
discovering and characterizing polymorphisms. Another key component of GBS datasets
80
analyzed, uncovering further information (e.g. new polymorphisms, annotated genes,
82
collection of sequence data increases. Each of these factors adds additional value to the
81
etc.) as bioinformatics techniques improve, reference genomes develop, and the
83
same raw dataset.
85
nucleotide polymorphism (SNP) and presence/absence variation (PAV) discovery in
87
al., 2008; Gore et al., 2009a; b; Huang et al., 2009; Deschamps et al., 2010; Hyten et al.,
89
These studies have focused on assaying a few key genotypes with a reduced-
91
et al., 2009). While highly effective for SNP discovery, this approach is limited in the
93
population of interest.
95
polymorphisms and then transfer these to a fixed assay, but to simultaneously discover
97
It is this combined one-step approach that makes GBS a truly rapid and flexible platform
84
One of the first and broadly adapted applications for utilizing NGS was for single
86
diverse populations with and without reference genomes (Baird et al., 2008; Wiedmann et
88
2010; You et al., 2011; Nelson et al., 2011; Hohenlohe et al., 2011; Byers et al., 2012).
90
representation approach (Baird et al., 2008) or with whole-genome re-sequencing (Huang
92
number of lines assayed and does not simultaneously assay the markers across the full
94
The key objective of the GBS approach, therefore, is not merely to discover
96
polymorphisms and obtain genotypic information across the whole population of interest.
98
for a range of species and germplasm sets and perfectly suited for genomic selection in
99
plant breeding programs. As sequencing output continues to increase, GBS will evolve
101
to whole-genome re-sequencing (to capture all variants). Whole genome re-sequencing
100
first to lower levels of complexity reduction (to capture more sequence variants) and then
102
has been applied in Arabidopsis, rice, and maize (Huang et al., 2009; Ashelford et al.,
The Plant Genome: Published ahead of print 10 Sept. 2012; doi: 10.3835/plantgenome2012.05.0005
103
2011; Gan et al., 2011; Chia et al., 2012; Jiao et al., 2012; Xu et al., 2012), though it
105
reference genome (Morrell et al., 2011). The level of multiplexing has also been limited
107
As GBS can be readily used for de novo discovery and application of new
104
quickly becomes less manageable with larger, more complex genomes that lack a solid
106
in this approach, increasing per sample cost.
108
molecular polymorphisms, it is particularly powerful for new sets of germplasm and
110
genotyping approaches is reducing ascertainment bias associated with marker discovery
109
uncharacterized species. In many ways the greatest advantage of sequence-based
111
in panels differing from the target population. This is an obvious advantage for
113
precision of the study (Myles et al., 2009; Hamblin et al., 2010). For breeding
115
introduced into the breeding pool. The use of an unrepresentative marker panel in
117
molecular diversity present in a target population. Most GBS approaches utilize
119
of the genome, ascertainment bias could potentially be introduced in different sets of
121
should have little bias across sets of germplasm, it is also unknown how uniformly they
123
that GBS markers were uniformly spaced across the chromosomes of both wheat and
112
association studies where differing allele frequencies greatly influence the power and
114
applications, informative polymorphisms can be discovered as novel germplasm is
116
surveying molecular diversity is highly problematic for getting a true representation of
118
methylation-sensitive enzymes. If these enzymes target differentially methylated regions
120
germplasm, but evidence for this has yet to be seen. While markers discovered with GBS
122
are spaced across the genome. Evidence from Poland et al. (2012), however, indicated
124
barley.
126
Many flavors
128
genome was first demonstrated by Altshuler et al. (2000). This approach was then later
125 127
The use of reduced-representation sequencing for targeting small portions of the
129
combined with NGS and DNA barcoded adapters to sequence multiplex libraries in
131
“genotyping using “next generation sequencing of multiplex DNA-barcoded reduced-
130
132
133
parallel. There are many variations of this approach and GBS is one specific method for
representation libraries” (Table 1). Further, the combination of enzymes that can be employed for complexity reduction is almost endless. Davey et al. (2011) has thoroughly
The Plant Genome: Published ahead of print 10 Sept. 2012; doi: 10.3835/plantgenome2012.05.0005
134
reviewed several approaches of complexity reduction including complexity reduction of
136
reduced representation libraries (van Tassell et al., 2008).
138
combined with NGS was first described by Baird et al. (2007) and termed Restriction
140
fragments which are then ligated to an adapter containing a forward primer for
142
sample multiplexing (Baird et al., 2008; Craig et al., 2008; Cronn et al., 2008). The
144
similarly-sized DNA fragments (Baird et al., 2008). The fragments are then ligated to a Y
146
et al., 2008). RAD markers provided a robust method to discover polymorphisms and
135 137
polymorphic sequences (CRoPS™) (van Orsouw et al., 2007) and deep sequencing of
The use of restriction enzymes for targeted reduction of genome complexity
139
Association DNA (RAD). RAD methods use a restriction enzyme to generate genomic
141
amplification, sequencing platform primer sites, and a unique DNA barcode that enables
143
samples are pooled, randomly sheared, and size-selected to create a uniform collection of
145
adapter that ensures only fragments containing the first adapter will be amplified (Baird
147
map variation in a population (Miller et al., 2007).
149
based marker technologies: the requirement of species-specific arrays, a hybridization for
151
Combining the progressive features of RAD with NGS, however, resulted in the
148
First generation RAD analysis had similar drawbacks to older restriction enzyme-
150
every comparison, and a limitation to presence variation assays (Baird et al., 2008).
152
discovery of new markers at a significantly decreased cost (Baird et al., 2008). The
154
mapping of many polymorphisms and precise assignment of chromosomal regions to
156
has recently been modified to utilize restriction enzymes that cut upstream and
158
length tags, allows nearly all of the restriction sites to be surveyed, and permits marker
153
simultaneous discovery of SNP markers during RAD sequencing facilitated robust
155
mapping parents, allowing for detection of recombination locations. The RAD approach
157
downstream of a target site (Wang et al., 2012). This new methodology produces uniform
159
intensity adjustment (Wang et al., 2012). The next flavor of sequence-based genotyping
161
eliminated DNA shearing, required less starting DNA, and implemented a Hidden
160
was multiplexed shotgun genotyping (MSG) which required only one gel purification,
162
Markov Model (HMM) to determine points of chromosomal recombination (Andolfatto
164
limited complexity reduction suitable for the smaller genome (~130Mb) of Drosophila
163
et al., 2011). MSG employed a single common cutting restriction enzyme and produced a
The Plant Genome: Published ahead of print 10 Sept. 2012; doi: 10.3835/plantgenome2012.05.0005
165
simulans (Andolfatto et al., 2011). In the context of a good reference genome, the HMM
167
recombination break points (Andolfatto et al., 2011).
169
construction of RAD libraries (Elshire et al., 2011). The strength of the GBS protocol is
171
avoiding shearing and size selection (Figure 2). . The GBS approach removed the need
173
the Y-adapters used in the RAD protocol, the original GBS protocol utilized a single
175
Although all combinations of adapters can ligate to the DNA fragments, only those that
177
2011).
179
combines a rare- and a common-cutting restriction enzyme to generate uniform libraries
181
each fragment (Poland et al., 2012). The use of two enzymes in this GBS approach
183
employment of a Y-adapter on the common restriction site avoids amplification of more
185
approach has been successfully applied in several species including cotton (Gossypium
187
little to no change in protocol (Jesse Poland, unpublished).
166
168
imputation approach was highly effective for tracing parental origin and defining
The original GBS protocol was developed to simplify and streamline the
170
its simplicity: utilizing inexpensive adapters, allowing pooled library construction, and
172
for size selection by using a short PCR extension of the multiplexed library. Instead of
174
restriction enzyme, a barcoded adapter, and a common adapter (Elshire et al., 2011).
176
contained one of each barcode are able to be amplified and sequenced (Davey et al.,
178
The original GBS approach was recently extended to a two enzyme version that
180
consisting of a forward (barcoded) adapter and a reverse (Y) adapter on alternate ends of
182
enables the capture of most fragments associated with the rare-cutting enzyme. The
184
186
188
common fragments, a preferential situation for larger, more complex genomes. The GBS hirsutum), oats (Avena sativa), sorghum (Sorghum bicolor) and rice (Oryza sativa) with The options for tailoring GBS to any species or desired application are almost
189
endless. A range of enzymes including ApekI, PstI and HindIII have been evaluated in
191
personal communication). With a varied level of complexity reduction, it is possible to
193
population. The interplay of these two factors will determine the optimal approach for the
195
use of rare-cutting restriction enzymes (i.e. 6 bp or greater target site) with methylation
190
maize with success in varying the level of complexity reduction (Edward Buckler,
192
increase coverage of a target genome or increase the multiplexing level of a target
194
species under investigation. For species with large genomes or no reference genome, the
The Plant Genome: Published ahead of print 10 Sept. 2012; doi: 10.3835/plantgenome2012.05.0005
196
sensitivity can assist in creating a higher level of complexity reduction by targeting fewer
198
amount of missing data.
200
Hand-in-hand with the reference genome
202
(sequenced) reference genome. A reference genome makes ordering and imputing low
204
approaches straightforward. This has been seen in many of the reported uses of sequence-
206
the D. simulans reference genome to first align tags to the reference and then call SNPs.
208
segregating in the population. This approach is very robust for assigning parent-of-origin
210
rice to first align NGS tags and subsequently call SNPs. The physical ordering of these
212
for segregating populations.
214
genome, the rapid discovery and ordering (through genetic mapping) of sequence-based
216
genome. High-density genetic maps developed through GBS can be used to anchor and
218
Andolfatto, et al. (2011) were able to assign 8 Mb to linkage groups, which comprised
220
substantial improvement of an already well-characterized genome. Likewise, in current
222
Gb) (Arumuganathan and Earle, 1991), high-density GBS maps are being used to assist
224
contigs (N. Stein et al., in press). This approach appears very promising, creating a
197 199 201
sites. This will lead to higher sampling depth of the same genomic sites and reduce the
Sequence-based genotyping greatly benefits from a well-characterized
203
coverage marker data generated through GBS and other sequence-based genotyping
205
based genotyping. The MSG approach employed by Andolfatto et al. (2011) made use of
207
Using a physical map framework, the parent-of-origin was then imputed across all SNPs
209
in bi-parental populations. Likewise, Huang et al. (2009) used the reference genome of
211 213
markers greatly enabled and simplified the imputation and assignment of parent-of-origin
Though genotyping-by-sequencing approaches greatly benefit from a reference
215
molecular markers can greatly assist with the development and refinement of a reference
217
order physical maps and refine or correct unordered sequence contigs. In D. simulans,
219
30% of the unassembled D. simulans genome or about 6% of the total genome. This is a
221
efforts in much larger, more complex genomes including barley (5.5 Gb) and wheat (16
223
with anchoring and ordering large numbers of assembled but unanchored and unordered
225
positive feed-back loop where the development of the reference genome assisted by GBS
226
markers leads to better SNP calling and order-based imputation for GBS datasets.
The Plant Genome: Published ahead of print 10 Sept. 2012; doi: 10.3835/plantgenome2012.05.0005
227 228
Maps made easy
230
development of genetic maps for characterizing segregating populations exceptionally
232
genetic map can serve the same purpose. For characterizing a new population, there will
234
frequencies, or order markers. With a reference genome, markers can be ordered along
236
recombination break points. The power of such approaches has been highlighted in recent
238
al., 2010), and maize (Elshire et al., 2011). Even at low coverage, the placement of sparse
240
intervals (Huang et al., 2009; Xie et al., 2010). This approach can be extended to
242
Andolfatto et al. (2011) demonstrated a hidden Markov Model that accurately inferred
244
have successfully been applied in maize, as well (P. Bradbury, personal communication).
246
be accomplished through development of a reference genetic map for the species of
248
density genetic map (Poland et al., 2012). For new populations, GBS tags can be used to
250
map. The extremely large number of markers produced with GBS allows sufficient
229
The combination of GBS with a well-defined reference genome makes the
231
straightforward. In the absence of a solid reference genome, a high-density reference
233
no longer be any need to place markers on linkage groups, calculate recombination
235
the physical chromosome (Figure 3). This ordering can then be used to precisely place
237
papers with model species including D. simulans (Andolfatto et al., 2011), rice (Huang et
239
markers on the physical map can be used to narrow points of recombination to 100-200kb
241
populations with heterozygous chromosomal segments such as F 2 or BC 1 populations.
243 245
heterozygous states from low-pass sequence-based genotyping. These same approaches
In the absence of a solid reference genome, the same ease of genetic mapping can
247
interest. GBS markers and other framework markers can be integrated to develop a high-
249
make genotype calls based on the reference map without the need to construct a de novo
251
coverage for most populations even if only a fraction of the total markers are utilized.
253
be broadly applied to the characterization of populations of interest for breeding and
255
selection, near-isogenic lines, and alien-introgression lines. The use of a variety of
257
will add value to inferences and conclusions for molecular breeding and selection
252
These same approaches for developing genetic maps and graphical genotypes can
254
germplasm improvement including elite breeding lines, segregating populations for
256
algorithms to correctly infer the heterozygous/homozygous state of chromosome regions
The Plant Genome: Published ahead of print 10 Sept. 2012; doi: 10.3835/plantgenome2012.05.0005
258
(Andolfatto et al., 2011). Other algorithms can be used for phasing markers in
260
marker order of the GBS SNPs.
262
Mapping single genes
264
mapping single genes. The de novo discovery of high-density markers in a population of
259
261 263
segregating and outcrossing populations. This will generally, however, require known
GBS and other sequence-based genotyping approaches can be very powerful for
265
interest has the potential to circumvent the cumbersome process of marker discovery and
267
RAD markers have been used in bulked segregant analysis to quickly identify linked
269
to rapidly identify segregating polymorphisms. In lupin (Lupinus angustifolius), Yang et
266
testing for fine-mapping of target genes and mutations. In the absence of a reference map,
268
markers (Baird et al., 2008). For single genes of interest, this can be a valuable approach
270
al. were able to identify 30 markers linked to an Anthracnose resistance gene (Yang et
272
is that the per-sample cost will be low enough that individual samples can be used rather
271
al., 2012). One advantage of GBS for mapping single genes in F 2 or similar populations
273
than bulks. This will allow correction or removal of any individuals that were incorrectly
275
application, there will be a balance between finding markers linked to the gene of interest
277
breeding approaches, it can still be optimal to pre-screen populations with markers for
274
phenotyped while confirming segregation of linked markers. Depending on the
276
using GBS and developing single marker assays from the resulting data. Considering
278
known single genes (with large effects) for smaller investment in time and sample costs
280
then be genotyped using GBS for genomic selection.
282
An Excess of Markers
284
genes is a viable breeding strategy, sequencing capacity is becoming so inexpensive and
286
germplasm of interest. Previously, scientists spent a majority of their time developing and
288
number of markers to complete. GBS, however, can readily generate tens of thousands of
279 281 283
prior to conducting whole genome profiling. Selected plants carrying desired genes can
While pre-selection of breeding populations for single markers for important
285
readily available that it will soon be reasonable to generate whole-genome profiles on any
287
working with a small number of markers. Many projects today still only require a small
The Plant Genome: Published ahead of print 10 Sept. 2012; doi: 10.3835/plantgenome2012.05.0005
289
usable markers which can be selectively filtered into the few required for a target
291
possible, genomic selection models have diminishing returns on additional markers once
293
et al., 2011). On the other hand, for association mapping studies, additional markers
295
2010). The current limitation for the generated data is computational. There are new
297
resources needed to make these quantitative genetics questions more manageable
299
to manage breeding data and develop models. At the same time, bioinformatics training
290
experiment. While statistical geneticists will always prefer to have as many markers as
292
the population has reached the point of “marker saturation” (Jannink et al., 2010; Heffner
294
increase the likelihood of finding and tagging causal polymorphisms (Cockram et al.,
296
algorithms and developments in cluster computing to provide the computational
298
(Stanzione, 2011). Quantitative geneticists and bioinformatics personnel will be needed
300
will become a more central component to any plant breeding and genetics curriculum.
302
Filling in the blanks
304
often have a significant amount of missing data due to low coverage sequencing (Davey
301 303
The “catch” to GBS and sequence-based genotyping in general, is that datasets
305
et al., 2011). Biologically, missing genotyping calls in GBS datasets can be the result of
307
methylation. On the other hand, the technical issue of missing data with GBS is a
309
sequence coverage of the library.
311
and the choice of enzyme(s) used for complexity reduction. Enzymes with a shorter
313
recognition site. Methylation-sensitive enzymes will greatly reduce the number of
315
generate around 500,000 - 600,000 unique tags while in wheat around 1.5M tags are
317
dataset is substantially higher partly due allelic variants, but largely due to sequencing
319
“unique” tags.
306
presence-absence variation, polymorphism in restriction sites, and/or differential
308
combination of 1) library complexity (i.e., number of unique sequence tags) and 2)
310
Library complexity is directly related to the species’ genome under investigation
312
recognition site will naturally produce more fragments than those with a longer
314
fragments in species with large portions of repetitive DNA. In barley, PstI-MspI libraries
316
generated (J. Poland, unpublished). The actual number of sequence tags present in a raw
318
errors, many of which can be non-random. This can and will generate many versions of
The Plant Genome: Published ahead of print 10 Sept. 2012; doi: 10.3835/plantgenome2012.05.0005
320
The level of missing data is based on the sequencing coverage. The sequencing
321
coverage is a function of the library complexity, multiplexing level, and the output of the
323
independent sequences generated from the sequencing platform will determine the
325
sample, while increased sequencing output (when using the same multiplexing level) will
327
sequencing platforms is the number of independent reads. Post-Sanger sequencing
322
sequencing platform (Andolfatto et al., 2011). The multiplexing level and the number of
324
average number of reads per sample. Higher multiplexing levels will reduce the data per
326
understandably increase the data per sample. One key component of GBS on different
328
platforms generally rely on a large number of short sequence reads to produce gigabases
329
of sequence data (Metzker, 2009). The new platforms are continually increasing the
331
longer reads is less advantageous than generating more reads. More sequence reads
333
multiplexing levels with static amounts of data per sample. For GBS, 10 Gb of sequence
335
While increasing the number of reads is clearly advantageous for GBS, longer reads are
337
with limited diversity) and assisting GBS applications in polyploids where secondary,
339
homeologous sequences on other genomes.
341
The logical approach to removing missing data is to sequence to a higher depth by
330
sequencing output, a function of more and longer reads. For GBS, however, generating
332
provides more data per sample. Alternatively, increasing read numbers allows higher
334
data generated from 100M reads of 100 bp would be preferable to 10M reads of 1,000 bp.
336
also beneficial, leading to the discovery of more polymorphisms (particularly in species
338 340
genome-specific polymorphisms are needed to differentiate a segregating SNP from
Missing data can be dealt with by 1) sequencing to higher depth or 2) imputing.
342
reducing the multiplexing level or sequencing the library multiple times. This can be very
344
association mapping panels or parents of a breeding program, however, the additional
346
applications using GBS with targeted selection, other approaches to minimize the impact
348
minimizing genotyping cost will take preference over minimizing missing data.
350
the type of GBS libraries, and the overall size of the datasets, imputation can give very
343
effective (Figure 4) but has the drawback of increaseing per sample cost. For important
345
investment to generate higher coverage of the tags is likely worthwhile. For breeding
347
349
of missing data are preferable. Since a majority of the population will be discarded,
The second approach is imputation of missing data. Depending on the genome,
The Plant Genome: Published ahead of print 10 Sept. 2012; doi: 10.3835/plantgenome2012.05.0005
351
352
accurate results. There are many imputation algorithms (Marchini et al., 2007; Purcell et al., 2007; Browning and Browning, 2007), most of which are targeted toward haplotype
353
reconstruction on a reference genome. Other approaches such as a Random Forest model
355
Sequencing diverse, key individuals in the population (parents or representatives of
354
(Breiman, 2001) can be used to impute unordered markers (as is the situation in wheat).
356
kinship clusters) can greatly improve imputation accuracy by defining known haplotypes
358
Finally, a matrix of realized relationships among individuals in a breeding
357
for the population.
359
population can be constructed without imputation. For very high-density genotyped data
361
disequilibrium present in most breeding programs. From this perspective, it is only
363
present in both individuals. With high marker density, there will still be tens of thousands
365
most elite breeding material. Imputation with the simple marker mean can still produce
360
generated by GBS, the marker coverage is sufficient to saturate the genomic linkage
362
necessary to determine a pair-wise identity between individuals for the markers that are
364
of pair-wise comparisons between two individuals, well beyond the saturation point for
366
accurate genomic selection prediction models. From a genomic selection perspective,
367
kinship-based marker imputation can be used to optimize the realized relationship matrix
369
concurrent submission). This approach has been shown to improve the relationship
368
in the presence of a high-level of missing data (Poland et al., The Plant Genome,
370
estimates and give more accurate genomic selection model predictions.
372
Association mapping
374
association mapping (AM). One key to applying GBS for AM mapping is addressing the
371 373
GBS has the potential to be an excellent tool for genotyping of diverse panels for
375
missing data problem. As noted previously, higher coverage sequencing will reduce the
377
AM panel that will be well characterized, extensively phenotyped, and serve as a
379
achieve high coverage is likely worth the investment. This will produce a very well-
381
become a very precise exercise, particularly on populations with extensive linkage
376
amount of missing data at the expense of increased per sample costs. For a high-value
378
community resource population, the additional cost of sequencing several times to
380
characterized genetic population. At a high coverage, imputation of missing data will
The Plant Genome: Published ahead of print 10 Sept. 2012; doi: 10.3835/plantgenome2012.05.0005
382
disequilibrium. Depending on the species under interrogation, the GBS markers will need
384
In such populations, GBS markers also have the advantage of being able to survey
383
to be ordered via a physical reference map or through genetic mapping.
385
multiple haplotypes on a fine scale. When two or more SNPs are within the same tag,
387
uncover these alleles. Array-based methods, particularly those applied to polyploid
389
duplicated sequence will indicate an allele call (for the ancestral allele) even if the target
391
discrimination between duplicated sequences. At higher sequencing coverage of the GBS
393
pool of sequenced tags.
395
Genomic Selection
397
is to create a low-cost genotyping platform capable of generating high-density genotypes.
386
these SNP alleles are both evaluated concurrently. For PAVs, GBS also has the power to
388
species, are limited in the ability to accurately survey PAVs as hybridization to a
390
locus is absent. Due to the context sequence accompanying a SNP, GBS enables
392
library, PAV can then be inferred by the absence of a given tag for a given sample in the
394 396
398
In the field of plant breeding, an important objective in the development of GBS
For genomic selection in crop species, breeders need a fast, inexpensive, flexible method
399
that will enable genotyping of large populations of selection candidates. A majority of the
401
low-cost genotyping. GBS is quickly expanding to fill those requirements.
403
approach to capture the full complement of small effect loci in genomic prediction
405
fitting effects to all markers and avoiding statistical testing. By utilizing these GS models,
400 402
selection candidates are then discarded, creating a situation that is greatly benefited from
Genomic selection (GS) was proposed in 2001 by Meuwissen, et al. as an
404
models. GS takes advantage of dense genome-wide molecular markers by simultaneously
406
breeders are able to predict the performance of new experimental lines at early
408
(Jannink et al., 2010). Combined with a fast turn-around on generations, selection based
410
increase gains in plant breeding programs (Meuwissen et al., 2001; Jannink et al., 2010).
412
needed for generating tens to hundreds of thousands of molecular markers. Poland et al,
407
generations and generate suggested crosses and selections based on the model predictions
409
on predicted breeding values determined by marker data provided by GBS could greatly
411
The advantage of GBS for GS in breeding programs is the low per sample cost
The Plant Genome: Published ahead of print 10 Sept. 2012; doi: 10.3835/plantgenome2012.05.0005
413
(2012, The Plant Genome, concurrent submission) have demonstrated the suitability for
415
demonstrate prediction accuracies for yield and other agronomic traits that are high
417
significant improvement in the attained prediction accuracy over a previously used array
419
implications in breeding. The training population was genotyped without a priori
414
GBS markers in developing GS models in the complex wheat genome. They were able to
416
enough to be suitable for breeding applications. The GBS markers also showed a
418
of hybridization-based markers. The important finding of this work is the practical
420
knowledge of the population or SNPs and per sample cost was below $20.
422
Putting GBS to work
424
every genomic problem. These marker datasets are low-cost and dynamic, with data and
421
423
Looking forward, high-density markers from NGS will soon be applied to almost
425
genotyping results getting more robust and economical each year. GBS has been shown
427
al., 2012), breeding applications (Poland et al., concurrent submission), and diversity
426
to be a valid tool for genetic mapping (Baird et al., 2008; Elshire et al., 2011; Poland et
428
studies (Fu, 2012; Lu et al., 2012). The ability to quickly generate robust datasets without
430
plagued researchers working with obscure or foreign species: a lack of defined and
432
platform for studies ranging from quickly identifying single gene markers to whole
429
considerable prior effort for marker discovery is quickly dispelling issues that have
431
specific genetic tools for genome analysis (Allendorf et al., 2010). GBS is an ideal
433
genome profiling of association panels.
435
breeding. Theoretical and preliminary studies on genomic selection show great promise
437
and low-cost tool for genotyping these populations, allowing breeders to implement GS
439
will drive per sample cost below $10. Further, there is no requirement for a priori
441
range of species and SNP discovery and genotyping are completed together. This is a
443
understudied genomes and commercial crops with large and complex genomes.
434
Perhaps one of the most exciting applications of GBS will be in the field of plant
436
for accelerating the rate of developing new improved varieties. GBS is providing a rapid
438
on a large scale in their breeding programs. Current developments in sequencing output
440
knowledge of the species as the GBS methods have been shown to be robust across a
442
very important feature for moving genomics-assisted breeding into orphan crops with
The Plant Genome: Published ahead of print 10 Sept. 2012; doi: 10.3835/plantgenome2012.05.0005
444
Challenges remaining include data management as well as modeling genotype-by-
446
stands to be a major supplement to traditional crop development. The potential for GBS
445
environment interactions, though the future looks promising. Genomic selection via GBS
447
data to improve breeding systems through GS is enormous.
449
genomic studies will have an important place well into the future. Driven by applications
451
developments in next-generation sequencing and genomics platforms must be put to use
448
Application of sequence-based genotyping for a whole range of diversity and
450
across the whole spectrum of human, microbial, plant, and animal genomics,
452
for plant breeding and genetics studies.
453 454
455
ACKNOWLEDGMENTS
457
Project (T-CAP) (2011-68002-30029) provided support for T. Rife. This manuscript was
459
trade names or commercial products in this publication is solely for the purpose of
461
the U.S. Department of Agriculture. USDA is an equal opportunity provider and
456
USDA-ARS and the USDA-NIFA funded Triticeae Coordinated Agriculture
458
greatly improved by the helpful comments of two anonymous reviewers. Mention of
460
providing specific information and does not imply recommendation or endorsement by
462
employer.
463 464
The Plant Genome: Published ahead of print 10 Sept. 2012; doi: 10.3835/plantgenome2012.05.0005
465 466
Allendorf, F.W., P.A. Hohenlohe, and G. Luikart. 2010. Genomics and the future of conservation genetics. Nat. Rev. Genet. 11:697–709.
470 471 472
Andolfatto, P., D. Davison, D. Erezyilmaz, T.T. Hu, J. Mast, T. Sunayama-Morita, and D.L. Stern. 2011. Multiplexed shotgun genotyping for rapid and efficient genetic mapping. Genome Res. 21:610–617.
475 476 477
Ashelford, K., M.E. Eriksson, C.M. Allen, R. D’Amore, M. Johansson, P. Gould, S. Kay, A.J. Millar, N. Hall, and A. Hall. 2011. Full genome re-sequencing reveals a novel circadian clock mutation in Arabidopsis. Genome Biol. 12:R28.
467 468 469
Altshuler, D., V.J. Pollara, C.R. Cowles, W.J. Van Etten, J. Baldwin, L. Linton, and E.S. Lander. 2000. An SNP map of the human genome generated by reduced representation shotgun sequencing. Nature 407:513–516.
473 474
Arumuganathan, K., and E.D. Earle. 1991. Nuclear DNA content of some important plant species. Plant Mol. Biol. Rep. 9:415–415.
478 479 480
Baird, N.A., P.D. Etter, T.S. Atwood, M.C. Currey, A.L. Shiver, Z.A. Lewis, E.U. Selker, W.A. Cresko, and E.A. Johnson. 2008. Rapid SNP discovery and genetic mapping using sequenced RAD markers. PLoS One 3:e3376.
482 483 484
Browning, S.R., and B.L. Browning. 2007. Rapid and accurate haplotype phasing and missing-data inference for whole-genome association studies by use of localized haplotype clustering. Am. J. Hum. Genet. 81:1084–1097.
488 489 490 491 492 493 494
Chia, J.-M., C. Song, P.J. Bradbury, D. Costich, N. de Leon, J. Doebley, R.J. Elshire, B. Gaut, L. Geller, J.C. Glaubitz, M. Gore, K.E. Guill, J. Holland, M.B. Hufford, J. Lai, M. Li, X. Liu, Y. Lu, R. McCombie, R. Nelson, J. Poland, B.M. Prasanna, T. Pyhäjärvi, T. Rong, R.S. Sekhon, Q. Sun, M.I. Tenaillon, F. Tian, J. Wang, X. Xu, Z. Zhang, S.M. Kaeppler, J. Ross-Ibarra, M.D. McMullen, E.S. Buckler, G. Zhang, Y. Xu, and D. Ware. 2012. Maize HapMap2 identifies extant variation from a genome in flux. Nat. Genet. 44:803–807.
481
Breiman, L. 2001. Random forests. Machine Learning 45:5–32.
485 486 487
Byers, R.L., D.B. Harker, S.M. Yourstone, P.J. Maughan, and J.A. Udall. 2012. Development and mapping of SNP assays in allotetraploid cotton. Theor. Appl. Genet. 124:1201–1214.
495 496 497 498
Cockram, J., J. White, D.L. Zuluaga, D. Smith, J. Comadran, M. Macaulay, Z. Luo, M.J. Kearsey, P. Werner, D. Harrap, C. Tapsell, H. Liu, P.E. Hedley, N. Stein, D. Schulte, B. Steuernagel, D.F. Marshall, W.T.B. Thomas, L. Ramsay, I. Mackay, D.J. Balding, R. Waugh, and D.M. O’Sullivan. 2010. Genome-wide association mapping
The Plant Genome: Published ahead of print 10 Sept. 2012; doi: 10.3835/plantgenome2012.05.0005
499 500
to candidate polymorphism resolution in the unsequenced barley genome. Proc. Natl. Acad. Sci. U. S. A. 107:21611–21616.
501 502 503 504
Craig, D.W., J.V. Pearson, S. Szelinger, A. Sekar, M. Redman, J.J. Corneveaux, T.L. Pawlowski, T. Laub, G. Nunn, D.A. Stephan, N. Homer, and M.J. Huentelman. 2008. Identification of genetic variants using bar-coded multiplexed sequencing. Nat. Methods 5:887–893.
508 509 510
Davey, J.W., P.A. Hohenlohe, P.D. Etter, J.Q. Boone, J.M. Catchen, and M.L. Blaxter. 2011. Genome-wide genetic marker discovery and genotyping using next-generation sequencing. Nat. Rev. Genet. 12:499–510.
516 517 518
Elshire, R.J., J.C. Glaubitz, Q. Sun, J.A. Poland, K. Kawamoto, E.S. Buckler, and S.E. Mitchell. 2011. A robust, simple genotyping-by-sequencing (GBS) approach for high diversity species. PLoS One 6:e19379.
521 522
Futschik, A., and C. Schlötterer. 2010. The next generation of molecular markers from massively parallel sequencing of pooled DNA samples. Genetics 186:207–218.
528 529 530 531
Geraldes, A., J. Pang, N. Thiessen, T. Cezard, R. Moore, Y. Zhao, A. Tam, S. Wang, M. Friedmann, I. Birol, S.J.M. Jones, Q.C.B. Cronk, and C.J. Douglas. 2011. SNP discovery in black cottonwood (Populus trichocarpa) by population transcriptome resequencing. Mol. Ecol. Resour. 11:81–92.
505 506 507
Cronn, R., A. Liston, M. Parks, D.S. Gernandt, R. Shen, and T. Mockler. 2008. Multiplex sequencing of plant chloroplast genomes using Solexa sequencing-by-synthesis technology. Nucleic Acids Res. 36:e122.
511 512 513 514 515
Deschamps, S., M. la Rota, J.P. Ratashak, P. Biddle, D. Thureen, A. Farmer, S. Luck, M. Beatty, N. Nagasawa, L. Michael, V. Llaca, H. Sakai, G. May, J. Lightner, and M.A. Campbell. 2010. Rapid Genome-wide Single Nucleotide Polymorphism Discovery in Soybean and Rice via Deep Resequencing of Reduced Representation Libraries with the Illumina Genome Analyzer. The Plant Genome 3:53–68.
519 520
Fu, Y.-B. 2012. Genotyping-by-sequencing: a Case Study in Barley. In Plant and Animal Genome XX.
523 524 525 526 527
532 533 534
Gan, X., O. Stegle, J. Behr, J.G. Steffen, P. Drewe, K.L. Hildebrand, R. Lyngsoe, S.J. Schultheiss, E.J. Osborne, V.T. Sreedharan, A. Kahles, R. Bohnert, G. Jean, P. Derwent, P. Kersey, E.J. Belfield, N.P. Harberd, E. Kemen, C. Toomajian, P.X. Kover, R.M. Clark, G. Rätsch, and R. Mott. 2011. Multiple reference genomes and transcriptomes for Arabidopsis thaliana. Nature 477:419–423.
Gore, M.A., J.M. Chia, R.J. Elshire, Q. Sun, E.S. Ersoz, B.L. Hurwitz, J.A. Peiffer, M.D. McMullen, G.S. Grills, and J. Ross-Ibarra. 2009a. A first-generation haplotype map of maize. Science 326:1115–1117.
The Plant Genome: Published ahead of print 10 Sept. 2012; doi: 10.3835/plantgenome2012.05.0005
535 536 537 538
Gore, M.A., M.H. Wright, E.S. Ersoz, P. Bouffard, E.S. Szekeres, T.P. Jarvie, B.L. Hurwitz, A. Narechania, T.T. Harkins, G.S. Grills, D.H. Ware, and E.S. Buckler. 2009b. Large-Scale Discovery of Gene-Enriched SNPs. The Plant Genome 2:121– 133.
539 540 541 542 543
Hamblin, M.T., T.J. Close, P.R. Bhat, S. Chao, J.G. Kling, K.J. Abraham, T. Blake, W.S. Brooks, B. Cooper, C. a. Griffey, P.M. Hayes, D.J. Hole, R.D. Horsley, D.E. Obert, K.P. Smith, S.E. Ullrich, G.J. Muehlbauer, and J.-L. Jannink. 2010. Population Structure and Linkage Disequilibrium in U.S. Barley Germplasm: Implications for Association Mapping. Crop Sci. 50:556–566.
547 548 549
Heffner, E.L., J.-L. Jannink, and M.E. Sorrells. 2011. Genomic Selection Accuracy using Multifamily Prediction Models in a Wheat Breeding Program. The Plant Genome 4:65–75.
554 555 556
Huang, X., Q. Feng, Q. Qian, Q. Zhao, L. Wang, A. Wang, J. Guan, D. Fan, Q. Weng, T. Huang, G. Dong, T. Sang, and B. Han. 2009. High-throughput genotyping by wholegenome resequencing. Genome Res. 19:1068–1076.
562 563 564
Hyten, D.L., Q. Song, E.W. Fickus, C.V. Quigley, J.-S. Lim, I.-Y. Choi, E.-Y. Hwang, M. Pastor-Corrales, and P.B. Cregan. 2010. High-throughput SNP discovery and assay development in common bean. BMC Genomics 11:475.
567 568 569
Jiao, Y., H. Zhao, L. Ren, W. Song, B. Zeng, J. Guo, B. Wang, Z. Liu, J. Chen, W. Li, M. Zhang, S. Xie, and J. Lai. 2012. Genome-wide genetic changes during modern breeding of maize. Nat. Genet. 44:812–815.
544 545 546
Harper, A.L., M. Trick, J. Higgins, F. Fraser, L. Clissold, R. Wells, C. Hattori, P. Werner, and I. Bancroft. 2012. Associative transcriptomics of traits in the polyploid crop species Brassica napus. Nat. Biotechnol. 30:798–802.
550 551 552 553
Hohenlohe, P.A., S.J. Amish, J.M. Catchen, F.W. Allendorf, and G. Luikart. 2011. Nextgeneration RAD sequencing identifies thousands of SNPs for assessing hybridization between rainbow and westslope cutthroat trout. Mol. Ecol. Resour. 11:117–122.
557 558 559 560 561
Huang, X., X. Wei, T. Sang, Q. Zhao, Q. Feng, Y. Zhao, C. Li, C. Zhu, T. Lu, Z. Zhang, M. Li, D. Fan, Y. Guo, A. Wang, L. Wang, L. Deng, W. Li, Y. Lu, Q. Weng, K. Liu, T. Huang, T. Zhou, Y. Jing, W. Li, Z. Lin, E.S. Buckler, Q. Qian, Q.-F. Zhang, J. Li, and B. Han. 2010. Genome-wide association studies of 14 agronomic traits in rice landraces. Nat. Genet. 42:961–967.
565 566
Jannink, J.-L., A.J. Lorenz, and H. Iwata. 2010. Genomic selection in plant breeding: from theory to practice. Briefings Funct. Genomics 9:166–177.
The Plant Genome: Published ahead of print 10 Sept. 2012; doi: 10.3835/plantgenome2012.05.0005
570 571 572
Lu, F., A.E. Lipka, R.J. Elshire, J. Glaubitz, J. Cherney, M. Casler, E.S. Buckler, and D. Costich. 2012. Characterization of the Genetic Diversity of Switchgrass Using Genotyping by Sequencing. In Plant and Animal Genome XX.
576 577
Mardis, E.R. 2008. The impact of next-generation sequencing technology on genetics. Trends Genet. 24:133–141.
580 581
Meuwissen, T.H.E., B.J. Hayes, and M.E. Goddard. 2001. Prediction of total genetic value using genome-wide dense marker maps. Genetics 157:1819–1829.
585 586
Morrell, P.L., E.S. Buckler, and J. Ross-Ibarra. 2011. Crop genomics: advances and applications. Nat. Rev. Genet. 13:85–96.
590 591 592
Nelson, J.C., S. Wang, Y. Wu, X. Li, G. Antony, F.F. White, and J. Yu. 2011. Singlenucleotide polymorphism discovery by high-throughput sequencing in sorghum. BMC Genomics 12:352.
595 596 597 598
van Orsouw, N.J., R.C.J. Hogers, A. Janssen, F. Yalcin, S. Snoeijers, E. Verstege, H. Schneiders, H. van der Poel, J. van Oeveren, H. Verstegen, and M.J.T. van Eijk. 2007. Complexity reduction of polymorphic sequences (CRoPS): a novel approach for large-scale polymorphism discovery in complex genomes. PLoS One 2:e1172.
573 574 575
Marchini, J., B. Howie, S. Myers, G. McVean, and P. Donnelly. 2007. A new multipoint method for genome-wide association studies by imputation of genotypes. Nat. Genet. 39:906–913.
578 579
Metzker, M. 2009. Sequencing technologies - the next generation. Nat. Rev. Genet. 11:31–46.
582 583 584
Miller, M.R., J.P. Dunham, A. Amores, W.A. Cresko, and E.A. Johnson. 2007. Rapid and cost-effective polymorphism identification and genotyping using restriction site associated DNA (RAD) markers. Genome Res. 17:240–248.
587 588 589
Myles, S., J. Peiffer, P.J. Brown, E.S. Ersoz, Z. Zhang, D.E. Costich, and E.S. Buckler. 2009. Association mapping: critical considerations shift from genotyping to experimental design. Plant Cell 21:2194–2202.
593 594
Nielsen, R., J.S. Paul, A. Albrechtsen, and Y.S. Song. 2011. Genotype and SNP calling from next-generation sequencing data. Nat. Rev. Genet. 12:443–451.
599 600 601 602 603
Poland, J.A., P.J. Brown, M.E. Sorrells, and J.-L. Jannink. 2012. Development of HighDensity Genetic Maps for Barley and Wheat Using a Novel Two-Enzyme Genotyping-by-Sequencing Approach. PLoS One 7:e32253. Purcell, S., B. Neale, K. Todd-Brown, L. Thomas, M. a R. Ferreira, D. Bender, J. Maller, P. Sklar, P.I.W. de Bakker, M.J. Daly, and P.C. Sham. 2007. PLINK: a tool set for
The Plant Genome: Published ahead of print 10 Sept. 2012; doi: 10.3835/plantgenome2012.05.0005
604 605
whole-genome association and population-based linkage analyses. Am. J. Hum. Genet. 81:559–575.
606 607
Stanzione, D. 2011. The iPlant Collaborative: Cyberinfrastructure to Feed the World. Computer 44:44–52.
612 613 614 615
Truong, H.T., A.M. Ramos, F. Yalcin, M. de Ruiter, H.J.A. van der Poel, K.H.J. Huvenaars, R.C.J. Hogers, L.J.G. van Enckevort, A. Janssen, N.J. van Orsouw, and M.J.T. van Eijk. 2012. Sequence-based genotyping for marker discovery and codominant scoring in germplasm and populations. PLoS One 7:e37565.
618 619
Wang, X., H. Wang, J. Wang, R. Sun, J. Wu, et al. 2011. The genome of the mesopolyploid crop species Brassica rapa. Nat. Genet. 43:1035–1039.
622 623 624 625
Xie, W., Q. Feng, H. Yu, X. Huang, Q. Zhao, Y. Xing, S. Yu, B. Han, and Q. Zhang. 2010. Parent-independent genotyping for constructing an ultrahigh-density linkage map based on population sequencing. Proc. Natl. Acad. Sci. U. S. A. 107:10578– 10583.
608 609 610 611
van Tassell, C.P., T.P.L. Smith, L.K. Matukumalli, J.F. Taylor, R.D. Schnabel, C.T. Lawley, C.D. Haudenschild, S.S. Moore, W.C. Warren, and T.S. Sonstegard. 2008. SNP discovery and allele frequency estimation by deep sequencing of reduced representation libraries. Nat. Methods 5:247–252.
616 617
Wang, S., E. Meyer, J.K. McKay, and M.V. Matz. 2012. 2b-RAD: a simple and flexible method for genome-wide genotyping. Nat. Methods 9:808–810.
620 621
Wiedmann, R.T., T.P.L. Smith, and D.J. Nonneman. 2008. SNP discovery in swine by reduced representation and high throughput pyrosequencing. BMC Genet. 9:81.
626 627 628 629 630
Xu, X., X. Liu, S. Ge, J.D. Jensen, F. Hu, X. Li, Y. Dong, R.N. Gutenkunst, L. Fang, L. Huang, J. Li, W. He, G. Zhang, X. Zheng, F. Zhang, Y. Li, C. Yu, K. Kristiansen, X. Zhang, J. Wang, M. Wright, S. McCouch, R. Nielsen, J. Wang, and W. Wang. 2012. Resequencing 50 accessions of cultivated and wild rice yields markers for identifying agronomically important genes. Nat. Biotechnol. 30:105–111.
633 634 635 636
Yang, H., Y. Tao, Z. Zheng, C. Li, M. Sweetingham, and J. Howieson. 2012. Application of next-generation sequencing for rapid marker development in molecular plant breeding: a case study on anthracnose disease resistance in Lupinus angustifolius L. BMC Genomics 13:318.
631 632
Xu, X., S. Pan, S. Cheng, B. Zhang, D. Mu, et al. 2011. Genome sequence and analysis of the tuber crop potato. Nature 475:189–195.
637 638
You, F.M., N. Huo, K.R. Deal, Y.Q. Gu, M.-C. Luo, P.E. McGuire, J. Dvorak, and O.D. Anderson. 2011. Annotation-based genome-wide SNP discovery in the large and
The Plant Genome: Published ahead of print 10 Sept. 2012; doi: 10.3835/plantgenome2012.05.0005
639 640 641 642
complex Aegilops tauschii genome using next-generation sequencing without a reference genome sequence. BMC Genomics 12:59.
The Plant Genome: Published ahead of print 10 Sept. 2012; doi: 10.3835/plantgenome2012.05.0005
643
644
FIGURES Figure 1 – A comparison of actual sequencing capacity (orange) to what would be
645
expected if sequencing technology was following Moore’s Law (blue). The
647
generation sequencing technology. Data from National Human Genome Research
646
significant decrease in 2007 coincides roughly with the introduction of next-
648
Institute (http://www.genome.gov/sequencingcosts/).
649
650
Figure 2 – Schematic overview of steps in GBS library construction, sequencing and
651
analysis. 1) Genomic DNA is quantified using florescence-based method. 2) gDNA
653
of all samples and equal molarity of gDNA and adapters. 3) A master mix with
655
barcoded adapters are added along with ligase and ligation bufferes. 5) Samples are
657
is cleaned and evaluated on a capillary sizing system. 8) Libraries are ready to
659
Data analysis: Following a sequencing run, Fastq files containing raw data from the
661
Once assigned to individual samples, the reads are aligned to a reference genome.
663
internally aligned (alignment of all sequence reads will all other reads from that
665
algorithms can then be used to distinguish true bi-allelic SNPs from sequencing
652
is normalized in a new plate. Normalization is needed to ensure equal representation
654
restriction enzyme(s) and buffer is added to the plate and incubated. 4) DNA
656
pooled and cleaned. 6) The GBS library is PCR amplified. 7) The amplified library
658
sequence.
660
run are used to parse sequencing reads to samples using the DNA barcode sequence.
662
In the case of species without a complete reference genomic sequence, reads are
664
library) and SNPs identified from 1 or 2 bp sequence miss-match. Various filtering
666 667 668 669 670 671
672
673
errors.
Figure 3 – Integration of genotyping-by-sequencing in the context of plant breeding and genomics for a species without a completed reference genome.
Figure 4 – Removal of missing data in genotyping-by-sequencing by increasing coverage of the library via re-sequencing. In a set of international wheat breeding germplasm, several lines (samples) were replicated across two or more libraries. Replicating a
The Plant Genome: Published ahead of print 10 Sept. 2012; doi: 10.3835/plantgenome2012.05.0005
674
sample two times increased the coverage of SNPs to 60%, while five replications
676
missing data replicated sequencing increases the per sample cost. The average per
678
equivalent to the sequencing coverage of the library (i.e. 5 replications ~ 5X
675
increase the coverage to over 90%. While very effective as a means to remove
677
sample cost is $15. In this situation for wheat, the number of replications is roughly
679
coverage). Data from Poland, et al. (unpublished).
680 681 682 683 684 685 686 687
688 689 690
TABLES Table 1 – A technical comparison of current genotyping methods utilizing nextgeneration sequencing of multiplex barcoded libraries. Adapted from Wang et al. (2012). Flavors of genotyping using next generation sequencing of multiplex DNA-barcoded reduced-representation libraries Method Multiplex Shotgun Genotyping (MSG) Restriction Association DNA (RAD-seq) Double Digest RADseq (ddRADseq) 2b-RAD Genotyping-by-Sequencing (GBS) Genotyping-by-Sequencing (GBS) – two enzyme Sequence-Based Genotyping (SBG)
Random shearing No
Size selection Yes
Fragment size size selected
Enzymes 1
Yes
Yes
size selected
No
Yes
No No
Analysis tool(s)
SbfI, EcoRI
Multiplexing level 2 96 (up to 384) 96
size selected
EcoRI-MspI
48 3
MUSCLE
No No
33-36 bp < 350 bp
BsaXI 4 ApeKI 5
Custom Perl scripts TASSEL
No
No
< 350 bp
PstI-MspI
No
Yes
size selected
EcoRI-MseI PstI-TaqI
NA 48 (up to 384) 48 (up to 384) 32
MseI
1 All of these approaches can utilize different enzymes. Shown are the enzyme(s) used in the initial study. 2 All of these methods have the possibility to increase the number of multiplexed samples using more unique
barcodes. The multiplex levels was the number of samples reported in the first paper. Given in parenthesis are subsequent increases. 3 Combinatorial barcoding is possible, placing a barcode on each end of the DNA fragment. Using a set of 48 adapter P1 barcodes and x 12 PCR2 indices it is possible to uniquely label 576 individuals [48 (adapter P1 barcodes) x 12 (PCR2 indices)]. This method would require paired-end sequencing. 4 Uses type II restriction endonucleases
5 Has been successfully applied to using PstI and HindIII (Buckler et al, personal communication)
Burrows-Wheeler Alignment tool Custom Perl scripts
TASSEL Burrows-Wheeler Alignment tool; Unif Genotyper
The Plant Genome: Published ahead of print 10 Sept. 2012; doi: 10.3835/plantgenome2012.05.0005
691 692
Restriction Enzyme Sequence Comparative Analysis (RESCAN)
No
6 96-plexing reported but unpublished
Yes
size selected
MseI, NlaIII
NA 6
Burrows-Wheeler Alignment tool; Samtools
The Plant Genome: Published ahead of print 10 Sept. 2012; doi: 10.3835/plantgenome2012.05.0005
The Plant Genome: Published ahead of print 10 Sept. 2012; doi: 10.3835/plantgenome2012.05.0005
The Plant Genome: Published ahead of print 10 Sept. 2012; doi: 10.3835/plantgenome2012.05.0005
The Plant Genome: Published ahead of print 10 Sept. 2012; doi: 10.3835/plantgenome2012.05.0005