Genotyping-by-sequencing for plant breeding and genetics 1 2 ...

8 downloads 226 Views 2MB Size Report
Sep 10, 2012 ... 30 questions in plant genetics and breeding. Here we address some of the new research. 31 opportunities that are becoming more feasible ...
The Plant Genome: Published ahead of print 10 Sept. 2012; doi: 10.3835/plantgenome2012.05.0005

1

Genotyping-by-sequencing for plant breeding and genetics

3

Jesse A. Poland* and Trevor W. Rife

5

J.A. Poland, United States Department of Agriculture – Agricultural Research Service,

7

University, 4008 Throckmorton Hall, Manhattan KS, 66506; T.W. Rife,

2

4 6

8 9

10 11

Hard Winter Wheat Genetics Research Unit and Department of Agronomy, Kansas State

Interdepartmental Genetics, Kansas State University, 4024 Throckmorton Hall, Manhattan KS, 66506

* Corresponding author: [email protected]

12 13 14 15

Abbreviations:

17

sequencing

16 18 19 20 21 22

GBS, genotyping-by-sequencing; GS, genomic selection; NGS, next-generation

The Plant Genome: Published ahead of print 10 Sept. 2012; doi: 10.3835/plantgenome2012.05.0005

23

Abstract

25

$1000 human genome within reach while providing the raw sequencing output for

27

advancements, genotyping-by-sequencing (GBS) has been developed as a rapid and

29

combines genome-wide molecular marker discovery and genotyping. The flexibility and

24

Rapid advances in “next-generation” DNA sequencing technology have brought the

26

researchers to revolutionize the way populations are genotyped. To capitalize on these

28

robust approach for reduced-representation sequencing of multiplexed samples that

30

low-cost of GBS makes this an excellent tool for many applications and research

32

opportunities that are becoming more feasible with GBS. Further, we highlight areas

34

output, development of reference genomes, and improved bioinformatics. The ultimate

36

genotype can then be used to predict phenotypes and select improved cultivars.

38

resulting phenotypes will enable genomics-assisted breeding to exist on the scale needed

31

questions in plant genetics and breeding. Here we address some of the new research

33

where GBS will become more powerful with the continued increase of sequencing

35

goal of plant biology scientists is to connect phenotype to genotype. In plant breeding the

37

Furthering our understanding of the connection between heritable genetic factors and the

39

to increase global food supplies in the face of decreasing arable land and climate change.

40

The Plant Genome: Published ahead of print 10 Sept. 2012; doi: 10.3835/plantgenome2012.05.0005

41

“Next-generation” genotyping

43

generation sequencing (NGS) output have provided technology with the ability to greatly

42

Driven by the quest for a $1000 human genome, rapid advances in next-

44

transform the way we think about plant genomics and breeding. With the introduction of

46

months (Figure 1). The availability of inexpensive sequencing technology has

48

polymorphisms are discovered (Mardis, 2008; Futschik and Schlötterer, 2010; You et al.,

50

al., 2012), and populations are genotyped (Baird et al., 2008; Elshire et al., 2011; Davey

52

rapidly becoming so inexpensive that it will soon be reasonable to use it for every genetic

54

and the practice of applied plant breeding.

56

connect phenotype to genotype and use this knowledge to make phenotypic predictions

58

populations with dense molecular markers across the genome. To put the power of NGS

60

have been developed. One promising approach is genotyping-by-sequencing (GBS)

62

only a small portion of the genome) coupled with DNA barcoded adapters to produce

64

demonstrated to be robust across a range of species and capable of producing tens to

66

The flexibility of GBS in regards to species as well as populations and research

45

massively parallel sequencing, raw sequencing output is doubling roughly every 6

47

transformed the way genomes are sequenced (Xu et al., 2011; Wang et al., 2011),

49

2011; Nielsen et al., 2011), gene expression is analyzed (Geraldes et al., 2011; Harper et

51

et al., 2011; Truong et al., 2012; Poland et al., 2012; Wang et al., 2012). Sequencing is

53

study. NGS applications have the potential to revolutionize the field of plant genomics

55

One of the primary objectives of functional genomics in agricultural species is to

57

and select improved plant types. To do this on a genome-wide scale requires large

59

to work for plant breeding and genomics, new approaches for sequence-based genotyping

61

which uses enzyme-based complexity reduction (using restriction endonucleases to target

63

multiplex libraries of samples ready for NGS sequencing. This approach has been

65

hundreds of thousands of molecular markers (Elshire et al., 2011; Poland et al., 2012).

67

objectives makes this an ideal tool for plant genetics studies. As the phenomenal increase

68

in NGS output continues, many research questions that were once out of reach will be

69

resolved through the application of these approaches.

71

All in one

70

The Plant Genome: Published ahead of print 10 Sept. 2012; doi: 10.3835/plantgenome2012.05.0005

72

The two key components for genotyping germplasm are finding DNA sequence

73

polymorphisms and assaying the markers across a full set of material. Classically, this

75

genotyping. An important strength of sequence-based genotyping approaches is that the

77

exploration of new germplasm sets or even new species without the upfront effort of

79

is that the raw data is dynamic. The raw sequences obtained from GBS can be re-

74

has been a two-step process involving marker discovery followed by assay design and

76

marker discovery and genotyping are completed at the same time. This facilitates

78

discovering and characterizing polymorphisms. Another key component of GBS datasets

80

analyzed, uncovering further information (e.g. new polymorphisms, annotated genes,

82

collection of sequence data increases. Each of these factors adds additional value to the

81

etc.) as bioinformatics techniques improve, reference genomes develop, and the

83

same raw dataset.

85

nucleotide polymorphism (SNP) and presence/absence variation (PAV) discovery in

87

al., 2008; Gore et al., 2009a; b; Huang et al., 2009; Deschamps et al., 2010; Hyten et al.,

89

These studies have focused on assaying a few key genotypes with a reduced-

91

et al., 2009). While highly effective for SNP discovery, this approach is limited in the

93

population of interest.

95

polymorphisms and then transfer these to a fixed assay, but to simultaneously discover

97

It is this combined one-step approach that makes GBS a truly rapid and flexible platform

84

One of the first and broadly adapted applications for utilizing NGS was for single

86

diverse populations with and without reference genomes (Baird et al., 2008; Wiedmann et

88

2010; You et al., 2011; Nelson et al., 2011; Hohenlohe et al., 2011; Byers et al., 2012).

90

representation approach (Baird et al., 2008) or with whole-genome re-sequencing (Huang

92

number of lines assayed and does not simultaneously assay the markers across the full

94

The key objective of the GBS approach, therefore, is not merely to discover

96

polymorphisms and obtain genotypic information across the whole population of interest.

98

for a range of species and germplasm sets and perfectly suited for genomic selection in

99

plant breeding programs. As sequencing output continues to increase, GBS will evolve

101

to whole-genome re-sequencing (to capture all variants). Whole genome re-sequencing

100

first to lower levels of complexity reduction (to capture more sequence variants) and then

102

has been applied in Arabidopsis, rice, and maize (Huang et al., 2009; Ashelford et al.,

The Plant Genome: Published ahead of print 10 Sept. 2012; doi: 10.3835/plantgenome2012.05.0005

103

2011; Gan et al., 2011; Chia et al., 2012; Jiao et al., 2012; Xu et al., 2012), though it

105

reference genome (Morrell et al., 2011). The level of multiplexing has also been limited

107

As GBS can be readily used for de novo discovery and application of new

104

quickly becomes less manageable with larger, more complex genomes that lack a solid

106

in this approach, increasing per sample cost.

108

molecular polymorphisms, it is particularly powerful for new sets of germplasm and

110

genotyping approaches is reducing ascertainment bias associated with marker discovery

109

uncharacterized species. In many ways the greatest advantage of sequence-based

111

in panels differing from the target population. This is an obvious advantage for

113

precision of the study (Myles et al., 2009; Hamblin et al., 2010). For breeding

115

introduced into the breeding pool. The use of an unrepresentative marker panel in

117

molecular diversity present in a target population. Most GBS approaches utilize

119

of the genome, ascertainment bias could potentially be introduced in different sets of

121

should have little bias across sets of germplasm, it is also unknown how uniformly they

123

that GBS markers were uniformly spaced across the chromosomes of both wheat and

112

association studies where differing allele frequencies greatly influence the power and

114

applications, informative polymorphisms can be discovered as novel germplasm is

116

surveying molecular diversity is highly problematic for getting a true representation of

118

methylation-sensitive enzymes. If these enzymes target differentially methylated regions

120

germplasm, but evidence for this has yet to be seen. While markers discovered with GBS

122

are spaced across the genome. Evidence from Poland et al. (2012), however, indicated

124

barley.

126

Many flavors

128

genome was first demonstrated by Altshuler et al. (2000). This approach was then later

125 127

The use of reduced-representation sequencing for targeting small portions of the

129

combined with NGS and DNA barcoded adapters to sequence multiplex libraries in

131

“genotyping using “next generation sequencing of multiplex DNA-barcoded reduced-

130

132

133

parallel. There are many variations of this approach and GBS is one specific method for

representation libraries” (Table 1). Further, the combination of enzymes that can be employed for complexity reduction is almost endless. Davey et al. (2011) has thoroughly

The Plant Genome: Published ahead of print 10 Sept. 2012; doi: 10.3835/plantgenome2012.05.0005

134

reviewed several approaches of complexity reduction including complexity reduction of

136

reduced representation libraries (van Tassell et al., 2008).

138

combined with NGS was first described by Baird et al. (2007) and termed Restriction

140

fragments which are then ligated to an adapter containing a forward primer for

142

sample multiplexing (Baird et al., 2008; Craig et al., 2008; Cronn et al., 2008). The

144

similarly-sized DNA fragments (Baird et al., 2008). The fragments are then ligated to a Y

146

et al., 2008). RAD markers provided a robust method to discover polymorphisms and

135 137

polymorphic sequences (CRoPS™) (van Orsouw et al., 2007) and deep sequencing of

The use of restriction enzymes for targeted reduction of genome complexity

139

Association DNA (RAD). RAD methods use a restriction enzyme to generate genomic

141

amplification, sequencing platform primer sites, and a unique DNA barcode that enables

143

samples are pooled, randomly sheared, and size-selected to create a uniform collection of

145

adapter that ensures only fragments containing the first adapter will be amplified (Baird

147

map variation in a population (Miller et al., 2007).

149

based marker technologies: the requirement of species-specific arrays, a hybridization for

151

Combining the progressive features of RAD with NGS, however, resulted in the

148

First generation RAD analysis had similar drawbacks to older restriction enzyme-

150

every comparison, and a limitation to presence variation assays (Baird et al., 2008).

152

discovery of new markers at a significantly decreased cost (Baird et al., 2008). The

154

mapping of many polymorphisms and precise assignment of chromosomal regions to

156

has recently been modified to utilize restriction enzymes that cut upstream and

158

length tags, allows nearly all of the restriction sites to be surveyed, and permits marker

153

simultaneous discovery of SNP markers during RAD sequencing facilitated robust

155

mapping parents, allowing for detection of recombination locations. The RAD approach

157

downstream of a target site (Wang et al., 2012). This new methodology produces uniform

159

intensity adjustment (Wang et al., 2012). The next flavor of sequence-based genotyping

161

eliminated DNA shearing, required less starting DNA, and implemented a Hidden

160

was multiplexed shotgun genotyping (MSG) which required only one gel purification,

162

Markov Model (HMM) to determine points of chromosomal recombination (Andolfatto

164

limited complexity reduction suitable for the smaller genome (~130Mb) of Drosophila

163

et al., 2011). MSG employed a single common cutting restriction enzyme and produced a

The Plant Genome: Published ahead of print 10 Sept. 2012; doi: 10.3835/plantgenome2012.05.0005

165

simulans (Andolfatto et al., 2011). In the context of a good reference genome, the HMM

167

recombination break points (Andolfatto et al., 2011).

169

construction of RAD libraries (Elshire et al., 2011). The strength of the GBS protocol is

171

avoiding shearing and size selection (Figure 2). . The GBS approach removed the need

173

the Y-adapters used in the RAD protocol, the original GBS protocol utilized a single

175

Although all combinations of adapters can ligate to the DNA fragments, only those that

177

2011).

179

combines a rare- and a common-cutting restriction enzyme to generate uniform libraries

181

each fragment (Poland et al., 2012). The use of two enzymes in this GBS approach

183

employment of a Y-adapter on the common restriction site avoids amplification of more

185

approach has been successfully applied in several species including cotton (Gossypium

187

little to no change in protocol (Jesse Poland, unpublished).

166

168

imputation approach was highly effective for tracing parental origin and defining

The original GBS protocol was developed to simplify and streamline the

170

its simplicity: utilizing inexpensive adapters, allowing pooled library construction, and

172

for size selection by using a short PCR extension of the multiplexed library. Instead of

174

restriction enzyme, a barcoded adapter, and a common adapter (Elshire et al., 2011).

176

contained one of each barcode are able to be amplified and sequenced (Davey et al.,

178

The original GBS approach was recently extended to a two enzyme version that

180

consisting of a forward (barcoded) adapter and a reverse (Y) adapter on alternate ends of

182

enables the capture of most fragments associated with the rare-cutting enzyme. The

184

186

188

common fragments, a preferential situation for larger, more complex genomes. The GBS hirsutum), oats (Avena sativa), sorghum (Sorghum bicolor) and rice (Oryza sativa) with The options for tailoring GBS to any species or desired application are almost

189

endless. A range of enzymes including ApekI, PstI and HindIII have been evaluated in

191

personal communication). With a varied level of complexity reduction, it is possible to

193

population. The interplay of these two factors will determine the optimal approach for the

195

use of rare-cutting restriction enzymes (i.e. 6 bp or greater target site) with methylation

190

maize with success in varying the level of complexity reduction (Edward Buckler,

192

increase coverage of a target genome or increase the multiplexing level of a target

194

species under investigation. For species with large genomes or no reference genome, the

The Plant Genome: Published ahead of print 10 Sept. 2012; doi: 10.3835/plantgenome2012.05.0005

196

sensitivity can assist in creating a higher level of complexity reduction by targeting fewer

198

amount of missing data.

200

Hand-in-hand with the reference genome

202

(sequenced) reference genome. A reference genome makes ordering and imputing low

204

approaches straightforward. This has been seen in many of the reported uses of sequence-

206

the D. simulans reference genome to first align tags to the reference and then call SNPs.

208

segregating in the population. This approach is very robust for assigning parent-of-origin

210

rice to first align NGS tags and subsequently call SNPs. The physical ordering of these

212

for segregating populations.

214

genome, the rapid discovery and ordering (through genetic mapping) of sequence-based

216

genome. High-density genetic maps developed through GBS can be used to anchor and

218

Andolfatto, et al. (2011) were able to assign 8 Mb to linkage groups, which comprised

220

substantial improvement of an already well-characterized genome. Likewise, in current

222

Gb) (Arumuganathan and Earle, 1991), high-density GBS maps are being used to assist

224

contigs (N. Stein et al., in press). This approach appears very promising, creating a

197 199 201

sites. This will lead to higher sampling depth of the same genomic sites and reduce the

Sequence-based genotyping greatly benefits from a well-characterized

203

coverage marker data generated through GBS and other sequence-based genotyping

205

based genotyping. The MSG approach employed by Andolfatto et al. (2011) made use of

207

Using a physical map framework, the parent-of-origin was then imputed across all SNPs

209

in bi-parental populations. Likewise, Huang et al. (2009) used the reference genome of

211 213

markers greatly enabled and simplified the imputation and assignment of parent-of-origin

Though genotyping-by-sequencing approaches greatly benefit from a reference

215

molecular markers can greatly assist with the development and refinement of a reference

217

order physical maps and refine or correct unordered sequence contigs. In D. simulans,

219

30% of the unassembled D. simulans genome or about 6% of the total genome. This is a

221

efforts in much larger, more complex genomes including barley (5.5 Gb) and wheat (16

223

with anchoring and ordering large numbers of assembled but unanchored and unordered

225

positive feed-back loop where the development of the reference genome assisted by GBS

226

markers leads to better SNP calling and order-based imputation for GBS datasets.

The Plant Genome: Published ahead of print 10 Sept. 2012; doi: 10.3835/plantgenome2012.05.0005

227 228

Maps made easy

230

development of genetic maps for characterizing segregating populations exceptionally

232

genetic map can serve the same purpose. For characterizing a new population, there will

234

frequencies, or order markers. With a reference genome, markers can be ordered along

236

recombination break points. The power of such approaches has been highlighted in recent

238

al., 2010), and maize (Elshire et al., 2011). Even at low coverage, the placement of sparse

240

intervals (Huang et al., 2009; Xie et al., 2010). This approach can be extended to

242

Andolfatto et al. (2011) demonstrated a hidden Markov Model that accurately inferred

244

have successfully been applied in maize, as well (P. Bradbury, personal communication).

246

be accomplished through development of a reference genetic map for the species of

248

density genetic map (Poland et al., 2012). For new populations, GBS tags can be used to

250

map. The extremely large number of markers produced with GBS allows sufficient

229

The combination of GBS with a well-defined reference genome makes the

231

straightforward. In the absence of a solid reference genome, a high-density reference

233

no longer be any need to place markers on linkage groups, calculate recombination

235

the physical chromosome (Figure 3). This ordering can then be used to precisely place

237

papers with model species including D. simulans (Andolfatto et al., 2011), rice (Huang et

239

markers on the physical map can be used to narrow points of recombination to 100-200kb

241

populations with heterozygous chromosomal segments such as F 2 or BC 1 populations.

243 245

heterozygous states from low-pass sequence-based genotyping. These same approaches

In the absence of a solid reference genome, the same ease of genetic mapping can

247

interest. GBS markers and other framework markers can be integrated to develop a high-

249

make genotype calls based on the reference map without the need to construct a de novo

251

coverage for most populations even if only a fraction of the total markers are utilized.

253

be broadly applied to the characterization of populations of interest for breeding and

255

selection, near-isogenic lines, and alien-introgression lines. The use of a variety of

257

will add value to inferences and conclusions for molecular breeding and selection

252

These same approaches for developing genetic maps and graphical genotypes can

254

germplasm improvement including elite breeding lines, segregating populations for

256

algorithms to correctly infer the heterozygous/homozygous state of chromosome regions

The Plant Genome: Published ahead of print 10 Sept. 2012; doi: 10.3835/plantgenome2012.05.0005

258

(Andolfatto et al., 2011). Other algorithms can be used for phasing markers in

260

marker order of the GBS SNPs.

262

Mapping single genes

264

mapping single genes. The de novo discovery of high-density markers in a population of

259

261 263

segregating and outcrossing populations. This will generally, however, require known

GBS and other sequence-based genotyping approaches can be very powerful for

265

interest has the potential to circumvent the cumbersome process of marker discovery and

267

RAD markers have been used in bulked segregant analysis to quickly identify linked

269

to rapidly identify segregating polymorphisms. In lupin (Lupinus angustifolius), Yang et

266

testing for fine-mapping of target genes and mutations. In the absence of a reference map,

268

markers (Baird et al., 2008). For single genes of interest, this can be a valuable approach

270

al. were able to identify 30 markers linked to an Anthracnose resistance gene (Yang et

272

is that the per-sample cost will be low enough that individual samples can be used rather

271

al., 2012). One advantage of GBS for mapping single genes in F 2 or similar populations

273

than bulks. This will allow correction or removal of any individuals that were incorrectly

275

application, there will be a balance between finding markers linked to the gene of interest

277

breeding approaches, it can still be optimal to pre-screen populations with markers for

274

phenotyped while confirming segregation of linked markers. Depending on the

276

using GBS and developing single marker assays from the resulting data. Considering

278

known single genes (with large effects) for smaller investment in time and sample costs

280

then be genotyped using GBS for genomic selection.

282

An Excess of Markers

284

genes is a viable breeding strategy, sequencing capacity is becoming so inexpensive and

286

germplasm of interest. Previously, scientists spent a majority of their time developing and

288

number of markers to complete. GBS, however, can readily generate tens of thousands of

279 281 283

prior to conducting whole genome profiling. Selected plants carrying desired genes can

While pre-selection of breeding populations for single markers for important

285

readily available that it will soon be reasonable to generate whole-genome profiles on any

287

working with a small number of markers. Many projects today still only require a small

The Plant Genome: Published ahead of print 10 Sept. 2012; doi: 10.3835/plantgenome2012.05.0005

289

usable markers which can be selectively filtered into the few required for a target

291

possible, genomic selection models have diminishing returns on additional markers once

293

et al., 2011). On the other hand, for association mapping studies, additional markers

295

2010). The current limitation for the generated data is computational. There are new

297

resources needed to make these quantitative genetics questions more manageable

299

to manage breeding data and develop models. At the same time, bioinformatics training

290

experiment. While statistical geneticists will always prefer to have as many markers as

292

the population has reached the point of “marker saturation” (Jannink et al., 2010; Heffner

294

increase the likelihood of finding and tagging causal polymorphisms (Cockram et al.,

296

algorithms and developments in cluster computing to provide the computational

298

(Stanzione, 2011). Quantitative geneticists and bioinformatics personnel will be needed

300

will become a more central component to any plant breeding and genetics curriculum.

302

Filling in the blanks

304

often have a significant amount of missing data due to low coverage sequencing (Davey

301 303

The “catch” to GBS and sequence-based genotyping in general, is that datasets

305

et al., 2011). Biologically, missing genotyping calls in GBS datasets can be the result of

307

methylation. On the other hand, the technical issue of missing data with GBS is a

309

sequence coverage of the library.

311

and the choice of enzyme(s) used for complexity reduction. Enzymes with a shorter

313

recognition site. Methylation-sensitive enzymes will greatly reduce the number of

315

generate around 500,000 - 600,000 unique tags while in wheat around 1.5M tags are

317

dataset is substantially higher partly due allelic variants, but largely due to sequencing

319

“unique” tags.

306

presence-absence variation, polymorphism in restriction sites, and/or differential

308

combination of 1) library complexity (i.e., number of unique sequence tags) and 2)

310

Library complexity is directly related to the species’ genome under investigation

312

recognition site will naturally produce more fragments than those with a longer

314

fragments in species with large portions of repetitive DNA. In barley, PstI-MspI libraries

316

generated (J. Poland, unpublished). The actual number of sequence tags present in a raw

318

errors, many of which can be non-random. This can and will generate many versions of

The Plant Genome: Published ahead of print 10 Sept. 2012; doi: 10.3835/plantgenome2012.05.0005

320

The level of missing data is based on the sequencing coverage. The sequencing

321

coverage is a function of the library complexity, multiplexing level, and the output of the

323

independent sequences generated from the sequencing platform will determine the

325

sample, while increased sequencing output (when using the same multiplexing level) will

327

sequencing platforms is the number of independent reads. Post-Sanger sequencing

322

sequencing platform (Andolfatto et al., 2011). The multiplexing level and the number of

324

average number of reads per sample. Higher multiplexing levels will reduce the data per

326

understandably increase the data per sample. One key component of GBS on different

328

platforms generally rely on a large number of short sequence reads to produce gigabases

329

of sequence data (Metzker, 2009). The new platforms are continually increasing the

331

longer reads is less advantageous than generating more reads. More sequence reads

333

multiplexing levels with static amounts of data per sample. For GBS, 10 Gb of sequence

335

While increasing the number of reads is clearly advantageous for GBS, longer reads are

337

with limited diversity) and assisting GBS applications in polyploids where secondary,

339

homeologous sequences on other genomes.

341

The logical approach to removing missing data is to sequence to a higher depth by

330

sequencing output, a function of more and longer reads. For GBS, however, generating

332

provides more data per sample. Alternatively, increasing read numbers allows higher

334

data generated from 100M reads of 100 bp would be preferable to 10M reads of 1,000 bp.

336

also beneficial, leading to the discovery of more polymorphisms (particularly in species

338 340

genome-specific polymorphisms are needed to differentiate a segregating SNP from

Missing data can be dealt with by 1) sequencing to higher depth or 2) imputing.

342

reducing the multiplexing level or sequencing the library multiple times. This can be very

344

association mapping panels or parents of a breeding program, however, the additional

346

applications using GBS with targeted selection, other approaches to minimize the impact

348

minimizing genotyping cost will take preference over minimizing missing data.

350

the type of GBS libraries, and the overall size of the datasets, imputation can give very

343

effective (Figure 4) but has the drawback of increaseing per sample cost. For important

345

investment to generate higher coverage of the tags is likely worthwhile. For breeding

347

349

of missing data are preferable. Since a majority of the population will be discarded,

The second approach is imputation of missing data. Depending on the genome,

The Plant Genome: Published ahead of print 10 Sept. 2012; doi: 10.3835/plantgenome2012.05.0005

351

352

accurate results. There are many imputation algorithms (Marchini et al., 2007; Purcell et al., 2007; Browning and Browning, 2007), most of which are targeted toward haplotype

353

reconstruction on a reference genome. Other approaches such as a Random Forest model

355

Sequencing diverse, key individuals in the population (parents or representatives of

354

(Breiman, 2001) can be used to impute unordered markers (as is the situation in wheat).

356

kinship clusters) can greatly improve imputation accuracy by defining known haplotypes

358

Finally, a matrix of realized relationships among individuals in a breeding

357

for the population.

359

population can be constructed without imputation. For very high-density genotyped data

361

disequilibrium present in most breeding programs. From this perspective, it is only

363

present in both individuals. With high marker density, there will still be tens of thousands

365

most elite breeding material. Imputation with the simple marker mean can still produce

360

generated by GBS, the marker coverage is sufficient to saturate the genomic linkage

362

necessary to determine a pair-wise identity between individuals for the markers that are

364

of pair-wise comparisons between two individuals, well beyond the saturation point for

366

accurate genomic selection prediction models. From a genomic selection perspective,

367

kinship-based marker imputation can be used to optimize the realized relationship matrix

369

concurrent submission). This approach has been shown to improve the relationship

368

in the presence of a high-level of missing data (Poland et al., The Plant Genome,

370

estimates and give more accurate genomic selection model predictions.

372

Association mapping

374

association mapping (AM). One key to applying GBS for AM mapping is addressing the

371 373

GBS has the potential to be an excellent tool for genotyping of diverse panels for

375

missing data problem. As noted previously, higher coverage sequencing will reduce the

377

AM panel that will be well characterized, extensively phenotyped, and serve as a

379

achieve high coverage is likely worth the investment. This will produce a very well-

381

become a very precise exercise, particularly on populations with extensive linkage

376

amount of missing data at the expense of increased per sample costs. For a high-value

378

community resource population, the additional cost of sequencing several times to

380

characterized genetic population. At a high coverage, imputation of missing data will

The Plant Genome: Published ahead of print 10 Sept. 2012; doi: 10.3835/plantgenome2012.05.0005

382

disequilibrium. Depending on the species under interrogation, the GBS markers will need

384

In such populations, GBS markers also have the advantage of being able to survey

383

to be ordered via a physical reference map or through genetic mapping.

385

multiple haplotypes on a fine scale. When two or more SNPs are within the same tag,

387

uncover these alleles. Array-based methods, particularly those applied to polyploid

389

duplicated sequence will indicate an allele call (for the ancestral allele) even if the target

391

discrimination between duplicated sequences. At higher sequencing coverage of the GBS

393

pool of sequenced tags.

395

Genomic Selection

397

is to create a low-cost genotyping platform capable of generating high-density genotypes.

386

these SNP alleles are both evaluated concurrently. For PAVs, GBS also has the power to

388

species, are limited in the ability to accurately survey PAVs as hybridization to a

390

locus is absent. Due to the context sequence accompanying a SNP, GBS enables

392

library, PAV can then be inferred by the absence of a given tag for a given sample in the

394 396

398

In the field of plant breeding, an important objective in the development of GBS

For genomic selection in crop species, breeders need a fast, inexpensive, flexible method

399

that will enable genotyping of large populations of selection candidates. A majority of the

401

low-cost genotyping. GBS is quickly expanding to fill those requirements.

403

approach to capture the full complement of small effect loci in genomic prediction

405

fitting effects to all markers and avoiding statistical testing. By utilizing these GS models,

400 402

selection candidates are then discarded, creating a situation that is greatly benefited from

Genomic selection (GS) was proposed in 2001 by Meuwissen, et al. as an

404

models. GS takes advantage of dense genome-wide molecular markers by simultaneously

406

breeders are able to predict the performance of new experimental lines at early

408

(Jannink et al., 2010). Combined with a fast turn-around on generations, selection based

410

increase gains in plant breeding programs (Meuwissen et al., 2001; Jannink et al., 2010).

412

needed for generating tens to hundreds of thousands of molecular markers. Poland et al,

407

generations and generate suggested crosses and selections based on the model predictions

409

on predicted breeding values determined by marker data provided by GBS could greatly

411

The advantage of GBS for GS in breeding programs is the low per sample cost

The Plant Genome: Published ahead of print 10 Sept. 2012; doi: 10.3835/plantgenome2012.05.0005

413

(2012, The Plant Genome, concurrent submission) have demonstrated the suitability for

415

demonstrate prediction accuracies for yield and other agronomic traits that are high

417

significant improvement in the attained prediction accuracy over a previously used array

419

implications in breeding. The training population was genotyped without a priori

414

GBS markers in developing GS models in the complex wheat genome. They were able to

416

enough to be suitable for breeding applications. The GBS markers also showed a

418

of hybridization-based markers. The important finding of this work is the practical

420

knowledge of the population or SNPs and per sample cost was below $20.

422

Putting GBS to work

424

every genomic problem. These marker datasets are low-cost and dynamic, with data and

421

423

Looking forward, high-density markers from NGS will soon be applied to almost

425

genotyping results getting more robust and economical each year. GBS has been shown

427

al., 2012), breeding applications (Poland et al., concurrent submission), and diversity

426

to be a valid tool for genetic mapping (Baird et al., 2008; Elshire et al., 2011; Poland et

428

studies (Fu, 2012; Lu et al., 2012). The ability to quickly generate robust datasets without

430

plagued researchers working with obscure or foreign species: a lack of defined and

432

platform for studies ranging from quickly identifying single gene markers to whole

429

considerable prior effort for marker discovery is quickly dispelling issues that have

431

specific genetic tools for genome analysis (Allendorf et al., 2010). GBS is an ideal

433

genome profiling of association panels.

435

breeding. Theoretical and preliminary studies on genomic selection show great promise

437

and low-cost tool for genotyping these populations, allowing breeders to implement GS

439

will drive per sample cost below $10. Further, there is no requirement for a priori

441

range of species and SNP discovery and genotyping are completed together. This is a

443

understudied genomes and commercial crops with large and complex genomes.

434

Perhaps one of the most exciting applications of GBS will be in the field of plant

436

for accelerating the rate of developing new improved varieties. GBS is providing a rapid

438

on a large scale in their breeding programs. Current developments in sequencing output

440

knowledge of the species as the GBS methods have been shown to be robust across a

442

very important feature for moving genomics-assisted breeding into orphan crops with

The Plant Genome: Published ahead of print 10 Sept. 2012; doi: 10.3835/plantgenome2012.05.0005

444

Challenges remaining include data management as well as modeling genotype-by-

446

stands to be a major supplement to traditional crop development. The potential for GBS

445

environment interactions, though the future looks promising. Genomic selection via GBS

447

data to improve breeding systems through GS is enormous.

449

genomic studies will have an important place well into the future. Driven by applications

451

developments in next-generation sequencing and genomics platforms must be put to use

448

Application of sequence-based genotyping for a whole range of diversity and

450

across the whole spectrum of human, microbial, plant, and animal genomics,

452

for plant breeding and genetics studies.

453 454

455

ACKNOWLEDGMENTS

457

Project (T-CAP) (2011-68002-30029) provided support for T. Rife. This manuscript was

459

trade names or commercial products in this publication is solely for the purpose of

461

the U.S. Department of Agriculture. USDA is an equal opportunity provider and

456

USDA-ARS and the USDA-NIFA funded Triticeae Coordinated Agriculture

458

greatly improved by the helpful comments of two anonymous reviewers. Mention of

460

providing specific information and does not imply recommendation or endorsement by

462

employer.

463 464

The Plant Genome: Published ahead of print 10 Sept. 2012; doi: 10.3835/plantgenome2012.05.0005

465 466

Allendorf, F.W., P.A. Hohenlohe, and G. Luikart. 2010. Genomics and the future of conservation genetics. Nat. Rev. Genet. 11:697–709.

470 471 472

Andolfatto, P., D. Davison, D. Erezyilmaz, T.T. Hu, J. Mast, T. Sunayama-Morita, and D.L. Stern. 2011. Multiplexed shotgun genotyping for rapid and efficient genetic mapping. Genome Res. 21:610–617.

475 476 477

Ashelford, K., M.E. Eriksson, C.M. Allen, R. D’Amore, M. Johansson, P. Gould, S. Kay, A.J. Millar, N. Hall, and A. Hall. 2011. Full genome re-sequencing reveals a novel circadian clock mutation in Arabidopsis. Genome Biol. 12:R28.

467 468 469

Altshuler, D., V.J. Pollara, C.R. Cowles, W.J. Van Etten, J. Baldwin, L. Linton, and E.S. Lander. 2000. An SNP map of the human genome generated by reduced representation shotgun sequencing. Nature 407:513–516.

473 474

Arumuganathan, K., and E.D. Earle. 1991. Nuclear DNA content of some important plant species. Plant Mol. Biol. Rep. 9:415–415.

478 479 480

Baird, N.A., P.D. Etter, T.S. Atwood, M.C. Currey, A.L. Shiver, Z.A. Lewis, E.U. Selker, W.A. Cresko, and E.A. Johnson. 2008. Rapid SNP discovery and genetic mapping using sequenced RAD markers. PLoS One 3:e3376.

482 483 484

Browning, S.R., and B.L. Browning. 2007. Rapid and accurate haplotype phasing and missing-data inference for whole-genome association studies by use of localized haplotype clustering. Am. J. Hum. Genet. 81:1084–1097.

488 489 490 491 492 493 494

Chia, J.-M., C. Song, P.J. Bradbury, D. Costich, N. de Leon, J. Doebley, R.J. Elshire, B. Gaut, L. Geller, J.C. Glaubitz, M. Gore, K.E. Guill, J. Holland, M.B. Hufford, J. Lai, M. Li, X. Liu, Y. Lu, R. McCombie, R. Nelson, J. Poland, B.M. Prasanna, T. Pyhäjärvi, T. Rong, R.S. Sekhon, Q. Sun, M.I. Tenaillon, F. Tian, J. Wang, X. Xu, Z. Zhang, S.M. Kaeppler, J. Ross-Ibarra, M.D. McMullen, E.S. Buckler, G. Zhang, Y. Xu, and D. Ware. 2012. Maize HapMap2 identifies extant variation from a genome in flux. Nat. Genet. 44:803–807.

481

Breiman, L. 2001. Random forests. Machine Learning 45:5–32.

485 486 487

Byers, R.L., D.B. Harker, S.M. Yourstone, P.J. Maughan, and J.A. Udall. 2012. Development and mapping of SNP assays in allotetraploid cotton. Theor. Appl. Genet. 124:1201–1214.

495 496 497 498

Cockram, J., J. White, D.L. Zuluaga, D. Smith, J. Comadran, M. Macaulay, Z. Luo, M.J. Kearsey, P. Werner, D. Harrap, C. Tapsell, H. Liu, P.E. Hedley, N. Stein, D. Schulte, B. Steuernagel, D.F. Marshall, W.T.B. Thomas, L. Ramsay, I. Mackay, D.J. Balding, R. Waugh, and D.M. O’Sullivan. 2010. Genome-wide association mapping

The Plant Genome: Published ahead of print 10 Sept. 2012; doi: 10.3835/plantgenome2012.05.0005

499 500

to candidate polymorphism resolution in the unsequenced barley genome. Proc. Natl. Acad. Sci. U. S. A. 107:21611–21616.

501 502 503 504

Craig, D.W., J.V. Pearson, S. Szelinger, A. Sekar, M. Redman, J.J. Corneveaux, T.L. Pawlowski, T. Laub, G. Nunn, D.A. Stephan, N. Homer, and M.J. Huentelman. 2008. Identification of genetic variants using bar-coded multiplexed sequencing. Nat. Methods 5:887–893.

508 509 510

Davey, J.W., P.A. Hohenlohe, P.D. Etter, J.Q. Boone, J.M. Catchen, and M.L. Blaxter. 2011. Genome-wide genetic marker discovery and genotyping using next-generation sequencing. Nat. Rev. Genet. 12:499–510.

516 517 518

Elshire, R.J., J.C. Glaubitz, Q. Sun, J.A. Poland, K. Kawamoto, E.S. Buckler, and S.E. Mitchell. 2011. A robust, simple genotyping-by-sequencing (GBS) approach for high diversity species. PLoS One 6:e19379.

521 522

Futschik, A., and C. Schlötterer. 2010. The next generation of molecular markers from massively parallel sequencing of pooled DNA samples. Genetics 186:207–218.

528 529 530 531

Geraldes, A., J. Pang, N. Thiessen, T. Cezard, R. Moore, Y. Zhao, A. Tam, S. Wang, M. Friedmann, I. Birol, S.J.M. Jones, Q.C.B. Cronk, and C.J. Douglas. 2011. SNP discovery in black cottonwood (Populus trichocarpa) by population transcriptome resequencing. Mol. Ecol. Resour. 11:81–92.

505 506 507

Cronn, R., A. Liston, M. Parks, D.S. Gernandt, R. Shen, and T. Mockler. 2008. Multiplex sequencing of plant chloroplast genomes using Solexa sequencing-by-synthesis technology. Nucleic Acids Res. 36:e122.

511 512 513 514 515

Deschamps, S., M. la Rota, J.P. Ratashak, P. Biddle, D. Thureen, A. Farmer, S. Luck, M. Beatty, N. Nagasawa, L. Michael, V. Llaca, H. Sakai, G. May, J. Lightner, and M.A. Campbell. 2010. Rapid Genome-wide Single Nucleotide Polymorphism Discovery in Soybean and Rice via Deep Resequencing of Reduced Representation Libraries with the Illumina Genome Analyzer. The Plant Genome 3:53–68.

519 520

Fu, Y.-B. 2012. Genotyping-by-sequencing: a Case Study in Barley. In Plant and Animal Genome XX.

523 524 525 526 527

532 533 534

Gan, X., O. Stegle, J. Behr, J.G. Steffen, P. Drewe, K.L. Hildebrand, R. Lyngsoe, S.J. Schultheiss, E.J. Osborne, V.T. Sreedharan, A. Kahles, R. Bohnert, G. Jean, P. Derwent, P. Kersey, E.J. Belfield, N.P. Harberd, E. Kemen, C. Toomajian, P.X. Kover, R.M. Clark, G. Rätsch, and R. Mott. 2011. Multiple reference genomes and transcriptomes for Arabidopsis thaliana. Nature 477:419–423.

Gore, M.A., J.M. Chia, R.J. Elshire, Q. Sun, E.S. Ersoz, B.L. Hurwitz, J.A. Peiffer, M.D. McMullen, G.S. Grills, and J. Ross-Ibarra. 2009a. A first-generation haplotype map of maize. Science 326:1115–1117.

The Plant Genome: Published ahead of print 10 Sept. 2012; doi: 10.3835/plantgenome2012.05.0005

535 536 537 538

Gore, M.A., M.H. Wright, E.S. Ersoz, P. Bouffard, E.S. Szekeres, T.P. Jarvie, B.L. Hurwitz, A. Narechania, T.T. Harkins, G.S. Grills, D.H. Ware, and E.S. Buckler. 2009b. Large-Scale Discovery of Gene-Enriched SNPs. The Plant Genome 2:121– 133.

539 540 541 542 543

Hamblin, M.T., T.J. Close, P.R. Bhat, S. Chao, J.G. Kling, K.J. Abraham, T. Blake, W.S. Brooks, B. Cooper, C. a. Griffey, P.M. Hayes, D.J. Hole, R.D. Horsley, D.E. Obert, K.P. Smith, S.E. Ullrich, G.J. Muehlbauer, and J.-L. Jannink. 2010. Population Structure and Linkage Disequilibrium in U.S. Barley Germplasm: Implications for Association Mapping. Crop Sci. 50:556–566.

547 548 549

Heffner, E.L., J.-L. Jannink, and M.E. Sorrells. 2011. Genomic Selection Accuracy using Multifamily Prediction Models in a Wheat Breeding Program. The Plant Genome 4:65–75.

554 555 556

Huang, X., Q. Feng, Q. Qian, Q. Zhao, L. Wang, A. Wang, J. Guan, D. Fan, Q. Weng, T. Huang, G. Dong, T. Sang, and B. Han. 2009. High-throughput genotyping by wholegenome resequencing. Genome Res. 19:1068–1076.

562 563 564

Hyten, D.L., Q. Song, E.W. Fickus, C.V. Quigley, J.-S. Lim, I.-Y. Choi, E.-Y. Hwang, M. Pastor-Corrales, and P.B. Cregan. 2010. High-throughput SNP discovery and assay development in common bean. BMC Genomics 11:475.

567 568 569

Jiao, Y., H. Zhao, L. Ren, W. Song, B. Zeng, J. Guo, B. Wang, Z. Liu, J. Chen, W. Li, M. Zhang, S. Xie, and J. Lai. 2012. Genome-wide genetic changes during modern breeding of maize. Nat. Genet. 44:812–815.

544 545 546

Harper, A.L., M. Trick, J. Higgins, F. Fraser, L. Clissold, R. Wells, C. Hattori, P. Werner, and I. Bancroft. 2012. Associative transcriptomics of traits in the polyploid crop species Brassica napus. Nat. Biotechnol. 30:798–802.

550 551 552 553

Hohenlohe, P.A., S.J. Amish, J.M. Catchen, F.W. Allendorf, and G. Luikart. 2011. Nextgeneration RAD sequencing identifies thousands of SNPs for assessing hybridization between rainbow and westslope cutthroat trout. Mol. Ecol. Resour. 11:117–122.

557 558 559 560 561

Huang, X., X. Wei, T. Sang, Q. Zhao, Q. Feng, Y. Zhao, C. Li, C. Zhu, T. Lu, Z. Zhang, M. Li, D. Fan, Y. Guo, A. Wang, L. Wang, L. Deng, W. Li, Y. Lu, Q. Weng, K. Liu, T. Huang, T. Zhou, Y. Jing, W. Li, Z. Lin, E.S. Buckler, Q. Qian, Q.-F. Zhang, J. Li, and B. Han. 2010. Genome-wide association studies of 14 agronomic traits in rice landraces. Nat. Genet. 42:961–967.

565 566

Jannink, J.-L., A.J. Lorenz, and H. Iwata. 2010. Genomic selection in plant breeding: from theory to practice. Briefings Funct. Genomics 9:166–177.

The Plant Genome: Published ahead of print 10 Sept. 2012; doi: 10.3835/plantgenome2012.05.0005

570 571 572

Lu, F., A.E. Lipka, R.J. Elshire, J. Glaubitz, J. Cherney, M. Casler, E.S. Buckler, and D. Costich. 2012. Characterization of the Genetic Diversity of Switchgrass Using Genotyping by Sequencing. In Plant and Animal Genome XX.

576 577

Mardis, E.R. 2008. The impact of next-generation sequencing technology on genetics. Trends Genet. 24:133–141.

580 581

Meuwissen, T.H.E., B.J. Hayes, and M.E. Goddard. 2001. Prediction of total genetic value using genome-wide dense marker maps. Genetics 157:1819–1829.

585 586

Morrell, P.L., E.S. Buckler, and J. Ross-Ibarra. 2011. Crop genomics: advances and applications. Nat. Rev. Genet. 13:85–96.

590 591 592

Nelson, J.C., S. Wang, Y. Wu, X. Li, G. Antony, F.F. White, and J. Yu. 2011. Singlenucleotide polymorphism discovery by high-throughput sequencing in sorghum. BMC Genomics 12:352.

595 596 597 598

van Orsouw, N.J., R.C.J. Hogers, A. Janssen, F. Yalcin, S. Snoeijers, E. Verstege, H. Schneiders, H. van der Poel, J. van Oeveren, H. Verstegen, and M.J.T. van Eijk. 2007. Complexity reduction of polymorphic sequences (CRoPS): a novel approach for large-scale polymorphism discovery in complex genomes. PLoS One 2:e1172.

573 574 575

Marchini, J., B. Howie, S. Myers, G. McVean, and P. Donnelly. 2007. A new multipoint method for genome-wide association studies by imputation of genotypes. Nat. Genet. 39:906–913.

578 579

Metzker, M. 2009. Sequencing technologies - the next generation. Nat. Rev. Genet. 11:31–46.

582 583 584

Miller, M.R., J.P. Dunham, A. Amores, W.A. Cresko, and E.A. Johnson. 2007. Rapid and cost-effective polymorphism identification and genotyping using restriction site associated DNA (RAD) markers. Genome Res. 17:240–248.

587 588 589

Myles, S., J. Peiffer, P.J. Brown, E.S. Ersoz, Z. Zhang, D.E. Costich, and E.S. Buckler. 2009. Association mapping: critical considerations shift from genotyping to experimental design. Plant Cell 21:2194–2202.

593 594

Nielsen, R., J.S. Paul, A. Albrechtsen, and Y.S. Song. 2011. Genotype and SNP calling from next-generation sequencing data. Nat. Rev. Genet. 12:443–451.

599 600 601 602 603

Poland, J.A., P.J. Brown, M.E. Sorrells, and J.-L. Jannink. 2012. Development of HighDensity Genetic Maps for Barley and Wheat Using a Novel Two-Enzyme Genotyping-by-Sequencing Approach. PLoS One 7:e32253. Purcell, S., B. Neale, K. Todd-Brown, L. Thomas, M. a R. Ferreira, D. Bender, J. Maller, P. Sklar, P.I.W. de Bakker, M.J. Daly, and P.C. Sham. 2007. PLINK: a tool set for

The Plant Genome: Published ahead of print 10 Sept. 2012; doi: 10.3835/plantgenome2012.05.0005

604 605

whole-genome association and population-based linkage analyses. Am. J. Hum. Genet. 81:559–575.

606 607

Stanzione, D. 2011. The iPlant Collaborative: Cyberinfrastructure to Feed the World. Computer 44:44–52.

612 613 614 615

Truong, H.T., A.M. Ramos, F. Yalcin, M. de Ruiter, H.J.A. van der Poel, K.H.J. Huvenaars, R.C.J. Hogers, L.J.G. van Enckevort, A. Janssen, N.J. van Orsouw, and M.J.T. van Eijk. 2012. Sequence-based genotyping for marker discovery and codominant scoring in germplasm and populations. PLoS One 7:e37565.

618 619

Wang, X., H. Wang, J. Wang, R. Sun, J. Wu, et al. 2011. The genome of the mesopolyploid crop species Brassica rapa. Nat. Genet. 43:1035–1039.

622 623 624 625

Xie, W., Q. Feng, H. Yu, X. Huang, Q. Zhao, Y. Xing, S. Yu, B. Han, and Q. Zhang. 2010. Parent-independent genotyping for constructing an ultrahigh-density linkage map based on population sequencing. Proc. Natl. Acad. Sci. U. S. A. 107:10578– 10583.

608 609 610 611

van Tassell, C.P., T.P.L. Smith, L.K. Matukumalli, J.F. Taylor, R.D. Schnabel, C.T. Lawley, C.D. Haudenschild, S.S. Moore, W.C. Warren, and T.S. Sonstegard. 2008. SNP discovery and allele frequency estimation by deep sequencing of reduced representation libraries. Nat. Methods 5:247–252.

616 617

Wang, S., E. Meyer, J.K. McKay, and M.V. Matz. 2012. 2b-RAD: a simple and flexible method for genome-wide genotyping. Nat. Methods 9:808–810.

620 621

Wiedmann, R.T., T.P.L. Smith, and D.J. Nonneman. 2008. SNP discovery in swine by reduced representation and high throughput pyrosequencing. BMC Genet. 9:81.

626 627 628 629 630

Xu, X., X. Liu, S. Ge, J.D. Jensen, F. Hu, X. Li, Y. Dong, R.N. Gutenkunst, L. Fang, L. Huang, J. Li, W. He, G. Zhang, X. Zheng, F. Zhang, Y. Li, C. Yu, K. Kristiansen, X. Zhang, J. Wang, M. Wright, S. McCouch, R. Nielsen, J. Wang, and W. Wang. 2012. Resequencing 50 accessions of cultivated and wild rice yields markers for identifying agronomically important genes. Nat. Biotechnol. 30:105–111.

633 634 635 636

Yang, H., Y. Tao, Z. Zheng, C. Li, M. Sweetingham, and J. Howieson. 2012. Application of next-generation sequencing for rapid marker development in molecular plant breeding: a case study on anthracnose disease resistance in Lupinus angustifolius L. BMC Genomics 13:318.

631 632

Xu, X., S. Pan, S. Cheng, B. Zhang, D. Mu, et al. 2011. Genome sequence and analysis of the tuber crop potato. Nature 475:189–195.

637 638

You, F.M., N. Huo, K.R. Deal, Y.Q. Gu, M.-C. Luo, P.E. McGuire, J. Dvorak, and O.D. Anderson. 2011. Annotation-based genome-wide SNP discovery in the large and

The Plant Genome: Published ahead of print 10 Sept. 2012; doi: 10.3835/plantgenome2012.05.0005

639 640 641 642

complex Aegilops tauschii genome using next-generation sequencing without a reference genome sequence. BMC Genomics 12:59.

The Plant Genome: Published ahead of print 10 Sept. 2012; doi: 10.3835/plantgenome2012.05.0005

643

644

FIGURES Figure 1 – A comparison of actual sequencing capacity (orange) to what would be

645

expected if sequencing technology was following Moore’s Law (blue). The

647

generation sequencing technology. Data from National Human Genome Research

646

significant decrease in 2007 coincides roughly with the introduction of next-

648

Institute (http://www.genome.gov/sequencingcosts/).

649

650

Figure 2 – Schematic overview of steps in GBS library construction, sequencing and

651

analysis. 1) Genomic DNA is quantified using florescence-based method. 2) gDNA

653

of all samples and equal molarity of gDNA and adapters. 3) A master mix with

655

barcoded adapters are added along with ligase and ligation bufferes. 5) Samples are

657

is cleaned and evaluated on a capillary sizing system. 8) Libraries are ready to

659

Data analysis: Following a sequencing run, Fastq files containing raw data from the

661

Once assigned to individual samples, the reads are aligned to a reference genome.

663

internally aligned (alignment of all sequence reads will all other reads from that

665

algorithms can then be used to distinguish true bi-allelic SNPs from sequencing

652

is normalized in a new plate. Normalization is needed to ensure equal representation

654

restriction enzyme(s) and buffer is added to the plate and incubated. 4) DNA

656

pooled and cleaned. 6) The GBS library is PCR amplified. 7) The amplified library

658

sequence.

660

run are used to parse sequencing reads to samples using the DNA barcode sequence.

662

In the case of species without a complete reference genomic sequence, reads are

664

library) and SNPs identified from 1 or 2 bp sequence miss-match. Various filtering

666 667 668 669 670 671

672

673

errors.

Figure 3 – Integration of genotyping-by-sequencing in the context of plant breeding and genomics for a species without a completed reference genome.

Figure 4 – Removal of missing data in genotyping-by-sequencing by increasing coverage of the library via re-sequencing. In a set of international wheat breeding germplasm, several lines (samples) were replicated across two or more libraries. Replicating a

The Plant Genome: Published ahead of print 10 Sept. 2012; doi: 10.3835/plantgenome2012.05.0005

674

sample two times increased the coverage of SNPs to 60%, while five replications

676

missing data replicated sequencing increases the per sample cost. The average per

678

equivalent to the sequencing coverage of the library (i.e. 5 replications ~ 5X

675

increase the coverage to over 90%. While very effective as a means to remove

677

sample cost is $15. In this situation for wheat, the number of replications is roughly

679

coverage). Data from Poland, et al. (unpublished).

680 681 682 683 684 685 686 687

688 689 690

TABLES Table 1 – A technical comparison of current genotyping methods utilizing nextgeneration sequencing of multiplex barcoded libraries. Adapted from Wang et al. (2012). Flavors of genotyping using next generation sequencing of multiplex DNA-barcoded reduced-representation libraries Method Multiplex Shotgun Genotyping (MSG) Restriction Association DNA (RAD-seq) Double Digest RADseq (ddRADseq) 2b-RAD Genotyping-by-Sequencing (GBS) Genotyping-by-Sequencing (GBS) – two enzyme Sequence-Based Genotyping (SBG)

Random shearing No

Size selection Yes

Fragment size size selected

Enzymes 1

Yes

Yes

size selected

No

Yes

No No

Analysis tool(s)

SbfI, EcoRI

Multiplexing level 2 96 (up to 384) 96

size selected

EcoRI-MspI

48 3

MUSCLE

No No

33-36 bp < 350 bp

BsaXI 4 ApeKI 5

Custom Perl scripts TASSEL

No

No

< 350 bp

PstI-MspI

No

Yes

size selected

EcoRI-MseI PstI-TaqI

NA 48 (up to 384) 48 (up to 384) 32

MseI

1 All of these approaches can utilize different enzymes. Shown are the enzyme(s) used in the initial study. 2 All of these methods have the possibility to increase the number of multiplexed samples using more unique

barcodes. The multiplex levels was the number of samples reported in the first paper. Given in parenthesis are subsequent increases. 3 Combinatorial barcoding is possible, placing a barcode on each end of the DNA fragment. Using a set of 48 adapter P1 barcodes and x 12 PCR2 indices it is possible to uniquely label 576 individuals [48 (adapter P1 barcodes) x 12 (PCR2 indices)]. This method would require paired-end sequencing. 4 Uses type II restriction endonucleases

5 Has been successfully applied to using PstI and HindIII (Buckler et al, personal communication)

Burrows-Wheeler Alignment tool Custom Perl scripts

TASSEL Burrows-Wheeler Alignment tool; Unif Genotyper

The Plant Genome: Published ahead of print 10 Sept. 2012; doi: 10.3835/plantgenome2012.05.0005

691 692

Restriction Enzyme Sequence Comparative Analysis (RESCAN)

No

6 96-plexing reported but unpublished

Yes

size selected

MseI, NlaIII

NA 6

Burrows-Wheeler Alignment tool; Samtools

The Plant Genome: Published ahead of print 10 Sept. 2012; doi: 10.3835/plantgenome2012.05.0005

The Plant Genome: Published ahead of print 10 Sept. 2012; doi: 10.3835/plantgenome2012.05.0005

The Plant Genome: Published ahead of print 10 Sept. 2012; doi: 10.3835/plantgenome2012.05.0005

The Plant Genome: Published ahead of print 10 Sept. 2012; doi: 10.3835/plantgenome2012.05.0005