Identification and Prevention of Genotyping Errors ... - Clinical Chemistry

2 downloads 0 Views 1MB Size Report
Ju¨ rgen J. Wenzel,1† Heidi Rossmann,1*† Christian Fottner,2 Stefan Neuwirth,3 Carolin Neukirch,1 ...... Burge S, Parkinson GN, Hazel P, Todd AK, Neidle S.
Clinical Chemistry 55:7 1361–1371 (2009)

Molecular Diagnostics and Genetics

Identification and Prevention of Genotyping Errors Caused by G-Quadruplex– and i-Motif–Like Sequences Ju¨rgen J. Wenzel,1† Heidi Rossmann,1*† Christian Fottner,2 Stefan Neuwirth,3 Carolin Neukirch,1 Peter Lohse,4 Julia K. Bickmann,1 Timo Minnemann,2 Thomas J. Musholt,5 Brigitte Schneider-Ra¨tzke,6 Matthias M. Weber,2 and Karl J. Lackner1

BACKGROUND: Reliable PCR amplification of DNA fragments is the prerequisite for most genetic assays. We investigated the impact of G-quadruplex– or i-motif– like sequences on the reliability of PCR-based genetic analyses.

We found the sequence context of a common intronic polymorphism in the MEN1 gene (multiple endocrine neoplasia I) to be the cause of systematic genotyping errors by inducing preferential amplification of one allelic variant [allele dropout (ADO)]. Bioinformatic analyses and pyrosequencingbased allele quantification enabled the identification of the underlying DNA structures.

avoid the formation of ADO-causing secondary structures. Truly affected assays can then be identified by a simple experimental procedure, which simultaneously provides the solution to the problem. © 2009 American Association for Clinical Chemistry

METHODS:

RESULTS:

We showed that G-quadruplex– or i-motif– like sequences can reproducibly cause ADO. In these cases, amplification efficiency strongly depends on the PCR enzyme and buffer conditions, the magnesium concentration in particular. In a randomly chosen subset of candidate single-nucleotide polymorphisms (SNPs) defined by properties deduced from 2 originally identified ADO cases, we confirmed preferential PCR amplification in up to 50% of the SNPs. We subsequently identified G-quadruplex and i-motifs harboring a SNP that alters the typical motif as the cause of this phenomenon, and a genomewide search based on the respective motifs predicted 0.5% of all SNPs listed by dbSNP and Online Mendelian Inheritance in Man to be potentially affected.

CONCLUSIONS: Undetected, the described phenomenon produces systematic errors in genetic analyses that may lead to misdiagnoses in clinical settings. PCR products should be checked for G-quadruplex and i-motifs to

1

Departments of Clinical Chemistry and Laboratory Medicine; and 2 Medicine I, Johannes Gutenberg University Mainz, Mainz, Germany; 3 Bundeskriminalamt, Wiesbaden, Germany; 4 Institute of Clinical Chemistry – Großhadern, LudwigMaximilians University Munich, Munich, Germany; 5 Endocrine Surgery; and 6 Institute for Human Genetics, Johannes Gutenberg University Mainz, Mainz, Germany. * Address correspondence to this author at: Johannes Gutenberg University Mainz, Department of Clinical Chemistry and Laboratory Medicine, Langenbeckstr. 1, 55131 Mainz, Germany. Fax ⫹49-6131-17-6627; e-mail [email protected]. † J.J. Wenzel and H. Rossmann contributed equally to this work.

Amplification by the PCR is the basis for the majority of DNA sequence analyses, including analyses of single-nucleotide polymorphisms (SNPs),7 that are used for research or diagnostic purposes. Although that sequence motifs may affect PCR efficiency and may therefore cause preferential amplification of specific alleles [allele dropout (ADO)] is well accepted, the data in the literature documenting such phenomena are scarce. In most cases, amplification problems during the PCR have been attributed to apparently obvious causes, e.g., the repeat structure of microsatellites, SNPs within the binding sites of PCR primers, a low DNA concentration, or poor DNA quality (1– 4 ). We studied systematic genotyping errors caused by reproducible preferential allelic amplification due to DNA motifs favoring G-quadruplex formation. G-quadruplexes and i-motifs are known to be highly prevalent in eukaryotic genomes (5 ), particularly so in biologically (6, 7 ) and diagnostically relevant regions such as promoter regions of proto-oncogenes [HIF1A,8 hypoxia inducible factor 1, alpha subunit (basic helixloop-helix transcription factor); KRAS, v-Ki-ras2 Kirsten rat sarcoma viral oncogene homolog; RB1, retinoblastoma 1; and RET, ret proto-oncogene] (8, 9 ). Accordingly, ADO during the PCR is a common phe-

Received October 1, 2008; accepted March 24, 2009. Previously published online at DOI: 10.1373/clinchem.2008.118661 7 Nonstandard abbreviations: SNP, single-nucleotide polymorphism; ADO, allele dropout; MEN1, multiple endocrine neoplasia type 1; NCBI, National Center for Biotechnology Information; RefSeq, NCBI Reference Sequence; OMIM, Online Mendelian Inheritance in Man. 8 Human genes: HIF1A, hypoxia inducible factor 1, alpha subunit (basic helixloop-helix transcription factor); KRAS, v-Ki-ras2 Kirsten rat sarcoma viral oncogene homolog; RB1, retinoblastoma 1; and RET, ret proto-oncogene; MEN1, multiple endocrine neoplasia I; SLC6A4, solute carrier family 6 (neurotransmitter transporter, serotonin), member 4.

1361

nomenon in sequences with G-quadruplex– and i-motif–like characteristics. Materials and Methods DNA AND RNA EXTRACTION AND cDNA SYNTHESIS

We used the QIAamp DNA Mini Kit (Qiagen) to isolate genomic DNA from 96 surplus, anonymized whole-blood samples and from 24 whole-blood samples from patients with symptoms suspicious of multiple endocrine neoplasia type 1 (MEN1) and family members of the MEN1 index patients. Total RNA was extracted from whole blood of patient 6 with the RNeasy Midi Kit (Qiagen). cDNA was synthesized with SuperScript III reverse transcriptase (Invitrogen). The detected MEN1 mutation, clinical information, and genotype result for the rs509606 SNP of each patient are provided in Table 1 in the Data Supplement that accompanies the online version of this article at http://www.clinchem.org/content/vol55/issue7. PCR

Gene-specific primers for sequencing and pyrosequencing assays were chosen from the published sequence information [GenBank and dbSNP, build 127, National Center for Biotechnology Information (NCBI)] by either the Pyrosequencing Assay Design Software 1.0 (Biotage/Qiagen) or by the primer-design programs of HUSAR 5.0 (DKFZ). For pyrosequencing assays, one PCR primer was 5⬘-biotinylated (Invitrogen) to facilitate preparation of single-stranded DNA. Table 2 in the online Data Supplement summarizes the oligonucleotide primers and amplification conditions, and Table 3 in the online Data Supplement lists the compositions of PCR buffers B1–B5, which were prepared in house. The sequences of exons 2–10 and the respective flanking intronic regions of the MEN1 gene (multiple endocrine neoplasia I) were analyzed diagnostically with a previously published set of PCR primers (10, 11 ). SEQUENCING

Amplification products were column-purified (CentriSep Spin Columns, Princeton Separations) and sequenced bidirectionally (Dye Terminator Cycle Sequencing Quick Start Kit, CEQ 8000 Genetic Analysis System; Beckman Coulter). Control sequencing of one PCR product was performed by a local sequencing-service provider with a model 3730 DNA Analyzer (Applied Biosystems). Preparation of single-stranded DNA, annealing of the sequencing primer, and pyrosequencing were carried out according to the instrument manufacturer’s instructions (PSQMA; Biotage/Qiagen) with Pyro Gold Reagents (Biotage/Qiagen). Allele quantification 1362 Clinical Chemistry 55:7 (2009)

was carried out as described previously (12 ). In brief, the area under the curve of a pyrogram peak, which represents the integration of one nucleotide into the nascent DNA strand, is proportional to the amount of incorporated nucleotides. Therefore, the peak ratio of such peaks, which are critical for genotype calling, quantitatively represents the allele frequency in a DNA sample (Figs. 1 and 2). BIOINFORMATICS

Databases. The HUMAN_ENSEMBL database was searched with the commands blastn, fasta, and fuzznuc of the HUSAR sequence analysis program package (HUSAR version 5.0; Biocomputing Service at DKFZ). Databases dbSNP and RefSeq were downloaded as compressed text files from the NCBI FTP server: NCBI Reference Sequence (RefSeq) Release 26 (ftp://ftp. ncbi.nih.gov/refseq/H_sapiens/) and dbSNP Build 127 (ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606/). Furthermore, refGene.txt was obtained from the UCSC Genome Bioinformatics Site (http://genome.ucsc.edu/). Diagnostically relevant SNPs were extracted from Online Mendelian Inheritance in Man (OMIM) (2007, http://www.ncbi.nlm.nih.gov/omim/) and downloaded as a text file in FASTA format. Text files were analyzed by custom Perl (version 5.8.8) scripts on a local server (Gentoo Linux, Kernel 2.6.23). G-quadruplex– and i-motif–forming sequences were identified with QGRS Mapper (http:// bioinformatics.ramapo.edu/QGRS/index.php) (13 ), Quadfinder (http://miracle.igib.res.in/quadfinder/) (14 ), and custom Perl scripts. Sequence-homology search. In a preliminary approach to elucidate the ADO observed for rs509606, we used the FASTA34 software to screen the human genome (build 36.1, 2006, hg18) for sequences similar to the twenty 5⬘- and 3⬘-flanking bases of rs509606. We focused on hits near annotated coding regions that also contained at least one G/C SNP. The highest-scoring hits were manually sorted for the best match to the sequence composition around rs509606. rs11219825, rs2529438, rs3805938, and rs34668715 were chosen for the design of sequencing assays. Sequence-pattern search. In a second approach, we searched the human genome for nucleic acid patterns generated on the basis of sequence characteristics around rs509606. In this search, we used the program fuzznuc (EMBOSS package, http://emboss.sourceforge.net/) with the patterns GC3GAC3TCSCTC2YSC and AC3TCSCTC2YSC (where S stands for G or C and Y stands for C or T). As described above, we focused on hits near annotated coding regions that also contained at least one G/C SNP. The highest-scoring results were sorted

Genotyping Errors in PCR-Based Assays

Fig. 1. Example of a complete genotype change of a sample heterozygous for SNP rs2529438 caused by varying MgCl2 concentrations, as detected by pyrosequencing. The pyrosequencing data peaks for the polymorphic nucleotides (G/C) are indicated by arrows, and the peak ratios (G to C) represent the relative allelic ratios in the actually heterozygous samples. PCR was carried out in the presence of Pfu polymerase and buffer B3 containing the indicated MgCl2 concentrations. The genotype results reported by the pyrosequencing software are positioned above the respective pyrograms. Pyrogram peaks influenced by the G/C polymorphism are shown with a gray background.

Clinical Chemistry 55:7 (2009) 1363

Fig. 2. Quantitative genotyping data obtained under different conditions. Quantitative genotyping data for 2 heterozygous samples for rs2529438 in the presence of different Mg2⫹ concentrations, different enzymes (Taq, Pfu) (A), and different monovalent cations (NH4⫹, buffer B1; K⫹, buffer B3) (B). (C), Quantitative genotyping results obtained with different buffer systems and Mg2⫹ concentrations. Turquoise rectangles/yellow circles, buffer B3; red crosses/green crosses, buffer B1; blue crosses/pink rectangles, buffer B2; gray triangles/orange inverted triangles, buffer B5; black circles/orange triangles, buffer B4. (D), Comparison of calculated allele frequencies for rs509606 (MEN1, exon 2) in 2 heterozygous (G/C) samples, one a wild-type sample (blue) and the other a mutant-type sample (red; case 10, c.⫺23⫹2C⬎A). Pfu (⫹ and 䡺); Taq (⫻ and *). Peak ratios (G to C) were calculated from the peaks (e.g., Fig. 1, arrows) and represent relative ratios of alleles in the samples, which are actually heterozygous (A–D). Genotype calling by the pyrosequencing software is based on the peak ratios, and the genotyping results are indicated by different background colors (A, B, and D).

1364 Clinical Chemistry 55:7 (2009)

Genotyping Errors in PCR-Based Assays

Fig. 3. Genomewide search strategies. (A), Genomewide search strategies for candidate ADO SNPs. (B), Graphical representation of the data obtained by the genomewide search strategies. Identification numbers of experimentally confirmed ADO SNPs are positioned inside the circles. Identification numbers for polymorphisms that served as negative controls are indicated. Dark-gray circle indicates results obtained from search strategy 1; light-gray circle represents results from search strategy 2. Val./val. status, NCBI dbSNP validation status.

manually. We carried out sequencing assays for 11 SNPs (rs28563020, rs628979, rs12242873, rs7206090, rs4803243, rs4431038, rs7255993, rs7225964, rs3218266, rs136224, rs12975292), but none of them showed preferential allelic amplification. After further specification of the ADO phenomenon, we concluded this strategy to be unsuitable for its detection, and the SNPs that had been excluded as a possible cause for ADO then served as negative controls to evaluate the following database search strategies. Database research strategies to further define ADO sequences and the causal DNA-sequence motifs. From the common properties of rs509606 (intron 1, MEN1), rs2529438 (identified by one of the preliminary search strategies, as described above), and rs4795541 [a diagnostically relevant ADO case reported by Yonan et al. (15 )], we deduced computational search strategies 1

and 2, which are outlined in Fig. 3A. Whereas strategy 1 was designed to give just a good description of the sequence features of our originally detected ADO sequences, strategy 2 was designed to elucidate the general biochemical background of the phenomenon. Results data were checked with respect to their validation status and their presence in RefSeq and OMIM SNP lists. Results We initially detected the phenomenon described here by conventional sequencing in a patient with MEN1, who was apparently homozygous for a novel single-base deletion in exon 2 (c.400delT, case 6; see Table 1 in the online Data Supplement) of the MEN1 gene (Fig. 4C). This result was confirmed on both DNA strands. Because hoClinical Chemistry 55:7 (2009) 1365

Fig. 4. Schematic representation of the MEN1 gene and diagnostic sequence analysis of PCR products. (A), Schematic representation of the MEN1 gene with positions of the forward and reverse primers (F1–F3 and R1–R2, respectively; arrows), the intronic rs509606 SNP and the c.400delT mutation (open circles). Exons and introns are represented by horizontal bars and lines, respectively (gray, untranslated; blue, translated). (B, C), Diagnostic sequence analysis of PCR products from a healthy control individual (B, wild type) and a MEN1 patient (case 6), who appeared (falsely) to be homozygous (C, mutant) for the intronic SNP rs509606 (C/G) and for the sporadic MEN1 mutation c.400delT (p.Phe134SerfsX51). (D), PCR products amplified by forward primer F3 lacking rs509606 showed the true genotype. (E), When Pfu was used instead of Taq polymerase, the true genotype was apparent. All electropherograms without further specification were recorded by a Beckman Coulter CEQ 8000 instrument. (F), Sequence analysis of Taq-amplified DNA by means of an Applied Biosystems (ABI) chemistry and an ABI 3730 system again demonstrated apparent homozygosity. (G), Representation of the MEN1 mRNA with positions of the forward primer F4, reverse primer R4 and the c.400delT mutation. The electropherogram lane shows the sequence of the mRNA generated with the F4 primer. The location of the deletion is marked by the dotted line. NA, not applicable.

1366 Clinical Chemistry 55:7 (2009)

Genotyping Errors in PCR-Based Assays

Table 1. Comparison of expected and actual heterozygote frequencies for rs509606 in all analyzed MEN1 samples (upper part). Flanking sequences of the SNPs with experimentally confirmed ADO (lower part). rs509606: Allele frequencies and deficiency of heterozygotes (n ⴝ 23) Taq polymerase Qiagen buffer

In-house buffer

True genotype

1.5 mmol/L Mg2ⴙ

1.6 mmol/L Mg2ⴙ

6 mmol/L Mg2ⴙ

Minor-allele (G) frequency

0.222

0.422

0.222

0.444

Heterozygotes: actual/expected, n

10/8

1/11

10/8

0/12 a

Flanking sequences of all SNPs with experimentally confirmed ADO i-motif–like sequences: CCCGGCCGAACCTGCCCGACCCTCCCTCCC G/C CGGCTTGCCTTGCAGGCCGCCGCCCACCGC

rs509606

GCCCCGGCTCTTAGCCCGACCCTCGCTCCT G/C CTCCGCCGGTCCCTCAGCGCGGCCTCCTGC

rs2529438

CCGCCGCCGTCCCGCCTGCCCAACCCCCGC C/T CCTCCCTCCGCTTCTCTGCCTCGGGCCAGG

rs3742558

GGCCCTACCCTCGGCCCCCGACCGCCCACA T/C CCGCCGGTTACCCTCGAGGCTCCCCGGCCG

rs1230263

G-quadruplex–like sequences:

a

CAGGGGGCGGCGCCGGTCGGGTAGGGTCGG G/A CTGGCGGGAGCCCGGGGCGGGGCTTGGGCA

rs4898786

GGGCTTGGGCATGGCTGGCTGCAGGTCCCC A/G GCCCTGCCACACGGGGAGGCGGCTGAGGCC

rs4898787

Putative secondary or tertiary structures causing homopolymers are underlined.

mozygous MEN1 defects have not been described in humans and are lethal in mice (16 ), the patient’s DNA was reamplified with a proofreading polymerase (Pfu, Fig. 4E). These PCR conditions revealed one wild-type and one mutant allele, indicating dropout of the wildtype allele during amplification with Taq polymerase. Initially, the patient also appeared to be homozygous for the G allele of a common polymorphism in the first intron (rs509606) of the MEN1 gene; however, amplification with Pfu revealed the patient to be heterozygous at this position as well. Polymorphisms within the binding sites of the primers were ruled out as possible reasons for ADO, because changing the primers without omitting rs509606 did not solve the problem. Use of different primer pairs for the PCR showed that ADO occurred only when the intronic G allele of rs509606 was part of the amplified fragment (Fig. 4). Haplotype analysis by allele-specific PCR colocalized the G allele of rs509606 and the T deletion on the same strand (see Fig. 1 in the online Data Supplement). As expected, sequence analysis of MEN1 mRNA by reverse-transcription PCR showed no preferential amplification but always produced a heterozygous genotype. This is because the mRNA does not contain the intronic SNP rs509606 (Fig. 4G). Analysis of DNA samples from another 23 potential MEN1 patients revealed a dramatic deviation from Hardy–Weinberg equilibrium (i.e., almost complete heterozygosity deficiency) for rs509606 when the PCR was performed

conventionally (Table 1; see Table 1 in the online Data Supplement). The upper part of Table 1 shows a comparison of expected and actual heterozygote frequencies for rs509606 in all analyzed MEN1 samples. A large deviation from the Hardy–Weinberg equilibrium was produced when Taq polymerase and the original Qiagen PCR buffer were used without additional MgCl2. We then contacted 2 other diagnostic laboratories experienced in MEN1 testing. They also had reported heterozygosity deficiency for rs509606 but had not been able to solve the problem. From these results, we hypothesized that the allelic state of rs509606 and its flanking sequence somehow caused the observed phenomenon. In a first, preliminary approach, we searched the dbSNP (NCBI) database for sequences similar to the twenty 5⬘- and 3⬘-flanking bases of rs509606 and harboring a SNP. One of 4 candidate SNPs, rs2529438, also showed preferential amplification of one allele with Taq polymerase. Surprisingly, use of Pfu for amplification did not solve the problem but rather led to preferential amplification of the other allele. Further analysis by a quantitative pyrosequencing approach demonstrated that the magnesium concentration in the PCR reaction profoundly changed the amplification ratio for the 2 alleles of rs2529438 produced by Taq as well as by Pfu; however, the 2 enzymes differed substantially in their overall effects (Figs. 1 and 2). This result strongly indicated that the formation of Clinical Chemistry 55:7 (2009) 1367

secondary and tertiary structures might be responsible for the preferential amplification. We therefore analyzed the flanking sequences of the now-identified ADO SNPs rs509606 and rs2529438 in more detail. The lower part of Table 1 shows that the sequences flanking both SNPs have an unusually high GC content and represent inverted G-quadruplex motifs, known as i-motifs (C2–7 N1–7 C2–7 N1–7 C2–7 N1–7 C2–7, where N is any base different from the 2 adjacent bases) (8, 9 ), that harbor a SNP with one allelic variant being a C. The putative secondary or tertiary structures causing C homopolymers (2–7 ) are underlined. Accordingly, the complementary strands represented a G-quadruplex motif, facilitating the formation of a G-quadruplex, a DNA structure composed of tetrads of hydrogen-bonded guanine residues. From these observations and the fact that G-quadruplex motifs are highly prevalent in the human genome (5 ), we concluded that numerous SNPs and sequences within the genome might contain sequence features predisposing the respective regions to preferential amplification of one particular allele during the PCR. To identify such sequences, we searched the NCBI database dbSNP by means of 2 strategies. A more stringent strategy (strategy 1) was based on the characteristics of the nucleotide composition of the sequences flanking rs509606, rs2529438, and rs4795541 (15 ), and an extended strategy (strategy 2) was based on potential G-quadruplex and i-motif structures (Fig. 3A). Fig. 3B graphically represents the data obtained by the genomewide search strategies. The data set for strategy 2 contains approximately 30 000 potential Gquadruplex– and i-motif–forming SNP-flanking sequences, which are ADO candidates. This data set represents 0.5% of all known SNPs (the rate may be even higher among unknown SNPs, because detection of these SNPs may often fail because of the described PCR artifacts). To estimate the rate of diagnostically relevant SNPs, we compared the strategy 2 data set with OMIM-listed dbSNP entries and found that 0.5% of the total OMIM-listed SNPs are predicted to be affected. Furthermore, about one third of this data set was located within NCBI reference sequences (RefSeqs), suggesting potential diagnostic relevance. The strategy 1 data set is almost exclusively a subset of data set 2, suggesting that certain DNA secondary structures, such as potential G-quadruplex– and i-motif– forming sequences are in fact one possible cause for ADO. More than 90% of the SNPs identified by strategy 1 apparently contained G-quadruplexes or i-motifs (Table 1, Fig. 3B). We randomly selected a subset of 15 SNPs detected by strategy 1 and designed pyrosequencing assays to quantitatively evaluate allele-specific preferential amplification (12 ) by stepwise addition of MgCl2 to the 1368 Clinical Chemistry 55:7 (2009)

PCR reaction buffer (e.g., for rs2529438 and rs509606 in Fig. 2, A, B, and D). Nine of these assays detected SNPs in different allelic states in our test collection from 96 healthy individuals and performed technically well without any need for extensive optimization. The results are summarized in Table 2. Dropout of one allele was observed in 3 (rs12302637, rs3742558, rs4898786) of the 9 assays. The PCR product of one of these ADO assays contained 2 SNPs in close proximity, and preferential amplification was dependent on the haplotype status (rs4898786 and rs4898787). Two additional SNPs (rs2072579 and rs7008933) displayed preferential amplification of one allele but no complete genotype change. Thus, the overall frequency of preferential amplification of one allele was approximately 50%. Divalent cations such as Mg2⫹ and Ca2⫹ are known to mediate the transition from parallel to antiparallel G-quadruplex structures (17 ). Figs. 1 and 2 show genotyping results for 2 samples from heterozygous rs2529438 individuals and their dependence on the magnesium concentration. Use of Pfu (Fig. 2A) and increasing the MgCl2 concentration allowed the apparent genotype to be “titrated” from GG to CC, indicating clearly that the phenomenon of ADO is by no means restricted to Taq polymerase (Figs. 2A and 4). We also tested a high-fidelity enzyme mixture (Elongase; data not shown), which yielded similar results. We then studied the effect of the monovalent cations K⫹ and NH4⫹, which are known to act as central ions in G tetrads, thus enabling the formation of G-quadruplexes (18 ). The Mg2⫹-dependent genotype change was more evident when K⫹ was added to the buffer system (Fig. 2B); however, the influence of K⫹ or NH4⫹ on the amplification results was much smaller than that of Mg2⫹. To rule out the possibility that the change in buffer osmolality, and not the Mg2⫹ concentration, was responsible for the observed genotype shift, we depicted the quantitative genotyping results obtained with different buffer systems and Mg2⫹ concentrations as a function of the osmolality, as measured by the method of freezing-point depression with an osmometer (Fig. 2C). Compared with the Mg2⫹ concentration, the osmolality had nearly no influence on the determined genotype. An exchange of only a single base, however, within the flanking region of an ADO SNP can cause major changes in the properties of the alleles during amplification. The only patient (case 10, see Table 1 in the online Data Supplement) who was genotyped correctly (G/C) for rs509606 by our original genotyping assay, showed an additional base exchange (c.⫺23⫹2C⬎A, a probable splice site defect) 18 bp 3⬘ of rs509606. This result supports the finding that sequence motifs are the basis of the described phenomenon (Fig. 2D).

Genotyping Errors in PCR-Based Assays

Table 2. Results of the PCR assays designed for a randomly chosen subset of ADO-candidate SNPs detected by search strategy 1. SNP identifier (dbSNP)

rs3748784

GC content, %a

Longest G-quadruplex– or i-motif–like sequence flanking the SNP

78.7

15 C stretches (C ⱖ 2), i-motif

Potential gene function

ESTb

ADO

Allele frequencies (n ⴝ 96)

C ⫽ 0.554

No

T ⫽ 0.446 HZ ⫽ 0.50 rs3903289

78.7

15 C stretches (C ⱖ 2), i-motif

Potential gene

T ⫽ 0.854

No

C ⫽ 0.146 HZ ⫽ 0.25 rs2279935

78.7

13 C stretches (C ⱖ 2), i-motif

Potential transcript

C ⫽ 0.990

No

G ⫽ 0.010 HZ ⫽ 0.02 rs11692546

75.4

13 C stretches (C ⱖ 2), i-motif

EST

G ⫽ 0.719

No

C ⫽ 0.281 HZ ⫽ 0.42 G ⫽ 0.813 C ⫽ 0.187

rs2072579

65.6

14 C stretches (C ⱖ 2), i-motif

SART3, cds

Buffer-dependent AQ change

rs7008933

60.7

4 G stretches (G ⱖ 2), Gquadruplex

EST

Buffer-dependent AQ change

rs12302637

78.7

7 C stretches (C ⱖ 2), i-motif; preferential amplification: C

EP400 N-terminal–like protein, intronic

Yes

rs3742558

78.7

11 C stretches (C ⱖ 2), i-motif; preferential amplification: T

Hypothetical mRNA

Yes

T ⫽ 0.708 C ⫽ 0.292 HZ ⫽ 0.40

rs4898786/ rs4898787

80.9

19 G stretches (G ⱖ 2), Gquadruplex; preferential amplification: haplotype dependent

EST

Yes

Genotype not determined for each case

HZ ⫽ 0.29 G ⫽ 0.990 T ⫽ 0.010 HZ ⫽ 0.02 T ⫽ 0.824 C ⫽ 0.176 HZ ⫽ 0.27

a b

GC content of 60 SNP-flanking bases. EST, expressed sequence tag; HZ, heterozygote; SART3, squamous cell carcinoma antigen recognized by T cells 3; cds, coding sequence; AQ, Pyrosequencing-based allele quantification.

It is common knowledge that GC-rich sequences are difficult PCR templates and that amplification of such sequences often requires intensive optimization and/or the use of PCR additives, such as DMSO, formamide, or betaine. Our standard PCR-optimization procedure consists of amplification in the presence and absence of 1⫻ solution Q (Qiagen; key component, betaine). All assays described in this study produced very good amplification yields under standard PCR conditions, either with or without the addition of solution Q. Whereas Mg2⫹ ions could be titrated over a wide range without any obvious change with respect to amplification yield, the concomitant genotype changes were maximal. Only small changes in the concentration of solution Q, however, altered amplification effi-

ciency dramatically, but varying the concentration of solution Q did not change the resulting genotype (in our cases, the ADO phenomenon persisted) as long as the product yield was sufficient to enable genotyping. Similar data were obtained when DMSO or formamide was substituted for solution Q. Discussion Taken together, our data show that certain Gquadruplex motifs can cause reproducible, but at present unpredictable, effects on PCR efficiency. We hypothesize that the stability of a G-quadruplex or i-motif is changed if a base exchange occurs within its characteristic sequence. These PCR artifacts can cause Clinical Chemistry 55:7 (2009) 1369

preferential amplification of one allele in heterozygous samples and thereby produce systematic genotyping errors that can lead to misdiagnosis of genetic diseases, as was observed in our case (Fig. 4; Fig. 2 in the online Data Supplement). False genotyping results may affect clinical diagnoses, including prenatal diagnoses as well as epidemiologic studies. Genotyping errors may lead to false results in association studies and also may cause false allele-frequency entries in the databases. Although we have analyzed the effect of these sequence motifs only with respect to genetic tests, it is obvious that they may also affect reverse-transcription PCR: ADO SNPs are found within expressed sequences and untranscribed regions to the same extent (Table 2). In gene expression analyses (e.g., with a quantitative or semiquantitative PCR approach) and transcriptionprofiling studies (e.g., with microarrays), allele-specific preferential amplification may mimic differential expression in samples of different genotypes. These results may even be reproducible when several different primer sets and several methods of PCR product analysis are used, if the ADO SNP and its flanking sequence remain part of the amplified regions. We developed a search strategy (strategy 1) for these sequence motifs that is capable of detecting motifs with a high propensity to cause PCR problems. Although this search strategy yields a high proportion of true-positive results (in our case, one third given that the criterion is ADO and greater than one half given that the criterion is the preferential amplification of one allele), we have no data regarding the sensitivity of the strategy with respect to ADO SNP detection but suspect that the stringent search algorithm (strategy 1) still underestimates the total number of affected SNPs. Yonan et al., for example, reported a similar ADO in SLC6A4 [solute carrier family 6 (neurotransmitter transporter, serotonin), member 4] (15 ). Again, the affected promoter polymorphism (rs4795541) is situated within an i-motif that would be detected only by our extended search algorithm (strategy 2). Boa´n et al. observed preferential allelic amplification due to G-quadruplex formation of the (TGGGGC)4 motif in the human minisatellite MsH43 (1 ). Neither group, however, attempted to assess whether their observations have more-general implications for genotyping. On the other hand, it is well known that there are genomic regions that are systematically over- or underamplified by whole-genome amplification procedures. In a recent analysis of the impact of the bias introduced by whole-genome amplification on the detection of copy number variants by SNP array platforms, Pugh et al. (19 ) found that whole-genome amplification induced hundreds of reproducible copy number variant artifacts. The error-prone genomic regions were characterized by a high GC content, enrichment for repet1370 Clinical Chemistry 55:7 (2009)

itive sequences, and frequent localization in proximity to chromosome ends. Although the authors did not comment on whether DNA secondary and tertiary structures were possible causes for their observations, the properties in common with G-quadruplex and i-motif structures are obvious. At present, there is no simple way to identify critical sequence motifs with accuracy and specificity. Therefore, researchers and diagnostic laboratories must be aware of the potential for G-quadruplex- and i-motifs to cause PCR artifacts. Careful assay design, including analysis of the target sequence with respect to local GC enrichment and the accumulation of GGN or CCN triplets (where N is a base different from the 2 adjacent bases), and assessment of the possible presence of G-quadruplex or i-motifs [strategy 2, QGRS Mapper (13 ); Quadfinder (14 )] are crucial to the detection of assays potentially affected by ADO. The overall GC content of the PCR product, however, plays a minor role in this context. To exclude possible genotyping errors caused by the described sequences, we recommend omitting all G-quadruplex– or i-motif– like sequences from a PCR product whenever such sequences are not the primary target of detection, because they may harbor SNPs that change the stability of a G-quadruplex or i-motif structure and thereby induce preferential amplification of one of the alleles. In cases of the unavoidable presence of suspicious sequences within the region of interest or in cases of genotyping results not in accordance with the Hardy– Weinberg equilibrium (especially in cases of heterozygosity deficiency), the PCR should be carried out with heterozygous control DNA in a K⫹-containing buffer and with varying concentrations of MgCl2, which we identified as the key buffer component affecting alleledependent PCR efficiency. Unfortunately, most manufacturers do not provide detailed information about the composition of their PCR buffers beyond the MgCl2 concentration. Therefore, it may be necessary to perform PCRs with an in-house buffer system (see Table 3 in the online Data Supplement) that permits systematic changes in ion concentrations. PCR products are subsequently analyzed by a method allowing allele quantification (e.g., pyrosequencing) to visualize any Mg2⫹-dependent genotype change in heterozygous samples (Figs. 1 and 2) and to finally establish PCR conditions that consistently produce similar amplification efficiencies for both alleles. Most of the assays that were affected by preferential amplification needed a defined concentration of denaturing agents to facilitate amplification. Only small changes in the concentrations of denaturing agents were required to alter the amplification efficiency of the PCR product dramatically. We hypothesize that secondary structures (including the G-quadruplex and the i-motif) of the PCR

Genotyping Errors in PCR-Based Assays

product are influenced so profoundly by denaturing agents that concentration changes do not produce a genotype change but a failure of amplification. In summary, G-quadruplex– and i-motif–like sequences cause systematic errors in genetic analyses that may lead to misdiagnoses in clinical settings. PCR products should be checked for error-prone sequences. Furthermore, we have provided a simple experimental procedure that is suitable for detecting and solving the problem.

Author Contributions: All authors confirmed they have contributed to the intellectual content of this paper and have met the following 3 requirements: (a) significant contributions to the conception and design, acquisition of data, or analysis and interpretation of data; (b) drafting

or revising the article for intellectual content; and (c) final approval of the published article. Authors’ Disclosures of Potential Conflicts of Interest: Upon manuscript submission, all authors completed the Disclosures of Potential Conflict of Interest form. Potential conflicts of interest: Employment or Leadership: None declared. Consultant or Advisory Role: None declared. Stock Ownership: None declared. Honoraria: C. Fottner, travel grants from Novartis, and compensation for scientific talks for Novartis, AstraZeneca, and SanofiAventis. Research Funding: None declared. Expert Testimony: None declared. Role of Sponsor: The funding organizations played no role in the design of study, choice of enrolled patients, review and interpretation of data, or preparation or approval of manuscript.

References 1. Boa´n F, Blanco MG, Barros P, Gonza´lez AI, Go´mez-Ma´rquez J. Inhibition of DNA synthesis by K⫹-stabilised G-quadruplex promotes allelic preferential amplification. FEBS Lett 2004;571: 112– 8. 2. Platzer M, Hiller M, Szafranski K, Jahn N, Hampe J, Schreiber S, et al. Sequencing errors or SNPs at splice-acceptor guanines in dbSNP? Nat Biotechnol 2006;24:1068 –70. 3. Quinlan AR, Marth GT. Primer-site SNPs mask mutations. Nat Methods 2007;4:192. 4. Walsh PS, Erlich HA, Higuchi R. Preferential PCR amplification of alleles: mechanisms and solutions. PCR Methods Appl 1992;1:241–50. 5. Todd AK, Johnston M, Neidle S. Highly prevalent putative quadruplex sequence motifs in human DNA. Nucleic Acids Res 2005;33:2901–7. 6. Du Z, Zhao Y, Li N. Genome-wide analysis reveals regulatory role of G4 DNA in gene transcription. Genome Res 2008;18:233– 41. 7. Rawal P, Kummarasetti VB, Ravindran J, Kumar N, Halder K, Sharma R, et al. Genome-wide prediction of G4 DNA as regulatory motifs: role in Escherichia coli global regulation. Genome Res 2006;16:644 –55. 8. Burge S, Parkinson GN, Hazel P, Todd AK, Neidle S.

9.

10.

11.

12.

13.

Quadruplex DNA: sequence, topology and structure. Nucleic Acids Res 2006;34:5402–15. Guo K, Pourpak A, Beetz-Rogers K, Gokhale V, Sun D, Hurley LH. Formation of pseudosymmetrical G-quadruplex and i-motif structures in the proximal promoter region of the RET oncogene. J Am Chem Soc 2007;129:10220 – 8. Giraud S, Zhang CX, Serova-Sinilnikova O, Wautot V, Salandre J, Buisson N, et al. Germ-line mutation analysis in patients with multiple endocrine neoplasia type 1 and related disorders. Am J Hum Genet 1998;63:455– 67. Lemmens I, Van de Ven WJ, Kas K, Zhang CX, Giraud S, Wautot V, et al. Identification of the multiple endocrine neoplasia type 1 (MEN1) gene. The European Consortium on MEN1. Hum Mol Genet 1997;6:1177– 83. Rossmann H, Bu¨chler E, Wenzel JJ, Neukirch C, du Prel JB, Lackner KJ. Evaluation of a new pooling strategy based on leukocyte count for rapid quantification of allele frequencies. Clin Chem 2007; 53:980 –2. Kikin O, D’Antonio L, Bagga PS. QGRS Mapper: a web-based server for predicting G-quadruplexes in nucleotide sequences. Nucleic Acids Res 2006; 34:W676 – 82.

14. Scaria V, Hariharan M, Arora A, Maiti S. Quadfinder: server for identification and analysis of quadruplex-forming motifs in nucleotide sequences. Nucleic Acids Res 2006;34:W683–5. 15. Yonan AL, Palmer AA, Gilliam TC. HardyWeinberg disequilibrium identified genotyping error of the serotonin transporter (SLC6A4) promoter polymorphism. Psychiatr Genet 2006;16: 31– 4. 16. Crabtree JS, Scacheri PC, Ward JM, Garrett-Beal L, Emmert-Buck MR, Edgemon KA, et al. A mouse model of multiple endocrine neoplasia, type 1, develops multiple endocrine tumors. Proc Natl Acad Sci U S A 2001;98:1118 –23. 17. Miyoshi D, Nakao A, Sugimoto N. Structural transition of d(G4T4G4) from antiparallel to parallel G-quartet induced by divalent cations. Nucleic Acids Res Suppl 2001;259 – 60. 18. Simonsson T. G-quadruplex DNA structures— variations on a theme. Biol Chem 2001;382: 621– 8. 19. Pugh TJ, Delaney AD, Farnoud N, Flibotte S, Griffith M, Li HI, et al. Impact of whole genome amplification on analysis of copy number variants. Nucleic Acids Res 2008;36:e80.

Clinical Chemistry 55:7 (2009) 1371