Development of a Genotyping Microarray for ... - Semantic Scholar

2 downloads 70 Views 353KB Size Report
8Clinical Epidemiology and Biostatistics, Perelman School of Medicine, University of ... 19104, USA; 6Cancer Prevention and Control Program, Fox Chase Cancer Center, ... advanced to treat successfully with surgery and the current portfolio ...
ARTICLE

Development of a Genotyping Microarray for Studying the Role of Gene-Environment Interactions in Risk for Lung Cancer Don A. Baldwin,1,2 Christopher P. Sarnowski,3 Sabrina A. Reddy,3 Ian A. Blair,2,4,5 Margie Clapper,6 Philip Lazarus,7 Mingyao Li,8 Joshua E. Muscat,9 Trevor M. Penning,2,4 Anil Vachani,2,10 and Alexander S. Whitehead2,4 1

Pathonomics LLC, Philadelphia, Pennsylvania 19104, USA; 2Center of Excellence in Environmental Toxicology, 3Penn Molecular Profiling Facility, Departments of 4Pharmacology and 10Medicine, and Centers for 5Cancer Pharmacology and 8 Clinical Epidemiology and Biostatistics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, Pennsylvania 19104, USA; 6Cancer Prevention and Control Program, Fox Chase Cancer Center, Philadelphia, Pennsylvania 19111, USA; 7 Department of Pharmaceutical Sciences, Washington State University, Spokane, Washington 99210, USA; and 9Department of Public Health Sciences, Pennsylvania State University, Hershey, Pennsylvania 17033, USA A microarray (LungCaGxE), based on Illumina BeadChip technology, was developed for high-resolution genotyping of genes that are candidates for involvement in environmentally driven aspects of lung cancer oncogenesis and/or tumor growth. The iterative array design process illustrates techniques for managing large panels of candidate genes and optimizing marker selection, aided by a new bioinformatics pipeline component, Tagger Batch Assistant. The LungCaGxE platform targets 298 genes and the proximal genetic regions in which they are located, using ⬃13,000 DNA single nucleotide polymorphisms (SNPs), which include haplotype linkage markers with a minimum allele frequency of 1% and additional specifically targeted SNPs, for which published reports have indicated functional consequences or associations with lung cancer or other smoking-related diseases. The overall assay conversion rate was 98.9%; 99.0% of markers with a minimum Illumina design score of 0.6 successfully generated allele calls using genomic DNA from a study population of 1873 lung-cancer patients and controls. KEY WORDS: genetic association, environmental exposures, Tagger Batch Assistant, LungCaGxE

INTRODUCTION

Lung cancer is the leading cause of cancer death for men and women in the United States. The American Cancer Society estimates that in 2013, there will be 228,190 new cases (118,080 in men; 110,110 in women) and 159,480 deaths.1 Many patients present with disease that is too advanced to treat successfully with surgery and the current portfolio of drugs. Identification of those at highest risk of disease would facilitate earlier diagnosis and therapeutic intervention, with consequent reduced mortality and longer survival time. Risk identification techniques would also support preventative screening and targeted interventions, such as smoking-cessation programs leading to reduced incidence. Given the huge number of new lung cancer cases that occur each year, the impact of such interventions ADDRESS CORRESPONDENCE TO: Alexander S. Whitehead, Perelman School of Medicine, University of Pennsylvania, Room 1311, BRB II/III, 421 Curie Blvd., Philadelphia, PA 19104-6160, USA (Phone: 215-898-2332; E-mail: [email protected]). doi: 10.7171/jbt.13-2404-004

would be significant even if applicable only to an etiologically distinct subset of all cases. As the majority (up to 90%) of lung cancers occurs in smokers, but only a minority (⬃10%) of smokers get the disease,2 it is likely that significant gene/phenotype/environment interactions exist.3 Although tobacco smoke is the main etiologic agent,4 the long latency between exposure and disease, the multistep nature of neoplastic transformation,5 and the low, 10-year lung-cancer risk of elderly, life-long heavy smokers (15%)6 suggest that factors other than tobacco-associated carcinogens modify risk. These likely include environmental variables,7 functional genetic polymorphisms,8,9 and differential expression of genes that interact with such variables.10 Strategies to identify associations between genetic variants and diseases, such as lung cancer, include genotyping sequence polymorphisms that are distributed throughout the genome or that occur in specifically targeted genes of interest. Compared with genome-wide approaches, genotyping a focused set of single nucleotide polymorphisms (SNPs) for high-resolution haplotype mapping boosts

xxxxxx xxxxxx Journal of Biomolecular Techniques 24:198–217 © 2013 ABRF

BALDWIN ET AL. / CUSTOM GENOTYPING FOR LUNG CANCER

analysis power for identifying single gene and gene family effects with statistical significance. Targeted, redundant genotyping of candidate genes further enables the analysis of additional variables, such as environmental factors, without a requirement to sample extremely large populations. However, designing a genotyping assay that adequately covers each candidate gene with a sufficiently large number of markers poses a challenge for this approach, especially when interrogating many genes in parallel. Standard genome-wide platforms, such as Affymetrix (Affymetrix, Santa Clara, CA, USA) or Illumina microarrays (Illumina, San Diego, CA, USA), provide predesigned collections of genotyping assays but rarely include enough markers to approach saturation of any given target gene. Microarray vendors therefore offer custom manufacturing options to allow researchers to create comprehensive panels of assays that satisfy the requirements of high-resolution genotyping. We describe a process that connects publicly available SNP catalogs with commercial assay design interfaces, using a new bioinformatics tool that assists with the management of large collections of genes and their haplotypetagging (HapTag) SNPs. This process was used to demonstrate the rapid and iterative design of a customgenotyping microarray for studying lung cancer. MATERIALS AND METHODS Target Selection

Investigators in our consortium contributed prioritized lists of genes potentially relevant to environmentally mediated biological processes leading to lung cancer. Candidate genes included modulators of and checkpoints within pathways hypothesized to respond to tobacco toxins and environmental factors that may promote oncogenesis, as well as those that may act in concert with environmental factors to support tumor survival, progression, and growth. These genes fell into broad categories, including tobaccospecific nitrosamine [particularly nitrosaminoketone (NNK)] activation and detoxification, polycyclic aromatic hydrocarbon (PAH) activation and detoxification, repair of NNK- and PAH-attributable DNA damage, oxidative stress, inflammatory signaling and processes of immune regulation, steroid hormone metabolism and signaling, nicotine addiction and smoking behavior, and folate transport and metabolism. For each individual gene, HapTag SNPs and genetic polymorphisms known to affect function or shown previously to be associated with risk for lung cancer were sought and if found, incorporated into the final microarray design. Target sources included extensive literature searches, Ingenuity Pathway Analysis (http://www.ingenuity.com), Database for Annotation, Visualization, and Integrated JOURNAL OF BIOMOLECULAR TECHNIQUES, VOLUME 24, ISSUE 4, DECEMBER 2013

Discovery (DAVID) Bioinformatics Resources,11,12 and ongoing research in investigators’ laboratories. SNP Selection

All targeted genes/chromosomal regions were uploaded to the Assay Design Tool (http://support.illumina.com/tools. ilmn; Illumina) for retrieval of all iSelect Infinium database SNPs within each targeted region, as well as from 15 kb sequences flanking the gene-boundary coordinates. Known polymorphisms from the target-selection phase were also queried by reference SNP (rs) number from database of SNPs (dbSNP; http://www.ncbi.nlm.nih.gov/snp/), or uploaded as custom sequences if polymorphisms were unrecognized by iSelect or not annotated in dbSNP. Independently, the targeted genes and regions were analyzed using Tagger (http://www.broadinstitute.org/mpg/tagger/server. html, and International HapMap Project haplotype mapping databases therein)13 with the following parameters in all combinations: HapMap panels of Utah (U.S.A.) residents of northern and western European ancestry (CEU) and residents of Ibadan, Nigeria of Yoruban ancestry (YRI); SNP minimum allele frequencies 5% and 1%; Tagger mode pairwise and aggressive; SNP r2 threshold 0.8; and default settings for all other parameters. The Tagger online interface does not support batch queries using gene symbols, so we created the Tagger Batch Assistant (http:// www.bioinformatics.upenn.edu/tagtool/batch.html) as a tool for automated processing of large query lists and management and formatting of the output data. The retrieved iSelect SNPs were filtered to retain markers with an Infinium design score ⱖ0.6 (a 60% probability of conversion, i.e., successful genotyping assays for that SNP), and the subset corresponding to selected HapTag SNPs from Tagger was identified. No Infinium design score limits were imposed on functional SNPs from the target selection phase. A panel of 357 ancestry informative markers was included (http://support.illumina.com/array/array_kits/dna_test_panel. ilmn, Illumina catalog GT-17-222). Genotyping

DNA was extracted from whole-blood samples or buffycoat fractions using Chemagic DNA purification kits and a Chemagen Magnetic Separation Module I robot (Chemagen/PerkinElmer, Baesweiler, Germany). DNA qualitycontrol checks included A260/280 and E-Gel electrophoresis (Invitrogen, Life Technologies, Grand Island, NY, USA), and DNA samples (n⫽1873) were normalized to 50 ng/ul and used for genotyping assays. Genotyping was conducted using the iScan system (Illumina), according to the manufacturer’s protocols.14 The Infinium assay amplified and fragmented 200 ng genomic DNA, which was then hybridized to our LungCaGxE iSelect HD Custom 199

BALDWIN ET AL. / CUSTOM GENOTYPING FOR LUNG CANCER

BeadChips containing 24 arrays/BeadChip and 13,308 assayed SNPs/array. Four negative control (no DNA) arrays were processed, and 43 samples were processed twice to check assay consistency. Data from scanned BeadChips were processed in Illumina GenomeStudio for signal quantitation, quality control, and genotype assignments. The research described does not involve animals. Blood samples from human subjects were collected with their informed consent for research use, including genetic analyses. This study was approved by Institutional Review Boards at the University of Pennsylvania, Pennsylvania State University, Temple University, and Fox Chase Cancer Center. RESULTS Tagger Batch Assistant

The online Tagger Batch Assistant tool was designed with two components: one for rapid retrieval of genomic coordinates for large lists of genes and another for managing Tagger output files that result from a batch query using genomic coordinates. Starting with a list of official National Center for Biotechnology Information gene symbols, the tool supports queries of several human genomebuild versions, concatenation or separation of overlapping genes, and rules for flanking regions that allow the addition of sequences adjacent to gene coordinates. Multiple choices are available for the amount of flanking sequences added, and rules can be stacked to vary the flanking regions by gene length. The output file can be reviewed in text or spreadsheet formats and is configured for uploading to the Tagger query interface. After receiving compressed Tagger results files, the tool supports automated merging of the user’s annotated gene query lists with the corresponding Tagger results. Assembly of Target Gene Panel

Project investigators identified 298 genes in pathways for which genetically mandated differential interactions with environmental factors leading to lung cancer were deemed to be biologically plausible. These pathways included those supporting or mediating carcinogen effects (i.e., nitrosamine and PAH activation and detoxification), oxidative stress, DNA damage repair, inflammation or immunesystem monitoring, estrogen, and other steroid hormone processes, nicotine addiction/smoking behavior, and folate metabolism. Target genes were chosen by examining previous literature, established molecular pathways, and gene interactions and sequence polymorphisms known to affect the functions of genes involved in lung tumor oncogenesis or responses to environmental factors that may impact lung cancer (Table 1). Confirmatory DAVID annotation analyses were performed on the final gene list to summarize the 200

categories represented from Online Mendelian Inheritance in Man (OMIM) Disease, Kyoto Encyclopedia of Genes and Genomes (KEGG) Pathway, Gene Ontology (GO) Molecular Function, and GO Biological Process databases (Supplemental Table 1). As expected, the final target panel was confirmed as being enriched for genes associated with risk for lung cancer, folate-sensitive phenotypes, hormone synthesis and signaling, oxidative stress responses, DNA repair, detoxification and metabolism of complex molecules, and apoptosis. Cross-category annotation indicates that the panel is coincidentally enriched for genes involved in schizophrenia, trichothiodystrophy, myocardial infarction, reproductive development, and various neurological processes. Comparison of Pairwise and Multimarker Tagger Analyses

With the use of dbSNPs for the CEU and YRI populations, Tagger analysis was performed initially to predict marker HapTag SNPs that cover polymorphisms with minimum minor allele frequency (MAF) of 5% and then repeated for MAF ⬎1%. Two Tagger algorithms were compared: pairwise modeling, in which a HapTag marker reports its own genotype and predicts the genotype of one linked SNP, and “aggressive” multimarker modeling, in which the combined genotypes of one to three HapTags report the local haplotype and predict the genotype(s) of one or more linked SNPs.13,15,16 The resulting number of HapTags calculated for each gene is shown in Table 1. At MAF ⬎1%, pairwise modeling produced a g/h ratio of 1.92 (g⫽measured⫹predicted genotypes; h⫽HapTag markers), and multimarker modeling resulted in 2.38 g/h for the same number of genotypes. Genotyping Array Design and Assay Performance

Tagger multimarker-predicted HapTags with MAF ⬎1% were filtered for iSelect Infinium design scores ⱖ0.6. TLR5 had no multimarker HapTags, so pairwise HapTags were selected; CCR2, UGT2B15, and GSTT1 had no HapTags, so marker SNPs were manually identified. To avoid exceeding the marker capacity set by our microarray manufacturing budget, the low-priority genes, ALPL, TNS1, GAB1, HHIP, DBH, and PTGIS, were dropped, and HapTag coverage of GPR126 was reduced to 85%. With the addition of specifically targeted functional SNPs and published marker SNPs, 12,890 genomic SNPs were compiled for the final design of the LungCaGxE array with average and median intermarker distances of 5958 bp and 1093 bp, respectively. Sixty-one mitochondrial DNA SNPs were included to target MT-COI, as well as 357 ancestry informative markers for a total of 13,308 genotyping markers on JOURNAL OF BIOMOLECULAR TECHNIQUES, VOLUME 24, ISSUE 4, DECEMBER 2013

1

JOURNAL OF BIOMOLECULAR TECHNIQUES, VOLUME 24, ISSUE 4, DECEMBER 2013

17 27 87 23 17 30

AKAP9 AKR1A1

AKR1B10

120 23

ADK AGER

AHCY AHR AHRR

49

ADH7

PPT2

28

12

ACHE

ADH1B

355

ABCC4

110

51

ABCC2

ADAM19

199

ABCC1

All HapTags: pairwise, MAF1%

1099 91

Genetic locus overlaps

A2BP1 ABCB1

Gene symbol

25

21 13

12 27 79

98 22

47

26

101

12

324

44

180

1046 81

Inf 0.6⫹, pairwise, MAF1%

24

17 10

12 24 71

80 20

40

23

78

10

274

33

162

914 62

Inf 0.6⫹, pairwise MAF5%

22

19 12

8 25 61

74 22

40

23

81

12

244

40

142

710 62

Inf 0.6⫹, multi, MAF1%

21

15 7

8 21 51

55 17

33

20

61

10

199

29

123

599 44

Inf 0.6⫹, multi, MAF5%

full

full full

full full full

full full

full

full

full

full

full

full

full

full full

Array coverage: multi HapTags, MAF1% ataxin 2 binding protein 1 ATP binding cassette, subfamily B [multidrug resistance (MDR)/transporter associated with antigen processing (TAP)], member 1 ATP binding cassette, subfamily C [cystic fibrosis transmembrane conductance regulator (CFTR)/multidrug resistance-associated protein (MRP)], member 1 ATP binding cassette, subfamily C (CFTR/MRP), member 2 ATP binding cassette, subfamily C (CFTR/MRP), member 4 ACETYLCHOLINESTERASE (YT BLOOD GROUP) a disintegrin and metalloprotease domain (ADAM) metallopeptidase domain 19 (meltrin ␤) alcohol dehydrogenase 1B (class I), ␤ polypeptide alcohol dehydrogenase 7 (class IV), ␮ or ␴ polypeptide adenosine kinase advanced glycosylation end product-specific receptor S-ADENOSYLHOMOCYSTEINE HYDROLASE ARYL-HYDROCARBON RECEPTOR ARYL-HYDROCARBON RECEPTOR REPRESSOR A kinase (PRKA) anchor protein (yotiao) 9 ALDO-KETO REDUCTASE FAMILY 1, MEMBER A1 (ALDEHYDE REDUCTASE) ALDO-KETO REDUCTASE FAMILY 1, MEMBER B10 (ALDOSE REDUCTASE-LIKE)

Gene name

Continued

onc

tum PAH

fol PAH PAH

inf inf/mut

tox

tox

adh

nic

mut/PAH

fol

mut/PAH

tum tum

Target categorya

Targeted Genes, Annotations, and Number of HapTag SNPs Identified by Pairwise or Multimarker (multi) Algorithms at the Indicated Minor Allele Frequencies (MAFs) and Infinium (Inf) Design Scores ⱖ0.6

TABLE

BALDWIN ET AL. / CUSTOM GENOTYPING FOR LUNG CANCER

201

202

15

31

10 30

AR

AREG

ARID1A ARNT

44

30

DRD2

APEX1

ANKK1

115 65 100

15 95

AKT2 AKT3

ALDH1L1 ALOX5 ALPL

16

46

AKT1

AKR1C2

18

44

AKR1C1

AKR1C1

All HapTags: pairwise, MAF1%

AKR1C3

AKR1C2

Gene symbol

1

Genetic locus overlaps

(Continued)

TABLE

10 29

14

5

29

41

98 57 96

14 89

13

36

38

15

Inf 0.6⫹, pairwise, MAF1%

7 17

12

5

26

33

89 48 88

11 70

12

35

36

15

Inf 0.6⫹, pairwise MAF5%

10 27

10

12

26

26

67 46 79

13 69

13

27

24

22

Inf 0.6⫹, multi, MAF1%

7 15

7

12

22

19

58 36 70

10 54

12

26

23

21

Inf 0.6⫹, multi, MAF5%

full full

full

full

full

full full dropped for capacity full

full full

full

full

full

full

Array coverage: multi HapTags, MAF1%

ANKYRIN REPEAT AND KINASE DOMAIN CONTAINING 1 APEX NUCLEASE (MULTIFUNCTIONAL DNA REPAIR ENZYME) 1 ANDROGEN RECEPTOR (DIHYDROTESTOSTERONE RECEPTOR; TESTICULAR FEMINIZATION; SPINAL AND BULBAR MUSCULAR ATROPHY; KENNEDY DISEASE) AMPHIREGULIN (SCHWANNOMA-DERIVED GROWTH FACTOR) AT RICH-INTERACTIVE DOMAIN 1A (SWI-LIKE) ARYL-HYDROCARBON RECEPTOR NUCLEAR TRANSLOCATOR

ALDO-KETO REDUCTASE FAMILY 1, MEMBER C1 [DIHYDRODIOL DEHYDROGENASE 1; 20-␣ (3-␣)-HYDROXYSTEROID DEHYDROGENASE] ALDO-KETO REDUCTASE FAMILY 1, MEMBER C2 (DIHYDRODIOL DEHYDROGENASE 2; BILE ACID BINDING PROTEIN; 3-␣ HYDROXYSTEROID DEHYDROGENASE, TYPE III) ALDO-KETO REDUCTASE FAMILY 1, MEMBER C3 (3-␣ HYDROXYSTEROID DEHYDROGENASE, TYPE II) V-AKT MURINE THYMOMA VIRAL ONCOGENE HOMOLOG 1 v-akt murine thymoma viral oncogene homolog 2 v-akt murine thymoma viral oncogene homolog 3 (PKB, ␥) aldehyde dehydrogenase 1 family, member L1 ARACHIDONATE 5-LIPOXYGENASE alkaline phosphatase, liver/bone/kidney

Gene name

Continued

onc PAH

onc

str

DNA

nic

fo1 inf/oxs

tum tum

onc

PAH/str

nit/PAH/ str

nit/PAH/ str

Target categorya

BALDWIN ET AL. / CUSTOM GENOTYPING FOR LUNG CANCER

JOURNAL OF BIOMOLECULAR TECHNIQUES, VOLUME 24, ISSUE 4, DECEMBER 2013

88 63 41

19 16 60 23 20 10 22 25 83 6

46

66

CBR1 CBR3 CBS CCL2 CCL21 CCL5 CCNA2 CCND1 CCND3 CCR2

CD47

CDH1

21

BIRC5

BRCA2 C3 CAMKK1

33 28

BDNF BHMT

210

229

BCL2

BMPR1B

38

All HapTags: pairwise, MAF1%

ATIC

Genetic locus overlaps

103

1

ARNTL

Gene symbol

(Continued)

TABLE

JOURNAL OF BIOMOLECULAR TECHNIQUES, VOLUME 24, ISSUE 4, DECEMBER 2013

56

43

17 13 57 20 18 9 17 25 21 0

82 59 36

199

19

33 28

224

34

98

Inf 0.6⫹, pairwise, MAF1%

53

35

15 11 53 17 16 6 12 20 17 0

59 49 33

180

18

26 25

178

31

88

Inf 0.6⫹, pairwise MAF5%

50

36

13 11 47 16 15 8 13 23 18 0

68 55 35

143

17

30 26

190

28

84

Inf 0.6⫹, multi, MAF1%

47

28

11 9 43 13 14 5 9 18 14 0

47 45 32

125

16

23 22

144

25

75

Inf 0.6⫹, multi, MAF5%

full

full full full full full full full full full nine non-HapTag SNPs full

full full full

full

full

full full

full

full

full

Array coverage: multi HapTags, MAF1%

CD47 ANTIGEN (RH-RELATED ANTIGEN, INTEGRIN-ASSOCIATED SIGNAL TRANSDUCER) CADHERIN 1, TYPE 1, E-CADHERIN (EPITHELIAL)

ARYL-HYDROCARBON RECEPTOR NUCLEAR TRANSLOCATOR-LIKE 5-AMINOIMIDAZOLE-4-CARBOXAMIDE RIBONUCLEOTIDE FORMYLTRANSFERASE/ IMP CYCLOHYDROLASE B CELL chronic lymphocytic leukemia (CLL)/ LYMPHOMA 2 BRAIN-DERIVED NEUROTROPHIC FACTOR BETAINE-HOMOCYSTEINE METHYLTRANSFERASE BACULOVIRAL inhibitor of apoptosis (IAP) REPEAT-CONTAINING 5 (SURVIVIN) BONE MORPHOGENETIC PROTEIN RECEPTOR, TYPE IB breast cancer 2, early onset COMPLEMENT COMPONENT 3 calcium/calmodulin-dependent protein kinase kinase 1, ␣ CARBONYL REDUCTASE 1 CARBONYL REDUCTASE 3 CYSTATHIONINE-␤-SYNTHASE chemokine (C–C motif) ligand 2 chemokine (C–C motif) ligand 21 chemokine (C–C motif) ligand cyclin A2 CYCLIN D1 cyclin D3 chemokine (C–C motif) receptor 2

Gene name

Continued

adh

adh

nit/PAH nit/PAH fol inf inf inf onc onc onc inf

tum inf tum

onc

onc

nic fol

onc

fol

PAH

Target categorya

BALDWIN ET AL. / CUSTOM GENOTYPING FOR LUNG CANCER

203

204

24

112

13

CYP17A1

CYP19A1

CYP1A1

9

56 31 51 19 38 27

COMT CRP CRY1 CSNK1D CTH CTLA4

CTSD

35 55

CYP1A2

CHRNA3

21 27 27

CLOCK COL3A1

CHRNB3 CHRNB4 CHUK

CHRNB4

10 85 2 28 28 89 24

SLC18A3 CHRNA5

All HapTags: pairwise, MAF1%

CES3 CHAT CHRNA3 CHRNA4 CHRNA5 CHRNA7 CHRNB2

Genetic locus overlaps

33

1

CDKN2A

Gene symbol

(Continued)

TABLE

11

104

19

8

49 31 45 15 34 27

29 54

20 26 26

8 98 2 25 26 85 23

31

Inf 0.6⫹, pairwise, MAF1%

11

94

15

7

43 23 38 11 31 24

27 46

17 24 20

5 81 1 24 21 71 22

27

Inf 0.6⫹, pairwise MAF5%

11

75

16

8

39 25 36 15 30 18

20 46

16 15 23

7 112 22 21 13 76 20

27

Inf 0.6⫹, multi, MAF1%

10

65

12

7

34 18 28 11 27 15

18 37

13 14 17

4 93 18 20 10 62 18

24

Inf 0.6⫹, multi, MAF5%

full

full

full

full

full full full full full full

full full

full full full

full full full full full full full

full

Array coverage: multi HapTags, MAF1% cyclin-dependent kinase inhibitor 2A (melanoma, p16, inhibits CDK4) CARBOXYLESTERASE 3 CHOLINE ACETYLTRANSFERASE CHOLINERGIC RECEPTOR, NICOTINIC, ␣ 3 CHOLINERGIC RECEPTOR, NICOTINIC, ␣ 4 CHOLINERGIC RECEPTOR, NICOTINIC, ␣ 5 CHOLINERGIC RECEPTOR, NICOTINIC, ␣ 7 CHOLINERGIC RECEPTOR, NICOTINIC, ␤ 2 (NEURONAL) CHOLINERGIC RECEPTOR, NICOTINIC, ␤ 3 CHOLINERGIC RECEPTOR, NICOTINIC, ␤ 4 CONSERVED HELIX-LOOP-HELIX UBIQUITOUS KINASE CLOCK HOMOLOG (MOUSE) COLLAGEN, TYPE III, ␣ 1 (EHLERS-DANLOS SYNDROME TYPE IV, AUTOSOMAL DOMINANT) CATECHOL-O-METHYLTRANSFERASE C-REACTIVE PROTEIN, PENTRAXIN-RELATED CRYPTOCHROME 1 (PHOTOLYASE-LIKE) CASEIN KINASE 1, ␦ CYSTATHIONASE (CYSTATHIONINE ␥-LYASE) CYTOTOXIC T-LYMPHOCYTE-ASSOCIATED PROTEIN 4 CATHEPSIN D (LYSOSOMAL ASPARTYL PEPTIDASE) CYTOCHROME P450, FAMILY 17, SUBFAMILY A, POLYPEPTIDE 1 CYTOCHROME P450, FAMILY 19, SUBFAMILY A, POLYPEPTIDE 1 CYTOCHROME P450, FAMILY 1, SUBFAMILY A, POLYPEPTIDE 1

Gene name

Continued

PAH

str

str

tum

PAH/str inf onc onc fol inf

onc adh

nic nic onc

tox nic nic nic nic nic nic

onc

Target categorya

BALDWIN ET AL. / CUSTOM GENOTYPING FOR LUNG CANCER

JOURNAL OF BIOMOLECULAR TECHNIQUES, VOLUME 24, ISSUE 4, DECEMBER 2013

11

18

43

32

11

39

30

88

10

24 75 18 56 58 53 11 129

CYP2A13

CYP2A6

CYP2B6

CYP2C9

CYP2D6

CYP2E1

CYP3A4

DBH

DDX54

DHFR DMGDH DNMT1 DNMT3A DNMT3B DRD2 DRD4 EGF

ANKK1

15

CYP21A2

13

41

CYP1A2

All HapTags: pairwise, MAF1%

CYP1B1

CYP1A1

Gene symbol

1

Genetic locus overlaps

(Continued)

TABLE

JOURNAL OF BIOMOLECULAR TECHNIQUES, VOLUME 24, ISSUE 4, DECEMBER 2013

19 69 16 54 56 53 11 115

10

85

24

36

5

28

34

13

7

7

38

13

Inf 0.6⫹, pairwise, MAF1%

16 64 12 44 46 45 10 79

9

75

18

35

5

20

31

11

6

5

34

10

Inf 0.6⫹, pairwise MAF5%

15 58 14 45 40 44 10 95

9

73

23

30

5

26

27

13

7

7

33

11

Inf 0.6⫹, multi, MAF1%

12 50 10 35 30 36 9 61

8

63

17

29

5

18

23

11

6

5

28

9

Inf 0.6⫹, multi, MAF5%

full full full full full full full full

dropped for capacity full

full

full

full

full

full

full

full

full

full

full

Array coverage: multi HapTags, MAF1% CYTOCHROME P450, FAMILY 1, SUBFAMILY A, POLYPEPTIDE 2 CYTOCHROME P450, FAMILY 1, SUBFAMILY B, POLYPEPTIDE 1 CYTOCHROME P450, FAMILY 21, SUBFAMILY A, POLYPEPTIDE 2 cytochrome P450, family 2, subfamily A, polypeptide 13 CYTOCHROME P450, FAMILY 2, SUBFAMILY A, POLYPEPTIDE 6 cytochrome P450, family 2, subfamily B, polypeptide 6 CYTOCHROME P450, FAMILY 2, SUBFAMILY C, POLYPEPTIDE 9 CYTOCHROME P450, FAMILY 2, SUBFAMILY D, POLYPEPTIDE 6 cytochrome P450, family 2, subfamily E, polypeptide 1 CYTOCHROME P450, SUBFAMILY IIIA (NIPHEDIPINE OXIDASE), POLYPEPTIDE 3 DOPAMINE ␤-HYDROXYLASE (DOPAMINE ␤-MONOOXYGENASE) DEAD (ASP-GLU-ALA-ASP) BOX POLYPEPTIDE 54 DIHYDROFOLATE REDUCTASE dimethylglycine dehydrogenase DNA (cytosine-5-)-methyltransferase 1 DNA (cytosine-5-)-methyltransferase 3 ␣ DNA (cytosine-5-)-methyltransferase 3 ␤ DOPAMINE RECEPTOR D2 DOPAMINE RECEPTOR D4 epidermal growth factor

Gene name

Continued

fol fol fol fol fol nic nic tum

onc

str

nit

nit

PAH

nit/tum

nit

nit

str

PAH

PAH

Target categorya

BALDWIN ET AL. / CUSTOM GENOTYPING FOR LUNG CANCER

205

206

22 36

20

34

39

61

67

64

38

341 68 342

ERCC1

ERCC2

ERCC3

ERCC4

ERCC5

ERCC6

ERCC8

ESR1 ESR2 EYA2

All HapTags: pairwise, MAF1%

EGLN2 EPHX1

Genetic locus overlaps

212

1

EGFR

Gene symbol

(Continued)

TABLE

237 61 325

33

58

58

55

36

33

18

20 26

196

Inf 0.6⫹, pairwise, MAF1%

173 52 293

21

37

42

39

25

22

14

18 23

167

Inf 0.6⫹, pairwise MAF5%

182 43 247

28

46

46

41

32

30

16

17 26

162

Inf 0.6⫹, multi, MAF1%

126 34 217

17

25

30

26

20

20

12

15 22

135

Inf 0.6⫹, multi, MAF5%

full full full

full

full

full

full

full

full

full

full full

full

Array coverage: multi HapTags, MAF1% EPIDERMAL GROWTH FACTOR RECEPTOR (ERYTHROBLASTIC LEUKEMIA VIRAL (V-ERB-B) ONCOGENE HOMOLOG, AVIAN) egl nine homolog 2 EPOXIDE HYDROLASE 1, MICROSOMAL (XENOBIOTIC) EXCISION REPAIR CROSS-COMPLEMENTING RODENT REPAIR DEFICIENCY, COMPLEMENTATION GROUP 1 (INCLUDES OVERLAPPING ANTISENSE SEQUENCE) EXCISION REPAIR CROSS-COMPLEMENTING RODENT REPAIR DEFICIENCY, COMPLEMENTATION GROUP 2 (XERODERMA PIGMENTOSUM D) excision repair cross-complementing rodent repair deficiency, complementation group 3 (xeroderma pigmentosum group B complementing) EXCISION REPAIR CROSS-COMPLEMENTING RODENT REPAIR DEFICIENCY, COMPLEMENTATION GROUP 4 EXCISION REPAIR CROSS-COMPLEMENTING RODENT REPAIR DEFICIENCY, COMPLEMENTATION GROUP 5 [XERODERMA PIGMENTOSUM, COMPLEMENTATION GROUP G (COCKAYNE SYNDROME)] excision repair cross-complementing rodent repair deficiency, complementation group 6 EXCISION REPAIR CROSS-COMPLEMENTING RODENT REPAIR DEFICIENCY, COMPLEMENTATION GROUP 8 ESTROGEN RECEPTOR 1 ESTROGEN RECEPTOR 2 (ER ␤) eyes absent homolog 2 (Drosophila)

Gene name

Continued

str str DNA

DNA

DNA

DNA

DNA

DNA

DNA

DNA

oxs PAH

onc

Target categorya

BALDWIN ET AL. / CUSTOM GENOTYPING FOR LUNG CANCER

JOURNAL OF BIOMOLECULAR TECHNIQUES, VOLUME 24, ISSUE 4, DECEMBER 2013

22

63 85

17

18 27

93 17 969 15 70

GART

GATA3 GCLC

GCLM

GDF15 GGH

GHR GNMT GPC5 GPER GPR126

5 9 23 23 51 72

FOLR1 FOLR2 FOLR3 FPGS FTCD GAB1

FOLR2 FOLR1

53 46 22

All HapTags: pairwise, MAF1%

FKBP5 FMO3 FOLH1

Genetic locus overlaps

165 8

1

FAM13A FCGR1A

Gene symbol

(Continued)

TABLE

JOURNAL OF BIOMOLECULAR TECHNIQUES, VOLUME 24, ISSUE 4, DECEMBER 2013

84 16 889 12 63

16 21

16

62 76

19

5 7 13 20 50 68

30 42 14

156 1

Inf 0.6⫹, pairwise, MAF1%

72 13 739 11 52

15 13

15

50 69

14

4 6 12 17 46 58

22 34 10

138 1

Inf 0.6⫹, pairwise MAF5%

66 15 666 12 54

15 16

15

58 66

19

8 3 9 19 43 56

24 34 13

113 1

Inf 0.6⫹, multi, MAF1%

55 12 528 11 44

14 9

12

45 59

14

6 3 9 16 39 46

18 26 9

93 1

Inf 0.6⫹, multi, MAF5%

full full full full 85%

full full

full

full full

full full full full full dropped for capacity full

full full full

full full

Array coverage: multi HapTags, MAF1% family with sequence similarity 13, member A Fc fragment of IgG, high-affinity Ia, receptor (CD64) FK506 BINDING PROTEIN 5 Flavin containing monooxygenase 3 FOLATE HYDROLASE (PROSTATE-SPECIFIC MEMBRANE ANTIGEN) 1 FOLATE RECEPTOR 1 (ADULT) FOLATE RECEPTOR 2 (FETAL) FOLATE RECEPTOR 3 (␥) FOLYLPOLYGLUTAMATE SYNTHASE formiminotransferase cyclodeaminase growth factor receptor-bound protein 2-associated binding protein 1 PHOSPHORIBOSYLGLYCINAMIDE FORMYLTRANSFERASE, PHOSPHORIBOSYLGLYCINAMIDE SYNTHETASE, PHOSPHORIBOSYLAMINOIMIDAZOLE SYNTHETASE GATA BINDING PROTEIN 3 GLUTAMATE-CYSTEINE LIGASE, CATALYTIC SUBUNIT GLUTAMATE-CYSTEINE LIGASE, MODIFIER SUBUNIT growth differentiation factor 15 ␥-glutamyl hydrolase (conjugase, folylpolygammaglutamyl hydrolase) growth hormone receptor glycine N-methyltransferase glypican 5 GPCR 30 GPCR 126

Gene name

Continued

tum fol mut str adh

onc fol

oxs

tum oxs

fol

fol fol fol fol fol

inf/str tox fol

mut inf

Target categorya

BALDWIN ET AL. / CUSTOM GENOTYPING FOR LUNG CANCER

207

208

38 40 24 83

43

97 39 17 84 63 24 20

HDC HELQ HFE HGF

HHIP

hsa-mir21 HSD11B1 HSD17B1 HSD17B12 HSD17B3 HSD17B7 HSD3B1

7 15 25 33 21 1

GSTM2 GSTM1 GSTM1

All HapTags: pairwise, MAF1%

GSTM1 GSTM2 GSTM5 GSTO1 GSTP1 GSTT1

Genetic locus overlaps

14 53 13 71 49 23 16 37 39

1

GPX1 GPX3 GRPR GSK3B GSR GSS GSTA1 GSTA4 GSTCD

Gene symbol

(Continued)

TABLE

37 14 77 56 20 16

43

36 37 23 78

4 12 14 30 19 0

10 51 13 58 42 20 11 34 37

Inf 0.6⫹, pairwise, MAF1%

35 12 63 46 19 16

37

33 31 20 59

6 10 12 24 17 0

8 47 13 44 32 19 9 24 27

Inf 0.6⫹, pairwise MAF5%

29 13 64 45 17 11

35

32 26 19 64

7 10 11 28 16 0

10 43 22 48 37 18 10 27 29

Inf 0.6⫹, multi, MAF1%

27 11 51 35 16 11

28

29 20 17 46

7 8 10 23 14 0

8 39 22 35 27 17 8 17 18

Inf 0.6⫹, multi, MAF5%

dropped for capacity full full full full full full full

full full full full full four nonHapTag SNPs full full full full

full full full full full full full full full

Array coverage: multi HapTags, MAF1%

HOMO SAPIENS MICRORNA 21 HYDROXYSTEROID (11-␤) DEHYDROGENASE 1 HYDROXYSTEROID (17-␤) DEHYDROGENASE 1 HYDROXYSTEROID (17-␤) DEHYDROGENASE 12 HYDROXYSTEROID (17-␤) DEHYDROGENASE 3 HYDROXYSTEROID (17-␤) DEHYDROGENASE 7 HYDROXY-␦-5-STEROID DEHYDROGENASE, 3 ␤- AND STEROID ␦-ISOMERASE 1

HISTIDINE DECARBOXYLASE HELQ helicase, POLQ-like HEMOCHROMATOSIS HEPATOCYTE GROWTH FACTOR (HEPAPOIETIN A; SCATTER FACTOR) Hedgehog-interacting protein

GLUTATHIONE PEROXIDASE 1 GLUTATHIONE PEROXIDASE 3 (PLASMA) GASTRIN-RELEASING PEPTIDE RECEPTOR glycogen synthase kinase 3 ␤ GLUTATHIONE REDUCTASE GLUTATHIONE SYNTHETASE GLUTATHIONE S-TRANSFERASE A1 GLUTATHIONE S-TRANSFERASE A4 glutathione S-transferase, C-terminal domain containing GLUTATHIONE S-TRANSFERASE M1 GLUTATHIONE S-TRANSFERASE M2 (MUSCLE) GLUTATHIONE S-TRANSFERASE M5 GLUTATHIONE S-TRANSFERASE ␻ 1 GLUTATHIONE S-TRANSFERASE ␲ 1 GLUTATHIONE S-TRANSFERASE ␪ 1

Gene name

Continued

onc nit/str str str str str str

inf DNA tox onc

oxs/PAH oxs/PAH oxs oxs oxs/PAH oxs/PAH

oxs oxs onc tum oxs fol oxs/PAH oxs fol

Target categorya

BALDWIN ET AL. / CUSTOM GENOTYPING FOR LUNG CANCER

JOURNAL OF BIOMOLECULAR TECHNIQUES, VOLUME 24, ISSUE 4, DECEMBER 2013

JOURNAL OF BIOMOLECULAR TECHNIQUES, VOLUME 24, ISSUE 4, DECEMBER 2013

34 25 66 49 44 30 46 19 11 27

15 19 10 6 48

IL10 IL1B IL1RN IL4 IL6 IL8 IRS1 JUN KEAP1 KLRK1

KRT18 KRT19 LTA LTC4S MAF

TNF

25

21 40 65 318 16 161 24

IER3 IFNG IGF1 IGF1R IGF2 IGF2R IGFBP3

IKBKB

35

IDH1

TH

126 26 13

All HapTags: pairwise, MAF1%

HTR4 ICAM1 ID2

Genetic locus overlaps

24

1

HTR3E

Gene symbol

(Continued)

TABLE

10 19 9 5 47

30 24 66 46 39 28 43 18 9 21

24

19 38 57 306 14 148 19

31

115 25 13

22

Inf 0.6⫹, pairwise, MAF1%

7 18 10 5 43

24 20 58 40 32 24 33 16 9 20

18

18 30 38 272 16 108 18

24

105 21 11

18

Inf 0.6⫹, pairwise MAF5%

10 17 19 5 45

28 23 54 39 34 21 37 17 9 19

20

16 31 48 239 17 123 19

27

89 25 12

20

Inf 0.6⫹, multi, MAF1%

7 16 10 5 41

21 19 46 33 26 17 28 15 9 18

15

15 23 30 206 16 86 18

22

81 21 11

16

Inf 0.6⫹, multi, MAF5%

full full full full full

full full full full full full full full full full

full

full full full full full full full

full

full full full

full

Array coverage: multi HapTags, MAF1% 5-hydroxytryptamine (serotonin) receptor 3, family member E 5-hydroxytryptamine (serotonin) receptor 4 intercellular adhesion molecule 1 INHIBITOR OF DNA BINDING 2, DOMINANT NEGATIVE HELIX-LOOP-HELIX PROTEIN ISOCITRATE DEHYDROGENASE 1 (NADP⫹), SOLUBLE immediate early response 3 IFN-␥ insulin-like growth factor 1 (somatomedin C) insulin-like growth factor 1 receptor insulin-like growth factor 2 (somatomedin A) insulin-like growth factor 2 receptor INSULIN-LIKE GROWTH FACTOR BINDING PROTEIN 3 inhibitor of ␬ light polypeptide gene enhancer in B-cells, kinase ␤ INTERLEUKIN 10 interleukin 1, ␤ interleukin 1 receptor antagonist INTERLEUKIN 4 INTERLEUKIN 6 interleukin 8 INSULIN RECEPTOR SUBSTRATE 1 jun oncogene kelch-like ECH-associated protein 1 KILLER CELL LECTIN-LIKE RECEPTOR SUBFAMILY C, MEMBER 4 KERATIN 18 KERATIN 19 lymphotoxin ␣ (TNF superfamily, member 1) LEUKOTRIENE C4 SYNTHASE v-maf musculoaponeurotic fibrosarcoma oncogene homolog (avian)

Gene name

Continued

adh adh inf inf onc

inf inf inf inf inf inf onc onc oxs adh

inf

mut/inf inf onc onc onc onc onc

onc

nic inf onc

nic

Target categorya

BALDWIN ET AL. / CUSTOM GENOTYPING FOR LUNG CANCER

209

210

57

79

54

17 34

40

MTHFS

MTR

MTRR

MUTYH MYBL2

MYC

36

36

32

99 64

MSR1 MTAP MT-COI MTHFD1

MTHFR

94 59

62 44

MGST3 MIF

38

17 32

49

64

53

32

56 42

238

19 31

Inf 0.6⫹, pairwise, MAF1%

249

All HapTags: pairwise, MAF1%

MGMT

Genetic locus overlaps

19 40

1

MAOA MDM2

Gene symbol

(Continued)

TABLE

27

14 26

36

49

47

29

29

81 52

53 39

212

19 24

Inf 0.6⫹, pairwise MAF5%

36

16 24

34

42

35

26

31

79 44

48 35

179

22 27

Inf 0.6⫹, multi, MAF1%

24

13 18

22

33

29

22

28

68 37

45 31

152

22 20

Inf 0.6⫹, multi, MAF5%

full

full full

full

full

full

full

full full 61 SNPs full

full full

full

full full

Array coverage: multi HapTags, MAF1% MONOAMINE OXIDASE A MDM2, TRANSFORMED 3T3 CELL DOUBLEMINUTE 2, P53 BINDING PROTEIN (MOUSE) O-6-METHYLGUANINE-DNA METHYLTRANSFERASE MICROSOMAL GST 3 macrophage migration inhibitory factor (glycosylation-inhibiting factor) macrophage scavenger receptor 1 methylthioadenosine phosphorylase mitochondrially encoded cytochrome c oxidase I METHYLENETETRAHYDROFOLATE DEHYDROGENASE (NADP⫹ DEPENDENT) 1, METHENYLTETRAHYDROFOLATE CYCLOHYDROLASE, FORMYLTETRAHYDROFOLATE SYNTHETASE 5,10-METHYLENETETRAHYDROFOLATE REDUCTASE (NADPH) 5,10-METHENYLTETRAHYDROFOLATE SYNTHETASE (5-FORMYLTETRAHYDROFOLATE CYCLO-LIGASE) 5-METHYLTETRAHYDROFOLATEHOMOCYSTEINE METHYLTRANSFERASE 5-METHYLTETRAHYDROFOLATEHOMOCYSTEINE METHYLTRANSFERASE REDUCTASE MUTY HOMOLOG (Escherichia coli) v-myb myeloblastosis viral oncogene homolog (avian)-like 2 V-MYC MYELOCYTOMATOSIS VIRAL ONCOGENE HOMOLOG (AVIAN)

Gene name

Continued

onc

DNA/oxs tum

fol

fol

fol

fol

inf fol oxs fol

oxs inf

DNA/nit

nic DNA

Target categorya

BALDWIN ET AL. / CUSTOM GENOTYPING FOR LUNG CANCER

JOURNAL OF BIOMOLECULAR TECHNIQUES, VOLUME 24, ISSUE 4, DECEMBER 2013

56

19 30

74

27

191 56

30

20 27

58

85 46

34 62 19 209 179 14 91 9 234

NCOA6 NFE2L2

NFKB1

NFKBIA

NOS1 NOS2

NOS3

NQO1 NR1D2

NR3C1

NRIP1 NSD1

OAS1 OAS2 OGG1 OPRM1 PCDH7 PER1 PGR PHB2 PID1

All HapTags: pairwise, MAF1%

NAT2

Genetic locus overlaps

115

1

NAT1

Gene symbol

(Continued)

TABLE

JOURNAL OF BIOMOLECULAR TECHNIQUES, VOLUME 24, ISSUE 4, DECEMBER 2013

30 51 18 174 174 14 71 9 221

33 33

56

19 22

29

172 53

26

72

17 30

50

99

Inf 0.6⫹, pairwise, MAF1%

26 46 14 149 143 11 56 8 200

30 26

43

19 18

23

141 51

25

54

16 24

43

86

Inf 0.6⫹, pairwise MAF5%

26 38 17 139 135 14 53 9 173

30 27

48

15 21

26

137 47

23

56

14 28

39

80

Inf 0.6⫹, multi, MAF1%

22 32 12 115 109 11 39 8 153

26 20

36

15 17

20

108 44

21

39

13 21

32

68

Inf 0.6⫹, multi, MAF5%

full full full full full full full full full

full full

full

full full

full

full full

full

full

full full

full

full

Array coverage: multi HapTags, MAF1% N-ACETYLTRANSFERASE 1 (ARYLAMINE N-ACETYLTRANSFERASE) N-ACETYLTRANSFERASE 2 (ARYLAMINE N-ACETYLTRANSFERASE) NUCLEAR RECEPTOR COACTIVATOR 6 NUCLEAR FACTOR (ERYTHROID-DERIVED 2)-LIKE 2 NUCLEAR FACTOR OF ␬ LIGHT POLYPEPTIDE GENE ENHANCER IN B CELLS 1 (P105) NUCLEAR FACTOR OF ␬ LIGHT POLYPEPTIDE GENE ENHANCER IN B CELLS INHIBITOR, ␣ NITRIC OXIDE SYNTHASE 1 (NEURONAL) NITRIC OXIDE SYNTHASE 2A (INDUCIBLE, HEPATOCYTES) NITRIC OXIDE SYNTHASE 3 (ENDOTHELIAL CELL) NAD(P)H DEHYDROGENASE, QUINONE 1 NUCLEAR RECEPTOR SUBFAMILY 1, GROUP D, MEMBER 2 nuclear receptor subfamily 3, group C, member 1 (glucocorticoid receptor) NUCLEAR RECEPTOR-INTERACTING PROTEIN 1 NUCLEAR RECEPTOR BINDING SET DOMAIN PROTEIN 1 2=,5=-oligoadenylate synthetase 1, 40/46 kDa 2=,5=-oligoadenylate synthetase 2, 69/71 kDa 8-OXOGUANINE DNA GLYCOSYLASE OPIOID RECEPTOR, ␮ 1 protocadherin 7 PERIOD HOMOLOG 1 (DROSOPHILA) PROGESTERONE RECEPTOR PROHIBITIN 2 Phosphotyrosine-interaction domain containing 1

Gene name

Continued

inf inf DNA/oxs nic adh onc str adh/str inf

onc onc

onc

oxs/PAH onc

inf

inf inf

inf/onc

inf/onc

onc oxs

PAH

fol

Target categorya

BALDWIN ET AL. / CUSTOM GENOTYPING FOR LUNG CANCER

211

212

150

PPARGC1B

73

29

13

23

27

49 40

PTGS1

PTGS2

RELA

RERGL

RNASEL

SELE SERPINA3

18 20 30 65

17 19 39 20 67 113

POLH POLI POLK POLL PON1 PPARG

AGER

134

PLEKHA6

PPT2 PTCH1 PTEN PTGIS

57

All HapTags: pairwise, MAF1%

PLA2G6

Genetic locus overlaps

47

1

PIK3CG

Gene symbol

(Continued)

TABLE

45 39

26

19

13

24

70

18 20 28 59

141

14 18 35 19 66 107

126

55

45

Inf 0.6⫹, pairwise, MAF1%

32 37

23

15

11

19

54

14 20 20 47

132

12 10 26 15 59 80

120

43

36

Inf 0.6⫹, pairwise MAF5%

38 37

25

17

12

19

61

15 18 27 57

112

14 18 34 18 56 83

90

42

31

Inf 0.6⫹, multi, MAF1%

25 34

22

13

10

14

47

13 18 18 44

102

11 10 25 13 48 60

85

30

22

Inf 0.6⫹, multi, MAF5%

full full

full

full

full

full

full full full dropped for capacity full

full

full full full full full full

full

full

full

Array coverage: multi HapTags, MAF1%

PG-endoperoxide synthase 1 (PG G/H synthase and cyclooxygenase) PG-ENDOPEROXIDE SYNTHASE 2 (PG G/H SYNTHASE AND COX) v-rel reticuloendotheliosis viral oncogene homolog A (avian) RAS-like, estrogen-regulated, growth inhibitor (RERG)/RAS-like ribonuclease L (2=,5=-oligoisoadenylate synthetase-dependent) selectin E SERPIN PEPTIDASE INHIBITOR, CLADE A (␣-1 ANTIPROTEINASE, ANTITRYPSIN), MEMBER 3

phosphoinositide-3-kinase, catalytic, ␥ polypeptide phospholipase A2, group VI (cytosolic, calcium-independent) pleckstrin homology domain containing, family A member 6 POLYMERASE (DNA-DIRECTED), ␩ POLYMERASE (DNA-DIRECTED) ␫ POLYMERASE (DNA-DIRECTED) ␬ POLYMERASE (DNA-DIRECTED), ␭ paraoxonase 1 PEROXISOME PROLIFERATIVE ACTIVATED RECEPTOR, ␥ PEROXISOME PROLIFERATIVE ACTIVATED RECEPTOR, ␥, COACTIVATOR 1, ␤ palmitoyl-protein thioesterase 2 patched homolog 1 phosphatase and tensin homolog PG I2 (prostacyclin) synthase

Gene name

Continued

inf adh/onc

inf

str

onc

infl/oxs

inf

tox tum onc

onc

DNA DNA DNA DNA tox onc

nic

tum

onc

Target categorya

BALDWIN ET AL. / CUSTOM GENOTYPING FOR LUNG CANCER

JOURNAL OF BIOMOLECULAR TECHNIQUES, VOLUME 24, ISSUE 4, DECEMBER 2013

31

46

55

58

17

17

25 30 14

54

28

41

32

SLC19A1

SLC5A7

SLC6A3

SLC7A5

SOD1

SOD2

SOD3 STC2 SULT1A1

SULT1E1

SULT2A1

TCN2

TEF

16

SLC18A3

CHAT

8

All HapTags: pairwise, MAF1%

SHMT2

Genetic locus overlaps

29

1

SHMT1

Gene symbol

(Continued)

TABLE

JOURNAL OF BIOMOLECULAR TECHNIQUES, VOLUME 24, ISSUE 4, DECEMBER 2013

21

40

26

49

21 27 10

15

15

51

47

45

29

32

6

23

Inf 0.6⫹, pairwise, MAF1%

20

38

24

28

18 26 7

13

14

46

43

36

26

36

6

20

Inf 0.6⫹, pairwise MAF5%

18

33

21

43

20 24 10

13

15

44

41

38

25

6

6

18

Inf 0.6⫹, multi, MAF1%

17

29

18

22

16 23 7

11

14

40

37

29

22

10

6

14

Inf 0.6⫹, multi, MAF5%

full

full

full

full

full full full

full

full

full

full

full

full

full

full

full

Array coverage: multi HapTags, MAF1% SERINE HYDROXYMETHYLTRANSFERASE 1 (SOLUBLE) SERINE HYDROXYMETHYLTRANSFERASE 2 (MITOCHONDRIAL) SOLUTE CARRIER FAMILY 18 (VESICULAR ACETYLCHOLINE), MEMBER 3 SOLUTE CARRIER FAMILY 19 (FOLATE TRANSPORTER), MEMBER 1 SOLUTE CARRIER FAMILY 5 (CHOLINE TRANSPORTER), MEMBER 7 SOLUTE CARRIER FAMILY 6 (NEUROTRANSMITTER TRANSPORTER, DOPAMINE), MEMBER 3 SOLUTE CARRIER FAMILY 7 (CATIONIC AMINO ACID TRANSPORTER, Y⫹ SYSTEM), MEMBER 5 SUPEROXIDE DISMUTASE 1, SOLUBLE [AMYOTROPHIC LATERAL SCLEROSIS 1 (ADULT)] SUPEROXIDE DISMUTASE 2, MITOCHONDRIAL SUPEROXIDE DISMUTASE 3, EXTRACELLULAR STANNIOCALCIN 2 SULFOTRANSFERASE FAMILY, CYTOSOLIC, 1A, PHENOL-PREFERRING, MEMBER 1 SULFOTRANSFERASE FAMILY 1E, ESTROGENPREFERRING, MEMBER 1 SULFOTRANSFERASE FAMILY, CYTOSOLIC, 2A, DEHYDROEPIANDROSTERONE (DHEA)PREFERRING, MEMBER 1 TRANSCOBALAMIN II; MACROCYTIC ANEMIA THYROTROPHIC EMBRYONIC FACTOR

Gene name

Continued

onc

fol

str

str

inf onc PAH

oxs

inf

onc

nic

nic

fol

nic

fol

fol

Target categorya

BALDWIN ET AL. / CUSTOM GENOTYPING FOR LUNG CANCER

213

214

UGT1A8

UGT1A1

UGT1A8

21

18

UGT2B10

UGT2B11

151

25 31 70

TP53BP1 TYMS UGT1A1

32 20 154

19

TLR1 LTA

TP53

TLR6 TNF TNS1

28 594 20 35 23 58 14

TH THSD4 TLR1 TLR10 TLR2 TLR4 TLR5

TLR6

37

TGFBR1

IGF2

46 136 19

All HapTags: pairwise, MAF1%

TFF3 TGFA TGFB1

Genetic locus overlaps

72

1

TFF1

Gene symbol

(Continued)

TABLE

10

10

113

23 28 58

17

12 17 151

26 558 20 30 20 56 12

34

44 129 19

64

Inf 0.6⫹, pairwise, MAF1%

9

9

78

18 26 46

17

9 6 133

20 498 17 24 20 43 12

24

39 107 17

57

Inf 0.6⫹, pairwise MAF5%

10

9

79

20 24 53

15

9 4 135

20 428 23 21 19 50 0

29

39 109 18

49

Inf 0.6⫹, multi, MAF1%

9

8

53

15 22 37

15

6 3 114

18 364 19 17 19 38 0

19

35 87 16

43

Inf 0.6⫹, multi, MAF5%

full

full

full

full full full

full full full full full full 13 pairwise HapTags full full dropped for capacity full

full

full full full

full

Array coverage: multi HapTags, MAF1%

TUMOR PROTEIN P53 (LI-FRAUMENI SYNDROME) tumor protein p53 binding protein 1 THYMIDYLATE SYNTHETASE UDP glucuronosyltransferase 1 family, polypeptide A cluster UDP GLUCURONOSYLTRANSFERASE 1 FAMILY, POLYPEPTIDE A8 UDP GLUCURONOSYLTRANSFERASE 2 FAMILY, POLYPEPTIDE B10 UDP GLUCURONOSYLTRANSFERASE 2 FAMILY, POLYPEPTIDE B11

Toll-like receptor 6 tumor necrosis factor tensin 1

TREFOIL FACTOR 1 (BREAST CANCER, ESTROGEN-INDUCIBLE SEQUENCE EXPRESSED IN) TREFOIL FACTOR 3 (INTESTINAL) TRANSFORMING GROWTH FACTOR, ␣ TRANSFORMING GROWTH FACTOR, ␤ 1 (CAMURATI-ENGELMANN DISEASE) TRANSFORMING GROWTH FACTOR, ␤ RECEPTOR I (ACTIVIN A RECEPTOR TYPE II-LIKE KINASE, 53 KDA) TYROSINE HYDROXYLASE thrombospondin, type I, domain containing 4 Toll-like receptor 1 Toll-like receptor 10 Toll-like receptor 2 Toll-like receptor 4 Toll-like receptor 5

Gene name

Continued

nit/PAH

nit/PAH

PAH

onc fol PAH

onc

inf inf

nic adh/inf inf inf inf inf inf

onc

onc onc onc

onc

Target categorya

BALDWIN ET AL. / CUSTOM GENOTYPING FOR LUNG CANCER

JOURNAL OF BIOMOLECULAR TECHNIQUES, VOLUME 24, ISSUE 4, DECEMBER 2013

JOURNAL OF BIOMOLECULAR TECHNIQUES, VOLUME 24, ISSUE 4, DECEMBER 2013

21 69 45 15 73 25 34 61 47 134

UGT2B7

VCAM1 VEGFA VEGFB VEGFC XIAP XPA

XPC

XRCC1

XRCC4 15,961

120

46

59

65 45 15 71 18 31

17

24

4

10

0

Inf 0.6⫹, pairwise, MAF1%

13,474

94

30

42

45 36 15 63 18 25

15

22

4

7

0

Inf 0.6⫹, pairwise MAF5%

12,926

94

42

49

63 43 12 54 18 26

13

18

4

10

0

Inf 0.6⫹, multi, MAF1%

10,511

68

27

34

42 33 12 45 18 20

11

16

4

7

0

Inf 0.6⫹, multi, MAF5%

Count:

full

full

full

full full full full full full

full

full

full

nine nonHapTag SNPs full

Array coverage: multi HapTags, MAF1%

UDP GLUCURONOSYLTRANSFERASE 2 FAMILY, POLYPEPTIDE B15 UDP GLUCURONOSYLTRANSFERASE 2 FAMILY, POLYPEPTIDE B17 UDP GLUCURONOSYLTRANSFERASE 2 FAMILY, POLYPEPTIDE B28 UDP GLUCURONOSYLTRANSFERASE 2 FAMILY, POLYPEPTIDE B4 UDP GLUCURONOSYLTRANSFERASE 2 FAMILY, POLYPEPTIDE B7 vascular cell adhesion molecule 1 VASCULAR ENDOTHELIAL GROWTH FACTOR A VASCULAR ENDOTHELIAL GROWTH FACTOR B VASCULAR ENDOTHELIAL GROWTH FACTOR C BACULOVIRAL IAP REPEAT-CONTAINING 4 XERODERMA PIGMENTOSUM, COMPLEMENTATION GROUP A XERODERMA PIGMENTOSUM, COMPLEMENTATION GROUP C X-RAY REPAIR COMPLEMENTING DEFECTIVE REPAIR IN CHINESE HAMSTER CELLS 1 X-ray repair complementing defective repair in Chinese hamster cells 4 298

Gene name

DNA

DNA

DNA

adh/inf onc inf inf onc DNA

PAH

nit/PAH

nit/PAH

nit/PAH

nit/PAH

Target categorya

adh, Adhesion molecules; DNA, repair of DNA damage; fol, folate transport and metabolism; inf, inflammatory signaling and processes or immune regulation; mut, mutagenic processes; nic, nicotine addiction and smoking behavior; nit, tobacco-specific nitrosamine (in particular, NNK) activation and detoxification; onc, oncogenesis; oxs, oxidative stress; str, steroid hormone metabolism and signaling; tox, other toxin or toxicity; tum, risk for lung cancer or related tumors.

17,797

31

UGT2B4

Sum:

8

UGT2B28

a

14

All HapTags: pairwise, MAF1%

UGT2B17

Genetic locus overlaps 1

1

UGT2B15

Gene symbol

(Continued)

TABLE

BALDWIN ET AL. / CUSTOM GENOTYPING FOR LUNG CANCER

215

BALDWIN ET AL. / CUSTOM GENOTYPING FOR LUNG CANCER

FIGURE 1

Distribution of assay conversion rates for SNPs in various design categories. Assays were assigned to Infinium design score bins equal to or less than the indicated values. The percent of all assays in a bin that successfully generated genotypes (unambiguous SNP allele calls in at least 95% of DNA samples) is plotted for Infinium-eligible SNPs in the Illumina database (black bars), mitochondrial DNA SNPs (gray bars), and SNPs uploaded as custom sequences (dashed bars). The number of SNPs in each bin, as a percent of total SNPs in each category, is plotted with square line markers for Infinium database SNPs, circles for mitochondrial, and X for custom sequences.

the array. All markers and their sequences, coordinates, and targeted genes are provided in Supplemental Table 2. Genotyping assays were performed on 1873 DNA samples from lung-cancer patients and controls using LungCaGxE microarrays. Forty-seven samples had a SNP assay call rate ⬍99.0%. If these samples are excluded, SNP assays with an Infinium design score of at least 0.6 produced unambiguous genotype calls in 99.03% of the attempted reactions (Fig. 1). Targeted functional SNPs with a design score ⬍0.6 generated genotype calls in 84.96% of the attempted reactions; the average genotyping rate for SNPs with recognized rs numbers in the Illumina database was 99.09%, whereas the rate for SNPs submitted as custom sequences was 96.16% (design score ⬎0.6 in both sets). DISCUSSION

The advancement of array-based SNP genotyping technologies has led to genome-wide association studies (GWAS), in which genetic markers distributed evenly throughout the genome17 (or covering predicted haplotypes throughout the genome14) are tested for statistically significant association with a phenotype. Arrays offer advantages for GWAS over current deep-sequencing methods, including lower cost, faster assay turnaround and sample throughput, and easier data processing. However, the success of proxy markers depends on linkage to causal but unmeasured genetic variants, and even the highest capacity arrays of over 5 million SNPs may not cover rare variants or diverse populations well. Whole-genome or exome sequencing directly detects causal variants and polymorphism types beyond bi-allelic single nucleotides and does not rely on linked markers for statistical analysis. Whether deployed on SNP arrays or deep sequencing platforms, the primary concern for whole-genome assays is statistical power. Rare variants, 216

multiple causes for the same phenotype, intergenic and multigene effects, and genetically mandated differential interactions between genes and environmental variables can all combine with multiple testing correction requirements to drive study population sizes to thousands or tens of thousands of subjects to adequately power GWAS.18–21 Projects of this scale are an expensive proposition for arrays and would be extremely costly with deep sequencing even at the as-yet unattained goal of $1000/genome. Comprehensive genotyping of targeted genes, by arrays or sequencing, takes advantage of high multiplex assay capacities to saturate targets with genetic markers. Hence, array data are less reliant on capturing a single, important linked marker while retaining rapid sample throughputs, and sequencing costs and efficiency are improved by focusing on a subset of genes rather than the whole genome. Depending on the size of the target panel and degree of saturation desired, custom arrays or sequencing can ease multiple testing penalties and reduce study population sizes necessary to achieve statistical power. Of course, the critical issue for this strategy is choosing which genes to assay. For the LungCaGxE panel, we chose genes involved in pathways relevant to responses to environmental stressors and saturated the resulting target panel with genetic markers as well as previously demonstrated functional and diseaseassociated variants. The Illumina design score, whereas generally predictive of positive assay performance, underestimated the LungCaGxE genotype success rate achieved for Infiniumeligible tagSNPs and custom SNPs from the nuclear genome. The design scores were somewhat less positively predictive (i.e., further underestimated) of genotyping rates achieved for mitochondrial genome SNPs, which performed well over a wide range of design scores. The relaJOURNAL OF BIOMOLECULAR TECHNIQUES, VOLUME 24, ISSUE 4, DECEMBER 2013

BALDWIN ET AL. / CUSTOM GENOTYPING FOR LUNG CANCER

tively high success rates for assays with design scores ⬍0.6 indicate that for future targeted genotyping projects, failure to meet this overly stringent standard cutoff should not necessarily disqualify an assay if the specific SNP in question is important for the study goals. In summary, the investigator tasked with designing a custom-targeted genotyping assay must balance several considerations. Given that the platform’s multiplex capacity is often dictated by the project’s budget, the investigator must select the marker types, thresholds for number of genes targeted, and MAF cutoffs that will provide the most efficient use of available assay resources. Several iterations of empirical design are usually needed to assess the impact of these parameters, and this process is aided by a streamlined bioinformatics workflow. Tagger Batch Assistant helps automate the retrieval of genetic coordinates for requested genes, managing genome build versions and providing an output format that easily interfaces with Tagger for marker prediction. The resulting Tagger files are then automatically processed to connect markers with the user’s upstream gene annotations. We used this tool to optimize the LungCaGxE design through multiple versions, preserving sensitivity for marker MAFs as low as 1%, while reducing the number of SNPs required by using the Tagger multimarker haplotyping algorithm. This array enables rapid, cost-effective, and comprehensive genotyping of a panel of genes important for exploring genetic factors in lung cancer and the environmental influences that impact those factors. ACKNOWLEDGMENTS This work was funded by grant PA4100038714 from the Pennsylvania Department of Health and U.S. National Institutes of Health–National Institute of Environmental Health Sciences grant 5 P30 ES 013508-06 for the Center of Excellence in Environmental Toxicology. We thank David McGain and Kathakali Addya (Penn Molecular Profiling Facility) and Cecilia Kim (Children’s Hospital of Philadelphia Center for Applied Genomics) for technical assistance.

2. 3.

4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18.

DISCLOSURE

The authors have no associations or sources of financial support that pose a conflict of interest for conducting or interpreting the work presented in this manuscript. REFERENCES 1. American Cancer Society. Cancer Facts & Figures 2013. Atlanta, GA, USA: American Cancer Society, 2013 (http://www.cancer.

JOURNAL OF BIOMOLECULAR TECHNIQUES, VOLUME 24, ISSUE 4, DECEMBER 2013

19. 20. 21.

org/acs/groups/content/@epidemiologysurveilance/documents/ document/acspc-036845.pdf). Cassidy A, Duffy SW, Myles JP, Liloglou T, Field JK. Lung cancer risk prediction: a tool for early detection. Int J Cancer 2007;120:1–6. Ihsan R, Chauhan PS, Mishra AK, et al. Multiple analytical approaches reveal distinct gene-environment interactions in smokers and non-smokers in lung cancer. PLoS One 2011;6: e29431. Thomas L, Doyle LA, Edelman MJ. Lung cancer in women: emerging differences in epidemiology, biology, and therapy. Chest 2005;128:370 –381. Braithwaite KL, Rabbitts PH. Multi-step evolution of lung cancer. Semin Cancer Biol 1999;9:255–265. Bach PB, Kattan MW, Thornquist MD, et al. Variations in lung cancer risk among smokers. J Natl Cancer Inst 2003;95:470 – 478. Bilello KS, Murin S, Matthay RA. Epidemiology, etiology, and prevention of lung cancer. Clin Chest Med 2002;23:1–25. Liu G, Zhou W, Christiani DC. Molecular epidemiology of non-small cell lung cancer. Semin Respir Crit Care Med 2005;26: 265–272. Taioli E. Gene-environment interaction in tobacco-related cancers. Carcinogenesis 2008;29:1467–1474. Gustafson AM, Soldi R, Anderlind C, et al. Airway PI3K pathway activation is an early and reversible event in lung cancer development. Sci Transl Med 2010;2:26ra25. Huang da W, Sherman BT, Lempicki RA. Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources. Nat Protoc 2009;4:44 –57. Huang da W, Sherman BT, Lempicki RA. Bioinformatics enrichment tools: paths toward the comprehensive functional analysis of large gene lists. Nucleic Acids Res 2009;37:1–13. De Bakker PI, Yelensky R, Pe’er I, Gabriel SB, Daly MJ, Altshuler D. Efficiency and power in genetic association studies. Nat Genet 2005;37:1217–1223. Peiffer DA, Le JM, Steemers FJ, et al. High-resolution genomic profiling of chromosomal aberrations using Infinium wholegenome genotyping. Genome Res 2006;16:1136 –1148. Goode EL, Fridley BL, Sun Z, et al. Comparison of tagging single-nucleotide polymorphism methods in association analyses. BMC Proc 2007;1(Suppl 1):S6. Nam MH, Won HH, Lee KA, Kim JW. Effectiveness of in silico tagSNP selection methods: virtual analysis of the genotypes of pharmacogenetic genes. Pharmacogenomics 2007;8:1347–1357. Matsuzaki H, Dong S, Loi H, et al. Genotyping over 100,000 SNPs on a pair of oligonucleotide arrays. Nat Methods 2004;1: 109 –111. Becker T, Herold C, Meesters C, Mattheisen M, Baur MP. Significance levels in genome-wide interaction analysis (GWIA). Ann Hum Genet 2011;75:29 –35. Park JH, Wacholder S, Gail MH, et al. Estimation of effect size distribution from genome-wide association studies and implications for future discoveries. Nat Genet 2010;42:570 –575. Sale MM, Mychaleckyj JC, Chen WM. Planning and executing a genome wide association study (GWAS). Methods Mol Biol 2009;590:403–418. Spencer CC, Su Z, Donnelly P, Marchini J. Designing genomewide association studies: sample size, power, imputation, and the choice of genotyping chip. PLoS Genet 2009;5:e1000477.

217