ARTICLE
Development of a Genotyping Microarray for Studying the Role of Gene-Environment Interactions in Risk for Lung Cancer Don A. Baldwin,1,2 Christopher P. Sarnowski,3 Sabrina A. Reddy,3 Ian A. Blair,2,4,5 Margie Clapper,6 Philip Lazarus,7 Mingyao Li,8 Joshua E. Muscat,9 Trevor M. Penning,2,4 Anil Vachani,2,10 and Alexander S. Whitehead2,4 1
Pathonomics LLC, Philadelphia, Pennsylvania 19104, USA; 2Center of Excellence in Environmental Toxicology, 3Penn Molecular Profiling Facility, Departments of 4Pharmacology and 10Medicine, and Centers for 5Cancer Pharmacology and 8 Clinical Epidemiology and Biostatistics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, Pennsylvania 19104, USA; 6Cancer Prevention and Control Program, Fox Chase Cancer Center, Philadelphia, Pennsylvania 19111, USA; 7 Department of Pharmaceutical Sciences, Washington State University, Spokane, Washington 99210, USA; and 9Department of Public Health Sciences, Pennsylvania State University, Hershey, Pennsylvania 17033, USA A microarray (LungCaGxE), based on Illumina BeadChip technology, was developed for high-resolution genotyping of genes that are candidates for involvement in environmentally driven aspects of lung cancer oncogenesis and/or tumor growth. The iterative array design process illustrates techniques for managing large panels of candidate genes and optimizing marker selection, aided by a new bioinformatics pipeline component, Tagger Batch Assistant. The LungCaGxE platform targets 298 genes and the proximal genetic regions in which they are located, using ⬃13,000 DNA single nucleotide polymorphisms (SNPs), which include haplotype linkage markers with a minimum allele frequency of 1% and additional specifically targeted SNPs, for which published reports have indicated functional consequences or associations with lung cancer or other smoking-related diseases. The overall assay conversion rate was 98.9%; 99.0% of markers with a minimum Illumina design score of 0.6 successfully generated allele calls using genomic DNA from a study population of 1873 lung-cancer patients and controls. KEY WORDS: genetic association, environmental exposures, Tagger Batch Assistant, LungCaGxE
INTRODUCTION
Lung cancer is the leading cause of cancer death for men and women in the United States. The American Cancer Society estimates that in 2013, there will be 228,190 new cases (118,080 in men; 110,110 in women) and 159,480 deaths.1 Many patients present with disease that is too advanced to treat successfully with surgery and the current portfolio of drugs. Identification of those at highest risk of disease would facilitate earlier diagnosis and therapeutic intervention, with consequent reduced mortality and longer survival time. Risk identification techniques would also support preventative screening and targeted interventions, such as smoking-cessation programs leading to reduced incidence. Given the huge number of new lung cancer cases that occur each year, the impact of such interventions ADDRESS CORRESPONDENCE TO: Alexander S. Whitehead, Perelman School of Medicine, University of Pennsylvania, Room 1311, BRB II/III, 421 Curie Blvd., Philadelphia, PA 19104-6160, USA (Phone: 215-898-2332; E-mail:
[email protected]). doi: 10.7171/jbt.13-2404-004
would be significant even if applicable only to an etiologically distinct subset of all cases. As the majority (up to 90%) of lung cancers occurs in smokers, but only a minority (⬃10%) of smokers get the disease,2 it is likely that significant gene/phenotype/environment interactions exist.3 Although tobacco smoke is the main etiologic agent,4 the long latency between exposure and disease, the multistep nature of neoplastic transformation,5 and the low, 10-year lung-cancer risk of elderly, life-long heavy smokers (15%)6 suggest that factors other than tobacco-associated carcinogens modify risk. These likely include environmental variables,7 functional genetic polymorphisms,8,9 and differential expression of genes that interact with such variables.10 Strategies to identify associations between genetic variants and diseases, such as lung cancer, include genotyping sequence polymorphisms that are distributed throughout the genome or that occur in specifically targeted genes of interest. Compared with genome-wide approaches, genotyping a focused set of single nucleotide polymorphisms (SNPs) for high-resolution haplotype mapping boosts
xxxxxx xxxxxx Journal of Biomolecular Techniques 24:198–217 © 2013 ABRF
BALDWIN ET AL. / CUSTOM GENOTYPING FOR LUNG CANCER
analysis power for identifying single gene and gene family effects with statistical significance. Targeted, redundant genotyping of candidate genes further enables the analysis of additional variables, such as environmental factors, without a requirement to sample extremely large populations. However, designing a genotyping assay that adequately covers each candidate gene with a sufficiently large number of markers poses a challenge for this approach, especially when interrogating many genes in parallel. Standard genome-wide platforms, such as Affymetrix (Affymetrix, Santa Clara, CA, USA) or Illumina microarrays (Illumina, San Diego, CA, USA), provide predesigned collections of genotyping assays but rarely include enough markers to approach saturation of any given target gene. Microarray vendors therefore offer custom manufacturing options to allow researchers to create comprehensive panels of assays that satisfy the requirements of high-resolution genotyping. We describe a process that connects publicly available SNP catalogs with commercial assay design interfaces, using a new bioinformatics tool that assists with the management of large collections of genes and their haplotypetagging (HapTag) SNPs. This process was used to demonstrate the rapid and iterative design of a customgenotyping microarray for studying lung cancer. MATERIALS AND METHODS Target Selection
Investigators in our consortium contributed prioritized lists of genes potentially relevant to environmentally mediated biological processes leading to lung cancer. Candidate genes included modulators of and checkpoints within pathways hypothesized to respond to tobacco toxins and environmental factors that may promote oncogenesis, as well as those that may act in concert with environmental factors to support tumor survival, progression, and growth. These genes fell into broad categories, including tobaccospecific nitrosamine [particularly nitrosaminoketone (NNK)] activation and detoxification, polycyclic aromatic hydrocarbon (PAH) activation and detoxification, repair of NNK- and PAH-attributable DNA damage, oxidative stress, inflammatory signaling and processes of immune regulation, steroid hormone metabolism and signaling, nicotine addiction and smoking behavior, and folate transport and metabolism. For each individual gene, HapTag SNPs and genetic polymorphisms known to affect function or shown previously to be associated with risk for lung cancer were sought and if found, incorporated into the final microarray design. Target sources included extensive literature searches, Ingenuity Pathway Analysis (http://www.ingenuity.com), Database for Annotation, Visualization, and Integrated JOURNAL OF BIOMOLECULAR TECHNIQUES, VOLUME 24, ISSUE 4, DECEMBER 2013
Discovery (DAVID) Bioinformatics Resources,11,12 and ongoing research in investigators’ laboratories. SNP Selection
All targeted genes/chromosomal regions were uploaded to the Assay Design Tool (http://support.illumina.com/tools. ilmn; Illumina) for retrieval of all iSelect Infinium database SNPs within each targeted region, as well as from 15 kb sequences flanking the gene-boundary coordinates. Known polymorphisms from the target-selection phase were also queried by reference SNP (rs) number from database of SNPs (dbSNP; http://www.ncbi.nlm.nih.gov/snp/), or uploaded as custom sequences if polymorphisms were unrecognized by iSelect or not annotated in dbSNP. Independently, the targeted genes and regions were analyzed using Tagger (http://www.broadinstitute.org/mpg/tagger/server. html, and International HapMap Project haplotype mapping databases therein)13 with the following parameters in all combinations: HapMap panels of Utah (U.S.A.) residents of northern and western European ancestry (CEU) and residents of Ibadan, Nigeria of Yoruban ancestry (YRI); SNP minimum allele frequencies 5% and 1%; Tagger mode pairwise and aggressive; SNP r2 threshold 0.8; and default settings for all other parameters. The Tagger online interface does not support batch queries using gene symbols, so we created the Tagger Batch Assistant (http:// www.bioinformatics.upenn.edu/tagtool/batch.html) as a tool for automated processing of large query lists and management and formatting of the output data. The retrieved iSelect SNPs were filtered to retain markers with an Infinium design score ⱖ0.6 (a 60% probability of conversion, i.e., successful genotyping assays for that SNP), and the subset corresponding to selected HapTag SNPs from Tagger was identified. No Infinium design score limits were imposed on functional SNPs from the target selection phase. A panel of 357 ancestry informative markers was included (http://support.illumina.com/array/array_kits/dna_test_panel. ilmn, Illumina catalog GT-17-222). Genotyping
DNA was extracted from whole-blood samples or buffycoat fractions using Chemagic DNA purification kits and a Chemagen Magnetic Separation Module I robot (Chemagen/PerkinElmer, Baesweiler, Germany). DNA qualitycontrol checks included A260/280 and E-Gel electrophoresis (Invitrogen, Life Technologies, Grand Island, NY, USA), and DNA samples (n⫽1873) were normalized to 50 ng/ul and used for genotyping assays. Genotyping was conducted using the iScan system (Illumina), according to the manufacturer’s protocols.14 The Infinium assay amplified and fragmented 200 ng genomic DNA, which was then hybridized to our LungCaGxE iSelect HD Custom 199
BALDWIN ET AL. / CUSTOM GENOTYPING FOR LUNG CANCER
BeadChips containing 24 arrays/BeadChip and 13,308 assayed SNPs/array. Four negative control (no DNA) arrays were processed, and 43 samples were processed twice to check assay consistency. Data from scanned BeadChips were processed in Illumina GenomeStudio for signal quantitation, quality control, and genotype assignments. The research described does not involve animals. Blood samples from human subjects were collected with their informed consent for research use, including genetic analyses. This study was approved by Institutional Review Boards at the University of Pennsylvania, Pennsylvania State University, Temple University, and Fox Chase Cancer Center. RESULTS Tagger Batch Assistant
The online Tagger Batch Assistant tool was designed with two components: one for rapid retrieval of genomic coordinates for large lists of genes and another for managing Tagger output files that result from a batch query using genomic coordinates. Starting with a list of official National Center for Biotechnology Information gene symbols, the tool supports queries of several human genomebuild versions, concatenation or separation of overlapping genes, and rules for flanking regions that allow the addition of sequences adjacent to gene coordinates. Multiple choices are available for the amount of flanking sequences added, and rules can be stacked to vary the flanking regions by gene length. The output file can be reviewed in text or spreadsheet formats and is configured for uploading to the Tagger query interface. After receiving compressed Tagger results files, the tool supports automated merging of the user’s annotated gene query lists with the corresponding Tagger results. Assembly of Target Gene Panel
Project investigators identified 298 genes in pathways for which genetically mandated differential interactions with environmental factors leading to lung cancer were deemed to be biologically plausible. These pathways included those supporting or mediating carcinogen effects (i.e., nitrosamine and PAH activation and detoxification), oxidative stress, DNA damage repair, inflammation or immunesystem monitoring, estrogen, and other steroid hormone processes, nicotine addiction/smoking behavior, and folate metabolism. Target genes were chosen by examining previous literature, established molecular pathways, and gene interactions and sequence polymorphisms known to affect the functions of genes involved in lung tumor oncogenesis or responses to environmental factors that may impact lung cancer (Table 1). Confirmatory DAVID annotation analyses were performed on the final gene list to summarize the 200
categories represented from Online Mendelian Inheritance in Man (OMIM) Disease, Kyoto Encyclopedia of Genes and Genomes (KEGG) Pathway, Gene Ontology (GO) Molecular Function, and GO Biological Process databases (Supplemental Table 1). As expected, the final target panel was confirmed as being enriched for genes associated with risk for lung cancer, folate-sensitive phenotypes, hormone synthesis and signaling, oxidative stress responses, DNA repair, detoxification and metabolism of complex molecules, and apoptosis. Cross-category annotation indicates that the panel is coincidentally enriched for genes involved in schizophrenia, trichothiodystrophy, myocardial infarction, reproductive development, and various neurological processes. Comparison of Pairwise and Multimarker Tagger Analyses
With the use of dbSNPs for the CEU and YRI populations, Tagger analysis was performed initially to predict marker HapTag SNPs that cover polymorphisms with minimum minor allele frequency (MAF) of 5% and then repeated for MAF ⬎1%. Two Tagger algorithms were compared: pairwise modeling, in which a HapTag marker reports its own genotype and predicts the genotype of one linked SNP, and “aggressive” multimarker modeling, in which the combined genotypes of one to three HapTags report the local haplotype and predict the genotype(s) of one or more linked SNPs.13,15,16 The resulting number of HapTags calculated for each gene is shown in Table 1. At MAF ⬎1%, pairwise modeling produced a g/h ratio of 1.92 (g⫽measured⫹predicted genotypes; h⫽HapTag markers), and multimarker modeling resulted in 2.38 g/h for the same number of genotypes. Genotyping Array Design and Assay Performance
Tagger multimarker-predicted HapTags with MAF ⬎1% were filtered for iSelect Infinium design scores ⱖ0.6. TLR5 had no multimarker HapTags, so pairwise HapTags were selected; CCR2, UGT2B15, and GSTT1 had no HapTags, so marker SNPs were manually identified. To avoid exceeding the marker capacity set by our microarray manufacturing budget, the low-priority genes, ALPL, TNS1, GAB1, HHIP, DBH, and PTGIS, were dropped, and HapTag coverage of GPR126 was reduced to 85%. With the addition of specifically targeted functional SNPs and published marker SNPs, 12,890 genomic SNPs were compiled for the final design of the LungCaGxE array with average and median intermarker distances of 5958 bp and 1093 bp, respectively. Sixty-one mitochondrial DNA SNPs were included to target MT-COI, as well as 357 ancestry informative markers for a total of 13,308 genotyping markers on JOURNAL OF BIOMOLECULAR TECHNIQUES, VOLUME 24, ISSUE 4, DECEMBER 2013
1
JOURNAL OF BIOMOLECULAR TECHNIQUES, VOLUME 24, ISSUE 4, DECEMBER 2013
17 27 87 23 17 30
AKAP9 AKR1A1
AKR1B10
120 23
ADK AGER
AHCY AHR AHRR
49
ADH7
PPT2
28
12
ACHE
ADH1B
355
ABCC4
110
51
ABCC2
ADAM19
199
ABCC1
All HapTags: pairwise, MAF1%
1099 91
Genetic locus overlaps
A2BP1 ABCB1
Gene symbol
25
21 13
12 27 79
98 22
47
26
101
12
324
44
180
1046 81
Inf 0.6⫹, pairwise, MAF1%
24
17 10
12 24 71
80 20
40
23
78
10
274
33
162
914 62
Inf 0.6⫹, pairwise MAF5%
22
19 12
8 25 61
74 22
40
23
81
12
244
40
142
710 62
Inf 0.6⫹, multi, MAF1%
21
15 7
8 21 51
55 17
33
20
61
10
199
29
123
599 44
Inf 0.6⫹, multi, MAF5%
full
full full
full full full
full full
full
full
full
full
full
full
full
full full
Array coverage: multi HapTags, MAF1% ataxin 2 binding protein 1 ATP binding cassette, subfamily B [multidrug resistance (MDR)/transporter associated with antigen processing (TAP)], member 1 ATP binding cassette, subfamily C [cystic fibrosis transmembrane conductance regulator (CFTR)/multidrug resistance-associated protein (MRP)], member 1 ATP binding cassette, subfamily C (CFTR/MRP), member 2 ATP binding cassette, subfamily C (CFTR/MRP), member 4 ACETYLCHOLINESTERASE (YT BLOOD GROUP) a disintegrin and metalloprotease domain (ADAM) metallopeptidase domain 19 (meltrin ) alcohol dehydrogenase 1B (class I),  polypeptide alcohol dehydrogenase 7 (class IV), or polypeptide adenosine kinase advanced glycosylation end product-specific receptor S-ADENOSYLHOMOCYSTEINE HYDROLASE ARYL-HYDROCARBON RECEPTOR ARYL-HYDROCARBON RECEPTOR REPRESSOR A kinase (PRKA) anchor protein (yotiao) 9 ALDO-KETO REDUCTASE FAMILY 1, MEMBER A1 (ALDEHYDE REDUCTASE) ALDO-KETO REDUCTASE FAMILY 1, MEMBER B10 (ALDOSE REDUCTASE-LIKE)
Gene name
Continued
onc
tum PAH
fol PAH PAH
inf inf/mut
tox
tox
adh
nic
mut/PAH
fol
mut/PAH
tum tum
Target categorya
Targeted Genes, Annotations, and Number of HapTag SNPs Identified by Pairwise or Multimarker (multi) Algorithms at the Indicated Minor Allele Frequencies (MAFs) and Infinium (Inf) Design Scores ⱖ0.6
TABLE
BALDWIN ET AL. / CUSTOM GENOTYPING FOR LUNG CANCER
201
202
15
31
10 30
AR
AREG
ARID1A ARNT
44
30
DRD2
APEX1
ANKK1
115 65 100
15 95
AKT2 AKT3
ALDH1L1 ALOX5 ALPL
16
46
AKT1
AKR1C2
18
44
AKR1C1
AKR1C1
All HapTags: pairwise, MAF1%
AKR1C3
AKR1C2
Gene symbol
1
Genetic locus overlaps
(Continued)
TABLE
10 29
14
5
29
41
98 57 96
14 89
13
36
38
15
Inf 0.6⫹, pairwise, MAF1%
7 17
12
5
26
33
89 48 88
11 70
12
35
36
15
Inf 0.6⫹, pairwise MAF5%
10 27
10
12
26
26
67 46 79
13 69
13
27
24
22
Inf 0.6⫹, multi, MAF1%
7 15
7
12
22
19
58 36 70
10 54
12
26
23
21
Inf 0.6⫹, multi, MAF5%
full full
full
full
full
full full dropped for capacity full
full full
full
full
full
full
Array coverage: multi HapTags, MAF1%
ANKYRIN REPEAT AND KINASE DOMAIN CONTAINING 1 APEX NUCLEASE (MULTIFUNCTIONAL DNA REPAIR ENZYME) 1 ANDROGEN RECEPTOR (DIHYDROTESTOSTERONE RECEPTOR; TESTICULAR FEMINIZATION; SPINAL AND BULBAR MUSCULAR ATROPHY; KENNEDY DISEASE) AMPHIREGULIN (SCHWANNOMA-DERIVED GROWTH FACTOR) AT RICH-INTERACTIVE DOMAIN 1A (SWI-LIKE) ARYL-HYDROCARBON RECEPTOR NUCLEAR TRANSLOCATOR
ALDO-KETO REDUCTASE FAMILY 1, MEMBER C1 [DIHYDRODIOL DEHYDROGENASE 1; 20-␣ (3-␣)-HYDROXYSTEROID DEHYDROGENASE] ALDO-KETO REDUCTASE FAMILY 1, MEMBER C2 (DIHYDRODIOL DEHYDROGENASE 2; BILE ACID BINDING PROTEIN; 3-␣ HYDROXYSTEROID DEHYDROGENASE, TYPE III) ALDO-KETO REDUCTASE FAMILY 1, MEMBER C3 (3-␣ HYDROXYSTEROID DEHYDROGENASE, TYPE II) V-AKT MURINE THYMOMA VIRAL ONCOGENE HOMOLOG 1 v-akt murine thymoma viral oncogene homolog 2 v-akt murine thymoma viral oncogene homolog 3 (PKB, ␥) aldehyde dehydrogenase 1 family, member L1 ARACHIDONATE 5-LIPOXYGENASE alkaline phosphatase, liver/bone/kidney
Gene name
Continued
onc PAH
onc
str
DNA
nic
fo1 inf/oxs
tum tum
onc
PAH/str
nit/PAH/ str
nit/PAH/ str
Target categorya
BALDWIN ET AL. / CUSTOM GENOTYPING FOR LUNG CANCER
JOURNAL OF BIOMOLECULAR TECHNIQUES, VOLUME 24, ISSUE 4, DECEMBER 2013
88 63 41
19 16 60 23 20 10 22 25 83 6
46
66
CBR1 CBR3 CBS CCL2 CCL21 CCL5 CCNA2 CCND1 CCND3 CCR2
CD47
CDH1
21
BIRC5
BRCA2 C3 CAMKK1
33 28
BDNF BHMT
210
229
BCL2
BMPR1B
38
All HapTags: pairwise, MAF1%
ATIC
Genetic locus overlaps
103
1
ARNTL
Gene symbol
(Continued)
TABLE
JOURNAL OF BIOMOLECULAR TECHNIQUES, VOLUME 24, ISSUE 4, DECEMBER 2013
56
43
17 13 57 20 18 9 17 25 21 0
82 59 36
199
19
33 28
224
34
98
Inf 0.6⫹, pairwise, MAF1%
53
35
15 11 53 17 16 6 12 20 17 0
59 49 33
180
18
26 25
178
31
88
Inf 0.6⫹, pairwise MAF5%
50
36
13 11 47 16 15 8 13 23 18 0
68 55 35
143
17
30 26
190
28
84
Inf 0.6⫹, multi, MAF1%
47
28
11 9 43 13 14 5 9 18 14 0
47 45 32
125
16
23 22
144
25
75
Inf 0.6⫹, multi, MAF5%
full
full full full full full full full full full nine non-HapTag SNPs full
full full full
full
full
full full
full
full
full
Array coverage: multi HapTags, MAF1%
CD47 ANTIGEN (RH-RELATED ANTIGEN, INTEGRIN-ASSOCIATED SIGNAL TRANSDUCER) CADHERIN 1, TYPE 1, E-CADHERIN (EPITHELIAL)
ARYL-HYDROCARBON RECEPTOR NUCLEAR TRANSLOCATOR-LIKE 5-AMINOIMIDAZOLE-4-CARBOXAMIDE RIBONUCLEOTIDE FORMYLTRANSFERASE/ IMP CYCLOHYDROLASE B CELL chronic lymphocytic leukemia (CLL)/ LYMPHOMA 2 BRAIN-DERIVED NEUROTROPHIC FACTOR BETAINE-HOMOCYSTEINE METHYLTRANSFERASE BACULOVIRAL inhibitor of apoptosis (IAP) REPEAT-CONTAINING 5 (SURVIVIN) BONE MORPHOGENETIC PROTEIN RECEPTOR, TYPE IB breast cancer 2, early onset COMPLEMENT COMPONENT 3 calcium/calmodulin-dependent protein kinase kinase 1, ␣ CARBONYL REDUCTASE 1 CARBONYL REDUCTASE 3 CYSTATHIONINE--SYNTHASE chemokine (C–C motif) ligand 2 chemokine (C–C motif) ligand 21 chemokine (C–C motif) ligand cyclin A2 CYCLIN D1 cyclin D3 chemokine (C–C motif) receptor 2
Gene name
Continued
adh
adh
nit/PAH nit/PAH fol inf inf inf onc onc onc inf
tum inf tum
onc
onc
nic fol
onc
fol
PAH
Target categorya
BALDWIN ET AL. / CUSTOM GENOTYPING FOR LUNG CANCER
203
204
24
112
13
CYP17A1
CYP19A1
CYP1A1
9
56 31 51 19 38 27
COMT CRP CRY1 CSNK1D CTH CTLA4
CTSD
35 55
CYP1A2
CHRNA3
21 27 27
CLOCK COL3A1
CHRNB3 CHRNB4 CHUK
CHRNB4
10 85 2 28 28 89 24
SLC18A3 CHRNA5
All HapTags: pairwise, MAF1%
CES3 CHAT CHRNA3 CHRNA4 CHRNA5 CHRNA7 CHRNB2
Genetic locus overlaps
33
1
CDKN2A
Gene symbol
(Continued)
TABLE
11
104
19
8
49 31 45 15 34 27
29 54
20 26 26
8 98 2 25 26 85 23
31
Inf 0.6⫹, pairwise, MAF1%
11
94
15
7
43 23 38 11 31 24
27 46
17 24 20
5 81 1 24 21 71 22
27
Inf 0.6⫹, pairwise MAF5%
11
75
16
8
39 25 36 15 30 18
20 46
16 15 23
7 112 22 21 13 76 20
27
Inf 0.6⫹, multi, MAF1%
10
65
12
7
34 18 28 11 27 15
18 37
13 14 17
4 93 18 20 10 62 18
24
Inf 0.6⫹, multi, MAF5%
full
full
full
full
full full full full full full
full full
full full full
full full full full full full full
full
Array coverage: multi HapTags, MAF1% cyclin-dependent kinase inhibitor 2A (melanoma, p16, inhibits CDK4) CARBOXYLESTERASE 3 CHOLINE ACETYLTRANSFERASE CHOLINERGIC RECEPTOR, NICOTINIC, ␣ 3 CHOLINERGIC RECEPTOR, NICOTINIC, ␣ 4 CHOLINERGIC RECEPTOR, NICOTINIC, ␣ 5 CHOLINERGIC RECEPTOR, NICOTINIC, ␣ 7 CHOLINERGIC RECEPTOR, NICOTINIC,  2 (NEURONAL) CHOLINERGIC RECEPTOR, NICOTINIC,  3 CHOLINERGIC RECEPTOR, NICOTINIC,  4 CONSERVED HELIX-LOOP-HELIX UBIQUITOUS KINASE CLOCK HOMOLOG (MOUSE) COLLAGEN, TYPE III, ␣ 1 (EHLERS-DANLOS SYNDROME TYPE IV, AUTOSOMAL DOMINANT) CATECHOL-O-METHYLTRANSFERASE C-REACTIVE PROTEIN, PENTRAXIN-RELATED CRYPTOCHROME 1 (PHOTOLYASE-LIKE) CASEIN KINASE 1, ␦ CYSTATHIONASE (CYSTATHIONINE ␥-LYASE) CYTOTOXIC T-LYMPHOCYTE-ASSOCIATED PROTEIN 4 CATHEPSIN D (LYSOSOMAL ASPARTYL PEPTIDASE) CYTOCHROME P450, FAMILY 17, SUBFAMILY A, POLYPEPTIDE 1 CYTOCHROME P450, FAMILY 19, SUBFAMILY A, POLYPEPTIDE 1 CYTOCHROME P450, FAMILY 1, SUBFAMILY A, POLYPEPTIDE 1
Gene name
Continued
PAH
str
str
tum
PAH/str inf onc onc fol inf
onc adh
nic nic onc
tox nic nic nic nic nic nic
onc
Target categorya
BALDWIN ET AL. / CUSTOM GENOTYPING FOR LUNG CANCER
JOURNAL OF BIOMOLECULAR TECHNIQUES, VOLUME 24, ISSUE 4, DECEMBER 2013
11
18
43
32
11
39
30
88
10
24 75 18 56 58 53 11 129
CYP2A13
CYP2A6
CYP2B6
CYP2C9
CYP2D6
CYP2E1
CYP3A4
DBH
DDX54
DHFR DMGDH DNMT1 DNMT3A DNMT3B DRD2 DRD4 EGF
ANKK1
15
CYP21A2
13
41
CYP1A2
All HapTags: pairwise, MAF1%
CYP1B1
CYP1A1
Gene symbol
1
Genetic locus overlaps
(Continued)
TABLE
JOURNAL OF BIOMOLECULAR TECHNIQUES, VOLUME 24, ISSUE 4, DECEMBER 2013
19 69 16 54 56 53 11 115
10
85
24
36
5
28
34
13
7
7
38
13
Inf 0.6⫹, pairwise, MAF1%
16 64 12 44 46 45 10 79
9
75
18
35
5
20
31
11
6
5
34
10
Inf 0.6⫹, pairwise MAF5%
15 58 14 45 40 44 10 95
9
73
23
30
5
26
27
13
7
7
33
11
Inf 0.6⫹, multi, MAF1%
12 50 10 35 30 36 9 61
8
63
17
29
5
18
23
11
6
5
28
9
Inf 0.6⫹, multi, MAF5%
full full full full full full full full
dropped for capacity full
full
full
full
full
full
full
full
full
full
full
Array coverage: multi HapTags, MAF1% CYTOCHROME P450, FAMILY 1, SUBFAMILY A, POLYPEPTIDE 2 CYTOCHROME P450, FAMILY 1, SUBFAMILY B, POLYPEPTIDE 1 CYTOCHROME P450, FAMILY 21, SUBFAMILY A, POLYPEPTIDE 2 cytochrome P450, family 2, subfamily A, polypeptide 13 CYTOCHROME P450, FAMILY 2, SUBFAMILY A, POLYPEPTIDE 6 cytochrome P450, family 2, subfamily B, polypeptide 6 CYTOCHROME P450, FAMILY 2, SUBFAMILY C, POLYPEPTIDE 9 CYTOCHROME P450, FAMILY 2, SUBFAMILY D, POLYPEPTIDE 6 cytochrome P450, family 2, subfamily E, polypeptide 1 CYTOCHROME P450, SUBFAMILY IIIA (NIPHEDIPINE OXIDASE), POLYPEPTIDE 3 DOPAMINE -HYDROXYLASE (DOPAMINE -MONOOXYGENASE) DEAD (ASP-GLU-ALA-ASP) BOX POLYPEPTIDE 54 DIHYDROFOLATE REDUCTASE dimethylglycine dehydrogenase DNA (cytosine-5-)-methyltransferase 1 DNA (cytosine-5-)-methyltransferase 3 ␣ DNA (cytosine-5-)-methyltransferase 3  DOPAMINE RECEPTOR D2 DOPAMINE RECEPTOR D4 epidermal growth factor
Gene name
Continued
fol fol fol fol fol nic nic tum
onc
str
nit
nit
PAH
nit/tum
nit
nit
str
PAH
PAH
Target categorya
BALDWIN ET AL. / CUSTOM GENOTYPING FOR LUNG CANCER
205
206
22 36
20
34
39
61
67
64
38
341 68 342
ERCC1
ERCC2
ERCC3
ERCC4
ERCC5
ERCC6
ERCC8
ESR1 ESR2 EYA2
All HapTags: pairwise, MAF1%
EGLN2 EPHX1
Genetic locus overlaps
212
1
EGFR
Gene symbol
(Continued)
TABLE
237 61 325
33
58
58
55
36
33
18
20 26
196
Inf 0.6⫹, pairwise, MAF1%
173 52 293
21
37
42
39
25
22
14
18 23
167
Inf 0.6⫹, pairwise MAF5%
182 43 247
28
46
46
41
32
30
16
17 26
162
Inf 0.6⫹, multi, MAF1%
126 34 217
17
25
30
26
20
20
12
15 22
135
Inf 0.6⫹, multi, MAF5%
full full full
full
full
full
full
full
full
full
full full
full
Array coverage: multi HapTags, MAF1% EPIDERMAL GROWTH FACTOR RECEPTOR (ERYTHROBLASTIC LEUKEMIA VIRAL (V-ERB-B) ONCOGENE HOMOLOG, AVIAN) egl nine homolog 2 EPOXIDE HYDROLASE 1, MICROSOMAL (XENOBIOTIC) EXCISION REPAIR CROSS-COMPLEMENTING RODENT REPAIR DEFICIENCY, COMPLEMENTATION GROUP 1 (INCLUDES OVERLAPPING ANTISENSE SEQUENCE) EXCISION REPAIR CROSS-COMPLEMENTING RODENT REPAIR DEFICIENCY, COMPLEMENTATION GROUP 2 (XERODERMA PIGMENTOSUM D) excision repair cross-complementing rodent repair deficiency, complementation group 3 (xeroderma pigmentosum group B complementing) EXCISION REPAIR CROSS-COMPLEMENTING RODENT REPAIR DEFICIENCY, COMPLEMENTATION GROUP 4 EXCISION REPAIR CROSS-COMPLEMENTING RODENT REPAIR DEFICIENCY, COMPLEMENTATION GROUP 5 [XERODERMA PIGMENTOSUM, COMPLEMENTATION GROUP G (COCKAYNE SYNDROME)] excision repair cross-complementing rodent repair deficiency, complementation group 6 EXCISION REPAIR CROSS-COMPLEMENTING RODENT REPAIR DEFICIENCY, COMPLEMENTATION GROUP 8 ESTROGEN RECEPTOR 1 ESTROGEN RECEPTOR 2 (ER ) eyes absent homolog 2 (Drosophila)
Gene name
Continued
str str DNA
DNA
DNA
DNA
DNA
DNA
DNA
DNA
oxs PAH
onc
Target categorya
BALDWIN ET AL. / CUSTOM GENOTYPING FOR LUNG CANCER
JOURNAL OF BIOMOLECULAR TECHNIQUES, VOLUME 24, ISSUE 4, DECEMBER 2013
22
63 85
17
18 27
93 17 969 15 70
GART
GATA3 GCLC
GCLM
GDF15 GGH
GHR GNMT GPC5 GPER GPR126
5 9 23 23 51 72
FOLR1 FOLR2 FOLR3 FPGS FTCD GAB1
FOLR2 FOLR1
53 46 22
All HapTags: pairwise, MAF1%
FKBP5 FMO3 FOLH1
Genetic locus overlaps
165 8
1
FAM13A FCGR1A
Gene symbol
(Continued)
TABLE
JOURNAL OF BIOMOLECULAR TECHNIQUES, VOLUME 24, ISSUE 4, DECEMBER 2013
84 16 889 12 63
16 21
16
62 76
19
5 7 13 20 50 68
30 42 14
156 1
Inf 0.6⫹, pairwise, MAF1%
72 13 739 11 52
15 13
15
50 69
14
4 6 12 17 46 58
22 34 10
138 1
Inf 0.6⫹, pairwise MAF5%
66 15 666 12 54
15 16
15
58 66
19
8 3 9 19 43 56
24 34 13
113 1
Inf 0.6⫹, multi, MAF1%
55 12 528 11 44
14 9
12
45 59
14
6 3 9 16 39 46
18 26 9
93 1
Inf 0.6⫹, multi, MAF5%
full full full full 85%
full full
full
full full
full full full full full dropped for capacity full
full full full
full full
Array coverage: multi HapTags, MAF1% family with sequence similarity 13, member A Fc fragment of IgG, high-affinity Ia, receptor (CD64) FK506 BINDING PROTEIN 5 Flavin containing monooxygenase 3 FOLATE HYDROLASE (PROSTATE-SPECIFIC MEMBRANE ANTIGEN) 1 FOLATE RECEPTOR 1 (ADULT) FOLATE RECEPTOR 2 (FETAL) FOLATE RECEPTOR 3 (␥) FOLYLPOLYGLUTAMATE SYNTHASE formiminotransferase cyclodeaminase growth factor receptor-bound protein 2-associated binding protein 1 PHOSPHORIBOSYLGLYCINAMIDE FORMYLTRANSFERASE, PHOSPHORIBOSYLGLYCINAMIDE SYNTHETASE, PHOSPHORIBOSYLAMINOIMIDAZOLE SYNTHETASE GATA BINDING PROTEIN 3 GLUTAMATE-CYSTEINE LIGASE, CATALYTIC SUBUNIT GLUTAMATE-CYSTEINE LIGASE, MODIFIER SUBUNIT growth differentiation factor 15 ␥-glutamyl hydrolase (conjugase, folylpolygammaglutamyl hydrolase) growth hormone receptor glycine N-methyltransferase glypican 5 GPCR 30 GPCR 126
Gene name
Continued
tum fol mut str adh
onc fol
oxs
tum oxs
fol
fol fol fol fol fol
inf/str tox fol
mut inf
Target categorya
BALDWIN ET AL. / CUSTOM GENOTYPING FOR LUNG CANCER
207
208
38 40 24 83
43
97 39 17 84 63 24 20
HDC HELQ HFE HGF
HHIP
hsa-mir21 HSD11B1 HSD17B1 HSD17B12 HSD17B3 HSD17B7 HSD3B1
7 15 25 33 21 1
GSTM2 GSTM1 GSTM1
All HapTags: pairwise, MAF1%
GSTM1 GSTM2 GSTM5 GSTO1 GSTP1 GSTT1
Genetic locus overlaps
14 53 13 71 49 23 16 37 39
1
GPX1 GPX3 GRPR GSK3B GSR GSS GSTA1 GSTA4 GSTCD
Gene symbol
(Continued)
TABLE
37 14 77 56 20 16
43
36 37 23 78
4 12 14 30 19 0
10 51 13 58 42 20 11 34 37
Inf 0.6⫹, pairwise, MAF1%
35 12 63 46 19 16
37
33 31 20 59
6 10 12 24 17 0
8 47 13 44 32 19 9 24 27
Inf 0.6⫹, pairwise MAF5%
29 13 64 45 17 11
35
32 26 19 64
7 10 11 28 16 0
10 43 22 48 37 18 10 27 29
Inf 0.6⫹, multi, MAF1%
27 11 51 35 16 11
28
29 20 17 46
7 8 10 23 14 0
8 39 22 35 27 17 8 17 18
Inf 0.6⫹, multi, MAF5%
dropped for capacity full full full full full full full
full full full full full four nonHapTag SNPs full full full full
full full full full full full full full full
Array coverage: multi HapTags, MAF1%
HOMO SAPIENS MICRORNA 21 HYDROXYSTEROID (11-) DEHYDROGENASE 1 HYDROXYSTEROID (17-) DEHYDROGENASE 1 HYDROXYSTEROID (17-) DEHYDROGENASE 12 HYDROXYSTEROID (17-) DEHYDROGENASE 3 HYDROXYSTEROID (17-) DEHYDROGENASE 7 HYDROXY-␦-5-STEROID DEHYDROGENASE, 3 - AND STEROID ␦-ISOMERASE 1
HISTIDINE DECARBOXYLASE HELQ helicase, POLQ-like HEMOCHROMATOSIS HEPATOCYTE GROWTH FACTOR (HEPAPOIETIN A; SCATTER FACTOR) Hedgehog-interacting protein
GLUTATHIONE PEROXIDASE 1 GLUTATHIONE PEROXIDASE 3 (PLASMA) GASTRIN-RELEASING PEPTIDE RECEPTOR glycogen synthase kinase 3  GLUTATHIONE REDUCTASE GLUTATHIONE SYNTHETASE GLUTATHIONE S-TRANSFERASE A1 GLUTATHIONE S-TRANSFERASE A4 glutathione S-transferase, C-terminal domain containing GLUTATHIONE S-TRANSFERASE M1 GLUTATHIONE S-TRANSFERASE M2 (MUSCLE) GLUTATHIONE S-TRANSFERASE M5 GLUTATHIONE S-TRANSFERASE 1 GLUTATHIONE S-TRANSFERASE 1 GLUTATHIONE S-TRANSFERASE 1
Gene name
Continued
onc nit/str str str str str str
inf DNA tox onc
oxs/PAH oxs/PAH oxs oxs oxs/PAH oxs/PAH
oxs oxs onc tum oxs fol oxs/PAH oxs fol
Target categorya
BALDWIN ET AL. / CUSTOM GENOTYPING FOR LUNG CANCER
JOURNAL OF BIOMOLECULAR TECHNIQUES, VOLUME 24, ISSUE 4, DECEMBER 2013
JOURNAL OF BIOMOLECULAR TECHNIQUES, VOLUME 24, ISSUE 4, DECEMBER 2013
34 25 66 49 44 30 46 19 11 27
15 19 10 6 48
IL10 IL1B IL1RN IL4 IL6 IL8 IRS1 JUN KEAP1 KLRK1
KRT18 KRT19 LTA LTC4S MAF
TNF
25
21 40 65 318 16 161 24
IER3 IFNG IGF1 IGF1R IGF2 IGF2R IGFBP3
IKBKB
35
IDH1
TH
126 26 13
All HapTags: pairwise, MAF1%
HTR4 ICAM1 ID2
Genetic locus overlaps
24
1
HTR3E
Gene symbol
(Continued)
TABLE
10 19 9 5 47
30 24 66 46 39 28 43 18 9 21
24
19 38 57 306 14 148 19
31
115 25 13
22
Inf 0.6⫹, pairwise, MAF1%
7 18 10 5 43
24 20 58 40 32 24 33 16 9 20
18
18 30 38 272 16 108 18
24
105 21 11
18
Inf 0.6⫹, pairwise MAF5%
10 17 19 5 45
28 23 54 39 34 21 37 17 9 19
20
16 31 48 239 17 123 19
27
89 25 12
20
Inf 0.6⫹, multi, MAF1%
7 16 10 5 41
21 19 46 33 26 17 28 15 9 18
15
15 23 30 206 16 86 18
22
81 21 11
16
Inf 0.6⫹, multi, MAF5%
full full full full full
full full full full full full full full full full
full
full full full full full full full
full
full full full
full
Array coverage: multi HapTags, MAF1% 5-hydroxytryptamine (serotonin) receptor 3, family member E 5-hydroxytryptamine (serotonin) receptor 4 intercellular adhesion molecule 1 INHIBITOR OF DNA BINDING 2, DOMINANT NEGATIVE HELIX-LOOP-HELIX PROTEIN ISOCITRATE DEHYDROGENASE 1 (NADP⫹), SOLUBLE immediate early response 3 IFN-␥ insulin-like growth factor 1 (somatomedin C) insulin-like growth factor 1 receptor insulin-like growth factor 2 (somatomedin A) insulin-like growth factor 2 receptor INSULIN-LIKE GROWTH FACTOR BINDING PROTEIN 3 inhibitor of light polypeptide gene enhancer in B-cells, kinase  INTERLEUKIN 10 interleukin 1,  interleukin 1 receptor antagonist INTERLEUKIN 4 INTERLEUKIN 6 interleukin 8 INSULIN RECEPTOR SUBSTRATE 1 jun oncogene kelch-like ECH-associated protein 1 KILLER CELL LECTIN-LIKE RECEPTOR SUBFAMILY C, MEMBER 4 KERATIN 18 KERATIN 19 lymphotoxin ␣ (TNF superfamily, member 1) LEUKOTRIENE C4 SYNTHASE v-maf musculoaponeurotic fibrosarcoma oncogene homolog (avian)
Gene name
Continued
adh adh inf inf onc
inf inf inf inf inf inf onc onc oxs adh
inf
mut/inf inf onc onc onc onc onc
onc
nic inf onc
nic
Target categorya
BALDWIN ET AL. / CUSTOM GENOTYPING FOR LUNG CANCER
209
210
57
79
54
17 34
40
MTHFS
MTR
MTRR
MUTYH MYBL2
MYC
36
36
32
99 64
MSR1 MTAP MT-COI MTHFD1
MTHFR
94 59
62 44
MGST3 MIF
38
17 32
49
64
53
32
56 42
238
19 31
Inf 0.6⫹, pairwise, MAF1%
249
All HapTags: pairwise, MAF1%
MGMT
Genetic locus overlaps
19 40
1
MAOA MDM2
Gene symbol
(Continued)
TABLE
27
14 26
36
49
47
29
29
81 52
53 39
212
19 24
Inf 0.6⫹, pairwise MAF5%
36
16 24
34
42
35
26
31
79 44
48 35
179
22 27
Inf 0.6⫹, multi, MAF1%
24
13 18
22
33
29
22
28
68 37
45 31
152
22 20
Inf 0.6⫹, multi, MAF5%
full
full full
full
full
full
full
full full 61 SNPs full
full full
full
full full
Array coverage: multi HapTags, MAF1% MONOAMINE OXIDASE A MDM2, TRANSFORMED 3T3 CELL DOUBLEMINUTE 2, P53 BINDING PROTEIN (MOUSE) O-6-METHYLGUANINE-DNA METHYLTRANSFERASE MICROSOMAL GST 3 macrophage migration inhibitory factor (glycosylation-inhibiting factor) macrophage scavenger receptor 1 methylthioadenosine phosphorylase mitochondrially encoded cytochrome c oxidase I METHYLENETETRAHYDROFOLATE DEHYDROGENASE (NADP⫹ DEPENDENT) 1, METHENYLTETRAHYDROFOLATE CYCLOHYDROLASE, FORMYLTETRAHYDROFOLATE SYNTHETASE 5,10-METHYLENETETRAHYDROFOLATE REDUCTASE (NADPH) 5,10-METHENYLTETRAHYDROFOLATE SYNTHETASE (5-FORMYLTETRAHYDROFOLATE CYCLO-LIGASE) 5-METHYLTETRAHYDROFOLATEHOMOCYSTEINE METHYLTRANSFERASE 5-METHYLTETRAHYDROFOLATEHOMOCYSTEINE METHYLTRANSFERASE REDUCTASE MUTY HOMOLOG (Escherichia coli) v-myb myeloblastosis viral oncogene homolog (avian)-like 2 V-MYC MYELOCYTOMATOSIS VIRAL ONCOGENE HOMOLOG (AVIAN)
Gene name
Continued
onc
DNA/oxs tum
fol
fol
fol
fol
inf fol oxs fol
oxs inf
DNA/nit
nic DNA
Target categorya
BALDWIN ET AL. / CUSTOM GENOTYPING FOR LUNG CANCER
JOURNAL OF BIOMOLECULAR TECHNIQUES, VOLUME 24, ISSUE 4, DECEMBER 2013
56
19 30
74
27
191 56
30
20 27
58
85 46
34 62 19 209 179 14 91 9 234
NCOA6 NFE2L2
NFKB1
NFKBIA
NOS1 NOS2
NOS3
NQO1 NR1D2
NR3C1
NRIP1 NSD1
OAS1 OAS2 OGG1 OPRM1 PCDH7 PER1 PGR PHB2 PID1
All HapTags: pairwise, MAF1%
NAT2
Genetic locus overlaps
115
1
NAT1
Gene symbol
(Continued)
TABLE
JOURNAL OF BIOMOLECULAR TECHNIQUES, VOLUME 24, ISSUE 4, DECEMBER 2013
30 51 18 174 174 14 71 9 221
33 33
56
19 22
29
172 53
26
72
17 30
50
99
Inf 0.6⫹, pairwise, MAF1%
26 46 14 149 143 11 56 8 200
30 26
43
19 18
23
141 51
25
54
16 24
43
86
Inf 0.6⫹, pairwise MAF5%
26 38 17 139 135 14 53 9 173
30 27
48
15 21
26
137 47
23
56
14 28
39
80
Inf 0.6⫹, multi, MAF1%
22 32 12 115 109 11 39 8 153
26 20
36
15 17
20
108 44
21
39
13 21
32
68
Inf 0.6⫹, multi, MAF5%
full full full full full full full full full
full full
full
full full
full
full full
full
full
full full
full
full
Array coverage: multi HapTags, MAF1% N-ACETYLTRANSFERASE 1 (ARYLAMINE N-ACETYLTRANSFERASE) N-ACETYLTRANSFERASE 2 (ARYLAMINE N-ACETYLTRANSFERASE) NUCLEAR RECEPTOR COACTIVATOR 6 NUCLEAR FACTOR (ERYTHROID-DERIVED 2)-LIKE 2 NUCLEAR FACTOR OF LIGHT POLYPEPTIDE GENE ENHANCER IN B CELLS 1 (P105) NUCLEAR FACTOR OF LIGHT POLYPEPTIDE GENE ENHANCER IN B CELLS INHIBITOR, ␣ NITRIC OXIDE SYNTHASE 1 (NEURONAL) NITRIC OXIDE SYNTHASE 2A (INDUCIBLE, HEPATOCYTES) NITRIC OXIDE SYNTHASE 3 (ENDOTHELIAL CELL) NAD(P)H DEHYDROGENASE, QUINONE 1 NUCLEAR RECEPTOR SUBFAMILY 1, GROUP D, MEMBER 2 nuclear receptor subfamily 3, group C, member 1 (glucocorticoid receptor) NUCLEAR RECEPTOR-INTERACTING PROTEIN 1 NUCLEAR RECEPTOR BINDING SET DOMAIN PROTEIN 1 2=,5=-oligoadenylate synthetase 1, 40/46 kDa 2=,5=-oligoadenylate synthetase 2, 69/71 kDa 8-OXOGUANINE DNA GLYCOSYLASE OPIOID RECEPTOR, 1 protocadherin 7 PERIOD HOMOLOG 1 (DROSOPHILA) PROGESTERONE RECEPTOR PROHIBITIN 2 Phosphotyrosine-interaction domain containing 1
Gene name
Continued
inf inf DNA/oxs nic adh onc str adh/str inf
onc onc
onc
oxs/PAH onc
inf
inf inf
inf/onc
inf/onc
onc oxs
PAH
fol
Target categorya
BALDWIN ET AL. / CUSTOM GENOTYPING FOR LUNG CANCER
211
212
150
PPARGC1B
73
29
13
23
27
49 40
PTGS1
PTGS2
RELA
RERGL
RNASEL
SELE SERPINA3
18 20 30 65
17 19 39 20 67 113
POLH POLI POLK POLL PON1 PPARG
AGER
134
PLEKHA6
PPT2 PTCH1 PTEN PTGIS
57
All HapTags: pairwise, MAF1%
PLA2G6
Genetic locus overlaps
47
1
PIK3CG
Gene symbol
(Continued)
TABLE
45 39
26
19
13
24
70
18 20 28 59
141
14 18 35 19 66 107
126
55
45
Inf 0.6⫹, pairwise, MAF1%
32 37
23
15
11
19
54
14 20 20 47
132
12 10 26 15 59 80
120
43
36
Inf 0.6⫹, pairwise MAF5%
38 37
25
17
12
19
61
15 18 27 57
112
14 18 34 18 56 83
90
42
31
Inf 0.6⫹, multi, MAF1%
25 34
22
13
10
14
47
13 18 18 44
102
11 10 25 13 48 60
85
30
22
Inf 0.6⫹, multi, MAF5%
full full
full
full
full
full
full full full dropped for capacity full
full
full full full full full full
full
full
full
Array coverage: multi HapTags, MAF1%
PG-endoperoxide synthase 1 (PG G/H synthase and cyclooxygenase) PG-ENDOPEROXIDE SYNTHASE 2 (PG G/H SYNTHASE AND COX) v-rel reticuloendotheliosis viral oncogene homolog A (avian) RAS-like, estrogen-regulated, growth inhibitor (RERG)/RAS-like ribonuclease L (2=,5=-oligoisoadenylate synthetase-dependent) selectin E SERPIN PEPTIDASE INHIBITOR, CLADE A (␣-1 ANTIPROTEINASE, ANTITRYPSIN), MEMBER 3
phosphoinositide-3-kinase, catalytic, ␥ polypeptide phospholipase A2, group VI (cytosolic, calcium-independent) pleckstrin homology domain containing, family A member 6 POLYMERASE (DNA-DIRECTED), POLYMERASE (DNA-DIRECTED) POLYMERASE (DNA-DIRECTED) POLYMERASE (DNA-DIRECTED), paraoxonase 1 PEROXISOME PROLIFERATIVE ACTIVATED RECEPTOR, ␥ PEROXISOME PROLIFERATIVE ACTIVATED RECEPTOR, ␥, COACTIVATOR 1,  palmitoyl-protein thioesterase 2 patched homolog 1 phosphatase and tensin homolog PG I2 (prostacyclin) synthase
Gene name
Continued
inf adh/onc
inf
str
onc
infl/oxs
inf
tox tum onc
onc
DNA DNA DNA DNA tox onc
nic
tum
onc
Target categorya
BALDWIN ET AL. / CUSTOM GENOTYPING FOR LUNG CANCER
JOURNAL OF BIOMOLECULAR TECHNIQUES, VOLUME 24, ISSUE 4, DECEMBER 2013
31
46
55
58
17
17
25 30 14
54
28
41
32
SLC19A1
SLC5A7
SLC6A3
SLC7A5
SOD1
SOD2
SOD3 STC2 SULT1A1
SULT1E1
SULT2A1
TCN2
TEF
16
SLC18A3
CHAT
8
All HapTags: pairwise, MAF1%
SHMT2
Genetic locus overlaps
29
1
SHMT1
Gene symbol
(Continued)
TABLE
JOURNAL OF BIOMOLECULAR TECHNIQUES, VOLUME 24, ISSUE 4, DECEMBER 2013
21
40
26
49
21 27 10
15
15
51
47
45
29
32
6
23
Inf 0.6⫹, pairwise, MAF1%
20
38
24
28
18 26 7
13
14
46
43
36
26
36
6
20
Inf 0.6⫹, pairwise MAF5%
18
33
21
43
20 24 10
13
15
44
41
38
25
6
6
18
Inf 0.6⫹, multi, MAF1%
17
29
18
22
16 23 7
11
14
40
37
29
22
10
6
14
Inf 0.6⫹, multi, MAF5%
full
full
full
full
full full full
full
full
full
full
full
full
full
full
full
Array coverage: multi HapTags, MAF1% SERINE HYDROXYMETHYLTRANSFERASE 1 (SOLUBLE) SERINE HYDROXYMETHYLTRANSFERASE 2 (MITOCHONDRIAL) SOLUTE CARRIER FAMILY 18 (VESICULAR ACETYLCHOLINE), MEMBER 3 SOLUTE CARRIER FAMILY 19 (FOLATE TRANSPORTER), MEMBER 1 SOLUTE CARRIER FAMILY 5 (CHOLINE TRANSPORTER), MEMBER 7 SOLUTE CARRIER FAMILY 6 (NEUROTRANSMITTER TRANSPORTER, DOPAMINE), MEMBER 3 SOLUTE CARRIER FAMILY 7 (CATIONIC AMINO ACID TRANSPORTER, Y⫹ SYSTEM), MEMBER 5 SUPEROXIDE DISMUTASE 1, SOLUBLE [AMYOTROPHIC LATERAL SCLEROSIS 1 (ADULT)] SUPEROXIDE DISMUTASE 2, MITOCHONDRIAL SUPEROXIDE DISMUTASE 3, EXTRACELLULAR STANNIOCALCIN 2 SULFOTRANSFERASE FAMILY, CYTOSOLIC, 1A, PHENOL-PREFERRING, MEMBER 1 SULFOTRANSFERASE FAMILY 1E, ESTROGENPREFERRING, MEMBER 1 SULFOTRANSFERASE FAMILY, CYTOSOLIC, 2A, DEHYDROEPIANDROSTERONE (DHEA)PREFERRING, MEMBER 1 TRANSCOBALAMIN II; MACROCYTIC ANEMIA THYROTROPHIC EMBRYONIC FACTOR
Gene name
Continued
onc
fol
str
str
inf onc PAH
oxs
inf
onc
nic
nic
fol
nic
fol
fol
Target categorya
BALDWIN ET AL. / CUSTOM GENOTYPING FOR LUNG CANCER
213
214
UGT1A8
UGT1A1
UGT1A8
21
18
UGT2B10
UGT2B11
151
25 31 70
TP53BP1 TYMS UGT1A1
32 20 154
19
TLR1 LTA
TP53
TLR6 TNF TNS1
28 594 20 35 23 58 14
TH THSD4 TLR1 TLR10 TLR2 TLR4 TLR5
TLR6
37
TGFBR1
IGF2
46 136 19
All HapTags: pairwise, MAF1%
TFF3 TGFA TGFB1
Genetic locus overlaps
72
1
TFF1
Gene symbol
(Continued)
TABLE
10
10
113
23 28 58
17
12 17 151
26 558 20 30 20 56 12
34
44 129 19
64
Inf 0.6⫹, pairwise, MAF1%
9
9
78
18 26 46
17
9 6 133
20 498 17 24 20 43 12
24
39 107 17
57
Inf 0.6⫹, pairwise MAF5%
10
9
79
20 24 53
15
9 4 135
20 428 23 21 19 50 0
29
39 109 18
49
Inf 0.6⫹, multi, MAF1%
9
8
53
15 22 37
15
6 3 114
18 364 19 17 19 38 0
19
35 87 16
43
Inf 0.6⫹, multi, MAF5%
full
full
full
full full full
full full full full full full 13 pairwise HapTags full full dropped for capacity full
full
full full full
full
Array coverage: multi HapTags, MAF1%
TUMOR PROTEIN P53 (LI-FRAUMENI SYNDROME) tumor protein p53 binding protein 1 THYMIDYLATE SYNTHETASE UDP glucuronosyltransferase 1 family, polypeptide A cluster UDP GLUCURONOSYLTRANSFERASE 1 FAMILY, POLYPEPTIDE A8 UDP GLUCURONOSYLTRANSFERASE 2 FAMILY, POLYPEPTIDE B10 UDP GLUCURONOSYLTRANSFERASE 2 FAMILY, POLYPEPTIDE B11
Toll-like receptor 6 tumor necrosis factor tensin 1
TREFOIL FACTOR 1 (BREAST CANCER, ESTROGEN-INDUCIBLE SEQUENCE EXPRESSED IN) TREFOIL FACTOR 3 (INTESTINAL) TRANSFORMING GROWTH FACTOR, ␣ TRANSFORMING GROWTH FACTOR,  1 (CAMURATI-ENGELMANN DISEASE) TRANSFORMING GROWTH FACTOR,  RECEPTOR I (ACTIVIN A RECEPTOR TYPE II-LIKE KINASE, 53 KDA) TYROSINE HYDROXYLASE thrombospondin, type I, domain containing 4 Toll-like receptor 1 Toll-like receptor 10 Toll-like receptor 2 Toll-like receptor 4 Toll-like receptor 5
Gene name
Continued
nit/PAH
nit/PAH
PAH
onc fol PAH
onc
inf inf
nic adh/inf inf inf inf inf inf
onc
onc onc onc
onc
Target categorya
BALDWIN ET AL. / CUSTOM GENOTYPING FOR LUNG CANCER
JOURNAL OF BIOMOLECULAR TECHNIQUES, VOLUME 24, ISSUE 4, DECEMBER 2013
JOURNAL OF BIOMOLECULAR TECHNIQUES, VOLUME 24, ISSUE 4, DECEMBER 2013
21 69 45 15 73 25 34 61 47 134
UGT2B7
VCAM1 VEGFA VEGFB VEGFC XIAP XPA
XPC
XRCC1
XRCC4 15,961
120
46
59
65 45 15 71 18 31
17
24
4
10
0
Inf 0.6⫹, pairwise, MAF1%
13,474
94
30
42
45 36 15 63 18 25
15
22
4
7
0
Inf 0.6⫹, pairwise MAF5%
12,926
94
42
49
63 43 12 54 18 26
13
18
4
10
0
Inf 0.6⫹, multi, MAF1%
10,511
68
27
34
42 33 12 45 18 20
11
16
4
7
0
Inf 0.6⫹, multi, MAF5%
Count:
full
full
full
full full full full full full
full
full
full
nine nonHapTag SNPs full
Array coverage: multi HapTags, MAF1%
UDP GLUCURONOSYLTRANSFERASE 2 FAMILY, POLYPEPTIDE B15 UDP GLUCURONOSYLTRANSFERASE 2 FAMILY, POLYPEPTIDE B17 UDP GLUCURONOSYLTRANSFERASE 2 FAMILY, POLYPEPTIDE B28 UDP GLUCURONOSYLTRANSFERASE 2 FAMILY, POLYPEPTIDE B4 UDP GLUCURONOSYLTRANSFERASE 2 FAMILY, POLYPEPTIDE B7 vascular cell adhesion molecule 1 VASCULAR ENDOTHELIAL GROWTH FACTOR A VASCULAR ENDOTHELIAL GROWTH FACTOR B VASCULAR ENDOTHELIAL GROWTH FACTOR C BACULOVIRAL IAP REPEAT-CONTAINING 4 XERODERMA PIGMENTOSUM, COMPLEMENTATION GROUP A XERODERMA PIGMENTOSUM, COMPLEMENTATION GROUP C X-RAY REPAIR COMPLEMENTING DEFECTIVE REPAIR IN CHINESE HAMSTER CELLS 1 X-ray repair complementing defective repair in Chinese hamster cells 4 298
Gene name
DNA
DNA
DNA
adh/inf onc inf inf onc DNA
PAH
nit/PAH
nit/PAH
nit/PAH
nit/PAH
Target categorya
adh, Adhesion molecules; DNA, repair of DNA damage; fol, folate transport and metabolism; inf, inflammatory signaling and processes or immune regulation; mut, mutagenic processes; nic, nicotine addiction and smoking behavior; nit, tobacco-specific nitrosamine (in particular, NNK) activation and detoxification; onc, oncogenesis; oxs, oxidative stress; str, steroid hormone metabolism and signaling; tox, other toxin or toxicity; tum, risk for lung cancer or related tumors.
17,797
31
UGT2B4
Sum:
8
UGT2B28
a
14
All HapTags: pairwise, MAF1%
UGT2B17
Genetic locus overlaps 1
1
UGT2B15
Gene symbol
(Continued)
TABLE
BALDWIN ET AL. / CUSTOM GENOTYPING FOR LUNG CANCER
215
BALDWIN ET AL. / CUSTOM GENOTYPING FOR LUNG CANCER
FIGURE 1
Distribution of assay conversion rates for SNPs in various design categories. Assays were assigned to Infinium design score bins equal to or less than the indicated values. The percent of all assays in a bin that successfully generated genotypes (unambiguous SNP allele calls in at least 95% of DNA samples) is plotted for Infinium-eligible SNPs in the Illumina database (black bars), mitochondrial DNA SNPs (gray bars), and SNPs uploaded as custom sequences (dashed bars). The number of SNPs in each bin, as a percent of total SNPs in each category, is plotted with square line markers for Infinium database SNPs, circles for mitochondrial, and X for custom sequences.
the array. All markers and their sequences, coordinates, and targeted genes are provided in Supplemental Table 2. Genotyping assays were performed on 1873 DNA samples from lung-cancer patients and controls using LungCaGxE microarrays. Forty-seven samples had a SNP assay call rate ⬍99.0%. If these samples are excluded, SNP assays with an Infinium design score of at least 0.6 produced unambiguous genotype calls in 99.03% of the attempted reactions (Fig. 1). Targeted functional SNPs with a design score ⬍0.6 generated genotype calls in 84.96% of the attempted reactions; the average genotyping rate for SNPs with recognized rs numbers in the Illumina database was 99.09%, whereas the rate for SNPs submitted as custom sequences was 96.16% (design score ⬎0.6 in both sets). DISCUSSION
The advancement of array-based SNP genotyping technologies has led to genome-wide association studies (GWAS), in which genetic markers distributed evenly throughout the genome17 (or covering predicted haplotypes throughout the genome14) are tested for statistically significant association with a phenotype. Arrays offer advantages for GWAS over current deep-sequencing methods, including lower cost, faster assay turnaround and sample throughput, and easier data processing. However, the success of proxy markers depends on linkage to causal but unmeasured genetic variants, and even the highest capacity arrays of over 5 million SNPs may not cover rare variants or diverse populations well. Whole-genome or exome sequencing directly detects causal variants and polymorphism types beyond bi-allelic single nucleotides and does not rely on linked markers for statistical analysis. Whether deployed on SNP arrays or deep sequencing platforms, the primary concern for whole-genome assays is statistical power. Rare variants, 216
multiple causes for the same phenotype, intergenic and multigene effects, and genetically mandated differential interactions between genes and environmental variables can all combine with multiple testing correction requirements to drive study population sizes to thousands or tens of thousands of subjects to adequately power GWAS.18–21 Projects of this scale are an expensive proposition for arrays and would be extremely costly with deep sequencing even at the as-yet unattained goal of $1000/genome. Comprehensive genotyping of targeted genes, by arrays or sequencing, takes advantage of high multiplex assay capacities to saturate targets with genetic markers. Hence, array data are less reliant on capturing a single, important linked marker while retaining rapid sample throughputs, and sequencing costs and efficiency are improved by focusing on a subset of genes rather than the whole genome. Depending on the size of the target panel and degree of saturation desired, custom arrays or sequencing can ease multiple testing penalties and reduce study population sizes necessary to achieve statistical power. Of course, the critical issue for this strategy is choosing which genes to assay. For the LungCaGxE panel, we chose genes involved in pathways relevant to responses to environmental stressors and saturated the resulting target panel with genetic markers as well as previously demonstrated functional and diseaseassociated variants. The Illumina design score, whereas generally predictive of positive assay performance, underestimated the LungCaGxE genotype success rate achieved for Infiniumeligible tagSNPs and custom SNPs from the nuclear genome. The design scores were somewhat less positively predictive (i.e., further underestimated) of genotyping rates achieved for mitochondrial genome SNPs, which performed well over a wide range of design scores. The relaJOURNAL OF BIOMOLECULAR TECHNIQUES, VOLUME 24, ISSUE 4, DECEMBER 2013
BALDWIN ET AL. / CUSTOM GENOTYPING FOR LUNG CANCER
tively high success rates for assays with design scores ⬍0.6 indicate that for future targeted genotyping projects, failure to meet this overly stringent standard cutoff should not necessarily disqualify an assay if the specific SNP in question is important for the study goals. In summary, the investigator tasked with designing a custom-targeted genotyping assay must balance several considerations. Given that the platform’s multiplex capacity is often dictated by the project’s budget, the investigator must select the marker types, thresholds for number of genes targeted, and MAF cutoffs that will provide the most efficient use of available assay resources. Several iterations of empirical design are usually needed to assess the impact of these parameters, and this process is aided by a streamlined bioinformatics workflow. Tagger Batch Assistant helps automate the retrieval of genetic coordinates for requested genes, managing genome build versions and providing an output format that easily interfaces with Tagger for marker prediction. The resulting Tagger files are then automatically processed to connect markers with the user’s upstream gene annotations. We used this tool to optimize the LungCaGxE design through multiple versions, preserving sensitivity for marker MAFs as low as 1%, while reducing the number of SNPs required by using the Tagger multimarker haplotyping algorithm. This array enables rapid, cost-effective, and comprehensive genotyping of a panel of genes important for exploring genetic factors in lung cancer and the environmental influences that impact those factors. ACKNOWLEDGMENTS This work was funded by grant PA4100038714 from the Pennsylvania Department of Health and U.S. National Institutes of Health–National Institute of Environmental Health Sciences grant 5 P30 ES 013508-06 for the Center of Excellence in Environmental Toxicology. We thank David McGain and Kathakali Addya (Penn Molecular Profiling Facility) and Cecilia Kim (Children’s Hospital of Philadelphia Center for Applied Genomics) for technical assistance.
2. 3.
4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18.
DISCLOSURE
The authors have no associations or sources of financial support that pose a conflict of interest for conducting or interpreting the work presented in this manuscript. REFERENCES 1. American Cancer Society. Cancer Facts & Figures 2013. Atlanta, GA, USA: American Cancer Society, 2013 (http://www.cancer.
JOURNAL OF BIOMOLECULAR TECHNIQUES, VOLUME 24, ISSUE 4, DECEMBER 2013
19. 20. 21.
org/acs/groups/content/@epidemiologysurveilance/documents/ document/acspc-036845.pdf). Cassidy A, Duffy SW, Myles JP, Liloglou T, Field JK. Lung cancer risk prediction: a tool for early detection. Int J Cancer 2007;120:1–6. Ihsan R, Chauhan PS, Mishra AK, et al. Multiple analytical approaches reveal distinct gene-environment interactions in smokers and non-smokers in lung cancer. PLoS One 2011;6: e29431. Thomas L, Doyle LA, Edelman MJ. Lung cancer in women: emerging differences in epidemiology, biology, and therapy. Chest 2005;128:370 –381. Braithwaite KL, Rabbitts PH. Multi-step evolution of lung cancer. Semin Cancer Biol 1999;9:255–265. Bach PB, Kattan MW, Thornquist MD, et al. Variations in lung cancer risk among smokers. J Natl Cancer Inst 2003;95:470 – 478. Bilello KS, Murin S, Matthay RA. Epidemiology, etiology, and prevention of lung cancer. Clin Chest Med 2002;23:1–25. Liu G, Zhou W, Christiani DC. Molecular epidemiology of non-small cell lung cancer. Semin Respir Crit Care Med 2005;26: 265–272. Taioli E. Gene-environment interaction in tobacco-related cancers. Carcinogenesis 2008;29:1467–1474. Gustafson AM, Soldi R, Anderlind C, et al. Airway PI3K pathway activation is an early and reversible event in lung cancer development. Sci Transl Med 2010;2:26ra25. Huang da W, Sherman BT, Lempicki RA. Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources. Nat Protoc 2009;4:44 –57. Huang da W, Sherman BT, Lempicki RA. Bioinformatics enrichment tools: paths toward the comprehensive functional analysis of large gene lists. Nucleic Acids Res 2009;37:1–13. De Bakker PI, Yelensky R, Pe’er I, Gabriel SB, Daly MJ, Altshuler D. Efficiency and power in genetic association studies. Nat Genet 2005;37:1217–1223. Peiffer DA, Le JM, Steemers FJ, et al. High-resolution genomic profiling of chromosomal aberrations using Infinium wholegenome genotyping. Genome Res 2006;16:1136 –1148. Goode EL, Fridley BL, Sun Z, et al. Comparison of tagging single-nucleotide polymorphism methods in association analyses. BMC Proc 2007;1(Suppl 1):S6. Nam MH, Won HH, Lee KA, Kim JW. Effectiveness of in silico tagSNP selection methods: virtual analysis of the genotypes of pharmacogenetic genes. Pharmacogenomics 2007;8:1347–1357. Matsuzaki H, Dong S, Loi H, et al. Genotyping over 100,000 SNPs on a pair of oligonucleotide arrays. Nat Methods 2004;1: 109 –111. Becker T, Herold C, Meesters C, Mattheisen M, Baur MP. Significance levels in genome-wide interaction analysis (GWIA). Ann Hum Genet 2011;75:29 –35. Park JH, Wacholder S, Gail MH, et al. Estimation of effect size distribution from genome-wide association studies and implications for future discoveries. Nat Genet 2010;42:570 –575. Sale MM, Mychaleckyj JC, Chen WM. Planning and executing a genome wide association study (GWAS). Methods Mol Biol 2009;590:403–418. Spencer CC, Su Z, Donnelly P, Marchini J. Designing genomewide association studies: sample size, power, imputation, and the choice of genotyping chip. PLoS Genet 2009;5:e1000477.
217