Genomic data integration in chronic lymphocytic ... - Wiley Online Library

4 downloads 0 Views 1MB Size Report
Mathematics, University of Oviedo, Oviedo,. Asturias, Spain. 2 Artificial Intelligence Center, University of. Oviedo, Oviedo, Asturias, Spain. 3 Biomodels LLC ...
Received: 8 March 2016

Revised: 4 October 2016

Accepted: 26 November 2016

DOI 10.1002/jgm.2936

RESEARCH ARTICLE

Genomic data integration in chronic lymphocytic leukemia Juan Luis Fernández‐Martínez1

|

Enrique J. deAndrés‐Galiana1,2

|

Stephen T. Sonis3

1

Group of Inverse Problems, Optimization and Machine Learning, Department of Mathematics, University of Oviedo, Oviedo, Asturias, Spain

2

Artificial Intelligence Center, University of Oviedo, Oviedo, Asturias, Spain

3

Biomodels LLC, Watertown, MA, USA

Correspondence J. L. Fernández‐Martínez, Group of Inverse Problems, Optimization and Machine Learning, Department of Mathematics, University of Oviedo, Oviedo, Asturias, Spain and Jesús Arias de Velasco s/n, 33005 Oviedo, Spain. Email: [email protected]

Abstract Background

B‐cell chronic lymphocytic leukemia (CLL) is a heterogeneous disease and the

most common adult leukemia in western countries. IgVH mutational status distinguishes two major types of CLL, each associated with a different prognosis and survival. Sequencing identified NOTCH1 and SF3B1 as the two main recurrent mutations. We described a novel method to clarify how these mutations affect gene expression by finding small‐scale signatures that predict the IgVH, NOTCH1 and SF3B1 mutations. We subsequently defined the biological pathways and correlation networks involved in disease development, with the potential goal of identifying new drugable targets.

Methods

We modeled a microarray dataset consisting of 48807 probes derived from 163

samples. The use of Fisher's ratio and fold change combined with feature elimination allowed us to identify the minimum number of genes with the highest predictive mutation power and, subsequently, we applied network and pathway analyses of these genes to identify their biological roles.

Results

The mutational status of the patients was accurately predicted (94–99%) using small‐

scale gene signatures: 13 genes for IgVH, 60 for NOTCH1 and 22 for SF3B1. LPL plays an important role in the case of the IgVH mutation, whereas MSI2, LTK, TFEC and CNTAP2 are involved in the NOTCH1 mutation, and RPL32 and PLAGL1 are involved in the SF3B1 mutation. Four high discriminatory genes (IGHG1, MYBL1, NRIP1 and RGS1) are common to these three mutations. The IL‐4‐mediated signaling events pathway appears to be involved as a common mechanism and suggests an important role of the immune response mechanisms and antigen presentation.

Conclusions

This retrospective analysis served to provide a deeper understanding of the

effects of the different mutations in CLL disease progression, with the expectation that these findings will be clinically applied in the near future to the development of new drugs. KEY W ORDS

cancer, gene expression, hematologic, leukemia, mathematical modeling, oncology

1

|

I N T RO D U CT I O N

discerned by immunoglobulin variable‐region heavy chain (IgVH) gene mutation status.5 Because determination of IgVH mutation status is

B‐cell chronic lymphocytic leukemia (CLL) is a complex heterogeneous

very labor‐intensive and expensive, alternative markers have been

disease characterized by the accumulation of malignant B‐cells in the

investigated aiming to achieve a better prognosis with respect to dis-

blood and lymphoid organs.1 Clinical diagnosis of CLL is based on the

ease progression. ZAP‐70 became a very popular surrogate marker of

demonstration of an abnormal population of B lymphocytes in the

IgVH mutational status6–10 and cell membrane expression of CD38

blood, bone marrow or tissues displaying an unusual but characteristic

has been described as a reliable prognostic value for CLL.11 Both

pattern of molecules on the cell surface (CD5 and CD23 clusters of dif-

ZAP‐70 and CD38 have been established as good predictors of IgVH

ferentiation). The staging systems of Rai et al.2 or Binet et al.3 are cur-

mutational status.12,13

rently used to determine the extent of the disease and are primarily based on a low platelet or red cell counts.

Gene expression profiles were also used to understand the genesis and progression of CLL. As a result of the high dimensionality of the

DNA analysis distinguishes two major types of CLL with different

data, on first inspection, both subtypes of CLL showed quite homoge-

survival times.4 This distinction is based on lymphocyte maturity, as

neous expression profiles irrespective of their IgVH mutational

J Gene Med. 2017;19:e2936. https://doi.org/10.1002/jgm.2936

wileyonlinelibrary.com/journal/jgm

Copyright © 2016 John Wiley & Sons, Ltd.

1 of 23

2 of 23

FERNÁNDEZ‐MARTÍNEZ

status.4,14–16 This suggested that both CLL types derive from a com-

ET AL.

scale (log2) after the corresponding RMA preprocess. Of the original

mon pathogenic pathway. Nevertheless, different subsets of genes

cohort of 163 patients, 92 had mutated IgVH, which was associated

specifically expressed by CLL cells with potential pathogenesis and

with a favorable prognosis, whereas IgVH was not mutated in the

clinical relevance were identified.17–21

remainder (n = 71) and suggested an unfavorable outcome. The exome

Subsequently four major genomic aberrations have been identi-

sequencing data are described by Quesada et al.,25 who identified

22

1246 mutations resulting in protein coding changes. Six genes

fied in CLL cells that are strongly associated with disease behavior.

More recently, NOTCH1 and SF3B1 have been described as the most

appeared to be most frequently mutated (> 5%): NOTCH1, SF3B1,

frequently mutated genes predictive of CLL prognosis.23 Addition-

NOP16, CHD2, ATM and LRP1B. Amongst the 163 samples evaluated,

ally, disease progression has been associated with a number of

NOTCH1 and SF3B1 mutational status were determined for 117

genetic alterations, including cytogenetic abnormalities and specific

patients. Of these, 106 were unmutated for NOTCH1 and 107 were

gene mutations,23–29 and epigenetic alterations, including aberrant

unmutated for SF3B1. In the present study, this dataset is referred to

methylation CLL.30

as the reduced cohort and is used to explore the possible relationships

Given the low incidence of NOTCH1 (9%) and SF3B1 (8%)

between mutations.

mutations, it is unlikely that CLL progression could be solely ascribed

Table 1 shows the degree of mutational overlap between the

to the two. We therefore aimed to identify shared/synergistic mech-

patients in the reduced cohort (117 samples). It can be observed that

anisms among the three most common mutations (IgVH, NOTCH1

most of the patients with IgVH mutated (n = 66) do not show the

and SF3B1) that might better predict and explain disease progression

NOTCH1 (n = 63) and SF3B1 mutations (n = 61). This corresponds to

and behavior.

what is expected in terms of prognosis. Also, the NOTCH1 and SF3B1

Obviously, CLL is a multicause disease and, for that reason, it is important to integrate the modelling of different mutations. In the

mutations appear to be independent factors for CLL progression because only two patients exhibit both mutations at the same time.

present study, we describe the development of a novel a methodol-

These two classification problems (NOTCH1 and SF3B1 muta-

ogy aiming to clarify how these mutations affect gene expression

tional status) are naturally unbalanced, and the classifier has to take

by identifying small‐scale signatures to predict the IgVH, NOTCH1

this feature into account. In the case where the predictive accuracies

and SF3B1 mutations (genomic data integration). We subsequently

of the small‐scale signatures found are higher than those provided by

applied our methodology to define and understand the biological

the corresponding majority class classifiers, we could affirm that the

pathways and correlation networks that are involved in disease

classifier has learnt the set of discriminatory genes for these muta-

development, with the potential goal of identifying new druggable

tions. These signatures have been validated by cross‐validation. For

targets. We have identified four high discriminatory genes (IGHG1,

that purpose, the dataset was divided into two folds according to

MYBL1, NRIP1 and RGS1) that are common to these three mutations.

the different mutations: one fold was used for training and the other

Furthermore, the IL‐4‐mediated signaling events pathway appears to

for validation. This process is repeated until the whole dataset is

be a common mechanism to these genes. We confirm that the inte-

processed and it is termed leave‐one‐out‐cross‐validation (LOOCV).

gration of microarray data and the main mutational status of patients

Furthermore, we have confirmed the stability of these results using

achieved by CLL is a novel approach for understanding the main

different random holds‐outs validation experiments (75% for learning

causes of disease progression.

and 25% for validation).

2 2.1

MATERIALS AND METHODS

|

|

Dataset

We used a publically accessible dataset (European Bioinformatics Institute EGAS00001000374) in which microarray results consisting of 48807 probes were derived from 163 patients with a diagnosis of CLL.31 The expression data were originally presented in logarithmic TABLE 1 Degree of mutational overlap in the reduced cohort of patients (117 samples)

IgVH (+)

IgVH (−)

NOTCH1 (+)

NOTCH1 (−)

SF3B1 (+)

SF3B1 (−)

IgVH (+)

66

*

3

63

5

61

IgVH (−)

*

51

8

43

5

46

NOTCH1 (+)

3

8

11

*

2

9

NOTCH1 (−)

63

43

*

106

8

98

SF3B1 (+)

5

5

2

8

10

*

SF3B1 (−)

61

46

9

98

*

107

*No overlapping.

FIGURE 1

Flow diagram of the analytical procedure: (1) Data input. (2) Gene ranking according the discriminatory power. (3) Small‐scale signature predictive accuracy (LOOCV) and hold‐out analysis (stability of the signature). (4) Analysis of the biological pathways and correlation networks

FERNÁNDEZ‐MARTÍNEZ

2.2

|

3 of 23

ET AL.

Analysis

processes that are impacted by each mutation. The algorithm designed to perform the pathways sampling is outlined below.

Feature selection is one of the major challenges of any genotype‐ based phenotypic prediction problem because the number of genetic

3. The training (75%) and validation (25%) sets are randomly selected

probes far exceeds the number of samples. To directly address this

from the study cohort. We have performed 500 different

issue, we used a combination of two well‐known ranking algorithms:

simulations.

32

33

(for further details, see ).

4. For each simulation, the list of most discriminatory genes

We first ranked genes according to their discriminatory power (FR or

(smallest‐scale signature) is determined in the training setand

absolute FC value) and then applied a Nearest Neighbor (k‐NN) based

applied to the validation set to establish the predictive accuracy.34

algorithm to establish the accuracy of the different ranked sets of

If the validation accuracy is higher that a given minimum accuracy

genes via LOOCV modeling. The combination of this procedure with

(90% in this case), the current list is stored.

Fisher's ratio (FR)

and fold change (FC)

a feature elimination algorithm produced the shortest list of high dis-

5. After all the simulations are performed, a frequency analysis of

criminatory genes and served to validate the prognostic value of these

the most discriminatory genes found in the different simulations

gene signatures over the existing dataset by LOOCV.34 The stability of

is performed. The hold‐out experiment serves to analyze how

these signatures is also examined by random 75–25 hold‐out

the lack of information (samples) impact gene selection. Subse-

experiments.

quently, the most frequently sampled genes are used to infer

The aim of the hold‐out procedure is two‐fold.

the genetic pathways, which are the robust way of linking disease genes to phenotypes.

1. Checking the stability of the predictive accuracy of the small‐ scales signatures found via LOOCV when the number of training

Finally, the correlation networks between the set of highly dis-

samples is decreased. In this case, the minimum‐scale signature

criminatory genes determined via LOOCV using FR and FC as featured

is read in the training data set for training (75% of the whole set) and applied for blind validation in the validation set (25%).

TABLE 3

IgVH mutational status prediction using FC

The cumulative distribution function of the small‐scale predictive

Gene

μ1

σ1

fc (log)

Acc (%)

accuracies found in different hold‐outs is finally presented and

RGS13

261

568

22

25

3.6

57.7

serves to account for the variability in its predictive accuracy with

LPL

40

70

380

272

–3.2

86.5

partial information. A statistical analysis is performed providing

PXDNL

269

838

29

52

3.2

87.1

the minimum, maximum and median bounds that could be

SEPT10

22

50

177

232

–3.0

88.3

expected in an independent dataset.

SEPT10

21

46

172

232

–3.0

86.5

LOC100128252

29

43

224

194

–3.0

89.0

SEPT10

22

48

158

206

–2.9

85.3

LOC100128252

30

42

220

172

–2.9

86.5

KANK2

40

75

258

335

–2.7

87.1

NPTX1

28

32

168

363

–2.6

85.9

LMNA

30

54

178

446

–2.6

85.9

PTCH1

140

163

24

42

2.6

89.0

IGHG1

162

687

942

1893

–2.5

88.3

MYBL1

424

526

73

88

2.5

89.6

87.1

PPP1R9A

45

90

260

278

–2.5

90.2

62

125

352

298

–2.5

90.8

2. Sampling the uncertainty space in the corresponding phenotype 35

prediction problems to perform pathways analysis.

In this pro-

cedure, we look for the most predictive genes found in each random hold‐out. Based on these predictive high discriminatory signatures, we provide pathway analysis to unravel the biological

TABLE 2

IgVH mutational status prediction using FR

Gene LPL

μ1

σ1 40

70

μ2 380

σ2 272

FR (log) 4.6

Acc (%)

LPL

26

33

146

102

3.7

86.5

CRY1

CRY1

62

125

352

298

3.1

90.2

LPL

LOC100128252

29

43

224

194

3.0

90.2

TFEC

LOC100128252 SPG20 ZBTB20 NRIP1 SPG20

30 24

42 35

220 111

172 85

3.0 2.9

μ2

σ2

26

33

146

102

–2.5

92.0

412

580

77

230

2.4

92.6

89.6

PPP1R9A

33

63

171

175

–2.4

92.6

91.4

SPG20

30

53

148

126

–2.3

92.6

32

43

153

301

–2.3

92.6

1943

505

982

417

2.8

91.4

STK32B

275

183

63

81

2.7

91.4

ANKRD57

20

28

98

97

–2.3

91.4

91.4

SPG20

24

35

111

85

–2.2

92.0

205

203

45

97

2.2

92.0

30

53

148

126

2.6

ZAP70

103

151

273

140

2.4

92.6

ADAM29

LDOC1

20

19

50

27

2.3

92.6

RBMS3

19

27

84

122

–2.2

91.4

92.6

PPP1R9A

27

36

120

126

–2.2

91.4

93.3

NRIP1

275

183

63

81

2.1

92.6

PTCH1

115

123

27

33

2.1

93.3

COBLL1 NRIP1

186 85

107 60

85 24

100 24

2.3 2.1

List of the 13 most discriminatory genes list with the highest predictive accuracy (93.3%), ordered by decreasing FR. μ1 and σ1 refer, respectively, to the mean expression and SD in class 1, (mutated IgVH), whereas μ2 and σ2 indicate the same for the unmutated group. FR (log), logarithmic FR; Acc, accuracy.

List of the 28 most discriminatory genes with the highest predictive accuracy (93.3%), ordered by decreasing absolute FC. μ1 and σ1 refer, respectively, to the mean expression and SD in class 1 (mutated IgVH), whereas μ2 and σ2 indicate the same for the unmutated group. fc, FC; Acc, accuracy.

4 of 23

FERNÁNDEZ‐MARTÍNEZ

ET AL.

FIGURE 2

Correlation networks of the most discriminatory genes (FR) for the IgVH mutational status prediction. A, Using the PC coefficient. B, Using the NMI

selection methods, are calculated via the Pearson correlation (PC)

3

RESULTS

|

coefficient36 and normalized mutual information (NMI).37 These correlation networks are two different measures of the mutual dependence

3.1

|

IgVH mutational status

between gene expressions (for further details, see ). They serve to analyze inter‐relationships between genes, which impact both expression

We determined the best set of genes that discriminates IgVH

and function. We then identified the pathways ontology using

mutational status based on microarray expression and the class infor-

GeneAnalytics

38

to cover the altered and disease pathways. Figure 1

shows the flow‐diagram of the procedure that has been adopted.

mation defined by the IgVH phenotype using 92 mutated and 71 unmutated samples (whole cohort).

FERNÁNDEZ‐MARTÍNEZ

ET AL.

5 of 23

FIGURE 3 Correlation networks of the most differentially expressed genes (FC) for the IgVH mutational status prediction. A, Using the PC coefficient. B, Using the NMI

3.1.1

|

Gene ranking

The shortest list with the highest predictive accuracy (93.3%) was com-

of the expressions. Table 3 shows the first 28 genes with the highest predictive accuracy (93.3%), ordered by decreasing absolute FC

posed of 13 first probes: LPL (two probes), CRY1, LOC100128252 (two

(under‐ or over‐expressed) in logarithmic scale, fc (log), the mean

probes), SPG20 (two probes), ZBTB20, NRIP1 (two probes), ZAP‐70,

(μ1, μ2) and the SD (σ1, σ2) for each IgVH group, and the LOOCV accu-

LDOC1 and COBLL1. Table 2 shows a list of these genes, their associ-

racy [Acc (%)]. In this case, over‐expression implies that the average

ated FR and the mean LOOCV accuracy. FR was applied to the log2

expression is higher in the mutated group. The gene with the highest

6 of 23

TABLE 4

FERNÁNDEZ‐MARTÍNEZ

ET AL.

NOTCH1 mutational status prediction using FR

Gene

μ1

σ1

μ2

σ2

FR (log)

Acc (%)

MSI2

157

74

43

26

4.6

93.2

MSI2

238

123

62

49

4.1

94.9

MSI2

73

25

31

16

3.0

91.5

MSI2

283

149

92

61

2.8

90.6

MSI2

58

19

32

15

2.7

92.3

C10orf137

193

86

392

135

2.4

90.6

LAG3

236

155

77

103

2.4

90.6

LPL

357

250

170

254

2.3

92.3

NCK2

838

219

1560

529

2.2

93.2

CNTNAP2

66

96

667

799

2.1

92.3

ST3GAL1

38

11

85

36

2.1

90.6

CCDC24

109

73

48

44

2.0

92.3

LTK

216

96

103

132

2.0

90.6

FLNB

59

30

33

17

1.9

94.0

ZNF333

38

5

57

16

1.9

92.3

PREPL

190

62

329

108

1.9

93.2

C19orf28

120

37

217

80

1.9

93.2

C1orf38

365

148

189

109

1.8

91.5

LTK

107

52

52

64

1.8

91.5

SPG20

182

150

71

106

1.8

92.3

SAP30L

74

38

111

32

1.8

94.0

MYST1

248

37

322

60

1.7

93.2

C10orf137

99

41

187

66

1.7

94.9

ATP6V0B

831

198

596

183

1.7

91.5

LPL

130

89

75

99

1.7

92.3

SLC4A7

47

39

150

120

1.7

90.6

161

126

112

156

1.7

89.7

HNRNPR

57

22

110

48

1.7

89.7

REEP5

41

18

80

39

1.6

90.6

SRSF1

110

60

175

52

1.6

94.0

37

8

64

24

1.6

94.0

SHPRH

270

64

383

83

1.6

94.0

CNTNAP2

101

140

804

1105

1.6

94.9

PHF2

119

44

175

60

1.6

92.3

FCRL1

234

180

525

308

1.6

93.2

WSB2

804

329

489

258

1.6

93.2

ATP6V0B

624

145

448

134

1.6

94.9

LOC100128252

GNPNAT1

LYL1

87

31

140

47

1.5

94.9

ACSL5

230

85

332

106

1.5

94.9

STX17

50

21

75

25

1.5

94.0

SPG20

125

98

55

74

1.5

94.0

NHEJ1

29

7

37

8

1.5

94.0

ZNF248

48

25

89

45

1.5

93.2

MPST

55

20

35

10

1.5

93.2

CDK13

69

42

132

75

1.5

93.2

TRMT1

58

17

86

30

1.5

92.3

PI4K2A

224

101

115

84

1.5

93.2

ELOVL5

254

97

504

188

1.5

93.2

FAM30A

588

900

1535

1495

1.5

93.2

PTDSS1

129

21

190

44

1.5

94.0

PLGLB1

74

47

152

103

1.5

94.0 (Continues)

FERNÁNDEZ‐MARTÍNEZ

TABLE 4

7 of 23

ET AL.

(Continued)

μ1

Gene

σ1

μ2

σ2

FR (log)

Acc (%)

C5orf53

51

22

125

74

1.5

94.0

PSMD7

608

175

414

141

1.5

94.9

NASP

117

26

176

52

1.5

94.0

ATP6V0B

768

170

566

172

1.5

94.9

WDR36

108

36

164

43

1.4

94.9

LTN1

511

52

645

99

1.4

94.9

22

2

19

2

1.4

94.9

GAL3ST3 PDE7A CAPRIN2

102

67

214

120

1.4

94.9

1098

345

1511

368

1.4

95.7

List of the 60 most discriminatory genes to predict the NOTCH1 mutation list with the highest predictive accuracy (95.7%), ordered by decreasing FR. Class 1 corresponds to samples with mutated NOTCH1 and class 2 corresponds to those with unmutated NOTCH1. μ1 and σ1 refer, respectively, to the mean expression and SD in class 1 (mutated NOTCH1), whereas μ2 and σ2 indicate the same for the unmutated group. FR (log), logarithmic FR; Acc, accuracy.

under‐expression is LPL, and the gene with the highest over‐expres-

network (Figure 4B) demonstrates a main connection through NCK2.

sion is RSG13.

Figures 5A and 5B show these networks constructed with the list of the most differentially expressed genes defined by FC.

3.1.2

|

Correlation networks

Figure 2A shows the PC network of the most discriminatory genes

3.3

SF3B1 mutation status

|

(defined by FR) of IgVH mutational status. The NMI correlation network is shown in Figure 2B. Figures 3A and 3B show the same networks constructed with the list of most differentially expressed genes provided by FC.

SF3B1 gene (Splicing Factor 3b, Subunit 1) is located in chromosome 2. Its importance in CLL has been analyzed previously.26,41 As with NOTCH1, the SF3B1 classification problem was also highly unbalanced because 107 CLL samples (out of 117) in the reduced cohort did not

3.2

Modeling the NOTCH1 mutation status

|

It has been demonstrated that NOTCH1 mutation influences survival in CLL patients.

39,40

show this mutation.

3.3.1

|

Gene ranking

We recognized the challenge in analyzing those

The shortest list with the highest predictive accuracy (99.1%) was com-

genes for which the NOTCH1 mutation impacted expression given

posed of 22 probes with FR values of between 2.6 and 1.7. The most

the highly unbalanced sample mix (106 of 117 samples did not show

discriminatory gene was RPL32 (Table 6). The FC provided a predictive

the NOTCH1 mutation in the reduced cohort).

accuracy of 96.6% using the list of the first 118 genes ranked by absolute FC (Table 7). This accuracy was lower compared to the list of the

3.2.1

|

Gene ranking

22 most discriminatory genes provided by the FR. Seven different

The shortest list with the highest predictive accuracy (95.7%) was com-

probes of gene ANXA4 appeared in this list, which accounts for the

posed of 60 probes with FR values of between 4.6 and 1.4 (Table 4).

importance of this gene.

The first five probes of this list corresponded to MSI2. Also, using the two first probes of MSI2, the NOTCH1 mutation is predicted with 94.9% accuracy. All MSI2 probes had lower expression in NOTCH1‐ mutation negative patients. One probe of LPL appeared in eighth position within this list. Therefore, the incremental accuracy from probes 5 to 60 was minimal (0.8%). The genes from the sixth position to the 60th position serve to add high frequency details in the discrimination (helper genes).34 All of these genes show correlation networks that are analyzed as described further below. The FC provided a longer list (n = 126) with the same predictive accuracy as the FR list (Table 5). Different probes for TFEC and

3.3.2

|

Correlation networks

Figure 6A shows the PC network of the most discriminatory genes of the SF3B1 mutation. In general, correlations between discriminatory genes are low, implying that these genes are independent predictors for this mutation. Two main networks were noted to be associated with the most discriminatory gene RPL32 through YWHAB and KLF8. Conversely, the correlation network using the NMI (Figure 6B) demonstrated a single network associated with CNPY2‐STK38. Figures 7A and 7B show these networks constructed with the list of the most differentially expressed genes defined by FC.

CNTNAP2 were consistently expressed the most differentially. The high variability of these genes in samples with unmutated NOTCH1 precluded their inclusion among the leaders in the FR list.

3.4

|

Hold‐outs experiments and pathway analysis

Figures 8A, 8B and 8C show the cumulative probability distribution

3.2.2

|

Correlation networks

function (CDF) of the predictive accuracy for the small‐scale signatures

Figure 4A shows the PC network of the most discriminatory genes of

found for the three mutations, obtained via 500 different random

the NOTCH1 mutation in which three main networks associated with

75–25 hold‐out experiments. The CDF gives the probability that the

MSI2 through WSB2, ACSL5 and CNTNAP2 are apparent. The NMI

predictive accuracy of the minimum‐scale signature is smaller than a

8 of 23

TABLE 5

FERNÁNDEZ‐MARTÍNEZ

TABLE 5

NOTCH1 mutational status prediction using FC

Gene

μ1

σ1

μ2

σ2

fc (log)

Acc (%)

ET AL.

(Continued)

Gene

μ1

σ1

μ2

σ2

fc (log)

Acc (%)

TFEC

19

4

289

488

–3.9

85.5

IgVH5–78

56

46

166

232

–1.6

91.5

TFEC

15

2

180

331

–3.6

85.5

IGHG1

1369

2081

468

1313

1.5

91.5

CNTNAP2

66

96

667

799

–3.3

82.9

FCRL2

460

334

1334

872

–1.5

92.3

RASSF6

15

1

144

352

–3.2

84.6

FCRL5

226

185

656

590

–1.5

93.2

PXDNL

18

3

168

598

–3.2

82.1

FCGR3A

65

76

189

269

–1.5

93.2

CNTNAP2

101

140

804

1105

–3.0

79.5

FCRL2

420

303

1215

788

–1.5

91.5

DEFA1

272

816

1866

3452

–2.8

77.8

CD9

74

70

215

302

–1.5

92.3

ADAM29

24

20

156

199

–2.7

76.9

FCRL5

564

492

1629

1333

–1.5

93.2

NRIP1

31

10

200

187

–2.7

84.6

FCRL2

469

323

1351

877

–1.5

91.5

IGF2BP3

33

47

186

281

–2.5

88.0

NBPF3

27

19

78

88

–1.5

91.5

MYBL1

47

65

262

407

–2.5

86.3

FCRL2

220

261

629

541

–1.5

90.6

ZNF208

18

6

100

143

–2.5

89.7

CD9

37

27

106

152

–1.5

90.6

FGL2

67

84

359

422

–2.4

91.5

FCRL5

266

210

749

628

–1.5

91.5

CLC

29

17

141

184

–2.3

90.6

CD9

151

180

425

595

–1.5

91.5

ZNF208

18

4

81

108

–2.2

91.5

FCRL2

725

482

2017

1253

–1.5

92.3

APOD

32

34

142

204

–2.1

90.6

FGL2

17

7

46

50

–1.5

92.3

APOD

115

139

459

637

–2.0

89.7

C21orf7

66

100

178

225

–1.4

92.3

MSI2

238

123

62

49

1.9

90.6

FCRL3

1692

1627

4567

3202

–1.4

92.3

SORL1

58

72

221

415

–1.9

90.6

FCRL2

739

560

1992

1191

–1.4

91.5

NRIP1

17

2

64

60

–1.9

90.6

MAP4K4

33

16

90

65

–1.4

90.6

2509

5892

657

2843

1.9

88.9

TRIB2

51

55

135

214

–1.4

89.7

46

31

173

440

–1.9

88.0

EDNRB

326

368

123

280

1.4

91.5

MSI2

157

74

43

26

1.9

94.0

FCRL3

1653

1573

4376

3115

–1.4

91.5

IGJ

299

429

1079

1859

–1.9

92.3

TUBB6

409

540

156

210

1.4

92.3

68

39

239

337

–1.8

91.5

ATF5

186

192

71

68

1.4

91.5

420

808

121

382

1.8

92.3

LOC728175

69

67

26

26

1.4

92.3

TFEC

15

1

51

71

–1.8

92.3

FAM30A

588

900

1535

1495

–1.4

92.3

PTCH1

25

29

86

135

–1.8

93.2

ACSM3

483

397

185

208

1.4

93.2

FCER1A

27

15

93

145

–1.8

93.2

PYHIN1

262

264

683

447

–1.4

92.3

FCRL2

81

85

276

214

–1.8

92.3

C1orf38

593

309

229

195

1.4

88.9

CD9

94

118

316

466

–1.7

91.5

MEF2C

286

166

739

393

–1.4

88.0

LAIR1

31

27

105

114

–1.7

92.3

SPG20

182

150

71

106

1.4

90.6

IGJ

82

106

273

463

–1.7

91.5

FCRL5

128

98

325

300

–1.3

89.7

1561

4218

471

2074

1.7

90.6

LAIR1

29

16

75

81

–1.3

89.7

135

131

41

52

1.7

91.5

PLGLB2

78

52

198

187

–1.3

89.7

78

97

258

379

–1.7

92.3

TCF7

316

337

802

780

–1.3

89.7

PRKCA

119

132

36

54

1.7

92.3

H3F3C

47

24

119

118

–1.3

89.7

TSHZ2

28

39

92

143

–1.7

92.3

SLC15A2

53

33

133

107

–1.3

91.5

FCRL2

650

530

2102

1440

–1.7

93.2

PLGLB2

88

66

219

136

–1.3

92.3

RGS1

1049

1956

327

670

1.7

91.5

MEF2C

193

107

481

252

–1.3

91.5

SLC4A7

47

39

150

120

–1.7

91.5

CD24

1070

624

2656

1495

–1.3

93.2

CD9

89

86

281

433

–1.7

92.3

BMP6

26

19

65

60

–1.3

93.2

SORL1

49

34

153

273

–1.6

92.3

C5orf53

51

22

125

74

–1.3

92.3

THRB

78

150

25

30

1.6

91.5

C1orf38

1176

588

477

377

1.3

90.6

MSI2

283

149

92

61

1.6

91.5

SERPINB6

245

176

100

119

1.3

93.2

LAG3

236

155

77

103

1.6

91.5

HIST1H2BD

413

362

1014

1104

–1.3

90.6

FCRL3

133

125

400

362

–1.6

93.2

RORA

347

546

142

256

1.3

92.3

FGL2

64

57

191

193

–1.6

92.3

TUBB6

410

630

168

344

1.3

91.5

CNTNAP2

21

4

62

79

–1.6

92.3

C21orf7

207

281

504

587

–1.3

92.3

FCRL2

342

261

1018

724

–1.6

93.2

SESN3

29

7

70

61

–1.3

92.3

ATF5

146

152

49

45

1.6

92.3

IGKV3D‐11 TCTN1

HOMER3 RGS13

IGKC LOC728175 CD9

(Continues) (Continues)

FERNÁNDEZ‐MARTÍNEZ

TABLE 5

9 of 23

ET AL.

(Continued)

Gene

μ1

phenotypes using all of the genes available in the microarray and also σ1

μ2

σ2

fc (log)

Acc (%)

an unsupervised classification method that serves to project the

76

75

185

197

–1.3

92.3

ATF5

247

251

102

93

1.3

90.6

MSI2

73

25

31

16

1.3

92.3

FCRL2

2055

1421

4932

2543

–1.3

92.3

FCRL5

1062

723

2521

1444

–1.2

92.3

TCTN1

469

238

1103

1509

–1.2

93.2

PPAPDC1B

474

394

1113

832

–1.2

93.2

ACSM3

839

726

358

424

1.2

94.0

ZADH2

50

26

118

71

–1.2

93.2

MNDA

168

139

391

235

–1.2

94.0

PER1

110

102

48

39

1.2

93.2

SLAMF1

55

30

127

104

–1.2

92.3

SERPINB6

224

149

97

108

1.2

91.5

PYHIN1

161

150

370

223

–1.2

92.3

FCRL1

850

631

1955

1048

–1.2

92.3

KCTD7

62

37

143

96

–1.2

93.2

SLC30A1

65

60

150

266

–1.2

93.2

IL15

167

176

73

113

1.2

93.2

SPG20

125

98

55

74

1.2

92.3

HIST1H2AC

132

156

301

367

–1.2

91.5

DNASE1L3

63

98

28

60

1.2

92.3

CRIP3

SERPINB6

341

219

151

164

1.2

91.5

B4GALT2

135

95

60

59

1.2

94.0

29

12

64

46

–1.2

94.0

387

282

172

124

1.2

95.7

KLF7 ABCA9

the dataset reduced to their respective small‐scale signatures. PCA is

List of the most discriminatory genes (126) to predict the NOTCH1 mutation ordered by decreasing absolute FC with an accuracy of 95.7%. Class 1 corresponds to samples with mutated NOTCH1, and class 2 corresponds to those with unmutated NOTCH1. μ1 and σ1 refer, respectively, to the mean expression and SD in class 1 (mutated NOTCH1), whereas μ2 and σ2 indicate the same for the unmutated group. fc, FC; Acc, accuracy.

dataset in a reduced dimensional space. In this case, we used a two‐ dimensional plot to show that the classification problem approaches a linearly separable behavior in the PCA space (it is possible to separate both populations by a plane) when the dataset is reduced to the corresponding small‐scale signatures, in contrast that observed when all the genetic probes are taken into account. This can be considered as graphical proof of the discriminatory power of these small‐scale signatures.35 Figures 9, 10 and 11 show the PCA plots for IgVH, NOTCH1 and SF3B1 mutational status.

3.6 | Gene intersections for IgVH, NOTCH and SF3B1 mutations We analyzed the intersection between the most discriminatory genes for IgVH, NOTCH1 and SF3B1 mutations as defined by FR and FC analyses. We consolidated both lists for each mutation, and then performed pairwise intersections to establish shared genes. This would serve to determine which effects might be amplified by these mutations. Figure 12 shows the results of these intersections. The intersection with the greater number of genes is NOTCH1–SF3B1 (19 genes), followed by IgVH–NOTCH1 (11 genes) and IgVH–SF3B1 with only five genes. Only four genes were common to all mutations: IGHG1, MYBL1, NRIP1 and RGS13.

4

|

DISCUSSION AND CONCLUSIONS

In the present study, we demonstrate genomic data integration in CLL patients by linking together microarray expression data and their IgVH, NOTCH1 and SF3B1 mutational status. The present study focuses on a novel methodological approach aiming to define hierarchical gene rela-

given cut‐off (x‐value). The lower and upper percentiles and the

tionships among CLL patients expressing these three different muta-

median accuracy can be read directly in this plot, which provides stabil-

tions and to establish the predictive accuracy of gene clusters

ity in the predictive accuracy as a function of the variability in the train-

relative to each mutation. Because of the high dimensionality of the

ing and validation sets. Table 8 shows the main statistics related to this

microarray data with respect to the number of available samples, these

stability analysis. The minimum accuracy that has been obtained was

types of phenotype prediction problems have a very high

80.5% in the case of the IgVH mutation. In all cases, the median accu-

underdetermined character. Therefore, simple and efficient methods

racy is close to the accuracy obtained via LOOCV. The SF3B1 muta-

are needed to rank genes according to their discriminatory power

tional status has been more accurately predicted, as was already the

and establish their predictive accuracy. We used two well‐known filter

case in the LOOCV experiment.

techniques: FR and FC. For each mutation and ranking method, we

Table 9 shows the frequency analysis of the most sampled dis-

determined the shortest list of high discriminatory genes with its corre-

criminatory genes found in the hold‐out analysis for all the mutations.

sponding LOOCV predictive accuracy. Using this methodology, we

Most of the genes also appeared in the lists of the most discriminatory

predicted IgVH mutational status and how the NOTCH1 and SF3B1

genes given by FR and FC via LOOCV. These lists were also used to

mutations affect the expression of different genes and their correlation

perform pathway analysis and for comparison with those that were

networks via the PC and NMI similarity coefficients. In this discussion,

obtained for the FC and FR lists.

we also provide the top ontological pathways that are involved in the disease progression, using GeneAnalytics. Correlation networks and

3.5 | Principal component analysis (PCA) analysis of the small‐scale signatures

canonical pathways provide effective methodologies for understanding the mechanisms that are involved in disease progression. This methodology served us to depict the gene clusters that are most

To show the discriminatory power of the small‐scale signatures for

strongly associated with the expression of each selective mutation

IgVH, NOTCH1 and SF3B1, we performed PCA of these three

(networks of synergistically working genes) and their relationship

10 of 23

FERNÁNDEZ‐MARTÍNEZ

ET AL.

FIGURE 4

Correlation networks of the most discriminatory genes (FR) for the NOTCH1 mutational status prediction. A, Using the PC coefficient. B, Using the NMI

between mutation expression and a particular clinical outcome (survival).

polymerase chain reaction.43 Particularly, CRY1, LDOC1 and LPL were over‐expressed in IgVH‐unmutated compared with IgVH‐mutated

The main conclusions for each of these mutations are presented below.

cases. Conversely, COBLL1 and ZBTB20 were under‐expressed. The analysis of differential expression shows that LPL is also the second gene with the highest absolute FC (3.24) and is over‐expressed in the

4.1

|

group with unmutated IgVH of worst prognosis. RSG13 is the other

IgVH

gene that is highly under‐expressed in the same group. Other high dis-

The IgVH mutational status was predicted with very high accuracy

criminatory genes of the IgVH mutational status are CRY1, SPG20,

(94%) using a small‐scale signature composed of 13 genes. LPL (Lipo-

ZBTB20, NRIP1 and ZAP‐70, confirming findings reported previ-

protein Lipase) is the most discriminatory gene of the list, with two

ously.18–20

probes having a FR of 4.6 and 3.7. The predictive power of the first

The hold‐out analysis (Figure 8A) has shown a minimum predictive

LPL probe is also very high, providing a LOOCV predictive accuracy

accuracy of the small‐scale signature of 80.5%, a median accuracy of

of 87.1%. LPL has a lower expression in the patients with mutated

92.7% and a maximum accuracy of 100%. Also, the most frequently

20

in a study that also found

sampled genes in the hold‐out experiment coincide with expanded lists

high LPL mRNA expression to be associated with shorter treatment‐

of the most discriminatory genes found via FC and FR. This analysis

free survival. LPL is a very specific biomarker because its activity is

has also highlighted the importance of FCRL2 and FCRL3 genes related

IgVH. This was also reported previously

42

low or absent in other blood cells types. Kaderi et al

noted that LPL

was the strongest prognostic factor in comparative analysis of RNA‐

to immune receptor translocation proteins that may play a role in the regulation of the immune system.

based markers in early CLL. The results shown in the present study

The 2D PCA plots (Figures 9A and 9B) of both groups for IgVH

confirm that LPL has almost twice the discriminatory power of ZAP‐

mutational status indicate that the classification approaches a linearly

70. Several genes of this list have several probes involved and belong

separable behavior when the small‐scale signature is used because it

to the list of 24 genes differentially expressed in studies identifying

becomes very easy to predict the mutational status of a new incoming

and validating biomarkers of IgVH mutational status using the

sample by projecting it into the 2D PCA space shown in Figure 9B.

FERNÁNDEZ‐MARTÍNEZ

ET AL.

FIGURE 5 Correlation networks of the most differentially expressed genes (FC) for the NOTCH1 mutational status prediction. A, Using the PC coefficient. B, Using the NMI

11 of 23

12 of 23

TABLE 6

Gene

FERNÁNDEZ‐MARTÍNEZ

SF3B1 mutational status prediction using FR μ1

σ1

μ2

σ2

Therefore, it can be concluded that the small‐scale signature gathers

FR (log)

Acc (%)

RPL32

859

228

513

115

2.6

94.0

KLF8

131

45

59

30

2.4

94.0

PDGFD

85

34

42

20

2.2

95.7

PLAGL1

171

87

336

118

2.2

94.0

KLF3

40

29

239

221

2.2

94.0

UQCC

27

7

41

7

2.1

94.9

HBA1

3650

2978

755

2218

2.1

96.6

CNPY2

206

73

317

70

2.1

97.4

TMC6

322

74

546

155

2.0

97.4

CSNK2B

71

37

141

38

2.0

97.4

PLAGL1

282

135

507

174

2.0

97.4

PIP5K1B

55

32

212

200

1.9

98.3

DGKG

44

16

115

70

1.9

97.4

12044

6627

2783

5082

1.9

98.3

PLAGL1

138

83

252

92

1.9

98.3

ZNF76

34

8

61

20

1.8

98.3

AMT

48

8

97

41

1.8

97.4

206

108

368

156

1.8

97.4

HBB

8359

5278

1777

3669

1.8

97.4

ACTR2

3113

266

3789

506

1.8

97.4

GLIPR1

115

107

359

261

1.7

97.4

MAST4

136

89

59

60

1.7

99.1

HBB

STK38

List of most discriminatory genes (22) to predict the SF3B1 mutation, ordered by decreasing FR with an accuracy of 99.1%. Class 1 corresponds to samples with mutated SF3B1, and class 2 corresponds to those with unmutated SF3B1. μ1 and σ1 refer, respectively, to the mean expression and SD in class 1 (mutated SF3B1), whereas μ2 and σ2 indicate the same for the unmutated group. FR (log), logarithmic FR; Acc, accuracy. TABLE 7

ET AL.

important genetic features involved in the mutational status of the immoglobulin variable region heavy chain (IgVH). The PC network shows a main branch relating LPL and ZBTB20. ZBTB20 is a transcription factor that may be involved in hematopoiesis, oncogenesis and innate immune responses.44 ZBTB20 and LPL have been shown to predict survival in B‐cell CLL.45 Diseases associated with ZBTB20 include bone lymphoma. ZBTB20 expression is increased in hepatocellular carcinoma associated with poor prognosis.46 This network also shows connections between LPL and COBLL1, LDOC1, LOC100128252, KANK2, WSB2 and ANKRD57‐SEPT10. Some of these genes are known to have important roles in CLL and cancer. COBLL1 (Cordon‐Bleu Protein‐Like 1) is a gene related to actin binding. Actins participate in important processes such as muscle contraction, cell motility, cell division and cytokinesis, and cell signaling, etc. This gene is down‐regulated in CLL groups with a poor prognosis.47 LDOC1 (Leucine Zipper, Down‐Regulated in Cancer 1) has been proposed as a tumor suppressor gene whose protein product may have an important role in the development and/or progression of some cancers.48 It is considered to regulate the transcriptional responses by NF‐kappa B that play a key role in regulating the immune response to infection. It has been also shown that LDOC1 is differentially expressed in CLL and is a good predictor for overall survival in untreated patients.49 The NMI shows two main branches with TBC1D2B and SEPT10. SEPT10 has been associated with B‐cell CLL.50,51 GO annotations related to TBC1D2B include phospholipid binding. The Mutual information network appears to delineate the gene clusters better than the PC network. The analysis of the networks for the most differentially expressed genes provided by FC analysis (Figures 3A and 3B) shows the

SF3B1 mutational status prediction using FC

Gene ANXA4

μ1

σ1 42

μ2

σ2

19

430

524

fc (log)

Acc (%)

–3.3

87.2

IGHG1

63

37

599

1472

–3.3

88.0

ANXA4

41

17

333

408

–3.0

85.5

ANXA4

55

26

419

498

–2.9

87.2

ANXA4

47

17

335

408

–2.8

88.9

ANKRD36BP2

32

31

227

683

–2.8

90.6

TSPAN13

29

17

204

338

–2.8

86.3

ANXA4

52

24

364

460

–2.8

87.2

ANXA4

44

17

279

343

–2.7

88.9

MYBL1

41

36

261

406

–2.7

93.2

ANXA4

28

10

173

218

–2.6

91.5

KLF3

40

29

239

221

–2.6

92.3

NT5E

17

2

96

210

–2.5

92.3

GNB4

24

7

130

253

–2.4

90.6

TRPV3

198

525

38

67

2.4

92.3

ZNF608

16

2

82

166

–2.4

91.5

RBM20

26

25

132

212

–2.3

93.2

KLRK1 CNTNAP2 HBA1

53

63

266

610

–2.3

93.2

135

188

655

801

–2.3

91.5

3650

2978

755

2218

2.3

90.6 (Continues)

FERNÁNDEZ‐MARTÍNEZ

TABLE 7

Gene PPP1R9A HBB HTRA3 KLRK1 HBB TUBB6 CNTNAP2

13 of 23

ET AL.

(Continued)

μ1

σ1

μ2

σ2

fc (log)

Acc (%) 90.6

22

18

103

145

–2.2

8359

5278

1777

3669

2.2

87.2

202

409

45

55

2.2

88.9 88.9

51

48

227

512

–2.2

12044

6627

2783

5082

2.1

88.9

48

42

204

398

–2.1

88.9

189

266

789

1105

–2.1

88.0 88.0

TCTN1

42

32

172

438

–2.0

PTPRK

14

1

58

117

–2.0

88.9

HOMER3

59

31

239

336

–2.0

88.0

PIP5K1B

55

32

212

200

–2.0

89.7

39

43

148

227

–1.9

90.6

1224

2650

339

591

1.9

90.6

15100

6062

4201

5929

1.8

90.6

39

44

136

161

–1.8

88.9

166

259

48

56

1.8

88.9

21

9

71

99

–1.8

89.7

CD69

362

484

106

153

1.8

88.9

FCRL3

118

154

399

359

–1.8

89.7

PPP1R9A S100A9 HBB VASH1 CD69 PPP1R9A

43

39

145

212

–1.7

91.5

FOSB

659

1112

199

418

1.7

91.5

RASSF6

362

720

110

274

1.7

92.3

51

67

168

308

–1.7

93.2

MAP3K8

SDPR CYBB

91

112

295

433

–1.7

93.2

PSD3

235

211

74

175

1.7

93.2

ADM

153

206

48

84

1.7

93.2

CD72

148

94

471

297

–1.7

91.5

FOS

928

1556

292

513

1.7

91.5

25

8

80

123

–1.7

91.5

365

1057

1151

2084

–1.7

90.6

BCL7A

61

56

193

248

–1.7

92.3

GLIPR1

115

107

359

261

–1.6

93.2

96

160

296

533

–1.6

94.9

1105

1870

360

631

1.6

94.0

30

16

92

144

–1.6

93.2

RAB20 IGKC

SDPR FOS FCER1A CX3CR1

32

14

97

141

–1.6

92.3

GNG11

60

85

179

322

–1.6

94.0

GLIPR1

127

79

373

246

–1.6

94.0

FCRL3

1498

1626

4365

3100

–1.5

90.6

64

42

185

196

–1.5

90.6

FCRL3

1565

1755

4552

3186

–1.5

90.6

ADM

163

221

56

102

1.5

89.7

RPL31

803

501

279

140

1.5

95.7

TUBB1

176

297

505

870

–1.5

95.7

BCL7A

20

3

57

77

–1.5

94.9

IGKV1–5

367

1058

1047

2114

–1.5

94.0

HOMER3

17

2

49

80

–1.5

94.0

SLAMF1

PLAC8

559

386

1588

1047

–1.5

94.0

BTNL9

201

384

71

300

1.5

94.0

FCRL1

186

105

527

307

–1.5

94.0 (Continues)

14 of 23

TABLE 7

Gene CNR1

FERNÁNDEZ‐MARTÍNEZ

ET AL.

(Continued)

μ1

σ1

μ2

σ2

fc (log)

Acc (%)

301

362

106

236

1.5

94.0

CLEC4C

24

9

69

95

–1.5

94.9

FCGBP

51

26

145

209

–1.5

95.7

GLIPR1

189

117

532

361

–1.5

94.9

CYBB

28

19

80

108

–1.5

94.9

CLLU1

449

597

161

355

1.5

94.9

GEN1

77

31

215

374

–1.5

94.0

GAPT

64

38

177

139

–1.5

94.0

PPBP

624

1065

1730

2686

–1.5

94.9

RASSF6

124

258

45

100

1.5

94.9

NRIP1

23

17

63

60

–1.5

94.9 94.0

22

9

62

78

–1.5

355

777

130

399

1.5

93.2

SIGLEC6

37

47

100

164

–1.4

92.3

GLIPR1

207

120

554

354

–1.4

94.0

CNTNAP2 RGS13

SMAD3

51

33

136

198

–1.4

94.0

PRDM1

128

201

48

51

1.4

94.9

NRIP1

73

119

195

187

–1.4

94.0

RNGTT

981

825

370

627

1.4

94.0 93.2

PF4

85

114

226

319

–1.4

CYB5A

49

17

130

104

–1.4

93.2

FHDC1

52

43

137

148

–1.4

94.0

GLIPR1

168

135

444

292

–1.4

94.0

FCGR3A

71

82

187

268

–1.4

94.9

GLIPR1

346

186

907

595

–1.4

95.7

RNGTT

673

558

257

419

1.4

94.9

CYB5A

62

34

163

135

–1.4

94.9

TCF7

306

247

798

780

–1.4

95.7

GEN1

38

8

99

213

–1.4

95.7

RNGTT

972

755

374

622

1.4

95.7

ITGAX

196

217

510

356

–1.4

95.7

DGKG

44

16

115

70

–1.4

94.0

TCF7

100

54

259

267

–1.4

94.9

RASGRP1

177

137

455

315

–1.4

94.9

44

29

112

87

–1.4

94.0

PDK4

31

14

78

74

–1.3

92.3

KLF3

22

4

56

73

–1.3

92.3

SSBP2

KLF3

27

6

68

56

–1.3

92.3

PF4

107

163

272

399

–1.3

93.2 92.3

FGL2

138

152

349

424

–1.3

IPCEF1

101

100

40

34

1.3

95.7

FCRL1

775

460

1952

1051

–1.3

95.7

FCRL1

642

350

1610

883

–1.3

95.7 95.7

FRMD5

58

82

23

14

1.3

NSUN7

171

168

69

95

1.3

95.7

FCRL2

109

75

271

216

–1.3

95.7

HBM

41

31

103

708

–1.3

95.7

RGS1

871

1001

351

864

1.3

96.6

List of the most discriminatory genes (118) to predict the SF3B1 mutation ordered by decreasing absolute FC with an accuracy of 96.6%. Class 1 corresponds to samples with mutated SF3B1, and class 2 corresponds to those with unmutated SF3B1. μ1 and σ1 refer, respectively, to the mean expression and SD in class 1 (mutated SF3B1), whereas μ2 and σ2 indicate the same for the unmutated group. fc, FC; Acc, accuracy.

FERNÁNDEZ‐MARTÍNEZ

15 of 23

ET AL.

FIGURE 6 Correlation network of the most discriminatory genes (FR) for the SF3B1 mutational status prediction. A, Using the PC coefficient. B, Using the NMI

importance of the RGS13–SPG20 connection. RGS13 encodes a pro-

within the set of the 13 most discriminatory genes. The mutual infor-

tein of the RGS family and has been associated with mantle cell lym-

mation network also highlights the connection RGS13–LMNA–SPG20.

phoma. SPG20 encodes the protein called Spartin that appears to be

LMNA is a gene that in involved in many pathways, including apoptosis

related to endocytosis. This gene has been associated to ZAP70 and

and the caspase cascade.

LPL as a good prognostic biomarker in CLL survival.18 Hyper‐methyla-

GeneAnalytics software has shown that the most important path-

tion of SPG20 in early stage CLL has been positively associated with

ways involved are the inflammatory response and the PAK pathways.

progression‐free survival, supporting the idea that epigenetic changes

The main GO biological processes involved were related to aging and

have clinical impact in CLL.52 Two different probes of this gene are

vasoconstriction, and the main GO molecular functions were

16 of 23

FERNÁNDEZ‐MARTÍNEZ

ET AL.

FIGURE 7 Correlation networks of the most differentially expressed genes (FC) for the SF3B1 mutational status prediction. A, Using the PC coefficient. B, Using the NMI

FERNÁNDEZ‐MARTÍNEZ

17 of 23

ET AL.

synthase binding. Finally, the main compounds that were identified target IGHG1 and IGKC. Finally, pathway analysis using the list of the first 300 most‐frequently sampled genes in the hold‐outs analysis demonstrated the importance of certain networks: B cell development pathways; the inflammatory response pathway; the gastric cancer network 2; the T cell co‐signaling pathway (ligand‐receptor interactions); and the Th17 differentiation pathway. Both analyses have provided common mechanisms.

4.2

|

NOTCH1

The NOTCH1 mutation was also predicted with very high accuracy (96%) using a set of 60 most discriminatory genes. MSI2 was the most important gene, with five probes affected by this mutation. This gene plays an important role in post‐transcriptional gene regulation and tumorigenesis. MSI1 and MSI2 are cooperatively involved in the proliferation and maintenance of central nervous system stem cell populations.53 Also, their expression levels in human myeloid leukemia directly correlate with decreased survival in patients, defining MSI2 as a new prognostic marker and a new target for therapy.54 The Musashi‐Numb pathway can control the differentiation of chronic myeloid leukemia cells. MSI2 expression is upregulated during human chronic myeloid leukemia progression and is also an early indicator of poorer prognosis.55 The analysis of most differentially expressed genes in this mutation has showed the importance of TFEC and CNTNAP2. TFEC (Transcription Factor EC) encodes a member of the microphthalmia (MiT) family. MiT transcription factors play an important role in multiple cellular processes, including survival, growth and differentiation. CNTNAP2 encodes a member of the neurexin family with functions in cell adhesion. This protein contains epidermal growth factor and laminin G domains. Annotations related to this gene include enzyme binding and receptor binding. This gene has been associated with genetic risk prediction for acute myeloid leukemia56 and is involved in the genomic abnormalities for this illness.57 The hold‐out analysis (Figure 8B) has shown a minimum predictive accuracy of the small‐scale signature of 82.8%, a median accuracy of 96.6% and a maximum accuracy of 100%. As in the previous case, the most frequently sampled genes in the hold‐out experiment coincide with expanded lists of most discriminatory genes found via FC and FR. This analysis has highlighted the importance of ST3GAL1, as FIGURE

8 Hold‐out stability analysis. Cumulative probability functions for the predictive accuracy of the small‐scale lists in the IgHV (A), NOTCH1 (B) and SF3B1 (C) mutations

involved in glycosphingolipids biosynthesis and associated with colorectal cancer and immune diseases (HIV‐1). The PC network of the most discriminatory genes (Figure 4A) shows three main connections: MSI2‐ACSL5, MSI2‐CNTNAP2 and

cholesterol binding, antigen binding, patched binding and nitric‐oxide

MSI2‐WSB2, whereas the NMI network (Figure 4B) shows the connections to NCK2, LPL, WSB2 and SPG20. Some of these genes are not

TABLE 8

Statistical analysis of the predictive accuracy of the minimum‐scale signature in these three mutations, obtained after 500 random 75–25 hold outs: minimum, maximum, mean, median, SD and interquartile range (IQR) Min

Max

Mean

Median

STD

IQR

IgVH

80.5

100

93.1

92.7

3.6

4.9

NOTCH1

82.8

100

94.6

96.6

3.6

3.5

SF3B1

89.7

100

98.3

2.2

3.5

known to play a role in CLL. The FC networks (Figures 5A and 5B) are simpler and, in both cases (PC and NMI), show one main connection between TFEC and NRIP1. NRIP1 encodes a nuclear protein that specifically interacts with the hormone‐dependent activation domain AF2 of nuclear receptors

100

and modulates transcriptional activity of the estrogen receptor. This gene has been shown to have predictive value in CLL together with LPL and ZAP‐70.18

18 of 23

TABLE 9

FERNÁNDEZ‐MARTÍNEZ

ET AL.

Hold‐out frequency analysis

IgVH Gene name

NOTCH1 Frequency

Gene name

SF3B1 Frequency

Gene name

Frequency

WSB2

2.79

ST3GAL1

0.18

HBB

0.33

LPL

2.79

LAG3

0.18

TMC6

0.33

LPL

2.79

MSI2

0.18

PIP5K1B

0.33

NRIP1

2.79

MSI2

0.18

UQCC

0.33

FCRL2

2.79

LTK

0.18

KLF3

0.33

PYHIN1

2.79

MSI2

0.18

MAST4

0.33

ZBTB20

2.79

CNTNAP2

0.18

FAM114A1

0.33

FCRL2

2.79

KIAA1715

0.18

KLF8

0.33

ZBTB20

2.79

MSI2

0.18

AMT

0.33

SH3PXD2A

2.51

MSI2

0.18

HBB

0.33

COBLL1

2.51

NCK2

0.18

HBA1

0.33

FCRL2

2.51

LPL

0.18

RPL32

0.33

LDOC1

2.23

ZNF333

0.18

CSNK2B

0.33

ZAP70

2.23

PI4K2A

0.18

ZNF76

0.33

PYHIN1

2.23

CNTNAP2

0.18

DGKG

0.33

USP6NL

2.23

DCLK2

0.18

ACTR2

0.33

COBLL1

2.23

LTK

0.17

DLG2

0.33

FCRL2

2.23

SRSF1

0.17

PLAGL1

0.33

FCRL2

2.23

C19orf28

0.17

YWHAB

0.33

FCRL2

2.23

HNRNPR

0.17

ANXA4

0.32

FCRL3

2.23

RMND5A

0.17

ZC3H12B

0.32

KLF12

1.96

C10orf137

0.17

CNTNAP2

0.31

FCRL2

1.96

AQR

0.17

PLAGL1

0.31

FCRL3

1.96

SRSF1

0.17

PDGFD

0.31

PYHIN1

1.96

PTDSS1

0.17

PDGFD

0.30

KLF12

1.68

PREPL

0.16

DGKG

0.30

DCLK2

1.68

GNPNAT1

0.16

PSD3

0.30

WSB2

1.68

EXD2

0.16

PLAGL1

0.29

FCRL2

1.68

C10orf137

0.16

VPS29

0.29

LOC100128252

1.68

FNIP1

0.16

LOC100287515

0.28

FCRL2

1.68

LTN1

0.16

HBB

0.28

SPG20

1.40

DTX1

0.16

FKBP3

0.28

MBOAT1

1.40

REEP5

0.16

HMGB2

0.28

LOC100128252

1.40

CCDC24

0.16

ATP2B1

0.28

TBC1D2B

1.12

ZNF544

0.16

FBXO38

0.28

ATOX1

1.12

FCRL2

0.16

RNF135

0.28

RPS4Y1

0.84

LDOC1

0.16

XPC

0.27

ADRBK2

0.84

C1orf38

0.16

JAZF1

0.27

DDX3Y

0.84

ICK

0.16

GLIPR1

0.27

List of the most sampled discriminatory genes for the IgVH, NOTCH1 and SF3B1 mutational status prediction, obtained after 500 random simulations.

Pathways analysis using GeneAnalytics has shown that the main

triglyceride biosynthetic processes and neural tube formation. The

discriminatory and differentially expressed genes are related to multi-

main GO molecular function is antigen binding and the compounds

ple cellular processes, including survival, growth and differentiation,

found also target the genes IGHG1 and IGKC. Other types of com-

apoptosis and host defense. The main super‐pathways were the crea-

pounds are retinoid (APOD, BMP6, NRIP1, PRKCA, RORA, THRB) and

tion of C4 and C2 activators, FCGR‐dependent phagocytosis and adi-

valine (FCGR3A, IGKC, LPL, PRKCA, THRB). Finally, pathway analysis

pogenesis. The main GO biological processes involve the regulation

using the list of the first 300 most‐frequently sampled genes in the

of the smoothened signaling pathway, the immune response, retina

hold‐out analysis has shown the importance of the certain networks:

homeostasis, positive regulation of cardiac muscle hypertrophy, the

the MRNA splicing – major pathway; gene expression; the notch sig-

Fc‐gamma receptor signaling pathway involved in phagocytosis,

naling pathway (KEGG); TCR signaling (REACTOME); and the MRNA

FERNÁNDEZ‐MARTÍNEZ

FIGURE 9

19 of 23

ET AL.

IgVH mutational status. A, 2D PCA plot using all the genes. B, 2D PCA plot in the small‐scale signature

FIGURE 10

NOTCH1 mutational status. A, 2D PCA plot using all the genes. B, 2D PCA plot in the small‐scale signature

FIGURE 11

SF3B1 mutational status. A, 2D PCA plot using all the genes. B, 2D PCA plot in the small‐scale signature

surveillance pathway. The Notch signaling pathway occurs within

named ZAC1), and two different probes of HBB. PLAGL1 encodes a

these

is

C2H2 zinc finger protein with transactivation and DNA‐binding activ-

progressing in the right direction. Both analyses highlight the impor-

ities and has been shown to have anti‐proliferative properties as a

tance of different signaling pathways and the immune response.

tumor suppressor.58 HBB (hemoglobin beta) expression in CLL sam-

pathways,

confirming

that

bioinformatics

modeling

ples with SF3B1 mutation is almost six times greater than HBB

4.3

|

SF3B1

Finally, the SF3B1 mutation was predicted with almost 100% accu-

expression in the unmutated samples. The analysis of most differentially expressed genes in this mutation showed the importance of ANXA4, with seven different probes in this list. ANXA4 (Annexin IV)

racy using a list of the 22 most discriminatory genes, including

belongs to the annexin family of calcium‐dependent phospholipid

RPL32, KLF8 and PDGFD, three different probes of PLAGL1 (also

binding proteins.

20 of 23

FERNÁNDEZ‐MARTÍNEZ

ET AL.

FIGURE 12 Intersection among the most discriminatory genes of the IgVH, NOTCH1 and SF3B1 mutations. The three main mutations are represented with a rectangle and the most discriminatory genes are surrounded by ellipses. An edge represents that the gene appears as most discriminatory for a specific mutation. Genes with three edges (surrounded by a dot rectangle) are common to these three main mutations

The hold‐out analysis (Figure 8C) has shown a minimum predictive

lipoxins on super‐oxide production in neutrophils; the HIF‐1 alpha

accuracy of the small‐scale signature of 89.7%, a median accuracy of

transcription factor network; the Fc‐gamma receptor signaling path-

98.3% and a maximum accuracy of 100%. This analysis has highlighted

way; the PAK pathway; the immune response CCR signaling in eosin-

the importance of HBB that encodes the beta globin involved in oxy-

ophils; and the GPCR pathway. The main GO molecular function

gen transport.

involved is oxygen transporter activity and the main GO biological

The Pearson network of the most discriminatory genes (Figure 6A)

processes are oxygen transport, the innate immune response, the

shows two main networks of RPL32, with YWHAB and KLF8. YWHAB

Fc‐gamma receptor signaling pathway involved in phagocytosis, and

(Tyrosine 3‐Monooxygenase/Tryptophan 5‐Monooxygenase Activa-

signal transduction. Finally, the main compounds found were thymi-

tion Protein, Beta) encodes a protein that has been shown to interact

dine (ADM, CD69, CD72, FOS, HBB, NT5E, PF4, PPBP, RNGTT and

with RAF1 and CDC25 phosphatases, suggesting that it may play a role

SMAD3) and the drugs targeting the genes HBA1 and HBB

in linking mitogenic signaling and the cell cycle machinery. KLF8

concerning hemoglobin. Finally, pathway analysis using the list of

(Kruppel‐Like Factor 8) encodes a protein that is a member of the

the first 300 most‐frequently sampled genes in the hold‐out analysis

Sp/KLF family of transcription factors, and is considered to play an

has shown the importance of certain networks: WNT mediated acti-

59

important role in metastasis.

Diseases associated with KLF8 include

ovarian epithelial cancer and mental retardation.

vation of DVL; Epstein–Barr virus infection; actin nucleation by ARP‐ WASP complex; rheumatoid arthritis; and CD28 co‐stimulation.

The NMI correlation (Figure 6B) shows one main network relating

Most of the genes affected by the SF3B1 mutation are not known

RPL32 with CNPY2‐STK38. Related to this last gene are hemoglobin

to play a role in CLL. In these three classification problems, the FR

HBA1 and HBB. Also, the analysis of the networks for the most differ-

always provided the shortest list with the highest discriminatory

entially expressed genes (Figures 7A and 7B) shows the importance of

power, whereas FC gave longer lists of genes with lower predictive

the connections of ANXA4 with CYBB and FCRL3 (PC network) and one

accuracy in general terms. Therefore, we can conclude that the most

main connection with ADM that is involved in vasodilation, regulation

differentially expressed genes are not the most discriminatory as a

of hormone secretion, promotion of angiogenesis and antimicrobial

result of the high variability of these genes in the groups of patients

activity. The NMI network has a peculiar circular geometry where all

with unmutated NOTCH1 and SF3B1.

the genes relate to ADM. The most important pathways found using GeneAnalytics were

Regarding commonalities, the NOTCH1 and SF3B1 mutations share a longer list of high discriminatory genes than the IgVH mutation.

NFAT in the immune response; ERK signaling; the immune response

Furthermore, the relationship IgVH–NOTCH1 is stronger than the rela-

of DAP12; creation of C4 and C2 activators; inhibitory action of

tionship IgVH–SF3B1. Using an expanded list of high discriminatory

FERNÁNDEZ‐MARTÍNEZ

21 of 23

ET AL.

and differentially expressed genes, we have shown that these three mutations share only four genes (IGHG1, MYBL1, NRIP1 and RGS1) that are related to immune diseases, blood diseases, rare diseases and cancer. To better understand the functional and biological ramifications of the most discriminatory and common genes associated with the IgVH, NOTCH1 and SF3B1 mutations (IGHG1, MYBL1, NRIP1 and RGS13) (Figure 12), we evaluated the super‐pathway fit using GeneAnalytics software and identified the IL‐4‐mediated signaling events pathway as the one that was highly matched. This finding is particularly relevant because IL‐4 regulation of B‐cell receptor signaling has been identified as integral to CLL (in CLL, both B‐ and T‐cells highly express IL‐4).60–62 We then performed a secondary analysis to identify KEGG pathways using BinoX63 and noted thyroid hormone signaling, notch signaling, viral carcinogenesis and, interestingly, chronic myeloid leukemia. Clearly, the identification of notch signaling is consistent with the findings that we reported earlier in the present study, and viral carcinogenesis is not unexpected given the possible pathogenesis of CLL. We suspect that the other two pathways were noted because of common components, although definitive interpretation requires additional investigation. Finally, using GeneAnalytics, we have identified the main pathways, as well as GO molecular and biological functions, that are involved in each mutation, and also the compounds that are at disposal and the genes that could be targeted. These analyses suggest an important role of the immune response and antigen presentation in the three cases. This methodology could also be applied to analyze the effect of other mutations in CLL and to understand the genesis of other illnesses with a genetic background. The aim of this retrospective analysis was to provide a deeper understanding of the effects of the different mutations in CLL disease progression, with the aim of applying these findings clinically in the near future with respect to the development of new drugs. A future verification of these findings with other independent cohorts could lead to a better design of therapeutic targets.

The authors declare that there are no conflicts of interest. They have also read and accepted the journal's authorship agreement. We would like to thank different researchers from the Biochemistry Department (University of Oviedo, Spain) for having introduced us to this interesting problem. Enrique J. de Andrés‐Galiana salary has assisted

1. Rodríguez‐Vicente AE, Díaz MG, Hernández‐Rivas JM. Chronic lymphocytic leukemia: a clinical and molecular heterogeneous disease. Cancer Genet. 2013;206:49–62. 2. Rai KR, Sawitsky A, Cronkite EP, Chanana AD, Levy RN, Pasternack BS. Clinical staging of chronic lymphocytic leukemia. Blood. 1975;46:219– 234. 3. Binet JL, Auquier A, Dighiero G, et al. A new prognostic classification of chronic lymphocytic leukemia derived from a multivariate survival analysis. Cancer. 1981;48:198–206. 4. Rosenwald A, Alizadeh AA, Widhopf G, et al. Relation of gene expression phenotype to immunoglobulin mutation genotype in b cell chronic lymphocytic leukemia. J Exp Med. 2001;194:1639–1647. 5. Hamblin TJ, Davis Z, Gardiner A, Oscier DG, Stevenson FK. Unmutated ig v(h) genes are associated with a more aggressive form of chronic lymphocytic leukemia. Blood. 1999;94:1848–1854. 6. Crespo M, Bosch F, Villamor N, et al. Zap‐70 expression as a surrogate for immunoglobulin‐variable‐region mutations in chronic lymphocytic leukemia. N Engl J Med. 2003;348:1764–1775. 7. Wiestner A, Rosenwald A, Barry TS, et al. Zap‐70 expression identifies a chronic lymphocytic leukemia subtype with unmutated immunoglobulin genes, inferior clinical outcome, and distinct gene expression profile. Blood. 2003;101:4944–4951. 8. Orchard JA, Ibbotson RE, Davis Z, et al. Zap‐70 expression and prognosis in chronic lymphocytic leukemia. Lancet. 2004;363:105–111. 9. Kim SZ, Chow KU, Kukoc‐Zivojnov N, et al. Expression of zap‐70 protein correlates with disease stage in chronic lymphocytic leukemia and is associated with, but not generally restricted to, non‐mutated ig vh status. Leuk Lymphoma. 2004;45:2037–2045. 10. Smolej L, Saudkova L, Spacek M, Kozak T. Zap‐70 in b‐cell chronic lymphocytic leukemia: clinical significance and methods of detection. Vnitr Lek. 2006;52:1194–1199. 11. Mainou‐Fowler T, Dignum HM, Proctor SJ, Summerfield GP. The prognostic value of cd38 expression and its quantification in b cell chronic lymphocytic leukemia (b‐cell). Leuk Lymphoma. 2004;45:455–462. 12. Cruse JM, Lewis RE, Webb RN, Sanders CM, Suggs JL. Zap‐70 and cd38 as predictors of IgVH mutation in CLL. Exp Mol Pathol. 2007;83:459–461. 13. Shanafelt TD, Byrd JC, Call TG, Zent CS, Kay NE. Narrative review: initial management of newly diagnosed, early‐stage chronic lymphocytic leukemia. Ann Intern Med. 2006;145:435–447. 14. Klein U, Tu Y, Stolovitzky GA, et al. Gene expression profiling of b cell chronic lymphocytic leukemia reveals a homogeneous phenotype related to memory b cells. J Exp Med. 2001;194:1625–1638.

ACKNOWLEDGMENTS

been

RE FE RE NC ES

by

the

Spanish

Ministerio

de

Economía

y

Competitividad (grant TIN2011‐23558). No additional research funds were provided to perform this research. JLFM and EJAG prepared the data, designed the machine learning methodology, carried out the experiments, analyzed and interpreted the results and drafted the manuscript. SS revised the design of the methodology critically, analyzed and interpreted the results, and drafted the manuscript. All authors read and approved the final manuscript submitted for publication. There are no competing interests (political, personal, religious, ideological, academic, intellectual, commercial or any other) to declare in relation to this manuscript.

15. Ferrer A, Ollila J, Tobin G, et al. Different gene expression in immunoglobulin‐mutated and immunoglobulin‐ unmutated forms of chronic lymphocytic leukemia. Cancer Genet Cytogenet. 2004;153:69–72. 16. Haslinger C, Schweifer N, Stilgenbauer S, et al. Microarray gene expression profiling of b‐cell chronic lymphocytic leukemia subgroups defined by genomic aberrations and vh mutation status. J Clin Oncol. 2004;22:3937–3949. 17. Vasconcelos Y, De Vos J, Vallat L, et al. Gene expression profiling of chronic lymphocytic leukemia can discriminate cases with stable disease and mutated ig genes from those with progressive disease and unmutated ig genes. Leukemia. 2005;19:2002–2005. 18. Van't Veer MB, Brooijmans AM, Langerak AW, et al. The predictive value of lipoprotein lipase for survival in chronic lymphocytic leukemia. Haematologica. 2006;91:56–63. 19. Nuckel H, Hüttmann A, Klein‐Hitpass L, et al. Lipoprotein lipase expression is a novel prognostic factor in b‐cell chronic lymphocytic leukemia. Leuk Lymphoma. 2006;47:1053–1061. 20. Hartman ML, Kilianska ZM. Lipoprotein lipase: a new prognostic factor in chronic lymphocytic leukaemia. Contemp Oncol (Pozn). 2012;16:474–479.

22 of 23

FERNÁNDEZ‐MARTÍNEZ

ET AL.

21. Oppezzo P, Vasconcelos Y, Settegrana C, et al. The LPL/ADAM29 expression ratio is a novel prognosis indicator in chronic lymphocytic leukemia. Blood. 2005;106:650–657.

42. Kaderi MA, Kanduri M, Buhl AM, et al. Lpl is the strongest prognostic factor in a comparative analysis of rna‐based markers in early chronic lymphocytic leukemia. Haematologica. 2011;96:1153–1160.

22. Döhner H, Stilgenbauer S, Benner A, et al. Genomic aberrations and survival in chronic lymphocytic leukemia. N Engl J Med. 2000;343:1910–1916.

43. Abruzzo LV, Barron LL, Anderson K, et al. Identification and validation of biomarkers of IGV(h) mutation status in chronic lymphocytic leukemia using microfluidics quantitative real‐time polymerase chain reaction technology. J Mol Diagn. 2007;9:546–555.

23. Puente XS, Pinyol M, Quesada V, et al. Whole‐genome sequencing identifies recurrent mutations in chronic lymphocytic leukemia. Nature. 2011;475:101–105. 24. Fabbri G, Rasi S, Rossi D, et al. Analysis of the chronic lymphocytic leukemia coding genome: role of notch1 mutational activation. J Exp Med. 2011;208:1389–1401. 25. Quesada V, Conde L, Villamor N, et al. Exome sequencing identifies recurrent mutations of the splicing factor sf3b1 gene in chronic lymphocytic leukemia. Nat Genet. 2012;44:47–52. 26. Wang L, Lawrence MS, Wan Y, et al. Sf3b1 and other novel cancer genes in chronic lymphocytic leukemia. N Engl J Med. 2011;365:2497–2506. 27. Oscier DG, Rose‐Zerilli MJ, Winkelmann N, et al. The clinical significance of notch1 and sf3b1 mutations in the uk lrf cll4 trial. Blood. 2013;121:468–475. 28. Ramsay AJ, Quesada V, Foronda M, et al. Pot1 mutations cause telomere dysfunction in chronic lymphocytic leukemia. Nat Genet. 2013;45:526–530. 29. Ghia P, Guida G, Stella S, et al. The pattern of cd38 expression defines a distinct subset of chronic lymphocytic leukemia (cll) patients at risk of disease progression. Blood. 2003;101:1262–1269. 30. Rush LJ, Raval A, Funchain P, et al. Epigenetic profiling in chronic lymphocytic leukemia reveals novel methylation targets. Cancer Res. 2004;64:2424–2433. 31. Ferreira PG, Jares P, Rico D, et al. Transcriptome characterization by RNA sequencing identifies a major molecular and clinical subdivision in chronic lymphocytic leukemia. Genome Res. 2014;24:212–226. 32. Fisher RA. The use of multiple measurements in taxonomic problems. Ann Eugen. 1936;7:179–188. 33. Tusher VG, Tibshirani R, Chu G. Significance analysis of microarrays applied to the ionizing radiation response. Proc Natl Acad Sci U S A. 2001;98:5116–5121.

44. Liu X, Zhang P, Bao Y, et al. Zinc finger protein ZBTB20 promotes Toll‐ like receptor‐triggered innate immune responses by repressing IκBα gene transcription. Proc Natl Acad Sci U S A. 2013;110:11097–11102. 45. Nikitin EA, Malakho SG, Biderman BV, et al. Expression level of lipoprotein lipase and dystrophin genes predict survival in B‐cell chronic lymphocytic leukemia. Leuk Lymphoma. 2007;48:912–922. 46. Wang Q, Tan YX, Ren YB, et al. Zinc finger protein ZBTB20 expression is increased in hepatocellular carcinoma and associated with poor prognosis. BMC Cancer. 2011;11:271 47. Ronchetti D, Mosca L, Cutrona G, et al. Small nucleolar RNAs as new biomarkers in chronic lymphocytic leukemia. BMC Med Genomics. 2013;6:27 48. Nagasaki K, Manabe T, Hanzawa H, Maass N, Tsukada T, Yamaguchi K. Identification of a novel gene, LDOC1, down‐regulated in cancer cell lines. Cancer Lett. 1999;140:227–234. 49. Duzkale H, Schweighofer CD, Coombes KR, et al. LDOC1 mRNA is differentially expressed in chronic lymphocytic leukemia and predicts overall survival in untreated patients. Blood. 2011;117:4076–4084. 50. Benedetti D, Bomben R, Dal‐Bo M, et al. Are surrogates of IgVH gene mutational status useful in b‐cell chronic lymphocytic leukemia? The example of septin‐10. Leukemia. 2008;22:224–226. 51. Travella A, Ripollés L, Aventin A, et al. Structural alterations in chronic lymphocytic leukaemia. Cytogenetic and fish analysis. Hematol Oncol. 2013;31:79–87. 52. Ronchetti D, Tuana G, Rinaldi A, et al. Distinct patterns of global promoter methylation in early stage chronic lymphocytic leukemia. Genes Chromosomes Cancer. 2014;53:264–273. 53. Sakakibara S, Nakamura Y, Yoshida T, et al. RNA‐binding protein Musashi family: roles for CNS stem cells and a subpopulation of ependymal cells revealed by targeted disruption and antisense ablation. Proc Natl Acad Sci U S A. 2002;99:15194–15199. 54. Kharas MG, Lengner CJ, Al‐Shahrour F, et al. Musashi‐2 regulates normal hematopoiesis and promotes aggressive myeloid leukemia. Nat Med. 2010;16:903–908.

34. Saligan L, Fernández‐Martínez JL, deAndrés‐Galiana EJ, Sonis S. Supervised classification by filter methods and recursive feature elimination predicts risk of radiotherapy‐related fatigue in patients with prostate cancer. Cancer Inform. 2014;13:141–152.

55. Ito T, Kwon HY, Zimdahl B, et al. Regulation of myeloid leukaemia by the cell‐fate determinant Musashi. Nature. 2010;466:765–768.

35. deAndrés‐Galiana EJ, Fernández‐Martínez JL, Sonis ST. Design of biomedical robots for phenotype prediction problems. J Comput Biol. 2016;23:678–692.

56. Heo SG, Hong EP, Park JW. Genetic risk prediction for normal‐karyotype acute myeloid leukemia using whole‐exome sequencing. Genomics Inform. 2013;11:46–51.

36. Witten IH, Eiben F, Hall MA. Data Mining: Practical Machine Learning Tools and Techniques. 3rd ed. Burlington, MA: Morgan Kaufmann; 2011.

57. De Weer A, Poppe B, Vergult S, et al. Identification of two critically deleted regions within chromosome segment 7q35‐q36 in EVI1 deregulated myeloid leukemia cell lines. PLoS One. 2010;5:e8676

37. Peng HC, Long F, Ding C. Feature selection based on mutual information: criteria of max‐dependency, max‐relevance, and min‐ redundancy. IEEE Trans Pattern Anal Mach Intell. 2005;27:1226–1238.

58. Varrault A, Ciani E, Apiou F, et al. hZac encodes a zinc finger protein with antiproliferative properties and maps to a chromosomal region frequently lost in cancer. Proc Natl Acad Sci U S A. 1998;95:8835–8840.

38. Stelzer G, Inger A, Olender T, et al. GeneDecks: paralog hunting and gene‐set distillation with GeneCards annotation. OMICS. 2009;13: 477–487.

59. Wan W, Zhu J, Sun X, Tang W. Small interfering RNA targeting Krüppel‐like factor 8 inhibits U251 glioblastoma cell growth by inducing apoptosis. Mol Med Rep. 2012;5:347–350.

39. Rossi D, Rasi S, Fabbri G, et al. Mutations of notch1 are an independent predictor of survival in chronic lymphocytic leukemia. Blood. 2012;119:521–529.

60. Douglas RS, Capocasale RJ, Lamb RJ, Nowell PC, Moore JS. Chronic leukemia B cells are resistant to the apoptotic effects of transforming growth factor‐beta. Blood. 1997;89:941–947.

40. Willander K, Dutta RK, Ungerbäck J, et al. Notch1 mutations influence survival in chronic lymphocytic leukemia patients. BMC Cancer. 2013;13:274

61. Rossman ED, Lewin N, Jeddi‐Tehrani M, Osterborg A, Mellstedt H. T‐cell cytokines in patients with B cell chronic lymphocytic leukemia. Eur J Hematol. 2002;68:299–306.

41. Wan Y, Wu CJ. SF3B1 mutations in chronic lymphocytic leukemia. Blood. 2013;121:4627–4634.

62. Guo B, Zhang L, Chiorazzi N, Rothstein TL. IL‐4 rescues surface IgM expression in chronic lymphocytic leukemia. Blood. 2016;128:553–562.

FERNÁNDEZ‐MARTÍNEZ

23 of 23

ET AL.

63. Ogris C, Guaia D, Helleday T, Sonnhammer EL. A novel method for crosstalk analysis of biological networks: improving accuracy of pathway annotation. Nucleic Acids Res. 2016. doi: 10.1093/nargl/gkw849

SUPPORTING I NFORMATION Additional Supporting Information may be found online in the supporting information tab for this article.

How to cite this article: Fernández‐Martínez JL, deAndrés‐ Galiana EJ, Sonis ST. Genomic data integration in chronic lymphocytic leukemia. J Gene Med. 2017;19:e2936. https://doi. org/10.1002/jgm.2936