Sparse Representation Based Clustering for ... - Semantic Scholar

3 downloads 0 Views 925KB Size Report
{'ZNF217'}, {'RB1'}, {'TP53'}, {'BCAS1'}. To locate those genes, more genes should be selected or other gene shaving criteria should be applied. In addition to that ...
International Journal of Computers and Applications (IJCA), Vol. 19, No. 2, June 2012

Sparse Representation Based Clustering for Integrated Analysis of Gene Copy Number Variation and Gene Expression Data Hongbao Cao1, Junbo Duan1, Dongdong Lin1 and Yu-Ping Wang1,2

1

Department of Biomedical Engineering, Tulane University, New Orleans, USA 2

Department of Biostatistics, Tulane University, New Orleans, USA

Abstract Integrated analysis of multiple types of genomic data has received increasing attention in recently years, due to the rapid development of new genetic techniques and the strong demand for the improvement of the reliability of these techniques. In this work, we proposed a sparse representation based clustering (SRC) method for joint analysis of gene expression and copy number data with the purpose to select significant genes/variables for identification of genes susceptible to a disease. Different from traditional gene selections methods, the proposed SRC model employs information of multi-features and clusters the data into multi-groups, and then selects significant genes/variables in a particular group. By using joint features extracted from both types of data, the proposed SRC method provides an efficient approach to integrate different types of genomic measurements for comprehensive analysis. Our method has been tested on both breast cancer cell lines and breast tumors data. In addition, simulated data sets were used to test the robustness of the method to several factors such as noise, data sizes and data types. Experiments showed that our proposed method can effectively identify genes/variables with

International Journal of Computers and Applications (IJCA), Vol. 19, No. 2, June 2012

interesting characteristics, e.g., genes/variables with large variations across all genes, and genes/variables that are statistically significant in both measurements with strong correlations. The proposed method can be applicable to a wide variety of biological problems where joint analysis of biological measurements is a common challenge.

Index Terms—Sparse representations, clustering, gene expression, gene copy number variation, and gene selection

I.

Introduction

In genomic data analysis, one of the crucial issues is to identify genes/variables with specific interests out of the vast amount of variables generated from these measurements [1]-[12]. Some genes are related to the diagnosis task, but many are presumably irrelevant [1]. During the past few years, various clustering techniques have been developed to identify subsets of genes/variables with specific characteristics for diagnosis or classification [1]-[7]. In those genes/variables selection methods, some novel statistical methods are used. For example, Yang et al. [1] used forward sequential feature selection (FSFS) method to remove irrelevant single nucleotide polymorphism (SNP) data, which employed Mahalanobis distance as measurements for the gene selection [1]. Soneson et al. used Canonical Correlation Analysis (CCA) for jointly analyzing gene expression and copy number alterations [2], which measured correlations between gene expression and copy number changes. In the combined analysis of gene expression

International Journal of Computers and Applications (IJCA), Vol. 19, No. 2, June 2012

and copy number data, Berger et al. developed a generalized singular value decomposition (GSVD) to locate genes with both high variations across genes and high correlation across samples between the two types of data [4], which performed gene selection with variation analysis. However, using one characteristic of data may lead to biased conclusion. By far there is little report on employing multi-characteristics for the selection of genes/variables in the analysis of genetic data.

In this work, we proposed a sparse representation based clustering (SRC) method for feature selection. Fig. 1 gives the diagram of the proposed SRC model. The output of ‘Feature extraction’ in the model can be any features of the data. It can be the output of complicated analysis as we discussed above (e.g., FSFS, CCA, GSVD), simple statistical features such as mean, variance, etc, or the original data without any transformation. vectors extracted from the ‘Data’, where gene/variable,

is the feature vector extracted for the th

is the number of features to be employed,

of variables.

, and

is the number

is the characteristic matrix containing information of

groups. In each group to cluster each

consists of the feature

contains

samples, and

. The ‘SRC clustering’ is

according to the information contained in the characteristic matrix

the goal of gene/variable separation and selection.

to reach

International Journal of Computers and Applications (IJCA), Vol. 19, No. 2, June 2012

Fig. 1 Diagram of SRC model for the data analysis using multi-features

The ‘SRC clustering’ algorithm is based on solving the L1-minimization problem (P1):

(P1)

In the linear system . When

subject to

given by Eq. (1),

,

,

belongs to one of the group, the optimum solution

(1)

,

, and

of the L1-minimization

problem (P1) is a sparse solution, with only a few non-zeros entries. Based on the solution of this linear sparse system, Wright et al. proposed a SRC clustering method for face recognition, which showed advantages in clustering accuracy and resistance to noise [13]. In our earlier work, we applied the SRC method for M-FISH images classification and showed improved classification accuracy [14]. The SRC algorithm developed by Wright et al. has been applied successfully for face recognition [13], where they formulated face recognition as a supervised problem with the

International Journal of Computers and Applications (IJCA), Vol. 19, No. 2, June 2012

characteristic matrix

trained using the training samples [13]. The approach cannot be applied

to our problem of feature selection. In our work, we design the characteristic matrix with column vectors

to designate different classes that the data belong to. The vector angle between two

column vectors

and

is given by

, with

. We

develop an improved SRC clustering algorithm so that the clustering of features is based on the similarity to the class vector (i.e., the column vector of A) in terms of both direction and amplitude (i.e., the L2 norm of the vector).

The paper is organized as follows. After introducing the SRC algorithm, we examine its robustness to noise, to different number of variables and to different data types using simulated data. Then, we tested our method on the breast cancer cell line data [15] and breast tumor data [16]. Results showed that our proposed method can effectively locate genes with significant variance in both gene expression and copy number data.

II.

Method

In this section, we first present the improved SRC algorithm (Section 2.1). Then we describe the design of the characteristic matrix

used in the SRC algorithm (Section 2.2). In particular, we

give the Homotopy method used to solve the L1-norm minimization problem in the SRC algorithm (Section 2.3). Finally, we apply the SRC algorithm to gene/variable selection.

International Journal of Computers and Applications (IJCA), Vol. 19, No. 2, June 2012

2.1 Improved SRC algorithm

The original SRC algorithm has been successfully applied to face recognition [13], which demonstrated advantages of noise resistance and clustering accuracy. We improve the original SRC algorithm so that the clustering employs both vector angles and vector amplitude in the analysis. The improved SRC algorithm is given as following:

Algorithm 1: Improved Sparse Representation-based clustering (SRC) algorithm: 1. Inputs: characteristic matrix

with vectors of

different

groups; and samples 2. Normalize the rows of

to the rang of [0,1];

3. Add the vector length into each feature vector:

;

4. Inverse normalize the last row of columns of 5. Normalize the columns of

to the rang of [0,1];

to have unit L2-norm;

6. Solve the L1-norm minimization problem (P1) defined by Equation (1), with

and

as input; 7. Calculate the vector angle 8.

,

; ;

In the algorithm described above, ‘Normalize’ is to linearly project the maximum to 1 and the minimum to 0, while ‘Inverse normalize’ is to linearly project the maximum to 0 and the

International Journal of Computers and Applications (IJCA), Vol. 19, No. 2, June 2012

minimum to 1;

is a function that maps x to a sparse vector with only a few non-zero

entries corresponding to the vector

from the

-th group.

is the approximation of y with sparse

-th group,

, and

. The algorithm given by

Algorithm 1 clusters feature vectors according to the distance to a vector in a particular class represented by a sparse vectors. The main difference between our proposed algorithm and the one used for classification is the design of characteristic matrix

, which will be described

below.

2.2 Design of Characteristic matrix If m number of features is used for clustering, there will be label each group with a column vector combinations of 1 and 0, and group

, where

possible groups. We

being given a binary value, designating different . Then we design characteristic matrix of the

, and

. The relation between

-th

and

is

given by Eq. (2), which guarantees that each column vector is corresponding to its groups only.

(2)

To guarantee genes/variables from the

-th group be represented by its characteristic matrix

, it requires that: , where

.

(3)

International Journal of Computers and Applications (IJCA), Vol. 19, No. 2, June 2012

In addition to the requirements mentioned above, a valid

for the sparse representation

based classifier should have a sparse representation whose nonzero entries concentrate mostly on one group, while an invalid vector has sparse coefficients spread evenly over all groups. To quantify this observation, we use the Sparsity Concentration Index (SCI) that was proposed in [13] to measure how concentrated the feature vectors are on a single group in the data:

(4)

where

is the number of groups. For a solution

the feature vector

found by the SRC algorithm, if

is represented using only vectors from a particular group; if

the sparse coefficients are spread evenly over all groups. We choose a threshold accept a

as valid if

, and

; otherwise, reject it as invalid.

2.3 Homotopy algorithm for solving (P1)

In our proposed SRC algorithm, it is essential to find the optimum solution of the sparse representation problem. In this work, we employed the Homotopy method [17] to solve the L1 minimization problems (P1) given by Eq. (1). The Homotopy method was originally proposed by Osborne et al. for solving noisy over-determined L1-penalized least square problem [18].

International Journal of Computers and Applications (IJCA), Vol. 19, No. 2, June 2012

Donoho et al. [17] applied it to solve the noiseless underdetermined L1-minimization problem, and showed that Homotopy runs much more rapidly than general-purpose linear programming (LP) solvers when sufficient sparsity is present [17].

For the L1-minimization problem (P1) given by Eq. (1), it is convenient to consider the unconstrained optimization problem instead:

(P2)

where

(P1). Let

and

, and terminates when =0 and

converge to the solution of

denote the objective function of (P2). By convex analysis, a necessary condition

to be a minimizer of

subdifferential of

at

is that

, i.e., the zero vector is an element of the

. We calculate

,

where

(5)

is a non-negative coefficient. Homotopy method tries to find a solution pathway, which

starts with large

for

,

(6)

is the subgradient:

(7)

International Journal of Computers and Applications (IJCA), Vol. 19, No. 2, June 2012

Let

denote the support of

, and call

the vector of

residual correlations. Then the condition on the gradient expressed in (3) being zeros can be written equivalently as the following two conditions:

(8) and ,

(9)

In other words, residual correlations on the support of

must all have magnitude equal to

,

and signs that match the corresponding elements of

, while residual correlations off the

support must have magnitude less than or equal to . The Homotopy algorithm now follows from these two conditions, by tracing the optimal path

that maintains (5) and (6) for all

. The key to its successful implementation is that the path

is a piecewise linear path,

with a discrete number of vertices.

Homotopy algorithm: 1) Initialize 2) For the -

. stage (

, compute an update direction

,

by solving

(10)

International Journal of Computers and Applications (IJCA), Vol. 19, No. 2, June 2012

with

set to zero in coordinates not in , where

(11)

3) Calculate the residual

(12)

where

, and the minimum is taken only over positive arguments. Record the

corresponding index as

.

4) Calculate the residual

,

(13)

Again the minimum is taken only over positive arguments. Record the corresponding index as .

5) Calculate the residual

,

International Journal of Computers and Applications (IJCA), Vol. 19, No. 2, June 2012

,

6) Update the active set

by either appending

(14)

with

or removing

from .

7) Update

(15)

8) If

, terminate and

is the solution of (P1); Otherwise, go back to step 2).

2.4 SRC based gene/variable shaving

Once the characteristic matrix

for the SRC algorithm is set, we can use the proposed SRC

algorithm given by algorithm 1 to perform the clustering. Fig. 2 gives an illustration of the gene/variable shaving process using SRC based gene shaving method. As shown in Fig.2, all genes/variables were first grouped in to different clusters (Fig.2 (a)). Since each group is labeled with

s with different statistical significances, the genes/variables that fall into the group(s) of a

particular significance can be selected for further analysis (Fig.2 (b)), while others will be shaved off. The process will continue until the remaining genes/variables meet the requirement.

International Journal of Computers and Applications (IJCA), Vol. 19, No. 2, June 2012

One step shaving

(a)

(b)

Fig. 2 Diagram of gene shaving by SRC: (a) all genes were clustering into different the clusters; and (b) only clusters of a particular significance were selected for analysis.

III.

Results

The SRC based gene shaving method proposed in this work was tested with three data sets. The simulation data were first tested to verify the robustness of the method to noise, to different data types, and to different data size. Then breast cancer cell lines data [15] were tested to verify that the method has the ability of locating genes/variables with interesting characteristics. Results from separate and joint analysis were compared to demonstrate the advantage of joint analysis. Finally, breast cancer tumors data [16] were studied. In the analysis of this data set, known oncogenes or candidate oncogenes from [4] were studied

3.1 Test on simulation data

International Journal of Computers and Applications (IJCA), Vol. 19, No. 2, June 2012

In order to evaluate the robustness of the method, the gene list percentage similarity ( computed by counting the number of genes obtained from noisy data ( obtained from the original data (

) was

) intersecting with that

) [4].

(16)

where

is the number of total genes in the list. Fig. 3 gives the

of SRC based gene

shaving method on simulated data sets with different noise levels (NL), different data types, and different data size, respectively. In this work, we define the noise levels as the ratio of the variance of noise to the variance of signal:

(17)

where Var(*) is the variance and noises are assumed to be Gaussian white noises.

Four sets of data were simulated by adding noise with different variance to the real breast cancer cell lines data [15] and breast tumors data [16] with 1000 and 2000 samples respectively. In Fig. 3 (a), the plots were the results using joint analysis, while in Fig. 3 (b) they were the results of analyzing the gene expression data alone. As shown in Fig. 3, for both types of data, no matter how many genes/variables are (solid lines V.S. dashed lines), or what analysis method (joint/separate) is used, the

s are similar for the same type of data under the same level of

International Journal of Computers and Applications (IJCA), Vol. 19, No. 2, June 2012

noise, which proved the robustness of the model.

(a)

Fig. 3

(b)

of SRC based gene-shaving method on simulated data sets with different noise level.

(a) is the result of using joint analysis; and (b) is the result of analyzing gene expression data only.

3.2 Breast cancer cell lines data study

In this section, we analyzed breast cancer cell lines data [15], which have 14 cell lines (BT-20, BT-474, HCC-1428, Hs578T, MCF7, MDA-361, MDA-436, MDA-453, MDA-468, SKBR-3, T47D, UACC-812, ZR-75-1, and ZR-75-30), with 11994 genes. It includes both copy number and gene expression data, and the expression and copy number ratios were log2 -transformed prior to analysis. In addition, the data set was filtered and those entries that fall outside of mean 3*standard deviation will be considered as outliers and replaced by the mean 3*

International Journal of Computers and Applications (IJCA), Vol. 19, No. 2, June 2012

standard deviation. The goal of this study is to locate ‘abnormal’ genes that may contribute to the breast cancer. To reach the goal, we performed both separate and joint analysis on the data set. In addition, we compared the results of joint/separate studies to see the difference of the two methods. For joint analysis, the target is to locate genes that have the following characteristics: 1. big variant across genes in copy number data; 2. big variant across genes in gene expression data; 3. big correlation between two types of data. Genes with high percentage of mean-valued samples, and high L1-norm values for both copy number and gene expression are the ones that have big variance across the genes and may contribute significantly to breast cancer. On the other hand, genes that show big correlation between two types of data may better reflect the condition of the patients’ status and be more helpful for the diagnosis. While separate analysis can identify genes with significant variance across genes in either gene expression data or copy number data, they cannot find correlation between two types of data. For joint analysis, the gene shaving process considers all three characteristics and the results appear to be more comprehensive and reliable. The top highest variant genes selected by using the SRC method for separate analysis were given in Fig. 4, while the genes selected using joint analysis were shown in Fig.5. In the separate analysis, since the two types of data were analyzed separately, the outputs of the method (selected genes) are not the same. As a matter of fact, there is only one same gene out of the 50 top variant genes that are identified from the study of the two data sets. However, the two studies should identify similar groups of genes, or at least the gene list should have a large overlap. The separate analyses with one type of data tend to get unilateral conclusions. This is avoided in the

International Journal of Computers and Applications (IJCA), Vol. 19, No. 2, June 2012

joint analysis, as shown in Fig. 5 (a) and (b). The genes were selected by analyzing both types of data, and one unique list of selected genes was given. In Fig. 5 (a) and (b), the genes were the same but in different order since they were re-arranged according to the L1-norm of the two types of data respectively.

(a)

(b)

Fig. 4 Separate analysis results: the top highest variant genes using (a) copy number and (b) gene expression analysis in 14 samples from breast cancer cell lines

International Journal of Computers and Applications (IJCA), Vol. 19, No. 2, June 2012

Fig. 5 Joint analysis results: the top highest variant genes using (a) copy number and (b) gene expression analysis in 14 samples from breast cancer cell lines

Fig. 6 shows the process of the SRC based gene shaving method to locate interesting genes in joint analysis. In this study, we employed 9 features for the sets,

=1,2, …, , and

-th gene from both data

is the number of genes in the data set.

,

where

is the -th biggest absolute value of the -th type of data for the

reflects the largest changed sample for the -th gene;

(25)

-th gene, which

is the mean value,

International Journal of Computers and Applications (IJCA), Vol. 19, No. 2, June 2012

measuring the overall changes of the samples for the -th gene for data type ; Pearson correlation coefficients between the two types of data for the

is the

-th gene, which reflect

the correlated changes of the two types of data; =1, 2, 3; and =1 for gene expression data and i=2 for copy number data. Data normalization was performed so that the entries of

From feature selection process we can see that different genes. For example, if

.

will represent different types of

is close to the vector [1, 1, 1, 1, 1, 1, 0, 0, 1], it means the gene has

relatively high variant samples with strong correlated changes for both types of data, which will be the genes that we are looking for. Since the transformed data (feature vectors)

, we

train the sparse system to classify those feature vectors into 48 different clusters. In the gene-shaving step, we selected genes falling into clusters that reflect high sample means and high correlations. Specifically, we require

and

is relatively big. In addition, we require at least 4 out of 6

are relatively small, while are relatively big for =1, 2

and =1, 2, 3.

As we stated in the ‘Methods’ section, the gene shaving process were iteratively processed until the number of genes selected is close to the genes that we want. In this work, we set the number of genes to be 50.

International Journal of Computers and Applications (IJCA), Vol. 19, No. 2, June 2012

Fig. 6 The gene shaving of breast cancer cell lines data using joint analysis. (a) is the clustering results for all genes; (b) gives the genes after one step selection; (c) is the clustering result using genes selected in (b), and (d) is the final selected genes before it reaches the pre-set number of genes

Fig. 7 gives the comparison of normalized gene expression data and copy number data of the selected genes and their Persian correlation coefficients. It can be easily seen from Fig. 7 (b) that the selected genes give relatively higher correlation coefficients, which mean those genes have similar changes for both gene expression data and copy number data, and are more likely to be associated with the same disease.

International Journal of Computers and Applications (IJCA), Vol. 19, No. 2, June 2012

(a)

(b)

Fig. 7 Correlation analysis of the genes selected from breast cancer cell lines data using joint analysis: (a) is the normalized copy number data (left) and gene expression data (right) in 14 samples; (b) is the cross matrix of correlation coefficients between two data types for the genes selected

One way ANOVA analysis was performed on the results from joint and separate (using gene expression data alone) analysis in terms of L1-norm and Pearson correlation coefficients respectively. Fig. 8 gives the box plot of the two statistical analyses. As shown in Fig. 8 (a), when using gene expression data alone for analysis, the proposed SRC model is effective in identifying genes with significant variance across genes for both gene expression and copy number data. However, using one data set cannot reveal the correlation between two types of

International Journal of Computers and Applications (IJCA), Vol. 19, No. 2, June 2012

data across samples (Fig. 8 (b)). This unilateral analysis may not discover the genes susceptible to cancer. As a matter of fact, ERBB2, which is known to play a role in breast cancer [4], was not identified by separate analysis using cDNA gene expression and copy number data alone. However, combined analysis located ERBB2, which shows the advantages of joint analysis.

(a)

(b)

Fig. 8 Comparison of joint and separate analysis of breast cancer cell lines data: (a) compare the L1-norm of the selected gene expression data; (b) compare the Pearson correlation confidents of the genes selected by using gene expression data alone and using joint analysis.

3.3 Breast tumor data study

We also studied a set of 37 breast cancer tumors, which were profiled genome-wide for mRNA expression levels and DNA copy numbers [16]. Each sample had mRNA levels and DNA copy

International Journal of Computers and Applications (IJCA), Vol. 19, No. 2, June 2012

numbers measured for 6,095 genes before preprocessing. Each spot on the array covers between 200 and 1,500 bp in the human genome. This tumor data set has been analyzed by Berger et al. [4]. It was reported that roughly 3 percent of the data contains missing or errant values [4]. The expression and copy number ratios were log2 -transformed prior to analysis. As we did for breast cancer cell lines data, the breast tumor data set was also filtered and those entries that fall out of mean 3*standard deviation will be considered outliers and be replaced with

the mean 3*

standard deviation.

In this study, we also performed both joint and separate analysis to find out ‘abnormal’ genes that may contribute to the disease formation. However, we mainly focused on analyzing the characteristics of known oncogenes or candidate oncogenes [4] identified by our proposed SRC based gene shaving method. The top highest variant genes selected by using the proposed SRC method in the separate analysis of breast tumor data were given in Fig. 9, while the genes selected using joint analysis were given in Fig.10.

International Journal of Computers and Applications (IJCA), Vol. 19, No. 2, June 2012

(a)

(b)

Fig. 9 Separate analysis results: the top highest variant genes using (a) copy number and (b) gene expression analysis in 37 samples from breast tumor data

(a)

(b)

International Journal of Computers and Applications (IJCA), Vol. 19, No. 2, June 2012

Fig. 10 Joint analysis results: the top highest variant genes using (a) copy number and (b) gene expression analysis in 37 samples from breast tumor data

Fig. 11 gives the process of the SRC based gene shaving method to locate interesting genes. In this study, we employed similar features as we did in the tumor cell lines study (Eq. (25)). However, we employed the first 6 biggest absolute values for both types of data sets, which gave 15 features for each gene, and the sparse system was trained to cluster the first quadrant of a 15-dimensional vector space

into 96 different clusters. The changes were

made by considering that the number of samples for breast tumor data is more than that of breast cell lines data (37 vs. 14). In each gene-shaving step, we also selected genes falling into clusters that reflect high sample changes and high correlations. However, we require at least 8 out of 12 are relatively big for =1, 2 and =1, …, 6.

(a)

(b)

International Journal of Computers and Applications (IJCA), Vol. 19, No. 2, June 2012



(c) Fig.11 The joint gene shaving of breast tumor data (a) is the clustering results for all genes; (b) gives the genes after one step selection; and (c) is the final selected genes before it reaches the pre-set number of genes (50)

Fig. 12 gives the comparison of normalized gene expression data and copy number data of the selected genes and their Pearson correlation coefficients. As shown in Fig. 12 (b), the selected genes show high correlation between two types of data, which is consistent with the analysis results of breast cell lines data.

(a)

(b)

International Journal of Computers and Applications (IJCA), Vol. 19, No. 2, June 2012

Fig. 12 Correlation analysis of the genes selected from breast tumor data using joint analysis: (a) is the normalized copy number data (left) and gene expression data (right) in 37 samples; and (b) is the matrix of cross correlation coefficients between two data types for the genes selected

In Fig. 13, the L1-norm and Pearson correlation coefficients of the located/un-located known or candidate oncogenes motioned in [18] were compared. The purpose of this comparison is to show that, as long as the genes/variables satisfy the required characteristics, our proposed SRC method is capable of identifying them. From Fig. 13, it can be seen that, with limited number of genes selected (50 out of 11994), 4 of 9 known or candidate oncogenes [4] {‘ERBB2'}, {'TPD52'}, {'GRB7'}, {'FGFR1‘} were selected. For those genes that were not located, they at least miss one of the 3 requirements for the gene selection. Those genes are {'MYC'}, {'ZNF217'}, {'RB1'}, {'TP53'}, {'BCAS1'}. To locate those genes, more genes should be selected or other gene shaving criteria should be applied. In addition to that, we also located a gene {‘TOB1’}, which is transducer of ERBB2. This further proved the ability of our proposed SRC method in locating genes with specific characteristics. Moreover, we also studied three other known breast cancer genes that included in the breast tumor data tested in this work, namely {'BRCA1'},{'BRCA2'},{'ATM'}. Those three genes in our data set show low variant in copy number data and gene expression data. In addition, they give low correlation between the two types of data. This should explain why they were not identified by the proposed SRC method.

International Journal of Computers and Applications (IJCA), Vol. 19, No. 2, June 2012

Fig. 13 The L1-norm and correlation coefficients of selected genes (solid lines) and known/candidate oncogenes (stars). The red-underlined genes are those that located in the joint analysis, which show both high variant across genes and high correlation for the two data sets, while those that were not selected at least miss one of the requirements

IV.

Discussion and Conclusion

Identification of interesting genes from vast amount of genomic data has been a challenge. In this work, the proposed SRC based gene/variable selected method, or so called gene shaving method, has been successful to identify genes susceptible to breast cancer and tumors. Part of the results

International Journal of Computers and Applications (IJCA), Vol. 19, No. 2, June 2012

has been reported in our earlier work [19]. Using the proposed SRC method, we addressed on crucial issue in the joint analysis for gene expression and copy number data, i.e., identifying subsets of features significant for integrative analysis. We assume genes/variables of interests should be distinguishable through specific characteristics, such as mean value, variance, correlation coefficients, etc. Different from other combined analysis methods, our proposed SRC method perform combined analysis of different characteristic parameters. In addition, by analyzing same features from different data sets (e.g., copy number and gene expression), the proposed SRC method provides a way for joint analysis of different data sets with different structures and data ranges.

Our approach has been tested on both simulation and real data. From the simulated data experiments (Fig. 3), it can be seen that for both types of data, the changes of percentage similarity (

) with noise levels (

) follow similar pattern. This proves that the proposed SRC

method is stable for data with different sizes and different patterns/types. In separate analysis of both breast cancer cell lines data set and a breast tumors data set, SRC model is effective in identifying genes with significant variance across variables/genes, as can be seen from Fig. 4 and Fig. 9. This proves the effectiveness of the SRC method in identifying genes with specific characteristics. While from the joint analysis results we can see that our proposed SRC method is effective in identifying data with large variants as well as with high correlations. When compared with separate analysis, the joint analysis gave more comprehensive results and identified genes that

International Journal of Computers and Applications (IJCA), Vol. 19, No. 2, June 2012

are more correlated to the disease. Specifically, in the study of breast cancer cell lines data, the joint analysis successfully located gene ‘ERBB2’, which is proved to be informative in breast cancer. In the analysis of breast tumor data, we located 4 of 9 known or candidate oncogenes {‘ERBB2'}, {'TPD52'}, {'GRB7'}, {'FGFR1‘} specified in [4]. In addition, we identified {‘TOB1’}, which is the transducer of {'ERBB2’} [16]. However, in this work, some of known or candidate oncogenes within the data sets were left unidentified, such as {'BRCA1'}, {'BRCA2'}, {'ATM'}. This is because we were aimed to located genes with relative big variance and correlations from both types of data sets, while those unidentified genes do not satisfy those requirements. In this work, our proposed SRC method was used for both separate and joint analysis of gene expression and copy number data. The results showed that outputs of the joint analysis (gene list) are far different from the separate analysis. One of the reasons for this is that, in the joint analysis we employed the Pearson correlation coefficients between two types of data as one of the features, which cannot be obtained with separate analysis. This indicated that the feature extraction methods highly affect the results of SRC. In the future work, more feature extraction methods with significance should be studied. In addition, more data sets and data types should be tested.

Acknowledgement. This work has been supported by NIH and NSF

V.

References

International Journal of Computers and Applications (IJCA), Vol. 19, No. 2, June 2012

[1] Honghui Yang, Jingyu Liu, Jing Sui, Godfrey Pearlson and Vince D. Calhoun, A hybrid machine learning method for fusing fMRI and genetic data: combining both improves classification of schizophrenia, Neurosci. 4:192, 2010 doi: 10.3389/fnhum.2010.00192 [2] Charlotte Soneson, Henrik Lilljebjörn, Thoas Fioretos, Magnus Fontes, Integrative analysis of gene expression and copy number alterations using canonical correlation analysis, BMC Bioinformatics , 2010, 11:191 [3] Lê Cao KA, Martin PGP, Robert-Granié C, Besse P: Sparse canonical methods for biological data integration: application to a crossplatform study. BMC Bioinformatics, 2009, 10:34. [4] J. A. Berger, S. Hautaniemi, S. K. Mitra and J. Astola, “Jointly Analyzing Genes Expression and Copy Number Data in Breast Cancer using Data Reduction models”, IEEE Transactions on Computational Biology and Bioinformatics, vol. 3, no.1, pp.2-16 2006. [5] Yong-Jun Liu, Hui Shen, Peng Xiao, Dong-Hai Xiong, Li-Hua Li, Robert R Recker, Hong-wen Deng, Molecular Genetic Studies of Gene Identification for Osteoporosis: A 2004 Update, Journal of Bone and Mineral Research, Vol. 21 Issue 10, 2005 [6] P. Wang, Y. Kim, J. Pollack, B. Narasimhan, and R. Tibshirani, “A Method for Calling Gains and Losses in Array CGH Data,” Biostatistics, vol. 6, pp. 45-58, Jan. 2005. [7] S. Hautaniemi, M. Ringne´r, P. Kauraniemi, R. Autio, H. Edgren, O. Yli-Harja, J. Astola, A. Kallioniemi, and O.-P. Kallioniemi, “A Strategy for Identifying Putative Causes of Gene Expression Variation in Human Cancers,” J. Franklin Inst., vol. 341, pp. 77-88, Mar. 2004.

International Journal of Computers and Applications (IJCA), Vol. 19, No. 2, June 2012

[8] L.W.M. Loo, D.I. Grove, E.M. Williams, C.L. Neal, L.A. Cousens, E.L. Schubert, I.N. Holcomb, H.F. Massa, J. Glogovac, C.I. Li, K.E. Malone, J.R. Daling, J.J. Delrow, B.J. Trask, L. Hsu, and P.L. Porter, “Array Comparative Genomic Hybridization Analysis of Genomic Alterations in Breast Cancer Subtypes,” Cancer Research, vol. 64, pp. 8541-8549, Dec. 2004. [9] J.R. Pollack, T. Sørlie, C.M. Perou, C.A. Rees, S.S. Jeffrey, P.E. Lonning, R. Tibshirani, D. Botstein, A.-L. Børresen-Dale, and P.O. Brown, “Microarray Analysis Reveals a Major Direct Role of DNA Copy Number Alteration in the Transcriptional Program of Human Breast Tumors,” Proc. Nat’l Academy of Science USA, vol. 99, pp. 12 963-12 968, Oct. 2002. [10] S. Hautaniemi, M. Ringne´r, P. Kauraniemi, R. Autio, H. Edgren, O. Yli-Harja, J. Astola, A. Kallioniemi, and O.-P. Kallioniemi, “A Strategy for Identifying Putative Causes of Gene Expression Variation in Human Cancers,” J. Franklin Inst., vol. 341, pp. 77-88, Mar. 2004. [11] O. Monni, M. Ba¨rlund, S. Mousses, J. Kononen, G. Sauter, M. Heiskanen, P. Paavola, K. Avela, Y. Chen, M.L. Bittner, and A. Kallioniemi, “Comprehensive Copy Number and Gene Expression Profiling of the 17q23 Amplicon in Human Breast Cancer,” Proc. Nat’l Academy of Science USA, vol. 98, pp. 5711-5716, May 2001. [12] E. Hyman, P. Kauraniemi, S. Hautaniemi, M. Wolf, S. Mousses, E. Rozenblum, M. Ringne´r, G. Sauter, O. Monni, A. Elkahloun, O.-P. Kallioniemi, and A. Kallioniemi, “Impact of DNA Amplification on Gene Expression Patterns in Breast Cancer,” Cancer Research, vol. 62, pp. 6240-6245, Nov. 2002.

International Journal of Computers and Applications (IJCA), Vol. 19, No. 2, June 2012

[13] John Wright, Allen Y. Yang, Arvind Ganesh, S. Shankar Sastry, and Yi Ma. Robust Face Recognition via Sparse Representation, IEEE TRANS. PAMI,

Feb. 2009, vol. 31 no. 2, pp.

210-227 [14] Hongbao Cao, Yu-Ping Wang, M-Fish Image Analysis with Improved Adaptive Fuzzy C-Means Clustering based Segmentation and Sparse Representation Classification, 3rd International Conference on Bioinformatics and Computational Biology (BICoB), March 23-25, 2011, New Orleans, Louisiana, USA. In press. [15] E. Hyman, P. Kauraniemi, S. Hautaniemi, M. Wolf, S. Mousses, E. Rozenblum, M. Ringne´r, G. Sauter, O. Monni, A. Elkahloun, O. P. Kallioniemi, and A. Kallioniemi, “Impact of DNA amplification on gene expression patterns in breast cancer”, Cancer Research, vol. 62, pp.6240-6245, 2002. [16] J. R. Pollack, T. Sørlie, C. M. Perou, C. A. Rees, S. S. Jeffrey, P. E. Lonning, R. Tibshirani, D. Botstein, A. L. Børresen-Dale, and P. O. Brown, “Microarray analysis reveals a major direct role of DNA copy number alteration in the transcriptional program of human breast tumors”, The National Academy of Sciences, USA, vol. 99, pp.12963-12968, 2002. [17] D. Donoho and Y. Tsaig, “Fast solution of `1-norm minimization problems when the solution may be sparse,” preprint, http://www.stanford.edu/ tsaig/research.html, 2006 [18] M. R. Osborne, B. Presnell, and B.A. Turlach. A new approach to variable selection in least squares problems. IMA J. Numerical Analysis, 20:389–403, 2000. [19] Hongbao Cao, Yu-Ping Wang, Integrated Analysis of Gene Expression and Copy Number Data using Sparse Representation Based Clustering Model, 3rd International Conference on

International Journal of Computers and Applications (IJCA), Vol. 19, No. 2, June 2012

Bioinformatics and Computational Biology (BICoB) March 23-25, 2011, New Orleans, Louisiana, USA. In press.