Support vector machine based classification of fast

0 downloads 0 Views 261KB Size Report
However, to our best knowledge, the method has not been used yet for identification ... to a series of experimental data measured for different selected wavelengths ... (3) all fonts and special characters are correct, and (4) all text and figures fit within the ... The central idea of PCA is to reduce the dimensionality of a data set ...
Please verify that (1) all pages are present, (2) all figures are acceptable, (3) all fonts and special characters are correct, and (4) all text and figures fit within the margin lines shown on this review document. Return to your MySPIE ToDo list and approve or disapprove this submission.

Support vector machine based classification of fast Fourier transform spectroscopy of proteins Aleksandar Lazarevic, Dragoljub Pokrajac, Aristides Marcano, and Noureddine Melikechi Center for Research and Education in Optical Sciences and Applications, Department of Physics and Pre-Engineering, Delaware State University, 1200 N Dupont Highway, Dover, DE 19901 ABSTRACT Fast Fourier transform spectroscopy has proved to be a powerful method for study of the secondary structure of proteins since peak positions and their relative amplitude are affected by the number of hydrogen bridges that sustain this secondary structure. However, to our best knowledge, the method has not been used yet for identification of proteins within a complex matrix like a blood sample. The principal reason is the apparent similarity of protein infrared spectra with actual differences usually masked by the solvent contribution and other interactions. In this paper, we propose a novel machine learning based method that uses protein spectra for classification and identification of such proteins within a given sample. The proposed method uses principal component analysis (PCA) to identify most important linear combinations of original spectral components and then employs support vector machine (SVM) classification model applied on such identified combinations to categorize proteins into one of given groups. Our experiments have been performed on the set of four different proteins, namely: Bovine Serum Albumin, Leptin, Insulin–like Growth Factor 2 and Osteopontin. Our proposed method of applying principal component analysis along with support vector machines exhibits excellent classification accuracy when identifying proteins using their infrared spectra. Keywords: Fourier Transform Infrared Spectroscopy, infrared spectra of proteins, principal component analysis, support vector machine, classification.

1. INTRODUCTION Timely identification and classification of complex samples such as viruses, bacteria, proteins, and other molecules of organic origin remain as one of the most important challenges in modern spectroscopy. Molecules of these samples are composed of several thousands atoms, namely hydrogen, carbon, oxygen, nitrogen and trace quantities of other atoms (molecular weight larger than 104 kDa). The spectroscopic data of these materials show mostly evidences of the presence of their main components but cover the differences due to the presence of other trace components thus making identification and classification of such spectra extremely difficult. Recently, principal component analysis (PCA) has been used as a statistical method for the analysis of spectroscopic data aimed at detection of several complex organic samples [1, 2]. In these methods, the spectroscopic data can be represented in a three-dimensional space of eigen vector projections of the matrices corresponding to a series of experimental data measured for different selected wavelengths [3]. In this regard, each point of this space represents a full set of spectroscopic measurements corresponding to one sample. Differences between the spectra can be then visualized graphically as different points in the space of eigenvectors. In this paper, we propose to use the results of PCA analysis to train a support vector machine (SVM) and to perform an automatic identification of complex molecules. SVMs have been proven recently to be quite efficient in classification tasks in a wide variety of application domains. In this paper, we use the proposed SVM method to analyze the fast Fourier transform infrared (FTIR) spectra of several proteins: bovine serum albumin (BSA), and solutions of Osteopontin (OPN), Leptin and insulin-like growth factor II (IGF2). The later three proteins have been reported as possible biomarkers of ovarian cancer [4-6]. The FTIR method provides information about vibrational spectra of molecules. Several authors have shown that FTIR is a powerful method for the study of the secondary structure of proteins [7-10]. However, the spectral lines of these large molecules are usually broadened due to different molecular interactions thus making the identification of the structure difficult. We show that despite the presence of broadening mechanisms and evident similarities in the FTIR spectra of different proteins, the proposed SVM method provides an automatic and effective identification of the proteins with almost

7169-11 V. 1 (p.1 of 8) / Color: No / Format: Letter / Date: 12/17/2008 10:17:35 PM SPIE USE: ____ DB Check, ____ Prod Check, Notes:

Please verify that (1) all pages are present, (2) all figures are acceptable, (3) all fonts and special characters are correct, and (4) all text and figures fit within the margin lines shown on this review document. Return to your MySPIE ToDo list and approve or disapprove this submission.

perfect accuracy. This statistical procedure can also be applied to other spectroscopic methods such as fluorescence, NIR-VI absorbance spectroscopy and laser-induced breakdown spectroscopy.

2. METHODOLOGY 2.1 Principal Component Analysis (PCA) Principal Component Analysis (PCA) [11] is a powerful technique for dimensionality reduction in machine learning and data mining. The central idea of PCA is to reduce the dimensionality of a data set consisting of a large number of interrelated variables, while retaining as much as possible of the variation present in the data set. This is achieved by transforming to a new set of variables, the principal components (PCs), which are uncorrelated, and which are ordered so that the first few retain most of the variation present in all of the original variables. Suppose that X is a N-dimensional matrix of k-dimensional random variables [x1; x2;… xN], and that the variances of the k random variables and the structure of the covariances or correlations between the k variables are of interest. Assume that we would like to approximate the vector xi as a linear combination of m0 N s ⎜⎝ ci j∴λ j >0 ⎠⎠ ⎝

(7a)

This makes possible using implicit and infinitely dimensional transformation f. Popular choices of kernel function include: • Linear kernel:

K (u, v ) = uT v ;

(

)

• Polynomial kernel (p is a prespecified parameter): K (u, v ) = 1 + uT v ; • Exponential kernel (σ is a pre-specified parameter): K (u, v ) = e



p

1 u− v 2σ 2

2

.

The original SVM technique is designed for a two-class problem. In this particular case, we need to perform multiclass classification with c = 4 classes. We use DAG-SVM method [16] which for a c-class problem trains c(c-1)/2 two-class support machines and the class decision is performed based on successive elimination of classes as a result of a two-class comparisons. In comparison to one-to-rest classifiers, the application of DAG-SVM is more practical, since it does not result in imbalanced training sets [17, 18].

3. EXPERIMENTAL RESULTS 3.1 Data For measuring the Fourier transform infrared (FTIR) spectra we used an attenuated total reflection (ATR) FTIR spectrophotometer NICOLET 6700 (Thermo Industries, Inc). Drops of the samples were deposited over an aperture on the top of the device. This aperture connected to the surface of a diamond where the total reflection occurs. Samples under study corresponded to distilled and deionized water, and to high purity proteins: Bovine serum albumin (BSA), OPN, Leptin and IGF2 (Figure 1). Usually the water contribution masked most of the contribution from the proteins. To eliminate the water peaks, the samples were dried through simple evaporation of the solvent before collecting data. A drop of 5 μL of the solution was deposited over the aperture of the spectrophotometer. The samples were then left to dry at room temperature during 30 minutes. The drying process was monitored by taking spectra every 5 minutes until 1

Let us define linear operator T on function g such that: Tg ≡ K (u, v )g (u )du and scalar product o of functions h and ∫ C

g such that h o g = h(u )g(u )du . Then, the kernel K is non-negative definite iff (∀g ∈ L2 (C ))g o Tg ≥ 0 [15].



C

7169-11 V. 1 (p.4 of 8) / Color: No / Format: Letter / Date: 12/17/2008 10:17:35 PM SPIE USE: ____ DB Check, ____ Prod Check, Notes:

Please verify that (1) all pages are present, (2) all figures are acceptable, (3) all fonts and special characters are correct, and (4) all text and figures fit within the margin lines shown on this review document. Return to your MySPIE ToDo list and approve or disapprove this submission.

solvent contribution was depleted. This process extended to over 20 minutes in some cases. When the drying process was complete, the spectra did not show further changes for several hours. The dried protein sample formed a film over the aperture of several tens of micrometers sufficiently good for ATR measurement. The spectra were collected with a resolution of 4 cm-1. One hundred scans were averaged for each spectrum. The spectra showed high reproducibility and a signal to noise value usually larger than 100.

Figure 1. FTIR spectra of Bovine serum albumin (BSA) (a), Leptin (b), Osteopontin (c), and insulin-growth factor 2 IGF2 (d)

In our data set, each spectrum consisted of 1867 wavenumbers between 400 and 4000cm-1. The final number of wavenumbers was 1556. In this study, we used 36 samples of BSA, 36 samples of IGF2, 36 samples of OPN and 24 samples of Leptin. 3.2 Results and Discussion Our classification experiments are performed using the linear support vector machines and C=1 with 4-fold crossvalidation. After 132 available samples were randomly shuffled, they were split into four subsets of the equal size: S1, S2, S3 and S4. Then, the SVM model is trained on subsets S1, S2, S3 and tested on S4. This was repeated with all possible combinations of 3 subsets Si for training and using remaining Sj, j ≠i for testing. For teach test, confusion matrices are computed (confusion matrix specifies how the elements of each class from the test set are classified by the SVM) and then the average confusion matrix is evaluated. The diagonal elements of the confusion matrix are partial accuracies: the percentage of each class on the test set that is correctly classified. The total accuracy determines the ratio of the numbers of correctly classified samples and the total number of samples. For classification, we experimented with different numbers of principal components sorted in descending order of the eigenvalues of the data correlation matrix: k=2, 3, 5, 10, 15, 20, 30, 40, 50, 60, 100. For each number of principal components, we reported the average numbers of support vectors and the average partial accuracy.

7169-11 V. 1 (p.5 of 8) / Color: No / Format: Letter / Date: 12/17/2008 10:17:35 PM SPIE USE: ____ DB Check, ____ Prod Check, Notes:

Please verify that (1) all pages are present, (2) all figures are acceptable, (3) all fonts and special characters are correct, and (4) all text and figures fit within the margin lines shown on this review document. Return to your MySPIE ToDo list and approve or disapprove this submission.

We used our PCA code for high dimensional data (Figure 2) and Oregon State University (OSU) toolbox OSU SVM 3.0 [20] based on recent development in SVMs [21]. We varied the number of principal components k used for training and testing the classification model. function [rpm, proj, m]=PCA_highdimensional(opm,odim) [N,dim] = size(opm); if (odim > N) error('Desired output dimensionality >= input dimensionality'); end % center the pattern matrix by subtracting the mean vector from each pattern m = mean(opm); X = opm - repmat(m,N,1); % repmat: N x dim mtx from 1 x dim vector C = 1/(N-1)*X*transpose(X); [V,Lambda] = eig(C); % Lambda is a dim x dim diagonal mtx % calculate total variance in this data = sum of eigenvalues Lambda=transpose(diag(Lambda)); %eigenvalues U=transpose(X)*V; %Instead of normalizing by sqrt((N-1))*repmat(sqrt(Lambda),N,1) we'll %normalize by sum(U.^2).^0.5 for numerical stability U=U./repmat(sum(U.^2).^0.5,dim,1); % sort eigenvalues. index is the array of permutation indices [sv,index]=sort(Lambda); proj = fliplr(U(:,index(N-odim+1:N))); rpm = X * proj; s2=sum(Lambda); s3 = sum(sv(N-odim+1:N)); fprintf(1,'%f%% of variance retained\n',100.0*s3/s2); return

Figure 2. Code for fast computation of principal components on high-dimensional data in Matlab.

In Figure 1a we show the FTIR spectrum from a dried BSA water solution. Characteristic bands of the FTIR protein spectra are the amide I and amide II bands in the region 1400-1700 cm-1 [7]. Those arise from the amide bonds that link the amino acids. The amide I centered around 1740 cm-1 corresponds to the stretching mode of the CO bond of the amide. The amide peak II centered around 1550 cm-1 corresponds to the bending mode of the NH bond of the amide. The characteristics of these peaks provide information about the secondary structure of the proteins since the hydrogen bonds that establish this structure, are mostly associated to the CO and NH bonds. Stretching hydroxyl peaks (OH) are dominant in the region 2500-3500 cm-1 with clear peaks at 2870, 2930, 3056, 3200 and 3290 cm-1. The wide peak in the region 400-800 cm-1 corresponds to librations with contribution from other rotational and low energy vibrational lines. In Figure 1 we also show the FTIR spectra from 3 proteins: Leptin (b), OPN (c) and IGF2 (d) that repeats the same basic structure although the relative amplitudes of particular peaks are different. The data transformation performed by using PCA made data linearly separable (Figure 3)..It can be observed from Figure 3 that even the first two principal components were sufficient to achieve perfect separation of protein classes. Therefore, even the application of linear SVM was able to provide perfect accuracy on both training and test set (100% accuracy). This result did not depend on the number k of principal components used. The average number of support vectors per class was relatively small (Figure 4) and practically did not depend on the number of principal components used, which indicates good generalization and stability of the proposed technique.

7169-11 V. 1 (p.6 of 8) / Color: No / Format: Letter / Date: 12/17/2008 10:17:35 PM SPIE USE: ____ DB Check, ____ Prod Check, Notes:

First 2 PCA of FTIR data (600-2000,2400-4000cm-1) 1.5 1 0.5 0 -0.5 BSA IGF2

-1

Leptin OPN

-1.5 -2 -6

-4

-2

0

2

4

Average # of Support vectors per class

Please verify that (1) all pages are present, (2) all figures are acceptable, (3) all fonts and special characters are correct, and (4) all text and figures fit within the margin lines shown on this review document. Return to your MySPIE ToDo list and approve or disapprove this submission.

6 OPN 5 IGF2

4

Leptin 3 BSA 2 1 0 0

20

6

Figure 3. First two principal components of the protein data

40

60

#PCA components

80

100

Figure 4. Average number of support vectors per class

There are two possible reasons for such good results. First, FTIR is extremely good in distinguishing different protein classes. Second, the spectra are collected on a small number of specimens. This way, the variability of the data in the dataset is small (simply, all records from the same class would be similar to each other!) and no wonder we could classify with 100% accuracy. The results obtained in this paper suggest that automatic classification of protein samples is possible. This anticipates the possibility and viability of automatic systems for sampling and classification of proteins based on their FTIR spectra, with potential applications in medicine, homeland security, defense, etc.

4. CONCLUSIONS In this paper, we demonstrated that using combination of principal component analysis and support vector machines is possible to automatically determine a class of the FTIR sample of an unknown protein. Further, we demonstrated that the classification accuracy and the model complexity (the number of support vectors) practically does not depend on the number of extracted principal components. The proposed technique is computationally fast and can in principle be applied in on-line learning and classification framework. Work in progress includes testing the proposed technique on larger datasets to exclude the small variability of samples (the sample bias) as a potential reason for extremely high classification accuracy.

REFERENCES [1] [2]

[3] [4]

J. D. Hybl, G. A. Lithgow, and S. G. Buckley, “Laser-induced breakdown spectroscopy detection and classification of biological aerosols”, Appl. Spectros. 57, 1207-1215 (2003). N. Melikechi, H. Ding, S. Rock, A. Marcano O. and D. Connolly, “Laser-induced breakdown spectroscopy of whole blood and other Liquid organic compounds”, in Optical Diagnostic and Sensing VIII, Editors G. Cote and A. V. Priezzhev, Proceeding of SPIE 6863, 68630O1-7, DOI:10.1117/12.761901 (2008). D. L. Massart, B. G. Vandenginste, S. N. Deming, Y. Michotte, and L. Kaufman, [Chemometrics: A Textbook], Elsevier, Amsterdam, (1988) . Mor, G., Visintin, I., Lai, Y., Zhao, H., Schwartz, P., Rutherford, T., Yue, L., Bray-Ward, P. and Ward D. C., “Serum protein markers for early detection of ovarian cancer”, PNAS, 102 (21), 7677-7682 (2005).

7169-11 V. 1 (p.7 of 8) / Color: No / Format: Letter / Date: 12/17/2008 10:17:35 PM SPIE USE: ____ DB Check, ____ Prod Check, Notes:

Please verify that (1) all pages are present, (2) all figures are acceptable, (3) all fonts and special characters are correct, and (4) all text and figures fit within the margin lines shown on this review document. Return to your MySPIE ToDo list and approve or disapprove this submission.

[5]

[6]

[7] [8] [9]

[10] [11] [12] [13]

[14] [15] [16] [17] [18] [19] [20] [21]

Schorge, J.O., R. D. Drake, H. Lee, S. J. Skates, R. Rajanbabu, D. S. Miller, J. H. Kim, D. W. Cramer, R. S. Berkowitz, S. C. Mok, “Osteopontin as an adjunct to CA125 in detecting recurring ovarian cancer”, Clin. Cancer Res. 10(10), 3474-3478 (2004). Sutphen, R., Y. Xu, G. D. Wilbanks, J. Fiorica, E. C. Grendys Jr., J. P. LaPolla, H. Arango, M. S. Hoffman, M. Martino, K. Wakeley, D. Griffin, R. W. Blanco, A. B. Cantor, Y. J. Xiao, J. P. Krischer, “Lysophospholipids are potential biomarkers of ovarian cancer”, Cancer Epidemiol. Biomarkers Prev. 13(7), 1185-1191 (2004). Byler, D. M. and Susi, H. “Examination of the secondary structure of proteins by deconvolved FTIR spectra”, Biopolymers 25, 469-487 (1989). Surewicz, W.K. and Mantsch, H. H. “Biochim. Biophys. Acta”, 952, 115-130 (1988). Petibois, C., Gionnet, K., Goncalves, M., Perromat, A., Moenner, M. and Deleris, G., “Analytical performances of FT-IR spectrometry and imaging for concentration measurement within biological fluids, cells, and tissues”, Analysis, 131, 640-647 (2006). Kunihiro, K., Kim, P. and Baldwin, R. L. “Strategy for trapping intermediates in the folding of ribonuclease and for using 1H-NMR to determine their structures”, Biopolymers 22, 59-67 (1984). I.T. Jolliffe, [Principal Component Analysis], Springer (200)2. Bishop, C., [Pattern recognition and machine learning], Springer (2007). W. Karush (1939). "Minima of Functions of Several Variables with Inequalities as Side Constraints". M.Sc. Dissertation. Dept. of Mathematics, Univ. of Chicago, Chicago, Illinois.. Available from http://wwwlib.umi.com/dxweb/details?doc_no=7371591 V. Vapnik, [The Nature of Statistical Learning Theory], Springer, (2000). Jörgens Konrad, [Linear integral operators], Pitman, Boston, 1982 J. Platt, N. Cristianini, J. Shawe-Taylor, "Large Margin DAGs for Multiclass Classification", in Advances in Neural Information Processing Systems 12, pp. 547-553, MIT Press, (2000). C.-W. Hsu, and C.-J. Lin, “A comparison of methods for multi-class support vector machines”, IEEE Trans. Neural Networks, vol. 13, no. 2, pp. 415-425, Mar. (2002). Z.-Q. Jiang, H.-G. Fu, and L.-J. Li. (2005). “Support vector machine for mechanical faults classification”, Journal of Zhejiang University SCIENCE. [Online]. Available: http://www.zju.edu.cn/jzus/2005/A0505/A050513.pdf R.-E. Fan, P.-H. Chen, and C.-J. Lin. “Working set selection using second order information for training SVM”, Journal of Machine Learning Research 6, 1889-1918, 2005. J. Ma, and Y. Zhao, “Oregon State University (OSU) Support Vector Machine (SVM) toolbox for the MATLAB numerical environment”, 2002, http://sourceforge.net/projects/svm/ Chih-Chung Chang and Chih-Jen Lin, “LIBSVM : a library for support vector machines” (2001).

7169-11 V. 1 (p.8 of 8) / Color: No / Format: Letter / Date: 12/17/2008 10:17:35 PM SPIE USE: ____ DB Check, ____ Prod Check, Notes: