Automatic colon cancer detection and classification

2 downloads 0 Views 5MB Size Report
The entire contents of this thesis entitled 'Automatic Colon Cancer Detection and Clas- .... 3.1.1 Minimum redundancy and maximum relevance (mRMR) .
Automatic colon cancer detection and classification

Saima Rathore PhD Thesis

Department of Computer and Information Sciences Pakistan Institute of Engineering & Applied Sciences Islamabad, Pakistan

Automatic colon cancer detection and classification

By Saima Rathore

A dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Computer and Information Sciences

The Department of Computer and Information Sciences Pakistan Institute of Engineering and Applied Sciences Islamabad, Pakistan 2014

ii

This thesis is carried out under the supervision of

Dr. Mutawarra Hussain

Department of Computer and Information Sciences, Pakistan Institute of Engineering & Applied Sciences, Islamabad, Pakistan

This work is financially supported by Higher Education Commission of Pakistan under the indigenous 5000 Ph.D. fellowship program as per award number 117-7931-Eg7-037.

iii

Declaration of Originality I hereby declare that the work contained in this thesis and the intellectual content of this thesis are the product of my own work. This thesis has not been previously published in any form nor does it contain any verbatim of the published resources which could be treated as infringement of the international copyright law. I also declare that I do understand the terms ”copyright” and ”plagiarism”, and that in case of any copyright violation or plagiarism found in this work, I will be held fully responsible of the consequences of any such violation.

Signature: Name: Date: Place:

iv

Saima Rathore

Certificate This is to certify that the work contained in the thesis entitled: Automatic Colon Cancer Detection and Classification, was carried out by Mrs. Saima Rathore, and in my opinion, is fully adequate, in scope and quality, for the degree of PhD in Computer and Information Sciences.

Supervisor:

Dr. Mutawarra Hussain Department of Computer and Information Sciences, Pakistan Institute of Engineering and Applied Sciences, Islamabad.

v

Copyright Statement The entire contents of this thesis entitled ’Automatic Colon Cancer Detection and Classification’ and authored by Mrs. Saima Rathore, are an intellectual property of PIEAS. No portion of the thesis should be reproduced without obtaining explicit permission from PIEAS and the author.

vi

Acknowledgments First and foremost, I am very thankful to Allah almighty for His blessings during my entire life particularly, for the duration of this research work. He blessed me with knowledge and purpose as well as guided me whenever I faced any problems. Having sincere teachers and cooperative friends, during my PhD adventure, are all the blessings of Allah almighty. Next, I would like to pay my deepest gratitude to my supervisor Dr. Mutawarra Hussain whose dedication, enthusiasm, and devotion to work always inspired me. He has been a constant source of motivation and inspiration for me throughout my PhD research. I always found him generous in sharing his knowledge and wisdom. Working with him was a great learning experience. His kind attitude really made the difference. I would also like to pay my gratitude to Dr. Asifullah Khan for his advice, guidance, and invaluable comments during my PhD. I extend my gratitudes to Dr. Abdul Jalil for his appreciation and encouragement to complete my PhD. I would also appreciate my friends, particularly, Ayesha Siddiqa, Summuyya Munib, Nabeela Kousar for their cooperative and encouraging behavior during my stay at PIEAS. I would certainly like to thank my parents, husband, children, and other family members, whose support, throughout my studies, has made all this possible. Without their encouraging behavior and moral support, the completion of this research work would not have been possible. I pay my gratitude to Mr Imtiaz Ahmed Qureshi, Assistant Professor, Histopathology department, Rawalpindi Medical College, Pakistan for providing technical support in preparation of dataset and the relevent ground truth. I would also like to thank Dr. Akif Burak Tosun, Research assistant, Carnegie Mellon University for helping in Matlab implementation of some algorithms. Finally, I would like to thank Higher Education Commission of Pakistan, for their financial support under the Indigenous 5000 PhD scholarship program with reference to the award letter number 117-7931-Eg7-037.

Saima Rathore vii

List of Publications 1. Saima Rathore, Mutawarra Hussain, Asifullah Khan, “GECC: Gene expression based ensemble classification of colon samples”, IEEE/ACM Transactions on Computational Biology and Bioinformatics, 2014. 2. Saima Rathore, Mutawarra Hussain, Ahmad Ali, Asifullah Khan, “A recent survey on colon cancer detection techniques”, IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. 10, no. 3, pp. 545-563, 2013. 3. Saima Rathore, Mutawarra Hussain, Muhammad Aksam Iftikhar, Abdul Jalil, “Ensemble classification of colon biopsy images based on information rich hybrid features”, Computers in Biology and Medicine, vol. 47, pp. 76-92, 2014. 4. Saima Rathore, Mutawarra Hussain, Asifullah Khan, “Hybridization of novel geometric and traditional features for classification of colon biopsy images”, submitted in journal. 5. Saima Rathore, Mutawarra Hussain, Muhammad Aksam Iftikhar, Abdul Jalil, “Automatic detection and grading of colon cancer based on novel structural descriptors”, submitted in journal. 6. Saima Rathore, Muhammad Aksam Iftikhar, Mutawarra Hussain, Abdul Jalil, “Classification of colon biopsy images based on novel structural features”, in: Proceedings of International Conference on Emerging Technologies, pp. 1-6, 2013. 7. Saima Rathore, Mutawarra Hussain, Asifullah Khan, “A novel approach for colon biopsy image segmentation”, in: Proceedings of International Conference on Complex Medical Engineering, pp. 134-139, 2013. 8. Saima Rathore, Muhammad Aksam Iftikhar, Mutawarra Hussain, Abdul Jalil, “A novel approach for ensemble clustering of colon biopsy images”, in: Proceedings of Frontiers of Information Technology, pp. 25-30, 2013.

viii

9. Saima Rathore, Muhammad Aksam Iftikhar, Mutawarra Hussain, “A novel approach for automatic gene selection and classification of gene based colon cancer datasets”, in: Proceedings of International Conference on Emerging Technologies, 2014. 10. Muhammad Aksam Iftikhar, Abdul Jalil, Saima Rathore, Ahmad Ali, Mutawarra Hussain, “Brain MRI de-noising and segmentation based on improved adaptive nonlocal means”, International Journal of Imaging System and technology, vol. 23, no. 3, pp. 235-248, 2013. 11. Muhammad Aksam Iftikhar, Abdul Jalil, Saima Rathore, Mutawarra Hussain, “Robust brain MRI de-noising and segmentation using enhanced non-local means algorithm”, International Journal of Imaging System and Technology, vol. 24, no. 1, pp. 52-66, 2014. 12. Muhammad Aksam Iftikhar, Abdul Jalil, Saima Rathore, Mutawarra Hussain, Ahmad Ali, “An extended non-local means algorithm for brain MRI de-noising”, International Journal of Imaging System and Technology.

ix

Contents Declaration

iv

Certificate

v

Copyright Statement

vi

Acknowledgments

vii

List of Publications

viii

Abbreviations

xx

Abstract

1

1 Introduction

2

1.1

Motivation and objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . .

7

1.2

Research perspective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

8

1.3

Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

8

1.4

Thesis structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

9

2 Related work in the field of automatic colon cancer detection 2.1

12

Texture analysis based techniques . . . . . . . . . . . . . . . . . . . . . . . .

12

2.1.1

General texture analysis based techniques . . . . . . . . . . . . . . .

13

2.1.2

Object oriented texture analysis based techniques . . . . . . . . . . .

14

2.2

Hyperspectral analysis based classification . . . . . . . . . . . . . . . . . . .

18

2.3

Gene analysis based classification . . . . . . . . . . . . . . . . . . . . . . . .

19

2.3.1

20

Oligonucleotide microarrays based classification . . . . . . . . . . . .

x

2.3.2

cDNA microarrays based classification . . . . . . . . . . . . . . . . .

3 Machine learning techniques 3.1

3.2

3.3

21 22

Feature selection methodologies . . . . . . . . . . . . . . . . . . . . . . . . .

22

3.1.1

Minimum redundancy and maximum relevance (mRMR)

. . . . . .

22

3.1.2

F-Score . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

24

3.1.3

Principal component analysis (PCA) . . . . . . . . . . . . . . . . . .

24

3.1.4

Chi-square . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

25

Classification methodologies . . . . . . . . . . . . . . . . . . . . . . . . . . .

26

3.2.1

Support vector machines . . . . . . . . . . . . . . . . . . . . . . . . .

26

3.2.2

Probabilistic neural network . . . . . . . . . . . . . . . . . . . . . . .

28

3.2.3

K-Nearest neighbor . . . . . . . . . . . . . . . . . . . . . . . . . . . .

28

3.2.4

Decision tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

29

Cross-validation methodology . . . . . . . . . . . . . . . . . . . . . . . . . .

29

4 Datasets, performance measures, and evaluation of some existing colon cancer detection techniques on these datasets

31

4.1

Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

31

4.1.1

Dataset-A — Colon biopsy image based segmentation . . . . . . . .

32

4.1.2

Dataset-B — Colon biopsy image based classification . . . . . . . . .

32

4.1.3

Dataset-C — Gene analysis based classification . . . . . . . . . . . .

33

Performance measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

35

4.2.1

Classification techniques . . . . . . . . . . . . . . . . . . . . . . . . .

35

4.2.2

Segmentation techniques . . . . . . . . . . . . . . . . . . . . . . . . .

39

Evaluation of some existing colon cancer detection techniques . . . . . . . .

40

4.3.1

Evaluation of colon biopsy image based segmentation techniques . .

40

4.3.2

Evaluation of colon biopsy image based classification techniques

. .

40

4.3.3

Evaluation of gene based colon classification techniques . . . . . . .

41

4.4

Existing colon cancer datasets . . . . . . . . . . . . . . . . . . . . . . . . . .

41

4.5

Chapter summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

42

4.2

4.3

5 Proposed object oriented texture analysis based segmentation 5.1

Proposed segmentation technique . . . . . . . . . . . . . . . . . . . . . . . .

xi

43 44

5.1.1

Object identification . . . . . . . . . . . . . . . . . . . . . . . . . . .

45

5.1.2

Feature extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . .

50

5.1.3

Region demarcation . . . . . . . . . . . . . . . . . . . . . . . . . . .

50

5.1.4

Genetic algorithm based parameter optimization . . . . . . . . . . .

52

5.2

Results and discussions

. . . . . . . . . . . . . . . . . . . . . . . . . . . . .

53

5.3

Chapter summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

57

6 Proposed classification techniques 6.1

6.2

6.3

58

Hybrid of traditional HOG and novel variants of statistical moments and Haralcik features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

58

6.1.1

Feature extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . .

59

6.1.2

Features selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

64

6.1.3

Training/testing data formulation . . . . . . . . . . . . . . . . . . .

64

6.1.4

Ensemble classification . . . . . . . . . . . . . . . . . . . . . . . . . .

65

6.1.5

Results and discussion . . . . . . . . . . . . . . . . . . . . . . . . . .

65

Hybrid of novel geometric features and some commonly used features . . . .

77

6.2.1

Pre-processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

78

6.2.2

Feature extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . .

79

6.2.3

Feature concatenation . . . . . . . . . . . . . . . . . . . . . . . . . .

89

6.2.4

Training/testing data formulation . . . . . . . . . . . . . . . . . . .

89

6.2.5

Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

89

6.2.6

Results and discussions . . . . . . . . . . . . . . . . . . . . . . . . .

90

Chapter summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

95

7 Gene expression based ensemble classification of colon samples 7.1

7.2

97

Proposed GECC technique . . . . . . . . . . . . . . . . . . . . . . . . . . .

98

7.1.1

Gene expression based feature vector formulation . . . . . . . . . . .

98

7.1.2

Gene expression profile reduction method . . . . . . . . . . . . . . .

99

7.1.3

Training/testing data formulation . . . . . . . . . . . . . . . . . . .

100

7.1.4

Majority voting based ensemble classification . . . . . . . . . . . . .

100

Experimental results and discussions . . . . . . . . . . . . . . . . . . . . . .

102

7.2.1

Selection of discriminative gene expressions from datasets having high dimensionality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xii

102

7.2.2

Parameter selection for SVM classifiers pertaining to different gene selection strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . .

7.2.3

Selected gene expressions, and optimized SVM classifiers based ensemble classification . . . . . . . . . . . . . . . . . . . . . . . . . . .

105

7.2.4

Performance of GECC on standard colon cancer datasets . . . . . .

107

7.2.5

Computational complexity of the GECC technique . . . . . . . . . .

108

7.2.6

Performance comparison of the GECC technique with some existing schemes and classifiers . . . . . . . . . . . . . . . . . . . . . . . . . .

7.2.7

7.3

8.2

8.3

expression based cancer datasets . . . . . . . . . . . . . . . . . . . .

111

Chapter summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

113 114

White run-length features based grading of colon cancer . . . . . . . . . . .

114

8.1.1

White run-length features . . . . . . . . . . . . . . . . . . . . . . . .

115

8.1.2

Results and discussion . . . . . . . . . . . . . . . . . . . . . . . . . .

117

Lumen area features based grading of colon cancer . . . . . . . . . . . . . .

118

8.2.1

Lumen area based features . . . . . . . . . . . . . . . . . . . . . . .

119

8.2.2

Results and discussions . . . . . . . . . . . . . . . . . . . . . . . . .

120

Chapter summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

125

9 Conclusion 9.1

109

Performance analysis of the GECC technique on other complex gene

8 Proposed colon cancer grading techniques 8.1

104

126

Potential future directions . . . . . . . . . . . . . . . . . . . . . . . . . . . .

127

9.1.1

Classification at multiple magnification factors . . . . . . . . . . . .

127

9.1.2

Grading at multiple magnification factors . . . . . . . . . . . . . . .

127

9.1.3

Testing on other biopsy types . . . . . . . . . . . . . . . . . . . . . .

128

9.1.4

Combining output of multiple feature selection methods . . . . . . .

128

9.1.5

Classification of various cancer stages . . . . . . . . . . . . . . . . .

128

9.1.6

Evaluation of existing colon cancer detection techniques . . . . . . .

128

9.1.7

Combining image based and gene based dataset . . . . . . . . . . . .

128

9.1.8

Deployment in histopathology labs . . . . . . . . . . . . . . . . . . .

129

xiii

List of Figures 1.1

Microscopic images of (a) normal, and (b) malignant colon biopsy samples, and (c) regular structure of normal colon tissue . . . . . . . . . . . . . . . .

1.2

Different stages of colon cancer: (a) stage 0, (b) stage I, (c) stage II, (d1 , d2 ) stage III, and (e) stage IV . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1.3

3

5

Microscopic images of (a1 ,a2 ) normal colon tissues, (b1 ,b2 ) well-, (c1 ,c2 ) moderate-, and (d1 ,d2 ) poorly-differentiable malignant colon tissues . . . . .

6

3.1

Formulation of training/test data through 10-fold Jack-knife cross-validation

30

4.1

Distribution of age of cancerous patients . . . . . . . . . . . . . . . . . . . .

33

4.2

Tiff image generated by a microarray experiment showing genes expressions for various samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

34

5.1

Top level layout of the proposed MOOS technique . . . . . . . . . . . . . .

44

5.2

Top level layout of the segmentation module of the proposed MOOS technique 45

5.3

Image clustering: (a) a colon biopsy image, (b) white cluster, (c) pink cluster, and (d) purple cluster . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

46

5.4

Algorithm for detection of ellipses in connected components of different clusters 46

5.5

(a) Four generated ellipses, and (b) representation of a horizontal ellipse with SMJA = 4 and SMIA = 3 in terms of matrix entries . . . . . . . . . . . . .

5.6

(a) Generated horizontal ellipse, (b) image pattern (full pattern match), and (c) image pattern (partial match 95.6%) . . . . . . . . . . . . . . . . . . . .

5.7

5.8

47

48

Object detection: (a) small portion of a colon biopsy image, (b) white cluster, (c) detected circular objects, and (d) detected elliptic objects . . . . . . . .

49

Region demarcation step of the segmentation module of MOOS technique .

51

xiv

5.9

Segmentation results for homogeneous colon biopsy images: (a1 -b1 ) OOSEG, (a2 -b2 ) GRLM and (a3 -b3 ) MOOS . . . . . . . . . . . . . . . . . . . . . . .

55

5.10 Segmentation results for heterogeneous colon biopsy images: (a1 -b1 ) OOSEG, (a2 -b2 ) GRLM and (a3 -b3 ) MOOS . . . . . . . . . . . . . . . . . . . . . . .

56

6.1

Top-level layout of the proposed CBIC technique . . . . . . . . . . . . . . .

59

6.2

Individual color components of RGB and HSV model for a colon biopsy image 61

6.3

Classification performance of HOG features as a function of several parameters: (a) H, (b) V, and (c) B

6.4

. . . . . . . . . . . . . . . . . . . . . . . . . .

67

Classification performance of different variants of Haralick features as a function of size of GLCM matrix: (a) Haralick-GL, (b) Haralick-RGB and (c) Haralick-HSV . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

68

6.5

ROC curves for various hybrid feature vectors . . . . . . . . . . . . . . . . .

71

6.6

Examples of normal, poorly-,moderately-, and well-differentiated malignant colon biopsy images that are correctly classified by the proposed CBIC system 72

6.7

The two well-differentiated malignant colon biopsy images that are incorrectly classified as normal colon biopsy images

6.8

. . . . . . . . . . . . . . . .

Performance enhancement in terms of classification accuracy for individual and hybrid feature sets reduced through mRMR . . . . . . . . . . . . . . .

6.9

72

74

Performance enhancement in terms of classification time for individual and hybrid feature sets reduced through mRMR . . . . . . . . . . . . . . . . . .

74

6.10 Top level layout of the proposed HFS-CC technique . . . . . . . . . . . . .

78

6.11 Average gray level values for white, pink and purple clusters . . . . . . . . .

79

6.12 Histopathological images of (a) normal and (b) malignant colon tissue . . .

81

6.13 The process of computing geometric features: (a) OSU features, (b) OSDU features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

83

6.14 Schematic diagram showing the steps involved in the extraction of lumen from colon biopsy images: (a1 ) normal colon biopsy image, (a2 ) malignant colon biopsy image, (b1 -b2 ) corresponding white clusters, (c1 -c2 ) eroded clusters, (d1 -d2 ) connected components, (e1 -e2 ) , final extracted lumen after hole filling 84

xv

6.15 Least square ellipse fitting: (a1 ) normal extracted lumen, (a2 ) malignant extracted lumen, (b1 -b2 ) ellipses fitted to the extracted lumens, (c1 -c2 ) deviation of extracted lumen from the fitted ellipse shown as vertical and horizontal lines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

85

6.16 Process of extraction of EFD features . . . . . . . . . . . . . . . . . . . . .

88

6.17 Features’ concatenation into a single hybrid feature vector . . . . . . . . . .

89

6.18 Classification accuracy for different values of (a) area threshold T, and (b) harmonic level H for various values of number of selected objects E . . . . .

91

6.19 Classification performance of the LGF features as a function of r . . . . . .

91

6.20 Examples of the normal, poorly-, moderately-, and well differentiable malignant colon tissues, which are correctly classified by the proposed HFS-CC system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

93

7.1

Top level layout of the proposed GECC technique . . . . . . . . . . . . . . .

99

7.2

Plot of (a) eigenvalues, (b) number of eigenvalues required for a particular confidence interval, (c) classification accuracy of GECC on KentRidge dataset for different values of confidence interval . . . . . . . . . . . . . . . . . . . .

103

7.3

Classification accuracy of various classifiers for gene datasets selected by mRMR104

7.4

(a) Classification accuracy and (b) corresponding CPU time on KentRidge dataset for different values of F, (c) Classification accuracy and (d) corresponding CPU time on BioGPS dataset for different values of F . . . . . . .

7.5

105

Predictions of individual SVM classifiers (collected through 10-fold crossvalidation) and the proposed GECC for BioGPS dataset . . . . . . . . . . .

106

7.6

ROC curves: (a) KentRidge dataset (using F-Score), and (b) BioGPS dataset 108

8.1

Top level layout of the CAD system used for the evaluation of WRL features 115

8.2

Schematic diagram showing the steps involved in the computation of WRL features from malignant colon biopsy images of various cancer grades, 1st row: malignant colon biopsy images, 2nd row: white clusters, 3rd row: eroded clusters116

8.3

Top level layout of the CAD system used for the evaluation of lumen area based features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

xvi

118

8.4

Schematic diagram showing the steps involved in the computation of lumen area based features from malignant colon biopsy images of various cancer grades, 1st row: malignant colon biopsy images, 2nd row: white clusters, 3rd row: eroded clusters, 4th row: connected components . . . . . . . . . . . . .

8.5

(a) well-, (b) moderate-, and (c) poorly-differentiable images, which are correctly classified by the proposed features . . . . . . . . . . . . . . . . . . . .

8.6

123

Examples of a few malignant images, which are incorrectly classified by the proposed features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

8.7

119

123

Performance comparison of the proposed system with existing colon cancer grading technique . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

xvii

124

List of Tables 4.1

Datasets used for the evaluation of proposed and existing colon cancer detection techniques

4.2

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Statistics of dataset-A used for the evaluation of colon biopsy image based segmentation techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

4.3

32

32

Statistics of dataset-B used for the evaluation of colon biopsy image based cancer detection and grading techniques . . . . . . . . . . . . . . . . . . . .

33

4.4

Confusion matrix for two classes (a and b) . . . . . . . . . . . . . . . . . . .

36

4.5

Confusion matrix for three classes (a, b and c) . . . . . . . . . . . . . . . .

36

4.6

Some existing colon cancer datasets

42

5.1

Values of GA variables used while computing optimal values of system pa-

. . . . . . . . . . . . . . . . . . . . . .

rameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2

Optimal values of system parameters for different magnification factors calculated through GA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

5.3

53

54

Performance comparison of OOSEG, GRLM and MOOS (M=Homogeneous images, T=Heterogeneous images) . . . . . . . . . . . . . . . . . . . . . . .

57

6.1

Number of features selected by mRMR . . . . . . . . . . . . . . . . . . . . .

66

6.2

Performance analysis of HOG features . . . . . . . . . . . . . . . . . . . . .

69

6.3

Performance analysis of Haralick-GL, Haralick-RGB and Haralick-HSV features 69

6.4

Performance analysis of CCSM features . . . . . . . . . . . . . . . . . . . .

70

6.5

Performance analysis of hybrid feature sets . . . . . . . . . . . . . . . . . .

70

6.6

Computational time required for the classification of various feature sets . .

73

6.7

Optimal values of parameters for existing colon biopsy image based classification techniques determined for dataset-B . . . . . . . . . . . . . . . . . . xviii

76

6.8

Performance comparison of the proposed CBIC with existing techniques . .

6.9

Performance analysis of individual feature extraction strategies by using RBF

77

SVM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

92

6.10 Performance analysis of hybrid feature sets . . . . . . . . . . . . . . . . . .

93

6.11 Computational time required for the classification of various feature sets . .

94

6.12 Comparison of HFS-CC with some existing colon cancer detection techniques 95 7.1

Number of gene expressions selected by various feature selection strategies for different datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

7.2

104

Performance analysis for various combinations of classifiers and feature selection strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

107

7.3

Performance comparison of classifiers in terms of AUC . . . . . . . . . . . .

109

7.4

Computational time requirements of GECC (sec) . . . . . . . . . . . . . . .

110

7.5

Performance comparison of GECC with some existing schemes and classifiers in terms of classification accuracy . . . . . . . . . . . . . . . . . . . . . . . .

111

7.6

Binary class gene expression datasets . . . . . . . . . . . . . . . . . . . . . .

112

7.7

Number of genes selected by feature selection strategies for the gene expression datasets given in Table 7.6 . . . . . . . . . . . . . . . . . . . . . . . . .

112

7.8

Performance of GECC on the gene expression datasets given in Table 7.6 .

113

8.1

Classification performance of the proposed WRL features for grading of colon cancer on dataset-B . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

117

8.2

Confusion matrix of LCR . . . . . . . . . . . . . . . . . . . . . . . . . . . .

121

8.3

Confusion matrix of LIR . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

121

8.4

Confusion matrix of LCR+LIR . . . . . . . . . . . . . . . . . . . . . . . . .

121

8.5

Classification accuracy of the LCR and LIR features . . . . . . . . . . . . .

122

8.6

Classification results of the LCR and LIR features in terms of sensitivity, specificity, and F-Measure . . . . . . . . . . . . . . . . . . . . . . . . . . . .

xix

122

Abbreviations

AUC

Area under the curve

CBIC

Colon biopsy image classification

CCSM

Color component based statistical moments

CLBP

Circular local binary patterns

CNS

Central nervous system

DLBCL

Diffuse large b-cell lymphoma

DNA

Deoxyribonucleic acid

ECGF

Epithelial cells based geometric features

EFDs

Elliptic Fourier descriptors

FIR-ELM

Finite impulse response - Extreme learning machine

GA

Genetic algorithm

GECC

Gene expressions based ensemble colon classification

GLCM

Gray-level co-occurrence matrix

GRLM

Graph run-length matrix

H&E

Hematoxylin & Eosin

HFS-CC

Hybrid feature space based colon classification

HOG

Histogram of oriented gradients

HSV

Hue, saturation, and value

KNN

K-nearest neighbor

LCR

Lumen cluster ratio

LDA

Linear discriminate analysis

LGF

Lumen based geometric features

xx

LIC

Lumen inner concavity

LIR

Lumen image ratio

LOC

Lumen outer convexity

LOO

Leave one out

LUC

Lumen circularity

MCC

Matthews correlation coefficient

MOOS

Modified object oriented segmentation

mRMR

Minimum redundancy and maximum relevancy

OOSEG

Object oriented segmentation

OSDU

Object spatial distribution uniformity

OSU

Object size uniformity

PCA

Principal component analysis

PNN

Probabilistic neural network

RBF

Radial basis function

RGB

Red, green, and blue

RNA

Ribonucleic acid

ROC

Receiver operating characteristics

SLFN

Single hidden layer feed-forward neural network

SVM

Support vector machine

VAO

Variation in the area of objects

VSOO

Variation in the spatial orientation of objects

WRL

White run-length

xxi

Abstract In the past two decades, automatic colon cancer detection has become an active research area. Traditionally, colon cancer is diagnosed using microscopic analysis of pathological tissue imagery. However, the process is subjective and leads to considerable inter/intra observer variation in diagnosis. Therefore, reliable computer-aided colon cancer diagnostic systems are in high demand. In this thesis, a computer-aided colon cancer diagnostic (CAD) system has been proposed that comprises three main phases. In the first phase, an unsupervised colon biopsy image segmentation technique, which is based on a few novel extensions in traditional object oriented texture analysis based segmentation technique, has been developed. The second phase deals with classification of colon image and gene based datasets into normal and malignant classes. For the colon biopsy image based datasets, two classification techniques based on hybridization of various features have been proposed. In these techniques, some traditional features such as morphological and texture, variants of traditional features, and some novel features which have especially been designed to capture the variation between normal and malignant colon tissues have been used. Similarly, for the gene expression based dataset, a novel technique that utilizes various feature selection strategies for solving the challenging problem of larger dimensionality of gene based datasets, and a weighted majority voting based ensemble of various SVM classifiers for performance improvement has been proposed. In the third phase of this work, the structural variation in the shape of lumen among various colon cancer grades has been quantified in terms of a few novel structural features. These features are used for the classification of malignant colon biopsy images into various cancer grades. Performance of the proposed diagnostic system has been validated on various datasets, and superior qualitative and quantitative performance has been observed compared to previously reported methods of colon cancer detection.

1

Chapter 1

Introduction Medical imaging has gained importance in the last few decades, especially in analyzing different body parts for predicting certain disorders/diseases. Microscopic imaging is one of the medical imaging techniques, wherein the images of biopsy slides are captured. Biopsy images have well-defined organization of tissues and connected components, depending upon the body part from which they are taken [1]. The same is true for colon biopsy images, which are used in our studies for cancer detection. Biologically different constituents in a colon biopsy image can be identified by looking at the spatial organization of its constituents. Colon is one major constituent of large intestine, and its cancer has become a major cause of deaths in modern and industrialized world. It is the second leading cause of cancer related deaths in the US. The death toll rate has been raised to 0.5 million deaths per year worldwide [2]. There are several reasons of colon cancer, which are largely associated with the traits, habits and diets [3]. Age is a major factor, as colon cancer usually starts to develop in people over 50 years of age. The risk of colon cancer is high for people who are inactive, overweight, or chain smokers. Unbalanced diet is another major factor of colon cancer. Low fiber diets, and diets that are high in red meat, fat and calories increase the risk of colon cancer many times. Heavy intake of alcohol also increases the risk of colon cancer. Family history of colon cancer is another important factor for developing colon cancer [3]. The common and traditional method of colon cancer diagnosis is microscopic analysis of colon biopsy samples. In such an examination, histopathologists analyze the biopsy samples under microscope, and diagnose the tissue as normal/malignant based on the morphology

2

of tissues. Normal and malignant tissues have high contrast in their morphology. Normal colon tissues have well-defined structure. Figure 1.1 (a) presents microscopic image of a normal colon biopsy sample, wherein all the tissues possess a regular structure. The detailed regular structure of a normal colon tissue is shown in Figure 1.1 (c), wherein we see that a normal colon tissue has three constituents, namely, epithelial cells, non-epithelial cells, and lumen. Epithelial cells usually surround lumen and form glandular structure, whereas nonepithelial cells, called stroma, lie in between these structures. But, cancer heavily disturbs the structure of colon tissues, and makes the structure almost amorphous. The deformation introduced by cancer is clearly visible in the microscopic image of a malignant colon biopsy sample shown in Figure 1.1 (b). Normal and malignant colon tissues have similar colors, but the distribution of colors varies. Further, normal colon tissues have well-defined structures such as elliptic shaped epithelial cells and lumen, whereas malignant colon tissues do not have any regular structures. In a normal colon tissue, different constituents are organized in a sequence. On the other hand, all the constituents of tissues mix with each other, thereby diminishing the boundaries in a malignant colon tissue.

(a)

Stroma

(b) Lumen

Gland boundary Epithelial cells (c) Figure 1.1: Microscopic images of (a) normal, and (b) malignant colon biopsy samples, and (c) regular structure of normal colon tissue

3

Colon biopsy images are either homogeneous or heterogeneous. Homogeneous colon biopsy images are those images, which contain normal/malignant tissues along with connecting tissues. For example, colon biopsy images in Figure 1.1 (a) and (b) are homogeneous. On the contrary, heterogeneous colon biopsy images are those images, which have normal as well as malignant tissues along with connecting tissues. Histopathologists quantify the severity of colon cancer in terms of two quantitative measures, namely, grades and stages. Stage of colon cancer is a measure of the extent to which cancer has spread/reached in the colon or in other parts of the body. The stages of colon cancer are determined by the incursion of the cancer through the lining of colon, involvement of the lymph nodes, and spread of colon cancer to other parts of the body. Stages of colon cancer are quantified based on Duke’s scale [4]. Stage 0 is the earliest possible stage of colon cancer, and is also called carcinoma in situ. As understood by the name, stage 0 cancer is confined to the place where it has started to develop. It is still restricted to the innermost lining of colon, therefore, stage 0 cancer can easily be removed either by polypectomy through colonoscopy or by surgical treatment. In stage I, the cancer cells have grown further into the lining of colon but have not yet spread beyond the muscular coat of colon. The prescribed treatment for stage I cancer patients is resection of colon in which affected portion of colon is removed. Stage II colon cancer has penetrated beyond the muscular coat of colon, and even has spread to the nearby tissues. Stage III is the advanced stage of colon cancer, wherein cancer has reached to the lymph nodes. The suggested treatment for the stage II and III colon cancer patients is resection of colon followed by some sort of chemotherapy in order to totally eliminate the cancerous cells from the nearby tissues and lymph nodes. Stage IV can be considered as the final stage of colon cancer, wherein cancer has reached to the distant organ in the body such as liver, lungs and ovaries. When the colon cancer has reached stage IV, surgical treatment is generally planned in order to relieve or prevent further complications as opposed to remedying the patient of colon cancer. Different stages of colon cancer have been shown in Figure 1.2, which has been used by the permission of the owner (Madam Terese Winslow).

4

(a)

(b)

(c)

(d1)

5 (d2)

(e)

Figure 1.2: Different stages of colon cancer: (a) stage 0, (b) stage I, (c) stage II, (d1 , d2 ) stage III, and (e) stage IV

c

Terese Winslow, TERESE WINSLOW LLC, Website: http://www.teresewinslow.com, Email: [email protected]

On the contrary, the grade of colon cancer is a measure of the differentiability level and growth rate of malignant cells. The ’well-differentiated’ is the lowest possible grade of colon cancer, wherein malignant cells seem identical to the normal cells, and have a propensity to grow and spread very slowly. The next grade of colon cancer is ’moderately-differentiated’, wherein cancer cells seem more abnormal and have a propensity to grow and spread slightly quickly. The final grade of colon cancer is ’poorly-differentiated’, wherein malignant cells seem much abnormal, and tend to grow and spread to the colon and other body parts very quickly. Figure 1.3 presents microscopic images of malignant colon biopsy samples with poor-, moderate-, and well-differentiable cancer grades.

(a1)

(b1)

(c1)

(d1)

(a2)

(b2)

(c2)

(d2)

Figure 1.3: Microscopic images of (a1 ,a2 ) normal colon tissues, (b1 ,b2 ) well-, (c1 ,c2 ) moderate-, and (d1 ,d2 ) poorly-differentiable malignant colon tissues

The determination of the grades and stages of colon cancer is a manual process. In order to determine the cancer grades, histopathologists analyze the biopsy samples under microscope and assign quantitative cancer grades depending upon the morphology of malignant tissues. On the other hand, a cancer stage is determined by microscopic analysis of separate biopsy samples taken from different layers of colon and lymph nodes. The manual process of colon cancer detection has a few limitations. For instance, the process leads to inter and intra observer variability [5, 6]. Moreover, the process is subjective, and leads to biased opinion due to workload and experience level of histopathologists. Furthermore, it

6

consumes precious time of the histopathologists as they have to analyze many samples per day. Therefore, an accurate computational system for automatic colon cancer detection is highly desirable. The automated diagnosis of colon cancer has three major stages, namely, segmentation, classification, and grading. Segmentation is the first most stage in which colon biopsy image is segregated into biologically different (normal and malignant) regions. Accurate segmentation techniques help the pathologists in accurately localizing the cancer that in turn helps in deciding surgical treatment plans. Classification refers to the extraction of valuable features from colon biopsy images, and assignment of normal or malignant classes to the samples based on the feature values. Grading of colon biopsy images is the succeeding stage of classification, wherein malignant colon biopsy images are assigned well-, moderate-, and poor-differentiable cancer grades based on the feature values extracted from these images.

1.1

Motivation and objectives

The colon cancer is usually diagnosed by the visual examination of histopathology slide under the microscope by the expert histopathologists. But, the process has a few limitations. For instance, the manual examination consumes significant time of the histopathologists as they have to analyze many biopsy slides per day basis. Second, the manual examination of biopsy slides has notable inter and intra-observer variability in the diagnosis. The results of the same biopsy slide may vary from person to person, and even for the same person at different times. Therefore, research community seriously lack accurate as well as robust techniques of automated colon cancer diagnosis. In the past, some automatic colon cancer detection techniques have been proposed, but these techniques suffer a few drawbacks. For instance, some of the techniques are computationally expensive, and consume considerable CPU time in feature extraction and classification stages. Also, most of the techniques are deficient in terms of discriminating features. Furthermore, previous techniques have utilized features of only one type, and have not utilized multiple feature types simultaneously to get a more robust and discerning feature set. Therefore, automatic colon biopsy image classification techniques, which are computationally tractable and simultaneously rich in terms of discerning features are highly needed. So, the main objective behind this research study is to provide an automatic system for the examination of biopsy slides, which could

7

provide reliable second opinion to the histopathologists in least possible time.

1.2

Research perspective

In this research work, automated diagnostic system has been proposed for segmentation, detection and grading of colon cancer. The system is capable of recognizing and classifying colon biopsy images accurately and efficiently. The research work presented in this thesis focuses on the use of various novel feature types, which incorporate background information about the structure/geometry of normal colon tissue, and malignant colon tissues of various cancer grades. These features set a new line of research for the community of medical image analysis. Furthermore, this research will help histopathologists by providing a reliable second opinion. The diagnostic system proposed in this dissertation will not only help making histopathology departments automated, reliable, and fast, but will also aid in improving the efficiency and outcomes of patient care.

1.3

Contributions

This thesis contributes in the field of automated diagnosis of colon cancer. In this context, an automatic colon cancer diagnostic system has been proposed. Some of the major contributions and key findings made by this research work are as follows. • An unsupervised modified object-oriented segmentation technique (MOOS) has been presented as an improved variant of traditional object-oriented segmentation technique. The proposed MOOS technique yields better performance compared to its various counterparts. • Colon biopsy images captured at higher magnifications have zoom in effect compared to images captured at lower magnifications. Similarly, colon biopsy images of higher magnifications capture smaller biopsy area compared to images of lower magnifications. Due to these variations, various segmentation/classification technique may not yield good results for all magnifications. The proposed MOOS technique has been tested on four different magnifications, and has shown good results at all the magnification factors.

8

• Various novel features have been introduced for capturing the variation between the morphology of normal and malignant colon tissues, and amongst tissues with different colon cancer grade. These features may not only be helpful for the people working in the field of colon cancer but may also be applied for detection of other cancer types having similar structure. • The proposed system may provide a useful second opinion to the histopathologist for segmentation, classification, and grading of colon biopsy images. • The proposed system may relieve the burden of the histopathologists to a great extent. The automatic diagnostic system may reduce the amount of resources, which are otherwise needed. • The automated diagnosis eliminates the issues of inter and intra observer variation in the diagnosis of colon cancer. • Several parameters involved in the proposed techniques have been optimized for better performance. Consequently, the proposed techniques produce good results on the image based and gene based colon cancer datasets, and can be utilized in practical medical applications. • The results reveal that the hybrid and rich feature spaces certainly improve classification performance compared to their individual counterparts. Further, the ensembles constructed from the decisions of individual classifiers enhance overall prediction performance. • It is revealed from various simulation results that most of the feature selection methodologies select a smaller sub-set of features and enhance the performance of the classification system by boosting classification accuracy and cutting down the computational time.

1.4

Thesis structure

The rest of the thesis has been divided into several chapters, where some chapters address distinct unit of the research that we have carried out in the field of colon cancer, and the others either describe literature or preliminary machine learning techniques. 9

Chapter 2 of the thesis describes the traditional and contemporary methodologies of colon cancer diagnosis. These methodologies have been broadly divided into several categorizes; and multiple techniques within each category have been discussed. Chapter 3 presents a preliminary material about the machine learning methodologies, which have been employed as part of the proposed cancer detection and grading techniques. The preface material includes details of feature selection strategies, classification models, and the cross-validation methodology. Chapter 4 of the thesis provides detail of the datasets used for evaluation of colon caner detection techniques. The detail includes data acquisition method, total number of samples, distribution of samples among different classes, age of patients, etc. In addition to this, Chapter 4 describes performance measures used for the evaluation of proposed 2-class (cancer detection) and multi-class (cancer grading) classification techniques. Furthermore, this chapter investigates the performance of some existing colon cancer detection techniques. Chapter 5 presents the proposed unsupervised MOOS technique that incorporates the background knowledge about normal and malignant tissues into the segmentation process. It locates elliptic objects in four orientations, and divides the objects into different categories based on their sizes. It further calculates a few features based on the spatial distribution and size of these objects. Finally, regions are identified based on the values of these features. Chapter 6 of the thesis presents two methodologies for classification of colon biopsy images into normal and malignant classes. In these techniques, various novel features have been proposed, which capture the variation between normal and malignant colon tissues. In these techniques, individual features have been combined to develop rich hybrid feature space, and the performance of individual and hybrid feature spaces has been studied in detail. Furthermore, original features have been reduced wherever it is necessary, and the performance of original and reduced features has been compared in terms of computational time requirements and the resultant classification rate. Chapter 7 of the thesis presents a classification technique for gene expressions based colon datasets. In this research work, discerning genes have been selected using various feature selection methodologies. The reduced gene sets have been classified into respective classes by using ensemble of SVM classifiers. Chapter 8 of the thesis presents novel features for grading of malignant colon biopsy images into well-, moderate-, and poorly-differentiable cancer grades. These features quantify the 10

deviation, which epithelial cells and lumen undergo in malignant colon tissues, for identification of various cancer grades. The results demonstrate that the proposed features have good grading capability. The research work has been concluded in Chapter 9. A few potential future directions, which may be helpful in carrying out further research in the field of colon cancer, have also been identified in this chapter.

11

Chapter 2

Related work in the field of automatic colon cancer detection In the last two decades, several techniques have been proposed by the researchers for automatic detection of colon cancer. Though the techniques vary in terms of underlying dataset and the adopted methodology, but they all share the common goal of automatic cancer detection without the subjective intervention of the pathologist. These techniques can be broadly divided into several categories. Generally, there are three major categories of colon cancer detection techniques depending upon the underlying dataset and adopted methodology. These categories are listed below. 1. Texture analysis based techniques 2. Hyper spectral analysis based techniques 3. Gene analysis based techniques The following text describes these categories, and various colon cancer detection techniques within each category.

2.1

Texture analysis based techniques

These techniques exploit the textural variation between normal and malignant colon tissues for their classification into respective classes. In particular, the texture analysis of colon biopsy images is characterized by the extraction of discerning features from the images. The 12

extracted features are then used as input to different classifiers for identifying normal and malignant images. There are some techniques, which exploit traditional texture features such as entropy and correlation for the identification of texture. These techniques are called general texture analysis based techniques. There are some other techniques, which exploit background knowledge about the distribution and size of tissue components in colon biopsy images, and are called object-oriented texture analysis based techniques. These techniques are described in the following text.

2.1.1

General texture analysis based techniques

Texture is a combination of repeated patterns with regular/irregular frequency [7]. There is a significant variation in the texture of normal and malignant colon tissues. General texture analysis based techniques exploit this variation in terms of a few traditional texture features for classification purposes. Various computer-aided general texture analysis based techniques have been proposed in the past two decades. The first renowned texture analysis based technique was proposed by Esgiar et al. [8]. In this work, colon biopsy images of size 512x512 have been sub-divided into four sub-images of size 256x256. The sub-images having no clear texture information have been excluded from further experimentation. Gray-level co-occurrence matrices (GLCM) are calculated from each of the sub-images. The GLCM matrices are normalized, and texture features of angular second moment, contrast, correlation, inverse difference moment, dissimilarity and entropy are computed from the normalized matrices. K-nearest neighbor (KNN) and linear discriminate analysis (LDA) classifiers have been used, and best possible accuracy of 90.20% has been achieved for a combination of entropy and correlation by using LDA classifier. Esgiar et al. proposed another texture analysis based technique for automatic detection of colon cancer, wherein geometric features of shape and orientation, and texture features of energy, inertia and uniformity are extracted from colon biopsy images [9]. The reported classification accuracy is 80% and 90% by using geometric and texture features, respectively. Esgiar et al. combined image fractal dimensions with the previously used features of entropy and correlation in their next work [10], and employed KNN and LDA classifiers for classification. The results showed that the combination of texture features with image fractal dimensions enhanced the classification accuracy up to 94.10%. Recently, Jaio et al. proposed a straightforward and computationally efficient technique 13

for automatic identification of normal and malignant colon biopsy images [11]. In this work, texture features of contrast, correlation and entropy are calculated from the GLCM matrices, which are computed in four directions from the input gray-scale colon biopsy images. The first and second order statistical moments are also computed from the images. Polynomial SVM classifier with 3-fold cross-validation has been used, and classification success rate of 96.67% has been achieved.

2.1.2

Object oriented texture analysis based techniques

These techniques treat the components of a colon tissue as objects, and exploit background knowledge about size and spatial distribution of these objects for segmentation and classification of colon biopsy images. Hence, these techniques are called object-oriented texture analysis based techniques. These techniques have been further divided into segmentation and classification techniques. Segmentation techniques: Segmentation of colon biopsy images is extremely hard due to similar color in normal and malignant colon biopsy images, therefore, researchers have exploited the information about the tissue components (objects) for segmentation purposes. The first object oriented texture analysis based segmentation technique, called OOSEG, was proposed by Tosun et al. [1]. There are four major phases of OOSEG, namely, pre-processing, object definition, texture quantification, and segmentation. In the pre-processing phase, K-Means [12] clustering algorithm is applied on color intensities of pixels in order to obtain three clusters corresponding to white colored epithelial cells and lumen, pink colored connecting tissues, and purple colored stroma. In the object definition phase, circular objects are found separately in the three clusters by using the circle fitting algorithm proposed in this work. The objects of each cluster are divided into two object types based on an object area threshold, thereby resulting in a total of six object types at the end of object definition phase. In texture quantification phase, two features, namely, object size uniformity and object spatial distribution uniformity are computed for each pixel by considering six object types lying in a circular window around the pixel. This procedure results in twelve features for each pixel. In the segmentation phase, pixels having values of all the features less than corresponding threshold (sum of mean and standard deviation) are declared as seeds. The seeds are grown in successive iterations until they span the entire image. Regions (seeds) are merged in the 14

end based on percentage area of each object type in a region, and percentage of combined areas of different objects that belong to particular cluster type in the region. Tosun et al. proposed a few modifications to the OOSEG technique for better segmentation of colon biopsy images. In the modified OOSEG [13], clustering and object definition phases are the same, but the next two phases are based on objects instead of pixels. In the texture quantification phase, features are computed for each object as is done for the pixels in OOSEG. In the segmentation phase, a Voronoi diagram is constructed on the centroids of objects, and any two adjacent objects are grouped if Euclidean distance between their features is smaller compared to a predefined similarity threshold. Later, groups having number of objects larger than a threshold are declared seeds. The seed regions are iteratively grown by adding more objects to the regions until they span the whole image. The final regions comprise objects, hence, Voronoi diagram of the objects is drawn in the end to convert object based regions into pixel based regions. Demir et. al. proposed a valuable colon biopsy image segmentation technique [14]. In the object detection phase of the proposed work, circular objects are detected in the purple and white clusters in order to locate “nucleus” and “lumen” objects, respectively. In the graph generation phase, two graphs, namely, lumen graph and nucleus graph are constructed. In the lumen graph, lumens are considered as nodes, and edges are assigned between each lumen object and its predefined number of closest lumen and nucleus neighbors. In the second graph, nucleus objects are considered as nodes, and edges are assigned between each node and its predefined number of closest nodes. In the feature extraction phase, a circular window is placed on each lumen object say L of the lumen graph, and only those neighbors, which lie inside the window, are considered for calculating features of L. Features include areas of neighbors, lengths of edges, and angles between L and its neighbors. These features are given as input to the K-Means algorithm for dividing lumen objects into ’gland’ and ’non-gland’ classes. Lumen objects belonging to ’gland’ class are treated as initial seeds, and adjacent lumen objects are iteratively added to the seeds until an edge of the nucleus graph is encountered. The proposed approach uses edges of the nucleus graph to stop region growing because glands are usually encircled by the nucleus objects, and encountering a nucleus object means that gland boundary is reached. In the end, false glands are eliminated based on a few criteria, and final segmentation is achieved. The strive continued in this domain, and Tosun et al. also proposed a graph based segmen15

tation technique for colon biopsy images [15]. Circular object are detected by using the same process as discussed in previous techniques, and an object graph is constructed on objects detected in all the clusters. In the feature extraction phase, different types of edges are defined based upon their ending nodes. Since there are three types of nodes corresponding to white, pink and purple clusters, therefore, there are total of six edge types i.e. white-white, white-pink, white-purple, purple-pink, purple-purple, and pink-pink. For each node in the graph, a graph run-length matrix (GRLM) that comprises graph edge run length for each particular edge type that lies within the window around the node is calculated. Graph-edge run is a path that starts from an initial node, and covers all nodes reachable with a set of edges of the same type. Later, an accumulated GRLM is calculated for each object by combining the GRLMs of all the objects lying in the window around the node. Four texture features, namely, short path emphasis, long path emphasis, edge type non-uniformity, and path length non-uniformity are computed from the accumulated GRLM of each node. In the seed identification phase, initially all the objects are considered as joint. Later, the pairs of adjacent objects having Euclidean distance between their feature values greater than a distance threshold are disjointed to get connected regions in the image. The image regions, having number of objects smaller than a pre-defined threshold, are removed to get initial seeds. These seeds are grown by adding more adjacent objects if their Euclidean distance is smaller than a merge threshold. Seed growing process continues until the regions span the entire image. Object based regions are transformed into pixel based regions in the end by constructing a Voronoi diagram on the objects. Recently, Simsek et al. proposed a segmentation technique for colon biopsy images [16], wherein object detection and graph generation processes are the same as used in previous works [15]. A co-occurrence matrix that stores the number of times an object pair occurred at a certain distance ’d’ is calculated for each object by considering the objects in a window around it. Since, there are six object pairs corresponding to three object types, therefore, the concurrence matrix has six rows and the predefined number of columns depending upon ’d’. Four texture features are computed for each object pair, thereby resulting in a total of twenty four features for each object. In the segmentation phase, random objects are selected to generate graphs, and segmentation is achieved on these graphs. In the end, multiple segmentation results are combined to achieve final segmentation.

16

Classification techniques: The object oriented texture analysis has also been applied by the researchers for classification of colon biopsy images. For instance, Altunbay et al. proposed first object oriented texture analysis based technique for classification of colon biopsy images into normal and malignant (low and high grade) classes [17]. In this work, an object graph is generated on the circular objects detected in all the clusters as already done by Tousn et al. [15]. In such an object graph, there are total of three object types i.e. white, pink and purple, and six edge types i.e. white-white, white-pink, white-purple, purple-pink, purple-purple, and pink-pink. In the feature extraction phase, structural features of degree, diameter, and average clustering coefficient are extracted from the graph. Degree is the number of edges of a node. Seven degree based features are computed for each node; the first feature caters all edge types, whereas the other six features consider specific edge types. Clustering coefficient measures the connectivity of a node in its neighborhood. Four types of clustering coefficients are computed for each node; the first coefficient considers all node types, whereas the other three coefficients consider nodes of one particular type. Diameter is the longest of the shortest paths between any pair of nodes. Seven types of diameters are computed for each node in a similar fashion as degree based features are computed. The extracted features were combined for classification of colon biopsy images into normal, low grade and high grade cancer classes by using linear SVM, and 82.65% classification success rate was observed. Recently, Ozdemir et al. presented a resampling based Markovian model for classification of colon biopsy images into normal tissues, and low grade and high grade cancerous tissues [18]. In this work, perturbed samples (images) are generated from the original image. First order discrete Markov model is employed to determine the posteriori probabilities of all the classes for a given perturbed sample. A class having highest posteriori probability is assigned to the perturbed sample. Finally, majority voting is employed to combine the classes of individual perturbed samples, and to determine the class of the original test sample. The reported average classification accuracy is 90.66 % on a colon biopsy image based dataset. Ozdemir et al. presented another valuable method for automatic identification of normal colon tissues, and low grade and high grade cancerous tissues [19]. This method is also based on the concept of circle fitting and object graph generation similar to the work presented by Altunbay et al. [17]. In this work, object graphs of a few normal colon biopsy images 17

(training images) are generated, and are stored in the database for further referencing. These object graphs are called reference graphs. Later, in the testing phase, query graphs are generated from the test images, and are found in the reference graphs by placing central node of a query graph on each node of the test graph. Three most similar graphs are found in the testing phase, and then based on the level of similarity with the identified reference graphs, normal or malignant class is assigned to the test sample. The reported classification rate is 92.21% on a colon biopsy image based dataset.

2.2

Hyperspectral analysis based classification

Hyperspectral analysis based techniques operate on selected spectral bands of colon biopsy images, and identify normal and malignant tissues. Hyperspectral data of colon biopsy images is collected by using hyperspectral imaging setup that consists of tuned light source. Rajpoot et al. proposed one of the earliest hyperspectral analysis based techniques for colon cancer detection [20]. In this work, hyperspectral image cubes of colon tissues having size 1024x1024x20 were acquired from hematoxylin & eosin (H&E) stained microarray. This work comprises four main steps. In the first step, dimensionality of the cube is reduced using FlexIA. In the second step, extracted components are given as input to the K-Means algorithm to produce 1024x1024 labeled images for each cube. In the third step, morphological features of area, diameter, extent, orientation, solidity, eccentricity, Euler number, major and minor axis are computed initially for 16x16 image patches. Later, patch size is increased gradually up to 256x256, and the same set of features is extracted from the patches so as to capture local and global level information. This way, 4096 features were extracted from each image cube, thereby resulting in a set of 45056 features for 11 selected cubes. Separate sets of 30,000 and 15056 features have been used for training and testing, respectively. In the fourth step, various kernel functions of SVM such as Gaussian, linear and polynomial have been used, and classification rate of 87.5% has been achieved by using Gaussian kernel. In 2006, Masood et al. have used texture and morphological features for colon tissue classification [21]. The first two steps of this work are similar to those of Rajpoot et al. [20]. In the third step, morphological and texture features are extracted. Morphological features are the same as extracted by Rajpoot et al. [20], whereas texture features of energy,

18

contrast and homogeneity are calculated from GLCMs of 64x64 image patches. The GLCMs are calculated for all possible combinations of distance i.e. d=1,2 and orientations θ = 00 , 450 , 900 , 1350 . In the fourth step, PCA and LDA have been used to classify images by using morphological features, and a maximum of 84% accuracy has been achieved. Similarly, polynomial SVM has been used for classification of images by using texture features, and classification accuracy of 90% has been achieved. Circular local binary patterns (CLBP) have been used for the classification of colon tissues into normal and malignant classes [22]. In this work, CLBP(r,n) features are computed for 33 different combinations of radius r = 2, 3, · · · , 12 and number of neighbors n = 8, 12, 16. The core novelty of this work is the identification of discerning features without actually evaluating them using a classifier. Three measures, namely, scatter index (C), rand index (R) and silhouette index (S) have been proposed for this purpose. C is a measure of compactness of clusters formed by a set of features. S is a measure of similarity of each point to the points of its own cluster and to the points of other clusters, and R measures how similar two clusters are to each other. Finally, composite index is computed by weighted averaging of the three measures. The composite index showed that CLBP(5,8), and CLBP(5,12) are the most discerning features. Later, these features are given as input to the LDA, PCA and SVM classifiers, and 87.5% and 90.6% classification accuracy is achieved with SVM by using CLBP(5,8) and CLBP(5,12), respectively. Moreover, Chadded et al. performed classification of multi-spectral colon biopsy images [23]. In the first phase of this work, colon biopsy image is segmented by employing a modified version of classical Snake algorithm [24]. In the second phase, texture features of entropy, correlation, energy, homogeneity and contrast are extracted from segmented portion of the image. Finally, images are classified into normal and malignant categories based on the extracted features.

2.3

Gene analysis based classification

Gene expression profiling based colon cancer detection is an active research area. There are usually three types of alterations a gene could undergo i.e. over expression, suppression and gene mutation. Such alterations have been exploited for detection of colon cancer, and significant research studies have been dedicated to this field. Genes are usually ana-

19

lyzed by using different variants of microarrays, like, Oligonucleotide, and complementary deoxyribonucleic acid (cDNA) microarrays.

2.3.1

Oligonucleotide microarrays based classification

Oligonucleotide microarrays are created by synthesizing a particular Oligonucleotide in a solid surface based on an already defined spatial orientation. Oligonucleotide slides are scanned using non focal laser that analyzes different probes. The following text summarizes some of the Oligonucleotide microarrays based colon cancer detection techniques. In 1999, Alon et al. employed a clustering algorithm on a dataset of 62 colon samples, wherein each sample comprises 6500 gene expressions [25]. A set of 2000 gene expressions having highest minimal intensity across samples was found in the dataset. These gene expressions were declared to be most discerning compared to others in the dataset. Strive continued in this domain, and Grade et al. analyzed a dataset of 30 normal and 73 malignant gene expression profiles, and found 17 discerning gene expressions [26]. Similarly, Kim et al. analyzed the gene expressions of 5 malignant and 5 normal colon samples, and identified 124 discerning gene expressions for effective discrimination of normal and malignant colon samples [27]. Later, Venkatesh et al. proposed a method of colon cancer detection [28], wherein chi-square measure is used for selection of discerning gene expressions. The selected gene expressions are further used for classification of samples into normal and malignant classes by using recurrent neural network. The proposed technique has been validated on KentRidge dataset of 2000 gene expressions, and 94.40% classification accuracy has been achieved. Further, Kulkarni et al. proposed an evolutionary algorithms based technique for classification of colon samples [29]. In this work, mutual information and t-statistic measures are used for selecting discerning gene expressions, and top 10 and 20 selected gene expressions are classified using decision tree and genetic programming. Several combinations of feature selection and classification methodologies were evaluated by the authors. Result revealed that mutual information based feature selection together with genetic programming is the most effective solution, yielding an accuracy of 98.33% on KentRidge dataset. Recently, Lee et al. proposed a colon cancer detection technique [30], wherein neural network-based finite impulse response extreme learning machine (FIR-ELM) [43] is employed for classification. The FIR-ELM algorithm performs classification based on single 20

hidden layer feed forward neural network (SLFN). In SLFN, well-known filtering methods, like, finite length low-pass filtering, high-pass filtering, and band-pass filtering are employed to train the input weights in the hidden layer of SLFN to extract features from the dataset. These features are then used to classify the given colon samples. The classification accuracy of the proposed technique is 76.85% on KentRidge dataset. Moreover, Tong et al. proposed a technique based on ensemble of SVM classifiers for classification of colon samples [31]. In this work, 50 gene expressions are selected using top scoring pair method, and multiple linear SVM classifiers are trained on the selected gene expressions. GA is used to select such an optimal combination of SVM classifiers that yields maximum possible performance. The reported classification accuracy is 90.30% on KentRidge colon cancer dataset.

2.3.2

cDNA microarrays based classification

The cDNA microarrays based data has been classified by many researchers into normal and malignant classes. For example, Bianchini et al. analyzed the gene expressions profiles of 13 normal and 25 malignant colon samples, and identified 584 discerning genes capable of identifying the normal and malignant classes in an effective manner [32]. Li et al. have used GA to identify meaningful and discerning gene expressions from a pool of available genes. The selected genes have been used for classification of samples into normal and malignant classes, and 94.1% classification accuracy has been achieved by using KNN classifier [33]. Moreover, Chen et al. used multiple kernel SVM (MK-SVM) technique where multiple kernels are described as the convex combination of the single feature basic kernels [34]. The proposed technique has been tested on gene expressions datasets of colon cancer and leukemia, and more than 90% classification accuracy was achieved for both the datasets. Strive continued in this domain, and Shon et al. proposed working in the frequency domain. They used wavelet transform to select discerning gene expressions, and obtained 92% accuracy on colon data with probabilistic neural network (PNN) [35].

21

Chapter 3

Machine learning techniques This section explains some preliminary material, which will be helpful for the reader to better understand next sections of the thesis. In this context, various feature selection, classification, and cross-validation techniques, which have been employed as part of the proposed cancer detection and grading techniques, have been explained.

3.1

Feature selection methodologies

High dimensionality of training dataset and presence of meaningless and redundant features generally create problems for the classification algorithms in accurate prediction of samples and tractable computational performance. Therefore, significant and non-redundant features must be selected prior to training the classifier. In this context, four different feature selection methodologies, namely, minimum redundancy and maximum relevance (mRMR), principal component analysis (PCA), F-Score and chi-square have been employed to select discerning gene expressions, and features from gene based and image based datasets, respectively. The following text describes these methodologies.

3.1.1

Minimum redundancy and maximum relevance (mRMR)

mRMR is an effective method of feature selection that is based on the concept of selecting features which have maximum possible relevancy with the target labels and have least possible redundancy amongst them [36]. The relevance and redundancy scores of features are determined by calculating the mutual information amongst different features and between features and the target classes, and then discerning features are calculated based on these 22

scores. Suppose, a dataset X of size SxJ, where S is the number of samples, and J is the number of features, the redundancy R(X) within the dataset is defined as the average of mutual information between all the possible feature pairs.

R(X) =

J 1 X I(f i , f j ) j2

;

f i, f j ∈ X

(3.1)

i,j=1

Where f i and f j represent the vectors corresponding to ith and j th features in the dataset X. The term I(f i , f j ) is the mutual information between feature vectors f i and f j , and can be determined by using the following mathematical expression.

I(f i , f j ) =

S X

p(fi,x , fj,y ) p(fi,x )p(fj,y )

p(fi,x , fj,y ) log

x,y=1

;

i, j = 1, 2, · · · , J

(3.2)

Where fi,x and fj,y are the values of ith and j th features for xth and y th training samples of the dataset, respectively. The factor p(fi,x , fj,y ) is joint probability density function of fi,x , and fj,y , and the factors p(fi,x ) and p(fj,y ) show marginal probability density functions of fi,x and fj,y , respectively. Likewise, for the label vector t where each individual element of t has either value +1 or -1, the relevancy of dataset X with label vector t, represented by V (X, t), is defined by the average of mutual information values between individual feature vectors f i and t as follows. J

V (X, t) =

1X I(f i , t) j

(3.3)

i=1

The term I(f i , t) shows the mutual information between f i and t, and can be determined using the following expression.

I(f i , t) =

S X x=1

p(fi,x , tx ) log

p(fi,x , tx ) p(fi,x )p(tx )

;

i = 1, 2, · · · , J

(3.4)

The terms p(fi,x , tx ), p(fi,x ) and p(tx ) show joint probability density of fi,x and tx , marginal probability density function of fi,x , and marginal probability mass function of tx , respectively. The goal is to pick the set of features that yields maximum possible relevance V and minimum possible redundancy R. Since it is not possible to simultaneously achieve both

23

the objectives, therefore, equations (3.1) and (3.3) are combined to establish a tradeoff between the two objectives. The tradeoff is shown in equation (3.5).

J J 1X 1 X mRM R = maxX [V (X, t) − R(X)] = maxX [ I(f i , t) − 2 I(f i , f j )] j j i=1

(3.5)

i,j=1

Suppose si be the function that indicates the membership of a particular feature vector f i in such a way that si = 1 means presence and si = 0 means absence of f i in the globally optimal feature set, then equation (3.5) may be converted to an optimization problem as follows. PJ

PJ mRM R = maxs∈{0,1}J [

i=1 I(f i , t)si PJ i=1 si



i,j=1 I(f i , f j )si sj ] P [ Ji=1 si ]2

(3.6)

The whole criterion of feature selection through mRMR makes sure that the selected features are not only relevant to the target label vectors, but have least possible redundancy amongst them [36].

3.1.2

F-Score

F-Score [37] is a simple feature selection technique that tries to reduce the intra-class proximities and increase the inter-class proximities at the same time. For a given dataset X, if the number of normal, malignant and total samples are N , M , and S, respectively, F-Score for the j th feature may be determined using the following equation. 2

F − Scorej =

2

(µN (j) − µS (j)) + (µM (j) − µS (j)) PS N PS M 1 1 N N 2 M M 2 n=1 (Xn (j) − µ (j)) + S M −1 m=1 (Xm (j) − µ (j)) S N −1

(3.7)

where j = 1, 2, · · · , J represents the total number of features. The terms µS (j), µN (j), and µM (j) are the mean values of the j th feature for total, normal and malignant samples, N th feature for mth respectively. The expressions XM m (j) and Xn (j) are the values of the j

and nth samples of malignant and normal classes, respectively. The higher the F-Score, the more discerning the feature is supposed to be.

3.1.3

Principal component analysis (PCA)

PCA was initially proposed by Karl Pearson [38], and was later fully developed by Harold Hotelling. It is a statistical method that uses a mathematical (orthogonal) conversion to 24

transform a set of correlated variables into a set of variables, which are linearly uncorrelated. The uncorrelated variables are termed as principal components. The components may be equal or lesser in number to the original variables of the dataset. The orthogonal transformation defined in this way makes certain that the first principal component bears utmost variance in the data, and every following component sequentially bears highest possible variance while meeting the constraint of orthogonality with previous principal components.

3.1.4

Chi-square

Chi-square is a frequently used feature selection strategy, which assigns chi-square score to different features with reference to the classes in the dataset. For the dataset X, chi-square value for the j th feature can be calculated by using equation (3.8).

Chi − Squarej =

T X (P N (j) − E N (j))2 t

t

EtN (j)

t=1

+

T X (P M (j) − E M (j))2 t

t=1

t

EtM (j)

(3.8)

Chi-square discretizes the range of each feature into multiple partitions when used to select features from a numeric feature set. Partitions of a single feature j are represented by t where t = 1, 2, · · · , T in equation (3.8). The terms PtM (j) and PtN (j) in equation (3.8) are the number of samples of malignant and normal classes lying in partition t, respectively. Similarly, EtM (j) and EtN (j) are the expected frequencies of PtM (j) and PtN (j), and are calculated by using equations (3.9) and (3.10), respectively.

EtM (j) =

SM ∗ PtS (j) S

(3.9)

EtN (j) =

SN ∗ PtS (j) S

(3.10)

where PtS (j) is the total number of samples lying in partition t. The terms S N , S M and S are the number of normal, malignant and total samples, respectively. Once chi-square scores of all the features are calculated, features are sorted in the descending order of their chi-square values. Desired number of top most features are selected as larger the chi-square score, better discerning the feature is.

25

3.2

Classification methodologies

Classification methodologies play a significant role in overall framework of a CAD system. These methodologies take training data comprising features and corresponding target labels as input, and develop a model based on the data. The classes of the samples in the test data are determined based on the model developed on training data. Multiple classification methodologies such as SVM, PNN, decision tree, and KNN have been used in the proposed techniques. The following text describes these classification methodologies.

3.2.1

Support vector machines

SVM classifier has been extensively used in the past for classification of medical images into normal and diseased subjects. It was originally proposed by Vapnik in 1998 [39]. The following text describes the detailed functioning of SVM and its different kernel functions. Suppose a training dataset Q ∈ X comprising Z training samples q 1 , q 2 , · · · , q Z and class labels t = [t1 , t2 , · · · , tZ ]T where tZ ∈ {−1, +1}, z = 1, 2, · · · , Z. The dataset Q to be classified into respective classes by using the SVM may be linearly or non-linearly separable. For linearly separable data, the eventual purpose of classification is to design a linear decision surface, which assigns correct class labels to the samples of input classes. The decision surface f (q) for linearly separable data is given in equation (3.11).

f (q) = wT .q + bias = 0

(3.11)

The decision surface f (q) is characterized by its direction vector w (also called the weight vector), and position (also called bias) in the space. Nonetheless, such a decision surface may not be unique as many such surfaces may be drawn. Thus, the objective of classification is to select such a direction w of the decision surface that maximizes the distance of the surface to the nearby points of the classes. The nearest points are called support vectors, and the distance of the support vectors from the decision surface is called margin. The direction vector w of a decision surface is a linear combination of its support vectors. For 2-class classification problems where input samples need to be assigned to normal and malignant classes, candidate decision surfaces are normalized in such a way that value of f (q) for support vectors equals to +1 and 1 for malignant and normal classes, respectively. Therefore, correctly classified samples

26

of malignant and normal classes have values of f (q) greater than +1, and lesser than -1, respectively. The classification problem may be formulated in terms of an objective function as given below, and can be solved using optimization techniques for non-linear objective function subjected to linear inequalities [40].

min kwk2 tz (wT qz + bias) ≥ 1; z = 1, 2, · · · , Z

(3.12)

For linearly non-separable data, the training samples may fall into one of the three categories. First, training samples may lie on right side of the decision surface and well behind margin. Second, training samples may lie on right side of the decision surface, however inside margin. Third, training samples may lie on wrong (opposite) side of the decision surface. Therefore, the aim of classification is to chose such a decision surface that minimizes the samples related to second and third categories. A penalty term is added to the above mentioned objective function to minimize the samples pertaining to second and third categories. Suppose ξ = [ξ1 , ξ2 , · · · , ξZ ] be a vector of error terms associated with Z training samples. Therefore, the objective function for linearly non-separable data may be formulated as:

2

min kwk + c

Z X

ξz tz (wT qz + bias) ≥ 1 − ξz ; z = 1, 2, · · · , Z

(3.13)

z=1

where ξz = 0 , 0 < ξz < 1 , and ξz > 1 for samples belonging to first, second and third categories, respectively. c is the penalty parameter associated with the penalty term PZ z=1 ξz . When data is not linearly separable, SVM applies a non-linear mapping function φ(q) so ∗

that φ : RJ → RJ . The function φ(q) maps the input data from lower dimension J to such a higher dimension J ∗ where data becomes easily separable. A non-linear decision surface φ(q) between the classes can be constructed in terms of kernel functions [40]: P

f (q) =

S X

P

αz tz K(q, r) + bias =

z=1

S X

αz tz φ(q).φ( r) + bias

(3.14)

z=1

where S P is the number of support vectors. The terms αz and tz , respectively, are the Lagrange multipliers and target labels associated with the support vectors. There are two types of kernel functions of SVM i.e. local kernels and global kernels. In case of local kernels, only nearby samples have an influence on the kernel value, whereas 27

in case of global kernels, far away samples also have an influence on the kernel value. Therefore, one local kernel (RBF) and three global kernels (sigmoid, linear and polynomial) have been employed in the proposed techniques in order to introduce diversity. Linear, RBF, sigmoid and polynomial kernels of SVM are mathematically defined by equations (3.15)-(3.18), respectively.

K(q, r) = q T .r

(3.15)

K(q, r) = exp(−γ kq − rk2 )

(3.16)

K(q, r) = tanh(−γq T .r + r)

(3.17)

K(q, r) = [γq T .r + r]g

(3.18)

The parameter c, which is the penalty factor associated with the data points falling on wrong side of the decision surface, is common in all the kernel functions. The parameter g shows the degree of polynomial kernel, and parameter r is the offset of polynomial and sigmoid kernels. The parameter γ is specific to sigmoid, polynomial and RBF kernels. It manages the shape of the decision surface; an increase in the value of γ generally results an increase in the number of support vectors.

3.2.2

Probabilistic neural network

PNN, introduced in 1990s by D.F. Specht, is a feed-forward multi-layered neural network that was derived from kernel discriminate fisher analysis and Bayesian network [41]. The functioning of PNN is divided into four distinct layers, namely, input, hidden, pattern, and output layer. The neurons in the input layer correspond to the variables in the data to be classified, and the neurons in the output layer correspond to number of target labels.

3.2.3

K-Nearest neighbor

K-Nearest neighbor, also called KNN, is the simplest classification methodology. It is based on the concept of lazy learning or instance based learning, whereby function value is estimated locally and all the calculations are performed at the time of classification [41]. KNN 28

determines the class of a test sample based on the class labels of N training data points that lie closest to the sample in the feature space. In particular, the test sample is assigned the class, which is in majority in its pre-defined N neighbors. If the value of N is 1, the class label of the closest sample is assigned to the test sample. Contrarily, if K is greater than 1, the class of majority of the samples is assigned to the test sample.

3.2.4

Decision tree

Decision tree is a simple learning rules based classifier that arranges a set of test questions and conditions in a tree like structure [41]. In a decision tree classifier, class labels can be discrete numeric, categorical, or continuous values. In the training phase, a tree comprising three types of nodes is constructed i.e. leaf nodes, root nodes, and intermediate (non-leaf) nodes. The intermediate nodes contain features and test conditions that help to separate the data samples with different feature values belonging to different classes. The leaf nodes are assigned the class labels. In the testing phase, a test condition is applied to feature value at the root node, and an appropriate branch of the tree is followed based on the output. The result of the test condition either leads to a leaf node, or to an intermediate node where once again test condition is applied. The process truncates when leaf node is met, and the class label associated with the leaf node is assigned to the test samples.

3.3

Cross-validation methodology

Another important phase of any classification system is the formulation of training/test data. In practical applications, three cross-validation methods, namely, independent dataset test, sub sampling, and Jack-knife are used to access the effectiveness of classification algorithms. According to a recent survey, Jack-knife test is supposed to be most precise, and least arbitrary because it always yields a unique result for a particular dataset [42]. In medical diagnostic systems, it is highly desirable to yield unique result for a particular sample. Therefore, Jack-knife test has been progressively used by the researchers for determining the success rates of the classification algorithms [43]. Consequently, Jack-knife 10-fold cross-validation has been used to optimize parameters and to validate the accuracy of techniques proposed in this thesis. In 10-fold Jackknife test, data are divided into 10 folds. In a certain iteration, 9 folds participate in training,

29

and the classes of the samples belonging to the remaining fold are predicted based on the training performed on 9 folds. The test samples in the test fold are purely unseen for the trained model. This sampling process is repeated 10 times and the class of each sample is predicted. Finally, the predicted labels of the unseen test samples are used to determine classification accuracy. The Jack-knife process is repeated for each combination of system’s parameters, and classification performance has been reported for the combination that leads to maximum classification accuracy on the unseen test data. Figure 3.1 presents Jackknife 10-fold cross-validation process for the evaluation of classification performance of a feature vector using linear SVM. Suppose, there are two parameters involved in this task; one is the T parameter of feature extraction strategy, and second is the constraint violation cost (c) of linear SVM. The parameters have been varied in their potential ranges, and Jack-knife process has been repeated for each combination. The classification accuracy on unseen test data has been measured for each combination of parameter values, and the best achieved classification accuracy has been reported for the feature set.

Parameter optimization

...

Cost parameter of linear SVM 99 100 c 1 2 3

Parameter of feature vector T 25 50 75 100125

Jack-knife cross-validation for c=2, T=75

Colon biopsy image dataset (N=174 samples) 1st iteration

2nd iteration

10th iteration

Training phase Training data (2-10 folds) (9*N)/10 samples + labels

Training phase Training data(1,3-10 folds) (9*N)/10 samples + labels

Trained model

Trained model

Test phase Unseen test data (1st fold) N/10 samples (no labels) Predicted classes for samples of 1th fold

Test phase Unseen test data(2nd fold) N/10 samples (no labels) Predicted classes for samples of 2nd fold

Training phase Training data(1-9 folds) (9*N)/10 samples + labels

...

Trained model

Test phase Unseen test data(10th fold) N/10 samples (no labels) Predicted classes for samples of 10th fold

Measure classification accuracy based on predicted labels of unseen test samples

Figure 3.1: Formulation of training/test data through 10-fold Jack-knife cross-validation

30

Chapter 4

Datasets, performance measures, and evaluation of some existing colon cancer detection techniques on these datasets In this chapter, various colon cancer datasets and performance measures used to access the performance of proposed techniques have been described. Furthermore, the performance of some existing colon cancer detection techniques has been evaluated on these datasets. In the end, we report different datasets used in the literature for evaluation of colon cancer detection techniques.

4.1

Datasets

Various datasets have been used to evaluate the performance of colon cancer detection techniques. These datasets have been summarized in Table 4.1. Dataset-A and datasetB have been used to evaluate colon biopsy image based segmentation, and classification and grading techniques, respectively. The KentRidge, BioGPS, Notterman, and E-GEOD40966 datasets have been grouped into one bigger unit i.e. dataset-C, which has been used to access the performance of gene analysis based colon cancer detection techniques.

31

Dataset

Type

Dataset name

Segmentation dataset Classification dataset

Biopsy image based Biopsy image based

Dataset-A Dataset-B

KentRidge

Gene based

BioGPS Notterman

Gene based Gene based

E-GEOD-40966

Gene based

Grouped into dataset-C

Table 4.1: Datasets used for the evaluation of proposed and existing colon cancer detection techniques Image type

4x

5x

10x

40x

Total

Homogeneous

05

08

07

05

25

Heterogeneous Total

20 25

20 28

28 35

07 12

75 100

Table 4.2: Statistics of dataset-A used for the evaluation of colon biopsy image based segmentation techniques

4.1.1

Dataset-A — Colon biopsy image based segmentation

This colon biopsy image based dataset has been acquired from 68 colon biopsy samples. The biopsy samples have been collected from the pathology division of Rawalpindi Medical College, Pakistan in the years 2010-12 without any racial, age and gender based discrimination. A set of 100 RGB images has been captured from these biopsy samples at four different magnification factors i.e. 4x, 5x, 10x, and 40x of the objective lens of microscope. The spatial resolution was set to 600x800. The acquisition of images and preparation of ground truth has been carried out under the supervision of Mr. Imtiaz Qureshi (Assistant Professor of Pathology, Rawalpindi Medical College). The acquired images are either homogeneous or heterogeneous. Homogeneous colon biopsy images have either normal or malignant tissue along with connecting tissues, whereas heterogeneous images comprise both normal and malignant portions along with connecting tissues. Detailed distribution of images with respect to homogeneity/heterogeneity, and magnification factor is given in Table 4.2.

4.1.2

Dataset-B — Colon biopsy image based classification

This dataset has been acquired from the same set of 68 biopsy samples discussed in previous section. A set of 174 RGB images has been acquired from these samples at magnification factor of 10x. These images comprise only normal or malignant tissues. The ground truth

32

labels have been assigned to the images under the supervision of Mr. Imtiaz Qureshi. The confidentiality of the patients has been sustained right through this research work. The college has provided the details about gender and age of the patients only. The information about the age and gender of patients, and categorization of images into multiple classes is summarized in Table 4.3. Parameters

Values

Number of images

174

Distribution of images

92 malignant, 82 normal

Grades of malignant images Age of patients

23 poor-, 44 moderate-, and 25 well-differentiable 42-68, Mean = 57.11, Standard deviation = 6.35

Age of female patients Age of male patients

43-63 42-68

Table 4.3: Statistics of dataset-B used for the evaluation of colon biopsy image based cancer detection and grading techniques

The histogram of the ages of malignant images is given in Figure 4.1, which shows that most of the colon cancer patients in the given dataset are above 50 years of age.

Figure 4.1: Distribution of age of cancerous patients

4.1.3

Dataset-C — Gene analysis based classification

Gene expressions are generally analyzed by using different variants of microarrays. The final output of a microarray experiment is an image in which each location that corresponds to

33

a gene has an associated fluorescence value representing the relative expression level of that gene. Figure 4.2 presents a sample tiff image obtained after a microarray experiment. Each row represents one unique sample, wherein each column represents gene expressions corresponding to the sample. The gene expressions are measured from the image, and are stored in database against respective samples.

Figure 4.2: Tiff image generated by a microarray experiment showing genes expressions for various samples

The proposed gene analysis based classification technique deals with the classification of gene based colon samples into normal and malignant classes, and will be described in Chapter 7. The proposed and the existing techniques have been tested using four standard gene based colon cancer datasets, namely, KentRidge [44], BioGPS [45], Notterman [46], and E-GEOD-40966 [47]. These datasets have been acquired from publicly available databases of gene based datasets. The dimensionality of these datasets except BioGPS dataset is large, therefore, the datasets have further been reduced in the proposed technique in order to cut down the computational burden and to reduce the resources otherwise necessary. In the subsequent text, these datasets have been described in brief. KentRidge dataset: KentRidge dataset [44] comprises 2000 gene expressions, which have already been selected amongst 6500 gene expressions by using machine learning approaches in a clinical research 34

work [25]. There are a total of 62 samples in the dataset out of which 40 are malignant and 22 are normal. BioGPS dataset: BioGPS dataset [45] comprises 37 normal and 94 malignant colon samples. It has three discerning gene expressions, therefore, it has not been further reduced using machine learning approaches. Notterman dataset: Notterman dataset [46] has been prepared as part of the gene expression analysis based project carried out at the Princeton University of New Jersey, USA. The dataset has 18 normal and 18 malignant samples, and each sample in the dataset comprises 7,457 gene expressions. Therefore, overall dimensionality of the dataset is 36x7,457. E-GEOD-40966 dataset: This is a multi-class dataset [47]. It comprises 463 colon samples, which belong to different stages of colon cancer such as stage A, B, C, and D. Therefore, to establish a binary-class problem, 208 samples of stage 2, and 142 samples of stage 3 cancer patients have been picked. This way, a dataset of 350 samples has been developed, wherein each sample has 5,851 genes. This dataset has been acquired from ArrayExpress database that serves as a standard repository of gene based datasets.

4.2

Performance measures

The performance of the proposed and previously published segmentation and classification techniques has been investigated in terms of various well-known performance metrics. In general, a particular performance metric quantifies the performance of a system from one perspective, therefore, multiple performance metrics have been calculated in order to determine the effectiveness of the techniques from various viewpoints. The following text describes these metrics separately for segmentation and classification techniques.

4.2.1

Classification techniques

Classification techniques addressed in this research work are either 2-class or 3-class techniques. The 2-class classification techniques classify images into normal and malignant classes, whereas 3-class classification techniques identify well, poor, and moderate grades

35

of malignant images. Therefore, performance metrics have been separately elaborated for both types of classification techniques. Suppose, a, b, and c are the three classes. The confusion matrices for 2-class (a and b) and 3-class (a, b, and c) classification techniques are given in Table 4.4 and 4.5, respectively. Predicted class

Actual class

a

b

a

Ta

Eab

b

Eba

Tb

Table 4.4: Confusion matrix for two classes (a and b)

Predicted class

Actual class

a

b

c

a

Ta

Eab

Eac

b c

Eba Eca

Tb Ecb

Ebc Tc

Table 4.5: Confusion matrix for three classes (a, b and c)

where, Ta , Tb and Tc = Number of correctly classified samples of class a, b and c, respectively Eab and Eac = Number of samples of class a, which have been assigned to class b and class c, respectively Eba and Ebc = Number of samples of class b, which have been assigned to class a and class c, respectively Eca and Ecb = Number of samples of class c, which have been assigned to class a and class b, respectively Based on the confusion matrices, performance measures for 2-class and 3-class classification techniques have been defined as follows. Accuracy: Classification accuracy of a technique is the percentage of correctly classified images [48]. It can be calculated by using equations (4.1) and (4.2), respectively, for 2-class and 3class classification techniques. Its value ranges between 0-100, where 100 means perfect

36

recognition rate, and 0 means worst. Ta + Tb ∗ 100 Ta + Tb + Eab + Eba

(4.1)

Ta + Tb + Tc ∗ 100 Ta + Tb + Tc + Eab + Eac + Eba + Ebc + Eca + Ecb

(4.2)

Accuracy = Accuracy = Sensitivity:

It is the ratio of the number of samples of positive class, which are correctly classified [48]. Its value can be calculated by using Equation (4.3) for 2-class classification techniques, and by using equations (4.4)-(4.6) individually for the three classes of 3-class classification technique. The valid values of sensitivity lie between 0 and 1, where 0 and 1 correspond to worst and best classification, respectively.

Sensitivity =

Ta Ta + Eab

(4.3)

Sensitivity(a) =

Ta Ta + Eab + Eac

(4.4)

Sensitivity(b) =

Tb Tb + Eba + Ebc

(4.5)

Sensitivity(c) =

Tc Tc + Eca + Ecb

(4.6)

Specificity: It is the ratio of the number of samples of negative class, which are correctly classified [48]. Its value can be calculated by using equation (4.7) for 2-class classification techniques, and by using equations (4.8)-(4.10) individually for the three classes of 3-class classification technique. The valid values of specificity lie between 0 and 1, where 0 and 1 correspond to worst and best classification, respectively.

Specif icity =

Tb Tb + Eba

(4.7)

Specif icity(a) =

Tb + Ebc + Ecb + Tc Tb + Ebc + Ecb + Tc + Eba + Eca

(4.8)

Specif icity(b) =

Ta + Eac + Eca + Tc Ta + Eac + Eca + Tc + Eab + Ecb

(4.9)

Specif icity(c) =

Ta + Eab + Eba + Tb Ta + Eab + Eba + Tb + Eac + Ebc

(4.10)

37

Matthews correlation coefficient (MCC): MCC is a measure of the eminence of binary class classifications [49]. It can be calculated for 2-class problem using the following mathematical expression. Ta ∗ Tb − Eab ∗ Eba M cc = p (Ta + Eab )(Ta + Eba )(Tb + Eab )(Tb + Eba )

(4.11)

Its value ranges between -1 and +1, where -1, +1 and 0, respectively, correspond to worst, best, and random prediction. F-Measure: It is a measure of the accuracy of classification [48]. It is a weighted average of precision and recall. The values of precision, recall, and the resultant F-Measure for 2-class classification problem can be calculated by using equations (4.12), (4.13), and (4.14), respectively. Ta Ta + Eba Ta Recall = Ta + Eab P recision ∗ Recall F − M easure = 2 ∗ P recision + Recall P recision =

(4.12) (4.13) (4.14)

The values of precision and recall for the three classes of 3-class classification problem can be calculated by using equations (4.15)-(4.20). Ta Ta + Eba + Eca Tb P recision(b) = Tb + Eba + Ecb Tc P recision(c) = Tc + Eac + Ebc Ta Recall(a) = Ta + Eab + Eac Tb Recall(b) = Tb + Eba + Ebc Tc Recall(c) = Tc + Eca + Ecb

P recision(a) =

(4.15) (4.16) (4.17) (4.18) (4.19) (4.20)

The precision and recall values of the classes are utilized for calculating F-Measure of the classes by using equations (4.21)-(4.23).

F − M easure(a) = 2 ∗

P recision(a) ∗ Recall(a) P recision(a) + Recall(a) 38

(4.21)

F − M easure(b) = 2 ∗

P recision(b) ∗ Recall(b) P recision(b) + Recall(b)

(4.22)

F − M easure(c) = 2 ∗

P recision(c) ∗ Recall(c) P recision(c) + Recall(c)

(4.23)

The valid values of F-Measure lie between 0 and 1, where 0 and 1, respectively, mean worst and best classification. Receiver operating characteristics (ROC): ROC curves graphically represent overall performance of a classification technique over its complete operating range [48], and are developed by plotting true positive rate (TPR) against false positive rate (FPR). TPR is the ratio of correctly classified positive samples to the total number of positive samples, whereas FPR is the ratio of number of negative samples predicted as positive ones to the total number of negative samples. The terms TPR and FPR also show the sensitivity and (1-specificity) of the classification technique. In order to draw an ROC curve, the decision values (predicted values) of a classifier are scaled in the range [0-1], and the curve is drawn by employing a classification threshold T on the scaled decision values. If the decision value is greater than T, then input sample is assigned the first class, otherwise the second class. The T is varied in the range [0-1] by using a suitable step size, and TPR/FPR value pairs are computed at each value of T. ROC curve is then plotted between computed TPR and FPR value pairs. In practical applications, the ROC curve provides a degree of freedom for the practitioners to select a suitable operating point depending upon the needs of the underlying application area. Area under the curve (AUC): AUC is a measure of the overall effectiveness of the classifiers [48]. It has been extensively used in the past for accessing the performance of classifiers. It is a quantitative figure that is usually measured from ROC curves of the classifiers. The valid values of AUC lie between 0 and 1, where the classifier is supposed to be optimal one in case the value of AUC approaches to one.

4.2.2

Segmentation techniques

The performance of the proposed and previously published segmentation techniques has also been quantified in terms of a few standard performance metrics such as segmentation accuracy, sensitivity and specificity. Ground truth binary images have been compared against segmented colon biopsy images, and confusion matrices have been populated based on the 39

classes of pixels in the ground truth and segmented images. The performance measures have been calculated in the same way as discussed for the classification techniques.

4.3

Evaluation of some existing colon cancer detection techniques

In this section, most of the techniques presented in Chapter 2 have been evaluated on the datasets described in Section 4.1. The techniques have been implemented in Matlab, and the results have been collected in terms of various performance measures described in Section 4.2. The techniques within different categories work in different domains, therefore, it is not possible to test all of them on the same dataset. However, techniques within each category have been evaluated on the same dataset with an aim to provide a uniform and fair performance comparison.

4.3.1

Evaluation of colon biopsy image based segmentation techniques

The colon biopsy image based segmentation techniques have been tested on dataset-A described in Section 4.1.1, and the results have been collected in terms of various performance metrics. In this context, three techniques, namely, OOSEG [1], GRLM [15], and multi-level segmentation [16] have been implemented in Matlab. The results for these techniques are given in Chapter 5, where the segmentation capabilities of these techniques are compared with those of the proposed MOOS segmentation technique.

4.3.2

Evaluation of colon biopsy image based classification techniques

Various colon biopsy image based classification techniques such as Esgiar et al. [8], Esgiar et al. [10], Masood et al. [21], Masood et al. [22], and Altunbay et al. [17] have been implemented in Matlab, and their results have been evaluated on dataset-B of colon biopsy images. The quantitative results for these techniques on dataset-B are given in Chapter 6 and 8, where the performance of the proposed classification techniques has been compared with the performance of these techniques.

40

4.3.3

Evaluation of gene based colon classification techniques

Five gene analysis based classification techniques, namely, Venkatesh et al. [28], Kulkarni et al. [29], Lee et al. [30], Tong et al. [31], and Li et al. [33], have been implemented in Matlab. These techniques have been tested on KentRidge, BioGPS, Notterman, and EGEOD-40966 datasets of gene expressions, which are part of dataset-C, and the results have been evaluated in terms of various performance measures. The detailed quantitative results of these techniques in terms of all the performance measures are in Chapter 7, where the performance of proposed GECC technique is compared with these techniques.

4.4

Existing colon cancer datasets

As already discussed in Chapter 2 that various colon cancer detection techniques have been proposed in the past for automatic detection of colon cancer. These are general texture analysis, object-oriented texture analysis, gene analysis, and hyper-spectral analysis based classification techniques. Different datasets have been used by the respective authors for the evaluation of these techniques. Except KentRidge dataset of gene expressions, these datasets have no specific names; therefore number of samples within each dataset are given in Table 4.6. Furthermore, only KentRidge dataset is available on line amongst these datasets, therefore, the link for the KentRidge dataset has also been given in Table 4.6.

41

Dataset

Link

Technique

Ref.

Object-oriented texture analysis based techniques 16 images



Tosun et al. Tosun et al.

[1] [13]

72 images



Demir et al.

[14]

150 images



Tosun et al.

[15]

200 images 213 images

— —

Simsek et al. Altunbay et al.

[16] [17]

3236 images



Ozdemir et al.

[18]

Ozdemir et al.

[19]

General texture analysis based techniques 98 images



Esgiar et al. Esgiar et al.

[8] [9]

Esgair et al.

[10]

60 images



Jiao et al.

[11]

Alon et al.

[25]

Venktesh et al. Kulkarni et al.

[28] [29]

/krbd/ColonTumor/ColonTumor.html

Lee et al. Tong et al.

[30] [31]

103 samples



Grade et al.

[26]

42 samples 10 samples

— —

Yajima et al. Kim et al.

[50] [27]

38 samples



Bianchini et al.

[32]

Gene analysis based techniques

KentRidge dataset

http://datam.i2r.a-star.edu.sg/datasets

Hyper-spectral analysis based techniques 32 samples



Rajpoot et al.

[20]

32 samples 45 samples

— —

Masood et al Chadded et al.

[22] [23]

Table 4.6: Some existing colon cancer datasets

4.5

Chapter summary

Previously proposed colon cancer detection techniques had been evaluated by the authors on different datasets, therefore, comparison of the techniques was not possible. This chapter serves a very needy purpose in this context, and makes the comparison of the techniques possible by evaluating them on the same datasets. This chapter also describes the datasets and performance measures used for the evaluation of these techniques.

42

Chapter 5

Proposed object oriented texture analysis based segmentation Segmentation of colon biopsy images is extremely challenging due to similar color distribution in various biological regions of pathological images. The only viable solution for the segmentation of colon biopsy images is to incorporate background knowledge about the morphology of normal and malignant colon tissues into the segmentation process. In this context, an object oriented segmentation (OOSEG) technique [1] has been proposed in the past. In this work, colon biopsy image is divided into white, pink and purple clusters, and circular objects of a suitable range of radii are detected in the clusters. The detected objects are divided into different categories, and the information about the sizes and spatial distribution of detected objects is exploited to demarcate final regions. In this chapter, a modified object oriented segmentation (MOOS) technique has been proposed that improves some of the aspects of OOSEG. Previously, epithelial cells have been treated as circles in OOSEG, and in order to detect epithelial cells, circular objects have been fitted to the clusters. However, epithelial cells are elliptic in nature, and seem slightly circular when viewed at very small magnification factors. Therefore, in this work, elliptic objects are detected in four orientations of 00 , 450 , 900 , 1350 instead of circular objects. Furthermore, a membership function has also been proposed to detect the epithelial cells, which are either disturbed due to blur, or are slightly tilted from the four standard orientations. The optimal values of various system parameters are empirically selected for each image in OOSEG. However, in this work, genetic algorithm (GA) has been employed to optimize

43

several system parameters for different magnification factors.

5.1

Proposed segmentation technique

The proposed modified object oriented segmentation (MOOS) technique comprises two main modules, namely, colon biopsy image segmentation, and optimization of system parameters through GA. Figure 5.1 shows top-level framework of the proposed technique. The main module of overall system is the segmentation of colon biopsy images, wherein a few parameters are involved. The optimal values of these parameters are found separately for images of different magnification factors by evolving GA on initial random population. These parameters are discussed in detail in later sections of this chapter. The optimal values obtained this way are directly used for segmentation of colon biopsy images. GA based training phase (iterate all generations) Generations (iterate till population size) GA population (chromosome) Best chromosome

Image + chromosome

Segmentation Image + best chromosome

Image segmentation

Image segmentation Fitness evaluation

Fitness evaluation

Results Cost of single individual Cost of all the individual

Selection, mutation, crossover

Figure 5.1: Top level layout of the proposed MOOS technique

Segmentation module of the proposed MOOS technique comprises three distinct phases, namely, identification of elliptic objects, extraction of geometric features, and demarcation of final region boundaries. The main architecture of the segmentation module is presented in Figure 5.2, and the following text describes different phases of the segmentation process. The core novelty of the proposed MOOS technique as already discussed lies in object 44

identification phase of the segmentation module, and GA based parameter optimization. The rest of the phases i.e. feature extraction, and region demarcation are the same as in OOSEG and are explained here for better understanding of the overall procedure.

Segmentation process Image acquisition 100 images images (800x600) (600x800)

Object identification K-Means clustering

Object detection

Region growing and merging Region merging

Seeds growing

Seeds identification

Object categorization Feature extraction Calculate feature maps

Figure 5.2: Top level layout of the segmentation module of the proposed MOOS technique

5.1.1

Object identification

The main purpose of this phase is to locate elliptic objects in given colon biopsy images. In this connection, initially colon biopsy image is segregated into its constituent tissues in the clustering step. Later, elliptic objects are located, and are divided into different object categories based on a few criteria. Different steps of this phase are described in the following text. Clustering: Colon biopsy images comprise white colored epithelial cells and lumen, pink colored connecting tissues, and purple colored stroma. In this study, K-Means algorithm has been applied on color intensities of pixels, and image pixels have been segregated into white, pink and purple clusters. K-Means is an iterative statistical method that was proposed by Fukunaga et al. for estimation of gradients of a density function [12]. Later, it has extensively been used in computer vision for image segmentation [51, 52] and visual object tracking [53]. The clusters obtained after applying K-Means are transformed into binary clusters. Later, dilation and hole filling are performed to make the clusters suitable for finding ellipses. A colon biopsy image and its corresponding white, pink and purple clusters are shown in Figure 5.3.

45

(a)

(b)

(c)

(d)

Figure 5.3: Image clustering: (a) a colon biopsy image, (b) white cluster, (c) pink cluster, and (d) purple cluster Object detection: Epithelial cells are elliptic in nature as shown in Figure 1.1 (c) (page 3). In previous colon biopsy image based segmentation techniques [1, 15], epithelial cells have been modeled as circles, and best fit circles have been fitted to epithelial cells. The major drawback of this approach is that when a circular object is fitted to an elliptic shaped epithelial cell, many pixels that belong to the corresponding epithelial cell and are outside the fitted circle stay unassigned to the circle. These pixels either stay unassigned till the end or another circle may be fitted to these pixels if they are large in number. This phenomenon results in degraded segmentation especially at higher magnification factors. In order to alleviate such problems, a new object detection algorithm has been proposed in this research study. The proposed algorithm is shown in Figure 5.4.

Clusters Single cluster processing

Single component processing

Iterate for all the clusters

Conversion to binary

Ellipse generation

Dilation and hole filling

Ellipse searching and marking

Connected component identification

Exit criteria met?

Smaller component removal Process remaining components

No (Decrement axes values)

Yes Continue with next component

Figure 5.4: Algorithm for detection of ellipses in connected components of different clusters

46

In the proposed object detection algorithm, connected components are identified in each processed cluster, and the components having area smaller than a component area threshold (CAT) are excluded from further experimentation. The CAT parameter influences the performance of the proposed technique, therefore, its optimal value for each magnification factor is found through GA. Four ellipses of 00 , 450 , 900 , and 1350 as shown in Figure 5.5 are generated starting with optimal upper bounds of semi-major (SMJA) and semi-minor axes (SMIA). The optimal upper and lower bounds for images of each magnification factor have been found through GA. Let us suppose a case where SMJA=4 and SMIA=3 units, the horizontal (00 ) , vertical (900 ), diagonal (1350 ), and off-diagonal (450 ) ellipses are the matrices of size 7x9, 9x7, 9x9, and 9x9, respectively. The representation of a 00 ellipse in terms of matrix entries is shown in Figure 5.5 (b), wherein the outer pixels of ellipse have value ’0’, while the inner pixels have value ’1’.

135o ellipse

90o ellipse 45o ellipse 0o ellipse

(a)

0 0 0 1 0 0 0

0 0 1 1 1 0 0

0 1 1 1 1 1 0

0 1 1 1 1 1 0

1 0 1 1 1 1 1 1 1 1 1 1 1 0 (b)

0 1 1 1 1 1 0

0 0 1 1 1 0 0

0 0 0 1 0 0 0

Figure 5.5: (a) Four generated ellipses, and (b) representation of a horizontal ellipse with SMJA = 4 and SMIA = 3 in terms of matrix entries

The four generated ellipses are found in each connected component of a cluster. For the reason, system traverses pixels of the connected component one at a time, and extracts four regions of the same size as horizontal, vertical, diagonal and off-diagonal ellipses around the pixel. The pixel values of horizontal, vertical, diagonal, and off-diagonal windows are compared with respective pixel values of the horizontal (00 ), vertical (900 ), diagonal (1350 ), and off-diagonal (450 ) ellipses. If all the pixels having value ’1’ in an ellipse are also ’1’ in the extracted window, an ellipse of that particular orientation is supposed to exist at the 47

pixel. Once the system completes traversing all the pixels of a connected component, the object detection process is repeated with the remaining pixels (which have not been assigned to any elliptic object) of the component by decrementing the values of SMJA and SMIA by one unit. The process is continued for the same connected component provided there are some unassigned pixels, and the SMJA and SMIA have not reached minimum bounds. Otherwise, the process is started for the next connected component of the cluster in the same fashion. The blur or noise in colon biopsy images may disturb the functioning of K-Means, and some inner pixels of epithelial cells may be 0, thereby leading to not exact but almost elliptic shapes in the clusters. Furthermore, the ellipse searching process in four orientations covers most of the ellipses, but there may be some ellipses which are slightly tilted from the four defined orientations. Therefore, a concept of membership function has been introduced in the proposed MOOS technique in order to find nearly elliptic shape based epithelial cells and the epithelial cells tilted from four standard orientations. The membership function defines the percentage of pixels, which have the same value in generated ellipse and the extracted window. The optimal value of membership function, found through experimentation, is 95%. 0 0 0 1 0 0 0

0 0 1 1 1 0 0

0 1 1 1 1 1 0

0 1 1 1 1 1 0

0 1 1 1 1 1 0

1 0 1 1 1 1 1 1 1 1 1 1 1 0 (a) 1 1 0 1 0 0 0

0 1 1 1 1 1 0

0 1 1 1 1 1 0

0 0 1 1 1 0 0

0 0 0 1 0 0 0

1 1 0 1 0 0 0

0 0 1 1 1 0 0

0 1 1 1 1 1 0

0 1 1 1 1 1 0

0 1 1 1 1 1 0

0 0 1 1 1 0 0

0 1 1 1 1 1 0

0 1 1 1 1 1 0

0 1 1 1 1 1 0

1 0 1 1 1 1 1 0 1 1 1 1 1 0 (c)

0 1 1 1 1 1 0

0 1 1 0 1 1 0

0 0 1 1 1 1 1

0 0 0 1 0 0 0

1 0 1 1 1 1 1 1 1 1 1 1 1 0 (b)

0 1 1 1 1 1 0

0 1 1 1 1 1 0

0 0 1 1 1 1 1

0 0 0 1 0 0 0

Figure 5.6: (a) Generated horizontal ellipse, (b) image pattern (full pattern match), and (c) image pattern (partial match 95.6%)

Figure 5.6 demonstrates membership function. In Figure 5.6 (a), there is a horizontal ellipse that needs to be found in the image patches (b) and (c). A full match has been found in the 48

pattern given in Figure 5.6 (b). Circled numbers in Figure 5.6 (b) represent those image pixels which have value 1, but they do not participate in matching process because they lie outside the boundary of generated ellipse. A partial match has been found in Figure 5.6 (c). It shows nearly elliptic shape where two circled numbers inside ellipse are 0, however membership function helps detecting such ellipses. The proposed ellipse detection algorithm helps in detecting most of the epithelial cells in colon biopsy images. Figure 5.7 (a) and (b), respectively, show a small portion of a colon biopsy image and its corresponding white cluster. Figure 5.7 (c) and (d) show the objects detected by OOSEG (circle fitting based techniques) and the proposed ellipse detection algorithm in the white cluster. The figure shows that the proposed ellipse detection algorithm have better detected the epithelial cells compared to circle fitting based techniques (OOSEG and GRLM).

(a)

(b)

(c)

(d)

Figure 5.7: Object detection: (a) small portion of a colon biopsy image, (b) white cluster, (c) detected circular objects, and (d) detected elliptic objects

The proposed algorithm maintains a record of detected objects, their corresponding semimajor axis, semi-minor axis and directions for use in subsequent phases. Object categorization: The objects detected from each cluster are divided into two object types based on a certain object area threshold (OAT). The optimal value of OAT is found through GA. The objects having area greater than OAT belong to first type, and the objects having area less than OAT belong to second object type for a particular cluster. This categorization results in a total of six object types, two types corresponding to each cluster. The term area means number of pixels occupied by an object in the image.

49

5.1.2

Feature extraction

In this phase, geometric features are calculated for each individual pixel of colon biopsy image. A circular window is placed at each pixel in a colon biopsy image, and two features are computed corresponding to the six object types lying in the window, thereby resulting in a set of 12 features for the pixel. These features measure the uniformity in the sizes and spatial distribution of detected objects, and are named object size uniformity (OSU), and object spatial distribution uniformity (OSDU), respectively. The following text describes these features. It is worth mentioning that an object is considered to lie within a circular window if its central pixel lies within the window. Object size uniformity (OSU): OSU, as the name suggests, measures the uniformity in the sizes of objects corresponding to different object types. It is a measure of the co-efficient of standard deviation of areas for each particular object type lying within a circular window around given pixel. The co-efficient will be zero for a particular object type if the instances of the object type lying in the window have the same area. Object spatial distribution uniformity (OSDU): OSDU measures the uniformity in the spatial distribution of objects corresponding to different object types. In this feature type, a circular window is placed on each image pixel, and position vectors of the objects lying within the window are calculated with reference to the center of the window. The OSDU measure for each object type is calculated by computing the magnitude of sum of position vectors corresponding to the object type. The magnitude will be zero for an object type in case the objects of that particular type are uniformly distributed in space within the circular window around this pixel. The features are calculated using smaller and larger circular windows with an aim to capture local and global level information in the image. The radii for smaller and larger windows are represented by RS and RL in the subsequent text, respectively. The optimal values of RS and RL for different magnification factors are found through GA.

5.1.3

Region demarcation

In the final phase of the segmentation process, initial seeds are determined, which are grown in successive iterations to span entire image. In the end, regions (seeds) are merged

50

to obtain final segmentation. The region demarcation process is shown in Figure 5.8, and the following text describes its different steps. Feature maps Seed identification Determine seed pixels Determine seed pixels (larger window) (smaller window) Form seed regions

Initial seeds

Grown regions

Region growing based on initial seeds Seed growing Determine seed pixels Amongst remaining ones

Region merging Compute merge criterions for regions

Form seed regions

Combine both seed types

Seed size < Smaller window Yes Eliminate seed

Seed pixels exist?

Region size < larger window

Form seed regions Has neighbors ?

Yes Exit the seed growing

No

Distance ys−1 if ys−1 > ys+1 if ys−1 = ys+1

101

(7.1)

7.2

Experimental results and discussions

The performance of the proposed GECC technique has been tested on various gene expressions based colon cancer datasets. The discerning gene expressions, chosen by different feature selection techniques, are given as input to the GECC for classification of samples into respective classes through weighted majority voting.

7.2.1

Selection of discriminative gene expressions from datasets having high dimensionality

The dimensionality of larger datasets such as E-GEOD-40966, Notterman, and KentRidge has been reduced prior to classification by using various feature selection techniques. The feature selection techniques have a common goal of selecting most discerning gene expressions, although the procedure of gene selection is totally different. For the reason, the selected gene expressions vary in number and are different amongst different feature selection techniques. F-Score and chi-square based gene subsets have been selected by using machine learning tool of Weka. The PCA and mRMR based gene subsets have been selected through an empirical process, which is explained in subsequent text for KentRidge dataset. In case of PCA, eigenvalues of the KentRidge dataset have been analyzed, and principal components corresponding to most discerning eigenvalues have been chosen. A plot of first 50 eigenvalues for the dataset has been shown in Figure 7.2 (a). The area under the curve in the figure is the confidence interval, which is generally normalized to 100%. The analysis of the figure reveals that other than the first few discerning eigenvalues, remaining values may be ignored in order to cut off the computational complexity. Therefore, number of eigenvalues and corresponding classification rate of GECC has been evaluated as a function of confidence interval in order to determine the optimal number of eigenvalues that leads to maximum classification accuracy. In this perspective, confidence interval has been varied from 92-99%, and number of eigenvalues corresponding to each value of confidence interval has been determined from Figure 7.2 (a). The principal components corresponding to these eigenvalues have been used for classification. Figure 7.2 (b) and (c) demonstrate the number of eigenvalues and classification accuracy corresponding to different values of confidence intervals. Figure 7.2 (c) shows an increase in classification rate up to 96% confidence interval, but, classification accuracy deteriorates beyond this point. Hence, 28 principal

102

components corresponding to 96% confidence interval have been used for the classification of KentRidge dataset.

Eigen value

15 10 5

Number of eigen values

(a)

Confidence interval (%)

Classification accuracy (%)

Principal component

(b)

Confidence interval (%)

(c)

Figure 7.2: Plot of (a) eigenvalues, (b) number of eigenvalues required for a particular confidence interval, (c) classification accuracy of GECC on KentRidge dataset for different values of confidence interval

Similarly, the mRMR method sorts individual gene expressions of the dataset in terms of their discriminatory capabilities. Therefore, in order to select the most discerning gene set that leads to maximum classification accuracy, various gene subsets (having different sizes) have been selected and classification accuracy has been measured on these subsets. Figure 7.3 shows the classification accuracy of individual classifiers and the proposed GECC technique for different gene subsets. The results show that the gene subset comprising 50 gene expressions for KentRidge dataset yields maximum classification accuracy, therefore, results for the KentRidge dataset have been reported for this particular subset. The same process has been used to select discerning gene subsets from E-GEOD-40966 and Notterman datasets. The number of gene expressions selected by mRMR, F-Score and chi-square for different datasets have been shown in Table 7.1. Similarly, the number of principal components have been shown for different datasets in case of PCA.

103

Classification accuracy (%)

100 95 90 85 80

Linear RBF Polynomial Sigmoid GECC

75 70 65 10

20

30

40

50

60

70

80

Number of genes selected by mRMR

Figure 7.3: Classification accuracy of various classifiers for gene datasets selected by mRMR Original

PCA

mRMR

F-Score

Chi-square

KentRidge

2,000

28

50

26

135

Notterman E-GEOD-40966

7,457 5,851

33 25

120 140

95 130

185 165

Table 7.1: Number of gene expressions selected by various feature selection strategies for different datasets

7.2.2

Parameter selection for SVM classifiers pertaining to different gene selection strategies

A number of parameters need to be optimized in order to improve the performance of SVM classifiers. To reduce the problem complexity, g has been fixed to 3 for polynomial kernel, and the value of parameter r has been set to 1 both for polynomial and sigmoid kernels. The optimal values of rest of the parameters have been found through grid search method [82]. The optimal value of parameter c has been obtained by adjusting grid range of c = [1, 2, · · · , 100] with ∆c = 1 for all the classifiers. Similarly, the value of γ has been obtained by exploring the grid range of γ = [0.001, · · · , 0.099] with ∆γ = 0.002 for polynomial, sigmoid, and RBF classifiers. The parameter F (number of folds) of the Jack-knife cross validation may also influence the performance of SVM classifiers. Therefore, optimal value of F has also been determined by exploring the performance for F=3, 5, 8, 10, 12, 15, 20, 25. The classification accuracy and CPU time elapsed during classification of various datasets by different classifiers have been calculated against each value of F. The results for KentRidge and BioGPS datasets for different values of F have been shown in Figure 7.4. The results indicate that parameter

104

F has no significant effect on classification accuracy, but has a considerable influence on the CPU time involved in classification which increases with increase in the value of F. The similar results were obtained for the other two datasets, but have not been shown here. Therefore, considering the slightly better classification accuracy at F= 10 in most of the cases, and an increase in computational complexity with increasing F, a smaller value of F=10 has been chosen. 600

3

5

8

10

12

Classification time (sec) KentRidge dataset

500 98 96

400 94 92 90

20

25

600

Linear

100

Classification time (sec)

Classification accuracy (%) KentRidge dataset

100

15

400

200

300

0

RBF Poly Decision models

Sigmoid

Linear

RBF Poly Decision models

(a)

Sigmoid

(b) 600 Classification time (sec) BioGPS dataset

Classification accuracy (%) BioGPS dataset

200 98 96

100 94 92

0 90

Linear

400

200

Linear RBF Poly Decision models

RBF 0

Sigmoid

(c)

Linear

Poly Decision models RBF Poly Sigmoid Decision models

(d)

Figure 7.4: (a) Classification accuracy and (b) corresponding CPU time on KentRidge dataset for different values of F, (c) Classification accuracy and (d) corresponding CPU time on BioGPS dataset for different values of F

7.2.3

Selected gene expressions, and optimized SVM classifiers based ensemble classification

The gene sets selected by various feature selection techniques corresponding to different datasets have been given as input to the optimized classifiers for classification. Class labels assigned by individual optimized SVM classifiers have been augmented, and final class labels 105

are assigned to the samples based on the weighted majority voting of individual labels. Class labels of the SVM classifiers and those of the proposed GECC are shown in Figure 7.5 for BioGPS dataset. First 37 samples are normal and the remaining 94 samples are malignant. Normal and malignant samples are shown by -1 and +1 labels, respectively. The odd entries in the figure i.e. the normal samples with +1 label and the malignant samples with -1 label represent the mis-classifications. For example, 5th , 10th and 16th samples in Figure 7.5 are normal, but RBF classifier have identified them as malignant ones.

Linear

1 0 -1

0

20

40

60

80

100

120

140

0

20

40

60

80

100

120

140

0

20

40

60

80

100

120

140

0

20

40

60

80

100

120

140

0

20

40

60

80

100

120

140

RBF

1 0

Sigmoid

-1 1 0

Polynomial

-1 1 0 -1

GECC

1 0 -1

Figure 7.5: Predictions of individual SVM classifiers (collected through 10-fold crossvalidation) and the proposed GECC for BioGPS dataset

The figure demonstrates superior classification results of the proposed GECC technique over optimized SVM classifiers for BioGPS dataset. Results show that most of the samples which were misclassified by the individual SVM classifiers have been assigned correct labels by the proposed GECC technique. The better performance of GECC is mainly due to the combination of the strengths of multiple SVM classifiers in GECC. Further, many hard samples such as 5th , 16th , 34th , and 43rd instances in Figure 7.5 have been assigned correct labels by the proposed GECC technique owing to the idea of weighted majority voting.

106

7.2.4

Performance of GECC on standard colon cancer datasets

In this section, the performance of the proposed GECC technique has been evaluated on various gene based colon cancer datasets in terms of well-known measures, and has been shown in Table 7.2. The values of these measures for individual SVM classifiers have also been given in the table for showing the performance improvement in case of GECC. GECC

Linear

RBF

Sigm

Poly

Acc

Acc

Acc

Acc

Acc

Sen

Spe

MCC

FM

94.66

96.18

93.89

98.67

0.97

0.98

0.96

0.98

90.32

93.54

92.32

97.03

0.98

0.96

0.93

0.98

BioGPS dataset 94.63 KentRidge dataset mRMR

88.71

F-Score

90.32

91.94

95.16

93.54

98.78

0.98

0.97

0.96

0.98

Chi-sq

82.26

87.10

93.55

91.93

97.01

0.98

0.95

0.93

0.98

PCA

82.26

87.10

85.48

85.48

91.94

0.90

0.95

0.83

0.94

Notterman dataset mRMR

86.11

86.11

91.67

88.89

94.44

0.94

0.94

0.89

0.94

F-Score

91.67

91.67

94.44

94.44

97.22

0.94

1.00

0.95

0.97

Chi-sq

80.56

83.33

88.89

83.33

91.67

0.89

0.94

0.83

0.91

PCA

77.78

80.56

86.11

83.33

88.89

0.94

0.83

0.78

0.89

91.43

93.14

92.86

97.14

0.97

0.98

0.94

0.98

E-GEOD-40966 dataset mRMR

90.57

F-Score

92.29

93.43

94.00

93.71

97.71

0.96

0.99

0.95

0.97

Chi-sq

89.71

90.29

92.00

90.86

95.71

0.95

0.97

0.91

0.96

PCA

86.57

87.71

90.29

88.86

94.29

0.93

0.96

0.88

0.95

Simple and italic bold face entries, respectively, correspond to the best performance of GECC and individual classifiers on a given dataset.

Table 7.2: Performance analysis for various combinations of classifiers and feature selection strategies

The performance of the proposed GECC technique is better compared to those of individual classifiers for BioGPS dataset as well as different variants of other datasets. Individual best performance has been shown by the sigmoid SVM for BioGPS dataset with an accuracy of 96.18%. This classification accuracy has been further enhanced upto 98.67% by the proposed GECC ensemble. Similarly, sigmoid SVM in combination with F-score based feature selection have been proven to be effective for all the datasets as validated by the 94.00%, 94.44%, and 95.16% classification accuracies for E-GEOD-40966, Notterman, and KentRidge datasets, respectively. This classification rate is further enhanced by the proposed GECC ensemble scheme by utilizing the positive aspects of multiple classifiers,

107

thereby yielding classification accuracies of 97.71%, 97.22% and 98.78%, respectively, for F-score based selected sets of E-GEOD-40966, Notterman, and KentRidge datasets. The performance of the proposed GECC scheme and individual classifiers has also been evaluated in terms of the ROC curves, shown in Figure 7.6 (a) and (b), respectively, for FScore based selected gene set of KentRidge dataset and original BioGPS dataset. The figures show that the ROC curves of GECC are better compared to those of individual classifiers. The ROC curves have similar pattern for E-GEOD-40966 and Notterman datasets (not included in Figure 7.6).

1

1 Linear RBF Sigmoid Polynomial GECC

0.95 0.9

0.9 0.85 FPR

FPR

0.85 0.8

0.8

0.75

0.75

0.7

0.7

0.65

0.65

0.6

0

0.2

0.4

0.6 TPR

0.8

1

Linear RBF Sigmoid Polynomial GECC

0.95

0.6

(a)

0

0.2

0.4

0.6 TPR

0.8

1

(b)

Figure 7.6: ROC curves: (a) KentRidge dataset (using F-Score), and (b) BioGPS dataset

Furthermore, AUC values have been calculated for individual classifiers and the proposed GECC technique for different variants of datasets. Table 7.3 shows AUC values. The AUC values show that the proposed GECC technique has enhanced classification performance compared to individual SVM classifiers.

7.2.5

Computational complexity of the GECC technique

This section investigates the CPU time consumed during different phases of the proposed technique, like, selection of genes, optimization of weights using 10-fold cross-validation, and optimization of weights for majority voting. The CPU time for different phases is given in Table 7.4. The first section in Table 7.4 shows the time elapsed during the selection of discerning genes from different datasets. It may be noticed that the gene selection time is greater for larger datasets. The second section shows the time involved in parameter optimization of different

108

Linear

RBF

Sigmoid

Polynomial

GECC

0.9323

0.9516

0.8763

0.9759

0.9084

0.9797

BioGPS dataset Original data

0.9356

KentRidge dataset mRMR

0.8765

0.8965

0.9385

F-Score

0.9052

0.9064

0.9552

0.9450

0.9886

Chi-square

0.8321

0.8865

0.9377

0.9036

0.9645

PCA

0.8232

0.8865

0.8612

0.8615

0.9254

Notterman dataset mRMR

0.8693

0.8752

0.9210

0.8962

0.9452

F-Score

0.9199

0.9215

0.9387

0.9398

0.9799

Chi-square

0.8164

0.8497

0.8751

0.8439

0.9200

PCA

0.7991

0.8135

0.8586

0.8465

0.9029

0.9287

0.9705

E-GEOD-40966 dataset mRMR

0.8987

0.9235

0.9463

F-Score

0.9165

0.9487

0.9586

0.9535

0.9832

Chi-square

0.8757

0.8865

0.9365

0.9174

0.9607

PCA

0.8598

0.8654

0.9174

0.8978

0.9386

Simple and italic bold face entries correspond to the best performance of GECC and other classifiers for a given dataset, respectively.

Table 7.3: Performance comparison of classifiers in terms of AUC SVM classifiers through 10-fold cross-validation. The optimization time of GECC is the sum of the optimization times of individual classifiers. The labels predicted by individual SVM classifiers are combined through weighted majority voting. The weights have been determined using GA, and the time elapsed during the computation of weights is given in the third section of the table. Optimal values of parameters and the optimized weights of classifiers have been directly used in the testing phase. The training and testing time on optimal values is given in the last two sections of the table. Furthermore, the results show that the proposed GECC technique is computationally tractable even the maximum training time for the largest dataset (E-GEOD-40966) is 35.25+555.37+0.39 = 591.01 sec when PCA has been used as the underlying gene selection strategy.

7.2.6

Performance comparison of the GECC technique with some existing schemes and classifiers

This section provides a performance comparison of the proposed GECC technique with some earlier gene based colon cancer detection techniques, and frequently used classifiers. To this end, five classification techniques, namely, Li et al. [33], Venkatesh et al. [28], Kulkarni et al. [29], Lee et al. [30] and Tong et al. [31] have been used for comparison. Similarly,

109

BioGPS

KentRidge

Notterman

E-GEOD-40966

Gene selection mRMR 3.56

6.77

13.58

15.89

Chi-square

0.33

0.72

0.98

1.06

F-Score PCA

0.68 0.048

0.99 7.76

1.58 16.02

3.01 32.25

Parameter optimization using 10-fold cross-validation Linear RBF

1.25 56.02

2.22 120.02

2.62 136.23

3.30 179.23

Sigmoid Polynomial

63.54 65.87

124.25 129.98

140.28 142.56

184.28 188.56

GECC

186.68

376.47

421.69

555.37

Weight optimization for majority vote GECC 0.3121 0.4080 0.3356

0.3897

Training on optimal parameter values Linear

0.0100

0.0212

0.0252

0.0330

RBF Sigmoid

0.0116 0.0123

0.0236 0.0252

0.0272 0.0280

0.0346 0.0360

Polynomial GECC

0.0127 0.0466

0.0260 0.0960

0.0286 0.1090

0.0374 0.1410

Testing on optimal parameter values Linear RBF

0.0024 0.0026

0.0031 0.0033

0.0037 0.0040

0.0040 0.0042

Sigmoid Polynomial

0.0027 0.0030

0.0038 0.0037

0.0041 0.0043

0.0044 0.0045

GECC

0.0117

0.0148

0.0182

0.0191

Table 7.4: Computational time requirements of GECC (sec) three frequently used classifiers, namely, PNN, KNN, and decision tree have been chosen for comparison. The computational software of Matlab has been used for the implementation of these techniques, and optimal values of the parameter involved in these techniques and classifiers have been found on dataset-C prior to classification. Table 7.5 summarizes the comparison of GECC with these techniques and classifiers. The classification accuracy of GECC is superior compared to that of other techniques and classifiers for all the datasets. The better performance of GECC is the result of selecting discerning gene expressions by employing various state-of-the-art feature selection techniques, and boosting up of classification accuracy by combining results of optimized individual classifiers in GECC. The reduced gene sets, selected by various feature selection techniques, not only reduce the computational time, but are more discerning as well.

110

Ref.

Acc

Sensitivity

Specificity

MCC

F-Measure

BioGPS dataset KNN



91.15

0.89

0.88

0.90

0.90

PNN



90.08

0.65

0.87

0.75

0.79

Li et al.

[33]

91.23

0.89

0.90

0.88

0.88

Venkatesh et al.

[28]

91.25

0.89

0.93

0.90

0.89

Kulkarni et al.

[29]

94.45

0.96

0.96

0.97

0.97

Lee et al.

[30]

80.23

0.79

0.81

0.77

0.81

Tong et al.

[31]

93.55

0.86

0.98

0.86

0.90

GECC



98.67

0.97

0.98

0.96

0.98

KentRidge dataset KNN



85.48

0.77

0.90

0.68

0.79

PNN



87.09

0.77

0.93

0.71

0.81

Li et al.

[33]

89.01

0.88

0.90

0.87

0.88

Venkateshet al.

[28]

94.40

0.92

0.93

0.93

0.92

Kulkarni et al.

[29]

98.33

0.99

0.98

0.97

0.97

Lee et al.

[30]

76.85

0.77

0.79

0.70

0.74

Tong et al.

[31]

90.32

0.82

0.95

0.79

0.86

GECC



98.78

0.98

0.97

0.96

0.98

Notterman dataset KNN



75.00

0.72

0.78

0.50

0.74

PNN



75.00

0.67

0.83

0.51

0.73

Li et al.

[33]

80.56

0.78

0.83

0.61

0.80

Venkaetsh et al.

[28]

86.11

0.83

0.89

0.72

0.86

Kulkarni et al.

[29]

91.67

0.89

0.94

0.83

0.91

Lee et al.

[30]

83.33

0.83

0.83

0.67

0.83

Tong et al.

[31]

88.89

0.89

0.89

0.78

0.89

GECC



97.22

0.94

1.00

0.95

0.97

E-GEOD-40966 dataset KNN



83.71

0.89

0.76

0.66

0.87

PNN



84.57

0.86

0.83

0.68

0.87

Li et al.

[33]

88.57

0.89

0.88

0.77

0.90

Venkaetsh et al.

[28]

90.57

0.91

0.90

0.81

0.92

Kulkarni et al.

[29]

92.29

0.94

0.90

0.84

0.94

Lee et al.

[30]

88.86

0.89

0.89

0.77

0.90

Tong et al.

[31]

91.14

0.93

0.88

0.82

0.93

GECC



97.71

0.96

0.99

0.95

0.97

Simple and italic bold face entries, respectively, correspond to the best performance of GECC and previous techniques on a given dataset

Table 7.5: Performance comparison of GECC with some existing schemes and classifiers in terms of classification accuracy

7.2.7

Performance analysis of the GECC technique on other complex gene expression based cancer datasets

Some binary-class gene expression based datasets of other cancer types (cancer related to other body parts) have also been classified into their respective classes by using the proposed GECC technique. It was done in order to validate the usefulness of the proposed GECC as a general purpose cancer detection technique capable of effectively classifying gene expression based cancer datasets of various body parts. A brief detail of these datasets is given in Table 7.6.

111

Dataset

Ref.

No. of genes

CNS DLBCL

[86] [87]

7,129 7,129

Samples of C1

Samples of C2

Total samples

39 58

21 19

60 77

Leukemia

[88]

7,129

25

47

72

Lung-I Lung-II

[89] [90]

12,533 7,129

150 86

31 10

181 96

Prostate-I Prostate-II

[91] [92]

12,600 12,625

52 38

50 50

102 88

Prostate-III

[93]

12,626

24

09

33

Table 7.6: Binary class gene expression datasets The datasets summarized in Table 7.6 are large in size, hence, discriminating genes have been selected by using mRMR, PCA, chi-square and F-score before classification. Number of selected genes are shown in Table 7.7 for mRMR, chi-square and F-score, whereas, number of eigenvalues are shown for PCA. The principal components or genes lying within this confidence interval have been used for classification. Original

mRMR

PCA

Chi-square

F-score

7,129

175

171

180

165

DLBCL

7,129

160

162

210

155

Leukemia

7,129

180

112

220

135

Lung-I

12,533

280

268

295

235

Lung-II

7,129

130

116

195

140

Prostate-I

12,600

235

261

410

220

Prostate-II

12,625

285

238

395

235

Prostate-III

12,626

285

201

425

280

CNS

Bold face entries correspond to the number of genes where maximum accuracy is achieved by GECC for a given dataset

Table 7.7: Number of genes selected by feature selection strategies for the gene expression datasets given in Table 7.6

The genes selected by various feature selection techniques have been used for the classification of samples into respective classes by using base classifiers and the ensemble GECC technique. The gene expressions corresponding to the highlighted entries have been employed for the classification since they correspond to maximum classification accuracy for a particular dataset. Table 7.8 shows the classification accuracy, which prove that GECC performs equally good for colon cancer and for cancers relating to other body parts.

112

Linear RBF

Sigmoid

Polynomial GECC

CNS

91.67

93.33

95.00

96.67

98.33

DLBCL

89.61

92.21

94.81

93.51

98.70

Leukemia

91.67

93.06

97.22

95.83

98.61

Lung-I

92.27

93.92

97.79

96.13

99.45

Lung-II

81.25

82.29

85.42

83.33

88.54

Prostate-I

89.22

91.98

94.12

93.14

96.08

Prostate-II

88.64

89.77

92.05

90.91

94.32

Prostate-III

90.91

93.94

96.97

96.97

100.00

Bold face entries correspond to the individual best performance of a classifier for a given dataset

Table 7.8: Performance of GECC on the gene expression datasets given in Table 7.6

7.3

Chapter summary

The proposed GECC technique employs an ensemble of various SVM classifiers for classification. The experiments have been conducted on four standard colon cancer datasets. To reduce the large size of the datasets, four different feature selection strategies have been employed. Analysis reveals that genes selected by F-Score are better able to classify different datasets compared to the genes selected by other techniques. Amongst the multiple individual classifiers, sigmoid SVM performs best for all the datasets. Ensemble SVM increases performance compared to individual SVM classifiers with a slight increase in computational time. Performance of GECC has also been validated on several other complex gene expression datasets, and quite promising classification results have been achieved. Therefore, it can reasonably be concluded that the proposed GECC can help biologists not only in accurately predicting colon cancer, but also the cancer relating to other parts of body.

113

Chapter 8

Proposed colon cancer grading techniques The grade of colon cancer quantifies the differentiability level of malignant tissues from the normal ones. Pathologists assign quantitative cancer grades to the colon samples by the visual examination of pathological colon samples under microscope. There are three grades of colon cancer, namely, well-, moderate- and poor-differentiable. Since the physicians decide treatment plans depending upon the grade of colon cancer, hence, it is very important to determine the true grade of colon cancer. So, automatic colon cancer grading techniques are required, which could automatically determine the severity of colon cancer and assign cancer grades accordingly. In this research work, the structural variation amongst colon biopsy images of various cancer grades has been exploited to determine the grades of colon cancer. For this purpose, various types of novel features have been extracted from colon biopsy images, and have been used to classify images into three colon cancer grades. These features are described in the following text.

8.1

White run-length features based grading of colon cancer

This section presents white run-length (WRL) features for grading of colon biopsy images into well-, moderate-, and poor-differentiable cancer grades. The proposed run-length features are a novel variant of traditional gray-level run-length features. These features exploit the variation in the structure of malignant colon tissues of various colon cancer grades for classification, and are simple to extract. A typical CAD framework has been employed for 114

the evaluation of proposed WRL features. This framework is shown in Figure 8.1. In the pre-processing phase, K-Means algorithm [12], as already discussed in Section 5.1.1, has been used to quantize image pixels into white, pink, and purple clusters. In the feature extraction phase, the proposed WRL features, which exploit the variation among different colon cancer grades are extracted from colon biopsy images. Data is formulated using Jackknife 10-fold cross-validation, and classification is performed using RBF kernel of SVM. The RBF kernel has been employed for classification since it has proven to be better compared to other classifiers for colon biopsy image dataset in Chapter 6. Section 8.1.1 and 8.1.2 discuss the proposed WRL features and their results on dataset-B.

Image acquisition

Pre-processing

Feature extraction

SVM classification

WRL features Colon image database Data formulation Jackknife 10-fold cross validation Poor/moderate/ well grades

Figure 8.1: Top level layout of the CAD system used for the evaluation of WRL features

8.1.1

White run-length features

The distribution of all the cytological tissue constituents such as epithelial cells, nonepithelial cells, and connecting tissues of a colon biopsy image changes due to malignancy. However, it is observed that the variation experienced by lumen and epithelial cells is significant among all. This variation is also notable among different grades of progressing colon cancer. In the well-differentiable grade, epithelial cells and lumen merge together, thereby resulting in larger white patches. With the increase in severity of colon cancer, the white patch splits into multiple smaller patches, which are dispersed among other constituents of colon tissue. Therefore, in this work, the variation that lumen and epithelial cells undergo as cancer grade progresses from well, to moderate and poor cancer grades is measured. This variation is calculated by measuring the maximum possible expansion, which epithelial cells and lumen undergo both horizontally and vertically. This expansion is measured by finding

115

the largest horizontal and vertical run-lengths in white cluster. The horizontal and vertical white run-lengths are supposed to be large for well-differentiable cancer grade, medium for moderate-differentiable cancer grade, and small for poor-differentiable cancer grade. Since the run-length features have been extracted from white cluster, therefore, have been named white run-length (WRL) features. Figure 8.2 shows the process of computing WRL features from colon biopsy images of various cancer grades. Well-differentiated

Moderately-differentiated

Poorly-differentiated

Figure 8.2: Schematic diagram showing the steps involved in the computation of WRL features from malignant colon biopsy images of various cancer grades, 1st row: malignant colon biopsy images, 2nd row: white clusters, 3rd row: eroded clusters

First and second row in Figure 8.2 show colon biopsy images, and their corresponding white clusters, respectively. Initially the white clusters have been eroded with a disk-shaped structuring element of radius r in order to disjoint small patches associated with the largest white patch. The eroded clusters have been shown in Figure 8.2. Later, WRL features have been computed both at the local and the global level. The global WRL features have been computed using the whole image, whereas, local WRL features are computed from equalsized smaller blocks of the image. For a colon biopsy image comprising Bwrl small blocks, the WRL feature vector has Bwrl ∗ 2 + 2 features. Third row in Figure 8.2 shows global

116

horizontal and vertical run-lengths in terms of arrows. The run-lengths are in number of pixels.

8.1.2

Results and discussion

This section presents the results of using the proposed WRL features for classification of malignant colon biopsy images into various cancer grades. The optimal value of Bwrl and r have been found to be 9 and 2, respectively, for dataset-B before classification. The feature set has been formulated using Jack-knife 10-fold cross-validation, and the RBF kernel of SVM has been employed for classification. Table 8.1 shows the performance of WRL features for detection of individual cancer grades in terms of various performance measures. It is observed from the results of Table 8.1 that the proposed WRL features yield reasonable classification of malignant colon biopsy images into various cancer grades. The classification capability of the proposed WRL features has been separately shown for total malignant dataset, and for individual cancer grades in the table. The results show that WRL features yield better classification accuracy for well-differentiable cancer grade compared to moderate and poor-differentiable cancer grades. The primary reason is that as cancer grade progresses, the larger white patch, which is most convenient to detect in well-differentiable images may not be easily identifiable. Accuracy

Sensitivity

Specificity

F-Measure

well

91.30

0.91

0.91

0.84

moderate

75.00

0.75

0.92

0.81

poor overall

80.00 80.43

0.80 —

0.88 —

0.75 —

Table 8.1: Classification performance of the proposed WRL features for grading of colon cancer on dataset-B

The performance of the proposed WRL features has also been analyzed in terms of the CPU time involved in feature extraction. The average extraction time to compute WRL features from a single image is 5.35 seconds.

117

8.2

Lumen area features based grading of colon cancer

The WRL features, explained in Section 8.1.1, capture the variation amongst colon biopsy images of various cancer grades by measuring the horizontal and vertical expansion in white cluster of colon biopsy images. But these features are sensitive to the cases, where there is a small gap within the white patch. Furthermore, the extraction of WRL features from colon biopsy images consumes considerable CPU time. Therefore, to overcome the limitations of WRL features, novel lumen area based features, which are computationally tractable as well as discriminative, have been proposed. Though the lumen area based features also exploit the expansion in the white cluster of colon biopsy images as has been done in case of WRL features, but they do it from the perspective of lumen area. Similar to the evaluation process of WRL features, a typical CAD framework has been employed for the evaluation of proposed features. This framework is shown in Figure 8.3. In the pre-processing phase, KMeans algorithm [12], as already discussed in Section 5.1.1, has been used to quantize image pixels into white, pink, and purple clusters. In the feature extraction phase, the proposed lumen area based features, which exploit the variation among different colon cancer grades are extracted from colon biopsy images. Feature set is formulated using Jack-knife 10-fold cross-validation, and classification is performed using RBF kernel of SVM. Section 8.2.1 and 8.2.2 discuss the proposed lumen area based features and their results on dataset-B.

Image acquisition

Pre-processing

SVM classification

Feature extraction LCR

LIR

LCR+LIR Colon image database Data formulation Jackknife 10-fold cross validation Poor/moderate/ well grades

Figure 8.3: Top level layout of the CAD system used for the evaluation of lumen area based features

118

8.2.1

Lumen area based features

As already discussed in Section 8.1.1, epithelial cells and lumen merge together to generate larger white regions in well-differentiable cancer grade. With the increase in the severity of cancer, the white patch splits into smaller patches, which disperse among other parts of colon tissue. Therefore, in this work, the variation among various colon cancer grades is measured by calculating the area spanned by the largest patch in the white cluster. The area of the largest white patch is supposed to be large for well-, medium for moderate-, and small for poor-differentiable cancer grade. The procedure of calculating these features is a two step process. These steps are illustrated in Figure 8.4 for sample malignant colon biopsy images, and are described in the following text. Well-differentiated

Moderately-differentiated

Poorly-differentiated

Figure 8.4: Schematic diagram showing the steps involved in the computation of lumen area based features from malignant colon biopsy images of various cancer grades, 1st row: malignant colon biopsy images, 2nd row: white clusters, 3rd row: eroded clusters, 4th row: connected components 119

Extraction of largest connected component: In this step, the largest patch is extracted from the white cluster of colon biopsy image. Initially, morphological erosion with disk-shape structuring element of radius r is performed on white cluster in order to disjoint largest connected component from other structures in the image as discussed in Section 8.1.1. The first row in Figure 8.4 shows colon biopsy images of different cancer grades. The second and the third rows in Figure 8.4 show corresponding white clusters and the clusters obtained after erosion, respectively. Finally, the largest connected component is obtained by converting eroded cluster into connected components. Fourth row in the figure shows all the components and the corresponding largest connected component (enclosed by the boundary) for the sample images. Quantification of lumen shape: In this step, the expansion in lumen and epithelial cells has been quantized in terms of two novel structural features. The description and mathematical formulation of these features is in the following text. Let I, W and C represent the sets of pixels in the image, white cluster, and the extracted largest component, respectively. The structural features are defined as under. 1. Lumen cluster ratio (LCR): LCR is the ratio of the area occupied by the largest white connected component to the total area of white cluster. The area is computed by counting the number of pixels in an object. LCR =

C

(8.1)

W where C and W represent the cardinality of set C and W , respectively. 2. Lumen image ratio (LIR): LIR is the ratio of the area occupied by the largest white connected component to the total area of image. The area is computed by counting the number of pixels in an object. LIR =

C

(8.2)

I where C and I represent the cardinality of set C and I, respectively.

8.2.2

Results and discussions

This section presents the findings about the performance of the proposed LCR and LIR features for classification of malignant images into well-, moderate- and poor-differentiable 120

cancer grades. The optimal values of r has been found to be 2 on dataset-B prior to classification. The performance of the proposed features has not only been measured for overall malignant dataset, but also for individual cancer grades. Both individual and hybrid features (i.e. LCR+LIR) have been investigated for cancer grading. The confusion matrices for LCR, LIR, and LCR+LIR are given in Tables 8.2, 8.3 and 8.4, respectively. The results demonstrate that the proposed features have effectively captured the biological variation that lumen undergoes with progressing cancer. The diagonal entries in the confusion matrices verify that the proposed features have correctly classified a larger subset of well-, moderate- and poor-differentiable images. well

moderate

poor

well moderate

20 3

2 36

1 5

poor

1

2

22

Table 8.2: Confusion matrix of LCR

well

moderate

poor

well

21

1

1

moderate poor

2 1

38 3

4 21

Table 8.3: Confusion matrix of LIR

well

moderate

poor

well

22

1

0

moderate poor

2 0

39 2

3 23

Table 8.4: Confusion matrix of LCR+LIR

Table 8.5 reports classification accuracy of the proposed features for overall malignant dataset and for individual cancer grades. The recognition rate of 93.47% for overall dataset, and the recognition rates of 95.65%, 88.64% and 92.00%, respectively, for well-, moderate-, and poor-differentiable cancer grades show that the proposed features are quite effective at discriminating various colon cancer grades. Further, the results demonstrate that the starting grade (well-differentiable) is easy to distinguish compared to other grades of colon

121

cancer as shown by the classification success rate of 95.65%. Thus, the proposed features can be effective in detection of colon cancer in early stage, thereby improving the disease prognosis. Well

Moderate

Poor

Overall

LCR

86.96

81.82

88.00

84.78

LIR LCR+LIR

91.30 95.65

86.36 88.64

84.00 92.00

86.95 93.47

Table 8.5: Classification accuracy of the LCR and LIR features

Table 8.6 presents the performance of the proposed features for detection of individual cancer grades in terms of a few other performance measures. As concluded from the results of Table 8.5, the well-differentiable grade is identified most conveniently by the proposed features. Further, the better values of sensitivity, specificity, and F-Measure for well-, moderate and poor-differentiable cancer grades show the efficacy of the proposed features for discerning various colon cancer grades. Sensitivity

Specificity

F-Measure

LCR well

0.87

0.94

0.85

moderate poor

0.82 0.88

0.92 0.91

0.86 0.83

well

0.91

0.96

0.89

moderate

0.86

0.92

0.88

poor

0.84

0.93

0.82

well moderate

0.96 0.89

0.97 0.94

0.94 0.91

poor

0.92

0.96

0.90

LIR

LCR+LIR

Table 8.6: Classification results of the LCR and LIR features in terms of sensitivity, specificity, and F-Measure

It can be concluded from the results presented in Table 8.2-8.6 that the proposed LCR and LIR features are good at capturing variations amongst different cancer grades, but this distinguishing capability is further enhanced when the features are combined to form a hybrid feature vector. Therefore, the results for cancer grading have been reported only

122

for the hybrid feature (LCR+LIR) in subsequent text. Figure 8.5 presents a few sample colon biopsy images, which are correctly graded by the proposed features. It can be seen from the figure that the well-differentiable image contains large white cluster (lumen), thereby resulting in larger values of the proposed LCR and LIR features. Whereas, the lumen becomes smaller and dispersed as the cancer grade progresses to moderate- and poor-differentiated states as shown in Figure 8.5 (b) and (c), respectively. This characteristic is effectively exploited by the proposed features in order to successfully identify the grade of these images.

(a)

(b)

(c)

Figure 8.5: (a) well-, (b) moderate-, and (c) poorly-differentiable images, which are correctly classified by the proposed features

Figure 8.6 contains a few malignant images, which are incorrectly classified by the proposed features. The images in the figure contain two labels each; first one is the true grade of the image, whereas second label (pointed by arrow) refers to the grade predicted by the proposed system. It can be seen that the images deviate from the normally observed behavior of progressing grades, i.e. the largest white region is small for well-differentiable image, and vice versa for moderate- and poor-differentiable image. Therefore, the proposed system incorrectly identifies the grades of these images.

poor à moderate

well à moderate

moderate à well

Figure 8.6: Examples of a few malignant images, which are incorrectly classified by the proposed features

123

Computational time requirements of the proposed features: In this section, the average CPU time consumed during the extraction of the proposed features has been measured in seconds. The average CPU time to compute LCR and LIR features from a single image is 0.0057 and 0.0055 seconds, respectively. The extraction time for hybrid feature set (LCR+LIR) is 0.0112 seconds. The feature extraction time of LCR+LIR is merely the sum of extraction time of constituent features. The feature extraction time is quite nominal, which proves that the proposed features are computationally tractable. The feature extraction time of lumen area based features is also very small compared to the feature extraction time of WRL features, which is 5.35 seconds. Performance comparison of the proposed CAD system with existing techniques: This section presents a performance comparison of the proposed CAD system (used for evaluation of lumen area based features) with previously published approaches of colon cancer grading. Two techniques, namely, Altunbay et al. [17], and Ozdemir et al. [19] have been selected from the contemporary literature for comparison. The comparison of these techniques with the proposed system is shown in Figure 8.7. 1 Altunbay et al.

0.8 0.85

0.7 0.6

Well

1

1

0.9

0.9

0.8 0.7 0.6

Moderate Poor 0.8 Cancer grades

LCR+LIR

specificity

0.9

accuracy

accuracy

0.9

sensitivity

0.95

1

Ozdemir et al.

Well

Moderate Poor Cancer grades

Well

Moderate Poor Cancer grades Moderate Cancer grades

0.8 0.7 0.6

Well

Moderate Poor Cancer grades

1

0.75

fscore

0.9 0.7

0.8 0.7

0.65

0.6 0.6

Well

Poor

Figure 8.7: Performance comparison of the proposed system with existing colon cancer grading technique

Compared with the performance of previous techniques, the proposed system has shown better results in terms of all the performance measures as demonstrated in Figure 8.7. The 124

previous techniques make graph of all the cytological tissue components and valuate some features on the graph as already discussed in Chapter 2. The comparison in Figure 8.7 shows that the features defined on the shape and other properties of lumen are more effective and robust than those defined on graph of all the cytological tissue components. The lumen based features are not only computationally tractable, but also improve classification performance compared to the features proposed in previous techniques.

8.3

Chapter summary

This chapter presents WRL features for automatic grading of malignant colon biopsy images. WRL features though yield reasonable results, but suffer in cases where there is gap of even one pixel in a long white-run. Furthermore, WRL features are computationally expensive. Hence, reliable and computationally tractable LCR and LIR features have been presented for automatic grading of colon cancer. The proposed LIR and LCR features have been used to classify 92 malignant images into well-, moderate-, and poor-differentiable cancer grades. The experimental results show that LCR and LIR features have good grading capability with the best possible accuracy of 93.47%. The results in terms of various other performance measures also verify the worthiness of proposed features for grading of colon cancer. Compared with several contemporary techniques, the results show that the proposed lumen area based features are more effective in grading of colon cancer.

125

Chapter 9

Conclusion In this thesis, reliable and effective computer-aided techniques have been proposed for automatic segmentation, classification, and grading of colon samples. The performance of the proposed techniques has been measured in terms of various performance measures, and quite promising performance has been observed compared to various previously proposed techniques. In this dissertation, a performance overview of most of the existing colon cancer detection techniques has been presented. These techniques have been evaluated by the authors on their own datasets in respective research studies. Therefore, this thesis serves a very needy purpose by evaluating the techniques on unified datasets. In this context, colon biopsy image based segmentation and classification techniques have been evaluated using datasetA and dataset-B, respectively. The gene analysis based classification techniques have been evaluated using dataset-C. Several novel features have been proposed for automatic detection and grading of colon biopsy images. These features have either been hybridized with each other or with other feature types for classification. The results show that the hybrid and rich feature spaces certainly improve performance compared to their individual counter parts. In a hybrid feature vector, individual features reinforce each other, thereby enhancing the classification accuracy to a great extent in most of the cases. The results of the CBIC and GECC classification techniques demonstrate that individual classifiers show good performance, but when the classifiers are combined in an ensemble paradigm, they strengthen each other and boost the recognition rate. Results also show

126

that there is no single classifier that performs well in all the cases. RBF and sigmoid SVM classifiers have shown better results for colon biopsy image based and gene analysis based classification techniques, respectively. Another important finding of the thesis is that feature selection phase plays a vital role in overall computer-aided diagnostic system. The original datasets having high dimensionality and irrelevant/redundant features generally deteriorate recognition rate, however, the performance usually increases when discerning and meaningful features are selected using any suitable feature selection technique. The selected features are usually small in number compared to original feature sets, therefore, reduce the computational time to a great extent. The ECGF, LGF, CCSM and Haralick-HSV features proposed for classification have good discriminative power as demonstrated by 92.53%, 95.14%, 96.68% and 96.55% classification accuracy achieved on the dataset-B, respectively. Similarly, the lumen area based features proposed for colon cancer grading have also shown good grading capability of 93.47%. The results show that the features which incorporate color information or background knowledge of normal and malignant tissues organization into the classification process have good discriminative power compared to traditional texture based features.

9.1

Potential future directions

There are several possible future directions for extending the work presented in this dissertation.

9.1.1

Classification at multiple magnification factors

The proposed classification techniques such as CBIC and HFS-CC (Chapter 6) classify colon biopsy images captured at magnification factor of 10x. In future, these techniques may be tested/modified to classify colon biopsy images captured at multiple magnification factors.

9.1.2

Grading at multiple magnification factors

Researchers may test the proposed WRL and lumen area based features for grading of images captured at multiple magnification factors. Researcher may also propose some other features for grading of colon biopsy images.

127

9.1.3

Testing on other biopsy types

Presently, the datasets employed for evaluation of segmentation, classification and grading techniques have been captured from biopsy slides stained with Hematoxylin & Eosion. In future the proposed techniques may be evaluated on datasets captured from Immuno Histochemically stained biopsies.

9.1.4

Combining output of multiple feature selection methods

The performance of the proposed classification and grading techniques may be enhanced by uniting multiple feature selection methods in the feature selection phase i.e. selecting features using one technique and then applying another feature selection technique on the selected features. Furthermore, weights may be assigned to the features by different feature selection techniques and maximum voting may be applied to select discerning features.

9.1.5

Classification of various cancer stages

Stages of colon cancer can be determined by analyzing gene expression profiles. Researchers may propose some techniques, which automatically classify the gene expressions based colon samples into different cancer stages.

9.1.6

Evaluation of existing colon cancer detection techniques

In this thesis, most of the existing colon cancer detection techniques have been evaluated on unified datasets, and the performance of proposed techniques has been compared with these techniques. In future, researchers may evaluate some other colon cancer detection techniques on these datasets in order to provide a better and comprehensive performance comparison. Some other performance measures such as Q-statistic, ROC curves, area under the curve, and statistical measures may also be introduced in the comparison.

9.1.7

Combining image based and gene based dataset

In this thesis, colon biopsy image, and gene based datasets have been separately used for the evaluation of proposed techniques. In future, researchers may combine gene based datasets with image based datasets collected from the same patents to develop a combined feature set, and may investigate its classification performance.

128

9.1.8

Deployment in histopathology labs

Because the dataset that we have used for the evaluation of our proposed system is small and has been collected from one laboratory, therefore, we cannot directly deploy the system in real practical scenario. However, the system may be deployed in a histopathology labs for experimental purposes, and may be deployed in practical field once it passes rigorous testing on large-scale datasets collected from diverse sources.

129

References [1] A. B. Tosun, M. Kandemir, C. Sokmensuer, and C. Gunduz-Demir, “Object-oriented texture analysis for the unsupervised segmentation of biopsy images for cancer detection,” Pattern Recognition, vol. 42, no. 6, pp. 1104–1112, 2009. [2] Cancer

facts

and

figures.

[Online].

Available:

http://www.cancer.org/research/cancerfactsstatistics [3] What

are

the

risk

factors

for

colorectal

cancer?

[Online].

Avail-

able: http://www.cancer.org/cancer/colonandrectumcancer/detailedguide/colorectalcancer-risk-factors [4] Colon cancer stages:

Basics of each colon cancer stage. [Online]. Available:

http://coloncancer.about.com/od/stagesandsurvivalrate1/a/ColonCancerStag.htm [5] G. Thomas, M. Dixon, N. Smeeton, and N. Williams, “Observer variation in the histological grading of rectal carcinoma.” Journal of Clinical Pathology, vol. 36, no. 4, pp. 385–391, 1983. [6] A. Andrion, C. Magnani, P. Betta, A. Donna, F. Mollo, M. Scelsi, P. Bernardi, M. Botta, and B. Terracini, “Malignant mesothelioma of the pleura: interobserver variability.” Journal of Clinical Pathology, vol. 48, no. 9, pp. 856–860, 1995. [7] S. Rathore, M. A. Iftikhar, M. Hussain, and A. Jalil, “Texture analysis for liver segmentation and classification: a survey,” in Proceedings of 9th International Conference on Frontiers of Information Technology.

IEEE, 2011, pp. 121–126.

[8] A. N. Esgiar, R. N. Naguib, B. S. Sharif, M. K. Bennett, and A. Murray, “Microscopic image analysis for quantitative measurement and feature identification of nor-

130

mal and cancerous colonic mucosa,” IEEE Transactions on Information Technology in Biomedicine, vol. 2, no. 3, pp. 197–203, 1998. [9] A. N. Esgiar, R. Naguib, M. K. Bennett, and A. Murray, “Automated feature extraction and identification of colon carcinoma.” Analytical and quantitative cytology and histology/the International Academy of Cytology [and] American Society of Cytology, vol. 20, no. 4, pp. 297–301, 1998. [10] A. N. Esgiar, R. N. Naguib, B. S. Sharif, M. K. Bennett, and A. Murray, “Fractal analysis in the detection of colonic cancer images,” IEEE Transactions on Information Technology in Biomedicine, vol. 6, no. 1, pp. 54–58, 2002. [11] L. Jiao, Q. Chen, S. Li, and Y. Xu, “Colon cancer detection using whole slide histopathological images,” in World Congress on Medical Physics and Biomedical Engineering.

Springer, 2013, pp. 1283–1286.

[12] K. Fukunaga and L. Hostetler, “The estimation of the gradient of a density function, with applications in pattern recognition,” IEEE Transactions on Information Theory, vol. 21, no. 1, pp. 32–40, 1975. [13] A. B. Tosun, C. Sokmensuer, and C. Gunduz-Demir, “Unsupervised tissue image segmentation through object-oriented texture,” in Proceedings of 20th International Conference on Pattern Recognition.

IEEE, 2010, pp. 2516–2519.

[14] C. Gunduz-Demir, M. Kandemir, A. B. Tosun, and C. Sokmensuer, “Automatic segmentation of colon glands using object-graphs,” Medical image analysis, vol. 14, no. 1, pp. 1–12, 2010. [15] A. B. Tosun and C. Gunduz-Demir, “Graph run-length matrices for histopathological image segmentation,” IEEE Transactions on Medical Imaging, vol. 30, no. 3, pp. 721– 732, 2011. [16] A. C. Simsek, A. B. Tosun, C. Aykanat, C. Sokmensuer, and C. Gunduz-Demir, “Multilevel segmentation of histopathological images using cooccurrence of tissue objects,” IEEE Transactions on Biomedical Engineering, vol. 59, no. 6, pp. 1681–1690, 2012.

131

[17] D. Altunbay, C. Cigir, C. Sokmensuer, and C. Gunduz-Demir, “Color graphs for automated cancer diagnosis and grading,” IEEE Transactions on Biomedical Engineering, vol. 57, no. 3, pp. 665–674, 2010. [18] E. Ozdemir, C. Sokmensuer, and C. Gunduz-Demir, “A resampling-based markovian model for automated colon cancer diagnosis,” IEEE Transactions on Biomedical Engineering, vol. 59, no. 1, pp. 281–289, 2012. [19] E. Ozdemir and C. Gunduz-Demir, “A hybrid classification model for digital pathology using structural and statistical pattern recognition,” IEEE Transactions on Medical Imaging, vol. 32, no. 2, pp. 474–483, 2013. [20] K. Rajpoot and N. Rajpoot, “Svm optimization for hyperspectral colon tissue cell classification,” in Proceedings of Medical Image Computing and Computer-Assisted Intervention–MICCAI.

Springer, 2004, pp. 829–837.

[21] K. Masood, N. Rajpoot, H. Qureshi, and K. Rajpoot, “Co-occurrence and morphological analysis for colon tissue biopsy classification,” in Proceedings of 4th International Conference on Frontiers of Information Technology.

IEEE, 2006, pp. 211–216.

[22] K. Masood and N. Rajpoot, “Texture based classification of hyperspectral colon biopsy samples using clbp,” in IEEE International Symposium on Biomedical Imaging: From Nano to Macro.

IEEE, 2009, pp. 1011–1014.

[23] A. Chaddad, C. Tanougast, A. Dandache, A. Al Houseini, and A. Bouridane, “Improving of colon cancer cells detection based on haralick’s features on segmented histopathological images,” in Proceedings of IEEE International Conference on Computer Applications and Industrial Electronics.

IEEE, 2011, pp. 87–90.

[24] T. F. Chan and L. A. Vese, “Active contours without edges,” IEEE Transactions on Image Processing, vol. 10, no. 2, pp. 266–277, 2001. [25] U. Alon, N. Barkai, D. A. Notterman, K. Gish, S. Ybarra, D. Mack, and A. J. Levine, “Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays,” Proceedings of the National Academy of Sciences, vol. 96, no. 12, pp. 6745–6750, 1999.

132

[26] M. Grade, P. H¨ ormann, S. Becker, A. B. Hummon, D. Wangsa, S. Varma, R. Simon, T. Liersch, H. Becker, M. J. Difilippantonio et al., “Gene expression profiling reveals a massive, aneuploidy-dependent transcriptional deregulation and distinct differences between lymph node–negative and lymph node–positive colon carcinomas,” Cancer Research, vol. 67, no. 1, pp. 41–56, 2007. [27] K. Kim, U. Park, J. Wang, J. Lee, S. Park, S. Kim, D. Choi, C. Kim, and J. Park, “Gene profiling of colonic serrated adenomas by using oligonucleotide microarray,” International Journal of Colorectal Disease, vol. 23, no. 6, pp. 569–580, 2008. [28] E. T. Venkatesh and P. Thangaraj, “An improved neural approach for malignant and normal colon tissue classification from oligonucleotide arrays,” European Journal of Scientific Research, vol. 54, no. 1, pp. 159–164, 2011. [29] A. Kulkarni, B. N. Kumar, V. Ravi, and U. S. Murthy, “Colon cancer prediction with genetics profiles using evolutionary techniques,” Expert Systems with Applications, vol. 38, no. 3, pp. 2752–2757, 2011. [30] K. Lee, Z. Man, D. Wang, and Z. Cao, “Classification of bioinformatics dataset using finite impulse response extreme learning machine for cancer diagnosis,” Neural Computing and Applications, vol. 22, no. 3-4, pp. 457–468, 2013. [31] M. Tong, K.-H. Liu, C. Xu, and W. Ju, “An ensemble of svm classifiers based on gene pairs,” Computers in Biology and Medicine, vol. 43, no. 6, pp. 729–737, 2013. [32] M. Bianchini, E. Levy, C. Zucchini, V. Pinski, C. Macagno, P. De Sanctis, L. Valvassori, P. Carinci, and J. Mordoh, “Comparative study of gene expression by cdna microarray in human colorectal cancer tissues and normal mucosa,” International Journal of Oncology, vol. 29, no. 1, p. 83, 2006. [33] L. Li, C. R. Weinberg, T. A. Darden, and L. G. Pedersen, “Gene selection for sample classification based on gene expression data: study of sensitivity to choice of parameters of the ga/knn method,” Bioinformatics, vol. 17, no. 12, pp. 1131–1142, 2001. [34] Z. Chen and J. Li, “A multiple kernel support vector machine scheme for simultaneous feature selection and rule-based classification,” in Advances in Knowledge Discovery

133

and Data Mining, Lecture Notes in Computer Science. Springer, 2007, vol. 4426, pp. 441–448. [35] H.-S. Shon, G. Sohn, K. S. Jung, S. Y. Kim, E. J. Cha, and K. H. Ryu, “Gene expression data classification using discrete wavelet transform,” in Proceedings of International Conference on Bioinformatics and Computational Biology, 2009, pp. 204–208. [36] H. Peng, F. Long, and C. Ding, “Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 27, no. 8, pp. 1226–1238, 2005. [37] K. Polat and S. G¨ une¸s, “A new feature selection method on classification of medical datasets: Kernel f-score feature selection,” Expert Systems with Applications, vol. 36, no. 7, pp. 10 367–10 373, 2009. [38] K. Pearson, “On lines and planes of closest fit to systems of points in space,” The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science, vol. 2, no. 11, pp. 559–572, 1901. [39] V. N. Vapnik and V. Vapnik, Statistical learning theory. Wiley New York, 1998, vol. 2. [40] S. Theodoridis and K. Koutroumbas, Pattern Recognition.

Academic Press, 2008.

[41] R. O. Duda, P. E. Hart, and D. G. Stork, Pattern classification. John Wiley & Sons, 2012. [42] K.-C. Chou, “Some remarks on protein attribute prediction and pseudo amino acid composition,” Journal of Theoretical Biology, vol. 273, no. 1, pp. 236–247, 2011. [43] M. Hassan, A. Chaudhary, and A. Khan, “Carotid artery image segmentation using modified spatial fuzzy c-means and ensemble clustering,” Computer Methods and Programs in Biomedicine, vol. 108, no. 3, pp. 1261–1276, 2013. [44] Colon

cancer

dataset

kentridge.

[Online].

Available:

http://datam.i2r.a-

star.edu.sg/datasets/krbd/ColonTumor/ColonTumor.html [45] Colon

cancer

dataset

biogps.

[Online].

http://biogps.org/dataset/1352/stage-ii-and-stage-iii-colorectal-cancer/

134

Available:

[46] D. A. Notterman, U. Alon, A. J. Sierk, and A. J. Levine, “Transcriptional gene expression profiles of colorectal adenoma, adenocarcinoma, and normal tissue examined by oligonucleotide arrays,” Cancer Research, vol. 61, no. 7, pp. 3124–3130, 2001. [47] L. Marisa, A. de Reyni`es, A. Duval, J. Selves, M. P. Gaub, L. Vescovo, M.-C. EtienneGrimaldi, R. Schiappa, D. Guenot, M. Ayadi et al., “Gene expression classification of colon cancer into molecular subtypes: characterization, validation, and prognostic value,” PLoS Medicine, vol. 10, no. 5, p. e1001453, 2013. [48] I. H. Witten and E. Frank, Data Mining: Practical machine learning tools and techniques.

Morgan Kaufmann, 2005.

[49] B. W. Matthews, “Comparison of the predicted and observed secondary structure of t4 phage lysozyme,” Biochimica et Biophysica Acta (BBA)-Protein Structure, vol. 405, no. 2, pp. 442–451, 1975. [50] S. Yajima, M. Ishii, H. Matsushita, K. Aoyagi, K. Yoshimatsu, H. Kaneko, N. Yamamoto, T. Teramoto, T. Yoshida, Y. Matsumura et al., “Expression profiling of fecal colonocytes for rna-based screening of colorectal cancer,” International Journal of Oncology, vol. 31, no. 5, pp. 1029–1037, 2007. [51] Y. Cheng, “Mean shift, mode seeking, and clustering,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 17, no. 8, pp. 790–799, 1995. [52] D. Comaniciu and P. Meer, “Robust analysis of feature spaces: color image segmentation,” in IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

IEEE, 1997, pp. 750–755.

[53] D. Comaniciu, V. Ramesh, and P. Meer, “Real-time tracking of non-rigid objects using mean shift,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, vol. 2.

IEEE, 2000, pp. 142–149.

[54] S. Paisitkriangkrai, C. Shen, and J. Zhang, “Face detection with effective feature extraction,” in Computer Vision–ACCV.

Springer, 2011, pp. 460–470.

[55] D. Unay and A. Ekin, “Dementia diagnosis using similar and dissimilar retrieval items,” in IEEE International Symposium on Biomedical Imaging: From Nano to Macro. IEEE, 2011, pp. 1889–1892. 135

[56] L. Song, X. Liu, L. Ma, C. Zhou, X. Zhao, and Y. Zhao, “Using hog-lbp features and mmp learning to recognize imaging signs of lung lesions,” in 25th International Symposium on Computer-Based Medical Systems.

IEEE, 2012, pp. 1–4.

[57] L. Meng, L. Li, S. Mei, and W. Wu, “Directional entropy feature for human detection,” in Proceedings of 19th International Conference on Pattern Recognition. IEEE, 2008, pp. 1–4. [58] K. Lee, C. Y. Choo, H. Q. See, Z. J. Tan, and Y. Lee, “Human detection using histogram of oriented gradients and human body ratio estimation,” in Proceedings of 3rd IEEE International Conference on Computer Science and Information Technology, vol. 4.

IEEE, 2010, pp. 18–22.

[59] N. Dalal and B. Triggs, “Histograms of oriented gradients for human detection,” in Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol. 1.

IEEE, 2005, pp. 886–893.

[60] O. L. Junior, D. Delgado, V. Gon¸calves, and U. Nunes, “Trainable classifier-fusion schemes: An application to pedestrian detection,” in Proceedings of 12th International IEEE Conference on Intelligent Transportation Systems, 2009. [61] H.-Y. Yang, X.-Y. Wang, X.-Y. Zhang, and J. Bu, “Color texture segmentation based on image pixel classification,” Engineering Applications of Artificial Intelligence, vol. 25, no. 8, pp. 1656–1669, 2012. [62] A. Sengur, “Color texture classification using wavelet transform and neural network ensembles,” Arabian Journal for Science and Engineering, vol. 34, no. 2, pp. 491–493, 2009. [63] R. C. Gonzalez and R. E. Woods, “Digital image processing,” Prentice Hall, pp. 462– 463, 2002. [64] R. M. Haralick, “Statistical and structural approaches to texture,” Proceedings of the IEEE, vol. 67, no. 5, pp. 786–804, 1979. [65] A. K. Mittra and R. Parekh, “Automated detection of skin diseases using texture features,” International Journal of Engineering Science and Technology, vol. 3, no. 6, 2011. 136

[66] C.-C. Lee, S.-H. Chen, and Y.-C. Chiang, “Classification of liver disease from ct images using a support vector machine,” Journal of Advanced Computational Intelligence and Intelligent Informatics, vol. 11, no. 4, pp. 396–402, 2007. [67] D. Mitrea, M. Socaciu, R. Badea, and A. Golea, “Texture based characterization and automatic diagnosis of the abdominal tumors from ultrasound images using third order glcm features,” in 4th International Congress on Image and Signal Processing, vol. 3. IEEE, 2011, pp. 1558–1562. [68] K. Dhanalakshmi and V. Rajamani, “An intelligent mining system for diagnosing medical images using combined texture-histogram features,” International Journal of Imaging Systems and Technology, vol. 23, no. 2, pp. 194–203, 2013. [69] M. Kurzynski and M. Wozniak, “Combining classifiers under probabilistic models: experimental comparative analysis of methods,” Expert Systems, vol. 29, no. 4, pp. 374–393, 2012. [70] J. Kittler and F. Roli, Multiple classifier systems.

Springer, 2010.

[71] M. Masseroli, A. Bollea, and G. Forloni, “Quantitative morphology and shape classification of neurons by computerized image analysis,” Computer Methods and Programs in Biomedicine, vol. 41, no. 2, pp. 89–99, 1993. [72] D. Welfer, J. Scharcanski, and D. R. Marinho, “Fovea center detection based on the retina anatomy and mathematical morphology,” Computer Methods and Programs in Biomedicine, vol. 104, no. 3, pp. 397–409, 2011. [73] V. Naranjo, R. Llor´ens, M. Alca˜ niz, and F. L´opez-Mir, “Metal artifact reduction in dental ct images using polar mathematical morphology,” Computer Methods and Programs in Biomedicine, vol. 102, no. 1, pp. 64–74, 2011. [74] Y.-M. Li and X.-P. Zeng, “A new strategy for urinary sediment segmentation based on wavelet, morphology and combination method,” Computer Methods and Programs in Biomedicine, vol. 84, no. 2, pp. 162–173, 2006. [75] D. Guru, Y. Sharath, and S. Manjunath, “Texture features and knn in classification of flower images,” International Journal of Computer Applications, pp. 21–29, 2010.

137

[76] S. G. Mougiakakou, I. Valavanis, K. Nikita, A. Nikita, and D. Kelekis, “Characterization of ct liver lesions based on texture features and a multiple neural network classification scheme,” in Proceedings of 25th Annual International Conference of the IEEE Engineering in Medicine and Biology Society, vol. 2. IEEE, 2003, pp. 1287–1290. [77] M. E. Mavroforakis, H. V. Georgiou, D. Cavouras, N. Dimitropoulos, and S. Theodoridis, “Mammographic mass classification using textural features and descriptive diagnostic data,” in Proceedings of 14th International Conference on Digital Signal Processing, vol. 1.

IEEE, 2002, pp. 461–464.

[78] F. P. Kuhl and C. R. Giardina, “Elliptic fourier features of a closed contour,” Computer Graphics and Image processing, vol. 18, no. 3, pp. 236–258, 1982. [79] T. Taxt and K. W. Bjerde, “Classification of handwritten vector symbols using elliptic fourier descriptors,” in Proceedings of 12th International Conference on Pattern Recognition, vol. 2.

IEEE, 1994, pp. 123–128.

[80] L. P. Nicoli and G. C. Anagnostopoulos, “Shape-based recognition of targets in synthetic aperture radar images using elliptical fourier descriptors,” in SPIE Defense and Security Symposium.

International Society for Optics and Photonics, 2008, pp.

69 670G–69 670G. [81] G. Diaz, D. Quacci, and C. Dell’Orbo, “Recognition of cell surface modulation by elliptic fourier analysis,” Computer Methods and Programs in Biomedicine, vol. 31, no. 1, pp. 57–62, 1990. [82] C.-W. Hsu, C.-C. Chang, C.-J. Lin et al., “A practical guide to support vector classification,” 2003. [83] M. Re, M. Mesiti, and G. Valentini, “A fast ranking algorithm for predicting gene functions in biomolecular networks,” IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. 9, no. 6, pp. 1812–1818, 2012. [84] J. C. Rajapakse and P. A. Mundra, “Multiclass gene selection using pareto-fronts,” IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. 10, no. 1, pp. 87–97, 2013.

138

[85] N. Littlestone and M. K. Warmuth, “The weighted majority algorithm,” Information and Computation, vol. 108, no. 2, pp. 212–261, 1994. [86] S. L. Pomeroy, P. Tamayo, M. Gaasenbeek, L. M. Sturla, M. Angelo, M. E. McLaughlin, J. Y. H. Kim, L. C. Goumnerova, P. M. Black, C. Lau, J. C. Allen, D. Zagzag, J. M. Olson, T. Curran, C. Wetmore, J. A. Biegel, T. Poggio, S. Mukherjee, R. Rifkin, A. Califano, G. Stolovitzky, D. N. Louis, J. P. Mesirov, E. S. Lander, and T. R. Golub, “Prediction of central nervous system embryonal tumour outcome based on gene expression,” Nature, vol. 415, no. 6870, pp. 436–442, 2002. [87] K. N. Ross, P. Tamayo, A. P. Weng, J. L. Kutok, R. C. Aguiar, M. Gaasenbeek, M. Angelo, M. Reich, G. S. Pinkus, T. S. Ray, M. A. Koval, K. W. Last, A. Norton, T. A. Lister, J. Mesirov, D. S. Neuberg, E. S. Lander, J. C. Aster, and T. R. Golub, “Diffuse large b-cell lymphoma outcome prediction by gene-expression profiling and supervised machine learning,” Nature Medicine, vol. 8, no. 1, pp. 68–74, 2002. [88] T. R. Golub, D. K. Slonim, P. Tamayo, C. Huard, M. Gaasenbeek, J. P. Mesirov, H. Coller, M. L. Loh, J. R. Downing, M. A. Caligiuri, C. D. Bloomfield, and E. S. Lander, “Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring,” Science, vol. 286, no. 5439, pp. 531–537, 1999. [89] G. J. Gordon, R. V. Jensen, L.-l. Hsiao, S. R. Gullans, J. E. Blumenstock, S. Ramaswamy, W. G. Richards, D. J. Sugarbaker, and R. Bueno, “Translation of microarray data into clinically relevant cancer diagnostic tests using gene expression ratios in lung cancer and mesothelioma,” Cancer Research, vol. 62, pp. 4963–4967, 2002. [90] D. G. Beer, S. L. Kardia, C.-C. Huang, T. J. Giordano, A. M. Levin, D. E. Misek, L. Lin, G. Chen, T. G. Gharib, D. G. Thomas et al., “Gene-expression profiles predict survival of patients with lung adenocarcinoma,” Nature Medicine, vol. 8, no. 8, pp. 816–824, 2002. [91] D. Singh, P. G. Febbo, K. Ross, D. G. Jackson, J. Manola, C. Ladd, P. Tamayo, A. A. Renshaw, A. V. D’Amico, J. P. Richie et al., “Gene expression correlates of clinical prostate cancer behavior,” Cancer Cell, vol. 1, no. 2, pp. 203–209, 2002.

139

[92] R. O. Stuart, W. Wachsman, C. C. Berry, J. Wang-Rodriguez, L. Wasserman, I. Klacansky, D. Masys, K. Arden, S. Goodison, M. McClelland et al., “In silico dissection of cell-type-associated patterns of gene expression in prostate cancer,” Proceedings of the National Academy of Sciences of the United States of America, vol. 101, no. 2, pp. 615–620, 2004. [93] J. B. Welsh, L. M. Sapinoso, A. I. Su, S. G. Kern, J. Wang-Rodriguez, C. A. Moskaluk, H. F. Frierson, and G. M. Hampton, “Analysis of gene expression identifies candidate markers and pharmacological targets in prostate cancer,” Cancer Research, vol. 61, no. 16, pp. 5974–5978, 2001.

140