A DECISION TREES-BASED CLASSIFICATION ...

A DECISION TREES-BASED CLASSIFICATION MODEL FOR THE SURVIVAL OF CHRONIC MYELOID LEUKAEMIA (CML) PATIENTS Jeremiah Ademola Balogun1, Peter Adebayo Idowu2, Anthony Oyekunle3 1,2

Department of Computer Science and Engineering, 3Department of Hematology and Immunology Obafemi Awolowo University, Ile-Ife, Nigeria.

[email protected], [email protected], [email protected] ABSTRACT: Chronic Myeloid Leukaemia (CML) is a cancer of the white blood cells of humans and is more common among men than women. The only curative treatment for CML is a bone marrow transplant but today the bet practice for treatment of CML uses Imatinib as a first line of action which has increased the survival of patients up to 8 years. Hematologists in Nigeria rely on survival models which were proposed using non-African/Nigerian patients’ dataset before the Imatinib era. These models have been deemed ineffective on predicting the survival of Nigerian CML patients undergoing Imatinib treatment. This study is aimed at developing a predictive model for the classification of CML patients’ survival using decision trees algorithms in addition to the variables which are predictive for CML survival. Historical dataset containing information about variables monitored during the follow-up of Imatinib treatment was collected from Obafemi Awolowo University Teaching Hospital Complex (OAUTHC), Ile-Ife, South-western Nigeria. The predictive model for CML was formulated using to decision trees algorithms – C4.5 and CART and was simulated using the WEKA environment. The performances of the predictive models were evaluated using the datasets collected via 10-fold cross validation. The results showed that there are variations in the variables predictive for CML survival in Nigerians compared to those proposed by earlier models. Keywords: Chronic Myeloid Leukaemia (CML), survival, classification, machine learning, decision trees algorithms. 1.0 Introduction

grow, divide to make new cells, and die in

Chronic Myeloid Leukaemia (CML) is a

an orderly way. During the early years of

type of cancer that targets the blood cells

a person‟s life, normal cells divide only to

of living organisms [1]. The body is made

replace worn-out, damaged, or dying cells.

up of trillions of living cells. Normal cells

Cancer begins when cells in a part of a

body start to grow out of control [2].

(USA) however, the incidence of CML is

Different types of cancer exist and they all

in the age group under 70 years is higher

start because of this out-of-control growth

among the African-Americans than among

of abnormal cells; instead of dying like

any other racial/ethnic groups [9]. It is

normal cells, cancer cells keep on growing

probable

and form new cancer cells [3]. Cancers

environmental and yet unknown biological

like leukaemia, rarely form tumors like

factors may account for the differential age

other cancers rather it is inside the blood

incidence pattern of CML between Blacks

and bone marrow. When the cancer cells

and other races in the USA.

get to the bloodstream or lymph vessels,

CML is rare, accounting for less than 10%

they travel to other parts of the body [4].

of all cases of CML and less than 3% of all

There they begin to grow and form new

pediatric leukaemia [10], [11]. Incidence

blood cells; these cells are found in the

increases with age being exceptionally rare

soft, inner part of the bones called bone

in infancy; it is about 0.7 per million/year

marrow.

at ages 1 – 14 years and rising to 1.2 per

(CML)

Chronic Myeloid Leukemia also

known

as

Chronic

Myelogenous leukemia is a fairly slow growing cancer that starts in the bone marrow affecting the myeloid cells – cells that form blood cells. Most cases of CML occur in adults, but it is also very rarely found in children but the treatment of the children is the same as for adults.

that

a

combination

of

Pediatric

million/year in adolescents. To date, allogenic stem cell transplantation (SCT) remains curative, though its role has waned significantly in recent times due to the effectiveness of the Tyrosine Kinase Inhibitors (TKIs) [12], [13] although potentially curative, SCT is associated with significant morbidity and mortality

1.1 Chronic myeloid leukaemia (CML)

[14].

Chronic Myeloid Leukaemia (CML) has

adequately control the chronic phase of

an

of

CML but results in few long term

1/100,000 with male-female ratio of 1.5:1

survivors [15]. Advances in targeted

[5] and the median age of the disease

therapy resulted in the discovery of

incidence is about 60 years [6]. In Nigeria

Imatinib

and other African countries with similar

competitive inhibitor of BCR-ABL protein

demographic pattern, the median age of

tyrosine kinase, which has demonstrated to

the occurrence of CML is about 38 years

induce both hematologic and cytogenetic

[7], [8]. In the United States of America

remission in a significant proportion of

annual

worldwide

incidence

Alpha Interferon-based regimens

Mesylate,

a

selective

and

CML patients [4]. A number of prognostic

regression technique for exploring the

risk scoring models have been developed

relationship

the

explanatory

for CML survival but using patients that

variables of CML survival.

Before the

were administered treatment for CML

advent of Imatinib as a treatment option

different from Imatinib; most popular

for CML, the median survival time was 3 –

among them are the Sokal and Hasford

5 years form the time of diagnosis of the

Sokal‟s model

disease [22]. According to a follow-up of

used patients administered Hydroxyurea –

832 patients using Imatinib, an overall

a form of chemotherapy while the Hasford

survival rate of 95.2% was recorded after 8

model

years [23] while a 10-year follow up 527

(EUTOS) models [16].

Interferon

used

patients

alpha

–

administered a

form

of

Immunotherapy [17], [18].

analysis

patients in Nigeria showed an overall survival rate of 92% and 78% after 2 and 5 years respectively.

1.2 CML survival Survival

between

deals

with

the

1.3 Predictive modeling in healthcare

application of methods to estimate the

In the past, the dependency of health

likelihood of an event (death, survival,

service

decay, child-birth etc.) occurring over a

information (tumor, patient‟s clinical data,

variable time period [19]; it is also

population data, environmental data etc.)

concerned with studying the time between

generally kept the numbers of variables

entry to a study and a subsequent event

small enough so that standard statistical

(such as death etc.).

The traditional

methods or even a physician‟s own

statistical methods applied in the area of

intuition could be used to predict cancer

survival analysis include Kaplan-Meier

risks or outcomes. However, with today‟s

(KM) estimator curve [20] and the Cox-

high-throughput diagnostic and imaging

proportional Hazard (PH) models [21].

technologies, there are dozens of hundreds

These models apply parametric methods in

of

estimating survival parameters for group

parameters [24]. In such situations, human

of individuals under study while other

intuitions and standard statistics do not

methods apply non-parametric methods.

generally work.

The KM estimator allows for an estimation

reliance on non-traditional, intensively

of the proportion of the population who

computational approaches such as machine

survive a given length of time under while

learning is needed. The use of computers

the cox PH model is a statistical logistic

and machine learning in disease prediction

personnel

molecular,

on

cellular

micro-scale

and

clinical

Rather, an increased

and prognosis is part of a growing trend

casualty

towards

This

outcome, casualty is neither a primary aim

movement towards predictive medicine is

nor a requirement for variable inclusion

important, not only for patients (in terms

[30]. There are several other important

of lifestyle and quality-of-life decisions)

issues relating to data management when

but also for physicians (in making

developing a predictive model, such as

treatment options) as well as health

dealing with missing data and variable

economists

in

transformation [31], [32]. For a prediction

cancer

model to be valuable, it must not only have

personalization

and

implementing

[25].

policy large

planners scale

prevention or treatment policies. Predictive research

aims

at predicting

patterns within a set of variables and become

increasingly

pathophysiology

of

the

predictive ability in the derivation cohort

future events or an outcome based on

has

of

popular

in

medical research [26], [27]. Accurate predictive models can inform patients and physicians about the future course of an illness or the risk of developing illness and thereby help guide decisions on screening

but

must

also

perform

well

in

a

validation

cohort [33]. A model„s

performance

may

differ

between derivation

substantially

and

validation

cohorts for several reasons including over-fitting

of

the model,

missing

important predictor variables, and interobserver variability of predictors leading to measurement [34].

and/or treatment [26]. There are several

1.4 Machine learning

important differences between traditional

Machine learning is a branch of artificial

explanatory research

predictive

intelligence that allows computers to learn

research. Explanatory research typically

from past examples of data records [35],

applies statistical methods to test causal

[24]. Machine learning does not rely on

hypothesis

prior

using

constructs. research

and

prior

theoretical

In contrast, predictive

modeling

do

explanatory

[36].

Machine

learning has found great importance in the

techniques,

area of predictive modeling [19] in

theoretical

medical research especially in the area of

constructs, to predict future outcomes

cancer risk assessment, risk survival and

(e.g. predicting the risk of hospital

risk recurrence [28]. Machine learning can

readmission) [29].

be

without

machine

statistical

statistical

as

methods

and/or

applies

hypothesis

learning

preconceived

Although, predictive

models may be used to provide insight into

broadly classified

into

two (2);

supervised and unsupervised learning.

The goal of supervised learning is to build a concise model of the distribution of class labels in terms of predictor features [37]. The resulting classifier is then used to assign class labels to the testing instances where the values of the predictor features are known, but the value of the class label is known [38]. There are two variations of supervise classifications:

In this study, supervised machine learning algorithms were used to develop the predictive model for the classification of the 2-year survival of CML patients in Nigeria. Decision trees are structural and hierarchical with respect to the variables selected in forming the decision tree from the root/parent node to the successive child

a. Regression

(or

Prediction/Forecasting) – the class label is represented by a continuous variable (e.g. numeric data); and b. Classification – the class label is represented

d. Association analysis etc.

by discrete

values

(e.g. categorical and/or nominal data).

node along edges (value of nodes) to the leaf node where the target class is assigned using a top-down approach [35], [29]. The decision trees development uses two criteria during tree construction, namely: a test condition to determine how the records should be split (by specifying test condition and a measure of evaluating the

Unsupervised

machine

learning

goodness of test condition) and a stopping

algorithms perform learning tasks used

condition to determine when splitting

for inferring a function to describe hidden

procedures should stop [43], [44], [45].

structure from unlabeled data

–

data

Such decision trees algorithm include:

without a target class [39]. The goal of

ID3 (Iterative Dichotomiser 3), C4.5 (an

unsupervised machine learning is to

extension of the ID3) [35], CART

identify different examples that belong to

(Classification and Regression Trees) [29],

the same group/clusters based underlying

CHAID

characteristics that is common among

Interaction Detector) [45], MARS etc.

attributes of members of the same cluster Examples

or groups of

[40], [41], [42].

unsupervised

machine

learning algorithms include: a. Clustering; b. Maximum likelihood estimation; c. Feature selection;

(Chi-squared

Automatic

Chronic Myeloid Leukaemia (CML) is a

very

Nigerians

serious with

disease just

one

affecting referral

government hospital in Nigeria which administers Imatinib treatment but with limited number of experts compared to the

number

of cases attended to. In

Nigeria, hematologists rely on scoring

Patel and Rana [46] performed a survey of

models proposed using datasets belonging

decision

to Caucasian (white race) and/or non-

classification which included algorithms

African

like: ID3, C4.5, C5.0 and CART.

CML

patients undergoing

trees

algorithm

used

for

The

treatment before the Imatinib era (e.g.

study compared the algorithms in terms of

Sokal used busulphan or hydroxyurea,

their speed, pruning capabilities, boosting

Hasford

and

and ability to handle missing data for

Outcome

categorical, continuous and nominal data

Study). These models have been deemed

alongside their applicability in the areas of

ineffective on Nigerian CML patients who

business, intrusion detection, E-commerce,

are undergoing Imatinib treatment and as

medicine,

such there is presently no existing

applications. The study also revealed the

predictive model in Nigeria specifically for

application of decision trees algorithm, on

the survival of CML patients undergoing

dataset that included: heart disease, image

Imatinib treatment. There is a need for

segmentation,

a

aid

secondary structure. The study concluded

continual

that the performance of decision trees

treatment or alternative action affecting

algorithms is dependent on entropy,

the survival of CML patients receiving

information gain and the features of the

Imatinib treatment. Two decision trees

datasets considered for classification.

European

used

Treatment

predictive

clinical

Interferon

model

Alfa

and

which

decision concerning

will

algorithms were selected for this study based on their numerous application in medical data mining, they are: C4.5 and the Classification and Regression Trees (CART) algorithms.

remote

sensing

breast

and

cancer,

web

protein

Shajahaan et al. [47] applied decision trees algorithm to

the development of a

predictive model for the risk assessment of breast cancer. The dataset used for the study was collected from Wisconsin Breast

2.0 Related Work

Cancer dataset, University of Wisconsin

There have been a number of related

Hospitals, Madison and contained 169

literatures in the area of the application of

instances with 10 attributes including the

machine

target class which was described as either

learning

algorithms

to

the

development of predictive models for the

malignant or benign.

The missing

classification of medical data. A number

attributes were replaced using linear

of such works are presented in the

interpolation for data preprocessing using

following paragraphs, as follows:

the Statistical Software for Social Sciences

(SPSS) while TANAGRA and WEKA

further increase the performance of the

software were used to model breast cancer

machine learning algorithms.

risk from the dataset. C4.5, ID3, Random Tree and CART decision trees algorithms were used to formulate the model and then compared for the most effective.

The

Random Trees algorithm outperformed other algorithms with an accuracy of 100% followed by C4.5 and ID3 with accuracies of 95.57% and 92.99% respectively. Ahmad

et

al.

[48]

performed

3.0 Materials and Methods In

this

section,

the

methodological

approach used in the development of the predictive model for the classification of CML patients‟ survival was discussed. The section includes the data identification and collection methods, decision trees used for model formulation and simulation

a

environment alongside the performance

comparative analysis of three machine

metrics used in evaluating the performance

learning algorithms for the prediction of

of the decision trees algorithm proposed.

breast cancer recurrence using dataset

Figure

collected from the Iranian Center for

approach applied in this study.

Breast Cancer (ICBC) from 1997 to 2008. The dataset used contained 1189 instance of records, 22 predictor variables (input variables) and one outcome variable (target class). The predictive model was developed using three machine learning algorithms, namely: C4.5 decision trees, support vector machines (SVM) and artificial neural network (ANN).

The

results showed that the SVM algorithm outperformed the other algorithms with an accuracy of 95.7% while decision trees and ANN had accuracies of 93.6% and 94.7% respectively.

The study suggests

further variables to be considered in analyzing the recurrence of breast cancer and a longer follow-up period in order to

1

show

the

methodological

3.1 Data identification and collection Following expert

successive

interview

Hematologists

at

with

Obafemi

Awolowo University Teaching Hospital Complex

(OAUTHC),

the

variables

monitored for CML survival while patients received administered Imatinib treatment were identified and compared to those observed in literature. The variables identified

include:

sex,

time

from

diagnosis of CML to start of Imatinib treatment, age of patient, packed cell volume (PCV), white blood cell (WBC) count, percentage of blasts, basophil and eosinophil, spleen and liver size, disease phase at diagnosis, vital status of the patient (dead or alive) and the survival time from the date of diagnosis.

Figure 1: Methodological approach for CML survival classification Table 1 gives a description of the variables

eosinophil, spleen and liver size, disease

identified alongside their respective labels

phase at diagnosis, vital status of the

used in identifying them. The variables

patient (dead or alive) and the survival

identified

from

time from the date of diagnosis. Table 1

diagnosis of CML to start of Imatinib

gives a description of the variables

treatment, age of patient, packed cell

identified alongside their respective labels

volume (PCV), white blood cell (WBC)

used in identifying them.

include:

sex,

time

count, percentage of blasts, basophil and Table 1: Variables monitored during Imatinib treatment Type

Name

Unit of Measure

Labels

I

Time to treatment from date of diagnosis

Months

Numeric

N

Age of Patient

Months

Numeric

P

Sex

Nil

Male, Female

U

Spleen Size (below coastal margin)

cm

Numeric

T

Liver Size

cm

Numeric

Packed Cell Volume (PCV)

%

Numeric

White Blood Cell (WBC) count

Numeric

Platelet count

Numeric

Basophil

%

Numeric

Eosinophil

%

Numeric

Disease phase at diagnosis

Nil

CP, AP, BP

Survival Time

Days

Vital Status OUTPUT CML Survival Class

Numeric

Nil

Alive, Dead

Nil

Survived Not Survived Censored

For the classification of the survival of

threshold of 2 years (728 days) for CML

CML patients considered for this study,

survival in the dataset collected for this

the vital status and the survival time of the

study.

patients was used to classify the dataset

i.e. they did not spend as much time as

//algorithm used in defining the CML //survival class using vital status and //survival time if(survival time >= 728) then Survival Class = “Survived” Else if((survival time