A DECISION TREES-BASED CLASSIFICATION MODEL FOR THE SURVIVAL OF CHRONIC MYELOID LEUKAEMIA (CML) PATIENTS Jeremiah Ademola Balogun1, Peter Adebayo Idowu2, Anthony Oyekunle3 1,2
Department of Computer Science and Engineering, 3Department of Hematology and Immunology Obafemi Awolowo University, Ile-Ife, Nigeria.
[email protected],
[email protected],
[email protected] ABSTRACT: Chronic Myeloid Leukaemia (CML) is a cancer of the white blood cells of humans and is more common among men than women. The only curative treatment for CML is a bone marrow transplant but today the bet practice for treatment of CML uses Imatinib as a first line of action which has increased the survival of patients up to 8 years. Hematologists in Nigeria rely on survival models which were proposed using non-African/Nigerian patients’ dataset before the Imatinib era. These models have been deemed ineffective on predicting the survival of Nigerian CML patients undergoing Imatinib treatment. This study is aimed at developing a predictive model for the classification of CML patients’ survival using decision trees algorithms in addition to the variables which are predictive for CML survival. Historical dataset containing information about variables monitored during the follow-up of Imatinib treatment was collected from Obafemi Awolowo University Teaching Hospital Complex (OAUTHC), Ile-Ife, South-western Nigeria. The predictive model for CML was formulated using to decision trees algorithms – C4.5 and CART and was simulated using the WEKA environment. The performances of the predictive models were evaluated using the datasets collected via 10-fold cross validation. The results showed that there are variations in the variables predictive for CML survival in Nigerians compared to those proposed by earlier models. Keywords: Chronic Myeloid Leukaemia (CML), survival, classification, machine learning, decision trees algorithms. 1.0 Introduction
grow, divide to make new cells, and die in
Chronic Myeloid Leukaemia (CML) is a
an orderly way. During the early years of
type of cancer that targets the blood cells
a person‟s life, normal cells divide only to
of living organisms [1]. The body is made
replace worn-out, damaged, or dying cells.
up of trillions of living cells. Normal cells
Cancer begins when cells in a part of a
body start to grow out of control [2].
(USA) however, the incidence of CML is
Different types of cancer exist and they all
in the age group under 70 years is higher
start because of this out-of-control growth
among the African-Americans than among
of abnormal cells; instead of dying like
any other racial/ethnic groups [9]. It is
normal cells, cancer cells keep on growing
probable
and form new cancer cells [3]. Cancers
environmental and yet unknown biological
like leukaemia, rarely form tumors like
factors may account for the differential age
other cancers rather it is inside the blood
incidence pattern of CML between Blacks
and bone marrow. When the cancer cells
and other races in the USA.
get to the bloodstream or lymph vessels,
CML is rare, accounting for less than 10%
they travel to other parts of the body [4].
of all cases of CML and less than 3% of all
There they begin to grow and form new
pediatric leukaemia [10], [11]. Incidence
blood cells; these cells are found in the
increases with age being exceptionally rare
soft, inner part of the bones called bone
in infancy; it is about 0.7 per million/year
marrow.
at ages 1 – 14 years and rising to 1.2 per
(CML)
Chronic Myeloid Leukemia also
known
as
Chronic
Myelogenous leukemia is a fairly slow growing cancer that starts in the bone marrow affecting the myeloid cells – cells that form blood cells. Most cases of CML occur in adults, but it is also very rarely found in children but the treatment of the children is the same as for adults.
that
a
combination
of
Pediatric
million/year in adolescents. To date, allogenic stem cell transplantation (SCT) remains curative, though its role has waned significantly in recent times due to the effectiveness of the Tyrosine Kinase Inhibitors (TKIs) [12], [13] although potentially curative, SCT is associated with significant morbidity and mortality
1.1 Chronic myeloid leukaemia (CML)
[14].
Chronic Myeloid Leukaemia (CML) has
adequately control the chronic phase of
an
of
CML but results in few long term
1/100,000 with male-female ratio of 1.5:1
survivors [15]. Advances in targeted
[5] and the median age of the disease
therapy resulted in the discovery of
incidence is about 60 years [6]. In Nigeria
Imatinib
and other African countries with similar
competitive inhibitor of BCR-ABL protein
demographic pattern, the median age of
tyrosine kinase, which has demonstrated to
the occurrence of CML is about 38 years
induce both hematologic and cytogenetic
[7], [8]. In the United States of America
remission in a significant proportion of
annual
worldwide
incidence
Alpha Interferon-based regimens
Mesylate,
a
selective
and
CML patients [4]. A number of prognostic
regression technique for exploring the
risk scoring models have been developed
relationship
the
explanatory
for CML survival but using patients that
variables of CML survival.
Before the
were administered treatment for CML
advent of Imatinib as a treatment option
different from Imatinib; most popular
for CML, the median survival time was 3 –
among them are the Sokal and Hasford
5 years form the time of diagnosis of the
Sokal‟s model
disease [22]. According to a follow-up of
used patients administered Hydroxyurea –
832 patients using Imatinib, an overall
a form of chemotherapy while the Hasford
survival rate of 95.2% was recorded after 8
model
years [23] while a 10-year follow up 527
(EUTOS) models [16].
Interferon
used
patients
alpha
–
administered a
form
of
Immunotherapy [17], [18].
analysis
patients in Nigeria showed an overall survival rate of 92% and 78% after 2 and 5 years respectively.
1.2 CML survival Survival
between
deals
with
the
1.3 Predictive modeling in healthcare
application of methods to estimate the
In the past, the dependency of health
likelihood of an event (death, survival,
service
decay, child-birth etc.) occurring over a
information (tumor, patient‟s clinical data,
variable time period [19]; it is also
population data, environmental data etc.)
concerned with studying the time between
generally kept the numbers of variables
entry to a study and a subsequent event
small enough so that standard statistical
(such as death etc.).
The traditional
methods or even a physician‟s own
statistical methods applied in the area of
intuition could be used to predict cancer
survival analysis include Kaplan-Meier
risks or outcomes. However, with today‟s
(KM) estimator curve [20] and the Cox-
high-throughput diagnostic and imaging
proportional Hazard (PH) models [21].
technologies, there are dozens of hundreds
These models apply parametric methods in
of
estimating survival parameters for group
parameters [24]. In such situations, human
of individuals under study while other
intuitions and standard statistics do not
methods apply non-parametric methods.
generally work.
The KM estimator allows for an estimation
reliance on non-traditional, intensively
of the proportion of the population who
computational approaches such as machine
survive a given length of time under while
learning is needed. The use of computers
the cox PH model is a statistical logistic
and machine learning in disease prediction
personnel
molecular,
on
cellular
micro-scale
and
clinical
Rather, an increased
and prognosis is part of a growing trend
casualty
towards
This
outcome, casualty is neither a primary aim
movement towards predictive medicine is
nor a requirement for variable inclusion
important, not only for patients (in terms
[30]. There are several other important
of lifestyle and quality-of-life decisions)
issues relating to data management when
but also for physicians (in making
developing a predictive model, such as
treatment options) as well as health
dealing with missing data and variable
economists
in
transformation [31], [32]. For a prediction
cancer
model to be valuable, it must not only have
personalization
and
implementing
[25].
policy large
planners scale
prevention or treatment policies. Predictive research
aims
at predicting
patterns within a set of variables and become
increasingly
pathophysiology
of
the
predictive ability in the derivation cohort
future events or an outcome based on
has
of
popular
in
medical research [26], [27]. Accurate predictive models can inform patients and physicians about the future course of an illness or the risk of developing illness and thereby help guide decisions on screening
but
must
also
perform
well
in
a
validation
cohort [33]. A model„s
performance
may
differ
between derivation
substantially
and
validation
cohorts for several reasons including over-fitting
of
the model,
missing
important predictor variables, and interobserver variability of predictors leading to measurement [34].
and/or treatment [26]. There are several
1.4 Machine learning
important differences between traditional
Machine learning is a branch of artificial
explanatory research
predictive
intelligence that allows computers to learn
research. Explanatory research typically
from past examples of data records [35],
applies statistical methods to test causal
[24]. Machine learning does not rely on
hypothesis
prior
using
constructs. research
and
prior
theoretical
In contrast, predictive
modeling
do
explanatory
[36].
Machine
learning has found great importance in the
techniques,
area of predictive modeling [19] in
theoretical
medical research especially in the area of
constructs, to predict future outcomes
cancer risk assessment, risk survival and
(e.g. predicting the risk of hospital
risk recurrence [28]. Machine learning can
readmission) [29].
be
without
machine
statistical
statistical
as
methods
and/or
applies
hypothesis
learning
preconceived
Although, predictive
models may be used to provide insight into
broadly classified
into
two (2);
supervised and unsupervised learning.
The goal of supervised learning is to build a concise model of the distribution of class labels in terms of predictor features [37]. The resulting classifier is then used to assign class labels to the testing instances where the values of the predictor features are known, but the value of the class label is known [38]. There are two variations of supervise classifications:
In this study, supervised machine learning algorithms were used to develop the predictive model for the classification of the 2-year survival of CML patients in Nigeria. Decision trees are structural and hierarchical with respect to the variables selected in forming the decision tree from the root/parent node to the successive child
a. Regression
(or
Prediction/Forecasting) – the class label is represented by a continuous variable (e.g. numeric data); and b. Classification – the class label is represented
d. Association analysis etc.
by discrete
values
(e.g. categorical and/or nominal data).
node along edges (value of nodes) to the leaf node where the target class is assigned using a top-down approach [35], [29]. The decision trees development uses two criteria during tree construction, namely: a test condition to determine how the records should be split (by specifying test condition and a measure of evaluating the
Unsupervised
machine
learning
goodness of test condition) and a stopping
algorithms perform learning tasks used
condition to determine when splitting
for inferring a function to describe hidden
procedures should stop [43], [44], [45].
structure from unlabeled data
–
data
Such decision trees algorithm include:
without a target class [39]. The goal of
ID3 (Iterative Dichotomiser 3), C4.5 (an
unsupervised machine learning is to
extension of the ID3) [35], CART
identify different examples that belong to
(Classification and Regression Trees) [29],
the same group/clusters based underlying
CHAID
characteristics that is common among
Interaction Detector) [45], MARS etc.
attributes of members of the same cluster Examples
or groups of
[40], [41], [42].
unsupervised
machine
learning algorithms include: a. Clustering; b. Maximum likelihood estimation; c. Feature selection;
(Chi-squared
Automatic
Chronic Myeloid Leukaemia (CML) is a
very
Nigerians
serious with
disease just
one
affecting referral
government hospital in Nigeria which administers Imatinib treatment but with limited number of experts compared to the
number
of cases attended to. In
Nigeria, hematologists rely on scoring
Patel and Rana [46] performed a survey of
models proposed using datasets belonging
decision
to Caucasian (white race) and/or non-
classification which included algorithms
African
like: ID3, C4.5, C5.0 and CART.
CML
patients undergoing
trees
algorithm
used
for
The
treatment before the Imatinib era (e.g.
study compared the algorithms in terms of
Sokal used busulphan or hydroxyurea,
their speed, pruning capabilities, boosting
Hasford
and
and ability to handle missing data for
Outcome
categorical, continuous and nominal data
Study). These models have been deemed
alongside their applicability in the areas of
ineffective on Nigerian CML patients who
business, intrusion detection, E-commerce,
are undergoing Imatinib treatment and as
medicine,
such there is presently no existing
applications. The study also revealed the
predictive model in Nigeria specifically for
application of decision trees algorithm, on
the survival of CML patients undergoing
dataset that included: heart disease, image
Imatinib treatment. There is a need for
segmentation,
a
aid
secondary structure. The study concluded
continual
that the performance of decision trees
treatment or alternative action affecting
algorithms is dependent on entropy,
the survival of CML patients receiving
information gain and the features of the
Imatinib treatment. Two decision trees
datasets considered for classification.
European
used
Treatment
predictive
clinical
Interferon
model
Alfa
and
which
decision concerning
will
algorithms were selected for this study based on their numerous application in medical data mining, they are: C4.5 and the Classification and Regression Trees (CART) algorithms.
remote
sensing
breast
and
cancer,
web
protein
Shajahaan et al. [47] applied decision trees algorithm to
the development of a
predictive model for the risk assessment of breast cancer. The dataset used for the study was collected from Wisconsin Breast
2.0 Related Work
Cancer dataset, University of Wisconsin
There have been a number of related
Hospitals, Madison and contained 169
literatures in the area of the application of
instances with 10 attributes including the
machine
target class which was described as either
learning
algorithms
to
the
development of predictive models for the
malignant or benign.
The missing
classification of medical data. A number
attributes were replaced using linear
of such works are presented in the
interpolation for data preprocessing using
following paragraphs, as follows:
the Statistical Software for Social Sciences
(SPSS) while TANAGRA and WEKA
further increase the performance of the
software were used to model breast cancer
machine learning algorithms.
risk from the dataset. C4.5, ID3, Random Tree and CART decision trees algorithms were used to formulate the model and then compared for the most effective.
The
Random Trees algorithm outperformed other algorithms with an accuracy of 100% followed by C4.5 and ID3 with accuracies of 95.57% and 92.99% respectively. Ahmad
et
al.
[48]
performed
3.0 Materials and Methods In
this
section,
the
methodological
approach used in the development of the predictive model for the classification of CML patients‟ survival was discussed. The section includes the data identification and collection methods, decision trees used for model formulation and simulation
a
environment alongside the performance
comparative analysis of three machine
metrics used in evaluating the performance
learning algorithms for the prediction of
of the decision trees algorithm proposed.
breast cancer recurrence using dataset
Figure
collected from the Iranian Center for
approach applied in this study.
Breast Cancer (ICBC) from 1997 to 2008. The dataset used contained 1189 instance of records, 22 predictor variables (input variables) and one outcome variable (target class). The predictive model was developed using three machine learning algorithms, namely: C4.5 decision trees, support vector machines (SVM) and artificial neural network (ANN).
The
results showed that the SVM algorithm outperformed the other algorithms with an accuracy of 95.7% while decision trees and ANN had accuracies of 93.6% and 94.7% respectively.
The study suggests
further variables to be considered in analyzing the recurrence of breast cancer and a longer follow-up period in order to
1
show
the
methodological
3.1 Data identification and collection Following expert
successive
interview
Hematologists
at
with
Obafemi
Awolowo University Teaching Hospital Complex
(OAUTHC),
the
variables
monitored for CML survival while patients received administered Imatinib treatment were identified and compared to those observed in literature. The variables identified
include:
sex,
time
from
diagnosis of CML to start of Imatinib treatment, age of patient, packed cell volume (PCV), white blood cell (WBC) count, percentage of blasts, basophil and eosinophil, spleen and liver size, disease phase at diagnosis, vital status of the patient (dead or alive) and the survival time from the date of diagnosis.
Figure 1: Methodological approach for CML survival classification Table 1 gives a description of the variables
eosinophil, spleen and liver size, disease
identified alongside their respective labels
phase at diagnosis, vital status of the
used in identifying them. The variables
patient (dead or alive) and the survival
identified
from
time from the date of diagnosis. Table 1
diagnosis of CML to start of Imatinib
gives a description of the variables
treatment, age of patient, packed cell
identified alongside their respective labels
volume (PCV), white blood cell (WBC)
used in identifying them.
include:
sex,
time
count, percentage of blasts, basophil and Table 1: Variables monitored during Imatinib treatment Type
Name
Unit of Measure
Labels
I
Time to treatment from date of diagnosis
Months
Numeric
N
Age of Patient
Months
Numeric
P
Sex
Nil
Male, Female
U
Spleen Size (below coastal margin)
cm
Numeric
T
Liver Size
cm
Numeric
Packed Cell Volume (PCV)
%
Numeric
White Blood Cell (WBC) count
Numeric
Platelet count
Numeric
Basophil
%
Numeric
Eosinophil
%
Numeric
Disease phase at diagnosis
Nil
CP, AP, BP
Survival Time
Days
Vital Status OUTPUT CML Survival Class
Numeric
Nil
Alive, Dead
Nil
Survived Not Survived Censored
For the classification of the survival of
threshold of 2 years (728 days) for CML
CML patients considered for this study,
survival in the dataset collected for this
the vital status and the survival time of the
study.
patients was used to classify the dataset
i.e. they did not spend as much time as
//algorithm used in defining the CML //survival class using vital status and //survival time if(survival time >= 728) then Survival Class = “Survived” Else if((survival time