A Learning Model for the Automated Assessment of ... - IEEE Xplore

560

IEEE TRANSACTIONS ON NEURAL SYSTEMS AND REHABILITATION ENGINEERING, VOL. 18, NO. 5, OCTOBER 2010

A Learning Model for the Automated Assessment of Hand-Drawn Images for Visuo-Spatial Neglect Rehabilitation Yiqing Liang, Michael C. Fairhurst, Richard M. Guest, and Jonathan M. Potter

Abstract—Visuo-spatial neglect (often simply referred to as “neglect”) is a complex poststroke medical syndrome which may be assessed by means of a series of drawing-based tests. Based on a novel analysis of a test battery formed from established pencil-and-paper tests, the aim of this study is to develop an automated assessment system which enables objectivity, repeatability, and diagnostic capability in the scoring process. Furthermore, the novel assessment system encapsulates temporal sequence and other “dynamic” information inherent in the drawing process. Several approaches are introduced in this paper and the results compared. The optimal model is shown to produce significant agreement with the score for drawing-related components of the Rivermead Behavioural Inattention Test, the widely accepted standardised clinical test for the diagnosis of neglect, and, more importantly, to encapsulate data to enable an enhanced test resolution with a reduction in battery size. Index Terms—Computer-based hand-drawn image analysis, computer-aided diagnosis, stroke patients’ rehabilitation, visuo-spatial neglect.

I. INTRODUCTION

ISUO-SPATIAL neglect (designated simply “neglect” subsequently in this paper) is a complex medical condition often following stroke, characterized by patients’ failure to respond to, or orient toward, stimuli on one side of their visual field, usually contra-lateral to the lesion causing the condition [3]. Neglect is also regarded as an important predictor of poor outcome of rehabilitation after stroke [4]. Clinicians use a range of standard methods for the diagnosis and assessment of neglect including a series of “pencil-and-paper” based tests (involving tasks such as target cancellation, line midpoint bisection and geometric shape copying). A particular test battery, the Rivermead Behavioural Inattention Test (BIT) [4], [5], which consists of six of these pencil-and-paper tests (referred to as the BIT Conventional subset), alongside nine behavioral tests assessing everyday activities (referred to as the BIT Behavioral subset), is currently accepted as the clinical standard for neglect assessment.

V

Manuscript received August 17, 2009; revised January 13, 2010; accepted February 15, 2010. First published April 12, 2010; current version published October 08, 2010. This work was supported by the East Kent Hospitals NHS Trust Charitable Fund. Y. Liang, M. C. Fairhurst, and R. M. Guest are with the School of Engineering and Digital Arts, University of Kent, CT2 7NT Kent, U.K. (e-mail: [email protected]). J. M. Potter is with Kent and Canterbury Hospital, CT1 3NG Canterbury, U.K. Digital Object Identifier 10.1109/TNSRE.2010.2047605

The six pencil-and-paper tests forming the BIT Conventional subset are line crossing, letter cancellation, star cancellation (all involving the location and cancellation of a target, possibly amongst distracter targets), figure and shape copying (copying of simple geometric or representational shapes), line bisection (location and marking of a printed line midpoint), and representational drawing (specified freehand drawings made without a model). Within the figure and shape copying testing regime, the subject is instructed to draw six images, and in the representational drawing task three images. Hence, the BIT Conventional subset can be considered to comprise 13 individual drawing tasks. Formally, scoring of the test is carried out according to defined rules, resulting in a total score in the BIT Conventional subset ranging from 0 to 146, with a cutoff value of 130, below which the test outcome is taken to be indicative of the existence of neglect symptoms. For a patient who scores above 130, the scores of each individual task are examined. A score below an assigned cutoff value for an individual task also indicates the existence of neglect symptoms, in which case the patient will be asked to perform the behavioral subset of the BIT. Examples of two subtests within the BIT Conventional subset, the line-crossing and the line-bisection tests, as completed by a neglect patient with a right hemi-sphere stroke, are given in Fig. 1(a) and (b), respectively. As can be seen in Fig. 1(a), the patient failed to cancel the targets on the left side of the test overlay, while in Fig. 1(b), the identified midpoint is biased to the right as the patient was unaware of part of the left side of his visual field. A patient’s performance in the BIT Conventional subset is traditionally assessed by direct observation by trained clinicians using a series of performance criteria based on the “static” outcome of drawn image produced as a response in the task. In addition to the disadvantage of possible subjectivity which has been found in assessment [6], the time required for completion of multiple drawing tasks can potentially induce considerable fatigue in patients and requires significant clinical resources. Furthermore, temporal/constructional (so called “dynamic”) information inherently embedded in the test response (which is not routinely analyzed in current assessment techniques aside from a few studies using videoed test responses [5]) has shown to be very useful for the assessment of neglect and other forms of neuropsychological disorders in terms of the accuracy and resolution of a diagnosis [6]–[8]. This dynamic assessment is the key motivation in the development of the methodology described in this paper. It is our aim to establish a computer-based assessment system for neglect patients which should not only

1534-4320/$26.00 © 2010 IEEE

LIANG et al.: A LEARNING MODEL FOR THE AUTOMATED ASSESSMENT OF HAND-DRAWN IMAGES FOR VISUO-SPATIAL NEGLECT REHABILITATION

561

TABLE I COMPUTER-BASED TEST BATTERY

Fig. 1. Examples of neglect patients’ responses to two pencil-and-paper tests (a) The line-crossing subtest from the BIT. (b) The line-bisection subtest from the BIT.

demonstrate objectivity, efficiency, and accuracy, but also enable further research on neglect utilizing dynamic assessment techniques. In addition to the BIT battery, there are a number of other standalone pencil-and-paper tests for the assessment of neglect which use similar methods to those tests forming the BIT Conventional subset: the Mesulam. [7], Bells [8], and the OX test [9], for example, are all cancellation tests with various forms of distracters. In assessing the diagnostic power of these tests, those with complex distracters have been reported to be more sensitive in detecting mild neglect symptoms than those with simpler distracters. In the development of our computer-based test system, instead of increasing the complexity of the test overlay in order to achieve a higher sensitivity we focus on extracting novel dynamic features for the assessment of test performance, and hence the chosen test overlays are all of simple forms within the relevant category. For example, the line cancellation and OX tests are in a very simple form and the figure copying tests using geometric figures such as a square, a cube, a cross, and a star, simpler than most figure copying implementations. Table I shows the 14 test overlays chosen to form the com-

puter-based test battery. It should be noted that three derivative tasks were also generated from the set of Figure Completion tasks by assessing the performance differences when the subject drew first to the left and then to the right. These comparison derivative tasks are also shown in Table I as FMDD for the “diamond” shape, FMMM for the “man” shape and FMHH for the “house” shape. Using the chosen test overlays, we have previously reported a prototype computer-based technique for the capture of test responses and the extraction of performance features encapsulating dynamic information [10], and we have developed a classification/diagnostic prediction system (based on a binary, pass/fail outcome) [11] with an optimal selection of features assessed using a leave-one-out methodology. An approach which utilizes a computer-based technique based on behavioral tasks for assessing neglect is also found in other studies [12], [13], wherein a set of visual target searching tests (a so-called “Starry Night Test”) was displayed on the screen and the participants indicated the location of targets by a click of a button. The test response was assessed using two features measuring the reaction time and success rates. In comparison, our work follows the pencil-and-paper test approaches (importantly incorporating a pen-based response capture), providing a familiar test administration experience to the clinicians and test subjects. More importantly, a range of dynamic measurements are extracted from

562


the test response which will not only be assessed for the diagnosis of neglect but also provide opportunities for further clinical research on neglect. Building upon our previous work, for the purpose of monitoring rehabilitation progress, a first objective of this study is to provide a finer classification score. Moreover, although the leave-one-out methodology has been suggested for the validation of prediction models using a small dataset [14], it has also been noted [15] that given a substantially-sized dataset a more convincing evaluation of a selected fitted model should be carried out using disjoint training and testing datasets. Therefore, the work described in this paper will firstly seek to develop a quantitative assessment system using the same set of computer-based pencil-and-paper tests as used in [11], and secondly will evaluate the performance of the system with disjoint training and testing datasets. The computer-based score will be assessed based on the agreement with the same subject’s BIT score (as assessed by human experts). Various resolutions of the interpretation of the BIT score will be considered, e.g., four-point scale [16], [17] and continuous scores, especially those that are accepted in clinical practice, which will be further explained in Section III-D with respect to the measurement of model performance.

TABLE II DESCRIPTION OF COMPUTER EXTRACTED FEATURES

II. DATA COLLECTION AND FEATURE EXTRACTION The development and assessment of our computer-based system required the use of datasets of test responses from actual test subjects, including both stroke patients with neglect and stroke patients without neglect (referred to here as “stroke control subjects”). All test subjects donating response data year old). In addition to the BIT were poststroke patients ( Conventional subtests, which were performed and assessed following the conventional methodology, subjects were also asked to perform the 14 computer-based tasks. The contents of the computer-based tasks were printed individually on a series of sheets of paper overlaid onto a standard graphics tablet (specifically a WACOM Intuos2 tablet) and completed using a cordless inking pen. The contents of the computer-based tasks are similar to the BIT Conventional subtests. However, unlike in the BIT subtest, where one subtest may consist of from one to six figures on a single sheet of paper, there is only a single figure or shape on each of the computer-based overlays. Pen movements during each task execution were sampled at a frequency of 100 Hz and recorded separately in a data file on an attached computer. The dataset adopted in our previous study [11] will be used as training dataset in this study. This dataset includes 33 neglect patients and 110 stroke control subjects, where group identity was determined by the total score of the BIT Conventional subset administered and scored by specialist clinical staff. For the purposes of estimating the model performance using completely unseen samples, a second separate dataset was collected (using the same protocol as the training dataset) comprising 19 neglect patients and 27 stroke control subjects. Therefore, the coefficients of learning models were adjusted based on the first dataset, and the performance tested using the second dataset.

A wide range of performance features were automatically extracted from each of the data files. Because some features extracted from the cancellation tasks cannot be obtained from the other tasks (for example, the number of cancelled targets) the 17 tasks (including the derivative tasks) are also divided into two subsets—cancellation tasks and drawing tasks (see Table I). A set of features were uniquely defined for each of the two subsets. A total of 57 features were extracted from each of the cancellation tasks and 35 from each of the drawing tasks [11]. Hence a maximum total of 639 features were extracted from the responses to the whole computer-based test battery. In our previous work, 223 features have shown, individually, significant performance in distinguishing neglect patients from stroke control subjects, assessed by a significance test for individual features to be a predictor in a Binary Logistic Regression (BLR) model for predicting the existence of neglect symptoms. Full details of these features and performance can be found in [11]. Features utilized in this study are further described in Table II. III. FEATURE SELECTION METHODOLOGY The criteria for including a feature in a model may vary from one problem to another. Automated selection methods depending entirely on statistical measurements are a popular choice in engineering studies. Epidemiology studies have, however, suggested that clinically and intuitively relevant variables should all be included in the model, regardless of their “statistical significance” [18]. Both criteria have their own advantages and disadvantages. Automated selection methods, for example, have been reported [19] to show a drawback of producing unstable models characterized by overfitting the


563

Fig. 2. Misclassified DMCU drawing.

model parameters according to the characteristics of the training dataset and resulting in much poorer performance when tested with new data, while the second criterion, referred to as theoretical rule-based selection, is associated with unconditional selection bias [20], referring to the average underestimation of the regression coefficients when statistically nonsignificant coefficients are included in forming the model. The approach to forming the BLR models reported in [11] began as an exploratory exercise of utilizing computer-extracted features to give a prediction of neglect patients, and hence an automated variable selection method was chosen. In order to investigate which variable selection method suits the nature of the data specifically collected for this study, both automated and theoretical rule-based methodologies were considered in the work described here. As the relationship between neglect symptoms and the computer-extracted features is yet to be established from a clinical perspective, for the purpose of this study, a novel rule-based method for choosing computer-extracted features is developed by means of experimental observations. A. Experimental Observations The BLR models formed in our previous study using automated variable selection methods were all constituted from a small subset of computer-extracted features. An important question is: do the remaining features carry any additional information about different aspects of neglect symptoms beyond that which is available from the selected predictors? In order to explore this question further, the BLR models developed in our previous study are now specifically investigated for such a possibility. Taking the test response of DMCU (Table I) as an example, the two features which have shown the most significant performance in classifying the two groups are ON-OFF-R (see Table II for definition, chi square score ) and PEN-DIS chi square score . Evidently these features assess different aspects of the task execution. Although the feature ON-OFF-R was selected as a predictor at the

first step of the training of the BLR model, PEN-DIS was excluded by the automated selection method. The evidence shown in this study again demonstrates that, without theoretical guidance, the use of the automated selection method has excluded some potentially useful features. In assessing whether any of the remaining features carry additional information about different aspects of neglect symptoms, let us consider another neglect subject with a BIT score of 15 (out of 146), misclassified as a non-neglect subject by seven of the BLR model classifiers. To investigate this analysis the test responses can be visually inspected. The response to the DMCU task from this subject is illustrated in Fig. 2 with the dimensions of the test overlay as captured by the graphics tablet shown on each axis. The drawing can be seen clearly as off-center in both horizontal and vertical axes. Examining the BLR model for this task, the features selected as predictors of neglect failed to reveal any difference characterizing this drawing from those made by post-stroke control subjects; however, the value of feature X-CENTRE falls far outside the standard range, set as the 95% confidence interval of the responses from the stroke control subjects. The same phenomenon is found in four other patients’ test responses to the same task. By using X-CENTRE, these five subjects are readily classified as neglect patients. An examination of the classification performance with respect to individual features, as described in our previous work [11], shows that the feature X-CENTRE extracted from the DMCU task was not significant in accurately classifying the two groups of subjects, although this feature has been reported in previous works as a useful predictor of neglect symptoms [8], [9]. In the dataset available for this study, the performance of this feature, showing significant difference between the two subject groups in 8 out of the 15 drawing tasks, confirms the findings in reported works. This also indicates a possibility that the models previously constructed by using the automated selection method are hampered by the failure of presenting all sources of variability in the training dataset. Explorations of other misclassified subjects’ test responses revealed

564


the same phenomenon—one or more features which can be observed through visual observation and have shown significant performance in discriminating between the neglect patients and the stroke control subjects in one or more tasks but were not included in forming the binary assessment model. The above example indicates that non-selected features may still encapsulate important diagnostic information concerning aspects of neglect symptoms and that the automated feature selection method may not result in optimal models being generated. Our current experimentation will therefore investigate a “semi-automatic” feature selection method based on visual observations alongside the BLR models formed with these features designated as predictors. B. Rule-Based Feature Selection Method Ideally, the model predictors should reveal every aspect of neglect symptoms that are available from the entire set of computer-extracted features. Therefore, the method adopted to form the predictors is to first include all available features and then remove the features that are considered to be redundant measurements. In revising the model predictors using all available features it is essential to consider the relationship between features, especially when considering a feature that can be calculated directly from several others. Derived features as such will be referred to as high level features, while those they are derived from are referred to as low level features. The total task execution time (TOTAL-TIME), for example, is defined as the time between the first pen contact on the tablet surface as a drawing is initiated to when the pen is finally removed from the surface on completion of the drawing process. Though simple and straightforward, while compared to other time-related features, such as the execution time when the pen is not in contact with the tablet surface (MOVE-TIME), or the execution time when the pen is moving on the surface of the tablet (DRAW-TIME), or the execution time when the pen is paused on the surface of the tablet (PAUSE-TIME), the total execution time, as the sum of the above features, is considered a high level feature. Another example of a high level feature is the pen velocity, defined as the pen travel distance divided by the task execution time. In such cases, either the higher level feature, or the set of lower level features from which it was derived remain as a predictor/predictors. The predictor selection criteria will be discussed below. 1) Predictor for the Drawing Tasks: As described above, some features are defined representing either the scalar sum or the vector sum of other features. In such cases, in our revised model formation systems only these high level features are included for the formation of the predictor. It is also observed that features related to velocity and acceleration are highly correlated, and therefore only one feature should be chosen to represent the overall velocity and acceleration profile. Comparing features related to velocity and to acceleration, the latter were considered to be representative of a larger number of aspects of the task process. As velocity is calculated directly from two other features, PEN-DIS and TOTAL-TIME, as discussed above, and PEN-DIS and TOTAL-TIME are already included in forming the predictor, velocity should therefore be removed from the predictor. Among the range of measurements of pen acceleration, the features assessing the mean value of

pen acceleration across the whole task execution (MEAN-ACC) have shown significant performance in classifying the subject groups in six tasks, more frequently than the features assessing the standard deviation (two tasks) or the peak value (one task) of the pen acceleration. Hence, the feature MEAN-ACC is selected as the representative feature of the pen velocity and acceleration profile. For the same reason, the feature MEAN-PRES is selected as the representative feature of the pen pressure profile. The features are transformed to measure the difference between an individual feature and the average feature value from the control subject group. Taking the entire set of computer-extracted features into account, while a number of features are excluded on the basis of the concerns of an overlap in the measurements as described above, the general predictor of neglect symptoms across all drawing tasks takes the form of the unweighted sum of the transformed values of a set of features as specified in (1). The predictor as illustrated in (1) therefore measures the sum of the distance from a specific test subject’s test response to the average response from the control subjects. Also, because the feature transformation includes a normalization process, the possibility that the predictor calculated by (1) is dominated by one or more features with extremely large value is avoided

(1) 2) Predictor for the Cancellation Tasks: The features extracted from the cancellation tasks measure three main characteristics of the task execution process: number of targets being cancelled, global temporal characteristics, and ratio of the temporal characteristics extracted from the left-hand side of the test overlay to the same characteristics extracted from the right-hand side of the test overlay. The number of omissions (OMISSION) is initially selected to form the predictor because it is the conventional scoring criterion for cancellation tasks in general and has shown significant performance in identifying the neglect patients [11]. The selection of features assessing temporal characteristics of the task execution process to form the predictor is problematical because there is a considerable amount of redundant information, mainly due to the quadrant-related features. More specifically, features such as TOTAL-TIME, MOVE-TIME, DRAW-TIME, and PAUSE-TIME are also calculated within each quadrant as well as on each side of the test overlay. Observations of the task execution time across all subjects indicate that without neglect symptoms, stroke control subjects should complete the task in a relatively short time. However, it is easily observed that a subject who failed to complete the cancellation task also tends to have a small value in TOTAL-TIME and the other three temporal features detailed above. In the ALB task, for instance, the stroke control subjects completed the task in times between 22 and 129 s, while five out of 28 neglect patients abandoned the task within 20 s. It can be inferred that the total execution time does not necessarily reflect the subject’s speed in the task execution process


directly. This problem is resolved in the feature TIME-PER-CAN, calculated as the average execution time spent on each target that has been actually cancelled, which has shown encouraging performance in distinguishing between the two subject groups [11]. A problem revealed by the quadrant-related temporal features is a strong dependence between the feature performance and the physiological location of the stroke in the test subject. According to reports in the literature [3], [4], [21] as well as our own observations with the datasets built up in this study, neglect patients tend to fail in response to the cancellation targets on the side of test overlay contralateral to their stroke lesion. Another advantage of the feature TIME-PER-CAN over the quadrant-related temporal features is its independence from the physiological stroke location, and hence it is chosen to form the predictor. With respect to the cancellation task analysis we have defined five ratio features which assess intratask performance variation. Of these features, ON-OFF-R, assessing the ratio of “pen-down” to “pen-up” time across the whole task execution, is among the features that have shown outstanding classification performance across different tasks [11], and hence can be regarded as a task-independent significant feature. It is therefore included in the formation of the predictor of neglect symptoms in cancellation task models. The other four ratio features are calculated by assessing the difference between identical temporal features extracted from the left side of the test overlay and the right side of the test overlay. TOTAL-LR, for instance, is the ratio of the TOTAL-TIME within the left side of test overlay to the same feature within the right side of test overlay. These four ratio features are preferable to an individual side-related temporal feature from which the ratio features are derived, because a ratio feature can carry the assessment of neglect symptoms provided by two temporal features. Among the four temporal features defined, TOTAL-TIME and MOVE-TIME have shown advantages over the other two features in distinguishing neglect patients from the stroke population in most cases, including the overall measurement, quadrant-related or side-related measurement and the ratio of the same features between the left side of test overlay and the right side of test overlay. Therefore, the ratio features TOTAL-LR and MOVE-LR, assessing the TOTAL-TIME and MOVE-TIME respectively, are selected to represent the left to right ratio of the temporal characteristic of the task execution process in forming the predictor of neglect symptoms. The predictor formed from features extracted from cancellation tasks is specified in

(2) C. Modelling Approaches Several modelling approaches are designed, adopting both an automated variable selection method based on statistical criteria and a rule-based method determined by experimental observations. The linear regression modelling method is chosen in the first approach as an exploratory exercise and hence an automated

565

variable selection method is adopted here. Using the rule-based variable selection method, three more separate modelling approaches to the continuous scoring for computer-based tasks are considered. The BLR model was chosen to form the binary classification system in our previous study, and encouraging results have been shown. The BLR model, in fact, also produces a continuous outcome between 0 and 1, representing the probability of the occurrence of neglect symptoms. Encouraged by the high performance produced in our previous study as well as the versatility of the BLR model, it is chosen again in the latter three modelling approaches to generate an interim continuous outcome, which will be used in further modelling processes in each separate approach. The four approaches are further described as follows. • Linear Regression Model: with an automated feature selection method across all 17 computer-based tasks. • Unweighted Voting Model: An individual BLR model is developed for each task using the features chosen by the rule-based criteria. A voting system is then formed using the outcome from the 17 BLR models. The final score in this approach is calculated as the unweighted sum of the continuous outcome of the combination of BLR models producing the optimal binary classification performance in forming the voting system. • Weighted Voting Model: The BIT total score is the sum of scores for the six subtests, within which the three cancellation subtests contribute up to 130, while on the other hand, the three noncancellation subtests comprise the remaining 16 marks out of the 146. This observation leads to a method of forming the computer-based score with respect to two different sets of tasks: cancellation tasks and drawing/noncancellation tasks. The cancellation tasks from the computer-based test battery are used to approximate the BIT cancellation subtests and the drawing tasks to approximate the BIT drawing subtests. Still using the chosen BLR models in forming the voting system in the Unweighted Voting Model approach, however, different weights are assigned to tasks with respect to the task category and the weight of the task in the equivalent category within the BIT. • Weighted Linear Model: Also motivated by the unequal weights given to tasks belonging to different task categories within the BIT, the computer-based score in this approach is composed of two parts which will be referred to as cancellation score and drawing score respectively. The computer-based cancellation score is calculated as 130 times the average of the continuous outcome of the BLR models for ALB and OX tasks, both scaled to a value between 0 and 1. Linear Regression modelling is applied to form the model for calculating the drawing score. In training the regression model, the summed score of the BIT drawing subtests is taken as the dependent variable and the predictor for each individual drawing task as an independent variable. The three voting methods considered as part of the modelling process in approaches 2 and 3 are as follows: the majority vote, the sum rule vote and the product rule vote [11], [22]. An exhaustive combinational search, forming voting systems comprising between 2 and 17 individual task BLR models was conducted. For each test subject, the outcome of each of the BLR

566


models within the combination is counted as a vote, with each vote combined separately using the three voting methods.

TABLE III COEFFICIENTS OF THE LINEAR REGRESSION MODEL

D. Measurement of Model Performance There are several methods for assessing the correlation between two variables, with Pearson’s, Spearman’s and Kendall’s tau [23] amongst the most frequently adopted methods. The Pearson’s correlation assumes a linear relationship between two variables and Kendall’s tau is a nonparametric correlation coefficient that can be used to assess and test correlations between noninterval scaled ordinal variables. The Pearson’s correlation is also a method for assessing the performance of a Linear Regression model. However, the Linear Regression method is only used to form a part of the computer-based score in the third approach approximating the total score of the BIT drawing subtests. Hence, a linear relationship between the computer-based score and the BIT total score is not assumed in this approach, nor in the first two approaches. As the linear relationship between the computer and BIT scores is yet to be validated, and the BIT score and the resultant computer-based score are regarded as interval (continuous) variables [24] neither the Pearson’s correlation nor the Kendall’s tau is suitable for assessing the agreement. The Spearman’s correlation coefficient, on the other hand, calculated from the rank of the variables, measures any arbitrary monotonic function describing the relationship between two variables. Therefore, Spearman’s method is chosen to assess the correlation between the continuous outcomes of the two scoring systems. Although the BIT has been widely accepted as the gold-standard test for neglect, the interpretation of the score as regulated in the formal BIT manual focuses on the diagnosis of the existence of the symptoms, in which case a binary outcome is produced. A number of reported studies [16], [17] have attempted to interpret the continuous BIT score on a four-point-scale classification measurement. For ease of interpretation here we will adopt this four-point-scale with the following groupings: non-neglect mild neglect moderate neglect severe neglect [16], [17]. The Kappa [25] test is a measurement of the agreement between two nominal or categorical variables and has been suggested as a suitable method for this purpose, specifically in medical applications [26]. The agreement between the computerbased score and the BIT score on the four-point-scale defined above can therefore be assessed by the Kappa test. The Kappa test gives a continuous outcome ranging from 0 to 1, where a larger value indicates a stronger agreement between the two variables. Because the four modelling approaches all produce different outcomes that may disagree with each other in terms of the classification of patient group, it is essential to assess the agreement between the binary diagnostic outcomes produced by the BIT score and the continuous computer-based scores. Three common measurements: FAR (false acceptance rate), FRR (false rejection rate), and EER (equal error rate) [27], adopted in our previous study, are also utilised in this study for the purpose of assessing the binary classification performance. EER is a useful criterion for choosing the optimal model for each individual task, while, on the other hand, the performance of the voting system (consisting of more than one task model)

TABLE IV OBSERVATION-BASED BLR MODELS’ TESTING PERFORMANCE

is assessed using the average of FAR and FRR, denoted as AER (average error rate) in this paper. IV. RESULTS The model coefficients trained with the training dataset using the four approaches as described in Section III.C will be presented here. The model performance reported in this section is estimated using a testing dataset completely disjoint from the training dataset. In other words, the statistics report the success or failure rates expected when, in the future, the model is applied to new test subjects. Since the Unweighted and Weighted Voting Models both use the interim outcome of BLR models and a voting system, they will be presented together. A. Model Coefficients 1) Linear Regression Model: As a result of the linear regression training with automated feature selection criteria, five features extracted from five different tasks are chosen to form this model, as shown in Table III. 2) Unweighted/Weighted Voting Model: The BLR models formed with the predictors defined in (1) (for drawing tasks) and (2) (for cancellation tasks), have shown significant training performance within each individual task, assessed by the chi-square test of goodness-of-fit [28]. Applying the testing dataset, the testing performance of the BLR models is shown in Table IV, ranked by the EER. The threshold on the continuous outcome of the BLR model where the EER was achieved is also presented in Table IV. The combination of tasks associated with the optimum voting performance consists of seven tasks (which also contribute to two derivative tasks): ALB, OX, DMSQ, FCST, FMDR, FMDD (derived from FMDL and FMDR), and FMMM (derived from


FMML and FMMR), producing an average error rate (AER) of % % using the product rule vote. 14.5% a) Unweighted Voting model: In this approach, the final score is the sum of the continuous outcomes of the seven BLR models within the combination producing the optimal voting result. The result is scaled between 0 and 146 to produce the same range as the BIT

567

those in (3). By replacing the independent variables in (5) with the threshold of the corresponding BLR model, as shown in Table IV, the cutoff point on the dependent variable is 35.21. In the same way as was adopted for is also adjusted to the format of BIT score according to its own threshold (6)

(3) where is the computer-based score and the variables on the right side of equation, each denoted with the task name as a subscript, represent the continuous outcome of each BLR model respectively, each ranging between 0 and 1. The same variable naming system is applied to describe the following modelling approaches. A cutoff point on this score is needed to decide the subject’s group membership. This is obtained by replacing the variables on the right side of (3) with the threshold of each specific BLR model, as can be observed in Table IV. For example, the threshold of the ALB model is 39%, and in (3) should be replaced by 39%. As a rehence the sult, the cutoff point on the outcome of (3) is 32.12. The interpretation of the outcome of (3) is similar to the interpretation of the BLR continuous score: a score higher than the threshold indicates the predicted event’s occurrence. It can be inferred that the dependent variable in (3) (denoted as ) while ranging from 32.12 to 146 produces a diagnosis of the existence of neglect symptoms, or is indicative of the absence of the symptoms when falling between 0 and 32.12. With an adjustment according to (4), the computer-based score can be adapted to the same for neglect patients and format as the BIT score 130–146 for non-neglect subjects (4) where is the adjusted computer-based score. b) Weighted Voting model: Examining the structure of the Conventional BIT total score, the three cancellation tasks contribute up to 130 marks out of 146. Hence, the relative weighting of the cancellation tasks as a contribution to the total BIT score is greater than the drawing tasks. This observation leads to a second method of forming the computer-based score using the BLR models’ continuous outcomes. Instead of an unweighted sum of the seven models’ outcomes, a weight of 130 is given to the average of the two cancellation tasks, ALB and OX, at the same time, a weight of 16 is given to the average of the other five tasks

(5) Here, is the computer-based score and the naming of variables on the right side of equation is consistent with

where is the formatted computer-based score. 3) Weighted Linear Model: This method is a variation of the weighted voting model. In this approach, the first part of (5) remains the same while the second part is formed by using Linear Regression modelling. The BIT Conventional subtests and the computer-based tasks are divided into two categories: cancellation tasks and drawing tasks (any non-cancellation tasks are regarded as drawing tasks). The cancellation tasks from the computer-based test battery are used to approximate the BIT cancellation subtests and the drawing tasks to approximate the BIT drawing subtests. Therefore, the computer-based score is composed of two parts which will be referred to as cancellation score and drawing score. The formation of the computer-based cancellation score is the same as the first part of (5). Linear Regression [29] modelling is applied to form the model for calculating the drawing score. In training the Linear Regression model, the summed score of the BIT drawing subtests is taken as the dependent variable (DV) [29] and the predictor for the BLR model for each individual drawing task [calculated by (1)] as an independent variable (IV) [29]. The predictor selection methods for Linear Regression models are similar to those for BLR models except for using a different cost function—Least-squares estimation [29]. The predictor selection for the Linear Regression model also begins as an exploration of relationships between extracted features and test performance, and hence an automated selection method is chosen. Three computer-based tasks including a derivative task, producing the best approximation of the BIT drawing score, were selected to form the Linear Regression model. They are the DMSQ, FMMR and FMHH (derived from FMHL and FMHR) tasks. The mathematical expression relating to this approach is illustrated in

(7) B. Model Performance In order to provide a visual comparison of performance, Table V shows the cross-tables between the four-point-scale outcomes of the BIT and the computer-based systems with different models. Note that the total number of subjects included in separate models can be different, because not all of the subjects have completed the tasks required to form the assessment model, and the test configuration is to include, separately for each approach, the maximum number of subjects who have completed the required tasks. As shown in Table V, the numbers detailed on the diagonal line across the top-left to the bottom-right corner of each of the

568


TABLE V CROSS-TABLE OF THE OUTCOMES BETWEEN THE BIT SCORE AND THE COMPUTER-BASED SCORE ON THE FOUR-POINT-SCALE* (A) LINEAR REGRESSION MODEL, (B) UNWEIGHTED VOTING MODEL, (C) WEIGHTED VOTING MODEL, (D) WEIGHTED LINEAR MODEL

cross-tables represent the subjects agreed by the two systems. The computer-based score generated from the first model agrees for 20 out of 42 subjects with the BIT score, while the second model agrees for 22 out of 46 subjects, the third model 28 out of 46 subjects, and the fourth model 32 out of 44 subjects. The agreement between the scores produced by the two systems is firstly assessed on the four-point-scale by the Kappa test, secondly in continuous form by Spearman’s correlation, and finally in binary format by classification rates. As the results presented in Table VI demonstrate, when assessed using the appropriate method, a significant agreement with the BIT score is shown for each of the four models of computer-based scoring.

TABLE VI AGREEMENT BETWEEN THE COMPUTER-BASED SCORES AND THE BIT CONVENTIONAL SCORE IN THREE DIFFERENT FORMS (A) KAPPA AGREEMENT OF FOUR-POINT SCALE OUTCOMES AND SPEARMAN’S CORRELATION OF CONTINUOUS OUTCOMES, (B) CLASSIFICATION RATE OF BINARY OUTCOMES

ever, it produces the poorest result within each of the assessment methods while tested with the disjoint testing dataset. The dramatic change in model performance informs the overfitting issue caused by the adoption of automated variable selection procedure. The overfitting issue has also been observed in the training process using a leave-one-out verification wherein the change in the chosen features between iterations is significant. Such a phenomenon has previously been commented on in the literature [19], [20], which was the initial motivation of the development of the following three models in this study. The other three models, developed using a theoretical rule-based feature selection method, all produce a better agreement with the benchmark system than the Linear Regression Model. The observations described in Section III-A reveal that the dataset used in our study is small in size because it does not contain all possible variation in the features as a means of assessing patients’ test performance. The outcome of comparing the performance between models formed with different feature selection methods indicates that choosing variables based on experimental observations as well as relevant clinical knowledge is a suitable method for a small dataset such as the one utilised for our study. Among the three models developed using a theoretical rulebased feature selection method, none produces an overall optimal performance. However, the Weighted Linear Model, as shown in Table VI, produces the best classification performance based both on binary and four-point scale. Considering the fact that, in clinical practice, the four-point-scale noted above is the highest resolution in terms of the interpretation of BIT score, the Weighted Linear Model is rated as the optimal model. Most interestingly, the Weighted Linear Model outperformed the voting system described in Section IV-A2. V. CONCLUSION

C. Discussion During the training process, the Linear Regression Model has shown the most promising performance . How-

The work presented in this paper constitutes the first attempt to develop a quantitative assessment system for hand-drawn images as a test for neglect patients. Besides providing a binary


diagnostic outcome, a score with a higher resolution is an important indication of the progress of rehabilitation. In this study, the score produced using the computer-based system can be interpreted as a four-point-scale or a continuous outcome. Four approaches to forming the quantitative assessment model have been investigated. The model producing the best agreement with the BIT score is the Weighted Linear Model, consisting of six test overlays. As previously mentioned, the BIT Conventional subset is composed of 13 individual hand-drawn tests, and in comparison, the time required for administering the computer-based test battery is significantly reduced. At the same time, the score generated from the Weighted Linear Model has achieved a significant agreement with the BIT score, assessed in three different formats: 1) binary diagnostic outcome (3.8% for FAR, 11% for FRR, and 7.5% for AER), , and 3) 2) four-point scale score . continuous form While a significant agreement with the BIT score is established, the computer-based testing system also has additional advantages over the BIT test battery, including the following. 1) The computer-based test system is capable of assessing the dynamic features, which reveal various aspects of neglect symptoms that have not been assessed in the conventional scoring criteria. 2) The number of test overlays included in different models for the computer-based test system ranges from five to seven. As discussed at the outset, the six BIT Conventional subtests can be taken as equivalent to 13 drawing tasks, which the participants typically required 10–20 min to complete. Considering the six test overlays required to form the Weighted Linear Model, the average completion time of all participants is less than 3 min and the maximum execution time is 10 minutes. In comparison, the computer-based test battery provides the benefit of reducing the patient’s fatigue as well as the feasibility of repeating the test battery on a more regular basis for patient assessment during a period of rehabilitation. 3) The computer-based test system assesses the features using an automated and objective measurement, and hence does not produce results affected by human subjectivity, which is unavoidable with the traditional way of administering the BIT [30]. The results and discussion presented here indicate that it is possible to use a computer-based testing system for assessing neglect symptoms, by means of which an agreement of up to 92.5% with the diagnosis obtained by the traditional BIT total score is achievable. In addition, the computer-based test battery has the advantage that it can be administered with repeatability and objectivity and the potential for further investigating the neglect symptoms by means of both traditional features and novel features which are now made available. Moreover, the methods employed in this work can be applied to other studies in a similar field, for example, the assessment of Dyspraxia using hand drawn images [31]. The methods described in the paper have been implemented in a prototype system for the purpose of demonstration. It is in our plan for future research and development to implement the full system so as to facilitate

569

further study of neglect as well as other clinical studies using pencil-and-paper tests. REFERENCES [1] M. L. Albert, “A simple test of visual neglect,” Neurology, vol. 23, pp. 658–664, 1973. [2] R. M. Guest et al., “Analysing constructional aspects of figure completion for the diagnosis of visuospatial neglect,” in Int. Conf. Pattern Recognit. (ICPR), 2000, vol. 4, pp. 4316–4319. [3] L. R. Cherney and A. S. Halper, “Unilateral visual neglect in righthemisphere stroke: A longitudinal study,” Brain Injury, vol. 15, pp. 585–592, 2001. [4] P. W. Halligan and I. H. Robertson, “The assessment of unilateral neglect,” in A Handbook of Neuropsychological Assessment, J. R. Crawford, D. M. Parker, and W. W. McKinlay, Eds. London, U.K.: Psychology Press, 1992. [5] B. Wilson et al., “Development of a behavioral test of visuospatial neglect,” Arch. Phys. Med. Rehabil., vol. 68, pp. 98–102, 1987. [6] S. Hannaford et al., “Assessing visual inattention: Study of inter-rater reliability,” Br. J. Therapy Rehabil., vol. 10, pp. 72–75, 2003. [7] S. Dawes and G. Senior, “Australian normative data and clinical utility of the mesulam and weintraub cancellation test,” presented at the 21st Annu. Conf. Nat. Acad. Neuropsychol., San Francisco, CA, 2001. [8] L. Gauthier, F. Dehaut, and Y. Joanette, “The bells test: A quantitative and qualitative test for visual neglect,” Int. J. Clin. Neuropsychol., vol. 11, pp. 49–54, 1989. [9] N. Donnelly et al., “Developing algorithms to enhance the sensitivity of cancellation tests of visuospatial neglect,” Behaviour Res. Methods, Instrum. Comput., vol. 31, no. 4, pp. 668–673, 1999. [10] R. M. Guest, “The diagnosis of visuo-spatial neglect through the computer-based analysis of hand-executed drawing tasks,” Ph.D. dissertation, Dept. Electron., Univ. Kent, Canterbury, 1999. [11] Y. Liang et al., “Feature-based assessment of visuo-spatial neglect patients using hand-drawing tasks,” Pattern Anal. Appl., vol. 10, no. 4, pp. 361–374, 2007. [12] A. Erez et al., “Visual spatial search task (VISSTA): A computerized assessment and training program,” in Proc. 6th Int. Conf. Disability, Virtual Reality Assoc. Tech. (ICDVRAT), 2006, pp. 265–270. [13] L. Y. Deouell, Y. Sacher, and N. Soroker, “Assessment of spatial attention after brain damage with a dynamic reaction time test,” J. Int. Neuropsychol. Soc., vol. 11, pp. 697–707, 2005. [14] H. Martens and P. Dardenne, “Validation and verification of regression in small data sets,” Chemometrics Intell. Lab. Syst., vol. 44, pp. 99–121, 1998. [15] R. R. Picard and R. D. Cook, “Cross-validation of regression models,” J. Am. Stat. Assoc., vol. 79, no. 387, pp. 575–583, 1984. [16] L. J. Buxbaum et al., “Amantadine treatment of hemispatial neglect: A double-blind, placebo-controlled study,” Am. J. Phys. Med. Rehabil., vol. 86, pp. 527–537, 2007. [17] C. Lafosse et al., “Graviceptive misperception of the postural vertical after right hemisphere damage,” Neuroreport, vol. 15, pp. 887–891, 2004. [18] D. W. Hosmer and S. Lemeshow, “Model-building strategies and methods for logistic regression,” in Applied Logistic Regression, D. W. Hosmer and S. Lemeshow, Eds. New York: Wiley, 1989. [19] P. C. Austin and J. V. Tu, “Automated variable selection methods for logistic regression produced unstable models for predicting acute myocardial infarction mortality,” J. Clin. Epidemiol., vol. 57, pp. 1138–1146, 2004. [20] E. W. Steyerberg, M. J. C. Eijkemans, and J. D. F. Habbema, “Stepwise selection in small data sets—A simulation study of bias in logistic regression analysis,” J. Clin. Epidemiol., vol. 52, no. 10, pp. 935–942, 1999. [21] A. Hartman-Maeir and N. Katz, “Validity of the behavioural inattention test (BIT): Relationships with functional tasks,” Am. J. Occup. Ther., vol. 49, pp. 507–516, 1995. [22] M. Erp, L. Vuurpijl, and L. Schomaker, “An overview and comparison of voting methods for pattern recognition,” Proc. IWFHR-8, pp. 195–200, 2002. [23] S. D. Bolboaca and L. Jantschi, “Pearson versus spearman, Kendall’s Tau correlation analysis on structure-activity relationships of biologic active compounds,” Leonardo J. Sci., vol. 9, pp. 179–200, 2006. [24] S. S. Stevens, “On the theory of scales of measurement,” Science, vol. 103, no. 2684, pp. 677–680, 1946.

570


[25] J. Cohen, “A coefficient of agreement for nominal scale,” Edu. Psychol. Measure, vol. 20, pp. 37–46, 1960. [26] Y. H. Chan, “Biostatistics 104: Correlational analysis,” Singapore Med. J., vol. 44, pp. 614–619, 2003. [27] S. Bengio and J. Mariéthoz, “The expected performance curve: A new assessment measure for person authentication,” in , Toledo, Spain, 2004, pp. 9–16. [28] D. W. Hosmer and S. Lemeshow, Applied Logistic Regression. New York: Wiley, 1989. [29] S. Weisberg, Applied Linear Regression. New York: Wiley, 1985. [30] S. Hannaford, G. Gower, J. M. Potter, R. M. Guest, and M. C. Fairhurst, “Assessing visual inattention: A study of the inter-rater reliability,” Br. J. Rehabil. Therapy, vol. 10, no. 2, pp. 72–75, 2003. [31] M. C. Fairhurst, S. Hoque, and T. J. Boyle, “Assessing behavioural characteristics of dyspraxia through on-line drawing analysis,” in Proc. 12th Conf. Int. Graphonomics Soc. (IGS2005), Salerno, Italy, 2005, pp. 291–295. Yiqing Liang received the B.S. degree in Nanjing University of Post and Telecommunication, China, in 2000, and the Ph.D. degree in electronic engineering from the University of Kent, Canterbury, U.K., in 2008. She is currently a Research Associate at the University of Kent, Canterbury, U.K. Her research interests are in pattern recognition, image analysis, and biometrics. She worked in the China Telecom Research and Development Centre.

Michael C. Fairhurst is with the School of Engineering and Digital Arts at the University of Kent, Canterbury, U.K. His research interests include high performance image analysis and classification, handwritten text reading and document processing, medical image analysis and, especially, security and biometrics for identification. Prof. Fairhurst is a Fellow of the IAPR.

Richard M. Guest received the Ph.D. degree in electronic engineering from the University of Kent, Canterbury, U.K., in 2000. He is a Senior Lecturer at the University of Kent, Canterbury, U.K. His research interests include hand-drawn data analysis for neuropsychological testing, biometric signature verification, and forensics. He serves on many publication, standards, and funding review committees.

Jonathan Potter is a Consultant Geriatrician and Lead Stroke Physician at the Kent and Canterbury Hospital, Canterbury, U.K. He is also Clinical Director of the Royal College of Physicians of London Clinical Effectiveness and Evaluation Unit. He has undertaken research work including the evaluation of rehabilitation methods and visual neglect in stroke.