Time-Elastic Generative Model for Acceleration Time Series in ... - MDPI

3 downloads 0 Views 3MB Size Report
Feb 8, 2017 - particular individual generates sequences of time series of sensed data from .... combination of a tri-axial accelerometer and a gyroscope. ... [19] make use of auto-encoders for the recognition of Activities of Daily Life (ADL).
sensors Article

Time-Elastic Generative Model for Acceleration Time Series in Human Activity Recognition Mario Munoz-Organero * and Ramona Ruiz-Blazquez Telematics Engineering Department, Universidad Carlos III de Madrid; Av. Universidad, 30, 28911 Leganes, Spain; [email protected] * Correspondence: [email protected]; Tel.: +34-91-624-8801 Academic Editor: Kamiar Aminian Received: 21 December 2016; Accepted: 6 February 2017; Published: 8 February 2017

Abstract: Body-worn sensors in general and accelerometers in particular have been widely used in order to detect human movements and activities. The execution of each type of movement by each particular individual generates sequences of time series of sensed data from which specific movement related patterns can be assessed. Several machine learning algorithms have been used over windowed segments of sensed data in order to detect such patterns in activity recognition based on intermediate features (either hand-crafted or automatically learned from data). The underlying assumption is that the computed features will capture statistical differences that can properly classify different movements and activities after a training phase based on sensed data. In order to achieve high accuracy and recall rates (and guarantee the generalization of the system to new users), the training data have to contain enough information to characterize all possible ways of executing the activity or movement to be detected. This could imply large amounts of data and a complex and time-consuming training phase, which has been shown to be even more relevant when automatically learning the optimal features to be used. In this paper, we present a novel generative model that is able to generate sequences of time series for characterizing a particular movement based on the time elasticity properties of the sensed data. The model is used to train a stack of auto-encoders in order to learn the particular features able to detect human movements. The results of movement detection using a newly generated database with information on five users performing six different movements are presented. The generalization of results using an existing database is also presented in the paper. The results show that the proposed mechanism is able to obtain acceptable recognition rates (F = 0.77) even in the case of using different people executing a different sequence of movements and using different hardware. Keywords: human activity recognition; accelerometer sensors; auto-encoders; generative models for training deep learning algorithms

1. Introduction Human Activity Recognition (HAR) based on low-level sensor data is widely used in order to provide contextual information to applications in areas such as ambient-assisted living [1], health monitoring and management [2], sports training [3], security, and entertainment [4]. Raw data from sensors such as accelerometers, Global Position System (GPS), physiological sensors, or environmental sensors are processed into movement-related features that are used to train machine learning algorithms that are able to classify different activities and movements [4]. By detecting the particular activity that a user is performing at each particular moment, personal recommender systems could be used in order to provide feedback and advice to increase the user’s wellbeing, optimize the user tasks, or adapt the user interface in order to optimally convey information to the user.

Sensors 2017, 17, 319; doi:10.3390/s17020319

www.mdpi.com/journal/sensors

Sensors 2017, 17, 319

2 of 18

Different variants exist in previous research studies in order to recognize human activities from sensor data. When focusing on the type of sensor used, the recognition of human activities has been approached in two major different ways, namely using external and wearable sensors [4]. When approaching the subject based on the way that is used in order to segment the sensed data, two major alternatives have been previously used to recognize human activities, namely temporal windows of sensed data and activity composition by combining the detection of sporadic atomic movements [5]. Depending on the way in which the computed features are defined, two major approaches exist, either using hand-crafted features or automatically learning the optimal features from data [6]. Each approach has desirable and undesirable aspects. Environmental sensors such as video cameras do not require the user to wear the sensors and therefore are less intrusive but are restricted to particular areas and raise privacy issues. Windowed approaches are simple to implement but only adapt to activities that can be fully statistically described in the duration of the time window and do not overlap or are concatenated with other activities during that temporal window. Hand-crafted features can introduce a priori models for the sensed data for the execution of particular activities but require expert knowledge and tend not to generalize among different applications. Deep learning approaches are able to automatically detect the optimal features but require bigger training datasets in order to avoid overfitting problems [7]. In this paper, we propose a novel mechanism that focuses on the desirable aspects of previous proposals in order to recognize human activities while trying to minimize the undesirable aspects. In order to detect each particular activity, candidate temporal windows of sensed data are pre-selected based on basic statistical properties of adjacent points of interest (such as consecutive local maximum acceleration points or consecutive local maxima in the standard deviation of the acceleration time series). Time windows are selected and aligned according to the pre-selected points of interest. A stack of auto-encoders is then used in order to assess if a candidate window of sensed data corresponds to a particular human activity or movement. The stack of auto-encoders is trained with similar temporal windows obtained from performing the activity at different speeds by different people. In order not to require an extensive training phase and data gathering process, a generative model has been defined that takes into account the elastic temporal variation of the sensed data when performing the same activity at different speeds. The model is able to generate data for intermediate speeds based on a limited number of training samples. Each human activity or movement to be detected is assessed by a separately trained stack of auto-encoders. The Pearson correlation coefficient is used in order to compute the similarity of each pre-selected data segment with each activity to be recognized. The rest of the paper is organized as follows. Section 2 captures the previous related studies and motivates the research in this paper. Section 3 describes the proposed generative model in order to estimate missing data samples in the data gathering process so as to provide enough statistical information to the detection algorithm to characterize a particular movement. The model uses the elastic deformation of the time series when performing the same movement at different speeds. Section 4 details the architecture of the proposed detection algorithm and the way in which it is trained. The detection algorithm is based on a stack of auto-encoders. Section 5 presents the results for detecting two particular movements in a new database that we have created using five people executing six different movements. Section 6 is dedicated to assessing the generalization of results by applying the algorithm trained using our database to the data in a different database in which the participants executed one common movement. Finally, Section 7 captures the major conclusions of the paper. 2. Human Movement Detection from Accelerometer Sensors Human Activity Recognition has been approached by different research studies in different ways since the first contributions in the late 1990s [8], either using environmental sensors such as cameras [9] and smart environments [10] or wearable sensors such as accelerometers and gyroscopes [11]. Body-worn sensors are preferred in order to preserve privacy and continuously monitor the user independently of his or her location [4].

Sensors 2017, 17, 319

3 of 18

Body-worn sensors provide one or multiple series of temporal data depending on the activity that the user is performing. In order to extract the particular activity from the sensed data, different machine learning algorithms have been used over some features computed from selected segments of raw data. Different features could be defined in order to better capture the statistical differences among the activities to be detected in either the time or frequency domain [4]. Instead of using hand-crafted features, features could be learned from data, using an optimization process based on some pre-defined features in order to select the most important ones to detect a particular activity [12], combine or project pre-defined features in order to reduce the number of features [13], or use deep learning machine learning techniques to automatically generate the best features for a particular dataset [14]. An example of applying machine learning techniques over hand-crafted features for Human Activity Recognition (HAR) can be found in [15]. The authors used Decision Trees, Random Forest (RF), Naive Bayes (NB), Support Vector Machines (SVM), and K-Nearest Neighbors (KNN) machine learning algorithms applied to acceleration data to compare detection rates when attaching the accelerometer to different parts of the body (wrist, elbow, chest, ankle, knee, and belt). The authors used several hand-crafted features over windowed data and a movement detection algorithm based on an empirical threshold applied to the differences between consecutive values of the lengths of the acceleration vector. Hand-crafted features could be applied to windowed segments of sensor data or be based on particular relevant points of the time series of sensed data instead, such as in [5], in which the authors use statistical properties of adjacent maximum and minimum values in acceleration signals in order to recognize different movements, taking into account the energy constrains for in-sensor algorithms. Deep learning architectures are able to automatically learn the best features to be used for a particular dataset. Plötz et al. [16] present a general approach to feature extraction based on deep learning architectures and investigate the suitability of feature learning for activity recognition tasks. Avoiding hand-crafted features lessens the burden for application-specific expert knowledge. Gjoreski et al. [17] compare hand crafted features based machine learning algorithms such as Random Forests, Naïve Bayes, k-nearest Neighbor, and Support Vector Machines with deep learning architectures based on a convolutional neural network (CNN), showing that the latter was able to outperform the recognition results by at least two percentage points. Yang et al. [6] performed a similar analysis by introducing a convolutional neural network (CNN) for Human Activity Recognition (HAR). The key advantages of using a deep learning architecture are: (i) feature extraction is performed in task-dependent and non-hand-crafted manners; (ii) extracted features have more discriminative power w.r.t. the classes of human activities; (iii) feature extraction and classification are unified in one model so their performances are mutually enhanced. The proposed CNN method outperformed other state-of-the-art methods such as k-nearest Neighbor and Support Vector Machines in classification for the HAR problems. A review of unsupervised feature learning and deep learning for time-series modeling in general is presented in [14]. Restricted Boltzmann Machines, Auto-encoders, Recurrent Neural Networks, and Hidden Markov Models are described and how these machine learning techniques can be applied to time series obtained from sensors. The authors in [7] use a deep recurrent neural network (DRNN) architecture in order to recognize six different human activities (“stay”, “walk”, “jog”, “skip”, “stair up”, and “stair down”) using a combination of a tri-axial accelerometer and a gyroscope. If the dataset is divided into segments of raw data containing a single activity, the optimal training parameters are able to detect the correct activity with a 95.42% rate. However, if the algorithm is applied to non-pre-segmented data, the recognition rate decreases to 83.43% for the best parameters and when the training and validation is performed using the same dataset. The optimal number of layers in the DRNN was found to be 3. The more layers we add to a deep learning architecture, the more degrees of freedom we have and therefore the easier it is to achieve higher detection rates for a particular dataset. However, the generalization of results when validating with a different dataset may decrease due to overfitting problems.

Sensors 2017, 17, 319

4 of 18

The research study in [18] uses a deep convolutional neural network to perform Human Activity Recognition (HAR) using smartphone sensors by exploiting the inherent characteristics of activities and 1D time series signals, at the same time providing a way to automatically and data-adaptively extract robust features from raw data. The authors recognize six different activities (“walking”, “stair up”, “stair down”, “sitting”, “standing”, and “laying”) using a 2.56-s window (with 50% overlap) for the six-dimensional raw sensed data from a tri-axial accelerometer and a gyroscope. The experimental results show that a three-layer architecture is enough to obtain optimal results. The best results show an overall performance of 94.79% on the test set with raw sensor data that are similar to those obtained using Support Vector Machines (SVM). Wang et al. [19] make use of auto-encoders for the recognition of Activities of Daily Life (ADL) based on sensed information from environmental sensors. The results outperformed previous machine learning algorithms based on hand-crafted features such as Hidden Markov Models, Naïve Bayes, 1-nearest Neighbor, and Support Vector Machines. We will use a similar approach based on stacked auto-encoders applied to one of the axes of a single accelerometer placed on the wrist of the dominant hand. In order to simplify the data gathering for the training phase, a novel generative model based on the time elasticity of the human movements when performed at different speeds is added. A different stack of auto-encoders is trained for each activity to be detected. A simple pre-detection algorithm based on the statistical properties of some points of interest for each activity is added in order to pre-select candidate time windows and a time-offset cancellation mechanism (sequence alignment) is used by centering the selected window based on the detected candidate points of interest. 3. Elastic Generative Models for Movement Description Human activities can be described as a sequence of movements that can be executed at different speeds. In order to characterize each movement, we will use a generative model to parameterize the elasticity of the sensed time series performed at different speeds. Let us define si (t) as the time series capturing the sensed data for a particular execution of a particular movement performed at speed vi . The relationship between two different executions of the same movement at different speeds can be modeled as described by Equation (1):   si (t − tc ) = αij (t − tc ) · s j βij (t − tc ) βij =

vi , vj

(1) (2)

where tc is an alignment factor (we have aligned all the movements to the center of the windowed data), βij is the ratio of execution speeds (as described in Equation (2)), and αij (t − tc ) captures the deformation coefficient for a particular value in the time series. If αij (t − tc ) = αij for all values of t (does not depend on t) the execution of the same movement at different speeds generates scaled versions of the same data. αij (t − tc ) is modeled as a function of time (relative to the alignment point). The bigger the value of the variance of αij (t − tc ) the larger the deformation of the time series when executing the movement at a different speed. A generative model is able to estimate sk (t) when other samples are at different speeds. Let us assume that we have two samples at speeds vi and v j and that vi < vk < v j . The time series for sk (t)can be written as described by Equations (3) and (4):

and αki (t − tc ) → 1

if

sk (t − tc ) = αki (t − tc ) · si (βki (t − tc )),

(3)

  sk (t − tc ) = αkj (t − tc ) · s j βkj (t − tc ) ,

(4)

vk → vi and αkj (t − tc ) → 1

if

vk → v j .

and  ki (t  tc )  1 if

vk  vi and kj (t  tc )  1 if

vk  vj .

We could estimate sk (t) by interpolating from si (t ) and sj (t) , as captured in Equation (5), by combining Sensors 2017, 17,Equations 319

(3) and (4) and approximating  ki (t  tc ) and kj (t  tc ) by

2( v j  vk ) v j  vi

and 5 of 18

2(vk  vi ) . These values chosen so from that sEquation the in limit cases (5), of We could estimate sk (t) are by interpolating (t), fulfills as captured Equation i ( t ) and s j(5) v j  vi 2(v j −vk ) 2(vk −vi ) by combining Equations (3) and (4) and approximating αki (t − tc ) and αkj (t − tc ) by v −v and v −v . j i j i  ki (t  tc )  1 if vk  vi and kj (t  tc )  1 if vk  vj : These values are chosen so that Equation (5) fulfills the limit cases of αki (t − tc ) → 1 i f vk → vi and αkj (t − tc ) → 1 i f vk → v j v:j  vk v  vi sk  t  tc    si ki  t  tc   k  s   t  tc  . (5) v  v j  kj  v j −vvj k vi vk j− vii sk (t − tc ) = (5) · s (β (t − tc )) + · s j βkj (t − tc ) . − vi i kiand minimum vj − vi for The model will also need av jmaximum value the speed of execution. These









values can be estimated based on the measured samples of sensed data. The model will also need a maximum and minimum value for the speed of execution. These values can be estimated based on the measured samples of sensed data. 4. Using a Stack of Auto-Encoders Based on the Generative Model 4. Using a Stack of Auto-Encoders Based on theencoder Generative Auto-encoders use the combination of an andModel a decoder function in order to try to minimize the reconstruction errors for the of samples in the and training set. Auto-encoders can betostacked Auto-encoders use the combination an encoder a decoder function in order try to by connecting the reconstructed output of the “layer I” auto-encoder to a “layer I + 1” auto-encoder. minimize the reconstruction errors for the samples in the training set. Auto-encoders can be stacked Auto-encoders arereconstructed designed to minimize the inputtoand the reconstructed output by connecting the output ofthe theerror “layerbetween I” auto-encoder a “layer I + 1” auto-encoder. according to Equation (6): to minimize the error between the input and the reconstructed output Auto-encoders are designed according to Equation (6):





 





2

(6)  2 x , xr  x  f2 W ' f1  Wx  b   b ' ,  2 r 0 0 2 ε ( x, x ) = k x − f 2 W ( f 1 (Wx + b)) + b k , (6) where xr is the reconstructed signal which is the concatenation of the encoder and decoder 1 and f2 are activation functions as the sigmoidoffunction). A final function functions where xr is(fthe reconstructed signal which issuch the concatenation the encoder and detection decoder functions is1required the end of the stacked order toAassess the similarity of the input and (f and f 2 areatactivation functions suchauto-encoders as the sigmoid in function). final detection function is required at reconstructed output. We have used the Pearson’s correlation as and a similarity function. the end of the stacked auto-encoders in order to assess the similaritycoefficient of the input the reconstructed Figure 1 shows the training phase of the system. We select segmentsfunction. of samples from the sensor output. We have used the Pearson’s correlation coefficient as a similarity device containing thethe particular to We be trained. Segments data are aligned at the Figure 1 shows training human phase ofmovement the system. select segments of of samples from the sensor center of the selected window (of a particular length in order to contain enough information to device containing the particular human movement to be trained. Segments of data are aligned at the describe the selected movement). The(of samples are taken movement different speeds. The center of the window a particular lengthexecuting in order tothe contain enoughatinformation to describe generative model described 3 is then the used to reconstruct training samples different the movement). The samples in areSection taken executing movement at different speeds. Theat generative speedsdescribed in order in to Section properly human training movement whenatperformed at those speeds model 3 ischaracterize then used tothe reconstruct samples different speeds in order to missing in the captured Themovement original data andperformed the newlyat generated datamissing are theninused to train a properly characterize thedata. human when those speeds the captured stack The of auto-encoders. data. original data and the newly generated data are then used to train a stack of auto-encoders.

Figure 1. 1. Training phase. Figure Training phase.

Figure 2 shows the movement detection process. A new candidate segment of non-labeled data Figure 2 shows the movement detection process. A new candidate segment of non-labeled data is is captured from the sensors. The data are aligned using the same procedure as that used for the captured from the sensors. The data are aligned using the same procedure as that used for the training training data, using the detection of particular points of interest in the raw signal fulfilling certain data, using the detection of particular points of interest in the raw signal fulfilling certain properties. The window of input data is then obtained and used as the input for the stack of auto-encoders. The detection phase computes the Pearson’s correlation coefficient and is compared to the values obtained for the same coefficient when using the training data.

Sensors 2017, 17, 319

6 of 18

properties. The window of input data is then obtained and used as the input for the stack of auto-encoders. The detection phase computes the Pearson’s correlation coefficient and is compared Sensors 2017, 17, 319 6 of 18 to the values obtained for the same coefficient when using the training data.

Figure 2. 2. Movement detection. Figure

The proposed proposed algorithm algorithm has has some some parameters parameters that that have have to to be be selected, selected, such such as as the the number number of of The auto-encoders to to stack stack or or the the number number of of hidden hidden units units at at each each layer. layer. Section Section 55 will will show show the the results results for for auto-encoders the particular cases of detecting the “cutting with a knife” and “eating with a spoon” movements the particular cases of detecting the “cutting with a knife” and “eating with a spoon” movements inside inside acceleration timecontaining series containing six different movements (“sitting”, “eating with a “cutting spoon”, acceleration time series six different movements (“sitting”, “eating with a spoon”, “cutting with a knife”, “washing the hands”, “walking”, and “free random movements”). A new with a knife”, “washing the hands”, “walking”, and “free random movements”). A new database database for five different people executing the six different movements has been generated. The for five different people executing the six different movements has been generated. The parameters parameters willbased be selected based to ondetect the ability to detect when the movement is will be selected on the ability movement whenmovement the movement is performed (recall) performed (recall) and not to detect false positives in segments of data that correspond to other and not to detect false positives in segments of data that correspond to other movements (precision). movements (precision). In order to assess generalization thethe algorithm, we will use thethe trained In order to assess the generalization of the the algorithm, we willof use trained algorithm with data algorithm with the data in Section 5 in order to detect “cutting” movements in the database in [20] in in Section 5 in order to detect “cutting” movements in the database in [20] in Section 6 (using different Section 6 (using different people and different hardware to implement the sensors). people and different hardware to implement the sensors). 5. Training Training and and Validation ValidationBased Basedon onaaNewly NewlyCreated CreatedDatabase Database 5.

A new new human human activity activity database database combining combining the the repetition repetition of of six six different different movements movements by by five five A different people people has has been been created created in in order order to to train train and and assess assess the the generative generative model model proposed proposed in in the the different previous section. Ethics approval was obtained through the process defined by the Carlos III of previous section. Ethics approval was obtained through the process defined by the Carlos III of Madrid Madrid University Ethics Committee (approved by the agreement of University’s Government University Ethics Committee (approved by the agreement of University’s Government Council on Council on 252014). September 2014). Aninformed individual informed consent to in the has 25 September An individual consent to participate in participate the experiment hasexperiment been obtained been obtained from each participant. The demographics of the five participants are captured in Table 1. from each participant. The demographics of the five participants are captured in Table 1. Table 1. 1. Participant Participant demographics. demographics. Table

Gender F F M M FF FF MM

Gender

Age Height (m) Weight (kg) Age Height (m) Weight (kg) 32 1.73 62 32 1.73 62 38 1.75 77 38 1.75 77 39 1.68 7171 39 1.68 41 1.65 6464 41 1.65 43 1.83 43 1.83 8989

Each user user was was wearing wearing aa Nexus Nexus 66 Android Android device device (which (which contains contains aa three-axial three-axial accelerometer) accelerometer) Each attached to to the attached the wrist wrist of of the the dominant dominant hand handand andwas wasasked askedtotorepeat repeatthe thefollowing followingmovements movementsfor for1 1min mineach: each: Idle Idle Eating soup with a spoon Eating soup with a spoon Cutting meat Cutting meat Washing hands Washing hands Walking Walking Free arm movements Free arm movements The data gathering device used is captured in Figure 3. Although the Nexus 6 Android device is The data device used is captured in Figure 3. Although Nexusby6 selecting Android device is equipped withgathering a three-axial accelerometer, we will use a “worst case” the scenario only one equipped with a three-axial accelerometer, we will use aby “worst case” scenario by selecting only component from the acceleration vector (represented a black arrow in Figure 3). Using at one the component from the acceleration vector (represented by a black arrow in Figure 3). Using at the same time the data in all the acceleration components is expected to provide even better results that we will study in future work.

1. 1. 2. 2. 3. 3. 4. 4. 5. 5. 6. 6.

Sensors 2017, 17, 319

7 of 18

same time the data in all the acceleration components is expected to provide even better results that 7 of 18 we will study in future work.

Sensors 2017, 17, 319

Figure 3. 3. Data Data gathering gathering system. system. Figure

The acceleration data were re-sampled at 100 Hz. Each 1-min period contained several The acceleration data were re-sampled at 100 Hz. Each 1-min period contained several repetitions repetitions of the same movement. In order to train the stack of auto-encoders, the generative of the same movement. In order to train the stack of auto-encoders, the generative process described process described in the previous section was used on selected segments of the “cutting with a knife” in the previous section was used on selected segments of the “cutting with a knife” and the “eating and the “eating with spoon” movements. Each selected segment contained a window of 2 s of sensed with spoon” movements. Each selected segment contained a window of 2 s of sensed data aligned at data aligned at the center of the window. The “cutting with a knife” movement constitutes a the center of the window. The “cutting with a knife” movement constitutes a concatenation of several concatenation of several repetitions (a variable number of times) of a “move forward” and “move repetitions (a variable number of times) of a “move forward” and “move backward” movements backward” movements executed at similar frequencies. Selected time windows of sensor data can be executed at similar frequencies. Selected time windows of sensor data can be aligned to each central aligned to each central local maximum value in the sequence. In the case of the “cutting with a knife” local maximum value in the sequence. In the case of the “cutting with a knife” movement, the speed of movement, the speed of execution of each segment vi was estimated as the inverse of the time span execution of each segment vi was estimated as the inverse of the time span between two consecutive between two consecutive maxima in the acceleration data. Inspoon” the case of the “eating with spoon” maxima in the acceleration data. In the case of the “eating with movement, each movement of movement, each movement of moving the spoon from the plate to the mouth and back to the plate moving the spoon from the plate to the mouth and back to the plate can be described as a concatenation canthree be described as a concatenation three gestures (moving eating, and moving down). of gestures (moving up, eating,ofand moving down). Theup, “up” and “down” gestures canThe be “up” and “down” gestures can be pre-filtered based on the maximum values for the standard pre-filtered based on the maximum values for the standard deviation of the acceleration series in a deviation the acceleration a short window centered at each point. The alignment of short time of window centered atseries each in point. Thetime alignment of each candidate movement will be done to eachmean candidate movement will be done the mean the point the movements. maximum values the point between the maximum valuesto representing “up”between and “down” In the representing the “up” and “down” movements. In the case of the “eating with spoon” case of the “eating with spoon” movement, the speed of execution of each movement vi wasmovement, estimated was estimated as the inverse of the time span between thethe speed of execution of span each movement as inverse of the time between thevestimated “up” and “down” movements. i The generative model was used on the selected segments of sensed data corresponding to the the estimated “up” and “down” movements. “cutting” movement model in orderwas to generate equally spaced speeds the minimum The generative used on 51 thesamples selectedatsegments of sensed databetween corresponding to the and maximum values detected sensed samples. In a similar way,spaced 142 samples were generated “cutting” movement in orderintothegenerate 51 samples at equally speeds between the for the “eating with spoon” movement. participants were asked to execute at minimum and maximum values detectedThe in the sensed samples. In a similar way,the 142movements samples were different during the 1-min window.movement. The participants were asked to execute the generatedspeeds for the “eating with spoon” The 51 and 142 generated were usedwindow. to train different configurations of parameters for movements at different speeds samples during the 1-min the stack of and auto-encoders in order to detect The trained auto-encoders were then The 51 142 generated samples were each used movement. to train different configurations of parameters for validated using a leave-one-out architecture which four users usedauto-encoders to train the system and the stack of auto-encoders in order to detect in each movement. Thewere trained were then the other one to aassess the truly architecture classified samples (thefour process repeated out each user validated using leave-one-out in which userswas were used toleaving train the system and and averaging results). In order comparesamples the classification results each configuration, the other one to assess the trulyto classified (the process wasprovided repeated by leaving out each user the recall and F score have been used. Defining tp as thebynumber of “cutting” andobtained averagingprecision, results). In order to compare the classification results provided each configuration, or with spoon” recall movements that are correctly classified, Tp tp as as thethe total number of “cutting” the“eating obtained precision, and F score have been used. Defining number of “cutting” or or “eating with spoon” movements in the classified, validationTp sequence andnumber fp as theoftotal number “eating with spoon” movements thatpresent are correctly as the total “cutting” or of “non-cutting” or “non-eating spoon” that are classified samples, “eating with spoon” movements with present in themovements validation sequence and fp as as thepositive total number of the precision, the and thewith F score can be defined as:that are classified as positive samples, the “non-cutting” or recall “non-eating spoon” movements precision, the recall and the F score can be defined as: tp precision = (tp+ f p) tp Tp precision·recall precision+recall

recall = F = 2·

(7)

Sensors 2017, 17, 319

8 of 18

The design parameters for the stack of auto-encoders that are going to be assessed are the number of layers (for one, two, and three layers, according to the previous related studies presented in Section 2) and the number of hidden units inside the encode and decode functions of each layer (hidden units). The results of previous related studies based on the use of deep learning techniques applied to sensor data in order to detect human movements and activities are captured in Table 2. A final row has been added to capture the results from the analysis of a database that we are going to use to assess the generalization of our approach (described in the following section). The F score is going to be used in order to compare the results obtained by our algorithm with previous related results. In some of the previous studies, it has not been possible to compute the F score from the published information and the recall or accuracy numbers are then presented. Table 2. Best results in similar previous studies. Previous Related Studies

F/Recall/Acc.

[21] [21] [21] [6] [25] [7] [18] [19] [20]

F = 0.927 F = 0.937 F = 0.760 F = 0.851 F = 0.917 Recall = 0.950 F = 0.948 Accuracy = 0.856 F = 0.750

Method

Database

Data Dimensionality

[22] [23] [24] [22] [22] [26] [18] [19] [20]

79 52 9 79 79 3 6 23 9

1

b-LSTM-S CNN 2 LSTM-S 1 CNN 2 DeepConvLSTM 1 DRNN 3 CNN 2 SDAE 5 NCC 4

1

LSTM = Long Short-Term Memory. 2 CNN = Convolutional Neural Network. 3 DRNN = Deep Recurrent Neural Network. 4 NCC = Nearest Class Centroid. 5 SDAE = Stack of two De-noising Auto-encoders.

The achieved results for our generative model applied to the five-participant database are presented in four sub-sections. The first three sub-sections will study the number of layers of the auto-encoders stacked using the “cutting with a knife” movement. The last sub-section will use the selected parameters in the first sub-sections to validate the results when detecting the “eating with a spoon” movement. A comparison of the results achieved when training the stack of auto-encoders with and without our generative model is also added to the first sub-section. 5.1. Parameter Selection for a Single Layer Using the “Cutting” Movement The first result set has been generated by training a single auto-encoder with the 51 generated samples for the “cutting” movement with different number of hidden units. A leave-one-out approach has been used to compute the results. Figures 4–7 show the obtained results for 100, 75, 50 and 20 hidden units in the encode and decode functions of the auto-encoder (the input window of sensed data contains 200 samples, or 2 s of data). The threshold value for the Pearson correlation coefficient used to select a validation sample as a cutting or non-cutting movement is used as the independent parameter in each figure. For small values of the threshold, all cutting movements in the validation sequence are detected and the achieved recall value is optimal. However, some of the non-cutting movements are also classified as cutting movements and the precision of the algorithm worsens. The F score is computed for each value of the independent variable in order to compare each configuration. The best results are captured in Table 3. The results show that a single auto-encoder is able to achieve an F score of 1 when 100 hidden units are used. The performance slightly decreased if the number of hidden units is also decreased. However, even for the case of using 20 hidden units, the F score achieved is 0.947, which is similar to the best value in Table 2 for the research study in [18]. The threshold correlation coefficient has to be decreased when the number of hidden units is small (under 50) in order to achieve optimal results. An explanation for this is that the similarity of the reconstructed sample to the sensed data segment decreases as the degrees of freedom (the number of hidden units) in the algorithm decrease.

slightly decreased if the number of hidden units is also decreased. However, even for the case of slightly if the hidden is units is which also decreased. case of using 20decreased hidden units, thenumber F scoreof achieved 0.947, is similar However, to the besteven valuefor in the Table 2 for using 20 hidden units, the F score achieved is 0.947, which is similar to the best value in Table 2 for the research study in [18]. The threshold correlation coefficient has to be decreased when the number the research study in [18]. The threshold correlation coefficient to beAn decreased when number of hidden units is small (under 50) in order to achieve optimal has results. explanation forthe this is that of hidden units is small (under 50) in order to achieve optimal results. An explanation for this is that the similarity of the reconstructed sample to the sensed data segment decreases as the degrees Sensors 2017, 17, 319 9 of of 18 the similarity of the reconstructed sample to the sensed data segment decreases as the degrees of freedom (the number of hidden units) in the algorithm decrease. freedom (the number of hidden units) in the algorithm decrease. Table 3. Table 3. Optimal Optimal values values for for the the F F score score for for each each configuration. configuration. Table 3. Optimal values for the F score for each configuration. 5050 20 20 Values the Optimal F 100 7575 Values for thefor Optimal F 100

75 Values rfor the Optimal F 100 threshold 0.8 r threshold 0.8 0.8 0.8 0.8 F 1 0.988 0.988 F r threshold 1 0.8 F 11 0.988 Recall Recall 1 11 PrecisionRecall 1 11 0.976 1 Precision 0.976 Precision 1 0.976

1 1 0.9 0.9 0.8 0.8 0.7 0.7 0.6 0.6 0.5 0.5 0.4 0.4 0.3 0.3 0.2 0.2 0.1 0.10 0 0.7 0.7

50 0.81 0.81 0.81 0.974 0.974 0.974 11 0.949 1 0.949 0.949

20 0.77 0.77 0.770.973 0.973 0.973 1 1 1 0.9470.947 0.947

100 hidden units 100 hidden units

Precision Precision F F Recall Recall 0.75 0.75

0.8 0.8

0.85 0.85 r threshold r threshold

0.9 0.9

0.95 0.95

1 1

Figure 4. Precision, recall, and F score for 1 auto-encoder and 100 hidden units. Figure 4. 4. Precision, Precision, recall, recall, and and F F score score for for 11 auto-encoder auto-encoder and Figure and 100 100 hidden hidden units. units.

1 1 0.9 0.9 0.8 0.8 0.7 0.7 0.6 0.6 0.5 0.5 0.4 0.4 0.3 0.3 0.2 0.2 0.1 0.10 0 0.7 0.7

75 hidden units 75 hidden units

Precision Precision F F Recall Recall 0.75 0.75

0.8 0.8

0.85 0.85 r threshold r threshold

0.9 0.9

0.95 0.95

1 1

Figure 5. Precision, recall, and F score for 1 auto-encoder and 75 hidden units. Figure and 75 75 hidden hidden units. units. Figure 5. 5. Precision, Precision, recall, recall, and and F F score score for for 11 auto-encoder auto-encoder and

Sensors 2017, 17, 319

10 of 18

Sensors 2017, 17, 319 Sensors 2017, 17, 319

10 of 18 10 of 18

1 1 0.9 0.9 0.8 0.8 0.7 0.7 0.6 0.6 0.5 0.5 0.4 0.4 0.3 0.3 0.2 0.2 0.1 0.1 0 0 0.7 0.7

50 hidden units 50 hidden units

Precision Precision F F Recall Recall 0.75 0.75

0.8 0.8

0.85 0.85 r threshold r threshold

0.9 0.9

0.95 0.95

1 1

Figure 6. Precision, recall, and F score for 1 auto-encoder and 50 hidden units. Figure 6. Precision, recall, and F score for 1 auto-encoder and 50 hidden units. Figure 6. Precision, recall, and F score for 1 auto-encoder and 50 hidden units.

1 1 0.9 0.9 0.8 0.8 0.7 0.7 0.6 0.6 0.5 0.5 0.4 0.4 0.3 0.3 0.2 0.2 0.1 0.1 0 0 0.7 0.7

20 hidden units 20 hidden units

Precision Precision F F Recall Recall 0.75 0.75

0.8 0.8

0.85 0.85 r threshold r threshold

0.9 0.9

0.95 0.95

1 1

Figure 7. Precision, recall, and F score for 1 auto-encoder and 20 hidden units. Figure 7. Precision, recall, and F score for 1 auto-encoder and 20 hidden units. Figure 7. Precision, recall, and F score for 1 auto-encoder and 20 hidden units.

Figure 8 captures the results achieved for the F score when moving the r threshold in the case of 8 captures thegenerative results achieved for thetraining F score when moving the with r threshold in thelayer case of usingFigure and not using our model when the auto-encoder one hidden of Figure 8 captures the results achieved for the F scorethe when moving the r threshold in the case using and not using our generative model when training auto-encoder with one hidden layer of 100 units. The use of the generative model provides a better description and characterization of the of using and not using our generative model when training the auto-encoder with one hiddenof layer 100 units. The use of the generative model provides a better description and characterization the movement to be detected and provides better results compared to the use of the sensed samples of 100 units. The use of theand generative model provides a better description and characterization of movement be detected without theto generative model. provides better results compared to the use of the sensed samples the movement to be detected without the generative model.and provides better results compared to the use of the sensed samples without the generative model.

Sensors 2017, 17, 319

11 of 18

Sensors 2017, 17, 319

11 of 18

Sensors 2017, 17, 319

F score with and without the generative model

1 0.9 0.8 1 0.7 0.9 0.6 0.8 0.5 0.7 0.4 0.6 0.3 0.5 0.2 0.4 0.1 0.3 0 0.2 0.7 0.1 0 0.7

11 of 18

F score with and without the generative model

F - without F - with F - without F - with 0.75

0.8

0.85

0.9

0.95

1

0.9

0.95

1

r threshold 0.75

0.8

0.85

r threshold Figure 8. Results the generative model compared to those achieved without generative Figure 8. Results usingusing the generative model compared to those achieved without thethe generative model model forofone of 100 hidden units in the auto-encoder. for one layer 100layer hidden units in the auto-encoder. Figure 8. Results the Layers generative model compared Movement to those achieved without the generative 5.2. Parameter Selectionusing for Two Using the “Cutting”

5.2. Parameter for of Two Movement modelSelection for one layer 100Layers hidden Using units inthe the“Cutting” auto-encoder.

Adding a second layer to the stack of auto-encoders increases the degrees of freedom and

Adding athe second layer thethe stack of auto-encoders increases the degrees oftofreedom and therefore therefore required size of training data samples is also increased in order avoid overfitting 5.2. Parameter Selection fortoTwo Layers Using the “Cutting” Movement the required size of the training data samples is also increased in order to avoid overfitting problems. problems. Our generative model is able to generate as many different training samples at different Adding a second layer to the stack of auto-encoders increases the degrees of freedom and Our generative model isasable to generate training differentinspeeds speeds of execution needed. However,asinmany order different to compare results samples with thoseatpresented the of therefore the required size of the training data samples is also increased in order to avoid overfitting previous sub-section, the same 51 haveresults been used. execution as needed. However, in training order tosamples compare with those presented in the previous problems. Our generative model is able to generate as many different training samples at different Optimal results were achieved for 100 hidden units in the case of a single auto-encoder. We sub-section, the same 51as training have beento used. speeds of execution needed.samples However, in order compare results with those presented in the have computed the results for 100-100, 100-50, 75-50, 50-50,the andcase 50-20 (in the first and Optimal results werethe achieved 100 hidden ofhidden a singleunits auto-encoder. We have previous sub-section, same 51 for training samplesunits have in been used. second layers, respectively). Figure 9 captures the results for the F score considering the threshold computedOptimal the results for 100-100, 100-50, 75-50, 50-50, and 50-20 hidden units (in the first and second results were achieved for 100 hidden units in the case of a single auto-encoder. We for the Pearson correlation coefficient used in order to classify each validation sample into a cutting layers, respectively). captures results for50-50, the Fand score considering the for the have computed theFigure results9for 100-100,the 100-50, 75-50, 50-20 hidden units (inthreshold the first and or non-cutting movement. The results for the optimal values for the F score for each configuration second layers, respectively). Figure 9incaptures theclassify results for thevalidation F score considering the threshold Pearson correlation coefficient used order to each sample into a cutting are captured in Table 4. In this case, optimal values are achieved by the 100-50 and the 75-50 or for the Pearson correlationresults coefficient in order values to classify each validation into a cutting are non-cutting movement. for used theunits. optimal forthe the100-100 F score forsample each configuration configurations for theThe number of hidden The results for configuration are slightly or non-cutting movement. The results for the optimal values for the F score for each configuration captured in(probably Table 4. In this optimal values are achieved the 100-50 is and the configurations worse due tocase, overfitting in the model). The 50-50 by configuration able to75-50 provide a value are captured in Table 4. In this case, optimal values are achieved by the 100-50 and the 75-50 for the hidden units.than Thethe results for the 100-100 configuration are slightly worse (probably of number F = 0.975,of which is higher previous studies in Table 2. configurations for the number of hidden units. The results for the 100-100 configuration are slightly due to overfitting in the model). The 50-50 configuration is able to provide a value of F = 0.975, which worse (probably due to overfitting in the model). The 50-50 configuration is able to provide a value is higher the previous studies 2. studies in Table 2. of F =than 0.975, which is higher than in theTable previous

F scores

1 0.8 1 0.6 0.8 0.4 0.6 0.2 0.4 0 0.2 0.7

F scores F 100-100 F 100-50 F 100-100 75-50 F 100-50 50-50 F 75-50 50-20 0.75

0.8

0.85

0.9

0.95

0.9

0.95

r threshold

0 0.7

0.75

0.8

0.85

F 50-50 F 50-20

Figure 9. F scores for 2 stacked auto-encoders and 100-100, 100-50, 75-50, 50-50, and 50-20 hidden units. r threshold

Figure 9. F scores for 2 stacked auto-encoders and 100-100, 100-50, 75-50, 50-50, and 50-20 hidden units.

Figure 9. F scores for 2 stacked auto-encoders and 100-100, 100-50, 75-50, 50-50, and 50-20 hidden units.

Sensors 2017, 17, 319

12 of 18

Table 4. Optimal values for the F score for each configuration. Sensors 2017, 17, 319

12 of 18

Values for the Optimal F

100-100

100-50

75-50

50-50

50-20

Table 4. Optimal values for the F score for each configuration.

r threshold 0.77 0.79 0.79 FValues for the Optimal 0.987F 100-1001 100-50 75-50 1 Recall 1 r threshold 1 0.77 1 0.79 0.79 Precision 0.974 1 11 F 0.987 1 Recall 1 1 1 Precision 1 5.3. Parameter Selection for Three Layers Using the0.974 “Cutting”1 Movement

0.74 50-500.975 50-20 0.74 10.76 0.9750.951 0.928 1 0.889 0.951 0.97

0.76 0.928 0.889 0.97

We have added a thirdforlayer auto-encoders to ourMovement model in order to assess the changes in the 5.3. Parameter Selection Threeof Layers Using the “Cutting” obtained precision and recall. A third layer of auto-encoders has not always provided better results [19]. We have added a third layer of auto-encoders to our model in order to assess the changes in the The increase in the number degrees of layer freedom tends to require a bigger training samples obtained precision and of recall. A third of auto-encoders has not alwaysnumber providedofbetter results to avoid overfitting. We have used the same generated data sample in order to compare results [19]. The increase in the number of degrees of freedom tends to require a bigger number of training with samples to in avoid overfitting. We have the same generated sample order to configurations compare those presented previous sections. The used achieved results for thedata F score forindifferent results with those presented in units previous sections. achieved results for theare F score for different in terms of the number of hidden used in theThe stack of auto-encoders captured in Figure 10. configurations in terms of the number of hidden units used in the stack of auto-encoders are The optimal values for the F score for each configuration are captured in Table 5. Optimal values are captured in Figure 10. The optimal values for the F score for each configuration are captured in Table achieved for the 200-150-100 and 100-75-50 hidden units. However, the results are worse than in the 5. Optimal values are achieved for the 200-150-100 and 100-75-50 hidden units. However, the results two-layer architecture. are worse than in the two-layer architecture.

F scores 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

F 200-150-100 F 100-75-50 F 75-75-50 F 100-50-50 F 100-50-20 0.7

0.75

0.8

0.85

0.9

0.95

r threshold Figure 10. F scores for three stacked auto-encoders and 200-150-100, 100-75-50, 75-75-50, 100-50-50,

Figureand 10. 100-50-20 F scoreshidden for three stacked auto-encoders and 200-150-100, 100-75-50, 75-75-50, 100-50-50, units. and 100-50-20 hidden units. Table 5. Optimal values for the F score for each configuration.

Table 5. Optimal values for the F score for each configuration. Values for the Optimal F 200-150-100 100-75-50 75-75-50 100-50-50 r threshold 0.76 0.76 0.75 0.75 Values for the Optimal F 200-150-100 100-75-50 75-75-50 100-50-50 F 0.959 0.959 0.932 0.933 r threshold 0.76 0.76 0.75 Recall 0.972 0.972 0.944 0.972 0.75 F 0.959 0.959 0.932 Precision 0.946 0.946 0.919 0.897 0.933 Recall Precision

0.972 0.946

5.4. Detecting the “Eating with a Spoon” Movement

0.972 0.946

0.944 0.919

0.972 0.897

In this section, we area going to Movement train a second stack of auto-encoders with the best parameters 5.4. Detecting the “Eating with Spoon”

obtained in the previous sections in order to detect the “eating with a spoon” movement. Using a

approach (usinga four usersstack in order to train the stack with of auto-encoders and Inleave-one-out this section,validation we are going to train second of auto-encoders the best parameters onein forthe validation), first generate 142 equally spaced samples (with in the obtained previouswesections in order to detect the “eating with aa constant spoon” increment movement. Using a speed of validation execution) using the generative model proposed in this paper.the In stack order of to pre-select and and leave-one-out approach (using four users in order to train auto-encoders one for validation), we first generate 142 equally spaced samples (with a constant increment in the speed of execution) using the generative model proposed in this paper. In order to pre-select and align

Sensors 2017, 17, 319

13 of 18

candidate segments of acceleration data containing the “eating with a spoon” movement, the standard deviation on a 100-ms window centered at each point is first calculated. Candidate segments should have an “up” followed by a “down” gesture within a certain time span (taking into account the maximum and minimum speeds for the execution of the movement). “Up” and “down” movements are detected by analyzing the maximum values of the standard deviation data. The maximum value for the speed of execution of the movement (in our recorded data) corresponds to a time span of 480 ms. The minimum speed value for the speed of execution of the movement corresponds to a time span of 1900 ms. Pre-selected segments are aligned to the mean value between the “up” and “down” movements. Table 6 captures the recall, precision and F score for the best configurations for one and two layers of auto-encoders as analyzed in the previous sub-sections. The confusion matrix when using the best configuration is presented in Table 7. There are four gestures of eating out of 88 that are classified as “other”. There are three segments of other movements (performed while “walking” (2) and “free arm movements” (1)), which are classified as “eating with spoon”. The results outperform those in previous studies, as captured in Table 2, by at least 1.8 percentage points. Table 6. Values for the optimal parameters selected from previous sections. Values for the Optimal F

100

100-50

r threshold F Recall Precision

0.82 0.960 0.954 0.966

0.79 0.966 0.966 0.966

Table 7. Classification results for two layers of auto-encoders. Classified as Ground Truth

Cutting with A Knife

Eating with Spoon

Other

Cutting with a knife Eating with spoon Other

112 0 0

0 84 4

0 3 208

6. Generalization of Results In order to assess the generalization of results to different people using different hardware when performing a different experiment in which the “cutting with a knife” movement is also included, we have applied the auto-encoders trained in the previous section using the data generated in the previous section to the database in [20]. This human movement database recorded arm movements of two people performing a continuous sequence of eight gestures of daily living (including the “cutting with a knife” movement). The authors also recorded typical arm movements performed while playing tennis. In addition, they also included periods with no specific activity as the NULL class. The activities were performed in succession with a brief break between each activity. Each activity lasted between two and eight seconds and was repeated 26 times by each participant, resulting in a total dataset of about 70 min. Arm movements were tracked using three custom Inertial Measurement Units placed on top of each participant’s right hand, as well as on the outer side of the right lower and upper arm. Each Inertial Measurement Unit comprised a three-axial accelerometer and a two-axis gyroscope recording timestamped motion data at a joint sampling rate of 32 Hz [20]. We have only used the data from the accelerometer placed on top of the dominant hand since it is located in a similar position as the one we used to train the stack of auto-encoders in the previous section. The data was resampled at 100 Hz and windowed using 2-s segments of sensor data aligned to pre-selected points (local maximum values with an estimated execution speed of the captured movement in the expected range for the “cutting with a knife” movement). The best performing configuration for the number of hidden units for 1, 2, and 3 stacked auto-encoders have been used. Figure 11 shows the recall, precision, and F score for the case of a single auto-encoder with 100 hidden units. The best values obtained are for F = 0.77 (recall = 0.731

Sensors 2017, 17, 319

14 of 18

and precision = 0.816). The best values in [20] are F = 0.75 and only when using the three sensor devices Sensors 2017, 17, 319 14 of 18 and for the person-dependent case (training and validating using the data for the same person). Using only the sensor in a person-dependent authors were onlyusing able the to achieve F = 0.559. andhand precision = 0.816). The best values in validation [20] are F =the 0.75 and only when three sensor Validating the detection algorithm in a person-independent way (training with the data of one user devices and for the person-dependent case (training and validating using the data for the same and validating with the sensed data from the second user) provided F = 0.486 for the three sensors person). Using only the hand sensor in a person-dependent validation the authors were only able to and F = 0.289 for Fthe single hand sensor. achieve = 0.559. Validating the detection algorithm in a person-independent way (training with the data of one and validating withfor thethe sensed data from the second user)one, provided = 0.486 forlayers the of In order touser compare the results three configurations using two, Fand three three sensors and F = 0.289 for the single hand sensor. auto-encoders, the Area under the Curve (AuC) for the F score has been computed. This value will Inestimation order to compare for the threeare configurations using one,the two, and three layers ofvalue provide an about the howresults sensitive results when non-selecting optimal threshold auto-encoders, the Area under the Curve (AuC) for the F score has been computed. This value will for the Pearson correlation coefficient. In the case of a single auto-encoder, the AuC = 0.217. provide an estimation about how sensitive results are when non-selecting the optimal threshold The authors in [20] present the confusion matrix for person-independent evaluation and all data value for the Pearson correlation coefficient. In the case of a single auto-encoder, the AuC = 0.217. for one ofThe theauthors participants. For thethe case of the “cutting a knife” gesture, the achieved in [20] present confusion matrix forwith person-independent evaluation and allrecall data was 0.615for and the precision 0.781. In order to compare results, the classification results for a similar one of the participants. For the case of the “cutting with a knife” gesture, the achieved recall wasrecall of 0.615 when using a single auto-encoder trainedresults, with our generative model areashown Table 8. 0.615 and the precision 0.781. In order to compare the classification results for similar in recall The precision achieved ourauto-encoder approach istrained 0.865. The of correctly gestures of 0.615 when usingusing a single with number our generative modelclassified are showncutting in Table 8. The precision achieved using our approach is 0.865. The number of correctly classified cutting is also captured. gestures is also captured. Table 9 shows the classification results for the optimal F value. In this case, the optimal F score Table classification results for the optimal F value.and In this case, thea optimal F score is 0.77 which is9 shows able tothedetect 73.1% of all the cutting gestures provides precision of 0.816. is 0.77 which is able to detect 73.1% of all the cutting gestures and provides a precision of 0.816. The The majority of false positives are samples containing the stirring gesture, which is similar to the majority of false positives are samples containing the stirring gesture, which is similar to the cutting cutting gestures when the execution speeds are similar. Only 4.4% of false positives come from other gestures when the execution speeds are similar. Only 4.4% of false positives come from other gestures in the gestures in database. the database.

1 auto encoder 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

recall precision F

0.5

0.6

0.7

0.8

0.9

r threshold Figure 11. Precision, recall, and F score for 1 auto-encoder and 100 hidden units.

Figure 11. Precision, recall, and F score for 1 auto-encoder and 100 hidden units. Table 8. Classification results for one layer of auto-encoders and recall 0.615.

Table 8. Classification results for one layer of auto-encoders and recall 0.615. Classified as Ground Truth Cutting Other Precision Cutting 32 5 0.865 Classified as Ground Truth Cutting Other Precision Other 20 572 Cutting 32 5 0.865 Recall 0.615 Other 20 572 Recall 0.615 Table 9. Values for the optimal F (F = 0.77, recall = 0.731, precision = 0.816).

Allthe Positives Stirring Table 9. Values for optimal F (FCutting = 0.77, recall = 0.731,Other precision = 0.816). Detected as cutting 0.816 0.14 0.044 All Positives

Cutting

Stirring

Other

Detected as cutting

0.816

0.14

0.044

Sensors 2017, 17, 319

15 of 18

Figure 12 shows Sensors 2017, 17, 319 the recall, precision, and F score for the case of two stacked auto-encoders 15 of 18 with 100-50 hidden units. The best values obtained are for F = 0.77 (recall = 0.692 and precision = 0.867). Figure 12 shows recall, precision, score the case of two stackedHowever, auto-encoders with The values are similar to the those obtained forand the Fcase offor a single auto-encoder. in the case of 100-50 hidden units. The best values obtained are for F = 0.77 (recall = 0.692 and precision = 0.867). two stacked auto-encoders the AuC improves to 0.237. By using two auto-encoders, the results get The values are similar to those obtained for the case of a single auto-encoder. However, in the case of less conditioned to the optimality in the threshold selection improving the robustness of the detection two stacked auto-encoders the AuC improves to 0.237. By using two auto-encoders, the results get system increasingtothe parameter selection window. lessand conditioned theoptimal optimality in the threshold selection improving the robustness of the The classification results for a recall of 0.615 when using two stacked auto-encoders trained with detection system and increasing the optimal parameter selection window. our generative model are shown in Table 10.ofThe precision achieved using our approach improves The classification results for a recall 0.615 when using two stacked auto-encoders trained to 0.889with compared to 0.865 for the are caseshown of a single auto-encoder (which achieved is congruent increase in our generative model in Table 10. The precision usingwith our the approach improves to 0.889 compared 0.865positives for the case of a single auto-encoder is congruent with the AuC value). The number oftofalse decreases. The number of(which correctly classified cutting the increase in the AuC value). The number of false positives decreases. The number of correctly gestures is also captured. classified cutting the gestures is also captured. Table 11 shows classification results for the optimal F value. In this case, the optimal F score Table 11 shows the classification results for the optimal F value. In this case, the optimal F score is 0.77, which is able to detect 69.2% of all the cutting gestures and provides a precision of 0.867. is 0.77, which is able to detect 69.2% of all the cutting gestures and provides a precision of 0.867. The The majority of false positives are samples containing the stirring gesture, which is similar to the majority of false positives are samples containing the stirring gesture, which is similar to the cutting cutting gestures when execution speeds similar. of false positives gestures when the the execution speeds are are similar. OnlyOnly 3.1%3.1% of false positives comecome fromfrom otherother gestures in the gestures in database. the database.

2 stacked auto encoders 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

recall precision F

0.5

0.6

0.7

0.8

0.9

r threshold Figure 12. Precision, recall, and F score for two stacked auto-encoders and 100-50 hidden units.

Figure 12. Precision, recall, and F score for two stacked auto-encoders and 100-50 hidden units. Table 10. Classification results for two layers of auto-encoders and recall 0.615.

Table 10. Classification results for two layers of auto-encoders and recall 0.615. Classified as Ground Truth Cutting Other Precision Cutting 4 0.889 Precision Classified as Ground Truth Cutting 32 Other Other 20 572 Cutting 32 4 0.889 Recall 0.615 Other 20 572 Recall 0.615 Table 11. Values for the optimal F (F = 0.77, recall = 0.692, precision = 0.867).

Positives Stirring Table 11. Values All for the optimal F (FCutting = 0.77, recall = 0.692,Other precision = 0.867). Detected as cutting 0.867 0.102 0.031 All Positives Cutting Stirring Other Figure 13 Detected shows theasrecall, precision, and F score for the case of three stacked auto-encoders cutting 0.867 0.102 0.031 with 100-75-50 hidden units. The best values obtained are for F = 0.77 (recall = 0.731 and precision = 0.82). The values are similar to those obtained for the case of a single or 2 stacked auto-encoders. Figure 13 shows the recall, precision, and F score for case of threetostacked auto-encoders However, in the case of three stacked auto-encoders, thethe AuC worsens 0.22 compared to 0.237 with 100-75-50 units. Theauto-encoders. best values obtained are for F = 0.77 (recall = 0.731 and precision = 0.82). whenhidden using two stacked

The values are similar to those obtained for the case of a single or 2 stacked auto-encoders. However, in the case of three stacked auto-encoders, the AuC worsens to 0.22 compared to 0.237 when using two stacked auto-encoders.

Sensors 2017, 17, 319 Sensors 2017, 17, 319

16 of 18 16 of 18

The classification results for a recall of 0.615 when using three stacked auto-encoders trained with our our generative generativemodel modelare areshown shown Table precision achieved using our approach is in in Table 12. 12. TheThe precision achieved using our approach is 0.889 0.889 (the as same asusing whentwo using two auto-encoders). The ofnumber of classified correctly cutting classified cutting (the same when auto-encoders). The number correctly gestures is gestures is also captured. also captured. Table 13 shows shows the the classification classificationresults resultsfor forthe theoptimal optimalF Fvalue. value.InInthis thiscase, case,the theoptimal optimal F score Table 13 F score is is 0.77, which is able to detect 73.1% all cutting gestures and provides a precision of 0.82. The 0.77, which is able to detect 73.1% of allof cutting gestures and provides a precision of 0.82. The majority majority of false positives are containing samples containing thegesture, stirring which gesture, to thegestures cutting of false positives are samples the stirring is which similaristosimilar the cutting gestures the execution similar. 2.2% of falsecome positives other when thewhen execution speeds arespeeds similar.areOnly 2.2%Only of false positives from come other from gestures in gestures in the database. the database.

3 stacked auto encoders 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

recall precision F

0.5

0.6

0.7

0.8

0.9

r threshold Figure 13. Precision, recall, and F score for three stacked auto-encoders and 100-75-50 hidden units. Figure 13. Precision, recall, and F score for three stacked auto-encoders and 100-75-50 hidden units. Table 12. Classification results for three layers of auto-encoders and recall 0.615. Table 12. Classification results for three layers of auto-encoders and recall 0.615.

Classified as Ground Truth Classified as Ground Truth Cutting Cutting Other OtherRecall Recall

Cutting

Cutting32 32 20 20 0.615 0.615

Other Other 4 4 572 572

Precision Precision 0.889 0.889

Table 13. Values for the optimal F (F = 0.77, recall = 0.731, precision = 0.82). Table 13. Values for the optimal F (F = 0.77, recall = 0.731, precision = 0.82).

All Positives

Cutting

AllDetected Positivesas cutting Cutting 0.82 Detected as cutting

0.82

Stirring Other Stirring 0.158 0.022Other 0.158

0.022

7. Conclusions 7. Conclusions Deep learning techniques applied to human sensed data from wearable sensors have shown betterDeep results in previous research studies compared with previous machinesensors learninghave techniques learning techniques applied to as human sensed data from wearable shown in order to in detect specific human movements andwith activities. However, using deep learning better results previous research studies as compared previous machine learning techniques in techniques, the degrees of freedom to be trained in the system tend to increase and so does the order to detect specific human movements and activities. However, using deep learning techniques, number of training samples avoid overfitting problems. the degrees of freedom to be required trained intothe system tend to increase and so does the number of training Thisrequired paper proposes movementproblems. detection approach using a deep learning method based on samples to avoid aoverfitting stacked auto-encoders applied to one ofdetection the axes of a single using accelerometer placed on the wrist of the This paper proposes a movement approach a deep learning method based on dominant hand. In order to simplify the data gathering for the training phase, a novel generative stacked auto-encoders applied to one of the axes of a single accelerometer placed on the wrist of the model based on the of the movements performed at different has dominant hand. In time orderelasticity to simplify thehuman data gathering forwhen the training phase, a novel speeds generative been defined and validated. A simple pre-detection algorithm based on the statistical properties of model based on the time elasticity of the human movements when performed at different speeds some points of interest for each activity is added in order to pre-select candidate time windows and

Sensors 2017, 17, 319

17 of 18

has been defined and validated. A simple pre-detection algorithm based on the statistical properties of some points of interest for each activity is added in order to pre-select candidate time windows and a time-offset cancellation mechanism (sequence alignment) is used by centering the selected data window based on the detected candidate points of interest. The results show that it is possible to achieve optimal results for detecting particular movements when using a database of different users following the same instructions and using the same sensors and procedures. Using the leave-one-out validation method for a database of five users performing six different movements, we achieve F = 1 when detecting the “cutting with a knife” movement. The results are able to generalize to a second database obtained from two different people using different hardware and a different recording procedure, executing other movements (cutting being one of them). The optimal results are obtained when using a two-layer architecture and a value of F = 0.77, which improves the best case in the original study using the same database (F = 0.75). In the future, we plan to apply the proposed generative model to other human movements to confirm that results generalize to different human movements that are executed at different speeds. Acknowledgments: The research leading to these results received funding from the “HERMES-SMART DRIVER” project TIN2013-46801-C4-2-R funded by the Spanish MINECO, and from the “ANALYTICS USING SENSOR DATA FOR FLATCITY” project TIN2016-77158-C4-1-R, also funded by the Spanish MINECO. Author Contributions: Mario Munoz-Organero was responsible for the design, implementation, validation of results, and writing of the paper. Ramona Ruiz-Blazquez contributed to the data gathering process and system design. Conflicts of Interest: The authors declare no conflict of interest.

References 1. 2.

3.

4. 5. 6.

7. 8. 9. 10. 11.

Wang, A.; Chen, G.; Yang, J.; Zhao, S.; Chang, C.H. A Comparative Study on Human Activity Recognition Using Inertial Sensors in a Smartphone. IEEE Sens. J. 2016, 16, 4566–4578. [CrossRef] Hassanalieragh, M.; Page, A.; Soyata, T.; Sharma, G.; Aktas, M.; Mateos, G.; Kantarci, B.; Andreescu, S. Health Monitoring and Management Using Internet-of-Things (IoT) Sensing with Cloud-based Processing: Opportunities and Challenges. In Proceedings of the 2015 IEEE International Conference on Services Computing (SCC), New York, NY, USA, 27 June–2 July 2015. Avci, A.; Bosch, S.; Marin-Perianu, M.; Marin-Perianu, R.; Havinga, P. Activity recognition using inertial sensing for healthcare, wellbeing and sports applications: A survey. In Proceedings of the 2010 23rd International Conference on Architecture of Computing Systems (ARCS), Hannover, Germany, 22–25 February 2010. Lara, O.D.; Labrador, M.A. A survey on human activity recognition using wearable sensors. IEEE Commun. Surv. Tutor. 2013, 15, 1192–1209. [CrossRef] Munoz-Organero, M.; Lotfi, A. Human movement recognition based on the stochastic characterisation of acceleration data. Sensors 2016, 16, 1464. [CrossRef] [PubMed] Yang, J.B.; Nguyen, M.N.; San, P.P.; Li, X.L.; Krishnaswamy, S. Deep convolutional neural networks on multichannel time series for human activity recognition. In Proceedings of the 24th International Joint Conference on Artificial Intelligence (IJCAI), Buenos Aires, Argentina, 25 July–1 August 2015; pp. 25–31. Inoue, M.; Inoue, S.; Nishida, T. Deep Recurrent Neural Network for Mobile Human Activity Recognition with High Throughput. arXiv 2016, arXiv:1611.03607. Foerster, F.; Smeja, M.; Fahrenberg, J. Detection of posture and motion by accelerometry: A validation study in ambulatory monitoring. Comput. Hum. Behav. 1999, 15, 571–583. [CrossRef] Poppe, R. A survey on vision-based human action recognition. Image Vis. Comput. 2010, 28, 976–990. [CrossRef] Blasco, R.; Marco, A.; Casas, R.; Cirujano, D.; Picking, R. A smart kitchen for ambient assisted living. Sensors 2014, 14, 1629–1653. [CrossRef] [PubMed] Varkey, J.P.; Pompili, D.; Walls, T.A. Human Motion Recognition Using a Wireless Sensor-based Wearable System. Pers. Ubiquitous Comput. 2012, 16, 897–910. [CrossRef]

Sensors 2017, 17, 319

12.

13.

14. 15. 16.

17.

18. 19.

20. 21. 22.

23. 24.

25. 26.

18 of 18

Zhang, M.; Sawchuk, A.A. A feature selection-based framework for human activity recognition using wearable multimodal sensors. In Proceedings of the 6th International Conference on Body Area Networks, Beijing, China, 7–10 November 2011. Long, X.; Yin, B.; Aarts, R.M. Single-accelerometer-based daily physical activity classification. In Proceedings of the 2009 Annual International Conference of the IEEE Engineering in Medicine and Biology Society, Minneapolis, MN, USA, 3–6 September 2009. Längkvist, M.; Karlsson, L.; Loutfi, A. A review of unsupervised feature learning and deep learning for time-series modeling. Pattern Recognit. Lett. 2014, 42, 11–24. [CrossRef] Gjoreski, M.; Gjoreski, H.; Luštrek, M.; Gams, M. How accurately can your wrist device recognize daily activities and detect falls? Sensors 2016, 16, 800. [CrossRef] [PubMed] Plötz, T.; Hammerla, N.Y.; Olivier, P. Feature learning for activity recognition in ubiquitous computing. In Proceedings of the International Joint Conference on Artificial Intelligence, Barcelona, Spain, 19–22 July 2011; Volume 22, p. 1729. Gjoreski, H.; Bizjak, J.; Gjoreski, M.; Gams, M. Comparing Deep and Classical Machine Learning Methods for Human Activity Recognition Using Wrist Accelerometer. Available online: http://www.cc.gatech.edu/ ~alanwags/DLAI2016/2.%20(Gjoreski+)%20Comparing%20Deep%20and%20Classical%20Machine%20Learning %20Methods%20for%20Human%20Activity%20Recognition%20using%20Wrist%20Accelerometer.pdf (accessed on 7 February 2017). Ronao, C.A.; Cho, S.B. Human activity recognition with smartphone sensors using deep learning neural networks. Expert Syst. Appl. 2016, 59, 235–244. [CrossRef] Wang, A.; Chen, G.; Shang, C.; Zhang, M.; Liu, L. Human Activity Recognition in a Smart Home Environment with Stacked Denoising Autoencoders. In Proceedings of the International Conference on Web-Age Information Management, Nanchang, China, 3–5 June 2016. Bulling, A.; Blanke, U.; Schiele, B. A tutorial on human activity recognition using body-worn inertial sensors. ACM Comput. Surv. 2014, 46, 33. [CrossRef] Hammerla, N.Y.; Halloran, S.; Ploetz, T. Deep, Convolutional, and Recurrent Models for Human Activity Recognition using Wearables. arXiv 2016, arXiv:1604.08880. Chavarriaga, R.; Sagha, H.; Calatroni, A.; Digumarti, S.T.; Troster, G.; Millan, J.R.; Roggen, D. The opportunity challenge: A benchmark database for on-body sensor-based activity recognition. Pattern Recognit. Lett. 2013, 34, 2033–2042. [CrossRef] Reiss, A.; Stricker, D. Introducing a new benchmarked dataset for activity monitoring. In Proceedings of the 2012 16th International Symposium on Wearable Computers (ISWC), Newcastle, UK, 18–22 June 2012. Bachlin, M.; Roggen, D.; Troster, G.; Plotnik, M.; Inbar, N.; Meidan, I.; Herman, T.; Brozgol, M.; Shaviv, E.; Giladi, N.; et al. Potentials of enhanced context awareness in wearable assistants for parkinson’s disease patients with the freezing of gait syndrome. In Proceedings of the 2009 13th International Symposium on Wearable Computers (ISWC), Linz, Austria, 4–7 September 2009. Ordonez, F.J.; Roggen, D. Deep convolutional and lstm recurrent neural networks for multimodal wearable activity recognition. Sensors 2016, 16, 115. [CrossRef] [PubMed] Kawaguchi, N.; Ogawa, N.; Iwasaki, Y.; Kaji, K.; Terada, T.; Murao, K.; Inoue, S.; Kawahara, Y.; Sumi, Y.; Nishio, N. Hasc challenge: Gathering large scale human activity corpus for the real-world activity understandings. In Proceedings of the 2nd Augmented Human International Conference, Tokyo, Japan, 13 March 2011. © 2017 by the authors; licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).