Robust Keystroke Biometric Anomaly Detection - Semantic Scholar

2 downloads 0 Views 325KB Size Report
Jun 29, 2016 - [1] Martın Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, ... [12] John V Monaco, Gonzalo Perez, Charles C Tappert, Patrick Bours,.
Robust Keystroke Biometric Anomaly Detection John V. Monaco U.S. Army Research Laboratory Aberdeen, MD

arXiv:1606.09075v1 [cs.CR] 29 Jun 2016

[email protected]

Abstract—This paper describes the fifteen keystroke biometric anomaly detection systems submitted by U.S. Army Research Laboratory (ARL) to the Keystroke Biometrics Ongoing Competition (KBOC), an official competition of the IEEE Eighth International Conference on Biometrics: Theory, Applications, and Systems (BTAS). Submissions to the KBOC were evaluated on an unlabeled test dataset of 300 users typing their first and last name. For each user in the test dataset, there were 4 template samples available for training and 20 query samples for system evaluation. In some cases, query sample keystrokes only approximately matched the template due to the presence of modifier keys. The number of genuine and impostor samples for each user remained unknown to competition participants. The fifteen systems submitted by ARL ranged from relatively complex models with hundreds of parameters to a very simple distance measure with no free parameters. In addition, keystroke alignment preprocessing algorithm was developed to establish a semantic correspondence between keystrokes of the template sample and an approximately-matching query sample. The method is robust in the sense that the query keystroke sequence needs only approximately match the template sequence. All fifteen systems submitted by ARL achieved a lower EER than any other submission, and Manhattan distance achieved the lowest EER of 5.32%. Source code is available at https://github.com/vmonaco/kboc. Index Terms—Keystroke dynamics, anomaly detection, cybersecurity

I. I NTRODUCTION The Keystroke Biometrics Ongoing Competition (KBOC) presented a keystroke anomaly detection challenge in the form of a biometric competition1 . Participants were tasked with designing keystroke anomaly detection algorithms that achieved a low equal error rate on a set of unlabeled query samples given only several template samples for each user in the dataset. Unlike previous keystroke biometric competitions [12], the KBOC utilized a platform that allows ongoing participation [13]. This paper describes the fifteen anomaly detection systems submitted to the KBOC by U.S. Army Research Laboratory (ARL). The fifteen systems submitted by ARL ranged from relatively complex models with hundreds of parameters to a very simple distance measure with no free parameters. A novel preprocessing algorithm was developed to establish a semantic correspondence between keystrokes of the template sample and an approximately-matching query sample. In addition, an outlier-tolerant score normalization technique was employed. The systems are robust in the sense that the query keystroke sequence needs only approximately match the template sequence. The rest of this paper is organized as follows. Section 1 https://sites.google.com/site/btas16kboc/home

II contains an overview of the KBOC dataset and evaluation metrics. Section III describes the fifteen anomaly detection systems. Section IV describes the validation procedure and competition results. Sections V and VI contain a discussion and draw conclusions, respectively. II. BACKGROUND A. Dataset The KBOC dataset contains keystroke samples from users typing their first and last name. Each sample contains a sequence of timestamped key-press and key-release events. Each event is a 3-tuple that contains the key scancode, action (press or release), and millisecond-precision time interval since the previous event (beginning with 0 for the first event). The samples were required to be a case-insensitive match to the template user’s name. Although mistakes were not permitted, modifier keys (such as Shift) were recorded and the sequences for a given user could vary in length. Two separate datasets were made available to competition participants: a development set for the purpose of system development, and a test set for the purpose of system evaluation. The development set contains 10 users with 24 labeled samples for each user, including 14 genuine and 10 impostor samples. The test set contains 300 users, each with 4 labeled genuine samples and 20 unlabeled query samples. The ratio of genuine to impostor samples in the test set remained hidden from competition participants. The time between data collection sessions for each user ranged from less than a day to several months, with an average interval of 1 month. This captured a wide range of variability in typing behavior. Additionally, since the input sequence (first and last name) is unique to each user, it is not straightforward for an anomaly detector to utilize negative data (i.e., the labeled templates from other users). This is in contrast to other benchmark keystroke datasets, such as [7], in which every user typed the same input sequence. B. System evaluation Keystroke biometric authentication systems are often evaluated by their equal error rate (EER), the point on the receiver operating characteristic (ROC) at which the false acceptance rate (FAR) is equal to the false rejection rate (FRR). The EER can be derived by varying either a global threshold or userdependent thresholds. To compute the global EER, the ROC curve is derived by varying a global threshold. In this case, the false positives and false negatives from all users are tallied to compute the FAR and FRR at each threshold value.

2

The user EER is the EER obtained by deriving an ROC curve for each user. As a user-dependent threshold is varied, the FAR and FRR are computed using the number of false positive and false negative classifications for a single user. Since this produces an EER for each user in the dataset, system performance is typically characterized by the mean and standard deviation (SD) of the distribution of user EERs. Anomaly detectors submitted to the KBOC were evaluated by the global EER derived from the scores assigned to the 300 × 20 = 6000 query samples. In some cases, the user EER is also reported. In the rest of this paper, EER refers to the global EER unless otherwise noted. III. M ETHODS The anomaly detection systems in this work utilized a processing pipeline that consists of three main steps: feature extraction, anomaly detection, and score normalization. During feature extraction, a correspondence is established between keystrokes of the template and the query, which may differ slightly as described below, and the query sample keystrokes are aligned to match the template. Keystroke timing features are then extracted from the corresponding keystrokes of the query sample. Following this, an anomaly detector trained on the template assigns a similarity score to the query sample. Finally, the score is normalized using one of several normalization methods. A. Feature extraction 1) Preprocessing: While the raw data samples each contain a sequence of timestamped key-press and key-release events, the methods in this work utilize a sequence of keystroke events. Each keystroke event contains the key name, press time, and duration (time interval from press to release). The data is first preprocessed to convert the sequence of key-press and key-release events to a sequence of keystroke events. The sequence of keystroke events in each sample are ordered by the press time, as they would appear on screen to the user as each key is pressed. Note that the keystroke sequence contains one event per key pressed while the raw key-press/key-release sequence contains two events per key pressed. 2) Keystroke alignment: The KBOC data collection procedure required template samples to be a case-insensitive match to the user’s first and last name. Modifier keys were also recorded and the template samples for each user could vary in length (e.g., if one sample began with the Shift key and the others didn’t). The presence of differences between keystroke sequences is verified by computing the Damerau–Levenshtein distance between template and query samples using the key name as the discrete symbol. Through every possible combination of template and query sample for each user (a total of 300 × 20 × 4 = 24000 distances), the average Damerau–Levenshtein distance is found to be 0.074 ± 0.323 with a maximum distance of 7. An examination of the query samples reveals insertions, deletions, and transpositions from the respective template due to the presence of modifier keys. To compare slightly-differing keystroke sequences, a sequence-alignment algorithm was developed to establish a

correspondence between keystrokes of two samples. This allows an differently-typed query sample to be directly compared to the respective template using an anomaly detector that operates on feature vectors of fixed length, such as Manhattan distance. Let the target sequence be the shortest keystroke sequence from all the template samples for a particular user. In the case there are ties, the target sequence could be chosen randomly among the shortest sequences. Let the given sequence be a keystroke sequence from another sample (either another template sample or a query sample). The correspondence should ensure an element-wise semantic similarity between the given sequence and target sequence. In other words, considering the keystrokes of a target sequence, the corresponding keystrokes in the given sequence should have been pressed with similar intention by the user. Keystrokes from the given sequence are aligned to the target sequence as follows. For each key in the target sequence, find the same key in the given sequence with the closest position. If the key does not exist in the given sequence, then use as a substitute the key in the corresponding position. If a key appears in the given sequence, but not the target sequence, it is simply ignored. The keystrokes in the given sequence are then reordered by their mapping to the target sequence. To illustrate the utility of keystroke alignment, consider an alternative method in which the keystrokes of the given sequence are simply truncated to the same length as the target sequence. This approach is shown in Figure 1a. Unless the keystroke sequences match exactly, semantically different features will be compared by the anomaly detector. Figure 1b shows a more sensible alignment between keystrokes using the proposed approach. In this example, the alignment ensures a semantic similarity between the keystrokes of each sequence. For example, the 6th keystroke of the given sequence is mapped to the 4th keystroke of the target sequence. In both sequences, this key is likely what the user intended as the 2nd “N” in “VINNIE”. Note that while this algorithm can account for case-insensitive differences, such as in the example in Figure 1, differences in the KBOC dataset are only casesensitive, i.e., due to modifier keys. 3) Timing features: Features are extracted from the template and query samples, which are reordered to match the target sequence as described above. While a variety of keystroke timings can be extracted, this work uses the presspress latencies and key-hold durations. The press-press latency is the time interval between two successive key-presses. The key-hold duration is the time interval from the press to release of each key. If the target sequence contains 11 keystrokes, then the feature vectors for every given sequence aligned to the target will contain 31 features (10 latencies and 11 durations). The press-press latency is typically positive since keystrokes are normally ordered by their press time (as they would appear on screen to the user). Since the query sequences have been reordered to match the target sequence, the press-press latency can be negative when, e.g., there is a transposition between the query and target sequence. Using Figure 1b as an example, the keystrokes in the given sequence are reordered to the positions of the corresponding keys in the target sequence. In the given

3

Target

V

I

N

N

I

E

Given

V

Shift

I

N

I

N

one standard deviation of the mean. The duration features are normalized similarly. Query samples are normalized using the lower and upper bounds determined from the template samples. E

B. Anomaly detection

(a) Truncate the given sequence to the length of the target.

Target

V

I

N

N

I

E

Given

V

Shift

I

N

I

N

E

(b) Align the given sequence using key names.

Figure 1: Keystroke alignment between differing sequences. This example shows an insertion (“Shift” key) and transposition (“I” and “N” keys) in the given sequence. The given sequence is aligned to the target sequence in order to use an anomaly detector that operates on fixed-length feature vectors. The alignment ensures a semantic similarity between the keystrokes of each sequence while truncation ignores the key names. sequence, the press-press latency from the 2nd “N” to the 2nd “I” will be negative since the 2nd “I” is pressed before the 2nd “N”. Reordering the keystrokes of the given sequence in this way, such that the press-press latency becomes negative for transposed keys, will not have an adverse effect on anomaly detection as long as this behavior is also captured in the template samples. Consider a typist who presses the keys “T” and “H” in very quick succession. The press-press latency between these keys will be close to 0, and if normal behavior is to transpose these keys 50% of the time, the distribution of press-press latencies will be centered on 0. If this behavior occurs in the template samples, then it does not matter whether the target sequence contains [“T”, “H”] or [“H”, “T”] since the correspondence to the target sequence preserves the semantic meaning of that feature (the press-press latency between “T” and “H”, which are transposed 50% of the time). 4) Feature normalization: Finally, the latency and duration features are normalized to the range [0, 1]. The ith normalized press-press latency, p˙i , is given by    pi − bpc p˙i = max 0, min 1, (1) dpe − bpc where pi is the raw feature and bpc and dpe provide the lower and upper bounds for normalization, respectively. Let µp be the mean latency and σp be the latency standard deviation. The bounds are determined by bpc

=

µp − Hf σp

dpe

=

µp + Hf σp

(2)

where Hf is a free parameter. The results in this work were obtained using Hf = 1, i.e., normalizing features to within

A variety of anomaly detectors were evaluated. All of the anomaly detectors were implemented in Python and can be found in the source code repository, which contains a script reproduce results2 . Neural network models were implemented using the tensorflow library [1]. Seven different anomaly detection models used by the fifteen systems are described in this section. Model parameters were selected from a small range that performed well on the development set. 1) Autoencoder: The basic autoencoder is a neural network that aims to encode and then decode its input [2]. The network topology consists of equally-sized input and output layers separated by at least one hidden layer. Successive layers are fully connected through weight matrices, a bias vector, and a nonlinear function. It is common to use tied weights, i.e., the weight matrix connecting the hidden and output layers is simply the transpose weight matrix connecting the input and hidden layers. This limits the number of model parameters and acts as a kind of regularization. Using feature vector x as input, the hidden layer is calculated as h = f (x) = tanh (W x + bh ) (3) where W is the weight matrix and bh is the hidden layer bias vector. Similarly, the output layer y is given by  y = g (x) = tanh W > h + by (4) where by is the output layer bias vector. Model parameters for a 3-layer topology (1 hidden layer) consist of θ = {W, bh , by }. The only free parameter is the dimension of the hidden layer, dh . Typically, the hidden layer should be smaller than the input layer to achieve a compressed representation and avoid learning the identity function. Parameters are determined by back-propagating gradients from a squared error loss function, L (x, y) = kx − yk2 .

(5)

where y is the reconstructed output. The objective function of the basic autoencoder is given by X JAE (θ) = L (x, g (f (x))) (6) x∈D

where D is the set of training examples. Bias parameters are initialized to 0 and weights are initialized from a random uniform distribution such that the scale of the gradients in each layer is roughly the same [5]. Models are trained for 5000 epochs using gradient descent with a learning rate of 0.5. During testing, the score of a query sample is given by the negative reconstruction error, −L (x, y). The topology of system 1 consists of three hidden layers with dimensions 5, 4, and 3. System 7 uses a single hidden layer of dimension 5. 2 https://github.com/vmonaco/kboc.

4

2) Contractive autoencoder: The contractive autoencoder is an autoencoder that uses the Frobenius norm of the hidden layer Jacobian as a regularization term [16]. This enables a sparse representation of the input whereby the dimension of the hidden layer is much larger than the input and output layer dimensions. The contractive autoencoder is closely related to the denoising autoencoder, in which the goal is to reconstruct an input vector that has been corrupted by noise [18]. The contractive autoencoder in this work uses a sigmoid activation function. The hidden layer is given by (7)

h = f (x) = σ (W x + bh ) −1

where σ (z) = (1 + e−z ) given by

. Similarly, the output layer is >

(8)



y = g (x) = σ W h + by .

4) One-class support vector machine: The one-class support vector machine (SVM) is an unsupervised model that learns a separating hyperplane between the origin and feature vector points3 [17]. The free parameters of the model include η, the fraction of training errors (samples that lie outside the separating plane), in addition to any parameters of the kernel function. This work uses the one-class SVM implemented by scikit-learn [15], which internally uses libsvm [4]. Query sample scores are given by the negative distance to the separating hyperplane. Systems 4 and 12 both use the oneclass SVM with η = 0.5, radial basis function (RBF) kernel, and RBF kernel parameter γ = 0.9. 5) Manhattan distance: Manhattan distance between two vectors x and y is computed by D (x, y) =

d X

|xi − yi |

(10)

i

The objective function of the contractive autoencoder includes a regularization term that penalizes the Jacobian of f , JCAE (θ) =

X

L (x, g (f (x))) + λkJf (x) k2F

x∈D

where λ is a free parameter and k·k2F is the squared Frobenius norm. Similar to the basic autoencoder, the loss function, L, is the squared reconstruction error. Using a sigmoid activation function allows for efficient computation of the Jacobian, given by

kJf (x) k2F =

X  ∂hj (x) 2 ij

∂xi

=

dh X i=1

2

[hi (1 − hi )]

dx X j=1

Wij2

(9) where dh and dx are the hidden and input layer dimensions, respectively. Model parameters are learned by gradient descent with a 0.01 learning rate and 1000 epochs. Similar to the basic autoencoder, bias vectors are initialized to 0 and weight vectors use Xavier initialization [5]. Systems 5 and 13 have hidden layer dimension dh = 400 with λ = 1.5, while systems 8 and 14 have hidden layer dimension dh = 200 with λ = 0.5. 3) Variational autoencoder: The variational autoencoder is a probabilistic autoencoder with continuous latent variables [9]. Parameters are learned efficiently by backpropagation through a reparametrization trick, which allows the gradient of the loss function to propagate through the sampling process. The objective function of the variational autoencoder is composed of both a reconstruction loss and a latent loss. System 2 uses a variational autoencoder with 2 hidden layers of dimension 5, spherical Gaussian latent space of dimension 3, and softplus nonlinearities between layers. Parameters are learned by Adam, a stochastic gradient-based optimization algorithm [8], using a learning rate of 0.001, mini-batch size of 2, and 700 epochs. Query sample scores are given by the negative reconstruction loss, which is the negative log probability of the input given the reconstructed latent distribution.

where d is the dimension of the vectors. Query sample scores are given by the negative Manhattan distance to the mean template vector. In this case, training a model consists of simply computing the mean template vector. Scaled Manhattan distance was previously shown to achieve state-of-the-art accuracy in keystroke biometric anomaly detection [7], and Manhattan distance closely followed. Scaled Manhattan distance was not used in this work since the number of template samples (4) is not large enough to obtain a reliable estimate of the SD of each feature. System 6 uses the Manhattan distance. 6) Partially observable hidden Markov model: The partially observable hidden Markov model (POHMM) is a generalization of the hidden Markov model in which hidden states are partially observable through event metadata at each time step [11]. In a two-state model of typing behavior, the user can be in either an active state, during which relatively short time intervals are observed, or passive state, during which relatively long time intervals are observed. The key name of each keystroke partially reveals the underlying state since certain keys, such as “Space” or “Shift”, indicate a greater probably of being in a passive state. Despite an explosion in the number of model parameters, parameter estimation can still be performed in linear time using a modified Baum-Welch algorithm. Keystroke timing features are described by a lognormal distribution, and parameters are initialized such that there is a correspondence between the states of any two models being compared. The POHMM is a free-text model, since it operates on arbitrary keystroke sequences and does not require fixed-length input. Therefore, the POHMM does not use the keystroke alignment algorithm described in Section III-A2. Query sample scores are given by the model loglikelihood. Systems 3 and 11 use a POHMM with two-dimensional lognormal emission (for the press-press latency and duration), two hidden states, and 0.01 reduction in the loglikelihood as the convergence criteria. A parameter smoothing technique 3 Alternatively, one may learn a minimal-volume hypersphere that encapsulates most of the points in feature space.

5

is employed to avoid overfitting the model to short input sequences [11]. 7) Ensemble: An ensemble of anomaly detector scores is implemented with the goal of achieving higher accuracy than any individual member. The ensemble score is simply the mean score from each system in the ensemble. The sum rule (which is equivalent to a scaled mean) has been shown to be a robust, albeit simple, method of combining classifier output scores [10]. Systems 9, 10, and 15 utilize ensemble anomaly detectors. System 9 is an ensemble of systems 3, 4, and 5, which have relatively low score correlations. The correlation between test set scores of systems 3 and 4 is 0.665, the lowest of any pair that utilize SD score normalization. System 10 is the mean score from systems 1-8, which all use SD score normalization, and system 15 is the mean score of systems 11-14, which use min/max score normalization. C. Score normalization Score normalization can have a significant effect on system performance and remains an ongoing area of research [14]. System performance, as measured by either the global or user EER, can be improved by score normalization in three different ways. First, score normalization places the scores from multiple anomaly detectors on the same scale for the purpose of scorelevel fusion [6]. The anomaly detectors may correspond to members of an ensemble, in which each anomaly detector uses a different underlying model or different set of features. The anomaly detectors may also correspond to different modalities, such as one for keystroke behavior and another for face recognition. Scores from the set of anomaly detectors can either be equally weighted or combined in a way that reflects the relative performance of each underlying model. Second, score normalization places the scores from different users on the same scale. This is only relevant for the globallyderived ROC curve and global EER, in which a single global threshold is used to make an authentication decision. As the score distributions of different users vary, a global threshold will correspond to a different FAR and FRR for each user. To achieve similar FAR and FRR for each user with a single global threshold, the user scores must be normalized. Note that this effect is not apparent for user-dependent thresholds, in which a separate threshold is chosen for each user. Finally, score normalization can reshape the score distribution within a single user. This can have the effect of being robust to sample outliers and lead to a better separation of genuine and impostor scores. While some score normalization techniques are sensitive to sample outliers, such as the min/max normalization described below, other techniques, such as tanh normalization, are robust [6]. Let su be the set of scores from user u and sui be the ith score from user u. The normalized score, s˙ ui , is given by    sui − bsu c (11) s˙ ui = max 0, min 1, dsu e − bsu c where bsu c and dsu e are the lower and upper bounds used for score normalization. Note that the clamping function

max [0, min (1, ·)] ensures the resulting score is in the range [0, 1]. Two methods of score normalization are compared, which differ in the way bsu c and dsu e are defined. 1) Min/max normalization: Min/max normalization is achieved by letting bsu c

=

min su

dsu e

=

max su

(12)

where min su and max su denote the minimum and maximum scores from user u, respectively. Min/max normalization preserves the shape of the each user’s score distribution, and thus does not affect the user EER since the score distribution is only scaled. It does, however, have an affect on the global EER since scores from different users are compared to a single threshold. Systems 11-15 use min/max score normalization. 2) SD normalization: The resulting scores are normalized using the SD of the scores within each user. Let µsu be the mean score for user u and σsu be the standard deviation score for user u. The lower and upper bounds for normalization are given by bsu c

= µsu − Hs σsu

dsu e

= µsu + Hs σsu

(13)

where Hs is a free parameter. This method affects both the user EER and global EER. It is tolerant to outlier samples since a minimum or maximum outlier score will be clamped to 0 or 1, respectively. In this work, Hs = 2 encapsulates scores within 2 SD of the mean. Systems 1-10 use SD score normalization. IV. R ESULTS A validation EER is determined for each system through a Monte Carlo validation procedure using the development dataset. In each repetition, 4 template samples are randomly selected from the genuine samples. The remaining 20 samples (10 genuine and 10 impostor) are used as the query samples. These conditions mimic the test dataset, in which there are 4 labeled template samples and 20 unlabeled query samples for each user. This process is repeated 10 times to obtain a confidence interval on the EER. Note that the confidence interval is obtained on the global EER and not the user EER as in [7]. The resulting validation EER and test EER of the 15 systems submitted to the KBOC are shown in Table I. The test EERs were determined by the KBOC organizers, as the ground truth labels of the query samples remained hidden from competition participants. Systems 1-10 used SD score normalization and systems 11-15 use min/max score normalization. Systems 9 and 15, both ensembles, achieve a lower EER on the test set than any individual member in the ensemble. This is not the case for system 10, which is an ensemble of systems 1-8. The systems that utilize SD score normalization each achieve lower EERs than their min/max score normalization counterparts.

6

0.25

Density

0.20

6

Validation

Genuine Impostor

35

30

25

20

15

10

5

0

0

5

3.0

Test

Unknown

0.10

Density

2 1

1 40

0.2

0.0

0.2

Unknown

2.5

0.4

0.6

0.8

1.0

1.2

Test

0 3.0

2.0

2.0

0.06

1.5

1.5

0.04

1.0

1.0

0.02

0.5

0.5

0.00

0.0

35

30

25

20

15

Score

10

5

0

5

0.2

0.0

0.2

0.4

0.6

Score

0.2

0.8

1.0

1.2

0.0

0.0

0.2

Unknown

2.5

0.08

40

Validation

3

2

0.05

Genuine Impostor

4

3

0.10

0.12

5

Validation

4

0.15

0.00

Genuine Impostor

5

0.2

0.0

0.2

0.4

0.6

0.8

1.0

1.2

0.8

1.0

1.2

Test

0.4

0.6

Score

Figure 2: The effects of score normalization using a Manhattan distance anomaly detector: unnormalized (left), min/max normalization (middle), SD normalization (right). Min/max normalization is sensitive to outliers; SD normalization is not. System

Description

Validation

Test

1 2 3 4 5 6 7 8 9 10

Autoencoder {5,4,3} Variational autoencoder POHMM One-class SVM Contractive autoencoder {400} Manhattan distance Autoencoder {5} Contractive autoencoder {200} Mean ensemble systems 3,4,5 Mean ensemble systems 1-8

11 12 13 14 15

POHMM One-class SVM Contractive autoencoder {400} Contractive autoencoder {200} Mean ensemble systems 11-14

7.90 6.50 7.75 7.40 6.20 6.95 5.75 6.50 5.40 5.85

(1.91) (0.75) (1.36) (2.20) (1.27) (1.17) (1.57) (1.47) (1.45) (1.31)

7.82 6.46 7.32 7.35 8.02 5.32 7.95 8.08 5.68 5.91

(1.38) (1.63) (1.23) (1.25) (1.30)

10.35 10.89 11.2 11.23 6.26

Min/max score normalization 8.75 12.10 7.85 8.00 5.75

Table I: Summary of validation and test EER (%). The validation EER was obtained by Monte Carlo procedure that mimics test set conditions (SD shown in parentheses). The test EER was determined by the KBOC organizers using score submission files. Hidden layer sizes of neural network models are shown in braces.

V. D ISCUSSION Manhattan distance is arguably the simplest anomaly detector out of the fifteen systems evaluated in this work since it has no free parameters, yet it achieved the lowest EER on the test dataset. This indicates a certain amount of overfitting to the development set due to hyperparameter (and perhaps even model) selection. Score normalization has a significant effect on the global EER. Score distributions of the best system (Manhattan distance) are shown in Figure 2 using SD and min/max normalization compared to the unnormalized scores. The unnor-

Score Norm.

SD score normalization None Min/max SD

Keystroke alignment Discard Truncate Key-match 19.20 (1.14) 18.40 (1.90) 18.40 (1.90) 21.73 20.17 20.17 11.30 (1.72) 7.55 (0.96) 7.55 (0.96) 10.67 10.05 9.52 10.05 (1.44) 6.95 (1.17) 6.95 (1.17) 8.02 6.13 5.32

Table II: Manhattan distance validation (top with SD in parentheses) and test (bottom) EER (%) using different methods of score normalization and keystroke alignment. Discard=discard any modifier keys, comparing only character keys, Truncate=truncate the template and query keystroke sequences to the shortest template, Key-match=the keystroke alignment method described in Section III-A2.

malized scores of the test set appear to have a unimodal distribution. The min/max normalized scores appear bimodal at 0 and 1 due to the presence of 0 and 1 scores for each user. This effect is not present in the SD normalized scores, which appear to have a trimodal distribution with peaks at 0 and above the (presumably) expected genuine and impostor scores. Since the lower and upper bounds for normalization are determined by the standard deviation, and not the minimum and maximum values, there is no guarantee of the presence of 0 and 1 in the normalized scores for each user. However, there is still a peak at 0 which indicates the presence of (presumably impostor) sample outliers. Table II shows the effects of score normalization and keystroke alignment on Manhattan distance validation and test EER. The best performance is achieved using the keymatch keystroke alignment with SD score normalization. In all cases, the key-match alignment between keystrokes of the query sequences and the target sequence improve performance. Simply discarding the modifier keys hurts performance, suggesting that the use of modifier keys plays an important role in keystroke biometrics. Note that the validation EERs of the

7

truncation and key-match alignment methods are equal since the length of the samples in the development set did not differ within any of the 10 users. Interestingly, the choice of keystroke alignment method has the greatest impact on system performance when SD score normalization is used. In the same way the SD score normalization is tolerant to outlier samples, feature SD normalization is tolerant to outlier features. Taking the best submission (system 6) and using min/max feature normalization yields a validation EER of 8.55 (1.21)% and test EER of 8.60%. VI. C ONCLUSIONS Like most previous works, the results is this work reflect asymptotic system performance, i.e., after there are many samples in the system. Both the min/max and SD score normalization methods utilize statistics of the entire empirical score distribution. Future work should investigate online score normalization, which is subject to the order in which genuine and impostor samples appear to the system. There been several approaches to keystroke dynamics in which the query sequences only approximately match the template [3]. This scenario can be viewed as a compromise between fixed-text input, which requires the query keystrokes to exactly match the template, and free-text input, which places no restriction on the keystroke sequence. The keystroke alignment algorithm introduced in this work reflects this compromise. Future work could improve both the way the keystrokes are aligned, such as by being more tolerant to typing mistakes, and develop a better method to choose the target sequence. ACKNOWLEDGMENTS This research was supported in part by an appointment to the Postgraduate Research Participation Program at the U.S. Army Research Laboratory administered by the Oak Ridge Institute for Science and Education through an inter agency agreement between the U.S. Department of Energy and USARL. R EFERENCES [1] Martın Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, et al. Tensorflow: Large-scale machine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467, 2016. [2] Yoshua Bengio. Learning deep architectures for ai. Foundations and R in Machine Learning, 2(1):1–127, 2009. trends [3] Patrick Bours and Vathsala Komanpally. Performance of keystroke dynamics when allowing typing corrections. In Biometrics and Forensics (IWBF), 2014 International Workshop on, pages 1–6. IEEE, 2014. [4] Chih-Chung Chang and Chih-Jen Lin. Libsvm: a library for support vector machines. ACM Transactions on Intelligent Systems and Technology (TIST), 2(3):27, 2011. [5] Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep feedforward neural networks. In International conference on artificial intelligence and statistics, pages 249–256, 2010. [6] Anil Jain, Karthik Nandakumar, and Arun Ross. Score normalization in multimodal biometric systems. Pattern recognition, 38(12):2270–2285, 2005. [7] Kevin S Killourhy and Roy A Maxion. Comparing anomaly-detection algorithms for keystroke dynamics. In Dependable Systems & Networks, 2009. DSN’09. IEEE/IFIP International Conference on, pages 125–134. IEEE, 2009. [8] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.

[9] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013. [10] Josef Kittler, Mohamad Hatef, Robert PW Duin, and Jiri Matas. On combining classifiers. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 20(3):226–239, 1998. [11] John V Monaco. Time intervals as a Behavioral Biometric. PhD thesis, Pace University, 2015. [12] John V Monaco, Gonzalo Perez, Charles C Tappert, Patrick Bours, Soumik Mondal, Sudalai Rajkumar, Aythami Morales, Julian Fierrez, and Javier Ortega-Garcia. One-handed keystroke biometric identification competition. In Biometrics (ICB), 2015 International Conference on, pages 58–64. IEEE, 2015. [13] Aythami Morales, Mario Falanga, Julian Fierrez, Carlo Sansone, and Javier Ortega-Garcia. Keystroke dynamics recognition based on personal data: A comparative experimental evaluation implementing reproducible research. In Biometrics Theory, Applications and Systems (BTAS), 2015 IEEE 7th International Conference on, pages 1–6. IEEE, 2015. [14] Aythami Morales, Elena Luna-Garcia, Julian Fierrez, and Javier OrtegaGarcia. Score normalization for keystroke dynamics biometrics. In Security Technology (ICCST), 2015 International Carnahan Conference on. IEEE, 2015. [15] Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, et al. Scikit-learn: Machine learning in python. The Journal of Machine Learning Research, 12:2825–2830, 2011. [16] Salah Rifai, Pascal Vincent, Xavier Muller, Xavier Glorot, and Yoshua Bengio. Contractive auto-encoders: Explicit invariance during feature extraction. In Proceedings of the 28th international conference on machine learning (ICML-11), pages 833–840, 2011. [17] Bernhard Schölkopf, John C Platt, John Shawe-Taylor, Alex J Smola, and Robert C Williamson. Estimating the support of a high-dimensional distribution. Neural computation, 13(7):1443–1471, 2001. [18] Pascal Vincent, Hugo Larochelle, Yoshua Bengio, and Pierre-Antoine Manzagol. Extracting and composing robust features with denoising autoencoders. In Proceedings of the 25th international conference on Machine learning, pages 1096–1103. ACM, 2008.