Discriminative conditional restricted Boltzmann machine for discrete ...

7 downloads 0 Views 1MB Size Report
Jun 1, 2017 - modelling methods to estimate latent variables from prior choice ..... Next, we reduced the complexity of the dataset by removing transaction data which contain ..... we have a strong reason to believe that the generation of latent factors are ... study would provide a foundation for analysis of various latent ...
Discriminative conditional restricted Boltzmann machine for discrete choice and latent variable modelling ∗ Melvin Wong



Bilal Farooq‡

Guillaume-Alexandre Bilodeau§

arXiv:1706.00505v1 [cs.LG] 1 Jun 2017

June 5, 2017

Abstract Conventional methods of estimating latent behaviour generally use attitudinal questions which are subjective and these survey questions may not always be available. We hypothesize that an alternative approach can be used for latent variable estimation through an undirected graphical models. For instance, non-parametric artificial neural networks. In this study, we explore the use of generative non-parametric modelling methods to estimate latent variables from prior choice distribution without the conventional use of measurement indicators. A restricted Boltzmann machine is used to represent latent behaviour factors by analyzing the relationship information between the observed choices and explanatory variables. The algorithm is adapted for latent behaviour analysis in discrete choice scenario and we use a graphical approach to evaluate and understand the semantic meaning from estimated parameter vector values. We illustrate our methodology on a financial instrument choice dataset and perform statistical analysis on parameter sensitivity and stability. Our findings show that through non-parametric statistical tests, we can extract useful latent information on the behaviour of latent constructs through machine learning methods and present strong and significant influence on the choice process. Furthermore, our modelling framework shows robustness in input variability through sampling and validation.



Paper presented at International Choice Modelling Conference 2017 Laboratory of Innovations in Transportation (LITrans), Department of Civil Engineering, Ryerson University, Toronto, Canada, Email: [email protected] ‡ Laboratory of Innovations in Transportation (LITrans), Department of Civil Engineering, Ryerson University, Toronto, Canada, Email: [email protected] § Laboratoire d’Interprétation et de Traitement d’Images et Vidéo (LITIV), Department of Computer and Software Engineering, Polytechnique Montréal, Montréal, Canada, Email: [email protected]

1

Introduction

Complex theories of decision-making processes provide the basis of latent behaviour representation in statistical models focusing on the use of psychometric data such as choice perception and attitudinal questions. Although they can provide important insights into choice processes and underlying heterogeneity, studies have shown the limited flexibility and benefits of statistical latent behaviour models, i.e. Integrated Choice and Latent Variable (ICLV) models (Chorus and Kroesen, 2014; Vij and Walker, 2016). Two disadvantages are known in ICLV models: first, datasets are required to have attitudinal responses, for instance, likert scale questions in product choice surveys. Second, model mis-specification may occur when latent variable model equations are poorly defined and attitudinal questions are subjective and would change over time. The objective of this study is to use of machine learning (ML) methods to analyze the underlying latent behaviour in choice models based on a set of synthetic ML considerations and hyperparameters without explicitly using attitudinal or perception attributes. A growing body of behavioural research focuses on patterns and clusters of behaviour characteristics including latent attitudes and choice perceptions. Yet, comparing with specific advanced choice modelling strategies such as ICLV models, our knowledge of the prevalence and consequences of latent behaviour in choice model still remains limited (Vij and Walker, 2016). Studies of hidden representations using neural network models may give us more nuanced and potentially new perspectives of latent variables on discrete choice experiments and choice behaviour theory (Rungie et al., 2012). Given the many possible latent variable combinations, it is necessary to use advanced ML techniques to segment population into groups with similar attitudinal profiles. For this study, we have chosen to use restricted Boltzmann machines (RBM). RBM is a non-parametric generative modelling approach that seeks to find latent representations within a homogeneous group by hypothesizing that posterior outputs can be explained with a reduced number of hidden units (Le Roux and Bengio, 2008). In addition, identifying common latent representation may enable policy makers to better understand the sensitivity and stability of latent behaviour models in surveyed and revealed preference data. We decouple the latent behaviour model underlying the data distribution by estimation on a financial instrument choice behaviour dataset without the need for subjective measurement indicators. The proposed method does not predefine a semantic meaning for each latent variable. Instead, we define a restricted Boltzmann machine to learn the latent relationships and approximate the posterior probability. We show in our findings that a RBM modelling approach is able to characterize latent variables with semantic meaning without additional psychometric data. The parameters estimated through our RBM model presents strong and significant influence in the choice process. Furthermore, sensitivity analysis have shown that this approach is robust to input data variance and use of generated latent variables improves sampling stability. The remainder of the paper is organized as follows: in Section 2, we provide a background literature review on latent behaviour models. Section 3 describes the conditional RBM modelling approach and model training methodology, given only observed variables without attitudinal questions. Section 4 explains the data and the experiment procedure. Section 5 presents the results and performance tests. Section 6 analyzes the model sensitivity and stability. Finally, section 7 discuss the conclusions and future research directions.

2

Background

Current practice in choice modelling is targeted at drawing conclusion on the mechanism of the stochastic model and not so much about the nature of the data itself. This leads to simple assumptions of data relevance and statistical properties of explanatory variables (Burnham and Anderson, 2003). A number of parametric and non-parametric modelling methods are available. Parametric models are regression based and random utility maximization structural models. Examples of non-parametric methods include latent class and variable models. Other statistical models include k-means or hierarchical clustering. These nonparametric methods are often criticized for being too descriptive, theoretical, may result in inconsistent estimates and often not possible to make generalizations (Ben-Akiva and Bierlaire, 1999; Atasoy et al., 1

2013; Bhat and Dubey, 2014). Analyzing data through the statistical properties is generally applied for extracting information about the evolution of the responses associated with stochastic input variables rather than having good prediction capabilities. On the other hand, algorithmic modelling approaches such as artificial neural networks (ANN), decision trees, clustering and factor analysis are based on the ability to predict future responses accurately given future input variables within a ‘black-box’ framework (Breiman et al., 2001). Econometric choice models can be estimated by using both parametric and non-parametric methods that incorporate machine learning algorithms into discrete choice analysis to learn mappings from latent variables to posterior distribution (Eric et al., 2008). A number of different approaches which implements the use of attitudinal variables have been used in existing literature (Ashok et al., 2002; Morey et al., 2006; Hackbarth and Madlener, 2013). The first approach relies on a top-down modelling framework which makes prior assumptions that individuals are divided into multiple market segments and each segment has its own utility function of underlying attributes. In the most generic form, these assumptions are based on multiple sources of unobserved heterogeneity influencing decisions, e.g. inter- and intra-class variance and ‘agent effect’ (Yazdizadeh et al., 2017). Fig. 1 illustrates the Latent Class and ICLV model framework which shows the process of deriving latent classes or variables and how it integrates into the structural choice model. The Latent Class model (LCM) is one such form which assumes a discrete distribution among market segments (Hess and Daly, 2014). LCM derive clusters using a probabilistic model that describes the distribution of the data. Based on this assumption, similarities within a heterogeneous population are identified through assignment of latent class probabilities. Individuals in the same class share a common joint probability distribution among the observed variables. Under assumption of class independence, the utility is generated with a prior hypothesis from several sub-populations, and each sub-population is modelled separately. The resulting classes are often meaningful and easily interpretable. The unobserved heterogeneity in the population is captured by the latent classes, each of which is associated with different utility vector in the sub-model (Fig. 1a). Another similar class of top-down models are finite mixture models, e.g. Mixed Logit, which allows the parameters to vary with a variance component and that behaviour is dependent on the observable attributes and on the latent heterogeneity which varies with the unobserved factors (Hensher and Greene, 2003). The use of attitudes and perception latent variables are also particularly interesting and popular in past work (Glerum et al., 2014; Atasoy et al., 2013). Choice models with measurement indicator functions treat correlated indicators into multiple latent variables. This factor analysis method is similar to principal component analysis where the latent variables are used as principal components (Glerum et al., 2014). This approach involves the analysis of relationship between indicators and the choice model. Within this domain, there is the sequential and simultaneous estimation process. Sequential approach first estimates a measurement model which derives the relationship between latent variables and indicators. Then, a choice model is estimated, integrating over the distribution of the latent variables. The main disadvantage of this approach is that the parameters may contain measurement errors from the indicator function that were not taken into account during the choice model. To solve this issue, another approach uses simultaneous estimation of structural and measurement model, which includes the latent variable in the choice model framework. This is so called the Integrated Choice and Latent Variable (ICLV) model (Fig. 1b). The ICLV model explicitly uses information from measurement indicators and explanatory variables to derive latent constructs. This combined structural model framework has led to many interesting results, e.g. environmental attitudes in rail travel (Hess et al., 2013), image, stress and safety attitudes towards cycling (Maldonado-Hinarejos et al., 2014), and social attitudes towards electric cars (Kim et al., 2014). However, the simultaneous approach still relies on a separate measurement model (latent variable model) that describes the relationship to indicators. Despite the direct benefits of the ICLV model combining factor analysis with traditional discrete choice models, the only advantage to using such an approach is when attitudinal measurement indicators are expected to be available to the modeller and the observed explanatory variables are weak predictors of the choice model (Vij and Walker, 2016). Even when measurement indicators are available, they may not provide any further information that directly 2

influence the choice than through explanatory variables (Chorus and Kroesen, 2014). Consequently, misspecification and other measurement errors may occur, when the criteria is not associated with the choice model. Without measurement indicators to guide selection of latent variables, we can alternatively use ML for latent variables through data mining. This can be implemented through generative modelling methods used in ML. Generative modelling in ML is a class of models which uses unlabelled data to generate latent features. Generative models learn the underlying choice distribution p(y) and the latent inference p(h|y), where h is the latent variable. Followed by implementing a Bayesian network that represents a probabilistic conditional relationship between random variables and dependencies to derive the posterior distribution of y given h using p(y|h) = p(h|y)p(y) . Efficient algorithms which perform ML and inference such as RBMs can p(h) P be used in this method. The denominator is given by p(h) = y p(h|y = 1) indicating choice y is chosen. The rapid advancement of machine learning research have led to the development of efficient semi-supervised training algorithms such as the conditional restricted Boltzmann machine (C-RBM) (Salakhutdinov et al., 2007; Larochelle and Bengio, 2008), a hybrid discriminative-generative model, capable of simultaneously estimating a latent variable model using a priori choice distribution with an latent inference model (see Fig. 2). To date, econometric and machine learning models are often studied for its contrasting purposes in decision forecasting by behavioural researchers (Breiman et al., 2001). Econometric models are based on the classical decision theory that individual’s decisions can be modelled rationally based on utility maximization. These models assume that the population will adhere to the strict formulation of the choice model, but may not always represent the true decisions. Generative modelling based approach uses clustering and factor analysis developed through algorithmic modelling of the data. Associations between decision factors can be classified in this method, obtaining latent information without explicit definition of latent constructs (Poucin et al., 2016). Thus, machine learning algorithms such as ANN that decouple latent information from ‘true’ distribution generally outperform traditional regression based models in multidimensional problems (Ahmed et al., 2010). Recent works on latent behaviour modelling on choice analysis agree on the potential of improving behaviour models with machine learning. Examples include combining machine learning to improve complex psychological models (Rosenfeld et al., 2012), representing the phenomena of similarity, attraction and compromise in choice models (Osogami and Otsuka, 2014) and inference of priorities and attitudinal characteristics (Aggarwal, 2016). Despite the many benefits, interpretation of results are still extremely difficult due to the complexity and number of parameters in ML analysis. As a result, ML models are not often used for general purpose behaviour understanding, but created exclusively for a specific purpose for prediction accuracy. Still, machine learning research is a rapidly growing field at the intersection of statistical analysis and information science to find patterns in complex data (Donoho, 2015). Furthermore, with the emphasis on applications and theoretical studies in today’s massive data driven industry, improving analytical techniques with ML is very relevant, although structural modelling, statistical and probability theory will remain the cornerstone of discrete choice analysis.

2.1

The basis of latent class and latent variable models

The latent class model shown in Figure 1 is a simple top-down model that imparts generalization properties to the choice model that predefines a discrete number of classes, allowing the parameters to vary with an fixed distribution. Formally, the LCM choice probability can be expressed as: P (y) =

X

P (sn )P (y|x, sn )

(1)

n

where S = [s1 , s2 , ..., sn ] are the set of classes and P (sn ) is the probability that an individual belongs to class s. P (y|x, sn ) is the conditional probability of choice y selected given the class sn and input variable x. The ICLV model extends the choice model by describing how perceptions and attitudes affect real choices as well as using separate indicators to estimate latent variables (Ben-Akiva et al., 2002). Latent variables can 3

be classified as either attitudinal (individual characteristics) or perceived (personal beliefs towards responses) (Ben-Akiva and Bierlaire, 1999). The latent variable model (measurement model) forms a sub-part of the structural framework which captures the relationship between the latent variables and indicators and the observed explanatory variables which influence the latent variables. This specification can be used to identify more useful parameters and predict accurate decision outcomes when there is a lack of strong significant correlation between explanatory variables and choice outcomes. The functions of the structural and measurement model can be explained in four equations (Vij and Walker, 2016): x∗ = Ax + ν

(2)

I∗ = Dx∗ + η

(3)

u = Bx + Gx∗ + 

(4)

( 1 if ui > u0i for i ∈ {1, ..., I} yi = 0 otherwise

(5)

where ui is the utility of selecting alternative i. A represents the relationship between input explanatory variables x and latent variables x∗ , D represents the relationship between x and the indicator output I∗ . B and G represents the model parameters with respect to the observed and latent variables. ν, η and  are the stochastic error terms of the model, assumed to be mutually independent and Gumbel distributed. In a generative model, parameters are shared between G and D that simply defines the joint distribution of p(y, h), i.e. G = D> (Fig. 2). The re-use of a shared parameter vector differentiates the RBM model from the structural equation formulation of the ICLV model.

2.2

Modelling through generative machine learning methods

In generative machine learning models, hidden units h are the learned features (see Fig. 2) which performs non-redundant generalization of the data to reduce high dimensional input data (Hinton and Salakhutdinov,

Figure 1: Classical structural framework for (a) latent class model and (b) integrated choice and latent variables model 4

Figure 2: Framework for a C-RBM choice model conditional on explanatory variables and choice distribution. 2006). Intuitively, in terms of econometric analysis, hidden units are latent variables that depend on some observed data, for instance, socio-economic attributes such as weather or price information or direct choices such as location and choice of purchase. We can construct a generative model as a function of these dependent and independent variables. In the case of factor analysis approach, a common process is to perform feature extraction based on statistical hypothesis testing to determine if the values of the two classes are distinct, for example, using Support Vector Machines (SVMs) or Principal Component Analysis (PCA) to learn low-dimensional classes by capturing only significant statistical variances in the data (Poucin et al., 2016; Wong et al., 2016). The learned classes (or clusters) can then be introduced directly into the model via parameterization. In generative modelling approach, we use the priors directly to learn the distribution of the hidden units. In this process we extract latent information directly from the observed choice data instead of using measurement functions which may be prone to errors.

2.3

Balancing model inference and accuracy

One common problem that researchers face when constructing latent behaviour models is specifying of the optimal size of latent factors (Vermunt and Magidson, 2002). Since the hypothesis on the number of latent size cannot be tested directly, typical statistical evaluation methods such as AIC and BIC are used to guide class selection (Vermunt and Magidson, 2002), in the case of ICLV models, through predefinition of measurement functions (Rungie et al., 2012). However, since the number of latent factors determines the ability of the model to represent the various heterogeneity in the data, it is likely that as we increase h, the choice model become more efficient in capturing complex behaviour effects from individual and latent attributes. On the other hand, if we increase the number of latent segments, the number of parameters will also increase at an exponential rate (Vermunt and Magidson, 2002). Therefore, we may gain model accuracy but we would lose model interpretability. The trade-off between inference and accuracy is a challenge when dealing with complex data (Breiman et al., 2001). If the goal of latent behaviour modelling is to leverage on data to understand underlying statistical problems, we have to incorporate implicit modelling methods in addition to describing explicit structural utility formulations.

3

Methodology

In this section, we provide a brief overview on restricted Boltzmann machines and how it can be used to generate prior over the choice distributions. We refer readers to Goodfellow et al. (2016) for background and details on generative models and deep learning.

5

3.1

Restricted Boltzmann machines

A restricted Boltzmann machine (RBM) is an energy-based undirected graphical model that extends from a Markov Random Field distribution by including hidden variables (Salakhutdinov et al., 2007). It is a single layer artificial neural network with no internal layer connections. The model has stochastic visible variables y ∈ {0, 1}I and stochastic hidden variables h ∈ {0, 1}J . The joint configuration (y, h) of visible and hidden variables is given by the Hopfield energy (Hinton et al., 1984): X

Energy(y, h) = −

yi ci −

i∈vis

X

hj dj −

j∈hid

X

hj Dij yi ,

(6)

i,j

where dj and ci represent the vector biases (constants) for the hidden and visible vectors respectively. Dij is the matrix of parameters representing an undirected connection between the hidden and visible variables. We can express the Boltzmann distribution as an energy model with energy function F (y): 1 exp(−F (y)), (7) Z P where the partition function Z = exp(−Energy(y, h)) is the normalization function over all possible i,j P vector combinations. F (y) is defined as the free energy F (y) = − ln h exp(−Energy(y, h)) and further simplified to p(y, h) =

F (y) = −yi ci −

X

ln(1 + exp(D.,j y + dj )).

(8)

j∈hid

The probability of assigning a visible vector y is given by the sum of all possible hidden vector states: p(y) =

1 X exp(−F (y)). Z

(9)

h

The RBM model is used to learn aspects of an unknown probability distribution based on samples from that distribution. Given some observation, the RBM makes updates to the model weights such that the model best represent the distribution of the observation. To generate data with this method, it is necessary to compute the log likelihood gradient for all visible and hidden units. Hinton introduced a fast greedy algorithm to learn model parameters efficiently using Contrastive Divergence (CD) method that starts a sampling chain (Gibbs sampling) from real data points instead of random initialization (Hinton, 2010).

3.2

Model estimation and inference

The probability that the RBM network learns a training sample can be raised by adjusting the weights to lower the energy of that training sample and raise the energy of other non-training samples. In order to minimize the negative log likelihood of the probability distribution p(y), we take its gradient derivative of the log probability of a training vector with respect to the model parameters as follows: ∂ log p(y) = hyi hj itrain − hyi hj imodel = φ+ − φ− , (10) ∂θ where the components in the angle brackets corresponds to the expectations under the specified distribution. The first and second terms are the positive φ+ and negative φ− phases respectively. This function updates the model parameters using a simple learning rule with a learning rate Φ: ∆θ = Φ(hyi hj itrain − hyi hj imodel ).

6

(11)

The updates for parameters θ = {Dij , dj , ci } can be performed using simple stochastic gradient descent at each iteration of t: θt = θt−1 − ∆θ.

(12)

To obtain a sample of a hidden unit from hyi hj itrain , we take a random training sample y and sample the state in the hidden layer is given by the following function: P

X edj + i Dij yi P p(hj = 1|y) = = σ(d + Dij yi ), j 1 + edj + i Wij yi i

(13)

where σ(x) = ex /(1 + ex ). Similarly, we can obtain a visible state, given a vector of sampled hidden units, via a logistic function: P

eci + j Dij hj p(yi |h) = P c 0 +P D 0 h . i i j j j ie

(14)

Since weights are shared between D and G and they define the P distributions of p(y), p(h), p(y, h), p(y|h) and p(h|y), we can express the posterior distribution as p(y) = h p(h)p(y|h) (Ng and Jordan, 2002). Due to its bidirectional structure, this framework possesses good generalization capabilities. The visible layer represents the data (in the case of choice modelling, data represent selected choices), and the hidden layer represents the capacity of the model as class distributions. The model can be inferred from hyi hj imodel can be done by setting the states of the visible variables to a training sample and then the states of the hidden variables are computed using Eq. 13. Once a “state” is ˜ with a probability given chosen for the hidden variables, a “reconstruction” phase produces a new vector y by Eq. 14, and the gradient update rule is given by: ∆θ = Φ(hyi hj itrain − hyi hj ireconstruction ).

(15)

We approximate the gradient function by using a CD Gibbs sampler minimizing the divergence between the expected and estimated probability distribution, known as the Kullback-Leibler (KL) divergence (Hinton, 2002). A divergence ratio of 0 indicates that the estimated distribution is totally similar. The training algorithm that runs for a total number of N chain steps is initialized from a fixed point from the data distribution and then averaged across all examples (Carreira-Perpinan and Hinton, 2005).

3.3

Modelling approach

In this paper, the proposed method uses a conditional RBM (C-RBM) training algorithm to include inputoutput connections that allows for discriminative learning (Mnih et al., 2012). C-RBM expands the model to include “context input variables”, i.e. p(y|x, h). k input explanatory variables are introduced as context variables so that they can be used to influence the latent variables, even though Eq.14 does not reconstruct these explanatory variables. This influence is represented by a weight matrix Bik . The intuition is that for each latent variable, it acts as a function of the observed choice y, conditional on x (see Fig. 2). In the choice prediction stage, a vector of new input samples x generate latent variables h. Conditional on the explanatory and latent variables, a probability function describing the choice behaviour is given as: p(yi |h, x) = P

e i0

P

e

Bik xk +

k

P

P

j Dij hj +ci P k Bi0 k xk + j Di0 j hj +ci0

.

(16)

Likewise, sampling of the hidden state is extended to incorporate x: p(hj = 1|y) = σ(dj +

X i

7

Dij yi +

X k

Ajk xk ),

(17)

Figure 3: C-RBM (a) positive φ+ and (b) negative φ− phases during semi-supervised discriminative training. Weights (connections) are learned to reduce reconstruction y˜ error.

Figure 4: During the choice prediction phase, (a) latent variables are sampled using explanatory variables, and (b) the choice model is estimated with variables x and h. where the update parameters are θ = {Dij , Bik , Ajk , dj , ci }. During the reconstruction phase, the condition probability (Eq. 16) is equivalent to a MNL model with latent variables (where h and x represents the latent and observed variables respectively). Good latent variables h best capture information along the orthogonal direction where choices y and observed inputs x vary the most. The training and choice estimation phase is illustrated in Fig. 3 and 4. In the positive phase, parameter vectors are adjusted decided by the learning rate σ to learn the transformed latent representation of the training set. In the negative phase, the latent variables are “clamped” or realized and the parameter vectors are adjusted again by reconstructing the observed variables. Referring to Fig. 2, the multinomial (MNL) model estimates the conditional parameter vector B and bias vector c, while the C-RBM model includes vectors D, A and d.

4

Data

In this section, we develop a financial product choice scenario with explanatory variables using the C-RBM model. The latent variables representing the latent attitudinal variables is simultaneously estimated in conjunction with the interaction with choice model. First, we construct a structured choice subset from a financial product transaction dataset from the Kaggle database1 . The data shows a monthly basis record of each financial product purchase by customers of Santander. The time span of the data is from January 2015 to June 2016. Next, we reduced the complexity of the dataset by removing transaction data which contain 1

Dataset: https://www.kaggle.com/c/santander-product-recommendation/data

8

multiple product choices. To ensure consistency, inputs were scaled and normalized. Overall, the constructed dataset has a total of 13 alternatives (product choice) and 20 explanatory variables. Table 1 lists the alternatives and distribution across the dataset. Given the above conditions, a total of 253,803 valid responses were recorded representing the total population sample with 13 available choices. A descriptive list of mean and standard deviation values of the explanatory variables are shown in Table 2. The experimental question is straightforward: “Given a set of examples with explanatory variables, what product is the individual most likely to purchase in the given month?” In a typical situation, the decision maker chooses an alternative that yields the maximum utility, making an inference about the behaviour of the decision maker using the predictive model. Table 1: Choices (y)

4.1

Choice index

Name

Total sample distrib.

1 2 3 4 5 6 7 8 9 10 11 12 13

Guarantees Short-term deposits Medium-term deposits Long-term deposits Funds Mortgage Pensions Loans Taxes Cards Securities Payroll Direct debit

0.002% 0.83% 0.07% 3.79% 0.98% 0.02% 0.15% 0.035% 2.68% 21.93% 1.42% 22.04% 46.05%

Method for assessing C-RBM model performance

We can estimate the weights for the latent inference model Bik and Dij by optimizing the lower bound of the KL-divergence using gradient backpropagation. Intuitively Dij represents the parameters for the explanatory variables and Bik represents the parameters for the latent variables. We selected models with 2, 4, 16 and 32 latent variables to observe the effects of increasing model complexity. One disadvantage of this step is that it results in a large number of estimated parameters: (Nparams ∈ R(I×J)+(K×I)+(K×J)+K+I ). With J = 4, we ended up with 409 parameters. To counteract overfitting due to this problem, we trained on 70% of our data and validate the model on the other 30% with a 2-fold cross-validation to verify generalization. When the validation error stops decreasing, the optimal estimation is reached (Goodfellow et al., 2016). A baseline comparison is set up using a standard multinomial logistic regression model with all explanatory variable and compared to the discriminative C-RBM modelling approach, followed by comparing the log-likelihood, ρ2 model fit and predictive accuracy across all data models. The criteria for measuring performance of a categorical based model include: ρ2 model fit and prediction error. The ρ2 fit denotes the predictive ability between the trained model and a model without covariates. In the prediction error evaluation, the elements in the diagonal cells of a confusion matrix over the total number of examples denotes the accuracy of the model in predicting the correct choice and the error is Errorvalid = 1 −

X

P (ypred = 1|x, h, yi = 1).

(18)

i

yi is the actual choice and Errorvalid is the sum of all the error probabilities for correct assessment for each choice. We fit the model on the training set and evaluate on the validation set. 9

Table 2: Explanatory variable descriptive statistics (x)

5

Explanatory variable

Description

mean

std. dev.

age loyalty income sex employee active new_cust resident foreigner european vip savings current derivada payroll_acc junior masparti particular partiplus e_acc

Customer age Customer seniority (in years) Customer income (e) Customer sex (1=male) Employee index, 1 if employee Active customer index 1 if customer loyalty < 6 mo. Resident index (Spain) Foreign citizen index EU citizen index VIP customer index Savings Account type Current Account type Derivada Account type Payroll Account type Junior Account type Mas Particular Account type Particular Account type Particular Plus Account type e-Account type

42.9 8.03 141,838 0.387 0.0006 0.95 0.045 0.999 0.045 0.995 0.116 0.0002 0.572 0.0009 0.416 0.0001 0.017 0.168 0.113 0.255

13.0 6.0 262,748 0.487 0.024 0.199 0.207 0.007 0.21 0.006 0.32 0.012 0.495 0.03 0.493 0.0098 0.128 0.373 0.316 0.436

Results

We compare the different models based on their generalization performance on the test set. A total of 76,141 observations were used in the test. For the purpose of this study, we tested on both normalized and non-normalized data and found that both data produce similar result. Model estimation and validation were performed with Theano ML Python libraries2 . Optimization parameters used were stochastic gradient descent (SGD) on mini-batches of 64 samples for 400 epochs with input normalization. We used an adaptive momentum based learning rate of with initial rate of 1e− 3 (Hinton et al., 2006). Training time was approximately 30 minutes for each model including validation running on a Intel Core i5 workstation. At the given time, computational demand may not be significant to justify the small number of hidden units, however, speed could become a more important consideration when model estimation and validation increase in data size or using very large parameter vectors with higher dimensionality. The statistical results of the model comparison across the same validation set is shown in Table 3. We found that additional latent information about the relationship between explanatory variables and observed decisions was useful and increases model accuracy. Bayesian Information Criterion (BIC) values indicate that 8 hidden units may be the optimal number of latent variables and higher BIC values above 8 hidden units might suggest overfitting. However, when generating semantic class meanings, a smaller number of latent variables may be simpler, therefore, in our example, we use only 2 latent variables for analysis. To evaluate the efficiency of the models, we used a Hinton diagram (Bremner et al., 1994) to analyze the parameter strengths between independent and dependent variables. We plot the parameter values and significance with choice on the y-axis and independent variables on the x-axis (Bremner et al., 1994). A Hinton diagram is often used in model analysis where the dimensionality of the model is high and provides a simple visual way of analyzing each vector. Figs. 5 through 9 shows the parameter estimates of the 2

Theano Python library: http://github.com/Theano/Theano

10

Table 3: Model training results Model

latent variables

Validation error

log-likelihood

ρ2

no. of params

BIC

MNL CRBM

(baseline) J =2 J =4 J =8 J = 16

0.4454 0.4360 0.4338 0.4323 0.4318

-206808 -203558 -202066 -200846 -200223

0.546 0.553 0.556 0.559 0.560

273 341 409 545 817

416915 411237 409075 408279 410321

completed training stage of the different models. The Hinton matrix shows the influence of each independent variable on each alternative or latent variable. Statistically significant (>95% confidence bound) parameters are highlighted in blue. The values along the x-axis are normalized with zero mean and unit variance. The 13 financial product choices are listed on the y-axis. The estimated parameters and bias of the C-RBM prediction model B, D and c are projected onto the Hinton diagram (Figs. 6a, 7a, 8a and 9a) while parameters A and d representing the parameters and bias for the latent variable with respect to the alternatives shown in Figs. 6b, 7b, 8b and 9b. c and d are the constants with respect to the observed and hidden layer respectively. The signs and value of each parameter corresponds to the size and colour of the patches in the matrix, with white and black representing positive and negative signs respectively. Statistical significance (t-test) of each parameter is calculated using √θσ , where σ is the inverse of the Hessian of the log likelihood with sample size adjustment with respect to the parameters.

6 6.1

Analysis Characteristics of latent variables

We can characterize each hidden unit with the explained significance and strengths represented by the weights D> . D> is a parameter vector that indicates the linear contribution of each latent variable and a constant d, such that each alternative can be described as a utility function of latent variables: y = Dh + d. For example, C-RBM-2 latent variable hidden1 is characterized by individuals who are of working age, non-EU foreign citizens with non-VIP status and does not own any special accounts. We can therefore infer this latent variable that indicates a ‘savings driven attitude’ (see Fig 6b). From the model results, population with such characteristics have a positive preference of purchasing a payroll product and a low motivation of purchasing a (credit/debit) card product as indicated in Fig 6b. Likewise in latent variable hidden2, it is represented by older, loyal customers who are VIP and have held various account types over their lifetime. This latent variable can be inferred to as ‘self-reliance attitude’ and are indication of the population who are less likely to purchase long term deposits, funds, securities and card products. The C-RBM with latent variables outperforms the MNL model, however, the performance increase from increasing the number of latent variables past 4 LV, is small. This would suggest that the upper bound of latent representative capacity is reached with just a small number of latent variables. Using 2 or 4 latent variables would be sufficient for significant improvement over a MNL structure. From the presented results, it is clear that the C-RBM models differ significantly from the MNL model in terms of parameters which are strong and significant. This result seems to be broad-based in the sense that it is not dictated by the number of hidden units and signifies that the observed distribution has some latent factors that can be explored. However, we should mention that the training parameter initialization may have a small random effect on the model. Note that in the parameter plots, the signs and strength contribution to the choice model differ from model to model which may indicate that model training may be stuck at a local optima. This also suggest that the hidden and observed layer have different scale (Glorot and Bengio, 2010). What is suggested in He et al. (2015) is to increase learning rate to improve convergence, but that would result in overgeneralization and loss of expressive power in the hidden units. We posit a 11

Table 4: Parameter sensitivity rank and standard error difference for estimated parameters B for samplingbased sensitivity analysis

sample size parameter βage βloyalty βincome βsex βemployee βactive βnew_cust βresident βforeigner βeuropean βvip βsavings βcurrent βderivada βpayroll_acc βjunior βmasparti βparticular βpartiplus βe_acc bias

RBM 2 LV ns std. err. rank diff.

n

15 18 3 12 5 21 6 16 8 17 20 1 7 4 9 2 11 14 10 19 13

15 14 3 13 2 16 12 20 17 20 10 1 11 4 18 5 7 8 6 9 19

49.30 59.36 3712.99 67.51 4267.79 47.92 53.93 16.61 29.15 16.62 122.66 34177.13 64.19 3112.38 24.91 1759.26 185.94 166.56 189.75 159.38 19.07

C-RBM 4 LV ns std. err. rank diff.

n

15 14 3 13 2 16 12 20 17 20 10 1 11 4 18 5 7 8 6 9 19

12 15 3 14 4 19 7 20 8 20 16 1 13 5 18 2 9 11 10 17 6

0.52 0.38 26.67 0.41 13.74 0.20 1.49 0.19 1.43 0.19 0.33 258.41 0.41 4.70 0.29 58.29 1.41 0.61 0.65 0.33 3.17

C-RBM 8 LV ns std. err. rank diff.

n

11 15 3 14 5 19 8 20 9 21 16 2 12 4 18 1 7 13 10 17 6

11 15 3 13 5 19 8 20 10 20 12 1 18 4 17 2 6 14 9 16 7

0.99 0.82 43.00 0.91 21.27 0.34 1.34 0.31 0.76 0.31 0.99 255.12 0.38 19.67 0.52 45.32 4.99 0.83 1.51 0.82 3.35

C-RBM 16 LV n ns std. err. rank diff. 11 15 3 14 4 19 8 20 7 21 16 2 12 5 18 1 9 13 10 17 6

12 17 2 15 4 19 9 20 7 20 13 1 16 5 18 3 6 14 10 11 8

0.64 0.48 35.82 0.52 33.74 0.26 0.91 0.23 1.35 0.23 0.68 181.81 0.39 2.82 0.41 22.43 2.29 0.53 0.86 0.91 0.48

middle-of-the-road solution should have adequate model accuracy and generalization over a large population. We performed 2-fold cross-validation analysis and determined that the residual from model fit is not significant, therefore the model is robust to changes in input data – this is further confirmed with a sensitivity analysis presented in the following section. In the parameter plots, we can see the values and signs correspond to the strength of each variable. For instance, the parameters for Guarantees choice are not significant, since the distribution is very low (0.002%). The latent models show similar results. For C-RBM with 2 and 4 hidden units, almost all of the parameters are significant, except for income, employee, savings, derivada and junior variables. This can be attributed to the small mean values (and high deviation).

6.2

Sensitivity of parameter estimates

The versatility and effectiveness of parameter estimates are determined by a sensitivity analysis of the model output. Methods of sensitivity analysis include variance based estimator, sampling based and differential analysis (Helton et al., 1991; Saltelli et al., 2000, 2010).“Sensitive” parameters are those whose uncertainty contributes substantially to the test results (Helton et al., 1991; Hamby, 1994). The model is sensitive to input parameters in the variability associated with the input variable resulting in a large output variability. Sensitivity ranking sorts the input parameters by the amount of influence it has on the model output and the disagreement between rankings measures the parameter sensitivity to changes to the input (Hamby, 1994). 12

We first define a list of parameters used in the model by their standard errors calculated over the full dataset. In large dataset sensitivity analysis, a key concern is the computational cost needed to complete the analysis, hence we use a sampling based approach as a cheap estimator to the output % difference of the parameter minimum and maximum value. Random sampling (e.g. simple random sample, Monte Carlo, etc.) generates distributions of input and output to assess model uncertainties (Helton et al., 1991). Analyzing the sampling effects can provide information of the overal model performance since parameter sensitivity depends on all parameters which the model is sensitive and therefore the importance of each parameter (Hamby, 1994). Consider that the C-RBM model is represented by y = f (x, h), where x and h are the input vectors of observed and latent variables respectively and y is the model output. We suppose that the model f (·) is a complex, highly non-linear function such that we cannot completely define the way the C-RBM model responds to changes in input variables. Also, h is dependent on x through a submodel previously shown in Fig. 2. The analysis involves independently and randomly generated sample with size nS = 0.1n (10% random sample draw), where n = 76, 141 is the total number of observations. The model performance is considered by sampling stability of variable parameters. Sensitivities are also assessed for size of hidden units used in generating the C-RBM models and indicates what number of latent variables (hyperparameter) is required for model identifiability. Since the model was applied using a multinomial logit approach instead of a conditional logit, this resulted in a very large number of parameters. Thus the effect of relative changes to the number of distributed parameters gives the range of variance across each explanatory variable and number of hidden units used. Table 4 shows the effects of sampling on the sensitivity and stability of the model observed parameters on the theoretical values and size of latent variables. Notice that the relative difference in standard error between the full and sampled model decreases when number of latent variables increases. This shows that the C-RBM model with high synthetic latent variables are robust to changes to input values through sampling. Additionally, the parameter sensitivity rank across variables also becomes more consistent and therefore,we show that RBM models are efficient in obtaining good latent variables with low generalization error. The significant decrease in standard error difference from 2 LV to 4 LV may indicate that the number of latent variables used in the models has a lower bound on the generalization error, which implies that we need careful consideration on h for obtaining efficient but yet accurate exact values of β without losing model interpretability.

7

Conclusion

This study analyzes alternative means of latent behaviour modelling in the absence of attitudinal indicators. In ICLV models, specialized surveys have to be constructed with attitudinal questions to model latent effects on the decisions. While it has been one of the more popular method in discrete choice analysis, there are several disadvantages to it. First, attitudinal questions are subjective and the behaviours are subjected to changes over time. Next, existing datasets that have no attitudinal questions cannot leverage on the ICLV model, thus latent effects cannot be utilized. We explore generative modelling of the choice distribution to uncover latent variables using machine learning methods, without measurement indicators. We hypothesized that latent effects can be obtained not only from attitudinal questions, but also from the posterior choice distribution. In effect, we are modelling latent components that fits the real choice distribution rather than achieving good statistics on subjective models. For example, there could possibly be some mean behaviour that dictates a more probable influence on purchases given some latent variables. For this method to be effective, certain conditions have to be present: First, difficultly to get a good discriminative prediction result using only the provided explanatory variables. In this scenario, the C-RBM models were able to learn good latent variable representation and improve the model fit and prediction accuracy while providing latent variable inferrability. Next, when the data lacks attitudinal survey data, this method can find latent effects without the use of subjective measurement indicators. The current limitations of this study are the absence of choice dynamics or explanatory variable dynamics, i.e. changes over time or multiple choices for the same individual was not considered, but can be brought in. 13

The underlying RBM is capable of dynamics. We hypothesize that this may improve the model significantly, but we are still looking for ways to incorporate dynamics into our C-RBM model. In recent studies, we have seen dynamic frameworks such as recurrent neural networks used in modelling temporal data (Taylor et al., 2007; Mnih et al., 2012). Finally, it is worth noting that as the number of latent variable increases, the number of estimated parameters increases exponentially. This will pose problems in large datasets and the ability to reduce dimensionality will give a significant benefit to efficient use of model parameters. In our observation, performing cross-validation or model selection with lowest validation error is a justifiable method to prevent overfitting using all the parameters. In the future, we would also look at the possibility of introducing deep learning architecture to choice modelling by stacking RBMs (Otsuka and Osogami, 2016). While ICLV model are optimized to predict the effects of latent constructs on the choice model using measurement indicators to guide latent parameters selection, this method uses observed decisions as an influence source for optimizing latent variables through machine learning. This is not to say that we do not agree with using measurement indicators which may often be subjective and may raise mis-specification problems and when explanatory variables are poor predictors, ICLV models can improve latent effects on choice models (Vij and Walker, 2016). However, latent effects may not only be present in attitudes and perceptions, but also in the direct observation of choices. Our current work explores the use of posterior choice distribution for latent behaviour modelling. Generative modelling in DCA is inspired by state-of-theart machine learning algorithms that performs unsupervised feature extraction from unlabelled data used in classification problems (Hinton et al., 1984). In circumstances when attitudinal variables are not available, we have a strong reason to believe that the generation of latent factors are important and effective in building a discrete choice model. A future study that would be of interest is to extend this method to datasets with attitudinal questions and survey. For example, inter-city rail survey (Sobhani and Farooq, 2017), and perform an analysis on both RBM and ICLV methods to obtain the generalization error of attitudinal survey models. A comparative study would provide a foundation for analysis of various latent behaviour models through graphical and algorithmic methods and provide guidance not only in selecting the appropriate latent variables, but also direct research effect to more promising directions.

Acknowledgements This research is in part funded by Ryerson University PhD Fellowship and by Fonds de recherche du Québec - Nature et technologies (FRQ-NT) with team grant No. 2016-PR-189250

14

MNL model

age loyalty income sex employee active new_cust resident foreigner european vip savings current derivada payroll_acc junior masparti particular partiplus e_acc constant

guarantees short term deposits medium term deposits long term deposits funds mortgage pensions loans taxes cards securities payroll direct debit

Figure 5: MNL model parameters. White: +ve values, Black: -ve values, Blue: >95% significant

C-RBM-2 model guarantees short term deposits medium term deposits long term deposits funds mortgage pensions loans taxes cards securities payroll direct debit

C-RBM-2 latent variables

age loyalty income sex employee active new_cust resident foreigner european vip savings current derivada payroll_acc junior masparti particular partiplus e_acc hidden1 hidden2 constant

age loyalty income sex employee active new_cust resident foreigner european vip savings current derivada payroll_acc junior masparti particular partiplus e_acc constant

hidden1 hidden2

(a)

(b)

Figure 6: (a) C-RBM model with 2 latent variables. (b) Latent variable relationship parameters. White: +ve values, Black: -ve values, Blue: >95% significant

15

C-RBM-4 model guarantees short term deposits medium term deposits long term deposits funds mortgage pensions loans taxes cards securities payroll direct debit

C-RBM-4 latent variables

age loyalty income sex employee active new_cust resident foreigner european vip savings current derivada payroll_acc junior masparti particular partiplus e_acc hidden1 hidden2 hidden3 hidden4 constant

age loyalty income sex employee active new_cust resident foreigner european vip savings current derivada payroll_acc junior masparti particular partiplus e_acc constant

hidden1 hidden2 hidden3 hidden4

(a)

(b)

Figure 7: (a) C-RBM model with 4 latent variables. (b) Latent variable relationship parameters. White: +ve values, Black: -ve values, Blue: >95% significant

C-RBM-8 latent variables

age loyalty income sex employee active new_cust resident foreigner european vip savings current derivada payroll_acc junior masparti particular partiplus e_acc constant

hidden1 hidden2 hidden3 hidden4 hidden5 hidden6 hidden7 hidden8

(a) (b)

Figure 8: (a) C-RBM model with 8 latent variables. (b) Latent variable relationship parameters. White: +ve values, Black: -ve values, Blue: >95% significant

16

C-RBM-16 latent variables hidden1 hidden2 hidden3 hidden4 hidden5 hidden6 hidden7 hidden8 hidden9 hidden10 hidden11 hidden12 hidden13 hidden14 hidden15 hidden16 age loyalty income sex employee active new_cust resident foreigner european vip savings current derivada payroll_acc junior masparti particular partiplus e_acc constant

(a)

(b)

Figure 9: (a) C-RBM model with 16 latent variables. (b) Latent variable relationship parameters. White: +ve values, Black: -ve values, Blue: >95% significant

17

References C. G. Chorus, M. Kroesen, On the (im-) possibility of deriving transport policy implications from hybrid choice models, Transport Policy 36 (2014) 217–222. A. Vij, J. L. Walker, How, when and why integrated choice and latent variable models are latently useful, Transportation Research Part B: Methodological 90 (2016) 192–217. C. M. Rungie, L. V. Coote, J. J. Louviere, Latent variables in discrete choice experiments, Journal of Choice Modelling 5 (2012) 145 – 156. N. Le Roux, Y. Bengio, Representational power of restricted boltzmann machines and deep belief networks, Neural Computation 20 (2008) 1631–1649. K. P. Burnham, D. R. Anderson, Model selection and multimodel inference: a practical information-theoretic approach, Springer Science & Business Media, 2003. M. Ben-Akiva, M. Bierlaire, Discrete choice methods and their applications to short term travel decisions, in: Handbook of transportation science, Springer, 1999, pp. 5–33. B. Atasoy, A. Glerum, M. Bierlaire, Attitudes towards mode choice in switzerland, disP-The Planning Review 49 (2013) 101–117. C. R. Bhat, S. K. Dubey, A new estimation approach to integrate latent psychological constructs in choice modeling, Transportation Research Part B: Methodological 67 (2014) 68–85. L. Breiman, et al., Statistical modeling: The two cultures, Statistical science 16 (2001) 199–231. B. Eric, N. D. Freitas, A. Ghosh, Active preference learning with discrete choice data, in: Advances in neural information processing systems, pp. 409–416. K. Ashok, W. R. Dillon, S. Yuan, Extending discrete choice models to incorporate attitudinal and other latent variables, Journal of marketing research 39 (2002) 31–46. E. Morey, J. Thacher, W. Breffle, Using angler characteristics and attitudinal data to identify environmental preference classes: a latent-class model, Environmental and Resource Economics 34 (2006) 91–115. A. Hackbarth, R. Madlener, Consumer preferences for alternative fuel vehicles: A discrete choice analysis, Transportation Research Part D: Transport and Environment 25 (2013) 5–17. A. Yazdizadeh, B. Farooq, Z. Patterson, A. Rezaei, A generic form for capturing unobserved heterogeneity in discrete choice modeling: Application to neighborhood location choice, in: Transportation Research Board 96th Annual Meeting, 17-05144. S. Hess, A. Daly, Handbook of choice modelling, Edward Elgar Publishing, 2014. D. A. Hensher, W. H. Greene, The mixed logit model: the state of practice, Transportation 30 (2003) 133–176. A. Glerum, L. Stankovikj, M. Thémans, M. Bierlaire, Forecasting the demand for electric vehicles: Accounting for attitudes and perceptions, Transportation Science 48 (2014). S. Hess, J. Shires, A. Jopson, Accommodating underlying pro-environmental attitudes in a rail travel context: application of a latent variable latent class specification, Transportation Research Part D: Transport and Environment 25 (2013) 42–48.

18

R. Maldonado-Hinarejos, A. Sivakumar, J. W. Polak, Exploring the role of individual attitudes and perceptions in predicting the demand for cycling: a hybrid choice modelling approach, Transportation 41 (2014) 1287–1304. J. Kim, S. Rasouli, H. Timmermans, Expanding scope of hybrid choice models allowing for mixture of social influences and latent attitudes: Application to intended purchase of electric cars, Transportation research part A: policy and practice 69 (2014) 71–85. R. Salakhutdinov, A. Mnih, G. Hinton, Restricted boltzmann machines for collaborative filtering, in: Proceedings of the 24th international conference on Machine learning, ACM, pp. 791–798. H. Larochelle, Y. Bengio, Classification using discriminative restricted boltzmann machines (2008) 536–543. G. Poucin, B. Farooq, Z. Patterson, Pedestrian activity pattern mining in wifi-network connection data, in: Transportation Research Board 95th Annual Meeting, 16-5846. N. K. Ahmed, A. F. Atiya, N. E. Gayar, H. El-Shishiny, An empirical comparison of machine learning models for time series forecasting, Econometric Reviews 29 (2010) 594–621. A. Rosenfeld, I. Zuckerman, A. Azaria, S. Kraus, Combining psychological models with machine learning to better predict people’s decisions, Synthese 189 (2012) 81–93. T. Osogami, M. Otsuka, Restricted boltzmann machines modeling human choice, Advances in Neural Information Processing Systems (2014) 73–81. M. Aggarwal, On learning of choice models with interactive attributes, IEEE Transactions on Knowledge and Data Engineering 28 (2016) 2697–2708. D. Donoho, 50 years of data science, Tukey Centennial Workshop (2015). M. Ben-Akiva, D. McFadden, K. Train, J. Walker, C. Bhat, M. Bierlaire, D. Bolduc, A. Boersch-Supan, D. Brownstone, D. S. Bunch, et al., Hybrid choice models: progress and challenges, Marketing Letters 13 (2002) 163–175. G. E. Hinton, R. R. Salakhutdinov, Reducing the dimensionality of data with neural networks, science 313 (2006) 504–507. M. Wong, B. Farooq, G.-A. Bilodeau, Next direction route choice model for cyclist using panel data, in: 51st Annual Conference of Canadian Transportation Research Forum. J. K. Vermunt, J. Magidson, Latent class cluster analysis, Applied latent class analysis 11 (2002) 89–106. I. Goodfellow, Y. Bengio, A. Courville, Deep Learning, MIT Press, 2016. G. E. Hinton, T. J. Sejnowski, D. H. Ackley, Boltzmann machines: Constraint satisfaction networks that learn, Carnegie-Mellon University, Department of Computer Science Pittsburgh, PA, 1984. G. Hinton, A practical guide to training restricted boltzmann machines, Momentum 9 (2010) 926. A. Y. Ng, M. I. Jordan, On discriminative vs. generative classifiers: A comparison of logistic regression and naive bayes, Advances in neural information processing systems 2 (2002) 841–848. G. E. Hinton, Training products of experts by minimizing contrastive divergence, Neural computation 14 (2002) 1771–1800. M. A. Carreira-Perpinan, G. E. Hinton, On contrastive divergence learning., in: AISTATS, volume 10, 2005, pp. 33–40. 19

V. Mnih, H. Larochelle, G. E. Hinton, Conditional restricted boltzmann machines for structured output prediction, arXiv preprint arXiv:1202.3748 (2012). G. E. Hinton, S. Osindero, Y.-W. Teh, A fast learning algorithm for deep belief nets, Neural computation 18 (2006) 1527–1554. F. J. Bremner, S. J. Gotts, D. L. Denham, Hinton diagrams: Viewing connection strengths in neural networks, Behavior Research Methods, Instruments, & Computers 26 (1994) 215–218. X. Glorot, Y. Bengio, Understanding the difficulty of training deep feedforward neural networks, in: Aistats, volume 9, pp. 249–256. K. He, X. Zhang, S. Ren, J. Sun, Delving deep into rectifiers: Surpassing human-level performance on imagenet classification, in: Proceedings of the IEEE international conference on computer vision, pp. 1026–1034. J. C. Helton, J. W. Garner, R. D. McCurley, D. K. Rudeen, Sensitivity analysis techniques and results for performance assessment at the waste isolation pilot plant, Technical Report, Sandia National Labs., Albuquerque, NM (USA); Arizona State Univ., Tempe, AZ (USA)), 1991. A. Saltelli, K. Chan, E. M. Scott, et al., Sensitivity analysis, volume 1, Wiley New York, 2000. A. Saltelli, P. Annoni, I. Azzini, F. Campolongo, M. Ratto, S. Tarantola, Variance based sensitivity analysis of model output. design and estimator for the total sensitivity index, Computer Physics Communications 181 (2010) 259–270. D. Hamby, A review of techniques for parameter sensitivity analysis of environmental models, Environmental monitoring and assessment 32 (1994) 135–154. G. W. Taylor, G. E. Hinton, S. T. Roweis, Modeling human motion using binary latent variables, Advances in neural information processing systems 19 (2007) 1345. M. Otsuka, T. Osogami, A deep choice model, in: Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, AAAI Press, pp. 850–856. A. Sobhani, B. Farooq, Innovative intercity transport mode: Application of choice preference integrated with attributes nonattendance and value learning, in: 21st International Federation of Operational Research Societies, Québéc City.

20