When crowds hold privileges: Bayesian unsupervised representation ...

2 downloads 0 Views 850KB Size Report
Jun 16, 2015 - The Street View House Numbers dataset [19] is a natural image .... [11] Brian Cheung, Jesse A Livezey, Arjun K Bansal, and Bruno A Olshausen. ... Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Y Ng.
When crowds hold privileges: Bayesian unsupervised representation learning with oracle constraints

arXiv:1506.05011v1 [stat.ML] 16 Jun 2015

Theofanis Karaletsos Computational Biology Program, Sloan Kettering Institute 1275 York Avenue, New York, USA [email protected] Serge Belongie Cornell Tech 111 Eighth Avenue #302, New York, USA [email protected] Gunnar R¨atsch Computational Biology Program, Sloan Kettering Institute 1275 York Avenue, New York, USA [email protected]

Abstract Representation learning systems typically rely on massive amounts of labeled data in order to be trained effectively. Recently, high-dimensional parametric models like convolutional neural networks have succeeded in building rich representations using either compressive, reconstructive or supervised criteria. However, the semantic structure inherent in observations is oftentimes lost in the process. Human perception excels at understanding semantics but cannot always be expressed in terms of labels. Human-in-the-loop systems like crowdsourcing are often employed to generate similarity constraints using an implicit similarity function encoded in human perception. We propose to combine generative unsupervised feature learning with learning from similarity orderings in order to learn models which take advantage of privileged information coming from the crowd. We use a fast variational algorithm to learn the model on standard datasets and demonstrate applicability to two image datasets, where classification is drastically improved. We show how triplet-samples of the crowd can supplement labels as a source of information to shape latent spaces with rich semantic information.

1

Introduction

In recent years, computer vision has made tremendous strides spurred by two major developments. On one hand, the ability to model large quantities of data with layered non-linear feature-learning systems has facilitated learning rich visual systems that can be used for purposes such as classification and understanding of images, scenes and videos. On the other hand, crowdsourcing has also progressed in its practical application as a tool that uses representations of crowd decisions, typically employing kernel or multiview systems, to encode subtle human knowledge into automated reasoning systems. Oftentimes, especially in the case of perception, the structure of the data generating process is unknown and large amounts of accurate labels are hard to come by or may even be an inadequate form 1

of knowledge representation. A common approach to deal with such cases is the use of crowdsourcing to gather cheap information about the data from human annotators. These annotations can take the form of labels, which frequently are noisy and unreliable when interrogating crowds of nonexperts, or alternatively can yield similarities between percepts. In the latter case, human annotators have the task to decide which one of a set of objects is the most similar to a target object. We can better understand this process in a series of steps. Initially, an annotator infers lower dimensional representations of observed datapoints coming from an unspecified generative process using human perception. Subsequently, question-specific similarity functions are applied depending on the task. The comparison then results in a decision regarding which object is more similar to a target with respect to a question. Throughout this process the perceiving system itself and the similarity functions are not known directly, but downstream samples using these functions can be observed in the form of decisions. We thus consider this a process of sampling from an oracle distribution with unknown structure. In this paper, we propose to learn a flexible graphical model with latent variables that mimics this process by considering the generated similarity-objects as observations generated from the model. We show that parameters of a rich statistical model can be learned from human observations and help train models to produce interpretable results which exceed the performance of purely unsupervised learning and can be applied in cases where labels are sparse or suboptimal as a representation of prior knowledge. In order to provide speed and robustness we use a fast variational inference algorithm. 1.1

Relationship to other work

Similarity-based learning, for instance via crowdsourcing, has been tackled in various ways in the community before. Notably, crowd-kernels are inferred and used for various vision tasks using [1] which assumes a fixed Student-t structure to produce an embedding using similarity constraints from a crowd. We assume that an embedding containing the percepts z is given by a conditional probability density p(z|x). Rather than fixing the shape of this density, for instance using a Studentt distribution, we attempt to learn it from data. In [2], a metric respecting the particular distances in similarity is learned. This differs from the case we are studying, as it assumes that certain distances or similarities are observed, which is harder to ask of a weak oracle than orderings of similarities. In [3], a probabilistic treatment for triplets is introduced and an adaptive crowd kernel is learned without specific visual features in mind. While we also adopt a probabilistic treatment of triplets, we will learn an adaptive feature representation comparing images from the crowd as well. Flexible nonlinear models have been employed in a variety of situations to learn representations for data. A key result in relation to this work is the Siamese network [4], which uses discriminatively learned features and refines them using a loss attached to the encodings of multiply winged networks over the compared images. Similar approaches have been used in [5, 6], where usage of supervised features with crowd-inferred similarities boosts performance in face classification and more generic fine-grained visual categorization tasks. The key difference to our work is two-fold: we focus on an unsupervised approach, where features are learned from images without labels and the feature learning is guided additionally by the crowd and we introduce a probabilistic generative model which provides a joint model of all these components and their interactions. Generative models in representation learning have recently made rapid progress using variational inference [7, 8, 9]. These techniques allow fast learning of directed graphical model and have been a major stepping stone in combining deep learning with graphical models. However, not much work has been done to use them as components in larger models of data rather than for density estimation. Notably, in [10] these approaches are used to achieve state-of-the art results in semisupervised learning. We identify that as a related setting to ours: using the crowd we can obtain weak supervision in the form of similarity constraints over a sparse subset of the data, while most of the data is not subject to observed constraints. In [11] deep generative models are used with constraints on the latent space to increase specificity of latent variables, which is a goal we share but tackle using the crowd-information as a regularizer. Constraints on latent variables models in an otherwise unsupervised setting have also found early usage in the context of Gaussian Processes [12] using backconstraints, but those differ in nature compared to the constraints we consider here. 2

t

t

p(t|z)

S(Q)

C2

C1

Perception

Visual Physics

X1

Perception

Visual Physics

Perception

X2

Z2

Z1

C3

Visual Physics

φ

X3

θ

φ

X1

(a) Triplet Generation From the Crowd

Z3

θ

X2

φ

θ

X3

(b) Our model: Oracle-Prioritized Belief Network

Figure 1: Figure 1a shows a sketch of the process of sampling triplets from the crowd given a question Q. Figure 1b shows the model we introduce which treats triplets as given (shaded) and uses the specified likelihood to generate them. The key difference is that unobserved generative and inference models in Figure 1a are approximated by explicit parametric models θ and φ. Finally, a connection also exists with Vapnik’s privileged learning framework [13] where in a supervised setting improved classifiers can be learned if privileged information in the form of additional features is present during training time. Borrowing the terminology, we consider the similarity constraints to be a sparse privilege conveyed by the crowd oracle and aim at learning a student model which improves understanding of the data.

2

Methods

We tackle the problem of learning from an oracle like a crowd providing weak supervision in the form of similarity constraints and embedding observed data into an informative space. Our goal is to infer a model which learns jointly from triplets and observed data and transfers implicit perceptual biases encoded in the triplets into an explicit latent space which captures the semantics of the triplet-generating process (for instance see Figure 1a) better than simple density estimation. We will proceed to show the two key contributions needed to perform this task, namely first a novel probabilistic treatment of crowd triplets and second a principled way to combine the probabilistic model of crowd triplets with a graphical model performing nonlinear feature learning in order to transfer the implicit triplet knowledge into an explicit parametric model. For the remainder of this section, let x ∈ RN ×D denote N observations with D dimensions. We define percepts z ∈ RN ×J corresponding to datapoints x to be low-dimensional representations of datapoints. 2.1

Representing Oracle Triplets probabilistically

Typically, the (dis)-similarity function s() giving the distance between the percepts two objects s(xj , xi ) is not explicitly available, but an oracle such as the human visual system can be used infer an ordering over (dis)-similarities over groups of objects by comparing the associated observed data xj and xi , respectively. Using these relationships, a crowd annotator can select the most similar objects to a given target. Internally, the oracle evaluates s(xj , xi ) and s(xl , xi ) and then reports which percept is closer to target xi from a small selection of candidates. Repeated application of this procedure yields crowd triplets, formalizeable as: T = {(i, j, l)| xj is more similar to xi than xl is }.

(1)

We treat this data as a privileged source of information arriving from an oracle distribution poracle (t|x) which we can sample from, but do not know the internal structure of. In order to 3

learn a function mimicing the oracle, we will introduce a flexible latent variable model which will R approximate the oracle using a learned mapping pφ (t|x) = p(t|z)pφ (z|x). z

We assume we are given a conditional probability density p(z|x) over percepts z, which denotes an arbitrary function that generates lower dimensional representations with some uncertainty given an input stimulus x. Since our goal is to model crowd triplets, we observe that s(xj , xi ) = g(p(zj |xj ), p(zi |xi )) for an appropriate function g if zi captures the relevant information about xi and we can thus replace x with z when defining triplets. A motivation for doing so is that when humans judge perceptual similarity they heavily rely on internal representations than purely on raw low level image statistics. Other forms of oracle data could stem from model structure, such as temporal or spatial orderings or other known semantic structure which can also be used to provide weak natural constraints on similarity without explicitly quantifying it. Intuitively, we wish to model the distances in the latent space between the percepts in order to define a likelihood over percepts. We proceed to express the likelihood in terms of an information theoretic measure of divergence which respects the distribution of latent variables z capturing coordinates in latent space. This has the advantage of not only adding flexibility to the application of triplet embedding in the context of a variety of latent variable models with arbitrary latent distributions apart from the Gaussian, but additionally presents a principled way of respecting uncertainty and the shape of the latent manifold when comparing data. In previous work, typically normalized Euclidian norms or fixed learned distance metrics were used to model triplets. In our formulation, the location and associated uncertainty of the latent variable for each datapoint can be different depending on the distribution of p(z|x). One way to measure belief in triplets in a generative model is to consider the question whether the distance between target and given datapoint xj is smaller than the distance of target and point xi . Formalizing this as a draw from a Bernoulli distribution over the states True and False parametrized using a softmax-function gives the following likelihood, denoted as DIST in the following:

eKL(p(zl )||p(zi )) + eKL(p(zl )||p(zi ))

Z p(ti,j,l = True|i, j, l) =

p(ti,j,l |zi , zj , zl )p(z)dz =

eKL(p(zj )||p(zi ))

(2)

z

with

Z q(z)log(

KL(q||p) =

q(z) )dz p(z)

z

for continuous distributions over z. While this can model the full distribution over a triplet accurately and can serve as a loss, we need to consider the assumptions being made. Here, we assume that a triplet t has a higher probability of being true if relative distances are maximized between percepts. However, a subtlety of the acquisition process for the triplets is that annotators are not asked to provide distances, but just binary statements over similarity rankings based on the prompted question. We encode this relaxation into a similar alternative likelihood which truncates the probability space as follows:

 p(ti,j,l = True|i, j, l) =

1 e−Di,j,l

if Di,j,l ≤ 0 if Di,j,l > 0

(3)

with Di,j,l = KL(p(zj )||p(zi )) − KL(p(zl )||p(zi )). In order to make this more amenable to gradient-based learning, we smooth out the thresholded probability distribution using a softplus function f (x) = ln(1 + ex ), denoted as SOFTPLUS. The benefit of this formulation is that it smoothly penalizes unmet constraints and provides flexibility in modeling the relative orderings of z as long as the triplet constraints hold. 4

2.2

Oracle-Prioritized Belief Network

After having established an observation model for triplets, we can proceed to introduce the full generative process. Instead of relying on a supervised model taking observations x as input, we prefer using a generative approach that models the joint density of both data and triplets to provide an unsupervised model which requires only input data and samples from an oracle to be trained. We use a belief network with exponential family latent variables and a generative mapping given by a multi-layer perceptron (MLP) transforming them into parameters for exponential family observation models, similar to the models used in [7, 8, 9]. However, we introduce a second observation term apart from the datapoints x, which are triplets t. Triplets require multiple samples from the prior to be drawn, as they are defined over multiple objects jointly. Similar to the inference model in the Siamese network [4], this necessitates multiple instances of the model with shared parameters to work in coordination to generate a triplet. We sketch the generative model in Figure 1b. We denote θ to be the variational parameters for the MLP of the generative model. While we can use any exponential family in the context of this model, we focus on the Gaussian distribution as a prior p(z) with dimensions J, taking the shape of a J-dimensional diagonal Gaussian as a latent variable. The observation likelihood depends on the dataset in question. For N datapoints and K triplets defined over them the probabilistic model is given by: p(x, t; θ) =

Z Y N K  Y   p(zn )pθ (xn |zn ) p(tk |zki , zkj , zkl ) dz z

n

(4)

k

Triplets tie together multiple datapoints and capture their dependencies through the latent representations. This has the effect of attaching potentials to these triplets of latent observations, which the model can use for regularization and guidance. It is noteworthy that the model consists of learning the marginal likelihood p(x, t) by integrating out latent variables z in-between, with the exception of variational parameters θ and φ. This directly maximizes the evidence coming from the crowd and the observations, while maintaining flexibility for the model used in-between which needs to respect both the reconstructive cost for the datapoints, the generative cost for the triplets and the prior on the latent variables when generating samples. 2.3

Learning using fast Variational Inference

Our goal is to maximize the marginal likelihood of the evidence, logpθ (x, t) in order to learn a good mapping capturing the dependencies between observations x and triplets t. This involves integrating out the latent variables which is in general analytically intractable in highly flexible model classes. In order to perform efficient learning and inference in the model given by Equation 4 we resort to approximate inference methods and employ doubly stochastic variational inference [7, 8, 14], which is highly efficient and provably maximizes a lower bound to the evidence. Variational Inference [15, 16] has become a standard tool in Bayesian modeling as the speed benefits frequently greatly outweigh the loss in precision compared to Monte Carlo methods. In order to perform variational inference we need to define an approximate distribution q(z) over the posterior distribution of the latent variables. We resort to amortized inference b using an inference network to learn a conditional variational distribution qφ (z|x) using an MLP parametrized by φ. This will act as the inference model, predicting the variational approximation to the posterior latent variables per input data point. Writing out the evidence lower bound (ELBO) yields the following Equation:

logpθ (x, t) =L(θ, φ; x, t) + KL(q(z)||p(z|x, t)) ≥L(θ, φ; x, t) h i h i = − En KL(qφ (zn |xn )||pθ (z)) + En Eqφ (z|x) [logpθ (xn |zn )] h i + Ek Eqφ (z|x) [logp(tk |zkijl )] 5

(5)

where kijl acts as an index on matrix z selecting the corresponding datapoints. Theoretically, performing coordinate ascend on this lower bound is needed to infer the parameters of the model θ and inference network φ. However, the expectations over latent variables present in the ELBO are intractable. We resort to the reparametrization trick [7, 8, 14] and perform doubly stochastic variational inference by drawing L unbiased samples z l from these expectations using the identity z l = µφ + λφ · l , where {µφ , λφ } are predicted variational parameters using the inference network and l ∼ N (0, 1) are unbiased samples from a unit Gaussian. The differentiable new bound then takes the shape: L h i h1 X i h i L(θ, φ; x, t) = −En KL(qφ (zn |xn )||pθ (z)) +En [logpθ (xn |zl )] +Ek Eqφ (z|x) [logp(tk |zkijl )] L l (6)

On this new objective we can now perform gradient-based learning by drawing minibatches with Nb datapoints and Kb triplets each time. It becomes evident that this approach has similarities with semi-supervised learning, as the sparse triplets carry extra information for some groups of samples which carries over to the other datapoints and their combinations which we do not have oracle information for. Upon close inspection we detect that the components qφ (z|x) and pθ (x|z) form a variational autoencoder where the parameters have distilled the triplet information. This also clarifies where the transfer of information from the triplets to the learned parametric model happens. In simple terms, the formulation of the model forces the inference network to learn encodings respecting the triplets and the model pθ (x|z) decodings which account for that shared information. We can thus consider the triplets to be an implicit teacher conveying privileged information to the model, which in turn distills that information into a reusable parametric inference network and generative model. The inference model with parameters φ can also act as a learned fine-grained metric over observations.

3

Results

We ran the oracle-prioritized belief network (OPBN) on multiple datasets as well as a purely unsupervised analogous variational autoencoder to determine the performance of the model. Interestingly, the parameters learned by the OPBN can be used in absence of triplets, as they basically form an instance of a belief network that has seen privileged information. We use the encodings of both models to perform classification on held-out test data as well as to predict held-out triplets on unseen data. Finally, we also compare to a crowd-informed model with a Euclidian loss similar to [6], which can act as a symmetric proxy to our KL-based loss. In all experiments we used diagonal Normal distributions as priors for the latent space and rmsProp with momentum [17] as an optimizer. We will detail how the crowd-triplets were generated in each corresponding section. All experiments were run on Graphics Processing Units (GPUs) using a theano [18] implementation and did not take more than a few hours each. All classification experiments are performed using logistic regression, but are not part of the learning objective of the models and for all triplet predictions DIST was used. 3.1

SVHN

The Street View House Numbers dataset [19] is a natural image dataset showing house numbers in a natural setting comprised of 73257 train and 26032 test images. It can be seen as a complex counterpart to other digit datasets. We use the cropped version which focuses on the images and perform density estimation using unsupervised learning. We learn fully unsupervised models of these images using a Gaussian observation model, 2 layers of 400 deterministic layers and 50 latent variables. The deterministic layers use tanh nonlinearities. We will call this model VAE for variational autoencoder. We also train oracle-prioritized belief networks on the same data using the same architecture with a crowd-simulation based on identify of the digits. We generate 70000 randomly chosen triplet constrains based on picking similarly labeled images to be closer than dissimilarly labeled ones. We run the models for 1000 iterations to obtain the reported solutions. For visual inspection see Figure 2a. In Table 1 we show that the oracle-informed models show clear benefits in classification accuracy, which is the desired result. 6

(a) SVHN Data

(b) Yale Faces Data

Table 1: Comparison of metrics in SVHN DATA SET SVHN C LASSIFICATION % SVHN T RIPLET P REDICTION %

3.2

VAE 23.3 50

OPBN-SOFTPLUS 29 37

OPBN-DIST 30 36

Yale Faces

The Yale Faces dataset [20] version we used is comprised of 2414 images of 38 individuals under different light conditions. We split it into 300 test images and 2114 training images. The images were taken under controlled conditions using a lighting rig which allows for light sources to be varied in specific ways. The azimuth and elevation of the light in relation with the depicted face were changed with values between -130 to +130 degrees and -40 to 90 degrees, respectively. The resulting images have dramatic variability in appearance due to shading, apart from variability in identity of the depicted person. As was the case for SVHN, we proceed to learn fully unsupervised models of these images using precisely the same architecture as before. For this dataset we proceeded with a series of crowd simulations. Particularity, we ask the simulation 3 different questions upon presenting it with random triplets of images, enforcing complex constraints on image representations. The questions are the following: 1. Who has the most similar identity? 2. Where is the light condition most similar in terms of azimuth? 3. Where is the light condition most similar in terms of elevation? While the first question is similar to a typical classification setting, answering it accurately requires an ability to understand light variation well. Also, the other two concern more complex qualities of the images than identity and are more related to visual physics, which we do not tackle here in detail. Table 2: Comparison of Model metrics on Yale Faces with Identity crowd TASK T RIPLET P REDICTION % A ZIMUTH RMSD E LEVATION RMSD C LASSIFICATION %

VAE 36.8 22 14.5 59

OPBN-EUC 35.8 22 11.1 78.3

7

OPBN-SOFTPLUS 20 22.5 11 78.3

OPBN-DIST 20 23 11 75.4

Table 3: Comparison of Model metrics on Yale Faces with other crowds TASK T RIPLET P REDICTION % A ZIMUTH RMSD E LEVATION RMSD C LASSIFICATION %

OPBN-DIST-EL 40 27 8 68

OPBN-DIST-AZ 39 24 14 64

We can simulate a crowd for each question by using the data provided with the dataset. In all simulations, we generate 2000 random triplets. For question one, we sample from the label distribution checking for a match to produce answers to the triplets generated. For question two and three we resort to sampling from the relative distances of target angles to given angles to produce the triplet information. The experiments allow us to test the influence of these questions on the latent representation we can learn. We proceed to learn oracle-prioritized belief networks using the same architecture as in the unsupervised model to provide a fair comparison and use different triplet-likelihood models. We run the models for 10000 iterations each until convergence. For visual inspection see Figure 2b. We evaluate by comparing the ability of the model as a representation to be used for classification of labels in held-out test data, triplet prediciton on triplets defined over held-out data, linear regression of light conditions. The results are surprising in that all three crowd-simulations drastically improve classification accuracy using the inferred representation, even the ones which are not informed about the underlying face identities. Table 2 clearly shows that the oracle-informed models outperform pure density estimation in terms of classification without having labels be part of the learning criterium. Interestingly, label-informed oracles appear to lose flexibility in predicting unseen triplets on unseen data, while crowds based on visual properties of the images perform better at that task, as seen in Table 3. This suggests that the model when informed by a crowd questioned about light conditions learns to factorize lighting from image content better than in the purely unsupervised case, resulting in better results on test-set classification and triplet prediction and in slight benefits in regressing the light conditions from latent spaces. Questioning crowds regarding visual physics might be a more beneficial route to inform models than label-knowledge, as the model automatically improves in a series of tasks. We also experimented with generating varying numbers of triplets and found that more triplets shape latent spaces stronger resulting in better classification results.

4

Discussion

We have introduced an unsupervised generative model over observed datapoints and tripletconstraints as given by an oracle. Our contributions are first a fully probabilistic treatment of triplets and image models in a joint unsupervised setting using variational belief networks. We show how this joint learning allows for knowledge from the crowd to be transfered to the rich parametric model, resulting in improved classification scores and improved ability to predict triplets. This can be a useful framework to encode expert knowledge in probabilstic reasoning systems when the exact model is unknown or labels are hard to obtain, such as medical data. Second, we introduce information theoretic distance measures for triplets unlike the commonly used Euclidian distance. Our approach using variational inference and a triplet likelihood is not limited to belief networks, thus it will be interesting to use the framework in conjunction with other flexible probabilistic models such as Gaussian Processes and infinite partition models from the realm of Bayesian Non-parametrics. In terms of applications, it will be interesting to combine the learning approach with more structured models such as temporal and spatially constrained models and encode other relationships like topological or unobserved constraints, such as taste of food in images. In terms of vision, segmentation and shape learning appear to be perennially difficult and promising avenues for using oracle priors. On the oracle side, future work regarding more accurate crowd-modeling using multiple tasks and different noise regimes are promising in conjunction with use-cases such as amazon mechanical turk. Finally, we wish to mention the potential for this framework to assist perceptual applications where biases of the human visual system can be studied assisted by generative models. 8

References [1] Laurens Van Der Maaten and Kilian Weinberger. Stochastic triplet embedding. In Machine Learning for Signal Processing (MLSP), 2012 IEEE International Workshop on, pages 1–6. IEEE, 2012. [2] Gal Chechik, Varun Sharma, Uri Shalit, and Samy Bengio. Large scale online learning of image similarity through ranking. The Journal of Machine Learning Research, 11:1109–1135, 2010. [3] Omer Tamuz, Ce Liu, Serge Belongie, Ohad Shamir, and Adam Tauman Kalai. Adaptively learning the crowd kernel. arXiv preprint arXiv:1105.1033, 2011. [4] Sumit Chopra, Raia Hadsell, and Yann LeCun. Learning a similarity metric discriminatively, with application to face verification. In Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on, volume 1, pages 539–546. IEEE, 2005. [5] Jiang Wang, Yang Song, Thomas Leung, Chuck Rosenberg, Jingbin Wang, James Philbin, Bo Chen, and Ying Wu. Learning fine-grained image similarity with deep ranking. In Computer Vision and Pattern Recognition (CVPR), 2014 IEEE Conference on, pages 1386–1393. IEEE, 2014. [6] Florian Schroff, Dmitry Kalenichenko, and James Philbin. Facenet: A unified embedding for face recognition and clustering. arXiv preprint arXiv:1503.03832, 2015. [7] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013. [8] Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation and approximate inference in deep generative models. arXiv preprint arXiv:1401.4082, 2014. [9] Andriy Mnih and Karol Gregor. Neural variational inference and learning in belief networks. arXiv preprint arXiv:1402.0030, 2014. [10] Diederik P Kingma, Shakir Mohamed, Danilo Jimenez Rezende, and Max Welling. Semisupervised learning with deep generative models. In Advances in Neural Information Processing Systems, pages 3581–3589, 2014. [11] Brian Cheung, Jesse A Livezey, Arjun K Bansal, and Bruno A Olshausen. Discovering hidden factors of variation in deep networks. arXiv preprint arXiv:1412.6583, 2014. [12] Neil D Lawrence and Joaquin Qui˜nonero-Candela. Local distance preservation in the gp-lvm through back constraints. In Proceedings of the 23rd international conference on Machine learning, pages 513–520. ACM, 2006. [13] Vladimir Vapnik and Akshay Vashist. A new learning paradigm: Learning using privileged information. Neural Networks, 22(5):544–557, 2009. [14] Michalis Titsias and Miguel L´azaro-Gredilla. Doubly stochastic variational bayes for nonconjugate inference. In Proceedings of the 31st International Conference on Machine Learning (ICML-14), pages 1971–1979, 2014. [15] Michael I Jordan, Zoubin Ghahramani, Tommi S Jaakkola, and Lawrence K Saul. An introduction to variational methods for graphical models. Machine learning, 37(2):183–233, 1999. [16] Martin J Wainwright and Michael I Jordan. Graphical models, exponential families, and variR in Machine Learning, 1(1-2):1–305, 2008. ational inference. Foundations and Trends [17] Alex Graves. Generating sequences with recurrent neural networks. arXiv preprint arXiv:1308.0850, 2013. [18] James Bergstra, Olivier Breuleux, Fr´ed´eric Bastien, Pascal Lamblin, Razvan Pascanu, Guillaume Desjardins, Joseph Turian, David Warde-Farley, and Yoshua Bengio. Theano: a CPU and GPU math expression compiler. In Proceedings of the Python for Scientific Computing Conference (SciPy), June 2010. Oral Presentation. [19] Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Y Ng. Reading digits in natural images with unsupervised feature learning. [20] Kuang-Chih Lee, Jeffrey Ho, and David Kriegman. Acquiring linear subspaces for face recognition under variable lighting. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 27(5):684–698, 2005.

9