Convolutional Neural Networks for Object Recognition on Mobile

0 downloads 0 Views 2MB Size Report
face recognition, scene labelling, object detection and image classification among others. Meanwhile, mobile devices have become powerful enough to handle ...
Convolutional Neural Networks for Object Recognition on Mobile Devices: a Case Study Luis Tob´ıas∗ , Aur´elien Ducournau∗ , Franc¸ois Rousseau† , Gr´egoire Mercier† , Ronan Fablet∗ ∗ Institut

Mines-T´el´ecom, T´el´ecom-Bretagne; UMR 6285 LabSTICC, Brest, France Mines-T´el´ecom, T´el´ecom-Bretagne; UMR 1101 LATIM, Brest, France Email: [email protected], {aurelien.ducournau, francois.rousseau, gregoire.mercier, ronan.fablet}@telecom-bretagne.eu † Institut

Abstract—Deep Learning (DL), especially Convolutional Neural Networks (CNN), has become the state-of-the-art for a variety of pattern recognition issues. Technological developments have allowed the use of high-end General Purpose Graphic Processor Units (GPGPU) for accelerating numerical problem solving. They resort no only to lower computational time, but also allow considering much larger networks. Hence, nowadays computers are able to drive deeper, wider and more powerful models. State of the art CNNs have achieved human-like performance in several recognition tasks such as: handwritten character recognition, face recognition, scene labelling, object detection and image classification among others. Meanwhile, mobile devices have become powerful enough to handle the computations required for deploying CNNs models in near real-time. Here, we investigate the implementation of light-weight CNN schemes on mobile devices for domain-specific objection recognition tasks. Index Terms—Machine Learning, Deep Learning, Convolutional Neural Networks, Object Detection, Mobile Devices.

I. I NTRODUCTION In this study, we address object detection on mobile devices (tablets, smartphones,...) for domain-specific case-studies. We aim at developing interactive interfaces on mobile platforms based on object recognition. As case-study, we develop a collaboration with a museum (”Mus´ee National de la Marine de Brest”). The general objective is to automatically trigger or suggest interactive contents (i.e. 3D models, video, audio, etc.) on a mobile device. The detection of an object of interest from the camera of the mobile device is a key step. Here, the detection comprises both the identification and localisation of an object in an image, while the classification consists in the prediction of elements belonging to a certain class in an image. From a methodological point of view, as briefly reviewed below, we investigate deep learning models, more precisely Fully Convolutional Neural Networks (FCNN), and their deployment on mobile devices for domain-specific realworld applications. During the last decade Machine Learning (ML) techniques have been commonly employed to address such tasks, making the choice of visual features a crucial factor. With the discovery of the Scale Invariant Feature Transform (SIFT) [1], multiple opportunities for vocabulary learning techniques have been successfully developed, including for instance Bag of Features (BoF) [2], Improved Fisher Vector Encoding (IFV) [3] among others. These techniques are simple yet effective and can be summarised in well defined steps: dense sampling of local

descriptors, encoding into a high-dimensional representation and finally pooling to create a single descriptor per image [4]. Despite their simplicity these methods are hand-crafted and require a certain amount of engineering behind them. These techniques are known as shallow, where the learning is done only at mid-level by training classifiers such as Support Vector Machines (SVM), Random Forest or Naive Bayes classifier. Deep learning models, and CNNS, have become in the last few years the state-of-the-art for a variety of large-scale pattern recognition problems [5]. CNNs are regarded as deep architectures as they involve a hierarchy of layers, such that the outputs of a layer are connected to the next layer’s inputs. The exploitation of a large number of layers, for instance up to 22 for GoogleNet model [6], has led to very significant gain in visual recognition tasks [7] compared to shallow strategies. The use of such model for domain-specific (and small-scale) case-studies is an active topic, as deep architectures typically require large-scale datasets for their learning. Although the gain in computational power on mobile devices has been tremendous in the last years, there still exists an important gap when compared to personal computers. Resources are limited and must be exploited in an efficient way. Shallow techniques imply the detection, extraction and projection of features into a high dimensional space, which may be computationally expensive and memory-demanding. Consequently, embedded devices generally use client-server based applications such as in [8], [9] and [10]. By contrast, CNN-based architectures seem computationally appealing as they exploit convolutional filters, but their memory requirement may however be an issue. In this paper, we investigate the deployment of stateof-the-art object recognition schemes embedded in mobile devices for domain-specific case-studies. We consider both model learning and model deployment (gpu, cpu, mobile). We compare different shallow and deep architectures for two representative datasets: a benchmarked dataset (Caltech-101) [11] and a real-world case-study (”Mus´ee National de la Marine de Brest”). We demonstrate the relevance of lightweight deep architectures, especially NIN and GoogLeNet [12], [6], to reach high classification performance on mobile devices in near real-time for a real application setting. This paper is organised as follows. Section II provides an introduction to CNNs, Section III presents three important network architectures. Section IV introduces to the learning

transfer cases. Section V describes the used methodology. Section VI presents the evaluation results, and Section VII finalises with a short discussion and future work. II. C ONVOLUTIONAL N EURAL N ETWORKS Convolutional Neural Networks (CNNs) refer to a family of statistical learning models which use the convolutional operator as basis to abstract, encapsulate and learn information [5]. CNNs are used to estimate or approximate functions that can depend on a large number of inputs. Compared to original neural networks, deep CNNs involve a large number of layers (typically, varying from 8 to 22 [7], [6]). Whereas the first layers are composed of convolutions, normalisation, pooling and activation (nonlinear function) layers, the top layers generally involve fully-connected layers. For classification purposes, the last layer is typically a SoftMax function that acts as a classifier. We briefly review below the different layers involved in the CNNs as well as learning strategies for CNNs. For an in-depth description of CNNs, we refer the reader to [5]. A. CNN Layers A layer of a CNN is composed of nodes (or neurons) connected to nodes of the previous layers, such that the output at a given node for layer L is a function of outputs of nodes in layer L − 1. CNNs involve four main types of layers: • Convolution layers: Convolution layers are characterized by weights (filter values). There exist multiple convolutions per layer with a fixed size, and each kernel is applied over the entire image with a fixed step (stride). The first convolution layers learn the low-level features such as edges, lines and corners. Next layers learn more complex representations (e.g., parts and models). The deeper the network is, the higher-level the learnt features. • Pooling layers: Pooling layers perform a nonlinear downsampling. • Activation layers: Activation functions mimic the behaviour of the neuron’s axon that fires a signal when a specific stimulus is presented. Some of the most common activations functions are the Hyperbolic Tangent, Sigmoid and the Rectified Linear Units (ReLU) among others. ReLU has emerged as a key feature of CNN. It is defined as: f (x) = max(0, x). • Fully-connected layers: a fully-connected layer (FC) differs from the above mentioned by the fact that all outputs of the previous layer are connected to all inputs of the FC layer. These layers can be mathematically represented by inner products. B. Learning Convolutional Neural Networks The learning stage for CNNs comes to the estimation of the weights of the different layers. In a supervised setting, it relies on the backpropagation technique, which provides a gradient-based algorithm for some predefined cost function, in our case the misclassification rate. The backpropagation

exploits the structure of the CNN to compute the gradient of the cost function with respect to CNN weights as an error propagation process from the final layer of the CNN to the input layer. The learning stage then proceeds iteratively a feedforward pass which computes the filter responses, pooling, and nonlinear activations followed by the backpropagation of the cost error. The backpropagation technique benefits from activation functions that are smooth and differentiable. Dropout layers are in charge of randomly neglecting outputs of hidden or visible units. Generally a unit has a probability p = 0.5 to be kept. By setting its output to a value of zero the neuron will not contribute to the forward pass nor the backpropagation of the error function [13]. These layers remain active only during the learning stage, and are ignored at inference time. III. N ETWORK A RCHITECTURES Different architectures have been proposed recently. Here, we focus on three state-of-the-art architectures, namely the AlexNet network [7], and two fully convolutional networks called Network In Network (NIN) [12] and GoogLeNet [6]. A. AlexNet A widely used CNN architecture is the award-winning AlexNet network presented in [7]. It has been selected as starting point for multiple applications, for instance [14] and [15]. The network consists of a combination of 5 convolutional and 3 fully-connected layers. The final FC layer is connected to a soft-max classifier which produces a 1000-class distribution output. Input colour images are size-normalised to a squared shape with 256x256 pixels. In the first convolutional layer 96 filters of size 11x11x3 are applied to the input image with a stride of 4 pixels. The output of this layer is then normalised and maxpooled. Outputs from this layer are fed to 256 kernels of size 5x5x48, then normalised and pooled again. Third, fourth and fifth layers are connected without any normalisation or pooling layers. The third convolutional layer has 384 kernels of size 3x3x256 connected to the outputs of the second layer. The fourth convolutional layer has 384 kernels of size 3x3x192, and the fifth has 256 kernels of size 3x3x192. The fullyconnected layers have a total 4096 neurons each. A SoftMax layer is connected on top to obtain probability estimates for classification purposes. The network maximises the multinomial logistic regression objective, which is equivalent to maximising the average across training cases of the log-probability of the correct label under the prediction distribution [7]. This network has been trained using the ImageNet challenge database, which contains about 1.3 Million images [16]. The accuracy obtained by the model in the Top One Prediction is 57.4%. In this network over 60 Millions of parameters have to be estimated. Despite the prediction of a significant number of labels (1000) and a humongous amount of data to train, the model is prone to overfitting. To avoid this, dropout strategies and data augmentation techniques are applied: left to right

mirroring and up to bottom reflections, and 224x224 image crops from the centre and the corners of the images. B. Network In Network Network in Network is a FCNN that uses Micro Neural Networks with complex structures to abstract the data within the receptive field [12]. In this architecture no fully connected layers are used, which results in a considerable reduction of the number of parameters to estimate. Overall, this network involves about 8 million parameters. A layer that mimics a Multilayer Perceptron (MLP) is employed in this network. A MLP layer is a regarded as a ”micro network” and is approximated by stacking two convolution layers with a 1x1 convolution kernel. The short convolutions reproduce the effect of a Cross Channel Parametric (CCP) pooling layer, which outputs a weighted linear recombination of the input feature maps [12]. The final architecture of this network is obtained by stacking three MLP layers. Instead of using FC + SoftMax layers for classification, the network obtains class confidence scores by globally averaging the output of the final MLP layer. The convolution layers respectively contain 96, 256 and 384 kernels of sizes 11x11, 5x5 and 3x3. While the pairs of CCP layers are coupled to have the same number of filters but with fixed size of 1x1. Model training also involves an aggressive dropout strategy to avoid overfitting. When trained using ImageNet dataset the networks obtains a 59.36% Top One Prediction accuracy. C. GoogLeNet This network is called Inception and widely known as GoogLeNet [6]. Inspired by the tight convolutions of size 1x1 introduced by [12]. The use of these short convolutions is twofold 1) reduction of dimensionality 2) removing computational bottlenecks. The network decomposes convolutional filters with a wide receptive field into groups of parallel small convolutions (1x1, 3x3 and 5x5) and pooling layers. The output of this parallel blocks are then concatenated. Each arrangement of this kind is called an inception module, and the final architecture is composed by stacking multiple modules. The final output is obtained by using average pooling and an extra linear layer. Overall, GoogLeNet involves about 7 million parameters. Dropout remains a key factor for regularisation and additional intermediate outputs are used for classification. It was shown to reach state-of-the-art performance on the ImageNet dataset [6]. IV. L EARNING T RANSFER For real-world and domain-specific scenarios, it may not be possible to train a model from scratch, especially when dealing with relatively small training datasets. However, it has been proven that features learned across the layers of a deep architecture can be transferred to domain-specific applications [17]. Two main transfer scenarios can be investigated: the finetuning of a previously trained network and the use of a deep network as feature extraction scheme. We review these two strategies below.

A. Fine-tuning of a pre-trained network The fine-tuning [17] consists in using a previously trained network as initialisation of the training step for given task and training datasets. Compared to a purely random initialisation of network weights, the fine-tuning reduces the learning time, as it converges in fewer iterations and make feasible the training of network from relatively small dataset. We apply here a common fine-tuning strategy [17]. It exploits a pre-trained network as initialisation for all layers except the top fully connected layer, whose weight are randomly initialised. Besides, the learning rate of this layer is higher than the learning rates of the other ones. It may be noticed that the top fully connected layer may contain a number of outputs different from the original network, which allows to adjust well to new classification tasks. B. CNNs as feature extractors Another strategy for application to domain-specific classification tasks is to combine classical machine learning techniques, e.g. SVMs or random forests, within a feature space issued from a trained CNN [18]. The output of each layer of a CNN may be regarded as a description of the input image. The deeper in the hierarchy, the higher-level the associated image information. These outputs of the different layers of a CNN are referred to as CNN codes [19], which may provide relevant feature space for classification tasks. The dimension of each code refers to the number of nodes of the associated layer. For instance, using AlexNet CNN, if one considers the input of the fully connected layers as a feature vector, one resorts to a 4096-dimensional feature vector. Given the feature space defined from the selected CNN codes, any supervised machine learning model may be relevant. Here, we consider linear SVMs. CNN codes may be further compressed through a principal component analysis without a significant accuracy loss. Whereas the training of the CNN typically requires very large image databases, the training of linear SVMs is efficient for small image datasets, which makes them appealing for domain-specific applications. V. E XPERIMENTAL FRAMEWORK In this study, we report an experimental evaluation of CNN-based strategies for classification applications on mobile platforms (i.e., smart-phones and tablets), with a specific emphasis on domain specific case-studies. The reported experiments compare the considered CNN model against shallow techniques mentioned in section I. In addition to classification performance, we also evaluate the computational complexity of the considered model for three types of implementation: CPU, GPU and mobile devices. All experiments for CNNs have been conducted using the CAFFE library [20] and associated C++ and Python interfaces. A custom ObjectiveC version of the library was used to deploy models on mobile devices [21]. For the shallow methods, we used the Matlab interface of VLFeat library [22].

histogram square rooting, followed by L2 normalisation. IFV uses a 256-visual-word Gaussian Mixture Model (GMM). Both techniques use a linear SVM classifier with parameter C = 10. Additional spatial information is aggregated to the BOVW features by appending the feature coordinates of the descriptors, while a spatial pyramid with 1x1 and 3x1 subdivisions is used for the IFV as described in [23]. These two models are referred to as the BOVW + aug and IFV + sp method names in the reported results. C. Deep models

Figure 1. Examples of images from the Marine Museum database.

A. Image databases We consider two different datasets: •



Caltech-101 database [11] contains 101 object categories, with about 50 images per category in general. While this benchmarked dataset is big enough to train shallow classification algorithms, it is too small for the direct application of deep learning strategies. Marine Museum database. This database, described below, is part of a collaborative project with the Marine Museum in Brest for the development of new interactive services on mobile devices for visitors. This database provides a real-word case study for the considered models.

The ”Marine Museum” database involves 50 different classes (including a background class) corresponding to a variety of objects (e.g., statues, mockups, small objects) and materials (e.g., wooden and stone objects). For each object, the database contains at least 90 images associated with different viewpoints and lighting conditions. Different cameras were also used to acquire images with different image qualities. Representative examples of the database are reported in Figure 1. The database is randomly split into training, testing and validation subsets. Each subset respectively contains 45, 25 and 20 images of each object. All images in the database are subject to a preprocessing step, which includes mean subtraction and size normalisation. Besides, data augmentation techniques (mirroring and cropping) mentioned in section III are applied to the training and validation subsets. B. Shallow Methods As baseline approach, we implement and compare different shallow models in our experiments. These models exploit dense SIFT features extracted in√a regular grid with step of 4 pixels at 7 scales with a factor 2 and bins of 8 pixel wide. As encodings, we consider two classical methods, namely Bag of Visual Words (BOVW) [2] and Improved Fisher Vectors (IFV) [3]. BOVW uses 4096 vector quantised visual words

We investigate three categories of deep models: • Fully trained deep models: Deep models with the architectures described in III-A and III-B are first fully trained for each database. Optimisation is carried out by Batch Stochastic Gradient Descent (BSGD). The parameters are set as follows: momentum m = 0.9, weight decay of 0.0005, starting with a high learning rate lr = 0.01 and step as learning rate update policy. Optimisation is performed across 20 epochs. The resulting trained models are referred hereafter to as AlexNet and NIN. • Finely-tuned deep models: as mentioned in Section IV, we benefit from pre-trained models as initialisation to obtain a more discriminative model using a finetuning specific to each case-study database. Our finetuning strategy consists in learning the last two FC layers in the case of AlexNet, the last two CCP layers for NIN model and FC layers of the multiple outputs in GoogLeNet. Learning rates in these final layers are set ten times higher than the rest of the network. The optimisation is carried out during fewer iterations than in the general training since faster convergence is expected. Learning rate (BSGD step) is set to be 0.001, which is ten times smaller than the learning rate used for full training. The resulting deep models are referred to as AlexNet+F, NIN+F, and GoogLeNet+F. For GoogLeNet model, we report the output of the deepest classification layer. Intermediate output layers are omitted. We noticed that they lead to similar classification performance. • Deep models as feature extraction schemes: we explore the use of AlexNet as a feature extractor scheme. We consider input features from layers FC5, FC6 and FC7. These features are exploited to train a multiclass linear SVM model with a fixed parameter C = 1.0. From cross-validation experiments, the best classification performances were issued from features extracted from layer FC7. We only report these results referred to as AlexNet FC. A combination with Principal Component Analysis (PCA) dimension reduction to keep first 1024 Principal Components is also investigated and referred to as AlexNet+PCA. VI. E XPERIMENTAL RESULTS A. Classification performance We report the synthesis of the classification performance of the considered models in Table I. For Caltech-101 database,

classification performances range from 43% to 87%. Whereas the baseline approaches, BOVW and IFV encodings combined with linear SVMs, reach 73% of correct classification, fully trained deep models behave poorly due to the limited size of the training dataset both for AlexNet and NIN architectures (below 50%). The best results are retrieved for the finetuned models (respectively, 87.63% of correct classification for AlexNet and 87.22% for NIN). The gain appears significant compared to the use of AlexNet as a feature extraction scheme. Similar conclusions can be drawn for the ”Maritime Museum” database. Whereas fully trained models reach up to 86% of correct classification, the fine-tuned version of NIN model reach more than 98% of correct classification. The significantly higher correct classification rates reported for fully-trained models compared to Caltech-101 case-study may relate to the greater visual separation between object classes for the ”Maritime Museum” database. Table I O BJECT RECOGNITION PERFORMANCE ON C ALTECH -101 AND M ARINE M USEUM DATABASES : BAG O F V ISUAL W ORDS + AUGMENTED SPATIAL INFORMATION (BOVW + AUG ), F ISHER V ECTOR E NCODING + SPATIAL PYRAMID , A LEX N ET, A LEX N ET + F INE - TUNING (F), A LEX N ET AS F EATURE EXTRACTOR (FC), A LEX N ET AS F EATURE EXTRACTOR + PCA, NIN, NIN + F INE - TUNING (F) AND G OOG L E N ET + F INE - TUNING (F). Method BOVW + aug FV + s.p. AlexNet AlexNet + F AlexNet + FC AlexNet + PCA NIN NIN + F GoogLeNet + F

Caltech 101 73.33% 73.33% 43.22% 87.63% 82.23% 81.37% 46.19% 87.22% 86.22%

Marine Museum 84.35% 94.07% 88.31% 82.23% 86.14% 98.16% 97.64%

One of the main advantages of using CNNs is that once the models are trained they can be deployed on almost any device. The training time depends on the amount of data to process and the number of iterations applied. For example, the entire learning on the Caltech-101 dataset is executed in roughly 2 to 3 days using the GPU version of Caffe. While fine-tuning takes only a portion of the time, the entire process can be done about 10 hours. We also measure the computational load of the classification step for different devices: CPU, GPU and mobile device. Evaluation is performed by computing the average of 100 single image classifications (no batch processing) and excluding I/O operations. The selected devices are: CPU ”intel i7-6820HQ CPU at 2.70GHz”, GPU ”Nvidia Tesla K80” and an iPad device with a ”1.3 GHz dual-core Apple Cyclone” processor. Results are presented in Table II along with the RAM memory required to load the model. AlexNet requires large memory storage, making it less relevant for an implementation on mobile device. By contrast, GoogLeNet and NIN models are associated with a significantly lower memory requirement. However, large differences exist in terms of computational time. Whereas GoogLeNet leads to a processing time close to 1s per frame, NIN model requires less than 300ms per frame. These differences may be explained

by the large number of convolution operations involved in GoogLeNet. Such operations are not optimized for a CPUonly implementation. Overall, NIN appears as the best choice for the considered domain-specific real-world application with: 1) a smaller memory foot-print, 2) slightly better accuracy results in our application, 3) lower computational time, and 3) potential use for localisation as described in VI-B. Table II AVERAGE COMPUTATIONAL TIME ( PER FRAME ) AND MEMORY REQUIREMENT OF THE CONSIDERED DEEP MODELS .

AlexNet GoogLeNet NIN

GPU 22.6 ms 17.6 ms 4.6 ms

CPU 339.0 ms 452.1 ms 210.7 ms

Mobile 382 ms 992. ms 289 ms

Memory 230 MB 50 MB 29 MB

B. Object Localisation Task In this section, we report object localization experiments using NIN. FCNNs retain a strong spatial input-output relationship. Based on this assumption we propose to roughly localise objects using the Feature Maps (FMs) produced by the final MLP layer. The class assigned to the image directly depends on the output of the average pooling, which enforces the network to learn correspondences between feature maps and categories [12]. Thus, the class index that is attributed to the image serves to choose the FM employed to perform the detection. In this way localization comes free of computational burden because FMs are built at inference time. Feature maps have a considerably small size, this is caused by the multiple nonlinear down sampling performed by the pooling operations, due to this fact we can retrieve only a rough estimation of the object’s position. We start by first normalising the selected FM then up-scale it by bilinear interpolation to match the size of the original image, then a threshold th = mean(F M ) is applied to segment objects. Examples of the detection results are shown in Figure 2. VII. D ISCUSSION Deep learning has led to significant improvements of object recognition and image classification performance over shallow techniques for complex dataset such as ImageNet. Deep models typically benefit from very large datasets to learn models with a very large number of parameters (up to 60 millions of parameters for AlexNet). This raises scientific questions for their application to real case-studies, where the number of training data remain relatively small. Similarly, the memory storage requested by deep models may question their applicability on mobile devices. Within the context of interactive services on mobile devices in museum, we here investigated object recognition on mobile platforms using deep models. Following [4], we demonstrated that highly accurate domain-specific object recognition pipeline can be used in near real-time on mobile devices. Deep models, more-specifically fine-tuning strategies, were shown to significantly outperform classical shallow strategies with a computational time lower than 300 ms per frame

Figure 2. Example of object localization for Marine Museum case-study: from left to right, original image, top three activations produced by the last MLP layer of NIN model, segmentation mask and contour of the object localisation. Best viewed in colour.

for NIN model. This model is superior in terms of minimal memory storage, computational complexity and recognition performance (compared with AlexNet and GoogLeNet). The computational time and the memory requirement for the other two models may be prohibitive for real-world applications or require additional optimizations. We also illustrated that object recognition can be complemented by object localization at no additional computational cost from the considered model. Future work will further investigate the optimization of implementation issues on mobile devices. Our current solution uses a CPU-only implementation, which achieves a near real-time classification. Computational time could be further reduced by adopting a mobile GPU implementation. Ongoing research also includes the extension of the proposed model for object recognition on mobile devices from RGB-D images, as depth sensors will be embedded in the next generation of mobile devices. R EFERENCES [1] D. G. Lowe, “Distinctive image features from scale-invariant keypoints,” Int. J. Comput. Vision, vol. 60, no. 2, pp. 91–110, Nov. 2004. [2] G. Csurka, C. R. Dance, L. Fan, J. Willamowski, and C. Bray, “Visual categorization with bags of keypoints,” in In Workshop on Statistical Learning in Computer Vision, ECCV, 2004, pp. 1–22. [3] F. Perronnin, J. S´anchez, and T. Mensink, “Improving the fisher kernel for large-scale image classification,” in Proceedings of the 11th European Conference on Computer Vision: Part IV, ser. ECCV’10, 2010, pp. 143–156. [4] K. Chatfield, K. Simonyan, A. Vedaldi, and A. Zisserman, “Return of the devil in the details: Delving deep into convolutional nets,” in British Machine Vision Conference, 2014. [5] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521, no. 7553, pp. 436–444, 2015. [6] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. E. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” CoRR, vol. abs/1409.4842, 2014. [7] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in Neural Information Processing Systems 25, F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, Eds. Curran Associates, Inc., 2012, pp. 1097–1105.

[8] S. Gammeter, A. Gassmann, L. Bossard, T. Quack, and L. V. Gool, “Server-side object recognition and client-side object tracking for mobile augmented reality,” in IEEE Conference on Computer Vision and Pattern Recognition, CVPR, Workshops 2010, 2010, pp. 1–8. [9] S. S. Kumar, M. Sun, and S. Savarese, “Mobile object detection through client-server based vote transfer,” in CVPR, 2012. ´ Lipovits, and M. G´al, “Lightweight mobile [10] L. Cz´uni, P. J. Kiss, A. object recognition,” in 2014 IEEE International Conference on Image Processing, ICIP, 2014, pp. 3426–3428. [11] R. P. L. Fei-Fei; Fergus, “One-shot learning of object categories,” IEEE Transactions on Pattern Analysis Machine Intelligence, vol. 28, pp. 594– 611, April 2006. [12] M. Lin, Q. Chen, and S. Yan, “Network in network,” CoRR, vol. abs/1312.4400, 2013. [13] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: A simple way to prevent neural networks from overfitting,” Journal of Machine Learning Research, vol. 15, pp. 1929–1958, 2014. [14] J. Zhang, S. Ma, M. Sameki, S. Sclaroff, M. Betke, Z. Lin, X. Shen, B. Price, and R. M˘ech, “Salient object subitizing,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015. [15] S. Gupta, R. Girshick, P. Arbel´aez, and J. Malik, “Learning rich features from RGB-D images for object detection and segmentation,” in Proceedings of the European Conference on Computer Vision (ECCV), 2014. [16] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “ImageNet: A Large-Scale Hierarchical Image Database,” in CVPR09, 2009. [17] J. Yosinski, J. Clune, Y. Bengio, and H. Lipson, “How transferable are features in deep neural networks?” CoRR, vol. abs/1411.1792, 2014. [18] A. S. Razavian, H. Azizpour, J. Sullivan, and S. Carlsson, “CNN features off-the-shelf: an astounding baseline for recognition,” CoRR, vol. abs/1403.6382, 2014. [19] A. Babenko, A. Slesarev, A. Chigorin, and V. S. Lempitsky, “Neural codes for image retrieval,” CoRR, vol. abs/1404.1777, 2014. [20] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell, “Caffe: Convolutional architecture for fast feature embedding,” pp. 675–678, 2014. [21] A. Isaza, “Caffe for ios: A wrapper for the caffe library,” https://github.com/aleph7/caffe, 2015. [22] A. Vedaldi and B. Fulkerson, “VLFeat: An open and portable library of computer vision algorithms,” http://www.vlfeat.org/, 2008. [23] K. Chatfield, V. Lempitsky, A. Vedaldi, and A. Zisserman, “The devil is in the details: an evaluation of recent feature encoding methods,” in British Machine Vision Conference, 2011.