Scene Classification using Spatial Pyramid of Latent

T.R. TURKISH NAVAL ACADEMY NAVAL SCIENCE AND ENGINEERING INSTITUTE DEPARTMENT OF COMPUTER ENGINEERING

Scene Classification using Spatial Pyramid of Latent Topics MASTER THESIS

EMRAH ERGÜL

Advisor: Asst.Prof.Dr. Nafiz ARICA

Đstanbul, 2009

 Copyright by Naval Science and Engineering Institute, 2009

ii

Scene Classification using Spatial Pyramid of Latent Topics Submitted in partial fulfillment of the requirements for degree of

MASTER OF SCIENCE IN COMPUTER ENGINEERING

Turkish Naval Academy Naval Science and Engineering Institute

Author: --------------------------------------------------------------------------Emrah ERGÜL

Defence Date:

/

/

Approved by:

--------------------------------------------------------------------------Asst. Prof. Dr. Nafiz ARICA (Thesis Advisor)

--------------------------------------------------------------------------Prof. Dr. Yahya KARSLIGĐL (Defense Committee Member)

--------------------------------------------------------------------------Asst. Prof. Dr. Songül ALBAYRAK (Defense Committee Member)

iii

DISCLAIMER STATEMENT

The views expressed in this thesis are those of the author and do not reflect the official policy or position of the Turkish Naval Forces, Turkish Naval Academy and Naval Science and Engineering Institute.

iv

DEDICATION

To my wife, Serpil ERGÜL

v

ACKNOWLEDGEMENT

I would like to render my spatial thanks to

My dear parents, Aynur and Đlyas, for their amity and generous support, My wife, Serpil, for her moral everlasting support and giving me enthusiasm, My advisor, Ast. Prof. Nafiz Arıca, for giving me his ultimate guidance and spirit, My colleague, Murat Küçükbayrak, for his helps and unique patience, Institute director, Mustafa Karadeniz, for his active collaboration, My anonymous tutors for their priceless knowledge.

This thesis would not have been possible without their trust on me.

vi

TABLE OF CONTENTS 1.

INTRODUCTION…………………………………………………………...1 1.1

Motivation……………………………………………………………3

1.2

Objectives…………………………………………………………….5

1.3

Challenges……………………………………………………………6 1.3.1 Variance in Visual Domain…………………………………7 1.3.2 Perception Ambiguities……………………………………..8 1.3.3 Variability in a Single Class………………………………...9 1.3.4 Variability among Classes………………………………....10 1.3.5 Representation Issues……………………………………...10

2.

1.4

Contributions……………………………………………………….11

1.5

Structure of Thesis…………………………………………………13

LITERATURE REVIEW ON SCENE CLASSIFICATION....................14 2.1

Low-level Features for Scene Classification……………………...14 2.1.1 Global Low-level Representation…………………………16 2.1.2 Local Low-level Representation………………………..…19 2.1.3 Discussion…………………………………………………...23

3.

2.2

Bag-of-Words Modeling…………………………………………...24

2.3

Spatial Pyramid Matching Methods……………………………...28

2.4

Latent Semantic Analyzing Methods……………………………..32

SCENE CLASSIFICATION USING PYRAMID OF LATENT TOPICS........44

3.1

Feature Extraction and Visual Vocabulary Generation………...44 3.1.1 Scale Invariant Feature Transform (SIFT)………………45 3.1.2 Dense SIFT…………………………………………………49 3.1.3 Visual Words and Vocabulary Generation………………50

3.2

Semantic Image Representation Based on pLSA………………..52 3.2.1 Discussion About pLSA……………………………………56

3.3

Building Spatial Pyramid of Latent Topics………………………58 3.3.1 Spatial Pyramid of Latent Topics by Cascaded pLSA…..58

vii

3.3.2 Spatial Pyramid of Latent Topics by Semantic Segmentation……………………………………………….62 3.4

Classification……………………………………………………….64 3.4.1 K-Nearest Neighborhood (KNN)………………………….64 3.4.2 Support Vector Machines (SVM)…………………………65

4.

PERFORMANCE EVALUATION……………………………………….67 4.1

Dataset………………………………………………………………67

4.2

Experimental Setup………………………………………………..69 4.2.1 Pre-processing……………………………………………...69 4.2.2 Feature Extraction…………………………………………69 4.2.3 Input for pLSA: Co-occurrence Table……………………70 4.2.4 Spatial Pyramid of pLSA Modeling………………………71

4.3

5.

Experimental Results………………………………………………72

CONCLUSIONS AND FUTURE WORKS...…………………………….78 5.1

Future Works………………………………………………………80

BIBLIOGRAPHY………………………………………………………….82

viii

LIST OF TABLES

Table-4.1. Classification results of Cascaded pLSA based method, using SVM…..72

Table-4.2. Classification results of Cascaded pLSA based method, using KNN…..73

Table-4.3. Classification Results of Semantic Segmentation based Method, using SVM…………………………………………………………………………………74

Table-4.4. Classification Results of Semantic Segmentation based Method, using KNN…………………………………………………………………………………74

Table-4.5. Comparative results of Spatial Pyramid Method with visual word = 400, Cascaded pLSA based and Semantic Segmentation based methods with T=100…..75

Table-4.6. Compact comparison of our algorithms with other methods using the same experimental setup…………………………………………………………….76

Table-4.7. Confusion Matrix achieved by Cascaded pLSA based method with T=100 and SVM accuracy % 80.90…………………………………………………............77

ix

LIST OF FIGURES

Figure-1.1. Example images from different scene categories in classification framework…………………………………………………………………………….3

Figure-1.2. Example images that introduce scale, viewpoint and lighting changes, respectively…………………………………………………………………………...8

Figure-1.3. Example images that contain ambiguities of manual annotation in a classification dataset………………………………………………………………….9

Figure-1.4. Example images that contain intra-class variability……………………..9

Figure-1.5. Example images that contain inter-class variability……………………10

Figure-1.6. Example images that contain same objects but carry different semantic concepts……………………………………………………………………………...11

Figure-2.1. An illustration of low-level and semantic feature modeling methods…16

Figure-2.2. Two-stage classification scheme of indoor/outdoor images…………...20

Figure-2.3. Coefficient labelling for wavelet features extraction…………………..23

Figure-2.4. Steps for producing the BoW representation for images………………25

Figure-2.5. Overview of Spatial Pyramid creation and histogram calculation……..30

Figure-2.6. Example images from scene category databases of Lazebnik et. al……30

Figure-2.7. The structure of linear spatial pyramid algorithm based on sparse coding………………………………………………………………31 x

Figure 2-8. Overview of visual vocabulary formation, learning and classification stages of generative/discriminative model of scene classification……………….…33

Figure-2.9. Schematic representation of Spatial Envelope model……………….…34

Figure 2-10. Flow chart of the algorithm of Hierarchical Bayesian………………..37

Figure-2.11. 13 categories of scene images, 9 out of 13 are outdoor, and the rest are indoor images………………………………………………………………………..38

Figure 2-12. Illustration of hLDA model…………………………………………...39

Figure 2-13. The highest probable words shown in rows of 5 different examples and 4 learnt topics via pLSA…………………………………………………………….41

Figure 2-14. An example of algorithm described by Li as a generative model…….42

Figure-3.1. Illustration of SIFT keypoint detection algorithm……………………...46

Figure-3.2. Illustration of orientation assignment and descriptor formation……….48

Figure-3.3. Illustration of dense SIFT descriptors over an image………………….49

Figure-3.4. Illustration of pLSA model…………………………………………….54

Figure-3.5. Illustration of Cascaded pLSA based method………………………….60

Figure-3.6. Illustration of Semantic Segmentation based method………………….63

Figure-3.7. Illustration of SVM algorithm………………………………………….65

Figure-4.1 Some example images from the scene classification dataset…………...68

xi

LIST OF ACRONYMS AND ABBREVIATIONS

BoT

:

Bag-of-Topics

BoW

:

Bag-of-Words

CPT

:

Conditional Probability Table

DCT

:

Discrete Cosine Transform

DFT

:

Discrete Fourier Transform

DoG

:

Difference-of-Gaussians

EM

:

Expectation Maximization

GMM

:

Gaussian Mixture Model

HI

:

Histogram Intersection

hLDA

:

hierarchical Latent Dirichlet Allocation

HSV

:

Hue-Saturation-Value

KL

:

Kullback-Leibler

KNN

:

K-Nearest Neighbours

LDA

:

Latent Dirichlet Allocation

LoG

:

Laplacian-of-Gaussian

LSA

:

Latent Semantic Analysis

LST

:

Luminance Saturation Transform

MAP

:

Maximum a Posteriori

MLE

:

Maximum likelihood Estimation

MSAR

:

Multiresolution Simultaneous AutoRegressive

MSER

:

Maximally Stable Extremal Regions

PCA

:

Principal Component Analysis

pLSA

:

probabilistic Latent Semantic Analysis

QP

:

Quadratic Programming

RBF

:

Radial Basis Function

RGB

:

Red-Green-Blue

SIFT

:

Scale Invariant Feature Transform

SP

:

Spatial Pyramid

SVD

:

Singular Value Decomposition

SVM

:

Support Vector Machine xii

TF

:

Term Frequency

TDV

:

Topic Distribution Value

TL

:

Topic Label

VIZ

:

Visual word – Image - Topic

WWW

:

World Wide Web

xiii

ÖZET

Gizli Temalardan Uzaysal Piramit Oluşturularak Sahne Sınıflandırılması

Emrah ERGÜL

Bilgisayar Mühendisliği M.S. Tezi, 2009

Danışman: Yrd. Doç. Dr. Nafiz ARICA

Anahtar Kelimeler: Sahne Sınıflandırılması, Uzaysal Piramit, Olasılıksal Gizli Anlam Analizi, Görsel Kelimeler Kümesi.

Bu tezde imgelerin analiz edilmesi ve neticesinde içerik bilgilerine göre imgelerin taşıdıkları anlamlara uygun olarak sınıflandırılmaları hedeflenmiştir. Đmgenin betimlenmesi sahne sınıflandırma problemindeki en önemli kısmı oluşturmaktadır. Zira literatürdeki mevcut sahne sınıflandırma algoritmaları göz önüne alındığında, bu yöntemlerin çoğunlukla imge betimleme yaklaşımlarındaki farklılıktan ötürü birbirlerinden ayrıştıkları görülmektedir. Bu kapsamda daha etkin bir imge betimlemesi elde etmek ve sınıflandırma performansını arttırmak maksadıyla; zayıf denetimle sahne sınıflandırması sağlayan ve literatürde son zamanlarda sıkça başvurulan Görsel Kelimeler Kümesi ve Olasılıksal Gizli Anlam Analizi yöntemlerinin birleştirildiği iki yeni yaklaşım önerilmektedir.

Đmgenin betimlenmesi amacıyla ilk yöntem olarak Olasılıksal Gizli Anlam Analizi algoritmasının hiyerarşik bir yapıda imgeye uygulanması önerilmektedir. Yöntemin temelinde SIFT özniteliklerine dayalı Görsel Kelimeler Kümesinin elde edilmesini müteakip, Olasılıksal Gizli Anlam Analizi modellemesinin piramit xiv

basamaklandırma şeklinde tüm alt bölgelere ayrı ayrı uygulanması yatmaktadır. Bu düşünceden yola çıkarak, imge alt bölgelere ayrılır ve Olasılıksal Gizli Anlam Analizi alt bölgelere uygulanır. Tüm sevilerden elde edilen gizli tema dağılımı birleştirilerek imge betimlemesi gerçekleştirilir.

.

Önerilen ikinci yöntemde, imge betimlemesi Olasılıksal Gizli Anlam Analiz

sonuçlarına göre tema bölütleme esasına dayalıdır. Görsel Kelimeler Kümesi modelinin aksine, imge görsel kelime histogramı yerine tema histogramı kullanılarak betimlenir. Verilen imgedeki her görsel kelime sahip olduğu maksimum tema olasılığına bağlı olarak bir temaya atanır. Đlk yöntemde olduğu gibi uzaysal bilginin eklenmesi, imgenin alt bölgelere bölünmesi ve her bölgede tema histogramının elde edilmesiyle sağlanır.

Önerilen her iki yöntemin performansı, aynı veri seti kullanılarak eşit şartlarda literatürde mevcut en başarılı diğer yöntemler ile karşılaştırılmış; ve önerilen yöntemlerin diğerlerinde daha iyi neticeler elde ettiği görülmüştür.

xv

ABSTRACT

Scene Classification using Spatial Pyramid of Latent Topics

Emrah ERGÜL

Computer Engineering M.S. Thesis, 2009

Advisor: Ast.Prof.Dr. Nafiz ARICA

Keywords: Scene Classification, Spatial Pyramid, probabilistic Latent Semantic Analysis, Bag-of-Visual words

In this thesis, we aim to classify natural and man-made images among a set of challenging dataset into semantically meaningful categories. It addresses analyzing an image using robust algorithms and assigning it a category label. In this context, image representation is the most important part in the scene classification problem yet scene classification systems in the literature vary mostly based on their representation schemes. We propose two novel approaches of image representation for weakly supervised scene classification that mainly combine two popular methods in the literature: Bag-of-Words (BoW) modeling and probabilistic Latent Semantic Analysis (pLSA) modeling.

Firstly, a new image representation scheme based on Cascaded pLSA is proposed. After the BoW representation based on SIFT features is achieved, pLSA analysis is performed in a hierarchical sense. We associate location information with the conventional BoW/pLSA algorithm by subdividing each image into sub-regions iteratively at different resolution levels and implementing a pLSA model for each xvi

sub-region individually. Finally, an image is represented by a concatenated topic distributions of each sub-region.

In the second method, topic based segmentation is achieved using the results of pLSA analysis. The image is represented with its topic counts, rather than visual word counts used in BoW modeling. We assign each visual word to a topic label which shows maximum posterior probability conditioned on that word of a given image. As in the first method, the spatial information is added to the image representation by subdividing it into finer resolutions, then topic histograms are calculated for each region individually.

The performances of our two methods are compared with the most successful methods in the literature using the same dataset and the same number of training and testing images. In the experiments, it is seen that the proposed methods outperform the others in that particular dataset.

xvii

ÖZET

Gizli Temalardan Uzaysal Piramit Oluşturularak Sahne Sınıflandırılması

Emrah ERGÜL

Bilgisayar Mühendisliği M.S. Tezi, 2009

Danışman: Yrd. Doç. Dr. Nafiz ARICA

Anahtar Kelimeler: Sahne Sınıflandırılması, Uzaysal Piramit, Olasılıksal Gizli Anlam Analizi, Görsel Kelimeler Kümesi.

Bu tezde imgelerin analiz edilmesi ve neticesinde içerik bilgilerine göre imgelerin taşıdıkları anlamlara uygun olarak sınıflandırılmaları hedeflenmiştir. Đmgenin betimlenmesi sahne sınıflandırma problemindeki en önemli kısmı oluşturmaktadır. Zira literatürdeki mevcut sahne sınıflandırma algoritmaları göz önüne alındığında, bu yöntemlerin çoğunlukla imge betimleme yaklaşımlarındaki farklılıktan ötürü birbirlerinden ayrıştıkları görülmektedir. Bu kapsamda daha etkin bir imge betimlemesi elde etmek ve sınıflandırma performansını arttırmak maksadıyla; zayıf denetimle sahne sınıflandırması sağlayan ve literatürde son zamanlarda sıkça başvurulan Görsel Kelimeler Kümesi ve Olasılıksal Gizli Anlam Analizi yöntemlerinin birleştirildiği iki yeni yaklaşım önerilmektedir.

Đmgenin betimlenmesi amacıyla ilk yöntem olarak Olasılıksal Gizli Anlam Analizi algoritmasının hiyerarşik bir yapıda imgeye uygulanması önerilmektedir. Yöntemin temelinde SIFT özniteliklerine dayalı Görsel Kelimeler Kümesinin elde edilmesini müteakip, Olasılıksal Gizli Anlam Analizi modellemesinin piramit

basamaklandırma şeklinde tüm alt bölgelere ayrı ayrı uygulanması yatmaktadır. Bu düşünceden yola çıkarak, imge alt bölgelere ayrılır ve Olasılıksal Gizli Anlam Analizi alt bölgelere uygulanır. Tüm sevilerden elde edilen gizli tema dağılımı birleştirilerek imge betimlemesi gerçekleştirilir.

.

Önerilen ikinci yöntemde, imge betimlemesi Olasılıksal Gizli Anlam Analiz

sonuçlarına göre tema bölütleme esasına dayalıdır. Görsel Kelimeler Kümesi modelinin aksine, imge görsel kelime histogramı yerine tema histogramı kullanılarak betimlenir. Verilen imgedeki her görsel kelime sahip olduğu maksimum tema olasılığına bağlı olarak bir temaya atanır. Đlk yöntemde olduğu gibi uzaysal bilginin eklenmesi, imgenin alt bölgelere bölünmesi ve her bölgede tema histogramının elde edilmesiyle sağlanır.

Önerilen her iki yöntemin performansı, aynı veri seti kullanılarak eşit şartlarda literatürde mevcut en başarılı diğer yöntemler ile karşılaştırılmış; ve önerilen yöntemlerin diğerlerinde daha iyi neticeler elde ettiği görülmüştür.

ii

ABSTRACT

Scene Classification using Spatial Pyramid of Latent Topics

Emrah ERGÜL

Computer Engineering M.S. Thesis, 2009

Advisor: Ast.Prof.Dr. Nafiz ARICA

Keywords: Scene Classification, Spatial Pyramid, probabilistic Latent Semantic Analysis, Bag-of-Visual words

In this thesis, we aim to classify natural and man-made images among a set of challenging dataset into semantically meaningful categories. It addresses analyzing an image using robust algorithms and assigning it a category label. In this context, image representation is the most important part in the scene classification problem yet scene classification systems in the literature vary mostly based on their representation schemes. We propose two novel approaches of image representation for weakly supervised scene classification that mainly combine two popular methods in the literature: Bag-of-Words (BoW) modeling and probabilistic Latent Semantic Analysis (pLSA) modeling.

Firstly, a new image representation scheme based on Cascaded pLSA is proposed. After the BoW representation based on SIFT features is achieved, pLSA analysis is performed in a hierarchical sense. We associate location information with the conventional BoW/pLSA algorithm by subdividing each image into sub-regions iteratively at different resolution levels and implementing a pLSA model for each

sub-region individually. Finally, an image is represented by a concatenated topic distributions of each sub-region.

In the second method, topic based segmentation is achieved using the results of pLSA analysis. The image is represented with its topic counts, rather than visual word counts used in BoW modeling. We assign each visual word to a topic label which shows maximum posterior probability conditioned on that word of a given image. As in the first method, the spatial information is added to the image representation by subdividing it into finer resolutions, then topic histograms are calculated for each region individually.

The performances of our two methods are compared with the most successful methods in the literature using the same dataset and the same number of training and testing images. In the experiments, it is seen that the proposed methods outperform the others in that particular dataset.

ii

Chapter 1

Introduction In the last decade, digital imagery has grown at an incredible speed in many areas, resulting in an explosion in the number of image archives and quality of images to be managed automatically or at least with an optimum supervision. In particular, with the wide usage of high resolution digital cameras, camcorders, mobile phones with built-in cameras; and storage of personal computers, cheap flash memories reaching to huge capacities, people nowadays can easily produce thousands of personal images, share their products with social networking and photo sharing web sites in the digital world WWW without any limitation in capacity, speed and connectivity. Although this situation seems very encouraging for the advance of human civilization era, there are serious problems to be solved urgently. Because if we keep producing data so speedy without any management, we can lose our way in such a large data cluster, and data will turn into a waste, not flow into useful information. For instance, by June 2005, the Internet photo-sharing website Flickr had almost one million registered users and hosted 19.5 million photos, with a growth of about 30 percent per month (Li, 2006). By April 2007, only after two years, Flickr had over 5 million registered users and over 250 million images (Ames, 2007). Although people are increasingly attaching annotations to their images for semantic filtering/searching, the vast majority of the images on the internet are barely documented, making it very difficult for people to find one of interest. While using search engines like Google, Yahoo for image retrieval, mostly you have to reinitialize a new search with new keywords after the first search fails to meet your needs because information is not documented by its semantic contents. In order to handle overload and exploit the massive image information, we need to develop techniques to document and search images. Describing images by its semantic contents will help us organize, access and classify huge amass of data in a reasonable way. 1

A computer system that could automatically classify objects/scenes from images would be of great importance since online resources of huge amount of digital information have covered most of our daily life. Applications are in large scale: surveillance, environment definition, robots with visual interactions, and smart instruments like camcorders or photographers that could sense the environment and set up automatically to capture the best snapshots. Knowledge about the scene enables smarter image processing. For instance, when film is developed and prints are made from the negatives, the exposure and color is automatically adjusted. But unfortunately, automatic correlation does not take into account the context of the photograph. If the computer could distinguish scene categories of images, it could adjust these classes differently, rather than adjusting everything towards one “ideal” exposure and color. This observation can also be applied to image scanners, photocopiers, fax machines, image processing software, etc. Scene classification problem is a very impressive and multi-objective task for computer vision, and also a popular research area nowadays. It comprises of many sub-problems including segmentation of relevant components to identify objects over an image, clustering data which is extracted from dataset images into semantic exemplars to reduce storage and computing consumption, training the classification system to generate representative models in a data-driven fashion, matching between observed and unobserved images in statistical/probabilistic ways. The release of many challenging datasets with multiple classes supported by recently published papers (Fei-Fei, 2005A; Lazebnik, 2006A; Sivic, 2005; Oliva, 2001; Vogel, 2007; Deng 2009) has proved itself how it is hard and interesting research sub-area in computer vision. As datasets sort in large diversity and bear ambiguity with multiple classes that contain many objects; distinctive image representation, efficient training and testing algorithms are needed to cope with such complexity. Some exemplar images that we are dealing with can be observed at Figure-1.1. Also we have to combine different modules, interface them with optimized parameters and eventually make them run together in a robust way. Besides that, we have to care about trade-offs amongst efficiency in time and storage, repeatability in experimental environment, scalability/complexity in algorithmic framework and performance in evaluation. At this point, supervision is our concern because as dataset that we use in scene 2

classification gets growing; all issues mentioned above may be burden a lot for us that we can not carry out. High percentage correctness is, of course, will be our ultimate goal but we also take optimization into consideration for scene classification

Open county

Highway

Kitchen

Living room

problem.

Figure-1.1. Example images from different scene categories in classification framework.

1.1 Motivation As the digital imagery industry evolves at an incredible rate, vision systems are getting to hold a place in almost every kind of business branch with its applications. This enlargement has generated another large market to automatically organize, search and access huge amount of image data produced every day. The main idea in the market is to provide automatic tags to images allowing for easier searching and sorting of photos, which forces us to represent images by their semantic contents. As speaking of a scene classification system, what we mean in content-based image retrieval comprises of analyzing a scene and assigning it a semantic class label 3

(i.e. tags like “bedroom”, “forest”, “ofis”). If we could manage labeling an image with a semantic category in a reasonably supervised manner, we might go a step forward for providing better object detection and recognition since the content of an image directly refers to the object categories while indicating its semantic class. We can illustrate the benefit from scene classification and understand how it is important for lots of applications in the market by giving some examples. Searching for images at an online search engine or at a local database is very crucial especially in social networking and photo sharing web sites in the WWW. Currently, search engines use the manual annotations of an image surrounding it in a web page to provide clues to the contents of the image. With a scene classification system, rather than manually annotation which might contain irrelevant tags, the contents of an image can be determined from the image itself automatically, which provide more accurate search capability. This also allows the search to be based on another image instead of keywords; for example, rather than searching for images by providing the term “coast," a user can provide an image which is classified and then images of the same class are returned. We can easily apply the same procedure in medical applications, video searching etc to decide which category model in the database best fits to the one newly taken from a patient of a film shots, respectively. Another example can be given in photographing issue. Imagine that the photographer has a database which comprises of scene category models with best snapshots in pose taken in the past. Rather than manually adjusting the pose parameters or depending on the pre-determined set-ups of the photographing machine, we first take a random snapshot from the existing scene and the scene classification system would determine which posing parameters best fit into the environment dominantly. It is only a decision factor of the classification system which matches the random snapshot with one of models from assigned scene category. For object detection/recognition by scene classification indirectly, we can give examples from robotics or surveillance systems. If a robot or surveillance system could percept the semantic category of the existing environment, it could infer which objects, or more specifically types of a single object category, might occur in it. By doing so, the search is only conducted among the objects in the database which have been previously attached to that type of scene category.

4

Unfortunately, there is not a commonly accepted solution to those of applications in the view of scene classification because of challenges that we will mention some of them below.

1.2 Objectives Our aim at this thesis is to classify natural and man-made images among a set of challenging dataset into semantically meaningful categories. It addresses analyzing an image using computer algorithms and assigning it a category label (i.e. suburb, forest, street). While scene classification systems in the literature vary considerably, we can place them generally into 3 categories according to their image representation schemes: Low-level, Semantic and Bag-of-Words (BoW). We will discuss them at chapter 2 in details. After representation scheme is determined and conducted, a labeled set of images is used to train the system to discriminate between image classes and after training is complete, a testing image is classified by the system and compared to ground truth classes for the testing set. Among representation schemes, BoW and Semantic modeling, also called content-based method, are the most promising methods in the previous works. Generally in the Semantic modeling, training images are partitioned into local patches and each piece is hand-labeled with one of several classes (i.e. sky, water, wall). Then a classifier like SVMs, neural networks or Bayesian classifiers is used to model semantic classes. Eventually, low-level feature descriptors are assigned to “intermediate semantic classes” (i.e. textons or materials) and a histogram of textons is created for the image which stores a count of occurrences for each of these semantic classes in the image (Vogel, 2004B; Luo, 2003; Boutell, 2006). Although they use semantic concepts for image representation as human vision system does, they are mostly supervised and extra work (i.e. manually annotation for each patch in the training step) need to be established. On the other hand, low-level feature descriptors are assigned to “visual words” by establishing a clustering algorithm like k-means on the feature descriptors which are extracted from a set of training images; then each image is represented by a frequency vector of visual words in the BoW

5

modeling. Although it needs no supervision for image representation, unlike Semantic modeling, neither is there any semantic concept in this method as it uses only the counts of visual words which account for cluster centers calculated in Euclidean metric (Quelhas, 2007; Fei-Fei, 2005A). Also

note

that

location

information is ignored if only we represent an image by a histogram of occurrences in both cases. In this thesis, we propose to combine these models to achieve better results. After achieving BoW representation for an image as described shortly above, we utilize intermediate semantic representation by using probabilistic Latent Semantic Analysis (pLSA) algorithm, introduced firstly by Hofmann in text analyzing literature (Hofmann, 1998). To add location information to the classification system; we divide the image into multiple parts, using a pyramidal division scheme as proposed by Lazebnik et. al (Lazebnik, 2006A), then run pLSA for each sub-region to generate a new probabilistic model which refers to mixture of topics (i.e. objects) in each relevant sub-region. After all, we achieve an intermediate semantic representation for each image which is more robust to failures in scene classification problems due to geometric and photometric changes with location information in an unsupervised manner.

1.3 Challenges As mentioned before, although many promising works have been established based on scene classification, it has not been solved fully yet. Human vision system can recognize the class of a scene image at a single glance which requires little attention and is an easy task to learn and perform classification. But when using machines to emulate generic human vision system that makes inferences automatically according to the relationships between contents of a scene, it is still a challenge in computer vision. Human visual system is generic, because it can generate a specific decision for classification in a short time even though environment has much varying complexity; or objects in the environment are transformed into irregular entities. How should we model scene categories with a

6

generalization like human does? We will explain the major challenges below which we have to take into account to manage a robust scene classification system.

1.3.1

Variance in Visual Domain

In the visual domain, there are many types of transformations like background clutter, viewpoint, scale, rotation, illumination, blurring, occlusion etc. Also there exist many methods invariant/covariant to such geometric and photometric changes for detecting interest points like Laplacian of Gaussian (LoG) (Lindeberg, 1998), Difference of Gaussians (DoG) (Lowe, 2004), Harris Laplace/Affine (Mikolajczyk, 2004), Hessian Laplace/Affine (Mikolajczyk, 2005A), and for describing appearance and shape information of support regions around keypoints like SIFT (Lowe, 2004), Local Jet (Koenderink, 1987), Gabor filters (Manjunath, 1996), steerable filters (Freeman, 1991). However, all the available detectors and descriptors emphasize different kinds of variances which lead to varying properties and sampled sizes. To conclude, we need to select a proper feature descriptor for discriminating scene classes under those challenging conditions. Some scene examples which have different scales, viewpoints, lighting conditions are displayed at Figure-1.2. As we can see from images, houses which are the main objects of “suburb” class are at different distances (i.e. scales), roads and buildings of “street” class are at different viewpoints and mountains of “mountain” class are at different illumination conditions.

7

Suburb Street Mountain Figure-1.2. Example images that introduce scale, viewpoint and lighting changes, respectively.

1.3.2

Perception Ambiguities

Scene classification accuracy is measured mainly by manual annotations which indicate human perception. This leads to ambiguities and subjectivity of the viewer unavoidably. Although designers try to make optimum arrangements for an ideal dataset, some percentage of the set contains flue ground true annotations, so they commonly denote this statistical information attached to dataset set-ups. Besides, human perception use sophisticated acknowledge while doing rational inferences which means that humans utilize huge learnt information in a classification problem. But we have a very limited dataset in classification challenging scenes, compared to those of our generic visual system. Vogel also expresses that there is a crucial semantic vagueness between categories in scene classification problems, thus many images can not be clearly classified into one category (Vogel, 2004B). For instance, if we look at Figure-1.3 carefully, we can not separate clearly (a) and (b) images as highway and street classes; (c) and (d) images as mountain and coast classes, respectively.

8

(a)

(b)

(c)

(d)

Figure-1.3. Example images that contain ambiguities of manual annotation in a classification dataset. (a) Highway (b) Street (c) Mountain (d) Coast.

1.3.3

Variability in a Single Class

In a scene image, we can not limit types and number of object instances as human visual system does not. Although we focus on main objects which declare class categories for scene image, identification is also a problem because variations of shape, location and appearance in each main object category limit us to model a specific scene category. We show three images from bedroom category at Figure-1.4. In these scene images, the main object is, by sure, bed. But as we notice, bed object instances have different attributes, also varying number of other objects like bedstands, pictures, bed lights, chairs, mirrors are observed in the scenes. Let’s put two people objects into these images. Does this addition change the class category? Of course, it does not. So we need to generalize all main instances of specific categories, while disregarding effects of secondary instances.

Figure-1.4. Example images that contain intra-class variability.

9

1.3.4

Variability among Classes

In addition to within class variance, there is also a critical confusion between scenes of different categories which are quite similar in shape and appearance. For example, inside city and street scenes are labeled as different categories and we can see in Figure-1.5 that they might be easily confused between each other. So we need correlation within class images while increasing divergence between classes in classification problem in a supervised way.

(a) Inside city

(b) Street

(c) Street

(d) Inside city

Figure-1.5. Example images that contain inter-class variability.

1.3.5

Representation Issues

Final challenge that we need to denote is image representation issues. At the low-level step, we only have pixel values that belong to an image and we try to describe the image with some kinds of feature properties which indicate observed pattern changes. This is usually obtained by sampling the image at regular grid points locally and describing local patches around those keypoints individually; or processing the whole image at once by using gradient, moment and frequency changes at intensity, color values. Today’s state-of-art algorithms can describe an image robustly by using image enhancing, filtering and statistical analyzing methods. But as we deal with multiple images that have varying contents from different categories, we need to fill the gap between observed data and unobserved gist of the image to classify images in a semantic way. Semantic modeling is a new term for computer vision and there is not a common acceptance for which method is to be used for recognizing the objects 10

and in turn recognizing the category of the image semantically. For instance, we see in Figure-1.6 that trees, bushes, grass cover most of the images, and there is a house in each image, as a little detail. At this point, we need to develop a robust method which would infer that green part in the left image belongs to the garden of the house; thus it is a suburb scene; and at the right image, house is located in a forest, so it is a forest scene.

(a)

(b)

Figure-1.6. Example images that contain same objects but carry different semantic concepts. (a) Suburb scene (b) Forest scene.

1.4 Contributions As we note, a general scene classification algorithm consists of four steps: Low-level feature extraction, image representation, training models for scene categories and testing the system. Among them all, image representation is the most important part in the scene classification problem yet scene classification systems in the literature vary mostly based on their representation schemes. We introduce two novel approaches of image representation for weakly supervised scene classification that mainly combine two popular methods in the literature, BoW visual wordfrequency modeling and pLSA mixture of topics modeling.

Spatial Pyramid of Latent Topics by Cascaded pLSA Method: After finding visual words in the dataset with a clustering algorithm like k-means, we first subdivide each image into sub-regions iteratively at each level (L) (i.e. L=0 whole image, L=1 1/ 4 image, L=2 1/ 8 image) and generate probabilistic models for each sub-region individually by using its BoW features in pLSA algorithm to discover

11

topics (i.e objects) in an unsupervised way. Eventually, we represent each image by a new vector which is achieved by concatenating probabilistic topic distribution vectors (i.e. products of pLSA models) of each sub-region in a weighted scheme; then train a classifier (i.e. KNN and SVM) on these vectors in a supervised way by supplying only the class labels of train dataset. As we will see in chapter 2, BoW representation does not use location information, so does not pLSA, either. Lazebnik et. al extended the BoW modeling for scene classification to include location information by creating “Spatial Pyramid” histograms of visual words from each sub-region (Lazebnik, 2006A) without any intermediate semantic concept. Bosch et. al classified scenes as a combination of pLSA topics by using the similar pyramid division scheme of Lazebnik et al. (Bosch, 2008). It differs from our method in that they have implemented pLSA modeling in a holistic way at each level by using level specific concatenated histogram of BoW representations for a whole image, rather than using different pLSA models on each sub-region individually. Spatial Pyramid of Latent Topics by Semantic Segmentation Method: In addition to aforementioned method, we have established a transition between BoW representation and Bag-of-Topics (BoT) representation for an image. In the pLSA modeling, a visual word has a probability for each topic. This means that each visual word in a given image has a number of probabilities with respect to topic distributions over the image. So we can assign each visual word to a topic label which shows maximum posterior probability conditioned on that word of a given image. By doing so, we can represent each image as a vector of topics-frequency in a weighted scheme, instead of visual words-frequency. This also leads to a rough segmentation over an image based on topic distributions as regions of visual words which are assigned to a common topic would produce a meaningful connected component. Bosch et al discussed this issue as a segmentation application but they did not use it in the scene classification specifically (Bosch, 2008). Differently from our first method, after implementing individual pLSA models in the pyramid division scheme, we now represent each sub-region as a topic-frequency vector, rather than probabilistic mixture of topics, weighted by maximum posterior probabilities. We have compared our two approaches with other methods (Lazebnik, 2006A; Bosch 2008; fei-Fei, 2005A) in the scene classification literature by using the 12

same dataset (Fei-Fei, 2005A), and have achieved better results which will be discussed at chapter 4.

1.5 Structure of Thesis The outline of the thesis is described shortly below:

Literature Review: We will review mostly used methods in scene classification problem focusing on the low-level image representation models which use general or local feature properties to describe whole image or support regions. Then we turn our route to the bag-of-words modeling using local image patches, which form the basis for this thesis work. Finally state-of-art spatial pyramid matching and latent semantic analyzing methods which use bag-of-words modeling in different points of view to extract semantic models for scene classification are explained briefly.

Scene Classification Using Pyramid of Latent Topics: We propose two novel approaches which introduce spatial concept that is achieved by subdividing an image iteratively to produce a new cumulative feature vector for extraction of local latent topics found by probabilistic Latent Semantic Analysis (pLSA) method. We will explain the elements of the scene classification method in detail both in training and testing steps.

Performance Evaluation: We will explain dataset properties which we use during experiments, and describe what we have done in each step of classification algorithm as initializations. Thereafter, we will discuss in depth the results of our proposed methods and effects of different parameters which we have used during experiments and compare the performance of algorithm with those of previously published methods that give promising results.

Conclusions and Future Work: Finally, we draw a conclusion about this thesis and discuss what we could add to improve the proposed methods as a future work.

13

Chapter 2 Literature Review on Scene Classification In this chapter, we focus on a number of approaches to solve scene classification problem. We start by discussing low-level features for representing images in section 2.1 as they are based on intermediate semantic image modeling methods which lead to overcome the representation complexity. Afterwards, we give a general idea in section 2.2 about Bag-of-Words (BoW) modeling that represents images basically as histograms of exemplars of features while using their invariance aspects. This approach has recently attracted many attentions because of its simplicity, assembly facility that it can be combined easily with other advanced machine learning and classification methods. We then go insight Spatial Pyramid Matching methods in section 2.3 which use BoW modeling in a hierarchical spatial informative way where basic BoW has no geometry information. This way of thinking has empowered BoW modeling to overcome limitations of orderless representation. Finally in section 2.4, we mention briefly some of Latent Semantic Analyzing methods which in fact, generate a new dimensionally reduced feature space to represent images discriminatively by using observed data in a statistical way.

2.1 Low-level Features for Scene Classification The first question to answer in scene classification is how we should represent an image to achieve a stable performance at this multi-step operation. To do so, “features” are used to describe image properties for modeling a prototype as human does in visual perception. A feature can be described as an image pattern which is extracted by its noticeable variance (Tuytelaars, 2007). Generally, image properties are considered as intensity, texture, color, power spectrum; and features are to be produced by using changes in these properties such as moments, gradients, orientations, derivatives, fourier and wavelet transformations. For mathematical formulation, some measurements are calculated from a patch around a feature and transformed into a descriptor to model the image in a distinctive way. An ideal

14

feature should have some properties corresponding to repeatability, distinctiveness, accuracy, efficiency and invariance to transitions such as illumination, scale, rotation, occlusion and affinity. Although these are considered as the main properties of general features in literature, the selection of features to be used depends strongly on the usage scenarios in a specific application. As speaking of scene classification, the matter is not the one what a feature actually represent or where it is precisely in an image. The main goal is to create new exemplar features while modeling each class in dataset, then to match them statistically by using some similarity or dissimilarity functions without any burdensome segmentation. Initialization is conducted by using low-level features to represent an image in vision algorithms and leads to intermediate semantic representation modeling approaches (Bosch, 2007A). As they do not refer to any external knowledge, lowlevel features, like color histograms, power spectrum, textures and edges, are extracted directly from the image with some mathematical formulations, also considered as transformation functions, such as derivation, moment, normalization and quantization (Boutell, 2006). Furthermore, these low-level features are used in classification and learning algorithms to infer high-level semantic information, called as “topics” or “themes”. We can separate low-level feature extraction methods generally in two groups: One which computes low-level image properties in the entire image in a holistic fashion, and the other which first subdivides the image into several regular sub-blocks or irregular partitioned segments, then computes features in each part differently. In Figure-2.1, we see two main approaches for image representation. As we notice from this illustration, the interest regions are taken into consideration in the low-level modeling, while content of an image in the semantic modeling.

15

Figure-2.1. An illustration of low-level and semantic feature modeling methods (Bosch, 2007A).

2.1.1

Global Low-level Representation

Vailaya et. al discuss that high-level scene categories can be extracted from global low-level image features in large databases by using a hierarchical link between scene categories. They separate categories into indoor and outdoor classes at the highest level, then outdoor images are classified as city and landscape images, eventually landscape class is divided into sunset, forest and mountain subclasses. They infer that each class has different qualitative attributes like outdoor images have uniform color distribution while indoor images have non-uniform color but uniform lighting distributions (Vailaya, 1999). Hence, they propose to use global discriminative features of color moments, color histograms, color coherence vectors, edge orientation histograms with weighted magnitudes, and edge coherence vectors. They use binary Bayesian classifier in which class-conditional probability density functions are achieved by Vector Quantization framework which uses Gaussian Mixture Model (GMM) to estimate codebook vectors. Consider n training samples from a class w, the vector quantization framework is used to extract q codebook vectors v j from the n training samples. The class conditioned feature vector y given

16

the class w, f ( y / w) is then calculated by a mixture of Gaussians each centered at the codebook vector as in :

q

f ( y / w) =

∑me

(−

|| y − vi ||2 ) 2

(2.1)

i

i =1

Where mi is the proportion of training set of n images which is assigned to codebook vector vi . According to this probabilistic setting, Bayesian decision theory is based on maximum a posteriori (MAP) criterion as :

∼

w = arg max w∈Ω { p ( w / y )} = arg max w∈Ω { f ( y / w) * p ( w)}

(2.2)

Where w is the scene categories and P(w) represents the priory class probability independently. They extract different amount of codebook vectors based on the classes in hierarchy levels with varying dimensional feature vectors and give very promising results over low-level features. Szummer and Pickard touch the scene classification problem as separating images into indoor or outdoor classes by using global color, texture and frequency feature vectors. In color feature vectors, they use Ohta color space which yields the optimum performance in Principal Component Analysis (PCA) of natural images statistically (Szummer, 1998). This new space comes from RBG as : I1 = R + G + B I2=R − B

(2.3)

I 3 = R − 2G + B

The color histogram consists of 32 bins per channel in Ohta space. Morover, they use histogram intersection norm to compute distance between histograms of training and test images instead of Euclidean norm defined as :

dist (h1 , h 2 ) =

n

∑ (h

1 i

− min(hi1 , hi2 ))

(2.4)

i =1

17

For texture classification, Multi-resolution Simultaneous Autoregressive (MSAR) model is proposed in (Mao, 1992). They extract feature vectors from intensity images at two resolutions, ie. Half and quarter sizes, at three scales of neighborhood 2,3 and 4 which yield 15-dimensional vector. To calculate similarity between feature vectors, Mahalanobis distance is used with covariance matrix. As speaking of frequency features, 2D Discrete Fourier Transform (DFT) is performed over images for taking magnitudes as weights, and after taking Discrete Cosine Transform (DCT), all weights are multiplied by coefficients to create vector elements of features. Eventually they compare classification results of K-nearest neighborhood and three layer neural network algorithms and conclude that K-nearest neighborhood is better that neural network classification especially for color features in accuracy and time efficiency. Swain and Ballard use benefits from color histograms to describe whole image in a different color space as : rg = r − g by = 2* b − r − g

(2.5)

wb = r + g + b

Where r,g and b are main color domains of RGB. They use these new axes as opponents of human visual system, and separate wb domain in 8 bins and the rest are in 16 bins with a total of 2048 bins. To compare color histograms in measuring similarity, they propose normalized histogram intersection formula because it penalizes the matched pixels from background (Swain, 1991). The formula is given as:

n

∑ min( I , M ) i

H (I , M ) =

i

i =1

(2.6)

n

∑M j j =1

While this representation takes every pixel in consideration, they get very good results in testing images with distinctive colors, but have problems with images

18

that have uniform color distribution, thus indexing errors occur since foreground and background information is mixed. Besides those of mentioned specific methods, combined low-level feature vectors of color, texture and shape have also been proposed to overcome the complexity of visual content in scene classification problem and satisfactory results have been achieved in classification by using Support Vector Machines (SVM), Knearest neighborhood and Gaussian Mixture Model (GMM) classifiers (Shen, 2005). However, global features can not give a distinctive representations where clutter, occlusions, content complexity increase within images and the number of scene categories rise up to multi-classes from binary. So that, local features become popular and open a new gateway through semantic modeling concepts which mostly use dense local feature descriptors over entire image to associate links between contents of image in a spatial, geometric and appearance manner.

2.1.2

Local Low-level Representation

As it is proved that global features are very limited in presence of within/inter-class variations, occlusion, background clutter, pose and lighting changes; local low-level features have attracted more attention in recent works. The main approach to local representation is to divide the entire image into regular subregions in a grid fashion or to segment the image roughly by a semantic way, then describe those sub-regions by their own low level properties. Afterwards, the classification is carried out within each part of whole image and eventually image is classified by statistic results of individual regions. Szummer and Pickard denote the importance of local feature computation while researching the global feature power in their work, and give attention to locality because of its additional spatial information as it will improve scene categorization (Szummer, 1998). They propose to divide each image into 4x4, totally 16, sub-regions and compute low-level features of color and texture individually in those parts. Instead of cumulating feature vectors from all sub-regions into one vector to describe whole image where high-dimension happens to be a problem in computing covariance, they use a two-step classification method, classifying subdivisions independently with k-nearest neighborhood classifier by comparing

19

each sub-region to all sub-regions in the database, then performing another classification on results of first step. In the second step, they test majority voting, one-layer neural network and Mixture of Experts classifiers for combining the results from sub-regions, and conclude that majority voting scheme performs better than the other two in training time although the former ones perform slightly better than majority voting scheme.

Figure-2.2. Two-stage classification scheme of indoor/outdoor images (Szummmer, 1998).

Gorkani and Picard focus on texture classification in scene classification problem by using low-level orientation histograms of sub blocks of images. They use directional filters for convolving images to compute the orientations and magnitudes at each pixel, extract orientation histogram over multi-scales using a steerable pyramid (Gorkani, 1994). They repeatedly filter and subsample the previous lower level to achieve higher levels as level 0 is the whole image, and then continue applying directional filters to obtain cascaded histogram of orientations. They calculate weights of orientations in 158-bin histogram with formula of “number histogram”:

H n (k ) =

N nθ (k ) , k = 0,1, 2,..., b − 1 Nt

(2.7)

Where N nθ (k ) is the sum of the number of pixels in the region of interest that fall into the same orientation bin k with magnitudes bigger than S threshold, b is the total orientation bin in histogram, and N t is the total number of pixels in the subregion. The orientation finding algorithm is applied into regular rectangular subregions over entire image in three levels, achieving totally 16 sub-regions to classify 20

dataset images into city and suburb categories. To model an orientation histogram for each scene class, they calculate saliency measure thresholds for each sub-region which are derived from training set and use those thresholds for classification. Paek and Chang insight through scene classification problem with low-level local features in a different way by applying probabilistic reasoning classifier, called Belief Network. The probabilistic approach is used to specify the joint probability distribution of a set of random variables for a given domain (Paek, 2000). In case of scene classification, they try to classify images as indoor/outdoor, sky/no sky and vegetation/no vegetation by using Boolean Bayesian conditional independence relationships between random variables in the given domain. We can think of belief network as a cyclic graph that has nodes of classifiers that work on clustered feature vectors extracted from sub-regions. Each entry in the joint probability network can be measured by:

n

P( X 1 = x1 Λ x2 Λ ... Λ X n = xn ) = ∏ P( xi \ Parents ( X i ))

(2.8)

i =1

Where xi is the i th node in the belief network and n is the total number of nodes. They produce indoor-outdoor, sky-no sky and vegetation-no vegetation block matching classifiers which evaluate all images in the dataset. The equation holds the dot product of linked nodes’ probabilities as a joint. Thus each node has a Conditional Probability Table (CPT) where we can estimate cumulative probabilities and linked nodes. In classification problem, nodes represent image classes and block matching classifiers. In the experiments, they divide all images into 8x8, totally 64, sub-regions, calculating HSV color features and edge direction histograms for subregions individually. For testing, each sub-block of a query image is compared to all sub-blocks in the database by using K-nearest neighbor classification which measures similarity with histogram intersection. Each of nearest sub-blocks from training data is labeled in case we already know the labels of training patches, then majority voting scheme is conducted among sub-regions to estimate the class category of the query. They conclude that multi-class categorization problems can be solved in probabilistic multiple classifiers with high accuracy.

21

Serrano and Savakis think that low-level features can be used to predict scene contents semantically. They try to classify images as indoor and outdoor classes by using local color and wavelet texture features calculated at each sub-region in combination (Serrano, 2004). As speaking of color features, they implement LST space because they think it is more suitable than RGB space as illumination distinctiveness between indoor and outdoor images can be perceived in LST space more accurately. LST transformation from RGB is given by:

α α  α   3 3  I (x, y)  3 R  α  α  lLST (x, y) =  0 −  IG (x, y) 2   2 I (x, y)  2α α  B α −   6 6  6 (2.9)

α = 255 / max {IR (x, y), IG(x, y), IB (x, y)} for 8−bit image They compute a color histogram which consists of 16 bins per color channel, totally 48, for each sub-region of entire image that covers 4x4 grid cells. Thus the color feature is 48-element length in LST space in a quantized manner. For local texture representation, they calculate wavelet coefficients, instead of Maximally Stable Extremal Regions (MSER) features (Matas, 2002), at two level with biorthogonal

5/3

low

and

high-pass

filter

pairs,

ho (k ) = [−0.125, 0.25, 0.75, 0.25, − 0.125] and h1 (k ) = [0.5, − 1, 0.5] respectively. The wavelet coefficients are achieved with the formula:  ∑∑ I L (α , β )hi (2 x − α )h j (2 y − β ), l = 1   α β  c ( x, y ) =   l −1  ∑∑ c00 (α , β )hi (2 x − α )h j (2 y − β ), l = 2   α β  l ij

(2.10)

Where I L ( x, y ) is image luminance information, cijl ( x, y ) are the wavelet coefficients at level l=1 and l=2, hi and h j are biorthogonal low/high-pass filters for convolution. The wavelet coefficient extraction for each sub-block in 4x4 grid is illustrated at Figure-2.3. 22

Figure-2.3. Coefficient labelling for wavelet features extraction (Serrano, 2004).

Eventually, they achieve local texture feature vectors of 7-element length from 16 cells. Unlike the others, they run SVM for classification instead of K-nearest neighborhood algorithm. After normalization to zero mean and unit variance of color and texture feature vectors, SVM classification is performed for each sub-region and global indoor/outdoor belief measure is achieved in a probabilistic way as:

P(outdoor | color ) =

B 1 = , d f c ( xc ,i ) ∑ c 1 + e − dc i =1

B 1 = P(outdoor | texture) = , d f t ( xt ,i ) ∑ t 1 + e − dt i =1

(2.11)

Where f c ( xc ,i ) and f t ( xt ,i ) are SVM classification results of color xc ,i and texture xt ,i feature vectors extracted from i th sub-region from given query image, B is the total number of sub-regions, 16 in here. They report that SVM outperforms Knearest neighborhood and majority voting classification schemes in indoor/outdoor classification problem as it minimizes wrong labeling for sub-regions. They also use these low-dimensional color/texture features in combination with a Bayesian network to infer semantic meanings in images like grass, sky, cloud; promising results have been reported in experiments and it shows that locality with low-level features enhances classification accuracy.

2.1.3

Discussion In conclusion; texture, color, spatial frequency, dominant orientations and

edge direction features are generally used for low-level feature extraction in previous

23

works. We can also see some examples of features which combine several of lowlevel features to achieve better accuracy. Features are extracted globally or locally over images, quantized into varying histogram bins and label assignments are given to query images by a classifier such as K-nearest, SVM, probabilistic Bayesian/neural/belief networks. In addition, we can denote that extracting and processing low-level features in sub-regions individually following a holistic classification approach provide better results than ones extracted from entire image at once. It is notable that all above mentioned methods use some statistical classification for one ultimate purpose: To map relationships between low-level features and semantic concepts. Although low-level features are good start points for semantic inferences for scene classification in case of binary modes like indoor/outdoor, city/suburb, they are weak while used alone in multi-classification problems as they lack intermediate semantic description for image contents. At this point, we get a novel approach to handle intermediate semantic modeling which has been widely used in recent works: Bag-of-Words (BoW). From now on, we draw our attention to BoW modeling concept and systems that run this model in a probabilistic/statistical manner.

2.2 Bag-of-Words Modeling Bag of Words (BoW) model has been derived from natural language processing and information retrieval researches and has proved itself as one of the most promising methods in literature for its simplicity and modularity. BoW algorithm, also called bag-of-keypoints or bag-of-features in computer science, was first developed in text corpus analysis for indexing electronic documents statistically by their lexical contents. The idea in BoW in computer vision is to depict each image as an orderless collection of local features. Algorithm computes the visual feature distribution of each document within dataset to model each category by using histograms of cluster centers, also called visual words, and simply compares this distribution to those calculated in observed models for recognition. After giving robust results in text/natural language analyzing areas, it was adapted to computer vision applications as a generative model which is applied to images by using a visual analogue of a word, i.e. visterm, formed by quantizing visual features like

24

region descriptors (Sivic, 2003). Because local features are proved to be more robust to vision transformations, these features are represented in bag-of-words modeling rather than global baseline visual features like color, texture or intensity. Several works has achieved high performances in scene classification using BoW modeling approach (Fei-Fei, 2005A; Lazebnik, 2006A; Quelhas 2005). In detail, constructing the bag-of-words model from images involves in detecting regions/interest points as local patches, computing local descriptors over these patches, quantizing all descriptors found in dataset images into visual words to form a visual vocabulary and finding occurances of each word of visual vocabulary over images. By mapping descriptors in an image to the visual vocabulary, we can achieve a novel representation for the image of a feature vector according to the count of each visual word. Figure-2.4 illustrates general four steps involved in BoW model.

Figure-2.4. Steps for producing the BoW representation for images (Bosch, 2008B).

The main advantage of BoW method is its simplicity. Under a supervised classification method like SVM, K-nearest neighborhood, BoW vector forms the basic visual appearance for scene classification. Category label specifications is the only requisite in learning step and despite its simplicity this method has been used widely to classify images into large number of categories. Another advantage of BoW is that it gives us general information about images which leads to identify object/scene relation without any complicated segmentation or other low-level 25

processes. The supervision is a big problem when there are many items in hand to cope with, especially in vision research areas, so that BoW brings a new point of view into scene classification/ object recognition algorithms. But, besides its simplicity, BoW has some problems in computer vision. First question is how to describe features over a local region of an image which forms the basis of BoW. Different descriptors emphasize different properties of invariances. Secondly, the spatial information of words of features is lost when we describe an image as a histogram of visual words. This orderless representation, unfortunately, block us to make geometric correspondences between objects, and to infer a relation between environment and objects. Another problem is that we have to cope with polysmes, words with multiple meanings; and synonyms, words with identical meanings. We should know the true meaning of a visual word in a specific scene category, and combine or eliminate visual words which have similar meaning in that scene category. At the other side of problem, we must distinguish the correct usage of that visual word in some way like lexical correctness measuring in text literature. Another aspect is the size of visual vocabulary. The vocabulary is created by clustering local features in the pre-defined feature space and acknowledging each cluster center as a unique word in the vocabulary. Different from text literature, the size of vocabulary is determined by the number of cluster centers. A small vocabulary leads to decrease in discrimination since two feature vectors may be assigned to the same visual word even they are not similar. On the other hand, a large vocabulary is restricted generalization and leads to noise words which have no semantic meaning. The other aspect is keyword weighting scheme. Term weighting has an important effect to text information retrieval. A fundamental difference of visual terms is that: text words are sampled according to lexical context while visual words are outputs of clustering algorithms. The text word carries semantic sense naturally but visual word refers to statistical information only. Generally, all the weighting schemes perform nearest neighbor search in the visual vocabulary where each feature vector is mapped to the most similar visual word. But this may not a good choice, because two similar features might be assigned to different visual words when increasing the size of vocabulary. Besides, simply counting the frequencies may lead to misclassification because two features which are assigned to the same visual word may not be equally similar to 26

that visual word as their distances to associated word are not same in the metric domain. To conclude this discussion, we can extract shape and appearance information over a patch of an image that is invariant/covariant to transformations such viewpoint, illumination, scale, rotation and affinity in a discriminative way with state-of-art descriptors. However we lose most parts of information in generalization of indexing each descriptor into a visual word while clustering, classification and similarity measuring if we don’t empower these modules in a robust way. Many works have been done in the literature about BoW modeling to classify scene categories by using its advantages while overcoming disadvantages. First experiments using BoW representation in computer literature are related to texture classification. Leung and Malik use BoW modeling in recognition of textures taken from different viewpoints and under varying illumination conditions by quantizing the convolution responses of a filter bank of 1st and 2nd Gaussian derivatives at different scales and orientations, Laplacian of Gaussian and Gaussian filters, totally 48 (Leung, 2001). Exemplar filter responses are chosen as “textons” via K-means clustering and these texture images are represented by distribution of textons. Varma and Zisserman handle texture classification with both quantizing filter bank responses (Varma 2002, 2005), or using affine covariant local image patches in a regular grid (Varma, 2003) and use chi-squared statistic to measure distances between histograms in a nearest neighbor classification algorithm. Zhang et. al gives a detailed survey of BoW systems by comparing Harris-Laplace (Mikolajczyk, 2004), Laplacian (Lindeberg, 1998) detectors and SIFT (Lowe, 2004), SPIN (Lazebnik, 2005) descriptors with different SVM kernel functions (Zhang, 2007). Perronin et. al create a vocabulary that covers contents of all the categories of images in dataset, and category specific visual vocabularies are obtained with an adaptation of this universal vocabulary using class-specific data (Perronin, 2006). While other approaches characterize an image with a single histogram, an image is represented by a set of histograms in here, one per class. Each histogram describes whether an image is more suitably modeled by the universal vocabulary or the corresponding adapted vocabulary. They represent a vocabulary of visual words by means of a GMM. Each Gaussian represents a unique word of the visual vocabulary. The Universal vocabulary is trained using maximum likelihood estimation (MLE) and the class vocabularies are adapted using the maximum a posteriori (MAP)

27

criterion. This method is tested in classifying images like sunset, underwater, cars, bikes. After giving a start to work with BoW modeling, many different approaches are used to classify scene categories. Although they have a large diversity in applications, we can divide BoW modeling into two parts basically: One which empowers spatial information while producing hierarchically cumulative BoW feature vectors to describe image/scene models, and the other which ensembles probabilistic mechanisms into BoW modeling for training and classification schemes.

2.3 Spatial Pyramid Matching Methods Instead of quantizing feature vectors to visual words in a single clustering scheme, Grauman and Darrell develop a novel technique for computing an approximate distance directly between two collections of feature vectors (Grauman, 2007). In detail, their approach is to cumulate the feature vectors into a multiresolution pyramid defined in a feature space, afterwards, to count the number of features that correspond to associated bins in each level. The distance between the two sets of feature vectors’ histograms are computed using histogram intersection between corresponding bins. These per-level counts are then summed in a weighted way where discounts matches already found at finer levels while weighting matches more heavily at finer resolutions. Some hashing algorithms are also used to construct the pyramid shape to overcome time consumption. Lazebnik et. al introduce a new insight into bag-of-features methods, which represent an image as an orderless group of locally described features. They insist upon that since these methods disregard all information about the spatial layout of the local features, they have very limited descriptive ability in scene classification, thus one need to integrate geometric correspondences based on histogram approximation using local features (Lazebnik, 2006A). To achieve some rough global shape information and segmentation an object from its background, they propose a new method which is known as “Spatial Pyramid” that accumulates statistical histograms of bag-of-features over increasingly fine sub-regions and finds the pairwise relations between the scene categories with the pyramid matching

28

scheme of Grauman and Darrell (Grauman, 2005). As Oliva and Torralba implicate that holistic representation of images is effective in overall scene analyzing and categorizing images with their contextual objects (Oliva, 2001), Lazebnik et. al suggest that these cumulative histograms which are repeatedly computed in finer resolution sub-areas would capture the semantic “gist” of an image (Torralba, 2003). In detail, Lazebnik et. al have implemeted two kinds of features. First one is edge points at two scales and eight orientations, for a total of 16 channels, whose magnitudes exceed the minimum threshold in pre-specified directions. Besides these “weak features”, they have experienced SIFT descriptors of 16 by 16 patches over a grid spacing of 8 spacing on whole image surface. After extraction of features from dataset images, they have used K-means algorithm to cluster a visual vocabulary, typically with size of 200 and 400. After creation of descriptors and visual vocabulary, the algorithm depends on the computation of histograms of visual words per image in a sequence at increasingly finer grids over the image (i.e. compute histograms on 1/4 parts of image, 1/8 parts of image … and amass all histograms for each image). For instance, when 200 words are used for 3 levels (level 0,1 and 2), the total length of histogram vector for each image would be 4200. From now on, we may think of images as integrated histograms at different resolutions. As Grauman and Darrell propose, Lazebnik et. al give multiplicative weighting factors for each resolution level where finer ones are weighted more highly than coarser ones, and calculates total scores of matches in each level cells (Lazebnik, 2006B). They use a histogram intersection kernel to find the number of matches at given level ℓ :

Ι(H

ℓ x

, H

ℓ y

) =

D

∑

m in ( H

ℓ x

(i), H

ℓ y

(i ))

(2.12)

i=1

Where Hx and Hy denote the histograms of X and Y sets of vectors in a ddimensional feature space at resolution level ℓ , so that h histograms are the numbers of points from X and Y vector spaces that fall into the i th cell of the grid. Lazebnik et. al have tested their spatial pyramid method with thirteen-scenecategory dataset of Fei-Fei and Perona (Fei-Fei, 2005A) and their own fifteen-scenecategory dataset. They have compared pLSA and Latent Dirichlet Allocation (LDA) which are known to be unsupervised dimensional reduction aspect models, and have implemented SVM to classify the test images in accordance with the trained dataset. In their results, they have achieved very good performance results in scene classification and have concluded that by aggregating global scene statistics with 29

related weights, it would achieve improvements over orderless bag-of-words methods, even though it is very simple and efficient to compute. The illustration of pyramid matching and histogram computing is at Figure-2.5 and some experimental scene image samples are given at Figure-2.6.

Figure-2.5. Overview of Spatial Pyramid creation and histogram calculation. At the top, the image is subdivided iteratively at 3 different levels of resolution, At bottom, each feature corresponds to different pre-determined visual vocabulary is counted, then histograms of different levels are weighted by its associative multiplicative factor and all histograms are concatenated (Lazebnik, 2006B).

Figure-2.6. Example images from scene category databases of Lazebnik et. al (Lazebnik, 2006A), Fei-Fei and Perona (Fei-Fei, 2005A), and Oliva and Torralba (Oliva 2001).

30

Yang and Yu propose a spatial pyramid matching approach based on SIFT sparse codes for image classification. The method uses selective sparse coding instead of traditional vector quantization to extract salient properties of appearance descriptors of local image patches (Yang, 2009). In detail, given a whole image, they calculate all SIFT descriptors over a regular grid that enables descriptors to overlap in half size. After completion of calculation of descriptors, they cluster all descriptors with K-means algorithm as in BoW model to create visual vocabulary. Afterwards, they encode all SIFT descriptors of each image into visual word vectors by using max pooling algorithm with linear feature-sign search method (Honglak, 2006) in dictionary learning step. For instance, lets say, 1000 SIFT descriptors of an image vote for each visual vocabulary of 512 vectors by max-pooling algorithm, then eventually we can describe whole image by a single vector of length 512 which carries weighted sums of votes as a histogram. They repeatedly divide each image and run sparse coding with max pooling algorithm for each sub region, then they create a pyramid vector which describes an image cumulatively. In contrast to other spatial pyramids, because they do not use any count data in this histogram vector, they do not apply any weight functions to sub parts of histogram vector differently. In the classification part of algorithm they test this linear spatial pyramid histogram with large scale SVM kernels. They use Caltech-256 dataset to justify their proposes and conclude that linear spatial pyramid histograms perform well with linear SVMs in scene classification, and also the sparse codes of SIFT features might serve as a better local appearance descriptor for general image processing tasks. An illustration of sparse coding algorithm is depicted at Figure-2.7.

Figure-2.7. The structure of linear spatial pyramid algorithm based on sparse coding. Sparse coding measures the responses of each local descriptor to the dictionary’s ”visual elements” with max-pooling algorithm in associated with feature-sign search algorithm. These responses are pooled across different spatial locations over different spatial scales (Yang, 2009).

31

2.4 Latent Semantic Analyzing Methods Bosch and Zisserman argue that the Probabilistic Latent Semantic Analysis, pLSA, is very effective for the task of scene classification in condition of weakly supervised manner. Their algorithm first constructs a visual term (visterm) – image cooccurance table by clustering some part of all feature descriptors extracted from training images with K-means algorithm as histogram vectors, then discovers latent “object based topics” by using pLSA, a generative dimensionality reductive model from the statistical language processing literature firstly introduced by Hofmann in 1998 (Hofmann, 1998;2001), applied to a bag of visterms representation for each image in dataset, and eventually trains a classifier like K-nearest neighborhood (KNN) or Support Vector Machine(SVM) on the topic distribution vectors conditioned over visterms and images (Bosch, 2008). They have produced a promising hybrid technique which comprises of unsupervised pLSA data training and information retrieval part and weakly supervised multiway classifier in a generative/distinctive manner. Besides that, they have used different sparse and dense feature descriptors with varying size of patches and spacing between interest region centers like low-level gray/color patches, sparse/dense color/gray SIFT descriptors (Lowe, 2004; Weijer, 2006) to compute a co-occurrence table where each image is transformed into a collection of visterms, provided from visual vocabulary. This vocabulary is achieved by quantizing features from training images using k-means (Bosch, 2006). The illustration of all mentioned algorithm is depicted at Figure-2.8. They have compared their classification performances to previous methods (Fei-Fei, 2005A; Lazebnik, 2006A; Oliva, 2001; Vogel, 2007) by using the authors’ own datasets in the presence of supervised/unsupervised behaviors, different features/descriptors for vocabulary creation, normalization/unnormalization of images, distinctive classification via KNN and SVM, representation of images via probabilistic topic-image distribution and Bag of Words (BoW), and eventually spatial information embedded into descriptors. They have concluded that their new descriptors with this hybrid model is very effective in scene classification literature with very challenging datasets that have very varying inter/intra class complication

32

problems (Bosch, 2008-A). We will discuss pLSA and SIFT descriptors in detail as we have used these promising methods in our classification structure in chapter 3.

Figure-2.8.Overview of visual vocabulary formation, learning and classification stages of generative/discriminative model of scene classification (Bosch, 2008-B).

Oliva and Torralba discuss that since human can easily categorize a new image semantically with a simple single glance, the general structure of a scene image can be estimated by using the spatial layout properties of the scene images with global features and this is called as “Spatial Envelope Representation” (Oliva, 2001). This work is not involved in segmentation or grouping but global spatial scale calculations. According to this idea, global features carry local feature properties and encode spatial relationship between parts of a scene, so that spatial scales of resolutions different clusters of information for recognition purposes. In detail, they calculates of global feature vector values of an image as the weighted summation of outputs of a bank of multi-scale oriented filters. They have used Principal Components Analysis (PCA) for weight calculations and down sampled the output of each filter to N by N where N is between 2 to 6 for computational efficiency. As a result, each image can be summarized by the NxNxK global feature vectors which are the product of filter bank. Here, K is the total number of orientations and scales of filters; N is the spatial size of that weighted and down sampled magnitudes of filter outputs. After applying

33

PCA to the total images, they have achieved some global templates which represent all categories in dataset. The model of a scene gist, usually includes the semantic label of a scene (i.e livingroom) should go beyond finding curvatures, edges, corners or even objects in an image but to describe the semantic information of the scene. With the help of global feature templates, they manually rank the training images into 6 different categories and can describe the test images within these new spatial layout properties axes. Their properties are naturalness, openness, expansion, depth, roughness and homogeneity. For instance, a forest scene can be depicted as some degree of roughness and homogeneity of its textural components, and these components may help us to differ to different forest scenes, besides other class of scenes. This new holistic scene-centered description is depicted in Figure-2.9.

Figure-2.9. Schematic representation of Spatial Envelope model. A-Spatial envelope properties are classified mainly into boundary and content properties and these are classified into openness, expansion, naturalness and roughness. B- Roughness is depicted as object recognition in whole image. C- Projection of 1200 urban scenes into 3 spatial envelope properties axes (Oliva, 2001).

Csurka and Dance present another novel method for natural image categorization based on bags of keypoints (Csurka, 2004). Again, their first problem is to detect and describe the salient image patches, and they have used Harris affine detector which is described by Mikolajczyk et. al (Mikolajczyk, 2002). This choice of detector benefits from scale and affinity invariance that result in local description of an elliptical regions which are resistant to geometric, lighting and viewpoint changes. In detail, Harris affine points are captured as local maxima positions of 34

scale adapted Harris function and as local peaks in scale of Laplacian iteratively. This detector has a region given by the selected local maxima scale and a shape given by the eigen values in dominant orientation axes of image’s second Moment matrix, also called auto-correlation matrix. This matrix describes the gradient distribution in a local neighborhood (Tuytelaars, 2007) :  I x2 ( x, σ D ) I x ( x, σ D ) I y ( x, σ D )  M = σ D2 g (σ I )*   I ( x, σ ) I ( x, σ )  I y2 ( x, σ D ) x D y D  

∂ g (σ D ) * I ( x ) ∂x

I x ( x,σ D ) = g (σ ) =

1 2Π σ

(2.13)

2

e

−

2

x +y 2σ

(2.14)

2

2

(2.15)

The local image derivatives are computed with Gaussian function of scale σ D (derivative scale), and the derivatives are then averaged in the neighborhood of local maxima by smoothing with a Gaussian window of scale σ I (integration scale). This affine elliptical region is then transformed into a circular region with an interpolated normalization and SIFT descriptors (Lowe, 1999) are computed on that normalized patch. SIFT descriptors are, in short, Gaussian derivative representations of an image in different scales, computed at 8 orientation planes which are rotated relatively to the dominant orientation over a 4x4 grid of spatial symmetric positions, giving 128dimensional vector. They also denotes that SIFT descriptors perform best in comparison of several descriptors surveyed by Mikolajczyk et. al (Mikolajczyk, 2003). After creation of descriptors over dataset images, they have run k-means clustering algorithm with different number of quantized vectors (k) and selected different initial cluster centers giving the lowest experimental risk in categorization. They have found that the best vocabulary size is 1000 for 640.000 interest point descriptors from seven natural scene categories. When the work comes into categorization, they have experimented Naive Bayes classifier, which can be described shortly as the maximum a posteriori probability classifier in which a scene category is selected according to class prior probability rates and each visual word in the document is chosen independently from a multinomial distribution over visterms 35

to that scene category. Another classifier, SVM, may be described as the hyperplane separator between two-class data with a kind of EM algorithm. While many SVM algorithms exist in the literature, they have chosen, like many other researchers in scene classification section, one-against-other approach. Shortly, given m-category scene images, they train m SVMs, each differs images from category A from images from all other categories, and the query image is assigned to the class label which gives the largest SVM output (Csurka, 2004). They have used 1776 images from seven classes: faces, buildings, cars, phones, bikes, books, trees; and have concluded that SVM is superior to Naïve Bayes Classifier as a training and testing algorithm. Fei-Fei and Perona introduce a novel approach to scene categorization as “Hierarchical Bayesian Model” used in both learning and recognition (Fei-Fei, 2005A). They model each image as a collection of local patches which are described by a large vocabulary of codewords. The goal of learning is to get the model that best fits the best distribution of these codewords for each category of scenes in an orderless manner. As speaking of recognition, they first identify the codeword distribution in new query image like in training mode, then find the category model that best fits to that of particular image. They have used Latent Dirichlet Allocation (LDA) to create hierarchical patterns in classification. In detail, for each class label (like mountain, forest), they choose a probability vector that indicates what intermediate themes (i.e. objects) to select while labeling each patch of the scene. For creating each patch in the image, they first determine a specific theme out of the mixtures of possible themes. For instance, if “house” is selected for suburb class scenes, this will give priority to codewords that occur more frequently in “house” themes which favor vertical horizontal and vertical lines. Afterwards, given more horizontal and vertical lines favoring that house theme is directed to some codeword, the process is repeated by drawing theme and related codewords many times until formation of entire bag of patches that constitute a “suburb” scene is fulfilled. Flow chart of the algorithm is summarized at Figure-2.10.

36

Figure-2.10. Flow chart of the algorithm of Hierarchical Bayesian (Fei-Fei, 2005B)

For region detection and representation, they have experienced four different ways of extracting local image regions as: 1) Evenly sampled grid of 10 pixels spacing and randomly sampled scales between 10-30 pixels. 2) Randomly sampling of 500 patches from an image with different scales between 10-30 pixels. 3) Kadir and Brady Saliency detector (Kadir, 2001) regions roughly in between 100-200 patches with varying scales above mentioned and 4) Lowe’s DoG detector (Lowe, 1999) that detects roughly 100-500 regions over image that are rotationally invariant over different scales. They have used two different representations for descriptors 11 by 11 gray patch values, and SIFT descriptors. After finding patches from the training images of all categories, they learn the codebook by performing K-means algorithm. Clusters that have members under a predetermined threshold are further pruned out. Codewords are then defined as the centers of the learnt clusters. They have tested 13 scene category images of total 3589, and have concluded that in a regular grid form with SIFT descriptors, they have achieved the best performance results. Some examples of scene categories are at Figure 2.11.

37

Figure-2.11. 13 categories of scene images, 9 out of 13 are outdoor, and the rest are indoor images (Fei-Fei, 2005B).

Sivic et. al investigate to discover a hierarchical structure from a collection of unlabeled images automatically by using Hierarchical Latent Dirichlet Allocation (hLDA) model which has been previously used for unsupervised discovery of topic hierarchies in text document literature (Sivic, 2008). While previous unsupervised approaches focus on segmentation the image or partition the visual data into non overlapping classes of equal weights, they propose to group visual objects/scenes in a multi-layer hierarchy tree based on a common vocabulary which is generated at different layer levels. Like LDA (Blei, 2003), hLDA generates image models as flat composition of topic distribution, but topic are produced along a path through a tree which separates them more specifically. Thus the structure of that tree and topic hierarchy from root to leaves can be learnt during training in an unsupervised manner. As flat LDA model describes each document as a superposition of K topics with a calculated mixture weights, hLDA in a tree of depth L. Each node in the tree has a topic and each image is generated by sequence of topics on a single path from root to leaf through that learnt tree. They also denotes that hLDA can be viewed as

38

standart LDA mixture models where each one is situated along one full path of the tree, thus topics associated with internal nodes are shared by two or more LDAs and root node is shared by all LDAs. When the work on image representation comes in presence, they, like many others, use quantized SIFT descriptors, but they create different visual words to obtain “coarse-o-fine” description of the image with changing degrees of appearance and spatial localization grids to achieve hierarchical representation suitable for hLDA. They argue that dense representation performs better than others which sit on sparse representations, so they have placed regular rectangular grid over the image. SIFT descriptors are computed over that dense patches extracted at three different scales as in Bosch’s implementation (Bosch, 2006) and assigned to the nearest label of visual word from vocabulary precalculated using K-means algorithm on some part of training dataset. They eliminate some descriptors by thresholding the sum of gradient magnitudes within the patch, and creates 11 and 101 different visual words of vocabularies (one of each is for empty patch description) over varying sized grid of cells, 1x1, 3x3 and 5x5. Eventually with the help of 2 different vocabularies and three varying size of spatial positioning, the establishe 3446 visual words for hLDA input. They test 4 and 5-level hLDAs which has different visual vocabularies in each layer that are created in aforementioned way, and conclude that meaningful object hierarchies can be learnt from training set of data in an unsupervised manner, and hLDA outperforms flat LDA in experiments as this hierarchical topic discovery model uses both appearance and spatial layout power. An example for a 3-level hLDA model is depicted in Figure-2.12.

Figure-2.12. Illustration of hLDA model. (a) Three level bar topic hierarchy. Each node of tree represents a topic containing 5 different words from a 25 word-vocabulary. Each topic is represented as a 5x5 pixel image. (b) A bar hierarchy automatically trained by hLDA from a collection of 100 images sampled from model (a) (Sivic, 2008).

39

Sivic and Russell, at another work, explore the power of pLSA and model images as mixture of topics by using posterior probabilities of visual words over images (Sivic, 2005). They compare Latent Dirichlet Allocation (LDA) and probabilistic Latent Semantic Analysis (pLSA) as they have similar attributes in unsupervised topic discovery in text literature, except pLSA bring forwards probabilistic insight into Singular Value Decomposition (SVD) process where two observed variables are associated with third unobserved variable to reduce the dimensionality of search space (Hofmann, 1998). In detail, they extract SIFT descriptors on affine covariant regions about interest points as described by Mikolajczyk (Mikolajczyk, 2002) first, and the vector quantization is carried out by K-means algorithm computed from about 300K regions, resulted into total visual vocabulary of 2237 words which are insensitive to some degrees of viewpoint, illumination, rotation and scale changes. They implement pLSA algorithm with Expectation-Maximization (EM) methodology, eventually compare it with a K-means baseline method in model learning step. This baseline method use same features of word frequency vectors for each image, but statistical pLSA clustering on Kullback-Leibler (KL) divergence convergence is replaced by Euclidean distance and each image is assigned to one cluster. In the topic discovery with pLSA, they seek 4 topics (objects) for about 4000 images of 5 categories from Caltech 101 datasets previously used by Fergus et al. (Fergus, 2003). In the classification level, they experiment the recognition process of pLSA with “fold-in” heuristic described by Hofmann (Hoffmann, 2001). Another novel contribution of this work is that they evaluate image’s spatial segmentation discovered by model fitting. In this case, they extract maximum posteriors of topics conditioned of a specific visual word and image, thus assign a topic label to each word over patch regions which surpass some threshold. They note that there is a significant alignment of visual words with the corresponding object areas of the image. They conclude that using these learnt topics for classification improves segmentation and these local appearances apply flexibility to the task of visual interpretations. Similar experiments have been conducted by Bosch over segmentation using pLSA topic posteriors and have been reported impressive improvements (Bosch, 2008). Some examples of topics found by pLSA are displayed at Figure-2.13.

40

Figure-2.13. The highest probable words shown in rows of 5 different examples and 4 learnt topics via pLSA. (a) Faces (b) Motorbikes (c) Airplanes, and (d) Cars (Sivic, 2005).

Fei-Fei and Li fetch a new point of view into event recognition by combining scene and object recognition. An event in an image as a human activity which takes place in a specific environment is classified while providing semantic labels to the objects and scene environment in the image (Fei-Fei, 2007A). As an example, they try to extract snowboarding event from a scene image by classifying the environment as a snowy mountain/hill and recognizing objects in image like snowboard, cable railway, humans. They achieve this integrated and scene recognition in a generative graphical model and test their algorithm with a database of 8 challenging sport events. As mentioned by short course about object recognition and classification (Fei-Fei, 2007B), humans can recognize many individual objects in a scene, thus perceive interactions in a single glance, they propose to identify scene environment in a holistic fashion. In this context, they try to answer where as scene environment label, who as a list of object categories, and finally what as event label questions assembled with static images. They denote that scene and objects are independent of each other given an event, but both of their presences affect the probability of an event in images. In scene part of their method, they use algorithm described by Fei-Fei et. al (Fei-Fei, 2005A) as Bayesian hierarchy model for natural scene categorization, as learning is done with global statistics of scene categories through local patch frequency distribution in a hierarchical way. In their integrative model, local image patches sampled over a grid of size 10x10 are the building blocks for their holistic scene interpretation task with SIFT descriptors, because dense uniform sampling over

41

image has proved to be more effective than using specific sparse interest point detectors. They also add appearance and shape information to object part of their method and calculate maximum likelihood estimations over scene, object and event. They have tested their algorithm with a LDA supervised learning method in case of scene-only event recognition model with 8 different event classes of 1040 images, and have achieved high precise results for labeling the scenes. Besides, they have published another work based on this generative model for a fully automated learning framework that is able to learn robust scenes from noisy web data and ambiguous user tags from flickr.com. While classifying, annotating and segmenting images, the algorithm captures the co-occurrences of objects and highlevel scene classes (Li, 2009). They support their idea to that when humans observe a scene, three simultaneous actions happen: 1) High level recognition (classification) 2) Identification of specific objects in the scene (annotation) and 3) Localization of scene components (segmentation). This is illustrated at Figure-2.14.

Figure-2.14. An example of algorithm described by Li as a generative model (Li, 2009). At the scene level, the whole image is categorized as “polo” scene. A number of objects are extracted and segmented by visual information, hierarchically represented by object regions and feature patches. Tags are predicted by these scene-object correspondences.

Intuitively, this model can be thought of as 4 parallel LDA models sharing the same scene class. Another contribution of this work is to pick words as tags that are conceptually related to an image. They have tested their algorithm with a dataset of 9 scene categories; each has 800 images and totally 1256 tags from which the authors 42

chose the 30 most frequently appearing for the segmentation experiments. They have used precision, recall and F-measure rates for comparison with other works as:

(2.16)

(2.17)

(2.18)

Which are used in common retrieval evaluation algorithms. They finally conclude that their work is closely related to image understanding, machine translation between words and images, simultaneous object recognition and segmentation, and learning semantic visual models from Internet data. By the generative form, the algorithm can provide a single label or multiple labels to an image without localization, and separate imagery between background clutter and foreground objects.

43

Chapter 3 Scene Classification Using Pyramid of Latent Topics In this chapter, we give a detailed description of our algorithm developed for scene classification. Shortly, our task is to classify a query image into one of given labels of scenes (e.g. suburb, kitchen, bedroom, inside city, etc.). To achieve such a challenging goal, we need to establish a chain of procedures; namely feature extraction, image representation, implementation of training and classification. To this end, we first explain the feature extraction process in section 3.1, followed by probabilistic Latent Semantic Analysis (pLSA) model in section 3.2. pLSA is a generative model from the probabilistic natural language processing literature, to represent each image as a topic distribution feature vector instead of conventional bag of visual words representation. In section 3.3, we describe how we construct our algorithm on scene classification by using two approaches, one based on “Cascaded pLSA” on each sub-region of an image, the other based on “Semantic Segmentation” of an image by using maximum posteriori of topics over sub-regions of the image. Finally in section 3.4, we introduce two classification methods which we have used in our algorithm to achieve comparative results for scene classification: K-nearest neighborhood (KNN) and Support Vector Machine (SVM).

3.1

Feature

Extraction

and

Visual

Vocabulary

Generation The first part of feature extraction process is the detection of interest regions over an image. The second part is to obtain a representation for those regions that allows us to find correct matches between specific objects/scenes. Although we mostly deal with semantic description in scene classification problems, the basis of semantic modeling refers to low-level descriptors of interest regions based on appearance information of an image. Thus, a descriptor should be sufficiently discriminative to provide correct matches with low probability of mismatches in a large database, reasonably invariant to illumination changes and other appearance 44

transformations such as scaling, rotation, minor chances in viewpoint; eventually easy to extract. In scene classification tasks, most of the recent methods (Lazebnik, 2006A; Bosch, 2008; Jiang, 2007) use SIFT descriptor for local appearance representation which has been proposed by Lowe (Lowe, 2004), because of its robustness to those deformations and transformations. Besides, It has been concluded experimentally that local SIFT features achieve better results for classification issues (Mikolajczyk, 2005A). Although we only use the descriptor part of SIFT method, we first give a brief explanation of the total method, for the sake of completeness. Then we describe the dense SIFT method which we have used in our experiments; and the construction of visual vocabulary, respectively.

3.1.1 Scale Invariant Feature Transform (SIFT) In the original formulation of SIFT, there are four stages to extract local features over an image: Scale-space local extrema detection, accurate keypoint localization, orientation alignment and generation of local image descriptors. The first two steps are for the detection of keypoints with a reasonable accuracy at spatial and scale localization; and for filtering unstable keypoints with low contrast and high edge responses. The latter two steps are for dominant orientation assignment to keypoints at the scale-normalized image for achieving invariance to rotation and scale; and for creating descriptors at the local regions around keypoints by using gradient histograms of quantized orientations at the bilinear interpolated Cartesian grid centers. First, to obtain the scale normalized patches, an interest point detection procedure is established by the scale-space Difference-of-Gaussians (DoG) function given as: D ( x, y , σ ) = L ( x, y, kσ ) − L ( x, y, σ )

(3.1)

L ( x, y, σ ) = conv ( G ( x, y, σ ), I ( x, y ))

(3.2)

G ( x, y , σ ) =

−( x2 + y2 )

1 2πσ

2

e

σ2

(3.3)

45

Where D ( x, y, σ ) is the difference of Gaussian-blurred images at scales kσ and σ , k is the multiplicative factor between scale intervals, L ( x, y, σ ) is the

product of convolution of image I ( x, y ) by Gaussian function G ( x, y , σ ) which is stated as the most proper kernel to achieve a scale-space of an image (Lindeberg, 1994). In detail, an input image is repeatedly smoothed by Gaussian kernels with different scales and downsampled by bilinear interpolation, and then scale-adjacent Gaussian blurred images are subtracted to create a pyramid of octave based DoG images. The keypoints are selected at DoG scale-space as the local maxima of 26neighborhood, followed by location and scale tuning, low-contrast filtering and edge response elimination with different thresholds used by different formulas described in the paper (Lowe, 2004). An illustration of SIFT keypoint detection algorithm is depicted at Figure-3.1.

D ( k 2σ ) D ( kσ )

D (σ ) 1. DoG pyramid initialization

2. Local extrema detection in DoG scale-space

True

D ( xɵ ) = D + −1

∂ 2 D ∂D xˆ = − 2 ∂x ∂x

 D xx H =   D xy T r (H ) = D

2

Tr(H ) < D et(H )

3. Keypoint localization in DoG scale-space

Low contrast

| D ( xɵ ) | >= Thrloxcontrast    + D

D

xy

D

yy

xx

D et (H ) = D

Detected

1 ∂DT ɵ x 2 ∂x

xx

D

elimination Edge elimination

yy

yy

− (D

( T h re d g e + 1 )

xy

)2

2

T h re d g e

4. Keypoint elimination in DoG scale-space

Figure-3.1. Illustration of SIFT keypoint detection algorithm (Lowe, 2004).

After accurate keypoint localization, the next step is to determine the keypoint dominant orientation by using a gradient orientation histogram which is computed in the neighborhood around the keypoint location at the Gaussian blurred image (i.e. scale-normalized image) with the closest scale to the keypoint’s scale. 46

The interest region is a rectangular window around keypoint that has a width of 1.5 times the scale of the keypoint. The contribution of each pixel to the histogram in the region is calculated by multiplication of gradient magnitude and a Gaussian kernel with the standard deviation σ that is the same as rectangular region width. Scalenormalized Gaussian image gradient and orientation is computed for every pixel in the region by:

M ( x , y ) = I x ( x, y ) 2 + I y ( x, y ) 2

(3.4)

θ ( x, y ) = tan −1 ( I y ( x, y ) / I x ( x, y ))

Where I x ( x, y ) and I y ( x, y ) are simple derivatives of the Gaussian image I ( x, y ) with pixel differences at keypoint location (x,y) in x and y directions, respectively. Derivatives are measured by: I x ( x, y ) = I ( x + 1, y ) − I ( x − 1, y )

(3.5)

I y ( x, y ) = I ( x, y + 1) − I ( x, y − 1)

The orientation histogram is quantized into 36 bins (i.e. 10 degrees per bin) and peaks in the histogram correspond to dominant orientations. All the properties of the keypoint are calculated relative to the dominant orientation at scale-normalized image, so this provides invariance to scale and rotation. Once a keypoint orientation has been selected, the last step comes in presence as the descriptor computation. The feature descriptor is computed as a set of orientation histograms on 4 x 4 Cartesian grid neighborhoods around keypoint at Gaussian image. In detail, the orientation histograms, quantized into 8 bins (i.e. 45 degrees per bin), are calculated individually for corresponding 16 grid sub-regions (4 per dimension x and y), and are rotated relatively to the pre-calculated keypoint dominant orientation. Just like before, orientation data comes from the Gaussian image closest in scale to the keypoint’s scale and weighting scheme is the same as in dominant orientation calculation. Eventually, each histogram contains 8 bins each, and the SIFT local descriptor is the concatenation of 16 orientation histograms of grid sub-regions. This leads to 4 x 4 x 8 = 128 dimensional feature vector derived from sum of orientation

47

magnitudes at each grid center which are relatively rotated to dominant orientation of the keypoint as: SIFT = (hr (1,1) , hr (1,2) , hr (1,3) , hr (1,4) ,...hr (4,1) , hr (4,2) , hr (4,3) , hr (4,4) ) hr ( l ,m ) (k ) =

∑

M ( x, y )(1− |θ ( x, y ) − ck | / △k ), θ ( x, y )∈bin K

(3.6)

x , y ∈r ( l , m )

Where ck is the orientation bin center, totally 16, △k is the orientation bin width in pixel, (x,y) are pixel coordinates in sub-region r (l , m) , and M ( x, y ) is the gradient magnitude weighted by Gaussian kernel with scale σ , 1.5 times the scale of the keypoint. The final step is to normalize the descriptor in eq. (3.6) to unit L2 norm in order to reduce the effects of uniform illumination changes. The illustration of orientation assignment and descriptor formation is displayed at Figure-3.2.

Figure-3.2. Illustration of orientation assignment and descriptor formation.

48

3.1.2 Dense SIFT The keypoint detection in SIFT algorithm has lots of free parameters such as initial scale for first Gaussian-blurred image of first octave, multiplicative factor k, number of blurred images in each octave, number of octaves, offset thresholds for tuned localization, low-contrast and edge response elimination thresholds etc. Thus lots of implementation details due to keypoint detection are determined with respect to the application purpose. In this thesis, we extract the SIFT descriptors at regular grids all over the image for scene classification problem. However, comparative results show that utilization of dense keypoints over whole image surface work better than sparsely detected keypoints for scene classification (Fei-Fei, 2005A; Bosch, 2008). Since a dense image description is compulsory to capture uniform semantic regions such as sea, sky, forest, we have used dense SIFT representation for each image of 16 by 16 pixel patches , meaning scale 8 pixels, computed over a regular rectangular grid with spacing 8 pixels. Lazebnik et al used the same technique to in “Spatial Pyramid” method and achieved very promising results in scene classification issue (Lazebnik, 2006A). The illustration of dense SIFT descriptors over an image is depicted at Figure-3.3.

Figure-3.3. Illustration of dense SIFT descriptors over an image.

49

3.1.3

Visual Words and Vocabulary Generation Our scene classification algorithm is based largely on the BoW / pLSA

implementation. In this implementation, we analyze and compare images by tracking the number of occurrences of every visual word. At this point, we encounter a new concept: Visual words and word-image co-occurrence table. For example, for a set of images I = {i1 , i2 , i3 , i4 ,..., in } , a table of counts can be created, where each row represents a word from the visual vocabulary (i.e. cluster of visual words) V = {w1 , w2 , w3 , w4 ,..., wm } and each column represents an image. This two-mode table refers to BoW representation scheme for the images and it is compulsory for initializing pLSA modeling, to which co-occurrence table is the only input parameter. After features are described in the dataset using dense SIFT algorithm as described at 3.1.2, we need to group them into visual words to create aforementioned co-occurrence table which indicates a BoW representation for each image, columnwisely. In order to create the visual words of co-occurrence table, this large set of descriptors which have been extracted from the dataset densely must be reduced to a smaller set of repeated terms by clustering the SIFT descriptors. One of the simplest efficient method of clustering in the literature is k-means , which is widely used since it is relatively quick and allows the user to choose the desired number of cluster centers as visual words.

In the k-means algorithm, also referred to as Lloyd’s algorithm, given a set of n observations {x1 , x2 , x3 , x4 ,...., xn } , SIFT descriptors in here, where each observation is a d-dimensional real vector; the main idea is to partition the n observations into k (k < n) sets S = {s1 , s2 , s3 ,..., sk } so as to minimize the withincluster sum of squares (Kardi, 2009) :

k

arg min S ∑ ∑ || x j − µi ||2

(3.7)

i =1 x j ∈si

50

Where µi is the average vector of set si . Although many distance functions exist in the literature, Euclidean distance is commonly used to compute sum of squares distance in the algorithm. In details, given an initial set of k means {µ1(1) , µ2 (1) ,..., µ k (1) } , which maybe specified randomly or by some heuristic, the

algorithm proceeds by alternating between two steps:

1)

Assignment Step :

si (t ) = {x j :|| x j − µi ( t ) || < = || x j − µi* (t ) ||, i* =1, 2,3,..., k}

(3.8)

Where each observation x j is assigned to the cluster si with the closest mean

µi at iteration (t ) .

2)

Update Step:

µi (t +1) =

1

∑

| si ( t ) | x j ∈si( t )

(3.9)

xj

Where new means µi are calculated for the next iteration (t + 1) as the centroid of the observations x j assigned to cluster si . This iterative refinement technique continues alternating until no observation moves to a new cluster. After the visual words (i.e. cluster centers) are found using a subset of the images, the entire set of images must be processed to match each descriptor with its nearest visual word (known as quantizing). Since the descriptors are stored by SIFT detector, the actual images aren't processed in this step, just the descriptors. Each descriptor is compared to each of the visual words, and is assigned the label of the closest word, using Euclidean distance. At this point, each image has been converted into a list of visual words and their count data. To compare images, however, we need an occurrence table of counts of N = (n( wi , d j ))i j where n( wi , d j ) refers to the number of times the visual word wi occurred in image d j . This two-mode matrix will be used in pLSA modeling as an input, which will be explained in details below.

51

3.2

Semantic Image Representation Based on pLSA Probabilistic Latent Semantic Analysis (pLSA) is a statistical method of

factor analysis for binary or two-mode count data to generate latent class models. It was first proposed by Hofmann in text literature based on Latent Semantic Analysis (LSA) which is derived from Frobenius norm (i.e. L2 -matrix) Singular Value Decomposition (SVD) of co-occurrence tables (Hofmann, 1998, 2001). But instead of linear algebra of LSA, pLSA performs probabilistic mixture decomposition in a statistical way to discover topics in a document while using Bag-of-Words (BoW) representation. In text/natural language analyzing field, text documents are often analyzed by counting the number of occurrences of every word. For a set of documents, a table of counts can be produced where each row represents count of a word from a specific vocabulary V = {w1 , w2 , w3 , w4 ,..., wm } , i.e. cluster of keywords, and each column represents a document D = {d1 , d 2 , d3 , d 4 ,..., d n } . This co-occurrence table of counts, also called term-document matrix, is denoted as MxN matrix of N = (n( wi , d j ))i j where n( wi , d j ) refers to the number of times the word wi occurred in document d j . The main assumption in this BoW representation is that it will reasonably preserve most of the information for data retrieval/indexing/filtering tasks. Although this assumption is partially true, the features of human language make the situation harder than thought. For instance, due to the existence of synonyms in language, words with identical or almost identical meaning, two documents could be about the same topic in concept but might share few same words in lexical content. On the other hand, existence of polysems, words with multiple meanings, result in indexing two documents as the same category although they refer to entirely different topics just because they share common words lexically. Besides, since the term-document matrix counts every word used even once in the set of documents, many documents will result in a very sparse representation of BoW vector. This leads to incorrect similarity estimation as divergence of word distribution among documents is relatively small, just because documents do not use lexically same words. The basic idea in pLSA is to map high dimensional count vectors of documents to a lower dimensional representation, so-called “Latent Semantic Space”

52

(Hofmann, 1998). Representing semantic relations between words and documents by generating a new variable space, in his terms, provides information beyond lexical word co-occurrences. The main difference between pLSA and LSA comes in presence when conducting SVD process to generate unobserved topic variables Z = {z1 , z2 , z3 ,..., zk } from observed word (V) and document (D) variables. Shortly speaking, SVD process can be described as decomposing a twovariable matrix into two different matrices where each variable (i.e. V and D) of prior matrix is associated with a common new variable Z. In formulation, SVD is given by: N = V ∑ D, N (m, n), V ( m, k ), ∑( k , k ), D (k , n)

(3.10)

Where V and D are orthonormal matrices, VV T = DDT = I , and ∑ matrix contains singular values (i.e. Eigen) of N term-document matrix in its diagonal and zero in other parts. LSA approximates the calculation of SVD with linear algebra (i.e. by using Frobenius norm) by thresholding all but the largest K singular values in ∑ matrix diagonals to zero for dimensionally reduced representation in a new space,

called “Latent Semantic” (Hofmann, 2001). But pLSA uses a probabilistic approach to solve SVD problem by running Expectation Maximization (EM) algorithm which is generally consulted for model fitting, cross validation or complexity control in an independent multi-variable environment (Dempster, 1977). Intuitively in pLSA, there is a multinomial probabilistic distribution where each word has K probabilities in each document independently, i.e. probability of a word appearing in a document is related to the probability of it appearing in each topic, and the probability of that topic being relevant to that document. As LSA uses linear algebra, this only allows each word belong to one topic which leads to inability to handle polysemy. On the other hand, the improvement provided by pLSA is the ability for a word to belong to multiple topics, each with a probability normalized to unit for every topic (Rubel, 2008). After explanation of basis of pLSA, it is necessary to define probabilistic notations used in algorithm : P(d i ) denotes the probability that any randomly selected word belongs to the document d i , P( w j | zk ) denotes the conditional

53

probability of word w j on unobserved topic variable zk , and P( zk | d i ) indicates the probability of a random word from document d i belongs to the unobserved topic variable zk . Note that while word V and document D variables are observed from N co-occurrence data, topic variable Z is an unobserved “hidden” variable created in the pLSA process. This is known as a generative model because a new occurrence matrix can be generated with the following process:

a)

Pick a document d i with probability P(d i )

b)

Select a topic zk with probability P( zk | d i )

c)

Select a word w j with probability P( w j | zk )

This process is depicted with the probabilistic SVD notation at Figure-3.4.

Figure-3.4. Illustration of pLSA model. (a) Graphical model of relationship between variables in a probabilistic manner. N is the number of documents

Wd is the number of words per

document. Document variable d and word variable w are observed variables while topic variable is unobserved. (b) Matrix notations used in pLSA. (c) SVD process in pLSA. (Bosch, 2007B).

54

As we notice at Figure-3.4 (c), parameterization of the joint probability is calculated as:

P(di , w j ) =

K

K

∑ P( z ) P(d | z ) P(w | z ) = ∑ P(d , w, z ) k

i

k

j

k =1

k

k

(3.11)

k =1

In this respect, we can achieve an observation pair (d , w) in which observed variables d and w are independent while the topic variable z is dismissed. On the other hand, we can reformulate P(d , w ) as P( w, d ) = P( d , w) = P (d ) P ( w | d ) in Naïve Bayesian approach, thus we get the conditional probability distribution of words over documents as:

K

P ( w | d ) = ∑ P ( w | zk ) P ( zk | d )

(3.12)

k =1

This matrix decomposition seems SVD like in LSA, but topic vectors of P( w | z ) and P( z | d ) are normalized to unit to achieve a probabilistic distribution

without any negative entry in pLSA model, unlikely. In order to compare documents in latent space, however, we must work this process in reverse; starting with a termdocument matrix, we want to generate a table which represents P( z | d ) in order to describe each document as a mixture of topics. The standard way to do this is to use an Expectation Maximization (EM) process, which alternates an expectation step where posterior probabilities are calculated for the latent variables Z based on current estimates of parameters and a maximization step where parameters are updated based on posterior probabilities until a model is fitted (i.e. convergence conditions are met due to sequential likelihood measurements). The Expectation (E) step uses the following equation:

P ( zk | d i , w j ) =

P ( w j | zk ) P ( zk | d i ) K

∑ P ( w j | zk ) P ( zk | d i ) k =1

55

(3.13)

And the Maximization (M) step is formulated as:

N

∑ n( d , w ) P ( z i

P ( w j | zk ) =

j

k

| di , w j )

i =1 M N

∑∑ n(d , w ) P( z i

j

k

| di , w j )

j =1 i =1

(3.14) M

∑ n(d , w ) P ( z i

P ( zk | d i ) =

j

k

| di , w j )

j =1

n( d i )

Where n( di , w j ) is the number of word w j in document d i , and n(d i ) is the number of documents which has word w j .The E-step and M-step equations are implemented sequentially until a termination condition is satisfied. The iterative algorithm is generally stopped by giving an iteration limit for E/M alternation steps or giving a threshold value which indicates the difference between two sequential likelihood values. Log-likelihood is computed on the posterior probabilities P( w j | zk ) and P ( zk | d i ) of complete data as in:

L = log P(D,W ) = ∑ ∑ n(w, d )log P(w, d ) = ∑ ∑ n(w, d )log(P(w | z)P( z | d )) (3.15) d∈D w∈W

d∈D w∈W

To summarize, posterior probabilities P ( w | z ) and P ( z | d ) are computed in E-step, and updated in M-step while maximizing the log-likelihood function which is equivalent to minimizing the

Kullback-Leibler divergence of probability

distributions between observed data and fitted model data (Hofmann, 2001).

3.2.1

Discussion About pLSA

Even though this system was specifically designed to information retrieval in a word-document environment, it can be applied to similar problems like in scene classification. For imitation, we have images as document corpus, quantized SIFT feature descriptors corresponding to “visual words”, and finally object categories like

56

sky, car, sea etc. instead of topics. By using pLSA, an image is fit into a model of mixture of objects while SIFT descriptors over local patches are clustered (i.e. quantized) into visual words to produce a co-occurrence matrix for BoW representation of visual two-mode word – image data as an input to pLSA. Because it gives a reasonable probabilistic model for clustering multiple object categories per image, one can model an image with mixture coefficients of a topic vector P ( z | d ) in the latent semantic space, thus classification algorithms can be applied to those dimensionally reduced but semantically meaningful vectors with higher matching probabilities. Another point is that, PLSA is known for its capability of handling polysemy. If a visual word w is observed in two images d i and d j , then the object category z associated with that word can differ in d i and d j : argmax( P( z \ d i , w) ) can be different from argmax( P ( z | d j , w) ). Hence, PLSA allows a visual word to have different meanings in different images (Liu, 2008). But there is an important drawback of pLSA, it is based on the BoW representation which completely ignores the position of the visual words. It means that, for example, if we randomly shuffle around the patches in the image, pLSA would still infer the same hidden topic for each patch. Since visual words are found in the scene and used to construct a BoW histogram as an input to pLSA, the content what is in the scene is preserved while spatial information where objects are in the scene is lost. Although this may not seem as an important problem in text literature, it is not desirable in scene classification, since the spatial configuration (i.e. location and shape information) of patches can give us a clue about object identities. We propose a simple method of including spatial information; rather than building one BoW histogram of words for an image and conducting pLSA, we split the image into multiple sub-regions in each level and conduct pLSA individually to extract P ( z | d ) vectors for each sub-regions, using the fold-in heuristic described by Hofmann (Hofmann, 1998). Thereafter, we concatenate the P ( z | d ) vectors which are produced in different levels to achieve a new histogram representation for an image, eventually run a classification algorithm like Support Vector Machines (SVM) or KNearest Neighborhood (KNN) after applying a weighting scheme which depends on resolution levels over elements of concatenated vector. We will discuss the proposed method with two different approaches in detail below.

57

3.3 Building Spatial Pyramid of Latent Topics As mentioned before, the original pLSA model lacks spatial location information, because it uses BoW representation for each image. In other words, it is not possible to answer where objects are in the scene at the simple pLSA implementation. Lazebnik et. al introduces a novel approach to BoW representation model by adding spatial information, called “Spatial Pyramid” (Lazebnik, 2006A).Shortly speaking, Spatial Pyramid matching algorithm run by placing a sequence of increasingly finer grids over an image, calculating BoW histograms for each grid at each split level. Then it concatenates all BoW histograms while finer resolutions are weighted more highly than coarser resolutions to represent the image in a spatial manner. This approach improves scene classification performance but lacks an intermediate topic representation. In this thesis, we propose two novel approaches for building spatial pyramids of latent topics. In the first method, image representation is achieved by a probabilistic topic distributions vector which is derived from sub-region specific pLSA models. In the second method, we represent an image by a topic counts vector which is derived from maximum posteriori-probability of topics in each sub-region individually. We call the first one “Spatial Pyramid of Latent Topics by Cascaded pLSA” and the latter one “Spatial Pyramid of Latent Topics by Semantic Segmentation” which are explained in details below.

3.3.1

Spatial

Pyramid

of

Latent

Topics

by

Cascaded pLSA The key idea in our method is to fit pLSA models into sub-regions individually at different resolution levels to achieve a new representation for each image. The categorization of an unobserved image is performed in this semantic representation by using classification algorithms such as KNN and SVM. In the training stage, the first probabilistic distributions we should find in pLSA are topic conditional visual word distributions P ( w | z ) which will be used in other resolution levels of training images and in overall testing stage. To make this

58

happen, a pLSA model is fit to the entire set of training images at resolution level (L) 0 (i.e. base level) where the images are intact. One point to mention further is that model fitting via pLSA is known as an unsupervised approach because we only input co-occurrence matrix (i.e. count data of visual words in each document), no other like category labels of images. The products of model fitting are

P ( w | z ) and

P( z | d train _ Level 0 ) ; hence we will use document specific topic distributions P ( z | d ) vector for image representation later .

At L=1, we split the training images into four sub-regions, generate a new cooccurrence matrix for each sub-region as an input to pLSA by using the same visual vocabulary that has been created in k-means clustering. P ( z | d ) coefficients are computed individually by initiating the fold-in heuristic proposed by Hofmann for information retrieval (Hofmann, 2001). Specifically, each sub-region is projected onto the triple axis space VIZ (Visual word-Image-Topic) by the P ( w | z ) learnt when L=0. This is achieved by updating only the topic distributions vector P ( z | dtrain _ level =1 ) in each M-step while the learnt P ( w | z ) kept fixed at EM iterations

of pLSA until Kullback-Leibler divergence between the observed distributions and K

calculated

P( w | d ) =

∑ P(w | z

k

) P ( zk

| d ) is minimized.

We follow the same procedure in

k =1

finer resolution levels, except we divide training images into different number of sub-regions (i.e. powers of 2) at each level. For instance, we get 16 sub-regions at L=2, 64 at L=3, so on. We will use L=3 as the maximum resolution level in our experiments. Because if the image is too finely subdivided, the number of visual words drops dramatically, sparseness at co-occurrence table increases while count vectors of images begin to resemble each other, thus discriminative power diminishes. This is also noted by Lazebnik et al and experimental results conclude in that way (Lazebnik, 2006A). After calculating all P ( z | d ) coefficients of each sub-region at each resolution level, we concatenate them all with a proper weight factor which differs at each level to form a new representation for images, respectively. The weight at level ℓ is set to

1 2 L−ℓ

, where L is the number of levels. This formulation is inversely

proportional to the sub-region width at that level, means a finer resolution level is weighted more highly than a coarser level. It can be explained in that since the 59

coarser level already includes all visual words found at the finer level, the coefficients P ( z | d ) at the coarser level are more weighted and need to be balanced accordingly. When concatenate all P ( z | d ) coefficients from individual sub-regions, 1 one can obtain a new feature vector with dimensionality D= Z (4 L +1 −1) , where Z is 3

the number of topics used in pLSA. For example, if we have 25 topics in pLSA modeling we achieve 25 dimensional feature vector for an image (i.e. whole image) at L=0, 125 dimensional at L=1, so on. An illustration of scene classification based on Cascaded pLSA method is illustrated at Figure-3.5.

Figure-3.5. Illustration of Cascaded pLSA based method. BoW = Bag-of-Words histogram, P(z|d) = image specific probability distribution over latent variable space, P(w|z) = probability of a specific word conditioned on a latent variable. Note that

∑

indicates concatenation.

So far, we have obtained a cumulative feature vector of topic mixture coefficients for each image that are derived from sub-regions individually, using pLSA. The last part of training process is to train a multi-class classifier which takes P( z | dtrain _ concatenated ) vector and class label of each training image. The key idea in

60

this part is to determine the observed discrimination between scene classes for further usage at testing stage. We use KNN and SVM as a classifier for comparison. Shortly speaking, KNN gives an unobserved image the class label which is most represented within K nearest neighbors of it by calculating its Euclidean distance to training images. SVM, as a binary classifier, evaluates an image and assign it one of two classes by constructing a hyper plane as a separator region, using decision function g ( x) = ∑ α i yi K ( xi − x) − b where K ( xi − x) is the response of a kernel i

function, xi , x are training and test samples respectively, yi is the class label of xi ,

α i is the learnt weight and finally b is the learnt threshold parameters. In our experiments, we use Rakotomamonjy’s SVM toolbox, using Radial Basis Function (RBF) kernel with Histogram Intersection (HI) distance (Rakotomamonjy,2008). The SVM classifier is done by 10-fold cross validation, using the one-versus-all rule: A classifier is learnt to separate each class from others, and a test image is given the label of the classifier with the highest response. Testing procedure is very similar with the training process. The only difference between testing and training stages in pLSA model fitting is that P( z | dtest ) coefficients are calculated by keeping P( w | z ) which has been already computed at L=0 in pLSA model fitting for training images fixed at all resolution levels. The concatenated P( z | d test _ concatenated ) coefficients are then used to classify the test images by using a discriminative classifier that is described in the training process. One more thing to mention is the normalization issue. We know that normalization is generally conducted upon feature vectors or histograms to avoid abnormal features due to uniform illumination, and the effects of variable image size if the keypoints are detected sparsely. Since we do not deal with sparse SIFT features directly but with intermediate semantic features of concatenated

P( z | d )

coefficients, we do not need any normalization. P( z | d ) coefficients are already normalized to unit individually in pLSA modeling as they represent probability distributions. Because we weight each P( z | d ) vector with respect to its resolution level to balance each other before concatenation, the total sum of concatenated P( z | d ) vectors is unit, thus again no need to normalize.

61

3.3.2

Spatial Pyramid of Latent Topics by Semantic Segmentation

The coefficients vectors P( z | d ) derived from pLSA models which are fit into sub-regions individually indicate distribution of topics with their probabilities. This idea opens a new gateway for representing an image by count data of its topics weighted in some way, rather than representation by count data of visual words like in BoW modeling. As we remember from E-step of EM log-likelihood maximization process, posterior probabilities are calculated for the latent topic variables Z based on current estimates of parameters P( z | d ) and P ( w | z ) as in:

P( z | d , w) =

P( w | z ) P ( z | d )

(3.16)

K

∑ P( w | z ) P( z k

k

| d)

k =1

This formulation leads us to another novel approach for scene classification issue: Segmenting an image coarsely with its object contents without any extra load. We extract a semantic meaning from the formula that given an image and a visual word in it, we can calculate a probability for each topic accordingly. After all, we achieve topic probabilities for a specific visual word; hence we may assign a topic label

to

that

word

(i.e. arg max ( P( z | d , w)) ).

by

taking

the

maximum

posterior

of

P( z | d , w)

As long as we have already obtained P( z | d ) and

P( w | z ) coefficients for a whole image as described in section 3.3.1, there is no

extra work to assign a topic label to each visual word in an image by using arg max ( P( z | d , w)) . Training and testing issues share the common sense for this new representation issue. The illustration of topic label assignment for each visual word in an image is displayed at Figure-3.6.

62

Figure-3.6. Illustration of Semantic Segmentation based method. BoW = Bag-of-Words histogram, P(z|d) = image specific probability distribution over latent variable space, P(w|z) = probability of a specific word conditioned on a latent variable,

arg max( z | w, d ) = maximum

posterior probability of a specific topic given word and document. Note that

∑

indicates

concatenation.

In more detail, we find maximum posterior values and topic labels at those values of training/testing images at L=0 (i.e. whole image), then calculate a new P( z | d ) coefficients vector as in:

N (wℓ , di ) P(argmax(P( z | di , w))TL | di )new = ∑ ( M argmax(P( z | di , wℓ )TDV ) wℓ ∈W ∑ N(wj , di )

(3.17)

j =1

Where TL indicates Topic Label and TDV refers to Topic Distribution Value. For a given image, we first assign a topic label to each visual word in the image by arg max ( P( z | d i , wℓ )TL . As we notice, some visual words vote for the same topic with different maximum posterior probabilities of arg max ( P( z | di , wℓ )TDV . So the 63

final value for each topic Z in the new P( z | d ) vector is calculated as a sum of maximum posterior probabilities of common visual words to that of given Z k topic. These maximum posterior values are weighted by the ratio of count numbers of common visual words to total number of visual words in that image for normalization issue. We repeat this procedure for finer resolution levels (i.e L=1, L=2 and L=3) in the same manner, except that we keep maximum posterior values of topics arg max ( P( z | di , wℓ )TDV of L=0 fixed for finer resolutions, because as we deal with the same image, we should not change posteriors at other sub-regions for avoiding instability.

3.4 Classification At the final step, we use a discriminative classifier KNN or SVM to assign a category label which is learnt from the training set to an unobserved image from the test set. Here are the brief details about these classifiers below.

3.4.1

K-Nearest Neighborhood (KNN)

KNN algorithm is amongst the simplest of all machine learning algorithms. It is an instance-based learning algorithm where density function is only approximated locally. The key idea is to classify a query point in the d-dimensional feature space by a majority vote of its k neighbors where k is positive integer. Majority voting refers to taking the most common class label among k points from train set which are most similar to the query point. In details, determining the class of a query point consists of calculating Euclidean distance between a query point and all points in the training set (i.e. class labels are already known for training set), selection of Knearest points to query point in the training set, and finally assigning the query point to most common class among its K-nearest neighbors. The algorithm needs no explicit training process to create models to reduce the feature dimension. So we need to take all training points into account for computation. Besides that, there will be a tie if more than one class occurs as the most similar class among K-neighbors for a query point. In our experiments, we break the tie of KNN algorithm by taking

64

the class label of the nearest point of the most representative classes. Additionally, we will experiment k parameter in a range (i.e. 1-20), rather than a fix integer.

3.4.2

Support Vector Machines (SVM)

A Support Vector Machine (SVM) is a binary classifier, meaning it can evaluate data points and assign them one of two classes. In order to perform this classification, SVMs are trained with training data and their class labels, and then later assign a class to unobserved data. In the training process, the SVM takes a set of d-length vectors (i.e. points of data in a d-dimensional space) and associated classes, while testing requires just a set of vectors. SVMs have been very popular in many classification problems like in scene classification (Cristianini, 2001). The decision function for a test sample x has the form of:

g ( x) = ∑ α i yi K ( xi , x) − b

(3.18)

i

Where K ( xi , x) is the response of the kernel function for training point xi and the test point x, yi is the class label (i.e. 1 or -1) of xi , α i is the learnt weight of the xi , and b is the learnt threshold parameter. An illustration of general SVM Algorithm is displayed at Figure-3.7.

Figure-3.7. Illustration of SVM algorithm.

65

Training the SVM involves solving a quadratic programming (QP) problem that has as many variables as we have training points. For example we have 13 training points which belong to 2 classes at figure 3-5, square and cross. At this point, training a SVM over these points means finding the Support Vectors - points that are on the boundary between the classes +1 (i.e. squares) and -1 (i.e. cross). In the figure, the actual decision boundary that separates the positive from the negative points is plotted in black, the contour lines plotted in blue and green are the lines of distance +1 and -1 from the decision boundary which are formed by support vectors (points 3,6 and 13 in this example). All points that are in the margin (between the +1 and -1 lines) are seen as misclassifications. The choice of a good kernel function K ( xi , x) is very important for SVM learning algorithm. There are many types of general purpose kernels in the literature like linear, polynomial and Radial Basis Function (RBF) kernels. RBF kernel is one of the most popular kernels in the literature which uses a Gaussian distribution to map data to higher dimensions. Once the data is in this larger space, a hyper plane can be found that separates the data. Generalized form of RBF kernels is: K d _ RBF ( x , y ) = e − ρ d ( x , y )

(3.19)

Where d ( x, y ) is the distance function between two points in the feature space, ρ is the constant as the inverse of the average distance between each data point (Bennett, 2000). In our scene classification problem, we have 13 scene categories. So instead of solving the binary classification problem, a one-vs-all method is employed to train SVMs for this multiclass problem. In this scheme, each SVM is trained to discriminate one class (+1) from all others (-1); in testing, mixture of topics vectors P ( z \ d ) are run through each SVM, and the classifier category with the strongest

response is selected as the winner label for the unobserved test data.

66

Chapter 4 Performance Evaluation In this chapter we evaluate our classification methods on a dataset which contains 13 natural scene images and compare our results with the most successful methods the scene classification literature. In this manner, we first describe the dataset in section 4.1, and then implementation details are given in section 4.2. After briefing the experimental setup in section 4.3, we mention our test results and compare ours with those which have used the same dataset.

4.1 Dataset The dataset that we have used in the scene classification experiments was introduced by Fei-Fei et. al in the semi-supervised application of Latent Dirichlet Analysis (LDA) (Fei-Fei, 2005A). It contains 13 scene categories which consist of totally 3859 images in grayscale. The distribution of images per scene category is : 216 Bedroom, 241 Suburb, 210 Kitchen, 289 Livingroom, 360 Coast, 328 Forest, 260 Highway, 308 Inside city, 374 Mountain, 410 Open country, 292 Street, 356 Tall building and 215 Office. Some exemplar images from this dataset are displayed at Figure-4.1. Most of the scenes display large intra-class variability, meaning that object contents within a scene category are very different. Also note that indoor scenes (i.e. kitchen, bedroom, office and livingroom) have very similar structure, indicating a low inter-class variability. These issues make the scene classification problem hard when working with this dataset. The size of each image is varying both in a category and between categories, with an average of 250 x 300 pixels.

67

Figure-4.1 Some example images from the scene classification dataset.

68

Office

building

Tall

Street Country

Open

Mountain

Inside city

Highway

Forest

Coast room

Living

Kitchen

Suburb

Bedroom

4.2 Experimental Setup Our goal in this thesis is to assign each unobserved image in the test dataset to one of 13 scene categories which are learnt from train dataset. It is generally based on BoW/pLSA implementation with using spatial information for topics over an image. The classification performance is computed in the confusion matrix, displaying count of test images per categories in the experiment, and overall accuracy rate is measured by the average of the diagonal values of the confusion matrix. Note that the system is implemented in MATLAB on Intel Core 2 Duo 2.00 GHz. with 3.00 GB. RAM.

4.2.1

Pre-processing

The first step in our algorithm as pre-processing is to resize images into a fixed size as they have varying sizes. We have used MATLAB built-in function “imresize” to achieve a size of 256 x 256 pixels for each image in the dataset, so that we obtain the same number of features per image and get a computing efficiency for dense feature extraction process. In addition, we normalize the gray-scale images of dataset to have intensities with zero mean and unit standard deviation. This issue is noted specifically in (Bosch, 2008) that classification performance in normalized images is nearly 1 percent better than in unnormalized ones due to noise elimination (Bosch, 2008). We implement the pre-processing issues within feature extraction step beforehand.

4.2.2

Feature Extraction

Secondly, we want to have features spread out across the entire image evenly because it is necessary not only to find the salient regions but also to detect texture regions to represent a whole image in the scene classification problem. In order to do this, we implement the dense SIFT. Rather than finding keypoints at extrema in DoG scale space, the dense SIFT detector generates keypoints at even intervals throughout the image. Bosch et al experimented with grids of spacing 5, 10, 15 pixels and found the best results produced by 10 pixel spacing (Bosch, 2008) while Lazebnik et. al 69

tested with 8 pixels spacing (Lazebnik, 2006A). Our implementation uses 8 pixels spacing in the Cartesian grid, since this is close to the optimal value and works better with the datasets which contain image sizes that are powers of two (i.e. 256 x 256 pixels). For each feature, a dominant orientation is assigned and a SIFT descriptor is computed with respect to the standard SIFT implementation as described in section 3.1.1. Instead of generating scale space with DOG images, scale invariance is obtained by generating the SIFT descriptors using a circle of radii 8 pixels which indicates 16 pixel-width patches overlapping each other half size over the grids spacing 8 pixels. Illustration of Dense SIFT is at Figure 3-3. Dense gray-SIFT extraction process takes about 15 hours for whole dataset of 3889 images, each with 961 SIFT descriptors. Although it is time consuming, we do this once and other steps with different parameters can be conducted without repeating this step.

4.2.3

Input for pLSA: Co-occurrence Table

In order to get input parameter co-occurrence table for pLSA modeling algorithm, we first need to create a visual vocabulary and to represent each image with a visual word frequency vector, BoW. We implement k-means algorithm to cluster SIFT descriptors which have been previously computed densely for each image. Since this algorithm terminates when no point moves to a new cluster, thus it is huge time and memory consuming when using large amount of data with high dimensions like 128. We select randomly 40 images from each scene category, totally 520 images, and input SIFT descriptors of these randomly selected images into k-means clustering algorithm, about 500000 descriptors at once. Finally, since the number of clusters which indicate the visual words needs to be known beforehand, we should also input the number of desired visual words into k-means algorithm. Lazebnik et al experiment 200 and 400 visual words (Lazebnik, 2006A) when dealing with the dataset and the dense SIFT extraction method we have used; and conclude that 400 clusters perform better in their “Spatial Pyramid Matching” algorithm. Because we mainly compare our results with them, we use k=400 in our k-means clustering. After all, It takes about 3 hours to converge 500000 SIFT descriptors into 400 visual word cluster centers.

70

After the visual words are found using a subset of dataset, for each image in dataset, the descriptors are compared to each of the 400 visual words, using Euclidean distance. Goal of this quantizing process is to assign the label of the closest visual word to every descriptor, where each image can be identified by its word distribution, from now on. At this point, we create a BoW representation for each image by counting the number of visual words individually and place them into a matrix N = (n( wi , d j ))i j where n( wi , d j ) refers to the number of times the visual word wi occurred in the image d j . This two-mode matrix will be used in pLSA modeling as an input. Note that since we know the locations of SIFT descriptors over an image, we also know where the visual words in it which will be used in pLSA models of each sub-region.

4.2.4

Spatial Pyramid of pLSA Modeling

So far, we have achieved a BoW representation for each image in the dataset and we can carry on the rest of our algorithm based on pLSA. We split whole dataset into two separate sets of images, training and testing. As speaking of training images, we select randomly 100 images per scene category which makes totally 1300 images for training set. The rest of whole dataset is used as a testing set with varying number of images from each category. We implement our two approaches which are described in details at sections 3.3.1 and 3.3.2 with this set-up. In the pLSA modeling, we have experimented varying number of topics (i.e. 25,50,75 and 100) while in the classification process, we use discriminative K-nearest neighbors and Support Vector Machine classifiers with varying parameters and the results are explained below.

71

4.3 Experimental Results As mentioned before, we have mainly used a vocabulary of 400 visual words for BoW representation, pLSA modeling for semantic representation and KNN/SVM classifiers to achieving a discriminative classification in our experiments. The classification results of cascaded pLSA based method with SVM is displayed at Table-4.1; and with KNN is at Table-4.2. The tables show the performance rates achieved using just the highest level of the pyramid as the “single-level” columns, using multiple levels as the “pyramid” columns. Lazebnik et. al also use this format in their experiments (Lazebnik, 2006A). We notice that the performance increases as we go from single level (L) 0 to finer resolution levels L=1,2 and 3. But it drops a little (between 1 and 2 percent) as we go from L=2 to L=3. This is due to the impact of too finely subdivision at L=3. Although the spatial information comes in appearance and improves the performance when we divide an image into sub-regions, L=3 is subdivided too finely and only a few number of topics exist in sub-regions at L=3. We notice the improvement in multi-level representations. In Table-4.1, the mean classification accuracy is 73.75 at L=0 while at the highest pyramid level (i.e. L=3) it increases drastically to 80.29. Although it seems that single finer resolution levels support most of the improvement to the system, using all levels together make the system more robust.

T = 25 Level

Single

T = 50

Pyramid

Level

Single

T = 75

Pyramid

Level

Single

T = 100

Pyramid

Level

Single

Pyramid

Level

0 (1x1)

73.12

---

72.73

---

75.09

---

74.06

---

1 (2x2)

76.33

76.70

77.07

77.27

77.11

77.44

78.22

79.34

2 (4x4)

79.66

79.95

79.33

80.07

79.74

80.20

79.62

80.77

3 (8x8)

78.59

79.87

78.22

79.95

77.93

80.44

77.93

80.90

Table-4.1. Classification results of Cascaded pLSA based method, using SVM. T indicates the number of topics used in pLSA.

72

As we see from Table-4.1 and Table-4.2, the performance drops between 1 and 5 percent when we go from L=2 to L=3; but it stays almost same with upmost 1 percent decrease when using multi-level representations. Specifically at Table-4.1, we can see that the accuracy increases 0.24 percent at T=75, and 0.13 percent at T= 100 although it decreases generally as we go from single level L=2 to L=3. So we can infer that despite the decrease at the highest single level L=3, combining multiple resolution levels make the system effectively robust to mismatches at single levels.

T = 25 Level

Single

T = 50

Pyramid

Level

Single

T = 75

Pyramid

Level

Single

T = 100

Pyramid

Level

Single

Pyramid

Level

0 (1x1)

66.73

---

67.22

---

65.95

---

67.02

---

1 (2x2)

70.60

70.68

71.01

70.60

70.85

70.81

70.52

71.14

2 (4x4)

71.75

72.62

70.89

72.70

69.78

71.30

68.67

71.75

3 (8x8)

67.27

72.16

67.80

72.00

64.84

70.93

63.56

70.76

Table-4.2. Classification results of Cascaded pLSA based method, using KNN. T indicates the number of topics used in pLSA.

When we compare KNN vs SVM, the performance of SVM is much better than that of KNN. This is mainly because that KNN classifier simply compares the feature vector of an image with all vectors in training by using Euclidean distance, identifies the nearest K vectors, and returns the classification result that occurs most in the set of K neighbors by using majority voting. The accuracy of the KNN algorithm is severely degraded by the presence of noisy or irrelevant features. Since we have not weighted the visual words (i.e. taking only the frequencies into account) with their importance, more frequent visual words dominate the prediction process in the majority voting scheme. But SVMs analyzes each feature with a kernel function (i.e. RBF) that estimates a hyper plane among class categories, calculates a weight factor for each image and a threshold for each class by initializing a cost function. Specifically, one-vs.-all method in SVMs has been very popular recently with the multi-class classification problems due to its robustness and higher performance among other classifiers. Our experimental results justify this conclusion.

73

T = 25 Level

Single

T = 50

Pyramid

Level

Single

T = 75

Pyramid

Level

Single

T = 100

Pyramid

Level

Single

Pyramid

Level

0 (1x1)

71.43

---

68.71

---

71.34

---

72.46

---

1 (2x2)

75.17

75.55

74.68

75.50

76.64

77.12

77.77

78.02

2 (4x4)

77.11

77.56

77.44

78.63

77.92

79.45

78.51

80.40

3 (8x8)

76.00

77.14

74.43

78.30

76.36

79.26

77.19

80.28

Table-4.3 . Classification Results of Semantic Segmentation based Method, using SVM. T indicates the number of topics used in pLSA.

At Table-4.3 and Table-4.4, we see the performance of Semantic Segmentation based method using SVM and KNN, respectively. As we notice from performance results, SVM accuracy stably outperforms KNN accuracy between 6-13 percent. Also note that we have achieved 80.90 percent in Cascaded pLSA based method and 80.40 in Semantic Segmentation based method by using SVM, while using KNN we have obtained 72.70 and 69.20 percent as the maximum performances, respectively.

T = 25 Level

Single

T = 50

Pyramid

Level

Single

T = 75

Pyramid

Level

Single

T = 100

Pyramid

Level

Single

Pyramid

Level

0 (1x1)

64.84

---

63.19

---

63.45

---

63.72

---

1 (2x2)

67.64

67.76

66.69

66.48

67.38

68.56

68.71

67.72

2 (4x4)

67.55

69.12

66.11

68.13

68.14

69.06

67.10

69.20

3 (8x8)

63.23

67.97

61.01

67.84

61.36

67.52

59.32

68.67

Table-4.4 . Classification Results of Semantic Segmentation based Method, using KNN. T indicates the number of topics used in pLSA.

As speaking of the effect of number of topics used in pLSA modeling, increasing the number of topics from T=25 to T=100 results in generally a small performance increase (upmost 1 percent), also there are some fluctuations. We can conclude that number of topics strongly depends on the number of categories (i.e. 13 in here) in the dataset while number of visual words depends on the size of the feature vectors (i.e. 128 in here). So increase in the number of topics does not 74

improve the performance dramatically as the visual words in the system has a linear distribution over topics, meaning that discriminative power of topic mixture vectors P ( z | d ) used in the classifiers stays almost same, even displays fluctuations in some

cases.

Spatial Pyramid Single

Cascaded pLSA

Pym.

Level

Single

Semantic Segmentation

Pym.

Level

Single

Pym.

Level

Level

KNN

SVM

KNN

SVM

KNN

SVM

KNN

SVM

KNN

SVM

KNN

SVM

0 (1x1)

66.40

72.66

---

---

67.02

74.06

---

---

63.72

72.46

---

---

1 (2x2)

68.79

75.83

70.23

75.87

70.52

78.22

71.14

79.34

68.71

77.77

67.72

78.02

2 (4x4)

66.07

76.86

70.52

77.76

68.67

79.62

71.75

80.77

67.10

78.51

69.20

80.40

3 (8x8)

57.38

72.66

68.34

76.08

63.56

77.93

70.76

80.90

59.32

77.19

68.67

80.28

Table-4.5 . Comparative results of Spatial Pyramid Method with visual word = 400, Cascaded pLSA based and Semantic Segmentation based methods with T=100.

At Table-4.5, we see the comparative results of three methods. Spatial Pyramid method was introduced by Lazebnik et. al and accepted as a very promising algorithm as they combined BoW representation with spatial location information (Lazebnik, 2006A). They denote that they have achieved the best performance of 81.4 (i.e. mean accuracy with 0.5 percent standard deviation) percent with 400 visual words at pyramid level L=2. For comparison to our results, their own Spatial Pyramid Matlab code (Lazebnik, 2009) is used except SVM toolbox which they have experienced. Because they have not put their SVM toolbox on the website, we have implemented our SVM toolbox in the comparative experiments. Their algorithm has been applied in the same proposed manner described at their papers to compare with the results of our two approaches: Cascaded pLSA and Semantic Segmentation based methods. In all cases, Cascaded pLSA based method outperforms both Spatial Pyramid method and Semantic Segmentation based method. Spatial Pyramid is better than Semantic Segmentation based method at single levels L=0 and L=1 with KNN classifier, but Semantic Segmentation based method outperforms at finer resolution levels L=2 and L=3. As speaking of multi-resolution levels with SVM classifier, Semantic Segmentation based method obtains better performance with an average difference of 3 percent stably. 75

We compare the performance of our methods to semi-supervised LDA of FeiFei et. al (Fei-Fei, 2005A), weakly supervised Spatial Pyramid of Lazebnik et. al (Lazebnik, 2006A), and Spatial Pyramid pLSA of Bosch et. al (Bosch, 2008) using the same dataset and the same number of training and testing images. We use SVM classifier, 400 visual words and 100 topics in comparison. pLSA indicates computation only at L=0 (i.e. handling whole image without dividing into subregions) while BoW refers to Spatial Pyramid of Lazebnik et. al at L=0. We have implemented their algorithms (except SVM toolbox) as described in their papers except LDA of Fei-Fei et. al which we have used their maximum accuracy result noted in (Fei-Fei-2005A). The comparison results are displayed at Table-4.6. We conclude that both Cascaded pLSA based and Semantic Segmentation based methods outperforms the other methods.

BoW

Bayesian

pLSA

Spatial Pyramid

SP pLSA

Semantic

Cascaded

Hirearchical

(L= 0)

(Lazebnik 2006A)

(Bosch, 2008)

Segmentation

pLSA

74.06

77.76

79.17

80.40

80.90

Model (LDA) (Fei-Fei 2005A)

Accuracy

72.66

65.2

Table-4.6. Compact comparison of our algorithms with other methods using the same experimental setup

At Table 4-7, we see the confusion matrix achieved by Cascaded pLSA based method with T=100 using SVM. The overall performance is 80.23 percent. The best classified scenes are Suburb and Forest with a performance of 96.18 percent and 94.95 percent, respectively. The most difficult scenes are open country, Bedroom, Kitchen and Living room. There is confusion between open country and coast scenes, also between open country and mountain scenes. Indoor scenes Bedroom, Kitchen and Living room are confused each other as they have low intra-class variability among them. The confusion mainly caused by the similar structures of shape and appearance, besides potential ambiguities on the subjective manual annotations.

76

Table-4.7 . Confusion Matrix achieved by Cascaded pLSA based method with T=100 and SVM accuracy % 80.90. Diagonal entries indicate the within class accuracy while others indicate inter-class confusion Sb = Suburd, Cs=Coast, Fr=Forest, Hw=Highway, Ic=Inside city, Mn=Mountain, Oc = Open country, St=Street, Tb=Tall building, Of=Ofis, Br=Bedroom, Kt=Kitchen, Lr=Livingroom.

77

Chapter 5 Conclusions and Future Works In this thesis, we focus on scene classification problem with two new methods using BoW/pLSA modeling: Cascaded pLSA based and Semantic Segmentation based. Firstly, a new image representation scheme based on Cascaded pLSA is proposed. After dense SIFT feature extraction is executed in a set of images, SIFT descriptors are clustered into visual words to achieve a BoW representation for each image as an input to pLSA modeling. We associate location information with the conventional BoW/pLSA algorithm where the spatial information is actually lost. This is achieved by subdividing each image into sub-regions iteratively at different grid levels and implementing a pLSA model for each sub-region individually. Hence each sub-region produces its own mixture of topics (i.e. P ( z | d ) ) while staying coherent to the whole image where they belong to; since word-topic distributions (i.e. P ( w | z ) ) of the whole image is kept fixed. Eventually, we concatenate these sub-region specific topic distributions with a weighting scheme to obtain a new semantic image representation. One of the important contributions of pLSA modeling is reducing the dimension of representative feature vector from higher number of visual words to lower number of semantic topics, while improving classification performance considerably. We benefit this reduction when we represent an image as a concatenated mixture of topics, rather as a concatenated BoW histograms as described in (Lazebnik, 2006A). For example, we have achieved a maximum accuracy of 77.76 percent by using Spatial Pyramid at L=2 (i.e. 8400 dimensional feature vector with 400 visual words in BoW modeling), 77.56 percent by using Semantic Segmentation at L=2 (i.e. 525 dimensional feature vector with 25 topics), and 79.95 percent by using Cascaded pLSA at L=2 (i.e. same dimensional as Semantic Segmentation). Secondly, we approach the classification problem in a new way by segmenting an image semantically based on topic distributions, and then representing the image with its topic counts, rather than visual word counts used in BoW

78

modeling. This method is called as Spatial Pyramid of Latent Topics by Semantic Segmentation, a transition between BoW representation and Bag-of-Topics (BoT) representation for an image. In the pLSA modeling, a visual word has a probability (i.e. P( wi | z ) ) for each topic. This means that each visual word in a given image has a number of probabilities with respect to topic distributions over the image. So we can assign each visual word to a topic label which shows maximum posterior probability (i.e. arg max z ( P ( z | wi , d j )) ) conditioned on that word of a given image. By doing so, we represent each image as a vector of topic-frequency in a weighted scheme, instead of visual word-frequency. This also leads us to a rough segmentation over an image based on topic distributions because regions of visual words which are assigned to a common topic would produce a meaningful connected component. Like in the first method, Spatial Pyramid of Latent Topics by Cascaded pLSA, we again add spatial information to the image representation by subdividing the image into finer resolutions at each level. Ultimately, we describe each image with concatenated Bag-of-Topics histograms which are extracted from each sub-region, accordingly. Also note that there is no extra work over the aforementioned method since they use the same notations, except in semantic image representation. In both cases, we learn topics and their distributions in training images by a completely unsupervised fashion, unlike those of (Vogel, 2007; Fei-Fei, 2005A). It is mainly caused by the nature of pLSA modeling where BoW histograms are the only input to the system. We test the system in supervised classifiers (i.e.KNN and SVM) where category labels of training images are known. The performances of our two methods are compared with the most successful methods (i.e. Bayesian Hierarchical model (LDA), BoW with/without spatial information, pLSA without spatial information) in the literature using the same dataset and the same number of training/testing images. As seen at Table 4.6, our methods outperform others with a rounded percentage between 3 and 15. We conclude that semantic image representation with spatial information is very promising and will be a dominant issue in scene classification problems. However, aside from its higher performance accuracy and simplicity, there are some drawbacks in the BoW/pLSA modeling with spatial information. First of all, there is no explicit weighting scheme for visual words in creating the cooccurrence table of the general BoW representation model. Instead, we only take into

79

account the nearest Euclidean distance when we assign each SIFT feature to a visual word, meaning that word-frequency table refers to counts of nearest visual words to each SIFT descriptor. So we neglect the proposal that a SIFT descriptor might mean more than one visual word as a visual word surely refers to more than one topic in the latent topic space, pLSA. Another issue is how to partition an image into subregions for constructing a pLSA model for each sub-region. The common method in the literature to partition an image, and what we have implemented in our experiments, is to subdivide it into a regular grid of non-overlapping sections, producing its own mixture of topics in pLSA modeling. This approach leads us to weighting each sub-region equally in a single resolution level, although they might carry different degree of importance to us for classifying that image semantically. Additionally, we do not establish any relationship among visual words in the BoW/pLSA modeling. Instead, we assign each SIFT descriptor to the nearest visual word in the BoW modeling, and establish probabilistic relations (i.e. P ( w | z ) ) between visual words and semantic topics in the pLSA modeling. It concludes that irrelevant (i.e. noise) visual words in an image might direct us wrongly for scene classification.

5.1 Future Works Future works on the proposed methods in this thesis can be explained under the following headlines: •

Weighting Scheme for Visual Words: As noted before, visual words are

outputs of a clustering algorithm, like k-means. In other words, they refer to statistical information only. Generally, all the word weighting schemes perform nearest neighbor search in the visual vocabulary where each feature vector is mapped to the most similar visual word. But this may not a good choice, because two similar features might be assigned to different visual words when increasing the size of vocabulary. Besides, simply counting the frequencies may lead to misclassification because two features which are assigned to the same visual word may not be equally similar to that visual word as their distances to associated word are not same in the used metric domain. Further work would be to generate a new weighting scheme for

80

visual words in an assumption that would be an important effect to the BoW modeling. •

Partitioning in Spatial pLSA Modeling: When we incorporate pLSA with

spatial information in our experiments, we have split each image into a regular grid of sub-regions. Further work would be to use a different partitioning scheme which accounts for overlapping sub-regions (i.e. windows), rather than non-overlapping partitions which we have experimented. We assume that intersected parts of overlapping sub-regions might improve the performance since they carry additional spatial information. Besides weighting grid levels in concatenation stage, we think of weighting each sub-region individually with a degree of importance function in a single split level, too. •

Hierarchical Clustering of Visual Words: We quantize SIFT descriptors

which are extracted from the dataset into visual words to generate the input parameter of pLSA, co-occurrence table. Since clustering algorithms use only statistical information to quantize feature vectors, we underestimate semantic relations between visual words while trying to represent images semantically by using these words. Further work would be to re-organize root visual words (i.e. cluster centers found in the first round of clustering) hierarchically. By doing so, we expect to achieve a robust co-occurrence table which would be more proper to pLSA modeling in a semantic manner.

81

Bibliography Ames M. and Naaman M. “Why We Tag: Motivations for Annotation In Mobile and Online Media,” Proc. SIGCHI Conf. on Human Factors in Computing Systems, pp. 971–980, 2007.

Bennett K. and Campbell C. “Support Vector Machines: Hype of Hallelujah?," SIGKDD Explorations, vol. 2, 2000.

Blei D. Ng A. and Jordan M. “Latent Dirichlet Allocation ” J. Machine Learning Research, vol. 3, pp. 993-1022, 2003.

Bosch A., Zisserman A. and Munoz X. “Scene Classification via pLSA,” Proc. European Conf. Computer Vision, vol. 4, pp. 517-530, 2006.

Bosch A., Munoz X. and Marti R. “Which is the Best Way to Organize/Classify Images by Content?,” Image and Vision Computing, vol. 25, issue 6, pp. 778-791, 2007A.

Bosch A. “Image Classification for Large Number of Object Categories,” PhD thesis, Department of Electronics, Informatics and Automation, Univ. of Girona, 2007B.

Bosch A., Zisserman A. and Munoz X. “Scene Classification Using a Hybrid Generative/Discriminative Approach,” IEEE Transections on Pattern Analysis and Machine Intelligence, vol. 30, no. 4, 2008.

Boutell M., Choudhury A., Luo J. and Brown M.C. “Using Semantic Features for Scene Classification: How Good do They Need to Be?,” IEEE Int’l Conf. Multimedia and Expo, pp. 785-788, 2006.

82

Chien J.T and Liao C.P. “Maximum Confidence Hidden Markov Modeling for Face Recognition,” IEEE Pattern Analysis and Machine Intelligence, vol. 30, issue 4, pp. 606-616, 2008.

Cristianini N. “Support Vector and Kernel Machines,” BIOwulf Technologies, ICML, http://www.support-vector.net/tutorial.html , 2001.

Csurka G., Dance C.R., Fan L., Willamowski J. and Bray C. “Visual Categorization with Bags of Keypoints,” ECCV Int’l Workshop pn Statistical Learning in Computer Vision, 2004.

Dempster A.P., Laird N.M. and Rubin D.B. “Maximum Likelihood from Incomplete data via the EM Algorithm,” J. Royal Statist. Soc. B., vol. 39, pp. 1-38, 1977.

Deng J., Dong W., Socher R., Li L., Li K. and Fei-Fei L. “ImageNet: A Large-scale Hierarchical Image Database,” CVPR, http://www.image-net.org, 2009.

Dorado A. Djordjevic D. and Izquierdo E. “ Supervised Semantic Scene Classification Based on Low-Level Clustering and Relevance Feedback,” Proceeding of the European Workshop on the Integration of Knowledge, Semantics and Digital Media Technology, EWIMPT, pp. 181-188, 2004.

Fei-Fei L. and Perona P. “A Bayesian Hierarchical Model for Learning Natural Scene Categories,” Proc. IEEE CS Conf. Computer Vision and Pattern Recognition, pp. 524-531, 2005A.

Fei-Fei L. “Visual Recognition : Computational Models and Human Psychophysics”, PhD thesis, Deparment of Computer Vision, California Institute of Technology, 2005B.

Fei-Fei L. and Li L.J. “What, Where and Who? Classifying Events by Scene and Object Recognition,” ICCV, pp. 1-8, 2007A.

83

Fei-Fei L. , Fergus R. and Torralba A. “Recognizing and Learning Object Categories,”

Short

Course

CVPR

:

http://www.people.csail.mit.edu/torralba/shortCourseRLOC/index.html/ , 2007B.

Fergus R. , Perona P. and Zisserman A. “Object Class Recognition by Unsupervised Scale-invariant Learning,” CVPR, pp. 264-271, 2003.

Freeman W. and Adelson E. “The Design and Use of Steerable Filters,” IEEE Pattern Analysis and Machine Intelligence, vol. 13, pp. 891-906, 1991.

Gorkani M. and Picard R.W. “Texture Orientation for Sorting Photos at a Glance,” Int’l Conf. Pattern Recognition, vol. 1, pp. 459-464, 1994.

Grauman K. and Darrell T. “Pyramid Match Kernels : Discriminative Classification with Sets of Image Features, ” Proc. ICCV, 2005.

Grauman, K. and Darrell, T. “The Pyramid Match Kernel: Efficient Learning with Sets of Features” J. Machine Learning Research, vol. 8, pp. 725–760, 2007.

Hofmann T. “Probabilistic Latent Semantic Indexing,” Proc. SIGIR Conf. Research and Development in Information Retrieval, 1998.

Hofmann T. “Unsupervised Learning by Probabilistic Latent Semantic Analysis,” Machine Learning, vol. 41, no. 2, pp. 177-196, 2001.

Honglak L., Battle A., Raina R, and A. Y. Ng.

“Efficient Sparse Coding

Algorithms,” NIPS, 2006.

Jiang J., Ngo C.W. and Yang J. “Towards Optimal Bag-of-Features for Object Categorization and Semantic Video Retrieval,” ACM Int’l Conf. Image and Video Retrieval (CIVR), 2007.

Kadir T. and Brady M. “Scale, Saliency and Image Description,” Int’l J. Computer Vision, 45(2), pp. 83-105, 2001. 84

Kardi T. “K-means Clustering Tutorial,” Pattern Recognition and Neural Network Algorithms, http://people.revoledu.com/kardi/tutorial/KMean, 2009.

Koenderink J. and Doorm V.A. “Representation of Local Geometry in the Visual System,” Biological Cybernetics, vol. 55, pp. 367-375, 1987.

Lazebnik S., Schmid C. and Ponce, J. “A Sparse Texture Representation Using Local Affine Regions,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 27(8), pp. 1265–1278, 2005.

Lazebnik S., Schmid C. And Ponce J. “Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories,” Proc. IEEE CS Conf. Computer Vision and Pattern Recognition , vol. 2, pp. 2169-2178, 2006A.

Lazebnik S. “Local, Semi-local and Global Models for Texture, Object and Scene Recognition” PhD. Thesis, Department of Computer Science, University of Illinois, 2006B.

Lazebnik S. “Spatial Pyramid MATLAB code,” http://www.cs.unc.edu/~lazebnik, July 2009.

Leung T. and Malik J. “Representing and Recognizing the Visual Appearance of Materials Using Three-Dimensional Textons,” Int’l J. Computer Vision, vol. 43, pp. 29-44, 2001.

Li J.L., Socher R. and Fei-Fei L. “Towards Total Scene Understanding: Classification, Annotation and Segmentation in an Automatic Framework,” CVPR, 2009.

Li J. and Wang J.Z. “Real-time Computerized Annotation of Pictures,” Proc. ACM Multimedia Conference, pp. 911–920, 2006.

85

Lindeberg T. “Scale-space Theory: A Basic Tool for Analyzing Structures at Different Scales,” J. Applied Statistics, vol. 21(2), pp. 224-270, 1994.

Lindeberg, T. “Feature Detection with Automatic Scale Selection,” Int’l J. Computer Vision, vol. 30(2), pp. 79–116, 1998.

Liu D. “Discovering Objects in Images and Videos,” PhD thesis, Department of Electrical and Computer Engineering, Carnegie Mellon Univ., 2008.

Lowe D. “Object Recognition from Local Scale-invariant Features,” ICCV, pp. 1150-1157, 1999.

Lowe D. “Distinctive Image Features from Scale Invariant Keypoints,” Int’l J. Computer Vision, vol. 60, no. 2, pp. 91-110, 2004.

Luo J., Singhal, A. and Zhu W. “Natural Object Detection in Outdoor Scenes Based on Probabilistic Spatial Context Models," Proc. IEEE Int’l. Conf. on Multimedia and Expo, 2003.

Manjunath B.S. and Ma W.Y. “Texture Features for Browsing and Retrieval of Image Data,” IEEE Pattern Analysis and Machine Intelligence, vol. 18, pp. 837-842, 1996.

Mao J. and Jain K.A. “Texture Classification and Segmentation Using Multiresolution Simultaneous Autoregressive Models,” Pattern Recognition, vol. 25(2), pp. 173-188, 1992.

Matas J., Chum O., Urban M. and Pajdla T. “Robust Wıde Baseline Stereo from Maximally Stable Extremal Regions,” 13 th BMVC, pp. 384-393, 2002.

Mikolajczyk K. and Schmid C. “An Affine Invariant Interest Point Detector,” ECCV, vol. 2350, pp. 128-142, 2002.

86

Mikolajczyk K. and Schmid, C. “Scale and Affine Invariant Interest Point Detectors,” Int’l J. of Computer Vision, vol. 60(1), pp. 63–86, 2004.

Mikolajczyk K. and Schmid C. “A Performance Evaluation of Local Descriptors,” IEEE Pattern Analysis and Machine Intelligence, vol. 27(10), pp. 1615-1630, 2005A.

Mikolajczyk K. “A Comparison of Affine Region Detectors,” Int’l J. Computer Vision, vol. 65, pp. 43-72, 2005B.

Oliva A. and Torralba A. “Modeling the Shape of the Scene: A Holistic Reprasentation of the Spatial Envelope,” Int’l J. Computer Vision, vol. 42, no. 3, pp. 145-175, 2001.

Oliva A. and Torralba A. “Building the Gist of a Scene : The Role of Global Image Features in Recognition,” Progress in Brain Research, vol. 155, pp. 23-36, 2006.

Paek S. and Chang S.F. “A Knowledge Engineering Approach for Image Classification Based on Probabilistic Reasoning Systems,” Int’l Conf. Multimedia and Expo, vol. 2, pp. 1133–1136, 2000.

Perronin F., Dance C., Csurka G. and Bressan M. “Adapted Vocabularies for Generic Visual Categorization,,” European Conference on Computer Vision, vol. 4, pp. 464-475, 2006.

Quelhas P., Monay F., Odobez J.M., Perez D., Tuytelaars T. and Gool V.L. “Modeling Scenes with Local Descriptors and Latent Aspects,” Int’l Conference on Computer Vision, pp. 883-890, 2005.

Quelhas P., Monay F., Odobez J.M., Perez, D. and Tuytelaars, T. “A Thousand Words in a Scene," IEEE Trans. on pattern Analysis and Machine Intelligence, vol: 29, no: 9, 2007.

Rakotomamonjy A. “SVM and Kernel Methods Matlab Toolbox,” http://asi.insarouen.fr/enseignants/~arakotom/toolbox/index.html, 2008. 87

Rubel D. “Scene Classification Using pLSA and Spatial Information,” MS Project, Department of Computer Science, Rochester Institute of Technology, 2008.

Serrano N., Savakis A., Luo J. “Improved Scene Classification Using Efficient Lowlevel Features and Semantic Cues,” Pattern Recognition, vol. 37, pp. 1773–1784, 2004.

Shen J., Shepherd J. and Ngu A.A. “Semantic-sensitive Classification for Large Image Libraries,” Int’l Conf. Multimedia modeling, pp. 340-345, 2005.

Shi J. and Tomasi C. “Good Features to Track,” Proceedings of the Conference on Computer Vision and Pattern Recognition, pp. 593-600, 1994.

Sivic J. and Zisserman A. “Video Google : A Text Retrieval Approach to Object Matching in Videos,” Int’l Conference on Computer Vision, vol. 2, pp. 1470-1477, 2003.

Sivic J., Russell B.C., Efros A., Zisserman A. and Freeman W. “Discovering Objects and Their Location in Images,” IEEE ICCV, vol. 1, pp. 370-377, 2005.

Sivic J., Russell B.C., Zisserman A., Freeman W.T. and Efros A. “Unsupervised Discovery of Visual Object Class Hierarchies,” CVPR, pp. 1-8, 2008.

Swain M. and Ballard D. “Color Indexing,” Int’l J. Computer Vision vol. 7, no. 1, pp.11-32, 1991.

Szummer M. and Picard R. “Indoor-outdoor Image Classification,” Int’l Workshop on Content-based Access of Image and Video Databases, 1998.

Torralba A., Murphy K.P., Freeman W.T. and Rubin M.A. “Context-based Vision System for Place and Object Recognition,” Proc. ICCV, 2003.

88

Torralba A. “Classifier-based Methods in Recognition and Learning Object Categories,” CVPR Short Course on Recognition and Learning Object Categories, http://people.csail.mit.edu/torralba/shortCourseRLOC, 2007.

Tuytelaars T. and Mikolajczyk K. “Local Invariant Feature Detectors: A Survey,” Int’l J. Computer Graphics and Vision, vol. 3, no. 3, pp. 177-280, 2007.

Vailaya A., Figueiredo M., Jain A. and Zhang H. “Content-based Hierarchical Classification of Vacation Images,” IEEE Int’l Conf. Multimedia Computing and Systems, vol. 1, pp. 518–523, 1999.

Varma M. and Zisserman A. “Classifying Images of Materials: Achieving Viewpoint and Illumination Independence,” Proc. 7th European Conference on Computer Vision, 2002.

Varma M. and Zisserman A. “Texture Classification : Are Filter Banks Necessary?,” CVPR, vol. 2, pp. 691-698, 2003.

Varma M. and Zisserman A. “A Statistical Approach to Texture Classification from Single Images,” Int’l J. Computer Vision, 2005.

Vogel J. and Schiele B. “A Semantic Typicality Measure for Natural Scene Categorization,” DAGM Annual Pattern Recognition Symposium, 2004A.

Vogel J. and Schiele B. “Natural Scene Retrieval Based on a Semantic Modeling Step,” Int’l Conf. Image and Video Retrieval, vol. 3155, 207-215, 2004B.

Vogel J. and Schiele B. “Semantic Modeling of Natural Scenes for Content-Based Image Retrieval,” Int’l J. Computer Vision, vol. 72, no. 2, pp. 133-157, 2007.

Wang G., Zhang Y. and Fei-Fei L. “Using Dependent Regions for Object Categorization in a Generative Framework,” IEEE CVPR, vol. 2, pp. 1597-1604, 2006.

89

Wang Q. and Gao Z. “Study on a Realtime Image Object Tracking System,” Int’l Symposium on Computer Science and Computational Technology, vol. 2, pp. 788791, 2008.

Weijer J. and Schmid C. “Coloring Local Feature Extraction,” Proc. European Conf. Computer Vision, vol. 2, pp. 332-348, 2006.

Wu X., Ngo C.W. and Hauptmann A.G. “Real-time Near-duplicate Elimination for Web Video Search with Content and Context,” IEEE Trans. on Image Processing, 2009.

Yang J,Yu K., Gong Y. and Huang T. “Linear Spatial Pyramid Matching Using Sparse Coding for Image Classification,” CVPR, 2009.

Zhang Z. and He,L.W. “Whiteboard scanning and image enhancement,” Digital Signal Processing, vol. 17(2), pp. 414–432, 2007.

90