Prediction Reweighting for Domain Adaptation - IEEE Xplore

3 downloads 8172 Views 4MB Size Report
Domain adaptation is an effective approach to address this problem. In this paper, we propose a general domain adaptation framework from the perspective of ...
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS

1

Prediction Reweighting for Domain Adaptation Shuang Li, Shiji Song, and Gao Huang Abstract— There are plenty of classification methods that perform well when training and testing data are drawn from the same distribution. However, in real applications, this condition may be violated, which causes degradation of classification accuracy. Domain adaptation is an effective approach to address this problem. In this paper, we propose a general domain adaptation framework from the perspective of prediction reweighting, from which a novel approach is derived. Different from the major domain adaptation methods, our idea is to reweight predictions of the training classifier on testing data according to their signed distance to the domain separator, which is a classifier that distinguishes training data (from source domain) and testing data (from target domain). We then propagate the labels of target instances with larger weights to ones with smaller weights by introducing a manifold regularization method. It can be proved that our reweighting scheme effectively brings the source and target domains closer to each other in an appropriate sense, such that classification in target domain becomes easier. The proposed method can be implemented efficiently by a simple two-stage algorithm, and the target classifier has a closed-form solution. The effectiveness of our approach is verified by the experiments on artificial datasets and two standard benchmarks, a visual object recognition task and a cross-domain sentiment analysis of text. Experimental results demonstrate that our method is competitive with the state-of-the-art domain adaptation algorithms. Index Terms— Domain adaptation, domain separator, manifold regularization, prediction reweighting.

I. I NTRODUCTION

M

ANY machine learning algorithms are based on the assumption that both training and testing data are in the same feature space and are drawn from the same distribution. However, in many real-world applications, we are often confronted with the situation, in which the distributions of training and testing data do not match. For example, in the sentiment analysis of product reviews, it is expensive and time-consuming to annotate reviews for various products [1]. Therefore, we may need to learn a classifier on some types of product reviews with their corresponding labels, such as product reviews on books, and make predictions on another types of product reviews, e.g., user reviews on Digital Versatile Disc (DVDs) [2]. However, the distributions of different types of reviews may not be identical due to that the terms used in the reviews of different types of products are distinct. As another example, in the field of computer vision, images of Manuscript received October 24, 2014; revised December 14, 2015; accepted February 28, 2016. This work was supported in part by the National Natural Science Foundation of China under Grant 41427806 and Grant 61273233 and in part by the Research Fund for the Doctoral Program of Higher Education under Grant 20130002130010. The authors are with the Department of Automation, Tsinghua University, Beijing 100084, China (e-mail: [email protected]; [email protected]; [email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TNNLS.2016.2538282

the same object in different data sources can vary significantly due to various factors. In these cases, a classifier learned from training data may generalize poorly if directly applied to the testing data. Domain adaptation is an approach to tackle this problem by cleverly adapting a classifier trained in the source domain, which has plentiful labeled data, for use in the target domain, which is drawn from a different but related distribution [2]–[8]. Existing methods of domain adaptation can be divided into three major categories [3]. The first type is the instancetransfer approach, which is motivated by importance sampling. It reweights the source instances by their significance, and trains a particular classifier for unlabeled target instances. As a representative example, the kernel-mean matching (KMM) method proposed in [5] reweights the importance of source data in a reproducing kernel Hilbert space (RKHS). The second category is the feature-representation-transfer approach. It focuses on reducing the difference between the source and target domain distributions, and discovering new feature representations. A popular feature representation algorithm is the structural correspondence learning (SCL) [9], which relies on the pivot feature selection. Thus, the design of pivot features is crucial. Blitzer et al. [2] use mutual information to select pivot features rather than the heuristic technique. Pan et al. [10] propose an efficient algorithm known as transfer component analysis (TCA) to find the representations of both domains from a different perspective. It utilizes transfer components between the source and target domains to reduce the distance across the domains in an RKHS. The last category is the parameter-transfer method. It assumes that both the source and target domains share some parameters or prior distributions [3]. It is noteworthy that domain adaptation can be seen as a subfield of transfer learning [3], and it is closely related to multitask learning, which tries to learn the source and target tasks simultaneously [11], [12]. However, domain adaptation only aims at achieving high performance in target domain. Essentially, there are two key issues involved in domain adaptation: 1) to make full use of the information from the source domain and 2) to fully explore the unlabeled testing data in the target domain. The first issue is resulted from a basic assumption that the source and target domains are related, which implies that the source data should provide useful information for constructing the target classifier. Otherwise, the source data are redundant, and domain adaptation is useless. The second issue corresponds to another basic assumption that the source and target domains are different. Due to the distributional difference between the two domains, the classifier learned from the source data cannot be directly

2162-237X © 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. 2

applied to the target domain, but should be adapted properly by leveraging the unlabeled data in the target domain. Based on the above analysis, one can imagine that an ideal domain adaptation method should simultaneously explore the labeled source training data and the unlabeled target testing data, and give a proper tradeoff between them. This paper addresses the aforementioned two issues explicitly. With respect to the first issue, we first construct a classifier using labeled source data, and make predictions on the target domain. We believe that there exist some target instances, which are similar to those from the source domain, and they should be predicted correctly with high confidence. Others may be predicted accurately with low confidence, since they are less likely to be drawn from the source data distribution. Therefore, we propose to assign each predicted label with a weight to indicate confidence. Our reweighting strategy is based on the following intuition: if two domains are related, they should overlap to some extent, i.e., there are some regions in the feature space that contains samples from both domains. Hence, if we train a binary classifier, which is referred as the domain separator [13], to distinguish the instances from the two domains, some instances may not be correctly classified, or are lying closely to the decision boundary though correctly classified. Naturally, the signed distance from a target instance to the domain separator measures how close it is to the source domain, and how confident we are if it is predicted by a source classifier. Therefore, our algorithm reweights target predictions based on these signed distances. In comparison with instancetransfer approaches that weight individual source instances, our method weights individual predictions in the target domain in a novel and intuitive way. Moreover, we show that this way of reweighting algorithm provably brings the target domain closer to the source domain in a proper sense. Through this method, we can effectively incorporate the information from the source domain into the construction of the target classifier. With respect to the second issue, to make our prediction reweighting approach compatible with target data distribution, we introduce a manifold regularization framework to explore the structure of the target data, which enables us to smoothly propagate labels from the most confident target instances (those with large weights) to less confident target instances (those with small weights). Overall, to address the domain adaptation problems, we consider the above two parts simultaneously, such that the proposed method can be solved efficiently in a two-stage algorithm. To be specific, in the first stage, we train a source domain classifier to classify target instances, and reweight the predictions. In the second stage, we learn the final target classifier by minimizing a quadratic objective function, which consists of a penalization term that enforces the predictions of target classifier to match those of the source classifier, and a manifold regularization term to enforce smoothness on the target data. In this paper, the major contributions of the proposed method for domain adaptation problems are summarized as follows. 1) This paper contributes a novel and general framework for domain adaption, which explicitly deals with the two

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS

essential issues in domain adaptation problems (i.e., how to fully incorporate the information from source domain and target domain to construct the final target classifier). Under this framework, we propose a novel algorithm, which not only has theoretical guarantees, but also gives impressive performance on two real-world domain adaptation datasets. Moreover, the proposed framework can be combined with any standard supervised classification method or domain adaptation method, which can potentially boost the classification performance. 2) In our approach, we focus on reweighting the predictions on target data. This is rather different from most existing domain adaptation algorithms. To the best of our knowledge, the majority of existing instance-transfer approaches train the final classifier using weighted source data. We are not aware of other algorithms, which use weighted predictions as our approach. 3) We reweight each target instance based on its signed distance to the domain separator, which is consistent with our intuition. Under this reweighting strategy, the source and target domains can provably get closer effectively in a proper sense, no matter what the distributions of both domains are (see Section IV). Obviously, this will benefit the classification of the target data. In the rest of this paper, we first review some related works in Section II. Section III describes the motivation and formulation of our prediction reweighting approach. The detail of our reweighting scheme based on the domain separator and the theoretical justification will be presented in Section IV. In Section V, we give experimental results on both artificial and benchmark datasets. Some extensions will also be discussed in this part. Finally, Section VI draws conclusion and outlines directions for future work. II. R ELATED W ORKS Among existing domain adaptation approaches, the instance-transfer approach, which reweights the source instances by their importance, is most related to the proposed prediction reweighting method. As a representative instance-transfer approach, KMM [5] directly learns instance importance in the source domain by matching the means between both domains in some RKHS based on the maximum mean discrepancy (MMD) theory [14]. Xia et al. [15] propose to reweight the training instance using in-target-domain probability, obtained via positive and unlabeled learning [16]. In general, all these approaches focus on reweighting instances from the source domain. In comparison, our approach assigns different weights (confidences) to the predictions of the target data given by the source classifier. Furthermore, we introduce an intuitive and efficient reweighting scheme to learn these weights. Recently, the landmark-based approach proposed in [17] aims to identify and select landmark samples to construct easier auxiliary domain adaptation tasks. Similarly, Baktashmotlagh et al. [18] propose a statistically invariant sample selection method to choose landmarks using the Hellinger distance instead of MMD. However, in their approaches, the landmarks are selected from the

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. LI et al.: PREDICTION REWEIGHTING FOR DOMAIN ADAPTATION

source domain, and are not necessarily the samples close to the target domain, while our approach identifies the target instances, which are more likely to be correctly predicted by the source classifier. Moreover, our approach could be more efficient, since we directly use the weighted predictions to obtain closed-form solutions for the target classifier, instead of creating auxiliary tasks and constructing domain-invariant features. In [19], a two-step approach for domain adaptation is proposed. The first step is to construct the reliable target dataset, and the second step is to develop an Expectation Maximization algorithm to refine the model trained on the reliable target data. This is different from our approach, since we use manifold regularization framework to be compatible to target domain with all the target data. The domain separator, which is used to assign weights to target predictions in our approach, is recently introduced for domain adaptation [13]. In the active learning domain adapted method [13], a domain separator is employed to upfront rule out acquiring labels of those target instances, which are close to the source domain, and then query the most informative points in the target domain, which are far from the source domain. Different from these active learning methods, we consider the unsupervised domain adaptation problems instead. Moreover, in our approach, the signed distances between the target instances and the domain separator are used to weight their predicted labels given by the source classifier. We also show that such reweighting scheme can provably bring the two domains closer in a proper sense. To effectively exploit useful information from the target data, our prediction reweighting approach adopts manifold regularization to propagate labels from the target instances with higher confidences to those of lower confidences. The manifold regularization has been widely used for semisupervised learning and unsupervised learning in the past decade, and a comprehensive study on it is given in [20]. The recently proposed adaptation regularization-based transfer learning (ARTL) [21] also uses manifold regularization for domain adaptation. However, different from our approach, ARTL is not an instance reweighting-based method. III. M ETHOD In this paper, we focus on domain adaptation problems. The task of our approach is to utilize labeled instances from the source domain, and target unlabeled data, to predict the unknown labels of target data. In this section, we will describe our prediction reweighting approach in detail and give a quadratic objective function with a closed-form solution. Before we state our formulation of prediction reweighting approach, we will introduce some notations first. A. Notation and Background In the training process, labeled data D S in the source domain are available, while we can only use unlabeled data DT in the target domain. We denote X ∈ Rm and Y as the feature space and the label space, respectively. Define the labeled Ns instances from the source domain as D S = {(x S i , y S i )}i=1 ,

3

where x S i ∈ X and y S i ∈ Y is the corresponding label of x S i . Nt represents the unlabeled data in the Similarly, DT = {xT i }i=1 target domain, and our goal is to predict their labels. In domain adaptation problems, we assume that the source domain is related but different from the target domain, thus their corresponding marginal distributions are different. Formally, let P(XS ) and P(XT ) be the marginal distributions of XS and XT , respectively. In the common domain adaptation setting, we need Assumption 1 [10], [22]. Assumption 1: The marginal distributions of the source and target domains are different: P(XS ) = P(XT ), while the conditional distributions are the same: P(YS |XS ) = P(YT |XT ). B. Motivation Our prediction reweighting approach considers domain adaptation problems from two aspects. One is to use source classifier to predict the target data, and we believe that the source classifier will have more accurate predictions on target instances, which are closer to the source domain. Therefore, we give these target instances larger weights (confidences), and their target labels will be more similar to the labels predicted by the source classifier. This constitutes the first part of our formulation. On the other side, to confirm with target data distribution, we utilize manifold regularization for label propagation in the target domain. This part is to guarantee that the close target instances will share the same labels, and the labels of instances with higher weights will be smoothly propagated to those with lower weights. Together, the aforementioned part will be integrated in a balanced way. C. Formulation The formulation of the objective function for the target instances consists of two parts: 1) the accordance with the source domain classifier and 2) the accordance with the target domain data min fT

Nt 

αi ( f T (xT i ) − f S (xT i ))2

i=1



Nt 

βi j ( f T (xT i ) − f T (xT j ))2

(1)

i, j =1

where Nt is the number of data points in the target domain. f S (xT i ) and f T (xT i ) represent the predictions for xT i given by the source and target domain classifiers, respectively. Let fT = [ f T (xT 1 ), . . . , f T (xT Nt )]T , and we denote αi as the prediction weight measuring our confidence in target instance label predicted correctly by the source classifier. We will introduce our reweighting scheme to calculate αi in detail in Section IV. λ is a tradeoff parameter, and βi j is the similarity between instance xT i and instance xT j , and  1, if xT i ∈ N p (xT j ) or xT j ∈ N p (xT i ) (2) βi j = 0, otherwise where N p (xT i ) is the set of p-nearest neighbors of point xT i .

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. 4

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS

First, we notice that Assumption 1 implies that if an target instance lies closely to the source domain, it is very likely to be classified correctly by the source classifier, and this constitutes the first term of (1). The prediction weight αi indicates the probability of xT i being classified correctly by the source classifier. If we have high confidence that xT i can be predicted correctly by the source classifier, we will put a large value on αi . At this time, to minimize (1), f T (xT i ) should be close to f S (xT i ). If αi is small, then the target label of xT i will not be restricted to the prediction by the source classifier. It is worth mentioning the difference between the weights defined in our prediction reweighting method and domain adaptation machine (DAM) [23], which mainly focuses on multiple source domain adaptation problems. In DAM (a domainwise method), the regularizer γi is denoted by the weight for measuring the distribution relevance between the i th source domain and the target domain. In other words, for each individual domain, all target instances share the same weight. However, in our prediction reweighting method (a pointwise method), the weight αi indicates the probability of a target instance being classified correctly by the source classifier, and the weights of different instances are unequal. Second, to be compatible with the target data, the target classifier should give smooth predictions within the target domain and not cut through high-density regions. More specifically [20], if two instances in the target domain xT i and xT j are close in their marginal distributions P(XT ), the conditional distributions P(y|xT i ) and P(y|xT j ) should be similar. This introduces the second part of our formulation (1) (manifold regularization). We want the conditional probability distribution P(y|xT ) to change smoothly based on the intrinsic geometry property of P(XT ). From the manifold regularization standpoint, this phenomenon will inspire us to train a more adaptable model for label propagation of future data in target domain, as shown by empirical results. The two parts of our prediction reweighting approach mentioned earlier in this paper can be seen mutually complementary. The first part gives high confidence target data precise predictions, and the second part helps propagate the labels from high confidence regions in the target domain to the low confidence regions smoothly. Obviously, the tradeoff parameter λ is critical to balance the two parts. Later, we will introduce a model selection method to automatically choose the best parameter.

From the perspective of classifier property, we deduce two versions to make the proposed algorithm more adaptable. 1) Linear Case: Given the form of target classifier f T (x) = wT x, after setting the derivative of (3) with respect to w to zero, we have X T S X TT w − X T Sf S + λX T LX TT w = 0.

According to (4), we can get a closed form for the optima w  −1 w∗ = X T (S + λL)X TT X T Sf S . (5) For efficient computing, when Nt > m, the optimal solution w∗ is the same as (5), but when Nt  m  −1 Sf S . (6) w∗ = (S + λL)X TT Besides the linear case, we also construct a nonlinear form of domain adaptation classifier based on our prediction reweighting approach, which will make our method more adaptable. 2) Kernel Case: Let the desired nonlinear transformation of both domains’ data be φ: X → Z, and assume that φ is a universal nonlinear feature kernel mapping in an appropriately chosen RKHS. Then, the formulation (3) can be kernelized, such that fT = [wT φ(xT 1 ), . . . , wT φ(xT Nt )]T is the target data prediction of our nonlinear classifier. According to the classical representer theorem [20], the optimal solution of the minimization problem (3) can be written as w∗ =

Nt 

θi φ(xT i ) = φ(X T )θθ

To solve (1) compactly, we can write the formulation in a matrix form min (fT − f S )T S(fT − f S ) + λfTT LfT . fT

(3)

Denote by S the diagonal matrix with diagonal (α1 , . . . , α Nt ) and let f S = [ f S (xT 1 ), . . . , f S (xT Nt )]T . L is the Laplacian matrix constructed within the target domain given by L = D − B. Here, B = [βi j ] Nt ×Nt , and the diagonal matrix  t βi j . To solve the problem in (3), D is given by Dii = Nj =1 we define X T = [xT 1 , . . . , xT Nt ] ∈ Rm×Nt .

(7)

i=1

where φ(X T ) = [φ(xT 1 ), . . . , φ(xT Nt )], and θ = [θ1 , . . . , θ Nt ]T is a coefficient vector. Therefore, problem (3) can be reduced to optimizing the coefficient vector θ . In addition, the transformation can solve the minimization problem more efficiently. We write k(xi , x j ) = φ(xi ), φ(x j ) is the inner product of φ(xi ) and φ(x j ). Let K = [k(xi , x j )] ∈ R Nt ×Nt be the kernel matrix defined on the target domain. Assuming that the target classifier in an RKHS is given by fT (x) = wT φ(x), the optimization problem (3) is equivalent to min (Kθθ − f S )T S(Kθθ − f S ) + λ(Kθθ )T LKθθ . θ

(8)

Setting the derivative of (8) with respect to θ to zero, we have a closed form of optimal solution θ ∗ = (K (S + λL)K )−1 K Sf S .

D. Solution

(4)

(9)

Based on the kernel version, we can get a nonlinear target classifier, which complements the linear one. Empirical results show that the two versions of the target classifier make our algorithm more adaptable for target data. E. Model Selection Method In our prediction reweighting method, λ plays an important role by balancing the effects of the two parts. Because we only have unlabeled target instances, it is difficult to choose λ by measuring the performance of the target classifier on the

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. LI et al.: PREDICTION REWEIGHTING FOR DOMAIN ADAPTATION

5

Algorithm 1 Prediction Reweighting Algorithm Input: Ns Labeled source instances, {X S , Y S } = {(x Sm , y S m )}m=1 ; Nt Unlabeled target instances, {X T } = {xT n }n=1 . Ns Step 1: Use {(x S m , y S m )}m=1 to train source classifier f S (x), Nt . and predict the labels fs of {xT n }n=1 Step 2: Train domain separator wds and construct the graph Laplacian L of X T . (s) for x S m and αn for Step 3: Compute prediction weight αm xT n using Eq. (11). Step 4: • If classifier is linear Compute w using Eq. (6) or using Eq. (5). • Elseif classifier is kernel Compute coefficient vector θ using Eq. (8). Step 5: For each parameter λi , compute evaluation value E v , and select the best λ corresponding to the minimization of Eq. (10). Output: The classifier function f T (x) = wT x or f T (x) = wT φ(x). target data. Therefore, we resort to the performance of target classifier on the source data with different weights. We assume that the source domain classifier is accurate for those target data close to the source domain.1 Alternatively, we also expect the target domain classifier to be accurate for those source data, which are close to the target domain. Thus, we first evaluate how close each source instance is to the target domain based on our reweighting scheme (denoted by α (s) j ; the closer a source instance is to the target domain, (s)

the higher α j is),2 and use the weighted source domain data for automatic model selection Ev =

Ns 

(s)

α j ( f T (x S j ) − y j )2

(10)

j =1

where f T (x S j ) is the prediction of the target classifier on the source domain instance, and y j is its corresponding true label. According to the analysis above, for each experiment, the evaluation of the target classifier’s performance on the target data is equal to that on weighted source data. As a result, we can automatically choose the parameter λ according to the values of E v . This model selection process reduces arbitrariness in choosing λ, and depending on the property of the source domain data. Model selection method will choose the more adaptable classifier. Thus, our approach can be easily implemented in real applications. F. Summary of Prediction Reweighting Method Based on the above discussion, our prediction reweighting algorithm is summarized in Algorithm 1. 1 This will be verified in Section V-B. 2 The specific form of α (s) will be introduced in Section IV-B. j

Fig. 1. Artificial example of domain separator and corresponding instances weights. All the instances are reweighted based on our reweighting scheme (best viewed in color).

IV. R EWEIGHTING S CHEME In our prediction reweighting approach, we believe that the target instances that are close to the source domain can be predicted correctly by the source classifier with high probability, and we assign these data high weights. We can leverage the domain separator to measure the closeness. Therefore, we reweight the target data based on their signed distances to the domain separator in an intuitional way. Moreover, we provide a theoretical justification that the weighted target domain will be closer to the source domain in an appropriate sense. As a result, this can increase the adaptivity of the classifier to the target domain and improve the classification accuracy. Before introducing our reweighting scheme based on the domain separator, we will present the definition of domain separator. A. Domain Separator We consider the process of training domain separator as a binary classification problem. If we treat all the source data as one class, and all the target data as another one, we can train a classifier (domain separator) to maximally separate them. In this process, we will not make any assumption about the specific distributions of P(XS ) and P(XT ). Let hyperplane T x = b} be a domain separator. H(wds , b) = {x|wds Depending on the discrepancy of the source and target domains, if the distributions of them are far apart, the two domains can be separated perfectly by the domain separator. However, if they have a reasonable overlap, those target instances close or even violating the domain separator may follow the rule of source distribution. Thus, the closer a target instance is to the domain separator (we regard the signed distance of a target instance violating the domain separator as negative), the more accurate source classifier can predict it. Therefore, we reweight each instance based on its signed distance to the domain separator. As an example, Fig. 1 schematically shows the function of the domain separator separating the source data from the target data. Squares represent the instances drawn from the target domain, and similarly, dots are the instances from the source domain. The straight line between two domains is the domain separator wds . The size of an instance represents its weight value, and the closer it is to the domain separator, the larger it is. We believe that the larger instances can

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. 6

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS

influence the training model more than the smaller ones. The specific form of reweighting scheme will be introduced in Section IV-B. B. Reweighting Based on the Domain Separator For each target instance xT i , we denote z i = D(xT i ) as the signed distance from xT i to the domain separator. Its reweighting factor is denoted by α(xT i ), which is calculated depending on its signed distance z i . The specific form can be expressed as 1 −γ zi e (11) δ where γ is a scale factor, and the instruction of selecting an appropriate γ will be discussed in Proposition 1. δ is a normalization factor, which guarantees the distribution of weighted target domain as a valid probability density function (pdf). According to (11), the factor α(xT i ) is a monotonically decreasing function of the signed distance z i . Similarly, in the model selection process, (11) is also applied (s) to calculate α j , whereas z i in (11) will denote the signed distance of a source instance to the domain separator in this case. In Section IV-C, we will demonstrate why we choose the form of weights like (11) and how our approach can make the source and target domains become closer. α(xT i ) =

C. Justification Our reweighting scheme is based on each instance’s signed distance to the domain separator. It is not only an intuitive way, but also we can prove that the probability of weighted target data crossing over the domain separator will become larger. In domain adaptation problems, we cannot apply the source domain classifier to the target domain directly due to the difference of their distributions. Therefore, the objective of our reweighting scheme is to close the two distributions to benefit the adaptivity of our target classifier. Moreover, in order to generalize our consequences, we do not assume any specific distributions of the source and target domains. As a result, it might be useful to control the probability of target data locating on the opposite side of the domain separator where the plenty of source data are. We define that the probability of original target data, which locates on the other side of the domain separator, is P, correspondingly, Pα is the probability of weighted target data using our reweighting scheme. If Pα is larger than P, the source and target domains will become closer. Our reweighting scheme is resort to the following theorem according to [24], which is then discussed in convex optimization techniques by [25]: 1 Pr{x ∈ M} = sup 1 + s2 μx , x ) x∼(μ with s 2 = inf (x − μ x )T x−1 (x − μ x ) x∈M

(12)

where x is a random vector, and M is a given convex set. In this theorem, we can take over all the distributions for x,

which has mean μ x and covariance matrix x 0.3 Moreover, even if without making any specific domain distributional assumptions, we are capable to calculate the bound of the probability of target data locating on the opposite side of the domain separator. We begin with a lemma [26], which constrains the convex set M to one side of the domain separator. This is convenient to handle and will be used to demonstrate our results. Lemma 1 [26]: Consider any source and target domains, and denote μ T , T 0 as the mean and the covariance matrix of the target domain, respectively. Domain separator wds = 0, T μ ≤ b and ρ ∈ (0, 1], the condition b is given, such that wds T  T  sup Pr wds x≥b ≤ρ (13) μT , T ) x∼(μ

holds if and only if

T T w b − wds μ T ≥ κ(ρ) wds (14) T ds √ where κ(ρ) = (1 − ρ)/ρ. Lemma 1 provides the upper bound ρ of the probability of the target data locating on the opposite side of the domain separator. In other words, under the condition that we cannot calculate the exact value of P, we resort to computing ρ instead. It is easy to see that the smaller ρ is, the smaller P is, and the further target domain is far from the source domain. Thus, if our reweighting scheme makes ρ larger, we can bring the source and target domains closer. Observe (14), and it can be expressed as 2  T wdsμ T − b ≥ κ 2. (15) T w wds T ds When we consider μ T = E[XT ] and T = E[(XT − μ T ) μT )T ], and if we define Z is a random variable, which (XT −μ represents the signed distance from the target instance to the domain separator, we will get 2  T wdsμ T − b = wds 2 (E[Z ])2 T wds T wds = wds 2 Var(Z ).

(16)

Thus, (15) reduces to (E[Z ])2 ≥ κ 2. (17) Var(Z ) √ Combine with the relation κ(ρ) = (1 − ρ)/ρ, (17) can be expressed as Var(Z ) . (18) ρ≥ E[Z 2 ] By utilizing (18), it is necessary to have the knowledge of the distribution of Z to calculate ρ. According to the definition of Z , it is a linear projection from high dimensional to 1-D space, essentially. In [27], from the observations, we can obtain that when the target data are high dimensional (m → ∞), the quantities def

hω (x) =

m 

ωjx j

(19)

j =1 3 In this paper, we define a matrix Q 0 (resp. Q  0), representing Q is symmetric and positive definite (resp. positive semidefinite).

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. LI et al.: PREDICTION REWEIGHTING FOR DOMAIN ADAPTATION

7

Berk CLT [28] and Rinott CLT [29] state stronger results, which are more general and practical. Therefore, it is a delicate question whether the above CLTs can apply in practice. However, Balasubramanian et al. [27] state that (19) being normally distributed holds for text and image data for some values of ω , if it is not sparse. In our justification, we assume random variable Z ∼ N(μz , σz2 ), μz > 0 (μz < 0 is the same), and denote Z α as the weighted random variable using our reweighting scheme. Following the notation of Lemma 1, ρα represents the value of ρ calculated by the weighted target domain. Our reweighting scheme aims to bring the source and target domains closer, which means we should prove ρα ≥ ρ. Consider the form of (18), we can calculate the minimization of ρ, which will indicate the relation of ρα and ρ. We define ρmin =

Var(Z ) . (E[Z 2 ])

(20)

Then, we provide a proposition to state the effectiveness of our algorithm. Proposition 1: Suppose that Z ∼ N(μz , σz2 ), μz > 0, and for each point z, the weight is selected as the form of α(z) = (1/δ)e−ξ z/σz , where ξ is a scale factor and δ is a normalization term.4 If the condition 0 ≤ ξ ≤ 2μz holds, then ρα min ≥ ρmin . Proof: Due to Z ∼ N(μz , σz2 ), we can obtain 2

Fig. 2. Centered histograms of hω (x) overlayed with the pdf of a fitted ω vectors are the same with the domain Gaussian for multiple ω vectors (ω separator calculated by MPM), and dataset is object recognition. (a) Test for AMAZON. (b) Test for CALTECH.

are normally distributed, where x = [x 1 , x 2 , . . . , x m ] and ω = [ω1 , ω2 , . . . , ωm ] is a linear classifier. This phenomenon can be verified in two ways. One way to support (19) being normally distributed is through empirical verification. As we can see in Fig. 2, we contrast the histogram with a fitted normal pdf for the dataset of visual object recognition (AMAZON and CALTECH have the most instances), which will be introduced in detail in Section V. In the two cases, hω (x) is approximately normal for domain separators obtained by minimax probability machine (MPM). Besides the justification of the normality of Z by the experiments, another perspective to illustrate the normality is from a theoretical standpoint using a central  limit theorem (CLT). m The original CLT demonstrates that j =1 x j is approximately normal for large m if x j is independent identically distributed (iid). However, this is a relatively restricted theorem, which is impractical for most cases, as instance dimensions are often related instead of iid. For this problem, a more general Lindberg CLT [27], [28] can be helpful, and it does not need x j be identically distributed. Lindberg CLT states that, as long as data dimensions are m independent, under some conditions, j =1 (x j − μ j )/τ is approximately normal for large m, where μ j and σ j2 are its  mean and variance, respectively, and τ 2 = mj=1 σ j2 . However, in real applications, the different features of instances are not always independent. Thus, it is also inapplicable for Lindberg CLT, although it is a mild one.

ρmin =

σz2 Var(Z ) = . 2 (E[Z ]) (μz )2 + σz2

(21)

Plug in the parameters and normalization factor δ, we can calculate

+∞ E[Z α ] = f (z) · α(z) · zdz

−∞ +∞

2

− (z−μ2z ) − ξ 1 e 2σz e √ 2πσz −∞ = μz − ξ

+∞ f (z) · α(z) · z 2 dz E Z α2 =

=



−∞ +∞

2

− (z−μ2z ) − ξ 1 e 2σz e √ 2πσz −∞ = (μz − ξ )2 + σz2 .

=

2 −2μ ξ z 2σz2

e

− ξ 2z σz

zdz (22)

2 −2μ ξ z 2σz2

e

− ξ 2z σz

z 2 dz (23)

Thus, we can obtain σz2 Var(Z α ) ρα min =  2  = . (μz − ξ )2 + σz2 E Zα

(24)

If the condition 0 ≤ ξ ≤ 2μz holds, we can get ρα min ≥ ρmin . This ends our proof of Proposition 1. Consider the form of instance weight (11), and according to Proposition 1, as long as we choose 0 ≤ γ ≤ 2μz /σz2 , ρα min will be larger than ρmin . This indicates that ρα trends to be larger than ρ no matter what the distribution of target data is, and Pα can be larger than P. Especially, when we 4 In order to guarantee f (z · α(z)) to be

+∞ 2 2 f (z) · α(z)dz = e(ξ −2μz ξ )/2σz . get δ = −∞

a valid pdf, we can

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. 8

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS

Fig. 4. Kernel version of the proposed approach on an artificial dataset of uniform distribution in semiarc-shaped region (best viewed in color). (a) Source nonlinear classifier by SVM. (b) Target nonlinear classifier by the proposed approach. Fig. 3. Linear version of the proposed approach on an artificial dataset of Gaussian distribution (best viewed in color). (a) Source linear classifier by SVM. (b) Target linear classifier by the proposed approach.

select ξ = μz , ρα min = 1, and ρα = 1. This means our reweighting scheme can enhance the upper bound of the probability of target data locating on the opposite side of the domain separator to 1, if we choose an appropriate ξ . Our prediction reweighting scheme based on the domain separator can help us discover the closest target instances to the source domain, which are also the most confident ones. This intuitional way will make the upper bound of the probability of target data locating on the opposite side of the domain separator become larger, which bring the source and target domains closer. This helps design the target classifier a lot. V. E XPERIMENTS In this section, in order to evaluate our approach extensively, we first apply it to two artificial datasets, which can visualize our algorithm. Then, two real-world problems: sentiment analysis [2] and object recognition [30]–[32] are discussed. A. Artificial Datasets To illustrate our prediction reweighting approach, Fig. 3 shows a linear version of the target classifier for a binary classification. As the same setting as domain adaptation, we have plenty of labeled instances of Gaussian distribution in the source domain (denoted by cross, and labels “+” and “−” are red and green crosses, respectively), and unlabeled instances of different Gaussian distribution in target domain (denoted by squares, and blue and pink squares represent labels “+” and “−,” respectively). According to Algorithm 1, we utilize support vector machine (SVM) [33], [34] to train the source classifier f s (x), and due to the distribution difference

of the source and target domains, the target classification accuracy of f s (x) is 87.6%. By contrast, as can be seen in Fig. 3 obviously, our approach performs better, and the classification accuracy is 99.1% for the target domain. Similarly, Fig. 4 shows a kernel version of our approach and the denotation is the same with Fig. 3. As can be seen, data in both the domains are uniform distribution, but in different and overlapped semiarc-shaped regions. Intuitively, we should take manifold property of target data into account; however, no adaptation classifier trained in the source data by SVM cannot involve this. Although its classification accuracy is 100% in the source domain, its performance on the target data is only 85.2%, and a part of positive data are misclassified. However, our approach can totally classify the target data accurately. Observing Figs. 3 and 4, we can note that our prediction reweighting approach improves the classification accuracy a lot in both artificial datasets with different classifier versions. It is easy to see that the target data, which are closer to the source domain, have higher probabilities to be classified correctly by the source classifier, and it also verifies our confidence in the closer target instances. Moreover, owing to that we pay attention to the manifold property of target data, we can avoid the target classifier crossing high-density region of the target instances. Hence, our target classifier can be more adaptable to the target domain. B. Sentiment Classification In this section, we first describe the Sentiment dataset used in our experiments, which is from [2] and widely used in the field of domain adaptation [35]. This dataset contains a collection of product reviews taken from Amazon.com from four product domains: Book (B), DVD (D), Electronics (E),

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. LI et al.: PREDICTION REWEIGHTING FOR DOMAIN ADAPTATION

9

Fig. 5. Performance comparison of every 20% target data with different weights value from large to small, and they are classified by the source classifier. There are five bars in each pair representing fivefolds target data.

TABLE I S UMMARY OF D ATASETS U SED FOR S ENTIMENT C LASSIFICATION ON 12 PAIRS [S OURCE /TARGET D OMAINS A RE B OOK (B), DVD (D), E LECTRONICS (E), AND K ITCHEN (K)]

and Kitchen (K) [8], [36]. In each domain, there are plenty of positive and negative reviews, respectively, based on the rating score given by reviewers. Similar to [37], we reduce the feature dimension of the Sentiment dataset to 5000 according to a cutoff document frequency method. In the experiment, we construct 12 cross-domain sentiment classification tasks denoted by the term source → target. Therefore, the tasks are: B → D, B → E, B → K, D → B, D → E, D → K, E → B, E → D, E → K, K → B, K → D, and K → E [17], [38], which are shown in Table I. We conduct a series of experiments to compare our approach with some baselines. A vital factor in our approach is to calculate the prediction weight of each target instance based on their signed distance to the domain separator. In order to avoid too much data lying nearby the domain separator, we reduce the dimensionality of the source and target data to 1000 utilizing the principal component analysis (PCA) method. We have found that this preprocessing improves the discrimination of each instance’s prediction weight and does not reduce performance significantly. Meanwhile, this also makes computation more efficient. On the other side, the performance of the source classifier on the target data affects our approach to some extent; thus, for sentiment classification, we leverage a multinomial model [39], [40] to train the source classifier. Before the results of sentiment classification are showed, we first verify whether our reweighting scheme is helpful to discriminate the confident target instances. On the basis of our approach, target data are more trustworthy, which are closer to the source domain, and correspondingly, they

should have larger weights and be classified correctly with higher probability by the source classifier. Fig. 5 confirms our intuition. Target instances are ranked by their new weights calculated by our reweighting scheme in the task of every pair (source/target), and we divide them into fivefolds according to their weights from large to small. We test them by the source classifier, and as a result, almost all pairs have the same phenomenon that instances with larger weights have better performance. To fully evaluate the effectiveness of our approach, in Table II, We denote our prediction reweighting approach for domain adaptation as “PRDA”, and compare linear and kernel versions with a large number of state-of-the-art algorithms as baselines. Since no labeled target data are available, it is failed to select the optimal parameters through the standard cross-validation methods. We resort to grid searching the parameter space to obtain the respective optimal settings, and the best results are reported. We describe all the baselines as follows. 1) SVM5 [34]: An SVM classifier with linear kernel is trained on labeled source domain only. The tradeoff parameter C is set as C ∈ {0.001, 0.01, 0.1, 1, 10, 100, 1000}. We then use k-fold cross-validation to select the best parameter. 2) Multinomial Model [39]: Multinomial model is traditional in statistical language modeling for speech recognition and text classification [41], [42]. We use the labeled source data to train a multinomial model, and then apply it to the unlabeled target data. 3) LANDMARK6 [17]: LANDMARK aims at finding the most similar source data to target domain. Then, auxiliary domain adaptation tasks are constructed. All the parameters are chosen as in [17], and the optimal dimension of domain-invariant features is set and fixed by searching k ∈ {10, 20, . . . , 100, 500} for all pairs. 4) TCA [10]: We utilize the TCA method to learn the subspace spanned by these transfer components. 5 http://www.csie.ntu.edu.tw/~cjlin/libsvm. 6 http://www-scf.usc.edu/~boqinggo/.

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. 10

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS

TABLE II A CCURACY R ESULTS FOR S ENTIMENT C LASSIFICATION ON A LL PAIRS [S OURCE /TARGET D OMAINS A RE B OOK (B), DVD (D), E LECTRONICS (E), AND K ITCHEN (K)]

We set the tradeoff parameter μ = 0.1 and search k ∈ {10, 20, . . . , 100, 500} to decide the optimal dimension of the subspace. 5) SCL [9]: SCL is based on pivot features to learn the common feature representation across domains. According to [9], we select the pivot features, and the optimal number of them is set by searching k ∈ {10, 50, 100, 200, 500}. 6) KMM [5]: KMM reweights the source instances by matching distributions between the source and target domains in RKHS. The weight constraints of each source data is followed in [5], and then, a target classifier is trained on these weighted source data using the weighted SVM.7 7) Transfer Joint Matching (TJM) [43]: TJM approach learns a new feature representation to reduce the domain difference by jointly matching the features and reweighting the instances. The regularization parameter is set as λ = 1, and we search k ∈ {10, 50, 100, 200, 500} to select the optimal subspace dimension. 8) DAM [23]: Domain adaptation machine (DAM) is designed to address the multiple source domain adaptation problems. DAM leverages a set of prelearned source classifiers and satisfies smoothness assumption. We tune the setting of DAM to be suitable for our experiments, and the best result of DAM is reported. 9) Transfer Ordinal Label Learning (TOLL) [8]: TOLL leverages multiple relevant ordinal source classifiers to predict the ordinal labels of target data by spanning the feasible solution space. Under our experiment settings, there is only one source domain, and we set K = 2 in TOLL for binary problem. The optimal hyperparameter β to restrict the imbalanced class size is set by searching β = {0.3, 0.4, 0.5}. As stated in model selection method in our algorithm, we set λ = 10q , where q ∈ {−5, −4, . . . , 4, 5}, and according to the value of (10), the hyperparameter λ can be picked up automatically. Table II shows the performance of sentiment classification accuracy of different methods. The evaluation metric that we 7 https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/#weights_for_data_ instances.

use is the accuracy of unlabeled target data, and it is adopted widely by the domain adaptation methods [3], [10], [44], [45]. For a more fair comparison, besides listing the accuracy of all pairs, we also rank all the methods in every task according to their accuracies. Therefore, the average rank of each method is also reported. From Table II, our prediction reweighting method, including linear and kernel (cosine kernel for Sentiment dataset) versions, outperforms almost all the other algorithms. In practice, our prediction reweighting approach achieves the best results in 8 out of 12 pairs. As can be seen, the performance can be improved by our approach, and the average ranks are 2 and 2.08 for linear and kernel versions, respectively. The average prediction accuracies are 79.9% and 79.8%, which outperform all the other competitive methods. It is worth noting that the DAM performs better than other baselines. Due to the reweighting scheme of DAM is domainwise, our method is pointwise. Therefore, every target instance can be offered different weights to represent our confidence on it through our reweighting scheme. Moreover, the geometry structure of the target instances is also considered, and this makes our method more adaptable to the target domain. In contrast to TOLL, which needs multiple source classifiers, our approach can leverage one source and one target domain data to achieve better performance. For KMM, it focuses on reweighting source data by minimizing the MMD divergence of the source and target domains. However, our prediction reweighting approach can consider and exploit the information of both domains in a balanced way. In addition, the reweighting scheme of KMM is time-consuming for largescale dataset. Although our prediction reweighting method cannot perform the best on all tasks, it is still demonstrated the effectiveness of our approach. Especially, for task D → K, E → K, and K → D, DAM, TOLL, and LANDMARK perform the best respectively. As shown in Fig. 5, the weights calculated of these three tasks cannot discriminate the most confident target data obviously, which results in the degradation of the classification accuracy. Moreover, our approach performs only slightly worse than the best baseline in the tasks of K → E. Analyzing the results of Table II, we can see that the two forms of our method generally outperform than other methods, and the linear version is superior to kernel version.

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. LI et al.: PREDICTION REWEIGHTING FOR DOMAIN ADAPTATION

11

TABLE III S UMMARY OF D ATASETS U SED FOR O BJECTIVE R ECOGNITION ON N INE PAIRS [S OURCE /TARGET D OMAINS A RE AMAZON (A), CALTECH (C), DSLR (D), AND WEBCAM (W)]

For sentiment data, it has a relatively high dimensionality (1000), and in high space, linear classifier will have some advantages. In Section V-C, we will illustrate that the two forms of our method are complementary for each other. C. Objective Recognition Next, we report the experimental results on the task of objective recognition. Four datasets of ten object class images are used in our experiment released in [46]. These four datasets are from the Web, which is downloaded from online merchants (www.amazon.com, denoted by AMAZON), from a digital single-lens reflex, which is captured with a DSLR camera in realistic environments (denoted by DSLR), from a webcam, which is recorded with a simple webcam (denoted by WEBCAM) with high and low resolutions [32], and from Google images, which are screened out all images that do not fit the category (denoted by CALTECH), respectively [43]. Based on [31], we take the advantage of its method to preprocess data, which will help us describe the features of data efficiently. The summary of the datasets is described in Table III, and the number of images per class is with little difference in every dataset. Be the same with sentiment analysis, we reduce the dimensionality of both source and target instances to 100 for reweight them in an efficient way. In Table III, we can notice that the DSLR dataset has only 157 samples. Due to its small size, we do not utilize it as a source domain. As a consequence, there are nine pairs (source/target) in this experiment. Table IV shows the object recognition accuracies with different methods. Similar to the experiment setting on sentiment classification, we contrast our prediction reweighting approach to SVM [34], LANDMARK [17], TCA [10], SCL [9], KMM [5], TJM [43], and DAM [23]. Since the multinomial model generalizes poorly for objective recognition, the original implementations of TOLL cannot deal with multiclass problem. Therefore, we do not use them as baselines for this dataset. Here, we use the original data for other compared baselines instead of reduced data by PCA. As can be seen from Table IV, all the domain adaptation methods outperform the standard SVM. The average accuracies of our prediction reweighting approach are 49.1%. The performance improvement is 2.7% compared with the best baseline LANDMARK. The average ranks of our prediction reweighting method are 3.22 and 2.67, and our approach outperforms all the methods in this dataset, which manifests its effectiveness. However, KMM and TJM perform better than our method in the pair of C → D and W → D, respectively. But in other pairs, our approach outperforms

Fig. 6. Classification accuracy of PRDA, SCL, KMM, DAM, LANDMARK, TCA, and TJM on reduced target data. (a) Book and DVD. (b) CALTECH and WEBCAM.

KMM and TJM. That is because our prediction reweighting method pays more attention to exploiting the use of unlabeled target data. TCA and KMM have underperformed LANDMARK this time, because for image data, resorting to decrease MMD distance between both source and target domains is not very suitable [47]. LANDMARK is better than our approach in three pairs, while it performs not so highlighted in other pairs, which affects its average rank. For W → C and A → C, LANDMARK is better than our method. We attribute this phenomenon to two factors. First, the size of the source domain data is smaller than the target data. One part of our approach is to use the source classifier to predict the target data, and we will reweight the prediction. Thus, too few of source data can affect the training of the source classifier. Second, the number of images per class is not the same and sometimes has a large discrepancy, which will lead to the unbalance of prediction weight computation. At this time, different from experiments on sentiment classification, kernel version (radial basis function (RBF) kernel for object recognition) of our method performs better than the linear case, and this indicates that they are complementary for each other. D. Extensions We verify the effectiveness of our prediction reweighting method by inspecting the impacts of scarce target data,

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. 12

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS

TABLE IV A CCURACY R ESULTS FOR O BJECTIVE R ECOGNITION ON N INE PAIRS [S OURCE /TARGET D OMAINS A RE AMAZON (A), CALTECH (C), DSLR (R), AND WEBCAM (W)]

Fig. 7. Relative change of objective recognition classification accuracy after multiple runs of prediction reweighting approach over a single run in kernel version.

iterative approach, and the combination with the existing domain adaptation methods. 1) Scarce Target Data: Our prediction reweighting method calculates the weights of source classifier predictions on the target data, and thus, the training of domain separator is of vital importance. Utilizing MPM can help us get a robust domain separator even if the target data become scarce. To inspect the impacts of the scarce target data, we first run all the DA methods on some randomly selected pairs of different datasets, respectively, e.g., B → D and D → B of sentiment classification and CALTECH → WEBCAM and WEBCAM → CALTECH of object recognition. Then, the target data are reduced by 20% randomly every time to form a new target domain. As shown in Fig. 6, the horizontal axis represents the ratio of reduced new target data number to the original target domain. We notice that our prediction reweighting approach achieves much better performance than the other DA methods on most pairs. Even if the target data become scarce (e.g., for 20% of WEBCAM, there is only 59 samples), our method performs steady without much fluctuation. This phenomenon manifests that our reweighting framework does not depend on the number of target data, and it can also deal with the scarce target data problem. Moreover, our method considers a manifold regularization of the target domain in a balanced way, which benefits the prediction accuracy. It is also worth highlighting that we reduce the target data randomly, which makes the tasks more challenging. This will

lead to the fluctuation of DA methods. However, unlike KMM, LANDMARK, and TJM, they will be affected by the changing of target data, and our method and TOLL can display relative robust results. 2) Iterative Approach: Another vital factor of our prediction reweighting method is that it depends on the source classifier f s (x) in some ways. In the absence of prior knowledge, a default choice is to use standard learning method, such as SVM. While due to the difference of data, the learning method may affect the results. This uncertainty suggests an iterative approach, in which the labels of the target data in one application (or run) of the proposed method are used to predict the next run. More formally, let fTi be target labels obtained from the i th run of the proposed method. For the (i+1) (i + 1)th run, we assign the prediction f S given by the i source classifier utilizing the i th run results fT → f S(i+1) . This iterative process can enhance the classification accuracy in a way of positive cycle. To evaluate this iterative method, we perform multiple runs of the proposed method with kernel version, which has the best results on objective recognition dataset. Fig. 7 shows the iterative method results (the accuracy of multiple runs is reported in bold in Fig. 7) and the relative classification accuracy rate over single application. Here, a value of one rate represents the ratio of iterative method results to standard method for each pair. The black dashed line in Fig. 7 is the reference line. As can be seen, the ratio of six pairs is larger than 1 and one pair is equal to 1, which means that the

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. LI et al.: PREDICTION REWEIGHTING FOR DOMAIN ADAPTATION

13

TABLE V A CCURACY R ESULTS FOR O BJECTIVE R ECOGNITION OF TCA, PRDA-TCA, KMM, AND PRDA-KMM

iterative method is generally helpful and sometimes significantly improves the results. In particular, the performances of pairs A → D and W → A improve 4%, and for C → W, it improves 6% when compared with the standard prediction reweighting method. The other two pairs are A → C (0.99) and C → A (0.99), whose results decrease only 1%. For these two pairs, the multirun strategy seems prone to overfit. In the future, we will explore how to avoid this happening. 3) Combination With the Existing Domain Adaptation Methods: Our general prediction reweighting framework can not only utilize the popular classification methods to train the source classifier, but can also combine with the existing domain adaptation methods. Using the existing domain adaptation methods, e.g., TCA or KMM, as the source classifier can give more accurate predictions to the target data, then our approach can boost their performance. We apply TCA and KMM to the objective recognition dataset as a source classifier, and we denote our method as PRDA-TCA and PRDA-KMM, respectively. The classification accuracies of original TCA, KMM, and both versions of our approach are shown in Table V. Similar results of the other domain adaptation methods are omitted due to the space limitation. We note that the performance of both versions of PRDA outperforms the original TCA and KMM significantly. We improve the classification accuracies of all the tasks except for W → D of TCA and C → D of KMM. This verifies that our prediction reweighting framework can potentially enhance the performance of the existing domain adaptation methods. This will extend the application of our framework. VI. C ONCLUSION AND F UTURE W ORK In this paper, we have proposed a novel prediction reweighting approach to learn a closed form of target classifier for domain adaptation based on two basic aspects: one is to reweight the predictions of the source classifier on target data, and the other is using manifold regularization of the target domain to propagate labels from high confidence instances to lower ones. The proposed method with linear and kernel cases can learn a more adaptable target classifier for predicting the unlabeled target data. Our reweighting scheme utilizes the signed distance from the target data to the domain separator to calculate their weights in an effective and convenient way, and it can be proved that we can bring the source and target domains closer using our method. The experimental results on two artificial datasets and two real-world applications of both text and vision processing prove the necessity and

effectiveness of our prediction reweighting approach. For extensions, we investigate: 1) the effect of the scarce target data, and our method emerges as superior to all other DA methods in most reduced target data pairs; 2) multiple consecutive applications of the proposed approach, and it improves the results of the standard method; and 3) combination with the existing domain adaptation methods, and our framework can significantly boost their performance when using the existing domain adaptation methods as the source classifier. Therefore, our method has broadened application potential to other tasks and domains. In the future, we are planning to advance in this direction. For example, we will optimize the process of reweighting samples and model selection method, so that we can pick up the best result automatically. R EFERENCES [1] B. Pang, L. Lee, and S. Vaithyanathan, “Thumbs up? Sentiment classification using machine learning techniques,” in Proc. ACL Conf. Empirical Methods Natural Lang. Process., vol. 10. 2002, pp. 79–86. [2] J. Blitzer, M. Dredze, and F. Pereira, “Biographies, bollywood, boomboxes and blenders: Domain adaptation for sentiment classification,” in Proc. ACL, vol. 7. 2007, pp. 440–447. [3] S. J. Pan and Q. Yang, “A survey on transfer learning,” IEEE Trans. Knowl. Data Eng., vol. 22, no. 10, pp. 1345–1359, Oct. 2010. [4] J. T. Zhou, I. W. Tsang, S. J. Pan, and M. Tan, “Heterogeneous domain adaptation for multiple classes,” in Proc. 7th Int. Conf. Artif. Intell. Statist., 2014, pp. 1095–1103. [5] J. A. Huang, A. J. Smola, A. Gretton, K. M. Borgwardt, and B. Schölkopf, “Correcting sample selection bias by unlabeled data,” in Proc. Adv. Neural Inf. Process. Syst., vol. 19. 2007, pp. 601–608. [6] J. Hoffman, T. Darrell, and K. Saenko, “Continuous manifold based adaptation for evolving visual domains,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2014, pp. 867–874. [7] F. Zhuang et al., “Mining distinction and commonality across multiple domains using generative model for text classification,” IEEE Trans. Knowl. Data Eng., vol. 24, no. 11, pp. 2025–2039, Nov. 2012. [8] C.-W. Seah, I. W. Tsang, and Y.-S. Ong, “Transfer ordinal label learning,” IEEE Trans. Neural Netw. Learn. Syst., vol. 24, no. 11, pp. 1863–1876, Nov. 2013. [9] J. Blitzer, R. McDonald, and F. Pereira, “Domain adaptation with structural correspondence learning,” in Proc. Conf. Empirical Methods Natural Lang. Process., 2006, pp. 120–128. [10] S. J. Pan, I. W. Tsang, J. T. Kwok, and Q. Yang, “Domain adaptation via transfer component analysis,” IEEE Trans. Neural Netw., vol. 22, no. 2, pp. 199–210, Feb. 2011. [11] C. Li, M. Georgiopoulos, and G. C. Anagnostopoulos, “A unifying framework for typical multitask multiple kernel learning problems,” IEEE Trans. Neural Netw. Learn. Syst., vol. 25, no. 7, pp. 1287–1297, Jul. 2014. [12] C. Du, F. Zhuang, Q. He, and Z. Shi, “Multi-task semi-supervised semantic feature learning for classification,” in Proc. 12th ICDM, Dec. 2012, pp. 191–200. [13] A. Saha, P. Rai, H. Daumé, III, S. Venkatasubramanian, and S. L. DuVall, “Active supervised domain adaptation,” in Machine Learning and Knowledge Discovery in Databases. Berlin, Germany: Springer, 2011, pp. 97–112.

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. 14

[14] K. M. Borgwardt, A. Gretton, M. J. Rasch, H.-P. Kriegel, B. Schölkopf, and A. J. Smola, “Integrating structured biological data by kernel maximum mean discrepancy,” Bioinformatics, vol. 22, no. 14, pp. e49–e57, 2006. [15] R. Xia, X. Hu, J. Lu, J. Yang, and C. Zong, “Instance selection and instance weighting for cross-domain sentiment classification via PU learning,” in Proc. 23rd Int. Joint Conf. Artif. Intell., 2013, pp. 2176–2182. [16] B. Liu, W. S. Lee, P. S. Yu, and X. Li, “Partially supervised classification of text documents,” in Proc. 19th ICML, vol. 2. 2002, pp. 387–394. [17] B. Gong, K. Grauman, and F. Sha, “Connecting the dots with landmarks: Discriminatively learning domain-invariant features for unsupervised domain adaptation,” in Proc. 30th Int. Conf. Mach. Learn., 2013, pp. 222–230. [18] M. Baktashmotlagh, M. T. Harandi, B. C. Lovell, and M. Salzmann, “Domain adaptation on the statistical manifold,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2014, pp. 2481–2488. [19] F. Zhuang, Q. He, and Z. Shi, “Effectively constructing reliable data for cross-domain text classification,” in Intelligent Information Processing VI. Berlin, Germany: Springer, 2012, pp. 16–27. [20] M. Belkin, P. Niyogi, and V. Sindhwani, “Manifold regularization: A geometric framework for learning from labeled and unlabeled examples,” J. Mach. Learn. Res., vol. 7, pp. 2399–2434, Nov. 2006. [21] M. Long, J. Wang, G. Ding, S. J. Pan, and P. S. Yu, “Adaptation regularization: A general framework for transfer learning,” IEEE Trans. Knowl. Data Eng., vol. 26, no. 5, pp. 1076–1089, May 2014. [22] K. Zhang, B. Schölkopf, K. Muandet, and Z. Wang, “Domain adaptation under target and conditional shift,” in Proc. 30th Int. Conf. Mach. Learn. (ICML), 2013, pp. 819–827. [23] L. Duan, D. Xu, and I. W. Tsang, “Domain adaptation from multiple sources: A domain-dependent regularization approach,” IEEE Trans. Neural Netw. Learn. Syst., vol. 23, no. 3, pp. 504–518, Mar. 2012. [24] A. W. Marshall and I. Olkin, “Multivariate Chebyshev inequalities,” Ann. Math. Statist., vol. 31, no. 4, pp. 1001–1014, 1960. [25] D. Bertsimas and I. Popescu, “Optimal inequalities in probability theory: A convex optimization approach,” SIAM J. Optim., vol. 15, no. 3, pp. 780–804, 2005. [26] G. R. G. Lanckriet, L. El Ghaoui, C. Bhattacharyya, and M. I. Jordan, “A robust minimax approach to classification,” J. Mach. Learn. Res., vol. 3, pp. 555–582, Mar. 2003. [27] K. Balasubramanian, P. Donmez, and G. Lebanon, “Unsupervised supervised learning II: Margin-based classification without labels,” J. Mach. Learn. Res., vol. 12, pp. 3119–3145, Feb. 2011. [28] R. B. Ash and C. Doléans-Dade, Probability & Measure Theory. San Diego, CA, USA: Academic, 2000. [29] Y. Rinott, “On normal approximation rates for certain sums of dependent random variables,” J. Comput. Appl. Math., vol. 55, no. 2, pp. 135–143, 1994. [30] R. Gopalan, R. Li, and R. Chellappa, “Domain adaptation for object recognition: An unsupervised approach,” in Proc. IEEE Int. Conf. Comput. Vis. (ICCV), Nov. 2011, pp. 999–1006. [31] K. Saenko, B. Kulis, M. Fritz, and T. Darrell, “Adapting visual category models to new domains,” in Computer Vision. Berlin, Germany: Springer, 2010, pp. 213–226. [32] B. Kulis, K. Saenko, and T. Darrell, “What you saw is not what you get: Domain adaptation using asymmetric kernel transforms,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2011, pp. 1785–1792. [33] C. Cortes and V. Vapnik, “Support vector machine,” Mach. Learn., vol. 20, no. 3, pp. 273–297, 1995. [34] C.-C. Chang and C.-J. Lin, “LIBSVM: A library for support vector machines,” ACM Trans. Intell. Syst. Technol., vol. 2, no. 3, 2011, Art. no. 27. [35] R. Xia and C. Zong, “A POS-based ensemble model for cross-domain sentiment classification,” in Proc. 5th IJCNLP, 2011, pp. 614–622. [36] C.-W. Seah, I. W. Tsang, Y.-S. Ong, and Q. Mao, “Learning target predictive function without target labels,” in Proc. IEEE 12th Int. Conf. Data Mining (ICDM), Dec. 2012, pp. 1098–1103. [37] Q. Sun, R. Chattopadhyay, S. Panchanathan, and J. Ye, “A two-stage weighting framework for multi-source domain adaptation,” in Proc. Adv. Neural Inf. Process. Syst., 2011, pp. 505–513. [38] C.-W. Seah, I. W.-H. Tsang, and Y.-S. Ong, “Healing sample selection bias by source classifier selection,” in Proc. IEEE 11th Int. Conf. Data Mining (ICDM), Dec. 2011, pp. 577–586. [39] A. McCallum and K. Nigam, “A comparison of event models for naive Bayes text classification,” in Proc. AAAI Workshop Learn. Categorization, vol. 752. 1998, pp. 41–48.

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS

[40] D. D. Lewis and W. A. Gale, “A sequential algorithm for training text classifiers,” in Proc. 17th Annu. Int. ACM SIGIR Conf. Res. Develop. Inf. Retr., 1994, pp. 3–12. [41] T. Kalt and W. B. Croft, “A new probabilistic model of text classification and retrieval,” Univ. Massachusetts Center Intell. Inf. Retr., Tech. Rep. IR-78, 1996. [42] T. Joachims, “A probabilistic analysis of the Rocchio algorithm with TFIDF for text categorization,” Dept. Comput. Sci., Carnegie-Mellon Univ., Pittsburgh PA, USA, Tech. Rep. CMU-CS-96-118, 1996. [43] M. Long, J. Wang, G. Ding, J. Sun, and P. S. Yu, “Transfer joint matching for unsupervised domain adaptation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2014, pp. 1410–1417. [44] S. J. Pan, X. Ni, J.-T. Sun, Q. Yang, and Z. Chen, “Cross-domain sentiment classification via spectral feature alignment,” in Proc. 19th Int. Conf. World Wide Web, 2010, pp. 751–760. [45] K. Zhang, V. W. Zheng, Q. Wang, J. T. Kwok, Q. Yang, and I. Marsic, “Covariate shift in Hilbert space: A solution via surrogate kernels,” in Proc. 30th Int. Conf. Mach. Learn., 2013, pp. 388–395. [46] B. Gong, Y. Shi, F. Sha, and K. Grauman, “Geodesic flow kernel for unsupervised domain adaptation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2012, pp. 2066–2073. [47] S. Si, D. Tao, and B. Geng, “Bregman divergence-based regularization for transfer subspace learning,” IEEE Trans. Knowl. Data Eng., vol. 22, no. 7, pp. 929–942, Jul. 2010.

Shuang Li received the B.S. degree from the Department of Automation, Northeastern University, Shenyang, China, in 2012. He is currently pursuing the Ph.D. degree with the Department of Automation, Institute of System Integration, Tsinghua University, Beijing, China. He is a Visiting Research Scholar with the Department of Computer Science, Cornell University, Ithaca, NY, USA. His current research interests include machine learning and pattern recognition, especially in transfer learning and domain adaptation.

Shiji Song received the Ph.D. degree from the Department of Mathematics, Harbin Institute of Technology, Harbin, China, in 1996. He is currently a Professor with the Department of Automation, Tsinghua University, Beijing, China. His current research interests include system modeling, control and optimization, computational intelligence, and pattern recognition.

Gao Huang received the B.S. degree from the School of Automation Science and Electrical Engineering, Beihang University, Beijing, China, in 2009, and the Ph.D. degree from the Department of Automation, Tsinghua University, Beijing, in 2015. He was a Visiting Research Scholar with the Department of Computer Science and Engineering, Washington University in St. Louis, St. Louis, MO, USA, in 2013. He is currently a Post-Doctoral Researcher with the Department of Automation, Tsinghua University, Beijing, China. His current research interests include machine learning and statistical pattern recognition.