It Takes Two to Tango: an Exploration of Domain

0 downloads 0 Views 624KB Size Report
Sep 20, 2015 - 2Using GraphChi Software (http://graphchi.org). Table 2: Correlation of RMSE of algorithms with each other. **: significant with p − value < 0.01.
It Takes Two to Tango: an Exploration of Domain Pairs for Cross-Domain Collaborative Filtering Shaghayegh Sahebi

Peter Brusilovsky

Intelligent Systems Program University of Pittsburgh Pittsburgh, PA, 15260

School of Information Sciences University of Pittsburgh Pittsburgh, PA, 15260

[email protected]

[email protected]

ABSTRACT

to start using a new domain or system, which would call for more research on cross-domain recommendations. One of the important problems in cross-domain recommendation is the selection of source domains appropriate for a target domain. Previous work either assume that the best domain pairs can be decided by similarity of their nature (such as books and movies) [23, 17], or split the same dataset into multiple domains [1]. While the majority of early works have typically focused on one or two pairs of related domains (such as books and movies) and return quite positive results, which confirms the hopes of cross-domain enthusiasts [2, 17, 23], we argue that the success of cross-domain recommendations depends on domain characteristics, and specifically on shared (latent) information among domains. A recent work by Shapira et al. [19] has also delivered mixed results for cross-domain recommendations. Just as it takes two to tango, it takes two matching domains that are ready to work together (and not just a specific recommendation approach) to succeed in the context of cross-domain recommendation. This poses new questions: What makes a good auxiliary domain? How should we choose the best auxiliary domain for a specific target domain? In this paper, we present our attempts to examine the success and failure of cross-domain collaborative filtering across a large number of domain pairs. Our goals are to broadly explore the added value of cross-domain recommendations in comparison with traditional within-domain recommendations by using two different cross-domain collaborative filtering approaches, and to achieve some progress in uncovering the main mystery of cross-domain recommendation: how can we determine whether a pair of domains is a good candidate for applying cross-domain recommendation techniques? In order to address our later goal, we pilot a canonical correlation approach as a possible predictor of successful domain pairs and examine a range of features of a single domain and domain pairs in order to see how they could be used to improve predictions. In summary, our contribution in this paper is as follows:

As the heterogeneity of data sources are increasing on the web, and due to the sparsity of data in each of these data sources, cross-domain recommendation is becoming an emerging research topic in the recent years. Cross-domain collaborative filtering aims to transfer the user rating pattern from source (auxiliary) domains to a target domain for the purpose of alleviating the sparsity problem and providing better target recommendations. However, the studies so far have either focused on a limited number of domains that are assumed to be related to each other (such as books and movies), or a division of the same dataset (such as movies) into different domains based on an item characteristic (such as genre). In this paper, we study a broad set of domains and their characteristics to understand the factors that affect the success or failure of cross-domain collaborative filtering, the amount of improvement in cross-domain approaches, and the selection of best source domains for a specific target domain. We propose to use Canonical Correlation Analysis (CCA) as a significant major factor in finding the most promising source domains for a target domain, and suggest a cross-domain collaborative filtering based on CCA (CDCCA) that proves to be successful in using the shared information between domains in the target recommendations.

1.

INTRODUCTION

In the last five years, cross-domain recommendation has emerged as a hot topic in the field of recommender systems. The idea of cross-domain recommendation is to use rating information accumulated in one domain (known as a source or auxiliary domain) to improve the quality of recommendations in another domain (known as a target domain). The proponents of cross-domain recommendation argue that this technology might be especially helpful when the user has few or no ratings in the target domain or if the quality of recommendation in the target domain is low because of a lack of information. Modern users may have a solid user profile in a system that they have previously used, but may be required

• To the best of our knowledge, we present the first study that analyzes a large number of domain pairs (158) in cross-domain recommender systems.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. RecSys’15, September 16–20, 2015, Vienna, Austria. c 2015 ACM. ISBN 978-1-4503-3692-5/15/09 ...$15.00.

DOI: http://dx.doi.org/10.1145/2792838.2800188.

• We propose a cross-domain recommendation approach based on Canonical Correlation Analysis (CD-CCA). • We conduct a detailed study on both dataset characteristics and collaborative filtering approaches to find out – if the cross-domain recommendation results only

131

improve because of added information, or if the recommendation algorithm also matters; – the data characteristics that affect the prediction error of approaches; – the domain-pair characteristics that affect the amount of recommendation improvements; – and the nature of suitable domain pairs. The rest of the paper is structured as follows. In Section 2, we overview related work in the area of cross-domain recommendations. We propose a CCA-based cross-domain collaborative filtering approach in Section 3. Then, we conduct a detailed analysis of important factors in cross-domain collaborative filtering based on an extensive multi-domain dataset in Section 4. Finally, we summarize our findings and discuss future work in Section 5.

sets of multiple dependent variables and multiple independent variables. It is the most generalized member of the family of multivariate statistical techniques [8]. It is related to factor analysis in the sense that it creates composites of variables, and is related to discriminant analysis in finding independent dimensions for each variable set. The goal of this analysis is to produce the maximum correlation between the dimensions. As a result, canonical correlation finds the optimum structure or dimensionality of each variable set that maximizes the relationship between independent and dependent variable sets. In other words, if we have X ∈ Rm×n and Y ∈ Rp×n , CCA finds two projection vectors wx ∈ Rm and wy ∈ Rp that maximize the correlation coefficient: ρ= p

wxT XY T wy (wxT XX T wx )(wyT Y Y T wy )

(1)

Since Equation 1 is not affected by re-scaling of wx and wy (namely, the multiplication of these vectors by a constant Although cross-domain recommendation has recently emerged, α does not change the value of ρ), we can maximize ρ as follows. it has gained increasing attention and is a promising way to develop new methods to improve recommendations, esmax wxT XY T wy wx ,wy pecially in a cold-start setting [17]. Cross-domain recom(2) mender systems aim to take advantage of information among subject to wxT XX T wx = 1, wyT Y Y T wy = 1 related source (auxiliary) domains to recommend items in a It can be shown that solving Equation 2 is equivalent to target domain [6]. According to [13], domains can be catfinding the eigenvectors of top eigenvalues of the generalegorized as system, data, and temporal domains. These ized eigenvalue problem in Equation 3, in which η is the categories represent, respectively, different datasets that a eigenvalue that corresponds to the eigenvector wx . recommender system is built upon, various representation of user preferences (explicit or implicit), and various time XY T (Y Y T )−1 Y X T wx = ηXX T wx (3) points in which the data is gathered. To compute multiple projection vectors, we can solve the Recent work on cross-domain recommendations has been optimization problem in Equation 4, in which matrix W focused on both collaborative filtering [7, 9, 10, 15, 17] and consists of multiple projection vectors. content-based approaches [2, 4, 11, 18, 21]. However, in this paper, we focus on cross-domain collaborative filtering apmax T race(W T XY T (Y Y T )−1 Y X T W ) proaches. Many cross-domain algorithms assume that the W (4) source and target domains share some of the users or items. subject to W T XX T W = I However, in some of the cross-domain approaches, the assumption is that no shared users or items are available in To avoid the over-fitting of ρ and the singularity of XX T , a the domains [3, 14]. In this paper, we assume the existence term λI is added to Equation 3. We have the constraint λ > of shared users among domains. 0 in this regularization term. Eventually, the regularized In this paper, we propose a cross-domain collaborative CCA attempts to solve the generalized eigenvalue problem filtering method, based on CCA. CCA has been used in difin Equation 5. ferent sources of the same domain recommenders or to find XY T (Y Y T )−1 Y X T wx = η(XX T + λI)wx (5) the correlation between the content (such as text or image) of the resources in cross-domain recommender systems. To Sun et al. solve the regularized CCA problem, using a least the best of our knowledge, it has not yet been used in a squares formulation of it, with the Least Angle Regression pure, rating-based, cross-domain collaborative filtering setalgorithm [20]. ting. For example, in the area of recommender systems, Faridani has used CCA to predict hotel ratings from textual 3.2 CCA-based Cross-Domain Recommender comments of the hotels and their sentiment analysis [5]. In As explained in Section 3.1, CCA evaluates the latent lin[16], Ohkushi has used Kernel CCA to find the relationship ear correlations between two sets of variables. To draw an between music pieces and human motion to recommend muanalogy between CCA and cross-domain recommender, we sic to users. Yang et al. [24] have proposed a cross-domain suppose that there are n common users between the source feature learning algorithm that uses CCA for inferring feaand target domains. We consider the source (auxiliary) dotures of semantic information in the data. However, Yang et main in cross-domain recommender as the independent varial. have not yet used their model in recommender systems. able set X (with n users and m items), and the target domain as the dependent variable set Y (with n users and 3. RECOMMENDATION BASED ON CCA p items). Note that here we are working on m × n and p × n item-user matrices, as opposed to the usual user-item 3.1 Regularized CCA matrices in collaborative filtering. The value ρ in Equation 1 shows the maximum canonical correlation that can Canonical correlation analysis (CCA) is a multivariate be achieved by rotating the X and Y spaces in direction of statistical model that studies the interrelationships among

2.

RELATED WORK

132

Table 1: Summary of domain Min Max User Size 9 11013 Item Size 8 4435 Rating Density 0.0017 0.1581

wx and wy , respectively. In other words, CCA calculates the components of each domain, that are consisted of sets of items from each of the domains, which are most similar to each other based on user rating behavior. Also, it determines how much the two components are correlated to one another. As a result, if we know the ratings in the source domain X and ratings in the target domain Y , we can find the wx and wy that maximize the canonical correlation between X and Y . In other words, with the projections vectors wx and wy , we know how the ratings of a combination of items in the source domain affect the ratings of an item in the target domain. Consequently, after adding the user ratings of the source domain X, we can understand how all of the ratings of a user in the source domain affect the same user’s ratings in the target domain. Eventually, we can estimate the ratings of users in the target domain Yˆ by using the projection vectors, the source domain ratings, and the canonical correlation value [22]. The calculation of estimated rating (Yˆ ) is shown in Equation 6. Thus far, this approach only focuses on the first canonical component (projection vectors) that maximize the correlation (ρ or R-statistic). There are other components between the domains that can indicate different projection vectors and correlations (R-Statistics) for each pair of them. In this case of multiple projections, the estimated rating matrix Yˆ is calculated as in Equation 7. Here, if we assume that c pairs of projection vectors are calculated, P is a diagonal c × c matrix, in which the diagonal elements are ρs for each canonical component; Wx is a m × c matrix consisted of c projection vectors of size m × 1; and Wy is a p × c matrix of c projection vectors of size p ∈ 1. Yˆ = wy ρwxT X

(6)

Yˆ = Wy P WxT X

(7)

various categories and subcategories. To obtain a better distinction between domains, we choose the 21 parent categories in the dataset. We also use the star ratings of reviews within each category. Each rating can be between 1 and 5. For each pair of categories, we find out the common users (the users who have rating reviews in both of the selected domains). To obtain more reliable results, we exclude the category pairs, within which the number of common users is smaller than the number of items in any of the two categories. For each pair of categories, we run the experiments twice: once with the first category as the source and the second category as the second domain, and once the other way around. Eventually, we end up with 158 category (domain) pairs. A summary of these data statistics is shown in Table 1. We use a 5-fold user-stratified cross-validation on the target domain, in which 80% of the users are used in the training set and the rest is used as the test user set for each fold. We add 20% of each test user’s target reviews to the training set, to obtain a partial profile for each user, and the rest of the reviews to the test set. We use 15% of the data as validation set for finding the best parameters. We use all of the source domain ratings of common users in the source training dataset for cross-domain recommendations.

4.2

(8)

As an abbreviation, we use the name CD-CCA for this CCA-based cross-domain recommender.

4.

ANALYSIS

In this section, we briefly describe our dataset, explain the approach of our analysis, and present our detailed analysis for each of our research questions.

4.1

Dataset

We used the Yelp Academic dataset1 to run our experiments. The dataset contains user reviews on items from 1

Overview of Analysis

In our analysis, we aim to gain a more detailed understanding of cross-domain collaborative filtering and the factors that affect its performance. We use three algorithms: a cross-domain collaborative approach, a single-domain approach applied to the cross-domain data, and a single-domain recommendation only on the target dataset. Our dataset consists of various pairs with different data characteristics, such as number of users, items, data density, and others. The goal of our analysis is to understand the importance and scale of each of these factors on cross-domain recommendations, including the approach and dataset characteristics. Our first questions are: how do each of the algorithms perform on the data and how do they compare to each other? Are the RMSE of these approaches correlated? Is there any possibility that cross-domain recommendations harm the results that can be achieved by single-domain recommendation? Does the better performance of cross-domain algorithms occur only because of merging the two source and target datasets, or does it also depend on the ability of the cross-domain recommender to understand the common information among these datasets’ ratings? After analyzing the algorithm aspect of cross-domain collaborative filtering, we analyze its data aspect. In other words, we want to know the reason behind different results that we get for each of the approaches. The second question that we have in mind is about the data characteristics that affect the performance (RMSE) of each approach. Is there a data characteristic that significantly affects the results? After finding the data characteristics that correlate with the results from each approach, we analyze the characteristics that affect the improvement of predictions in crossdomain collaborative filtering. Considering the domain pairs,

Eventually, if the target rating matrix (Y˜ ) is incomplete and has some missing values, we can estimate Wx and Wy (Wˆx and Wˆy ) by calculating the canonical correlations between the source rating matrix X and incomplete target matrix Y˜ . Then, we can use the estimated projection vectors wˆx and wˆy to estimate a complete rating matrix Yˆ . More specifically, if we want to predict the unknown rating of user i on item j in the incomplete target domain (ˆ yj,i ), we follow Equation 8 after finding Wˆx and Wˆy on matrices X and Y˜ . Here, ˆ y refers to the Xk,i is the rating of user i on item k; W j,l target projection element for the item j and component l; ˆ X is the source projection element for the item k and W k,l and component l. ˆ y Pl,l Σm ˆ yˆj,i = Σcl=1 W k=1 WXk,l Xk,i j,l

pair statistics Mean Median 1064.09 424 406.89 252.5 0.017 0.0084

http://www.yelp.com/academic_dataset

133

Table 2: Correlation of RMSE of algorithms with each other. **: significant with p − value < 0.01 Correlation (R-Values) CD-CCA CD-SVD SD-SVD CD-CCA CD-SVD SD-SVD

in which we have significantly better cross-domain recommendations, what can better explain this improvement? We examine the domain-pair characteristics in addition to the characteristics of each domain, and we study the correlation of the CCA’s results with the improvement of cross-domain recommenders over the single-domain recommender. In the final part of our analysis, we look at the specific samples of domain pairs to get a deeper understanding of the results of previous analyses. More specifically, we look at the domain pairs with a high CCA to see if they can obtain a higher improvement in cross-domain recommenders versus the single-domain recommender. We examine the domain-pairs with a high CCA and low improvement in a closer look to understand the reason behind this behavior. As a reverse look at these results, we look at the domain pairs with a high improvement ratio and their characteristics. More specifically, we look at the domain-pairs with a high improvement in cross-domain recommenders versus the single-domain recommender, and a low CCA, to understand the other factors that affect this result. As baseline algorithms, in addition to CD-CCA, we run the SVD++ algorithm [12]2 in two single-domain and crossdomain modes. In the single domain SVD++ (SD-SVD), we only use the target domain ratings to predict the test user ratings in the target domain. For the cross-domain SVD++ (CD-SVD), we add the source domain item space to the target domain item space, as if the two categories are coming from the same (single) domain. This is done by concatenating the source and target domain rating matrices. We run the SVD++ algorithm on the joined domain matrices.

What is the Approach Effect on Recommendation Results?

4.4

We run CD-CCA, CD-SVD, and SD-SVD on the 158 domain pairs in the data. We evaluate the algorithms based on Root Mean Squared Error (RMSE). Figure 1 shows the RMSE of all three algorithms on the 158 domain pairs, including the 95% confidence interval. To better comprehend the difference between algorithms, we order the domain pairs based on RMSE of CD-CCA algorithm on them. Due to the visualization limitations, we cannot show the name of all domain pairs in the picture. However, it can be seen that in most of the domain pairs, CD-CCA has performed better than both CD-SVD and SD-SVD. Also, in most of the domain pairs, CD-SVD has performed better than SD-SVD. 2

0.7896** – 0.9550**

0.7779** 0.9550** –

CD-CCA performs significantly3 different than SD-SVD in 77 domain pairs, and different than CD-SVD in 74 domain pairs. CD-SVD performs significantly different than SDSVD in 9 domain pairs. In some cases, the RMSE of cross-domain recommenders is more than the RMSE of the single-domain recommender. More specifically, the CD-CCA average RMSE is more than RMSE of SD-SVD in 8 domain pairs; the average RMSE of CD-CCA is more than the average RMSE of CD-SVD in 14 pairs; and the average RMSE of CD-SVD is more than SD-SVD in 51 domain pairs. However, these differences are not statistically significant in any of these domain pairs (p − value < 0.05). As a result, the cross-domain recommenders work at least as well as the single-domain recommender in our dataset. We can conclude that the CD-CCA works significantly better than the CD-SVD in 77 domain pairs; CD-CCA works significantly better than the CD-SVD in 74 domain pairs; and CD-SVD works better than SD-SVD in 9 domain pairs. In the rest of the domain pairs, CD-CCA and CD-SVD work similar to SD-SVD. As a result, we can see that cross-domain collaborative filtering approaches either improve the recommendation results, or will not significantly change them in our dataset. However, the CD-CCA method works better than CD-SVD method in many of these domain pairs. It means that there is some common information between the mentioned domain pairs that is only being captured by CD-CCA, although CD-SVD has the same information in its input data. Consequently, in addition to the improvement that we can obtain by adding the source data to the target domain, the CD-CCA approach shows an additional improvement. It means that performance of CDCCA is not only because of the added data, but because of its ability to comprehend this additional data and use it to achieve better target recommendations. We calculate the correlation between error rates of all algorithms in all of the domain pairs (Table 2). Based on these results, the RMSE of algorithms are highly and significantly correlated (p − value < 0.01). We can conclude that if the RMSE of single-domain recommender is low in the target domain, it is also most likely low for cross-domain recommenders, and vice versa.

Figure 1: RMSE of algorithms on 158 domain pairs ordered by the RMSE of the CCA-based cross-domain algorithm

4.3

– 0.7896** 0.7779**

What Data Characteristics Affect Prediction Error?

We investigate the domain characteristics to evaluate their effect on experiment results. These characteristics include user space and items space sizes in each domain, as well as domain densities. We define domain density as the ratio of known rating reviews to all possible reviews (number of items multiplied by number of users). Figure 2 shows the RMSE results of each pair of domains, sorted based on number of items in target domain. The lines represent the least square lines of each set of data. Based on the picture it appears that by decreasing the target do3

Using GraphChi Software (http://graphchi.org)

134

0.05 p-value

Figure 2: Results ordered by item space size of target domain Figure 4: Results ordered by number of common users in pairs of domains

Figure 3: Results ordered by item space size of source domain main’s item space size (represented in logarithm to the base 10), the error rate increases slightly in all three algorithms. However, the correlation between size of the target domain’s item size and the RMSE results is not statistically significant. In figure 3, we see the errors of each algorithm, sorted based on the item space size of the source domain. There is no significant correlation between this factor and the errors in any of the algorithms. Considering the common user size between the two domains, we sort the results based on this factor in Figure 4. Here, by increasing the number of users, we see a decrease in RMSE of both CD-CCA and CD-SVD. This correlation between RMSE of both algorithms on the 158 categories is significant, as shown in Table 3. Based on this finding, we can conclude that the more common users we have in the source and target domains, the better the cross-domain recommendations will be. Looking at the error rates of algorithms ordered by target domain density (Figure 5), we see a slight increase of RMSE when the target domain density decreases. However, this correlation is not significant in any of the methods. The effect of source domain density on prediction error (Figure 6) is also insignificant. Overall, we can see that the number of common users is the most significant factor among dataset characteristics in prediction error measured by RMSE. The more common users we have in the source and target domains, the better our prediction.

Figure 5: Results ordered by target domain’s density

Figure 6: Results ordered by source domain’s density

135

Table 3: Correlation of RMSE of algorithms with data characteristics. **: significant with p − value < 0.01, *: significant with p − value < 0.05 Correlation (R-Values) User Size Target Item Size Source Item Size Target Density Source Density CD-CCA

-0.1782*(p=0.026)

CD-SVD

-0.1745*(p=0.028)

SD-SVD

-0.1455 (p=0.068)

-0.1250 (p=0.121) -0.1445 (p=0.070) -0.1225 (p=0.125)

-0.1239 (p=0.1245) -0.1274 (p=0.111) –

-0.0502 (p=0.5352) -0.1346 (p=0.092) -0.1525 (p=0.056)

0.0515 (p=0.5246) -0.1161 (p=0.146) –

with the improvement ratio in any of the three algorithm pairs. The two most important and significant factors in all the algorithm pairs are “source density” and “percentage of CCA correlation coefficients greater than 0.95”. The source-domain density has a negative correlation with the improvement ratio: the denser the auxiliary data, the less the cross-domain algorithms perform better than the singledomain algorithm. This is an interesting finding, as based on our previous findings, the source domain density does not have any significant effect on the RMSE of each algorithm. As a result, it only affects the RMSE improvement of crossdomain algorithms if they are significantly better than the single-domain algorithm. On the other hand, the more between-domain components of CCA have a canonical correlation of more than 0.95, the Figure 7: RMSE of algorithms on 79 significant domain more the IR. We can see that this factor is significant in pairs, ordered by the RMSE of the CCA-Based cross-domain all of the algorithm pairs. However, the two other CCAalgorithm related factors (“percentage of CCA correlation coefficients 4.5 What Data Characteristics Affect Crossgreater than 0.9” and “percentage of CCA correlation coeffiDomain Recommendation Improvement? cients greater than 0.8”) are only significant for the improveIn this section, we evaluate the improvement ratio of each ment of CD-CCA versus CD-SVD and SD-SVD algorithms, pair of algorithms in each pair of domains and its correlaand do not significantly affect the improvement of CD-SVD tion with each of the domain characteristics to understand over SD-SVD. While this result might be due to the lower what makes a good domain pair for a cross-domain recomnumber of domain pairs (9), in which we have a significant mender. Figure 7 shows the RMSE of all 79 domain pairs, in improvement of RMSE in CD-SVD over SD-SVD, it might which either CD-CCA works significantly better than CDalso be because the CD-CCA algorithm captures more reSVD, or any of the cross-domain methods work significantly lationships between the two domains that can be explained better than the single-domain method. As we can see in by CCA components and cannot be captured by the other this picture, the RMSE improvement is different for differtwo algorithms. ent domain pairs. To understand this difference, we use the The next important factor presented in Table 4 is “target Improvement Ratio measure, which is defined as follows. density”. The same as “source density,” this factor has a The improvement ratio of algorithm a1 over algorithm a2 significant negative correlation with improvement factors for with the source domain si and target domain dj (IRa1 ,a2 (si , dj )) CD-CCA over both CD-SVD and SD-SVD. The correlation is equivalent to the improvement of RMSE of algorithm a1 is not significant for CD-SVD over SD-SVD. over algorithm a2 , normalized by RMSE of algorithm a2 in Among other factors, “target item size”, “user size”, and the source domain si and target domain dj (Equation 9). “source item size” have a positive correlation with IR for CDCCA over both CD-SVD and SD-SVD. However, they do not RM SEa1 (si , dj ) − RM SEa2 (si , dj ) have a significant correlation with IR for CD-SVD over SDIRa1 ,a2 (si , dj ) = (9) RM SEa2 (si , dj ) SVD. The reason might be a better performance of CD-CCA over CD-SVD (and SD-SVD) in the cases where we have In addition to the domain characteristics that we evaluated more items in the target or source domain. It is interesting in the previous section, we evaluate the correlation of imto see that even though “user size” is a significant factor provement ratio with “user size to target item size ratio”, in decreasing RMSE in both cross-domain algorithms in all “user size to source item size ratio”, “source item size to tardomain pairs, it does not contribute to the improvement of get item size ratio”, “source density to target density ratio”, RMSE of CD-SVD over SD-SVD, for the domain pairs in and “percentage of CCA correlation coefficients greater than which cross-domain algorithms perform significantly better 0.8, 0.9, and 0.95”4 . The results are reported in Table 4. As than the single-domain one. we can see in these results, “user size to target item size Correlations for “user to source item ratio” and “source ratio” is the only factor that has no significant correlation to target density ratio” factors are only significant for the 4 The factor “percentage of CCA correlation coefficients IR of CD-CCA over CD-SVD. It means that the more users greater than λ” is equivalent to the number of CCA canonwe have, as compared to source items, and the denser the ical components that are correlated with each other by source domain is, as compared to the target domain, the R − statistics > λ divided by the total number of commore improvement we can achieve by CD-CCA compared to ponents found by CCA.

136

Table 4: Correlation of significant RMSE improvement (IR) with data characteristics. ***: significant with p − value < 0.001, **: significant with p − value < 0.01, *: significant with p − value < 0.05 Correlation (R-Values)

User Size

Source Item Size

Target Item Size

Source Density

Target Density

User to Target Item Ratio

User to Source Item Ratio

Percentage of CCA Correlation Co-efficients greater than 0.8

Percentage of CCA Correlation Co-efficients greater than 0.9

Percentage of CCA Correlation Co-efficients greater than 0.95

Source to Target Density Ratio

Source to Target Item Size Ratio

CD-CCA vs. CD-SVD

0.3924*** (p=0.0005)

0.3292** (p=0.0031)

0.4332*** (p