Social Semantic Collaborative Filtering for Digital Libraries

2 downloads 14267 Views 638KB Size Report
filtering feature in the digital library lists all the collections, within the given range ..... Users can sign-in automatically across the P2P network (called D-FOAF) of ...
Social Semantic Collaborative Filtering for Digital Libraries

Sebastian Ryszard Kruk, Stefan Decker, Adam Gzella, Slawomir Grzonkowski

Contents •

Abstract 1. Introduction •

1.1. Contributions 1.2. Outline of the Paper

2. Social Semantic Collaborative Filtering •

2.1. How does the Social Collaboration Works •

2.1.1. Distributed Collections 2.1.2. Resources Annotations

2.2. Social Semantic Collaborative Filtering Scenarios •

2.2.1. Simple Social Collaborative Filtering 2.2.2. Secured Social Collaborative Filtering

2.3. The Benefits •

2.3.1. Backward Referral Chaining 2.3.2. Connection to the Established Classification Schemata 2.3.3. Extrapolated User Profile

2.4. Security and Privacy Issues 3. Evaluation of Social Semantic Collaborative Filtering •

3.1. Simulation Model 3.2. Underlying Assumptions 3.3. Definition of the Experiment 3.4. Results of Simulation 3.5. Conclusions on Results of Simulation

4. FOAFRealm - the Reference Implementation of Social Semantic Collaborative Filtering •

4.1. FOAFRealm Component Architecture 4.2. Architecture of the Distributed FOAFRealm (D-FOAF) •

4.2.1. User Authentication in D-FOAF 4.2.2. Distributed User Profile Management 4.2.3. Distance and Quantisation Level Computing in D-FOAF 4.2.4. Security Issues in Distributed FOAFRealm

4.3. FOAFRealm Use Case Studies 5. Related work •

5.1. Collaborative filtering 5.2. On-line Social Communities 5.3. User Profile Management

6. Future Work and Conclusions Acknowledgements Author details

Keywords Collaborative Filtering, Semantic Web, Social Networks, User Profile Management

Abstract The most popular collaborative filtering implementations require either a critical mass of referenced resources or a lot of active users. Other solutions are based on finding a referral with an expertise on the given domain of discourse. In this article we present the social semantic collaborative filtering solution to information retrieval. We describe how the concept of users' managed collections can be exploited to provide collaborative filtering system based on social network maintained by the users themselves. We present FOAFRealm, a user profile management system based on the social networking and the FOAF metadata. Additionally, FOAFRealm enables distributed collaboration between parties in the social semantic collaborative filtering way.

1. Introduction The contemporary Internet contains a lot of information. In the unorganised structure of the Web all the information that we are looking for seems to be always just behind the corner. Though, still beyond our reach. And when we fail to find that information, it turns to be useless. Search engines and on-line catalogues tend to return a lot of resources as an answer to our queries. Very often some of results are unrelated to given queries. No wonder, we end up asking around our friends and acquaintances for interesting references on the exact topic. Collaborative filtering is an idea of automating the process of asking around when looking for the information on the Internet (Goldberg at al,1992). Since early implementations of collaborative filtering, like introduced in (Resnik at al, 1994), a number of methods have been developed for the collaborative filtering and social filtering (Resnik at al, 1994)(Breese at al, 1995)(Shardanand and Maes, 1995).

1.1. Contributions The paper makes the following contribution to the field of collaborative filtering and user profile management systems: •

We introduce a new approach to collaborative filtering (CF) - the social semantic collaborative filtering that covers both active and passive kinds of CF and solves additionally some privacy/security issues.



The reference implementation library (FOAFRealm), can be embedded into of web applications, providing additionally unified, distributed users management system based on FOAF.



Our solution introduces goals like: distributed user profile management, privacy of the profile information, security of the provided knowledge, utilisation of social networks.

1.2. Outline of the Paper The next section describes the architecture of the social semantic collaborative filtering in the context of other similar solutions. In section 3 we present the evaluation of the underlying model of social interactions in the social semantic collaborative filtering. We describe in section 4, the FOAFRealm system that implements a distributed user profile management system and delivers social semantic collaborative filtering features.

2. Social Semantic Collaborative Filtering The social semantic collaborative filtering (SSCF) presented in this article is based on two concepts: distributed collections and annotations of resources. Each user classifies only a small subset of the knowledge, based on the level of expertise he/she has on the specific topic. This knowledge is later shared across the social network.

2.1 How does the Social Collaboration Work The problem that there is a trade-off between accuracy and scalability is often found in search engine applications. The information gathered in on-line collections is very precise, as the human factor is involved in the indexing process. But since the Internet is growing so fast, the process of creating the catalogue does not scale. On the other hand, search engines do the indexing work without involving the human activity. And results of queries are not always satisfiable. A social network is a set of people or group of people, with some pattern of interactions or "ties" between them (Trevor at al, 2002)(Fukui, 2003)(Newman at al, 2002)(Hoadley at al, 2002). A social network is modeled by a digraph where nodes represent individuals, and a directed edge between nodes indicates direct relationship between two individuals. Each person is interested in a couple of topics. Each person has also a general idea about interests of his/her friends, considering some of them having more knowledge on some topics. It is possible to construct a subgraph, on top of a social network, that represents flow of expertise in the certain domain. The idea of the social semantic collaborative filtering is based on this observation. Each person in the social network gathers the interesting information in collections he/she has created. Collections maintained by some users can be easily linked into collections created by other users. By linking other peoples' categories users receive bookmarks that are managed and filled in by other people with possibly expertise level on particular subjects higher than their own. As we show later the information is easy disseminated through the network of collections' linking. 2.1.1. Distributed Collections

To overcome the problem of managing vast and fast growing amount of information the divide-andconquerer approach can be adopted. The information is gathered in collections by a number of people. Each of them handles specific domains of discourse within the collections information space he/she has created. To provide scalability of this solution the catalog system should operate in the distributed environment. The quality of the information gathered across the collections can be satisfied by approving the expertise in given domain of discourse. Each user maintains his/her own collections (private bookshelf (Kruk at al, 2005)) and renders them accessible to his/her friends (Maltz at al,1995). In the private bookshelf all resources are being collected according to user's point of view he/she expresses by his/her categories taxonomy. We can assume that some of topics are better explored by some people. In some topics the user is considered by other people as an expert. So each collection can have a quality metric assigned to it, based on the expertise the owner has on the related topic. Each user is also aware of the expertise level of other people on given topics. 2.1.2. Resources Annotations

Apart from managing collections by providing the categorisation description of resources, the social semantic collaborating filtering utilises comments and annotations provided by users. Annotations are represented as fora with some additional semantic content. so each user can extend the annotations provided by others by replying to them. Annotations can be used by other people as a shorthand to quickly explore: •

the content or meaning of the resource;



the context of resources;



the general opinion of other users.

2.2. Social Semantic Collaborative Scenario

Figure 1. The scenario of a simple social semantic collaborative filtering model In our example scenario (see Fig. 1), Alice writes a thesis on "Mediation in Bibliographic Ontologies". She registers to the digital library run by the University. She discovers that some of her friends are already registered to the library as well. With features known from on-line communities, she connects her profile to her friends profiles. Later on, Alice starts to gather the information required for her thesis topic. She keeps links to resources she has found in collections managed by the on-line bookmarks system. Soon she discovers that resources that she has bookmarked do not cover the topic of the thesis at satisfiable level. With the features provided by social semantic collaborative filtering she tries to find other people within her neighbourhood with higher expertise on related topic. The following sections (simple social collaborative filtering, secured social collaborative filtering) describe different algorithms Alice uses to find the desired information with the help of the social semantic collaborative filtering. 2.2.1. Simple Social Collaborative Filtering

To find the desired information Alice sings up to the university digital library. The system used by the library is based on the simple social semantic collaborative filtering implementation (see Fig. 1). Alice uses the searching features provided by the digital library web application (see Fig. 2(a)) to find interesting resources. We introduce a solution to the problem stated in the previous section based on the simple social semantic collaborative filtering model (see Fig. 2(b)). Each collection is categorised by the owner. Collaborative filtering feature in the digital library lists all the collections, within the given range of friendship neighbourhood, with topics related to the ones defined by Alice. procedure ListCollectionsSM(p,t): collections [] for p' C'

P with PeerDistance(p,p' ) < knowsRange C'

PeerCollection(p' )

end for sort C' according to FinalRangkingSM end procedure Figure 2(a). The simple model of social collaborative filtering: Algorithm retrieving list of collections

P is a set of peers C is a set of collections FoafKnows is a set of directed connections between peers Gpeers(P, FoafKnows) is a digraph of friendship relations T is a lattice of categorisation topics We assume that each collection c PeerCollection: P OwnedBy: C

C has exactly one owner p

P

2C - returns all collections owned by the peer

P - returns the owner of the collection

Expertise: (P,C) related topic

[0,1] - computes the quality of the collection based on the peer's expertise on

Categorisation: C

T - returns the list of topic describing collection

PeerDistance: (P,P) Dijkstra algorithm

N - computes distance between two peers in the social network graph using

Similarity: (T,T)

[0,1] - computes similarity level between two topics

FinalRankingSM: (PeerDistance,Similarity,Expertise)

[0,1] - computes ranking

value for a collection based on distance to the owner, similarity level and quality measure knowsRange - defines a maximal distance between two people when traversing the graph of friendship relations. Figure 2(b). The simple model of social collaborative filtering: The model definition One of the possible ways to introduce the expertise level explicitly would be by utilising the collections inclusion graph. The more people includes given collection in their collections the more important it is. Additionally the more collections provided by other people is included in a given collection, the less valuable it is. Each collection has a quality level assigned to it. The quality of the collection corresponds to the expertise level of the owner on related topic. The expertise level can be computed with PageRank (Brin and Page, 1998) algorithm applied to graphs of collections' inclusions and social network. Both graphs represents the rank value each person and each collection receives from other people. The rank values are assigned directly (by people to people) and indirectly (by including someone's collection to own collection). Alice finds out that one of her friends, Caroline, gathers the information on digital libraries and her expertise level on that topic is very high. Though her direct friend Bob is interested in Artificial Intelligence, she finally decides to link resources provided by Eric, who has a highly ranked "Semantic Web" collection. From now on, Alice takes the advantage of the information gathered by Caroline and Eric in their collections. 2.2.2. Secured Social Collaborative Filtering

Alice is still looking for more information required for her thesis. She decides to register in an open, heterogeneous network of digital library. Some people protect their collections with access control restrictions (see Fig. 3). The restrictions applied on the collection are based on maximal distance and minimal trust level between two people in the social network graph. Apart from defining friendship relations, users express the quality (trust level) of every outgoing social connection.

Since not all information should be accessible by everyone, some of it need to be protected from people from the outside of the given community. This is why access control lists (ACL) have been introduced (see Fig. 3). In the social semantic collaborative filtering environment based on ACL each collection has its own ACL, that defines the maximal distance and minimal friendship quantisation level from the specific person to the person willing to access that collection. Please note that it does not have to be an owner of the collection, thought the owner is the one that manages ACL. Only when this is satisfied the user can access and include this collection in his/her collections.

Figure 3. The scenario in the secured social semantic collaborative filtering model Alice wants to make use of the knowledge provided by Damian. But the algorithm for retrieving a list of collections in the secured environment (see Fig. 3) omitted some of collections. With ACL applied Alice is out of the range defined in Damian's ACL constrains. The collection managed by Damian is not presented to her. procedure ListCollectionsACL(p,t): collections [] cp

PeerCollection(p)

for p'

P with PeerDistance(p,p' ) < knowsRange

for c'

PeerCollection(p' )

with

aclPD

DistanceACL(c' ) PeerDistance(Peer(aclPD),p) < Distance(aclPD)

with aclFQ QuantizationACL(c' ) FriendshipQuantization(Peer(aclPD),p) > Quantisation(aclPD) with CollectionDistance(cp,c' ) < inclusionRange C'

C'

{ c' }

end for for c if

C with FriendshipQuantization(p,OwnedBy(c)) > quantisationLevel aclPD

DistanceACL(c' ) PeerDistance(Peer(aclPD),p) < Distance(aclPD)

and

aclFQ

and

c'

QuantizationACL(c' )

C' CollectionDistance(cp,c' ) + CollectionDistance(c' ,c) < inclusionRange

then C"

C"

{c}

end for sort (C'

C") according to FinalRankingCI

end for end procedure Figure 4(a). The secure model of social collaborative filtering: Algorithm retrieving list of collections ACLPD is an access control constrains, defining maximal distance D (in degrees of separation) from user P ACLFQ is an access control constrains, defining minimal FriendshipQuantization value (calculated across the graph) from user P DistanceACL: (C)

2ACLPD - defines a list of allowed maximal distances to the user

QuantizationACL: (C) values Peer: (ACL) performed

2ACLFQ - defines a list of allowed minimal FriendshipQuantization

P - returns a peer from which the computation of ACL distance/level is do be

Distance: (ACLPD)

N - returns the maximal distance defined in ACL

Quantisation: (ACLFQ)

[0,1] - returns the minimal FriendshipQuantization level

Figure 4(b). The secure model of social collaborative filtering: The model definition

2.3. The Benefits The main bottleneck of existing passive collaborative filtering systems is the process of gathering users' preferences (Shardanand at al, 1995). A reliable system requires a very large number of people to express their opinion about a large number of topics. This requires users to either fill out a survey or perform a lot of activities (like e.g. buying a product, reading a book) over a certain time. Active collaborative filtering solutions depends on maintaining the social network by users themselves. Outdated information on list of friends can mislead the person in his quest for an answer. 2.3.1. Backward Referral Chaining

Maintaining a list of friends, posting a question and gathering the answers may be time consuming. That is why the social collaborative filtering (a new approach to active collaborative filtering) tends to ease some hardships by introducing the concept of backward referral chaining, reusing existing classification schemata and extrapolating user profile information with interests of his friends. Usually, a user is not aware of the whole social network. To gather the knowledge outside of his/her direct friendship neighbourhood a user has to rely on references provided by his/her friends. Because an expert in the specific domain can be quite distant from the user, in terms of relationship links, the access to the answer he/she provides dependents on the path from the user to the expert himself/herself. As it has been introduced in secure social collaborative filtering, an expert can restrict access to some parts of information by applying access control lists. The referral chaining (Kautz at al, 1997b) has two strong dependencies: accuracy of finding the right path to an expert, and

responsiveness factor of the found expert. The backward referral chaining introduced in the social collaborative filtering inverses the process of finding an expert. The answers provided by different people (including experts) are being assembled into hierarchical knowledge base. Users link into their collections, information provided by other people. In many cases, the expertise of the latter, on given topic is higher. This approach improves the accuracy when a person is looking for an answer, as it is possible to extract the answer along the tree of categories maintained by different peers (see Fig. 2). The responsiveness factor is being utilized in the social collaborative filtering to implement security features. Each person (issuer) can apply some ACLs to categories created and maintained by himself. The ACL is based on the distance and trust factor between issuer and requester (the person who seeks information). It is possible to prevent unwanted people from access to the information. Although, the information protection, may be still exposed to socio-technical attacks (Mitnick and Simon, 2002). A malicious person can persuade someone that has an access to confidential information, to share that information. 2.3.2. Connection to the Established Classification Schemata

In social collaborative filtering each person can create own categories according to the local understanding of the world. The definition of the category might be hard to understand to other peers because of the use of ambiguous descriptions or an native language. The social semantic collaborative filtering (SSCF) is a social collaborative filtering enriched with an existing thesauri or a classification taxonomy, like WordNet (Fellbaum, 1998) or Dewey Decimal Classification (Dewey, 2004)(Kruk at al, 2005) or DMoz. This description can help to understand the meaning of the category both to people and machines. The latter can then utilise this knowledge in e.g. recommending related categories created by other users or during the query expansion process (Kruk at al, 2005). 2.3.3. Extrapolated User Profile

When information about user's activities (personal bookshelf, resources' annotations) is gathered for a longer time it can be re-used during the search process. The query expansion process (Kruk at al, 2005) takes into account semantically rich descriptions of users' preferences reflecting their activities. The result set becomes more user oriented than with a generic search. New users registered to the system very often suffer from the lack of rich profile information. This may have a strong influence on the quality of search results. To overcome this problem the social collaborative filtering paradigm introduces the concept of an extrapolated user's profile. The profile of the new user can be represented with some probability depending on trust level as a combination of profiles of his/her friends. (see Fig. 5).

Figure 5. The extrapolated user's profile Alice is a new registered user to the system. She states Bob and Caroline as her friends. Caroline knows Damian and Eric. The extrapolated user's profile of Alice consists information about her friends' profiles. It is than assumed that Alice is interested in Artificial Intelligence, Digital Libraries and (with smaller probability) Semantic Web and general librarian topics. The information is being verified and updated by Alice later activities in the system.

2.4. Security and Privacy Issues Collaborative filtering implementations suffer in most cases from very weak security features or frequent privacy abuse. The information about the user in passive collaborative filtering systems is very often gathered without his/her knowledge. In the active collaborative filtering the user very often has no means to protect him/herself from gathering information about him/her. To implement the security and privacy features the concept of digraph of interpersonal connections has been utilised. Each user defines a list of his/her friends and states the level of trust to each of them. The user can then define the maximal distance and minimal trust level required from the person which wants to view information gathered in specific category (see Fig. 6).

Figure 6. The protecting

the information in social collaborating filtering with trust level The Alice knows Caroline and Eric. Eric knows Caroline, that knows Alice. Trust levels of each relationship have been provided. Alice has an access to "Digital Libraries" and "Semantic Web" categories. The access to "Distributed Systems" category has been denied because the required trust level (70%) has not been satisfied. Category "P2P Systems" may not be access by Alice, as it is only accessible to people from direct friendship neighbourhood of Eric. As all the information about the user is provided by himself/herself and he/she manages the access control lists for each piece of information, the privacy of the user is preserved. Since SSCF information can be distributed among many peer services (digital libraries, blogs, wikis, etc.) the underlying social network has to be protected from being altered without user permission (see 4.2.4)

3. Evaluation of Social Semantic Collaborative Filtering Semantic social collaborative filtering utilises existing social networks instead of creating artificial connections between people. That is why on the contrary to other collaboration filtering solution, there is no need to evaluate an algorithm for creating a social network, as the social network is given explicitly. On the other hand, since the social semantic collaborative filtering is based on friendship connections, the actual similarities of interests between connected users might differ. That is why, the evaluation of this collaborative filtering approach should prove that the dissemination of knowledge is possible within graph of semantically annotated friendship connections.

3.1. Simulation model In this section we present the implementation of the simple social semantic collaborative filtering model. We prove that average level of expertise in the subgraph of social network is almost maximal within 6 degrees of separation (see 3.4). The definition of a simulation model has been based on similar ideas defined in Refferal Web project (Kautz,1997a). The main difference between social semantic collaborative filtering and the Refferal Web is that in the Referral Web project, the process of finding an expert on certain topic is performed manually by the user. In social semantic collaborative filtering, semantical annotation on the knowledge provided in the social network are used to automate the process of finding the high quality of information. The simulation model itself might be similar to the one presented in (Kautz,1997a), so we just need prove that it is possible to find an expert within the given maximal degree of separation.

3.2. Underlying assumptions In the model of social network for the social semantic collaborative filtering, each user manages collections with information on selected topic. The different users represent different expertise on the given topic. We assume that: •

The quality of the information provided by a user on a certain collection is proportional to the expertise level of the user on the topic of collection.



It is possible to find a user with a high expertise on given topic within the network of social connections.

According to simple social collaborative filtering model (see Fig. 2) the simulation environment is modeled by a set of users and a set of collections managed by those users. There is exactly one user that owns each collection. On the base of the user's expertise on related topic the quality of the collection is defined. Each user has a predefined set of other users he knows (this relation should not be considered as implicitly symmetric). Although according to the Small World Phenomena (Milgram, 1967)(Barabasi, 2002) the distribution of the degree of the friendship connections is power-law based (Zipf's distribution, see Eq. 1) we have decided to perform second set of experiments where the degree of friendship connections is a bell-curve shaped (normal random variable see Eq. 2).

(1)

(2) The distribution of expertise on a certain topic within the social network can be based on the Lotka's Law (Lotka, 1926), stating that the number of authors making n contributions is about of those making one contribution, where a is often nearly 2. Since the expertise on a certain topic is proportional to the number of high quality of publications, the distribution of the level of expertise (the level of expertise over the number of users that have one) is Zipfian shaped as well. Each collection has a quality value assigned to it that represents the expertise the owner has on the related topic. In order to make sure that there would be at least one absolute (Expertise(T) = 1) expert in each topic T, we have normalised the associated expertise values dividing each but the value of the highest expertise in each topic. The list of topics used to describe content of collection has been based on Dewey Decimal Classification (Dewey, 2004). This simplifies the computability of the model in the sense of comparison similarity between topics. Each category has a three-digit number (100 - 999) associated. Categories are structured as a three level taxonomy tree: 1. top level, general categories, with numbers dividable by 100; 2. middle level with numbers dividable by 10 and not by 100; 3. bottom level, precise categories, with numbers that are not dividable 10. Although in the real world implementation categories are described additionally with WordNet words vectors, DDC seems to be enough for the modeling purpose.

3.3. Definition of the experiment During the experiment each user (p

P, sizeOf(P) = N_P_) tries to find in the social network within a

given range R, the collection that provides the information on the topic t T_p_. The topic is randomly selected from the list of topics associated to collections owned by the user. The average value of the highest expertise

(R) level found within given range is computed (see Fig. 7).

procedure AverageMaximalExpertise(R) : for p' select t

(R)

P with T_p_ find c that

t = Categorisation(c) PeerDistance(p, Owner(c)) < R e = Expertize(Owner(c), c) is maximal AverageMaximalExpertise += end for end procedure Figure 7. Algorithm calculating average maximal expertise in the social semantic collaborative filtering

model in the given range We have performed four experiments. Each time the social network model consisted of N_P_ = 1000 users. Each user in our social collaborative filtering environment had only one collection associated. This simplification is correct since during the experiment we are looking only for collections with exactly the same topic as selected. So collections associated with each topic creates a subgraph that is independent of the actual number of collections owned by each user. The expertise level for each collection has been randomly selected according to power law distribution. In the first two experiments the degree of friendship connections has been randomly selected according to normal distribution (

= 25,

= 12.5). In the last two experiments the power law distribution (

1.9) has been applied. During each experiment average maximal expertise values calculated for maximal degree of separation R

=

(R) has been

[1,8].

3.4. Results of simulation Table 1 presents results of all four experiments. Table 1. Results of the experiment - average maximal expertise R

= 12,5F (Bell)

(R)

= 1,9F (Zipf)

1

0,07072 0,06427 0,01793 0,01595

2

0,69098 0,69192 0,10557 0,09042

3

0,96399 0,96183 0,33044 0,29836

4

0,96796 0,96782 0,62892 0,61653

5

0,96796 0,96782 0,82896 0,82980

6 0,96796 0,96782 0,90953 0,91751 It is interesting that even for the power law based distribution user is able to find information with almost the highest possible quality within 6 degrees of separation (see Fig. 8).

3.5. Conclusions on results of simulation

Figure 8. Average expertise level in the neighbourhood of the given size Following experiments by Kauth (Kautz, 1997) we have constructed similar social collaborative filtering model. The results revealed that each user is able to find (on average) the best quality of information provided by other users within the subgraph of social network bounded by 6 degrees of separation. These experimental results proved that the constructed social network model corresponds to the small world phenomena (Milgram, 1967). Hence, the assumptions underlying the social collaborative filtering has been fulfilled. It is possible to find an expert (with an average expertize level above 90%) within the small social network neighbourhood.

4. FOAFRealm - the Reference Implementation of Social Semantic Collaborative Filtering The FOAFRealm is a library for distributed user profile management based on the FOAF vocabulary. It enables users to control their profile information, as the information can be accessed in the open FOAF metadata. Users can sign-in automatically across the P2P network (called D-FOAF) of FOAFRealm enabled systems (Grzonkowski, 2005). FOAFRealm provides a basic implementation of the social semantic collaborative filtering concept. The knowledge (annotations and private collections) can be shared among registered users. Security constraints can be applied to each piece of information separately.

4.1. FOAFRealm Component Architecture The current implementation of FOAFRealm consists of four layers: •

The distributed communication layer providing access to highly scalable HyperCuP Lightweight Implementation of P2P infrastructure to communicate and share the information with other FOAFRealm implementations..



FOAF and collaborative filtering ontology management. It wraps the actual RDF storage being used from the upper layers providing simple access to the semantic information. The Dijkstra algorithm for calculating distance and friendship quantisation is implemented in that layer.



Implementation of the org.apache.catalina.{Realm,Valve} interfaces to easily plug-in the FOAFRealm in to Tomcat-based web applications. It provides authentication features including autologin based on Cookies.



A set of Java classes, Tagfiles and JSP files plus list of guidelines that can be used while developing user interface in own web applications.

4.2. Distributed FOAFRealm (D-FOAF) D-FOAF, Distributed FOAFRealm, utilises HyperCuP P2P infrastructure to connect and exchange the information between FOAFRealm instances. There are four major features supported by the D-FOAF to provide the distributed social semantic collaborative filtering implementation: •

User authentication (see 4.2.1.)



User profile management (see 4.2.2.)



Computing distance and quantisation level between users (see 4.2.3.).



Security of distributed computing (see 4.2.4.)

4.2.1. User authentication in D-FOAF

FOAFRealm provides the one time registration feature. It means that user needs to register into one service only, and log into different one. Both services have to be use FOAFRealm components joined in the D-FOAF network. When user wants to log into a service, application tries to find his/her profile locally (see Fig. 9). If such a local authentication fails, there is a need to authenticate user on other server.

Figure

9.

User

authentication algorithm in D-FOAF If the user logs into the system for the first time a broadcast message is sent to other services in the DFOAF network to find user's registration server and each service attempts to authenticate the user with the local database. HyperCuP infrastructure provides efficient and scalable broadcast algorithm that smooths the process of the distributed authentication. One of the services that successfully authenticates the user sends back the message with the its URI as the registration service. This information is later stored in the local database of the service that initiated the distributed authentication process as RDF triple (see Fig. 10).

.

Figure 10. Information on user's registration service cached in the local database 4.2.2. Distributed User Profile Management

D-FOAF allows users to have their profile information distributed across various servers. D-FOAF provides mechanisms to manage and read the distributed user profile. Ability to gather all distributed information in one place is very important for social semantic collaborative filtering. It makes possible browsing through all the users bookmarks and friends. HyperCuP infrastructure has been utilized to provide efficient broadcast algorithm for D-FOAF. On user demand D-FOAF sends the specific query to P2P network and receives data which is then concatenated into the final user profile snapshot. 4.2.3. Distance and Quantisation Level Computing in D-FOAF

Distance and quantisation level computing is probably the most important part of D-FOAF from SSCF perspective. It is executed each time user performs keywords search with query expansion or wants to browse through the network of bookmarks managed by him/her and his/her friends resources. Computing measures over a distributed RDF is probably one of the most complex algorithms in the DFOAF. System has to cope with a variety of problems. he distance between two people in FOAF graph has to be efficiently computed. And the problem gets less trivial when the FOAF graph is distributed among many services consisting D-FOAF network. The distances computation is performed in three steps (see Fig. 11): Step 1 •

The single instance of FOAFRealm implements the Dijkstra (Dijkstra, 1959) to compute distance (or quantisation level, see Fig. 4(b)) between the users. In the first step of the distributed computing we need to find the answer at the local FOAF database. If the boundaries like maximal distance or minimal quantisation level (see Fig. 4(b)) are given and the local information conforms to them the algorithm can terminate with success, otherwise it continues to the next step.

Step 2 •

In the second step the distance algorithm is performed on each node of D-FOAF P2P network independently. If the boundaries are given and any of the services can provide positive answer than the result is sent back to the service initiating the process and algorithm terminates.

Step 3 •

Although we might assume that close friends we be known within one community handled by a single FOAFRealm instance, it is also very possible that two people we need to compute distance between, are connected through some other people with their profiles on other FOAFRealm instances. In this case system has to gather all information required to compute distance in to one place - FOAFRealm instance which invoked the query. The complete information about the first user's profile is retrieved. Next all triples describing direct friends of this person are gathered with HyperCuP broadcast. Local server builds temporary database and performs standard local computation (step 1) together with retrieving missing foaf:knows profile information on demand.

Caching •

Since step 3 might generate huge RDF graph and expensive overload the network with broadcast messages just to compute one (or two, one for distance and one for quantisation level) path between two users a caching has been introduced to perform the third step as rarely as possible. One of the solution which we presented is caching of the information. The goal is to remember the result of the complex distance (knows level) computing. Of course remembering all the

information gathered from other servers would provide lot of redundancy and could cause data inconsistency. That is why we have decided to keep only the paths between the users for further use in the step 1 or 2. The path consists of sequence of foaf:knows triples and each registration server is notified on the fact that information on one of its users is cached. Next time this user will alter his/her foaf:knows information at any of D-FOAF nodes the registration server is required to sign the changed information. Hence it can notify all the FOAFRealm nodes caching information on that user that the foaf:knows information has altered. If the change affected cached paths, the path is invalidated starting from triple with the user that has changed his/her information.

Figure 11. Distance computing in D-FOAF

4.2.4. Security Issues in Distributed FOAFRealm The credibility of the social semantic collaborative filtering depends on several aspects. The secure SSCF depends on access control lists for sharing the information across the social network. So the underlying user profile management system like D-FOAF has to ensure the consistence and the security of the social network information. In order to protect the foaf:knows list the standard FOAF metadata has been extended with DSA Digital Signature Algorithm. As a result the FOAF ontology has been enriched with three properties: the signature on the foaf:knows list.

user's public key. user's private key. The signature has to be computed each time the foaf:knows information is changed at one of the FOAFRealm nodes. The registration server is responsible for generating the signature out of the foaf:knows triples list since the private key cannot be revealed outside the registration server. Each time D-FOAF performs operation that requires foaf:knows information it checks the integrity against the signature attached to each list of foaf:knows triples originating from one of FOAFRealm instances. The public key is kept at the registration server and provided on demand. Since the foaf:knows chains (see 4.2.3.) cached at the FOAFRealm nodes are violate and would require a number of additional signature computations they are kept in the RDF storage other then the main FOAFRealm graph.

4.3. FOAFRealm Use Case Studies The library has been successfully deployed as a user management system in JeromeDL - e-Library with Semantics}. It is used to handle private bookshelves of readers, and provides additional semantical annotations to the resources. The concept of extrapolated user profile has been adapted in the semantically enhanced search engine in JeromeDL. So that even new users to the system can benefit from the full-fledged semantic search process. The FOAFRealm system has also become a part of MarcOnt Initiative collaboration portal for ontologies management based on negotiations. The portal will utilise social networks based features of FOAFRealm to: isolate outside world from the ontology management community The registered users will be allowed to take part of the ontology management process when they will be defined as a friend of at least on of the community members. differentiate evaluations of ontology changes suggestions provided by different members of the community We will explore if evaluations provided by close friends of the person that posted the suggestion should be ranked lower than evaluations provided by people with higher degree of separation from the suggestion owner.

5. Related work 5.1. Collaborative filtering

Figure

12. Types

of

Collaborative Filtering . The most popular types of the collaborative filtering systems (see Fig. 12) are Active Collaborative Filtering and Passive Collaborative Filtering. The distinction between those is based on the activeness of the user that receives information. In active collaborative filtering, the user takes active part in creating a network of friendship connections. The passive collaborative filtering, very often known as Automated Collaborative Filtering features creating and maintaining the network of friends for each user. With passive collaborative filtering, the information about the user, such as: •

mailing-lists posts,



links on home pages,



citations in publications



co-authors of articles,

is utilised to create the social network from the scratch. Since the user does not actively take part in maintaining his network of friends, he/she has no direct impact on information he/she receives. Active collaborative filtering implements two models of information retrieval: user pull model where a user generates a query to the network of other users, user push model where the answers on previously stated questions or information filters, are feed to the user. One of the most popular implementations of the pull model of active collaborative filtering are fora and email based queries. The e-mail based queries can also be classified as push model implementation. Another kind of push model implementations is based on common workspaces, like wiki pages or Lotus Notes. Though by shifting from central (a search engine) to a distributed method of recommendation the problem tends to be more manageable, particular collaboration filtering implementations suffer various difficulties: •

heterophilous diffusion (exchange information across different socio-economic groups) is neglected in favour to homophilous diffusion (exchange of information within socio-economic

groups); •

security and privacy issues are weakly supported;



meaning (semantics) of shared concepts are lost;



when the network of friends is created automatically by harvesting various databases with advance algorithms:







the critical mass of registered users is required to provide satisfiable level of correlation to user's interests;



it is impossible to create a digraph of social connection from most of commonly used sources; privacy of individuals is violated;



monopolies are supported (Polat and Du, 2003) because a service provider has to gather a lot of information to become accurate (critical mass);

when the user actively uses fora or mailing-lists: •

there is no guarantee that there will be an answer to the posted question, or that the answer will be through;



there might be no expert on the specific field of discourse in the direct friendship neighbourhood of the user;

some systems requires from users to answer long questionnaires (Shardanand at al, 1995) in order to find similarities in users' interests.

To overcome presented drawbacks, we propose social collaborative filtering that covers both major types of collaborative filtering. It provides distributed catalog maintained by the users (active collaborative filtering, push model) and resources annotations features (passive collaborative filtering). Social collaborative filtering is based on social on-line communities ideas like (FOAF) metadata for RDF:

Friend-Of-A-Friend



users have control over their profile information;



the profiles (FOAF files) can be distributed: both files themselves and the content of the files can be distributed among different servers;



connections between users create a digraph of social network;



it is possible (as described later) to implement security/privacy issue on top of the network of trust (see 4).

Hybrid filtering (Basu at al, 1998), the combination of content filtering and social filtering, is used to maximise precision with a recall still above specified limit. Active collaborative filtering solutions concentrate on utilising the existing social connections provided explicitly. One of the approaches (Maltz and Ehrlich, 1995) is build on the the common practise where people tell their friends or colleagues of interesting documents. Users collect bookmarks on the interesting World Wide Web pages that they have found. (Kazunari at al, 2004) describes a social collaborative filtering system where users have direct impact on filtering process. The changes in the users interests are exploited to provide thorough relevance feedback to the system.. To format and distribute collections of bookmarks several simple system have been developed. With Simon system (Simons, 1995) users can create subject spaces which are lists of hypertext links to the WWW pages with annotations on them. Individual people can either use the bookmarks for keeping track on their own explorations or share their knowledge by sending it to the Simon server. Another system called Pointer (Maltz and Ehrlich, 1995) has been modeled after what people do informally when sharing information. The distribution of pointers can be done by: •

saving in a private database (bookmarking favorite documents);



saving in public databases;



emailing (to one user, a group of users or distribution lists);



editing predestined documents called Information Digests.

One of other possible solutions is to find a personal referral that can answer the given query. The quality and reliability of the answer depends on the distance to the referral. Several attempts have been made to estimate the average distance between people (Milgram, 1967). The network of relationships can also help in exploring the hidden web, the part of the Internet that is not indexed by search engines (Kautz at al, 1997a), as some of the information is deliberately not accessible outside the intranets (Kautz at al, 1997b)(Maltz and Ehrlich, 1995). Numerous studies have revealed that one of the most effectively used channels of dissemination of knowledge (Kautz at al, 1997b), especially in an organization, is an informal network of collaborators (Kautz at al, 1997a). In this approach, searching for the information becomes a matter of searching the social network for an expert on the topic as well as providing a chain of personal referrals from the searcher to the expert. In this article we described similarities and improvements that social collaborative filtering introduces in comparison to Pointer (Maltz and Ehrlich, 1995 and referral chaining (Kautz at al, 1997a) solutions.

5.2. On-line social communities On-line social communities are the underlying key concept of the social semantic collaborative filtering presented in this article. In the last few years this field has been widely explored by several implementations. Some of them, like Orkut on-line community portal provides forum-like channels of dissemination of knowledge, where community members can ask questions to their friends or other members of specific thematic group. In the Semantic Web field the FOAF (a vocabulary for RDF) metadata has been introduced to describe the interpersonal connections.

5.3. User Profile Management The existing implementations of user profile management lack: •

fine granularity of security constraints;



scalability;



openness and privacy - both of which play important roles in social semantic collaborative filtering.

Some implementations, like realms in J2EE, provide only simple, flat, unrelated groups. It requires from the web application developer to implement more sophisticated user profile management himself. When the security constraints are implemented in the web application logic layer they are hard to maintain. One of the features that is becoming more and more important in social P2P environment is single-signon (Pearlman at al, 2002). Each time a user uses a new web system, he/she would rather not provide all the same information about himself over and over again. And after a successfully log-in to a web application, he would like to be automatically authenticated in others. Solutions like Microsoft Passport or Sxip provide such features but have not been widely accepted yet.

6. Future Work and Conclusions The social semantic collaborative filtering presented in the article opens new possibilities of exchanging and managing knowledge. Users can share their bookmarks (collections and their content) with their friends. Everyone can organise the knowledge by gathering collections that other people are maintaining. Since collections can be linked it is possible to find more relevant information in categories provided by some distant people. Annotations are also a key part of the social semantic collaborative filtering. Together with private collections (private bookshelves) they are utilised in the semantically enhanced information retrieval in systems like digital libraries. FOAFRealm is a reference implementation of the social semantic collaborative filtering. It refers to social networks and open standards like FOAF. FOAFRealm provides support for J2EE based web applications for quick extending their features with user management and social collaborative filtering. Since the

social network is represented as a digraph, FOAFRealm utilises informations about distance between two people and the trust level, to provide the security and privacy features. Current implementation of FOAFRealm, D-FOAF, provides a distributed user profile management system and hence, the social semantic collaborative filtering across different systems. The future step, DigiMe, will deliver this features to mobile devices and will explore the ad-hoc social networks paradigm.

Acknowledgements This material is based upon works supported by the Science Foundation Ireland under Grant No. SFI/02/CE1/I131. Authors thank all members of the JeromeDL and the FOAFRealm working groups for fruitful discussions on this document.

References •

Barabasi, A.L. (2002) Linked: The new science of Networks. Cambridge Perseus Press



Basu, C., Hirsh, H., Cohen, W.W. (1998) Recommendation as Classification: Using Social and Content-Based Information in Recommendation. In: AAAI/IAAI. pp. 714-720 http://www2.cs.cmu.edu/~wcohen/postscript/aaai-98-collab.ps



Breese, J.S., Heckerman, D., Kadie, C. (1995) Empirical Analysis of Predictive Algorithms for Collaborative Filtering. pp. 43-52 ftp://ftp.research.microsoft.com/pub/tr/tr-98-12.pdf



Brin, S., Page, L. (1998) The anatomy of a large-scale hypertextual Web search engine. Computer Networks and ISDN Systems 30, pp. 107-117 http://kulturinformatik.unilueneburg.de/veranst/zeitpfeil/material_suchmaschinen/anatomy.pdf



Dewey, M. (2004) A Classification and Subject Index for Cataloguing and Arranging the Books and Pamphlets of a Library - Dewey Decimal Classification. guternberg.net, http://www.gutenberg.org/files/12513/12513-h/12513-h.htm



Dijkstra E. W. (1959) A note on two problems in connexion with graphs. In: Numerische Mathematik. 1, pp. 269-271



Fellbaum, C. (1998) WordNet An Electronic Lexical Database, Cambridge, Mass : MIT Press



Fukui, H.O. (2003) SocialPathFinder: Computer Supported Exploration of Social Networks on WWW. http://www-yano.is.tokushima-u.ac.jp/ogata/icce99/fukui.pdf



Goldberg, D., Nichols, D., Oki, B.M., Terry, D. (1992) Using collaborative filtering to weave an information tapestry. Commun. ACM 35, pp. 61-70 http://portal.acm.org/citation.cfm?doid=138859.138867



Grzonkowski, S., Gzella, A., Krawczyk, H., Kruk, S.R., Moyano, F.J.M.R., Wroniecki, T. (2005) D-FOAF - Security Aspects in Distributed User Management System.(In: TEHOSS?2005)



Hoadley, C., Pea, R. (2002) Finding the ties that bind: Tools in support of a knowledgebuilding community http://www.tophe.net/papers/Hoadley-Pea-2001.pdf



Kautz, H.A., Selman, B., Shah, M.A. (1997) The Hidden Web. AI Magazine 18 27-36 http://users.cs.cf.ac.uk/O.F.Rana/distsys/papers/aimag.pdf



Kautz, H., Selman, B., Shah, M. (1997) Referral Web: Combining Social Networks and Collaborative Filtering. Communications of the ACM 40, pp. 63-65 http://www.cs.washington.edu/homes/kautz/papers/refwebCACM.ps



Kazunari Sugiyama and Kenji Hatano and Masatoshi Yoshikawa: Adaptive web search based on user profile constructed without any effort from users. In WWW '04: Proceedings of the 13th international conference on World Wide Web, pp. 675-684, http://www.www2004.org/proceedings/docs/1p675.pdf



Kruk, S.R., Decker, S., Zieborak, L. (2005) JeromeDL - Reconnecting Digital Libraries and the Semantic Web. In: DEXA?2005. http://www.marcont.org/marcont/pdf/www2005_jeromedl.pdf



Lotka, A.J. (1926) The Frequency Distribution of Scientific Productivity. Journal of the Washington Academy of Sciences 16, pp. 317-323



Maltz, D., Ehrlich, K. (1995) Pointing the way: active collaborative filtering. In: Proceedings of the Conference on Computer-Human Interaction. 202-209 http://www.ischool.utexas.edu/~i385q/readings/Maltz_Ehrlich-1995-Pointing.pdf



Milgram, S. (1967) The small world problem. Psychology Today 67



Mitnick K., Simon W. L. (2002) The Art of Deception.



http://www.pnas.org/cgi/content/full/99/suppl_1/2566 Newman, M., Watts, D., Strogatz, S. (2002) Random graph models of social networks. In: Proc. Natl. Acad. Sci., to appear. http://www.pnas.org/cgi/content/full/99/suppl_1/2566



Pearlman L., Welch V., Foster I., Kesselman C., Tuecke S. (2002) A Community Authorization Service for Group Collaboration, Globus Project http://arxiv.org/ftp/cs/papers/0306/0306053.pdf.



Polat, H. and Du, W. (2003) Privacy-Preserving Collaborative Filtering using Randomized Perturbation Techniques. In The Third IEEE International Conference on Data Mining (ICDM'03), Melbourne, FL, November 2003. http://www.cis.syr.edu/~wedu/Research/paper/icdm2003.pdf



Resnick, P., Iacovou, N., Suchak, M., Bergstorm, P., Riedl, J. (1994) GroupLens: An Open Architecture for Collaborative Filtering of Netnews. In: Proceedings of ACM 1994 Conference on Computer Supported Cooperative Work, Chapel Hill, North Carolina, ACM, pp. 175-186 http://user.it.uu.se/~sverdlik/MovieLens.pdf



Simons, J. (1995) Using a Semantic User Model to Filter the "World Wide Web" Proactively. pp. 455-456 http://www.cs.uni-sb.de/UM97/ps/SimonsJ.ps



Shardanand, U., Maes, P. (1995) Social Information Filtering: Algorithms for Automating "Word of Mouth". In: Proceedings of ACM CHI?95 Conference on Human Factors in Computing Systems. Volume 1, pp. 210-217 http://jolomo.net/ringo/chi-95-paper.ps



Sugiyama, K., Hatano, K., Yoshikawa, M. (2004) Adaptive web search based on user profile constructed without any effort from users. In: WWW ?04: Proceedings of the 13th international conference on World Wide Web, New York, NY, USA, ACM Press, pp. 675-684 http://wwwconf.ecs.soton.ac.uk/archive/00000586/01/p675-sugiyama.pdf



Trevor, J., Hilbert, D.M., Billsus, D., Vaughan, J., Tran, Q.T. (2002) Contextual Contact Retrieval http://www.cc.gatech.edu/fce/ecl/publications/fxpal-iui04-contextualContactRetrieval.pdf

Author details Sebastian Ryszard Kruk PhD Researcher in Digital Enterprise Research Institue, Galway, Ireland Dr. Stefan Decker Excecutive Director of Digita Enterprise Research Institue, Galway, Ireland Adam Gzella MSc Student in Gdansk University of Technology, Poland Slawomir Grzonkowski MSc Student in Gdansk University of Technology, Poland