Semantic Wrappers - Semantic Scholar

4 downloads 0 Views 609KB Size Report
In other words, if F is the set of all the well-formed formula that are possible to be obtained from the ...... [28] J. Kennon and A. Johnson. Sizing the Internet. ... [32] N. Kushmerick, Daniel S. Weld, and Robert B. Doorenbos. Wrapper induction.
Semantic Wrappers Using Web Agents to Extract Knowledge from the Web J.L. Arjona, R. Corchuelo and M. Toro Escuela T´ecnica Superior de Ingenier´ıa Inform´ atica de la Universidad de Sevilla Departamento de Lenguajes y Sistemas Inform´ aticos Avda. de la Reina Mercedes, s/n, Sevilla (SPAIN) {arjona}@lsi.us.es

Abstract. The Web is presented as the biggest existing information repository. The extraction of information from that enormous repository has attracted the interest of many researches, who have made algorithms (wrappers) able to extract structured syntactic information in an automatic way. The results that we present in this paper are novel because we extend the present possibilities of wrappers so that they can extract knowledge from the Web. This task is carried out by means of semantic translators; which use induction in order to automatically adding meaning to the information extracted by the wrapper. Using our proposal, that we have named semantic wrappers, helps getting semantic interoperability between agents and let us define new mechanisms based on semantics checking to verify the validity of the extracted information.

Keywords: Web agents, knowledge extraction, wrappers, and ontologies.

1

Introduction

In recent years, the web has consolidated as one of the most important knowledge repositories. Furthermore, the technology has evolved to a point where sophisticated new generation web agents proliferate. They enable efficient, precise, and comprehensive retrieval and extraction of information from the vast web information repository. They can also circumvent some problems related to slow Internet access, and free up prohibitively expensive surf time by operating in the background. It is thus not surprising that researches are so interested in agents [45]. A major challenge for them has become sifting through an unwieldy amount of data to extract meaningful information. This process is difficult because of the following reasons: first, the information on the web is mostly available in human-readable forms that lack formalised semantics that would help agents use it [4]; second, the information sources are likely to change their structure, which usually has an impact on their presentation but not on their semantics [5, 34, 44].

There is a new proposal called semantic web that aims at solving these problems. [4] define it “an extension to the current web in which information is given well–defined meaning, better enabling computers and people to work in cooperation.” It shall simplify and improve the accuracy of current information extraction techniques tremendously [7, 30, 40, 39, 43]. Nevertheless, this extension requires a great deal of effort to annotate current web pages with semantics, which suggests that it is not likely to be adopted in the immediate future [14]. Several authors have worked on techniques for extracting information from today’s non-semantic web, and inductive wrappers are the most popular ones [7, 30, 40, 39, 33, 35]. They are components that use automated learning techniques to extract information from similar pages automatically; furthermore, they deal with changes, so that extraction process is not invalidate if the layout of a web page changes. Although induction wrappers are suited to extract information from the web, they do not associate semantics with the data extracted, this being their major drawback. In this article, we are going to present a new solution in order to extracting semantically-meaningful information from the today’s non-semantic web. It is novel in that it associates semantics with the information extracted, which improves agent interoperability; it can also deal with changes to the structure of a web page, which improves adaptability; furthermore, it achieves a complete separation between the data extraction procedure and the logic or base functionality an agent encapsulates. We also report on several experiments that show that our framework is effective enough, and enables the design and implementation of clean, reusable, understandable agents with a clear separation of concerns. The way to reach this goal is extending actual inductive wrappers to extract knowledge from the web; we have called semantic wrappers to these new wrappers able to deal with web knowledge. Semantic wrappers give semantic to the information extracted by current wrappers using semantic-translators that are developed by means of inductive techniques. The results of applying a specific semantic-translator is a piece of information annotated with ontologies. Extracting information with meaning let us define new possibilities at the time of verifying the good functioning of wrappers. Under the name semantic verification, we present, briefly, a new mechanism that carries out semantic tests to the extracted information. It validates semantic relations between the concepts that gives meaning to the information, and without the slightest doubt, it complement the current syntactic checking. The rest of the paper is organized as follows: Next section presents our motivation and glances at other proposals; Section 3 presents a model for current wrappers; Section 4 extend the model to address the problem of knowledge extraction; Section 5 gives the reader an insight into our framework; finally, Section 6 summarizes our main conclusions and future research directions.

Fig. 1. A web page that shows information about golfers scores.

2

Motivation and Related Work

The incredible successfulness of the Internet world has paved the way for technologies whose goal is to enhance the way humans and computers interact on the web. Unfortunately, the information a human user can easy interpret is usually difficult to be extracted and interpreted by a web agent. This is the reason why such enhancements are usually viewed as problems from an agent programmer’s point of view. Figure 1 shows two views of a web page picked from scores.golfweb.com. If we were interested in extracting the information automatically, the following issues would arise immediately: – The implied meaning of the terms that appear in this page can be easily interpreted by humans, but there is not a reference to the ontology that describes them precisely, which complicates communication and interoperability amongst agents [4, 3]. – The layout and the aspect of a web page may change unexpectedly. For instance, it is usual that web sites incorporate Christmas banners in December, which does not change the meaning of the information they provide, but may invalidate unexpectedly the automatic extraction methods used so far [5, 19, 34, 44]. – The access to the page that contains the information in which we are interested may involve navigating through a series of intermediate pages, e.g.,

login or index pages. Furthermore, this information may spread over a set of pages. – We cannot use known natural language processing techniques because the information in a web page is not usually in sentential form [12]. Furthermore, it is not easy to delimit the scope of a piece of data because of the HTML tags used to specify how to render it, which implies these techniques are not appropriate in general [24]. In view of these issues, several researchers began working on proposals whose goal is to achieve a clear separation between presentation concerns and data. XML [21] is one of the most popular languages to represent structured data, but, unfortunately, it lacks a standardised way to link them with an abstract, formal description of their semantics. There are many proposals that aim at solving this problem, and they usually rely on annotating web pages with instances of ontologies that are written in languages such as DAML+OIL [8, 27], SHOE [36] or RDF-Schema [6]. Most authors agree in that a web in which pages are annotated with semantics would be desirable because this would help web agents extract information, understand their contents, and would enhance semantic interoperability [3]. Unfortunately, there are very little annotated pages if we compare them with the total number of pages. As of the time of writing this article, the DAML crawler1 reports 18,288 annotated web pages, which is a negligible figure if we compare it with 2.1 billion, which is the estimated number of pages as of July 2000 [28]. Furthermore, about 7 million web pages are created everyday, which suggest that the semantic web is not likely to be adopted in the near future [26]. This argues for an automatic solution to extract information in the meanwhile. Several authors have worked on techniques for extracting information from today’s non-annotated web, and inductive wrappers are amongst the most popular ones [7, 30, 43, 40, 39, 33]. They are components that use a number of extraction rules generated by means of automated learning techniques such as inductive logic programming, statistical methods, and inductive grammars. These techniques use a number of web pages as samples that feed an algorithm that uses induction to generalise a set of rules that allow to extract information from similar pages automatically. Recently, researchers have put a great deal of effort to deal with changes, so that extraction rules can be regenerated on the fly if the layout of a web page changes [33]. Although induction wrappers are suited to extract information from the web, they do not associate semantics with the extracted data, this being their major drawback. The information thus extracted is a piece of text, and it does not allow for semantic interoperability amongst agents. For instance, when a wrapper extracts information about a golfer from a web page, it is a piece of text that can be later converted into a record, but it has no relation to an ontology describing golfers properly. Therefore, confusion may exists if the agent that uses such a wrapper passes this information on to another agent. There are also some related proposals in the field of databases, 1

www.daml.org/crawler

e.g., TSIMMIS [20] and ARANEUS [37]. Their goal is to integrate heterogeneous information sources such as traditional databases and web pages so that the user can work on them as if they were a homogeneous information source. However, these proposals lack a systematic way to extract information from the web because extraction rules need to be implemented manually, which makes them not scalable and unable to recover from unexpected changes on the web. Summing up, there are a number of proposals based on induction wrappers that allow to extract information from the web, but do not attempt to associate semantics with them; other proposals rely on user–defined extraction rules, which do not allow them to adapt to the ever-changing web repository; Thus, a solution to the problem of extracting semantically-meaningful information in which these problems can be clearly solved would be desirable.

3

Background

In this section we are going to introduce a model2 to formalize current wrappers. The main aim of this basic model is that semantic wrappers must be seen as an extension of current wrappers. This fact let us take advantage of all the work made by researchers in the wrappers arena. The first three definitions have as objective to establish the scope and reach of our proposal. They are the web page, the web and query concepts. The web page concept defines the input to our model, whereas the concept of query defines how we can arrive at that page through the web. Definition 1. From a basic point of view, a Web page is a sequence of symbols of an alphabet3 Σ. We denote with P the set of all the existing web pages (P ⊆ Σ ? ). Working with a higher abstraction level makes possible to deal with all the possible formats that we can find using the protocol Http, and this is an important feature for future model extensions. Figure 1 shows a sample of web page from a site that offers information about scores in a PGA golf championship. Definition 2. The Web can be seen as a graph G = (N, E), where each node n ∈ N is labelled by an Url and makes reference to a web page p ∈ P, and each arc e ∈ E is a link labelled with certain information, and makes reference to a link between two web pages p1 and p2 in P. The arcs are labelled with the information needed to pass from a web page to another using the HTTP protocol method POST and GET. In the simplest case, a direct link between two pages web, would not be necessary any information; if we arrive at web page after filling up a web form, the information would be a set of pairs ‘‘attribute = value’’; and if we have arrive after invoking a 2

3

In the nomenclature used to show the model, we use capital to represent sets and small letters for the elements of a set. Usually, this alphabet will be the ASCII one.

web service SOAP4 , the information would be a Xml file. The vision of the web as a directed labelled graph is adopted by many researches [29, 31], and it provides the necessary mechanisms to model some problems related to web, as web querying. Definition 3. A Query on the Web is a finite set of paths (sequences of arcs), that leave from a same origin node. The result of executing a Query, is the set of nodes obtained if we apply each one of the routes specified in the Query. Example 1. According to previous definitions, the following query can be defined: c0 = {h(n0 , n1 , e1 ), (n1 , n2 , e2 )i, h(n0 , n3 , e3 ), (n3 , n4 , e4 )i} This query is composed of two paths, first one, allows us to arrive from node n0 to node n2 , and the second one goes from node n0 to node n4 . The first one sequence indicates that from a node n0 (web page identified by a Url) we can arrive to node n1 using the arc e1 , and from n1 to n2 by means of e2 . The result of executing the query c0 is E(c0 ) = {n2 , n4 }. Although previous definition allows us to solve the problem of navigating through the web and model the navigation as a states machine [2]. Since the problem that we raised here is the knowledge extraction, the input to our system will is a web page from the result of executing a query and we will not go in details about how the query are modelled and executed. Next, we specify the manner of represent the information from a web page that we are interested in. The notation that we use in next definition is adopted from the data modelling RDF defined by W3C. In this model Resources are used to describe anything, web pages, people, flowers, etc and Properties express aspects, characteristics, attributes, or relations used to describe a resource. Definition 4. An Information Tuple (T ∈ Mn×m ) is a matrix that represents the information of interesting in a structured way [18]. A row contains information about several related resources and each column represents a property from some of the resources. The matrix elements belong to the Kleene closure of the alphabet in which the web page has been codified. Example 2. If we analyze the information provided in the web page at Figure 1, about the results of golfers, we identify different players, and the information about the score and position of each one. A tuple that represent the player’s name, position and total number of points, is the following one:   “Rich Beem” 278 1  “Tiger Woods” 279 2    283 3  t=  “Chris Riley”   “Justin Leonard” 284 4  ... ... ... 4

http://www.w3.org/TR/SOAP/

In this tuple (t ∈ T ), we have information about two resources. First of them is the golfer, and the second one is the score of this golfer. The golfer has a property name that identifies him; and the score has two properties: the number of points obtained and position in ranking. Representing the information by means of a matrix allows us to have a structured vision of data extracted and to establish a location for each one of the properties. For instance, we can identify the golfer ‘‘Tiger Woods’’ using the expression t[2][1]. Next, we define the concept of wrapper that was presented in an intuitive way in the introduction section. We model wrappers as a black box specialized in extract structured information from certain web site. This allow us to extend a greater number of current wrappers in order to deal with knowledge. Definition 5. A Syntactic wrapper is a function W : P → T . Given a web page, it gives back a tuple with the information of interest. We qualify the wrappers existing at present time as syntactic wrappers, since they extract syntactic information, devoid of semantic formalization that expresses the meaning of the extracted information. And, although the implied meaning of some of the properties extracted from the web page can be easily interpreted by humans, its understanding is impossible for software applications, which complicates communication and interoperability amongst them. Finally, another problem to consider is caused by the dynamism that web sites present; and it is that the layout and the aspect of a web page may change unexpectedly. For instance, it is usual that web sites incorporate Christmas banners in December, which does not change the meaning of the information they provide, but may invalidate unexpectedly the automatic extraction methods used so far. In this sense, we need mechanisms that help us to decide if the wrapper works correctly, or on the contrary, it is invalid because of changes in the structure of the web page. Definition 6. A syntactic wrapper is valid (syntactic verification) if the information extracted and structured in a tuple by the wrapper is the information we are interested in. The decision about the wrapper validity is only based on syntactic properties of the extracted information. In order to verify the validity of a wrapper, we define a predicate SyntVerification(T, Ω). This predicate is satisfied if the tuple extracted by the wrapper is correct in the context of Ω. Where Ω makes reference to some parameters and if we give a concrete definition for Ω, we would have a syntactic checker of wrappers. Example 3. The algorithm of syntactic verification RAPTURE [33] defined by Kushmerick can be seen as a predicate. It takes as input the tuple of information extracted and a set of tuples (Ω) referring to extractions already validated in the past, and it is satisfied according to some statistical relations.

4

Extending the possibilities of current Wrappers

Now, we extend the presented model for syntactic wrappers to include the semantic aspects relative with the extracted information. Our goal is to develop a framework, based on the model, that helps us to address the problem of automatic knowledge extraction from the Web. We use ontologies [9] to specify the meaning of the concepts on which we are extracting information. The ontologies describe the concepts that define the semantics associated with the information we extract from a high-level, abstract point of view. Ontology allows to define a common vocabulary by means of which different applications may interoperate semantically. Some authors [23, 25] have specified a formal model for ontologies; our vision of the formalization of the term ontology is in agreement with the presented by J. Heflin in its Phd dissertation. We begin with a simplified formalization of the term ontology that does not consider relations among different ontologies. It is because we do not want to complicate the study that we are presented in this paper. Definition 7. An ontology is a tuple hV, Ai defined in some first order logical language L, where V is the vocabulary of predicates symbols of L and A are axioms that make reference to a set of well–formed formula of L. If we analyze the previous definition, an ontology can be seen as a new logical language, subset of L, that have a set of predicates symbol and axioms that define the relations between these symbols. We use first–order language to describe the knowledge, because it offers us the power and flexibility needed to describe knowledge, in this sense, many of the knowledge representation languages and structures, such as semantic networks [42], frame systems [38], etc ... can be formulated in first–order logic. Example 4. The following set of axioms defines a possible ontology o that specifies the concepts in which we are interested from the web page in Figure 1: o = h {Person, Golfer, Score} , { ∀x(Golf er(x) ⇒ P erson(x)), ∀x · ∃y, z(Golf er(x) ⇒ Score(x, y, z))} i

In this ontology (o ∈ O) we have three predicate symbols Person, Golfer and Score. The first formula asserts, naturally, that a Golfer is a Person and the second one say that a Golfer have a Score, where y represents the total number of obtained points and z is the position in the championship. Attending to last definition, we are able to express the ontologies, but how can we express the instances of an ontology?, is to said, How can we say that Timer Woods is a golfer?. In order to answer this questions, it is necessary to define what we understand by information with meaning. The following definition helps us to address this subject.

Definition 8. Let Lo be an ontology, a semantic tuple is the result of properly associate the information in the tuple with the concepts defined using Lo . In other words, if F is the set of all the well-formed formula that are possible to be obtained from the logical language (ontology) Lo , a semantic tuple is a subset Ts ⊂ F, in which concepts are associated to the extracted information. Example 5. The semantic tuple for the information we are interested in, making use of the ontology o is: ts = { Golfer(“Rich Beem”), Score(“Rich Beem”, 278, 1), Golfer(“Tiger Woods”), Score(“Tiger Woods”, 279, 2), Golfer(“Chris Riley”), Score(“Chris Riley”, 283, 3), Golfer(“Justin Leonard”), Score(“Justin Leonard”, 284, 4) }

Based on the concepts defined previously, it would be interesting to have a mechanism that automatically gives meaning to the extracted information tuple obtained by means of a syntactic wrapper. The following definition goes in this sense and it establishes the way of extending syntactic wrappers to semantic wrappers. Definition 9. A semantic translator is a function ∆ : T × Ξ → Ts , that receives as input the tuple obtained using a syntactic wrapper and some information Ξ and outputs a semantic tuple. The objective of this function is to assign semantics to the structured information of a tuple. The semantic translator can be seen like a function that maps the structured information extracted by wrappering to a set of well–formed formula of some ontology language Lo . Ξ makes reference to some parameters and if we give a concrete definition for Ξ, we have a concrete semantic translator. In next section, we present a semantic translator which Ξ have a concrete meaning. Example 6. The results of applying for previous tuple the appropriate semantic translator δ ∈ ∆ would be δ(t, ξ) = ts . We obtain of this form, knowledge from structured information. Now, we are willingness to give a definition for the core element of our proposal, the Semantic Wrapper. Definition 10. A Semantic Wrapper is a function Ws : P → TS . Given a web page, it returns a semantic tuple with the information of interest.

A Semantic Wrapper can be seen as the natural extension of current syntactic wrappers with the aim of adding semantics to the extracted information, in this way: ws (p) ⇔ δ(w(p), ξ)) = ts where ws ∈ Ws , p ∈ P, δ ∈ ∆, ξ ∈ Ξ, ts ∈ Ts and w∈W The information with meaning extracted by a semantic wrapper, can be used for the inference of new knowledge. Besides it makes possible its reusability and use; in this sense we are favoring the semantic interoperability. For instance, from the semantic tuple of our example we can infer that the golfers ‘‘Rich Beem’’ and ‘‘Tiger Woods’’ are humans, according to the axiom ∀x(Golf er(x) ⇒ P erson(x)). Definition 11. A semantic wrapper is valid (semantic verification) if the information extracted with meaning in a semantic tuple is the information that we are interested in. The decision if the wrapper is valid or not is based on semantic properties of the extracted information. The semantic verification allows us to check the existing relations amongst the different concepts of which the information is extracted. In this way, we are aware of non-syntactic errors of the information provided by a web site, for example, we control that the same golf player was in more that one position at same time. In order to verify the semantic validity of a wrapper, a predicate SemVerification(Ts , Lo , R) is defined. This predicate is satisfied if the semantic tuple fulfills all the relations and properties defined by means of axioms in the ontological language Lo . Furthermore, Ts must fulfil the constraints expressed in the set of well formed formula R. Formula in R are defined using the predicate vocabulary of Lo and represents additional constrains that we require to the concepts that we are extracted information about. Example 7. In this example, we show a possible definition of R. In r ∈ R a constrain is presented, it makes reference that two different golfers can not have the same position in the ranking. This constrains can help us to avoid inconsistences as Rich Beem and Tiger Woods both were in the first position. r = {∀x, y∃z, v, w, t(Score(x, z, v) ∧ Score(y, w, t) ∧ x 6= y ⇒ v 6= t} Notice that this constraint could be modelled into the same ontology, but it is important consider that attending to Knowledge Engineering community [22] the ontologies are designed to model domain knowledge for enabling knowledge sharing and reusing in a flexible, easy way. Therefore non separating ontology from the constrains required to the information extracted could hinder the degree of reuse and complicate the ontologies maintenance task. In order to conclude, Figure 2 summarizes the different concepts presented in this model as well as the relationships between each one of them, by means of a conceptual map.

Check Validity

Sintactic check

Semantic Verification

Check Validity

Sintactic Verification

Semantic Wrapper

Returns

Extends

Returns

Sintactic Wrapper

Semantic check

Semantic Translator Semantic Tuple Use

Information Tuple

Is

Is

Ontologies

Structured Information with Meaning

Structured Information

Our Aproach

Fig. 2. A conceptual map of the model.

5

Semantic wrappers for knowledge extraction

Once we have presented the necessary background, we go into further details to show How semantic wrappers are dealt with. In this sense, the concepts presented in the model are instantiated at a lower abstraction level in order to answering the following questions: – How is built a semantic wrapper? Semantic wrapper generation. – How is used a semantic wrapper? Knowledge extraction. – How is tested the good functioning of a semantic wrapper? Semantic verification. We illustrate these ideas by means of our simple example in which we were interested in extracting information about the score of golfers in a PGA Championship. This information was given at http://www.golfweb.com. Figure 1 shows a web page from this site. Exactly, we were interested in getting the golfer name, his position in the classification and total number of points obtained.

5.1

Semantic wrapper generation

Attending to the model presented in previous section, we need to generate a syntactical inductive wrapper and a semantic translator that extends it. Current inductive wrappers, as shown previously, can be seen as functions that take as input a web page and give back the structured information of interest. It is the responsibility of semantic translator to give meaning to the information extracted. Figure 3 illustrates the process of semantic wrapper generation.

O n to l o g ies

GUI

S eman tic A n o n tatio n an d data

S eman tic T ran s l ato r G en erato r

Wrapper Generator Test wep pages and data

Parameterized S eman tic R D F G rap h

W Syntactic Wrapper

Results

Fig. 3. Semantic wrapper Generation.

Syntactic wrapper generation In Our experiment, we make use of the syntactical wrappers-HLRT defined by Kushmerick. Basically, the operation of these wrappers is based on the identification of delimiters strings, that identifies the information of interest. The algorithm to generate the rules of extraction (the wrapper generator) that define how to have access to the information in which we are interested, and the algorithm for the extraction of information using the obtained rules of extraction (syntactic wrapper) appears in [32]. To generate the extraction rules, we have to feed the syntactic wrapper generator with a set of pairs of the following form: {(p1 , t1 ), (p2 , t2 ), ..., (pk , tk )}; k ≥ 1 5. 5

The k value depends on the structure of the web page and characteristic values can be found in the description that gives the author of this kind of wrapper

Where pi denotes a web page containing sample data, and ti is the tuple data that must be extracted from this web page. With this information, we can apply the induction algorithm defined by Kushmerick to generate the set of extraction rules R1, R2, ..., Rm. This set of rules set up the syntactic wrapper. In our study case, if we input some sample data to the HLRT-wrapper generator, the following rules will be generated: – A piece of HTML source indicating the beginning of data (Head): r1 = “ T otalh/bih/f ontih/tdih/tri ”. – A piece of HTML source indicating the end of data (Tail): r2 = “ h/tablei ”. – A pair of delimiters for each property (Left and Right). For the name property of resource golfer these would be: r3 = “ ”i” (left), r4 = “h/tdi” (right). Semantic translator generator In a lower abstraction level, we represent the knowledge using some web ontological language as DAML+OIL [8, 27], SHOE [36], RDF-Schema [6] or OWL [13]. These languages provide us the possibility of defining metadata that express the meaning of information resources. Annotating information with semantics using ontological web languages is the mechanism proposed by most researches in this arena in order to establish the Semantic Web. This idea is not inconsistent with the model presented and the justification for using some of these languages and not directly work with a logical language comes given by the axiomatization. Axiomatization [15] is a mapping of a set of descriptions in any one of these ontological web languages into a logical theory expressed in first-order predicate calculus that is logically equivalent to the intended meaning of that set of descriptions. This mapping consists of a simple rule for translating statements from these languages into first-order relational sentences and a set of first-order logic axioms that restrict the allowable interpretations of the non-logical symbols. In our proposal we will make use6 of RDF-S defined by the W3C to represent the ontologies and the instances of these ontologies. The RDF data model consists of three object types: resource, property and statements. Resources and properties where presented in section 4. Statements are composed of a resource together with a property and the value of that property for that resource. RDF models consist of a bag of statements that can be represented using directed labelled graphs. Figure 4 shows the Ontology in RDF-S [46] that we have used to annotate the information from the web page of Figure 1 and the corresponding annotated information. 6

We describe our ontologies using RDF because it is an standard and an important contribution to the semantic web, it is being supported by many researchers, and the repository of available tools is quite rich. Using DAML+OIL do not cause any problems, since DAML+OIL is an extension of RDF, therefore, the mapping by means of axiomatization of RDF statements is sufficient for translating DAML+OIL.

]> RDF-S Ontology ]> instance

Fig. 4. Golfer Ontology and golfer instance.

Attending to the data model RDF, the golfer instance can be represented using the directed labelled graph of Figure 5. In order to increase its legibility, they have not specified uris where the resources and properties are defined.

[ “Tiger Woods”, 279, 2 ]

go

lf# sc

s#

typ

e

golf#Score

nt ax -n

golf#golf_00007

y f-s -rd golf#name

golf#Golfer

Tiger Woods

-rd f-s y

22

e t yp

22

nt a

s# x-n

Tuple row (set of related resources)

or e

golf#golf_00009

go

278

al tot lf#

golf#position

2

Fig. 5. A graph representing a golfer instance.

The semantic translator generator have been defined by means of an inductive algorithm7 . It takes as input example of data extracted by the syntactic wrapper (tuple), and a semantic annotation of those data. With these information, a parameterize semantic graph is generated. It is carry out looking for the information extracted in nodes, is the information is found, then we mark the node with labels that indicates the localization of the information in the tuple. This way, we can give meaning to any other tuple replacing in the semantic graph the position with the data in this position of tuple. Figure 6 shows the semantic graph for our example. The parameterize semantic graph is the information required by the semantic translator (that was specified in the model with Ξ) to give meaning to the information tuples. 5.2

Knowledge extraction. Using the semantic wrapper

Now we present a framework agent developers can use to extract information with semantics from non–annotated, changing web pages so that this concern can be clearly separated from the rest in an attempt to reduce development costs and improve maintainability. Before going into details, it is important to say that our notion of agent was drawn from [45]: “Agents have their own will (autonomy), they are able to interact with each other (social ability), they respond to stimuli (reactivity), and they take initiative (proactivity).” We refer to agents that need to interact with the web to retrieve, extract or manage information as web agents. 7

This algorithm can be found in Appendix A.

Pos 3

Pos 2

Pos 1

[ “Tiger Woods”, 279, 2 ]

e

golf#Score

#t yp

y

golf#Golfer

Pos 1

golf#name

-n s

f-s

pe

Pos 2

lf# sc

22

go

-rd f-s

yn

golf#golf_ID

-rd

#t y

ta x

22

nta

s x- n

Tuple row (set of related resources)

ore

al tot lf# go

Pos 3 golf#golf_ID

golf#position

Fig. 6. A parameterize graph.

Figure 7 sketches the architecture of our proposal. Semantic Wrappers (SW) are web agents that allow to separate the extraction of information from the logic of an agent, and they are able to react to information inquiries (reactivity) from other agents (social ability), and act in the background to maintain the extraction rules they use updated (autonomy and proactivity). In order to allow for semantic interoperability, the information they extract references a number of concepts in a given application domain that are described by means of ontologies. The SW extract the knowledge accessing to the web page that contains the information. then, it is syntactic wrapper responsibility to extract the tuple of information structured of that web page. At this moment we can use a syntactic checker as RAPTURE [33] to test the information in the tuple. The following step consist on giving semantics to the extracted information, this task is carry out by the semantic translator. It uses the parameterized semantic graph and the information tuple, and returns the information in RDF with a precise semantics defined by means of ontologies. The way to carry out this task is replacing in the graph the tuple positions labels with the values that in its appears. There is also an agent broker [16] for information extraction that acts as a trader between the agents that need knowledge from the web and the set of available semantic wrappers channels. When an agent needs some information, it contacts the broker, which redirects the request to the appropriate information channel, if possible. This way, agents need not be aware of the existence of different semantic wrappers, which can thus be adapted, created or removed from the system transparently. However, every time an semantic wrappers is created or destroyed, it must be registered or unregistered so that the broker knows it. It therefore has a catalogue with the description of every SW in the system (yellow pages). We use ACL [17] as a transport language to send messages from an agent to another. Their contents describe how an agent wants to interact with another,

Semantic Wrapper (Web Agent)

Semantic Translator Parameterized Semantic RDF Graph

W Extracted information

Syntactic Wrapper

The Web

Query

Agent Control

Broker

Knowledge

User Agent

Sem Wrapper 1 Agent society

Broker Sem Wrapper 2

The web

Sem Wrapper n

Agent platform (FIPA)

Fig. 7. Knowledge Extraction.

and it is written in RDF. Figure 8 shows the brokering protocol [16] to communicate user agents with the semantic wrappers using the AUML notation [41]. When an initiator agent sends a message with the performative proxy to the broker, it then replies with one of the following standard messages: not-understood, refuse or agree. If the broker agrees on the inquiry, it then searches for an adequate SW to serve it; if not found, it then sends a failure-no-match message to the initiator; otherwise, it tries to contact the SW and passes the inquiry on to it. If the broker succeeds in communicating with the SW, this shall later send the requested information to the initiator; otherwise, a failure-com-SW message is sent back to the initiator, which indicates that an appropriate SW exists, but cannot respond. Once we have set up a semantic wrapper, we can send messages to it in order to extract information about a given golfer by means of the broker. The content of the messages in DAML is based on an ontology that defines the communication [11]. This ontology is illustrated in Figure 9.

Initiator

Semantic Wrapper

Broker

proxy not-understood refuse agree

failure-no-match request [Broker cannot find any Sem. Wrapper for the request]

agree refuse not-understood

failure-com-SW

inform-result-SW [Broker, finds an Sem. Wrapper for the request]

Fig. 8. Broker interaction protocol in AUML.

ExtractInfo attributeRestrictions label : String comment : String about : String

Attribute

ExtractInfoResult

name : String value : String

Warning

Error

Ok

code : Integer description : String

code : Integer description : String

NumberRecords : Integer

Records

* Class

Fig. 9. Ontology for the content language.

Information requests are expressed as instances of class ExtractInfo. The reply from the abstract channel is an instance of ExtractInfoResponse, an error message (for instance, the channel is not able to have access to the web page that contains the information), a warning message (for instance, 0 records have been found) or the information requested by the agent (as instances of the ontology class that defines the channel). Figure 10 shows two sample messages. The first is a request for information about "Timer Woods" from an agent called Agent-1 to the broker agent; the second one is the reply from the SW to Agent-1.

(inform :sender BookChannel :receiver Agent-1 :content ( :content ( xmlns:extractontology="&extractontology;" ]> xmlns:golf="&golf;" xmlns:extractontology="&extractontology;" rdfs:label="Golfer"/> ) :language RDF)

Fig. 10. Example of messages.

5.3

Semantic verification

Once extracted the knowledge we can use axiomatization to translate the semantic markup obtained after the knowledge extracting process into a first order language, normally, this language is expressed using the ANSI standard KIF (Knowledge Interchange Format)8 . This knowledge can be checked if we previously feed automatic theorem provers and problem solvers with the axioms that represents the ontologies used for mark up the information and the set of constraints r defined in addition. 8

http://logic.stanford.edu/kif/dpans.html.

6

Conclusions and future work

The rapid evolution of the Internet demands an ever–increasing ability to adapt to a medium that is in continuum change and evolution. Web agents enable efficient searches on the Internet, but they face several important problems [1, 2, 10] when they need to extract information because the current web is mostly user–oriented. The semantic web shall help extract information with well–defined semantics, regardless of they way it is rendered, but it does not seem it is going to be adopted in the immediate future, which argues for another solution to the problem in the meanwhile. In this article, we have presented a new approach to knowledge extraction from web sites based on semantic wrappers. It improves on other proposals in that it associates semantics with the extracted information, and can also deal with changes because the information is extracted by means of current wrappers. Furthermore, new mechanism based on semantic verification has been defined in order to check the wrapper validity. In the future, we are going to work on an implementation of a framework in which data sources can be more heterogeneous (databases, news servers, mail servers, and so on). Extraction of knowledge from multimedia sources such as videos, images, or sound files will be also paid much attention.

Acknowledgments The work reported in this article was supported by the Spanish Interministerial Commission on Science and Technology under grants TIC2000-1106-C02-01.

References [1] J. L. Arjona, R. Corchuelo, A. Ruiz, and M. Toro. Automatic extraction of semantically-meaningful information from the web. In Adaptive Hypermedia and Adaptive Web-Based Systems, Second International Conference, AH 2002, volume 2347 of Lecture Notes in Computer Science, pages 24–35. Springer, 2002. [2] J. L. Arjona, R. Corchuelo, A. Ruiz, and M. Toro. A practical agent-based method to extract semantic information from the web. In Advanced Information Systems Engineering, 14th International Conference, CAiSE 2002, volume 2348 of Lecture Notes in Computer Science, pages 697–700. Springer, 2002. [3] T.J. Berners-Lee, R. Cailliau, and J.-F. Groff. The World-Wide Web. Computer Networks and ISDN Systems, 25(4–5):454–459, November 1992. [4] T. Berners-Lee, J. Hendler, and O. Lassila. The semantic web. Scientific American, 284(5):34–43, May 2001. [5] B.E. Brewington and G. Cybenko. Keeping up with the changing web. Computer, 33(5):52–58, May 2000. [6] D. Brickley and R.V. Guha. Resource description framework schema specification 1.0. Technical Report http://www.w3.org/TR/2000/CR-rdf-schema-20000327, W3C Consortium, March 2000.

[7] W.W. Cohen and L.S. Jensen. A structured wrapper induction system for extracting information from semi-structured documents. In Proceedings of the Workshop on Adaptive Text Extraction and Mining (IJCAI’01), 2001. [8] D. Connolly, F. van Harmelen, I. Horrocks, D.L. McGuinness, P.F. PatelSchneider, and L.A. Stein. DAML+OIL: Reference description. Technical Report http://www.daml.org, Defense Advanced Research Projects Agency, October 2000. [9] O. Corcho and A. G´ omez-P´erez. A road map on ontology specification languages. In Proceedings of the Workshop on Applications of Ontologies and Problem Solving Methods. 14th European Conference on Artificial Intelligence (ECAI’00), pages 80–96, 2000. [10] R. Corchuelo, J. S. Aguilar, and J. L. Arjona. A framework for extracting information with semantics from the web. an application to knowledge discovery for web agents. The International Journal of Computers, Systems and Signals, October 2002. [11] S. Cranefield and M. Purvis. Generating ontology-specific content languages. In Proceedings of Ontologies in Agent Systems Workshop (Agents’01), pages 29–35, 2000. [12] R. Dale, H. Moisl, and H. Somers, editors. A handbook of natural language processing: Techniques and applications for the processing of language as text. Marcel Dekker, New York, 2000. [13] M. Dean, D. Connolly, F. van Harmelen, J. Hendler, I. Horrocks, D. L. McGuinness, P. F. Patel-Schneider, and L. Andrea Stein. OWL Web Ontology Language 1.0 Reference (working draft 29 july 2002). Technical report, World Wide Web Consortium, July 2002. [14] D. Fensel, editor. Spinning the Semantic Web: Bringing the World Wide Web to Its Full Potential. The MIT Press, 2002. [15] R. Fikes and D. McGuinness. An Axiomatic Semantics for RDF, RDF-S, and DAML+OIL (march 2001). Technical report, World Wide Web Consortium, 2001. [16] T. Finin, Y. Labrou, and J. Mayfield. KQML as an agent communication language. In Jeffrey M. Bradshaw, editor, Software Agents, chapter 14, pages 291–316. AAAI Press/The MIT Press, 1997. [17] FIPA. FIPA specifications. Technical Report http://www.fipa.org/specifications, The Foundation for Intelligent Physical Agents, 2000. [18] D. Florescu, A. Y. Levy, and A. Mendelzon. “Database Techniques for the WorldWide Web: A Survery”. ACM SIGMOD Record, 27(3):59–74, September 1998. [19] L. Francisco-Revilla, F. Shipman, R. Furuta, U. Karadkar, and A. Arora. Managing change on the web. In Proceedings of the 1st ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL’01), Digital Libraries and the Web: Technology and Trust, pages 67–76, 2001. [20] H. Garc´ıa-Molina, J. Hammer, K. Ireland, Y. Papakonstantinou, J. Ullman, and J. Widom. Integrating and accessing heterogeneous information sources in TSIMMIS. In Proceedings of the AAAI Symposium on Information Gathering, pages 61–64, March 1995. [21] C.F. Goldfarb and P. Prescod. The XML Handbook. Prentice-Hall, 2nd edition, 2000. [22] T. R. Gruber. Towards Principles for the Design of Ontologies Used for Knowledge Sharing. In N. Guarino and R. Poli, editors, Formal Ontology in Conceptual Analysis and Knowledge Representation, Deventer, The Netherlands, 1993. Kluwer Academic Publishers.

[23] N. Guarino. Formal ontology and information systems, 1998. [24] F. van Harmelen and D. Fensel. Practical knowledge representation for the web. In Proceedings of the IJCAI Workshop on Intelligent Information Integration, July 1999. [25] J. Heflin. Towards the Semantic Web: Knowledge Representation in a Dynamic, Distributed Environment. PhD thesis, University of Maryland, College Park, 2001. [26] J. Hendler. Agents and the semantic web. IEEE Intelligent Systems Journal, 16(2):30–37, March/April 2001. [27] I. Horrocks, P.F. Patel-Schneider, and F. van Harmelen. Reviewing the design of DAML+OIL: An ontology language for the semantic web. Technical Report http://www.daml.org, Defense Advanced Research Projects Agency, 2002. [28] J. Kennon and A. Johnson. Sizing the Internet. Technical Report http://www.cyveillance.com/web/newsroom/releases/2000/2000-07-10.htm, Cyveillance, Inc., July 2000. [29] J. M. Kleinberg, R. Kumar, P. Raghavan, S. Rajagopalan, and A. S. Tomkins. The Web as a graph: Measurements, models, and methods. In T. Asano, H. Imai, D. T. Lee, S. Nakano, and T. Tokuyama, editors, Proc. 5th Annual Int. Conf. Computing and Combinatorics, COCOON, number 1627. Springer-Verlag, 1999. [30] C.A. Knoblock, K. Lerman, S. Minton, and I. Muslea. Accurately and reliably extracting data from the web: A machine learning approach. IEEE Data Engineering Bulletin, 23(4):33–41, 2000. [31] S. R. Kumar, P. Raghavan, S. Rajagopalan, and A. Tomkins. Extracting largescale knowledge bases from the web. In The VLDB Journal, pages 639–650, 1999. [32] N. Kushmerick, Daniel S. Weld, and Robert B. Doorenbos. Wrapper induction for information extraction. In Intl. Joint Conference on Artificial Intelligence (IJCAI), pages 729–737, 1997. [33] N. Kushmerick. Wrapper verification. World Wide Web Journal, 3(2):79–94, 2000. [34] L. Lim, M. Wang, S. Padmanabhan, J.S. Vitter, and R. Agarwal. Characterizing web document change. Lecture Notes in Computer Science, 2118:133–144, 2001. [35] Ling Liu, Calton Pu, and Wei Han. XWRAP: An XML-enabled wrapper construction system for web information sources. In ICDE, pages 611–621, 2000. [36] S. Luke, L. Spector, D. Rager, and J. Hendler. Ontology-based web agents. In W.L. Johnson and B. Hayes-Roth, editors, Proceedings of the First International Conference on Autonomous Agents (Agents’97), pages 59–68, Marina del Rey, CA, USA, 1997. ACM Press. [37] G. Mecca, P. Merialdo, and P. Atzeni. ARANEUS in the era of XML. Data Engineering Bullettin, 22(3):19–26, September 1999. [38] M. Minsky. A framework for representing knowledge. McGraw-Hill, New York, 1975. [39] I. Muslea, S. Minton, and C. Knoblock. STALKER: Learning extraction rules for semistructured, web–based information sources. In Proceedings of the AAAI-98 Workshop on AI and Information Integration, 1998. [40] I. Muslea, S. Minton, and C. Knoblock. Wrapper induction for semistructured, web–based information sources. In Proceedings of the Conference on Automated Learning and Discovery (CONALD’98), 1998. [41] J. Odell, H. Van Dyke, and B. Bauer. Extending UML for agents. In G. Wagner, Y. Lesperance, and E. Yu, editors, Proceedings of the Agent–Oriented Information Systems Workshop at the 17th National Conference on Artificial Intelligence, pages 3–17, 2000.

[42] M. R. Quillian. Word concepts: A theory and simulation of some basic semantic capabilities. Behavioral Science, 12:410–430, 1967. [43] S. Soderland. Learning information extraction rules for semi-structured and free text. Machine Learning, 34(1-3):233–272, 1999. [44] B. Starr, M.S. Ackerman, and M. Pazzani. Do-I-Care: Tell me what’s changed on the Web. In Proceedings of the AAAI Spring Symposium on Machine Learning in Information Access Technical Papers, March 1996. [45] M.J. Wooldridge and M.R. Jennings. Intelligent agents: Theory and practice. The Knowledge Engineering Review, 10(2):115–152, 1995. [46] World Wide Web Consortium. Resource Description Framework (RDF) model and syntax specification. Technical report, World Wide Web Consortium, February 1999.

A

Some algorithms

The following algorithm shows the knowledge extraction process, it takes as input a web page with the information we are interested in, and a semantic parameterize graph and returns an RDF file that represents the knowledge extracted. This algorithm pass the web page to the syntactic wrapper so that it extracts the information, then, the semantic translator assigns meaning to the extracted information. KnowledgeExtraction(WebPage p, RDFParamGraph g) : RDFFile Tuple t ← syntacticWrapper (p) return semanticTranslator (t,g)

The semanticTranslator algorithm uses the semantic parameterize graph in order to give meaning to the tuple. It visits all the tuple rows and replaces in the labelled nodes, the tuple position that indicates the label by the data item that contains the tuple in that position. For each row (instance of related resources), a semantic graph is created; All these graph are stored in a wood. The algorithm output is the result of converting the graphs to a RDF File. semanticTranslator (Tuple t, RDFParamGraph g) : RDFFile Wood w for i = 1 to numberOfRows(t) do RDFParamGraph gaux ← g for all labelled value node n of gaux do replace(label(n), t[i][label(n)]) {It replaces the position of the tuple with the information in this position} end for addGraphToWood(w,gaux) end for return transformToRDF (w)

The semanticTranslatorGenerator inductive algorithm is the responsible of creating the semantic parameterized graph used by semanticTranslator to give meaning. It takes as input an array of sample data (a tuple and RDF file specifying the meaning of the tuple information) and generalizes the way of give meaning to other tuple extracted from the same web site. This task is carried out labelling the nodes with information in the graph, with the possible positions (candidates) that they occupy in the tuple. If there is more than one candidate for a node (data in the same row of a tuple is repeated), the ambiguity is removed using more sample data.

SemanticTranslatorGenerator (array sampledata) : RDFParamGraph RDFParamGraph output for all in sampledata do Wood of RDFParamGraph w ← transformToRDFGraph(f) for i = 1 to numberOfRows(t) do g ← w[i] {It takes the graph corresponding to the i row} for j = 1 to numberOfColums(t) do labelNodes(g,findNodesGraph(g,t[i][j]),{j}) {It adds a label j to the nodes from g that contains t[i][j] data} end for end for Boolean b ← TRUE for all labelled node n T in w[1] do labels(g, node)) {It labels node n in output with the the labelNodes(output,n, g∈w

intersection of all the labels in the same node of the wood graphs} if numberLabels(n) > 1 then b ← FALSE end if end for if b = TRUE then return output {There is no ambiguity, each value node of graph output is labelled with only one tuple position} end if end for Message(”More sample data is needed”) return output {There is ambiguity, the output is erroneous}