Ontology-based Information Extraction
Visualitza/Obre
Estadístiques de LA Referencia / Recolecta
Inclou dades d'ús des de 2022
Cita com:
hdl:2099.1/12662
Tipus de documentProjecte Final de Màster Oficial
Data2011-06-23
Condicions d'accésAccés obert
Llevat que s'hi indiqui el contrari, els
continguts d'aquesta obra estan subjectes a la llicència de Creative Commons
:
Reconeixement-NoComercial-SenseObraDerivada 3.0 Espanya
Abstract
Since the creation of the World Wide Web (referred as WWW), presented by
Tim Berners-Lee in 1989, its structure and architecture have been in constant growth
and development. Nowadays the Web is involved in what we know as the Social
Web or Web 2.0, where the user role changes and he becomes as consumer as
producer of information, so, all users are able to add and modify their contents. This
fact has as a result an exponential growth of the available contents. Although this
increase of information could seem a very interesting feature, the lack of structure
brought some problems: it complicates its accessing, and it cannot be interpreted
semantically by IT applications(Fensel, Bussler et al. 2002), both manually and in an
automatic way. So, in order to solve these inconveniences, the Semantic Web
(Berners-Lee, Hendler et al. 2001) is proposed as a new global initiative.
The Semantic Web is an evolving extension of the World Wide Web in which
the semantics of information and services on the Web is defined, making it possible
for the Web to understand and satisfy the requests of people and machines to use its
content. One of the basic pillars of the Semantic Web concept is the idea of having
explicit semantic information on the Web pages that can be used by intelligent
agents in order to solve complex problems of Information Retrieval and Question
Answering. In consequence, the final objective of the Semantic Web is to be able to
semantically analyze and catalog the Web contents. This requires a set of structures
to model the knowledge, and a linkage between the knowledge and contents. In this
manner the Semantic Web relies on two basic components, ontologies and semantic
annotations.
Ontologies are formal, explicit specifications of a shared conceptualization. This
means that ontologies are useful to model knowledge in a formal abstract way which
can be read by computers. With ontologies it is possible to represent concepts,
relations among concepts and even constraints on their use.
Annotations are a linkage between the knowledge and contents. On one hand,
knowledge is represented by means of ontologies. On the other hand, contents are
pieces of raw text that need a meaning and which are linked with ontological
concepts. Due to the interest in automated analysis of all this information, in recent years,
there has been a growing interest in the research community in developing data mining techniques, such as knowledge-based data mining and classification
algorithms(Batet, Valls et al. 2010), which are able to exploit this kind of
information. These methods rely on predefined knowledge (such as
ontologies(Guarino 1998)) to semantically interpret textual data and extract more
accurate conclusions from their analyses. They are typically applied over structured
textual attributes which correspond to features of the analysed entities. In these
cases, attribute labels (i.e., words or noun phrases) are interpreted by mapping them
to concepts and analysing the background knowledge structure to which these
concepts belong. However, these methods are rarely able to deal with raw text, from
which relevant features should be extracted and matched to ontological entities
before the data analysis. In this manner, textual documents (which represent most of
available Web resources) describing a particular entity (e.g. questionnaires,
Wikipedia entries, etc.) are difficult to process in order to extract relevant features
which could be exploited in order to apply semantically focused data mining
algorithms(Hotho, Maedche et al. 2002).
The main problem of Semantic Web is the fact that it is supposed that all Web
contents are semantically annotated, and nowadays this is not true yet. As a result of
those limitations, Semantic-based information extraction appears. It relies on
ontologies in order to interpret the textual content of a resource regardless of its
format. Even though there have been many conceptual approximations in the field of
Semantic Web in which it is assumed that resources have been semantically
annotated, in the short-term future it cannot be expected the availability of a massive
amount of annotated Web resources. So, in order to take profit from the Web
resources which are currently available, the extraction of features from plain text, as
it is proposed in this work, goes through the syntactic analysis of its content and its
association with the concepts modelled in one or more input ontologies.
To sum up, Semantic Web has brought about a growing interest in the research
community in developing semantic data mining techniques. These techniques are
able to exploit efficiently the semantic information but they depend on a structured
input. Unfortunately, at the moment, most of available Web resources are in raw
text. For all these reasons it is important to have mechanisms able to take profit of
raw texts.
This work aims to ease the application of semantically-grounded data-mining
algorithms on textual data and semi-structured resources.
Descripció
Tesina realitzada en col.laboració amb Universitat Rovira i Virgili
TitulacióMÀSTER UNIVERSITARI EN INTEL·LIGÈNCIA ARTIFICIAL (Pla 2009)
Col·leccions
Fitxers | Descripció | Mida | Format | Visualitza |
---|---|---|---|---|
CarlosVicient.pdf | 5,130Mb | Visualitza/Obre |