Ontology-based Information Extraction
Document typeMaster thesis
Rights accessOpen Access
Since the creation of the World Wide Web (referred as WWW), presented by Tim Berners-Lee in 1989, its structure and architecture have been in constant growth and development. Nowadays the Web is involved in what we know as the Social Web or Web 2.0, where the user role changes and he becomes as consumer as producer of information, so, all users are able to add and modify their contents. This fact has as a result an exponential growth of the available contents. Although this increase of information could seem a very interesting feature, the lack of structure brought some problems: it complicates its accessing, and it cannot be interpreted semantically by IT applications(Fensel, Bussler et al. 2002), both manually and in an automatic way. So, in order to solve these inconveniences, the Semantic Web (Berners-Lee, Hendler et al. 2001) is proposed as a new global initiative. The Semantic Web is an evolving extension of the World Wide Web in which the semantics of information and services on the Web is defined, making it possible for the Web to understand and satisfy the requests of people and machines to use its content. One of the basic pillars of the Semantic Web concept is the idea of having explicit semantic information on the Web pages that can be used by intelligent agents in order to solve complex problems of Information Retrieval and Question Answering. In consequence, the final objective of the Semantic Web is to be able to semantically analyze and catalog the Web contents. This requires a set of structures to model the knowledge, and a linkage between the knowledge and contents. In this manner the Semantic Web relies on two basic components, ontologies and semantic annotations. Ontologies are formal, explicit specifications of a shared conceptualization. This means that ontologies are useful to model knowledge in a formal abstract way which can be read by computers. With ontologies it is possible to represent concepts, relations among concepts and even constraints on their use. Annotations are a linkage between the knowledge and contents. On one hand, knowledge is represented by means of ontologies. On the other hand, contents are pieces of raw text that need a meaning and which are linked with ontological concepts. Due to the interest in automated analysis of all this information, in recent years, there has been a growing interest in the research community in developing data mining techniques, such as knowledge-based data mining and classification algorithms(Batet, Valls et al. 2010), which are able to exploit this kind of information. These methods rely on predefined knowledge (such as ontologies(Guarino 1998)) to semantically interpret textual data and extract more accurate conclusions from their analyses. They are typically applied over structured textual attributes which correspond to features of the analysed entities. In these cases, attribute labels (i.e., words or noun phrases) are interpreted by mapping them to concepts and analysing the background knowledge structure to which these concepts belong. However, these methods are rarely able to deal with raw text, from which relevant features should be extracted and matched to ontological entities before the data analysis. In this manner, textual documents (which represent most of available Web resources) describing a particular entity (e.g. questionnaires, Wikipedia entries, etc.) are difficult to process in order to extract relevant features which could be exploited in order to apply semantically focused data mining algorithms(Hotho, Maedche et al. 2002). The main problem of Semantic Web is the fact that it is supposed that all Web contents are semantically annotated, and nowadays this is not true yet. As a result of those limitations, Semantic-based information extraction appears. It relies on ontologies in order to interpret the textual content of a resource regardless of its format. Even though there have been many conceptual approximations in the field of Semantic Web in which it is assumed that resources have been semantically annotated, in the short-term future it cannot be expected the availability of a massive amount of annotated Web resources. So, in order to take profit from the Web resources which are currently available, the extraction of features from plain text, as it is proposed in this work, goes through the syntactic analysis of its content and its association with the concepts modelled in one or more input ontologies. To sum up, Semantic Web has brought about a growing interest in the research community in developing semantic data mining techniques. These techniques are able to exploit efficiently the semantic information but they depend on a structured input. Unfortunately, at the moment, most of available Web resources are in raw text. For all these reasons it is important to have mechanisms able to take profit of raw texts. This work aims to ease the application of semantically-grounded data-mining algorithms on textual data and semi-structured resources.
Tesina realitzada en col.laboració amb Universitat Rovira i Virgili