Approximating the DTD of a set of XML documents

Abelló Gamazo, Alberto; Palol Arregui, Xavier de; Hacid, Mohand-Saïd

Visualitza/Obre

R05-7.pdf (323,4Kb)

Veure estadístiques d'ús d'UPCommons

Estadístiques de LA Referencia / Recolecta

Cita com:

Mostra el registre d'ítem complet

Abelló Gamazo, Alberto

Palol Arregui, Xavier de

Hacid, Mohand-Saïd

Tipus de documentReport de recerca

Data publicació2005-03

Condicions d'accésAccés obert

Tots els drets reservats. Aquesta obra està protegida pels drets de propietat intel·lectual i industrial corresponents. Sense perjudici de les exempcions legals existents, queda prohibida la seva reproducció, distribució, comunicació pública o transformació sense l'autorització del titular dels drets

Abstract

The WWW contains a huge amount of documents. Some of them share the subject, but are generated by different people or even organizations. To guarantee the interchange of such documents, we can use XML. This allows to share documents that do not have the same structure. However, it makes difficult to understand the core of such heterogeneous documents (in general, schema is not available). In this paper, we offer a characterization and algorithm to obtain the midpoint (in terms of a resemblance function) of a set of semi-structured, heterogeneous documents without optional elements. The trivial case of midpoint would be the common elements to all documents. Nevertheless, in cases with several heterogeneous documents this may result in an empty set. Thus, we consider that those elements present in a given amount of documents belong to the midpoint. Once we have such midpoint, the algorithm is generalized for the obtaining of repetitions and optional elements. Thus, a exact schema can always be found generating optional elements. However, the exact schema of the whole set may result in overspecialization (lots of optional elements), which would make it useless.

CitacióAbelló, A., De Palol, X., Hacid, M. "Approximating the DTD of a set of XML documents". 2005.

Forma partLSI-05-7-R

URIhttp://hdl.handle.net/2117/84116

Col·leccions

Veure estadístiques d'ús d'UPCommons

Mostra el registre d'ítem complet

Fitxers	Descripció	Mida	Format	Visualitza
R05-7.pdf		323,4Kb	PDF	Visualitza/Obre

UPCommons. Portal del coneixement obert de la UPC

Approximating the DTD of a set of XML documents

Visualitza/Obre

Explora