Show simple item record

dc.contributor.authorAbelló Gamazo, Alberto
dc.contributor.authorPalol, Xavier de
dc.contributor.authorHacid, Mohand-Saïd
dc.contributor.otherUniversitat Politècnica de Catalunya. Departament d'Enginyeria de Serveis i Sistemes d'Informació
dc.date.accessioned2018-07-12T08:03:57Z
dc.date.available2019-06-04T02:31:08Z
dc.date.issued2018-06-02
dc.identifier.citationAbelló, A., Palol, X. de, Hacid, M.-S. Approximating the schema of a set of documents by means of resemblance. "Journal on data semantics", 2 Juny 2018, vol. 7, núm. 2, p. 87-105.
dc.identifier.issn1861-2032
dc.identifier.urihttp://hdl.handle.net/2117/119271
dc.description.abstractThe WWW contains a huge amount of documents. Some of them share the same subject, but are generated by different people or even by different organizations. A semi-structured model allows to share documents that do not have exactly the same structure. However, it does not facilitate the understanding of such heterogeneous documents. In this paper, we offer a characterization and algorithm to obtain a representative (in terms of a resemblance function) of a set of heterogeneous semi-structured documents. We approximate the representative so that the resemblance function is maximized. Then, the algorithm is generalized to deal with repetitions and different classes of documents. Although an exact representative could always be found using an unlimited number of optional elements, it would cause an overfitting problem. The size of an exact representative for a set of heterogeneous documents may even make it useless. Our experiments show that, for users, it is easier and faster to deal with smaller representatives, even compensating the loss in the approximation.
dc.format.extent19 p.
dc.language.isoeng
dc.publisherSpringer
dc.subjectÀrees temàtiques de la UPC::Informàtica::Sistemes d'informació
dc.subject.lcshData mining
dc.subject.lcshAutomatic data collection systems
dc.subject.otherDocument
dc.subject.otherDesign
dc.subject.otherXML
dc.titleApproximating the schema of a set of documents by means of resemblance
dc.typeArticle
dc.subject.lemacMineria de dades
dc.subject.lemacClassificació automàtica
dc.contributor.groupUniversitat Politècnica de Catalunya. GESSI - Grup d'Enginyeria del Software i dels Serveis
dc.identifier.doi10.1007/s13740-018-0088-0
dc.description.peerreviewedPeer Reviewed
dc.relation.publisherversionhttps://link.springer.com/article/10.1007/s13740-018-0088-0
dc.rights.accessOpen Access
drac.iddocument23241700
dc.description.versionPostprint (author's final draft)
upcommons.citation.authorAbelló, A., Palol, X. de, Hacid, M.-S.
upcommons.citation.publishedtrue
upcommons.citation.publicationNameJournal on data semantics
upcommons.citation.volume7
upcommons.citation.number2
upcommons.citation.startingPage87
upcommons.citation.endingPage105


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record

All rights reserved. This work is protected by the corresponding intellectual and industrial property rights. Without prejudice to any existing legal exemptions, reproduction, distribution, public communication or transformation of this work are prohibited without permission of the copyright holder