Show simple item record

dc.contributor.authorMunir, Rana Faisal
dc.contributor.authorAbelló Gamazo, Alberto
dc.contributor.authorRomero Moral, Óscar
dc.contributor.authorThiele, Maik
dc.contributor.authorLehner, Wolfgang
dc.contributor.otherUniversitat Politècnica de Catalunya. Departament d'Enginyeria de Serveis i Sistemes d'Informació
dc.identifier.citationMunir, R. [et al.]. A cost-based storage format selector for materialized results in big data frameworks. "Distributed and parallel databases", 8 Maig 2019, p. 1-30.
dc.description.abstractModern big data frameworks (such as Hadoop and Spark) allow multiple users to do large-scale analysis simultaneously, by deploying data-intensive workflows (DIWs). These DIWs of different users share many common tasks (i.e, 50–80%), which can be materialized and reused in future executions. Materializing the output of such common tasks improves the overall processing time of DIWs and also saves computational resources. Current solutions for materialization store data on Distributed File Systems by using a fixed storage format. However, a fixed choice is not the optimal one for every situation. Specifically, different layouts (i.e., horizontal, vertical or hybrid) have a huge impact on execution, according to the access patterns of the subsequent operations. In this paper, we present a cost-based approach that helps deciding the most appropriate storage format in every situation. A generic cost-based framework that selects the best format by considering the three main layouts is presented. Then, we use our framework to instantiate cost models for specific Hadoop storage formats (namely SequenceFile, Avro and Parquet), and test it with two standard benchmark suits. Our solution gives on average 1.33× speedup over fixed SequenceFile, 1.11× speedup over fixed Avro, 1.32× speedup over fixed Parquet, and overall, it provides 1.25× speedup.
dc.format.extent30 p.
dc.subjectÀrees temàtiques de la UPC::Informàtica::Sistemes d'informació::Emmagatzematge i recuperació de la informació
dc.subject.lcshFile organization (Computer science)
dc.subject.lcshBig data
dc.subject.otherData-intensive workflows
dc.subject.otherMaterialized results
dc.subject.otherStorage format
dc.subject.otherCost model
dc.titleA cost-based storage format selector for materialized results in big data frameworks
dc.subject.lemacFitxers informàtics -- Oganització
dc.contributor.groupUniversitat Politècnica de Catalunya. inSSIDE - integrated Software, Service, Information and Data Engineering
dc.contributor.groupUniversitat Politècnica de Catalunya. IMP - Information Modeling and Processing
dc.description.peerreviewedPeer Reviewed
dc.rights.accessOpen Access
dc.description.versionPostprint (author's final draft)
local.citation.authorMunir, R.; Abelló, A.; Romero, O.; Thiele, M.; Lehner, W.
local.citation.publicationNameDistributed and parallel databases

Files in this item


This item appears in the following Collection(s)

Show simple item record

All rights reserved. This work is protected by the corresponding intellectual and industrial property rights. Without prejudice to any existing legal exemptions, reproduction, distribution, public communication or transformation of this work are prohibited without permission of the copyright holder