A cost-based storage format selector for materialized results in big data frameworks

Munir, Rana Faisal; Abelló Gamazo, Alberto; Romero Moral, Óscar; Thiele, Maik; Lehner, Wolfgang

doi:10.1007/s10619-019-07271-0

dc.contributor.author	Munir, Rana Faisal
dc.contributor.author	Abelló Gamazo, Alberto
dc.contributor.author	Romero Moral, Óscar
dc.contributor.author	Thiele, Maik
dc.contributor.author	Lehner, Wolfgang
dc.contributor.other	Universitat Politècnica de Catalunya. Departament d'Enginyeria de Serveis i Sistemes d'Informació
dc.date.accessioned	2019-06-20T09:01:52Z
dc.date.available	2020-05-08T00:25:56Z
dc.date.issued	2019-05-08
dc.identifier.citation	Munir, R. [et al.]. A cost-based storage format selector for materialized results in big data frameworks. "Distributed and parallel databases", 8 Maig 2019, p. 1-30.
dc.identifier.issn	0926-8782
dc.identifier.uri	http://hdl.handle.net/2117/134838
dc.description.abstract	Modern big data frameworks (such as Hadoop and Spark) allow multiple users to do large-scale analysis simultaneously, by deploying data-intensive workflows (DIWs). These DIWs of different users share many common tasks (i.e, 50–80%), which can be materialized and reused in future executions. Materializing the output of such common tasks improves the overall processing time of DIWs and also saves computational resources. Current solutions for materialization store data on Distributed File Systems by using a fixed storage format. However, a fixed choice is not the optimal one for every situation. Specifically, different layouts (i.e., horizontal, vertical or hybrid) have a huge impact on execution, according to the access patterns of the subsequent operations. In this paper, we present a cost-based approach that helps deciding the most appropriate storage format in every situation. A generic cost-based framework that selects the best format by considering the three main layouts is presented. Then, we use our framework to instantiate cost models for specific Hadoop storage formats (namely SequenceFile, Avro and Parquet), and test it with two standard benchmark suits. Our solution gives on average 1.33× speedup over fixed SequenceFile, 1.11× speedup over fixed Avro, 1.32× speedup over fixed Parquet, and overall, it provides 1.25× speedup.
dc.format.extent	30 p.
dc.language.iso	eng
dc.subject	Àrees temàtiques de la UPC::Informàtica::Sistemes d'informació::Emmagatzematge i recuperació de la informació
dc.subject.lcsh	File organization (Computer science)
dc.subject.lcsh	Big data
dc.subject.other	Data-intensive workflows
dc.subject.other	Materialized results
dc.subject.other	Storage format
dc.subject.other	HDFS
dc.subject.other	Cost model
dc.title	A cost-based storage format selector for materialized results in big data frameworks
dc.type	Article
dc.subject.lemac	Fitxers informàtics -- Oganització
dc.subject.lemac	Macrodades
dc.contributor.group	Universitat Politècnica de Catalunya. inSSIDE - integrated Software, Service, Information and Data Engineering
dc.contributor.group	Universitat Politècnica de Catalunya. IMP - Information Modeling and Processing
dc.identifier.doi	10.1007/s10619-019-07271-0
dc.description.peerreviewed	Peer Reviewed
dc.relation.publisherversion	https://link.springer.com/article/10.1007/s10619-019-07271-0
dc.rights.access	Open Access
local.identifier.drac	25153928
dc.description.version	Postprint (author's final draft)
local.citation.author	Munir, R.; Abelló, A.; Romero, O.; Thiele, M.; Lehner, W.
local.citation.publicationName	Distributed and parallel databases
local.citation.startingPage	1
local.citation.endingPage	30

Fitxers d'aquest items

Nom:: journal_2019.pdf
Mida:: 2,838Mb
Format:: PDF

Visualitza/Obre

Aquest ítem apareix a les col·leccions següents

Mostra el registre d'ítem simple

UPCommons. Portal del coneixement obert de la UPC

A cost-based storage format selector for materialized results in big data frameworks

Fitxers d'aquest items

Aquest ítem apareix a les col·leccions següents

Explora