H-word: Supporting job scheduling in Hadoop with workload-driven data redistribution
Document typeConference report
Rights accessOpen Access
Today’s distributed data processing systems typically follow a query shipping approach and exploit data locality for reducing network traffic. In such systems the distribution of data over the cluster resources plays a significant role, and when skewed, it can harm the performance of executing applications. In this paper, we addressthe challenges of automatically adapting the distribution of data in a cluster to the workload imposed by the input applications. We propose a generic algorithm, named H-WorD, which, based on the estimated workload over resources, suggests alternative execution scenarios of tasks, and hence identifies required transfers of input data a priori, for timely bringing data close to the execution. We exemplify our algorithm in the context of MapReduce jobs in a Hadoop ecosystem. Finally, we evaluate our approach and demonstrate the performance gains of automatic data redistribution.
The final publication is available at http://link.springer.com/chapter/10.1007/978-3-319-44039-2_21
CitationJovanovic, P., Romero, O., Calders, T., Abello, A. H-word: Supporting job scheduling in Hadoop with workload-driven data redistribution. A: Conference on Advances in Databases and Information Systems. "Advances in Databases and Information Systems - 20th East European Conference, ADBIS 2016, Proceedings". Praga: 2016, p. 306-320.