ALOJA-ML: a framework for automating characterization and knowledge discovery in Hadoop deployments
Document typeConference report
PublisherAssociation for Computing Machinery (ACM)
Rights accessOpen Access
European Commission's projectHi-EST - Holistic Integration of Emerging Supercomputing Technologies (EC-H2020-639595)
This article presents ALOJA-Machine Learning (ALOJA-ML) an extension to the ALOJA project that uses machine learning techniques to interpret Hadoop benchmark performance data and performance tuning; here we detail the approach, efficacy of the model and initial results. The ALOJA-ML project is the latest phase of a long-term collaboration between BSC and Microsoft, to automate the characterization of cost-effectiveness on Big Data deployments, focusing on Hadoop. Hadoop presents a complex execution environment, where costs and performance depends on a large number of software (SW) configurations and on multiple hardware (HW) deployment choices. Recently the ALOJA project presented an open, vendor-neutral repository, featuring over 16.000 Hadoop executions. These results are accompanied by a test bed and tools to deploy and evaluate the cost-effectiveness of the different hardware configurations, parameter tunings, and Cloud services. Despite early success within ALOJA from expert-guided benchmarking, it became clear that a genuinely comprehensive study requires automation of modeling procedures to allow a systematic analysis of large and resource-constrained search spaces. ALOJA-ML provides such an automated system allowing knowledge discovery by modeling Hadoop executions from observed benchmarks across a broad set of configuration parameters. The resulting empirically-derived performance models can be used to forecast execution behavior of various workloads; they allow a-priori prediction of the execution times for new configurations and HW choices and they offer a route to model-based anomaly detection. In addition, these models can guide the benchmarking exploration efficiently, by automatically prioritizing candidate future benchmark tests. Insights from ALOJA-ML's models can be used to reduce the operational time on clusters, speed-up the data acquisition and knowledge discovery process, and importantly, reduce running costs. In addition to learning from the methodology presented in this work, the community can benefit in general from ALOJA data-sets, framework, and derived insights to improve the design and deployment of Big Data applications.
CitationBerral, J., Poggi, N., Carrera, D., Call, A., Reinauer, R., Green, D. ALOJA-ML: a framework for automating characterization and knowledge discovery in Hadoop deployments. A: ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. "KDD '15 Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining: August 10-13, 2015: Sydney, NSW, Australia". Sydney: Association for Computing Machinery (ACM), 2015, p. 1701-1710.
- Computer Sciences - Ponències/Comunicacions de congressos 
- CAP - Grup de Computació d'Altes Prestacions - Ponències/Comunicacions de congressos 
- Departament d'Arquitectura de Computadors - Ponències/Comunicacions de congressos [1.707]
- LARCA - Laboratori d'Algorísmia Relacional, Complexitat i Aprenentatge - Ponències/Comunicacions de congressos 
All rights reserved. This work is protected by the corresponding intellectual and industrial property rights. Without prejudice to any existing legal exemptions, reproduction, distribution, public communication or transformation of this work are prohibited without permission of the copyright holder