Cardinality Estimation in Shared-Nothing Parallel Dataflow Engines

View/Open
Cita com:
hdl:2117/77883
Document typeMaster thesis
Date2015-07-31
Rights accessOpen Access
All rights reserved. This work is protected by the corresponding intellectual and industrial
property rights. Without prejudice to any existing legal exemptions, reproduction, distribution, public
communication or transformation of this work are prohibited without permission of the copyright holder
Abstract
Shared nothing parallel data
ow systems aim to bridge the
gap between MapReduce and RDBMSs by combining parallel execution
of second order functions with operator based optimizations. In parallel
systems, job latency is strongly affected by data shuffling and unbalanced
data across nodes, thus the degree of parallelism and the data partition-
ing functions must be carefully considered when choosing optimization
strategies. However, it is hard to make good optimization choices with-
out any information about the distribution of the data. We attempt to
overcome this challenge in shared nothing parallel data
ows by tracking
statistics of data sets during query runtime. We use data streaming algo-
rithms to track statistics so as to affect job latency as little as possible.
We discuss how collected statistics can potentially be used to improve
execution plans during runtime.
DegreeMÀSTER UNIVERSITARI ERASMUS MUNDUS EN TECNOLOGIES DE LA INFORMACIÓ PER A LA INTEL·LIGÈNCIA EMPRESARIAL (Pla 2012)
Files | Description | Size | Format | View |
---|---|---|---|---|
108938.pdf | 709,0Kb | View/Open |