Increasing need for application benchmarking and testing purposes requires large amounts of data. However, obtaining realistic data from the industry for testing purposes, is often impossible due to confidentiality issues and expensive data transfer over the network i.e., Internet. Hence, there is a gap between the need to benchmark and the lack of a common testing environment to achieve it. The scope of this thesis is to contribute in narrowing the above presented gap, by introducing a theoretical framework of data generation for the simulation of data processes. Therefore, we aim at generating input data and hence, providing a common testing environment for testing and evaluating data processes. Specifically, we focus on generating data for ETL data processes by analyzing the semantics of the ow. The motivation comes from the fact that ETL processes are often time-consuming and error prone. Therefore, it is of high importance to evaluate and benchmark them, in order to identify bottlenecks and constantly improve their performance. Moreover, we introduce a layered architecture design for developing a prototype of the ETL data generation framework. In addition, we present a pilot tool developed for implementing the ETL data generation framework following the proposed architecture and the ETL semantics principle. As a conclusion to our work, we introduce the data generation approach and moreover show its feasibility to generate workload scenarios useful for testing and benchmarking ETL processes.
All rights reserved. This work is protected by the corresponding intellectual and industrial property rights. Without prejudice to any existing legal exemptions, reproduction, distribution, public communication or transformation of this work are prohibited without permission of the copyright holder. If you wish to make any use of the work not provided for in the law, please contact: email@example.com