Data access layer optimization of the Gaia data processing in Barcelona for spatially arranged data
Document typeMaster thesis
Rights accessOpen Access
Gaia is an ambitious astrometric space mission adopted within the scientific programme of the European Space Agency (ESA) in October 2000. It measures with very high accuracy the positions and velocities of a large number of stars and astronomical objects. At the end of the mission, a detailed three-dimensional map of more than one billion stars will be obtained. The spacecraft is currently orbiting around the L2 Lagrangian Point, 1.5 million kilometers from the Earth. It is providing a complete survey down to the 20th magnitude. The two telescopes of Gaia will observe each object 85 times on average during the 5 years of the mission, recording each time its brightness, color and, most important, its position. This leads to an enormous quantity of complex, extremely precise data, representing the multiple observations of a billion different objects by an instrument that is spinning and precessing. The Gaia data challenge, processing raw satellite telemetry to produce valuable science products, is a huge task in terms of expertise, effort and computing power. To handle the reduction of the data, an iterative process between several systems has been designed, each solving different aspects of the mission. The Data Analysis and Processing Consortium (DPAC), a large team of scientists and software developers, is in charge of processing the Gaia data with the aim of producing the Gaia Catalogue. It is organized in Coordination Units (CUs), responsible of science and software development and validation, and Data Processing Centers (DPCs), which actually operate and execute the software systems developed by the CUs. This project has been developed within the frame of the Core Processing Unit (CU3) and the Data Processing Center of Barcelona (DPCB). One of the most important DPAC systems is the Intermediate Data Updating (IDU), executed at the Marenostrum supercomputer hosted by the Barcelona Supercomputing Center (BSC), which is the core of the DPCB hardware framework. It must reprocess, once every few months, all raw data accumulated up to that moment, giving a higher coherence to the scientific results and correcting any possible errors or wrong approximations from previous iterations. It has two main objectives: to refine the image parameters from the astrometric images acquired by the instrument, and to refine the Cross Match (XM) for all the detections. In particular, the XM will handle an enormous number of detections at the end of the mission, so it will obviously not be possible to handle them in a single process. Moreover, one should also consider some limitations and constraints imposed by the features of the execution environment (the Marenostrum supercomputer). Therefore, it is necessary to optimize the Data Access Layer (DAL) in order to efficiently store the huge amount of data coming from the spacecraft, and to access it in a smart manner. This is the main scope of this project. We have developed and implemented an efficient and flexible file format based on Hierarchical Data Format version 5 (HDF5), arranging the detections by a spatial index such as Hierarchical Equal Area isoLatitude Pixelization (HEALPix) to tessellate the sphere. In this way it is possible to distribute and process the detections separately and in parallel, according to their distribution on the sky. Moreover, the HEALPix library and the framework implemented here allows to consider the data at different resolution levels according to the desired precision. In this project we consider up to level 12, that is, 201 million pixels in the sphere. Two different alternatives have been designed and developed, namely, a Flat solution and a Hierarchical solution. It refers to the distribution of the data through the file. In the first case, all the dataset is contained inside a single group. On the other hand, the hierarchical solution stores the groups of data in a hierarchical way according to the HEALPix hierarchy. The Gaia DPAC software is implemented in Java, where the HDF5 Application Programming Interface (API) support is quite limited. Thus, it has also been necessary to use the Java Native Interface (JNI) to adapt the software developed in this project (in C language), which follows the HDF5 C API. On the Java side, two main classes have been implemented to read and write the data: FileHdf5Archiver and FileArchiveHdf5FileReader. The Java part of this project has been integrated into an existing operational software library, DpcbTools, in coordination with the Barcelona IDU/DPCB team. This has allowed to integrate the work done in this project into the existing DAL architecture in the most efficient way. Prior to the testing of the operational code, we have first evaluated the time required by the creation of the whole empty structure of the file. It has been done with a simple program written in C which, depending on the HEALPix level requested, creates the skeleton of the file. It has been implemented for both alternatives previously mentioned. Up to HEALPix level 6 it is not possible to notice a relevant difference. For level 7onwards the difference becomes more and more important, especially starting with level 9 where the creation time is uncontrollable for the Flat solution. Anyhow, the creation of the whole file is not convenient in the real case. Therefore, in order to evaluate the most suitable alternative, we have simply considered the Input/Output performance. Finally, we have run the performance tests in order to evaluate how the two solutions perform when actually dealing with data contents. Also the TAR and ZIP solutions have been tested in order to compare and appraise the speedup and the efficiency of our new two alternatives. The analysis of the results has been based on the time to write and read data, the compression ratio and the read/write rate. Moreover, the different alternatives have been evaluated on two systems with different sets of data as input. The speedup and the compression ratio improvement compared to the previously adopted solutions is considerable for both HDF5-based alternatives, whereas the difference between the two alternatives. The integration of one of these two solutions will allow the Gaia IDU software to handle the data in a more efficient manner, increasing the final I/O performance remarkably.