dc.contributor | Romero Moral, Óscar |
dc.contributor | Jovanovic, Petar |
dc.contributor.author | Kaluzka, Justyna |
dc.contributor.other | Universitat Politècnica de Catalunya. Departament d'Enginyeria de Serveis i Sistemes d'Informació |
dc.date.accessioned | 2017-05-15T14:10:35Z |
dc.date.available | 2017-05-15T14:10:35Z |
dc.date.issued | 2016-07 |
dc.identifier.uri | http://hdl.handle.net/2117/104450 |
dc.description.abstract | Current market tendencies show the need of storing and processing rapidly
growing amounts of data. Therefore, it implies the demand for distributed
storage and data processing systems. The Apache Hadoop is an open-source
framework for managing such computing clusters in an effective, fault-tolerant
way.
Dealing with large volumes of data, Hadoop, and its storage system HDFS
(Hadoop Distributed File System), face challenges to keep the high efficiency
with computing in a reasonable time. The typical Hadoop implementation
transfers computation to the data, rather than shipping data across the cluster.
Otherwise, moving the big quantities of data through the network could significantly
delay data processing tasks. However, while a task is already running,
Hadoop favours local data access and chooses blocks from the nearest nodes.
Next, the necessary blocks are moved just when they are needed in the given
ask.
For supporting the Hadoop’s data locality preferences, in this thesis, we propose
adding an innovative functionality to its distributed file system (HDFS), that
enables moving data blocks on request. In-advance shipping of data makes it
possible to forcedly redistribute data between nodes in order to easily adapt it to
the given processing tasks. New functionality enables the instructed movement
of data blocks within the cluster. Data can be shifted either by user running
the proper HDFS shell command or programmatically by other module like an
appropriate scheduler.
In order to develop such functionality, the detailed analysis of Apache Hadoop
source code and its components (specifically HDFS) was conducted. Research
resulted in a deep understanding of internal architecture, what made it possible
to compare the possible approaches to achieve the desired solution, and develop
the chosen one. |
dc.language.iso | eng |
dc.publisher | Universitat Politècnica de Catalunya |
dc.subject | Àrees temàtiques de la UPC::Informàtica |
dc.subject.lcsh | Management information systems |
dc.title | Data locality in Hadoop |
dc.type | Master thesis |
dc.subject.lemac | Sistemes d'informació per a la gestió |
dc.identifier.slug | 116331 |
dc.rights.access | Open Access |
dc.date.updated | 2016-07-06T06:27:14Z |
dc.audience.educationlevel | Màster |
dc.audience.mediator | Facultat d'Informàtica de Barcelona |
dc.audience.degree | MÀSTER UNIVERSITARI EN INNOVACIÓ I RECERCA EN INFORMÀTICA (Pla 2012) |