Show simple item record

dc.contributor.authorMeister, Dirk
dc.contributor.authorKaiser, Jürgen
dc.contributor.authorBrinkmann, Andre
dc.contributor.authorCortés, Toni
dc.contributor.authorKuhn, Michael
dc.contributor.authorKunkel, Julian
dc.contributor.otherUniversitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors
dc.date.accessioned2013-10-03T08:58:09Z
dc.date.created2012
dc.date.issued2012
dc.identifier.citationMeister, D. [et al.]. A study on data deduplication in HPC storage systems. A: International Conference for High Performance Computing, Networking, Storage and Analysis. "2012 International Conference for High Performance Computing, Networking, Storage and Analysis (SC)". Salt Lake City, Utah: Institute of Electrical and Electronics Engineers (IEEE), 2012, p. 1-11.
dc.identifier.isbn978-1­4673­0806­9
dc.identifier.urihttp://hdl.handle.net/2117/20270
dc.description.abstractDeduplication is a storage saving technique that is highly successful in enterprise backup environments. On a file system, a single data block might be stored multiple times across different files, for example, multiple versions of a file might exist that are mostly identical. With deduplication, this data replication is localized and redundancy is removed – by storing data just once, all files that use identical regions refer to the same unique data. The most common approach splits file data into chunks and calculates a cryptographic fingerprint for each chunk. By checking if the fingerprint has already been stored, a chunk is classified as redundant or unique. Only unique chunks are stored. This paper presents the first study on the potential of data deduplication in HPC centers, which belong to the most demanding storage producers. We have quantitatively assessed this potential for capacity reduction for 4 data centers (BSC, DKRZ, RENCI, RWTH). In contrast to previous deduplication studies focusing mostly on backup data, we have analyzed over one PB (1212 TB) of online file system data. The evaluation shows that typically 20% to 30% of this online data can be removed by applying data deduplication techniques, peaking up to 70% for some data sets. This reduction can only be achieved by a subfile deduplication approach, while approaches based on whole-file comparisons only lead to small capacity savings.
dc.format.extent11 p.
dc.language.isoeng
dc.publisherInstitute of Electrical and Electronics Engineers (IEEE)
dc.subjectÀrees temàtiques de la UPC::Informàtica::Arquitectura de computadors
dc.subject.lcshInformation storage and retrieval systems
dc.subject.otherBack-up procedures
dc.subject.otherComputer centres
dc.subject.otherCryptography
dc.subject.otherInternet
dc.subject.otherReplicated databases
dc.subject.otherStorage management
dc.titleA study on data deduplication in HPC storage systems
dc.typeConference report
dc.subject.lemacInformació -- Sistemes d'emmagatzematge i recuperació
dc.contributor.groupUniversitat Politècnica de Catalunya. CAP - Grup de Computació d'Altes Prestacions
dc.identifier.doi10.1109/SC.2012.14
dc.description.peerreviewedPeer Reviewed
dc.rights.accessRestricted access - publisher's policy
local.identifier.drac11637271
dc.description.versionPostprint (published version)
dc.date.lift10000-01-01
local.citation.authorMeister, D.; Kaiser, J.; Brinkmann, A.; Cortes, A.; Kuhn, M.; Kunkel, J.
local.citation.contributorInternational Conference for High Performance Computing, Networking, Storage and Analysis
local.citation.pubplaceSalt Lake City, Utah
local.citation.publicationName2012 International Conference for High Performance Computing, Networking, Storage and Analysis (SC)
local.citation.startingPage1
local.citation.endingPage11


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record