A study on data deduplication in HPC storage systems
Document typeConference report
PublisherInstitute of Electrical and Electronics Engineers (IEEE)
Rights accessRestricted access - publisher's policy
Deduplication is a storage saving technique that is highly successful in enterprise backup environments. On a ﬁle system, a single data block might be stored multiple times across different ﬁles, for example, multiple versions of a ﬁle might exist that are mostly identical. With deduplication, this data replication is localized and redundancy is removed – by storing data just once, all ﬁles that use identical regions refer to the same unique data. The most common approach splits ﬁle data into chunks and calculates a cryptographic ﬁngerprint for each chunk. By checking if the ﬁngerprint has already been stored, a chunk is classiﬁed as redundant or unique. Only unique chunks are stored. This paper presents the ﬁrst study on the potential of data deduplication in HPC centers, which belong to the most demanding storage producers. We have quantitatively assessed this potential for capacity reduction for 4 data centers (BSC, DKRZ, RENCI, RWTH). In contrast to previous deduplication studies focusing mostly on backup data, we have analyzed over one PB (1212 TB) of online ﬁle system data. The evaluation shows that typically 20% to 30% of this online data can be removed by applying data deduplication techniques, peaking up to 70% for some data sets. This reduction can only be achieved by a subﬁle deduplication approach, while approaches based on whole-ﬁle comparisons only lead to small capacity savings.
CitationMeister, D. [et al.]. A study on data deduplication in HPC storage systems. A: International Conference for High Performance Computing, Networking, Storage and Analysis. "2012 International Conference for High Performance Computing, Networking, Storage and Analysis (SC)". Salt Lake City, Utah: Institute of Electrical and Electronics Engineers (IEEE), 2012, p. 1-11.