Data Deduplication vs. Compression

Understand how dedupe and compression operate, which application types will benefit from their use. and how the combination can provide unmatched storage savings across the broadest set of use cases.

Data Deduplication and Compression

3 minutes
image_pdfimage_print

In many ways data deduplication and compression are a lot like salt and pepper. Both of these seasonings enhance the taste of food, each has a distinct flavor and is used in varying quantities depending on the dish being prepared; however, most of the time food tastes better when the two are used together. Similarly, dedupe and compression both reduce storage capacity, yet they are rarely used together in most storage arrays.

While similar in purpose, these technologies provides data reductions for dissimilar data sets. It is critical to understand how these two technologies operate, which application types will benefit from their use but most importantly, understanding how the combination can provide unmatched storage savings across the broadest set of use cases.

A Word on Thin Provisioning

Surely at this point in the post someone is asking, ‘What about Thin Provisioning? It reduces storage capacity.’

That’s simply incorrect, Thin Provisioning is not a data reduction technology. It is a provisioning model that allows one to consume storage on demand by eliminating the preallocation of storage capacity. It allows increased utilization of storage media, but it does not reduce the capacity of the data written to the storage media. I’ll cover T.P. in my next post.

Data Deduplication (A Primer in Layman’s Terms)

Data compression provides savings by eliminating redundancy at the binary level within a block. Data Deduplication (aka dedupe) provides storage savings by eliminating redundant blocks of data.

Storage capacity reduction is accomplished only when there is redundancy in the data set. This means the data set must be comprised of multiple identical files or files that contain a portion of data that is identical to the content found in other files.

Examples of where one will find file redundancy includes home directories and cloud file sharing applications like Citrix ShareFile and VMware Horizon. Block redundancy is rampant in datasets like test & development, QA, virtual machines and virtual desktops. Just think of the number of copies of operating system and application binaries exist in these virtualized environments.

Tech Tip: The smaller the storage block size, the greater the ability to identify and dedupe data. For example, misaligned VMs are deduped with a 512 byte block size, they can’t dedupe with a 4 KB block.

Data Compression

Data compression provides storage savings by eliminating the binary level redundancy within a block of data. Unlike dedupe, compression is not concerned with whether a second copy of the same block exists, it simply wants to store the most efficient block on flash. By storing data in a format that is denser than the native form, compression algorithms “inflate” and “deflate” data, respectively as it is read or written. Examples of common file level compression that we use in our day-to-day lives include MP3 audio and JPG image files.

Compression at the application layer, like a SQL or Oracle database, is somewhat of a balancing act. Faster compression and decompression speeds usually come at the expense of smaller space savings. To cite a less know example, Hadoop commonly offers the following five compression formats:

  • DEFLATE
  • gzip
  • bzip2
  • LZO
  • Snappy

Tech Tip: Compressed data sets can often be compressed on a storage array. This is possible as most admins tend to select optimal application performance over optimal storage savings.

How the Right Storage Plays a Role in Optimizing Database Environments