Data Deduplication vs. Compression

Understand how dedupe and compression operate, which application types will benefit from their use. and how the combination can provide unmatched storage savings across the broadest set of use cases.

Pure Storage

Dec 10, 2013

In many ways data deduplication and compression are a lot like salt and pepper. Both of these seasonings enhance the taste of food, each has a distinct flavor and is used in varying quantities depending on the dish being prepared; however, most of the time food tastes better when the two are used together. Similarly, dedupe and compression both reduce storage capacity, yet they are rarely used together in most storage arrays.

While similar in purpose, these technologies provides data reductions for dissimilar data sets. It is critical to understand how these two technologies operate, which application types will benefit from their use but most importantly, understanding how the combination can provide unmatched storage savings across the broadest set of use cases.

A Word on Thin Provisioning

Surely at this point in the post someone is asking, ‘What about Thin Provisioning? It reduces storage capacity.’

That’s simply incorrect, Thin Provisioning is not a data reduction technology. It is a provisioning model that allows one to consume storage on demand by eliminating the preallocation of storage capacity. It allows increased utilization of storage media, but it does not reduce the capacity of the data written to the storage media. I’ll cover T.P. in my next post.

Data Deduplication (A Primer in Layman’s Terms)

Data compression provides savings by eliminating redundancy at the binary level within a block. Data Deduplication (aka dedupe) provides storage savings by eliminating redundant blocks of data.

Storage capacity reduction is accomplished only when there is redundancy in the data set. This means the data set must be comprised of multiple identical files or files that contain a portion of data that is identical to the content found in other files.

Examples of where one will find file redundancy includes home directories and cloud file sharing applications like Citrix ShareFile and VMware Horizon. Block redundancy is rampant in datasets like test & development, QA, virtual machines and virtual desktops. Just think of the number of copies of operating system and application binaries exist in these virtualized environments.

Tech Tip: The smaller the storage block size, the greater the ability to identify and dedupe data. For example, misaligned VMs are deduped with a 512 byte block size, they can’t dedupe with a 4 KB block.

Data Compression

Data compression provides storage savings by eliminating the binary level redundancy within a block of data. Unlike dedupe, compression is not concerned with whether a second copy of the same block exists, it simply wants to store the most efficient block on flash. By storing data in a format that is denser than the native form, compression algorithms “inflate” and “deflate” data, respectively as it is read or written. Examples of common file level compression that we use in our day-to-day lives include MP3 audio and JPG image files.

Compression at the application layer, like a SQL or Oracle database, is somewhat of a balancing act. Faster compression and decompression speeds usually come at the expense of smaller space savings. To cite a less know example, Hadoop commonly offers the following five compression formats:

DEFLATE
gzip
bzip2
LZO
Snappy

Tech Tip: Compressed data sets can often be compressed on a storage array. This is possible as most admins tend to select optimal application performance over optimal storage savings.

How Storage Plays a Role in Optimizing Database Environments

Solutions

Compliance, Confidence, and Cyber Resilience: The Power of Pure Storage and Superna

Traditional security strategies won’t suffice against increasingly sophisticated cyber threats. A joint solution from Superna…

By Kevin Unthank

Solutions

4 Ways Retrieval-augmented Generation (RAG) Is Revolutionizing Business Value from AI

Retrieval-augmented generation (RAG) has the potential to transform AI from a cost center into a…

By Carey Wodehouse

Solutions

Pure Storage Cloud for Azure VMware Solution: A Closer Look

Pure Storage Cloud for Azure VMware Solution is now in public preview. Take a technical…

By Vaclav Jirovsky

Purely Educational, Solutions

5 Ways Modern Storage Solutions and AI Are Advancing Healthcare

AI in healthcare is becoming a critical tool in helping to improve medical diagnosis, the…

By Pure Storage

Blog Home

Data Deduplication vs. Compression

A Word on Thin Provisioning

Data Deduplication (A Primer in Layman’s Terms)

Data Compression

How Storage Plays a Role in Optimizing Database Environments

Related posts:

Compliance, Confidence, and Cyber Resilience: The Power of Pure Storage and Superna

4 Ways Retrieval-augmented Generation (RAG) Is Revolutionizing Business Value from AI

Pure Storage Cloud for Azure VMware Solution: A Closer Look

5 Ways Modern Storage Solutions and AI Are Advancing Healthcare

Top Stories

Compliance, Confidence, and Cyber Resilience: The Power of Pure Storage and Superna

4 Ways Retrieval-augmented Generation (RAG) Is Revolutionizing Business Value from AI

Pure Storage Cloud for Azure VMware Solution: A Closer Look

5 Ways Modern Storage Solutions and AI Are Advancing Healthcare

Pure Storage Raises the Bar with NVIDIA Cloud Partners and AI Data Platform

Data Deduplication vs. Compression

A Word on Thin Provisioning

Data Deduplication (A Primer in Layman’s Terms)

Data Compression

How Storage Plays a Role in Optimizing Database Environments

Related posts:

Related Stories

Top Stories