4 Data Deduplication Challenges (and How to Solve Them)

Why doesn’t every array just have data reduction built-in? Are deduplication and compression really that hard? Yes. But Pure FlashArray deduplication and compression solves the challenges.

data deduplication

image_pdfimage_print

Why doesn’t every storage array just have data reduction built-in? Are deduplication and compression really that hard? In a word: yes…. they are.

Doing data reduction itself isn’t hard, but doing it at large scale reliably without sacrificing performance turns out to be really, really hard. Let’s explore why data reduction is beneficial from a sustainability standpoint, why it can be difficult, and how the Pure Storage® data deduplication and compression solves these challenges.

Learn how the DirectCompress Accelerator Card Supercharges Compression for FlashArray//XL >>

Data Deduplication Helps Reduce Energy Usage in the Data Center

Data reduction can greatly increases the efficiency of a storage system and directly impacts your total spend on capacity by helping you to:

  • Save energy
  • Reduce your physical storage costs
  • Decrease your data center footprint

Why is Primary Data Deduplication So Difficult?

1. Deduplication Breeds Randomization, Slowing Reads

The single biggest challenge around deduplication is dealing with the massive increase in random I/O. Traditional disk array architectures do a lot of work to try to serialize I/O streams, because random I/O is the Achilles’ heel of a rotating hard drive. The very process of deduplication takes that well-ordered I/O stream and picks out duplicate pieces and stores pointers to those instead.

The result is that any given dataset ends up getting spread out across the storage array. When one goes to read that data, you have to seek 10s or potentially even 100s of disks to retrieve the constituent data, reassemble it, and only then serve the I/O….a latency nightmare.

2. Data Deduplication and Compression Slows Writes

While the “reading” side of the equation is hampered by the randomization introduced, the “writing” side of the equation is just as painful, for a simple reason: deduplication and compression take time and CPU cycles to process.

While most modern arrays have a write cache making writes somewhat asynchronous, at the end of the day that write cache is limited in size, and the drain rate of data out of the write cache down to back-end disk is a key limiter of write bandwidth of the array. Deduplication and compression simply add layers and steps in this process, where data must be fingerprinted, fingerprints must be checked and verified (the most secure approaches don’t trust fingerprints and compare actual data bit-for-bit), and data must be compressed… all operations that add precious 10s or 100s of micro-seconds each to the write operation. For traditional disk arrays this process has just simply been too slow, so most primary deduplication operations have been done “post process” where the data is first landed on disk, then some overnight or occasional process does the deduplication and compression later.

The operational reality of these solutions has left a lot to be desired though: the array must be kept 20-30% empty to keep room to “land” data, the deduplication process has an overhead that must be accounted for, and all this must be scheduled and managed.

3. Data Deduplication and Compression Require Massive Virtualization

Perhaps a simpler and more fundamental challenge for most legacy disk arrays is that data reduction requires complete virtualization of the array. Traditional RAID arrays were designed with a fairly rigid architecture, where RAID structures were constructed on back-end disk, and then logical block addresses from the back of the array mapped fairly directly to block addresses on the host.

Over the past decade more virtualization has entered the typical storage device, where there is a layer of metadata and indirection between host addresses and back-end blocks. But this level of virtualization varies significantly depending on the vintage of the architecture of one’s array. Deduplication and compression necessarily break the 1:1 block linkage, requiring fine-grain virtualization, and many arrays just simply weren’t meant to handle that. Think about it: petabytes of data virtualized into 512-byte chunks yields a metadata structure that has to keep track of trillions of objects. If an array’s controllers and metadata structure weren’t designed for this level of virtualization, it quite simply isn’t possible to retrofit it. This goes a long way towards explaining why 5 years later most arrays still don’t have data reduction.

4. Compression Makes Modifying Data Expensive

Once you incur the performance hit of deduplicating and compressing data, the performance pain has just begun. Now you don’t only have to deal with handling reads effectively (per #1), but you also have to deal with modifying data. Why is modifying data hard? Well, because compression is non-deterministic in size. If I write a piece of data and it compresses down to “x” amount of data in my array, when I then potentially over-write or modify that data, it may compress to a completely different size, “x+y” or “x-z.” In the former case it no longer fits in the “x” slot so I have to find some place to store “y,” and in the latter case it fits in the “x” slot, but I waste “z” space (remember, we were trying to avoid wasting space in the first place here!).

In most cases, data modification ends up triggering a dreaded “read-modify-write” loop, where instead of just landing new data in the old slot, I have to read in the old data, decompress it, figure out what has changed, re-compress it, and then rewrite it to some entirely new spot of the array, while figuring out how to reuse the old spot. And meanwhile, the micro-seconds of latency tick away…

Adding it up: A Performance Train-Wreck

Add up these four main challenges, and the performance of every single operation a disk array does seems to be horribly impacted by data reduction. It isn’t at all surprising that traditional disk arrays haven’t been able to make this work. The ones who have made it work have had to make significant compromises along the way: slower performance, post-processing, limited data volume size, limited scalability, large chunk sizes.

The results for end-users are mixed at best—until now.

Pure Storage FlashArray Data Deduplication is Different

FlashArray has two key advantages: being 100% flash, and designing our array from the ground-up for data reduction. By coupling these approaches we were able to overcome every challenge of primary storage data reduction. Pure Storage® Purity//FA Reduce uses five different data-reduction technologies to save space with up to a 10:1 deduplication in its all-flash arrays:

    1. Pattern removal: Purity Reduce identifies and removes repetitive binary patterns to reduce the volume of data to be processed by the dedupe scanner and compression engine. 
    2. 512B aligned variable dedupe: Most primary dedupe solutions today chunk data and look for duplicates at the 4K block size, or larger. They have limited scale to their metadata structures, so the smaller the block size, the more objects to manage, the more metadata one needs (many legacy arrays keep the majority of virtualization metadata in memory for performance). A high-performance inline deduplication process with a variable block-size range of 4-32KB ensures only unique blocks of data are saved on flash. Simply put, the smaller the block size the better data reduction ratios your array is capable of.
    3. Inline compression: Purity Reduce uses an append-only write layout and variable addressing to remove the wasted space fixed-block architectures introduce.
    4. Deep reduction: Inline compression is followed by heavier-weight compression algorithms post-process to further increase space savings. 
    5. Copy reduction: Copies made on FlashArray only use metadata—Purity provides instant pre-deduplicated copies of data for xCopy commands, snapshots, replication, and clones
    6. Data reduction is always on; it’s just how the array works. In other arrays deduplication has to be enabled, managed, and thought about…particularly the “performance hit” that comes long with it. In the FlashArray, it’s just how the array works, it can’t be turned off, and all our performance stats include deduplication and compression. We call this ‘Always On’ FlashReduce.
    7. Zero impact snapshots. Inefficient clone and snapshot features can consume significant storage capacity. Zero impact snapshots do not, and allow for faster recovery and restore times.

Not all data reduction technologies and implementations are not the same. We agree with the base premise that traditional disk arrays just aren’t ready for data reduction…but a 100% flash array that was designed from the ground-up for data reduction – well that is a whole different animal.

Learn more about how IT can reduce power consumption in the data center and explore more specs of the FlashArray//XL.

Banner CTA - Real-World Data Virtualization Stories eBook