Why doesn’t every storage array just have data reduction built-in? Are deduplication and compression really that hard?
In a word: yes…. they are.
Doing data reduction itself isn’t hard, but doing it at large scale reliably without sacrificing performance turns out to be really, really hard. Let’s explore why data reduction is so difficult, and then how the Pure Storage FlashArray’s deduplication and compression solves these challenges.
Why is Primary Data Deduplication So Difficult?
1. Deduplication Breeds Randomization, Slowing Reads
The single biggest challenge around deduplication is dealing with the massive increase in random I/O. Traditional disk array architectures do a lot of work to try to serialize I/O streams, because random I/O is the Achilles’ heel of a rotating hard drive. The very process of deduplication takes that well-ordered I/O stream and picks out duplicate pieces and stores pointers to those instead.
The result is that any given dataset ends up getting spread out across the storage array. When one goes to read that data, you have to seek 10s or potentially even 100s of disks to retrieve the constituent data, re-assemble it, and only then serve the I/O….a latency nightmare.
2. Data Deduplication and Compression Slows Writes
While the “reading” side of the equation is hampered by the randomization introduced, the “writing” side of the equation is just as painful, for a simple reason: deduplication and compression take time and CPU cycles to process.
While most modern arrays have a write cache making writes somewhat asynchronous, at the end of the day that write cache is limited in size, and the drain rate of data out of the write cache down to back-end disk is a key limiter of write bandwidth of the array. Deduplication and compression simply add layers and steps in this process, where data must be fingerprinted, fingerprints must be checked and verified (the most secure approaches don’t trust fingerprints and compare actual data bit-for-bit), and data must be compressed…all operations that add precious 10s or 100s of micro-seconds each to the write operation. For traditional disk arrays this process has just simply been too slow, so most primary deduplication operations have been done “post process” where the data is first landed on disk, then some overnight or occasional process does the deduplication and compression later.
The operational reality of these solutions has left a lot to be desired though: the array must be kept 20-30% empty to keep room to “land” data, the deduplication process has an overhead that must be accounted for, and all this must be scheduled and managed.
3. Data Deduplication and Compression Require Massive Virtualization
Perhaps a simpler and more fundamental challenge for most legacy disk arrays is that data reduction requires complete virtualization of the array. Traditional RAID arrays were designed with a fairly rigid architecture, where RAID structures were constructed on back-end disk, and then logical block addresses from the back of the array mapped fairly directly to block addresses on the host.
Over the past decade more virtualization has entered the typical storage device, where there is a layer of metadata and indirection between host addresses and back-end blocks. But this level of virtualization varies significantly depending on the vintage of the architecture of one’s array. Deduplication and compression necessarily break the 1:1 block linkage, requiring fine-grain virtualization, and many arrays just simply weren’t meant to handle that. Think about it: petabytes of data virtualized into 512-byte chunks yields a metadata structure that has to keep track of trillions of objects. If an array’s controllers and metadata structure weren’t designed for this level of virtualization, it quite simply isn’t possible to retrofit it. This goes a long way towards explaining why 5 years later most arrays still don’t have data reduction.
4. Compression Makes Modifying Data Expensive
Once you incur the performance hit of deduplicating and compressing data, the performance pain has just begun. Now you don’t only have to deal with handling reads effectively (per #1), but you also have to deal with modifying data. Why is modifying data hard? Well, because compression is non-deterministic in size. If I write a piece of data and it compresses down to “x” amount of data in my array, when I then potentially over-write or modify that data, it may compress to a completely different size, “x+y” or “x-z.” In the former case it no longer fits in the “x” slot so I have to find some place to store “y,” and in the latter case it fits in the “x” slot, but I waste “z” space (remember, we were trying to avoid wasting space in the first place here!).
In most cases, data modification ends up triggering a dreaded “read-modify-write” loop, where instead of just landing new data in the old slot, I have to read in the old data, decompress it, figure out what has changed, re-compress it, and then re-write it to some entirely new spot of the array, while figuring out how to re-use the old spot. And meanwhile, the micro-seconds of latency tick away…
Adding it up: A Performance Train-Wreck
Add up these four main challenges, and the performance of every single operation a disk array does seems to be horribly impacted by data reduction. It isn’t at all surprising that traditional disk arrays haven’t been able to make this work. The ones who have made it work have had to make significant compromises along the way: slower performance, post-processing, limited data volume size, limited scalability, large chunk sizes.
The results for end-users are mixed at best.
How Is Pure Storage Data Deduplication Different?
Pure Storage has two key advantages: being 100% flash, and designing our array from the ground-up for data reduction. By coupling these approaches we were able to overcome every challenge of primary storage data reduction. Let’s look at what makes the Pure Storage FlashArray different:
- Data reduction is always on, it’s just how the array works. In other arrays deduplication has to be enabled, managed, and thought about…particularly the “performance hit” that comes long with it. In the Pure Storage FlashArray, it’s just how the array works, it can’t be turned off, and all our performance stats include deduplication and compression.
- No performance penalty on reads. Because the FlashArray is 100% flash, we don’t care about the read performance hit due to randomization. In fact flash thrives on randomization. Because of the flash wear leveling and deletion management we are doing we’d spread data out even if we weren’t deduplicating, so the performance difference between a read of original vs. reduced data is only the difference in decompression. Decompression is a fast enough operation – it is a wash with read speed (i.e. we have to read less data, so we can take the extra time to decompress).
- Inline, with no performance penalty on writes. In flash, writing is the hard operation…so, somewhat anti-intuitively, data reduction is a write accelerator. Yes, compression and deduplication take CPU cycles and time, but if we can reduce 70%-95%+ of the writes that eventually get written to flash, we can actually increase the overall write bandwidth of our device. Doing this, of course, takes massive CPU and non-volatile caching ability (to store the data safely while all this computation happens), which we’ve designed the FlashArray for.
- Global. Other solutions have limited scalability of their metadata, leading to large dedupe chunks and limits to the amount of data in a dedupe pool or volume or disk shelf/controller. Bifurcating the dedupe pool just leads to a large chunk of duplicate data being stored redundantly in many pools, wasting the precious space you are trying to save. Pure Storage is scalable enough to deduplicate globally across the array.
- 512-byte block size. Most primary dedupe solutions today chunk data and look for duplicates at the 4K block size, or larger. They have limited scale to their metadata structures, so the smaller the block size, the more objects to manage, the more metadata one needs (many legacy arrays keep the majority of virtualization metadata in memory for performance). Simply put, the smaller the block size the better data reduction ratios your array is capable of.
- Transactional performance. It’s one thing to maintain performance during writes, but it is another thing to handle updates to blocks (multiple reads and writes/modifications to the same block well)…as discussed above, compression makes that very difficult (leading array vendors suggest not enabling compression on transactional data sets – “Compression is best suited for data that is largely inactive”). The virtualization introduced by the Purity Operating Environment enables all write operations (new writes or over-writes) to be handled the same with great performance.
Hopefully this post has given you a sense for why all data reduction technologies, and implementations are not the same. We agree with the base premise that traditional disk array’s just aren’t ready for data reduction…but a 100% flash array that was designed from the ground-up for data reduction – well that is a whole different animal.