Pure Storage’s Take on Server Flash Caching

Yesterday EMC announced its long-awaited entry into the server-local PCIe flash caching market: VFCache (aka Project Lightning). When the #1 storage vendor jumps on a technology bandwagon, it’s always interesting. Even before this announcement many customers have been confused about how/when they should consider host flash caching vs. array-based flash, so we wanted to take…

Yesterday EMC announced its long-awaited entry into the server-local PCIe flash caching market: VFCache (aka Project Lightning). When the #1 storage vendor jumps on a technology bandwagon, it’s always interesting. Even before this announcement many customers have expressed confusion to us about when they should consider using server-based flash caching vs. array-based flash, so we wanted to take a minute to explain our thinking on server-based flash, and how the server, application, and storage architecture will evolve around ubiquitous server flash. We summarized our thoughts right up-front, and you can read the details if we strike a chord.

The Shot of Espresso Version
(60 seconds or less):

  • Flash will be as ubiquitous at the server tier as DRAM, just a standard feature of every server you order. As such, the application architecture of major enterprise software applications (Oracle, SQL, VMware, SAP, Exchange, etc.) will morph to be able to natively take best advantage of server-local flash.
  • For most workloads, server flash caching of the storage IO doesn’t address the core of the performance pain, which is waiting for hard drives to seek and rotate. The Achilles’ heel of flash caching is lack of predictability. For any workload that’s not 100% cacheable or requires HA (write through cache), sometimes you get flash performance (10s microseconds) and sometimes you get disk performance (5s-10s milliseconds). The situation is reminiscent of Hierarchical Storage Management (HSM) in that the application must be prepared for 2-3 orders of magnitude in latency spike when reads or writes fall through the fast flash tier and have to wait for mechanical disk. Generally, it’s the longer IOs that cause the most performance pain, and for most workloads server caching will do nothing to accelerate those long IOs.
  • Under HA and/or clustered applications server flash caching often can’t be used at all, because doing so could lead to data corruption in the event of a server failover. That’s right, the most important applications which need flash the most can’t take advantage of products like VFCache today.
  • The application and systems software is the appropriate place to implement flash caching logic, not the storage stack. This server code (application, operating system, virtualization or database container) better understands the user workload as well as load balancing and failover (VMware vMotion, Oracle RAC). The storage layer just sees reads and writes and tries to guess what is coming next. Just look at how the largest users of server flash today (Facebook, Google, Amazon, etc.) use flash: they build caching logic and resiliency right into their systems infrastructure and applications.
  • Host-based flash caching isn’t particularly economic. It generally requires expensive SLC flash, which is captive in each of your application servers leading to poor utilization. Technologies for lowering the cost of flash (deduplication, compression, and MLC flash) aren’t particularly well-suited for host-side deployment.

Net net: Our belief is that until enterprise applications update their architectures to take better advantage of server-local flash caching, most enterprises will be best-served to leave the server-local flash caching to the Facebooks of the world, and instead look to deploy flash within the shared storage architecture. Storage-networked flash is more economic, fits better into today’s prevailing application and virtualization architectures, allows its performance benefit to impact a greater number of applications, and fixes the real root of your performance challenges: your legacy disk array. Face it: no amount of lightning-fast flash in your servers will help you if 50% of your I/O still requires a pokey hard drive to spin. That’s our belief in a nutshell…if you’d like to understand more of the “why,” please read on.

The Cappuccino Version (5 minutes):
Say ‘No Thanks’ to HSM 2.0

For those willing to invest more time in a deeper look, here’s a bit more of our thinking on why server-local flash caching isn’t the right architecture for most enterprise applications today.


Server Flash Will Commoditize and Become Ubiquitous…Eventually

Make no mistake: although we’re big fans of networked flash, our belief is that flash will become ubiquitous in the server architecture (flash won’t be one-size-one-place-fits-all). Likely every server shipped in a few short years will have server-local flash attached in some manner (either directly on the motherboard, or via server networking). Advances in PCIe make attaching flash straightforward, and Intel and others will work to commoditize as many aspects of the flash connectivity and controller architecture as they can to make flash ubiquitous. Server form factors will evolve to make hot-swapping and servicing that flash straightforward as well. The result: application and systems software developers will have copious amounts of volatile (DRAM) and non-volatile (server flash) memory available at their fingertips to make use of, and application architectures will evolve to do just that.

Who is the Smartest Flash Cache of Them All?

EMC’s strategy is to try and add higher and higher levels of sophistication around the caching software, enabling integration between the storage array and the flash cache to improve cache hit rate. The problem is that the storage array is about the least intelligent layer in the stack, in regards to understanding the application and systems software. Server infrastructure software has been successfully managing the DRAM caches for years, and all the logic that can help improve cache-hit rates resides therein. It’s this application and server software that understands what transaction is happening, what the load is, where that VM will be moved next, or what failure has occurred. The reality of server-local flash is that the ideal deployment is one in which the software up the stack makes use of it (after all, that’s what’s going on at Google, Facebook, Apple, and Amazon). Consider that despite the astonishing growth of the server-flash leader, Fusion-IO, two tech customers, facebook and Apple, account for an astonishing 57% of total revenue. Major gains have come to companies who can custom-tailor their systems and applications to handle flash caching.

Flash Caching: Magic Performance Pixie Dust or HSM 2.0?

The potential appeal for server-based flash is clear: the latency for accessing a PCIe card in a server is measured in 10s of microseconds, while the latency for accessing a storage array across a network (Fibre Channel, Ethernet) is typically 100s of microseconds in the best case (you hit array cache) or 5s-10s of milliseconds at worst case (you have to spin disk drives). And therein lies the rub. Provided you hit cache, server side caching is very fast, but if you fall through cache to a backing storage array of hard drives, you will experience at least two orders of magnitude worse latency.

As you may recall, this was exactly how Hierarchical Storage Management (HSM) worked: the system attempts to put the hottest data on disk and the colder data on tape, and when you miss cache, you might as well get a cup of coffee given the increase in latency. As the workload evolves, some reads will miss cache, and then the latency spikes. If multiple IOs fall through, contention on the backend mechanical disk heads can lead to even higher latency spikes (and the situation is much worse with more economical, slower RPM disk drives). The trouble with disparate latencies is that the application must generally be designed to handle the worst case IOs, not the best case ones. Plus, HSM was really complicated.

The result is that for most applications, all server-based flash caching will do is lower the latency of your fastest operations (say from a millisecond to 10s of microseconds), but it won’t help much at all in terms of lowering the latency for your longer IOs (the ones that stall a transaction while waiting on mechanical disk to read or write). What this means for your average latency depends upon the workload, but unless your workload is very highly cacheable (most all of the working set fits in the server flash cache), it’s unlikely to do anything for the average latency of your slower reads or writes. In a world where workloads are increasingly virtualized and consolidated, storage I/O is getting increasingly more random and unpredictable…and you simply can’t cache unpredictable workloads.

Highly-Available & Clustered Applications: Not So Fast!

The reality for server-based flash caching is not just that it suffers from complexity and latency disparity a la HSM 2.0. The reality is in its current form it is incompatible with workloads that require high availability (HA) or clustering from the underlying storage, such as most business-critical deployments of Oracle, VMware, MS SQL Server, MS Exchange, etc.!

HA means that when a server fails, there is zero loss of data. For HA to work, writes cannot stop in a PCIe server flash card. Rather write-through caching must be employed meaning no acknowledgement can be returned to the application until highly available shared storage has captured the change. This means that the there is zero reduction in latency for writes with server-local flash on HA storage workloads, and thus that the latency the application perceives is the same as if there were no server-local flash on the server.

What about clustering? If a LUN can be pinned to a particular server (i.e., the data can be cleanly partitioned and failover is not required from the backing shared storage), then any data cached locally on that server’s PCIe card is the current data of record. For a clustered workload, however, there is the possibility that data will be accessed across multiple server blades. The problem, of course, is that if data is written from one server, and then subsequently read from another, you might get the wrong answer. In order to avoid such data corruption (the serving of incorrect data results), one either must (1) use some sort of distributed caching with concurrency management and revocation (this approach has sufficiently high overhead that has been repudiated—think distributed two phase commit); or (2) the cache must be write through (see HA discussion above), and then any access to a LUN across servers must invalidate existing cache contents.

The result is until the industry ships software drivers for host-based flash caching that solve the write-through and distributed cache coherency problems, flash caching is a no-go for the applications which need it most.

Where is the Most Economic Place to Adopt Flash?

Shared flash storage offers consistent latency at far lower cost, because writes can be more broadly amortized across less expensive MLC flash, enabling much lower-cost flash architectures. Moreover, dedupe and compression dramatically lower the write cost—why write the data that you’ve already written? Deduplication in particular is far less effective in the server tier because the data set is too small to highly dedupe (distributed dedupe across server instances is potentially feasible but would be expensive in terms of server CPU , network chatter, and latency to reassemble results), and compression is difficult in the server tier because of the required CPU overhead.


Our Net Net is Simple: Most Enterprises Should Pass on Widespread Host Flash Caching for Now

The future of server-based flash is a low-cost, ubiquitous flash layer that’s virtually standard in enterprise servers, and applications which are written from the ground-up to best harness the power of server flash caches. Until that happens, end users will be best served treating solutions like EMC’s VFCache as bridging products—ones that provide excellent acceleration for non-clustered, non-HA storage workloads, but only under a very specific set of constraints.

Our belief is that flash in the networked storage tier is the better answer. An all-flash array can give you consistent sub-millisecond latency for a far better price point and far easier adoption than server-local flash. Server flash can give you really low latency (10s of microseconds) for unclustered, non HA workloads, but when you miss cache, and then must wait for hard drives to seek and rotate, you can expect 2-3 orders of magnitude additional latency. Hence, VFCache’s likely sweet spot is competing with Fusion-IO for server storage for those applications (Google, Facebook) that have been designed to take advantage of it. (The irony, of course, is that those are the sort of workloads that don’t employ EMC products today.) Rather VFCache in the server tier with a traditional hard drive array behind it is only going to make the customer realize their real performance bottleneck is the back end of their HSM 2.0 configuration—the traditional “performance” disk array behind that super fast flash cache.

If you’d like to fix the latter problem, that pokey disk-based array, then give us a call, or take all-flash networked storage for a spin in our Early Access Program.

Up Next: Blade Server vs. Rack Server vs. Tower Server