We had a GREAT week out at VMworld last week. On the heels of our killer launch, VMworld was an opportunity to show the world that the Pure Storage FlashArray is real, and give folks a chance to see what the FlashArray is capable of achieving.  Funny thing we noticed at VMworld: most of our competitors weren’t showing much.  They maybe brought an array and powered it on to show the GUI, but they weren’t running any workload through it.  We took the opposite approach, and brought over $500,000 of gear to VMworld, stuffed it in both our booth and the Samsung booth, and set out to prove to the world what all-flash storage was capable of.  And show it we did….

 

The Goal

Our goals for the demos were simple:

  • Show that a high level of consolidation is possible with VMware on all-flash storage: 1,000 of VMs on a highly consolidated server / LUN / datastore environment
  • Show the data reduction possible in VMware workloads, often 10-20x, and show that this data reduction can be achieved without sacrificing performance
  • Show that consistent low latency can be achieved (<1ms) at high IOPS (>100,000), and show it can be achieved with a variety of workloads running that don’t interfere with one another
  • Show that these performance and consolidation benefits can be applied to virtual server workloads, virtual desktop (VDI) workloads, and virtualized database environments

The Setup

As I mentioned, we brought over $500,000 worth of gear to bear between our three separate demos (virtual server, VDI, and virtualized Oracle).  In this post, we’re describing the centerpiece demo on VMware server consolidation.  Here was the hardware rig:
Server  & Storage Hardware
  • Two HP DL580 servers
  • Intel E7-4800 series processors (4 processors x 10 cores each x 2 servers = 80 cores total)
  • 1 TB of Samsung Green DDR3 Memory per server (2TB total)
  • Two dual-port 8Gb/s FC adapters per server
  • Pure Storage FlashArray (2 controllers, 2 storage shelves, 11TB of raw physical capacity, 8 total 8Gb/s FC target ports), presented to the servers as two 64TB LUNs/datastores
  • 58TB of total storage consumed in ESXi by 1,000 VMs (38TB in one datastore, 20TB in the other)
  • 16U of active rack space (8U for the servers, 8U for the storage, plus additional for networking and power)
Software & Workload:
  • vSphere 5.0 (beta), Virtual Center 5.0, configured with two 64TB datastores
  • Server workload simulation, mixture of Linux and Windows VMs
  • 500 Linux VMs running PureLoad (proprietary load generation tool, simulates OLTP workload, 90/10 read/write mix, 100% random, continuous load, throttled to max 120 IOPS/VM)
  • 500 Windows 2008 Server R2 VMs running IOmeter, 80/20 read/write mix, 80/20 random/sequential, 32 outstanding IOs, 16K block size, continuous load, throttled to max 120 IOPS/VM)
All of the above was running live at the Samsung booth for 7 days continuously, and demonstrated live in both the Samsung and Pure Storage booths 100s of times throughout the show.

The Results

The results were…let’s just say, fun.  I knew we were impressing people when I literally had five different people react with an immediate “Holy Sh!t”…and that was only in the first two hours :).  It was fun to watch how we blew people’s mental picture of what storage hardware must be required to run 1,000 VMs, showing the performance and then pointing to our 8U storage device running in the booth.  Let’s look at the result: consistent sub-millisecond latency.  Below you’ll see a vCenter screen shot of the setup on one of the servers (50% of the VMs):

Here’s what you see:

  • The top two wavy lines (blue, red) are read and write IOPS, respectively (plotted on the left axis).  At this moment this server was doing about 45,000 IOPS total, it fluctuated between 40-65K throughout the week (both servers together achieved over 100K consistently)
  • The bottom two lines (orange, purple) are read and write latency, respectively (plotted on the right axis).  As you can see, they both look like flat lines…you see, it turns out that vCenter isn’t used to latency this low, and rounds to the nearest millisecond!  (Next we’ll look at the Pure Storage GUI below where you can see that the average latency across both reads and writes was 0.46 ms).  Time to work with VMware to improve the resolution of their latency measurements!
Now let’s look at the same view from the Pure Storage GUI below (note this is an early Beta version of the GUI, as the final GUI wasn’t quite ready in time for preparing this particular demo, we did show it in the booth in other demos however):
Here’s what you see:
  • First off, data reduction.  This particular 64TB LUN/datastore was filled with a total of 38TBs of VMs.  Those 38TBs of VMs represented 6.5TB of what the Pure Storage GUI calls “host written” data, or real actual written data from VMware to the array (after removing zeros, patterns, etc.).  The FlashArray then reduced that 6.5TBs of data to 417.2GB physically stored on flash.  So, that’s greater than 15-to-1 data reduction (which is quite similar to what customers see, in our experience).
  • Second, performance.  This particular server  (again, one server, 50% of the environment) was generating 60,000 IOPS at this moment, at 0.458ms of latency (and you can see from the graph that this latency was sustained for the entire week).
  • Third, Power consumption: we used an APC power controller on the rack with a constant power read-out, and the Pure Storage FlashArray was consuming a total of 10 amps, which is 1,200 watts at 120v.

The Comparison with Inefficient Spinning Disk

This level of consolidation and performance is no doubt impressive.  But what’s more impressive is to consider what it would have looked like if we had run this exact same demo on spinning disk.  Let’s make some basic assumptions on how much spinning disk would be required to drive this load:
  • Arrays have different cache efficiency based upon workload and array.  This particular workload was almost completely random (as many VM workloads are), but let’s be kind and say that the array can cache 30% of the IOPS (this is a VERY kind assumption)…that leaves 70K IOPS that must be serviced by disk.
  • Another generous assumption is to say that a 15K FC/SAS spindle can sustain 180 IOPS consistently under random workload.  That means that we’d need 389 spindles.  Protect them with RAID-5 at a conservative 20% overhead (often performance disk is mirrored for 100% overhead), that means we need 486 spindles.  (Note that assuming we use 300GB drives we have way more capacity than we need, 116TB, but this is of course the problem with delivering performance with spinning disk…you can’t use all the capacity).
  • According to a leading array vendor, each 15K drive shelf filled with 15K drives consumes 235 watts, and the array’s dual controller complex consumes 500 watts itself, for a total of 8,020 watts.  The Pure Storage FlashArray consumed 1,200 watts, a 6.7x advantage!
  • And for physical size, let’s watch the “U”s add up: 2U for the controllers plus 3U for each drive enclosure (15 drives each) = 98U, or over two full racks.  The Pure Storage FlashArray consumed 8U, a 12x advantage!
  • Finally, it’s worth noting that it is also a very kind assumption that a typical dual-controller storage array could even reach 100K IOPS without relying on a massive cache-hit assumption, which wouldn’t have been possible in this highly randomized workload.

And Finally Some Thanks…

Pulling-off a demo like this wouldn’t be possible without a huge team effort.  A quick word of thanks to all that were part of that team:
  • Thanks to Samsung for donating the 2TB of Samsung Green DDR3 DRAM for the demo and hosting us in their boot at VMworld
  • Thanks to Ravi Venkat, Bryan Wood, and Pratik Chavda, the Pure Storage Solutions Architects who created and ran this demo during VMworld
  • Thanks to the Pure Storage engineering team, who created the revolutionary FlashArray, and worked with us to polish the demo leading up to VMworld