BlogSolutions » Database

The Debate

If you’ve been researching flash storage for your Oracle database, you have undoubtedly come across plenty of blogs both touting and panning the strategy of placing redo logs on flash storage.  Oracle appears to be hedging their bets with Exadata Smart Logging (available with Exadata storage software 11.2.0.4) which writes redo simultaneously to flash and to disk; whichever completes first, wins.  The argument against using flash for redo boils down to two main arguments:

  1. it’s expensive
  2. it’s no faster than disk since flash is horrible at sequential writes

The primary arguments in favor of using flash for redo are:

  1. it’s cheaper than disk
  2. it’s very fast
How can something be fast and cheap and slow and expensive at the same time?  Certainly not all disks are created equal, and definitely not all flash storage arrays.  Keep reading.

The Hardware

Our test bed consisted of the following:

                    Server

– HP DL580 G7
– Xeon 4870 2.40GHz
– 40 cores
– 512GB RAM

Disk Drives

– 12 Hitachi GST Ultrastar C15K147 HUC151414CSS600
– SAS
– 147GB
– 15K RPM
– 64MB cache
– 6GB/sec write rate
– LSI SAS2008 PCI controller (2 ports, 6GB/sec)
– LSI 5350 Storage Enclosure (24 drive bays)

Pure Storage FlashArray


– 2 Controllers
– 2 Shelves
– 10TB Raw Flash
– ~50TB Usable
– 8Gb Fibre Attached

The Software

Operating System: Red Hat Enterprise Linux Server release 6.3 (Santiago) (2.6.32-279.el6.x86_64)

Database: Oracle Database 11g Enterprise Edition Release 11.2.0.3.0 – 64bit Production

File System: None (Oracle ASM 11.2.0.3.0)

The Workload

I used the Hammerora tool to drive a TPC-C OLTP workload configured with 100 warehouses and 100 users performing 100,000 transactions each.  The highest transaction rate was ~2,300,000 transactions per minute (tpm) — or about 33,000 transactions per second.  During the heaviest workload, the database, in archivelog mode,  generated about 160MB/s of redo.  Essentially all of the IOPS were to the redo logs.

It should be clear this transaction rate and redo volume is extremely high.  It’s probably at least ten times higher than your production environment’s sustained workload.  Also, since reads benefit from flash much more than writes, this workload doesn’t exactly showcase flash’s performance.

What I Saw

All tests used 20 logfile groups sized at 2GB.

For my first test, I put the redo logs on the hard disks configured for the best possible performance, as follows:

  • All 12 drives put into a single ASM diskgroup with EXTERNAL redundancy (i.e. no mirroring, essentially RAID-0)
  • I tried both FINE and COARSE striping for the diskgroup.  It turned out that COARSE striping performed the workload faster, so I used those results for this report.
  • Surprisingly, using Oracle’s “Intelligent Data Placement”  to place the redo logs on the disks’ “HOT” region (basically short stroking) hindered rather than helped, so these tests reflect COLD data placement.

SQL> alter diskgroup redohdd01 modify template onlinelog attributes (coarse cold);

Diskgroup altered.

  •  The redo logs were not multiplexed.
Three runs averaged 641 seconds — not too bad, but despite the optimizations, log file sync was still among the top 5 wait events:
 And from a histogram point of view:
The drives were sustaining about 10MB/s of writes each:

So, as the above iostat display indicates, the database was writing about 120MB/s of redo with a service time under 2ms.

Great, but there’s one little problem

This performance isn’t bad at all.  The TPC-C  workload churned along at nearly 1 million transactions per minute.  The only trouble is that the disks weren’t mirrored, the ASM disk group wasn’t protected, and the redo logs weren’t multiplexed — that’s the sort of exposure that does get DBA’s fired.

The options for providing redundancy are either within ASM (Normal or High) or by multiplexing the logs in two different disk groups.  I tried both, and found the multiplexed approach to be more performant.  When we went from 12 disks in a single disk group to 2 disk groups with six disks each, the average test duration leaped from 641 seconds to 955 seconds (a 49% performance hit), and  log file sync moved to the pole position as the top wait event:

And the corresponding histogram:

Break out the Flash

Finally, I moved the redo logs to a single LUN disk group on a Pure Storage array.  Since the Purity Operating Environment provides RAID-3D redundancy, the diskgroup used External Protection, and the redo logs were not multiplexed.  This time, the performance was about 7% better than the 12 unmirrored hard drives, and log file sync was not even in the top 5 wait events:

As the histogram shows, the log file sync wait event was insignificant compared to the test on the hard drives.

Test Results Compared

Let’s take a look at how the 3 different configurations fared.  The graphs below compare the performance in terms of:

  1. overall test duration
  2. redo write time
  3. log file sync wait time as percentage of DB time

Graph: Test Duration

 

Graph: Redo Wait time

 

Graph: Log File Sync time (as percentage of DB time)

 

It should be clear that no matter how you look at it, the best performance was on the Pure Storage array.

Revisiting the Debate

As mentioned at the beginning of this post, the debate over whether or not to place redo logs on flash storage centers on 2 main concerns: cost and performance.  Let’s examine each.

Cost

The cost of the 12 spinning disks used in these tests was US$228 each, for a total of US$2,736, or about US$1.55/GB.  However,  these disks weren’t purchased based on capacity requirements, they were purchased based on IOPS requirements.  Even though I had over a terabyte of capacity, these disks only held 80GB of data at most (in the multiplexed redo log tests).  By that measure, the disks cost US$34.95/GB.

The disk group I created on the Pure Storage Flash Array, on the other hand, was only 50GB; I provisioned what I actually needed (40GB) with a little head room.  The Pure Storage cost per GB depends on your dedupe ratio (typically about 4.5:1 for redo logs) but no matter what, it’s never anywhere near US$34.95/GB!  Costs in the US$5/GB are typical, including the RAID-3D overhead.

How cheap does that hard disk look now?

Performance

The Pure Storage array outperformed disk in every regard.  The claim that flash is poor at sequential writes doesn’t apply to a Pure Storage array because we don’t do sequential writes in the traditional sense.  Before I/O hits the physical media, we dedupe it, compress it, and protect it across multiple SSD’s.  Because of this processing, the amount of data that actually makes it to media is generally far less than the amount of data the application wrote.  Unlike some other players in the flash storage arena, The Pure Storage array is not “just a bunch of flash”.

Conclusion

I don’t expect to end the debate about flash vs. disk for redo logs in a single blog post.  But hopefully it’s clear that, in the case of Pure Storage, redo on flash is a viable strategy in terms of cost and certainly in terms of performance.  We believe that you only need a single Tier 1 storage solution in your data center, and because the Purity Operating Environment is able to leverage flash’s advantages intelligently, we’re suitable for any workload.  In addtion, the Pure Storage FlashArray is very simple to manage: there is no need to isolate workloads to dedicated LUNS, and no need to sacrifice capacity for performance.

Join the discussion...