Part 2: Deduplication and Compression with SQL Server Databases

This article looks at how built-in deduplication and compression mechanisms in FlashArray can further compress SQL Server databases.

Deduplication vs Compression

Summary

Pure Storage FlashArray can indeed help you reduce your data footprint on your already-natively compressed database structures. After all, those LOB data types in SQL Server (nvarchar(max), ntext, XML, etc.) are, unfortunately, more common than you might think.

image_pdfimage_print

In Part 1 of this series, I tried to put a SQL Server spin on answers to questions we often get from customers regarding dedupe and compression. In this installment, I’d like to show how our built-in dedupe and compression mechanisms can increase compression on already-compressed tables.

You might ask yourself if you should look into deduping or compressing a SQL database. Deduping is important even for small databases, and compression greatly reduces the amount of space necessary to store data. Compression could arguably be done when a database gets to enterprise levels of records, but it doesn’t hurt for smaller businesses to think about saving money on storage as well.

Deduping is exactly as it sounds—removing duplicate records from being stored. Duplicate records can cause all kinds of issues with reporting, accurate analysis, predictions, and poor future decision-making. Records with duplicates also unnecessarily exhaust data storage.

Compression takes a larger file and slims it down to a fraction of the storage space using an algorithm. Some argue it slows down data processing, but using a fast storage system like Pure Storage® FlashArray™ can mitigate any performance degradation. For backups, it’s especially important to minimize storage space. It’s possible that a large enterprise organization could have petabytes of data, so compression can make a huge impact on IT costs.

With that said, I performed tests on FlashArray with deduping and compression to evaluate performance and storage. Here is an explanation of my experiments. 

The Setup

The fine folks at StackExchange have made their Q&A site database publicly available through several means—you can even interactively query it at Data.StackExchange.com. Of particular interest for this post is their full database dump, available on the Internet Archive. A tool called SODDI to create SQL Server databases is based on these data dumps.

Armed with the StackOverflow database, I started asking myself questions about the effectiveness that FlashArray dedupe and compression would have on SQL Server compressed data.

The Tests

I wanted to start clean, so I rebuilt all the indexes for all tables on the StackOverflow database with DATA_COMPRESSION=NONE. Then, I went ahead and ran the standard “Disk Tables by Top Tables” report bundled with SQL Server Management Studio. Since this database has only seven tables, I didn’t need anything fancier than that. I repeated that process for DATA_COMPRESSION=ROW, then DATA_COMPRESSION=PAGE. The results (merged) are below:

A few things can be observed from these numbers: Some tables compressed extremely well, and others, not so much. Take dbo.Votes, for example. The base table went from 1.66GB down to 0.6GB. That’s pretty good. 

Then there is dbo.Posts. I took a look at the table definition for clues as to why it didn’t compress, and something popped up immediately:

“Aha!” I said to myself, “That ntext column right there!” To confirm, I ran the following query (results included):

Sure enough, that table has 3.78 million pages of LOB_DATA—the allocation unit type for data kept on a separate data structure from the clustered index/heap. SQL Server’s row and page compression methods do not affect data that is stored off-row. And thus, the majority of the data in this dbo.Posts table remains uncompressed. This sounds like an opportunity for our FlashArray system.

After I rebuilt the indexes with page compression, I went to look at the data reduction ratio on the array’s console, as shown in the screenshot below:

Mind you, the StackOverflow database file (.mdf) and transaction log (.ldf) are the only files in this volume. To give you an idea of what kind of compression ratio you would get from the uncompressed data, this is what I saw once I rebuilt all indexes with DATA_COMPRESSION=NONE:

Some of you might be thinking, “This is an edge case—the majority of the pages in the entire database are in LOB_DATA allocation units.” I agree with you, but this is just one example using a very well-known database. Here’s another data point: At my previous job, where we used a FlashArray system, the main database (~11TB in size) compressed at a very healthy 3:1 ratio.