How to Troubleshoot Oracle Databases with Storage Snapshots

With an ever-increasing dependency on data for all business functions and decision-making, the need for highly available application and database architectures has never been more critical.

database

4 minutes
image_pdfimage_print

Over recent years, we’ve seen an explosion in data growth, driven by the realisation of the business value of data generated and the insights we can gain from its analysis.

The need to store more data and for longer periods of time to support new AI/ML processes and regulatory and compliance requirements has resulted in the need for additional storage infrastructure, more efficient ways of working, and at the same time, ways to address data security, governance, and management challenges.

Business Data Loss and Corruption

With an ever-increasing dependency on data for all business functions and decision-making, the need for highly available application and database architectures has never been more critical.

Modern space-efficient storage infrastructure has helped reduce cost and complexity, increased availability, and provided greater capacity, allowing us to support these ever-growing data set demands.

However, even with highly resilient IT infrastructure, database professionals frequently find themselves needing to quickly respond to a loss or corruption of production data due to hardware issues, human error, or software bugs.

With many organisations delivering customer-facing applications, any outage can have a significant impact on service availability, revenue, and customer satisfaction.

Data Loss and Corruption

Writing this post made me recall a now infamous Reddit post about a junior software developer who destroyed a production database on their first day and lost their job.

Allegedly, what should have been a simple local development environment set-up went wrong with a copy/paste switch-up. By inputting the wrong values, tests that should have been run in a personal DB instance ran on a production database, clearing all existing data and replacing it with fake data. We can’t be sure if this actually happened as described, but let’s consider some of the data protection and recovery options we have available to us to recover from this scenario.

Storage Replication

What about storage replication? Many databases use storage replication for high availability (HA) and disaster recovery (DR). These can provide great protection from data centre failures. However, they don’t provide protection against a physical data block corruption or logical mistakes. As a result, these corruptions would be replicated intact.

Database Replication

What about database replication? Oracle database block corruptions can be identified by database replication technologies, for example, Oracle Data Guard. However, they’re more likely to be undetected and only identified much later by Oracle Recovery Manager (RMAN) block validation. Logical data corruptions will continue to be replicated without any issues being raised, removing the option of failing over to a standby database.

Storage Snapshots

If you have a modern all-flash storage array and have been taking regular recoverable, crash-consistent snapshots, the quickest and simplest approach will be to take the application and database offline and instantly roll back to the last-known good snapshot.

Tip: Before you perform any form of restore or recovery, take a storage snapshot of your current environment since you may need to reference or use it in the event this is the best you have.

Traditional Oracle Database Recovery

Database restores and recoveries using RMAN backups, file system copies, or database exports are slow compared to the use of storage snapshots. These are our belts and braces (or belts and suspenders, for US readers). Hopefully, we won’t need them, but if we do, we need to know they’re usable. Remember to keep checking and testing them.

Database

All of the above will result in some loss of data, an application outage, and possible coordination and corrective actions in other related applications or software-as-a-service (SaaS) solutions.

In-place Surgical Data Fix

An alternative approach you may want to consider first is whether you can perform an in-place surgical data fix to correct the logical corruption or missing data, for example.

Here’s how to perform an in-place surgical data fix:

  1. Perform a storage snapshot of database volume(s).
  2. If the majority of the application is unaffected and application architecture is able to support it, limit access to be read-only for the impacted module(s). Alternatively, shut down application servers and stop remote database access.
  3. Create a database clone using the storage snapshot for the support and development teams.
  4. Investigate issues, develop data fixes, and agree on resolutions with the application owner.
  5. Apply data fixes to the production database with confidence and confirm the issue has been resolved.
  6. Re-enable full access to the database and application.

An in-place surgical data fix can provide the following benefits over other methods:

  • Maintain service availability and reduce the impact to end users of the application.
  • Reduce impact on other related applications and SaaS solutions.
  • Remove the need for remedial action on replication targets.

For more information on how FlashArray™ volume snapshots can be used with database management systems, read the technical reference guide: “Using FlashArray Volume Snapshots with Databases.” If you want to try out managing FlashArray for yourself, take a test drive.

oracle