image_pdfimage_print

The world of technology has changed dramatically as IT organizations now face, more than ever, intense scrutiny on how they deliver technology services to the business. Raw performance must be available at all times, in addition to resource delivery for IT consumers who span the globe. “Planned downtime” is almost a misnomer as any downtime or slowdowns, planned or unplanned, result in the same thing: lost productivity and lost revenue.

Traditional Storage Array Approaches

Storage arrays primarily take one of two approaches to address this challenge: active/active or scale-out. Each has its advantages and disadvantages.

Active/Active: Advantages and Disadvantages

Active/active, unlike active/passive in which only one controller accepts IO while the other does not, allows both controllers to simultaneously serve IO to hosts. This has the advantage of providing high levels of performance since both controllers’ CPU and memory are actively used to deliver performance. However, this also means that during a planned or unplanned outage of a single controller, the array has lost 50% of its total performance profile. Storage administrators must ensure that maintenance is either performed during off-peak hours, where a drop in performance is less impactful, or closely monitor controller resources so that the combined busyness of the two controllers does not exceed the capabilities of a single controller, thus guaranteeing no performance is lost when a single controller is offline.

Most companies prefer that this work is performed during off-peak hours; however, as businesses scale, the opportunity for off-peak hours continually erodes. This means employees are burdened with performing delicate maintenance tasks in the evening, overnight, on the weekend, or even on holidays.

Another consideration with active/active controllers is a concept of volume “ownership” by one controller at a time, often because each controller has its own independent write cache. This means storage administrators must take into consideration which controller owns specific volumes and additional multipathing software or configuration must be done on the host side as paths to one controller may be seen as “active” or “optimized” while the paths to the other controller are “standby” or “unoptimized.” During controller maintenance, the host must fail over paths from any volume “owned” by the controller going offline to the surviving controller. It also means more thought must be given to which volumes belong to which controller to properly balance workloads across them.

disk replacement

Scale-out: Advantages and Disadvantages

Scale-out is another architecture possibility in which multiple controllers all provide performance and the more controllers you add, the more performance you get. This can be a huge advantage when a system requires extremely large amounts of IOPS or bandwidth (which is why we built FlashBlade®); however, this brings two challenges that can be problematic in traditional scale-out: data consistency and complexity.

By its very nature, scale-out means data and metadata must be consistent between the controllers. If not, a single controller failure could cause a data outage or corruption. All controllers must act in harmony as one. Great care must be taken to not only ensure consistency but also for intercontroller traffic to be as fast as possible so that client performance doesn’t suffer. This is a major engineering feat to undertake and often involves external, specialty switches.

This often leads to complexity unless you design your own hardware platform to address it where nodes are also stateless and Evergreen (see FlashBlade). The more and more you grow a scale-out system, the more and more complex it becomes. What happens when a node fails? What about a node and a drive on a second node? Two nodes? All of these scenarios must be accounted for, and since consistency is required across nodes, failures must be addressed rapidly so the data is at risk of corruption for as little time as possible. (Remember: consistency!)

The complexity also manifests financially. If you own a scale-out solution that is three years old, can you still add the original nodes to scale it further? Are there new nodes and are they compatible with the originals? If they are, how many generations can you support with a system with heterogeneous nodes? If they aren’t compatible, do you need a totally new cluster to replace the old one, despite it being only three years old?

For Pure Storage, we decided our scale-out platform, FlashBlade//S, should be designed to support our Evergreen Forever business model so these challenges don’t come back to haunt customers later on. See more here.

Pure Storage’s Controller Approach

When we built our first product, FlashArray™, we determined that neither of these scenarios were satisfactory and set out to build something new.

Pure Storage® FlashArray™, released in 2012, was designed to deliver 100% performance, uptime, and, more importantly in today’s world, access to IT resources that can be dynamically and automatically created by users themselves via multiple avenues: API calls, scripts, automation tools, and plugins.

Achieving these goals required a new, two-part design philosophy when it came to data storage:

  • Component failure cannot and should not compromise performance or access to provisioning. 
  • Data protection should be built in a way that provides maximum resilience but not at the cost of speed of data or access to it, even during data re-protection. 

It had to address data availability, protection, and performance while also ensuring users can continue to consume or alter resources being served by the platform. Maintenance and even hardware failures should have no noticeable impact on performance and API availability.

To tackle this, we decided to design our Purity OS as an abstraction layer that can decouple the hardware and software. The identity of the array (IP addresses, WWPNs, array configuration, etc.) should be defined in software and be portable. This means a Fibre Channel (FC) WWPN exists as a virtual address rather than the hardcoded address tied to the physical port. Even if hardware is changed out multiple times, the FC WWPNs can persist into perpetuity. Should a controller fail, its FC WWPNs can move to the surviving controller, ensuring paths are not lost to the clients.

Combining the Best of Both Worlds – A/A and A/P

We knew the traditional active/active approach alone wasn’t the right fit and neither was active/passive. We wanted FlashArray systems to be as simple as possible to manage and monitor. This meant that someone should look at the performance of the system as a whole, not individual components, and not be concerned about a controller failure hurting performance or worry about volumes and paths having to quickly change ownership to the other controller. 

Going entirely active/passive wasn’t the route we wanted to take either. While it does solve the problem of constantly monitoring individual controller CPU and RAM utilization, this would mean hosts would see paths to only a single controller at a time as active and the others as standby. 

So we combined the best of both worlds: 

  • By presenting the controllers as active/active to hosts but having an active/standby relationship (we call it primary and secondary) to the backend media means that hosts see both controllers at all times as active data paths
  • When the secondary controller receives an IO, it can simply forward the IO to the primary controller over a passive PCIE bridge contained within the FlashArray chassis. Since only one controller is ever really servicing IO at a time, there is no need to monitor the performance of each controller, yet hosts can actively send IO to both. 
  • Should either controller go down, not only can our stateless design move WWPNs to the other controller, but the system can also still deliver 100% performance.

How Pure Handles Metadata

We also designed our controllers not to rely on local write caching or maintaining a local configuration but instead to utilize mirrored and shared persistent NVRAM devices for caching writes and storing all metadata and configuration on the flash media itself. This means we no longer need to worry about “ownership” of volumes since the metadata associated with them isn’t tied to a controller. 

This has the added benefit of drastically reducing failover times when a controller needs to change roles from secondary to primary as it already has access to the shared NVRAM and all the metadata contained therein. It also reduces complexity since we don’t need to keep metadata in sync between the controllers since it exists on a shared resource.

Scale-out was considered early on, but, in the end, the complexity was not worth the cost for the product we wanted to build. When going to three or more controllers, we would have to reintroduce some of the challenges we managed to avoid with the above design. If a system had four controllers, would we design it to provide 100% performance in an “erasure coding” fashion where three controllers supplied IO while the fourth stayed passive? This would require some management overhead for customers and a lot of engineering work on our part. 

Keeping metadata consistency across the four nodes would have an overhead in and of itself, which would erode the advantage of having three or more controllers. If we did a simple mirror, two controllers active and two passive, we would still have to consider many of the same challenges of active/active and scale-out.

In the end, if we focus on a simple, stateless, two-controller active/active frontend active/passive backend architecture, we could provide unparalleled stability and simplicity for our customers. To address additional performance needs, we created multiple controller models, each with larger and more powerful CPUs, so customers could easily move to bigger controllers if needed. Since the controllers are stateless and contain no permanent metadata or configuration, it’s a non-event to swap them out for more powerful models.

This gives customers a reliable platform that can be upgraded and modified as needed into perpetuity:

  • Each hardware upgrade is non-disruptive to clients as there’s no need to “trespass” volumes (change volume ownership, which can cause small disruptions to performance) between controllers. 
  • All paths to both controllers are active and optimized for data flow. 
  • Write caching is handled by a persistent and redundant set of shared NVRAM devices (mirrored NVRAM in FlashArray//X or NVRAM built into the media using erasure coding in FlashArray//XL), and all metadata and configuration of the system is stored on the media, not the controllers. No matter how many times you change out the controllers over the years, you’re still talking to the same “array.”

Pure Storage’s Data Protection Approach

At the end of the day, if your data isn’t protected, always available, and free of corruption, your business is in trouble. 

We knew legacy RAID schemes or mirroring wasn’t going to cut it with flash. The more you perform a state change on a cell of NAND (altering its voltage, like changing a 1 to a 0), the faster you wear it out. Imagine what an entire RAID 5 parity rebuild would do to a flash device. Not only would there be excessive wear on the NAND, but the performance hit of the rebuild would also drag down system performance.

By addressing the unique challenges of flash itself, this enabled our engineers to perform data protection in a multitude of new ways.

First, the Purity OS handles wear leveling and garbage collection (cleaning up deleted or overwritten data in the flash) at a global level, rather than individual drive firmware. It has direct line of sight to the NAND itself, which is why we created our own flash media—DirectFlash® modules. We can perform these operations with the context of the entire pool of flash, which enhances its longevity. It also means that those processes are something that can run at all times, rather than performed ad hoc by a device firmware. Since a cell of flash can only be read or written to at any given time, we can’t have a device firmware deciding to perform garbage collection on data we need to access as this would result in increased latency to the clients. 

To overcome this, read requests on a FlashArray fall into two buckets: user reads and system reads. Since our DirectFlash Modules don’t have a firmware acting as a gatekeeper to the NAND, like an off-the-shelf drive would, and Purity is performing system tasks, like garbage collection, we can determine what reads are for host access to the data and prioritize it as such. This means those necessary processes for managing flash, such as garbage collection, won’t interfere with accessing the data.

To protect against drive failures, our RAID-3D provides N+2 drive protection across each chassis or DirectFlash shelf. Not only can the system continue operating at 100% performance when drives are lost, but RAID-3D was also designed to continuously rebuild parity with free space available in the system using background processes already running on the array. This means no dedicated hot spares (wasted drives), no performance degradation while parity rebuilds, and no racing to the data center on a weekend to replace a failed drive. The system will self-heal back to N+2 protection on its own.

On top of that, FlashArray is always running AES-256-bit Data at Rest Encryption (DARE) at all times. Again, this isn’t running in the drive’s firmware (no reliance on third-party self-encrypting drives) but at the Purity OS level globally. Keys cycle on their own so there’s no need to configure your own KMIP server.

All of this was built with data protection in mind but without creating additional administrative headaches. Customers should not need to make sacrifices to data protection in the name of performance or vice versa. The system should be protected, resilient, and performant at all times without anyone needing to tune anything or make any decisions about how to configure it. As such, all of our data protection, reduction, and encryption services are enabled right out of the box with no need or ability to disable or tune them.

Why This Matters

These design decisions drive right to the core of our vision for data storage in the modern data center: an always-available, simple, reliable, fast platform to build your business on

When facing competition like the public cloud, modern IT organizations should find ways to provide services in a cloud-like fashion. If you store any of your own personal data in the cloud, like photos, can you imagine how many times your data has moved to different hardware? Did you notice? Do you care? You want your data readily available and fast, no matter what technology challenges the company that houses that data may face. 

That’s the experience Pure Storage FlashArray delivers.