One of the core “gotta have it” features of modern storage arrays are non-disruptive upgrades, or as the feature is commonly called “NDU”.  Why is NDU important? Storage arrays occasionally need upgrades to their controller firmware and core OSes to add new features and fix bugs and supportability issues.  If the storage array doesn’t support NDU, then the storage team is forced to ask the application team for an outage window so that the application can be shut-down and the upgrade performed.  Problem is, that’s painful.  In many environments such maintenance windows are only scheduled every few months or even once a year, and the last thing the storage admin wants to do is bother the application owner to schedule and manage downtime.

While NDUs have been commonplace in the storage industry for about a decade, most of the new all-flash appliances and/or array products don’t support them, let alone host-based PCIe flash architectures which make NDU impossible without mirroring flash cards.  For some it is an issue of maturity – NDU is a difficult feature that requires a mature HA model.  For others it is a challenge given architecture –  highly-integrated single-appliance designs which rely on firmware at several layers of IO modules, RAID cards, and FPGA-based flash controllers are inherently difficult if not impossible to upgrade online.

Pure Storage has supported non-disruptive upgrades since our move to our Purity 2.x codebase, as in our minds NDU was a core feature of the broader high availability feature set.  The Pure Storage architecture features clustered active/active controllers, and both controllers are capable of serving the entire performance of the array individually – this means that the array can continue at its maximum performance rate through either an unplanned outage like a controller failure, or a non-disruptive upgrade event.

How does a Pure Storage non-disruptive software upgrade work?

Here is the process for upgrading the software on a Pure Storage FlashArray:

  • First, it is important to understand that although the FlashArray is active/active from an IO port perspective (all volumes are available on all 4 ports of both controllers), it is active/passive from an IO processing perspective (meaning that all IO operations are handled on the back-end by one controller).  This was a design decision to ensure that the array can maintain full performance through failure of a controller or upgrades.
  • When the upgrade process is initiated, the new Purity Operating Environment software is first installed and staged on the secondary controller that isn’t processing back-end IO, let’s call it CT0.
  • CT0 is then re-booted, at which time its ports become inactive and host multipathing detects the missing ports and automatically directs all IOs to the remaining active controller, CT1.  CT1 handles all IO at the full performance of the array.
  • CT0 re-boots into the upgraded software and re-joins the array, all 8 ports are again active in the array, host multi-pathing detects the ports have returned and can now direct IO across all ports.
  • The new software is then staged onto CT1, and that controller is re-booted to complete the install.  CT0 detects that CT1 has gone offline, and takes-over all IO processing for the array, just as it would if there was a normal HA failure of CT1.  Host multipathing is used to detect the offline ports on CT1, and directs all IO to CT0, which now becomes the primary controller.  The array experiences a short multi-second pause in IO while the take-over is being completed, but since this pause is just a few seconds it is easily handled within the SCSI IO timeout limits and managed by host multipathing without disrupting the host/application.
  • After CT1 re-boots, it re-joins the array and all 8 ports are again active, and detected by host multi-pathing.

Customers and vendors have different definitions on what “non-disruptive” means, so for the sake of clarity, our NDU model does not deliver port-level NDU (i.e. the SAN does see port outages as the controllers reboot and relies on host multipathing to work around port outages), however there is never an array-level failure or downtime, and the brief IO pause during the controller role reversal is easily managed within the SCSI IO timeout tolerances, so the application isn’t disrupted.

One other note: our current software update model is CLI-driven, but this function will be automated in the GUI with single-click operation in future releases.