Many storage users are familiar with ALUA, or Asymmetric Logical Unit Access. This describes storage where some paths don’t work at all or give lower performance, because of standby controllers, volumes associated with a controller, or other architectural reasons. The Pure Storage FlashArray provides symmetric access to storage — any IO to any volume on any port always gets the same performance.
The FlashArray has two controllers for high availability, and providing fast access to the same storage through both controllers requires a high-performance connection between the controllers. Since the array can serve multiple gigabytes per second to initiators, the back-end interconnect similarly needs very high throughput to keep up. And because the array needs to handle hundreds of thousands of IOs per second with latency less than 1 millisecond for every one, the back-end also needs to handle high message rates with low latency.
In fact, this back-end interconnect is one area where we’ve made some interesting improvements in FlashArray//m. To understand this, let’s first look at the back-end interrconnect in the familiar FA-400 series arrays. The controllers are separate boxes, which talk to each other over InfiniBand, something like this:
InfiniBand has three important characteristics that make it great as a back-end connection between storage controllers:
- It’s really fast: a 4X FDR connection runs at 56 Gbit/sec, with negligible latency (down in the nanoseconds).
- We don’t burn CPU copying data: all the data movement is done by offload hardware in the InfiniBand adapter.
- The software stack is lightweight: the InfiniBand stack gives our application direct access to the adapter’s hardware, so we don’t have the latency and overhead of system calls when moving data.
In the FlashArray//m, the controllers share an enclosure and talk to each other over a passive midplane. We’ve simplified the architecture to look like this:
We don’t use InfiniBand adapters, and instead connect the PCI Express ports on our processors directly, using a feature called Non-Transparent Bridging, or NTB for short. NTB lets each controller expose a subset of its memory to the other controller efficiently.
The hardware is simpler, but we maintained and improved the same three key characteristics of our interconnect:
- It’s still really fast: the path between controllers is now just a PCI Express gen3 x8 link, running at 64 Gbit/sec. This is the same as the PCI link to the InfiniBand adapter in FA-400, so clearly we have not introduced a new bottleneck. In fact, latency is even more negligible now, since data only has to travel over a single PCI link instead of going from PCI to IB and back to PCI.
- We still don’t burn CPU copying data: our processors include Intel’s “I/O Acceleration Technology” (the little boxes I labeled “DMA” in my diagram). We use these integrated engines to offload data movement without even the (admittedly small) overhead of talking to an InfiniBand adapter over PCI.
- Our software stack is still lightweight: we access the PCI Express NTB and DMA engine hardware registers directly from our application by using the Linux vfio driver. This driver was originally developed to support passing devices into virtual machines, but we’ve used it to build a driver stack matched exactly to our application’s needs, which allows it to be even lower latency than the Linux InfiniBand stack.
Connecting controllers natively via PCI Express is just one way that the FlashArray//m hardware was purpose built to enhance and extend the architecture of the FlashArray. Of course we support a complete non-disruptive upgrade from existing systems to FlashArray//m by temporarily using an InfiniBand connection to the old controllers and seamlessly switching over to PCI Express between the new FlashArray//m controllers.