The legacy SCSI interface dates to an era when Moore’s law was understood to mean that CPUs got faster and that symmetric multiprocessing (SMP) systems were exotic and limited in scale. The next generation NVMe interface matches modern parallel multicore architectures.
As an analogy, when SMP support was developed for the Linux kernel, it was developed on systems with two Intel Pentium CPUs—two one-core chips! To make Linux work on these systems, the kernel needed to be updated so that the two CPUs wouldn’t both modify the same data at the same time and corrupt the system state. With only two CPUs, locking is the right approach, and Linux SMP support was implemented by adding a single lock, the “Big Kernel Lock” (BKL) that limited one CPU at a time to running the kernel.
Old school two CPU motherboard
As Moore’s law continued but clock speeds hit a limit, CPU chips became more powerful by adding multiple cores. SMP servers moved from 2 CPUs, to 4, 8, 16 and more CPUs. CPU architectures added symmetric multithreading so that a single physical core appears as two or more CPUs to system software. In 2017, it’s not unusual for a server to have 80, 100 or more CPUs.
As CPU counts increased, the BKL became a point of contention. With two CPUs, it’s not so bad if one CPU is stalled for a short time while the other CPU runs in the kernel; with eight CPUs, it’s very wasteful for a CPU to be stuck waiting for the seven other CPUs ahead of it inline to run in the kernel. The kernel architecture evolved, first by splitting the BKL into finer-grained locks so that more than one CPU can run the kernel, and then by adopting lock-free algorithms when even those locks became contended. Hardware evolved along with software, with techniques such as multiple queues and message-signaled interrupts developing to allow multiple CPUs to access devices simultaneously without locking or contention.
These hardware innovations came first to networking, where a fast NIC might handle millions of packets per second, while disk-era storage adapters only needed to handle hundreds or thousands of IOs per second. The NVMe interface standard was designed with the lessons of high-performance network and RDMA interfaces in mind (I know, because I participated in the discussions leading up to the NVMe 1.0 spec more than six years ago, even before I joined Pure). SCSI HBAs have a heritage dating to the time of two-CPU servers; hardware interfaces have been improved over time, but the single-threaded ancestral features are still there. In contrast, NVMe was designed from the beginning with the most advanced techniques for parallel access even by 100s of CPUs.
Some commentary about storage arrays suggests that SCSI and SAS are just fine inside AFAs. If we just count PCI lanes and SAS throughput, it doesn’t look like NVMe makes a difference. However, AFA controllers have advanced along with CPUs – for example, the FlashArray//X70 has more than three times as many cores as the FA-320 we sold just four years ago. As we optimized Purity for new controller models, we found that contention on locks in the SCSI driver was wasting significant CPU time. While we’ve been able to improve locking in the SCSI driver, it’s clear that this has limits. With NVMe in //X70, every CPU has a direct queue to every DirectFlash Module, so we never have to worry about locking overhead. We get a simpler (and therefore even more reliable) system that performs better now and has room to scale to future CPU generations.
FlashArray//X prototype showing the midplane
NVMe over Fabrics extends NVMe to allow multiple systems to connect to shared storage over cloud-era fabrics, while keeping the inherent parallelism of NVMe. NVMeF is not just a lower-latency SAN protocol. As we saw earlier, the NVMe advantage within //X is not just more throughput. In the same way, connecting multiple queues from applications directly through to storage using Pure’s DirectFlash technology makes it possible to build massively multithreaded systems and applications without wasting resources on excessive contention. The advantage of NVMeF goes beyond shaving a few microseconds from IO latencies – instead of burning huge amounts of CPU contending on locks between CPUs, applications can use that CPU to do useful work.
FlashArray//X and DirectFlash Module
As we extend //X to allow hosts to connect directly via NVMeF, we’ll bring shared storage to new applications. Today, these applications can’t use a SAN because of inefficiencies in the SCSI stack and HBA drivers, and are forced to use local storage – tomorrow, with NVMeF, they’ll be able to use the same efficient NVMe stack to talk to shared storage.
Storage features like snapshots and REST APIs will make these applications easier to build and manage. It’s going to be really cool, and I and the entire team are working hard to make that future happen.