NetApp recently published a reference architecture paper for deep learning. In AI, data is the fuel that drives accuracy, putting data storage front-and-center in the AI infrastructure. So it’s no surprise that legacy storage vendors are scrambling to get into the AI game, but can legacy storage architectures really keep up with modern AI? Let’s take a deeper look at NetApp’s first AI reference architecture, which seems rather “inspired” by our work with AIRI, but has some important differences to understand.
We unveiled AIRI™ in March, the industry’s first AI-ready infrastructure, brought to market with NVIDIA to bring AI-at-scale to all enterprises. AIRI was created at the behest of our mutual customers, trailblazers aspiring to change their own industries, who built their AI infrastructure with NVIDIA DGX and Pure Storage FlashBlade™. From working with these customers Pure learned a ton about AI infrastructure, packaged these best practices into AIRI and published a reference architecture white paper.
It’s been said that imitation is the sincerest form of flattery. While NetApp’s paper looks rather similar in some areas (and we take that as a compliment), on closer look, we were left with a lot of questions.
Will the Real Reference Architecture, Please Stand Up?
NetApp’s paper proposes two reference architectures, yet shows measured results for a completely different design. The two proposed architectures are five NVIDIA DGX-1 systems with one NetApp A800 system, and four DGX-1 systems with one A700 system. But the paper shows performance results for one DGX system to one A700. No other benchmark is offered.
So readers are left scratching their heads with questions like:
- What actual performance will I get if I deploy the reference design?
- If A800 is their fastest system, why does NetApp only show results on A700?
- Why is the measured result even relevant? I can just use local SSDs in a DGX if a single DGX is what I need.
A plausible explanation is that, in the real-world, building an AI infrastructure capable of delivering linear scaling for multi-node training is very difficult, and in our opinion particularly so when hamstrung by legacy architectures. Multiple DGX systems can put tremendous pressure on the shared storage system. While FlashBlade makes it look simple to deliver linear results with enough headroom for more DGX systems, the reality is that even the latest legacy storage systems may struggle. We show below how AIRI delivers linear performance with more DGXs and GPUs.
Another oddity in the NetApp paper is how sensitive the A700 system appears to be to varying settings. When image distortion (crop, blur, etc.) is enabled as part of the benchmark, A700 slows down DGX performance by up to 20% (figure 4 & 5 in their paper)! These benchmarks are complex so configuration variations may affect legacy storage systems in unpredictable ways. We found no such performance variations with AIRI- the same performance is delivered no matter what the setting is.
Litmus Test of Modern Scale-Out Architecture
Real-world AI requires a true scale-out storage architecture to support the entire data pipeline. AI is not just about training and ImageNet benchmarks. It is a real-world pipeline of workloads, from ingest and labeling to exploration and training. In aggregate, AI data pipeline requires the storage system to be great at everything, from sequential to random accesses, from small to large files. It also needs to handle lots of clients constantly requesting and modifying the dataset or its metadata. Here’s a Pure blog that describes the pipeline in more depth.
When given the litmus test of true scale-out architecture, in our opinion NetApp’s A700 and A800 appear to fall flat in two ways. First, the underlying design of A700/A800 is a federation of individual controller-based appliances, where data volumes are physically tied to controllers and nodes, leading to performance hotspots and manual load-balancing. A controller is a fixed resource and cannot share the load with other controllers like a true scale-out design should. FlashBlade is true scale-out, allowing data to move and scale seamlessly with each blade, intelligently load balancing to deliver linear performance seamlessly.
Second is that NetApp’s A series, like the A700, uses a fixed, 8KB block size for I/O. FlashBlade has a key technology called variable-block metadata engine to adapt to files and objects in real-time. While small file performance may be okay for NetApp’s system (performance may vary based on volume of small files due to metadata management), workloads suffer with large files.
The ultimate problem is that AI workloads cannot be limited to just small files. As described above, a real-world AI data pipeline pushes the limits of I/O in every direction. In this particular benchmark, ImageNet with TensorFlow, small images are packed into a single very large file, in our opinion posing an architectural challenge to these NetApp systems. This may be the reason why A700 appears to max out at only 300 MB/sec in the deep learning benchmark (figure 5 in NetApp paper).
AIRI Stands Alone Among Its Imitators
Once in a generation, a technological force emerges with enough potential to change industries and societies. AI is that defining force of our generation. The same data responsible for fueling the AI revolution is delivered by storage systems. Yet, legacy systems are usually hampered by a 20 year old software stack, this age visible through unpredictable performance that is sensitive to configuration settings and I/O patterns.
FlashBlade is unique. It is the industry’s only scale-out storage system built from the ground up to deliver multi-dimensional performance for both file & object. Its software is modern, free from the historical baggage of legacy systems, massively parallel, and powerful enough to accelerate the entire data pipeline.
AIRI is built on FlashBlade, and represents the most advanced solution ever built for AI infrastructure. In working with customers to design AIRI, particularly those paving the way for their own industries, we learned any AI infrastructure must deliver on three essential elements:
- Delivers linear performance to keep GPUs busy at any scale. Many enterprises are exploring AI with a few servers, but now looking to scale out their efforts with an infrastructure. Multi-node environment poses many challenges, both on software and infrastructure. So ask your vendor if their solution can deliver linear performance, no matter the scale.
- Supports AI data pipeline. Real-world workloads are different than benchmarks, often pushing legacy storage beyond its limits. Data is unpredictable. AI is constantly evolving. So ask your vendor if their solution is built to deliver performance for any data, small or large, for any access pattern, sequential or random.
- Built on a modern scale-out architecture. AI represents a technological shift, from a serial world to a parallel world, from a legacy design to a modern one. Ask your vendor if their solution is truly scale-out, or if it is built on legacy software posing odd limitations on how data is accessed.
If your organization is looking to start its AI journey, bypass imitations built on legacy architectures and get started today with the modern purpose-built AI solution, AIRI from Pure Storage and NVIDIA.