Pushing the Limits of AI and HPC Performance via FlashBlade

The proof is in the IO500 results: FlashBlade//S500 R2 shatters old HPC assumptions, proving all‑flash NFS over Ethernet can match parallel filesystems and keep AI/HPC GPUs and CPUs fully utilized.

FlashBlade AI HPC

Summary

IO500 benchmark results confirm FlashBlade//S500 R2 delivers exceptional all-flash NFS performance for AI and HPC, simplifying infrastructure and maximizing GPU/CPU utilization over standard Ethernet.

image_pdfimage_print

For years, HPC has clung to a familiar belief: Truly “heroic” performance demands complex parallel filesystems running over specialized InfiniBand fabrics. In that worldview, NFS over Ethernet was preferred for general workloads, but not a contender for the massive concurrency and metadata intensity of modern AI and HPC workloads.

FlashBlade//S500 R2 upends that assumption by combining the standard NFS protocol with a purpose-built native scale-out stack—engineered with distributed metadata and locking as core platform services compared to a Linux NFS daemon running on generic x86 servers. The combination of Purity//FB, a modern, purpose-built storage OS, and today’s Linux kernel advancements lifts many of the historical NFS constraints—making modern NFS a credible protocol of choice even for demanding HPC workloads. It enables high-concurrency access and keeps CPUs/GPUs fed over standard protocols—delivering write performance and metadata scale that previously required complex, proprietary parallel filesystems. 

The FlashBlade//S500 R2 IO500 advantage

IO500 is used as a standard benchmark for production-grade deployments to stress that the storage platform can keep expensive CPUs and GPUs fully utilized for both traditional HPC simulations and modern AI workloads. On FlashBlade®, strong IO500 results demonstrate not just raw bandwidth but also metadata scalability and concurrency, which directly influence time to checkpoint, time to first token, and end‑to‑end pipeline throughput. 

Recent FlashBlade enhancements let a single Linux client batch and run non-overlapping NFS writes in parallel, while also reducing lock contention when multiple clients write to the same shared file. The result is much higher per-client write throughput at low latency, so modern AI and traditional HPC workloads can scale without getting bottlenecked by POSIX client-side serialization on high-performance all-flash NFS-based storage.

The integration of these optimizations is not merely about achieving a higher benchmark number; it fundamentally changes the performance profile of FlashBlade//S500 R2 for the most demanding, I/O-intensive workloads.

Key technical outcomes:

Feature/OptimizationIO500 Subtest ImprovementReal-World AI/HPC Impact
Batching & parallel writesHigher single-client IOR-Easy (write bandwidth)Faster loading of tensors to GPU memory and writing distributed checkpoints (time to checkpoint)
Multi-writer to a shared fileHigher IOR-hard-operations to a single file with unaligned 47k IO size.Faster shared checkpoints and model updates, especially in distributed training environments
End-to-end scalabilityHigh IO500 score (combined metrics)Full GPU cluster utilization and higher throughput for complex pipelines 

Analyzing FlashBlade//S500 R2 IO500 results

By mitigating the serialization constraints of the standard NFS client, Everpure allows applications to fully leverage the concurrent architecture of FlashBlade. This significantly reduces the risk of storage becoming a constraint for high-demand AI and HPC workloads, ensuring that expensive compute (CPU/GPU) resources are kept active and productive.

Based on internal testing at Everpure, FlashBlade achieves exceptional IO500 results from one uninterrupted run (no partial component measures) utilizing three chassis for FlashBlade//S500 R2 and 10 clients with a 400G backend Ethernet network. The performance tuning was limited to kernel parameters and NFSv3 mount options with very few changes to the config.ini file of the IO500 benchmark suite. This performance is consistent across hard, easy, and metadata tests, positioning FlashBlade as an optimal platform for mixed traditional high-performance computing (HPC) workloads that necessitate shared and concurrent file access, as detailed in the table below.

Comparison of FlashBlade//S500 R2
Figure 1: Comparison of FlashBlade//S500 R2 performance to Vast, DDN, and WekaIO. 

The composite IO500 benchmark score, as depicted in yellow in the chart above, signifies that the FlashBlade//S500 R2 system is a well-balanced storage platform with a high score of 142.32. Compared to IO500 submission 782 (DDN EXAScaler “Gautschi”, IO500 score 70.18), FlashBlade//S500 R2 delivered ~2X higher—at a similar scale under the published configurations.

Out of the box with no additional configuration, FlashBlade//S500 R2 is capable of handling the high metadata demands of small files and directories and the substantial throughput necessary for data-intensive workloads, in contrast to publicly available data from several other storage platforms like Vast, DDN, and WekaIO.

IOR-Easy benchmark results flashBlade//S500
Figure 2: Comparison of IOR-Easy benchmark results for FlashBlade//S500 R2 vs. Vast, DDN, and WekaIO. 

FlashBlade advantage: IOR-Easy (distributed throughput and compressibility)

The IOR-Easy benchmark is critical for contemporary AI/HPC workflows as it demonstrates the system’s ability to handle file-per-process read/write throughput during substantial sequential workloads. IOR-easy subtests represent large bandwidth-bound workloads at scale, where many clients stream large, contiguous reads/writes to their own files to drive maximum aggregate bandwidth. It maps well to AI/HPC pipelines like sharded checkpointing, per-rank simulation output, and fast data staging/ingest, where performance is primarily about throughput at scale rather than shared-file coordination.

This phase highlights the FlashBlade//S500 R2 system’s efficiency in handling compressible data sets, a common trait in modern data pipelines. In testing, the FlashBlade//S500 R2 system showcased its raw power by achieving a massive write performance of 104GiB/s, proving its capability to ingest heavy, sequential data streams without bottlenecking.

FlashBlade//S500 R2 MD Tests
Figure 3: Comparison of MDTest-Easy benchmark results for FlashBlade//S500 R2 vs. Vast, DDN, and WekaIO. 

FlashBlade advantage: MDTest-Easy (concurrent metadata and small files)

The MDTest-Easy benchmark rigorously evaluates a storage system’s optimal metadata management, specifically for handling numerous small files, which are prevalent in AI training sets. This subtest generates a lot of create/stat/delete/open operations on small (often zero‑byte) files very common in AI and HPC workloads.

Out-of-the-box scalable metadata has consistently been the core strength of the FlashBlade architecture. This was validated by the S500 R2 system delivering 7.2 million IOPS (adding all the metadata operations), covering essential metadata operations such as write, statistics, and delete, ensuring that the system remains responsive even under the most intense directory-level stress.

FlashBlade//S500 R2 vs Vast
Figure 4: Comparison of IOR-hard-write benchmark results for FlashBlade//S500 R2 vs. Vast, DDN, and WekaIO.

IOR-hard-write is intentionally designed to be the most brutal IO500 test. Unlike IOR-easy, it stresses distributed locking and forces many clients to issue small, random, unaligned 47KB writes into a single shared file over NFSv3. This shared-file contention and locking overhead exposes the worst-case write path—and simulates rogue behavior from parallel I/O libraries—so IOR-hard-write reflects minimum bandwidth under extreme concurrency rather than typical AI/HPC throughput. Modern AI/ML typically uses sharded, many-file writes, while IOR-hard-write is a better representation of scientific HPC apps that write many regions into a single shared file.

FlashBlade advantage: IOR-Hard (worst‑case shared-file bandwidth)

To validate raw storage performance without reliance on client-side caching, the benchmark was executed using a uniform –posix.odirect configuration. Under these rigorous, unbuffered conditions, the FlashBlade//S500 R2 delivered 6GiB/s on IOR-hard-write over standard NFSv3—outperforming some of the comparable parallel storage systems while leading the pack of scale-out NFS-based all-flash storage platforms, demonstrating that high throughput and strong consistency can be achieved without client-side software, proprietary agents, or selective tuning.

MDTest-Hard subsets FlashBlade//S500 R2
Figure 5: Comparison of MDTest-hard benchmark results for FlashBlade//S500 R2 vs. Vast, DDN, and WekaIO.

In the MDTest-hard subtests on FlashBlade over NFSv3, all MPI ranks hammer a single shared directory doing multiple CREATEs, STATs, READs, and DELETEs to files in the same namespace commonly seen in HPC scratch directories and workflow pipelines that generate huge volumes of small files. It’s also relevant to AI/ML environments with small-file data sets or shared checkpoint directories, where metadata—not bandwidth—often becomes the scaling bottleneck.

The NFSv3 metadata path of FlashBlade scales efficiently as concurrency ramps, driving rapid gains at moderate nproc. At extreme process counts, the test shifts into a shared-namespace “stress” regime where incremental ranks primarily amplify directory hot-spot contention—highlighting the benchmark’s contention limit more than the platform’s scaling.

FlashBlade advantage: MDTest-Hard (worst‑case metadata and small files)

FlashBlade demonstrates exceptional resilience in shared-directory workloads, maintaining 38k IOPS for writes and 24k IOPS for deletes even at high concurrency levels where other storage platforms often degrade. While stat operations are naturally optimized for moderate process counts to minimize locking overhead, the system’s architecture proves uniquely capable of sustaining heavy metadata pressure and shared access at scale.

While it is possible to inflate IOR-hard and MDTest-hard results by drastically reducing concurrency—a strategy that sacrifices IOR-easy performance—we chose a different path. The FlashBlade//S500 R2 was benchmarked with a global nproc of 2304 (48 cores/node x 48 multiplier), accepting the natural contention in the “hard” phases to preserve the massive throughput required for the “easy” phases. This ensures the results reflect a system running at full capacity for modern AI/HPC workloads, rather than one tuned solely for synthetic corner cases.

Conclusion: A new standard for efficiency and scale for AI/HPC workloads

The above IO500 benchmark results provide definitive proof that the tradeoff between simplicity and speed is a thing of the past. With a total IO500 score of 142.32, the FlashBlade//S500 R2 system delivers more than 2X the performance of parallel filesystems like DDN (70.18 IO500 score).

FlashBlade//S500 R2 achieves the higher score with superior efficiency, including:

  • Improved rack density: The benchmark achieved up to 30% better rack density with just three chassis (17U) and 10 clients, avoiding a multi-rack sprawl typically required by other competitors.
  • Power savings and simplicity: By consolidating high-bandwidth and high-metadata workloads onto a single platform without proprietary clients or complex tuning, enterprises gain up to 25% power savings and more than 7X operational simplicity.

The IO500 benchmark serves as the definitive proof that high-performance enterprise AI and HPC workloads no longer require complex parallel filesystems or specialized InfiniBand (IB) networks. By delivering exceptional results across rigorous bandwidth and metadata phases, FlashBlade//S500 R2 demonstrates that an NFSv3-based architecture over Ethernet is fully capable of matching—and often exceeding—the performance of niche legacy stacks. This validates that enterprises can achieve the massive concurrency and throughput needed to keep CPUs/GPUs saturated on a single, unified namespace, effectively debunking the myth that “AI/HPC-class” speed is exclusive to complex, proprietary infrastructure.