Summary
Pure Key-Value Accelerator (KVA) is a protocol-agnostic key-value caching solution. Combining Pure KVA with FlashBlade delivers faster inference, higher GPU efficiency, and consistent performance across AI environments.
AI inference at scale is trapped in its own version of Groundhog Day, repeating the same work every time a prompt is reused.
Before a large language model (LLM) can generate a single new word, whether answering a question or resuming a saved conversation, it must retrace its entire thought process. This stage, known as prefill, involves reprocessing the full prompt from the user and recomputing attention KV matrices from the beginning. It’s like asking the model to reread the whole book before writing the next sentence. At scale, this makes LLM inference one of the most resource-hungry operations in modern AI systems.
To solve this, Pure Storage developed Pure Key-Value Accelerator (KVA), a protocol-agnostic key-value caching solution that allows models to persist and reuse these precomputed attention states across sessions. By avoiding unnecessary recomputation, Pure KVA unlocks significant performance gains without changing your model or infrastructure.
This challenge is especially acute as newer models allow larger context sizes and enterprises deploy more interactive, persistent, or multi-turn workloads such as retrieval-augmented generation (RAG) and reasoning agents. Without caching, inference latency grows, GPU inefficiency spikes, and cost efficiency plummets.
While most KV caching approaches have focused on in-memory acceleration for a single session or tenant, Pure KVA brings persistent, multi-session reuse to production-grade environments. It supports both NFS file and S3 object storage, enabling organizations to build flexible, high-throughput AI pipelines.
In benchmark testing, Pure KVA delivered up to twenty times faster inference with NFS and six times faster with S3, all over standard Ethernet. These performance gains help enterprises scale more efficiently, reduce GPU costs, and maintain fast, reliable inference at scale.
For technology leaders focused on operational efficiency and cost control, this innovation in AI architecture marks a significant step forward in making LLM inference truly enterprise-ready while reducing the amount of investment needed for AI infrastructure.
How It Works
Core Functionality
At the heart of Pure KVA is a simple yet powerful idea: persist and reuse the attention states generated during LLM inference, so that the model doesn’t need to recompute them every time a prompt is reused. Instead of discarding the key and value tensors after each inference session, Pure KVA captures these intermediate states, compresses them, and stores them to high-performance Pure Storage NFS or S3 backend, reloading them when needed, thus eliminating redundant computation.
When the same prompt is seen again, Pure KVA intelligently detects a cache hit and streams the precomputed data directly into the model’s memory, dramatically accelerating inference. This design is ideal for applications with repeated or templated queries, such as chatbots, search augmentation, multi-turn interactions, and RAG systems.
Pure KVA Benchmark Results at a Glance
- Up to twenty times faster inference using NFS-backed KV caching
- Up to six times faster inference using S3-backed KV caching
- Validated across both small and large LLMs (e.g., cache sizes from approximately 50MB to 10GB)
- Two to three times faster than other KV caching solutions
These gains enable enterprise teams to reduce GPU hours, scale more cost-effectively, and improve responsiveness without changing model architecture or deployment stack.
Technical Integration
Performance is maximized through efficient serialization, parallelized I/O, intelligent batching, and advanced compression methods, all abstracted away behind a simple interface. This version of Pure KVA integrates natively with vLLM, a widely used open source LLM inference engine in production environments (with support for other engines coming soon). Pure KVA automatically detects the runtime environment, adapting to single-GPU and multi-GPU setups with no manual configuration necessary.
Pure KVA has already delivered massive speedups on standard Ethernet networks in single-GPU environments, with performance gains increasing alongside token count and scaling further in multi-GPU deployments. This underscores its ability to drive meaningful efficiency gains using existing infrastructure. With this release, we improve upon and beat other KV cache systems by two to three times in time to first token (TTFT).
Figure 1. Performance of Pure KVA on S3 and NFS vs. vLLM on Qwen 7B model.
Looking forward, an upcoming release of the Pure KVA library introduces incremental caching, enabling even finer-grained reuse of KV states within partially overlapping prompts or dynamic inputs. This next phase of development will further optimize inference for a broader set of real-world workloads, delivering even greater efficiency for enterprise deployments.
Of course, KV caching is only as effective as the storage layer behind it… which is where Pure Storage® FlashBlade® comes in.
Fast Cache, Smarter Inference: The FlashBlade + Pure KVA Advantage
Large-scale LLM inference often stalls on a hidden bottleneck: loading thousands of KV tensor files in parallel. Traditional storage systems struggle here due to metadata overhead and concurrency limits. This slows response times, decreases GPU utilization, and undermines SLA guarantees. Pure KVA, paired with Pure Storage FlashBlade, delivers a high-performance caching solution that eliminates this barrier. The scale-out architecture of FlashBlade provides consistent throughput and microsecond latency, ensuring cached attention states are served without delay. This ensures your users experience fast response times, even under heavy load or peak concurrency.
Unlike siloed file or object storage, FlashBlade is a unified fast file and object (UFFO) platform that supports both NFS and S3 on a single system. Enterprises can build flexible AI pipelines using file access for performance-sensitive workloads and object storage for distributed or cloud-based inference. These configurations can coexist without requiring separate infrastructure or complex rearchitecture.
Stated simply: Pure KVA reduces redundant computation by caching attention states, and FlashBlade ensures fast, parallel access with low latency. Together, they deliver faster inference, higher GPU efficiency, and consistent performance across AI environments. This combination simplifies AI infrastructure, reduces operational overhead, and supports ultra-fast performance for production workloads even under the most demanding conditions.
The performance gains enabled by Pure KVA and FlashBlade set a new standard for enterprise-scale AI inference. In rigorous benchmarking across both small and large language models, Pure KVA consistently accelerated inference—delivering up to twenty times and six times respectively for larger, large context multi-GPU models with cache sizes ranging from approximately 1GB to 10GB and up to ten times speedups on NFS and five times on S3, respectively, for smaller models with cache sizes ranging from approximately 50MB to 1GB. These results were achieved under real-world conditions, and further improvements are expected as protocol-level optimizations continue.
Figure 2. Performance of Pure KVA on NFS and S3 vs. vLLM on Qwen 1.5B.
Figure 3. Performance of Pure KVA on NFS and S3 vs. vLLM on Llama 3.1 70B.
Enterprise Use Case Examples
The transformative impact of KV cache acceleration becomes clear when examining real enterprise scenarios:
- Retrieval-augmented generation (RAG): RAG pipelines often rely on frequently accessed “hot” enterprise data, e.g., internal documentation, product catalogs, or knowledge bases. Frequently accessed documents can be cached for faster inference.
- High-frequency inference workloads: Financial services and trading platforms can cache market data prompts or regulatory language used in real-time analysis.
- Multi-tenant SaaS platforms: Providers serving multiple clients with similar but segmented prompts can maintain per-tenant caches to optimize response times and reduce compute cost—without compromising isolation or performance.
- Conversational AI and reasoning agents and chatbots: In multi-turn interactions, Pure KVA allows reuse of shared conversation history. This avoids reprocessing full transcripts every turn, improving responsiveness and enabling more natural, scalable user experiences.
- Cost optimization at scale: Enterprises running large LLMs across repetitive workloads can offload up to 90% of redundant computation by caching popular prompts or prompt templates. This drives meaningful reductions in GPU hours while increasing system capacity and throughput.
Bottom Line
Pure KVA is the first enterprise solution to bring persistent KV caching to vLLM with full support for both NFS and S3. It removes one of the most costly inefficiencies in LLM inference and delivers immediate performance gains without requiring changes to your model or infrastructure stack.
With up to twenty times faster inference, Pure KVA enables enterprise teams to scale LLM workloads efficiently, reduce GPU costs, and improve responsiveness across applications. These gains were achieved using standard Ethernet infrastructure, with even greater performance expected on RDMA-enabled deployments.
When combined with Pure Storage FlashBlade, Pure KVA delivers predictable, high-throughput caching at scale. The unified support FlashBlade provides for file and object protocols ensures flexibility across hybrid and cloud environments, with no performance degradation under concurrent access.
For AI leaders building production-grade generative AI platforms, Pure KVA turns the storage layer into a strategic advantage, improving throughput, reducing latency, and unlocking new levels of operational efficiency. The Pure KVA platform is available in targeted release, contact your account representative for more information.
The future of Enterprise AI inference is here, and it’s faster than ever on the Pure Storage Platform.
The Pure Storage Platform
A platform that grows
with you, forever.
Simple. Reliable. Agile. Efficient. All as-a-service.
Experience FlashBlade
Take a free test drive.






