Summary
Architecting a global KV cache with Pure Storage eliminates redundant prefill, overcomes HBM limits, and delivers faster, more cost-efficient enterprise LLM inference at scale.
The bottleneck in modern LLM inference isn’t the GPU. It’s not even the network. It’s the fact that we’re treating multi-gigabyte tensors, computed at massive expense, as if they’re disposable scratch space.
Here’s what that looks like at scale. A single 128K context prompt on Llama 3.1-70B consumes about 40GB of high bandwidth memory (HBM) just for the key-value (KV) cache, the model’s working memory for the input tokens. With 1,000 concurrent users asking largely the same questions, your enterprise cluster spends most of its cycles recomputing identical tensors over and over. You’re burning petaFLOPs and megawatts to regenerate the same data thousands of times per hour instead of generating actual value for your users.
To scale inference, consider treating the KV cache as a first-class data citizen and not the transient artifact it has become today. A data citizen that demands lifecycle management, cross-request persistence, and the same storage infrastructure we would design for any essential large-scale data set that a corporation depends on. In this post, we’ll explore the technical architecture of KV cache with a focus on sharing, the mechanics of cross-node reuse, and the systems engineering required to make that real.
Why should you care? Get this wrong and your inference costs scale with user count, regardless of whether each user asks the same or similar questions. Compare that to the KV cache-enabled world where your inference cost scales with new questions, and existing or shared content gets looked up, not recomputed.
The Anatomy of the Prefill KV Cache Problem
When you submit a prompt with n tokens, the self-attention mechanism in the prefill phase scales as O(n²) computations. The model processes all n tokens in a single forward pass, and for each position in the sequence, computes attention across all previous positions. During this process, the keys and values for all n tokens are generated and cached in GPU memory. (See figure below.)
For a 100K token prompt, that’s 10 billion attention operations on a 70B model, taking 8-10 seconds of GPU time. If 500 users are asking about the same information, you’ve just burned 4,000-5,000 GPU-seconds recomputing identical KV tensors.
This is where storage-backed cache injection matters. If that prompt’s KV cache has already been computed and stored on a Pure Storage® FlashBlade® system, you can inject those 100K cached KV pairs directly into GPU memory in 500ms instead of recomputing them from scratch.
The prefill phase goes from O(n²) GPU compute to O(n) storage I/O, eliminating 95% of the time to first token (TTFT).
Note: Once token generation begins in the decode phase, the cache grows incrementally with each new token generated. But that’s a different optimization. The storage injection benefit happens entirely during prefill.

Figure 1: Diagram highlighting the computational complexity of the input prefill on the left. On the right, the mechanism to store to GPU memory or to flash memory.
The Scale Challenge: Memory vs. Capacity
At the microscale (e.g., a single H100 or B200 GPU), the KV cache resides in HBM, the ultra-fast but expensive DRAM stacked directly on the GPU die. HBM is a scarce resource.
Consider a modest deployment using Llama-3.1-8B. With a 16K context prompt, you can run one or two requests simultaneously on a single H100 GPU without running out of memory. This works fine for a small team.
Now multiply this across a real enterprise AI deployment with 100 H100 GPUs serving 2,000 concurrent users, where hundreds of those users are querying the same quarterly earnings reports, legal documents, or HR policy FAQ. Without cache reuse, that scarce HBM is being consumed by hundreds of identical copies of the same KV cache across your cluster.
Frontier models like OpenAI and Anthropic handle users bouncing between React component debugging, Mandarin translation, relationship advice, linear algebra proofs, and fantasy football strategies, all within the same minute. Enterprise AI doesn’t work that way. Your users query the same internal docs, generate SQL against the same schemas, and analyze the same contract templates repeatedly.
With agentic AI, the waste becomes even more staggering. Every agent in your workflow shares the same system prompt (2,000 tokens defining role and behavior), the same tool definitions (5,000 tokens describing available APIs), and the same policy context (4,000 tokens of RAG-retrieved compliance rules). That’s 11,000 tokens of identical context per agent invocation. Scale that to 5,000 agent calls per day and you’re recomputing 55 million tokens of shared context. Without KV cache injection, every single one of those invocations pays the full prefill cost.
This isn’t just inefficient. It’s a waste of compute. It’s a waste of power. It’s a waste of rackspace. Your infrastructure burns cycles on prefill instead of decode in the AI compute cycle, which means your expensive GPUs spend plenty of time creating context tokens and less time generating output tokens. Same hardware cost, dramatically lower throughput.
Architectural Tiering: From HBM to ICMS
To solve the capacity wall, AI inference is rapidly deploying a disaggregated cache architecture. This involves three distinct tiers:
- L1: Local HBM: Active tokens for the current generation
- L2: Local Host DRAM/NVMe: Recently used caches stored in the node’s system RAM or local NVMe storage
- L3: Distributed Storage: The “Global KV Store” where caches are stored and shared across the entire GPU cluster
This is where NVIDIA’s Inference Context Memory Storage (ICMS) architecture and its network-optimized BlueField-4 DPUs come into play. The vision is for DPUs to handle cache reuse logic and persistence directly from the storage layer. The ICMS standard is still being defined, and vendors are actively building products to support these capabilities. Consider this, though: The data path that will exist for DPU-integrated storage is very, very similar to a fast network that can speak to GPUs and offload KV cache to really fast shared storage mediums like the storage arrays you probably already own (or should own if you are not yet on the flash architecture).
Pure Key-Value Accelerator (KVA) on FlashBlade demonstrates exactly how this works in practice. When a cache miss occurs in HBM, the traditional approach is for the GPU to recompute the entire prefix from scratch. This is expensive because a 100K token prefill on a 70B model can take 8-10 seconds of GPU compute.
Pure KVA solves this by injecting cached tensors from FlashBlade storage instead. The system orchestrates remote direct memory access (RDMA) transfers, moving multi-gigabyte KV tensors directly from FlashBlade into GPU memory while bypassing the CPU and kernel networking stack entirely. The scale-out architecture of FlashBlade delivers both high read bandwidth and consistent low latency, meaning cache injection performance remains stable even as you scale to thousands of concurrent users and hundreds of models. Injecting a cached prefix from storage is dramatically faster than recomputing it on the GPU.
The result is a 20X improvement in TTFT, turning what would be a 10-second recomputation into a 500ms cache injection. More importantly, your GPUs spend more time on decode (generating tokens) and less time on prefill recomputation, which translates directly to higher token throughput across your infrastructure and less time your users have to wait for an output.
The Mechanics of Reuse: Prefix Hashing and Matching
How does a system know it can reuse a cache? The industry standard is prefix caching via hashing.
Each block of tokens is hashed (typically using BLAKE3 or SHA-256 for collision resistance). When a new request arrives, the orchestrator breaks the prompt into blocks and checks the Global KV Store for a hash match. If a match exists, the cached KV tensors are injected directly into GPU memory. If no match is found, the system computes the KV cache for those tokens and saves it to the Global KV Store for future reuse.
The Collision and Security Boundary
At enterprise scale, hash collisions are not just a technical failure; they are a security breach. If Hash(Promptₐ) = Hash(Promptᵦ), User B could receive a completion influenced by User A’s private data. This means a business analyst in Finance might inadvertently see cached context from an HR investigation, or a junior employee could receive completions shaped by a senior executive’s confidential strategy document.
For organizations with stringent security and isolation requirements, one option is to implement multi-tenant namespacing where hashes are partitioned by Org ID, Department ID, or User ID. This prevents cross-contamination even in the event of a mathematical collision, though it does reduce cache reuse rates across the broader user population.
Having said that, as a storage company, we’ve used our dedup engine on hash keys for well over a decade. We actively track for hash collisions and verify data bit by bit to catch them. Despite hundreds of thousands of deployed systems generating trillions upon trillions of hashes over that time, we have never encountered a single collision. We think this fear, while mathematically sound, is in practice non-existent with modern hash generation methods.
The Tensor Parallelism Constraint
A significant technical hurdle in KV reuse is how to make the Tensor Layout consistent across different GPU configurations. In a distributed inference setup, the KV cache is partitioned across GPUs based on the Tensor Parallelism (TP) degree.
- The problem: A KV cache generated on a 2-GPU (TP2) setup is sliced differently than one generated on a 4-GPU (TP4) setup.
- The impact: You cannot inject a TP2 cache into a TP4 inference engine without a complex resharding operation. This operation often takes more time than simply recomputing the tokens from scratch.
The ability to dynamically reshard KV caches across different TP configurations is an active area of research, so expect this constraint to be resolved in the near future. Pure KVA takes this constraint into consideration when creating hashes so that there are no failed injections due to Tensor Parallelism mismatches. Until then, practical deployments need to work around this limitation. We suggest you keep your deployments somewhat uniform.
Today, achieving high reuse rates means your inference scheduler must be topology-aware. Requests for specific cached prefixes need to be routed to GPU clusters with matching TP configurations. This requires coordination between your orchestration layer (whether that’s vLLM, TensorRT-LLM, or a custom scheduler) and your cache storage backend to ensure that cache lookups account for both the content hash and the TP degree used to generate that cache.
Ephemeral Data, Persistent Management
While the KV cache is “ephemeral” (recomputable), its management in a production environment mirrors traditional database engineering.
- KV Cache time to live (TTL): Cache lifetimes vary dramatically by use case. A shared system prompt might be valuable for weeks, while a user’s chat history expires after hours of inactivity. FlashBlade lifecycle policies give enterprises fine-grained control over KV cache footprint, automatically evicting stale data while preserving high-value shared contexts.
- Observability: Enterprise AI requires trust, and trust is established through audit trails. Because LLM inference is non-deterministic, LLM observability tools must track which cached context was injected into each inference, when it was generated, and by whom. This chain of custody allows you to trace hallucinated or biased outputs back to their source.
- Data services: Pure KVA on FlashBlade handles the storage backend for these caches. It provides compression to reduce the I/O footprint and supports high-concurrency access over NFS and S3, allowing heterogeneous clients (vLLM, TensorRT-LLM, custom inference engines) to read and write cached tensors without coordination overhead.
The Path Forward
The transition from stateless inference to persistent, context-aware AI systems requires a fundamental shift in how we view storage. Shareable KV caching layers like ICMS and Pure KVA are making the key-value store effectively infinite in size and globally available across the data center. The HBM bottleneck is gone. The KV cache is no longer constrained by the memory capacity of a single GPU but becomes a scalable, shared resource accessible to the entire cluster.
The result is simple: Your GPUs stop burning cycles on redundant prefill and start doing what you actually purchased them for—generating tokens. We invite you to take a look at Pure’s optimized KV Cache Accelerator. We use some really neat tricks to look up tokens from the KV cache with an efficiency and speed that naive implementations simply can’t match.
Architect Smarter AI Infrastructur
See how Pure Storage’s AI data platform helps you eliminate GPU bottlenecks, reuse context across workloads, and scale LLM inference efficiently.






