Summary
Techniques like RAG, CAG, cRAG, and kvRAG can significantly improve LLM inference accuracy and reduce time to first token. By aligning the approach to your workload, you can optimize both LLM performance and latency.
Welcome back to our technical how-to series on AI. Today, we’ll explain and test several approaches to enhancing inference response accuracy and time:
- Retrieval-augmented generation (RAG)
- Cache-augmented generation (CAG)
- Cached retrieval-augmented generation (cRAG)
- KV prefill-assisted RAG (kvRAG)
When it comes to LLMs, one of the most important things is context. Without the right context, our models may hallucinate, generating factually incorrect answers, where the model, due to the missing knowledge, approximates and invents what it considers the statistically best response. These irrelevant responses are a consequence of poor context, often due to a combination of the model’s limited learned knowledge and a lack of relevant query time context.
A model is trained on a given corpus of data, where it has learned knowledge that is valid up to a certain cutoff. Anything that has occurred since the curation of that data set is effectively unknown. Therefore, to improve response accuracy, we need to provide our model with access to current and valid context at query time. Think of this as providing the model with access to the latest company process documentation to help answer internal employee queries, or feeding it the current version of a product guide to run an up-to-date support chatbot. Now this can be done by fine-tuning the model on updated or domain-specific data sets, but performing repetitive fine-tuning to maintain updated knowledge is a heavy lift when compared to RAG, CAG, or cRAG workflows.
Retrieval-augmented generation (RAG)
How it works
Documents are split into chunks, and each chunk is converted into a vector embedding, a dense numerical representation of the text’s semantic content, using an embedding model. These vectors are stored in a vector database (in our tests, Qdrant). Chunks are commonly defined by a size in tokens and an overlap. This is the ingestion step.
At query time, the user’s question is passed through the same embedding model. The vector database performs an approximate nearest-neighbor search to find the top-k chunks whose embeddings are closest to the query embedding. This is the semantic search step.
The retrieved chunks are inserted into the prompt alongside the question, and the model generates an answer grounded in that context.
When to use it
RAG is appropriate when:
- Your knowledge base is large and dynamic, and frequent updates would invalidate cache quickly.
- Retrieval is sparse, and queries typically rely on a small subset of available information.
- You want lower per-query latency by limiting the available context length.
- Query distribution is unpredictable (no “hot” documents).
The trade-off is precision: If the retrieval step returns the wrong chunks, the model either generates a poor answer or hallucinates. Chunk size and top-k are the key tunable parameters.
Cache-augmented generation (CAG)
How it works
CAG takes the second simplest possible approach: load a pre-processed key-value (KV) cache version of the entire source document into the model’s context window ahead of every query. The model has access to all available information and generates a response grounded on the latest cached version of the full text.
At query time, the preloaded cache is passed into the model, skipping the expensive prefill step.
When to use it
CAG is a good fit when:
- You have static data sets.
- The data set(s) fit in the model’s context window (small to medium corpus).
- Latency is a top priority, with no retrieval step and no encoding step.
- You need the model to reason across the entire document rather than isolated fragments.
- You can afford the memory cost of caching the full KV state upfront.
CAG is more efficient for repeated tasks on the same data set and simpler than RAG (no complex stack to set up).
The main limitation is scale: Context windows have token limits; prefilling large documents is potentially wasting usable GPU memory. A second limitation is the possibility of the LLM getting “lost in the middle” as opposed to the finer-grained approach of RAG chunks.
Cached retrieval-augmented generation (cRAG)
How it works
Cached RAG combines a RAG stack with an added cache layer. Caching can be applied to embeddings, retrieval, and generation. For simplicity, let’s focus on caching for generation, where we cache the final answer, keyed by the meaning of the question. Initial queries and their response are saved in this cache layer, using the same embedding model to encode the user’s question. Subsequent user queries are compared against previous cached queries to find semantically similar items. If any cached entry exceeds a given similarity threshold, the stored answer is returned immediately, no context retrieval, no LLM call. On a miss, we fall through to standard RAG, with the query and result being saved in the cache layer.
When to use it
cRAG is a good fit when:
- The users ask the same questions repeatedly with slightly different phrasing, as in customer support/FAQ systems.
- Queries cluster around common topics, as in internal knowledge bases.
- Sub 10ms response times matter for common queries, and slightly stale answers are acceptable.
KV prefill-assisted RAG (kvRAG)
How it works
KV prefill-assisted RAG combines the retrieval granularity of RAG with KV-cache injection. Instead of caching final answers (like cRAG) or keeping entire documents GPU resident (like CAG), it saves, per chunk, the transformers key-value tensors to disk at ingest time and reloads them at query time. This skips the most expensive part of inference: the prefill forward pass over context tokens.
During a user query, top-k chunks that match are retrieved, with the first chunk’s saved cached attention state loaded directly into the model. At top-k=1, the model only has to process the query tokens, reducing total time dramatically, but most RAG systems work with higher top-k values. At top-k values >1, after the first chunk’s KV cache is injected, following chunks are loaded as text alongside the query. Now token position and cross attention between chunks would corrupt the answer if we simply injected kv cache for all chunks. Note: New research into KV cache chunk blending techniques exists but is not covered in the scope of this blog.
Due to this key-value limitation, kvRAG should be used as a variant of CAG. Instead of preloading cache, we load it when a query matches a stored large chunk or full document.
When to use it
KV prefill-assisted RAG occupies a specific niche in the RAG acceleration landscape. It’s a good fit when:
- One chunk provides sufficient context and is large relative to the query.
- The corpus is too large for CAG.
- You have high query volume where repeatedly re-encoding the same retrieved chunks becomes a measurable cost, even more so in edge deployments on constrained hardware.
- Latency matters, but full CAG isn’t feasible.
- Documents are static or update at different rates, so recomputation is not a burden.
Test Results
All tests were performed on a single DGX Spark, with Docker used to run a Qdrant instance and KV cache saved to the local storage device. All queries were tested against a Qwen 2.5-7B-Instruct model against two long-form English writings as source documents.
We tested with chunk size set to 512 tokens, and semantic search top-k set to 5, as these are commonly used values. We ran 10 queries spread across the two available documents against each configuration. The tests collected the time to first token (TTFT). To avoid GPU processing variance, the tests ran on a single DGX Spark unit.
The chart below shows per approach average time to first token times:

Figure 1: Average time to first token for RAG, CAG, cRAG, and kvRAG in our testing.
As a baseline, full text took ~13 seconds, where we read the document from the local drive and passed it along with the query to the model.
Conclusion
The right approach depends on your workload:
Use RAG when your corpus is large and diverse, you need to search across many documents, and chunk sizes are kept small. It’s a well-understood approach with low overhead.
Use CAG when your knowledge is a single document or a small stable corpus that fits in the context window. The upfront cost of generating the KV cache is amortized over many queries, and you avoid the risk of retrieval errors.
Use cRAG when you have high query volume against a stable chunked corpus, like in a customer support system, and questions have small variations, allowing for high semantic hit rates.
kvRAG is a niche technique that can be selected for certain specific use cases where a hybrid of CAG and RAG is the best approach.
Most often, the bottleneck in inference response times comes from prefilling the context or loading saved KV cache. Optimizing around that is what separates a 96ms response from a 15,000ms one, so making the right storage choice is critical. Check out Everpure offerings to remove the load time bottleneck from your AI inference workload.
Learn more about the Everpure approach to AI inference at scale, and view our industry-leading benchmark results.
Not covered in this blog are the complexities of preparing and presenting data sets in these “enhanced context pipelines.” For more information on an enterprise-grade offering in this space, check out Everpure Data Stream.
FAQ
What is KV cache?
KV (key-value) cache stores intermediate attention states from previous tokens during LLM inference, allowing the model to reuse prior computations instead of recalculating them. This significantly reduces latency and improves response speed, especially for long prompts or repeated queries.
What are embeddings?
Embeddings are numerical vector representations of data (such as text, images, or audio) that capture semantic meaning. They enable AI systems to compare, search, and retrieve information based on similarity rather than exact matches.
What is a vector database?
A vector database is a specialized system designed to store and query embeddings efficiently. It enables fast similarity search (e.g., nearest neighbor search), making it a core component of applications like retrieval-augmented generation (RAG), recommendation systems, and semantic search.
What is GPU Direct Storage (GDS)?
GPU Direct Storage (GDS) is a technology that enables data to be transferred directly between storage and GPU memory, bypassing thece and respond faster.
Design Your AI Inference Factory for Real-time Answers
You’ve seen how RAG, CAG, cRAG, and kvRAG can slash time to first token. Now take the next step and design an AI factory that turns those gains into consistent, production-ready inference at scale with Everpure.






