Imagine you're writing a long document, and every time you type a new word, your word processor re-reads the entire document from page one before deciding what to suggest next. That sounds absurd. But that's almost exactly what an LLM does — or at least what it would do without a KV cache.
Language models generate text one token at a time. To decide what token comes next, they run a mechanism called self-attention, which lets every token look at every previous token and measure how relevant each one is. Without caching, every new token requires the model to reprocess the full growing sequence from scratch. For a 2,000-token response, that means running 2,000 full forward passes — each one more expensive than the last.
The Key-Value cache is the engineering solution to this problem. It's why LLMs can generate thousands of tokens quickly instead of grinding to a halt. It's also why longer context windows require dramatically more GPU memory, and why inference at scale is as much a memory problem as it is a compute problem.
KV caching transforms attention's quadratic compute scaling into linear scaling — at the cost of growing memory. Understanding this tradeoff is essential for anyone building or deploying LLMs in production. — Pierre Lienhart, LLM Inference Series (2025)
Step 1: How Attention Works (The Part That Causes the Problem)
To understand KV caching, you first need to understand what it's caching and why. The attention mechanism — the core of every transformer LLM — works by computing three vectors for each token in the input sequence: a Query (Q), a Key (K), and a Value (V).
Think of it like a library lookup system. Each token sends out a Query — "what information do I need?" Each other token holds up a Key — "here's what I contain." The model computes a similarity score between the Query and all Keys to find which tokens are most relevant. It then uses those scores to take a weighted average of the Values — the actual content — to produce the output for that position.
This happens for every token, attending to every other token, across every layer of the model. For an LLM with 80 transformer layers, generating a response to a 1,000-token prompt means computing Q, K, and V vectors for each of those 1,000 tokens, across all 80 layers, then doing it again for token 1,001. And again for 1,002. And so on.
The redundancy is obvious: the K and V vectors for the prompt tokens don't change between generation steps. Token 1's key and value at layer 37 is identical whether you're generating token 500 or token 5,000. Without caching, you're recomputing them every single time.
The Redundancy Problem: Without KV Cache
Each red block is a full recomputation of K and V vectors — identical to the previous step, wasted. Compute cost scales as O(n²) with sequence length: generating 1,000 tokens requires roughly 500,000 unnecessary full-token computations.
The Solution: Store Once, Reuse Always
KV caching solves this with a simple insight: if the K and V vectors for a token won't change, compute them once and store them. On each generation step, the model only needs to compute Q, K, and V for the new token. For all previous tokens, it reads K and V directly from the cache.
Blue = read from cache (zero compute). Blue-solid = newly computed and added to cache. Attention compute drops from O(n²) to O(n) across the generation phase. A 1,000-token response now requires ~1,000 attention computations instead of ~500,000.
This is implemented concretely in frameworks like HuggingFace Transformers via the past_key_values argument. When use_cache=True (the default), the model's forward() method returns the K and V tensors for every layer alongside the output. On the next generation step, these are passed back in, and only the new token's Q, K, V are freshly computed. The new K and V are then appended to the cache tensors before they're returned again.
Two Phases: Prefill and Decode
KV caching creates two distinct phases during LLM inference. The prefill phase processes your entire prompt at once — this is parallelisable, like training, and is why processing long prompts is fast. K and V vectors for all prompt tokens are computed and stored in cache simultaneously.
The decode phase is where generation happens — strictly sequential, one token at a time. Each step reads the full cache (past_key_values), computes only the new token, appends its K and V, and passes the updated cache forward. This is inherently serial and cannot be parallelised across tokens.
This is why time-to-first-token (TTFT) scales with prompt length, while tokens-per-second (TPS) during generation stays relatively constant regardless of how much you've already generated.
The Cost: KV Cache Is a Memory Monster
KV caching trades compute for memory. And the memory appetite is significant. For each token in the sequence, the model must store K and V vectors for every attention head in every transformer layer. The formula is:
KV Cache Size = 2 × num_layers × num_heads × head_dim × seq_len × bytes_per_element
For Llama 3 70B (80 layers, 64 heads, head dimension 128, FP16 = 2 bytes), a single token requires approximately 2 × 80 × 64 × 128 × 2 = 2.6 MB. A 4,096-token context window therefore needs roughly 10.7 GB of KV cache — just for one request. Serve 8 concurrent users and that's 85 GB, nearly the entire HBM of an H100.
Memory shown is for a single request. The KV cache grows linearly with context length — doubling context doubles cache size. At 1M tokens, KV cache alone exceeds the memory of an entire H100 node (8 × 80 GB = 640 GB), requiring specialised offloading strategies.
This memory pressure explains several phenomena developers encounter in practice. Why does serving longer conversations cost more? KV cache. Why can't you just set context length to 1M tokens for every model? KV cache. Why does the number of concurrent users you can serve drop sharply with longer inputs? KV cache. It is simultaneously the most important inference optimisation and the primary scaling bottleneck of modern LLMs.
KV Cache Variants: MHA, MQA, GQA
Reducing KV cache size has become an active area of architecture design. The standard attention mechanism — Multi-Head Attention (MHA) — maintains separate K and V caches for every attention head. Researchers have developed two major variants that dramatically reduce memory usage.
| Variant | K/V Heads | Cache Size | Quality | Used In |
|---|---|---|---|---|
| MHA — Multi-Head Attention | One per Q head | Full | Highest | GPT-2, early GPT-4 |
| MQA — Multi-Query Attention | Single shared K/V | ~8× smaller | Reduced | Falcon, early Gemini |
| GQA — Grouped-Query Attention | Shared per group | 2–8× smaller | Near-MHA | Llama 3, Mistral, Grok 4, Claude |
GQA (Ainslie et al., 2023) has become the industry standard. It groups Q heads to share K/V heads — reducing cache 4–8× with minimal quality loss. Llama 3, Mistral, and most frontier models released in 2024–25 use GQA.
2025 KV Cache Optimisations
Beyond architectural variants, a wave of engineering optimisations has made KV cache dramatically more efficient in production systems.
Prefix Caching: The Silent Cost Saver
Prefix caching deserves special attention because it has the most immediate practical impact for anyone building on LLM APIs. When you use a system prompt — "You are a helpful assistant..." — that same prompt is sent with every API call. Without prefix caching, the inference server recomputes the KV cache for that prompt on every single request.
With prefix caching enabled, the server computes K and V for the system prompt once, stores them, and reuses them for every subsequent request that shares that prefix. For a 2,000-token system prompt with 10,000 requests per day, this eliminates 20 million token computations worth of redundant work — directly reducing latency and cost.
Both the Anthropic and OpenAI APIs support prompt caching as of 2024–2025, with Anthropic offering explicit cache-control headers that let developers mark which parts of a prompt to cache. The cached portions are billed at a significantly lower rate per token, making long system prompts dramatically more economical at scale.
Prefix caching also applies at the document level. If you are building a RAG system and repeatedly querying a large knowledge base document, keeping that document in a cached prefix slot means the model only processes it once. Subsequent queries that reference the same document hit the cache — reducing both latency and cost proportionally.
The limitation of prefix caching is that it only works for exact prefix matches. If any token in the shared prefix changes — even punctuation — the cache is invalidated and the prefill must run again from that point. This means careful prompt design matters: keep your static system prompt at the very beginning of every call, followed by dynamic content like user history. Never interleave static and dynamic content within the cached section.
So Why Is Inference Still Slow Sometimes?
KV caching solves the compute redundancy problem beautifully. What it can't fully solve is memory bandwidth. Modern GPUs have become so fast at arithmetic (the H100 delivers 3,958 TFLOPS in FP8) that the bottleneck has shifted from computation to simply moving data between GPU memory and the compute cores fast enough.
During the decode phase, generating each token requires loading the full KV cache — which grows with every step — from HBM into the GPU's L2 cache and registers. For a 10 GB KV cache, this means moving 10 GB of data for every single token. At the H100's 3.35 TB/s memory bandwidth, that's about 3 milliseconds per token, purely from memory reads — before any actual computation happens.
This is why techniques like speculative decoding (using a smaller "draft" model to predict multiple tokens, then verifying them in parallel with the main model) and continuous batching (interleaving multiple requests in a single GPU pass) have become so critical in production. They're not about compute; they're about making better use of memory bandwidth and reducing idle time between generations.
KV Cache in Production: What the Numbers Look Like
Let's ground this in real deployment numbers. A team running Llama 3 70B on 2 H100 GPUs (160 GB combined HBM) for a customer support application with an average conversation length of 8,000 tokens faces the following reality:
KV cache per session ≈ 21 GB. With 160 GB total — minus ~140 GB for model weights — the available KV cache budget is roughly 20 GB. That means the system can serve approximately one concurrent long-context session at full quality, or handle more sessions using KV quantisation (dropping to INT8 halves the cache to ~10.5 GB, enabling ~2 concurrent sessions) or prefix caching (if conversations share a system prompt, those tokens are cached once across all sessions).
This arithmetic is why inference infrastructure teams spend enormous effort on KV cache management — it is the primary lever controlling how many users you can serve per dollar of GPU spend.
Conclusion
The KV cache is one of the most consequential engineering decisions in the modern AI stack — simultaneously the reason LLMs can generate text quickly and the primary reason serving them at scale is expensive. Every frontier model architecture in 2025 — Llama 3, Mistral, GPT-5, Gemini 3, Claude Opus 4.6 — has been shaped by the memory constraints the KV cache imposes.
Understanding it changes how you think about LLM costs: longer context isn't just slower, it's fundamentally more memory-intensive in a way that doesn't get better with more compute. It explains why Grouped-Query Attention displaced Multi-Head Attention across the industry. It explains why prefix caching is one of the highest-ROI API features ever shipped. And it frames the genuine innovation happening in 2025 — MorphKV, Flash Attention 3, PagedAttention, KV quantisation — as what it really is: an ongoing engineering campaign to serve ever-longer contexts to ever-more users on the same constrained GPU hardware.
Practical Takeaways for Developers
If you're building on LLMs, here's what the KV cache means for you concretely. When choosing a model for a long-context application, memory bandwidth — not raw compute TFLOPS — is your primary throughput constraint during generation. An H100 with 3.35 TB/s bandwidth will serve long-context requests faster than a theoretically more powerful chip with lower bandwidth, because the decode phase is bandwidth-bound.
When architecting your system prompt, keep it consistent across requests. Both the Anthropic API and OpenAI API support prefix caching — a fixed system prompt computed once and reused across thousands of calls is dramatically cheaper than a variable one regenerated every time. This single change can reduce costs by 30–50% for chatbot applications with long system prompts.
When debugging slow inference, distinguish between time-to-first-token (TTFT) and tokens-per-second (TPS). If TTFT is high but TPS is fine, the bottleneck is prefill — your input prompt is too long or your GPU is under-provisioned for the batch size. If TPS is low, the bottleneck is memory bandwidth during decode — likely because your KV cache is too large relative to available HBM. These are different problems with different solutions.
When scaling to multiple users, the KV cache is what limits your concurrency. GQA, KV quantisation, and PagedAttention are not optional optimisations for a production system — they are the difference between serving 1 user and serving 20 on the same hardware. Implement them before you scale, not after.
If you're building on LLMs, KV cache is not a background implementation detail. It is the central resource you are managing — and the engineers who understand it deeply will build systems that are faster, cheaper, and more scalable than those who treat inference as a black box.