How KV Cache Works: The Secret Behind Fast LLMs

Imagine you're writing a long document, and every time you type a new word, your word processor re-reads the entire document from page one before deciding what to suggest next. That sounds absurd. But that's almost exactly what an LLM does — or at least what it would do without a KV cache.

Language models generate text one token at a time. To decide what token comes next, they run a mechanism called self-attention, which lets every token look at every previous token and measure how relevant each one is. Without caching, every new token requires the model to reprocess the full growing sequence from scratch. For a 2,000-token response, that means running 2,000 full forward passes — each one more expensive than the last.

The Key-Value cache is the engineering solution to this problem. It's why LLMs can generate thousands of tokens quickly instead of grinding to a halt. It's also why longer context windows require dramatically more GPU memory, and why inference at scale is as much a memory problem as it is a compute problem.

KV caching transforms attention's quadratic compute scaling into linear scaling — at the cost of growing memory. Understanding this tradeoff is essential for anyone building or deploying LLMs in production. — Pierre Lienhart, LLM Inference Series (2025)

Step 1: How Attention Works (The Part That Causes the Problem)

To understand KV caching, you first need to understand what it's caching and why. The attention mechanism — the core of every transformer LLM — works by computing three vectors for each token in the input sequence: a Query (Q), a Key (K), and a Value (V).

Think of it like a library lookup system. Each token sends out a Query — "what information do I need?" Each other token holds up a Key — "here's what I contain." The model computes a similarity score between the Query and all Keys to find which tokens are most relevant. It then uses those scores to take a weighted average of the Values — the actual content — to produce the output for that position.

This happens for every token, attending to every other token, across every layer of the model. For an LLM with 80 transformer layers, generating a response to a 1,000-token prompt means computing Q, K, and V vectors for each of those 1,000 tokens, across all 80 layers, then doing it again for token 1,001. And again for 1,002. And so on.

The redundancy is obvious: the K and V vectors for the prompt tokens don't change between generation steps. Token 1's key and value at layer 37 is identical whether you're generating token 500 or token 5,000. Without caching, you're recomputing them every single time.

The Redundancy Problem: Without KV Cache

Visual 01

Token Generation Without KV Cache — Full Recomputation at Every Step

Step 1

Time

→ generates "flies"

Step 2

Time

flies

→ "Time" recomputed again

Step 3

Time

flies

→ both recomputed

Step 4

Time

flies

→ all recomputed

Each red block is a full recomputation of K and V vectors — identical to the previous step, wasted. Compute cost scales as O(n²) with sequence length: generating 1,000 tokens requires roughly 500,000 unnecessary full-token computations.

The Solution: Store Once, Reuse Always

KV caching solves this with a simple insight: if the K and V vectors for a token won't change, compute them once and store them. On each generation step, the model only needs to compute Q, K, and V for the new token. For all previous tokens, it reads K and V directly from the cache.

Visual 02

Token Generation With KV Cache — Compute Once, Read From Cache

Step 1

Time

→ K,V stored in cache

Step 2

Time

flies

→ only "flies" computed

Step 3

Time

flies

→ only "like" computed

Step 4

Time

flies

→ only "an" computed

Blue = read from cache (zero compute). Blue-solid = newly computed and added to cache. Attention compute drops from O(n²) to O(n) across the generation phase. A 1,000-token response now requires ~1,000 attention computations instead of ~500,000.

This is implemented concretely in frameworks like HuggingFace Transformers via the past_key_values argument. When use_cache=True (the default), the model's forward() method returns the K and V tensors for every layer alongside the output. On the next generation step, these are passed back in, and only the new token's Q, K, V are freshly computed. The new K and V are then appended to the cache tensors before they're returned again.

Under the Hood

Two Phases: Prefill and Decode

KV caching creates two distinct phases during LLM inference. The prefill phase processes your entire prompt at once — this is parallelisable, like training, and is why processing long prompts is fast. K and V vectors for all prompt tokens are computed and stored in cache simultaneously.

The decode phase is where generation happens — strictly sequential, one token at a time. Each step reads the full cache (past_key_values), computes only the new token, appends its K and V, and passes the updated cache forward. This is inherently serial and cannot be parallelised across tokens.

This is why time-to-first-token (TTFT) scales with prompt length, while tokens-per-second (TPS) during generation stays relatively constant regardless of how much you've already generated.

The Cost: KV Cache Is a Memory Monster

KV caching trades compute for memory. And the memory appetite is significant. For each token in the sequence, the model must store K and V vectors for every attention head in every transformer layer. The formula is:

KV Cache Size = 2 × num_layers × num_heads × head_dim × seq_len × bytes_per_element

For Llama 3 70B (80 layers, 64 heads, head dimension 128, FP16 = 2 bytes), a single token requires approximately 2 × 80 × 64 × 128 × 2 = 2.6 MB. A 4,096-token context window therefore needs roughly 10.7 GB of KV cache — just for one request. Serve 8 concurrent users and that's 85 GB, nearly the entire HBM of an H100.

Visual 03

KV Cache Memory Requirements by Model & Context (FP16)

Llama 3 8B · 4K ctx

~1 GB

Llama 3 70B · 4K ctx

~10.7 GB

Llama 3 70B · 32K ctx

~85 GB

GPT-4 class · 128K ctx

~340 GB

Gemini 3 · 1M ctx

>2 TB

Memory shown is for a single request. The KV cache grows linearly with context length — doubling context doubles cache size. At 1M tokens, KV cache alone exceeds the memory of an entire H100 node (8 × 80 GB = 640 GB), requiring specialised offloading strategies.

This memory pressure explains several phenomena developers encounter in practice. Why does serving longer conversations cost more? KV cache. Why can't you just set context length to 1M tokens for every model? KV cache. Why does the number of concurrent users you can serve drop sharply with longer inputs? KV cache. It is simultaneously the most important inference optimisation and the primary scaling bottleneck of modern LLMs.

KV Cache Variants: MHA, MQA, GQA

Reducing KV cache size has become an active area of architecture design. The standard attention mechanism — Multi-Head Attention (MHA) — maintains separate K and V caches for every attention head. Researchers have developed two major variants that dramatically reduce memory usage.

Visual 04

Attention Variants — Cache Size vs. Quality Tradeoff

Variant	K/V Heads	Cache Size	Quality	Used In
MHA — Multi-Head Attention	One per Q head	Full	Highest	GPT-2, early GPT-4
MQA — Multi-Query Attention	Single shared K/V	~8× smaller	Reduced	Falcon, early Gemini
GQA — Grouped-Query Attention	Shared per group	2–8× smaller	Near-MHA	Llama 3, Mistral, Grok 4, Claude

GQA (Ainslie et al., 2023) has become the industry standard. It groups Q heads to share K/V heads — reducing cache 4–8× with minimal quality loss. Llama 3, Mistral, and most frontier models released in 2024–25 use GQA.

2025 KV Cache Optimisations

Beyond architectural variants, a wave of engineering optimisations has made KV cache dramatically more efficient in production systems.

PagedAttention (vLLM)

Inspired by OS virtual memory paging. Stores KV cache in non-contiguous memory blocks, eliminating fragmentation and enabling up to 24× higher throughput than naive allocation. Default in vLLM.

Flash Attention 3

Fuses attention into a single GPU kernel, avoiding expensive reads/writes of K and V matrices to HBM. Dramatically reduces memory bandwidth pressure. Standard in all frontier model serving stacks.

KV Cache Quantisation

Storing cached K and V in INT8 or INT4 instead of FP16 halves or quarters memory usage with minimal quality loss. Now supported natively in TensorRT-LLM and vLLM. H2O (2024) adds dynamic eviction on top.

KV Cache Offloading

Moving inactive cache entries to CPU RAM or NVMe SSD. NVIDIA reports up to 14× faster TTFT for long inputs vs recomputation. Critical for multi-turn conversations in memory-constrained deployments.

Prefix Caching

If multiple requests share a common prefix (e.g., a system prompt), KV vectors for that prefix are computed once and shared across all requests. Standard in OpenAI API, Anthropic API, and vLLM. Dramatically reduces cost for chatbots.

MorphKV / SnapKV

2025 research techniques that maintain fixed-size caches by selectively evicting less important K/V pairs based on attention patterns. MorphKV achieves up to 20× compression while preserving long-range reasoning accuracy.

Prefix Caching: The Silent Cost Saver

Prefix caching deserves special attention because it has the most immediate practical impact for anyone building on LLM APIs. When you use a system prompt — "You are a helpful assistant..." — that same prompt is sent with every API call. Without prefix caching, the inference server recomputes the KV cache for that prompt on every single request.

With prefix caching enabled, the server computes K and V for the system prompt once, stores them, and reuses them for every subsequent request that shares that prefix. For a 2,000-token system prompt with 10,000 requests per day, this eliminates 20 million token computations worth of redundant work — directly reducing latency and cost.

Both the Anthropic and OpenAI APIs support prompt caching as of 2024–2025, with Anthropic offering explicit cache-control headers that let developers mark which parts of a prompt to cache. The cached portions are billed at a significantly lower rate per token, making long system prompts dramatically more economical at scale.

Prefix caching also applies at the document level. If you are building a RAG system and repeatedly querying a large knowledge base document, keeping that document in a cached prefix slot means the model only processes it once. Subsequent queries that reference the same document hit the cache — reducing both latency and cost proportionally.

The limitation of prefix caching is that it only works for exact prefix matches. If any token in the shared prefix changes — even punctuation — the cache is invalidated and the prefill must run again from that point. This means careful prompt design matters: keep your static system prompt at the very beginning of every call, followed by dynamic content like user history. Never interleave static and dynamic content within the cached section.

So Why Is Inference Still Slow Sometimes?

KV caching solves the compute redundancy problem beautifully. What it can't fully solve is memory bandwidth. Modern GPUs have become so fast at arithmetic (the H100 delivers 3,958 TFLOPS in FP8) that the bottleneck has shifted from computation to simply moving data between GPU memory and the compute cores fast enough.

During the decode phase, generating each token requires loading the full KV cache — which grows with every step — from HBM into the GPU's L2 cache and registers. For a 10 GB KV cache, this means moving 10 GB of data for every single token. At the H100's 3.35 TB/s memory bandwidth, that's about 3 milliseconds per token, purely from memory reads — before any actual computation happens.

This is why techniques like speculative decoding (using a smaller "draft" model to predict multiple tokens, then verifying them in parallel with the main model) and continuous batching (interleaving multiple requests in a single GPU pass) have become so critical in production. They're not about compute; they're about making better use of memory bandwidth and reducing idle time between generations.

KV Cache in Production: What the Numbers Look Like

Let's ground this in real deployment numbers. A team running Llama 3 70B on 2 H100 GPUs (160 GB combined HBM) for a customer support application with an average conversation length of 8,000 tokens faces the following reality:

KV cache per session ≈ 21 GB. With 160 GB total — minus ~140 GB for model weights — the available KV cache budget is roughly 20 GB. That means the system can serve approximately one concurrent long-context session at full quality, or handle more sessions using KV quantisation (dropping to INT8 halves the cache to ~10.5 GB, enabling ~2 concurrent sessions) or prefix caching (if conversations share a system prompt, those tokens are cached once across all sessions).

This arithmetic is why inference infrastructure teams spend enormous effort on KV cache management — it is the primary lever controlling how many users you can serve per dollar of GPU spend.

Conclusion

The KV cache is one of the most consequential engineering decisions in the modern AI stack — simultaneously the reason LLMs can generate text quickly and the primary reason serving them at scale is expensive. Every frontier model architecture in 2025 — Llama 3, Mistral, GPT-5, Gemini 3, Claude Opus 4.6 — has been shaped by the memory constraints the KV cache imposes.

Understanding it changes how you think about LLM costs: longer context isn't just slower, it's fundamentally more memory-intensive in a way that doesn't get better with more compute. It explains why Grouped-Query Attention displaced Multi-Head Attention across the industry. It explains why prefix caching is one of the highest-ROI API features ever shipped. And it frames the genuine innovation happening in 2025 — MorphKV, Flash Attention 3, PagedAttention, KV quantisation — as what it really is: an ongoing engineering campaign to serve ever-longer contexts to ever-more users on the same constrained GPU hardware.

Practical Takeaways for Developers

If you're building on LLMs, here's what the KV cache means for you concretely. When choosing a model for a long-context application, memory bandwidth — not raw compute TFLOPS — is your primary throughput constraint during generation. An H100 with 3.35 TB/s bandwidth will serve long-context requests faster than a theoretically more powerful chip with lower bandwidth, because the decode phase is bandwidth-bound.

When architecting your system prompt, keep it consistent across requests. Both the Anthropic API and OpenAI API support prefix caching — a fixed system prompt computed once and reused across thousands of calls is dramatically cheaper than a variable one regenerated every time. This single change can reduce costs by 30–50% for chatbot applications with long system prompts.

When debugging slow inference, distinguish between time-to-first-token (TTFT) and tokens-per-second (TPS). If TTFT is high but TPS is fine, the bottleneck is prefill — your input prompt is too long or your GPU is under-provisioned for the batch size. If TPS is low, the bottleneck is memory bandwidth during decode — likely because your KV cache is too large relative to available HBM. These are different problems with different solutions.

When scaling to multiple users, the KV cache is what limits your concurrency. GQA, KV quantisation, and PagedAttention are not optional optimisations for a production system — they are the difference between serving 1 user and serving 20 on the same hardware. Implement them before you scale, not after.

If you're building on LLMs, KV cache is not a background implementation detail. It is the central resource you are managing — and the engineers who understand it deeply will build systems that are faster, cheaper, and more scalable than those who treat inference as a black box.

TTFT

Time-to-first-token — latency before streaming begins. Scales with prompt length.

TPS

Tokens per second — generation speed. Constrained by memory bandwidth, not compute.

GQA

Grouped-Query Attention — shares K/V heads across query groups, reducing cache 4–8×.

HBM

High Bandwidth Memory — the fast DRAM on a GPU die where weights and KV cache live.

Prefill

The phase where the input prompt is processed in parallel before generation begins.