Improve LLM Cache Hit Rate
Prompt caching helps repeated LLM requests run with lower input-token cost and often lower latency. In WaveSpeedAI LLM, caching behavior still comes from the selected upstream model provider. WaveSpeedAI gives you one OpenAI-compatible endpoint, while each model family decides how prompt prefixes are cached, how long caches live, and which usage fields are returned.
Use this guide when cache hit rate is low, coding-agent sessions feel expensive, or your app sends the same long instructions, tool definitions, documents, or examples many times.
The Short Version
| Model family | How caching usually works | What you should do |
|---|---|---|
| OpenAI-style models | Repeated long prefixes can be cached automatically | Keep the beginning of messages identical and optionally use a stable prompt_cache_key |
| Claude models | Cache behavior is more explicit | Mark the reusable prefix with cache_control when you call the API yourself |
| Coding agents | The tool may create or reuse stable context for you | Keep project rules, tool definitions, and startup context stable across turns |
| Other providers | Behavior depends on the upstream model | Keep a stable prefix and check the returned usage fields |
The most important rule is simple: put stable content first, put dynamic content last.
Good cache candidates:
- Long system instructions
- Coding rules and project conventions
- Tool definitions
- Shared examples
- Large documents reused across questions
- Repository summaries reused across coding-agent turns
Bad cache candidates:
- User-specific one-off questions
- Timestamps
- Request IDs
- Randomized examples
- Short prompts
- Frequently changing tool definitions
How Prompt Caching Works
Most LLM prompt caching is prefix caching. The provider checks whether the beginning of the current prompt matches a prefix it processed recently. If it matches, part of the prompt can be read from cache instead of processed from scratch.
That means these two requests are cache-friendly:
Request 1:
stable instructions
stable examples
stable tool definitions
question A
Request 2:
stable instructions
stable examples
stable tool definitions
question BThese two requests are not cache-friendly:
Request 1:
timestamp A
stable instructions
question A
Request 2:
timestamp B
stable instructions
question BThe instructions are the same, but the prefix changed at the very beginning, so the provider may not be able to reuse the cached prefix.
Minimum Size and Cache Lifetime
Very short prompts usually do not benefit from prompt caching. Providers apply their own minimum cacheable prompt sizes and cache lifetimes.
| Area | What to expect |
|---|---|
| OpenAI-style models | Cache hits usually start after the repeated prefix is at least about 1,024 tokens. Cached-token counts can increase in fixed token increments. |
| Claude models | Short prefixes may be ignored even when cache_control is present. Make the cached block large enough to be worth caching. |
| Cache lifetime | Ephemeral prompt caches are short-lived. Treat them as useful for related requests sent close together, not as long-term storage. |
| First request | The first matching request usually creates or warms the cache. A later request is where you should expect to see cache-read fields. |
If a test prompt is only a few sentences, cached_tokens may stay 0 even when the request shape is correct. Test with the real long instructions, tool definitions, documents, or repository context that your app sends in production.
WaveSpeedAI Endpoint
Use the normal WaveSpeedAI LLM endpoint:
https://llm.wavespeed.ai/v1/chat/completionsUse the same API key and OpenAI-compatible request format as the rest of WaveSpeedAI LLM:
curl https://llm.wavespeed.ai/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer YOUR_WAVESPEED_API_KEY" \
-d '{
"model": "openai/gpt-5.5",
"messages": [
{
"role": "system",
"content": "Stable instructions that stay identical across requests."
},
{
"role": "user",
"content": "The current user question goes here."
}
]
}'OpenAI-Style Models
For OpenAI-style models, start by making the repeated prefix stable. For long repeated prompts, compatible upstream models may cache repeated prefixes automatically. A stable prompt_cache_key can help route similar requests together when your workload has many requests sharing the same long prefix.
curl https://llm.wavespeed.ai/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer YOUR_WAVESPEED_API_KEY" \
-d '{
"model": "openai/gpt-5.5",
"prompt_cache_key": "docs-assistant-v1",
"messages": [
{
"role": "system",
"content": "Long stable instructions, examples, and policy text. Keep this prefix byte-for-byte consistent across related requests."
},
{
"role": "user",
"content": "Ask the current user question here."
}
],
"max_tokens": 200
}'Keep prompt_cache_key stable for requests that share the same long prefix. Do not use a unique key per user message, request ID, or timestamp, because that can split traffic into too many cache groups.
OpenAI-style cache hits depend on the repeated prefix, not only the key. If the beginning of messages, tools, or other request fields changes, the same prompt_cache_key may still miss cache.
Claude Models
Claude prompt caching is more explicit. When you call Claude models through WaveSpeedAI LLM and want to cache a specific reusable prefix, mark the reusable content block with cache_control.
curl https://llm.wavespeed.ai/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer YOUR_WAVESPEED_API_KEY" \
-d '{
"model": "anthropic/claude-opus-4.7",
"messages": [
{
"role": "user",
"content": [
{
"type": "text",
"text": "Long stable project context, coding rules, examples, or document text. Keep this block identical across requests that should share the cache.",
"cache_control": {
"type": "ephemeral"
}
},
{
"type": "text",
"text": "Ask the current question here."
}
]
}
],
"max_tokens": 200
}'Place cache_control on the last block that should stay identical. Do not put it on the block that contains the user question, a timestamp, or any content that changes every request.
Claude caching is prefix-based across the request shape. If your request includes tools, system instructions, and messages, keep those stable and in the same order. Changing tool definitions, tool order, system text, image blocks, message order, or other early request fields can change the prefix and reduce cache hits.
When prompts contain many content blocks, a single cache breakpoint near the end may not be enough for every earlier block to be checked. Use additional stable breakpoints when you have a long prompt made of many separate blocks, while keeping the number of breakpoints small and intentional.
Coding Tools and Agents
Coding tools often send a large repeated prefix: project rules, repository context, tool definitions, file summaries, and conversation history. Some tools may already manage prompt caching behind the scenes, especially when they use Claude or GPT-family models directly.
To help caching work well with Codex, Claude Code, Cursor, OpenClaw, or similar agents:
| Do | Why |
|---|---|
| Keep project instructions stable | Changing rules can invalidate the prefix |
| Put durable rules in files such as project docs or agent instructions | Tools can reuse them across turns |
| Avoid adding timestamps or random IDs to the beginning of prompts | Early changes break prefix matching |
| Ask follow-up questions in the same session when possible | The tool can reuse previous context more easily |
| Avoid changing tool definitions during a session | Tool definition changes often affect the cached prefix |
| Avoid editing agent config between related prompts | Model, tool, or provider changes can create a new request prefix |
| Start long sessions with a small read-only prompt | It can warm up context before heavier requests |
For example:
First prompt:
Read the project instructions and summarize the repo. Do not edit files.
Follow-up prompt:
Now update the LLM cache guide based on the same project rules.This is usually more cache-friendly than repeatedly starting fresh sessions with different preambles.
How to Check Cache Usage
Log the usage object from the response.
OpenAI-style models may return:
{
"usage": {
"prompt_tokens": 3134,
"completion_tokens": 5,
"total_tokens": 3139,
"prompt_tokens_details": {
"cached_tokens": 2816
}
}
}Claude models may return cache creation and cache read fields:
{
"usage": {
"prompt_tokens": 12127,
"prompt_tokens_details": {
"cached_tokens": 10549,
"cache_creation_input_tokens": 0,
"cache_read_input_tokens": 10549
}
}
}Some upstream responses may use slightly different field names for cache creation, such as cached_creation_tokens or cache_creation_input_tokens. When debugging, log the full usage.prompt_tokens_details object instead of checking only one field.
For a simple cache-hit signal:
cache hit tokens = usage.prompt_tokens_details.cached_tokensFor Claude-specific debugging, also check:
cache write tokens = usage.prompt_tokens_details.cache_creation_input_tokens
cache read tokens = usage.prompt_tokens_details.cache_read_input_tokensIf a cache creation field is positive, the provider wrote a cache entry. If cache_read_input_tokens is positive on a later request, the provider read from cache.
Tested With WaveSpeedAI LLM
These checks were run against https://llm.wavespeed.ai/v1/chat/completions:
| Model | Test | Observed signal |
|---|---|---|
openai/gpt-5.5 | Same long prefix, same prompt_cache_key, repeated request | Later response included prompt_tokens_details.cached_tokens |
anthropic/claude-opus-4.7 | Long content block marked with cache_control, repeated request | Responses included cache creation fields, then cache_read_input_tokens on a later identical request |
The exact amount of cached tokens can vary by model, prompt size, cache lifetime, and upstream routing. Treat cache metrics as something to monitor in production, not as a fixed value.
Improve Cache Hit Rate
Use this checklist when cache hit rate is low:
- Put stable content before dynamic content.
- Keep system prompts, examples, tool definitions, and schemas byte-for-byte consistent.
- Keep user questions, timestamps, request IDs, and short-lived data at the end.
- Make the reusable prefix long enough to be worth caching.
- For OpenAI-style models, reuse
prompt_cache_keyfor the same prompt family. - For Claude models, place
cache_controlon the last stable content block. - Keep tool definitions, tool order, system text, and message order stable.
- Send the second request after the first response starts or completes, especially when testing cache reads.
- Log
usage.prompt_tokens_detailsso you can see cached tokens, cache writes, and cache reads. - Test with your real prompt shape, not only a tiny prompt.
- If Claude shows cache creation but no cache read, wait a few seconds and retry the exact same stable prefix.
Common Reasons for Low Cache Hit Rate
| Symptom | Likely cause | What to change |
|---|---|---|
cached_tokens stays 0 | Prompt is too short or prefix keeps changing | Use a longer stable prefix and move dynamic data later |
| Same prompt text still misses | Tools, system text, images, or message order changed before the text | Compare the full serialized request, not only the visible user prompt |
| Claude writes cache every time but does not read | cache_control is placed after changing content | Move cache_control to the end of the stable block |
| Claude shows cache creation first, then no read yet | Cache may not be readable immediately or the request may route differently | Retry the exact same stable prefix after a short delay and monitor usage.prompt_tokens_details |
| Long Claude prompt has many blocks | Cache lookup may not cover every earlier block from one late breakpoint | Add stable breakpoints earlier in the long reusable prefix |
| Agent sessions do not seem to reuse context | Each task starts with different instructions or tool setup | Keep project rules and tool definitions stable |
| Cache works once, then disappears later | Upstream cache expired | Send related requests close together or use provider-supported retention options when available |
| Similar prompts miss cache | The beginning is not identical | Compare the serialized request prefix, including tools and message order |
What Prompt Caching Does Not Do
Prompt caching does not store or reuse the model’s final answer. The model still generates a fresh response for each request. Caching only helps the provider reuse processing for repeated prompt prefixes.
Prompt caching also does not replace good context management. You should still trim irrelevant history, summarize old turns, and avoid sending large content that the model no longer needs.