Improve LLM Cache Hit Rate

Prompt caching helps repeated LLM requests run with lower input-token cost and often lower latency. In WaveSpeedAI LLM, caching behavior still comes from the selected upstream model provider. WaveSpeedAI gives you one OpenAI-compatible endpoint, while each model family decides how prompt prefixes are cached, how long caches live, and which usage fields are returned.

Use this guide when cache hit rate is low, coding-agent sessions feel expensive, or your app sends the same long instructions, tool definitions, documents, or examples many times.

The Short Version

Model family	How caching usually works	What you should do
OpenAI-style models	Repeated long prefixes can be cached automatically	Keep the beginning of `messages` identical and optionally use a stable `prompt_cache_key`
Claude models	Cache behavior is more explicit	Mark the reusable prefix with `cache_control` when you call the API yourself
Coding agents	The tool may create or reuse stable context for you	Keep project rules, tool definitions, and startup context stable across turns
Other providers	Behavior depends on the upstream model	Keep a stable prefix and check the returned `usage` fields

The most important rule is simple: put stable content first, put dynamic content last.

Good cache candidates:

Long system instructions
Coding rules and project conventions
Tool definitions
Shared examples
Large documents reused across questions
Repository summaries reused across coding-agent turns

Bad cache candidates:

User-specific one-off questions
Timestamps
Request IDs
Randomized examples
Short prompts
Frequently changing tool definitions

How Prompt Caching Works

Most LLM prompt caching is prefix caching. The provider checks whether the beginning of the current prompt matches a prefix it processed recently. If it matches, part of the prompt can be read from cache instead of processed from scratch.

That means these two requests are cache-friendly:

Request 1:
  stable instructions
  stable examples
  stable tool definitions
  question A

Request 2:
  stable instructions
  stable examples
  stable tool definitions
  question B

These two requests are not cache-friendly:

Request 1:
  timestamp A
  stable instructions
  question A

Request 2:
  timestamp B
  stable instructions
  question B

The instructions are the same, but the prefix changed at the very beginning, so the provider may not be able to reuse the cached prefix.

Minimum Size and Cache Lifetime

Very short prompts usually do not benefit from prompt caching. Providers apply their own minimum cacheable prompt sizes and cache lifetimes.

Area	What to expect
OpenAI-style models	Cache hits usually start after the repeated prefix is at least about 1,024 tokens. Cached-token counts can increase in fixed token increments.
Claude models	Short prefixes may be ignored even when `cache_control` is present. Make the cached block large enough to be worth caching.
Cache lifetime	Ephemeral prompt caches are short-lived. Treat them as useful for related requests sent close together, not as long-term storage.
First request	The first matching request usually creates or warms the cache. A later request is where you should expect to see cache-read fields.

If a test prompt is only a few sentences, cached_tokens may stay 0 even when the request shape is correct. Test with the real long instructions, tool definitions, documents, or repository context that your app sends in production.

WaveSpeedAI Endpoint

Use the normal WaveSpeedAI LLM endpoint:

https://llm.wavespeed.ai/v1/chat/completions

Use the same API key and OpenAI-compatible request format as the rest of WaveSpeedAI LLM:

curl https://llm.wavespeed.ai/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer YOUR_WAVESPEED_API_KEY" \
  -d '{
    "model": "openai/gpt-5.5",
    "messages": [
      {
        "role": "system",
        "content": "Stable instructions that stay identical across requests."
      },
      {
        "role": "user",
        "content": "The current user question goes here."
      }
    ]
  }'

OpenAI-Style Models

For OpenAI-style models, start by making the repeated prefix stable. For long repeated prompts, compatible upstream models may cache repeated prefixes automatically. A stable prompt_cache_key can help route similar requests together when your workload has many requests sharing the same long prefix.

curl https://llm.wavespeed.ai/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer YOUR_WAVESPEED_API_KEY" \
  -d '{
    "model": "openai/gpt-5.5",
    "prompt_cache_key": "docs-assistant-v1",
    "messages": [
      {
        "role": "system",
        "content": "Long stable instructions, examples, and policy text. Keep this prefix byte-for-byte consistent across related requests."
      },
      {
        "role": "user",
        "content": "Ask the current user question here."
      }
    ],
    "max_tokens": 200
  }'

Keep prompt_cache_key stable for requests that share the same long prefix. Do not use a unique key per user message, request ID, or timestamp, because that can split traffic into too many cache groups.

OpenAI-style cache hits depend on the repeated prefix, not only the key. If the beginning of messages, tools, or other request fields changes, the same prompt_cache_key may still miss cache.

Claude Models

Claude prompt caching is more explicit. When you call Claude models through WaveSpeedAI LLM and want to cache a specific reusable prefix, mark the reusable content block with cache_control.

curl https://llm.wavespeed.ai/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer YOUR_WAVESPEED_API_KEY" \
  -d '{
    "model": "anthropic/claude-opus-4.7",
    "messages": [
      {
        "role": "user",
        "content": [
          {
            "type": "text",
            "text": "Long stable project context, coding rules, examples, or document text. Keep this block identical across requests that should share the cache.",
            "cache_control": {
              "type": "ephemeral"
            }
          },
          {
            "type": "text",
            "text": "Ask the current question here."
          }
        ]
      }
    ],
    "max_tokens": 200
  }'

Place cache_control on the last block that should stay identical. Do not put it on the block that contains the user question, a timestamp, or any content that changes every request.

Claude caching is prefix-based across the request shape. If your request includes tools, system instructions, and messages, keep those stable and in the same order. Changing tool definitions, tool order, system text, image blocks, message order, or other early request fields can change the prefix and reduce cache hits.

When prompts contain many content blocks, a single cache breakpoint near the end may not be enough for every earlier block to be checked. Use additional stable breakpoints when you have a long prompt made of many separate blocks, while keeping the number of breakpoints small and intentional.

Coding Tools and Agents

Coding tools often send a large repeated prefix: project rules, repository context, tool definitions, file summaries, and conversation history. Some tools may already manage prompt caching behind the scenes, especially when they use Claude or GPT-family models directly.

To help caching work well with Codex, Claude Code, Cursor, OpenClaw, or similar agents:

Do	Why
Keep project instructions stable	Changing rules can invalidate the prefix
Put durable rules in files such as project docs or agent instructions	Tools can reuse them across turns
Avoid adding timestamps or random IDs to the beginning of prompts	Early changes break prefix matching
Ask follow-up questions in the same session when possible	The tool can reuse previous context more easily
Avoid changing tool definitions during a session	Tool definition changes often affect the cached prefix
Avoid editing agent config between related prompts	Model, tool, or provider changes can create a new request prefix
Start long sessions with a small read-only prompt	It can warm up context before heavier requests

For example:

First prompt:
Read the project instructions and summarize the repo. Do not edit files.

Follow-up prompt:
Now update the LLM cache guide based on the same project rules.

This is usually more cache-friendly than repeatedly starting fresh sessions with different preambles.

How to Check Cache Usage

Log the usage object from the response.

OpenAI-style models may return:

{
  "usage": {
    "prompt_tokens": 3134,
    "completion_tokens": 5,
    "total_tokens": 3139,
    "prompt_tokens_details": {
      "cached_tokens": 2816
    }
  }
}

Claude models may return cache creation and cache read fields:

{
  "usage": {
    "prompt_tokens": 12127,
    "prompt_tokens_details": {
      "cached_tokens": 10549,
      "cache_creation_input_tokens": 0,
      "cache_read_input_tokens": 10549
    }
  }
}

Some upstream responses may use slightly different field names for cache creation, such as cached_creation_tokens or cache_creation_input_tokens. When debugging, log the full usage.prompt_tokens_details object instead of checking only one field.

For a simple cache-hit signal:

cache hit tokens = usage.prompt_tokens_details.cached_tokens

For Claude-specific debugging, also check:

cache write tokens = usage.prompt_tokens_details.cache_creation_input_tokens
cache read tokens  = usage.prompt_tokens_details.cache_read_input_tokens

If a cache creation field is positive, the provider wrote a cache entry. If cache_read_input_tokens is positive on a later request, the provider read from cache.

Tested With WaveSpeedAI LLM

These checks were run against https://llm.wavespeed.ai/v1/chat/completions:

Model	Test	Observed signal
`openai/gpt-5.5`	Same long prefix, same `prompt_cache_key`, repeated request	Later response included `prompt_tokens_details.cached_tokens`
`anthropic/claude-opus-4.7`	Long content block marked with `cache_control`, repeated request	Responses included cache creation fields, then `cache_read_input_tokens` on a later identical request

The exact amount of cached tokens can vary by model, prompt size, cache lifetime, and upstream routing. Treat cache metrics as something to monitor in production, not as a fixed value.

Improve Cache Hit Rate

Use this checklist when cache hit rate is low:

Put stable content before dynamic content.
Keep system prompts, examples, tool definitions, and schemas byte-for-byte consistent.
Keep user questions, timestamps, request IDs, and short-lived data at the end.
Make the reusable prefix long enough to be worth caching.
For OpenAI-style models, reuse prompt_cache_key for the same prompt family.
For Claude models, place cache_control on the last stable content block.
Keep tool definitions, tool order, system text, and message order stable.
Send the second request after the first response starts or completes, especially when testing cache reads.
Log usage.prompt_tokens_details so you can see cached tokens, cache writes, and cache reads.
Test with your real prompt shape, not only a tiny prompt.
If Claude shows cache creation but no cache read, wait a few seconds and retry the exact same stable prefix.

Common Reasons for Low Cache Hit Rate

Symptom	Likely cause	What to change
`cached_tokens` stays `0`	Prompt is too short or prefix keeps changing	Use a longer stable prefix and move dynamic data later
Same prompt text still misses	Tools, system text, images, or message order changed before the text	Compare the full serialized request, not only the visible user prompt
Claude writes cache every time but does not read	`cache_control` is placed after changing content	Move `cache_control` to the end of the stable block
Claude shows cache creation first, then no read yet	Cache may not be readable immediately or the request may route differently	Retry the exact same stable prefix after a short delay and monitor `usage.prompt_tokens_details`
Long Claude prompt has many blocks	Cache lookup may not cover every earlier block from one late breakpoint	Add stable breakpoints earlier in the long reusable prefix
Agent sessions do not seem to reuse context	Each task starts with different instructions or tool setup	Keep project rules and tool definitions stable
Cache works once, then disappears later	Upstream cache expired	Send related requests close together or use provider-supported retention options when available
Similar prompts miss cache	The beginning is not identical	Compare the serialized request prefix, including tools and message order

What Prompt Caching Does Not Do

Prompt caching does not store or reuse the model’s final answer. The model still generates a fresh response for each request. Caching only helps the provider reuse processing for repeated prompt prefixes.

Prompt caching also does not replace good context management. You should still trim irrelevant history, summarize old turns, and avoid sending large content that the model no longer needs.

Quick Start Connect Coding Agents