Prompt Caching

Improve LLM Cache Hit Rate

Prompt caching helps repeated LLM requests run with lower input-token cost and often lower latency. In WaveSpeedAI LLM, caching behavior still comes from the selected upstream model provider. WaveSpeedAI gives you one OpenAI-compatible endpoint, while each model family decides how prompt prefixes are cached, how long caches live, and which usage fields are returned.

Use this guide when cache hit rate is low, coding-agent sessions feel expensive, or your app sends the same long instructions, tool definitions, documents, or examples many times.

The Short Version

Model familyHow caching usually worksWhat you should do
OpenAI-style modelsRepeated long prefixes can be cached automaticallyKeep the beginning of messages identical and optionally use a stable prompt_cache_key
Claude modelsCache behavior is more explicitMark the reusable prefix with cache_control when you call the API yourself
Coding agentsThe tool may create or reuse stable context for youKeep project rules, tool definitions, and startup context stable across turns
Other providersBehavior depends on the upstream modelKeep a stable prefix and check the returned usage fields

The most important rule is simple: put stable content first, put dynamic content last.

Good cache candidates:

  • Long system instructions
  • Coding rules and project conventions
  • Tool definitions
  • Shared examples
  • Large documents reused across questions
  • Repository summaries reused across coding-agent turns

Bad cache candidates:

  • User-specific one-off questions
  • Timestamps
  • Request IDs
  • Randomized examples
  • Short prompts
  • Frequently changing tool definitions

How Prompt Caching Works

Most LLM prompt caching is prefix caching. The provider checks whether the beginning of the current prompt matches a prefix it processed recently. If it matches, part of the prompt can be read from cache instead of processed from scratch.

That means these two requests are cache-friendly:

Request 1:
  stable instructions
  stable examples
  stable tool definitions
  question A

Request 2:
  stable instructions
  stable examples
  stable tool definitions
  question B

These two requests are not cache-friendly:

Request 1:
  timestamp A
  stable instructions
  question A

Request 2:
  timestamp B
  stable instructions
  question B

The instructions are the same, but the prefix changed at the very beginning, so the provider may not be able to reuse the cached prefix.

Minimum Size and Cache Lifetime

Very short prompts usually do not benefit from prompt caching. Providers apply their own minimum cacheable prompt sizes and cache lifetimes.

AreaWhat to expect
OpenAI-style modelsCache hits usually start after the repeated prefix is at least about 1,024 tokens. Cached-token counts can increase in fixed token increments.
Claude modelsShort prefixes may be ignored even when cache_control is present. Make the cached block large enough to be worth caching.
Cache lifetimeEphemeral prompt caches are short-lived. Treat them as useful for related requests sent close together, not as long-term storage.
First requestThe first matching request usually creates or warms the cache. A later request is where you should expect to see cache-read fields.

If a test prompt is only a few sentences, cached_tokens may stay 0 even when the request shape is correct. Test with the real long instructions, tool definitions, documents, or repository context that your app sends in production.

WaveSpeedAI Endpoint

Use the normal WaveSpeedAI LLM endpoint:

https://llm.wavespeed.ai/v1/chat/completions

Use the same API key and OpenAI-compatible request format as the rest of WaveSpeedAI LLM:

curl https://llm.wavespeed.ai/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer YOUR_WAVESPEED_API_KEY" \
  -d '{
    "model": "openai/gpt-5.5",
    "messages": [
      {
        "role": "system",
        "content": "Stable instructions that stay identical across requests."
      },
      {
        "role": "user",
        "content": "The current user question goes here."
      }
    ]
  }'

OpenAI-Style Models

For OpenAI-style models, start by making the repeated prefix stable. For long repeated prompts, compatible upstream models may cache repeated prefixes automatically. A stable prompt_cache_key can help route similar requests together when your workload has many requests sharing the same long prefix.

curl https://llm.wavespeed.ai/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer YOUR_WAVESPEED_API_KEY" \
  -d '{
    "model": "openai/gpt-5.5",
    "prompt_cache_key": "docs-assistant-v1",
    "messages": [
      {
        "role": "system",
        "content": "Long stable instructions, examples, and policy text. Keep this prefix byte-for-byte consistent across related requests."
      },
      {
        "role": "user",
        "content": "Ask the current user question here."
      }
    ],
    "max_tokens": 200
  }'

Keep prompt_cache_key stable for requests that share the same long prefix. Do not use a unique key per user message, request ID, or timestamp, because that can split traffic into too many cache groups.

OpenAI-style cache hits depend on the repeated prefix, not only the key. If the beginning of messages, tools, or other request fields changes, the same prompt_cache_key may still miss cache.

Claude Models

Claude prompt caching is more explicit. When you call Claude models through WaveSpeedAI LLM and want to cache a specific reusable prefix, mark the reusable content block with cache_control.

curl https://llm.wavespeed.ai/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer YOUR_WAVESPEED_API_KEY" \
  -d '{
    "model": "anthropic/claude-opus-4.7",
    "messages": [
      {
        "role": "user",
        "content": [
          {
            "type": "text",
            "text": "Long stable project context, coding rules, examples, or document text. Keep this block identical across requests that should share the cache.",
            "cache_control": {
              "type": "ephemeral"
            }
          },
          {
            "type": "text",
            "text": "Ask the current question here."
          }
        ]
      }
    ],
    "max_tokens": 200
  }'

Place cache_control on the last block that should stay identical. Do not put it on the block that contains the user question, a timestamp, or any content that changes every request.

Claude caching is prefix-based across the request shape. If your request includes tools, system instructions, and messages, keep those stable and in the same order. Changing tool definitions, tool order, system text, image blocks, message order, or other early request fields can change the prefix and reduce cache hits.

When prompts contain many content blocks, a single cache breakpoint near the end may not be enough for every earlier block to be checked. Use additional stable breakpoints when you have a long prompt made of many separate blocks, while keeping the number of breakpoints small and intentional.

Coding Tools and Agents

Coding tools often send a large repeated prefix: project rules, repository context, tool definitions, file summaries, and conversation history. Some tools may already manage prompt caching behind the scenes, especially when they use Claude or GPT-family models directly.

To help caching work well with Codex, Claude Code, Cursor, OpenClaw, or similar agents:

DoWhy
Keep project instructions stableChanging rules can invalidate the prefix
Put durable rules in files such as project docs or agent instructionsTools can reuse them across turns
Avoid adding timestamps or random IDs to the beginning of promptsEarly changes break prefix matching
Ask follow-up questions in the same session when possibleThe tool can reuse previous context more easily
Avoid changing tool definitions during a sessionTool definition changes often affect the cached prefix
Avoid editing agent config between related promptsModel, tool, or provider changes can create a new request prefix
Start long sessions with a small read-only promptIt can warm up context before heavier requests

For example:

First prompt:
Read the project instructions and summarize the repo. Do not edit files.

Follow-up prompt:
Now update the LLM cache guide based on the same project rules.

This is usually more cache-friendly than repeatedly starting fresh sessions with different preambles.

How to Check Cache Usage

Log the usage object from the response.

OpenAI-style models may return:

{
  "usage": {
    "prompt_tokens": 3134,
    "completion_tokens": 5,
    "total_tokens": 3139,
    "prompt_tokens_details": {
      "cached_tokens": 2816
    }
  }
}

Claude models may return cache creation and cache read fields:

{
  "usage": {
    "prompt_tokens": 12127,
    "prompt_tokens_details": {
      "cached_tokens": 10549,
      "cache_creation_input_tokens": 0,
      "cache_read_input_tokens": 10549
    }
  }
}

Some upstream responses may use slightly different field names for cache creation, such as cached_creation_tokens or cache_creation_input_tokens. When debugging, log the full usage.prompt_tokens_details object instead of checking only one field.

For a simple cache-hit signal:

cache hit tokens = usage.prompt_tokens_details.cached_tokens

For Claude-specific debugging, also check:

cache write tokens = usage.prompt_tokens_details.cache_creation_input_tokens
cache read tokens  = usage.prompt_tokens_details.cache_read_input_tokens

If a cache creation field is positive, the provider wrote a cache entry. If cache_read_input_tokens is positive on a later request, the provider read from cache.

Tested With WaveSpeedAI LLM

These checks were run against https://llm.wavespeed.ai/v1/chat/completions:

ModelTestObserved signal
openai/gpt-5.5Same long prefix, same prompt_cache_key, repeated requestLater response included prompt_tokens_details.cached_tokens
anthropic/claude-opus-4.7Long content block marked with cache_control, repeated requestResponses included cache creation fields, then cache_read_input_tokens on a later identical request

The exact amount of cached tokens can vary by model, prompt size, cache lifetime, and upstream routing. Treat cache metrics as something to monitor in production, not as a fixed value.

Improve Cache Hit Rate

Use this checklist when cache hit rate is low:

  1. Put stable content before dynamic content.
  2. Keep system prompts, examples, tool definitions, and schemas byte-for-byte consistent.
  3. Keep user questions, timestamps, request IDs, and short-lived data at the end.
  4. Make the reusable prefix long enough to be worth caching.
  5. For OpenAI-style models, reuse prompt_cache_key for the same prompt family.
  6. For Claude models, place cache_control on the last stable content block.
  7. Keep tool definitions, tool order, system text, and message order stable.
  8. Send the second request after the first response starts or completes, especially when testing cache reads.
  9. Log usage.prompt_tokens_details so you can see cached tokens, cache writes, and cache reads.
  10. Test with your real prompt shape, not only a tiny prompt.
  11. If Claude shows cache creation but no cache read, wait a few seconds and retry the exact same stable prefix.

Common Reasons for Low Cache Hit Rate

SymptomLikely causeWhat to change
cached_tokens stays 0Prompt is too short or prefix keeps changingUse a longer stable prefix and move dynamic data later
Same prompt text still missesTools, system text, images, or message order changed before the textCompare the full serialized request, not only the visible user prompt
Claude writes cache every time but does not readcache_control is placed after changing contentMove cache_control to the end of the stable block
Claude shows cache creation first, then no read yetCache may not be readable immediately or the request may route differentlyRetry the exact same stable prefix after a short delay and monitor usage.prompt_tokens_details
Long Claude prompt has many blocksCache lookup may not cover every earlier block from one late breakpointAdd stable breakpoints earlier in the long reusable prefix
Agent sessions do not seem to reuse contextEach task starts with different instructions or tool setupKeep project rules and tool definitions stable
Cache works once, then disappears laterUpstream cache expiredSend related requests close together or use provider-supported retention options when available
Similar prompts miss cacheThe beginning is not identicalCompare the serialized request prefix, including tools and message order

What Prompt Caching Does Not Do

Prompt caching does not store or reuse the model’s final answer. The model still generates a fresh response for each request. Caching only helps the provider reuse processing for repeated prompt prefixes.

Prompt caching also does not replace good context management. You should still trim irrelevant history, summarize old turns, and avoid sending large content that the model no longer needs.

© 2025 WaveSpeedAI. All rights reserved.