← Blog

Artikel ini belum tersedia dalam bahasa Anda. Menampilkan versi bahasa Inggris.

How to Reduce Agent Token Costs in CLI Workflows

A practical guide to reducing agent token costs in CLI workflows using output compression, command hygiene, and context discipline.

10 min read
How to Reduce Agent Token Costs in CLI Workflows

I’m Dora. I ran a single npm test through Claude Code last month and watched the session burn through roughly 12,000 input tokens before the agent said a word back to me. The test output was 400 lines. Maybe 30 of them mattered. The rest — deprecation warnings, dependency noise, Jest’s progress dots — all of it went straight into the model’s context. I paid for every byte.

That was when I stopped treating tokens as “something the model handles” and started treating them as a budget I was actively leaking. If you’re running agentic CLI workflows on Claude Code, Gemini CLI, or anything similar, this is probably your biggest cost line — and the fix isn’t a better model, it’s better hygiene. Anthropic’s own cost management documentation puts it plainly: token costs scale with context size, and most optimization happens before the model ever sees the data. This piece is about how to reduce agent token costs in CLI workflows without losing the debugging signal you need.

Where CLI Workflows Waste Tokens

Before fixing anything, I had to figure out where the leaks were. Two patterns stood out, and they show up in almost every CLI workflow I’ve audited since.

Verbose Commands and Irrelevant Output

Terminal commands were designed for humans skimming a screen, not for LLMs reading byte by byte. git status prints ANSI codes the model doesn’t need. npm install dumps a thousand lines of dependency tree the model already knows about. next build echoes its own progress for fifteen seconds. None of it earns its keep in the context window.

The numbers are worse than they look on paper. A single cargo test run in a medium Rust project can produce 8,000–15,000 tokens of output. Most of that is compilation noise. When the agent reads it all to find one failing assertion, you’ve paid Opus-tier rates for the privilege of streaming a build log.

This is also why community projects like rtk and tokf exist — they sit between the shell and the agent, filter out boilerplate before it hits context, and report savings in the 70–90% range on common commands. Whether you use a wrapper or not, the principle holds: raw terminal output is not LLM-ready data.

Context Carry-Over and Repeated Reads

The second leak is subtler. Each tool call the agent makes — file read, grep, bash command — sticks around in conversation history. By turn ten, the model is reprocessing nine turns of stale outputs on every request. Anthropic’s own April postmortem on Claude Code quality issues describes exactly this dynamic: a caching bug caused thinking history to compound across turns, and token usage inflated 10–20x before anyone noticed. Even without bugs, this is the default behavior. Long sessions are expensive sessions.

I checked one of my own week-old sessions. The agent had read the same package.json four times. None of those rereads added information — the file hadn’t changed. They were just artifacts of the agent not knowing what it already knew.

Step 1: Compress Noisy Outputs

The cheapest fix, by a wide margin, is preventing junk from entering context in the first place. Three rules, in this order:

Filter at the source, not after. Instead of npm test, the agent runs npm test —silent 2>&1 | grep -E “(FAIL|PASS|Error)”. Instead of git status, it runs git status —short. Instead of cargo build, it runs cargo build —quiet 2>&1 | tail -20. None of this is clever. It’s just discipline. The agent gets the failing test, the modified files, the actual error — nothing else.

Cap tool output at the harness level. Claude Code lets you set a maximum tool output size. I dropped mine to 8,000 characters per call. When a command exceeds it, the agent gets a truncation notice and decides whether to refine the query. This single setting saved me more tokens than every other change combined.

Use a CLI proxy when the upstream tool won’t shut up. Some commands have no quiet flag worth using — next build, webpack, anything Java-based. For these, a wrapper that strips known boilerplate is worth the setup time. Tools in the rtk/tokf family handle this generically; you can also write a 30-line bash function for the three commands that bother you most.

There’s a real trade-off here. Aggressive compression can hide debugging signal. When a build fails for a reason the filter strips out — a deprecation warning that turned into an error, an obscure config issue buried in line 847 — the agent gets a shorter, less useful picture. I’ve had it twice. Both times the fix was loosening one filter rule, not abandoning the strategy.

Step 2: Limit Context Before It Hits the Model

Output filtering handles the new tokens entering each turn. Context discipline handles the accumulated tokens already inside the session. Different problems.

The two commands that matter, both straight from Anthropic’s Claude Code best practices, are /clear and /compact. /clear resets the session entirely — useful when switching to an unrelated task. /compact summarizes earlier history while preserving key decisions and current state — useful when the task continues but the early exploration is no longer load-bearing. Claude Code auto-compacts when approaching context limits, but waiting for that trigger is usually too late. By then you’ve already paid the inflated rate for several turns.

My current habit: I run /compact at every natural task boundary, with an instruction like /compact Focus on the failing test and the recent file edits. The instruction matters. Without it, compaction summarizes everything roughly. With it, the agent keeps the parts that matter for the next phase.

For API-based agents (not the CLI subscription), Anthropic’s context editing documentation describes a stricter mechanism: clear_tool_uses_20250919 automatically clears old tool results once context grows past a threshold. The agent retains the conversation but loses the raw outputs it already processed. For long-horizon agentic tasks, this is the right default.

One thing I’d flag: a bloated CLAUDE.md is a permanent tax. It loads every turn, every session, forever. I trimmed mine from ~280 lines to ~90. Per-turn token count dropped noticeably and the agent’s behavior didn’t change in any way I could measure.

Step 3: Redesign Agent Tooling for Lower Waste

The first two steps are tactical. This one is structural, and it’s where the durable savings live.

Design tools that emit LLM-friendly output. The community-driven CLI Spec makes this argument better than I can: commands meant for agents should support a —output flag, separate data (stdout) from diagnostics (stderr), and provide pagination instead of dumping unbounded JSON. If you’re building internal CLIs your agents will call, follow that spec. If you’re using external CLIs that don’t, wrap them.

Prefer narrow tools over broad ones. A git_status_summary function that returns three structured fields beats letting the agent run raw git status and parse the output. Every layer of parsing the model has to do is a layer where tokens get burned on translation rather than reasoning. I converted four of my most-used commands to thin Python wrappers that return JSON. Round-trip token usage on those operations dropped by roughly 60%.

Use ​subagents​ for read-heavy work. Claude Code’s subagent feature runs a separate context for tasks like “scan the repo and summarize the auth flow.” The findings come back as a compact summary — not the 40 files the subagent actually read. The main conversation never sees the raw data. For exploration-heavy tasks this is the single biggest structural win available.

Match the model to the work. Opus 4.7 is impressive and expensive. Most CLI work — file edits, test fixes, routine refactors — runs fine on Sonnet, at roughly 40% of Opus’s per-token cost. Worth knowing: Opus 4.7’s new tokenizer can produce up to 35% more tokens for identical text compared to earlier models, which compounds the cost gap.

The honest caveat: measure before you optimize, then measure after. I set a baseline with /cost (API) or /usage (subscription) for a week before changing anything, then re-measured after each change. Two of my “optimizations” turned out to do nothing measurable. Without a baseline, you’re guessing.

FAQ

Why do terminal workflows consume so many tokens?

Because terminal output was designed for humans, and agents pay byte by byte. A typical build command emits thousands of lines of progress, warnings, and boilerplate the model doesn’t need. Combine that with conversation history that never resets and tool results that accumulate across turns, and you get sessions that burn through context budgets before the actual work starts.

How much can output compression help?

In my measurements, command-level filtering plus output caps cut per-turn input tokens by 40–60% on test runs, builds, and git operations. Community wrappers like rtk report 80–90% reductions on specific commands, though those numbers assume worst-case verbose output. Realistic gains depend on which commands your agent runs most. Audit the top five, fix those, and most of the savings show up immediately.

What should teams optimize first?

In this order: tool output caps, /clear and /compact discipline, model selection. Output caps are a one-time configuration change with zero ongoing cost. Session hygiene is a habit, but it’s free once you have it. Model selection is the easiest win to overlook — running everything on Opus when most tasks run fine on Sonnet is a quiet, large leak.

When does token optimization hurt debugging quality?

When you compress past the point where the agent can see what broke. A truncated stack trace, a filtered-out deprecation warning, a —quiet flag that hides the real error — all of these have cost me real time. The pattern I follow: compress aggressively on routine commands (git status, npm install, successful test runs), keep verbose output for known-failing or unfamiliar operations. If you find yourself re-running a command without filters to debug, the filter was wrong, not the strategy.

Conclusion

Token costs in CLI workflows aren’t a model problem. They’re a plumbing problem. Most of the spend disappears into the gap between what terminal commands emit and what the model actually needs to reason about — and that gap is fixable with output filtering, context discipline, and tooling that respects the agent on the other end.

I’ve been running the setup above for about six weeks. Daily token consumption on Claude Code is down roughly 55%, agent latency improved as a side effect of smaller contexts, and the workflow feels less noisy to debug. None of those numbers are universal — your baseline and your top-five commands will look different. But the pattern holds: control what enters context, control what stays in context, and let the model spend its budget on reasoning instead of reading build logs.

That’s where my data ends. The compression layer keeps evolving, and Anthropic’s tokenizer changes mean these numbers have a shelf life. Worth re-baselining every quarter.

Previous posts: