GLM-5 Inference Speed on WaveSpeed: Latency & Throughput

I wanted a fast reply while drafting an onboarding email. GLM-5 felt smart, but the first token took long enough that I caught myself staring at the cursor. Not a crisis, just a small pause that kept happening. That’s usually my cue to check what’s actually going on.

I’m Dora. I ran a simple set of tests to see how much GLM-5 inference speed I could squeeze out without turning my workflow into a science project. Nothing fancy, just enough to feel the difference in real work, not a lab.

Test methodology (hardware, settings, prompt lengths)

I tried three access paths:

WaveSpeed: a lightweight acceleration layer I’ve been testing that handles request shaping and caching on the client/gateway side.
Direct Z.ai API: the straight route to GLM-5 via the provider.
OpenRouter: a broker that forwards to the model provider with some extras.

Hardware and context

Local client: MacBook Pro (M2 Max, 64 GB RAM). Network on stable fiber (≈500 Mbps down, ≈30 ms to common US endpoints).
Server side: no custom server, just client calls, except WaveSpeed, which runs a small local gateway to manage caching and batch shaping.

Settings I kept steady unless noted

Model: GLM-5 (chat/completions), temperature 0.2, top_p 0.9, max_tokens 512.
Streaming on, because that’s how I actually work.
Retries off except for network errors.

Prompts I used

Short: 60–80 tokens (a tight instruction with 2–3 constraints).
Medium: ~350 tokens (email draft with brand notes and examples).
Long: ~1,500 tokens (small brief with product context, tone notes, and do/don’t lists).

I ran 25 iterations per condition and recorded:

Time to first token (TTFT): from request send to first streamed token.
Throughput (tokens/s): streamed output rate once tokens start.
A toggle for “thinking mode” (provider’s reasoning traces).

Key metrics

Time to first token (TTFT)

Short prompts landed between 250–400 ms on good routes. Medium prompts pushed TTFT closer to 450–700 ms. Long prompts crossed one second unless caching kicked in. The jump wasn’t linear: it felt like queueing + validation overhead mattered as much as prompt length.

My reaction: sub-400 ms feels snappy: anything over a second pulls me out of flow. When I’m editing live copy, that first token matters more than the final throughput.

Tokens per second (throughput)

Once streaming began, I saw 35–70 tok/s on non-thinking runs. That’s comfortably fast for copy, borderline for code diffs. Throughput dipped on longer outputs, which I suspect was server-side shaping and safety passes rather than pure model speed.

Thinking mode vs non-thinking mode

With “thinking” on, TTFT jumped 30–80%, and throughput halved in some cases. The prose was more coherent on tricky prompts, but for everyday drafting I didn’t need it. This didn’t save me time at first, but after a few runs, I noticed it reduced mental effort on complex edits. On simple tasks, I leave it off.

How WaveSpeed acceleration applies to GLM-5

WaveSpeed didn’t touch the model weights. It did two simple, useful things on my side of the wire: reduce redundant work and shape requests so the provider could move faster. On GLM-5, that translated to better TTFT on repeat prompts and small gains on medium prompts.

ParaAttention and caching techniques

ParaAttention (my notes): Instead of batching unrelated requests, WaveSpeed parallelizes attention-friendly chunks inside a single long prompt when it detects repeated scaffolding. In practice, it reused the prelude (system + shared examples) and only pushed the delta. I can’t verify internals, but the effect looked like partial KV reuse at the gateway level.
Caching: It memoized embeddings for the prompt prelude and short system templates. When I iterated on the same task with minor edits, TTFT dropped by ~120–180 ms. On cold prompts, the benefit was smaller (~40 ms) but still noticeable.

Limits I hit

First cold run is still bound by the upstream. No miracles.
If I changed more than ~20% of the prompt, the cache didn’t help.
Thinking mode largely neutralized the gains: the reasoning trace behaved like a separate stream.

Comparison: WaveSpeed vs direct Z.ai API vs OpenRouter

This is where the small differences matter.

Direct Z.ai API: Consistent. Lowest variance in TTFT. When I needed predictable starts for demos, this was my pick. On long prompts, the TTFT penalty was real but steady.
OpenRouter: Slightly higher TTFT on average, a bit more variance, but easy setup and routing flexibility. Good when I’m juggling models. Docs are solid: see the OpenRouter guides.

WaveSpeed layer: Best TTFT on warm prompts and medium inputs, likely from caching and request shaping. On truly cold, long prompts, it tied with direct.

None of these changed the core GLM-5 behavior. They changed how quickly I felt the model “wake up,” and how smooth iteration felt.

If you’re deciding:

Need steady performance and fewer moving parts? Go direct via the provider. Reference: the ZhipuAI API docs.
Need multi-model routing or shared keys across tools? OpenRouter is fine, accept a few extra milliseconds.
Iterating on similar prompts all day? A light acceleration layer pays back in mental calm more than raw speed.

Optimization tips for production workloads

What actually helped in practice:

Warm the prelude: Keep system prompts and shared examples stable. Cache them, even client-side. My TTFT savings: ~100–200 ms on repeats.
Trim the tail: Cap max_tokens to what you truly need. Cutting a 1,000-token cap to 400 shaved 10–20% off total time for my drafts.
Prefer structured retries: If you must retry, resume streams, don’t restart them. Blind retries killed TTFT.
Control variability: temperature ≤0.3 for production. Lower entropy reduced server passes and made throughput more stable.
Defer thinking mode: Enable it only on prompts that historically misfire. I map “hard prompts” and flip the flag per route.
Chunk long context upstream: Summarize references once, store the summary, and reuse it. The second and third runs feel dramatically lighter.
Watch tokenization: Different SDKs tokenize slightly differently. Lock the client and measure again: I saw phantom regressions from this alone.
Measure p95, not just p50: Spiky tails cause the visible lag that users remember.

Raw benchmark data (table)

Here’s the snapshot from my runs (25 iterations per row). All numbers are approximate but representative of what I felt at the keyboard.

Notes

“Warm prelude” means the system + shared examples were cached by the gateway.
Throughput is measured after the first token. Total time = TTFT + tokens_out / throughput.
Network was steady: when I tested on hotel Wi‑Fi, everything looked worse by 10–20%.

One last observation: shaving 150 ms off TTFT mattered more to me than adding 5 tok/s, because it reduced the little wait that triggers context-switching. That’s not a universal rule, just how my brain handles streams.