Helios: A Real-Time Long Video Generation Model That Skips Every Shortcut

I keep a mental list of things I assume video generation models need: KV-cache for speed, sparse attention for memory, keyframe sampling to stop drift. Helios from PKU-YuanGroup throws out all of them — and still hits 19.5 FPS on a single H100. That contradiction is what made me stop scrolling.

I’m Dora. I spent the past couple of days reading through the Helios paper and repo, running what I could locally, and trying to understand why this approach works when the conventional wisdom says it shouldn’t. This isn’t a benchmark review. It’s more like a set of notes from someone who’s been burned enough times by “revolutionary” claims to want receipts.

What Helios Actually Is

Helios is an autoregressive video generation model that produces 33 frames per chunk, chaining chunks together to create minute-scale videos — up to 1,452 frames at 24 FPS, which works out to roughly 60 seconds of continuous footage.

That alone isn’t shocking. What’s unusual is the list of things it doesn’t use:

No KV-cache
No causal masking
No sparse or linear attention
No TinyVAE
No progressive noise schedules
No quantization
No self-forcing, error-banks, or keyframe sampling (the standard anti-drifting toolkit)

Reading that list felt like someone describing a car that runs without an engine. Every one of those techniques exists because video generation is expensive, memory-hungry, and prone to quality degradation over long sequences. Helios sidesteps all of them and still manages real-time inference. The question isn’t whether it works — the demos are out there — but how.

The Three-Stage Training Pipeline

Helios ships three model variants, each corresponding to a training stage. Understanding the stages helps explain the design logic.

Stage 1: Helios-Base

The foundation. This is where the core architectural innovations land:

Unified History Injection — the model conditions on previous chunks without the usual error-accumulation penalties
Easy Anti-Drifting — a training-time strategy that replaces the inference-time hacks (self-forcing, error-banks) most autoregressive video models rely on
Multi-Term Memory Patchification — a memory-efficient approach to handling long temporal context

Helios-Base uses v-prediction with standard classifier-free guidance. It produces the highest raw quality of the three variants, but it’s also the heaviest at inference time — 50 diffusion steps per chunk.

Stage 2: Helios-Mid

An intermediate checkpoint that introduces Pyramid Unified Predictor Corrector for token compression. This is where the model starts trading marginal quality for meaningful speed gains. It uses CFG-Zero*, which eliminates the need for unconditional model evaluations during inference.

If you’ve worked with diffusion models, you know that CFG typically doubles your compute because you run the model twice per step — once with the prompt, once without. Removing that requirement is a significant efficiency gain.

Stage 3: Helios-Distilled

The final variant uses Adversarial Hierarchical Distillation to collapse 50 diffusion steps down to 3. It switches from v-prediction to x0-prediction with a custom scheduler (HeliosDMDScheduler) and drops the CFG requirement entirely.

This is the variant that hits 19.5 FPS. Three steps, no CFG, no acceleration tricks — just a model that’s been trained to get it right the first time.

Why the “No Shortcuts” Approach Matters

Most acceleration work in video generation is additive. You build a model, it’s too slow, so you bolt on KV-cache. Still too much memory, so you add sparse attention. Quality drifts on long sequences, so you add keyframe sampling. Each fix introduces its own failure modes and complexity.

Helios takes the opposite path: make the base model efficient enough that you don’t need the bolt-ons. The training pipeline is doing the heavy lifting that inference-time tricks usually handle.

There’s a practical consequence here that’s easy to miss. Fewer moving parts means fewer things to break. If you’ve ever debugged a KV-cache corruption issue or watched sparse attention create artifacts at specific frame boundaries, you know the tax those systems impose. Helios doesn’t pay that tax.

The memory story is equally striking. The paper claims they can fit four 14B-parameter models within 80 GB of GPU memory during training, using image-diffusion-scale batch sizes. That’s an aggressive compression of what’s usually a sprawling resource footprint.

What It Can Do

Helios supports four generation modes across all three variants:

Text-to-Video — prompt in, video out
Image-to-Video — first frame plus prompt
Video-to-Video — style transfer, re-timing, modification
Interactive mode — iterative refinement

The frame math is specific: you work in multiples of 33 frames per chunk. Want roughly 30 seconds? That’s 22 chunks = 726 frames. A full minute? 44 chunks = 1,452 frames. The chunk boundary is where autoregressive handoffs happen, and from the demos I’ve seen, the seams are remarkably clean.

That last point deserves emphasis. Autoregressive video models usually show their worst behavior at chunk boundaries — motion stutters, color shifts, object drift. The “Easy Anti-Drifting” training strategy seems to genuinely address this, though I’d want to see more diverse test cases before declaring the problem solved.

Integration and Ecosystem

Helios already supports multiple inference backends:

Hugging Face Diffusers — ModularPipeline integration
vLLM-Omni — disaggregated serving with stage-based graph architecture
SGLang-Diffusion — unified pipeline with optimized kernels
Ascend NPU — Day-0 hardware support (~10 FPS on Ascend)

The Diffusers integration is the most accessible. The vLLM-Omni path is interesting for production deployments where you want to separate prefill and decode stages across different hardware. SGLang-Diffusion feels like the forward-looking option — it’s designed for the kind of batched, pipelined serving that makes real-time applications feasible.

The Ascend NPU support is a strategic signal. Day-0 support for non-NVIDIA hardware suggests this wasn’t an afterthought. At ~10 FPS on Ascend, it’s slower than the H100 path but still usable for many applications.

HeliosBench

The team built their own benchmark — HeliosBench — specifically designed for evaluating real-time long-video generation. This is worth noting because most existing video benchmarks focus on short clips (4–16 seconds) and don’t capture the failure modes that emerge at minute-scale lengths: temporal drift, motion degradation, object persistence failures.

Having a purpose-built benchmark doesn’t guarantee objectivity, but it does mean they’re at least measuring the right things. I’d like to see independent evaluations using HeliosBench to validate the methodology.

What I’m Still Thinking About

Quality at the extremes. The 33-frame chunk design is elegant, but 44 consecutive autoregressive steps is a lot of opportunities for accumulated error. The demos look clean, but demos always look clean. I want to see adversarial prompts — complex camera motion, many interacting objects, dramatic lighting changes across a full minute.

The distillation trade-off. Going from 50 steps to 3 is aggressive. Distilled models generally sacrifice diversity and fine detail for speed. The Helios-Base variant exists for a reason — when quality matters more than speed, you’re paying 17x the compute. That’s a wide gap between the two operating points.

Ecosystem maturity. The model is open-source (Apache 2.0), which is great. But open-source video models need community tooling to become practical — ComfyUI nodes, training scripts for fine-tuning, LoRA support. That ecosystem takes time to develop, and right now Helios is brand new.

Hardware requirements. Real-time on an H100 is impressive. But H100s aren’t sitting idle on most people’s desks. The more relevant question for many users is: what’s the experience on a 4090? On an A100? The paper is clear about H100 and Ascend performance — less clear about the long tail of hardware.

Why This Stands Out

I’ve watched a lot of video generation announcements over the past year. Most of them are incremental: better FID scores, slightly longer clips, marginally faster inference. Helios feels different because it challenges an assumption I didn’t realize I’d internalized — that real-time long video generation requires a tower of inference optimizations stacked on top of each other.

The answer Helios proposes is: what if you just train the model better? Push the complexity into the training pipeline, not the inference stack. Make the model inherently efficient rather than bolting efficiency on after the fact.

Whether that approach scales, generalizes, and survives contact with production workloads is an open question. But the direction is compelling. Fewer moving parts, cleaner architecture, and performance numbers that speak for themselves.

The code and weights are on GitHub. Apache 2.0. If you have an H100 and an afternoon, it’s worth a look.