MiniMax M2.7: The Self-Evolving AI Model That Rivals Claude and GPT at a Fraction of the Cost
MiniMax M2.7: A Self-Evolving Model That Rewrites the Rules of AI Agents
What happens when you let an AI model participate in its own evolution? MiniMax just answered that question with M2.7 — a next-generation flagship text model that doesn’t just execute tasks, but actively improves itself through real-world interaction. Built on the OpenClaw (Agent Harness) framework, M2.7 autonomously ran over 100 rounds of scaffold optimization during training, achieving a 30% performance improvement on internal evaluations — without human intervention.
The result is a model that matches or approaches Claude Opus 4.6 and GPT-5 on the hardest coding and agent benchmarks, runs 3x faster, and costs a fraction of the price. Here’s everything you need to know.
What Makes M2.7 Different: Self-Improvement
Most AI models are trained, evaluated, and deployed as static artifacts. M2.7 breaks that pattern. It’s MiniMax’s first model that deeply participates in its own evolution — involved in updating its own memory, building training skills, and improving its own learning process.
During development, M2.7 autonomously:
- Executed 100+ iteration cycles optimizing its own scaffold performance
- Managed 30–50% of reinforcement learning research workflows independently
- Participated in 22 ML competitions, achieving 9 gold medals in best trials
- Achieved a 66.6% medal rate on MLE-Bench Lite, tying with Google’s Gemini 3.1
This isn’t just a training technique — it’s a signal of where AI development is headed. Models that can evaluate and improve their own performance represent a fundamentally different paradigm from static train-and-deploy cycles.
Benchmark Performance: Punching Way Above Its Weight
M2.7 activates only 10 billion parameters — making it the smallest model in the Tier-1 performance class. Despite this efficiency, it competes head-to-head with models orders of magnitude larger.
Software Engineering
| Benchmark | M2.7 | Claude Opus 4.6 | GPT-5.3 Codex |
|---|---|---|---|
| SWE-Pro | 56.22% | ~57% | 56.2% |
| SWE-bench Verified | 78% | 55% | — |
| VIBE-Pro (end-to-end delivery) | 55.6% | — | — |
| Terminal Bench 2 | 57.0% | — | — |
M2.7 nearly matches Opus on SWE-Pro and significantly outperforms it on SWE-bench Verified (78% vs 55%). On VIBE-Pro — which measures end-to-end project delivery rather than isolated patches — M2.7 scores 55.6%, demonstrating real-world engineering capability beyond benchmark-specific optimization.
Professional Productivity
| Benchmark | M2.7 | Best Competitor |
|---|---|---|
| GDPval-AA (Office tasks) | ELO 1495 | Highest among open-source models |
| Skill Adherence (40 complex tasks) | 97% | — |
| MM Claw (Agent evaluation) | 62.7% | Approaching Sonnet 4.6 |
M2.7’s ELO score of 1495 on GDPval-AA — which evaluates real-world office productivity tasks across Excel, PowerPoint, Word, and complex document editing — is the highest among all open-source models. The 97% skill adherence rate across 40+ complex tasks (each exceeding 2,000 tokens) demonstrates reliable execution on the kind of intricate, multi-step workflows that trip up most models.
Machine Learning Research
| Benchmark | M2.7 | Gemini 3.1 | GPT-5.4 |
|---|---|---|---|
| MLE-Bench Lite (medal rate) | 66.6% | 66.6% | 71.2% |
M2.7 ties with Google’s Gemini 3.1 and approaches GPT-5.4’s state-of-the-art on machine learning competition benchmarks — a remarkable result for a model with only 10B activated parameters.
Speed and Pricing: The Real Disruption
Raw benchmark scores tell one story. Cost-adjusted performance tells a completely different one.
| Metric | M2.7 | Claude Opus 4.6 | GPT-5 |
|---|---|---|---|
| Speed | 100 TPS | ~33 TPS | ~40 TPS |
| Input cost | $0.30/M tokens | $15/M tokens | $10/M tokens |
| Output cost | $1.20/M tokens | $75/M tokens | $30/M tokens |
| Blended cost (with cache) | $0.06/M tokens | — | — |
| Activated parameters | 10B | — | — |
M2.7 is 50x cheaper than Opus on input and 60x cheaper on output — while matching it on SWE-Pro. At 100 tokens per second, it’s also 3x faster. With automatic cache optimization, the effective blended cost drops to just $0.06 per million tokens.
For teams running high-volume agent workloads, coding assistants, or document processing pipelines, this cost structure changes the economics of what’s feasible.
Core Capabilities
Agent-Centric Workflows
M2.7 is built from the ground up for agentic use cases. The OpenClaw framework enables:
- Continuous self-improvement in real-world environments
- Multi-agent collaboration with native capabilities in role boundaries, adversarial reasoning, and protocol adherence
- Active participation in execution and decision-making rather than passive response generation
- Complex environment interaction with 97% skill adherence on intricate multi-step tasks
Software Engineering
Beyond benchmarks, M2.7 handles real-world engineering workflows:
- End-to-end project delivery (not just isolated code patches)
- Log analysis and debugging
- Code security review
- Machine learning pipeline development
Office Suite Excellence
Enhanced capabilities for professional productivity:
- Complex Excel operations and formula generation
- PowerPoint creation and editing
- Word document manipulation
- Multi-turn modification support — iterate on documents through conversation
Character and Emotional Intelligence
M2.7 includes enhanced identity preservation and emotional intelligence capabilities, providing a foundation for interactive entertainment, roleplay, and character-driven applications.
Two API Variants
| Variant | Speed | Quality | Use Case |
|---|---|---|---|
| M2.7 | Standard | Full quality | Production, complex tasks |
| M2.7-highspeed | Faster | Identical results | High-throughput, latency-sensitive |
Both variants produce identical results — the highspeed variant simply processes faster for latency-sensitive applications.
Developer Tool Compatibility
M2.7 integrates with the tools developers already use:
- AI Coding: Claude Code, Cursor, Cline, Codex CLI, Roo Code, Kilo Code
- Agents: OpenCode, Droid, TRAE, Grok CLI
- Platforms: MiniMax Agent, MiniMax API Platform
OpenRoom: Interactive Agent Demo
MiniMax also open-sourced OpenRoom — an interactive agent demonstration that moves AI interaction beyond plain text into graphical environments. Most of the code was AI-generated, demonstrating M2.7’s practical coding capabilities.
- Repository: github.com/MiniMax-AI/OpenRoom
- Live Demo: openroom.ai
M2.7 vs the Competition: Who Should Use What
| If you need… | Best choice |
|---|---|
| Maximum benchmark ceiling regardless of cost | Claude Opus 4.6 |
| Best cost-adjusted coding performance | MiniMax M2.7 |
| Fastest inference speed | MiniMax M2.7 (100 TPS) |
| High-volume agent workloads | MiniMax M2.7 (50x cheaper) |
| Office productivity automation | MiniMax M2.7 (highest GDPval-AA ELO) |
| Established ecosystem and integrations | Claude or GPT |
| Self-improving agent capabilities | MiniMax M2.7 (OpenClaw) |
Try M2.7 on WaveSpeedAI
WaveSpeedAI provides access to MiniMax M2.7 alongside hundreds of other AI models through a unified platform. Whether you’re building coding agents, document processing pipelines, or interactive applications, M2.7’s combination of Tier-1 performance and fraction-of-the-cost pricing makes it the most efficient choice for production workloads.
Try MiniMax M2.7 on WaveSpeedAI →
No subscriptions. No cold starts. Pay only for what you use.
The Bottom Line
MiniMax M2.7 isn’t just another model release — it’s a proof of concept for self-evolving AI. A model with only 10B activated parameters matching Opus and GPT-5 on the hardest engineering benchmarks, while running 3x faster at 50x lower cost, represents exactly the kind of disruption that reshapes how teams build with AI.
The question isn’t whether M2.7 is good enough. It’s whether you can justify paying 50x more for marginal gains.
Sources:





