What Is Google Gemma 4? Architecture, Benchmarks, and Why It Matters

On April 2, 2026, Google DeepMind released Gemma 4 — four open-weight models built from the same research behind Gemini 3, now available under the Apache 2.0 license. That license change alone makes this a significant moment for the open model ecosystem: no MAU caps, no acceptable-use restrictions, full commercial freedom.

But the models themselves are the real story. Let’s break down what shipped, how they perform, and who should care.

The Gemma 4 Model Family

Gemma 4 comes in four sizes, each available as both base and instruction-tuned variants:

Model	Active Params	Total Params	Context	Modalities
E2B	2.3B	5.1B	128K	Text, image, audio
E4B	4.5B	8B	128K	Text, image, audio
26B-A4B (MoE)	3.8B	25.2B	256K	Text, image, video
31B (Dense)	30.7B	30.7B	256K	Text, image, video

The “E” prefix stands for effective parameters — E2B and E4B use a technique called Per-Layer Embeddings (PLE) that feeds a secondary embedding signal into every decoder layer. The result is that a 2.3B-active model carries the representational depth of the full 5.1B parameter count while fitting in under 1.5 GB of memory with quantization.

The 26B-A4B variant is a Mixture-of-Experts model with 128 small experts, activating 8 plus 1 shared expert per token. Only 3.8B parameters fire per forward pass, so it achieves roughly 97% of the dense 31B model’s quality at a fraction of the compute.

Architecture Highlights

Gemma 4 introduces several design choices worth noting:

Alternating attention. Layers alternate between local sliding-window attention (512–1024 tokens) and global full-context attention. This balances efficiency with long-range understanding.

Dual RoPE. Standard rotary position embeddings for sliding-window layers, proportional RoPE for global layers — enabling the 256K context window on the larger models without the usual quality degradation at long distances.

Shared KV cache. The last N layers reuse key/value tensors from earlier layers, reducing both memory and compute during inference.

Vision encoder. A learned 2D position encoder with multidimensional RoPE that preserves original aspect ratios. Token budgets are configurable (70 to 1,120 tokens per image), so you can trade off detail for speed.

Audio encoder. A USM-style conformer (same architecture as Gemma-3n) that handles speech recognition and translation natively, with up to 30 seconds of audio input on the smaller models.

Benchmarks

The numbers are a generational leap over Gemma 3:

Reasoning and Knowledge

Benchmark	31B	26B-A4B	E4B	E2B
MMLU Pro	85.2%	82.6%	69.4%	60.0%
AIME 2026 (no tools)	89.2%	88.3%	42.5%	37.5%
GPQA Diamond	84.3%	82.3%	58.6%	43.4%
BigBench Extra Hard	74.4%	64.8%	33.1%	21.9%

For context, Gemma 3’s BigBench Extra Hard score was 19.3%. The 31B hits 74.4%.

Coding

Benchmark	31B	26B-A4B	E4B	E2B
LiveCodeBench v6	80.0%	77.1%	52.0%	44.0%
Codeforces ELO	2150	1718	940	633

Vision

Benchmark	31B	26B-A4B	E4B	E2B
MMMU Pro	76.9%	73.8%	52.6%	44.2%
MATH-Vision	85.6%	82.4%	59.5%	52.4%

On LMArena’s text-only leaderboard, the 31B ranks #3 globally among open models with an ELO of ~1452.

Multimodal and Agentic Capabilities

Every Gemma 4 model supports multimodal input out of the box:

Image understanding with variable aspect ratio and resolution
Video comprehension up to 60 seconds at 1 fps (26B and 31B)
Audio input for speech recognition and translation (E2B and E4B)

On the agentic side, Gemma 4 includes native function calling, structured JSON output, multi-step planning, and configurable extended thinking/reasoning mode. It can also output bounding boxes for UI element detection — useful for browser automation and screen-parsing agents.

On-Device Deployment

The smaller models are designed to run on edge hardware:

E2B fits in under 1.5 GB with 2-bit quantization
On a Raspberry Pi 5: 133 tokens/sec prefill, 7.6 tokens/sec decode
Runs on Android, iOS, Windows, Linux, macOS, WebGPU browsers, and Qualcomm IQ8 NPUs

Google partnered with Pixel, Qualcomm, MediaTek, ARM, and NVIDIA to optimize deployment across these targets. NVIDIA is distributing Gemma 4 through their RTX AI Garage for local inference on RTX GPUs.

How to Access Gemma 4

Gemma 4 is available now across multiple platforms:

Hugging Face: google/gemma-4-31B-it, google/gemma-4-26B-A4B-it, google/gemma-4-E4B-it, google/gemma-4-E2B-it
Google AI Studio for API access (31B and 26B)
Ollama for local inference
Kaggle for model weights
Vertex AI, Cloud Run, GKE for production deployments

Day-one framework support includes Hugging Face Transformers, vLLM, llama.cpp, MLX (Apple Silicon), LM Studio, and transformers.js for in-browser inference.

Hardware Requirements

Model	Minimum VRAM
E2B	8 GB / Apple Silicon
E4B	12–16 GB
26B-A4B	24 GB (A100)
31B	40+ GB (H100 for bf16)

The Apache 2.0 License Shift

Previous Gemma releases used a custom license with restrictions on commercial use and content policies. Gemma 4 ships under Apache 2.0 — the same permissive license used by Qwen 3.5 and more open than Llama 4’s community license.

This means no monthly active user limits, no acceptable-use policy enforcement, and full freedom for sovereign and commercial AI deployments. For organizations building products on open models, the licensing clarity matters as much as the benchmark numbers.

Bottom Line

Gemma 4 represents a serious move from Google in the open model space. The 31B dense model competes with models many times its size on reasoning and coding benchmarks. The MoE variant delivers nearly the same quality at a fraction of the inference cost. And the E2B model brings genuine multimodal intelligence to devices with under 2 GB of available memory.

Combined with the Apache 2.0 license, Gemma 4 gives developers a compelling option whether they’re building cloud-scale agentic systems or shipping on-device AI to mobile and IoT hardware.