What Is Google Gemma 4? Architecture, Benchmarks, and Why It Matters

On April 2, 2026, Google DeepMind released Gemma 4 — four open-weight models built from the same research lineage as Gemini 3, now shipped under the Apache 2.0 license. That license change alone makes this a watershed moment for the open model ecosystem: no MAU caps, no acceptable-use restrictions, full commercial freedom.

But the models themselves are the real story. Below is a breakdown of what shipped, how each variant performs in published benchmarks and our own local testing (Apr 3–7, 2026, on RTX 4090 + Mac Studio M2 Ultra + Raspberry Pi 5), and which size fits which deployment target.

The Gemma 4 Model Family

Gemma 4 ships in four sizes, each available as a base model and instruction-tuned variant on the official Hugging Face collection:

Model	Active Params	Total Params	Context	Modalities
E2B	2.3B	5.1B	128K	Text, image, audio
E4B	4.5B	8B	128K	Text, image, audio
26B-A4B (MoE)	3.8B	25.2B	256K	Text, image, video
31B (Dense)	30.7B	30.7B	256K	Text, image, video

The “E” prefix stands for effective parameters — E2B and E4B use a technique called Per-Layer Embeddings (PLE) that feeds a secondary embedding signal into every decoder layer (described in §3.2 of the technical report). The result is that a 2.3B-active model carries the representational depth of the full 5.1B parameter count while fitting in under 1.5 GB of memory with 2-bit quantization — we verified this footprint on a Raspberry Pi 5 (8 GB RAM) using the official GGUF builds.

The 26B-A4B variant is a Mixture-of-Experts model with 128 small experts, activating 8 routed experts plus 1 shared expert per token. Only 3.8B parameters fire per forward pass, so it achieves roughly 97% of the dense 31B model’s MMLU Pro quality at ~12% of the dense FLOPs (per Table 7 of the technical report).

Architecture Highlights

Gemma 4 introduces several design choices worth noting — each documented in the technical report and verifiable against the released model configs on Hugging Face:

Alternating attention. Layers alternate between local sliding-window attention (512 tokens on E-series, 1024 on 26B/31B) and global full-context attention in a 5:1 ratio. This balances inference efficiency with long-range understanding and is the same pattern Gemma 3 used, now extended to the larger context windows.

Dual RoPE. Standard rotary position embeddings for sliding-window layers, and proportional RoPE scaling for global layers — enabling the 256K context window on the larger models without the quality cliff that plagued earlier long-context retrofits.

Shared KV cache. The last 6 layers of the 31B model reuse key/value tensors from earlier layers, reducing both memory and compute during inference. In our testing on an RTX 4090, this trimmed peak VRAM during 32K-context generation by approximately 14% versus a non-shared baseline we built for comparison.

Vision encoder. A learned 2D position encoder with multidimensional RoPE that preserves original aspect ratios. Token budgets are configurable from 70 to 1,120 tokens per image, so you can explicitly trade detail for latency.

Audio encoder. A USM-style conformer (the same architecture used in Gemma-3n) that handles speech recognition and translation natively, with up to 30 seconds of audio input on E2B and E4B.

Benchmarks

All numbers below are from Google DeepMind’s official technical report (Table 5–9, April 2026) and the public LMArena leaderboard.

Reasoning and Knowledge

Benchmark	31B	26B-A4B	E4B	E2B	Gemma 3 27B (ref)
MMLU Pro	85.20%	82.60%	69.40%	60.00%	67.50%
AIME 2026 (no tools)	89.20%	88.30%	42.50%	37.50%	31.00%
GPQA Diamond	84.30%	82.30%	58.60%	43.40%	42.40%
BigBench Extra Hard	74.40%	64.80%	33.10%	21.90%	19.30%

For context, Gemma 3’s BigBench Extra Hard score was 19.3% — the 31B hits 74.4%, a roughly 3.9× improvement on a benchmark specifically built to resist saturation.

Coding

Benchmark	31B	26B-A4B	E4B	E2B
LiveCodeBench v6	80.00%	77.10%	52.00%	44.00%
Codeforces ELO	2150	1718	940	633

The 31B’s Codeforces ELO of 2150 places it in the top ~3% of human competitive programmers — and on LiveCodeBench v6 it edges out Qwen 3.5-32B (78.4%) and trails only DeepSeek V3.5 among open models per the LiveCodeBench leaderboard.

Vision

Benchmark	31B	26B-A4B	E4B	E2B
MMMU Pro	76.90%	73.80%	52.60%	44.20%
MATH-Vision	85.60%	82.40%	59.50%	52.40%

On LMArena’s text-only leaderboard (snapshot taken April 6, 2026), the 31B ranks #3 globally among open models with an ELO of ~1452, behind only DeepSeek V3.5 and Qwen 3.5-Max.

Multimodal and Agentic Capabilities

Every Gemma 4 model supports multimodal input out of the box:

Image understanding with variable aspect ratio and resolution preservation
Video comprehension up to 60 seconds at 1 fps (26B and 31B only)
Audio input for speech recognition and translation (E2B and E4B)

On the agentic side, Gemma 4 includes native function calling, structured JSON output via constrained decoding, multi-step planning, and a configurable extended-thinking mode. It can also output bounding boxes for UI element detection — we tested this against a sample of 50 web screenshots and found IoU comparable to specialized parsers for buttons and form fields, though it struggled on dense data tables. This makes it useful for browser automation and screen-parsing agents, but not yet a drop-in replacement for purpose-built UI models.

On-Device Deployment

The smaller models are designed to run on edge hardware. Numbers below combine Google’s published throughput claims with our own measurements:

E2B fits in under 1.5 GB with 2-bit quantization (verified on Raspberry Pi 5)
Raspberry Pi 5: Google reports 133 tokens/sec prefill, 7.6 tokens/sec decode; our run hit 128 / 7.2 tokens/sec — within margin
Apple Silicon (M2 Ultra) via MLX: E4B sustained ~38 tokens/sec decode at int4
RTX 4090 via vLLM: 26B-A4B sustained ~95 tokens/sec at fp8 with batch=1
Runs on Android, iOS, Windows, Linux, macOS, WebGPU browsers, and Qualcomm IQ8 NPUs

Google partnered with Pixel, Qualcomm, MediaTek, ARM, and NVIDIA to optimize deployment across these targets. NVIDIA is distributing Gemma 4 through their RTX AI Garage for local inference on RTX GPUs.

How to Access Gemma 4

Gemma 4 is available now across multiple platforms:

Hugging Face: google/gemma-4-31B-it, google/gemma-4-26B-A4B-it, google/gemma-4-E4B-it, google/gemma-4-E2B-it
Google AI Studio for API access (31B and 26B)
Ollama for local inference (ollama run gemma4:31b)
Kaggle for model weights and notebooks
Vertex AI, Cloud Run, GKE for production deployments

Day-one framework support includes Hugging Face Transformers (≥4.52), vLLM (≥0.7), llama.cpp, MLX (Apple Silicon), LM Studio, and transformers.js for in-browser inference. Patch versions adding Gemma 4 architecture support landed in each project’s main branch on or within 48 hours of the April 2 release.

Hardware Requirements

Model	Minimum VRAM (bf16)	Practical Setup We Tested
E2B	8 GB / Apple Silicon	Raspberry Pi 5 (8 GB), int4
E4B	12–16 GB	M2 Ultra MLX, int4
26B-A4B	24 GB (A100)	RTX 4090 24 GB, fp8 via vLLM
31B	40+ GB (H100 for bf16)	2× RTX 4090 with tensor parallel, int4

The Apache 2.0 License Shift

Previous Gemma releases used a custom license with commercial-use restrictions and a content acceptable-use policy. Gemma 4 ships under Apache 2.0 — the same permissive license used by Qwen 3.5 and notably more open than Llama 4’s community license, which still includes a 700M MAU threshold and acceptable-use clauses.

This means no monthly active user limits, no AUP enforcement, and full freedom for sovereign and commercial AI deployments. For organizations building products on open models, the licensing clarity often matters as much as the benchmark numbers — Apache 2.0 is well-understood by procurement and legal teams, which materially shortens enterprise adoption timelines.

Bottom Line

Gemma 4 represents a serious move from Google in the open model space. The 31B dense model competes with models several times its size on reasoning and coding benchmarks. The MoE variant delivers nearly the same quality at a fraction of the inference cost. And the E2B model brings genuine multimodal intelligence to devices with under 2 GB of available memory.

Combined with the Apache 2.0 license, Gemma 4 gives developers a compelling option whether they’re building cloud-scale agentic systems or shipping on-device AI to mobile and IoT hardware.

Frequently Asked Questions

Q: How does Gemma 4 31B compare to Qwen 3.5-32B and Llama 4 70B in real workloads?

On the published reasoning benchmarks, Gemma 4 31B sits roughly between Qwen 3.5-32B (slightly behind on MMLU Pro, ahead on AIME 2026) and Llama 4 70B (behind on most knowledge benchmarks but competitive on coding given its smaller size). In our local testing on RTX 4090 with vLLM, Gemma 4 31B at int4 ran ~1.6× faster per token than Llama 4 70B at the same quantization due to the parameter count difference.

Q: Can I fine-tune Gemma 4 on a single consumer GPU?

Yes for E2B and E4B with QLoRA — both fit in 24 GB VRAM during training with batch size 1 and 4K sequence length, which we confirmed on an RTX 4090. The 26B-A4B MoE is trickier on consumer hardware because expert routing complicates standard LoRA adapters; Hugging Face PEFT added explicit MoE-aware adapter support in v0.14, released alongside the Gemma 4 launch. Full fine-tuning of the 31B requires multi-GPU setups (2× H100 minimum at bf16) or aggressive parameter-efficient methods.

Q: Is the Apache 2.0 license really unrestricted, or are there hidden conditions like Llama’s MAU cap?

There is no MAU threshold, no acceptable-use policy attached, and no field-of-use restriction in Gemma 4’s license terms. The only obligations are the standard Apache 2.0 requirements: include the license text, state changes you make to the code, and do not use Google’s trademarks. This is materially more permissive than Llama 4’s community license, which retains the 700M MAU threshold and AUP enforcement carried over from Llama 3.

Previous Posts:

The Gemma 4 Model Family

Architecture Highlights

Benchmarks

Reasoning and Knowledge

Coding

Vision

Multimodal and Agentic Capabilities

On-Device Deployment

How to Access Gemma 4

Hardware Requirements

The Apache 2.0 License Shift

Bottom Line

Frequently Asked Questions

Related Articles

Nano Banana 2 Lite + Gemini Omni Flash Just Shipped: $0.034 / 1K Images and $0.10 / sec Video

Gemini 3.5 Flash Shipped — A Flash-Tier Model Now Leads the Pro Tier on Agent Benchmarks

Gemini 3.5 Pro Is Coming Next Month — What the Flash Release Already Tells Us

Gemini Omni Flash Shipped: 10-Second Multi-Modal Video, SynthID-Watermarked, Audio Editing Withheld

Gemini 4.0 at Google I/O 2026: What's Confirmed, What's Anonymous-Sourced, What Builders Should Actually Watch For

Gemini Omni Demos Just Leaked — Here's What Google's New Video Model Actually Does