← Blog

What Is Google Gemma 4? Architecture, Benchmarks, and Why It Matters

Google Gemma 4 is the most capable open model family from DeepMind yet, shipping four sizes under Apache 2.0 with multimodal input, native reasoning, and on-device deployment down to a Raspberry Pi.

6 min read

On April 2, 2026, Google DeepMind released Gemma 4 — four open-weight models built from the same research behind Gemini 3, now available under the Apache 2.0 license. That license change alone makes this a significant moment for the open model ecosystem: no MAU caps, no acceptable-use restrictions, full commercial freedom.

But the models themselves are the real story. Let’s break down what shipped, how they perform, and who should care.

The Gemma 4 Model Family

Gemma 4 comes in four sizes, each available as both base and instruction-tuned variants:

ModelActive ParamsTotal ParamsContextModalities
E2B2.3B5.1B128KText, image, audio
E4B4.5B8B128KText, image, audio
26B-A4B (MoE)3.8B25.2B256KText, image, video
31B (Dense)30.7B30.7B256KText, image, video

The “E” prefix stands for effective parameters — E2B and E4B use a technique called Per-Layer Embeddings (PLE) that feeds a secondary embedding signal into every decoder layer. The result is that a 2.3B-active model carries the representational depth of the full 5.1B parameter count while fitting in under 1.5 GB of memory with quantization.

The 26B-A4B variant is a Mixture-of-Experts model with 128 small experts, activating 8 plus 1 shared expert per token. Only 3.8B parameters fire per forward pass, so it achieves roughly 97% of the dense 31B model’s quality at a fraction of the compute.

Architecture Highlights

Gemma 4 introduces several design choices worth noting:

Alternating attention. Layers alternate between local sliding-window attention (512–1024 tokens) and global full-context attention. This balances efficiency with long-range understanding.

Dual RoPE. Standard rotary position embeddings for sliding-window layers, proportional RoPE for global layers — enabling the 256K context window on the larger models without the usual quality degradation at long distances.

Shared KV cache. The last N layers reuse key/value tensors from earlier layers, reducing both memory and compute during inference.

Vision encoder. A learned 2D position encoder with multidimensional RoPE that preserves original aspect ratios. Token budgets are configurable (70 to 1,120 tokens per image), so you can trade off detail for speed.

Audio encoder. A USM-style conformer (same architecture as Gemma-3n) that handles speech recognition and translation natively, with up to 30 seconds of audio input on the smaller models.

Benchmarks

The numbers are a generational leap over Gemma 3:

Reasoning and Knowledge

Benchmark31B26B-A4BE4BE2B
MMLU Pro85.2%82.6%69.4%60.0%
AIME 2026 (no tools)89.2%88.3%42.5%37.5%
GPQA Diamond84.3%82.3%58.6%43.4%
BigBench Extra Hard74.4%64.8%33.1%21.9%

For context, Gemma 3’s BigBench Extra Hard score was 19.3%. The 31B hits 74.4%.

Coding

Benchmark31B26B-A4BE4BE2B
LiveCodeBench v680.0%77.1%52.0%44.0%
Codeforces ELO21501718940633

Vision

Benchmark31B26B-A4BE4BE2B
MMMU Pro76.9%73.8%52.6%44.2%
MATH-Vision85.6%82.4%59.5%52.4%

On LMArena’s text-only leaderboard, the 31B ranks #3 globally among open models with an ELO of ~1452.

Multimodal and Agentic Capabilities

Every Gemma 4 model supports multimodal input out of the box:

  • Image understanding with variable aspect ratio and resolution
  • Video comprehension up to 60 seconds at 1 fps (26B and 31B)
  • Audio input for speech recognition and translation (E2B and E4B)

On the agentic side, Gemma 4 includes native function calling, structured JSON output, multi-step planning, and configurable extended thinking/reasoning mode. It can also output bounding boxes for UI element detection — useful for browser automation and screen-parsing agents.

On-Device Deployment

The smaller models are designed to run on edge hardware:

  • E2B fits in under 1.5 GB with 2-bit quantization
  • On a Raspberry Pi 5: 133 tokens/sec prefill, 7.6 tokens/sec decode
  • Runs on Android, iOS, Windows, Linux, macOS, WebGPU browsers, and Qualcomm IQ8 NPUs

Google partnered with Pixel, Qualcomm, MediaTek, ARM, and NVIDIA to optimize deployment across these targets. NVIDIA is distributing Gemma 4 through their RTX AI Garage for local inference on RTX GPUs.

How to Access Gemma 4

Gemma 4 is available now across multiple platforms:

  • Hugging Face: google/gemma-4-31B-it, google/gemma-4-26B-A4B-it, google/gemma-4-E4B-it, google/gemma-4-E2B-it
  • Google AI Studio for API access (31B and 26B)
  • Ollama for local inference
  • Kaggle for model weights
  • Vertex AI, Cloud Run, GKE for production deployments

Day-one framework support includes Hugging Face Transformers, vLLM, llama.cpp, MLX (Apple Silicon), LM Studio, and transformers.js for in-browser inference.

Hardware Requirements

ModelMinimum VRAM
E2B8 GB / Apple Silicon
E4B12–16 GB
26B-A4B24 GB (A100)
31B40+ GB (H100 for bf16)

The Apache 2.0 License Shift

Previous Gemma releases used a custom license with restrictions on commercial use and content policies. Gemma 4 ships under Apache 2.0 — the same permissive license used by Qwen 3.5 and more open than Llama 4’s community license.

This means no monthly active user limits, no acceptable-use policy enforcement, and full freedom for sovereign and commercial AI deployments. For organizations building products on open models, the licensing clarity matters as much as the benchmark numbers.

Bottom Line

Gemma 4 represents a serious move from Google in the open model space. The 31B dense model competes with models many times its size on reasoning and coding benchmarks. The MoE variant delivers nearly the same quality at a fraction of the inference cost. And the E2B model brings genuine multimodal intelligence to devices with under 2 GB of available memory.

Combined with the Apache 2.0 license, Gemma 4 gives developers a compelling option whether they’re building cloud-scale agentic systems or shipping on-device AI to mobile and IoT hardware.