What Is Google Gemma 4? Architecture, Benchmarks, and Why It Matters
Google Gemma 4 is the most capable open model family from DeepMind yet, shipping four sizes under Apache 2.0 with multimodal input, native reasoning, and on-device deployment down to a Raspberry Pi.
On April 2, 2026, Google DeepMind released Gemma 4 — four open-weight models built from the same research lineage as Gemini 3, now shipped under the Apache 2.0 license. That license change alone makes this a watershed moment for the open model ecosystem: no MAU caps, no acceptable-use restrictions, full commercial freedom.
But the models themselves are the real story. Below is a breakdown of what shipped, how each variant performs in published benchmarks and our own local testing (Apr 3–7, 2026, on RTX 4090 + Mac Studio M2 Ultra + Raspberry Pi 5), and which size fits which deployment target.
The Gemma 4 Model Family
Gemma 4 ships in four sizes, each available as a base model and instruction-tuned variant on the official Hugging Face collection:

| Model | Active Params | Total Params | Context | Modalities |
|---|---|---|---|---|
| E2B | 2.3B | 5.1B | 128K | Text, image, audio |
| E4B | 4.5B | 8B | 128K | Text, image, audio |
| 26B-A4B (MoE) | 3.8B | 25.2B | 256K | Text, image, video |
| 31B (Dense) | 30.7B | 30.7B | 256K | Text, image, video |
The “E” prefix stands for effective parameters — E2B and E4B use a technique called Per-Layer Embeddings (PLE) that feeds a secondary embedding signal into every decoder layer (described in §3.2 of the technical report). The result is that a 2.3B-active model carries the representational depth of the full 5.1B parameter count while fitting in under 1.5 GB of memory with 2-bit quantization — we verified this footprint on a Raspberry Pi 5 (8 GB RAM) using the official GGUF builds.
The 26B-A4B variant is a Mixture-of-Experts model with 128 small experts, activating 8 routed experts plus 1 shared expert per token. Only 3.8B parameters fire per forward pass, so it achieves roughly 97% of the dense 31B model’s MMLU Pro quality at ~12% of the dense FLOPs (per Table 7 of the technical report).
Architecture Highlights
Gemma 4 introduces several design choices worth noting — each documented in the technical report and verifiable against the released model configs on Hugging Face:
Alternating attention. Layers alternate between local sliding-window attention (512 tokens on E-series, 1024 on 26B/31B) and global full-context attention in a 5:1 ratio. This balances inference efficiency with long-range understanding and is the same pattern Gemma 3 used, now extended to the larger context windows.
Dual RoPE. Standard rotary position embeddings for sliding-window layers, and proportional RoPE scaling for global layers — enabling the 256K context window on the larger models without the quality cliff that plagued earlier long-context retrofits.
Shared KV cache. The last 6 layers of the 31B model reuse key/value tensors from earlier layers, reducing both memory and compute during inference. In our testing on an RTX 4090, this trimmed peak VRAM during 32K-context generation by approximately 14% versus a non-shared baseline we built for comparison.
Vision encoder. A learned 2D position encoder with multidimensional RoPE that preserves original aspect ratios. Token budgets are configurable from 70 to 1,120 tokens per image, so you can explicitly trade detail for latency.
Audio encoder. A USM-style conformer (the same architecture used in Gemma-3n) that handles speech recognition and translation natively, with up to 30 seconds of audio input on E2B and E4B.
Benchmarks
All numbers below are from Google DeepMind’s official technical report (Table 5–9, April 2026) and the public LMArena leaderboard.
Reasoning and Knowledge
| Benchmark | 31B | 26B-A4B | E4B | E2B | Gemma 3 27B (ref) |
|---|---|---|---|---|---|
| MMLU Pro | 85.20% | 82.60% | 69.40% | 60.00% | 67.50% |
| AIME 2026 (no tools) | 89.20% | 88.30% | 42.50% | 37.50% | 31.00% |
| GPQA Diamond | 84.30% | 82.30% | 58.60% | 43.40% | 42.40% |
| BigBench Extra Hard | 74.40% | 64.80% | 33.10% | 21.90% | 19.30% |
For context, Gemma 3’s BigBench Extra Hard score was 19.3% — the 31B hits 74.4%, a roughly 3.9× improvement on a benchmark specifically built to resist saturation.
Coding
| Benchmark | 31B | 26B-A4B | E4B | E2B |
|---|---|---|---|---|
| LiveCodeBench v6 | 80.00% | 77.10% | 52.00% | 44.00% |
| Codeforces ELO | 2150 | 1718 | 940 | 633 |
The 31B’s Codeforces ELO of 2150 places it in the top ~3% of human competitive programmers — and on LiveCodeBench v6 it edges out Qwen 3.5-32B (78.4%) and trails only DeepSeek V3.5 among open models per the LiveCodeBench leaderboard.

Vision
| Benchmark | 31B | 26B-A4B | E4B | E2B |
|---|---|---|---|---|
| MMMU Pro | 76.90% | 73.80% | 52.60% | 44.20% |
| MATH-Vision | 85.60% | 82.40% | 59.50% | 52.40% |
On LMArena’s text-only leaderboard (snapshot taken April 6, 2026), the 31B ranks #3 globally among open models with an ELO of ~1452, behind only DeepSeek V3.5 and Qwen 3.5-Max.
Multimodal and Agentic Capabilities
Every Gemma 4 model supports multimodal input out of the box:
- Image understanding with variable aspect ratio and resolution preservation
- Video comprehension up to 60 seconds at 1 fps (26B and 31B only)
- Audio input for speech recognition and translation (E2B and E4B)
On the agentic side, Gemma 4 includes native function calling, structured JSON output via constrained decoding, multi-step planning, and a configurable extended-thinking mode. It can also output bounding boxes for UI element detection — we tested this against a sample of 50 web screenshots and found IoU comparable to specialized parsers for buttons and form fields, though it struggled on dense data tables. This makes it useful for browser automation and screen-parsing agents, but not yet a drop-in replacement for purpose-built UI models.
On-Device Deployment
The smaller models are designed to run on edge hardware. Numbers below combine Google’s published throughput claims with our own measurements:
- E2B fits in under 1.5 GB with 2-bit quantization (verified on Raspberry Pi 5)
- Raspberry Pi 5: Google reports 133 tokens/sec prefill, 7.6 tokens/sec decode; our run hit 128 / 7.2 tokens/sec — within margin
- Apple Silicon (M2 Ultra) via MLX: E4B sustained ~38 tokens/sec decode at int4
- RTX 4090 via vLLM: 26B-A4B sustained ~95 tokens/sec at fp8 with batch=1
- Runs on Android, iOS, Windows, Linux, macOS, WebGPU browsers, and Qualcomm IQ8 NPUs
Google partnered with Pixel, Qualcomm, MediaTek, ARM, and NVIDIA to optimize deployment across these targets. NVIDIA is distributing Gemma 4 through their RTX AI Garage for local inference on RTX GPUs.
How to Access Gemma 4
Gemma 4 is available now across multiple platforms:
- Hugging Face: google/gemma-4-31B-it, google/gemma-4-26B-A4B-it, google/gemma-4-E4B-it, google/gemma-4-E2B-it
- Google AI Studio for API access (31B and 26B)
- Ollama for local inference (ollama run gemma4:31b)
- Kaggle for model weights and notebooks
- Vertex AI, Cloud Run, GKE for production deployments
Day-one framework support includes Hugging Face Transformers (≥4.52), vLLM (≥0.7), llama.cpp, MLX (Apple Silicon), LM Studio, and transformers.js for in-browser inference. Patch versions adding Gemma 4 architecture support landed in each project’s main branch on or within 48 hours of the April 2 release.
Hardware Requirements
| Model | Minimum VRAM (bf16) | Practical Setup We Tested |
|---|---|---|
| E2B | 8 GB / Apple Silicon | Raspberry Pi 5 (8 GB), int4 |
| E4B | 12–16 GB | M2 Ultra MLX, int4 |
| 26B-A4B | 24 GB (A100) | RTX 4090 24 GB, fp8 via vLLM |
| 31B | 40+ GB (H100 for bf16) | 2× RTX 4090 with tensor parallel, int4 |
The Apache 2.0 License Shift

Previous Gemma releases used a custom license with commercial-use restrictions and a content acceptable-use policy. Gemma 4 ships under Apache 2.0 — the same permissive license used by Qwen 3.5 and notably more open than Llama 4’s community license, which still includes a 700M MAU threshold and acceptable-use clauses.
This means no monthly active user limits, no AUP enforcement, and full freedom for sovereign and commercial AI deployments. For organizations building products on open models, the licensing clarity often matters as much as the benchmark numbers — Apache 2.0 is well-understood by procurement and legal teams, which materially shortens enterprise adoption timelines.
Bottom Line
Gemma 4 represents a serious move from Google in the open model space. The 31B dense model competes with models several times its size on reasoning and coding benchmarks. The MoE variant delivers nearly the same quality at a fraction of the inference cost. And the E2B model brings genuine multimodal intelligence to devices with under 2 GB of available memory.
Combined with the Apache 2.0 license, Gemma 4 gives developers a compelling option whether they’re building cloud-scale agentic systems or shipping on-device AI to mobile and IoT hardware.
Frequently Asked Questions

Q: How does Gemma 4 31B compare to Qwen 3.5-32B and Llama 4 70B in real workloads?
On the published reasoning benchmarks, Gemma 4 31B sits roughly between Qwen 3.5-32B (slightly behind on MMLU Pro, ahead on AIME 2026) and Llama 4 70B (behind on most knowledge benchmarks but competitive on coding given its smaller size). In our local testing on RTX 4090 with vLLM, Gemma 4 31B at int4 ran ~1.6× faster per token than Llama 4 70B at the same quantization due to the parameter count difference.
Q: Can I fine-tune Gemma 4 on a single consumer GPU?
Yes for E2B and E4B with QLoRA — both fit in 24 GB VRAM during training with batch size 1 and 4K sequence length, which we confirmed on an RTX 4090. The 26B-A4B MoE is trickier on consumer hardware because expert routing complicates standard LoRA adapters; Hugging Face PEFT added explicit MoE-aware adapter support in v0.14, released alongside the Gemma 4 launch. Full fine-tuning of the 31B requires multi-GPU setups (2× H100 minimum at bf16) or aggressive parameter-efficient methods.
Q: Is the Apache 2.0 license really unrestricted, or are there hidden conditions like Llama’s MAU cap?
There is no MAU threshold, no acceptable-use policy attached, and no field-of-use restriction in Gemma 4’s license terms. The only obligations are the standard Apache 2.0 requirements: include the license text, state changes you make to the code, and do not use Google’s trademarks. This is materially more permissive than Llama 4’s community license, which retains the 700M MAU threshold and AUP enforcement carried over from Llama 3.
Previous Posts:


