1M
0.182
0.364
Chat
Active

DeepSeek 4 Flash

In the 1M-token context setting, V4 Flash achieves only 10% of the single-token FLOPs and 7% of the KV cache size compared with DeepSeek-V3.2 — a dramatic efficiency jump that makes serving very long contexts actually economical.
DeepSeek 4 FlashTechflow Logo - Techflow X Webflow Template

DeepSeek 4 Flash

A 284B-parameter Mixture-of-Experts model engineered for fast, affordable inference without sacrificing reasoning depth. Thirteen billion parameters active per forward pass. One million tokens of context.

What is DeepSeek V4 Flash?

DeepSeek V4 Flash is the efficiency-first member of DeepSeek's fourth-generation model family, released in preview on April 24, 2026. It sits alongside V4 Pro as a complementary option, where Pro optimizes for maximum intelligence, Flash optimizes for throughput, latency, and cost per token without falling dramatically short on quality.

The model uses a sparse Mixture-of-Experts design: while it carries 284 billion parameters in total, only 13 billion are active during any single inference call. That translates directly into lower compute and lower cost while keeping outputs sharper than a dense 13B model would achieve on its own.

Total params
284B
MoE architecture
Active params
13B
per forward pass
Context window
1M
tokens
Output speed
84 t/s
above avg (52 median)
TTFT
1.00s
vs 2.03s median
Intelligence index
47 / 100
avg open-weight: 28

API pricing

  • Input (miss): $0.182
  • Input (hit): $0.0364
  • Output: $0.364

Architecture & key innovations

Several architectural decisions separate V4 Flash from earlier DeepSeek releases and from the broader open-source field.

Attention stack

Compressed Sparse Attention (CSA)

Compresses KV caches along the sequence dimension (compression rate 4 in Flash), then applies DeepSeek Sparse Attention. A lightning indexer picks the top 512 most relevant compressed KV entries per query, plus a 128-token sliding window so local context is never missed.

Heavily Compressed Attention (HCA)

Applies a much more aggressive compression rate of 128, then performs dense attention over that compressed representation, giving the model a cheap global view of distant tokens in every layer. CSA and HCA layers are interleaved throughout the network.

Manifold-Constrained Hyper-Connections (mHC)

Strengthens conventional residual connections to enhance stability of signal propagation across layers while preserving model expressivity — a key factor in maintaining quality at high compression ratios.

Muon optimizer

Used during training for faster convergence and greater stability. Alongside FP4/FP8 mixed precision (expert weights in FP4, most other weights in FP8), this keeps training costs low while preserving model quality.

MoE routing

The model uses one shared expert plus a pool of routed experts. The first three MoE layers use Hash routing (expert assignment by a fixed hash of the token ID), while the remaining layers use standard DeepSeekMoE learned routing. Multi-Token Prediction is enabled at depth 1 — the same strategy used in V3.

Training data

Pre-trained on more than 32 trillion diverse, high-quality tokens. Post-training used a two-stage pipeline: first, independent cultivation of domain-specific experts via supervised fine-tuning and reinforcement learning with GRPO; second, unified model consolidation via on-policy distillation, integrating distinct proficiencies into a single model.

Reasoning modes

V4 Flash supports three configurable reasoning effort modes, giving developers direct control over the latency/quality trade-off without switching models entirely.

  • Non-thinking (fast): No reasoning chain generated. Fastest latency, lowest token count, best for simple queries, chat, and RAG retrieval steps.
  • Thinking (balanced): Internal chain-of-thought before answering. Standard mode for coding, structured reasoning, and multi-step agentic tasks.
  • Think Max (deep): Extended reasoning budget. Approaches V4 Pro quality on complex math, STEM, and formal proofs. Recommended context: 384K+ tokens.

Benchmark performance

On the Artificial Analysis Intelligence Index (v4.0 — covering GDPval-AA, GPQA Diamond, HLE, IFBench, SciCode, Terminal-Bench, and others), V4 Flash in reasoning mode scores 47 versus an open-weight median of 28. Selected highlights below.

Key scores

47
Intelligence Index
81.0
Putnam-200 Pass@8
95.2
HMMT 2026 Feb
89.8
IMOAnswerBench
84
Output speed (t/s)

Use cases

V4 Flash is positioned as the cost-effective default for most serving scenarios — the model you reach for first unless maximum frontier intelligence is explicitly required. Its combination of speed, long context, and low cost makes it a natural fit across a wide range of production workloads.

Coding assistants

Long-context repo understanding, diff review, autocomplete at high throughput.

RAG pipelines

High-volume retrieval synthesis where cache hits reduce input costs to fractions of a cent.

Agentic workflows

Multi-step tool-calling loops; performs on par with V4 Pro on simple agent tasks.

Document processing

1M-token context absorbs entire contracts, codebases, or report archives in a single call.

Math & STEM

Think Max mode produces frontier-level formal reasoning at a fraction of Pro pricing.

Chat & customer support

Sub-second TTFT and 84 t/s throughput keep conversational latency imperceptible.

How it compares

vs. DeepSeek V4 Pro

Pro carries 1.6T total / 49B active params. Flash is roughly 3–4× cheaper and faster, with reasoning that closely approaches Pro quality. Simple agent tasks: parity. Knowledge-intensive or highly complex agentic chains: Pro leads.

vs. DeepSeek V3.2

Flash uses 10% of V3.2's FLOPs and 7% of its KV cache at 1M-token context, a generational efficiency leap, while introducing hybrid attention and configurable reasoning modes that V3.2 lacked.

vs. GPT-5.4 Nano

V4 Flash is currently the cheapest among small capable models, undercutting GPT-5.4 Nano on price while offering open weights and a 1M-token context that most nano-class models do not provide.

What is DeepSeek V4 Flash?

DeepSeek V4 Flash is the efficiency-first member of DeepSeek's fourth-generation model family, released in preview on April 24, 2026. It sits alongside V4 Pro as a complementary option, where Pro optimizes for maximum intelligence, Flash optimizes for throughput, latency, and cost per token without falling dramatically short on quality.

The model uses a sparse Mixture-of-Experts design: while it carries 284 billion parameters in total, only 13 billion are active during any single inference call. That translates directly into lower compute and lower cost while keeping outputs sharper than a dense 13B model would achieve on its own.

Total params
284B
MoE architecture
Active params
13B
per forward pass
Context window
1M
tokens
Output speed
84 t/s
above avg (52 median)
TTFT
1.00s
vs 2.03s median
Intelligence index
47 / 100
avg open-weight: 28

API pricing

  • Input (miss): $0.182
  • Input (hit): $0.0364
  • Output: $0.364

Architecture & key innovations

Several architectural decisions separate V4 Flash from earlier DeepSeek releases and from the broader open-source field.

Attention stack

Compressed Sparse Attention (CSA)

Compresses KV caches along the sequence dimension (compression rate 4 in Flash), then applies DeepSeek Sparse Attention. A lightning indexer picks the top 512 most relevant compressed KV entries per query, plus a 128-token sliding window so local context is never missed.

Heavily Compressed Attention (HCA)

Applies a much more aggressive compression rate of 128, then performs dense attention over that compressed representation, giving the model a cheap global view of distant tokens in every layer. CSA and HCA layers are interleaved throughout the network.

Manifold-Constrained Hyper-Connections (mHC)

Strengthens conventional residual connections to enhance stability of signal propagation across layers while preserving model expressivity — a key factor in maintaining quality at high compression ratios.

Muon optimizer

Used during training for faster convergence and greater stability. Alongside FP4/FP8 mixed precision (expert weights in FP4, most other weights in FP8), this keeps training costs low while preserving model quality.

MoE routing

The model uses one shared expert plus a pool of routed experts. The first three MoE layers use Hash routing (expert assignment by a fixed hash of the token ID), while the remaining layers use standard DeepSeekMoE learned routing. Multi-Token Prediction is enabled at depth 1 — the same strategy used in V3.

Training data

Pre-trained on more than 32 trillion diverse, high-quality tokens. Post-training used a two-stage pipeline: first, independent cultivation of domain-specific experts via supervised fine-tuning and reinforcement learning with GRPO; second, unified model consolidation via on-policy distillation, integrating distinct proficiencies into a single model.

Reasoning modes

V4 Flash supports three configurable reasoning effort modes, giving developers direct control over the latency/quality trade-off without switching models entirely.

  • Non-thinking (fast): No reasoning chain generated. Fastest latency, lowest token count, best for simple queries, chat, and RAG retrieval steps.
  • Thinking (balanced): Internal chain-of-thought before answering. Standard mode for coding, structured reasoning, and multi-step agentic tasks.
  • Think Max (deep): Extended reasoning budget. Approaches V4 Pro quality on complex math, STEM, and formal proofs. Recommended context: 384K+ tokens.

Benchmark performance

On the Artificial Analysis Intelligence Index (v4.0 — covering GDPval-AA, GPQA Diamond, HLE, IFBench, SciCode, Terminal-Bench, and others), V4 Flash in reasoning mode scores 47 versus an open-weight median of 28. Selected highlights below.

Key scores

47
Intelligence Index
81.0
Putnam-200 Pass@8
95.2
HMMT 2026 Feb
89.8
IMOAnswerBench
84
Output speed (t/s)

Use cases

V4 Flash is positioned as the cost-effective default for most serving scenarios — the model you reach for first unless maximum frontier intelligence is explicitly required. Its combination of speed, long context, and low cost makes it a natural fit across a wide range of production workloads.

Coding assistants

Long-context repo understanding, diff review, autocomplete at high throughput.

RAG pipelines

High-volume retrieval synthesis where cache hits reduce input costs to fractions of a cent.

Agentic workflows

Multi-step tool-calling loops; performs on par with V4 Pro on simple agent tasks.

Document processing

1M-token context absorbs entire contracts, codebases, or report archives in a single call.

Math & STEM

Think Max mode produces frontier-level formal reasoning at a fraction of Pro pricing.

Chat & customer support

Sub-second TTFT and 84 t/s throughput keep conversational latency imperceptible.

How it compares

vs. DeepSeek V4 Pro

Pro carries 1.6T total / 49B active params. Flash is roughly 3–4× cheaper and faster, with reasoning that closely approaches Pro quality. Simple agent tasks: parity. Knowledge-intensive or highly complex agentic chains: Pro leads.

vs. DeepSeek V3.2

Flash uses 10% of V3.2's FLOPs and 7% of its KV cache at 1M-token context, a generational efficiency leap, while introducing hybrid attention and configurable reasoning modes that V3.2 lacked.

vs. GPT-5.4 Nano

V4 Flash is currently the cheapest among small capable models, undercutting GPT-5.4 Nano on price while offering open weights and a 1M-token context that most nano-class models do not provide.

Try it now

400+ AI Models

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

The Best Growth Choice
for Enterprise

Get API Key
Testimonials

Our Clients' Voices