The DeepSeek Series: A Technical Overview

The appearance of DeepSeek Large-Language Models has caused a lot of discussion and angst since their latest versions appeared at the beginning of 2025. But much of the value of DeepSeek's work comes from the papers they have published over the last year. This article provides an overview of these papers, highlighting three main arcs in this research: a focus on improving cost and memory efficiency, the use of HPC Co-Design to train large models on limited hardware, and the development of emergent reasoning from large-scale reinforcement learning

06 February 2025


Photo of Shayan Mohanty

Shayan Mohanty is the Head of AI Research at Thoughtworks, where his group focuses on foundational research to bridge the gap between AI development and production. Previously, he was CEO and Co-Founder of Watchful, a startup that built software to automate the process of data labeling for AI. Shayan has a decade leading data engineering teams at various companies including Facebook, where he led the stream processing team responsible for processing 100% of the ads metrics data for all FB products. He is also a Guest Scientist at Los Alamos National Laboratory and has given talks on topics ranging from Automata Theory to Machine Teaching.


This article provides a cohesive overview of four technical reports from DeepSeek:

  1. DeepSeek-LLM (Jan '24): an early investigation of scaling laws and data-model tradeoffs.
  2. DeepSeek-V2 (Jun '24): introducing Multi-Head Latent Attention (MLA) and DeepSeekMoE to improve memory and training efficiency.
  3. DeepSeek-V3 (Dec '24): scaling sparse MoE networks to 671B parameters, with FP8 mixed precision training and intricate HPC co-design
  4. DeepSeek-R1 (Jan '25): building upon the efficiency foundations of the previous papers and using large-scale reinforcement learning to incentivize emergent chain-of-thought capabilities, including a “zero-SFT” variant.

For additional context on DeepSeek itself and the market backdrop that has caused claims made by the DeepSeek team to be taken out of context and spread widely, please take a look at my colleague Prasanna Pendse's post: Demystifying Deepseek. For the purposes of this article, we'll be focusing analysis and commentary on the technical work itself, its merits, and what it may signal for the future.

Much of this article assumes significant knowledge of the terminology and concepts of building LLMs, more so than is typical for articles on this site. In future weeks we hope to expand this article to provide explanations of these concepts to make this article easier to follow for those not familiar with this world. We shall post any such updates on this site's usual channels.

All four papers revolve around a singular challenge: building ever-larger language models with minimal cost, memory overhead, and training instability. In each iteration, the authors refine both architecture and infrastructure - a strategy often referred to as HPC co-design.

Key arcs in this series include:

DeepSeek-LLM: Laying the Foundation

Motivation & Overview

The authors set out to answer an important question: Given a fixed compute budget for pre-training, how do we choose the scale of the model and how much training data to use? Prior studies (e.g. Chinchilla vs. GPT-3) differed on the ratio between these two factors. DeepSeek-LLM addresses that by measuring scale in a different way. Earlier work measured scale in terms of how many parameters were in the model, DeepSeek-LLM instead measured scale as non-embedding FLOPs/token1 They then found they could predict computation with:

1: Non-embedding FLOPs are the amount of FLOPs (Floating Point Operations per Second) used for pre-training certain layers of the transformer (non-embedding). The authors found only some layers contributed to the scaling formula.

$$ C = M \times D $$

where $C$ is the compute budget, $M$ is non-embedding FLOPs/token, and $D$ is data size.

This more granular representation helps them predict how a 7B or 67B model might train on 2T tokens of bilingual data.

Training Instability

A central concern they grapple with is training instability (sudden irrecoverable divergences in the training process), which can often manifest in large-scale language models—especially those with mixture-of-experts or very long contexts.

By carefully tuning learning rates, batch sizes, and other hyperparameters 2, DeepSeek-LLM demonstrates that stable large-scale training is achievable, but it requires meticulous design of the architecture of the transformer model together with the infrastructure of the High Performance Computing (HPC) data center used to train it. This interwoven design of both architecture and infrastructure is called HPC Co-Design.

2: A model consists of billions on internal variables, which are called its parameters. These parameters gain their values (weights) during training. Before training, developers will set a number of different variables that control the training process itself, these are called hyperparameters

Data Quality & Model Scale

A point the authors make is about how data quality shifts the optimal ratio—i.e., higher-quality data can justify a bigger model for the same number of tokens. You can intuit this by imagining two scenarios:

  • Scenario A: You have a 100-billion-token corpus full of duplicates, spammy text, or incomplete sentences. The model might not glean much new knowledge because the data is partly redundant or low-value.
  • Scenario B: You have a carefully curated 100-billion-token corpus with broad coverage of code, math, multi-lingual dialogues, factual text, etc. Each token is more “information-rich,” so the model can “afford” to use more parameters without hitting diminishing returns prematurely.

In other words, when data is denser in useful information, scaling the model further pays off because each parameter can learn from richer signals.

Key Takeaways

  • Hyperparameter Scaling: They propose simple power-law fits to pick batch size and learning rate as compute $C$ grows.
  • Bilingual Data: They train two base sizes (7B, 67B) on 2T tokens covering English/Chinese, then do Supervised Fine Tuning (SFT) and a simpler preference-based alignment called Direct Preference Optimization (DPO).
  • Results: The resulting DeepSeek-LLM67B “Outperforms LLaMA-2 70B” on math/coding tasks, illustrating how HPC co-designed approaches can keep training stable while efficiently pushing scale.

The seeds planted here - scaling laws and infrastructure for extremely large training - will reappear in subsequent works.

DeepSeek-V2: Multi-Head Latent Attention & MoE

Expanding the Model While Reducing Memory

Where DeepSeek-LLM mostly explored high-level scale tradeoffs, DeepSeek-V2 dives into specifics of Transformer architecture overhead. Two big obstacles in large LLMs are:

  1. Attention KV Cache: Storing Key/Value vectors for thousands of tokens is memory-intensive.
  2. Feed-Forward Computation: Typically the largest consumption of FLOPs in a Transformer.

To tame both, they propose:

  1. Multi-Head Latent Attention (MLA): compresses Key/Value vectors to reduce memory.
  2. DeepSeekMoE: a sparse Mixture-of-Experts approach that activates a fraction of the feed-forward capacity per token.

Multi-Head Latent Attention (MLA)

In standard attention, each token's Q/K/V can be as large as $d_{model}$  times the number of heads. MLA folds them into smaller “latent” vectors:

$$ \quad \mathbf{c}_{t}^{KV} = W^{DKV}\mathbf{h}_t, \quad \mathbf{k}_{t}^{C} = W^{UK}\mathbf{c}_t^{KV}, \quad \mathbf{v}_{t}^{C} = W^{UV}\mathbf{c}_t^{KV}, \quad $$

Where $c_{t}^{KV}$ is the compressed latent vector for keys and values. $W^{DKV}$ is the down-projection matrix, and $W^{UK}, W^{UV}$ are the up-projection matrices for keys and values, respectively. In simpler terms:

  1. Replaces the standard QKV computation by using low rank factorization to turn one matrix of dim (in, out) into two matrices of (in, rank) and (rank, out)
  2. Project the compressed KV latent vector for each head to get the full K and V head corresponding to each Q head
  3. Cache the compressed KV latent vector instead of each of the KV heads in full, and compute the KV heads on the fly from the latent vector.

DeepSeekMoE: Sparsely Activated FFNs

Next, they adopt a Mixture-of-Experts (MoE) in the feed-forward blocks:

  • Shared Experts handle universal patterns for every token.
  • Routed Experts handle specialized sub-problems, chosen dynamically via gating.
  • Auxiliary Loss ensures balanced usage so no expert collapses (i.e. is never used).

They further limit cross-device routing with a “device-limited routing” scheme - instead of allowing any token to access any expert, DeepSeekMoE selects a limited number of devices ($M$) per token, and performs expert selection only within these devices. The basic process is as follows:

  • Identify top $M$ devices that contain experts with the highest affinity to the token
  • Perform top $K_r$ expert selection within these $M$ devices
  • Assign the selected experts to process the token

Without device-limited routing, MoE models can generate excessive communication overhead which is incompatible with the hardware limitations imposed on the DeepSeek team. In addition, MoE models typically risk uneven expert utilization, where some experts are overused while others remain inactive. To prevent this, DeepSeekMoE introduces three balancing loss functions:

  • Expert-level Balance Loss ($L_{ExpBal}$):
    • Ensures uniform distribution of tokens across experts to prevent expert collapse
    • Uses a loss function based on softmax scores of token-expert affinity
  • Device-level Balance Loss ($L_{DevBal}$):
    • Ensures workload is evenly distributed across devices
  • Communication Balance Loss ($L_{CommBal}$):
    • Balances incoming and outgoing token routing to each device

Training & Outcomes

DeepSeek-V2, with ~236B total params (21B activated), is pre-trained on 8.1T tokens. They do Supervised Fine Tuning (SFT) on 1.5M instruction samples, then reinforcement learning (RL) for alignment. The end result:

  • Inference and training are both faster and cheaper (MLA + sparse experts)
  • They remain stable at scale

This paper is really when iteration gains due to HPC Co-Design start to become apparent. By designing the model architecture with the training infrastructure in mind, and implementing a training regime that considers the realities of the hardware (e.g. low interconnect speeds on H800s), the team was able to lay the foundation for their most notable breakthrough.

DeepSeek-V3: HPC Co-Design

Scaling MoE to 671B While Preserving Efficiency

Building on V2, DeepSeek-V3 further extends sparse models to 671B parameters (37B activated), training on 14.8T tokens in under 2.8M H800 GPU hours. The authors credit extensive HPC co-design:

Lastly, we emphasize again the economical training costs of DeepSeek-V3, summarized in Table 1, achieved through our optimized co-design of algorithms, frameworks, and hardware.

-- DeepSeek-V3 Tech. Report, p.5

The major novelties are:

  1. Refined MLA
  2. Refined DeepSeekMoE
  3. Co-Designed Training & Inference Frameworks

Refined MLA

Multi-Head Latent Attention was introduced in V2 to reduce KV cache overhead. In V3, it is further refined with several new features:

  • Dynamic Low-Rank Projection: Instead of a static compression dimension, MLA adjusts how strongly it compresses Key/Value vectors depending on sequence length. For shorter sequences, less compression preserves fidelity; for extremely long sequences (32K–128K tokens), deeper compression manages memory growth.
  • Adaptive Query Compression: Where V2 used a fixed $d_c$ dimension, V3 employs an adaptive scaling of the query up/down at different layer depths. Early layers use higher-dimensional queries for expressiveness; deeper layers more aggressively compress to save activation memory.
  • Improved RoPE Handling: V2 only partially decoupled keys, but V3 extends the concept for more stable 128K context. They track a “decoupled shared key” that reduces numerical drift in extremely long generations.
  • Joint KV Storage: V2 stored compressed keys and values separately. V3 merges them into a shared compressed representation to further reduce memory traffic during multi-node inference.
  • Layer-Wise Adaptive Cache: Instead of caching all past tokens for all layers, V3 prunes older KV entries at deeper layers. This helps keep memory usage in check when dealing with 128K context windows.

Together, these MLA refinements ensure that while DeepSeek-V3 can attend across very long sequences, the memory overhead remains manageable.

Refined DeepSeekMoE: Auxiliary-Loss-Free, Higher Capacity

On the MoE side, DeepSeek-V3 drops the auxiliary-loss approach from V2. Instead of an explicit penalty term, each expert acquires a dynamic bias $b_i$. If an expert is overloaded at a step, $b_i$ decreases; if underloaded, $b_i$ increases. The gating decision then adds $b_i$ to the token's affinity:

$$ s'_{i,t} = s_{i,t} + b_i $$

Key Improvements:

  • No Token Dropping: V2 occasionally dropped tokens if certain experts got overloaded, but the new bias-based method keeps everything.
  • More Activated Experts: They raise the number of routed experts from 6 to 8 per token, improving representational power.
  • Higher Stability: By removing auxiliary losses, they avoid potential interference with the main training objective, focusing purely on the intrinsico gating signals plus bias adjustments.

Hence, the final feed-forward module is a combination of a small set of shared experts plus up to 8 specialized experts chosen adaptively.

Co-Designed Frameworks: FP8, DualPipe, and PTX Optimizations

Scaling an MoE model to 671B demanded HPC-level solutions for training and inference. The authors emphasize:

Through the co-design of algorithms, frameworks, and hardware, we overcome the communication bottleneck in cross-node MoE training, achieving near-full computation- communication overlap.

-- DeepSeek-V3 Tech. Report, p.5

FP8 Mixed Precision

They adopt an FP8 data format for General Matrix Multiplications (GEMMs), halving memory. The risk is reduced numeric range so they offset it with:

  • Block-wise scaling (e.g., 1x128 or 128x128 tiles).
  • Periodic “promotion” to FP32 after short accumulation intervals to avoid overflow/underflow.

DualPipe Parallelism

They propose DualPipe to overlap forward/backward computation with the MoE all-to-all dispatch. It rearranges pipeline stages to ensure that network communication (particularly across InfiniBand) is hidden behind local matrix multiplications.

PTX-Level & Warp Specialization

To fully exploit InifiniBand(IB) and NVLink:

  • They tune warp-level instructions in PTX (a level lower than CUDA), auto-tuning the chunk size for all-to-all dispatch.
  • Dynamically partition Streaming Microcontrollers into communication vs. compute tasks so that token dispatch never stalls local GEMM.

As a result, training costs were cut to 2.8M H800 GPU hours per run - low for a 14.8T token corpus.

Outcomes

The resulting DeepSeek-V3 excels at code, math, and some multilingual tasks, outperforming other open-source LLMs of similar scale. Deep HPC co-design (FP8, DualPipe, PTX-level optimization) plus refined MLA/MoE implementation achieve extreme scale with stable training.

DeepSeek-R1: Reinforcement Learning for Deeper Reasoning

It's worth noting that both DeepSeek R1 and DeepSeek R1-Zero are architecturally identical to DeepSeek V3 (but uses the “only-pretrained” base version). The only difference in these models is how post-training is handled.

Emergent Reasoning Behaviors Through RL-Only

All prior DeepSeek releases used SFT (plus occasional RL). By contrast, DeepSeek-R1-Zero tries an extreme: no supervised warmup, just RL from the base model. They adopt Group Relative Policy Optimization (GRPO), which:

  1. Samples a group of old-policy outputs ${o_1, ..., o_G}$
  2. Scores each with a reward (in this case, rule-based)
  3. Normalizes the advantage $A_i$ by group mean/stdev
  4. Optimizes a clipped PPO-like objective

The reward function for the R1 models is rule-based - a simple weighted sum between 2 components

  • Accuracy Reward - if the task has an objective correct answer (e.g. a math problem, coding task, etc.), correctness is verified using mathematical equation solvers for step-by-step proof checking, and code execution & test cases for code correctness verification
  • Format Reward - the model is rewarded for following a structured reasoning process using explicit reasoning markers <think></think> and <answer></answer>

The relative advantage $A_i$ for a given output is calculated as:

$$ A_i = \frac{r_i - mean(\{r_1, r_2, ..., r_G\})}{std(\{r_1, r_2, ..., r_G\})} $$

where $r_i$ is the reward calculated for the given output. The model's policy is updated to favor responses with higher rewards while constraining changes using a clipping function which ensures that the new policy remains close to the old.

In so many words: the authors created a testing/verification harness around the model which they exercised using reinforcement learning, and gently guided the model using simple Accuracy and Format rewards. In doing so, emergent reasoning behaviors were observed:

  • Self-verification where the model double-checks its own answers
  • Extended chain-of-thought where the model learns to explain its reasoning more thoroughly
  • Exploratory reasoning - the model tries different approaches before converging on an answer
  • Reflection - the model starts questioning its own solutions and adjusting reasoning paths dynamically

R1-Zero is probably the most interesting outcome of the R1 paper for researchers because it learned complex chain-of-thought patterns from raw reward signals alone. However, the model exhibited notable issues:

  • Readability Problems: Because it never saw any human-curated language style, its outputs were sometimes jumbled or mix multiple languages.
  • Instability in Non-Reasoning Tasks: Lacking SFT data for general conversation, R1-Zero would produce valid solutions for math or code but be awkward on simpler Q&A or safety prompts.
  • Limited Domain: Rule-based rewards worked well for verifiable tasks (math/coding), but handling creative/writing tasks demanded broader coverage.

Hence, the authors concluded that while “pure RL” yields strong reasoning in verifiable tasks, the model’s overall user-friendliness was lacking. This led them to DeepSeek-R1: an alignment pipeline combining small cold-start data, RL, rejection sampling, and more RL, to “fill in the gaps” from R1-Zero’s deficits.

Refined Reasoning Through SFT + RL

DeepSeek-R1 addresses R1-Zero's limitations by injecting a small amount of supervised data before RL and weaving in additional alignment steps.

Stage 1: “Cold-Start” SFT

They gather a small number (~thousands) of curated, “human-friendly” chain-of-thought data covering common sense Q&A, basic math, standard instruction tasks, etc. Then, they do a short SFT pass on the base model. This ensures the model acquires:

  • Better readability: Polished language style and formatting.
  • Non-reasoning coverage: Some conversation, factual QA, or creative tasks not easily rewarded purely by rule-based checks.

In essence, the authors realized you can avoid the “brittleness” of a zero-SFT approach by giving the model a seed of user-friendly behaviors.

Stage 2: Reasoning-Oriented RL

Next, as in R1-Zero, they apply large-scale RL for tasks like math and code. The difference is that now the model starts from a “cold-start SFT” checkpoint—so it retains decent language style while still learning verifiable tasks from a rule-based or tool-based reward. This RL stage fosters the same emergent chain-of-thought expansions but without the random “language mixing” or bizarre structure.

Stage 3: Rejection Sampling + Additional SFT

Once that RL converges, they generate multiple completions per prompt from the RL checkpoint. Using a combination of automatic verifiers and some human checks, they pick the best outputs (“rejection sampling”) and build a new SFT dataset. They also incorporate standard writing/factual/safety data from DeepSeek-V3 to keep the model balanced in non-verifiable tasks. Finally, they re-fine-tune the base model on this curated set.

This step addresses the “spotty coverage” problem even further: The best RL answers become training targets, so the model improves at chain-of-thought and clarity.

Stage 4: RL for “All Scenarios”

Lastly, they do another RL pass on diverse prompts—not just math/code but general helpfulness, safety, or role-playing tasks. Rewards may come from a combination of rule-based checks and large “preference” models (trained from user preference pairs). The final result is a model that:

  • Retains strong chain-of-thought for verifiable tasks,
  • Aligns to broad user requests in everyday usage,
  • Maintains safer, more controlled outputs.

Connecting the Arcs: Efficiency & Emergence

Despite covering different angles - scaling laws, MoE, HPC scheduling, and large-scale RL - DeepSeek's work consistently follows these arcs:

  1. Cost and Memory Efficiency
    • They systematically design methods (MLA, MoE gating, device-limited routing, FP8 training, DualPipe) to maximize hardware utilization even in constrained environments
    • HPC-level scheduling (PTX instructions, warp specialization) hides communication overhead and overcomes the limitations imposed by limited interconnect speeds on H800s
  2. Sparse + HPC Co-Design
    • From V2 to V3, we see an evolving mixture-of-experts approach, culminating in a 671B-parameter model feasible on H800 clusters.
    • The authors repeatedly stress that HPC co-design is the only path to cheaply train multi-hundred-billion-parameter LLMs.
  3. Emergent Reasoning
    • R1 pushes beyond standard supervised training, letting RL signals shape deep chain-of-thought. The synergy between pre-trained scale and targeted post-training yields advanced reasoning patterns like reflection or multi-step verification.

Taken as a whole, the DeepSeek series highlights how architecture, algorithms, frameworks, and hardware must be co-designed to handle LLM training at trillion-token scales. Looking to the future, it indicates that toolchain builders may want to find ways to capture some of these HPC optimizations as part of the model compilation path or training apparatus, and AI research teams may want to work closely with HPC expertise even in the early days of architecture ideation.


Footnotes

1: Non-embedding FLOPs are the amount of FLOPs (Floating Point Operations per Second) used for pre-training certain layers of the transformer (non-embedding). The authors found only some layers contributed to the scaling formula.

2: A model consists of billions on internal variables, which are called its parameters. These parameters gain their values (weights) during training. Before training, developers will set a number of different variables that control the training process itself, these are called hyperparameters

Significant Revisions

06 February 2025: First published