Self-Distillation Enables Continual Learning

year: 2026/01
paper: https://arxiv.org/pdf/2601.19897
website: https://self-distillation.github.io/SDFT
code: https://github.com/idanshen/Self-Distillation
connections: ICL, on-policy, SFT, distillation, GPICL

On-policy, i.e. the same model generates $y ∣ x$ (student), and $y ∣ x, c$ (teacher) where $c$ is a demonstration (the answer ¹)
~~Premise: $x, c$ helps predict $y$ better than $x$ alone~~ Knowing $c$ helps generate better $y$ , distilling better COTs into weights. ¹
Minimize reverse KL on student-generated rollouts: $L (θ) = D_{K L} (π_{θ} (y ∣ x) ∣∣ π_{θ} (y ∣ x, c))$
This is equal to maximizing the reward $r (y, x, c) = lo g π_{θ} (y ∣ x, c) - lo g π_{θ} (y ∣ x)$
- How much more does the teacher like $y$ than the student currently does?
Off-policy leads to catastrophic forgetting because because it directly maximizes likelihood on expert outputs, implicitly reducing mass on prior capabilities. No anchor to old behavior.- Off-policy leads to compounding errors/distribution shift (small prediction errors push you into states the expert never demonstrated, pushing you further OOD, growing as $O (T^{2} ϵ)$ , where $T$ is the sequence length and $ϵ$ is the per-step error).
- On-policy trains from your own mistakes, so you learn to recover from them.
On-policy leads to better generalization because it’s not memorizing the teacher’s outputs, but learning how itself can use context to improve predictions.

Results:

Higher new-task accuracy than SFT
Substantially less forgetting
Sequential learning on 3 tasks: SDFT accumulates skills; SFT oscillates as each new task kills previous ones
Better ICL → Better SDFT
- Failed at 3B scale for them
Can train reasoning models from answer-only data without collapsing their chain-of-thought behavior
- Naively: If your target is just “42” but your model naturally produces “Let me work through this step by step… first I’ll consider… therefore the answer is 42,” every token that isn’t in “42” gets penalized. The model learns: long reasoning = bad. So it collapses to short outputs.
- With SDFT: The student’s verbose reasoning gets rewarded if it reaches the right answer, because the teacher’s distribution favors correct-answer-reaching outputs. The demo provides what to achieve, not how to achieve it.

Limitations:

Requires 2.5x compute vs SFT
Can inherit teacher artifacts like “Based on the text…” (teacher’s context leaks into student’s learned distribution)

![[Self-Distillation Enables Continual Learning-1770048166870.webp]]
![[Self-Distillation Enables Continual Learning-1770048177569.webp]]
![[Self-Distillation Enables Continual Learning-1770048203835.webp]]
![[Self-Distillation Enables Continual Learning-1770048217439.webp]]
![[Self-Distillation Enables Continual Learning-1770048227700.webp]]

https://claude.ai/chat/e2227fb5-ae94-4d7c-ae0d-22189f86df09

I thought it could be any extra context/info, but this paper seems to give the actual answer, always, whereas that other approach would be more generally known as context distillation. ↩ ↩²

Graph View

Self-Distillation Enables Continual Learning

Footnotes