year: 2026/01
paper: https://arxiv.org/pdf/2601.19897
website: https://self-distillation.github.io/SDFT
code: https://github.com/idanshen/Self-Distillation
connections: ICL, on-policy, SFT, distillation, GPICL


  • On-policy, i.e. the same model generates (student), and (teacher) where is a demonstration (the answer 1)
  • Premise: helps predict better than alone Knowing helps generate better , distilling better COTs into weights. 1
  • Minimize reverse KL on student-generated rollouts:
  • This is equal to maximizing the reward
    • How much more does the teacher like than the student currently does?
  • Off-policy leads to catastrophic forgetting because because it directly maximizes likelihood on expert outputs, implicitly reducing mass on prior capabilities. No anchor to old behavior.- Off-policy leads to compounding errors/distribution shift (small prediction errors push you into states the expert never demonstrated, pushing you further OOD, growing as , where is the sequence length and is the per-step error).
    • On-policy trains from your own mistakes, so you learn to recover from them.
  • On-policy leads to better generalization because it’s not memorizing the teacher’s outputs, but learning how itself can use context to improve predictions.

Results:

  • Higher new-task accuracy than SFT
  • Substantially less forgetting
  • Sequential learning on 3 tasks: SDFT accumulates skills; SFT oscillates as each new task kills previous ones
  • Better ICL → Better SDFT
    • Failed at 3B scale for them
  • Can train reasoning models from answer-only data without collapsing their chain-of-thought behavior
    • Naively: If your target is just “42” but your model naturally produces “Let me work through this step by step… first I’ll consider… therefore the answer is 42,” every token that isn’t in “42” gets penalized. The model learns: long reasoning = bad. So it collapses to short outputs.
    • With SDFT: The student’s verbose reasoning gets rewarded if it reaches the right answer, because the teacher’s distribution favors correct-answer-reaching outputs. The demo provides what to achieve, not how to achieve it.

Limitations:

  • Requires 2.5x compute vs SFT
  • Can inherit teacher artifacts like “Based on the text…” (teacher’s context leaks into student’s learned distribution)
![[Self-Distillation Enables Continual Learning-1770048166870.webp]]
![[Self-Distillation Enables Continual Learning-1770048177569.webp]]
![[Self-Distillation Enables Continual Learning-1770048203835.webp]]
![[Self-Distillation Enables Continual Learning-1770048217439.webp]]
![[Self-Distillation Enables Continual Learning-1770048227700.webp]]

https://claude.ai/chat/e2227fb5-ae94-4d7c-ae0d-22189f86df09

Footnotes

  1. I thought it could be any extra context/info, but this paper seems to give the actual answer, always, whereas that other approach would be more generally known as context distillation. 2