Entropy-Adaptive Fine-Tuning in Axolotl
Entropy Aware Fine-Tuning (EAFT) is a variant of supervised fine-tuning (SFT) which utilises token-level entropy to mitigate catastrophic forgetting during vanilla SFT. Vanilla SFT utilises the Cross-Entropy (CE) loss, which trains a base model to a target dataset comprised of prompts and associated responses which we wish our model to adapt to. The CE loss achieves this by maximising the log-likelihood of the target sequences - but this approach uniformly reinforces the model's confidence across all tokens.
EAFT addresses this by scaling the standard CE loss with the normalised entropy, which helps suppress gradients for samples for which the model is already confident, and instead introduces an inductive bias for the model to learn from low-confidence samples. You can learn more about how EAFT works in the original paper from Diao et. al.
To configure EAFT in Axolotl, it's as simple as adding the following to your SFT config:
use_eaft: true
eaft_k: 20
eaft_alpha: 0.333
eaft_normalize: true
To validate our implementation, we attempted to replicate a scaled-down version of one of the experiments in the paper. We fine-tuned the gemma-3-1b-it base model using both SFT and EAFT on the NuminaMath-CoT dataset. We used a 2xH100NVL pod with RunPod.
| Task / Group | Metric | Gemma-3-1B-EAFT-Math | Gemma-3-1B-SFT-Math |
|---|---|---|---|
| MMLU (Total) | acc | 0.3153 | 0.3096 |
| ↳ Humanities | acc | 0.2959 | 0.2912 |
| ↳ Social Sciences | acc | 0.3578 | 0.3474 |
| ↳ STEM | acc | 0.2544 | 0.2490 |
| ↳ Other | acc | 0.3643 | 0.3614 |
| IFEval | prompt_level_strict_acc | 0.3900 | 0.4030 |
| IFEval | inst_level_strict_acc | 0.5372 | 0.5504 |
| GSM8K | exact_match (flex) | 0.1721 | 0.1736 |
| ARC Challenge | acc_norm | 0.3481 | 0.3515 |
| Hellaswag | acc_norm | 0.5780 | 0.5779 |
We found that our results lined up in some areas with the EAFT paper: the EAFT-tuned model yielded improvements in the general knowledge and common sense benchmarks (MMLU, Hellaswag), but struggled to win out in the math benchmarks.
We would have expected that training on the NuminaMath dataset would result in clear improvements across both groups, but attribute this to using a much weaker base model (Gemma-3-1B-IT vs. Qwen3-4B-Instruct) compared to the EAFT paper. Our EAFT checkpoints help show that EAFT can mitigate the impact of catastrophic forgetting: the model retains its capacity to perform well on general knowledge benchmarks.
For instance, both of our Gemma SFT and EAFT checkpoints failed to produce any correct answers on the AIME benchmarks—versus the Qwen3-4B models in the EAFT paper reaching significantly higher accuracy (up to 60% for the thinking variant).