Entropy-Adaptive Fine-Tuning in Axolotl

Community Article Published February 4, 2026

Entropy Aware Fine-Tuning (EAFT) is a variant of supervised fine-tuning (SFT) which utilises token-level entropy to mitigate catastrophic forgetting during vanilla SFT. Vanilla SFT utilises the Cross-Entropy (CE) loss, which trains a base model to a target dataset comprised of prompts and associated responses which we wish our model to adapt to. The CE loss achieves this by maximising the log-likelihood of the target sequences - but this approach uniformly reinforces the model's confidence across all tokens.

EAFT addresses this by scaling the standard CE loss with the normalised entropy, which helps suppress gradients for samples for which the model is already confident, and instead introduces an inductive bias for the model to learn from low-confidence samples. You can learn more about how EAFT works in the original paper from Diao et. al.

To configure EAFT in Axolotl, it's as simple as adding the following to your SFT config:

use_eaft: true
eaft_k: 20
eaft_alpha: 0.333 
eaft_normalize: true

To validate our implementation, we attempted to replicate a scaled-down version of one of the experiments in the paper. We fine-tuned the gemma-3-1b-it base model using both SFT and EAFT on the NuminaMath-CoT dataset. We used a 2xH100NVL pod with RunPod.

Task / Group	Metric	Gemma-3-1B-EAFT-Math	Gemma-3-1B-SFT-Math
MMLU (Total)	acc	0.3153	0.3096
↳ Humanities	acc	0.2959	0.2912
↳ Social Sciences	acc	0.3578	0.3474
↳ STEM	acc	0.2544	0.2490
↳ Other	acc	0.3643	0.3614
IFEval	prompt_level_strict_acc	0.3900	0.4030
IFEval	inst_level_strict_acc	0.5372	0.5504
GSM8K	exact_match (flex)	0.1721	0.1736
ARC Challenge	acc_norm	0.3481	0.3515
Hellaswag	acc_norm	0.5780	0.5779

We found that our results lined up in some areas with the EAFT paper: the EAFT-tuned model yielded improvements in the general knowledge and common sense benchmarks (MMLU, Hellaswag), but struggled to win out in the math benchmarks.

We would have expected that training on the NuminaMath dataset would result in clear improvements across both groups, but attribute this to using a much weaker base model (Gemma-3-1B-IT vs. Qwen3-4B-Instruct) compared to the EAFT paper. Our EAFT checkpoints help show that EAFT can mitigate the impact of catastrophic forgetting: the model retains its capacity to perform well on general knowledge benchmarks.

For instance, both of our Gemma SFT and EAFT checkpoints failed to produce any correct answers on the AIME benchmarks—versus the Qwen3-4B models in the EAFT paper reaching significantly higher accuracy (up to 60% for the thinking variant).

Training low-bit ternary models with Axolotl

April 30, 2026

Enabling Long Context Training with Sequence Parallelism in Axolotl

April 4, 2025

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote