Self-Distillation Enables Continual Learning
Paper
โข
2601.19897
โข
Published
โข
26
This model is a Self-Distillation Fine-Tuned (SDFT) version of LiquidAI/LFM2.5-1.2B-Instruct, trained using the methodology from the paper "Self-Distillation Enables Continual Learning".
Self-Distillation Fine-Tuning (SDFT) is a continual learning technique that:
Key advantages:
pip install torch transformers peft accelerate bitsandbytes
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained("LiquidAI/LFM2.5-1.2B-Instruct")
# Load model with quantization
base_model = AutoModelForCausalLM.from_pretrained(
"yasserrmd/lfm2.5-1.5b-sdft",
torch_dtype=torch.float16,
device_map="auto"
)
model.eval()
# Generate
prompt = """<|im_start|>user
Explain how photosynthesis works.
<|im_end|>
<|im_start|>assistant
"""
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
# Use official LiquidAI parameters
outputs = model.generate(
**inputs,
max_new_tokens=256,
do_sample=True,
temperature=0.1,
top_k=50,
top_p=0.1,
repetition_penalty=1.05,
pad_token_id=tokenizer.pad_token_id
)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)
prompt = """<|im_start|>user
Explain how databases work.
Here is an example response to guide you:
Example: Databases store data in tables. You can query them to get information back.
Now provide your own response following a similar approach:
<|im_end|>
<|im_start|>assistant
"""
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
outputs = model.generate(
**inputs,
max_new_tokens=256,
do_sample=True,
temperature=0.1,
top_k=50,
top_p=0.1,
repetition_penalty=1.05
)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)
# Model Architecture
- Base: LiquidAI/LFM2.5-1.2B-Instruct
- Quantization: 8-bit with bitsandbytes
- LoRA: rank=8, alpha=16, dropout=0.05
- Target modules: q_proj, v_proj
# Training Parameters
- Learning rate: 5e-6
- Optimizer: AdamW (weight_decay=0.01)
- Batch size: 1 (with gradient accumulation)
- Gradient accumulation steps: 16
- Epochs: 3
- Max sequence length: 512
- Max generation length: 128
# SDFT-Specific
- EMA alpha: 0.02
- Temperature: 1.0
- KL divergence: Analytic (full vocabulary)
- On-policy generation: Yes
Student Prompt (query only):
<|im_start|>user
{query}
<|im_end|>
<|im_start|>assistant
Teacher Prompt (query + demonstration):
<|im_start|>user
{query}
Here is an example response to guide you:
<|im_start|>assistant
{demonstration}
<|im_end|>
<|im_start|>user
Now provide your own response following a similar approach and reasoning:
<|im_end|>
<|im_start|>assistant
| Category | Description | Performance |
|---|---|---|
| ICL Adaptation | Following demonstration style | โ Good |
| Task Improvement | Learning from examples | โ Good |
| Retention | No catastrophic forgetting | โ ~80% |
| Polarity Control | Following demo viewpoint | โ ๏ธ Moderate |
โ ๏ธ Important: Use official LiquidAI parameters for best results:
generation_config = {
"max_new_tokens": 256,
"do_sample": True,
"temperature": 0.1, # Official LiquidAI recommendation
"top_k": 50, # Official LiquidAI recommendation
"top_p": 0.1, # Official LiquidAI recommendation
"repetition_penalty": 1.05 # Official LiquidAI recommendation
}
These parameters are specifically tuned for LFM2.5 and provide:
If you use this model, please cite:
SDFT Paper:
@article{shenfeld2026sdft,
title={Self-Distillation Enables Continual Learning},
author={Shenfeld, Idan and Damani, Mehul and H{\"u}botter, Jonas and Agrawal, Pulkit},
journal={arXiv preprint arXiv:2601.19897},
year={2026}
}
Base Model:
@misc{lfm25,
title={LFM2.5: Liquid Foundation Models},
author={LiquidAI},
year={2024},
url={https://huggingface.co/LiquidAI/LFM2.5-1.2B-Instruct}
}
This model is released under the Apache 2.0 license, following the base model's licensing.
[Your Name/Organization]
For questions or issues, please open an issue on the model repository.
Base model
LiquidAI/LFM2.5-1.2B-Base