LFM2.5-1.2B-SDFT: Self-Distillation Fine-Tuned Model

This model is a Self-Distillation Fine-Tuned (SDFT) version of LiquidAI/LFM2.5-1.2B-Instruct, trained using the methodology from the paper "Self-Distillation Enables Continual Learning".

Model Description

  • Base Model: LiquidAI/LFM2.5-1.2B-Instruct
  • Training Method: Self-Distillation Fine-Tuning (SDFT)
  • Training Data: ~5K samples from OpenAssistant dataset
  • Training Hardware: Single NVIDIA A100 GPU
  • Parameters: LoRA rank=8, alpha=16, targeting q_proj and v_proj

What is SDFT?

Self-Distillation Fine-Tuning (SDFT) is a continual learning technique that:

  • Uses the model's in-context learning ability to create a demonstration-aware teacher
  • Generates training data on-policy from the student model
  • Minimizes KL divergence between student and demonstration-conditioned teacher
  • Enables learning new tasks while reducing catastrophic forgetting

Key advantages:

  • โœ… Learns from demonstrations without explicit reward functions
  • โœ… Maintains prior knowledge while acquiring new skills
  • โœ… On-policy learning improves generalization
  • โœ… Efficient training with EMA teacher updates

Quick Start

Installation

pip install torch transformers peft accelerate bitsandbytes

Basic Usage

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM


# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained("LiquidAI/LFM2.5-1.2B-Instruct")

# Load model with quantization
base_model = AutoModelForCausalLM.from_pretrained(
    "yasserrmd/lfm2.5-1.5b-sdft",
    torch_dtype=torch.float16,
    device_map="auto"
)


model.eval()

# Generate
prompt = """<|im_start|>user
Explain how photosynthesis works.
<|im_end|>
<|im_start|>assistant
"""

inputs = tokenizer(prompt, return_tensors="pt").to("cuda")

# Use official LiquidAI parameters
outputs = model.generate(
    **inputs,
    max_new_tokens=256,
    do_sample=True,
    temperature=0.1,
    top_k=50,
    top_p=0.1,
    repetition_penalty=1.05,
    pad_token_id=tokenizer.pad_token_id
)

response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)

With Demonstration (In-Context Learning)

prompt = """<|im_start|>user
Explain how databases work.

Here is an example response to guide you:

Example: Databases store data in tables. You can query them to get information back.

Now provide your own response following a similar approach:
<|im_end|>
<|im_start|>assistant
"""

inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
outputs = model.generate(
    **inputs,
    max_new_tokens=256,
    do_sample=True,
    temperature=0.1,
    top_k=50,
    top_p=0.1,
    repetition_penalty=1.05
)

response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)

Training Details

Dataset

  • Source: OpenAssistant conversations
  • Size: ~5,000 query-demonstration pairs
  • Preprocessing:
    • Filtered demonstrations: 20-2048 characters
    • Train/Val/Test split: 75%/10%/15%

Training Configuration

# Model Architecture
- Base: LiquidAI/LFM2.5-1.2B-Instruct
- Quantization: 8-bit with bitsandbytes
- LoRA: rank=8, alpha=16, dropout=0.05
- Target modules: q_proj, v_proj

# Training Parameters
- Learning rate: 5e-6
- Optimizer: AdamW (weight_decay=0.01)
- Batch size: 1 (with gradient accumulation)
- Gradient accumulation steps: 16
- Epochs: 3
- Max sequence length: 512
- Max generation length: 128

# SDFT-Specific
- EMA alpha: 0.02
- Temperature: 1.0
- KL divergence: Analytic (full vocabulary)
- On-policy generation: Yes

Prompt Format (Teacher vs Student)

Student Prompt (query only):

<|im_start|>user
{query}
<|im_end|>
<|im_start|>assistant

Teacher Prompt (query + demonstration):

<|im_start|>user
{query}

Here is an example response to guide you:

<|im_start|>assistant
{demonstration}
<|im_end|>

<|im_start|>user
Now provide your own response following a similar approach and reasoning:
<|im_end|>
<|im_start|>assistant

Evaluation Results

Tested on Multiple Dimensions:

Category Description Performance
ICL Adaptation Following demonstration style โœ… Good
Task Improvement Learning from examples โœ… Good
Retention No catastrophic forgetting โœ… ~80%
Polarity Control Following demo viewpoint โš ๏ธ Moderate

Key Findings:

  1. โœ… Maintains Knowledge: No significant forgetting on general tasks
  2. โœ… Adapts to Demos: Successfully follows demonstration styles
  3. โœ… Improved Over Training: Epoch 3 shows stable, coherent outputs
  4. โš ๏ธ Model Size Limitation: 1.2B parameters limits complex reasoning

Comparison to Base Model:

  • With Demonstrations: SDFT shows better style matching and task following
  • Without Demonstrations: Maintains base model capabilities
  • Response Quality: More consistent and focused outputs

Generation Parameters

โš ๏ธ Important: Use official LiquidAI parameters for best results:

generation_config = {
    "max_new_tokens": 256,
    "do_sample": True,
    "temperature": 0.1,        # Official LiquidAI recommendation
    "top_k": 50,               # Official LiquidAI recommendation
    "top_p": 0.1,              # Official LiquidAI recommendation
    "repetition_penalty": 1.05 # Official LiquidAI recommendation
}

These parameters are specifically tuned for LFM2.5 and provide:

  • Focused, factual responses
  • Minimal hallucinations
  • Consistent output quality

Limitations

Model Constraints:

  • Size: 1.2B parameters (smaller capacity than 7B+ models)
  • Training Data: 5K samples (vs paper's 20K+)
  • Hardware: Single A100 (vs paper's multi-GPU setup)
  • Complexity: Limited reasoning on very complex tasks

Known Issues:

  • May require proper ChatML formatting for best results
  • Performance degrades on tasks requiring deep technical knowledge
  • Smaller model size limits polarity control effectiveness

Appropriate Use Cases:

  • โœ… Conversational AI with example-guided responses
  • โœ… Task learning from demonstrations
  • โœ… Style-adaptive text generation
  • โœ… Educational/research purposes

Not Recommended For:

  • โŒ Production systems requiring 100% reliability
  • โŒ Tasks requiring strong reasoning (use 7B+ models)
  • โŒ Safety-critical applications
  • โŒ Tasks outside training distribution without demonstrations

Bias and Ethical Considerations

  • Inherits biases from base LFM2.5 model and OpenAssistant dataset
  • May generate inconsistent responses on controversial topics
  • Should not be used for medical, legal, or financial advice
  • Outputs should be reviewed by humans for critical applications

Citation

If you use this model, please cite:

SDFT Paper:

@article{shenfeld2026sdft,
  title={Self-Distillation Enables Continual Learning},
  author={Shenfeld, Idan and Damani, Mehul and H{\"u}botter, Jonas and Agrawal, Pulkit},
  journal={arXiv preprint arXiv:2601.19897},
  year={2026}
}

Base Model:

@misc{lfm25,
  title={LFM2.5: Liquid Foundation Models},
  author={LiquidAI},
  year={2024},
  url={https://huggingface.co/LiquidAI/LFM2.5-1.2B-Instruct}
}

Acknowledgments

License

This model is released under the Apache 2.0 license, following the base model's licensing.

Model Card Authors

[Your Name/Organization]

Contact

For questions or issues, please open an issue on the model repository.


Additional Resources

Version History

  • v1.0 (2024-XX-XX): Initial release
    • Trained on 5K OpenAssistant samples
    • 3 epochs with gradient accumulation
    • LoRA rank 8, alpha 16
Downloads last month
53
Safetensors
Model size
1B params
Tensor type
F16
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for yasserrmd/lfm2.5-1.5b-sdft

Finetuned
(37)
this model

Paper for yasserrmd/lfm2.5-1.5b-sdft