π¦ Llama 3.2 3B β Distilled from Llama 3.1 8B Instruct
This repository contains a distilled 3B Llama model, trained using knowledge distillation from the larger Llama-3.1-8B-Instruct teacher model.
The goal of this project was to create a smaller, faster model while preserving strong instruction-following capabilities.
π₯ Overview
- Teacher Model:
meta-llama/Llama-3.1-8B-Instruct - Student Model:
meta-llama/Llama-3.2-3B-Instruct - Objective: Compress the teacherβs behavior into a lighter 3B parameter model for improved latency and deployment efficiency.
- Method: Logits-based distillation with temperature scaling and combined KD + student loss.
This model is ideal for:
- Chat-based applications
- Lightweight on-device or serverless inference
- Fast prototyping with low compute resources
π§ͺ Distillation Process
Distillation was performed using a single epoch over a curated instruction-tuning dataset.
Training Command
python distill.py \
--teacher-model meta-llama/Llama-3.1-8B-Instruct \
--student-model meta-llama/Llama-3.2-3B-Instruct \
--train-file data/train.jsonl \
--val-file data/test.jsonl \
--output-dir llama3_3B_distilled \
--batch-size 8 \
--gradient-accumulation-steps 1 \
--num-epochs 1 \
--learning-rate 1e-5 \
--warmup-steps 300 \
--max-length 1024 \
--temperature 2.0 \
--alpha 0.5 \
--save-steps 2000 \
--eval-steps 200 \
--logging-steps 5 \
--bf16
- Downloads last month
- 2
Model tree for SwashBuckler001/Llama-3.2-3B-distill-GLoRE
Base model
meta-llama/Llama-3.1-8B Finetuned
meta-llama/Llama-3.1-8B-Instruct