πŸ¦™ Llama 3.2 3B β€” Distilled from Llama 3.1 8B Instruct

This repository contains a distilled 3B Llama model, trained using knowledge distillation from the larger Llama-3.1-8B-Instruct teacher model.
The goal of this project was to create a smaller, faster model while preserving strong instruction-following capabilities.


πŸ”₯ Overview

  • Teacher Model: meta-llama/Llama-3.1-8B-Instruct
  • Student Model: meta-llama/Llama-3.2-3B-Instruct
  • Objective: Compress the teacher’s behavior into a lighter 3B parameter model for improved latency and deployment efficiency.
  • Method: Logits-based distillation with temperature scaling and combined KD + student loss.

This model is ideal for:

  • Chat-based applications
  • Lightweight on-device or serverless inference
  • Fast prototyping with low compute resources

πŸ§ͺ Distillation Process

Distillation was performed using a single epoch over a curated instruction-tuning dataset.

Training Command

python distill.py \
  --teacher-model meta-llama/Llama-3.1-8B-Instruct \
  --student-model meta-llama/Llama-3.2-3B-Instruct \
  --train-file data/train.jsonl \
  --val-file data/test.jsonl \
  --output-dir llama3_3B_distilled \
  --batch-size 8 \
  --gradient-accumulation-steps 1 \
  --num-epochs 1 \
  --learning-rate 1e-5 \
  --warmup-steps 300 \
  --max-length 1024 \
  --temperature 2.0 \
  --alpha 0.5 \
  --save-steps 2000 \
  --eval-steps 200 \
  --logging-steps 5 \
  --bf16
Downloads last month
2
Safetensors
Model size
3B params
Tensor type
BF16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for SwashBuckler001/Llama-3.2-3B-distill-GLoRE

Finetuned
(2706)
this model

Dataset used to train SwashBuckler001/Llama-3.2-3B-distill-GLoRE