🦙 Llama 3.2 3B — Distilled from Llama 3.1 8B Instruct

This repository contains a distilled 3B Llama model, trained using knowledge distillation from the larger Llama-3.1-8B-Instruct teacher model.
The goal of this project was to create a smaller, faster model while preserving strong instruction-following capabilities.

🔥 Overview

Teacher Model: meta-llama/Llama-3.1-8B-Instruct
Student Model: meta-llama/Llama-3.2-3B-Instruct
Objective: Compress the teacher’s behavior into a lighter 3B parameter model for improved latency and deployment efficiency.
Method: Logits-based distillation with temperature scaling and combined KD + student loss.

This model is ideal for:

Chat-based applications
Lightweight on-device or serverless inference
Fast prototyping with low compute resources

🧪 Distillation Process

Distillation was performed using a single epoch over a curated instruction-tuning dataset.

Training Command

python distill.py \
  --teacher-model meta-llama/Llama-3.1-8B-Instruct \
  --student-model meta-llama/Llama-3.2-3B-Instruct \
  --train-file data/train.jsonl \
  --val-file data/test.jsonl \
  --output-dir llama3_3B_distilled \
  --batch-size 8 \
  --gradient-accumulation-steps 1 \
  --num-epochs 1 \
  --learning-rate 1e-5 \
  --warmup-steps 300 \
  --max-length 1024 \
  --temperature 2.0 \
  --alpha 0.5 \
  --save-steps 2000 \
  --eval-steps 200 \
  --logging-steps 5 \
  --bf16

Downloads last month: 2

Safetensors

Model size

3B params

Tensor type

BF16

Model tree for SwashBuckler001/Llama-3.2-3B-distill-GLoRE

Base model

meta-llama/Llama-3.1-8B

Finetuned

meta-llama/Llama-3.1-8B-Instruct

Finetuned

(2706)

this model

SwashBuckler001
/

Llama-3.2-3B-distill-GLoRE

🦙 Llama 3.2 3B — Distilled from Llama 3.1 8B Instruct

🔥 Overview

🧪 Distillation Process

Training Command

Model tree for SwashBuckler001/Llama-3.2-3B-distill-GLoRE

Dataset used to train SwashBuckler001/Llama-3.2-3B-distill-GLoRE