HebrewGPT-1B-Instruct (LoRA Phase 2) 🇮🇱

A 1.08 billion parameter Hebrew instruction-tuned language model, fine-tuned from HebrewGPT-1B using LoRA Phase 2 curriculum distillation on 65K Hebrew instruction examples.

This is the latest and best instruct variant — achieving PPL 15.78 (↓47% from base) with 97.3% instruction following and zero repetition, trained for ~$12 on a single A10G GPU.

📄 Paper: Autonomous AI-Driven Hebrew Language Model Research
💻 GitHub: AgenticResearcher
🏗️ Base Model: HebrewGPT-1B

Model Details

Property	Value
Parameters	1.08B (44.7M trainable via LoRA, 4%)
Architecture	Custom Mamba-Transformer hybrid (interleaved RoPE attention + Mamba SSM, SwiGLU MLP)
Base Model	HebrewGPT-1B (pretrained with Muon optimizer + SWA)
Fine-Tuning	LoRA SFT (rank=64, alpha=128)
Context Length	2,048 tokens
Tokenizer	SentencePiece BPE, 8,192 vocab, Hebrew morphology-aware with prefix splitting
License	Apache 2.0
Language	Hebrew (he)

Architecture

HebrewGPT-1B-Instruct uses the same hybrid architecture as the base model:

Width: 1024, Depth: 8 layers, Heads: 8 (head_dim=128)
Interleaved blocks: Alternating RoPE multi-head attention and Mamba SSM layers
MLP: SwiGLU activation
Positional encoding: Rotary Position Embeddings (RoPE)

Training: LoRA Phase 2

Method

LoRA SFT with rank=64, alpha=128
Target modules: qkv, proj, gate, up, down
Trainable parameters: 44.7M / 1.08B (4%)

Data

65K examples combined from two-phase curriculum:
- Phase 1 (ELI5 simple): 28.5K examples — simple explanations for foundational instruction following
- Phase 2 (Sonnet/Nemotron complex): 36.5K examples — advanced, diverse instruction data

Two-Phase Curriculum

The training uses a curriculum distillation approach: starting with simple ELI5-style examples to establish instruction-following behavior, then progressing to complex Sonnet/Nemotron-generated examples for advanced capabilities.

Training Details

Property	Value
Hardware	NVIDIA A10G (AWS g5.2xlarge)
Training time	~8 hours
Best validation loss	2.4768 (BPB 3.57)
Early stopping	Step ~1000 (patience 5)
Total cost	~$12

Evaluation Results

Metric	Base Model	LoRA Phase 2	Delta
Perplexity	25.14	15.78	-37%
Instruction Following	—	97.3%	—
MCQA	—	10%	—
Repetition Rate	0.006	0.001	-83%
High-rep Outputs	—	0%	—

Key Improvements

Perplexity: 29.75 → 15.78 (-47% from base pretrained model)
Zero repetition — Phase 1 distillation had severe repetition loops; LoRA Phase 2 eliminates them entirely
Fluent Hebrew generation across diverse topics
97.3% instruction following rate — the model reliably follows the instruction format
Total post-training cost: ~$12 on a single NVIDIA A10G GPU

Usage

import torch
import sentencepiece as spm

# Load tokenizer
sp = spm.SentencePieceProcessor()
sp.Load("tokenizer.model")

# Load model weights
state_dict = torch.load("model.pt", map_location="cpu")
# Initialize model architecture (see HebrewGPT-1B for model class definition)
# model.load_state_dict(state_dict)

Prompt Format

The model was trained with a structured instruction format:

### הוראה:
{instruction}

### קלט:
{input}

### תשובה:
{response}

For inference, provide the instruction and input, then let the model generate after ### תשובה:.

Files

model.pt — LoRA Phase 2 merged clean weights (2.1 GB)
tokenizer.model — SentencePiece BPE tokenizer (8,192 vocab)

Limitations

Factual accuracy limited — expected for a 1B parameter model
HTML entity artifacts from training data contamination (e.g., … appearing in outputs)
MCQA still weak (10%) — needs MCQA-specific training data to improve
2,048 context window limits long-document tasks
Small vocabulary (8,192 tokens) may limit performance on rare words
Hebrew-specific model — limited multilingual capability

Base Model: HebrewGPT-1B

Built on HebrewGPT-1B, a 1.08B parameter model trained from scratch on 9.8B tokens of Hebrew text.

Pre-Training Data (12 Hebrew Datasets, 9.8B tokens)

Dataset	Share	Description
Hebrew Wikipedia	12%	Encyclopedia articles
Supreme Court Rulings	22%	Israeli legal corpus
Ben Yehuda Project	23%	Classic Hebrew literature
C4 Hebrew	20%	Web-crawled text (cleaned)
CC100 Hebrew	19%	CommonCrawl filtered
Task-specific	4%	QA, NLI, sentiment prompts

Pre-Training Details

Tokens: 9.8B (3.9 epochs over 2.48B unique)
Hardware: 8×H100 80GB (p5.48xlarge), 8 hours
Optimizer: Muon + SWA (12.3% better BPB than AdamW at 1B scale)
Perplexity: 29.75 (SWA)
Research: 200 autonomous experiments across 4 versions, 100% hit rate in v4

Infrastructure

Research Orchestration: Amazon Bedrock (Claude) via OpenClaw
Training Compute: AWS EC2 g5.2xlarge (NVIDIA A10G)
Data Pipeline: Automated dataset collection, translation, and balancing

Citation

@misc{hebrewgpt1b-instruct-2026,
  title={HebrewGPT-1B-Instruct: A Hebrew Instruction-Tuned Language Model via LoRA Curriculum Distillation},
  author={Slasky, Ronnen},
  year={2026},
  url={https://huggingface.co/Slasky/HebrewGPT-1B-Instruct},
  note={Paper: https://d11k83yu06biio.cloudfront.net/paper/hebrew-autoresearch.html}
}

License

Apache 2.0

Downloads last month: -; Downloads are not tracked for this model. How to track

Evaluation results

Perplexity
self-reported

15.780
Instruction Following
self-reported

97.300
Repetition Rate
self-reported

0.001