HebrewGPT-1B-Instruct (LoRA Phase 2) ๐ฎ๐ฑ
A 1.08 billion parameter Hebrew instruction-tuned language model, fine-tuned from HebrewGPT-1B using LoRA Phase 2 curriculum distillation on 65K Hebrew instruction examples.
This is the latest and best instruct variant โ achieving PPL 15.78 (โ47% from base) with 97.3% instruction following and zero repetition, trained for ~$12 on a single A10G GPU.
- ๐ Paper: Autonomous AI-Driven Hebrew Language Model Research
- ๐ป GitHub: AgenticResearcher
- ๐๏ธ Base Model: HebrewGPT-1B
Model Details
| Property | Value |
|---|---|
| Parameters | 1.08B (44.7M trainable via LoRA, 4%) |
| Architecture | Custom Mamba-Transformer hybrid (interleaved RoPE attention + Mamba SSM, SwiGLU MLP) |
| Base Model | HebrewGPT-1B (pretrained with Muon optimizer + SWA) |
| Fine-Tuning | LoRA SFT (rank=64, alpha=128) |
| Context Length | 2,048 tokens |
| Tokenizer | SentencePiece BPE, 8,192 vocab, Hebrew morphology-aware with prefix splitting |
| License | Apache 2.0 |
| Language | Hebrew (he) |
Architecture
HebrewGPT-1B-Instruct uses the same hybrid architecture as the base model:
- Width: 1024, Depth: 8 layers, Heads: 8 (head_dim=128)
- Interleaved blocks: Alternating RoPE multi-head attention and Mamba SSM layers
- MLP: SwiGLU activation
- Positional encoding: Rotary Position Embeddings (RoPE)
Training: LoRA Phase 2
Method
- LoRA SFT with rank=64, alpha=128
- Target modules: qkv, proj, gate, up, down
- Trainable parameters: 44.7M / 1.08B (4%)
Data
- 65K examples combined from two-phase curriculum:
- Phase 1 (ELI5 simple): 28.5K examples โ simple explanations for foundational instruction following
- Phase 2 (Sonnet/Nemotron complex): 36.5K examples โ advanced, diverse instruction data
Two-Phase Curriculum
The training uses a curriculum distillation approach: starting with simple ELI5-style examples to establish instruction-following behavior, then progressing to complex Sonnet/Nemotron-generated examples for advanced capabilities.
Training Details
| Property | Value |
|---|---|
| Hardware | NVIDIA A10G (AWS g5.2xlarge) |
| Training time | ~8 hours |
| Best validation loss | 2.4768 (BPB 3.57) |
| Early stopping | Step ~1000 (patience 5) |
| Total cost | ~$12 |
Evaluation Results
| Metric | Base Model | LoRA Phase 2 | Delta |
|---|---|---|---|
| Perplexity | 25.14 | 15.78 | -37% |
| Instruction Following | โ | 97.3% | โ |
| MCQA | โ | 10% | โ |
| Repetition Rate | 0.006 | 0.001 | -83% |
| High-rep Outputs | โ | 0% | โ |
Key Improvements
- Perplexity: 29.75 โ 15.78 (-47% from base pretrained model)
- Zero repetition โ Phase 1 distillation had severe repetition loops; LoRA Phase 2 eliminates them entirely
- Fluent Hebrew generation across diverse topics
- 97.3% instruction following rate โ the model reliably follows the instruction format
- Total post-training cost: ~$12 on a single NVIDIA A10G GPU
Usage
import torch
import sentencepiece as spm
# Load tokenizer
sp = spm.SentencePieceProcessor()
sp.Load("tokenizer.model")
# Load model weights
state_dict = torch.load("model.pt", map_location="cpu")
# Initialize model architecture (see HebrewGPT-1B for model class definition)
# model.load_state_dict(state_dict)
Prompt Format
The model was trained with a structured instruction format:
### ืืืจืื:
{instruction}
### ืงืื:
{input}
### ืชืฉืืื:
{response}
For inference, provide the instruction and input, then let the model generate after ### ืชืฉืืื:.
Files
model.ptโ LoRA Phase 2 merged clean weights (2.1 GB)tokenizer.modelโ SentencePiece BPE tokenizer (8,192 vocab)
Limitations
- Factual accuracy limited โ expected for a 1B parameter model
- HTML entity artifacts from training data contamination (e.g.,
…appearing in outputs) - MCQA still weak (10%) โ needs MCQA-specific training data to improve
- 2,048 context window limits long-document tasks
- Small vocabulary (8,192 tokens) may limit performance on rare words
- Hebrew-specific model โ limited multilingual capability
Base Model: HebrewGPT-1B
Built on HebrewGPT-1B, a 1.08B parameter model trained from scratch on 9.8B tokens of Hebrew text.
Pre-Training Data (12 Hebrew Datasets, 9.8B tokens)
| Dataset | Share | Description |
|---|---|---|
| Hebrew Wikipedia | 12% | Encyclopedia articles |
| Supreme Court Rulings | 22% | Israeli legal corpus |
| Ben Yehuda Project | 23% | Classic Hebrew literature |
| C4 Hebrew | 20% | Web-crawled text (cleaned) |
| CC100 Hebrew | 19% | CommonCrawl filtered |
| Task-specific | 4% | QA, NLI, sentiment prompts |
Pre-Training Details
- Tokens: 9.8B (3.9 epochs over 2.48B unique)
- Hardware: 8รH100 80GB (p5.48xlarge), 8 hours
- Optimizer: Muon + SWA (12.3% better BPB than AdamW at 1B scale)
- Perplexity: 29.75 (SWA)
- Research: 200 autonomous experiments across 4 versions, 100% hit rate in v4
Infrastructure
- Research Orchestration: Amazon Bedrock (Claude) via OpenClaw
- Training Compute: AWS EC2 g5.2xlarge (NVIDIA A10G)
- Data Pipeline: Automated dataset collection, translation, and balancing
Citation
@misc{hebrewgpt1b-instruct-2026,
title={HebrewGPT-1B-Instruct: A Hebrew Instruction-Tuned Language Model via LoRA Curriculum Distillation},
author={Slasky, Ronnen},
year={2026},
url={https://huggingface.co/Slasky/HebrewGPT-1B-Instruct},
note={Paper: https://d11k83yu06biio.cloudfront.net/paper/hebrew-autoresearch.html}
}
License
Apache 2.0
Evaluation results
- Perplexityself-reported15.780
- Instruction Followingself-reported97.300
- Repetition Rateself-reported0.001