HebrewGPT-1B-Instruct (LoRA Phase 2) ๐Ÿ‡ฎ๐Ÿ‡ฑ

A 1.08 billion parameter Hebrew instruction-tuned language model, fine-tuned from HebrewGPT-1B using LoRA Phase 2 curriculum distillation on 65K Hebrew instruction examples.

This is the latest and best instruct variant โ€” achieving PPL 15.78 (โ†“47% from base) with 97.3% instruction following and zero repetition, trained for ~$12 on a single A10G GPU.

Model Details

Property Value
Parameters 1.08B (44.7M trainable via LoRA, 4%)
Architecture Custom Mamba-Transformer hybrid (interleaved RoPE attention + Mamba SSM, SwiGLU MLP)
Base Model HebrewGPT-1B (pretrained with Muon optimizer + SWA)
Fine-Tuning LoRA SFT (rank=64, alpha=128)
Context Length 2,048 tokens
Tokenizer SentencePiece BPE, 8,192 vocab, Hebrew morphology-aware with prefix splitting
License Apache 2.0
Language Hebrew (he)

Architecture

HebrewGPT-1B-Instruct uses the same hybrid architecture as the base model:

  • Width: 1024, Depth: 8 layers, Heads: 8 (head_dim=128)
  • Interleaved blocks: Alternating RoPE multi-head attention and Mamba SSM layers
  • MLP: SwiGLU activation
  • Positional encoding: Rotary Position Embeddings (RoPE)

Training: LoRA Phase 2

Method

  • LoRA SFT with rank=64, alpha=128
  • Target modules: qkv, proj, gate, up, down
  • Trainable parameters: 44.7M / 1.08B (4%)

Data

  • 65K examples combined from two-phase curriculum:
    • Phase 1 (ELI5 simple): 28.5K examples โ€” simple explanations for foundational instruction following
    • Phase 2 (Sonnet/Nemotron complex): 36.5K examples โ€” advanced, diverse instruction data

Two-Phase Curriculum

The training uses a curriculum distillation approach: starting with simple ELI5-style examples to establish instruction-following behavior, then progressing to complex Sonnet/Nemotron-generated examples for advanced capabilities.

Training Details

Property Value
Hardware NVIDIA A10G (AWS g5.2xlarge)
Training time ~8 hours
Best validation loss 2.4768 (BPB 3.57)
Early stopping Step ~1000 (patience 5)
Total cost ~$12

Evaluation Results

Metric Base Model LoRA Phase 2 Delta
Perplexity 25.14 15.78 -37%
Instruction Following โ€” 97.3% โ€”
MCQA โ€” 10% โ€”
Repetition Rate 0.006 0.001 -83%
High-rep Outputs โ€” 0% โ€”

Key Improvements

  • Perplexity: 29.75 โ†’ 15.78 (-47% from base pretrained model)
  • Zero repetition โ€” Phase 1 distillation had severe repetition loops; LoRA Phase 2 eliminates them entirely
  • Fluent Hebrew generation across diverse topics
  • 97.3% instruction following rate โ€” the model reliably follows the instruction format
  • Total post-training cost: ~$12 on a single NVIDIA A10G GPU

Usage

import torch
import sentencepiece as spm

# Load tokenizer
sp = spm.SentencePieceProcessor()
sp.Load("tokenizer.model")

# Load model weights
state_dict = torch.load("model.pt", map_location="cpu")
# Initialize model architecture (see HebrewGPT-1B for model class definition)
# model.load_state_dict(state_dict)

Prompt Format

The model was trained with a structured instruction format:

### ื”ื•ืจืื”:
{instruction}

### ืงืœื˜:
{input}

### ืชืฉื•ื‘ื”:
{response}

For inference, provide the instruction and input, then let the model generate after ### ืชืฉื•ื‘ื”:.

Files

  • model.pt โ€” LoRA Phase 2 merged clean weights (2.1 GB)
  • tokenizer.model โ€” SentencePiece BPE tokenizer (8,192 vocab)

Limitations

  • Factual accuracy limited โ€” expected for a 1B parameter model
  • HTML entity artifacts from training data contamination (e.g., … appearing in outputs)
  • MCQA still weak (10%) โ€” needs MCQA-specific training data to improve
  • 2,048 context window limits long-document tasks
  • Small vocabulary (8,192 tokens) may limit performance on rare words
  • Hebrew-specific model โ€” limited multilingual capability

Base Model: HebrewGPT-1B

Built on HebrewGPT-1B, a 1.08B parameter model trained from scratch on 9.8B tokens of Hebrew text.

Pre-Training Data (12 Hebrew Datasets, 9.8B tokens)

Dataset Share Description
Hebrew Wikipedia 12% Encyclopedia articles
Supreme Court Rulings 22% Israeli legal corpus
Ben Yehuda Project 23% Classic Hebrew literature
C4 Hebrew 20% Web-crawled text (cleaned)
CC100 Hebrew 19% CommonCrawl filtered
Task-specific 4% QA, NLI, sentiment prompts

Pre-Training Details

  • Tokens: 9.8B (3.9 epochs over 2.48B unique)
  • Hardware: 8ร—H100 80GB (p5.48xlarge), 8 hours
  • Optimizer: Muon + SWA (12.3% better BPB than AdamW at 1B scale)
  • Perplexity: 29.75 (SWA)
  • Research: 200 autonomous experiments across 4 versions, 100% hit rate in v4

Infrastructure

  • Research Orchestration: Amazon Bedrock (Claude) via OpenClaw
  • Training Compute: AWS EC2 g5.2xlarge (NVIDIA A10G)
  • Data Pipeline: Automated dataset collection, translation, and balancing

Citation

@misc{hebrewgpt1b-instruct-2026,
  title={HebrewGPT-1B-Instruct: A Hebrew Instruction-Tuned Language Model via LoRA Curriculum Distillation},
  author={Slasky, Ronnen},
  year={2026},
  url={https://huggingface.co/Slasky/HebrewGPT-1B-Instruct},
  note={Paper: https://d11k83yu06biio.cloudfront.net/paper/hebrew-autoresearch.html}
}

License

Apache 2.0

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Evaluation results